![Designing Cloud Data Platforms](http://img.images-bn.com/static/redesign/srcs/images/grey-box.png?v11.9.4)
![Designing Cloud Data Platforms](http://img.images-bn.com/static/redesign/srcs/images/grey-box.png?v11.9.4)
eBook
Available on Compatible NOOK devices, the free NOOK App and in My Digital Library.
Related collections and offers
Overview
Summary
Centralized data warehouses, the long-time defacto standard for housing data for analytics, are rapidly giving way to multi-faceted cloud data platforms. Companies that embrace modern cloud data platforms benefit from an integrated view of their business using all of their data and can take advantage of advanced analytic practices to drive predictions and as yet unimagined data services. Designing Cloud Data Platforms is a hands-on guide to envisioning and designing a modern scalable data platform that takes full advantage of the flexibility of the cloud. As you read, you’ll learn the core components of a cloud data platform design, along with the role of key technologies like Spark and Kafka Streams. You’ll also explore setting up processes to manage cloud-based data, keep it secure, and using advanced analytic and BI tools to analyze it.
Purchase of the print book includes a free eBook in PDF, Kindle, and ePub formats from Manning Publications.
About the technology
Well-designed pipelines, storage systems, and APIs eliminate the complicated scaling and maintenance required with on-prem data centers. Once you learn the patterns for designing cloud data platforms, you’ll maximize performance no matter which cloud vendor you use.
About the book
In Designing Cloud Data Platforms, Danil Zburivsky and Lynda Partner reveal a six-layer approach that increases flexibility and reduces costs. Discover patterns for ingesting data from a variety of sources, then learn to harness pre-built services provided by cloud vendors.
What's inside
Best practices for structured and unstructured data sets
Cloud-ready machine learning tools
Metadata and real-time analytics
Defensive architecture, access, and security
About the reader
For data professionals familiar with the basics of cloud computing, and Hadoop or Spark.
About the author
Danil Zburivsky has over 10 years of experience designing and supporting large-scale data infrastructure for enterprises across the globe. Lynda Partner is the VP of Analytics-as-a-Service at Pythian, and has been on the business side of data for over 20 years.
Table of Contents
1 Introducing the data platform
2 Why a data platform and not just a data warehouse
3 Getting bigger and leveraging the Big 3: Amazon, Microsoft Azure, and Google
4 Getting data into the platform
5 Organizing and processing data
6 Real-time data processing and analytics
7 Metadata layer architecture
8 Schema management
9 Data access and security
10 Fueling business value with data platforms
Product Details
ISBN-13: | 9781638350965 |
---|---|
Publisher: | Manning |
Publication date: | 03/17/2021 |
Sold by: | SIMON & SCHUSTER |
Format: | eBook |
Pages: | 336 |
File size: | 11 MB |
Note: | This product may take a few minutes to download. |
About the Author
Danil Zburivsky has over 10 years experience designing and supporting large-scale data infrastructure for enterprises across the globe.
Lynda Partner is the VP of Analytics-as-a-Service at Pythian, and has been on the business side of data for over 20 years.
Table of Contents
Preface xi
Acknowledgments xiii
About this book xv
About the authors xviii
About the cover illustration xix
1 Introducing the data platform 1
1.1 The trends behind the change from data warehouses to data platforms 2
1.2 Data warehouses struggle with data variety, volume, and velocity 3
Variety 4
Volume 5
Velocity 5
All the V's at once 6
1.3 Data lakes to the rescue? 6
1.4 Along came the cloud 7
1.5 Cloud, data lakes, and data warehouses: The emergence of cloud data platforms 9
1.6 Building blocks of a cloud data platform 10
Ingestion layer 10
Storage layer 11
Processing layer 12
Serving layer 13
1.7 How the cloud data platform deals with the three V's 14
Variety 14
Volume 15
Velocity 15
Two more V's 16
1.8 Common use cases 16
2 Why a data platform and not just a data warehouse 18
2.1 Cloud data platforms and cloud data warehouses: The practical aspects 19
A closer look at the data sources 20
An example cloud data warehouse-only architecture 22
An example cloud data platform architecture 23
2.2 Ingesting data 24
Ingesting data directly into Azure Synapse 25
Ingesting data into an Azure data platform 26
Managing changes in upstream data sources 26
2.3 Processing data 28
Processing data in the warehouse 29
Processing data in the data platform 31
2.4 Accessing data 33
2.5 Cloud cost considerations 34
2.6 Exercise answers 36
3 Getting bigger and leveraging the Big 3: Amazon, Microsoft Azure, and Google 37
3.1 Cloud data platform layered architecture 38
Data ingestion layer 40
Fast and slow storage 44
Processing layer 46
Technical metadata layer 47
The serving layer and data consumers 49
Orchestration and ETL overlay layers 53
3.2 The importance of layers in a data platform architecture 59
3.3 Mapping cloud data platform layers to specific tools 60
AWS 62
Google Cloud 66
Azure 70
3.4 Open source and commercial alternatives 74
Batch data ingestion 74
Streaming data ingestion and real-time analytics 75
Orchestration layer 75
3.5 Exercise answers 77
4 Getting data into the platform 78
4.1 Databases, files, APIs, and streams 79
Relational databases 80
Files 81
SaaS data via API 82
Streams 82
4.2 Ingesting data from relational databases 83
Ingesting data from RDBMSs using a SQL interface 84
Full-table ingestion 86
Incremental table ingestion 91
Change data capture (CDC) 94
CDC vendors overview 98
Data type conversion 100
Ingesting data from NoSQL databases 103
Capturing important metadata for RDBMS or NoSQL ingestion pipelines 104
4.3 Ingesting data from files 107
Tracking ingested files 109
Capturing file ingestion metadata 112
4.4 Ingesting data from streams 114
Differences between batch and streaming ingestion 117
Capturing streaming pipeline metadata 119
4.5 Ingesting data from SaaS applications 120
No standard approach to API design 121
No standard way to deal with full vs. incremental data exports 122
Resulting data is typically highly nested JSON 122
4.6 Network and security considerations for data ingestion into the cloud 123
Connecting other networks to your cloud data platform 123
4.7 Exercise answers 126
5 Organizing and processing data 127
5.1 Processing as a separate layer in the data platform 129
5.2 Data processing stages 131
5.3 Organizing your cloud storage 132
Cloud storage containers and folders 134
5.4 Common data processing steps 140
File format conversion 140
Data deduplication 145
Data quality checks 150
5.5 Configurable pipelines 152
5.6 Exercise answers 155
6 Real-time data processing and analytics 156
6.1 Real-time ingestion vs. real-time processing 157
6.2 Use cases for real-time data processing 160
Retail use case: Real-time ingestion 160
Online gaming use case: Real-time ingestion and real-time processing 161
Summary of real-time ingestion vs. real-time processing 164
6.3 When should you use real-time ingestion and/or real-time processing? 164
6.4 Organizing data for real-time use 167
The anatomy of fast storage 167
How does fast storage scale? 170
Organizing data in the real-time storage 172
6.5 Common data transformations in real time 178
Causes of duplicates in real-time systems 178
Deduplicating data in real-time systems 181
Converting message formats in real-time pipelines 186
Real-time data quality checks 187
Combining batch and real-time data 188
6.6 Cloud services for real-time data processing 190
AWS real-time processing services 190
Google Cloud real-time processing services 192
Azure real-time processing services 193
6.7 Exercise answers 195
7 Metadata layer architecture 197
7.1 What we mean by metadata 198
Business metadata 198
Data platform internal metadata or "pipeline metadata" 199
7.2 Taking advantage of pipeline metadata 199
7.3 Metadata model 203
Metadata domains 204
7.4 Metadata layer implementation options 213
Metadata layer as a collection of configuration files 214
Metadata database 217
Metadata API 218
7.5 Overview of existing solutions 220
Cloud metadata services 221
Open source metadata layer implementations 223
7.6 Exercise answers 227
Schema management 228
8.1 Why schema management 229
Schema changes in a traditional data warehouse architecture 230
Schema-on-read approach 231
8.2 Schema-management approaches 232
Schema as a contract 233
Schema management in the data platform 235
Monitoring schema changes 241
8.3 Schema Registry Implementation 243
Apache Avro schemas 243
Existing Schema Registry implementations 245
Schema Registry as part of a Metadata layer 246
8.4 Schema evolution scenarios 248
Schema compatibility rules 249
Schema evolution and data transformation pipelines 251
8.5 Schema evolution and data warehouses 255
Schema-management features of cloud data warehouses 257
8.6 Exercise answers 260
9 Data access and security 261
9.1 Different types of data consumers 262
9.2 Cloud data warehouses 263
AWS Redshift 264
Azure Synapse 268
Google BigQuery 270
Choosing the right data warehouse 273
9.3 Application data access 274
Cloud relational databases 275
Cloud key/value data stores 276
Full-text search services 277
In-memory cache 278
9.4 Machine learning on the data platform 278
Machine learning model lifecycle on a cloud data platform 279
ML cloud collaboration tools 282
9.5 Business intelligence and reporting tools 283
Traditional BI tools and cloud data platform integration 283
Using Excel as a BI tool 284
BI tools that are external to the cloud provider 284
9.6 Data security 285
Users, groups, and roles 285
Credentials and configuration management 286
Data encryption 286
Network boundaries 287
9.7 Exercise Answers 288
10 Fueling business value with data platforms 289
10.1 Why you need a data strategy 290
10.2 The analytics maturity journey 291
SEE: Getting insights from data 292
PREDICT: Using data to predict what to do 293
DO: Making your analytics actionable 294
CREATE: Going beyond analytics into products 295
10.3 The data platform: The engine that powers analytics maturity 296
10.4 Platform project stoppers 297
Time does in deed kill 297
User adoption 298
User trust and the need for data governance 299
Operating in a platform silo 300
The dollar dance 301
Index 304