50% Off The Criterion Collection Shop Now

Designing Cloud Data Platforms

Add to Wishlist

Designing Cloud Data Platforms

eBook

$43.99

View All Available Formats & Editions

eBook
$43.99

View All Available Formats & Editions

Available on Compatible NOOK devices, the free NOOK App and in My Digital Library.

WANT A NOOK? Explore Now

Buy As Gift

Related collections and offers

Overview

In Designing Cloud Data Platforms, Danil Zburivsky and Lynda Partner reveal a six-layer approach that increases flexibility and reduces costs. Discover patterns for ingesting data from a variety of sources, then learn to harness pre-built services provided by cloud vendors.

Summary
Centralized data warehouses, the long-time defacto standard for housing data for analytics, are rapidly giving way to multi-faceted cloud data platforms. Companies that embrace modern cloud data platforms benefit from an integrated view of their business using all of their data and can take advantage of advanced analytic practices to drive predictions and as yet unimagined data services. Designing Cloud Data Platforms is a hands-on guide to envisioning and designing a modern scalable data platform that takes full advantage of the flexibility of the cloud. As you read, you’ll learn the core components of a cloud data platform design, along with the role of key technologies like Spark and Kafka Streams. You’ll also explore setting up processes to manage cloud-based data, keep it secure, and using advanced analytic and BI tools to analyze it.

Purchase of the print book includes a free eBook in PDF, Kindle, and ePub formats from Manning Publications.

About the technology
Well-designed pipelines, storage systems, and APIs eliminate the complicated scaling and maintenance required with on-prem data centers. Once you learn the patterns for designing cloud data platforms, you’ll maximize performance no matter which cloud vendor you use.

About the book
In Designing Cloud Data Platforms, Danil Zburivsky and Lynda Partner reveal a six-layer approach that increases flexibility and reduces costs. Discover patterns for ingesting data from a variety of sources, then learn to harness pre-built services provided by cloud vendors.

What's inside
    Best practices for structured and unstructured data sets
    Cloud-ready machine learning tools
    Metadata and real-time analytics
    Defensive architecture, access, and security

About the reader
For data professionals familiar with the basics of cloud computing, and Hadoop or Spark.

About the author
Danil Zburivsky has over 10 years of experience designing and supporting large-scale data infrastructure for enterprises across the globe. Lynda Partner is the VP of Analytics-as-a-Service at Pythian, and has been on the business side of data for over 20 years.

Table of Contents
1 Introducing the data platform
2 Why a data platform and not just a data warehouse
3 Getting bigger and leveraging the Big 3: Amazon, Microsoft Azure, and Google
4 Getting data into the platform
5 Organizing and processing data
6 Real-time data processing and analytics
7 Metadata layer architecture
8 Schema management
9 Data access and security
10 Fueling business value with data platforms

Product Details

ISBN-13:	9781638350965
Publisher:	Manning
Publication date:	03/17/2021
Sold by:	SIMON & SCHUSTER
Format:	eBook
Pages:	336
File size:	11 MB
Note:	This product may take a few minutes to download.

About the Author

Danil Zburivsky has over 10 years of experience designing and supporting large-scale data infrastructure for enterprises across the globe. Lynda Partner is the VP of Analytics-as-a-Service at Pythian, and has been on the business side of data for over 20 years.
Danil Zburivsky has over 10 years experience designing and supporting large-scale data infrastructure for enterprises across the globe.
Lynda Partner is the VP of Analytics-as-a-Service at Pythian, and has been on the business side of data for over 20 years.

Preface xi

Acknowledgments xiii

About this book xv

About the authors xviii

About the cover illustration xix

1 Introducing the data platform 1

1.1 The trends behind the change from data warehouses to data platforms 2

1.2 Data warehouses struggle with data variety, volume, and velocity 3

Variety 4

Volume 5

Velocity 5

All the V's at once 6

1.3 Data lakes to the rescue? 6

1.4 Along came the cloud 7

1.5 Cloud, data lakes, and data warehouses: The emergence of cloud data platforms 9

1.6 Building blocks of a cloud data platform 10

Ingestion layer 10

Storage layer 11

Processing layer 12

Serving layer 13

1.7 How the cloud data platform deals with the three V's 14

Variety 14

Volume 15

Velocity 15

Two more V's 16

1.8 Common use cases 16

2 Why a data platform and not just a data warehouse 18

2.1 Cloud data platforms and cloud data warehouses: The practical aspects 19

A closer look at the data sources 20

An example cloud data warehouse-only architecture 22

An example cloud data platform architecture 23

2.2 Ingesting data 24

Ingesting data directly into Azure Synapse 25

Ingesting data into an Azure data platform 26

Managing changes in upstream data sources 26

2.3 Processing data 28

Processing data in the warehouse 29

Processing data in the data platform 31

2.4 Accessing data 33

2.5 Cloud cost considerations 34

2.6 Exercise answers 36

3 Getting bigger and leveraging the Big 3: Amazon, Microsoft Azure, and Google 37

3.1 Cloud data platform layered architecture 38

Data ingestion layer 40

Fast and slow storage 44

Processing layer 46

Technical metadata layer 47

The serving layer and data consumers 49

Orchestration and ETL overlay layers 53

3.2 The importance of layers in a data platform architecture 59

3.3 Mapping cloud data platform layers to specific tools 60

AWS 62

Google Cloud 66

Azure 70

3.4 Open source and commercial alternatives 74

Batch data ingestion 74

Streaming data ingestion and real-time analytics 75

Orchestration layer 75

3.5 Exercise answers 77

4 Getting data into the platform 78

4.1 Databases, files, APIs, and streams 79

Relational databases 80

Files 81

SaaS data via API 82

Streams 82

4.2 Ingesting data from relational databases 83

Ingesting data from RDBMSs using a SQL interface 84

Full-table ingestion 86

Incremental table ingestion 91

Change data capture (CDC) 94

CDC vendors overview 98

Data type conversion 100

Ingesting data from NoSQL databases 103

Capturing important metadata for RDBMS or NoSQL ingestion pipelines 104

4.3 Ingesting data from files 107

Tracking ingested files 109

Capturing file ingestion metadata 112

4.4 Ingesting data from streams 114

Differences between batch and streaming ingestion 117

Capturing streaming pipeline metadata 119

4.5 Ingesting data from SaaS applications 120

No standard approach to API design 121

No standard way to deal with full vs. incremental data exports 122

Resulting data is typically highly nested JSON 122

4.6 Network and security considerations for data ingestion into the cloud 123

Connecting other networks to your cloud data platform 123

4.7 Exercise answers 126

5 Organizing and processing data 127

5.1 Processing as a separate layer in the data platform 129

5.2 Data processing stages 131

5.3 Organizing your cloud storage 132

Cloud storage containers and folders 134

5.4 Common data processing steps 140

File format conversion 140

Data deduplication 145

Data quality checks 150

5.5 Configurable pipelines 152

5.6 Exercise answers 155

6 Real-time data processing and analytics 156

6.1 Real-time ingestion vs. real-time processing 157

6.2 Use cases for real-time data processing 160

Retail use case: Real-time ingestion 160

Online gaming use case: Real-time ingestion and real-time processing 161

Summary of real-time ingestion vs. real-time processing 164

6.3 When should you use real-time ingestion and/or real-time processing? 164

6.4 Organizing data for real-time use 167

The anatomy of fast storage 167

How does fast storage scale? 170

Organizing data in the real-time storage 172

6.5 Common data transformations in real time 178

Causes of duplicates in real-time systems 178

Deduplicating data in real-time systems 181

Converting message formats in real-time pipelines 186

Real-time data quality checks 187

Combining batch and real-time data 188

6.6 Cloud services for real-time data processing 190

AWS real-time processing services 190

Google Cloud real-time processing services 192

Azure real-time processing services 193

6.7 Exercise answers 195

7 Metadata layer architecture 197

7.1 What we mean by metadata 198

Business metadata 198

Data platform internal metadata or "pipeline metadata" 199

7.2 Taking advantage of pipeline metadata 199

7.3 Metadata model 203

Metadata domains 204

7.4 Metadata layer implementation options 213

Metadata layer as a collection of configuration files 214

Metadata database 217

Metadata API 218

7.5 Overview of existing solutions 220

Cloud metadata services 221

Open source metadata layer implementations 223

7.6 Exercise answers 227

Schema management 228

8.1 Why schema management 229

Schema changes in a traditional data warehouse architecture 230

Schema-on-read approach 231

8.2 Schema-management approaches 232

Schema as a contract 233

Schema management in the data platform 235

Monitoring schema changes 241

8.3 Schema Registry Implementation 243

Apache Avro schemas 243

Existing Schema Registry implementations 245

Schema Registry as part of a Metadata layer 246

8.4 Schema evolution scenarios 248

Schema compatibility rules 249

Schema evolution and data transformation pipelines 251

8.5 Schema evolution and data warehouses 255

Schema-management features of cloud data warehouses 257

8.6 Exercise answers 260

9 Data access and security 261

9.1 Different types of data consumers 262

9.2 Cloud data warehouses 263

AWS Redshift 264

Azure Synapse 268

Google BigQuery 270

Choosing the right data warehouse 273

9.3 Application data access 274

Cloud relational databases 275

Cloud key/value data stores 276

Full-text search services 277

In-memory cache 278

9.4 Machine learning on the data platform 278

Machine learning model lifecycle on a cloud data platform 279

ML cloud collaboration tools 282

9.5 Business intelligence and reporting tools 283

Traditional BI tools and cloud data platform integration 283

Using Excel as a BI tool 284

BI tools that are external to the cloud provider 284

9.6 Data security 285

Users, groups, and roles 285

Credentials and configuration management 286

Data encryption 286

Network boundaries 287

9.7 Exercise Answers 288

10 Fueling business value with data platforms 289

10.1 Why you need a data strategy 290

10.2 The analytics maturity journey 291

SEE: Getting insights from data 292

PREDICT: Using data to predict what to do 293

DO: Making your analytics actionable 294

CREATE: Going beyond analytics into products 295

10.3 The data platform: The engine that powers analytics maturity 296

10.4 Platform project stoppers 297

Time does in deed kill 297

User adoption 298

User trust and the need for data governance 299

Operating in a platform silo 300

The dollar dance 301

Index 304

From the B&N Reads Blog

Page 1 of

Related collections and offers

Overview

Product Details

About the Author

Table of Contents

Related Subjects

Customer Reviews