The Enterprise Big Data Lake: Delivering the Promise of Big Data and Data Science

The data lake is a daring new approach for harnessing the power of big data technology and providing convenient self-service capabilities. But is it right for your company? This book is based on discussions with practitioners and executives from more than a hundred organizations, ranging from data-driven companies such as Google, LinkedIn, and Facebook, to governments and traditional corporate enterprises. You’ll learn what a data lake is, why enterprises need one, and how to build one successfully with the best practices in this book.

Alex Gorelik, CTO and founder of Waterline Data, explains why old systems and processes can no longer support data needs in the enterprise. Then, in a collection of essays about data lake implementation, you’ll examine data lake initiatives, analytic projects, experiences, and best practices from data experts working in various industries.

  • Get a succinct introduction to data warehousing, big data, and data science
  • Learn various paths enterprises take to build a data lake
  • Explore how to build a self-service model and best practices for providing analysts access to the data
  • Use different methods for architecting your data lake
  • Discover ways to implement a data lake from experts in different industries
1130413523
The Enterprise Big Data Lake: Delivering the Promise of Big Data and Data Science

The data lake is a daring new approach for harnessing the power of big data technology and providing convenient self-service capabilities. But is it right for your company? This book is based on discussions with practitioners and executives from more than a hundred organizations, ranging from data-driven companies such as Google, LinkedIn, and Facebook, to governments and traditional corporate enterprises. You’ll learn what a data lake is, why enterprises need one, and how to build one successfully with the best practices in this book.

Alex Gorelik, CTO and founder of Waterline Data, explains why old systems and processes can no longer support data needs in the enterprise. Then, in a collection of essays about data lake implementation, you’ll examine data lake initiatives, analytic projects, experiences, and best practices from data experts working in various industries.

  • Get a succinct introduction to data warehousing, big data, and data science
  • Learn various paths enterprises take to build a data lake
  • Explore how to build a self-service model and best practices for providing analysts access to the data
  • Use different methods for architecting your data lake
  • Discover ways to implement a data lake from experts in different industries
50.99 In Stock
The Enterprise Big Data Lake: Delivering the Promise of Big Data and Data Science

The Enterprise Big Data Lake: Delivering the Promise of Big Data and Data Science

by Alex Gorelik
The Enterprise Big Data Lake: Delivering the Promise of Big Data and Data Science

The Enterprise Big Data Lake: Delivering the Promise of Big Data and Data Science

by Alex Gorelik

eBook

$50.99  $67.99 Save 25% Current price is $50.99, Original price is $67.99. You Save 25%.

Available on Compatible NOOK devices, the free NOOK App and in My Digital Library.
WANT A NOOK?  Explore Now

Related collections and offers


Overview

The data lake is a daring new approach for harnessing the power of big data technology and providing convenient self-service capabilities. But is it right for your company? This book is based on discussions with practitioners and executives from more than a hundred organizations, ranging from data-driven companies such as Google, LinkedIn, and Facebook, to governments and traditional corporate enterprises. You’ll learn what a data lake is, why enterprises need one, and how to build one successfully with the best practices in this book.

Alex Gorelik, CTO and founder of Waterline Data, explains why old systems and processes can no longer support data needs in the enterprise. Then, in a collection of essays about data lake implementation, you’ll examine data lake initiatives, analytic projects, experiences, and best practices from data experts working in various industries.

  • Get a succinct introduction to data warehousing, big data, and data science
  • Learn various paths enterprises take to build a data lake
  • Explore how to build a self-service model and best practices for providing analysts access to the data
  • Use different methods for architecting your data lake
  • Discover ways to implement a data lake from experts in different industries

Product Details

ISBN-13: 9781491931509
Publisher: O'Reilly Media, Incorporated
Publication date: 02/21/2019
Sold by: Barnes & Noble
Format: eBook
Pages: 224
File size: 9 MB

About the Author

Alex Gorelik is CTO and founder of Waterline Data and the founder of three startups. He also served as GM of Informatica’s Data Quality Business Unit and managed the company’s platform and data integration technology. Also for Informatica, Alex managed a team of 400 engineers and product managers as SVP of R&D for Core Technology, developing Informatica’s platform and Data Integration technology. Alex was an IBM Distinguished Engineer and co-founder, CTO and VP of engineering at Exeros and Acta Technology. Previously, Alex was co-founder, CTO and VP of Engineering at Acta Technology (acquired by Business Objects and now marketed as SAP Business Objects Data Services). Prior to founding Acta, Alex managed development of Replication Server at Sybase and worked on Sybase’s strategy for enterprise application integration (EAI). Earlier, he developed the database kernel for Amdahl’s Design Automation group. Alex holds a B.S. in Computer Science from Columbia University School of Engineering and a M.S. in Computer Science from Stanford University.

Table of Contents

Preface ix

1 Introduction to Data Lakes 1

Data Lake Maturity 3

Data Puddles 5

Data Ponds 6

Creating a Successful Data Lake 7

The Right Platform 7

The Right Data 8

The Right Interface 9

The Data Swamp 11

Roadmap to Data Lake Success 12

Standing Up a Data Lake 13

Organizing the Data Lake 14

Setting Up the Data Lake for Self-Service 15

Data Lake Architectures 20

Data Lakes in the Public Cloud 20

Logical Data Lakes 21

Conclusion 24

2 Historical Perspective 25

The Drive for Self-Service Data-The Birth of Databases 25

The Analytics Imperative-The Birth of Data Warehousing 28

The Data Warehouse Ecosystem 29

Storing and Querying the Data 31

Loading the Data-Data Integration Tools 37

Organizing and Managing the Data 41

Consuming the Data 46

Conclusion 47

3 Introduction to Big Data and Data Science 49

Hadoop Leads the Historic Shift to Big Data 50

The Hadoop File System 50

How Processing and Storage Interact in a MapReduce Job 51

Schema on Read 53

Hadoop Projects 53

Data Science 55

What Should Your Analytics Organization Focus On? 56

Machine Learning 59

Explain ability 60

Change Management 61

Conclusion 62

4 Starting a Data Lake 63

The What and Why of Hadoop 63

Preventing Proliferation of Data Puddles 66

Taking Advantage of Big Data 67

Leading with Data Science 67

Strategy 1 Offload Existing Functionality 70

Strategy 2 Data Lakes for New Projects 71

Strategy 3 Establish a Central Point of Governance 72

Which Way Is Right for You? 73

Conclusion 74

5 From Data Ponds/Big Data Warehouses to Data Lakes 75

Essential Functions of a Data Warehouse 76

Dimensional Modeling for Analytics 77

Integrating Data from Disparate Sources 78

Preserving History Using Slowly Changing Dimensions 78

Limitations of the Data Warehouse as a Historical Repository 78

Moving to a Data Pond 79

Keeping History in a Data Pond 79

Implementing Slowly Changing Dimensions in a Data Pond 81

Growing Data Ponds into a Data Lake-Loading Data That's Not in the Data

Warehouse 83

Raw Data 83

External Data 84

Internet of Things (IoT) and Other Streaming Data 86

Real-Time Data Lakes 87

The Lambda Architecture 89

Data Transformations 90

Target Systems 92

Data Warehouses 93

Operational Data Stores 93

Real-Time Applications and Data Products 93

Conclusion 95

6 Optimizing for Self-Service 97

The Beginnings of Self-Service 98

Business Analysts 100

Finding and Understanding Data-Documenting the Enterprise 101

Establishing Trust 103

Provisioning 110

Preparing Data for Analysis 112

Data Wrangling in the Data Lake 113

Situating Data Preparation in Hadoop 113

Common Use Cases for Data Preparation 114

Analyzing and Visualizing 116

The New World of Self-Service Business Intelligence 116

The New Analytic Workflow 117

Gatekeepers to Shopkeepers 118

Governing Self-Service 119

Conclusion 120

7 Architecting the Data Lake 121

Organizing the Data Lake 121

Landing or Raw Zone 123

Gold Zone 123

Work Zone 125

Sensitive Zone 125

Multiple Data Lakes 127

Advantages of Keeping Data Lakes Separate 127

Advantages of Merging the Data Lakes 128

Cloud Data Lakes 129

Virtual Data Lakes 131

Data Federation 131

Big Data Visualization 132

Eliminating Redundancy 134

Conclusion 136

8 Cataloging the Data Lake 137

Organizing the Data 137

Technical Metadata 138

Business Metadata 143

Tagging 145

Automated Cataloging 146

Logical Data Management 147

Sensitive Data Management and Access Control 147

Data Quality 149

Relating Disparate Data 151

Establishing Lineage 152

Data Provisioning 153

Tools for Building a Catalog 154

Tool Comparison 155

The Data Ocean 156

Conclusion 156

9 Governing Data Access 157

Authorization or Access Control 158

Tag-Based Data Access Policies 159

Deidentifying Sensitive Data 162

Data Sovereignty and Regulatory Compliance 165

Self-Service Access Management 167

Provisioning Data 171

Conclusion 177

10 Industry-Specific Perspectives 179

Big Data in Financial Services 180

Consumers, Digitization, and Data Are Changing Finance as We Know It 180

Saving the Bank 182

New Opportunities Offered by New Data 185

Key Processes in Making Use of the Data Lake 188

Value Added by Data Lakes in Financial Services 190

Data Lakes in the Insurance Industry 192

Smart Cities 193

Big Data in Medicine 195

Index 197

From the B&N Reads Blog

Customer Reviews