Table of Contents
Preface ix
1 Introduction to Data Lakes 1
Data Lake Maturity 3
Data Puddles 5
Data Ponds 6
Creating a Successful Data Lake 7
The Right Platform 7
The Right Data 8
The Right Interface 9
The Data Swamp 11
Roadmap to Data Lake Success 12
Standing Up a Data Lake 13
Organizing the Data Lake 14
Setting Up the Data Lake for Self-Service 15
Data Lake Architectures 20
Data Lakes in the Public Cloud 20
Logical Data Lakes 21
Conclusion 24
2 Historical Perspective 25
The Drive for Self-Service Data-The Birth of Databases 25
The Analytics Imperative-The Birth of Data Warehousing 28
The Data Warehouse Ecosystem 29
Storing and Querying the Data 31
Loading the Data-Data Integration Tools 37
Organizing and Managing the Data 41
Consuming the Data 46
Conclusion 47
3 Introduction to Big Data and Data Science 49
Hadoop Leads the Historic Shift to Big Data 50
The Hadoop File System 50
How Processing and Storage Interact in a MapReduce Job 51
Schema on Read 53
Hadoop Projects 53
Data Science 55
What Should Your Analytics Organization Focus On? 56
Machine Learning 59
Explain ability 60
Change Management 61
Conclusion 62
4 Starting a Data Lake 63
The What and Why of Hadoop 63
Preventing Proliferation of Data Puddles 66
Taking Advantage of Big Data 67
Leading with Data Science 67
Strategy 1 Offload Existing Functionality 70
Strategy 2 Data Lakes for New Projects 71
Strategy 3 Establish a Central Point of Governance 72
Which Way Is Right for You? 73
Conclusion 74
5 From Data Ponds/Big Data Warehouses to Data Lakes 75
Essential Functions of a Data Warehouse 76
Dimensional Modeling for Analytics 77
Integrating Data from Disparate Sources 78
Preserving History Using Slowly Changing Dimensions 78
Limitations of the Data Warehouse as a Historical Repository 78
Moving to a Data Pond 79
Keeping History in a Data Pond 79
Implementing Slowly Changing Dimensions in a Data Pond 81
Growing Data Ponds into a Data Lake-Loading Data That's Not in the Data
Warehouse 83
Raw Data 83
External Data 84
Internet of Things (IoT) and Other Streaming Data 86
Real-Time Data Lakes 87
The Lambda Architecture 89
Data Transformations 90
Target Systems 92
Data Warehouses 93
Operational Data Stores 93
Real-Time Applications and Data Products 93
Conclusion 95
6 Optimizing for Self-Service 97
The Beginnings of Self-Service 98
Business Analysts 100
Finding and Understanding Data-Documenting the Enterprise 101
Establishing Trust 103
Provisioning 110
Preparing Data for Analysis 112
Data Wrangling in the Data Lake 113
Situating Data Preparation in Hadoop 113
Common Use Cases for Data Preparation 114
Analyzing and Visualizing 116
The New World of Self-Service Business Intelligence 116
The New Analytic Workflow 117
Gatekeepers to Shopkeepers 118
Governing Self-Service 119
Conclusion 120
7 Architecting the Data Lake 121
Organizing the Data Lake 121
Landing or Raw Zone 123
Gold Zone 123
Work Zone 125
Sensitive Zone 125
Multiple Data Lakes 127
Advantages of Keeping Data Lakes Separate 127
Advantages of Merging the Data Lakes 128
Cloud Data Lakes 129
Virtual Data Lakes 131
Data Federation 131
Big Data Visualization 132
Eliminating Redundancy 134
Conclusion 136
8 Cataloging the Data Lake 137
Organizing the Data 137
Technical Metadata 138
Business Metadata 143
Tagging 145
Automated Cataloging 146
Logical Data Management 147
Sensitive Data Management and Access Control 147
Data Quality 149
Relating Disparate Data 151
Establishing Lineage 152
Data Provisioning 153
Tools for Building a Catalog 154
Tool Comparison 155
The Data Ocean 156
Conclusion 156
9 Governing Data Access 157
Authorization or Access Control 158
Tag-Based Data Access Policies 159
Deidentifying Sensitive Data 162
Data Sovereignty and Regulatory Compliance 165
Self-Service Access Management 167
Provisioning Data 171
Conclusion 177
10 Industry-Specific Perspectives 179
Big Data in Financial Services 180
Consumers, Digitization, and Data Are Changing Finance as We Know It 180
Saving the Bank 182
New Opportunities Offered by New Data 185
Key Processes in Making Use of the Data Lake 188
Value Added by Data Lakes in Financial Services 190
Data Lakes in the Insurance Industry 192
Smart Cities 193
Big Data in Medicine 195
Index 197