Mastering Azure Analytics: Architecting in the Cloud with Azure Data Lake, HDInsight, and Spark

Mastering Azure Analytics: Architecting in the Cloud with Azure Data Lake, HDInsight, and Spark

by Zoiner Tejada
Mastering Azure Analytics: Architecting in the Cloud with Azure Data Lake, HDInsight, and Spark

Mastering Azure Analytics: Architecting in the Cloud with Azure Data Lake, HDInsight, and Spark

by Zoiner Tejada

eBook

$29.49  $38.99 Save 24% Current price is $29.49, Original price is $38.99. You Save 24%.

Available on Compatible NOOK devices, the free NOOK App and in My Digital Library.
WANT A NOOK?  Explore Now

Related collections and offers


Overview

Microsoft Azure has over 20 platform-as-a-service (PaaS) offerings that can act in support of a big data analytics solution. So which one is right for your project? This practical book helps you understand the breadth of Azure services by organizing them into a reference framework you can use when crafting your own big data analytics solution.

You’ll not only be able to determine which service best fits the job, but also learn how to implement a complete solution that scales, provides human fault tolerance, and supports future needs.

  • Understand the fundamental patterns of the data lake and lambda architecture
  • Recognize the canonical steps in the analytics data pipeline and learn how to use Azure Data Factory to orchestrate them
  • Implement data lakes and lambda architectures, using Azure Data Lake Store, Data Lake Analytics, HDInsight (including Spark), Stream Analytics, SQL Data Warehouse, and Event Hubs
  • Understand where Azure Machine Learning fits into your analytics pipeline
  • Gain experience using these services on real-world data that has real-world problems, with scenarios ranging from aviation to Internet of Things (IoT)

Product Details

ISBN-13: 9781491956601
Publisher: O'Reilly Media, Incorporated
Publication date: 04/06/2017
Sold by: Barnes & Noble
Format: eBook
Pages: 412
File size: 18 MB
Note: This product may take a few minutes to download.

About the Author

Zoiner Tejada has more than 17 years of experience consulting in the software industry as a software architect, CTO, and start-up CEO, with particular expertise in cloud computing, big data, analytics, and machine learning. He was among the first to receive a Microsoft Azure MVP (“Most Valuable Professional”) designation and has since been awarded the MVP for five consecutive years, and now holds a dual MVP in Microsoft Azure and Microsoft Data Platform. He received his BS in computer science from Stanford University.

Zoiner is the author of Mastering Azure Analytics published by O'Reilly (which covers a broad range of analytics solutions from real-time processing with Storm, to Interactive/Batch processing with Spark, the application of Machine Learning and many other Data/Analytics related Azure services). He is also co-author of Exam Ref 70-532: Programming Microsoft’s Clouds (the official exam study guide for developers seeking Azure certification), co-author of Developing Microsoft Azure Solutions, and creator of the “Google Analytics Fundamentals” course on Pluralsight.com.

Table of Contents

Foreword vii

Preface ix

1 Enterprise Analytics Fundamentals 1

The Analytics Data Pipeline 1

Data Lakes 2

Lambda Architecture 3

Kappa Architecture 5

Choosing Between Lambda and Kappa 6

The Azure Analytics Pipeline 6

Introducing the Analytics Scenarios 9

Example Code and Example Data Sets 11

What You Will Need 11

Broadband Internet Connectivity 11

Azure Subscription 11

Visual Studio 2015 wish Update 1 11

Azure SDK 2.8 or Later 15

Summary 16

2 Getting Data into Azure 17

Ingest Loading Layer 17

Bulk Data Loading 19

Disk Shipping 19

End User Tools 35

Network-Oriented Approaches 52

Stream Loading 74

Stream Loading with Event Hubs 75

Summary 76

3 Storing Ingested Data in Azure 77

File-Oriented Storage 77

Blob Storage 79

Azure Data Lake Store 84

HDFS 90

Queue-Oriented Storage 94

Blue Yonder Scenario: Smart Buildings 95

Event Hubs 96

IoT Hub 111

Summary 122

4 Real-Time Processing in Azure 123

Stream Processing 123

Consuming Messages from Event Hubs 125

Tuple-at-a-Time Processing in Azure 129

Introducing HDInsight 129

Storm on HDInsight 129

EventProcessorHost 170

Azure Machine Learning 174

Summary 174

5 Real-Time Micro-Batch Processing in Azure 175

Micro-Batch Processing in Azure 175

Spark Streaming on HDInsight 175

Storm on HDInsight 192

Azure Stream Analytics 199

Summary 206

6 Batch Processing in Azure 207

Batch Processing with MapReduce on HDInsight 209

Apache Hadoop MapReduce 210

Batch Processing with Hive on HDInsight 213

Internal and External Tables 214

Partitioning Tables 214

Views 215

Indexes 215

Databases 216

Using Hive on HDInsight 216

Storage on HDInsight 218

Batch Processing Blue Yonder Airports Data 219

Creating an External Table 220

Creating an Internal Table 225

Batch Processing with Pig on HDInsight 228

Batch Processing with Spark on HDInsight 229

Batch Processing Blue Yonder Airports Data 232

Creating an External Table 233

Batch Processing with SQL Data Warehouse 237

Using SQL Data Warehouse 240

Batch Processing Blue Yonder Airports Data 240

Storing the Credentials to Azure Storage 241

Batch Processing with Data Lake Analytics 247

Using Data Lake Analytics 249

Batch Processing Blue Yonder Airports Data 250

Processing with U-SQL 250

Batch Processing with Azure Batch 258

Orchestrating Batch Processing Pipelines with Azure Data factory 259

Summary 260

7 Interactive Querying in Azure 261

Interactive Querying with Azure SQL Data Warehouse 263

Partitions and Distributions 263

Indexes 265

Interactive Exploration of the Blue Yonder Airports Data 266

Interactive Querying with Hive and Tez 269

Indexes 271

Partitions 271

Interactive Exploration of the Blue Yonder Airports Data 271

Interactive Querying with Spark SQL 278

Indexes 278

Partitions 278

Interactive Exploration of the Blue Yonder Airports Data 279

Interactive Querying with USQL 283

Interactive Exploration of the Blue Yonder Airports Data 283

Summary 285

8 Hot and Cold Path Serving Layer in Azure 287

Azure Redis Cache 290

Redis in the Speed Serving Layer 291

Document DB 296

Document DB in the Speed Serving Layer 299

Document DB in the Batch Serving Layer 302

SQL Database 303

SQL Database in the Speed Serving Layer 305

SQL Database in the Batch Sewing Layer 311

SQL Data Warehouse 311

HBase on HDInsight 312

Azure Search 317

Summary 318

9 Intelligence and Machine Learning 319

Azure Machine Learning 322

R Server on HDInsight 324

SQL R Services 325

Microsoft Cognitive Services 326

Summary 338

10 Managing Metadata in Azure 339

Managing Metadata with Azure Data Catalog 339

Data Catalog in the Blue Yonder Airports Scenario 342

Add an Azure Data Lake Store Asset 344

Add Azure Storage Blobs 347

Add a SQL Data Warehouse 352

Summary 355

11 Protecting Your Data in Azure 357

Identity and Access Management 357

Data Protection 359

Auditing 361

Summary 362

12 Performing Analytics 363

Analytics with Power BI 363

Real-Time Power BI in the Blue Yonder Scenario 365

Batch Analytics Reporting with Power BI in the Blue Yonder Scenario 374

A Look Ahead 378

Real Time 378

Lower Batch Latencies 379

IoT 379

Security 379

More Linux 379

Index 381

From the B&N Reads Blog

Customer Reviews