High Performance Spark: Best Practices for Scaling and Optimizing Apache Spark

Apache Spark is amazing when everything clicks. But if you haven’t seen the performance improvements you expected, or still don’t feel confident enough to use Spark in production, this practical book is for you. Authors Holden Karau and Rachel Warren demonstrate performance optimizations to help your Spark queries run faster and handle larger data sizes, while using fewer resources.

Ideal for software engineers, data engineers, developers, and system administrators working with large-scale data applications, this book describes techniques that can reduce data infrastructure costs and developer hours. Not only will you gain a more comprehensive understanding of Spark, you’ll also learn how to make it sing.

With this book, you’ll explore:

  • How Spark SQL’s new interfaces improve performance over SQL’s RDD data structure
  • The choice between data joins in Core Spark and Spark SQL
  • Techniques for getting the most out of standard RDD transformations
  • How to work around performance issues in Spark’s key/value pair paradigm
  • Writing high-performance Spark code without Scala or the JVM
  • How to test for functionality and performance when applying suggested improvements
  • Using Spark MLlib and Spark ML machine learning libraries
  • Spark’s Streaming components and external community packages
1123162609
High Performance Spark: Best Practices for Scaling and Optimizing Apache Spark

Apache Spark is amazing when everything clicks. But if you haven’t seen the performance improvements you expected, or still don’t feel confident enough to use Spark in production, this practical book is for you. Authors Holden Karau and Rachel Warren demonstrate performance optimizations to help your Spark queries run faster and handle larger data sizes, while using fewer resources.

Ideal for software engineers, data engineers, developers, and system administrators working with large-scale data applications, this book describes techniques that can reduce data infrastructure costs and developer hours. Not only will you gain a more comprehensive understanding of Spark, you’ll also learn how to make it sing.

With this book, you’ll explore:

  • How Spark SQL’s new interfaces improve performance over SQL’s RDD data structure
  • The choice between data joins in Core Spark and Spark SQL
  • Techniques for getting the most out of standard RDD transformations
  • How to work around performance issues in Spark’s key/value pair paradigm
  • Writing high-performance Spark code without Scala or the JVM
  • How to test for functionality and performance when applying suggested improvements
  • Using Spark MLlib and Spark ML machine learning libraries
  • Spark’s Streaming components and external community packages
29.99 In Stock
High Performance Spark: Best Practices for Scaling and Optimizing Apache Spark

High Performance Spark: Best Practices for Scaling and Optimizing Apache Spark

High Performance Spark: Best Practices for Scaling and Optimizing Apache Spark

High Performance Spark: Best Practices for Scaling and Optimizing Apache Spark

eBook

$29.99  $39.99 Save 25% Current price is $29.99, Original price is $39.99. You Save 25%.

Available on Compatible NOOK devices, the free NOOK App and in My Digital Library.
WANT A NOOK?  Explore Now

Related collections and offers


Overview

Apache Spark is amazing when everything clicks. But if you haven’t seen the performance improvements you expected, or still don’t feel confident enough to use Spark in production, this practical book is for you. Authors Holden Karau and Rachel Warren demonstrate performance optimizations to help your Spark queries run faster and handle larger data sizes, while using fewer resources.

Ideal for software engineers, data engineers, developers, and system administrators working with large-scale data applications, this book describes techniques that can reduce data infrastructure costs and developer hours. Not only will you gain a more comprehensive understanding of Spark, you’ll also learn how to make it sing.

With this book, you’ll explore:

  • How Spark SQL’s new interfaces improve performance over SQL’s RDD data structure
  • The choice between data joins in Core Spark and Spark SQL
  • Techniques for getting the most out of standard RDD transformations
  • How to work around performance issues in Spark’s key/value pair paradigm
  • Writing high-performance Spark code without Scala or the JVM
  • How to test for functionality and performance when applying suggested improvements
  • Using Spark MLlib and Spark ML machine learning libraries
  • Spark’s Streaming components and external community packages

Product Details

ISBN-13: 9781491943151
Publisher: O'Reilly Media, Incorporated
Publication date: 05/25/2017
Sold by: Barnes & Noble
Format: eBook
Pages: 358
File size: 4 MB

About the Author

Holden Karau is transgender Canadian, and an active open source contributor. When not in San Francisco working as a software development engineer at IBM's Spark Technology Center, Holden talks internationally on Apache Spark and holds office hours at coffee shops at home and abroad. She is a Spark committer with frequent contributions, specializing in PySpark and Machine Learning. Prior to IBM she worked on a variety of distributed, search, and classification problems at Alpine, Databricks, Google, Foursquare, and Amazon. She graduated from the University of Waterloo with a Bachelor of Mathematics in Computer Science. Outside of software she enjoys playing with fire, welding, scooters, poutine, and dancing.


Rachel Warren is a data scientist and software engineer at Alpine Data Labs, where she uses Spark to address real world data processing challenges. She has experience working as an analyst both in industry and academia. She graduated with a degree in Computer Science from Wesleyan University in Connecticut.

Table of Contents

Preface ix

1 Introduction to High Performance Spark 1

What Is Spark and Why Performance Matters 1

What You Can Expect to Get from This Book 2

Spark Versions 3

Why Scala? 3

To Be a Spark Expert You Have to Learn a Little Scala Anyway 3

The Spark Scala API Is Easier to Use Than the Java API 4

Scala Is More Performant Than Python 4

Why Not Scala? 4

Learning Scala 5

Conclusion 6

2 How Spark Works 7

How Spark Fits into the Big Data Ecosystem 8

Spark Components 8

Spark Model of Parallel Computing: RDDs 10

Lazy Evaluation 11

In-Memory Persistence and Memory Management 13

Immutability and the RDD Interface 14

Types of RDDs 16

Functions on RDDs: Transformations Versus Actions 17

Wide Versus Narrow Dependencies 17

Spark Job Scheduling 19

Resource Allocation Across Applications 20

The Spark Application 20

The Anatomy of a Spark Job 22

The DAG 22

Jobs 23

Stages 23

Tasks 24

Conclusion 26

3 DataFrames, Datasets, and Spark SQL 27

Getting Started with the SparkSession (or HiveContext or SQLContext) 28

Spark SQL Dependencies 30

Managing Spark Dependencies 31

Avoiding Hive JARs 32

Basics of Schemas 33

DataFrame API 36

Transformations 36

Multi-DataFrame Transformations 48

Plain Old SQL Queries and Interacting with Hive Data 49

Data Representation in DataFrames and Datasets 49

Tungsten 50

Data Loading and Saving Functions 51

DataFrameWriter and DataFrameReader 51

Formats 52

Save Modes 61

Partitions (Discovery and Writing) 62

Datasets 62

Interoperability with RDDs, DataFrames, and Local Collections 63

Compile-Time Strong Typing 64

Easier Functional (RDD "like") Transformations 65

Relational Transformations 65

Multi-Dataset Relational Transformations 65

Grouped Operations on Datasets 66

Extending with User-Defined Functions and Aggregate Functions (UDFs, UDAFs) 67

Query Optimizer 69

Logical and Physical Plans 69

Code Generation 70

Large Query Plans and Iterative Algorithms 70

Debugging Spark SQL Queries 70

JDBC/ODBC Server 71

Conclusion 72

4 Joins (SQL and Core) 75

Core Spark Joins 75

Choosing a Join Type 77

Choosing an Execution Plan 78

Spark SQL Joins 81

DataFrame Joins 81

Dataset Joins 85

Conclusion 86

5 Effective Transformations 87

Narrow Versus Wide Transformations 88

Implications for Performance 90

Implications for Fault Tolerance 91

The Special Case of coalesce 91

What Type of RDD Does Your Transformation Return? 92

Minimizing Object Creation 94

Reusing Existing Objects 94

Using Smaller Data Structures 97

Iterator-to-Iterator Transformations with mapPartitions 100

What Is an Iterator-to-Iterator Transformation? 101

Space and Time Advantages 102

An Example 103

Set Operations 106

Reducing Setup Overhead 107

Shared Variables 108

Broadcast Variables 108

Accumulators 109

Reusing RDDs 114

Cases for Reuse 114

Deciding if Recompute Is Inexpensive Enough 117

Types of Reuse: Cache, Persist, Checkpoint, Shuffle Files 118

Alluxio (nee Tachyon) 122

LRU Caching 123

Noisy Cluster Considerations 124

Interaction with Accumulators 125

Conclusion 126

6 Working with Key/Value Data 127

The Goldilocks Example 129

Goldilocks Version 0: Iterative Solution 130

How to Use PairRDDFunctions and OrderedRDD Functions 132

Actions on Key/Value Pairs 133

What's So Dangerous About the groupByKey Function 134

Goldilocks Version 1: groupByKey Solution 134

Choosing an Aggregation Operation 138

Dictionary of Aggregation Operations with Performance Considerations 138

Multiple RDD Operations 141

Co-Grouping 141

Partitioners and Key/Value Data 142

Using the Spark Partitioner Object 144

Hash Partitioning 144

Range Partitioning 144

Custom Partitioning 145

Preserving Partitioning Information Across Transformations 146

Leveraging Co-Located and Co-Partitioned RDDs 146

Dictionary of Mapping and Partitioning Functions PairRDDFunctions 148

Dictionary of OrderedRDDOperations 149

Sorting by Two Keys with SortByKey 151

Secondary Sort and repartitionAndSortWithinPartitions 151

Leveraging repartitionAndSortWithinPartitions for a Group by Key and Sort Values Function 152

How Not to Sort by Two Orderings 155

Goldilocks Version 2: Secondary Sort 156

A Different Approach to Goldilocks 159

Goldilocks Version 3: Sort on Cell Values 164

Straggler Detection and Unbalanced Data 165

Back to Goldilocks (Again) 167

Goldilocks Version 4: Reduce to Distinct on Each Partition 167

Conclusion 173

7 Going Beyond Scala 175

Beyond Scala within the JVM 176

Beyond Scala, and Beyond the JVM 180

How PySpark Works 181

How SparkR Works 189

Spark.jl (Julia Spark) 191

How Eclair JS Works 192

Spark on the Common Language Runtime (CLR)-C# and Friends 193

Calling Other Languages from Spark 193

Using Pipe and Friends 193

JNI 195

Java Native Access (JNA) 198

Underneath Everything Is FORTRAN 199

Getting to the GPU 200

The Future 201

Conclusion 201

8 Testing and Validation 203

Unit Testing 203

General Spark Unit Testing 204

Mocking RDDs 208

Getting Test Data 210

Generating Large Datasets 210

Sampling 211

Property Checking with ScalaCheck 213

Computing RDD Difference 213

Integration Testing 216

Choosing Your Integration Testing Environment 216

Verifying Performance 217

Spark Counters for Verifying Performance 217

Projects for Verifying Performance 218

Job Validation 219

Conclusion 220

9 Spark MLlib and ML 221

Choosing Between Spark MLlib and Spark ML 221

Working with MLlib 222

Getting Started with MLlib (Organization and Imports) 222

MLlib Feature Encoding and Data Preparation 223

Feature Scaling and Selection 228

MLlib Model Training 228

Predicting 229

Serving and Persistence 230

Model Evaluation 232

Working with Spark ML 233

Spark ML Organization and Imports 233

Pipeline Stages 234

Explain Params 235

Data Encoding 236

Data Cleaning 239

Spark ML Models 239

Putting It All Together in a Pipeline 240

Training a Pipeline 241

Accessing Individual Stages 241

Data Persistence and Spark ML 242

Extending Spark ML Pipelines with Your Own Algorithms 244

Model and Pipeline Persistence and Serving with Spark ML 252

General Serving Considerations 252

Conclusion 253

10 Spark Components and Packages 255

Stream Processing with Spark 257

Sources and Sinks 257

Batch Intervals 259

Data Checkpoint Intervals 260

Considerations for DStreams 261

Considerations for Structured Streaming 262

High Availability Mode (or Handling Driver Failure or Checkpointing) 270

GraphX 271

Using Community Packages and Libraries 271

Creating a Spark Package 273

Conclusion 274

A Tuning, Debugging, and Other Things Developers Like to Pretend Don't Exist 275

Index 325

From the B&N Reads Blog

Customer Reviews