50% Off The Criterion Collection Shop Now

Effective Data Science Infrastructure: How to make data scientists productive

Simplify data science infrastructure to give data scientists an efficient path from prototype to production.

In Effective Data Science Infrastructure you will learn how to:

    Design data science infrastructure that boosts productivity
    Handle compute and orchestration in the cloud
    Deploy machine learning to production
    Monitor and manage performance and results
    Combine cloud-based tools into a cohesive data science environment
    Develop reproducible data science projects using Metaflow, Conda, and Docker
    Architect complex applications for multiple teams and large datasets
    Customize and grow data science infrastructure

Effective Data Science Infrastructure: How to make data scientists more productive is a hands-on guide to assembling infrastructure for data science and machine learning applications. It reveals the processes used at Netflix and other data-driven companies to manage their cutting edge data infrastructure. In it, you’ll master scalable techniques for data storage, computation, experiment tracking, and orchestration that are relevant to companies of all shapes and sizes. You’ll learn how you can make data scientists more productive with your existing cloud infrastructure, a stack of open source software, and idiomatic Python.

The author is donating proceeds from this book to charities that support women and underrepresented groups in data science.

About the technology
Growing data science projects from prototype to production requires reliable infrastructure. Using the powerful new techniques and tooling in this book, you can stand up an infrastructure stack that will scale with any organization, from startups to the largest enterprises.

About the book
Effective Data Science Infrastructure teaches you to build data pipelines and project workflows that will supercharge data scientists and their projects. Based on state-of-the-art tools and concepts that power data operations of Netflix, this book introduces a customizable cloud-based approach to model development and MLOps that you can easily adapt to your company’s specific needs. As you roll out these practical processes, your teams will produce better and faster results when applying data science and machine learning to a wide array of business problems.

What's inside

    Handle compute and orchestration in the cloud
    Combine cloud-based tools into a cohesive data science environment
    Develop reproducible data science projects using Metaflow, AWS, and the Python data ecosystem
    Architect complex applications that require large datasets and models, and a team of data scientists

About the reader
For infrastructure engineers and engineering-minded data scientists who are familiar with Python.

About the author
At Netflix, Ville Tuulos designed and built Metaflow, a full-stack framework for data science. Currently, he is the CEO of a startup focusing on data science infrastructure.

Table of Contents
1 Introducing data science infrastructure
2 The toolchain of data science
3 Introducing Metaflow
4 Scaling with the compute layer
5 Practicing scalability and performance
6 Going to production
7 Processing data
8 Using and operating models
9 Machine learning with the full stack

"1140636464"

Effective Data Science Infrastructure: How to make data scientists productive

43.99 In Stock

Effective Data Science Infrastructure: How to make data scientists productive

Add to Wishlist

Effective Data Science Infrastructure: How to make data scientists productive

eBook

$43.99

View All Available Formats & Editions

eBook
$43.99

View All Available Formats & Editions

Available on Compatible NOOK devices, the free NOOK App and in My Digital Library.

WANT A NOOK? Explore Now

Buy As Gift

Related collections and offers

Overview

Product Details

ISBN-13:	9781638350989
Publisher:	Manning
Publication date:	08/30/2022
Sold by:	SIMON & SCHUSTER
Format:	eBook
Pages:	352
File size:	9 MB

About the Author

Ville Tuulos has been developing tools and infrastructure for data science and machine learning for over two decades. At Netflix, he designed and built Metaflow, a full-stack framework for data science. Currently, he is the CEO of a startup focusing on data science infrastructure.

Foreword vi

Preface vii

Acknowledgments ix

About this book xi

About the author xv

About the cover illustration xvi

1 Introducing data science infrastructure 1

1.1 Why data science infrastructure? 2

The life cycle of a data science project 3

1.2 What is data science infrastructure? 6

The infrastructure stack for data science 7

Supporting the full-life cycle of a data science project 10

One size doesn't fit all 11

1.3 Why good infrastructure matters 13

Managing complexity 14

Leveraging existing platforms 15

1.4 Human-centric infrastructure 16

Freedom and responsibility 17

Data scientist autonomy 19

2 The toolchain of data science 21

2.1 Setting up a development environment 23

Cloud account 27

Data science workstation 27

Notebooks 30

Putting everything together 34

2.2 Introducing workflows 35

The basics of workflows 37

Executing workflows 38

The world of workflow frameworks 40

3 Introducing Metaflow 45

3.1 The basics of Metaflow 46

Installing Metaflow 48

Writing a basic workflow 49

Managing dataflow in workflows 53

Parameters 60

3.2 Branching and merging 66

Valid DAG structures 68

Static branches 69

Dynamic-branches 73

Controlling concurrency 76

3.3 Metaflow in Action 79

Starting a new project 80

Accessing results with the Client API 82

Debugging failures 86

Finishing touches 91

4 Scaling with the compute layer 95

4.1 What is scalability? 97

Scalability across the stack 98

Culture of experimentation 101

4.2 The compute layer 103

Batch processing with containers 105

Examples of compute layers 110

4.3 The compute layer in Metaflow 117

Configuring AWS Batch for Metaflow 119

@batch and @resources decorators 124

4.4 Handling failures 127

Recovering from transient errors with @retry 129

Killing zombies with @timeout 131

The decorator of last resort: @catch 132

5 Practicing scalability and performance 135

5.1 Starting simple: Vertical scalability 137

Example: Clustering Yelp reviews 138

Practicing vertical scalability 140

Why vertical scalability? 146

5.2 Practicing horizontal scalability 148

Why horizontal scalability? 148

Example: Hyperparameter search 150

5.3 Practicing performance optimization 154

Example: Computing a co-occurrence matrix 154

Recipe for fast-enough workflows 164

6 Going to production 166

6.1 Stable workflow scheduling 169

Centralized metadata 171

Using AWS Step Functions with Metaflow 173

Scheduling runs with @schedule 179

6.2 Stable execution environments 180

How Metaflow packages flows 183

Why dependency managements matters 189

Using the @conda decorator 192

6.3 Stable operations 197

Namespaces during prototyping 200

Production namespaces 204

Parallel deployments with @project 206

7 Processing data 211

7.1 Foundations of fast data 215

Loading data from S3 217

Working with tabular data 222

The in-memory data stack 226

7.2 Interfacing with data infrastructure 230

Modern data infrastructure 231

Preparing datasets in SQL 235

Distributed data processing 242

7.3 From data to features 248

Distinguishing facts and features 249

Encoding features 251

8 Using and operating models 259

8.1 Producing predictions 261

Batch, streaming, and real-time predictions 263

Example: Recommendation system 267

Batch predictions 272

Real-time predictions 284

9 Machine learning with the full stack 291

9.1 Pluggable feature encoders and models 293

Developing a framework for pluggable components 293

Executing feature encoders 298

Benchmarking models 303

9.2 Deep regression model 308

Encoding input tensors 310

Defining a deep regression model 314

Training a deep regression model 316

9.3 Summarizing lessons learned 320

Appendix Installing Conda 323

Index 325

From the B&N Reads Blog

Page 1 of

Related collections and offers

Overview

Product Details

About the Author

Table of Contents

Related Subjects

Customer Reviews