50% Off The Criterion Collection Shop Now

Text as Data: A New Framework for Machine Learning and the Social Sciences

A guide for using computational text analysis to learn about the social world

From social media posts and text messages to digital government documents and archives, researchers are bombarded with a deluge of text reflecting the social world. This textual data gives unprecedented insights into fundamental questions in the social sciences, humanities, and industry. Meanwhile new machine learning tools are rapidly transforming the way science and business are conducted. Text as Data shows how to combine new sources of data, machine learning tools, and social science research design to develop and evaluate new insights.

Text as Data is organized around the core tasks in research projects using text—representation, discovery, measurement, prediction, and causal inference. The authors offer a sequential, iterative, and inductive approach to research design. Each research task is presented complete with real-world applications, example methods, and a distinct style of task-focused research.

Bridging many divides—computer science and social science, the qualitative and the quantitative, and industry and academia—Text as Data is an ideal resource for anyone wanting to analyze large collections of text in an era when data is abundant and computation is cheap, but the enduring challenges of social science remain.

Overview of how to use text as data
Research design for a world of data deluge
Examples from across the social sciences and industry

"1139308040"

Text as Data: A New Framework for Machine Learning and the Social Sciences

Overview of how to use text as data
Research design for a world of data deluge
Examples from across the social sciences and industry

45.0 In Stock

Text as Data: A New Framework for Machine Learning and the Social Sciences

Add to Wishlist

Text as Data: A New Framework for Machine Learning and the Social Sciences

Paperback

$45.00

View All Available Formats & Editions

Paperback
$45.00

View All Available Formats & Editions

SHIP THIS ITEM

Qualifies for Free Shipping
PICK UP IN STORE
Check Availability at Nearby Stores

Available within 2 business hours

Want it Today?
Check Store Availability

Related collections and offers

Overview

Overview of how to use text as data
Research design for a world of data deluge
Examples from across the social sciences and industry

Product Details

ISBN-13:	9780691207551
Publisher:	Princeton University Press
Publication date:	03/29/2022
Pages:	360
Sales rank:	708,564
Product dimensions:	7.00(w) x 10.00(h) x (d)

About the Author

Justin Grimmer is professor of political science and a senior fellow at the Hoover Institution at Stanford University. Twitter @justingrimmer Margaret E. Roberts is associate professor in political science and the Halıcıoğlu Data Science Institute at the University of California, San Diego. Twitter @mollyeroberts Brandon M. Stewart is assistant professor of sociology and Arthur H. Scribner Bicentennial Preceptor at Princeton University. Twitter @b_m_stewart

Preface xvii

Prerequisites and Notation xvii

Uses for This Book xviii

What This Book Is Not xix

Part I Preliminaries 1

Chapter 1 Introduction 3

1.1 How This Book Informs the Social Sciences 5

1.2 How This Book Informs the Digital Humanities 8

1.3 How This Book Informs Data Science in Industry and Government 9

1.4 A Guide to This Book 10

1.5 Conclusion 11

Chapter 2 Social Science Research and Text Analysis 13

2.1 Discovery 15

2.2 Measurement 16

2.3 Inference 17

2.4 Social Science as an Iterative and Cumulative Process 17

2.5 An Agnostic Approach to Text Analysis 18

2.6 Discovery, Measurement, and Causal Inference: How the Chinese Government Censors Social Media 20

2.7 Six Principles of Text Analysis 22

2.7.1 Social Science Theories and Substantive Knowledge are Essential for Research Design 22

2.7.2 Text Analysis does not Replace Humans-It Augments Them 24

2.7.3 Building, Refining, and Testing Social Science Theories Requires Iteration and Cumulation 26

2.7.4 Text Analysis Methods Distill Generalizations from Language 28

2.7.5 The Best Method Depends on the Task 29

2.7.6 Validations are Essential and Depend on the Theory and the Task 30

2.8 Conclusion: Text Data and Social Science 32

Part II Selection and Representation 33

Chapter 3 Principles of Selection and Representation 35

3.1 Principle 1: Question-Specific Corpus Construction 35

3.2 Principle 2: No Values-Free Corpus Construction 36

3.3 Principle 3: No Right Way to Represent Text 37

3.4 Principle 4: Validation 38

3.5 State of the Union Addresses 38

3.6 The Authorship of the Federalist Papers 39

3.7 Conclusion 40

Chapter 4 Selecting Documents 41

4.1 Populations and Quantities of Interest 42

4.2 Four Types of Bias 43

4.2.1 Resource Bias 43

4.2.2 Incentive Bias 44

4.2.3 Medium Bias 44

4.2.4 Retrieval Bias 45

4.3 Considerations of "Found Data" 46

4.4 Conclusion 46

Chapter 5 Bag of Words 48

5.1 The Bag of Words Model 48

5.2 Choose the Unit of Analysis 49

5.3 Tokenize 50

5.4 Reduce Complexity 52

5.4.1 Lowercase 52

5.4.2 Remove Punctuation 52

5.4.3 Remove Stop Words 53

5.4.4 Create Equivalence Classes (Lemmatize/Stem) 54

5.4.5 Filter by Frequency 55

5.5 Construct Document-Feature Matrix 55

5.6 Rethinking the Defaults 57

5.6.1 Authorship of the Federalist Papers 57

5.6.2 The Scale Argument against Preprocessing 58

5.7 Conclusion 59

Chapter 6 The Multinomial Language Model 60

6.1 Multinomial Distribution 61

6.2 Basic Language Modeling 63

6.3 Regularization and Smoothing 66

6.4 The Dirichlet Distribution 66

6.5 Conclusion 69

Chapter 7 The Vector Space Model and Similarity Metrics 70

7.1 Similarity Metrics 70

7.2 Distance Metrics 73

7.3 tf-idf Weighting 75

7.4 Conclusion 77

Chapter 8 Distributed Representations of Words 78

8.1 Why Word Embeddings 79

8.2 Estimating Word Embeddings 81

8.2.1 The Self-Supervision Insight 81

8.2.2 Design Choices in Word Embeddings 81

8.2.3 Latent Semantic Analysis 82

8.2.4 Neural Word Embeddings 82

8.2.5 Pretrained Embeddings 84

8.2.6 Rare Words 84

8.2.7 An Illustration 85

8.3 Aggregating Word Embeddings to the Document Level 86

8.4 Validation 87

8.5 Contextuaiized Word Embeddings 88

8.6 Conclusion 89

Chapter 9 Representations from Language Sequences 90

9.1 Reuse 90

9.2 Parts of Speech Tagging 91

9.2.1 Using Phrases to Improve Visualization 92

9.3 Named-Entity Recognition 94

9.4 Dependency Parsing 95

9.5 Broader Information Extraction Tasks 96

9.6 Conclusion 97

Part III Discovery 99

Chapter 10 Principles of Discovery 103

10.1 Principle 1: Context Relevance 103

10.2 Principle 2: No Ground Truth 104

10.3 Principle 3: Judge the Concept, Not the Method 105

10.4 Principle 4: Separate Data Is Best 106

10.5 Conceptualizing the US Congress 106

10.6 Conclusion 109

Chapter 11 Discriminating Words 111

11.1 Mutual Information 112

11.2 Fightin' Words 115

11.3 Fictitious Prediction Problems 117

11.3.1 Standardized Test Statistics as Measures of Separation 118

11.3.2 χ² Test Statistics 118

11.3.3 Multinomial Inverse Regression 121

11.4 Conclusion 121

Chapter 12 Clustering 123

12.1 An Initial Example Using k-Means Clustering 124

12.2 Representations for Clustering 127

12.3 Approaches to Clustering 127

12.3.1 Components of a Clustering Method 128

12.3.2 Styles of Clustering Methods 130

12.3.3 Probabilistic Clustering Models 132

12.3.4 Algorithmic Clustering Models 134

12.3.5 Connections between Probabilistic and Algorithmic Clustering 137

12.4 Making Choices 137

12.4.1 Model Selection 137

12.4.2 Careful Reading 140

12.4.3 Choosing the Number of Clusters 140

12.5 The Human Side of Clustering 144

12.5.1 Interpretation 144

12.5.2 Interactive Clustering 144

12.6 Conclusion 145

Chapter 13 Topic Models 147

13.1 Latent Dirichlet Allocation 147

13.1.1 Inference 149

13.1.2 Example: Discovering Credit Claiming for Fire Grants in Congressional Press Releases 149

13.2 Interpreting the Output of Topic Models 151

13.3 Incorporating Structure into LDA 153

13.3.1 Structure with Upstream, Known Prevalence Covariates 154

13.3.2 Structure with Upstream, Known Content Covariates 154

13.3.3 Structure with Downstream, Known Covariates 156

13.3.4 Additional Sources of Structure 157

13.4 Structural Topic Models 157

13.4.1 Example: Discovering the Components of Radical Discourse 159

13.5 Labeling Topic Models 159

13.6 Conclusion 160

Chapter 14 Low-Dimensional Document Embeddings 162

14.1 Principal Component Analysis 162

14.1.1 Automated Methods for Labeling Principal Components 163

14.1.2 Manual Methods for Labeling Principal Components 164

14.1.3 Principal Component Analysis of Senate Press Releases 164

14.1.4 Choosing the Number of Principal Components 165

14.2 Classical Multidimensional Scaling 167

14.2.1 Extensions of Classical MDS 168

14.2.2 Applying Classical MDS to Senate Press Releases 168

14.3 Conclusion 169

Part IV Measurement 171

Chapter 15 Principles of Measurement 173

15.1 From Concept to Measurement 174

15.2 What Makes a Good Measurement 174

15.2.1 Principle 1: Measures should have Clear Goals 175

15.2.2 Principle 2: Source Material should Always be Identified and Ideally Made Public 175

15.2.3 Principle 3: The Coding Process should be Explainable and Reproducible 175

15.2.4 Principle 4: The Measure should be Validated 175

15.2.5 Principle 5: Limitations should be Explored, Documented and Communicated to the Audience 176

15.3 Balancing Discovery and Measurement with Sample Splits 176

Chapter 16 Word Counting 178

16.1 Keyword Counting 178

16.2 Dictionary Methods 180

16.3 Limitations and Validations of Dictionary Methods 181

16.3.1 Moving Beyond Dictionaries: Wordscores 182

16.4 Conclusion 183

Chapter 17 An Overview of Supervised Classification 184

17.1 Example: Discursive Governance 185

17.2 Create a Training Set 186

17.3 Classify Documents with Supervised Learning 186

17.4 Check Performance 187

17.5 Using the Measure 187

17.6 Conclusion 188

Chapter 18 Coding a Training Set 189

18.1 Characteristics of a Good Training Set 190

18.2 Hand Coding 190

18.2.1 1: Decide on a Codebook 191

18.2.2 2: Select Coders 191

18.2.3 3: Select Documents to Code 191

18.2.4 4: Manage Coders 192

18.2.5 5: Check Reliability 192

18.2.6 Managing Drift 192

18.2.7 Example: Making the News 192

18.3 Crowdsourcing 193

18.4 Supervision with Found Data 195

18.5 Conclusion 196

Chapter 19 Classifying Documents with Supervised Learning 197

19.1 Naive Bayes 198

19.1.1 The Assumptions in Naive Bayes are Almost Certainly Wrong 200

19.1.2 Naive Bayes is a Generative Model 200

19.1.3 Naive Bayes is a Linear Classifier 201

19.2 Machine Learning 202

19.2.1 Fixed Basis Functions 203

19.2.2 Adaptive Basis Functions 205

19.2.3 Quantification 206

19.2.4 Concluding Thoughts on Supervised Learning with Random Samples 207

19.3 Example: Estimating Jihad Scores 207

19.4 Conclusion 210

Chapter 20 Checking Performance 211

20.1 Validation with Gold-Standard Data 211

20.1.1 Validation Set 212

20.1.2 Cross-Validation 213

20.1.3 The Importance of Gold-Standard Data 213

20.1.4 Ongoing Evaluations 214

20.2 Validation without Gold-Standard Data 214

20.2.1 Surrogate Labels 214

20.2.2 Partial Category Replication 215

20.2.3 Nonexpert Human Evaluation 215

20.2.4 Correspondence to External Information 215

20.3 Example: Validating Jihad Scores 216

20.4 Conclusion 217

Chapter 21 Repurposing Discovery Methods 219

21.1 Unsupervised Methods Tend to Measure Subject Better than Subtleties 219

21.2 Example: Scaling via Differential Word Rates 220

21.3 A Workflow for Repurposing Unsupervised Methods for Measurement 221

21.3.1 1: Split the Data 223

21.3.2 2: Fit the Model 223

21.3.3 3: Validate the Model 223

21.3.4 4: Fit to the Test Data and Revalidate 225

21.4 Concerns in Repurposing Unsupervised Methods for Measurement 225

21.4.1 Concern 1: The Method Always Returns a Result 226

21.4.2 Concern 2: Opaque Differences in Estimation Strategies 226

21.4.3 Concern 3: Sensitivity to Unintuitive Hyperpara meters 227

21.4.4 Concern 4: Instability in results 227

21.4.5 Rethinking Stability 228

21.5 Conclusion 229

Part V Inference 231

Chapter 22 Principles of Inference 233

22.1 Prediction 233

22.2 Causal Inference 234

22.2.1 Causal Inference Places Identification First 235

22.2.2 Prediction Is about Outcomes That Will Happen, Causal inference is about Outcomes from Interventions 235

22.2.3 Prediction and Causal Inference Require Different Validations 236

22.2.4 Prediction and Causal Inference Use Features Differently 237

22.3 Comparing Prediction and Causal Inference 238

22.4 Partial and General Equilibrium in Prediction and Causal Inference 238

22.5 Conclusion 240

Chapter 23 Prediction 241

23.1 The Basic Task of Prediction 242

23.2 Similarities and Differences between Prediction and Measurement 243

23.3 Five Principles of Prediction 244

23.3.1 Predictive Features do not have to Cause the Outcome 244

23.3.2 Cross-Validation is not Always a Good Measure of Predictive Power 244

23.3.3 It's Not Always Better to be More Accurate on Average 246

23.3.4 There can be Practical Value in Interpreting Models for Prediction 247

23.3.5 It can be Difficult to Apply Prediction to Policymaking 247

23.4 Using Text as Data for Prediction: Examples 249

23.4.1 Source Prediction 249

23.4.2 Linguistic Prediction 253

23.4.3 Social Forecasting 254

23.4.4 Nowcasting 256

23.5 Conclusion 257

Chapter 24 Causal Inference 259

24.1 Introduction to Causal Inference 260

24.2 Similarities and Differences between Prediction and Measurement, and Causal Inference 263

24.3 Key Principles of Causal Inference with Text 263

24.3.1 The Core Problems of Causal Inference Remain, even when Working with Text 263

24.3.2 Our Conceptualization of the Treatment and Outcome Remains a Critical Component of Causal Inference with Text 264

24.3.3 The Challenges of Making Causal Inferences with Text Underscore the Need for Sequential Science 264

24.4 The Mapping Function 266

24.4.1 Causal Inference with g 267

24.4.2 Identification and Overfitting 268

24.5 Workflows for Making Causal Inferences with Text 269

24.5.1 Define g before Looking at the Documents 269

24.5.2 Use a Train/Test Split 269

24.5.3 Run Sequential Experiments 271

24.6 Conclusion 271

Chapter 25 Text as Outcome 272

25.1 An Experiment on Immigration 272

25.2 The Effect of Presidential Public Appeals 275

25.3 Conclusion 276

Chapter 26 Text as Treatment 277

26.1 An Experiment Using Trump's Tweets 279

26.2 A Candidate Biography Experiment 281

26.3 Conclusion 284

Chapter 27 Text as Confounder 285

27.1 Regression Adjustments for Text Confounders 287

27.2 Matching Adjustments for Text 290

27.3 Conclusion 292

Part VI Conclusion 295

Chapter 28 Conclusion 297

28.1 How to Use Text as Data in the Social Sciences 298

28.1.1 The Focus on Social Science Tasks 298

28.1.2 Iterative and Sequential Nature of the Social Sciences 298

28.1.3 Model Skepticism and the Application of Machine Learning to the Social Sciences 299

28.2 Applying Our Principles beyond Text Data 299

28.3 Avoiding the Cycle of Creation and Destruction in Social Science Methodology 300

Acknowledgments 303

Bibliography 307

Index 331

What People are Saying About This

From the Publisher

"This is the definitive guide for social scientists wishing to work with text-based data. Written by pioneers in the field, Text as Data provides a comprehensive overview of the state of the art. But the authors don’t stop there: they offer a fresh agenda for doing social science, showing how algorithms can augment our ability to develop theories of human behavior, rather than poorly attempting to replace us.”—Chris Bail, author of Breaking the Social Media Prism

“Text as Data is a long-awaited book by an all-star team of methodologists. The explosion of textual data provides unprecedented opportunities to learn about human behavior and society at a massive scale. Through this authoritative book, Grimmer, Roberts, and Stewart lay the foundation of text analysis for students and researchers.”—Kosuke Imai, author of Quantitative Social Science

“This book provides a clear and comprehensive introduction to the key computational techniques for analyzing text data. The technical material is contextualized within a broader research philosophy that will drive exciting new applications in computational social science, the digital humanities, and commercial data science. I highly recommend it.”—Jacob Eisenstein, author of Introduction to Natural Language Processing

“Beyond offering an engaging survey of text analysis methods, this book is a vital guide to social science research design. Diverse applications from detecting Chinese censorship to classifying jihadist texts bring text analysis to life for readers of all methodological backgrounds. My students praised Text as Data as one of the best textbooks they have encountered.”—Alexandra Siegel, University of Colorado, Boulder

"This book fills acute gaps in the theory components of text as data. Accessible to advanced undergraduates and graduate students with some background in social science terminology and methodology, this volume draws together aspects of text-as-data approaches that are often discussed and applied separately, and brings them into a coherent framework."—Sarah Bouchat, Northwestern University

"Written by leaders in the discipline, Text as Data is an excellent book. Comprehensive in its scope, this work is a perfect introduction for social science graduate students and faculty getting into the field."—Arthur Spirling, New York University

"There is a clear lack of relevant textbooks in the social science text-as-data area. Thorough and manageable, Text as Data presents a good conceptual overview and frames issues at the right level."—David Mimno, Cornell University

From the B&N Reads Blog

Page 1 of

Text as Data: A New Framework for Machine Learning and the Social Sciences

Text as Data: A New Framework for Machine Learning and the Social Sciences

Paperback

Paperback

Related collections and offers

Overview

Product Details

About the Author

Table of Contents

What People are Saying About This

Customer Reviews

Related collections and offers

Overview

Product Details

About the Author

Table of Contents

What People are Saying About This

Related Subjects

Customer Reviews