Text Mining with R: A Tidy Approach

Much of the data available today is unstructured and text-heavy, making it challenging for analysts to apply their usual data wrangling and visualization tools. With this practical book, you’ll explore text-mining techniques with tidytext, a package that authors Julia Silge and David Robinson developed using the tidy principles behind R packages like ggraph and dplyr. You’ll learn how tidytext and other tidy tools in R can make text analysis easier and more effective.

The authors demonstrate how treating text as data frames enables you to manipulate, summarize, and visualize characteristics of text. You’ll also learn how to integrate natural language processing (NLP) into effective workflows. Practical code examples and data explorations will help you generate real insights from literature, news, and social media.

  • Learn how to apply the tidy text format to NLP
  • Use sentiment analysis to mine the emotional content of text
  • Identify a document’s most important terms with frequency measurements
  • Explore relationships and connections between words with the ggraph and widyr packages
  • Convert back and forth between R’s tidy and non-tidy text formats
  • Use topic modeling to classify document collections into natural groups
  • Examine case studies that compare Twitter archives, dig into NASA metadata, and analyze thousands of Usenet messages
"1125531184"
Text Mining with R: A Tidy Approach

Much of the data available today is unstructured and text-heavy, making it challenging for analysts to apply their usual data wrangling and visualization tools. With this practical book, you’ll explore text-mining techniques with tidytext, a package that authors Julia Silge and David Robinson developed using the tidy principles behind R packages like ggraph and dplyr. You’ll learn how tidytext and other tidy tools in R can make text analysis easier and more effective.

The authors demonstrate how treating text as data frames enables you to manipulate, summarize, and visualize characteristics of text. You’ll also learn how to integrate natural language processing (NLP) into effective workflows. Practical code examples and data explorations will help you generate real insights from literature, news, and social media.

  • Learn how to apply the tidy text format to NLP
  • Use sentiment analysis to mine the emotional content of text
  • Identify a document’s most important terms with frequency measurements
  • Explore relationships and connections between words with the ggraph and widyr packages
  • Convert back and forth between R’s tidy and non-tidy text formats
  • Use topic modeling to classify document collections into natural groups
  • Examine case studies that compare Twitter archives, dig into NASA metadata, and analyze thousands of Usenet messages
25.49 In Stock
Text Mining with R: A Tidy Approach

Text Mining with R: A Tidy Approach

Text Mining with R: A Tidy Approach

Text Mining with R: A Tidy Approach

eBook

$25.49  $33.99 Save 25% Current price is $25.49, Original price is $33.99. You Save 25%.

Available on Compatible NOOK devices, the free NOOK App and in My Digital Library.
WANT A NOOK?  Explore Now

Related collections and offers


Overview

Much of the data available today is unstructured and text-heavy, making it challenging for analysts to apply their usual data wrangling and visualization tools. With this practical book, you’ll explore text-mining techniques with tidytext, a package that authors Julia Silge and David Robinson developed using the tidy principles behind R packages like ggraph and dplyr. You’ll learn how tidytext and other tidy tools in R can make text analysis easier and more effective.

The authors demonstrate how treating text as data frames enables you to manipulate, summarize, and visualize characteristics of text. You’ll also learn how to integrate natural language processing (NLP) into effective workflows. Practical code examples and data explorations will help you generate real insights from literature, news, and social media.

  • Learn how to apply the tidy text format to NLP
  • Use sentiment analysis to mine the emotional content of text
  • Identify a document’s most important terms with frequency measurements
  • Explore relationships and connections between words with the ggraph and widyr packages
  • Convert back and forth between R’s tidy and non-tidy text formats
  • Use topic modeling to classify document collections into natural groups
  • Examine case studies that compare Twitter archives, dig into NASA metadata, and analyze thousands of Usenet messages

Product Details

ISBN-13: 9781491981603
Publisher: O'Reilly Media, Incorporated
Publication date: 06/12/2017
Sold by: Barnes & Noble
Format: eBook
Pages: 194
File size: 11 MB
Note: This product may take a few minutes to download.

About the Author

Julia Silge is a data scientist at Stack Overflow; her work involves analyzing complex datasets and communicating about technical topics with diverse audiences. She has a PhD in astrophysics and loves Jane Austen and making beautiful charts. Julia worked in academia and ed tech before moving into data science and discovering the statistical programming language R.


David Robinson is a data scientist at Stack Overflow with a PhD in Quantitative and Computational Biology from Princeton University. He enjoys developing open source R packages, including broom, gganimate, fuzzyjoin and widyr, as well as blogging about statistics, R, and text mining on his blog, Variance Explained.

Table of Contents

Preface vii

1 The Tidy Test Format 1

Contrasting Tidy Text with Other Data Structures 2

The unnest_tokens Function 2

Tidying the Works of Jane Austen 4

The gutenbergr Package 7

Word Frequencies 8

Summary 12

2 Sentiment Analysis with Tidy Data 13

The sentiments Dataset 14

Sentiment Analysis with Inner Join 16

Comparing the Three Sentiment Dictionaries 19

Most Common Positive and Negative Words 22

Wordclouds 25

Looking at Units Beyond Just Words 27

Summary 29

3 Analyzing Word and Document Frequency: tf-idf 31

Term Frequency in Jane Austen's Novels 32

Zipf's Law 34

The bind_tf_idf Function 37

A Corpus of Physics Texts 40

Summary 44

4 Relationships Between Words: N-grams and Correlations 45

Tokenizing by N-gram 45

Counting and Filtering N-grams 46

Analyzing Bigrams 48

Using Bigrams to Provide Context in Sentiment Analysis 51

Visualizing a Network of Bigrams with ggraph 54

Visualizing Bigrams in Other Texts 59

Counting and Correlating Pairs of Words with the widyr Package 61

Counting and Correlating Among Sections 62

Examining Pairwise Correlation 63

Summary 67

5 Converting to and from Nontidy Formats 69

Tidying a Document-Term Matrix 70

Tidying DocumentTermMatrix Objects 71

Tidying dfm Objects 74

Casting Tidy Text Data into a Matrix 77

Tidying Corpus Objects with Metadata 79

Example: Mining Financial Articles 81

Summary 87

6 Topic Modeling 89

Latent Dirichlet Allocation 90

Word-Topic Probabilities 91

Document-Topic Probabilities 95

Example: The Great Library Heist 96

LDA on Chapters 97

Per-Document Classification 100

By-Word Assignments: augment 103

Alternative LDA Implementations 107

Summary 108

7 Case Study: Comparing Twitter Archives 109

Getting the Data and Distribution of Tweets 109

Word Frequencies 110

Comparing Word Usage 114

Changes in Word Use 116

Favorites and Retweets 120

Summary 124

8 Case Study: Mining NASA Metadata 125

How Data Is Organized at NASA 126

Wrangling and Tidying the Data 126

Some Initial Simple Exploration 129

Word Co-ocurrences and Correlations 130

Networks of Description and Title Words 131

Networks of Keywords 134

Calculating tf-idf for the Description Fields 137

What Is tf-idf for the Description Field Words? 137

Connecting Description Fields to Keywords 138

Topic Modeling 140

Casting to a Document-Term Matrix 140

Ready for Topic Modeling 141

Interpreting the Topic Model 142

Connecting Topic Modeling with Keywords 149

Summary 152

9 Case Study: Analyzing Usenet Text 153

Preprocessing 153

Preprocessing Text 155

Words in Newsgroups 156

Finding tf-idf Within Newsgroups 157

Topic Modeling 160

Sentiment Analysis 163

Sentiment Analysis by Word 164

Sentiment Analysis by Message 167

N-gram Analysis 169

Summary 171

Bibliography 173

Index 175

From the B&N Reads Blog

Customer Reviews