Anonymizing Health Data: Case Studies and Methods to Get You Started

Anonymizing Health Data: Case Studies and Methods to Get You Started

by Khaled El Emam, Luk Arbuckle


View All Available Formats & Editions
Choose Expedited Shipping at checkout for guaranteed delivery by Thursday, April 15


Updated as of August 2014, this practical book will demonstrate proven methods for anonymizing health data to help your organization share meaningful datasets, without exposing patient identity. Leading experts Khaled El Emam and Luk Arbuckle walk you through a risk-based methodology, using case studies from their efforts to de-identify hundreds of datasets.

Clinical data is valuable for research and other types of analytics, but making it anonymous without compromising data quality is tricky. This book demonstrates techniques for handling different data types, based on the authors’ experiences with a maternal-child registry, inpatient discharge abstracts, health insurance claims, electronic medical record databases, and the World Trade Center disaster registry, among others.

  • Understand different methods for working with cross-sectional and longitudinal datasets
  • Assess the risk of adversaries who attempt to re-identify patients in anonymized datasets
  • Reduce the size and complexity of massive datasets without losing key information or jeopardizing privacy
  • Use methods to anonymize unstructured free-form text data
  • Minimize the risks inherent in geospatial data, without omitting critical location-based health information
  • Look at ways to anonymize coding information in health data
  • Learn the challenge of anonymously linking related datasets

Product Details

ISBN-13: 9781449363079
Publisher: O'Reilly Media, Incorporated
Publication date: 12/27/2013
Edition description: Reprint
Pages: 228
Product dimensions: 7.00(w) x 9.10(h) x 0.60(d)

About the Author

Dr. Khaled El Emam is an Associate Professor at the University of Ottawa, Faculty of Medicine, a senior investigator at the Children's Hospital of Eastern Ontario Research Institute, and a Canada Research Chair in Electronic Health Information at the University of Ottawa. He is also the Founder and CEO of Privacy Analytics, Inc. His main area of research is developing techniques for health data de-identification/anonymization and secure computation protocols for health research and public health purposes. He has made many contributions to the health privacy area.

Luk Arbuckle has been crunching numbers for a decade. He originally plied his trade in the area of image processing and analysis, and then in the area of applied statistics. Since joining the Electronic Health Information Laboratory (EHIL) at the CHEO Research Institute he has worked on methods to de-identify health data, participated in the development and evaluation of secure computation protocols, and provided all manner of statistical support. As a consultant with Privacy Analytics, he has also been heavily involved in conducting risk analyses on the re-identification of patients in health data.

Table of Contents

Preface ix

1 Introduction 1

To Anonymize or Not to Anonymize 1

Consent, or Anonymization? 2

Penny Pinching 3

People Are Private 4

The Two Pillars of Anonymization 4

Masking Standards 5

De-Identification Standards 5

Anonymization in the Wild 8

Organizational Readiness 8

Making it Practical 9

Use Cases 10

Stigmatizing Analytics 12

Anonymization in Other Domains 13

About This Book 15

2 A Risk-Based De-Identification Methodology 19

Basic Principles 19

Steps in the De-Identification Methodology 21

Step 1 Selecting Direct and Indirect Identifiers 21

Step 2 Setting the Threshold 22

Step 3 Examining Plausible Attacks 23

Step 4 De-Identifying the Data 25

Step 5 Documenting the Process 26

Measuring Risk Under Plausible Attacks 26

T1 Deliberate Attempt at Re-Identification 26

T2 Inadvertent Attempt at Re-Identification 28

T3 Data Breach 29

T4 Public Data 30

Measuring Re- Identification Risk 30

Probability Metrics 30

Information Lou Metrics 32

Risk Thresholds 35

Choosing Threshold 35

Meeting Thresholds 38

Risky Business 39

3 Cross-Sectional Data: Research Registries 43

Process Overview 43

Secondary Uses and Disclosures 43

Getting the Data 46

Formulating the Protocol 47

Negotiating with the Data Access Committee 48

BORN Ontario 49

BORN Data Set 50

Risk Assessment 51

Threat Modeling 51

Results 52

Year on Year: Reusing Risk Analyses 53

Final Thoughts 54

4 Longitudinal Discharge Abstract Data: State Inpatient Databases 57

Longitudinal Data 58

Don't Treat It Like Cross-Sectional Data 60

De-Identifying Under Complete Knowledge 61

Approximate Complete Knowledge 63

Exact Complete Knowledge 64

Implementation 65

Generalization Under Complete Knowledge 65

The State Inpatient Database (SID) of California 66

The SID of California and Open Data 66

Risk Assessment 68

Threat Modeling 68

Results 68

Final Thoughts 69

5 Dates, Long Tails, and Correlation: Insurance Claims Data 71

The Heritage Health Prize 71

Dale Generalization 72

Randomizing Dales Independently of One Another 72

Shifting the Sequence, Ignoring the Intervals 73

Generalizing Intervals to Maintain Order 74

Dates and Intervals and Back Again 76

A Different Anchor 77

Other Quasi-Identifiers 77

Connected Dates 78

Long Tails 78

The Risk from long Tails 79

Threat Modeling 80

Number of Claims to Truncate 80

Which Claims to Truncate 82

Correlation of Related Items 83

Expert Opinions 84

Predictive Models 85

Implications fur De-Identifying Data Sets 85

Final Thoughts 86

6 Longitudinal Events Data: A Disaster Registry 89

Adversary Power 90

Keeping Power in Check 90

Power in Practice 91

A Sample of Power 92

The WTC Disaster Registry 94

Capturing Events 94

The WTC Data Set 95

The Power of Events 96

Risk Assessment 98

Threat Modeling 99

Results 99

Final Thoughts 99

7 Data Reduction: Research Registry Revisited 101

The Subsampling Limbo 101

How Low Can We Go? 102

Not for All Types of Risk 102

BORN to limbo! 103

Many Quasi-Identifiers 104

Subsets of Quasi-Identifiers 105

Covering Designs 106

Covering BORN 108

Final Thoughts 109

8 Free-Form Text: Electronic Medical Records 111

Not So Regular Expressions 111

General Approaches to Text Anonymization 112

Ways to Mark the Text as Anonymized 114

Evaluation Is Key 115

Appropriate Metrics, Strict but Fair 117

Standards for Recall, and a Risk-Based Approach 118

Standards for Precision 119

Anonymization Rules 120

Informatics for Integrating Biology and the Bedside (i2b2) 121

i2b2 Text Data Set 121

Risk Assessment 123

Threat Modeling 123

A Rule-Based System 124

Results 124

Final Thoughts 126

9 Geospatial Aggregation: Dissemination Areas and ZIP Codes 129

Where the Wild Things Are 130

Being Good Neighbors 131

Distance Between Neighbor 131

Circle of Neighbors 132

Round Earth 134

Flat Earth 135

Clustering Neighbors 136

We All Have Boundaries 137

Fast Nearest Neighbor 138

Too Close to Home 140

Levels of Gcoproxy Attacks 141

Measuring Geoproxv Risk 142

Final Thoughts 144

10 Medical Codes: A Hackathon 147

Codes in Practice 148

Generalization 149

The Digits of Diseases 149

The Digits of Procedures 151

The (Alpha)Digits of Drugs 151

Suppression 152

Shuffling 153

Final Thoughts 156

11 Masking: Oncology Databases 159

Schema Shmema 159

Data in Disguise 160

Field Suppression 160

Randomization 161

Pseudonymization 163

Frequency of Pseudonyms 164

Masking On the Fly 165

Final Thoughts 166

12 Secure Linking 167

Let's link Up 167

Doing It Securely 170

Don't Try This at Home 170

The Third-Party Problem 172

Basic Layout for Linking Up 173

The Nifty-Gritty Protocol for Linking Up 174

Bringing Paillier to the Parties 174

Matching on the Unknown 175

Scaling Up 177

Cuckoo Hashing 178

How last Does a Cuckoo Run? 179

Final Thoughts 179

13 De-Identification and Data Quality. 181

Useful Data from Useful De-Identification 181

Degrees of Loss 182

Workload-Aware De Identification 183

Questions to Improve Data Utility 185

Final Thoughts 187

Index 191

Customer Reviews