Anonymizing Health Data: Case Studies and Methods to Get You Started
228Anonymizing Health Data: Case Studies and Methods to Get You Started
228eBook
Related collections and offers
Overview
Updated as of August 2014, this practical book will demonstrate proven methods for anonymizing health data to help your organization share meaningful datasets, without exposing patient identity. Leading experts Khaled El Emam and Luk Arbuckle walk you through a risk-based methodology, using case studies from their efforts to de-identify hundreds of datasets.
Clinical data is valuable for research and other types of analytics, but making it anonymous without compromising data quality is tricky. This book demonstrates techniques for handling different data types, based on the authors’ experiences with a maternal-child registry, inpatient discharge abstracts, health insurance claims, electronic medical record databases, and the World Trade Center disaster registry, among others.
- Understand different methods for working with cross-sectional and longitudinal datasets
- Assess the risk of adversaries who attempt to re-identify patients in anonymized datasets
- Reduce the size and complexity of massive datasets without losing key information or jeopardizing privacy
- Use methods to anonymize unstructured free-form text data
- Minimize the risks inherent in geospatial data, without omitting critical location-based health information
- Look at ways to anonymize coding information in health data
- Learn the challenge of anonymously linking related datasets
Product Details
ISBN-13: | 9781449363031 |
---|---|
Publisher: | O'Reilly Media, Incorporated |
Publication date: | 12/11/2013 |
Sold by: | Barnes & Noble |
Format: | eBook |
Pages: | 228 |
File size: | 4 MB |
About the Author
Dr. Khaled El Emam is an Associate Professor at the University of Ottawa, Faculty of Medicine, a senior investigator at the Children's Hospital of Eastern Ontario Research Institute, and a Canada Research Chair in Electronic Health Information at the University of Ottawa. He is also the Founder and CEO of Privacy Analytics, Inc. His main area of research is developing techniques for health data de-identification/anonymization and secure computation protocols for health research and public health purposes. He has made many contributions to the health privacy area.
Luk Arbuckle has been crunching numbers for a decade. He originally plied his trade in the area of image processing and analysis, and then in the area of applied statistics. Since joining the Electronic Health Information Laboratory (EHIL) at the CHEO Research Institute he has worked on methods to de-identify health data, participated in the development and evaluation of secure computation protocols, and provided all manner of statistical support. As a consultant with Privacy Analytics, he has also been heavily involved in conducting risk analyses on the re-identification of patients in health data.
Table of Contents
Preface ix
1 Introduction 1
To Anonymize or Not to Anonymize 1
Consent, or Anonymization? 2
Penny Pinching 3
People Are Private 4
The Two Pillars of Anonymization 4
Masking Standards 5
De-Identification Standards 5
Anonymization in the Wild 8
Organizational Readiness 8
Making it Practical 9
Use Cases 10
Stigmatizing Analytics 12
Anonymization in Other Domains 13
About This Book 15
2 A Risk-Based De-Identification Methodology 19
Basic Principles 19
Steps in the De-Identification Methodology 21
Step 1 Selecting Direct and Indirect Identifiers 21
Step 2 Setting the Threshold 22
Step 3 Examining Plausible Attacks 23
Step 4 De-Identifying the Data 25
Step 5 Documenting the Process 26
Measuring Risk Under Plausible Attacks 26
T1 Deliberate Attempt at Re-Identification 26
T2 Inadvertent Attempt at Re-Identification 28
T3 Data Breach 29
T4 Public Data 30
Measuring Re- Identification Risk 30
Probability Metrics 30
Information Lou Metrics 32
Risk Thresholds 35
Choosing Threshold 35
Meeting Thresholds 38
Risky Business 39
3 Cross-Sectional Data: Research Registries 43
Process Overview 43
Secondary Uses and Disclosures 43
Getting the Data 46
Formulating the Protocol 47
Negotiating with the Data Access Committee 48
BORN Ontario 49
BORN Data Set 50
Risk Assessment 51
Threat Modeling 51
Results 52
Year on Year: Reusing Risk Analyses 53
Final Thoughts 54
4 Longitudinal Discharge Abstract Data: State Inpatient Databases 57
Longitudinal Data 58
Don't Treat It Like Cross-Sectional Data 60
De-Identifying Under Complete Knowledge 61
Approximate Complete Knowledge 63
Exact Complete Knowledge 64
Implementation 65
Generalization Under Complete Knowledge 65
The State Inpatient Database (SID) of California 66
The SID of California and Open Data 66
Risk Assessment 68
Threat Modeling 68
Results 68
Final Thoughts 69
5 Dates, Long Tails, and Correlation: Insurance Claims Data 71
The Heritage Health Prize 71
Dale Generalization 72
Randomizing Dales Independently of One Another 72
Shifting the Sequence, Ignoring the Intervals 73
Generalizing Intervals to Maintain Order 74
Dates and Intervals and Back Again 76
A Different Anchor 77
Other Quasi-Identifiers 77
Connected Dates 78
Long Tails 78
The Risk from long Tails 79
Threat Modeling 80
Number of Claims to Truncate 80
Which Claims to Truncate 82
Correlation of Related Items 83
Expert Opinions 84
Predictive Models 85
Implications fur De-Identifying Data Sets 85
Final Thoughts 86
6 Longitudinal Events Data: A Disaster Registry 89
Adversary Power 90
Keeping Power in Check 90
Power in Practice 91
A Sample of Power 92
The WTC Disaster Registry 94
Capturing Events 94
The WTC Data Set 95
The Power of Events 96
Risk Assessment 98
Threat Modeling 99
Results 99
Final Thoughts 99
7 Data Reduction: Research Registry Revisited 101
The Subsampling Limbo 101
How Low Can We Go? 102
Not for All Types of Risk 102
BORN to limbo! 103
Many Quasi-Identifiers 104
Subsets of Quasi-Identifiers 105
Covering Designs 106
Covering BORN 108
Final Thoughts 109
8 Free-Form Text: Electronic Medical Records 111
Not So Regular Expressions 111
General Approaches to Text Anonymization 112
Ways to Mark the Text as Anonymized 114
Evaluation Is Key 115
Appropriate Metrics, Strict but Fair 117
Standards for Recall, and a Risk-Based Approach 118
Standards for Precision 119
Anonymization Rules 120
Informatics for Integrating Biology and the Bedside (i2b2) 121
i2b2 Text Data Set 121
Risk Assessment 123
Threat Modeling 123
A Rule-Based System 124
Results 124
Final Thoughts 126
9 Geospatial Aggregation: Dissemination Areas and ZIP Codes 129
Where the Wild Things Are 130
Being Good Neighbors 131
Distance Between Neighbor 131
Circle of Neighbors 132
Round Earth 134
Flat Earth 135
Clustering Neighbors 136
We All Have Boundaries 137
Fast Nearest Neighbor 138
Too Close to Home 140
Levels of Gcoproxy Attacks 141
Measuring Geoproxv Risk 142
Final Thoughts 144
10 Medical Codes: A Hackathon 147
Codes in Practice 148
Generalization 149
The Digits of Diseases 149
The Digits of Procedures 151
The (Alpha)Digits of Drugs 151
Suppression 152
Shuffling 153
Final Thoughts 156
11 Masking: Oncology Databases 159
Schema Shmema 159
Data in Disguise 160
Field Suppression 160
Randomization 161
Pseudonymization 163
Frequency of Pseudonyms 164
Masking On the Fly 165
Final Thoughts 166
12 Secure Linking 167
Let's link Up 167
Doing It Securely 170
Don't Try This at Home 170
The Third-Party Problem 172
Basic Layout for Linking Up 173
The Nifty-Gritty Protocol for Linking Up 174
Bringing Paillier to the Parties 174
Matching on the Unknown 175
Scaling Up 177
Cuckoo Hashing 178
How last Does a Cuckoo Run? 179
Final Thoughts 179
13 De-Identification and Data Quality. 181
Useful Data from Useful De-Identification 181
Degrees of Loss 182
Workload-Aware De Identification 183
Questions to Improve Data Utility 185
Final Thoughts 187
Index 191