Responsible Data Science 1119741750, 9781119741756

This is the first book on ethical data science that provides hand-on practical technical steps practitioners and manager

547 142 7MB

English Pages 300 [304] Year 2021

Report DMCA / Copyright

DOWNLOAD FILE

Responsible Data Science
1119741750, 9781119741756

Author / Uploaded
Grant Fleming
Peter C. Bruce

Table of contents :
Cover
Title Page
Copyright Page
About the Authors
About the Technical Editor
Acknowledgments
Contents at a Glance
Contents
Introduction
What This Book Covers
Who Will Benefit Most from This Book
Looking Ahead in This Book
Special Features
Code Repository
Part 1 Motivation for Ethical Data Science and Background Knowledge
Chapter 1 Responsible Data Science
The Optum Disaster
Jekyll and Hyde
Eugenics
Galton, Pearson, and Fisher
Ties between Eugenics and Statistics
Ethical Problems in Data Science Today
Predictive Models
From Explaining to Predicting
Predictive Modeling
Setting the Stage for Ethical Issues to Arise
Classic Statistical Models
Black-Box Methods
Important Concepts in Predictive Modeling
Feature Selection
Model-Centric vs. Data-Centric Models
Holdout Sample and Cross-Validation
Overfitting
Unsupervised Learning
The Ethical Challenge of Black Boxes
Two Opposing Forces
Pressure for More Powerful AI
Public Resistance and Anxiety
Summary
Chapter 2 Background: Modeling and the Black-Box Algorithm
Assessing Model Performance
Predicting Class Membership
The Rare Class Problem
Lift and Gains
Area Under the Curve
AUC vs. Lift (Gains)
Predicting Numeric Values
Goodness-of-Fit
Holdout Sets and Cross-Validation
Optimization and Loss Functions
Intrinsically Interpretable Models vs. Black-Box Models
Ethical Challenges with Interpretable Models
Black-Box Models
Ensembles
Nearest Neighbors
Clustering
Association Rules
Collaborative Filters
Artificial Neural Nets and Deep Neural Nets
Problems with Black-Box Predictive Models
Problems with Unsupervised Algorithms
Summary
Chapter 3 The Ways AI Goes Wrong, and the Legal Implications
AI and Intentional Consequences by Design
Deepfakes
Supporting State Surveillance and Suppression
Behavioral Manipulation
Automated Testing to Fine-Tune Targeting
AI and Unintended Consequences
Healthcare
Finance
Law Enforcement
Technology
The Legal and Regulatory Landscape around AI
Ignorance Is No Defense: AI in the Context of Existing Law and Policy
A Finger in the Dam: Data Rights, Data Privacy, and Consumer Protection Regulations
Trends in Emerging Law and Policy Related to AI
Summary
Part 2 The Ethical Data Science Process
Chapter 4 The Responsible Data Science Framework
Why We Keep Building Harmful AI
Misguided Need for Cutting-Edge Models
Excessive Focus on Predictive Performance
Ease of Access and the Curse of Simplicity
The Common Cause
The Face Thieves
An Anatomy of Modeling Harms
The World: Context Matters for Modeling
The Data: Representation Is Everything
The Model: Garbage In, Danger Out
Model Interpretability: Human Understanding for Superhuman Models
Efforts Toward a More Responsible Data Science
Principles Are the Focus
Nonmaleficence
Fairness
Transparency
Accountability
Privacy
Bridging the Gap Between Principles and Practice with the Responsible Data Science (RDS) Framework
Justification
Compilation
Preparation
Modeling
Auditing
Summary
Chapter 5 Model Interpretability: The What and the Why
The Sexist Résumé Screener
The Necessity of Model Interpretability
Connections Between Predictive Performance and Interpretability
Uniting (High) Model Performance and Model Interpretability
Categories of Interpretability Methods
Global Methods
Local Methods
Real-World Successes of Interpretability Methods
Facilitating Debugging and Audit
Leveraging the Improved Performance of Black-Box Models
Acquiring New Knowledge
Addressing Critiques of Interpretability Methods
Explanations Generated by Interpretability Methods Are Not Robust
Explanations Generated by Interpretability Methods Are Low Fidelity
The Forking Paths of Model Interpretability
The Four-Measure Baseline
Building Our Own Credit Scoring Model
Using Train-Test Splits
Feature Selection and Feature Engineering
Baseline Models
The Importance of Making Your Code Work for Everyone
Execution Variability
Addressing Execution Variability with Functionalized Code
Stochastic Variability
Addressing Stochastic Variability via Resampling
Summary
Part 3 EDS in Practice
Chapter 6 Beginning a Responsible Data Science Project
How the Responsible Data Science Framework Addresses the Common Cause
Datasets Used
Regression Datasets—Communities and Crime
Classification Datasets—COMPAS
Common Elements Across Our Analyses
Project Structure and Documentation
Project Structure for the Responsible Data Science Framework: Everything in Its Place
Documentation: The Responsible Thing to Do
Beginning a Responsible Data Science Project
Communities and Crime (Regression)
Justification
Compilation
Identifying Protected Classes
Preparation—Data Splitting and Feature Engineering
Datasheets
COMPAS (Classification)
Justification
Identifying Protected Classes
Preparation
Summary
Chapter 7 Auditing a Responsible Data Science Project
Fairness and Data Science in Practice
The Many Different Conceptions of Fairness
Different Forms of Fairness Are Trade-Offs with Each Other
Quantifying Predictive Fairness Within a Data Science Project
Mitigating Bias to Improve Fairness
Preprocessing
In-processing
Postprocessing
Classification Example: COMPAS
Prework: Code Practices, Modeling, and Auditing
Justification, Compilation, and Preparation Review
Modeling
Auditing
Per-Group Metrics: Overall
Per-Group Metrics: Error
Fairness Metrics
Interpreting Our Models: Why Are They Unfair?
Analysis for Different Groups
Bias Mitigation
Preprocessing: Oversampling
Postprocessing: Optimizing Thresholds Automatically
Postprocessing: Optimizing Thresholds Manually
Summary
Chapter 8 Auditing for Neural Networks
Why Neural Networks Merit Their Own Chapter
Neural Networks Vary Greatly in Structure
Neural Networks Treat Features Differently
Neural Networks Repeat Themselves
A More Impenetrable Black Box
Baseline Methods
Representation Methods
Distillation Methods
Intrinsic Methods
Beginning a Responsible Neural Network Project
Justification
Moving Forward
Compilation
Tracking Experiments
Preparation
Modeling
Auditing
Per-Group Metrics: Overall
Per-Group Metrics: Unusual Definitions of “False Positive”
Fairness Metrics
Interpreting Our Models: Why Are They Unfair?
Bias Mitigation
Wrap-Up
Auditing Neural Networks for Natural Language Processing
Identifying and Addressing Sources of Bias in NLP
The Real World
Data
Models
Model Interpretability
Summary
Chapter 9 Conclusion
How Can We Do Better?
The Responsible Data Science Framework
Doing Better As Managers
Doing Better As Practitioners
A Better Future If We Can Keep It
Index
EULA