Learning Data Science: Data Wrangling, Exploration, Visualization, and Modeling with Python [1 ed.] 1098113004, 9781098113001

As an aspiring data scientist, you appreciate why organizations rely on data for important decisions—whether it's f

611 241 20MB

English Pages 594 [597] Year 2023

Report DMCA / Copyright

DOWNLOAD FILE

Learning Data Science: Data Wrangling, Exploration, Visualization, and Modeling with Python [1 ed.]
 1098113004, 9781098113001

  • Commentary
  • Publisher's PDF | Published: September 2023 | Revision History: 2023-09-15: First Release

Table of contents :
Cover
Copyright
Table of Contents
Preface
Expected Background Knowledge
Organization of the Book
Conventions Used in This Book
Using Code Examples
O’Reilly Online Learning
How to Contact Us
Acknowledgments
Part I. The Data Science Lifecycle
Chapter 1. The Data Science Lifecycle
The Stages of the Lifecycle
Examples of the Lifecycle
Summary
Chapter 2. Questions and Data Scope
Big Data and New Opportunities
Example: Google Flu Trends
Target Population, Access Frame, and Sample
Example: What Makes Members of an Online Community Active?
Example: Who Will Win the Election?
Example: How Do Environmental Hazards Relate to an Individual’s Health?
Instruments and Protocols
Measuring Natural Phenomena
Example: What Is the Level of CO2 in the Air?
Accuracy
Types of Bias
Types of Variation
Summary
Chapter 3. Simulation and Data Design
The Urn Model
Sampling Designs
Sampling Distribution of a Statistic
Simulating the Sampling Distribution
Simulation with the Hypergeometric Distribution
Example: Simulating Election Poll Bias and Variance
The Pennsylvania Urn Model
An Urn Model with Bias
Conducting Larger Polls
Example: Simulating a Randomized Trial for a Vaccine
Scope
The Urn Model for Random Assignment
Example: Measuring Air Quality
Summary
Chapter 4. Modeling with Summary Statistics
The Constant Model
Minimizing Loss
Mean Absolute Error
Mean Squared Error
Choosing Loss Functions
Summary
Chapter 5. Case Study: Why Is My Bus Always Late?
Question and Scope
Data Wrangling
Exploring Bus Times
Modeling Wait Times
Summary
Part II. Rectangular Data
Chapter 6. Working with Dataframes Using pandas
Subsetting
Data Scope and Question
Dataframes and Indices
Slicing
Filtering Rows
Example: How Recently Has Luna Become a Popular Name?
Aggregating
Basic Group-Aggregate
Grouping on Multiple Columns
Custom Aggregation Functions
Pivoting
Joining
Inner Joins
Left, Right, and Outer Joins
Example: Popularity of NYT Name Categories
Transforming
Apply
Example: Popularity of “L” Names
The Price of Apply
How Are Dataframes Different from Other Data Representations?
Dataframes and Spreadsheets
Dataframes and Matrices
Dataframes and Relations
Summary
Chapter 7. Working with Relations Using SQL
Subsetting
SQL Basics: SELECT and FROM
What’s a Relation?
Slicing
Filtering Rows
Example: How Recently Has Luna Become a Popular Name?
Aggregating
Basic Group-Aggregate Using GROUP BY
Grouping on Multiple Columns
Other Aggregation Functions
Joining
Inner Joins
Left and Right Joins
Example: Popularity of NYT Name Categories
Transforming and Common Table Expressions
SQL Functions
Multistep Queries Using a WITH Clause
Example: Popularity of “L” Names
Summary
Part III. Understanding The Data
Chapter 8. Wrangling Files
Data Source Examples
Drug Abuse Warning Network (DAWN) Survey
San Francisco Restaurant Food Safety
File Formats
Delimited Format
Fixed-Width Format
Hierarchical Formats
Loosely Formatted Text
File Encoding
File Size
The Shell and Command-Line Tools
Table Shape and Granularity
Granularity of Restaurant Inspections and Violations
DAWN Survey Shape and Granularity
Summary
Chapter 9. Wrangling Dataframes
Example: Wrangling CO2 Measurements from the Mauna Loa Observatory
Quality Checks
Addressing Missing Data
Reshaping the Data Table
Quality Checks
Quality Based on Scope
Quality of Measurements and Recorded Values
Quality Across Related Features
Quality for Analysis
Fixing the Data or Not
Missing Values and Records
Transformations and Timestamps
Transforming Timestamps
Piping for Transformations
Modifying Structure
Example: Wrangling Restaurant Safety Violations
Narrowing the Focus
Aggregating Violations
Extracting Information from Violation Descriptions
Summary
Chapter 10. Exploratory Data Analysis
Feature Types
Example: Dog Breeds
Transforming Qualitative Features
The Importance of Feature Types
What to Look For in a Distribution
What to Look For in a Relationship
Two Quantitative Features
One Qualitative and One Quantitative Variable
Two Qualitative Features
Comparisons in Multivariate Settings
Guidelines for Exploration
Example: Sale Prices for Houses
Understanding Price
What Next?
Examining Other Features
Delving Deeper into Relationships
Fixing Location
EDA Discoveries
Summary
Chapter 11. Data Visualization
Choosing Scale to Reveal Structure
Filling the Data Region
Including Zero
Revealing Shape Through Transformations
Banking to Decipher Relationships
Revealing Relationships Through Straightening
Smoothing and Aggregating Data
Smoothing Techniques to Uncover Shape
Smoothing Techniques to Uncover Relationships and Trends
Smoothing Techniques Need Tuning
Reducing Distributions to Quantiles
When Not to Smooth
Facilitating Meaningful Comparisons
Emphasize the Important Difference
Ordering Groups
Avoid Stacking
Selecting a Color Palette
Guidelines for Comparisons in Plots
Incorporating the Data Design
Data Collected Over Time
Observational Studies
Unequal Sampling
Geographic Data
Adding Context
Example: 100m Sprint Times
Creating Plots Using plotly
Figure and Trace Objects
Modifying Layout
Plotting Functions
Annotations
Other Tools for Visualization
matplotlib
Grammar of Graphics
Summary
Chapter 12. Case Study: How Accurate Are Air Quality Measurements?
Question, Design, and Scope
Finding Collocated Sensors
Wrangling the List of AQS Sites
Wrangling the List of PurpleAir Sites
Matching AQS and PurpleAir Sensors
Wrangling and Cleaning AQS Sensor Data
Checking Granularity
Removing Unneeded Columns
Checking the Validity of Dates
Checking the Quality of PM2.5 Measurements
Wrangling PurpleAir Sensor Data
Checking the Granularity
Handling Missing Values
Exploring PurpleAir and AQS Measurements
Creating a Model to Correct PurpleAir Measurements
Summary
Part IV. Other Data Sources
Chapter 13. Working with Text
Examples of Text and Tasks
Convert Text into a Standard Format
Extract a Piece of Text to Create a Feature
Transform Text into Features
Text Analysis
String Manipulation
Converting Text to a Standard Format with Python String Methods
String Methods in pandas
Splitting Strings to Extract Pieces of Text
Regular Expressions
Concatenation of Literals
Quantifiers
Alternation and Grouping to Create Features
Reference Tables
Text Analysis
Summary
Chapter 14. Data Exchange
NetCDF Data
JSON Data
HTTP
REST
XML, HTML, and XPath
Example: Scraping Race Times from Wikipedia
XPath
Example: Accessing Exchange Rates from the ECB
Summary
Part V. Linear Modeling
Chapter 15. Linear Models
Simple Linear Model
Example: A Simple Linear Model for Air Quality
Interpreting Linear Models
Assessing the Fit
Fitting the Simple Linear Model
Multiple Linear Model
Fitting the Multiple Linear Model
Example: Where Is the Land of Opportunity?
Explaining Upward Mobility Using Commute Time
Relating Upward Mobility Using Multiple Variables
Feature Engineering for Numeric Measurements
Feature Engineering for Categorical Measurements
Summary
Chapter 16. Model Selection
Overfitting
Example: Energy Consumption
Train-Test Split
Cross-Validation
Regularization
Model Bias and Variance
Summary
Chapter 17. Theory for Inference and Prediction
Distributions: Population, Empirical, Sampling
Basics of Hypothesis Testing
Example: A Rank Test to Compare Productivity of Wikipedia Contributors
Example: A Test of Proportions for Vaccine Efficacy
Bootstrapping for Inference
Basics of Confidence Intervals
Basics of Prediction Intervals
Example: Predicting Bus Lateness
Example: Predicting Crab Size
Example: Predicting the Incremental Growth of a Crab
Probability for Inference and Prediction
Formalizing the Theory for Average Rank Statistics
General Properties of Random Variables
Probability Behind Testing and Intervals
Probability Behind Model Selection
Summary
Chapter 18. Case Study: How to Weigh a Donkey
Donkey Study Question and Scope
Wrangling and Transforming
Exploring
Modeling a Donkey’s Weight
A Loss Function for Prescribing Anesthetics
Fitting a Simple Linear Model
Fitting a Multiple Linear Model
Bringing Qualitative Features into the Model
Model Assessment
Summary
Part VI. Classification
Chapter 19. Classification
Example: Wind-Damaged Trees
Modeling and Classification
A Constant Model
Examining the Relationship Between Size and Windthrow
Modeling Proportions (and Probabilities)
A Logistic Model
Log Odds
Using a Logistic Curve
A Loss Function for the Logistic Model
From Probabilities to Classification
The Confusion Matrix
Precision Versus Recall
Summary
Chapter 20. Numerical Optimization
Gradient Descent Basics
Minimizing Huber Loss
Convex and Differentiable Loss Functions
Variants of Gradient Descent
Stochastic Gradient Descent
Mini-Batch Gradient Descent
Newton’s Method
Summary
Chapter 21. Case Study: Detecting Fake News
Question and Scope
Obtaining and Wrangling the Data
Exploring the Data
Exploring the Publishers
Exploring Publication Date
Exploring Words in Articles
Modeling
A Single-Word Model
Multiple-Word Model
Predicting with the tf-idf Transform
Summary
Bibliography
Data Sources
Index
About the Authors
Colophon

Polecaj historie