Data Smart: Using Data Science to Transform Information into Insight [Team-IRA] [2 ed.] 111993138X, 9781119931386

A straightforward and engaging approach to data science that skips the jargon and focuses on the essentials In the newl

311 73 12MB

English Pages 448 [445] Year 2023

Report DMCA / Copyright

DOWNLOAD FILE

Data Smart: Using Data Science to Transform Information into Insight [Team-IRA] [2 ed.]
 111993138X, 9781119931386

  • Commentary
  • Thanks to Team-IRA for the true PDF

Table of contents :
Cover Page
Title Page
Copyright Page
About the Author
About the Technical Editors
Acknowledgments
Contents
Introduction
What Am I Doing Here?
What Is Data Science?
Do Data Scientists Actually Use Excel?
Conventions
Let’s Get Going
Chapter 1 Everything You Ever Needed to Know About Spreadsheets but Were Too Afraid to Ask
Some Sample Data
Accessing Quick Descriptive Statistics
Excel Tables
Filtering and Sorting
Table Formatting
Structured References
Adding Table Columns
Lookup Formulas
VLOOKUP
INDEX/MATCH
XLOOKUP
PivotTables
Using Array Formulas
Solving Stuff with Solver
Chapter 2 Set It and Forget It: An Introduction to Power Query
What Is Power Query?
Sample Data
Starting Power Query
Filtering Rows
Removing Columns
Find & Replace
Close & Load to. . .Table
Chapter 3 Naïve Bayes and the Incredible Lightness of Being an Idiot
The World’s Fastest Intro to Probability Theory
Totaling Conditional Probabilities
Joint Probability, the Chain Rule, and Independence
What Happens in a Dependent Situation?
Bayes Rule
Separating the Signal and the Noise
Using the Bayes Rule to Create an AI Model
High-Level Class Probabilities Are Often Assumed to Be Equal
A Couple More Odds and Ends
Let’s Get This Excel Party Started
Cleaning the Data with Power Query
Splitting on Spaces: Giving Each Word Its Due
Counting Tokens and Calculating Probabilities
We Have a Model! Let’s Use It
Chapter 4 Cluster Analysis Part 1: Using K-Means to Segment Your Customer Base
Dances at Summer Camp
Getting Real: K-Means Clustering Subscribers in Email Marketing
The Initial Dataset
Determining What to Measure
Start with Four Clusters
Euclidean Distance: Measuring Distances as the Crow Flies
Solving for the Cluster Centers
Making Sense of the Results
Getting the Top Deals by Cluster
The Silhouette: A Good Way to Let Different K Values Duke It Out
How About Five Clusters?
Solving for Five Clusters
Getting the Top Deals for All Five Clusters
Computing the Silhouette for 5-Means Clustering
K-Medians Clustering and Asymmetric Distance Measurements
Using K-Medians Clustering
Getting a More Appropriate Distance Metric
Putting It All in Excel
The Top Deals for the 5-Medians Clusters
Chapter 5 Cluster Analysis Part II: Network Graphs and Community Detection
What Is a Network Graph?
Visualizing a Simple Graph
Beyond GiGraph and Adjacency Lists
Building a Graph from the Wholesale Wine Data
Creating a Cosine Similarity Matrix
Producing an R-Neighborhood Graph
Introduction to Gephi
Creating a Static Adjacency Matrix
Bringing in Your R-Neighborhood Adjacency Matrix into Gephi
Node Degree
Touching the Graph Data
How Much Is an Edge Worth? Points and Penalties in Graph Modularity
What’s a Point, and What’s a Penalty?
Setting Up the Score Sheet
Let’s Get Clustering!
Split Number 1
Split 2: Electric Boogaloo
And. . .Split3: Split with a Vengeance
Encoding and Analyzing the Communities
There and Back Again: A Gephi Tale
Chapter 6 Regression: The Granddaddy of Supervised Artificial Intelligence
Predicting Pregnant Customers at RetailMart Using Linear Regression
The Feature Set
Assembling the Training Data
Creating Dummy Variables
Let’s Bake Our Own Linear Regression
Linear Regression Statistics: R-Squared, F-Tests, t-Tests
Making Predictions on Some New Data and Measuring Performance
Predicting Pregnant Customers at RetailMart Using Logistic Regression
First You Need a Link Function
Hooking Up the Logistic Function and Reoptimizing
Baking an Actual Logistic Regression
Chapter 7 Ensemble Models: A Whole Lot of Bad Pizza
Getting Started Using the Data from Chapter 6
Bagging: Randomize, Train, Repeat
Decision Stump is Another Name for a Weak Learner
Doesn’t Seem So Weak to Me!
You Need More Power!
Let’s Train It
Evaluating the Bagged Model
Boosting: If You Get It Wrong, Just Boost and Try Again
Training the Model—Every Feature Gets a Shot
Evaluating the Boosted Model
Chapter 8 Forecasting: Breathe Easy: You Can’t Win
The Sword Trade Is Hopping
Getting Acquainted with Time-Series Data
Starting Slow with Simple Exponential Smoothing
Setting Up the Simple Exponential Smoothing Forecast
You Might Have a Trend
Holt’s Trend-Corrected Exponential Smoothing
Setting Up Holt’s Trend-Corrected Smoothing in a Spreadsheet
So Are You Done? Looking at Autocorrelations
Multiplicative Holt-Winters Exponential Smoothing
Setting the Initial Values for Level, Trend, and Seasonality
Getting Rolling on the Forecast
And. . .Optimize!
Putting a Prediction Interval Around the Forecast
Creating a Fan Chart for Effect
Forecast Sheets in Excel
Chapter 9 Optimization Modeling: Because That “Fresh-Squeezed” Orange Juice Ain’t Gonna Blend Itself
Wait. . .Is This Data Science?
Starting with a Simple Trade-Off
Representing the Problem as a Polytope
Solving by Sliding the Level Set
The Simplex Method: Rooting Around the Corners
Working in Excel
Fresh from the Grove to Your Glass. . .with a Pit Stop Through a Blending Model
Let’s Start with Some Specs
Coming Back to Consistency
Putting the Data into Excel
Setting Up the Problem in Solver
Lowering Your Standards
Dead Squirrel Removal: the Minimax Formulation
If-Then and the “Big M” Constraint
Multiplying Variables: Cranking Up the Volume to 11,000
Modeling Risk
Normally Distributed Data
Chapter 10 Outlier Detection: Just Because They’re Odd Doesn’t Mean They’re Unimportant
Outliers Are (Bad?) People, Too
The Fascinating Case of Hadlum v. Hadlum
Tukey’s Fences
Applying Tukey’s Fences in a Spreadsheet
The Limitations of This Simple Approach
Terrible at Nothing, Bad at Everything
Preparing Data for Graphing
Creating a Graph
Getting the k-Nearest Neighbors
Graph Outlier Detection Method 1: Just Use the Indegree
Graph Outlier Detection Method 2: Getting Nuanced with k-Distance
Graph Outlier Detection Method 3: Local Outlier Factors Are Where It’s At
Chapter 11 Moving on From Spreadsheets
Getting Up and Running with R
A Crash Course in R-ing
Show Me the Numbers! Vector Math and Factoring
The Best Data Type of Them All: the Dataframe
How to Ask for Help in R
It Gets Even Better. . .Beyond Base R
Doing Some Actual Data Science
Reading Data into R
Spherical K-Means on Wine Data in Just a Few Lines
Building AI Models on the Pregnancy Data
Forecasting in R
Looking at Outlier Detection
Chapter 12 Conclusion
Where Am I? What Just Happened?
Before You Go-Go
Get to Know the Problem
We Need More Translators
Beware the Three-Headed Geek-Monster: Tools, Performance, and Mathematical Perfection
You Are Not the Most Important Function of Your Organization
Get Creative and Keep in Touch!
Index
EULA

Polecaj historie