Advanced Data Analysis from an Elementary Point of View

Pittsburgh,USA: Carnegie Mellon University. - 586 p. Страничка автора "Продвинутый анализ данных с элементарной точ

944 124 25MB

English Pages 0 [586]

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Advanced Data Analysis from an Elementary Point of View

  • Commentary
  • 1123387
Citation preview

Advanced Data Analysis from an Elementary Point of View Cosma Rohilla Shalizi Spring 2013 Last LATEX’d Tuesday 16th April, 2013

17:49 Tuesday 16th April, 2013

Contents Introduction 13 To the Reader . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 0.0.0.0.1 Updates . . . . . . . . . . . . . . . . . . . . . . . 14 Concepts You Should Know . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

I

Regression and Its Generalizations

1

Regression Basics 1.1 Statistics, Data Analysis, Regression . . . . . . . . . . . . . . . . . . . 1.2 Guessing the Value of a Random Variable . . . . . . . . . . . . . . . 1.2.1 Estimating the Expected Value . . . . . . . . . . . . . . . . . 1.3 The Regression Function . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.1 Some Disclaimers . . . . . . . . . . . . . . . . . . . . . . . . . 1.4 Estimating the Regression Function . . . . . . . . . . . . . . . . . . . 1.4.1 The Bias-Variance Tradeoff . . . . . . . . . . . . . . . . . . . 1.4.2 The Bias-Variance Trade-Off in Action . . . . . . . . . . . . 1.4.3 Ordinary Least Squares Linear Regression as Smoothing 1.5 Linear Smoothers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.5.1 k-Nearest-Neighbor Regression . . . . . . . . . . . . . . . . 1.5.2 Kernel Smoothers . . . . . . . . . . . . . . . . . . . . . . . . . 1.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2

16 . . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

The Truth about Linear Regression 2.1 Optimal Linear Prediction: Multiple Variables . . . . . . . . . . . . . . . . 2.1.1 Collinearity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.2 The Prediction and Its Error . . . . . . . . . . . . . . . . . . . . . . 2.1.3 Estimating the Optimal Linear Predictor . . . . . . . . . . . . . . 2.1.3.1 Unbiasedness and Variance of Ordinary Least Squares Estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Shifting Distributions, Omitted Variables, and Transformations . . . . 2.2.1 Changing Slopes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.1.1 R2 : Distraction or Nuisance? . . . . . . . . . . . . . . . 2.2.2 Omitted Variables and Shifting Distributions . . . . . . . . . . . 2

17 17 18 19 19 20 23 23 25 25 30 30 32 35 36 36 38 38 39 40 41 41 41 41

3

CONTENTS . . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

43 47 49 51 52 53 55

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

56 56 57 61 66 67 67 70 70 71 72

Smoothing in Regression 4.1 How Much Should We Smooth? . . . . . . . . . . . . . . . . . . . . . . . 4.2 Adapting to Unknown Roughness . . . . . . . . . . . . . . . . . . . . . . 4.2.1 Bandwidth Selection by Cross-Validation . . . . . . . . . . . . 4.2.2 Convergence of Kernel Smoothing and Bandwidth Scaling . 4.2.3 Summary on Kernel Smoothing . . . . . . . . . . . . . . . . . . 4.3 Kernel Regression with Multiple Inputs . . . . . . . . . . . . . . . . . . 4.4 Interpreting Smoothers: Plots . . . . . . . . . . . . . . . . . . . . . . . . . 4.5 Average Predictive Comparisons . . . . . . . . . . . . . . . . . . . . . . . 4.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . .

. . . . . . . . .

73 73 74 84 85 90 90 91 95 98

2.3

2.4 2.5 3

4

5

2.2.3 Errors in Variables . . . . . . . . . . . . . . . 2.2.4 Transformation . . . . . . . . . . . . . . . . . Adding Probabilistic Assumptions . . . . . . . . . . 2.3.1 Examine the Residuals . . . . . . . . . . . . 2.3.2 On Significant Coefficients . . . . . . . . . Linear Regression Is Not the Philosopher’s Stone Exercises . . . . . . . . . . . . . . . . . . . . . . . . . .

Model Evaluation 3.1 What Are Statistical Models For? . . . . 3.2 Errors, In and Out of Sample . . . . . . . 3.3 Over-Fitting and Model Selection . . . . 3.4 Cross-Validation . . . . . . . . . . . . . . . 3.4.1 Data-set Splitting . . . . . . . . . . 3.4.2 k-Fold Cross-Validation (CV) . . 3.4.3 Leave-one-out Cross-Validation . 3.5 Warnings . . . . . . . . . . . . . . . . . . . . 3.5.1 Parameter Interpretation . . . . . 3.6 Exercises . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

Simulation 99 5.1 What Do We Mean by “Simulation”? . . . . . . . . . . . . . . . . . . . . . . 99 5.2 How Do We Simulate Stochastic Models? . . . . . . . . . . . . . . . . . . . 100 5.2.1 Chaining Together Random Variables . . . . . . . . . . . . . . . . 100 5.2.2 Random Variable Generation . . . . . . . . . . . . . . . . . . . . . . 100 5.2.2.1 Built-in Random Number Generators . . . . . . . . . . 100 5.2.2.2 Transformations . . . . . . . . . . . . . . . . . . . . . . . . 101 5.2.2.3 Quantile Method . . . . . . . . . . . . . . . . . . . . . . . 101 5.2.2.4 Rejection Method . . . . . . . . . . . . . . . . . . . . . . . 102 5.2.2.5 The Metropolis Algorithm and Markov Chain Monte Carlo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 5.2.2.6 Generating Uniform Random Numbers . . . . . . . . 106 5.2.3 Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 5.2.3.1 Sampling Rows from Data Frames . . . . . . . . . . . . 110 5.2.3.2 Multinomials and Multinoullis . . . . . . . . . . . . . . 110 17:49 Tuesday 16th April, 2013

CONTENTS

5.3

5.4

5.5 5.6 6

7

4

5.2.3.3 Probabilities of Observation . . . . . . . . . . . . . . . 5.2.4 Repeating Simulations . . . . . . . . . . . . . . . . . . . . . . . . . Why Simulate? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.1 Understanding the Model; Monte Carlo . . . . . . . . . . . . . . 5.3.2 Checking the Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.3 Sensitivity Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . The Method of Simulated Moments . . . . . . . . . . . . . . . . . . . . . . 5.4.1 The Method of Moments . . . . . . . . . . . . . . . . . . . . . . . 5.4.2 Adding in the Simulation . . . . . . . . . . . . . . . . . . . . . . . 5.4.3 An Example: Moving Average Models and the Stock Market Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Appendix: Some Design Notes on the Method of Moments Code . .

The Bootstrap 6.1 Stochastic Models, Uncertainty, Sampling Distributions . . . . . . . . 6.2 The Bootstrap Principle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.1 Variances and Standard Errors . . . . . . . . . . . . . . . . . . . . 6.2.2 Bias Correction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.3 Confidence Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.3.1 Other Bootstrap Confidence Intervals . . . . . . . . 6.2.4 Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.4.1 Double bootstrap hypothesis testing . . . . . . . . . 6.2.5 Parametric Bootstrapping Example: Pareto’s Law of Wealth Inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3 Non-parametric Bootstrapping . . . . . . . . . . . . . . . . . . . . . . . . . 6.3.1 Parametric vs. Nonparametric Bootstrapping . . . . . . . . . . 6.4 Bootstrapping Regression Models . . . . . . . . . . . . . . . . . . . . . . . 6.4.1 Re-sampling Points: Parametric Example . . . . . . . . . . . . . 6.4.2 Re-sampling Points: Non-parametric Example . . . . . . . . . . 6.4.3 Re-sampling Residuals: Example . . . . . . . . . . . . . . . . . . . 6.5 Bootstrap with Dependent Data . . . . . . . . . . . . . . . . . . . . . . . . 6.6 Things Bootstrapping Does Poorly . . . . . . . . . . . . . . . . . . . . . . 6.7 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Weighting and Variance 7.1 Weighted Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2 Heteroskedasticity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2.1 Weighted Least Squares as a Solution to Heteroskedasticity 7.2.2 Some Explanations for Weighted Least Squares . . . . . . . . 7.2.3 Finding the Variance and Weights . . . . . . . . . . . . . . . . . 7.3 Conditional Variance Function Estimation . . . . . . . . . . . . . . . . 7.3.1 Iterative Refinement of Mean and Variance: An Example . . 7.3.2 Real Data Example: Old Heteroskedastic . . . . . . . . . . . . 7.4 Re-sampling Residuals with Heteroskedasticity . . . . . . . . . . . . . 7.5 Local Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17:49 Tuesday 16th April, 2013

. . . . . . . . . .

. . . . . . . . . . . .

110 111 111 111 112 115 117 117 118 118 125 127

. . . . . . . .

129 129 131 133 133 134 135 136 137

. . . . . . . . . . .

137 141 144 144 145 147 150 152 152 153 153

. . . . . . . . . .

154 154 155 156 158 162 163 164 168 171 171

5

CONTENTS

7.6 8

9

7.5.1 Advantages and Disadvantages of Locally Linear Regression . . 173 7.5.2 Lowess . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176

Splines 8.1 Smoothing by Directly Penalizing Curve Flexibility 8.1.1 The Meaning of the Splines . . . . . . . . . . . 8.2 Computational Example: Splines for Stock Returns 8.2.1 Confidence Bands for Splines . . . . . . . . . . 8.3 Basis Functions and Degrees of Freedom . . . . . . . . 8.3.1 Basis Functions . . . . . . . . . . . . . . . . . . . 8.3.2 Degrees of Freedom . . . . . . . . . . . . . . . . 8.4 Splines in Multiple Dimensions . . . . . . . . . . . . . 8.5 Smoothing Splines versus Kernel Regression . . . . . 8.6 Further Reading . . . . . . . . . . . . . . . . . . . . . . . 8.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

177 177 178 180 183 186 186 188 190 190 190 191

Additive Models 9.1 Partial Residuals and Back-fitting for Linear Models 9.2 Additive Models . . . . . . . . . . . . . . . . . . . . . . . 9.3 The Curse of Dimensionality . . . . . . . . . . . . . . . 9.4 Example: California House Prices Revisited . . . . . 9.5 Closing Modeling Advice . . . . . . . . . . . . . . . . . 9.6 Further Reading . . . . . . . . . . . . . . . . . . . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

193 193 194 197 199 211 211

.. .. .. .. .. .. of .. .. ..

. . . . . .

213 213 215 224 224 224 224

10 Testing Regression Specifications 10.1 Testing Functional Forms . . . . . . . . . . . . . . . . . . . . . . . . . . 10.1.1 Examples of Testing a Parametric Model . . . . . . . . . . . . 10.1.2 Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.1.2.0.1 Other Nonparametric Regressions . . 10.1.2.0.2 Additive Alternatives . . . . . . . . . . . 10.1.2.0.3 Testing E [b ε|X ] = 0 . . . . . . . . . . . . 10.1.2.0.4 Stabilizing the Sampling Distribution the Test Statistic . . . . . . . . . . . . . . 10.2 Why Use Parametric Models At All? . . . . . . . . . . . . . . . . . . . 10.3 Why We Sometimes Want Mis-Specified Parametric Models . . . .

. 224 . 225 . 226

11 More about Hypothesis Testing 12 Logistic Regression 12.1 Modeling Conditional Probabilities . . . . . . . . . . . . . . . 12.2 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . 12.2.1 Likelihood Function for Logistic Regression . . . . 12.2.2 Logistic Regression with More Than Two Classes 12.3 Newton’s Method for Numerical Optimization . . . . . . . 12.3.1 Newton’s Method in More than One Dimension . 17:49 Tuesday 16th April, 2013

230 . . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

231 231 232 235 236 237 239

CONTENTS

6

12.3.2 Iteratively Re-Weighted Least Squares . . . . . . . . . . 12.4 Generalized Linear Models and Generalized Additive Models 12.4.1 Generalized Additive Models . . . . . . . . . . . . . . . . 12.4.2 An Example (Including Model Checking) . . . . . . . . 12.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 GLMs and GAMs 13.1 Generalized Linear Models and Iterative Least Squares . 13.1.1 GLMs in General . . . . . . . . . . . . . . . . . . . . 13.1.2 Examples of GLMs . . . . . . . . . . . . . . . . . . 13.1.2.1 Vanilla Linear Models . . . . . . . . . . 13.1.2.2 Binomial Regression . . . . . . . . . . . 13.1.2.3 Poisson Regression . . . . . . . . . . . . 13.1.3 Uncertainty . . . . . . . . . . . . . . . . . . . . . . . 13.2 Generalized Additive Models . . . . . . . . . . . . . . . . . 13.3 Weather Forecasting in Snoqualmie Falls . . . . . . . . . 13.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

II

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

239 240 241 241 245

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

246 246 248 248 248 249 249 250 250 251 264

Multivariate Data, Distributions, and Latent Structure

14 Multivariate Distributions 14.1 Review of Definitions . . . . . . . . . . . . . . . . . . . . 14.2 Multivariate Gaussians . . . . . . . . . . . . . . . . . . . 14.2.1 Linear Algebra and the Covariance Matrix . 14.2.2 Conditional Distributions and Least Squares 14.2.3 Projections of Multivariate Gaussians . . . . 14.2.4 Computing with Multivariate Gaussians . . . 14.3 Inference with Multivariate Distributions . . . . . . . 14.3.1 Estimation . . . . . . . . . . . . . . . . . . . . . . 14.3.2 Model Comparison . . . . . . . . . . . . . . . . 14.3.3 Goodness-of-Fit . . . . . . . . . . . . . . . . . . . 14.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . .

266

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

267 267 268 270 271 271 271 272 272 273 275 276

15 Density Estimation 15.1 Histograms Revisited . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.2 “The Fundamental Theorem of Statistics” . . . . . . . . . . . . . . . 15.3 Error for Density Estimates . . . . . . . . . . . . . . . . . . . . . . . . 15.3.1 Error Analysis for Histogram Density Estimates . . . . . 15.4 Kernel Density Estimates . . . . . . . . . . . . . . . . . . . . . . . . . 15.4.1 Analysis of Kernel Density Estimates . . . . . . . . . . . . 15.4.2 Joint Density Estimates . . . . . . . . . . . . . . . . . . . . . 15.4.3 Categorical and Ordered Variables . . . . . . . . . . . . . . 15.4.4 Practicalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.4.5 Kernel Density Estimation in R: An Economic Example 15.5 Conditional Density Estimation . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

277 277 278 279 280 282 282 284 285 286 286 288

17:49 Tuesday 16th April, 2013

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

7

CONTENTS 15.5.1 Practicalities and a Second Example . . . . . . . . . . . . . . . . . . 289 15.6 More on the Expected Log-Likelihood Ratio . . . . . . . . . . . . . . . . . 292 15.7 Simulating from Density Estimates . . . . . . . . . . . . . . . . . . . . . . . 293 15.7.1 Simulating from Kernel Density Estimates . . . . . . . . . . . . . 293 15.7.1.1 Sampling from a Kernel Joint Density Estimate . . . 294 15.7.1.2 Sampling from Kernel Conditional Density Estimates 294 15.7.2 Sampling from Histogram Estimates . . . . . . . . . . . . . . . . . 295 15.7.3 Examples of Simulating from Kernel Density Estimates . . . . . 295 15.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 300

16 Relative Distributions and Smooth Tests 16.1 Smooth Tests of Goodness of Fit . . . . . . . . . . . . . . . . . . . . 16.1.1 From Continuous CDFs to Uniform Distributions . . 16.1.2 Testing Uniformity . . . . . . . . . . . . . . . . . . . . . . . 16.1.3 Neyman’s Smooth Test . . . . . . . . . . . . . . . . . . . . . 16.1.3.1 Choice of Function Basis . . . . . . . . . . . . . 16.1.3.2 Choice of Number of Basis Functions . . . . . 16.1.3.3 Application: Combining p-Values . . . . . . . 16.1.3.4 Density Estimation by Series Expansion . . . 16.1.4 Smooth Tests of Non-Uniform Parametric Families . . 16.1.4.1 Estimated Parameters . . . . . . . . . . . . . . . 16.1.5 Implementation in R . . . . . . . . . . . . . . . . . . . . . . 16.1.5.1 Some Examples . . . . . . . . . . . . . . . . . . . 16.1.6 Conditional Distributions and Calibration . . . . . . . . 16.2 Relative Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 16.2.1 Estimating the Relative Distribution . . . . . . . . . . . . 16.2.2 R Implementation and Examples . . . . . . . . . . . . . . 16.2.2.1 Example: Conservative versus Liberal Brains 16.2.2.2 Example: Economic Growth Rates . . . . . . 16.2.3 Adjusting for Covariates . . . . . . . . . . . . . . . . . . . . 16.2.3.1 Example: Adjusting Growth Rates . . . . . . . 16.3 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . .

301 301 301 302 302 304 305 305 308 308 310 311 311 314 315 317 317 317 321 322 324 327 327

17 Principal Components Analysis 17.1 Mathematics of Principal Components . . . . . . . . . . . 17.1.1 Minimizing Projection Residuals . . . . . . . . . 17.1.2 Maximizing Variance . . . . . . . . . . . . . . . . . 17.1.3 More Geometry; Back to the Residuals . . . . . 17.1.4 Statistical Inference, or Not . . . . . . . . . . . . . 17.2 Example: Cars . . . . . . . . . . . . . . . . . . . . . . . . . . . 17.3 Latent Semantic Analysis . . . . . . . . . . . . . . . . . . . . 17.3.1 Principal Components of the New York Times 17.4 PCA for Visualization . . . . . . . . . . . . . . . . . . . . . . 17.5 PCA Cautions . . . . . . . . . . . . . . . . . . . . . . . . . . . 17.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

328 328 329 330 331 332 333 336 337 339 341 342

17:49 Tuesday 16th April, 2013

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

CONTENTS

8

18 Factor Analysis 345 18.1 From PCA to Factor Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 345 18.1.1 Preserving correlations . . . . . . . . . . . . . . . . . . . . . . . . . . 347 18.2 The Graphical Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 347 18.2.1 Observables Are Correlated Through the Factors . . . . . . . . . 349 18.2.2 Geometry: Approximation by Hyper-planes . . . . . . . . . . . . 350 18.3 Roots of Factor Analysis in Causal Discovery . . . . . . . . . . . . . . . . 350 18.4 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 351 18.4.1 Degrees of Freedom . . . . . . . . . . . . . . . . . . . . . . . . . . . . 352 18.4.1.0.1 More unknowns (free parameters) than equations (constraints) . . . . . . . . . . . . . . . . 353 18.4.1.0.2 More equations (constraints) than unknowns (free parameters) . . . . . . . . . . . . . . . . . 353 18.4.2 A Clue from Spearman’s One-Factor Model . . . . . . . . . . . . 354 18.4.3 Estimating Factor Loadings and Specific Variances . . . . . . . . 355 18.5 Maximum Likelihood Estimation . . . . . . . . . . . . . . . . . . . . . . . . 355 18.5.1 Alternative Approaches . . . . . . . . . . . . . . . . . . . . . . . . . 356 18.5.2 Estimating Factor Scores . . . . . . . . . . . . . . . . . . . . . . . . . 357 18.6 The Rotation Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 357 18.7 Factor Analysis as a Predictive Model . . . . . . . . . . . . . . . . . . . . . . 358 18.7.1 How Many Factors? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 359 18.7.1.1 R2 and Goodness of Fit . . . . . . . . . . . . . . . . . . . 360 18.8 Reification, and Alternatives to Factor Models . . . . . . . . . . . . . . . . 361 18.8.1 The Rotation Problem Again . . . . . . . . . . . . . . . . . . . . . . 361 18.8.2 Factors or Mixtures? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 361 18.8.3 The Thomson Sampling Model . . . . . . . . . . . . . . . . . . . . 363 19 Mixture Models 19.1 Two Routes to Mixture Models . . . . . . . . . . . . . . . . . . . . . . . . 19.1.1 From Factor Analysis to Mixture Models . . . . . . . . . . . . 19.1.2 From Kernel Density Estimates to Mixture Models . . . . . 19.1.3 Mixture Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19.1.4 Geometry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19.1.5 Identifiability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19.1.6 Probabilistic Clustering . . . . . . . . . . . . . . . . . . . . . . . 19.1.7 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19.2 Estimating Parametric Mixture Models . . . . . . . . . . . . . . . . . . . 19.2.1 More about the EM Algorithm . . . . . . . . . . . . . . . . . . . 19.2.2 Further Reading on and Applications of EM . . . . . . . . . . 19.2.3 Topic Models and Probabilistic LSA . . . . . . . . . . . . . . . 19.3 Non-parametric Mixture Modeling . . . . . . . . . . . . . . . . . . . . . 19.4 Worked Computating Example . . . . . . . . . . . . . . . . . . . . . . . . 19.4.1 Mixture Models in R . . . . . . . . . . . . . . . . . . . . . . . . . 19.4.2 Fitting a Mixture of Gaussians to Real Data . . . . . . . . . . 19.4.3 Calibration-checking for the Mixture . . . . . . . . . . . . . . . 19.4.4 Selecting the Number of Components by Cross-Validation 17:49 Tuesday 16th April, 2013

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

368 368 368 368 369 370 370 371 372 372 374 377 377 377 378 378 378 382 384

9

CONTENTS 19.4.5 Interpreting the Mixture Components, or Not . . . . . . . . . . 389 19.4.6 Hypothesis Testing for Mixture-Model Selection . . . . . . . . . 394 19.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 397

20 Graphical Models 20.1 Conditional Independence and Factor Models . . . . . . . . . . 20.2 Directed Acyclic Graph (DAG) Models . . . . . . . . . . . . . . 20.2.1 Conditional Independence and the Markov Property 20.3 Examples of DAG Models and Their Uses . . . . . . . . . . . . . 20.3.1 Missing Variables . . . . . . . . . . . . . . . . . . . . . . . . 20.4 Non-DAG Graphical Models . . . . . . . . . . . . . . . . . . . . . 20.4.1 Undirected Graphs . . . . . . . . . . . . . . . . . . . . . . 20.4.1.0.1 Further reading . . . . . . . . . . . . 20.4.2 Directed but Cyclic Graphs . . . . . . . . . . . . . . . . . 20.5 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

III

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

Causal Inference

21 Graphical Causal Models 21.1 Causation and Counterfactuals . . . . . . . . . . . . . . . . 21.2 Causal Graphical Models . . . . . . . . . . . . . . . . . . . . 21.2.1 Calculating the “effects of causes” . . . . . . . . . 21.2.2 Back to Teeth . . . . . . . . . . . . . . . . . . . . . . 21.3 Conditional Independence and d -Separation . . . . . . . 21.3.1 D-Separation Illustrated . . . . . . . . . . . . . . . 21.3.2 Linear Graphical Models and Path Coefficients 21.3.3 Positive and Negative Associations . . . . . . . . 21.4 Independence and Information . . . . . . . . . . . . . . . . 21.5 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . 21.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

409 . . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

22 Identifying Causal Effects 22.1 Causal Effects, Interventions and Experiments . . . . . . . . . . . . . . . 22.1.1 The Special Role of Experiment . . . . . . . . . . . . . . . . . . . 22.2 Identification and Confounding . . . . . . . . . . . . . . . . . . . . . . . . 22.3 Identification Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22.3.1 The Back-Door Criterion: Identification by Conditioning . . 22.3.1.1 The Entner Rules . . . . . . . . . . . . . . . . . . . . . . 22.3.2 The Front-Door Criterion: Identification by Mechanisms . . 22.3.2.1 The Front-Door Criterion and Mechanistic Explanation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22.3.3 Instrumental Variables . . . . . . . . . . . . . . . . . . . . . . . . . 22.3.3.1 Some Invalid Instruments . . . . . . . . . . . . . . . . 22.3.3.2 Critique of Instrumental Variables . . . . . . . . . . . 22.3.4 Failures of Identification . . . . . . . . . . . . . . . . . . . . . . . . 17:49 Tuesday 16th April, 2013

398 398 399 400 401 404 405 405 406 406 407

. . . . . . . . . . .

410 410 411 412 413 416 418 420 421 422 423 424

. . . . . . .

425 425 426 427 430 432 433 435

. . . . .

436 438 440 442 444

CONTENTS

10

22.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 446 22.4.1 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 446 22.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 446 23 Estimating Causal Effects 23.1 Estimators in the Back- and Front- Door Criteria . . 23.1.1 Estimating Average Causal Effects . . . . . . . 23.1.2 Avoiding Estimating Marginal Distributions 23.1.3 Propensity Scores . . . . . . . . . . . . . . . . . 23.1.4 Matching and Propensity Scores . . . . . . . . 23.2 Instrumental-Variables Estimates . . . . . . . . . . . . . 23.3 Uncertainty and Inference . . . . . . . . . . . . . . . . . 23.4 Recommendations . . . . . . . . . . . . . . . . . . . . . . 23.5 Further Reading . . . . . . . . . . . . . . . . . . . . . . . 23.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

448 448 449 449 450 452 454 455 455 456 457

24 Discovering Causal Structure 24.1 Testing DAGs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24.2 Testing Conditional Independence . . . . . . . . . . . . . . . 24.3 Faithfulness and Equivalence . . . . . . . . . . . . . . . . . . . 24.3.1 Partial Identification of Effects . . . . . . . . . . . . . 24.4 Causal Discovery with Known Variables . . . . . . . . . . . 24.4.1 The PC Algorithm . . . . . . . . . . . . . . . . . . . . 24.4.2 Causal Discovery with Hidden Variables . . . . . . 24.4.2.0.1 Partial identification of effects 24.4.3 On Conditional Independence Tests . . . . . . . . . 24.5 Software and Examples . . . . . . . . . . . . . . . . . . . . . . . 24.6 Limitations on Consistency of Causal Discovery . . . . . . 24.7 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . 24.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

458 459 460 461 462 462 465 466 466 466 467 472 473 473

IV

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

Dependent Data

25 Time Series 25.1 Time Series, What They Are . . . . . . . . . . . . . . . . . . . 25.1.0.0.2 Other kinds of time series . . . 25.1.0.0.3 Notation . . . . . . . . . . . . . . 25.2 Stationarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25.2.1 Autocorrelation . . . . . . . . . . . . . . . . . . . . . . 25.2.2 The Ergodic Theorem . . . . . . . . . . . . . . . . . . 25.2.2.1 The World’s Simplest Ergodic Theorem 25.2.2.2 Rate of Convergence . . . . . . . . . . . . . 25.2.2.3 Why Ergodicity Matters . . . . . . . . . . 25.3 Markov Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25.3.1 Meaning of the Markov Property . . . . . . . . . . . 17:49 Tuesday 16th April, 2013

474 . . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

475 475 475 477 477 477 481 481 482 483 484 485

11

CONTENTS 25.4 Autoregressive Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25.4.1 Autoregressions with Covariates . . . . . . . . . . . . . . . . . . . . 25.4.2 Additive Autoregressions . . . . . . . . . . . . . . . . . . . . . . . . 25.4.2.0.1 Example: The lynx . . . . . . . . . . . . . . . . 25.4.3 Linear Autoregression . . . . . . . . . . . . . . . . . . . . . . . . . . 25.4.3.1 “Unit Roots” and Stationary Solutions . . . . . . . . . 25.4.4 Conditional Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . 25.4.4.0.1 Example: lynx . . . . . . . . . . . . . . . . . . . 25.4.5 Regression with Correlated Noise; Generalized Least Squares . 25.5 Bootstrapping Time Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25.5.1 Parametric or Model-Based Bootstrap . . . . . . . . . . . . . . . . 25.5.2 Block Bootstraps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25.5.3 Sieve Bootstrap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25.6 Trends and De-Trending . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25.6.1 Forecasting Trends . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25.6.2 Seasonal Components . . . . . . . . . . . . . . . . . . . . . . . . . . . 25.6.3 Detrending by Differencing . . . . . . . . . . . . . . . . . . . . . . . 25.7 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

486 487 487 487 487 492 494 494 494 497 497 497 498 500 502 507 507 508 510

26 Time Series with Latent Variables

511

27 Longitudinal, Spatial and Network Data

512

Appendices

514

A Programming A.1 Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.2 First Example: Pareto Quantiles . . . . . . . . . . . . . A.3 Functions Which Call Functions . . . . . . . . . . . . . A.3.1 Sanity-Checking Arguments . . . . . . . . . . A.4 Layering Functions and Debugging . . . . . . . . . . . A.4.1 More on Debugging . . . . . . . . . . . . . . . . A.5 Automating Repetition and Passing Arguments . . . A.6 Avoiding Iteration: Manipulating Objects . . . . . . . A.6.1 ifelse and which . . . . . . . . . . . . . . . . . A.6.2 apply and Its Variants . . . . . . . . . . . . . . A.7 More Complicated Return Values . . . . . . . . . . . . A.8 Re-Writing Your Code: An Extended Example . . . . A.9 General Advice on Programming . . . . . . . . . . . . A.9.1 Comment your code . . . . . . . . . . . . . . . A.9.2 Use meaningful names . . . . . . . . . . . . . . A.9.3 Check whether your program works . . . . . A.9.4 Avoid writing the same thing twice . . . . . . A.9.5 Start from the beginning and break it down 17:49 Tuesday 16th April, 2013

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

515 515 516 517 519 519 522 522 533 534 535 537 538 544 544 545 545 546 546

CONTENTS

12

A.9.6 Break your code into many short, meaningful functions . . . . 546 A.10 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 547 B Big O and Little o Notation

548

C χ 2 and the Likelihood Ratio Test

550

D Proof of the Gauss-Markov Theorem

553

E Constrained and Penalized Optimization E.1 Constrained Optimization . . . . . . . . . . . . . . . . . . . . . . . . . E.2 Lagrange Multipliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . E.3 Penalized Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . E.4 Mini-Example: Constrained Linear Regression . . . . . . . . . . . . E.4.1 Statistical Remark: “Ridge Regression” and “The Lasso” F Rudimentary Graph Theory

. . . . .

. . . . .

. . . . .

. . . . .

555 555 556 557 557 559 566

G Pseudo-code for the SGS Algorithm 569 G.1 Pseudo-code for the SGS Algorithm . . . . . . . . . . . . . . . . . . . . . . . 569 G.2 Pseudo-code for the PC Algorithm . . . . . . . . . . . . . . . . . . . . . . . 570

17:49 Tuesday 16th April, 2013

17:49 Tuesday 16th April, 2013

Introduction To the Reader These are the notes for 36-402, Advanced Data Analysis, at Carnegie Mellon. If you are not enrolled in the class, you should know that it’s the methodological capstone of the core statistics sequence taken by our undergraduate majors (usually in their third year), and by students from a range of other departments. By this point, they have taken classes in introductory statistics and data analysis, probability theory, mathematical statistics, and modern linear regression (“401”). This class does not presume that you have learned but forgotten the material from the pre-requisites; it presumes that you know that material and can go beyond it. The class also presumes a firm grasp on linear algebra and multivariable calculus, and that you can read and write simple functions in R. If you are lacking in any of these areas, now would be an excellent time to leave. 36-402 is a class in statistical methodology: its aim is to get students to understand something of the range of modern1 methods of data analysis, and of the considerations which go into choosing the right method for the job at hand (rather than distorting the problem to fit the methods the student happens to know). Statistical theory is kept to a minimum, and largely introduced as needed. Since 36-402 is also a class in data analysis, there are assignments in which, nearly every week, a new, often large, data set is analyzed with new methods. (I reserve the right to re-use data sets, and even to fake data, but will do so sparingly.) Assignments and data will be on the class web-page. There is no way to cover every important topic for data analysis in just a semester. Much of what’s not here — sampling, experimental design, advanced multivariate methods, hierarchical models, the intricacies of categorical data, graphics, data mining — gets covered by our other undergraduate classes. Other important areas, like dependent data, inverse problems, model selection or robust estimation, have to wait for graduate school. The mathematical level of these notes is deliberately low; nothing should be beyond a competent second-year student. But every subject covered here can be profitably studied using vastly more sophisticated techniques; that’s why this is advanced data analysis from an elementary point of view. If reading these pages inspires any1 Just as an undergraduate “modern physics” course aims to bring the student up to about 1930 (more specifically, to 1926), this class aims to bring the student up to about 1990.

13

CONTENTS

14

one to study the same material from an advanced point of view, I will consider my troubles to have been amply repaid. A final word. At this stage in your statistical education, you have gained two kinds of knowledge — a few general statistical principles, and many more specific procedures, tests, recipes, etc. If you are a typical ADA student, you are much more comfortable with the specifics than the generalities. But the truth is that while none of your recipes are wrong, they are tied to assumptions which hardly ever hold. Learning more flexible and powerful methods, which have a much better hope of being reliable, will demand a lot of hard thinking and hard work. Those of you who succeed, however, will have done something you can be proud of. 0.0.0.0.1 Updates The page for this book is http://www.stat.cmu.edu/~cshalizi/ ADAfaEPoV/. The latest version will live there. It will eventually be published by Cambridge University Press, at which point there will still be a free next-tofinal draft at that URL, and errata. I plan to incorporate the data-analysis problem sets into the text; in the meanwhile, they can be found at the page for the class, http://www.stat.cmu.edu/~cshalizi/uADA/.

Concepts You Should Know If more than a handful of these are unfamiliar, it is very unlikely that you are ready for this course. Random variable; population, sample. Cumulative distribution function, probability mass function, probability density function. Specific distributions: Bernoulli, binomial, Poisson, geometric, Gaussian, exponential, t , Gamma. Expectation value. Variance, standard deviation. Sample mean, sample variance. Median, mode. Quartile, percentile, quantile. Inter-quartile range. Histograms. Joint distribution functions. Conditional distributions; conditional expectations and variances. Statistical independence and dependence. Covariance and correlation; why dependence is not the same thing as correlation. Rules for arithmetic with expectations, variances and covariances. Laws of total probability, total expectation, total variation. Contingency tables; odds ratio, log odds ratio. Sequences of random variables. Stochastic process. Law of large numbers. Central limit theorem. Parameters; estimator functions and point estimates. Sampling distribution. Bias of an estimator. Standard error of an estimate; standard error of the mean; how and why the standard error of the mean differs from the standard deviation. Confidence intervals and interval estimates. Hypothesis tests. Tests for differences in means and in proportions; Z and t tests; degrees of freedom. Size, significance, power. Relation between hypothesis tests and confidence intervals. χ 2 test of independence for contingency tables; degrees of freedom. KS test for goodness-of-fit to distributions. Linear regression. Meaning of the linear regression function. Fitted values and residuals of a regression. Interpretation of regression coefficients. Least-squares estimate of coefficients. Matrix formula for estimating the coefficients; the hat matrix. 17:49 Tuesday 16th April, 2013

15

CONTENTS

R2 ; why adding more predictor variables never reduces R2 . The t -test for the significance of individual coefficients given other coefficients. The F -test and partial F -test for the significance of regression models. Degrees of freedom for residuals. Examination of residuals. Confidence intervals for parameters. Confidence intervals for fitted values. Prediction intervals. Likelihood. Likelihood functions. Maximum likelihood estimates. Relation between maximum likelihood, least squares, and Gaussian distributions. Relation between confidence intervals and the likelihood function. Likelihood ratio test.

17:49 Tuesday 16th April, 2013

17:49 Tuesday 16th April, 2013

Part I

Regression and Its Generalizations

16

17:49 Tuesday 16th April, 2013

Chapter 1

Regression: Predicting and Relating Quantitative Features 1.1

Statistics, Data Analysis, Regression

Statistics is the branch of mathematical engineering which designs and analyses methods for drawing reliable inferences from imperfect data. The subject of most sciences is some aspect of the world around us, or within us. Psychology studies minds; geology studies the Earth’s composition and form; economics studies production, distribution and exchange; mycology studies mushrooms. Statistics does not study the world, but some of the ways we try to understand the world — some of the intellectual tools of the other sciences. Its utility comes indirectly, through helping those other sciences. This utility is very great, because all the sciences have to deal with imperfect data. Data may be imperfect because we can only observe and record a small fraction of what is relevant; or because we can only observe indirect signs of what is truly relevant; or because, no matter how carefully we try, our data always contain an element of noise. Over the last two centuries, statistics has come to handle all such imperfections by modeling them as random processes, and probability has become so central to statistics that we introduce random events deliberately (as in sample surveys).1 Statistics, then, uses probability to model inference from data. We try to mathematically understand the properties of different procedures for drawing inferences: Under what conditions are they reliable? What sorts of errors do they make, and how often? What can they tell us when they work? What are signs that something has gone wrong? Like other branches of engineering, statistics aims not just at understanding but also at improvement: we want to analyze data better, more reliably, with fewer and smaller errors, under broader conditions, faster, and with less mental 1 Two excellent, but very different, histories of how statistics came to this understanding are Hacking (1990) and Porter (1986).

17

1.2. GUESSING THE VALUE OF A RANDOM VARIABLE

18

effort. Sometimes some of these goals conflict — a fast, simple method might be very error-prone, or only reliable under a narrow range of circumstances. One of the things that people most often want to know about the world is how different variables are related to each other, and one of the central tools statistics has for learning about relationships is regression.2 In your linear regression class, you learned about how it could be used in data analysis, and learned about its properties. In this class, we will build on that foundation, extending beyond basic linear regression in many directions, to answer many questions about how variables are related to each other. This is intimately related to prediction. Being able to make predictions isn’t the only reason we want to understand relations between variables, but prediction tests our knowledge of relations. (If we misunderstand, we might still be able to predict, but it’s hard to see how we could understand and not be able to predict.) So before we go beyond linear regression, we will first look at prediction, and how to predict one variable from nothing at all. Then we will look at predictive relationships between variables, and see how linear regression is just one member of a big family of smoothing methods, all of which are available to us.

1.2

Guessing the Value of a Random Variable

We have a quantitative, numerical variable, which we’ll imaginatively call Y . We’ll suppose that it’s a random variable, and try to predict it by guessing a single value for it. (Other kinds of predictions are possible — we might guess whether Y will fall within certain limits, or the probability that it does so, or even the whole probability distribution of Y . But some lessons we’ll learn here will apply to these other kinds of predictions as well.) What is the best value to guess? More formally, what is the optimal point forecast for Y ? To answer this question, we need to pick a function to be optimized, which should measure how good our guesses are — or equivalently how bad they are, how big an error we’re making. A reasonable start point is the mean squared error: ” — MSE(a) ≡ E (Y − a)2

(1.1)

2 The origin of the name is instructive. It comes from 19th century investigations into the relationship between the attributes of parents and their children. People who are taller (heavier, faster, . . . ) than average tend to have children who are also taller than average, but not quite as tall. Likewise, the children of unusually short parents also tend to be closer to the average, and similarly for other traits. This came to be called “regression towards the mean”, or even “regression towards mediocrity”; hence the line relating the average height (or whatever) of children to that of their parents was “the regression line”, and the word stuck.

17:49 Tuesday 16th April, 2013

19

1.3. THE REGRESSION FUNCTION

So we’d like to find the value r where MSE(a) is smallest. ” — MSE(a) = E (Y − a)2 = = = d MSE da 2(E [Y ] − r ) r

2

(E [Y − a]) + Var [Y − a] (E [Y − a])2 + Var [Y ]

(1.2) (1.3)

(E [Y ] − a) + Var [Y ]

(1.4) (1.5)

=

−2 (E [Y ] − a) + 0

(1.6)

=

0

(1.7)

=

E [Y ]

(1.8)

2

So, if we gauge the quality of our prediction by mean-squared error, the best prediction to make is the expected value.

1.2.1

Estimating the Expected Value

Of course, to make the prediction E [Y ] we would have to know the expected value of Y . Typically, we do not. However, if we have sampled values, y1 , y2 , . . . yn , we can estimate the expectation from the sample mean: b r≡

n 1X

n

yi

(1.9)

i=1

If the samples are independent and identically distributed (IID), then the law of large numbers tells us that b r → E [Y ] = r (1.10) and the central limit theorem tells us something about how fast the convergence is (namely the squared error will typically be about Var [Y ] /n). Of course the assumption that the yi come from IID samples is a strong one, but we can assert pretty much the same thing if they’re just uncorrelated with a common expected value. Even if they are correlated, but the correlations decay fast enough, all that changes is the rate of convergence. So “sit, wait, and average” is a pretty reliable way of estimating the expectation value.

1.3

The Regression Function

Of course, it’s not very useful to predict just one number for a variable. Typically, we have lots of variables in our data, and we believe they are related somehow. For example, suppose that we have data on two variables, X and Y , which might look like Figure 1.1. The feature Y is what we are trying to predict, a.k.a. the dependent variable or output or response, and X is the predictor or independent variable or covariate or input. Y might be something like the profitability of a customer and X their credit rating, or, if you want a less mercenary example, Y could be some measure of improvement in blood cholesterol and X the dose taken of a drug. 17:49 Tuesday 16th April, 2013

1.3. THE REGRESSION FUNCTION

20

Typically we won’t have just one input feature X but rather many of them, but that gets harder to draw and doesn’t change the points of principle. Figure 1.2 shows the same data as Figure 1.1, only with the sample mean added on. This clearly tells us something about the data, but also it seems like we should be able to do better — to reduce the average error — by using X , rather than by ignoring it. Let’s say that the we want our prediction to be a function of X , namely f (X ). What should that function be, if we still use mean squared error? We can work this out by using the law of total expectation, i.e., the fact that E [U ] = E [E [U |V ]] for any random variables U and V . ” — MSE( f (X )) = E (Y − f (X ))2 (1.11) ” ” —— = E E (Y − f (X ))2 |X (1.12) ” — 2 = E Var [Y |X ] + (E [Y − f (X )|X ]) (1.13) When we want to minimize this, the first term inside the expectation doesn’t depend on our prediction, and the second term looks just like our previous optimization only with all expectations conditional on X , so for our optimal function r (x) we get r (x) = E [Y |X = x]

(1.14)

In other words, the (mean-squared) optimal conditional prediction is just the conditional expected value. The function r (x) is called the regression function. This is what we would like to know when we want to predict Y .

1.3.1

Some Disclaimers

It’s important to be clear on what is and is not being assumed here. Talking about X as the “independent variable” and Y as the “dependent” one suggests a causal model, which we might write Y ← r (X ) + ε (1.15) where the direction of the arrow, ←, indicates the flow from causes to effects, and ε is some noise variable. If the gods of inference are very, very kind, then ε would have a fixed distribution, independent of X , and we could without loss of generality take it to have mean zero. (“Without loss of generality” because if it has a non-zero mean, we can incorporate that into r (X ) as an additive constant.) However, no such assumption is required to get Eq. 1.14. It works when predicting effects from causes, or the other way around when predicting (or “retrodicting”) causes from effects, or indeed when there is no causal relationship whatsoever between X and Y 3 . It is always true that Y |X = r (X ) + η(X ) (1.16) where η(X ) is a noise variable with mean zero, but as the notation indicates the distribution of the noise generally depends on X . 3 We

will cover causal inference in considerable detail in Part III.

17:49 Tuesday 16th April, 2013

1.3. THE REGRESSION FUNCTION

0.0

0.2

0.4

y

0.6

0.8

1.0

21

0.0

0.2

0.4

0.6

0.8

1.0

x

plot(all.x,all.y,xlab="x",ylab="y") rug(all.x,side=1,col="grey") rug(all.y,side=2,col="grey")

Figure 1.1: Scatterplot of the example data. (These are made up.) The rug commands add horizontal and vertical ticks to the axes to mark the location of the data (in grey so they’re less strong than the main tick-marks). This isn’t necessary but is often helpful. The data are in the example.dat file.

17:49 Tuesday 16th April, 2013

22

0.0

0.2

0.4

y

0.6

0.8

1.0

1.3. THE REGRESSION FUNCTION

0.0

0.2

0.4

0.6

0.8

1.0

x

abline(h=mean(all.y),lty=3)

Figure 1.2: Data from Figure 1.1, with a horizontal line showing the sample mean of Y.

17:49 Tuesday 16th April, 2013

23

1.4. ESTIMATING THE REGRESSION FUNCTION

It’s also important to be clear that when we find the regression function is a constant, r (x) = r0 for all x, that this does not mean that X and Y are statistically independent. If they are independent, then the regression function is a constant, but turning this around is the logical fallacy of “affirming the consequent”4 .

1.4

Estimating the Regression Function

We want to find the regression function r (x) = E [Y |X = x], and what we’ve got is a big set of training examples, of pairs (x1 , y1 ), (x2 , y2 ), . . . (xn , yn ). How should we proceed? If X takes on only a finite set of values, then a simple strategy is to use the conditional sample means: X 1 b yi (1.17) r (x) = # {i : xi = x} i :x =x i

By the same kind of law-of-large-numbers reasoning as before, we can be confident that b r (x) → E [Y |X = x]. Unfortunately, this only works if X has only a finite set of values. If X is continuous, then in general the probability of our getting a sample at any particular value is zero, is the probability of getting multiple samples at exactly the same value of x. This is a basic issue with estimating any kind of function from data — the function will always be undersampled, and we need to fill in between the values we see. We also need to somehow take into account the fact that each yi is a sample  from the  conditional distribution of Y |X = xi , and so is not generally equal to E Y |X = xi . So any kind of function estimation is going to involve interpolation, extrapolation, and smoothing. Different methods of estimating the regression function — different regression methods, for short — involve different choices about how we interpolate, extrapolate and smooth. This involves our making a choice about how to approximate r (x) by a limited class of functions which we know (or at least hope) we can estimate. There is no guarantee that our choice leads to a good approximation in the case at hand, though it is sometimes possible to say that the approximation error will shrink as we get more and more data. This is an extremely important topic and deserves an extended discussion, coming next.

1.4.1

The Bias-Variance Tradeoff

Suppose that the true regression function is r (x), but we use the function b r to make our predictions. Let’s look at the mean squared error at X = x in a slightly different way than before, which will make it clearer what happens when we can’t use r to 4 As in combining the fact that all human beings are featherless bipeds, and the observation that a cooked turkey is a featherless biped, to conclude that cooked turkeys are human beings. An econometrician stops there; an econometrician who wants to be famous writes a best-selling book about how this proves that Thanksgiving is really about cannibalism.

17:49 Tuesday 16th April, 2013

1.4. ESTIMATING THE REGRESSION FUNCTION

24

make predictions. We’ll begin by expanding (Y − b r (x))2 , since the MSE at x is just the expectation of this. (Y − b r (x))2 =

=

(1.18) 2

(Y − r (x) + r (x) − b r (x))

(Y − r (x))2 + 2(Y − r (x))(r (x) − b r (x)) + (r (x) − b r (x))2

(1.19)

We saw above (Eq. 1.16) that Y − r (x) = η, a random variable which has expectation zero (and is uncorrelated with x). When we take the expectation of Eq. 1.19, nothing happens to the last term (since it doesn’t involve any random quantities); the middle term goes to zero (because E [Y − r (x)] = E [η] = 0), and the first term becomes the variance of η. This depends on x, in general, so let’s call it σ x2 . We have MSE(b r (x))

r (x))2 = σ x2 + ((r (x) − b

(1.20)

The σ x2 term doesn’t depend on our prediction function, just on how hard it is, intrinsically, to predict Y at X = x. The second term, though, is the extra error we get from not knowing r . (Unsurprisingly, ignorance of r cannot improve our predictions.) This is our first bias-variance decomposition: the total MSE at x is decomposed into a (squared) bias r (x) − b r (x), the amount by which our predictions are systematically off, and a variance σ x2 , the unpredictable, “statistical” fluctuation around even the best prediction. All of the above assumes that b r is a single fixed function. In practice, of course, b r is something we estimate from earlier data. But if those data are random, the exact c , where regression function we get is random too; let’s call this random function R n the subscript reminds us of the finite amount of data we used to estimate it. What we c (x)|R c =b have analyzed is really MSE(R r ), the mean squared error conditional on a n n particular estimated regression function. What can we say about the prediction error of the method, averaging over all the possible training data sets? h i c (x)) = E (Y − R c (X ))2 |X = x MSE(R (1.21) n n h h i i c (X ))2 |X = x, R c =b = E E (Y − R r |X = x (1.22) n n h i c (x))2 |X = x = E σ x2 + (r (x) − R (1.23) n h i c (x))2 |X = x = σ x2 + E (r (x) − R (1.24) n h h i h i i c (x) + E R c (x) − R c (x))2 (1.25) = σ x2 + E (r (x) − E R n n n  h i2 h i c (x) c (x) = σ x2 + r (x) − E R + Var R (1.26) n n This is our second bias-variance decomposition — I pulled the same trick as before, adding and subtract a mean inside the square. The first term is just the variance of the process; we’ve seen that before and isn’t, for the moment, of any concern. c to estimate r — the approximation bias or The second term is the bias in using R n 17:49 Tuesday 16th April, 2013

25

1.4. ESTIMATING THE REGRESSION FUNCTION

approximation error. The third term, though, is the variance in our h estimate i of the c regression function. Even if we have an unbiased method (r (x) = E R (x) ), if there n

is a lot of variance in our estimates, we can expect to make large errors. The approximation h i bias has to depend on the true regression function. For exc ample, if E R (x) = 42 + 37x, the error of approximation will be zero if r (x) = n

42 + 37x, but it will be larger and x-dependent if r (x) = 0. However, there are flexible methods of estimation which will have small approximation biases for all r in a broad range of regression functions. The catch is that, at least past a certain point, decreasing the approximation bias can only come through increasing the estimation variance. This is the bias-variance trade-off. However, nothing says that the tradeoff has to be one-for-one. Sometimes we can lower the total error by introducing some bias, since it gets rid of more variance than it adds approximation error. The next section gives an example. In general, both the approximation bias and the estimation variance depend on n. A method is consistent5 when both of these go to zero as n → 0 — that is, if we recover the true regression function as we get more and more data.6 Again, consistency depends on how well the method matches the actual data-generating process, not just on the method, and again, there is a bias-variance trade-off. There can be multiple consistent methods for the same problem, and their biases and variances don’t have to go to zero at the same rates.

1.4.2

The Bias-Variance Trade-Off in Action

Let’s take an extreme example: we could decide to approximate r (x) by a constant r0 . The implicit smoothing here is very strong, but sometimes appropriate. For instance, it’s appropriate when r (x) really is a constant! Then trying to estimate any additional structure in the regression function is just so much wasted effort. Alternately, if r (x) is nearly constant, we may still be better off approximating it as one. For instance, suppose the true r (x) = r0 + a sin (ν x), where a  1 and ν  1 (Figure 1.3 shows an example). With limited data, we can actually get better predictions by estimating a constant regression function than one with the correct functional form.

1.4.3

Ordinary Least Squares Linear Regression as Smoothing

Let’s revisit ordinary least-squares linear regression from this point of view. Let’s assume that the independent variable X is one-dimensional, and that both X and Y 5 To be precise, consistent for r , or consistent for conditional expectations. More generally, an estimator of any property of the data, or of the whole distribution, is consistent if it converges on the truth. 6 You might worry about this claim, especially if you’ve taken more probability theory — aren’t we b rather than any particular estimated regresjust saying something about average performance of the R, sion function? But notice that if the estimation variance goes to zero, then inequality, h by Chebyshev’s i Pr (|X − E [X ] | ≥ η) ≤ Var [X ] /η2 , each Ó R (x) comes arbitrarily close to E Ó R (x) with arbitrarily high n

n

probability. If the approximation bias goes to zero, therefore, the estimated regression functions converge in probability on the true regression function, not just in mean.

17:49 Tuesday 16th April, 2013

26

1.0 0.0

0.5

y

1.5

2.0

1.4. ESTIMATING THE REGRESSION FUNCTION

0.0

0.2

0.4

0.6

0.8

1.0

x

ugly.func = function(x) {1 + 0.01*sin(100*x)} r = runif(100); y = ugly.func(r) + rnorm(length(r),0,0.5) plot(r,y,xlab="x",ylab="y"); curve(ugly.func,add=TRUE) abline(h=mean(y),col="red") sine.fit = lm(y ~ 1+ sin(100*r)) curve(sine.fit$coefficients[1]+sine.fit$coefficients[2]*sin(100*x), col="blue",add=TRUE)

Figure 1.3: A rapidly-varying but nearly-constant regression function; Y = 1 + 0.01 sin 100x + ε, with ε ∼ N (0, 0.1). (The x values are uniformly distributed between 0 and 1.) Red: constant line at the sample mean. Blue: estimated function of the same form as the true regression function, i.e., r0 + a sin 100x. If the data set is small enough, the constant actually generalizes better — the bias of using the wrong functional form is smaller than the additional variance from the extra degrees of freedom. Here, the root-mean-square (RMS) error of the constant on new data is 0.50, while that of the estimated sine function is 0.51 — using the right function actually hurts us! 17:49 Tuesday 16th April, 2013

27

1.4. ESTIMATING THE REGRESSION FUNCTION

are centered (i.e. have mean zero) — neither of these assumptions is really necessary, but they reduce the book-keeping. We choose to approximate r (x) by α+βx, and ask for the best values a, b of those constants. These will be the ones which minimize the mean-squared error. ” — M S E(α, β) = E (Y − α − βX )2 (1.27) ” — 2 = E (Y − α − βX ) |X (1.28) ” — 2 = E Var [Y |X ] + (E [Y − α − βX |X ]) (1.29) ” — = E [Var [Y |X ]] + E (E [Y − α − βX |X ])2 (1.30) The first term doesn’t depend on α or β, so we can drop it for purposes of optimization. Taking derivatives, and then brining them inside the expectations, ∂ M SE ∂α E [Y − a − b X ]

=

E [2(Y − α − βX )(−1)]

(1.31)

=

0

(1.32)

a

=

E [Y ] − b E [X ] = 0

(1.33)

using the fact that X and Y are centered; and, ∂ M SE

= ∂β ” — E [X Y ] − b E X 2 = b

E [2(Y − α − βX )(−X )]

(1.34)

0

(1.35)

Cov [X , Y ]

=

Var [X ]

(1.36)

again using the centering of X and Y . That is, the mean-squared optimal linear prediction is Cov [X , Y ] r (x) = x (1.37) Var [X ] Now, if we try to estimate this from data, there are (at least) two approaches. One is to replace the true population values of the covariance and the variance with their sample values, respectively 1X yi xi (1.38) n i and

1X n

i

xi2

(1.39)

(again, assuming centering). The other is to minimize the residual sum of squares, X 2 RS S(α, β) ≡ yi − α − βxi (1.40) i

17:49 Tuesday 16th April, 2013

1.4. ESTIMATING THE REGRESSION FUNCTION

28

You may or may not find it surprising that both approaches lead to the same answer: ab = b b

0 P

=

(1.41) i

P

yi xi i

(1.42)

xi2

Provided that Var [X ] > 0, this will converge with IID samples, so we have a consistent estimator.7 We are now in a position to see how the least-squares linear regression model is really a smoothing of the data. Let’s write the estimated regression function explicitly in terms of the training data points. b r (x)

= = =

b bx P

i

=

(1.43)

i yi xi x P 2 i xi X xi yi P

X i

j

yi

xi n sX2

(1.44)

x 2j

x

x

(1.45) (1.46)

where sX2 is the sample variance of X . In words, our prediction is a weighted average of the observed values yi of the dependent variable, where the weights are proportional to how far xi is from the center (relative to the variance), and proportional to the magnitude of x. If xi is on the same side of the center as x, it gets a positive weight, and if it’s on the opposite side it gets a negative weight. Figure 1.4 shows the data from Figure 1.1 with the least-squares regression line added. It will not escape your notice that this is very, very slightly different from the constant regression function; the coefficient on X is 6.3×10−3 . Visually, the problem is that there should be a positive slope in the left-hand half of the data, and a negative slope in the right, but the slopes are the densities are balanced so that the best single slope is zero.8 Mathematically, the problem arises from the peculiar way in which least-squares linear regression smoothes the data. As I said, the weight of a data point depends on how far it is from the center of the data, not how far it is from the point at which we are trying to predict. This works when r (x) really is a straight line, but otherwise — e.g., here — it’s a recipe for trouble. However, it does suggest that if we could somehow just tweak the way we smooth the data, we could do better than linear regression. 7 Eq. 1.41 may look funny, but remember that we’re assuming X and Y have been centered. Centering doesn’t change the slope of the least-squares line but does change the intercept; if we go back to the uncentered variables the intercept becomes Y − b b X , where the bar denotes the sample mean. 8 The standard test of whether this coefficient is zero is about as far from rejecting the null hypothesis as you will ever see, p = 0.95. Remember this the next time you look at regression output.

17:49 Tuesday 16th April, 2013

1.4. ESTIMATING THE REGRESSION FUNCTION

0.0

0.2

0.4

y

0.6

0.8

1.0

29

0.0

0.2

0.4

0.6

0.8

1.0

x

fit.all = lm(all.y~all.x) abline(fit.all)

Figure 1.4: Data from Figure 1.1, with a horizontal line at the mean (dotted) and the ordinary least squares regression line (solid). If you zoom in online you will see that there really are two lines there. (The abline adds a line to the current plot with intercept a and slope b; it’s set up to take the appropriate coefficients from the output of lm.

17:49 Tuesday 16th April, 2013

1.5. LINEAR SMOOTHERS

1.5

30

Linear Smoothers

The sample mean and the linear regression line are both special cases of linear smoothers, which are estimates of the regression function with the following form: X b b i , x) r (x) = yi w(x (1.47) i

b i , x) = 1/n, regardless of what xi The sample mean is the special case where w(x and x are. b i , x) = (xi /n sX2 )x. Ordinary linear regression is the special case where w(x Both of these, as remarked, ignore how far xi is from x.

1.5.1

k-Nearest-Neighbor Regression

At the other extreme, we could do nearest-neighbor regression:  1 xi nearest neighbor of x b i , x) = w(x 0 otherwise

(1.48)

This is very sensitive to the distance between xi and x. If r (x) does not change too rapidly, and X is pretty thoroughly sampled, then the nearest neighbor of x among the xi is probably close to x, so that r (xi ) is probably close to r (x). However, yi = r (xi )+noise, so nearest-neighbor regression will include the noise into its prediction. We might instead do k-nearest neighbor regression,  1/k xi one of the k nearest neighbors of x b i , x) = (1.49) w(x 0 otherwise Again, with enough samples all the k nearest neighbors of x are probably close to x, so their regression functions there are going to be close to the regression function at x. But because we average their values of yi , the noise terms should tend to cancel each other out. As we increase k, we get smoother functions — in the limit k = n and we just get back the constant. Figure 1.5 illustrates this for our running example data.9 To use k-nearest-neighbors regression, we need to pick k somehow. This means we need to decide how much smoothing to do, and this is not trivial. We will return to this point. Because k-nearest-neighbors averages over only a fixed number of neighbors, each of which is a noisy sample, it always has some noise in its prediction, and is generally not consistent. This may not matter very much with moderately-large data (especially once we have a good way of picking k). However, it is sometimes useful to let k systematically p grow with n, but not too fast, so as to avoid just doing a global average; say k ∝ n. Such schemes can be consistent. 9 The

code uses the k-nearest neighbor function provided by the package knnflex (available from CRAN). This requires one to pre-compute a matrix of the distances between all the points of interest, i.e., training data and testing data (using knn.dist); the knn.predict function then needs to be told which rows of that matrix come from training data and which from testing data. See help(knnflex.predict) for more, including examples.

17:49 Tuesday 16th April, 2013

1.5. LINEAR SMOOTHERS

0.0

0.2

0.4

y

0.6

0.8

1.0

31

0.0

0.2

0.4

0.6

0.8

1.0

x

library(knnflex) all.dist = knn.dist(c(all.x,seq(from=0,to=1,length.out=100))) all.nn1.predict = knn.predict(1:110,111:210,all.y,all.dist,k=1) abline(h=mean(all.y),lty=2) lines(seq(from=0,to=1,length.out=100),all.nn1.predict,col="blue") all.nn3.predict = knn.predict(1:110,111:210,all.y,all.dist,k=3) lines(seq(from=0,to=1,length.out=100),all.nn3.predict,col="red") all.nn5.predict = knn.predict(1:110,111:210,all.y,all.dist,k=5) lines(seq(from=0,to=1,length.out=100),all.nn5.predict,col="green") all.nn20.predict = knn.predict(1:110,111:210,all.y,all.dist,k=20) lines(seq(from=0,to=1,length.out=100),all.nn20.predict,col="purple")

Figure 1.5: Data points from Figure 1.1 with horizontal dashed line at the mean and the k-nearest-neighbor regression curves for k = 1 (blue), k = 3 (red), k = 5 (green) and k = 20 (purple). Note how increasing k smoothes out the regression line, and pulls it back towards the mean. (k = 100 would give us back the dashed horizontal line.) 17:49 Tuesday 16th April, 2013

1.5. LINEAR SMOOTHERS

1.5.2

32

Kernel Smoothers

Changing k in a k-nearest-neighbors regression lets us change how much smoothing we’re doing on our data, but it’s a bit awkward to express this in terms of a number of data points. It feels like it would be more natural to talk about a range in the independent variable over which we smooth or average. Another problem with kNN regression is that each testing point is predicted using information from only a few of the training data points, unlike linear regression or the sample mean, which always uses all the training data. If we could somehow use all the training data, but in a location-sensitive way, that would be nice. There are several ways to do this, as we’ll see, but a particularly useful one is to use a kernel smoother, a.k.a. kernel regression or Nadaraya-Watson regression. To begin with, we need to pick a kernel function10 K(xi , x) which satisfies the following properties: 1. K(xi , x) ≥ 0 2. K(xi , x) depends only on the distance xi − x, not the individual arguments R 3. xK(0, x)d x = 0 R 4. 0 < x 2 K(0, x)d x < ∞ These conditions together (especially the last one) imply that K(xi , x) → 0 as |xi − x| → ∞. Two examples of such functions are the density of p the Unif(−h/2, h/2) distribution, and the density of the standard Gaussian N (0, h) distribution. Here h can be any positive number, and is called the bandwidth. The Nadaraya-Watson estimate of the regression function is b r (x) =

X i

K(xi , x) yi P j K(x j , x)

(1.50)

i.e., in terms of Eq. 1.47, K(xi , x) b i , x) = P w(x j K(x j , x)

(1.51)

(Notice that here, as in k-NN regression, the sum of the weights is always 1. Why?)11 What does this achieve? Well, K(xi , x) is large if xi is close to x, so this will place a lot of weight on the training data points close to the point where we are trying to predict. More distant training points will have smaller weights, falling off towards zero. If we try to predict at a point x which is very far from any of the training data points, the value of K(xi , x) will be small for all xi , but it will typically be much, 10 There are many other mathematical objects which are also called “kernels”. Some of these meanings are related, but not all of them. (Cf. “normal”.) 11 What do we do if K(x , x) is zero for some x ? Nothing; they just get zero weight in the average. i i What do we do if all the K(xi , x) are zero? Different people adopt different conventions; popular ones are to return the global, unweighted mean of the yi , to do some sort of interpolation from regions where the weights are defined, and to throw up our hands and refuse to make any predictions (computationally, return NA).

17:49 Tuesday 16th April, 2013

33

1.5. LINEAR SMOOTHERS

b i , x) ≈ 1 for much smaller for all the xi which are not the nearest neighbor of x, so w(x the nearest neighbor and ≈ 0 for all the others.12 That is, far from the training data, our predictions will tend towards nearest neighbors, rather than going off to ±∞, as linear regression’s predictions do. Whether this is good or bad of course depends on the true r (x) — and how often we have to predict what will happen very far from the training data. Figure 1.6 shows our running example data, together with kernel regression estimates formed by combining the uniform-density, or box, and Gaussian kernels with different bandwidths. The box kernel simply takes a region of width h around the point x and averages the training data points it finds there. The Gaussian kernel gives reasonably large weights to points within h of x, smaller ones to points within 2h, 2 tiny ones to points within 3h, and so on, shrinking like e −(x−xi ) /2h . As promised, the bandwidth h controls the degree of smoothing. As h → ∞, we revert to taking the global mean. As h → 0, we tend to get spikier functions — with the Gaussian kernel at least it tends towards the nearest-neighbor regression. If we want to use kernel regression, we need to choose both which kernel to use, and the bandwidth to use with it. Experience, like Figure 1.6, suggests that the bandwidth usually matters a lot more than the kernel. This puts us back to roughly where we were with k-NN regression, needing to control the degree of smoothing, without knowing how smooth r (x) really is. Similarly again, with a fixed bandwidth h, kernel regression is generally not consistent. However, if h → 0 as n → ∞, but doesn’t shrink too fast, then we can get consistency. In Chapter 2, we’ll look more at the limits of linear regression and some extensions; Chapter 3 will cover some key aspects of evaluating statistical models, including regression models; and then Chapter 4 will come back to kernel regression.

2

2

a Gaussian kernel in one dimension, for instance, so K(xi , x) ∝ e −(xi −x) /2h . Say xi is the 2 2 nearest neighbor, and |xi − x| = L, with L  h. So K(xi , x) ∝ e −L /2h , a small number. But now for any 12 Take

2

2

2

other x j , K(xi , x) ∝ e −L /2h e −(x j −xi )L/2h e −(x j −xi ) /2h  e −L /2h . — This assumes that we’re using a kernel like the Gaussian, which never quite goes to zero, unlike the box kernel. 2

2

2

2

17:49 Tuesday 16th April, 2013

34

0.0

0.2

0.4

y

0.6

0.8

1.0

1.5. LINEAR SMOOTHERS

0.0

0.2

0.4

0.6

0.8

1.0

x

plot(all.x,all.y,xlab="x",ylab="y") lines(ksmooth(all.x, all.y, "box", bandwidth=2),col="blue") lines(ksmooth(all.x, all.y, "box", bandwidth=1),col="red") lines(ksmooth(all.x, all.y, "box", bandwidth=0.1),col="green") lines(ksmooth(all.x, all.y, "normal", bandwidth=2),col="blue",lty=2) lines(ksmooth(all.x, all.y, "normal", bandwidth=1),col="red",lty=2) lines(ksmooth(all.x, all.y, "normal", bandwidth=0.1),col="green",lty=2)

Figure 1.6: Data from Figure 1.1 together with kernel regression lines. Solid colored lines are box-kernel estimates, dashed colored lines Gaussian-kernel estimates. Blue, h = 2; red, h = 1; green, h = 0.1 (per the definition of bandwidth in the ksmooth function). Note the abrupt jump around x = 0.75 in the box-kernel/h = 0.1 (solid green) line — with a small bandwidth the box kernel is unable to interpolate smoothly across the break in the training data, while the Gaussian kernel can.

17:49 Tuesday 16th April, 2013

35

1.6. EXERCISES

1.6

Exercises

Not to hand in. 1. Suppose we use the mean absolute error instead of the mean squared error: MAE(a) = E [|Y − a|]

(1.52)

Is this also minimized by taking a = E [Y ]? If not, what value ˜r minimizes the MAE? Should we use MSE or MAE to measure error? 2. Derive Eqs. 1.41 and 1.42 by minimizing Eq. 1.40. 3. What does it mean for Gaussian kernel regression to approach nearest-neighbor regression as h → 0? Why does it do so? Is this true for all kinds of kernel regression? 4. COMPUTING The file ckm.csv on the class website13 contains data from a famous study on the “diffusion of innovations”, in this case, the adoption of tetracycline, a then-new antibiotic, among doctors in four cities in Illinois in the 1950s. In particular, the column adoption_date records how many months after the beginning of the study each doctor surveyed began prescribing tetracycline. Note that some doctors did not do so before the study ended (these have an adoption date of Inf, infinity), and this information is not available for others (NA). (a) Load the data as a data-frame called ckm. (b) What does adopters