Using R for Introductory Statistics 9781466590748, 1466590742

The second edition of a bestselling textbook, Using R for Introductory Statistics guides students through the basics of

644 173 17MB

English Pages 518 [515] Year 2014

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Using R for Introductory Statistics
 9781466590748, 1466590742

Table of contents :
Front Cover
Contents
Preface
Chapter 1 - Getting started
Chapter 2 - Univariate data
Chapter 3 - Bivariate data
Chapter 4 - Multivariate data
Chapter 5 - Multivariate graphics
Chapter 6 - Populations
Chapter 7 - Statistical inference
Chapter 8 - Confidence intervals
Chapter 9 - Significance tests
Chapter 10 - Goodness of fit
Chapter 11 - Linear regression
Chapter 12 - Analysis of variance
Chapter 13 - Extensions of the linear model
Appendix A - Programming
Bibliography
Back Cover

Citation preview

Statistics

“… Without hesitation I would use it for an introductory statistics course or an introduction to R for a general audience. Indeed, Verzani’s book may prove a useful travel guide through the sometimes exasperating territory of statistical computing.” —E. Andres Houseman (Harvard School of Public Health), Statistics in Medicine, Vol. 26, 2007 “This book sets out to kill two birds with one stone—introducing R and statistics at the same time. The author accomplishes his twin goals by presenting an easy-tofollow narrative mixed with R codes, formulae, and graphs … contains a cornucopia of information for beginners in statistics who want to learn a computer language that is positioned to take the statistics world by storm.” —Significance, September 2005 “Anyone who has struggled to produce his or her own notes to help students use R will appreciate this thorough, careful, and complete guide aimed at beginning students.” —Journal of Statistical Software, November 2005 “This is an ideal text for integrating the study of statistics with a powerful computation tool.” —Zentralblatt MATH

K20484

K20484_cover.indd 1

Verzani

See What’s New in the Second Edition: • Increased emphasis on more idiomatic R provides a grounding in the functionality of base R • Discussions of the use of RStudio help new R users avoid as many pitfalls as possible • Use of knitr package makes code easier to read and therefore easier to reason about • Additional information on computer-intensive approaches motivates the traditional approach • Updated examples and data make the information current and topical

Second Edition

Using R for Introductory Statistics

Praise for the First Edition: “… One mistake most authors of similar texts make is to assume some basic level of familiarity, either with the subject to be taught, or the tool (the software package) to be used in teaching the subject. This book does not fall into either trap. … the examples and exercises are well chosen …” —MAA Reviews, October 2010

The R Series

Using R for Introductory Statistics Second Edition

John Verzani

5/15/14 9:24 AM

Using R for Introductory Statistics Second Edition

Chapman & Hall/CRC The R Series Series Editors John M. Chambers Department of Statistics Stanford University Stanford, California, USA

Torsten Hothorn Division of Biostatistics University of Zurich Switzerland

Duncan Temple Lang Department of Statistics University of California, Davis Davis, California, USA

Hadley Wickham RStudio Boston, Massachusetts, USA

Aims and Scope This book series reflects the recent rapid growth in the development and application of R, the programming language and software environment for statistical computing and graphics. R is now widely used in academic research, education, and industry. It is constantly growing, with new versions of the core software released regularly and more than 5,000 packages available. It is difficult for the documentation to keep pace with the expansion of the software, and this vital book series provides a forum for the publication of books covering many aspects of the development and application of R. The scope of the series is wide, covering three main threads: • Applications of R to specific disciplines such as biology, epidemiology, genetics, engineering, finance, and the social sciences. • Using R for the study of topics of statistical methodology, such as linear and mixed modeling, time series, Bayesian methods, and missing data. • The development of R, including programming, building packages, and graphics. The books will appeal to programmers and developers of R software, as well as applied statisticians and data analysts in many fields. The books will feature detailed worked examples and R code fully integrated into the text, ensuring their usefulness to researchers, practitioners and students.

Published Titles

Using R for Numerical Analysis in Science and Engineering, Victor A. Bloomfield Event History Analysis with R, Göran Broström Computational Actuarial Science with R, Arthur Charpentier Statistical Computing in C++ and R, Randall L. Eubank and Ana Kupresanin Reproducible Research with R and RStudio, Christopher Gandrud Introduction to Scientific Programming and Simulation Using R, Second Edition, Owen Jones, Robert Maillardet, and Andrew Robinson Displaying Time Series, Spatial, and Space-Time Data with R, Oscar Perpiñán Lamigueiro Programming Graphical User Interfaces with R, Michael F. Lawrence and John Verzani Analyzing Baseball Data with R, Max Marchi and Jim Albert Growth Curve Analysis and Visualization Using R, Daniel Mirman R Graphics, Second Edition, Paul Murrell Multiple Factor Analysis by Example Using R, Jérôme Pagès Customer and Business Analytics: Applied Data Mining for Business Decision Making Using R, Daniel S. Putler and Robert E. Krider Implementing Reproducible Research, Victoria Stodden, Friedrich Leisch, and Roger D. Peng Using R for Introductory Statistics, Second Edition, John Verzani Dynamic Documents with R and knitr, Yihui Xie

Using R for Introductory Statistics Second Edition

John Verzani CUNY/College of Staten Island New York, USA

CRC Press Taylor & Francis Group 6000 Broken Sound Parkway NW, Suite 300 Boca Raton, FL 33487-2742 © 2014 by Taylor & Francis Group, LLC CRC Press is an imprint of Taylor & Francis Group, an Informa business No claim to original U.S. Government works Version Date: 20140514 International Standard Book Number-13: 978-1-4665-9074-8 (eBook - PDF) This book contains information obtained from authentic and highly regarded sources. Reasonable efforts have been made to publish reliable data and information, but the author and publisher cannot assume responsibility for the validity of all materials or the consequences of their use. The authors and publishers have attempted to trace the copyright holders of all material reproduced in this publication and apologize to copyright holders if permission to publish in this form has not been obtained. If any copyright material has not been acknowledged please write and let us know so we may rectify in any future reprint. Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented, including photocopying, microfilming, and recording, or in any information storage or retrieval system, without written permission from the publishers. For permission to photocopy or use material electronically from this work, please access www.copyright.com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization that provides licenses and registration for a variety of users. For organizations that have been granted a photocopy license by the CCC, a separate system of payment has been arranged. Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used only for identification and explanation without intent to infringe. Visit the Taylor & Francis Web site at http://www.taylorandfrancis.com and the CRC Press Web site at http://www.crcpress.com

Contents

Preface 1

2

Getting started 1.1 What is data? . . . . . 1.2 Getting started with R Installing R . . . . . . . Installing RStudio . . R’s command line . . . Variables . . . . Functions . . . The workspace . . . . External packages . . . Data sets . . . . . . . . Problems . . . . . . . .

xv . . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

1 1 3 3 4 5 7 8 12 15 16 18

Univariate data 2.1 Data vectors . . . . . . . . . . . . Structured data . . . . . . . . . . Indexing . . . . . . . . . . . . . . Data types . . . . . . . . . . . . . Numeric data types . . . Categorical data types . . Date and time types . . . Logical data . . . . . . . . Problems . . . . . . . . . . . . . . 2.2 Functions . . . . . . . . . . . . . . Problems . . . . . . . . . . . . . . 2.3 Numeric summaries . . . . . . . Center . . . . . . . . . . . . . . . The sample mean . . . . . The sample median . . . Measures of position . . . Other measures of center

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

20 22 28 29 33 33 34 39 41 45 48 50 50 51 51 55 56 59

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

vii

CONTENTS

viii . . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

59 60 65 66 70 81 85 87

Bivariate data 3.1 Independent samples . . . . . . . . . . . . . . . . . . . Problems . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Data manipulation basics . . . . . . . . . . . . . . . . Lists . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Data frames . . . . . . . . . . . . . . . . . . . . . . . . Model formulas . . . . . . . . . . . . . . . . . . . . . . Problems . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Paired data . . . . . . . . . . . . . . . . . . . . . . . . . Correlation . . . . . . . . . . . . . . . . . . . . . . . . . Trends . . . . . . . . . . . . . . . . . . . . . . . . . . . Transformations . . . . . . . . . . . . . . . . . . . . . . Alternative trend lines . . . . . . . . . . . . . . . . . . Problems . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4 Bivariate categorical data . . . . . . . . . . . . . . . . Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . Two-way tables from summarized data . . . . . . . . Two-way tables from unsummarized data . . . . . . . Marginal distributions of two-way tables . . . . . . . Conditional distributions of two-way tables . . . . . . The xtabs function . . . . . . . . . . . . . . . . . . . . Graphical summaries of two-way contingency tables Mosaic plots . . . . . . . . . . . . . . . . . . . . Measures of association for categorical data . . . . . . Problems . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . .

88 88 93 94 94 96 97 101 102 105 115 120 123 128 132 132 132 134 135 136 137 140 141 143 149

Multivariate data 4.1 Data structures in R . . . . . . . . . . . Problems . . . . . . . . . . . . . . . . . 4.2 Working with data frames . . . . . . . Problems . . . . . . . . . . . . . . . . . 4.3 Applying a function over a collection Map . . . . . . . . . . . . . . . . . . . . Filter . . . . . . . . . . . . . . . . . . . Reduce . . . . . . . . . . . . . . . . . . Problems . . . . . . . . . . . . . . . . . 4.4 Using external data . . . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

150 150 154 155 166 167 168 177 177 179 181

2.4 3

4

Spread . . . . . . . . . . . . . . . . . . . . . . The variance and standard deviation The IQR . . . . . . . . . . . . . . . . . Shape . . . . . . . . . . . . . . . . . . . . . . . Viewing the shape of a data set . . . . Problems . . . . . . . . . . . . . . . . . . . . . Categorical data . . . . . . . . . . . . . . . . . Problems . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . .

. . . . . . . . . .

. . . . . . . .

. . . . . . . . . .

. . . . . . . .

. . . . . . . . . .

. . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

CONTENTS

ix

Spreadsheet data . . . . . . . . . . . . . . . . . . . . . . . . . . . 181 Web-based data sets . . . . . . . . . . . . . . . . . . . . . . . . . 182 5

6

Multivariate graphics 5.1 Base graphics . . . . . . . Problems . . . . . . . . . . 5.2 Lattice graphics . . . . . . Problems . . . . . . . . . . 5.3 The ggplot2 package . . . Geoms . . . . . . . . . . . Grouping . . . . . . . . . . Statistical transformations Faceting . . . . . . . . . . Problems . . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

189 189 196 197 200 200 201 203 204 207 210

Populations 6.1 Populations . . . . . . . . . . . . . . . . . . . . . . . . . Discrete random variables . . . . . . . . . . . . . . . . . Using sample to generate random values . . . . . . . . The mean and standard deviation . . . . . . . . Continuous random variables . . . . . . . . . . . . . . . The p.d.f. and c.d.f. . . . . . . . . . . . . . . . . . The mean and standard deviation . . . . . . . . Quantiles . . . . . . . . . . . . . . . . . . . . . . Sampling from a population . . . . . . . . . . . . . . . . Random samples generated by sample . . . . . Sampling distributions . . . . . . . . . . . . . . . . . . . Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Families of distributions . . . . . . . . . . . . . . . . . . The d, p, q, and r functions . . . . . . . . . . . . . . . . Binomial, normal, and some other named distributions Bernoulli random variables . . . . . . . . . . . . Binomial random variables . . . . . . . . . . . . Normal random variables . . . . . . . . . . . . . Popular distributions to describe populations . . . . . Uniform distribution . . . . . . . . . . . . . . . . Exponential distribution . . . . . . . . . . . . . . Lognormal distribution . . . . . . . . . . . . . . Sampling distributions . . . . . . . . . . . . . . . . . . . Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3 The central limit theorem . . . . . . . . . . . . . . . . . Normal parent population . . . . . . . . . . . . . . . . . Nonnormal parent population . . . . . . . . . . . . . . Problems . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . .

211 211 213 214 215 216 218 218 218 219 219 220 221 222 222 224 224 225 227 231 231 232 233 233 234 236 237 238 240

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

CONTENTS

x 7

8

9

Statistical inference 7.1 Simulation . . . . . . . . . . . . . Repeating a simulation easily . . Problems . . . . . . . . . . . . . . 7.2 Significance tests . . . . . . . . . 7.3 Estimation, confidence intervals The basic bootstrap . . . . . . . . 7.4 Bayesian analysis . . . . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

242 244 244 252 252 255 258 259

Confidence intervals 8.1 Confidence intervals for a population proportion, p . . . Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2 Confidence intervals for the population mean . . . . . . One-sided confidence intervals . . . . . . . . . . . . . . . Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.3 Other confidence intervals . . . . . . . . . . . . . . . . . . Confidence interval for σ2 . . . . . . . . . . . . . . . . . . Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.4 Confidence intervals for differences . . . . . . . . . . . . Difference of proportions . . . . . . . . . . . . . . . . . . Difference of means . . . . . . . . . . . . . . . . . . . . . . Matched samples . . . . . . . . . . . . . . . . . . . . . . . Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.5 Confidence intervals for the median . . . . . . . . . . . . Confidence intervals based on the binomial distribution Confidence intervals based on signed-rank statistic . . . Confidence intervals based on the rank-sum statistic . . Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

262 264 269 271 274 276 278 278 280 281 282 283 286 287 288 288 289 290 292

Significance tests 9.1 Significance test for a population proportion . . . . . Using prop.test to compute p-values . . . . . . . . . Problems . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2 Significance test for the mean (t-tests) . . . . . . . . . Power . . . . . . . . . . . . . . . . . . . . . . . . . . . . Problems . . . . . . . . . . . . . . . . . . . . . . . . . . 9.3 Significance tests and confidence intervals . . . . . . 9.4 Significance tests for the median . . . . . . . . . . . . The sign test . . . . . . . . . . . . . . . . . . . . . . . . The signed-rank test . . . . . . . . . . . . . . . . . . . Problems . . . . . . . . . . . . . . . . . . . . . . . . . . 9.5 Two-sample tests of proportion . . . . . . . . . . . . . Problems . . . . . . . . . . . . . . . . . . . . . . . . . . 9.6 Two-sample tests of center . . . . . . . . . . . . . . . . Two-sample tests of center with normal populations Matched samples . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

294 299 301 302 304 307 309 310 312 312 313 315 316 319 321 322 325

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

CONTENTS

xi

The Wilcoxon rank-sum test for equality of center . . . . . . . . 328 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 331 10 Goodness of fit 10.1 The chi-squared goodness-of-fit test . . . . . . . . The multinomial distribution . . . . . . . . . . . . Pearson’s χ2 -statistic . . . . . . . . . . . . . . . . . Partially specified null hypotheses . . . . . Problems . . . . . . . . . . . . . . . . . . . . . . . . 10.2 The chi-squared test of independence . . . . . . . The chi-squared test of homogeneity . . . . . . . . Problems . . . . . . . . . . . . . . . . . . . . . . . . 10.3 Goodness-of-fit tests for continuous distributions Kolmogorov-Smirnov test . . . . . . . . . . . . . . The Shapiro-Wilk test for normality . . . . . . . . Finding parameter values using fitdistr . . . . . Problems . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

334 334 334 336 339 341 344 348 350 352 352 357 359 362

11 Linear regression 11.1 The simple linear regression model . . . . . . . . . . Estimating the parameters in simple linear regression Using lm to find the estimates . . . . . . . . . . . . . . Extractor functions for lm . . . . . . . . . . . . Problems . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Statistical inference for simple linear regression . . . Statistical inferences . . . . . . . . . . . . . . . . . . . Marginal t-tests . . . . . . . . . . . . . . . . . . The F-test . . . . . . . . . . . . . . . . . . . . . R2 —the coefficient of determination . . . . . . Using lm to find values for a regression model . . . . Confidence intervals . . . . . . . . . . . . . . . Standard error . . . . . . . . . . . . . . . . . . Significance tests . . . . . . . . . . . . . . . . . Finding b σ 2 , R2 . . . . . . . . . . . . . . . . . . F-test for β 1 = 0 . . . . . . . . . . . . . . . . . . Predicting the response with predict . . . . . Testing the model assumptions . . . . . . . . . . . . . Assessing the linear model for the mean . . . Assessing the residuals . . . . . . . . . . . . . Influential points . . . . . . . . . . . . . . . . . Prediction intervals . . . . . . . . . . . . . . . . Confidence intervals for µy| x . . . . . . . . . . Problems . . . . . . . . . . . . . . . . . . . . . . . . . . 11.3 Multiple linear regression . . . . . . . . . . . . . . . . Types of models . . . . . . . . . . . . . . . . . . . . . . Fitting the multiple regression model using lm . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . .

364 364 365 366 367 368 369 370 370 371 373 374 374 374 376 376 377 377 378 379 380 381 382 385 386 390 390 392

. . . . . . . . . . . . .

CONTENTS

xii Using update with model formulas Interpreting the regression parameters . . . Statistical inferences . . . . . . . . . . . . . Model selection . . . . . . . . . . . . . . . . Partial F-test . . . . . . . . . . . . . . The Akaike information criterion . . Problems . . . . . . . . . . . . . . . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

394 395 396 397 398 400 402

12 Analysis of variance 12.1 One-way ANOVA . . . . . . . . . . . . . . . . . . . . . Using R’s model formulas to specify ANOVA models Using oneway.test to perform ANOVA . . . . . . . . Using aov for ANOVA . . . . . . . . . . . . . . . . . . The nonparametric Kruskal–Wallis test . . . . . . . . Problems . . . . . . . . . . . . . . . . . . . . . . . . . . 12.2 Using lm for ANOVA . . . . . . . . . . . . . . . . . . . Treatment coding for analysis of variance . . . . . . . Comparing multiple differences . . . . . . . . . . . . Problems . . . . . . . . . . . . . . . . . . . . . . . . . . 12.3 ANCOVA . . . . . . . . . . . . . . . . . . . . . . . . . Problems . . . . . . . . . . . . . . . . . . . . . . . . . . 12.4 Two-way ANOVA . . . . . . . . . . . . . . . . . . . . . Interaction plots . . . . . . . . . . . . . . . . . Fitting a two-way ANOVA . . . . . . . . . . . . . . . . Blocking variables . . . . . . . . . . . . . . . . Problems . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

404 404 408 408 409 411 414 416 418 421 424 425 428 429 430 431 435 437

13 Extensions of the linear model 13.1 Logistic regression . . . . . . . . . Generalized linear models . . . . . Fitting the model using glm . . . . 13.2 Nonlinear models . . . . . . . . . . Fitting nonlinear models with nls Problems . . . . . . . . . . . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

440 440 443 443 448 449 455

A Programming A.1 Functions . . . . . . . . . Function names . . . . . Arguments . . . . . . . . Body . . . . . . . . . . . Control flow . . . . . . . Variable scope . . . . . . Closures . . . . . A.2 Generic functions . . . . S3 methods . . . . . . . . S4 classes and methods .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

458 458 458 462 467 467 472 474 475 475 479

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

CONTENTS

xiii

Reference classes . . . . . . . . . . . . . . . . . . . . . . . . . . . 479 Bibliography

489

Index

494

Preface About this book This is a second edition of a book that introduces R alongside the introductory statistics curriculum. The first edition found its niche with individuals looking to get started with both areas outside of a classroom environment. It is the hope, that this second edition will be even more useful for that task. The book was first published in 2004, when R was at version 2.0.0. Now, as of writing, R is past version 3.0.0 (3.1.0 and climbing). In that time so much has changed. For example: • The number of R users has grown enormously. A recent survey ranked R the 15th most used programming language. • The number of add-on packages for R has grown four- or five-fold to over 5, 000. The depth and range of applications has grown considerably. • The number of books including material on R has grown at least tenfold.1 • The internet has developed many additional R communities beyond the initial mailing list. Two key additions are the question and answer site stackoverflow.com which has nearly 50, 000 questions tagged with “r” and the blog aggregator r-bloggers.com which has over 13, 000 blog entries related to R. Basically, the amount of material out there related to learning and using R is now enormous. This book doesn’t try to canvas even a sliver, rather it tries to guide the reader through the learning of the basics of R so that it is possible to take advantage of the contributions made by the R community. Though R—like other programming languages—has a reputation of having 1 For example, there are many other texts introducing R, as this one does, that can be chosen to learn from. For example, [15], [64], [13], [14], [36], [12], [56], and http://www.openintro.org/ stat/.

xv

xvi

PREFACE

a steep learning curve, we try to break this down into small, task-oriented steps. In this edition we place a greater emphasis on more idiomatic R. For a small example, despite the greater familiarity of using = for the assignment operator, we now use the for a ready state and + for a continuation state.2 These are not typeset in the text, as they can be distracting while reading. But be warned, the + prompt is indicating the previous command was not complete. If you thought it was, likely you are missing a closing parentheses. Errors Of course, there are times where we type in a command that does not make sense to R’s interpreter. This can happen, for example, when we misspell a command name or make some syntax error. Here we have two ˆ symbols, one too many for R’s taste: 2 ^^ 2 Error: unexpected ’^’ in "2 ^^" The error messages generated by R are usually quite informative, though may seem cryptically written to the new R user. The above one is pretty clear. R may also generate warnings, which are similar to an error, but do not stop the flow of a function.3 Command history After one issues a command, it is recorded in R’s history. Most command lines allow for scrolling through the previous commands using the up- and down-arrow keys. This can be used to edit and re-execute a previous command. RStudio has a history pane (Figure 1.3) showing the past commands. One can double click on a command to send it back to the command line. Selecting more than one and then pressing the “To Console” toolbar item will 2 These

can be changed through the prompt and continue options, cf. ?options. “sourcing” in commands from a script in RStudio, the error message will conveniently contain a line number. 3 If

1.2. GETTING STARTED WITH R

7

Figure 1.3: RStudio’s history pane showing its recording of previously issued commands.

send the collection of commands back to the console. As the history stack can grow quite large, the search panel in the history pane allows one to search through past commands. When the desired one is located, it can be viewed in its context by clicking on the small arrow on the right. Variables R can be used like a calculator, as above. But it really is an environment for statistical computing and graphics. The power of R goes well beyond that of a graphing calculator. One immediate difference is the ability to assign names to values.4 In R this is done with an assignment operator. We use the left arrow for assignment. In RStudio, there is a keyboard shortcut to insert the two-character