Introduction to Machine Learning with R: Rigorous Mathematical Analysis 1491976446, 9781491976449

Machine learning is an intimidating subject until you know the fundamentals. If you understand basic coding concepts, th

2,854 102 4MB

English Pages 226 [217] Year 2018

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Introduction to Machine Learning with R: Rigorous Mathematical Analysis
 1491976446, 9781491976449

Table of contents :
Contents
Preface
What is Model
Algorithms Versus Models: What’s the Difference
A Note on Terminology
Modeling Limitations
Statistics and Computation in Modeling
Data Training
Cross-Validation
Why Use R?
The Good
The Bad
Summary
Supervised & Unsupervised ML
Supervised Models
Regression
Training and Testing of Data
Classificatio
Mixed Methods
Unsupervised Learning
Unsupervised Clustering Methods
Summary
Sampling Statistics & Model Training
Bias
Sampling in R
Training and Testing
Cross-Validation
Summary
Regression in Nutshell
Linear Regression
Polynomial Regression
Goodness of Fit with Data—The Perils of Overfittin
Logistic Regression
Summary
Neural Networks in Nutshell
Single-Layer Neural Networks
Building a Simple Neural Network by Using R
Multilayer Neural Networks
Neural Networks for Regression
Neural Networks for Classificatio
Neural Networks with caret
Summary
Tree-based Methods
A Simple Tree Model
Deciding How to Split Trees
Pros and Cons of Decision Trees
Conditional Inference Trees
Random Forests
Summary
Other Advanced Methods
Naive Bayes Classificatio
Principal Component Analysis
Support Vector Machines
k-Nearest Neighbors
Summary
Machine Learning with caret Package
The Titanic Dataset
caret Unleashed
Summary
Encyclopedia of ML Models in caret
Index

Citation preview

Introduction to Machine Learning with R Rigorous Mathematical Analysis

Scott V. Burger

Introduction to Machine Learning with R by Scott V. Burger Copyright © 2018 Scott Burger. Printed in the United States of America March 2018:

First Edition

Revision History for the First Edition 2018-03-08:

First Release

http://oreilly.com/catalog/errata.csp?isbn=9781491976449 978-1-491-97644-9 [LSI]

Contents Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii 1. What Is a Model?. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Algorithms Versus Models: What’s the Difference? A Note on Terminology Modeling Limitations Statistics and Computation in Modeling Data Training Cross-Validation Why Use R? The Good R and Machine Learning The Bad Summary

6 7 8 10 11 12 13 13 15 16 17

2. Supervised and Unsupervised Machine Learning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 Supervised Models Regression Training and Testing of Data Classification Logistic Regression Supervised Clustering Methods Mixed Methods Tree-Based Models Random Forests Neural Networks Support Vector Machines Unsupervised Learning

20 20 22 24 24 26 31 31 34 35 39 40

Unsupervised Clustering Methods Summary

41 43

3. Sampling Statistics and Model Training in R. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 Bias Sampling in R Training and Testing Roles of Training and Test Sets Why Make a Test Set? Training and Test Sets: Regression Modeling Training and Test Sets: Classification Modeling Cross-Validation k-Fold Cross-Validation Summary

46 51 54 55 55 55 63 67 67 69

4. Regression in a Nutshell. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 Linear Regression Multivariate Regression Regularization Polynomial Regression Goodness of Fit with Data—The Perils of Overfitting Root-Mean-Square Error Model Simplicity and Goodness of Fit Logistic Regression The Motivation for Classification The Decision Boundary The Sigmoid Function Binary Classification Multiclass Classification Logistic Regression with Caret Summary Linear Regression Logistic Regression

72 74 78 81 87 87 89 91 92 93 94 98 101 105 106 106 107

5. Neural Networks in a Nutshell. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 Single-Layer Neural Networks Building a Simple Neural Network by Using R Multiple Compute Outputs Hidden Compute Nodes Multilayer Neural Networks Neural Networks for Regression Neural Networks for Classification

109 111 113 114 120 125 130

Neural Networks with caret Regression Classification Summary

131 131 132 133

6. Tree-Based Methods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 A Simple Tree Model Deciding How to Split Trees Tree Entropy and Information Gain Pros and Cons of Decision Trees Tree Overfitting Pruning Trees Decision Trees for Regression Decision Trees for Classification Conditional Inference Trees Conditional Inference Tree Regression Conditional Inference Tree Classification Random Forests Random Forest Regression Random Forest Classification Summary

135 138 139 140 141 145 151 151 152 154 155 155 156 157 158

7. Other Advanced Methods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 Naive Bayes Classification Bayesian Statistics in a Nutshell Application of Naive Bayes Principal Component Analysis Linear Discriminant Analysis Support Vector Machines k-Nearest Neighbors Regression Using kNN Classification Using kNN Summary

159 159 161 163 169 173 179 181 182 184

8. Machine Learning with the caret Package. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185 The Titanic Dataset Data Wrangling caret Unleashed Imputation Data Splitting caret Under the Hood Model Training

186 187 188 188 190 191 194

Comparing Multiple caret Models Summary

197 199

A. Encyclopedia of Machine Learning Models in caret. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201 Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209

Preface In this short introduction, I tackle a few key points.

Who Should Read This Book? This book is ideally suited for people who have some working knowledge of the R programming language. If you don’t have any knowledge of R, it’s an easy enough language to pick up, and the code is readable enough that you can pretty much get the gist of the code examples herein.

Scope of the Book This book is an introductory text, so we don’t dive deeply into the mathematical underpinnings of every algorithm covered. Presented here are enough of the details for you to discern the difference between a neural network and, say, a random forest at a high level.

Conventions Used in This Book The following typographical conventions are used in this book: Italic Indicates new terms, URLs, email addresses, filenames, and file extensions. Constant width

Used for program listings, as well as within paragraphs to refer to program ele‐ ments such as variable or function names, databases, data types, environment variables, statements, and keywords. Constant width bold

Shows commands or other text that should be typed literally by the user.

Constant width italic

Shows text that should be replaced with user-supplied values or by values deter‐ mined by context. This element signifies a tip or suggestion.

This element signifies a general note.

This element indicates a warning or caution.

CHAPTER 1

What Is a Model?

There was a time in my undergraduate physics studies that I was excited to learn what a model was. I remember the scene pretty well. We were in a Stars and Galaxies class, getting ready to learn about atmospheric models that could be applied not only to the Earth, but to other planets in the solar system as well. I knew enough about climate models to know they were complicated, so I braced myself for an onslaught of math that would take me weeks to parse. When we finally got to the meat of the subject, I was kind of let down: I had already dealt with data models in the past and hadn’t even realized! Because models are a fundamental aspect of machine learning, perhaps it’s not sur‐ prising that this story mirrors how I learned to understand the field of machine learning. During my graduate studies, I was on the fence about going into the finan‐ cial industry. I had heard that machine learning was being used extensively in that world, and, as a lowly physics major, I felt like I would need to be more of a computa‐ tional engineer to compete. I came to a similar realization that not only was machine learning not as scary of a subject as I originally thought, but I had indeed been using it before. Since before high school, even! Models are helpful because unlike dashboards, which offer a static picture of what the data shows currently (or at a particular slice in time), models can go further and help you understand the future. For example, someone who is working on a sales team might only be familiar with reports that show a static picture. Maybe their screen is always up to date with what the daily sales are. There have been countless dashboards that I’ve seen and built that simply say “this is how many assets are in right now.” Or, “this is what our key performance indicator is for today.” A report is a static entity that doesn’t offer an intuition as to how it evolves over time. Figure 1-1 shows what a report might look like:

1

op