Statistical Learning With Math And R: 100 Exercises For Building Logic [1st Edition] 9811575673, 9789811575679, 9789811575686

The most crucial ability for machine learning and data science is mathematical logic for grasping their essence rather t

9,252 1,334 4MB

English Pages 226 Year 2020

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Statistical Learning With Math And R: 100 Exercises For Building Logic [1st Edition]
 9811575673, 9789811575679, 9789811575686

Table of contents :
Preface......Page 5
Contents......Page 9
1.1 Inverse Matrix......Page 12
1.2 Determinant......Page 14
1.3 Linear Independence......Page 17
1.4 Vector Spaces and Their Dimensions......Page 19
1.5 Eigenvalues and Eigenvectors......Page 21
1.6 Orthonormal Bases and Orthogonal Matrix......Page 23
1.7 Diagonalization of Symmetric Matrices......Page 24
2.1 Least Squares Method......Page 28
2.2 Multiple Regression......Page 31
2.3 Distribution of......Page 33
2.4 Distribution of the RSS Values......Page 35
2.5 Hypothesis Testing for j=0......Page 37
2.6 Coefficient of Determination and the Detection of Collinearity......Page 44
2.7 Confidence and Prediction Intervals......Page 46
3.1 Logistic Regression......Page 58
3.2 Newton–Raphson Method......Page 60
3.3 Linear and Quadratic Discrimination......Page 65
3.4 k-Nearest Neighbor Method......Page 68
3.5 ROC Curves......Page 69
4.1 Cross-Validation......Page 77
4.2 CV Formula for Linear Regression......Page 81
4.3 Bootstrapping......Page 84
5.1 Information Criteria......Page 93
5.2 Efficient Estimation and the Fisher Information Matrix......Page 97
5.3 Kullback-Leibler Divergence......Page 100
5.4 Derivation of Akaike's Information Criterion......Page 102
6.1 Ridge......Page 111
6.2 Subderivative......Page 113
6.3 Lasso......Page 116
6.4 Comparing Ridge and Lasso......Page 119
6.5 Setting the λ Value......Page 121
7.1 Polynomial Regression......Page 126
7.2 Spline Regression......Page 129
7.3 Natural Spline Regression......Page 131
7.4 Smoothing Spline......Page 135
7.5 Local Regression......Page 139
7.6 Generalized Additive Models......Page 143
8.1 Decision Trees for Regression......Page 156
8.2 Decision Tree for Classification......Page 165
8.4 Random Forest......Page 169
8.5 Boosting......Page 172
9.1 Optimum Boarder......Page 180
9.2 Theory of Optimization......Page 183
9.3 The Solution of Support Vector Machines......Page 186
9.4 Extension of Support Vector Machines Using a Kernel......Page 189
10.1 K-means Clustering......Page 202
10.2 Hierarchical Clustering......Page 206
10.3 Principle Component Analysis......Page 213
Index......Page 224

Citation preview

Joe Suzuki

Statistical Learning with Math and R 100 Exercises for Building Logic

Statistical Learning with Math and R

Joe Suzuki

Statistical Learning with Math and R 100 Exercises for Building Logic

123

Joe Suzuki Graduate School of Engineering Science Osaka University Toyonaka, Osaka, Japan

ISBN 978-981-15-7567-9 ISBN 978-981-15-7568-6 https://doi.org/10.1007/978-981-15-7568-6

(eBook)

© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2020 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd. The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721, Singapore

Preface

I am currently with the Statistics Laboratory at Osaka University, Japan. I often meet with data scientists who are engaged in machine learning and statistical analyses for research collaborations and introducing my students to them. I recently found out that almost all of them think that (mathematical) logic rather than knowledge and experience is the most crucial ability for grasping the essence in their jobs. Our necessary knowledge is changing every day and can be obtained when needed. However, logic allows us to examine whether each item on the Internet is correct and follow any changes; without it, we might miss even chances. In 2016, I started teaching statistical machine learning to the undergraduate students of the Mathematics Department. In the beginning, I was mainly teaching them what (statistical) machine learning (ML) is and how to use it. I explained the procedures of ML, such as logistic regression, support vector machines, k-means clustering, etc., by showing figures and providing intuitive explanations. At the same time, the students tried to understand ML by guessing the details. I also showed the students how to execute the ready-made functions in several R packages without showing the procedural details; at the same time, they understood how to use the R packages as black boxes. However, as time went by, I felt that this manner of teaching should be changed. In other non-ML classes, I focus on making the students consider extending the ideas. I realized that they needed to understand the essence of the subject by mathematically considering problems and building programs. I am both a mathematician and an R/Python programmer and notice the importance of instilling logic inside each student. The basic idea is that the students see that both theory and practice meet and that using logic is necessary. I was motivated to write this book because I could not find any other book that was inspired by the idea of “instilling logic” in the field of ML. The closest comparison is “Introduction to Statistical Learning: with Application in R” (ISLR) by Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani (Springer), which is the most popular book in this field. I like this book and have used it for the aforementioned class. In particular, the presentation in the book is splendid (abundant figures and intuitive explanations). I followed this style v

vi

Preface

when writing this book. However, ISLR is intended for a beginner audience. Compared to ISLR, this book (SLMR) focuses more on mathematics and R programming, although the contents are similar: linear regression, classification, information criteria, regularizations, decision trees, support vector machine, and unsupervised learning. Another similar book is “The Elements of Statistical Learning” (ESL) by Trevor Hastie, Robert Tibshirani, and Jerome Friedman (Springer), which provides the most reliable knowledge on statistical learning. I often use it when preparing for my classes. However, the volume of information in ESL is large, and it takes at least 500–1000 h to read it through, although I do recommend reading the book. My book, SLMR, on the other hand, takes at most 100 h, depending on the reader’s baseline ability, and it does not assume the reader has any knowledge of ML. After reading SLMR, it takes at most 300–500 h to read through ESL because the reader will have enough logic to easily understand ESL. ESL contains many equations and procedures but no programming codes. In this sense, SLMR focuses on both mathematics and R programming more than ISLR and ESL. I sincerely wish that the reader of SLMR will develop both logic and statistical learning knowledge.

What Makes SLMR Unique? I have summarized the features of this book as follows. 1. Developing logic To grasp the essence of the subject, we mathematically formulate and solve each ML problem and build those programs. The SLMR instills “logic” in the minds of the readers. The reader will acquire both the knowledge and ideas of ML so that even if new technology emerges, they will be able to follow the changes smoothly. After solving the 100 problems, most of the students would say “I learned a lot.” 2. Not just a story If programming codes are available, you can immediately take action. It is unfortunate when an ML book does not offer the source codes. Even if a package is available, if we cannot see the inner workings of the programs, all we can do is input data into those programs. In SLMR, the program codes are available for most of the procedures. In cases where the reader does not understand the math, the codes will help them understand what it means. 3. Not just how to book: an academic book written by a university professor This book explains how to use the package and provides examples of executions for those who are not familiar with them. Still, because only the inputs and outputs are visible, we can only see the procedure as a black box. In this sense, the reader will have limited satisfaction because they will not be able to obtain

Preface

4.

5.

6.

7.

vii

the essence of the subject. SLMR intends to show the reader the heart of ML and is more of a full-fledged academic book. Solve 100 exercises: problems are improved with feedback from university students The exercises in this book have been used in university lectures and have been refined based on feedback from students. The best 100 problems were selected. Each chapter (except the exercises) explains the solutions, and you can solve all of the exercises by reading the book. Self-contained All of us have been discouraged by phrases such as “for the details, please refer to the literature XX.” Unless you are an enthusiastic reader or researcher, nobody will seek out those references. In this book, we have presented the material in such a way that consulting external references is not required. Additionally, the proofs are simple derivations, and the complicated proofs are given in the appendices at the end of each chapter. SLMR completes all discussions, including the appendices. Readers’ pages: questions, discussion, and program files The reader can ask any question on the book Facebook page (The URL will be set here). Additionally, all of the programs and data can be downloaded from http://bitbucket.org/prof-joe (thus, you do not have to copy the programs from the book). Linear algebra One of the bottlenecks in learning ML and statistics is linear algebra. Except for books for researchers, few books assume the reader has knowledge of linear algebra, and most books cannot go into the details of this subject. Therefore, SLMR contains a summary of linear algebra. This summary is only 14 pages and is not just an example, but it provides all the proofs. If you already know linear algebra, then you can skip it. However, if you are not confident in the subject, you can read in only one day.

How to Use This Book Each chapter consists of problems, their explanation (body), and an appendix (proof, program). You can start reading the body and solve the problem. Alternatively, you might want to solve the 100 exercises first and consult the body if necessary. Please read through the entire book until the end. When used in a lecture, I recommend that the teacher organizes the class into 12-, 90-min lectures (or a 1000-min course) as follows: three lectures for Chaps. 1, 2 lectures for Chaps. 6, and 1 lecture for each of the other chapters. You may ask the

viii

Preface

students to complete the 100 exercises. If you read the text carefully, you will be able to answer any of their questions. I think that the entire book can be fully read in about 12 lectures total. Osaka, Japan July 2020

Joe Suzuki

Acknowledgments The author wishes to thank Yuske Inaoka, Tianle Yang, Ryosuke Shinmura, and Kazuya Morishita for checking the manuscript. This English book is largely based on the Japanese book published by Kyoritsu Shuppan Co., Ltd. in 2020. The author would like to thank Kyoritsu Shuppan Co., Ltd. for their generosity. The author also appreciates Ms. Mio Sugino, Springer, for preparing the publication and providing advice on the manuscript.

Contents

1

Linear Algebra . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 Inverse Matrix . . . . . . . . . . . . . . . . . . . . . 1.2 Determinant . . . . . . . . . . . . . . . . . . . . . . . 1.3 Linear Independence . . . . . . . . . . . . . . . . . 1.4 Vector Spaces and Their Dimensions . . . . . 1.5 Eigenvalues and Eigenvectors . . . . . . . . . . 1.6 Orthonormal Bases and Orthogonal Matrix . 1.7 Diagonalization of Symmetric Matrices . . . Appendix: Proof of Propositions . . . . . . . . . . . . .

2

Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Least Squares Method . . . . . . . . . . . . . . . . . . . 2.2 Multiple Regression . . . . . . . . . . . . . . . . . . . . ^. . . . . . . . . . . . . . . . . . . . . . . 2.3 Distribution of b 2.4 Distribution of the RSS Values . . . . . . . . . . . . ^ 6¼ 0 . . . . . . . . . . . . . 2.5 Hypothesis Testing for b j 2.6 Coefficient of Determination and the Detection of Collinearity . . . . . . . . . . . . . . . . . . . . . . . . 2.7 Confidence and Prediction Intervals . . . . . . . . . Appendix: Proof of Propositions . . . . . . . . . . . . . . . . Exercises 1–18 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Classification . . . . . . . . . . . . . . . . . . . . . . 3.1 Logistic Regression . . . . . . . . . . . . . 3.2 Newton–Raphson Method . . . . . . . . 3.3 Linear and Quadratic Discrimination 3.4 k-Nearest Neighbor Method . . . . . . . 3.5 ROC Curves . . . . . . . . . . . . . . . . . . Exercises 19–31 . . . . . . . . . . . . . . . . . . . .

3

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . . . .

. . . . . . .

. . . . . . . . .

. . . . . . .

. . . . . . . . .

. . . . . . .

. . . . . . . . .

1 1 3 6 8 10 12 13 14

............ ............ ............ ............ ............

17 17 20 22 24

............

26

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

33 35 38 39

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

47 47 49 54 57 58 60

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

ix

x

Contents

4

Resampling . . . . . . . . . . . . . . . . . . . . . . 4.1 Cross-Validation . . . . . . . . . . . . . . 4.2 CV Formula for Linear Regression 4.3 Bootstrapping . . . . . . . . . . . . . . . . Appendix: Proof of Propositions . . . . . . . Exercises 32–39 . . . . . . . . . . . . . . . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

67 67 71 74 78 79

5

Information Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1 Information Criteria . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Efficient Estimation and the Fisher Information Matrix 5.3 Kullback-Leibler Divergence . . . . . . . . . . . . . . . . . . . 5.4 Derivation of Akaike’s Information Criterion . . . . . . . Appendix: Proof of Propositions . . . . . . . . . . . . . . . . . . . . . Exercises 40–48 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

83 83 87 90 92 94 96

6

Regularization . . . . . . . . . . 6.1 Ridge . . . . . . . . . . . . 6.2 Subderivative . . . . . . 6.3 Lasso . . . . . . . . . . . . 6.4 Comparing Ridge and 6.5 Setting the k Value . . Exercise 49–56 . . . . . . . . . .

. . . . Lasso . ...... ......

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

101 101 103 106 109 111 112

7

Nonlinear Regression . . . . . . . . . . 7.1 Polynomial Regression . . . . . 7.2 Spline Regression . . . . . . . . . 7.3 Natural Spline Regression . . . 7.4 Smoothing Spline . . . . . . . . . 7.5 Local Regression . . . . . . . . . 7.6 Generalized Additive Models . Appendix: Proof of Propositions . . . Exercises 57–68 . . . . . . . . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

117 117 120 122 126 130 134 136 140

8

Decision Trees . . . . . . . . . . . . . . . . . . 8.1 Decision Trees for Regression . . 8.2 Decision Tree for Classification . 8.3 Bagging . . . . . . . . . . . . . . . . . . 8.4 Random Forest . . . . . . . . . . . . . 8.5 Boosting . . . . . . . . . . . . . . . . . . Exercises 69–74 . . . . . . . . . . . . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

147 147 156 160 160 163 166

9

Support Vector Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.1 Optimum Boarder . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2 Theory of Optimization . . . . . . . . . . . . . . . . . . . . . . . . 9.3 The Solution of Support Vector Machines . . . . . . . . . . 9.4 Extension of Support Vector Machines Using a Kernel .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

171 171 174 177 180

. . . .

. . . .

. . . .

. . . .

. . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

Contents

xi

Appendix: Proof of Propositions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185 Exercises 75–87 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187 10 Unsupervised Learning . . . . . . . . . . 10.1 K-means Clustering . . . . . . . . 10.2 Hierarchical Clustering . . . . . . 10.3 Principle Component Analysis . Appendix: Program . . . . . . . . . . . . . . Exercises 88–100 . . . . . . . . . . . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

193 193 197 204 209 210

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215

Chapter 1

Linear Algebra

Abstract Linear algebra is the basis of logic constructions in any science. In this chapter, we learn about inverse matrices, determinants, linear independence, vector spaces and their dimensions, eigenvalues and eigenvectors, orthonormal bases and orthogonal matrices, and diagonalizing symmetric matrices. In this book, to understand the essence concisely, we define ranks and determinants based on the notion of Gaussian elimination and consider linear spaces and their inner products within the range of the Euclidean space and the standard inner product. By reading this chapter, the readers should solve the reasons why.

1.1 Inverse Matrix First, we consider solving the problem Ax = b w.r.t. x ∈ Rn for A ∈ Rm×n , b ∈ Rm . We refer to A ∈ Rm×n and [A|b] ∈ Rm×(n+1) as a coefficient matrix and an extended coefficient matrix, respectively. We write A ∼ B when A can be transformed into B ∈ Rm×n via the three elementary row operations below: Operation 1 Operation 2 Operation 3

divide one whole row by a nonzero constant exchange two rows add one row multiplied by a constant to another row       1 0 1 2x + 3y = 8 x =1 2 3 8 ∼ . Example 1 ⇐⇒ is equivalent to 1 2 5 0 1 2 x + 2y = 5 y=2 ⎧ ⎤ ⎡  2 −1 5 −1 ⎨ 2x − y + 5z = −1 x + 3z = 1 y+z =3 Example 2 ⇐⇒ is equivalent to ⎣ 0 1 1 3 ⎦ ∼ y+z =3 ⎩ 1 0 3 1 x + 3z = 1 ⎡ ⎤ 1 0 3 1 ⎣ 0 1 1 3 ⎦. 0 0 0 0 ⎧ ⎡ ⎤ ⎡ ⎤  2 −1 5 1 0 3 ⎨ 2x − y + 5z = 0 x + 3z = 0 y+z =0 Example 3 ⇐⇒ is equivalent to ⎣ 0 1 1 ⎦ ∼ ⎣ 0 1 1 ⎦. y+z =0 ⎩ 1 0 3 0 0 0 x + 3z = 0

© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2020 J. Suzuki, Statistical Learning with Math and R, https://doi.org/10.1007/978-981-15-7568-6_1

1

2

1 Linear Algebra

We refer to the first nonzero element in each of the nonzero rows as the main element of the row and the matrix satisfying the following conditions as the canonical form. • • • •

All zero rows are positioned in the lowest rows. The main element is one unless the row is all zeros. The lower the row, the closer to the right the position of the main element is. For each column, all but the main elements are zero.

Example 4 The following main element. ⎡ ⎤ ⎡ 1 3 0 2 1 0   ⎣0 0 0  1 1 ⎦, ⎣ 0 0 0 0 0 0 0 ⎡ ⎤ 1 0 2 0 0 0  ⎣0 0 0 0 0  1⎦ 0 0 0 0 0 0

1 is the matrices are in the canonical form in which  ⎤ ⎡ ⎤ 1 0 0 2 3 0 1 4 0 −1 0  1 7 −4 0 1 ⎦, ⎣ 0 0 0 0 0 0 ⎦,  1 3 0 0 0  0 0 0 0 0 0

For an arbitrary A ∈ Rm×n , the canonical form is unique, which will be proven at the end of Sect. 1.3. We refer to any procedure computing the canonical form based on the three above operations as Gaussian elimination. In particular, we refer to the number of main elements in matrix A as the rank of A (i.e., rank(A)). From the definition, the rank of A does not exceed the minimum of m, n. If matrix A ∈ Rn×n is square and its canonical form is the unit matrix I ∈ Rn×n , then we say that A is nonsingular. For square matrices A, B ∈ Rn×n , if [A|I ] ∼ [I |B], then the extended coefficient matrix [I |B] is the canonical form of [A|I ]. In such a case, we say that either A or B is the inverse matrix of the other, and write A−1 = B, B −1 = A. The relation [A|I ] ∼ [I |B] implies AX = I ⇐⇒ B = X . In fact, if we write X = [x1 , . . . , xn ], B = [b1 , . . . , bn ] ∈ Rn×n , then the relation implies that Axi = ei ⇐⇒ bi = xi for i = 1, . . . , n, where ei ∈ Rn is the unit vector in which the ith element is one and the other elements are zero. ⎡ ⎤ 1 2 1 Example 5 For matrix A = ⎣ 2 3 1 ⎦, we have 1 2 2 ⎡

⎤ ⎡ ⎤ 1 2 1 1 0 0 1 0 0 −4 2 1 ⎣ 2 3 1 0 1 0 ⎦ ∼ ⎣ 0 1 0 3 −1 −1 ⎦ . 1 2 2 0 0 1 0 0 1 −1 0 1 ⎡

⎤ ⎡ ⎤ 1 2 1 1 0 0 If we look at the left half on both sides, we can see that ⎣ 2 3 1 ⎦ ∼ ⎣ 0 1 0 ⎦, 1 2 2 0 0 1 which means that A is nonsingular. Therefore, we can write

1.1 Inverse Matrix

3

⎤ ⎤−1 ⎡ −4 2 1 1 2 1 ⎣ 2 3 1 ⎦ = ⎣ 3 −1 −1 ⎦ −1 0 1 1 2 2 ⎡



⎤−1 ⎡ ⎤ −4 2 1 1 2 1 ⎣ 3 −1 −1 ⎦ = ⎣ 2 3 1 ⎦ −1 0 1 1 2 2 and we have the following relation: ⎡

⎤⎡ ⎤ ⎡ ⎤ 1 2 1 −4 2 1 1 0 0 ⎣ 2 3 1 ⎦ ⎣ 3 −1 −1 ⎦ = ⎣ 0 1 0 ⎦ 0 0 1 1 2 2 −1 0 1 On the other hand, although the solution x = x ∗ of Ax = b with x ∈ Rn and b ∈ Rn can be obtained by A−1 b, we may obtain via [A|b] ∼ [I |x ∗ ], which means that x = x ∗ , without computing A−1 .

1.2 Determinant We define the determinant det(A) for a square matrix A as follows. If the canonical form is not the unit matrix, which means that A is singular, we set det(A) := 0. If A is the unit matrix I , then we set det(A) = 1. Suppose that A is nonsingular and is not the unit matrix. If we repeatedly apply the third elementary row operation, we obtain a matrix such that each row and each column contain exactly one nonzero element. Then, if we repeat the second operation, we obtain a diagonal matrix. Finally, if we repeat the third operation, then we obtain the unit matrix. When computing the determinant det(A), we execute the reverse procedure from the unit matrix A = I to the original A: Step 1 Multiply det(A) by αi if the ith row is multiplied by αi . Step 2 Multiply det(A) by −1 if the i = j rows are exchanged. Step 3 Do not change det(A) if the jth row multiplied by β j is subtracted from the ith row. We define det(A) as the final obtained value after executing the three steps. Let m be how many times we implement Step 2 (multiplying by −1). Then, we have n αi . det(A) = (−1)m i=1 Proposition 1 For a matrix A ∈ Rn×n , A is nonsingular ⇐⇒ rank(A) = n ⇐⇒ det(A) = 0.

4

1 Linear Algebra

Example 6 If an all-zero row appears, which means that the determinant is zero, we may terminate the procedure. For the matrix below, the determinant is six because we exchange the rows once at the beginning. ⎡

⎤ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ 1 0 4 1 0 4 1 0 0 0 1 1 ⎣ 0 −1 5 ⎦ ∼ ⎣ 0 −1 5 ⎦ ∼ ⎣ 0 −1 5 ⎦ ∼ ⎣ 0 −1 0 ⎦ 0 1 1 0 0 6 0 0 6 1 0 4 In the following example, since an all-zero row appears during Gaussian elimination, the determinant is zero. ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ 2 −1 5 2 −1 5 2 0 6 ⎣0 1 1⎦ ∼ ⎣0 1 1 ⎦ ∼ ⎣0 1 1⎦ 1 0 3 0 1/2 1/2 0 0 0 In general, for a 2 × 2 matrix, if a = 0, we have



b

a b

a



c d =

0 d − bc

= ad − bc a and even if a = 0, the determinant is ad − bc.





0 b

= − c d = −bc

0 b

c d Therefore, ad − bc = 0 is the condition for A to be nonsingular, and from 

we have

     1 a b d −b 1 0 · = , c d 0 1 ad − bc −c a 

a b c d

−1

=

1 ad − bc



d −b −c a

 .

On the other hand, for 3 × 3 matrices, if a = 0 and ae = bd, we have

a b c

d e f =

g h i

a b

0 e − bd

a

bg

0 h − a



cd

f − a

cg

i−

a c

1.2 Determinant

5

 

b cd

a

0 c− · f −

e − bd/a a



cd bd

f − = 0 e −

a a

 

h − bg/a cd cg

0 − f − 0 i−

a e − bd/a a

a 0 0

ae − bd

0 = 0 a

aei + b f g + cdh − ceg − bdi − a f h

0

0 ae − bd = aei + b f g + cdh − ceg − bdi − a f h.









Even if either a = 0 or ae = bd holds, we can see that the determinant is aei + b f g + cdh − ceg − bdi − a f h. Proposition 2 For square matrices A and B of the same size, we have det(AB) = det(A) det(B) and det(A T ) = det(A). For the proof, see the Appendix at the end of this chapter. The equation det(A T ) = det(A) in Proposition 2 means that we may apply the following rules to obtain the determinant det(A) Step 2’ Multiply det(A) by −1 if the i = j columns are exchanged. Step 3’ Do not change det(A) if the jth column multiplied by β j is subtracted from the ith columns. Example 7 (Vandermonde’s determinant)

1 a1 . . . a n−1 1

.. .. . . .. = (−1)n(n−1)/2  (a − a ).

. . i j . .

1≤i< j≤n

1 an . . . a n−1 n

(1.1)

In fact, if n = 1, both sides are one, and the claim holds. If we assume the claim for n = k − 1, then for n = k, the left-hand side of (1.1) is

1

0

..

.

0

1

0

= .

..

0

a1 ... a1k−2 a1k−1

a2 − a1 . . . akk−2 − a1k−2 akk−1 − a1k−1

.. .. .. ..

. . . .

k−2 k−2 k−1 k−1 ak − a1 ak − a1 . . . ak − a1

0 ... 0 0

a2 − a1 . . . (a2 − a1 )a2k−2 (a2 − a1 )a2k−1

.. .. .. ..

. . . .

k−2 k−1 (ak − a1 )ak ak − a1 . . . (ak − a1 )ak

6

1 Linear Algebra



a2 − a1 . . . (a2 − a1 )a2k−2

1 a2 . . . a2k−2



.. .. . .

.. .. .. .. = (a = − a ) . . . (a − a )

2 1 k 1 . . . . .

. .

ak − a1 . . . (ak − a1 )a k−2

1 ak . . . a k−2 k k  k−1 (k−1)(k−2)/2 = (−1) (a1 − a2 ) . . . (a1 − ak ) · (−1) (ai − a j ) 2≤i< j≤k

where the left hand side is obtained by subtracting the first row from the other rows. The first equation is obtained by subtracting the ( j − 1)th column multiplied by a1 from the jth column for j = k, k − 1, . . . , 2. The third equation is obtained by dividing the rows by constants and multiplying the determinant by the same constants. The last transformation is due to the assumption of induction, and this value coincides with the right-hand side of (1.1). Thus, from induction, we have (1.1).

1.3 Linear Independence For a matrix A ∈ Rm×n with column vectors a1 , . . . , an ∈ Rm , if the solution of Ax = 0 is only x = 0 ∈ Rn , we say that a1 , . . . , an are linearly independent. Otherwise, we say they are linearly dependent. Given a set of vectors, we refer to any instance of linear independence or dependence as a linear relation. If A ∼ B, we can see that the linear relations among the column vectors in A and B are equivalent. Example 8 For A = [a1 , a2 , a3 , a4 , a5 ] and B = [b1 , b2 , b3 , b4 , b5 ], A ∼ B means Ax = 0 ⇐⇒ Bx = 0. ⎡ ⎤ ⎡ ⎡ ⎡ ⎡ ⎤ ⎤ ⎤ ⎤ 1 1 1 −2 −1 ⎢1⎥ ⎢ 2 ⎥ ⎢ 3 ⎥ ⎢ −4 ⎥ ⎢ −4 ⎥ ⎢ ⎢ ⎢ ⎢ ⎥ ⎥ ⎥ ⎥ ⎥ a1 = ⎢ ⎣ 3 ⎦ , a2 = ⎣ 0 ⎦ , a3 = ⎣ −3 ⎦ , a4 = ⎣ 1 ⎦ , a5 = ⎣ 7 ⎦ , 0 −1 2 −1 0 ⎡

⎤ ⎡ 1 1 1 −2 −1 1 ⎢1 2 ⎥ ⎢0 3 −4 −4 ⎢ ⎥∼⎢ ⎣ 3 0 −3 1 7 ⎦ ⎣0 0 −1 −2 −1 0 0

⎤ 0 −1 0 2 1 2 0 −1 ⎥ ⎥, 0 0 1 1 ⎦ 0 0 0 0

⎫ ⎧ a1 , a2 , a4 are linearly independent ⎬ ⎨ b1 , b2 , b4 are linearly independent a3 = −a1 + 2a2 ⇐⇒ b3 = −b1 + 2b2 ⎭ ⎩ a5 = 2a1 − a2 + a4 b5 = 2b1 − b2 + b4 We interpret the rank as the maximum number of linearly independent columns in the matrix.

1.3 Linear Independence

7

If a1 , . . . , an ∈ Rm are linearly independent, none of them can be expressed by  any linear combination of the others. If we can express them as ai = j=i x j a j n for some i, we would have Ax = i=1 xi ai = 0, which means that there exists they are linearly dependent, such an x ∈ Rn such that xi = 0. On the other hand, if  xi = 0 that Ax = 0 exists, and we write ai = j=i (−x j /xi )a j . Moreover, even if we define a vector ar +1 by a linear combination a1 , . . . , ar , then a1 , . . . , ar , ar +1 are linearly dependent. Thus, if we right-multiply A ∈ Rm×n by a matrix B ∈ Rn×l , where then we obtain a matrix AB whose column vectors are right, n B is on the n a b , . . . , a b i=1 i i,1 i=1 i i,l , which means that the rank (the number of linearly independent vectors) does not exceed the rank of A, i.e., rank(AB) ≤ rank(A). When a matrix B is obtained from elementary row operations applied to matrix A, the number of linearly independent row vectors in B does not exceed that in A. Similarly, because A can be obtained from B via elementary row operations, the numbers of linearly independent row vectors are equal, which holds even when B is the canonical form of A. On the other hand, all nonzero rows in the canonical form are linearly independent, and the number of such vectors is the same as that of the main elements. Therefore, the rank is the number of linearly independent row vectors in A as well. Thus, A and its transpose A T share the same rank. Moreover, the matrix B A obtained by multiplying B ∈ Rl×m by A from right, has the same rank as (B A)T = A T B T , which means that rank(B A) does not exceed rank(A T ), which is equal to rank(A). We summarize the above discussion as follows. Proposition 3 For A ∈ Rm×n , B ∈ Rn×l , we have rank(AB) ≤ min{rank(A), rank(B)} rank(A T ) = rank(A) ≤ min{m, n}. 

           2 3 1 0 1 2 1 0 5 10 1 0 ∼ ,B= ∼ , AB = ∼ , the 1 2 0 1 1 2 0 0 3 6 0 0 ranks of A, B, and AB are 2, 1, and 1, respectively.

Example 9 From A =



⎤ 1 3 0 2 0  1 1 ⎦ is two and does not exceed three and Example 10 The rank of ⎣ 0 0 0  0 0 0 0 0 five.

Finally, we show that the canonical form is unique. Suppose that A ∼ B and that the ith columns of A and B are ai and bi , respectively. Since a linear relation that is true in A is true in B as well, if a j is linearly independent of the vectors, so is b j . Suppose further that B is in canonical form. If the number of independent vectors on the left is k − 1, i.e., b j is the kth row, then bk should be ek , the column vector such that the kth element is one, and the other elements are zero. Otherwise, the kth row of the canonical form is a zero vector, or a column vector that is right from b j becomes ek , which contradicts that B is in canonical form. On  the other hand,  if a j can be written as i< j ri ai , then b j should be written as i< j ri bi = i< j ri ei , which means that b j is a column vector whose ith element is the coefficient ri in a j . In any case, given A, the canonical form B is unique.

8

1 Linear Algebra

1.4 Vector Spaces and Their Dimensions We refer to any subset V of Rn such that 

x, y ∈ V =⇒ x + y ∈ V a ∈ R, x ∈ V =⇒ ax ∈ V

(1.2)

as a linear subspace of Rn . We may similarly define a subspace of V . Example 11 Let V be the subset of Rn such that the last element of x ∈ V is equal to the sum of the other elements. Since we can see that x, y ∈ V =⇒ xn =

n−1 

xi , yn =

n−1 

i=1

and x ∈ V =⇒ xn =

yi =⇒ xn + yn =

i=1

n−1  i=1

xi =⇒ axn =

n−1  (xi + yi ) =⇒ x + y ∈ V i=1

n−1 

axi =⇒ ax ∈ V ,

i=1

V satisfies (1.2) and is a subspace of Rn . For example, for a subset W of V such that the first element is zero, W := {[x1 , . . . , xn ]T ∈ V |x1 = 0}, satisfies (1.2), and W is a subspace of V . In the following, we refer to any subspace of Rn solely as a vector space.1 Any vector in the vector space can be expressed as a linear combination of a finite number of vectors. For example, an arbitrary x = [x1 , x2 , x3 ] ∈ R3 can be written as x =  3 T T T i=1 x i ei using e1 := [1, 0, 0] , e2 := [0, 1, 0] , e3 := [0, 0, 1] . We refer to a linearly independent subset {a1 , . . . , ar } of V such that any element in V that can be expressed as a linear combination of a1 , . . . , ar as a basis of V , and the number r of elements in {a1 , . . . , ar } to as the dimension of V . Although the basis of the V is not unique, the dimension of any basis is equal. Thus, the dimension of V is unique. Example 12 The set of vectors V that are linear combinations of a1 , . . . , a5 in Example 8 satisfies (1.2) and constitutes a vector space. Since a1 , a2 , and a4 are linearly independent and a3 and a5 can be expressed by linear combinations of these vectors, each element v in V can be expressed by specifying x1 , x2 , x4 ∈ R in x1 a1 + x2 a2 + x4 a4 , but there exists a v ∈ V that cannot be expressed by specifying x1 , x2 ∈ R and x2 , x4 ∈ R in x1 a1 + x2 a2 and x2 a2 + x4 a4 , respectively. On the other hand, if we specify x1 , x2 , x3 , x4 ∈ R in x1 a1 + x2 a2 + a3 x3 + x4 a4 , then from x1 a1 + x2 a2 + a3 x3 + x4 a4 = x1 a1 + x2 a2 + x3 (−a1 + 2a2 ) + x4 a4 = (x1 − x3 )a1 + (x2 + 2x3 )a2 + x4 a4 1 In

general, any subset V that satisfies (1.2) R is said to be a vector space with scalars in R.

1.4 Vector Spaces and Their Dimensions

9

there is more than one way to express v = a2 , such as (x1 , x2 , x3 , x4 ) = (0, 1, 0, 0), (1, −1, 1, 0). Therefore, {a1 , a2 , a4 } is a basis, and the dimension of the vector space is three. In addition, {a1 , a2 , a4 }, such that a1 = a1 + a2 , a2 = a1 − a2 , a4 , is a basis as well. In fact, because of ⎡ ⎤ ⎡ ⎤ 2 0 −2 1 0 0 ⎢ 3 −1 −4 ⎥ ⎢ 0 1 0 ⎥ ⎥∼⎢ ⎥, [a1 , a2 , a4 ] = ⎢ ⎣ 3 3 1 ⎦ ⎣0 0 1⎦ −1 1 −1 0 0 0 they are linearly independent. Let V and W be subspaces of Rn and Rm , respectively. We refer to any map V x → Ax ∈ W as the linear map2 w.r.t. A ∈ Rm×n . For example, the image {Ax | x ∈ V } and the kernel {x ∈ V | Ax = 0} are subspaces W and V , respectively. On the other hand, the image can be expressed as a linear combination of the columns in A and its dimension coincides with the rank of A (i.e., the number of linearly independent vectors in A). Example 13 For the matrix A in Example 8 and vector space V , each element of which can be expressed by a linear combination of a1 , . . . , a5 , the vectors in the image can be expressed by a linear combination of a1 , a2 , and a4 . ⎡

⎤ x1 ⎡ ⎤ ⎥ 0 1 0 −1 0 2 ⎢ ⎢ x2 ⎥ ⎥ = ⎣0⎦ x Ax = 0 ⇐⇒ ⎣ 0 1 2 0 −1 ⎦ ⎢ 3 ⎢ ⎥ 0 0 0 0 1 1 ⎣ x4 ⎦ x5 ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ x1 x3 − 2x5 1 −2 ⎢ x2 ⎥ ⎢ −2x3 + x5 ⎥ ⎢ −2 ⎥ ⎢ 1 ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎢ ⎥ ⎥ ⎢ ⎥ ⎢ ⎥ x3 ⇐⇒ ⎢ x3 ⎥ = ⎢ ⎥ = x3 ⎢ 1 ⎥ + x5 ⎢ 0 ⎥ ⎣ x4 ⎦ ⎣ −x5 ⎦ ⎣ 0 ⎦ ⎣ −1 ⎦ 0 1 x5 x5 ⎡



The and kernel are ⎫{c1 a1 + c2 a2 + c4 a4 | c1 , c2 , c4 ∈ R} and ⎧ ⎡ image ⎤ ⎡ ⎤ 1 −2 ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎥ ⎢ 1 ⎥ ⎨ ⎢ ⎬ ⎢ −2 ⎥ ⎢ ⎥ ⎥ + c5 ⎢ 0 ⎥ | c3 , c5 ∈ R , respectively, and they are the subspaces of 1 c3 ⎢ ⎢ ⎥ ⎢ ⎥ ⎪ ⎪ ⎪ ⎪ ⎣ 0 ⎦ ⎣ −1 ⎦ ⎪ ⎪ ⎪ ⎪ ⎩ ⎭ 0 1 V (three dimensions) and W = R5 (two dimensions), respectively.

general, for vector spaces V and W , we say that f : V → W is a linear map if f (x + y) = f (x) + f (y), where x, y ∈ V, f (ax) = a f (x), a ∈ R and x ∈ V .

2 In

10

1 Linear Algebra

Proposition 4 Let V and W be subspaces of Rn and Rm , respectively. The image and kernel of the linear map V → W w.r.t. A ∈ Rm×n are subspaces of W and V , respectively, and the sum of the dimensions is n. The dimension of the image coincides with the rank of A. For the proof, see the Appendix at the end of this chapter.

1.5 Eigenvalues and Eigenvectors For a matrix A ∈ Rn×n , if there exist 0 = x ∈ Cn and λ ∈ C such that Ax = λx, we refer to x = 0 as the eigenvector of eigenvalue λ. In general, The solution of (A − λI )x = 0 is only x = 0 ⇐⇒ det(A − λI ) = 0 Combined with Proposition 1, we have the following proposition: Proposition 5 λ is an eigenvalue of A ⇐⇒ det(A − λI ) = 0 In this book, we only consider matrices for which all the eigenvalues are real. In general, if the eigenvalues of A ∈ Rn×n are λ1 , . . . , λn , they are the solutions of the eigenpolynomial det(A − t I ) = (λ1 − t) . . . (λn − t) = 0, and if we substitute in t = 0, we have det(A) = λ1 . . . λn . Proposition 6 The determinant of a square matrix is the product of its eigenvalues. In general, for each λ ∈ R, the subset Vλ := {x ∈ Rn | Ax = λx} constitutes a subspace of Rn (the eigenspace of λ): x, y ∈ Vλ =⇒ Ax = λx, Ay = λy =⇒ A(x + y) = λ(x + y) =⇒ x + y ∈ Vλ x ∈ Vλ , a ∈ R =⇒ Ax = λx, a ∈ R =⇒ A(ax) = λ(ax) =⇒ ax ∈ Vλ . ⎡

⎤ 7 12 0 Example 14 A = ⎣ −2 −3 0 ⎦, from det(A − t I ) = 0, 2 4 1 3) = 0 ⎡ ⎤ ⎡ 6 12 0 1 When t = 1, we have A − t I = ⎣ −2 −4 0 ⎦ ∼ ⎣ 0 2 4 0 0 ⎡ ⎤ ⎡ ⎤ 2 0 its kernel consists of ⎣ −1 ⎦, ⎣ 0 ⎦ 0 1

we have (t − 1)2 (t − ⎤ 2 0 0 0 ⎦, and a basis of 0 0

1.5 Eigenvalues and Eigenvectors

11



⎤ ⎡ ⎤ 4 12 0 1 3 0 When t = 3, we have A − t I = ⎣ −2 −6 0 ⎦ ∼ ⎣ 1 2 −1 ⎦ and a basis 2 4 −2 0 0 0 ⎡ ⎤ 3 of its kernel consists of ⎣ −1 ⎦. Hence, we have 1 ⎧ ⎨

⎫ ⎧ ⎡ ⎫ ⎤ ⎡ ⎤ ⎤ 2 0 3 ⎬ ⎨ ⎬ W1 = c1 ⎣ −1 ⎦ + c2 ⎣ 0 ⎦ | c1 , c2 ∈ R , W3 = c3 ⎣ −1 ⎦ | c3 ∈ R . ⎩ ⎭ ⎩ ⎭ 0 1 1 ⎡



⎤ 1 3 2 Example 15 For A = ⎣ 0 −1 0 ⎦, from det(A − t I ) = 0, we have (t + 1)2 (t − 1 2 0 ⎡ ⎤ ⎡ ⎤ 2 3 2 1 0 1 2) = 0. When t = −1, we have A − t I = ⎣ 0 0 0 ⎦ ∼ ⎣ 0 1 0 ⎦, and a basis of 1 2 1 0 0 0 ⎡ ⎤ ⎡ ⎤ −1 −1 3 2 its kernel consists of ⎣ 0 ⎦. When t = 2, we have A − t I = ⎣ 0 −3 0 ⎦ ∼ 1 1 2 −2 ⎡ ⎤ ⎡ ⎤ 1 0 −2 2 ⎣ 0 1 0 ⎦, and a basis of its kernel consists of ⎣ 0 ⎦. Hence, we have 0 0 0 1

W−1

⎧ ⎡ ⎫ ⎧ ⎡ ⎤ ⎫ ⎤ −1 2 ⎨ ⎬ ⎨ ⎬ = c1 ⎣ 0 ⎦ | c1 ∈ R , W2 = c2 ⎣ 0 ⎦ | c2 ∈ R . ⎩ ⎭ ⎩ ⎭ 1 1

If we obtain a diagonal matrix by multiplying a square matrix A ∈ Rn×n by a nonsingular matrix and its inverse from left and right, respectively, then we say that A is diagonalizable. Example write the matrix that arranges in Example 14 as ⎡ 16 If we ⎤ ⎡the eigenvectors ⎤ 2 0 3 1 0 0 P = ⎣ −1 0 −1 ⎦, then we have P −1 A P = ⎣ 0 1 0 ⎦. 0 1 1 0 0 3 As in Example 14, if the sum of the dimensions of the eigenspaces is n, we can diagonalize matrix A. On the other hand, as in Example 15, we cannot diagonalize A. In fact, each column vector of P should be an eigenvector. If the sum of the dimensions of the eigenspaces is less than n, we cannot choose linearly independent columns of P.

12

1 Linear Algebra

1.6 Orthonormal Bases and Orthogonal Matrix n 3 T We define √ the inner product and norm of a vector space V as u v = i=1 u i vi and u = u T u, respectively, for u, v ∈ V . If a basis u 1 , . . . , u n of V is orthogonal (the inner product of each pair is zero), the norms are ones; we say that they constitute an orthonormal basis. For an arbitrary linearly independent v1 , . . . , vn ∈ V , we construct an orthonormal basis u 1 , . . . , u n of V such that the subspaces that contain u 1 , . . . , u i and v1 , . . . , vi coincide for i = 1, . . . , n. Example 17 (Gram-Schmidt Orthonormal Basis) We construct orthonormal basis u 1 , . . . , u i such that {α1 v1 + · · · + αi vi |α1 , . . . , αi ∈ R} = {β1 u 1 + · · · + βi u i |β1 , . . . , βi ∈ R} ⎡ ⎤ ⎡ ⎤ 1 1 for each i = 1, . . . , n: Suppose we are given v1 = ⎣ 1 ⎦ , v2 = ⎣ 3 ⎦ , and v3 = 0 1 ⎡ ⎤ ⎡ ⎤ 2 1 ⎣ −1 ⎦. Then the orthonormal basis consists of u 1 = 1 = √1 ⎣ 1 ⎦, v1  2 0 1

⎤ ⎡ ⎤ ⎡ ⎤ ⎡ ⎡ ⎤ −1 1 1 −1 v2

4 1 ⎣ ⎦ ⎣ 1 ⎣ ⎦ ⎣ ⎦ 1 , u2 = = √ 1 = 1 ⎦ = v2 − (v2 , u 1 )u 1 = 3 − √ · √ v2  2 2 0 3 1 1 1 ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ 2 1 1 −1 1 1 1 −2 5 v3 = v3 − (v3 , u 1 )u 1 − (v3 , u 2 )u 2 = ⎣ −1 ⎦ − √ · √ ⎣ 1 ⎦ − √ · √ ⎣ 1 ⎦ = ⎣ −1 ⎦, 6 2 2 0 3 3 1 2 1 ⎡ ⎤ 1 1 and u 3 = √ ⎣ −1 ⎦. 6 2 v2

We say a square matrix such that the columns are orthonormal. For an orthogonal matrix P ∈ Rn×n , P T P is the unit matrix. Therefore, P T = P −1 .

(1.3)

If we take the determinants on both sides, then det(P T ) det(P) = 1. From det(P) = det(P T ), we have det(P) = ±1. On the other hand, we refer to the linear map V x → P x ∈ V as an orthogonal map. Since (P x)T (P y) = x T P T P y = x T y, x, y ∈ V , an orthogonal map does not change the inner product of any pairs in V .

general, we say that the map (·, ·) is an inner product of V if (u + u , v) = (u, v) + (u , v), (cu, v) = c(u, v), (u, v) = (u , v), and u = 0 =⇒ (u, u) > 0 for u, v ∈ V , where c ∈ R.

3 In

1.7 Diagonalization of Symmetric Matrices

13

1.7 Diagonalization of Symmetric Matrices In this section, we assume that square matrix A ∈ Rn×n is symmetric. We say that a square matrix is an upper-triangular matrix if the (i, j)th elements are zero for all i > j. Then, we have the following proposition. Proposition 7 For any square matrix A, we can obtain an upper-triangular matrix P −1 A P by multiplying an orthogonal matrix P and its inverseP −1 from right and left respectively. For the proof, see the Appendix at the end of this chapter. If we note that P −1 A P is symmetric because (P −1 A P)T = P T A T (P −1 )T = −1 P A P, from (1.3), Proposition 7 claims that diagonalization as well as triangulation are obtained using the orthogonal matrix P. In the following, we claim a stronger statement. To this end, we note the following proposition. Proposition 8 For a symmetric matrix, any eigenvectors in different eigenspaces are orthogonal. In fact, for eigenvalues λ, μ ∈ R of A, where x ∈ Vλ and y ∈ Vμ , we have λx T y = (λx)T y = (Ax)T y = x T A T y = x T Ay = x T (μy) = μx T y . In addition, because λ = μ, we have x T y = 0. As we have seen before, a matrix A being diagonalizable is equivalent to the sum n of the dimensions of the eigenspaces. Thus, if we choose the basis of each eigenspace to be orthogonal, all n vectors will be orthogonal. Proposition 9 For a symmetric matrix A, using an orthogonal matrix P, the matrix P −1 A P is diagonal with diagonal elements equal to the eigenvalues of A. ⎧ ⎡ ⎤ ⎫ ⎤ ⎡ ⎤ 1 2 −1 2 −1 ⎨ ⎬ ⎣ 2 −2 2 ⎦ are c1 ⎣ 1 ⎦ + c2 ⎣ 0 ⎦ | c1 , c2 ∈ R ⎩ ⎭ −1 2 1 0 1 ⎡

Example 18 The eigenspaces of ⎧ ⎡ ⎫ ⎤ 1 ⎨ ⎬ and c3 ⎣ −2 ⎦ | c3 ∈ R . ⎩ ⎭ 1 Then, we orthogonalize eigenspace. For P = √ √the ⎤basis of the two-dimensional ⎡ ⎤ ⎡ √ 2/√5 −1/√ 30 1/ √6 2 0 0 ⎣ 1/ 5 2/ 30 −2/ 6 ⎦, we have P −1 A P = ⎣ 0 2 0 ⎦. √ √ 0 0 −4 0 5/ 30 1/ 6 In addition, from the discussion thus far, we have the following proposition: Proposition 10 For a symmetric matrix A of size n, the three conditions below are equivalent: 1. A matrix B ∈ Rm×n exists such that A = B T B.

14

1 Linear Algebra

2. x T Ax ≥ 0 for an arbitrary x ∈ Rn . 3. All the eigenvalues of A are nonnegative. In fact, 1. =⇒ 2. because A = B T B =⇒ x T Ax = x T B T Bx = Bx2 , 2. =⇒ 3. 2 and√ 3. √ =⇒ 1. because x T Ax ≥ 0 =⇒ 0 ≤ x T Ax = x T λx = λx √ √, −1 T . . . , λn ≥ 0 =⇒ A = P D P = P D D P = ( D P)T D P, because λ1 ,√ where D √ and D are the diagonal matrix whose elements are λ1 , . . . , λn and √ λ1 , . . . , λn , respectively. In this book, we refer to the matrices that satisfy the three equivalent conditions in Proposition 10 and the ones whose eigenvalues are positive as to nonnegative definite and positive definite matrices, respectively.

Appendix: Proof of Propositions Proposition 2 For square matrices A and B of the same size, we have det(AB) = det(A) det(B) and det(A T ) = det(A). Proof For steps 1, 2, and 3, we multiply the following matrices from left: Vi (α): a unit matrix where the (i, i)th element has been replaced with α Ui, j : a unit matrix where the (i, i), ( j, j)th and (i, j), ( j, i)th elements have been replaced by zero and one, respectively. Wi, j (β): a unit matrix where the (i, j)th zero (i = j) has been replaced by −β. Then, for B ∈ Rn×n , det(Vi (α)B) = α det(B), det(Ui, j B) = − det(B), det(Wi, j (β)B) = det(B). (1.4) Since (1.5) det(Vi (α)) = α, det(Ui, j ) = −1, det(Wi, j (β)) = 1 holds, if we write matrix A as the product E 1 , . . . , Er of matrices of the three types, then we have det(A) = det(E 1 ) . . . det(Er ) . det(AB) = det(E 1 · E 2 . . . Er B) = det(E 1 ) det(E 2 . . . Er B) = . . . = det(E 1 ) . . . det(Er ) det(B) = det(A) det(B). On the other hand, since matrices Vi (α) and Ui, j are symmetric and Wi, j (β)T = W j,i (β), we have a similar equation to (1.4) and (1.5). Hence, we have det(A T ) = det(ErT . . . E 1T ) = det(ErT ) . . . det(E 1T ) = det(E 1 ) . . . det(Er ) = det(A) .

Appendix: Proof of Propositions

15

Proposition 4 Let V and W be subspaces of Rn and Rm , respectively. The image and kernel of the linear map V → W w.r.t. A ∈ Rm×n are subspaces of W and V , respectively, and the sum of the dimensions is n. The dimension of the image coincides with the rank of A. Proof Let r and x1 , . . . , xr ∈ V be the dimension and basis of the kernel, respectively. We add xr +1 , . . . , xn , which are linearly independent of them, so that x1 , . . . , xr , xr +1 , . . . , xn are the bases of V . It is sufficient to show that Axr +1 , . . . , Axn are the bases of the image. , xr are vectors in the kernel, we have Ax1 = · · · = Axr = 0. First, since x1 , . . . For an arbitrary x = nj=1 b j x j with br +1 , . . . , bn ∈ R, the image can be expressed  as follows: Ax = nj=r +1 b j Ax j , which is a linear combination of Axr +1 , . . . , Axn . Then, our goal is to show that n 

bi Axi = 0 =⇒ br +1 , . . . , bn = 0.

(1.6)

i=r +1

n n If A i=r = 0, then i=r bi xi is in the kernel. Therefore, +1 bi x i  +1 n there exist n bi xi = 0. b1 , . . . , br such that i=r +1 bi xi = − ri=1 bi xi , which means that i=1 However, we assumed that x1 , . . . , xn are linearly independent, which means that b1 = . . . = bn = 0 and Proposition (1.6) is obtained. Proposition 7 For any square matrix A, we can obtain an upper-triangular matrix P −1 A P by multiplying an orthogonal matrix P and its inverse P − 1 from right and left, respectively. Proof: We prove the proposition by induction. For n = 1, since the matrix is scalar, the claim holds. From the assumption of induction, for an arbitrary B˜ ∈ R(n−1)×(n−1) , there exists an orthogonal matrix Q˜ such that ⎡ ⎢ Q˜ −1 B˜ Q˜ = ⎣

λ˜ 2

..



∗ .

0

⎥ ⎦ ,

λ˜ n

˜ where ∗ represents the nonzero elements and λ˜ 2 , . . . , λ˜ n are the eigenvalues of B. For a nonsingular matrix A ∈ Rn×n with eigenvalues λ1 , . . . , λn , allowing multiplicity, let u 1 be an eigenvector of eigenvalue λ1 , and R an orthogonal matrix such that the first column is u 1 . Then, we have Re1 = u 1 and Au 1 = λ1 u 1 , where e1 := [1, 0, . . . , 0]T ∈ Rn . Hence, we have R −1 A Re1 = R −1 Au 1 = λ1 R −1 u 1 = λ1 R −1 Re1 = λ1 e1 and we may express R

−1



λ1 b AR = 0 B

 ,

16

1 Linear Algebra

where b ∈ R1×(n−1) and 0 ∈ R(n−1)×1  . Note that R and A are nonsingular, so is B. 1 0 We claim that P = R is an orthogonal matrix, where Q is a orthogonal 0 Q (n−1)×(n−1) . In fact, Q T Q is a unit matrix, so is P T P = matrix that  B ∈R   diagonalizes 1 0 1 0 . Note that the eigenvalues of B are λ2 , . . . , λn of A: RT R 0 Q 0 Q n 

(λi − λ) = det(A − λIn ) = det(R −1 A R − λIn ) = (λ1 − λ) det(B − λIn−1 ) ,

i=1

where In is a unit matrix of size n. Finally, we claim that A is diagonalized by multiplying P −1 and P from left and right, respectively.        1 0 λ1 b 1 0 1 0 1 0 −1 R A R = 0 B 0 Q 0 Q −1 0 Q 0 Q −1 ⎡ ⎤ λ1 ∗  ⎢  ⎥ λ2 bQ λ1 ⎢ ⎥ =⎢ = ⎥ . −1 . 0 Q BQ ⎣ ⎦ . λn

P −1 A P =



which completes the proof.

Chapter 2

Linear Regression

Abstract Fitting covariate and response data to a line is referred to as linear regression. In this chapter, we introduce the least squares method for a single covariate (single regression) first and extend it to multiple covariates (multiple regression) later. Then, based on the statistical notion of estimating parameters from data, we find the distribution of the coefficients (estimates) obtained via the least squares method. Thus, we present a method for estimating a confidence interval of the estimates and for testing whether each of the true coefficients is zero. Moreover, we present a method for finding redundant covariates that may be removed. Finally, we consider obtaining a confidence interval of the response of new data outside of the data set used for the estimation. The problem of linear regression is a basis of consideration in various issues and plays a significant role in machine learning.

2.1 Least Squares Method Let N be a positive integer. For given data (x1 , y1 ), . . . , (x N , y N ) ∈ R × R, we obtain the intercept β0 and slopeβ1 via the least squares method. More preN (yi − β0 − β1 xi )2 of the squared distances cisely, we minimize the sum L := i=1 2 (yi − β0 − β1 xi ) between (xi , yi ) and (xi , β0 + xi β1 ) over i = 1, . . . , N (Fig. 2.1). Then, by partially differentiating L by β0 , β1 and letting them be zero, we obtain the following equations: N  ∂L = −2 (yi − β0 − β1 xi ) = 0 ∂β0 i=1

(2.1)

N  ∂L = −2 xi (yi − β0 − β1 xi ) = 0 ∂β1 i=1

(2.2)

where the partial derivative is calculated by differentiating each variable and regarding the other variables as constants. In this case, β0 and β1 are regarded as constants when differentiating L by β1 and β0 , respectively. © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2020 J. Suzuki, Statistical Learning with Math and R, https://doi.org/10.1007/978-981-15-7568-6_2

17

18

2 Linear Regression

y Line y = β1 x + β0

(xi , β1 xi + β0 )

Distance |yi − β1 xi − β0 | (xi , yi )

x

0 Fig. 2.1 Obtain β0 and β1 that minimize

n

i=1 (yi

By solving the Eqs. (2.1), (2.2) when

− β1 xi − β0 )2 via the least squares method

N 

(xi − x) ¯ 2 = 0, i.e.,

i=1

x1 = · · · = x N is not true

(2.3)

we obtain N 

βˆ1 =

(xi − x)(y ¯ i − y¯ )

i=1 N 

(2.4) (xi − x) ¯

2

i=1

βˆ0 = y¯ − βˆ1 x¯ ,

(2.5)

N N 1  1  xi and y¯ := yi . We used variables βˆ0 and βˆ1 instead of β0 N i=1 N i=1 and β1 , which means that they are not the true values but rather estimates obtained from data. If we divide both sides of Eq. (2.1) by −2N , we obtain (2.5). To show (2.4), we center the data as follows

where x¯ :=

¯ . . . , x˜ N := x N − x, ¯ y˜1 := y1 − y¯ , . . . , y˜ N := y N − y¯ , x˜1 := x1 − x, ¯ y¯ ) in the directions and obtain the slope (βˆ1 ) first. Even if we shift all the points by (x, of X and Y , the slope remains the same, but the line goes through the origin. Note that once x1 , . . . , x N , y1 , . . . , y N are centered, then we have

2.1 Least Squares Method

19 N N 1  1  x˜i = y˜i = 0 N i=1 N i=1

and

N N 1  1  y˜i − β1 x˜i = 0 , N i=1 N i=1

which means that the intercept becomes zero with the new coordinates. From the centered x1 , . . . , x N and y1 , . . . , y N , we obtain βˆ1 ; if we substitute β0 = 0 into (2.2), unless x˜1 = · · · = x˜ N = 0, we obtain N 

βˆ1 =

x˜i y˜i

i=1 N 

.

(2.6)

x˜i2

i=1

The estimate (2.6) is obtained after centering w.r.t. x1 , . . . , x N and y1 , . . . , y N , and if we return to the values before the centering by ¯ . . . , x N := x˜ N + x, ¯ y1 := y˜1 + y¯ , . . . , y N := y˜ N + y¯ , x1 := x˜1 + x, we obtain (2.4). Finally, from βˆ1 and the relation in (2.5), we obtain the intercept ¯ βˆ0 = y¯ − βˆ1 x. Example 19 Figure 2.2 shows the two lines l and l  generated via the R program below. l is obtained from the N pairs of data and the least squares method, and l  obtained by shifting l so that it goes through the origin. min.sq=function(x,y){ # min.sq, a function obtaining intercept and slope via least squares x.bar=mean(x); y.bar=mean(y) beta.1=sum((x−x.bar)*(y−y.bar))/sum((x−x.bar)^2); beta.0=y.bar−beta.1*x. bar return(list(a=beta.0, b=beta.1)) } a=rnorm(1); b=rnorm(1); # randomly generate the coefficients of the line N=100; x=rnorm(N); y=a*x+b+rnorm(N) # randomly generate the points surrounding the line plot(x,y); abline(h=0); abline(v=0) # plots of the points abline(min.sq(x,y)$a, min.sq(x,y)$b,col="red") # the line before centering x=x−mean(x); y=y−mean(y) # centering abline(min.sq(x,y)$a,min.sq(x,y)$b,col="blue") # the line after centering legend("topleft",c("BEFORE","AFTER"),lty=1, col=c("red","blue")) # legend

In the program, the function abline outputs the line with intercept a and slope b, the X-axis y = 0, and the Y-axis x = 0 by the commands abline(a,b), abline(h=0), and abline(v=0), respectively. The function min.sq defined

20

2 Linear Regression

BEFORE AFTER

1 -2

-1

0

y

2

3

4

Fig. 2.2 Instead of (2.4), we center the data at the beginning and obtain the slope via (2.6) first and obtain the intercept later via the arithmetic means x, ¯ y¯ and the relation in (2.5)

-2

-1

0

1

2

x

in the program returns the intercept beta.0 and slope beta.1 from the least squares methods as a list with attributes a and b such as min.sq(x,y)$a and min.sq(x,y)$b for input (x,y).

2.2 Multiple Regression We extend the regression problem for a single covariate ( p = 1) to the one for multiple covariates ( p ≥ 1). To this end, we formulate the least squares method for single regression with matrices. If we define ⎤ ⎡ ⎤ 1 x1 y1

β0 ⎢ ⎥ ⎢ ⎥ , y := ⎣ ... ⎦ , X := ⎣ ... ... ⎦ , β := β1 yN 1 xN ⎡

(2.7)

N  then for L := (yi − β0 − xi β1 )2 , we have i=1

L = y − X β2 and

⎤ ∂L ⎢ ∂β0 ⎥ T ⎥ ∇ L := ⎢ ⎣ ∂ L ⎦ = −2X (y − X β) . ∂β1 ⎡

(2.8)

2.2 Multiple Regression

21

By examining (2.8), we see that the elements on the right-hand side of (2.8) are ⎡



N 

(yi − β0 − β1 xi ) ⎥ ⎢ −2 ⎢ ⎥ i=1 ⎢ ⎥ , N ⎢ ⎥  ⎣ ⎦ −2 xi (yi − β0 − β1 xi )

(2.9)

i=1

which means that (2.9) expresses (2.1) and (2.2). For multiple regression ( p ≥ 1), we may extend the formulation in (2.7) to the one below: ⎡ ⎤ ⎤ ⎡ ⎡ ⎤ β0 1 x1,1 · · · x1, p y1 ⎢ β1 ⎥ ⎥ ⎢ ⎢ ⎥ ⎢ ⎥ .. .. ⎥ y := ⎣ ... ⎦ , X := ⎢ , β := ⎢ . ⎥ , .. ⎣ ... . ⎣ .. ⎦ . . ⎦ yN 1 x N ,1 · · · x N , p βp Even if we extend this formulation, (2.8) still holds. In fact, if we let xi,0 = 1, i = 1, . . . , N , (2.9) is extended to ⎡

⎤ p N   (yi − β j xi, j ) ⎥ ⎢ −2 ⎢ ⎥ i=1 j=0 ⎢ ⎥ ⎢ ⎥ p N   ⎢ ⎥ ⎢ −2 xi,1 (yi − β j xi, j ) ⎥ ⎢ ⎥ T − 2X (y − X β) = ⎢ ⎥ . i=1 j=0 ⎢ ⎥ .. ⎢ ⎥ ⎢ ⎥ . ⎢ ⎥ p N ⎢ ⎥   ⎣ −2 xi, p (yi − β j xi, j ) ⎦ i=1

(2.10)

j=0

Since (2.10) being zero means that X T X β = X T y, we have the following statement: Proposition 11 When a matrix X T X ∈ R( p+1)×( p+1) is invertible, we have βˆ = (X T X )−1 X T y .

(2.11)

Example 20 The following is an R program that estimates the intercept and slope via the least squares method: based on (2.11) for N = 100 random data points with β0 = 1, β1 = 2, β2 = 3, and i ∼ N (0, 1). n=100; beta=c(1,2,3) x=matrix(rnorm(n*p),nrow=n,ncol=p) y=beta[1]+beta[2]*x[,1]+beta[3]*x[,2]+rnorm(n) # noise following the standard Gaussian distribution

22

2 Linear Regression

X=cbind(1,x) # adding the all one vector in the leftmost column

>solve(t(X)%*%X)%*%t(X)%*%y [,1] [1,] 0.8684384 [2,] 1.9712187 [3,] 3.1783430

# estimate the beta

We may notice that the matrix X T X is not invertible under each of the following conditions: 1. N < p + 1 2. Two columns in X coincide. In fact, when N < p + 1, from Proposition 3, we have rank(X T X ) ≤ rank(X ) ≤ min{N , p + 1} = N < p + 1 , which means from Proposition 1 that X T X is not invertible. On the other hand, when two columns in X coincide, from Proposition 3, we have rank(X T X ) ≤ rank(X ) < p + 1 , which means that, from Proposition 1, X T X is not invertible as well. Moreover, we see that the ranks of X T X and X coincide. In fact, for an arbitrary z ∈ R p+1 , X T X z = 0 =⇒ z T X T X z = 0 =⇒ X z2 = 0 =⇒ X z = 0 and X z = 0 =⇒ X T X z = 0, which means that the kernels of X T X and X coincide. Since the numbers of columns of X T X and X are equal, so are the dimensions of their images (see Proposition 4). On the other hand, from Proposition 4, since the image dimensions are the ranks of the matrices, the ranks of X T X and X T are equal. In the following section, we assume that the rank of X ∈ R N ×( p+1) is p + 1. In particular, if p = 1, the condition in (2.3) is equivalent to rank(X ) = 1 < 2 = p + 1.

2.3 Distribution of βˆ We assume that the responses y ∈ R N have been obtained from the covariates X ∈ R N ×( p+1) multiplied by the (true) coefficients β ∈ R p+1 plus some noise  ∈ R N , which means that y fluctuates only because of the randomness in . Thus, we let

2.3 Distribution of βˆ

23

y = Xβ +  ,

(2.12)

ˆ We have estimated βˆ where the true β is unknown and different from the estimate β. via the least squares method from the N pairs of data (x1 , y1 ), . . . , (x N , y N ) ∈ R p × R, where xi ∈ R p is the row vector consisting of p values excluding the leftmost one in the ith row of X . Moreover, we assume that each element 1 , . . . ,  N in the random variable  is independent of the others and follows the Gaussian distribution with mean zero and variance σ 2 . Therefore, the density function is

f i (i ) = √

1 2πσ 2

i2 e 2σ 2 −

for i = 1, . . . , N , which we write as i ∼ N (0, σ 2 ). We may express the distributions of 1 , . . . ,  N by T  N − 2 1 f i (i ) = e 2σ , f () = (2πσ 2 ) N /2 i=1 which we write as  ∼ N (0, σ 2 I ), where I is a unit matrix of size N . In general, we have the following statement: Proposition 12 Two Gaussian random variables are independent if and only if their covariance is zero. For the proof, see the Appendix at the end of this chapter. If we substitute (2.12) into (2.11), we have βˆ = (X T X )−1 X T (X β + ) = β + (X T X )−1 X T  .

(2.13)

The estimate βˆ of β depends on the value of  because N pairs of data (x1 , y1 ), . . . , (x N , y N ) randomly occur. In fact, for the same x1 , . . . , x N , if we again generate (y1 , . . . , y N ) randomly according to (2.12) only, the fluctuations (1 , . . . ,  N ) are different. The estimate βˆ is obtained based on the N pairs of randomly generated data points. On the other hand, since the average of  ∈ R N is zero, the average of  multiplied from left by the constant matrix (X T X )−1 X T is zero. Therefore, from (2.13), we have ˆ =β E[β] (2.14) In general, we say that an estimate is unbiased if its average coincides with the true value. Moreover, both βˆ and its average β consist of p + 1 values. In this case, in addition to each variance V (βˆi ) = E(βˆi − βi )2 , i = 0, 1, . . . , p, the covariance σi, j := E(βˆi − βi )(βˆ j − β j )T can be defined for each pair i = j. We refer to the

24

2 Linear Regression

matrix consisting of σi, j in the ith row and jth column as to the covariance matrix ˆ which can be computed as follows. From (2.13), we have of β, ⎡

⎤ (βˆ0 − β0 )2 (βˆ0 − β0 )(βˆ1 − β1 ) · · · (βˆ0 − β0 )(βˆ p − β p ) ⎢ (βˆ1 − β1 )(βˆ0 − β0 ) (βˆ1 − β1 )2 · · · (βˆ1 − β1 )(βˆ p − β p ) ⎥ ⎢ ⎥ E⎢ ⎥ .. .. .. .. ⎣ ⎦ . . . . 2 (βˆ p − β p ) (βˆ p − β p )(βˆ0 − β0 ) (βˆ p − β p )(βˆ1 − β1 ) · · · ⎡ ˆ ⎤ β0 − β0 ⎢ βˆ1 − β1 ⎥ ⎢ ⎥ ˆ = E⎢ ⎥ [β0 − β0 , βˆ1 − β1 , . . . , βˆ p − β p ] .. ⎣ ⎦ . βˆ p − β p

= E(βˆ − β)(βˆ − β)T = E(X T X )−1 X T {(X T X )−1 X T }T = (X T X )−1 X T ET X (X T X )−1 = σ 2 (X T X )−1 , for which we have determined that the covariance matrix of  is ET = σ 2 I . Hence, we have (2.15) βˆ ∼ N (β, σ 2 (X T X )−1 ) .

2.4 Distribution of the RSS Values In this subsection, we derive the distribution of the squared loss by substituting β = βˆ into L = y − X β2 when we fit the data to a line. To this end, we explore the properties of the matrix1 H := X (X T X )−1 X T ∈ R N ×N . The following are easy to derive but useful in the later part of this book. H 2 = X (X T X )−1 X T · X (X T X )−1 X T = X (X T X )−1 X T = H (I − H )2 = I − 2H + H 2 = I − H H X = X (X T X )−1 X T · X = X ˆ then from (2.11), we have yˆ = X βˆ = X (X T X )−1 X T y Moreover, if we set yˆ := X β, = H y, and y − yˆ = (I − H )y = (I − H )(X β + ) = (X − H X )β + (1 − H ) = (I − H ) .

(2.16)

We define 1 We

often refer to this matrix as the hat matrix.

2.4 Distribution of the RSS Values

25

RSS := y − yˆ 2 = {(I − H )}T (I − H ) = T (I − H )2  = T (I − H ) . (2.17)

The following proposition is useful for deriving the distribution of the RSS values. Proposition 13 The eigenvalues of H and I − H are only zeros and ones, and the dimensions of the eigenspaces of H and I − H with eigenvalues one and zero, respectively, are both p + 1, while the dimensions of the eigenspaces of H and I − H with eigenvalues of zero and one, respectively, are both N − p − 1. For the proof, see the Appendix at the end of this chapter. Since I − H is a real symmetric matrix, from Proposition 9, we can diagonalize it by an orthogonal matrix P to obtain the diagonal matrix P(I − H )P T . Additionally, since the N − p − 1 and p + 1 eigenvalues out of the N eigenvalues are ones and zeros, respectively, without loss of generality, we may put ones for the first N − p − 1 elements in the diagonal matrix: P(I − H )P T = diag(1, . . . , 1, 0, . . . , 0) .

    N − p−1

p+1

Thus, if we define v = P ∈ R N , then from  = P T v and (2.17), we have RSS = T (I − H ) = (P T v)T (I − H )P T v = v T P(I ⎡ 1 0 ··· ⎢ .. ⎢ . 0 ⎢0 ⎢ ⎢ .. ⎢. 0 1 = [v1 , . . . , v N − p−1 , v N − p , . . . , vn ] ⎢ ⎢. . ⎢. . 0 ⎢. . ⎢. . .. ⎢. . ⎣. . . 0 ··· 0

− H )P T v

⎤ ··· ··· 0 ⎡ ⎤ v1 .. ⎥ ⎥⎢ ⎥ .. ··· ··· . ⎥⎢ ⎥ ⎥⎢ . ⎥ N − p−1 ⎥⎢ ⎥  ⎥ 0 · · · 0 ⎥ ⎢ v N − p−1 ⎥ vi2 ⎢ ⎥= ⎥ . ⎢ vN − p ⎥ .. ⎥ . i=1 ⎥ . ··· . ⎥⎢ .. ⎢ ⎥ ⎦ .. . . .. ⎥ . ⎥⎣ . .⎦ . vN ··· ··· 0

for v = [v1 , . . . , v N ]T . Let w ∈ R N − p−1 be the first N − p − 1 elements of v. Then, since the average of v is E[P] = 0, we have E[w] = 0; thus, Evv T = E P(P)T = P ET P T = Pσ 2 I P T = σ 2 I . Hence, the covariance matrix is Eww T = σ 2 I , where I is a unit matrix of size N − p − 1. For the Gaussian distributions, the independence of variables is equivalent to the covariance matrix being a diagonal matrix (Proposition 12); we have RSS RSS ∼ χ2N − p−1 , σ2

(2.18)

26

2 Linear Regression

Fig. 2.3 χ2 distributions with 1–10 degrees of freedom

where we denote by χ2m , which is a χ2 distribution with m degrees of freedom, i.e., the distribution of the squared sum of m independent standard Gaussian random variables. Example 21 For each degree of freedom up to m for the χ2 distribution, we depict the probability density function in Fig. 2.3. i=1; curve(dchisq(x,i), 0, 20, col=i) for(i in 2:10)curve(dchisq(x,i), 0, 20,col=i,add=TRUE,ann=FALSE) legend("topright",legend=1:10,lty=1, col=1:10)

2.5 Hypothesis Testing for βˆ j  = 0 In this section, we consider whether each of the βˆ j , j = 0, 1, . . . , p, is zero or not based on the data. Without loss of generality, we assume that the values of x1 , . . . , x N ∈ R p (row vectors) and β ∈ R p+1 are fixed. However, due to fluctuations in the N random variables 1 , . . . ,  N , we may regard that the values y1 = β0 + x1 [β1 , . . . , β p ]T + 1 , . . . , y N = β0 + x N [β1 , . . . , β p ]T +  N

2.5 Hypothesis Testing for βˆ j = 0

1.0 0.6

beta.1

1.4

27

0.6

0.8

1.0

1.2

1.4

beta.0 Fig. 2.4 We fix p = 1, N = 100, and x1 , . . . , x N ∼ N (2, 1), generate 1 , . . . ,  N ∼ N (0, 1), and estimate the intercept β0 and slope β1 from x1 , . . . , x N and y1 = x1 + 1 + 1 , . . . , y N = x N + 1 +  N . We repeat the procedure one hundred times and find that the (βˆ 0 , βˆ 1 ) values are different

occurred by chance (Fig. 2.4). In fact, if we observe y1 , . . . , y N again, since the randomly occurring 1 , . . . ,  N are not the same, the y1 , . . . , y N are different from the previous ones. In the following, although the value of β j is unknown for each j = 0, 1, . . . , p, we construct a test statistic T that follows a t distribution with N − p − 1 degrees of freedom as defined below when we assume β j = 0. If the actual value of T is rare under the assumption β j = 0, we decide that hypothesis β j = 0 should be rejected. What we mean by a t distribution √with m degrees of freedom is that the distribution of the random variable T := U/ V /m such that U ∼ N (0, 1), V ∼ χ2m (the χ2 distribution of degree of freedom m), and U and V are independent. For each degree of freedom up to m, we depict the graph of the probability density function of the t distribution as in Fig. 2.5. The t distribution is symmetric, its center is at zero, and it approaches the standard Gaussian distribution as the number of degrees of freedom m grows. Example 22 We allow the degrees of freedom of the t distribution to vary and compare these distributions with the standard Gaussian distribution. curve(dnorm(x), −10,10, ann=FALSE, ylim=c(0,0.5), lwd=5) for(i in 1:10)curve(dt(x,df=i), −10, 10, col=i, add=TRUE, ann=FALSE) legend("topright",legend=1:10,lty=1, col=1:10)

The hypothesis test constructed here is to set the significance level (e.g., α = 0.01, 0.05) and to reject the null hypothesis if the value of T is outside of the range that occurs with probability 1 − α as in Fig. 2.6. More precisely, if T is either too large or too small so that the probability is within α/2 from both extremes, we reject the null hypothesis β j = 0. If β j = 0 is true, since T ∼ t N − p−1 , it is rare that T will be far from the center. We estimate σ in (2.12) and the standard deviation of βˆ j by

28

2 Linear Regression

0.5

Degrees of Freedom in the t Distribution

0.0

0.1

0.2

0.3

0.4

1 2 3 4 5 6 7 8 9 10

-10

-5

0

5

10

Fig. 2.5 t distributions with 1 to 10 degrees of freedom. The thick line shows the standard Gaussian distribution

1−α ACCEPT

0.4 0.0

0.2

dnorm

0.6

Fig. 2.6 Acceptance and rejection regions for hypothesis testing

α/2 REJECT -6

-4

-2

α/2 REJECT 0

x  σˆ := and

RSS N − p−1

 S E(βˆ j ) := σˆ B j ,

respectively, where B j is the jth diagonal element of (X T X )−1 .

2

4

6

2.5 Hypothesis Testing for βˆ j = 0

29

Example 23 For p = 1, since

XT X =

1 ··· 1 x1 · · · x N





⎤ ⎡ 1 x¯ 1 x1 N ⎢ .. .. ⎥ ⎢  1 ⎣. . ⎦= N⎣ x¯ xi2 N 1 xN i=1

⎤ ,

⎥ ⎦ ,



the inverse is (X T X )−1

⎤ N 1  2 1 xi −x¯ ⎥ ⎢ = N ⎣N ⎦ , i=1  2 −x¯ 1 (xi − x) ¯ i=1

which means that

B0 =

N 1  2 x N i=1 i N 

(xi − x) ¯

i=1

and B1 = 2

1 N 

(xi − x) ¯

. 2

i=1

ˆ and B j σ 2 is the variance of For B = (X T X )−1 , Bσ 2 is the covariance matrix of β, βˆ j . Thus, we may regard B j σˆ 2 as an estimate of B j σ 2 . For β0 = 1 and β1 = 1, we estimate βˆ0 and βˆ1 from N = 100 data points. We repeated the process 100 times and plotted them in Fig. 2.4. n=100; x=rnorm(n)+2; plot(1,1,xlim=c(0.5,1.5),ylim=c(0.5,1.5),xlab="beta.0",ylab="beta.1") for(i in 1:100){ y=1+x+rnorm(n); z=cbind(1,x); beta.est=solve(t(z)%*%z)%*%t(z)%*%y points(beta.est[1],beta.est[2],col=i) } abline(v=1); abline(h=1)

> sum(x)/n [1] 2.21178 > sum(xˆ2)/n [1] 5.706088

Because x¯ is positive, the correlation between βˆ0 and βˆ1 is negative. In the following, we show t=

βˆ j − β j ∼ t N − p−1 S E(βˆ j )

(2.19)

30

2 Linear Regression

To this end, from the definition of the t distribution, we have  βˆ j − β j βˆ j − β j RSS/σ 2 / =  . N − p−1 Bjσ S E(βˆ j ) Thus, from (2.15)–(2.18), we have βˆ j − β j RSS U :=  ∼ N (0, 1) and V := 2 ∼ χ2N − p−1 . σ Bjσ Hence, it remains to be shown that U and V are independent. In particular, since RSS depends only on y − yˆ , it is sufficient to show that y − yˆ and βˆ − β are independent. To this end, if we note that (βˆ − β)(y − yˆ )T = (X T X )−1 X T T (I − H ) , from ET = σ 2 I and H X = X , we have E(βˆ − β)(y − yˆ )T = 0 . Since both y − yˆ = (I − H ) and βˆ − β follow Gaussian distributions, zero covariance between them means that they are independent (Proposition 12), which completes the proof. Example 24 We wish to perform a hypothesis test between for a null hypothesis H0 : β j = 0 and its alternative H1 : β j = 0. For p = 1 and using t=

βˆ j − 0 ∼ t N − p−1 S E(βˆ j )

under H0 , we construct the following procedure in which the function pt(x,m) x returns f m (t)dt, where f m is the probability density function of a t distribution −∞

of with m degrees of freedom. We compare the output with the output obtained via the lm function in the R environment. N=100; x=rnorm(N); y=rnorm(N) x.bar=mean(x); y.bar=mean(y) beta.0=sum(y.bar*sum(x^2)−x.bar*sum(x*y))/sum((x−x.bar)^2) beta.1=sum((x−x.bar)*(y−y.bar))/sum((x−x.bar)^2) RSS=sum((y−beta.0−beta.1*x)^2); RSE=sqrt(RSS/(N−1−1)) B.0=sum(x^2)/N/sum((x−x.bar)^2); B.1=1/sum((x−x.bar)^2) se.0=RSE*sqrt(B.0); se.1=RSE*sqrt(B.1) t.0=beta.0/se.0; t.1=beta.1/se.1 p.0=2*(1−pt(abs(t.0),N−2)) # probability of being outside of it value p.1=2*(1−pt(abs(t.1),N−2))

2.5 Hypothesis Testing for βˆ j = 0

31

>beta.0;se.0;t.0;p.0; [1] -0.03719302 [1] 0.1073593 [1] -0.3464351 [1] 0.7297585 >beta.1;se.1;t.1;p.1 [1] -0.1617122 [1] 0.1081125 [1] -1.495778 [1] 0.1379251 >lm(y˜x) Call: lm(formula = y ˜ x) Coefficients: (Intercept) -0.03719

x -0.16171

>summary(lm(y˜x)) Call: lm(formula = y ˜ x) Residuals: Min 1Q -2.92822 -0.59266

Median 0.00184

3Q 0.68805

Max 2.19163

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -0.03719 0.10736 -0.346 0.730 x -0.16171 0.10811 -1.496 0.138 Residual standard error: 1.072 on 98 degrees of freedom Multiple R-squared: 0.02232, Adjusted R-squared: 0.01234 F-statistic: 2.237 on 1 and 98 DF, p-value: 0.1379

Here, we have RSS = 1.072 (d f = N − p − 1 = 98), and the coefficient of determination is 0.02232. For the definition of the adjusted coefficient of determination, we consider in Sect. 2.6. Example 25 We repeat the estimation βˆ1 in Example 24 1000 times (r = 1000) to construct the histogram of βˆ1 /S E(β1 ). In the following procedure, we compute the

32

2 Linear Regression

-3

-1

1

t

3

0.4 0.3 0.2 0.1 0.0

Probability Density Function

0.4 0.3 0.2 0.1 0.0

Probability Density Function

under H0 : βj = 0 NOT under H0 (βj = 0.1)

-2

0

2

4

t

Fig. 2.7 Distribution of βˆ1 /S E(βˆ 1 ) under the null hypothesis β1 = 0 (left) and under β1 = 0.1 (right)

quantity beta.1/se.1, and accumulate them as a vector of size r in T. First, we generate the data that follow the null hypothesis β1 = 0 (Fig. 2.7, left). N=100; r=1000 T=NULL for(i in 1:r){ x=rnorm(N); y=rnorm(N); x.bar=mean(x); y.bar=mean(y) fit=lm(y~x);beta=fit$coefficients RSS=sum((y−fit$fitted.values)^2); RSE=sqrt(RSS/(N−1−1)) B.1=1/sum((x−x.bar)^2); se.1=RSE*sqrt(B.1) T=c(T,beta[2]/se.1) } hist(T,breaks=sqrt(r),probability=TRUE, xlab="t",ylab="Probability Density Function", main="Histogram of the value of $t$ and its theoretical distribution in red") curve(dt(x, N−2),−3,3,type="l", col="red",add=TRUE)

Next, we generate data that do not follow the null hypothesis (β1 = 0.1) and estimate the model with them, replacing y=rnorm(N) with y=0.1*x+rnorm(N) (Fig. 2.7, Right).

2.6 Coefficient of Determination and the Detection of Collinearity

33

2.6 Coefficient of Determination and the Detection of Collinearity In the following, we define a matrix W ∈ R N ×N such that all the elements are 1/N . N 1  yi for y1 , . . . , y N ∈ R. Thus, all the elements of W y ∈ R N are y¯ = N i=1 As we have defined the residual sum of squares as follows: RSS =  yˆ − y2 = (I − H )2 = (I − H )y2

,

we define the explained sum of squares ESS E SS :=  yˆ − y¯ 2 =  yˆ − W y2 = (H − W )y2 and the total sum of squares T SS := y − y¯ 2 = (I − W )y2 If RSS is much less than TSS, we may regard that linear regression is suitable for the data. For the three measures, we have the relation T SS = RSS + E SS

(2.20)

Because H X = X and the elements in the leftmost column of X are all ones, any all one vector multiplied by a constant is an eigenvector of eigenvalues of one, which means that H W = W . Thus, we have (I − H )(H − W ) = 0

(2.21)

If we squares both sides of (I − W )y = (I − H )y + (H − W )y, from (2.21), we have (I − W )y2 = (I − H )y2 + (H − W )y2 . Moreover, we can show that RSS and E SS are independent. To this end, we notice that the covariance matrix between (I − H ) and (H − W )y = (H − W )X β + (H − W ) are equal to those of (I − H ) and (H − W ). In fact, (H − W )X β ∈ R N does not fluctuate and is not random. Thus, we may remove it when we compute the covariance matrix. Then, from (2.21), the covariance matrix E(I − H )T (H − W ) is a zero matrix. Because RSS and E SS follow Gaussian distributions, they are independent (Proposition 12). We refer to RSS E SS =1− R2 = T SS T SS as to the coefficient of determination. As we will see later, for single regression ( p = 1), the value of R 2 coincides with the square of the sample-based correlation coefficient

34

2 Linear Regression N  (xi − x)(y ¯ i − y¯ ) i=1

ρˆ :=  .  N N    (xi − x) ¯ 2 (yi − y¯ )2 i=1

i=1

In this sense, the coefficient of determination expresses (nonnegative) correlation between the covariates and response. In fact, for p = 1, from yˆ = βˆ0 + βˆ1 x and N  ¯ Hence, from (2.4) and x − x ¯ 2= (xi − x) ¯ 2 (2.5), we have yˆ − y¯ = βˆ1 (x − x). i=1 N  and y − y¯ 2 = (yi − y¯ )2 , we have i=1

⎧ ⎫2 N N   ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ (xi − x)(y ¯ i − y¯ ) ⎪ (xi − x) ¯ 2 ⎪ ⎪ ⎨ ⎬

E SS βˆ 2 x − x ¯ 2 i=1 = 1 = N ⎪ T SS y − y¯ 2  ⎪ ⎪ ⎪ (xi − x) ¯ 2 ⎪ ⎩  =

i=1 2

i=1

N ⎪  ⎪ ⎪ ⎪ (yi − y¯ )2 ⎪ ⎭ i=1

N  (xi − x)(y ¯ i − y¯ ) i=1

N 

N  (xi − x) ¯ 2 (yi − y¯ )2

i=1

i=1

= ρˆ2 .

We sometimes use a variant of the coefficient of determination (the adjusted coefficient of determination) such that RSS and TSS are divided by N − p − 1 and N − 1, respectively: RSS/(N − p − 1) (2.22) 1− T SS/(N − 1) If p is large, the adjusted coefficient of determination is smaller than the non-adjusted counterpart. For the regular coefficient of determination, the larger the number of covariates, the better the line fits the data. However, for adjustment covariates, unnecessary covariates that are not removed are penalized. Example 26 We construct a function to obtain the coefficient of determination and calculate it for actual data. R2=function(x,y){ y.hat=lm(y~x)$fitted.values; y.bar=mean(y) RSS=sum((y−y.hat)^2); TSS=sum((y−y.bar)^2) return(1−RSS/TSS) }

2.6 Coefficient of Determination and the Detection of Collinearity

35

> N=100; m=2; x=matrix(rnorm(m*N),ncol=m); y=rnorm(N); R2(x,y) [1] 0.002332905 > N=100; m=1; x=matrix(rnorm(m*N),ncol=m); y=rnorm(N); R2(x,y); cor(x,y)ˆ2 [1] 0.004391959 [,1] [1,] 0.004391959

While the coefficient of determination expresses how well the covariates explain the response variable and takes a maximum value of one. We also use VIFs (variance inflation factors), which measures the redundancy of each covariate when the other covariates are present: 1 , V I F := 1 − R 2X j |X − j where R 2X j |X − j is the coefficient of determination when the jth variable is the response and the other p − 1 variables are covariates in X ∈ R N × p (y ∈ R N is not used when the VIF is computed). The larger the VIF, the better the covariate is explained by the other covariates, which means that the jth covariate is redundant. The minimum value of VIF is one, and we say that the collinearity of a covariate is strong when its VIF value is large. Example 27 We installed the R package MASS, and computed the VIF for the Boston data set. vif=function(x){ p=ncol(x); values=array(dim=p); for(j in 1:p)values[j]=1/(1−R2(x[,−j],x[,j]) ) return(values) }

> library(MASS); x=as.matrix(Boston); vif(x) [1] 1.831537 2.352186 3.992503 1.095223 4.586920 2.260374 3.100843 [8] 4.396007 7.808198 9.205542 1.993016 1.381463 3.581585 3.855684

2.7 Confidence and Prediction Intervals Thus far, we have showed how to obtain the estimate βˆ of β ∈ R p+1 . In other words, from (2.19), we obtain2 the confidence interval of βˆ as follows: βi = βˆi ± t N − p−1 (α/2)S E(βˆi )

2 We

write ξˆ − γ ≤ ξ ≤ ξˆ + γ as ξ = ξˆ ± γ, where ξˆ is an unbiased estimator of ξ.

36

2 Linear Regression ∞

for i = 0, 1, . . . , p, where t N − p−1 (α/2) is the t-statistic such that α/2 = t f (u)du for the probability density function f . In this section, we also wish to obtain the confidence interval of x∗ βˆ for another point x∗ ∈ R p+1 (a row vector whose first element is one), which is different from the ˆ = x1 , . . . , x N used for estimation. Then the average and variance of x∗ βˆ are E[x∗ β] ˆ x∗ E[β] and ˆ T = σ 2 x∗ (X T X )−1 x T , ˆ = x∗ V (β)x V [x∗ β] ∗ ∗ respectively, where σ 2 is the variance of i , i = 1, . . . , N . As we derived before, if we define !  ˆ := σˆ x∗ (X T X )−1 x T , σˆ := RSS/(N − p − 1), S E(x∗ β) ∗ then we can show that x∗ βˆ − x∗ β C := = ˆ S E(x∗ β)

!

x∗ βˆ − x∗ β

σˆ x∗ (X T X )−1 x∗T

=

!

"

x∗ βˆ − x∗ β

σ x∗ (X T X )−1 x∗T

RSS σ2

" (N − p − 1)

follows a t distribution with N − p − 1 degrees of freedom. In fact, the numerator RSS follows the N (0, 1) distribution, and 2 follows a χ2N − p−1 distribution. Moreover, σ as we derived before, RSS and βˆ − β are independent. Thus, the proof of C ∼ t N − p−1 is completed. On the other hand, if we need to consider the noise  as well as the estimated x∗ βˆ in the evaluation, we consider the variance in the difference between x∗ βˆ and y∗ := x∗ β + : V [x∗ βˆ − (x∗ β + )] = V [x∗ (βˆ − β)] + V [] = σ 2 x∗ (X T X )−1 x∗T + σ 2 . Similarly, we can derive the following x∗ βˆ − y∗ P := = S E(x∗ βˆ − y∗ )

" x∗ βˆ − y∗ ! σ(1 + x∗ (X T X )−1 x∗T )

 RSS σ2

" (N − p − 1) ∼ t N − p−1 .

Hence, with probability α, we obtain the confidence and prediction intervals, respectively, as follows: ! x∗ β = x∗ βˆ ± t N − p−1 (α/2)σˆ x∗ (X T X )−1 x∗T ! y∗ = x∗ βˆ ± t N − p−1 (α/2)σˆ 1 + x∗ (X T X )−1 x∗T

0.3

The t Distribution with Degree of Freedom N − p − 1 Confident Interval C Prediction Interval P

0.2 0.1

37

Reject

Reject

0.005

0.005

0.0

Probability Density Function

2.7 Confidence and Prediction Intervals

-4

-2

0

2

4

Values of C,P

2

Prediction Interval

0

1

Confidence Interval

-1

Confidence Interval

-2

y

Fig. 2.9 The line obtained via the least squares method. The confidence and prediction intervals are shown by the solid and dashed lines, respectively. In general, the prediction interval lies outside of the confidence interval

3

Fig. 2.8 The confidence and prediction intervals are obtained based on the fact that those intervals are the ranges with probability 1 − α, excluding the tails, where we set α = 0.01

-3

Prediction Interval

-10

-5

0

5

10

x

Example 28 We do not just fit the points to a line via the least squares method, we also draw the confidence interval that surrounds the line and the prediction interval that surrounds both the fitted line and the confidence interval (Figs. 2.8 and 2.9). # Generate data N=100; p=1; X=matrix(rnorm(N*p),ncol=p); X=cbind(rep(1,N),X) beta=c(1,1); epsilon=rnorm(N); y=X%*%beta+epsilon # Define functions f(x), g(x), where U is the inverse of t(X)%*%X U=solve(t(X)%*%X); beta.hat=U%*%t(X)%*%y; RSS=sum((y−X%*%beta.hat)^2); RSE=sqrt(RSS/(N−p−1)); alpha=0.05 f=function(x, a){ # if a=0 and a=1, f is the confidence and prediction intervals, respectively. x=cbind(1,x); range=qt(df=N−p−1,1−alpha/2)*RSE*sqrt(a+x%*%U%*%t(x)); return(list(lower=x%*%beta.hat−range,upper=x%*%beta.hat+range))

38

2 Linear Regression

} x.seq=seq(−10,10,0.1) # Show the graph of confidence interval lower.seq=NULL; for(x in x.seq)lower.seq=c(lower.seq, f(x,0)$lower) upper.seq=NULL; for(x in x.seq)upper.seq=c(upper.seq, f(x,0)$upper) x.lim=c(min(x.seq),max(x.seq)); y.lim=c(min(lower.seq),max(upper.seq)) plot(x.seq, lower.seq, col="blue",xlim=x.lim, ylim=y.lim, xlab="x", ylab="y", type="l") par(new=TRUE); plot(x.seq, upper.seq,col="red", xlim=x.lim, ylim=y.lim, xlab="",ylab="", type="l", axes=FALSE) par(new=TRUE); # Show the graph of prediction interval lower.seq=NULL; for(x in x.seq)lower.seq=c(lower.seq, f(x,1)$lower) upper.seq=NULL; for(x in x.seq)upper.seq=c(upper.seq, f(x,1)$upper) x.lim=c(min(x.seq),max(x.seq)); y.lim=c(min(lower.seq),max(upper.seq)) plot(x.seq, lower.seq, col="blue",xlim=x.lim, ylim=y.lim, xlab="",ylab="", type="l", lty=4, axes=FALSE) par(new=TRUE); plot(x.seq, upper.seq, col="red", xlim=x.lim, ylim=y.lim, xlab="",ylab="", type="l", lty=4, axes=FALSE) abline(beta.hat[1],beta.hat[2])

Appendix: Proof of Propositions Proposition 12 Two Gaussian random variables are independent if and only if their covariance is zero. Proof Let X ∼ N (μ X , σ 2X ) and Y ∼ N (μY , σY2 ), and let E[·] be the expectation operation. If we let E(X − μ X )(Y − μY )  (2.23) ρ :=  E(X − μ X )2 E(Y − μY )2 and define the independence of X andY by the property f X (x) f Y (y) = f X Y (x, y) for all x, y ∈ R, where   1 1 f X (x) = √ exp − 2 (x − μ X )2 2σ X 2πσ X   1 1 exp − 2 (y − μY )2 f Y (y) = √ 2σY 2πσY f X Y (x, y) =

1  1 − ρ2

2πσ X σY 

1 × exp − 2(1 − ρ2 )

#$

x − μX σX

$

%2 − 2ρ

x − μX σX

%$

y − μY σY

%

$ +

x − μX σX

%2 & ,

Appendix: Proof of Propositions

39

then ρ = 0 =⇒ f X Y (x, y) = f X (x) f Y (y). On the other hand, if f X Y (x, y) = f X (x) f Y (y), then we can write the numerator of ρ in (2.23) as follows: 



−∞



=





(x − μ X )(y − μY ) f X Y (x, y)d xd y  ∞ (x − μ X ) f X (x)d x (y − μY ) f Y (y)dy

−∞ ∞

−∞

=0,

−∞

which means that ρ = 0 ⇐= f X Y (x, y) = f X (x) f Y (y). Proposition 13 The eigenvalues of H and I − H are only zeros and ones, and the dimensions of the eigenspaces of H and I − H with eigenvalues one and zero, respectively, are both p + 1, while the dimensions of the eigenspaces of H and I − H with eigenvalues of zero and one, respectively, are both N − p − 1. Proof Using Proposition 4, from H = X (X T X )−1 X T and rank(X ) = p + 1, we have rank(H ) ≤ min{rank(X (X T X )−1 ), rank(X )} ≤ rank(X ) = p + 1 . On the other hand, from Proposition 4 and H X = X, rank(X ) = p + 1, we have rank(H ) ≥ rank(H X ) = rank(X ) = p + 1 . Therefore, we have rank(H ) = p + 1. Moreover, from H X = X , the columns of X are the basis of the image of H and the eigenvectors of H for an eigenvalue of one. Since the dimension of the image of H is p + 1, the dimension of the kernel is N − p − 1 (the eigenspace of an eigenvalue of zero). Moreover, for an arbitrary x ∈ R p+1 , we have (I − H )x = 0 ⇐⇒ H x = x and (I − H )x = x ⇐⇒ H x = 0, which means that the eigenspaces of H and I − H for eigenvalues of zero and one are the same as the eigenspaces of I − H and H for eigenvalues one and zero, respectively.

Exercises 1–18 1. For a given x1 , . . . , x N , y1 , . . . , y N ∈ R, let βˆ0 , βˆ1 be the β0 , β1 ∈ R that miniN  (yi − β0 − β1 xi )2 . Show the following equations, where x¯ and mizes L := i=1

y¯ are defined by

N N 1  1  xi and yi . N i=1 N i=1

40

2 Linear Regression

(a) βˆ0 + βˆ1 x¯ = y¯ (b) Unless x1 = · · · = x N , N 

βˆ1 =

(xi − x)(y ¯ i − y¯ )

i=1 N 

(xi − x) ¯ 2

i=1

Hint: Item (a) is obtained from −2

N 

∂L ∂L = 0. For (b), substitute (a) into = ∂β0 ∂β1

xi (yi − β0 − β1 xi ) = 0 and eliminate β0 . Then, solve it w.r.t. β1 first

i=1

and obtain β0 later. 2. We consider the line l with the intercept βˆ0 and slope βˆ1 obtained in Problem 1. ¯ . . . , xN − Find the intercept and slope of the shifted line l  from the data x1 − x, x, ¯ and y1 − y¯ , . . . , y N − y¯ . How do we obtain the intercept and slope of l from those of the shifted line l  ? 3. We wish to visualize the relation between the lines l, l  in Problem 2. Fill Blanks (1) and (2) below and draw the graph. N=100 # sample size N a=rnorm(1); b=rnorm(1); # generate the coefficients x=rnorm(N); y=a*x+b+rnorm(N) # generates the points plot(y~x) # plot the points abline(lm(y~x),col="red") # fit the points to a line abline(h=0); abline(v=0) # X−axis and Y−axis x=x− # Brank(1) #; y=y− # Brank (2) # shift the line abline(lm(y~x),col="blue") # fit the points to a line abline(h=0); abline(v=0) # X−axis and Y−axis legend("topleft",c("BEFORE","AFtER"),lty=1, col=c("red","blue"))

4. Let m, n be positive integers. Suppose that the matrix A ∈ Rm×m can be written by A = B T B for some B ∈ Rn×m . (a) Show that Az = 0 ⇐⇒ Bz = 0 for arbitrary z ∈ Rm Hint: Use Az = 0 =⇒ z T B T Bz = 0 =⇒ Bz2 = 0. (b) shows that the ranks of A and B are equal. Hint: Because the kernels of A and B are equal, so are the dimensions (ranks) of the images. In the following, the leftmost column of X ∈ R N ×( p+1) consists of all ones. 5. For each of the following cases, show that X T X is not invertible (a) N < p + 1 (b) N ≥ p + 1 and different columns are equal in X In the following, the rank of X ∈ R N ×( p+1) is p + 1.

Appendix: Proof of Propositions

41

6. We wish to obtain β ∈ R p+1 that minimizes L := y − X β2 from X ∈   N  R N ×( p+1) , y ∈ R N , where  ·  denotes  z i2 for z = [z 1 , . . . , z N ]T . i=1

(a) Let xi, j be the (i, j)th element of X . Show that the partial derivative of L = ⎛ ⎞2 p N  1 ⎝ xi, j β j ⎠ w.r.t. β j is the jth element of −X T y + X T X β. yi − 2 i=1 j=0 Hint: The jth element of X T y is

N 

xi, j yi , the ( j, k)th element of X T X is

i=1 N 

xi, j xi,k , and the jth element of X T X β is

i=1

(b) Find β ∈ R p+1 such that

p N  

xi, j xi,k βk .

k=0 i=1

∂L ˆ = 0. In the sequel, we write the value by β. ∂β

7. Suppose that the random variable βˆ is obtained via the procedure in Problem 6, where we assume that X ∈ R N ×( p+1) is given and y ∈ R N is generated by X β +  with unknown constants β ∈ R p+1 and σ 2 > 0 and random variable  ∼ N (0, σ 2 I ). (a) Show βˆ = β + (X T X )−1 X T . (b) shows that the average of βˆ coincides with β, i.e., βˆ is an unbiased estimator. (c) shows that the covariance matrix of βˆ is E(βˆ − β)(βˆ − β)T = σ 2 (X T X )−1 . ˆ Show the following equations. 8. Let H := X (X T X )−1 X T ∈ R N ×N and yˆ := X β. (a) (b) (c) (d) (e) (f)

H2 = H (I − H )2 = I − H HX = X yˆ = H y y − yˆ = (I − H ) y − yˆ 2 = T (I − H )

9. Prove the following statements. (a) The dimension of the image, rank, of H is p + 1. Hint: We assume that the rank of X is p + 1. (b) H has eigenspaces of eigenvalues of zero and one, and their dimensions are N − p − 1 and p + 1, respectively. Hint: The number of columns N in H is the sum of the dimensions of the image and kernel. (c) I − H has eigenspaces of eigenvalues of zero and one, and their dimensions are p + 1 and N − p − 1, respectively. Hint: For an arbitrary x ∈ R p+1 , we have (I − H )x = 0 ⇐⇒ H x = x and (I − H )x = x ⇐⇒ H x = 0.

42

2 Linear Regression

10. Using the fact that P(I − H )P T becomes a diagonal matrix such that the first N − p − 1 and last p + 1 diagonal elements are ones and zeros, respectively, for an orthogonal P, show the following.  N − p−1 2 (a) RSS := T (I − H ) = i=1 vi , where v := P. Hint: Because P is orthogonal, we have P T P = I . Substitute  = P −1 v = P T v into the definition of RSS and find that the diagonal elements of P T (I − H )P are the N eigenvalues. In particular, I − H has N − p − 1 and p + 1 eigenvalues of zero and one, respectively. (b) Evv T = σ 2 I˜. Hint: Use Evv T = P(ET )P T . (c) RSS/σ 2 ∼ χ2N − p−1 (χ2 distribution with N − p − 1 degrees of freedom). Hint: Find the statistical properties from (a) and (b). Use the fact that the independence of Gaussian random variables is equivalent to the covariance matrix of them being diagonal, without proving it. 11. (a) Show that E(βˆ − β)(y − yˆ )T = 0. Hint: Use (βˆ − β)(y − yˆ )T = (X T X )−1 X T T (I − H ) and ET = σ 2 I . T −1 ˆ (b) Let B√ 0 , . . . , B p be the diagonal elements of (X X ) . Show that (βi − 2 βi )/( Bi σ) and RSS/σ are independent for i = 0, 1, . . . , p. Hint: Since RSS is a function of y − yˆ , the problem reduces to independence between y − yˆ and βˆ − β. Because they are Gaussian, it is sufficient to show that the covarianceis zero. RSS (the residual standard error, an estimate of σ), and (c) Let σˆ := N − p−1 √ S E(βˆi ) := σˆ Bi (an estimate of the standard error of βˆi ). Show that βˆi − βi ∼ t N − p−1 , i = 0, 1, . . . , p S E(βˆi ) (t distribution with N − p − 1 degrees of freedom). Hint: Derive " " RSS βˆi − βi βˆi − βi (N − p − 1) = √ σ2 σ Bi S E(βˆi ) and show that the right-hand side follows a t distribution. (d) When p = 1, find B0 and B1 , letting (x1,1 , . . . , x N ,1 ) = (x1 , . . . , x N ). Hint: Derive ⎡ ⎤ N 1  2 1 xi −x¯ ⎥ ⎢ (X T X )−1 = N ⎣N ⎦ i=1  2 −x¯ 1 (xi − x) ¯ i=1

Appendix: Proof of Propositions

43

Use the fact that independence of Gaussian random variables U1 , . . . , Um , V1 , . . . , VN is equivalent to a covariance matrix of size m × n being a diagonal matrix, without proving it. 12. We wish to test the null hypothesis H0 : βi = 0 versus its alternative H1 : βi = 0. For p = 1, we construct the following procedure using the fact that under H0 , t=

βˆi − 0 ∼ t N − p−1 , S E(βˆi ) 



where the function pt(x,m) returns the value of

f m (t)dt, where f m is the

x

probability density function of a t distribution with m degrees of freedom. N=100; x=rnorm(N); y=rnorm(N) x.bar=mean(x); y.bar=mean(y) beta.0=sum(y.bar*sum(x^2)−x.bar*sum(x*y))/sum((x−x.bar)^2) beta.1=sum((x−x.bar)*(y−y.bar))/sum((x−x.bar)^2) RSS=sum((y−beta.0−beta.1*x)^2); RSE=sqrt(RSS/(N−1−1)) B.0=sum(x^2)/N/sum((x−x.bar)^2); B.1=1/sum((x−x.bar)^2) se.0=RSE*sqrt(B.0); se.1=RSE*sqrt(B.1) t.0=beta.0/se.0; t.1=beta.1/se.1 p.0=2*(1−pt(abs(t.0),N−2)) # probability of being outside of it value p.1=2*(1−pt(abs(t.1),N−2))

Examine the outputs using the lm function in the R language. lm(y~x) summary(lm(y~x))

13. The following procedure repeats estimating βˆ1 1000 times (r = 1000) and draws a histogram of βˆ1 /S E(β1 ), where beta.1/se.1 is computed each time from the data, and they are accumulated in the vector T of size r . N=100; r=1000 T=NULL for(i in 1:r){ x=rnorm(N); y=rnorm(N); x.bar=mean(x); y.bar=mean(y) fit=lm(y~x);beta=fit$coefficients RSS=sum((y−fit$fitted.values)^2); RSE=sqrt(RSS/(N−1−1)) B.1=1/sum((x−x.bar)^2); se.1=RSE*sqrt(B.1) T=c(T,beta[2]/se.1) } hist(T,breaks=sqrt(r),probability=TRUE, xlab="t",ylab="Probability Density Function", main="Histogram of the value of $t$ and its theoretical distribution in red") curve(dt(x, N−2),−3,3,type="l", col="red",add=TRUE)

Replace y=rnorm(N) with y=0.1*x+rnorm(N) and execute it. Furthermore, explain the difference between the two graphs.

44

2 Linear Regression

14. Suppose that each element of W ∈ R N ×N is 1/N , thus y¯ = y = [y1 , . . . , y N ]T .

N 1  yi = W y for N i=1

(a) Show that H W = W and (I − H )(H − W ) = 0. Hint: Because each column of W is an eigenvector of eigenvalue one in H , we have H W = W . (b) Show that E SS :=  yˆ − y¯ 2 = (H − W )y2 and T SS := y − y¯ 2 = (I − W )y2 . (c) Show that RSS = (I − H )2 = (I − H )y2 and E SS are independent Hint: The covariance matrix of (I − H ) and (H − W )y is that of (I − H ) and (H − W ). Evaluate the covariance matrix E(I − H )T (H − W ). Then, use (a). (d) Show that (I − W )y2 = (I − H )y2 + (H − W )y2 , i.e., T SS = RSS + E SS. Hint: (I − W )y = (I − H )y + (H − W )y. In the following, we assume that X ∈ R N × p does not contain a vector of size N of all ones in the leftmost column. 15. Given X ∈ R N × p and y ∈ R N , we refer to R2 =

RSS E SS =1− T SS T SS

as to the coefficient of determination. For p = 1, suppose that we are given x = [x1 , . . . , x N ]T . (a) Show that yˆ − y¯ = βˆ1 (x − x). ¯ Hint: Use yˆi = βˆ0 + βˆ1 xi and Problem 1(a). 2 ˆ ¯ 2 β x − x (b) Show that R 2 = 1 . y − y¯ 2 (c) For p = 1, show that the value of R 2 coincides with the square of the corN  (xi − x) ¯ 2 and Problem 1(b). relation coefficient. Hint: Use x − x ¯ 2= i=1

(d) The following function computes the coefficient of determination. R2=function(x,y){ y.hat=lm(y~x)$fitted.values; y.bar=mean(y) RSS=sum((y−y.hat)^2); TSS=sum((y−y.bar)^2) return(1−RSS/TSS) } N=100; m=2; x=matrix(rnorm(m*N),ncol=m); y=rnorm(N); R2(x,y)

Let N=100 and m=1, and execute x=matrix(rnorm(m*N),ncol=m); y=rnorm(N); R2(x,y); cor(x,y)ˆ2. 16. The coefficient of determination expresses how well the covariates explain the response variable, and its maximum value is one. When we evaluate how redundant a covariate is given the other covariates, we often use VIFs (variance inflation factors)

Appendix: Proof of Propositions

45

V I F :=

1 , 1 − R 2X j |X − j

where R 2X j |X − j is the coefficient of determination of the jth covariate in X ∈ R N × p given the other p − 1 covariates (y ∈ R N is not used). The larger the VIF value, the better the covariate is explained by the other covariates (the minimum value is one), which means that the collinearity is strong. Install the R package MASS and compute the VIF values for each variable in the Boston data set by filling the blank. (Simply execute the following). library(MASS); X=as.matrix(Boston); p=ncol(X) T=NULL; for(j in 1:p)T=c(T, # Blank #); T

17. We can compute the prediction value x∗ βˆ for each x∗ ∈ R p+1 (the row vector ˆ whose first value is one), using the estimate β. ˆ = (a) Show that the variance of x∗ βˆ is σ 2 x∗ (X T X )−1 x∗T . Hint: Use V (β) σ 2 (X T X )−1 .  ˆ := σˆ x∗ (X T X )−1 x T , show that (b) If we let S E(x∗T β) ∗ x∗ βˆ − x∗ β ∼ t N − p−1 , ˆ S E(x∗ β)  where σˆ = RSS/(N − p − 1). (c) The actual value of y can be expressed by y∗ := x∗ β + . Thus, the variance of y∗ − x∗ βˆ is σ 2 larger. Show that 

x∗ βˆ − y∗

σˆ 1 + x∗ (X T X )−1 x∗T

∼ t N − p−1 .

18. From Problem 17, we have ! x∗T βˆ ± t N − p−1 (α/2)σˆ x∗T (X T X )−1 x∗ ! y∗ ± t N − p−1 (α/2)σˆ 1 + x∗T (X T X )−1 x∗ (the confidence and prediction intervals, respectively), where f is the t distribution with N − p − 1 degrees of freedom. t N − p−1 (α/2) is the t-statistic such that ∞ α/2 = t f (u)du. Suppose that p = 1. We wish to draw the confidence and prediction intervals in red and blue, respectively, for x∗ ∈ R. For the confidence interval, we expressed the upper and lower limits by red and blue solid lines, respectively, executing the procedure below. For the prediction interval, define the function g(x) and overlay the upper and lower dotted lines in red and blue on the same graph.

46

2 Linear Regression # Generate Data N=100; p=1; X=matrix(rnorm(N*p),ncol=p); X=cbind(rep(1,N),X); beta=rnorm( p+1); epsilon=rnorm(N); y=X%*%beta+epsilon # Define function f(x). U is the inverse of t(X)%*%X U=solve(t(X)%*%X); beta.hat=U%*%t(X)%*%y; RSS=sum((y−X%*%beta.hat)^2); RSE=sqrt(RSS/(N−p−1)); alpha=0.05 f=function(x){ x=cbind(1,x); range=qt(df=N−p−1,1−alpha/2)*RSE*sqrt(x%*%U%*%t(x)); return(list(lower=x%*%beta.hat−range,upper=x%*%beta.hat+range)) } # Draw the confidence interval x.seq=seq(−10,10,0.1) lower.seq=NULL; for(x in x.seq)lower.seq=c(lower.seq, f(x)$lower) upper.seq=NULL; for(x in x.seq)upper.seq=c(upper.seq, f(x)$upper) x.lim=c(min(x.seq),max(x.seq)); y.lim=c(min(lower.seq),max(upper.seq)) plot(x.seq,lower.seq,col="blue",xlim=x.lim, ylim=y.lim, xlab="x",ylab ="y", type="l") par(new=TRUE); plot(x.seq,upper.seq,col="red", xlim=x.lim, ylim=y.lim, xlab="",ylab=" ", type="l", axes=FALSE) abline(beta.hat[1],beta.hat[2])

Chapter 3

Classification

Abstract In this chapter, we consider constructing a classification rule from covariates to a response that takes values from a finite set such as ±1, figures 0, 1, . . . , 9. For example, we wish to classify a postal code from handwritten characters and to make a rule between them. First, we consider logistic regression to minimize the error rate in the test data after constructing a classifier based on the training data. The second approach is to draw borders that separate the regions of the responses with linear and quadratic discriminators and the k-nearest neighbor algorithm. The linear and quadratic discriminations draw linear and quadratic borders, respectively, and both introduce the notion of prior probability to minimize the average error probability. The k-nearest neighbor method searches the border more flexibly than the linear and quadratic discriminators. On the other hand, we take into account the balance of two risks, such as classifying a sick person as healthy and classifying a healthy person as unhealthy. In particular, we consider an alternative approach beyond minimizing the average error probability. The regression method in the previous chapter and the classification method in this chapter are two significant issues in the field of machine learning.

3.1 Logistic Regression We wish to determine a decision rule from p covariates to a response that takes two values. More precisely, we derive the map x ∈ R p → y ∈ {−1, 1} from the data (x1 , y1 ), . . . , (x N , y N ) ∈ R p × {−1, 1} that minimizes the error probability. In this section, we assume that for x ∈ R p (row vector), the probabilities of y = 1 eβ0 +xβ 1 and , respectively, for some and y = −1 are expressed by1 1 + eβ0 +xβ 1 + eβ0 +xβ β0 ∈ R and β ∈ R p and write the probability of y ∈ {−1, 1} as 1 1 + e−y(β0 +xβ) 1 In

this chapter, instead of β ∈ R p+1 , we separate the slope β ∈ R p and the intercept β0 ∈ R.

© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2020 J. Suzuki, Statistical Learning with Math and R, https://doi.org/10.1007/978-981-15-7568-6_3

47

48

3 Classification

Fig. 3.1 As the value of β increases, the probability of y = 1 increases monotonically and changes greatly from approximately 0 to approximately 1 near x =0

1.0

Logistic Curve

0.6 0.4 0.0

0.2

P (Y = 1|x)

0.8

0 0.2 0.5 1 2 10

-10

-5

0

5

10

x

(logistic regression). To roughly explain the function (the sigmoid function), we draw the graph for p = 1, β0 = 0, β > 0, and y = 1: f (x) =

1 1+

e−(β0 +xβ)

, x ∈R.

Example 29 We ran the following program, and the graph is shown in Fig. 3.1. f=function(x)exp(beta.0+beta*x)/(1+exp(beta.0+beta*x)) beta.0=0; beta.seq=c(0,0.2,0.5,1,2,10); m=length(beta.seq); beta=beta.seq[1] plot(f,xlim=c(−10,10),ylim=c(0,1),xlab="x",ylab="P(Y=1|x)", col=1, main="Logistic Curve") for(i in 2:m){ beta=beta.seq[i]; par(new=TRUE); plot(f,xlim=c(−10,10),ylim=c(0,1),xlab="", ylab="", axes=FALSE, col=i) } legend("topleft", legend=beta.seq, col=1:length(beta.seq), lwd=2, cex=.8)

From e−(β0 +xβ) ≥0 (1 + e−(β0 +xβ) )2 e−(β0 +xβ) [1 − e−(β0 +xβ) ] f  (x) = −β 2 , (1 + e−(β0 +xβ) )3 f  (x) = β

we see that f (x) is increasing monotonically and is convex and concave when x < −β0 /β and x > −β0 /β, respectively; they change at x=0, when β0 = 0.

3.1 Logistic Regression

49

In the following, from the observations (x1 , y1 ), . . . , (x N , y N ) ∈ R p × {−1, 1}, N  1 (maximum likelihood), or miniby maximizing the likelihood −yi (β0 +xi β) 1 + e i=1 mizing the negative log-likelihood: l(β0 , β) =

N 

log(1 + vi ), vi = e−yi (β0 +xi β) , i = 1, . . . , N ,

i=1

we obtain the estimate β0 ∈ R, β ∈ R p . Example 30 If the observations are i xi yi

1 71.2 −1

2 29.3 −1

3 42.3 1

··· ··· ···

25 25.8 1

( p = 1, N = 25), the likelihood to be maximized is 1 1 1 1 · · ··· . 1 + exp(β0 + 71.2β1 ) 1 + exp(β0 + 29.3β1 ) 1 + exp(−β0 − 42.3β1 ) 1 + exp(−β0 − 25.8β1 )

Note that the observations are known, and we determine β0 , β1 so that the likelihood is maximized. However, for logistic regression, unlike for linear regression, no formula to obtain the estimates of the coefficients exists.

3.2 Newton–Raphson Method When we solve the equations such as the partial derivatives of l(β0 , β) being zero, the Newton–Raphson method is often used. To understand the essence, briefly, we consider the purest example of the use of the Newton–Raphson method. Suppose that we solve f (x) = 0 with f (x) = x 2 − 1 . We set an initial value x = x0 and draw the tangent that goes through the point (x0 , f (x0 )). If the tangent crosses the x-axis (y = 0) at x = x1 , then we again draw the tangent that intersects the point (x1 , f (x1 )). If we repeat the process, the sequence x0 , x1 , x2 , . . . approaches the solution of f (x) = 0. In general, because the tangent line is y − f (xi ) = f  (xi )(x − xi ), the intersection with y = 0 is xi+1 := xi −

f (xi ) f  (xi )

(3.1)

3 Classification

10 0

5

f (x)

15

20

50

-1

0

1

2

3

4

5

x Fig. 3.2 The Newton–Raphson method: starting from x0 = 4, the tangent that goes through (x0 , f (x0 ) and crosses the x-axis at x1 , and the tangent that goes through (x1 , f (x1 ) and crosses the x-axis at x2 , and so on. The sequence is obtained by the recursion x1 = x0 − f (x0 )/ f  (x0 ), x2 = x1 − f (x1 )/ f  (x1 ), . . .. The points in the sequence are marked in red

for i = 0, 1, 2, . . .. If more than one solution exists, the solution obtained by the convergence may depend on the initial value of x0 . In the current case, if we set x0 = −2, the solution converges to x = −1. In addition, we need to decide when the cycle should be terminated based on some conditions, such as the size of |xi+1 − xi | and the number of repetitions. Example 31 For x0 = 4, we run the following R program to obtain the graph in Fig. 3.2. The program repeats the cycle ten times. f=function(x) x^2−1; f.=function(x) 2*x curve(f(x),−1,5); abline(h=0,col="blue") x=4 for(i in 1:10){ X=x; Y=f(x); x=x−f(x)/f.(x); y=f(x) segments(X,Y,x,0); segments(X,Y,X,0, lty=3) points(x,0,col="red",pch=16) }

The Newton–Raphson method can even be applied to two variables and two equations: for  f (x, y) = 0 , g(x, y) = 0 we can see that (3.1) is extended to ⎡ ∂ f (x, y)     ⎢ x x ∂x ← −⎢ ⎣ ∂g(x, y) y y ∂x

⎤ ∂ f (x, y) −1   ⎥ f (x, y) ∂y ⎥ , g(x, y) ∂g(x, y) ⎦ ∂y

(3.2)

3.2 Newton–Raphson Method



∂ f (x, y) ⎢ ∂x where the matrix ⎢ ⎣ ∂g(x, y) ∂x

51

⎤ ∂ f (x, y) ⎥ ∂y ⎥ is called a Jacobian matrix. ∂g(x, y) ⎦ ∂y

Example 32 For f (x, y) = x 2 + y 2 − 1 and g(x, y) = x + y, if we start searching the solution from (x, y) = (3, 4), the execution is as follows. f=function(z)z[1]^2+z[2]^2−1; f.x=function(z) 2*z[1]; f.y=function(z)2*z[2]; g=function(z)z[1]+z[2]; g.x=function(z) 1; g.y=function(z) 1; z=c(3,4) for(i in 1:10){ z=z−solve(matrix(c(f.x(z),f.y(z),g.x(z),g.y(z)),ncol=2,byrow=TRUE))%*%c(f(z),g( z)) }

> z [,1] [1,] -0.7071068 [2,] 0.7071068

Then, we apply the same method to the problem of finding β0 ∈ R and β ∈ R p such that ∇l(β0 , β) = 0: (β0 , β) ← (β0 , β) − {∇ 2 l(β0 , β)}−1 ∇l(β0 , β) , ∂f , and ∇ 2 f (v) ∈ ∂vi ∂2 f R( p+1)×( p+1) is a square matrix such that the (i, j)th element is . In the ∂vi ∂v j following, for ease of notation, we write (β0 , β) ∈ R × R p as β ∈ R p+1 . If we differentiate the negative log-likelihood l(β0 , β) and if we let vi = e−yi (β0 +xi β) , i = 1, . . . , N , the vector ∇l(β0 , β) ∈ R p+1 such that the jth element is ∂l(β0 , β) , j = 0, 1, . . . , p, can be expressed by ∇l(β0 , β) = −X T u with ∂β j

where ∇ f (v) ∈ R p+1 is a vector such that the ith element is

⎡ yv 1 1 ⎢ 1 + v1 ⎢ .. u=⎢ . ⎢ ⎣ yN vN 1 + vN

⎤ ⎥ ⎥ ⎥ , ⎥ ⎦

where β0 is regarded as the zeroth element, and the ith row of X is [1, xi ] ∈ R p+1 . If we note yi = ±1, i.e., yi2 = 1, the matrix ∇ 2 l(β0 , β) such that the ( j, k)th element ∂ 2 l(β0 , β) is , j, k = 0, 1, . . . , p, can be expressed by ∇ 2 l(β0 , β) = X T W X with ∂β j βk

52

3 Classification



v1 ⎢ (1 + v1 )2 ⎢ .. W =⎢ . ⎢ ⎣ 0

···

⎤ 0

.. . . vN ··· (1 + v N )2 ..

⎥ ⎥ ⎥ . ⎥ ⎦

Using such W and u, the update rule can be written as follows: β ← β + (X T W X )−1 X T u . In addition, if we introduce the variable z := X β + W −1 u ∈ R N , the formula becomes simpler: β ← (X T W X )−1 X T W z . Example 33 We wrote an R program that solves ∇l(β0 , β) = 0 and executed it for the following data. ## Data Generation N=1000; p=2; X=matrix(rnorm(N*p),ncol=p); X=cbind(rep(1,N),X) beta=rnorm(p+1); y=array(N); s=as.vector(X%*%beta); prob=1/(1+exp(s)); for(i in 1:N)if(runif(1)>prob[i])y[i]=1 else y[i]=−1

>beta [1] 0.3872739 -1.6038466 -1.1695330 ## ML Estimation beta=Inf; gamma=rnorm(p+1) while(sum((beta-gamma)ˆ2)>0.001){ beta=gamma s=as.vector(X%*%beta); v=exp(-s*y); u= y*v/(1+v) w= v/(1+v)ˆ2; W=diag(w); z= s+u/w gamma=as.vector(solve(t(X)%*%W%*%X)%*%t(X)%*%W%*%z) print(gamma) } [1] 0.2479060 -1.0826825 -0.6997795 [1] 0.3790175 -1.5436907 -1.0067709 [1] 0.4205884 -1.7001886 -1.1111159 [1] 0.4242678 -1.7141541 -1.1204503

We found that the results were almost correct. For some cases, the maximum likelihood solution cannot be obtained even if we apply the Newton–Raphson method. For example, if the observations satisfy yi (β0 + xi β) ≥ 0, (xi , yi ) ∈ R p × R, i = 1, . . . , N , then the maximum likelihood estimate of logistic regression cannot be obtained. In fact, the terms in the exponent part of

3.2 Newton–Raphson Method

53 N  i=1

1 1 + exp{−yi (β0 + xi β)}

can be all negative, which means that the exponent can diverge to −∞ if we multiply β0 and β by 2. Thus, the likelihood can approach one by choosing some β0 and β. Even if we do not meet such conditions if p is large compared to N , the possibility of the parameter being infinitely large increases. Example 34 For p = 1, We estimated the coefficients βˆ0 , βˆ1 of logistic regression using the training data with N /2 samples and predicted the response of the covariate values in the N /2 test data. n=100; x=c(rnorm(n)+1, rnorm(n)−1); y=c(rep(1,n),rep(−1,n)) train=sample(1:(2*n),n,replace=FALSE); df=data.frame(x,y) x=as.matrix(df[train,1]); y=as.vector(df[train,2]) p=1; X=cbind(1,x); beta=0; gamma=rnorm(p+1),

while(sum((beta-gamma)ˆ2)>0.001){ beta=gamma s=as.vector(X%*%beta); v=exp(-s*y); u= y*v/(1+v) w= v/(1+v)ˆ2; W=diag(w); z= s+u/w gamma=as.vector(solve(t(X)%*%W%*%X)%*%t(X)%*%W%*%z) print(gamma) } [1] 0.2865112 1.2735841 [1] 0.2273836 1.7364786 [1] 0.2401321 1.9284290 [1] 0.2439243 1.9522814

x=as.matrix(df[−train,1]); y=as.vector(df[−train,2]) ## y contains the correct one in the test data z=2*as.integer(beta[1]+x*beta[2]>0)−1 ## z contains the exponents (positive/negative)

> table(y,z) ## The figures on the diagonal are the # of correct answers (total 100) z y -1 1 -1 38 10 1 12 40

We set up a data frame with the pairs of covariate and response values and divided the N = 2n data into training and test sets of size n. The finally obtained values of y are the correct values, and we predicted each of the y values based on the estimates of β0 and β1 and whether each of the zs is positive or negative. The table expresses the numbers of correct and incorrect answers, and the correct rate in this experiment was (38 + 40)/100 = 0.78.

54

3 Classification

3.3 Linear and Quadratic Discrimination As before, we find the map x ∈ R p → y ∈ {−1, 1} to minimize the error probability, given the observations x1 , . . . , x N ∈ R p , y1 , . . . , y N ∈ {−1, 1}. In this section, we assume that the distributions of x ∈ R p given y = ±1 are N (μ±1 , ±1 ) and write the probability density functions by 

1 1 T −1 exp − (x − μ±1 ) ±1 (x − μ±1 ) . f ±1 (x) = √ 2 (2π) p det 

(3.3)

In addition, we introduce the notion of prior probabilities of events: we assume that the probabilities of responses y = ±1 are known before seeing the covariates x, which we term the prior probability. For example, we may estimate the probability of the response being π±1 from the ratio of the two from y1 , . . . , y N in the training data. On the other hand, we refer to π±1 f ±1 (x) π1 f 1 (x) + π−1 f −1 (x) as the posterior probability of y = ±1 given x. We can minimize the error probability by estimating y = 1 if π1 f 1 (x) π−1 f −1 (x) ≥ , π1 f 1 (x) + π−1 f −1 (x) π1 f 1 (x) + π−1 f −1 (x) which is equivalent to π1 f 1 (x) ≥ π−1 f −1 (x) ,

(3.4)

and y = −1 otherwise. The procedure assumes that f ±1 follows a Gaussian distribution and that the expectation μ±1 and covariance matrix ±1 are known, and that π±1 is known. For actual situations, we need to estimate these entities from the training data. The principle of maximizing the posterior probability is applied not only to the binary case (K = 2) but also to the general case K ≥ 2, where K is the number of values that the response takes. The probability that response y = k given covariates ˆ then the probability of the x is P(y = k|x) for k = 1, . . . , K . If we estimate y = k, ˆ Thus, choosing estimate being correct is 1 − k =kˆ P(y = k|x) = 1 − P(y = k|x). ˆ ˆ a k that maximizes the posterior probability P(y = k|x) as k minimizes the average error probability when the prior probability is known. In the following, assuming K = 2 for simplicity, we see the properties at the border between y = ±1 when we maximize the posterior probability: −1 −(x − μ1 )T 1−1 (x − μ1 ) + (x − μ−1 )T −1 (x − μ−1 ) = log

det 1 π1 − 2 log , det −1 π−1

3.3 Linear and Quadratic Discrimination

55

where the equation is obtained from (3.3) and (3.4). In general, the border is a function −1 x of x (quadratic discrimination). of the quadratic forms x T 1−1 x and x T −1 In particular, when 1 = −1 , if we write them as , the border becomes a surface (a line when p = 2), which we call linear discrimination. In fact, the terms −1 x are canceled out, and the border becomes x T 1−1 x = x T −1 2(μ1 − μ−1 )T  −1 x − (μ1T  −1 μ1 − μ−1  −1 μ−1 ) = −2 log

π1 , π−1

or more simply, (μ1 − μ−1 )T  −1 (x −

π1 μ1 + μ−1 ) = − log . 2 π−1

μ1 + μ−1 . Thus, if π1 = π−1 , then the border is x = 2 If π±1 and f ±1 are unknown, we need to estimate them from the training data. Example 35 For artificially generated data, we estimated the averages and covariances of covariates x for a response y = ±1, and drew the border. mu.1=c(2,2); sigma.1=2; sigma.2=2; rho.1=0 mu.2=c(−3,−3); sigma.3=1; sigma.4=1; rho.2=−0.8 n=100 u=rnorm(n); v=rnorm(n); x.1=sigma.1*u+mu.1[1]; y.1=(rho.1*u+sqrt(1−rho.1^2)*v)*sigma.2+mu.1[2] u=rnorm(n); v=rnorm(n); x.2=sigma.3*u+mu.2[1]; y.2=(rho.2*u+sqrt(1−rho.2^2)*v)*sigma.4+mu.2[2] f=function(x,mu,inv,de)drop(−0.5*t(x−mu)%*%inv%*%(x−mu)−0.5*log(de)) mu.1=mean(c(x.1,y.1)); mu.2=mean(c(x.2,y.2)); df=data.frame(x.1,y.1); mat=cov(df); inv.1=solve(mat); de.1=det(mat) # df=data.frame(x.2,y.2); mat=cov(df); inv.2=solve(mat); de.2=det(mat) # f.1=function(u,v)f(c(u,v),mu.1,inv.1,de.1); f.2=function(u,v)f(c(u,v),mu.2,inv.2,de.2) pi.1=0.5; pi.2=0.5 u = v = seq(−6, 6, length=50); m=length(u); w=array(dim=c(m,m)) for(i in 1:m)for(j in 1:m)w[i,j]=log(pi.1)+f.1(u[i],v[j])−log(pi.2)−f.2(u[i],v[j]) # plot contour(u,v,w,level=0) points(x.1,y.1,col="red"); points(x.2,y.2,col="blue")

We show the covariates for each response and the generated border in Fig. 3.3 (Right). If the covariance matrices are equal, we change the lines marked with “#” as follows: df=data.frame(c(x.1,y.1)−mu.1, c(x.2,y.2)−mu.2); inv.1=solve(mat); de.1=det(mat) inv.2=inv.1; de.2=de.1

We show the output in Fig. 3.3 (Left).

3 Classification

-6 -4 -2 0

-6 -4 -2 0

2

2

4

4

6

6

56

1

-6

-4

-2

0

2

4

6

1

1

-6

-4

-2

0

2

4

6

Fig. 3.3 Linear Discrimination (Left) and Quadratic Discrimination (Right): The border is a line if the covariance matrices are equal; otherwise, it is a quadratic (elliptic) curve. In the former case, if the prior probabilities and the covariance matrices are equal, then the border is the vertical bisector of the line connecting the centers

Example 36 (Fisher’s Iris data set) Even when the response takes more than two values, we can choose the response with the maximum posterior probability. Fisher’s Iris data set contains four covariates (the petal length, petal width, sepal length, and sepal width), and the response variable can be three species of irises (Iris setosa, Iris virginica, and Iris versicolor). Each of the three species contains 50 samples (N = 150, p = 4). We construct the classifier via quadratic discrimination and evaluate it using the test data set that is different from the training data. #iris data f=function(w,mu,inv,de)−0.5*(w−mu)%*%inv%*%t(w−mu)−0.5*log(de) df=iris; df[[5]]=c(rep(1,50),rep(2,50),rep(3,50)) n=nrow(df); train=sample(1:n,n/2,replace=FALSE); test=setdiff(1:n,train) mat=as.matrix(df[train,]) mu=list(); covv=list() for(j in 1:3){ x=mat[mat[,5]==j,1:4]; mu[[j]]=c(mean(x[,1]),mean(x[,2]),mean(x[,3]),mean(x[,4])) covv[[j]]=cov(x) } g=function(v,j)f(v,mu[[j]],solve(covv[[j]]),det(covv[[j]])) z=array(dim=n/2) for(i in test){ u=as.matrix(df[i,1:4]); a=g(u,1);b=g(u,2); c=g(u,3) if(atheta)/N.0 # treat health as sick v=sum(pnorm(y,mu.1,var.1)/pnorm(y,mu.0,var.0)>theta)/N.1 # consider sick as healthy U=c(U,u); V=c(V,v) } lines(U,V,col="blue") M=length(theta.seq)−1; AUC=0; for(i in 1:M)AUC=AUC+abs(U[i+1]−U[i])*V[i] text(0.5,0.5,paste("AUC=",AUC),col="red")

60

3 Classification

Fig. 3.4 The ROC curve shows all the performances of the test for acceptable false positives

0.6 0.4

AUC= 0.9304883

0.0

0.2

True Positive

0.8

1.0

ROC Curve

0.0

0.2

0.4

0.6

0.8

1.0

False Positive

Exercises 19–31 19. We assume that there exist β0 ∈ R and β ∈ R p such that for x ∈ R p , the prob1 eβ0 +xβ and , respectively. abilities of Y = 1 and Y = −1 are β +xβ 0 1+e 1 + eβ0 +xβ 1 Show that the probability of Y = y ∈ {−1, 1} can be written as . 1 + e−y(β0 +xβ) 1 20. For p = 1 and β > 0, show that the function f (x) = is mono1 + e−(β0 +xβ) tonically increasing for x ∈ R and convex and concave in x < −β0 /β and x > −β0 /β, respectively. How does the function change as β increases? Execute the following to answer this question. f=function(x)exp(beta.0+beta*x)/(1+exp(beta.0+beta*x)) beta.0=0; beta.seq=c(0,0.2,0.5,1,2,10); m=length(beta.seq) beta=beta.seq[1] plot(f,xlim=c(−10,10),ylim=c(0,1),xlab="x",ylab="y", col=1, main=" Logistic Curve") for(i in 2:m){ beta=beta.seq[i]; par(new=TRUE) plot(f,xlim=c(−10,10),ylim=c(0,1),xlab="", ylab="", axes=FALSE,,col=i) } legend("topleft",legend=beta.seq,col=1:length(beta.seq),lwd=2,cex=.8) par(new=FALSE)

21. We wish to obtain the estimates of β0 ∈ R and β ∈ R p by maximizing the likeliN  1 , or equivalently, by minimizing the negated logarithm hood −y 1 + e i (β0 +xi β) i=1

Exercises 19–31

61

l(β0 , β) =

N 

log(1 + vi ), vi = e−yi (β0 +xi β)

i=1

from observations (x1 , y1 ), . . . , (x N , y N ) ∈ R p × {−1, 1} (maximum likelihood). Show that l(β0 , β) is convex by obtaining the derivative ∇l(β0 , β) and the second derivative ∇ 2 l(β0 , β). Hint: Let ∇l(β0 , β) and ∇ 2 l(β0 , β) be the ∂l column vector of size p + 1 such that the jth element is and the matrix of ∂β j ∂ 2l size ( p + 1) × ( p + 1) such that the ( j, k)th element is , respectively. ∂β j ∂βk Simply show that the matrix is nonnegative definite. To this end, show that ∇ 2 l(β0 , β) = X T W X . If W is diagonal, then it can be written as W = U T U , where the diagonal elements of U are the square roots of W , which means ∇ 2 l(β0 , β) = (U X )T U X . 22. Solve the following equations via the Newton–Raphson method by constructing an R program. (a) For f (x) = x 2 − 1, set x = 2 and repeat the recursion x ← x − f (x)/ f  (x) 100 times. (b) For f (x, y) = x 2 + y 2 − 1, g(x, y) = x + y, set (x, y) = (1, 2) and repeat the recursion 100 times. ⎡ ∂ f (x, y)     ⎢ x x ∂x ← −⎢ ⎣ ∂g(x, y) y y ∂x

⎤ ∂ f (x, y) −1   ⎥ f (x, y) ∂y ⎥ g(x, y) ∂g(x, y) ⎦ ∂y

Hint: Define the procedure and repeat it 100 times. f=function(z)z[1]^2+z[2]^2−1 f.x=function(z) 2*z[1]; f.y=function(z)2*z[2]; g=function(z)z[1]+z[2]; g.x=function(z) 1; g.y=function(z) 1; z=c(1,2)

23. We wish to solve ∇l(β0 , β) = 0, (β0 , β) ∈ R × R p in Problem 21 via the Newton–Raphson method using the recursion (β0 , β) ← (β0 , β) − {∇ 2 l(β0 , β)}−1 ∇l(β0 , β) , where ∇ f (v) ∈ R p+1 and ∇ 2 f (v) ∈ R( p+1)×( p+1) are the vector such that the ∂f ∂2 f ith element is and the square matrix such that the (i, j)th element is , ∂vi ∂vi ∂v j p respectively. In the following, for ease of notation, we write (β0 , β) ∈ R × R by β ∈ R p+1 . Show that the update rule can be written as follows: βnew ← (X T W X )−1 X T W z ,

(3.5)

62

3 Classification

where u ∈ R p+1 such that ∇l(βold ) = −X T u and W ∈ R( p+1)×( p+1) such that ∇ 2 l(βold ) = X T W X , z ∈ R is defined by z := X βold + W −1 u, and X T W X is assumed to be nonsingular. Hint: The update rule can be written as follows: βnew ← βold + (X T W X )−1 X T u. 24. We construct a procedure to solve Problem 23. Fill in blanks (1)(2)(3), and examine that the procedure works. ## Data Generation ## N=1000; p=2; X=matrix(rnorm(N*p),ncol=p); X=cbind(rep(1,N),X) beta=rnorm(p+1); y=array(N); s=as.vector(X%*%beta); prob=1/(1+exp(s)); for(i in 1:N)if(runif(1)>prob[i])y[i]=1 else y[i]=−1 beta ## Maximum Likelihood beta=Inf; gamma=rnorm(p+1) while(sum((beta−gamma)^2)>0.001){ beta=gamma s=as.vector(X%*%beta) v=exp(−s*y) u= ## Brank (1) w= ## Brank (2) W=diag(w) z= ## Brank (3) gamma=as.vector(solve(t(X)%*%W%*%X)%*%t(X)%*%W%*%z) print(gamma) }

25. If the condition yi (β0 + xi β) ≥ 0, (xi , yi ) ∈ R p × R, i = 1, . . . , N is met, we cannot obtain the parameters of logistic regression via maximum likelihood. Why? 26. For p = 1, we wish to estimate the parameters of logistic regression from N /2 training data and to predict the responses of the N /2 test data that are not used as the training data. Fill in the blanks and execute the program. n=100; x=c(rnorm(n)+1, rnorm(n)−1); y=c(rep(1,n),rep(−1,n)) train=sample(1:(2*n),n,replace=FALSE); df=data.frame(x,y) x=as.matrix(df[train,1]); y=as.vector(df[train,2]) p=1; X=cbind(1,x); beta=0; gamma=rnorm(p+1), while(sum((beta−gamma)^2)>0.001){ beta=gamma s=as.vector(X%*%beta); v=exp(−s*y); u= y*v/(1+v); w= v/(1+v)^2; W=diag(w); z= s+u/w gamma=as.vector(solve(t(X)%*%W%*%X)%*%t(X)%*%W%*%z) print(gamma) } x= ## Brank (1) ## y=as.vector(df[−train,2]) z= ## Brank (2) ## table(y,z)

Hint: For prediction, see whether β0 + xβ1 is positive or negative. 27. In linear discrimination, let πk be the prior probability of Y = k for k = 1, . . . , m (m ≥ 2), and let f k (x) be the probability density function of the p covariates

Exercises 19–31

63

x ∈ R p given response Y = k with mean μk ∈ R p and covariance matrix k ∈ R p× p . We consider the set Sk,l of x ∈ R p such that πk f k (x) K  j=1

π j f j (x)

=

πl fl (x) K 

π j f j (x)

j=1

for k, l = 1, . . . , m, k = l. (a) Show that when πk = πl , Sk,l is the set of x ∈ R p on the quadratic surface −(x − μk )T k−1 (x − μk ) + (x − μl )T l−1 (x − μl ) = log

det k . det l

(b) Show that when k = l (= ), Sk,l is the set of x ∈ R p on the surface a T x + b = 0 with a ∈ R p and b ∈ R) and express a, b using μk , μl , , πk , πl . (c) When πk = πl and k = l , show that the surface of (b) is x = (μk + μl )/2. 28. In the following, we wish to estimate distributions from two classes and draw a boundary line that determines the maximum posterior probability. If the covariance matrices are assumed to be equal, how do the boundaries change? Modify the program. ## Data Generation mu.1=c(2,2); sigma.1=2; sigma.2=2; rho.1=0 mu.2=c(−3,−3); sigma.3=1; sigma.4=1; rho.2=−0.8 n=100 u=rnorm(n); v=rnorm(n); x.1=sigma.1*u+mu.1[1]; y.1=(rho.1*u+sqrt(1−rho.1^2)*v)*sigma.2+mu.1[2] u=rnorm(n); v=rnorm(n); x.2=sigma.3*u+mu.2[1]; y.2=(rho.2*u+sqrt(1−rho.2^2)*v)*sigma.4+mu.2[2] ## Estimate the distribution and draw the border f=function(x,mu,inv,de)drop(−0.5*t(x−mu)%*%inv%*%(x−mu)−0.5*log(de)) mu.1=mean(c(x.1,y.1)); mu.2=mean(c(x.2,y.2)); df=data.frame(x.1,y.1); mat=cov(df); inv.1=solve(mat); de.1=det(mat) # df=data.frame(x.2,y.2); mat=cov(df); inv.2=solve(mat); de.2=det(mat) # f.1=function(u,v)f(c(u,v),mu.1,inv.1,de.1); f.2=function(u,v)f(c(u,v),mu.2,inv.2,de.2) pi.1=0.5; pi.2=0.5 u = v = seq(−6, 6, length=50); m=length(u); w=array(dim=c(m,m)) for(i in 1:m)for(j in 1:m)w[i,j]=log(pi.1)+f.1(u[i],v[j])−log(pi.2)−f.2(u[i],v [j]) ## plot contour(u,v,w,level=0) points(x.1,y.1,col="red"); points(x.2,y.2,col="blue")

Hint: Modify the lines marked with #. 29. Even in the case of three or more values, we can select the class that maximizes the posterior probability. From four covariates (length of sepals, width of sepals,

64

3 Classification

length of petals, width of petals) of Fisher’s iris data, we wish to identify the three types of irises (Setosa, Versicolor, and Virginica) via quadratic discrimination. Specifically, we learn rules from training data and evaluate them with test data. Assuming N = 150 and p = 4, each of the three irises contains 50 samples, and the prior probability is expected to be equal to 1/3. If we find that the prior probabilities of Setosa, Versicolor, and Virginica irises are 0.5, 0.25, 0.25, respectively how should the program be changed to determine the maximum posterior probability? f=function(w,mu,inv,de)−0.5*(w−mu)%*%inv%*%t(w−mu)−0.5*log(de) df=iris; df[[5]]=c(rep(1,50),rep(2,50),rep(3,50)) n=nrow(df); train=sample(1:n,n/2,replace=FALSE); test=setdiff(1:n, train) mat=as.matrix(df[train,]) mu=list(); covv=list() for(j in 1:3){ x=mat[mat[,5]==j,1:4]; mu[[j]]=c(mean(x[,1]),mean(x[,2]),mean(x[,3]),mean(x[,4])) covv[[j]]=cov(x) } g=function(v,j)f(v,mu[[j]],solve(covv[[j]]),det(covv[[j]])) z=array(dim=n/2) for(i in test){ u=as.matrix(df[i,1:4]); a=g(u,1);b=g(u,2); c=g(u,3) if(a func.1=function(data,index){ + X=data$X[index]; Y=data$Y[index] + return((var(Y)-var(X))/(var(X)+var(Y)-2*cov(X,Y))) + } > library(ISLR) > bt(Portfolio,func.1,1000) $original [1] 0.1516641 $bias [1] 0.01398203 $stderr [1] 0.1823519

If a method for evaluating the estimation error is available, we do not have to assess it by bootstrapping, but for our purposes, let us compare the two to see how correctly bootstrapping performs. Example 44 We estimated the intercept and slope in the file crime.txt many times by bootstrap, evaluated the dispersion of the estimated values and compared them with the theoretical values calculated by the lm function. In the bootstrap estimation, func.2 estimates the intercept and two slopes when regressing the first variable to the third and fourth variables ( j = 1, 2, 3) and evaluates its standard deviation. > > + + +

df=read.table("crime.txt") for(j in 1:3){ func.2=function(data,index)coef(lm(V1˜V3+V4,data=data,subset=index))[j] print(bt(df,func.2,1000)) }

4.3 Bootstrapping

77

$original (Intercept) 621.426 $bias (Intercept) 34.21078 $stderr [1] 221.9134 $original V3 11.85833 $bias V3 -0.5269679 $stderr [1] 3.585458 $original V4 -5.973412 $bias V4 -0.204651 $stderr [1] 3.326094 > summary(lm(V1˜V3+V4,data=df)) Call: lm(formula = V1 ˜ V3 + V4, data = df) Residuals: Min 1Q -356.90 -162.92

Median -60.86

3Q 100.69

Max 784.30

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 621.426 222.685 2.791 0.00758 ** V3 11.858 2.568 4.618 3.02e-05 *** V4 -5.973 3.561 -1.677 0.10013 --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 246.6 on 47 degrees of freedom Multiple R-squared: 0.3247, Adjusted R-squared: 0.296 F-statistic: 11.3 on 2 and 47 DF, p-value: 9.838e-05

The function func.2 finds the intercept and the slope of the third and fourth variables at i = 1, 2, 3, respectively. In this case, the standard deviation of the intercept and the slopes of the two variables almost match the theoretical values obtained

78

4 Resampling

as the output of the lm function. Even if it is a linear regression problem, if the noise does not follow a Gaussian distribution or is not independent, bootstrapping is still useful.

Appendix: Proof of Propositions Proposition 17 (Sherman–Morrison–Woodbury) For m, n ≥ 1 and a matrix A ∈ Rn×n , U ∈ Rn×m , C ∈ Rm×m , V ∈ Rm×n , we have (A + U C V )−1 = A−1 − A−1 U (C −1 + V A−1 U )−1 V A−1

(4.5)

Proof The derivation is due to the following: (A + U C V )(A−1 − A−1 U (C −1 + V A−1 U )−1 V A−1 ) = I + U C V A−1 − U (C −1 + V A−1 U )−1 V A−1 − U C V A−1 U (C −1 + V A−1 U )−1 V A−1 = I + U C V A−1 − U C · (C −1 ) · (C −1 + V A−1 U )−1 V A−1 −U C · V A−1 U · (C −1 + V A−1 U )−1 V A−1 = I + U C V A−1 − U C(C −1 + V A−1 U )(C −1 + V A−1 U )−1 V A−1 = I

Proposition 18 Suppose that X T X is a nonsingular matrix. For each S ⊂ {1, T X −S is a nonsingular matrix, so is I − HS . . . . , N }, if X −S Proof For m, n ≥ 1, U ∈ Rm×n , and V ∈ Rn×m , we have

I 0 V I



I + UV U 0 I



I 0 −V I



=

I + UV U V + VUV VU + I



I 0 −V I



=

I U 0 I + VU

.

Combined with Proposition 2, we have det(I + U V ) = det(I + V U ) .

(4.6)

Therefore, from Proposition 2, we have T X −S ) = det(X T X − X ST X S ) det(X −S

= det(X T X ) det(I − (X T X )−1 X ST X S ) = det(X T X ) det(I − X S (X T X )−1 X ST ) , T X −S where the last transformation is due to (4.6). Hence, from Proposition 1, if X −S T and X X are nonsingular, so is I − HS .

Exercises 32–39

79

Exercises 32–39 32. Let m, n ≥ 1. Show that for matrix A ∈ Rn×n , U ∈ Rn×m , C ∈ Rm×m , V ∈ Rm×n , (4.7) (A + U C V )−1 = A−1 − A−1 U (C −1 + V A−1 U )−1 V A−1 (Sherman-Morrison-Woodbury). Hint: Continue the following: (A + U C V )(A−1 − A−1 U (C −1 + V A−1 U )−1 V A−1 ) = I + U C V A−1 − U (C −1 + V A−1 U )−1 V A−1 − U C V A−1 U (C −1 + V A−1 U )−1 V A−1 = I + U C V A−1 − U C · (C −1 ) · (C −1 + V A−1 U )−1 V A−1 −U C · V A−1 U · (C −1 + V A−1 U )−1 V A−1 .

33. Let S be a subset of {1, . . . , N } and write the matrices X ∈ R(N −r )×( p+1) that consist of the rows in S and the rows not in S as X S ∈ Rr ×( p+1) and X −S ∈ R(N −r )×( p+1) , respectively, where r is the number of elements in S. Similarly, we divide y ∈ R N into yS and y−S . (a) Show T (X −S X −S )−1 = (X T X )−1 + (X T X )−1 X ST (I − HS )−1 X S (X T X )−1 ,

where HS := X S (X T X )−1 X ST is the matrix that consists of the rows and columns in S of H = X (X T X )−1 X T . Hint: Apply n = p + 1, m = r , A = X T X , C = I , U = X ST , V = −X S to (4.3). (b) For e S := yS − yˆ S with yˆ S = X S βˆS , show the equation βˆ−S = βˆ − (X T X )−1 X ST (I − HS )−1 e S . T T X −S and X T y = X ST yS + X −S y−S , Hint: From X T X = X ST X S + X −S

βˆ−S = {(X T X )−1 + (X T X )−1 X ST (I − HS )−1 X S (X T X )−1 }(X T y − X ST yS ) = βˆ − (X T X )−1 X ST (I − HS )−1 (X S βˆ − HS yS ) = βˆ − (X T X )−1 X ST (I − HS )−1 {(I − HS )yS − X S βˆ + HS yS } 34. By showing yS − X S βˆ−S = (I − HS )−1 e S , prove that the squared sum of the  groups in CV is (I − HS )−1 e S 2 , where a2 denotes the squared sum of S

the elements in a ∈ R N . 35. Fill in the blanks below and execute the procedure in Problem 34. Observe that the squared sum obtained by the formula and by the general cross-validation method coincide. n=1000; p=10; x=matrix(rnorm(n*p),nrow=n,ncol=p); x=cbind(rep(1,n),x); beta=rnorm(p+1); y=x%*%beta+rnorm(n)*0.2

80

4 Resampling ## Using the general CV cv.linear=function(X,y,k){ n=length(y); m=n/k; S=0 for(j in 1:k){ test=((j−1)*m+1):(j*m) beta= ## Blank(1) ## e=y[test]−X[test,]%*%beta; S=S+drop(t(e)%*%e) } return(S/n) } ## Using the formula cv.fast=function(X,y,k){ n=length(y); m=n/k; H=X%*%solve(t(X)%*%X)%*%t(X); I=diag(rep(1,n)); e=(I−H)%*%y; I=diag(rep(1,m)) sum=0 for(j in 1:k){ test=((j−1)*m+1):(j*m); sum=sum+## Blank(2) ## } return(sum/n) }

Moreover, we wish to compare the speeds of the functions cv.linear and cv.fast. Fill in the blanks below to complete the procedure and draw the graph. plot(0,0,xlab="k",ylab="Execution Time", xlim=c(2,n),ylim=c(0,0.5),type ="n") U=NULL; V=NULL for(k in 10:n)if(n%%k==0){ t=proc.time()[3]; cv.fast(x,y,k); U=c(U,k); V=c(V, (proc.time()[3]−t)) } ## Blank Several Lines ## legend("topleft",legend=c("cv.linear","cv.fast"),col=c("red","blue "), lty=1)

36. How much the prediction error differs with k in the k-fold CV depends on the data. Fill in the blanks and draw the graph that shows how the CV error changes with k. You may use either the function cv.linear or cv.fast. ## Data Generation ## n=100; p=5 plot(0,0,xlab="k",ylab="CV ", xlim=c(2,n),ylim=c(0.3,1.5),type="n") for(j in 2:11){ X=matrix(rnorm(n*p),ncol=p); X=cbind(1,X); beta=rnorm(p+1); eps=rnorm(n); y=X%*%beta+eps U=NULL; V=NULL; for(k in 2:n)if(n%%k==0){ ## Blank ## }; lines(U,V, col=j) }

37. We wish to know how the error rate changes with K in the K -nearest neighbor method when ten-fold CV is applied for the Fisher’s iris data set. Fill in the blanks, execute the procedure, and draw the graph.

Exercises 32–39

81

df=iris; df=df[sample(1:150,150,replace=FALSE),] n=nrow(df); U=NULL; V=NULL for(k in 1:10){ top.seq=1+seq(0,135,10); S=0 for(top in top.seq){ index= ## Blank(1) ## knn.ans=knn(df[−index,1:4],df[−index,5],df[index,1:4],k) ans= ## Blank(2) ## S=S+sum(knn.ans!=ans) } S=S/n; U=c(U,k);V=c(V,S) } plot(0,0,type="n", xlab="k", ylab="Error Rate", xlim=c(1,10),ylim=c (0,0.1), main="Evaluation of Error Rate via CV") lines(U,V,col="red")

38. We wish to estimate the standard deviation of the quantity below w.r.t. X, Y based on N data. ⎧ ⎡ 2 ⎤  N N ⎪   ⎪ 1 1 ⎪ ⎪ ⎣ vx := X i2 − Xi ⎦ ⎪ ⎪ ⎪ N − 1 N i=1 ⎪ i=1 ⎪ ⎪ ⎡ ⎪ 2 ⎤  N ⎪ ⎨ N   v y − vx 1 1 , v y := ⎣ Y2 − Yi ⎦ vx + v y − 2vx y ⎪ N − 1 i=1 i N i=1 ⎪ ⎪ ⎪ ⎪  N 

N  N ⎪ ⎪ ⎪   ⎪ 1 1  ⎪ ⎪ X i Yi − Xi Yi ⎪ ⎩ vx y := N − 1 N i=1 i=1 i=1 To this end, allowing duplication, we randomly choose N data in the data frame r times and estimate the standard deviation (Bootstrap). Fill in the blanks (1)(2) to complete the procedure and observe that it estimates the standard deviation. func.1=function(data,index){ X=data$X[index]; Y=data$Y[index] return((var(Y)−var(X))/(var(X)+var(Y)−2*cov(X,Y))) } bt=function(df,func,r){ m=nrow(df); org=## Blank(1) u=array(dim=r) for(j in 1:r){ index=sample(## Blank(2) u[j]=func.1(df,index) } return(list(original=org, bias=mean(u)−org, stderr=sd(u))) } library(ISLR); bt(Portfolio,func.1,1000) ## Execution Example

39. For linear regression, if we assume that the noise follows a Gaussian distribution, we can compute the theoretical value of the standard deviation. We wish to compare the value with the one obtained by bootstrap. Fill in the blanks and execute the procedure. What are the three kinds of data that appear first?

82

4 Resampling df=read.table("crime.txt") for(j in 1:3){ func.2=function(data,index)coef(lm(V1~V3+V4,data= ## Blank(1) ##, subset= ## Blank(2) ##))[j] print(bt(df,func.2,1000)) } summary(lm(V1~V3+V4,data=df))

Chapter 5

Information Criteria

Abstract Until now, from the observed data, we have considered the following cases: • Build a statistical model and estimate the parameters contained in it • Estimate the statistical model In this chapter, we consider the latter for linear regression. The act of finding rules from observational data is not limited to data science and statistics, However, many scientific discoveries are born through such processes. For example, the writing of the theory of elliptical orbits, the law of constant area velocity, and the rule of harmony in the theory of planetary motion published by Kepler in 1596 marked the transition from the dominant theory to the planetary motion theory. While the explanation by the planetary motion theory was based on countless theories based on philosophy and thought, Kepler’s law solved most of the questions at the time with only three laws. In other words, as long as it is a law of science, it must not only be able to explain phenomena (fitness) but it must also be simple (simplicity). In this chapter, we will learn how to derive and apply the AIC and BIC, which evaluate statistical models of data and balance fitness and simplicity.

5.1 Information Criteria The Information criterion is generally defined as an index for evaluating the validity of a statistical model from observation data. Akaike’s information criterion (AIC) and the Bayesian information criterion (BIC) are well known. An information criterion often refers to the evaluation of both how much the statistical model explains the data (Fitness) and how simple the statistical model is (Simplicity). AIC and BIC are standard except for the difference in how they are balanced. The same can be done with the cross-validation approach discussed in Chap. 4, which is superior to the information criteria in versatility, but does not explicitly control the balance between fitness and simplicity.

© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2020 J. Suzuki, Statistical Learning with Math and R, https://doi.org/10.1007/978-981-15-7568-6_5

83

84

5 Information Criteria

One of the most important problems in linear regression is to select some p covariates based on N observations (x1 , y1 ), . . . , (x N , y N ) ∈ R p × R. The reason why there should not be too many covariates is that they overfit the data and try to explain the noise fluctuation by other covariates. Thus, we need to recognize the exact subset. However, it is not easy to choose S ⊆ {1, . . . , p} from the 2 p subsets {}, {1}, . . . , { p}, {1, 2}, . . . , {1, . . . , p} when p is large because 2 p increases exponentially with p. We express the fitness and simplicity by the RSS value RSS(S) based on the subset S and the cardinality1 k(S) := |S| of S. Then, we have that S ⊆ S  =⇒



RSS(S) ≥ RSS(S  ) , k(S) ≤ k(S  )

which means that the larger the k = k(S), the smaller σˆ k2 = mink(S)=k RSS(S). The AIC and BIC are defined by AI C := N log σˆ k2 + 2k B I C := N

log σˆ k2

+ k log N ,

RSSk is, where RSSk := N (5.1) (5.2)

and the Coefficient of determination 1−

RSSk T SS

increases monotonically with k and reaches its maximum value at k = p. However, the AIC and BIC values decrease before reaching the minimum at some 0 ≤ k ≤ p and increase beyond that point, where k such that the values are minimized are generally different between the AIC and BIC. The Adjusted coefficient of determination maximizes RSSk /(N − k − 1) 1− T SS/(N − 1) at some 0 ≤ k ≤ p, which is often much larger than those of the AIC and BIC.

1 By

|S|, we mean the cardinality of set S.

5.1 Information Criteria

85

Example 45 The following data fields are from the Boston data set in the R MASS package. We assume the first 13 variables and last variables are the covariates and responses, respectively. Column # 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Variable CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT MEDV

Meaning of the Variable Per capita crime rate by town Proportion of residential land zoned for lots over 25,000 sq. ft Proportion of nonretail business acres per town Charles River dummy variable (1 if the tract bounds the river; 0 otherwise) Nitric oxide concentration (parts per 10 million) Average number of rooms per dwelling Proportion of owner-occupied units built prior to 1940 Weighted distances to five Boston employment centers Index of accessibility to radial highways Full-value property tax rate per $10,000 Student-teacher ratio by town 1000(Bk − 0.63)2 where Bk is the proportion of black people by town % lower status of the population Median value of owner-occupied homes in $1000s

We construct the following procedure to find the set of covariates that minimizes the  AIC.  In particular, we execute combn(1:p,k) to obtain a matrix of size k × p that has subsets {1, · · · , p} of size k in its columns to find the minimum value k σˆ k2 over S such that |S| = k. RSS.min=function(X,y,T){ m=ncol(T); S.min=Inf for(j in 1:m){ q=T[,j]; S=sum((lm(y~X[,q])$fitted.values−y)^2)/n if(S set.min

86

5 Information Criteria [1]

1

3

4

6

8

9 10

If we replace the line n*log(S.min)+2*k marked by ## with n*log(S.min)+k*log(N), then the quantity becomes the BIC. To maximize the Adjusted coefficient of determination, we may update it as follows: y.bar=mean(y); TSS=sum((y−y.bar)^2); D.max=−Inf for(k in 1:p){ T=combn(1:p,k); res=RSS.min(X,y,T) D= 1−res$value/(n−k−1)/TSS/(n−1); if(D>D.max){D.max=D ; set.max= res$set } } D.max set.max

For each k = 0, 1, . . . , p, we find the S that minimizes RSS(S) such that |S| = k for each k = 0, 1, . . . , p, and compute the Adjusted coefficient of determination from RSSk . Then we obtain the maximum value of the adjusted coefficients of determination over k = 0, 1, . . . , p. On the other hand, the BIC (5.2) is used as often as the AIC (5.1). The difference is only in the balance between Fitness log σˆ k2 and Simplicity k. We see that the schools with 200 and 100 points for English and math, respectively, on the entrance examination, choose different applicants from the schools with 100 and 200 points for English and math, respectively. Similarly, the statistical models selected by the AIC and BIC are different. Since the BIC has a more significant penalty for the Simplicity k, the selected k is smaller, and the chosen model is simpler than that chosen by the AIC. More importantly, the BIC converges to the correct model when the number of samples N is large (consistency), but the AIC does not. The AIC was developed to minimize the prediction error (Sect. 5.4). Even if the statistical model selected is incorrect, the squared error in the test data may be small for the finite number of samples N , which is an advantage of the AIC. It is essential to use the specific information criteria according to their intended purposes, and it is meaningless to discuss which one is superior to the other (Fig. 5.1). Example 46 library(MASS) df=Boston; X=as.matrix(df[,c(1,3,5,6,7,8,10,11,12,13)]); y=df[[14]]; n=nrow(X); p=ncol(X) IC=function(k){ T=combn(1:p,k); res=RSS.min(X,y,T) AIC= n*log(res$value/n)+2*k; BIC= n*log(res$value/n)+k*log(n) return(list(AIC=AIC,BIC=BIC)) } AIC.seq=NULL; BIC.seq=NULL; for(k in 1:p){AIC.seq=c(AIC.seq, IC(k)$AIC); BIC.seq=c(BIC.seq, IC(k)$BIC)} plot(1:p, ylim=c(min(AIC.seq),max(BIC.seq)), type="n",xlab="# of variables", ylab="IC values")

5.1 Information Criteria

87

1650 1700 1750 1800 1850

AIC/BIC

Changes of AIC/BIC with # of Covariates AIC BIC

2

4

6

8

10

# of Covariates Fig. 5.1 The BIC is larger than the AIC, but the BIC chooses a simpler model with fewer variables than the AIC

lines(AIC.seq,col="red"); lines(BIC.seq,col="blue") legend("topright",legend=c("AIC","BIC"), col=c("red","blue"), lwd=1, cex =.8)

5.2 Efficient Estimation and the Fisher Information Matrix Next, as preparation for deriving the AIC, it is deduced that the estimates of linear regression by the least squares method is a so-called efficient estimator that minimizes the variance among unbiased estimators. For this purpose, we define the Fisher information matrix to derive the Cramér-Rao inequality. Suppose that the observations x1 , . . . , x N ∈ R p+1 (row vector) and y1 , . . . , y N ∈ R have been generated by the realizations yi = xi β + β0 + ei , i = 1, . . . , N with random variables e1 , . . . , en ∼ N (0, σ 2 ) and unknown constants β0 ∈ R, β ∈ R p , which we write as β ∈ R p+1 . In other words, the probability density function can be written as follows:   1 1 2 exp − 2 y − xβ f (y|x, β) :=  2σ (2πσ 2 ) N /2

88

5 Information Criteria

⎡ ⎤ ⎤ ⎡ ⎤ β0 x1 y1 ⎢ β1 ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ X = ⎣ ... ⎦ ∈ R N ×( p+1) , y = ⎣ ... ⎦ ∈ R N , β = ⎢ . ⎥ ∈ R p+1 ⎣ .. ⎦ xN yN βp ⎡

In the least squares method, we estimated β by βˆ = (X T X )−1 X T y if X T X is nonsingular (Proposition 11). We claim that βˆ coincides with the β ∈ R p+1 that maximizes the likelihood L :=

N

f (yi |xi , β)

i=1

In fact, the log-likelihood is written by l := log L = −

1 N log(2πσ 2 ) − 2 y − X β 2 . 2 2σ

If σ 2 > 0 is fixed, maximizing this value is equivalent to minimizing y − X β 2 . Moreover, if we partially differentiate l w.r.t. σ 2 , we have that N y − X β 2 ∂l 2 = − − =0. ∂σ 2 2σ 2 2(σ 2 )2 Thus, using βˆ = (X T X )−1 X T y, we find that σˆ 2 :=

1 ˆ 2 = RSS y − X β N N

is the maximum likelihood estimate of σ 2 . In Chap. 2, we derived βˆ ∼ N (β, σ 2 (X T X )−1 ), which means that βˆ is an unbiased estimator and the covariance matrix is σ 2 (X T X )−1 . In general, if the variance is minimized among the unbiased estimators, it is called an efficient estimator. In the following, we show that the estimate βˆ is an efficient estimator. ∂l , j = 0, 1, · · · , p. We refer to the covariance Let ∇l be the vector consisting of ∂β j matrix J of ∇l divided by N as the Fisher information matrix. For f N (y|x, β) := N

f (yi |xi , β), we have i=1

∇l =

∇ f N (y|x, β) . f N (y|x, β)

5.2 Efficient Estimation and the Fisher Information Matrix

89

Suppose that the order between the derivative w.r.t. βand the integral w.r.t. y can be switched.2If we partially differentiate both sides of f N (y|x, β)dy = 1 w.r.t. β, we have that ∇ f N (y|x, β)dy = 0. On the other hand, we have that 

∇ f N (y|x, β) N f (y|x, β)dy = f N (y|x, β)

E∇l =

 ∇ f N (y|x, β)dy = 0

(5.3)

and  0 = ∇ ⊗ [E∇l] = ∇ ⊗ (∇l) f N (y|x, β)dy   = (∇ 2 l) f N (y|x, β)dy + (∇l){∇ f N (y|x, β)}dy = E[∇ 2 l] + E[(∇l)2 ] .

(5.4)

In particular, (5.4) implies that J=

1 1 E[(∇l)2 ] = − E[∇ 2 l] . N N

(5.5)

Example 47 For linear regression, we examine the equation in (5.5): ∇l = −

N 1  T xi (yi − xi β) σ2 i=1

N 1  T 1 xi xi = 2 X T X ∇ 2l = 2 σ σ i=1

N 1  T xi E(yi − xi β) = 0 2 σ i=1 ⎤ ⎡ N N   1 2 T T T⎦ ⎣ E[(∇l) ] = E xi (yi − xi β){ x j (y j − xi β)} (σ 2 )2

E[∇l] = −

i=1

=

1 (σ 2 )2

N  i=1

j=1

xiT E(yi − xi β)(yi − xi β)T xi =

N N 1  T 2 1  T 1 xi σ I xi = 2 xi xi = 2 X T X 2 2 (σ ) σ σ i=1

i=1

N N 1  T 2 1 1  T xi E(yi − xi β)(yi − xi β)T xi = 2 2 xi σ I xi = 2 X T X V [∇l] = 2 2 (σ ) (σ ) σ i=1

i=1

In general, we have the following statement. ˜ ∈ Proposition 17 (Cramér-Rao inequality) Any covariance matrix V (β) R( p+1)×( p+1) w.r.t. an Unbiased estimate does not exceed the inverse of the Fisher information matrix: ˜ ≥ (N J )−1 , V (β) 2 In

many practical situations, including linear regression, no problem occurs.

90

5 Information Criteria

where an inequality between matrices ≥ 0 implies that the difference is nonnegative definite. Note that the least squares estimate satisfies the equality part of the inequality. To this end, if we partially differentiate both sides of 

β˜i f N (y|x, β)dy = βi

w.r.t. β j , we have the following equation: 

∂ N f (y|x, β)dy = β˜i ∂β j



1, i = j 0, i = j

T ˜ ]= If we write this equation in terms of its covariance matrix, we have that E[β(∇l) I , where I is a unit matrix of size ( p + 1). Moreover, from E[∇l] = 0 (5.3), we rewrite the above equation as follows:

E[(β˜ − β)(∇l)T ] = I .

(5.6)

Then, the covariance matrix of the vector of size 2( p + 1) that consists of β˜ − β and ∇l is   ˜ V (β) I . I NJ ˜ and J are covariance matrices, they are nonnegative Note that because both V (β) definite. Finally, we claim that both sides of 

˜ − (N J )−1 0 V (β) 0 NJ





I −(N J )−1 = 0 I



˜ V (β) I I NJ



I 0 −(N J )−1 I



are nonnegative definite. In fact, for an arbitrary x ∈ Rn , if x T Ax ≥ 0, for an arbitrary B ∈ Rn×m and y ∈ Rm , we have that y T B T ABy ≥ 0, which means that ˜ − (N J )−1 is nonnegative definite (for x, y ∈ R p+1 , the inequality x T {V (β) ˜ − V (β) −1 T (N J ) }x + y N J y ≥ 0 should hold even if y = 0). This completes the proof of Proposition 17.

5.3 Kullback-Leibler Divergence For probability density functions f, g ∈ R, we refer to  D( f g) :=

∞ −∞

f (x) log

f (x) dx g(x)

5.3 Kullback-Leibler Divergence

91

1.0

Fig. 5.2 y = x − 1 is beyond y = log x except x =1

y

0.5

y = x − 1y = log x

-0.5

0.0

1

0.5

1.0

1.5

2.0

2.5

3.0

x

as to the Kullback-Leibler (KL) divergence, which is defined if the following condition is met and   f (x)d x = 0 =⇒ g(x)d x = 0 . S

S

for an arbitrary S ⊆ R. In general, since D( f g) and D(g f ) do not coincide, the KL divergence is not a distance. However, D( f g) ≥ 0 and is equal to zero if and only if f and g coincide. In fact, we have from log x ≤ x − 1, x > 0 (Fig. 5.2) that 



−∞

f (x) log

f (x) dx = − g(x) =−

 

∞ −∞ ∞ −∞

f (x) log

g(x) dx ≥ − f (x)



∞ −∞

 f (x)

 g(x) − 1 dx f (x)

(g(x) − f (x))d x = 1 − 1 = 0 .

In the following, we compute the KL divergence value D(β||γ) of a parameter γ when the true parameter is β. Proposition 18 For covariates x1 , . . . , x N , if the responses are z 1 , . . . , z N , the likeN  log f (z i |xi , γ) of γ ∈ R p+1 is lihood − i=1

1 1 1 N log 2πσ 2 + 2 z − X β 2 − 2 (γ − β)T X T (z − X β) + 2 (γ − β)T X T X (γ − β) 2 2σ σ 2σ

(5.7)

for an arbitrary β ∈ R p+1 . For the proof, see the appendix at the end of this chapter. We assume that z 1 , . . . , z N has been generated by f (z 1 |x1 , β), . . . , f (z N |x N , β), where the true parameter is β. Then, the average of (5.7) w.r.t. Z 1 = z 1 , . . . , Z N = z N is N  N 1 log(2πσ 2 e) + 2 X (γ − β) 2 log f (Z i |xi , γ) = (5.8) − EZ 2 2σ i=1

92

5 Information Criteria

since the averages of z − X β and z − X β 2 are 0 and N σ 2 , respectively. Moreover, the value of (5.8) can be written as follows: −

N   i=1

Since

N   i=1



−∞



−∞

f (z|xi , β) log f (z|xi , γ)dz .

f (z|xi , β) log f (z|xi , β)dz is a constant that does not depend on γ,

we only need to choose a parameter γ so that the sum of Kullback-Leibler divergence EZ

N 

log

i=1

N  1 f (z|xi , β)  ∞ f (z|xi , β) = dz = f (z|xi , β) log X (γ − β) 2 f (z|xi , γ) f (z|xi , γ) 2σ 2 −∞ i=1

(5.9) is minimized.

5.4 Derivation of Akaike’s Information Criterion In general, the true parameters β are unknown and should be estimated. In the following, our goal is to choose a γ among the Unbiased estimate so that (5.9) is minimized on average. In general, for random variables U, V ∈ R N , Schwarz’s inequality we have that {E[U T V ]}2 ≤ E[ U 2 ]E[ V 2 ]

() .

In fact, in the quadratic equation w.r.t. t E(tU + V )2 = t 2 E[ U 2 ] + 2t E[U T V ] + E[ V 2 ] = 0 at most one solution exists, so the determinant is not positive. If we let U = X (X T X )−1 ∇l and V = X (β˜ − β), then we have {E[(β˜ − β)T ∇l]}2 ≤ E X (X T X )−1 ∇l 2 E X (β˜ − β) 2 .

(5.10)

In the following, we use the fact that for matrices A = (ai, j ) and B = (bi, j ), if the products AB and B A are defined, then both traces are i j ai, j b j,i and coincide. Now, the traces of the left-hand and right-hand sides of (5.6) are trace{E[(β˜ − β)(∇l)T ]} = trace{E[(∇l)T (β˜ − β)]} = E[(β˜ − β)T (∇l)] and p + 1, which means that

5.4 Derivation of Akaike’s Information Criterion

93

E[(β˜ − β)T (∇l)] = p + 1 .

(5.11)

Moreover, we have that    E X (X T X )−1 ∇l 2 = E trace (∇l)T (X T X )−1 X T X (X T X )−1 ∇l = trace{(X T X )−1 E(∇l)(∇l)T } = trace{(X T X )−1 σ −2 X T X } = trace{σ −2 I } = ( p + 1)/σ 2 .

(5.12)

Thus, from (5.10), (5.11), and (5.12), we have that E{ X (β˜ − β) 2 } ≥ ( p + 1)σ 2 On the other hand, if we apply the least squares method: βˆ = (X T X )−1 X T y, we have that E X (βˆ − β) 2 = E[trace(βˆ − β)T X T X (βˆ − β)] ˆ T X ) = trace(σ 2 I ) = ( p + 1)σ 2 = trace(V [β]X and the equality holds. The goal of AIC is to minimize the quantity 1 N log 2πσ 2 + ( p + 1) 2 2 obtained by replacing the second term of (5.8) with its average. In particular, for the problem of Variable selection, the number of the covariates is not p, but any 0 ≤ k ≤ p. Hence, we choose the k that minimizes N log σk2 + k .

(5.13)

Note that the value of σk2 := min σ 2 (S) is unknown. For a subset S ⊆ {1, . . . , p} k(S)=k

of covariates, some might replace σ 2 (S) with σˆ 2 (S). However, the value of log σˆ 2 (S) is smaller on average than log σ 2 (S). In fact, we have the following proposition. Proposition 19 Let k(S) be the cardinality of S. Then we have that3 E[log σˆ 2 (S)] = log σ 2 (S) −

k(S) + 2 +O N



For the proof, see the appendix at the end of this chapter. Since, up to O(N −2 ), we have   k+2 = log σk2 , E log σˆ k2 + N 3 By

O( f (N )), we denote a function such that g(N )/ f (N ) is bounded.

1 N2

 .

94

5 Information Criteria

the AIC replaces log σk2 in (5.13) with log σˆ k2 +

k and chooses the k that minimizes N

N log σˆ k2 + 2k .

(5.14)

Appendix: Proof of Propositions Proposition 18 For covariates x1 , . . . , x N , if the responses are z 1 , . . . , z N , the likeN  log f (z i |xi , γ) of γ ∈ R p+1 is lihood − i=1

N 1 1 1 log 2πσ 2 + 2 z − X β 2 − 2 (γ − β)T X T (z − X β) + 2 (γ − β)T X T X (γ − β) 2 2σ σ 2σ

(5.15)

for an arbitrary β ∈ R p+1 . Proof In fact, for u ∈ R and x ∈ R p+1 , we have that 1 1 log f (u|x, γ) = − log 2πσ 2 − 2 (u − xγ)2 2 2σ (u − xγ)2 = {(u − xβ) − x(γ − β)}2 = (u − xβ)2 − 2(γ − β)T x T (u − xβ) + (γ − β)T x T x(γ − β) 1 1 log f (u|x, γ) = − log 2πσ 2 − 2 (u − xβ)2 2 2σ 1 1 T T + 2 (γ − β) x (u − xβ) − 2 (γ − β)T x T x(γ − β) σ 2σ and, if we sum over (x, u) = (x1 , z 1 ), . . . , (xn , z n ), we can write −

N  i=1

log f (z i |xi , γ) =

N 1 log 2πσ 2 + 2 z − X β 2 2 2σ −

1 1 (γ − β)T X T (z − X β) + 2 (γ − β)T X T X (γ − β) , σ2 2σ

where we have used z = [z 1 , . . . , z N ]T and z − X β 2 =

N  (z i − xi β)2 , X T X = i=1

N  i=1

xiT xi , X T (z − X β) =

N  i=1

xiT (z i − xi β).

Appendix: Proof of Propositions

95

Proposition 19 Let k(S) be the cardinality of S. Then, we have4 k(S) + 2 +O E[log σˆ (S)] = log σ (S) − N 2



2

1 N2

 .

Proof Let m ≥ 1, U ∼ χ2m , V1 , . . . , Vm ∼ N (0, 1). For i = 1, . . . , m, we have that 2 Eet Vi =

EetU =

 ∞ −∞

 ∞

−∞

2 1 etvi √



2 e−vi /2 dvi =

2 2 1 et (v1 +···+vm ) √

  (1 − 2t)vi2 1 dvi = (1 − 2t)−1/2 √ exp − 2 −∞ 2π

 ∞

 ∞

2π −∞

e−(v1 +···+vm )/2 dv1 · · · dvm = (1 − 2t)−m/2 . 2

2

which means that for n = 1, 2, . . .,  d n EetU  EU = = m(m + 2) · · · (m + 2n − 2) , dt n t=0 n

(5.16)

t2 where EetU = 1 + t E[U ] + E[U 2 ] + · · · has been used. Moreover, from the Tay2 lor expansion, we have that U E[log ] = E m



  2 U U 1 −1 − E − 1 + ··· . m 2 m

(5.17)

If we let (5.16) for n = 1, 2, where EU = m and EU 2 = m(m + 2), the first and second terms of (5.17) are zero and −

1 1 1 (EU 2 − 2m EU + m 2 ) = − 2 {m(m + 2) − 2m 2 + m 2 } = − , 2m 2 2m m

rspectively. Next, we show that each term in (5.17) for n ≥ 3 is at most O(1/m 2 ). From the binomial theorem and (5.16), we have that E(U − m)n =

n    n j=0

j

EU j (−m)n− j =

n  j=0

(−1)n− j

  n m n− j m(m + 2) · · · (m + 2 j − 2) . j

(5.18) If we regard m n− j m(m + 2) · · · (m + 2 j − 2)

4 By

O( f (N )), we denote a function such that g(N )/ f (N ) is bounded.

96

5 Information Criteria

as a polynmial w.r.t. m, the coefficients of the highest and (n − 1)-th terms are one and 2{1 + 2 + · · · + ( j − 1)} = j ( j − 1), respectively. Hence, the coefficients of the n-th and (n − 1)-th terms in (5.18) are n    n j=0

j

(−1) j =

n    n j=0

j

(−1) j 1n− j = (−1 + 1)n = 0

and n    n j=0

j

(−1) j j ( j − 1) =

n  j=2

n−2 

 n! (−1) j−2 = n(n − 1) (n − j)!( j − 2)! i=0

n−2 i

 (−1)i = 0 ,

respectively. Thus, we have shown that for n ≥ 3,  E

n   1 U −1 = O . m m2

RSS(S) N σˆ 2 (S) = ∼ χ2N −k(S)−1 and (5.17), if we apply m = N − σ 2 (S) σ 2 (S) k(S) − 1, then we have that Finally, from

log  E log



k(S) + 1 1 N = + O(( )2 ) N − k(S) − 1 N − k(S) − 1 N − k(S) − 1

σˆ 2 (S) N − k(S) − 1

   2  1 σ (S) 1 1 1 = − + O( 2 ) +O =− N N − k(S) − 1 N N2 N

and       1 k(S) + 2 k(S) + 1 1 1 σˆ 2 (S) = − = − . − + O + O E log σ2 N N N2 N N2

Exercises 40–48 In the following, we define ⎤ ⎡ ⎡ ⎡ ⎤ ⎤ ⎤ β0 x1 y1 z1 ⎢ β1 ⎥ ⎥ ⎢ ⎢ ⎢ ⎢ ⎥ ⎥ ⎥ X = ⎣ ... ⎦ ∈ R N ×( p+1) , y = ⎣ ... ⎦ ∈ R N , z = ⎣ ... ⎦ ∈ R N , β = ⎢ . ⎥ ∈ R p+1 , ⎣ .. ⎦ xN yN zN βp ⎡

Appendix: Proof of Propositions

97

where x1 , . . . , x N are row vectors. We assume that X T X has an inverse matrix and denote by E[·] the expectation w.r.t.   y − xβ 2 . exp − f (y|x, β) := √ 2σ 2 2πσ 2 1

40. For X ∈ R N ×( p+1) , y ∈ R N , show each of the following. (a) If the variance σ 2 > 0 is known, the β ∈ R p+1 that maxmizes l :=

N 

f (yi |xi , β) coincides with the least squares solution. Hint: l=−

log

i=1

N 1 log(2πσ 2 ) − 2 y − X β 2 2 2σ

(b) If both β ∈ R p+1 and σ 2 > 0 are unknown, the maximum likelihood estimate of σ 2 is given by 1 ˆ 2 σˆ 2 = y − X β N . Hint: If we partially differentiate l with respect to σ 2 , we have N y − X β 2 ∂l 2 = − =0 ∂σ 2 2σ 2 2(σ 2 )2 (c) For probabilistic density functions f, g over R, the Kullback-Leibler divergence is nonnegative, i.e.,  D( f g) := 41. Let f N (y|x, β) :=

N i=1



−∞

f (x) log

f (x) dx ≥ 0 g(x)

f (yi |xi , β). By showing (a) through (d), prove J=

1 1 E(∇l)2 = − E∇ 2 l N N

∇ f N (y|x, β) (a) ∇l = f N (y|x, β)  (b) ∇ f N (y|x, β)dy = 0 (c) E∇l = 0 (d) ∇ E[∇l] = E[∇ 2 l] + E[(∇l)2 ] 42. Let β˜ ∈ R p+1 be an arbitrary unbiased estimate β. By showing (a) through (c), prove Cramer-Rao’s inequality

98

5 Information Criteria

˜ ≥ (N J )−1 V (β) (a) E[(β˜ − β)(∇l)T ] = I (b) The covariance matrix of the vector combining β˜ − β and ∇l of size 2( p + 1)   ˜ V (β) I I NJ (c) Both sides of 

˜ − (N J )−1 0 V (β) 0 NJ



 =

I −(N J )−1 0 I



˜ V (β) I I NJ



I 0 −(N J )−1 I



are nonnegative definite. 43. By showing (a) through (c), prove E X (β˜ − β) 2 ≥ σ 2 ( p + 1). (a) E[(β˜ − β)T ∇l] = p + 1 (b) E X (X T X )−1 ∇l 2 = ( p + 1)/σ 2 (c) {E(β˜ − β)T ∇l}2 ≤ E X (X T X )−1 ∇l 2 E X (β˜ − β) 2 Hint: For random variables U, V ∈ Rm (m ≥ 1), prove {E[U T V ]}2 ≤ E[ U 2 ]E[ V 2 ] (Schwarz’s inequality). 44. Prove the following statements. (a) For covariates x1 , . . . , x N , if we obtain the responses z 1 , . . . , z N , then the N  log f (z i |xi , γ) of the parameter γ ∈ R p+1 is likelihood − i=1

N 1 1 1 log 2πσ 2 + 2 z − X β 2 − 2 (γ − β)T X T (z − X β) + 2 (γ − β)T X T X (γ − β) 2 2σ σ 2σ

for an arbitrary β ∈ R p+1 . (b) If we take the expectation of (a) w.r.t. z 1 , . . . , z N , it is N 1 log(2πσ 2 e) + 2 X (γ − β) 2 . 2 2σ (c) If we estimate β and choose an estimate γ of β, the minimum value of (b) on average is N 1 log(2πσ 2 e) + ( p + 1) 2 2 and the minimum value is realized by the least squares method. (d) Instead of choosing all the p covariates, we choose 0 ≤ k ≤ p covariates from p. Minimizing

Appendix: Proof of Propositions

99

N 1 log(2πσk2 e) + (k + 1) 2 2 w.r.t. k is equivalent to minimizing N log σk2 + k w.r.t. k, where σk2 is the minimum variance when we choose k covariates. 45. By showing (a) through (f), prove E log

k(S) + 1 σˆ 2 (S) 1 +O =− − σ2 N N



1 N2

 =−

k(S) + 2 +O N



1 N2

 .

Use the fact that the moment of U ∼ χ2m is EU n = m(m + 2) · · · (m + 2n − 2) without proving it.    2 U 1 U U =E −1 − E − 1 + ··· (a) E log m m 2 m    2 2 U U (b) E − 1 = 0 and E −1 = m m m   n  n (c) (−1)n− j =0 j j=0   n  n n n− j m n− j m(m + 2) · · · (m + 2 j − (−1) (d) if we regard E(U − m) = j j=0

2) as a polynomial of degree m, the sum of the terms of degree n is zero. Hint: Use (c). (e) The sum of the terms of degree n − 1 is zero. Hint: Derive that the coefficient of degree n − 1 is 2{1 + 2 + · · · + ( j − 1)} = j ( j − 1) for each j and that n    n (−1) j j ( j − 1) = 0. j j=0  2    σ 1 1 σˆ 2 (S) =− +O (f) E log N − k(S) − 1 N N N2 46. The following procedure produces the AIC value. Fill in the blanks and execute the procedure. RSS.min=function(X,y,T){ m=ncol(T); S.min=Inf for(j in 1:m){ q=T[,j]; S=sum((lm(y~X[,q])$fitted.values−y)^2)/n if(S 0 . Moreover, from Proposition 6, all the eigenvalues being positive means that the product det(X T X + N λI ) is positive and that the matrix X T X + N λI is nonsingular (Proposition 1), which is true for any p and N . If N < p, the rank of X T X ∈ R p× p is at most N (Proposition 3), and it is not nonsingular (Proposition 1). Therefore, we have λ > 0 ⇐⇒ X T X + N λI is nonsingular For ridge, we implement the following procedure. ridge=function(X, y, lambda=0){ X=as.matrix(X); p=ncol(X); n=length(y); X.bar=array(dim=p); s=array(dim=p) for(j in 1:p){X.bar[j]=mean(X[,j]);X[,j]=X[,j]−X.bar[j];}; for(j in 1:p){s[j]=sd(X[,j]);X[,j]=X[,j]/s[j]}; y.bar=mean(y); y=y−y.bar beta=drop(solve(t(X)%*%X+n*lambda*diag(p))%*%t(X)%*%y) for(j in 1:p)beta[j]=beta[j]/s[j] beta.0= y.bar−sum(X.bar*beta) return(list(beta=beta, beta.0=beta.0)) }

Example 48 We store the data set (US Crime Data) https://web.stanford.edu/ ~hastie/StatLearnSparsity/data.html as a text file crime.txt and apply ridge to find the relation between the response and covariates. 1 2 3 4 5 6 7

Response Covariate Covariates Covariates Covariates Covariates

Total overall reported crime rate per 1 million residents NA Annual police funding in $/resident % of people 25 years+ with 4 yrs. of high school % of 16–19 year olds not in highschool and not highschool graduates % of 18–24 year olds in college % of 18–24 year olds in college

We execute the function ridge via the following procedure. df=read.table("crime.txt"); x=df[,3:7]; y=df[,1]; p=ncol(x); lambda.seq=seq(0,100,0.1); coef.seq=lambda.seq plot(lambda.seq, coef.seq, xlim=c(0,100), ylim=c(−40,40), xlab="lambda",ylab="beta",main="The coefficients for each lambda", type="n", col="red") for(j in 1:p){

6.1 Ridge

103

40

Ridge: The Coefficients for each λ

0 -40

-20

β

20

annual police funding in $/resident % of people 25 years+ with 4 yrs. of high school % of 16 to 19 year-olds not in highschool · · · % of 18 to 24 year-olds in college % of 18 to 24 year-olds in college

0

20

40

60

80

100

λ Fig. 6.1 Execution of Example 48. The coefficients β obtained via ridge shrink as λ increases

coef.seq=NULL; for(lambda in lambda.seq)coef.seq=c(coef.seq,ridge(x,y, lambda)$beta[j]) par(new=TRUE); lines(lambda.seq,coef.seq, col=j) } legend("topright",legend= c("annual police funding in $resident","% of people 25 years+ with 4 yrs. of high school", "% of 16 to 19 year-olds not in highschool and not highschool graduates","% of 18 to 24 year-olds in college", "% of 18 to 24 year-olds in college"), col=1:p, lwd=2, cex =.8)

We illustrate how the coefficients change as λ increases in Fig. 6.1.

6.2 Subderivative We consider optimizing functions that cannot be differentiated. For example, when we find the points x at which a variable function f such as f (x) = x 3 − 2x + 1 is maximal and minimal, by differentiating f with respect to x, we can solve equation f  (x) = 0. However, what if the absolute function is contained as in f (x) = x 2 + x + 2|x|? To this end, we extend the notion of differentiation.

104

6 Regularization

To begin, we assume that f is convex. In general, if f (αx + (1 − α)y) ≤ α f (x) + (1 − α) f (y) for an arbitrary 0 < α < 1 and x, y ∈ R, we say that f is convex.1 For example, f (x) = |x| is because |αx + (1 − α)y| ≤ α|x| + (1 − α)|y| . In fact, because both sides are nonnegative, if we subtract the square of the left from the that of the right, we have 2α(1 − α)(|x y| − x y) ≥ 0. If a f : R → R and x0 ∈ R satisfy f (x) ≥ f (x0 ) + z(x − x0 )

(6.2)

for x ∈ R, we say that the set of such z ∈ R is the of f at x0 . If a convex f is differentiable at x0 , then z consists of one element2 f  (x0 ), which can be shown as follows. When a convex function f is differentiable at x0 , we have f (x) ≥ f (x0 ) + f  (x0 )(x − x0 ). In fact, we see that the inequality f (αx + (1 − α)x0 ) ≤ α f (x) + (1 − α) f (x0 ) is equivalent to f (x) − f (x0 ) ≥

f (x0 + α(x − x0 )) − f (x0 ) (x − x0 ) . α(x − x0 )

Then, regardless of x < x0 and x > x0 , f (x0 + α(x − x0 )) − f (x0 ) α(x − x0 ) approaches the same f  (x0 ) as α → 0. However, we can show that when the f is differentiable at x0 , the z that satisfies (6.2) does not exist, except f  (x0 ). In fact, f (x) − f (x0 ) ≥ z and in order for (6.2) to hold for x > x0 and x < x0 , we require x − x0 f (x) − f (x0 ) ≤ z, respectively, which means that z is no less than the left derivative x − x0 at x0 and no more than the right derivative at x = x0 . Since f is differentiable at x0 , these values need to coincide. In this book, we consider only the case of f (x) = |x| and x0 = 0, and (6.2) becomes |x| ≥ zx for an arbitrary x ∈ R. Then, we can show that the is the interval [−1, 1]: |x| ≥ zx , x ∈ R ⇐⇒ |z| ≤ 1 . 1 In 2 In

this book, convexity always means convex below and does not mean concave (convex above). such a case, we do not express the subderivative as { f  (x0 )} but as f  (x0 ).

6.2 Subderivative

105

Fig. 6.2 f (x) = |x| cannot be differentiated at the origin. The coefficients from both sides do not match

f (x) = |x| f (x) = −1

f (x) = 1

0 x cannot be differentiated

To demonstrate this result, suppose that |x| ≥ zx for an arbitrary x ∈ R. If the claim is true for x > 0 and for x < 0, we require z ≤ 1 and z ≥ −1, respectively. On the other hand, if −1 ≤ z ≤ 1, we have zx ≤ |zx| ≤ |x| for any x ∈ R (Fig. 6.2). Example 49 For the cases x < 0, x = 0, and x > 0, we obtain points x such that f (x) = x 2 − 3x + |x| and f (x) = x 2 + x + 2|x| are minimal. For x = 0, we can differentiate the functions. Note that the of f (x) = |x| at x = 0 is [−1, 1]. For the first case  2  2 x − 2x, x ≥ 0 x − 3x + x, x ≥ 0 2 = f (x) = x − 3x + |x| = x 2 − 3x − x, x < 0 x 2 − 4x, x < 0 ⎧ x >0 ⎨ 2x − 2, f  (x) = 2x − 3 + [−1, 1] = −3 + [−1, 1] = [−4, −2] 0, x = 0 ⎩ 2x − 4 < 0, x 0 ⎨ 2x + 3 > 0, f  (x) = 2x + 1 + 2[−1, 1] = 1 + 2[−1, 1] = [−1, 3] 0, x = 0 ⎩ 2x − 1 < 0, x 0 ⎨ 1, βj < 0 + λ −1, ⎩ [−1, 1], β j = 0

because the subderivative of |x| at x = 0 is [-1,1]. Since we have

(6.5)

6.3 Lasso

107

Fig. 6.4 The shape of Sλ (x) for λ = 5

4 2

λ=5

-4 -2 0

soft.th(5, x)

soft.th(lambda,x)

-10

-5

0

5

10

x ⎧ βj > 0 ⎨ −s j + β j + λ, βj < 0 , 0 ∈ −s j + β j − λ, ⎩ −s j + β j + λ[−1, 1], β j = 0 we may write the solution as follows: ⎧ ⎨ s j − λ, s j > λ β j = s j + λ, s j < −λ , ⎩ 0, −λ ≤ s j ≤ λ where the right-hand side can be expressed as follows: as β j = S λ (s j ) with the function ⎧ ⎨ x − λ, x > λ Sλ (x) = x + λ, x < −λ . ⎩ 0, −λ ≤ x ≤ λ We present the shape of Sλ (·) for λ = 5 in Fig. 6.4, where we execute the following code. soft.th=function(lambda,x)sign(x)*pmax(abs(x)−lambda,0) curve(soft.th(5,x),−10,10, main="soft.th(lambda,x)") segments(−5,−4,−5,4, lty=5, col="blue"); segments(5,−4,5,4, lty=5, col="blue ") text(−0.2,1,"lambda=5",cex=1.5)

Finally, we remove assumption (6.4).

pHowever, relation (6.5) does not hold for this case. To this end, we replace yi − j=1 xi, j β j in (6.5) by ri, j − xi, j β j with the N

1  residue ri, j := yi − k = j xi,k βk and by s j := ri, j xi, j in N i=1 ⎧ n βj > 0 ⎨ 1,  1 βj < 0 . 0∈− xi, j (ri, j − xi, j β j ) + λ −1, ⎩ N i=1 [−1, 1], β j = 0

108

6 Regularization

Then, for fixed β j , we update βk for k = j and repeat the process for j = 1, · · · , p. We further repeat the cycle until convergence. For example, we can construct the following procedure. lasso=function(X, y, lambda=0){ X=as.matrix(X); X=scale(X); p=ncol(X); n=length(y); X.bar=array(dim=p); for(j in 1:p){X.bar[j]=mean(X[,j]);X[,j]=X[,j]−X.bar[j];}; y.bar=mean(y); y=y− y.bar eps=1; beta=array(0, dim=p); beta.old=array(0, dim=p) while(eps>0.001){ for(j in 1:p){ r= y−X[,−j]%*%beta[−j] beta[j]= soft.th(lambda,sum(r*X[,j])/n) } eps=max(abs(beta−beta.old)); beta.old=beta } beta.0= y.bar−sum(X.bar*beta) return(list(beta=beta, beta.0=beta.0)) }

Example 50 We apply the data in Example 48 to lasso. df=read.table("crime.txt"); x=df[,3:7]; y=df[,1]; p=ncol(x); lambda.seq=seq(0,100,0.1); coef.seq=lambda.seq plot(lambda.seq, coef.seq, xlim=c(0,100), ylim=c(−40,40),

Lasso: The Coeffieicnts for each λ

0 -40

-20

β

20

40

annual police funding in $/resident % of people 25 years+ with 4 yrs. of · · · % of 16 to 19 year-olds not in · · · % of 18 to 24 year-olds in college % of 18 to 24 year-olds in college

0

20

40

60

80

100

λ Fig. 6.5 Execution of Example 50. Although the coefficients decrease as λ increases for , each coefficient becomes zero for large λ, and the timing for the coefficients differs

6.3 Lasso

109

xlab="lambda",ylab="beta",main="The coefficients for each lambda", type="n", col="red") for(j in 1:p){ coef.seq=NULL; for(lambda in lambda.seq)coef.seq=c(coef.seq,lasso(x,y, lambda)$beta[j]) par(new=TRUE); lines(lambda.seq,coef.seq, col=j) } legend("topright",legend= c("annual police funding in $ resident","% of people 25 years+ with 4 yrs. of high school", "% of 16 to 19 year-olds not in highschool and not highschool graduates","% of 18 to 24 year-olds in college", "% of 18 to 24 year-olds in college"), col=1:p, lwd=2, cex =.8)

As shown in Fig. 6.5, the larger λ is, the smaller the absolute value of the coefficients. We observe that each coefficient becomes zero when λ exceeds a threshold that depends on the coefficient and that the sets of nonzero coefficients depend on the value of λ. The larger λ is, the smaller the set of nonzero coefficients.

6.4 Comparing Ridge and Lasso If we compare Figs. 6.1 and 6.5, we find that the absolute values of ridge and lasso decrease as λ increases and that the values approach zero. However, in lasso, each of the coefficients diminishes when λ exceeds a value that depends on the coefficient, and the timing also depends on the coefficients; therefore, we must consider this property for model selection. Thus far, we have mathematically analyzed ridge and lasso. Additionally, we may wish to intuitively understand the geometrical meaning. Images such as those in Fig. 6.6 are often used to explain the difference between lasso and ridge. Suppose that p = 2 and that X ∈ R N × p consists of two columns xi,1 , xi,2 , i = 1, . . . , N . In the least squares method, we obtain the β1 , β2 that minimize S := N  (yi − β1 xi,1 − β2 xi,2 )2 . Let βˆ1 and βˆ2 be the estimates. Since i=1 N 

xi,1 (yi − yˆi ) =

i=1

N 

xi,2 (yi − yˆi ) = 0

i=1

with yˆi = βˆ1 xi1 + βˆ2 xi2 and yi − β1 xi,1 − β2 xi,2 = yi − yˆi − (β1 − βˆ1 )xi,1 − (β2 − βˆ2 )xi,2 for arbitrary β1 , β2 , the

N  i=1

(yi − β1 xi,1 − β2 xi,2 )2 can be expressed as follows:

110

6 Regularization

Ridge

5

4

Lasso 1

y 1

2

y

2

3

3

4

1

11 1

-1

-1

0

0

1

1

-1

0

1

2

3

-1

0

1

2

3

4

x

x

Fig. 6.6 The contours that share the center (βˆ 1 , βˆ 2 ) and the square error (6.6), where the rhombus and circle are the constraints of the |β1 | + |β2 | ≤ C  and the β12 + β22 ≤ C, respectively

(β1 − βˆ 1 )2

N  i=1

2 xi,1 + 2(β1 − βˆ 1 )(β2 − βˆ 2 )

N  i=1

xi,1 xi,2 + (β2 − βˆ 2 )2

N  i=1

2 xi,2 +

N 

(yi − yˆi )2 .

(6.6)

i=1

If we let (β1 , β2 ) := (βˆ1 , βˆ2 ), then we obtain the minimum value (= RSS). However, when p = 2, we may regard solving (6.1) and (6.3) in ridge and lasso as obtaining (β1 , β2 ) that minimize (6.6) w.r.t. β12 + β22 ≤ C, |β1 | + |β2 | ≤ C  for constants C, C  > 0, respectively, where the larger the C, C  , the smaller λ is, where we regard xi1 , xi2 , yi , yˆi , i = 1, · · · , N , and βˆ1 , βˆ2 as constants. The elliptic curve in Fig. 6.6 Left has center (βˆ1 , βˆ2 ), and each of the contours shares the same value of (6.6). If we expand the contours, then we eventually obtain a rhombus at some (β1 , β2 ). Such a pair (β1 , β2 ) is the solution of lasso. If the rhombus is smaller (if λ is larger), the elliptic curve is more likely to reach one of the four corners of the rhombus, which means that one of β1 and β2 becomes zero. However, as shown in Fig. 6.6 Right, if we replace the rhombus with a circle (Ridge), it is unlikely that one of β1 and β2 becomes zero. For simplicity, we consider a circle rather than an elliptic curve. In this case, if the solution (βˆ1 , βˆ2 ) of the least squares method is located somewhere in the green region in Fig. 6.7, either β1 = 0 or β2 = 0 is the solution. Specifically, if the rhombus is small (λ is large), even if (βˆ1 , βˆ2 ) remain the same, the area of the green region becomes large.

6.5 Setting the λ Value

111

6.5 Setting the λ Value When we apply lasso in actual situations, CRAN glmnet is available. Thus far, we have constructed procedures from scratch to understand the principle. We may use the existing package in real applications. To set the λ value, we usually apply the Cross-validation (CV) method in Chap. 3. Suppose that the CV is ten-fold. For example, for each λ, we estimate β using nine groups and test the estimate using one group, and we execute this process ten times, changing the groups to evaluate λ. We evaluate all λ values and choose the best. If

6

Fig. 6.7 In the green area, the solution satisfies either β1 = 0 or β2 = 0 when the center is (βˆ 1 , βˆ 2 )

1

-2

0

2

4

1

-2

0

2

4

6

90000 70000 50000

Mean-Squared Error

110000

5 5 5 5 5 5 5 5 4 4 3 3 3 3 3 3 2 1 1 1

0

1

2

3

4

5

log λ Fig. 6.8 Using the function cv.glmnet, we obtain the evaluation for each λ (the squared error for the test data), marked as a red point. The vertical segments are the s of the true coefficient values. log λmin = 3. (The optimum value is approximately λmin = 20). The numbers on the top of the figure 5, . . . , 5, 4, . . . , 4, 3, . . . , 3, 2, 2, 1, . . . , 1 indicate how many variables are nonzero

112

6 Regularization

we input the covariates and response data to the function cv.glmnet, the package evaluates various values of λ and outputs the best one. Example 51 We apply the data in Examples 48 and 50 to the functions cv.glmnet to obtain the best λ. Then, for the best λ, we apply the usual lasso procedure to obtain β. The package outputs the evaluation (the squared error for the test data) and confidence interval for each λ (Fig. 6.8). The numbers on the top of the figure express how many variables are nonzero for the λ value. > library(glmnet) > df=read.table("crime.txt"); X=as.matrix(df[,3:7]) > cv.fit=cv.glmnet(X,y); plot(cv.fit) > lambda.min=cv.fit$lambda.min; lambda.min [1] 20.03869 > fit=glmnet(X,y,lambda=lambda.min); fit$beta 5 x 1 sparse Matrix of class "dgCMatrix" s0 V3 9.656911 V4 -2.527286 V5 3.229431 V6 . V7 .

Exercise 49–56 49. Let N , p ≥ 1. For X ∈ R N × p and y ∈ R N , λ ≥ 0, we wish to obtain β ∈ R p that minimizes 1 y − X β2 + λβ22 , N 

p 2 where for β = (β1 , . . . , β p ), we denote β2 := j=1 β j . Suppose N < p. Show that such a solution always exists and that it is equivalent to λ > 0. Hint: In order to show a necessary and sufficient condition, both directions should be proved. 50. (a) Suppose that a function f : R → R is convex and differentiable at x = x0 . Show that a z exists for an arbitrary x ∈ R such that f (x) ≥ f (x0 ) + z(x − x0 ) (subderivative) and that it coincides with the differential coefficient f  (x0 ) at x = x0 . (b) Show that −1 ≤ z ≤ 1 is equivalent to zx ≤ |x| for all x ∈ R. (c) Find the set of z defined in (a) for function f (x) = |x| and x0 ∈ R. Hint: Consider the cases x0 > 0, x0 < 0, and x0 = 0.

Exercise 49–56

113

(d) Compute the subderivatives of f (x) = x 2 − 3x + |x| and f (x) = x 2 + x + 2|x| for each point, and find the maximal and minimal values for each of the two functions. 51. Write an R program soft.th(lambda,x) of the function Sλ (x), λ > 0, x ∈ R defined by ⎧ ⎨ x − λ, x > λ |x| ≤ λ Sλ (x) := 0, ⎩ x + λ, x < −λ and execute the following. curve(soft.th(5,x),−10,10)

Hint: Use pmax rather than max. 52. We wish to find the β ∈ R that minimizes N 1  (yi − xi β)2 + λ|β| L= 2N i=1

. , N , λ > 0, where we assume that x1 , . . . , given (xi , yi ) ∈ R × R, i = 1, . .

N xi2 = 1. Express the solution by z := x N have been scaled so that N1 i=1 1 N i=1 x i yi and function Sλ (·). N 53. For p > 1 and λ > 0, we estimate the coefficients β0 ∈ R and β ∈ R p as follows: Initially, give the coefficients β ∈ R p . Then, we update  N we randomly   xi, j ri, j xi, j β j . We repeat this process β j by Sλ , where ri, j := yi − N i=1 k = j for j = 1, . . . , p and repeat the cycle until convergence. The function lasso below is used to scale the sample-based variance to one for each of the p variables before estimation of (β0 , β). Fill in the blanks and execute the procedure.

lasso=function(X, y, lambda=0){ X=as.matrix(X); p=ncol(X); n=length(y); X.bar=array(dim=p); s=array( dim=p) for(j in 1:p){X.bar[j]=mean(X[,j]);X[,j]=X[,j]−X.bar[j];}; for(j in 1:p){s[j]=sd(X[,j]);X[,j]=X[,j]/s[j]} y.bar=mean(y); y=y−y.bar eps=1; beta=array(0, dim=p); beta.old=array(0, dim=p) while(eps>0.001){ for(j in 1:p){ r= ## Blank (1) ## beta[j]= ## Blank (2) ## } eps=max(abs(beta−beta.old)); beta.old=beta } for(j in 1:p)beta[j]=beta[j]/s[j] beta.0= ## Blank (3) ## return(list(beta=beta, beta.0=beta.0)) }

114

6 Regularization df=read.table("crime.txt"); x=df[,3:7]; y=df[,1]; p=ncol(x); lambda.seq=seq(0,100,0.1); coef.seq=lambda.seq plot(lambda.seq, coef.seq, xlim=c(0,100), ylim=c(−40,40), xlab="lambda",ylab="beta",main="The coefficients for each lambda", type="n", col="red") for(j in 1:p){ coef.seq=NULL; for(lambda in lambda.seq)coef.seq=c(coef.seq,## Blank (4) ##) par(new=TRUE); lines(lambda.seq,coef.seq, col=j) }

54. Transform Problem 53 (Lasso) into the setting in Problem 49 (Ridge) and execute it. Hint: Replace the line of eps and the while loop in the function lasso by beta=drop(solve(t(X)%*%X+n*lambda*diag(p))%*%t(X)%*%y)

and change the function name to ridge. Blank (4) should be ridge rather than lasso. 55. Look up the meanings of glmnet and cv.glmnet and find the optimal λ and β for the data below. Which variables are selected among the five variables? library(glmnet) df=read.table("crime.txt"); X=as.matrix(df[,3:7]) cv.fit=cv.glmnet(X,y) lambda.min=cv.fit$lambda.min fit=glmnet(X,y,lambda=lambda.min)

Hint: The coefficients are displayed via fit$beta. If a coefficient is nonzero, we consider it to be selected. 56. Given xi,1 , xi,2 , yi ∈ R, i = 1, . . . , N , let βˆ1 , βˆ2 be the β1 , β2 that miniN  (yi − β1 xi,1 − β2 xi,2 )2 given βˆ1 xi,1 + βˆ2 xi,2 , yˆi , (i = 1, . . . , N ). mize S := i=1

Show the following three equations. (a) N 

xi,1 (yi − yˆi ) =

i=1

N 

xi,2 (yi − yˆi ) = 0.

i=1

For arbitrary β1 , β2 , yi − β1 xi,1 − β2 xi,2 = yi − yˆi − (β1 − βˆ1 )xi,1 − (β2 − βˆ2 )xi,2 . For arbitrary β1 , β2 ,

N  (yi − β1 xi,1 − β2 xi,2 )2 can be expressed by i=1

Exercise 49–56

115

(β1 − βˆ1 )2

N 

2 xi,1 + 2(β1 − βˆ1 )(β2 − βˆ2 )

i=1

+

N 

N 

xi,1 xi,2 + (β2 − βˆ2 )2

i=1

N 

2 xi,2

i=1

(yi − yˆi )2 .

i=1

(b) We consider the case

N  i=1

2 xi,1 =

N  i=1

2 xi,2 = 1,

N  i=1

xi,1 xi,2 = 0. In the stan-

dard least squares method, we choose the coefficients as β1 = βˆ1 , β2 = βˆ2 . However, under the constraint that |β1 | + |β2 | is less than a constant, we choose (β1 , β2 ) at which the circle with center (βˆ1 , βˆ2 ) and the smallest radius comes into contact with the rhombus. Suppose that we grow the radius of the circle with center (βˆ1 , βˆ2 ) until it comes into contact with the rhombus that connects (1, 0), (0, 1), (−1, 0), (0, −1). Show the region of the centers such that one of coordinates (βˆ1 and βˆ2 ) is zero. (c) What if the rhombus in (b) is replaced by a unit circle?

Chapter 7

Nonlinear Regression

Abstract For regression, until now we have focused on only linear regression, but in this chapter, we will consider the nonlinear case where the relationship between the covariates and response is not linear. In the case of linear regression in Chap. 2, if there are p variables, we calculate p + 1 coefficients of the basis that consists of p + 1 functions 1, x1 , . . . , x p . This chapter addresses regression when the basis is general. For example, if the response is expressed as a polynomial of the covariate x, the basis consists of 1, x, . . . , x p . We also consider spline regression and find a basis. In that case, the coefficients can be found in the same manner as for linear regression. Moreover, we consider local regression for which the response cannot be expressed by a finite number of basis functions. Finally, we consider a unified framework (generalized additive model) and back-fitting.

7.1 Polynomial Regression We consider fitting the relation between the covariates and response to a polynomial from observed data (x1 , y1 ), . . . , (x N , y N ) ∈ R × R. By a polynomial, we mean the function f : R → R that is determined by specifying the coefficients β0 , β1 , . . . , β p in β0 + β1 x + · · · + β p x p for p ≥ 1, such as f (x) = 1 + 2x − 4x 3 . As we do in the least-squares method, we assume that the coefficients β0 , . . . , β p minimize N  p (yi − β0 − β1 xi − · · · − β p xi )2 . i=1 j

By overlapping xi, j and xi , if the matrix X T X is nonsingular with p⎤ 1 x1 · · · x1 ⎥ ⎢ X = ⎣ ... ... . . . ... ⎦ , p 1 xN · · · xN



© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2020 J. Suzuki, Statistical Learning with Math and R, https://doi.org/10.1007/978-981-15-7568-6_7

117

118

p=3 p=5 p=7 f (x)

Fig. 7.1 We generated the data by adding standard Gaussian random values to a sine curve and fit the data to polynomials of orders p = 3, 5, 7

7 Nonlinear Regression

-3

-2

-1

0

1

2

3

x we can check that βˆ = (X T X )−1 X T y is the solution. As in linear regression fˆ(x) = βˆ0 + βˆ1 x1 + · · · + βˆ p x p , from the obtained βˆ0 , . . . , βˆ p , we construct an estimated function fˆ(x) = βˆ0 + βˆ1 x + · · · + βˆ p x p . Example 52 We generate N = 100 observed data by adding standard Gaussian random values to the sine function and fit them to polynomials of orders p = 3, 5, 7. We show the results in Fig. 7.1. The generation of polynomials is achieved via the following code. n=100; x=rnorm(n); y=sin(x)+rnorm(n) ## Data Generation m=3;p.set=c(3,5,7); col.set=c("red","blue","green") g=function(beta,u){S=beta[1]; for(j in 1:p)S=S+beta[j+1]*u^j; return(S)} for(i in 1:m){ p=p.set[i]; X=rep(1,n);for(j in 1:p)X=cbind(X,x^j) beta=drop(solve(t(X)%*%X)%*%t(X)%*%y); f=function(u)g(beta,u) curve(f(x),−3,3, col=col.set[i],yaxt="n"); par(new=TRUE) } legend("topleft",lty=1,paste0("p=",p.set),col=col.set); points(x,y)

We can show that if no less than p + 1 are different among x1 , . . . , x N , the matrix X T X is nonsingular. To examine this claim, since the ranks of X T X and X are equal (See Sect. 2.2), it is sufficient to show that the determinant is not zero for the matrix such that the p + 1 columns are contained in X ∈ R N ×( p+1) , which is true from the fact that in Example 7, the determinant of the n × n Vandermonde’s matrix is not zero if a1 , . . . , an are different. Polynomial regression can be applied to more general settings. For f 0 = 1 and f 1 , . . . , f p : R → R, then we can compute βˆ = (X T X )−1 X T y as long as each of the columns in ⎤ ⎡ 1 f 1 (x1 ) · · · f p (x1 ) ⎢ .. .. ⎥ .. X = ⎣ ... . . . ⎦ 1 f 1 (x N ) · · · f p (x N )

7.1 Polynomial Regression

119

Fig. 7.2 The graph of the function obtained by removing noise in (7.1). It is an even and cyclic function

1 -3 -2 -1 0

1

2

3

-1

Fig. 7.3 We generated data such that whether y is close to either −1 or 1 is based on whether x is even or odd when truncating it. Note that (7.1) is a cyclic and even function when removing the noise  (Fig. 7.2). We observe that cos nx, n = 1, 2, . . ., are better to fit than sin nx, n = 1, 2, . . .

-4

-2

0

2

4

is linearly independent. From the obtained βˆ0 , . . . , βˆ p , we can construct fˆ(x) = βˆ0 f 0 (x) + βˆ1 f 1 (x) + · · · + βˆ p f p (x) , where we often assume f 0 (x) = 1. Example 53 We then generate x ∼ N (0, π 2 ) and y=

−1 + , 1 + ,

2m − 1 ≤ |x| < 2m , m = 1, 2, . . . 2m − 2 ≤ |x| < 2m − 1

(7.1)

(Fig. 7.2), where  ∼ N (0, 0.22 ). We observe that the even functions f 1 (x) = 1, f 2 (x) = cos x, f 3 (x) = cos 2x, f 4 (x) = cos 3x are better to fit than the odd functions f 1 (x) = 1, f 2 (x) = sin x, f 3 (x) = sin 2x, f 4 (x) = sin 3x (Fig. 7.3) because we generated the observed data according to an even function with added noise. The procedure is implemented via the following code. ## Generation of data close to even function n=100; x=rnorm(n)*pi; y=ceil(x)%%2*2−1+rnorm(n)*0.2 plot(x,y,xaxt="n",yaxt="n",ann=FALSE, main="FollowRandomnumbersviasinand cos") ## The following function f chooses 1, cos x, cos 2x, cos 3x as the basis X=cbind(1,cos(x),cos(2*x),cos(3*x)); beta=solve(t(X)%*%X)%*%t(X)%*%y f=function(x)beta[1]+beta[2]*cos(x)+beta[3]*cos(2*x)+beta[4]*cos(3*x) par(new=TRUE); curve(f(x),−5,5, col="red",yaxt="n",ann=FALSE) ## The following function g chooses 1, sin x, sin 2x, sin 3x as the basis X=cbind(1,sin(x),sin(2*x),sin(3*x)); beta=solve(t(X)%*%X)%*%t(X)%*%y

120

7 Nonlinear Regression

g=function(x)beta[1]+beta[2]*sin(x)+beta[3]*sin(2*x)+beta[4]*sin(3*x) par(new=TRUE); curve(g(x),−5,5,col="blue",yaxt="n",ann=FALSE)

7.2 Spline Regression In this section, we restrict the polynomials to those with an order at most three, such as x 3 + x 2 − 7, −8x 3 − 2x + 1. We first note that if polynomials f and g of order p = 3 coincide with each other up to the second derivative at the point x∗ ∈ R: f ( j) (x∗ ) = g ( j) (x∗ ), j = 0, 1, 2 in ⎧ 3 ⎪  ⎪ ⎪ ⎪ f (x) = β j (x − x∗ ) j ⎪ ⎨ j=0

3  ⎪ ⎪ ⎪ ⎪ g(x) = γ j (x − x∗ ) j ⎪ ⎩ j=0

then we have β j = γ j , j = 0, 1, 2. In fact, we see that f (x∗ ) = g(x∗ ), f  (x∗ ) = g  (x∗ ), and f  (x∗ ) = g  (x∗ ) imply 2β2 = 2γ2 , β1 = γ1 , and β0 = γ0 , respectively. Hence, we have f (x) − g(x) = (β3 − γ3 )(x − x∗ )3 . In the following, for K ≥ 1, we divide the line R at the knots −∞ = α0 < α1 < · · · < α K < α K +1 = ∞ and express the function f (x) as a polynomial f i (x) for each αi ≤ x ≤ αi+1 , where we assume that those K + 1 functions are continuous up to the second derivative at the K knots: ( j)

( j)

f i−1 (αi ) = f i (αi ), j = 0, 1, 2, i = 1, . . . , K

(7.2)

(spline function). Note that there exists a constant γi such that f i (x) = f i−1 (x) + γi (x − αi )3 for each i = 1, 2, . . . , K + 1. In (7.2), there are 3K linear constraints for 4(K + 1) variables w.r.t. K + 1 cubic polynomials, each of which contains four coefficients, which means that there remain K + 4 degrees of freedom. We first arbitrarily determine the values of β0 , β1 , β2 , β3 in f 0 (x) = β0 + β1 x + β2 x 2 + β3 x 3 for α0 ≤ x ≤ α1 . Next, noting that for each i = 1, 2, . . . , K , the difference between f i and f i−1 is (x − αi )3 multiplied by a constant βi+3 , all the polynomials are determined by specifying βi+3 , i = 1, 2, . . . , K . We express the function f as follows.

7.2 Spline Regression

121

⎧ β0 + β 1 x + β 2 x 2 + β 3 x 3 , ⎪ ⎪ 2 3 3 ⎪ ⎪ β 0 + β1 x + β2 x + β3 x + β4 (x − α1 ) , ⎪ ⎪ ⎪ 2 3 ⎨ β0 + β1 x + β2 x + β3 x + β4 (x − α1 )3 + β5 (x − α2 )3 , f (x) = .. ⎪ ⎪ . ⎪ ⎪ ⎪ 2 3 3 ⎪ β 0 ⎪ ⎩ + β1 x + β2 x 3+ β3 x + β4 (x − α1 ) 3 +β5 (x − α2 ) + · · · + β K +3 (x − α K ) , = β0 + β1 x + β2 x 2 + β3 x 3 +

K 

α0 ≤ x ≤ α1 α1 ≤ x ≤ α2 α2 ≤ x ≤ α3 .. . α K ≤ x ≤ α K +1

βi+3 (x − αi )3+ ,

i=1

where (x − αi )+ is the function that takes x − αi and zero for x > αi and for x ≤ αi , respectively. The method for choosing the coefficients β0 , . . . , β K +3 is similar to the method we use for linear regression. Suppose we have observations (x1 , y1 ), . . . , (x N , y N ), where the sample points x1 , . . . , x N and the knots α1 , . . . , α K should not be confused. For the matrix ⎡ ⎤ 1 x1 x12 x13 (x1 − α1 )3+ (x1 − α2 )3+ · · · (x1 − α K )3+ ⎢ 1 x2 x22 x23 (x2 − α1 )3+ (x2 − α2 )3+ · · · (x2 − α K )3+ ⎥ ⎢ ⎥ X =⎢. . ⎥ , .. .. .. .. .. .. ⎣ .. .. ⎦ . . . . . . 2 3 3 3 3 1 x N x N x1 (x N − α1 )+ (x N − α2 )+ · · · (x N − α K )+ we determine the β = [β0 , . . . , β K +3 ]T that minimize N  {yi − β0 − xi β1 − xi2 β2 − xi3 β3 − (xi − α1 )3+ β4 − (xi − α2 )3+ β5 − · · · − (xi − α K )3+ β K +3 }2 . i=1

If the rank is K + 4, i.e., the K + 4 columns of X are linearly independent, then X T X is nonsingular, and we obtain the solution βˆ = (X T X )−1 X T y (Fig. 7.4). Example 54 After generating data, we execute spline regression with K = 5, 7, 9 knots. We present the results in Fig. 7.6. n=100; x=rnorm(n)*2*pi; y=sin(x)+0.2*rnorm(n) ## Data Generation col.set=c("red","green","blue"); K.set=c(5,7,9) ## Knots erty for(k in 1:3){ K=K.set[k]; knots=seq(−2*pi,2*pi,length=K) X=matrix(nrow=n,ncol=K+4) for(i in 1:n){ X[i,1]= 1; X[i,2]= x[i]; X[i,3]= x[i]^2; X[i,4]= x[i]^3 for(j in 1:K)X[i,j+4]=max((x[i]−knots[j])^3,0) } beta=solve(t(X)%*%X)%*%t(X)%*%y ## Estimation of beta f=function(x){S=beta[1]+beta[2]*x+beta[3]*x^2+beta[4]*x^3; for(j in 1:K)S=S+beta[j+4]*max((x−knots[j])^3,0) return(S) } ## Function f u.seq=seq(−5,5,0.02); v.seq=NULL; for(u in u.seq)v.seq=c(v.seq,f(u)) plot(u.seq,v.seq,type="l",col=col.set[k], yaxt="n", xlab="x", ylab="f(x)")

122

7 Nonlinear Regression

fi−1 (αi ) = fi (αi ) fi−1 (αi ) = fi (αi ) fi−1 (αi ) = fi (αi )

fK−1 (αK ) = fK (αK ) fK−1 (αK ) = fK (αK ) fK−1 (αK ) = fK (αK )

0

f0 (α1 ) = f1 (α1 ) f0 (α1 ) = f1 (α1 ) f0 (α1 ) = f1 ””(α1 )

-2

-1

f (x)

1

2

Spline Curve

α1

···

-4

-2

αi

···

αK

0

2

4

x Fig. 7.4 In spline functions, the value and the first and second derivatives should coincide on the left and right of each knot

par(new=TRUE) } legend(−2.2,1,paste0("K=",K.set), lty=1, col=col.set); points(x,y)

7.3 Natural Spline Regression In this section, we modify spline regression by replacing cubic curves with lines only for both ends x ≤ α1 , α K ≤ x (natural spline curve). Suppose we write the function f for x ≤ α K as follows: ⎧ β1 + β2 x, ⎪ ⎪ ⎪ ⎪ β1 + β2 x + β3 (x − α1 )3 , ⎪ ⎪ ⎪ ⎨ .. . f (x) = 3 3 ⎪ β ⎪ 1 + β2 x + β3 (x − α1 ) + · · · + β K (x − α K −2 ) , ⎪ ⎪ 3 ⎪ β + β2 x + β3 (x − α1 ) + · · · ⎪ ⎪ ⎩ 1 +β K (x − α K −2 )3 + β K +1 (x − α K −1 )3 , Since the second derivative at x = α K is zero, we have 6

K +1  j=3

and we obtain

α0 ≤ x ≤ α1 α1 ≤ x ≤ α2 .. .

α K −2 ≤ x ≤ α K −1 α K −1 ≤ x ≤ α K β j (α K − α j−2 ) = 0,

7.3 Natural Spline Regression

123

6

Natural Spline Curve

4

fK−2 (αK−1 ) = fK−1 (αK−1 ) fK−2 (αK−1 ) = fK−1 (αK−1 ) fK−2 (αK−1 ) = fK−1 (αK−1 )fK−1 (αK ) = fK (αK ) fK−1 (αK ) = fK (αK ) fK−1 (αK ) = 0

0 -4

-2

f (x)

2

f0 (α1 ) = f1 (α1 ) f0 (α1 ) = f1 (α1 ) 0 = f1 (α1 )

αK−1

αK

2

4

-6

α1 -6

-4

-2

0

6

x Fig. 7.5 In the natural spline curves, we choose the slope and intercept of the line for x ≤ α1 (two degrees of freedom) and choose the coefficients for αi ≤ x ≤ αi+1 , (one degree of freedom for each i = 1, 2, . . . , K − 2). However, no degrees of freedom are left for α K −1 ≤ x ≤ α K because f  (α) = 0. Moreover, for α K ≤ x, the slope and intercept are determined from the values of f (α K ) and f  (α K ), and no degrees of freedom are left as well

β K +1

K  α K − α j−2 =− βj . α − α K −1 j=3 K

(7.3)

Then, if we find the values of β1 , . . . , β K , we obtain the values of f (α K ) and f  (α K ) and the line y = f  (α K )(x − α K ) + f (α K ) for x ≥ α K (Fig. 7.5). Thus, the function f is obtained by specifying β1 , . . . , β K (Fig. 7.6). Proposition 20 The function f (x) has K cubic polynomials h 1 (x) = 1, h 2 (x) = x, h j+2 (x) = d j (x) − d K −1 (x), j = 1, . . . , K − 2 as a basis, and if we define γ1 := β1 , γ2 := β2 , γ3 := (α K − α1 )β3 , . . . , γ K := (α K − α K −2 )β K for each β1 , . . . , β K , then we can express f by f (x) =

K 

γ j h j (x), where we have

j=1

d j (x) =

(x − α j )3+ − (x − α K )3+ , j = 1, . . . , K − 1 . αK − α j

For the proof, see the appendix at the end of this chapter. We can construct the corresponding R code as follows.

124

7 Nonlinear Regression

f (x)

K = 5 K = 7 K = 9

-4

-2

0

2

4

x Fig. 7.6 Spline regression with K = 5, 7, 9 knots (Example 54)

d=function(j,x,knots){ K=length(knots); (max((x−knots[j])^3,0)−max((x−knots[K])^3,0))/(knots[K]−knots[j]) } h=function(j,x,knots){ K=length(knots); if(j==1) return(1) else if(j==2)return(x) else return(d(j−2,x,knots)−d(K−1,x, knots)) }

If we are given observations (x1 , y1 ), . . . , (x N , y N ), then we wish to determine γ that minimizes y − X γ 2 with ⎡

⎤ h 2 (x1 ) · · · h K (x1 ) h 2 (x2 ) · · · h K (x2 ) ⎥ ⎥ ⎥ . .. .. ⎦ . ··· . h 1 (x N ) = 1 h 2 (x N ) · · · h K (x N )

h 1 (x1 ) = 1 ⎢ h 1 (x2 ) = 1 ⎢ X =⎢ .. ⎣ .

(7.4)

If the rank is K , i.e., the K columns in X are linearly independent, the matrix X T X is nonsingular, and we obtain the solution γˆ = (X T X )−1 X T y. Example 55 If K = 4, then we have h 1 (x) = 1, h 2 (x) = x, ⎧ ⎪ ⎪ 0, ⎪ ⎪ (x − α1 )3 ⎪ ⎪ ⎨ , α4 − α1 h 3 (x) = d1 (x) − d3 (x) = 3 3 ⎪ ⎪ (x − α1 ) − (x − α3 ) , ⎪ ⎪ ⎪ α4 − α3 ⎪ α4 − α1 ⎩ (α3 − α1 )(3x − α1 − α3 − α4 ),

x ≤ α1 x1 ≤ x ≤ α3 α3 ≤ x ≤ α4 α4 ≤ x

7.3 Natural Spline Regression

125

⎧ ⎪ ⎪ 0, ⎪ ⎪ (x − α2 )3 ⎪ ⎪ ⎨ , α4 − α2 h 4 (x) = d2 (x) − d3 (x) = 3 3 ⎪ ⎪ (x − α2 ) − (x − α3 ) , ⎪ ⎪ ⎪ α4 − α3 ⎪ ⎩ α4 − α2 (α3 − α2 )(3x − α2 − α3 − α4 ),

x ≤ α2 x2 ≤ x ≤ α3 x3 ≤ x ≤ α4

.

α4 ≤ x

Hence, the lines x ≤ α1 and x ≥ α4 are f (x) = γ1 + γ2 x, x ≤ α1 f (x) = γ1 + γ2 x + γ3 (α3 − α1 )(3x − α1 − α3 − α4 ) +γ4 (α3 − α2 )(3x − α2 − α3 − α4 ), x ≥ α4 . Example 56 We compare the ordinary and natural spline curves (Fig. 7.7). By definition, the natural spline becomes a line at both ends, although considerable differences are observed near the points α1 and α K . The procedure is implemented according to the following code. n=100; x=rnorm(n)*2*pi; y=sin(x)+0.2*rnorm(n); ## Data Generation K=11; knots=seq(−5,5,length=K); X=matrix(nrow=n,ncol=K+4) for(i in 1:n){ X[i,1]= 1; X[i,2]= x[i]; X[i,3]= x[i]^2; X[i,4]= x[i]^3 for(j in 1:K)X[i,j+4]=max((x[i]−knots[j])^3,0) } beta=solve(t(X)%*%X)%*%t(X)%*%y f=function(x){ ## Spline Function S=beta[1]+beta[2]*x+beta[3]*x^2+beta[4]*x^3; for(j in 1:K)S=S+beta[j+4]*max((x−knots[j])^3,0) return(S) } X=matrix(nrow=n,ncol=K); X[,1]=1 for(j in 2:K)for(i in 1:n)X[i,j]=h(j,x[i],knots) gamma=solve(t(X)%*%X)%*%t(X)%*%y g=function(x){ ## Natural Spline S=gamma[1]; for(j in 2:K)S=S+gamma[j]*h(j,x,knots); return(S) } u.seq=seq(−6,6,0.02);## Drawing the Graph from below v.seq=NULL; for(u in u.seq)v.seq=c(v.seq,f(u)) plot(u.seq,v.seq,type="l",col="blue", yaxt="n", xlab="x",ylab="f(x),g(x)"); par(new=TRUE); w.seq=NULL for(u in u.seq)w.seq=c(w.seq,g(u)) plot(u.seq,w.seq,type="l",col="red", yaxt="n", xlab="",ylab="") par(new=TRUE) legend(−3.7,1.1,c("Spline","NaturalSpline"), lty=1, col=c("blue","red")) points(x,y); abline(v=knots,lty=3); abline(v=c(−5,5),lwd=2); title("K=11")

126

7 Nonlinear Regression

K=6

f (x), g(x)

Spline Natural Spline

-6

-4

-2

0

2

4

6

2

4

6

x

K=11

f (x), g(x)

Spline Natural Spline

-6

-4

-2

0

x Fig. 7.7 Comparison of the ordinary (Blue) and natural (Red) splines when K = 6 (Left) and K = 11 (Right) in Example 56. While the natural spline becomes a line for each of the ends, they do not coincide inside the region, in particular, near the borders

7.4 Smoothing Spline Given observed data (x1 , y1 ), . . . , (x N , y N ), we wish to obtain f : R → R that minimizes  ∞ N  (yi − f (xi ))2 + λ { f  (x)}2 d x (7.5) L( f ) := i=1

−∞

(smoothing spline), where λ ≥ 0 is a constant determined a priori. Suppose x1 < · · · < x N . The second term in (7.5) penalizes the complexity of the function f , and { f  (x)}2 intuitively expresses how nonsmooth the function is at x. If f is linear, the value is zero, and if λ is small, although the curve meanders, the curve is easier to

7.4 Smoothing Spline

127

fit to the observed data. On the other hand, if λ is large, although the curve does not follow the observed data, the curve is smoother. First, we show the optimal f is realized by the natural spline with knots x1 , . . . , x N . Proposition 21 (Green and Silverman, 1994) The natural spline f with knots x1 , . . . , x N minimizes L( f ). See the appendix at the end of this chapter for the proof. Next, we obtain the coefficients γ1 , . . . , γ N of such a natural spline f (x) = N  γi h i (x). Let G = (gi, j ) be the matrix with elements i=1

 gi, j :=



−∞

h i (x)h j (x)d x .

(7.6)

Then, the second term in L(g) becomes  λ



−∞





{g (x)} d x = λ 2





N 

−∞ i=1 N N   i=1 j=1

γi h i (x)

γ j h j (x)d x

j=1

 γi γ j

N 



−∞

h i (x)h j (x)d x = λγ T Gγ .

Thus, by differentiating L(g) with respect to γ , as done to obtain the coefficients of ridge regression in Chap. 5, we find that the solution of −X T (y − X γ ) + λGγ = 0 is given by

γˆ = (X T X + λG)−1 X T y .

Because the proof of the following proposition is complicated, it is provided in the appendix at the end of this chapter. Proposition 22 The elements gi, j defined in (7.6) are given by gi, j =

  (x N −1 − x j−2 )2 12x N −1 + 6x j−2 − 18xi−2 + 12(x N −1 − xi−2 )(x N −1 − x j−2 )(x N − x N −1 ) (x N − xi−2 )(x N − x j−2 )

for xi ≤ x j , where gi, j = 0 for either i ≤ 2 or j ≤ 2. For example, by means of the following procedure, we can obtain the matrix G from the knots x1 < · · · < x N .

128

Smoothing Spline (N = 100) λ = 40 λ = 400 λ = 1000

g(x)

Fig. 7.8 In smoothing spline, we specify a parameter λ that expresses the smoothness instead of knots. For λ = 40,400,1000, we observe that the larger λ is, the more difficult it is to fit the curve to the observed data

7 Nonlinear Regression

-5

0

5

x x

G=function(x){ ## each x is assumed to be in ascending order n=length(x); g=matrix(0, nrow=n,ncol=n) for(i in 3:(n))for(j in i:n){ g[i,j]=12*(x[n]−x[n−1])*(x[n−1]−x[j−2])*(x[n−1]−x[i−2])/(x[n]−x[i−2])/(x[n]− x[j−2])+ (12*x[n−1]+6*x[j−2]−18*x[i−2])*(x[n−1]−x[j−2])^2/(x[n]−x[i−2])/(x[n]−x[j −2]) g[j,i]=g[i,j] } return(g) }

Example 57 Computing the matrix G and γˆ for each λ, we draw the smoothing spline curve. We observe that the larger λ is, the smoother the curve (Fig. 7.8). The procedure is implemented via the following code. n=100; x=runif(n,−5,5); y=x+sin(x)*2+rnorm(n) ## Data Generation index=order(x); x=x[index];y=y[index] X=matrix(nrow=n,ncol=n); X[,1]=1 for(j in 2:n)for(i in 1:n)X[i,j]=h(j,x[i],x) ## Generation of X GG=G(x); ## Generation of G lambda.set=c(40,400,1000); col.set=c("red","blue","green") for(i in 1:3){ lambda=lambda.set[i] gamma=solve(t(X)%*%X+lambda*GG)%*%t(X)%*%y g=function(u){S=gamma[1]; for(j in 2:n)S=S+gamma[j]*h(j,u,x); return(S)} u.seq=seq(−8,8,0.02); v.seq=NULL; for(u in u.seq)v.seq=c(v.seq,g(u)) plot(u.seq,v.seq,type="l",yaxt="n", xlab="x",ylab="g(x)",ylim=c(−8,8), col=col. set[i]) par(new=TRUE) } points(x,y); legend("topleft", paste0("lambda=",lambda.set), col=col.set, lty=1) title("SmoothingSpline(n=100)")

129

Trace of H[λ] vs CV [λ]

0.00008 0.00002

Predictive Error of CV

Fig. 7.9 The larger λ is, the smaller the effective degrees of freedom. Even if the effective degrees of freedom are large, the predictive error of CV may increase

0.00014

7.4 Smoothing Spline

3.0

3.5

4.0

4.5

5.0

5.5

Effective Degree of Freedom

In ridge regression, we obtain a matrix of size ( p + 1) × ( p + 1). However, for the current problem, we must compute the inverse of a matrix of size N × N , so we need an approximation because the computation is complex for large N . However, if N is not large, the value of λ can be determined by cross-validation. Proposition 14 applies when matrix X is given by (7.4) with K = N . In addition, Proposition 15 applies when A is given by X T X + λG. Thus, the predictive error of CV in Proposition 14 is given by C V [λ] :=



(I − HS [λ])−1 e S 2 ,

S

where HS [λ] := X S (X T X + λG)−1 X ST . We construct the following procedure (Fig. 7.9). cv.ss.fast=function(X,y,lambda, G, k){ n=length(y); m=n/k; H=X%*%solve(t(X)%*%X+lambda*G)%*%t(X); df=sum(diag(H)) I=diag(rep(1,n)); e=(I−H)%*%y; I=diag(rep(1,m)) S=0 for(j in 1:k){ test=((j−1)*m+1):(j*m); S=S+norm(solve(I−H[test,test])%*%e[test],"2")^2 } return(list(score=S/n,df=df)) }

Note that if we set λ = 0, then the procedure is the same as cv.fast+ in Chap. 4. How much the value of λ affects the estimation of γ depends on several conditions, and we cannot compare the λ values under different settings. Instead, we often use the effective degrees of freedom, the trace of the matrix H [λ] := X T (X T X + λG)−1 X , rather than λ. The effective degrees of freedom express how well the fitness and simplicity are balanced.

130

7 Nonlinear Regression

Example 58 For sample size1 N = 100, changing λ value from 1 to 50, we draw the graph of the effective degrees of freedom (the trace of H [λ]) and the predictive error of CV (C V [λ]). The execution is implemented via the following code. ## Data Generation n=100; x=runif(n,−5,5); y=x−0.02*sin(x)−0*1*rnorm(n) index=order(x); x=x[index];y=y[index] d=function(j,u)(max((u−x[j])^3,0)−max((u−x[n])^3,0))/(x[n]−x[j]) h=function(j,u)if(j==1)1 else if(j==2)u else d(j−2,u)−d(n−1,u) X=matrix(nrow=n,ncol=n); X[,1]=1; for(j in 2:n)for(i in 1:n)X[i,j]=h(j,x[i], x) GG=G(x) ## Plot of the effective degree of freedom and predictive square error for each lambda. u=seq(1,50,1); v=NULL; w=NULL for(lambda in u){ result=cv.ss.fast(X,y,lambda,GG,n); v=c(v,result$df); w=c(w,result$score) } plot(v,z,type="l",col="red",xlab="EffectiveDegreeofFreedom",ylab="The PredictiveErrorofCV") title("TheEffectiveDegreeofFreedomandCVPredictiveError")

7.5 Local Regression In this section, we consider the Nadaraya-Watson estimator and local linear regression. Let X be a set. We call a function k : X × X → R a kernel (in a strict sense) if 1. For any n ≥ 1 and x1 , . . . , xn , the matrix K ∈ X n×n with K i, j = k(xi , x j ) is nonnegative definite (positive definiteness). 2. For any x, y ∈ X , k(x, y) = k(y, x) (symmetry). For example, if X is a vector space, its inner product is a kernel. In fact, from the definition of the inner product ·, · : for elements x, y, z of the vector space and a real number c, x, y + z = x, y + x, z , cx, y = c x, y , x, x ≥ 0 for arbitrary a1 , . . . , an ∈ X and c1 , . . . , cn ∈ R, we have ⎛ 0≤k⎝

n  i=1

ci ai ,

n  j=1

⎞ cjaj⎠ =

 i

j

ci c j k(ai , a j )

⎤⎡ ⎤ c1 k(a1 , a1 ) · · · k(a1 , an ) ⎥ ⎢ .. ⎥ ⎢ . . . . . . = [c1 , . . . , cn ] ⎣ ⎦⎣ . ⎦ . . . . cn k(an , a1 ) · · · k(an , an ) ⎡

Kernels are used to express the similarity of two elements in set X : the more similar the x, y ∈ X are, the larger the k(x, y). 1 For

N > 100, we could not compute the inverse matrix; errors occurred due to memory shortage.

7.5 Local Regression

131

Even if k : X × X → R does not satisfy the positive definiteness, it can be used2 if it accurately expresses the similarity. Example 59 (Epanechnikov Kernel) The kernel k : X × X → R defined by  K λ (x, y) = D 3 D(t) =

|x − y| λ



(1 − t 2 ), |t| ≤ 1 4 0, Otherwise

does not satisfy positive definiteness. In fact, when λ = 2, n = 3, x1 = −1, x2 = 0, x3 = 1, the matrix with elements K λ (xi , yi ) can be expressed as ⎡

⎤ ⎡ ⎤ K λ (x1 , y1 ) K λ (x1 , y2 ) K λ (x1 , y3 ) 3/4 9/16 0 ⎣ K λ (x2 , y1 ) K λ (x2 , y2 ) K λ (x2 , y3 ) ⎦ = ⎣ 9/16 3/4 9/16 ⎦ . 0 9/16 3/4 K λ (x3 , y1 ) K λ (x3 , y2 ) K λ (x3 , y3 ) We see that the determinant is 33 /26 − 35 /210 − 35 /210 = −33 /29 . Since the determinant is equal to the product of the eigenvalues (Proposition 6), at least one of the three eigenvalues should be negative. The Nadaraya-Watson estimator is constructed as N K (x, xi )yi ˆ f (x) = i=1 N j=1 K (x, x j ) from observed data (x1 , y1 ), . . . , (x N , y N ) ∈ X × R, where X is a set and k : X × X → R is a kernel. Then, given a new data point x∗ ∈ X , the estimator returns fˆ(x∗ ), which weights y1 , . . . , y N according to the ratio K (x∗ , x1 ) K (x∗ , x N ) , . . . , N . N j=1 K (x ∗ , x j ) j=1 K (x ∗ , x j ) Since we assume that k(u, v) expresses the similarity between u, v ∈ X , the larger the weight on yi , the more similar x∗ and xi are. Example 60 We apply the Epanechnikov kernel to the Nadaraya-Watson estimator. The Nadaraya-Watson estimator executes successfully even for kernels that do not satisfy positive definiteness. For a given input x∗ ∈ X , the weights are only on yi , i = 1, . . . , N , such that xi − λ ≤ x∗ ≤ xi + λ. If the value of λ is small, the prediction is made based on (xi , yi ) such that xi is within a small neighboring region of x∗ . We present the results obtained by executing the following code in Fig. 7.10.

2 We

call such a kernel a kernel in a broader sense.

132

7 Nonlinear Regression

Fig. 7.10 We apply the Epanechnikov kernel to the Nadaraya-Watson estimator and draw curves for λ = 0.05, 0.25. Finally, we compute the optimal λ and draw the curve in the same graph (Example 60)

Nadaraya-Watson Estimator

-2

-1

0

y

1

2

3

λ = 0.05 λ = 0.25 λ = λbest

-3

-2

-1

0

1

2

3

x

n=250; x=2*rnorm(n); y=sin(2*pi*x)+rnorm(n)/4 ## Data Generation D=function(t) max(0.75*(1−t^2),0) ## Function Definition D K=function(x,y,lambda) D(abs(x−y)/lambda) ## Function Definition K f=function(z,lambda){ ## Function definition f S=0; T=0; for(i in 1:n){S=S+K(x[i],z,lambda)*y[i]; T=T+K(x[i],z,lambda)} return(S/T) } plot(seq(−3,3,length=10),seq(−2,3,length=10),type="n",xlab="x", ylab="y"); points (x,y) xx=seq(−3,3,0.1) yy=NULL;for(zz in xx)yy=c(yy,f(zz,0.05)); lines(xx,yy,col="green") yy=NULL;for(zz in xx)yy=c(yy,f(zz,0.25)); lines(xx,yy,col="blue") ## Thus far, the curves for lambda=0.05, 0.25 have been shown m=n/10 lambda.seq=seq(0.05,1,0.01); SS.min=Inf for(lambda in lambda.seq){ SS=0 for(k in 1:10){ test=((k−1)*m+1):(k*m); train=setdiff(1:n,test) for(j in test){ u=0; v=0; for(i in train){ kk=K(x[i],x[j],lambda); u=u+kk*y[i]; v=v+kk } if(v==0){index=order(abs(x[j]−x[−j]))[1]; z=y[index]} else z=u/v SS=SS+(y[j]−z)^2 } } if(SS