Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond [1st ed.] 0-262-19475-9, 9780262194754

A comprehensive introduction to Support Vector Machines and related kernel methods. In the 1990s, a new type of learnin

1,211 173 9MB

English Pages 644 [638] Year 2002

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond [1st ed.]
 0-262-19475-9, 9780262194754

Citation preview

Learning with Kernels Support Vector Machines, Regularization, Optimization, and Beyond

Bernhard Scholkopf ¨ Alexander J. Smola

The MIT Press Cambridge, Massachusetts London, England

Scholkopf ¨ and Smola: Learning with Kernels

2001/09/24 10:30

­c 2002 Massachusetts Institute of Technology Printed and bound in the United States of America Library of Congress Cataloging-in-Publication Data Learning with Kernels — Support Vector Machines, Regularization, Optimization and Beyond / by Bernhard Scholkopf, ¨ Alexander J. Smola. p. cm. Includes bibliographical references and index. ISBN 0-262-19475-9 (alk. paper) 1. Machine learning. 2. Algorithms. 3. Kernel functions I. Scholkopf, ¨ Bernhard. II. Smola, Alexander J.

Contents

Series Foreword

xiii

Preface

xv

1 A Tutorial Introduction 1.1 Data Representation and Similarity . . . . . . . . 1.2 A Simple Pattern Recognition Algorithm . . . . 1.3 Some Insights From Statistical Learning Theory 1.4 Hyperplane Classifiers . . . . . . . . . . . . . . . 1.5 Support Vector Classification . . . . . . . . . . . 1.6 Support Vector Regression . . . . . . . . . . . . . 1.7 Kernel Principal Component Analysis . . . . . . 1.8 Empirical Results and Implementations . . . . .

I

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

CONCEPTS AND TOOLS

23

2 Kernels 2.1 Product Features . . . . . . . . . . . . . . . . . . . . . 2.2 The Representation of Similarities in Linear Spaces . . 2.3 Examples and Properties of Kernels . . . . . . . . . . 2.4 The Representation of Dissimilarities in Linear Spaces 2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6 Problems . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . .

25 26 29 45 48 55 55

. . . . . .

61 62 65 68 75 83 84

4 Regularization 4.1 The Regularized Risk Functional . . . . . . . . . . . . . . . . . . . .

87 88

3 Risk and Loss Functions 3.1 Loss Functions . . . . . . . . . 3.2 Test Error and Expected Risk 3.3 A Statistical Perspective . . . 3.4 Robust Estimators . . . . . . . 3.5 Summary . . . . . . . . . . . . 3.6 Problems . . . . . . . . . . . .

Scholkopf ¨ and Smola: Learning with Kernels

. . . . . . . .

1 1 4 6 11 15 17 19 21

2001/09/24 10:30

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 4.10 4.11

The Representer Theorem . . . . . . . . . . . . . . . Regularization Operators . . . . . . . . . . . . . . . Translation Invariant Kernels . . . . . . . . . . . . . Translation Invariant Kernels in Higher Dimensions Dot Product Kernels . . . . . . . . . . . . . . . . . . Multi-Output Regularization . . . . . . . . . . . . . Semiparametric Regularization . . . . . . . . . . . . Coefficient Based Regularization . . . . . . . . . . . Summary . . . . . . . . . . . . . . . . . . . . . . . . . Problems . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. 89 . 92 . 96 . 105 . 110 . 113 . 115 . 118 . 121 . 122

5 Elements of Statistical Learning Theory 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 The Law of Large Numbers . . . . . . . . . . . . . . . . . 5.3 When Does Learning Work: the Question of Consistency 5.4 Uniform Convergence and Consistency . . . . . . . . . . 5.5 How to Derive a VC Bound . . . . . . . . . . . . . . . . . 5.6 A Model Selection Example . . . . . . . . . . . . . . . . . 5.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.8 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

125 125 128 131 131 134 144 146 146

6 Optimization 6.1 Convex Optimization . . . . 6.2 Unconstrained Problems . . 6.3 Constrained Problems . . . 6.4 Interior Point Methods . . . 6.5 Maximum Search Problems 6.6 Summary . . . . . . . . . . . 6.7 Problems . . . . . . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

149 150 154 165 175 179 183 184

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . . . . .

. . . . . . .

. . . . . . . . . .

. . . . . . .

. . . . . . .

II SUPPORT VECTOR MACHINES 7 Pattern Recognition 7.1 Separating Hyperplanes . . . . . . . 7.2 The Role of the Margin . . . . . . . . 7.3 Optimal Margin Hyperplanes . . . . 7.4 Nonlinear Support Vector Classifiers 7.5 Soft Margin Hyperplanes . . . . . . 7.6 Multi-Class Classification . . . . . . 7.7 Variations on a Theme . . . . . . . . 7.8 Experiments . . . . . . . . . . . . . . 7.9 Summary . . . . . . . . . . . . . . . . 7.10 Problems . . . . . . . . . . . . . . . .

Scholkopf ¨ and Smola: Learning with Kernels

2001/09/24 10:30

187

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

189 189 192 196 200 204 211 214 215 222 222

Scholkopf ¨ and Smola: Learning with Kernels

8 Single-Class Problems: Quantile Estimation and Novelty Detection 8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2 A Distribution’s Support and Quantiles . . . . . . . . . . . . . . . 8.3 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.4 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.5 Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.7 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.9 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . .

227 228 229 230 234 236 241 243 247 248

9 Regression Estimation 9.1 Linear Regression with Insensitive Loss Function . 9.2 Dual Problems . . . . . . . . . . . . . . . . . . . . . 9.3  -SV Regression . . . . . . . . . . . . . . . . . . . . 9.4 Convex Combinations and  1 -Norms . . . . . . . . 9.5 Parametric Insensitivity Models . . . . . . . . . . . 9.6 Applications . . . . . . . . . . . . . . . . . . . . . . 9.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . 9.8 Problems . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

251 251 254 260 266 269 272 273 274

10 Implementation 10.1 Tricks of the Trade . . . . . . . . . . . . 10.2 Sparse Greedy Matrix Approximation 10.3 Interior Point Algorithms . . . . . . . 10.4 Subset Selection Methods . . . . . . . 10.5 Sequential Minimal Optimization . . . 10.6 Iterative Methods . . . . . . . . . . . . 10.7 Summary . . . . . . . . . . . . . . . . . 10.8 Problems . . . . . . . . . . . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

279 281 288 295 300 305 312 327 329

11 Incorporating Invariances 11.1 Prior Knowledge . . . . . . . . . 11.2 Transformation Invariance . . . . 11.3 The Virtual SV Method . . . . . . 11.4 Constructing Invariance Kernels 11.5 The Jittered SV Method . . . . . . 11.6 Summary . . . . . . . . . . . . . . 11.7 Problems . . . . . . . . . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

333 333 335 337 343 354 356 357

12 Learning Theory Revisited 12.1 Concentration of Measure Inequalities . . . . . . 12.2 Leave-One-Out Estimates . . . . . . . . . . . . . 12.3 PAC-Bayesian Bounds . . . . . . . . . . . . . . . 12.4 Operator-Theoretic Methods in Learning Theory

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

359 360 366 381 391

2001/09/24 10:30

. . . . . . .

. . . . . . .

. . . . . . .

12.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403 12.6 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 404

III KERNEL METHODS

405

13 Designing Kernels 13.1 Tricks for Constructing Kernels 13.2 String Kernels . . . . . . . . . . 13.3 Locality-Improved Kernels . . . 13.4 Natural Kernels . . . . . . . . . 13.5 Summary . . . . . . . . . . . . . 13.6 Problems . . . . . . . . . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

407 408 412 414 418 423 423

14 Kernel Feature Extraction 14.1 Introduction . . . . . . . . . . . . . . 14.2 Kernel PCA . . . . . . . . . . . . . . 14.3 Kernel PCA Experiments . . . . . . . 14.4 A Framework for Feature Extraction 14.5 Algorithms for Sparse KFA . . . . . 14.6 KFA Experiments . . . . . . . . . . . 14.7 Summary . . . . . . . . . . . . . . . . 14.8 Problems . . . . . . . . . . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

427 427 429 437 442 447 450 451 452

15 Kernel Fisher Discriminant 15.1 Introduction . . . . . . . . . . . . . . . . . . . . . 15.2 Fisher’s Discriminant in Feature Space . . . . . . 15.3 Efficient Training of Kernel Fisher Discriminants 15.4 Probabilistic Outputs . . . . . . . . . . . . . . . . 15.5 Experiments . . . . . . . . . . . . . . . . . . . . . 15.6 Summary . . . . . . . . . . . . . . . . . . . . . . . 15.7 Problems . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

457 457 458 460 464 466 467 468

16 Bayesian Kernel Methods 16.1 Bayesics . . . . . . . . . . . . . . . . . 16.2 Inference Methods . . . . . . . . . . . 16.3 Gaussian Processes . . . . . . . . . . . 16.4 Implementation of Gaussian Processes 16.5 Laplacian Processes . . . . . . . . . . . 16.6 Relevance Vector Machines . . . . . . 16.7 Summary . . . . . . . . . . . . . . . . . 16.8 Problems . . . . . . . . . . . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

469 470 475 480 488 499 506 511 513

. . . . . .

. . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

17 Regularized Principal Manifolds 517 17.1 A Coding Framework . . . . . . . . . . . . . . . . . . . . . . . . . . 518

Scholkopf ¨ and Smola: Learning with Kernels

2001/09/24 10:30

17.2 17.3 17.4 17.5 17.6 17.7 17.8

A Regularized Quantization Functional An Algorithm for Minimizing Rreg [ f ] . Connections to Other Algorithms . . . . Uniform Convergence Bounds . . . . . Experiments . . . . . . . . . . . . . . . . Summary . . . . . . . . . . . . . . . . . . Problems . . . . . . . . . . . . . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

522 526 529 533 537 539 540

18 Pre-Images and Reduced Set Methods 18.1 The Pre-Image Problem . . . . . . . . . . . . . . . 18.2 Finding Approximate Pre-Images . . . . . . . . . . 18.3 Reduced Set Methods . . . . . . . . . . . . . . . . . 18.4 Reduced Set Selection Methods . . . . . . . . . . . 18.5 Reduced Set Construction Methods . . . . . . . . . 18.6 Sequential Evaluation of Reduced Set Expansions 18.7 Summary . . . . . . . . . . . . . . . . . . . . . . . 18.8 Problems . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

543 544 547 552 554 561 564 566 567

A Addenda 569 A.1 Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 569 A.2 Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 572 B Mathematical Prerequisites 575 B.1 Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 575 B.2 Linear Algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 580 B.3 Functional Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 586

Scholkopf ¨ and Smola: Learning with Kernels

2001/09/24 10:30

References

591

Index

617

Notation and Symbols

625

Series Foreword

The goal of building systems that can adapt to their environments and learn from their experience has attracted researchers from many fields, including computer science, engineering, mathematics, physics, neuroscience, and cognitive science. Out of this research has come a wide variety of learning techniques that have the potential to transform many scientific and industrial fields. Recently, several research communities have converged on a common set of issues surrounding supervised, unsupervised, and reinforcement learning problems. The MIT Press series on Adaptive Computation and Machine Learning seeks to unify the many diverse strands of machine learning research and to foster high quality research and innovative applications. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond is an excellent illustration of this convergence of ideas from many fields. The development of kernel-based learning methods has resulted from a combination of machine learning theory, optimization algorithms from operations research, and kernel techniques from mathematical analysis. These three ideas have spread far beyond the original support-vector machine algorithm: Virtually every learning algorithm has been redesigned to exploit the power of kernel methods. Bernhard Scholkopf ¨ and Alexander Smola have written a comprehensive, yet accessible, account of these developments. This volume includes all of the mathematical and algorithmic background needed not only to obtain a basic understanding of the material but to master it. Students and researchers who study this book will be able to apply kernel methods in creative ways to solve a wide range of problems in science and engineering. Thomas Dietterich

Scholkopf ¨ and Smola: Learning with Kernels

2001/09/24 10:30

Preface

One of the most fortunate situations a scientist can encounter is to enter a field in its infancy. There is a large choice of topics to work on, and many of the issues are conceptual rather than merely technical. Over the last seven years, we have had the privilege to be in this position with regard to the field of Support Vector Machines (SVMs). We began working on our respective doctoral dissertations in 1994 and 1996. Upon completion, we decided to combine our efforts and write a book about SVMs. Since then, the field has developed impressively, and has to an extent been transformed. We set up a website that quickly became the central repository for the new community, and a number of workshops were organized by various researchers. The scope of the field has now widened significantly, both in terms of new algorithms, such as kernel methods different to SVMs, and in terms of a deeper theoretical understanding being gained. It has become clear that kernel methods provide a framework for tackling some rather profound issues in machine learning theory. At the same time, successful applications have demonstrated that SVMs not only have a more solid foundation than artificial neural networks, but are able to serve as a replacement for neural networks that perform as well or better, in a wide variety of fields. Standard neural network and pattern recognition textbooks have now started including chapters on SVMs and kernel PCA (for instance, [235, 153]). While these developments took place, we were trying to strike a balance between pursuing exciting new research, and making progress with the slowly growing manuscript of this book. In the two and a half years that we worked on the book, we faced a number of lessons that we suspect everyone writing a scientific monograph — or any other book — will encounter. First, writing a book is more work than you think, even with two authors sharing the work in equal parts. Second, our book got longer than planned. Once we exceeded the initially planned length of 500 pages, we got worried. In fact, the manuscript kept growing even after we stopped writing new chapters, and began polishing things and incorporating corrections suggested by colleagues. This was mainly due to the fact that the book deals with a fascinating new area, and researchers keep adding fresh material to the body of knowledge. We learned that there is no asymptotic regime in writing such a book — if one does not stop, it will grow beyond any bound — unless one starts cutting. We therefore had to take painful decisions to leave out material that we originally thought should be in the book. Sadly, and this is the third point, the book thus contains less material than originally planned, especially on the sub-

Scholkopf ¨ and Smola: Learning with Kernels

2001/09/24 10:30

ject of theoretical developments. We sincerely apologize to all researchers who feel that their contributions should have been included — the book is certainly biased towards our own work, and does not provide a fully comprehensive overview of the field. We did, however, aim to provide all the necessary concepts and ideas to enable a reader equipped with some basic mathematical knowledge to enter the engaging world of machine learning, using theoretically well-founded kernel algorithms, and to understand and apply the powerful algorithms that have been developed over the last few years. The book is divided into three logical parts. Each part consists of a brief introduction and a number of technical chapters. In addition, we include two appendices containing addenda, technical details, and mathematical prerequisites. Each chapter begins with a short discussion outlining the contents and prerequisites; for some of the longer chapters, we include a graph that sketches the logical structure and dependencies between the sections. At the end of most chapters, we include a set of problems, ranging from simple exercises (marked by ) to hard ones (); in addition, we describe open problems and questions for future research (ÆÆÆ).1 The latter often represent worthwhile projects for a research publication, or even a thesis. References are also included in some of the problems. These references contain the solutions to the associated problems, or at least significant parts thereof. The overall structure of the book is perhaps somewhat unusual. Rather than presenting a logical progression of chapters building upon each other, we occasionally touch on a subject briefly, only to revisit it later in more detail. For readers who are used to reading scientific monographs and textbooks from cover to cover, this will amount to some redundancy. We hope, however, that some readers, who are more selective in their reading habits (or less generous with their time), and only look at those chapters that they are interested in, will benefit. Indeed, nobody is expected to read every chapter. Some chapters are fairly technical, and cover material included for reasons of completeness. Other chapters, which are more relevant to the central subjects of the book, are kept simpler, and should be accessible to undergraduate students. In a way, this book thus contains several books in one. For instance, the first chapter can be read as a standalone “executive summary” of Support Vector and kernel methods. This chapter should also provide a fast entry point for practitioners. Someone interested in applying SVMs to a pattern recognition problem might want to read Chapters 1 and 7 only. A reader thinking of building their own SVM implementation could additionally read Chapter 10, and parts of Chapter 6. Those who would like to get actively involved in research aspects of kernel methods, for example by “kernelizing” a new algorithm, should probably read at least Chapters 1 and 2. A one-semester undergraduate course on learning with kernels could include the material of Chapters 1, 2.1–2.3, 3.1–3.2, 5.1–5.2, 6.1–6.3, 7. If there is more 1. We suggest that authors post their solutions on the book website www.learning-withkernels.org.

Scholkopf ¨ and Smola: Learning with Kernels

2001/09/24 10:30

time, one of the Chapters 14, 16, or 17 can be added, or 4.1–4.2. A graduate course could additionally deal with the more advanced parts of Chapters 3, 4, and 5. The remaining chapters provide ample material for specialized courses and seminars. As a general time-saving rule, we recommend reading the first chapter and then jumping directly to the chapter of particular interest to the reader. Chances are that this will lead to a chapter that contains references to the earlier ones, which can then be followed as desired. We hope that this way, readers will inadvertently be tempted to venture into some of the less frequented chapters and research areas. Explore this book; there is a lot to find, and much more is yet to be discovered in the field of learning with kernels. We conclude the preface by thanking those who assisted us in the preparation of the book. Our first thanks go to our first readers. Chris Burges, Arthur Gretton, and Bob Williamson have read through various versions of the book, and made numerous suggestions that corrected or improved the material. A number of other researchers have proofread various chapters. We would like to thank Matt Beal, Daniel Berger, Olivier Bousquet, Ben Bradshaw, Nicolo` CesaBianchi, Olivier Chapelle, Dennis DeCoste, Andre Elisseeff, Anita Faul, Arnulf Graf, Isabelle Guyon, Ralf Herbrich, Simon Hill, Dominik Janzing, Michael Jordan, Sathiya Keerthi, Neil Lawrence, Ben O’Loghlin, Ulrike von Luxburg, Davide Mattera, Sebastian Mika, Natasa Milic-Frayling, Marta Milo, Klaus Muller, ¨ Dave Musicant, Fernando P´erez Cruz, Ingo Steinwart, Mike Tipping, and Chris Williams. In addition, a large number of people have contributed to this book in one way or another, be it by sharing their insights with us in discussions, or by collaborating with us on some of the topics covered in the book. In many places, this strongly influenced the presentation of the material. We would like to thank Dimitris Achlioptas, Lu´ıs Almeida, Shun-Ichi Amari, Peter Bartlett, Jonathan Baxter, Tony Bell, Shai Ben-David, Kristin Bennett, Matthias Bethge, Chris Bishop, Andrew Blake, Volker Blanz, L´eon Bottou, Paul Bradley, Chris Burges, Heinrich Bulthoff, ¨ Olivier Chapelle, Nello Cristianini, Corinna Cortes, Cameron Dawson,Tom Dietterich, Andr´e Elisseeff, Oscar de Feo, Federico Girosi, Thore Graepel, Isabelle Guyon, Patrick Haffner, Stefan Harmeling, Paul Hayton, Markus Hegland, Ralf Herbrich, Tommi Jaakkola, Michael Jordan, Jyrki Kivinen, Yann LeCun, Chi-Jen Lin, Gabor Lugosi, Olvi Mangasarian, Laurent Massoulie, Sebastian Mika, Sayan Mukherjee, Klaus Muller, ¨ Noboru Murata, Nuria Oliver, John Platt, Tomaso Poggio, Gunnar R¨atsch, Sami Romdhani, Rainer von Sachs, Christoph Schnorr, ¨ Matthias Seeger, John Shawe-Taylor, Kristy Sim, Patrice Simard, Stephen Smale, Sara Solla, Lionel Tarassenko, Lily Tian, Mike Tipping, Alexander Tsybakov, Lou van den Dries, Santosh Venkatesh, Thomas Vetter, Chris Watkins, Jason Weston, Chris Williams, Bob Williamson, Andreas Ziehe, Alex Zien, and Tong Zhang. Next, we would like to extend our thanks to the research institutes that allowed us to pursue our research interests and to dedicate the time necessary for writing the present book; these are AT&T / Bell Laboratories (Holmdel), the Australian National University (Canberra), Biowulf Technologies (New York), GMD FIRST (Berlin), the Max-Planck-Institute for Biological Cybernetics (Tubingen), ¨ and Mi-

Scholkopf ¨ and Smola: Learning with Kernels

2001/09/24 10:30

crosoft Research (Cambridge). We are grateful to Doug Sery from MIT Press for continuing support and encouragement during the writing of this book. We are, moreover, indebted to funding from various sources; specifically, from the Studienstiftung des deutschen Volkes, the Deutsche Forschungsgemeinschaft, the Australian Research Council, and the European Union. Finally, special thanks go to Vladimir Vapnik, who introduced us to the fascinating world of statistical learning theory.

. . . the story of the sheep dog who was herding his sheep, and serendipitously invented both large margin classification and Sheep Vectors. . .

Illustration by Ana Mart´ın Larranaga ˜

Scholkopf ¨ and Smola: Learning with Kernels

2001/09/24 10:30

1

A Tutorial Introduction

Overview

Prerequisites

1.1

This chapter describes the central ideas of Support Vector (SV) learning in a nutshell. Its goal is to provide an overview of the basic concepts. One such concept is that of a kernel. Rather than going immediately into mathematical detail, we introduce kernels informally as similarity measures that arise from a particular representation of patterns (Section 1.1), and describe a simple kernel algorithm for pattern recognition (Section 1.2). Following this, we report some basic insights from statistical learning theory, the mathematical theory that underlies SV learning (Section 1.3). Finally, we briefly review some of the main kernel algorithms, namely Support Vector Machines (SVMs) (Sections 1.4 to 1.6) and kernel principal component analysis (Section 1.7). We have aimed to keep this introductory chapter as basic as possible, whilst giving a fairly comprehensive overview of the main ideas that will be discussed in the present book. After reading it, readers should be able to place all the remaining material in the book in context and judge which of the following chapters is of particular interest to them. As a consequence of this aim, most of the claims in the chapter are not proven. Abundant references to later chapters will enable the interested reader to fill in the gaps at a later stage, without losing sight of the main ideas described presently.

Data Representation and Similarity

Training Data

One of the fundamental problems of learning theory is the following: suppose we are given two classes of objects. We are then faced with a new object, and we have to assign it to one of the two classes. This problem can be formalized as follows: we are given empirical data (x1  y1 )     (xm  ym )    1

(1.1)

Here,  is some nonempty set from which the patterns x i (sometimes called cases, inputs, instances, or observations) are taken, usually referred to as the domain; the y i are called labels, targets, outputs or sometimes also observations. 1 Note that there are 1. Note that we use the term pattern to refer to individual observations. A (smaller) part of the existing literature reserves the term for a generic prototype which underlies the data. The

Scholkopf ¨ and Smola: Learning with Kernels

2001/09/24 10:30

2

A Tutorial Introduction

only two classes of patterns. For the sake of mathematical convenience, they are labelled by 1 and 1, respectively. This is a particularly simple situation, referred to as (binary) pattern recognition or (binary) classification. It should be emphasized that the patterns could be just about anything, and we have made no assumptions on  other than it being a set. For instance, the task might be to categorize sheep into two classes, in which case the patterns x i would simply be sheep. In order to study the problem of learning, however, we need an additional type of structure. In learning, we want to be able to generalize to unseen data points. In the case of pattern recognition, this means that given some new pattern x  , we want to predict the corresponding y  1.2 By this we mean, loosely speaking, that we choose y such that (x y) is in some sense similar to the training examples (1.1). To this end, we need notions of similarity in  and in 1. Characterizing the similarity of the outputs 1 is easy: in binary classification, only two situations can occur: two labels can either be identical or different. The choice of the similarity measure for the inputs, on the other hand, is a deep question that lies at the core of the field of machine learning. Let us consider a similarity measure of the form



k :  (x x )  k(x x ) 



(1.2) 

that is, a function that, given two patterns x and x , returns a real number characterizing their similarity. Unless stated otherwise, we will assume that k is symmetric, that is, k(x x ) k(x  x) for all x x  . For reasons that will become clear later (cf. Remark 2.16), the function k is called a kernel [359, 4, 42, 62, 223]. General similarity measures of this form are rather difficult to study. Let us therefore start from a particularly simple case, and generalize it subsequently. A simple type of similarity measure that is of particular mathematical appeal is a dot product. For instance, given two vectors x x   N , the canonical dot product is defined as





Dot Product







N

x x : ∑ [x]i [x ]i  



(1.3)



i 1

Here, [x]i denotes the ith entry of x. Note that the dot product is also referred to as inner product or scalar product, and sometimes denoted with round brackets and a dot, as (x x ) — this is where the “dot” in the name comes from. In Section B.2, we give a general definition of dot products. Usually, however, it is sufficient to think of dot products as (1.3). 

latter is probably closer to the original meaning of the term, however we decided to stick with the present usage, which is more common in the field of machine learning. 2. Doing this for every x   amounts to estimating a function f :   1.

Scholkopf ¨ and Smola: Learning with Kernels

2001/09/24 10:30

1.1 Data Representation and Similarity

3

The geometric interpretation of the canonical dot product is that it computes the cosine of the angle between the vectors x and x , provided they are normalized to length 1. Moreover, it allows computation of the length (or norm) of a vector x as 

Length

x 



x x 

(1.4)

Likewise, the distance between two vectors is computed as the length of the difference vector. Therefore, being able to compute dot products amounts to being able to carry out all geometric constructions that can be formulated in terms of angles, lengths and distances. Note, however, that the dot product approach is not really sufficiently general to deal with many interesting problems. First, we have deliberately not made the assumption that the patterns actually exist in a dot product space. So far, they could be any kind of object. In order to be able to use a dot product as a similarity measure, we therefore first need to represent the patterns as vectors in some dot product space  (which need not coincide with  N ). To this end, we use a map Φ: x  x :

 Φ(x)

(1.5)

Second, even if the original patterns exist in a dot product space, we may still want to consider more general similarity measures obtained by applying a map (1.5). In that case, Φ will typically be a nonlinear map. An example that we will consider in Chapter 2 is a map which computes products of entries of the input patterns. Feature Space

In both the above cases, the space  is called a feature space. Note that we have used a bold face x to denote the vectorial representation of x in the feature space. We will follow this convention throughout the book. To summarize, embedding the data into  via Φ has three benefits: 1. It lets us define a similarity measure from the dot product in ,

 x x  Φ(x) Φ(x ) 

k(x x ) : 





(1.6)

2. It allows us to deal with the patterns geometrically, and thus lets us study learning algorithms using linear algebra and analytic geometry. 3. The freedom to choose the mapping Φ will enable us to design a large variety of similarity measures and learning algorithms. This also applies to the situation where the inputs xi already exist in a dot product space. In that case, we might directly use the dot product as a similarity measure. However, nothing prevents us from first applying a possibly nonlinear map Φ to change the representation into one that is more suitable for a given problem. This will be elaborated in Chapter 2, where the theory of kernels is developed in more detail.

Scholkopf ¨ and Smola: Learning with Kernels

2001/09/24 10:30

4

1.2

A Tutorial Introduction

A Simple Pattern Recognition Algorithm We are now in the position to describe a pattern recognition learning algorithm that is arguably one of the simplest possible. We make use of the structure introduced in the previous section; that is, we assume that our data are embedded into a dot product space .3 Using the dot product, we can measure distances in this space. The basic idea of the algorithm is to assign a previously unseen pattern to the class with closer mean. We thus begin by computing the means of the two classes in feature space; c c



 m1





i yi

 

 m1

1





i yi

 

xi 

(1.7)

xi 

(1.8)





1

 

where m and m are the number of examples with positive and negative labels, respectively. We assume that both classes are non-empty, thus m   m  0. We assign a new point x to the class whose mean is closest (Figure 1.1). This geometric construction can be formulated in terms of the dot product  . Half way between c and c lies the point c : (c c )2. We compute the class of x by checking whether the vector x  c connecting c to x encloses an angle smaller than 2 with the vector w : c  c connecting the class means. This leads to 









y







 sgn (x  c) w  sgn (x  (c  c )2) (c  c  sgn ( x c  x c  b) 



)



(1.9)

Here, we have defined the offset

 12 ( c

b:

Decision Function



2  c  2 ) 





with the norm x :

x x . If the class means have the same distance to the origin, then b will vanish. Note that (1.9) induces a decision boundary which has the form of a hyperplane (Figure 1.1); that is, a set of points that satisfy a constraint expressible as a linear equation. It is instructive to rewrite (1.9) in terms of the input patterns xi , using the kernel k to compute the dot products. Note, however, that (1.6) only tells us how to compute the dot products between vectorial representations x i of inputs xi . We therefore need to express the vectors c i and w in terms of x1      xm . To this end, substitute (1.7) and (1.8) into (1.9) to get the decision function 3. For the definition of a dot product space, see Section B.2.

Scholkopf ¨ and Smola: Learning with Kernels

(1.10)

2001/09/24 10:30

1.2 A Simple Pattern Recognition Algorithm

5

o + +

.

w c

+

c+

o

c-

o

x-c

+

x

Figure 1.1 A simple geometric classification algorithm: given two classes of points (depicted by ‘o’ and ‘+’), compute their means c  c and assign a test pattern x to the one whose mean is closer. This can be done by looking at the dot product between x  c (where c (c c )2) and w : c  c , which changes sign as the enclosed angle passes through 2. Note that the corresponding decision boundary is a hyperplane (the dotted line) orthogonal to w.



y





 sgn  sgn

 

1 m 1 m

 1 b :  m

1 ∑ x xi  m i yi 1

 





1 ∑ k(x xi )  m i yi 1

 





x xi  b



i yi

 



1

 



i yi

 





k(x xi )

b

2



 

(i j) yi y j

 

1

 

1 k(xi  x j )  2 m



(1.11)

1

 

Similarly, the offset becomes 1 2





 1

(i j) yi y j



 k(x  x )  i

j

(1.12)



Surprisingly, it turns out that this rather simple-minded approach contains a wellknown statistical classification method as a special case. Assume that the class means have the same distance to the origin (hence b 0, cf. (1.10)), and that k can be viewed as a probability density when one of its arguments is fixed. By this we mean that it is positive and has unit integral,4  k(x x )dx 1 for all x   (1.13)











In this case, (1.11) takes the form of the so-called Bayes classifier separating the two classes, subject to the assumption that the two classes of patterns were generated by sampling from two probability distributions that are correctly estimated by the 4. In order to state this assumption, we have to require that we can define an integral on .

Scholkopf ¨ and Smola: Learning with Kernels

2001/09/24 10:30

6

A Tutorial Introduction

Parzen windows estimators of the two class densities, p (x) : Parzen Windows

 m1





i yi

 

1

 m1



k(x xi ) and p (x) : 





i yi

 



k(x xi )

(1.14)

1

 

where x  . Given some point x, the label is then simply computed by checking which of the two values p (x) or p (x) is larger, which leads directly to (1.11). Note that this decision is the best we can do if we have no prior information about the probabilities of the two classes. The classifier (1.11) is quite close to the type of classifier that this book deals with in detail. Both take the form of kernel expansions on the input domain, 



y

m

 sgn ∑ i k(x xi)  b 





(1.15)

i 1

In both cases, the expansions correspond to a separating hyperplane in a feature space. In this sense, the i can be considered a dual representation of the hyperplane’s normal vector [223]. Both classifiers are example-based in the sense that the kernels are centered on the training patterns; that is, one of the two arguments of the kernel is always a training pattern. A test point is classified by comparing it to all the training points that appear in (1.15) with a nonzero weight. More sophisticated classification techniques, to be discussed in the remainder of the book, deviate from (1.11) mainly in the selection of the patterns on which the kernels are centered and in the choice of weights  i that are placed on the individual kernels in the decision function. It will no longer be the case that all training patterns appear in the kernel expansion, and the weights of the kernels in the expansion will no longer be uniform within the classes — recall that in the current example, cf. (1.11), the weights are either (1m  ) or (1m ), depending on the class to which the pattern belongs. In the feature space representation, this statement corresponds to saying that we will study normal vectors w of decision hyperplanes that can be represented as general linear combinations (i.e., with non-uniform coefficients) of the training patterns. For instance, we might want to remove the influence of patterns that are very far away from the decision boundary, either since we expect that they will not improve the generalization error of the decision function, or since we would like to reduce the computational cost of evaluating the decision function (cf. (1.11)). The hyperplane will then only depend on a subset of training patterns called Support Vectors. 

1.3

Some Insights From Statistical Learning Theory With the above example in mind, let us now consider the problem of pattern recognition in a slightly more formal setting [559, 152, 186]. This will allow us to indicate the factors affecting the design of “better” algorithms. Rather than just

Scholkopf ¨ and Smola: Learning with Kernels

2001/09/24 10:30

1.3 Some Insights From Statistical Learning Theory

7

Figure 1.2 2D toy example of binary classification, solved using three models (the decision boundaries are shown). The models vary in complexity, ranging from a simple one (left), which misclassifies a large number of points, to a complex one (right), which “trusts” each point and comes up with solution that is consistent with all training points (but may not work well on new points). As an aside: the plots were generated using the so-called softmargin SVM to be explained in Chapter 7; cf. also Figure 7.10.

providing tools to come up with new algorithms, we also want to provide some insight in how to do it in a promising way. In two-class pattern recognition, we seek to infer a function f :   1

(1.16)

from input-output training data (1.1). The training data are sometimes also called the sample. Figure 1.2 shows a simple 2D toy example of a pattern recognition problem. The task is to separate the solid dots from the circles by finding a function which takes the value 1 on the dots and 1 on the circles. Note that instead of plotting this function, we may plot the boundaries where it switches between 1 and 1. In the rightmost plot, we see a classification function which correctly separates all training points. From this picture, however, it is unclear whether the same would hold true for test points which stem from the same underlying regularity. For instance, what should happen to a test point which lies close to one of the two “outliers,” sitting amidst points of the opposite class? Maybe the outliers should not be allowed to claim their own custom-made regions of the decision function. To avoid this, we could try to go for a simpler model which disregards these points. The leftmost picture shows an almost linear separation of the classes. This separation, however, not only misclassifies the above two outliers, but also a number of “easy” points which are so close to the decision boundary that the classifier really should be able to get them right. Finally, the central picture represents a compromise, by using a model with an intermediate complexity, which gets most points right, without putting too much trust in any individual point. The goal of statistical learning theory is to place these intuitive arguments in a mathematical framework. To this end, it studies mathematical properties of learning machines. These properties are usually properties of the function class

Scholkopf ¨ and Smola: Learning with Kernels

2001/09/24 10:30

8

A Tutorial Introduction

g(x) 1 x -1 Figure 1.3 A 1D classification problem, with a training set of three points (marked by circles), and three test inputs (marked on the x-axis). Classification is performed by thresholding real-valued functions g(x) according to sgn ( f (x)). Note that both functions (dotted line, and solid line) perfectly explain the training data, but they give opposite predictions on the test inputs. Lacking any further information, the training data alone give us no means to tell which of the two functions is to be preferred.

IID Data

Loss Function

Test Data

that the learning machine can implement. We assume that the data are generated independently from some unknown (but fixed) probability distribution P(x y).5 This is a standard assumption in learning theory; data generated this way is commonly referred to as iid (independent and identically distributed). Our goal is to find a function f that will correctly classify unseen examples (x y), so that f (x) y for examples (x y) that are also generated from P(x y).6 Correctness of the classification is measured by means of the zero-one loss function c(x y f (x)) : 12  f (x)  y. Note that the loss is 0 if (x y) is classified correctly, and 1 otherwise. If we put no restriction on the set of functions from which we choose our estimated f , however, then even a function that does very well on the training data, e.g., by satisfying f (xi ) yi for all i 1     m, might not generalize well to unseen examples. To see this, note that for each function f and any test set (x¯1  y¯ 1 )     (x¯ m¯  y¯ m¯ )    1 satisfying x¯1      x¯ m¯    x1      xm  , there exists another function f such that f (xi ) f (xi ) for all i 1     m, yet f (x¯ i )  ¯ (cf. Figure 1.3). As we are only given the training data, we f (x¯ i ) for all i 1     m have no means of selecting which of the two functions (and hence which of the two different sets of test label predictions) is preferable. We conclude that minimizing only the (average) training error (or empirical risk),









Empirical Risk

















m

Remp [ f ]

 m1 ∑ 12  f (xi )  yi  

(1.17)

i 1

Risk

does not imply a small test error (called risk), averaged over test examples drawn from the underlying distribution P(x y), 5. For a definition of a probability distribution, see Section B.1.1. 6. We mostly use the term example to denote a pair consisting of a training pattern x and the corresponding target y.

Scholkopf ¨ and Smola: Learning with Kernels

2001/09/24 10:30

1.3 Some Insights From Statistical Learning Theory

VC dimension

Shattering

VC Bound



1  f (x)  y dP(x y) (1.18) 2 The risk can be defined for any loss function, provided the integral exists. For the present zero-one loss function, the risk equals the probability of misclassification.7 Statistical learning theory (Chapter 5, [570, 559, 561, 136, 562, 14]), or VC (Vapnik-Chervonenkis) theory, shows that it is imperative to restrict the set of functions from which f is chosen to one that has a capacity suitable for the amount of available training data. VC theory provides bounds on the test error. The minimization of these bounds, which depend on both the empirical risk and the capacity of the function class, leads to the principle of structural risk minimization [559]. The best-known capacity concept of VC theory is the VC dimension, defined as follows: each function of the class separates the patterns in a certain way and thus induces a certain labelling of the patterns. Since the labels are in 1, there are at most 2m different labellings for m patterns. A very rich function class might be able to realize all 2m separations, in which case it is said to shatter the m points. However, a given class of functions might not be sufficiently righ to shatter the m points. The VC dimension is defined as the largest m such that there exists a set of m points which the class can shatter, and  if no such m exists. It can be thought of as a one-number summary of a learning machine’s capacity (for an example, see Figure 1.4). As such, it is necessarily somewhat crude. More accurate capacity measures are the annealed VC entropy or the growth function. These are usually considered to be harder to evaluate, but they play a fundamental role in the conceptual part of VC theory. Another interesting capacity measure, which can be thought of as a scale-sensitive version of the VC dimension, is the fat shattering dimension [286, 6]. For further details, cf. Chapters 5 and 12. Whilst it will be difficult for the non-expert to appreciate the results of VC theory in this chapter, we will nevertheless briefly describe an example of a VC bound: R[ f ]

Capacity



9

7. The risk-based approach to machine learning has its roots in statistical decision theory [582, 166, 43]. In that context, f (x) is thought of as an action, and the loss function measures the loss incurred by taking action f (x) upon observing x when the true output (state of nature) is y. Like many fields of statistics, decision theory comes in two flavors. The present approach is a frequentist one. It considers the risk as a function of the distribution P and the decision function f . The Bayesian approach considers parametrized families PΘ to model the distribution. Given a prior over Θ (which need not in general be a finite-dimensional vector), the Bayes risk of a decision function f is the expected frequentist risk, where the expectation is taken over the prior. Minimizing the Bayes risk (over decision functions) then leads to a Bayes decision function. Bayesians thus act as if the parameter Θ were actually a random variable whose distribution is known. Frequentists, who do not make this (somewhat bold) assumption, have to resort to other strategies for picking a decision function. Examples thereof are considerations like invariance and unbiasedness, both used to restrict the class of decision rules, and the minimax principle. A decision function is said to be minimax if it minimizes (over all decision functions) the maximal (over all distributions) risk. For a discussion of the relationship of these issues to VC theory, see Problem 5.9.

Scholkopf ¨ and Smola: Learning with Kernels

2001/09/24 10:30

10

A Tutorial Introduction

xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxx

x

x

xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

x

x

x

x

xxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxx

x

x

x

x

xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

x

x

x

xxxxxxxxxxxx xxxxxxxxxxxx xxxxxxxxxxxx xxxxxxxxxxxx xxxxxxxxxxxx xxxxxxxxxxxx xxxxxxxxxxxx xxxxxxxxxxxx xxxxxxxxxxxx xxxxxxxxxxxx xxxxxxxxxxxx xxxxxxxxxxxx xxxxxxxxxxxx xxxxxxxxxxxx xxxxxxxxxxxx xxxxxxxxxxxx xxxxxxxxxxxx xxxxxxxxxxxx xxxxxxxxxxxx xxxxxxxxxxxx xxxxxxxxxxxx xxxxxxxxxxxx xxxxxxxxxxxx xxxxxxxxxxxx xxxxxxxxxxxx xxxxxxxxxxxx xxxxxxxxxxxx xxxxxxxxxxxx xxxxxxxxxxxx xxxxxxxxxxxx xxxxxxxxxxxx xxxxxxxxxxxx xxxxxxxxxxxx xxxxxxxxxxxx xxxxxxxxxxxx xxxxxxxxxxxx xxxxxxxxxxxx xxxxxxxxxxxx

x

x

x

x

x

x x

x

xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

x x

x



Figure 1.4 A simple VC dimension example. There are 23 8 ways of assigning 3 points to two classes. For the displayed points in  2 , all 8 possibilities can be realized using separating hyperplanes, in other words, the function class can shatter 3 points. This would not work if we were given 4 points, no matter how we placed them. Therefore, the VC dimension of the class of separating hyperplanes in  2 is 3.

if h m is the VC dimension of the class of functions that the learning machine can implement, then for all functions of that class, independent of the underlying distribution P generating the data, with a probability of at least 1  Æ over the drawing of the training sample,8 the bound R[ f ]  Remp [ f ]

 (h m Æ)

(1.19)

holds, where the confidence term (or capacity term) is defined as

(h m Æ ) 



1 m

2m h ln h



1 



4  ln Æ

(1.20)

The bound (1.19) merits further explanation. Suppose we wanted to learn a “dependency” where patterns and labels are statistically independent, P(x y) P(x)P(y). In that case, the pattern x contains no information about the label y. If, moreover, the two classes 1 and 1 are equally likely, there is no way of making a good guess about the label of a test pattern. Nevertheless, given a training set of finite size, we can always come up with a learning machine which achieves zero training error (provided we have no examples contradicting each other, i.e., whenever two patterns are identical, then they must come with the same label). To reproduce the random labellings by correctly separating all training examples, however, this machine will necessarily require a large VC dimension h. Therefore, the confidence term (1.20), which increases monotonically with h, will be large, and the bound (1.19) will show





8. Recall that each training example is generated from P(x y), and thus the training data are subject to randomness.

Scholkopf ¨ and Smola: Learning with Kernels

2001/09/24 10:30

1.4 Hyperplane Classifiers

11

that the small training error does not guarantee a small test error. This illustrates how the bound can apply independent of assumptions about the underlying distribution P(x y): it always holds (provided that h m), but it does not always make a nontrivial prediction. In order to get nontrivial predictions from (1.19), the function class must be restricted such that its capacity (e.g., VC dimension) is small enough (in relation to the available amount of data). At the same time, the class should be large enough to provide functions that are able to model the dependencies hidden in P(x y). The choice of the set of functions is thus crucial for learning from data. In the next section, we take a closer look at a class of functions which is particularly interesting for pattern recognition problems.

1.4

Hyperplane Classifiers In the present section, we shall describe a hyperplane learning algorithm that can be performed in a dot product space (such as the feature space that we introduced earlier). As described in the previous section, to design learning algorithms whose statistical effectiveness can be controlled, one needs to come up with a class of functions whose capacity can be computed. Vapnik et al. [573, 566, 570] considered the class of hyperplanes in some dot product space ,

w x  b  0 where w   b   

(1.21)

corresponding to decision functions f (x)

Optimal Hyperplane

 sgn ( w x  b)

(1.22)

and proposed a learning algorithm for problems which are separable by hyperplanes (sometimes said to be linearly separable), termed the Generalized Portrait, for constructing f from empirical data. It is based on two facts. First (see Chapter 7), among all hyperplanes separating the data, there exists a unique optimal hyperplane, distinguished by the maximum margin of separation between any training point and the hyperplane. It is the solution of maximize min  x  xi x   w x

b

w



 

 b  0 i  1     m  

(1.23)

Second (see Chapter 5), the capacity (as discussed in Section 1.3) of the class of separating hyperplanes decreases with increasing margin. Hence there are theoretical arguments supporting the good generalization performance of the optimal hyperplane, cf. Chapters 5, 7, 12. In addition, it is computationally attractive, since we will show below that it can be constructed by solving a quadratic programming problem for which efficient algorithms exist (see Chapters 6 and 10). Note that the form of the decision function (1.22) is quite similar to our earlier example (1.9). The ways in which the classifiers are trained, however, are different. In the earlier example, the normal vector of the hyperplane was trivially computed from the class means as w c  c .



Scholkopf ¨ and Smola: Learning with Kernels

2001/09/24 10:30



12

A Tutorial Introduction

{x | + b = +1}

{x | + b = −1}

Note: 





x2

yi = −1

yi = +1

x1



,

w

+ b = +1 + b = −1

=>



= 2 w 2 , ||w|| (x1−x2) = ||w||

=>

>