Improved Classification Rates for Localized Algorithms under Margin Conditions [1 ed.] 3658295902, 9783658295905

Support vector machines (SVMs) are one of the most successful algorithms on small and medium-sized data sets, but on lar

342 75 1MB

English Pages 144 [134] Year 2020

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Improved Classification Rates for Localized Algorithms under Margin Conditions [1 ed.]
 3658295902, 9783658295905

Table of contents :
Danksagung
Contents
Abbreviations
List of Figures
Summary
Kurzfassung
1. Introduction
2. Preliminaries
2.1. Introduction to Statistical Learning Theory
2.1.1. Losses and Risks
2.1.2. Learning Methods
2.2. From Global to Localized SVMs
2.2.1. Kernels and RKHSs
2.2.2. The Localized SVM Approach
2.3. Advanced Statistical Analysis
2.3.1. Margin Conditions
2.3.2. General Oracle Inequalities
3. Histogram Rule: Oracle Inequality and Learning Rates
3.1. Motivation
3.2. Statistical Refinement and Main Results
3.3. Comparison of Learning Rates
4. Localized SVMs: Oracle Inequalities and Learning Rates
4.1. Motivation
4.2. Local Statistical Analysis
4.2.1. Approximation Error Bounds
4.2.2. Entropy Bounds
4.2.3. Oracle Inequalities and Learning Rates
4.3. Main Results
4.3.1. Global Learning Rates
4.3.2. Adaptivity
4.4. Comparison of Learning Rates
5. Discussion
A. Appendix
Bibliography

Citation preview

Ingrid Karin Blaschzyk

Improved Classification Rates for Localized Algorithms under Margin Conditions

Improved Classification Rates for Localized Algorithms under Margin Conditions

Ingrid Karin Blaschzyk

Improved Classification Rates for Localized Algorithms under Margin Conditions

Ingrid Karin Blaschzyk Stuttgart, Germany Dissertation University of Stuttgart, 2019 D93

ISBN 978-3-658-29590-5 ISBN 978-3-658-29591-2  (eBook) https://doi.org/10.1007/978-3-658-29591-2 © Springer Fachmedien Wiesbaden GmbH, part of Springer Nature 2020 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer Spektrum imprint is published by the registered company Springer Fachmedien Wiesbaden GmbH part of Springer Nature. The registered company address is: Abraham-Lincoln-Str. 46, 65189 Wiesbaden, Germany

Danksagung An meinen Doktorvater Prof. Dr. Ingo Steinwart. Danke f¨ ur deine Zeit und das Vertrauen, dass du in mich gesetzt hast. Danke, dass du mir so Vieles w¨ahrend meiner Promotion erm¨oglicht hast. Ich hatte durch Workshops, Summer Schools und Konferenzen die M¨oglichkeit, das Forscher-Dasein zu erleben und konnte mich durch die Teilnahme an Mentoring-Programmen nicht nur fachlich weiterentwickeln. Danke auch f¨ ur die Unterst¨ utzung bei meinem Auslandsaufenthalt in Genua, Italien, durch den ich neuen Antrieb gewonnen habe. All dies ist nicht selbstverst¨andlich. An meine Gutachter Prof. Dr. Andreas Christmann und Prof. Dr. Philipp Hennig. Vielen Dank, dass Sie sich die Zeit genommen haben, um diese Arbeit zu lesen und zu bewerten. An die International Max Planck Research School for Intelligent Systems (IMPRS-IS). Danke f¨ ur die Aufnahme in diese Research School und die M¨oglichkeit, mich mit jungen Forschern aus unterschiedlichen Fachrichtungen auszutauschen. An meine (ehemaligen) Kollegen des ISA & friends. Danke f¨ ur diese wahnsinnig coole Zeit an der Uni und die gute Stimmung. F¨ ur zuk¨ unftige Kollegen habt ihr die Latte hoch angesetzt. Danke Simon und Thomas f¨ ur eure wertvollen Kommentare zu dieser Arbeit. An meine Freunde. Danke, dass ihr f¨ ur mich da seid, auf euch konnte ich mich immer verlassen. Danke Iris, Julia, Maria und Sabrina f¨ ur euer schnelles Korrekturlesen. Danke Jim f¨ ur deinen LATEX Support. An meine Familie. Papa, Mama, Opa, Oma, Cora und Oskar, danke f¨ ur eure unglaubliche Unterst¨ utzung. Ich hatte bei euch immer einen Ort zum Erden.

Contents 1. Introduction 2. Preliminaries 2.1. Introduction to Statistical Learning Theory 2.1.1. Losses and Risks . . . . . . . . . . . 2.1.2. Learning Methods . . . . . . . . . . 2.2. From Global to Localized SVMs . . . . . . 2.2.1. Kernels and RKHSs . . . . . . . . . 2.2.2. The Localized SVM Approach . . . 2.3. Advanced Statistical Analysis . . . . . . . . 2.3.1. Margin Conditions . . . . . . . . . . 2.3.2. General Oracle Inequalities . . . . .

1

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

5 5 5 9 12 12 15 18 18 29

3. Histogram Rule: Oracle Inequality and Learning Rates 35 3.1. Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.2. Statistical Refinement and Main Results . . . . . . . . . . . . 40 3.3. Comparison of Learning Rates . . . . . . . . . . . . . . . . . 53 4. Localized SVMs: Oracle Inequalities and Learning 4.1. Motivation . . . . . . . . . . . . . . . . . . . . . . . 4.2. Local Statistical Analysis . . . . . . . . . . . . . . . 4.2.1. Approximation Error Bounds . . . . . . . . . 4.2.2. Entropy Bounds . . . . . . . . . . . . . . . . 4.2.3. Oracle Inequalities and Learning Rates . . . . 4.3. Main Results . . . . . . . . . . . . . . . . . . . . . . 4.3.1. Global Learning Rates . . . . . . . . . . . . . 4.3.2. Adaptivity . . . . . . . . . . . . . . . . . . . 4.4. Comparison of Learning Rates . . . . . . . . . . . .

Rates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . .

59 59 63 63 73 77 91 92 99 107

5. Discussion

115

A. Appendix

117

Bibliography

123

Abbreviations ERM LC ME MNE NE RKHS SVM TV-SVM TV-HR UC

empirical risk minimization lower control margin exponent margin noise exponent noise exponent reproducing kernel Hilbert space support vector machine training validation support vector machine training validation histogram rule upper control

List of Figures 2.1. 2.2. 2.3. 2.4.

Example of distance ∆η to decision boundary . . . . . . . . . Examples of data drawn by P with different values of ME α. Examples of η with different values of LC ζ . . . . . . . . . . Examples of η with regions of “critically” large noise far away from the decision boundary. . . . . . . . . . . . . . . . . . . . 2.5. Geometrical assumptions on X0 from Lemma 2.36 . . . . . .

20 21 23 24 28

Summary Localized support vector machines (SVMs) solve SVMs on many spatially defined small chunks and one of their main characteristics besides the computational benefit compared to global SVMs is the freedom of choosing arbitrary kernel and regularization parameter on each cell. In the present work, we take advantage of this observation to derive global learning rates for localized SVMs with Gaussian kernels and hinge loss for classification. These rates outperform known classification rates for localized SVMs, for global SVMs, and other learning algorithms under suitable sets of assumptions. They are achieved under a mild geometric condition on the decision boundary and under a set of margin conditions that describe the behavior of the datagenerating distribution near the decision boundary, where no assumption on the existence of a density is made. It turns out that a margin condition that relates the location of noise to the distance to the decision boundary is crucial to obtain improved rates. The margin parameters appear in the chosen parameters for localized SVMs and in the learning rates. Nevertheless, we show that a training validation procedure learns with the same rates adaptively such that no prior knowledge of these parameters is necessary. The statistical analysis relies on a simple partitioning based technique, which analyzes the excess risk separately on sets that are close to the decision boundary and on sets that are sufficiently far away. These sets depend on a splitting parameter s > 0. To illustrate and to understand the mechanisms of that technique we first apply it to the simple histogram rule and derive even for this simple method learning rates, which outperform rates for global SVMs under suitable assumptions. For localized SVMs with Gaussian kernel and hinge loss, we derive local learning rates that demonstrate how kernel, regularization, and margin parameters affect the rates on the considered sets. For an appropriately chosen splitting parameter, we finally derive global learning rates for localized SVMs.

Kurzfassung Eine Modifizierung der Lernmethode der Support Vector Machines (SVMs) ist die Lernmethode der lokalisierten SVMs, bei welcher SVMs auf jeder Zelle einer Partition des Eingaberaums trainiert werden. Neben dem verbesserten Rechenaufwand profitieren diese lokal lernenden Verfahren davon, dass unterschiedliche Kern- und Regularisierungsparameter auf jeder betrachteten Zelle gew¨ahlt werden k¨onnen. In der vorliegenden Arbeit nutzen wir diese Eigenschaft, um globale Klassifikations-Lernraten f¨ ur lokalisierte SVMs mit Gausskernen und der hinge-Verlustfunktion zu erhalten. Die erzielten Raten sind unter geeigneten Voraussetzungen besser als die bereits bekannten Klassifikationsraten f¨ ur lokalisierte SVMs, globale SVMs oder andere betrachtete Klassifikationsverfahren. Dabei setzen wir eine schwache geometrische Bedingung an die Entscheidungslinie voraus und treffen u ¨bliche Annahmen u ahe ¨ber das Verhalten der datenerzeugenden Verteilung in der N¨ der Entscheidunglinie. Die Parameter, die letzteres beschreiben, spiegeln sich in den Parametern der lokalisieten SVMs, sowie in der Lernrate wider und wir zeigen, dass eine Trainings- und Validierungsmethode adaptiv die gleichen Raten erzielt. Die statistische Analyse beruht auf einer einfachen Technik, die den Eingaberaum in sich zwei u ¨berlappende Mengen paritioniert. Dabei wird einmal die Menge betrachtet, deren Zellen nahe zur Entscheindungslinie liegen, sowie die Menge, deren Zellen einen ausreichenden Abstand zur Entscheidungli¨ nie besitzen. Das Uberschussrisiko wird separat auf diesen Mengen, deren Trennung durch einen Parameter s > 0 beschrieben wird, abgesch¨ atzt. Um diese Analyse und ihre Einfl¨ usse zu verdeutlichen, wenden wir die Technik zun¨achst auf die einfache Lernmethode der Histogramm-Regel an und erhalten unter geeigneten Annahmen Lernraten, die sogar besser sind als die der komplexen SVMs. F¨ ur die lokalisierten SVMs zeigen die separaten Analysen, welchen Einfluss die unterschiedlichen Kern- und Regularisierungsparameter, sowie die Parameter welche die Verteilung beschreiben, auf das lokale Lernverhalten besitzen. Eine geeignete Wahl des Trennungsparameters s f¨ uhrt schließlich zu globalen Lernraten f¨ ur lokalisierte SVMs mit Gausskernen und der hinge-Verlustfunkion.

1. Introduction The notion of artificial intelligence has become more and more popular over the recent years and reached nowadays attention even beyond science into media or politics. Typically this notion is connected with learning from data that means to find patterns and regularities that allow to predict or conclude information from new unseen data. This is done by sophisticated algorithms and finds practical use for example in medicine, to determine diseases, in navigation, to avoid traffic jams, or in commerce, to analyze shopping behavior of customers. Another field of use is finance, where especially since the financial crisis in 2007 it is a big concern to predict creditworthiness of banks. This prediction can be used as an early warning of the credit failure of banks. Having suitable models and precise algorithms banks can be rescued in time, which could prevent a damage of hundred of millions of dollars. To be more concrete, assume to have a data set D := ((x1 , y1 ), . . . , (xn , yn )) of observations that consists for example of annual report and market data of banks before or during the financial crisis, described by xi ∈ X ⊂ Rd , where d is the number of the individual information that are captured of each bank, and which consists of the information whether each bank was credit-worthy or not, which is described by the possible outcomes yi ∈ Y := {−1, 1}. In this classification problem the aim of an algorithm is to compute a decision function fD : X → {−1, 1} such that for new data (x, y), we have fD (x) = y with high probability. That is, for new annual report and market data of a bank, a statement should be made whether the bank is credit-worthy or not. A function fD should not overfit the data, since in the worst case the model miss its target and alerts only in cases where it has seen again exactly the same input data, which is very unlikely for annual report and market data. On the other hand it should not be too complex, since creditworthiness would be wrongly detected too often. This leads to the question how to measure the quality of fD , which in particular makes it possible to compare different methods with each other. For our model, we assume that the data is randomly generated and generated independently and identically by an unfortunately unknown probability measure P . We measure the quality by a loss function L that penalizes wrong predictions. More generalized, we measure the expected loss RL,P (fD ) and its distance to the best possible © Springer Fachmedien Wiesbaden GmbH, part of Springer Nature 2020 I. K. Blaschzyk, Improved Classification Rates for Localized Algorithms under Margin Conditions, https://doi.org/10.1007/978-3-658-29591-2_1

2

1. Introduction

risk R∗L,P that is achieved by a function that makes the smallest prediction error, the Bayes decision function. This so-called excess risk should tend to zero as we provide more data to the algorithm. Since our data is randomly generated these bounds are stated in high probability. A quantity that describes how fast these bounds tend to zero is called learning rate. The faster a learning rate the better since the algorithm needs less data to learn well. However, data may be large in the sense that either d is large, or the number of observations n, or both. It is desirable to handle well such data sets under a computational perspective (storage costs and runtime). One strategy to tackle massive amount of data is the approach of data decomposition, which is well-known in literature, e.g., [BV92] proposed random chunking, which is possibly the most famous approach and still analyzed, see e.g, [ZDW15]. Recently proposed algorithms use matrix or kernel approximations, see e.g., [WS01, RR08, RCR15, RR17], or they fall into the category of distributed learning, see e.g., [LGZ17] or [MB18]. In this work we investigate localized support vector machines (SVMs), proposed by [MS16], which learn global SVMs on many spatially defined small chunks. Experimental results show that global SVMs handle well small- and medium-sized data sets in certain learning tasks, see [FDCBA14, MS16, TBMS17, KUMH17], and recently, it was shown that they even outperform self-normalizing neural-networks (SNNs) for such data sets, see [KUMH17]. However, for large data sets global SVMs and more generally kernel methods suffer from their computational complexity, which for SMVs is at least quadratically in space and time. Solving SVMs on spatially defined chunks leads to improved time and space complexities compared to global SVMs. In [TBMS17] experimental results with liquidSVM [ST17] showed that localized SVMs can tackle data sets with 32 million of training samples. For localized SVMs the underlying partition can base on clusters [CTJ07], decision trees [BB98], or k-nearest-neighbors [ZBMM06], but the previous examples are experimentally investigated. Furthermore, several theoretical results have been shown for localized SVMs. Based on possible overlapping regions or on a decomposition with k-nearest-neighbors universal consistency and/or robustness for those classifiers have been proven by [DC18] and [Hab13]. For Gaussian kernels and least-squares-loss [MS16] showed optimal learning rates under usual smoothness assumptions on the Bayes decision function, whereas [TBMS17] obtained learning rates for classification under margin conditions. In classification, margin conditions that describe the interplay between the marginal distribution PX and the conditional distribution of labels are commonly used to obtain learning rates for classifiers, see e.g., [MT99,

1. Introduction

3

KK07, SC08] and [BS18]. The most popular exponent, the Tsybakov noise exponent, was introduced in [MT99] and measures the amount of noise in the input space, where noise equals the probability of falsely labelling some given input x ∈ X. Under the assumption of Tsybakov’s noise exponent and some smoothness assumption on the regression function, fast rates for plug-in classifier are achieved in [AT07, KK07], and [BHM18], for tree-based classifiers in [BCDD14] or for a special case of SVMs in [LZC17]. Some of the mentioned authors additionally make assumptions on the density of the marginal distribution PX to improve their rates that were achieved without density assumptions, or to even find rates. However, it is well known that such assumptions together with a smoothness and noise exponent assumption limit the class of distributions, see e.g., [AT07, KK07] and [BCDD14]. Hence, density assumptions are not preferable. Without smoothness assumptions on the regression function but with a margin condition that also takes the amount of mass around the decision boundary into consideration, rates for SVMs are achieved in [SS07, SC08, LZC17] and [TBMS17]. In this work, we investigate the statistical properties of classifiers derived by localized SVMs using Gaussian kernels and hinge loss. We show that the achieved global learning rates outperform the rates of several learning algorithms mentioned in the previous paragraph under suitable assumptions. In order to establish finite sample bounds on the excess classification risk we apply a splitting technique that we derive in Chapter 3 in detail. The idea behind the splitting technique is to split the input space into two sets that depend on a splitting parameter s > 0, one that is close to the decision boundary and one that is sufficiently far away from the decision boundary, and to analyze the excess risk separately on these sets. To derive, in a first step, local finite sample bounds by a standard decomposition into stochastic and approximation error we make the observation that the approximation error has to be handled differently on cells that intersect the decision boundary and on those which do not. On cells with the latter property the assumption of a margin condition that relates the distance to the decision boundary to the amount of noise is crucial. Descriptively, it restricts the location of noise, that is, if we have high noise for some x ∈ X this x has to be close to the decision boundary. From these local finite sample bounds we derive rates by taking advantage of the great flexibility local SVMs enable us by definition, that is, that kernel and regularization parameters can be chosen on each cell individually. By choosing, in a final step, the splitting parameter s appropriately, we derive global learning rates that depend on the margin parameters. We show that training validation support vector machines (TV-SVMs) achieve the same learning rates adaptively,

4

1. Introduction

that is, without knowing these parameters. Moreover, we compare our rates with the rates achieved by methods mentioned above under suitable sets of assumptions. It turns out that we improve or match the rates of the compared methods and that these improvements result essentially from the above mentioned margin condition. This work is organized as follows. Chapter 2, in which we state basic definitions, results, and concepts, serves as fundamentals for the subsequent chapters. As a lead-in we analyze the histogram rule in Chapter 3 in which we explain and emphasize the main idea behind the refinement technique that leads to improved learning rates. Chapter 4 is dedicated to the statistical analysis of localized SVMs, where the techniques from the previous chapter are applied. We close this work with a summary and a discussion for possible future work in Chapter 5.

2. Preliminaries In Section 2.1, we formalize the notion of learning and state basic definitions and theorems. We give a concrete example of a learning method in Section 2.2 on which the analysis from Chapter 4 is based on. For this analysis high-level concepts are necessary that are introduced in Section 2.3. 2.1. Introduction to Statistical Learning Theory In Section 2.1.1, we introduce notions that describe the quality of a general predictor. Based on that we describe in Section 2.1.2 what we actually mean by learning (methods) and state a quantity that assesses methods and that makes them comparable. 2.1.1. Losses and Risks In the introduction, we made the example of a classification problem, where the aim is to label new unseen input values correctly. Apart from classification, one of the most considered problems in learning theory is regression. In contrast to classification one tries to make a forecast of the outcomes of new unseen input values. A typical application is the forecast of share prices. No matter which learning scenario we consider, we aim to decide whether the prediction we respectively the algorithm made is good or not. To this end, we introduce the notion of a loss function. Definition 2.1 (Loss Function, c.f. [SC08, Def. 2.1]). Let X be a measurable space and Y ⊂ R be a closed subset. Then, a function L : X × Y × R → [0, ∞) is called loss function if it is measurable. Note that the examples we present in this section are supervised loss functions ¯ y, t) 7→ L(y, t) and thus L : Y × R → [0, ∞), but we canonically identify L(x, Portions of Section 2 are based on: I. Blaschzyk and I. Steinwart, Improved Classification Rates under Refined Margin Conditions. Electron. J. Stat., 12:793–823, 2018. Portions of Section 2 are based on: P. Thomann, I. Blaschzyk, M. Meister, and I. Steinwart, Spatial Decompositions for Large Scale SVMs. Proceedings of the 20th International Conference on Artificial Intelligence and Statistics (AISTATS), 54:1329-1337, 2017

© Springer Fachmedien Wiesbaden GmbH, part of Springer Nature 2020 I. K. Blaschzyk, Improved Classification Rates for Localized Algorithms under Margin Conditions, https://doi.org/10.1007/978-3-658-29591-2_2

6

2. Preliminaries

¯ and L. For the moment, every make no difference in our notation between L measurable function could serve as a loss function. To measure the quality in a meaningful way a loss should penalize only incorrect predictions. For classification this could mean that we compare the sign of prediction and true outcome, as shown in the subsequent example. Example 2.2 (Classification Loss). Let sign 0 := 1. The classification loss Lclass : Y × R → [0, ∞) is defined by Lclass (y, t) := 1(−∞,0] (y · sign t),

y, t ∈ R,

where 1(−∞,0] denotes the indicator function on (−∞, 0]. Clearly, the smaller the loss the better the predictions. We present later on more examples of loss functions and introduce the notion of the expected loss, called risk. Definition 2.3 (L-Risk, c.f. [SC08, Def. 2.2]). Let L : X × Y × R → [0, ∞) be a loss function and P be a probability measure on X × Y . Then, for a measurable function f : X → R, the L-risk is defined by Z RL,P (f ) := L(x, y, f (x)) dP (x, y). X×Y

The integral above is well-defined since the loss function is measurable and non-negative, but it is not necessarily finite. Clearly, the smaller the risk the better. Moreover, we remark that various characteristics of a loss function such as convexity directly transfer to the risk, see [SC08, Section 2.2]. Definition 2.4 (Bayes Risk, c.f. [SC08, Def. 2.3]). Let L : X × Y × R → [0, ∞) be a loss function and P be a probability measure on X × Y . Then, the minimal L-risk, R∗L,P := inf{ RL,P (f ) | f : X → R measurable }, is called the Bayes risk with respect to P and L. In addition, a measurable ∗ ∗ function fL,P : X → R with RL,P (fL,P ) = R∗L,P is called Bayes decision function. We call the discrepancy RL,P (f ) − R∗L,P excess risk. Note that in general ∗ neither the existence nor the uniqueness of the Bayes decision function fL,P can be assumed. Subsequently, we give an example of a Bayes decision function w.r.t. the classification loss.

2.1. Introduction to Statistical Learning Theory

7

Example 2.5 (Classification Loss). We define η(x) := P (y = 1|x), where P ( · |x) is a regular conditional probability on Y given x. Let sign 0 := 1. It can be shown that every measurable function f : X → R satisfying (2η(x) − 1) sign f (x) ≥ 0 for PX -almost all x ∈ X is a Bayes decision function w.r.t. Lclass and P , see [SC08, Example 2.4]. In particular, fL∗ class ,P (x) := sign(2η(x) − 1),

x ∈ X,

(2.1)

fulfils the previous condition. For certain classes of loss functions it suffices to make evaluations in a specific range without increasing the loss and thus the risk. We describe them in the subsequent definition. Definition 2.6 (Clipped Loss, c.f. [SC08, Def. 2.22]). A loss L : X × Y × R → [0, ∞) can be clipped at M > 0 if, for all (x, y, t) ∈ X × Y × R, we have L(x, y, Û t) ≤ L(x, y, t), where Û t denotes the clipped value of t, defined by Û t := max{−M, min{t, M }}. For convex losses that can be clipped at some M > 0 it can be shown that the function L(x, y, ·) : R → [0, ∞) has at least one global minimizer in [−M, M ] and vice versa, see [SC08, Lemma 2.23]. Next, we present a clippable loss function that is typically considered for regression problems and where the corresponding Bayes decision function has an interesting representation. Example 2.7 (Least Square Loss). The least square loss Lls : Y × R → [0, ∞), defined by Lls (y, t) := (y − t)2 ,

y ∈, t ∈ R,

is convex and a short calculation, e.g., [SC08, Example 2.6], shows that a function f : X → R is Bayes decision function w.r.t. the loss Lls if R∗L,P < ∞ and if it equals PX -almost surely the conditional expectation, that is, f (x) = EP (Y |x). If Y := {−1, 1} the loss can be clipped at M = 1, see [SC08, Example 2.26].

8

2. Preliminaries

Subsequently, we state a clippable loss function on which our analyis in Chapter 4 is based on. Example 2.8 (Hinge Loss). The hinge loss Lhinge : Y × R → [0, ∞), defined by Lhinge (y, t) := max{0, 1 − yt}

y ∈, t ∈ R,

is convex and can be clipped at M = 1. Moreover, the proof of [SC08, Thm. 2.31] shows that the Bayes classification function, defined in (2.1), is a Bayes decision function w.r.t. the loss Lhinge and P , that is, R∗Lhinge ,P = RLhinge ,P (fL∗ class ,P ). The hinge loss is convex compared to the classification loss. This is an advantage for applications and important for theory, see e.g., Section 2.2. The following result shows that by using the hinge loss instead of the classification loss we actually reduce the excess risk w.r.t. the classification loss and thus the hinge loss serves as a suitable surrogate. Theorem 2.9 (Zhang’s Equality, c.f. [SC08, Thm. 2.31]). Let P be a probability distribution on X × Y and define η(x) := P (Y = 1|x), x ∈ X. Let the Bayes classification function fL∗ class ,P be defined as in (2.1). Then, for all measurable f : X → R we have RLclass ,P (f ) − R∗Lclass ,P ≤ RLhinge ,P (f ) − R∗Lhinge ,P Moreover, for every measurable f : X → [−1, 1], we have Z RLhinge ,P (f ) − R∗Lhinge ,P = |f (x) − fL∗ class ,P (x)||2η(x) − 1|dPX (x). X

Finally, we introduce notions that make it possible to investigate local characteristics of predictors, which will be important for the analysis in Chapters 3 and 4. Definition 2.10 (Local Loss and Risk, c.f. [MS16, Section 2]). Let L : X × Y × R → [0, ∞) be a loss function and let A = (Aj )j=1,...,m be an arbitrary partition of X. For every j ∈ {1, . . . , m} we define the local loss Lj : X × Y × R → [0, ∞) w.r.t. a cell Aj by Lj (x, y, t) := 1Aj (x)L(x, y, t).

2.1. Introduction to Statistical Learning Theory

9

Moreover, we define for J ⊂ {1, . . . , m} the partition based set TJ by [ TJ := Aj . j∈J

Then, the local loss LTJ : X × Y × R → [0, ∞) is defined by LTJ (x, y, t) := 1TJ L(x, y, t) = 1S j∈J

Aj Lj (x, y, t).

Note that for some f : X → R and a loss function L : X × Y × R → [0, ∞) we define L ◦ f := L(x, y, f (x)). 2.1.2. Learning Methods Up to now we introduced basic concepts to measure the quality of a decision function, but non of the terms included the notion of learning. This changes with the subsequent definition. Definition 2.11 (Learning Method, c.f. [SC08, Def. 6.1]). Let X be a set and Y ⊂ R. A learning method L on X × Y maps every data set D ∈ (X × Y )n , n ≥ 1, to a function fD : X → R. In Sections 2.3.2, 3.2 and 4.2.3 we present finite sample bounds on the excess risk RL,P (fD ) − R∗L,P for certain learning methods fD . Clearly, this requires fD to be measurable. As all presented methods in this work are measurable, we always consider fD to be measurable. Moreover, we define L0 (X) := { f : X → R | f measurable }. A good learning method should be L-risk consistent, i.e., RL,P (fD )−R∗L,P should tend to zero with high probability for growing data set length n. However, these bounds in high probability w.r.t. the n-fold product measure of the probability measure P n of P , called oracle inequalities, do not describe how fast fD converges to the Bayes decision function. To this end, we present in the next definition the notion of learning rates. Definition 2.12 (Learning Rates, c.f. [SC08, Def. 6.5]). Let P be a probability measure on X × Y and L : X × Y × R → [0, ∞) be a loss function. Moreover, let L be a measurable learning method satisfying sup

RL,P (fD ) < ∞,

n ≥ 1,

D∈(X×Y )n

for its decision functions fD . Assume that there exists a constant cP > 0 and a decreasing sequence (εn ) ⊂ (0, 1] that converges to 0 such that for all

10

2. Preliminaries

τ ∈ (0, 1] there exists a constant cτ ∈ [1, ∞) only depending on τ such that, for all n ≥ 1 and τ ∈ (0, 1], we have  P n D ∈ (X × Y )n : RL,P (fD ) < R∗L,P + cP cτ εn ≥ 1 − τ. Then, L learns with rate (εn ) and confidence (cτ )τ ∈(0,1] . Under the considerations of the previous definition, [SC08, Lemma 6.5] shows that the following statements are equivalent: 1. L learns with some rate (εn ) and confidence (cτ )τ ∈(0,1] , 2. L is L-risk consistent for P . A well-known result, the no-free lunch theorem, see [DGL96, Thm. 7.2], shows that we have to make assumptions on the underlying probability measure P to obtain learning rates. In regression problems such assumptions are typically made on the smoothness of the Bayes decision function. For classification, we introduce in Section 2.3.1 suitable assumptions. Subsequently, we give an example for a learning method L. To this end, let D := ((x1 , y1 ), . . . , (xn , yn )) be a data set of observations drawn identically and independently from a probability measure P on X × Y . Having some loss function L describing our learning goal at hand we aim to find a function that minimizes the risk RL,P (·). Since we have no knowledge of P , we cannot directly minimize the risk and instead we consider an empirical counterpart of the L-risk, defined by n

RL,D (f ) :=

1X L(yi , f (xi )), n i=1

(2.2)

Pn where D := n1 i=1 δ(xi ,yi ) denotes the average of Dirac measures δ(xi ,yi ) at (xi , yi ). This seems quite natural since we know by the law of large numbers that lim RL,D (f ) = RL,P (f )

n→∞

for all f ∈ F with RL,P (f ) < ∞. However, minimizing the empirical risk over all measurable functions f : X → R would lead to the well-known phenomena of overfitting, which means that the predictor fD adapts to the data set D too much such that we miss our original goal to label new unseen data correctly. A possibility to avoid this problem is to restrict all measurable functions to a function class F ⊂ L0 (X). Learning methods that minimize the empirical risk over such a class are introduced in the next definition.

2.1. Introduction to Statistical Learning Theory

11

Definition 2.13 (Empirical Risk Minimizer, c.f. [SC08, Def. 6.16]). Let L : X ×Y ×R → [0, ∞) be a loss function and F ⊂ L0 (X) be a nonempty set. A learning method whose decision functions fD satisfy fD ∈ F and RL,D (fD ) = inf RL,D (f ) f ∈F

(2.3)

for all n ≥ 1 and D ∈ (X × Y )n is called empirical risk minimization (ERM) with respect to L and F . A decision function produced by (2.3) does not necessarily exist and is not uniquely determined. However, for finite F ⊂ L0 (X) ERM exists and [SC08, Lemma 6.17] provides measurability for these learning methods. We present in Section 2.3.2 an oracle inequality for ERMs and give subsequently a simple example. Example 2.14 (The Histogram Rule). Let A = (Aj )j≥1 be a partition of Rd into cubes of side length r ∈ (0, 1] and let P be a probability distribution on X × Y , where X ⊂ Rd . For x ∈ X we denote by A(x) the unique cell of A with x ∈ A(x). Furthermore, for a data set D sampled from P we write n

n

1X 1X fD,r (x) := 1{yi =+1} 1A(x) (xi ) − 1{yi =−1} 1A(x) (xi ) n i=1 n i=1 and define the empirical histogram by hD,r := sign fD,r , see [DGL96, Chs. 6, 9]. Descriptively, the empirical histogram rule labels x ∈ A(x) by making a majority vote over the labels yi of corresponding xi that fall into the same cell A(x). The empirical histogram rule hD,r is an ERM with respect to Lclass and the function set ( ) X F := cj 1Aj : cj ∈ {−1, 1} . Aj ∩X6=∅

For cubes with decaying side length rn and nrnd → ∞ the histogram rule is consistent, see [DGL96, Thm. 9.4]. Moreover, we call the map hP,r : X → Y , defined by ( −1 if fP,r (x) < 0, hP,r (x) := (2.4) 1 if fP,r (x) ≥ 0,

12

2. Preliminaries

where fP,r (x) := P (A(x) × {1}) − P (A(x) × {−1}), infinite sample histogram rule. 2.2. From Global to Localized SVMs In the previous section, we introduced empirical risk minimizer that minimize the empirical risk, given in (2.2), over some function class F . As the class F may be very rich a modification to regularize this occurrence can be done by adding a penalty term on the empirical risk. To be more precise, the modified learning methods minimize arg min λΥ(f ) + RL,D (f ), f ∈F

for some regularization parameter λ > 0 and a penalty term Υ(·). For example, if L = Lls and Υ(·) = k · k2 , the method above is called Tikhonov regularization. In this work, however, we study regularized ERMs as presented above that consider the minimization over certain Hilbert spaces H := F and impose a penalty term via the corresponding Hilbert norm Υ(·) = k · k2H . We call this method support vector machine, but before giving a precise definition in Section 2.2.2, we introduce in the following section selected Hilbert spaces: so-called reproducing Hilbert spaces (RKHSs). 2.2.1. Kernels and RKHSs First, we need to introduce definitions of some functions. Definition 2.15 (Kernel, c.f. [SC08, Def. 4.1]). Let X be a non-empty set. Then, a function k : X × X → R is called a kernel on X if there exists a R-Hilbert space H and a map Φ : X → H such that for all x, x0 ∈ X we have k(x, x0 ) = hΦ(x), Φ(x0 )i. We call Φ a feature map and H a feature space of k. Descriptively, the feature map maps data by Φ : X → H into some Hilbert space H. In fact, knowing the kernel k allows to evaluate scalar products in H without knowing the feature map Φ. In the literature this is called kernel trick, see [SSM98] or [SC08, Section 4]. The notion of kernel can be defined for C-valued kernels but we restrict ourselves to R-valued kernels. Kernels obtain various properties, e.g., sums of kernels or kernels multiplied with a

2.2. From Global to Localized SVMs

13

scalar s ≥ 0 remain kernels and even restrictions of kernels are kernels, see [SC08, Section 4.1]. We give two simple examples of kernels and define later on a more advanced one. Example 2.16 (Polynomial Kernel, c.f. [SC08, Lemma 4.7]). Let the integers m ≥ 0 and d ≥ 1 be given and let c ≥ 0 be a real number. Then k defined by k(x, x0 ) := (hx, x0 iRd + c)m is a kernel on Rd and called polynomial kernel. For c = 0 and m = 1 its called linear kernel. Example 2.17 (Exponential Kernel, c.f. [SC08, Example 4.9]). For d ∈ N and x, x0 ∈ Rd the function k defined by k(x, x0 ) := exp(hx, x0 iRd ) is called exponential kernel. We remark that a feature map (and feature space) related to a kernel is not unique. To determine a canonical feature space H, we introduce the following Hilbert space that can be seen as the smallest feature space related to the kernel k. Definition 2.18 (RKHS, c.f. [SC08, Def. 4.18]). Let X 6= ∅ and H be a R-Hilbert function space over X, i.e., a R-Hilbert space that consists of functions mapping from X into R. i) A function k : X × X → R is called a reproducing kernel of H if we have k(·, x) ∈ H for all x ∈ X and the reproducing property f (x) = hf, k(·, x)iH holds for all f ∈ H and x ∈ X. ii) The space H is called a reproducing kernel Hilbert space (RKHS) over X if for all x ∈ X the Dirac functional δx : H → R defined by δx (f ) := f (x),

f ∈ H,

is continuous. According to [SC08, Lemma 4.19, Thms. 4.20 and 4.21] there is a one-to-one relation between RKHSs and kernels. That means every kernel has a unique

14

2. Preliminaries

RKHS and vice versa. As a consequence, many characteristics of kernels k transfer to RKHSs H. For example, k is bounded if and only if every f ∈ H is bounded, see [SC08, Lemma 4.3]. Next, we present a kernel that possesses this property. Its characteristics are necessary for the investigations in Chapter 4. Example 2.19 (Gaussian Kernel and RKHS). For some γ > 0 we call the function  kγ (x, x0 ) := exp −γ −2 kx − x0 k22 ,

x, x0 ∈ Rd ,

(2.5)

Gaussian RBF kernel with width γ. Clearly, the Gaussian kernel is bounded and continuous. For X ⊂ Rd with X 6= ∅ and 0 < γ < ∞ it is shown in [SC08, Lemma 4.45] that the map Φγ : X → L2 (Rd ), defined by  Φγ (x) :=

2

d/2

π 1/2 γ

e−2γ

−2

kx− ·k22

,

x ∈ X,

is a feature map of kγ and L2 (Rd ) is a feature space. We denote by Hγ (X) the corresponding Gaussian RKHS over X, which is separable due to [SC08, Lemma 4.33]. Note that Gaussian RKHSs do not contain constant functions, see [SC08, Cor. 4.44]. In particular, the function space can be described by functions in L2 (Rd ) as due to [SC08, Prop. 4.46], the map Vγ : L2 (Rd ) → Hγ (X), defined by  Vγ (x) =

2 π 1/2 γ

d/2 Z

e−2γ

−2

kx−yk22

g(y) dy,

g ∈ L2 (Rd ), x ∈ X

Rd

˚L (Rd ) = B ˚H (X) , where B ˚ denotes the is a metric surjection, that is Vγ B γ 2 corresponding open unit ball. We discussed above that there exists a one-to-one relation between RKHSs and kernels. The RKHS Hγ (X) is affected by the kernel parameter γ in that way that scaling the kernel parameter has the same influence as scaling the input space X. To be more precisely, we define for some s > 0 the function f : sX → R and τs f (x) := f (sX) for x ∈ X. Then, [SC08, Prop. 4.37] shows that for all f ∈ Hsγ (sX) we have τs f ∈ Hγ (X), and that the corresponding operator τs : Hsγ (sX) → Hγ (X) is an isometric isomorphism.

2.2. From Global to Localized SVMs

15

2.2.2. The Localized SVM Approach As mentioned at the beginning of Section 2.2 given some data set D ∈ (X × Y )n support vector machines (SVMs) are decision functions fD,λ : X → R that satisfy fD,λ = arg min λkf k2H + RL,D (f ),

(2.6)

f ∈H

for λ > 0, where H is an RKHS over X with reproducing kernel k : X × X → R and where L : X ×Y ×R → [0, ∞) is a convex loss function. The convexity of the loss function ensures that the decision functions fD,λ ∈ H exist and are unique, see [SC08, Thm. 5.5]. Moreover, the map D 7→ fD,λ is measurable if H is separable, see [SC08, Lemma 6.23]. Due to the so-called representer theorem, see [SC08, Lemma 5.5], the decision function fD,λ can be computed solely in terms of kernel evaluations, that is, there exist α1 , . . . , αn ∈ R such that fD,λ (x) =

n X

αi k(x, xi ),

x ∈ X.

i=1

However, global SVMs suffer from their computational complexities. For example, the calculation of the kernel matrix K i,l = k(xi , xl ) for i, l ∈ {1, . . . , n} scales as O(n2 · d) in space. This equals the test time of the solver and it is well-known that the time complexity of the solver scales as O(nυ ) with υ ∈ [2, 3]. Subsequently, we introduce localized versions of SVMs that reduce all previously mentioned complexities. These so-called localized SVMs produce decision functions that satisfy (2.6) on many spatially defined small chunks. To this end, let (Aj )j=1,...,m be an arbitrary partition of the input space ˚j 6= ∅ for every j ∈ {1, . . . , m}. Then, for every j ∈ X 6= ∅ such that A {1, . . . , m} we denote by H(Aj ) or simply Hj the (local) RKHS over Aj with corresponding kernel kj : Aj × Aj → R. Furthermore, we define the index set Ij := {i ∈ {1, . . . , n} : xi ∈ Aj } Pm

with j=1 |Ij | = n, that indicates the samples of D contained in Aj and we define the corresponding local data set Dj by Dj := ((xi , yi ) : i ∈ Ij ).

(2.7)

16

2. Preliminaries

Optimizing the right-hand side of (2.6) w.r.t. Hj and Dj obviously yields a function fDj ,λ : Aj → R although a function over X is sought. To this end, we denote by fˆ the zero-extension of f ∈ H(Aj ) to X defined by ( f (x) , for x ∈ A , fˆ(x) := 0, for x ∈ X\A . The subsequent definition describes how to derive a global estimator from individual SVMs on each cell. Definition 2.20 (Localized SVMs, c.f. [MS16, Section 2]). Let (Aj )j=1,...,m of X be a partition that has cells with non-empty interior. For every j ∈ {1, . . . , m} let Hj be a RKHS over Aj and define λ := (λ1 , . . . , λm ) ∈ (0, ∞)m . A function fD,λ : X → R, defined by fD,λ (x) :=

m X

fˆDj ,λ (x),

(2.8)

j=1

where the individual decision function fDj ,λj ∈ H(Aj ) satisfies fDj ,λj = arg min λj kf k2Hj + RL,Dj (f )

(2.9)

f ∈Hj

is called localized support vector machine (SVM). By definition of the local loss Lj , see Def. 2.10, we have RL,Dj (f ) = RLj ,D (f ). Moreover, for every j ∈ {1, . . . , m} we have that fDj ,λj (xi ) = 0 if xi ∈ / Aj , i ∈ {1, . . . , n}. To illustrate the computational advantages of this approach, let us assume that the size of the data sets Dj is nearly the same for all j ∈ {1, . . . , m}, that means |Dj | ≈ n/m. For example, if the data is uniformly distributed, it is not hard to show that the latter assumption holds with high probability, and if the data lies uniformly on a low-dimensional manifold, the same is true, since empty cells can be completely ignored. Now, the computation of the kernel matrices Kji,l = kj (xi , xl ) for xi , xl ∈ Dj scales as   2   n 2  n ·d O m· ·d =O . m m The time complexity of the solver reduces to   n 1+υ    n ν  O m· =O n· m m

2.2. From Global to Localized SVMs

17

for υ ∈ [2, 3], and again the test time scales as the kernel computation. That is, the localized SVM approach has in all three phases an improvement by 1/m. In the following, we investigate the decision function fD,λ in (2.8) in more detail. To this end, we introduce the following lemma that shows that the space consisting of zero-extended functions of some RKHS is again an RKHS. Lemma 2.21 (Extended RKHS, c.f. [MS16, Lemma 2]). Let A ⊂ X and H(A) be an RKHS on A with corresponding kernel k : ˆ A×A → R. Then, the space H(A) := {fˆ : f ∈ H(A)} with the corresponding ˆ norm kf kH(A) := kf kH(A) is an RKHS on X and its reproducing kernel is ˆ given by ( k(x, x0 ) , if x, x0 ∈ A , 0 ˆ k(x, x ) := 0, else. Based on the lemma above, the next lemma shows that composing the individual extended RKHSs builts a new RKHS. Lemma 2.22 (Joint RKHS, c.f. [MS16, Lemma 3]). Let (Aj )j=1,...,m of X be a partition that has cells with non-empty interior and denote for every j ∈ {1, . . . , m} by Hj an RKHS over Aj . Then, for an arbitrary index set J ⊂ {1, . . . , m} and some λj > 0 for j ∈ J the direct sum     M X ˆj = f = ˆ j for all j ∈ J HJ := H fj : fj ∈ H (2.10)   j∈J

j∈J

is again an RKHS equipped with the norm X kf k2HJ = λj kfj k2Hˆ . j

j∈J

The results above show that by decomposing an RKHS H into individual RKHSs we obtain by re-composition a new RKHS HJ . Note that we have H ( HJ for J = {1, . . . , m}. A look at the definition of the localized ˆ j ) for every SVM estimator in (2.8) shows immediately that fˆD,λj ∈ H(A j ∈ {1, . . . , m} such that fD,λ ∈ HJ .

18

2. Preliminaries

Note that fD,λ exists and is unique since the local estimators in (2.9) are of the form described in (2.6) using Hj , Dj and the regularization parameter ˜ j = n λj . λ |Ij | We briefly describe that there exists a connection between localized SVMs and SVMs from the global SVM approach in (2.6). We denote by fÛD,λ the clipped localized SVM estimator that consists of the sum of extended clipped individual SVMs fÛD,λj for every j ∈ {1, . . . , m}. Moreover, let J := {1, . . . , m}. Then, for any f ∈ HJ [MS16, Section 3] pointed out that kfD,λ k2HJ + RL,D (fÛD,λ ) ≤ kf k2HJ + RL,D (f ). That means the localized SVM estimator fD,λ is a decision function of a ˜ = 1. global SVM estimator using HJ , the loss L, and λ 2.3. Advanced Statistical Analysis To analyze a learning method fD we are interested in bounds on the excess risk RL,P (fD ) − R∗L,P and the resulting learning rates. As mentioned in Section 2.1.2 we have to make assumptions on the probability measure P to obtain rates. In Section 2.3.1, we introduce various conditions on P that are typically considered in classification. In Section 2.3.2, we present fundamental oracle inequalities for the methods we analyse in Chapters 3 and 4, and discuss various concepts leading to the presented bounds. 2.3.1. Margin Conditions Recall that in classification the aim is to label new unseen data x correctly by y = 1 or y = −1 with high probability. For some measurable, nonempty set X and Y := {−1, 1} we denote by P an unknown probability distribution on X × Y that describes the relation between the input and output values. Then, the uncertainty of labeling input values is reflected in a conditional probability P (y = 1|x) =: η(x). Clearly, if we have η(x) = 0 resp. η(x) = 1 for x ∈ X we observe the label y = −1 resp. y = 1 with probability 1. Otherwise, if, e.g., η(x) ∈ (1/2, 1) we observe the label y = −1 with the probability 1 − η(x) ∈ (0, 1/2) and we call the latter probability noise. Obviously, in the worst case both labels have equal probability 1/2 and we define the set of critical noise containing those x ∈ X by X0 := { x ∈ X : η(x) = 1/2 }.

(2.11)

2.3. Advanced Statistical Analysis

19

Descriptively, there is a dependence between the existence of noise and learning. For distributions, for which the set {x ∈ X : |η(x) − 1/2| < t} is empty for some t > 0, learning is easy since η is bounded away from the critical region 1/2 and thus we have fewer uncertainty in labelling some x ∈ X. The possible most known margin exponent, the Tsybakov noise exponent, which goes back to [MT99], describes this characteristic of a probability distribution P . Definition 2.23 (Noise Exponent, c.f. [MT99]). A probability distribution P has Tsybakov noise exponent (NE) q ∈ [0, ∞] if there exists a constant cNE > 0 such that PX ({x ∈ X : |2η(x) − 1| < t}) ≤ (cNE · t)q

(2.12)

for all t > 0. The previous exponent is known as margin exponent in the literature. Since it measures the amount of critical noise but does not locate it, we call it noise exponent. Obviously, in the best case P has NE q = ∞ such that η is bounded from 1/2. Note that the noise exponent is monotone, which means that if P has some NE q it also has NE q˜ ∈ [0, q]. Subsequently, we introduce other exponents that describe the interplay between the marginal distribution PX and a conditional distribution of labels η. Since the latter is only PX -almost surely defined, we introduce the following formal definition on η. Definition 2.24 (Version of Posterior Prob., c.f. [SC08, Def. 8.4]). A measurable function η : X → [0, 1] is a version of the posterior probability of P if the probability measures P ( · |x), defined by η(x) := P (y = 1|x) for x ∈ X, form a regular conditional probability of P , that is, Z P (A × B) = P (B|x) dPX (x) A

for all measurable A ⊂ X and B ⊂ Y . We define the classes X1 and X−1 by X1 := { x ∈ X : η(x) > 1/2 }, X−1 := { x ∈ X : η(x) < 1/2 }.

20

2. Preliminaries

Descriptively, if a fixed version η of P is smooth, then the set X0 in (2.11) can be seen as decision boundary separating the sets that contain x which are labeled by y = 1 resp. y = −1. The next definition introduces the notion of distance to the decision boundary without defining X0 . Definition 2.25 (Distance to Dec. Boundary, c.f. [SC08, Def. 8.5]). Let (X, d) be a metric space and let η : X → [0, 1] be a version of the posterior probability of P . Then, the associated version of the distance to the decision boundary is the function ∆η : X → [0, ∞] defined by   if x ∈ X−1 , d(x, X1 ) ∆η (x) := d(x, X−1 ) if x ∈ X1 , (2.13)   0 otherwise, where d(x, A) := inf x0 ∈A d(x, x0 ). We refer to Fig. 2.1 for an illustration of (2.13). Since the subsequent notions and results rely on the the prior definition, we assume in the following for (X, d) to be a metric space equipped with the Borel σ-algebra.

X1 x ∆η (x)

X−1 η = 0.5 Figure 2.1.: Example of distance ∆η to decision boundary

Descriptively, there is a dependence between the distance to the decision boundary and learning. For example, if the set {x ∈ X : ∆η (x) < t} is empty for some t > 0, then the classes X1 and X−1 are separated and learning should be easy. In the next definition, we define an exponent that describes this behavior of a probability distribution P .

2.3. Advanced Statistical Analysis

21

Definition 2.26 (Margin Exponent, c.f. [SC08, Def. 8.3]). A probability distribution P has margin exponent (ME) α ∈ (0, ∞] for the version η : X → [0, 1] of its posterior probability if there exists a constant cME > 0 such that PX ({x ∈ X : ∆η (x) < t}) ≤ (cME · t)α for all t > 0. Obviously, the margin exponent α is monotone. It can be influenced by e.g., the decision boundary itself or the geometry of the input space, see [SC08, Ch. 8]. In Fig. 2.2 we show that large values of the ME α reflect a low concentration of mass in the vicinity of the decision boundary.

(a) α ≤ 1

(b) 1 < α < ∞

(c) α = ∞

Figure 2.2.: Examples of data drawn by P with different values of ME α.

Next, we introduce a margin exponent that is a mixture between the margin and noise exponent, and hence, considers both the amount of noise around the decision boundary and the existence of noise. Definition 2.27 (Margin-Noise Exponent, c.f. [SC08, Def. 8.15]). A probability distribution P has margin-noise exponent (MNE) β ∈ (0, ∞] for the version η : X → [0, 1] of its posterior probability if there exists a constant cMNE > 0 such that Z |2η(x) − 1| dPX (x) ≤ (cMNE · t)β {∆η 0. Obviously, the margin-noise exponent is at least as large as the margin exponent, that is β ≥ α. Moreover, we have large MNE β if we have low mass and/or high noise around the decision boundary. The following margin exponent explicitly relates the existence of noise to its location and is crucial for our analysis in Chapters 3 and 4.

22

2. Preliminaries

Definition 2.28 (Lower Control, c.f. [SC08, Exercise 8.5]). Let η : X → [0, 1] be a version of the posterior probability of P . The associated distance to the decision boundary ∆η controls the noise from below (LC) by the exponent ζ ∈ [0, ∞) if there exists a constant cLC > 0 with ∆ζη (x) ≤ cLC |2η(x) − 1|

(2.14)

for PX -almost all x ∈ X. Descriptively, (2.14) shows that if η(x) is close to 1/2 for some x ∈ X, this x is located close to the decision boundary. We describe the dependence of different values of ζ and the location of large noise which is close to the critical level 1/2 in Fig. 2.3. Note that η is always bounded away from 1/2 if ζ = 0, e.g., consider x ∈ X1 , then 1 ≤ cLC (2η(x) − 1)

⇐⇒

1 1 + ≤ η(x). 2 2cLC

The next lemma shows that some analytic properties of the conditional probability η itself have an influence on margin conditions. It relates the reverse H¨older continuity of η to the lower control condition (2.14). Lemma 2.29 (Reverse H¨ older Continuity yields Lower Control). Let η : X → [0, 1] be a fixed version of the posterior probability of P that is continuous and let X0 = ∂X X−1 = ∂X X1 . If η is reverse H¨ older-continuous with exponent ρ ∈ (0, 1], that is, if there exists a constant c > 0 such that |η(x) − η(x0 )| ≥ c · d(x, x0 )ρ ,

x, x0 ∈ X,

then, ∆η controls the noise from below by the exponent ρ. Proof. We fix w.l.o.g. an x ∈ X−1 . By the reverse H¨older continuity we obtain c∆ρη (x) = c inf (d(x, x ˜))ρ ≤ inf |η(x) − η(˜ x)| ≤ inf η(˜ x) − η(x). x ˜∈X1

x ˜∈X1

x ˜∈X1

Since η(˜ x) > 1/2 for all x ˜ ∈ X1 , we find by continuity of η and ∂X1 = X0 that inf x˜∈X1 η(˜ x) = 1/2. Thus, ∆ρη (x) ≤ (2c)−1 (1 − 2η(x)). Obviously, the last inequality is immediately satisfied for x ∈ X0 and for x ∈ X1 the calculation is similar. Hence, with cLC := (2c)−1 we find that

2.3. Advanced Statistical Analysis

23

∆η controls the noise by the exponent ρ from below, that is, x ∈ X.



1

1

0.5

0.5

0.5

0

η

1

η

η

∆ρη (x) ≤ cLC |2η(x) − 1|,

0

0

X (a) ζ < 1

X

X

(b) ζ = 1

(c) ζ > 1

Figure 2.3.: Examples of η with different values of LC ζ

Note that the assumption that η is reverse H¨older continuous limits the range of the lower control exponent in (2.14) to the range (0, 1]. Descriptively, if the function values of η are close, then the input values have to be close, too, but this means in particular that η cannot reach the critical value of 1/2 far from the decision boundary. In the following, we show that there exist strong dependencies between margin conditions. For example, if we consider distributions that force noise to be close to the decision boundary and we have low mass around the decision boundary, consequently, such distributions cannot have large noise exponent. This coherency is described in the next lemma. Lemma 2.30 (LC and ME yield NE). Let P have ME α ∈ [0, ∞) for the version η of its posterior probability. Assume that the associated distance to the decision boundary controls the noise from below by the exponent ζ ∈ [0, ∞). Then, P has NE q := αζ . Proof. Since P has ME α ∈ [0, ∞), we find for some t > 0 that ∆ζη (x) ≤ |2η(x) − 1| < t, cLC

x ∈ X,

1

and we follow that ∆η (x) ≤ (cLC t) ζ . Consequently,   1 PX ({x ∈ X : |2η(x) − 1| < t}) ≤ PX {x ∈ X : ∆η (x) ≤ (cLC t) ζ } α

ζ ≤ cα ME (cLC t) .



24

2. Preliminaries

Be aware that although (2.14) forces “critically” large noise to be located close to the decision boundary this does not mean that only distributions with noise-free regions far away from the decision boundary are considered. It is still allowed to have large noise η(x) ∈ (0, 1/2 − ε] ∪ [1/2 + ε, 1) for some ε > 0 everywhere. In Fig. 2.4 we present two examples which make this situation more clear.

(a) The brighter the region the closer η is to 1/2. The decision boundary (within the bright stripe) is located in the middle of the picture. Far away from the decision boundary we locate in both sets X−1 and X1 noise (upper left and lower right corner). In the upper left corner the critical level η = 1/2 is reached, which is prohibited by (2.14).

η

1

0.5

0 X−1

X1

(b) Far away from the decision boundary, in X1 , we locate a region of critical noise, which is prohibited by (2.14). In X−1 we have noise too, but the critical level η = 1/2 is not reached. Figure 2.4.: Examples of η with regions of “critically” large noise far away from the decision boundary. Distributions that have in this region large noise which is close to the critical level 1/2 as in (a) in the lower right corner or in (b) in the lower left corner are still allowed under assumption (2.14), whereas distributions that reach the critical level 1/2 far away from the decision boundary are prohibited.

2.3. Advanced Statistical Analysis

25

Next, we introduce a margin condition that is some kind of inverse to assumption (2.14). In our framework this margin condition and its conclusions are only necessary to compare learning rates in Sections 3.3 and 4.4. Definition 2.31 (Upper Control, c.f. [SC08, Def. 8.16]). Let η : X → [0, 1] be a version of the posterior probability of P . The associated distance to the decision boundary ∆η controls the noise from above (UC) by the exponent ζ ∈ [0, ∞) if there exists a constant cUC > 0 with ∆ζη (x) ≥ cUC |2η(x) − 1|

(2.15)

for PX -almost all x ∈ X. If x ∈ X is located close to the decision boundary, then η(x) ≈ 1/2. In contrary to Lemma 2.29, we show in the next lemma that the smoothness of η influences the upper control condition (2.15). Lemma 2.32 (H¨ older Continuity yields UC). Let η : X → [0, 1] be a version of the posterior probability of P . Then, if η is H¨ older-continuous with exponent ρ ∈ (0, 1], we have that ∆η controls the noise from above with exponent ρ. Proof. We fix w.l.o.g. an x ∈ X1 and we recall that η(x0 ) < 1/2 for all x0 ∈ X−1 . By the H¨older-continuity of η we find |2η(x) − 1| = 2|η(x) − 1/2| ≤ 2|η(x) − η(x0 )| ≤ 2c(d(x, x0 ))ρ for all x0 ∈ X−1 . Obviously, the last inequality holds immediately for x ∈ X0 . Define cUC := 2c. Since |2η(x) − 1| ≤ 2c inf (d(x, x ˜))ρ = cUC ∆ρη (x), x ˜∈X−1

we have that ∆η controls the noise from above by exponent ρ.



We remark that the H¨older continuity limits the range of the upper control exponent ρ in (2.15) to the range (0, 1]. The subsequent lemmata describe the interplay of the upper control assumption together with various other exponents described above. Lemma 2.33 (UC and ME yield Let P have ME α ∈ [0, ∞) for the Assume that the associated distance noise by the exponent ζ ∈ [0, ∞). β := α + ζ.

MNE, c.f. [SC08, Lemma 8.17]). version η of its posterior probability. to the decision boundary controls the Then, P has margin-noise exponent

26

2. Preliminaries

Lemma 2.34 (NE and UC, c.f. [SC08, Lemma 8.23]). Let P have NE q ∈ [0, ∞). Furthermore, let there exist a version η of its posterior probability such that the associated distance to the decision boundary controls the noise from above by the exponent ζ ∈ [0, ∞). Then, P has margin exponent α := ζq and margin-noise exponent β := ζ(q + 1). Finally, we demonstrate how assumptions on a density of a marginal distribution PX together with some mild geometric assumption on the set X0 of critical noise influence margin conditions. To this end, we define the following geometric characteristic of an arbitrary set. Definition 2.35 (Rectifiable Set, c.f. [Fed69, Def. 3.2.14(1)]). A set T ⊂ X is m-rectifiable for an integer m > 0 if there exists a Lipschitzian function mapping some bounded subset of Rm onto T . Subsequently, we present a lemma that describes the Lebesgue measure of a set in the vicinity of the decision boundary. This lemma is crucial for the refined analysis in Chapters 3 and 4. We denote by ∂X T the relative boundary of T in X and by Hd−1 the (d − 1)-dimensional Hausdorff measure on Rd , see [Fed69, Introduction]. Lemma 2.36 (Lebesgue Measure near Decision Boundary). Let η be a fixed version η : X → [0, 1] of the posterior probability of P . Furthermore, let X0 be (d − 1)-rectifiable with Hd−1 (X0 ) > 0 and let X0 = ∂X X1 . Then, Hd−1 (X0 ) < ∞ and there exists a δ ∗ > 0 such that λd ({ x ∈ X | ∆η (x) ≤ δ }) ≤ 4Hd−1 ({ x ∈ X | η(x) = 1/2 }) · δ

(2.16)

for all δ ∈ (0, δ ∗ ]. Proof. The first assertion follows from the rectifiability of X0 . To see the second assertion we define as in [Ste15] for a set T ⊂ X and δ > 0 the sets T +δ := { x ∈ X | d(x, T ) ≤ δ }, T −δ := X \ (X \ T )+δ . Since X1 := { x ∈ X | η(x) ≤ 1/2 } is bounded and measurable, we find with [Ste15, Lemma A.10.3] and the proof of [Ste15, Lemma A.10.4(ii)] that there exists a δ ∗ > 0, such that for all δ ∈ (0, δ ∗ ] we have λd (X1+δ \ X1−δ ) ≤ 4Hd−1 (∂X X1 ) · δ

(2.17)

2.3. Advanced Statistical Analysis

27

Next, we show that { x ∈ X | ∆η (x) ≤ δ } ⊂ X1+δ \ X1−δ ∪ X0 .

(2.18)

For this purpose, we remark that according to (2.13) we have { x ∈ X | ∆η (x) ≤ δ } = { x ∈ X1 | d(x, X−1 ) ≤ δ } ∪ { x ∈ X−1 | d(x, X1 ) ≤ δ } ∪ X0 . Let us first show that { x ∈ X1 | d(x, X−1 ) ≤ δ } ⊂ X1+δ \ X1−δ . To this end, consider an x ∈ X1 with d(x, X−1 ) ≤ δ, and we follow immediately that x ∈ X1+δ . Now, assume that x ∈ X1−δ = X \ (X \ X1 )+δ . Then, we find that x∈ / (X \X1 )+δ such that d(x, X \X1 ) = d(x, X−1 ∪X0 ) > δ. Hence, x ∈ / X1−δ . +δ Symmetric arguments show that { x ∈ X−1 | d(x, X1 ) ≤ δ } ⊂ X1 \ X1−δ . By using (2.18) and the fact that that λd (X0 ) = 0 since X0 is (d − 1)rectifiable, we see that λd ({ x ∈ X | ∆η (x) ≤ δ }) ≤ λd (X1+δ \ X1−δ ). Finally, with (2.17) and the fact that X0 = ∂X X1 we find for all δ ∈ (0, δ ∗ ] that λd ({ x ∈ X | ∆η (x) ≤ δ }) ≤ λd (X1+δ \ X1−δ ) ≤ 4Hd−1 (X0 ) · δ = 4Hd−1 ({ x ∈ X | η(x) = 1/2 }) · δ.



Remark 2.37. By a similar concept as in the proof of Lemma 2.36 and by applying the bound in [Ste15, Lemma A.10.4(i)], we obtain the lower bound λd ({ x ∈ X | ∆η (x) ≤ δ }) ≥ cd · δ

f.a. δ ∈ (0, δ ∗ ]

for some constant cd > 0. In Fig. 2.5, we present some examples of sets satisfying the geometrical assumptions in the lemma above and show below how these assumptions affect the margin exponent, if the marginal distribution PX has a uniformly bounded density w.r.t. the Lebesgue measure.

28

2. Preliminaries

X1

X1

X1 X0

X0

X0 X−1

X−1

X1

X−1

(a) allowed

(b) not allowed

(c) allowed

Figure 2.5.: Geometrical assumptions on X0 from Lemma 2.36

Lemma 2.38 (Margin Exponent). Let η be a version of the posterior probability of P and assume that the marginal distribution PX has a uniformly bounded density w.r.t. the Lebesgue measure. Let X0 is (d − 1)-rectifiable with Hd−1 (X0 ) > 0 and let X0 = ∂X X1 = ∂X X−1 . Then, we have ME α = 1. Proof. We follow immediately from Lemma 2.36 that PX ({∆η < t}) ≤ cλd ({∆η < t}) ≤ 4cHd−1 ({η = 1/2}) · t. for all t ≤ δ ∗ and some constant c > 0. For t > δ ∗ we have PX ({∆η < t}) ≤ 1 < (δ ∗ )−1 · t and thus, P has ME α = 1 with constant cLC := min{4cHd−1 ({η = 1/2}), δ ∗ )−1 }.  Remark 2.39. Assumptions on the existence of a density of the marginal distribution PX can shrink the classes of considered distributions. To see that, assume that η is H¨ older-smooth with exponent ρ, that P has NE q and that PX has a density w.r.t. the Lebesgue measure that is bounded away from zero. Remark 2.37 together with Lemma 2.32 yields 1

1

ct ρ ≤ PX ({∆η (x) ≤ t ρ }) ≤ PX ({x ∈ X : |2η(x) − 1| < t}) ≤ cNE tq for some constant c > 0. Thus, ρq > 1 can never be satisfied.

2.3. Advanced Statistical Analysis

29

2.3.2. General Oracle Inequalities In this section, we present an oracle inequality for localized SVMs. Prior to that we state an oracle inequality for ERMs that we need for adaptive learning methods defined in Chapter 3 and 4. The presented oracle inequality follows a typical decomposition into a so-called approximation error inf RL,P (f ) − R∗L,P = R∗L,P,F − R∗L,P

f ∈F

that measures the error of the underlying learning method if one has knowledge of the probability measure P , and in the so-called stochastic error RL,P (fD ) − RL,P (fP ) = RL,P (fD ) − inf RL,P (f ) f ∈F

that measures the error that results from the underlying dataset D. A key element to control the stochastic error for improved results is to assume that there exist constants B > 0, θ ∈ [0, 1] and V ≥ B 2−θ such that for all f ∈ F we have bounds of the form ∗ kL ◦ f − L ◦ fL,P k∞ ≤ B,

EP (L ◦ f − L ◦

∗ fL,P )2

≤ V · EP (L ◦ f − L ◦

(2.19) θ ∗ fL,P )

,

(2.20)

where we call (2.19) supremum bound and (2.20) variance bound. Clearly, if the right-hand side of (2.20) is small for some predictor fD , that is fD is close to a Bayes decision function, we have small variance. We give in the following an example that shows that the best possible exponent θ = 1 is achievable. Example 2.40 (LS-Loss Var. Bound, c.f. [SC08, Example 7.3]). For some M > 0 let Y ⊂ [−M, M ]. For the least-square loss L(y, t) := (y − t)2 a short calculation shows that (2.19) and (2.20) are satisfied for B := 4M 2 , V := 16M 2 and θ = 1. Achieving a variance bound and moreover the best possible exponent θ = 1 is not the regular case and depends heavily on the loss function as can be seen in the next theorems. For the hinge loss and the classification loss even assumptions on the probability distribution have to be made to obtain such a bound. More precisely, both results assume that P has some noise exponent q ∈ [0, ∞], which measures the amount of noise, see Definition 2.23.

30

2. Preliminaries

Theorem 2.41 (Variance Bound for Classification Loss). Let P be a distribution on X × Y that has noise exponent q ∈ [0, ∞]. ∗ Moreover, let L := Lclass be the classification loss and let fL,P : X → [−1, 1] be a fixed Bayes decision function. Then, for all measurable f : X → R, we have  q ∗ ∗ EP (L ◦ f − L ◦ fL,P )2 ≤ cq EP (L ◦ f − L ◦ fL,P ) q+1 , where cq := (q + 1) q −1 cNE

q  q+1

. |f −f ∗

|

L,P ∗ Proof. Since (L ◦ f − L ◦ fL,P )2 = = 1(X−1 4{f 0, θ ∈ [0, 1] and V ≥ B such that for all f ∈ F the supremum bound (2.19) and the variance bound (2.20) are satisfied. Then, for all fixed τ > 0 and n ≥ 1, we have with probability P n not less than 1 − e−τ that RL,P (fD ) −

R∗L,P

0 : ∃s1 , . . . , s2i−1 ∈ SBE such that SBE ⊂

i−1 2[

) (sj + εBF ) ,

j=1

where we use the convention inf ∅ := ∞, and BE as well as BF denote the closed unit balls in E and F , respectively. We quote from [CS90, Ch. 1] two properties for entropy numbers that we apply in this work. To this end, note that the first entropy number of an operator matches its operator norm, that is e1 (S : E → F ) = kSk.

(2.21)

Moreover, let (G, k · kG ) be another normed space and let R : G → E be a bounded linear operator. Then, entropy numbers are multiplicative so that ei+l−1 (SR : G → F ) ≤ ei (S : E → F )el (R : G → E),

i, l ≥ 1 . (2.22)

It is not simple to bound entropy numbers, but various investigations can be found in [ET96]. We give an example for bounds on entropy numbers for RKHSs with Gaussian kernels in the subsequent theorem. Theorem 2.45 (Gaussian Entropy Bounds, c.f. [FS19, Thm. 5]). Let X ⊂ Rd be a closed Euclidean ball. Then, there exists a positive constant C, such that for all p ∈ (0, 1), γ ∈ (0, 1] we have 1

ei (id : Hγ (X) → `∞ (X)) ≤ (3C) p



d+1 ep

 d+1 p

d

1

γ − p i− p ,

i ≥ 1.

The theorem above improves the result for Gaussian RKHSs in [SC08, Thm. 6.27] by giving an explicit behavior of p ∈ (0, 1) in the right-hand side. In Section 2.2.2, we have seen that a localized SVMs decision function consist of functions contained in local RKHSs. In fact, the general oracle inequality for localized SVMS in [ES15, Thm. 5] needs a bound on the entropy numbers of the underlying local RKHSs Hγ (A), that is, a bound on ei (id : Hγ (A) → L2 (PX|A )),

(2.23)

2.3. Advanced Statistical Analysis

33

resp. ei (id : HJ (X) → L2 (PX )).

(2.24)

Below, we state a slight modification of the general bound from [ES15, Thm. 5] for that it suffices to have a bound on the average entropy numbers EDX ∼PXn ei (id : HJ → L2 (DX ))

(2.25)

instead of (2.23), which is slightly weaker, see [SC08, Section 7.5]. Theorem 2.46 (Oracle Inequality for Localized SVMs). Let L : X × Y × R → [0, ∞) be a locally Lipschitz continuous loss that can be clipped at M > 0 and that satisfies (2.19) for some B > 0. Based on ˚j 6= ∅ for every j ∈ {1, . . . , m}, we a partition (Aj )j=1,...,m of X, where A assume H(Aj ) to be separable RKHSs. Furthermore, for an arbitrary index set J ⊂ {1, . . . , m}, we assume that the variance bound (2.20) holds for an arbitrary θ ∈ [0, 1]. Assume that, for fixed n ≥ 1, there exist constants p ∈ (0, 1) and aJ > 0 such that 1

EDX ∼PXn ei (id : HJ → L2 (DX )) ≤ aJ i− 2p ,

i ≥ 1.

Finally, fix an f0 ∈ HJ with kf0 k∞ ≤ 1. Then, for all fixed τ > 0, λ := (λ1 , . . . , λm ) ∈ (0, ∞)m , and a := max{aJ , 2} the localized SVM ˆ 1, . . . , H ˆ m and LJ satisfies given in (2.8) using H X λj kfÛDj ,λj k2Hˆ + RLJ ,P (fÛD,λ ) − R∗LJ ,P j

j∈J

 ≤ 9

 X

λj k1Aj f0 k2Hˆ + RLJ ,P (f0 ) − R∗LJ ,P  j

j∈J

 +C

a2p n

1  2−p−θ+θp

 +3

72V τ n

1  2−θ

+

30Bτ n

with probability P n not less than 1 − 3e−τ , where C > 0 is a constant only depending on p, M, V, θ and B. Proof. The proof follows the one of [ES15, Thm. 5] with two slight modifications. First, it suffices to assume to have average entropy numbers of the form in (2.25). Second, it suffices to consider the individual RKHS-norms on the local set J ⊂ {1, . . . , m} instead of the whole set J = {1, . . . , m}. Combining these observations yields the result. 

34

2. Preliminaries

Note that the dependence of the parameter p in the right-hand side of the inequality results from bounds on (average) entropy numbers and that the constant C > 0 in Theorem 2.46 is exactly the constant from [SC08, Thm. 7.23]. Based on this oracle inequality we establish in Chapter 4 oracle inequalities for the hinge loss and Gaussian kernel. This is possible due to the subsequent remark. Remark 2.47. Theorem 2.46 is directly applicable for localized SVMs using hinge loss and Gaussian kernel. To see that, we note that the hinge loss is Lipschitz continuous, bounded by B = 1 and that it can be clipped at M = 1, see Example 2.8. Moreover, the Gaussian kernel is separable, see Example 2.5.

3. Histogram Rule: Oracle Inequality and Learning Rates In this chapter, we statistically analyze the empirical histogram rule that we introduced in Example 2.14. It serves as a framework for the refined analysis of localized SVMs in Chapter 4 and highlights the main concepts that lead together with a certain splitting procedure to improved error bounds and learning rates, see Sections 3.1 and 3.2. We derive rates under margin conditions and compare them with other known rates in Section 3.3. Moreover, we discuss the crucial effect resulting from the lower control condition that relates the distance to the decision boundary with the existence of noise. For the rest of this chapter, let D := ((x1 , y1 ), . . . , (xn , yn )) be a given data set of observations, where yi ∈ Y := {−1, 1}. Recall that the goal of binary classification is to find a function fD : X → Y that labels unseen input values correctly with high probability. In Section 2.1.1, we have seen that the classification loss is a suitable loss function and we write in the following L := Lclass (y, t) = 1(−∞,0] (y · sign t) for y ∈ Y, t ∈ R. We assume that xi ∈ B`d∞ and assume for D to be generated independently and identically by a probability distribution P on Rd × Y with marginal distribution PX on Rd such that X := supp(PX ) ⊂ B`d∞ and PX (∂X) = 0. Throughout this chapter, we further fix a version η : X → [0, 1] of the posterior probability of P and denote by P n the n-fold product measure. We denote by A = (Aj )j≥1 a partition of Rd into cubes of side length r ∈ (0, 1] and denote by hD,r the empirical histogram rule that we introduced in Example 2.14. Finally, for some arbitrary C, D ⊂ X we denote by 4 the symmetric difference defined by C4D := (C \ D) ∪ (D \ C). 3.1. Motivation Our first goal is to find with high probability an upper bound for the excess risk RL,P (hD,r ) − R∗L,P w.r.t. the classification loss L. Due to Example 2.14, Portions of Chapter 3 are based on: I. Blaschzyk and I. Steinwart, Improved Classification Rates under Refined Margin Conditions. Electron. J. Stat., 12:793–823, 2018.

© Springer Fachmedien Wiesbaden GmbH, part of Springer Nature 2020 I. K. Blaschzyk, Improved Classification Rates for Localized Algorithms under Margin Conditions, https://doi.org/10.1007/978-3-658-29591-2_3

36

3. Histogram Rule: Oracle Inequality and Learning Rates

the empirical histogram rule is an ERM over a set of step functions and w.r.t. the loss L. That is, we obtain directly a bound on the excess risk by Theorem 2.43, where Theorem 2.41 provides the required but not optimal variance bound for the classification loss under the assumption that P has NE q ∈ [0, ∞]. Hence, we have up to constants   τ  q+1 q+2 RL,P (hD,r ) − R∗L,P ≤ R∗L,P (hP,r ) − R∗L,P + d r n with high probability, where the term r−d arises from an upper bound on the size of the class of step functions. Under the assumption that P has MNE β, we can easily derive a bound on the approximation error R∗L,P (hP,r ) − R∗L,P , which behaves as rβ and thus by optimizing over r, we directly obtain the learning rate β(q+1)

n− β(q+2)+d(q+1) . We illustrate this more precise at the end of Section 3.3. However, on cells that are sufficiently bounded away from the decision boundary we make in the following observations on the approximation error and on the variance bound that indicate that the result above can be improved. In order to apply Theorem 2.43 on such sets, we show in the subsequent theorem that the empirical histogram rule hD,r is an ERM w.r.t. the affiliated local loss and over ( ) X F := cj 1Aj : cj ∈ {−1, 1} . Aj ∩X6=∅

Lemma 3.1 (Histogram Rule is ERM for Local Loss). Let (Aj )j≥1 be a partition of Rd into cubes of side length r ∈ (0, 1]. Consider S for an arbitrary index set J ⊂ { j ≥ 1 | Aj ∩ X 6= ∅ } the set TJ := j∈J Aj and the related local loss LTJ : X × Y × R → [0, ∞). Then, the empirical histogram rule hD,r is an empirical risk minimizer over F for the loss LTJ , that means RLTJ ,D (hD,r ) = inf RLTJ ,D (f ). f ∈F

3.1. Motivation Proof. For f =

37 P

Aj ∩X6=∅ cj 1Aj

∈ F we have

Z RLTJ ,D (f ) = =

LTJ (x, y, f (x)) dD(x, y) X×Y

XZ j∈J

L(y, f (x)) dD(x, y), Aj ×Y

where D denotes the average of Dirac measures defined in Section 2.1.2. That means for a single cell Aj with j ∈ J that n

Z L(y, f (x)) dD(x, y) = Aj ×Y

1X 1A (xi )1yi 6=cj , n i=1 j

where cj ∈ {−1, 1} is the label of the cell Aj . The risk on a cell is the smaller, the less often we have yi 6= cj , such that the best classifier on a cell is the one which decides by majority. This is true for the histogram rule by definition. Since the risk is zero on Aj with j 6∈ J, the histogram rule minimizes the risk with respect to LTJ .  Subsequently, we define sets consisting of cells that have a certain distance to the decision boundary ∆η , see Definition 2.25. Recall that we fixed a version η of the posterior probability of P on which the distance to the decision boundary heavily relies. To this end, we define for the partition A of cubes with side length r > 0 the set of indices of cubes that intersect X by J := { j ≥ 1 | Aj ∩ X 6= ∅ }. For some s > 0, we define the set of indices JFs := { j ∈ J | ∀ x ∈ Aj : ∆η (x) ≥ s }

(3.1)

and its associated set by FJFs :=

[

Aj

(3.2)

s j∈JF

that consists of cubes that are sufficiently far away from the decision boundary. To avoid notation overload, we simply write F s := FJFs , while keeping in mind that this set is partition based. The following lemma shows that the set F s contains cells that are either assigned to the class X1 or X−1 and that the set has a benign impact on the approximation error.

38

3. Histogram Rule: Oracle Inequality and Learning Rates

Lemma 3.2 (Approximation Error on F s ). For some s ≥ 2r let the set F s be defined as in (3.2). Moreover, let X0 = ∂X X1 = ∂X X−1 . Then, we have i) either Aj ∩ X1 = ∅ or Aj ∩ X−1 = ∅ for j ∈ JFs . ii) RLF s ,P (hP,r ) − R∗LF s ,P = 0. Proof. i) For Aj with j ∈ JFs , we assume to have an x1 ∈ Aj ∩ X1 6= ∅ and an x−1 ∈ Aj ∩ X−1 6= ∅. Then, we have kx−1 − x1 k∞ ≤ r and the connecting line x−1 x1 from x−1 to x1 is contained in Aj since Aj is convex. Next, pick an m > 1 such that t0 = 0,

tm = 1,

ti =

i m

and xi := ti x−1 + (1 − ti )x1 for i = 0, . . . , m. Clearly, xi ∈ x−1 x1 and xi ∈ X−1 ∪ X1 . Since x0 ∈ X1 and xm ∈ X−1 , there exists an i with xi ∈ X1 and xi+1 ∈ X−1 and we find kxi − xi+1 k∞ ≥ ∆η (xi ) ≥ s. On the other hand, kxi − xi+1 k∞ =

1 r s kx−1 − x1 k∞ ≤ ≤ m m 2m

s such that s ≤ 2m , which is not true for m ≥ 1. Hence, we cannot have an x1 ∈ Aj ∩ X1 6= ∅ and an x−1 ∈ Aj ∩ X−1 6= ∅ for j ∈ JFs . ii) By Lemma A.1 we obtain Z RLF s ,P (hP,r ) − R∗LF s ,P = |2η − 1| dPX (X 4{hP,r ≥0})∩F s

=

1 X Z s j∈JF

|2η − 1| dPX (X1 4{hP,r ≥0})∩Aj

and we show that PX ((X1 4{hP,r ≥ 0}) ∩ Aj ) = 0 for each j ∈ JFs . To see this, we remark that the set (X1 4{hP,r ≥ 0}) ∩ Aj contains those x ∈ Aj for that either hP,r (x) ≥ 0 and η(x) ≤ 1/2, or hP,r (x) < 0 and η(x) > 1/2. Hereby, we ignore the case η(x) = 1/2 for x ∈ Aj since for those

3.1. Motivation

39

the integrand |2η(x) − 1| is zero. Furthermore, we know from the previous part i) that either Aj ∩ X−1 = ∅ or Aj ∩ X1 = ∅. Let us first consider the case Aj ∩ X−1 = ∅. According to the definition of the histogram rule, cf. (2.4), we find for all x ∈ Aj ∩ X1 that fP,r (x) = P (Aj (x) × {1}) − P (Aj (x) × {−1}) Z Z Z Z = 1{1} (y)P (dy|x) dPX (x) − 1{−1} (y)P (dy|x) dPX (x) Aj

Y

Aj

Z

Y

Z η(x) dPX (x) −

= Aj

(1 − η(x)) dPX (x) Aj

Z 2η(x) − 1 dPX (x)

= Aj

≥ 0. such that hP,r (x) = 1. That means, we have η(x) ≥ 1/2 and hP,r (x) = 1 for all x ∈ Aj for all x ∈ Aj ∩ X1 . Analogously, we show for x ∈ Aj ∩ X−1 that η(x) ≤ 1/2 and hP,r (x) = −1. Hence, PX ((X1 4{hP,r ≥ 0}) ∩ Aj ) = 0 for all j ∈ JFs and the excess risk above is zero.  Besides this improvement of the approximation error, we improve a quantity that influences the stochastic error under a margin condition. To be more precise, if we assume that ∆η controls the noise from below in the sense of Definition 2.14 the next theorem reveals that we achieve the best possible variance bound θ = 1 on the previously defined set F s . Theorem 3.3 (Improved Variance Bound for Classification Loss). Assume that the associated distance to the decision boundary ∆η controls the noise from below by the exponent ζ ∈ [0, ∞) and define for some s > 0 the ∗ set F s as in (3.2). Furthermore, let fL,P be a fixed Bayes decision function for the classification loss. Then, there exists a constant cLC > 0 independent of s such that for all measurable f : X → {−1, 1} we have ∗ EP (LF s ◦ f − LF s ◦ fL,P )2 ≤

cLC ∗ EP (LF s ◦ f − LF s ◦ fL,P ). sζ

Proof. For x ∈ F s , we have ∆η (x) ≥ s and find with the lower-control assumption that sζ ≤ ∆ζη (x) ≤ cLC |2η(x) − 1|

40

3. Histogram Rule: Oracle Inequality and Learning Rates

and therefore 1 ≤ cLC s−ζ |2η(x) − 1|. ∗ Moreover, we find (LF s ◦ f − LF s ◦ fL,P )2 = 1F s

1F s

∗ |f −fL,P | 2

∗ |f −fL,P | 2

and

∗ = 1({fL,P 0, we defined in (3.1) for a fixed version η of the posterior probability of P and for some s > 0 the set of indices of cubes far away from the decision by JFs := { j ∈ J | ∀ x ∈ Aj : ∆η (x) ≥ s } and its associated set F s by (3.2). Similar, we define for the same s > 0 the set of indices of cubes near the decision boundary by s JN := { j ∈ J | ∀ x ∈ Aj : ∆η (x) ≤ 3s }

and its associated set by NJNs :=

[

Aj ,

s j∈JN

where we write in the following N s := NJNs similar to the set F s to avoid notation overload. To ensure that the separation above captures all cubes that intersect the input space, the subsequent lemma provides a sufficient condition on the separation parameter s under a mild geometric assumption on the decision boundary. Moreover, it shows that we are able to bound the excess risk over X by the excess risks over the sets defined above. Lemma 3.4 (Separation of Risks). For some s ≥ 2r let the sets N s and F s be defined as above. Moreover, let X0 = ∂X X1 = ∂X X−1 . Then, we have X ⊂ N s ∪ F s and RL,P (hD,r ) − R∗L,P ≤ RLN s ,P (hD,r ) −

(3.3) R∗LN s ,P



+ RLF s ,P (hD,r ) −

R∗LF s ,P



.

Proof. First, we show that X ⊂ N s ∪ F s . We define the set of indices JCs := { j ∈ J | ∃ x∗ ∈ Aj : ∆η (x∗ ) < s } and its associated set by C s :=

[ s j∈JC

Aj .

42

3. Histogram Rule: Oracle Inequality and Learning Rates

Since X ⊂ F s ∪ C s , it is sufficient to show that C s ⊂ N s . To this end, we fix a j ∈ JCs and show that ∀z ∈ Aj we have ∆η (z) ≤ 3s. If z ∈ X0 , we immediately have ∆η (z) = 0 < 3s, and we therefore assume w.l.o.g. that z ∈ X1 . Furthermore, there exists an x∗ ∈ Aj with ∆η (x∗ ) < s and we find ∆η (z) =

inf

x0 ∈X−1

kz − x0 k∞

≤ kz − x∗ k∞ + ≤r+

inf

x0 ∈X−1

inf

x0 ∈X−1

kx∗ − x0 k∞

kx∗ − x0 k∞ .

We observe the following cases. If x∗ ∈ X−1 , we obviously have in the calculation above ∆η (z) ≤ s. If x∗ ∈ X1 , we have inf

x0 ∈X−1

kx∗ − x0 k∞ = ∆η (x∗ ) < s

and we follow with r ≤ 2s that ∆η (z) ≤ 3s. In the case that x∗ ∈ X0 we find with X0 = ∂X X−1 that inf

x0 ∈X−1

kx∗ − x0 k∞ = 0,

such that ∆η (z) ≤ s. That means, ∆η (z) ≤ 3s for z ∈ X1 and the same is true for z ∈ X−1 . Since j ∈ JCs was arbitrary we follow that C s ⊂ N s . Finally, by showing that X ⊂ N s ∪ F s , we find by the non-negativity of the excess risk that RL,P (hD,r ) − R∗L,P Z = L(y, hD,r (x)) − L(y, f ∗ (x)) dP (x, y) X×Y Z Z = L(y, hD,r (x)) − L(y, f ∗ (x)) P (dy|x)dPX (x) X Y Z Z ≤ L(y, hD,r (x)) − L(y, f ∗ (x)) P (dy|x)dPX (x) s s Y ZN ∪F Z = L(y, hD,r (x)) − L(y, f ∗ (x)) P (dy|x)dPX (x) Ns Y Z Z + L(y, hD,r (x)) − L(y, f ∗ (x)) P (dy|x)dPX (x) Fs Y   = RLN s ,P (hD,r ) − R∗LN s ,P + RLF s ,P (hD,r ) − R∗LF s ,P .



3.2. Statistical Refinement and Main Results

43

To bound the excess risk RL,P (hD,r ) − R∗L,P , we have to find bounds on the sets N s and F s . Since we have seen in Lemma 3.1 that the histogram rule is an ERM w.r.t. a considered local loss of an partition based set, we apply the oracle inequality for ERMs from Theorem 2.43 separately on both error terms. In the previous section, we examined that the approximation error vanishes on the set F s , see Lemma 3.2. Moreover, we obtained by Theorem 3.3 an improved variance bound w.r.t. the classification loss under the assumption that the distance to the decision boundary controls the noise from below by ζ. In Part 1 of the proof of Theorem 3.5, we show that the mentioned improvements lead up to constants to RLF s ,P (hD,r ) − R∗LF s ,P ≤

r−d + τ sζ n

with probability P n not less than 1 − e−τ , where τ ≥ 1. The term s−γ on the right-hand side is exactly the constant from the variance bound. Whereas this error term increases for s → 0, the error term on the set N s behaves exactly in the opposite way and decreases for s → 0. In fact, bounding the risk on N s requires additional knowledge of the behavior of P in the vicinity of the decision boundary. Under the additional assumptions that P has ME α and MNE β, we show in Part 2 of the proof of Theorem 3.5 that RLN s ,P (hD,r ) −

R∗LN s ,P

β



≤r +

V (sr−d + τ ) n

α+ζ  α+2ζ

is satisfied up to constants and with probability P n not less than 1 − e−τ , where τ ≥ 1 and where V > 0 is the constant of the variance bound that we obtained by Theorem 2.41 together with Lemma 2.33. By optimizing over the splitting parameter s, we obtain the following oracle inequality. Theorem 3.5 (Oracle for Histogram Rule). Let (Aj )j≥1 be a partition of Rd into cubes of side length r ∈ (0, 1]. Assume that P has MNE β ∈ (0, ∞] and ME α ∈ (0, ∞] for some fixed version of η and assume that the associated distance to the the decision boundary ∆η controls the noise from below by the exponent ζ ∈ [0, ∞). Define κ := (1 + ζ)(α + ζ). Furthermore, let X0 be (d − 1)-rectifiable, X0 = ∂X X1 = ∂X X−1 with Hd−1 (X0 ) > 0 and let δ ∗ > 0 be the constant for that (2.16) is satisfied. For fixed n, τ ≥ 1 let there exist a constant c > 0 such that the bounds r ≤c·n

− κ+ζζ2 +dζ

,

(3.4)

44

3. Histogram Rule: Oracle Inequality and Learning Rates

and rd n ≥ c˜, dζ+κ+ζ 2 ζ



n

(3.5)

2 o− κ+ζ ζ

δ∗ 3 ,1

where c˜ := c min , are satisfied. Then, there exists a constant cα,ζ,d > 0 depending on α, ζ and d, but not on c, such that β

RL,P (hD,r ) − R∗L,P ≤ 6 (cMNE r) + cα,ζ,d τ · (rd n)

κ − κ+ζ 2

(3.6)

holds with probability P n not less than 1 − 2e−τ . Proof. We define α α+ζ

θ :=

(3.7)

and s := c

1+ζ(2−θ)+d(1−θ) 1+ζ(2−θ)

1−θ  τ  1+ζ(2−θ) , rd n

(3.8)

where c > 0 is the constant from (3.4). Next, we show that r ≤ s. To this end, a short calculation with (3.7) shows ζ κ+ζ 2 +dζ

=

ζ (1+ζ)(α+ζ)+ζ 2 +dζ

=

1−θ 1+ζ(2−θ)

·

1+ζ(2−θ) 1+ζ(2−θ)+d(1−θ) .

Assuming that r is upper bounded, see (3.4), we find with the previous calculation that 1+ζ(2−θ)   1+ζ(2−θ)+d(1−θ) 1−θ r ≤ c n− 1+ζ(2−θ) 1+ζ(2−θ)+d(1−θ) 1+ζ(2−θ)

⇐⇒

r

⇐⇒

r≤c

≤c

1+ζ(2−θ)+d(1−θ) 1+ζ(2−θ)

1+ζ(2−θ)+d(1−θ) 1+ζ(2−θ)

1−θ

n− 1+ζ(2−θ)

1−θ

(rd n) 1+ζ(2−θ) ,

and therefore r ≤ s. According to Lemma 3.4, we then split the excess risk RL,P (hD,r ) − R∗L,P into RL,P (hD,r ) − R∗L,P ≤ RLN s ,P (hD,r ) −

(3.9) R∗LN s ,P



+ RLF s ,P (hD,r ) −

R∗LF s ,P



.

3.2. Statistical Refinement and Main Results

45

The rest of the proof is structured in three parts, where we establish error bounds on N s and F s in the first two parts and combine the results obtained in the third and last part of the proof. In the following, we write N := N s and F := F s and keep in mind that these sets depend on a parameter s. Furthermore, we write hD := hD,r . Part 1. In the first part we establish an oracle inequality for RLF s ,P (hD )− R∗LF s ,P . Since hD is an ERM over F for the loss LF s due to Lemma 3.1, we apply the improved oracle inequality for ERM in Theorem 2.43. To this end, we find for all f ∈ F that kLF s ◦ f − LF s ◦ fL∗ F s ,P k∞ ≤ 1. Moreover, Theorem 3.3 yields variance bound θ = 1 with constant Vs := cLC s−ζ . Define c1 := max{1, cLC }. Together with R∗LF s ,P,F ≤ RLF s ,P (hP,r ), we find by Theorem 2.43 for all fixed τ ≥ 1 and n ≥ 1 that RLF s ,P (hD ) − R∗LF s ,P < 6(RLF s ,P (hP,r ) − R∗LF s ,P ) +

32c1 (log(|F| + 1) + τ ) sζ n

holds with probability P n not less than 1 − e−τ . Next, we bound |F| by using a volume comparison argument. Obviously, we have |F| ≤ d 2S|J| . We define J˜ := { j ≥ 1 | Aj ∩ 2 [−1, 1] 6= ∅ } and observe that S . Then, j∈J Aj ⊂ j∈J˜ Aj ⊂ 4B`d ∞  |J|rd = λd 

 [



Aj  ≤ λ d 

 [

 Aj  ≤ λd 4B`d∞ = 8d ,

j∈J˜

j∈J

such that with |J| ≤ 8d r−d we deduce log(|F| + 1) ≤ log(28

d −d

≤ log(28

d −d

d −d

+1

≤8 r

r r

d+1 −d

≤8

r

.

+ 1) +1

)

46

3. Histogram Rule: Oracle Inequality and Learning Rates

Inserting this into the oracle inequality above, we find together with Lemma 3.2 for the oracle inequality on F s that RLF s ,P (hD ) − R∗LF s ,P
0. Define c2 := max{1, cα,ζ }. By Theorem 2.43 with R∗LN s ,P,F ≤ RLN s ,P (hP,r ) we then obtain for fixed τ ≥ 1 and n ≥ 1 that RLN s ,P (hD ) − R∗LN s ,P

(3.11)

< 6(RLN s ,P (hP,r ) − R∗LN s ,P ) + 4



8c2 (log(|F| + 1) + τ ) n

1  2−θ

holds with probability P n not less than 1 − e−τ . In order to refine the right-hand side in the inequality above, we establish a bound on the cardis nality |F| = 2|JN | and a bound on the approximation error. To bound the s mentioned cardinality, we use the fact that S by definition N is contained in a tube around the decision line, that is j∈J s Aj ⊂ {∆η (x) ≤ 3s}. Since r N

is lower bounded by (3.5) we find with

3s = 3c

1+ζ(2−θ)+d(1−θ) 1+ζ(2−θ)



1 rd n

(1+ζ)(α+ζ)+ζ 2 ζ

1−θ  1+ζ(2−θ)

= 

≤ 3 min

1+ζ(2−θ) 1−θ

that

 δ∗ , 1 ≤ δ∗ , 3

such that Lemma 2.36 yields λd ({∆η (x) ≤ 3s}) ≤ 12Hd−1 ({η = 1/2})s.

3.2. Statistical Refinement and Main Results

47

Then,  s d |JN |r = λd 

 [

Aj  ≤ λd ({∆η (x) ≤ 3s}) ≤ 12Hd−1 ({η = 1/2})s

s j∈JN

such that s |JN | ≤ 12Hd−1 ({η = 1/2})sr−d = c3 sr−d ,

where c3 := 12Hd−1 ({η = 1/2}). Since s ≥ r ≥ rd we hence conclude that log(|F| + 1) ≤ log(2c3 sr

−d

≤ log(2c3 sr

−d

≤ c3 sr

−d

≤ c4 sr

−d

+ 1) +1

+ sr

(3.12)

)

−d

,

where c4 := 2 max{12Hd−1 ({η = 1/2}), 1}. Thus, the oracle inequality on the set N s in (3.11) changes to RLN s ,P (hD ) − R∗LN s ,P

(3.13)

≤ 6(RLN s ,P (hP,r ) − R∗LN s ,P ) + 4



8c2 (c4 sr−d + τ ) n

1  2−θ

and holds with probability P n ≥ 1 − e−τ . Finally, we have to bound the approximation error RLN s ,P (hP,r ) − R∗LN s ,P . By Lemma A.1 we obtain RLN s ,P (hP,r ) − R∗LN s ,P = =

Z |2η − 1| dPX (X1 4{hP,r ≥0})∩N

X Z s j∈JN

|2η − 1| dPX . (X1 4{hP,r ≥0})∩Aj

s We split the set of indices JN into sets of indices whose corresponding cells do not intersect the decision line and those who do by s s JN 1 := { j ∈ JN | PX (Aj ∩ X1 ) = 0 ∨ PX (Aj ∩ X−1 ) = 0 }, s s JN 2 := { j ∈ JN | PX (Aj ∩ X1 ) > 0 ∧ PX (Aj ∩ X−1 ) > 0 },

48

3. Histogram Rule: Oracle Inequality and Learning Rates

such that X Z s j∈JN

=

|2η − 1| dPX (X1 4{hP,r =1})∩Aj

X Z

+

|2η − 1| dPX (X1 4{hP,r =1})∩Aj

s j∈JN 1

X Z s j∈JN 2

|2η − 1| dPX . (X1 4{hP,r =1})∩Aj

s Since PX ((X1 4{hP,r = 1}) ∩ Aj ) = 0 for all j ∈ JN 1 , the first sum vanishes as in the calculation of the approximation error in Part 1. Moreover, we s notice that JN 2 only contains indices of cells of width r that intersect the decision boundary. Hence, since P has MNE β by assumption, we find X Z RLN s ,P (hP,r ) − R∗LN s ,P = |2η − 1| dPX (3.14) s j∈JN 2

(X1 4{hP,r =1})∩Aj

Z ≤

|2η − 1| dPX {∆η (x)≤r} β

≤ (cMNE r) . Altogether for the oracle inequality on the set N s with (3.13), we obtain that RLN s ,P (hD ) −

R∗LN s ,P

β

≤ 6 (cMNE r) + 4



8c2 (c4 sr−d + τ ) n

1  2−θ

(3.15)

holds with probability P n not less than 1 − e−τ . Part 3. In the last part, we combine the results obtained in Part 1, the oracle inequality on F s and Part 2, the oracle inequality on N s . Inserting (3.10) and (3.15) into (3.9) yields RL,P (hD ) − R∗L,P

(3.16) R∗LN s ,P



R∗LF s ,P



≤ RLN s ,P (hD ) − + RLF s ,P (hD ) −  1  8c2 (c4 sr−d + τ ) 2−θ 32c1 (8d+1 r−d + τ ) β β ≤ 6cMNE rn + 4 + n sζ n

3.2. Statistical Refinement and Main Results

49

with probability P n not less than 1 − 2e−τ . Together with sr−d ≥ 1 and τ, c4 ≥ 1 we find RL,P (hD ) − R∗L,P   1 8c2 (c4 sr−d + τ ) 2−θ 32c1 (8d+1 r−d + τ ) β ≤ 6 (cMNE r) +4 + n sζ n 1   8c2 (c4 τ sr−d +c4 τ sr−d ) 2−θ 32c1 (8d+1 τ r−d +τ r−d ) β ≤ 6 (cMNE r) +4 + n sζ n 1   c5 τ sr−d 2−θ c6 τ r−d β ≤ 6 (cMNE r) +4 + ζ n s n 1   1 c5 τ 2−θ c6 τ β ≤ 6 (cMNE r) +s 2−θ 4 d + ζ d r n s r n with constants c5 := 32c2 max{12Hd−1 ({η = 1/2}), 1} and c6 := 64 · 8d+1 max{cLC , 2ζ }. Thus, inserting s defined in (3.8), we have with probability P n not less than 1 − 2e−τ that RL,P (hD ) − R∗L,P 1

β

1

1

≤ 6 (cMNE r) + 4(c5 τ ) 2−θ · s 2−θ (rd n)− 2−θ + c6 τ · (sζ rd n)−1 2−θ+ζ(2−θ)

1

1+ζ

= 6 (cMNE r) + c7 τ 2−θ · (rd n)− (1+ζ(2−θ))(2−θ) + c8 τ · (rd n)− 1+ζ(2−θ) β

1+ζ

≤ 6 (cMNE r) + c9 τ · (rd n)− 1+ζ(2−θ) , β

where c7 , c8 , c9 > 0 are constants depending on α, ζ, d. Inserting θ, defined in (3.7), yields the result.  The smaller the side length r, the smaller the bound on the approximation error in Theorem 3.5. By assumption (3.5) we ensure that the side length is not too small so that we have enough data contained in the cubes. By choosing an appropriate sequence of rn depending on the data length n, we state learning rates in the next theorem. Theorem 3.6 (Learning Rates for Histogram Rule). Let the assumptions on X0 in Theorem 3.5 be satisfied, as well as the assumptions on P for β ≤ ζ −1 κ, where κ := (1 + ζ)(α + ζ). Let A = (Aj )j≥1 be a partition of Rd into cubes of side length rn = c · n

− β(κ+ζκ2 )+dκ

,

50

3. Histogram Rule: Oracle Inequality and Learning Rates

where c > 0 is the constant from Theorem 3.5. Then, there exists a constant cα,β,ζ,d > 0 such that for all τ ≥ 1 and n sufficiently large we have RL,P (hD,rn ) − R∗L,P ≤ cα,β,ζ,d τ · n

βκ − β(κ+ζ 2 )+dκ

with probability P n not less than 1 − 2e−τ . Proof. First, we prove that the chosen sequence rn satisfies assumptions (3.4) and (3.5). Since β ≤ ζ −1 κ, we find rn = c · n

− β(κ+ζκ2 )+dκ

≤c·n



κ ζ −1 κ(κ+ζ 2 )+dκ

≤c·n

− κ+ζζ2 +dζ

,

which matches assumption (3.4). Moreover, we have rnd n = c · n

dβκ − β(κ+ζ 2 )+dκ

β(κ+ζ 2 )

n = c · n β(κ+ζ2 )+dκ

such that assumption (3.5) is obviously satisfied for n sufficiently large. By inserting rn into the bound of Theorem 3.5 we obtain RL,P (hD,rn ) − R∗L,P ≤ 6cβMNE rnβ + cα,ζ,d τ · (rnd n) ≤ c1 · n = c1 · n

βκ − β(κ+ζ 2 )+dκ βκ − β(κ+ζ 2 )+dκ

≤ c3 τ · n

κ − κ+ζ 2 dκ2

+ c2 τ · n (β(κ+ζ2 )+dκ)(κ+ζ2 ) n

κ − κ+ζ 2

βκ(κ+ζ 2 )

+ c2 τ · n

− (β(κ+ζ 2 )+dκ)(κ+ζ 2 )

βκ − β(κ+ζ 2 )+dκ

with probability P n not less than 1 − 2e−τ , where the constant c1 > 0 depends on β, c2 > 0 on α, ζ, d, and c3 > 0 on α, β, ζ, d .  Note that the constraint on the MNE β in the previous theorem ensures that the chosen side length rn fulfils assumption (3.4). For other choices of rn , we cannot guarantee that the histogram rule learns with the rate from Theorem 3.6. However, since the bound on the MNE β is satisfied in all the comparisons in Section 3.3, we did not consider other choices of rn . The rates in Theorem 3.6 depend on the unknown margin parameters α, ζ and β that characterize the underlying distribution. However, it is possible to obtain the same rates without knowing these parameters in prior by the following data splitting procedure whose concept is similar to the one described in [SC08, Section 6.5]. Let (Rn ) be a sequence of finite subsets Rn ⊂ (0, 1]. For a given dataset D := ((x1 , y1 ), . . . , (xn , yn )) we define the

3.2. Statistical Refinement and Main Results

51

sets D1 := ((x1 , y1 ), . . . , (xk , yk )), D2 := ((xk+1 , yk+1 ), . . . , (xn , yn )), where k := b n2 c + 1 and n ≥ 4. Then, we use D1 as a training set and ∗ compute hD1 ,r for r ∈ Rn , and we use D2 to determine rD ∈ Rn such that 2 ∗ rD := arg min RL,D2 (hD1 ,r ). 2 r∈Rn

∗ The resulting decision function is hD1 ,rD and a learning method producing 2 this decision function is called training validation histogram rule (TV-HR). The following theorem shows that the TV-HR learns with the same rate as in Theorem 3.6 without knowing the parameters of the margin conditions.

Theorem 3.7 (Learning Rates for TV-HR). Let the assumptions of Theorem 3.5 on X0 and on P for β ≤ ζ −1 κ, where κ := (1 + ζ)(α + ζ), be satisfied. Let Rn be a finite subset of (0, 1] such that Rn is a n−1/d -net of (0, 1] and assume that the cardinality of Rn grows at most polynomially in n. Then, there exists a constant cα,β,ζ,d > 0 depending on α, β, ζ, d such that for all τ ≥ 1 and sufficiently large n the TV-HR satisfies ∗ ∗ ) − R RL,P (hD1 ,rD L,P ≤ cα,β,ζ,d τ · n

βκ − β(κ+ζ 2 )+dκ

2

with probability P

n

not less than 1 − 5e−τ . −

κ

Proof. We define rn∗ := c · n β(κ+ζ2 )+dκ , where c > 0 is the constant from (n) (n) Theorem 3.5. Moreover, we assume that Rn := {r1 , . . . , rl } with r0 := 0 (n) (n) and ri−1 < ri for i ∈ {2, . . . , l}. Since Rn is a n−1/d -net, we have (n)

ri

1

(n)

− ri−1 ≤ 2n− d

(3.17)

for all i ∈ {1, . . . , l}. Furthermore, there exists an index i ∈ {1, . . . , l} such 1 (n) (n) (n) (n) that ri−1 ≤ rn∗ ≤ ri . That means |r − rn∗ | ≤ n− d for r ∈ {ri−1 , ri } and we find 1

r ≤ rn∗ + n− d ≤ 2rn∗ , 1

r ≥ rn∗ − n− d ≥ rn∗ .

(3.18)

52

3. Histogram Rule: Oracle Inequality and Learning Rates

An analogous calculation as at the beginning of the proof of Theorem (n) (n) 3.5 shows that r ∈ {ri−1 , ri } satisfies assumptions (3.4) and (3.5) for sufficiently large n. Hence, Theorem 3.5 yields β

RL,P (hD1 ,r ) − R∗L,P ≤ 6 (cMNE r) + cα,ζ,d τ · (rnd n)

κ − κ+ζ 2

(3.19)

with probability P k not less than 1 − 4e−τ . Define τn := τ + log(1 + 2|Rn |). ∗ Since hD1 ,rD is an ERM, we find by Theorem 2.43 together with Lemma 2.30 2 and Theorem 2.41 that ∗ ∗ ) − R RL,P (hD1 ,rD L,P

(3.20)

2

r∈Rn

≤6

 8cα,ζ τn n−k   α+ζ  8cα,ζ τn α+2ζ ∗ RL,P (hD1 ,r ) − RL,P + 4 n−k

 RL,P (hD1 ,r ) − R∗L,P + 4

< 6 inf

inf (n)

(n)

r∈{ri−1 ,ri

}

α+ζ α+2ζ



holds with probability P n−k not less than 1 − e−τ . By combining (3.19) and (3.20), we obtain with (3.18), k ≥ n/2 and n − k = n/2 + n/2 − k ≥ n/4 that ∗ ∗ ) − R RL,P (hD1 ,rD L,P 2

≤6



inf (n)

(n)

r∈{ri−1 ,ri

≤ c 1 τn

6 (cMNE r) + cα,ζ,d τ ·

}



inf (n)

(n)

r∈{ri−1 ,ri

 ≤ c 1 τn

β

β (rn∗ )

+



}

rβ + (rnd n)

d (rn∗ )

n

−

κ κ+ζ 2

κ − κ+ζ 2

+n

κ

− (rnd n) κ+ζ2



+n

α+ζ − α+2ζ

α+ζ − α+2ζ



 +4

32cα,ζ τn n

α+ζ  α+2ζ

!



holds with P n not less than 1 − (1 + 4)e−τ and for some constant c1 > 0 depending on α, β, ζ and d. A variable transformation together with the definition of rn∗ and β ≤ κζ −1 yields ∗ ∗ ) − R RL,P (hD1 ,rD L,P 2    − κ 2 α+ζ κ+ζ β d ≤ c1 τ (rn∗ ) + (rn∗ ) n + n− α+2ζ

3.3. Comparison of Learning Rates

53

  − κ 2 dκ α+ζ κ+ζ + n β(κ+ζ2 )+dκ n + n− α+2ζ   βκ βκ − − ≤ c1 τ 2n β(κ+ζ2 )+dκ + n β(κ+ζ2 )+dκ 

≤ c1 τ

n

βκ − β(κ+ζ 2 )+dκ

≤ 3c1 τ · n

βκ − β(κ+ζ 2 )+dκ

with probability P n not less than 1 − 5e−τ .



3.3. Comparison of Learning Rates In this subsection, we compare the histogram rates obtained in Theorem 3.6 to the ones known from [AT07, BCDD14, KK07] and [SC08]. We try to find reasonable sets of assumptions in all the comparisons down below such that both the histogram rates and the rates obtained by the compared methods are satisfied. Let (Aj )j≥1 be a partition of Rd into cubes of side length r ∈ (0, 1] and assume that the mild geometric assumptions on the decision boundary X0 for fixed η in Theorem 3.5 are satisfied. We denote by (i) and (ii) the following assumptions on P : (i) P has ME α ∈ (0, ∞], (ii) there exists a ζ ∈ [0, ∞) and some constant such that ∆η controls the noise from a) below, b) above. Note that (ii) means in particular that the lower and upper control assumptions made in Definitons 2.28 and 2.31 hold for exactly the same ζ. Since we find with (i) and (ii)b immediately by Lemma 2.33 that P has MNE β = α + ζ, the assumptions on P of Theorem 3.6 are satisfied. Hence, we find with κ := (1 + ζ)(α + ζ) and a suitable cell-width rn that the histogram classifier hD,rn learns under assumption (i) and (ii) with a rate with exponent βκ β[κ+ζ 2 ]+dκ

=

(α+ζ)(1+ζ)(α+ζ) (α+ζ)[(1+ζ)(α+ζ)+ζ 2 ]+d(1+ζ)(α+ζ)

=

(1+ζ)(α+ζ) (1+ζ)(α+ζ)+ζ 2 +d(1+ζ) .

A simple transformation shows that this exponent equals (1+ζ)(α+ζ) (1+ζ)(α+ζ)+ζ 2 +d(1+ζ)

=

α+ζ ζ2 α+ζ+ 1+ζ +d

=

α+ζ ζ2 α+2ζ+ 1+ζ +d−ζ

=

α+ζ ζ α+2ζ+d− 1+ζ

.

(3.21)

54

3. Histogram Rule: Oracle Inequality and Learning Rates

Support Vector Machines. Let us assume that (i) and (ii) are satisfied. Then, Lemma 2.30 yields NE q = αζ and we have MNE β = α + ζ as noted above. Under these conditions [SC08, Section 8.3 (8.18)] shows that the best known rate for SVMs using Gaussian kernels and the hinge loss equals α+ζ

n− α+2ζ+d +ρ ,

(3.22)

where ρ > 0 is an arbitrary small number. Under (i) and (ii), the histogram classifier learns with a rate that has exponent (3.21) and accordingly, comζ pared to (3.22), the rate is better by 1+ζ ∈ [0, 1) in the denominator and matches it only in the case ζ = 0. In the following, we compare the result for the histogram rule with various other ones that basically make the assumption that P has some noise exponent q besides some smoothness condition on η, namely that (iii) η is H¨older-continuous for some ζ ∈ (0, 1]. Note that with condition (iii) Lemma 2.32 shows that assumption (ii) b is fulfilled by the exponent ζ and thus we assume in the following that (ii)a holds for exactly the same ζ. This automatically shrinks the range of ζ, since in (iii) we have ζ ∈ (0, 1], whereas in the case of (ii) we have ζ ∈ [0, ∞). Concerning these observations, under (i) and (ii)a we find together with Lemma 2.30 that we have NE q = αζ . Moreover, we have MNE β = α + ζ. Tree-based and Plug-in classifier. We assume (i), (ii)a and (iii). Then, according to [BCDD14, Thms. 6.1(i) and 6.3(i)] the rates of the considered adaptive tree-based classifiers equal α+ζ

n− α+2ζ+d

(3.23)

up to a log n factor. We remark that the rate in [BCDD14] is achieved under milder assumptions, i.a., some condition on the behavior of the approximation error. However, taking the interconnections of margin conditions into consideration, these assumptions are implied by (i), see [BCDD14, Prop. 4.1]. Moreover, the plug-in classifiers based on kernel, partitioning or nearest neighbor regression estimates in [KK07, Thms. 1, 3 and 5] learn under (i), (ii)a , and (iii) with rate α+ζ

n− α+3ζ+d .

(3.24)

3.3. Comparison of Learning Rates

55

Actually, this rate holds under a slightly weaker assumption, namely that there exists a c¯ > 0 and some α > 0 such that for all δ > 0 the inequality E(|η − 1/2| · 1{|η−1/2|≤δ} ) ≤ c¯ · δ 1+α is satisfied, but this is implied by (i) and (ii)a , see [DGW15, Section 5]. Under the above conditions, the histogram rule learns with a rate with exponent ζ (3.21). Hence, compared to (3.23), our exponent is better by 1+ζ ∈ (0, 12 ] in the denominator, whereas we have an improvement denominator compared to (3.24).

ζ(2+ζ) 1+ζ

∈ (0, 32 ] in the

We emphasize that neither for the histogram rates or the rates from the mentioned authors above nor in the above comparisons, assumptions on the existence of a density of the marginal distribution PX have been made. Down below, we present examples that do make such assumptions on PX . Note that assumptions on the density of PX in combination with a noise exponent, a H¨older assumption on η, and some regularity of the decision boundary, limit the class of the considered distributions, see e.g., [AT07, KK07, BCDD14] or Remark 2.39. We discuss this more precise in Section 4.4. Plug-in classifier I. The authors in [KK07] improved the rates for plugin classifiers based on kernel, partitioning, or nearest neighbor regression estimates given in (3.24) by assuming in addition that PX has a density w.r.t. the Lebesgue measure, which is bounded away from zero. Under this condition on PX and (i), (ii)a and (iii), the plug-in classifiers yield the rate α+ζ

n− 2ζ+d , see [KK07, Thms. 2, 4 and 6]. Hence, our rate with exponent (3.21) is better ζ if the margin exponent α fulfils the inequality α < 1+ζ . For example, we have a small margin exponent α if we have much mass around the decision boundary, that is, the density is unbounded in this region. In fact, probability distributions which have a marginal distribution that is bounded away from zero can never have large values of α, e.g., α = ∞, since this would mean that the classes X−1 and X1 are strictly separated such that the marginal distribution would be close to zero in the vicinity of the decision boundary. In the previous comparison, the histogram rate is only better in certain cases. If the assumption on the marginal distribution is slightly stronger, we show in the subsequent example that the histogram rates outperform the rates of the compared method.

56

3. Histogram Rule: Oracle Inequality and Learning Rates

Plug-in classifier II. Assume (ii)a , (iii) and that PX has a uniformly bounded density w.r.t. the Lebesgue measure. We compare the histogram rates to the ones obtained for certain plug-in classifiers defined in [AT07, Thm. 4.1]. Since the marginal distribution is uniformly bounded, Lemma 2.38 yields ME α = 1. Moreover, we obtain by Lemma 2.30 the NE q = ζ1 and the classifiers of [AT07] thus achieve the rates 1+ζ

n− 1+2ζ+d

(3.25)

in expectation. Under these conditions the histogram rule learns with a rate that has exponent (3.21), that is 1+ζ



n such that the rate is better by

1+2ζ+d−

ζ 1+ζ

ζ 1+ζ

,

in the denominator.

Note that the rate from [AT07] is optimal w.r.t. the assumptions that PX has a uniformly bounded density w.r.t. the Lebesgue measure, that it has some noise exponent q ∈ [0, ∞], and that (iii) is satisfied. The optimal rate equals then ζ(q+1)

n− ζ(q+2)+d .

(3.26)

However, we have seen above that (ii)a , (iii) and the assumption on PX imply these assumptions, but this is not a contradiction since this set of assumptions is a subset of the assumptions mentioned in [AT07], and on this set the histogram rate outperforms the rate from [AT07]. We show in the following that the improvement mainly arises from assumption (ii) a . To be more precise, there exist two sources for slow learning rates in general. The first one is the approximation error, the second one is the existence of critical noise. If we have critical noise, the excess classification risk is zero, see Lemma A.1. Recall that assumption (ii)a forces noise to be located close to the decision boundary. Hence, both effects mentioned above, the approximation error and critical noise, appear in the same region such that they cannot independently occur, which in turn leads to better rates compared to [AT07]. Note that assumption (ii)a does not mean we consider distributions without noisy regions far away from the decision boundary, see Section 2.3.1 and Fig 2.4. To make this heuristic argument more precise, we take a look at Theorem 3.5 and its proof and show that if we omit assumption (ii)a and thus consider

3.3. Comparison of Learning Rates

57

the assumptions taken in [AT07], we match the optimal rate in (3.26): To this end, we consider the above mentioned assumptions of [AT07], that is (iii) and the assumption that P has NE q and that PX has a density w.r.t. the Lebesgue measure that is uniformly bounded. Since we do not assume (ii)a , we cannot apply Theorem 3.3 which delivers a variance bound on the set F s bounded away from the decision boundary. Hence, the separation technique we used in the proof would make no sense any more, but we are able to bound the excess risk on the whole set by Part 2, which analyzes the set N s close to the decision boundary. This corresponds to the situation that s → ∞ such that the set F s is empty. In Part 2 of the proof two things change. First, we apply Theorem 2.41 directly and obtain a variance bound of the form θ ∗ ∗ EP (L ◦ f − L ◦ fL,P )2 ≤ cq EP (L ◦ f − L ◦ fL,P ) , q where θ := q+1 and where cq is a positive constant. Second, we bound the cardinality |F| in (3.11) in the same way as we did in Part 1. Note that the calculation for the approximation error does not change since we have by Lemma 2.34 MNE β := ζ(q + 1). Then, some straightforward calculation yields for the overall excess risk

RL,P (hD,r ) − R∗L,P ≤ 6 (cMNE r)

ζ(q+1)

+c

1  τ  2−θ . rd n

with high probability, where τ ≥ 1 and where c > 0 is a constant. Optimizing over r leads for the histogram classifier to the rate ζ(q+1)

n− ζ(q+2)+d , which matches the optimal rate (3.26), proven in [AT07, Thm. 4.1]. We further remark that instead of the H¨older assumption (iii), the weaker assumption (ii)b is sufficient for Theorem 3.5 and for the modification above. If we assume (ii)a in addition to the assumptions in [AT07] our rate improves immediately, since we can directly apply Theorem 3.5 and thus ζ obtain exactly the rate with exponent (3.21), which is better by 1+ζ . Finally, we remark that for the histogram results as well as for the results from [AT07, BCDD14, KK07] and [SC08], less assumptions are sufficient and in the comparisons above we tried to formulate reasonable sets of common assumptions.

4. Localized SVMs: Oracle Inequalities and Learning Rates The aim of this chapter is to derive global learning rates for localized SVMs. In Section 4.1 we begin with a motivation of the statistical analysis from Section 4.2, in which we establish finite sample bounds and derive local learning rates. In Section 4.3, we present the main results and compare the obtained rates in Section 4.4. 4.1. Motivation Let the assumptions on D and P from the beginning of Chapter 3 be satisfied. In Section 2.2.2 we described that, given some partition, localized SVMs learn individual SVMs on each cell by solving an optimization problem over some local RKHS and w.r.t. a suitable loss function. A typical loss function for classification is the classification loss Lclass , defined in Example 2.5, but the optimization problem is not solvable for this loss function. A suitable convex surrogate for the classification loss is due to Theorem 2.9 the hinge loss L := Lhinge : Y × R → [0, ∞), defined by L(y, t) := max{0, 1 − yt} for y = ±1, t ∈ R. For this loss function, existence and uniqueness of the localized SVM estimators are ensured, see Section 2.2.2. Although we showed in Section 2.2.2 that localized SVMs can be defined w.r.t. an arbitrary partition A it is necessary to specify the assumptions on the underlying partition to obtain learning rates. To this end, we denote by B`d2 the closed unit ball of the d-dimensional Euclidean space `d2 and we define the ball with radius r > 0 and center s ∈ B`d2 by Br (s) := { t ∈ Rd | kt − sk2 ≤ r } with Euclidean norm k · k2 in Rd . Moreover, we define the radius rA of a set A ⊂ B`d2 by rA = inf{r > 0 : ∃s ∈ B`d2 such that A ⊂ Br (s)} Portions of Chapter 4 are based on: I. Blaschzyk and I. Steinwart, Improved Classification Rates for Localized SVMs. ArXiv e-prints 1905.01502, 2019 (submitted)

© Springer Fachmedien Wiesbaden GmbH, part of Springer Nature 2020 I. K. Blaschzyk, Improved Classification Rates for Localized Algorithms under Margin Conditions, https://doi.org/10.1007/978-3-658-29591-2_4

60

4. Localized SVMs: Oracle Inequalities and Learning Rates

and we make the following assumptions on the underlying partition: (A) Let A := (Aj )j=1,...,m be a partition of B`d2 and r > 0 such that ˚j 6= ∅ for every j ∈ {1, . . . , m}, and such that there exist we have A z1 , . . . , zm ∈ B`d2 such that Aj ⊂ Br (zj ), and kzi − zj k2 > 2r , i 6= j, and 1

rAj < r ≤ 16m− d ,

for all j ∈ {1, . . . , m},

(4.1)

are satisfied. A Voronoi partition (Aj )j=1,...,m based on an r-net z1 , . . . , zm of B`d2 with r < 16m−1/d and kzi − zj k2 > 2r , i 6= j immediately satisfies the assumptions above, see [MS16]. To fully determine our learning method, we specify the underlying local RKHSs and restrict ourselves to Gaussian RKHSs, which possess benign characteristics, see Sections 2.2.1 and 2.3.2. (H) For every j ∈ {1, . . . , m} let kj : Aj × Aj → R be the Gaussian kernel with width γj > 0, defined by  kγj (x, x0 ) := exp −γj−2 kx − x0 k22 , (4.2) with corresponding RKHS Hγj over Aj . According to Lemma 2.21 and ˆ γ or short H ˆ j for the extended RKHSs over Lemma 2.22 we write H j B`d2 and for some J ⊂ {1, . . . , m} we write HJ for the joint RKHS over B`d2 . We focus on the statistical analysis of localized SVMs using Gaussian kernel and hinge loss, which produce for some γ := (γ1 , . . . , γm ) ∈ (0, ∞)m and λ := (λ1 , . . . , λm ) ∈ (0, ∞)m a final decision function fD,λ,γ ∈ HJ with J = {1, . . . , m}, given by fD,λ,γ (x) :=

m X

fˆDj ,λj ,γj (x),

(4.3)

j=1

ˆ j satisfy where the restrictions of fˆDj ,λj ,γj ∈ H fDj ,λj ,γj = arg min λj kf k2Hj + RL,Dj (f )

(4.4)

f ∈Hj

w.r.t. a local data set Dj , defined in (2.7), and some regularization parameter λj > 0.

4.1. Motivation

61

A generic finite sample bound for localized SVMs is given by Theorem 2.46. By using the hinge loss and Gaussian kernels we find by Remark 2.47 that the bound has, for some arbitrary small p ∈ (0, 1/2) and up to constants, the form RLJ ,P (fÛD,λ,γ ) − R∗LJ ,P X ≤ λj k1Aj f0 k2Hˆ + RLJ ,P (f0 ) − R∗LJ ,P j

j∈J

 +

a2p n

1  2−p−θ+θp

 +3

V n

1  2−θ

+

1 , n

q where θ := q+1 is the exponent and V the constant from a variance bound of the hinge loss, see Theorem 2.42, and where a appears in the bound on the average entropy numbers in (2.25) measuring the capacity of HJ . We remark that based on a similar oracle inequality learning rates already have been derived in [TBMS17, Thm. 3.2] under the assumption that P has NE q and MNE β. Since we aim to motivate improvements of the bound above, we do not discuss these rates at this point and refer the reader to Section 4.4. Recall that in Chapter 3 we improved the approximation error for the histogram rule and the variance bound for the classification loss on sets that are far away from the decision boundary. To improve the variance bound, we assumed that the distance to the decision boundary ∆η controls the noise from below, see Definition 2.28. The next theorem shows that under this assumption an improvement is even possible for the hinge loss and we achieve on sets far away from the decision boundary the best possible variance bound θ = 1.

Lemma 4.1 (Improved Variance Bound for Hinge Loss). Let η be fixed and assume that the distance to the decision boundary ∆η controls the noise from below by the exponent ζ ∈ [0, ∞). Let A be a partition and define for some s > 0 the set of indices J := { j ∈ {1, . . . , m} | ∀x ∈ Aj : ∆η (x) ≥ s }. ∗ Furthermore, let fL,P : X → [−1, 1] be a fixed Bayes decision function w.r.t. the hinge loss. Then, there exists a constant cLC > 0 independent of s such that for all measurable f : X → R we have

2cLC ∗ ∗ EP (LJ ◦ fÛ − LJ ◦ fL,P )2 ≤ ζ EP (LJ ◦ fÛ − LJ ◦ fL,P ). s

62

4. Localized SVMs: Oracle Inequalities and Learning Rates

Proof. Since fÛ : X → [−1, 1] we consider functions f : X → [−1, 1]. Then, an analogous calculation as in the proof of [SC08, Thm. 8.24] yields ∗ ∗ (LJ ◦ f − LJ ◦ fL,P )2 = f − fL,P . Following the same arguments as in Theorem 3.3 we find for all x ∈ Aj , j ∈ J with the lower-control assumption that 1≤

cLC |2η(x) − 1|. sζ

Then, we have ∗ EP (LJ ◦ f − LJ ◦ fL,P )2 Z ∗ = S |f (x) − fL,P (x)|2 dPX (x)

Z j∈J

Aj

≤2 S

∗ |f (x) − fL,P (x)| dPX (x) Aj

j∈J Z 2cLC ≤ ζ S s

∗ |f (x) − fL,P (x)||2η(x) − 1| dPX (x) j∈J

Aj

2cLC ∗ ≤ ζ EP (LJ ◦ f − LJ ◦ fL,P ). s



To bound the approximation error, we need more precise investigations. We aim to find an appropriate f0 ∈ HJ such that the bound on X λj k1Aj f0 k2Hˆ + RLJ ,P (f0 ) − R∗LJ ,P j

j∈J

is small. Obviously, we control the error above if we control both, the norm and the excess risk. Concerning the norm, we will make the observation that the term is not substantial since on each cell we will be able to choose sufficiently small regularization parameters λj , see Section 4.2.3. Concerning the excess risk, we remark that it is small if f0 ∈ HJ is close to a Bayes decision function since its risk is then close to the Bayes risk. Note that we cannot assume the Bayes decision function to be contained in the RKHS HJ , see [SC08, Section 2]. Nonetheless, we find a function f0 ∈ HJ that is similar to a Bayes decision function. We define f0 as the convolution of a function Kγ : Rd → R with some f ∈ L2 (Rd ) such that the restriction to a cell A is contained in the local RKHS over A, that is (Kγ ∗ f )|A ∈ H(A).

4.2. Local Statistical Analysis

63

Hereby, we choose f as a function that on a ball containing A is similar to a Bayes decision function. Doing this, we observe the following cases. If a cell A has no intersection with the decision boundary and e.g., A ∩ X1 6= ∅, but A ∩ X−1 = ∅, we have for all x ∈ A ∩ X1 that fL∗ class ,P (x) = 1. Otherwise, if the cell intersects the decision boundary we find fL∗ class ,P := sign(2η − 1). That is, we aim different functions to approximate and we will choose f as constant if the considered cell has no intersection with the decision boundary and as sign(2η − 1) otherwise. These observations are the basis for the investigations of the subsequent section. 4.2. Local Statistical Analysis In this section, we derive finite sample bounds on the excess risk and learning rates on partition based sets. Concerning the observations on the approximation error and the improvement on the variance bound in the previous section, we analyze in the following different cases, namely the case that cells A satisfy A ∩ X1 6= ∅ and A ∩ X−1 6= ∅ such that cells do have an intersection with the decision boundary, and the case that they do not, that is A ∩ X1 = ∅ or A ∩ X−1 = ∅. In the latter case, we distinguish between cells that are located sufficiently far away from the decision boundary and that are not. For each set consisting of cells that own one of these threes properties we investigate in a first step the approximation error in Section 4.2.1. Moreover, we establish improved entropy bounds for local Gaussian kernels in Section 4.2.2. These are necessary to derive local oracle inequalities and learning rates, which we present in Section 4.2.3. We write γmax := maxj∈J γj or γmin := minj∈J γj w.r.t. some J ⊂ {1, . . . , m}. 4.2.1. Approximation Error Bounds For some f0 : X → R we define the function X (γ) AJ (λ) := λj k1Aj f0 k2Hˆ + RLJ ,P (f0 ) − R∗LJ ,P . j

j∈J

(4.5)

64

4. Localized SVMs: Oracle Inequalities and Learning Rates

Recall that we aim to find an f0 ∈ HJ such that both, the norm and the (γ) approximation error in AJ (λ) are small. We show in the following that a suitable choice for f0 is a function that is constructed by convolutions of some f ∈ L2 (Rd ) with the function Kγ : Rd → R, defined by  Kγ (x) =

d/2

2 π 1/2 γ

e−2γ

−2

kxk22

.

(4.6)

Note that we have Kγ ∗ f (x) = hΦγ (x), f iL2 (Rd ) for x ∈ X ⊂ Rd , where Φγ is a feature map of the Gaussian kernel, see Example 2.19. The following lemma shows that the restriction of this convolution is contained in a local RKHS and that we control the individual RKHS norms in (4.5). Lemma 4.2 (Convolution). Let A ⊂ Br (z) for some r > 0, z ∈ B`d2 . Furthermore, let Hγ (A) be the RKHS of the Gaussian kernel kγ over A with γ > 0 and let the function Kγ : Rd → R be defined as in (4.6). Moreover, define for some ρ ≥ r the function fγρ : Rd → R by fγρ (x) := (πγ 2 )−d/4 · 1Bρ (z) (x) · f˜(x), where f˜ : Rd → R is some function with kf˜k∞ ≤ 1. Then, we have ˆ γ (A) with kKγ ∗ fγρ k∞ ≤ 1 and 1A (Kγ ∗ fγρ ) ∈ H k1A (Kγ ∗ fγρ )k2Hˆ

 γ (A)



ρ2 πγ 2

d/2 vold (B).

Proof. Obviously, fγρ ∈ L2 (Rd ) such that we find Z ρ 2 kfγ kL2 (Rd ) = |(πγ 2 )−d/4 · 1Bρ (z) (x)f˜(x)|2 dx Rd Z ≤ (πγ 2 )−d/2 |1Bρ (z) (x)|2 dx Rd Z 2 −d/2 = (πγ ) 1 dx Bρ (z) 2 −d/2

= (πγ ) vold (Bρ (z))  2 d/2 ρ = vold (B). πγ 2

(4.7)

4.2. Local Statistical Analysis

65

Since the map Kγ ∗ · : L2 (Rd ) → Hγ (A) given by  Kγ ∗ g(x) :=

d/2 Z

2 π 1/2 γ

e−2γ

−2

kx−yk22

g ∈ L2 (Rd ), x ∈ A,

· g(y) dy,

Rd

is a metric surjection due to Example 2.19, we find k(Kγ ∗ fγρ )|A k2Hγ (A) ≤ kfγρ k2L2 (Rd ) .

(4.8)

Next, Young’s inequality, see [SC08, Thm. A.5.23], yields kKγ ∗

fγρ k∞

 =

2 πγ 2

d/2

kkγ ∗ (1Bρ (z) f˜)k∞ ≤



2 πγ 2

d/2

kkγ k1 kf˜k∞ ≤ 1. (4.9)

Hence, with (4.8) and (4.7) we find k1A (Kγ ∗ fγρ )k2Hˆ

γ (A)

= k(Kγ ∗ fγρ )|A k2Hγ (A) ≤ kfγρ k2L2 (Rd )  2 d/2 ρ ≤ vold (B). πγ 2



To bound the excess risks in (4.5) we apply Zhang’s equality, see Theorem 2.9, which yields Z RLJ ,P (f0 ) − R∗LJ ,P = S |f0 − fL∗ class ,P ||2η − 1|dPX . (4.10) j∈J

Aj

We begin with an analysis on sets, whose cells A have no intersection with the decision boundary. For such cells the subsequent lemma presents a suitable function f0 and its difference to fL∗ class ,P that occurs in (4.10). In particular, the function f0 is a convolution of Kγ with 2η − 1 since we have fL∗ class ,P (x) = 2η(x) − 1 for x ∈ A as mentioned in Section 4.1. Lemma 4.3 (Difference to fL∗class ,P I). Assume that the assumptions of Lemma 4.2 are satisfied with A ∩ X1 6= ∅ and A ∩ X−1 6= ∅. We define the function fγ3r : Rd → R by fγ3r (x) := (πγ 2 )−d/4 · 1B3r (z)∩(X1 ∪X−1 ) (x) sign(2η(x) − 1).

(4.11)

66

4. Localized SVMs: Oracle Inequalities and Learning Rates

Then, we find for all x ∈ A that |Kγ ∗ fγ3r (x) − fL∗ class ,P (x)| ≤

2 Γ(d/2)

Z



e−t td/2−1 dt, 2∆2η (x)γ −2

where Kγ is the function defined in (4.6). Proof. Let us consider w.l.o.g. x ∈ A ∩ X1 . Then, ∆η (x) = inf kx − x ¯k2 ≤ diam(Br (z)) = 2r. x ¯∈X−1

(4.12)

˚ the open ball and show that B ˚∆ (x) (x) ⊂ B3r (z) ∩ Next, we denote by B η ˚∆ (x) (x) we have kx0 − xk2 < ∆η (x) such that x0 ∈ X1 . X1 . For x0 ∈ B η Furthermore, (4.12) yields kx0 − zj k2 ≤ kx0 − xk2 + kx − zj k2 < ∆η (x) + r ≤ 2r + r = 3r and hence x0 ∈ B3r (z). We find Kγ ∗ fγ3r (x)  d/2 Z −2 2 2 = e−2γ kx−yk2 (πγ 2 )−d/4 1/2 π γ Rd · 1B3r (z)∩(X1 ∪X−1 ) (y)sign (2η(y) − 1) dy  d/2 Z −2 2 2 = e−2γ kx−yk2 · 1B3r (z)∩(X1 ∪X−1 ) (y)sign (2η(y) − 1) dy 2 πγ Rd !  d/2 Z Z −2 2 −2 2 2 = e−2γ kx−yk2 dy − e−2γ kx−yk2 dy πγ 2 B3r (z)∩X1 B3r (z)∩X−1 !  d/2 Z Z 2 −2γ −2 kx−yk22 −2γ −2 kx−yk22 ≥ e dy − e dy πγ 2 ˚∆ (x) (x) ˚∆ (x) (x) B Rd \ B η η  d/2 Z −2 2 2 =2 e−2γ kx−yk2 dy − 1. πγ 2 ˚ B∆η (x) (x) Since fL∗ class ,P (x) = 1 we obtain by Lemma 4.2 for ρ = 3r and f˜ = 1X1 ∪X−1 sign(2η − 1), and by Lemma A.2 that |Kγ ∗ fγ3r (x) − fL∗ class ,P (x)| = =

|Kγ ∗ fγ3r (x) − 1| 1 − Kγ ∗ fγ3r (x)

(4.13)

4.2. Local Statistical Analysis  ≤2−2

2 πγ 2

67

d/2 Z

e−2γ

−2

kx−yk22

dy

˚∆ (x) (x) B η

Z 2∆2η (x)γ −2 2 =2− e−t td/2−1 dt Γ(d/2) 0 ! Z ∞ Z 2∆2η (x)γ −2 2 −t d/2−1 −t d/2−1 = e t dt − e t dt Γ(d/2) 0 0 Z ∞ 2 e−t td/2−1 dt. = Γ(d/2) 2∆2η (x)γ −2 The case x ∈ A ∩ X0 is clear and for x ∈ A ∩ X−1 the calculation yields the same inequality, hence (4.13) holds for all x ∈ A.  Under the assumption that P has some MNE β we immediately obtain in the next theorem a bound on the approximation error on partition based sets that have an intersection with the decision boundary. Theorem 4.4 (Approximation Error Bound I). Let (A) and (H) be satisfied and let P have MNE β ∈ (0, ∞]. Define the set of indices J := { j ∈ {1, . . . m} | Aj ∩ X1 6= ∅ and Aj ∩ X−1 6= ∅ }. and the function f0 : X → R by   X f0 := 1Aj Kγj ∗ fγ3rj , j∈J

where the functions Kγ and fγ3rj are defined in (4.6) and (4.11). Then, f0 ∈ HJ and kf0 k∞ ≤ 1. Moreover, there exist constants cd , cd,β > 0 only depending on d and β such that (γ)

AJ (λ) ≤ cd ·

X λj r d j∈J

γjd

+ cd,β · max γjβ . j∈J

Proof. By Lemma 4.2 for ρ = 3r and f˜ = 1X1 ∪X−1 sign(2η − 1) we have immediately f0 ∈ HJ , as well as kf0 k∞ = kKγ ∗ fγ3r k∞ ≤ 1. Moreover, Lemma 4.2 yields X

λj k1Aj f0 k2Hˆ =

X

j

j∈J

λj k1Aj (Kγj ∗ fγ3rj )k2Hˆ ≤ cd ·

X λj r d

j

j∈J

j∈J

γjd

68

4. Localized SVMs: Oracle Inequalities and Learning Rates

for some constant cd > 0. Next, we bound the excess risk of f0 . To this end, we fix w.l.o.g. an x ∈ Aj ∩ X1 and find ∆η (x) = inf kx − x ¯k2 ≤ diam(Br (z)) = 2r x ¯∈X−1

and we conclude Aj ⊂ {∆η (x) ≤ 2r} for every j ∈ J. Together with Theorem 2.9 and Lemma 4.3 we then obtain RLJ ,P (f0 ) − R∗LJ ,P XZ = |(Kγj ∗ fγ3rj )(x) − fL∗ class ,P (x)||2η(x) − 1| dPX (x) j∈J

Aj

2 X ≤ Γ(d/2)

Z

2 X = Γ(d/2)

Z

= ≤ ≤

2 Γ(d/2)

XZ

2 Γ(d/2)

Z

2 Γ(d/2)

Z

2 Γ(d/2)

Z

j∈J ∞ 0

0

XZ Aj

j

1[0,√t/2γ ) (∆η (x))|2η(x) − 1| dPX (x)e−t td/2−1 dt j

{∆η ≤2r} ∞

j

1[0,√t/2γ ) (∆η (x))e−t td/2−1 dt|2η(x) − 1| dPX (x)

Z

0

0

η



Z

j∈J ∞

1[2∆2 (x)γ −2 ,∞) (t)e−t td/2−1 dt|2η(x) − 1| dPX (x)

0

Aj

e−t td/2−1 dt|2η(x) − 1| dPX (x)



Z Aj

j∈J

=

2∆2η (x)γj−2

Aj

j∈J



Z

1[0,√t/2γ

max )

(∆η )|2η − 1| dPX e−t td/2−1 dt

Z {∆η ≤min{2r,



|2η(x) − 1| dPX (x)e−t td/2−1 dt. t/2γmax }}

Next, a simple calculation shows that ( n p o 2r, min 2r, t/2γmax = p t/2γmax , and that 1 ≤ β yields



2 tγmax 8r 2

β/2

−2 if t ≥ 8r2 γmax , if else.

−2 for t ≥ 8r2 γmax . Finally, the definition of the MNE

RLJ ,P (f0 ) − R∗LJ ,P Z ∞Z 2 ≤ |2η − 1| dPX e−t td/2−1 dt √ Γ(d/2) 0 {∆η (x)≤min{2r, t/2γmax }}

4.2. Local Statistical Analysis

≤ =

2cβMNE Γ(d/2)



Z

2cβMNE Γ(d/2) Z ∞ +



min{2r,

p

69

β t/2γmax } e−t td/2−1 dt

0 −2 8r 2 γmax

Z

β γmax

0

 β/2 t e−t td/2−1 dt 2 !

(2r)β e−t td/2−1 dt

−2 8r 2 γmax

2cβ ≤ MNE Γ(d/2)

β γmax 2β/2

−2 8r 2 γmax

Z

e−t t(d+β)/2−1 dt + (2r)β

0

Z

!



e−t td/2−1 dt −2 8r 2 γmax

−2 Z 8r2 γmax β γmax e−t t(d+β)/2−1 dt 2β/2 0 !  2 β/2 Z ∞ γmax β −t (d+β)/2−1 + (2r) e t dt −2 8r2 8r 2 γmax ! −2 Z 8r2 γmax Z ∞ β 2cβMNE γmax −t (d+β)/2−1 −t (d+β)/2−1 e t dt + e t dt = −2 Γ(d/2) 2β/2 0 8r 2 γmax

2cβ ≤ MNE Γ(d/2)

=

21−β/2 cβMNE Γ((d + β)/2) β γmax . Γ(d/2)



In the previous theorem, we observe a contrary behavior of the individual kernel parameters γj , e.g., smaller γj lead to a smaller bound on the excess risk but increase the bound on the norm. We remark that the bound of the approximation error has the same structure as the error bound of global Gaussian kernels in [SC08, Thm. 8.18] for τ = ∞, which has up to constants the form λkf0 k2Hγ (X) + RL,P (f0 ) − R∗L,P ≤ λγ −d + γ β ,

(4.14)

for some appropriate f0 ∈ H, where γ > 0 is the kernel parameter of the global Gaussian kernel and λ > 0 the regularization parameter. Assume that all cells have the same kernel and regularization parameter. Together with (4.1) we then find that the bound in Theorem 4.4 matches (4.14) up to constants. Next, we develop bounds on the approximation error over partition based sets, whose cells A have no intersection with the decision boundary. Recall that we apply (4.10) and again, the subsequent lemma presents a suitable function f0 and its difference to fL∗ class ,P that occurs in (4.10). Note that

70

4. Localized SVMs: Oracle Inequalities and Learning Rates

on those sets we have fL∗ class ,P (x) = 1 for x ∈ A and hence, we choose a function f0 ∈ HJ that is a convolution of Kγ , defined in (4.6), with a constant function f that we have to cut off to ensure that it is an element of L2 (Rd ). Unfortunately, we will always make an error on such cells since Gaussian RKHSs do not contain constant functions, see Example 4.2. In order to make the convoluted function as flat as possible on a cell, we choose the radius of the ball on which f is a constant arbitrary large, that means ω+ > r. Note that although ω+ > r we receive by convolution a function that is still contained in a local RKHS over a cell A. Lemma 4.5 (Difference to fL∗class ,P II ). Let the assumptions of Lemma 4.2 be satisfied with A∩X1 = ∅ or A∩X−1 = ∅. ω For ω− > 0 we define ω+ := ω− + r and the function fγ + : Rd → R by ( (πγ 2 )−d/4 · 1Bω+ (z)∩(X1 ∪X0 ) (x), if x ∈ A ∩ (X1 ∪ X0 ), ω+ fγ (x) := 2 −d/4 (−1) · (πγ ) · 1Bω+ (z)∩X−1 (x), else. (4.15) Then, we find for all x ∈ A that |(Kγ ∗ fγω+ )(x) − fL∗ class ,P (x)| ≤

1 Γ(d/2)

Z



e−t td/2−1 dt, (ω−

)2 2γ −2

where Kγ is the function defined in (4.6). Proof. We assume w.l.o.g. that x ∈ A ∩ (X1 ∪ X0 ) and show in a first ˚ω (x) ⊂ Bω (z). To this end, consider an x0 ∈ B ˚ω (x). Since step that B − + − A ⊂ Br (z) we find kx0 − zk2 ≤ kx0 − xk2 + kx − zk2 < ω− + r = ω+ and hence, x0 ∈ Bω+ (z). Next, we obtain by Lemma A.2 that Kγ ∗

fγω+ (x)

 =

π 1/2 γ 

=  ≥

d/2 Z

2 2 πγ 2

d/2 Z

2 πγ 2

d/2 Z

e−2γ

−2

kx−yk22

Rd

e−2γ

−2

(πγ 2 )−d/4 · 1Bω+ (z) dy

kx−yk22

dy

Bω+ (z)

e−2γ ˚ω (x) B −

−2

kx−yk22

dy

4.2. Local Statistical Analysis

=

1 Γ(d/2)

71

2(ω− )2 γ −2

Z

e−t td/2−1 dt.

0

Since fL∗ class ,P (x) = 1, we finally have by Lemma 4.2 for ρ = ω+ and f˜ := 1X1 ∪X0 that |(Kγ ∗ fγω+ )(x) − fL∗ class ,P (x)| = |(Kγ ∗ fγω+ )(x) − 1| = 1 − (Kγ ∗ fγω+ )(x) Z 2(ω− )2 γ −2 1 ≤1− e−t td/2−1 dt Γ(d/2) 0 Z ∞ 1 = e−t td/2−1 dt. Γ(d/2) 2(ω− )2 γ −2 For x ∈ A ∩ X−1 the latter calculations yield with f˜ := 1X−1 the same results and hence, the latter inequality holds for all x ∈ A.  In the next theorem, we state bounds on the approximation error over sets that have no intersection with the decision boundary. If the cells are close to the decision boundary we refine the bound under the additional assumption that P has MNE β. Theorem 4.6 (Approximation Error II). Let (A) and (H) be satisfied and define the set of indices J := { j ∈ {1, . . . m} | Aj ∩ X1 = ∅ or Aj ∩ X−1 = ∅ }. ω

For some ω− > 0 define ω+ := ω− + r > 0 and let the function fγj+ for every j ∈ J be defined as in (4.15). Moreover, define the function f0 : X → R by   [ f0 := 1Aj Kγj ∗ fγωj+ . j∈J

Then, f0 ∈ HJ and kf0 k∞ ≤ 1. Furthermore, for all ξ > 0 there exist constants cd , cd,ξ > 0 such that (γ) AJ (λ)

≤ cd ·

X j∈J

 λj

ω+ γj

d

 + cd,ξ

maxj∈J γj ω−

2ξ X j∈J

PX (Aj ).

72

4. Localized SVMs: Oracle Inequalities and Learning Rates

In addition, if P has MNE β ∈ (0, ∞] and Aj ⊂ {∆η (x) ≤ s} for every j ∈ J, then (γ) AJ (λ)

≤ cd ·

X

 λj

j∈J

ω+ γj

d

 + cd,ξ

maxj∈J γj ω−

2ξ

β

(cMNE s) .

Proof. By Lemma 4.2 with ρ = ω+ and f˜ := 1X1 ∪X0 resp. f˜ := 1X−1 we ω have immediately f0 ∈ HJ and kf0 k∞ = kKγj ∗ fγj+ k∞ ≤ 1. Moreover, Lemma 4.2 yields X

λj k1Aj f0 k2Hˆ j

=

j∈J

X

λj k1Aj (Kγj ∗

fγωj+ )k2Hˆ j

≤ cd ·

j∈J

X

 λj

j∈J

ω+ γj

d

for some constant cd > 0. Next, we bound the excess risk of f0 . By applying Theorem 2.9, Lemma 4.5 and [SC08, Lemma A.1.1] we find for some arbitrary ξ > 0 that XZ RLJ ,P (f0 ) − R∗LJ ,P = |(Kγj ∗ fγωj+ ) − fL∗ class ,P ||2η − 1| dPX ≤

j∈J

Aj

XZ

1 Γ(d/2) Aj Z ∞

j∈J



1 Γ(d/2)

Z

∞ (ω− )2 2γj−2

e−t td/2−1 dt|2η − 1| dPX

e−t td/2−1 dt

−2 2(ω− )2 γmax

XZ j∈J

|2η − 1| dPX Aj

 −2 XZ Γ d/2, 2(ω− )2 γmax ≤ |2η − 1| dPX Γ(d/2) j∈J Aj  2ξ X 2−ξ Γ(d/2 + ξ) γmax ≤ · PX (Aj ). Γ(d/2) ω− j∈J

If in addition P has MNE β and Aj ⊂ {∆η ≤ s} for every j ∈ J we modify the previous calculation of the excess risk. Then, we obtain again by Theorem 2.9 , Lemma 4.5 and [SC08, Lemma A.1.1] for some arbitrary ξ > 0 that Z ∞ XZ 1 ∗ RLJ ,P (f0 ) − RLJ ,P ≤ e−t td/2−1 dt|2η − 1| dPX Aj Γ(d/2) (ω− )2 2γj−2 j∈J

4.2. Local Statistical Analysis 1 Γ(d/2)



73 ∞

Z

e−t td/2−1 dt −2 (ω− )2 2γmax

Z |2η − 1| dPX {∆η ≤s}

−2 cβMNE Γ(d/2, (ω− )2 2γmax ) β ·s Γ(d/2)  2ξ cβMNE 2−ξ Γ(d/2 + ξ) γmax ≤ sβ . Γ(d/2) ω−



The final bounds are obtained by a combination of the results.



Both bounds in the theorem above depend on the parameter ξ > 0. However, we observe in the theorems in Section 4.2.3, which state the corresponding oracle inequalities, that by choosing ω− appropriately this ξ will not have an influence anymore. 4.2.2. Entropy Bounds An important tool to obtain oracle inequalities is to have bounds on entropy numbers, see Section 2.3.2. For local Gaussian RKHSs Hγ (A) bounds on entropy numbers have been established in [MS16, Thm. 12] such that for γ ≤ r, p ∈ (0, 1) and i ≥ 1 we have ei (id : Hγ (A) → L2 (PX|A )) ≤ cp

p d+2p d+2p 1 PX (A) r 2p γ − 2p i− 2p .

(4.16)

Unfortunately, the constant cp depends on p in an unknown manner. We improve this result by following the lines of the proof of [MS16, Thm. 12], but apply the improved bound on entropy numbers for the Gaussian RKHS stated in Theorem 2.45. Lemma 4.7 (Entropy Bound for Local Gaussian RKHSs). ˚ 6= ∅ and A ⊂ Br (z) with r > 0, z ∈ B`d . Let A ⊂ B`d2 be such that A 2 Let Hγ (A) be the RKHS of the Gaussian kernel kγ over A. Then, for all p ∈ (0, 12 ) there exists a constant cd,p > 0 such that for all γ ≤ r and i ≥ 1 we have p d d 1 ei (id : Hγ (A) → L2 (PX|A )) ≤ cd,p PX (A) · r 2p γ − 2p i− 2p , 1

where cd,p := (3cd ) 2p



d+1 2ep

 d+1 2p

.

74

4. Localized SVMs: Oracle Inequalities and Learning Rates

Proof. Following the lines of [MS16, Thm. 12] we consider the commutative diagram / L2 (PX|A ) O

id

Hγ (A) −1 IB ◦IA

id

r

 Hγ (Br )

/ `∞ (Br )

id

where the extension operator IA : Hγ (A) → Hγ (Rd ) and the restriction op−1 erator IB : Hγ (Rd ) → Hγ (Br ), defined in [SC08, Thm. 4.37], are isometric r −1 isomorphisms such that kIB ◦ IA : Hγ (A) → Hγ (Br )k = 1. According to r (2.21) and (2.22) we then have ei (id : Hγ (A) → L2 (PX|A ))

(4.17)

≤ ei (id : Hγ (Br ) → `∞ (Br )) · k id : `∞ (Br ) → L2 (PX|A )k, where we find for f ∈ `∞ (Br ) that k id : `∞ (Br ) → L2 (PX|A )k ≤ kf k∞

p

PX (A)

(4.18)

since Z

 12

2

kf kL2 (PX|A ) =

1A (x)|f (x)| dPX (x) X

 12

Z ≤ kf k∞ · 1A (x)dPX (x) X p ≤ kf k∞ PX (A).

Furthermore, by (2.21), (2.22) and Theorem 2.45 we obtain ei (id : Hγ (Br ) → `∞ (Br )) ≤ ei (id : H γr (r−1 B) → `∞ (r−1 B)) ≤ cd,p · r

d 2p

γ

d 1 − 2p − 2p

i

,

(4.19)

4.2. Local Statistical Analysis 1

where cd,p := (3cd ) 2p yields



d+1 2ep

 d+1 2p

75

. Plugging (4.18) and (4.19) into (4.17)

ei (id : Hγ (A) → L2 (PX|A )) ≤ cd,p

p

d

d

1

PX (A) · r 2p γ − 2p i− 2p .



The bound of the previous lemma will appear in the finite sample bounds presented in the upcoming section. Compared to the bound in (4.16) the bound has a slightly better behavior in r and γj , but the kernel parameters are again unfortunately bounded by the radius. This restriction results from scaling the kernel parameters in the proof, see (4.19). Moreover, we can track in the bound the constants depending on p since we applied Theorem 2.45 in which the constants are stated exactly. However, to apply the general oracle inequality in Theorem 2.46 we need a bound on the average entropy numbers of the form in (2.25). The next lemma establishes such a bound. Lemma 4.8 (Average Entropy Bound). ˚j 6= ∅ and Aj ⊂ Br (zj ) for Let (Aj )j=1,...,m be a partition of B`d2 with A r > 0, zj ∈ B`d2 for every j ∈ {1, . . . , m} and assume (H). Denote by DX the empirical measure w.r.t. the data set D. Then, for all p ∈ (0, 12 ) there exists a constant c˜d,p > 0 such that for all γj ≤ r and i ≥ 1 we have 1

d

ei (id : HJ → L2 (DX )) ≤ c˜d,p |J| 2p r 2p

  12 d X 1 −p   i 2p λ−1 j γj DX (Aj ) j∈J

and, for the average entropy numbers EDX ∼PXn ei (id : HJ → L2 (DX ))   12 d X 1 d 1 −p  i 2p , ≤ c˜d,p |J| 2p r 2p  λ−1 j γj PX (Aj )

i ≥ 1.

j∈J

1

The proof shows that the constant is given by c˜d,p := 2 (9 ln(4)cd ) 2p Proof. We define aj := cd,p



d+1 2ep

 d+1 2p

p d − d DX (Aj ) · r 2p γj 2p . By Lemma 4.7 we have 1

ei (id : Hγj (Aj ) → L2 (DX|Aj )) ≤ aj i− 2p

.

76

4. Localized SVMs: Oracle Inequalities and Learning Rates

for j ∈ J, i ≥ 1. Following the lines of the proof of [MS16, Thm. 11] we find 1  2p

 1 ei (id : HJ → L2 (DX )) ≤ 2|J| 2 3 ln(4)

X

2p  λ−p j aj

1

i 2p .

j∈J

By inserting aj and by applying k · kp|J| ≤ |J|1−p k · kp|J| we obtain `p

`1

ei (id : HJ → L2 (DX )) 1   2p X −p 2p 1 1 ≤ 2|J| 2 3 ln(4) λj aj  i 2p j∈J

 1 2

= 2|J| (3 ln(4))

1 2p

X 

λ−p j

 cd,p

q

DX (Aj ) · r

d 2p

− d γj 2p

2p

1  2p 1



i 2p

j∈J

1

d

1

= cd,p 2(3 ln(4)) 2p |J| 2 r 2p

 1 p 2p d X 1 −p   i 2p λ−1 j γj DX (Aj ) j∈J

 12

 1

≤ c˜d,p |J| 2 r

d 2p

|J|

1−p 2p

X 

−d p

λ−1 j γj

1

DX (Aj ) i 2p

j∈J

 12

 = c˜d,p |J|

1 2p

r

d 2p

X 

−d p  λ−1 j γj DX (Aj )

1

i 2p ,

j∈J 1

where c˜d,p := cd,p 2(3 ln(4)) 2p and cd,p is the constant from Lemma 4.7. Finally, considering the above inequality in expectation yields EDX ∼PXn ei (id : HJ → L2 (DX ))   12 d X 1 d 1 −p  i 2p n ≤ c˜d,p |J| 2p r 2p  λ−1 j γj EDX ∼PX DX (Aj ) j∈J

1

d

≤ c˜d,p |J| 2p r 2p

  12 d X 1 −p   i 2p . λ−1 j γj PX (Aj ) j∈J



4.2. Local Statistical Analysis

77

4.2.3. Oracle Inequalities and Learning Rates In this section, we establish oracle inequalities and local learning rates on partition based sets whose cells possess one of the three properties that we described at the beginning of Section 4.2. That means, we consider sets whose cells do have an intersection with the decision boundary or who do not. For cells with the latter property we distinguish between cells which are located sufficiently close to the decision boundary and cells which are not. A look into Theorem 2.46 shows that we have to bound the cardinality of the underlying set of indices to obtain rates. In particular for cells that are sufficiently close to the decision boundary this bound can be refined under a mild geometric condition on the decision boundary and we introduce the following set of assumptions: (G) Let η : X → [0, 1] be a fixed version of the posterior probability of P . Let X0 = ∂X X1 = ∂X X−1 and let X0 be (d − 1)-rectifiable with Hd−1 (X0 ) > 0. Remember that under assumption (G) we have Hd−1 (X0 ) < ∞. In particular, we showed in Lemma 2.36 under assumption (G) that the d-dimensional Lebesgue measure λd of a set in the vicinity of the decision boundary behaves linear to its distance to the decision boundary. This yields together with a volume comparison argument to the following lemma. Lemma 4.9 (Number of Cells). Let (A) and (G) for some η be satisfied. Moreover, let s ≥ r and s + r ≤ δ ∗ , where δ ∗ is considered in Lemma 2.36, and define J := { j ∈ J | ∀ x ∈ Aj : ∆η (x) ≤ s }. Then, there exists a constant cd > 0 such that |J| ≤ cd · sr−d . S S Proof. We define T := j∈J Aj and T˜ := j∈J Br (zj ). Obviously, T ⊂ T˜ since Aj ⊂ Br (zj ) for all j ∈ J. Furthermore, we have for all x ∈ T˜ that ∆η (x) ≤ s˜, where s˜ := s + r. Then, we obtain with Lemma 2.36 that λd (T˜) ≤ λd ({∆(x) ≤ s˜}) ≤ 4Hd−1 (X0 ) · s˜.

(4.20)

78

4. Localized SVMs: Oracle Inequalities and Learning Rates

Moreover, 

 [

λd (T˜) = λd 

Br (zj )

(4.21)

j∈J

 ≥ λd 

 [

B r4 (zj )

j∈J d

 = |J|λ B r4 (z)  r d = |J| λd (B) , 4 since B r4 (zi )∩B r4 (zj ) = ∅ for i 6= j. To see the latter, assume that we have an x ∈ B r4 (zi )∩B r4 (zj ). But then, kzi −zj k2 ≤ kx−zj k2 +kx−zi k2 ≤ 4r + 4r ≤ 2r , which is not true, since we assumed kzi − zj k2 > 2r for all i 6= j. Hence, the balls with radius 4r are disjoint. Finally, by (4.20) together with (4.21) and s ≥ r we find 4d λd (T˜) rd λd (B) 22d+2 Hd−1 ({x ∈ X|η = 1/2}) · s˜ ≤ rd λd (B) 22d+3 Hd−1 ({x ∈ X|η = 1/2}) · s ≤ . rd λd (B)

|J| ≤



Based on the general oracle inequality in Section 2.46 together with the results on the approximation error in Section 4.2.1, and on the entropy bounds in Section 4.2.2, we establish an oracle inequality for partition based sets which cells intersect the decision boundary. Theorem 4.10 (Oracle Inequality I). Let P have MNE β ∈ (0, ∞] and NE q ∈ [0, ∞] and let (G) and (H) be satisfied. Moreover, let (A) be satisfied for some r := n−ν with ν > 0 and define the set of indices J := {j ∈ {1, . . . , m} | ∀x ∈ Aj : PX (Aj ∩ X1 ) > 0 and PX (Aj ∩ X−1 ) > 0}. 1 Let τ ≥ 1 be fixed and define n∗ := δ4∗ ν . Then, for all p ∈ (0, 12 ), n ≥ n∗ , λ := (λ1 , . . . , λm ) ∈ (0, ∞)m and γ := (γ1 , . . . , γm ) ∈ (0, r]m the SVM given

4.2. Local Statistical Analysis

79

in (4.3) satisfies RLJ ,P (fÛD,λ,γ ) − R∗LJ ,P   X λj r d ≤ 9cd,β  + max γjβ  d j∈J γ j j∈J + cd,p,q

q+1  r  q+2−p

n

(4.22)

  p(q+1) q+2−p  τ  q+1 d X −p q+2   λ−1 γ P (A ) + c ˜ X j p,q j j n j∈J

with probability P n not less than 1−3e−τ and for some constants cd,β , cd,p,q > 0, and c˜p,q > 0 only depending on d, β, p, q. Proof. We apply the generic oracle inequality given in Theorem 2.46, see also Remark 2.47, and bound first of all the constant a2p . To this end, we remark that an analogous calculation as in the proof of Theorem 4.4 shows that Aj ⊂ {∆η ≤ 2r} for every j ∈ J. Since  n≥

4 δ∗

 ν1

⇔ 4r ≤ δ ∗

we obtain by Lemma 4.9 for s = 2r that |J| ≤ c1 r−d+1 ,

(4.23)

where c1 is a positive constant only depending on d. Together with Lemma 4.8 we then find that    12 2p     X 1 d −d p  a2p = max c˜d,p |J| 2p r 2p  λ−1 γ P (A ) , 2 X j j j     j∈J  p X −d p d  + 4p ≤ c˜2p λ−1 j γj PX (Aj ) d,p |J|r j∈J

≤ c1 c˜2p d,p

 p d X −p  + 4p , ·r λ−1 j γj PX (Aj ) j∈J

80

4. Localized SVMs: Oracle Inequalities and Learning Rates 1

where cd,p := 2c1 (9 ln(4)cd ) 2p



d+1 2ep

 d+1 2p

. Moreover, Theorem 2.42 delivers a q

(γ)

q q+1 variance bound for θ = q+1 and constant V := 6cNE . We denote by AJ (λ) the approximation error, defined in (4.5). Then, we have by Theorem 2.46 with τ ≥ 1 that

RLJ ,P (fÛD,λ,γ ) − R∗LJ ,P q+1  2p  q+2−p   q+1 q a 432τ q+2 30τ (γ) q+2 ≤ 9AJ (λ) + cp,q + 3cNE + n n n q+1   p !  q+2−p d X −p (γ)   + 4p 1  ≤ 9AJ (λ) + cp,q  c1 c˜2p λ−1 j γj PX (Aj ) d,p · r n j∈J

+ cq (γ)

 τ  q+1 q+2 n

≤ 9AJ (λ) + cd,p,q

q+1  r  q+2−p

n

  p(q+1) q+2−p d X −p   λ−1 γ P (A ) X j j j j∈J

  q+1  τ  q+1 p(q+1) 1 q+2 q+2 q+2−p + cp,q 4 + cq n n   p(q+1) q+2−p q+1  r  q+2−p  τ  q+1 d X q+2 (γ) −1 − p   ≤ 9AJ (λ) + cd,p,q λj γj PX (Aj ) + c˜p,q n n j∈J

holds with probability P n not less than 1 − 3e−τ and with positive constants q+1   q+2−p n q o q+1 q+2 cd,p,q := cp,q c1 c˜2p , cq := 2 max 3cNE 432 q+2 , 30 and c˜p,q := d,p n o p(q+1) 2 max cp,q 4 q+2−p , cq . Finally, Theorem 4.4 yields for the approximation error the bound   d X λ r j (γ) AJ (λ) ≤ c2  + max γjβ  , d j∈J γ j j∈J where c2 > 0 is a constant depending on d and β. By plugging this into the oracle inequality above yields the result. 

4.2. Local Statistical Analysis

81

The bound in the theorem above depends on an arbitrary small parameter p ∈ (0, 12 ). Moreover, we observe that the kernel parameters are bounded by r, that is γj ≤ r for every j ∈ J. Both are consequences of the bound on (average) entropy numbers in Lemma 4.8. Formally the oracle inequality in Theorem 4.10 has the same structure as the oracle inequalities for global and localized SVMs with Gaussian kernel and hinge loss in [SC08, Thm. 8.25] and [TBMS17, Thm. 3.1]. However, in particular to the result in [TBMS17, Thm. 3.1] we observe a difference in the bound on the stochastic error, which has w.r.t. the whole input space and up to constants the form r

2p

(minj∈J γj ) n

−2p

  p(q+1) q+1 ! q+2−p q+2−p m d X −1 − p   λj γj PX (Aj ) . j=1

The bound in (4.22) shows a better behavior in the kernel parameters γj (and radius r) that results from improved bounds on the average entropy numbers. We remark that the bound on the stochastic error could be improved if assumptions on PX (Aj ) are made. Since this would require additional assumptions on P we did not follow this path. Based on the previous theorem we derive in the subsequent theorem local learning rates. Note that for some an , bn ∈ R we write an ' bn if there exists constants c1 , c2 > 0 such that an ≤ c1 bn and an ≥ c2 bn . Theorem 4.11 (Learning Rates I). Assume that the assumptions of Theorem 4.10 are satisfied with mn and with rn ' n−ν , γn,j ' rnκ n−κ , λn,j ' n

−σ

,

for all j ∈ {1, . . . , mn }. Moreover, define κ := ν≤

(4.24)

κ 1−κ

q+1 β(q+2)+d(q+1)

and let (4.25)

and σ ≥ 1 be satisfied. Then, for all ε > 0 there exists a constant cβ,d,ε,σ,q > 0 only depending on β, d, ε, σ, and q such that for λn := (λn,1 , . . . , λn,mn ) ∈ (0, ∞)mn , and γ n := (γn,1 , . . . , γn,mn ) ∈ (0, rn ]mn , and all n sufficiently

82

4. Localized SVMs: Oracle Inequalities and Learning Rates

large we have with probability P n not less than 1 − 3e−τ that q+1

RLJ ,P (fÛD,λn ,γ n ) − R∗LJ ,P ≤ cβ,d,ε,σ,q · τ q+2 · rnβκ n−βκ+ε . In particular, the proof shows that one can even choose σ ≥ κ(β + d)(ν + 1) − ν > 0. Proof. We write λn := n−σ and γn := rnκ n−κ . As in the proof of Theorem 4.10 we find |J| ≤ cd rn−d+1 for some constant cd > 0. Together with Theorem 4.10 we then obtain that RLJ ,P (fÛD,λn ,γ n ) − R∗LJ ,P X λn,j rd

≤ c1

j∈J

+

d γn,j ! q+1

+

β max γn,j j∈J

+

q+1  r  q+2−p

n

X

−d p λ−1 n,j γn,j PX (Aj )

! p(q+1) q+2−p

j∈J

 τ  q+2 n

q+1  r  q+2−p X −d λn r d n p ≤ c2 |J| d n + γnβ + λ−1 PX (Aj ) n γn γn n j∈J !  τ  q+1 q+2 + n     q+1  τ  q+1 −p −d q+2−p λ r r λ γ q+2 n n n n n  ≤ c2  d + γnβ + + γn n n

! p(q+1) q+2−p

holds with probability P n not less than 1 − 3e−τ and for some positive constant c1 , c2 depending on β, d, p and  q. Moreover,  with (4.24), σ ≥ κ(β + d)(ν + 1) − ν and

(1−dκ)(q+1) q+2

=

β(q+2) β(q+2)+d(q+1)

q+1 q+2

RLJ ,P (fÛD,λn ,γ n ) − R∗LJ ,P !   q+1  τ  q+1 −d q+2−p λ n rn rn λ−p q+2 n γn β + γn + ≤ c2 + γnd n n

= βκ we find

4.2. Local Statistical Analysis

83

rn κ(β+d)(ν+1)−ν n γnd

= c2

 +

rn1−dκ n1−dκ

+ rnβκ n−βκ

q+1  q+2−p

n

−σ

p(q+1)  q+2−p

n−ν

≤ c3

nκ(β+d)(ν+1)−ν n−νdκ n−dκ +

≤ c3

 r  (1−dκ)(q+1) q+2−p n

n n

−νβκ −βκ



n

+

ε

n +

!

n

+ rnβκ n−βκ

 τ  q+1 q+2

!

n

rnβκ n−βκ q+1

+

 τ  q+1 q+2

+

 r  (1−dκ)(q+1) q+2 n

n

ε

n +

 τ  q+1 q+2

!

n

 q+1

≤ c4 rnβκ n−βκ+ε + τ q+2 n− q+2 q+1

≤ c5 τ q+2 · rnβκ n−βκ+ε , where p is chosen sufficiently small such that ε ≥ pσ(q+1) ≥ 0 and where the q+2 constants c3 , c4 , c5 > 0 depend on β, d, ε, σ, and q.  The restriction on the radii rn in (4.25) results from the constraint that γn,j ≤ rn for j ∈ J. If we choose smaller rn , we cannot ensure to learn with the rate in the previous theorem. Moreover, the proof shows that the regularization parameters λn,j can be chosen arbitrary small. This allows to choose γn,j sufficiently small, even smaller than in the result from [TBMS17, Thm. 3.2]. The next theorem considers sets that are close to the decision boundary but which do not have an intersection with it. Theorem 4.12 (Oracle inequality II). Let P have MNE β ∈ (0, ∞] and NE q ∈ [0, ∞] and let (G) and (H) be satisfied. Moreover, let (A) be satisfied for some r := n−ν with ν > 0. Define for s := n−α with α > 0 and α ≤ ν the set of indices J := {j ∈ {1, . . . , m} | ∀x ∈ Aj : ∆η (x) ≤ 3s, and PX (Aj ∩ X1 ) = 0 or PX (Aj ∩ X−1 ) = 0}. − 1 Let τ ≥ 1 be fixed and define n∗ := 4−1 δ ∗ α . Then, for all ε > 0, p ∈ (0, 12 ), n ≥ n∗ , λ := (λ1 , . . . , λm ) ∈ (0, ∞)m , and γ := (γ1 , . . . , γm ) ∈ (0, r]m

84

4. Localized SVMs: Oracle Inequalities and Learning Rates

the SVM given in (4.3) satisfies RLJ ,P (fÛD,λ,γ ) − R∗LJ ,P  ≤

cd,β,ε,q · r minj∈J γj

d

+ cd,β,ε,p,q



X

λj + cd,p,q

q+1  s  q+2−p

j∈J

(4.26)  p(q+1) q+2−p

 X 

n

−d p

λ−1 j γj

PX (Aj )

j∈J

 τ  q+1 q+2 n

with probability P n not less than 1−3e−τ and with constants cd,β,ε,q , cd,p,q > 0 and cd,β,ε,p,q > 0. Proof. We apply the generic oracle inequality given in Theorem 2.46, see also Remark 2.47, and bound first of all the contained constant a2p . To this end, we remark that we find with α ≤ ν that 4−1 δ ∗

− α1

≤n

δ ∗ ≥ 4n−α ≥ 3n−α + n−ν = 3s + r



such that we obtain by Lemma 4.9 that |J| ≤ c1 · sr−d ,

(4.27)

where c1 is a positive constant only depending on d. Together with Lemma 4.8 we then find for the constant a2p from Theorem 2.46 that    12 2p     d X 1 d −1 − p 2p   2p 2p a = max c˜d,p |J| r λj γj PX (Aj ) ,2     j∈J  p d X − p d  + 4p ≤ c˜2p λ−1 j γj PX (Aj ) d,p · |J|r j∈J

≤ c1 c˜2p d,p

 p d X − p  + 4p , · s λ−1 j γj PX (Aj ) j∈J

1

where cd,p := 2c1 (9 ln(4)cd ) 2p



d+1 2ep

 d+1 2p

. Again, Theorem 2.42 yields a q

(γ)

q q+1 variance bound for θ = q+1 and constant V := 6cNE . We denote by AJ (λ) the approximation error given in (4.5), and find by Theorem 2.46 with τ ≥ 1

4.2. Local Statistical Analysis

85

that RLJ ,P (fÛD,λ,γ ) − R∗LJ ,P q+1  2p  q+2−p   q+1 q a 432τ q+2 30τ (γ) q+2 ≤ 9AJ (λ) + cp,q + 3cNE + n n n q+1   p  q+2−p d X q+1 − (γ) p   + 4p  n− q+2−p ≤ 9AJ (λ) + cp,q c1 c˜2p λ−1 j γj PX (Aj ) d,p · s j∈J

+ cq

 τ  q+1 q+2 n

(γ)

≤ 9AJ (λ) + cd,p,q + cp,q 4



p(q+1) q+2−p

q+1  s  q+2−p

n q+1

· n− q+2 +cq

(γ) 9AJ (λ)+cd,p,q

q+1  s  q+2−p

n

  p(q+1) q+2−p d X −1 − p   λj γj PX (Aj ) j∈J

 τ  q+1 q+2 n   p(q+1) q+2−p  τ  q+1 d X q+2 −1 − p   λj γj PX (Aj ) +˜ cp,q n j∈J

holds with probability P n not less than 1 − 3e−τ and with positive constants q+1   q+2−p n q o q+1 q+2 cd,p,q := cp,q c1 c˜2p , cq := 2 max 3cNE 432 q+2 , 30 and c˜p,q := d,p n o q+1 p(q+1) 2 max cp,q 4 q+2−p , cq . Finally, Theorem 4.6 for ω− := γmax n 2ξ(q+2) , where ξ > 0, and ω+ := ω− + r, yields   X  ω+ d  γmax 2ξ (γ) β AJ (λ) ≤ c2  λj + s  (4.28) γj ω− j∈J    d X  2ξ ω+ γmax ≤ c2  λj + sβ  γmin ω− j∈J   !d q+1 X 2ξ(q+2) q+1 γmax n +r = c2  λj + n− q+2 sβ  γmin j∈J

86

4. Localized SVMs: Oracle Inequalities and Learning Rates  ≤ c 3 n

d(q+1) 2ξ(q+2)

 ≤ c 4 n ε





r γmin

γmax + r γmin

d X

d X

 λj + n

− q+1 q+2

sβ 

j∈J

 λj + n

− q+1 q+2

sβ  ,

j∈J

where in the last step that we applied γmax ≤ r, and where we picked an d(q+1) arbitrary ε > 0 and chose ξ sufficiently large such that ε ≥ 2ξ(q+2) > 0. The constants c2 , c3 > 0 only depend on d, β and ξ, whereas c4 > 0 depends only on d, β, ε and q. By plugging this into the oracle inequality above yields RLJ ,P (fÛD,λ,γ ) − R∗LJ ,P    d X q+1 r ≤ 9c4 nε λj + n− q+2 sβ  γmin j∈J

+ cd,p,q

≤ 9c4 nε



q+1  s  q+2−p

n r

γmin

+ cd,β,ε,p,q

  p(q+1) q+2−p  τ  q+1 d X q+2 −1 − p   λj γj PX (Aj ) + c˜p,q n j∈J

d X

λj + cd,p,q

j∈J

 τ  q+1 q+2 n

.

q+1  s  q+2−p

n

 p(q+1) q+2−p

 X 

−d p

λ−1 j γj

PX (Aj )

j∈J



The bound on the right-hand side in the previous theorem depends on the parameter s, which determines how close the cells are located to the decision boundary. Obviously, the smaller s, and thus r, the better the bound. Concerning the bound on the stochastic error, it has the same structure as in Theorem 4.12, and the main difference is the bound on the approximation error, described by the first term on the right-hand side. As might be seen in the proof this is due to the different handling of the approximation error on sets that have no intersection with the decision boundary, described in Section 4.2.1 and that led to the bound in Theorem 4.6. By having in mind that again the regularization parameters λj might be arbitrary small, the proof shows that the widths ω− from Theorem 4.6 can be appropriately chosen in (4.28) such that the approximation error with sufficiently small λj q+1 is even smaller than n− q+2 , the last term in the bound of 4.26. A look into

4.2. Local Statistical Analysis

87

(4.26) shows that the freedom of choosing sufficiently small λj has the effect that γj can be chosen as large as possible, that is, γj = r, which leads to kernels that are as flat as possible. These observations lead to the following theorem that derives learning rates on sets that have no intersection with the decision boundary, but are located sufficiently close to it. Theorem 4.13 (Learning Rates II). Assume that the assumptions of Theorem 4.12 are satisfied for mn , s ' sn and rn ' n−ν , γn,j ' rn ,

(4.29)

λn,j ' n−σ with some σ ≥ 1 and 1 + α − νd > 0, and for all j ∈ {1, . . . , mn }. Then, for all ε > 0 there exists a constant cβ,d,ε,σ,q > 0 such that for λn := (λn,1 , . . . , λn,mn ) ∈ (0, ∞)mn , and γ n := (γn,1 , . . . , γn,mn ) ∈ (0, rn ]mn , and all n sufficiently large we have with probability P n not less than 1 − 3e−τ that RLJ ,P (fÛD,λn ,γ n ) −

R∗LJ ,P

≤ cβ,d,ε,σ,q τ

q+1 q+2

 ·

sn rnd

 q+1 q+2

q+1

n− q+2 +ε .

Proof. We write λn := n−σ and γn := rn . By Theorem 4.12, Lemma 4.9 and (4.29) we find with probability P n not less than 1 − 3e−τ that RLJ ,P (fÛD,λn ,γ n ) − R∗LJ ,P  ≤ c1

r

d X

γmin +

j∈J

 τ  q+1 q+2

n

X

n

−d p λ−1 n,j γn,j PX (Aj )

! p(q+1) q+2−p

j∈J

!

n ε

≤ c2 |J|λn n +

≤ c2

λn,j nε +

q+1  s  q+2−p

s n λn nε + rnd



sn γnd n



q+1  q+2−p

sn rnd n

! p(q+1) q+2−p λ−1 n

X

PX (Aj )

j∈J q+1  q+2−p



p(q+1)

λn q+2−p +

! q+1

 τ  q+2 n

+

 τ  q+1 q+2 n

!

88

4. Localized SVMs: Oracle Inequalities and Learning Rates

≤ c2

s n nε + rnd nσ



sn rnd n

 q+1 q+2 n

pσ(q+1) q+2−p

q+1 q+2

+

 τ  q+1 q+2

!

n ! q+1

   τ  q+2 s n nε sn ε + n + rnd n rnd n n !   q+1 q+1 sn q+2 − q+1 ε q+2 q+2 ≤ c4 τ n +n rnd n   q+1 q+1 sn q+2 − q+1 q+2 ≤ c5 τ n q+2 +ε , rnd ≤ c3

where we chose p sufficiently small such that ε ≥ pσ(q+1) q+2−p > 0. The constants c1 , c2 > 0 depend only on d, β, ε, p and q, whereas the constants c3 , c4 , c5 > 0 depend on d, β, ε, σ and q.  The learning rates show that we have a trade-off between the radii rn and the splitting parameter sn . Obviously, we achieve the fastest rate if sn is chosen as small as possible, that is sn = rn . Finally, we present in the subsequent theorem the oracle inequality on sets that are sufficiently far away from the decision boundary. Theorem 4.14 (Oracle Inequality III). Let P have LC ζ ∈ [0, ∞) and let (G) and (H) be satisfied. Moreover, let (A) be satisfied for some r := n−ν with ν > 0. Define for s := n−α with α > 0 and α ≤ ν the set of indices J := { j ∈ {1, . . . , m} | ∀x ∈ Aj : ∆η (x) ≥ s }. Furthermore, let τ ≥ 1 be fixed. Then, for all ε > 0, p ∈ (0, 12 ), n ≥ 1, λ := (λ1 , . . . , λm ) ∈ (0, ∞)m , and γ := (γ1 , . . . , γm ) ∈ (0, r]m the SVM given in (4.3) satisfies RLJ ,P (fÛD,λ,γ ) − R∗LJ ,P (4.30)  p  d X X −d cd,ε · r p  1 ≤ nε λj + cd,p  λ−1 j γj PX (Aj ) minj∈J γj n j∈J

+ cd,ε,p ·

j∈J

τ sζ n

with probability P n not less than 1 − 3e−τ and some positive constants cd,ε , cd,p , cd,ε,p only depending on d, ε and p.

4.2. Local Statistical Analysis

89

Proof. We apply the generic oracle inequality given in Theorem 2.46, which is possible due to Remark 2.47. To this end, we find for the contained constant a2p with Lemma 4.8 and (4.1) that

a2p = max

  

1

d

c˜d,p |J| 2p r 2p

 

  12 2p   X −d p −1  λj γj PX (Aj ) , 2   j∈J p

(4.31)

 X −d p d  + 4p ≤ c˜2p λ−1 j γj PX (Aj ) d,p |J|r j∈J

≤ c1 c˜2p d,p

 p d X − p   + 4p λ−1 j γj PX (Aj ) j∈J

where c1 > 0 is a constant depending on d. According to Lemma 4.1 we have variance bound θ = 1 and constant V := 2cLC s−ζ . We denote by A(γ) (λ) the approximation error, defined in (4.5), and obtain by Theorem 2.46 together with (4.31) with probability P n not less than 1 − 3e−τ that RLJ ,P (fÛD,λn ,γ n ) − R∗LJ ,P cp · a n

(4.32)

2p

432cLC τ 30τ + + ζ n  s n p d X − (γ) p   n−1 ≤ 9AJ (λ) + cp c1 c˜2p λ−1 j γj PX (Aj ) d,p (γ)

≤ 9AJ (λ) +

j∈J

cp 4p 432cLC τ 30τ + + + ζ n sn n p d X − (γ) p  n−1 + cˆp τ ≤ 9AJ (λ) + cd,p  λ−1 j γj PX (Aj ) sζ n j∈J

for some constants cd,p , cˆp > 0. For the approximation error Theorem 4.6 1 with ω− := γmax n 2ξ , where ξ > 0, and ω+ := ω− + r, yields    d  2ξ X ω+ γmax (γ) AJ (λ) ≤ c2 · λj + c4 · PX (F ) (4.33) γj ω− j∈J

90

4. Localized SVMs: Oracle Inequalities and Learning Rates    d X  2ξ ω γ + max  ≤ c2  λj + c 4 · γmin ω− j∈J   !d 1 X 2ξ γ n r max = c2  + λj + n−1  γmin γmin j∈J    d X r = c3 nε λj + n−1  , γmin j∈J

where we applied in the last step that γmax ≤ r and where we fixed an ε d and chose ξ sufficiently large such that ε ≥ 2ξ > 0. The constants c2 > 0 and c3 > 0 only depend on d, ξ resp. d, ε. By combining the results above we have RLJ ,P (fD,λn ,γ n ) − R∗LJ ,P   p   d X d X − r p  n−1 ≤ 9c3 nε λj + n−1  + cd,p  λ−1 j γj PX (Aj ) γmin j∈J

+ cˆp

j∈J

τ

sζ n  p  d X d X −p r  n−1 + c4 τ ≤ 9c3 nε λj + cd,p,q  λ−1 j γj PX (Aj ) γmin sζ n j∈J

j∈J

for some constant c4 > 0 depending on d, ε and p.



The main differences between the previous oracle inequality and the one from Theorem 4.12 are the refined bound on the stochastic error and the last summand that results from the improved variance bound in Theorem 4.1. Note that we handled the bound on the approximation error in (4.33) in the same way we described after Theorem 4.12. Finally, we present learning rates on the set considered in the previous theorem. Theorem 4.15 (Learning Rate III). Assume that the assumptions of Theorem 4.14 are satisfied for mn , s ' sn and with rn ' n−ν , γn,j ' rn λn,j ' n

−σ

(4.34) ,

4.3. Main Results

91

for all j ∈ {1, . . . , mn } and with max{νd, αζ} < 1 and σ ≥ 1. Then, for all ε > 0 there exists a constant cd,ε,σ > 0 such that for λn := (λn,1 , . . . , λn,mn ) ∈ (0, ∞)mn , and γ n := (γn,1 , . . . , γn,mn ) ∈ (0, rn ]mn , and all n ≥ 1 we have with probability P n not less than 1 − 3e−τ that −1+ε RLJ ,P (fÛD,λn ,γ n ) − R∗LJ ,P ≤ cd,ε,σ τ · max{rn−d , s−ζ . n }n

Proof. We write λn := n−σ and γn := rn . Then, we obtain by Theorem 4.14 and (4.34) with probability P n not less than 1 − 3e−τ that RLJ ,P (fÛD,λ,γ ) − R∗LJ ,P   p   d X d X − r τ n p −1 ≤ c1 nε λn,j +  λn,j γn,j PX (Aj ) n−1 + ζ  minj∈J γj sn n j∈J j∈J   p  X τ −d  ≤ c2 nε |J|λn + λ−p PX (Aj ) n−1 + ζ  n rn sn n j∈J  −1 ≤ c2 τ rn−d n−σ+ε + nσp rn−d n−1 + s−ζ n n  −1 ≤ c3 τ rn−d n−1+ε + nε rn−d n−1 + s−ζ n n  −1 ≤ c3 τ 2rn−d n−1+ε + s−ζ n n −1+ε ≤ c4 τ · max{rn−d , s−ζ , n }n

where p is chosen sufficiently small such that ε ≥ pσ > 0 and where the constants c1 , c2 > 0 depend only on d, ε, p and the constants c3 , c4 > 0 only on d, ε, σ.  The rates presented in the previous theorem behave in a different way in terms of sn as the rates in Theorem 4.13. Although the kernel parameters γn = rn are chosen suitable, we observe in this case no trade-off in rn and −1 sn . The factor max{rn−d , s−ζ . If n } describes the deviation from the rate n rn = sn the size of νd resp. νζ determines the rate. 4.3. Main Results In Section 4.3.1, we derive global learning rates for localized SVMs with Gaussian kernel and hinge loss under margin conditions. Since the chosen parameters and rates will depend on the margin parameters, we present in Section 4.3.2 a procedure that achieves the same rates adaptively.

92

4. Localized SVMs: Oracle Inequalities and Learning Rates

4.3.1. Global Learning Rates To derive global rates we apply the splitting technique developed in Chapter 3.2 in which we analyzed the excess risk separately on overlapping sets that consist of cells that are sufficiently far away from the decision boundary or sufficiently close to it. However, for cells that possess the latter property we distinguish between cells which intersect the decision boundary and which do not, see also considerations in the previous section. By choosing individual kernel parameters on these sets we obtain local learning rates that we balance out in a last step to derive global learning rates. To this end, we define for s > 0 and a fixed version η of the posterior probability of P the set of indices of cells that are sufficiently far away by JFs := { j ∈ {1, . . . , m} | ∀ x ∈ Aj : ∆η (x) ≥ s } and the set of indices of cells near the decision boundary by s JN := { j ∈ {1, . . . , m} | ∀ x ∈ Aj : ∆η (x) ≤ 3s }.

For the corresponding sets we write [ N s := Aj and

F s :=

s j∈JN

[

Aj .

(4.35)

s j∈JF

s In addition, we split the set of indicies JN into

JN1 := { j ∈ JN | PX (Aj ∩ X1 ) > 0 and PX (Aj ∩ X−1 ) > 0 }, JN2 := { j ∈ JN | PX (Aj ∩ X1 ) = 0 or PX (Aj ∩ X−1 ) = 0 },

(4.36)

and denote the corresponding sets similar to (4.35) by N1s resp. N2s . Analogously to Lemma 3.2 and Lemma 3.4 the next lemma gives a sufficient condition on the separation parameter s such that we capture all cells in the input space when dividing it into the sets defined in (4.35). Moreover, we are able to assign a cell in F s either to the class X−1 or to X1 . Since the proof is almost identical to the ones in Lemma 3.2 and Lemma 3.4 we omit the proof here. Lemma 4.16 (Separation of Sets). Let (Aj )j=1,...,m be a partition of B`2d such that for every j ∈ {1, . . . , m} we ˚j 6= ∅ and (4.1) is satisfied for some r > 0. For s ≥ r define the sets have A N s and F s by (4.35). Moreover, let X0 = ∂X X1 = ∂X X−1 . Then, we have i) X ⊂ N s ∪ F s ,

4.3. Main Results

93

ii) either Aj ∩ X1 = ∅ or Aj ∩ X−1 = ∅ for all j ∈ JFs . To prevent notation overload, we omit in the sets (of indices) defined above the dependence on s for the rest of this section, while keeping in mind that all sets depend on the separation parameter s. Moreover, we simply write s LN1 := LJNs , similar we do with the losses w.r.t. to JN and JFs . 2 1 By Lemma 4.16 we immediately have that RLJ ,P (fÛD,λn ,γ n ) − R∗LJ ,P ≤ RLN ,P (fÛD,λn ,γ n ) − R∗LN ,P + RLF ,P (fÛD,λn ,γ n ) − R∗LF ,P such that individual learning rates on the sets N and F lead to global learning rates. We briefly describe this more precise by applying results from the previous Section 4.2. The set N is divided into the set N1 and N2 . For the set N1 we find by Theorem 4.11 with high probability that RLN1 ,P (fÛD,λn ,γ n ) − R∗LN1 ,P  rnβk n−βk is satisfied for lower bounded rn and γj ' rnκ n−κ . If rn → 0 the bound over N1 tends to zero. This is not the case for the bounds on the sets N2 and F that have no intersection with the decision boundary. By Theorems 4.13 and 4.15 we obtain with high probability on the sets N2 and F bounds of the form RLN2 ,P (fÛD,λn ,γ n ) − R∗LN2 ,P 



sn rnd

 q+1 q+2

q+1

n− q+2

and −1+ε RLF ,P (fÛD,λn ,γ n ) − R∗LF ,P  max{rn−d , s−ζ , n }·n

where in both cases the kernel parameter γn ' rn was chosen. Both bounds depend in an opposite way on the separation parameter sn . In Chapter 3, we observed a similar behavior of the separation parameter and to derive global bounds we handled this by a straightforward optimization over the parameter sn . However, the optimal s∗ does not fulfil the requirement rn ≤ s∗ that results from Lemma 4.16. We bypass this difficulty by choosing sn = rn . In the proof of the subsequent theorem, we observe the following effects resulting from this choice of sn . The rates on N2 are always better than the rates on N1 . Thus, for the rates on N1 and F the combination of the considered margin parameters and the dimension d affects the speed of the

94

4. Localized SVMs: Oracle Inequalities and Learning Rates

rates and leads to a differentiation of rn := n−ν in the subsequent theorem, see (4.37). In particular, if β ≥ (q + 1)(1 + max{d, ζ} − d) the rate on N1 dominates the one on F and ν has to fulfil ν≤

κ 1−κ .

In the other case, β ≤ (q + 1)(1 + max{d, ζ} − d), the rate on F dominates N1 , but only if ν≤

1−βκ βk+max{d,ζ} .

Indeed, for the latter range of β we always have 1−βκ βk+max{d,ζ}



κ 1−κ .

We discuss the influence of the choice of ν on the learning rates in more detail after the presented main theorem. Theorem 4.17 (Global Leaning Rates for Localized SVMs). Let P be a probability measure on Rd × {−1, 1} with marginal distribution PX on Rd , such that X := supp(PX ) ⊂ B`d2 , and PX (∂X) = 0. Let (G) be satisfied for some η and let P have MNE β ∈ (0, ∞], NE q ∈ [0, ∞] and LC q+1 ζ ∈ [0, ∞). Define κ := β(q+2)+d(q+1) . Let assumption (A) be satisfied for mn and define rn := n−ν , where ν satisfies ( ν≤

κ 1−κ 1−βκ βκ+max{d,ζ}

if β ≥ (q + 1)(1 + max{d, ζ} − d), else,

(4.37)

and assume that (H) holds. Define for J = {1, . . . , mn } the set of indices JN1 := {j ∈ J | ∀x ∈ Aj : ∆η (x) ≤ 3rn and PX (Aj ∩ X1 ) > 0 and PX (Aj ∩ X−1 ) > 0},

4.3. Main Results

95

as well as ( γn,j '

rnκ n−κ rn

for j ∈ JN1 , else,

(4.38)

λn,j ' n−σ for for for the

some σ ≥ 1 and for every j ∈ J. Moreover, let τ ≥ 1 be fixed and define − 1 − 1 δ ∗ considered in (2.36), n∗ := max{4, 4−1 δ ∗ ν , 4−1 δ ∗ α }. Then, all ε > 0 there exists a constant cβ,d,ε,σ,q > 0 such that for all n ≥ n∗ localized SVM classifier satisfies RLJ ,P (fÛD,λn ,γ n ) − R∗LJ ,P ≤ cβ,d,ε,σ,q τ · n−βκ(ν+1)+ε

(4.39)

with probability P n not less than 1 − 9e−τ . Before proving this theorem, we state the following two remarks. Remark 4.18. The exponents of rn in (4.37) match for β = (q + 1)(1 + max{d, ζ} − d). Moreover, a short calculation shows that in the case β ≥ (q + 1)(1 + κ max{d, ζ} − d) the best possible rate is achieved for ν := 1−κ and equals β(q+1)

n− β(q+2)+(d−1)(q+1) +ε . In the other case, β < (q + 1)(1 + max{d, ζ} − d), the best possible rate is 1−βκ achieved for ν := βκ+max{d,ζ} and equals n−

βκ[1+max{d,ζ}] +ε βκ+max{d,ζ}

.

Remark 4.19. The rates in Theorem 4.17 are better the smaller we choose the cell sizes rn . The smaller rn the more cells mn are considered and training localized SVMs is more efficient. To be more precise, it reduces the complexity of the kernel matrices or the time complexity of the solver, see [TBMS17]. However, (4.37) gives a lower bound on rn = n−ν . For smaller rn we do achieve rates for localized SVMs, but we cannot ensure that they learn with the rate (4.39). Indeed, we achieve slower rates. We illustrate this for the case β < (q + 1)(1 + max{d, ζ} − d). The proof of Theorem 4.17 shows that

96

4. Localized SVMs: Oracle Inequalities and Learning Rates

if 

q+1 q+2

 (1 + max{d, ζ} − d) < β < (q + 1)(1 + max{d, ζ} − d)

h i 1−βκ κ we can choose some ν ∈ βκ+max{d,ζ} , 1−κ such that the localized SVM classifier learns for some ε > 0 with rate n−(1−ν max{d,ζ})+ε . A short calculation shows that this rate is indeed slower than (4.39) for the 1−βκ given range of ν and matches the rate in (4.39) only for ν := βκ+max{d,ζ} . κ In the worst case, that is, ν := 1−κ the rate equals n−(1−ν max{d,ζ}) = n

− 1−

κ max{d,ζ} 1−κ



(q+1) max{d,ζ}

=n

− 1− β(q+2)+(d−1)(q+1)

= n−



β(q+2)+(q+1)(d−1−max{d,ζ}) β(q+2)+(d−1)(q+1)

up  toε in the exponent. Note that the numerator is positive since β > q+1 q+2 (1 + max{d, ζ} − d). Proof (Theorem 4.17). By Theorem 4.16 for sn := n−ν we find that RLJ ,P (fÛD,λn ,γ n ) − R∗LJ ,P

(4.40)

≤ RLN ,P (fÛD,λn ,γ n ) − R∗LN ,P + RLF ,P (fÛD,λn ,γ n ) − R∗LF ,P ≤ RLN1 ,P (fÛD,λn ,γ n ) − R∗LN1 ,P + RLN2 ,P (fÛD,λn ,γ n ) − R∗LN2 ,P + RLF ,P (fÛD,λn ,γ n ) − R∗LF ,P . In the subsequent steps, we bound the excess risks above separately for both choices of ν by applying Theorems 4.11, 4.13 and 4.15 for α := ν. First, we consider the case β ≥ (q +1)(1+max{d, ζ}−d) and check some requirements for the mentioned theorems. Since β ≥ q+1 q+2 (1 + max{d, ζ} − d) we have ν≤

κ 1−κ

=

q+1 β(q+2)+(d−1)(q+1)



1 max{d,ζ} .

4.3. Main Results

97

Moreover, 1 + ν(1 − d) ≥ 1 −

κ(d−1) 1−κ

=1−

(q+1)(d−1) β(q+2)+(d−1)(q+1)

=

β(q+2) β(q+2)+(d−1)(q+1)

> 0.

We apply Theorems 4.11, 4.13 and 4.15 with α := ν. Together with βκ(ν + 1) ≤

βκ 1−κ

(4.41)

=

(q+1)β(q+2) (q+2)[β(q+2)+(d−1)(q+1)]

=

q+1 q+2



(q+1)(1−ν(d−1)) q+2



1−

κ(d−1) 1−κ



and βκ(ν + 1) ≤

β(q+1) β(q+2)+(d−1)(q+1)

=1−

(d−1)(q+1)+β β(q+2)+(d−1)(q+1)

≤1−

(q+1) max{d,ζ} β(q+2)+(d−1)(q+1)

≤1−

κ max{d,ζ} 1−κ

(4.42)

κ such that βκ(ν + 1) ≤ 1 − ν max{d, ζ} for ν ≤ 1−κ , we obtain in (4.40) for n ε1 , ε2 , ε3 > 0 and with probability P not less than 1 − 9e−τ that

RLJ ,P (fÛD,λn ,γ n ) − R∗LJ ,P ≤ RLN1 ,P (fÛD,λn ,γ n ) −

R∗LN1 ,P

(4.43) + RLN2 ,P (fÛD,λn ,γ n ) −

R∗LN2 ,P

+ RLF ,P (fÛD,λn ,γ n ) − R∗LF ,P   (q+1)(1+α−νd) q+2 ≤ c1 τ n−βκ(ν+1) nε1 + n− nε2 + n−(1−max{νd,αζ}) nε3   ≤ c2 τ nε 2n−βκ(ν+1) + n−(1−ν max{d,ζ}) ≤ c3 τ n−βκ(ν+1)+ε holds for some ε := max{ε1 , ε2 , ε3 } and some constants c1 depending on d, β, q, ξ, ε1 , ε2 , ε3 , σ, and c2 , c3 > 0 depending on d, β, q, ξ, ε, σ. Second, we consider the case β < (q + 1)(1 + max{d, ζ} − d) and check 1−βκ again the requirements on ν ≤ βk+max{d,ζ} for the theorems applied above.

98

4. Localized SVMs: Oracle Inequalities and Learning Rates

We have ν≤ = ≤ =

1−βκ βκ+max{d,ζ}

(4.44)

β+d(q+1) β(q+1)+max{d,ζ}[β(q+2)+d(q+1)] q+1 β(q+2)+(d−1)(q+1) κ 1−κ

and ν≤

1−βκ βκ+max{d,ζ}



1−βκ max{d,ζ}



1 max{d,ζ} .

Moreover, 1 + ν(1 − d) ≥ 1 −

(1−βκ)(d−1) βκ+max{d,ζ}

=

max{d,ζ}−d(1−βκ)+1 βκ+max{d,ζ}



max{d,ζ}−d+1 βκ+max{d,ζ}

> 0.

We apply Theorem 4.10 and Theorems 4.12, 4.14 for α := ν. Together with (4.44) we find similar to (4.41) and (4.42) that βκ(ν + 1) ≤

βκ 1−κ

=

q+1 q+2



1−

κ(d−1) 1−κ



q+1 q+2



1−

(1−βκ)(d−1) βk+max{d,ζ}



(q+1)(1−ν(d−1)) , q+2

 

and βκ(ν + 1) ≤

βκ(1+max{d,ζ}) βκ+max{d,ζ}

=1−

max{d,ζ}(1−βκ) βκ+max{d,ζ}

≤ 1 − ν max{d, ζ}

such that we obtain in (4.40) that RLJ ,P (fÛD,λn ,γ n ) − R∗LJ ,P ≤ RLN1 ,P (fÛD,λn ,γ n ) − R∗LN1 ,P + RLN2 ,P (fÛD,λn ,γ n ) − R∗LN2 ,P + RLF ,P (fÛD,λn ,γ n ) − R∗LF ,P   (q+1)(1+α−νd) q+2 ≤ c1 τ nε1 n−βκ(ν+1) + nε2 n− + nε3 n−(1−ν max{d,ζ})   ≤ c2 τ nε 2n−βκ(ν+1) + n−(1−ν max{d,ζ})

4.3. Main Results

99

≤ c3 τ n−βκ(ν+1)+ε , holds with probability P n not less than 1 − 9e−τ .



4.3.2. Adaptivity Before comparing the rates from Theorem 4.17 with rates obtained by other algorithms in the next section, we show that our rates are achieved adaptively by a training validation approach. That means, without knowing the MNE β, the NE q and LC ζ in advance. To this end, we briefly describe the training validation support vector machine ansatz given in [MS16], which is similar to the one in Chapter 3. We define Λ := (Λn ) and Γ := (Γn ) as sequences of finite subsets Λn ⊂ (0, n−1 ] and Γn ⊂ (0, rn ]. For a data set D := ((x1 , y1 ), . . . , (xn , yn )) we define D1 := ((x1 , y1 ), . . . , (xl , yl )), D2 := ((xl+1 , yl+1 ), . . . , (xn , yn )), where l := b n2 c + 1 and n ≥ 4. Moreover, we split these sets into (1)

Dj

:= ((xi , yi )i∈{1,...,l} : xi ∈ Aj ),

(2) Dj

j ∈ {1, . . . , mn },

:= ((xi , yi )i∈{l+1,...,n} : xi ∈ Aj ),

j ∈ {1, . . . , mn },

(1)

P mn

and define lj := |Dj | for all j ∈ {1, . . . , mn } such that use

(1) Dj

j=1 lj

= l. We

as a training set by computing a local SVM predictor fD(1) ,λj ,γj := arg min λj kf k2Hˆ j

ˆ γ (Aj ) f ∈H j

γj (Aj )

+ RLj ,D(1) (f ) j

(2)

for every j ∈ {1, . . . , mn }. Then, we use Dj to determine (λj , γj ) by choosing a pair (λD2 ,j , γD2 ,j ) ∈ Λn × Γn such that RLj ,D2 (fÛD(1) ,λD j

2 ,j ,γD2 ,j

)=

min

(λj ,γj )∈Λn ×Γn

RLj ,D(2) (fÛD(1) ,λj ,γj ). j

j

Finally, we call the function fD1 ,λD2 ,γ D2 , defined by fD1 ,λD2 ,γ D2 :=

mn X j=1

1Aj fD(1) ,λD j

2 ,j ,γD2 ,j

,

(4.45)

100

4. Localized SVMs: Oracle Inequalities and Learning Rates

training validation support vector machine (TV-SVM) w.r.t Λ and Γ. We remark that the parameter selection is performed independently on each cell and leads to m · |Λ| · |Γ| many candidates. For more details we refer the reader to [MS16, Section 4.2]. The subsequent theorem shows that the TV-SVM, defined in (4.45), achieves the same rates as the local SVM predictor in (4.3). Theorem 4.20 (Learning Rates for TV-SVMs). Let the assumptions of Theorem 4.17 be satisfied with rn ' n−ν , for some ν > 0. Furthermore, fix an ρn -net Λn ⊂ (0, n−1 ] and an δn rn -net Γn ⊂ (0, rn ] with ρn ≤ n−2 and δn ≤ n−1 . Assume that the cardinalities |Λn | and |Γn | grow polynomially in n. Let τ ≥ 1. If ( κ if β ≥ (q + 1)(1 + max{d, ζ} − d), ν ≤ 1−κ1−βκ (4.46) else, βκ+max{d,ζ} then, for all ε > 0 there exists a constant cd,β,q,ε > 0 such that the TV-SVM, defined in (4.45), satisfies RLJ ,P (fD1 ,λD2 ,γ D2 ) − R∗LJ ,P ≤ cd,β,q,ε τ · n−βκ(ν+1)+ε

(4.47)

with probability P n not less than 1 − e−τ . Remark 4.21. The previous result shows a trade-off between the “range of adaptivity” and “computational complexity”: On the one hand, the smaller we choose ν the bigger the set of P for which we learn with the correct learning rate without knowing the margin parameters β, q and ζ. On the other hand, smaller values of ν lead to bigger cells and hence, training is more costly. This trade-off can be also observed in the result for Training-Validation-Voronoi-PartitionSVMs using least squares loss and Gaussian kernels in [MS16, Thm. 7]. In practice, we choose ν as large as it is computationally feasible. The previous theorem then shows adaptivity for some P but not for all P as the results for global algorithms, see e.g., [SC08, Thm. 8.26]. − 1 − 1 Proof. We assume w.l.o.g. that n ≥ n∗ := max{4, 4−1 δ ∗ ν , 4−1 δ ∗ α }. For n < n∗ the equation in (4.47) is immediately satisfied with constant βκ(ν+1)−ε c˜d,β,q,ε := (n∗ ) and with probability 1. We analyze the excess risk

4.3. Main Results

101

RLJ ,P (fD1 ,λD2 ,γ D2 ) − R∗L,P by applying the splitting technique described in Section 4.3.1 and by applying a generic oracle inequality for ERM given in Theorem 2.43 on each set. To this end, we define sn := rn and find by Theorem 4.16 that RLJ ,P (fD1 ,λD2 ,γ D2 ) − R∗LJ ,P ≤ RLN ,P (fD1 ,λD2 ,γ D2 ) −

(4.48)

R∗LN ,P

≤ RLN1 ,P (fD1 ,λD2 ,γ D2 ) −

+ RLF ,P (fD1 ,λD2 ,γ D2 ) −

R∗LN1 ,P

R∗LF ,P

+ RLN2 ,P (fD1 ,λD2 ,γ D2 ) − R∗LN2 ,P

+ RLF ,P (fD1 ,λD2 ,γ D2 ) − R∗LF ,P . In a fist step, we analyze RLN1 ,P (fD1 ,λD2 ,γ D2 ) − R∗LN RLN1 ,P (fD1 ,λD2 ,γ D2 ) =

X

1 ,P

. Note that

RLj ,P (fD1 ,λD2 ,j ,γD2 ,j ).

j∈JN1

According to Theorem 2.42 we have on the set N1 variance bound θ = q q+1

q q+1

with constant V := 6cNE . Then, for fixed data set D1 and τnN1 := τ + ln(1 + |Λn × Γn ||JN1 | ), as well as n − l ≥ n/4 for n ≥ 4, and l := b n2 c + 1, we find by Theorem 2.43 with probability P n−l not less than 1 − e−τ that RLN1 ,P (fD1 ,λD2 ,γ D2 ) − R∗LN1 ,P

(4.49) !

≤6

inf

|J | (λ,γ)∈(Λn ×Γn ) N1

 + 4

q q+1

RLN1 ,P (fD1 ,λ,γ ) − R∗LN1 ,P

48cNE (τ + ln(1 + |Λn × Γn | n−l

|JN1 |

 q+1 q+2 ) !

≤6

inf (λ,γ)∈(Λn ×Γn )

|JN | 1

RLN1 ,P (fD1 ,λ,γ ) −

R∗LN1 ,P

 +cq

τnN1 n

 q+1 q+2 .

By Theorem 4.10 for p ∈ (0, 12 ) we obtain with probability P l not less than 1 − 3|Λn × Γn ||JN1 | e−τ that RLN1 ,P (fD1 ,λ,γ ) − R∗LN1 ,P

102

4. Localized SVMs: Oracle Inequalities and Learning Rates

q+1  r  q+2−p X λj r d n β ≤ c1 + max γ + j j∈JN1 n γjd j∈JN1 !  τ  q+1 q+2 + n

|J

|

|J

X

−d p λ−1 j γj PX (Aj )

! p(q+1) q+2−p

j∈JN1

|

holds for all (λ, γ) ∈ Λn N1 ×Γn N1 simultaneously and some constant c1 > 0 depending on d, β, p, q. Then, Lemma A.3 i) yields inf (λ,γ)∈(Λn ×Γn )



|JN | 1

inf (λ,γ)∈(Λn ×Γn )

+

RLN1 ,P (fD1 ,λ,γ ) − R∗LN1 ,P

|JN | 1

n

X λj r d + max γjβ d j∈JN1 γ j j∈J

c1

 r  p1 X n

(4.50)

N1

−d p λ−1 j γj PX (Aj )

! p(q+1) q+2−p +

j∈JN1



q+1

 τ  q+1 q+2

!

n

 q+1

≤ c2 n−βκ(ν+1)+ε1 + τ q+2 n− q+2

where c2 > 0 is a constant depending on d, β, q and ε1 . By inserting (4.50) into (4.49) we have with probability P n not less than 1 − (1 + 3|Λn × Γn ||JN1 | )e−τ that RLN1 ,P (fD1 ,λD2 ,γ D2 ) − R∗LN1 ,P ! ≤6

inf (λ,γ)∈(Λn ×Γn )



≤ c2 n

|JN | 1

−βκ(ν+1)+ε1



RLN1 ,P (fD1 ,λ,γ ) − q+1 q+2

n

− q+1 q+2



 + cq

R∗LN1 ,P

τnN1 n

 +cq

τnN1 n

 q+1 q+2

 q+1 q+2 .

Analogously to the calculations at the beginning of the poof of Theorem 4.22 we find by Lemma 4.9 for sn = 2rn that |JN1 | ≤ cd rn rn−d . Thus, we obtain with the inequalities (4.41) resp. (4.44) that RLN1 ,P (fD1 ,λD2 ,γ D2 ) − R∗LN1 ,P    q+1 q+1 τ + ln(1 + |Λn × Γn ||JN1 | ) ≤ c2 n−βκ(ν+1)+ε1 + τ q+2 n− q+2 + cq n

(4.51)  q+1 q+2

4.3. Main Results

103

q+1

q+1

≤ c3 n−βκ(ν+1)+ε1 + τ q+2 n− q+2 +  +

ln(1 + |Λn × Γn ||JN1 | ) n



q+1 q+2

 τ  q+1 q+2 n !

 q+1 ! |JN1 | ln(2|Λn × Γn |) q+2 ≤ c3 n + 2τ n + n !   q+1 q+2 q+1 q+1 c r ln(2|Λ × Γ |) d n n n − ≤ c3 n−βκ(ν+1)+ε1 + 2τ q+2 n q+2 + rnd n !   q+1 q+2 q+1 q+1 c ln(2|Λ × Γ |) d n n − −βκ(ν+1)+ε1 = c3 n + 2τ q+2 n q+2 + n1−ν(d−1)   q+1 q+1 ≤ c4 2n−βκ(ν+1)+ˆε1 + 2τ q+2 n− q+2 , q+1 q+2

−βκ(ν+1)+ε1

− q+1 q+2



where εˆ1 > 0 and where c3 , c4 > 0 are constants depending on d, β, q, ε1 resp. d, β, q, εˆ1 . A variable transformation in τ together with Lemma 4.9 yields q+1

RLN1 ,P (fD1 ,λD2 ,γ D2 ) − R∗LN1 ,P ≤ c5 τ q+2 · n−βκ(ν+1)+˜ε1 for some ε˜1 > 0 with probability P n not less than 1 − e−τ , where c5 > 0 is a constant depending on d, β, q, ε˜1 . Next, we analyze RLN2 ,P (fD1 ,λD2 ,γ D2 ) − R∗LN ,P by the same procedure. 2 q According to Theorem 2.42 we have on the set N2 variance bound θ = q+1 q

q+1 with constant V := 6cNE . Then, for fixed data set D1 and

τnN2 := τ + ln(1 + |Λn × Γn ||JN2 | ), as well as n − l ≥ n/4 for n ≥ 4, and l := b n2 c + 1, we find by Theorem 2.43 with probability P n−l not less than 1 − e−τ that RLN2 ,P (fD1 ,λD2 ,γ D2 ) − R∗LN2 ,P ! ≤6

inf (λ,γ)∈(Λn ×Γn )

|JN | 2

RLN2 ,P (fD1 ,λ,γ ) − R∗LN2 ,P

 +cq

(4.52) q+1  τnN2 q+2 . n

104

4. Localized SVMs: Oracle Inequalities and Learning Rates

By Theorem 4.12 for sn = rn , p ∈ (0, 12 ) and εˆ > 0 we obtain with probability P n not less than 1 − (1 + 3|Λn × Γn ||JN2 | )e−τ that RLN2 ,P (fD1 ,λ,γ ) − R∗LN2 ,P  d X rn ≤ c6 λj nεˆ minj∈JN2 γj j∈JN2

+

q+1  r  q+2−p

n

X

n

−d p λ−1 j γj PX (Aj )

! p(q+1) q+2−p +

 τ  q+1 q+2 n

j∈JN2 |J

|

|J

!

|

holds for all (λ, γ) ∈ Λn N2 ×Γn N2 simultaneously and some constant c6 > 0 depending on d, β, p, q and εˆ. Then, Lemma A.3 ii) yields inf (λ,γ)∈(Λn ×Γn )

|JN | 2

RLN2 ,P (fD1 ,λ,γ ) − R∗LN2 ,P 



inf

|J | (λ,γ)∈(Λn ×Γn ) N2

+

c6

 r  p1 X n n

(4.53)

d X

rn minj∈JN2 γj

−d p λ−1 j γj PX (Aj )

λj nε˜

j∈JN2

! p(q+1) q+2−p

j∈JN2

+

 τ  q+1 q+2

!

n

  − q+1 q+1 q+1 ≤ c7 nε2 rnd−1 n q+2 + τ q+2 · n− q+2 , where c7 > 0 is a constant depending on d, β, q and ε2 . We insert (4.53) into (4.52) and obtain RLN2 ,P (fD1 ,λD2 ,γ D2 ) − R∗LN2 ,P    N2  q+1 q+2 − q+1 q+1 τn − q+1 ε2 d−1 q+2 q+2 q+2 ≤ c 7 n rn n +τ ·n + cq . n with probability P n not less than 1 − (1 + 3|Λn × Γn ||JN2 | )e−τ . An analogous calculation as in (4.51) yields RLN2 ,P (fD1 ,λD2 ,γ D2 ) − R∗LN2 ,P ≤ c8

n

ε2

− q+1 rnd−1 n q+2

+ 2τ

q+1 q+2

·n

− q+1 q+2

 +

cd ln(2|Λn × Γn |) n1−ν(d−1)

!  q+1 q+2

4.3. Main Results

≤ c8

n

ε2



n

105

1−ν(d−1)

− q+1 q+2

+ 2τ

q+1 q+2

·n

− q+1 q+2

 +

cd ln(2|Λn × Γn |) n1−ν(d−1)

!  q+1 q+2

   − q+1 q+1 q+1 q+2 ≤ c9 nεˆ2 n1−ν(d−1) + τ q+2 · n− q+2 , where εˆ2 > 0 and where c8 , c9 > 0 are constants depending on d, β, q, ε2 resp. d, β, q, εˆ2 . A variable transformation in τ together with Lemma 4.9 then yields  − q+1 q+1 q+2 RLN2 ,P (fD1 ,λD2 ,γ D2 ) − R∗LN2 ,P ≤ c10 τ q+2 · nε˜2 n1−ν(d−1)

(4.54)

with probability P n not less than 1 − e−τ , where c10 > 0 is a constant depending on d, β, q, ε˜2 . Next, we analyze RLF ,P (fD1 ,λD2 ,γ D2 ) −R∗LF ,P . According to Theorem 4.1 we have on the set F the best possible variance bound θ = 1 with constant V := 2cLC rn−ζ . Then, for fixed data set D1 and τnF := τ + ln(1 + |Λn × Γn ||JF | ), as well as n − l ≥ n/4 for n ≥ 4, and l := b n2 c + 1, we find by Theorem 2.43 with probability P n−l not less than 1 − e−τ that RLF ,P (fD1 ,λD2 ,γ D2 ) − R∗LF ,P (4.55)   F 2cLC τ ≤6 inf RLF ,P (fD1 ,λ,γ ) − R∗LF ,P + ζ n . |J | (λ,γ)∈(Λn ×Γn ) F rn n Then, Theorem 4.14 for sn = rn , p ∈ (0, 12 ) and ε˜ > 0 yields RLF ,P (fD1 ,λ,γ ) − R∗LF ,P  d X rn ≤ c10 λj nε˜ + minj∈JF γj j∈JF

X

−d p λ−1 j γj PX (Aj )

j∈JF

!p n

−1

+

τ rζ n

! ,

with probability P l not less than 1 − 3|Λn × Γn ||JF | e−τ and for all (λ, γ) ∈ |J | |J | Λn F × Γn F simultaneously and some constant c10 > 0 depending on d, p and ε˜. Again, Lemma A.3 iii) yields inf

(λ,γ)∈(Λn ×Γn )|JF |

RLF ,P (fD1 ,λ,γ ) − R∗LF ,P

(4.56)

106

4. Localized SVMs: Oracle Inequalities and Learning Rates  ≤

inf

(λ,γ)∈(Λn ×Γn )|JF |

+

X

c10

minj∈JF γj !p

−d p λ−1 j γj PX (Aj )

j∈JF

d X

rn

n

−1

+

λj nε˜

j∈JF

τ

!

rnζ n

 ≤ c11 max{rn−d , rn−ζ } · n−1+ε3 + τ · rn−ζ n−1 , where c11 > 0 is a constant depending on d and ε3 . By inserting (4.56) into (4.55) we find RLF ,P (fD1 ,λD2 ,γ D2 ) − R∗LF ,P (4.57)   F 2cLC τ ≤6 inf RLF ,P (fD1 ,λ,γ ) − R∗LF ,P + ζ n |J | (λ,γ)∈(Λn ×Γn ) F rn n   τ + ln(1 + |Λn × Γn ||JF | ) ≤ c12 max{rn−d , rn−ζ } · n−1+ε3 + τ · rn−ζ n−1 + rnζ n   |JF | ln(2|Λn × Γn |) ≤ c12 max{rn−d , rn−ζ } · n−1+ε3 + 2τ · rn−ζ n−1 + rnζ n   mn ln(2|Λn × Γn |) −d −ζ −1+ε3 −ζ −1 ≤ c12 max{rn , rn } · n + 2τ · rn n + rnζ n ! ln(2|Λn × Γn |) −d −ζ −1+ε3 −ζ −1 ≤ c13 max{rn , rn } · n + 2τ · rn n + rnd rnζ n  ≤ c14 max{rn−d , rn−ζ } · n−1+ˆε3 + 2τ · rn−ζ n−1 , where εˆ3 > 0 and where c12 , c13 , c14 > 0 are constants depending on d, ε3 resp. d, εˆ3 . A variable transformation in τ then yields RLF ,P (fD1 ,λD2 ,γ D2 ) − R∗LF ,P ≤ c15 τ · max{rn−d , rn−ζ } · n−1+˜ε3

(4.58)

with probability P n not less than 1 − e−τ , where c15 > 0 is a constant depending on d and ε˜3 . Finally, we compose (31), (4.54), (4.58) and insert these inequalities into (4.48). We obtain with probability P n not less than 1 − 3e−τ that RLJ ,P (fD1 ,λD2 ,γ D2 ) − R∗LJ ,P ≤ RLN1 ,P (fD1 ,λD2 ,γ D2 ) − R∗LN1 ,P + RLN2 ,P (fD1 ,λD2 ,γ D2 ) − R∗LN2 ,P + RLF ,P (fD1 ,λD2 ,γ D2 ) − R∗LF ,P

4.4. Comparison of Learning Rates  ≤ c16 τ

−βκ(ν+1)+˜ ε1

n 

+n

ε˜2



n

107

1−ν(d−1)

− q+1 q+2

≤ c16 τ nε 2n−βκ(ν+1) + max{rn−d , rn−ζ } · n−1

+ 

max{rn−d , rn−ζ }

·n

−1+˜ ε3



≤ c17 τ nε · n−βκ(ν+1) , where in the last step we applied βκ(ν + 1) ≤ 1 − ν max{d, ξ} analogously to the calculations in the proof of Theorem 4.17, where ε := max{˜ ε1 , ε˜2 , ε˜3 } and where c16 , c17 > 0 are constants depending on d, β, q and ε.  4.4. Comparison of Learning Rates In this section, we compare the results for localized SVMs with Gaussian kernels and hinge loss from Theorem 4.17 to the results from various classifiers we mentioned in the introduction. We compare the rates to the ones obtained by global and localized SVMs with Gaussian kernel and hinge loss in [TBMS17, Thm. 3.2], [SC08, (8.18)] and [LZC17]. Moreover, we make comparisons to the rates achieved by various plug-in classifier in [KK07, AT07, BCDD14, BHM18], as well as to rates obtained by the histogram rule in Chapter 3. As in Chapter 3 we try to find reasonable sets of assumptions in all comparisons. We emphasize that the rates for localized SVMs in Theorem 4.17 do not need an assumption on the existence of a density of the marginal distributions. Throughout this section we assume (A) for some rn := n−ν , (G) for some η, and (H) to be satisfied. Moreover, we denote by (i), (ii) and (iii) the following assumptions on P : (i) P has MNE β ∈ (0, ∞], (ii) P has NE q ∈ [0, ∞], (iii) P has LC ζ ∈ [0, ∞). Note that under the just mentioned assumptions the assumptions of Theorem 4.17 for localized SVMs using hinge loss and Gaussian kernels are satisfied. First, we compare the rates to the known ones for localized and global SVMs. Localized and global SVM. Under assumptions (i) and (ii), [SC08, (8.18)] shows that global SVMs using hinge loss and Gaussian kernels learn with the rate β(q+1)

n−βκ = n− β(q+2)+d(q+1) .

(4.59)

108

4. Localized SVMs: Oracle Inequalities and Learning Rates

We remark that in the special case that (i) is satisfied for β = ∞ this rate is also achieved for the same method in [LZC17]. The rate above is also matched by localized SVMs in [TBMS17] using hinge loss and Gaussian kernels as well as cell size rn = n−ν for some ν ≤ κ. Next, we show that the rates from Theorem 4.17 outperform the rates above under a mild additional assumption. To this end, we assume (iii) in addition to (i) and (ii). Then, the rate in (4.39) is satisfied and better by νβκ for all ν our analysis is applied to. According to Remark 4.18 the improvement is at most q + 1 in the denominator if β ≥ (q + 1)(1 + max{d, ζ} − d). In the other case, we 1−βκ obtain the fastest rate with ν := βκ+max{d,ζ} such that the exponent of the rate in (4.39) equals βκ(1+max{d,ζ}) βκ+max{d,ζ}

=

β(q+1)(1+max{d,ζ}) β(q+1)+max{d,ζ}(β(q+2)+d(q+1))

=

β(q+1) β+d(q+1) β(q+2)+d(q+1)− 1+max{d,ζ}

(4.60)

.

Compared to (4.59) we then have at most an improvement of the denominator.

β+d(q+1) 1+max{d,ζ}

in

The main improvement in the comparison above results from the strong effect of the lower-control condition (iii). Recall that (iii) restricts the location of noise in the sense that if we have noise for some x ∈ X, that is η(x) ≈ 1/2, then, (iii) forces this x to be located close to the decision boundary. This does not mean that we have no noise far away from the decision boundary. It is still allowed to have noise η(x) ∈ (0, 1/2 − ε] ∪ [1/2 + ε, 1) for x ∈ X and some ε > 0, only the case that η(x) = 1/2 is prohibited, see Section 2.3.1 and Fig. 2.4. In the following, we compare the result from Theorem 4.17 with results that make besides assumption (ii) some smoothness condition on η, namely that (iv) η is H¨older-continuous for some ρ ∈ (0, 1]. This assumption can be seen as a strong reverse assumption to (iii) since it implies that the distance to the decision boundary controls the noise from above, see Lemma 2.32. In particular, if (iii) and (iv) are satisfied, then ρ ≤ ζ. Vice versa, we observe that a reverse H¨older continuity assumption implies (iii) if η is continuous, see Lemma 2.29. If we assume (iii) in addition to (ii) and (iv) we satisfy the assumptions for localized SVMs in Theorem 4.17 since we find with Lemma 2.32 and

4.4. Comparison of Learning Rates

109

Lemma 2.34 that we have MNE β = ρ(q + 1). We observe that β = ρ(q + 1) ≤ (q + 1)(1 + max{d, ζ} − d)

(4.61)

and according to Theorem 4.17 localized SVMs then learn with the rate n−βκ(ν+1) for arbitrary ν ≤

1−βκ βκ+max{d,ζ} .

(4.62)

In particular, this rate is upper bounded by ρ(q+1)

n−βκ(ν+1) < n−βκ = n− ρ(q+2)+d .

(4.63)

Plug-in classifier I. Under assumption (ii), (iv) and the assumption that the support of the marginal distribution PX is included in a compact set, the so called “Hybrid” plug-in classifiers in [AT07, Equation 4.1] learn with the optimal rate ρ(q+1)

n− ρ(q+2)+d ,

(4.64)

see [AT07, Thm. 4.3]. If we assume in addition (iii), the localized SVM rate again equals (4.62) and satisfies (4.63) such that our rate is faster for 1−βκ 1−βκ arbitrary ν ≤ βκ+max{d,ζ} . For ν := βκ+max{d,ζ} we find for the exponent in (4.62) that βκ(ν + 1) = βκ(1+max{d,ζ}) βκ+max{d,ζ} =

ρ(q+1) ρ(q+1)−(ρ(q+2)+d) ρ(q+2)+d+ 1+max{d,ζ}

such that we have at most an improvement of

ρ(q+1) = ρ(q+2)+d− ρ+d

ρ+d 1+max{d,ζ}

1+max{d,ζ}

in the denominator.

In the comparison above the localized SVM rate outperforms the optimal rate by making the additional assumption (iii). This is not surprising, since the assumptions we made imply the assumptions of [AT07], see also Section 3.3. We emphasize once again that the rates we obtained as well as the other rates are achieved under less assumptions. Tree-based and Plug-in classifier. Assume that (ii) and (iv) are satisfied. Then, the classifiers resulting from the tree-based adaptive partitioning methods in [BCDD14, Section 6] yield the rate ρ(q+1)

n− ρ(q+2)+d ,

110

4. Localized SVMs: Oracle Inequalities and Learning Rates

see [BCDD14, Thms. 6.1(i) and 6.3(i)]. Moreover, [KK07, Thms. 1, 3, and 5] showed that plug-in-classifiers based on kernel, partitioning and nearest neighbor regression estimates learn with rate ρ(q+1)

n− ρ(q+3)+d .

(4.65)

We remark that both rates are satisfied under milder assumptions, see Section 3.3. In order to compare rates we add (iii) to (ii) and (iv). Then, the localized SVM rate again equals (4.62), and is faster for all ν our analysis is applied to. The improvement to the rate from [BCDD14] equals the improvement in the previous comparison, whereas compared to the rate from [KK07] the improvement is at least better by ρ in the denominator. The three comparisons above have in common that rates are solely improved by assumption (iii). This condition was even sufficient enough to improve the optimal rate in (4.64). It is to emphasize that neither for the rates from Theorem 4.17 or the rates from the mentioned authors above nor in the comparisons assumptions on the existence of a density of the marginal distribution PX have to be made. Subsequently, we present examples that do make such assumptions on PX . As we discussed in Remark 2.39, assumptions on the density of PX in combination with (ii), a H¨older assumption on η, and some regularity of the decision boundary shrink the classes of the considered distributions, see also [AT07, KK07, BCDD14]. As a consequence assumptions without such conditions are preferable, however, to compare rates we find subsequently assumption sets that do contain those. Plug-in classifier II. Let us assume that (ii) and (iv) are satisfied and that PX has a uniformly bounded density w.r.t. the Lebesgue measure. Then, [AT07, Thm. 4.1] shows that plug-in classifiers learn with the optimal rate ρ(q+1)

n− ρ(q+2)+d . If we assume in addition (iii), the localized SVM rate again equals (4.62) 1−βκ and satisfies (4.63) such that our rate is faster for arbitrary ν ≤ βκ+max{d,ζ} . In the next comparison, we make another assumption on the density of PX . Plug-in classifier III. Let us assume that (ii), (iv) are satisfied and that that PX has a density with respect to the Lebesgue measure that is bounded away from zero. Then, the authors in [BHM18] show that plug-in classifiers

4.4. Comparison of Learning Rates

111

based on a weighted and interpolated nearest neighbor scheme obtain the rate ρq

n− p(q+2)+d .

(4.66)

Under the same conditions, [KK07] improved for plug-in-classifier based on kernel, partitioning, and nearest neighbor regression estimates the rate in (4.65) to n−

ρ(q+1) 2ρ+d

.

(4.67)

By reason of comparison we add (iii) to (ii) and (iv). Then, the localized SVM rate equals (4.62) and satisfies (4.63) such that our rate is obviously faster than the rate in (4.66) for all possible choices of ν. The improvement ρ compared to (4.66) is at least ρ(q+2)+d . In order to compare the localized SVM rate with (4.67) we take a closer look on the rate and the margin parameters under the stated conditions. A short calculation shows for the exponent of the rate in (4.62) that we have βκ(ν + 1) =

ρ(q+1) ρ(q+2)+d (ν+1)

=

ρ(q+1) 2ρ+d+

ρ(q+2)−2ρ(ν+1)+d−d(ν+1) (ν+1)

from which its easy to derive that the exponent is larger than the one pq in (4.67) or equals it if ν ≥ 2ρ+d . We show that the largest ν we can choose satisfies this bound if ρ = 1 and derive a rate for this case. Since PX has a density with respect to the Lebesgue measure that is bounded away from zero we restrict ourselves to the case ρq ≤ 1, and hence q ≤ 1, see Remark 2.39. Moreover, Lemma 2.32 together with Lemma 2.34 yield α = ρq = q. Furthermore, we find by Lemma 2.30 that q = αζ and we follow ζ = ρ = 1. Thus, a short calculation shows that ν :=

1−βκ βκ+max{d,ζ}

=

1−βκ βκ+d

=

ρ+d ρ(q+1)+d(ρ(q+2)+d)

=

1+d q+1+d(q+2+d)



q 2+d ,

which is satisfied for all q ≤ 1. By inserting this into the exponent of the localized SVM rate in (4.62) with q = 1 we find βκ(ν + 1) = = =

βκ(1+d) βκ+d ρ(q+1) ρ(q+1)−(ρ(q+2)+d) ρ(q+2)+d+ 1+d q+1 q+2+d+

q+1−(q+2+d) 1+d

112

4. Localized SVMs: Oracle Inequalities and Learning Rates = =

q+1 q+1+d 2 2+d .

Hence, the localized SVM rate is faster than the rate in (4.67) for all q < 1 and matches it if q = 1. We remark that the [AT07] showed under a slight stronger density assumption that certain plug-in classifier achieve the optimal rate (4.67). Under assumptions that contain that PX has a density w.r.t. Lebesgue measure that is bounded away from zero, we improved in the previous comparison the rates from [BHM18] and in the case that η is Lipschitz the rates from [KK07]. Finally, we compare the localized SVM rates to the ones derived for the histogram rule Chapter 3, where we also considered a set of margin conditions and a similar strategy to derive the rates. Recall that under a certain set of assumptions we showed in Section 3.3 that the histogram rule outperformed the global SVM rates from [SC08, (8.18)] and the localized SVM rates from [TBMS17]. Histogram rule. Let us assume that (i) and (iii) are satisfied and that (v) P has ME α ∈ (0, ∞]. Then, we find by Lemma 2.30 that we have NE q = Theorem 3.6 the histogram rule then learns with rate −

n

α ζ

and according to

β(q+1) β(q+1)+d(q+1)+

βζ 1+ζ

(4.68)

as long as β ≤ (1 + ζ)(q + 1). Under these assumptions the localized SVM learns with the rate from Theorem 4.17, that is, n−βκ(ν+1) , where our rate depends on ν. To compare our rates we have to pay attention to the range of β that provides a suitable ν, see (4.37). If we have that (q + 1)(1 + max{d, ζ} − d) ≤ β ≤ (q + 1)(1 + ζ), then a short calculation shows that our local SVM rate in (4.39) is faster if ν is not too small, that is if ν satisfies β ((β + d)(q + 1)(ζ + 1) + βζ)

−1

≤ν≤

κ 1−κ .

4.4. Comparison of Learning Rates

113

According to Remark 4.18 the best possible rate is achieved for ν := and has then exponent

κ 1−κ

β(q+1) β(q+2)+(d−1)(q+1)

=

β(q+1) βζ βζ β(q+1)+d(q+1)+ (1+ζ) +β−(q+1)− 1+ζ

=

β(q+1) ζ β(q+1)+d(q+1)−(q+ 1+ζ )

ζ such that compared to (4.68) we have an improvement of q + 1+ζ in the denominator. In the other case, that is, if β ≤ (q + 1)(1 + max{d, ζ} − d) a short calculation shows that our local SVM rate is better for all choices

β ((β + d)(q + 1)(ζ + 1) + βζ)

−1

≤ν≤

1−βκ βk+max{d,ζ} .

In this case, we find due to Remark 4.18 that the best possible rate is 1−βκ achieved for ν := βk+max{d,ζ} and has exponent βκ[1+max{d,ζ}] βk+max{d,ζ}

= =

β(q+1) β max{d,ζ}−d(q+1) β(q+1)+d(q+1)+ 1+max{d,ζ} β(q+1) βζ β(q+1)+d(q+1)+ 1+ζ −

d(q+1)−β max{d,ζ} βζ + 1+ζ 1+max{d,ζ}

.

max{d,ζ} βζ Compared to (4.68) the rate is better by d(q+1)−β + 1+ζ > 0 in 1+max{d,ζ} the denominator. We remark that the lower bound on ν is not surprising since if ν → 0 our rate matches the global rate in (4.59) and we showed in Section 3.3 that under a certain set of assumptions the rate of the histogram classifier is faster than the one of the global SVM. Moreover, we remark that our rates in Theorem 4.17 hold for all values of β and not only for a certain range of β.

5. Discussion In this work, we refined the statistical analysis for both the histogram rule and localized SVMs. The analysis relied on a simple splitting technique of the input space, which splits the input space in a part in which cells are located close to the decision boundary and in a part in which the cells are sufficiently bounded away from it. This separation was described by a parameter s > 0 and we examined the excess risks on both parts separately. We found in both cases suitable splitting parameters and derived global learning rates under margin conditions, see Sections 3.2 and 4.3.1. The resulting learning rates outperformed rates obtained by several other learning algorithms under suitable sets of assumptions, see Sections 3.3 and 4.4. The assumption that the distance to the decision boundary controls the noise from below was a crucial condition to obtain these rates. Since the parameters of the margin conditions appeared in the parameters of the algorithms, we proposed for both cases training-validation procedures that achieved the same rates without knowing these parameters, see Sections 3.2 and 4.4. For future work, the following points are interesting: i) The improved learning rates for the histogram rule and for localized SVMs in Theorems 3.5 and 4.17 required certain sets of margin conditions. For both results, we showed that the rates improved or matched the optimal rates in [AT07]. It would be interesting to know if our rates are optimal under the considered sets of assumptions in Theorems 3.5 and 4.17. ii) The models in Chapters 3 and 4 considered bounded input spaces. The results could probably be generalized to unbounded input spaces if some assumptions on the tail behavior of the marginal distribution are made, similar as in [SC08, Ch. 8]. Clearly, the rate will then depend additionally on the tail condition. iii) The rates for the histogram rule in Theorem 3.6 are only satisfied if β ≤ ζ −1 (1 + ζ)(α + ζ). For the comparisons in Section 3.3, this assumption was satisfied in each case, however, to fill the gap, one could examine which rate the histogram rule achieves if β exceeds this bound. © Springer Fachmedien Wiesbaden GmbH, part of Springer Nature 2020 I. K. Blaschzyk, Improved Classification Rates for Localized Algorithms under Margin Conditions, https://doi.org/10.1007/978-3-658-29591-2_5

116

5. Discussion

iv) For the result in Theorem 4.17, the choice of sn = rn was straightforward. This sn is not the optimal splitting parameter for minimizing the bounds in Theorem 4.15 and Theorem 4.13. Unfortunately, the optimal sn did not fulfil the constraint given by Theorem 4.11. It could be worth to examine if there exists a better choice of sn leading to even faster rates than the ones from Theorem 4.17. v) The rates for localized SVMs on sets that are sufficiently far away from the decision boundary in Theorem 4.15 seem not to be as good as possible, since there exist cases in which we learn better on the set which intersects the decision boundary. A cause could be the choice of sn , see also (iv). We tried to refine the rate in Theorem 4.15 by a form of peeling, where we divided the set sufficiently far away from the decision boundary into stripes. The closer the stripe to the decision boundary, the better the bound. Unfortunately, summing up all these stripes erased possible refinements on the stripes. However, one might have some ideas to avoid this effect. vi) The rates in Theorem 4.17 obtain a factor nε , where ε > 0 may be arbitrarily small. Recent results in [FS19] showed that results can be improved up to the slight better factor log n. This might also be possible for the result on cells that intersect the decision boundary in Theorem 4.11. However, for the result on sets which have no intersection with the decision boundary in Theorems 4.13 and 4.15 this is difficult. As one might have noticed in the proofs of the mentioned Theorems, we chose the weights ω− dependent on the parameter ζ in order to bound the approximation error. The parameter ζ resulted from bounding the incomplete gamma functions in Theorem 4.6. This choice led additionally to a factor of nε in the rates. Hence, future investigations might find a way to better deal with the incomplete gamma function in Theorem 4.6 and potentially find a suitable choice for ξ. Then, it could be possible to obtain for the global rate a log n factor instead of nε .

A. Appendix In this appendix, we state technical lemmata that are required for some proofs in Chapters 3 and 4. Lemma A.1. Let Y := {−1, 1} and P be a probability measure on X × Y . For η(x) := P (y = 1|x), x ∈ X define the set X1 := { x ∈ X | η(x) > 1/2 }. Let L be the classification loss and consider for A ⊂ X the loss LA (x, y, t) := 1A (x)L(y, t), where y ∈ Y, t ∈ R. For a measurable f : X → R we then have Z RLA ,P (f ) − R∗LA ,P = |2η(x) − 1| dPX (x), (X1 4{f ≥0})∩A

where 4 denotes the symmetric difference. Proof. It is well known, e.g., [SC08, Example 3.8], that Z ∗ RLA ,P (f ) − RLA ,P = |2η(x) − 1| · 1(−∞,0) ((2η(x) − 1)signf (x)) dPX (x). A

(A.1) Next, for PX -almost all x ∈ A we have 1(−∞,0] ((2η(x) − 1)signf (x)) = 1 ⇔ (2η(x) − 1)signf (x) ≤ 0. The latter is true if for x ∈ A holds that f (x) < 0 and η(x) > 1/2 or that f (x) ≥ 0 and η(x) ≤ 1/2 or that η(x) = 1/2. However, for η(x) = 1/2 we have |2η(x) − 1| = 0 and hence this case can be ignored. Then, the latter obviously equals the set (X1 4{f ≥ 0}) ∩ A and we obtain in (A.1) that Z RLA ,P (f ) − R∗LA ,P = |2η(x) − 1| dPX (x). (X1 4{f ≥0})∩A 

© Springer Fachmedien Wiesbaden GmbH, part of Springer Nature 2020 I. K. Blaschzyk, Improved Classification Rates for Localized Algorithms under Margin Conditions, https://doi.org/10.1007/978-3-658-29591-2

118

A. Appendix

Lemma A.2. Let X ⊂ Rd and γ, ρ > 0. Then, we have 

2 πγ 2

d/2 Z

e−2γ

−2

kx−yk22

1 Γ(d/2)

dy =

Bρ (x)

2ρ2 γ −2

Z

e−t td/2−1 dt. 0

Proof. For ρ > 0 we find that d/2 Z −2 2 2 e−2γ kx−yk2 dy 2 πγ Bρ (x)  d/2 Z −2 2 2 = e−2γ kyk2 dy πγ 2 Bρ (0)  d/2 Z ρ −2 2 2 π d/2 = e−2γ t d · td−1 dt πγ 2 Γ(d/2 + 1) 0  d/2  d−1 Z √2ργ −1 2 2π d/2 1 γ −t2 d−1 √ √ = e d · t · dt πγ 2 dΓ(d/2) 0 2γ −1 2 Z 2ρ2 γ −2 2 1 = e−t td/2−1 · dt Γ(d/2) 0 2 Z 2ρ2 γ −2 1 = e−t td/2−1 dt. Γ(d/2) 0





Lemma A.3. Let (Aj )j=1,...,m be a partition of B`d2 . Let d ≥ 1, p ∈ (0, 12 ) and let rn ∈ (0, 1]. For ρn ≤ n−2 and δn ≤ n−1 fix a finite ρn -net Λn ⊂ (0, n−1 ] and a finite δn rn -net Γn ⊂ (0, rn ]. Let J ⊂ {1, . . . , mn } be an index set and for all j ∈ J let γj ∈ (0, rn ], λj > 0. Define γmax := maxj∈J , as well as analogously γmin . i) Let β ∈ (0, 1], q ∈ [0, ∞) and let |J| ≤ cd rn−d+1 for some constant cd > 0. Then, for all ε1 > 0 there exists a constant c˜1 > 0 such that X λj r d

n

inf

(λ,γ)∈(Λn ×Γn )|J|

+

j∈J

q+1  r  q+2−p

≤ c˜1 · n

n

X

n −βκ(ν+1)+ε1

γjd

j∈J

.

β + γmax

−d p λ−1 j γj PX (Aj )

! p(q+1) ! q+2−p

A. Appendix

119

ii) Let β ∈ (0, 1], q ∈ [0, ∞) and let |J| ≤ cd rn−d+1 for some constant cd > 0. Then, for all ε˜, ε2 > 0 there exists a constant c˜2 > 0 such that  inf

(λ,γ)∈(Λn ×Γn )|J|

+

q+1  r  q+2−p

n

rn γmin X

n

d X

λj nε˜

j∈J −d p λ−1 j γj PX (Aj )

! p(q+1) ! q+2−p

j∈J

≤ c˜2 · nε2 rnd−1 n

− q+1 q+2

.

iii) Let |J| ≤ cd rn−d . Then, for all ε˜, ε3 > 0 there exists a constant c˜3 > 0 such that !p !  d X d X rn −1 − p ε˜ −1 inf λj n + λj γj PX (Aj ) n γmin (λ,γ)∈(Λn ×Γn )|J| j∈J

≤ c˜3 ·

j∈J

rn−d n−1+ε3 .

Proof. We follow the lines of the proof of [MS16, Lemma 14]. Let us assume that Λn := {λ(1) , . . . λ(u) } and Γn := {γ (1) , . . . γ (v) } such that λ(i−1) < λ(i) and γ (l−1) < γ (l) for all i = 2, . . . , u and l = 2, . . . , v. Furthermore, let γ (0) = λ(0) := 0 and λ(u) := n−1 , γ (v) := rn . Then, fix a pair (λ∗ , γ ∗ ) ∈ [0, n−1 ] × [0, rn ]. Following the lines of the proof of [SC08, Lemma 6.30] there exist indices i ∈ {1, . . . , u} and l ∈ {1, . . . , v} such that λ∗ ≤ λ(i) ≤ λ∗ + 2ρn , γ ∗ ≤ γ (l) ≤ γ ∗ + 2δn rn . First, we prove i). With (A.2) we find X λj r d

n

inf (λ,γ)∈(Λn ×Γn

+

)|J|

q+1  r  q+2−p

n

n

j∈J

γjd

X j∈J

 β X λ(i) rd n (l) ≤ d + γ (l) j∈J γ

β + γmax

−d p λ−1 j γj PX (Aj )

! p(q+1) ! q+2−p

(A.2)

120

A. Appendix

+

q+1  r  q+2−p

X

n

λ

(i)

−1 

γ

(l)

− dp

! p(q+1) q+2−p PX (Aj )

j∈J

 β λ(i) rnd (l) d + γ γ (l) q+1 −p (l) −d ! q+2−p rn λ(i) γ + n

≤ |J|

! p(q+1) q+2−p X

PX (Aj )

j∈J

  q+1 (λ∗ + 2ρn )rn rn (λ∗ )−p q+2−p ∗ β ≤ + (γ + 2δn rn ) + (γ ∗ )d (γ ∗ )d n !   q+1 rn (λ∗ )−p q+2−p β ∗ ∗ −d ∗ β ∗ −d ≤ c1 λ rn (γ ) + (γ ) + + ρn rn (γ ) + (δn rn ) (γ ∗ )d n for some c1 > 0. We define λ∗ := n−σ for some σ ∈ [1, 2] and γ ∗ := rnκ n−κ . κ Obviously, λ∗ ∈ [0, n−1 ]. Moreover, we have γ ∗ ∈ [0, rn ] since ν ≤ 1−κ . −2 −1 Then, we obtain with ρn ≤ n and δn ≤ n , and together with 1 ≥ κ(β +   d)(ν +1)−ν > 0 and that

(1−dκ)(q+1) q+2−p

>

(1−dκ)(q+1) q+2

=

β(q+2) β(q+2)+d(q+1)

q+1 q+2

= βκ

!  q+1 rn (λ∗ )−p q+2−p β ∗ −d c1 λ rn (γ ) + (γ ) + + ρn rn (γ ) + (δn rn ) (γ ∗ )d n q+1  1−dκ ∗ −p  q+2−p rn (λ ) −σ −dκ dκ βκ −βκ ≤ c 1 n rn r n n + rn n + n1−dκ ! β + n−2 rn (γ ∗ )−d + rn n−1 ∗

∗ −d



∗ β

≤ c2 rn−1+(β+d)κ n−(β+d)κ rn rn−dκ ndκ +

rnβκ n−βκ

+ rn n

−1

 (1−dκ)(q+1) q+2−p

n

pσ(q+1) q+2−p

 β  ≤ c2 rnβκ n−βκ + rnβκ n−βκ nε1 + rn n−1 ≤ c3 · n−βκ(ν+1)+ε1

! + rn n

 −1 β

A. Appendix

121

holds for some constants c2 , c3 > 0 and where p is chosen sufficiently small such that ε1 ≥ pσ(q+1) q+2−p > 0. Second, we prove ii). With (A.2) we find  inf

(λ,γ)∈(Λn ×Γn )|J|

+  ≤



n

n

rn γ (l) +



q+1  r  q+2−p

j∈J

λ(i) nε˜

j∈J

X

n

n d

λ(i)

−1 

γ (l)

− dp

! p(q+1) q+2−p PX (Aj )

j∈J

|J|λ(i) nε˜ rn d

γ (l)

rn λ(i) nε˜ ≤ c4 d + γ (l)

≤ c4

λj nε˜

  p(q+1) q+2−p ! d X −p −1  λj γj PX (Aj )

q+1  r  q+2−p

rn γ (l)

d X

j∈J

d X

+

≤ c4

rn γmin

q+1 ! q+2−p

rn λ∗ nε˜ d (γ ∗ )

+

λ

(i)

− p(q+1) q+2−p

n

! p(q+1) q+2−p X

PX (Aj )

j∈J

rn d

γ (l)

rn (λ∗ + 2ρn )nε˜ d (γ ∗ )





λ(i)

q+1 ! q+2−p

(λ∗ )

d (γ ∗ )

n q+1 ! q+2−p

rn

− p(q+1) q+2−p

n rn

+

d (γ ∗ )

q+1 ! q+2−p

(λ∗ )

n

p(q+1)

− q+2−p

p(q+1)

− q+2−p

+ 2c4

ρn rn nε˜ (γ ∗ )

d

for some constant c4 > 0 depending on d. We define γ ∗ := rn and λ∗ := n−σ for some σ ∈ [1, 2]. Then, we obtain with ρn ≤ n−2 that

c4

rn λ∗ nε˜ (γ ∗ )

d

+

rn d

(γ ∗ ) n

q+1 ! q+2−p

(λ∗ )

p(q+1)

− q+2−p

+ 2c4

rn ρn nε˜ (γ ∗ )

d

122

A. Appendix q+1   q+2−p pσ(q+1) n−σ nε˜ 1 ρn nε˜ + n q+2−p + 2c4 d−1 d−1 d−1 rn rn n rn −1 ε˜ ε˜ q+1  n n ρn n − ≤ c4 d−1 + rnd−1 n q+2 nεˆ + 2c4 d−1 rn rn   q+1   −1 −1 − ≤ c5 nε2 rnd−1 n + rnd−1 n q+2 + n−2 rnd−1

= c4

≤ c6 nε2 rnd−1 n

− q+1 q+2

,

where c5 , c6 > 0 are constants depending on d and where p is chosen sufficiently small such that εˆ ≥ pσ(q+1) ε, εˆ}. q+2−p and ε2 := max{˜ Finally, we prove iii). We find with (A.2) that   p   d X d X − r n p nε˜  n−1  inf λj +  λ−1 j γj PX (Aj ) γmin (λ,γ)∈(Λn ×Γn )|J| j∈J j∈J  p  d X −1  − dp X r n ≤ nε˜ λ(i) +  λ(i) γ (l) PX (Aj ) n−1 γmin j∈J j∈J  p  d  −p  −d X r n  ≤ nε˜ |J|λ(i) + λ(i) γ (l) PX (Aj ) n−1 γmin j∈J



≤ c7 nε˜ γ (l)

−d



λ(i) + λ(i)

−p 

≤ c7 nε˜ (γ ∗ )

−d

(λ∗ + 2ρn ) + (λ∗ )

= c7 nε˜ (γ ∗ )

−d

λ∗ + (λ∗ )

−p

(γ ∗ )

γ (l)

−p

−d

−d

(γ ∗ )

n−1

−d

n−1

n−1 + 2ρn c7 nε˜ (γ ∗ )

−d

holds for some constant c7 > 0 depending on d. We define γ ∗ := rn and λ∗ := n−σ for some σ ∈ [1, 2]. Then, we obtain with ρn ≤ n−2 c7 nε˜ (γ ∗ )

−d

λ∗ + (λ∗ )

−p

(γ ∗ )

−d

n−1 + 2ρn c7 nε˜ (γ ∗ )

−d

≤ c7 nε˜rn−d n−σ + npσ rn−d n−1 + 2c7 nε˜rn−d n−2 ≤ c7 nε˜rn−d n−1 + nεˆrn−d n−1 + 2c7 nε˜rn−d n−2 ≤ c8 · rn−d n−1+ε3 for some constant c8 > 0 depending on d and where p is chosen sufficiently small such that εˆ ≥ pσ > 0. Here, ε3 := max{˜ ε, εˆ}. 

Bibliography [AT07] J.-Y. Audibert and A. Tsybakov. Fast learning rates for plug-in classifiers. Ann. Statist., 35:608–633, 2007. [BB98] K. P. Bennett and J. A. Blue. A support vector machine approach to decision trees. IEEE International Joint Conference on Neural Networks Proceedings, 3:2396–2401, 1998. [BCDD14] P. Binev, A. Cohen, W. Dahmen, and R. DeVore. Classification algorithms using adaptive partitioning. Ann. Statist., 42:2141– 2163, 2014. [BHM18] M. Belkin, D. J. Hsu, and P. Mitra. Overfitting or perfect fitting? Risk bounds for classification and regression rules that interpolate. Advances in Neural Information Processing Systems (NIPS), 31:2306–2317, 2018. [BS18] I. Blaschzyk and I. Steinwart. Improved classification rates under refined margin conditions. Electron. J. Stat., 12:793–823, 2018. [BV92] L. Bottou and V. Vapnik. Local learning algorithms. Neural Comput., 4:888–900, 1992. [CS90] B. Carl and I. Stephani. Entropy, Compactness and the Approximation of Operators. Cambridge University Press, Cambridge, 1990. [CTJ07] H. Cheng, P.-N. Tan, and R. Jin. Localized support vector machine and its efficient algorithm. SIAM International Conference on Data Mining, 2007. [DC18] F. Dumpert and A. Christmann. Universal consistency and robustness of localized support vector machines. Neurocomputing, 315:96–106, 2018. [DGL96] L. Devroye, L. Gy¨ orfi, and L. Lugosi. A Probabilistc Theory of Pattern Recognition. Springer, 1996. © Springer Fachmedien Wiesbaden GmbH, part of Springer Nature 2020 I. K. Blaschzyk, Improved Classification Rates for Localized Algorithms under Margin Conditions, https://doi.org/10.1007/978-3-658-29591-2

124

Bibliography

[DGW15] M. D¨ oring, L. Gy¨orfi, and H. Walk. Exact rate of convergence of kernel-based classification rule, volume 605, pages 71–91. Springer International Publishing, 2015. [ES15] M. Eberts and I. Steinwart. Optimal learning rates for localized SVMs. ArXiv e-prints 1507.06615, 2015. [ET96] D.E. Edmunds and H. Triebel. Function Spaces, Entropy Numbers, Differential Operators. Cambridge University Press, 1996. [FDCBA14] M. Fernandez-Delgado, E. Cernadas, S. Barro, and D. Amorim. Do we need hundreds of classifiers to solve real world classification problems? J. Mach. Learn. Res., 15:3133–3181, 2014. [Fed69] H. Federer. Geometric Measure Theory. Springer-Verlag, Berlin, 1969. [FS19] M. Farooq and I. Steinwart. Learning rates for kernel-based expectile regression. Mach. Learn., 108:203–227, 2019. [Hab13] R. Hable. Universal consistency of localized versions of regularized kernel methods. J. Mach. Learn. Res., 14:153–186, 2013. [KK07] M. Kohler and A. Krzyzak. On the rate of convergence of local averaging plug-in classification rules under a margin condition. IEEE Trans. Inf. Theor., 53:1735–1742, 2007. [KUMH17] G. Klambauer, T. Unterthiner, A. Mayr, and S. Hochreiter. Selfnormalizing neural networks. Advances in Neural Information Processing Systems (NIPS), 30:971–980, 2017. [LGZ17] S. Lin, X. Guo, and D.-X. Zhou. Distributed Learning with Regularized Least Squares. J. Mach. Learn. Res., 18(92):1–31, 2017. [LZC17] S. Lin, J. Zeng, and X. Chang. Learning rates for classification with Gaussian kernels. Neural Comput., 29:3353–3380, 2017. [MB18] N. M¨ ucke and G. Blanchard. Parallelizing spectrally regularized kernel algorithms. J. Mach. Learn. Res., 19(1):1069–1097, 2018. [MS16] M. Meister and I. Steinwart. Optimal learning rates for localized SVMs. J. Mach. Learn. Res, 17:1–44, 2016.

Bibliography

125

[MT99] E. Mammen and A. Tsybakov. Smooth discrimination analysis. Ann. Statist., 27:1808–1829, 1999. [RCR15] A. Rudi, R. Camoriano, and L. Rosasco. Less is More: Nystr¨ om Computational Regularization. Advances in Neural Information Processing Systems (NIPS), 28:1657–1665, 2015. [RR08] A. Rahimi and B. Recht. Random features for large-scale kernel machines. Advances in Neural Information Processing Systems (NIPS), 20:1177–1184, 2008. [RR17] A. Rudi and L. Rosasco. Generalization Properties of Learning with Random Features. Advances in Neural Information Processing Systems (NIPS), 30:3215–3225, 2017. [SC08] I. Steinwart and A. Christmann. Support Vector Machines. Springer, New York, 2008. [SS07] I. Steinwart and C. Scovel. Fast rates for support vector machines using Gaussian kernels. Ann. Statist., 35:575–607, 2007. [SSM98] B. Sch¨olkopf, AJ. Smola, and K-R. M¨ uller. Nonlinear component analysis as a kernel eigenvalue problem. Neural Comput., 10:1299–1319, 1998. [ST17] I. Steinwart and P. Thomann. liquidSVM: A fast and versatile SVM package. ArXiv e-prints 1702.06899, 2017. [Ste15] I. Steinwart. Fully adaptive density-based clustering. Ann. Statist., 43:2132–2167, 2015. [TBMS17] P. Thomann, I. Blaschzyk, M. Meister, and I. Steinwart. Spatial decompositions for large scale SVMs. Proceedings of the 20th International Conference on Artificial Intelligence and Statistics (AISTATS), 54:1329–1337, 2017. [WS01] C. K. I. Williams and M. Seeger. Using the Nystr¨om method to speed up kernel machines. Advances in Neural Information Processing Systems (NIPS), 13:682–688, 2001. [ZBMM06] H. Zhang, A. C. Berg, M. Maire, and J. Malik. SVM-KNN: Discriminative nearest neighbor classification for visual category recognition. IEEE Conference on Computer Vision and Pattern Recognition, 2:2126–2136, 2006.

126

Bibliography

[ZDW15] Y. Zhang, J. Duchi, and M. Wainwright. Divide and Conquer Kernel Ridge Regression: A Distributed Algorithm with Minimax Optimal Rates. J. Mach. Learn. Res., 16(102):3299–3340, 2015.