Applications of Empirical Process Theory 052165002X

462 14 48MB

English Pages [297] Year 2000

Polecaj historie

The Fruits of Empirical Linguistics: Volume 1 Process 9783110216141, 9783110213386

The contributions to The Fruits of Empirical Linguistics. Volume 1: Process reveal why the data-driven approach makes fo

218 12 2MB Read more

Labeling Theory: Empirical Tests 1351509896, 9781351509893

Labeling theory has been an extremely important and influential development in criminology, but its recent advances have

188 89 1MB Read more

Advanced Mathematical Methods for Economic Efficiency Analysis: Theory and Empirical Applications 303129582X, 9783031295829

Economic efficiency analysis has received considerable worldwide attention in the last few decades, with Stochastic Fron

348 70 4MB Read more

Empirical Interrogation of Theory Construction 9780262363334, 9780262542326, 9781552505960

344 29 3MB Read more

Man and His Government An Empirical Theory of Politics

870 111 39MB Read more

Principles of the Theory of the Historical Process in Philosophy

Pтinciples of the Theory оf the Historical Process in Philosophy is а first effort in the Soviet literature to try and s

623 88 41MB Read more

Theory and applications of the empirical valence bond approach : from physical chemistry to chemical biology 9781119245377, 1119245370, 9781119245544, 1119245540

577 92 7MB Read more

Pervaporation: Process, Materials and Applications 1536144592, 9781536144598

Pervaporation is a separation process in which the selective permeation of components of a liquid mixture is achieved by

234 29 5MB Read more

Contemporary Empirical Political Theory [Reprint 2020 ed.] 9780520313248

How can we best understand the major debates and recent movements in contemporary empirical political theory? In this vo

157 110 17MB Read more

A process theory of organization 9780199695072, 0199695075, 9780199695089, 0199695083

This book presents a novel and comprehensive process theory of organization applicable to "a world on the move,&quo

765 41 21MB Read more

Applications of Empirical Process Theory
052165002X

Author / Uploaded
Sara van de Geer

Citation preview

CAMBRIDGE SERIES IN STATISTICAL AND PROBABILISTIC MATHEMATICS bditonal Board: R. Gill, Department of Mathematics, Utrecht University B.D. Ripley, Department of Statistics, University of Oxford R Ross, Department of Industrial Engineering, University of California, Berkeley M. Stem, Department of Statistics, University of Chicago D. Williams, School of Mathematical Sciences, University of Bath

This series of high-quality upper-division textbooks and expository monographs cov ers all aspects of stochastic applicable mathematics. The topics range from pure and apphed statistics to probability theory, operations research, optimization, and mathpresentations of new developments m the field and also of the state of the art in classical methods. While emphasizing rigorous treatment of theoretical methods, the books also contain applications and discussions of new techniques made possible by advances in computational practice. Already published 1. Bootstrap Methods and Their Application, A.C. Davison and D.V. Hinkley 2. Markov Chains, J. Norris 3. Asymptotic Statistics, A.W van der Vaart

MATH/STAT LIBRARY

Applications of Empirical Process Theory Sara A. van de Geer

CAMBRIDGE UNIVERSITY PRESS

V ICV-iTAM Vii/HL'U

PUBLISHED BY THE PRESS SYNDICATE OF THE UNIVERSITY OF CAMBRIDGE

The Pitt Building, Trumpington Street, Cambridge, United Kingdom CAMBRIDGE UNIVERSITY PRESS

The Edinburgh Building, Cambridge CB2 2RU, UK http.V/www.cup.cam.ac.uk 40 West 20th Street, New York, NY 10011-4211, USA http;//www.cup.org 10 Stamford Road, Oakleigh, Melbourne 3166, Australia Ruiz de Alarcdn 13, 28014 Madrid, Spain © Cambridge University Press 2000 This book is in copyright. Subject to statutory exception and to the provisions of relevant collective licensing agreements, no reproduction of any part may take place without the written permission of Cambridge University Press. First published 2000 Printed in the United States of America Typeset in Monotype Times ll/13pt, in T]^ [EPC] A catalog record for this book is available from the British Library Library of Congress Cataloging in Publication data Geer, S. A. van de (Sara A.) Applications of empirical process theory / Sara A. van de Geer, p. cm. Includes bibliographical references and indexes. ISBN 0 521 65002 X 1. Nonparametric statistics. 2. Estimation theoiy. 3. Limit theorems (Probability theory). I. Title. QA278.8.G44 1999 519.5'4 dc21 98-50544 CIP ISBN 0 521 65002 X hardback

(p/k7e /^T

Contents

2ooo fhnw

Preface

ix

Guide to the Reader

xi

1

2

3

Introduction 1.1

Some examples from statistics

1.2

Problems and complements

Notation and Definitions

1 2 10

12

2.1

Stochastic order symbols

12

2.2

The empirical process

13

2.3

Entropy

15

2.4

Examples

17

2.5

Notes

21

2.6

Problems and complements

22

Uniform Laws of Large Numbers Uniform laws of large numbers under finite entropy with 3.1

25

bracketing

25

3.2

The chaining technique

27

3.3

A maximal inequality for weighted sums

28

VI

Contents

3.4 , Symmetrization

4

3.5

Hoeffding’s inequality

3.6

Uniform laws of large numbers under random entropy conditions

3.7

Examples

3.8

Notes

3.9

Problems and complements

First Applications: Consistency 4.1

Consistency of maximum likelihood estimators

4.2

Examples

4.3

Consistency of least squares estimators

4.4

Examples

4.5

Notes

4.6

Problems and complements

Increments of Empirical Processes 5.1

Random entropy numbers and asymptotic equicontinuity

5.2

Random entropy numbers and classes depending on n

5.3

Empirical entropy and empirical norms

5.4

A uniform inequality based on entropy with bracketing

5.5

Entropy with bracketing and asymptotic equicontinuity

5.6

Modulus of continuity

5.7

Entropy with bracketing and empirical norms

5.8

Notes

5.9

Problems and complements

Central Limit Theorems 6.1

Definitions

6.2

Sufficient conditions for ^ to be P-Donsker

6.3

Useful theorems

6.4

Measurability

6.5

Notes

6.6

Problems and complements

31 33 34 36 43 43 46 46 51 54 58 59 60 63 64 67 69 11 16

78 82 83 84 87 87 89 90 90 92 92

Contents

7

8

9

vii

Rates of Convergence for Maximum Likelihood Estimators

94

7.1

The main idea

95

7.2

An exponential inequality for the maximum likelihood estimator

97

7.3

Convex classes of densities

102

7.4

Examples

105

7.5

Notes

122

7.6

Problems and complements

122

The Non-LLD. Case

125

8.1

Independent non-identically distributed random variables

125

8.2

Martingales

134

8.3

Application to maximum likelihood

139

8.4

Examples

142

8.5

Notes

8.6

Problems and complements

'

144 144

Rates of Convergence for Least Squares Estimators

147

9.1

Sub-Gaussian errors

148

9.2

Errors with exponential tails

150

9.3

Examples

151

9.4

Notes

162

9.5

Problems and complements

162

10 Penalties and Sieves

166

10.1 Penalized least squares

166

10.2 Penalized maximum likelihood

175

10.3 Least squares on sieves

183

10.4 Maximum likelihood on sieves

189

10.5 Notes

194

10.6 Problems and complements

194

vm

Contents

11 Some Applications toSemiparametric Models Partial linear models

11.1

199

299

11.2 Mixture models

211

11.3 A single-indexed model withbinary explanatoryvariable

233

11.4 Notes

242

11.5 Problems and complements

243

12 M-Estimators 12.1 Introduction

247 247

12.2 Estimating a regression function using a general loss

function 12.3

Classes offunctions indexed bya finite-dimensional parameter

12.4 Notes 12.5

250

Problems and complements

253 262 262

Appendix

265

References

272

Symbol Index

279

Author Index

281

Subject Index

283

Preface

This book is an extended version of a set of lecture notes, written for the

AIO course ‘Applications of Empirical Process Theory’, which was given in the spring of 1996 in Utrecht, The Netherlands. The abbreviation AIO stands for Assistent in Opleiding, which is the Dutch equivalent of PhD student. The course was intended for students with a master of sciences in mathematics or statistics. Nonparametric (infinite-dimensional) models provide a good alternative to the more classical parametric models, because fewer, and often more natural assumptions are imposed. In practice, nonparametric methods can be computationally complex, but this is nowadays a minor drawback. This book investigates the theoretical (asymptotic) properties of nonparametric Mestimators, with the emphasis on maximum likelihood estimators and least squares estimators. It treats the different models and estimation procedures in a unifying way, by invoking the theory on empirical processes. The general theory is illustrated with numerous examples. We hope that the methods provided will show that nonparametric models are in fact as basic as the more classical parametric models. Empirical process theory has turned out to be a very valuable tool in asymptotic statistics. Applications include the bootstrap, the delta-method, and goodness-of-fit testing. In this book, we consider its applications to M-estimation, with special focus on maximum likelihood and least squares. We treat the latter two methods in great detail, including penalties, sieves, and semiparametrics.

The description of M-estimators in general is in

Preface

X

fact deferred to the very last chapter, where we show that, basically, the methodology allows direct generalization. We do not assume any a priori knowledge on empirical process theory, and treat the subject, not in an exhaustive way, but with the applications always in mind. Moreover, we do not state any measurability conditions, because the formulation of these would require too many digressions. For most of the results, a full proof (modulo measurability) is presented, except when it concerns the calculation of entropy. Here, only some illustrations are given, which are meant to provide the reader with some understanding of what entropy actually is. I am most grateful to my husband Toon, for the many fruitful discussions we had on the subject of this book, and for his very valuable suggestions on how to distinguish signal from noise.

Guide to the Reader

Chapters 2, 3, 5, 6 and 8 present empirical processes theory, and the other chapters concern applications. Some of the material in this book is added only for completeness and can certainly be skipped at first reading. Chapter 1 provides an overview of the type of problems the book will address. Chapters 2 and 3 are essential for the rest of the book. Most of the results in Chapter 3 are along the lines of Pollard (1984, Chapter II). Chapter 4 treats consistency of estimators, and can be seen as a prepara tion for the more complicated theory on rates of convergence. Chapter 5 is one of the more technical chapters. The main result there is Theorem 5.11, which says that the increments of an empirical process behave as the integral of the (square root) entropy (with bracketing). Once this message is clear, most of what follows can be understood without knowing too many precise details. Chapter 6 contains some of the fundamental results on empirical processes. This chapter could of course not be excluded from the book, but it plays only a small role in the subsequent chapters. Chapter 7 derives rates of convergence for maximum likelihood estimators. Some of its examples are rather technical, whereas others are quite elegant. A reader might find some of the examples artificial, because certain constants that depend on the unknown parameter, are assumed to be known a priori. But one has to keep in mind that these models serve as a preparation for the more realistic models of Chapter 10. Section 8.1 of Chapter 8 is needed to make the step to independent, but

Xll

Guide to the Reader

not identically distributed variables. The section supplies the remainder of the technical tools necessary for the statistical applications.

The rest of

Chapter 8 is about dependent variables, and can be skipped without being penalized for that later on in the book. In Chapter 9, we consider rates of convergence for least squares estimators. Here, the results appear in their neatest form. A reader may choose to skip Chapters 6 and 7, and go immediately to Chapter 9 after consulting Section

8. 1 . The rest of the book contains some selected topics which can be, more or less, read separately. Chapter 10 considers penalties and sieves. Again, the results for the regression estimators are more transparent than those for maximum likelihood estimators. Chapter 11 consists of three independent parts.

There are some links

with the previous chapters here, because the results on rates of eonvergence play a prominent role in the derivation of asymptotic normality of certain functions of the curve estimators. In Chapter 12, we first summarize the general recipe we used so far only for least squares and maximum likelihood estimators. The idea there is that the methods carry over immediately to general M-estimators. It is possible to read Section 12.1 at any stage. This might help the reader to understand what the common underlying structure actually is. The last section of the book completes the circle: we started with a parametric model, and also end with it. Each chapter concludes with a problems section, where we sometimes also complement the theory with some auxiliary results. The level of the problems varies considerably.

1 Introduction

...where a motivation to study uniform laws of large numbers and central limit theorems is given. The examples concern parametric and nonparametric models, and contain indications on how we plan to derive consistency, rates of convergence and asymptotic normality of estimators. Let Xi,X2,... be independent copies of a random variable X with distribu tion P on (^, jif). If

is the real line, equipped with a Borel cr-algebra,

then by the strong law of large numbers, the sample mean X = {I/n) of the first n observations converges almost surely to the population mean g — EX, as n 00. If moreover X has finite variance cr^, the central limit theorem states that for n large, X is approximately normally distributed with mean g and variance a^/n. For a general measurable space such results hold for the sample mean (1/n) X)"=i

where g : ^

R is

some (measurable) real-valued transformation. This observation will be our starting point.

Strong law of large numbers IfEg{X) exists, then -^g{XdÊg{X), a.s..

(1.1)

i—\

Central limit theorem If Cg = var(g(Jf)) exists, then (1.2)

- Eg{X))

4=

V”

,=i

^(0, cl).

2

1. Introduction

For example, if we take g as the indicator function U of some set A e s/, (1.1) states the convergence of the proportion of observations falling into the set A, to the probability of the set A. Result (1.2) is then the classical central limit theorem for Bernoulli random variables. The convergence in (1.1) and (1.2) holds for a fixed function g, but can be extended to hold for several g simultaneously.

Consider a class

^ = {gfl : 0 € ©} of functions on 3f, indexed by a parameter 0 in a metric space 0. Assume that Ege{X) and var(ge(2f)) exist for all 0 6 0. We shall investigate to what extent (1.1) and (1.2) hold uniformly in 0 G 0. Although the study of uniform convergence is of interest in itself, our motivation is given by the numerous applications in asymptotic statistics. Let {0„} be some (possibly) random sequence in 0, converging to 0q € 0 (a.s. or in probability). We have in mind the situation where 0„ is an estimator of 0q , based on the first n observations. Uniform results would lead to something like a law of large numbers and a central limit theorem for g§ . In the next section, we present three examples to illustrate the importance of this. The first example sets out with a simple parametric model, where one sees that extensions of (1.1) and (1.2) can be used to prove asymptotic normality of the maximum likelihood estimator. Next, we note that a parametric model is not always appropriate. To indicate how to proceed in nonparametric models, we place ourselves in a general context in Example 1.2. We indicate there how extensions of (1.1) play a role in the development of a general theory on consistency of maximum likelihood estimators. The last example in this chapter shows that also in nonparametric models, one can exploit extensions of (1.2) to arrive at asymptotic normality.

1.1. Some examples from statistics Example 1.1. A binary choice model Let A, = (7i,Z,), with Z; the education of individual i, 7; = 1 if individual i has a job, and 7; = 0 otherwise. Suppose that we asked n individuals for their education and employment status. We want to model how the probability of having a job depends on education, and estimate the parameters in the model.

Case (i) The logit model is

P(7 = 1 I Z = z) = Fo(«o + 0oz), with Fo(0 = eV(l +

the distribution function of the logistic distribution.

To avoid notational digressions, we assume that only the parameter 0q G R is unknown, and that

oq

= 0.

3

1.1. Some examples from statistics

Denote the maximum likelihood estimator of 6q by 6„, i.e. of 6 that maximizes the (conditional) log-likelihood

6„

is the value

n J2ôgpe{Yi\Zi), !=1 where

pg{y\z) = Fy{ez){l-Fo{9z)Y-y. To investigate the asymptotic behaviour of this estimator, we introduce the score function

kiy,z) = ^ logpe(y 1 z) = z{y - Fo(0z)), and let

ge(z) if 0 = 00 •

U2To (0o z )(1--Fo (0o z )), Note that as 0

0o, ge{z) -» gefz) for all z. Lemma 1.1 below assumes that

we have some type of law of large numbers for gg^. Lemma 1.1 Suppose that (1.3)

"^P EgOoiZ) := Ig^.

-^

i—1 where Ig^ >0. Then ^(dn — 0o) is asymptotically j V{0, l/Ig^)-distnbuted. Proof Since 0„ maximizes the likelihood, the derivative of the log-likelihood at 0„ is zero:

E'8.(5') = 0.

(1.4)

i=l Clearly,

(1.5)

k(Yi,Zi) - 0„ - 0o)

/§„(Yi,Zi) = i=l

i=l

gg„(Z,). i=i

Combine (1.4) and (1.5) to give

88.(20’

4

1. Introduction

recall that X,- = (y/,Z,). Note that (1.6)

EkoiX) = 0,

var(/e„(X)) =

By the central limit theorem, the numerator is asymptotically distributed, so the result follows immediately from assumption (1.3). Assumption (1.3) corresponds to an extension of the law of large numbers. Another approach towards proving asymptotic normality is based on an extension of the central limit theorem. To formulate this idea, we introduce the quantities

mg = Elg(X), and

(t I = var(/e(X)). For each 9, converges weakly to a cr|)-distribution. If we consider ^"=i(^6/(X,)—mg)/^ as a stochastic process indexed by 6, one can think of condition (1.7) below as asymptotic continuity of this process at 6q .

^ We shall make use of the^ stochastic order symbol op(l). For example = 00 + op(l) means that 6„ converges to Oq in probability. For a formal definition, see Section 2.1. The general formulation of Lemma 1.2, for Mestimators of a finite-dimensional parameter, can be found in Section 12.3, Lemma 12.7. Lemma 1.2 Suppose that

+ op(l), and that

Then ^{9„ - do) is asymptotically jV{0,l/Ig^)-distributed. Proof Note first that "Jeo = 0.

= ho.

and moreover that (1.8)

Again, we rewrite the derivative of the log-likelihood at 9„:

1.1. Some examples from statistics

5

Using assumption (1.7), we see that

0=

1

"

- me,) + Op(l) +

v” ,=i

=

V”

kiXi) + op(l) -

- 0o)(/e„ + Op(l)),

i=i

where in the last step, we invoked me, = 0 and (1.8). The result now follows from the central limit theorem for le,.

Case (ii) In Case (i), we assumed a parametric model for the probability of a job, given education, but there appears to be no intrinsic reason why the probability of having a job depends on education in this specific way. All we know is for instance that the higher the education, the more likely it is to have a job. In that case the model would be

P(y = i|z=z) = fo(z), with To any increasing function of z satisfying 0 < Fo{z) < 1. The parameter space is now A = (F : R -> [0,1], F increasing}. The maximum likelihood estimator F„ is that value of F G A that maximizes (1.9)

n

(Tf logF(Z,) + (1 - y)log(l -F(Z,))) .

i=l It is no longer so easy to apply the argument that the derivative of the log-likelihood at F„ is zero. After all, what is a derivative in this situation? (See the first example in Section 11.2.3 for more details.) Moreover, how do we measure the distance between F„ and Fq ? There are several possibilities here. In the next example, we shall indicate how one can prove consistency in the so-called Hellinger metric by employing a uniform law of large numbers. One can also think of using the L2(2)-distance, with Q the distribution of I1F„-Fo ||q = (y {F„iz)-Foiz))^dQ{z)^

1/2

.

It will be shown in Example 7.4.3 that \\F„ — Fq Hq = Op(n“^/^) (for an explanation of stochastic order symbols, see Section 2.1). The same rate holds true for the Hellinger metric. Compare this with Case (i), where the rate of convergence for is Op{n~^^^). So the price one has to pay for

6

1. Introduction

not assuming the parametric model is that the rate of convergence is much slower. We shall present a quantification of this phenomenon. The ‘size’ or ‘richness’ of parameter space will be measured by means of its entropy (see Section 2.3 for a definition), and this entropy will be used to calculate the rate of convergence (see for example Theorem 7.4).

Case (Hi) One can think of many intermediate models between Case (i) and Case (ii). Here is an example. Suppose that the probability of having a job indeed increases with education, but the amount of increase is lower at higher education levels. This means that P(7 = 1 | Z = z) is a concave function of z: P(7 = 1 I Z = z) = Po(z), with Po £ A =

P(z) concave

The log-likelihood is of the same form as in Case (ii), but we maximize it over a smaller parameter space A in order to get the maximum likelihood estimator F„. It turns out that under regularity conditions on Q, the rate of convergence is Op(n~^/^) (see Example 7.4.3). This is an improvement as compared to Case (ii), due to the fact that we assumed a parameter space with smaller entropy.

Example 1.2. Maximum likelihood In this example, we study the maYirmim likelihood problem in general terms. Let X have density pe^{x),

€ 0, with

respect to a a-finite measure p. Here 0 may be finite-dimensional (as was the case in Example 1.1, Case (i)) or infinite-dimensional (as in Example 1.1, Case (ii) and (iii)). The maximum likelihood estimator 0„ maximizes the log-likelihood logpe(X,) over all 0 G 0. The idea here is that the loghkelihood will be close to its expectation for large sample sizes. Because the expected log-likelihood is maximized by

K is indeed a sensible estimator.

What we need to turn this idea into a rigorous argument is an extension of the law of large numbers (1.1). The estimator 6„ maximizes the likelihood over 0 € 0. Because 0q G 0, we therefore have (1.10)

On the other hand, for all 0, (1.11)

1

1.1. Some examples from statistics

(see Problem 1.3), with equality if 0 = 0o- The quantity given in (1.11) is the Kullback-Leibler information

nlog ^ ) P6o dp.

K{pe,P0o) Let

ge = log

Pe

so that

K{pe,pe) = Ege{X). Then by (1.10),

i=l

i=l

or. (1.12)

K{pê„’Peo) ^

ltlseS^>'^-K{Pg„,Peo) i=l

By the law of large numbers, for each 9,

-'^gg{Xi)-K{pe,Peo)

0,

a.s.

i=l A . Suppose this is also true for the sequence (0„};

0,

(1.13)

a.s.

i=l

Then it follows from (1.12) that the Kullback-Leibler information converges to zero. The Kullback-Leibler information is not a distance function, but its convergence often implies consistency of 6n in some metric of interest. A convenient metric is the Hellinger metric, defined as

h{p0,Peo)=

Q J{Pe^-Pli^fdp

1/2

The next lemma shows that convergence of the Kullback-Leibler information always yields consistency in the Hellinger metric.

8

1. Introduction

Lemma 1.3 We have

h (pe,Peo) ^ 2^{p^,Peo)Proof Use the fact that (l/2)logu
0:

2 so

lK{pe,peo)>l-E(^^ ye'oiX).

Observe that

'p'/\x) h-f

1

-£

Pe1/2Pbo1/2

,

pg„dp-

J

and since a density integrates to one,

1

-J

Pe^Pei^

^

j Pe dp

J

p I^^p I^^ dp

\ j{Pe^ - P\i^f dp = h^(pg,peo).

In summary, the maximum likelihood estimator 6„ maximizes an empir ical average, whereas do maximizes the expectation. If averages converge to expectations in a broad enough sense, this implies consistency of 0„. We remark here that (1.13) is sometimes difficult to prove and perhaps not true. However, a modification of the idea presented here can demonstrate consis tency in Hellinger distance, although one then might lose the convergence of the Kullback-Leibler information (see Section 4.1). Also the rate at which the maximum likelihood estimator converges can be obtained along these lines. For each 9

(1.14)

J2igeiXi)-K{pg,pe,)) i=l

provided cxg = var(ge(X)) < co. First of all, this implies that for each such 6,

^f2seiXi)-K{pg,pg^)

= Op(«-»/2).

9

1.1. Some examples from statistics If the same is true for the seênce {0„}:

=0p(n-i/2),

1=1 then (1.12), together with Lemma 1.3, would imply that /i(pg^,peo) = Op(n“^/‘‘).

In fact, one may expect that erg

0 as h{pe,peo)

0> and

that in view of (1.14), E"=i(gg„(î) --K(P9„,P0o))/« converges with a rate faster than Op(n“^/^). This would give h{p„,po) = op(n“^/'*). We shall see that the rate of convergence ranges from Op(n“^^^) for regular parametric models, to op(n“^/'*) for moderately complex infinite-dimensional models. If the parameter space is even richer (has very large entropy), the rate can be even slower, or one may not have consistency at all. However, the power 1/4 appears to be something like a critical point. Example 1.3.

4-

Estimating the mean in the binary choice model

Here is

another illustration of the application of extensions of the classical central limit theorem. Let us return to the situation of Example 1.1, Case (ii). There, ' we have F(T = 1 I Z = z) = Fo(z), with Fo an unknown increasing function satisfying 0 < Fo < 1. Suppose that we want to estimate the average probability of having a job. Write this

as 00

Y =

= 0Fo = ^ candidate for estimating 0o is of course the observed proportion of individuals with a job. We have

,Jn{Y - 0o)

J^(0,var(7)),

with (1.15)

var(7) = j Fo(z)(l - Fo(z)) dQ{z) + var(Fo(Z)).

Let F„ be as before the maximum likelihood estimator of Fq . Then F„(Z,-)/n is also a good candidate for estimating 0o, but it turns out that this estimator is just 7. Moreover, if the distribution g of Z is com pletely unknown, then it is also the maximum likelihood estimator of 0q . We shall not prove these two statements here (see Section 11.2 for more details). Instead, let us see what we can gain if we assume Q to be known. Since 00

=

J Poiz)dQ(z),

the maximum likelihood estimator of 0o is then

10

1. Introduction

More generally, we define

ep = J F{z)dQ{z).

Then for each F e A =

: R —» [0,1], F increasing}, ^ J^(0,var(F(Z))).

j=i

Now, suppose that the process “ Sp)/y/n : F e A) is asymp totically continuous at Fq , in the sense that

fn

V'^

“ ~lz

V”

i=i

~ ô) + op(l). i=i

This assumption can be seen as an extension of the central limit theorem (1.2).

Lemma 1.4 Under (1.16), (1.17)

- 6»o)

./T (^0, j Fo(z)(l - Fo(z)) dQ{z)^ .

Proof Since ^ E"=i Fn{Zi) = Y, we have - 9o) = ^{Y - 0o) - 4^ ^(^«(Zi) - 0„)

1 " = ^(Y -Bq )- — Y,[FQ{Zi) - 6o) -f- op(l) V” i=i

= ;Ê(^-ô (^0)+M1). The result now follows from the classical central limit theorem. So the asymptotic distribution of B„ has smaller variance than Y. This is what we gain from knowing the distribution of Z.

1.2. Problems and complements 1.1. Verify (1.6) and (1.8). In fact, recall that if {pe :

9 G & a R) is any sufficiently regular class, then for k = dlogpe/dB and I'g = dk/d9, one has — ^6o- The quantity Ig^ is called the Fisher information.

1.2. Problems and complements

11

1.2. Find an expression for the Hellinger distance h{pB,peo) when the densities are as in Example 1.1, Case (ii). 1.3.

Let Z be a positive random variable.

Using the concavity of the

log-function, we have by Jensen’s inequality, £(logZ) < log(£Z). Use this to show that

K(pe,Peo)

^

0.

1.4. Show that var(7) = £var(7 | Z) 4-var(£(7 | Z)), and use this to check (1.15).

2 Notation and Definitions

Stochastic order symbols are introdueed, averages and expectations are written as integrals, and a dehnition of entropy is given. In some examples a bound for the entropy is given.

2.1. Stochastic order symbols Let

P) be a probability space, Z„ :

R, n = 1,2,..., be a sequence

of random variables, and {kn\^=^\ be a sequence of positive numbers. We say that Z„ = Op{k„), if lim limsupP(|Z„| > Tk„) = 0. T—>00

n—*co

Then Z„/kn = Op(l). We say that

Z„ — Op{kn), if for all e > 0, lim P(|Z„| > ek„) = 0.

tl-f-co

Then Z„fk„ = op(l).

2.2. The empirical process

13

2.2. The empirical process Let Xi,X2,... be independent copies of a random variable X in

sd~) with

distribution P, and let ^ = {ge ; 0 G 0} be a class of functions on SC. We say that ^ satisfies the uniform law of large numbers (ULLN), if (2.1) Let us, for the moment, denote the expected value of g6i(X) as

me - Ege{X). Clearly, if (2.1) holds, then for any sequence {0„} c 0,

i=l Observe that 6„ may even be random in this case. But it is important to realize that mg is not the expectation of gg^(X’) if 0„ is random! Instead, since

we have

This is the reason why we shall often express expectations as integrals in this book. The sample average can be seen as an empirical expectation, which can also be written as an integral. Let P„ be the empirical distribution based on

Xi,... ,X„, i.e., for each set A E

i=l Thus, the P„ put mass ^ at each of the X,-, 1 < i < n:

with 5x, a point mass at X,-, i = 1, •.. ,n. We may now use the notation

2. Notation and Definitions

14

The difference between average and expectation is

^

g(Xi) - Eg(X) = fg d{Pn - P). J—1

1=1

Let ^ be a class of functions on Sf. Before, we indexed ^ by a parameter 0 e ©. Such a parametrization usually comes up naturally in statistical applications. It is a way to describe the form of the functions in

We do

not really need it here. In fact, our natural parameter space will be the class ^ itself. In the new formulation, we say that ^ satisfies the uniform law of large numbers (ULLN) if (2.2)

Theorem 3.7 contains the conditions on ^ for (2.2) to hold. Observe that

(2.2) implies that for any sequence {g„} c=

It is not so straightforward to formulate something like a uniform central limit theorem. Let us first recall that by the classical central limit theorem, for each g with var(g(Z)) finite.

Here, the variance of g(W) can be expressed as

âT{g{X)) = Eg\X)~ {Eg{X))\ We need a handy notation for this in case we are dealing with a possibly random function g„. Clearly,

This is the squared L2(P)-norm of the function g, which we shall denote by II ■ IP. Thus

2.3.

15

Entropy

Furthermore, let us write 2

gdP Then for g fixed, is the variance of g{X). But if g„ is random, we do not have this type of interpretation of o-|^. One can regard {v„(g) = ^ / g d(P„ -P) : g G indexed by

as a stochastic process

It is called the empirical process. The classes of functions we

have in mind may be rather large, such as the class of all increasing functions g with 0 < g < 1. Convergence in distribution of this process is treated in Chapter 6 (a uniform central limit theorem). In statistical problems, the important issue is often not so much the weak convergence, but rather the so-called asymptotic equicontinuity of the process (which is essentially a necessary condition for weak convergence). We say that (v„(g) : g G is asymptotically equicontinuous at go G ^ if for each (random) sequence {g„}

CZ

^ with ||g„ - goII = 0p(l), we have

(2.3)

|v«(gn) - V„(go)l

= Op(l).

Then,

provided In Chapter 5, we shall formulate conditions on ^ that guarantee the asymptotic equicontinuity of the empirical process indexed by

In partic

ular, we shall investigate the modulus of continuity of the empirical process, i.e., the behaviour of |v„(g) —v„(go)| as a function of ||g —goll- The conditions are in terms of entropy, which is a measure of the complexity of We already announced in Example 1.1, what the rates of convergence are in the cases considered there. The larger the parameter space, that is, the larger its entropy, the harder it will be to estimate the true state of nature. In empirical process theory, the effect of a large ^ is that the increments of the empirical process behave irregularly. The empirical process may even not be asymptotically continuous at all.

2.3. Entropy One can define entropy for general metric spaces, but we shall restrict ourselves to classes of functions

Let Q be a measure on (^,j/) and

LpiQ) = {g : ^ ^ R : / |gP dQ < co}, 1 < p < co. For g

G

Lp{Q), write

16

2.

Notation and Definitions

We refer to || ■ ||p,g as the Lp(g)-norm, or Lp(Q)-metric, and call ||gi —g2\\p,Q the Lp(2)-distance between gi and g2. Actually, these are rather pseudo norms and pseudo-distances, but we omit the term ‘pseudo’, although it is not our intention to identify equivalence classes.

Definition 2.1 (Entropy for the Lp(g)-metric) Consider for each ^ > 0, a collection of functions gi,...,gw, such that for each g e

there is a

j = jig) € (1,... , iV}, such that

||g-g;||p,Q 1, and {g : [0,00) ^ [0,1], /o“ |g('”>(x)r dx 0 for all j = 1,... ,N. Define

2. Notation and Definitions

24 Then

N

;=i

which implies that

7=1 where

= {ghj :

J=

g G

Now, insert the result of

Problem 2.5, to find that

j=l

7=1

Let us now have a closer look at the above tail-condition on Q. Suppose that Q has density q with respect to Lebesgue measure, and that the Riemann

dx is finite. Then there is a Cq such that for all T,

integral

there is a partition {Bj = {xj-uXj]}fî such that for Lj = xj — xj-i,

J^^2m/(2m+l)gl/(2m+D^5^.)
0 and R > 0,

iv(^,^n{||g-gollee)

d

Suppose now that

-P 0,

for all (5 > 0

n (see Theorem 3.7). The latter condition is clearly implied by (3.1). If ^ is a finite class, then it is of course no problem to get results uniformly in

Now, a totally bounded class can be approximated by a finite class.

The chaining technique is to apply finer and finer approximations.

This

works as follows. Suppose ^ c LjiQ), and (3.2)

supge^llglle < P.

For notational convenience, we index the functions in ^ by a parameter e e ^ = {ge : 0 G ©}. For s = 0,1,2,..., let {gj}% be a minimal 2““P-covering set of (^, || ■ Hg).

So Ns = N{2~^R,'^,Q), and for each 9,

there exists a gg G (gf,... ,g^J such that \\gg — ggHg < 2~^R. We use the parameter 9 here to indicate which function in the covering set approximates a particular g. We may choose gg = 0, since Hgellg < R. Then for any S,

ge =

s

- gr^) + (ge “ Se)-

S=1

One can think of this as telescoping from gg to gg, i.e. we follow a path taking smaller and smaller steps. Take S sufficiently large, in such a way that (g — gg) is small enough for the purpose one has in mind. The term Z)f=i(g0 ~

can be handled by exploiting the fact that as 9 varies, it

involves only finitely many functions.

3. Uniform Laws of Large Numbers

28

3.3. A maximal inequality for weighted sums In this subsection, we make an excursion to a non-i.i.d. case. As before, we let ^ = {ge : 0 G ©} be a class of functions on 3^. Moreover, (î,... ,

is a

set of points in 5T, and g„ is the probability measure that puts mass (1/n)

on

,L, i-e.,

(Later on, Q„ will be the empirical measure P„, and we shall work condi tionally on the event (Xi,... , X„) = (î,... ,

R".

Consider a random vector W G We assume an exponential probability inequality (in fact, a sub-Gaussian inequality) for weighted sums of the form Z)”=i îyithe next subsection, we shall see under what conditions such a probability inequality indeed holds. The exponential probabiUty inequality ensures that it is the logarithm of the number of functions in a covering set that governs the behaviour of empirical processes: simply note that if

{Zj : 1 < j < N} satisfies for each j : P(Z; > a) < exp[—a^].

then

Thus, we need to take a larger than the square root of logiV, in order to have the right-hand side small. Lemma 3.2 below states a maximal inequality for weighted sums. The proof applies the chaining technique.

Our first application will be the

derivation of ULLNs (see Section 3.6). For that particular application, we certainly do not use Lemma 3.2 in its full strength. A complete exploitation of the lemma will occur later on, for example when we prove asymptotic equicontinuity of the empirical process (see Theorem 5.3), or when we derive rates of convergence for least squares estimators (see Theorem 9.1). We remark moreover that Lemma 3.2 contains the main ideas for proving maximal inequalities.

For example, the proof of Theorem 8.13, which

concerns a maximal inequality for martingales, primarily uses the same approach, but the technical details needed there make it less transparent. Before presenting the lemma, let us pay some attention to the entropy integral in (3.4). Due to the chaining technique, we are confronted with 2“^-covering sets, for s = 1,... , S.

The total size of these covering sets

3.3. A maximal inequality for weighted sums

29

should not be too large. It turns out that we need to control the weighted sum

s 5=1

(Indeed, this expression involves the square root of the logarithm of a number of functions.) We can replace this sum by an integral, provided such an integral is well-defined. Therefore, we assume in what follows that

H{d,^,Qn) is a continuous function of 0. If this is not the case, replace H{5,^,Qn) by its smallest continuous majorant. It is easy to see that for R> p and S = min{s > 1 : 2~^R < p}, s

1 :

2

Êxl{Ag.} = \p{Ag-)

g*d(p„'-p)

- 2

M\I

g*d(Pn-P)

Therefore P

fsup [ gd{Pn-P) >5 Vge^ J 5

5/4

< 2P sup

.

i—1

i=l

So we arrive at the following corollary. Corollary 3.4 Suppose supg^^ l|g|| < R. Then Torn > %R?/d^,

(3.11)

P

sup

J

> 5 ) < 4P I sup

11,

gd{Pn-P)

> 5/4

,

i=l

where {Wi,..., W„) is a Rademacher sequence (see (3.10)/, independent of

{Xi,...,X„). 3.5. Hoeffding’s inequality There remains the investigation of the exponential probability inequality (3.3). Lemma 3.5 LetZi,... ,Z„ be independent random variables with expecta

tion zero. Suppose that 5,- < Z,- < c,- for some 5; < c,-, i = 1,... , n. Then for all a > 0,

^ Z; > a j < exp -2 =1

The proof can be found in Hoeffding (1963). We apply the inequality to

Zi = Wiji, i =

where

Wi,... ,W„ is a Rademacher sequence and

yi,... ,y„ are constants. Then we may choose c,- = —bi = y,-, i= 1,... ,n.

3. Uniform Laws of Large Numbers

34

3.6. Uniform laws of large numbers under random entropy conditions We now have the necessary equipment for a uniform law of large numbers under conditions on the empirical entropy. We start out with a preliminary lemma, where the entropy is with respect to the L2(Pn)-norm, and where one assumes that the functions in ^ are uniformly bounded. Lemma 3.6 Suppose supgg^ Igloo ^ Rj

that

-H{d,^,P„) -»-p 0, n

(3.12)

^

for all S>0.

Then ^ satisGes the ULLN. Proof Martingale arguments show that supgg^ 1 f g d(P„ — P)l converges a.s. to some constant (see Pollard (1984)).

So we only have to prove that

supgg^ 11S d{Pn - P)1 converges to zero in probabihty. Let 5 > 0 be arbitrary. By Corollary 3.3, for n >

P

/ 0,

so that ^ satisfies the ULLN. Proof The result follows from Theorem 2.4 if we can show that sup \g\oo 0-

t (6,0) 0 be arbitrary. Take, for each 0, pe in such a way that

j w(6,pe)dP < 9.

3.7.

39

Examples

Be = {9 : t (0,0) < pe} and let ,Be„ be a finite cover of 0. Define gf = goj — w{6j,pej), and gf = gej + w{9j,p0j), j = 1,... ,N. Then 0 < Jigf - gf) dP < 25 and for 9 e Bq j , gf [0,1], ^ = g'ix) exists for all x, TF(g') < l|,

satisfies the ULLN, by applying Lemma 3.13. (Hint: consider the convex hull of

=

{ky{x)

= (x — y)l(x > y}}.)

depending on n, with envelope

3.6. Consider a collection of functions

G„ = sup |gl.

> l,b„ = o{n^^^), we have

Suppose that for some sequence

—Hi{5,P„) -op 0, and lim sup / n-fco

for all 0,

G„dP = 0.

JG„>b„

Then sup ge^„

/

gd(P„-P)

—Op

0.

3.9. Problems and complements

45

(See van de Geer (1988, Lemma 2.3.3).) 3.7. Let

S

dn

— ^

^ Ij

j

— Ij--- 3^n ^ j

1=1 where A\^n, ■ ■ ■ ,A„^d„ forms a partition of 3^. Use the result of Problem 3.6 to obtain that for d„ = o{n).

sup

I

gd(P„-P)

3.8. Let P be Lebesgue measure on ([0,1],.^), with j/ the Borel cr-algebra. Verify that sup \P„{A) - P{A)\ = 1,

n > 1.

Aej^

A

4 First Applications: Consistency

In this chapter, we consider consistency of maximum likelihood estimators and least squares estimators, under entropy conditions. Suppose that is a class of densities. We show that the ULLN for certain transformations of implies consistency of the maximum likelihood estimator. If ^ is convex, one may use a transformation that is uniformly bounded. This is desirable, because that means that the envelope condition, necessary for a ULLN to hold, is satisfied. In the regression model, let ^ be the class of all regression functions allowed by the model. We do not assume a ULLN for but apply the entropy methods directly to study consistency of the least squares estimator. 4.1. Consistency of maximum likelihood estimators

Let Xu-- - ,X„,... be i.i.d. with distribution P on {SC,si), and suppose

where ^ is a given class of densities with respect to the u-finite measure g. Let p„ be the maximum likelihood estimator of po, i.e. p„ = argi

log pdP„.

Throughout, we assume that a maximizer Pn^Sf exists.

4.1. Consistency of maximum likelihood estimators

47

Our aim is to prove Hellinger consistency of p„, where the the Hellinger distance between two densities pi and p2 is defined as Kp u Pi )

~ P2^f

=

•

The factor | is a convention, which ensures that /i(pi,f>2) ^ 1 (Problem 4.1). The Hellinger distance describes the separation between two probability measures in a natural way, independent of a particular parametrization. But the main reason why we have chosen this distance function is because of its convenience when studying the maximum likelihood problem in a general setup. In Example 1.2, it was shown that (4.1)

f

h\p„,po)< Jpo>0 Uog^d{P„-P). 4 Po

Therefore, we have consistency in Hellinger distance if the ULLN holds for •the class (4.2)

|liog^l(po>0} : pe^j.

However, the ULLN needs the envelope condition, whereas in most realistic statistical models, the envelope of the class (4.2) is not in LfP). For instance, in many cases, the densities in ^ do not stay away from zero, so that the log-densities can become minus infinity. The latter problem can be easily overcome. Later on, we shall also handle the case where the densities do not have an integrable envelope, for the special situation where ^ is convex. Consider the class

The functions in ^ are bounded from below: g(x) > —^ log 2,

for all X G ^ and g E

The next lemma presents a modification of (4.1). Because inequalities of this type will play a major role, a general terminology will help to highlight the conformity of various problems. We propose to call them Basic Inequalities. These are inequalities with on one side essentially the squared distance between the estimator and the true parameter, and on the other side an empirical process.

4. First Applications: Consistency

48

Lemma 4.1 (Basic Inequality) We have (4.3) Proof Use the concavity of the log-function to find that

log&!^l{!.o>0)>ilog^l{p«>0}.

(4.4) Thus,

00 2 ^

f

/„>o 2 ® 2po /po>0 But (see also Lemma 1.3),

'

^Po

_P)+

f

hog^*^'’°tiP. ./«>» 2

1 , Pn + Po dP < X log2po Jpo>0

/

»Po ) .

In the left-hand side of (4.3), one now finds the Hellinger distance between the convex combination (^ + Po)/2 and po, instead of the Hellinger distance between p„ and po. But the next lemma shows that these behave m the same way. For the convex combination, we use the notation P=

P + Po

Lemma 4.2 We have h^puPi) < ^h\pi,p2)-

(4.5) Moreover

h^{p,Po) ^ 16h^(p,po).

(4.6) Proof Observe that

, „l/2 Pi1/2 +P2 < -1/2 +P2 , -1/2 Pi

So 1-1/2

-1/2|

1

(p/"+pf\

(4.7) Ip / -P2 l"2/pl/2+pf This gives inequality (4.5).

J

Pi

P2

1/2

Pi

1/2

-P2

4.1. Consistency of maximum likelihood estimators

49

The second inequality follows in the same way, using
0} : p G

.

Recall that the necessary conditions for the ULLN to hold are that the empirical covering numbers of ^ do not grow exponentially fast in n, and that the envelope function (4.9)

G = suplg| g6^

is in Li(P). The following theorem summarizes the results. Theorem 4.3 Suppose that for the class ^ defined in (4.8), we have (4.10)

^Hi ( 0,

and (4.11)

G G Li(P).

Then h{pn, po) —> 0 almost surely. To verify conditions (4.10) and (4.11), we prove in Lemma 4.4 that it is enough to have that Hi,s(5,^,|Uo) < 00 for all 5 > 0, where (4.12)

dpo = \{po > 0}dp.

; We shall show below that, when calculating an upper bound for the entropy of one may replace it by a simpler class, namely ^/po = {“Hpo > 0},

p G

J)enote the envelope of the densities by Xl3)

q{x) = supp(x), X e pe^

.

50

4. First Applications: Consistency

Lemma 4.4 We have for all 5 > 0, (4-14)

Hi{5,%P„) < Hi{25,0>jp^, P„).

Moreover, if q e Li(po), then G e Li(P). Finally, if (4-15)

< CO

for all S >0,

one has h(p„,po) -* 0, a.s. Proof By elementary calculations (see Problem 4.2), for pi = (p^ +po)/2, andp2 = (P2+Po )/2,

El l{Po > 0}. log — - log — l{po > 0} < 2 Pi Po Po Po Po This yields (4.14). To prove the second assertion of the lemma, note that from (4.16) (4.16)

so that

^ I Po

I

Hpo > 0},

y^GdP < jqdpo.

We conclude that the ULLN for ^/po implies consistency in Hellinger distance (see also Problem 4.3). To verify this ULLN, we have to check the conditions of Theorem 3.7, which are implied by those of Lemma 3.1. Lemma 3.1 uses entropy with bracketing, instead of the empirical entropy. The first takes a simple form: for all ^ > 0,

(4-17)

^/po ,P) =

/iq ).

Thus, if < CO for all 5 > 0, then the maximum likelihood estimator p„ is consistent in Hellinger distance. There are still many situations where the conditions of Theorem 4.3 are not met, and where nevertheless the maximum likelihood estimator is consistent. In fact, one could expect that Theorem 4.3 is much too rough, because it only uses the following property of p„: (4.18)

j log Pn dP„ > J log Po dP„.

This inequality yielded the Basic Inequality of Lemma 4.1. By replacing po in the right hand side by some other density in one can prove consistency under less strong conditions. It is clear that one has a whole range of choices here, and it depends on the model which choice gives you the best results. In the remainder of this section we shall investigate the model with ^ a convex class. Then (p„ + po)/2 e which supplies us with an alternative Basic Inequality.

4.2. Examples

51

Lemma 4.5 (Basic Inequality) Suppose 0^ is convex. Then

f

(4.19)

h^(p„po) < J Pn+PO d{P. - P).

Proof We have s/log

-I

2pn Pn+PO

2p„ -ijdPn Pn + PO

dP„
h^{p„,po). Pn +P0

From Lemma 4.5, we infer that consistency is implied by the ULLN for the class ' , (4.20)

= I

: pG^\.

Ip + Po

J

This class is unifornaly bounded by 2, so its envelope is certainly in LfP). Hence, the following theorem holds. Theorem 4.6 Suppose 0 is convex. Assume moreover that (4.21)

P„) -^p 0,

for all S > 0.

Then h(p„,po) —>• 0 almost surely.

4.2. Examples Example 4.2.1. The binary choice model We return to Example 1.1. Let Y G {0,1} be a binary response variable, and let Z € R be the covariate. We have i.i.d. observations Xt = (Y^Z,), i = 1,2,..., of Z = (T,Z). The model is P{Y = l\Z=z) = Fgfz),

52

4. First Applications: Consistency

with 00 € © an unknown parameter. We assume throughout that Fq {z ) is an increasing function of z for each 0 G 0. Now, we can take p = (counting measure on {0,1}) x Q as dominating measure, where Q is the distribution of Z. In the notation of the previous section, the densities are then ^ = {peiy,z) = yFe{z) + (1 - y)(l - Fe{z)) : 0 g 0}. Because Fe{z) is a probability, we have 0 < Fe{z) < 1 for all 0 G 0. Therefore, the envelope q of the densities is bounded by 1 as well: q = supp < 1. Moreover, (2 is a probability measure, so p is a finite measure. Now apply (2.5) (or Lemma 3.8). The class ^ is essentially a (subset of a) class of uniformly bounded monotone functions. Therefore, Hb {5,0*,P.) < a \,

for all 5 > 0.

o Hence, by Lemma 4.4, (4.22)

h{p„,po)-^0,

a.s.

The squared Hellinger distance is in this case

hHpe,P6o) = l^ j {Pe^-Fli^y dQ + ^ j ({\ - FeY'^ - {I - Fe,Y/^y dQ. So (4.22) is somewhat stronger than consistency of Fg in the L2(2)-norm. The consistency result holds no matter what the further assumptions on Fb are. Perhaps there are no further assumptions, or we assume in addition that Fq {z ) is a concave function of z. A parametric model is also possible, for example the one of Example 1.1, Case (i). Case (i) Suppose

^9z

Fe{z)

=

1 + e'6z’

with 0 G R (0 = R). In that case, consistency in Hellinger distance implies consistency of 0„, because the Hellinger distance h{pe,pe^) has a unique minimum in 0 = 0q (unless we are in the degenerate case where Z concentrates on z = 0). Now that we have got this far, we might as well verify the assumption (1.3), which we used for proving asymptotic normality of 0„. Because 0„ is consistent, it stays (almost surely) in a compact set. Use Lemma 3.10, on functions th^^t are continuous in the parameter, with the

4.2. Examples

53

parameter varying within a compact set. The envelope condition there is satisfied if Z has finite variance. So the conclusion is now: suppose Z has finite variance, then

îk-eo) Example 4.2.2. Estimating a smooth function Let p. be Lebesgue measure on [0,1] and let

Jo Here me {1,2,...} and M are given. The fact that f pdp = 1 implies in this case that |p|oo < for some constant K depending on m and M (apply the same arguments as in Lemma 3.9). Application of Theorem 2.4 yields p)
0.

So we find that h{p„, po) -> 0 almost surely. If po stays away from zero, say Po> ril > 0, this in turn implies consistency in the supremum norm of all lower-order derivatives, i.e.' sup Ip^f^(x) - Po*^(x)| ->a.s. 0,

/c = 0,..., m - 1.

xe[0,l]

See Lemma 10.9 for a proof. Observe that the consistency result also holds for the maximum likelihood estimator over any subclass of SP. In practice, one often uses a penalty, instead of restricting the density, which has the major advantage that one does not have to know a bound for f (Pq '”\x ))^ dx. We shall study the penahzed maximum likelihood estimator in Section 10.2. Another approach would be to use kernel estimators. Both penalized estimators and kernel estimators involve a tuning parameter, which has to be of the right order in order to get good asymptotic results. Example 4.2.3. Estimating a monotone density Let p be Lebesgue measure on [0,1] and ^ = {p is a decreasing density on [0,1]}.

Theorem 4.3 appears to be of no use here, since the densities in ^ may become arbitrarily large. However, one can exploit the convexity of ^ to arrive at consistency. To verify the entropy condition (4.21) of Theorem 4.6, note that (4.24)

54

4. First Applications: Consistency

is a class of decreasing functions. Moreover, if |poloo < co, the class in (4.24) is also uniformly bounded. Because for all 5 > 0, we obtain from (2.5) (or Lemma 3.8) that Hi B (S,

P) < a \,

for all ^ > 0.

0

Hence, under the assumption |po|oo < oo> we have /i(p„,po) 0, almost surely. We shall extend the result to possibly unbounded po in Example 7.4.2. There, we shall also consider the case with unbounded support. It is important to note that we used the convexity of ^ here. This means that if we replace the estimator p„ by the maximum likelihood estimator over a non-convex subset of consistency is no longer guaranteed by our results. Example 4.2.4, Mixture models Consider a random variable Y with unknown distribution Fq on Suppose that X},g-go)„< \ -^W?\{m>K}j

R.

For K —K{5,t]) sufficiently large, we find

i=l This means that we can truncate the measurement error at K. '

We find P(llg - goL > ^) < P(5 < \\gn - golU 0,

SO that q ^ Li(po)4.9. Consider the class of regression functions

Define y)k{z) = k = 1,... ,m,xp = (tpi,... ,xp^-iV and E„ = Jrpxp^ dQ„. Assuming that the eigenvalues of E„ stay away from zero, show that, under condition (4.26) on the errors, the least squares estimator g„ is || ■ ||„-consistent. 4.10. Let 7i,... ,Yn be independent real-valued random variables with EYj = ao for I = 1,..., Lyo«J, and £Y) = Pq for i = [yonj + We assume that (Zo, po and the change point yo are independent of n and completely unknown. Write g{i;a,P,y) = al{l < j < («?]}+ Pl{[ny\ + 1 < i < n}. Show that, under condition (4.26) on the errors Wi = Yi — EYi, the least squares estimator g„ = gn(",an,Pn,yn) is || • ||„-consistent. Verify that this implies consistency of (a„, p„, y„), provided that the parameters are identifiable, i.e. provided that «o ^ po and yo G (0,1). 4.11. Consider a regression model with

—

and

where 0 is a fixed bounded subset of Show that, under condition (4.26) on the errors, the least squares estimator is || • ||„-consistent. 4.12. Consider the regression model with ^ = {g = Ifl : D G where Si is the collection of all subsets of Suppose that go = 0, and that Wu, W„,... are i.i.d. copies of a random variable W, satisfying P(fF > |) > 0. Show that the least squares estimator is inconsistent.

5 Increments of Empirical Processes

Recall that the ULLN for ^ holds if its envelope G is in Li(P) and Hi{S,^,Pn)/n ->p Oforalld > 0. Now,ifG G L2{P) andH{d,^,Pn)IH{8) = Op(l), uniformly in S > 0, where H{S) is some (non-random) function of S satisfying fd du < oo, then the empirical process is asymptotically equicontinuous. The same conclusion holds if /q Hg (u,^,P)du < co. We also have a closer look at the increments of the empirical process, and at the ratio ||g|U/||g||. Consider Li.d. random variables Xu... ,Xn with distribution P on and let ^ c= L2(P) be a collection of functions. The empirical process indexed by ^ is

Vn =

|v„(g) = ^/nj g d{P„ - P) : g G

.

In Sections 5.1 and 5.5, we study asymptotic equicontinuity of this pro cess, under random entropy conditions and conditions on the entropy with bracketing respectively, assuming that Xi,... ,X„ are the first n of an infinite sequence of independent copies of a population random variable X. The other sections of this chapter include the situation of triangular arrays. To investigate the behaviour of the empirical process near some fixed function go G we consider a neighbourhood of go, given by

^(5) = {gG^ : l|g-goil a g6^(5)

< 4P

> 8(5^,

sup ■^\{w,g- go)„| > a/4 , gem)

where

Lemma 5.1 below is a special case of Lemma 3.2. Due to the conditioning on Xi,... ,X„, we are confronted with the empirical entropy, and moreover, with the empirical radius (5.2)

3„= sup i|g-goll„. gem)

Lemma 5.1 For 3„ given in (5.2) and

(5.3)

a>

we have (5.4)

p( sup

>

7

I Xi,... ,X„

If the envelope function G of ^ is square integrable, and if H{5,^,P„) = op{n) for all ^ > 0, then it follows from the ULLN that eventually, 3„ < 23 (Problem 5.1). Moreover, the ULLN then also gives that eventually H(u,^,Pn) < H{u/2,^,P) for all u > 0. However, the latter does not help to check (5.3), because (5.3) involves values of u that become smaller as n

5.1. Random entropy numbers and asymptotic equicontinuity

65

increases. Therefore, we shall impose later on the condition that for some non-random non-increasing function H{u), lim lim sup P ( sup

(5.5)

w>o

H{u,off(M,^,T„)/if(w) = Op(l). In many cases, one will in fact have that for some A fixed

i.e., that the second term in the right-hand side of (5.7) below vanishes. For instance, suppose a class of functions ^ satisfies Pollard’s uniform entropy condition sup H{u \\G\\q ,^, Q) < X{u), for all u > 0, where Jtmte is the class of all probability measures with finite support. Then H{u,^,Pn) < X{u/\\G\\„) and one can take H{u) = X{u/{2\\G\\), provided G G LiiP). Note that Pollard’s uniform entropy condition is for example met if ^ is a Vapnik-Chervonenkis subgraph class (see Theorem 3.11). 'Lemma 5.2 Suppose that ^ has envelope G G L2{P). Moreover, assume * that (5.5) holds. Then for each 5 > 0 Gxed (i.e., not depending on n), and for (5.6) we have (5.7)

lii 0

Proof The conditions G G L2(P) and (5.5) imply that suplllg-golln - llg-golll

0,

So eventually, for each fixed ^ > 0, sup ilg-golU 0, there exists a 5 > 0 such that (5.9) Proof Take A> 1 sufficiently large, such that 4C exp[-A] < |, and

Next, take rj

\g€^(S)

< 4C exp

64,4CV2(2^)1 25602^2

tj

+2

< 4C exp [-,4] + I < >/> where we used J(2S) > 25. Remark Because the conditions (5.5) and (5.8) do not depend on go, we have in fact shown that v„ is asymptotically equicontinuous at each go. Moreover,

5.2. Random entropy numbers and classes depending on n

67

the conditions then also hold with ^ replaced by {gi - g2 : gi, g2 e so that we may conclude that the process {v„(gi) — v„(g2) : gi,g2 S is asymptotically equicontinuous at zero: for all ?/ > 0 there exists a 5 > 0 such that lim sup P

sup

lv„(gi) - v„(g2)l >rj\ 0. However, it follows from Lemma 5.4 below that in the case ^ is uniformly bounded, we can apply Lemma 5.2 to sequences, provided 5„ does not converge to zero too fast (see Lemma 5.5). We assume that for each n we have independent observations Xu-■■ ,X„, with distribution P on Here, P and the space may also de•pend on n, although we do not express this in our notation. Let furthermore , ^ be a class of functions on (^, sf), possibly also depending on n. We do however assume that the functions are uniformly bounded by some constant independent of n, say (5.10)

SUpIg-goloo < 1ge^

In Lemma 5.2, we used the ULLN to show that for 5 fixed, eventually ||g — golU < 2(5 whenever ||g -goU < 5. We now study the case where llg _ gpll < with 5n ^ 0 as n —*■ CO. The result can be found in Pollard (1984, Lemma II.6.33). We briefly present the proof, because we need some of its ingredients in Lemma 5.6, which examines the ratio ||g||„/l|g|l. Lemma 5.4 Let H{u) be a given non-increasing function of u > 0. For n5^ > 2H{5„), we have

^gem) / < 4exp[-«(5^] + 4P {H{u,^,Pn) > H{u),

for some m > O).

Proof Apply the randomization device of Pollard (1984, page 32). Let Z„+i,... ,X2n be an independent copy of Xu-- - ,X„, and let ei,... ,e„ be

5. Increments of Empirical Processes

68

independent random variables, independent of X\,... ,X2n, with P(e'j = 1) = P(e,- = 0) = j = 1,... ,n. Set = X2i-i+e„ 6" = X2i-e„ i = 1,... ,n, and Pn = Ell h', Pn" = Eh h", and P2n = (Pf Pn")/2. Then (5.12) p( sup Ilg-golU > 8^„ ) 6^n Vg6«('5) < 2exp (^yj25„,^,P2n'^ - 2nb\ .

/

Now, verify that for all u > 0, H{yj2u,^,P„) < H{u,^,P„') -\- H{u,^,Pf). The rest is standard. We are now ready to present the extension of Lemma 5.2 to the case where the radius is allowed to decrease with n. Take a non-increasing function H{u) of u > 0, such that the integral (of the square root) converges:

f H^^^{u)du Jo

(5.14)

< 00,

and define J(5)= / H^/^{u)duV5. Jo

(5.15)

Lemma 5.5 For each A > 0 and for nS^ > 2AH{d„) and H{Sd„) -> co, we have limsupP sup |v„(g)-v„(go)| > 4Cv4^''^J(8(5„) «->-oo \ge^(5„)

(5.16)

< lim sup 5P sup n->QO

,M>0

H{u,%P„) H{u)

Proof Replace in Lemma 5.4, the function H{u) by AH{u), where ^ > 0 is arbitrary. Then we find P

sup ||g-golln > 85„

genSn)

< 4exp[-„52] + 4P (sup

.

» 4 5.3. Empirical entropy and empirical norms

69

Use Lemma 5.1 to arrive at P

sup |v„(g) - v„(go)l > genSn) AJ\^5nY + 4exp[-n5^] + 5P sup < Cexp 64S^„ . ,u>0

H{u)

The assumptions n5l > 2AH{dn) and H{S5n) -> oo imply that J^(8Sn)/dl —*■ 00 as well as n5^ —»• oo. So the first two terms on the right-hand side converge to zero. We conclude that if the empirical entropy behaves well, in the sense that for some ^4 > 0 (which without loss of generality may be taken equal to one), and for some H{u) satisfying (5.14), lim sup P sup ,w>0 n-*co

HMP„) >A H{u)

= 0,

then it follows from Lemma 5.6 that indeed, roughly speaking, the increment of the empirical process v„(g) at go behaves like J{5„) for ||g — goli < d„. In many examples, a ‘good’ uniform bound sup H{u,

Q) < H{u),

for all u > 0,

holds. Here, Jt is the collection of all probability measures on By ‘good’ we mean that H{u,^,Pn) and H(u) are of the same order in u. Such a ‘good’ bound is true for example when ^ is a uniformly bounded VapnikChervonenkis subgraph class, or a class of uniformly bounded monotone functions, etc. Of course, since the empirical measure has finite support, we may weaken the uniform entropy bound somewhat by requiring only that it holds uniformly in all probabihty measures Q with finite support. 5.3. Empirical entropy and empirical norms

This section presents an extension of Lemma 5.4. Let Xu... , be i.i.d. and let ^ be a class of functions, uniformly bounded by a constant independent of n, say sup Igloo < L ge 0 and where possibly jR = 00. Let ^ strictly increasing sequence, with mo = p, ms = R {if R = 00, take S = oo as well and take limj_oo ms = co). Then ^ can be peeled off into s S=1 where = {g G ^ : ms-i < i:(g) < mj,

s = 1,..., S.

So, if Z„(g) is a stochastic process indexed by (5.17)

\Z„{g)\ >a) < P sup gG$ l^(g) ' "y

we have for any positive a,

|Z„(g)| P [ sup Vgel", 'c(g)

i^y'pf

sup

|Z„(g)| > amj-iV

Vge^. T(g) 0. For ndl -> 00, n8l > 2AH{8n) for all n, we have (5.18) H{u,^,Pn) lim sup P gj llg||V.5„ > 14 < lim sup 4P ( sup H{u) u>0 n—*co Proof A symmetrization argument gives us P ( sup „

> 14^ < 2P fsup

ge^ I g I V 5„

ge^

IlfllA" > 12 1 .

llgl|V(5„

Here P„' and P„" are defined as in the proof of Lemma 5.4. But, using the peeling device, (5.19)

P fsup \ge^ 00

12 I Xi,... ,X„

11? I V On /

sup

^

||g||p„,-||g||p„.>6(2*,5„)lZi,...,X„

00

2AH{5„), then also nu^ > 2AH{u) for all u > 5„, because H{u)/u^ is a non-increasing function of u > 0. So we obtain (5.20)

P I sup

> 12

“ . / < ^ 2 exp [—n2^-’5l] -t- 2P ( sup H{u,^,P„) > A H{u) ;=i Vu>o Because nS^

oo, the first term on the right-hand side of (5.20) converges

to zero. Example: the ratio of empirical and theoretical measure over a VapnikChervonenkis class Take

^ = {\d : De 2}, where ^ is a Vapnik-Chervonenkis class. Then clearly, ^ is a VapnikChervonenkis subgraph class with envelope G = 1. Therefore, H{5,^,P„) 0 sufficiently small (see,Theorem 3.11). For g = \d , we have M=P^‘\D). Take

Then 2H{5n,^,Pn) < ^(logn —loglogn — log^) < yllogn = nd^. So we find from Lemma 5.6, limsupP (sup -----T~x > 196 ) = 0. Vd 6® -PP) V {A log n/n) ) 5.4. A uniform inequality based on entropy with bracketing

As before, ^ is a class of functions on (^,j/). Moreover, X\,... ,X„ are i.i.d. with distribution P on (^,j/). In this section, we derive probabihty inequalities for n fixed.

5. Increments of Empirical Processes

72

In Sections 5.1 and 5.2, the object of study was the behaviour of the empirical process on ^(5) = {g € ^ : ||g-goll < ^}- One arrives at the same conclusions if one first formulates the results for the entire class and then applies them with ^ replaced by ^((5). This is the approach we adopt here. 5.4.1. Bernstein’s inequality When using conditions on the entropy with bracketing, one no longer needs the symmetrization device. This means also that Hoeffding’s inequality is not available. Instead, we shall apply Bernstein’s inequality. Lemma 5.7 below presents this inequality for g fixed, and Theorem 5.11 extends it to a uniform inequality. Lemma 5.7 (Bernstein’s inequality.) Suppose that f gdP =0 and (5.21) Then for all a> 0 (5.22)

/

ml

t >2 \grdp a) < exp

m = 2,3,... .

2(aKn-i/2 + i?2) •

The proof is in Shorack and Wellner (1986). See also Lemma 8.9, where an extension to martingales is given. The condition (5.21) is equivalent to assuming that |g| has an exponential moment. Later on, it will be convenient to use the quantity (5.23)

p|(g) = 2K^ I

- 1 - |gl/x) dP,

K>0

We think of pK{g) as an extension of the L2(P)-norm, and we call pKigi —g2) the Bernstein difference between gi and g2. The idea is « 1 + x + x^/2, for X small, so that 2X2(^lg|/X _ 1

« |g|2

for K large. From the Taylor expansion — 1 — x = Y^Z=2^"' immediately sees that if p/c(g) < P, then (5.21) holds. On the other hand, (5.21) implies Px(s)

(

) 2^ m!(2K)» m=2 '■ '

m=2

2{1KY

= 2P}.

For bounded g, g(X) possesses an exponential moment. In many situa tions, we indeed shall consider (uniformly) bounded functions, so let us put this observation in a lemma.

5.4. A uniform inequality based on entropy with bracketing

73

Lemma 5.8 Suppose that (5.24)

Igloo 1 P(|v„(g)| > a) < 2exp

a '4J

i.e., we then have exponential tails. If Xi,... ,X„ are the first n of an infinite sequence of i.i.d. random variables, and g e L2{P), then roughly speaking, we indeed are in a situation where one can take K = ..Jh. This is because P(\A*|/

SdPn

>-) “ 2’

< P( |g(X,)l > y/n for some i € { 1,... ,«}) < nP(|g(Xi)| > y/n) ^ llgl{l?l >

0,

as «->(».

We now state a corollary of Lemma 5.7, which we generalize in Theorem 5.11.

5. Increments of Empirical Processes

74

Corollary 5.9 Suppose f gdP =0, and pjc(g) < R.

(5.26) Then for

aa) 0, ^(5) = {g€^: ilg-gol|0 such that

(5.38)

limsupP ( sup |v„(g) — Vn(go)| >q] L^, for some i G {1,... , n}) —>■ 0 as n —> 00, we obtain P (sup

Vge^

«/

I G>Lfn

The class {g — go : g e Hb {5,'^,P). Recall that for

ig-go)d{Pn-P)

0.

has 5-entropy with bracketing equal to

= {(g-go-J{g-go)dP)\{G fj 1

ygi.gie^. Ilgi—g2ll^^

/

(see also the remark following Theorem 5.3). 5.6. Modulus of continuity

In this section we investigate the behaviour of |v„(g) - v„(go)l as a function of ||g _ goii. Consider a uniformly bounded class of functions, say f5 391 ^

sup Ig - goloo < 1ge^

Assume moreover that Hb {5,^,P) < A5 “,

(5.40)

for all 5 > 0,

for some 0 < ot ■< 2 and some constant A. Then /

Hf{u,'^,P)
0.

5.6. Modulus tQf continuity

79

Application of Theorem 5.11 to the class '^{d) = {g G ^ : ||g — goll ^ shows that at each go, the increments of the empirical process v„(g) for llg — goII < ^ behave like Because the theorem holds for each n, we can investigate what happens if 5 depends on n, and converges to zero as n tends to infinity. We shall prove below that for d„ = |Vn(g)- v„(go)

,"?llg-goll'-“«vvS«2

= Op(l).

Thus, if ^ is small, in the sense that a is small, then the modulus of continuity of v„ is also small. The limiting case, where |v„(g) — v„(go)l behaves like ||g — goil, uniformly in ||g — goll not too near to zero, is reserved for the finite-dimensional situation. In some cases, the supremum norm decreases with the L2(P)-norm. Let us capture this by assuming that for some constants c q > 0 and 0 < < 1, (5.41)

sup |g - goloo ^ {coSf, for all 0 < 5 < 1. gens) This is not really an additional assumption, because (5.39) implies (5.41) with P = 0. But for P > 0, (5.41) allows us to push ||g — goll a little more towards zero in the modulus of continuity result. Lemma 5.13 Assume (5.39), (5.40) and (5.41). For some constants c and no depending on a, P, co and A, we have for all T > c and n > no, .(5.42) / \ Y fl 2+x-ip sup \J(g-go)d{P,-P) > Tn < cexp c2 Moreover, for T >c,n> no. sup

(5.43)

llg-goll>«

_

1

Mg) - v«(go)l —■■ > T ||g-gof ^

,

________

< cexp

Proof Replace ^ by ^(5) in Theorem 5.11, and take K = 4(c q 5)^, R = f25 and a = with Q = 2-flCoAoc^. Then (5.31) is satisfied for all 5 > n-i/(2+a-2^) Condition (5.32) is satisfied if we take n > no, no sufficiently large, and (5.33) is satisfied if we take Co sufficiently large. So for all S > fj-i/(2+a-2^)^ .yye obtain (5.44)

P

sup lv„(g) - v„(go)| > It Ci Cq ^5^ 5 j < Cexp ge^(5)

^

For S = „-i/(2+«-2^), (5.44) gives (5.42).

J

_CMf_ 16c M/

5. Increments of Empirical Processes

80

Now, let S = min{s > 1 :

< n-i/(2+a-2^)j ^nd apply the peeling

device. Then for T = |v>.(g) - v«(go)l

sup < y^P f ^

llg-goli 1-1

sup

> T

lv„(g) - v„(go)| > ^CiCo^(2

\ge«^(2-+>)

S

< ^ C exp S=1

^

Cl (2-^+1)-“’ 16C2c q ^

n J

r

< cexp

Ti

c2.

Lemma 5.13 can be extended to functions with uniformly bounded Bern stein difference and with generalized 5-entropy with bracketing not neces sarily a polynomial in 5. We shall not pursue this here. Instead, we consider an extension in another direction. Suppose that ^ has infinite entropy, but that ^- U where each map 7:^1-^

has finite entropy. In particular, we assume that there is a [1, oo), such that = {g e ^ ; 7(g) < M}.

We think of 7(g) as the complexity or irregularity of the function g (for example some Sobolev or Besov norm). Now, a space of irregular functions is a rich space: we allow the entropy of to increase with M. We shall evaluate the behaviour of the empirical process v„(g), not only as a function of its increments, but also as a function of 7(g). A slight further extension of the previous lemma is that we shall formulate the increments of the empirical process in terms of a more general distance function d(-, •), such that (5.45)

II

g - go II < d{g,go),

for all g G

The following conditions will be used: there exist constants 0 < a < 2, 0 < jS < 1, Co > 0 and A>0 such that for all M > 1, (5.46)

sup d(g,go) < coM,

5.6. Modulus of continuity (5.47)

,

81

for all 5 >0,

and (5.48)

sup |g - goloo < (co5fM^ geÂf. rf(g,go)^5

for all 5 > 0.

Lemma 5.14 Assume (5.45)-(5.48). Then, for some constants c and no depending on a, c q and A, we have for all T > c and n > no. (5.49)

l/(g-go)d(P„-_^^y.^-^

P ^ge^,

sup d(g,go)c,n> no. /CCAA

(5.50j

P I \

|v„(g)-V„(go)|

sup _ 1

\ge^, d{g,go)>n 2+»-2^ 7(g)

«_a \r®/ \ >T\n 2+“-2(' f

v«(g) - v«(go)l ^^^-5(g,go)

> T

M 2

< Cl exp

So, applying the peeling device once more,

|v»(g) - Vn(g0)l

> T

\ge^,d{g,g,)>r,-Î{g)

00

< Ep s=0

< ^ Cl s=0

sup d{

\ge», 7(g), d(g,go)>«“5+i^2»

exp

j’20+l)a

< cexp

|v«(g) - v„(go)l Ig’SOl

> T22

TM“ 2

c1

5. Increments of Empirical Processes

82

5.7. Entropy with bracketing and empirical norms Recall that in Section 5.3, we formulated a uniform bound for the ratio ||glU/llgll> based on the empirical entropy. Here, the aim is to derive such a bound from the entropy with bracketing, instead of the empirical entropy. We again assume that ^ is uniformly bounded, say (5.51)

suplgloo 0,

P(lllgll^-Ilgfl^«)^2exp

(5.52)

8(fl+ llgll^)

Proof This lemma can be obtained from Bernstein’s inequality (see Lemma 5.7). We have

J |gp'”dP < \\gf; j |g" - llgf r < Lemma 5.16 Take nbl > 2Hb (5„,^,P), n>\, and nbl -> co. Then for each 0 < ri < 1, we have (5.53)

IlglU _1

lim sup P

sup ge^. llgll>255„/»)

n—fco

llgll

>7

=

0.

Moreover, (5.54)

limsupPj

n->co

\ge'S,

sup

HlglU - llglll ^ 2

llgll^25^„

- 0. y

Proof Apply the peeling device. Take g G and suppose that (s — 1)5^ < llgll < s5„, where s € (1,2,...}, s > 2^/t]. Furthermore, let -1 < g^ < g < < 1, and llg^ - g^ll < K- Since

llgf„ R, E/(v„)

E/(v).

The class ^ is then called a P -Donsker elass.

6.2. Sufficient conditions for ^ to be P-Donsker

89

In the next two sections, we present some main results on weak conver gence of the empirical process, sweeping possible measurability problems under the carpet. Section 6.4 gives a brief outline of an elegant approach to the measurability issue. 6.2. Sufficient conditions for ^ to be P-Donsker Theorem 6.1 Suppose that (^, || • ||) is totally bounded, and that for alln>0

(6.2)

lim sup P

Then

l''n(gl) — V«(g2)l > fj } < fj-

sup

ghgie^,

I|gi-g2l|s5

is P -Donsker.

See Dudley (1984) for a proof of this theorem. Theorem 6.2 Suppose ^ has envelope G e L2{P), and that for some (nonrandom) non-increasing function H{5) with

we have (6-4)

lim limsupP (sup ^-►co

Then (6.2) holds, and (hence)

_q H{5)

j

'

is P-Donsker.

Proof Apply Theorem 5.3 to the process {v„(gi -gf) : gj,g2 e ^}. Because H ( 0, H{28, 1/2, is a P-Donsker class. This follows from similar entropy calculations as in Problem 2.6.

7 Rates of Convergence for Maximum Likelihood Estimators

condition without bracketing. • V Y Consider i.i.d. random variables Xi,...

with distribution P, and

assume that where ^ is a given elass of densities p with respect to the 0, (7.3)

wM‘^lpf,P) =

The last expression in (7.3) is the 5/V2-entropy of 0> endowed with the Hellinger metric. The technical intermediate results in the next section make it possible to relax (7.1) and (7.2). Indeed, assumption (7.1) is problematic, because in many situations po can become arbitrarily small. Also the assumption of a uniform bound on a class of densities is quite severe. We shall drop (7.1), and replace in (7.2) the entropy with bracketing of (^^/^ 11 • lU). by the smaller entropy with bracketing of (î^Ml • lU)- However, we point out

7.2. An exponential inequality

97

that, in the entropy with bracketing condition of Theorem 7.4, is still hidden the assumption that ^ is, in a certain sense, uniformly bounded (see also Problem 7.1). 7.2. An exponential inequality for the maximum likelihood estimator In Lemma 4.1, we proved the Basic Inequality

Our aim is to invoke Theorem 5.11 for ^ = {gp : p € ^}. Theorem 5.11 presents a uniform probability inequality for the empirical process, under assumptions on the generalized entropy with bracketing. In order to be able to apply it, we have to show that, among other things, the Bernstein difference (at zero) of the functions in ^ is uniformly bounded, i.e., for some K and R, (7.4)

sup Pic(gp) < R,

where

Exploiting the fact that gp is bounded from below, it turns out that (7.4) holds with X = 1 and R = 4 (see Lemma 7.2). We first state a preliminary lemma. Lemma 7.1 For x > —T, T > 0, 2(eW-l-|x|) 5„ P(Mp «,Po ) > 5) < cexp —^

(7.10)

Proof The combination of Lemma 4.1 and Lemma 4.2 shows that it suffices to prove that P

Vn(gp) - ^h^{p,p 0 I < cexp —^

sup \pe&>, h(p,po)>d/4

Let S = min{s : 2*+M/4 > 1}. Application of the peeling device (see Section 5.3) gives

0

(£'/l

. 0, (7.41)

P{Vi - Ui 0,

so that the rate is (7.42)

hiPn,Po) = Op{n

Although this is a faster rate than the one in (7.39), the result is actually worse, because under (7.41) the Hellinger metric is less strong.

«

t*

7.4. Examples

117

As in Case I, the obtained rates can be applied to prove asymptotic normality of certain functions of F„ (see Example 11.2.3d). Example 7.4.5. A convolution model Let variables on [0,1]. Suppose that Z has Lebesgue measure. The distribution Fq independent copies Xi,... ,X„ of X = Y the form

Y and Z be independent random a given density k with respect to of Y is unknown. We observe + Z. The density of A is then of

(7.43) with

F e A = {all distributions on [0,1]}. Note that ^ = {pp : T G A} is a convex class. In fact, we can write ^ = conv(JT), with Jf = {k{- — y) : y e [0,1]}. We shall apply the result of Ball and Pajor (Theorem 3.14) to calculate the appropriate entropy. Throughout, we assume that the density /o, with respect to Lebesgue measure, of Fq exists, and that for some constant c\ > 0, (•7.44)

— < |/o(y)| < Cl,

for all y e [0,1].

Case (i) Suppose that Z has the beta distribution with density (7.45) where jS > 0 is fixed. For simplicity we assume that the two parameters of the beta distribution are equal. The rate of convergence is determined by the behaviour of k near x = 0 and x = 1. Let C2, C3,... be suitable, strictly positive constants depending on ci and jff. One easily verifies that under condition (7.44), for 0 < x 1, PO(^) < C2X^+\

and Pq {x ) > Define G(x) = Then for 0 < x < 1,

C3X^+^

118

7. Rates of Convergence for Maximum Likelihood Estimators

so that

j

1

^

G\x)dP{x) < C2cl x^-^ dx < cl The same arguments can be applied to the case 1 < x < 2, so that one finds f\\x)dP{x) K and define for i = 1,2,..., TJ

T,)=l p \l

L and

2L2

’

^.^ = elz,l/x._i_|Zi|/L.

Note that {

Wjj^ }. We have

the compensator of { gU,.z. _ gt/i-i,L _

êxp

p Il

2L2

L

-1

and exp

L

2L2

exp[Z,/L] ^-l+py(2L2)

_ Wi^L + l+Zj/L “ 1 + Pl/{2L^)

_ Wi,L-pf,L/(2L^) + ZilL l + p2j(2L2)

136

8. The Non-I.l.D. Case

It follows that E

exp

Zj

p Il

L

2V-

0, a>0 and R>0, (8.37)

P(M„ > fl A Pnjc P02) =

We write (8.50)

fi^(p0i,P02) = ^^^(P0i>P02)-

We have seen that in the i.i.d. case the generalized entropy with bracketing of the set of log-likelihood ratios (involving the convex combinations) is bounded by the entropy with bracketing of the class of densities, endowed with Hellinger metric. The same is true in the dependent case. Definition 8.2 (Hellinger entropy with bracketing) Let Nb {S,R,F) be the smallest (non-random) value of N such that there exists a collection { [pt’Pf] }yli’ Pi = {p[r---’Pkj) Pf = {Pi,p---’P^j)’ a pair of -measurable functions, i = 1,... ,n, j = 1,... ,iV, such that for all 0 € 0, there is a (non-random) j — j{9) € { 1,... , JV }, such that (i)

h

(ii)

p Ij

^ on { h{pe, Pe„) < i? } n F,

< Pi,e < p Yj , i = 1,..., n, on {h{pe, PSo)
i 6C'P(2"+i ^). So we may apply Corollary 8.3 to each P^. This gives s

s < ^Cexp s=0

„24s -2^4 4C222s +2^2

n

n5^ < c exp-----^

150

9. Rates of Convergence for Least Squares Estimators

Under condition (A), P(l|w|l„ > cr) is small for all a Xj q , since it implies Ellwll^ < 2al) < exp

12K2 ■

Moreover, if the entropy integral converges, one may take a -> oo, in which case the second term on the right-hand side of (9.4) vanishes. 9.2. Errors with exponential tails In Corollary 8.8, we also proved a maximal inequality for {(w, g — go)« : g G ^„(E)}. The assumption there is that the tails of the error distribution decrease exponentially fast. We refer to this as assumption (B). (B)

max^2KjE

- 1 - 1 ITil/Ki) < ol

Clearly, (A) is stronger than (B). Note however that Corollary 8.8 needs the assumption that the functions involved are uniformly bounded, which is in general not true for a class of regression functions. To handle this problem we introduce a class of renormalized functions. Define for L„ > 0, i ____I—^------ : g Gî: ^n = \ 1-b L„|lg — golU j and ^n{S) = {f€^n:

ll/IU < 5}-

Moreover, let (9.5)

Jb {S,^n{S),Qn) =

f

(u,#-„(5),Qn) duy 5.

where ci is a constant depending on cp and Ki (appropriately chosen) and where d 5 > d„, (9.8)

P(llg«-goL > 2 C2, Tn~^/^ < 1, (9.17)

r

P(l|g„ -golU > Tn-^'^) < C2exp------^ .

9.3. Examples

157

Note that for g G g = gl + g2, with gi(z) = oap(z), xp{z) = 1, and (g2, tp)n = 0. So, from the same arguments as in the previous example, we know that |a„ — «o| = \ J{gn ~ go) dQ„\ converges with rate Example 9.3.4. Functions of bounded variation in Suppose that z,- = {uk,vi), i = Id, k = 1,... ,«i, I = 1,... ,ri2, n = ni«2, with ui < ••• < u„^, vi < ■ • ■ < v„2- Consider the class ^ = {g :R2-^R, I(g)r), and

From Theorem 3.14, we now deduce that H(5, A, Q„)
(2Ap)-‘ geBl^

as an estimator of y. We shall not pursue this here. An alternative is to use the number of non-zero coefficients in the wavelet decomposition as a penalty. The arguments to prove that then the estimator

10.2. Penalized maximum likelihood

175

adapts to the amount of smoothness y when y is unknown (in the sense of obtaining the rate are similar to the ones used in Theorem 10.2. We shall not present any details here. Lemma 10.12 sketches the idea in an undressed context. 10.2. Penalized maximum likelihood For each n > 1, let Xi,...,X„ be independent random variables with distribution P on [0,1] and let po — dP/dp. be the density with respect to the Lebesgue measure p. The distribution P (and hence the density po) may depend on n, but we assume throughout that for some rjo independent of n, (10.30)

Po > vl

We also assume that po is a member of the class ^ = |p : [0,1] -> [0, co) :

j

p(x) dx = 1, /^(p) < cx)

where I (p) measures the complexity of the density p. The penalized maximum likelihood estimator p„ is defined as (10.31)

prt = arg max ( [ logpdP„ - 2^/^(p) pe0‘ \J

Here,

is called the smoothing parameter.

Recall the notation

Lemma 10.5. (Basic Inequality) Let p„ be given by (10.31). We have (10.32)

h\p„,po) + 4Xll\pr,)

j

log

dP„ - XllHpn) > -Xll\po),

SO

16

J

gpJ{Pn-P)-4XllHpn)>-l6 > 16/j 2

j

gpJP-4XllHpo) ,Po^ -4A^/^(Po)

>hHpn,Po)-4Xll^{po).

176

10. Penalties and Sieves

In Subsections 4.2.2 and 7.4.1, we assumed that the density po is known to lie in a Sobolev class. In the literature on penalized maximum likelihood estimation, it is more common to assume that the log-density lies in a Sobolev class (see for example, Barron and Sheu (1991)). We shall consider both situations here. 10.2.1. Roughness penalty on the density In this subsection, we take (10.33) where m > 1 is a fixed integer. It follows from Theorem 2.4 and (10.30) that for all M > 1, Hb {3, { gp : p g

/(p) -K(po) 1.

Apply Lemma 5.14 with a = 1/m, P = 0, and d{gp,gp^) = h{p,po). Then we find

Note that (10.34) only presents the modulus of continuity for h{p,po) suffi ciently far way from zero. This is a consequence of the fact that we do not have sub-Gaussian tail behaviour. Combination of (10.34) and (10.35) with Lemma 10.5 gives a rate of convergence for the penalized maximum likelihood estimator. Theorem 10.6 Let p„ be defined by (10.31), with l{p) given in (10.33). Then (10.36)

h{Pn,Po) = Op(2„)(l -|-/(po)),

10.2. Penalized maximum likelihood

111

and (10.37)

I{p„) = Opil){l+I{po)).

provided that (10.38)

2-i = Op(n5Sr)(l+/(po))2

Proof Let us use the shorthand notation h = h{p„,po), and I =I{Pn), l0=I{P0). Case (i) Suppose h > n~^{l +/ +/o), and / > 1 + /q . Then from (10.34), either

P + 2lP
be a decreasing collection of linear spaces, with (10.62)

d„j, = dim(^„,fc) =

k = 0,..., s.

Write for g € ^„_o,

dn{g) = min{d„j, : g

e

Suppose that for some fixed m and fixed g* G (10-63)

we have

\\g:-go\\n Tj

^

< ^ c exp k=0 Let A„ be the event

R is a given link function, € R‘^‘ is an unknown parameter, and yo is an unknown function in a given class of smooth functions. Thus, the model has two parts: a parametric part containing the finite-dimensional parameter 0o» and a non-parametric part containing the infinite-dimensional parameter yo. To ease the notation, we take di—d2 — 1. Notice that in principle, the range of 0j^M-fyo(D) is not restricted, whereas the. range of the response variable Y often is. For instance, think of the situation where T is a binary variable (for example, yes/no answers to a questionnaire). Then po{z) can only take values between 0 and 1. The link function F allows one to take this into account. We shall assume that yo is smooth, in the sense that I (yo) < oo, where

denotes the roughness of the function y. Using the quasi-likelihood method, with a penalty on the roughness, one obtains estimators of 6q and yo. We shall first derive a global rate of convergence for these estimators, and then prove asymptotic normality of the estimator of the parametric component 00- In both stages, results from empirical process theory are inserted: the modulus of continuity of the empirical process in the first stage, and asymptotic equicontinuity in the second stage. The general definition of the penalized quasi-likelihood estimator is given in Subsection 11.1.3. We shall study two special cases: partial splines in Subsection 11.1.1, and the partially linear binary choice model in Subsection 11.1.2. Throughout, we use the following notation. Take ^ to be the class of all regression functions g of the form g(u, t;) = 0m -|- y(v). The distribution of Z is denoted by Q, and Q„ is the empirical distribution of Zi,... ,Z„. Write

and

For a function y depending on v only, we use the same notation, i.e..

11.1. Partial linear models and

201

Il7ll^ = y 7{v)dQ„{u,v).

We shall moreover write (11.1)

I^{g)=I^{y)= [ {y^”'\v)fdv,

g{u,v) = 6u + y{v).

Jo Define

y)o{u,v) = u, xpk{u,v) =

k = l,...,m,

rp{z) = (ipo(z),... and let

S = y tpw'^ dQ.

11.1.1. Partial splines The model in this subsection is 7 = go(Z) + W, with E{W I Z) = 0, and with

go{Z) = 9oU + yo{V),

Z={U,V).

As in Subsection 10.1.1, we define the penalized least squares estimator as (11.2)

gn = arg min

^(7; - g(Z,))^ +

=

where ^ is now the class of all regression functions g of the form g(u, v) = 9u + y{v), and where I^{g) = I^{y) = fîy^'^Kv))^ dv, g(u, v) = 9u + y(v). Now, assume that for all z e [0,1]^, (11.3)

2K2£(el^l/^ - 1 - i\W\/K) j Z = z) < al

Moreover, suppose (11.4)

f(go) 0. Then (11.9)

i IS.1 wMZi)

Proof Since ||gn — goil = Op(l) and \\h\\ > 0, it follows that |0„ — 0q I = op(l) and ||y„ — yoll = op(l). Now, define for t 6 R, gn,t(M,t>) = gn{u,v) + th(u,v) = (0„ + t)u + (y„(i;) - th{v)).

11.1. Partial linear models

203

Because g„,t G ^ for all t, we find that

=

0.

£=0

But

= -(w, h)„ + {g„ - go, h)„ + Xll {%, h)

=-1 + 11 +111, where

Ê

(w,

and

I{y,h)= f\^”'\v)h^"'\v)dv. Jo

Because X„

= op{n~^^'^), I{%) = Op(l), and I{h) < co, we find III = Xll{yn,h) < Xll{%)Iih) = op(«-i/2).

Moreover,

II

= (g„ - go,h)n = (0« - 0o)ll^ll^ + 0n - do){h, h)„ + {% - yo, h)„ = i + ii + iii.

Clearly, by the law of large numbers.

i=(0„ô)(Pll2 + o(l)). Because E{h{V)h{Z)) = 0, the law of large numbers also gives ii = (0„-0oMl). Moreover, E{y{V)h{Z)) = 0 also. Because I{y„) = Op(l), asymptotic equicontinuity (see Section 5.5) yields

iyn-yo,h)„ = Op{n j), since ||7„ — yoll = Op(l)- Thus, we may conclude that

II = (0„ - 0o)(Pf + 0(1)) + op(n-^).

11. Some Applications to Semiparametric Models

204

Combining the results gives 0 = -I + II + III = -(w, h)„ + (0„ - 0o)(PlP + 0(1)) + op{n~^). Rewriting this yields (11.9). It follows that On is asymptotically normal:

where cr^(z) = E[W^ \Z = z). IfZ and W = Y —E{Y \ Z) are independent, and W is normally distributed, then 6„ is an asymptotically efficient estimator of 00. We shall not prove this here: it involves showing regularity of the estimator.

11.1.2. Partially linear binary choice model Suppose that Y G {0,1}, with P(y = 1 I Z = z) = 1

- P(y = 0 I Z = z) = P(0oM + 7o(«^)), z = (u,v).

The link function F : R ^ (0,1) is assumed to be known (the case whejre F is unknown is treated in Section 11.3). We require it to be differentiable! with derivative

f{^) = dF{^)/d^,

(J G R. Moreover, we suppose

I/loo = sup 1/((^)1

we find from Lemma 11.3 that
0.

The same is true for the class

Thus, from Lemma 8.4, hl{Pg„,Pgo) +

0,

for all ?? > 0, and M < c»,

llg-goll>';. /(g)ÂT

and Fvar(l7 | F) > 0, and Q has density q w.r.t. Lebesgue measure, with q{z) > ql. If we moreover assume that for some qi > 0 and for all z G [0,1]^ we have for 1^0 = go(z), (11.20)

1/(01 >qi>0,

for all

we obtain the same rate Op{X„) for g„.

- 01 ^ fJu

208

11. Some Applications to Semiparametric Models

Lemma 11.5 Suppose the conditions of Lemma 11.4 are met. Furthermore, assume the identihability conditions (11.18) and (11.19), and also assume (11.20) . Then (11.21)

lli=’(g„)-F(go)ll = Op(U

(11.22)

l|g«-goli = Op(/l„),

and (11.23)

llg„-golU = Op(A„).

Proof Since l|F(g„) - F(go)lU = o p (1), together with I{g„) = Op(l), implies l|F(g„) — F(go)|l = op(l), we know from (11.18) that l|g„ — goll = op(l)Condition (11.19) now implies that \g„ - goU = op(l) (use Lemma 10.9). Now, use condition (11.20) to obtain that l|g„ — golU = Op(2„). Exploit the entropy with bracketing for the class (F(g) : g e IgU < C, I{g) < C} and {g ; Igloo < C, 1(g) < C}, to find from Lemma 5.16 that l|F(g„) — F(go)ll = Op{X„) and ||g„ - goll = Op(2„) respectively. Remark In view of Lemma 10.9, the conditions of Lemma 11.5 imply that for k e {0,... , m}. (11.24) Moreover, (11.25)

Ifn — yoloo = Op

•

Once we have obtained consistency of the estimator g„ in the supremum norm, it is not so hard to deduce asymptotic normality of The reason is that, locally, the model is approximately linear, provided that the appropriate expansions are valid. To this end, we assume that for all z G [0,1]^ and for

^0 = go(z), (11.26)

1 m) < —,

for all 1^ - ôl < m-

We can apply the same arguments as in the previous section. Some notation is needed to state the result. Define

m

F(0(i-F(or

êR,

11.1. Partial linear models

209

and k = /(go), /o = /(go)- Let

E{UMU,VMU,V)\V = v) E{foiU,V)lo{U,V)\V = v) ’ and

h{u, v) = u — h{v). Lemma 11.6 Assume that the conditions of Lemma 11.5 hold, that (11.26) is met, and that (11.27)

l|(/o/o)‘/'^|l>0.

Moreover, assume X„ = op y/n{9„ - 9q ) -

(11.28)

and I {h) < oo. Then

y„T,liWMZi)kZi)

{fokY^'h

+ op(l)-

Proof For t G R, let gn,t = g„ + the cW{5„), we have by Theorem 7.6, (11.41)

d{F„,Fo) = Op{5„),

11.2. Mixture models

217

and by Corollary 7.8, /log

(11.42)

Moreover, because the entropy integral converges, (11.41) and (11.42) hold with 5„ = Now we are ready to start our three methods program. The first method is based on the idea that for any a G (0,1], the convex combination F« = (1 — a)F + aFo

dominates Fq . Theorem 11.8 Assume that the following conditions are met:

(i)

for all 0 < a < i and F G Aq , we have that the worst possible subdirections and the efficient influence curves bp^ = Ap^hp^ exist;

(ii)

{bp^: F

G

Ao, 0 < a < 1} forms a P-Donsker class;

(iii) lim^(|.^_fg)ô ~ ^ôll ~ (iv) (rate of convergence) we have

(v)

with S„ = o(n“^/'*); (control on the efficient inffuence curves bpj the efficient influence curves are uniformly bounded:

(11.43) (vi)

sup sup |hf^|oo

(11.55)

J

log pp^^yPn.

Furthermore, the concavity of the log-function, and the fact that a„ < 1/2, yield (11.56)

r

r

f

pp +

Combine (11.54), (11.55) and (11.56) to find (11.57)

,

pf o dP„. / logpp^dP„>{l—2cc„) / logPp^dP„ + 2a„ I log—^-----

2cc„ J

2pF„

dPn > X1 2 log Pf „ +PFo

UhJPnY (1 -f- Op(l)). Jb^dP

11. Some Applications to Semiparametric Models

220

From (11.49), we know that the left-hand side of this inequality is

ccn0p{n

= {\9„-9o\ + n ^^^)op{n

In view of (11.52), the right-hand side of (11.57) is of the form

fOp(n-i/2)-(0„-0o)(l + op(l

So we find from (11.57),

\e„ - 0oP < max{Op(n“^),(|0„ - 0q I + n~^^^)op{n~^^^)}, which implies |0„ — 0q I = But then, the left-hand side of (11.57) is Op(n“^), so that it reads (11.58) (^JbFadPn +Op-{9„-~9o){l + op{l))^ (l-hop(l)) =op(n“^). In other words.

The advantage of Theorem 11.8 is that, as far as rates are concerned, it only needs the appropriate order of the log-likelihood ratio, and that there is a general theory for this. In our further results, we encounter conditions on the rate of convergence, in some metric which depends on the problem at hand. In applications, this means that one might have to insert ad hoc arguments. Theorem 11.9 is based on the idea that even if the influence curve at F„ does not exist, there may well be something very close to it. We introduce a (pseudo-)metric d on A, which in some applications is equal to d. If F„ and Fq are close for the metric d, where d is sufficiently strong, then it is to be expected that something not unlike an influence curve exists at F„. Theorem 11.9 Suppose that the following conditions are met:

for all F G Aq , there exists a function bp, with f bp dPp = 0, and with JbpdPp, = -(9p-9p,); (ib) for all F G Aq , there is a direction hp € Loo, with JhfdF = 0, such that for bp = Aphp, we have (ia)

(11.59)

Wbp-bpWp^ t i , and F(y) < F(t o ) for aU y < t q , we define for y G [t q ,t i ), ' (i) ^(s), My) = < (ii) , (iii) ^(t i -),

if there is an s e [t o ,t i ) with Fo(s) = f(s), if -F’o(y) > F(y) for all y G [t o ,t i ), if Fo(y) < F(y) for all y G [t o .-t i ).

In the three cases, we find l?F(y) - ^(y)l = l^(s) - ay)\ < cMs) - Fo{y)\ = cilF(s) - Fo(y)|

(i)

,

= ci\F{y)-Foiy)\, (ii)

Kf(y) - ay)\ = l^(ô) - ay)\ < Cl (Fo(y) - Fo (t o )) < ci {Fo{y) - F{xo))

= ciiFoiy) - F{y)), (iii)

|(^f (y) - fo)i Wp„-erj

= 0p(l).

224

11. Some Applications to Semiparametric Models

11.2.3. Examples Proposition 11.7 is applied in Example 11.2.3a. In Examples 11.2.3b, 11.2.3c and 11.2.3f, we use Theorem 11.8. Examples 11.2.3d and 11.2.3e illustrate Theorem 11.9. Finally, we end with an open problem in Example 11.2.3g. Example 11.2.3a. The binary choice model Let Xi = {Y\,Z{),... be i.i.d. copies of X = (Y,Z) with Y G {0,1} a binary response variable, and Z G R a covariate with distribution Q. Consider the model p(Y = 1 \ Z = z) = Foiz),

z

G

R, Fo

G

A.

The density of Y w.r.t. p. = (counting measure on{0,1}) x g is

Pfîy, z) = yFo{z) + (1 - y)(l - Fo(z))

= with

J

k{y, z I u) dfo(w).

k{y,z I m ) = y\{u < z} + (1 - y)l{u > z}.

So this is indeed a mixture model. (In fact, any model where the density is in some convex class can be regarded as a mixture model.) Suppose we want to estimate EY = JFo{z)dQ{z), and suppose that Q is known. We then take Of = J F{z)dQ{z) = /(I - Q{z-)dF{z), so that thegradient is a(z) — Op, with a{z) = 1 — g(z—). We shall apply Proposition 11.7. The equation bp = Aphp in this example IS

bpiy,z) = y

Hpjz) Hpjz) -(1-y) l-F{zY F{z)

dHp hp = ~dF'

Moreover,

A*bp{u)

rco Ju—

f

hf(0,z)dg(z)+ /

bp{l,z)dQ{z).

We have to solve the equation

A*bp{u) = a{u) — Op. Taking the derivative w.r.t. Q on both sides gives

-bp{0,u) + bp{l,u) = -1, or, using bp = Aphp,

Hpju) F(u)

Hpju) l-F{u)

11.2. Mixture models

225

Hence, Hf(u) = F(u){l - F{u)). Note that indeed hp = dHf/dF exists. Moreover,

bfiy, z) = y(l - F(z)) - (1 - y)F{z), and since F is a monotone function uniformly bounded by 1, we know that {bp ■ F e A} is a P-Donsker class (condition (ii) of Proposition 11.7). Condition (iii) is met as well. Since we solved the equation A^bp{u) = a{u) - 9p for all u, condition (i) of Proposition 11.7 also holds true. So we have

The efficient asymptotic variance is

The variance of the naive estimator Y = Y^=i Y/n is var(7) = ^var(F), with (11.67) var(7)-yF„(z){l-Fo(^))dQ(z)+U FUz)dQ{z)~

F„(z)dsw)'

Indeed, this is larger than (11.66). Observe that the equality

reads

In other words, 9„ - jF„(z)dQ(z) is an asymptotically efficient estimator /I __ 1_______ Tr 1 T—Ml ii. > » . of 00, whereas Y — j, J2i=i ^n(z,j is inefficient. This is the result we already announced in Example 1.3. Example 11.2.3b. Convolution with the uniform Let /z be Lebesgue measure,

I y) = k{x - y) with k{x) = 1{0 < X
1 will be treated in Example 11.2.3e). The densities with respect to Lebesgue measure are F{x), if 0 < X < 1 1 - E(x - 1), if 1 < X < 2.

Pf [x ) = E(x) - E(x - 1)

We want to estimate — f adFo. Let us apply Theorem 11.8 here. We assume that d(y) = da(y)/dy exists, that d E Loo and that

da dFh

< c. 00

Write for x G [0,1], (11.68)

ho(x) =

Hpoix) Foix) ’

= 0. Now, we have to solve the equation

where

A*bFo{y) = a{y) - 6f o , or (11.70)

f bFoix)dx+ r

Jv-

Jo

hfo(x + l)dx = a{y) - dp^.

Differentiating (11.70) yields

-bpoiy) +

+ 1) =

Now, bpo should be of the form given in (11.68) and (11.69). Inserting this gives Hp^jy) Hp,iy) Fo{y) l-Eo(y) = d{y), or (11.71)

Hp,iy) = -Fo(y)(l - Fo(y))u(y).

The derivative hp^ = dHp^/dFo exists, since dd/dFo exists. We can now apply Theorem 11.8. Let Fa = (1 — cc)f + c c Fq , a G (0,1], replace in (11.71), Fq by Fa, and call the result Hp^. Then for some constant M

dHp dFa 00

M ®

So condition (vi) of Theorem 11.8 is satisfied.

11.2. Mixture models

227

Moreover, for 0 < x < 1,

hSx) = -{l-F,{x))d{x),

+ 1) = Fa(x)a(x). Condition (v) of Theorem 11.8 holds. Moreover, the collection {bp : F e A, a E (0,1]} forms a P-Donsker class. Also condition (iv) is met: we may take 'P( Uu Pt = 1(7; < [/,}, y,- = l{17i < Yi < Vi], where 7; and {Ui, Vi) are independent, 7 has distribution Fq on [0,2] and (C/,-, 7) has distribution Q on [0,2]. We want to estimate 0Fo = f adFo- In general, no explicit expression for the influence curve bpo can be given. The equation A'bpg = a — 6fg becomes

f[

J J u'^y

bF^{l,0,u,v)dQ{u,v)+

j f

J v>y J u’ dFoiy)

f

e“ dFoiu) - dpo-

One may verify that the conditions of Theorem 11.8 are met. It follows that the maximum likelihood estimator 6„ is asymptotically equivalent to the simple estimator and that they are both asymptotically efficient. A

Example 11.2.3g. Mixture with an exponential family Let (^,^) = (R, Borel sets), and k{x I y) = exp[ip(y)T(x) -c(ip(y))]. where c(ip) = log

J

exp[ipT(x)] d/r(x).

Write

c(ip) =

dc{xp) dip

232

11. Some Applications to Semiparametric Models

We assume that {rp{y) : y G R} has a non-empty interior, and that this interior is a subset of the support of Fq . Then {k{- | y) ; y G supp(To)} is a complete exponential family, so that there is at most one solution b{x) to the equation

A*b{y) = a{y) — 0o, Now, take a(y) = c(ip(y)), so that _ 0p^ is the unique solution:

To-a.s..

= Jc{xp)dFo. Then clearly, hf„(x) =

A*{T-dFo) = biwiy)) - ^FoTherefore, if one can solve the equation AF^hfîx) = T{x) — Of ^, the simple estimator

„

J

e„ = TdPn

is asymptotically efl&cient. The following lemma investigates this further. Lemma 11.12 Suppose that dFo/dxp exists and is a continuous function of y, of bounded variation. Then for fy

HFo(y) = J

dFn

c{rp)dFo-~{y)-eFoFQ{y),

we have (11.75)

Jkjx I y)dHF^{y) PFoiy)

T{x) - Of o -

Moreover, if each F e A is identiSable, Hf q is also the unique solution of (11.75).

Proof Note first of all that Hf ^ is of bounded variation, so that the integrals are well defined. Partial integration gives

j k{x I y)dHFo(y) = Jk{x\ y)c{xp{y))dFoiy)

-J

=J

^

“

PFoix)dFo

k{x\ y)c(v^(y)) dFoiy)

+J

^(yMx I dy) - pf o {x )9f o = Jk{x\ y)c{y){y)) dFoiy)

J

+ I = pf.(x)(T(x) -

~ c(i/;(y))) dxpiy) - pf o {x )9f o

11.3. A single-indexed model with binary explanatory variable

233

If Hi and H2 are two solutions of (11.75), both of bounded variation, then it follows that lkix\y)d{Hi{y)-H2iy))=0. But since Hi — H2 is also of bounded variation, each F € A identifiable together with (11.75) implies Hi = H2. Corollary 11.13 Suppose that for some constant 02

d{dFo/dxp) TE--------------

-----------

2-

^ ^

Then T — Opo is the efficient influence curve for estimating and (11.76)

= f c{xp) dpQ,

, , d(dFo/dxp) ^ hp, = c{xp)--------—-------- Of o

is a worst possible subdirection. We conclude that under the condition of the above corollary, 9„ = f T dP„ is asymptotically efficient. The question arises whether the same is true for the maximum likehhood estimator 9n = J b(rp) dP„. We leave this as an open problem. 11.3. A single-indexed model with binary explanatory variable Let (Yi,Zi),... ,{Y„,Zn),... be i.i.d. copies of (T,Z), where Y e {0,1} is a binary response variable, and Z = (17, F) is a covariate consisting of a binary variable U € (0,1} and a continuous variable V e (0,1). For example, Y indicates whether a person has a job, U indicates whether he/she has children, and V is education (measured on a continuous scale). We transform the random variable V to an R-valued random variable, using the transformation V >->•