Stochastic Approximation and Recursive Estimation

131 102 28MB

English Pages 244 [248] Year 1976

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Stochastic Approximation and Recursive Estimation

Citation preview

TRANSLATIONS

OF

MATHEMATICAL MONOGRAPHS VOLUME

◄ 7

M. B. Nevel'son R. Z. Has' minskiT

Stochastic Approximation and Recursive Estimation

American Mathematical Society Providence, Rhode Island

CTOXACTJ.1qECKAfl ATinPOKG1MAUJ.1 H 1-1 PEKYPPEI-ITHOE OUEHIABAHl1E

M. 6. HEBEJlbCOH, P. 3. XACbMIAHCKM11 113.naTeJlbCTBO "J.1ayKa" MOCKBa 1972 Translated from the Russian by lsrae I Program for Scientific Translations Translation edired by B. Silver 2000 Mathematics Subject Classification. Primary 62L20; Secondary 60J05, 93EI0. Abstract, This book is de .. o ted to sequential methods of sohing a class of problems to which belongs, for example, the pcoblem of finding a maximum point of a funcrion if each measured value of this function con• rains a random euor. Some basic ptocedwes of stochastic approximation

are investigated from a single poin t of view, namely the theory of Markov processes and martingale s. Examples are considered of applications of the theorems to some problems of estimation theory, educational theory and control theory, and also to some problems of information transmission in the presence of inverse feedback,

Library of Congreu Catalosiag in Publication Data

1:tP

Nevel'son, Mikhail Borisovich. Stochastic approximation and recursive estimation. (Translations of mathematical monographs; v. 47) Translation of Stokhasticheskaia approksimats:i:a: i rekurrentnoe ofsenivanie . Bibliography: p. Includes index. 1. Stochastic approximation. 2. · Estimation theory. I. Khas 'minskil, Rafail Zalmanovich, joint author. II. Title. III. Series. QA274.N48i3 519. 2 76- 48298 ISBN 0-8218-1597- 0 Softcover ISBN 0-82 I 8-0906-7

@ The paper

used in this book is acid-free and falls within the guidelines established to ensure permanence and durability. Visit the AMS home page at URL: bUp ://vw . ams.org/

10 9 8 7 6 5 4 3

04 03 02 01 00

TABLE OF CONTENTS

Preface ......................._..............................................................................._.................

1

Introduction........... .................................. ............................................................. 4 Cha es ............................................. ll . ···••·•·············--·••·•··-•·•·• •·······••·••··--··········••······· ...................... . . ........................... ll 2. Random variables. ........................................................................................ .. 13 3. Conditional robabilities and conditional ex ctations........................... 11 4. lnde ndence. Product nieasures.................................................................. 20 5. Martingales and su ............. ;., ....ies.................................................................. 22 Chapter 2. Disctete time Markov rocesses. ....-..................................................... 22 1. Markov processes .....................................................................·-·············.......... 22. 2. Markov es......................................................... 32 3. Recursi .......................................................................... ..Y. 4 Discn:te mode] of diffusion ....................................................................... .31 5. Exit of sam le functions from a domain...................................................... ..38 6. Series of inde ndent random variables...................................................... Al C ctions................................................................... .43

.::.:....=.;==lll..,.C==...:.::.~==D.::.:··:.::··········· .. ·········.................................................. 49. stochastic e uations ................. -....................

53.

.::..:...-=-====:...==.:..-=-===--:....ac.:.::oc c::.~:::.:sses.=::.::·............. ............................................

53. 51 59

Cha 2. Stochastic

tions. 1...............................................................

3. 4.

5. 6. 7. 8.

tions.

n...............................................................

.62 It ' .. .............................................................................................. .M S es...................................................... ................................... .62. Existence of solutions in th 10. Exit from a domain. Conve 11

L.................

19._

The Robbins-Monro procedure........................................................................ The Kiefer•Wolfowitz rocedure..................................................................... Continuous rocedures............................................ .................. ................ Convergence of the Robbins-Monro procedure ..........•...........·-················· Convergence of the Kiefer-Wolfowitz procedure........................................

19.

Chapter 4. Convergence of stochastic approximation procedures.

1. 2. 3. 4. 5.

iii

.8.l 86 B8. 94

CONTENTS

Chapter 5. Convergence of stochastic approximation procedures. 11............... 101 1. Introductory remarks.........................·-·······--·················································· 1O1 2. General theorems............................................................................................. 102 3. Auxiliary results (continuous time)............................................................... 106 4 . Auxiliary results (discrete time) ..................................................................... 112 5. One-dimensional procedures........................................................................... 118 Chapter 6. Asymptotic normality of the Robbins•Monro procedure............... 123 1. Preliminary remarks............................ . . ......................................................... 123 2. Asymptotic behavior of solutions.................................................................. 129 3. Investigation of the process/1(1).................................................................. 132 4. Investigation of the process / 2(1).................................................................. 136 5. Asymptotic normality (continuous time) ...................................................... 140 6. Asymptotic normality (discrete time)............................................................ 147 7. Convergence of moments................................................................................. 154 Chapter 7. Some modifications of stochastic approximation procedures......... 161 1. Statement of the problem.............................................................................. 161 2. General theorem................................................................................................ 162 3. Auxiliary results................................................................................................ 165 4. Theorems on convergence and asymptotic normality.............................. 167 5. Adaptive Robbins-Monro procedures............................................................ 169 6. Asymptotic optimality.................................................................................... 17 5 Chapter 8. Recursive estimation (discrete time)................................................... 181 1. The Cramer-Rao inequality. Efficiency of estimates................................. 181 2. The Cramer-Rao inequality in the multidimensional case........................ 186 3. Estimation of a one-dimensional parameter ............................................... 189 4 . Asymptotically efficient recursive procedure............................................. 194 5. Estimation of a multidimensional parameter............................................. 197 6. Estimation with dependent observations................................................... 202 Chapter 9. Recursive estimation (continuous time)............................................. 207 1. The Cramer-Rao inequality ........................................................................... 207 2. Application of the Robbins-Monro procedure............................................. 210 3. Time-dependent observations........................................................................... 212 4. Some applications............................................................................................. 213 5. A modification................................................................................................... 220 Chapter 10. Recursive estimation with a control parameter.............................. 221 1. Statement of the problem ............................................................................... 221 2. Asymptotically optimal recursive plan......................................................... 223 3. Two examples................................................................................................... 226 4. Continuous case......................... ...................................................................... 229 Notes on the Literature............................................................................................. 233 Bibliography.................................................................................................................. 237 Subject Index................................................................................................................ 242 Ma.in Notation............................................................................................................... 244

iv

To the memory of our fathes who fell in 1942

PREFACE This book is concerned with certain special questions in the theory of Markov processes defmed by differential or difference equations. Our approach was largely influenced by the fact that, as recently observed by Ja. Z. Cypkin and others, these processes describe iterative procedures for the solution of nu• merous problems in the theory of automatic learning systems. This explains why most of our attention in this book centen on situations which are "degenerate" from the standpoint of the general theory of Markov processes; that is to say, the sample functions of the processes reduce to certain points of the phase space. This is the case, in particular, for stochastic approxilJlation (s.a.) procedures, to be studied in Chapten 4-6. There is quite a large volume of literature on these procedures. We have selected only a few topics in the theory of s.a., which we have endeavored to present in a unified manner. Among these are oonvergence (Chapter 4), behavior of the procedure when the regression equation has several solutions (Chapter S), and asymptotic normality (Chapter 6). In Chapter 7 we consider only the modifications of s.a. procedures having a thematic relation with the contents of Chap· ters 8-10. Other interesting modifications, proposed by Dupaf (2], Kesten (I], Fabian [I] and others, are outside the scope of this book. The mathematical tools and results of Chapter 4 are related to th.e tools and some of the results in Chapter 4 of Aizerman, Braverman and Rozonoer (I] (referred to henceforth as ABR [I]). We have no intention of presenting techniques for the oonstruction of iter• ative procedures suitable for solving various technical problems, since, in our opinion, such procedures have been discussed in sufficient detail in ABR [I] and Cypkin [I], (2). On the contrary, in the bulk of the exposition the procedure is assumed to be given and we investigate its properties. For the same reason, we almost never present a full analysis of examples demonstrating the application

1

2

PREFACE

of the theorems to specific problems. Almost all the examples (see §2.8 or

§ 1O.3(1 ) are considered only in an illustrative capacity. Only one application of the theorems of Chapters 4-7 is considered in comparative detail in Chapters 8 and 9-the application to the problem of parametric estimation, one of the fundamental problems of mathematical statistics.(2) Our standpoint here is similar to that of Albert and Gardner [l) : we confine attention to recursive methods of estimation in which each subsequent observation introduces only a slight correction to the available estimate, and this correction is moreover given by simple formulas. Unlike Albert and Gardner, however, we concentrate attention on the construction of asymptotically efficient estimates, which form Markov processes. Also considered (Chapter 10) is the problem of estimating an unknown density parameter, when an additional control parameter is at the disposal of the statistician. The material set forth in this book is closely bound up with various problems of learning theory, control theory and signal-transmission theory. We shall, however, say very little about these relationships. Only the relation of recursive estimation to the problem of modulator design in the presence of feedback is discussed in any detail ( § 9 .4). In the same section we also discuss some recent results of Schalkwijk and Kailath [1] , Zigangirov [l] , and D'ja~kov and Pinsker [ l ] on transmission of information along a Gaussian white channel with noiseless feedback, from the standpoint of recursive estimation. Throughout the book, the same questions are examined in parallel for continuous and discrete time. The material relating to continuous time may be omitted on a first reading. Nevertheless, readers capable of overcoming the logical difficulties inherent in the study of continuous-time processes will be rewarded later, not only because the problems discussed are of independent significance for continuous processes, but also because they contribute to a better understanding of the ideas of the proofs in the discrete case. (Sometimes the "continuous" variant of a theorem will be presented first, in order to clarify the reasoning underlying the more cumbersome arguments of the "discrete" case. This is true in particular in Chapter 6.)

( 1)Throuchout, we employ the "decimal" enumeration for propositions, sections and formulas. Thus, §2.7 and Theorem 3 .7.1 mean §7 in Chapter 2 and Theorem l in §7 of Chapter 3. References to propositions and formulu in t he same chapter omit the chapter number. ( 2)Variou.s problems cnay be reduced to estimation of a multidimensional distribution parameter (see ABR ( l )) ; amonc these are such problems of lurninc theory as the recovery of an unknown function from observations under noisy conditions, and the probabilistic problem of teachinc I machine to recognize patterns.

PREFACE

3

Unfortunately, an elementary course of probability theory, as given, say, at technical colleges, is not sufficient for an understanding of this book, since we make active use of concepts (such as martingales and Markov processes) based on Kolmogorov's [ l] general conception of probability theory as a branch of measure theory . We should nevertheless like the book to be accessible to engineers. The first four sections of the book constitute an attempt to resolve this dilemma. The mathematically equipped reader maY, safely begin with the last section of Chapter I. Other readers may choose one of two alternative courses: familiarize themselves with Kolmogorov's approach through some suitable text (and this is preferable!), or, convincing themselves that the definitions are intuitively reason• able and verifying the validity of the properties of conditional expectations for elementary cases, accept them as true in more general situations. § § 1-4 of Chapter I are intended for readers adopting the second course; they are hardly more than a resume of some definitions and theorems from Kolmogorov [I] , collected for later reference. Chapters 5 and 6 were written by Nevel'son, Chapters 8-10 by Has'• minsku, and the remaining chapters by both authors in collaboration. Each chap• ter is prefaced by a brief outline of its contents. Most of the references are assembled in the notes following the last chapter. All references are to the bibliography at the end of the book. More exhaustive bibliographies may be found in Wasan [I], Cypkin [1], [2], and the survey papers of Fabian (3] , Schmetterer [I] and others. Our heartfelt thanks are due to A. N. &rjaev and B. Ja. Levit, whose criti• cal remarks were of great assistance. Moscow

z.

R. Has'muukd M. B. Nnefson

INTRODUCTION Consider the following problem. ut R(x) be some function which an experimenter can measure at any point x of the real line. Suppose he knows that, for example, R(x) is monotone increasing and that the equation

R (z)

=

0

(0.1)

has a solution. How can he find the root x 0 of the equation? There are various rapidly convergent methods for solv:ing this problem (such as Newton's method, which converges more rapidly than a geometric progression). The situation is otherwise if the observer can measure R(x) only to with.in an error whose magnitude cannot be neglected in v:iew of the accuracy demanded of the solution to equation (0.1 ). In such cases it is rarely pos1ible to devise procedures that converge to x 0 more rapidly than a magnitude of order n-Y., where n is the number of observations. In 1951, Robbins and Monro suggested a method for solution of this and a more general problem, which they called the method of stochastic approximation (s.a.). Suppose that the outcome of the measurement at

Yi.= R (zi.)

Xie

at time k is

+ t u1,

where t 1 , . . • , t,, are independent random variables (r.v.'s) with zero expecta• tion. The procedure suggested by Robbins and Monro (the RM procedure) is as follows. Given an arbitrary initial point X(O) = x and an arbitrary sequence of positive numbers a1c such that

(0.2) set

The meaning is clear. If Xie < x 0 , the difference X 1c + 1 - X 1c will be positive on the average , since its conditional expectation given Xie is - a~(Xic), E (Xrt+ 1

-

X1, IX1,) 4

=-

ai.R (Xi.)-

5

INTRODUCTION

O thcr.i.1se, the diffc ence will be negative on the average. Thus the procedure (0.3) ..forces" the sequence Xk to move toward x 0 . First , however, we must tace care to ensure that the "jumps" X,c+ 1

-

X,.. arc "damped", for otherwise

the sequence Xk cannot converge to x 0 (it will generally oscillate around x 0 without approaching). This is guaranteed by the first condition of (0.2). Second, the magnitude of the jumps should not decrease too rapidly, for otherwise X"

will not "succeed" in reaching x 0 • This property of the sequence X,.. is embodied in the second condition of (0.2). (For more details, see §4.l .) Robbins and Monro [ l) proved that the procedure (0.3) is convergent i::D(Ur conditions (0.2), provided certain restrictions are imposed on R(x ). Their work stimulated widespread reactions, due apparently, in the first place, to the pnc11cal usefulness of the procedure . In fact , the procedure (0.3) indicates a

sonple plan of experiments, whose implementation requires storage of neither the outcomes Y" of previous experiments nor the points Xk at which the experi-

cenu are performed (except the last). Moreover, the procedure yields a sequence

X,., converging to x 0 with probability l. It is particularly convenient when the statistician does not know in advance at what time he may be required to "pro-

6:Jce" an estimate X" for x 0 - he constructs the estimate at each instant of time, progressively improving it. These features of s.a. procedures explain their popu-

lutty among applied workers, particularly specialists in automation and remote amuol and other branches of cybernetics. The original s.a. method was meant for solution of nonparametric problems

of mat.hematical statistics, but it may also be used to solve classical parametric problems. A detailed investigation of these possibilities may be found in the c.tacsting book of Albert and Gardner [ l). For example, Jet Zk

= x 0 + ~"'

= 1, ... , n , be a sequence of independent observations of a "signal" x 0 W1lh ..additive noise" h- Thus ~"' k = I , .. . , n , is a sequence of independent l

u .'s with probability density p(x). If the "a priori" districution 1r(x) of the parameter x O is given, it is known that the best estimate x,, (in the sense of the standard error) for x 0 is the conditlOml expectation of X 0 given Z 1 , • .. , Z,,.(1)

Zn=

E (.r~IZ1, ... ,

) up (Z1 - u) . . . p (Z,. - u) dn (u}

Z,.)=...a....-----------

l p(Z1-u) ... p(Z,. -u)dn(u)

(0.4)

if x 0 is not a r.v., one usually employs the maximum likelihood estimate

i., which

is the solution of the equation

( 1) Throughout this book we shall denote estimates by the lellen :,c and X . Capital lrcan X will usually be reserwd for estimates computed by recursive formulas.

6

INTRODUCTION

max[p(Z 1 - z) .. . p (Z,.-z)l =p (Z 1 -z,.) ... p(Z,.-z,.),

(0.5)

or the estimate (0.4) with a distribution n(x ) whjch is open to fairly arbitrary choice. We use the notation t,. for estimates of type (0.5) or of type (0.4) with arbitrary n(x). Clearly, in all these cases the passage from x n to x n + 1 is quite tedious and utmzes all previous observations. However, one can try to construct an estimate x,. for x 0 by a RM procedure. It will suffice, for example, to consider the estimation procedure

(0.6) Th.is is evidently a special case of (0.3), with R(x)

=x -

x 0 • It therefore follows

from the Robbins-Monro theorem that X1c converges to x 0 ask - 00• All the previously mentioned practical advantages of RM procedures extend to the proce• dure (0.6). It may be, however, that estimates of type (0.4) or (0.5) possess other important advantages, say in the sense of the rate of convergence to x 0 as k _oo_ Throughout this book, the quality of estimates will be measured by the mean square of the difference lxk - x 0 I, i.e. the function E lx1c - x 0 12 . One then has the following.asymptotic result for the estimates t,. (see lbragimov and Has'• minskil [ 1)): Under certain fairly weak assum.ptions on p(x) and 1r(x) (not speci• fled here), we have the asymptotic equality{2) E I t ,. - x0 12

~ -n11

n-+oo



(0.7)

where I is the information content of the density p(x):

I=

r

J

I p' (z )J.:.. p (z)

dx.

Moreover , it is well known that the normalized difference -/n(t,. - x 0 ) is asymptotically normal with parameters (0, 1 ) . In order to compare RM estimates with t" estimates, we need theorems about the asymptotic behavior of the distribution of the r.v. Y ,. = -/n(X,. - x 0 ) and the limit of its squared expectation. Such theorems have been established by several authors, among them Chung [l), Blum (2) , Sacks [l) and Fabian [5) . Let Et; = o 2 < 00 . It can be shown that for the procedure (0.3) (and a more general procedure ; see Chapter 6) with a,. = [a(n + 1))- 1 the r.v. Yn is asymptot•

r

ically normal with parameters (0, o 2 /a(2R.'(x0 ) - a)), provided O < a < 2R'(x 0 ) . The question of optimal selection of the parameter a has been studied by many authors since the celebrated work of Chung [I) . ( 2 ) ~ usual , o

~b

means that lim o / b

= I.

7

INTRODUCTION

x 0 . Here R'(x 0 ) = I , and in order to obtain estimates with asymptotically optimal properties of the ty pe (0.6) with an [a(n + I ))- 1 we must put a I. It is readily seen that the resulting estimate is simply the arithmetic mean

In the above example R(x)

=x -

=

=

It is known that the estimate

xn is optimal in the mean-square sense only for a

normal distribution (its variance is normal o 2 > r 1 ) .

o2 /n, while for any

distribution other than

The situation is Jess favorable if R(x) is nonlinear. We would then have to put a = R'(x0 ). However, the function R(x) is not at the observer's disposal, and all he can use are estimates for R'(x 0 ) based on observations. This "adaptive" method for determining the optimal factor (a(n + l ))- 1 was suggested by Venter [I) (whose results and some generalizations will be discussed in Chapter 7). However, as stated previously, even if we learn how to select the factor a optimally, the variance of the estimate will generally exceed that of the estimates tn by some finite factor. This is far from surprising. In fact, our calculation of estimates of type (0.6) makes no use of any properties of the distribution oft;, whereas the estimates (0.4) and (0.5) depend essentially on the explicit form of the density. It is quite natural to ask whether one can set up recursive procedures similar to stochastic approximation which utilize the explicit form of the density p(x) and yield estimates satisfying condition (0.7). With this end in view, we use the following heuristic device (see Stratonovic (2)). Let be the maximum likelihood estimate. Then the logarithmic derivative of the likelihood function will vanish at in. Thus,

xn

(0.8)

xn

Assuming that the difference + I - xrl is small and subtracting the fust equation in (0.8) from the second, we obtain p'



,.

~

( p'

.•

,;-(Z,.. 1 -x,.)+ ~ P(Z.-x,.)

•-1

),





(x,.. 1 -x,.).::::O.

lllis yields the estimation procedure

(0.9)

8

INTRODUCTION

Procedures of type (0.9), in a more general framework, have been considered by Stratonovic [2) , Cypk.in (l) , (2) and others. Cypk.in [2) calls an algorithm of type (0.9) quasi-optimal. The disadvantage o f this algorithm is the need to store all the observations Z 1 , ••• , Z" in the computer memory. In addition, the algorithm is not so easy to investigate, as the process X" that it defines is not usually a Markov process. Let us assume now that X" converges t o the estimated parameter x 0 as and replace Xn in (0.9) by x0 . Then of identically distributed r.v.'s and n -

00

r.~= 1 ((p'/pXZ1c -

x0 ))' is a sum

Therefore, by the law of large numbers, "

( p'

~ -;-(Z,.-z0 )

),

= -nl +o(n),

n- oo.

Lal

Substituting this estimate into (0.9), we obtain an algorithm proposed by Sakrison (2), (3) :(3) t X 11+1= X 11 + -/11

cz,..,-z,.>

p' p (Z,.. 1 -X,.)



(0.10)

Imposing certain restrictions on p(x), Sakrison proved that these estimates are consistent and asymptotically efficient in the strong sense (for the definitions, see Chapter 8). Thus the estimate (0.10) possesses all the merits of recursive estimates, and in addition produces asymptotically optimal results. These considerations, relating to the special case of additive noise, imply that recursive parametric estimation procedures are worthy of close attention. The situation is especially interesting if the experimenter has· to find an optimal estimate for the parameter x 0 and is also able to determine a control parameter z.(4 ) An exact solution of this problem is extremely difficult. To set up an optimal observation plan (i.e. an optimal choice of a sequence of controls z" and estimates x") one must, roughly speaking, "play through" the entire problem: first , specifying Zp ... , z", find the optimal estimate x"(zp .. . , z"), which is the conditional expectation E... I ,···•z n (x 0 I Y1 = y 1 ' • • . ' Yn = y n ) given the observations y., . . . , y" at the points z 1 , then minimizes the expression

.•• ,

z"., respectively. One

( 3) Saklilon actually proposed a recursive estimation procedure for a more seneral situation. A detailed account of his results may be found in Chapter a. A more risorous formulation of the problem will be siwn in Chapter lo.

· · • and µ(Am)
0. Consider the sum

I ) ,-- {

~

11-0

Hµ {B.u}, oo ,

t < (k +

if

µ{{s = no}nA} = O

if

µ{g=oo}nA}>O.

l)>.}

This sum always has a limit (possibly 00) as>. - 0. The limit is known as the Lebesgue integral of the function t over the set A, denoted by f Atµ {dw} . In general, when t takes values of different signs, the Lebesgue integral can• not always be defined in a logical way. For example , it is not clear what value to assign to the integral of a function t that takes the values +00 and - 00 on sets of positive measure .. Nevertheless, in some cases it is convenient to consider integrals of functions with values ± 00• Sett+ = max(t 0) and i- = max(-t, 0). It is obvious that t+ and C are nonnegative functions, and t = t+ - i-. If at least one of the functions t+ or t - has a finite Lebesgue integral over the set A , we shall say that the integral

f,,.tµ {dw} exists, and set

j sµ {dw} = j t•µ {dw}- j t-µ {dw}. A

.-l

A

The Lebesgue integral is equal to the definite Riemann integral if, for exam• pie , the set A C £ 1 is an interval,µ is Lebesgue measure, and the function t is Riemann-integrable and bounded. It has many of the usual properties of the definite integral. For instance, if all the relevant integrals exist, then

(I) Throushout !his book, vec:ton ue inlerp.-eted as column-vectors in £ 1, unless olherwise nated.

§2. RANDOM V AIUABLES

15

J(as+ bTt) µ {dw} = a JSJ! {dw} + b JT)µ {dw} , A

A

j

A

Jf}1 { dw} I~ JI sI µ {dw}, :\

A

and these relations will be used frequently in what follows. Yet another important property of the integral is the Ceby!ev inequality: if Hw) ~ 0 , then

µ{w: ~(w)~c. wEA}~+ Js(w)µ{dw},

AE~-

A

Some other important properties of the integral will be indicated below for the special case in which µ is a probability measure. Let (S1, ~ . P) be a probability space and~= ~(w) a r.v. If the integral

Js(w)P{dw} = )sP{dw} Sl

exists, it is called the expectation oft denoted by E~. It may of course happen that E~ = 00 or E~ = - 00• If El~I < 00, we shall say that the r.v. ~ has finite expectation. If~ is a discrete r.v., assuming finitely or countably many values x 1 , x 2 , ••• with probabilities p 1' p 2 , ••• , then it follows readily from the definition of the Lebesgue integral that (provided E~ exists) (2.1)

In the general case, the definition of the expectation as a Lebesgue integral reduces to replacing~ by an approximating discrete r.v., calculating the expectation by (2.1), and then going to the limit. This implies the fonnula E i;

=

Jx dFt( x) ,

where Ft(x) = P{~ < x} is the distribution function of~- It can also be shown that for any Borel function f(x) (2.2) provided

Ef(n exists.

A similar formula holds for vector-valued r.v.'s.

Later we shall have much occasion to deal with sequences ~" of random variables (vectors) converging to ~ in some probability-theoretic sense. A sequence ~" converges tot: a) almost surely (a.s.) if P U,, -+ U = l ; b) in probability if P{I~,, - ~I> E} -+ 0 as E ! O; c) in distribution if the sequence

16

I. ELEMENTS O F PROBABU.JTY

of distribution functions Ft/x) converges wealcly to Ft(x ), i.e., for any contin• uous bounded function f(x) , as n -

Let

En

'"',

be a sequence of r.v.'s with finite expectation, converging in some

sense to a r.v.

E-

Under what additional conditions can we say that

Es= lim Es. ? fl➔ OO

General theorems of this kind were proved by Lebesgue, and may be found in any standard text on measure theory. For the reader's convenience, we state a few well-known propositions. FATOU'S LEMMA. If En are nonnegative r.v. 's, then

E limsn ~ lim Es,.. n-.oo

n ... ao

If En - E, is some r. v. with finite expectation, then LEBESGUE'S THEOREM.

IEn I < 71,

with probability 1, where 11

e·s= 1im E ;,.. fl ➔ OO

LEMMA

for some a:

2.1. If En

-

~ in probability and the sequence

El~n1° is bounded

> 1, then Ehas finite expectation and E s= lim E Sn• n-.oo

This remains valid for convergence in distribution, and also in the case that and

En

Eare random vectors in £ 1•

Suppose now that the a-algebra '! is provided, apart from the probability measure P, with some other measure µ. (not necessarily a probability measure). We shall say thatµ is absolutely continuous relative to P if µ.(A)

=0

whenever

= 0. When can a measure µ. be represented as an integral with respect to the probability measure P? In other words, when does there exist a r.v. ~(w) such

P{A }

that

p.(.A)= )s((l))P{d(I)} A

for any set A E ii? It is clear that a necessary condition for this to be so is that µ. be absolutely continuous relative to P. It turns out that this condition is also sufficient: Let {n, !, P) be a probability space and µ. some measure on the a-algebra'!, absolutely continuous relative to P. Then there RAooN-NlKODYM THEOREM.

17

§ 3. CONDITIONAL PROBABILITY

is a r.v. ~(w)

~

0, unique up to stochastic equivalence, such thal

µ(A)=

Js(w) P {dw}, A

If the measureµ is finite, the r.v. E(w) assumes only finitely many values with probability 1. §3. Conditional probabilities and conditional expectations Let (n, '1, P) be a probability space and B some event with P{B } define the conditional probability P {A IB}, A E t, by

* 0.

We

P {AIB} = p / AB}

p (B } •

It is clear that for fixed B the conditional probability P { · IB} is a measure on '!1 . The conditional expectation E(~IB) of a nonnegative r.v. E= Hw) is defined as

E (tlB)=

j t(w) P {dwlB}. Q

It is readily checked that

E (~IB) = p

:B} Jt (w) P {dw}. B

(3.1)

We now wish to extend the concept of conditional probability and conditional expectation, defining them not only for a certain fixed event B of nonzero probability, but for a whole a-algebra of events. In particular, we shall then be able to consider conditional probabilities relative to certain events of probability 0. To this end, we first consider the minimal a-algebra 58 0 C '!1 containing a countable system of events B 1 , B 2 , ••• , whose union is n, such that P{B;} 0 for any i.

*

The conditional probability of an event A relative to 58 0 is defJned as the r.v. P{w, AIB 0 } = P{A l80 }, which takes the constant value P {AIB;} on the set B;:(2)

...

P {A l!8o} = ~ Xs - (w) P {AIB;}. illllC

1

l

It is obvious that for any w the set function P{ · 1580 } is a measure on~- It is therefore natural to defJne the conditional expectation E(Ei\8 0 ) as

E (;IQ30 ) =

) ~ (w) P {dwl!B0}. wfo

It is evident from the above two formulas that

r>

Here and below X,t = X,t (w) will denote the ch4rocruistic func tion of the set A; that is, the function equal to 1 if c.., E A and to 0 if w fl. A.

l. ELEMENTS OF PROBABJUTY

18

i.e., conditional probability is a special case of conditional expectation. As it turns out, it is also convenient in the general case to define conditional probability as a special case of conditional expectation. Before stating the definition, we note some properties of the r.v. E(tl58 0 ). First,

) E (sllt0 ) P {dw} B

=) s (w) P {dw} B

for any BE 580 . This follows readily from the definition of E(~l5B 0 ) , if we note that any set B in 580 is a finite or countable union o f sets B; and use formula

(3.1). Second, the r.v. E(~l5B0 ) is constant on each set B1• This is of course not true for arbitrary a-algebras. It is readily shown, however, that any r.v. is constant on the sets B; if and only if it is SB0 -measurable. These considerations justify the following general definitions (sec Kolmogorov (I)). DEFINITION 3.1. The conditional expectation of a nonnegative r.v. ~(w) relative to an arbitrary a-algebra 58 C ! is defined as the 58-measurable r.v. E((l58) which satisfies the equality

J E (sl!6) P {dw} = JsP {dw} B

(3.2)

B

for any BE~DEFINITION 3.2.

The conditional probability P{A 158}

= P {w, A 158} of an

event A C ! relative to an arbitrary a-algebra 58 C i is defined as the r.v. E(x,4 158). It is clear from (3.2) that

j P {A l!B} P {dw} = P {AB},

B E !6,

B

and this relation, together with the stipulation that P{Al58} be 58-measurable, is equivalent to Definition 3.2. It follows from the Radon-Nikodym theorem that the conditional expecta• tion E(~158) exists for any nonnegative r.v. ~- Moreover, E(t158) < 00 with probability I if Et < 00 • This remark enables us to extend the definition of conditional expectation to an arbitrary r.v. having an expectation (not necessarily finite), by setting

By the Radon-Nikodym theorem, the r.v.'s which are conditional expecta• tions of t(w) relative to a a-algebra SB are all pairwise stochastically equivalent. Henceforth, the notation E(tl~) will be used for any one of these r.v.'s. A sintl· lar convention will be adopted for the conditional probability P{A l5B}. In view

§ 3. CONDITJONAl. PROBAfilUTY

19

of these conventions, equalities involving conditional expectations and conditional probabilities will be understood (though this will not always be stated explicitly) in the sense of almost sure equality (a.s.), i.e. up to events of probability zero. We can now define the conditional expectation of a r.v. Hw), w En, relative to a random vector 11(w) taking values in euclidean /-space £ 1, assuming that t has an expectation. To this end, we consider the events {11 EB}, where B is a set in the Borel a-algebra 181• The collection of events {11 EB}, B E ~,, is clearly a a-algebra of subsets of n, which we denote by t 11 and call the a-algebra

generated by the random vector 11. We set EW11) = E(tm 11) and call the r.v. EW11) the conditional expectation oft relative to 11. Similarly, P {A 111} = E(X,t 111) will be called the conditional probability of an event A relative to 11. It can be shown (see, for example, Gihman and Skorohod (1]) that one can always find a 58,-measurable function g(x), x E £ 1, such that EW11) = g(11) with probability 1, i.e. EW11) is a function of fl. We are thus justified in defining the expectation E(th7 = x ) conditional on 1/ assuming a fixed value x, setting EW11 = x) = g(x). The conditional probability P {A 111 = x } is defined similarly. We now consider the main properties of conditional expectations and conditional probabilities (the proofs may be found, e.g., in Kolmogorov [I) , Doob [1), Gihman and Skorohod [I] or Loeve [ 1]). We assume throughout that the expectations of all r.v.'s

t

Tl, at

+ b1) and t11 appearing below do indeed exist.

for any constants a and b.

b)

(3.3)

(Titis equality follows directly from (3.2) by setting B = n .) It can be shown that when evaluating conditional expectations relative to a a-algebra 58 one can use 18-measurable r.v.'s in exactly the same way as constants are used in evaluating ordinary expectations. To be precise:

c)

EW18)

=t

if t is IS-measurable.

(3.4)

(Titis follows from the definition of E(t118).) Moreover,

d)

E(t11158)

= tE(71l58) if t

is 58-measurable,

(3.5)

i.e. 58-measurable r.v.'s may be brought forward through the expectation symbol. e) If

18' is a a-algebra containing

58, then

E ( E (~1~)1 5-0') = E (t l!b),

(3.6)

= E (;I~).

(3.7)

E ( E (~118') llo)

20

L ELEMENTS O F PROBABILITY

el

el

We shall say that two a-algebras and are independent if any two e~nts A 1 E ~ 1 and A 2 E ~ 2 are independent, i.e. P{A 1A 2 } = P{A 1 }P{A 2 }. A random variable

Eis said

~ are independent (where f} g)

to be independent of a a-algebra ~ if !lt and

'1t is the a-algebra generated by e).

= E t if Eand ~ are independent, P{A lia} ~O. P{Q l~ }=l , and

E (t l!B)

..

{3.8)

CII)

P {A IQ3} = ~ P {A.1\8},

jf

A= U A. ,

•-t

-...1

where the A k are pairwise disjoint sets. {Recall that these relations are the same as the relevant properties of conditional expectations, which are valid, however, only with probability I .)

§4. Independence. Product measwes In the preceding section we defined independent events and a-algebras. For the sequel we shall need the concepts of independent events, a-algebras and random variables. A sequence of events A 1 , A 2 , and k, i

•.• ,

= 1, ... , k,

A sequence of a-algebras !l 1 , !l 2 , n 1 and k, i dent.

is said to be independent if, for any n1

..• ,

= I , ... , k , any events An I

is said to be independent if, for any

E !ln

1

, ••• ,

An le E In le are indepen-

Random vectors E1 , . . . , En in £ 1 are said to be independent if the oalgebras It , . . . , 11 t,. that they generate are independent. In particular, two 1 r.v.'s E and fl are independent if P{E EA, fl EB}= P U EA } P{17 EB} for any Borel sets A and B. Finally, consider two sets of random vectors :::: 1 = {E41 , a EA } and :E:2 = {Eb• b EB}, where A and B are arbitrary abstract sets. Let 11; denote the mini• ma! a-algebra relative to which all random vectors in :E:1, i = I , 2, are measurable.

If 11 1 is independent of 11 2 , we shall say that the sets :E:1 and ~ are independent. Now let (!21 , 11 1 ) and (!2 2 , 11 2 ) be two measurable spaces. The direct product !2 = nl x n2 of nl and !22 is defined as the space whose elements are pairs of points w a a-algebra

= (w1 , w 2), w 1 e !21 , w 2 e !22 •

11 = 11 1

x

112

In this space, we can define

generated by the sets of points {w: w 1 EA 1 , w 2 E

= A 1 x A 2 , where A 1 E !1 1 and A 2 E 11 2 . If µ 1 is a measure on {!2 1 , 11 1 ) and ~ a measure on {!22 , !1 2), then the unique measure µ on (!2 , !l) satisfying the condition (3)

A2 }

(3) To consuuct the measure >', one def'anes it on "rectangles" 11 1 x 11 2 using (4. 1), and then extends it to 11 with the help of standud theorems on extension of measures.

§ 4. INDEPENOI.NC£. PRODUCT MEASURES

21

(4.1) is known as the product of the measures µ 1 and >"l , denoted by µ 1 x >"l · The concept of product measure is closely related to that of independent r.v.'s. Let ~ and T/ be independent r.v.'s, defined on a probability space (n, 'ft , P). Denote

P t (A)= P (A), P 11 (B)

= P (B),

(P( is then called the restriction of P to !t).

It follows from the independence of t and T/ that for any A E ,it and B E

P (AB )

=

Pt (A) P 11 (B ).

Hence, using {4.1), we deduce that the restriction of P to '1( x ! 11 may be regarded as the direct product of the measures Pt and P11 • Hence, in particular, we have the following proposition: if /(x, y) is a Borel function, x E £ 1, y E Ek , and E and T/ are independent r.v.'s with values in £ 1 and Ek• respectively, then

Later we shall need a somewhat more general formula . Let /(x, w) (x E E1, w En) be a~, x !I-measurable function, and Ea r.v. independent of the family /(x, w ). Let P1 denote the restriction of P to the minimal a-algebra relative to wruch the r .v.'s /(x, w ), x E £ 1, are measurable. Then E / (~ (w) , w)

= J) f (t (w 1) , W2} P t X P t(dw1 X dw

2) .

(

43)

We now state the important theorem on changing the order of integration in abstract Lebesgue integrals.

let ll; be a a-finite measure in a measurable space I , 2, and f(wl ' w2} an !11 X '112•measurable function on nl X !22

FUBJNl'S THEOREM.

en,.! ;), i = such that

ff JI/ (wi, w

2)

I µ 2 (dw2)J µi(dw ,)
"J) the integral

J I/ (wi, is finite, and, moreover,

Wz) Iµ, (dw1)

22

I. E.LEMENTS OF PROBABILITY

}(j f (ws, w

2) µ 2 (d~)] µ 1 (dw1)

=

J[J f

l

(w,, wz) µ1 (dw,) 112 (du>2 )

= j j f {u>1, ~) 111 X 112 (dw1 X d..02). We consider two simple corollaries of this theorem. l. Let f(w 1 , w 2 ) = Hw 1 )l?(w 2 ). Then it follows from (4.2) and from Fubini's theorem that if t and 11 are independent r.v.'s then E(~'ll) = E~E71, provided E~ and E11 are finite. 2. Let~ .. h, ... be nonnegative r.v.'s such that rTE~; is convergent. Then ~i is convergent a.s. This follows from Fubini's theorem, if we note that the sum may be treated as an integral with respect to a o-finite "counting" measure, i.e. the measure defined on the subsets of the natural number sequence and equal to the number of natural numbers in the set in question. Using (4.3) and Fubini's theorem, the reader is invited to solve the following

2:T

problem: Let f(x, w) be a ~, x !{-measurable function with values in E1c, such that the set of random vectors f(x, w) , x E £ 1, is independent of the EXERCISE 4.1.

random vector a)

b) c)

t = Hw) in £ 1• E

Prove that

U (t, w)lt

= x) = Ef (x, u>),

Ef(s.w)=E{{E/(x,w)l l:i=d, P {/{£, w)EBl£=x}=P{t(x, w)EB}, BE\611 •

§5. Martingales and supennartingales In this and the next section we give a brief account of the necessary basic properties of two types of discrete stochastic processes: martingales and Markov processes. By a discrete stochastic process we mean simply a sequence of r .v.'s X(t) = X(t, w), t = 0 , 1, 2, ... (possibly vector-valued), defined on some probability space. As a function oft, the sequence X(t, w) will be called a sample function of the process. The martingale concept arises in a natural way in connection with the definition of a zero-sum game. If X(t) is the payoff at time t (t = 1, 2, ... ), then the meaning of a zero-sum game should be that for any duration of the game up to time t the mean payoff in the tth play of the game is zero. In mathematical terms, this means that E(X(t)IX{l'), . . . , X(t - 1)) = X(t - I). The following general definition includes the situation just described as a special case. DEFINITION 5.1. Let X(t), t = 1, 2, ... , be a sequence of r .v.'s with finite expectations, and !l,, t = I, 2, . .. , a sequence of imbedded o-aJgebras, !1 1 C ! 2 C · · · . Suppose that X(t) is !,-measurable. The pair (X(t), !l,) is known as a martingale if

23

§ S. MARTINGALES

(S.1) and a supemuzrtingale if (S.2) REMARK

5.1. By property (3.7) of the conditional expectation, we have

for s < t

E (X (t)I~,)

=

=

E (X (t -

E (( E (X (t)l"21 ,_ 1)12!$)

1)121,)

= .. . =

E (X (s

+ 1)1~.) =

X (s) ,

and thus an equivalent definition of~ martingale is

E (X (t)I~.)

=

X (s),

Similarly, (5.2) is equivalent to

E (X (t)I W,) ~ X (s). s


a, in the upward direcLEMMA

tion. Then T-\

(b-a)H~:~~(a-X(TW+ ~ /(t)(X(t ~1) -X(t)),

(5.3)

1- 1

where the tenns of the sequence I(t), t = 1, ... , T - l , are O or 1 and I(t) is uniquely detennined by the values X(l ), ... , X(t). Let t 1 be the first time at which X(l) takes a value smaller than a, t 1 the first time after t 1 at which X(t) > b; t 3 the first time after t 1 at which X(t) < a , and so on. Then the sequence X(t) no longer intersects [a, b] in the upward direction in the time interval [t1 H, T]; moreover, it is readily seen (Figure 1) that PROOF.

H

(b-a)Hf>b-< .~ (X(t21)-X(t2;-1)]. •=1

(5.4)

We now define the function /(t) by stipulating that it change value from 0 to 1 or conversely at the points t k and only there,(4 ) and moreover that /(t 1 ) = 1. Thus, for example , /(1) = 0 if t 1 > 1 and /(1) = 1 if t 1 = l.

FIGURE

1

To detennine the value of /(t) at t = T, we consider two cases: 1) if X(t) never assumes values smaller than a after t 1 ff, then /(T) = O; 2) if there is a time t 1 H+ 1 after t 2 H at which the sequence assumes a value smaller than a (for the first time after t 2 H),then /(T) 1. Inequality (5.3) may be verified directly if Hfj = 0. Let H~tJ -" 1. It is clear that the value of /(t) at a fixed time t = 1, ... , Tis uniquely detennined

=

(4) More precisely, the chanse,, occur upon passage from 'k-l to '1c-

§ 5. MARTINGALES

by the values of X(l), ... , X(t). We then have one of the equalities T-1

~ I (t) (X (t + 1)- X (t))

1- 1

f

=

l

;., I

(X (t2;)-X (t2;- 1))

in case

1) (5.S)

11

;~ (X (t2;)- X (t2;-1))

+ X (T)-X (tm+1)

in case 2).

It follows from (5 .4) and (5.S) that T-1

~ /(t)(X(t +1)-X(t)) ,_,

Finally, noting that X(t 2 H+ 1 )

in case

(),

in case

2).

< a, we obtain the lower bound (5.3).

= X(t,

w), !l,) be a nonnegative supermartingale and HfJ = the random variable defined as the number of times the sequence X( I), .. . , X(7) intersects the interval [a, b] , b > a, in the upward direction. Then LEMMA

5.2. Let (X(t)

Jl!~;(w)

E n t 0 , x E G. That EY(t) is finite for all t ;> t 0 follows from (2.4) and from the fact that EV(t0 , X(t0 )) is finite. This completes the proof. The following corollary of Theorem 2.1 (obtained by setting G = E1 and 1G

=

00)

is important for the sequel.

22. Let V(t, x) be a nonnegative function such that V(t, x) E DL and LV ..; 0 in the domain t ;> t 0 , x E Er Then (V(t, X(t)) , N,) is a super· martingale provided EV(t0 , X(t 0 )) < 00• THEOREM

The following exercise is a special case of Kolomogrov's inequality for supermartingales.

11. DISCRETE TIME MARKOV PROCESSES

34 EXERCISE 2.1.

Let V(x) be a function satisfying the assumptions of Theo-

rem 2.2. Show that for any C > 0 (2.5) Hint. Apply Theorem 2.1 10 the uit time T(C) of the sample function X(t) from the domain {x: V(x) < C }.

§3. Recursively defined processes Let us define a stochastic process in £ 1 by the recurrence relation

X (t

+ 1) = (t + 1,

X (t),

~ (t

+ 1)),

{3.1)

t=s ,s+ 1, ... , where 4>(t + I, x, y ), x e £ 1, y E Eq, is a Borel-measurable function of the / + q variables x, y for each t, and Hs + I), ~(s + 2), . .. are random vectors with values in E q. For the recurrence relation (3.1) to define a process X(t), we must specify the initial state X(s) at time s. The process X(t) thus defined will be denoted by _xs.x(r). In particular, if X(s) = x is not a random variable, then X'·x(r) is the process issuing at time s from the fixed point x.

It is clear that for all t

>u >s

X'· x (t) 2s X"· .z•, " (t).

(3.2)

Suppose that the r.v.'s X(s), ~(s + 1), Hs + 2), ... are independent. Then it follows from the form of system (3.1) that the r.v.'s X(u) = _xs,X(s>(u),

e(u

+ 1), Hu + 2), ... are also independent for u > s.

In addition, the process X(t) = xs,X(s>(t) is uniquely determined by the values of s, X(s) and Hs + 1),

... , W), i.e.

X' · .Z (t)

=

'I' (s, X (s), t , ~ (s

+ 1),

... ,

s (t)).

For any x E £ 1, we set P (s, z, t,

Ii = P {X'· :re (t) E r}.

e

(3.3)

it can be shown that the function thus defined is ,-measurable as a function of x and a measure as a function of r. In addition, for any s < u < t, the fact that X(u) is independent of l(u + 1), t(u + 2), ... and part {b) of Exercise 1.4.1 imply that P (s, z, t, f) =

= =

P {X'· "(t) E f} ,= P {X"· .T'• :rM (t) E f} E lr ('I' (u, X (u), t , Hu+ 1), ; (u + 2), ... , ;(t)))

E (Exr('l'(u, y, t, ;(u+1). ~(u + 2), ... , ;(t)))J11• .rc-> = E P (u, X (u), t, f).

§ 3.

RECURSIVELY 0EFJNED

35

PROCESSES

Hence, via (3.3), we obtain

P (s, :r, t, f) =

j P (s, z. u, dy) P ( u. y, t. r).

Thus P(s, x, t, f) is a transition function. Next, using the fact that the random vector (X(s), X(s independent of t(u

+ I} and

+

l}, ... , X(u)) is

part (c) of Exercise I .4.1, we see that

+ 1) E r1x (s). X (s + 1) ....• X (u)} = P {Cl> (u + 1, X (u), s (u + 1)) E r1x (s), X (s + 1), . ..• X (u)} = IP {Cl> (u + 1, y, s (IL+ 1)) E r}l y-.T(u) = P (u. X (u). u + 1, Ij. P {X'·.z (u

Similarly,

P {X'· X (t) E r1X (s), X (s for all t > u ~ s.

+ 1) ,

.... X (u)}

=

P (u, X (u), f, r)

The above discussion implies THEOREM 3.1. Let 4>(t, x, y) be a Borel-measurable function of x and y for every t ;;a= s, and let X(s), Hs + I), H.s + 2), ... be independent r. 11. '.s. Then the process X(t) = ;xs,X(s>(t)defined by (3.1} with initial .state X(s) is a Markov process, and its one-step transition function P(u, x, u + I, f) is equal to

P {Cl> (u

+ 1, :r, s (u + 1)) E r}.

The method described in this section for deriving Markov processes from a sequence of independent random vectors t(t) using the recurrence relation (3. I) seems at first sight highly specialized; however, it can be shown that it yields a remarkably large class of discrete Markov processes. EXERCISE 3.1. Let W) = H 1 (t), ... , tN(t)}, t = 1, 2, . .. , be independent random vectors such that PH,.(t) = j} = P;;, i = I, ... , n, 1 P;; = I. Show that the recurrence relation X(t + l) = tx(t, x, w). t

= 0, 1, ... , x

E £ 1, be a set of vector-valued r.V'.'s,defined on a probability space (n, ! , P) and satisfying the following conditions (A). A.I . The function 4>(t, x, w) takes values in £ 1 and is ~ 1 x !I-measurable for each t = 0, 2, 3, . . . . A.2. There exists a family of a-algebras of events F,, t = 0 , l, 2, ... , in n such that F, C F, for s < t (monotone family of a-algebras). A.3. The family of vectors 4>(t, x, w) is F,-measurable and independent of

F,-1 . Consider the recursive procedure

X (t

+ 1) =

(t

+ 1,

X (t), w),

(3.4)

with initial condition X(s) measurable relative to F, . The procedure (3.4) and the initial condition X(s) define a stochastic process x•,X(t). Repeating the proof of Theorem 3.J almost word for word , we can prove

If conditions (A) are satisfied, then the process x 1 ,X(t) defined by (3.4) and an initilll condition X(s) measurable rel.alive to r-, is a Markov process. Its one-step transition function is given by THEOREM 32.

P (u, z, u

+ 1, r) =

P { (u

+ 1, .:r, w) Er}.

It can be shown that for any transition function P(_s, x, t, r) there exists a ~, x l •measwable function 4>(t, x, w) such that the Markov process defined by

(3.4) has transition function P(_s, x, t, l j; we shall not go into details here, remarking only that the rather abstract procedure (3.4) may often 'be replaced by the more readily understood formula (3 .I). To end the section, we prove an auxiliary proposition. LEMMA 3.1. Let X(t) be a Markov process 11nn Jet V(t, X(t)) have r111ite expectation for t = s, s + l , . . . . Then

EV (t+ 1, X(t + 1))- EV (t,X(t})= ELV(t , X(u) ,

(3.5)

I

EV(t+ 1, X(t + 1)) -- EV(s, X(s)) = ~ ELV(u, X(u))

(3.6)

"""' PROOF. Using

E (V (t

(1 .6), we obtain

+ 1, X (t+ 1))IX (t)) = f V (t + 1. :) P (t, .Y (f), t + 1, dz}

It follows from (3.7) and the definition of L that

(a.s.)

(3.7)

§4. DISCRETE MODEL OF DIFFUSION

37

E(V(t+1. X (t+i))-V(t, X(t))I X (t)]

= JfV(t+1,

z)-V(t, X(t)IP (t, X (t ). t+1, dz)

(3.8)

=LV(t, X(t)). Taking expectations on both sides of (3.8), we get (3.5). Formula (3.6) is an obvious consequence of (3.5). Let cl>(t, x, w) be a function such that for some n :-i= I

EXERCISE 3.2.

E and EIX(s)I"


O,

(5. 13)

~ . g (t ) k > 0 (or b(t. x) < k < 0), ,,2(1, x) < ai fort> 0, x E (r1 , r 2 ), and the r.v.'s ( 11 (t) are bounded a.s. by sone constant C Hirit.

Use lhe method of proof of the last theorem and lhe result of Exercise 4 .3.

§7. Convergence of sample functions It follows from Exercise 5.1 and Theorem 5.2 that under certain assump• tions the sample functions of a Markov process X(t), t ~ 0, converge to a set B over sorrte (generally random) sequence of instants of time. In this section we establish conditions guaranteeing a more general proposition: X(t) - B (a.s.) as t -

oo.

For any function W(t, x), t ;.i, 0, x E £ 1, we set

W (.r) = sup W (t, .r),

W (x)

=

-

1~11

inf W (t, x).

(7.1)

1;30

THEOREM 7.1. Suppose that there exist a nlmnegative function V(t, x) E DL in the domain t > 0, x E £ 1, and a set B, such that lim V (:r)

=

00 1

(7.2)

1"'1-... -

LY (t, x) ~ g (t) (1

+ V (t,

x)] - ex (t) cp (t, x),

cp (t, x) E '1> (B).

(7.3)

let the functions a(t) and g(t) satisfy conditions {5.1), (5.13) and

inf

V (x) > 0

for any

(7.4)

lim V(.r)=O.

(7.5)

:,,:EU,. R ( B) -

V(t, x)

= 0 for x EB, and in addition :,,:-B

Then the Markov process X(t) with generating operator L and arbitrary initial distribution converges a.s. to the set B as t - 00• In addition,

,_Jim... for any positive a

E [V (t, X (t)))a

=

0

< l , provided E V{0, X(O)) < oo.

PROOF. The first part of the theorem need be proved only for a process X(t) with arbitrary deterministic initial condition X(0) = x. Set CID

W(t, x)=(t+V(t,.r))

fJ (1+g(u)). _,

Then (see Remark 5.1) LW(t, x) } = cl> (t + 1, X (t), ca>), (7.6) t = s, s + 1, , -L 2, .•. , X (s} =- z, where s > 0 and x E £ 1 ; F(t, x) = F(x) + q(t, x) and G(t, x, w) are vectors in E1; the function (t, x, w) satisfies conditions (A) of §3, with EG(t, x, w) = O; X (t

and a(t) is some sequence of positive numbers. Theorem 7 .1 has a simple corollary for processes of type (7 .6) in the onedimensional

case:

THEOREM

such that F(x 0 )

72. Let F(t, x) and G(t, x, w) be functions with values ill £ 1 = 0,

sup e), F2 (x) + ec: (t, x, ca>) ~ K 1 (1 + r),

(7.7)

G (t

(7.8)

(here and below, K; are positive constants), and OD

CL

where 'Y(t)

(t) > 0,

~ a (t) = 00 ,

1- 0

P(t) > 0, (7 .9)

= sup,.eE1 q(t, x) and (7.10)

Then the process X'·"(t) defined by (7.6) conve,res a.s. to x 0 as t for alls> 0 and x E £ 1 •

00,

45

§ 7. CONVERGENCE OF SAMPLE FUNCTIONS

Indeed, assuming for simplicity's sake that x 0

= 0 (which clearly involves

no loss of generality), let us try to find a function V(t, x) satisfying the assumptions of Theorem 7.1, letting B be the set containing the single point x

V(t, x)

= x2 •

Using (1.7), (7.10) and the identity EG 1 (t, x, w)

Lx

2

=

= 2a ~

+a

E (x (t)

(t) F (t, x)

+

F (t, x) x

= O.

Set

= 0, we obtain

+ ~ (t) G1 (t ,

x, w))' - x' a' (t) F" (t, x)+ ~2 (t) EG~(t , x , w)

+

2a (t) F (x) x 2a (t) y (t) I x I + K 2 (~ 2 (t) F2 (x) + ~2 (t) E Gl (t, x, w) + et2 (t) y' (t)).

Th us, by (7. 7), (7 .8) and the inequality Ix I

Lx2 ~

-

< I + x2 ,

2a (t) I F (x) x I + g (t) (1

+ x 2),

=

where g(.t) K ifl2 (t) + a(t)'y(t) + a 2 (t)'y 2 (t)). Thus, by (7 .9), the function V(r, x) = x 2 satisfies the assumptions of Theorem 7.1 for B

= {0}, and our assertion is proved.

It is natural to study the behavior of the sample functions of the process xs,x(r) without the assumption that the set B = {x: F(x) = O} is a singleton. It

is readily seen that Theorem 7 .2 remains valid when

F(x)=

0, >0, { (B}, it follows from (5.9) that there exist n 2 (w) and 6(w) such that q>(u, X'•"(u))=6(w)>0

inf

for

n>n 2 (w).

,;.:!:i:1.1

n = max(n 1, n 2 ) and using (5.10) and (7.17), we have

00

~

0.

(u)

(u, X'·"(u))~6(w) ~

1;.+1- 1

~ a(u)=oo.

~ u-t;

JI. DISCRETE TIME MARKOV PROC£S.SES

48

from this contradiction and (7.16) it follows that

,limp ....., (X', x (t), Bi)= 0

E 0 ,.

for

2) Let C, denote the €•neighborhood of the set E1\B1, and Cu the E· neighbo rhood of C,. Suppose that for some there exist sequences 1~

wen,

1:(w) and 1: = 1:(w), 1:+ i > 1: > 1~. n = l, 2, ... , and a number E = €(w) > 0 for which x 1 ,tc1:) E c,. x 1 ,X(1:) 'F cff and X'·X(t) e B; for 1:

1:. Then, by the equalities

=

>

X'• X ('t;)-X'•,: (T~)

we have e

1'~- 1

t~ - 1

u.-tn

u.-tn

< l x•, :r (T~)-X'•:,: (T~) I" ~ - Ct (u) y (u) + I ~- G (u + t' x•.,: (u), )J.

Letting n -

00

and using (7. l S) and the condition

.., ~ et(u)y(u)s > 0

~ E IG(u + t , X',x(u),

u-1

)l1 =

t

~ E [E IG(u+ 1, Y,

,.__,

)l2 h,-zs,xM

t

t

00

w.-1

u.-,

u-•

~ ~ g 1 (u)[t + EV (u, X'· s (u))] ~C ~ Ks (u)~C ~ ts (u). Thus the series (7.13) is convergent.

§8. TEACHING PATTERN RECOGNITION

49

§8. Teaching pattern recognition Following ABR [I], Otapter V, let us consider the simplest formulation of the problem of teaching a machine to recognize patterns belong· ing to one of two classes. The possible patterns in this version of the

FIGURE 3

problem will be the points of a certain subset of £ 1, assumed to be the union of two disjoint components A and B, each of which will be identi· fled with one of the classes to be recognized. Suppose, furthermore, that there exists an (I - I )-dimensional hyperplane through the oripn that sepa• rates the patterns (Figure 3). Thus we can choose a vector .x< 0 > normal to this hyperplane in such a way that (y, x< 0 >) > O for y e A and (y, x< 0 >) < 0 for y E B.

In fact, we shall assume a slightly more restrictive condition: some E > 0,

> zCO>)
)

t

for

y EA,

(y,



for

y

for

EB.

Suppose now that when each "pattern", i.e. point of A U B, is presented, the "teacher" is able to "tell" the machine to which class the point belongs. This means that the machine can be "fed" the value of sgn(y, .x) for any pattern y .(6)

To complete this description of the model, it remains to note that the patterns in A U B are presented independently, with some (generally unknown) distribution P(dy) such that P(A U B) = 0. Let Tl(t), t = 1, 2, ... , denote the sequence of patterns presented to the machine from A U B. Since the set A U B is bounded, we may assume that for some nonrandom constant K

P { I TJ (t) I < K}

=

1.

The problem is as follows: using the data communicated by the teacher ( 6) The IIUlchine "knows" that a posilin sular product means a pattern of cbA A, and a negative product a pattern of class B.

50

II. DISCRETE TIME MARKOV PROCESSES

(that is, the values of sgn(11{t), x< 0 >), t

= 1, 2, ... ), construct a sequence of hy-

perplanes converging to a hyperplane separating the classes A and B with probabil· ity I. Before proceeding to the solution, we present a rigorous definition of a hyperplane separating A and B with probability I . We note first that the only information forthcoming about the classes A and B is contained in the sequence of independent random vectors 11(1), t = 1, 2, ... , and it is the patterns in this sequence which are presented to the machine for recognition after a certain initial learning period. Hence the following natural definition, in which and t O are random variables. DEFINITION. _Let be a vector normal to a hyperplane Q. We shall say that the hyperplane separates the classes A and B for t > t O with the probability 1 if

x

x

p { sgn (z, 'I (t))

=

sgn (zO, 'I (t)) for t ~ fo}

:=

f.

(8.2)

Summarizing, we obtain the following mathematical formalization of our problem: Let 11(1), t = 1, 2, ... , be a sequence of independent random vectors in E1 satisfying condition (8.1) and such that for some x< 0 > E E1 and f > 0

P {I (roi, 11

(t))

I> e} = 1.

(8.3)

=

Assume that the value of W) sgn(x, 11(t)) may be observed for each t. Put the vector 11(1) in class A if (x< 0>, 11(1)) > 0, and in class B if (x< 0 >, 11{r)) < 0. Using the observations «t) = 1, 2, ... , it is required to construct a vector X(t)

x

such that X(t) (a.s.), where i is a vector satisfying condition (8.2) for some a.s. finite r.v. t 0 • The solution is furnished by the following theorem:

8.1. Under conditions (8.1) and (8.3), the Markov process X(t) defined recursively by X (t+ 1)-X (t) = sgn(z< • 1,11(1 + 1»-;sgn (X (t),TJ (t~t))'l(t+t),X(O)=z, (8.4) THEOREM

for any x E £ 1 has the followmg pro~rties: 1) X(t) a.s. as t - 00, when a vector normal to a plllne which separates A and B with probability 1 fort> t 0 (w). where t 0 (w) is a r.v. which finite with probability 1. 2) X(t) fort> t 0 (w). so that the "learning" process (8.4) terminates a.s. after fmitely many steps.

x

is

xis

=x

PROOF. We first observe that the process

X(t) remains unchanged if we

51

§8. TEACHING PATTERN RECOGNITION

replace x< 0 > in (8.4) by cx< 0 >, where c is any positive number. Thus, by (8.1) and (8.3), we may assume without loss of generality that 21 (x, 'I) (t

+ 1))1- I'll (t + 1) 1 > 2

Now consider the function V(x)

= Ix -

1 (a.s.)

(8.S)

x< 0 >12 . Letting L denote the generating

operator of X(t), we have

LV (x) = - 2 E {(r 0 > - x,

+

where

E {r2 (x,

+ 1)) r (x, TJ (t + 1))} TJ (t + 1)) I TJ (t + 1) 12}, 'I)

( _) 86

(t

r(x, 'l(t + i))= sgn (z< 0 >, TJ(t + 1)) - sgn (z, TJ(t + ill. 2

It follows readily from (8.6) and the inequalities r 2 (x , y) = Ir (x, y) I~ 1, [(z< 0 >, y) - (x, y)J r (x, y) 0 (z< >, y) r (x, y) ~ 0, (x, y) r (x, y) ~ 0,

~

0.

which are valid for all x and y, that LV (x)

~ ·e

{I r

I ( I 'l'J

(x, 11 (t -i- 1))

~ E {(ITJ (t

(t

+ 1)) I)} TJ (t + 1)) I) lr(z, 'I) (t + 1))1},

+ 1) 11 -

+ 1) 12 -21 (x• 0 >,

2

I (x10>

-

x, TJ (t

This inequality and (8.5) imply the fundamental estimate ~ -

lV (x)

E l r (x, TJ (t

+ 1}) I-

(8.7)

We now use Lemma 3.1. It follows from that lemma and from (8.7) that Cl)

~EI E l[r(x, TJ(t + 1))1J]z-x(1)~V(x))Er for tET'}={X(t,(l))Er for tEAT'f'-.A. It follows from this condition and the completeness of the probability measure P (which we shall assume henceforth) that the set {1.1) (among others) is meas• urable. It can be shown (see Doob [I)) that any stochastic process is stochastically equivalent{1) to a separable process. Markov processes may be counted among the most important of all contin• uous time stochastic processes. The definition is entirely analogous to that in the discrete case (Defmition 1.2). Namely, let A be a Borel set and P(s, x, t, f) a function defined for all x EA, 0 < s < t, s, t E T and r C A, where r E t\1• llus function is called a transition function in A if it is t\,-measurable in x , a probability measure as a function of for s

r (P(s, x, t, A)= I), and satisfies {2.l.l)

< u < t.

Let P(s, x, t, f) be a transition function in A, A E t\1. A process X(t) = X(t, w) in A is called a Markov process with this transition function if, for any s

< u < t, P {X (t} Er/,#'.,}

=

P (u, X (u) , t , r) (a.s.)

As usual, Nu is the o-algebra generated by the r.v.'s X(v), v < u. It is readily seen that the properties of discrete time Markov processes established in §2.1 remain valid for the continuous case . As in the discrete case, we understand the concept of conditionaJ probability P{X(t) E r1X(u)}, u < t , in its regular sense, i.e. we assume that P{X(t) E r1X(u)} is a measure as a function of for any fixed X(u) = x E £ 1 (for details, see §2.1). If the transition probability P(s, x , t, r) has a deusity relative to Lebesgue measure for any S, t E T and )C e £,, so that P(s, X, t, = f rp(s, x, t, y)dy, we shall call p(_s, x, t, y) the transition density of the Markov process. The transition function P(s, x , t, I"') is said to be temporally homogeneous

r

n

if P(s, x, t + s, f) is independent of s. A temporally homogeneous transition function is a function of one time argument, for in that case we have P(s, x, t, f) P(O, X, t - s, and we may therefore denote it by P(x, t - S, f). Similar notation may be introduced for the density of a homogeneous Markov p rocess (if it exists).

n.

=

..{ X(r)

(1) Proceue, X(t) and Y(t), t E T, are u id to be sroclwutic,llly equivolent on T if Y(r) } I for all , E T.

=

=

§ 1. CONTINUOUS TIME MARKOV PROCESSES

55

We now consider an important example of a Markov process, which will be used below. A separable process t(r) = Ht, w) with values in £ 1 , t ~ 0 , is called a standard Wiener process if it satisfies the following conditions:

I. HO) = 0, EH!) = 0 and Et 2 (t) = t. 2. The increments Ht) - ((u) and t(v) - t(s) of the process are independent for any t > u ;.i, v > s. 3. For any fixed t, h ~ 0, the r.v. Ht + h) - Ht) is normally distributed, so that

P{s a, and f(s) = f(s, w) e ~ [a, T] we .s

T

I

J/(s) ~ (s) = JXt (s)/ (s) ~ (s), "

..·here

(3.10}

" x,(s) =

f 1, l o,

$"" t,

if

if

s>t.

(This is obviously true for step functions. The general case is proved by a limit procedure.) Now let f be an arbitrary random variable. We define the integral with random upper limit by

t

JI (s) ds (s) = G (t), 4

r

where G(_t) is given by (3.9). If moreover is a random variable independent of the future, i.e. such that {w: < t} e F, and in addition P{r 0, such that t,.(t) are F,measurable, and the r .v.'s E,.(t + s)- Et(t), i = 1, ... , k, are independent of any event in F, for s > 0. Let b(_t, x) and o 1 (t, x), ... , ok(t, x) be vectors in E 1, defined for x E E1 and t E [t0 , T), t0 > 0, and measurable as functions of

(t, x). A solution of the stochastic differential equation

,.

dX (t) = b (t, X (t)) dt

+ ~ o, (t. X (t)) d s, (t)

(4.1)

is defined as a solution of the corresponding integral equation I

X(t) = X (to)+

It

r,.

b(u , X(u))d"+

I

r

~ ,-1 '•

o,(u, X(u))ds,(u), (4.2)

where X0 (t) is some given initial condition which is F, -measurable, t 0 > 0. 0 Throughout the sequel we shall call the separable stochastic process X(t) a solution of equation (4.2) on [t0 , TJ if the r.v. X(t) is F ,-measurable, the integrals in (4.2) exist, and (4.2) holds for any t E [t0 , T) . The following theorem establishes the existence, uniqueness and certain properties of the solution of equation (4.2). THEOREM 4.1. Let the vectors b(_u, x) and o 1 (u, x), ... , ok(u, x) (t 0

u

< t;x


0 for x -4: 0; and c) W(x) is a linear function of x for Ix I ;. 6. It is readily seen that PROOF.

LW (x)~ -a (t) IR (.z) W' (x) I+ Q,

= 0.

Consider

ati'),

Q, = Q max I W" (z) I, zEEt

where L is the operator defined by (3.4). Thus our assertion will follow from Theorem 3.8.1 if we set a(t) = a(t), g(t) Q 1a2 (t)/2 and fl(_t, x) IR(x) W'(x)I.

=

=

Note that the assumptions of Theorem 4.2 impose no restrictions on the rate of increase of IR(x)I and lo(t, x)I as functions of x . An example, to be studied later, will show that there is no parallel result for discrete procedures. We now deal with conditions for an s.a. procedure (4.1) to converge to .x0 in probability with increasing time. EXERCISE

such that

4.1.

ut a(t) > 0 and ~(t) > 0 (t;;;. t 0 ) be bounded functions

§ 4. CONVERGENCE OF THE ROBBINS-MONRO PROC£OU R£

t

oo

j a(t)dt= oo,

x(t)=

as t -

00 •

r

J~(u)exp {- Ja (v) dv} du to

lo

91

0

"

Suppose that there exists a function V(t , x) E C2 such that( 4 )

V tx)

>

0 for x =I= 0,

lim V (x) %➔~ -

-

=

00

and

LV (t, x) =s;;; - ex (t) V (t, x)

+ ~ (t),

t ~ t 0•

(4.8)

1) Prove that any solution xs,.x(t), s ~ t0 , x E £ 1 , of equation ( 4.1) converges in probability to zero as t - oo. 2) If in addition V(t, x) ~ k Ix - x 0 1-Y, k > 0 , -, > 0 , then also E 1x.r,.x(t)l'Y - 0 as t - oo. 3) Show that K(t) - 0 as t - 00 , if fj(t)/Q(t) - 0 as t - oo and r • Q(t) dt = oo• . 'o Hint. Consider the function W(t, .x) = Y(t, x)exp o(u) du}; uac Lemma 3.5.1,

u:

0

f ato u's lemma 111d the resularity of the process to prove that

EW (t, X•,

t

:r (t))- W (s, z ) ,
(t + t ,

+ G (t + t,

X (t) , w) ,

X (t), w))

t

= t,

2, . • . (4.12)

where X , R and G are vectors in E1, and the function

fI> (t, z , w)= (cl>1 (t,

z,

w), . •. , cl> 1 (t, z, w)) •

satisfies conditions (A) of §2.3, EG(r, x , w)

= 0, and a(r) is a positive sequence.

Let ~ denote the set of functions in C2 with bounded partial derivatives of second order. The following analogs of Theorems 4.1 and 4 3 are valid for system (4.12). TuEOREM 4.4 . Assume that there exists a function V(x) E ~ satisfying

the conditions

V(z)>O, V(z0) = 0, for any



> 0, V(x) -

00

sup

t< I•- zol< t / a

as Ix I -+

00 ,

{R(z),: V(z)) 2.

(4.18)

Combining (4.16)-{4.18), we get Jim E 11 (t, 1-00

x:c (t)) = 0,

and so (since .f(x) > 0 for x =#: O) the assertion follows. EXERCISE 4.2. Show that a) condition (4.16) may be replaced by the conditions

b) condition (4.17) is satisfied if (4.16) holds and also

~ )

Jim~ (t) = 0,

lim

1-00

1-» Cl

=

0.

THEOREM 4.5 . Assume that there exists a function V(x) satisfying the conditions (4.10), (4.1 I), (4.13) and (4.14), and let 00

~ a(t)=oo,

a(t)-0

as

i ·- oo.

1- 1

Then the process X"(t) deFmed by (4.12) converges in probability to x 0 as t-+ 00,for any x E £ 1•

The proof follows from Lemma 4.1 (see also Exercise 4.2) and is left to the reader. § S. Convergence of the Kiefer-Wolfowitz procedure We now study convergence conditions for the s.a. procedures introduced in § §2 and 3 for locating the maximum point x 0 of an wucnown function f(x) .

95

§ 5. CONVERGENCE OF THE KJEFER-WOLFOWlTZ PROCEDURE

We first consider a continuous procedure, descnbed (in the notation of ~ § 2 and 3) by the equation II

dX=a(t)Vc/(X)+ ::~

i

o, (t, X ) ~r (t).

(SJ)

' ""' t

The differential generating operator of the Markov process xx(t), xx(0) = x , defined by equation (5.1) is

(5.2) The results of §3.8 yield convergence conditions for the procedure (SJ), under various assumptions on the coefficients 'v cf(x) and o,(t, x). of equation ( 5 .l ). We shall consider several such conditions in the following examples. The first example shows that Theorem 3.8.l is applicable even when c(t) does not tend to zero with increasing t . EXAMPLE 5.1 . Let /(x) = -(F(x -x0 ) , (x -x0 )), where Fis a positive definite matrix; in other words, / (x) < 0 for x x 0 • Then it is easy to see that 'vcf(x) = of(x)/ox. Therefore, setting V(t,x) = -/(x), we see that

*

" ~ 12 +,oZ (t) ""' LV= -a(t) azf(z) 2 (tJ""-l (Fa,(t , .z), a, (t, .z)).

I

,- 1

Applying Theorem 3.8.l, we deduce that if condition (4.7) is fulfilled the pro• crss xx(r) converges to x 0 a.s., provided the functions a(t) and c(t) satisfy the conditions

..

r

00

o2

(t)

Ja(t)dt=oo,

r~ 0. Tilis recalls a familiar strategy for locating the maximum point of a parabob: one must first "aim" at extreme values, i.e. select large values of c(t) (see the second example of § 103). The fact that in this example we can guarantee arbitrarily fast convergence of x x(t) to x 0 is due to the precise knowledge we have of the structure of f(x) in terms of the parameters F and x 0 • It should be dear that the procedure o f Example 5.1 "will not work" for a slightly different function f(x) . It would therefore be interesting to devise a procedure (i.e . to select functions a(t) and

96

IV. CONVERGENCE OF STOCHASTIC APPROXIMATION PROCEDURES. I

c(t)) such that the process xx(t) will converge to x 0 for as broad a range of functions f(x) as possible, provided they possess a single maximum. A natural restriction on the functions f(x) is that the point x 0 be uniformly asymptotically stable in the large for the pure gradient procedure



a

(5 .4)

X=a-f(X) iJ.l

where a is a constant. It then follows from results of Krasovskii ([l), lbeorem 5 .3) that there exists a function V(x) with the properties

V (z) > 0,

(

:z f (z), :z V

V (z0)

= 0,

(z))

T. Since the functions -f'(x)x and W'(x)x arc positive for x :I= 0 , it clearly follows that

Let

E

inf

1-Vcf (z) W' (.r)I

r< I x l(B), where B 3.8.1.

= {O}.

>

0.

Our assertion now follows from Theorem

Note that we have proved convergence of x.x(t) to x 0 in Theorem 5 .1 on the sole assumption that the stationary point (maximum of /(x)) is unique, and the function itself has a continuous derivative. There are no restrictions at all on the rate of increase of 1/(x)I as lxl - 00 • As mentioned previously, there is no valid analog of this result for discrete s.a. procedures. In addition, the only conditions on a(t) and c(t) were (5.3) and c(t) - 0 as t - 00• Usually, however, one requires that c(t) tend to zero sufficiently rapidly as t - 00• We have eliminated this restriction by separate treatment of the domains Ix I < c(t) and Ix I ;;:a: c(t) in application of Theorem 3.8 .1. We now consider the discrete Kiefer-Wolfowitz s.a. procedure of §2:

X(t+1)-X(t)=a(t)Vcl(X(t))+

= (J) (t + 1,

:g~

G(t+1, X(t), w)

X (t), w) ,

(5.10)

where EG(t, x, w} = 0 for any t ;;i, 1 and x E E1• Tne generating operator of the Markov process x.x(r) defined by (5.10) maps any function V(t, x} e Df".x> onto

LV(t, x)= EV(t+1 , x+a(t)Vcf(x)

+;g;G(t+1, z , w)) -V(t, z).

(5.11)

100

IV. CONVERGENCE OF STOCHASTIC APPROXIMATION PROCEDURES. J

Various conYergence conditions for this procedure are easily derived from the results of §2.7. We leave this task to the reader, as outlined in the following exercises. EXERCISE 5.2. Prove that if there is a function V(x) EC, satisfying conditions (5.5) and having bounded second partial derivatives, and moreover

I)

I :z V (z) I+ I! f (z) 1+ E l G (t, z , 2

Ca>) j•'~k1 (1

+ V (z)), (5 .12)

2) the function /(:x) has continuous partial derivatives satisfying a global Lipschitz condition,

3)

-

~ a(t)c(t) 0,

00

(t) dt = oo,

1

0

(t) dt < oo.

(2.4)

0

cg, we have ~ (a, (t, x), fzf'-v

PROOF. By (2.1)-(2.3) and the condition V(x) E

LV (x}= ~~+a (t) ( R (.r), ~:)

~ -a(t)j

0 {R(.r), ~ )

~ -a (t)

+

1')

II

02

l+k a 1

,._,t It

2

{t)

~ ICJ,(t, x) 12 r-t

I(

R (.r), : : )

I+

2

k2'l (t) (1

+

V (x)).

Thus our assertion is a corollary of Theorem 3.8.3 (see also Exercise 3.8.4) with F(x) = R(x) and Q(_t, x) 0.

=

THEOREM 2.2. Let xx (t) be the process defined by equation (1 .4), and assume that of/ox is continuous and there is a nonnegative function V(x) E satisfying (2.1), (2.3) and the inequalities

cg

( 8/ilz(z) '

a (t)

av (z) ) < oz

(vet (.r)- 8~~)

,

:: )

0

fior

~g (t)(1

x ~ B2,

(2.5)

+V (x)), (2.6)

104 V. CONVERGENCE OF STOCHASTIC APPROXIMATION PROCEDURES. II

Then the process x.x(t), x E £ 1 , converges a.s. as t - 00 either to a point of B2 {x : 'df/'dx = O} or to the boundary of one of its coMected components, provided that

-J

a (l) > 0,

a (l) dt = oo,

C (t)

-

( t) > o. Jr al ,, (t)

0

dt
0. Using a suitable limit procedure, one readily shows that also k'"(O) = - 8/3 < 0. Similarly, W"'(z) satisfies the condition

,,,stzw• (z) 2 (:a+:+ 3/4)

= cilz (z),

where

z (z) = 3 (z) e=-

• re" Vu du, 0

The function 11>2 (z) vanishes at z

= 0 and, in addition,

cI>; (z) = (; (z) + 3 (z) -

Vz) e' = ~r: Vs = l=' + ~+J,, ):>0

for

z>O.

Thus W"'(z) >- 0 for z > 0. Again using a limit procedure, we see that also W"'(O) = 16/15 > 0 . This completes the proof. PROOF OF THEOREM 3.1. We may assume without loss of generality that = 0. We shall show that there exists a function V(t, x) satisfying the conditions

x

111

§3. AUXILIARY RESULTS (CONTINUOUS TIME)

of Lemma 3.1. Indeed, set

V (t, z)

=

T (t) -

W (z),

z

=

V (.:z:)/ q, (t) ,

(3.12)

where W(z) is the function defined in Lemma 33 , V(.x) = (C.x, .x) is the quadratic form figuring in the definition of Band satisfying conditions (3.3) for I.xi < £, and T(t) and ,p(t) will be specified below. Let I.xi < min(£, c5) £ 1 . It is readily seen that

= LV (t z) = T' (t) + W' ( ) ~- It> ' Z Z cp(t)

a (t ) JV' (:)

cp (t)

(F(z t)-F(z)

a(t}W' (:)

cp \t)

'



(F ( ) au ) % '

iiz

au) oz

(3.13)

I

-~'(t) 2cp (t) .

~ a ·(tz)[lY"(;)oUiiU IJ ' In



LV(t, z)=T(t+i)-T(t)-W'(z) E1Y - ~ Ey'+R(t, z),

(4.4)

where

y=

U(z+c»(1+t, z , Cal)} U(z) cp(l + t) - cp(I)'

R(t, z)= -

o
(t 1, z, w) , (t + 1, z , w))}

r

(4.5)

Inserting (4.5) into (4.4) and performing some simple manipulations, we get

(4.6)

~w

~w

- ~(t+t) E ((t+ 1, z , w), Cz)I- cp (t + t) + 'P2 E(C(t + 1,x, w), (t+1 , x,w))

_ ( W' (z) ..L 2WN(z) z ~ ) E (C

x)+ R(t; x) .

We note, moreover, that by condition (I) of the theorem

I a 1 (CF (t, x), F (t, x))+

p2 Tr

((A (t, x) - A (t, 0)) CJ 2

1

:s;;;: k, la (t) q (t) 2

+ P2

Ix

lvf

and in addition-

T(t+ t)-T (t)-2~2 (t) µ (t)/cp(t + 1)

= In (1 + 2~2 (t) µ (t)/cp (t + 1))- 2q,~2(t(t)+"1)(t) """ ~O • Therefore

+ I (t, x) + R (t, x).

(4.7)

Using the (act that the functions W'(z)z and W#(z}z 2 are bounded, the inequalities I F (t, x) I ~ I F (z) I I q (t) I, a (I) ~ KP (I), p (t) ~ k1 11> (~ 1)

+ Y

+

and bounds for (C4>(t + 1, x, w), 4'(t + l, x, w)) and (4l(t + 1, x, w), Cx) de• rived from the Cauchy-Schwarz-Bunjakovskii inequality, we deduce that (4.8) It remains to estimate the remainder term R(t, x). To this end, we observe that since

w· (: + 8,y) 0


0}.

From

(4.10) Therefore, since W"'(z) is positive and Wm(z follows that

+ 81 y)(vz ) 3 xr is bounded, it

To estimate R 2 (t, x), we note that the inequality z/2 and so, by (4.9),

+ 8 2 y..; 0

implies y ..;·o,

r

R2 (t, z) ~'4' (:~ 1) 'E I w· (z + 81g)- w· (:) 11 (l) (t + 1, z, w) Xr=· l4.12) Now, using the obvious inclusion

we infer fcom (4.12) that

R (t z) & 2



....,

kts

,p' (t+1)

E I(l) (t

+ 1' z • ) I' ~ .---k "

Ii' +a' (t) q' (t) 1112 (t+ 1)

•.

(4.13) In"1ew of (4.7), (4.8), (4.11) and (4.13), we finally obtain

£,V ~kn [

j!2/)

-4-- ,x (t) q(t)

-,p•-v1-ca+ tl · Y,t• + tl

+ { j! (t) }l] = Y'4'ti + t>

(t)

v '

where, as follows from Assertion Sin §375 of Fihtcngol'c (1) and condition (3) of our theorem, the series l:;"=o 1(t) is convergent. Finally, as in the proof of Theorem 3.1, we show that V(t, x(t)) - 00 as 00 t for any sequence of vectors x(t) in E1 converging to 0. Thus all the assumptions of Lemma 4.1 are satisfied, and the proof is complete. REMARK 4.1. In comparison with Theorem 3.1, Theorem 4.1 involves an additional condition

118

V. CONVERGENCE OF STOCHASTIC APPROXIMATION PROCEDURES. II

- (~(t>IY"'(t+

~

t-0

1>)'
½,

..

+ 1) = i-1~+ 1 ~· (i) ~ kz/tN-l

+ 1))3 ~ k 3 /r3 12 . § S. Ooe-dimensiooal procedures

Theorems 3.1 and 4.1 make it possible (subject to certain added assump• tions) to improve the results of §2 about the liinit behavior of a s.a. process X"(t), since they exclude the possibility of .XX(t) converging to points e B. To simplify the exposition, we shall confine attention to one-dimensional procedures, though the results are easily carried over to s.a. procedures in £ 1• In the one-dimensional case, it is quite easy to establish a simple condition ensuring that e B = {x: F(x) = O} is not a stable point (in the sense of §2) for the equations

x BC

x

dX

=a

(t) F (X (t)) dt, X (t

+ 1) -

X (t)

=a

(t) F (X (t)).

x

In fact, this will be the case for all points in some neighborhood of which F(x)(x - x);;;. 0. We shall thus have quite simple results on the convergence of s.a. procedures in £ 1 • Let us assume that the sets B 1 and B 2 , in which the functions R(x) and f (x), respectively, vanish, are unions of finitely many closed intervals and isolated points. We define sets

B1 ={x: R(x)=O,

inf A(t,x)> O, R(x)(x-x) ~O t;:?:O

in some neighborhood of

i),

in some neighborhood of z},

where A(t, .x) = a 2 (t, x) for continuous time, and A(t, x) = EG2 (t + 1, x, w) for discrete time. Note that the isolated points x e B2 are local minimum points of f(x). THEOREM

5.1 . Suppose that for some r > 0

R (x) x


,, sup 02

(t, x)


0, and the function a(t) satisfies (2.4). Then any solution xx(t), XE£I• of the equation

dX (t)

=

a (t} [R (X (t)} + a (t, X (t)) d; (t)l,

where Ht) is a standard Wiener process, converges a.s. as t the set B 1 \81 •

X (0) 00

= .:t·,

to a point of

Let V(x) E C2 be a positive function satisfying the conditions 1) R(x) V'(x) < 0 for x E B1 , and · 2) V(x) is linear for lxl > r. Such a function always exists; for example, if R'(x) is continuous, we can set V(x) = - f~ R(u)du+ C for lxl 4' r. Then o 2 (t, x) v" 4' kif k is a sufficiently large positive constant, since a2 V/ox 2 = 0 for lxl > r. It thus follows from Theorem 2.1 that xx(t) converges to a point of B 1 • But by Theorem 3.1, whose assumptions are certainly satisfied for F(x) R(x), q(t, x) 0 and Q(t) (3(t) = a(t), we have PROOF.

=

=

=

p {Jim xx (t) = x} = 0, 1-+oo

if

xE 81 •

This .completes the proof.

TuEoREM 5.2. Let the function f(x) satisfy a Lipschitz condition, and suppose that for some r > 0

f'

(.:t)

X


r,

and the function o(t, x) satisfies the assumptions of Theorem 5.1. Assume, moreover, that a(t) and c(t) satisfy conditions (2.7), and 00

f J

0


rand f(x) V'(x) < 0 for x ~ B2 • This function satisfies condition (2.6) of Theorem PROOF.

120

V. CONVERGENCE OF STOCHASTIC APPROXIMATION PROCEDURES. II

2.2 (see Remark 2.1), since I V' I is botmded and the ~egral J; a(t)c(,t)dt is finite by (S.1). In addition, the function a2 (t, x) Y"(x) is also bounded. Therefore, by Theorem 2.2, the process XX(t) converges a.s. as t - 00 to a point of the set B2 • The desired conclusion now follows from Theorem 3.1 with F(x) = f'(x), q(r. x) Vc f(x) - f(x), a(t) a(t) and P(t) a(t)/c(t). (The only point that needs checking is the inequality

=

=

..t J

0

=

dt1ei(v, t,.1t>'

(I 15) .

where It - I

~. = ;-1 ~

I ~j

+.;- ~11+1 'l:i,.

r

~II=

~ 'lti,

;-2

k= 2, ... , t-1,

1- I

~tt=

~ ~J-

i=I

Setting

we deduce from (1.15) that

j / 1 (u) - g, (u) I

.

=IE ±, E

([ei(u,

,,.>

-ei(u.

111•>1 ei(u, t,.,,~h ... ' St.1t-1)j

t

~ ~ E i h.1t(U}-g,.1t(u)j.

•-1

(1.16)

We now use Lemma 1.1 to estimate 1/,k(u) - g,k(u)I. In view of (1.9), we find that

128

VI. ASYMPTOTIC NOltMALITY OF THE ROBBINS.MONRO PROCEDURE

I/- ,.(u)-g,.(u)l~ 21 u2 11 s ,.-R,.11 +e!ul3 [E l ti• l2 +

E

(lt,•12/ ~1 •••• ,

~1•-1H

+ ~ Iu I' l E I 11,. 12x";• (t) + E (It,,. I' x~.~11, ... , t1-:H•

(1.17)

By (1.16) and (1.17), as well as conditions (l.I0Hl.12), it will suffice to show that I

~ E I 111• 12 X"•• (t)-0

•-1

t - oo.

as

(1.18)

To this end, we observe, first, that

Hence, by (1.12),

lim sup II s,,.11 =0. ,_ 1~•~· Further, we have t

I

~

2

E {I TJ,11

x"•• (t:)}~ :, ~ E I'lt• i'

k•I

"-l

~:! ~11sll11 ~¾~11s,.11 ~•~t 11s,.11 t

t

1

sup

•-1

"-l

I

I

JP Tl

~ei"' 2., E lo •l

lt.•I'

sup l~l~t

11s,.11-o

as

t-00 .

This proves (1.18), and thereby completes the proof. In the following sections we shall study the asymptotic behavior of the processes Y(t) defined by (1.3) and (1.5). ln so doing, our first project (§2) will be to establish, under certain additional assumptions (Conditions 1, 2 and 2'), the formula

Subsequently, in §§3 and 4, we shall show that all the sums on the right of (1.3) and (LS), except the last, approach zero in some probability-theoretic sense. To be precise, we shall prove that in a suitable sense I

Y (t) ~ a

J •

in the continuous case, and

tA lo ( 1/ u)

Vu



~ a, (u, X (u)) ~, (u) r:a l

(1.19)

§2. ASYMPTOTIC BEHAVIOR OF SOLUTIONS

129

I

Y(t)

~a~ ~ G(k + 1, X (k), w)

in :.'le discrete case.

(I .20)

·-·

Then, in § §5 and 6, we shall prove that the processes on the right of ( 1.19) and (1.20) are asymptotically normal, and establish a sharper characterization of these processes (Theorem I.I will be used in the treatment of the second process). Finally, in §7 we shall investigate the asymptotic behavior of the moments of Y(t), in both continuous and discrete cases.

§ 2. Asymptotic behavior of solutions In this section we present conditions that guarantee rapid convergence to x 0 of a stochastic approximation process defined either by the differential equation

dX (t) =

;

(R (X (t)) dt



+ L, er, (t, X (t)) ds, (t)),

X (s) = ~. (2.1)

or by the difference equation

X(t + 1)-X (t)= ~ (R(X(t)) + G(t + 1,X(t), w)),

X(s) =,. (2.2)

As usual (see §§3.3 and 2.3), we shall assume that the Wiener processes

~.(r) (family of r.v.'s G(t, x, w), x E £ 1) are measurable relative to a monotone sequence of a-algebras F, and the increments (,(t + h) - ~,(t) are independent of F, {the r.v.'s G(t + l , .x, w) are independent of F,). We shall assume that the r .v. t is _F,-measurable. Throughout, we shall adopt the following assumptions.

CONDITION 1. There exist a symmetric positive definite matrix(4 ) C and a number >. > 0 such that for all x E E 1

(CR (x) , x - x 0) ~ CONDITION 2. For all t ~

A (C (x - x 0), (x - x 0)), 2a}. > 1. 1 and x E £,, -

II

IR (x) I+ ~ lcr,(t, x)j~ K (1 +l x l), K = const. r- 1

ln the discrete case, this condition will be replaced by CONDITION 2·. For all t = 1, 2, . . . and x E £,, I

R (x)

2

1

+

E I G (t , x , w) 12


1 and T such that (2.5)

for

and so, by formula (3.5.5) (which is valid under our assumptions; see Exercise

3.7 .1), :, EVs(X' ·t(t)}= ELVs(X'·t(t))

~ """=

_J!!. t EVt (X' · t (t}} + ~,2

'

t~T.

These relations imply that

:, [t,,1 EV1 (X10 t(t}})~kit')1 Therefore

EV1 (X'·~(t)) ~ ( Since C is positive defi.nite uid Pi

~r•

2

,

t~T.

EV1 (X'· t(T))+ ~ .

(2.6)

> 1, the estimate (2.3) now follows from (2.6).

To prove (2.4), we consider the auxiliary function V2 (t, x) = r"'1 V1 (x) + r-f , where O < 'Y < 1 and e < 1 - 'Y· Then, in view of (2.5), we see that for I

>T

+ yt'- 1 Vs(x)-er•-t , T1•

Thus

(Vi (t, X'·;

(t)) .

~ T1 )

.Y ,. t

is a supermartingale (see Corollary 3.8.1). Theorem 3.6.1 implies the existence

of a finite limit lim,_.t'YV1 (x), and this proves (2.4). 2) Now let x.r,t(r) be the process defined by (2.2) and L its generating operator. Then, assuming as before that x 0 = 0, we have

(c (z+ a(R(z) + G~t+t, z , (&)))),

LVi(z)= E

z+ a(R (z)+ G~t+t, z ,

(&)))

)-(Cz, z)

~24(C(R(z), z)+a1 11Cll(IR(.r) l1 + MIG(t + t, z, (&))11)

--,,

Thus, for some p 1

t

,,

> 1 and T, LV (z)~I

--:::

p,Y,(z) t

+~

for

ti

t > T.

Therefore, by Lemma 2.3.1,

E V 1 (X'· t (t

+ 1))- E Y: (X'· t (t)) ~ _ .fl. --,, t E V I (X'• t (t))

+~ 11 •

(2 -8)

Iterated application of this inequality gives

EV 1 (X'·; (t + 1)) I

~ EV (X'· t (T))

I

! -•+1 II (1-

II (1- ~ ) + k, ~ m-T

•- T

~)



Hence, using the familiar inequalities I

II (1-~) ~ t:·

m- 1

I

I

we see that (2.9} Formula (2.3) now follows from the fact that C is positive definite . The proof

132

VI. ASYMPTOTIC NORMALITY OF THE ROBBINS-MONRO PROCEDURE

of (2.4) is exactly the same as in the continuous case, except that here one uses Theorems 2.2.2 and l.5.1 instead of Corollary 3.8.1 and Theorem 3.6.1.

§3. lnvestiption of the proceu / 1(1) In Chapter 3 we introduced the concept of a stochastic integral and then set up the lt6 stochastic differential equation of a Markov process. Stochastic integrals may also be used to construct a broader class of stochastic processes. Thus, in proving the following lemma we shall have to consider an equation It

dY (t) = F (t, Y (t), w) dt

+ r=l ~ '¥, (t, (I))~. (t)

in £ 1, where F(t, y, w) (for fixed y) and +,(t, w) are F,-measurable. As usual, we define a solution of this equation satisfying the initial condition X(s)

where

f

= t,

is an F,-measurable random variable, to be a solution of the integral

equation It

I

I

Y(t)=~ + ) F(u, Y(u), w)du + ~ ) 'V,(u, w)~(u).

(3.1)

T=I •

Under certain restrictions, which we shall henceforth assume to hold, this equation

F,-measurable solution Y(t). Equation (3.1) is naturally associated with an operator

has an a.s. continuous

It

L=

! + (F(t, y, w), :y) + ! ,-~1('¥,(t, w), :!I )

2 •

(3.2)

In case F(,t, y, w) is independent of w, Y(t) is a Markov process and L is its generating operator. The following exercise is useful in understanding the role of the operator L in the general case. EXERCISE 3.1. Let V(t, y) E C:2 and L V(t, y) c;; 0 a.s. for t let x e £ 1, and assume that the expectations E l V (s, Y (s)) I< oo, E

(I:Y V (t, Y (t))

2

E

> s (s ;;;i, I),

(I LV (t, Y (t))IL~,),

1~,}

1 1'l', (t, w)F1

are bounded fort ;;;i, s on every finite interval. Prove that (V(t, Y(t)), F,, t ;;a, s) is a supermartingale. Given any y E £ 1 , we set

(Y)N

=

sgn ymin {IY I, N}.

If y is a vector in £1, with coordinates y 1 , . . . , y 1, we set (y)N equal to the vector ((y 1)N, ... , (y1)N). We shall also say that a matrix A is stable if all its eigenvalues have negative real parts.

§ 3. INVESTIGATION OF THE PROCESS / I (r)

133

The proof of the following celebrated theorem of stability theory may be foun d, e.g., in Malkin [ l). llAPUNOV'S LEMMA. If A

is stable, then for any symmetric positive defin•

+

ire matrix D there exists a symmetric positive definite matrix C such that CA A"C = -D.

The following corollary of Ljapunov's lemma will be needed in the sequel. COROLLARY 3.1. If A is a :stable matrix with eigenvalues A , ••• , A,, and 1 if ~ = min 1 c;;;c;; 11Re ~I. then for any A0 < I there exists a symmetric positive definite matrix C such that (CAx, x) < -A0 (Cx, x).

To prove this we need only observe that the matrix A ,hen to apply Ljapunov's lemma.

+ A0J

is stable and

3.1. Let A be a stable matrix; let 'lr,(u) = "1,(u, w), r == 0, l, ... , k. u ~ s, be Fu•measurable stochastic processes with values in £ 1, such that for some o > 0 LEMMA

(3.3) 1-+00

liml'1'r (t)j2 i°=O,

r=i, ... , k

(a.s.).

(3.4)

1-+oo

Let the functions "1,(u) be a.s. continuous. Then the process 11(t) defined by ( 1.7) tends to zero with probability last - 00• PROOF. Consider the process YY(t) satisfying the equation II

dY (t) =

(~ Y (t) + '¥o ~)) dt + ""''¥r (I)~. (t) t Vt ~ y,

(3.5)

r=I

with initial condition YY(l) prove the equality

= y.

Clearly,/1 (t) ==

lim yo (t) = 0 c-ao

r 0 (t), and so it will suffice to

(a.s.).

To this end, together with (3.5) we define a process YY(t) as the solution of the initial-value problem

dY (t) = ( ~

Y (t) + ('Yo

(I~:~:+6)N )

dt Y(i)=.v-

By (3.3) and (3.4), for any with probability > 1 - E,

E

> 0 there exists a sufficiently large N

(3.6)

such that,

134

VI. ASYMPTOTIC NORMALITY OF THE llOBBINs-MONllO PROCEDURE

for all t ;;;i. 1 and r

= 1, ... , k .

Thus, from some N onward,

Y" (t) I= O} > 1- t.

P {sup I Y"Y (t) l~ I

If we can prove for arbitrary N

> 0 that Y(y) -

0 a.s. as t -

(3.7) - , it will follow

at once from (3.7) that

P {lim ,...., Y !I (t) = O} > 1- e. Since

E

is arbitrary, this will imply the desired conclusion. Thus it will suffice to

prove the assertion for the process Y"(t). To this end we consider an auxiliary function

W (t , y)

=

tX (Cy, y)

+ ktK-6 .

Here " and k are constants satisfying the inequalities O < " C a symmetric positive definite matrix such that

< 6 and k > 0, and (3.8)

AC+ CA•= -J

(J is the 1 x / identity matrix). Such a matrix C exists, since A is stable (see 1.Japunov's lemma). If L denotes the operator (3.2) for the process Y"(t), simple manipulations, using (3.8) and the inequality 2ab .;; a2

+ b2 , yield

LW(t, y)=f"- 1 ((AC+CA•)y, y)+2tx-t-O(('Y0 (t)t 112 H)N, Cy)

"

+i--1-6 ~ (C{'Y,{t)t6/2)N,

,_,

+ xtx-l (Cy, ~

t"-1

y)

+ k {x -

((-J + xC) y,· y) +

('I', (t) t612)N)

6) ix-6-l

t•-1 - 6

1-k (6

-

x) + k1l

+ k2tK-l-6 (1 + (y,

y)].

(3.9)

Here k 1 and k 2 arc constants depending on /, N and the clements of C. It follows from (3.9) that lW(t, y).;; 0 for all sufficiently large t ;;;i. Tandy E £ 1 provided " is chosen sufficiently small and k sufficiently large. Since (3.6) is a lineu sys-

tem with bounded coefficients, it follows that for all t -- s there exist bounded moments

on any finite interval of the t•axis. Therefore (see Exercise 3.1) the process

(W(t, Y"(r)), f ,, t -- 7) is a supcnnartingale. By Theorem 3.6.1, there exists a finite limit

135

§3. INVESTIGATION OF THE PROCESS J 1(r)

Y11 (t)) = Jim t" (CY 11 (t ), Y., (t ))
sandy E £ 1• Prove that (V(t, Y'·f(t)), F ,, t;;., s) is a supermartingale.

n

We shall also need the following simple assertion. LEMMA 4.1.

Let A be a stable rruztrix, I

TI

- { "'--+1 (J +Alm), A _, -

k= 1, ... , t-1,

k=t,

J,

and let m1c., be a sequence of positive numbers such that

lim mu= 0

uniformly in

t.

(4.3)

Then (4.4) PROOF.

As follows from (1.6) and (3.12},

k=t, ... ,t-t,

(4.5)

where C is some positive constant. Thus

Therefore (4.4) follows from (4.3) and (4.5), and one readily proves that I

~ k2A1-t _

. 2As 1Im 2l1

~

,..... '

-

1



(4.6)

-'-1

We now proceed to investigation of the process I

/2(1} =

~

--·

~i cp(k+t).

Let A be a stable rruztrix, and let ip(t + 1) = 'lt(t) + CT(t + 1), where 'IJr(t) is F ,-measurable, E(CT(t + 1)IF,) = 0 , and for some 6 > 0 LEMMA 4.2.

VI. ASYMPTOTIC NORMALITY OF THE ROBBINS-MONRO PROCEDURE

138

,Jim __ t 112+e I 'I' (t} I= 0

(a.s.),

lim t6 E (Io (t + 1} l1 1Y,) = 0

(a.s.).

1--

Then 12 (t) tends to zero with probability 1 as t -

00•

Consider the process YY(r) defined by

PROOF.

Y(t+-1)-Y(t)= ~ Y(t)+q)~~t), Clearly, Y0 (r)

Y 11 (1)=y.

(4.7)

= /2 (t), and it will suffice to show that lim yo (t) = 0

(a.s.).

(4.8)

1-+ao

Together with (4.7), we define a process fY(t) recursively by •



Y(t+1)-Y(t)=: Y(t)+

('l' (1) 1/2+61 :+6 N+aN(t+1), 1

Y(t)=y,

where 0

(Vt6 E(!a(t+1)12lg- 1))No(t+1),

(t+i)=

1112+e12

N

>0

and show that for any N

ve(l 0, and C is a symmetric positive definite matrix satisfying condition (3.8). If L is the operator (4.2) for the process Y>'(t), then, clearly, • LW (t, y) = t"E

( (

12

C ( Y,, 7A Y+ ('I' (t) 11+16+6)N + a.,.,. (t + 1)) , Y-t-7A Y 1

+ {'I' {t\!~:+6JN + ON (t + f}) J§°I) + kt"- 6 -

(t -1}" (Cy, y)- k (t-1)"- 6 •

Evaluating the expectation in (4.9), using the relations

we obtain the estimate

(4.9)

F,measurability of i'(t) and the

§4. INVESTIGATION OF

nu: PROCESS l2(r)

139

LW (t, Y)< _,,c-l IY~+t"- t -& (k1I IJ I+ lc.z) + k3t"- 2 1y 12 +k.t"-

2 6 - 1y

Here and below, k;

I+ (Cy,

y) (t,c - (t -

= k;(l, N, C).

1)") + k (t"- 6 - ( t - 1t-6).

It follows, via the inequalities

tlt} lt-=T

I

t, supjX'·t(t)lo•o•• So-~, r • r-t

0

PROOF. To simplify matters, we

assume that x 0 = 0. Let >-1 , .·

.. ,

>-1

denote the eigenvalues of the matrix B, and set

l Note that

2aX > 1, since A

=

min

t~•~'

I Re ). ; I-

is a stable matrix. By Corollary 3.1 , for any positive

>-o < X there exists a symmetric positive definite matrix C such that

§5. ASYMPTOTIC NORMALITY (CONTINUOUS TIME)

Ul

(CBz, x) :;;;; - ). 0 (Cx, x).

=

Since 16(x)I o(lx)I as x such that for lxl < €

0, it follows that for any >..

< - ). (Cx,

(CR (x), x)

< "o

we can choose

x).



(5.2)

1n view of the inequality 2a>.. > I, we may assume that the number >.. in (5.2) is such that 2a). > 1.

(5.3)

Set

{ R (x)

R(:c)=

a, (t, x) =

R(~)l.:.l. Iz I

z)

0) or in probability (p

J'

(5.13)

0

= 0), where

Z (t)= e-'1 [ 11+a e-A• 0

..

~ o~"~,(u)] .. ,-1

and 1/ is any r.v. independent of i;(u), r = 1, .. . , k . Now let 1/ be a Gaussian r.v. with zero mean and covariance matrix Efl1/• = S. Then Z(t) is a Gaussian process with zero mean. To show that it is stationary, we need only show that the function R(t, u) = EZ(t)Z•(r + u) is independent oft. Tilis follows from the equalities

R (t, u) = e"' [

e,

I

1')1')'

+ 4 : Je-A•Soe-A•• tW] eA· 0 and X > 0 constants such that (5.2) and {5.3) hold. PROOF.

Set

VJ. ASYMPTOTIC NORMALJTY Of TH£ Jt.OBBINS-MONJtO Plt.OCEOUltE

148

~ 11

(x) =

{

~

-,-x,



G(t, x,)=

lxl >•

~x,~lt+t. l•, t(Jt), •ll>cJ• Ya•

Consequently

where

mi =

)

I G(k + 1, X'· t(k),

(I)) l'P {d6>) - O

,~1> dl (v).

Here i{v) = (~ (v), ... , i;.(v))• is a standard Wiener process and the real matrix o< 0 >has the property o< 0 >o< 0 >• = S 0 • PROOF. By Lemma 6.1, vt(Xx(t) - x 0 ) - Z 0 (t) 0 in probability as - . To prove the theorem, then, it will suffice to show that the finitedimensional distributions of the process V,..(s) = Z 0 (Te'), where T and t Te are positive integers, converge to the finite-dimensional distributions of the pro• cess Z(s). To this end, we observe that by (6.13) we can define Z 0 (t) recursively by t-

=

Z (t..Li)-Z (t) = ~Z (t)+a ~(t+i) o . o t o li t • where f(t)

Zo(1) = 0,

(6.15)

= G(t, x 0 , w).

It follows from {6.15) that the process V,..(s} may also be defined recursively:

VT (s + 61'(s))- VT (s)= AVT(s) 6T (s) ~ (s)+

+ a£(Te1 +6 T10 ) V6T (s)aT(s), where

VT(O) = Z 0 (T),

{6 .1 6)

154

VI. ASYMPTOTIC NORMALJTY OF THE ROBBINs-MONRO PROCEDURE

Letting T -

00

in these equalities, we see that

sup I rxr (s)-1 1-

0,

l:,!:0

sup 6r (s) •:;?0

O.

I

It readily follows from ( 6.16) that Jim.-!-() E (Vr(s+6r(s))-v.lVT(s)=v)=Av, T-+00 UT I

lim~()

T➔OO

T I

E ((V7 (s+llr(1))-v-Av!l:r(f))(V,.(s+A:r(1))

- v-Av!l:r (s))•IV:r (s) = v) = a.'Sa.

Moreover, Theorem 6.3 implies that the distribution of V:r(O) converges as T - 00 to the distribution of a Gaussian random ·vector with zero mean and correlation matrix S. These arguments suggest (see the heuristic considerations in § 2.4) that the finite-dimensional distributions of the process V:r(s), defined as a random polygonal line coinciding with V:r(s) at all points where the latter is defined (Figure S), converge to the finite-dimen•

= Z0 (n

v,/$1

s FIGURE

5

sional distributions of Z(t). 1bat the assumptions of Theorem 63 imply the truth of this conjecture follows from known results on the convergence of a sequence of Markov chains to the Markov process defined by a stochastic equation (see Skorohod (1) or

Cihman and Skorohod (11). This completes the proof. REMARK 6.2.

Theorems 6.1-6.3 remain valid if the sequence a(t) figuring in the difference equation defining the s.a. process is such that a(t) a/t + o(l/t 1 +f) (E > 0) as,- 00•

=

§7. Convergence of moments

=

It was shown in §2 that the stochastic process Y(t) ../t(X"(t) - x 0 ) has bounded moments up to second order, provided Conditions 1 and 2 are satisfied. We shall now show that these and higher-order moments of Y(t) are bounded wider weaker assumptions. This will also enable us to prove that the moments of Y(t) converge to the appropriate moments of the normal law with parameters (0, S). It will be sufficient to assume, in particular, that Condition 1 of §2 is satisfied for an arbitrary A> 0 (not necessarily such that 2aA > 1). In the sequel, then, we shall assume the following

155

§7, CONVERGENCE OF MOMENTS CONDITION

number

(a). There exist a symmetric positive definite matrix C and a

>. > 0 such that for all x

As always in this chapter, we fust consider the continuous RM procedure. LEMMA

7.1. Let Condition (a) hold, and, for all t;., 1 and x E £ 1, let

"

IR(z)I+ '51 !a,(t, z)l~K(i+lzl),

K=const.

(7.2)

' '"" t

Then the solution xx(t) of the equation

"

dX(t)=f[R(X(t))dt+~ a,(t, X{t))cis,(t)], X(t)=z.(7.3) r-t

S11tiJfies the relation

sup I:;?: t

for all n

E11xx(t)-zolteJl"=C11./2, ½] .

We may assume that x 0 = 0. Set Vn(x) = (Ct, x)". Then, if L is the differential generating operator of xx(t), we have PROOF.

LV,.(z) = 211a(Cz.

z)"-1(CR(z), z) 1

+ ;; ~" 1

,_,

(CJ,(t, z),

! )2 V,..

(7.5)

The second derivatives of Vn(x) are homogeneous functions of order 2n 2, and so

Ia::z, I< ks z I

1

1 "-2 ·

(7 .6)

Moreover, the coefficient a,(t, x) satisfies (7.2). Hence, by (7.1), (7.S), (7.6) and the inequality

(7.7) which is valid for all x E £ 1, we conclude that

V _, V V LV"(z)~ - -:?naA , - · "(z) + 11 ( " (z) + 11-1 (z)). Thus, for all

(7.8)

t ;., Tn, no.A. k3 V LV 11 (z) ~ - -V11 (z)+,a ,._ 1 (z). 1

(7.9)

Inequality (7 .9) coincides with (2.5) for n == 1, and so the assertion is true for this case (see (2.6)). Its truth for arbitmy n is verified by induction, for by Exercise 3.7.1 and condition (7.2) the function Vn(x) satisfies (3.5.5), and therefore also

VJ. ASYMPTOTIC NORMALITY OF THE ROBBJN5-MONRO PROCEDURE

156

:, ~V.(X"(t))= ELY11 (X"(t)).

71. Let the function R(x) mtiJfy condition (1.1), with a stable matrix A = aB + 1/z.J. Assume, moreover, that Condition (a:) and (7 .2) are satisLEMMA

fied. Then, if X"(t) iJ the solution of problem (7 .3), for any n

Vt 11 " = d,.
1.

where q

E [W n (X% (t)) x(IX%(t)1 Tn and some constant q 1 (1 < q 1 < q) we deduce from (7 .14) that

Writing this inequality as

! (t91nE Wn (X (t})}~-,n-.+1,-ak~'-9-1nintegrating from T tot, and using the inequality Wn(x);;. k 9 1xl2 ", we obtain the desired conclusion for 2nth moments. Since the lemma is certainly true for n 0, this completes the proof.

=

We now state the main result of this section for the continuous case. THEOREM

and let A = aB

7.1. Suppose that Condition (a:), (7.2) and (1.1) are satisfied, Let

+ ½J be a stable matrix. S=a2

JeA"S e,,..,,.dv,

m

0

S0 =

O

-

~ a:~•. r-1

and let x.r(t) be the solution of problem (7.3). Then the stochastic process Y(t) = Vt(X"(t) - x 0 ) has bounded moments of all orders n = 1, 2, . . .. Moreover, any mixed moments of the vector Y(t) converge as t - oo to the co"esponding moments of a Gaussian random vector T/ with zero mean and covariance matrix S, where Gr (t,

x} = o~,

Sas t -

oo.

lim 1-+oo

.r...zo

In particular, EY(t)Y(t)• -

PROOF. By Theorem 4.4.1, X(t)- x 0 under our assumptions. We may

thus use Theorem 5.1, according to which Y(t) converges to T/ in distribution. By Lemmas 7 .2 and l.2.1, this implies the desired conclusion.

158

VJ. ASYMPTOTIC NORMALITY OF THE ROBBINS-MONRO PROCEDURE

A similar result may be established for the discrete procedure

X(t+1)-X(t)=; (R(X(t)) +G(t+t, X(t); ill}]. (7.15)

X (i) =Z.

In comparison with the continuous case, however, two additional complications arise here: fint, the expression for the generating operator of a power of a quadratic form is not as simple as before; second, to ensure that the moments of the process Y(t) = vt(Xx(t) - x 0 ) ue bounded, it is now necessary to assume the existence of moments of G(t, x, w) to sufficiently high orders. We shall therefore assume that, for suitable Pand all x E £ 1 and t > I,

I R (z} 1• +

E

I G (t,

:t, (I))

1• ~ K (t

+ Iz

1•).

(7.16)

EXERCISE 7.1. Suppose that condition (7.1) holds, and also (7.16) with

P= 2n.

Show that the solution of problem (7.15) with k

= I, ••. , n satisfies

the condition

sup E ( I l~t

where

E

x:r (t)-xo I t'J2"= c,,_ < ~ .

(7.17)

= min(aX, ½).

Hint.

Use the method or proof of Lemma 2.1, with induction on le.

EXERCISE 7.2. (a) Let (7.1) hold, and also (7.16) for an even number

P

such that 2

~ :;;.. mia (2oA, t) •

Show that the process Y(t)

= .,µ(Xx(t) -

(7.18)

x 0 ), where X"(t) is a solution of

problem (7.15), has bounded second moments. (b) Show that if condition (7 .I 8) is replaced by 4 ~ :;;.. min (2oA, t)

one can guarantee that

sup E I Y (t) I'
~·-"sn0 , with the same probability.

2.1. Let the truncated RM procedure (13) Jun,e the following

properties: I. For some domain U1 C U such that x 0 E U1 , and some T sup

t :,!:T, sEUa

P {~·

"< 00} = q < t.

> 0, (2.4)

II. For alls > T and x EU,

P{-r'• z (U1)


T and x E U 1 ,

P{A:•·'"}~ q. Next, using the Markov property of the process have 00

P{A!·'"}= ~

00

~

I

i-•+l ;-•+ t 11EU1

i

= {T!'.x
0,

LV (t, z ) ~ Next, since W(x)

ca (t) for .:r

> W(y) for any x f

E fj "- U 1•

(r 1 , r 2 ) and y E (r1 , r2 ), it

follows that

=

LV (t, .:r)

EV (t

=

~

+ 1,

1 EW (X ·"' EW(X 1·"' (t

(t

X'·"' (t

+ 1))

+ 1)) -

+ t))

-

V (t, .:r)

- lV (z) - Ka' (t) W (z) - Ka 1 (t) = LV (t, z).

Thus V(t, x) satisfies (3.5), and-the lemma is proved.

§ 4. Theorems on convergence and asymptotic nonnality The following propositions are direct consequences of the results of §§2 and 3.

4.1. Let U = (r), r,) be a rmite interval in Ei containing the point Xo, and ;,.x(t) a one-dimensional Markov process derming a-truncated THEOREM

procedure (1.3)-(1.S} in CJ. Assume, moreover, that CD

CD

a (t) > 0, ~ a (t) = oo, ~ a 1 (t) < oo, 1-t

sup e x 2 , w < 0. It is easy to see from (5.9) that the component X(t) will then have a tendency to move to the right (E(X(t + 1) - X(t)IX(t), W(t)) > O). At the same time (see the proof of Lemma 5 .1 ), the process W(t) will approach the arithmetic mean of the values of R'(X(r)), which is negative because of the structure of R(x). Thus the process (X(r), W(t)} has a tendency to "linger" in the domain x > x 2 , w < 0, and so there is a positive probability that X(t) will not converge to x 0 • The above heuristic considerations lose their force when R '(x) > 0 for sufficiently large Ix I. We shall show below that under the slightly stronger assumption

0


J

(Y,CX(t)+t{t), )+Y, 0 be sufficiently small.

= T(E) such that P. {Ps - t < W

(t)




T}

>

1 - t.

Now, letting a sample function (.X(t), W(t)) of the process (5.10), (5.13) with

= p 1 - E and r2 = p 2 + E "issue" from the point corresponding to the position at time T of the sample function of the process (S.9), (5.10) with ipitial condition X(J) = x, W(O) = 0, we infer via Theorem S.1 that

r1

X (t} = z 0 ,

Jim f-+»

But, by the choice of T

Jim W(t} = et (a.s.). 1..00

= T(E),

P{sup {IX (t}-X (t) I +I W (t}- W (t} I}= O} > t~e, t> T

and so, since

E

is arbitrary, the conclusion follows.

§ 6. Asymptotic optimality In the last section we established the convergence of adaptive RM procedures under suitable assumptions. It is not clear yet, however, where lies the advantage of these procedures over the ord.i nary RM procedure examined previously. To answer this question , we shall study the limit distribution of

yr(X(t) - x0 ) as t -

oo.

6.1. Retaining the assumptions and notation of Theorem S.1, assume that in (S.1S) THEOREM

e > 1/4. Moreover, let

lim E G1 (t, z ,

.......

C&>)

1-...

and assume that for some 8 Jim

sup

>0 sup

J

(6.1)

= o:

G2 (t, z , (a)) P {d} a: 0 .

R-• Jz-.so JJl>R

(6.2)

(6 _3)

Then the process X(t) defined by (5.13), (5.10) am! the ininal condition

X(l)

= x , W(O) = 0 Slltis{ies the asymptotic condition

.Vt (X (t) -

Zo)

~ m(0,

2~ 2 ) •

(6.4)

REMARK. At first sight, the variance of the normal distribution in (6.4) seems to be twice as small as the optimal variance metioned at the beginning of

176

VII. MODlflCATION OF STOCHASTIC APPROXIMATION PROCEDURES

§S. Recall, however, that in order to calculate X(t + 1) we need 2t independent observations at the points X(,) ± c:(_1), i = 1, ... , t. Clearly, for the ordinary RM procedure (5.4) with the optimal choice of the parameter a we also have

Vt (X (2t) -

Zo)

~ IJl ( ~•

2~2 ) •

'This. is the sense in which Theorem 6.1 furnishes an asymptotically optimal result.

6.1. ]) We first assume that '2 < 2,. in (5.12) and the function R(x) satisfies the inequality

PROOF OF TuEOREM

that for all x E £ 1

(6.5) Let (X(t), W(t)) be the procr.ss defined by (5.13) and (5.10) with initial condition X(s) = t, W(s) = 11 (t and Tl are F,-measurable and have finite variance). It is easy to see that the process Y(t) = X(t) admits a representation analogous to (6.1.5) (as usual, we are assuming that .x0 0):

vt

Y(t

+ 1> =A._., Vit+]'

=

vA•.!. C(k>i (k>

--·

l:

~ A u ~ {R(X(l:)+c(l:))+R(X(k)-c(l:))-2aX(k)}

-""'

••

Vi

21Wc1:>J _

~ ..t., ~ G (k+ t, X (l:), 111) 1

::.

Vi

(W(k))



(6.6)

where for

k P, k..; t -

:==

-P

~ (x-zo)'

max(3/4a, r 1 ) and p 2 = min(S/4a, r 2 ) . Choose

P {sup J X (t)- Zo I>~,}+ f - p t>T

·w (t) E(pi, Pl)

(that this is possible follows from Theorem S .1 ). Now consider the process T(r), WT(t)) (t (5.6):

(.X

for all t > T} < e (6.11)

> T) defined by (5.10) and

X(t + 1)-X (t)= 1

'""!.-

~k(XCt) + c(t})+.R (X(t)-c (t)) +G1 (t+1,X(t),e, for z-z0 < - e• Now the conclusion of the theorem is clearly true for the process (XT(t) , WT(t)) by virtue of part 1 of the proof. Moreover, it follows from ( 6.11) that

..

P {sup t>T

..

(IX (t)-Xr(t) 1+ 1W (t)- ~r(t) I)> O}¼and conditions (6.2) and (6.3), the process X(t) defined by (5.9) and (5.10) with initial conditions X(l) = x and W(O) = 0 satisfies (6.4). TuEOREM

The proof is analogous to the second part of the proof of Theorem 6.1 . We leave the details to the reader.

CHAPTER 8

RECURSIVE ESTIMATION (Discrete Time) We consider the application of the Robbins-Monro method to the statistical problem of estimating a distribution parameter (generally multidimensional). Our main attention will center on construetion of estimation procedures with asympto•

tically optimal (asymptotically efficient) properties. A distinctive feature of the approach is that there is no need for a priori infonnation about the estimate_d parameter. We shall therefore not discuss recursive procedures which are optimal from the standpoint of the Bayes approach (see Kalman and Bucy (1], Stratono• vi~ [I), Lipcer and Sirjaev [I] , and others). § 1. The Cramer-Rao inequality. Efficiency of estimates

let Y1 , • •• , Yn be independent random vectors in £ 1, each of which is distributed with density f(y, xX1 ) relative to a o-finite measure II(·) defined on some o•algebra of subsets of £ 1• (In particular, II{·) may be supported by a countable set of points Y;, j 1, 2, .. . ; this corresponds to r .v.'s Y 1 assuming each

=

value Y; with probability f(yi' x).) The parameter xis assumed to take some value in an open set X C E,c let us assume that the statistician "knoM" the values of the r.v.'s Y1 , ... , Y,. and the function f(y, x), but not the value of x. Even more: the statistician has no a priori information concedting x, except that x e X. In this setting, the problem of parametric statistical estimation is stated as follows : using only the "observations" Y1 , ••• , Y,,, construct an estimate x,,(Y., . .. , Y,,) of the unknown parameter x. In rigorous terms an estimate or statistic based on 11 observations is defined to be any N,,-measurable random variable, where N,. is the minimal o-algebra of events relative to which all the r.v.'s Y1 , •.• , Y,. are measurable.

Of course, this very general definition of an estimate admits functions of ( 1 ) One then uy, that Y 1, .. . , Y,, ue independent observation, from • populatioo witb density /(y, x ).

181

182

VIII. RECURSIVE ESTIMATION (DISCRETE TIME)

observations which are quite dissimilar from the estimated quantity. It is therefore natural that one tries to pick out from the set of all possible estimates those which are "best" in the sense of some quality index. The most frequently employed index is

n

_,. J•.. Jlxn(Yi, • • ., Yn)-x l n f(y;, x)v(dy1) .•• v(dyn). 3

i•d

(Henceforth, the symbols E.x and P.x mean that the expectation and probability are evaluated for a parameter value x .) We recall a few standard definitions, due primarily to Fisher (see Cramer [ 1)). A statistic xn is said to be an unbiased estimate of the parameter x if E.zXn = x for all x e X, and it is asymptotiazUy unbiased if E.zO

J

Zn ('Y)f,. (~,

z) 'Vn (d~) = Z,

j Zn(~) /n (-Y, z+ Aµ) Vn (d~)= z+A1,1, whence it follows, as before, that

f (z,. {~)-z)(/n W, z+ 6µ)- /,a(~, z)) Vn (d~) = ~fl.

(2.5)

Taking the scalar product of (2.S) by X and using the Cauchy-Schwarz-Bunjakov• skii inequality, we obtain the analog of (1 .6):

u¥ (A, µ)1 ~2 j (A, Zn(~)- z)1 (/,. (-Y, z) + /,. (~, z+ Aµ)) ..... {~) - ~ -Y/,.(-Y,z)) ~-, v,.(d~). ~~ xJr (}"/,.(~,z+Aµ) We now find, using {2.2) and repeating the derivation of (1.8), that

l\1 {A, 1,1)1 ~2 [(S,. (z) A, A)+ (Sn (z x

+ A11) A, A) +. A' (A, µ)

2)

11/

J('Vf(y, z + Aµ)-l'/(y, z)}2v(dy) .

(2.7)

Finally, by virtue of the condition ../f(y, x) e ~(X), we deduce from (23) and (2.4) that b.

J

(l'/(y, z+Aµ)-Y/(y,z ))2v(dy)=Jv(dy) ·

( 6

r

~

('J/(y,

Z ..!.. uµ.)

Vaz 2

l)

/(V, ZTllfl)

~< !A \ (/(z+uµ)µ, µ)du. i

)!

. ,) du .

(2.8)

188

VIU. RECURSIVE ESTIMATION (DISCRETE TIME)

Letting ..:1 -

0 and using (2.6H2.8), we obtain

Settingµ

=r

1 (x)X,

we obtain an inequality equivalent to (2.4):

Titis completes the proof. Inequality (2.4) is usually proved under different assumptions (see, for example, Rao (1)). A similar inequality may be established for biased estimates, but we shall not do this here. In the light of (2.4), the following definition is natural. An estimate x,. of the parameter xis said to be efficient if S,.(x) 1 n- r 1 (x) for all x E X. An estimate x,. such that

lim (nSn (.r))

=

=

J-1 (.r) ,

f\➔CID

is said to be strongly asymptotically efficient. It is shown in courses of statistics that, under fairly broad assumptions, the

most familiar estimates x,. are asymptotically normal:

V~ (Zn

- %)

~ fil (0,

S (:r)).

If the covariance matrix of the limiting nonnal distribution coincides with (x), the estimate x,. is said to be weakly asymptotically efficient or, simply, asymptotically efficient. EXERCISE 2.1. I.et f(y, x) be the Gaussian density in £ 1 with vector of means m 1 (x) and nonsingular covariance matrix m2 , independent of the para• meter:

r

1

Show that the information matrix J(x) may be expressed as

_ ( I (X ) -

8m1 (z) ) • 8z

_1 8m1 (z) 8z '

m2

{2.9)

Let f(y, x) be a density in £ 1, x EX C E1c, and /(x) the corresponding Fisher information matrix. Assume that x,. is an asymptotically efficient estimate of x, so that EXERCISE 2.2.

Vn (Xn -

X)

~ m(0, J-l (x)).

189

§3. ESTIMATING A ONE•DIMENSIONAL PARAMETER

Let z = ~x) be a fwiction from X to E" such that the matrix nonsingular.(2) Show that a) the information matrix of f(y, x) for the parameter ~x) is

a,pfiJx is

b) ~x") is an asymptotically nonnal and asymptotically efficient estimate of ~x).

§3. Estimation of a onHimensional parameter Ther.e are several general methods for constructing estimates of a parameter x0 based on independent observations (maximum likelihood method, Bayes and generalized Bayes estimates, and so on). Common to all these methods is the shortcoming that the passage from an estimate x" based on n observations to an estimate x"+ 1 involves a quite complicated calculation, utilizing all the previous observations. 'This precludes widespread use of computers in constructing esti• mates. There is thus a pressing need for estimation methods not requiring complicated "recalculation", such as a method in which the estimate x" + 1 at time n + I could be determined from a formula

:i;.+ 1

=

q, (n,

Xi, ••• , Xn,

Yn+ 1),

n

=

0, 1, . ..

where ,p is a relatively easily computed function. The most convenient and practi• cal estimation procedures would be those in which X" +"l could be determined knowing only the previous estimate Xn and the new observation Yn+l : (3.1) where ,pis a function whose computation involves considerably less labor than,

say, maximum likelihood or Bayes estimates. (Henceforth X" will denote esti• mates calculated from recursive formulas of type (3 .I), and the notation x" will be reserved for arbitrary estimates.) REMARK, We have already stated that an estimation procedure of type (3.1) is of practical interest only when conputation of the fwiction ,p itself to within the prescribed accuracy does not require excessive time. If no restrictions are imposed on the class of admissible functions ,p, we can express any estimate as one of the components of a two-dimensional estimate of type (3.1). Indeed, suppos• ing to fix ideas that Y = Ep let X,, = cl>"(Yp ... , Y,,) be a sequence of func• tions defining a one-to-one mapping of£" into £ 1 , and xn any estimate, i.e. any measurable function of the observations Y1 , • • • , Y". Then ( 2) So tbat UM • - function .,-1(z) exists.

190

VIII. RECURSIVE ESTJMATION (DISCRETE TIME)

= Zn+t (Y,, ... , Y,., Y,.+1) = z,.~ 1 (;1 (i..,.) , Y,. + 1) = cp (n + 1, ).,., Y,.+1), 1 ).,.+I = $ 11 +1 (Y1, • • •• Y,., Y,.+1) = Cl>n+I (; ().,.), Y,.+1). Thus the ve~tor (.xn+ 1 , An+ 1 ) JS expressed recursively in terms of (xn, X,,) and Zn+t

Yn + 1 . · It is quite clear that this "universal" method for constructing recursive estimates is of little practical use, since it is in effect equivalent to storing all the observations in the computer memory. Below we shall examine some concrete methods for construction of esti· mates of type (3.1). As follows from the above discussion, these estimates will be practical only provided the function ip can be evaluated with comparative ease with the desired accuracy. DEnNITJON. An estimation procedure in which the estimate Xn+ 1 is expressible in terms of Xn and Yn+ 1 as in (3.1) is called a recursive procedure, and the estimates X 1 , ••• , Xn are called recunive estimates. If the obsemtions Y1 are independent and X0 is independent of the observations, it follows from Theorem 2.3.1 that recursive estimates form a Markov chain. To simplify the exposition, we shall always assume that X 0 ·is deterministic, usually arbitrary. Even when the estimate (3.1) is far from efficient, it may exhibit advantages over, say, the maximum likelihood estimate, since it may be computable in the _same period of time for much larger values of n (i.e., using a larger number of observations) and practically without use of the computer memory. In the examples given below, we use the following notation: Y1 , ••• , Yn, Y1 E E 1 , are independent random variables (observations), distributed with density f(y, x 0 ) with respect to a measure ll(dy), x 0 E £ 1 is an unknown parameter, and

f y/(y, ,:) =l

m 1 (z)= tni(z)

v(dy),

(y-m 1 (z))2 / (y, z)v(dy)

are the expectation and variance of Y1 when the parameter value is x. We shall also use the symbols E and P for the expectation and probability when the parameter value is x 0 , e.g., EY1 = m 1 (x 0 ). ut x, y E £., and let m 1(x) and m 2 (.x) be finite and m 1 (x) a strictly increasing function,(3) increasing at most linearly as Ix I - ..,_ Consider the procedure

X0 = coost,

(3.2)

( 3 ) Instead, ii is auff"u:.ient to assume that (m1(.x) - m 1(.xo))(.x - .xo) is positive for .x 'll .x0 , aod that m 1(.x) is continuous.

§3. ESTIMATING A ONE-DIMENSIONAL PARAMETER

191

where

(3.3) It is clear that the function E(Y,.+ 1 - m 1 (x)) = m 1(x0 ) - m 1(x) is positive for x < x 0 , negative for x > x 0 , and increases at most linearly as lxl - 00• Since, moreover, Var Y,.+ 1 = m2(x 0) < 00, it follows by an application of Theo• rem 4.1.l that the RM procedure (3.2) yields a strongly consistent estimate of the parameter x. Under certain additional assumptions, this estimate is asymptotically normal, sometimes even asymptotically efficient. Suppose that, over and above the previous assumptions, the function m1 (x) is differentiable at x 0 , and m~(x0 ) = a

>O. Then, by Theorem 6.5.2 (see Remark 6.6.1), we see that the procedure

X,.+1-Xn

=,.:

1 (Yn+1-mi(Xn}),

X 0 = Const,

(3.4)

satisfies the asymptotic condition

Vn

(Xn-Xo)

~ m (0,

0 2 ~ : ~~))

as n - 00, provided 2aa > 1. If it is also assumed that m~(x) > c > 0 for x E £ 1 and El r,1P < 00 for some even p;.. 4/min(2ac, 1), then the procedure (3.4) also satisfies the assumptions of Theorem 6.7 .2. Consequently, under the above conditions,

(3.5) Finally, we note that Theorem 6.3 enables us to calculate the limiting joint distribution of the random variables

as n -

1

n/n - e i. We summarize the above results in the following theorem. 00 ,

THEOREM 3.1. Let Yi , Y2 , •.. be independent observations from a f'OP• ulation with density f(y, x 0 ), :x0 , y E £ 1• Assume that the functions m 1(x) and m,(:x) defined at the beginning of the section are such that m1 (x 2 ) > m 1(x 1 ) for :x2 > :x 1 , m 2 (x0 ) < 00, and lm 1(x)I increases at most linearly as lxl - 00• Then the procedure (3.2), with condition (3.3), yields a strongly consistent esti-

mate of :x0 •

vm. RECURSIVE ESflMATION (DISCRETE TIME)

192

=

If in addition m '1(x O) a: > 0, 2aa > 1, then the procedure (3 .4) is asymptotically normal, and moreover, for any k > 0, t; > 0, i = 1, ... , k, and n

< n 1 < · · · < n1c such that ln[n/n] -

t; as n -

00 ,

the joint distribution of

the r.v. ~

Vii (Xn-Xo), Viii (Xn -Xo),

... ,

1

V n,. (Xn,. -Xo)

converges as n - 00 to the joint distribution of the r. v. 's X(O), X(t 1 ), ••• , X(t1c), where X(t) is a stationary Gaussian M.arkov process satisfying the stochastic differentillleq1111tion

dX (t)=

(!-act} X (t) dt+ a V m(x 2

0)

d;(t). 1

F1111l11y, if in addition m~(x) > c > 0 for all x EE1 and El Y;l' some even number P;.,: 4/min(2ac, 1), then (3.5) is also true.


.R

~ c\

1

(y-mi(zo))1 f (!I, z 0 ) v(dy)

111-"'t(zell> lie,

and from the finiteness of m 2 (x0 ). A comparison of the procedures (3.4) and (3.7) shows that for large n the latter approaches the procedure (3.4) with the optimal a = l/ci. At each sampling time, the procedure replaces a= m; (x 0 ) by its estimate ci,, = m~ (Xn), which is consistent since Xn - x 0 • In this sense, the procedure (3.7) may be termed "adaptive". EXERCISE 3.1. Assume that the equalities

may be differentiated with respect to x under the integ,al sign. Prove that (3.9)

It follows from (3.9) that the efficiency of the estimates (3.4) and (3.7) cannot exceed unity. To end this section, we consider an example in which each of the independent observations Y; is the sum of a "signal" m 1 (x 0 ) (m1 (x) satisfies the assumptions of Theorem 3.1) and a Gaussian r.v. f; with zero mean and variance m 2 independent of the parameter: (3.10) Then formulas (3.6) and (1.13) yield the following proposition: the procedure (3.7) produces an asymptotically normal and strongly asymptotically efficient estimate of the parameter x 0 • We now discuss the condition lm;(x)I;;. c > 0. It guarantees, among other thin~, that m 1 (x) is strictly monotone. It is clear that if m 1 (x) is not monotone there may not exist even a consistent estimate of the parameter. For example, if m 1(x0 ) = m 1(x 1 ) for x 0 #: x I' the observations (3.10) will not enable us to distingu,ish between parameter values x 0 and x 1 • If the function m 1 (x) nevertheless varies monotonically for sufficiently large lxl, the procedure (3.2) will converge to one of the solutions of the equation m 1 (x0 ) = m 1 (x 1 ) (see Theorem 5.2.1). Now let m 1 (.x) be strictly monotone but m~(x0 ) = 0. Then

194

VIII. RECURSIVE ESTIMATION (DISCRETE TIME)

=

(3.2) provides a consistent estimate of x 0 • But by (1.13) we have /(x0 ) 0. In conjunction with inequality (1.9), this implies that the variance of any estimate will cohverge to zero more slowly than 1/n. We shall not go into details.

If it is known a priori that x 0 E (a, b), it is natural to employ one of the truncated s.a. procedures of § § 7.1 - 7.4 to construct the estimates. All the re• suits of this section remain valid when this is done, and, of course, there is no need for any restrictions on the rate of increase of m 1 (x) when (a, b) is a finite interval.

3.2. Let Y 1, Y 2 , •• • be independent observations with density f(y, x 0 ), such that m 1 (x) =0, m 2 (x) increases monotonically but does not ex• ceed K(l + x 2 ), and m4 = Er;' < 00 exists. 1) Prove tliat the procedure EXERCISE

together with conditions (3.3), provides a strongly consistent estimate of x 0 •

2) Suppose it is known that x 0 E (a, b), where a >-00, and that lm;{x)I > c > 0 for x E (a, b ). Prove that the procedure X 0 = const

is strongly corwstent and asymptotically normal, and its asymptotic variance is m4 -ml (zo) n (mi (Zo))2



3) Show that this procedure is asymptotically efficient if Y1 are Gaussian r .v.'s.

§4. Asymptotically efficient recursive procedure The procedures examined in §3 were asymptotically efficient only for Gaussian distributions. This is not surprising, for their construction utilized only the first two moments of f(y, x ). In this context it is only natural to ask whether, by using fuller infonnation about the function f(y, x), one cannot construct asymp· totically efficient recursive procedures valid for a larger class of distributions. This can indeed be done, and the important result will presently be proved. We first introduce a function

M (:z:)

=) ln /(~,' ~ ) f (y , zo)v(dy).

(4.J)

This function clearly vanishes at x = x 0 and is nonpositive for all x. In fact, using the inequality 1n z < z - 1 (z ¢ 1), we have

§4. ASYMPTOTICALLY EFFICIENT RECURSIVE PROCEDURE

M (x}= E In

:g:/,' :~ ~ Jf (y, x}v(dy)-1~0.

195

(4.2)

Thus M(x) assumes its maximum value at x 0 • If this function is also dif•

ferentiable, then m(x) = M(x) vanishes at x = x 0 • In the following theorem we assume in addition that M(x) is a monotone function for x < x 0 and x > x 0 . Certain other assumptions, of a purely technical nature, will also be adopted. THEOREM

4.J. Assume the following conditions:

l. The functions f(y, x) and M(x) are twice differentiable with respect to x, and the equalities (4.l) and

J/(y, x) v (dy) = 1

(4.3)

may be differentiated twice under the integral sign. 2. For all x -:I= x 0 , m (x) (x - x 0)

< 0.

(4.4)

3. The function

G(x)=J-2 (:r) E ((ln/(Y;, x))~J 2 =J-2 (:r)

J(';(~:•x~))

increases at most as a quadratic function as lxl 4. The integral

J

I

2

f(y, x 0 )qdy)

00•

(';(~,x~)) 2 /(y, x 0)v(dy)-+-0

as

R-+oo

(I/, x) >R}· { 11: , ,~I (!/, X)

uniformly in Ix - x 0 I < e (e > 0). 5. The functions I(x) and G(x) are continuous and positive for x E £ 1 • Then the recunive procedure

X n+1- X n=(n +

1 /~(Yn•h Xn) f)/(Xn) /(Yn+h Xn)'

X

t o=CODS'

(4.5)

gives a strongly consistent, asymptotically normal and asymptotiazlly efficient estimate of the parameter x 0 • If condition 4 is replaced by the anumption that, for some µ. > 0, k e > 0 and all Ix - x 0 I < e

J( /~(//,

x) /~(//, zo) }2/( ) l(x)/(11, z)- /(zo)/(11,zo) y, Zo v

(d) y
0,

kl x-xo 114'

then the second assertion of Theorem 3.1 is valid for the process X(t) described by the equation

VIII. RECURSIVE ESTIMATION (DISCRETE TIME)

196

= - ~ X (t) dt + 1-112

dX (t)

Finally, if it is also 11'Ue that for all JC m (z) (z) (x-x 0)

1

* JC

0

(z0) d6 (t)

there exists >.

> 0 1uch that

< - r,. (x-x0) I

and E

I ,~


., 1), then the estimJZte (4.5) is strongly asymptotically efficient. PROOF.

Let ,.,,. (Y ...,

) n+h X

t

= I (z)

/~ (Y n+1o z) I (Y11+1, z) •

aearly,

,.,,. y

m (z)

E..., ( 11+1t x) = 1 (.r) ,

E[

~ (Y,, z 0)

=

I (:t,)•I-2 (z0)

=

1-1 (z0) ,

the first part of the theorem is proved. The second and third parts follow from Theorems 6.6.3 and 6 .7.2. The most essential assumption of Theorem 4.1 is (4.4). If it fails to hold, the theorems of Olapter 5 imply that the procedure (4.5) is not consistent. In particular, let us consider the estimation of the shift parameter of a distribution. In this problem, ll(dy) = dy and Y; = x + t;, where the t; are

§S. ESTIMATING A MULTIDIMENSIONAL PARAMETER

197

independent r.v.'s with density f(y) and xis an unknown parameter, with f(y, x) = f(y - :x). Let f(y) be positive for y E £ 1 and twice differentiable, satisfying condition (4.4). It is readily seen that the other assumptions of Theorem 4.1 hold, provided that

I= and that for some c 1 , c2

G(z)=

r (I' M>' dy < oo

J /

> 0,

J(j:ni f(y+z)dy 0,

THEOREM

sup (ms(z0)-mi(z), z-zo) 0, i = I, ... , k, and n < n 1 < · · · < nk such that ln{n/n) - t; as n - 00, the joint distribution of the random vectors

Yn (Xn -Xo), Vn, (Xn, -xo),

y n. (Xn. -

. .. '

Xo)

comerges a.s n - 00 to the joint distribution of the random vectors X(0), X(t 1 ), ... , X(tk), where X(t) is a Gaussian Marko11 process satisfying the stochastic equation

dX (t)

= AX (t) dt + amJ'2 (x

0)

ds (t),

mNxo) is some matrix satisfyinK the condition mNxoXmr 0 such that

I (m, (z)-m1 (z0), z-z0) . I > ·>. j z

- z0

11

and El Yn I'< 00 for an even P;;;., 4/min(2a>., 1), then

lim

Vn E (X.-xo) = O,

lim nE (X11-ro) (X,.-zo)•=S.

" ..•

n...-

The proof follows from Theorems 6.6.1, 6.6.3 and 6.7.2. Al in the one-Oimensional case, one can improve the procedure (5.4) by adopting an "adaptive" estimation procedure

t

X,.♦ 1- X "= n+1

(

am1a~(X11) )- s {Y••• -mi( X ..)},

{S.S)

X0 .=const. 53. Let lm 2(x0 )1 < 00, let the matrix am 1 /ox be continuously differentiable for all x e £ 1, and assume that its detDfflUlllllt Iam 1 /ax I satisfies the condition THEOREM

ex>>C>l a;:•l>c>O As.sume, moreover, that for all x

*x

0,

for

xEE1.

200

VIII. RECURSIVE ESTIMATION (DISCRETE TIME)

{{om;:z) )-

1

(m 1 (.z0) -- m1 (z)), z-zo)
t, we can reduce tbe situation to I = t witll a Jonaer observation intenal. for example, let / = 2 and t =t. We may tben uavme tbat one "component" fvnction is beina obacned over tbe time interval (0, 2TI . A similar reduction procedure applies when t > I in c - for wtlic:b a condstent estimate of .x 0 nnerthelea exists (lff §4). Enluation of the estimate (2.2) iDYOIYes an intesral J!,p(i)dY(1), wtiere Y(1) is tbe p,ocesa def"aaed by (2.1). Tbe rlsorous def"anition of an intearat of this type ii J)fecisely simiJar to the defU1itlon of• stochAstic intea,at with re~ct to a Wiener process (llff § 3.3), w:irig a limit passaae from N,-111euunble step functions -,(1).

=

=

211

§ 2. APPLJCATION OF THE ROBBINS-MONRO PROCEDURE

dY (t)

=

m

(.r

0)

+ ad~ (t),

dl

(2.3)

where the matrix 0111 (T) Jz

= ( ( om1 (z) ) ) dz;

is nonsingular, and examine the estimation procedure 1

dX(t)= 1 : 1 [ ~: (X(t))r (dY(t)-m(X(t))dtJ.

(2.4)

In view of (23), we obtain(3 ) · (t) = +I [. a; om dX (X (t}) 1 1

]-1 {(m (.r

0} - m (X

(t})} dt +ad; (t)}.

(2.5) We shall assume that the conditions are such as guarantee convergence of the solution of equation (2.5) for a deterministic initial condition X(O) to x0 as t - 00; in addition, we stipulate that the matrix om(x)/ox satisfies a Holder condition in the neighborhood of x 0 , i.e. for some E > 0 and 'Y > 0,

II ani_ozup(T (C-R))}' ""

Vvarx0

,

Vvarx0

~ up { - k exp (2 (C- R) T)} . On the other hand, it is well known from the fundamental papers of Shannon (see, for example, (1]) that the capacity of the channel is C. It follows

APPLICATIONS

§4.

217

that the stipulated method of transmission guarantees a signaling rate arbitrarily close to the capacity and at the same time expresses the probability of erroneous reception in terms of a double exponential. A similar method may be employed for another constraint on the input

signal. Following D'jackov and Pinsker [I], we again consider the channel (4.1) with instantaneous feedback, assuming now that the total transmission energy over the interval (0, T) is bounded: T

JlJI (t,

Zo,

Y ') dt ~ pavT .

(4.8)

0

In view of the previous example, it is natural to assume that the signal to be transmitted will have the form

B (t) =

v;;; aD (r)

(X (t)-z0),

(4.9)

and the stochastic differential of the estimate will be

dX (t) =

-

b V~D (r)

dY (t).

(4.10)

Here a and b are undetermined constants, to be selected optimally in a sense to be specified below, and, ~ before, D 2 (r) = E(X(r) - x 0 )2.

The transmission method (4.1), (4.9), (4.10) does not necessarily satisfy condition (4.8). We may decide, however, to cut off transmission whenever the energy exceeds PavT, stipulating that in such cases the message has been incor• rectly received. As we shall show presently, the parameters a and b may be chosen so that for large T the probability T

Pi = P

{5 B2(t) dt ~ PavT}

(4.1 I)

0

will be quite small, so that the probability of "error" due to excess energy will also be small. Before investigating the properties of this transmission system, we must solve two simple problems.

EXERCISE4.l . Show that for the system (4.1), (4.9), (4.10) the function D(_r) satisfies the equation

D (t) = D (0) exp { - C ( ~ -

Ir) t} ,

(4.12)

EXERCISE 4 .2. Show that the process B(t) defined by (4.1), (4.9), (4.10) is a Markov process, described by the stochastic equation

218

IX. RECURSIVE ESTIMATION (CONTINUOUS TIME)

P b

dB(t)= -ClrB(t)dt- :; ~(t).

(4.13)

We now cite a proposition which can be proved by the method of Libkind

(1): if Z(t) is a one-dimensional Gaussian Markov process, described by the equation

dZ (t)

=-

a.Z (t) dt

+ ~ dt (t),

Z (0)

then the following asymptotic formula holds as T T

lnP {Jzi(t)dt> 0

--+

= %, a> 0 (1e > 1) :

00

f: xr} ~-: (x-1) T. 2

(4.14)

It is now quite easy to select the transmission parameters in such a way that, on the one hand, the signaling rate will be arbitrarily close to R and, on the other, the error probability will be asymptotically minimal. Indeed, in view of {4.12), we can guarantee that whenever

!!!_ _ Ir = !}_ . a

C

(4.15)

the rate will be arbitrarily close to R. It is clear that for sufficiently large T the principal part of the error probability Pe, is the number p 1 defined by (4.11). EXERCISE 4.3. Deduce from (4.13)-(4.15) that for signaling rates arbitrar· ily close to R the limit of - ln min11 ,b~,/T as T- 00 is equal to the maximum (with respect to a and b) of the function Cbl (a'-1)' 4al

under the constraint (4.1 S). Show that this maximum is achieved when 2

=2J/-i-t , and it is equal to (../C - ../f =(Soc,>..... SIi-• cm·.

Thus, we must estimate the k-dimensional vector x 0 on the basis of kdimensional 'observations

dY (t) =

Cl) (t) z 0 dt

+ o dt (t),

{4.17)

We may thus extend our concepts of efficiency and asymptotic efficiency to problem (4 .16). In this case, I

S 0 (t) =

o

2 J,

B (t) =

a-2

J¢>• (s) Cl) (s) ds. 0

It follows from Theorem 3.1 that if T

S I CJ> (1) z 12 d1 inf ~o:___....,,..__ I z 11

00

as

(4.18)

then the procedure (3.1) furnishes an asymptotically efficient Gaussian estimate for x 0 • Of course, the above arguments are also applicable when the function f (t) is nonlinear in the k-dimensional parameter x 0 •

220

RECURSIVE ESTIMATION (CONTINUOUS TIME)

§S. A moclifacation The estimation procedure (3.1) has a serious shortcoming, in that it involves evaluation of the matrix (J + B(t, X(t)))- 1 ; the requisite computational labor increases rapidly with increase in the dimension. We shall therefore discuss another procedure, which avoids this difficulty. In accordance with §3, I

iJm• iJm B (t, X (t)) = Jr ,-;(s, X (t)) S.1 (s) Tz (s, X (t)) ds. 0

If X(t) is a consistent estimate, X(t) -

x 0 , one naturally expects that for large t this matrix may be replaced, at the cost of a relatively small error, by the matrix I

Z (t) =

iJm•

-

5,-;- (s, X (t)) S,

1

om

(s) 7z (s, X (s)) ds.

(5.1)

0

We thus arrive at the procedure

dX (t)= (J

+ Z (t)r1

a:•

(t, X (t)) S;' (t) (dY (t)-m (t, X (t)) dt),

(5.2)

where Z(t) is determined from (5.1).

This is not a recursive procedure in the strict sense of the definition of §2, since the computation of Z(t) requires storage of past values of the estimate X(s), 0 < s < I . It is easy, however, to put (5.1) and (5.2) into recursive form, so that we can then avoid the need for matrix inversion (except for inversion of the matrix S0 (r); but this may be done in advance). Set U(t) = (J + Z(t)>- 1 • Then, in view of (5.1 ), dZ

dU



7, = -U (t) Tt U (t)

= - U (t) dX (t) = U (t)

0 ::

(t," X (t)) s;1 (t) ~; (t, X (t)) U (t),

':z• (t, X (t)) s; (t) (dY (t) - m (t, X (t)) dt).

0

1

(5.3)

(5.4)

It is evident from (5.3) and (5.4) that (X(t), U(_t)) is a Markov process. The estimation procedure thus defined is frequently more convenient in practice than (3.1). Moreover, in those cases when it yields a consistent estimate, this estimate is apparently also asymptotically efficient. It would be interesting to have rigorous results in this direction.

CHAPTER 10

RECURSNE ESTIMATION WITH A CONTROL PARAMETER The topic considered in this chapter is estimation of the unknown parameter x of a density when the statistician is also able to select an additional control parameter z. In principle, this reduces to the optimal control of a process ob• served with errors, provided the a priori distribution of the unknown parameter is known. In this setting, however, determination of an optimal control proves to be a highly cumbersome operation. We therefore propose another approach, based on simultaneous recursive estimation of the parameter .x and of the value of z for which observations are most profitable. We shall present conditions under which the risk d,, of the recursive estimation plan constructed for a quadratic performance index is equivalent for large n to the risk d,, of the optimal plan, in the sense that d,,/d,, - J as n - oo. § I. Statement of the problem

Let f(y Ix, z) be a density relative to a measure v(dy) on the real line , de• pending on parameters x and z . To simplify the exposition, we shall assume that x, y and z are scalars, x EX C Et , y E Et and z E ZC Et . The parameter x = x 0 is unknown and is to be estimated by observations with density/. The parameter z plays the part of control. It is at the disposal of the statistician, who can either make it constant (in which case we are again in the situation of Chapter 8 -estimation with independent observations), or select its value at the ith observation depending on the results of previous observations:

Z1

=

Z;_(Y,, •.. , Y;_1 )

=

Z; (Y).

(J .l)

In more rigorous terms, this means that the joint distribution of Y 1 , . . . , Yn may be determined from the formula

n.j(y,1.xo, z;(c_y(i- l))). n

/n(c_y(n), .Xo) =

(1.2)

i za: 1

We denote zCi) = (Zt, . .. , Z;) and define an admissible control, plan of ex• periment, or simply plan to be a sequence Zt, Z 2 , •• • satisfying condition (1.1) such that Z; E 2,; = 1, 2 , ... . Our problem is to select a control (plan) 221

222

X. RECURSIVE ESTIMATION WITH A CONTROL PARAMETER

guaranteeing the best estimation in the sense of some performance index. As always in this book , we shall consider only a quadratic performance in• dex, looking for Z; and xn which minimize E(xn - x 0 ) 2 . This problem may often be solved exactly if the estimated parameter is a random variable with given distribution. The exact solution of this and more general problems is part of statistical decision theory and is described in numerous textbooks (see, for example, Blackwell and Girshick [I), Fe)'dbaum [ l) , and others). However, this solution is not always in keeping with the demands of practice. As stated, the solution may be effected only when one has an "a priori" distribution of the parameter. This "a priori" obstacle is absolutely essential when the number n of observations is relatively small, but in the case of interest here, when we wish to construct asymptotically optimal estimates for n - ..,, it may be neglected. The point is that under fairly general assumptions (see, for example, lbragimov and Has'minskii [11) the Bayes estimate with any a priori density 1r(x) that does not vanish for x E X is asymptotically optimal even when x 0 is not a random variable. A far more significant obstacle is the immense volume of computational labor required to evaluate the optimal solution (in the Bayesian approach; see Fel'dbaum [l J, Chapter VI). For example, computation of the nth optimal control;:. as a function of z(n-l) and y(n-l) involves the following steps: 1) Compute the conditional expectation xn

= E(x0 1znr 1

Ez (z11-z)t~( Es ~ / (z,

(1.3)

i•t

Assume, moreover, that

zEr.

/(z)=sup/(z, z)< 00, 1EZ

Then it follows from (1.3) that for any plan satisfying the assumptions of Theorem 8.6.1. Es

1

(z"-z)1 :;i:,.- •

(1.4)

7 (z)

11

A sequence Z 1 , Z 2 , . • . will be called an asymptotically optimal plan if it satisfies condition ( 1 .1) and there exists a corresponding sequence of estimates xn = xn '12, Z 1 = -1 for t E /3, if 'It~ TJ2, Z1

We now define an estimate.Xn of x 0 thus:

X,.={ t-v,i,

if

TJ1 > 1lz,

(3.4)

if Tlz ~ "" where 'I'/ is the arithmetic mean of the observations for which i E / 3 • (Of course, the result is somewhat more accurate if one also uses observations with indices in / 1 if 111 > 112 and in / 2 if 11 1 '- 112 • However, this has no effect on the asymptotic situation, which is alJ that interests us here.) EXERCISE 3.1. Prove that the estimate (3.4) is asymptotically efficient, i.e., for any x0 E £ 1 ,

'v,i-1-,

nE.,.0 (X,.-zo)'-;;:;;; 4 ma:r. ((zo-~)', (Zo+t)')" Hint. Distinguish between three cues: lxol

< n-l/t., x 0 < -n-

l/S and x

0

> n-1/S.

§ 4. Continuous case All the results of this chapter carry over to the case of continuous observations, provided the observations have a stochastic differential in Ito's sense. In this section we present the continuous formulation of the problem and prove the analog of Theorem 2 .1.

230

X. RECURSIVE ESTUMTION WJTH A CONTROL PARAMETER

In accordance with the discussion of §8.1, we shall assume that the func• tions m(x, z) and o(z) are defined and continuous for z e ZC E 1 and x e X C E1 . Assume that the observation process Y(t) has a stochastic differential:

dY (t) = m (z0 , Z (t)) dt

+ a (Z (t)) dl (t).

(4.1)

Here x 0 is the value of the parameter to be estimated, and Z(t) the control pro•

z.

cess (plan of the experiment), which takes values in As usual, we shall also asswne that the set of admissible controb contains only functions Z(t) for which the stochastic differential (4.1) is defined. In particular, we assume that the process Z(t) is N,-measurable (compare with (1.1)), where N, is the a -algebra of events generated by the history of the process Y(s), s .;; t . Under these conditions, inequality (9 .1.5) is E (X (t)-z )2 ~ ( E

.

o

?

I

.zo) ) 2 ds)- t Jr ( m;· a(Z(Z(s), (s)) •

(4.2)

0

If the function

'I' (.z z) '

asswnes its maximum for z E

=( mxa(J,(z) .z) )2

Zat some

sup 'I' (.z, z) = 'I'

,ez

point z(x) E

Z, so

that

(z (x), x) = qt (x),

(4.3)

then, for any admissible control Z(t) and any unbiased estimate X(t) o f x 0 satisfying (9.1.5), it follows from (4.2) that E (X (t)-i0}1 ~

(4.4)

ot_5'i'(.zo)) . t (m; (: (.zo), .zo))'

Following § 1, we shall say that an admissible control Z(t) is asymptotically

optimal if it ensures the existence of an estimate X(t) such that

Vt (X (t)-z0) ~ ~ (o,

os_c7 O (although one must then impose more

stringent conditions on the "noise"). The theorems proved heTe for continuous RM procedures are due t o the authors ( I ) • § S. Convergence theorems for discrete multidimensional KW procedures were proved

by Blum ( 2 ) , Oupa~ (I ] and other-s. For the continuous case, see Nevel' son and Has' minsku

Il I. Chapter 5 The conjecture tb&t a KW procedure cannot converge with positive probability to minimum points of the regression function was advanced by Fabian ( I presented here ate based on Has'm.inskij

11 I

J, ( 2 J.

The results

and Nevel'son (I]. (2) . The two last-mentioned

papen also prove more general theorems. See also Krasulina ( l ) .

Chapter 6 § I. Theorem 1.1 for the one-dimensional case is proved, e.g., in Lo~e ( l ) ; for the

rnultidimens.ional case, see Sack.s ( l ) . § 2. Conditions for the validity of the estimate EX2 (t) == 0(1/t) as r

is a discrete one-dimensional RM

prOCCS$,

-+ • ,

where X(r)

were fi..rst obtained by Chung ( I ) . An analog of

Lemma 2.1 for discrete multidimensional RM processes was proved by Sack.s ( l) under the additional assumption E IG(t, .x, w)i 2

< c < ...

The fact that (2.9) is a consequence of (2.8)

follows from a well-known lemma of Chung (I). §§ 3 and 4. Lemma 3.1 was proved by Has'minslti! (2). Lemma 4.3 is due to Sacks

I I I.

235

!',OTES O!'J TH£ LITERATURE

§ 5. Theorem 5.3 for>'

>0

is proved in Has. minskii

12).

The asymptotic normality

o f a RM procedure was established, under certain a.ssumptions, by Holcvo ( l ) .

§ 6. Theorem 6.1 was proved by Sacks under the following additio~ a.ssumptions: a) the matrix B may be reduced to diagonal form by a similarity transformation involving

an ol1hogonal matrix; b) the norm of the matrix A(I, .x) is bowided fot t ;. I and .x E £ 1. Theorem 6.3 is new. Asymptotic normality theorems fot other s.a. p1ocedures were proved by Fabian ( 5), Derman ( l), Burkholder ( I ), Oupa~ (I) and others. The ''truncation" method in asymptotic normality proofs for s.a. procedures was first applied by Hodges and Lehman (I).

§ 7. The fant convergence theorems for moments of s.a. proces.ses were proved by Chung (I) in the one-dimensional case. For the multidimensional case, the result of Theorem 7,2 is well known (see, for example, Schme-tterer (I ) ) for>..> ½.a. But if Condition (o) of § 7 holds with >..