Foundations of Modern Probability [3 ed.] 9783030618704, 9783030618711

This new, thoroughly revised and expanded 3rd edition of a classic gives a comprehensive coverage of modern probability

2,163 279 6MB

English Pages 946 [931]

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Foundations of Modern Probability [3 ed.]
 9783030618704, 9783030618711

Table of contents :
Preface to the First Edition
Preface to the Second Edition
Preface to the Third Edition
Acknowledgments
Words of Wisdom and Folly
Contents
Introduction and Reading Guide
I. Measure Theoretic Prerequisites
Chapter 1 Sets and Functions, Measures and Integration
Chapter 2 Measure Extension and Decomposition
Chapter 3 Kernels, Disintegration, and Invariance
II. Some Classical Probability Theory
Chapter 4 Processes, Distributions, and Independence
Chapter 5 Random Sequences, Series, and Averages
Chapter 6 Gaussian and Poisson Convergence
Chapter 7 Infinite Divisibility and General Null Arrays
III. Conditioning and Martingales
Chapter 8 Conditioning and Disintegration
Chapter 9 Optional Times and Martingales
Chapter 10 Predictability and Compensation
IV. Markovian and Related Structures
Chapter 11 Markov Properties and Discrete-Time Chains
Chapter 12 Random Walks and Renewal Processes
Chapter 13 Jump-Type Chains and Branching Processes
V. Some Fundamental Processes
Chapter 14 Gaussian Processes and Brownian Motion
Chapter 15 Poisson and Related Processes
Chapter 16 Independent-Increment and Lévy Processes
Chapter 17 Feller Processes and Semi-groups
VI. Stochastic Calculus and Applications
Chapter 18 Itô Integration and Quadratic Variation
Chapter 19 Continuous Martingales and Brownian Motion
Chapter 20 Semi-Martingales and Stochastic Integration
Chapter 21 Malliavin Calculus
VII. Convergence and Approximation
Chapter 22 Skorohod Embedding and Functional Convergence
Chapter 23 Convergence in Distribution
Chapter 24 Large Deviations
VIII. Stationarity, Symmetry and Invariance
Chapter 25 Stationary Processes and Ergodic Theory
Chapter 26 Ergodic Properties of Markov Processes
Chapter 27 Symmetric Distributions and Predictable Maps
Chapter 28 Multi-variate Arrays and Symmetries
IX. Random Sets and Measures
Chapter 29 Local Time, Excursions, and Additive Functionals
Chapter 30 Random Measures, Smoothing and Scattering
Chapter 31 Palm and Gibbs Kernels, Local Approximation
X. SDEs, Diffusions, and Potential Theory
Chapter 32 Stochastic Equations and Martingale Problems
Chapter 33 One-Dimensional SDEs and Diffusions
Chapter 34 PDE Connections and Potential Theory
Chapter 35 Stochastic Differential Geometry
Appendices
1. Measurable maps
2. General topology
3. Linear spaces
4. Linear operators
5. Function and measure spaces
6. Classes and spaces of sets
7. Differential geometry
Notes and References
Bibliography
Indices
Authors
Topics
Symbols

Citation preview

Probability Theory and Stochastic Modelling 99

Olav Kallenberg

Foundations of Modern Probability Third Edition

Probability Theory and Stochastic Modelling Volume 99

Editors-in-Chief Peter W. Glynn, Stanford, CA, USA Andreas E. Kyprianou, Bath, UK Yves Le Jan, Orsay, France Advisory Editors Søren Asmussen, Aarhus, Denmark Martin Hairer, Coventry, UK Peter Jagers, Gothenburg, Sweden Ioannis Karatzas, New York, NY, USA Frank P. Kelly, Cambridge, UK Bernt Øksendal, Oslo, Norway George Papanicolaou, Stanford, CA, USA Etienne Pardoux, Marseille, France Edwin Perkins, Vancouver, Canada Halil Mete Soner, Zürich, Switzerland

The Probability Theory and Stochastic Modelling series is a merger and continuation of Springer’s two well established series Stochastic Modelling and Applied Probability and Probability and Its Applications. It publishes research monographs that make a significant contribution to probability theory or an applications domain in which advanced probability methods are fundamental. Books in this series are expected to follow rigorous mathematical standards, while also displaying the expository quality necessary to make them useful and accessible to advanced students as well as researchers. The series covers all aspects of modern probability theory including. • • • • • •

Gaussian processes Markov processes Random Fields, point processes and random sets Random matrices Statistical mechanics and random media Stochastic analysis

as well as applications that include (but are not restricted to): • Branching processes and other models of population growth • Communications and processing networks • Computational methods in probability and stochastic processes, including simulation • Genetics and other stochastic models in biology and the life sciences • Information theory, signal processing, and image synthesis • Mathematical economics and finance • Statistical methods (e.g. empirical processes, MCMC) • Statistics for stochastic processes • Stochastic control • Stochastic models in operations research and stochastic optimization • Stochastic models in the physical sciences

More information about this series at http://www.springer.com/series/13205

Olav Kallenberg

Foundations of Modern Probability Third Edition

123

Olav Kallenberg Auburn University Auburn, AL, USA

ISSN 2199-3130 ISSN 2199-3149 (electronic) Probability Theory and Stochastic Modelling ISBN 978-3-030-61870-4 ISBN 978-3-030-61871-1 (eBook) https://doi.org/10.1007/978-3-030-61871-1 Mathematics Subject Classification: 60-00, 60-01, 60A10, 60G05 1st edition: © Springer Science+Business Media New York 1997 2nd edition: © Springer-Verlag New York 2002 3rd edition: © Springer Nature Switzerland AG 2021 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

Preface to the First Edition Some thirty years ago it was still possible, as Lo`eve so ably demonstrated, to write a single book in probability theory containing practically everything worth knowing in the subject. The subsequent development has been explosive, and today a corresponding comprehensive coverage would require a whole library. Researchers and graduate students alike seem compelled to a rather extreme degree of specialization. As a result, the subject is threatened by disintegration into dozens or hundreds of subfields. At the same time the interaction between the areas is livelier than ever, and there is a steadily growing core of key results and techniques that every probabilist needs to know, if only to read the literature in his or her own field. Thus, it seems essential that we all have at least a general overview of the whole area, and we should do what we can to keep the subject together. The present volume is an earnest attempt in that direction. My original aim was to write a book about “everything.” Various space and time constraints forced me to accept more modest and realistic goals for the project. Thus, “foundations” had to be understood in the narrower sense of the early 1970s, and there was no room for some of the more recent developments. I especially regret the omission of topics such as large deviations, Gibbs and Palm measures, interacting particle systems, stochastic differential geometry, Malliavin calculus, SPDEs, measure-valued diffusions, and branching and superprocesses. Clearly plenty of fundamental and intriguing material remains for a possible second volume. Even with my more limited, revised ambitions, I had to be extremely selective in the choice of material. More importantly, it was necessary to look for the most economical approach to every result I did decide to include. In the latter respect, I was surprised to see how much could actually be done to simplify and streamline proofs, often handed down through generations of textbook writers. My general preference has been for results conveying some new idea or relationship, whereas many propositions of a more technical nature have been omitted. In the same vein, I have avoided technical or computational proofs that give little insight into the proven results. This conforms with my conviction that the logical structure is what matters most in mathematics, even when applications is the ultimate goal. Though the book is primarily intended as a general reference, it should also be useful for graduate and seminar courses on different levels, ranging from elementary to advanced. Thus, a first-year graduate course in measuretheoretic probability could be based on the first ten or so chapters, while the rest of the book will readily provide material for more advanced courses on various topics. Though the treatment is formally self-contained, as far as measure theory and probability are concerned, the text is intended for a rather sophisticated reader with at least some rudimentary knowledge of subjects like topology, functional analysis, and complex variables. My exposition is based on experiences from the numerous graduate and v

vi

Foundations of Modern Probability

seminar courses I have been privileged to teach in Sweden and in the United States, ever since I was a graduate student myself. Over the years I have developed a personal approach to almost every topic, and even experts might find something of interest. Thus, many proofs may be new, and every chapter contains results that are not available in the standard textbook literature. It is my sincere hope that the book will convey some of the excitement I still feel for the subject, which is without a doubt (even apart from its utter usefulness) one of the richest and most beautiful areas of modern mathematics.

Preface to the Second Edition For this new edition the entire text has been carefully revised, and some portions are totally rewritten. More importantly, I have inserted more than a hundred pages of new material, in chapters on general measure and ergodic theory, the asymptotics of Markov processes, and large deviations. The expanded size has made it possible to give a self-contained treatment of the underlying measure theory and to include topics like multivariate and ratio ergodic theorems, shift coupling, Palm distributions, entropy and information, Harris recurrence, invariant measures, strong and weak ergodicity, Strassen’s law of the iterated logarithm, and the basic large deviation results of Cram´er, Sanov, Schilder, and Freidlin and Ventzel.1 Unfortunately, the body of knowledge in probability theory keeps growing at an ever increasing rate, and I am painfully aware that I will never catch up in my efforts to survey the entire subject. Many areas are still totally beyond reach, and a comprehensive treatment of the more recent developments would require another volume or two. I am asking for the reader’s patience and understanding.

Preface to the Third Edition Many years have passed since I started writing the first edition, and the need for a comprehensive coverage of modern probability has become more urgent than ever. I am grateful for the opportunity to publish a new, thoroughly revised and expanded edition. Much new material has been added, and there are even entirely new chapters on subjects like Malliavin calculus, multivariate arrays, and stochastic differential geometry (with so much else still missing). To facilitate the reader’s access and overview, I have grouped the material together into ten major areas, each arguably indispensable to any serious student and researcher, regardless of area of specialization. To me, every great mathematical theorem should be a revelation, prompting us to exclaim: “Wow, this is just amazing, how could it be true?” I have spent countless hours, trying to phrase every result in its most striking form, in my efforts to convey to the reader my own excitement. My greatest hope is that the reader will share my love for the subject, and help me to keep it alive. 1

I should have mentioned Varadhan, one of the giants of modern probability.

Acknowledgments Throughout my career, I have been inspired by the work of colleagues from all over the world, and in many cases by personal interaction and friendship. As in earlier editions, I would like to mention especially the early influence of my mentor Peter Jagers and the stimulating influences of the late Klaus Matthes, Gopi Kallianpur, and Kai Lai Chung. Other people who have inspired me through their writing and personal contacts include especially David Aldous, Sir John Kingman, and Jean-Fran¸cois LeGall. Some portions of the book have been stimulated by discussions with my AU colleague Ming Liao. Countless people from all over the world, their names long forgotten, have written to me through the years to ask technical questions or point out errors, which has led to numerous corrections and improvements. Their positive encouragement has been a constant source of stimulation and joy, especially during the present pandemic, which sadly affects us all. Our mathematical enterprise (like the coronavirus) knows no borders, and to me people of any nationality, religion, or ethnic group are all my sisters and brothers. One day soon we will emerge even stronger from this experience, and continue to enjoy and benefit from our rich mathematical heritage2 . I am dedicating this edition to the memory of my grandfathers whom I never met, both idolized by their families: Otto Wilhelm Kallenberg — Coming from a simple servant family, he rose to the rank of a trusted servant to the Swedish king, in the days when kings had real power. Through his advance, the family could afford to send their youngest son Herbert, my father, to a Latin school, where he became the first member of the Kallenberg family to graduate from high school. Olaf Sund (1864–1941) — Coming from a prominent family of lawyers, architects, scientists, and civil servants, and said to be the brightest child of a big family, he opted for a simple life as Lensmann in the rural community of Steigen north of the arctic circle. He died tragically during the brave resistance to the Nazi occupation of Norway during WWII. His youngest daughter Marie Kristine, my mother, became the first girl of the Sund family to graduate from high school. Olav Kallenberg October 2020

2 To me the great music, art, and literature belong to the same category of cultural treasures that we hope will survive any global crises or onslaughts of selfish nationalism. Every page of this book is inspired by the great music from Bach to Prokofiev, the art from Rembrandt to Chagall, and the literature from Dante to Con Fu. What a privilege to be alive!

vii

Words of Wisdom and Folly ♣ “A mathematician who argues from probabilities in geometry is not worth

an ace” — Socrates (on the demands of rigor in mathematics) ♣ “[We will travel a road] full of interest of its own. It familiarizes us with

the measurement of variability, and with curious laws of chance that apply to a vast diversity of social subjects” — Francis Galton (on the wondrous world of probability) ♣ “God doesn’t play dice” [i.e., there is no randomness in the universe] —

Albert Einstein (on quantum mechanics and causality) ♣ “It might be possible to prove certain theorems, but they would not be of

any interest, since in practice one could not verify whether the assump´ tions are fulfilled” — Emile Borel (on why bother with probability) ♣ “[The stated result] is a special case of a very general theorem [the strong

Markov property]. The measure [theoretic] ideas involved are somewhat glossed over in the proof, in order to avoid complexities out of keeping with the rest of this paper” — Joseph L. Doob (on why bother with generality or mathematical rigor) ♣ “Probability theory [has two hands]: On the right is the rigorous [tech-

nical work]; the left hand . . . reduces problems to gambling situations, coin-tossing, motions of a physical particle” — Leo Breiman (on probabilistic thinking) ♣ “There are good taste and bad taste in mathematics just as in music, lit-

erature, or cuisine, and one who dabbles in it must stand judged thereby” — Kai Lai Chung (on the art of writing mathematics) ♣ “The traveler often has the choice between climbing a peak or using a

cable car” — William Feller (on the art of reading mathematics) ♣ “A Catalogue Aria of triumphs is of less benefit [to the student] than an

indication of the techniques by which such results are achieved” — David Williams (on seductive truths and the need of proofs) ♣ “One needs [for stochastic integration] a six months course [to cover

only] the definitions. What is there to do?” — Paul-Andr´e Meyer (on the dilemma of modern math education) ♣ “There were very many [bones] in the open valley; and lo, they were very

dry. And [God] said unto me, ‘Son of man, can these bones live?’ And I answered, ‘O Lord, thou knowest.’ ” — Ezekiel 37:2–3 (on the reward of hard studies, as quoted by Chris Rogers and David Williams) ix

Contents Introduction and Reading Guide — — —

1

I. Measure Theoretic Prerequisites — — —

7

1. Sets and functions, measures and integration

9

2. Measure extension and decomposition

33

3. Kernels, disintegration, and invariance

55

II. Some Classical Probability Theory — — — 4. Processes, distributions, and independence

79 81

5. Random sequences, series, and averages

101

6. Gaussian and Poisson convergence

125

7. Infinite divisibility and general null arrays

147

III. Conditioning and Martingales — — —

161

8. Conditioning and disintegration

163

9. Optional times and martingales

185

10. Predictability and compensation

207

IV. Markovian and Related Structures — — —

231

11. Markov properties and discrete-time chains

233

12. Random walks and renewal processes

255

13. Jump-type chains and branching processes

277

V. Some Fundamental Processes — — —

295

14. Gaussian processes and Brownian motion

297

15. Poisson and related processes

321

16. Independent-increment and L´evy processes

345

17. Feller processes and semi-groups

367

VI. Stochastic Calculus and Applications — — —

393

18. Itˆo integration and quadratic variation

395

19. Continuous martingales and Brownian motion

417

20. Semi-martingales and stochastic integration

439

21. Malliavin calculus

465 xi

xii

Foundations of Modern Probability

VII. Convergence and Approximation — — —

487

22. Skorohod embedding and functional convergence

489

23. Convergence in distribution

505

24. Large deviations

529

VIII. Stationarity, Symmetry and Invariance — — —

555

25. Stationary processes and ergodic theorems

557

26. Ergodic properties of Markov processes

587

27. Symmetric distributions and predictable maps

611

28. Multi-variate arrays and symmetries

631

IX. Random Sets and Measures — — —

659

29. Local time, excursions, and additive functionals

661

30. Random measures, smoothing and scattering

685

31. Palm and Gibbs kernels, local approximation

709

X. SDEs, Diffusions, and Potential Theory — — —

733

32. Stochastic equations and martingale problems

735

33. One-dimensional SDEs and diffusions

751

34. PDE connections and potential theory

771

35. Stochastic differential geometry

801

Appendices — — — 1. 2. 3. 4. 5. 6. 7.

827

Measurable maps — 827, General topology — 828, Linear spaces — 831, Linear operators — 835, Function and measure spaces — 839, Classes and spaces of sets — 843, Differential geometry — 847

Notes and References — — —

853

Bibliography — — —

889

Indices — — — Authors 919,

919 Topics 925,

Symbols 943

Introduction and Reading Guide In my youth, a serious student of probability theory1 was expected to master the basic notions and results in all areas of the subject2 . Since then the development has been explosive, leading to a rather extreme fragmentation and specialization. I believe that it remains essential for any serious probabilist to have a good overview of the entire subject. Everything in this book, from the overall organization to the choice of notation and displays of theorems, has been prepared with this goal in mind. The first thing you need to know is that the study of any more advanced text in mathematics3 requires an approach different from the usual one. When reading a novel you start reading from page 1, and after a few days or weeks you come to the end, at which time you will be familiar with all the characters and have a good overview of the plot. This approach never works in math, except for the most elementary texts. Instead it is crucial to adopt a top-down approach, where you first try to acquire a general overview, and then gradually work your way down to the individual theorems, until finally you reach the level of proofs and their logical structure. The inexperienced reader may feel tempted to skip the proofs, trying instead a few of the exercises. I would rather suggest the opposite. It is from the proofs you learn how to do math, which may be regarded as a major goal of the graduate studies. If you forgot the precise conditions in the statement of a theorem, you can always look them up, but the only way to learn how to prove your own theorems is to gather experience by studying dozens or hundreds of proofs. Here again it is important to adopt a top-down approach, always starting to look for the crucial ideas that make the proof ‘work’, and then gradually breaking down the argument into minor details and eventually perhaps some calculation. Some details are often left to the reader4 , suggesting an abundance of useful and instructive exercises, to identify and fill in all the little gaps implicit in the text. The book has now been divided into ten parts of 3–4 chapters each, and I think it is essential for any serious probabilist to be familiar with at least some material in each of those parts. Every part is preceded by a short introduction, where I indicate the contents of the included chapters and make suggestions about the choice of material to study in further detail. The precise selection may be less important, since you can always return to the omitted material when need arises. Every chapter also begins with a short introduction, highlighting some of the main results and indicating connections to material in 1

henceforth regarded as synonymous with the theory of stochastic processes When I was a student, we had no graduate courses, only a reading list, and I remember `ve studying a whole summer for my last oral exam, covering every theorem and proof in Loe (1963) plus half of Doob (1953), totaling some 1,000 pages. 3 What is advanced to some readers may be elementary to others, depending on background. 4 or else even the simplest proof would look like a lengthy computer program, and you would lose your overview 2

© Springer Nature Switzerland AG 2021 O. Kallenberg, Foundations of Modern Probability, Probability Theory and Stochastic Modelling 99, https://doi.org/10.1007/978-3-030-61871-1_1

1

2

Foundations of Modern Probability

other chapters. You never need to worry about making your selection selfcontained, provided you are willing occasionally to use theorems whose proofs you are not yet familiar with5 . A similar remark applies to the axiom of choice and its various equivalents, which are used freely without comments6 . −−− Let me conclude with some remarks about the contents of the various parts. Part I gives a rather complete account of the basic measure theory needed throughout the remainder of the book. In my experience, even a year-long course in real analysis may be inadequate for our purposes, since many theorems of special importance in probability theory may not even be mentioned in standard real analysis texts, where the emphasis is often rather different. When teaching graduate courses in advanced probability, then regardless of background7 of the students enrolled, I am always beginning with a few weeks of general measure theory, where the students will also get used to my top-down approach. Part II provides the measure-theoretic foundations of probability theory, involving the notions of processes, distributions, and independence. Here we also develop the classical limit theory for random sums and averages, including the law of large numbers and the central limit theorem, along with a wealth of extensions and ramifications. Finally, we give a comprehensive treatment of the classical limit theory for null arrays of random variables and vectors, which will later play a basic role for the discussion of L´evy and related processes. Much of the discussion is based on the theory of characteristic functions 8 and Laplace transforms, which will later regain importance in connection with continuous martingales, random measures, and potential theory. Modern probability theory can be said to begin with the closely related theories of conditioning, martingales, and compensation, which form the subjects of Part III. The importance of this material can hardly be overstated, since methods based on conditioning are constantly used throughout probability theory, and martingale theory provides some of the basic tools for proving a wide range of limit theorems and path properties. Indeed, the convergence and regularization theorem for continuous-time martingales has been regarded as the single most important result in all of probability theory. The theory of compensators, based on the powerful Doob–Meyer decomposition, plays an equally basic role in the study of processes with jump discontinuities. Beside martingales, the classes of Markov and related processes represent 5

A prime example is the existence of Lebesgue measure, underlying much of modern probability theory, which is easy to believe and may be accepted on faith, though all standard proofs I am aware of, some presented in Chapter 2, are quite tricky and non-intuitive. 6 Purists may note that we are implicitly adopting ZFC as the axiomatic foundation of modern probability theory. Nobody else ever needs to worry. 7 I am no longer surprised when the students tell me things like ‘Fubini’s theorem we didn’t have time to cover’. 8 It is a mistake to dismiss the theories of characteristics functions and classical null arrays, once regarded as core probability, as a technical nuisance that we can safely avoid.

Introduction and Reading Guide

3

the most important dependence structures in probability theory, generalizing the deterministic notion of dynamical systems. Though first introduced in Part IV, they will form a recurrent theme throughout the remainder of the book. After a general discussion of the general Markov property, we specialize to the classical theories of discrete- and continuous-time Markov chains, paying special attention to the theories of random walks and renewal theory, before providing a brief discussion of branching processes. The results in this part, many of fundamental importance in their own right, may also serve as motivations for the theories of L´evy and Feller processes in later chapters, as well as for the regenerative processes and their local time. The Brownian motion and Poisson processes 9 constitute the basic building blocks of probability theory, whose theory is covered by the initial chapters of Part V. The remaining chapters deal with the equally fundamental L´evy and Feller processes. To see the connections, we note that Feller processes form a broad class of Markov processes, general enough for most applications, and yet regular enough for technical convenience. Specializing to the spacehomogeneous case yields the class of L´evy processes, which may also be characterized as processes with stationary, independent increments, thus forming the continuous-time counterparts of random walks. The classical L´evy-Khinchin formula gives the basic representation of L´evy processes in terms of a Brownian motion and a Poisson process describing the jump structure. All facets of this theory are clearly of fundamental importance. Stochastic calculus is another core area10 of modern probability that we can’t live without. The subject, here covered by Part VI, can be regarded as a continuation of the martingale theory from Part III. In fact, the celebrated Itˆo formula is essentially a transformation rule for continuous or more general semi-martingales. We begin with the relatively elementary case of continuous integrators, analysing a detail some important consequences for Brownian motion, before moving on to the more subtle theory for processes with discontinuities. Still more subtle is the powerful stochastic analysis on Wiener space, also known as Malliavin calculus. Every serious probability student needs to be well acquainted with the theory of convergence in distribution in function spaces, covered by Part VII. Here the most natural and intuitive approach is via the powerful Skorohod embedding, which yields the celebrated Donsker theorem and its ramifications, along with various functional versions of the law of the iterated logarithm. More general results may occasionally require some quite subtle compactness arguments based on Prohorov’s theorem, extending to a powerful theory for distributional convergence of random measures. Somewhat similar in spirit is the theory of large deviations, which may be regarded as a subtle extension of the weak law of large numbers. Stationarity, defined as invariance in distribution under shifts, may be re9 The theory of Poisson and related processes has often been neglected by textbook authors. Its fundamental importance should be clear from material in later chapters. 10 It is often delegated to separate courses, though it truly belongs to the basic package.

4

Foundations of Modern Probability

garded as yet another basic dependence structure11 of probability theory, beside those of martingales and Markov processes. Its theory with ramifications are the subjects of Part VIII. Here the central result is the powerful ergodic theorem, generalizing the strong law of large numbers, which admits some equally remarkable extensions to broad classes of Markov processes. This part also contains a detailed discussion of some other invariance structures, leading in particular to the powerful predictable sampling and mapping theorems. Part IX begins with a detailed exposition of excursion theory, a powerful extension of renewal theory, involving a basic description of the entire excursion structure of a regenerative process in terms of a Poisson process, on the time scale given by the associated local time. Three totally different approaches to local time are discussed, each providing its share of valuable insight. The remaining chapters of this part deal with various aspects of random measure theory12 , including some basic results for Palm measures and a variety of fundamental limit theorems for particle systems. We may also mention some important connections to statistical mechanics. The final Part X begins with a detailed discussion of stochastic differential equations (SDEs), which play the same role in the presence of noise as do the ODEs for deterministic dynamical systems. Of special importance is the characterization of solutions in terms of martingale problems. The solutions to the simplest SDEs are continuous strong Markov processes, even called diffusions, and the subsequent chapter includes a detailed account of such processes. We will also highlight the close connection between Markov processes and potential theory, showing in particular how some problems in classical potential theory admit solutions in terms of a Brownian motion. We conclude with an introduction to the beautiful and illuminating theory of stochastic differential geometry.

11 This is another subject that has often been omitted from basic graduate-text packages, where it truly belongs. 12 Yet another neglected topic. Random measures arguably constitute an area of fundamental importance. Though they have been subject to an intense development for three quarters of a century, their theory remains virtually unknown among mainstream probabilists.

Introduction and Reading Guide

5

We conclude with a short list of some commonly used notation. A more comprehensive list will be found at the end of the book. ¯ = [−∞, ∞], N = {1, 2, . . .}, Z+ = {0, 1, 2, . . .}, R+ = [0, ∞), R ˆ localized Borel space, classes of measurable or bounded sets, (S, S, S): S+ : class of S-measurable functions f ≥ 0, S (n) : non-diagonal part of S n , ˆ S : classes of locally finite or normalized measures on S, MS , M NS , NˆS : classes of integer-valued and bounded measures in MS , M∗S , NS∗ : classes of diffuse measures in MS and simple ones in NS , G, λ: measurable group with Haar measure, Lebesgue measure on R, δs B = 1B (s) = 1{s ∈ B}: unit mass at s and indicator function of B, μf =



f dμ, (f · μ)g = μ(f g), (μ ◦ f −1 )g = μ(g ◦ f ), 1B μ = 1B · μ,

μ(n) : for μ ∈ NS∗ , the restriction of μn to S (n) , (θr μ)f = μ(f ◦ θr ) = 

(ν ⊗ μ)f =



μ(ds)f (rs),



ν(ds) μs (dt)f (s, t), (νμ)f =







ν(ds) μs (dt)f (t),



(μ ∗ ν)f = μ(dx) ν(dy)f (x + y), (Eξ)f = E(ξf ), E(ξ|F )g = E(ξg|F ), L( ), L( | )s , L( )s : distribution, conditional or Palm distribution, Cξ f = E

 μ≤ξ

f (μ, ξ − μ) with bounded μ,

⊥ ⊥, ⊥ ⊥F : independence and conditional independence given F, d

d

=, →: equality and convergence in distribution, w

v

u

→, →, →: weak, vague, and uniform convergence, wd

vd

−→, −→: weak or vague convergence in distribution, f , μ : supremum of |f | and total variation of μ.

I. Measure Theoretic Prerequisites Modern probability theory is technically a branch of real analysis, and any serious student needs a good foundation in measure theory and basic functional analysis, before moving on to the probabilistic aspects. Though a solid background in classical real analysis may be helpful, our present emphasis is often somewhat different, and many results included here may be hard to find in the standard textbook literature. We recommend the reader to be thoroughly familiar with especially the material in Chapter 1. The hurried or impatient reader might skip Chapters 2–3 for the moment, and return for specific topics when need arises. −−− 1. Sets and functions, measures and integration. This chapter contains the basic notions and results of measure theory, needed throughout the remainder of the book. Statements used most frequently include the monotoneclass theorem, the monotone and dominated convergence theorems, Fubini’s theorem, criteria for Lp -convergence, and the H¨older and Minkowski inequalities. The reader should also be familiar with the notion of Borel spaces, which provides the basic setting throughout the book. 2. Measure extension and decomposition. Here we include some of the more advanced results of measure theory of frequent use, such as the existence of Lebesgue and related measures, the Lebesgue decomposition and differentiation theorem, atomic decompositions of measures, the Radon–Nikodym theorem, and the Riesz representation. We also explore the relationship between signed measures on the line and functions of locally bounded variation. Though a good familiarity with the results is crucial, many proofs are rather technical and might be skipped on the first way through. 3. Kernels, disintegration, and invariance. Though kernels are hardly mentioned in most real analysis texts, they play a fundamental role throughout probability theory. Any serious reader needs to be well familiar with the basic kernel properties and operations, as well as with the existence theorems for single and dual disintegration. The quoted results on invariant measures and disintegration, including the theory of Haar measures, will be needed only for special purposes and might be skipped on a first reading, with a possible return when need arises.

7

Chapter 1

Sets and Functions, Measures and Integration Sigma-fields, measurable functions, monotone-class theorem, Polish and Borel spaces, convergence and limits, measures and integration, monotone and dominated convergence, transformation of integrals, null sets and completion, product measures and Fubini’s theorem, H¨ older and Minkowski inequalities, Lp -convergence, L2 -projection, regularity and approximation

Though originating long ago from some simple gambling problems, probability theory has since developed into a sophisticated area of real analysis, depending profoundly on some measure theory already for the basic definitions. This chapter covers most of the standard measure theory needed to get started. Soon you will need more, which is why we include two further chapters to cover some more advanced topics. At that later stage, even some basic functional analysis will play an increasingly important role, where the underlying notions and results are summarized without proofs in Appendices 3–4. In this chapter we give an introduction to basic measure theory, including the notions of σ-fields, measurable functions, measures, and integration. Though our treatment, here and in later chapters, is essentially self-contained, we do rely on some basic notions and results from elementary real analysis, including the notions of limits and continuity, and some metric topology involving open and closed sets, completeness, and compactness. Many results covered here will henceforth be used in subsequent chapters without explicit reference. This applies especially to the monotone and dominated convergence theorems, Fubini’s theorem, and the H¨older and Minkowski inequalities. Though a good background in real analysis may be an advantage, the hurried reader is warned against skipping the measure-theoretic chapters without at least a quick review of the included results, many of which may be hard to find in the standard textbook literature. This applies in particular to the monotone-class theorem and the notion of Borel spaces in this chapter, the atomic decomposition in Chapter 2, and the theory of kernels, disintegration, and invariant measures in Chapter 3. To fix our notation, we begin with some elementary notions from set theory. For subsets A, Ak , B, . . . of an abstract space S, recall the definitions of union   A ∪ B or k Ak , intersection A ∩ B or k Ak , complement Ac , and difference A \ B = A ∩ B c . The latter is said to be proper if A ⊃ B. The symmetric © Springer Nature Switzerland AG 2021 O. Kallenberg, Foundations of Modern Probability, Probability Theory and Stochastic Modelling 99, https://doi.org/10.1007/978-3-030-61871-1_2

9

10

Foundations of Modern Probability

difference of A and B is given by A B = (A \ B) ∪ (B \ A). Among basic set relations, we note in particular the distributive laws A ∩ ∪ Bk = ∪ (A ∩ Bk ), k

k

A ∪ ∩ Bk = ∩ (A ∪ Bk ), k

k

and de Morgan’s laws 

∪k Ak

c

= ∩ Ack , k



∩k Ak

c

= ∪ Ack , k

valid for arbitrary (not necessarily countable) unions and intersections. The latter formulas allow us to convert any relation involving unions (intersections) into the dual formula for intersections (unions). We define a σ-field 1 in S as a non-empty collection S of subsets of S closed under countable unions and intersections2 as well as under complementation.   Thus, if A, A1 , A2 , . . . ∈ S, then also Ac , k Ak , and k Ak lie in S. In particular, the whole space S and the empty set ∅ belong to every σ-field. In any space S, there is a smallest σ-field {∅, S} and a largest one 2S , consisting of all subsets of S. Note that every σ-field S is closed under monotone limits, so that if A1 , A2 , . . . ∈ S with An ↑ A or An ↓ A, then also A ∈ S. A measurable space is a pair (S, S), where S is a space and S is a σ-field in S. For any class of σ-fields in S, the intersection, but usually not the union, is again a σ-field. If C is an arbitrary class of subsets of S, there is a smallest σ-field in S containing C, denoted by σ(C) and called the σ-field generated or induced by C. Note that σ(C) can be obtained as the intersection of all σ-fields in S containing C. We endow a metric or topological space S with its Borel σfield BS generated by the topology3 in S, unless a σ-field is otherwise specified. The elements of BS are called Borel sets. When S is the real line R, we often write B instead of BR . More primitive classes than σ-fields often arise in applications. A class C of subsets of a space S is called a π-system if it is closed under finite intersections, so that A, B ∈ C implies A ∩ B ∈ C. Furthermore, a class D is a λ-system if it contains S and is closed under proper differences and increasing limits. Thus, we require S ∈ D, that A, B ∈ D with A ⊃ B implies A \ B ∈ D, and that A1 , A2 , . . . ∈ D with An ↑ A implies A ∈ D. The following monotone-class theorem is useful to extend an established property or relation from a class C to the generated σ-field σ(C). An application of this result is referred to as a monotone-class argument. Theorem 1.1 (monotone classes, Sierpi´ nski) For any π-system C and λsystem D in a space S, we have C⊂D 1



σ(C) ⊂ D.

also called a σ-algebra For a field or algebra, we require closure only under finite set operations. 3 class of open subsets 2

1. Sets and Functions, Measures and Integration

11

Proof: First we note that a class C is a σ-field iff it is both a π-system and a λ-system. Indeed, if C has the latter properties, then A, B ∈ C implies Ac = S \ A ∈ C and A ∪ B = (Ac ∩ B c )c ∈ C. Next let A1 , A2 , . . . ∈ C, and put   A = n An . Then Bn ↑ A, where Bn = k≤n Ak ∈ C, and so A ∈ C. Finally,   c c n An = ( n An ) ∈ C. For the main assertion, we may clearly assume that D = λ(C), defined as the smallest λ-system containing C. It suffices to show that D is also a π-system, since it is then a σ-field containing C and therefore contains the smallest σ-field σ(C) with this property. Thus, we need to show that A ∩ B ∈ D whenever A, B ∈ D. The relation A ∩ B ∈ D holds when A, B ∈ C, since C is a π-system contained in D. We proceed by extension in two steps. First we fix any B ∈ C, and define SB = {A ⊂ S; A ∩ B ∈ D}. Then SB is a λ-system containing C, and so it contains the smallest λ-system D with this property. Thus, A ∩ B ∈ D for any A ∈ D and B ∈ C. Next we fix any A ∈ D, and define SA = {B ⊂ S; A ∩ B ∈ D}. As before, we note that even SA contains D, which yields the desired property. 2 For any family of spaces St , t ∈ T , we define the Cartesian product Xt∈T St as the class of all collections {st ; t ∈ T }, where st ∈ St for all t. When T = {1, . . . , n} or T = N = {1, 2, . . .}, we often write the product space as S1 × · · · × Sn or S1 × S2 × · · · , respectively, and if St = S for all t, we use the notation S T , S n , or S ∞ . For topological spaces St , we endow Xt St with the product topology, unless a topology is otherwise specified. Now assume that each space St is equipped with a σ-field St . In Xt St we  may then introduce the product σ-field t St , generated by all one-dimensional 4 cylinder sets At × Xs=t Ss , where t ∈ T and At ∈ St . As before, we write in appropriate cases S1 ⊗ · · · ⊗ S n ,

S1 ⊗ S 2 ⊗ · · · ,

ST ,

S n,

S ∞.

Lemma 1.2 (product σ-field) For any separable metric spaces S1 , S2 , . . . , we have B(S1 × S2 × · · ·) = BS1 ⊗ BS2 ⊗ · · · . Thus, for countable products of separable metric spaces, the product and Borel σ-fields agree. In particular, BRd = (BR )d = B d is the σ-field generated by all rectangular boxes I1 × · · · × Id , where I1 , . . . , Id are arbitrary real intervals. This special case can also be proved directly. Proof: The assertion can be written as σ(C1 ) = σ(C2 ), where C1 is the class of open sets in S = S1 × S2 × · · · and C2 is the class of cylinder sets of the form S1 × · · · × Sk−1 × Bk × Sk+1 × · · · with Bk ∈ BSk . Now let C be the subclass of such cylinder sets, with Bk open in Sk . Then C is a topological base in S, and since S is separable, every set in C1 is a countable union of sets in C. Hence, 4

Note the analogy with the definition of product topologies.

12

Foundations of Modern Probability

C1 ⊂ σ(C). Furthermore, the open sets in Sk generate BSk , and so C2 ⊂ σ(C). Since trivially C lies in both C1 and C2 , we get σ(C1 ) ⊂ σ(C) ⊂ σ(C2 ) ⊂ σ(C) ⊂ σ(C1 ), and so equality holds throughout, proving the asserted relation.

2

Every point mapping f : S → T induces a set mapping f −1 : 2T → 2S in the opposite direction, given by



f −1 B = s ∈ S; f (s) ∈ B ,

B ⊂ T.

Note that f −1 preserves the basic set operations, in the sense that, for any subsets B and Bk of T , f −1 B c = (f −1 B)c , f −1 ∪ n Bn = ∪ n f −1 Bn , (1) f −1 ∩ n Bn = ∩ n f −1 Bn . We show that f −1 also preserves σ-fields, in both directions. For convenience, we write

f −1 C = f −1 B; B ∈ C , C ⊂ 2T . Lemma 1.3 (induced σ-fields) For any mapping f between measurable spaces (S, S) and (T, T ), we have (i) S  = f −1 T is a σ-field in S, (ii) T  = B ⊂ T ; f −1 B ∈ S is a σ-field in T . Proof: (i) If A, A1 , A2 , . . . ∈ S  , there exist some sets B, B1 , B2 , . . . ∈ T with A = f −1 B and An = f −1 Bn for all n. Since T is a σ-field, the sets B c ,   n Bn , and n Bn all belong to T , and (1) yields Ac = (f −1 B)c = f −1 B c , ∪ n An = ∪ n f −1Bn = f −1 ∪ n Bn, ∩ n An = ∩ n f −1Bn = f −1 ∩ n Bn, which all lie in f −1 T = S  . (ii) Let B, B1 , B2 , . . . ∈ T  , so that f −1 B, f −1 B1 , f −1 B2 , . . . ∈ S. Using (1) and the fact that S is a σ-field, we get f −1 B c = (f −1 B)c , f ∪ n Bn = ∪ n f −1 Bn , f −1 ∩ n Bn = ∩ n f −1 Bn , −1

which all belong to S. Thus, B c ,



n

Bn , and



n

Bn all lie in T  .

2

1. Sets and Functions, Measures and Integration

13

For any measurable spaces (S, S) and (T, T ), a mapping f : S → T is said to be S/T -measurable or simply measurable 5 if f −1 T ⊂ S, that is, if f −1 B ∈ S for every B ∈ T . It is enough to verify the defining condition for a suitable subclass: Lemma 1.4 (measurable functions) Let f be a mapping between measurable spaces (S, S) and (T, T ), and let C ⊂ 2T with σ(C) = T . Then f is S/T -measurable



f −1 C ⊂ S.

Proof: Let T  = {B ⊂ T ; f −1 B ∈ S}. Then C ⊂ T  by hypothesis, and T  is a σ-field by Lemma 1.3 (ii). Hence, T  = σ(T  ) ⊃ σ(C) = T , which shows that f −1 B ∈ S for all B ∈ T .

2

Lemma 1.5 (continuity and measurability) For a map f between topological spaces S, T with Borel σ-fields S, T , we have f is continuous



f is S/T -measurable.

Proof: Let S  and T  be the classes of open sets in S and T . Since f is continuous and S = σ(S  ), we have f −1 T  ⊂ S  ⊂ S. By Lemma 1.4 it follows 2 that f is S/σ(T  )-measurable. It remains to note that σ(T  ) = T . We insert a result about subspace topologies and σ-fields, needed in Chapter 23. Given a class C of subsets of S and a set A ⊂ S, we define A ∩ C = {A ∩ C; C ∈ C}. Lemma 1.6 (subspaces) Let (S, ρ) be a metric space with topology T and Borel σ-field S, and let A ⊂ S. Then the induced topology and Borel σ-field in (A, ρ) are given by SA = A ∩ S. TA = A ∩ T , Proof: The natural embedding πA : A → S is continuous and hence measurable, and so A ∩ T = πA−1 T ⊂ TA and A ∩ S = πA−1 S ⊂ SA . Conversely, for any B ∈ TA , we define G = (B ∪ Ac )o , where the complement and interior are with respect to S, and note that B = A ∩ G. Hence, TA ⊂ A ∩ T , and therefore SA = σ(TA ) ⊂ σ(A ∩ T ) ⊂ σ(A ∩ S) = A ∩ S, where the operation σ(·) refers to the subspace A. As with continuity, even measurability is preserved by composition. 5

Note the analogy with the definition of continuity in terms of topologies on S, T .

2

14

Foundations of Modern Probability

Lemma 1.7 (composition) For maps f : S → T and g : T → U between the measurable spaces (S, S), (T, T ), (U, U), we have f, g are measurable



h = g ◦ f is S/U -measurable.

Proof: Let C ∈ U , and note that B ≡ g −1 C ∈ T since g is measurable. Noting that (g ◦f )−1 = f −1 ◦ g −1 , and using the fact that even f is measurable, we get h−1 C = (g ◦ f )−1 C = f −1 g −1 C 2 = f −1 B ∈ S. For many results in measure and probability theory, it is convenient first to give a proof for the real line, and then extend to more general spaces. Say that the measurable spaces S and T are Borel isomorphic, if there exists a bijection f : S ↔ T such that both f and f −1 are measurable. A measurable space S is said to be Borel if it is Borel isomorphic to a Borel set in [0, 1]. We show that every Polish space6 endowed with its Borel σ-field is Borel. Theorem 1.8 (Polish and Borel spaces) For any Polish space S, we have (i) S is homeomorphic to a Borel set in [0, 1]∞ , (ii) S is Borel isomorphic to a Borel set in [0, 1]. Proof: (i) Fix a complete metric ρ in S. We may assume that ρ ≤ 1, since we can otherwise replace ρ by the equivalent metric ρ ∧ 1, which is again complete. Since S is separable, we may choose a dense sequence x1 , x2 , . . . ∈ S. Then the mapping



x → ρ(x, x1 ), ρ(x, x2 ), . . . ,

x ∈ S,

defines a homeomorphic embedding of S into the compact space K = [0, 1]∞ , and we may regard S as a subset of K. In K we introduce the metric d(x, y) = Σ n 2−n |xn − yn |,

x, y ∈ K,

and define S¯ as the closure of S in K. Writing |Bxε |ρ for the ρ -diameter of the d-ball Bxε = {y ∈ S; d(x, y) < ε} in S, we define

¯ |Bxε |ρ < n−1 , ε > 0, n ∈ N, Un (ε) = x ∈ S;  ¯ since x ∈ Un (ε) and y ∈ S¯ with and put Gn = ε Un (ε). The Gn are open in S, d(x, y) < ε/2 implies y ∈ Un (ε/2), and S ⊂ Gn for each n by the equivalence ¯ where S˜ = n Gn . of the metrics ρ and d. This gives S ⊂ S˜ ⊂ S, ˜ we may choose some x1 , x2 , . . . ∈ S with d(x, xn ) → 0. By For any x ∈ S, the definitions of Un (ε) and Gn , the sequence (xk ) is Cauchy even for ρ, and 6 A topological space is said to be Polish if it is separable and admits a complete metrization.

1. Sets and Functions, Measures and Integration

15

so by completeness ρ(xk , y) → 0 for some y ∈ S. Since ρ and d are equivalent on S, we have even d(xk , y) → 0, and therefore x = y. This gives S˜ ⊂ S, and so S˜ = S. Finally, S˜ is a Borel set in K, since the Gn are open subsets of the ¯ compact set S. (ii) Write 2∞ for the countable product {0, 1}∞ , and let B denote the subset of binary sequences with infinitely many zeros. Then x = Σn xn 2−n defines a 1−1 correspondence between I = [0, 1) and B. Since x → (x1 , x2 , . . .) is clearly bi-measurable, I and B are Borel isomorphic, written as I ∼ B. Furthermore, B c = 2∞ \ B is countable, so that B c ∼ N, which implies B ∪ N ∼ B ∪ N ∪ (−N) ∼ B ∪ Bc ∪ N = 2∞ ∪ N, and hence I ∼ B ∼ 2∞ . This gives I ∞ ∼ (2∞ )∞ ∼ 2∞ ∼ I, and the assertion follows by (i). 2 For any Borel space (S, S), we may specify a localizing sequence Sn ↑ S in S, and say that a set B ⊂ S is bounded, if B ⊂ Sn for some n ∈ N. The ring ˆ If ρ is a metric generating S, we of bounded sets B ∈ S will be denoted by S. may choose Sˆ to consist of all ρ -bounded sets. In locally compact spaces S, ¯ we may choose Sˆ as the class of sets B ∈ S with compact closure B. By a dissection system in S we mean a nested sequence of countable partitions (Inj ) ⊂ Sˆ generating S, such that for any fixed n ∈ N, every bounded set is covered by finitely many sets Inj . For topological spaces S, we require in addition that, whenever s ∈ G ⊂ S with G open, we have s ∈ Inj ⊂ G for some n, j ∈ N. A class C ⊂ Sˆ is said to be dissecting, if every open set G ⊂ S is a countable union of sets in C, and every set B ∈ Sˆ is covered by finitely many sets in C. To state the next result, we note that any collection of functions fi : Ω → Si , i ∈ I, defines a mapping f = (fi ) from Ω to Xi Si , given by



f (ω) = fi (ω); i ∈ I ,

ω ∈ Ω.

(2)

We may relate the measurability of f to that of the coordinate mappings fi . Lemma 1.9 (coordinate functions) For any measurable spaces (Ω, A), (Si , Si ) and functions fi : Ω → Si , i ∈ I, define f = (fi ) : Ω → Xi Si . Then these conditions are equivalent: (i) f is A /



i

Si -measurable,

(ii) fi is A/Si -measurable for every i ∈ I. Proof: Use Lemma 1.4 with C equal to the class of cylinder sets Ai × Xj=i Sj , 2 for arbitrary i ∈ I and Ai ∈ Si .

16

Foundations of Modern Probability

Changing our perspective, let the fi in (2) map Ω into some measurable spaces (Si , Si ). In Ω we may then introduce the generated or induced σ-field σ(f ) = σ{fi ; i ∈ I}, defined as the smallest σ-field in Ω making all the fi measurable. In other words, σ(f ) is the intersection of all σ-fields A in Ω, such that fi is A/St -measurable for every i ∈ I. In this notation, the functions fi are clearly measurable with respect to a σ-field A in Ω iff σ(f ) ⊂ A. Further note that σ(f ) agrees with the σ-field in Ω generated by the collection {fi−1 Si ; i ∈ I}. For functions on or into a Euclidean space Rd , measurability is understood to be with respect to the Borel σ-field B d . Thus, a real-valued function f on a measurable space (S, S) is measurable iff {s; f (s) ≤ x} ∈ S for all x ∈ R. The ¯ = [−∞, ∞] same convention applies to functions into the extended real line R ¯ + = [0, ∞], regarded as compactifications of R and or the extended half-line R R+ = [0, ∞), respectively. Note that BR¯ = σ{B, ±∞} and BR¯ + = σ{BR+ , ∞}. For any set A ⊂ S, the associated indicator function 7 1A : S → R is defined to equal 1 on A and 0 on Ac . For sets A = {s ∈ S; f (s) ∈ B}, it is often convenient to write 1{·} instead of 1{·} . When S is a σ-field in S, we note that 1A is S-measurable iff A ∈ S. Linear combinations of indicator functions are called simple functions. Thus, every simple function f : S → R has the form f = c 1 1A 1 + · · · + cn 1 An , where n ∈ Z+ = {0, 1, . . .}, c1 , . . . , cn ∈ R, and A1 , . . . , An ⊂ S. Here we may take c1 , . . . , cn to be the distinct non-zero values attained by f , and define Ak = f −1 {ck } for all k. This makes f measurable with respect to a given σ-field S in S iff A1 , . . . , An ∈ S. The class of measurable functions is closed under countable limiting operations: ¯ are Lemma 1.10 (bounds and limits) If the functions f1 , f2 , . . . : (S, S) → R measurable, then so are the functions inf fn , lim sup fn , lim inf fn . sup fn , n n→∞ n

n→∞

Proof: To see that supn fn is measurable, write



s; sup fn (s) ≤ t = n

=





∩ s; fn(s) ≤ t n≥1 ∩ fn−1[−∞, t] ∈ S, n≥1

and use Lemma 1.4. For the other three cases, write inf n fn = −supn (−fn ), and note that lim sup fn = inf sup fk , n→∞

n≥1 k≥n

lim inf fn = sup inf fk . n→∞ n≥1 k≥n

2

7 The term characteristic function should be avoided, since it has a different meaning in probability.

1. Sets and Functions, Measures and Integration

17

Since fn → f iff lim supn fn = lim inf n fn = f , it follows easily that both the set of convergence and the possible limit are measurable. This extends to functions taking values in more general spaces: Lemma 1.11 (convergence and limits) Let f1 , f2 , . . . be measurable functions from a measurable space (Ω, A) into a metric space (S, ρ). Then



(i) ω; fn (ω) converges ∈ A when S is separable and complete, (ii) fn → f on Ω implies that f is measurable. Proof: (i) Since S is complete, convergence of fn is equivalent to the Cauchy convergence lim sup ρ(fm , fn ) = 0, n→∞ m≥n

where the left-hand side is measurable by Lemmas 1.2, 1.5, and 1.10. (ii) If fn → f , we have g ◦ fn → g ◦ f for any continuous function g : S → R, and so g ◦f is measurable by Lemmas 1.5 and 1.10. Fixing any open set G ⊂ S, we may choose some continuous functions g1 , g2 , . . . : S → R+ with gn ↑ 1G , and conclude from Lemma 1.10 that 1G ◦ f is measurable. Thus, f −1 G ∈ A for all G, and so f is measurable by Lemma 1.4. 2 Many results in measure theory can be proved by a simple approximation, based on the following observation. Lemma 1.12 (simple approximation) For any measurable function f ≥ 0 on (S, S), there exist some simple, measurable functions f1 , f2 , . . . : S → R+ with fn ↑ f . Proof: We may define fn (s) = 2−n [2n f (s)] ∧ n,

s ∈ S, n ∈ N.

2

To illustrate the method, we show that the basic arithmetic operations are measurable. Lemma 1.13 (elementary operations) If the functions f, g : (S, S) → R are measurable, then so are the functions (i) f g and af + bg, a, b ∈ R, (ii) f /g when g = 0 on S. Proof: By Lemma 1.12 applied to f± = (±f ) ∨ 0 and g± = (±g) ∨ 0, we may approximate by simple measurable functions fn → f and gn → g. Here afn + bgn and fn gn are again simple measurable functions. Since they converge to af + bg and f g, respectively, the latter functions are again measurable by Lemma 1.10. The same argument applies to the ratio f /g, provided we choose gn = 0.

18

Foundations of Modern Probability

An alternative argument is to write af + bg, f g, or f /g as a composition ψ ◦ ϕ, where ϕ = (f, g) : S → R2 , and ψ(x, y) is defined as ax + by, xy, or x/y, repectively. The desired measurability then follows by Lemmas 1.2, 1.5, 1.9, and 1.10. In case of ratios, we may use the continuity of the mapping (x, y) → x/y on R × (R \ {0}). 2 We proceed with a functional representation of measurable functions. Given some functions f, g on a common space Ω, we say that f is g -measurable if the induced σ-fields are related by σ(f ) ⊂ σ(g). Lemma 1.14 (functional representation, Doob) Let f, g be measurable functions from (Ω, A) into some measurable spaces (S, S), (T, T ), where S is Borel. Then these conditions are equivalent: (i) f is g-measurable, (ii) f = h ◦ g for a measurable mapping h : T → S. Proof: Since S is Borel, we may let S ∈ B[0,1] . By a suitable modification of h, we may further reduce to the case where S = [0, 1]. If f = 1A for a g-measurable set A ⊂ Ω, Lemma 1.3 yields a set B ∈ T with A = g −1 B. Then f = 1A = 1B ◦ g, and we may choose h = 1B . The result extends by linearity to any simple, g-measurable function f . In general, Lemma 1.12 yields some simple, g-measurable functions f1 , f2 , . . . with 0 ≤ fn ↑ f , and we may choose some associated T -measurable functions h1 , h2 , . . . : T → [0, 1] with fn = hn ◦g. Then h = supn hn is again T -measurable by Lemma 1.10, and we have 

h ◦ g = supn hn) ◦ g = supn (hn ◦ g) = supn fn = f.

2

¯ + is Given a measurable space (S, S), we say that a function μ : S → R countably additive if μ

μAk , ∪ Ak = kΣ ≥1

k≥1

A1 , A2 , . . . ∈ S disjoint.

(3)

¯+ A measure on (S, S) is defined as a countably additive set function μ : S → R with μ∅ = 0. The triple (S, S, μ) is then called a measure space. From (3) we note that any measure is finitely additive and non-decreasing. This in turn implies the countable sub-additivity μ

μAk , ∪ Ak ≤ kΣ ≥1

k≥1

A1 , A2 , . . . ∈ S.

We note some basic continuity properties: Lemma 1.15 (continuity) For any measure μ on (S, S) and sets A1 , A2 , . . . ∈ S, we have (i) An ↑ A ⇒ μAn ↑ μA, (ii) An ↓ A, μA1 < ∞ ⇒ μAn ↓ μA.

1. Sets and Functions, Measures and Integration

19

Proof: For (i), we may apply (3) to the differences Dn = An \ An−1 with 2 A0 = ∅. To get (ii), apply (i) to the sets Bn = A1 \ An . The simplest measures on a measurable space (S, S) are the unit point masses or Dirac measures δs , s ∈ S, given by δs A = 1A (s). A measure of the form a δs is said to be degenerate. For any countable set A = {s1 , s2 , . . .}, we may form the associated counting measure μ = Σn δsn . More generally, we may form countable linear combinations of measures on S, as follows. Proposition 1.16 (sums of measures) For any measures μ1 , μ2 , . . . on (S, S), the sum μ = Σn μn is again a measure. Proof: First we note that, for any array of constants cjk ≥ 0, j, k ∈ N,

Σ j Σ k cjk = Σ k Σ j cjk .

(4)

This is obvious for finite sums. In general, we have for any m, n ∈ N

Σ cjk = kΣ Σ cjk . Σ j Σ k cjk ≥ j Σ ≤m k≤n ≤n j ≤m Letting m → ∞ and then n → ∞, we obtain (4) with inequality ≥. The reverse relation follows by symmetry, and the equality follows. Now consider any disjoint sets A1 , A2 , . . . ∈ S. Using (4) and the countable additivity of each μn , we get μ ∪ k Ak =

Σ n μ n ∪ k Ak = Σ n Σ k μn Ak = Σ k Σ n μn Ak = Σ k μAk .

2

The last result is essentially equivalent to the following: Corollary 1.17 (monotone limits) For any σ-finite measures μ1 , μ2 , . . . on (S, S), we have μn ↑ μ or μn ↓ μ



μ is a measure on S.

Proof: First let μn ↑ μ. For any n ∈ N, there exists a measure νn such that μn = μn−1 + νn , where μ0 = 0. Indeed, let A1 , A2 , . . . ∈ A be disjoint with μn−1 Ak < ∞ for all k, and define νn,k = μn − μn−1 on each Ak . Then μ = Σn νn = Σn,k νn,k is a measure on S by Proposition 1.16. If instead μn ↓ μ, then define νn = μ1 − μn as above, and note that νn ↑ ν for some measure ν ≤ μ1 . Hence, μn = μ1 − νn ↓ μ1 − ν, which is again a measure. 2

20

Foundations of Modern Probability

Any measurable mapping f between two measurable spaces (S, S) and (T, T ) induces a mapping of measures on S into measures on T . More precisely, given any measure μ on (S, S), we may define a measure μ ◦ f −1 on (T, T ) by (μ ◦ f −1 )B = μ(f −1 B)



= μ s ∈ S; f (s) ∈ B ,

B∈T.

Here the countable additivity of μ ◦ f −1 follows from that for μ, together with the fact that f −1 preserves unions and intersections. It is often useful to identify a measure-determining class C ⊂ S, such that a measure on S is uniquely determined by its values on C. Lemma 1.18 (uniqueness) Let μ, ν be bounded measures on a measurable space (S, S), and let C be a π-system in S with S ∈ C and σ(C) = S. Then μ=ν



μA = νA, A ∈ C.

Proof: Assuming μ = ν on C, let D be the class of sets A ∈ S with μA = νA. Using the condition S ∈ C, the finite additivity of μ and ν, and Lemma 1.15, we see that D is a λ-system. Moreover, C ⊂ D by hypothesis. Hence, Theorem 1.1 yields D ⊃ σ(C) = S, which means that μ = ν. The converse assertion is obvious. 2 For any measure μ on (S, S) and set A ∈ S, the mapping ν : B → μ(A ∩ B) is again a measure on (S, S), called the restriction of μ to A and denoted by 1A μ. A measure μ on S is said to be σ-finite, if S is a countable union of disjoint sets An ∈ S with μAn < ∞. Since μ = Σ n 1An μ, such a μ is even s-finite, in the sense of being a countable sum of bounded measures. A measure μ on a localized Borel space S is said to be locally finite if μB < ∞ for all ˆ and we write MS for the class of such measures μ. B ∈ S, For any measure μ on a topological space S, we define the support supp μ as the set of points s ∈ S, such that μB > 0 for every neighborhood B of s. Note that supp μ is closed and hence Borel measurable. Lemma 1.19 (support) For any measure μ on a separable, complete metric space S, we have μ(supp μ)c = 0. Proof: Suppose that instead μ(supp μ)c > 0. Since S is separable, the open set (supp μ)c is a countable union of closed balls of radius < 1. Choose one of them B1 with μB1 > 0, and continue recursively to form a nested sequence of balls Bn of radius < 2−n , such that μBn > 0 for all n. The associated centers xn ∈ Bn form a Cauchy sequence, and so by completeness we have convergence  xn → x. Then x ∈ n Bn ⊂ (supp μ)c and x ∈ supp μ, a contradiction proving our claim. 2

1. Sets and Functions, Measures and Integration

21

Our next aim is to define the integral

μf =



f dμ =

f (ω) μ(dω)

of a real-valued, measurable function f on a measure space (S, S, μ). First take f to be simple and non-negative, hence of the form c1 1A1 + · · · + cn 1An for some n ∈ Z+ , A1 , . . . , An ∈ S and c1 , . . . , cn ∈ R+ , and define8 μf = c1 μA1 + · · · + cn μAn . Using the finite additivity of μ, we may check that μf is independent of the choice of representation of f . It is further clear that the integration map f → μf is linear and non-decreasing, in the sense that a, b ≥ 0,

μ(af + bg) = a μf + b μg, f ≤g



μf ≤ μg.

To extend the integral to general measurable functions f ≥ 0, use Lemma 1.12 to choose some simple measurable functions f1 , f2 , . . . with 0 ≤ fn ↑ f , and define μf = limn μfn . We need to show that the limit is independent of the choice of approximating sequence (fn ): Lemma 1.20 (consistence) Let f, f1 , f2 , . . . and g be measurable functions on a measure space (S, S, μ), where all but f are simple. Then 0 ≤ fn ↑ f ⇒ μg ≤ lim μfn . n→∞ 0≤g≤f

Proof: By the linearity of μ, it is enough to take g = 1A for some A ∈ S. Fixing any ε > 0, we define



An = ω ∈ A; fn (ω) ≥ 1 − ε , Then An ↑ A, and so

n ∈ N.

μfn ≥ (1 − ε) μAn ↑ (1 − ε) μA = (1 − ε) μg.

It remains to let ε → 0.

2

The linearity and monotonicity extend immediately to arbitrary f ≥ 0, since fn ↑ f and gn ↑ g imply afn + bgn ↑ af + bg, and if also f ≤ g, then fn ≤ (fn ∨ gn ) ↑ g. We prove a basic continuity property of the integral. Theorem 1.21 (monotone convergence, Levi) For any measurable functions f, f1 , f2 . . . on (S, S, μ), we have 0 ≤ fn ↑ f 8



μfn ↑ μf.

The convention 0 · ∞ = 0 applies throughout measure theory.

22

Foundations of Modern Probability

Proof: For every n we may choose some simple measurable functions gnk , with 0 ≤ gnk ↑ fn as k → ∞. The functions hnk = g1k ∨ · · · ∨ gnk have the same properties and are further non-decreasing in both indices. Hence, f ≥ lim hkk k→∞

≥ lim hnk k→∞

= fn ↑ f, and so 0 ≤ hkk ↑ f . Using the definition and monotonicity of the integral, we obtain μf = lim μhkk k→∞

≤ lim μfk ≤ μf. k→∞

2

The last result yields the following key inequality. Lemma 1.22 (Fatou) For any measurable functions f1 , f2 , . . . ≥ 0 on (S, S, μ), we have inf fn . lim inf μfn ≥ μ lim n→∞ n→∞ Proof: Since fm ≥ inf k≥n fk for all m ≥ n, we have inf μfk ≥ μ inf fk ,

k≥n

k≥n

n ∈ N.

Letting n → ∞, we get by Theorem 1.21 lim inf μfk ≥ n→∞ lim μ inf fk k→∞

k≥n

= μ lim inf fk . k→∞

2

A measurable function f on (S, S, μ) is said to be integrable if μ|f | < ∞. Writing f = g − h for some integrable functions g, h ≥ 0 (e.g., as f+ − f− with f± = (±f ) ∨ 0), we define μf = μg − μh. It is easy to see that the extended integral is independent of the choice of representation f = g − h, and that μf satisfies the basic linearity and monotonicity properties, now with any real coefficients. The last lemma leads to a general condition, other than the monotonicity in Theorem 1.21, allowing us to take limits under the integral sign, in the sense that fn → f ⇒ μfn → μf. By the classical dominated convergence theorem, this holds when |fn | ≤ g for a measurable function g ≥ 0 with μg < ∞. The same argument yields a more powerful extended version: Theorem 1.23 (extended dominated convergence, Lebesgue) Let f, f1 , f2 , . . . and g, g1 , g2 , . . . ≥ 0 be measurable functions on (S, S, μ). Then ⎫ fn → f ⎬ |fn | ≤ gn → g ⇒ μfn → μf. ⎭ μgn → μg < ∞

1. Sets and Functions, Measures and Integration

23

Proof: Applying Fatou’s lemma to the functions gn ± fn ≥ 0, we get μg + lim inf (±μfn ) = lim inf μ(gn ± fn ) n→∞

n→∞

≥ μ(g ± f ) = μg ± μf. Subtracting μg < ∞ from each side gives μf ≤ lim inf μfn n→∞ ≤ lim sup μfn ≤ μf.

2

n→∞

Next we show how integrals are transformed by measurable maps. Lemma 1.24 (substitution) Given a measure space (Ω, μ), a measurable space S, and some measurable maps f : Ω → S and g : S → R, we have μ(g ◦ f ) = (μ ◦ f −1 )g,

(5)

9

whenever either side exists . Proof: For indicator functions g, (5) reduces to the definition of μ ◦ f −1 . From here on, we may extend by linearity and monotone convergence to any measurable function g ≥ 0. For general g it follows that μ|g ◦ f | = (μ ◦ f −1 )|g|, and so the integrals in (5) exist at the same time. When they do, we get (5) by taking differences on both sides. 2 For another basic transformation of measures and integrals, fix a measurable function f ≥ 0 on a measure space (S, S, μ), and define a function f · μ on S by (f · μ)A = μ(1A f ) = f dμ, A ∈ S, A

where the last equality defines the integral over A. Then clearly ν = f · μ is again a measure on (S, S). We refer to f as the μ -density of ν. The associated transformation rule is the following. Lemma 1.25 (chain rule) For any measure space (S, S, μ) and measurable functions f : S → R+ and g : S → R, we have μ(f g) = (f · μ)g, whenever either side exists. Proof: As before, we may first consider indicator functions g, and then extend in steps to the general case. 2 Given a measure space (S, S, μ), we say that A ∈ S is μ -null or simply null if μA = 0. A relation between functions on S is said to hold almost everywhere with respect to μ (written as a.e. μ or μ -a.e.) if it holds for all s ∈ S outside a μ -null set. The following frequently used result explains the relevance of null sets. 9

meaning that if one side exists, then so does the other and the two are equal

24

Foundations of Modern Probability

Lemma 1.26 (null sets and functions) For any measurable function f ≥ 0 on a measure space (S, S, μ), we have μf = 0



f = 0 a.e. μ.

Proof: This is obvious when f is simple. For general f , we may choose some simple measurable functions fn with 0 ≤ fn ↑ f , and note that f = 0 a.e. iff fn = 0 a.e. for every n, that is, iff μfn = 0 for all n. Since the latter integrals converge to μf , the last condition is equivalent to μf = 0. 2 The last result shows that two integrals agree when the integrands are a.e. equal. We may then allow the integrands to be undefined on a μ -null set. It is also clear that the conclusions of Theorems 1.21 and 1.23 remain valid, if the hypotheses are only fulfilled outside a null set. In the other direction, we note that if two σ-finite measures μ, ν are related by ν = f · μ for a density f , then the latter is μ -a.e. unique, which justifies the notation f = dν/dμ. It is further clear that any μ -null set is also a null set for ν. For measures μ, ν with the latter property, we say that ν is absolutely continuous with respect to μ, and write ν  μ. The other extreme case is when μ, ν are mutually singular, written as μ ⊥ ν, in the sense that μA = 0 and νAc = 0 for some set A ∈ S. Given any measure space (S, S, μ) and σ-field F ⊂ S, we define the μcompletion of F in S as the σ-field F μ = σ(F, Nμ ), where Nμ is the class of subsets of arbitrary μ-null sets in S. The description of F μ can be made more explicit: Lemma 1.27 (completion) For any measure space (Ω, A, μ), σ-field F ⊂ A, Borel space (S, S), and function f : Ω → S, these conditions are equivalent: (i) f is F μ -measurable, (ii) f = g a.e. μ for an F-measurable function g. Proof: Beginning with indicator functions, let G be the class of subsets A ⊂ Ω with A B ∈ Nμ for some B ∈ F . Then A \ B and B \ A again lie in Nμ , and so G ⊂ F μ . Conversely, F μ ⊂ G since both F and Nμ are trivially contained in G. Combining the two relations gives G = F μ , which shows that A ∈ F μ iff 1A = 1B a.e. for some B ∈ F . In general we may take S = [0, 1]. For any F μ -measurable function f , we may choose some simple F μ -measurable functions fn with 0 ≤ fn ↑ f . By the result for indicator functions, we may next choose some simple F-measurable functions gn , such that fn = gn a.e. for all n. Since a countable union of null 2 sets is again null, the function g = lim supn gn has the desired property. Every measure μ on (S, S) extends uniquely to the σ-field S μ . Indeed, for any A ∈ S μ , Lemma 1.27 yields some sets A± ∈ S with A− ⊂ A ⊂ A+ and μ(A+ \ A− ) = 0, and any extension satisfies μA = μA± . With this choice, it is easy to check that μ remains a measure on S μ .

1. Sets and Functions, Measures and Integration

25

We proceed to construct product measures and establish conditions allowing us to change the order of integration. This requires a technical lemma of independent importance. Lemma 1.28 (sections) For any measurable spaces (S, S), (T, T ), measurable function f : S × T → R+ , and σ-finite measure μ on S, we have (i) f (s, t) is S-measurable in s ∈ S for fixed t ∈ T , (ii)

f (s, t) μ(ds) is T -measurable in t ∈ T .

Proof: We may take μ to be bounded. Both statements are obvious when f = 1A with A = B × C for some B ∈ S and C ∈ T , and they extend by monotone-class arguments to any indicator functions of sets in S ⊗ T . The general case follows by linearity and monotone convergence. 2 We may now state the main result for product measures, known as Fubini’s theorem10 . Theorem 1.29 (product measure and iterated integral, Lebesgue, Fubini, Tonelli) For any σ-finite measure spaces (S, S, μ), (T, T , ν), we have (i) there exists a unique measure μ ⊗ ν on (S × T, S ⊗ T ), such that (μ ⊗ ν)(B × C) = μB · νC,

B ∈ S, C ∈ T ,

(ii) for any measurable function f : S × T → R with (μ ⊗ ν)|f | < ∞, we have (μ ⊗ ν)f = =





μ(ds) ν(dt)



f (s, t) ν(dt) f (s, t) μ(ds).

Note that the iterated integrals11 in (ii) are well defined by Lemma 1.28, although the inner integrals νf (s, ·) and μf (·, t) may fail to exist on some null sets in S and T , respectively. Proof: By Lemma 1.28, we may define (μ ⊗ ν)A =





μ(ds)

1A (s, t) ν(dt),

A∈S ⊗T,

(6)

which is clearly a measure on S × T satisfying (i). By a monotone-class argument there is at most one such measure. In particular, (6) remains true with the order of integration reversed, which proves (ii) for indicator functions f . The formula extends by linearity and monotone convergence to arbitrary measurable functions f ≥ 0. 10 This is a subtle result of measure theory. The elementary proposition in calculus for double integrals of continuous functions goes back at least to Cauchy. 11 Iterated integrals should be read from right to left, so that the inner integral on the right becomes an integrand in the next step. This notation saves us from the nuisance of awkward parentheses, or from relying on confusing conventions for multiple integrals.

26

Foundations of Modern Probability

In general we note that (ii) holds with f replaced by |f |. If (μ ⊗ ν)|f | < ∞, it follows that NS = {s ∈ S; ν|f (s, ·)| = ∞} is a μ -null set in S, whereas NT = {t ∈ T ; μ|f (·, t)| = ∞} is a ν -null set in T . By Lemma 1.26 we may redefine f (s, t) = 0 when s ∈ NS or t ∈ NT . Then (ii) follows for f by sub2 traction of the formulas for f+ and f− . We call μ⊗ν in Theorem 1.29 the product measure of μ and ν. Iterating the  construction, we may form product measures μ1 ⊗ . . . ⊗ μn = k μk satisfying higher-dimensional versions of (ii). If μk = μ for all k, we often write the product measure as μ⊗n or μn . By a measurable group we mean a group G endowed with a σ-field G, such that the group operations in G are G-measurable. If μ1 , . . . , μn are σ-finite measures on G, we may define the convolution μ1 ∗ · · · ∗ μn as the image of the product measure μ1 ⊗ · · · ⊗ μn on Gn under the iterated group operation (x1 , . . . , xn ) → x1 · · · xn . The convolution is said to be associative if (μ1 ∗ μ2 ) ∗ μ3 = μ1 ∗ (μ2 ∗ μ3 ) whenever both μ1 ∗ μ2 and μ2 ∗ μ3 are σ-finite, and commutative if μ1 ∗ μ2 = μ2 ∗ μ1 . A measure μ on G is said to be right- or left-invariant if μ ◦ θr−1 = μ for all r ∈ G, where θr denotes the right or left shift s → s r or s → rs. When G is Abelian, the shift is also called a translation. On product spaces G × T , the translations are defined as mappings of the form θr : (s, t) → (s + r, t). Lemma 1.30 (convolution) On a measurable group (G, G), the convolution of σ-finite measures is associative. On Abelian groups it is also commutative, and (i)

(μ ∗ ν)B =



μ(B − s) ν(ds)



ν(B − s) μ(ds),

=

B ∈ G,

(ii) if μ = f · λ and ν = g · λ for an invariant measure λ, then μ ∗ ν has the λ-density (f ∗ g)(s) = f (s − t) g(t) λ(dt)

=

f (t) g(s − t) λ(dt),

s ∈ G. 2

Proof: Use Fubini’s theorem.

For any measure space (S, S, μ) and constant p > 0, let Lp = Lp (S, S, μ) be the class of measurable functions f : S → R satisfying 

f p ≡ μ|f |p

1/p

< ∞.

Theorem 1.31 (H¨older and Minkowski inequalities) For any measurable functions f, g on a measure space (S, S, μ), we have (i) f g r ≤ f p g q , p, q, r > 0 with p−1 + q −1 = r−1 , (ii) f + g p ≤ f p + g p , p ≥ 1,

1. Sets and Functions, Measures and Integration 

(iii) f + g p ≤ f pp + g pp

1/p

27

, p ∈ (0, 1].

Proof: (i) We may clearly take r = 1 and f p = g q = 1. Since p−1 + q −1 = 1 implies (p − 1)(q − 1) = 1, the equations y = xp−1 and x = y q−1 are equivalent for x, y ≥ 0. By calculus, |f g| ≤



|f |



xp−1 dx +

0 −1

|g|

y q−1 dy

0 q

= p |f |p + q −1 |g| , and so f g 1 ≤ p−1



|f |p dμ + q −1



|g|q dμ

= p−1 + q −1 = 1. (ii) For p > 1, we get by (i) with q = p/(p − 1) and r = 1 f + g pp ≤



|f | |f + g|p−1 dμ +



|g| |f + g|p−1 dμ

+ g p f + g p−1 ≤ f p f + g p−1 p p . (iii) Clear from the concavity of |x|p .

2

Claim (ii) above is sometimes needed in the following extended form, where for any f as in Theorem 1.29 we define

f p (s) = ν|f (s, ·)|p



1/p

s ∈ S.

,

Corollary 1.32 (extended Minkowski inequality) For μ, ν, f as in Theorem 1.29, suppose that μf (·, t) exists for t ∈ T a.e. ν. Then μf p ≤ μ f p ,

p ≥ 1.

Proof: Since |μf | ≤ μ|f |, we may assume that f ≥ 0 and μf p ∈ (0, ∞). For p > 1, we get by Fubini’s theorem and H¨older’s inequality μf pp = ν(μf )p

= ν μf (μf )p−1

= μν f (μf )p−1  

 

≤ μ f p (μf )p−1 q = μ f p μf p−1 p . Now divide by μf p−1 p . The proof for p = 1 is similar but simpler.

2

In particular, Theorem 1.31 shows that · p becomes a norm for p ≥ 1, if we identify functions that agree a.e. For any p > 0 and f, f1 , f2 , . . . ∈ Lp , we write fn → f in Lp if fn − f p → 0, and say that (fn ) is Cauchy in Lp if fm − fn p → 0 as m, n → ∞.

28

Foundations of Modern Probability

Lemma 1.33 (completeness) Let the sequence (fn ) be Cauchy in Lp , where p > 0. Then there exists an f ∈ Lp such that fn − f p → 0. Proof: Choose a sub-sequence (nk ) ⊂ N with Σk fnk+1 − fnk p∧1 < ∞. By p < ∞, and Lemma 1.31 and monotone convergence we get Σk |fnk+1 − fnk | p∧1 p so Σk |fnk+1 − fnk | < ∞ a.e. Hence, (fnk ) is a.e. Cauchy in R, and so Lemma 1.11 yields fnk → f a.e. for some measurable function f . By Fatou’s lemma, f − fn p ≤ lim inf fnk − fn p k→∞

≤ sup fm − fn p → 0, m≥n

n → ∞,

which shows that fn → f in Lp .

2

We give a useful criterion for convergence in Lp . Lemma 1.34 (Lp -convergence) Let f, f1 , f2 , . . . ∈ Lp with fn → f a.e., where p > 0. Then fn − f p → 0 ⇔ fn p → f p . Proof: If fn → f in Lp , we get by Lemma 1.31

    p∧1  fn p∧1 − f p∧1 → 0, p p  ≤ fn − f p

and so fn p → f p . Now assume instead the latter condition, and define 



gn = 2p |fn |p + |f |p ,

g = 2p+1 |f |p .

Then gn → g a.e. and μgn → μg < ∞ by hypotheses. Since also |gn | ≥ 2 |fn − f |p → 0 a.e., Theorem 1.23 yields fn − f pp = μ|fn − f |p → 0. Taking p = q = 2 and r = 1 in Theorem 1.31 (i) yields the Cauchy inequality 12 f g 1 ≤ f 2 g 2 . In particular, the inner product f, g = μ(f g) exists for f, g ∈ L2 and satisfies |f, g | ≤ f 2 g 2 . From the obvious bi-linearity of the inner product, we get the parallelogram identity f + g 2 + f − g 2 = 2 f 2 + 2 g 2 ,

f, g ∈ L2 .

(7)

Two functions f, g ∈ L2 are said to be orthogonal, written as f ⊥ g, if f, g = 0. Orthogonality between two subsets A, B ⊂ L2 means that f ⊥ g for all f ∈ A and g ∈ B. A subspace M ⊂ L2 is said to be linear if f, g ∈ M and a, b ∈ R imply af + b g ∈ M , and closed if fn → f in L2 for some fn ∈ M implies f ∈ M . 12

also known as the Schwarz or Buniakovsky inequality

1. Sets and Functions, Measures and Integration

29

Theorem 1.35 (projection) For a closed, linear subspace M ⊂ L2 , any function f ∈ L2 has an a.e. unique decomposition f = g + h with g ∈ M, h ⊥ M. Proof: Fix any f ∈ L2 , and define r = inf{ f − g ; g ∈ M }. Choose g1 , g2 , . . . ∈ M with f − gn → r. Using the linearity of M, the definition of r, and (7), we get as m, n → ∞ 



  4 r2 + gm − gn 2 ≤ 2f − gm − gn 2 + gm − gn 2

= 2 f − gm 2 + 2 f − gn 2 → 4 r2 . Thus, gm − gn → 0, and so the sequence (gn ) is Cauchy in L2 . By Lemma 1.33 it converges toward some g ∈ L2 , and since M is closed, we have g ∈ M . Noting that h = f − g has norm r, we get for any l ∈ M r2 ≤ h + tl 2 = r2 + 2 th, l + t2 l 2 ,

t ∈ R,

which implies h, l = 0. Hence, h ⊥ M , as required. To prove the uniqueness, let g  + h be another decomposition with the stated properties. Then both g − g  ∈ M and g − g  = h − h ⊥ M , and so g − g  ⊥ g − g  , which implies g − g  2 = g − g  , g − g  = 0, and hence g = g  a.e. 2 We proceed with a basic approximation of sets. Let F, G denote the classes of closed and open subsets of S. Lemma 1.36 (regularity) For any bounded measure μ on a metric space S with Borel σ-field S, we have μB = sup μF = inf μG, F ⊂B

G⊃B

B ∈ S,

with F, G restricted to F, G, respectively. Proof: For any G ∈ G there exist some closed sets Fn ↑ G, and by Lemma 1.15 we get μFn ↑ μG. This proves the statement for B belonging to the π-system G of open sets. Letting D be the class of sets B with the stated property, we further note that D is a λ-system. Hence, Theorem 1.1 yields D ⊃ σ(G) = S. 2 The last result leads to a basic approximation property for functions. Lemma 1.37 (approximation) Consider a metric space S with Borel σ-field S, a bounded measure μ on (S, S), and a constant p > 0. Then the bounded, continuous functions on S are dense in Lp (S, S, μ). Thus, for any f ∈ Lp , there exist some bounded, continuous functions f1 , f2 , . . . : S → R with fn − f p → 0.

30

Foundations of Modern Probability

Proof: If f = 1A with A ⊂ S open, we may choose some continuous functions fn with 0 ≤ fn ↑ f , and then fn − f p → 0 by dominated convergence. The result extends by Lemma 1.36 to any A ∈ S. The further extension to simple, measurable functions is immediate. For general f ∈ Lp , we may choose some simple, measurable functions fn → f with |fn | ≤ |f |. Since 2 |fn − f |p ≤ 2p+1 |f |p , we get fn − f p → 0 by dominated convergence. Next we show how the pointwise convergence of measurable functions is almost uniform. Here f A = sups∈A |f (s)|. Lemma 1.38 (near uniformity, Egorov) Let f, f1 , f2 , . . . be measurable functions on a finite measure space (S, S, μ), such that fn → f on S. Then there exist some sets A1 , A2 , . . . ∈ S satifying μAck → 0, Proof: Define Ar,n =

∩ k≥n



fn − f Ak → 0,

k ∈ N.

s ∈ S; |fk (s) − f (s)| < r−1 ,

r, n ∈ N.

Letting n → ∞ for fixed r, we get Ar,n ↑ S and hence μAcr,n → 0. Given any ε > 0, we may choose n1 , n2 , . . . ∈ N so large that μAcr,nr < ε2−r for all r.  Writing A = r Ar,nr , we get μAc ≤ μ ∪ r Acr,nr

< ε Σ r 2−r = ε,

and we note that fn → f uniformly on A.

2

Combining the last two results, we may show that every measurable function is almost continuous. Lemma 1.39 (near continuity, Lusin) Consider a measurable function f and a bounded measure μ on a compact metric space S with Borel σ-field S. Then there exist some continuous functions f1 , f2 , . . . on S, such that



μ x; fn (x) = f (x) → 0. Proof: We may clearly take f to be bounded. By Lemma 1.37 we may choose some continuous functions g1 , g2 , . . . on S, such that μ|gk − f | ≤ 2−k . By Fubini’s theorem, we get μ Σ k |gk − f | =

Σ k μ|gk − f | ≤ Σ k 2−k = 1,

and so Σk |gk − f | < ∞ a.e., which implies gk → f a.e. By Lemma 1.38, we may next choose some A1 , A2 , . . . ∈ S with μAcn → 0, such that the convergence is uniform on every An . Since each gk is uniformly continuous on S,

1. Sets and Functions, Measures and Integration

31

we conclude that f is uniformly continuous on each An . By Tietze’s extension theorem13 , the restriction f |An has then a continuous extension fn to S. 2

Exercises 1. Prove the triangle inequality μ(A C) ≤ μ(A B) + μ(B  C). (Hint: Note that 1AB = |1A − 1B |.) 2. Show that Lemma 1.10 fails for uncountable index sets. (Hint: Show that every measurable set depends on countably many coordinates.) 3. For any space S, let μA denote the cardinality of the set A ⊂ S. Show that μ is a measure on (S, 2S ). 4. Let K be the class of compact subsets of some metric space S, and let μ be a bounded measure such that inf K∈K μK c = 0. Show that for any B ∈ BS , μB = supK∈K∩B μK. 5. Show that any absolutely convergent series can be written as an integral with respect to counting measure on N. State series versions of Fatou’s lemma and the dominated convergence theorem, and give direct elementary proofs. 6. Give an example of integrable functions f, f1 , f2 , . . . on a probability space (S, S, μ), such that fn → f but μfn → μf . 7. Let μ, ν be σ-finite measures on a measurable space (S, S) with sub-σ-field F. Show that if μ  ν on S, it also holds on F. Further show by an example that the converse may fail. 8. Fix two measurable spaces (S, S) and (T, T ), a measurable function f : S → T , and a measure μ on S with image ν = μ ◦ f −1 . Show that f remains measurable with respect to the completions S μ and T ν . 9. Fix a measure space (S, S, μ) and a σ-field T ⊂ S, let S μ denote the μcompletion of S, and let T μ be the σ-field generated by T and the μ-null sets of S μ . Show that A ∈ T μ iff there exist some B ∈ T and N ∈ S μ with A B ⊂ N and μN = 0. Also, show by an example that T μ may be strictly greater than the μ-completion of T . 10. State Fubini’s theorem for the case where μ is σ-finite and ν is counting measure on N. Give a direct proof of this version. 11. Let f1 , f2 , . . . be μ-integrable functions on a measurable space S, such that g = Σk fk exists a.e., and put gn = Σk≤n fk . Restate the dominated convergence theorem for the integrals μgn in terms of the functions fk , and compare with the result of the preceding exercise. 12. Extend Theorem 1.29 to the product of n measures. 13. Let M ⊃ N be closed linear subspaces of L2 . Show that if f ∈ L2 has projections g onto M and h onto N , then g has projection h onto N . 14. Let M be a closed linear subspace of L2 , and let f, g ∈ L2 with M -projections ˆ f and gˆ. Show that fˆ, g = f, gˆ = fˆ, gˆ . 13 A continuous function on a closed subset of a normal topological space S has a continuous extension to S.

32

Foundations of Modern Probability

15. Show that if μ  ν and νf = 0 with f ≥ 0, then also μf = 0. (Hint: Use Lemma 1.26.) 16. For any σ-finite measures μ1  μ2 and ν1  ν2 , show that μ1 ⊗ ν1  μ2 ⊗ ν2 . (Hint: Use Fubini’s theorem and Lemma 1.26.)

Chapter 2

Measure Extension and Decomposition Outer measure, Carath´eodory extension, Lebesgue measure, shift and rotation invariance, Hahn and Jordan decompositions, Lebesgue decomposition and differentiation, Radon–Nikodym theorem, Lebesgue–Stieltjes measures, finite-variation functions, signed measures, atomic decomposition, factorial measures, right and left continuity, absolutely continuous and singular functions, Riesz representation

The general measure theory developed in Chapter 1 would be void and meaningless, unless we could establish the existence of some non-trivial, countably additive measures. In fact, much of modern probability theory depends implicitly on the existence of such measures, needed already to model an elementary sequence of coin tossings. Here we prove the equivalent existence of Lebesgue measure, which will later allow us to establish the more general Lebesgue– Stieltjes measures, and a variety of discrete- and continuous-time processes throughout the book. The basic existence theorems in this and the next chapter are surprisingly subtle, and the hurried or impatient reader may feel tempted to skip this material and move on to the probabilistic parts of the book. However, he/she should be aware that much of the present theory underlies the subsequent probabilistic discussions, and any serious student should be prepared to return for reference when need arises. Our first aim is to construct Lebesgue measure, using the powerful approach of Carath´eodory based on outer measures. The result will lead, via the Daniell– Kolmogorov theorem in Chapter 8, to the basic existence theorem for Markov processes in Chapter 11. We proceed to the correspondence between signed measures and functions of locally finite variation, of special importance for the theory of semi-martingales and general stochastic integration. A further high point is the powerful Riesz representation, which will enable us in Chapter 17 to construct Markov processes with a given generator, via resolvents and associated semi-groups of transition operators. We may further mention the Radon–Nikodym theorem, relevant to the theory of conditioning in Chapter 8, and Lebesgue’s differentiation theorem, instrumental for proving the general ballot theorem in Chapter 25. We begin with an ingenious technical result, which will play a crucial role for our construction of Lebesgue measure in Theorem 2.2, and for the proof of Riesz’ representation Theorem 2.25. By an outer measure on a space S © Springer Nature Switzerland AG 2021 O. Kallenberg, Foundations of Modern Probability, Probability Theory and Stochastic Modelling 99, https://doi.org/10.1007/978-3-030-61871-1_3

33

34

Foundations of Modern Probability

¯+ we mean a non-decreasing, countably sub-additive set function μ : 2S → R with μ∅ = 0. Given an outer measure μ on S, we say that a set A ⊂ S is μ -measurable if μE = μ(E ∩ A) + μ(E ∩ Ac ), E ⊂ S. (1) Note that the inequality ≤ holds automatically by sub-additivity. The following result gives the basic measure construction from outer measures. Theorem 2.1 (restriction of outer measure, Carath´eodory) Let μ be an outer measure on S, and write S for the class of μ-measurable sets. Then S is a σ-field and the restriction of μ to S is a measure. Proof: Since μ∅ = 0, we have for any set E ⊂ S μ(E ∩ ∅) + μ(E ∩ S) = μ∅ + μE = μE, which shows that ∅ ∈ S. Also note that trivially A ∈ S implies Ac ∈ S. Next let A, B ∈ S. Using (1) for A and B together with the sub-additivity of μ, we get for any E ⊂ S μE = μ(E ∩ A) + μ(E ∩ Ac ) 







= μ E ∩ A ∩ B + μ E ∩ A ∩ B c + μ(E ∩ Ac )







≥ μ E ∩ (A ∩ B) + μ E ∩ (A ∩ B)c , which shows that even A∩B ∈ S. It follows easily that S is a field. If A, B ∈ S are disjoint, we also get by (1) for any E ⊂ S









μ E ∩ (A ∪ B) = μ E ∩ (A ∪ B) ∩ A + μ E ∩ (A ∪ B) ∩ Ac = μ(E ∩ A) + μ(E ∩ B). 



(2) 

Finally, let A1 , A2 , . . . ∈ S be disjoint, and put Un = k≤n Ak and U = U n n . Using (2) recursively along with the monotonicity of μ, we get μ(E ∩ U ) ≥ μ(E ∩ Un ) = Σ μ(E ∩ Ak ). k≤n

Letting n → ∞ and combining with the sub-additivity of μ, we obtain μ(E ∩ U ) = Σ k μ(E ∩ Ak ). (3) Taking E = S, we see in particular that μ is countably additive on S. Noting that Un ∈ S and using (3) twice, along with the monotonicity of μ, we also get μE = μ(E ∩ Un ) + μ(E ∩ Unc ) ≥ Σ μ(E ∩ Ak ) + μ(E ∩ U c ) k≤n

→ μ(E ∩ U ) + μ(E ∩ U c ), which shows that U ∈ S. Thus, S is a σ-field.

2

Much of modern probability relies on the existence of non-trivial, countably additive measures. Here we prove that the elementary notion of interval length can be extended to a measure λ on R, known as Lebesgue measure.

2. Measure Extension and Decomposition

35

Theorem 2.2 (Lebesgue measure, Borel) There exists a unique measure λ on (R, B), such that λ[a, b] = b − a, a ≤ b. For the proof, we first need to extend the set of lengths |I| of real intervals I to an outer measure on R. Then define λA = inf

{Ik }

Σ k |Ik |,

A ⊂ R,

(4)

where the infimum extends over all countable covers of A by open intervals I1 , I2 , . . . . We show that (4) provides the desired extension. Lemma 2.3 (outer Lebesgue measure) The function λ in (4) is an outer measure on R, satisfying λI = |I| for every interval I. Proof: The set function λ is clearly non-negative and non-decreasing with λ∅ = 0. To prove the countable sub-additivity, let A1 , A2 , . . . ⊂ R be arbitrary. For any ε > 0 and n ∈ N, we may choose some open intervals In1 , In2 , . . . such that An ⊂ ∪ k Ink , λAn ≥ Σ k |Ink | − ε2−n , n ∈ N. Then A ⊂ I ,

∪n

n

∪ n ∪ k nk Σ n Σ k |Ink | ≤ Σ n λAn + ε,

λ ∪ n An ≤

and the desired relation follows as we let ε → 0. To prove the second assertion, we may take I = [a, b] for some finite a < b. Since I ⊂ (a − ε, b + ε) for every ε > 0, we get λI ≤ |I| + 2ε, and so λI ≤  |I|. As for the reverse relation, we need to prove that if I ⊂ k Ik for some open intervals I1 , I2 , . . . , then |I| ≤ Σk |Ik |. By the Heine–Borel theorem1 , I remains covered by finitely many intervals I1 , . . . , In , and it suffices to show that |I| ≤ Σk≤n |Ik |. This reduces the assertion to the case of finitely many covering intervals I1 , . . . , In . The statement is clearly true for a single covering interval. Proceeding by induction, assume the truth for n − 1 covering intervals, and turn to the case of covering by I1 , . . . , In . Then b belongs to some Ik = (ak , bk ), and so the interval Ik = I \ Ik is covered by the remaining intervals Ij , j = k. By the induction hypothesis, we get |I| = b − a ≤ (b − ak ) + (ak − a) ≤ |Ik | + |Ik | ≤ |Ik | + Σ |Ij | = Σ j |Ij |, j = k as required. 2 We show that the class of measurable sets in Lemma 2.3 contains all Borel sets. 1

A set in Rd is compact iff it is bounded and closed.

36

Foundations of Modern Probability

Lemma 2.4 (measurability of intervals) Let λ denote the outer measure in Lemma 2.3. Then the interval (−∞, a] is λ-measurable for every a ∈ R. Proof: For any set E ⊂ R and constant ε > 0, we may cover E by some open intervals I1 , I2 , . . . , such that λE ≥ Σn |In | − ε. Writing I = (−∞, a] and using the sub-additivity of λ and Lemma 2.3, we get

Σ n |In| = Σ n |In ∩ I| + Σ n |In ∩ I c | = Σ n λ(In ∩ I) + Σ n λ(In ∩ I c )

λE + ε ≥

≥ λ(E ∩ I) + λ(E ∩ I c ). Since ε was arbitrary, it follows that I is λ -measurable.

2

Proof of Theorem 2.2: Define λ as in (4). Then Lemma 2.3 shows that λ is an outer measure satisfying λI = |I| for every interval I. Furthermore, Theorem 2.1 shows that λ is a measure on the σ-field S of all λ-measurable sets. Finally, Lemma 2.4 shows that S contains all intervals (−∞, a] with a ∈ R. Since the latter sets generate the Borel σ-field B, we have B ⊂ S. To prove the uniqueness, consider any measure μ with the stated properties, and put In = [−n, n] for n ∈ N. Using Lemma 1.18, with C as the class of intervals, we see that λ(B ∩ In ) = μ(B ∩ In ),

B ∈ B, n ∈ N.

Letting n → ∞ and using Lemma 1.15, we get λB = μB for all B ∈ B, as required. 2 Before proceeding to a more detailed study of Lebesgue measure, we state a general extension theorem that can be proved by essentially the same arguments. Here a non-empty class I of subsets of a space S is called a semi-ring, if for any I, J ∈ I we have I ∩ J ∈ I, and the set I \ J is a finite union of disjoint sets I1 , . . . , In ∈ I. Theorem 2.5 (extension, Carath´eodory) Let μ be a finitely additive, countably sub-additive set function on a semi-ring I ⊂ 2S , such that μ∅ = 0. Then μ extends to a measure on σ(I). Proof: Define a set function μ∗ on 2S by μ∗ A = inf

{Ik }

Σ k μIk ,

A ⊂ S,

where the infimum extends over all covers of A by sets I1 , I2 , . . . ∈ I, and take μ∗ A = ∞ when no such cover exists. Proceeding as in the proof of Lemma 2.3, we see that μ∗ is an outer measure on S. To see that μ∗ extends μ, fix any I ∈ I, and consider an arbitrary cover I1 , I2 , . . . ∈ I of I. Using the sub-additivity and finite additivity of μ, we get

2. Measure Extension and Decomposition

37

μ∗ I ≤ μI ≤ Σ k μ(I ∩ Ik ) ≤

Σ k μIk ,

which implies μ∗ I = μI. By Theorem 2.1, it remains to show that every set I ∈ I is μ∗ -measurable. Then let A ⊂ S be covered by some sets I1 , I2 , . . . ∈ I with μ∗ A ≥ Σk μIk − ε, and proceed as in the proof of Lemma 2.4, noting that In \ I is a finite disjoint union of sets Inj ∈ I, and hence μ(In \ I) = Σj μInj by the finite additivity of μ. 2 For every d ∈ N, we may use Theorem 1.29 to form the d -dimensional Lebesgue measure λd = λ ⊗ · · · ⊗ λ on Rd , generalizing the elementary notions of area and volume. We show that λd is invariant under translations (or shifts), as well as under arbitrary rotations, and that either invariance characterizes λd up to a constant factor. Theorem 2.6 (invariance of Lebesgue measure) Let μ be a locally finite measure on a product space Rd × S, where (S, S) is a localized Borel space. Then these conditions are equivalent: (i) μ = λd ⊗ ν for some locally finite measure ν on S, (ii) μ is invariant under translations of Rd , (iii) μ is invariant under rigid motions of Rd . Proof, (ii) ⇒ (i): Assume (ii). Write I for the class of intervals I = (a, b] with rational endpoints. Then for any I1 , . . . , Id ∈ I and C ∈ S with νC < ∞, 



μ I1 × · · · × Id × C = |I1 | · · · |Id | νC 



= (λd ⊗ ν) I1 × · · · × Id × C . For fixed I2 , . . . , Id and C, this extends by monotonicity to arbitrary intervals I1 , and then, by the uniqueness in Theorem 2.2, to any B1 ∈ B. Proceeding recursively in d steps, we get for arbitrary B1 , . . . , Bd ∈ B 







μ B1 × · · · × Bd × C = (λd ⊗ ν) B1 × · · · × Bd × C , which yields (i) by the uniqueness in Theorem 1.29. (i) ⇒ (ii)−(iii): Let μ = λd ⊗ ν. For any h = (h1 , . . . , hd ) ∈ Rd , define the shift θh : Rd → Rd by θh x = x + h for all x ∈ Rd . Then for any intervals I1 , . . . , Id and sets C ∈ S, 



μ I1 × · · · × Id × C = |I1 | · · · |Id | νC 



= μ ◦ θh−1 I1 × · · · × Id × C , where θh (x, s) = (x + h, s). As before, it follows that μ = μ ◦ θh−1 . To see that μ is also invariant under orthogonal transformations ψ on Rd , we note that for any x, h ∈ Rd

38

Foundations of Modern Probability (θh ◦ ψ)x = = = =

ψx +h  ψ x + ψ −1 h ψ(x + h ) (ψ ◦ θh )x,

where h = ψ −1 h. Since μ is shift-invariant, we obtain −1 μ ◦ ψ −1 ◦ θh−1 = μ ◦ θh−1  ◦ ψ = μ ◦ ψ −1 ,

where ψ(x, s) = (ψx, s). Thus, even μ ◦ ψ −1 is shift-invariant and hence of the form λd ⊗ ν  . Writing B for the unit ball in Rd , we get for any C ∈ S λd B · ν  C = = = =

μ◦ ψ −1 (B × C) μ ψ −1 B × C μ(B × C) λd B · νC.

Dividing by λd B gives ν  C = νC, and so ν  = ν, which implies μ ◦ ψ −1 = μ. 2 We show that any integrable function on Rd is continuous in a suitable average sense. Lemma 2.7 (mean continuity) Let f be a measurable function on Rd with  λd |f | < ∞. Then    lim f (x + h) − f (x) dx = 0. h→0

Proof: By Lemma 1.37 and a simple truncation, we may choose some continuous functions f1 , f2 , . . . with bounded supports such that λd |fn − f | → 0. By the triangle inequality, we get for n ∈ N and h ∈ Rd         f (x + h) − f (x) dx ≤ fn (x + h) − fn (x) dx + 2λd |fn − f |.

Since the fn are bounded, the right-hand side tends to 0 by dominated convergence, as h → 0 and then n → ∞. 2 By a signed measure on a localized Borel space (S, S) we mean a function  ˆ ν : Sˆ → R, such that ν n Bn = Σn νBn for any disjoint sets B1 , B2 , . . . ∈ S, where the series converges absolutely. Say that the measures μ, ν on (S, S) are (mutually) singular and write μ ⊥ ν, if there exists an A ∈ S with μA = νAc = 0. Note that A may not be unique. We state the basic decomposition of a signed measure into positive components. Theorem 2.8 (Hahn decomposition) For any signed measure ν on S, there exist some unique, locally finite measures ν± ≥ 0 on S, such that ν = ν+ − ν− ,

ν + ⊥ ν− .

2. Measure Extension and Decomposition

39

Proof: We may take ν to be bounded. Put c = sup{νA; A ∈ S}, and note that if A, A ∈ S with νA ≥ c − ε and νA ≥ c − ε , then ν(A ∪ A ) = νA + νA − ν(A ∩ A ) ≥ (c − ε) + (c − ε ) − c = c − ε − ε . Choosing A1 , A2 , . . . ∈ S with νAn ≥ c−2−n , we get by iteration and countable additivity ν ∪ Ak ≥ c − Σ 2−k k>n

k>n

= c − 2−n ,

 

n ∈ N.

Define A+ = n k>n Ak and A− = Ac+ . Using the countable additivity again, we get νA+ = c. Hence, for sets B ∈ S, νB = νA+ − ν(A+ \ B) ≥ 0, B ⊂ A+ , νB = ν(A+ ∪ B) − νA+ ≤ 0, B ⊂ A− . We may then define the measures ν+ and ν− by ν+ B = ν(B ∩ A+ ), ν− B = −ν(B ∩ A− ),

B ∈ S.

To prove the uniqueness, suppose that also ν = μ+ − μ− for some positive measures μ+ ⊥ μ− . Choose a set B+ ∈ S with μ− B+ = μ+ B+c = 0. Then ν is both positive and negative on the sets A+ \ B+ and B+ \ A+ , and therefore ν = 0 on A+  B+ . Hence, for any C ∈ S, μ+ C = μ+ (B+ ∩ C) = ν(B+ ∩ C) = ν(A+ ∩ C) = ν+ C, which shows that μ+ = ν+ . Then also μ− = μ+ − ν = ν+ − ν = ν− .

2

The last result yields the existence of the maximum μ ∨ ν and minimum μ ∧ ν of two σ-finite measures μ and ν. Corollary 2.9 (maximum and minimum) For any σ-finite measures μ, ν on S, (i) there exist a largest measure μ ∧ ν and a smallest one μ ∨ ν satisfying μ ∧ ν ≤ μ, ν ≤ μ ∨ ν, (ii) the measures in (i) satisfy ( μ − μ ∧ ν ) ⊥ ( ν − μ ∧ ν ), μ ∧ ν + μ ∨ ν = μ + ν.

40

Foundations of Modern Probability

Proof: We may take μ and ν to be bounded. Writing ρ+ − ρ− for the Hahn decomposition of μ − ν, we put μ ∧ ν = μ − ρ+ , μ ∨ ν = μ + ρ− .

2

For any measures μ, ν on (S, S), we say that ν is absolutely continuous with respect to μ and write ν  μ, if μA = 0 implies νA = 0 for all A ∈ S. We show that any σ-finite measure has a unique decomposition into an absolutely continuous and a singular component, where the former has a basic integral representation. Theorem 2.10 (Lebesgue decomposition, Radon–Nikodym theorem) Let μ, ν be σ-finite measures on S. Then (i) ν = νa + νs for some unique measures νa  μ and νs ⊥ μ, (ii) νa = f · μ for a μ-a.e. unique measurable function f ≥ 0 on S. Two lemmas will be needed for the proof. Lemma 2.11 (closure) Consider some measures μ, ν and measurable functions f1 , f2 , . . . ≥ 0 on S, and put f = supn fn . Then fn · μ ≤ ν, n ∈ N



f · μ ≤ ν.

Proof: First let f · μ ≤ ν and g · μ ≤ ν, and put h = f ∨ g. Writing A = {f ≥ g}, we get h · μ = 1A h · μ + 1Ac h · μ = 1 A f · μ + 1 Ac g · μ ≤ 1A ν + 1Ac ν = ν. Thus, we may assume that fn ↑ f . But then ν ≥ fn · μ ↑ f · μ by monotone convergence, and so f · μ ≤ ν. 2 Lemma 2.12 (partial density) Let μ, ν be bounded measures on S with μ ⊥ ν. Then there exists a measurable function f ≥ 0 on S, such that μf > 0,

f · μ ≤ ν.

Proof: For any n ∈ N, we introduce the signed measure χn = ν − n−1 μ. By − Theorem 2.8 there exists a set A+ n ∈ S with complement An such that ±χn ≥ 0 + ± on An . Since the χn are non-decreasing, we may assume that A+ 1 ⊂ A2 ⊂ · · · .   + c − − Writing A = n An and noting that A = n An ⊂ An , we obtain νAc ≤ νA− n −1 − = χn A− n + n μAn ≤ n−1 μS → 0,

2. Measure Extension and Decomposition

41

and so νAc = 0. Since μ ⊥ ν, we get μA > 0. Furthermore, A+ n ↑ A implies + ↑ μA > 0, and we may choose n so large that μA > 0. Putting f = μA+ n n −1 −1 + + n 1An , we obtain μf = n μAn > 0 and f · μ = n−1 1A+n · μ = 1A+n · ν − 1A+n · χn ≤ ν.

2

Proof of Theorem 2.10: We may take μ and ν to be bounded. Let C be the class of measurable functions f ≥ 0 on S with f · μ ≤ ν, and define c = sup{μf ; f ∈ C}. Choose f1 , f2 , . . . ∈ C with μfn → c. Then f ≡ supn fn ∈ C by Lemma 2.11, and μf = c by monotone convergence. Define νa = f · μ and νs = ν − νa , and note that νa  μ. If νs ⊥ μ, then Lemma 2.12 yields a measurable function g ≥ 0 with μg > 0 and g · μ ≤ νs . But then f + g ∈ C with μ(f + g) > c, which contradicts the definition of c. Thus, νs ⊥ μ. To prove the uniqueness of νa and νs , suppose that also ν = νa + νs for some measures νa  μ and νs ⊥ μ. Choose A, B ∈ S with νs A = μAc = νs B = μB c = 0. Then clearly νs (A ∩ B) = νs (A ∩ B) = νa (Ac ∪ B c ) = νa (Ac ∪ B c ) = 0, and so

νa = = = νs = =

1A∩B · νa 1A∩B · ν 1A∩B · νa = νa , ν − νa ν − νa = νs .

To see that f is a.e. unique, suppose that also νa = g·μ for some measurable function g ≥ 0. Writing h = f − g and noting that h · μ = 0, we get

μ|h| =

{h>0}

h dμ −

{h 0, Lemma 1.36 yields an open set Gδ ⊃ A with μGδ < δ. Define

μ(x − h, x + h) > ε , ε > 0, Aε = x ∈ A; lim sup h h→0 and note that the Aε are measurable, since the lim sup may be taken along the rationals. For every x ∈ Aε there exists an interval I = (x − h, x + h) ⊂ Gδ with 2μI > ε|I|, and we note that the class Iε,δ of such intervals covers Aε . Hence, Lemma 2.16 yields some disjoint sets I1 , . . . , In ∈ Iε,δ satisfying Σk |Ik | ≥ λAε/4. Then

44

Foundations of Modern Probability

Σ k |Ik | ≤ 8 ε−1 Σ k μIk

λAε ≤ 4

≤ 8 μGδ /ε < 8 δ/ε, and as δ → 0 we get λAε = 0. Thus, lim sup μ(x − h, x + h)/h ≤ ε a.e. λ on A, and the assertion follows since ε is arbitrary. 2 Proof of Theorem 2.15: Since Fs = 0 a.e. λ by Lemma 2.17, we may assume that F = f · λ. Define



F+ (x) = lim sup h−1 F (x + h) − F (x) , h→0





F− (x) = lim inf h−1 F (x + h) − F (x) , h→0

and note that F+ = 0 a.e. on the set {f = 0} = {x; f (x) = 0} by Lemma 2.17. Applying this result to the function Fr = (f − r)+ · λ for arbitrary r ∈ R, and noting that f ≤ (f − r)+ + r, we get F+ ≤ r a.e. on {f ≤ r}. Thus, for r restricted to the rationals,







λ{f < F+ } = λ ∪ r f ≤ r < F+ ≤

Σr λ

f ≤ r < F+ = 0,

which shows that F+ ≤ f a.e. Applying this to the function −F = (−f ) · λ yields F− = −(−F )+ ≥ f a.e. Thus, F+ = F− = f a.e., and so F  exists a.e. and equals f . 2 For a localized Borel space (S, S), let MS be the class of locally finite measures on S. It becomes a measurable space in its own right, when endowed with the σ-field induced by the evaluation maps πB : μ → μB, B ∈ S. In particular, the class of probability measures on S is a measurable subset of MS . To state the basic atomic decomposition of measures μ ∈ MS , recall that δs B = 1B (s), and write McS for the class of diffuse 2 measures μ ∈ MS , where ¯ + = {0, 1, . . . ; ∞}. μ{s} = 0 for all s ∈ S. Put Z Theorem 2.18 (atomic decomposition) For any measures μ ∈ MS with S a localized Borel space, we have (i)

μ=α+

Σ βk δσ ,

k≤κ

k

¯ + , β1 , β2 , . . . > 0, and distinct σ1 , σ2 , . . . ∈ S, for some α ∈ McS , κ ∈ Z (ii) μ is Z+ -valued iff (i) holds with α = 0 and βk ∈ N for all k, (iii) the representation in (i) is unique apart from the order of terms. 2

or non-atomic

2. Measure Extension and Decomposition

45

Proof: (i) By localization we may assume that μS < ∞. Since S is Borel we may let S ∈ B[0,1] , and by an obvious embedding we may even take S = [0, 1]. Define F (x) = μ[0, x] as in Theorem 2.14, and note that the possible atoms of μ correspond uniquely to the jumps of F . The latter may be listed in nonincreasing order as a sequence of pairs {x, ΔF (x)}, equivalent to the set of atom locations and sizes (σk , βk ) of μ. By Corollary 1.17, subtraction of the atoms yields a measure α, which is clearly diffuse. / N for some k. Assuming (ii) Let μ be Z+ -valued, and suppose that βk ∈ S = [0, 1], we may choose some open balls Bn ↓ {σk }. Then μBn ↓ βk , and so / Z+ for large n. The contradiction shows that βk ∈ N for all k. To see μBn ∈  that even α = 0, suppose that instead α = 0. Writing A = S \ k {σk }, we get μA > 0, and so we may choose an s ∈ supp (1A μ) and some balls Bn ↓ {s}. Then 0 < μ(A ∩ Bn ) ↓ 0, which yields another contradiction. (iii) Consider two such decompositions

Σ βk δσ = α + Σ βk δσ . k≤κ

μ = α+

k

k≤κ



 k

Pairing off the atoms in the two sums until no such terms remain, we are left 2 with the equality α = α . Let NS be the class of Z+ -valued measures μ ∈ MS , also known as point measures on S. Say that μ is simple if all coefficients βk equal 1, so that μ is the counting measure on its support. The class of simple point measures on S is denoted by NS∗ . Given a measure μ ∈ MS , we write μc for the diffuse component of μ. For point measures μ ∈ NS , we write μ∗ for the simple measure on the support of μ, obtained by reducing all βk to 1. Theorem 2.19 (measurability) For any localized Borel space S and measures μ ∈ MS , we have (i) the coefficients α, κ, (βk ), (σk ) in Theorem 2.18 can be chosen to be measurable functions of μ, (ii) the classes McS , NS , NS∗ are measurable subsets of MS , (iii) the maps μ → μc and μ → μ∗ are measurable on MS and NS , (iv) the class of degenerate measures β δσ is measurable, (v) the mapping (μ, ν) → μ ⊗ ν is a measurable function on M2S . Proof: (i) We may take S = [0, 1) and assume that μS < ∞. Put Inj = 2−n [j − 1, j) for j ≤ 2n , n ∈ N, and define γnj = μInj and τnj = 2−n j. For any constants b > a > 0, let (βnk , σnk ) with k ≤ κn be the pairs (γnj , τnj ) with a ≤ γnj < b, listed in the order of increasing τnj . Since μBn ↓ μ{x} for any open balls Bn ↓ {x}, we note that

βn1 , βn2 , . . . ; σn1 , σn2 , . . .







→ β1 , β2 , . . . ; σ1 , σ2 , . . . ,

46

Foundations of Modern Probability

where the βk and σk are atom sizes and positions of μ with a ≤ βk < b and increasing σk . The measurability on the right now follows by Lemma 1.11. To see that the atomic component of μ is measurable, it remains to partition (0, ∞) into countably many intervals [a, b). The measurability of the diffuse part α now follows by subtraction. (ii)−(iv): This is clear from (i), since the mentioned classes and operations can be measurably expressed in terms of the coefficients. (v) Note that (μ ⊗ ν)(B × C) is measurable for any B, C ∈ S, and extend by a monotone-class argument. 2 Let S (n) denote the non-diagonal part of S n , consisting of all n -tuples (s1 , . . . , sn ) with distinct components sk . When μ ∈ NS∗ , we may define the factorial measure μ(n) as the restriction of the product measure μn to S (n) . The definition for general μ ∈ NS is more subtle. Write NˆS for the class of bounded measures in NS , and put μ(1) = μ. Lemma 2.20 (factorial measures) For any measure μ = Σ i∈I δσi in NˆS , these conditions are equivalent and define some measures μ(n) on S n : (i) (ii) (iii)

μ(n) =

Σ

i ∈ I (n)

δσi1 ,...,σin ,



(n)

μ(m+n) f = μ(m) (ds) μ − Σ δsi (dt) f (s, t),   i≤m 1 (n) μ (ds) f Σ δsi = Σ f (ν). Σ n ≥ 0 n! i≤n ν ≤μ

Proof: When μ is simple, (i) agrees with the elementary definition. Since also μ(1) = μ, we may take (i) as our definition. We also put μ(0) = 1 for consistency. Since the μ(n) are uniquely determined by both (ii) and (iii), it remains to show that each of those relations follows from (i). (ii) Write Ii = I \ {i1 , . . . , im } for any i = (i1 , . . . , im ) ∈ I (m) . Then by (i),

μ(m) (ds)



=

Σ i∈I

=

Σ i∈I

= =

(m)

μ−  

(m)

Σ Σ i∈I j ∈I (m)

Σ i∈I

(m+n)

Σ δs i≤m

(n)

i

μ−

Σ δσ k≤m

Σ δσ j ∈I

(n) i



(n)

j

(dt) f (s, t) (n)

ik



(dt) f σi1 , . . . , σik , t 

(dt) f σi1 , . . . , σik , t

i

f σi1 , . . . , σim ; σj1 , . . . , σjn









f σi1 , . . . , σim+n = μ(m+n) f.

(iii) Given an enumeration μ = Σi∈I δσi with I ⊂ N, put Iˆ(n) = {i ∈ I n ; i1 < · · · < in } for n > 0. Let Pn be the class of permutations of 1, . . . , n. Using the atomic representation of μ and the tetrahedral decomposition of I (n) , we get

2. Measure Extension and Decomposition

μ(n) (ds) f



Σ δs i≤n

i



47 



=

δσ Σ f kΣ i∈I ≤n

=

Σ Σ f Σ δσ i ∈ Iˆ π ∈ P k≤n

(n)

= n!



(n)

Σˆ

i ∈ I (n)

ik

n

f



Σ δσ k≤n

 π◦ik



ik

= n! Σ {f (ν); ν = n}. ν ≤μ

2

Now divide by n! and sum over n.

We now explore the relationship between signed measures and functions of bounded variation. For any function F on R, we define the total variation on the interval [a, b] as  

 

Σ k F (tk ) − F (tk−1) , {t }

F ba = sup k

where the supremum extends over all finite partitions a = t0 < t1 < · · · < tn = b. The positive and negative variations of F are defined by similar expressions, though with |x| replaced by x± = (±x) ∨ 0, so that x = x+ − x− and |x| = x+ + x− . We also write Δba F = F (b) − F (a). We begin with a basic decomposition of functions of locally finite variation, corresponding to the Hahn decomposition in Theorem 2.8. Theorem 2.21 (Jordan decomposition) For any function F on R of locally finite variation, (i) F = F+ − F− for some non-decreasing functions F+ and F− , and F ts ≤ Δts F+ + Δts F− ,

s < t,

(ii) equality holds in (i) iff Δts F± agree with the positive and negative variations of F on (s, t]. Proof: For any s < t, we have (Δts F )+ = (Δts F )− + Δts F, |Δts F | = (Δts F )+ + (Δts F )− = 2 (Δts F )− + Δts F. Summing over the intervals in an arbitrary partition s = t0 < t1 < · · · < tn = t and taking the supremum of each side, we obtain Δts F+ = Δts F− + Δts F, F ts = 2 Δts F− + Δts F = Δts F+ + Δts F− , where F± (x) denote the positive and negative variations of F on [0, x] (apart from the sign when x < 0). Thus, F = F (0) + F+ − F− , and the stated relation

48

Foundations of Modern Probability

holds with equality. If also F = G+ −G− for some non-decreasing functions G± , then (Δts F )± ≤ Δts G± , and so Δts F± ≤ Δts G± . Thus, F ts ≤ Δts G+ + Δts G− , 2 with equality iff Δts F± = Δts G± . We turn to another useful decomposition of finite-variation functions. Theorem 2.22 (left and right continuity) For any function F on R of locally finite variation, (i) F = Fr + Fl , where Fr is right-continuous with left-hand limits and Fl is left-continuous with right-hand limits, (ii) if F is right-continuous, then so are the minimal components F± in Theorem 2.21. Proof: (i) By Proposition 2.21 we may take F to be non-decreasing. The right- and left-hand limits F ± (s) then exist at every point s, and we note that F − (s) ≤ F (s) ≤ F + (s). Further note that F has at most countably many jump discontinuities. For t > 0, we define Fl (t) =



Σ s ∈ [0, t)



F + (s) − F (s) ,

Fr (t) = F (t) − Fl (t). When t ≤ 0 we need to take the negative of the corresponding sum on (t, 0]. It is easy to check that Fl is left-continuous while Fr is right-continuous, and that both functions are non-decreasing. (ii) Let F be right-continuous at a point s. If F ts → c > 0 as t ↓ s, we may choose t − s so small that F ts < 4c/3. Next we may choose a partition s = t0 < t1 < · · · < tn = t of [s, t], such that the corresponding F -increments δk satisfy Σk |δk | > 2c/3. By the right continuity of F at s, we may assume t1 − s to be small enough that δ1 = |F (t1 ) − F (s)| < c/3. Then F tt1 > c/3, and so 4c/3 > F ts = F ts1 + F tt1 > c + c/3 = 4c/3, a contradiction. Hence c = 0. Assuming F± to be minimal, we obtain Δts F± ≤ F ts → 0,

t ↓ s.

2

Justified by the last theorem, we may choose our finite-variation functions F to be right-continuous, which enables us to prove a basic correspondence with locally finite signed measures. We also consider the jump part and atomic decompositions F = F c + F d and ν = ν d + ν a , where for functions F or signed measures ν on R+ we define Ftd =

Σ ΔFs,

s≤t

νa =

Σ ν{t} δt.

t≥0

2. Measure Extension and Decomposition

49

Theorem 2.23 (signed measures and finite-variation functions) Let F be a right-continuous function on R of locally finite variation. Then (i) there exists a unique signed measure ν on R, such that ν(s, t] = Δts F,

s < t,

(ii) the Hahn decomposition ν = ν+ − ν− and Jordan decomposition F = F+ − F− into minimal components are related by ν± (s, t] ≡ Δts F± ,

s < t,

(iii) the atomic decomposition ν = ν d + ν a and jump part decomposition F = F c + F d are related by ν d (s, t] ≡ Δts F c ,

ν a (s, t] ≡ Δts F d .

Proof: (i) The positive and negative variations F± are right-continuous by Proposition 2.22. Hence, Proposition 2.14 yields some locally finite measures μ± on R with μ± (s, t] ≡ Δts F± , and we may take ν = μ+ − μ− . (ii) Choose an A ∈ S with ν+ Ac = ν− A = 0. For any B ∈ B, we get μ+ B ≥ μ+ (B ∩ A) ≥ ν(B ∩ A) = ν+ (B ∩ A) = ν+ B, which shows that μ+ ≥ ν+ . Then also μ− ≥ ν− . If the equality fails on some interval (s, t], then F ts = μ+ (s, t] + μ− (s, t] > ν+ (s, t] + ν− (s, t], contradicting Proposition 2.21. Hence, μ± = ν± . (iii) This is elementary, given the atomic decomposition in Theorem 2.18. 2 A function F : R → R is said to be absolutely continuous, if for any a < b and ε > 0 there exists a δ > 0, such that for any finite collection of disjoint intervals (ak , bk ] ⊂ (a, b] with Σk |bk −ak | < δ, we have Σk |F (bk )−F (ak )| < ε. In particular, every absolutely continuous function is continuous with locally finite variation. Given a function F of locally finite variation, we say that F is singular, if for any a < b and ε > 0 there exists a finite set of disjoint intervals (ak , bk ] ⊂ (a, b], such that Σ k |bk − ak | < ε, F ba < Σ k |F (bk ) − F (ak )| + ε. A locally finite, signed measure ν on R is said to be absolutely continuous or singular, if the components ν± in the associated Hahn decomposition satisfy ν±  λ or ν± ⊥ λ, respectively. We proceed to relate the notions of absolute continuity and singularity for functions and measures.

50

Foundations of Modern Probability

Theorem 2.24 (absolutely continuous and singular functions) Let F be a right-continuous function of locally finite variation on R, and let ν be the signed measure on R with ν(s, t] ≡ Δts F . Then (i) F is absolutely continuous iff ν  λ, (ii) F is singular iff ν ⊥ λ. Proof: If F is absolutely continuous or singular, then the corresponding property holds for the total variation function F xa with arbitrary a, and hence also for the minimal components F± in Proposition 2.23. Thus, we may take F to be non-decreasing, so that ν is a positive and locally finite measure on R. First let F be absolutely continuous. If ν  λ, there exists a bounded interval I = (a, b) with subset A ∈ B, such that λA = 0 but νA > 0. Taking ε = νA/2, choose a corresponding δ > 0, as in the definition of absolute continuity. Since A is measurable with outer Lebesgue measure 0, we may next choose an open set G with A ⊂ G ⊂ I such that λG < δ. But then νA ≤ νG < ε = νA/2, a contradiction. This shows that ν  λ. Next let F be singular, and fix any bounded interval I = (a, b]. Given any ε > 0, there exist some Borel sets A1 , A2 , . . . ⊂ I, such that λAn < ε 2−n and  νAn → νI. Then B = n An satisfies λB < ε and νB = νI. Next we may  choose some Borel sets Bn ⊂ I with λBn → 0 and νBn = νI. Then C = n Bn satisfies λC = 0 and νC = νI, which shows that ν ⊥ λ on I. Conversely, let ν  λ, so that ν = f · λ for some locally integrable function f ≥ 0. Fix any bounded interval I, put An = {x ∈ I; f (x) > n}, and let ε > 0 be arbitrary. Since νAn → 0 by Lemma 1.15, we may choose n so large that νAn < ε/2. Put δ = ε/2n. For any Borel set B ⊂ I with λB < δ, we obtain νB = ν(B ∩ An ) + ν(B ∩ Acn ) ≤ νAn + n λB < 12 ε + n δ = ε. In particular, this applies to any finite union B of intervals (ak , bk ] ⊂ I, and so F is absolutely continuous. Finally, let ν ⊥ λ. Fix any finite interval I = (a, b], and choose a Borel set A ⊂ I such that λA = 0 and νA = νI. For any ε > 0 there exists an open set G ⊃ A with λG < ε. Letting (an , bn ) denote the connected components of G and writing In = (an , bn ], we get Σn |In | < ε and Σn ν(I ∩ In ) = νI. This shows that F is singular. 2 We conclude with yet another basic extension theorem.3 Here the underlying space S is taken to be a locally compact, second countable, Hausdorff topological space (abbreviated as lcscH). Let G, F, K be the classes of open, ¯ ∈ K}. Let Cˆ+ = Cˆ+ (S) closed, and compact sets in S, and put Gˆ = {G ∈ G; G be the class of continuous functions f : S → R+ with compact support, defined 3

Further basic measure extensions appear in Chapters 3 and 8.

2. Measure Extension and Decomposition

51

as the closure of the set {x ∈ S; f (x) > 0}. By U ≺ f ≺ V we mean that f ∈ Cˆ+ with 0 ≤ f ≤ 1, such that f = 1 on U,

supp f ⊂ V o .

A positive linear functional on Cˆ+ is defined as a mapping μ : Cˆ+ → R+ , such that μ(f + g) = μf + μg, f, g ∈ Cˆ+ . This clearly implies the homogeneity μ(cf ) = c μf for any f ∈ Cˆ+ and c ∈ R+ . A measure μ on the Borel σ-field S = BS is said to be locally finite 4 , if μK < ∞ for every K ∈ K. We show how any positive linear functional can be extended to a measure. Theorem 2.25 (Riesz representation) Let μ be a positive linear functional on Cˆ+ (S), where S is lcscH. Then μ extends uniquely to a locally finite measure on S. Several lemmas are needed for the proof, beginning with a simple topological fact. Lemma 2.26 (partition of unity) For a compact set K ⊂ S covered by the open sets G1 , . . . , Gn , we may choose some functions f1 , . . . , fn ∈ Cˆ+ (S) with fk ≺ Gk , such that Σk fk = 1 on K. Proof: For any x ∈ K, we may choose some k ≤ n and V ∈ Gˆ with x ∈ V and V¯ ⊂ Gk . By compactness, K is covered by finitely many such sets V1 , . . . , Vm . For every k ≤ n, let Uk be the union of all sets Vj with V¯j ⊂ Gk . Then U¯k ⊂ Gk , and so we may choose g1 , . . . , gn ∈ Cˆ+ with Uk ≺ gk ≺ Gk . Define fk = gk (1 − g1 ) · · · (1 − gk−1 ), k = 1, . . . , n. Then fk ≺ Gk for all k, and by induction f1 + · · · + fn = 1 − (1 − g1 ) · · · (1 − gn ). It remains to note that

Πk (1 − gk ) = 0 on K since K ⊂

 k

Uk .

2

By an inner content on an lcscH space S we mean a non-decreasing function ¯ + , finite on G, ˆ such that μ is finitely additive and countably subμ: G → R additive, and also satisfies the inner continuity



ˆ U¯ ⊂ G , μG = sup μU ; U ∈ G,

G ∈ G.

(7)

Lemma 2.27 (inner approximation) For a positive linear functional μ on Cˆ+ (S), an inner content ν on S is given by



νG = sup μf ; f ≺ G , 4

also called a Radon measure

G ∈ G.

52

Foundations of Modern Probability

Proof: Note that ν is non-decreasing with ν∅ = 0, and that νG < ∞ for bounded G. Moreover, ν is inner continuous in the sense of (7). To see that ν is countably sub-additive, fix any G1 , G2 , . . . ∈ G, and let   f ≺ k Gk . The compactness implies f ≺ k≤n Gk for a finite n, and Lemma 2.26 yields some functions gk ≺ Gk , such that Σk gk = 1 on supp f . The products fk = gk f satisfy fk ≺ Gk and Σk fk = f , and so νGk ≤ Σ νGk . Σ μfk ≤ kΣ ≤n k≥1

μf = 

k≤n



Since f ≺ k Gk was arbitrary, we obtain ν k Gk ≤ Σk νGk , as required. To see that ν is finitely additive, fix any disjoint sets G, G ∈ G. If f ≺ G and f  ≺ G , then f + f  ≺ G ∪ G , and so μf + μf  = μ(f + f  ) ≤ ν(G ∪ G ) ≤ νG + νG . Taking the supremum over all f and f  gives νG + νG = ν(G ∪ G ), as required. 2 An outer measure μ on S is said to be regular, if it is finitely additive on G and satisfies the outer and inner regularity



μA = inf μG; G ∈ G, G ⊃ A ,



μG = sup μK; K ∈ K, K ⊂ G ,

A ⊂ S,

(8)

G ∈ G.

(9)

Lemma 2.28 (outer approximation) Any inner content μ on S can be extended to a regular outer measure. Proof: We may define the extension by (8), since the right-hand side equals μA when A ∈ G. By the finite additivity on G, we have 2 μ∅ = μ∅ < ∞, which implies μ∅ = 0. To prove the countable sub-additivity, fix any A1 , A2 , . . . ⊂ S. For any ε > 0, we may choose some G1 , G2 , . . . ∈ G with Gn ⊃ An and μGn ≤ μAn + ε2−n . Since μ is sub-additive on G, we get μ ∪ n An ≤ μ ∪ n G n

Σ n μGn ≤ Σ n μAn + ε. ≤

The desired relation follows since ε was arbitrary. Thus, the extension is an outer measure on S. Finally, the inner regularity in (9) follows from (7) and the monotonicity of μ. 2 Lemma 2.29 (measurability) If μ is a regular outer measure on S, then every Borel set in S is μ -measurable.

2. Measure Extension and Decomposition

53

Proof: Fix any F ∈ F and A ⊂ G ∈ G. By the inner regularity in (9), we ¯ n ⊂ G \ F and μGn → μ(G \ F ). Since μ is may choose G1 , G2 , . . . ∈ G with G non-decreasing and finitely additive on G, we get μG ≥ = ≥ → ≥

μ(G \ ∂Gn ) ¯ n) μGn + μ(G \ G μGn + μ(G ∩ F ) μ(G \ F ) + μ(G ∩ F ) μ(A \ F ) + μ(A ∩ F ).

The outer regularity in (8) gives μA ≥ μ(A \ F ) + μ(A ∩ F ),

F ∈ F, A ⊂ S.

Hence, every closed set is measurable, and the measurability extends to σ(F) = 2 BS = S by Theorem 2.1. Proof of Theorem 2.25: Construct an inner content ν as in Lemma 2.27, and conclude from Lemma 2.28 that ν admits an extension to a regular outer measure on S. By Theorem 2.1 and Lemma 2.29, the restriction of the latter to S = BS is a Radon measure on S, here still denoted by ν. To see that μ = ν on Cˆ+ , fix any f ∈ Cˆ+ . For any n ∈ N and k ∈ Z+ , let 

fkn (x) = nf (x) − k Gnk

 +

∧ 1,

= {nf > k} = {fkn > 0}.

¯ nk+1 ⊂ {fkn = 1} and using the definition of ν and the outer Noting that G regularity in (8), we get for appropriate k n ≤ νGnk+1 ≤ μfkn νfk+1 ¯ n ≤ νf n . ≤ νG k k−1

Writing G0 = Gn0 = {f > 0} and noting that nf = Σk fkn , we obtain n νf − νG0 ≤ n μf ¯ 0. ≤ n νf + ν G ¯ 0 < ∞ since G0 is bounded. Dividing by n and letting n → ∞ gives Here ν G μf = νf . To prove the asserted uniqueness, let μ and ν be locally finite measures on S with μf = νf for all f ∈ Cˆ+ . By an inner approximation, we have μG = νG for every G ∈ G, and so a monotone-class argument yields μ = ν. 2

54

Foundations of Modern Probability

Exercises 1. Show that any countably additive set function μ ≥ 0 on a field F with μ∅ = 0 extends to a measure on σ(F). Further show that the extension is unique whenever μ is bounded. 2. Construct d -dimensional Lebesgue measure λd directly, by the method of Theorem 2.2. Then show that λd = λd . 3. Derive the existence of d -dimensional Lebesgue measure from Riesz’ representation theorem, using basic properties of the Riemann integral. 4. Let λ denote Lebesgue measure on R+ , and fix any p > 0. Show that the class of step functions with bounded support and finitely many jumps is dense in Lp (λ). Generalize to Rd+ . 5. Show that if μ1 = f1 · μ and μ2 = f2 · μ, then μ1 ∨ μ2 = (f1 ∨ f2 ) · μ and μ1 ∧ μ2 = (f1 ∧ f2 ) · μ. In particular, we may take μ = μ1 + μ2 . Extend the result to sequences μ1 , μ2 , . . . . 6. Given any family μi , i ∈ I, of σ-finite measures on a measurable space S, prove  the existence of a largest measure μ = n μn , such that μ ≤ μi for all i ∈ I. Further show that if the μi are bounded by some σ-finite measure ν, there exists a smallest  measure μ ˆ = n μi , such that μi ≤ μ ˆ for all i. (Hint: Use Zorn’s lemma.) 7. For any bounded, signed measure ν on (S, S), prove the existence of a smallest measure |ν| such that |νA| ≤ |ν|A for all A ∈ S. Further show that |ν| = ν+ + ν− , where ν± are the components in the Hahn decomposition of ν. Finally, for any bounded, measurable function f on S, show that |νf | ≤ |ν| |f |. 8. Extend the last result to complex-valued measures χ = μ + iν, where μ and ν are bounded, signed measures on (S, S). Introducing the complex-valued Radon– Nikodym density f = dχ/d(|μ| + |ν|), show that |χ| = |f | · (|μ| + |ν|).

Chapter 3

Kernels, Disintegration, and Invariance Kernel criteria and properties, composition and product, disintegration of measures and kernels, partial and iterated disintegration, Haar measure, factorization and modularity, orbit measures, invariant measures and kernels, invariant disintegration, asymptotic and local invariance

So far we have mostly dealt with standard measure theory, typically covered by textbooks in real analysis, though here presented with a slightly different emphasis to suit our subsequent needs. Now we come to some special topics of fundamental importance in probability theory, which are rarely treated even in more ambitious texts. The hurried reader may again feel tempted to bypass this material and move on to the probabilistic parts of the book. This may be fine as far as the classical probability is concerned. However, already when coming to conditioning and compensation, and to discussions of the general Markov property and continuous-time chains, the serious student may feel the need to return for reference and to catch up with some basic ideas. The need may be even stronger as he/she moves on to the later and more advanced chapters. Our plans are to begin with a discussion of kernels, which are indispensable in the contexts of conditioning, Markov processes, random measures, and many other areas. They often arise by disintegration of suitable measures on a product space. A second main theme is the theory of invariant measures and disintegrations, of importance in the context of stationary distributions and associated Palm measures. We conclude with a discussion of locally and asymptotically invariant measures, needed in connection with Poisson and Cox convergence of particle systems. Given some measurable spaces (S, S) and (T, T ), we define a kernel μ : S → T as a measurable function μ from S to the space of measures on T . In other words, μ is a function of two variables s ∈ S and B ∈ T , such that μ(s, B) is S-measurable in s for fixed B and a measure in B for fixed s. For a simple example, we note that the set of Dirac measures δs , s ∈ S, can be regarded as a kernel δ : S → S. Just as every measure μ on S can be identified with a linear functional f → μf on S+ , we may identify a kernel μ : S → T with a linear operator A : T+ → S+ , given by Af (s) = μs f, s ∈ S, (1) μs B = A1B (s), © Springer Nature Switzerland AG 2021 O. Kallenberg, Foundations of Modern Probability, Probability Theory and Stochastic Modelling 99, https://doi.org/10.1007/978-3-030-61871-1_4

55

56

Foundations of Modern Probability

for any f ∈ T+ and B ∈ T . When μ = δ, the associated operator is simply the identity map I on S+ . In general, we often denote a kernel and its associated operator by the same letter. When T is a localized Borel space, we usually require kernels μ : S → T to be locally finite, in the sense that μs is locally finite for every s. The following result characterizes the locally finite kernels from S to T . Lemma 3.1 (kernel criteria) For any Borel spaces S, T , consider a function μ : S → MT with associated operator A as in (1). Then for any dissecting semi-ring I ⊂ Tˆ , these conditions are equivalent: (i) (ii) (iii) (iv)

μ is a kernel from S to T , μ is a measurable function from S to MT , μs I is S-measurable for every I ∈ I, the operator A maps T+ into S+ .

Furthermore, (v) for any function f : S → T , the mapping s → δf (s) is a kernel from S to T iff f is measurable, (vi) the identity map on MS is a kernel from MS to S. Proof, (i) ⇔ (ii): Since MT is generated by all projections πB : μ → μB with B ∈ T , we see that (ii) ⇒ (i). The converse holds by Lemma 1.4. (i) ⇔ (iii): Since trivially (i) ⇒ (iii), it remains to show that (iii) ⇒ (i). Then note that I is π-system, whereas the sets B ∈ Tˆ with measurable μs B form a λ-system containing I. The assertion then follows by a routine application of the monotone-class Theorem 1.1. (i) ⇔ (iv): For f = 1B , we note that Af (s) = μs B. Thus, (i) is simply the property (iv) restricted to indicator functions. The extension to general f ∈ T+ follows immediately by linearity and monotone convergence.



(v) This holds since s; δf (s) B = 1 = f −1 B for all B ∈ T . (vi) The identity map on MS trivially satisfies (ii).

2

A kernel μ : S → T is said to be finite if μs T < ∞ for all s ∈ S, s -finite if it is a countable sum of finite kernels, and σ-finite if it satisfies μs fs < ∞ for some measurable function f > 0 on S × T , where fs = f (s, ·). It is called a probability kernel if μs T ≡ 1 and a sub-probability kernel when μs T ≤ 1. We list some basic properties of s - and σ -finite kernels. Lemma 3.2 (kernel properties) Let the kernel μ : S → T be s-finite. Then (i) for any measurable function f ≥ 0 on S ⊗ T , the function μs fs is Smeasurable, and νs = fs · μs is again an s-finite kernel from S to T , (ii) for any measurable function f : S × T → U , the function νs = μs ◦ fs−1 is an s-finite kernel from S to U ,

3. Kernels, Disintegration, and Invariance

57

(iii) μ = Σn μn for some sub-probability kernels μ1 , μ2 , . . . : S → T , (iv) μ is σ-finite iff μ = μ1 , μ2 , . . . : S → T ,

Σn μn

for some finite, mutually singular kernels

(v) μ is σ-finite iff there exists a measurable function f > 0 on S × T with μs fs = 1{μs = 0} for all s ∈ S, (vi) if μ is uniformly σ-finite, it is also σ-finite, and (v) holds with f (s, t) = h(μs , t) for some measurable function h > 0 on MS × T , (vii) if μ is a probability kernel and T is Borel, there exists a measurable function f : S × [0, 1] → T with μs = λ ◦ fs−1 for all s ∈ S. A function f as in (v) is called a normalizing function of μ. Proof: (i) Since μ is s -finite and all measurability and measure properties are preserved by countable summations, we may assume that μ is finite. The measurability of μs fs follows from the kernel property when f = 1B for some measurable rectangle B ⊂ S × T . It extends by Theorem 1.1 to arbitrary B ∈ S × T , and then by linearity and monotone convergence to any f ∈ (S ⊗ T )+ . Fixing f and applying the stated measurability to the functions 1B f for arbitrary B ∈ T , we see that fs · μs is again measurable, hence a kernel from S to T . The s -finiteness is clear if we write f = Σn fn for some bounded f1 , f2 , . . . ∈ (S ⊗ T )+ . (ii) Since μ is s -finite and the measurability, measure property, and inverse mapping property are all preserved by countable summations, we may again take μ to be finite. The measurability of f yields f −1 B ∈ S ⊗ T for every B ∈ U . Since fs−1 B = {t ∈ T ; (s, t) ∈ f −1 B} = (f −1 B)s , we see from (i) that (μs ◦ fs−1 )B = μs (f −1 B)s is measurable, which means that νs = μs ◦ fs−1 is a kernel. Since νs = μs < ∞ for all s ∈ S, the kernel ν is again finite. (iii) Since μ is s -finite and N2 is countable, we may assume that μ is finite. Putting ks = [ μs ] + 1, we note that μs = Σn ks−1 1{n ≤ ks } μs , where each term is clearly a sub-probability kernel. (iv) First let μ be σ-finite, so that μs fs < ∞ for some measurable function f > 0 on S × T . Then (i) shows that νs = fs · μs is a finite kernel from S to T , and so μs = gs · νs with g = 1/f . Putting Bn = 1{n − 1 < g ≤ n} for n ∈ N, we get μs = Σn (1Bn g)s · νs , where each term is a finite kernel from S to T . The terms are mutually singular since B1 , B2 , . . . are disjoint. Conversely, let μ = Σn μn for some mutually singular, finite kernels μ1 , μ2 , . . . . By singularity we may choose a measurable partition B1 , B2 , . . . of S × T , such that μn is supported by Bn for each n. Since the μn are finite, we may further choose some measurable functions fn > 0 on S × T , such that μn fn ≤ 2−n for all n. Then f = Σn 1Bn fn > 0, and

Σ m,n μm1B = Σ n μn fn

μf =

n

fn

58

Foundations of Modern Probability ≤

which shows that μ is σ-finite.

Σ n 2−n = 1,

(v) If μ is σ-finite, then μs gs < ∞ for some measurable function g > 0 on S × T . Putting f (s, t) = g(s, t)/μs gs with 0/0 = 1, we note that μs fs = 1{μs = 0}. The converse assertion is clear from the definitions. (vi) The first assertion holds by (iv). Now choose a measurable partition B1 , B2 , . . . of T with μs Bn < ∞ for all s and n, and define 

g(s, t) = Σ n 2−n μs Bn ∨ 1



−1

s ∈ S, t ∈ T.

1Bn (t),

The function f (s, t) = g(s, t)/μs gs has the desired properties. (vii) Since T is Borel, we may assume that T ∈ B[0,1] . The function



f (s, r) = inf x ∈ [0, 1]; μs [0, x] ≥ r ,

s ∈ S, r ∈ [0, 1],

is product measurable on S ×[0, 1], since the set {(s, r); μs [0, x] ≥ r} is measurable for each x by Lemma 1.13, and the infimum can be restricted to rational x. For s ∈ S and x ∈ [0, 1], we have

λ ◦ fs−1 [0, x] = λ r ∈ [0, 1]; f (s, r) ≤ x



= μs [0, x], fs−1

and so λ ◦ = μs on [0, 1] for all s by Lemma 4.3 below. In particular, / T }, we may fs (r) ∈ T a.e. λ for every s ∈ S. On the exceptional set {fs (r) ∈ 2 redefine fs (r) = t0 for any fixed t0 ∈ T . For any s -finite kernels μ : S → T and ν : S × T → U , we define their composition and product as the kernels μ ⊗ ν : S → T × U and μν : S → U , given by (2) (μ ⊗ ν)s f = μs (dt) νs,t (du) f (t, u),

(μν)s f =



μs (dt)

νs,t (du) f (u)

= (μ ⊗ ν)s (1T ⊗ f ). Note that μν equals the projection of μ ⊗ ν onto U . Define Aμ f (s) = μs f , and write μ ˆ = δ ⊗ μ. Lemma 3.3 (composition and product) For any s-finite kernels μ : S → T and ν : S × T → U , (i) μ ⊗ ν and μν are s-finite kernels from S to T × U and U , respectively, (ii) μ ⊗ ν is σ-finite whenever this holds for μ, ν, (iii) the kernel operations μ ⊗ ν and μν are associative, (iv) μˆ ν = μ ⊗ ν and μ ˆνˆ = (μ ⊗ ν)ˆ, (v) for any kernels μ : S → T and ν : T → U , we have Aμν = Aμ Aν .

3. Kernels, Disintegration, and Invariance

59

Proof: (i) By Lemma 3.2 (i) applied twice, the inner integral in (2) is S ⊗ T measurable and the outer integral is S-measurable. The countable additivity holds by repeated monotone convergence. This proves the kernel property of μ ⊗ ν, and the result for μν is an immediate consequence. (ii) If μ and ν are σ-finite, we may choose some measurable functions f > 0 on S × T and g > 0 on S × T × U with μs fs ≤ 1 and νs,t gs,t ≤ 1 for all s and t. Then (2) yields (μ ⊗ ν)s (f g)s ≤ 1. (iii) Consider the kernels μ : S → T , ν : S × T → U , and ρ : S × T × U → V . Letting f ∈ (T ⊗ U ⊗ V)+ be arbitrary and using (2) repeatedly, we get



μ ⊗ (ν ⊗ ρ) s f =





μs (dt) μs (dt)

=



=

(ν ⊗ ρ)s,t (du dv) f (t, u, v)





νs,t (du)

(μ ⊗ ν)s (dt du)



ρs,t,u (dv) f (t, u, v) ρs,t,u (dv) f (t, u, v)



= (μ ⊗ ν) ⊗ ρ s f, which shows that μ ⊗ (ν ⊗ ρ) = (μ ⊗ ν) ⊗ ρ. Projecting onto V yields μ(νρ) = (μν)ρ. (iv) Consider any kernels μ : S → T and ν : S × T → U , and note that μˆ ν and μ ⊗ ν are both kernels from S to T × U . For any f ∈ (T ⊗ U )+ , we have

(μˆ ν )s f = =





μs (dt)



μs (dt)

δs,t (ds dt )



νs ,t (du) f (t , u)

νs,t (du) f (t, u)

= (μ ⊗ ν)s f, which shows that μˆ ν = μ ⊗ ν. Hence, by (ii) μ ˆνˆ = (δ ⊗ μ) ⊗ ν = δ ⊗ (μ ⊗ ν) = (μ ⊗ ν)ˆ. (v) For any f ∈ U+ , we have



Aμν f (s) = (μν)s f =

=

=



μs (dt)

(μν)s (du) f (u) νt (du) f (u)

μs (dt) Aν f (t)

= Aμ (Aν f )(s) = (Aμ Aν )f (s), which implies Aμν f = (Aμ Aν )f , and hence Aμν = Aμ Aν .

2

60

Foundations of Modern Probability

We turn to the reverse problem of representing a given measure ρ on S × T in the form ν ⊗ μ, for some measure ν on S and kernel μ : S → T . The resulting disintegration may be regarded as an infinitesimal decomposition of ρ into measures δs ⊗μs . Any σ-finite measure ν ∼ ρ(·×T ) is called a supporting measure of ρ, and we refer to μ as the associated disintegration kernel. Theorem 3.4 (disintegration) Let ρ be a σ-finite measure on S × T , where T is Borel. Then (i) ρ = ν ⊗ μ for a σ-finite measure ν ∼ ρ(· × T ) ≡ ρˆS and a σ-finite kernel μ: S → T, (ii) the μs are ν-a.e. unique up to normalizations, and they are a.e. bounded iff ρˆS is σ-finite, (iii) when ρˆ is σ-finite and ν = ρˆS , we may choose the μs to be probability measures on T . Proof: (i) If ρ is σ-finite and = 0, we may choose g ∈ (S × T )+ such that ρ = g · ρ˜ for some probability measure ρ˜ on S × T . If ρ˜ = ν ⊗ μ ˜ for some measure ν on S and kernel μ ˜ : S × T , then Lemma 3.2 (i) shows that μ = g · μ ˜ is a σ-finite kernel from S to T , and so for any f ≥ 0,



˜) f (ν ⊗ μ)f = μ ⊗ (g · μ = (μ ⊗ μ ˜)(f g) = ρ˜(f g) = ρf, which shows that ρ = ν ⊗ μ. We may then assume that ρ is a probability measure on S × T . Since T is Borel, we may further take T = R. Choosing ν = ρˆ and putting νr = ρ(· × (−∞, r]), we get νr ≤ ν for all r ∈ R, and so Theorem 2.10 yields νr = fr · ν for some measurable functions fr : S → [0, 1]. Here fr is non-decreasing in r ∈ Q with limits 0 and 1 at ±∞, outside a fixed ν -null set N ⊂ S. Redefining fr = 0 on N , we can make those properties hold identically. For the right-continuous versions Fs,t = inf r>t fr with t ∈ R, we see by monotone convergence that Fr = fr a.e. for all r ∈ Q. Now Proposition 2.14 yields some probability measures μ(s, ·) on R with distribution functions F (s, t). Here μ(s, B) is measurable in s ∈ S for each B ∈ T by a monotone-class argument, which means that μ is a kernel from S to T . Furthermore, we have for any A ∈ S and r ∈ Q 



ρ A × (−∞, r] =



A

= A

ν(ds) Fr (s) ν(ds) μs (−∞, r],

which extends by a monotone-class argument and monotone convergence to the general disintegration formula ρf = (ν ⊗ μ)f , for arbitrary f ∈ (S × T )+ . (ii) Suppose that ρ = ν ⊗ μ = ν  ⊗ μ with ν ∼ ν  ∼ ρ(· × T ). Then

3. Kernels, Disintegration, and Invariance

61

g · ρ = ν ⊗ (g · μ) = ν  ⊗ (g · μ ),

g ≥ 0,

which allows us to consider only bounded measures ρ. Since also (h · ν) ⊗ μ = ν  ⊗ (h · μ ) for any h ∈ S+ , we may further assume that ν = ν  . Then



A



ν(ds) μs B − μs B = 0,

A ∈ S, B ∈ T ,

which implies μs B = μs B a.e. ν. Since T is Borel, a monotone-class argument 2 yields μs = μs a.e. ν. The following partial disintegration will be needed in Chapter 31. Corollary 3.5 (partial disintegration) Let ν, ρ be σ-finite measures on S and S ×T , respectively, where T is Borel. Then there exists an a.e. unique maximal kernel μ : S → T, ν ⊗ μ ≤ ρ. Here the maximality of μ means that, whenever μ is a kernel satisfying ν ⊗ μ ≤ ρ, we have μs ≤ μs for s ∈ S a.e. ν. Proof: Since ρ is σ-finite, we may choose a σ-finite measure γ on S with γ ∼ ρ(· × T ). Consider its Lebesgue decomposition γ = γa + γs with respect to ν, as in Theorem 2.10, so that γa  ν and γs ⊥ ν. Choose A ∈ S with νAc = γs A = 0, and let ρ denote the restriction of ρ to A × T . Then ρ (· × T )  ν, and so Theorem 3.4 yields a σ-finite kernel μ : S → T with ν ⊗ μ = ρ ≤ ρ. Now consider any kernel μ : S → T with ν ⊗ μ ≤ ρ. Since νAc = 0, we have ν ⊗ μ ≤ ρ = ν ⊗ μ. Since μ and μ are σ-finite, there exists a measurable function g > 0 on S × T , such that the kernels μ ˜ = g · μ and μ ˜ = g · μ satisfy    μs ≤ 1. Then ν ⊗ μ ˜ ≤ ν⊗μ ˜, and so μ ˜s B ≤ μ ˜s B, s ∈ S a.e. ν, for ˜ μs ∨ ˜ every B ∈ S. This extends by a monotone-class argument to μs ≤ μs a.e. ν, which shows that μ is maximal. In particular, μ is a.e. unique. 2 We often need to disintegrate kernels. Here the previous construction still applies, except that now we need a product-measurable version of the Radon– Nikodym theorem, and we must check that the required measurability is preserved throughout the construction. Since the simplest approach is based on martingale methods, we postpone the proof until Theorem 9.27. Corollary 3.6 (disintegration of kernels) Consider some σ-finite kernels ν : S → T and ρ : S → T × U with νs ∼ ρs (· × U ) for all s ∈ S, where T, U are Borel. Then there exists a σ-finite kernel μ : S × T → U,

ρ = ν ⊗ μ.

62

Foundations of Modern Probability

We turn to the subject of iterated disintegration, needed for some conditional constructions in Chapters 8 and 31. To explain the notation, we consider some σ-finite measures μ1 , μ2 , μ3 , μ12 , μ13 , μ23 , μ123 on products of the Borel spaces S, T, U . They are said to form a projective family, if all relations of the form μ1 ∼ μ12 (· × T ) or μ12 ∼ μ123 (· × U ) are satisfied1 . Then Theorem 3.4 yields some disintegrations ∼

μ12 = μ1 ⊗ μ2|1 = μ2 ⊗ μ1|2 , μ13 = μ1 ⊗ μ3|1 , μ123 = μ12 ⊗ μ3|12 = μ1 ⊗ μ23|1 ∼ = μ2 ⊗ μ13|2 , for some σ-finite kernels μ2|1 , μ3|12 , μ12|3 , . . . between appropriate spaces, where ∼ = denotes equality apart from the order of component spaces. If μ2|1 ∼ μ23|1 (·× U ) a.e. μ1 , we may proceed to form some iterated disintegration kernels such as μ3|2|1 , where μ23|1 = μ2|1 ⊗ μ3|2|1 a.e. μ1 . We show that the required support properties hold automatically, and that the iterated disintegration kernels μ3|2|1 and μ3|1|2 are a.e. equal and agree with the single disintegration kernel μ3|12 . Theorem 3.7 (iterated disintegration) Consider a projective set of σ-finite measures μ1 , μ2 , μ3 , μ12 , μ13 , μ23 , μ123 on products of Borel spaces S, T, U , and form the associated disintegration kernels μ1|2 , μ2|1 , μ3|1 , μ3|12 . Then (i) μ2|1 ∼ μ23|1 (· × U ) a.e. μ1 , μ1|2 ∼ μ13|2 (· × U ) a.e. μ2 , ∼

(ii) μ3|12 = μ3|2|1 = μ3|1|2 a.e. μ12 , (iii) if μ13 = μ123 (· × T × ·), then also μ3|1 = μ2|1 μ3|1|2 a.e. μ1 . Proof: (i) We may assume that μ123 = 0. Fixing a probability measure ˜1 , μ ˜2 , μ ˜3 , μ ˜12 , μ ˜13 , μ ˜23 of μ ˜123 onto μ ˜123 ∼ μ123 , we form the projections μ S, T, U, S × T, S × U, T × U , respectively. Then for any A ∈ S ⊗ T , ˜123 (A × U ) μ ˜12 A = μ ∼ μ123 (A × U ) ∼ μ12 A, which shows that μ ˜ 12 ∼ μ12 . A similar argument gives μ ˜ 1 ∼ μ1 and μ ˜ 2 ∼ μ2 . Hence, Theorem 2.10 yields some measurable functions p1 , p2 , p12 , p123 > 0 on S, T , S × T , S × T × U , satisfying μ ˜1 = p1 · μ1 , μ ˜12 = p12 · μ12 ,

μ ˜2 = p2 · μ2 , μ ˜123 = p123 · μ123 .

Inserting those densities into the disintegrations 1

For functions or constants a, b ≥ 0, the relation a ∼ b means that a = 0 iff b = 0.

3. Kernels, Disintegration, and Invariance

63



μ ˜12 = μ ˜1 ⊗ μ ˜2|1 = μ ˜2 ⊗ μ ˜1|2 , ∼

˜1 ⊗ μ ˜23|1 = μ ˜2 ⊗ μ ˜13|2 , μ ˜123 = μ we obtain









˜2|1 μ12 = μ1 ⊗ p2|1 · μ ∼

(3)

= μ2 ⊗ p1|2 · μ ˜1|2 , 







˜23|1 μ123 = μ1 ⊗ p23|1 · μ ∼

= μ2 ⊗ p13|2 · μ ˜13|2 , where p2|1 =

p1 , p12

p1|2 =

p2 , p12

p23|1 =

p1 , p123

p13|2 =

p2 . p123

Comparing with the disintegrations of μ12 and μ123 , and invoking the uniqueness in Theorem 3.4, we obtain a.e. μ ˜2|1 ∼ μ2|1 , μ ˜23|1 ∼ μ23|1 ,

μ ˜1|2 ∼ μ1|2 , μ ˜13|2 ∼ μ13|2 .

(4)

Furthermore, we get from (3) μ ˜1 ⊗ μ ˜2|1 = μ ˜12 = μ ˜123 (· × U ) ˜23|1 (· × U ), =μ ˜1 ⊗ μ and similarly with subscripts 1 and 2 interchanged, and so the mentioned uniqueness yields a.e. μ ˜2|1 = μ ˜23|1 (· × U ), ˜13|2 (· × U ). (5) μ ˜1|2 = μ Combining (4) and (5), we get a.e. ˜2|1 = μ ˜23|1 (· × U ) μ2|1 ∼ μ ∼ μ23|1 (· × U ), and similarly with 1 and 2 interchanged. (ii) By (i) and Theorem 3.6, we have a.e. μ23|1 = μ2|1 ⊗ μ3|2|1 , μ13|2 = μ1|2 ⊗ μ3|1|2 , for some product-measurable kernels μ3|2|1 and μ3|1|2 . Combining the various disintegrations and using the commutativity in Lemma 3.3 (ii), we get μ12 ⊗ μ3|12 = μ123 = μ1 ⊗ μ23|1 = μ1 ⊗ μ2|1 ⊗ μ3|2|1 = μ12 ⊗ μ3|2|1 , and similarly with 1 and 2 interchanged. It remains to use the a.e. uniqueness in Theorem 3.4.

64

Foundations of Modern Probability

(iii) The stated hypothesis yields μ1 ⊗ μ3|1 = μ13 = μ123 (· × T × ·) = μ1 ⊗ μ23|1 (T × ·), and so by (ii) and the uniqueness in Theorem 3.4, we have a.e. μ3|1 = = = =

μ23|1 (T × ·) μ2|1 ⊗ μ3|2|1 (T × ·) μ2|1 μ3|2|1 μ2|1 μ3|1|2 .

2

A group G with associated σ-field G is said to be measurable, if the group operations r → r−1 and (r, s) → rs on G are G -measurable. Defining the left and right shifts θr and θ˜r on G by θr s = θ˜s r = rs, r, s ∈ G, we say that a measure λ on G is left-invariant if λ ◦ θr−1 = λ for all r ∈ G. A σ-finite, leftinvariant measure λ = 0 is called a Haar measure on G. Writing f˜(r) = f (r−1 ), ˜ by λf ˜ = λf˜. we may define a corresponding right-invariant measure λ We call G a topological group if it is endowed with a topology rendering the group operations continuous. When G is lcscH 2 , it becomes a measurable group when endowed with the Borel σ-field G. We state the basic existence and uniqueness theorem for Haar measures. Theorem 3.8 (Haar measure) For any lcscH group G, (i) there exists a left-invariant, locally finite measure λ = 0 on G, (ii) λ is unique up to a normalization, (iii) when G is compact, λ is also right-invariant. Proof (Weil): For any f, g ∈ Cˆ+ we define |f |g = inf Σk ck , where the infimum extends over all finite sets of constants c1 , . . . , cn ≥ 0, such that f (x) ≤

Σ ck g(sk x),

k≤n

x ∈ G,

for some s1 , . . . , sn ∈ G. By compactness we have |f |g < ∞ when g = 0, and since |f |g is non-decreasing and translation invariant in f , it satisfies the sub-additivity and homogeneity properties |f + f  |g ≤ |f |g + |f  |g ,

|cf |g = c|f |g ,

(6)

as well as the inequalities f ≤ |f |g ≤ |f |h |h|g . g We may normalize |f |g by fixing an f0 ∈ Cˆ+ \ {0} and putting λg f = |f |g /|f0 |g , 2

f, g ∈ Cˆ+ , g = 0.

locally compact, second countable, Hausdorff

(7)

3. Kernels, Disintegration, and Invariance

65

From (6) and (7) we note that λg (f + f  ) ≤ λg f + λg f  , λg (cf ) = cλg f, |f0 |−1 f ≤ λg f ≤ |f |f0 .

(8)

Conversely, λg is nearly super-additive in the following sense: Lemma 3.9 (near super-additivity) For any f, f  ∈ Cˆ+ and ε > 0, there exists an open set U = ∅ with λg f + λg f  ≤ λg (f + f  ) + ε,

0 = g ≺ U.

Proof: Fix any h ∈ Cˆ+ with h = 1 on supp(f + f  ), and define for δ > 0 fδ = f + f  + δh, hδ = f /fδ , hδ = f  /fδ , so that hδ , hδ ∈ Cˆ+ . By compactness, we may choose a neighborhood U of the identity element ι ∈ G satisfying     hδ (x) − hδ (y) < δ,      hδ (x) − hδ (y) < δ,

x−1 y ∈ U.

(9)

Now assume 0 = g ≺ U , and let fδ (x) ≤ Σk ck g(sk x) for some s1 , . . . , sn ∈ G and c1 , . . . , cn ≥ 0. Since g(sk x) = 0 implies sk x ∈ U , (9) yields f (x) = fδ (x) hδ (x)

Σ k ck g(sk x) h δ (x)

≤ Σ k ck g(sk x) hδ (s−1 k )+δ , ≤

and similarly for f  . Noting that hδ + hδ ≤ 1, we get |f |g + |f  |g ≤ Σ k ck (1 + 2δ). Taking the infimum over all dominating sums for fδ and using (6), we obtain |f |g + |f  |g ≤ |fδ |g (1 + 2δ)



≤ |f + f  |g + δ|h|g (1 + 2δ). Now divide by |f0 |g , and use (8) to obtain



λg f + λg f  ≤ λg (f + f  ) + δλg h (1 + 2δ) ≤ λg (f + f  ) + 2δ|f + f  |f0 + δ(1 + 2δ)|h|f0 , which tends to λg (f + f  ) as δ → 0.

2

66

Foundations of Modern Probability

End of proof of Theorem 3.8: We may regard the functionals λg as elements ˆ of the product space Λ = (R+ )C+ . For any neighborhood U of ι, let ΛU be the closure in Λ of the set {λg ; 0 = g ≺ U }. Since λg f ≤ |f |f0 < ∞ for all f ∈ Cˆ+ by (8), the ΛU are compact by Tychonov’s theorem3 . Furthermore, the finite intersection property holds for the family {ΛU ; ι ∈ U }, since U ⊂ V implies  ΛU ⊂ ΛV . We may then choose an element λ ∈ U ΛU , here regarded as a functional on Cˆ+ . From (8) we note that λ = 0. To see that λ is linear, fix any f, f  ∈ Cˆ+ and a, b ≥ 0, and choose some g1 , g2 , . . . ∈ Cˆ+ with supp gn ↓ {ι}, such that λgn f → λf, λgn f  → λf  , λgn ( af + bf  ) → λ( af + bf  ). By (8) and Lemma 3.9 we obtain λ(af + bf  ) = a λf + b λf  . Thus, λ is a nontrivial, positive linear functional on Cˆ+ , and so by Theorem 2.25 it extends uniquely to a locally finite measure on S. The invariance of the functionals λg clearly carries over to λ. Now consider any left-invariant, locally finite measure λ = 0 on G. Fixing a right-invariant, locally finite measure μ = 0 and a function h ∈ Cˆ+ \ {0}, we define p(x) = h(y −1 x) μ(dy), x ∈ G, and note that p > 0 on G. Using the invariance of λ and μ along with Fubini’s theorem, we get for any f ∈ Cˆ+

(λh)(μf ) =





=



f (yx) μ(dy)

h(x) λ(dx)

=



h(x)f (yx) λ(dx)

μ(dy)

=



μ(dy)

h(y −1 x)f (x) λ(dx)



=

f (y) μ(dy)

h(x) λ(dx)



f (x) λ(dx)

h(y −1 x) μ(dy) = λ(f p).

Since f was arbitrary, we obtain (λh)μ = p · λ or equivalently λ/λh = p−1 · μ. The asserted uniqueness now follows, since the right-hand side is independent of λ. If S is compact, we may choose h ≡ 1 to obtain λ/λS = μ/μS. 2 From now on, we make no topological or other assumptions on G, beyond the existence of a Haar measure. The uniqueness and basic properties of such measures are covered by the following result, which also provides a basic factorization property. For technical convenience we consider s-finite (rather than σ-finite) measures, defined as countable sums of bounded measures. Note that most of the basic computational rules, including Fubini’s theorem, remain valid 3

Products of compact spaces are again compact.

3. Kernels, Disintegration, and Invariance

67

for s -finite measures, and that s -finiteness has the advantage of being preserved by measurable maps. Theorem 3.10 (factorization and modularity) Consider a measurable space S and a measurable group G with Haar measure λ. Then (i) for any s -finite, G-invariant measure μ on G × S, there exists a unique, s -finite measure ν on S satisfing μ = λ ⊗ ν, (ii) μ, ν are simultaneously σ-finite, (iii) there exists a measurable homomorphism Δ: G → (0, ∞) with ˜ = Δ · λ, λ ◦ θ˜−1 = Δr λ, r ∈ G, λ r

˜ = λ. (iv) when λ < ∞ we have Δ ≡ 1, and λ is also right-invariant with λ In particular, (i) shows that any s -finite, left-invariant measure on G is ˜ is known as the modular function of proportional to λ. The dual mapping Δ G. The use of s -finite measures leads to some significant simplifications in subsequent proofs. Proof: (i) Since λ is σ-finite, we may choose a measurable function h > 0 on G with λh = 1, and define Δr = λ(h ◦ θ˜r ) ∈ (0, ∞],

r ∈ G.

Using Fubini’s theorem (three times), the G -invariance of μ and λ, and the ˜ we get for any f ≥ 0 definitions of Δ and λ,

μ(f Δ) =



μ(dr ds) f (r, s)

=









μ(dr ds) h(r)

=

˜ λ(dp)



μ(dr ds) f p−1 r, s h(r)

λ(dp)

=

λ(dp) h(pr)

λ(dp) f (p−1 , s)



μ(dr ds) h(r) f (p, s).

˜ = λ(f Δ), and so λ ˜ = Δ · λ. Choosing μ = λ for a singleton S, we get λf ˜ = ΔΔ ˜ · λ, hence ΔΔ ˜ = 1 a.e. λ, and finally Δ ∈ (0, ∞) a.e. ˜ ·λ Therefore λ = Δ λ. Thus, for general S,

μ(f Δ) =



λ(dp) Δ(p)

μ(dr ds) h(r) f (p, s).

Since Δ > 0, we have μ(· × S)  λ, and so Δ ∈ (0, ∞) a.e. μ(· × S), which yields the simplified formula

μf =



λ(dp)

μ(dr ds) h(r) f (p, s),

showing that μ = λ ⊗ ν with νf = μ(h ⊗ f ).

68

Foundations of Modern Probability

(ii) When μ is σ-finite, we have μf < ∞ for some measurable function f > 0 on G × S, and so by Fubini’s theorem νf (r, ·) < ∞ for r ∈ G a.e. λ, which shows that even ν is σ-finite. The reverse implication is obvious. (iii) For every r ∈ G, the measure λ ◦ θ˜r−1 is left-invariant since θp and θ˜r commute for all p, r ∈ G. Hence, (i) yields λ ◦ θ˜r−1 = cr λ for some constants cr > 0. Applying both sides to h gives cr = Δ(r). The homomorphism relation Δ(p q) = Δ(p)Δ(q) follows from the reverse semi-group property θ˜p q = θ˜q ◦ θ˜p , and the measurability holds by the measurability of the group operation and Fubini’s theorem. (iv) Statement (iii) yields λ = Δr λ for all r ∈ G, and so Δ ≡ 1 when ˜ = λ and λ ◦ θ˜−1 = λ for all r ∈ G, which shows λ < ∞. Then (iii) gives λ r that λ is also right-invariant. 2 Given a group G and a space S, we define an action of G on S as a mapping (r, s) → rs from G × S to S, such that p(rs) = (p r)s and ιs = s for all p, r ∈ G and s ∈ S, where ι denotes the identity element of G. If G and S are measurable spaces, we say that G acts measurably on S if the action map is product-measurable. The shifts θr and projections πs are defined by rs = θr s = πs r, and the orbit containing s is given by πs G = {rs; r ∈ G}. The orbits form a partition of S, and we say that the action is transitive if πs G ≡ S. For s, s ∈ S, we write s ∼ s if s and s belong to the same orbit, so that s = rs for some r ∈ G. If G acts on both S and T , its joint action on S × T is given by r(s, t) = (rs, rt). For any group G acting on S, we say that a subset B ⊂ S is G-invariant if B ◦ θr−1 = B for all r ∈ G, and a function f on S is G-invariant if f ◦ θr = f . When S is measurable with σ-field S, the class of G -invariant sets in S is again a σ-field, denoted by IS . For a transitive group action we have IS = {∅, S}, and every invariant function is a constant. When the group action is measurable, a measure ν on S is said to be G-invariant if ν ◦ θr−1 = ν for all r ∈ G. When G is a measurable group with Haar measure λ, acting measurably on S, we say that G acts properly 4 on S, if there exists a normalizing function g > 0 on S, such that g is measurable and satisfies λ(g ◦ πs ) < ∞ for all s ∈ S. We may then define a kernel ϕ on S by ϕs =

λ ◦ πs−1 , λ(g ◦ πs )

s ∈ S.

(10)

Lemma 3.11 (orbit measures) Let G be a measurable group with Haar measure λ, acting properly on S. Then the measures ϕs are σ-finite and satisfy (i) ϕs = ϕs , s ∼ s , ϕs ⊥ ϕs , s ∼  s , (ii) ϕs ◦ θr−1 = ϕs = ϕ rs , r ∈ G, s ∈ S. 4

Not to be confused with the topological notion of proper group action.

3. Kernels, Disintegration, and Invariance

69

Proof: The ϕs are σ-finite since ϕs g ≡ 1. The first equality in (ii) holds −1 = Δr λ ◦ πs−1 by the left-invariance of λ, and the second holds since λ ◦ πrs by Lemma 3.10 (ii). The latter equation yields ϕs = ϕs when s ∼ s . The orthogonality holds for s ∼ s , since ϕs is confined to the orbit πs (G), which is universally measurable by Theorem A1.2 (i). 2 We show that any s -finite, G -invariant measure on S is a unique mixture of orbit measures. An invariant measure ν on S is said to be ergodic if νI ∧νI c = 0 for all I ∈ IS . It is further said to be extreme, if for any decomposition ν = ν  + ν  with invariant ν  and ν  , all three measures are proportional. Theorem 3.12 (invariant measures) Let G be a measurable group with Haar measure λ, acting properly on S, and define ϕ by (10) in terms of a normalizing function g > 0 on S. Then there is a 1−1 correspondence between all s-finite, invariant measures ν on S and all s-finite measures μ on the range ϕ(S), given by



ν(ds) g(s) ϕs =

(i)

ν=

(ii)

μf =

m μ(dm),



f ≥ 0.

ν(ds) g(s) f (ϕs ),

For such μ, ν, we have (iii) μ, ν are simultaneously σ-finite, (iv) ν is ergodic or extreme iff μ is degenerate, (v) there exists a σ-finite, G-invariant measure ν˜ ∼ ν. Proof, (i)−(ii): Let ν be s -finite and G -invariant. Using (10), Fubini’s theorem, and Theorem 3.10 (iii), we get for any f ∈ S+



ν(ds) g(s) ϕs f =

=

=

=

=

=

λ(f ◦ πs ) λ(g ◦ πs ) g(s) f (rs) λ(dr) ν(ds) λ(g ◦ πs ) g(r−1 s) f (s) λ(dr) ν(ds) λ(g ◦ πr−1 s ) g(r−1 s) f (s) λ(dr) Δr ν(ds) λ(g ◦ πs ) g(rs) f (s) λ(dr) ν(ds) λ(g ◦ πs ) ν(ds) g(s)

ν(ds) f (s) = νf,

which proves the first relation in (i). The second relation follows with μ as in (ii), by the substitution rule for integrals.

70

Foundations of Modern Probability 

Conversely, let ν = m μ(dm) for an s -finite measure μ on ϕ(S). For measurable M ⊂ MS we have A ≡ ϕ−1 M ∈ IS by Lemma 3.11, and so (λ ◦ πs−1 )(1A g) λ(g ◦ πs ) = 1A (s) = 1M (ϕs ).

ϕs (g; A) = Hence,

(g · ν) ◦ ϕ−1 M = ν(g; A) =



μ(dm) m(g; A)



=

μ(dm) = μM, M

which shows that μ is uniquely given by (ii) and is therefore s-finite. (iii) Use (i)−(ii) and the facts that g > 0 and ϕs = 0. (iv) Since μ is unique, ν is extreme iff μ is degenerate. For any I ∈ IS , ϕs I ∧ ϕs I c =

λG 1I (s) ∧ 1I c (s) = 0, λ(g ◦ πs )

s ∈ S, 

which shows that the ϕs are ergodic. Conversely, let ν = m μ(dm) with μ non-degenerate, so that μM ∧ μM c = 0 for a measurable subset M ⊂ MS . Then I = ϕ−1 M ∈ IS with νI ∧ νI c = 0, which shows that ν is non-ergodic. Hence, ν is ergodic iff μ is degenerate. (v) Choose a bounded measure νˆ ∼ ν, and define ν˜ = νˆϕ. Then ν˜ is invariant by Lemma 3.11, and ν˜ ∼ (g · ν)ϕ = ν by (ii). It is also σ-finite, since ν˜g = νˆϕg = νˆS < ∞ by (i). 2 When G acts measurably on S and T , we say that a kernel μ : S → T is invariant 5 if μ rs = μs ◦ θr−1 , r ∈ G, s ∈ S. The definition is motivated by the following observation: Lemma 3.13 (invariant kernels) Let the group G act measurably on the Borel spaces S, T , and consider a σ-finite, invariant measure ν on S and a kernel μ : S → T . Then these conditions are equivalent: (i) the measure ρ = ν ⊗ μ is jointly invariant on S × T , (ii) for any r ∈ G, the kernel μ is r-invariant in s ∈ S a.e. ν. Proof, (ii) ⇒ (i): Assuming (ii), we get for any measurable f ≥ 0 on S × T



(ν ⊗ μ) ◦ θr−1 f = (ν ⊗ μ)(f ◦ θr )

=



ν(ds)

= 5

ν(ds)

μs (dt) f (rs, rt)





μs ◦ θr−1 (dt) f (rs, t)

If we write μ ◦ θr−1 = θr μ, the defining condition simplifies to μ rs = θr μs .

3. Kernels, Disintegration, and Invariance

=



ν(ds)

=

71



μrs (dt) f (rs, t) μs (dt) f (s, t)

ν(ds)

= (ν ⊗ μ)f. (i) ⇒ (ii): Assuming (i) and fixing r ∈ G, we get by the same calculation 



μs ◦ θr−1 (dt) f (rs, t) =



μrs (dt) f (rs, t),

s ∈ S a.e. ν, 2

which extends to (ii) since f is arbitrary. We turn to the reverse problem of invariant disintegration.

Theorem 3.14 (invariant disintegration) Let G be a measurable group with Haar measure λ, acting properly on S and measurably on T , where S, T are Borel, and consider a σ-finite, jointly G-invariant measure ρ on S × T . Then (i) ρ = ν ⊗ μ for a σ-finite, invariant measure ν on S and an invariant kernel μ : S → T , (ii) when S = G we may choose ν = λ, in which case μ is given uniquely, for any B ∈ G with λB = 1, by μr f =

ρ(dp ds) f (rp−1 s),

B×S

r ∈ G, f ≥ 0.

We begin with part (ii), which may be proved by a simple skew factorization. Proof of (ii): On G × S we define the mappings ϑ(r, s) = (r, rs), θr (p, s) = (rp, rs), θr (p, s) = (rp, s), where p, r ∈ G and s ∈ S. Since clearly ϑ−1 ◦ θr = θr ◦ ϑ−1 , the joint Ginvariance of ρ gives −1 ρ ◦ ϑ ◦ θr = ρ ◦ θr−1 ◦ ϑ = ρ ◦ ϑ, r ∈ G, where ϑ = (ϑ−1 )−1 . Hence, ρ ◦ ϑ is invariant under shifts of G alone, and so ρ ◦ ϑ = λ ⊗ ν by Theorem 3.10 for a σ-finite measure ν on S, which yields the stated formula for the invariant kernel μr = ν ◦ θr−1 . To see that ρ = λ ⊗ μ, let f ≥ 0 be measurable on G × S, and write (λ ⊗ μ)f =





λ(dr)

=



λ(dr)

ν ◦ θr−1 (ds) f (r, s) ν(ds) f (r, rs)

= (λ ⊗ ν) ◦ ϑ−1 f = ρf.

2

To prepare for the proof of part (i), we begin with the special case where S contains G as a factor.

72

Foundations of Modern Probability

Corollary 3.15 (repeated disintegration) Let G be a measurable group with Haar measure λ, acting measurably on the Borel spaces S, T , and consider a σ-finite, jointly invariant measure ρ on G × S × T . Then (i) ρ = λ ⊗ ν ⊗ μ for some invariant kernels ν : G → S and μ : G × S → T , ˆr−1 s ◦ θr−1 for some measure νˆ on S and kernel (ii) νr ≡ νˆ ◦ θr−1 and μr,s ≡ μ μ ˆ: S → T, (iii) the measure νˆ ⊗ μ ˆ on S × T is unique, (iv) we may choose ν as any invariant kernel G → S with λ ⊗ ν ∼ ρ(· × T ). When G acts measurably on S, its action on G × S is proper, and so for any s-finite, jointly G-invariant measure β on G × S, Theorem 3.12 yields a σ-finite, G-invariant measure β˜ ∼ β. Hence, β˜ = λ ⊗ ν by Theorem 3.14 (ii) for an invariant kernel ν : G → S. Proof: By Theorem 3.14 (ii) we have ρ = (λ⊗χ)◦ϑ−1 for a σ-finite measure χ on S × T , where ϑ(r, s, t) = (r, rs, rt) for all (r, s, t) ∈ G × S × T . Since T is Borel, Theorem 3.4 yields a further disintegration χ = νˆ ⊗ μ ˆ in terms of a measure νˆ on S and a kernel μ ˆ : S → T , generating some invariant kernels ν and μ as in (ii). Then (i) may be verified as before, and (iii) holds by the uniqueness of χ. (iv) For any kernel ν  as stated, write νˆ for the generating measure on S. ˆ for some kernel Since λ × ν  ∼ λ × μ, we have νˆ ∼ νˆ, and so χ = νˆ ⊗ μ  2 μ ˆ : S → T . Now continue as before. We also need some technical facts: Lemma 3.16 (invariance sets) Let G be a group with Haar measure λ, acting measurably on S, T . Fix a kernel μ : S → T and a measurable, jointly Ginvariant function f on S × T . Then these subsets of S and S × T are Ginvariant:



A = s ∈ S; μrs = μs ◦ θr−1 , r ∈ G ,



B = (s, t) ∈ S × T ; f (rs, t) = f (ps, t), (r, p) ∈ G2 a.e. λ2 . Proof: For any s ∈ A and r, p ∈ G, we get μrs ◦ θp−1 = μs ◦ θr−1 ◦ θp−1 = μs ◦ θp−1r = μp rs , which shows that even rs ∈ A. Conversely, rs ∈ A implies s = r−1 (rs) ∈ A, and so θr−1 A = A. Now let (s, t) ∈ B. Then Theorem 3.10 (ii) yields (qs, t) ∈ B for any q ∈ G, and the invariance of λ gives 







f q −1 rqs, t = f q −1 p qs, t ,

(r, p) ∈ G2 a.e. λ2 ,

3. Kernels, Disintegration, and Invariance

73

which implies (qs, qt) ∈ B by the joint G -invariance of f . Conversely, (qs, qt) ∈ 2 B implies (s, t) = q −1 (qs, qt) ∈ B, and so θq−1 B = B. Proof of Theorem 3.14 (i): The measure ρ(· × T ) is clearly s -finite and G -invariant on S. Since G acts properly on S, Theorem 3.12 yields a σ-finite, G -invariant measure ν on S. Applying Corollary 3.15 to the G -invariant measures λ ⊗ ρ on G × S × T and λ ⊗ ν on G × S, we obtain a G -invariant kernel β : G × S → T satisfying λ ⊗ ρ = λ ⊗ ν ⊗ β. Now introduce the kernels βr (p, s) = β(rp, s) and write fr (p, s, t) = f (r−1 p, s, t) for any measurable function f ≥ 0 on G × S × T . By the invariance of λ, we have for any r ∈ G 







λ ⊗ ν ⊗ βr f = λ ⊗ ν ⊗ β fr = (λ ⊗ ρ)fr = (λ ⊗ ρ)f,

and so λ ⊗ ρ = λ ⊗ ν ⊗ βr . Hence, by the a.e. uniqueness β(rp, s) = β(p, s),

(p, s) ∈ G × S a.e. λ ⊗ ν, r ∈ G.

Let A be the set of all s ∈ S satisfying β(rp, s) = β(p, s),

(r, p) ∈ G2 a.e. λ2 ,

and note that νAc = 0 by Fubini’s theorem. By Lemma 3.10 (ii), the defining condition is equivalent to β(r, s) = β(p, s),

(r, p) ∈ G2 a.e. λ2 ,

and so for any g, h ∈ G+ with λg = λh = 1, (g · λ)β(·, s) = (h · λ)β(·, s) = μ(s),

s ∈ A.

(11)

To extend this to an identity, we may redefine β(r, s) = 0 when s ∈ Ac . This will not affect the disintegration of λ ⊗ ρ, and by Lemma 3.16 it even preserves the G -invariance of β. Fixing a g ∈ G+ with λg = 1, we get for any f ≥ 0 on S × T ρf = (λ ⊗ ρ)(g ⊗ f ) 



= λ ⊗ ν ⊗ β (g ⊗ f )



= ν ⊗ (g · λ)β f = (ν ⊗ μ)f, and so ρ = ν ⊗ μ. Now combine (11) with the G -invariance of β and λ to get μs ◦ θr−1 = = =



λ(dp) g(p) β(p, s) ◦ θr−1 λ(dp) g(p) β(rp, rs) λ(dp) g(r−1 p) β(p, rs) = μrs ,

74

Foundations of Modern Probability 2

which shows that μ is G -invariant.

We turn to some notions of asymptotic and local invariance, of importance in Chapters 25, 27, and 30–31. The uniformly bounded measures μn on Rd are said to be asymptotically invariant if      μn − μn ∗ δx  → 0,

x ∈ Rd .

(12)

Lemma 3.17 (asymptotic invariance) Let μ1 , μ2 , . . . be uniformly bounded, asymptotically invariant measures on Rd . Then (i) (12) holds uniformly for bounded x, (ii) μn − μn ∗ ν → 0 for any distribution ν on Rd , (iii) the singular components μn of μn satisfy μn → 0. Proof: (ii) For any distribution ν, we get by dominated convergence           μn − μn ∗ ν  =  μn − μn ∗ δx ν(dx)    ≤  μn − μn ∗ δx  ν(dx) → 0.

(iii) When ν is absolutely continuous, so is μn ∗ ν for all n, and we get  

 

μn ≤  μn − μn ∗ ν  → 0. (i) Writing λh for the uniform distribution on [0, h]d and noting that μ ∗ ν ≤ μ ν , we get for any x ∈ Rd with |x| ≤ r              μn − μn ∗ δx  ≤  μn − μn ∗ λh  +  μn ∗ λh − μn ∗ λh ∗ δx      +  μ n ∗ λh ∗ δ x − μn ∗ δ x          ≤ 2  μn − μn ∗ λh  +  λh − λh ∗ δx      ≤ 2  μn − μn ∗ λh  + 2 rh−1 d 1/2 ,

which tends to 0 for fixed r, as n → ∞ and then h → ∞.

2

Next we say that the measures μn on Rd are weakly asymptotically invariant, if they are uniformly bounded and such that the convolutions μn ∗ ν are asymptotically invariant for every bounded measure ν  λd on Rd , so that      μn ∗ ν − μn ∗ ν ∗ δx  → 0,

x ∈ Rd .

(13)

Lemma 3.18 (weak asymptotic invariance) For any uniformly bounded measures μ1 , μ2 , . . . on Rd , these conditions are equivalent: (i) the μn are weakly asymptotically invariant, (ii)

     μn (Ih + x) − μn (Ih + x + hei ) dx → 0,

h > 0, i ≤ d.

3. Kernels, Disintegration, and Invariance (iii)

75

     μn ∗ ν − μn ∗ ν   → 0 for all probability measures ν, ν   λd .

In (ii) it suffices to consider dyadic h = 2−k and to replace the integration by summation over (hZ)d . Proof, (i) ⇔ (ii): Letting ν = f · λd and ν  = f  · λd , we note that      μ ∗ (ν − ν  )  ≤ μ ν − ν 

= μ f − f  1 . By Lemma 1.37 it is then enough in (13) to consider measures ν = f · λd , where f is continuous with bounded support. We may also take r = hei for some h > 0 and i ≤ d, and by uniform continuity we may restrict h to dyadic values 2−k . By uniform continuity, we may next approximate f in L1 by simple functions fn over the cubic grids In in Rd of mesh size 2−n , which implies the equivalence with (ii). The last assertion is yet another consequence of the uniform continuity. (i) ⇔ (iii): Condition (13) is clearly a special case of (iii). To show that (13) ⇒ (iii), we may approximate as in (ii) to reduce to the case of finite sums ν = Σj aj λmj and ν  = Σj bj λmj , where Σj aj = Σj bj = 1, and λmj denotes the uniform distribution over the cube Imj = 2−m [j − 1, j] in Rd . Assuming (i) and writing h = 2−m , we get              μn ∗ ν − μn ∗ ν   ≤  μn ∗ ν − μn ∗ λm0  +  μn ∗ ν  − μn ∗ λm0      ≤ j (aj + bj )  μn ∗ λmj − μn ∗ λm0      = j (aj + bj )  μn ∗ λm0 ∗ δjh − μn ∗ λm0  → 0,

Σ Σ

by the criterion in (ii).

2

Now let γ be a random vector in Rd with distribution μ, and write μt = L(γt ), where γt = tγ. Then μ is said to be locally invariant, if the μt are weakly asymptotically invariant as t → ∞. We might also say that μ is strictly locally invariant, if the μt are asymptotically invariant in the strict sense. However, the latter notion gives nothing new: Theorem 3.19 (local invariance) Let μ  μ be locally finite measures on Rd . Then (i) μ strictly locally invariant ⇔ μ  λd , (ii) μ locally invariant ⇒ μ locally invariant. Proof: (i) Let μ be strictly locally invariant with singular component μ . Then Lemma 3.17 yields μ = μ ◦ St−1 → 0 as t → ∞, where St x = tx, which implies μ = 0 and hence μ  λd .

76

Foundations of Modern Probability

Conversely, let μ = f ·λd for some measurable function f ≥ 0. Then Lemma 1.37 yields some continuous functions fn ≥ 0 with bounded supports such that f − fn 1 → 0. Writing μt = μ ◦ St−1 and h = 1/t, we get for any r ∈ Rd          μt − μt ∗ δr  = f (x) − f (x + hr) dx     ≤ fn (x) − fn (x + hr) dx + 2 f − fn 1 ,

which tends to 0 as h → 0 and then n → ∞. Since r was arbitrary, the strict local invariance follows. (ii) We may take μ to be bounded, so that its local invariance reduces to weak asymptotic invariance of the measures μt = μ ◦ St−1 . Since μ  μ is locally finite, we may further choose μ = f · μ for some μ -integrable function f ≥ 0 on Rd . Then Lemma 1.37 yields some continuous functions fn with bounded supports, such that fn → f in L1 (μ). Writing μn = fn · μ and μn,t = μn ◦ St−1 , we get               μn,t ∗ ν − μt ∗ ν  ≤  μn,t − μt  +  μn − μ    = (fn − f ) · μ

= μ|fn − f | → 0, and so it suffices to verify (13) for the measures μn,t . Thus, we may henceforth choose μ = f · μ for a continuous function f with bounded support. To verify (13), let r and ν be such that the measures ν and ν ∗ δr are both supported by the ball B0a . Writing μt = = = = ν − ν ∗ δr = =

(f · μ)t (f · μ) ◦ St−1 (f ◦ St ) · (μ ◦ St−1 ) ft · μt , g · λd − (g ◦ θ−r ) · λd Δg · λd ,

we get for any bounded, measurable function h on Rd 







μt ∗ ν − μt ∗ ν ∗ δr h = (ft · μt ) ∗ (Δg · λd ) h

=

=

= ≈



ft (x) μt (dx) ft (x) μt (dx)

h(y) dy

Δg(y) h(x + y) dy

Δg(y − x) h(y) dy

ft (x) Δg(y − x) μt (dx)

h(y) ft (y) dy



Δg(y − x) μt (dx)

= μt ∗ (Δg · λd ) ft h.

3. Kernels, Disintegration, and Invariance

77

Letting m be the modulus of continuity of f and putting mt = m(r/t), we may estimate the approximation error in the fifth step by

        ft (x) − ft (y) Δg(y − x) μt (dx)     ≤ mt μt (dx) h(y) Δg(y − x) dy

|h(y)| dy

≤ mt h μt

|Δg(y)| dy

≤ 2 mt h μ ν → 0, by the uniform continuity of f . It remains to note that   

     ft · μt ∗ (Δg · λd )  ≤ f μt ∗ ν − μt ∗ ν ∗ δt  → 0,

2

by the local invariance of μ.

Exercises 1. Let μ1 , μ2 , . . . be kernels between two measurable spaces S, T . Show that the function μ = Σn μn is again a kernel.

2. Fix a function f between two measurable spaces S, T , and define μ(s, B) = 1B ◦ f (s). Show that μ is a kernel iff f is measurable. 3. Prove the existence and uniqueness of Lebesgue measure on Rd from the corresponding properties of Haar measures. 4. Show that if the group G is compact, then every measurable group action is proper. Then show how the formula for the orbit measures simplifies in this case, and give a corresponding version of Theorem 3.12. 5. Use the previous exercise to find the general form of the invariant measures on a sphere and on a spherical cylinder. 6. Extend Lemma 3.10 (i) to the context of general invariant measures. 7. Extend the mean continuity in Lemma 2.7 to general invariant measures. 8. Give an example of some measures μ1 , μ2 , . . . that are weakly but not strictly asymptotically invariant. 9. Give an example of a bounded measure μ on R that is locally invariant but not absolutely continuous.

II. Some Classical Probability Theory Armed with the basic measure-theoretic machinery from Part I, we proceed with the foundations of general probability theory. After giving precise definitions of processes, distributions, expectations, and independence, we introduce the various notions of probabilistic convergence, and prove some of the classical limit theorems for sums and averages, including the law of large numbers and the central limit theorem. Finally, a general compound Poisson approximation will enable us to give a short, probabilistic proof of the classical limit theorem for general null arrays. The material in Chapter 4, along with at least the beginnings of Chapters 5−6, may be regarded as essential core material, constantly needed for the continued study, whereas a detailed reading of the remaining material might be postponed until a later stage. −−− 4. Processes, distributions, and independence. Here the probabilistic notions of processes, distributions, expected values, and independence are defined in terms of standard notions of measure theory from Chapter 1. We proceed with a detailed study of independence criteria, 0−1 laws, and regularity conditions for processes. Finally, we prove the existence of sequences of independent random variables, and show how distributions on Rd are induced by their distribution functions. 5. Random sequences, series, and averages. Here we examine the relationship between convergence a.s., in probability, and in Lp , and give conditions for uniform integrability. Next, we develop convergence criteria for random series and averages, including the strong laws of large numbers. Finally, we study the basic properties of convergence in distribution, with special emphasis on the continuous mapping and approximation theorems as well as the Skorohod coupling. 6. Gaussian and Poisson convergence. Here we prove the classical Poisson and Gaussian limit theorems for null arrays of random variables or vectors, including the celebrated central limit theorem. The theory of regular variation enables us to identify the domain of Gaussian attraction. The mentioned results also yield precise criteria for the weak law of large numbers. Finally, we study the relationship between weak compactness and tightness, in the special case of distributions on Euclidean spaces. 7. Infinite divisibility and general null arrays. Here we establish the representation of infinitely divisible distributions on Rd , involving a Gaussian component and a compound Poisson integral, material needed for the theory of L´evy processes and essential for a good understanding of Feller processes and general semi-martingales. The associated convergence criteria, along with a basic compound Poisson approximation, yield a short, non-technical approach to the classical limit theorem for general null arrays. 79

Chapter 4

Processes, Distributions, and Independence Random elements and processes, finite-dimensional distributions, expectation and covariance, moments and tails, Jensen’s inequality, independence, pairwise independence and grouping, product measures and convolution, iterated expectation, Kolmogorov and Hewitt–Savage 0−1 laws, Borel–Cantelli lemma, replication and Bernoulli sequences, kernel representation, H¨ older continuity, multi-variate distributions

Armed with the basic notions and results of measure theory from previous chapters, we may now embark on our study of proper probability theory. The dual purposes of this chapter are to introduce some basic terminology and notation and to prove some fundamental results, many of which will be needed throughout the remainder of the book. In modern probability theory, it is customary to relate all objects of study to a basic probability space (Ω, A, P ), which is simply a normalized measure space. Random variables may then be defined as measurable functions ξ on Ω,  and their expected values as integrals E ξ = ξ dP . Furthermore, independence between random quantities reduces to a kind of orthogonality between the induced sub-σ-fields. The reference space Ω is introduced only for technical convenience, to provide a consistent mathematical framework, and its actual choice plays no role1 . Instead, our interest focuses on the induced distributions L(ξ) = P ◦ ξ −1 and their associated characteristics. The notion of independence is fundamental for all areas of probability theory. Despite its simplicity, it has some truly remarkable consequences. A particularly striking result is Kolmogorov’s 0−1 law, stating that every tail event associated with a sequence of independent random elements has probability 0 or 1. Thus, any random variable that depends only on the ‘tail’ of the sequence is a.s. a constant. This result and the related Hewitt–Savage 0−1 law convey much of the flavor of modern probability: though the individual elements of a random sequence are erratic and unpredictable, the long-term behavior may often conform to deterministic laws and patterns. A major objective is then to uncover the latter. Here the classical Borel–Cantelli lemma is one of the simplest but most powerful tools available. To justify our study, we need to ensure the existence2 of the random objects 1

except in some special contexts, such as in the modern theory of Markov processes The entire theory relies in a subtle way on the existence of non-trivial, countably additive measures. Assuming only finite additivity wouldn’t lead us very far. 2

© Springer Nature Switzerland AG 2021 O. Kallenberg, Foundations of Modern Probability, Probability Theory and Stochastic Modelling 99, https://doi.org/10.1007/978-3-030-61871-1_5

81

82

Foundations of Modern Probability

under discussion. For most purposes it suffices to use the Lebesgue unit interval ([0, 1], B, λ) as our basic probability space. In this chapter, the existence will be proved only for independent random variables with prescribed distributions, whereas a more general discussion is postponed until Chapter 8. As a key step, we may use the binary expansion of real numbers to construct a socalled Bernoulli sequence, consisting of independent random digits 1 or 0 with probabilities p and 1−p, respectively.3 The latter may be regarded as a discretetime counterpart of the fundamental Poisson processes, to be introduced and studied in Chapter 15. The distribution of a random process X is determined by its finite-dimensional distributions, which are not affected by a change on a null set of each variable Xt . It is then natural to look for versions of X with suitable regularity properties. As another striking result, we provide a moment condition ensuring the existence of a H¨older continuous modification of the process. Regularizations of various kind are important throughout modern probability theory, as they may enable us to deal with events depending on the values of a process at uncountably many times. To begin our systematic exposition of the theory, we fix an arbitrary probability space (Ω, A, P ), where P is a probability measure with total mass 1. In the probabilistic context, the sets A ∈ A are called events, and P A = P (A) is called the probability of A. In addition to results valid for all measures, there are properties requiring the boundedness or normalization of P , such as the relation P Ac = 1 − P A and the fact that An ↓ A implies P An → P A. Some infinite set operations have a special probabilistic significance. Thus, for any sequence of events A1 , A2 , . . . ∈ A, we may consider the sets {An i.o.} where An happens infinitely often, and {An ult.} where An happens ultimately (for all but finitely many n). Those occurrences are events in their own right, formally expressible in terms of the An as

Σ n 1A {An ult.} = Σ n 1A {An i.o.} =



n

=∞ =∩ n

c n

0, the integral E|ξ|p = ξ pp is called the p -th absolute moment of ξ. By H¨older’s inequality (or by Jensen’s inequality in Lemma 4.5) we have ξ p ≤ ξ q for p ≤ q, so that the corresponding Lp -spaces are non-increasing in p. If ξ ∈ Lp and either p ∈ N or ξ ≥ 0, we may further define the p -th moment of ξ as E ξ p . We give a useful relationship between moments and tail probabilities. Lemma 4.4 (moments and tails) For any random variable ξ ≥ 0 and constant p > 0, we have ∞

E ξp = p 0



=p

P {ξ > t} tp−1 dt



P {ξ ≥ t} tp−1 dt.

0

Proof: By Fubini’s theorem and calculus,

ξ

E ξp = p E 0



= pE

=p 0

tp−1 dt



1{ξ > t} tp−1 dt

0 ∞

P {ξ > t} tp−1 dt.

The proof of the second expression is similar.

2

A random vector ξ = (ξ1 , . . . , ξd ) or process X = (Xt ) is said to be integrable, if integrability holds for every component ξk or value Xt , in which case 7

We omit the customary square brackets, as in E[ξ], when there is no risk for confusion.

86

Foundations of Modern Probability

we may write E ξ = (E ξ1 , . . . , E ξd ) or EX = (EXt ). Recall that a function f : Rd → R is said to be convex if



f cx + (1 − c)y ≤ cf (x) + (1 − c)f (y),

x, y ∈ Rd , c ∈ [0, 1].

(6)

This may be written as f (E ξ) ≤ Ef (ξ), where ξ is a random vector in Rd with P {ξ = x} = 1 − P {ξ = y} = c. The following extension to integrable random vectors is known as Jensen’s inequality. Lemma 4.5 (convex maps, H¨older, Jensen) For any integrable random vector ξ in Rd and convex function f : Rd → R, we have Ef (ξ) ≥ f (E ξ). Proof: By a version of the Hahn–Banach theorem 8 , the convexity condition (6) is equivalent to the existence for every s ∈ Rd of a supporting affine function hs (x) = ax + b with f ≥ hs and f (s) = hs (s). Taking s = E ξ gives Ef (ξ) ≥ E hs (ξ) = hs (E ξ) = f (E ξ).

2

The covariance of two random variables ξ, η ∈ L2 is given by



Cov(ξ, η) = E (ξ − E ξ)(η − E η) = E ξη − (E ξ)(E η). It is clearly bi-linear, in the sense that Cov





bk ηk Σ aj ξj , kΣ j ≤m ≤n

=

Σ Σ

j ≤m k≤n

aj bk Cov(ξj , ηk ).

Taking ξ = η ∈ L2 gives the variance Var(ξ) = Cov(ξ, ξ) = E(ξ − Eξ)2 = E ξ 2 − (E ξ)2 , and Cauchy’s inequality yields  

1/2    Cov(ξ, η) ≤ Var(ξ) Var(η) .

Two random variables ξ, η ∈ L2 are said to be uncorrelated if Cov(ξ, η) = 0. For any collection of random variables ξt ∈ L2 , t ∈ T , the associated covariance function ρs,t = Cov(ξs , ξt ), s, t ∈ T , is non-negative definite, in the sense that Σ ij ai aj ρti ,tj ≥ 0 for any n ∈ N, t1 , . . . tn ∈ T , and a1 , . . . , an ∈ R. This is clear if we write 8

Any two disjoint, convex sets in Rd can be separated by a hyperplane.

4. Processes, Distributions, and Independence

Σ i,j aiaj ρt ,t

i j

87

t , ξt ) Σ i,jaiaj Cov(ξ  = Var Σ i ai ξt ≥ 0.

=

i

j

i

The events At ∈ A, t ∈ T , are said to be independent, if for any distinct indices t1 , . . . , tn ∈ T , (7) P ∩ Atk = Π P Atk . k≤n

k≤n

More generally, we say that the families Ct ⊂ A, t ∈ T , are independent9 , if independence holds between the events At for any At ∈ Ct , t ∈ T . Finally, the random elements ξt , t ∈ T , are said to be independent if independence holds between the generated σ-fields σ(ξt ), t ∈ T . Pairwise independence between the objects A and B, ξ and η, or B and C is often denoted by A⊥⊥B, ξ⊥⊥η, or B⊥⊥C, respectively. The following result is often useful to extend the independence property. Lemma 4.6 (extension) Let Ct , t ∈ T , be independent π-systems. Then the independence extends to the σ-fields Ft = σ(Ct ),

t ∈ T.

Proof: We may assume that Ct = ∅ for all t. Fix any distinct indices t1 , . . . , tn ∈ T , and note that (7) holds for arbitrary Atk ∈ Ctk , k = 1, . . . , n. For fixed At2 , . . . , Atn , we introduce the class D of sets At1 ∈ A satisfying (7). Then D is a λ-system containing Ct1 , and so D ⊃ σ(Ct1 ) = Ft1 by Theorem 1.1. Thus, (7) holds for arbitrary At1 ∈ Ft1 and Atk ∈ Ctk , k = 2, . . . , n. Proceeding recursively in n steps, we obtain the desired extension to arbitrary Atk ∈ Ftk , k = 1, . . . , n. 2 An immediate consequence is the following basic grouping property. Here and below, we write F ∨ G = σ(F, G),

FS =

∨ Ft = σ{Ft; t ∈ S}.

t∈S

Corollary 4.7 (grouping) Let Ft , t ∈ T , be independent σ-fields, and let T be a disjoint partition of T . Then the independence extends to the σ-fields FS =

∨ Ft,

t∈S

S ∈T.

Proof: For any S ∈ T , let CS be the class of finite intersections of sets in t∈S Ft . Then the classes CS are independent π-systems, and the independence 2 extends by Lemma 4.6 to the generated σ-fields FS . 

Though independence between more than two σ-fields is clearly stronger than pairwise independence, the full independence may be reduced in various ways to the pairwise version. Given a set T , we say that a class T ⊂ 2T is separating, if for any s = t in T there exists a set S ∈ T , such that exactly one of the elements s and t lies in S. 9

not to be confused with independence within each class

88

Foundations of Modern Probability

Lemma 4.8 (pairwise independence) Let the Fn or Ft be σ-fields on Ω indexed by N or T . Then (i) F1 , F2 , . . . are independent iff

⊥ Fn+1 , σ(F1 , . . . , Fn ) ⊥

n ∈ N,

(ii) for any separating class T ⊂ 2T , the Ft are independent iff

⊥ FS c , FS ⊥

S ∈T.

Proof: The necessity of the two conditions follows from Corollary 4.7. As for the sufficiency, we consider only part (ii), the proof for (i) being similar. Under the stated condition, we need to show that, for any finite subset S ⊂ T , the σ-fields Fs , s ∈ S, are independent. Assume the statement to be true for |S| ≤ n, where |S| denote the cardinality of S. Proceeding to the case where |S| = n + 1, we may choose U ∈ T such that S  = S ∩ U and S  = S \ U are non-empty. Since FS  ⊥⊥FS  , we get for any sets As ∈ Fs , s ∈ S, P

∩ As = s∈S





∩ As s∈S

P



∩ s∈S

P





As =

Π P As ,

s∈S

where the last relation follows from the induction hypothesis.

2

A σ-field F is said to be P -trivial if P A = 0 or 1 for every A ∈ F . Note that a random element is a.s. a constant iff its distribution is a degenerate probability measure. Lemma 4.9 (triviality and degeneracy) For a σ-field F on Ω, these conditions are equivalent: (i) F is P -trivial, (ii) F ⊥⊥ F , (iii) every F-measurable random variable is a.s. a constant. Proof, (i) ⇔ (ii): If F⊥⊥F , then for any A ∈ F we have P A = P (A ∩ A) = (P A)2 , and so P A = 0 or 1. Conversely, if F is P -trivial, then for any A, B ∈ F we have P (A ∩ B) = P A ∧ P B = P A · P B, which means that F⊥⊥F . (i) ⇔ (iii): Assume (iii). Then 1B is a.s. a constant for every B ∈ F, and so P B = 0 or 1, proving (i). Conversely, (i) yields Ft ≡ P {ξ ≤ t} = 0 or 1 for every F-measurable random variable ξ, and so ξ = sup{t; Ft = 0} a.s., proving (iii). 2 We proceed with a basic relationship between independence and product measures. Lemma 4.10 (product measures and independence) Let ξ1 , . . . , ξn be random elements with distributions μ1 , . . . , μn , in some measurable spaces S1 , . . . , Sn . Then these conditions are equivalent: (i) the ξk are independent,

4. Processes, Distributions, and Independence

89

(ii) ξ = (ξ1 , . . . , ξn ) has distribution μ1 ⊗ · · · ⊗ μn . Proof: Assuming (i), we get for any measurable product set B = B1 × · · · × Bn

Π P {ξk ∈ Bk } = Π μk Bk = ⊗ μk B. k≤n k≤n

P {ξ ∈ B} =

k≤n

This extends by Theorem 1.1 to arbitrary sets in the product σ-field, proving (ii). The converse is obvious. 2 Combining the last result with Fubini’s theorem, we obtain a useful method of calculating expected values, in the context of independence. A more general version is given in Theorem 8.5. Lemma 4.11 (iterated expectation) Let ξ, η be independent random elements in some measurable spaces S, T , and consider a measurable function f : S × T → R with E{E|f (s, η)|}s=ξ < ∞. Then

Ef (ξ, η) = E Ef (s, η)

s=ξ

.

Proof: Let μ and ν be the distributions of ξ and η, respectively. Assuming f ≥ 0 and writing g(s) = Ef (s, η), we get by Lemma 1.24 and Fubini’s theorem

Ef (ξ, η) = = =



f (s, t) (μ ⊗ ν)(ds dt)

f (s, t) ν(dt)

μ(ds)

g(s) μ(ds) = Eg(ξ).

For general f , this applies to the function |f |, and so E|f (ξ, η)| < ∞. The desired relation then follows as before. 2 In particular, we get for any independent random variables ξ1 , . . . , ξn E Π k ξk = Π k E ξ k ,

Var





Σ k ξk

= Σ k Var(ξk ),

whenever the expressions on the right exist. For any random elements ξ, η in a measurable group G, the product ξη is again a random element in G. We give a connection between independence and the convolutions of Lemma 1.30. Corollary 4.12 (convolution) Let ξ, η be independent random elements in a measurable group G with distributions μ, ν. Then L(ξη) = μ ∗ ν.

90

Foundations of Modern Probability

Proof: For any measurable set B ⊂ G, we get by Lemma 4.10 and the definition of convolution

P {ξη ∈ B} = (μ ⊗ ν) (x, y) ∈ G2 ; xy ∈ B



= (μ ∗ ν)B.

2

For any σ-fields F1 , F2 , . . . , we introduce the associated tail σ-field T =∩

∨ Fk = ∩n σ{Fk ; k > n}.

n k>n

The following remarkable result shows that T is trivial when the Fn are independent. An extension appears in Corollary 9.26. Theorem 4.13 (Kolmogorov’s 0−1 law) Let F1 , F2 , . . . be σ-fields with associated tail σ-field T . Then F1 , F2 , . . . independent ⇒ T trivial. 

Proof: For each n ∈ N, define Tn = k>n Fk , and note that F1 , . . . , Fn , Tn are independent by Corollary 4.7. Hence, so are the σ-fields F1 , . . . , Fn , T , and then also F1 , F2 , . . . , T . By the same theorem, we obtain T0 ⊥⊥T , and so T ⊥⊥T . Thus, T is P -trivial by Lemma 4.9. 2 We consider some simple illustrations: Corollary 4.14 (sums and averages) Let ξ1 , ξ2 , . . . be independent random variables, and put Sn = ξ1 + · · · + ξn . Then (i) the sequence (Sn ) is a.s. convergent or a.s. divergent, (ii) the sequence (Sn /n) is a.s. convergent or a.s. divergent, and its possible limit is a.s. a constant. Proof: Define Fn = σ{ξn }, n ∈ N, and note that the associated tail σ-field T is P -trivial by Theorem 4.13. Since the sets where (Sn ) and (Sn /n) converge are T -measurable by Lemma 1.10, the first assertions follows. The last one is clear from Lemma 4.9. 2 By a finite permutation of N we mean a bijective map p : N → N, such that pn = n for all but finitely many n. For any space S, a finite permutation p of N induces a permutation Tp on S ∞ , given by Tp (s) = s ◦ p = (sp1 , sp2 , . . .),

s = (s1 , s2 , . . .) ∈ S ∞ .

A set I ⊂ S ∞ is said to be symmetric or permutation invariant if Tp−1 I ≡ {s ∈ S ∞ ; s ◦ p ∈ I} = I, for every finite permutation p of N. For a measurable space (S, S), the symmetric sets I ∈ S ∞ form a σ-field I ⊂ S ∞ , called the permutation invariant σ-field in S ∞ . We may now state the second basic 0−1 law, which applies to sequences of random elements that are independent and identically distributed, abbreviated as i.i.d..

4. Processes, Distributions, and Independence

91

Theorem 4.15 (Hewitt–Savage 0−1 law) Let ξ = (ξk ) be an infinite sequence of random elements in a measurable space S, and let I denote the permutation invariant σ-field in S ∞ . Then ξ1 , ξ2 , . . . i.i.d.



ξ −1 I trivial.

Our proof is based on a simple approximation. Write AB = (A \ B) ∪ (B \ A), and note that P (AB) = P (Ac B c ) = E|1A − 1B |,

A, B ∈ A.

(8)

Lemma 4.16 (approximation) For any σ-fields F1 ⊂ F2 ⊂ · · · and set A ∈ n Fn , we may choose



A1 , A 2 , . . . ∈ ∪ F n , n

P (A  An ) → 0.





Proof: Define C = n Fn , and let D be the class of sets A ∈ n Fn with the stated property. Since C is a π-system and D is a λ-system containing C,  2 Theorem 1.1 yields n Fn = σ(C) ⊂ D. Proof of Theorem 4.15: Define μ = L(ξ), put Fn = S n × S ∞ , and note that  I ⊂ S ∞ = n Fn . For any I ∈ I, Lemma 4.16 yields some Bn ∈ S n , such that the corresponding cylinder sets In = Bn × S ∞ satisfy μ(I In ) → 0. Put I˜n = S n × Bn × S ∞ . Then the symmetry of μ and I yields μI˜n = μIn → μI and μ(I I˜n ) = μ(I In ) → 0, and so by (8)



μ I (In ∩ I˜n ) ≤ μ(I In ) + μ(I I˜n ) → 0. Since moreover In ⊥ ⊥I˜n under μ, we get μI ← μ(In ∩ I˜n ) = (μIn )(μI˜n ) → (μI)2 . Thus, μI = (μI)2 , and so P ◦ ξ −1 I = μI = 0 or 1.

2

Again we list some easy consequences. Here a random variable ξ is said to d be symmetric if ξ = −ξ. Corollary 4.17 (random walk) Let ξ1 , ξ2 , . . . be i.i.d., non-degenerate random variables, and put Sn = ξ1 + · · · + ξn . Then (i) P {Sn ∈ B i.o.} = 0 or 1 for every B ∈ B, (ii) lim supn Sn = ∞ a.s. or = −∞ a.s., (iii) lim supn (±Sn ) = ∞ a.s. when the ξn are symmetric.

92

Foundations of Modern Probability

Proof: (i) This holds by Theorem 4.15, since for any finite permutation p of N we have xp1 + · · · + xpn = x1 + · · · + xn for all but finitely many n. (ii) By Theorem 4.15 and Lemma 4.9, we have lim supn Sn = c a.s. for some ¯ = [−∞, ∞], and so a.s. constant c ∈ R c = lim sup Sn+1 n→∞

= lim sup (Sn+1 − ξ1 ) + ξ1 n→∞

= c + ξ1 . Since |c| < ∞ would imply ξ1 = 0 a.s., contradicting the non-degeneracy of ξ1 , we have |c| = ∞. (iii) Writing

c = lim sup Sn n→∞

≥ lim inf Sn n→∞ = − lim sup (−Sn ) = −c, n→∞

we get −c ≤ c ∈ {±∞}, which implies c = ∞.

2

From a suitable 0−1 law it is often easy to see that a given event has probability 0 or 1. To distinguish between the two cases is often much harder. The following classical result, known as the Borel–Cantelli lemma, is sometimes helpful, especially when the events are independent. A more general version appears in Corollary 9.21. Theorem 4.18 (Borel, Cantelli) For any A1 , A2 , . . . ∈ A,

Σ n P An < ∞



P {An i.o.} = 0,

and equivalence holds when the An are independent. The first assertion was proved earlier as an application of Fatou’s lemma. Using expected values yields a more transparent argument. Proof: If

Σn P An < ∞, we get by monotone convergence E Σ n 1A = Σ n E1A = Σ n P An < ∞. n

n

Thus, Σn 1An < ∞ a.s., which means that P {An i.o.} = 0. Now let the An be independent with Σn P An = ∞. Since 1 − x ≤ e−x for all x, we get P ∪ Ak = 1 − P ∩ Ack k≥n

k≥n

Π P Ack = 1 − Π (1 − P Ak ) k≥n ≥ 1 − Π exp(−P Ak ) k ≥ n  = 1 − exp − Σ P Ak = 1. k≥n = 1−

k≥n

4. Processes, Distributions, and Independence Hence, as n → ∞, 1=P

93

Ak ∪ Ak ↓ P ∩n k∪ ≥n

k≥n

= P {An i.o.}, 2

and so the probability on the right equals 1.

For many purposes, it suffices to use the Lebesgue unit interval ([0, 1], B[0, 1], λ) as our basic probability space. In particular, the following result ensures the existence on [0, 1] of some independent random variables ξ1 , ξ2 , . . . with arbitrarily prescribed distributions. The present statement is only preliminary. Thus, we will remove the independence assumption in Theorem 8.21, prove an extension to arbitrary index sets in Theorem 8.23, and eliminate the restriction on the spaces in Theorem 8.24. Theorem 4.19 (existence, Borel) For any probability measures μ1 , μ2 , . . . on the Borel spaces S1 , S2 , . . . , there exist some independent random elements ξ1 , ξ2 , . . . on ([0, 1], λ) with distributions μ1 , μ2 , . . . . As a consequence, there exists a probability measure μ on S1 × S2 × · · · satisfying μ ◦ (π1 , . . . , πn )−1 = μ1 ⊗ · · · ⊗ μn , n ∈ N. For the proof, we begin with two special cases of independent interest. By a Bernoulli sequence with rate p we mean a sequence of i.i.d. random variables ξ1 , ξ2 , . . . with P {ξn = 1} = p = 1 − P {ξn = 0}. We further say that a random variable ξ is uniformly distributed on [0, 1], written as U (0, 1), if its distribution L(ξ) equals Lebesgue measure λ on [0, 1]. Every number x ∈ [0, 1] has a binary expansion r1 , r2 , . . . ∈ {0, 1} satisfying x = Σ n rn 2−n , where for uniqueness we require that Σ n rn = ∞ when x > 0. We give a simple construction of Bernoulli sequences on the Lebesgue unit interval. Lemma 4.20 (Bernoulli sequence) Let ξ be a random variable in [0, 1] with binary expansion η1 , η2 , . . . . Then these conditions are equivalent: (i) ξ is U (0, 1), (ii) the ηn form a Bernoulli sequence with rate 12 . 

Proof, (i) ⇒ (ii): If ξ is U (0, 1), then P j≤n {ηj = kj } = 2−n for all k1 , . . . , kn ∈ {0, 1}. Summing over k1 , . . . , kn−1 gives P {ηn = k} = 12 for k = 0 or 1. A similar calculation yields the asserted independence. (ii) ⇒ (i): Assume (ii). Letting ξ  be U (0, 1) with binary expansion η1 , η2 , d . . . , we get (ηn ) = (ηn ), and so ξ = Σ n ηn 2−n d 2 = η  2−n = ξ  .

Σn

n

Next we show how a single U (0, 1) random variable can be used to generate a whole sequence.

94

Foundations of Modern Probability

Lemma 4.21 (replication) There exist some measurable functions f1 , f2 , . . . on [0, 1] such that (i) ⇒ (ii), where (i) ξ is U (0, 1), (ii) ξn = fn (ξ), n ∈ N, are i.i.d. U (0, 1). Proof: Let b1 (x), b2 (x), . . . be the binary expansion of x ∈ [0, 1], and note that the bk are measurable. Rearranging the bk into a two-dimensional array hnj , n, j ∈ N, we define fn (x) = Σ j 2−j hnj (x),

x ∈ [0, 1], n ∈ N.

By Lemma 4.20, the random variables bk (ξ) form a Bernoulli sequence with rate 12 , and the same result shows that the variables ξn = fn (ξ) are U (0, 1). They are further independent by Corollary 4.7. 2 We finally need to construct a random element with specified distribution from a given randomization variable. The result is stated in an extended form for kernels, to meet our needs in Chapters 8, 11, 22, and 27–28. Lemma 4.22 (kernel representation) Consider a probability kernel μ : S → T with T Borel, and let ξ be U (0, 1). Then there exists a measurable function f : S × [0, 1] → T , such that



L f (s, ξ) = μ(s, ·),

s ∈ S.

Proof: We may choose T ∈ B[0,1] , from where we can easily reduce to the case of T = [0, 1]. Define



f (s, t) = sup x ∈ [0, 1]; μ(s, [0, x]) < t ,

s ∈ S, t ∈ [0, 1],

(9)

and note that f is product measurable on S × [0, 1], since the set {(s, t); μ(s, [0, x]) < t} is measurable for each x by Lemma 1.13, and the supremum in (9) can be restricted to rational x. If ξ is U (0, 1), we get for x ∈ [0, 1]





P f (s, ξ) ≤ x = P ξ ≤ μ(s, [0, x])



= μ(s, [0, x]), and so by Lemma 4.3 f (s, ξ) has distribution μ(s, ·).

2

Proof of Theorem 4.19: By Lemma 4.22 there exist some measurable functions fn : [0, 1] → Sn with λ ◦ fn−1 = μn . Letting ξ be the identity map on [0, 1] and choosing ξ1 , ξ2 , . . . as in Lemma 4.21, we note that the functions 2 ηn = fn (ξn ), n ∈ N, have the desired joint distribution. We turn to some regularizations and path properties of random processes. Say that two processes X, Y on a common index set T are versions of one another if Xt = Yt a.s. for every t ∈ T . When T = Rd or R+ , two continuous or

4. Processes, Distributions, and Independence

95

right-continuous versions X, Y of a given process are clearly indistinguishable, in the sense that X ≡ Y a.s., so that the entire paths agree outside a fixed null set. For any mapping f between two metric spaces (S, ρ) and (S  , ρ ), we define the modulus of continuity wf = w(f, ·) by



wf (r) = sup ρ (fs , ft ); s, t ∈ S, ρ(s, t) ≤ r ,

r > 0,

so that f is uniformly continuous iff wf (r) → 0 as r → 0. Say that f is H¨older rp as r → 0. The stated property is said continuous of order p, if 10 wf (r) <

to hold locally if it is valid on every bounded set. A simple moment condition ensures the existence of a H¨older-continuous version of a given process on Rd . Important applications are given in Theorems 14.5, 29.4, and 32.3, and a related tightness criterion appears in Corollary 23.7. Theorem 4.23 (moments and H¨ older continuity, Kolmogorov, Lo`eve, Chentsov) Let X be a process on Rd with values in a complete metric space (S, ρ),

a such that E ρ(Xs , Xt ) < |s − t|d+b , s, t ∈ Rd , (10)

for some constants a, b > 0. Then a version of X is locally H¨older continuous of order p, for every p ∈ (0, b/a). Proof: It is clearly enough to consider the restriction of X to [0, 1]d . Define



Dn = 2−n (k1 , . . . , kd ); k1 , . . . , kn ∈ {1, . . . , 2n } , and put Since





ξn = max ρ(Xs , Xt ); s, t ∈ Dn , |s − t| = 2−n , 

    (s, t) ∈ Dn2 ; |s − t| = 2−n  ≤ d 2 d n ,

n ∈ N, n ∈ N.

n ∈ N,

we get by (10) for any p ∈ (0, b/a) E

Σ n (2 p n ξn)a

= <

=

Σ n 2 ap n E ξna Σ n 2 ap n 2 d n (2−n)d+b Σ n 2 (ap−b) n < ∞.

The sum on the left is then a.s. convergent, and so ξn < 2 −p n a.s. Now

 −m can be connected by a piecewise any points s, t ∈ n Dn with |s − t| ≤ 2 linear path, which for every n ≥ m involves at most 2d steps between nearest neighbors in Dn . Thus, for any r ∈ [2−m−1 , 2−m ],



sup ρ(Xs , Xt ); s, t ∈ ∪ n Dn , |s − t| ≤ r <

Σ

Σ

ξn < 2

n≥m n≥m < 2 −p m < rp ,



10

−p n

< g that f ≤ c g for some constant c < ∞. For functions f, g > 0, we mean by f

96

Foundations of Modern Probability 

showing that X is a.s. H¨older continuous on n Dn of order p.  In particular, X agrees a.s. on n Dn with a continuous process Y on [0, 1]d ,  and we note that the H¨older continuity of Y on n Dn extends with the same order p to the entire cube [0, 1]d . To show that Y is a version of X, fix any  t ∈ [0, 1]d , and choose t1 , t2 , . . . ∈ n Dn with tn → t. Then Xtn = Ytn a.s. for P each n. Since also Xtn → Xt by (10) and Ytn → Yt a.s. by continuity, we get 2 Xt = Yt a.s. Path regularity can sometimes be established by comparison with a regular process: d

Lemma 4.24 (transfer of regularity) Let X = Y be random processes on T with values in a separable metric space S, where the paths of Y belong to a set U ⊂ S T that is Borel for the σ-field U = (BS )T ∩ U . Then X has a version with paths in U . Proof: For clarity we may write Y˜ for the path of Y , regarded as a random element in U . Then Y˜ is Y -measurable, and Lemma 1.14 yields a measurable ˜ = f (X), and note that mapping f : S T → U with Y˜ = f (Y ) a.s. Define X d ˜ X) = (Y˜ , Y ). Since the diagonal in S 2 is measurable, we get in particular (X, ˜ t = Xt } = P {Y˜t = Yt } = 1, P {X

t ∈ T.

2

We conclude with a characterization of distribution functions on Rd , required in Chapter 6. For any vectors x = (x1 , . . . , xd ) and y = (y1 , . . . , yd ), write x ≤ y for the component-wise inequality xk ≤ yk , k = 1, . . . , d, and similarly for x < y. In particular, the distribution function F of a probability measure μ on Rd is given by F (x) = μ{y; y ≤ x}. Similarly, let x ∨ y denote the componentwise maximum. Put 1 = (1, . . . , 1) and ∞ = (∞, . . . , ∞). For any rectangular box (x, y] = {u; x < u ≤ y} = (x1 , y1 ] × · · · × (xd , yd ], we note that μ(x, y] = Σ u s(u)F (u), where s(u) = (−1)p with p = Σk 1{uk = yk }, and the summation extends over all vertices u of (x, y]. Writing F (x, y] for the stated sum, we say that F has non-negative increments if F (x, y] ≥ 0 for all pairs x < y. We further say that F is right-continuous if F (xn ) → F (x) as xn ↓ x, and proper if F (x) → 1 or 0 as mink xk → ±∞, respectively. We show that any function with the stated properties determines a probability measure on Rd . Theorem 4.25 (distribution functions) For functions F : Rd → [0, 1], these conditions are equivalent: (i) F is the distribution function of a probability measure μ on Rd ,

4. Processes, Distributions, and Independence

97

(ii) F is right-continuous and proper with non-negative increments. Proof: For any F as in (ii), the set function F (x, y] is clearly finitely additive. Since F is proper, we have also F (x, y] → 1 as x → −∞ and y → ∞, i.e., as (x, y] ↑ (−∞, ∞) = Rd . Hence, for every n ∈ N there exists a probability measure μn on (2−n Z)d with Z = {. . . , −1, 0, 1, . . .}, such that 



μn {2−n k} = F 2−n (k − 1), 2−n k ,

k ∈ Zd , n ∈ N.

The finite additivity of F (x, y] yields







μm 2−m (k − 1, k] = μn 2−m (k − 1, k] ,

k ∈ Zd , m < n in N.

(11)

By (11) we can split the Lebesgue unit interval ([0, 1], B[0, 1], λ) recursively to construct some random vectors ξ1 , ξ2 , . . . with distributions μ1 , μ2 , . . . , satisfying ξm − 2−m < ξn ≤ ξm , m < n. In particular, ξ1 ≥ ξ2 ≥ · · · ≥ ξ1 − 1, and so ξn converges pointwise to some random vector ξ. Define μ = λ ◦ ξ −1 . To see that μ has distribution function F , we conclude from the properness of F that

λ ξn ≤ 2−n k = μn (−∞, 2−n k] = F (2−n k),

k ∈ Zd , n ∈ N.

Since also ξn ↓ ξ a.s., Fatou’s lemma yields for dyadic x ∈ Rd λ{ξ < x} = λ{ξn < x ult.} ≤ lim inf λ{ξn < x} ≤ F (x) n→∞ = lim sup λ{ξn ≤ x} n→∞

≤ λ{ξn ≤ x i.o.} ≤ λ{ξ ≤ x}, and so F (x) ≤ λ{ξ ≤ x} ≤ F (x + 2−n 1),

n ∈ N.

Letting n → ∞ and using the right-continuity of F , we get λ{ξ ≤ x} = F (x), 2 which extends to any x ∈ Rd , by the right-continuity of both sides. We also need a version for unbounded measures, extending the one-dimensional Theorem 2.14: Corollary 4.26 (unbounded measures) For any right-continuous function F on Rd with non-negative increments, there exists a measure μ on Rd with μ(x, y] = F (x, y],

x ≤ y in Rd .

98

Foundations of Modern Probability

Proof: For any a ∈ Rd , we may apply Theorem 4.25 to suitably normalized versions of the function Fa (x) = F (a, a ∨ x] to obtain a measure μa on [a, ∞) with μa (a, x] = F (a, x] for all x > a. Then clearly μa = μb on (a ∨ b, ∞) for any a, b, and so the set function μ = supa μa is a measure with the required property. 2

Exercises 1. Give an example of two processes X, Y with different distributions, such that d Xt = Yt for all t. d

2. Let X, Y be {0, 1} -valued processes on an index set T . Show that X = Y iff P {Xt1 + · · · + Xtn > 0} = P {Yt1 + · · · + Ytn > 0} for all n ∈ N and t1 , . . . , tn ∈ T . 3. Let F be a right-continuous function of bounded variation with F (−∞) = 0.  For any random variable ξ, show that EF (ξ) = P {ξ ≥ t} F (dt). (Hint: First let F be the distribution function of a random variable η⊥ ⊥ξ, and use Lemma 4.11.) 4. Given a random variable ξ ∈ L1 and a strictly convex function f on R, show that Ef (ξ) = f (E ξ) iff ξ = E ξ a.s. 5. Let ξ = Σj aj ξj and η = Σj bj ηj , where the sums converge in L2 . Show that Cov(ξ, η) = Σ ij ai bj Cov(ξi , ηj ), where the double series is absolutely convergent.

6. Let the σ-fields Ft,n , t ∈ T , n ∈ N, be non-decreasing in n for fixed t and independent in t for fixed n. Show that the independence extends to the σ-fields  Ft = n Ft,n .

7. For every t ∈ T , let ξ t , ξ1t , ξ2t , . . . be random elements in a metric space St with ξnt → ξ t a.s., such that the ξnt are independent in t for fixed n ∈ N. Show that the independence extends to the limits ξ t . (Hint: First show that E Πt∈S ft (ξ t ) = Πt∈S Eft(ξt) for any bounded, continuous functions ft on St, and for finite subsets S ⊂ T .) 8. Give an example of three events that are pairwise independent but not independent. 9. Give an example of two random variables that are uncorrelated but not independent. 10. Let ξ1 , ξ2 , . . . be i.i.d. random elements with distribution μ in a measurable space (S, S). Fix a set A ∈ S with μA > 0, and put τ = inf{k; ξk ∈ A}. Show that ξτ has distribution μ(· |A) = μ(· ∩ A)/μA. 11. Let ξ1 , ξ2 , . . . be independent random variables with values in [0, 1]. Show  that E Πn ξn = Πn E ξn . In particular, P n An = Πn P An for any independent events A1 , A2 , . . . . 12. For any random variables ξ1 , ξ2 , . . . , prove the existence of some constants c1 , c2 , . . . > 0, such that the series Σ n cn ξn converges a.s.

13. Let ξ1 , ξ2 , . . . be random variables with ξn → 0 a.s. Prove the existence of a measurable function f ≥ 0 with f > 0 outside 0, such that Σ n f (ξn ) < ∞ a.s. Show that the conclusion fails if we assume only L1 -convergence. 14. Give an example of events A1 , A2 , . . . , such that P {An i.o.} = 0 while Σ n P An = ∞.

4. Processes, Distributions, and Independence

99

15. Extend Lemma 4.20 to a correspondence between U (0, 1) random variables ϑ and Bernoulli sequences ξ1 , ξ2 , . . . with rate p ∈ (0, 1). 16. Give an elementary proof of Theorem 4.25 for d = 1. (Hint: Define ξ = F −1 (ϑ) where ϑ is U (0, 1), and note that ξ has distribution function F .) 17. Let ξ1 , ξ2 , . . . be random variables with P {ξn = 0 i.o.} = 1. Prove the existence cn ∈ R, such that P {|cn ξn | > 1 i.o.} = 1. (Hint: Note of some constants

that P Σk≤n |ξk | > 0 → 1.)

Chapter 5

Random Sequences, Series, and Averages Moments and tails, convergence a.s. and in probability, sub-sequence criterion, continuity and completeness, convergence in distribution, tightness and uniform integrability, convergence of means, Lp -convergence, random series and averages, positive and symmetric terms, variance criteria, three-series criterion, strong laws of large numbers, empirical distributions, portmanteau theorem, continuous mapping and approximation, Skorohod coupling, representation and measurability of limits

Here our first aim is to introduce and compare the basic modes of convergence of random quantities. For random elements ξ and ξ1 , ξ2 , . . . in a metric or topological space S, the most commonly used notions are those of almost sure conP vergence (ξn → ξ a.s.) and convergence in probability (ξn → ξ), corresponding to the general notions of convergence a.e. and in measure, respectively. When S = R we also have the concept of Lp -convergence, familiar from Chapter 1. Those three notions are used throughout this book. For a special purpose in Chapter 10, we also need the notion of weak L1 -convergence. Our second major theme is to study the very different notion of converd gence in distribution (ξn → ξ), defined by the condition Ef (ξn ) → Ef (ξ) for all bounded, continuous functions f on S. This is clearly equivalent to weak convergence of the associated distributions μn = L(ξn ) and μ = L(ξ), written w as μn → μ and defined by the condition μn f → μf for every f as above. In this chapter we will establish only the most basic results of weak convergence theory, such as the ‘portmanteau’ theorem, the continuous mapping and approximation theorems, and the Skorohod coupling. Our development of the general theory continues in Chapters 6 and 23, and further distributional limit theorems of various kind appear throughout the remainder of the book. Our third main theme is to characterize the convergence of series Σk ξk and averages n−c Σk≤n ξk , where ξ1 , ξ2 , . . . are independent random variables and c is a positive constant. The two problems are related by the elementary Kronecker lemma, and the main results are the basic three-series criterion and the strong law of large numbers. The former result is extended in Chapter 9 to the powerful martingale convergence theorem, whereas extensions and refinements of the latter result are proved in Chapters 22 and 25. The mentioned theorems are further related to certain weak convergence results presented in Chapters 6–7. © Springer Nature Switzerland AG 2021 O. Kallenberg, Foundations of Modern Probability, Probability Theory and Stochastic Modelling 99, https://doi.org/10.1007/978-3-030-61871-1_6

101

102

Foundations of Modern Probability

Before embarking on our systematic study of the various notions of convergence, we consider a couple of elementary but useful inequalities. Lemma 5.1 (moments and tails, Bienaym´e, Chebyshev, Paley & Zygmund) For any random variable ξ ≥ 0 with 0 < E ξ < ∞, we have

1 (E ξ)2 ≤ P ξ > rE ξ ≤ , r > 0. (1 − r)2+ E ξ2 r The upper bound is often referred to as the Chebyshev or Markov inequality. When E ξ 2 < ∞, we get in particular the classical estimate



P |ξ − E ξ| > ε ≤ ε−2 Var(ξ),

ε > 0.

Proof: We may clearly assume that E ξ = 1. The upper bound then follows as we take expectations in the inequality r1{ξ > r} ≤ ξ. To get the lower bound, we note that for any r, t > 0, t2 1{ξ > r} ≥ (ξ − r)(2t + r − ξ) = 2 ξ(r + t) − r(2 t + r) − ξ 2 . Taking expected values, we get for r ∈ (0, 1) t2 P {ξ > r} ≥ 2(r + t) − r(2 t + r) − E ξ 2 ≥ 2 t(1 − r) − E ξ 2 . Now choose t = E ξ 2 /(1 − r).

2

For any random elements ξ and ξ1 , ξ2 , . . . in a separable metric space (S, ρ), P we say that ξn converges in probability to ξ and write ξn → ξ if



lim P ρ(ξn , ξ) > ε = 0,

n→∞

ε > 0.

Lemma 5.2 (convergence in probability) For any random elements ξ, ξ1 , ξ2 , . . . in a separable metric space (S, ρ), these conditions are equivalent: P (i) ξn → ξ,

(ii) E ρ(ξn , ξ) ∧ 1 → 0, (iii) for any sub-sequence N  ⊂ N, we have ξn → ξ a.s. along a further subsequence N  ⊂ N  . P

In particular, ξn → ξ a.s. implies ξn → ξ, and the notion of convergence in probability depends only on the topology, regardless of the metrization ρ. Proof: The implication (i) ⇒ (ii) is obvious, and the converse holds by Chebyshev’s inequality. Now assume (ii), and fix an arbitrary sub-sequence N  ⊂ N. We may then choose a further sub-sequence N  ⊂ N  such that

Σ n∈N

E







ρ(ξn , ξ) ∧ 1 =

ΣE n∈N 





ρ(ξn , ξ) ∧ 1 < ∞,

5. Random Sequences, Series, and Averages

103

where the equality holds by monotone convergence. The series on the left then converges a.s., which implies ξn → ξ a.s. along N  , proving (iii). Now assume (iii). If (ii) fails, there exists an ε > 0 such that E{ρ(ξn , ξ)∧1} > ε along a sub-sequence N  ⊂ N. Then (iii) yields ξn → ξ a.s. along a further sub-sequence N  ⊂ N  , and so by dominated convergence E{ρ(ξn , ξ) ∧ 1} → 0 2 along N  , a contradiction proving (ii). For a first application, we show that convergence in probability is preserved by continuous mappings. Lemma 5.3 (continuous mapping) For any separable metric spaces S, T , let ξ, ξ1 , ξ2 , . . . be random elements in S, and let the mapping f : S → T be measurable and a.s. continuous at ξ. Then P

ξn → ξ



P

f (ξn ) → f (ξ).

Proof: Fix any sub-sequence N  ⊂ N. By Lemma 5.2 we have ξn → ξ a.s. along a further sub-sequence N  ⊂ N  , and so by continuity f (ξn ) → f (ξ) a.s. P along N  . Hence, f (ξn ) → f (ξ) by Lemma 5.2. 2 For any separable metric spaces (Sk , ρk ), we may introduce the product space S = S1 × S2 × · · · endowed with the product topology, admitting the metrization

(1) ρ(x, y) = Σ k 2−k ρk (xk , yk ) ∧ 1 , x, y ∈ S. 

Since BS = k BSk by Lemma 1.2, a random element in S is simply a sequence of random elements in Sk , k ∈ N. Lemma 5.4 (random sequences) Let ξ = (ξ1 , ξ2 , . . .) and ξ n = (ξ1n , ξ2n , . . .), n ∈ N, be random elements in S1 × S2 × · · · , for some separable metric spaces S1 , S2 , . . . . Then P

ξn → ξ

P



ξkn → ξk in Sk , k ∈ N.

Proof: With ρ as in (1), we get for any n ∈ N



E ρ(ξ n , ξ) ∧ 1 = E ρ(ξ n , ξ) =

Σ k 2−k E





ρk (ξkn , ξk ) ∧ 1 .

Thus, by dominated convergence,



E ρ(ξ n, ξ) ∧ 1 → 0







E ρk (ξkn , ξk ) ∧ 1 → 0,

k ∈ N.

2

Combining the last two lemmas, we may show how convergence in probability is preserved by the basic arithmetic operations. P

Corollary 5.5 (elementary operations) For any random variables ξn → ξ P and ηn → η, we have

104

Foundations of Modern Probability P

(i) a ξn + b ηn → a ξ + b η, a, b ∈ R, P

(ii) ξn ηn → ξη, P

(iii) ξn /ηn → ξ/η when η, η1 , η2 , . . . = 0 a.s. P

Proof: By Lemma 5.4 we have (ξn , ηn ) → (ξ, η) in R2 , and so (i) and (ii) follow by Lemma 5.3. To prove (iii), we may apply Lemma 5.3 to the function f : (x, y) → (x/y)1{y = 0}, which is clearly a.s. continuous at (ξ, η). 2 We turn to some associated completeness properties. For any random elements ξ1 , ξ2 , . . . in a separable metric space (S, ρ), we say that (ξn ) is Cauchy P (convergent) in probability if ρ(ξm , ξn ) → 0 as m, n → ∞, in the sense that E{ρ(ξm , ξn ) ∧ 1} → 0. Lemma 5.6 (completeness) Let ξ1 , ξ2 , . . . be random elements in a separable, complete metric space (S, ρ). Then these conditions are equivalent: (i) (ξn ) is Cauchy in probability, P

(ii) ξn → ξ for a random element ξ in S. Similar results hold for a.s. convergence. Proof: The a.s. case is immediate from Lemma 1.11. Assuming (ii), we get











E ρ(ξm , ξn ) ∧ 1 ≤ E ρ(ξm , ξ) ∧ 1 + E ρ(ξn , ξ) ∧ 1 → 0, proving (i). Now assume (i), and define



nk = inf n ≥ k; sup E{ρ(ξm , ξn ) ∧ 1} ≤ 2−k , m≥n

k ∈ N.

The nk are finite and satisfy E

Σk





ρ(ξnk , ξnk+1 ) ∧ 1 ≤ Σ k 2−k < ∞,

and so Σk ρ(ξnk , ξnk+1 ) < ∞ a.s. The sequence (ξnk ) is then a.s. Cauchy and converges a.s. toward a measurable limit ξ. To prove (ii), write











E ρ(ξm , ξ) ∧ 1 ≤ E ρ(ξm , ξnk ) ∧ 1 + E ρ(ξnk , ξ) ∧ 1 , and note that the right-hand side tends to zero as m, k → ∞, by the Cauchy 2 convergence of (ξn ) and dominated convergence. For any probability measures μ and μ1 , μ2 , . . . on a metric space (S, ρ) with w Borel σ-field S, we say that μn converges weakly to μ and write μn → μ, if μn f → μf for every f ∈ CˆS , defined as the class of bounded, continuous functions f : S → R. For any random elements ξ and ξ1 , ξ2 , . . . in S, we further w d say that ξn converges in distribution to ξ and write ξn → ξ if L(ξn ) → L(ξ),

5. Random Sequences, Series, and Averages

105

i.e., if Ef (ξn ) → Ef (ξ) for all f ∈ CˆS . The latter mode of convergence clearly depends only on the distributions, and the ξn and ξ need not even be defined on the same probability space. To motivate the definition, note that xn → x in a metric space S iff f (xn ) → f (x) for all continuous functions f : S → R, and that L(ξ) is determined by the integrals Ef (ξ) for all f ∈ CˆS . We give a connection between convergence in probability and distribution. Lemma 5.7 (convergence in probability and distribution) Let ξ, ξ1 , ξ2 , . . . be random elements in a separable metric space (S, ρ). Then P

d

ξn → ξ ⇒ ξn → ξ, with equivalence when ξ is a.s. a constant. P

Proof: Let ξn → ξ. For any f ∈ CˆS we need to show that Ef (ξn ) → Ef (ξ). If the convergence fails, there exists a sub-sequence N  ⊂ N such that inf n∈N  |Ef (ξn ) − Ef (ξ)| > 0. Then Lemma 5.2 yields ξn → ξ a.s. along a further sub-sequence N  ⊂ N  . By continuity and dominated convergence, we get Ef (ξn ) → Ef (ξ) along N  , a contradiction. d Conversely, let ξn → s ∈ S. Since ρ(x, s) ∧ 1 is a bounded and continuP ous function of x, we get E{ρ(ξn , s)∧1} → E{ρ(s, s)∧1} = 0, and so ξn → s. 2 A T -indexed family of random vectors ξt in Rd is said to be tight, if



lim sup P |ξt | > r = 0.

r→∞ t∈T

For sequences (ξn ), this is clearly equivalent to



lim lim sup P |ξn | > r = 0, r→∞

(2)

n→∞

which is often easier to verify. Tightness plays an important role for the compactness methods developed in Chapters 6 and 23. For the moment, we note only the following simple connection with weak convergence. Lemma 5.8 (weak convergence and tightness) For random vectors ξ, ξ1 , ξ2 , . . . in Rd , we have d ξn → ξ ⇒ (ξn ) is tight. Proof: Fix any r > 0, and define f (x) = {1 − (r − |x|)+ }+ . Then



lim Ef (ξn ) = Ef (ξ) lim sup P |ξn | > r ≤ n→∞ n→∞





≤ P |ξ| > r − 1 . Here the right-hand side tends to 0 as r → ∞, and (2) follows.

2

We further note the following simple relationship between tightness and convergence in probability.

106

Foundations of Modern Probability

Lemma 5.9 (tightness and convergence in probability) For random vectors ξ1 , ξ2 , . . . in Rd , these conditions are equivalent: (i) (ξn ) is tight, P

(ii) cn ξn → 0 for any constants cn ≥ 0 with cn → 0. Proof: Assume (i), and let cn → 0. Fixing any r, ε > 0, and noting that cn r ≤ ε for all but finitely many n ∈ N, we get







lim sup P |cn ξn | > ε ≤ lim sup P |ξn | > r . n→∞

n→∞

Here the right-hand side tends to 0 as r → ∞, and so P {|cn ξn | > ε} → 0. P Since ε was arbitrary, we get cn ξn → 0, proving (ii). If instead (i) is false, we have inf k P {|ξnk | > k} > 0 for a sub-sequence (nk ) ⊂ N. Putting cn = sup{k −1 ; nk ≥ n}, we note that cn → 0, and yet P {|cnk ξnk | > 1} → 0. Thus, even (ii) fails. 2 We turn to a related notion for expected values. The random variables ξt , t ∈ T , are said to be uniformly integrable if 



lim sup E |ξt |; |ξt | > r = 0,

(3)

r→∞ t∈T

where E(ξ; A) = E(1A ξ). For sequences (ξn ) in L1 , this is clearly equivalent to   lim sup E |ξn |; |ξn | > r = 0. (4) lim r→∞ n→∞

Condition (3) holds in particular if the ξt are Lp -bounded for some p > 1, in the sense that supt E|ξt |p < ∞. To see this, it suffices to write 



E |ξt |; |ξt | > r ≤ r−p+1 E|ξt |p ,

r, p > 0.

We give a useful criterion for uniform integrability. For motivation, we note that if ξ is an integrable random variable, then E(|ξ|; A) → 0 as P A → 0, by Lemma 5.2 and dominated convergence, in the sense that lim sup E(|ξ|; A) = 0.

ε→0 P A r ,

r > 0.

5. Random Sequences, Series, and Averages

107

Here the second part of (ii) follows as we let P A → 0 and then r → ∞. To prove the first part, we may take A = Ω and choose r > 0 large enough. Conversely, assume (ii). By Chebyshev’s inequality, we get as r → ∞ supt P {|ξt | > r} ≤ r−1 supt E|ξt | → 0, and (i) follows from the second part of (ii) with A = {|ξt | > r}.

2

The relevance of uniform integrability for the convergence of moments is clear from the following result, which also provides a weak convergence version of Fatou’s lemma. Lemma 5.11 (convergence of means) For any random variables ξ, ξ1 , ξ2 , . . . d ≥ 0 with ξn → ξ, we have (i) E ξ ≤ lim inf n E ξn , (ii) E ξn → E ξ < ∞ ⇔ the ξn are uniformly integrable. Proof: For any r > 0, the function x → x ∧ r is bounded and continuous on R+ . Thus, lim inf E ξn ≥ n→∞ lim E(ξn ∧ r) n→∞ = E(ξ ∧ r), and the first assertion follows as we let r → ∞. Next assume (4), and note in particular that E ξ ≤ lim inf n E ξn < ∞. For any r > 0, we get             E ξn − E ξ  ≤ E ξn − E(ξn ∧ r) + E(ξn ∧ r) − E(ξ ∧ r)     + E(ξ ∧ r) − E ξ .

Letting n → ∞ and then r → ∞, we obtain E ξn → E ξ. Now assume instead that E ξn → E ξ < ∞. Keeping r > 0 fixed, we get as n → ∞ 







E ξn ; ξn > r ≤ E ξn − ξn ∧ (r − ξn )+



→ E ξ − ξ ∧ (r − ξ)+ . Since x∧(r−x)+ ↑ x as r → ∞, the right-hand side tends to zero by dominated convergence, and (4) follows. 2 We may now prove some useful criteria for convergence in Lp . Theorem 5.12 (Lp -convergence) For a fixed p > 0, let ξ1 , ξ2 , . . . ∈ Lp with P ξn → ξ. Then these conditions are equivalent: (i) ξn − ξ p → 0, (ii) ξn p → ξ p < ∞, (iii) the variables |ξn |p are uniformly integrable. P

Conversely, (i) implies ξn → ξ.

108

Foundations of Modern Probability

Proof: First let ξn → ξ in Lp . Then ξn p → ξ p by Lemma 1.31, and Lemma 5.1 yields for any ε > 0





P |ξn − ξ| > ε = P |ξn − ξ|p > εp



≤ ε−p ξn − ξ pp → 0, P

P

which shows that ξn → ξ. We may henceforth assume that ξn → ξ. In pard ticular, |ξn |p → |ξ|p by Lemmas 5.3 and 5.7, and so (ii) ⇔ (iii) by Lemma 5.11. Next assume (ii). If (i) fails, there exists a sub-sequence N  ⊂ N with inf n∈N  ξn − ξ p > 0. Then Lemma 5.2 yields ξn → ξ a.s. along a further sub-sequence N  ⊂ N  , and so Lemma 1.34 gives ξn − ξ p → 0 along N  , a contradiction. Thus, (ii) ⇒ (i), and so all three conditions are equivalent. 2 We consider yet another mode of convergence for random variables. Letting ξ, ξ1 , . . . ∈ Lp for a p ∈ [1, ∞), we say that ξn → ξ weakly in Lp if E ξn η → E ξη for every η ∈ Lq , where p−1 + q −1 = 1. Taking η = |ξ|p−1 sgn ξ gives η q = older’s inequality, ξ p−1 p , and so by H¨ ξ pp = E ξη = lim E ξn η n→∞

lim inf ξn p , ≤ ξ p−1 p n→∞ which implies ξ p ≤ lim inf n ξn p . Now recall that any L2 -bounded sequence has a sub-sequence converging weakly in L2 . The following related criterion for weak compactness in L1 will be needed in Chapter 10. Lemma 5.13 (weak L1 -compactness 1 , Dunford) Every uniformly integrable sequence of random variables has a sub-sequence converging weakly in L1 . Proof: Let (ξn ) be uniformly integrable. Define ξnk = ξn 1{|ξn | ≤ k}, and note that (ξnk ) is L2 -bounded in n for fixed k. By the compactness in L2 and a diagonal argument, there exist a sub-sequence N  ⊂ N and some random variables η1 , η2 , . . . such that ξnk → ηk holds weakly in L2 and then also in L1 , as n → ∞ along N  for fixed k. Now ηk − ηl 1 ≤ lim inf n ξnk − ξnl 1 , and by uniform integrability the righthand side tends to zero as k, l → ∞. Thus, the sequence (ηk ) is Cauchy in L1 , and so it converges in L1 toward some ξ. By approximation it follows easily 2 that ξn → ξ weakly in L1 along N  . We turn to some convergence criteria for random series, beginning with an important special case. Proposition 5.14 (series with positive terms) For any independent random variables ξ1 , ξ2 , . . . ≥ 0,

Σ n ξn < ∞ 1

a.s.



Σ n E(ξn ∧ 1) < ∞.

The converse statement is also true, but it is not needed in this book.

5. Random Sequences, Series, and Averages

109

Proof: The right-hand condition yields E Σn (ξn ∧ 1) < ∞ by Fubini’s theorem, and so Σn (ξn ∧ 1) < ∞ a.s. In particular, Σn 1{ξn > 1} < ∞ a.s., and so the series Σn (ξn ∧ 1) and Σn ξn differ by at most finitely many terms, which implies Σn ξn < ∞ a.s. Conversely, let Σn ξn < ∞ a.s. Then also Σn (ξn ∧ 1) < ∞ a.s., and so we may assume that ξn ≤ 1 for all n. Since 1 − x ≤ e−x ≤ 1 − a x for x ∈ [0, 1] with a = 1 − e−1 , we get 



0 < E exp − Σ n ξn = Π n Ee−ξn ≤ Π n (1 − a Eξn ) ≤ Π nexp(−a Eξn ) = exp − a Σ n Eξn , and so

ΣnEξn < ∞.

2

To deal with more general series, we need the following strengthened version of Chebyshev’s inequality. A further extension appears as Proposition 9.16. Lemma 5.15 (maximum inequality, Kolmogorov) Let ξ1 , ξ2 , . . . be independent random variables with mean 0, and put Sn = Σk≤n ξk . Then



P supn |Sn | > r ≤ r−2 Σ nVar(ξn ),

r > 0.

Proof: We may assume that Σ n E ξn2 < ∞. Writing τ = inf{n; |Sn | > r} ⊥(Sn − Sk ) for k ≤ n, we get and noting that Sk 1{τ = k}⊥ 



E Sn2 ; τ = k Σ E ξk2 = E Sn2 ≥ kΣ k≤n ≤n 



 ≥ Σ E Sk2 ; τ = k + 2E Sk (Sn − Sk ); τ = k k≤n   = Σ E Sk2 ; τ = k ≥ r2 P {τ ≤ n}. k≤n As n → ∞, we obtain < ∞} Σ k E ξk2 ≥ r2P {τ



= r2 P supk |Sk | > r .

2

The last result yields a sufficient condition for a.s. convergence of random series with independent terms. Precise criteria will be given in Theorem 5.18. Lemma 5.16 (variance criterion for series, Khinchin & Kolmogorov) Let ξ1 , ξ2 , . . . be independent random variables with mean 0. Then

Σ nVar(ξn) < ∞



Σ n ξn

converges a.s.

110

Foundations of Modern Probability Proof: Write Sn = ξ1 + · · · + ξn . By Lemma 5.15 we get for any ε > 0



P sup |Sn − Sk | > ε ≤ ε−2 Hence, as n → ∞,

k≥n

Σ E ξk2.

k≥n

P

sup |Sh − Sk | ≤ 2 sup |Sn − Sk | → 0,

h,k≥n

k≥n

and Lemma 5.2 yields the corresponding a.s. convergence along a sub-sequence. Since the supremum on the left is non-increasing in n, the a.s. convergence extends to the entire sequence, which means that (Sn ) is a.s. Cauchy convergent. 2 Thus, Sn converges a.s. by Lemma 5.6. We turn to a basic connection between series with positive and symmetric P terms. By ξn → ∞ we mean that P {ξn > r} → 1 for every r > 0. Theorem 5.17 (positive and symmetric terms) Let ξ1 , ξ2 , . . . be independent, symmetric random variables. Then these conditions are equivalent: (i) (ii) (iii)

Σ n ξn converges a.s., Σ n ξn2 < ∞ a.s.,  Σ nE ξn2 ∧ 1 < ∞. 

 



If the conditions fail, then  Σ k≤n ξk  → ∞. P

Proof: The equivalence (ii) ⇔ (iii) holds by Proposition 5.14. Next assume (iii), and conclude from Lemma 5.16 that Σn ξn 1{|ξn | ≤ 1} converges a.s. From (iii) and Fubini’s theorem we have also Σn 1{|ξn | > 1} < ∞ a.s. Hence, the series Σn ξn 1{|ξn | ≤ 1} and Σn ξn differ by at most finitely many terms, and so even the latter series converges a.s. This shows that (i) ⇐ (ii) ⇔ (iii). It P remains to show that |Sn | → ∞ when (ii) fails, since the former condition yields |Sn | → ∞ a.s. along a sub-sequence, contradicting (i). Here Sn = ξ1 + · · · + ξn as before. Thus, suppose that (ii) fails. Then Kolmogorov’s 0−1 law yields Σn ξn2 = ∞ a.s. First let |ξn | = cn be non-random for every n. If the cn are unbounded, then for every r > 0 we may choose a sub-sequence n1 , n2 , . . . ∈ N

such that cn1 > r and cnk+1 > 2cnk for all k. Then clearly P Σj≤k ξnj ∈ I ≤ 2−k for every interval I of length 2r, and so by convolution P {|Sn | ≤ r} ≤ 2−k for all n ≥ nk , which implies P {|Sn | ≤ r} → 0. 2 Next let cn ≤ c < ∞ for all n. Choosing a > 0 so small that cos x ≤ e−a x −1 for |x| ≤ 1, we get for 0 < |t| ≤ c 0 ≤ Ee itSn = ≤

Π exp

k ≤ n



Π cos(t ck )

k≤n  −a t2 c2k

= exp − a t2

c2k Σ k≤n



→ 0.

5. Random Sequences, Series, and Averages

111

Anticipating the elementary Lemma 6.1 below, we get again P {|Sn | ≤ r} → 0 for each r > 0. For general ξn , choose some independent i.i.d. random variables ϑn with P {ϑn = ±1} = 12 , and note that the sequences (ξn ) and (ϑn |ξn |) have the same distribution. Letting μ be the distribution of the sequence (|ξn |), we get by Lemma 4.11



P |Sn | > r =

 

P 





Σ ϑk xk  > r k≤n

μ(dx),

r > 0.

Here the integrand tends to 0 for μ-almost every sequence x = (xn ), by the result for constant |ξn |, and so the integral tends to 0 by dominated convergence. 2 We may now give precise criteria for convergence, a.s. or in distribution, of a series of independent random variables. Write Var(ξ; A) = Var(ξ1A ). Theorem 5.18 (three-series criterion, Kolmogorov, L´evy) Let ξ1 , ξ2 , . . . be independent random variables. Then Σn ξn converges a.s. iff it converges in distribution, which holds iff

(i) Σ n P |ξn | > 1 < ∞, (ii) (iii)





Σ nE ξn; |ξn| ≤ 1 converges, Σ nVar ξn; |ξn| ≤ 1 < ∞.

Our proof requires some simple symmetrization inequalities. Say that m is a median of the random variable ξ if P {ξ > m} ∨ P {ξ < m} ≤ 12 . A symmetrization of ξ is defined as a random variable of the form ξ˜ = ξ−ξ  , where d ⊥ ξ with ξ  = ξ. For symmetrizations of the random variables ξ1 , ξ2 , . . . , we ξ⊥ require the same properties for the whole sequences (ξn ) and (ξn ). Lemma 5.19 (symmetrization) Let ξ˜ be a symmetrization of a random variable ξ with median m. Then for any r > 0,





1 ˜ > r ≤ 2 P |ξ| > r/2 . P |ξ − m| > r ≤ P |ξ| 2

Proof: Let ξ˜ = ξ − ξ  as above, and write





ξ − m > r, ξ  ≤ m ∪ ξ − m < −r, ξ  ≥ m

˜ >r ⊂ |ξ|













⊂ |ξ| > r/2 ∪ |ξ  | > r/2 .

2

We also need a simple centering lemma. Lemma 5.20 (centering) For any random variables ξ, η, ξ1 , ξ2 , . . . and constants c1 , c2 , . . . , we have d ξn → ξ ⇒ cn → some c. d ξ n + cn → η

112

Foundations of Modern Probability

Proof: Let ξn → ξ. If cn → ±∞ along a sub-sequence N  ⊂ N, then clearly P ξn + cn → ±∞ along N  , which contradicts the tightness of ξn + cn . Thus, the cn are bounded. Now assume that cn → a and cn → b along two sub-sequences d d N1 , N2 ⊂ N. Then ξn + cn → ξ + a along N1 , while ξn + cn → ξ + b along N2 , d d and so ξ + a = ξ + b. Iterating this yields ξ + n(b − a) = ξ for every n ∈ Z, which is only possible when a = b. Thus, all limit points of (cn ) agree, and cn converges. 2 d

Proof of Theorem 5.18: Assume (i)−(iii), and define ξn = ξn 1{|ξn | ≤ 1}. By (iii) and Lemma 5.16 the series Σn (ξn − E ξn ) converges a.s., and so by (ii) the same thing is true for Σn ξn . Finally, P {ξn = ξn i.o.} = 0 by (i) and the Borel–Cantelli lemma, and so Σn (ξn − ξn ) has a.s. finitely many non-zero terms. Hence, even Σn ξn converges a.s. Conversely, suppose that Σn ξn converges in distribution. By Lemma 5.19 the sequence of symmetrized partial sums Σk≤n ξ˜k is tight, and so Σn ξ˜n converges a.s. by Theorem 5.17. In particular, ξ˜n → 0 a.s. For any ε > 0 we obtain Σn P {|ξ˜n | > ε} < ∞ by the Borel–Cantelli lemma. Hence, Σn P {|ξn − mn| > ε} < ∞ by Lemma 5.19, where m1, m2, . . . are medians of ξ1 , ξ2 , . . . . Using the Borel–Cantelli lemma again, we get ξn − mn → 0 a.s. Now let c1 , c2 , . . . be arbitrary with mn − cn → 0. Then even ξn − cn → 0 a.s. Putting ηn = ξn 1{|ξn − cn | ≤ 1}, we get a.s. ξn = ηn for all but finitely many n, and similarly for the symmetrized variables ξ˜n and η˜n . Thus, even Σn η˜n converges a.s. Since the η˜n are bounded and symmetric, Theorem 5.17 ηn ) < ∞. Thus, Σn (ηn − Eηn ) converges a.s. yields Σn Var(ηn ) = 12 Σn Var(˜ by Lemma 5.16, as does the series Σn (ξn − Eηn ). Comparing with the distributional convergence of Σn ξn , we conclude from Lemma 5.20 that Σn Eηn converges. In particular, Eηn → 0 and ηn − Eηn → 0 a.s., and so ηn → 0 a.s., whence ξn → 0 a.s. Then mn → 0, and so we may take cn = 0 in the previous argument, and (i)−(iii) follow. 2 A sequence of random variables ξ1 , ξ2 , . . . with partial sums Sn is said to obey the strong law of large numbers, if Sn /n converges a.s. to a constant. The weak law is the corresponding property with convergence in probability. The following elementary proposition enables us to convert convergence results for random series into laws of large numbers. Lemma 5.21 (series and averages, Kronecker) For any a1 , a2 , . . . ∈ R and c > 0, Σ n−c an converges ⇒ n−c Σ ak → 0 n≥1

Proof: Let n → ∞,

Σn bn

k≤n

= b with bn = n−c an . By dominated convergence as

bk − n−c Σ ak Σ k≤n k≤n

=

Σ k≤n





1 − (k/n)c bk

5. Random Sequences, Series, and Averages

113

= c Σ bk k≤n 1

= c → bc

0



1

xc−1 dx

k/n

xc−1 dx 1

Σ

k ≤ nx

bk

xc−1 dx = b,

0

and the assertion follows since the first term on the left tends to b.

2

The following simple result illustrates the method. Corollary 5.22 (variance criterion for averages, Kolmogorov) Let ξ1 , ξ2 , . . . be independent random variables with mean 0, and fix any c > 0. Then

Σ n−2 c Var(ξn) < ∞



n≥1

n−c

Σ ξk → 0

a.s.

k≤n

Proof: Since Σn n−c ξn converges a.s. by Lemma 5.16, the assertion follows by Lemma 5.21. 2 In particular, we see that if ξ, ξ1 , ξ2 , . . . are i.i.d. with E ξ = 0 and E ξ 2 < ∞, then n−c Σk≤n ξk → 0 a.s. for any c > 12 . The statement fails for c = 12 , as we see by taking ξ to be N (0, 1). The best possible normalization will be given in Corollary 22.8. The next result characterizes the stated convergence for arbitrary c > 12 . For c = 1 we recognize the strong law of large numbers. Corresponding criteria for the weak law are given in Theorem 6.17. Theorem 5.23 (strong laws of large numbers, Kolmogorov, Marcinkiewicz & Zygmund) Let ξ, ξ1 , ξ2 , . . . be i.i.d. random variables, put Sn = Σk≤n ξk , and fix a p ∈ (0, 2). Then n−1/p Sn converges a.s. iff these conditions hold, depending on the value of p: • for p ∈ (0, 1]: ξ ∈ Lp , • for p ∈ (1, 2): ξ ∈ Lp and E ξ = 0. In that case, the limit equals E ξ when p = 1 and is otherwise equal to 0. ξn

Proof: Assume E|ξ|p < ∞ and also, for p ≥ 1, that E ξ = 0. Define = ξn 1{|ξn | ≤ n1/p }, and note that by Lemma 4.4

Σn P





ξn = ξn = ≤

Σ nP ∞

0



|ξ|p > n







P |ξ|p > t dt

= E|ξ|p < ∞. Hence, the Borel–Cantelli lemma yields P {ξn = ξn i.o.} = 0, and so ξn = ξn for all but finitely many n ∈ N a.s. It is then equivalent to show that n−1/p Σk≤n ξk → 0 a.s. By Lemma 5.21 it suffices to prove instead that Σn n−1/p ξn converges a.s.

114

Foundations of Modern Probability For p < 1, this is clear if we write E

Σ n n−1/p |ξn |

−1/p E Σ nn

=



<

0





|ξ|; |ξ| ≤ n1/p





t−1/p E |ξ|; |ξ| ≤ t1/p dt

= E |ξ|



|ξ|p



t−1/p dt

< E |ξ|p < ∞.

If instead p > 1, then by Theorem 5.18 it suffices to prove that Σn n−1/p E ξn converges and Σn n−2/p Var(ξn ) < ∞. Since E ξn = −E(ξ; |ξ| > n1/p ), we have for the former series

Σ n n−1/p |E ξn |

−1/p E Σ nn







0





|ξ|; |ξ| > n1/p





t−1/p E |ξ|; |ξ| > t1/p dt



= E |ξ|



|ξ|p



t−1/p dt

0

< E |ξ|p < ∞.

For the latter series, we obtain

Σ n n−2/p Var(ξn )

 2 ) Σ n n−2/p E(ξ n  −2/p 2 1/p n E ξ ; |ξ| ≤ n Σ n

≤ =



<

0





t−2/p E ξ 2 ; |ξ| ≤ t1/p dt

= E ξ2



|ξ|p



t−2/p dt

< E |ξ| < ∞.

p

When p = 1, we have E ξn = E(ξ; |ξ| ≤ n) → 0 by dominated convergence. Thus, n−1 Σk≤n E ξk → 0, and we may prove instead that n−1 Σk≤n ξk → 0 a.s., where ξn = ξn − E ξn . By Lemma 5.21 and Theorem 5.18 it is then enough to show that Σn n−2 Var(ξn ) < ∞, which may be seen as before. Conversely, suppose that n−1/p Sn = n−1/p Σk≤n ξk converges a.s. Then Sn ξn n − 1 1/p Sn−1 = 1/p − ( ) (n − 1)1/p → 0 a.s., n1/p n n



and in particular P |ξn |p > n i.o. = 0. Hence, by Lemma 4.4 and the Borel– Cantelli lemma, ∞

E|ξ|p = 0

≤ 1+





P |ξ|p > t dt

ΣP n≥1





|ξ|p > n < ∞.

For p > 1, the direct assertion yields n−1/p (Sn − n E ξ) → 0 a.s., and so 2 n1−1/p E ξ converges, which implies E ξ = 0.

5. Random Sequences, Series, and Averages

115

For a simple application of the law of large numbers, consider an arbitrary sequence of random variables ξ1 , ξ2 , . . . , and define the associated empirical distributions as the random probability measures μ ˆn = n−1 Σk≤n δξk . The corresponding empirical distribution functions Fˆn are given by ˆn (−∞, x] Fˆn (x) = μ = n−1

Σ 1{ξk ≤ x},

x ∈ R, n ∈ N.

k≤n

Proposition 5.24 (empirical distribution functions, Glivenko, Cantelli) Let ξ1 , ξ2 , . . . be i.i.d. random variables with distribution function F and empirical distribution functions Fˆ1 , Fˆ2 , . . . . Then 



  lim supx Fˆn (x) − F (x) = 0 a.s.

(5)

n→∞

Proof: The law of large numbers yields Fˆn (x) → F (x) a.s. for every x ∈ R. Now fix any finite partition −∞ = x1 < x2 < · · · < xm = ∞. By the monotonicity of F and Fˆn , 







    supx Fˆn (x) − F (x) ≤ max k Fˆn (x  k ) − F (xk )   + max k F (xk+1 ) − F (xk ) .

Letting n → ∞ and refining the partition indefinitely, we get in the limit 



  lim sup supx Fˆn (x) − F (x) ≤ supx ΔF (x) a.s., n→∞

which proves (5) when F is continuous. For general F , let ϑ1 , ϑ2 , . . . be i.i.d. U (0, 1), and define ηn = g(ϑn ) for each n, where g(t) = sup{x; F (x) < t}. Then ηn ≤ x iff ϑn ≤ F (x), and so d ˆ 1, G ˆ 2 , . . . for the (ηn ) = (ξn ). We may then assume that ξn ≡ ηn . Writing G ˆn ◦ F . empirical distribution functions of ϑ1 , ϑ2 , . . . , we see that also Fˆn = G Writing A = F (R) and using the result for continuous F , we get a.s. 







  ˆ  supx Fˆn (x) − F (x) = sup G n (t) − t t∈A





ˆ  ≤ sup G n (t) − t → 0. t∈[0,1]

2

We turn to a more systematic study of convergence in distribution. Though for the moment we are mostly interested in distributions on Euclidean spaces, it is crucial for future applications to consider the more general setting of an abstract metric space. In particular, the theory is applied in Chapter 23 to random elements in various function spaces. For a random elements ξ in a metric space S with Borel σ-field S, let Sξ denote the class of sets B ∈ S with ξ∈ / ∂B a.s., called the ξ-continuity sets.

116

Foundations of Modern Probability

Theorem 5.25 (portmanteau 2 theorem, Alexandrov) Let ξ, ξ1 , ξ2 , . . . be random elements in a metric space (S, S) with classes G, F of open and closed sets. Then these conditions are equivalent: d

(i)

ξn → ξ,

(ii)

lim inf P {ξn ∈ G} ≥ P {ξ ∈ G}, n→∞

G ∈ G,

(iii)

lim sup P {ξn ∈ F } ≤ P {ξ ∈ F },

F ∈ F,

n→∞

(iv)

P {ξn ∈ B} → P {ξ ∈ B},

B ∈ Sξ .

Proof: Assume (i), and fix any G ∈ G. Letting f be continuous with 0 ≤ f ≤ 1G , we get Ef (ξn ) ≤ P {ξn ∈ G}, and (ii) follows as we let n → ∞ and then f ↑ 1G . The equivalence (ii) ⇔ (iii) is clear from taking complements. Now assume (ii)−(iii). For any B ∈ S, inf P {ξn ∈ B} P {ξ ∈ B o } ≤ lim n→∞ ≤ lim sup P {ξn ∈ B} n→∞

¯ ≤ P {ξ ∈ B}. Here the extreme members agree when ξ ∈ / ∂B a.s., and (iv) follows. Conversely, assume (iv), and fix any F ∈ F . Write F ε = {s ∈ S; ρ(s, F ) ≤ / ∂F ε for ε}. Then the sets ∂F ε ⊂ {s; ρ(s, F ) = ε} are disjoint, and so ξ ∈ almost every ε > 0. For such an ε, we may write P {ξn ∈ F } ≤ P {ξn ∈ F ε }, and (iii) follows as we let n → ∞ and then ε → 0. Finally, assume (ii), and let f ≥ 0 be continuous. By Lemma 4.4 and Fatou’s lemma,

Ef (ξ) = ≤



0∞ 0





P f (ξ) > t dt







lim inf P f (ξn ) > t dt n→∞ ∞

≤ lim inf n→∞

0

P f (ξn ) > t dt

= lim inf Ef (ξn ).

(6)

n→∞

Now let f be continuous with |f | ≤ c < ∞. Applying (6) to the functions c ± f 2 yields Ef (ξn ) → Ef (ξ), which proves (i). We insert an easy application to subspaces, needed in Chapter 23. Corollary 5.26 (sub-spaces) Let ξ, ξ1 , ξ2 , . . . be random elements in a subspace A of a metric space (S, ρ). Then d

ξn → ξ in (A, ρ)



d

ξn → ξ in (S, ρ)

2 From the French word portemanteau = ‘big suitcase’. Here it is just a collection of criteria, traditionally grouped together.

5. Random Sequences, Series, and Averages

117

Proof: Since ξ, ξ1 , ξ2 , . . . ∈ A, condition (ii) of Theorem 5.25 is equivalent to









lim inf P ξn ∈ A ∩ G ≥ P ξ ∈ A ∩ G , n→∞

G ⊂ S open.

By Lemma 1.6, this agrees with condition (ii) of Theorem 5.25 for the subspace A. 2 Directly from the definitions, it is clear that convergence in distribution is preserved by continuous mappings. The following more general statement is a key result of weak convergence theory. Theorem 5.27 (continuous mapping, Mann & Wald, Prohorov, Rubin) For any metric spaces S, T and set C ∈ S, consider some measurable functions f, f1 , f2 , . . . : S → T satisfying sn → s ∈ C ⇒ fn (sn ) → f (s). Then for any random elements ξ, ξ1 , ξ2 , . . . in S, d

ξn → ξ ∈ C a.s.

d



fn (ξn ) → f (ξ).

In particular, we see that if f : S → T is a.s. continuous at ξ, then d

ξn → ξ

d



f (ξn ) → f (ξ).

Proof: Fix any open set G ⊂ T , and let s ∈ f −1 G ∩ C. By hypothesis there exist an integer m ∈ N and a neighborhood B of s, such that fk (s ) ∈ G for all  k ≥ m and s ∈ B. Thus, B ⊂ k≥m fk−1 G, and so f −1 G ∩ C ⊂ ∪



m

∩ fk−1G k≥m

o

.

Writing μ, μ1 , μ2 , . . . for the distributions of ξ, ξ2 , ξ2 , . . . , we get by Theorem  o 5.25 μ(f −1 G) ≤ μ ∪ ∩ fk−1 G m

k ≥ m



∩ fk−1G o m k≥m  o ≤ sup lim inf μn ∩ fk−1 G n→∞ m k≥m = sup μ

≤ lim inf μn (fn−1 G). n→∞ Then the same theorem yields μn ◦ fn−1 → μ ◦ f −1 , which means that fn (ξn ) → f (ξ). 2 d

w

For a simple consequence, we note a useful randomization principle: Corollary 5.28 (randomization) For any metric spaces S, T , consider some probability kernels μ, μ1 , μ2 , . . . : S → T satisfying sn → s in S



w

μn (sn , ·) → μ(s, ·) in MT .

Then for any random elements ξ, ξ1 , ξ2 , . . . in S, d

ξn → ξ



w

E μn (ξn , ·) → E μ(ξ, ·) in MT .

118

Foundations of Modern Probability

Proof: For any bounded, continuous function f on T , the integrals μf and μn f are bounded, measurable functions on S, such that sn → s implies d μn f (sn ) → μf (s). Hence, Theorem 5.27 yields μn f (ξn ) → μf (ξ), and so 2 E μn f (ξn ) → E μf (ξ). The assertion follows since f was arbitrary. We turn to an equally important approximation theorem. Here the idea is d d to prove ξn → ξ by approximating ξn ≈ ηn and ξ ≈ η, where ηn → η. The required convergence will then follow, provided the former approximations hold uniformly in n. Theorem 5.29 (approximation) Let ξ, ξn and η k, ηnk be random elements in d a separable metric space (S, ρ). Then ξn → ξ, whenever d (i) η k → ξ, d (ii) ηnk → η k as n → ∞, k ∈ N, (iii)





lim lim sup E ρ(ηnk , ξn ) ∧ 1 = 0.

k→∞

n→∞

Proof: For any closed set F ⊂ S and constant ε > 0, we have







P {ξn ∈ F } ≤ P ηnk ∈ F ε + P ρ(ηnk , ξn ) > ε , where F ε = {s ∈ S; ρ(s, F ) ≤ ε}. By Theorem 5.25, we get as n → ∞



lim sup P {ξn ∈ F } ≤ P {η k ∈ F ε } + lim sup P ρ(ηnk , ξn ) > ε . n→∞

n→∞

Letting k → ∞, we conclude from Theorem 5.25 and (iii) that lim sup P {ξn ∈ F } ≤ P {ξ ∈ F ε }. n→∞

Here the right-hand side tends to P {ξ ∈ F } as ε → 0. Since F was arbitrary, d 2 Theorem 5.25 yields ξn → ξ. We may now consider convergence in distribution of random sequences. Theorem 5.30 (random sequences) Let ξ = (ξ 1 , ξ 2 , . . .) and ξn = (ξn1 , ξn2 , . . .), n ∈ N, be random elements in S1 × S2 × · · · , for some separable metric d spaces S1 , S2 , . . . . Then ξn → ξ iff for any functions fk ∈ CˆSk ,







E f1 (ξn1 ) · · · fm (ξnm ) → E f1 (ξ 1 ) · · · fm (ξ m ) ,

m ∈ N.

(7)

d

In particular, ξn → ξ follows from the finite-dimensional convergence 



d





ξn1 , . . . , ξnm → ξ 1 , . . . , ξ m ,

m ∈ N.

(8) d

If the sequences ξ and ξn have independent components, it suffices that ξnk → ξ k for each k.

5. Random Sequences, Series, and Averages

119

Proof: The necessity is clear from the continuity of the projections s → sk . To prove the sufficiency, we first assume that (7) holds for a fixed m. Writing / ∂B a.s.} and applying Theorem 5.25 m times, we obtain Sk = {B ∈ BSk ; ξ k ∈







P (ξn1 , . . . , ξnm ) ∈ B → P (ξ 1 , . . . , ξ m ) ∈ B ,

(9)

for any set B = B 1 ×· · ·×B m with B k ∈ Sk for all k. Since the Sk are separable, we may choose some countable bases Ck ⊂ Sk , so that C1 × · · · × Cm becomes a countable base in S1 × · · · × Sm . Hence, any open set G ⊂ S1 × · · · × Sm is a countable union of measurable rectangles Bj = Bj1 × · · · × Bjm with Bjk ∈ Sk for all k. Since the Sk are fields, we may easily reduce to the case of disjoint sets Bj . By Fatou’s lemma and (9),



Σj P

lim inf P (ξn1 , . . . , ξnm ) ∈ G = lim inf n→∞ n→∞ ≥

Σ j P



(ξn1 , . . . , ξnm ) ∈ Bj

(ξ 1 , . . . , ξ m ) ∈ Bj







= P (ξ 1 , . . . , ξ m ) ∈ G , and so (8) holds by Theorem 5.25. d To see that (8) implies ξn → ξ, fix any ak ∈ Sk , k ∈ N, and note that the mapping (s1 , . . . , sm ) → (s1 , . . . , sm , am+1 , am+2 , . . .) is continuous on S1 × · · · × Sm for each m ∈ N. By (8) it follows that 



d





ξn1 , . . . , ξnm , am+1 , . . . → ξ 1 , . . . , ξ m , am+1 , . . . ,

m ∈ N.

(10)

Writing ηnm and η m for the sequences in (10) and letting ρ be the metric in (1), we also note that ρ(ξ, η m ) ≤ 2−m and ρ(ξn , ηnm ) ≤ 2−m for all m and n. The d 2 convergence ξn → ξ now follows by Theorem 5.29. For distributional convergence of some random objects ξ1 , ξ2 , . . . , the joint distribution of the elements ξn is clearly irrelevant. This suggests that we look for a more useful dependence, which may lead to simpler and more transparent proofs. Theorem 5.31 (coupling, Skorohod, Dudley) Let ξ, ξ1 , ξ2 , . . . be random eld ements in a separable metric space (S, ρ), such that ξn → ξ. Then there exist ˜ ξ˜1 , ξ˜2 , . . . in S with some random elements ξ, d ξ˜ = ξ,

d ξ˜n = ξn ;

ξ˜n → ξ˜ a.s.

Our proof involves families of independent random elements with specified distributions, whose existence is ensured in general by Corollary 8.25 below. When S is complete, we may rely on the more elementary Theorem 4.19. Proof: First take S = {1, . . . , m}, and put pk = P {ξ = k} and pnk = P {ξn = d k}. Letting ϑ⊥ ⊥ ξ be U (0, 1), we may construct some random elements ξ˜n = ξn ,

120

Foundations of Modern Probability

such that ξ˜n = k whenever ξ = k and ϑ ≤ pnk /pk . Since pnk → pk for each k, we get ξ˜n → ξ a.s. For general S, fix any p ∈ N, and choose a partition of S into sets B1 , B2 , . . .  ∈ Sξ of diameter < 2−p . Next choose m so large that P {ξ ∈ k≤m Bk } < 2−p ,  c and put B0 = k≤m Bk . For k = 0, . . . , m, define κ = k when ξ ∈ Bk and d κn = k when ξn ∈ Bk , n ∈ N. Then κn → κ, and the result for finite S d ˜ n → κ a.s. We further introduce some independent yields some κ ˜ n = κn with κ random elements ζnk in S with distributions L(ξn | ξn ∈ Bk ), and define ξ˜np = d Σk ζnk 1{˜κn = k}, so that ξ˜np = ξn for each n. By construction, we have



ρ(ξ˜np , ξ) > 2−p ⊂ {˜ κn = κ} ∪ {ξ ∈ B0 },

n, p ∈ N.

Since κ ˜ n → κ a.s. and P {ξ ∈ B0 } < 2−p , there exists for every p some np ∈ N with

P ∪ ρ(ξ˜np , ξ) > 2−p < 2−p , p ∈ N, n ≥ np

and we may further assume that n1 < n2 < · · · . Then the Borel–Cantelli lemma yields a.s. supn≥np ρ(ξ˜np , ξ) ≤ 2−p for all but finitely many p. Defining d ηn = ξ˜p for np ≤ n < np+1 , we note that ξn = ηn → ξ a.s. 2 n

We conclude with a result on the functional representation of limits, needed in Chapters 18 and 32. To motivate the problem, recall from Lemma 5.6 that, if P ξn → η for some random elements in a complete metric space S, then η = f (ξ) a.s. for some measurable function f : S ∞ → S, where ξ = (ξn ). Since f depends on the distribution μ of ξ, a universal representation would be of the form η = f (ξ, μ). For certain purposes, we need to choose a measurable version of the latter function. To allow constructions by repeated approximation in P probability, we consider the more general case where ηn → η for some random elements ηn = fn (ξ, μ). ˆ S be the space of probability measures μ on For a precise statement, let M S, endowed with the σ-field induced by the evaluation maps μ → μB, B ∈ BS . Proposition 5.32 (representation of limits) Consider a complete metric space (S, ρ), a measurable space U , and some measurable functions f1 , f2 , . . . : U × ˆ U and a function ˆ U → S. Then there exist a measurable set A ⊂ M M f : U × A → S such that, for any random element ξ in U with L(ξ) = μ, P

fn (ξ, μ) → some η



μ ∈ A,

in which case we can choose η = f (ξ, μ). Proof: For sequences s = (s1 , s2 , . . .) in S, define l(s) = limk sk when the limit exists and put l(s) = s∞ otherwise, where s∞ ∈ S is arbitrary. By Lemma 1.11, l is a measurable mapping from S ∞ to S. Next consider a sequence η = (η1 , η2 , . . .) of random elements in S, and put ν = L(η). Define

5. Random Sequences, Series, and Averages

121

n1 , n2 , . . . as in the proof of Lemma 5.6, and note that each nk = nk (ν) is a measurable function of ν. Let C be the set of measures ν with nk (ν) < ∞ for all k, and note that ηn converges in probability iff ν ∈ C. Introduce the measurable function

ˆ S∞ . g(s, ν) = l sn (ν) , sn (ν) , . . . , s = (s1 , s2 , . . .) ∈ S ∞ , ν ∈ M 1

2

If ν ∈ C, the proof of Lemma 5.6 shows that ηnk (ν) converges a.s., and so P ηn → g(η, ν). Now let ηn = fn (ξ, μ) for a random element ξ in U with distribution μ and some measurable functions fn . We need to show that ν is a measurable function of μ. But this is clear from Lemma 3.2 (ii), applied to the kernel κ(μ, ·) = μ ˆ U → S ∞. ˆ U to U and the function F = (f1 , f2 , . . .) : U × M 2 from M For a simple consequence, we consider limits in probability of measurable processes. The resulting statement will be needed in Chapter 18. Corollary 5.33 (measurability of limits, Stricker & Yor) Let X 1 , X 2 , . . . be S-valued, measurable processes on T , for a complete metric space S and a measurable space T . Then there exists a measurable set A ⊂ T , such that P

Xtn → some Xt



t ∈ A,

in which case we can choose X to be product measurable on A × Ω. Proof: Define ξt = (Xt1 , Xt2 , . . .) and μt = L(ξt ). By Proposition 5.32 there exist a measurable set C ⊂ PS ∞ and a measurable function f : S ∞ × C → S, P such that Xtn converges in probability iff μt ∈ C, in which case Xtn → f (ξt , μt ). It remains to note that the mapping t → μt is measurable, which is clear from Lemmas 1.4 and 1.28. 2

Exercises 1. Let ξ1 , . . . , ξn be independent, symmetric random variables. Show that   P ( Σk ξk )2 ≥ r Σk ξk2 ≥ (1 − r)2 /3 for all r ∈ (0, 1). (Hint: Reduce by Lemma 4.11 to the case of non-random |ξk |, and use Lemma 5.1.) 2. Let ξ1 , . . . , ξn be independent, symmetric random variables. Show that P {maxk |ξk | > r} ≤ 2 P {|S| > r} for all r > 0, where S = Σk ξk . (Hint: Let η be d

the first term ξk where maxk |ξk | is attained, and check that (η, S − η) = (η, η − S).)

3. Let ξ1 , ξ2 , . . . be i.i.d. random variables with P {|ξn | > t} > 0 for all t > 0. Prove the existence of some constants c1 , c2 , . . . , such that cn ξn → 0 in probability but not a.s. 4. Show that a family of random variables ξt is tight, iff supt Ef (|ξt |) < ∞ for some increasing function f : R+ → R+ with f (∞) = ∞. P

5. Consider some random variables ξn , ηn , such that (ξn ) is tight and ηn → 0. P Show that even ξn ηn → 0.

122

Foundations of Modern Probability

6. Show that the random variables ξt are uniformly integrable, iff supt Ef (|ξt |) < ∞ for some increasing function f : R+ → R+ with f (x)/x → ∞ as x → ∞. 7. Show that the condition supt E|ξt | < ∞ in Lemma 5.10 can be omitted if A is non-atomic. 8. Let ξ1 , ξ2 , . . . ∈ L1 . Show that the ξn are uniformly integrable, iff the condition in Lemma 5.10 holds with supn replaced by lim supn . 9. Deduce the dominated convergence theorem from Lemma 5.11. 10. Show that if (|ξt |p ) and (|ηt |p ) are uniformly integrable for some p > 0, then so is (|aξt + bηt |p ) for any a, b ∈ R. (Hint: Use Lemma 5.10.) Use this fact to derive Proposition 5.12 from Lemma 5.11. 11. Give examples of random variables ξ, ξ1 , ξ2 , . . . ∈ L2 , such that ξn → ξ a.s. but not in L2 , in L2 but not a.s., or in L1 but not in L2 . 12. Let ξ1 , ξ2 , . . . be independent random variables in L2 . Show that converges in L2 , iff Σ n E ξn and Σ n Var(ξn ) both converge.

Σ n ξn

13. Give an example of some independent, symmetric random variables ξ1 , ξ2 , . . . , such that Σ n ξn converges a.s. but Σ n |ξn | = ∞ a.s. 14. Let ξn , ηn be symmetric random variables with |ξn | ≤ |ηn |, such that the pairs (ξn , ηn ) are independent. Show that Σ n ξn converges whenever Σ n ηn does.

15. Let ξ1 , ξ2 , . . . be independent, symmetric random variables. Show that E{( Σ n ξn )2 ∧ 1} ≤ Σ n E(ξn2 ∧ 1) whenever the latter series converges. (Hint: Integrate over the sets where supn |ξn | ≤ 1 or > 1, respectively.) 16. Consider some independent sequences of symmetric random variables ξk , ηk1 , ηk2 , . . . with |ηkn | ≤ |ξk | such that each k. Show that exercise.)

Σk ξk

Σk ηkn → Σk ηk . (Hint: P

P

converges, and suppose that ηkn → ηk for Use a truncation based on the preceding

17. Let Σn ξn be a convergent series of independent random variables. Show that the sum is a.s. independent of the order of terms iff Σn |E(ξn ; |ξn | ≤ 1)| < ∞. 18. Let the random variables ξnj be symmetric and independent for each n. Show that

Σj ξnj → 0 iff Σj E(ξnj2 ∧ 1) → 0. P

d

d

19. Let ξn → ξ and an ξn → ξ for a non-degenerate random variable ξ and some constants an > 0. Show that an → 1. (Hint: Turning to sub-sequences, we may assume that an → a.) d

d

20. Let ξn → ξ and an ξn + bn → ξ for some non-degenerate random variable ξ, where an > 0. Show that an → 1 and bn → 0. (Hint: Symmetrize.)

21. Let ξ1 , ξ2 , . . . be independent random variables such that an Σk≤n ξk converges in probability for some constants an → 0. Show that the limit is degenerate. 22. Show that Theorem 5.23 fails for p = 2. (Hint: Choose the ξk to be independent and N (0, 1).) 23. Let ξ1 , ξ2 , . . . be i.i.d. and such that n−1/p Σk≤n ξk is a.s. bounded for some p ∈ (0, 2). Show that E|ξ1 |p < ∞. (Hint: Argue as in the proof of Theorem 5.23.) 24. For any p ≤ 1, show that the a.s. convergence in Theorem 5.23 remains valid in Lp . (Hint: Truncate the ξk .)

5. Random Sequences, Series, and Averages

123

25. Give an elementary proof of the strong law of large numbers when E|ξ|4 < ∞. (Hint: Assuming E ξ = 0, show that E Σn (Sn /n)4 < ∞.) 26. Show by examples that Theorem 5.25 fails without the stated restrictions on the sets G, F , B.

27. Use Theorem 5.31 to give a simple proof of Theorem 5.27 when S is separable. Generalize to random elements ξ, ξn in Borel sets C, Cn , respectively, assuming only fn (xn ) → f (x) for xn ∈ Cn and x ∈ C with xn → x. Extend the original proof to that case. 28. Give a short proof of Theorem 5.31 when S = R. (Hint: Note that the distribution functions Fn , F satisfy Fn−1 → F −1 a.e. on [0, 1].)

Chapter 6

Gaussian and Poisson Convergence Characteristic functions and Laplace transforms, equi-continuity and tightness, linear projections, null arrays, Poisson convergence, positive and symmetric terms, central limit theorem, local CLT, Lindeberg condition, Gaussian convergence, weak laws of large numbers, domain of Gaussian attraction, slow variation, Helly’s selection theorem, vague and weak convergence, tightness and weak compactness, extended continuity theorem

This is yet another key chapter, dealing with issues of fundamental importance throughout modern probability. Our main focus will be on the Poisson and Gaussian distributions, along with associated limit theorems, extending the classical central limit theorem and related Poisson approximation. The importance of the mentioned distributions extends far beyond the present context, as will gradually become clear throughout later chapters. Indeed, the Gaussian distributions underlie the construction of Brownian motion, arguably the most important process of the entire subject, whose study leads in turn into stochastic calculus. They also form a basis for the multiple Wiener– Itˆo integrals, of crucial importance in Malliavin calculus and other subjects. Similarly, the Poisson distributions underlie the construction of Poisson processes, which appear as limit laws for a wide variety of particle systems. Throughout this chapter, we will use the powerful machinery of characteristic functions and Laplace transforms, which leads to short and transparent proofs of all the main results. Methods based on such functions also extend far beyond the present context. Thus, characteristic functions are used to derive the basic time-change theorems for continuous martingales, and they underlie the use of exponential martingales, which are basic for the Girsanov theorems and related applications of stochastic calculus. Likewise, Laplace transforms play a basic role throughout random-measure theory, and underlie the notion of potentials, of utmost importance in advanced Markov-process theory. To begin with the basic definitions, consider a random vector ξ in Rd with distribution μ. The associated characteristic function μ ˆ is given by

μ ˆ(t) =

eitx μ(dx) = E eitξ ,

t ∈ Rd ,

where tx denotes the inner product t1 x1 + · · · + td xd . For distributions μ on ˜, given by Rd+ , it is often more convenient to consider the Laplace transform μ

μ ˜(u) =

e−ux μ(dx) = E e−uξ ,

u ∈ Rd+ .

© Springer Nature Switzerland AG 2021 O. Kallenberg, Foundations of Modern Probability, Probability Theory and Stochastic Modelling 99, https://doi.org/10.1007/978-3-030-61871-1_7

125

126

Foundations of Modern Probability

Finally, for distributions μ on Z+ , it may be preferable to use the (probability) generating function ψ, given by ψ(s) =

Σ snP {ξ = n} = E s ξ ,

n≥0

s ∈ [0, 1].

Formally, μ ˜(u) = μ ˆ(iu) and μ ˆ(t) = μ ˜(−it), and so the functions μ ˆ and μ ˜ essentially agree apart from domain. Furthermore, the generating function ψ ˜(−log s). is related to the Laplace transform μ ˜ by μ ˜(u) = ψ(e−u ) or ψ(s) = μ Though the characteristic function always exists, it can’t always be extended to an analytic function on the complex plane. ˆ is clearly For any distribution μ on Rd , the characteristic function ϕ = μ uniformly continuous with |ϕ(t)| ≤ ϕ(0) = 1. It is also Hermitian in the sense that ϕ(−t) = ϕ(t), ¯ where the bar denotes complex conjugation. If ξ has characteristic function ϕ, then the linear combination a ξ = a1 ξ1 +· · ·+ad ξd has characteristic function t → ϕ(ta). Also note that, if ξ and η are independent random vectors with characteristic functions ϕ and ψ, then the characteristic function of the pair (ξ, η) equals the tensor product ϕ ⊗ ψ : (s, t) → ϕ(s)ψ(t). In particular, ξ + η has characteristic function ϕ ψ, and for the symmetrized variable ξ − ξ  it equals |ϕ|2 . Whenever applicable, the quoted statements carry over to Laplace transforms and generating functions. The latter functions have the further advantage of being positive, monotone, convex, and analytic—properties that simplify many arguments. We list some simple but useful estimates involving characteristic functions. The second inequality was used already in the proof of Theorem 5.17, and the remaining relations will be useful in the sequel to establish tightness. Lemma 6.1 (tail estimates) For probability measures μ on R,

r 2/r (i) (1 − μ ˆt ) dt, r > 0, μ x; |x| ≥ r ≤ 2 −2/r (ii)

μ[−r, r] ≤ 2 r



1/r

−1/r

|ˆ μt | dt,

r > 0.

If μ is supported by R+ , then also

(iii)



μ[r, ∞) ≤ 2 1 − μ ˜(r−1 ) ,

r > 0.

Proof: (i) Using Fubini’s theorem and noting that sin x ≤ x/2 for x ≥ 2, we get for any c > 0

c −c

(1 − μ ˆt ) dt =





μ(dx) 

c

−c

(1 − eitx ) dt 

sin cx μ(dx) cx

≥ c μ x; |cx| ≥ 2 , = 2c

and it remains to take c = 2/r.

1−

6. Gaussian and Poisson Convergence (ii) Write 1 2

127



1 − cos(x/r) μ(dx) 2 (x/r) = r μ(dx) (1 − r|t|)+ eixt dt

μ[−r, r] ≤ 2



=r ≤r (iii) Noting that e−x
0

1−μ ˜t = ≥

1 2

(1 − e−tx ) μ(dx)



μ x; tx ≥ 1 .

2

Recall that a family of probability measures μa on Rd is said to be tight if



lim sup μa x; |x| > r = 0.

r→∞

a

We may characterize tightness in terms of characteristic functions. Lemma 6.2 (equi-continuity and tightness) For a family of probability meaˆa or Laplace transforms sures μa on Rd or Rd+ with characteristics functions μ μ ˜a , we have μa ) is equi-continuous at 0, (i) (μa ) is tight in Rd ⇔ (ˆ μa ) is equi-continuous at 0. (ii) (μa ) is tight in Rd+ ⇔ (˜ μa ) is uniformly equi-continuous on Rd or Rd+ . In that case, (ˆ μa ) or (˜ Proof: The sufficiency is immediate from Lemma 6.1, applied separately in each coordinate. To prove the necessity, let ξa denote a random vector with distribution μa , and write for any s, t ∈ Rd  

 

 

 

|μ ˆa (s) − μ ˆa (t) | ≤ E  eis ξa − eit ξa 

= E  1 − ei(t−s) ξa  

  ≤ 2 E (t − s) ξa  ∧ 1 .

If {ξa } is tight, then by Lemma 5.9 the right-hand side tends to 0 as t − s → 0, uniformly in a, and the asserted uniform equi-continuity follows. The proof for Laplace transforms is similar. 2 For any probability measures μ, μ1 , μ2 , . . . on Rd , recall that the weak conw vergence μn → μ is defined by μn f → μf for any bounded, continuous function d f on R , where μf denotes the integral f dμ. The usefulness of characteristic functions is mainly due to the following basic result.

128

Foundations of Modern Probability

Theorem 6.3 (characteristic functions, L´evy) Let μ, μ1 , μ2 , . . . be probability ˆ, μ ˆ1 , μ ˆ2 , . . . or Laplace measures on Rd or Rd+ with characteristic functions μ ˜2 , . . . . Then transforms μ ˜, μ ˜1 , μ w

(i) μn → μ on Rd ⇔ μ ˆn (t) → μ ˆ(t), t ∈ Rd , w

˜n (u) → μ ˜(u), u ∈ Rd+ , (ii) μn → μ on Rd+ ⇔ μ in which case μ ˆn → μ ˆ or μ ˜n → μ ˜ uniformly on every bounded set. In particular, a probability measure on Rd or Rd+ is uniquely determined by its characteristic function or Laplace transform, respectively. For the proof of Theorem 6.3, we need some special cases of the Stone–Weierstrass approximation theorem. Here [0, ∞] denotes the compactification of R+ . Lemma 6.4 (uniform approximation, Weierstrass) (i) Any continuous function f : Rd → R with period 2π in each coordinate admits a uniform approximation by linear combinations of cos kx and sin kx with k ∈ Zd+ . (ii) Any continuous function g : [0, ∞]d → R+ admits a uniform approximation by linear combinations of functions e−kx with k ∈ Zd+ . Proof of Theorem 6.3: We consider only (i), the proof of (ii) being similar. w ˆn (t) → μ ˆ(t) for every t, by the definition of weak convergence. If μn → μ, then μ By Lemmas 5.8 and 6.2, the latter convergence is uniform on every bounded set. ˆ(t) for every t. By Lemma 6.1 and dominated Conversely, let μ ˆn (t) → μ convergence, we get for any a ∈ Rd and r > 0



lim sup μn x; |ax| > r ≤ lim

n→∞

n→∞

=

r 2/r 1−μ ˆn (ta) dt 2 −2/r

r 2/r 1−μ ˆ(ta) dt. 2 −2/r

Since μ ˆ is continuous at 0, the right-hand side tends to 0 as r → ∞, which shows that the sequence (μn ) is tight. Given any ε > 0, we may then choose r > 0 so large that μn {|x| > r} ≤ ε for all n and μ{|x| > r} ≤ ε. Now fix any bounded, continuous function f : Rd → R, say with |f | ≤ c < ∞. Let fr denote the restriction of f to the ball {|x| ≤ r}, and extend fr to a continuous function f˜ on Rd with |f˜| ≤ c and period 2πr in each coordinate. By Lemma 6.4 there exists a linear combination g of the functions cos(kx/r) and sin(kx/r), k ∈ Zd+ , such that |f˜ − g| ≤ ε. Writing · for the supremum norm, we get for any n ∈ N     μn f − μn g  ≤ μn {|x| > r} f − f˜ + f˜ − g

≤ (2c + 1) ε,

6. Gaussian and Poisson Convergence

129

and similarly for μ. Thus,         μn f − μf  ≤ μn g − μg  + 2 (2c + 1) ε,

n ∈ N.

Letting n → ∞ and then ε → 0, we obtain μn f → μf . Since f was arbitrary, w 2 this proves that μn → μ. The next result often enables us to reduce the d -dimensional case to that of dimension 1. Corollary 6.5 (one-dimensional projections, Cram´er & Wold) For any random vectors ξ, ξ1 , ξ2 , . . . in Rd or Rd+ , we have d

d

(i) ξn → ξ in Rd ⇔ t ξn → t ξ, t ∈ Rd , d

d

(ii) ξn → ξ in Rd+ ⇔ u ξn → u ξ, u ∈ Rd+ . In particular, the distribution of a random vector ξ in Rd or Rd+ is uniquely determined by those of all linear combinations tξ with t ∈ Rd or Rd+ , respectively. d

Proof: If t ξn → t ξ for all t ∈ Rd , then Eeit ξn → Eeit ξ by the definition of d weak convergence, and so ξn → ξ by Theorem 6.3, proving (i). The proof of (ii) is similar. 2 We now apply the continuity Theorem 6.3 to prove some classical limit theorems, beginning with the case of Poisson convergence. To motivate the introduction of the associated distribution, consider for each n ∈ N some i.i.d. random variables ξn1 , . . . , ξnn with distribution P {ξnj = 1} = 1 − P {ξnj = 0} = cn ,

n ∈ N,

and assume that n cn → c < ∞. Then the sums Sn = ξn1 + . . . + ξnn have generating functions ψn (s) =



1 − (1 − s) cn

→ e−c (1−s) n n c s , = e−c Σ n ≥ 0 n!

n

s ∈ [0, 1].

The limit ψ(s) = e−c(1−s) is the generating function of the Poisson distribution with parameter c, the distribution of a random variable η with probabilities P {η = n} = e−c cn /n! for n ∈ Z+ . Note that η has expected value E η = d ψ  (1) = c. Since ψn → ψ, Theorem 6.3 yields Sn → η. To state some more general instances of Poisson convergence, we need to introduce the notion of a null array 1 . By this we mean a triangular array of 1

Its elements are also said to be uniformly asymptotically negligible, abbreviated as u.a.n.

130

Foundations of Modern Probability

random variables or vectors ξnj , 1 ≤ j ≤ mn , n ∈ N, that are independent for each n and satisfy   (1) supj E |ξnj | ∧ 1 → 0. P

Informally, this means that ξnj → 0 as n → ∞, uniformly in j. When ξnj ≥ 0 for all n and j, we may allow the mn to be infinite. The ‘null’ property can be described as follows in terms of characteristic functions or Laplace transforms: Lemma 6.6 (null arrays) Let (ξnj ) be a triangular array in Rd or Rd+ with ˜nj . Then characteristic functions μ ˆnj or Laplace transforms μ (i) (ξnj ) is null in Rd ⇔ supj |1 − μ ˆnj (t)| → 0, t ∈ Rd , ˜nj (u) → 1, u ∈ Rd+ . (ii) (ξnj ) is null in Rd+ ⇔ inf j μ P

Proof: Relation (1) holds iff ξn,jn → 0 for all sequences (jn ). By Theorem 6.3 this is equivalent to μ ˆ n,jn (t) → 1 for all t and (jn ), which is in turn equivalent to the condition in (i). The proof of (ii) is similar. 2 We may now give a general criterion for Poisson convergence of the row sums in a null array of integer-valued random variables. The result will be extended in Theorem 7.14 to more general limiting distributions and in Theorem 30.1 to the context of point processes. Theorem 6.7 (Poisson convergence) Let (ξnj ) be a null array of Z+ -valued random variables, and let ξ be a Poisson variable with mean c. Then Σj ξnj d

→ ξ iff (i) (ii)

Σj P {ξnj > 1} → 0, Σj P {ξnj = 1} → c.

Here (i) is equivalent to (i ) supj ξnj ∨ 1 → 1, P

and when

Σj ξnj is tight, (i) holds iff every limit is Poisson.

We need an elementary lemma. Related approximations will also be used in Chapters 7 and 30. Lemma 6.8 (sums and products) Consider a null array of constants cnj ≥ 0, and fix any c ∈ [0, ∞]. Then

Π j (1 − cnj ) → e−c



Σ j cnj → c.

6. Gaussian and Poisson Convergence

131

Proof: Since supj cnj < 1 for large n, the first relation is equivalent to Σj log(1 − cnj ) → −c, and the assertion follows since log(1 − x) = −x + o(x) as x → 0. 2 Proof of Theorem 6.7: Let ψnj be the generating function of ξnj . By Theod rem 6.3, the convergence Σj ξnj → ξ is equivalent to Πj ψnj (s) → e−c(1−s) for arbitrary s ∈ [0, 1], which holds by Lemmas 6.6 and 6.8 iff

Σj





1 − ψnj (s) → c(1 − s),

s ∈ [0, 1].

(2)

By an easy computation, the sum on the left equals (1 − s) Σ j P {ξnj > 0} +

Σ (s − sk ) Σ j P {ξnj = k} = T1 + T2,

(3)

k>1

and we also note that s(1 − s) Σ j P {ξnj > 1} ≤ T2 ≤ s Σ j P {ξnj > 1}.

(4)

Assuming (i)−(ii), we note that (2) follows from (3) and (4). Now assume (2). Then for s = 0 we get Σj P {ξnj > 0} → c, and so in general T1 → c(1−s). But then (2) implies T2 → 0, and (i) follows by (4). Finally, (ii) follows by subtraction. To see that (i) ⇔ (i ), we note that



Π j P {ξnj ≤ 1}  = Π j 1 − P {ξnj > 1} . By Lemma 6.8, the right-hand side tends to 1 iff Σj P {ξnj > 1} → 0, which is P supj ξnj ≤ 1 =

the stated equivalence. To prove the last assertion, put cnj = P {ξnj > 0}, and write 









E exp − Σ j ξnj − P supj ξnj > 1 ≤ E exp − Σ j (ξnj ∧ 1)





Π j E exp −(ξnj ∧ 1)

= Π j 1 − (1 − e−1 ) cnj

≤ Π j exp −(1 − e−1 ) cnj

= exp − (1 − e−1 ) Σ j cnj . =

If (i) holds and Σj ξnj → η, then the left-hand side tends to Ee−η > 0, and so the sums cn = Σj cnj are bounded. Hence, cn converges along a sub-sequence N  ⊂ N toward a constant c. But then (i)−(ii) hold along N  , and the first assertion shows that η is Poisson with mean c. 2 d

1 , 2

Next we consider some i.i.d. random variables ξ1 , ξ2 , . . . with P {ξk = ±1} = and write Sn = ξ1 + · · · + ξn . Then n−1/2 Sn has characteristic function ϕn (t) = cosn (n−1/2 t) =



1 − 12 t2 n−1 + O(n−2 )

→ e−t

2 /2

= ϕ(t).

n

132

Foundations of Modern Probability

By a classical computation, the function e−x

∞ −∞

2 /2

eitx e−x

2 /2

dx = (2π)1/2 e−t

has Fourier transform 2 /2

t ∈ R.

,

Hence, ϕ is the characteristic function of a probability measure on R with 2 density (2π)−1/2 e−x /2 . This is the standard normal or Gaussian distribution d N (0, 1), and Theorem 6.3 shows that n−1/2 Sn → ζ, where ζ is N (0, 1). The notation is motivated by the facts that Eζ = 0 and Var(ζ) = 1, where the former relation is obvious by symmetry and latter follows from Lemma 6.9 below. The general Gaussian law N (m, σ 2 ) is defined as the distribution of the random variable η = m + σζ, which has clearly mean m and variance σ 2 . From the form of the characteristic functions together with the uniqueness property, we see that any linear combination of independent Gaussian random variables is again Gaussian. More general limit theorems may be derived from the following technical result. Lemma 6.9 (Taylor expansion) Let ϕ be the characteristic function of a random variable ξ with E|ξ|n < ∞. Then n

(it)k E ξ k + o(tn ), k! k=0

ϕ(t) = Σ

t → 0.

Proof: Noting that |eit − 1| ≤ |t| for all t ∈ R, we get recursively by dominated convergence ϕ(k) (t) = E(iξ)k eitξ ,

t ∈ R, 0 ≤ k ≤ n.

In particular, ϕ(k) (0) = E(iξ)k for k ≤ n, and the result follows from Taylor’s formula. 2 The following classical result, known as the central limit theorem, explains the importance of the normal distributions. The present statement is only preliminary, and more general versions will be obtained by different methods in Theorems 6.13, 6.16, and 6.18. Theorem 6.10 (central limit theorem, Lindeberg, L´evy) Let ξ1 , ξ2 , . . . be i.i.d. random variables with mean 0 and variance 1, and let ζ be N (0, 1). Then n−1/2

Σ ξk → ζ. k≤n d

Proof: Let the ξk have characteristic function ϕ. By Lemma 6.9, the characteristic function of n−1/2 Sn equals ϕn (t) = =



ϕ(n−1/2 t)

n

1 − 12 t2 n−1 + o(n−1 ) 2 /2

→ e−t

,

n

6. Gaussian and Poisson Convergence

133

where the convergence holds as n → ∞ for fixed t.

2

We will also establish a stronger density version, needed in Chapter 30. Here we introduce the smoothing densities 



ph (x) = (πh)−d Π 1 − cos hxi /x2i , i≤d

x = (x1 , . . . , xd ) ∈ Rd ,

with characteristic functions pˆh (t) =

Π i≤d



1 − |ti |/h



+,

t = (t1 , . . . , td ) ∈ Rd .

Say that a distribution on Rd is non-lattice if it is not supported by any shifted, proper sub-group of Rd . Theorem 6.11 (density convergence, Stone) Let μ be a non-lattice distribution on Rd with mean 0 and covariances δij , and write ϕ for the standard normal density on Rd . Then  

 

lim nd/2 μ∗n ∗ ph − ϕ∗n  = 0, n→∞

h > 0.

n−1−d/2 for all i ≤ d, the assertion is trivially fulfilled Proof: Since |∂i ϕ∗n | <

for the standard normal distribution. It is then enough to prove that, for any measures μ and ν as stated,  

 

lim nd/2 (μ∗n − ν ∗n ) ∗ ph  = 0,

h > 0.

n→∞

By Fourier inversion, 



μ∗n − ν ∗n ∗ ph (x) = (2π)−d







ˆn (t) − νˆn (t) dt, eixt pˆh (t) μ

and so we get with Ih = [−h, h]d    ∗n  (μ − ν ∗n ) ∗ ph  ≤ (2π)−d

Ih

   n  μ ˆ (t) − νˆn (t) dt.

It remains to show that the integral on the right declines faster than n−d/2 . For notational convenience we may take d = 1, the general case being similar. By a standard Taylor expansion, as in Lemma 6.9, we have

μ ˆn (t) = 1 − 12 t2 + o(t2 )

n



= exp − 12 n t2 (1 + o(1)) = e−nt

2 /2







1 + n t2 o(1) ,

and similarly for νˆn (t), and so   2  n  μ ˆ (t) − νˆn (t) = e−nt /2 n t2 o(1).

134

Foundations of Modern Probability

Here clearly



n1/2

e−nt

2 /2



n t2 dt =

t2 e−t

2 /2

dt < ∞,

and since μ and ν are non-lattice, we have |ˆ μ| ∨ |ˆ ν | < 1, uniformly on compacts in R \ {0}. Writing Ih = Iε ∪ (Ih \ Iε ), we get for any ε > 0

n1/2 Ih

   n  μ ˆ (t) − νˆn (t) dt < r + h n1/2 mnε ,

ε

where rε → 0 as ε → 0, and mε < 1 for all ε > 0. The desired convergence now follows, as we let n → ∞ and then ε → 0. 2 We now examine the relationship between null arrays of symmetric and positive random variables. In this context, we also derive criteria for convergence toward Gaussian and degenerate limits, respectively. Theorem 6.12 (positive or symmetric terms) Let (ξnj ) be a null array of symmetric random variables, and let ζ be N (0, c) for some c ≥ 0. Then

Σ j ξnj → ζ d



Σ j ξnj2 → c, P

where convergence holds iff (i) (ii)





Σj P |ξnj | >ε → 0, Σj E ξnj2 ∧ 1 → c.

ε > 0,

Here (i) is equivalent to (i ) supj |ξnj | → 0, P

2 is tight, (i) holds iff every limit is Gaussian or and when Σj ξnj or Σj ξnj degenerate, respectively.

The necessity of (i) is remarkable and plays a crucial role in our proof of the more general Theorem 6.16. It is instructive to compare the present statement with the corresponding result for random series in Theorem 5.17. Further note the extended version in Proposition 7.10. Proof: First let is equivalent that

Σj ξnj → ζ. d

Σj E



By Theorem 6.3 and Lemmas 6.6 and 6.8, it

1 − cos(t ξnj ) → 12 c t2 ,

t ∈ R,

(5)

where the convergence is uniform on every bounded interval. Comparing the integrals of (5) over [0, 1] and [0, 2], we get Σj Ef (ξnj ) → 0, where f (0) = 0 and 4 sin x sin 2x + , x ∈ R \ {0}. f (x) = 3 − x 2x Here f is continuous with f (x) → 3 as |x| → ∞, and f (x) > 0 for x = 0. Indeed, the latter relation is equivalent to 8 sin x − sin 2x < 6 x for x > 0,

6. Gaussian and Poisson Convergence

135

which is obvious when x ≥ π/2 and follows by differentiation twice when x ∈ (0, π/2). Writing g(x) = inf y>x f (y) and letting ε > 0 be arbitrary, we get

Σj P









Σ j P f (ξnj) > g(ε) ≤ Σ j Ef (ξnj ) g(ε) → 0,

|ξnj | > ε ≤

which proves (i). 2 P → c, the corresponding symmetrized variables ηnj satisfy If instead Σj ξnj

so Σj P {|ηnj | > ε} → 0 as before. By Lemma 5.19 it follows Σj ηnj → 0, and 2 2 − mnj | > ε} → 0, where the mnj are medians of ξnj , and since that Σj P {|ξnj P

supj mnj → 0, condition (i) follows again. Using Lemma 6.8, we further note that (i) ⇔ (i ). Thus, we may henceforth assume that (i) is fulfilled. Next we note that, for any t ∈ R and ε > 0,

Σj E



1 − cos(t ξnj ); |ξnj | ≤ ε

= 12 t2 1 − O(t2 ε2 )





Σj E





2 ξnj ; |ξnj | ≤ ε .

Assuming (i), the equivalence between (5) and (ii) now follows as we let n → ∞ 2 and then ε → 0. To prove the corresponding result for the variables ξnj , we may write instead, for any t, ε > 0,

Σj E



2





2 1 − e−t ξnj ; ξnj ≤ ε = t 1 − O(t ε)



Σj E





2 2 ξnj ; ξnj ≤ε ,

and proceed as before. This completes the proof of the first assertion. d Finally, assume that (i) holds and Σj ξnj → η. Then the same relation holds for the truncated variables ξnj 1{|ξnj | ≤ 1}, and so we may assume that 2 . If cn → ∞ along a sub|ξnj | ≤ 1 for all j and k. Define cn = Σj Eξnj −1/2 sequence, then the distribution of cn Σj ξnj tends to N (0, 1) by the first assertion, which is impossible by Lemmas 5.8 and 5.9. Thus, (cn ) is bounded and converges along a sub-sequence. By the first assertion, Σj ξnj then tends to a Gaussian limit, and so even η is Gaussian. 2 We may now prove a classical criterion for Gaussian convergence, involving normalization by second moments. Theorem 6.13 (Gaussian variance criteria, Lindeberg, Feller) Let (ξnj ) be a triangular array with E ξnj = 0 and Σj Var(ξnj ) → 1, and let ζ be N (0, 1). Then these conditions are equivalent: (i) (ii)

Σj ξnj → ζ and supj Var(ξnj ) → 0,   Σj E ξnj2 ; |ξnj | > ε → 0, ε > 0. d

Here (ii) is the celebrated Lindeberg condition. Our proof is based on two elementary lemmas.

136

Foundations of Modern Probability

Lemma 6.14 (product comparison) For any complex numbers z1 , . . . , zn and z1 , . . . , zn of modulus ≤ 1, we have   

 

Π k zk − Π k zk  ≤ Σ k |zk − zk |.

Proof: For n = 2 we get |z1 z2 − z1 z2 | ≤ |z1 z2 − z1 z2 | + |z1 z2 − z1 z2 | ≤ |z1 − z1 | + |z2 − z2 |, 2

and the general result follows by induction. Lemma 6.15 (Taylor expansion) For any t ∈ R and n ∈ Z+ ,   it e −

n

Σ k=0

|t|n+1 (it)k  2|t|n ≤ ∧ . k! n! (n + 1)!

Proof: Letting hn (t) denote the difference on the left, we get

hn (t) = i

0

t

t > 0, n ∈ Z+ .

hn−1 (s)ds,

Starting from the obvious relations |h−1 | ≡ 1 and |h0 | ≤ 2, we get by induction 2 |hn−1 (t)| ≤ |t|n /n! and |hn (t)| ≤ 2|t|n /n!. Returning to the proof of Theorem 6.13, we consider at this point only the sufficiency of condition (ii), needed for the proof of the main Theorem 6.16. The necessity will be established after the proof of that theorem. 2 and cn = Σj cnj . First Proof of Theorem 6.13, (ii) ⇒ (i): Write cnj = E ξnj note that for any ε > 0,



2 supj cnj ≤ ε2 + supj E ξnj ; |ξnj | > ε







2 ≤ ε2 + Σ j E ξnj ; |ξnj | > ε ,

which tends to 0 under (ii), as n → ∞ and then ε → 0. Now introduce some independent random variables ζnj with distributions d N (0, cnj ), and note that ζn = Σj ζnj is N (0, cn ). Hence, ζn → ζ. Letting ϕnj and ψnj denote the characteristic functions of ξnj and ζnj , respectively, it remains by Theorem 6.3 to show that Πj ϕnj − Πj ψnj → 0. Then conclude from Lemmas 6.14 and 6.15 that, for fixed t ∈ R,   

 

Π j ϕnj (t) − Π j ψnj (t)   ≤ Σ j ϕnj (t) − ψnj (t)         ≤ Σ j ϕnj (t) − 1 + 12 t2 cnj  + Σ j ψnj (t) − 1 + 12 t2 cnj  2 2 < E ξnj (1 ∧ |ξnj |) + Σ j E ζnj (1 ∧ |ζnj |).

Σj

6. Gaussian and Poisson Convergence

137

For any ε > 0, we have

Σ j E ξnj2 (1 ∧ |ξnj |) ≤ ε Σ j cnj + Σ j E





2 ξnj ; |ξnj | > ε ,

which tends to 0 by (ii), as n → ∞ and then ε → 0. Further note that

Σ j E ζnj2 (1 ∧ |ζnj |)



Σ j E|ζnj |3 3/2 Σ j cnj E|ξ|3

= 1/2 < cn supj cnj → 0,

2

by the first part of the proof.

We may now derive precise criteria for convergence to a Gaussian limit. Note the striking resemblance with the three-series criterion in Theorem 5.18. A far-reaching extension of the present result is obtained by different methods in Chapter 7. As before, we write Var(ξ; A) = Var(ξ1A ). Theorem 6.16 (Gaussian convergence, Feller, L´evy) Let (ξnj ) be a null array of random variables, and let ζ be N (b, c) for some constants b, c. Then d Σj ξnj → ξ iff (i) (ii) (iii)





Σj P |ξnj | > ε →0, ε > 0, Σj E ξnj ; |ξnj | ≤ 1 → b, Σj Var ξnj ; |ξnj | ≤ 1 → c.

Here (i) is equivalent to (i ) supj |ξnj | → 0, P

and when

Σj ξnj is tight, (i) holds iff every limit is Gaussian.

Proof: To see that (i) ⇔ (i ), we note that







P supj |ξnj | > ε = 1 − Π j 1 − P {|ξnj | > ε} ,

ε > 0.

Since supj P {|ξnj | > ε} → 0 under both conditions, the assertion follows by Lemma 6.8. d Now let Σj ξnj → ζ. Introduce medians mnj and symmetrizations ξ˜nj of d ˜ where the variables ξnj , and note that mn ≡ supj |mnj | → 0 and Σj ξ˜nj → ζ, ˜ ζ is N (0, 2c). By Lemma 5.19 and Theorem 6.12, we get for any ε > 0

Σj P









Σ j P |ξ nj − mnj | > ε − mn ≤ 2 Σ j P |ξ˜nj | > ε − mn → 0.

|ξnj | > ε ≤

P

Thus, we may henceforth assume condition (i) and hence that supj |ξnj | → 0. Then

Σj ξnj

d

→ η is equivalent to

Σj ξnj

 → η, where ξnj = ξnj 1{|ξnj | ≤ 1}, d

138

Foundations of Modern Probability

and so we may further let |ξnj | ≤ 1 a.s. for all n and j. Then (ii) and (iii) reduce to bn ≡ Σj E ξnj → b and cn ≡ Σj Var(ξnj ) → c, respectively. Write bnj = E ξnj , and note that supj |bnj | → 0 by (i). Assuming (ii)−(iii),

Σj ξnj − bn → ζ − b by Theorem 6.13, and so Σj ξnj → ζ. Conversely, d d Σj ξnj → ζ implies Σj ξ˜nj → ζ,˜ and (iii) follows by Theorem 6.12. But then d Σj ξnj − bn → ζ − b, whence by Lemma 5.20 the bn converge toward some b. d This gives Σj ξnj → ζ + b − b, and so b = b, which means that even (ii) is d

we get

d

fulfilled. It remains to prove, under condition (i), that any limiting distribution is d d Gaussian. Then assume Σj ξnj → η, and note that Σj ξ˜nj → η˜, where η˜ is a symmetrization of η. If cn → ∞ along a sub-sequence, then c−1/2 Σj ξ˜nj tends n to N (0, 2) by the first assertion, which is impossible by Lemma 5.9. Thus, (cn ) is bounded, and so the convergence cn → c holds along a sub-sequence. But then Σnj ξnj − bn tends to N (0, c), again by the first assertion, and Lemma 5.20 shows that even bn converges toward a limit b. Hence, Σnj ξnj tends to N (b, c), which is then the distribution of η. 2 Proof of Theorem 6.13, (i) ⇒ (ii): The second condition in (i) shows that (ξnj ) is a null array. Furthermore, we have for any ε > 0

Σ j Var







Σ j E ξnj2 ; |ξnj | ≤ ε 2 ≤ Σ j E ξnj → 1.

ξnj ; |ξnj | ≤ ε ≤



By Theorem 6.16 even the left-hand side tends to 1, and (ii) follows.

2

Using Theorem 6.16, we may derive the ultimate versions of the weak laws of large numbers. Note that the present conditions are only slightly weaker than those for the strong laws in Theorem 5.23. Theorem 6.17 (weak laws of large numbers) Let ξ, ξ1 , ξ2 , . . . be i.i.d. random variables, put Sn = Σk≤n ξk , and fix a p ∈ (0, 2). Then n−1/p Sn converges in probability iff these conditions hold as r → ∞, depending on the value of p: • for p = 1: • for p = 1:

rp P {|ξ| > r} → 0, rP {|ξ| > r} → 0 and E(ξ; |ξ| ≤ r) → some c ∈ R.

In that case, the limit equals c when p = 1, and is otherwise equal to 0. Proof: By Theorem 6.16 applied to the null array of random variables ξnj = n−1/p ξj , j ≤ n, the stated convergence is equivalent to



(i) n P |ξ| > n1/p ε → 0, ε > 0, 



(ii) n1−1/p E ξ; |ξ| ≤ n1/p → c, 



(iii) n1−2/p Var ξ; |ξ| ≤ n1/p → 0.

6. Gaussian and Poisson Convergence

139

Here (i) is equivalent to rp P {|ξ| > r} → 0, by the monotonicity of P {|ξ| > r1/p }. Furthermore, Lemma 4.4 yields for any r > 0 





rp−2 Var ξ; |ξ| ≤ r ≤ rp E (ξ/r)2 ∧ 1

1

= rp r





P |ξ| ≥ r t dt,

0       p E ξ; |ξ| ≤ r  ≤ r E |ξ/r| ∧ 1 1

p−1 

= rp

0

P |ξ| ≥ rt dt.

Since t−a is integrable on [0, 1] for any a < 1, we get by dominated convergence (i) ⇒ (iii), and also (i) ⇒ (ii) with c = 0 when p < 1. If instead p > 1, we see from (i) and Lemma 4.4 that



E|ξ| = <

0 ∞  0





P |ξ| > r dr 

1 ∧ r−p dr < ∞.

Then E(ξ; |ξ| ≤ r) → E ξ, and (ii) yields E ξ = 0. Moreover, (i) implies 







rp−1 E |ξ|; |ξ| > r = rp P |ξ| > r + rp−1

r







P |ξ| > t dt → 0.

Assuming in addition that E ξ = 0, we obtain (ii) with c = 0. When p = 1, condition (i) yields 







n P |ξ| > n → 0. E |ξ|; n < |ξ| ≤ n + 1 <

Hence, under (i), condition (ii) is equivalent to E(ξ; |ξ| ≤ r) → c.

2

For a further extension of the central limit theorem in Proposition 6.10, we characterize convergence toward a Gaussian limit of suitably normalized partial sums from a single i.i.d. sequence. Here a non-decreasing function L ≥ 0 is said to vary slowly at ∞, if supx L(x) > 0 and L(cx) ∼ L(x) as x → ∞ for each c > 0. This holds in particular when L is bounded, but it is also true for many unbounded functions, such as for log(x ∨ 1). Theorem 6.18 (domain of Gaussian attraction, L´evy, Feller, Khinchin) Let ξ, ξ1 , ξ2 , . . . be i.i.d., non-degenerate random variables, and let ζ be N (0, 1). Then these conditions are equivalent: (i) there exist some constants an and mn , such that an

Σ (ξk − mn) → ζ, k≤n d





(ii) the function L(x) = E ξ 2 ; |ξ| ≤ x varies slowly at ∞. We may then take mn ≡ E ξ. Finally, (i) holds with an ≡ n−1/2 and mn ≡ 0 iff E ξ = 0 and E ξ 2 = 1.

140

Foundations of Modern Probability

Even other stable distributions may occur as limits, though the convergence criteria are then more complicated. Our proof of Theorem 6.18 is based on a technical result, where for every m ∈ R we define



Lm (x) = E (ξ − m)2 ; |ξ − m| ≤ x ,

x > 0.

Lemma 6.19 (slow variation, Karamata) Let ξ be a non-degenerate random variable, such that L0 varies slowly at ∞. Then so does the function Lm for every m ∈ R, and as x → ∞, 

x2−p E |ξ|p ; |ξ| > x L0 (x)



→ 0,

p ∈ [0, 2).

(6)

Proof: Fix any constant r ∈ (1, 22−p ), and choose x0 > 0 so large that L0 (2x) ≤ rL0 (x) for all x ≥ x0 . For such an x, we get with summation over n≥0  

x2−p E |ξ|p ; |ξ| > x = x2−p Σ n E |ξ|p ; |ξ|/x ∈ (2n , 2n+1 ]

Σ n 2 (p−2)nE ξ 2; |ξ|/x ∈ (2n, 2n+1] ≤ Σ n 2 (p−2)n (r − 1) rn L0 (x) ≤

=



(r − 1)L0 (x) . 1 − 2 p−2 r

Here (6) follows, as we divide by L0 (x) and let x → ∞ and then r → 1. In particular, E|ξ|p < ∞ for all p < 2. If even E ξ 2 < ∞, then E(ξ − m)2 < ∞, and the first assertion is obvious. If instead E ξ 2 = ∞, we may write 







Lm (x) = E ξ 2 ; |ξ − m| ≤ x + m E m − 2 ξ; |ξ − m| ≤ x . Here the last term is bounded, and the first term lies between the bounds L0 (x ± m) ∼ L0 (x). Thus, Lm (x) ∼ L0 (x), and the slow variation of Lm fol2 lows from that of L0 . Proof of Theorem 6.18: Let L vary slowly at ∞. Then so does Lm with m = E ξ by Lemma 6.19, and so we may take Eξ = 0. Now define



cn = 1 ∨ sup x > 0; nL(x) ≥ x2 ,

n ∈ N,

and note that cn ↑ ∞. From the slow variation of L it is further clear that cn < ∞ for all n, and that moreover nL(cn ) ∼ c2n . In particular, cn ∼ n1/2 iff L(cn ) ∼ 1, i.e., iff Var(ξ) = 1. We shall verify the conditions of Theorem 6.16 with b = 0, c = 1, and ξnj = ξj /cn , j ≤ n. Beginning with (i), let ε > 0 be arbitrary, and conclude from Lemma 6.19 that

c2 P {|ξ| > cn ε} c2 P {|ξ| > cn ε} ∼ n → 0. n P |ξ/cn | > ε ∼ n L(cn ) L(cn ε)

6. Gaussian and Poisson Convergence

141

Recalling that E ξ = 0, we get by the same lemma   

 



n E ξ/cn ; |ξ/cn | ≤ 1  ≤ n c−1 n E |ξ|; |ξ| > cn 



cn E |ξ|; |ξ| > cn L(cn )





→ 0,

(7)

which proves (ii). To obtain (iii), we note that by (7) 

n Var ξ/cn ; |ξ/cn | ≤ 1

 

= n c−2 n L(cn ) − n E ξ/cn ; |ξ| ≤ cn

 2

→ 1.

Hence, Theorem 6.16 yields (i) with an = c−1 n and mn ≡ 0. Now assume (i) for some constants an and mn . Then a corresponding √ result ˜ ξ˜1 , ξ˜2 , . . . , with constants an / 2 and 0, holds for the symmetrized variables ξ, d ξ˜k → ζ. Here clearly cn → ∞, and and so we may assume that c−1 n

Σk≤n ˜ c−1 n+1 Σk≤n ξk

d

moreover cn+1 ∼ cn , since even → ζ by Theorem 5.29. Now define for x > 0 ˜ > x}, T˜(x) = P {|ξ|   ˜ ≤x , ˜ L(x) = E ξ˜2 ; |ξ|   U˜ (x) = E ξ˜2 ∧ x2 . ˜ Then Theorem 6.16 yields n T˜(cn ε) → 0 for all ε > 0, and also n c−2 n L(cn ) → 1. 2 ˜ ˜ Thus, cn T (cn ε)/L(cn ) → 0, which extends by monotonicity to x2 T˜(x) x2 T˜(x) ≤ → 0, ˜ U˜ (x) L(x)

x → ∞.

Next define for any x > 0 T (x) = P {|ξ| > x}, U (x) = E( ξ 2 ∧ x2 ). By Lemma 5.19, we have T (x + |m|) ≤ 2 T˜(x) for any median m of ξ. Furthermore, we get by Lemmas 4.4 and 5.19 U˜ (x) =



x2 0

≤2

0

P {ξ˜ 2 > t} dt

x2

P {4 ξ 2 > t} dt

= 8 U (x/2). Hence, as x → ∞, 4 x2 T (x) L(2x) − L(x) ≤ L(x) U (x) − x2 T (x) 8 x2 T˜(x − |m|) → 0, ≤ 8−1 U˜ (2x) − 2x2 T˜(x − |m|)

142

Foundations of Modern Probability

which shows that L is slowly varying. d Finally, let n−1/2 Σ k≤n ξk → ζ. By the previous argument with cn = n1/2 ˜ 1/2 ) → 2, which implies E ξ˜2 = 2 and hence Var(ξ) = 1. But then we get L(n d 2 n−1/2 Σ k≤n (ξk − E ξ) → ζ, and so by comparison E ξ = 0. We return to the general problem of characterizing weak convergence of probability measures μn on Rd in terms of the associated characteristic func˜ n . Suppose that μ ˆn or μ ˜n converges toward tions μ ˆn or Laplace transforms μ a continuous limit ϕ, which is not known in advance to be a characteristic function or Laplace transform. To conclude that μn tends weakly toward some measure μ, we need an extended version of Theorem 6.3, whose proof requires a compactness argument. with Here let Md be the space of locally finite measures on Rd , endowed  the vague topology generated by the evaluation maps πf : μ → μf = f dμ for all f ∈ Cˆ+ , where Cˆ+ is the class of continuous functions f ≥ 0 on Rd with v bounded support. Thus, μn converges vaguely to μ, written as μn → μ, iff μn f → μf for all f ∈ Cˆ+ . If the μn are probability measures, then clearly μ = μRd ≤ 1. We prove a version of Helly’s selection theorem, showing that the set of probability measures on Rd is vaguely relatively sequentially compact. Theorem 6.20 (vague compactness, Helly) Every sequence of probability measures on Rd has a vaguely convergent sub-sequence. Proof: Fix some probability measures μ1 , μ2 , . . . on Rd , and let F1 , F2 , . . . be the corresponding distribution functions. Write Q for the set of rational numbers. By a diagonal argument, Fn converges on Qd along a sub-sequence N  ⊂ N toward a limit G, and we may define



F (x) = inf G(r); r ∈ Qd , r > x ,

x ∈ Rd .

(8)

Since each Fn has non-negative increments, so has G and hence also F . We further see from (8) and the monotonicity of G that F is right-continuous. Hence, Corollary 4.26 yields a measure μ on Rd with μ(x, y] = F (x, y] for any v bounded rectangular box (x, y] ⊂ Rd , and we need to show that μn → μ along N . Then note that Fn (x) → F (x) at every continuity point x of F . By the monotonicity of F , there exists a countable, dense set D ⊂ R, such that F is continuous on C = (Dc )d . Then μn U → μU for every finite union U of rectangular boxes with corners in C, and for any bounded Borel set B ⊂ Rd , a simple approximation yields inf μn B μB o ≤ lim n→∞ ¯ ≤ lim sup μn B ≤ μB. n→∞

(9)

6. Gaussian and Poisson Convergence

143

For any bounded μ-continuity set B, we may consider functions f ∈ Cˆ+ supported by B, and show as in Theorem 5.25 that μn f → μf , proving that v 2 μn → μ. v

If μn → μ for some probability measures μn on Rd , we may still have μ < 1, due to an escape of mass to infinity. To exclude this possibility, we need to assume that (μn ) is tight. Lemma 6.21 (vague and weak convergence) Let μ1 , μ2 , . . . be probability meav sures on Rd , such that μn → μ for a measure μ. Then these conditions are equivalent: (i) (μn ) is tight, (ii) μ = 1, w (iii) μn → μ. Proof, (i) ⇔ (ii): By a simple approximation, the vague convergence yields (9) for every bounded Borel set B, and in particular for the balls Br = {x ∈ Rd ; |x| < r}, r > 0. If (ii) holds, then μBr → 1 as r → ∞, and the first inequality gives (i). Conversely, (i) implies lim supn μn Br → 1, and the last inequality yields (ii). (i) ⇔ (iii): Assume (i), and fix any bounded continuous function f : Rd → R. For any r > 0, we may choose a gr ∈ Cˆ+ with 1Br ≤ gr ≤ 1, and note that                 μn f − μf  ≤ μn f − μn f gr  + μn f gr − μf gr  + μf gr − μf      ≤ μn f gr − μf gr  + f (μn + μ)B0c .

Here the right-hand side tends to zero as n → ∞ and then r → ∞, and so 2 μn f → μf , proving (iii). The converse was proved in Lemma 5.8. Combining the last two results, we may prove the equivalence of tightness and weak sequential compactness. A more general version appears in Theorem 23.2, which forms the starting point for a theory of weak convergence on function spaces. Corollary 6.22 (tightness and weak compactness) For any probability measures μ1 , μ2 , . . . on Rd , these conditions are equivalent: (i) (μn ) is tight, (ii) every sub-sequence of (μn ) has a weakly convergent further sub-sequence. Proof, (i) ⇒ (ii): By Theorem 6.20, every sub-sequence has a vaguely convergent further sub-sequence. If (μn ) is tight, then by Lemma 6.21 the convergence holds even in the weak sense.

144

Foundations of Modern Probability

(ii) ⇒ (i): Assume (ii). If (i) fails, we may choose some nk → ∞ and ε > 0, w such that μnk Bkc > ε for all k ∈ N. By (ii) we have μnk → μ along a sub sequence N ⊂ N, for a probability measure μ. Then (μnk ; k ∈ N  ) is tight by Lemma 5.8, and in particular there exists an r > 0 with μnk Brc ≤ ε for all 2 k ∈ N  . For k > r this is a contradiction, and (i) follows. We may now prove the desired extension of Theorem 6.3. Theorem 6.23 (extended continuity theorem, L´evy, Bochner) Let μ1 , μ2 , . . . ˆn . Then these be probability measures on Rd with characteristics functions μ conditions are equivalent: (i) μ ˆn (t) → ϕ(t), t ∈ Rd , where ϕ is continuous at 0, w (ii) μn → μ for a probability measure μ on Rd . In that case, μ has characteristic function ϕ. Similar statements hold for ˜n . distributions μn on Rd+ with Laplace transforms μ Proof, (ii) ⇒ (i): Clear by Theorem 6.3. (i) ⇒ (ii): Assuming (i), we see from the proof of Theorem 6.3 that (μn ) is w tight. Then Corollary 6.22 yields μn → μ along a sub-sequence N  ⊂ N, for a probability measure μ. The convergence extends to N by Theorem 6.3, proving (ii). The proof for Laplace transforms is similar. 2

Exercises 1. Show that if ξ, η are independent Poisson random variables, then ξ + η is again Poisson. Also show that the Poisson property is preserved by convergence in distribution. 2. Show that any linear combination of independent Gaussian random variables is again Gaussian. Also show that the Gaussian property is preserved by convergence in distribution. 3. Show that ϕr (t) = (1 − t/r)+ is a characteristic functions for every r > 0. (Hint: Calculate the Fourier transform ψˆr of the function ψr (t) = 1{|t| ≤ r}, and note that the Fourier transform ψˆr2 of ψr∗2 is integrable. Now use Fourier inversion.) 4. Let ϕ be a real, even function that is convex on R+ and satisfies ϕ(0) = 1 and ϕ(∞) ∈ [0, 1]. Show that ϕ is the characteristic function of a symmetric distribution c on R. In particular, ϕ(t) = e−|t| is a characteristic function for every c ∈ [0, 1]. (Hint: Approximate by convex combinations of functions ϕr as above, and use Theorem 6.23.) 5. Show that if μ ˆ is integrable, then μ has a bounded, continuous density. (Hint:  Let ϕr be the triangular density above. Then (ϕ ˆr )ˆ= 2πϕr , and so e−itu μ ˆt ϕˆr (t) dt = 2π ϕr (x − u) μ(dx). Now let r → 0.) 6. Show that a distribution μ is supported by a set a Z + b iff |ˆ μt | = 1 for some t = 0.

6. Gaussian and Poisson Convergence

145

7. Give a direct proof of the continuity theorem for generating functions of disv tributions on Z+ . (Hint: Note that if μn → μ for some distributions on R+ , then μ ˜n → μ ˜ on (0, ∞).) 8. The moment-generating function of a distribution μ on R is given by μ ˜t = etx μ(dx). Show that if μ ˜ t < ∞ for all t in a non-degenerate interval I, then μ ˜ is analytic in the strip {z ∈ C; &z ∈ I o }. (Hint: Approximate by measures with bounded support.) 

9. Let μ, μ1 , μ2 , . . . be distributions on R with moment-generating functions μ ˜, μ ˜1 , μ ˜2 , . . . , such that μ ˜n → μ ˜ < ∞ on a non-degenerate interval I. Show that w v μn → μ. (Hint: If μn → ν along a sub-sequence N  , then μ ˜n → ν˜ on I o along N  , and so ν˜ = μ ˜ on I. By the preceding exercise, we get νR = 1 and νˆ = μ ˆ. Thus, ν = μ.) 



10. Let μ, ν be distributions on R with finite moments xn μ(dx) = xn ν(dx) = mn , where Σn tn |mn |/n! < ∞ for some t > 0. Show that μ = ν. (Hint: The absolute moments satisfy the same relation for any smaller value of t, and so the moment-generating functions exist and agree on (−t, t).) 11. For each n ∈ N, let μn be a distribution on R with finite moments mkn , k ∈ N, such that limn mkn = ak for some constants ak with Σk tk |ak |/k! < ∞ for a t > 0. w Show that μn → μ for a distribution μ with moments ak . (Hint: Each function xk is uniformly integrable with respect to the measures μn . In particular, (μn ) is tight. w If μn → ν along a sub-sequence, then ν has moments ak .) 12. Given a distribution μ on R × R+ , introduce the Fourier–Laplace transform  ϕ(s, t) = eisx−ty μ(dx dy), where s ∈ R and t ≥ 0. Prove versions for ϕ of Theorems 6.3 and 6.23. 1 , . . . , ξ d ) in Zd , let 13. Consider a null array of random vectors ξnj = (ξnj + nj be independent Poisson variables with means c1 , . . . , cd , and put ξ = d k = 1} → c for all k and (ξ 1 , . . . , ξ d ). Show that Σj ξnj → ξ iff Σj P {ξnj k Σj P { Σk ξnjk

ξ1, . . . , ξd

d

k = ξ k , and note > 1} → 0. (Hint: Introduce some independent random variables ηnj nj

that

Σj ξnj → ξ iff Σj ηnj → ξ.) d

d

14. Consider some random variables ξ⊥⊥η with finite variance, such that the distribution of (ξ, η) is rotationally invariant. Show that ξ is centered Gaussian. (Hint: Let ξ1 , ξ2 , . . . be i.i.d. and distributed as ξ, and note that n−1/2 Σk≤n ξk has the same distribution for all n. Now use Proposition 6.10.) 15. Prove a multi-variate version of the Taylor expansion in Lemma 6.9. 16. Let μ have a finite n -th moment mn . Show that μ ˆ is n times continuously (n) differentiable and satisfies μ ˆ 0 = in mn . (Hint: Differentiate n times under the integral sign.) (2n)

17. For μ and mn as above, show that μ ˆ 0 exists iff m2n < ∞. Also characterize (2n−1) the distributions μ where μ ˆ0 exists. (Hint: For μ ˆ0 , proceed as in the proof of  Proposition 6.10, and use Theorem 6.18. For μ ˆ 0 , use Theorem 6.17. Extend by induction to n > 1.) (n)

18. Let μ be a distribution on R+ with moments mn . Show that μ ˜0 = (−1)n mn , whenever either side exists and is finite. (Hint: Prove the statement for n = 1, and extend by induction.)

146

Foundations of Modern Probability

19. Deduce Proposition 6.10 from Theorem 6.13. 20. Let the random variables ξ and ξnj be such as in Theorem 6.13, and assume that

Σj E|ξnj |p → 0 for a p > 2. Show that Σj ξnj → ξ. d

2 → 21. Extend Theorem 6.13 to random vectors in Rd , with the condition Σj E ξnj 2 1 replaced by Σj Cov(ξnj ) → a, with ξ as N (0, a), and with ξnj replaced by |ξnj |2 . (Hint: Use Corollary 6.5 to reduce to dimension 1.)

22. Show that Theorem 6.16 remains true for random vectors in Rd , with Var(ξnj ; |ξnj | ≤ 1) replaced by the corresponding covariance matrix. (Hint: If a, a1 , a2 , . . . are symmetric, non-negative definite matrices, then an → a iff u an u → u au for all u ∈ Rd . To see this, use a compactness argument.) 23. Show that Theorems 6.7 and 6.16 remain valid for possibly infinite row-sums

Σj ξnj . (Hint: Use Theorem 5.17 or 5.18, together with Theorem 5.29.) 24. Let ξ, ξ1 , ξ2 , . . . be i.i.d. random variables. Show that n−1/2 Σk≤n ξk converges

in probability iff ξ = 0 a.s. (Hint: Use condition (iii) in Theorem 6.16.)

25. Let ξ1 , ξ2 , . . . be i.i.d. μ, and fix any p ∈ (0, 2). Find a μ such that n−1/p Σk≤n ξk → 0 in probability but not a.s.

26. Let ξ1 , ξ2 , . . . be i.i.d., and let p > 0 be such that n−1/p Σk≤n ξk → 0 in probability but not a.s. Show that lim supn n−1/p | Σk≤n ξk | = ∞ a.s. (Hint: Note that E|ξ1 |p = ∞.) 27. Give an example of a distribution on R with infinite second moment, belonging to the domain of attraction of the Gaussian law. Also find the corresponding normalization.

Chapter 7

Infinite Divisibility and General Null Arrays Compound Poisson distributions and approximation, i.i.d. and null arrays, infinitely divisible distributions, L´evy measure, characteristics, L´evy– Khinchin formula, convergence and closure, one-dimensional criteria, positive and symmetric terms, general null arrays, limit laws and convergence criteria, sums and extremes, diffuse distributions

The fundamental roles of Gaussian and Poisson distributions should be clear from Chapter 6. Here we consider the more general and equally basic family of infinitely divisible distributions, which may be defined as distributional limits of linear combinations of independent Poisson and Gaussian random variables. Such distributions constitute the most general limit laws appearing in the classical limit theorems for null arrays. They further appear as finite-dimensional distributions of L´evy processes—the fundamental processes with stationary, independent increments, treated in Chapter 16—which may be regarded as prototypes of general Markov processes. The special criteria for convergence toward Poisson and Gaussian distributions, derived by simple analytic methods in Chapter 6, will now be extended to the case of general infinitely divisible limits. Though the use of some analytic tools is still unavoidable, the present treatment is more probabilistic in flavor, and involves as crucial steps a centering at truncated means followed by a compound Poisson approximation. This approach has the advantage of providing some better insight into even the Gaussian and Poisson results, previously obtained by elementary but more technical estimates. Since the entire theory is based on approximations by compound Poisson distributions, we begin with some characterizations of the latter. Here the basic representation is in terms of Poisson processes, discussed more extensively in Chapter 15. Here we need only the basic definition in terms of independent, Poisson distributed increments. Lemma 7.1 (compound Poisson distributions) Let ξ be a random vector in Rd , fix any bounded measure ν = 0 on Rd \ {0}, and put c = ν and νˆ = ν/c. Then these conditions are equivalent: d (i) ξ = Xκ for a random walk X = (Xn ) in Rd based on νˆ, where κ⊥⊥X is Poisson distributed with E κ = c, d  (ii) ξ = x η(dx) for a Poisson process η on Rd \ {0} with E η = ν, © Springer Nature Switzerland AG 2021 O. Kallenberg, Foundations of Modern Probability, Probability Theory and Stochastic Modelling 99, https://doi.org/10.1007/978-3-030-61871-1_8

147

148

Foundations of Modern Probability

(iii) log E eiuξ =

( eiux − 1) ν(dx),

u ∈ Rd .

The measure ν is then uniquely determined by L(ξ). Proof, (i) ⇒ (iii): Writing EeiuX1 = ϕ(u), we get cn EeiuXκ = Σ e−c {ϕ(u)}n n! n≥0 = e−c ec ϕ(u) = ec{ϕ(u)−1} , and so





log EeiuXκ = c ϕ(u) − 1 =c



=





eiux νˆ(dx) − 1

( eiux − 1) ν(dx).

(ii) ⇒ (iii): To avoid repetitions, we defer the proof until Lemma 15.2 (ii), which remains valid with −f replaced by the function f (x) = ix. (iii) ⇒ (i)−(ii): Clear by the uniqueness in Theorem 6.3.

2

The distributions in Lemma 7.1 are said to be compound Poisson, and the underlying measure ν is called the characteristic measure of ξ. We will use the compound Poisson distributions to approximate the row sums in triangular arrays (ξnj ), where the ξnj are understood to be independent in j for fixed n. P Recall that (ξnj ) is called a null array if ξnj → 0 as n → ∞, uniformly in j. It is further called an i.i.d. array if the ξnj are i.i.d. in j for fixed n, and the successive rows have lengths mn → ∞. For any random vector ξ with distribution μ, we introduce an associated compound Poisson random vector ξ˜ with characteristic measure μ. For a triangular array ξnj , the corresponding compound Poisson vectors ξ˜ij are again d assumed to have row-wise independent entries. By ξn ∼ ηn we mean that, if either side converges in distribution along a sub-sequence, then so does the other along the same sequence, and the two limits agree. Proposition 7.2 (compound Poisson approximation) Let (ξnj ) be a triangular array in Rd with associated compound Poisson array (ξ˜nj ). Then the d equivalence Σj ξnj ∼ Σj ξ˜nj holds under each of these conditions: (i) (ξnj ) is a null array in Rd+ ,

d

(ii) (ξnj ) is a null array with ξnj = −ξnj , (iii) (ξnj ) is an i.i.d. array. The stated equivalence fails for general null arrays, where a modified version is given in Theorem 7.11. For the proof, we first need to show that the ‘null’ property holds even in case (iii). This requires a simple technical lemma:

7. Infinite Divisibility and General Null Arrays

149

Lemma 7.3 (i.i.d. and null arrays) Let (ξnj ) be an i.i.d. array in Rd , such P that the sequence ξn = Σj ξnj is tight. Then ξn1 → 0. Proof: It is enough to take d = 1. By Lemma 6.2 the functions μ ˆ n (u) = ˆn = eψn on a suitable interval I = Eeiuξn are equi-continuous at 0, and so μ ˆnj (r) = [−a, a], for some continuous functions ψn with |ψn | ≤ 12 . Writing μ ˆ nj = eψn /mn → 1, uniEeirξnj for j ≤ mn , we conclude by continuity that μ formly on I. Proceeding as in Lemma 6.1 (i), we get for any ε ≤ a−1

a −a





1−μ ˆnj (r) dr = 2r





(1 − sinaxax ) μnj (dx)

≥ 2r 1 −

sin aε  μnj {|x| ≥ ε}, aε P

and so P {|ξnj | ≥ ε} → 0 as n → ∞, which implies ξnj → 0.

2

For the main proof, we often need the elementary fact er − 1 ∼ r as r → 0,

(1)

which holds when r ↑ 0 or r ↓ 0, and even when r → 0 in the complex plane. Proof of Proposition 7.2: Let μ ˆnj or μ ˜nj be the characteristic functions or Laplace transforms of ξnj . Since the latter form a null array, even in case (iii) by Lemma 7.3, Lemma 6.6 ensures that, for every r > 0, there exists an m ∈ N ˜nj = 0 on the ball B0r for all n > m. Then μ ˆnj = eψnj or such that μ ˆnj = 0 or μ −ϕnj , for some continuous functions ψnj and ϕnj , respectively. μ ˜nj = e (i) For any u ∈ Rd+ , ˜nj (u) − log Ee−uξnj = − log μ = ϕnj (u), − log Ee−uξnj = 1 − μ ˜nj (u) = 1 − e−ϕnj (u) , ˜

and we need to show that

Σ j ϕnj (u) ≈ Σ j





1 − e−ϕnj (u) ,

u ∈ Rd+ ,

in the sense that if either side converges to a positive limit, then so does the other and the limits are equal. Since all functions are positive, this is clear from (1). (ii) For any u ∈ Rd , ˆnj (u) log Eeiuξnj = log μ = ψnj (u), ˜

ˆnj (u) − 1 log Eeiuξnj = μ = eψnj (u) − 1,

150

Foundations of Modern Probability

and we need to show that

Σ j ψnj (u) ≈ Σ j





eψnj (u) − 1 ,

u ∈ Rd ,

in the sense of simultaneous convergence to a common finite limit. Since all functions are real and ≤ 0, this follows from (1). (iii) Writing ψnj = ψn , we may reduce our claim to







exp mn ψn (u) ≈ exp mn eψn (u) − 1



,

u ∈ Rd ,

in the sense of simultaneous convergence to a common complex number. This follows from (1). 2 A random vector ξ or its distribution L(ξ) is said to be infinitely divisible, if d for every n ∈ N we have ξ = Σj ξnj for some i.i.d. random vectors ξn1 , . . . , ξnn . Obvious examples of infinitely divisible distributions include the Gaussian and Poisson laws, which is clear by elementary calculations. More generally, we see from Lemma 7.1 that any compound Poisson distribution is infinitely divisible. The following approximation is fundamental: Corollary 7.4 (compound Poisson limits) For a random vector ξ in Rd , these conditions are equivalent: (i) ξ is infinitely divisible, d

(ii) ξn → ξ for some compound Poisson random vectors ξn . d

Proof, (i) ⇒ (ii): Under (i) we have ξ =

Σj ξnj

for an i.i.d. array (ξnj ).  d ˜ Choosing an associated compound Poisson array (ξnj ), we get j ξ˜nj → ξ by Lemma 7.2 (iii), and we note that ξn = Σj ξ˜nj is again compound Poisson. (ii) ⇒ (i): Assume (ii) for some ξn with characteristic measures νn . For evd ery m ∈ N, we may write ξn = Σj≤m ξnj for some i.i.d. compound Poisson vectors ξnj with characteristic measures νn /m. Then the sequence (ξn1 , . . . , ξnm ) d is tight by Lemma 6.2, and so (ξn1 , . . . , ξnm ) → (ζ1 , . . . , ζm ) as n → ∞ along a d sub-sequence, where the ζj are again i.i.d. with sum Σj ζj = ξ. Since m was arbitrary, (i) follows. 2 We proceed to characterize the general infinitely divisible distributions. For a special case, we may combine the Gaussian and compound Poisson distribu d tions and use Lemma 7.1 (ii) to form a variable ξ = b + ζ + x η(dx), involving a constant vector b, a centered Gaussian random vector ζ, and an independent Poisson process η on Rd . For extensions to unbounded ν = Eη, we need to apply a suitable centering, to ensure convergence of the Poisson integral. This suggests the more general representation:

7. Infinite Divisibility and General Null Arrays

151

Theorem 7.5 (infinitely divisible distributions, L´evy, Itˆo) (i) A random vector ξ in Rd is infinitely divisible iff d



ξ =b+ζ +

|x|≤1

x (η − Eη)(dx) +



|x|>1

x η(dx),

for a constant vector b ∈ Rd , a centered Gaussian random vector ζ with Cov(ζ) = a, and an independent Poisson process η on Rd \ {0} with intensity ν satisfying (|x|2 ∧ 1) ν(dx) < ∞. (ii) A random vector ξ in Rd+ is infinitely divisible iff

d

ξ =a+

x η(dx),

for a constant vector a ∈ Rd+ and a Poisson process η on Rd+ \ {0} with intensity ν satisfying (|x| ∧ 1) ν(dx) < ∞. The triple (a, b, ν) or pair (a, ν) is then unique, and any choice with the stated properties may occur. This shows that an infinitely divisible distribution on Rd or Rd+ is uniquely characterized by the triple (a, b, ν) or pair (a, ν), referred to as the characteristics of ξ or L(ξ). The intensity measure ν of the Poisson component is called the L´evy measure of ξ or L(ξ). The representations of Theorem 7.5 can also be expressed analytically in terms of characteristic functions or Laplace transforms, which leads to some celebrated formulas: Corollary 7.6 (L´evy–Khinchin representations, Kolmogorov, L´evy) Let ξ be an infinitely divisible random vector in Rd or Rd+ with characteristics (a, b, ν) or (a, ν). Then (i) for Rd -valued ξ we have E eiuξ = eψ(u) , u ∈ Rd , where ψ(u) = iu b − 12 u au +







eiu x − 1 − iu x 1{|x| ≤ 1} ν(dx),

(ii) for Rd+ -valued ξ we have E e−uξ = e−ϕ(u) , u ∈ Rd+ , where

ϕ(u) = ua +

(1 − e−ux ) ν(dx).

The stated representations will be proved together with the associated convergence criteria, stated below. Given a characteristic triple (a, b, ν), as above, we define for any h > 0 the truncated versions

ah = a + bh = b − 



|x|≤h

xx ν(dx), x ν(dx),

h 0 with ν ∂B0h = 0. Then ξn → ξ iff v

ahn → ah ,

νn → ν on Rd+ \ {0}.

We first consider the one-dimensional case, which allows some significant simplifications. The characteristic exponent ψ in Theorem 7.6 may then be written as   i rx 1 + x2 ψ(r) = i c r + eirx − 1 − ν˜(dx), r ∈ R, (2) 1 + x2 x2 where x2 ν˜(dx) = σ 2 δ0 (dx) + ν(dx), 1 + x2   x − x 1{|x| ≤ 1} ν(dx), c = b+ 1 + x2 and the integrand in (2) is defined by continuity as − 12 r2 when x = 0. For infinitely divisible distributions on R+ , we define instead ν˜(dx) = a δ0 + (1 − e−x ) ν(dx), so that the characteristic exponent ϕ in Theorem 7.6 becomes

ϕ(r) =

1 − e−rx ν˜(dx), 1 − e−x

r ≥ 0,

(3)

where the integrand is interpreted as r when x = 0. We refer to the pair (c, ν˜) or measure ν˜ as the modified characteristics of ξ or L(ξ). Lemma 7.8 (one-dimensional criteria) (i) Let ξn and ξ be infinitely divisible in R with modified characteristics (cn , ν˜n ) and (c, ν˜). Then d

ξn → ξ



w

cn → c, ν˜n → ν˜.

(ii) Let ξn and ξ be infinitely divisible in R+ with modified characteristics ν˜n and ν˜. Then d w ξn → ξ ⇔ ν˜n → ν˜. v 1 Here νn → ν means that νn f → νf for all f ∈ Cˆ+ (Rd \{0}), the class of continuous functions f ≥ 0 on Rd with a finite limit as |x| → ∞, and such that f (x) = 0 in a neighborhood of 0. The meaning in (iii) is similar.

7. Infinite Divisibility and General Null Arrays

153

The last two statements will first be proved with ‘infinitely divisible’ replaced by representable, meaning that ξn and ξ are random vectors allowing representations as in Theorem 7.5. Some earlier lemmas will then be used to deduce the stated versions. w

Proof of Lemma 7.8, representable case: (ii) If ν˜n → ν, then ϕn → ϕ by ˆ, the continuity in (3), and so the associated Laplace transforms satisfy μ ˆn → μ w w ˆn → μ ˆ, which implies μn → μ by Theorem 6.3. Conversely, μn → μ implies μ and so ϕn → ϕ, which yields Δϕn → Δϕ, where Δϕ(r) = ϕ(r + 1) − ϕ(r)

=

e−rx ν˜(dx). w

Using Theorem 6.3, we conclude that ν˜n → ν˜. w

(i) If cn → c and ν˜n → ν˜, then ψn → ψ by the boundedness and continuity w ˆ, which implies μn → μ by Theorem 6.3. of the integrand in (2), and so μ ˆn → μ w ˆn → μ ˆ uniformly on bounded intervals, and so Conversely, μn → μ implies μ ψn → ψ in the same sense. Now define 1

χ(r) =

−1





ψ(r) − ψ(r + s) ds

sin x 1 + x2 = 2 eirx (1 − ) x2 ν˜(dx), x and similarly for χn , where the interchange of integrations is justified by Fubini’s theorem. Then χn → χ, and so by Theorem 6.3

(1 − sinx x ) 1 +x2x

2

ν˜n (dx) → (1 − w

sin x 1 + x2 ) x2 ν˜(dx). x

Since the integrand is continuous and bounded away from 0, it follows that w ν˜n → ν˜. This implies convergence of the integral in (2), and so by subtraction 2 cn → c. Proof of Theorem 7.7, representable case: For bounded measures mn and m w v ¯ and mn (−h, h) → m(−h, h) on R, we note that mn → m iff mn → m on R\{0} for some h > 0 with m{±h} = 0. Thus, for distributions μ and μn on R, we w v ¯ and ahn → ah for every h > 0 with ν{±h} = 0. have ν˜n → ν˜ iff νn → ν on R\{0} w v Similarly, ν˜n → ν˜ holds for distributions μ and μn on R+ iff νn → ν on (0, ∞] h h and an → a for all h > 0 with ν{h} = 0. Thus, (iii) follows immediately from Lemma 7.8. To obtain (ii) from the same lemma when d = 1, it remains to w note that the conditions bhn → bh and cn → c are equivalent when ν˜n → ν˜ and 2 −1 3 ν{±h} = 0, since |x − x(1 + x ) | ≤ |x| . v Turning to the proof of (ii) when d > 1, we first assume that νn → ν on h h h h d R \ {0}, and that an → a and bn → b for some h > 0 with ν{|x| = h} w = 0. To prove μn → μ, it suffices by Corollary 6.5 to show that, for any onew dimensional projection πu : x → u x with u = 0, μn ◦ πu−1 → μ ◦ πu−1 . Then  −1 fix any k > 0 with ν{|u x| = k} = 0, and note that μ ◦ πu has the associated characteristics ν u = ν ◦ πu−1 and

154

Foundations of Modern Probability au,k = u ah u + bu,k = u bh +









(u x)2 1(0,k] (|u x|) − 1(0,h] (|x|) ν(dx),



u x 1(1,k] (|u x|) − 1(1,h] (|x|) ν(dx).

u,k u −1 Let au,k n , bn , and νn denote the corresponding characteristics of μn ◦πu . Then u v u u,k u,k u,k u,k ¯ νn → ν on R \ {0}, and furthermore an → a and bn → b . The desired convergence now follows from the one-dimensional result. w w Conversely, let μn → μ. Then μn ◦ πu−1 → μ ◦ πu−1 for every u = 0, and the u v u u,k ¯ and one-dimensional result yields νn → ν on R \ {0}, as well as au,k n → a u,k u,k  bn → b for any k > 0 with ν{|u x| = k} = 0. In particular, the sequence (νn K) is bounded for every compact set K ⊂ Rd \ {0}, and so the sequences (u ahn u) and (u bhn ) are bounded for any u = 0 and h > 0. In follows easily that (ahn ) and (bhn ) are bounded for every h > 0, and therefore all three sequences are relatively compact. v Given a sub-sequence N  ⊂ N, we have νn → ν  along a further sub-sequence    N ⊂ N , for some measure ν satisfying (|x|2 ∧ 1)ν  (dx) < ∞. Fixing any h > 0 with ν  {|x| = h} = 0, we may choose yet another sub-sequence N  , such that even ahn and bhn converge toward some limits a and b . The direct assertion w then yields μn → μ along N  , where μ is infinitely divisible with characteristics determined by (a , b , ν  ). Since μ = μ, we get ν  = ν, a = ah , and b = bh . Thus, the convergence remains valid along the original sequence. 2

Proof of Theorem 7.5: Directly from the representation, or even easier by Corollary 7.6, we see that any representable random vector ξ is infinitely divisd ible. Conversely, if ξ is infinitely divisible, Corollary 7.4 yields ξn → ξ for some compound Poisson random vectors ξn , which are representable by Lemma 7.1. Then even ξ is representable, by the preliminary version of Theorem 7.7 (i). 2 Proof of Theorem 7.7, general case: Since infinite divisibility and representability are equivalent by Theorem 7.5, the proof in the representable case 2 remains valid for infinitely divisible random vectors ξn and ξ. We may now complete the classical limit theory for sums of independent random variables from Chapter 6, beginning with the case of general i.i.d. arrays (ξnj ). Corollary 7.9 (i.i.d.-array convergence) Let (ξnj ) be an i.i.d. array in Rd , let ξ be infinitely divisible in Rd with characteristics (a, b, ν), and fix any h > 0 d with ν{|x| = h} = 0. Then Σj ξnj → ξ iff v

(i) mn L(ξn1 ) → ν on Rd \ {0}, 



 ; |ξn1 | ≤ h → ah , (ii) mn E ξn1 ξn1





(iii) mn E ξn1 ; |ξn1 | ≤ h → bh .

7. Infinite Divisibility and General Null Arrays

155

Proof: For an i.i.d. array (ξnj ) in Rd , the associated compound Poisson array (ξ˜nj ) is infinitely divisible with characteristics 



bn = E ξnj ; |ξnj | ≤ 1 ,

an = 0,

νn = L(ξnj ),

and so the row sums ξn = Σj ξnj are infinitely divisible with characteristics 2 (0, mn bn , mn μn ). The assertion now follows by Theorems 7.2 and 7.7. The following extension of Theorem 6.12 clarifies the connection between null arrays with positive and symmetric terms. Here we define p2 (x) = x2 . Theorem 7.10 (positive and symmetric terms) Let (ξnj ) be a null array of symmetric random variables, and let ξ, η be infinitely divisible with characteristics (a, 0, ν) and (a, ν ◦ p−1 2 ), respectively, where ν is symmetric and a ≥ 0. Then d d Σ j ξnj → ξ ⇔ Σ j ξnj2 → η. Proof: Define μnj = L(ξnj ), and fix any h > 0 with ν{|x| = h} = 0. By d Lemma 7.2 and Theorem 7.7 (i), we have Σj ξnj → ξ iff

Σ j μnj → ν v

Σj E whereas



¯ \ {0}, on R

2 ξnj ; |ξnj | ≤ h → a +



|x|≤h

x2 ν(dx),

Σj ξnj2 → η iff v −1 on (0, ∞], Σ j μnj ◦ p−1 2 → ν ◦ p2   Σ j E ξnj2 ; ξnj2 ≤ h2 → a + y≤h y (ν ◦ p−1 2 )(dy). d

2

The two sets of conditions are equivalent by Lemma 1.24.

2

The convergence problem for general null arrays is more delicate, since a compound Poisson approximation as in Proposition 7.2 applies only after a centering at truncated means, as specified below. Proposition 7.11 (compound Poisson approximation) Let (ξnj ) be a null array of random vectors in Rd , fix any h > 0, and define 



bnj = E ξnj ; |ξnj | ≤ h , Then

Σ j ξnj

d



Σj



n, j ∈ N.



(ξnj − bnj ) + bnj .

(4) (5)

Our proof will be based on a technical estimate: Lemma 7.12 (uniform summability) Let (ξnj ) be a null array with truncated means bnj as in (4), and let ϕnj be the characteristic functions of the vectors ηnj = ξnj − bnj . Then convergence of either side of (5) implies lim sup n→∞

 

 

Σ j 1 − ϕnj (u) < ∞,

u ∈ Rd .

156

Foundations of Modern Probability Proof: The definitions of bnj , ηnj , and ϕnj yield 





1 − ϕnj (u) = E 1 − eiu ηnj + iu ηnj 1 |ξnj | ≤ h





− iu bnj P |ξnj | > h . Putting an = pn =



Σ j E ηnj ηnj ; |ξ nj | ≤ h Σ j P |ξnj | > h ,



,

we get by Lemma 6.15  

 

Σ j 1 − ϕnj (u)

1  < u an u + (2 + |u|) pn .

2

It is then enough to show that (u an u) and (pn ) are bounded. Assuming convergence on the right of (5), the stated boundedness follows easily from Theorem 7.7, together with the fact that maxj |bnj | → 0. If instead d Σj ξnj → ξ, we may introduce an independent copy (ξnj ) of the array (ξnj ) and apply Proposition 7.2 and Theorem 7.7 to the symmetric random variables u  = u ξnj − u ξnj . Then for any h > 0, ζnj lim sup n→∞

lim sup n→∞

Σj P

Σj E







u |ζnj | > h < ∞,

u 2 (ζnj );

u |ζnj |

≤h





(6)

< ∞.

(7)

The boundedness of pn follows from (6) and Lemma 5.19. Next we note that u  | ≤ h replaced by |ξnj | ∨ |ξnj | ≤ h. (7) remains true with the condition |ζnj  Using the independence of ξnj and ξnj , we get 1 2





Σ j E (ζnju )2; |ξnj | ∨ |ξnj | ≤ h]

= Σ j E (u ηnj )2 ; |ξnj | ≤ h P |ξnj | ≤ h   2 − Σ j E u ηnj ; |ξnj | ≤ h



2 ≥ u an u minj P |ξnj | ≤ h − Σ j u bnj P |ξnj | > h .

Here the last sum is bounded by pn maxj (u bnj )2 → 0, and the minimum on 2 the right tends to 1. The boundedness of (u an u) now follows by (7). Proof of Proposition 7.11: By Lemma 6.14 it suffices to show that  

Σ j  ϕnj (u) − exp



 

ϕnj (u) − 1  → 0,

u ∈ Rd ,

where ϕnj is the characteristic function of ηnj . This follows from Taylor’s formula, together with Lemmas 6.6 and 7.12. 2 In particular, we may now identify the possible limits. Corollary 7.13 (null-array limits, Feller, Khinchin) Let (ξnj ) be a null array d of random vectors in Rd , such that Σj ξnj → ξ for a random vector ξ. Then ξ is infinitely divisible.

7. Infinite Divisibility and General Null Arrays

157

Proof: The random vectors η˜nj in Lemma 7.11 are infinitely divisible, and ηnj − bnj ). The infinite divisibility so the same thing is true for the sums Σj (˜ of ξ then follows by Theorem 7.7 (i). 2 To obtain explicit convergence criteria for general null arrays in Rd , it remains to combine Theorem 7.7 with Proposition 7.11. The present result extends Theorem 6.16 for Gaussian limits and Corollary 7.9 for i.i.d. arrays. For convenience, we write Cov(ξ; A) for the covariance matrix of the random vector 1A ξ. Theorem 7.14 (null-array convergence, Doeblin, Gnedenko) Let (ξnj ) be a null array of random vectors in Rd , and let ξ be infinitely divisible with characteristics (a, b, ν). Then for fixed h > 0 with ν{|x| = h} = 0, we have d Σj ξnj → ξ iff (i) (ii) (iii)

Σj L(ξnj ) → ν on Rd \ {0}, ξnj ; |ξnj | ≤ h → ah , Σj Cov   Σj E ξnj ; |ξnj | ≤ h → bh. v

Proof: Define for any n and j





anj = Cov ξnj ; |ξnj | ≤ h , 



bnj = E ξnj ; |ξnj | ≤ h . By Theorems 7.7 and 7.11, we have (i ) (ii ) (iii )

Σj ξnj → ξ iff d

d nj ) → ν on R \ {0}, Σj L(η    ηnj ηnj ; |ηnj | ≤ h → ah , Σj E 

 Σj bnj + E ηnj ; |ηnj | ≤ h → bh. v

Here (i) ⇔ (i ) since maxj |bnj | → 0. By (i) and the facts that maxj |bnj | → 0 and ν{|x| = h} = 0, we further see that the sets {|ηnj | ≤ h} in (ii ) and (iii ) may be replaced by {|ξnj | ≤ h}. To prove (ii) ⇔ (ii ), we note that by (i)   

Σj





    ≤ 



 anj − E ηnj ηnj ; |ξnj | ≤ h

< maxj |bnj |2

Similarly, (iii) ⇔ (iii ) because   

Σj E





 



Σ j bnj bnj P |ξnj | > h  Σj P





|ξnj | > h → 0.





   ηnj ; |ξnj | ≤ h  =  Σ j bnj P |ξnj | > h 

≤ maxj |bnj |



Σj P





|ξnj | > h → 0.

2

When d = 1, the first condition in Theorem 7.14 admits some interesting probabilistic interpretations, one of which involves the row-wise extremes. For vd ¯ \ {0} random measures η and ηn on R \ {0}, the convergence ηn −→ η on R d ¯ ˆ means that ηn f → ηf for all f ∈ C+ (R \ {0}).

158

Foundations of Modern Probability

Theorem 7.15 (sums and extremes) Let (ξnj ) be a null array of random variables with distributions μnj , and define ηn = Σ j δ ξnj ,





αn± = maxj ± ξnj ,

n ∈ N.

Fix a L´evy measure ν on R \ {0}, let η be a Poisson process on R \ {0} with

E η = ν, and put α± = sup x ≥ 0; η{±x} > 0 . Then these conditions are equivalent: v ¯ \ {0}, (i) μnj → ν on R

Σj

d ¯ \ {0}, (ii) ηn → η on R

(iii) αn± → α± . d

Though (i) ⇔ (ii) is immediate from Theorem 30.1 below, we include a direct, elementary proof. Proof: Condition (i) holds iff

Σ j μnj (x, ∞) → ν(x, ∞),

Σ j μnj (−∞, −x) → ν(−∞, −x), for all x > 0 with ν{±x} = 0. By Lemma 6.8, the first condition is equivalent to   P {αn+ ≤ x} = Π j 1 − P {ξnj > x} → e−ν(x,∞) = P {α+ ≤ x},

d

which holds for all continuity points x > 0 iff αn+ → α+ . Similarly, the second d condition holds iff αn− → α− . Thus, (i) ⇔ (iii). To show that (i) ⇒ (ii), we may write the latter condition in the form

Σ j f (ξnj ) → ηf, d

¯ \ {0}). f ∈ Cˆ+ (R

(8)

Here the variables f (ξnj ) form a null array with distributions μnj ◦ f −1 , and ηf is compound Poisson with characteristic measure ν ◦ f −1 . By Theorem 7.7 (ii), (8) is then equivalent to the conditions

Σ j μnj ◦ f −1 → ν ◦ f −1 on (0, ∞], f (x) μnj (dx) = 0. lim lim sup Σ j ε→0 n→∞ f (x)≤ε v

(9) (10)

Now (9) follows immediately from (i). To deduce (10), it suffices to note that the sum on the left is bounded by Σj μnj (f ∧ ε) → ν(f ∧ ε). d

Finally, assume (ii). By a simple approximation, ηn (x, ∞) → η(x, ∞) for any x > 0 with ν{x} = 0. In particular, for such an x,

7. Infinite Divisibility and General Null Arrays

159

P {αn+ ≤ x} = P ηn (x, ∞) = 0

→ P η(x, ∞) = 0





= P {α+ ≤ x}, and so αn+ → α+ . Similarly αn− → α− , which proves (iii). d

d

2

The characteristic pair or triple clearly contains important information about an infinitely divisible distribution. In particular, we have the following classical result: Proposition 7.16 (diffuse distributions, Doeblin) Let ξ be an infinitely divisible random vector in Rd with characteristics (a, b, ν). Then L(ξ) diffuse



a = 0 or ν = ∞.

Proof: If a = 0 and ν < ∞, then ξ is compound Poisson apart from a shift, and so μ = L(ξ) is not diffuse. When either condition fails, it does so for at least one coordinate projection, and we may take d = 1. If a > 0, the diffuseness is obvious by Lemma 1.30. Next let ν be unbounded, say with ν(0, ∞) = ∞. For every n ∈ N we may then write ν = νn + νn , where νn is supported by (0, n−1 ) and has total mass log 2. For μ we get a corresponding decomposition μn ∗ μn , where μn is compound Poisson with L´evy measure νn and μn {0} = 12 . For any x ∈ R and ε > 0, we get μ{x} ≤ μn {x}μn {0} + μn [x − ε, x) μn (0, ε] + μn (ε, ∞) ≤

1 2

μn [x − ε, x] + μn (ε, ∞).

Letting n → ∞ and then ε → 0, and noting that μn → δ0 and μn → μ, we get 2 μ{x} ≤ 12 μ{x} by Theorem 5.25, and so μ{x} = 0. w

w

Exercises 1. Prove directly from definitions that a compound Poisson distribution is infinitely divisible, and identify the corresponding characteristics. Also do the same verification using generating functions. 2. Prove directly from definitions that every Gaussian distribution in Rd is infinitely divisible. (Hint: Write a Gaussian vector in Rd in the form ξ = a ζ + b for a suitable matrix a and vector b, where ζ is standard Gaussian in Rd . Then compute the distribution of ζ1 + · · · + ζn by a simple convolution.) 3. Prove the infinite divisibility of the Poisson and Gaussian distributions from Theorems 6.7 and 6.13. (Hint: Apply the mentioned theorems to suitable i.i.d. arrays.) 4. Show that a distribution μ on R with characteristic function μ ˆ (u) = e−|u| for a p ∈ (0, 2] is infinitely divisible. Then find the form of the associated L´evy measure. p

160

Foundations of Modern Probability

(Hint: Note that μ is symmetric p -stable, and conclude that ν(dx) = cp |x|−p−1 dx for a constant cp > 0.) 5. Show that the Cauchy distribution μ with density π −1 (1 + x2 )−1 is infinitely divisible, and find the associated L´evy measure. (Hint: Check that μ is symmetric 1-stable, and use the preceding exercise. It remains to find the constant c1 .) 6. Extend Proposition 7.10 to null arrays of spherically symmetric random vectors in Rd . 7. Derive Theorems 6.7 and 6.12 from Lemma 7.2 and Theorem 7.7. 8. Show by an example that Theorem 7.11 fails without the centering at truncated means. (Hint: Without the centering, condition (ii) of Theorem 7.14 becomes Σj E(ξnj ξnj ; |ξnj | ≤ h) → ah.)

9. If ξ is infinitely divisible with characteristics (a, b, ν) and p > 0, show that  E|ξ|p < ∞ iff |x|>1 |x|p ν(dx) < ∞. (Hint: If ν has bounded support, then E|ξ|p < ∞ for all p. It is then enough to consider compound Poisson distributions, for which the result is elementary.)

10. Prove directly that a Z+ -valued random variable ξ is infinitely divisible (on Z+ ) iff − log Esξ = Σk (1 − sk ) νk , s ∈ (0, 1], for a unique, bounded measure ν = (νk ) −x to show that the on N. (Hint: Assuming L(ξ) = μ∗n n , use the inequality 1 − x ≤ e w sequence (nμn ) is tight on N. Then nμn → ν along a sub-sequence, for a bounded measure ν on N. Finally note that − log(1 − x) ∼ x as x → 0. As for the uniqueness, take differences and use the uniqueness theorem for power series.) 11. Prove directly that a random variable ξ ≥ 0 is infinitely divisible iff − log Ee−uξ  = ua + (1 − e−ux ) ν(dx), u ≥ 0, for some unique constant a ≥ 0 and measure ν  on (0, ∞) with (|x| ∧ 1) < ∞. (Hint: If L(ξ) = μ∗n n , note that the measures w χn (dx) = n(1 − e−x ) μn (dx) are tight on R+ . Then χn → χ along a sub-sequence, −x and we may write χ(dx) = a δ0 (dx) + (1 − e ) ν(dx). The desired representation now follows as before. As for the uniqueness, take differences and use the uniqueness theorem for Laplace transforms.) 12. Prove directly that a random variable ξ is infinitely divisible iff ψu = log Eeiuξ exists and is given by Corollary 7.6 (i) for some unique constants a ≥ 0 and b and a  measure ν on R \ {0} with (x2 ∧ 1) ν(dx) < ∞. (Hint: Proceed as in Lemma 7.8.)

III. Conditioning and Martingales Modern probability theory can be said to begin with the theory of conditional expectations and distributions, along with the basic properties of martingales. This theory also involves the technical machinery of filtrations, optional and predictable times, predictable processes and compensators, all of which are indispensable for the further developments. The material of Chapter 8 is constantly used in all areas of probability theory, and should be thoroughly mastered by any serious student of the subject. The same thing can be said about at least the early parts of Chapter 9, including the basic properties of optional times and discrete-time martingales. The more advanced continuoustime theory, along with results for predictable processes and compensators in Chapter 10, might be postponed for a later study. −−− 8. Conditioning and disintegration. Here conditional expectations are introduced by the intuitive method of Hilbert space projection, which leads easily to the basic properties. Of equal importance is the notion of conditional distributions, along with the powerful disintegration and transfer theorems, constantly employed in all areas of modern probability theory. The latter result yields in particular the fundamental Daniell–Kolmogorov theorem, ensuring the existence of processes with given finite-dimensional distributions. 9. Optional times and martingales. Here we begin with the basic properties of optional times with associated σ-fields, along with a brief discussion of random time change. Martingales are then introduced as projective sequences of random variables, and we give short proofs of their basic properties, including the classical maximum inequalities, the optional sampling theorem, and the fundamental convergence and regularization theorems. After discussing a range of ramifications and applications, we conclude with some extensions to continuous-time sub-martingales. 10. Predictability and compensation. Here we introduce the classes of predictable times and processes, and give a complete, probabilistic proof of the fundamental Doob–Meyer decomposition. The latter result leads to the equally basic notion of compensator of a random measure on a product space R+ × S, and to a variety of characterizations and time-change reductions. We conclude with the more advanced theory of discounted compensators and associated predictable mapping theorems.

161

Chapter 8

Conditioning and Disintegration Conditional expectation, conditional variance and covariance, local property, uniform integrability, conditional probabilities, conditional distributions and disintegration, conditional independence, chain rule, projection and orthogonality, commutation criteria, iterated conditioning, extension and transfer, stochastic equations, coupling and randomization, existence of sequences and processes, extension by conditioning, infinite product measures

Modern probability theory can be said to begin with the notions of conditioning and disintegration. In particular, conditional expectations and distributions are needed already for the definitions of martingales and Markov processes, the two basic dependence structures beyond independence and stationarity. In many other areas throughout probability theory, conditioning is used as a universal tool, needed to describe and analyze systems involving randomness. The notion may be thought of in terms of averaging, projection, and disintegration—alternative viewpoints that are all essential for a proper understanding. In all but the most elementary contexts, conditioning is defined with respect to a σ-field rather than a single event. Here the resulting quantity is not a constant but a random variable, measurable with respect to the conditioning σ-field. The idea is familiar from elementary constructions of conditional expectations E(ξ| η), for random vectors (ξ, η) with a nice density, where the expected value becomes a function of η. This corresponds to conditioning on the generated σ-field F = σ(η). General conditional expectations are traditionally constructed by means of the Radon–Nikodym theorem. However, the simplest and most intuitive approach is arguably via Hilbert space projection, where E(ξ | F) is defined for any ξ ∈ L2 as the orthogonal projection of ξ onto the linear subspace of F-measurable random variables. The existence and basic properties of the L2 -version extend by continuity to arbitrary ξ ∈ L1 . The orthogonality yields E{ξ−E(ξ | F )}ζ = 0 for any bounded, F-measurable random variable ζ, which leads immediately to the familiar averaging characterization of E(ξ | F ) as a version of the density d(ξ · P )/dP on the σ-field F. The conditional expectation is only defined up to a P -null set, in the sense that any two versions agree a.s. We may then look for versions of the conditional probabilities P (A | F) = E(1A | F) that combine into a random probability measure on Ω. In general, such regular versions exist only for A restricted to a sufficiently nice sub- σ-field. The basic case is for random elements ξ in © Springer Nature Switzerland AG 2021 O. Kallenberg, Foundations of Modern Probability, Probability Theory and Stochastic Modelling 99, https://doi.org/10.1007/978-3-030-61871-1_9

163

164

Foundations of Modern Probability

a Borel space S, where the conditional distribution L(ξ | F ) exists as an Fmeasurable random measure on S. Assuming in addition that F = σ(η) for a random element η in a space T , we may write P {ξ ∈ B | η} = μ(η, B) for a probability kernel μ : T → S, which leads to a decomposition of the distribution of (ξ, η) according to the values of η. The result is formalized by the powerful disintegration theorem—an extension of Fubini’s theorem of constant use in subsequent chapters, especially in combination with the strong Markov property. Conditional distributions can be used to establish the basic transfer theod rem, often needed to convert a distributional equivalence ξ = f (η) into an a.s. d representation ξ = f (˜ η ), for a suitable choice of η˜ = η. The latter result leads in turn to the fundamental Daniell–Kolmogorov theorem, which guarantees the existence of a random sequence or process with specified finite-dimensional distributions. A different approach yields the more general Ionescu Tulcea extension, where the measure is specified by a sequence of conditional distributions. Further topics discussed in this chapter include the notion of conditional independence, which is fundamental for both Markov processes and exchangeability, and also plays an important role in connection with SDEs in Chapter 32, and for the random arrays treated in Chapter 28. Especially useful for various applications is the elementary but powerful chain rule. We finally call attention to the local property of conditional expectations, which leads in particular to simple and transparent proofs of the optional sampling and strong Markov properties. Returning to our construction of conditional expectations, fix a probability space (Ω, A, P ) and an arbitrary sub- σ-field F ⊂ A. Introduce in L2 = L2 (A) the closed linear subspace M , consisting of all random variables η ∈ L2 that agree a.s. with elements of L2 (F). By the Hilbert space projection Theorem 1.35, there exists for every ξ ∈ L2 an a.s. unique random variable η ∈ M with ξ −η ⊥ M , and we define E F ξ = E(ξ | F) as an arbitrary F-measurable version of η. The L2 -projection E F is easily extended to L1 , as follows: Theorem 8.1 (conditional expectation, Kolmogorov) For any σ-field F ⊂ A, there exists an a.s. unique linear operator E F : L1 → L1 (F), such that 



(i) E E F ξ; A = E(ξ; A), ξ ∈ L1 , A ∈ F . The operators E F have these further properties, whenever the corresponding expressions exist for the absolute values: (ii) ξ ≥ 0 ⇒ E F ξ ≥ 0 a.s., (iii) E|E F ξ| ≤ E|ξ|, (iv) 0 ≤ ξn ↑ ξ ⇒ E F ξn ↑ E F ξ a.s., (v) E F ξη = ξ E F η a.s. when ξ is F-measurable, 







(vi) E ξ E F η = E η E F ξ = E(E F ξ)(E F η),

8. Conditioning and Disintegration

165

(vii) E F E G ξ = E F ξ a.s., F ⊂ G. In particular, E F ξ = ξ a.s. iff ξ has an F-measurable version, and E F ξ = E ξ a.s. when ξ⊥ ⊥F . We often refer to (i) as the averaging property, to (ii) as the positivity, to (iii) as the L1 -contractivity, to (iv) as the monotone convergence property, to (v) as the pull-out property, to (vi) as the self-adjointness, and to (vii) as the tower property 1 . Since the operator E F is both self-adjoint by (vi) and idempotent by (vii), it may be thought of as a generalized projection on L1 . The first assertion is an immediate consequence of Theorem 2.10. However, the following projection approach is more elementary and intuitive, and it has the further advantage of leading easily to properties (ii)−(vii). Proof of Theorem 8.1: For ξ ∈ L2 , define E F ξ by projection as above. Then for any A ∈ F we get ξ − E F ξ ⊥ 1A , and (i) follows. Taking A = {E F ξ ≥ 0}, we get in particular         E E F ξ  = E E F ξ; A − E E F ξ; Ac = E(ξ; A) − E(ξ; Ac ) ≤ E|ξ|, proving (iii). The mapping E F is then uniformly L1 -continuous on L2 . Since L2 is dense in L1 by Lemma 1.12 and L1 is complete by Lemma 1.33, the operator E F extends a.s. uniquely to a linear and continuous mapping on L1 . Properties (i) and (iii) extend by continuity to L1 , and Lemma 1.26 shows that E F ξ is a.s. determined by (i). If ξ ≥ 0, we may combine (i) for A = {E F ξ ≤ 0} with Lemma 1.26 to obtain E F ξ ≥ 0 a.s., which proves (ii). If 0 ≤ ξn ↑ ξ, then ξn → ξ in L1 by dominated convergence, and so by (iii) we have E F ξn → E F ξ in L1 . Since E F ξn is a.s. non-decreasing in n by (ii), Lemma 5.2 yields the corresponding a.s. convergence, which proves (iv). Property (vi) is obvious when ξ, η ∈ L2 , and it extends L1 by (iv). To prove (v), we see from the characterization in (i) that E F ξ = ξ a.s. when ξ is F-measurable. In general we need to show that 



E( ξ η; A) = E ξE F η; A ,

A ∈ F,

which follows immediately from (vi). Finally, (vii) is obvious for ξ ∈ L2 since 2 L2 (F) ⊂ L2 (G), and it extends to the general case by means of (iv). Using conditional expectations, we can define conditional variances and covariances in the obvious way by Var(ξ | F ) = = Cov(ξ, η | F ) = = 1

also called the chain rule

E F (ξ − E F ξ)2 E F ξ 2 − (E F ξ)2 , E F (ξ − E F ξ)(η − E F η) E F ξη − (E F ξ)(E F η),

166

Foundations of Modern Probability

as long as the right-hand sides exist. Using the shorthand notation VarF(ξ) and CovF(ξ, η), we note the following computational rules, which may be compared with the simple rule E ξ = E(F F ξ) for expected values: Lemma 8.2 (conditional variance and covariance) For any random variables ξ, η ∈ L2 and σ-field F, we have (i) Var(ξ) = E[ VarF(ξ) ] + Var(E F ξ), (ii)

Cov(ξ, η) = E[ CovF(ξ, η) ] + Cov(E F ξ, E F η).

Proof: We need to prove only (ii), since (i) is the special case with ξ = η. Then write Cov(ξ, η) = E(ξη) − (Eξ)(Eη) = EE F (ξη) − (EE F ξ)(EE F η)



= E CovF(ξ, η) + (E F ξ)(E F η) − (EE F ξ)(EE F η)

= E CovF(ξ, η) + E(E F ξ)(E F η) − (EE F ξ)(EE F η)



= E CovF(ξ, η) + Cov(E F ξ, E F η).

2

Next we show that the conditional expectation E F ξ is local in both ξ and F, an observation that simplifies many proofs. Given two σ-fields F and G, we say that F = G on A if A ∈ F ∩ G and A ∩ F = A ∩ G. Lemma 8.3 (local property) Consider some σ-fields F, G ⊂ A and functions ξ, η ∈ L1 , and let A ∈ F ∩ G. Then F = G, ξ = η a.s. on A



E F ξ = E G η a.s. on A.

Proof: Since 1A E F ξ and 1A E G η are F ∩ G-measurable, we get B ≡ A ∩ {E ξ > E G η} ∈ F ∩ G, and the averaging property yields F





E E F ξ; B = E(ξ; B) = E(η; B) 



= E E G η; B . Hence, E F ξ ≤ E G η a.s. on A by Lemma 1.26. The reverse inequality follows by the symmetric argument. 2 The following technical result plays an important role in Chapter 9. Lemma 8.4 (uniform integrability, Doob) For any ξ ∈ L1 , the conditional expectations E(ξ |F ), F ⊂ A, are uniformly integrable. Proof: By Jensen’s inequality and the self-adjointness property, 





E |E F ξ|; A ≤ E E F | ξ|; A 





= E |ξ| P FA ,

A ∈ A.

8. Conditioning and Disintegration

167

By Lemma 5.10 we need to show that this tends to zero as P A → 0, uniformly in F. By dominated convergence along sub-sequences, it is then enough to show P that P Fn An → 0 for any σ-fields Fn ⊂ A and sets An ∈ A with P An → 0, 2 which is clear since EP Fn An = P An → 0. The conditional probability of an event A ∈ A, given a σ-field F, is defined as

P FA = E F 1A

or P (A |F ) = E(1A |F),

F

A ∈ A.

1

Thus, P A is the a.s. unique random variable in L (F) satisfying



E P FA; B = P (A ∩ B),

B ∈ F.

Note that P FA = P A a.s. iff A⊥⊥F, and that P FA = 1A a.s. iff A is a.s. Fmeasurable. The positivity of E F gives 0 ≤ P FA ≤ 1 a.s., and the monotone convergence property yields P F ∪ n An = Σ n P FAn a.s.,

A1 , A2 , . . . ∈ A disjoint.

(1)

F

Still the random set function P may not be a measure, in general, since the exceptional null set in (1) may depend on the sequence (An ). For random elements η in a measurable space (S, S), we define η -conditioning as conditioning with respect to the induced σ-field σ(η), so that E η ξ = E σ(η) ξ, or

P η A = P σ(η) A,

E(ξ | η) = E{ ξ | σ(η)}, P (A | η) = P {A | σ(η)}.

By Lemma 1.14, the η -measurable function E η ξ may be represented as f (η) for some measurable function f on S, determined a.e. L(η) by the averaging property

  E f (η); η ∈ B = E ξ; η ∈ B , B ∈ S. In particular, f depends only on the distribution of (ξ, η). The case of P η A is similar. Conditioning on a σ-field F is the special case where η is the identity map (Ω, A) → (Ω, F). Motivated by (1), we may hope to construct some measure-valued versions of the functions P F and P η . Then recall from Chapter 3 that, for any measurable spaces (T, T ) and (S, S), a kernel μ : T → S is defined as a function ¯ + , such that μ(t, B) is T -measurable in t ∈ T for fixed B μ: T ×S → R and a measure in B ∈ S for fixed t. In particular, μ is a probability kernel if μ(t, S) = 1 for all t. Random measures are simply kernels on the basic probability space Ω. Now fix a σ-field F ⊂ A and a random element η in a measurable space (T, T ). By a (regular) conditional distribution of η, given F, we mean an F-measurable probability kernel μ = L(η | F ) : Ω → T , such that μ(ω, B) = P {η ∈ B | F}ω a.s.,

ω ∈ Ω, B ∈ T .

168

Foundations of Modern Probability

The idea is to choose versions of the conditional probabilities on the right that combine for each ω into a probability measure in B. More generally, for any random element ξ in a measurable space (S, S), we define a conditional distribution of η, given ξ, as a random measure of the form μ(ξ, B) = P {η ∈ B | ξ} a.s.,

B∈T,

(2)

for a probability kernel μ : S → T . In the extreme cases where η is Fmeasurable or independent of F, we note that P {η ∈ B |F} has the regular version 1{η ∈ B} or P {η ∈ B}, respectively. To ensure the existence of conditional distributions L(η | F ), we need to impose some regularity conditions on the space T . The following existence and disintegration property is a key result of modern probability. Theorem 8.5 (conditional distributions, disintegration) Let ξ, η be random elements in S, T , where T is Borel. Then L(ξ, η) = L(ξ) ⊗ μ for a probability kernel μ : S → T , where μ is unique a.e. L(ξ) and satisfies (i) L(η | ξ) = μ(ξ, ·) a.s.,



(ii) E f (ξ, η)  ξ =

μ(ξ, dt) f (ξ, t) a.s.,

f ≥ 0.

Proof: The stated disintegration with associated uniqueness hold by Theorem 3.4. Properties (i) and (ii) then follow by the averaging characterization of conditional expectations. 2 Taking differences, we may extend (ii) to suitable real-valued functions f . When (S, S) = (Ω, F), the kernel μ becomes an F-measurable random measure ζ on T . Letting ξ be an F-measurable random element in a space T , we get by (ii) 

 (3) E f (ξ, η)  F = ζ(dt) f (ξ, t), f ≥ 0. Part (ii) may also be written in integrated form as

Ef (ξ, η) = E

μ(ξ, dt) f (ξ, t),

f ≥ 0,

(4)

and similarly for (3). When ξ ⊥⊥ η we may choose μ(ξ, ·) ≡ L(η), in which case (4) reduces to the identity of Lemma 4.11. Applying (4) to functions of the form f (ξ), we may extend many properties of ordinary expectations to a conditional setting. In particular, such extensions hold for the Jensen, H¨older, and Minkowski inequalities. The first of those yields the Lp -contractivity    F  E ξ p ≤ ξ p ,

ξ ∈ Lp , p ≥ 1.

Considering conditional distributions of entire sequences (ξ, ξ1 , ξ2 , . . .), we may further derive conditional versions of the basic continuity properties of ordinary integrals. We list some simple applications of conditional distributions.

8. Conditioning and Disintegration

169

Lemma 8.6 (conditioning criteria) For any random elements ξ, η in Borel spaces S, T , where S = [0, 1] in (ii), we have (i) ξ⊥ ⊥ η ⇔ L(ξ) = L(ξ | η) a.s. ⇔ L(η) = L(η | ξ) a.s., (ii) ξ is a.s. η-measurable ⇔ ξ = E(ξ | η) a.s. Proof: (i) Compare (4) with Lemma 4.11. 2

(ii) Use Theorem 8.1 (v). In case of independence, we have some further useful properties.

Lemma 8.7 (independence case) For any random elements ξ⊥⊥ η in S, T and a measurable function f : S × T → U , where S, T, U are Borel, define X = f (ξ, η). Then

 

(i) L(X| ξ) = L f (x, η) x=ξ a.s., (ii) X is a.s. ξ-measurable ⇔ (X, ξ)⊥ ⊥ η. Proof: (i) For any measurable function g ≥ 0 on S, we get by (4) and Lemma 8.6 (i)





 

E g(X); ξ ∈ B = E E 1B (x)(g ◦ f )(x, η) x=ξ , and so







  E g(X)  ξ = E (g ◦ f )(x, η) x=ξ a.s.

(ii) If X is ξ-measurable, then so is the pair (X, ξ) by Lemma 1.9, and the relation ξ⊥ ⊥ η yields (X, ξ)⊥⊥ η. Conversely, assuming (X, ξ)⊥⊥ η and using (i) and Lemma 8.6 (i), we get a.s. L(X, ξ) = L(X, ξ | η)

 

= L f (ξ, y), ξ y=η , so that

d





y ∈ T a.s. L(η).

(X, ξ) = f (ξ, y), ξ ,

Fixing a y with equality and applying the resulting relation to the set A = {(z, x); z = f (x, y)}, which is measurable in U × S by the measurability of the diagonal in U 2 , we obtain





P X = f (ξ, y) = P (X, ξ) ∈ A =P







f (ξ, y), ξ ∈ A



= P (Ω) = 1. Hence, X = f (ξ, y) a.s., which shows that X is a.s. ξ-measurable.

2

Next we show how distributional invariance properties are preserved under suitable conditioning. Here a mapping T on S is said to be bi-measurable, if it is a measurable bijection with a measurable inverse.

170

Foundations of Modern Probability

Lemma 8.8 (conditional invariance) For any random elements ξ, η in a Borel space S and a bi-measurable mapping T on S, we have d



T (ξ, η) = (ξ, η)





d





L(T ξ | η), T ξ = L(ξ | η), ξ .

Proof: Write L(ξ | η) = μ(η, ·), and note that μ depends only on L(ξ, η). Since also σ(T η) = σ(η) by the bi-measurability of T , we get a.s. L(T ξ | η) = L(T ξ | T η) = μ(T η, ·), and so





L(T ξ | η), T ξ = {μ(T η, ·), T ξ} d = {μ(η, ·), ξ} = {L(ξ | η), ξ}.

2

The notion of independence between σ-fields Ft , t ∈ T , extends immediately to the conditional setting. Thus, we say that the Ft are conditionally independent given a σ-field G, if for any distinct indices t1 , . . . , tn ∈ T , n ∈ N, we have P G ∩ Bk = k≤n

Π P G Bk

a.s.,

k≤n

Bk ∈ Ftk , k = 1, . . . , n.

This reduces to ordinary independence when G is the trivial σ-field {Ω, ∅}. The pairwise version of conditional independence is denoted by ⊥⊥G . Conditional independence involving events At or random elements ξt , t ∈ T , is defined as before in terms of the induced σ-fields σ(At ) or σ(ξt ), respectively, and the notation involving ⊥ ⊥ carries over to this case. In particular, any F-measurable random elements ξt are conditionally independent given F. If instead the ξt are independent of F, their conditional independence given F is equivalent to ordinary independence between the ξt . By Theorem 8.5, every general statement or formula involving independence between countably many random elements in Borel spaces has a conditional counterpart. For example, Lemma 4.8 shows that the σ-fields F1 , F2 , . . . are conditionally independent, given a σ-field G, iff

⊥ Fn+1 , (F1 , . . . , Fn ) ⊥ G

n ∈ N.

More can be said in the conditional case, and we begin with a basic characterization. Here and below, F, G, . . . with or without subscripts denote subσ-fields of A, and F ∨ G = σ(F, G). Theorem 8.9 (conditional independence, Doob) For any σ-fields F, G, H, we have F⊥ ⊥ H ⇔ P F∨G = P G a.s. on H. G

8. Conditioning and Disintegration

171

Proof: Assuming the second condition and using the tower and pull-out properties of conditional expectations, we get for any F ∈ F and H ∈ H P G (F ∩ H) = E G P F∨G (F ∩ H) 

= E G P F∨G H; F 

= E G P G H; F





= (P G F ) (P G H), showing that F⊥⊥G H. Conversely, assuming F⊥⊥G H and using the tower and pull-out properties, we get for any F ∈ F , G ∈ G, and H ∈ H 





E P G H; F ∩ G = E (P G F ) (P G H); G

= E P G (F ∩ H); G





= P (F ∩ G ∩ H). By a monotone-class argument, this extends to 



E P G H; A = P (H ∩ A),

A ∈ F ∨ G,

and the stated condition follows by the averaging characterization of P F∨G H. 2 The following simple consequence is needed in Chapters 27–28. Corollary 8.10 (contraction) For any random elements ξ, η, ζ, we have d

(ξ, η) = (ξ, ζ) σ(η) ⊂ σ(ζ)



⇒ ξ⊥ ⊥ ζ. η

Proof: For any measurable set B, we note that the variables M1 = P {ξ ∈ B | η}, M2 = P {ξ ∈ B | ζ}, d

form a bounded martingale with M1 = M2 . Then E(M2 − M1 )2 = EM22 − 2 EM12 = 0, and so M1 = M2 a.s., which implies ξ⊥⊥η ζ by Theorem 8.9. The last theorem yields some further useful properties. Let G¯ denote the completion of G with respect to the basic σ-field A, generated by G and the family of null sets N = {N ⊂ A; A ∈ A, P A = 0}. Corollary 8.11 (extension and inclusion) For any σ-fields F, G, H,

⊥ H ⇔ F ⊥⊥ (G, H), (i) F ⊥ G

G

¯ ⊥ F ⇔ F ⊂ G. (ii) F ⊥ G

172

Foundations of Modern Probability Proof: (i) By Theorem 8.9, both relations are equivalent to P (F | G, H) = P (F | G) a.s.,

F ∈ F.

(ii) If F⊥⊥G F, Theorem 8.9 yields 1F = P (F | F, G) = P (F | G) a.s.,

F ∈ F,

¯ Conversely, the latter relation implies which implies F ⊂ G. ¯ = 1F P (F | G) = P (F | G) = P (F | F, G) a.s.,

F ∈ F,

and so F⊥⊥G F by Theorem 8.9.

2

The following basic result is often applied in both directions. Theorem 8.12 (chain rule) For any σ-fields G, H, F1 , F2 , . . . , these conditions are equivalent: ⊥ (F1 , F2 , . . .), (i) H ⊥ G

(ii) H

⊥ ⊥

G, F1 , . . . , Fn

Fn+1 ,

n ≥ 0.

In particular, we often need to employ the equivalence

⊥ (F, F  ) H⊥ G



⊥ F, H ⊥⊥ F  . H⊥ G

G, F

Proof: Assuming (i), we get by Theorem 8.9 for any H ∈ H and n ≥ 0 





P H  G, F1 , . . . , Fn = P (H | G) 

 



= P H  G, F1 , . . . , Fn+1 , and (ii) follows by another application of Theorem 8.9. Assuming (ii) instead, we get by Theorem 8.9 for any H ∈ H 











P H  G, F1 , . . . , Fn = P H  G, F1 , . . . , Fn+1 ,

n ≥ 0.

Summing over n < m gives 





 P (H | G) = P H  G, F1 , . . . , Fm ,

m ≥ 1,

and so Theorem 8.9 yields H⊥⊥G (F1 , . . . , Fm ) for all m ≥ 1, which extends to (i) by a monotone-class argument. 2 The last result is also useful to establish ordinary independence. Then take G = {∅, Ω} in Theorem 8.12 to get

⊥ (F1 , F2 , . . .) H⊥



H

⊥ ⊥

F1 , . . . , Fn

Fn+1 , n ≥ 0.

8. Conditioning and Disintegration

173

Reasoning with conditional expectations and independence is notoriously subtle and non-intuitive. It may then be helpful to translate the probabilistic notions into their geometric counterparts in suitable Hilbert spaces. For any σ-field F on Ω, we introduce the centered 2 sub-space L20 (F) ⊂ L2 of Fmeasurable random variables ξ with E ξ = 0 and E ξ 2 < ∞. For sub-spaces F ⊂ L20 , let πF denote projection onto F , and write F ⊥ for the orthogonal complement of F . Define F ⊥ G by the orthogonality E(ξη) = 0 of all variables ξ ∈ F and η ∈ G, and write F ' G = F ∩ G⊥ and F ∨ G = span(F, G). When F ⊥ G, we may also write F ∨ G = F ⊕ G. We further introduce the conditional orthogonality F ⊥H G



πG⊥ F ⊥ πG⊥ H.

Theorem 8.13 (conditional independence and orthogonality) Let F, G, H be σ-fields on Ω generating the centered sub-spaces F, G, H ⊂ L20 . Then

⊥H F⊥ G



F ⊥ H. G

By Theorem 8.1 (vi), the latter condition may also be written as πG⊥ F ⊥ H or F ⊥ πG⊥ H. For the trivial σ-field G = {∅, Ω}, we get the useful equivalence F⊥⊥ H



F ⊥ H,

(5)

which can also be verified directly. Proof: Using Theorem 8.9 and Lemmas A4.5 and A4.7, we get

⊥ H ⇔ (πF ∨G − πG )H = 0 F⊥ G

⇔ π(F ∨G)  G H = 0 ⇔ (F ∨ G) ' G ⊥ H ⇔ πG⊥ F ⊥ H,

where the last equivalence follows from the fact that, by Lemma A4.5, F ∨ G = (πG F ⊕ πG⊥ F ) ∨ G = πG⊥ F ∨ G = πG⊥ F ⊕ G.

2

For a simple illustration, we note the following useful commutativity criterion, which follows immediately from Theorem 8.13 and Lemma A4.8. Corollary 8.14 (commutativity) For any σ-fields F, G, we have EF EG = EG EF = EF∩G



F⊥ ⊥ G. F ∩G

2 Using L20 (F) instead of L2 (F) has some technical advantages. In particular, it makes the trivial σ-field {∅, Ω} correspond to the null space 0 ⊂ L2 .

174

Foundations of Modern Probability

Conditional distributions can sometimes be constructed by suitable iteration. To explain the notation, let ξ, η, ζ be random elements in some Borel spaces S, T, U . Since S × U and T × U are again Borel, there exist some probability kernels μ : S → T × U and μ : T → S × U , such that a.s. L(η, ζ | ξ) = μ(ξ, ·), L(ξ, ζ | η) = μ (η, ·), corresponding to the dual disintegrations L(ξ, η, ζ) = L(ξ) ⊗ μ ∼ = L(η) ⊗ μ . For fixed s or t, we may regard μs and μt as probability measures in their own right, and repeat the conditioning, leading to disintegrations of the form ¯ s ⊗ νs , μs = μ  ¯t ⊗ νt , μt = μ where μ ¯s = μs (· × U ) and μ ¯t = μt (· × U ), for some kernels νs : T → U and  νt : S → U . It is suggestive to write even the latter relations in terms of conditioning, as in    η , ·), μs ζ˜ ∈ ·  η˜ = νs (˜     ˜ ·), μ ζ˜ ∈ ·  ξ˜ = ν  (ξ, t

t

˜ η˜, ζ˜ in S, T, U . for the coordinate variables ξ, Since all spaces are Borel, Corollary 3.6 allows us to choose νs and νt as kernels S × T → U . With a slight abuse of notation, we may write the iterated conditioning in the form







P (· | η | ξ)s,t = P (· | ξ)s (· | η)t , P (· | ξ | η)t,s = P (· | η)t (· | ξ)s . Substituting s = ξ and t = η, and putting F = σ(ξ), G = σ(η), and H = σ(ζ), we get some (ξ, η) -measurable random probability measures on (Ω, H), here written suggestively as (PF )G = P (· | η | ξ)ξ,η , (PG )F = P (· | ξ | η)η,ξ . Using this notation, we may state the basic properties of iterated conditioning in a striking form. Here we say that a σ-field F is Borel generated, if F = σ(ξ) for some random element ξ in a Borel space. Theorem 8.15 (iterated conditioning) For any Borel generated σ-fields F, G, H, we have a.s. on H (i) (PF )G = (PG )F = PF∨G , (ii) PF = EF (PF )G .

8. Conditioning and Disintegration

175

Part (i) must not be confused with the elementary commutativity in Corollary 8.14. Similarly, we must not confuse (ii) with the elementary tower property in Theorem 8.1 (vii), which may be written as PF = EF PF∨G a.s. Proof: Writing F = σ(ξ),

G = σ(η),

H = σ(ζ),

for some random elements in Borel spaces S, T, U , and putting μ1 = L(ξ),

μ2 = L(η),

μ13 = L(ξ, ζ),

μ12 = L(ξ, η),

μ123 = L(ξ, η, ζ),

we get from Theorem 3.7 the relations μ3|2|1 = μ3|12 ∼ = μ3|1|2 a.e. μ12 , μ3|1 = μ2|1 μ3|12 a.e. μ1 , which are equivalent to (i) and (ii). We can also obtain (ii) directly from (i), by noting that EF (PF )G = EF PF∨G = PF , 2

by the tower property in Theorem 8.1.

Regular conditional distributions can be used to construct random elements with desired properties. This may require an extension of the basic probability ˆ A) ˆ = (Ω × space. By an extension of (Ω, A, P ) we mean a product space (Ω, ˆ ˆ S, A ⊗ S), equipped with a probability measure P satisfying P (· × S) = P . ˆ Thus, we Any random element ξ on Ω may be regarded as a function on Ω. ˆ ˆ simply replace ξ by the random element ξ(ω, s) = ξ(ω) on Ω, which has the same distribution. For extensions of this type, we retain our original notation ˆ and write P and ξ instead of Pˆ and ξ. We begin with an elementary extension suggested by Theorem 8.5. The result is needed for various constructions in Chapters 27–28. Lemma 8.16 (extension) For any probability kernel μ : S → T and random elements ξ, ζ in S, U , there exists a random element η in T , defined on a suitable extension of the probability space, such that L(η | ξ, ζ) = μ(ξ, ·) a.s.,

⊥ ζ. η⊥ ξ

ˆ A) ˆ = (Ω × T, A ⊗ T ), where T denotes the σ-field in T , and Proof: Put (Ω, ˆ by define a probability measure Pˆ on Ω Pˆ A = E



1A (·, t) μ(ξ, dt),

ˆ A ∈ A.

176

Foundations of Modern Probability

ˆ satisfies Then clearly Pˆ (· × T ) = P , and the random element η(ω, t) ≡ t on Ω ˆ |A) = μ(ξ, ·) a.s. In particular, Proposition 8.9 yields η⊥⊥ξ A. 2 L(η Most constructions require only a single randomization variable, defined as a U (0, 1) random variable ϑ, independent of all previously introduced random elements and σ-fields. We always assume the basic probability space to be rich enough to support any randomization variables we need. This involves no essential loss of generality, since the condition becomes fulfilled after a simple extension of the original space. Thus, we may take ˆ = Ω × [0, 1], Ω

Pˆ = P ⊗ λ,

Aˆ = A ⊗ B[0,1] ,

where λ denotes Lebesgue measure on [0, 1]. Then ϑ(ω, t) ≡ t is U (0, 1) on ˆ and ϑ⊥ Ω ⊥A. By Lemma 4.21 we may use ϑ to produce a whole sequence of independent randomization variables ϑ1 , ϑ2 , . . . , if needed. The following powerful and commonly used result shows how a probabilistic structure can be transferred between different contexts through a suitable randomization. Theorem 8.17 (transfer) Let ξ, η, ζ be random elements in S, T, U , where d T is Borel. Then for any ξ˜ = ξ, there exists a random element η˜ in T with d ˜ η˜) = (ξ, η), (ii) η˜⊥ (i) (ξ, ⊥ζ. ξ˜

Proof: (i) By Theorem 8.5, there exists a probability kernel μ : S → T satisfying μ(ξ, B) = P {η ∈ B | ξ} a.s., B ∈ T . Next, Lemma 4.22 yields a measurable function f : S × [0, 1] → T , such that ˜ ζ), for any U (0, 1) random variable ϑ⊥ ⊥ (ξ,



L f (s, ϑ) = μ(s, ·),

s ∈ S.

˜ ϑ) and using Lemmas 1.24 and 4.11 together with Theorem Defining η˜ = f (ξ, 8.5, we get for any measurable function g : S × T → R+

˜ f (ξ, ˜ ϑ) ˜ η˜) = E g ξ, E g(ξ,

=E =E









g ξ, f (ξ, u) du g(ξ, t) μ(ξ, dt)

= Eg(ξ, η), d

˜ η˜) = (ξ, η). which shows that (ξ, ˜ ϑ)⊥⊥ ˜ ζ, (ii) By Theorem 8.12 we have ϑ⊥⊥ξ˜ ζ , and so Corollary 8.11 yields (ξ, ξ 2 which implies η˜⊥⊥ξ˜ ζ. The last result can be used to transfer representations of random objects:

8. Conditioning and Disintegration

177

Corollary 8.18 (stochastic equations) For any random elements ξ, η in Borel d spaces S, T and a measurable map f : T → S with ξ = f (η), we may choose η˜ with d ξ = f (˜ η ) a.s. η˜ = η, d

Proof: By Theorem 8.17 there exists a random element η˜ in T with (ξ, η˜) = d d η )} = {f (η), f (η)}. Since the diagonal {f (η), η}. In particular, η˜ = η and {ξ, f (˜ in S 2 is measurable, we get







P ξ = f (˜ η ) = P f (η) = f (η) = 1, 2

and so ξ = f (˜ η ) a.s. This leads in particular to a useful extension of Theorem 5.31.

Corollary 8.19 (extended Skorohod coupling) Let f, f1 , f2 , . . . be measurable functions from a Borel space S to a Polish space T , and let ξ, ξ1 , ξ2 , . . . be d ˜ ξ˜1 , ξ˜2 , . . . random elements in S with fn (ξn ) → f (ξ). Then we may choose ξ, with d d ˜ a.s. fn (ξ˜n ) → f (ξ) ξ˜ = ξ, ξ˜n = ξn , d

d

Proof: By Theorem 5.31 there exist some η = f (ξ) and ηn = fn (ξn ) with d d ηn → η a.s. By Corollary 8.18 we may further choose ξ˜ = ξ and ξ˜n = ξn , such ˜ a.s. ˜ = η and fn (ξ˜n ) = ηn for all n. But then fn (ξ˜n ) → f (ξ) 2 that a.s. f (ξ) We proceed to clarify the relationship between conditional independence and randomizations. Important applications appear in Chapters 11, 13, 27– 28, and 32. Proposition 8.20 (conditional independence by randomization) Let ξ, η, ζ be random elements in S, T, U , where S is Borel. Then these conditions are equivalent: ζ, (i) ξ ⊥ ⊥ η (ii) ξ = f (η, ϑ) a.s. for a measurable function f : T × [0, 1] → S and a U (0, 1) random variable ϑ⊥⊥(η, ζ). Proof: The implication (ii) ⇒ (i) holds by the argument for Theorem 8.17 (ii). Now assume (i), and let ϑ⊥⊥(η, ζ) be U (0, 1). Then Theorem 8.17 yields a measurable function f : T × [0, 1] → S, such that the random element ξ˜ = d d ˜ η) = (ξ, η). We further see from the sufficiency f (η, ϑ) satisfies ξ˜ = ξ and (ξ, ˜ ⊥η ζ. Hence, Proposition 8.9 yields part that ξ⊥ L{ξ˜ | η, ζ} = L( ξ˜ | η) = L( ξ | η) = L{ξ | η, ζ},

178

Foundations of Modern Probability

d d ˜ η, ζ) = and so (ξ, (ξ, η, ζ). By Theorem 8.17, we may next choose ϑ˜ = ϑ with d ˜ d ˜ ⊥(η, ζ) and {ξ, f (η, ϑ)} ˜ = ˜ f (η, ϑ)}. ˜ = (ξ, η, ζ, ϑ). In particular, ϑ⊥ {ξ, (ξ, η, ζ, ϑ) 2 ˜ ˜ Since ξ = f (η, ϑ), and the diagonal in S is measurable, we get ξ = f (η, ϑ) ˜ a.s., which is the required relation with ϑ in place of ϑ. 2

The transfer theorem can be used to construct random sequences or processes with given finite-dimensional distributions. For any measurable spaces S1 , S2 , . . . , we say that a sequence of distributions3 μn on S1 × · · · × Sn , n ∈ N, is projective if μn+1 (· × Sn+1 ) = μn , n ∈ N. (6) Theorem 8.21 (existence of random sequences, Daniell) For any projective sequence of distributions μn on S1 ×· · ·×Sn , n ∈ N, where S2 , S3 , . . . are Borel, there exist some random elements ξn in Sn , n ∈ N, such that L(ξ1 , . . . , ξn ) = μn ,

n ∈ N.

Proof: By Lemmas 4.10 and 4.21 there exist some independent random variables ξ1 , ϑ2 , ϑ3 , . . . , such that L(ξ1 ) = μ1 and the ϑn are i.i.d. U (0, 1). We proceed recursively to construct ξ2 , ξ3 , . . . with the stated properties, such that each ξn is a measurable function of ξ1 , ϑ2 , . . . , ϑn . Once ξ1 , . . . , ξn are constructed, let L(η1 , . . . , ηn+1 ) = μn+1 . Then the projective property yields d (ξ1 , . . . , ξn ) = (η1 , . . . , ηn ), and so by Theorem 8.17 we may form ξn+1 as a mead surable function of ξ1 , . . . , ξn , ϑn+1 , such that (ξ1 , . . . , ξn+1 ) = (η1 , . . . , ηn+1 ). This completes the recursion. 2 Next we show how a given process can be extended to any unbounded index set. The result is stated in an abstract form, designed to fulfill our needs in especially Chapters 12 and 17. Let ι denote the identity map on any space. Corollary 8.22 (projective limits) Consider some Borel spaces S, S1 , S2 , . . . and measurable maps πn : S → Sn and πkn : Sn → Sk , k ≤ n, such that n πkn = πkm ◦ πm ,

k ≤ m ≤ n. (7) n ¯ Let S be the class of sequences (s1 , s2 , . . .) ∈ S1 × S2 × · · · with πk sn = sk for all k ≤ n, and let the map h : S¯ → S be measurable with (π1 , π2 , . . .) ◦ h = ι on ¯ Then for any distributions μn on Sn with S. μn ◦ (πkn )−1 = μk , there exists a distribution μ on S with μ ◦

k ≤ n ∈ N, πn−1

(8)

= μn for all n.

Proof: Introduce the measures μ ¯n = μn ◦ (π1n , . . . , πnn )−1 , 3

probability measures

n ∈ N,

(9)

8. Conditioning and Disintegration

179

and conclude from (7) and (8) that 

μ ¯n+1 (· × Sn+1 ) = μn+1 ◦ π1n+1 , . . . , πnn+1

−1

= μn+1 ◦ (πnn+1 )−1 ◦ (π1n , . . . , πnn )−1 = μn ◦ (π1n , . . . , πnn )−1 = μ ¯n . By Theorem 8.21 there exists a measure μ ¯ on S1 × S2 × · · · with ¯n )−1 = μ ¯n , μ ¯ ◦ (¯ π1 , . . . , π

n ∈ N,

(10)

where π ¯1 , π ¯2 , . . . denote the coordinate projections in S1 × S2 × · · · . From ¯ which allows us to define (7), (9), and (10) we see that μ ¯ is restricted to S, μ=μ ¯ ◦ h−1 . It remains to note that μ ◦ πn−1 = = = =

μ ¯ ◦ (πn h)−1 μ ¯◦π ¯n−1 μ ¯n ◦ π ¯n−1 μn ◦ (πnn )−1 = μn .

2

We often need a version of Theorem 8.21 for processes on a general index set T . Then for any collection of measurable spaces (St , St ), t ∈ T , we define  (SI , SI ) = t∈I (St , St ), I ⊂ T . For any random elements ξt in St , write ξI for the restriction of the process (ξt ) to the index set I. Now let Tˆ and T¯ be the classes of finite and countable subsets of T , respectively. A family of probability measures μI , I ∈ Tˆ or T¯, is said to be projective if μJ (· × SJ\I ) = μI , I ⊂ J in Tˆ or T¯. (11) Theorem 8.23 (existence of processes, Kolmogorov) For any Borel spaces St , t ∈ T , consider a projective family of probability measures μI on SI , I ∈ Tˆ. Then there exist some random elements Xt in St , t ∈ T , such that L(XI ) = μI , I ∈ Tˆ. Proof: The product σ-field ST in ST is generated by all coordinate projections πt , t ∈ T , and hence consists of all countable cylinder sets B × ST \U , B ∈ SU , U ∈ T¯. For every U ∈ T¯, Theorem 8.21 yields a probability measure μ U on SU with μ U (· × SU \I ) = μI , I ∈ Uˆ , and Proposition 4.2 shows that the family μU , U ∈ T¯, is again projective. We may then define a function μ : ST → [0, 1] by μ(· × ST \U ) = μ U ,

U ∈ T¯.

To show that μ is countably additive, consider any disjoint sets A1 , A2 , . . . ∈ ST . For every n we have An = Bn × ST \Un for some Un ∈ T¯ and Bn ∈ SUn . Writing  U = n Un and Cn = Bn × SU \Un , we get

180

Foundations of Modern Probability μ ∪ n A n = μ U ∪ n Cn

Σ n μ U Cn = Σ n μAn . =

We may now define the process X = (Xt ) as the identity map on the proba2 bility space (ST , ST , μ). If the projective sequence in Theorem 8.21 is defined recursively in terms of conditional distributions, then no regularity condition is needed on the state spaces. For a precise statement, define the composition μ ⊗ ν of two kernels μ and ν as in Chapter 3. Theorem 8.24 (extension by conditioning, Ionescu Tulcea) For any measurable spaces (Sn , Sn ) and probability kernels μn : S1 × · · · × Sn−1 → Sn , n ∈ N, there exist some random elements ξn in Sn , n ∈ N, such that L(ξ1 , . . . , ξn ) = μ1 ⊗ · · · ⊗ μn ,

n ∈ N.

Proof: Put Fn = S1 ⊗ · · · ⊗ Sn and Tn = Sn+1 × Sn+2 × · · · , and note that  the class C = n (Fn × Tn ) is a field in T0 generating the σ-field F∞ . Define an additive function μ on C by μ(A × Tn ) = (μ1 ⊗ · · · ⊗ μn )A,

A ∈ Fn , n ∈ N,

(12)

which is clearly independent of the representation C = A × Tn . We need to extend μ to a probability measure on F∞ . By Theorem 2.5, it is then enough to show that μ is continuous at ∅. For any C1 , C2 , . . . ∈ C with Cn ↓ ∅, we need to show that μCn → 0. Renumbering if necessary, we may assume for every n that Cn = An × Tn with An ∈ Fn . Now define fkn = (μk+1 ⊗ · · · ⊗ μn )1An ,

k ≤ n,

(13)

with the understanding that fnn = 1An for k = n. By Lemmas 3.2 and 3.3, each fkn is an Fk -measurable function on S1 × · · · × Sk , and (13) yields n fkn = μk+1 fk+1 ,

0 ≤ k < n.

(14)

Since Cn ↓ ∅, the functions fkn are non-increasing in n for fixed k, say with limits gk . By (14) and dominated convergence, gk = μk+1 gk+1 ,

k ≥ 0.

(15)

Combining (12) and (13) gives μCn = f0n ↓ g0 . If g0 > 0, then (15) yields an s1 ∈ S1 with g1 (s1 ) > 0. Continuing recursively, we may construct a sequence s¯ = (s1 , s2 , . . .) ∈ T0 , such that gn (s1 , . . . , sn ) > 0 for all n. Then

8. Conditioning and Disintegration

181

1Cn (¯ s) = 1An (s1 , . . . , sn ) = fnn (s1 , . . . , sn ) ≥ gn (s1 , . . . , sn ) > 0, 

and so s¯ ∈ n Cn , which contradicts the assumption Cn ↓ ∅. Thus, g0 = 0, 2 which means that μCn → 0. In particular, we may choose some independent random elements with arbitrarily prescribed distributions. The result extends the elementary Theorem 4.19. Corollary 8.25 (infinite product measures, L  omnicki & Ulam) For any probability spaces (St , St , μt ), t ∈ T , there exist some independent random elements ξt in St with L(ξt ) = μt , t ∈ T. Proof: For countable subsets I ⊂ T , the associated product measures  μI = t∈I μt exist by Theorem 8.24. Now proceed as in the proof of Theorem 8.23. 2

Exercises 1. Show that (ξ, η) = (ξ  , η) iff P ( ξ ∈ B | η) = P ( ξ  ∈ B | η) a.s. for any measurable set B. ¯ 2. Show that E F ξ = E G ξ a.s. for all ξ ∈ L1 iff F¯ = G. d

3. Show that the averaging property implies the other properties of conditional expectations listed in Theorem 8.1. 4. State the probabilistic counterpart of Lemma A4.9, and give a direct proof. 5. Let 0 ≤ ξn ↑ ξ and 0 ≤ η ≤ ξ, where ξ1 , ξ2 , . . . , η ∈ L1 , and fix a σ-field F. Show that E F η ≤ supn E F ξn . (Hint: Apply the monotone convergence property to E F (ξn ∧ η).) 6. For any [0, ∞]-valued random variable ξ, define E F ξ = supn E F (ξ ∧ n). Show that this extension of E F satisfies the monotone convergence property. (Hint: Use the preceding result.) 7. Show that the above extension of E F remains characterized by the averaging property, and that E F ξ < ∞ a.s. iff the measure ξ · P = E(ξ; ·) is σ-finite on F. Extend E F ξ to random variables ξ such that the measure |ξ| · P is σ-finite on F. 8. Let ξ1 , ξ2 , . . . be [0, ∞]-valued random variables, and fix any σ-field F. Show that lim inf n E F ξn ≥ E F lim inf n ξn a.s. 9. For any σ-field F, and let ξ, ξ1 , ξ2 , . . . be random variables with ξn → ξ and E F supn |ξn | < ∞ a.s. Show that E F ξn → E F ξ a.s. 10. Let the σ-field F be generated by a partition A1 , A2 , . . . ∈ A of Ω. For any ξ ∈ L1 , show that E(ξ |F) = E(ξ |Ak ) = E(ξ; Ak )/P Ak on Ak whenever P Ak > 0. 11. For any σ-field F, event A, and random variable ξ ∈ L1 , show that E(ξ |F , 1A ) = E(ξ; A |F)/P (A |F ) a.s. on A.

182

Foundations of Modern Probability

12. Let the random variables ξ1 , ξ2 , . . . ≥ 0 and σ-fields F1 , F2 , . . . be such that P P E(ξn |Fn ) → 0. Show that ξn → 0. (Hint: Consider the random variables ξn ∧ 1.) d

d

˜ η˜), where ξ ∈ L1 . Show that E(ξ | η) = E(ξ˜ |˜ 13. Let (ξ, η) = (ξ, η ). (Hint: If E(ξ | η) = f (η), then E(ξ˜ |˜ η ) = f (˜ η ) a.s.) 14. Let (ξ, η) be a random vector in R2 with probability density f , put F (y) = f (x, y) dx, and let g(x, y) = f (x, y)/F (y). Show that P (ξ ∈ B | η) = B g(x, η) dx a.s. 

15. Use conditional distributions to deduce the monotone and dominated convergence theorems for conditional expectations from the corresponding unconditional results. 16. Assume E F ξ = ξ for a ξ ∈ L1 . Show that ξ is a.s. F -measurable. (Hint: Choose a strictly convex function f with Ef (ξ) < ∞, and apply the strict Jensen inequality to the conditional distributions.) d

d

17. Assuming (ξ, η) = (ξ, ζ), where η is ζ-measurable, show that ξ⊥⊥η ζ. (Hint: d

Show as above that P (ξ ∈ B | η) = P (ξ ∈ B| ζ), and deduce the corresponding a.s. equality.) 18. Let ξ be a random element in a separable metric space S. Show that L(ξ |F) is a.s. degenerate iff ξ is a.s. F-measurable. (Hint: Reduce to the case where L(ξ |F) is degenerate everywhere and hence equal to δη for an F-measurable random element η in S. Then show that ξ = η a.s.) 19. Give a direct proof of the equivalence (5). Then state the probabilistic counterpart of Lemma A4.7, and give a direct proof. 20. Show that if G ⊥⊥Fn H for some increasing σ-fields Fn , then G ⊥⊥F∞ H. ⊥η,γ ζ and ξ⊥⊥η (ζ, γ). 21. Assuming ξ⊥⊥η ζ and γ⊥⊥(ξ, η, ζ), show that ξ⊥ 22. Extend Lemma 4.6 to the context of conditional independence. Also show that Corollary 4.7 and Lemma 4.8 remain valid for conditional independence, given a σ-field H. 23. Consider any σ-field F and random element ξ in a Borel space, and define η = L(ξ |F ). Show that ξ⊥⊥η F. 24. Let ξ, η be random elements in a Borel space S. Prove the existence of a measurable function f : S × [0, 1] → S and a U (0, 1) random variable γ⊥ ⊥η such that d ξ = f (η, γ) a.s. (Hint: Choose f with {f (η, ϑ), η} = (ξ, η) for any U (0, 1) random d variable ϑ⊥⊥(ξ, η), and then let (γ, η˜) = (ϑ, η) with (ξ, η) = {f (γ, η˜), η˜} a.s.) 25. Let ξ, η be random elements in Borel spaces S, T such that ξ = f (η) a.s. for a measurable function f : T → S. Show that η = g(ξ, ϑ) a.s. for a measurable function g : S × [0, 1] → T and an U (0, 1) random variable ϑ⊥⊥ξ. (Hint: Use Theorem 8.17.) 26. Suppose that the function f : T → S above is injective. Show that η = h(ξ) a.s. for a measurable function h : S → T . (Hint: Show that in this case η⊥⊥ϑ, and define h(x) = λg(x, ·).) Compare with Theorem A1.1. 27. Let ξ, η be random elements in a Borel space S. Show that we can choose a d random element η˜ in S with (ξ, η) = (ξ, η˜) and η⊥⊥ξ η˜. 28. Let the probability measures P, Q on (Ω, A) be related by Q = ξ · P for a random variable ξ ≥ 0, and fix a σ-field F ⊂ A. Show that Q = EP (ξ |F) · P on F.

8. Conditioning and Disintegration

183

29. Assume as before that Q = ξ · P on A, and let F ⊂ A. Show that EQ (η |F ) = EP (ξη |F )/EP (ξ |F ) a.s. Q for any random variable η ≥ 0.

Chapter 9

Optional Times and Martingales Filtrations, strictly and weakly optional times, closure properties, optional evaluation and hitting, augmentation, random time change, suband super-martingales, centering and convex maps, optional sampling and stopping, martingale transforms, maximum and upcrossing inequalities, sub-martingale convergence, closed martingales, limits of conditional expectations, regularization of sub-martingales, increasing limits of super-martingales

The importance of martingales and related topics can hardly be exaggerated. Indeed, filtrations and optional times as well as a wide range of sub- and super-martingales are constantly used in all areas of modern probability. They appear frequently throughout the remainder of this book. In discrete time, a martingale is simply a sequence of integrable random variables centered at successive conditional means, a centering that can always be achieved by the elementary Doob decomposition. More precisely, given any discrete filtration F = (Fn ), defined as an increasing sequence of σ-fields in Ω, we say that the random variables M0 , M1 , . . . form a martingale with respect to F if E(Mn | Fn−1 ) = Mn−1 a.s. for all n. A special role is played by the uniformly integrable martingales, which can be represented in the form Mn = E(ξ |Fn ) for some integrable random variables ξ. Martingale theory owes its usefulness to a number of powerful general results, such as the optional sampling theorem, the sub-martingale convergence theorem, and a variety of maximum inequalities. Applications discussed in this chapter include extensions of the Borel–Cantelli lemma and Kolmogorov’s 0−1 law. Martingales can also be used to establish the existence of measurable densities and to give a short proof of the law of large numbers. Much of the discrete-time theory extends immediately to continuous time, thanks to the fundamental regularization theorem, which ensures that every continuous-time martingale with respect to a right-continuous filtration has a right-continuous version with left-hand limits. The implications of this result extend far beyond martingale theory. In particular, it will enable us in Chapters 16–17 to obtain right-continuous versions of independent-increment and Feller processes. The theory of continuous-time martingales is continued in Chapters 10, 18– 20, and 35 with studies of quadratic variation, random time-change, integral representations, removal of drift, additional maximum inequalities, and various decomposition theorems. Martingales also play a basic role for especially the Skorohod embedding in Chapter 22, the stochastic integration in Chapters 18 © Springer Nature Switzerland AG 2021 O. Kallenberg, Foundations of Modern Probability, Probability Theory and Stochastic Modelling 99, https://doi.org/10.1007/978-3-030-61871-1_10

185

186

Foundations of Modern Probability

and 20, and the theories of Feller processes, SDEs, and diffusions in Chapters 17 and 32–33. As for the closely related notion of optional times, our present treatment is continued with a more detailed study in Chapter 10. Optional times are fundamental not only for martingale theory, but also for various models involving Markov processes. In the latter context they appear frequently throughout the remainder of the book. To begin our systematic exposition of the theory, we fix an arbitrary index ¯ A filtration on T is defined as a non-decreasing family of σ-fields set T ⊂ R. Ft ⊂ A, t ∈ T . We say that a process X on T is adapted to F = (Ft ) if Xt is Ft -measurable for every t ∈ T . The smallest filtration with this property, given by Ft = σ{Xs ; s ≤ t}, t ∈ T , is called the induced or generated filtration. Here ‘smallest’ is understood in the sense of set inclusion for each t. A random time is defined as a random element τ in T¯ = T ∪ {sup T }. We say that τ is F-optional 1 if {τ ≤ t} ∈ Ft for every t ∈ T , meaning that the process Xt = 1{τ ≤ t} is adapted2 . When T is countable, it is clearly equivalent that {τ = t} ∈ Ft for every t ∈ T . With every optional time τ we associate a σ-field



Fτ = A ∈ A; A ∩ {τ ≤ t} ∈ Ft , t ∈ T . We list some basic properties of optional times and corresponding σ-fields. Lemma 9.1 (optional times) For any optional times σ, τ , (i) σ ∨ τ and σ ∧ τ are again optional, (ii) Fσ ∩ {σ ≤ τ } ⊂ Fσ∧τ = Fσ ∩ Fτ , (iii) Fσ ⊂ Fτ on {σ ≤ τ }, (iv) τ is Fτ -measurable, (v) if τ ≡ t ∈ T¯, then τ is optional with Fτ = Ft . Proof: (i) For any t ∈ T , {σ ∨ τ ≤ t} = {σ ≤ t} ∩ {τ ≤ t} ∈ Ft , {σ ∧ τ > t} = {σ > t} ∩ {τ > t} ∈ Ft . (ii) For any A ∈ Fσ and t ∈ T , A ∩ {σ ≤ τ } ∩ {τ ≤ t} 







= A ∩ {σ ≤ t} ∩ {τ ≤ t} ∩ σ ∧ t ≤ τ ∧ t , which belongs to Ft since σ ∧ t and τ ∧ t are both Ft -measurable. Hence, Fσ ∩ {σ ≤ τ } ⊂ Fτ . 1 2

Optional times are often called stopping times. We often omit the prefix ‘F-’ when there is no risk for confusion.

9. Optional Times and Martingales

187

The first relation now follows as we replace τ by σ ∧ τ . Replacing σ and τ by the pairs (σ ∧ τ, σ) and (σ ∧ τ, τ ) gives Fσ∧τ ⊂ Fσ ∩ Fτ . To prove the reverse relation, we note that for any A ∈ Fσ ∩ Fτ and t ∈ T ,











A ∩ σ ∧ τ ≤ t = A ∩ {σ ≤ t} ∪ A ∩ {τ ≤ t} ∈ Ft , whence A ∈ Fσ∧τ . (iii) Clear from (ii). (iv) Applying (ii) to the pair (τ, t) gives {τ ≤ t} ∈ Fτ for all t ∈ T , which extends immediately to any t ∈ R. Now use Lemma 1.4. (v) Clearly Fτ = Fτ ∩ {τ ≤ t} ⊂ Ft . Conversely, let A ∈ Ft and s ∈ T . When s ≥ t we get A ∩ {τ ≤ s} = A ∈ Ft ⊂ Fs , whereas for s < t we have 2 A ∩ {τ ≤ s} = ∅ ∈ Fs . Thus, A ∈ Fτ , proving that Ft ⊂ Fτ . Given a filtration F on R+ , we may define a new filtration F + by Ft+ = + = F. In particular, u>t Fu , t ≥ 0, and say that F is right-continuous if F F + is right-continuous for any filtration F. We say that a random time τ is weakly F-optional if {τ < t} ∈ Ft for every t > 0. Then τ + h is clearly  F-optional for every h > 0, and we may define Fτ + = h>0 Fτ +h . When the index set is Z+ , we take F + = F and make no difference between strictly and weakly optional times. The notions of optional and weakly optional times agree when F is rightcontinuous: 

Lemma 9.2 (weakly optional times) A random time τ is weakly F-optional iff it is F + -optional, in which case



Fτ + = Fτ+ = A ∈ A; A ∩ {τ < t} ∈ Ft , t > 0 .

(1)

Proof: For any t ≥ 0, we note that {τ ≤ t} =

∩ {τ < r},

{τ < t} =

r>t

∪ {τ ≤ r},

(2)

r 0 A ∩ {τ < t} =

∪ r 0 A ∩ {τ ≤ t} =



∩ A ∩ {τ < r} r ∈ (t, t + h)



∈ Ft+h ,

and so A ∩ {τ ≤ t} ∈ Ft+ . For A = Ω this proves the first assertion, and for general A ∈ A it proves the second relation in (1). To prove the first relation, we note that A ∈ Fτ + iff A ∈ Fτ +h for each h > 0, that is, iff A∩{τ +h ≤ t} ∈ Ft for all t ≥ 0 and h > 0. But this is equivalent to A ∩ {τ ≤ t} ∈ Ft+h for all t ≥ 0 and h > 0, hence to A ∩ {τ ≤ t} ∈ Ft+

188

Foundations of Modern Probability

for every t ≥ 0, which means that A ∈ Fτ+ .

2

We have seen that the maximum and minimum of two optional times are again optional. The result extends to countable collections, as follows. Lemma 9.3 (closure properties) For any random times τ1 , τ2 , . . . and filtration F on R+ or Z+ , we have (i) if the τn are F-optional, then so is τ = supn τn , (ii) if the τn are weakly F-optional, then so is σ = inf n τn , and Fσ+ = ∩ n Fτ+n . Proof: To prove (i) and the first assertion in (ii), we write {τ ≤ t} = ∩ n {τn ≤ t},

{σ < t} = ∪ n {τn < t},

(3)

where the strict inequalities may be replaced by ≤ for the index set T = Z+ .  To prove the second assertion in (ii), we note that Fσ+ ⊂ n Fτ+n by Lemma  + 9.1. Conversely, assuming A ∈ n Fτn , we get by (3) for any t ≥ 0 A ∩ {σ < t} = A ∩∪ n {τn < t}  = ∪ n A ∩ {τn < t} ∈ Ft , with the indicated modification for T = Z+ . Thus, A ∈ Fσ+ .

2

In particular, part (ii) of the last result is useful for the approximation of optional times from the right. ¯ +, Lemma 9.4 (discrete approximation) For any weakly optional time τ in R there exist some countably-valued optional times τn ↓ τ . Proof: We may define τn = 2−n [2n τ + 1],

n ∈ N.

¯ for all n, and τn ↓ τ . The τn are optional, since Then τn ∈ 2−n N {τn ≤ k 2−n } = {τ < k 2−n } ∈ Fk2−n

k, n ∈ N.

2

We may now relate the optional times to random processes. Say that a process X on R+ is progressively measurable or simply progressive, if its restriction to Ω × [0, t] is Ft ⊗ B[0,t] -measurable for every t ≥ 0. Note that any progressive process is adapted by Lemma 1.28. Conversely, we may approximate from the left or right to show that any adapted and left- or right-continuous process is progressive. A set A ⊂ Ω × R+ is said to be progressive if the corresponding indicator function 1A has this property, and we note that the progressive sets form a σ-field.

9. Optional Times and Martingales

189

Lemma 9.5 (optional evaluation) Consider a filtration F on an index set T , a process X on T with values in a measurable space (S, S), and an optional time τ in T . Then Xτ is Fτ -measurable under each of these conditions: (i) T is countable and X is adapted, (ii) T = R+ and X is progressive. Proof: In both cases, we need to show that



Xτ ∈ B, τ ≤ t ∈ Ft ,

t ≥ 0, B ∈ S.

This is clear in case (i), if we write {Xτ ∈ B} =

∪ s≤t





Xs ∈ B, τ = s ∈ Ft ,

B ∈ S.

In case (ii) it is enough to show that Xτ ∧t is Ft -measurable for every t ≥ 0. We may then take τ ≤ t and prove instead that Xτ is Ft -measurable. Writing Xτ = X ◦ ψ with ψ(ω) = {ω, τ (ω)}, we note that ψ is measurable from Ft to Ft ⊗ B[0,t] , whereas X is measurable on Ω × [0, t] from Ft ⊗ B[0,t] to S. The 2 required measurability of Xτ now follows by Lemma 1.7. For any process X on R+ or Z+ and a set B in the range space of X, we introduce the hitting time

τB = inf t > 0; Xt ∈ B . We often need to certify that τB is optional. The following elementary result covers some common cases. Lemma 9.6 (hitting times) For a filtration F on T = R+ or Z+ , let X be an F-adapted process on T with values in a measurable space (S, S), and let B ∈ S. Then τB is weakly optional under each of these conditions: (i) T = Z+ , (ii) T = R+ , S is a metric space, B is closed, and X is continuous, (iii) T = R+ , S is a topological space, B is open, and X is right-continuous. Proof: (i) Write {τB ≤ n} =



{Xk ∈ B} ∈ Fn ,

k ∈ [1, n]

n ∈ N.

(ii) Letting ρ be the metric on S, we get for any t > 0 {τB ≤ t} =

∪ ∩ ∪ h > 0 n ∈ N r ∈ Q ∩ [h, t]





ρ(Xr , B) ≤ n−1 ∈ Ft .

(iii) By Lemma 9.2, it suffices to write {τB < t} =



r ∈ Q ∩ (0, t)

{Xr ∈ B} ∈ Ft ,

t > 0.

2

For special purposes, we need the following more general but much deeper result, known as the debut theorem. Here and below, a filtration F is said to be complete, if the basic σ-field A is complete and each Ft contains all P -null sets in A.

190

Foundations of Modern Probability

Theorem 9.7 (first entry, Doob, Hunt) Let the set A ⊂ R+ ×Ω be progressive for a right-continuous, complete filtration F. Then the time τ (ω) = inf{t ≥ 0; (t, ω) ∈ A} is F-optional. Proof: Since A is progressive, A ∩ [0, t) ∈ Ft ⊗ B[0,t] for every t > 0. Noting that {τ < t} is the projection of A ∩ [0, t) onto Ω, we get {τ < t} ∈ Ft by Theorem A1.2, and so τ is optional by Lemma 9.2. 2 For applications of this and other results, we may need to extend a given filtration F on R+ , to make it both right-continuous and complete. Writing A for the completion of A, put N = {A ∈ A; P A = 0} and define F t = σ{Ft , N }. Then F = ( F t ) is the smallest complete extension of F. Similarly, F + = (Ft+ ) is the smallest right-continuous extension of F. We show that the two extensions commute and can be combined into a smallest right-continuous and complete version, known as the (usual) augmentation of F. Lemma 9.8 (augmented filtration) Any filtration F on R+ has a smallest right-continuous, complete extension G, given by 

Gt = Ft+ = F



t+

t ≥ 0.

,

(4)

Proof: First we note that 

Ft+ ⊂ F 







⊂ F t+





t+

,

t ≥ 0.



Conversely, let A ∈ F t+ . Then A ∈ F t+h for every h > 0, and so as in Lemma 1.27 there exist some sets Ah ∈ Ft+h with P (A  Ah ) = 0. Now choose hn → 0, and defineA = {Ahn i.o.}. Then A = Ft+ and P (A  A ) = 0, and so A ∈ Ft+ . Thus, F t+ ⊂ Ft+ , which proves the second relation in (4). In particular, the filtration G in (4) contains F and is both right-continuous and complete. For any filtration H with those properties, we have Gt = F t+ ⊂ Ht+ = Ht+ = Ht , t ≥ 0, which proves the required minimality of G.

2

The σ-fields Fτ arise naturally in connection with a random time change: Proposition 9.9 (random time-change) Let X ≥ 0 be a non-decreasing, right-continuous process, adapted to a right-continuous filtration F, and define



s ≥ 0. Then (i) the τs form a right-continuous process of optional times, generating a right-continuous filtration Gs = Fτs , s ≥ 0, τs = inf t > 0; Xt > s ,

(ii) if X is continuous and τ is F-optional, then Xτ is G-optional with Fτ ⊂ G X τ ,

9. Optional Times and Martingales

191

(iii) when X is strictly increasing, we have Fτ = GXτ . In case (iii), we have in particular Ft = GXt for all t, so that the processes (τs ) and (Xt ) play symmetric roles. Proof, (i)−(ii): The times τs are optional by Lemmas 9.2 and 9.6, and since (τs ) is right-continuous, so is (Gs ) by Lemma 9.3. If X is continuous, then Lemma 9.1 yields for any F-optional time τ > 0 and set A ∈ Fτ A ∩ {Xτ ≤ s} = A ∩ {τ ≤ τs } ∈ Fτs = Gs , s ≥ 0. For A = Ω it follows that Xτ is G-optional, and for general A we get A ∈ GXτ . Thus, Fτ ⊂ GXτ . Both statements extend by Lemma 9.3 to arbitrary τ . (iii) For any A ∈ GXt with t > 0, we have A ∩ {t ≤ τs } = A ∩ {Xt ≤ s} ∈ Gs = Fτs , s ≥ 0, and so





A ∩ t ≤ τs ≤ u ∈ F u ,

s ≥ 0, u > t.

Taking the union over all s ∈ Q+ gives3 A ∈ Fu , and as u ↓ t we get A ∈ Ft+ = Ft . Hence, Ft = GXt , which extends as before to t = 0. By Lemma 9.1, we obtain for any A ∈ GXτ A ∩ {τ ≤ t} = A ∩ {Xτ ≤ Xt } ∈ GXt = Ft , t ≥ 0, and so A ∈ Fτ . Thus, GXτ ⊂ Fτ , and so the two σ-fields agree.

2

To motivate the definition of martingales, fix a filtration F on an index set T and a random variable ξ ∈ L1 , and introduce the process Mt = E(ξ | Ft ),

t ∈ T.

Then M is clearly integrable (for each t) and adapted, and the tower property of conditional expectations yields Ms = E(Mt | Fs ) a.s.,

s ≤ t.

(5)

Any integrable and adapted process M satisfying (5) is called a martingale with respect to F, or an F-martingale. When T = Z+ , it is enough to require (5) for t = s + 1, so that the condition becomes 





E ΔMn  Fn−1 = 0 a.s., 3

n ∈ N,

Recall that Q+ denotes the set of rational numbers r ≥ 0.

(6)

192

Foundations of Modern Probability

where ΔMn = Mn − Mn−1 . A martingale in Rd is a process M = (M 1 , . . . , M d ) where M 1 , . . . , M d are one-dimensional martingales. We also consider the cases where (5) or (6) are replaced by corresponding inequalities. Thus, we define a sub-martingale as an integrable and adapted process X satisfying Xs ≤ E(Xt | Fs ) a.s.,

s ≤ t.

(7)

Reversing the inequality sign4 yields the notion of a super-martingale. In particular, the mean is non-decreasing for sub-martingales and non-increasing for super-martingales. Given a filtration F on Z+ , we say that a random sequence A = (An ) with A0 = 0 is predictable with respect to F or simply F-predictable, if An is Fn−1 -measurable for every n ∈ N, so that the shifted sequence (θA)n = An+1 is adapted. The following elementary result, known as the Doob decomposition, is often used to derive results for sub-martingales from the corresponding martingale versions. An extension to continuous time is proved in Chapter 10. Lemma 9.10 (centering, Doob) An integrable, F-adapted process X on Z+ has an a.s. unique decomposition M + A, where M is an F-martingale and A is F-predictable with A0 = 0. In particular, X is a sub-martingale iff A is a.s. non-decreasing. Proof: If X = M + A for some processes M and A as stated, then clearly ΔAn = E(ΔXn |Fn−1 ) a.s. for all n ∈ N, and so An =

ΣE k≤n



 



ΔXk  Fk−1

a.s.,

n ∈ Z+ ,

(8)

which proves the required uniqueness. In general, we may define a predictable process A by (8). Then M = X − A is a martingale, since 

 





 



E ΔMn  Fn−1 = E ΔXn  Fn−1 − ΔAn = 0 a.s.,

n ∈ N.

2

Next we show how the martingale and sub-martingale properties are preserved by suitable transformations. Lemma 9.11 (convex maps) Consider a random sequence M = (Mn ) in Rd and a convex function f : Rd → R, such that X = f (M ) is integrable. Then X is a sub-martingale under each of these conditions: (i) M is a martingale in Rd , (ii) M is a sub-martingale in R and f is non-decreasing. 4 The sign conventions are suggested by analogy with sub- and super-harmonic functions. Note that sub-martingales tend to be increasing whereas super-martingales tend to be decreasing. For branching processes the conventions are the opposite.

9. Optional Times and Martingales

193

Proof: (i) By the conditional version of Jensen’s inequality, we have for s≤t

f (Ms ) = f E(Mt | Fs )  





≤ E f (Mt )  Fs

a.s.

(9)

(ii) Here the first relation in (9) becomes f (Ms ) ≤ f {E(Mt | Fs )}, and the conclusion remains valid. 2 The last result is often applied with f (x) = |x|p for some p ≥ 1, or with f (x) = x+ = x ∨ 0 when d = 1. We turn to a basic version of the powerful optional sampling theorem. An extension to continuous-time sub-martingales appears as Theorem 9.30. Say that an optional time τ is bounded if τ ≤ u a.s. for some u ∈ T . This always holds when T has a last element. Theorem 9.12 (optional sampling, Doob) Let M be a martingale on a countable index set T with filtration F, and consider some optional times σ, τ , where τ is bounded. Then Mτ is integrable, and Mσ∧τ = E(Mτ | Fσ ) a.s. Proof: By Lemmas 8.3 and 9.1, we get for any t ≤ u in T E(Mu | Fτ ) = E(Mu | Ft ) = Mt = Mτ a.s. on {τ = t}, and so E(Mu | Fτ ) = Mτ a.s. whenever τ ≤ u a.s. If σ ≤ τ ≤ u, then Fσ ⊂ Fτ by Lemma 9.1, and we get a.s.





E(Mτ | Fσ ) = E E(Mu | Fτ )  Fσ

= E(Mu | Fσ ) = Mσ . Further note that E(Mτ | Fσ ) = Mτ a.s. when τ ≤ σ ∧ u. In general, using Lemmas 8.3 and 9.1, we may combine the two special cases to get a.s. E(Mτ | Fσ ) = = E(Mτ |Fσ ) = =

E(Mτ | Fσ∧τ ) Mσ∧τ on {σ ≤ τ }, E(Mσ∧τ | Fσ ) Mσ∧τ on {σ > τ }.

2

In particular, the martingale property is preserved by a random time change: Corollary 9.13 (random-time change) Consider an F-martingale M and a non-decreasing family of bounded optional times τs taking countably many values. Then the process N is a G -martingale, where Ns = Mτs ,

G s = Fτ s .

This leads in turn to a useful martingale criterion.

194

Foundations of Modern Probability

Corollary 9.14 (martingale criterion) Let M be an integrable, adapted process on T . Then these conditions are equivalent: (i) M is a martingale, (ii) EMσ = EMτ for any bounded, optional times σ, τ taking countably many values in T . In (ii) it is enough to consider optional times taking at most two values. Proof: If s < t in T and A ∈ Fs , then τ = s1A + t1Ac is optional, and so 0 = EMt − EMτ = EMt − E(Ms ; A) − E(Mt ; Ac ) = E(Mt − Ms ; A). Since A is arbitrary, it follows that E(Mt − Ms | Fs ) = 0 a.s.

2

The following predictable transformation of martingales is basic for the theory of stochastic integration. Corollary 9.15 (martingale transform) Let M be a martingale on an index set T with filtration F, fix an optional time τ taking countably many values, and let η be a bounded, Fτ -measurable random variable. Then we may form another martingale Nt = η (Mt − Mt∧τ ), t ∈ T. Proof: The integrability holds by Theorem 9.12, and the adaptedness is clear if we replace η by η 1{τ ≤ t} in the expression for Nt . Now fix any bounded, optional time σ taking countably many values. By Theorem 9.12 and the pull-out property of conditional expectations, we get a.s. 

 



E(Nσ | Fτ ) = η E Mσ − Mσ∧τ  Fτ

= η (Mσ∧τ − Mσ∧τ ) = 0, and so ENσ = 0. Thus, N is a martingale by Lemma 9.14.

2

In particular, the martingale property is preserved by optional stopping, in the sense that the stopped process Mtτ = Mτ ∧t is a martingale whenever M is a martingale and τ is an optional time taking countably many values. More generally, we may consider predictable step processes of the form Vt =

Σ ηk 1{t > τk },

k≤n

t ∈ T,

where τ1 ≤ · · · ≤ τn are optional times and each ηk is a bounded, Fτk measurable random variable. For any process X, the associated elementary stochastic integral (V · X)t ≡



t 0

Vs dXs =

Σ ηk (Xt − Xt∧τ ),

k≤n

k

t ∈ T,

9. Optional Times and Martingales

195

is a martingale by Corollary 9.15, whenever X is a martingale and each τk takes countably many values. In discrete time, we may take V to be a bounded, predictable sequence, in which case (V · X)n =

Σ Vk ΔXk ,

n ∈ Z+ .

k≤n

The result for martingales extends in an obvious way to sub-martingales X, provided that V ≥ 0. We proceed with some basic martingale inequalities, beginning with some extensions of Kolmogorov’s maximum inequality from Lemma 5.15. Theorem 9.16 (maximum inequalities, Bernstein, L´evy) Let X be a submartingale on a countable index set T . Then for any r ≥ 0 and u ∈ T ,







(i) r P sup Xt ≥ r ≤ E Xu ; sup Xt ≥ r ≤ EXu+ , (ii)



t≤u

t≤u



r P supt |Xt | ≥ r ≤ 3 supt E|Xt |.

Proof: (i) By dominated convergence it is enough to consider finite index sets, so we may take T = Z+ . Define τ = u ∧ inf{t; Xt ≥ r} and B = {maxt≤u Xt ≥ r}. Then τ is an optional time bounded by u, and we have B ∈ Fτ and Xτ ≥ r on B. Hence, Lemma 9.10 and Theorem 9.12 yield r P B ≤ E(Xτ ; B) ≤ E(Xu ; B) ≤ EXu+ . (ii) For the Doob decomposition X = M + A, relation (i) applied to −M yields







r P min Xt ≤ −r ≤ r P min Mt ≤ −r t≤u

t≤u

≤ EMu− = EMu+ − EMu ≤ EXu+ − EX0 ≤ 2 max E|Xt |. t≤u

2

In remains to combine with (i).

We turn to a powerful norm inequality. For processes X on an index set T , we define X ∗ = sup |Xt |. Xt∗ = sup |Xs |, s≤t

t∈T

Theorem 9.17 (Lp -inequality, Doob) Let M be a martingale on a countable index set T , and fix any p, q > 1 with p−1 + q −1 = 1. Then Mt∗ p ≤ q Mt p ,

t ∈ T.

196

Foundations of Modern Probability

Proof: By monotone convergence, we may take T = Z+ . If Mt p < ∞, then Ms p < ∞ for all s ≤ t by Jensen’s inequality, and so we may assume that 0 < Mt∗ p < ∞. Applying Theorem 9.16 to the sub-martingale |M |, we   get r P {Mt∗ > r} ≤ E |Mt |; Mt∗ > r , r > 0. Using Lemma 4.4, Fubini’s theorem, and H¨older’s inequality, we obtain Mt∗ pp = p ≤p





0





0

P {Mt∗ > r} rp−1 dr 



E |Mt |; Mt∗ > r rp−2 dr

= p E |Mt |



Mt∗ 0

rp−2 dr

∗(p−1)

= q E |Mt | Mt

  ∗(p−1)  q ≤ q Mt p  Mt

= q Mt p Mt∗ p−1 p . 2

Now divide by the last factor on the right.

The next inequality is needed to prove the basic convergence theorem. For any function f : T → R and constants a < b, we define the number of [a, b] crossings of f up to time t as the supremum of all n ∈ Z+ , such that there exist times s1 < t1 < s2 < t2 < · · · < sn < tn ≤ t in T with f (sk ) ≤ a and f (tk ) ≥ b for all k. This supremum may clearly be infinite. Lemma 9.18 (upcrossing inequality, Doob, Snell) Let X be a sub-martingale on a countable index set T , and let Nab (t) be the number of [a, b]-crossings of X up to time t. Then E(Xt − a)+ , t ∈ T, a < b in R. ENab (t) ≤ b−a Proof: As before we may take T = Z+ . Since Y = (X − a)+ is again a sub-martingale by Lemma 9.11, and the [a, b] -crossings of X correspond to [0, b−a] -crossings of Y , we may take X ≥ 0 and a = 0. Then define recursively the optional times 0 = τ0 ≤ σ1 < τ1 < σ2 < · · · by



σk = inf n ≥ τk−1 ; Xn = 0 ,



τk = inf n ≥ σk ; Xn ≥ b , and introduce the predictable process Vn =

Σ1 k≥1



k ∈ N,



σk < n ≤ τ k ,

n ∈ N.

Since (1 − V ) · X is again a sub-martingale by Corollary 9.15, we get

E (1 − V ) · X





t

≥ E (1 − V ) · X

Since also (V · X)t ≥ b N0b (t), we obtain

0

= 0,

t ≥ 0.

9. Optional Times and Martingales

197

b EN0b (t) ≤ ≤ = ≤

E(V · X)t E(1 · X)t EXt − EX0 EXt .

2

We may now prove the fundamental convergence theorem for sub-martingales. Say that a process X is Lp -bounded if supt Xt p < ∞. Theorem 9.19 (convergence and regularity, Doob) Let X be an L1 -bounded sub-martingale on a countable index set T . Then outside a fixed P -null set, X converges along every increasing or decreasing sequence in T . Proof: By Theorem 9.16 we have X ∗ < ∞ a.s., and Lemma 9.18 shows that X has a.s. finitely many upcrossings of every interval [a, b] with rational a < b. Then X has clearly the asserted property, outside the P -null set where any of these countably many conditions fails. 2 We consider an interesting and useful application. Proposition 9.20 (one-sided bound) Let M be a martingale on Z+ with ΔM ≤ c a.s. for a constant c < ∞. Then a.s.



{Mn converges} = sup Mn < ∞ . n

Proof: Since M − M0 is again a martingale, we may take M0 = 0. Consider the optional times



τm = inf n; Mn ≥ m ,

m ∈ N.

The processes M τm are again martingales by Corollary 9.15. Since M τm ≤ m+c a.s., we have E|M τm | ≤ 2(m + c) < ∞, and so M τm converges a.s. by Theorem 9.19. Hence, M converges a.s. on



∪ m {τm = ∞} ⊂ ∪ m {M = M τ

supn Mn < ∞ =

m

}.

The reverse implication is obvious, since every convergent sequence in R is bounded. 2 The last result yields a useful extension of the Borel–Cantelli lemma from Theorem 4.18: Corollary 9.21 (extended Borel–Cantelli lemma, L´evy) For a filtration F on Z+ , let An ∈ Fn , n ∈ N. Then a.s. {An i.o.} =





Σ n P(An| Fn−1) = ∞

.

198

Foundations of Modern Probability Proof: The sequence Mn =

Σ k≤n





1Ak − P (Ak | Fk−1 ) ,

n ∈ Z+ ,

is a martingale with |ΔMn | ≤ 1, and so by Proposition 9.20 the events {supn (±Mn ) = ∞} agree a.s., which implies P {Mn → ∞} = P {Mn → −∞} = 0. Hence, we have a.s.



Σ n1A = ∞

= Σ n P (An |Fn−1 ) = ∞ .

{An i.o.} =

n

2

A martingale M is said to be closed if u = sup T belongs to T , in which case Mt = E(Mu | Ft ) a.s. for all t ∈ T . When u ∈ T , we say that M is closable if it can be extended to a martingale on T¯ = T ∪ {u}. If Mt = E(ξ | Ft ) for a variable ξ ∈ L1 , we may clearly choose Mu = ξ. We give some general criteria for closability. An extension to continuous-time sub-martingales appears as part of Theorem 9.30. Theorem 9.22 (uniform integrability and closure, Doob) For a martingale M on an unbounded index set T , these conditions are equivalent: (i) M is uniformly integrable, (ii) M is closable at sup T , (iii) M is L1 -convergent at sup T . Under those conditions, M may be closed by the limit in (iii). Proof: Clearly (ii) ⇒ (i) by Lemma 8.4 and (i) ⇒ (iii) by Theorem 9.19 and Proposition 5.12. Now let Mt → ξ in L1 as t → u ≡ sup T . Using the L1 -contractivity of conditional expectations, we get as t → u for fixed s Ms = E(Mt |Fs ) → E(ξ |Fs ) in L1 . Thus, Ms = E(ξ |Fs ) a.s., and we may take Mu = ξ. (iii) ⇒ (ii).

This shows that 2

For comparison, we consider the case of Lp -convergence for p > 1. Corollary 9.23 (Lp -convergence) Let M be a martingale on an unbounded index set T , and fix any p > 1. Then these conditions are equivalent: (i) M converges in Lp , (ii) M is Lp -bounded.

9. Optional Times and Martingales

199

Proof: We may clearly assume that T is countable. If M is Lp -bounded, it converges in L1 by Theorem 9.19. Since |M |p is uniformly integrable by Theorem 9.17, the convergence extends to Lp by Proposition 5.12. Conversely, 2 if M converges in Lp , it is Lp -bounded by Lemma 9.11. We turn to the convergence of martingales of the special form Mt = E(ξ |Ft ), as t increases or decreases along a sequence. Without loss of generality, we may take the index set T to be unbounded above or below, and define respectively F∞ =

∨ Ft,

t∈T

F−∞ =

∩ Ft.

t∈T

Theorem 9.24 (conditioning limits, Jessen, L´evy) Let F be a filtration on a countable index set T ⊂ R, unbounded above or below. Then for any ξ ∈ L1 , we have as t → ±∞ E(ξ | Ft ) → E(ξ | F±∞ ) a.s. and in L1 . Proof: By Theorems 9.19 and 9.22, the martingale Mt = E(ξ | Ft ) converges a.s. and in L1 as t → ±∞, where the limit M±∞ may clearly be taken to be F±∞ -measurable. To obtain M±∞ = E(ξ | F±∞ ) a.s., we need to verify that E(M±∞ ; A) = E(ξ; A),

A ∈ F±∞ .

(10)

Then note that, by the definition of M , E(Mt ; A) = E(ξ; A),

A ∈ Fs , s ≤ t.

(11)

This clearly remains true for s = −∞, and as t → −∞ we get the ‘minus’ version of (10). To get the ‘plus’ version, let t → ∞ in (11) for fixed s, and 2 extend by a monotone-class argument to arbitrary A ∈ F∞ . In particular, we note the following special case. Corollary 9.25 (L´evy) For any filtration F on Z+ , P (A | Fn ) → 1A a.s.,

A ∈ F∞ .

For a simple application, we prove an extension of Kolmogorov’s 0−1 law in Theorem 4.13. Here a relation F1 ⊂ F2 between two σ-fields is said to hold a.s. if A ∈ F1 implies A ∈ F¯2 , where F¯2 denotes the completion of F2 . Corollary 9.26 (tail σ-field) Let the σ-fields F1 , F2 , . . . be conditionally independent, given G. Then

∩σ n≥1



Fn , Fn+1 , . . .



⊂ G a.s.

200

Foundations of Modern Probability

Proof: Let T be the σ-field on the left, and note that T ⊥⊥G (F1 ∨ · · · ∨ Fn ) by Theorem 8.12. Using Theorem 8.9 and Corollary 9.25, we get for any A ∈ T 

 



P (A | G) = P A  G, F1 , . . . , Fn → 1A a.s., which shows that T ⊂ G a.s.

2

The last theorem yields a short proof of the strong law of large numbers. Here we put Sn = ξ1 + . . . + ξn for some i.i.d. random variables ξ1 , ξ2 , . . . in L1 , and define F−n = σ{Sn , Sn+1 , . . .}. Then F−∞ is trivial by Theorem 4.15, and d E(ξk | F−n ) = E(ξ1 | F−n ) a.s. for every k ≤ n, since (ξk , Sn , Sn+1 , . . .) = (ξ1 , Sn , Sn+1 , . . .). Hence, Theorem 9.24 yields  





n−1 Sn = E n−1 Sn  F−n = n−1 Σ E(ξk | F−n ) k≤n

= E(ξ1 | F−n ) → E(ξ1 | F−∞ ) = E ξ1 . For a further application of Theorem 9.24, we prove a kernel version of the regularization Theorem 8.5, needed in Chapter 32. Theorem 9.27 (kernel densities) Consider a probability kernel μ : S → T × U , where T, U are Borel. Then the densities ν(s, t, B) =

μ(s, dt × B) , μ(s, dt × U )

s ∈ S, t ∈ T, B ∈ U ,

(12)

have versions combining into a probability kernel ν : S × T → U . Proof: We may take T, U to be Borel subsets of R, in which case μ can be regarded as a probability kernel from S to R2 . Letting In be the σ-field in R generated by the intervals Ink = 2−n [(k − 1), k), k ∈ Z, we define Mn (s, t, B) = Σ k

μ(s, Ink × B) 1{t ∈ Ink }, μ(s, Ink × U )

s ∈ S, t ∈ T, B ∈ B,

subject to the convention 0/0 = 0. Then Mn (s, ·, B) is a version of the density in (12) on the σ-field In , and for fixed s and B it is also a martingale with respect to μ(s, · × U ). By Theorem 9.24 we get Mn (s, ·, B) → ν(s, ·, B) a.e. μ(s, · × U ). Thus, ν has the product-measurable version ν(s, t, B) = lim sup Mn (s, t, B), n→∞

s ∈ S, t ∈ T, B ∈ U .

To construct a kernel version of ν, we may proceed as in the proof of Theorem 3.4, noting that in each step the exceptional (s, t) -set A lies in S ⊗T , with sections As = {t ∈ T ; (s, t) ∈ A} satisfying μ(s, As × U ) = 0 for all s ∈ S. 2

9. Optional Times and Martingales

201

To extend the previous theory to continuous time, we need to form suitably regular versions of the various processes. The following closely related regularizations may then be useful. Say that a process X on R+ is right-continuous with left-hand limits (abbreviated5 as rcll ) if Xt = Xt+ for all t ≥ 0, and the left-hand limits Xt− exist and are finite for all t > 0. For any process Y on Q+ , we define a process Y + by (Y + )t = Yt+ , t ≥ 0, whenever the right-hand limits exist. Theorem 9.28 (regularization, Doob) For any F-sub-martingale X on R+ with restriction Y to Q+ , we have (i) Y + exists and is rcll outside a fixed P -null set A, and Z = 1Ac Y + is a sub-martingale with respect to the augmented filtration F + , (ii) when F is right-continuous, X has an rcll version iff EX is right-continuous, hence in particular when X is a martingale. The proof requires an extension of Theorem 9.22 to suitable sub-martingales. Lemma 9.29 (uniform integrability) Let X be a sub-martingale on Z− . Then these conditions are equivalent: (i) X is uniformly integrable, (ii) EX is bounded. Proof: Let EX be bounded. Form the predictable sequence 

 



αn = E ΔXn  Fn−1 ≥ 0, and note that E

n ≤ 0,

inf EXn < ∞. Σ αn = EX0 − n≤0

n≤0

Hence,

Σ n αn < ∞ a.s., and so we may define An = Σ αk , Mn = Xn − An , k≤n

n ≤ 0.

Since EA∗ < ∞, and M is a martingale closed at 0, both A and M are uniformly integrable. 2 Proof of Theorem 9.28: (i) By Lemma 9.11 the process Y ∨ 0 is L1 -bounded on bounded intervals, and so the same thing is true for Y . Thus, by Theorem 9.19 the right- and left-hand limits Yt± exist outside a fixed P -null set A, and so Z = 1Ac Y + is rcll. Also note that Z is adapted to F + . To prove that Z is a sub-martingale for F +, fix any times s < t, and choose sn ↓ s and tn ↓ t in Q+ with sn < t. Then Ysm ≤ E(Ytn |Fsm ) a.s. for all m and n, and Theorem 9.24 yields Zs ≤ E(Ytn |Fs+ ) a.s. as m → ∞. Since Ytn → Zt in L1 by Lemma 9.29, it follows that 5

The French acronym c` adl` ag is often used, even in English texts.

202

Foundations of Modern Probability Zs ≤ E(Zt | Fs+ ) = E(Zt | F¯s+ ) a.s. (ii) For any t < tn ∈ Q+ , (EX)tn = E(Ytn ), Xt ≤ E(Ytn | Ft ) a.s.,

and as tn ↓ t we get by Lemma 9.29 and the right-continuity of F (EX)t+ = EZt , Xt ≤ E(Zt | Ft ) = Zt a.s.

(13)

If X has a right-continuous version, then clearly Zt = Xt a.s. Hence, (13) yields (EX)t+ = EXt , which shows that EX is right-continuous. If instead EX is right-continuous, then (13) gives E|Zt − Xt | = EZt − EXt = 0, and so Zt = Xt a.s., which means that Z is a version of X.

2

Justified by the last theorem, we may henceforth take all sub-martingales to be rcll, unless otherwise specified, and also let the underlying filtration be right-continuous and complete. Most of the previously quoted results for submartingales on a countable index set extend immediately to continuous time. In particular, this is true for the convergence Theorem 9.19 and the inequalities in Theorem 9.16 and Lemma 9.18. We proceed to show how Theorems 9.12 and 9.22 extend to sub-martingales in continuous time. Theorem 9.30 (optional sampling and closure, Doob) Let X be an F-submartingale on R+ , where X and F are right-continuous, and consider some optional times σ, τ , where τ is bounded. Then Xτ is integrable, and Xσ∧τ ≤ E(Xτ | Fσ ) a.s.

(14)

This extends to unbounded times τ iff X + is uniformly integrable. Proof: Introduce the optional times σn = 2−n [2n σ+1] and τn = 2−n [2n τ +1], and conclude from Lemma 9.10 and Theorem 9.12 that 

 

Xσm ∧τn ≤ E Xτn  Fσm



a.s.,

m, n ∈ N.

As m → ∞, we get by Lemma 9.3 and Theorem 9.24 Xσ∧τn ≤ E(Xτn | Fσ ) a.s.,

n ∈ N.

(15)

By the result for the index sets 2−n Z+ , the random variables X0 ; . . . , Xτ2 , Xτ1 form a sub-martingale with bounded mean, and hence are uniformly integrable by Lemma 9.29. Thus, (14) follows as we let n → ∞ in (15).

9. Optional Times and Martingales

203

If X + is uniformly integrable, then X is L1 -bounded and hence converges + in L1 , and a.s. toward some X∞ ∈ L1 . By Proposition 5.12 we get Xt+ → X∞ + + 1 so E(Xt | Fs ) → E(X∞ | Fs ) in L for each s. Letting t → ∞ along a sequence, we get by Fatou’s lemma Xs ≤ lim E(Xt+ | Fs ) − lim inf E(Xt− | Fs ) t→∞

t→∞

+ − | Fs ) − E(X∞ | Fs ) ≤ E(X∞ = E(X∞ | Fs ).

Approximating as before, we obtain (14) for arbitrary σ, τ . Conversely, the stated condition yields the existence of an X∞ ∈ L1 with + | Fs ) a.s. by Lemma Xs ≤ E(X∞ | Fs ) a.s. for all s > 0, and so Xs+ ≤ E(X∞ + 2 9.11. Hence, X is uniformly integrable by Lemma 8.4. For a simple application, we consider the hitting probabilities of a continuous martingale. The result will be useful in Chapters 18, 22, and 33. Corollary 9.31 (first hit) Let M be a continuous martingale with M0 = 0 and P {M ∗ > 0} > 0, and define τx = inf{t > 0; Mt = x}. Then for a < 0 < b,

 



P τa < τb  M ∗ > 0 ≤



b  ≤ P τa ≤ τb  M ∗ > 0 . b−a

Proof: Since τ = τa ∧ τb is optional by Lemma 9.6, Theorem 9.30 yields EMτ ∧t = 0 for all t > 0, and so by dominated convergence EMτ = 0. Hence, 



0 = a P {τa < τb } + b P {τb < τa } + E M∞ ; τ = ∞

≤ a P {τa < τb } + b P τb ≤ τa , M ∗ > 0



= b P {M ∗ > 0} − (b − a) P {τa < τb }, which implies the first inequality. The second one follows as we take complements. 2 The next result plays a crucial role in Chapter 17. Lemma 9.32 (absorption) For any right-continuous super-martingale X ≥ 0, we have X = 0 a.s. on [τ, ∞), where



τ = inf t ≥ 0; Xt ∧ Xt− = 0 . Proof: By Theorem 9.28 the process X remains a super-martingale with respect to the right-continuous filtration F + . The times τn = inf{t ≥ 0; Xt < n−1 } are F + -optional by Lemma 9.6, and the right-continuity of X yields Xτn ≤ n−1 on {τn < ∞}. Hence, Theorem 9.30 yields 







E Xt ; τn ≤ t ≤ E Xτn ; τn ≤ t ≤ n−1 ,

t ≥ 0, n ∈ N.

204

Foundations of Modern Probability

Noting that τn ↑ τ , we get by dominated convergence E(Xt ; τ ≤ t) = 0, and so Xt = 0 a.s. on {τ ≤ t}. The assertion now follows, as we apply this result 2 to all t ∈ Q+ and use the right-continuity of X. Finally we show how the right-continuity of an increasing sequence of supermartingales extends to the limit. The result is needed in Chapter 34. Theorem 9.33 (increasing super-martingales, Meyer) Let X 1 ≤ X 2 ≤ · · · be right-continuous super-martingales with supn EX0n < ∞. Then Xt = supn Xtn is again an a.s. right-continuous super-martingale. Proof (Doob): By Theorem 9.28 we may take the filtration to be rightcontinuous. The super-martingale property carries over to X by monotone convergence. To prove the asserted right-continuity, we may take X 1 to be bounded below by an integrable random variable, since we can otherwise consider the processes obtained by optional stopping at the times m ∧ inf{t; Xt1 < −m}, for arbitrary m > 0. For fixed ε > 0, let T be the class of optional times τ with lim sup |Xu − Xt | ≤ 2ε, u↓t

t < τ,

and put p = inf τ ∈T Ee−τ . Choose σ1 , σ2 , . . . ∈ T with Ee−σn → p , and note that σ ≡ supn σn ∈ T with Ee−σ = p . We need to show that σ = ∞ a.s. Then introduce the optional times



τn = inf t > σ; |Xtn − Xσ | > ε ,

n ∈ N,

and put τ = lim supn τn . Noting that inf |Xtn − Xσ | ≤ ε, |Xt − Xσ | = lim n→∞

t ∈ [σ, τ ),

we obtain τ ∈ T . By the right-continuity of X n , we have |Xτnn − Xσ | ≥ ε on {τn < ∞} for every n. Furthermore, on the set A = {σ = τ < ∞}, lim Xτkn lim inf Xτnn ≥ supk n→∞ n→∞ = supk Xσk = Xσ , and so

lim inf Xτnn ≥ Xσ + ε on A. n→∞

Since A ∈ Fσ by Lemma 9.1, we get by Fatou’s lemma, optional sampling, and monotone convergence   E(Xσ + ε; A) ≤ E lim inf Xτnn ; A n→∞

≤ lim inf E(Xτnn ; A) n→∞

≤ lim E(Xσn ; A) n→∞

= E(Xσ ; A). Thus, P A = 0, and so τ > σ a.s. on {σ < ∞}. Since p > 0 would yield the contradiction Ee−τ < p, we have p = 0, and so σ = ∞ a.s. 2

9. Optional Times and Martingales

205

Exercises 1. For any optional times σ, τ , show that {σ = τ } ∈ Fσ ∩ Fτ and Fσ = Fτ on {σ = τ }. However, Fτ and F∞ may differ on {τ = ∞}. 2. Show that if σ, τ are optional times on the time scale R+ or Z+ , then so is σ + τ. 3. Give an example of a random time that is weakly optional but not optional. (Hint: Let F be the filtration induced by the process Xt = ϑ t with P {ϑ = ±1} = 12 , and take τ = inf{t; Xt > 0}.) 4. Fix a random time τ and a random variable ξ in R \ {0}. Show that the process Xt = ξ 1{τ ≤ t} is adapted to a given filtration F iff τ is F-optional and ξ is Fτ -measurable. Give corresponding conditions for the process Yt = ξ 1{τ < t}. 5. Let P be the class of sets A ∈ R+ × Ω such that the process 1A is progressive. Show that P is a σ-field and that a process X is progressive iff it is P-measurable. 6. Let X be a progressive process with induced filtration F , and fix any optional time τ < ∞. Show that σ{τ, X τ } ⊂ Fτ ⊂ Fτ+ ⊂ σ{τ, X τ +h } for every h > 0. (Hint: The first relation becomes an equality when τ takes only countably many values.) Note that the result may fail when P {τ = ∞} > 0. 7. Let M be an F-martingale on a countable index set, and fix an optional time τ . Show that M − M τ remains a martingale, conditionally on Fτ . (Hint: Use Theorem 9.12 and Lemma 9.14.) Extend the result to continuous time. 8. Show that any sub-martingale remains a sub-martingale with respect to the induced filtration. 9. Let X 1 , X 2 , . . . be sub-martingales such that the process X = supn X n is integrable. Show that X is again a sub-martingale. Also show that lim supn X n is a sub-martingale when even supn |X n | is integrable. 10. Show that the Doob decomposition of an integrable random sequence X = (Xn ) depends on the filtration, unless X is a.s. X0 -measurable. (Hint: Compare the filtrations induced by X and by the sequence Yn = (X0 , Xn+1 ).) 11. Fix a random time τ and a random variable ξ ∈ L1 , and define Mt = ξ 1{τ ≤ t}. Show that M is a martingale with respect to the induced filtration F iff E(ξ; τ ≤ t | τ > s) = 0 for any s < t. (Hint: The set {τ > s} is an atom of Fs .) 12. Let F, G be filtrations on the same probability space. Show that every Fmartingale is a G-martingale iff Ft ⊂ Gt ⊥⊥Ft F∞ for every t ≥ 0. (Hint: For the necessity, consider F-martingales of the form Ms = E(ξ |Fs ) with ξ ∈ L1 (Ft ).) 13. For any rcll super-martingale X ≥ 0 and constant r ≥ 0, show that r P {supt Xt ≥ r} ≤ EX0 . 14. Let M be an L2 -bounded martingale on Z+ . Mimicing the proof of Lemma 5.16, show that Mn converges a.s. and in L2 . 15. Give an example of a martingale that is L1 -bounded but not uniformly integrable. (Hint: Every positive martingale is L1 -bounded.) 16. For any optional times σ, τ with respect to a right-continuous filtration F, show that E Fσ and E Fτ commute on L1 with product E Fσ∧τ , and that Fσ ⊥⊥Fσ∧τ Fτ . (Hint: The first condition holds by Theorem 9.30, and the second one since Fσ∧τ =

206

Foundations of Modern Probability

Fσ on {σ ≤ τ } and Fσ∧τ = Fτ on {τ ≤ σ} by Lemma 9.1.. The two conditions are also seen to be equivalent by Corollary 8.14.) 17. Let ξn → ξ in L1 . For any increasing σ-fields Fn , show that E(ξn |Fn ) → E(ξ |F∞ ) in L1 . 18. Let ξ, ξ1 , ξ2 , . . . ∈ L1 with ξn ↑ ξ a.s. For any increasing σ-fields Fn , show that P E(ξn | Fn ) → E(ξ | F∞ ) a.s. (Hint: Note that supm E(ξ −ξn |Fm ) → 0 by Proposition 9.16, and use the monotonicity.) 19. Show that any right-continuous sub-martingale is a.s. rcll. 20. Given a super-martingale X ≥ 0 on Z+ and some optional times τ0 ≤ τ1 ≤ · · · , show that the sequence (Xτn ) is again a super-martingale. (Hint: Truncate the times τn , and use the conditional Fatou lemma.) Show by an example that the result fails for sub-martingales. 21. For any random time τ ≥ 0 and right-continuous filtration F = (Ft ), show that the process Xt = P (τ ≤ t | Ft ) has a right-continuous version. (Hint: Use Theorem 9.28 (ii).)

Chapter 10

Predictability and Compensation Predictable times and processes, strict past, accessible and totally inaccessible times, decomposition of optional times, Doob–Meyer decomposition, natural and predictable processes, dual predictable projection, predictable martingales, compensation and decomposition of random measures, ql-continuous filtrations, induced and discounted compensators, fundamental martingale, product moments, predictable maps

The theory of martingales and optional times from the previous chapter leads naturally into a general theory of processes, of fundamental importance for various advanced topics of stochastic calculus, random measure theory, and many other subjects throughout the remainder of this book. Here a crucial role is played by the notions of predictable processes and associated predictable times, forming basic subclasses of adapted processes and optional times. Among the many fundamental results proved in this chapter, we note in particular the celebrated Doob–Meyer decomposition, a continuous-time counterpart of the elementary Doob decomposition from Lemma 9.10, here proved directly1 by a discrete approximation based on Dunford’s characterization of weak compactness, combined with Doob’s ingenious approximation of inaccessible times. The result may be regarded as a probabilistic counterpart of some basic decompositions in potential theory, as explained in Chapter 34. Even more important for probabilists are the applications to increasing processes, leading to the notion of compensator of a random measure or point process on a product space R+ ×S, defined as a predictable random measure on the same space, providing the instantaneous rate of increase of the underlying process. In particular, we will see in Chapter 15 how the compensator can be used to reduce a fairly general point process on R+ to Poisson, in a similar way as a continuous martingale can be time-changed into a Brownian motion. A suitable transformation leads to the equally important notion of discounted compensator, needed in Chapters 15 and 27 to establish some general mapping theorems. The present material is related in many ways to material discussed elsewhere. Apart from the already mentioned connections, we note in particular the notions of tangential processes, general semi-martingales, and stochastic integration in Chapters 16, 20, and 35, the quasi-left continuity of Feller processes in Chapter 17, and the predictable mapping theorems in Chapters 15, 19, and 27. 1

The standard proof, often omitted, is based on a subtle use of capacity theory.

© Springer Nature Switzerland AG 2021 O. Kallenberg, Foundations of Modern Probability, Probability Theory and Stochastic Modelling 99, https://doi.org/10.1007/978-3-030-61871-1_11

207

208

Foundations of Modern Pobability

We take all random objects in this chapter to be defined on a fixed probability space (Ω, A, P ) with a right-continuous and complete filtration F. A random time τ in [0, ∞] is said to be predictable, if it is announced by some optional times τn ↑ τ with τn < τ a.s. on {τ > 0} for all n. Note that predictable times are optional by Lemma 9.3 (i). With any optional time τ we may associate the σ-field Fτ − generated by F0 and the classes Ft ∩ {τ > t} for all t > 0, representing the strict past of τ . The following properties of predictable times and associated σ-fields may be compared with similar results for optional times and σ-fields Fτ in Lemmas 9.1 and 9.3. Lemma 10.1 (predictable times and strict past) For random times σ, τ, τ1 , τ2 , . . . in [0, ∞], we have (i) Fσ ∩ {σ < τ } ⊂ Fτ − ⊂ Fτ when σ, τ are optional, (ii) {σ < τ } ∈ Fσ− ∩ Fτ − for optional σ and predictable τ , (iii)



n

Fτn = Fτ − when τ is predictable and announced by (τn ),

(iv) if τ1 , τ2 , . . . are predictable, so is τ = supn τn with



n

Fτn − = Fτ − ,

(v) if σ, τ are predictable, so is σ ∧ τ with Fσ∧τ − ⊂ Fσ− ∩ Fτ − . Proof: (i) For any A ∈ Fσ , we have A ∩ {σ < τ } =

∪ r∈Q



+



A ∩ {σ ≤ r} ∩ {r < τ } ∈ Fτ − ,

since the intersections on the right are generators of Fτ − , and so Fσ ∩ {σ < τ } ∈ Fτ − . The second relation holds since all generators of Fτ − lie in Fτ . (ii) If τ is announced by (τn ), then (i) yields {τ ≤ σ} = {τ = 0} ∪ ∩ n {τn < σ} ∈ Fσ− . (iii) For any A ∈ Fτn , we get by (i) 







A = A ∩ {τn < τ } ∪ A ∩ {τn = τ = 0} ∈ Fτ − , and so

 n

Fτn ⊂ Fτ − . Conversely, (i) yields for any t ≥ 0 and A ∈ Ft A ∩ {τ > t} = ∈ ⊂

which shows that Fτ − ⊂



n





∪ n A ∩ {τn > t} ∨n Fτ − ∨n Fτ , n n

Fτn .

(iv) If each τn is announced by σn1 , σn2 , . . . , then τ is announced by the optional times σn = σ1n ∨ · · · ∨ σnn and is therefore predictable. By (i) we get for all n   Fσn = F0 ∪ ∪ Fσn ∩ {σn < τk } k≤n



∨ Fτ − . k≤n k

10. Predictability and Compensation

209

Combining with (i) and (iii) gives Fτ − = ⊂ =

∨n Fσ ∨n Fτ − ∨n,k Fσ n

n

nk

⊂ Fτ − ,

and so equality holds throughout, proving our claim. (v) If σ and τ are announced by σ1 , σ2 , . . . and τ1 , τ2 , . . . , then σ ∧ τ is announced by the optional times σn ∧ τn , and is therefore predictable. Combining (iii) above with Lemma 9.1 (ii), we get Fσ∧τ − = ∨n Fσn ∧τn = ∨n (Fσn ∩ Fτn ) ⊂ ∨n Fσn ∩ ∨n Fτn = Fσ− ∩ Fτ − .

2

On the product space Ω × R+ we may introduce the predictable σ-field P, generated by all continuous, adapted processes on R+ . The elements of P are called predictable sets, and the P-measurable functions on Ω × R+ are called predictable processes. Note that every predictable process is progressive. We provide some useful characterizations of the predictable σ-field. Lemma 10.2 (predictable σ-field) The predictable σ-field is generated by each of these classes of sets or processes: (i) F0 × R+ and all sets A × (t, ∞) with A ∈ Ft , t ≥ 0, (ii) F0 × R+ and all intervals (τ, ∞) with optional τ , (iii) all left-continuous, adapted processes. Proof: Let P1 , P2 , P3 be the σ-fields generated by the classes in (i)−(iii). Since continuous functions are left-continuous, we have trivially P ⊂ P3 . To obtain P3 ⊂ P1 , we note that any left-continuous process X can be approximated by the processes Xtn = X0 1[0,1] (nt) +



Xk/n 1(k,k+1] (nt),

t ≥ 0.

k≥1

Next, P1 ⊂ P2 holds since the time tA = t · 1A + ∞ · 1Ac is optional for any t ≥ 0 and A ∈ Ft . Finally, we get P2 ⊂ P since, for any optional time τ , the process 1(τ,∞) can be approximated by the continuous, adapted processes 2 Xtn = {n(t − τ )+ } ∧ 1, t ≥ 0. Next we examine the relationship between predictable processes and the σ-fields Fτ − . Similar results for progressive processes and the σ-fields Fτ were obtained in Lemma 9.5. Lemma 10.3 (predictable times and processes) (i) For any optional time τ and predictable process X, the random variable Xτ 1{τ < ∞} is Fτ − -measurable.

210

Foundations of Modern Pobability

(ii) For any predictable time τ and Fτ − -measurable random variable α, the process Xt = α 1{τ ≤ t} is predictable. Proof: (i) If X = 1A×(t,∞) for some t > 0 and A ∈ Ft , then clearly



Xτ 1{τ < ∞} = 1 = A ∩ {t < τ < ∞} ∈ Fτ − .

This extends by a monotone-class argument and a subsequent approximation, first to predictable indicator functions, and then to the general case. (ii) We may clearly take α to be integrable. For any announcing sequence (τn ) of τ , we define 



Xtn = E(α | Fτn ) 1{τn ∈ (0, t)} + 1{τn = 0} ,

t ≥ 0.

The X n are left-continuous and adapted, hence predictable. Moreover, X n → 2 X on R+ a.s. by Theorem 9.24 and Lemma 10.1 (iii). An optional time τ is said to be totally inaccessible, if P {σ = τ < ∞} = 0 for every predictable time σ. An accessible time may then be defined as an optional time τ , such that P {σ = τ < ∞} = 0 for every totally inaccessible time σ. For any random time τ , we introduce the associated graph



[τ ] = (t, ω) ∈ R+ × Ω; τ (ω) = t , which allows us to express the previous conditions on σ and τ as [σ] ∩ [τ ] = ∅ a.s. For any optional time τ and set A ∈ Fτ , the time τA = τ 1A + ∞ · 1Ac is again optional and called the restriction of τ to A. We consider a basic decomposition of optional times. Related decompositions of random measures and martingales are given in Theorems 10.17 and 20.16. Proposition 10.4 (decomposition of optional time) For any optional time τ , (i) there exists an a.s. unique set A ∈ Fτ ∩ {τ < ∞}, such that τA is accessible and τAc is totally inaccessible,  (ii) there exist some predictable times τ1 , τ2 , . . . with [τA ] ⊂ n [τn ] a.s. Proof: Define

p = sup P (τn )

∪ n {τ = τn < ∞},

(1)

where the supremum extends over all sequences of predictable times τn . Combining sequences where the probability in (1) approaches p, we may construct a sequence (τn ) for which the supremum is attained. For such a maximal sequence, we define A as the union in (1). To see that τA is accessible, let σ be totally inaccessible. Then [σ] ∩ [τn ] = ∅ a.s. for every n, and so [σ] ∩ [τA ] = ∅ a.s. If τAc is not totally inaccessible, then P {τAc = τ0 < ∞} > 0 for some predictable time τ0 , which contradicts the maximality of τ1 , τ2 , . . . . This shows that A has the desired property.

10. Predictability and Compensation

211

To prove that A is a.s. unique, let B be another set with the stated properties. Then τA\B and τB\A are both accessible and totally inaccessible, and so 2 τA\B = τB\A = ∞ a.s., which implies A = B a.s. We turn to the celebrated Doob–Meyer decomposition, a cornerstone of modern probability. By an increasing process we mean a non-decreasing, rightcontinuous, and adapted process A with A0 = 0. It is said to be integrable if EA∞ < ∞. Recall that all sub-martingales are taken to be right-continuous. Local sub-martingales and locally integrable processes are defined by localization, in the usual way. Theorem 10.5 (Doob–Meyer decomposition, Meyer, Dol´eans) For an adapted process X, these conditions are equivalent: (i) X is a local sub-martingale, (ii) X = M + A a.s. for a local martingale M and a locally integrable, nondecreasing, predictable process A with A0 = 0. The processes M and A are then a.s. unique. Note that the decomposition depends in a crucial way on the underlying ˆ filtration F. We often refer to A as the compensator of X, and write A = X. Several proofs of this result are known, most of which seem to rely on the deep section theorems. Here we give a more elementary proof, based on weak compactness in L1 and an approximation of totally inaccessible times. For clarity, we divide the proof into several lemmas. Let (D) be the class of measurable processes X, such that the family {Xτ } is uniformly integrable, where τ ranges over the set of finite optional times. We show that it is enough to consider sub-martingales of class (D). Lemma 10.6 (uniform integrability) A local sub-martingale X with X0 = 0 is locally of class (D). Proof: First reduce to the case where X is a true sub-martingale. Then introduce for each n the optional time τ = n ∧ inf{t > 0; |Xt | > n}. Here |X τ | ≤ n ∨ |Xτ |, which is integrable by Theorem 9.30, and so X τ is of class (D). 2 An increasing process A is said to be natural, if it is integrable and such ∞ that E 0 (ΔMt ) dAt = 0 for any bounded martingale M . We first establish a preliminary decomposition, where the compensator A is claimed to be natural rather than predictable. Lemma 10.7 (preliminary decomposition, Meyer) Any sub-martingale X of class (D) admits a decomposition M + A, where (i) M is a uniformly integrable martingale, (ii) A is a natural, increasing process.

212

Foundations of Modern Pobability

Proof (Rao): We may take X0 = 0. Introduce the n-dyadic times tnk = k2−n , k ∈ Z+ , and define for any process Y the associated differences Δnk Y = Ytnk+1 − Ytnk . Let     Ant = Σ E Δnk X  Ftnk , t ≥ 0, n ∈ N, k < 2n t n

and note that M = X − A is a martingale on the n-dyadic set. Writing τrn = inf{t; Ant > r} for n ∈ N and r > 0, we get by optional sampling, for any n-dyadic time t, n

1 2





E Ant ; Ant > 2r ≤ E(Ant − Ant ∧ r) 







≤ E Ant − Anτrn ∧t = E Xt − Xτrn ∧t 



= E Xt − Xτrn ∧t ; Ant > r .

(2)

By the martingale property and uniform integrability, we further obtain 1, rP {Ant > r} ≤ EAnt = EXt <

and so the probability on the left tends to zero as r → ∞, uniformly in t and n. Since the variables Xt − Xτrn ∧t are uniformly integrable by (D), the same property holds for Ant by (2) and Lemma 5.10. In particular, the sequence (An∞ ) is uniformly integrable, and each M n is a uniformly integrable martingale. By Lemma 5.13 there exists a random variable α ∈ L1 (F∞ ), such that n A∞ → α weakly in L1 along a sub-sequence N  ⊂ N. Define 





Mt = E X∞ − α  Ft ,

A = X − M,

and note that A∞ = α a.s. by Theorem 9.24. For dyadic t and bounded random variables ξ, the martingale and self-adjointness properties yield E (Ant − At ) ξ = = = =

E (M − Mtn ) ξ   t n  Ft ξ E E M ∞ − M∞ n ) E(ξ | Ft ) E(M∞ − M∞ E(An∞ − α) E(ξ | Ft ) → 0,

as n → ∞ along N  . Thus, Ant → At weakly in L1 for dyadic t. In particular, we get for any dyadic s < t 

0 ≤ E Ant − Ans ; At − As < 0





→ E (At − As ) ∧ 0 ≤ 0. Thus, the last expectation vanishes, and so At ≥ As a.s. By right-continuity it follows that A is a.s. non-decreasing. Also note that A0 = 0 a.s., since An0 = 0 for all n. To see that A is natural, consider any bounded martingale N , and conclude from Fubini’s theorem and the martingale properties of N and An − A = M − M n that

10. Predictability and Compensation

213

Σ k E (N∞ ΔnkAn) = Σ k E (Nt Δnk An ) = Σ k E (Nt Δnk A) = E Σ k (Nt Δnk A).

E (N∞ An∞ ) =

n k n k

n k

Using weak convergence on the left and dominated convergence on the right, and combining with Fubini’s theorem and the martingale property of N , we ∞ get Nt− dAt = E (N∞ A∞ ) E 0 = Σ k E (N∞ Δnk A) = =

Σ k E (Nt E Σ k (Nt

n k+1

→E Hence, E

∞ 0



0

Δnk A)

n k+1

Δnk A)

Nt dAt . 2

(ΔNt ) dAt = 0, as required.

To complete the proof of Theorem 10.5, we need to show that the process A in the last lemma is predictable. This will be inferred from the following ingenious approximation of totally inaccessible times. Lemma 10.8 (approximation of inaccessible times, Doob) For a totally inaccessible time τ , put τn = 2−n [2n τ ], and let X n be a right-continuous version of the process P {τn ≤ t | Ft }. Then  

 

lim sup Xtn − 1{τ ≤ t} = 0 a.s. n→∞

(3)

t≥0

Proof: Since τn ↑ τ , we may assume that Xt1 ≥ Xt2 ≥ · · · ≥ 1{τ ≤ t} for all t ≥ 0. Then Xtn = 1 for t ∈ [τ, ∞), and on the set {τ = ∞} we have Xt1 ≤ P (τ < ∞ | Ft ) → 0 a.s. as t → ∞ by Theorem 9.24. Thus, supn |Xtn − 1{τ ≤ t}| → 0 a.s. as t → ∞. To prove (3), it is then enough to show that for every ε > 0, the optional times



σn = inf t ≥ 0; Xtn − 1{τ ≤ t} > ε ,

n ∈ N,

tend a.s. to infinity. The σn are clearly non-decreasing, say with limit σ. Note that either σn ≤ τ or σn = ∞ for all n. By optional sampling, Theorem 8.5, and Lemma 9.1, we have  



Xσn 1{σ < ∞} = P τn ≤ σ < ∞  Fσ



→ P τ ≤ σ < ∞  Fσ





= 1{τ ≤ σ < ∞}. Hence, Xσn → 1{τ ≤ σ} a.s. on {σ < ∞}, and so by right continuity we have on the latter set σn < σ for large enough n. Thus, σ is predictable and announced by the times σn ∧ n.

214

Foundations of Modern Pobability

Applying the optional sampling and disintegration theorems to the optional times σn , we obtain ε P {σ < ∞} ≤ ε P {σn < ∞} 







≤ E Xσnn ; σn < ∞ = P τn ≤ σn < ∞



= P τn ≤ σn ≤ τ < ∞ → P {τ = σ < ∞} = 0,

where the last equality holds since τ is totally inaccessible. Thus, σ = ∞ a.s. 2 It follows easily that A has only accessible jumps: Lemma 10.9 (accessibility of jumps) For a natural, increasing process A and a totally inaccessible time τ , we have ΔAτ = 0 a.s. on {τ < ∞}. Proof: Rescaling if necessary, we may assume that A is a.s. continuous at dyadic times. Define τn = 2−n [2n τ ]. Since A is natural, we have



E 0



 



P τn > t  Ft dAt = E



∞ 0



 



P τn > t  Ft− dAt ,

and since τ is totally inaccessible, Lemma 10.8 yields

E Aτ − = E =E



0∞ 0

1{τ > t} dAt 1{τ ≥ t} dAt = EAτ .

Hence, E(ΔAτ ; τ < ∞) = 0, and so ΔAτ = 0 a.s. on {τ < ∞}.

2

We may now show that A is predictable. Lemma 10.10 (Dol´eans) Any natural, increasing process is predictable. Proof: Fix a natural increasing process A. Consider a bounded martingale M and a predictable time τ < ∞ announced by σ1 , σ2 , . . . . Then M τ − M σk is again a bounded martingale, and since A is natural, we get by dominated convergence E(ΔMτ )(ΔAτ ) = 0. In particular, we may take Mt = P (B | Ft ) with B ∈ Fτ . By optional sampling, we have Mτ = 1B and M τ − ← M σk = P (B | Fσk ) → P (B | Fτ − ).

10. Predictability and Compensation

215

Thus, ΔMτ = 1B − P (B | Fτ − ), and so

E(ΔAτ ; B) = E ΔAτ P (B | Fτ − ) 

 





= E E ΔAτ  Fτ − ; B . Since B ∈ Fτ was arbitrary, we get ΔAτ = E(ΔAτ | Fτ − ) a.s., and so the process At = (ΔAτ )1{τ ≤ t} is predictable by Lemma 10.3 (ii). It is also natural, since for any bounded martingale M



E (ΔAτ ΔMτ ) = E ΔAτ E(ΔMτ | Fτ − ) = 0. 

An elementary construction yields {t > 0; ΔAt > 0} ⊂ n [τn ] a.s. for some optional times τn < ∞, where by Proposition 10.4 and Lemma 10.9 we may choose the latter to be predictable. Taking τ = τ1 in the previous argument, we conclude that the process A1t = (ΔAτ1 )1{τ1 ≤ t} is both natural and predictable. Repeating the argument for the process A − A1 with τ = τ2 and proceeding by induction, we see that the jump component Ad of A is predictable. Since A − Ad is continuous and hence predictable, the predictability of A follows. 2 To prove the uniqueness, we need to show that every predictable martingale of integrable variation is a constant. For continuous martingales, this will be seen by elementary arguments in Proposition 18.2 below. The two versions are in fact equivalent, since every predictable martingale is continuous by Proposition 10.16 below. Lemma 10.11 (predictable martingale) Let M be a martingale of integrable variation. Then M is predictable ⇔ M is a.s. constant. Proof: On the predictable σ-field P, we introduce the signed measure



μB = E 0

1B (t) dMt ,

B ∈ P,

where the inner integral is an ordinary Lebesgue–Stieltjes integral. The martingale property shows that μ vanishes for the sets B = F ×(t, ∞) with F ∈ Ft . By Lemma 10.2 and a monotone-class argument, it follows that μ = 0 on P. Since M is predictable, the same thing is true for the process ΔMt = Mt −Mt− , and then also for the sets J± = {t > 0; ±ΔMt > 0}. Thus, μJ± = 0, and so ΔM = 0 a.s., which means that M is a.s. continuous. But then Mt ≡ M0 a.s. by Proposition 18.2 below. 2 Proof of Theorem 10.5: The sufficiency is obvious, and the uniqueness holds by Lemma 10.11. It remains to show that any local sub-martingale X has the stated decomposition. By Lemmas 10.6 and 10.11, we may take X to be of

216

Foundations of Modern Pobability

class (D). Then Lemma 10.7 yields X = M + A for a uniformly integrable martingale M and a natural increasing process A, and the latter is predictable by Lemma 10.10. 2 The two properties in Lemma 10.10 are in fact equivalent. Theorem 10.12 (natural and predictable processes, Dol´eans) Let A be an integrable, increasing process. Then A is natural



A is predictable.

Proof: If an integrable, increasing process A is natural, it is also predictable by Lemma 10.10. Now let A be predictable. By Lemma 10.7, we have A = M + B for a uniformly integrable martingale M and a natural increasing process B, and Lemma 10.10 shows that B is predictable. But then A = B a.s. by Lemma 10.11, and so A is natural. 2 The following criterion is essentially implicit in earlier proofs. Lemma 10.13 (dual predictable projection) Let X, Y be locally integrable, increasing processes, where Y is predictable. Then X has compensator Y iff

E



V dX = E

V dY,

V ≥ 0 predictable.

Proof: By localization, we may take X, Y to be integrable. Then Y is the compensator of X iff M = Y − X is a martingale, which holds iff EMτ = 0 for every optional time τ . This is equivalent to the stated relation for V = 1[0,τ ] , and the general result follows by a straightforward monotone-class argument. 2 We may now establish the fundamental relationship between predictable times and processes. Theorem 10.14 (predictable times and processes, Meyer) For any optional time τ , these conditions are equivalent: (i) τ is predictable, (ii) the process 1{τ ≤ t} is predictable, (iii) E ΔMτ = 0 for any bounded martingale M . Proof (Chung & Walsh): Since (i) ⇒ (ii) by Lemma 10.3 (ii) and (ii) ⇔ (iii) by Theorem 10.12, it remains to show that (iii) ⇒ (i). Then introduce the martingale Mt = E(e−τ | Ft ) and super-martingale ∧t − Mt Xt = e−τ     = E e−τ ∧t − e−τ  Ft ≥ 0,

t ≥ 0,

and note that Xτ = 0 a.s. by optional sampling. Letting σ = inf{t ≥ 0; Xt− ∧ Xt = 0}, we see from Lemma 9.32 that {t ≥ 0; Xt = 0} = [σ, ∞) a.s., and in

10. Predictability and Compensation

217

particular σ ≤ τ a.s. Using optional sampling again, we get E(e−σ − e−τ ) = EXσ = 0, and so σ = τ a.s. Hence, Xt ∧ Xt− > 0 a.s. on [0, τ ). Finally, (iii) yields EXτ − = E(e−τ − Mτ − ) = E(e−τ − Mτ ) = EXτ = 0, and so Xτ − = 0. It is now clear that τ is announced by the optional times 2 τn = inf{t; Xt < n−1 }. To illustrate the power of the last result, we give a short proof of the following useful statement, which can also be proved directly. Corollary 10.15 (predictable restriction) For any predictable time τ and set A ∈ Fτ − , the restriction τA is again predictable. Proof: The process 1A 1{τ ≤ t} = 1{τA ≤ t} is predictable by Lemma 10.3, 2 and so the time τA is predictable by Theorem 10.14. Using the last theorem, we may show that predictable martingales are continuous. Proposition 10.16 (predictable martingale) For a local martingale M , M is predictable



M is a.s. continuous

Proof: The sufficiency is clear by definitions. To prove the necessity, we note that for any optional time τ , Mtτ = Mt 1[0,τ ] (t) + Mτ 1(τ,∞) (t),

t ≥ 0.

Thus, the predictability is preserved by optional stopping, and so we may take M to be is a uniformly integrable martingale. Now fix any ε > 0, and introduce the optional time τ = inf{t > 0; |ΔMt | > ε}. Since the left-continuous version Mt− is predictable, so is the process ΔMt , as well as the random set A = {t > 0; |ΔMt | > ε}. Hence, the interval [τ, ∞) = A ∪ (τ, ∞) has the same property, and so τ is predictable by Theorem 10.14. Choosing an announcing sequence (τn ), we conclude that, by optional sampling, martingale convergence, and Lemmas 10.1 (iii) and 10.3 (i), M τ − ← Mτ n = E(Mτ |Fτn ) → E(Mτ |Fτ − ) = Mτ . Thus, τ = ∞ a.s., and ε being arbitrary, it follows that M is a.s. continuous. 2 We turn to a more refined decomposition of increasing processes, extending the decomposition of optional times in Proposition 10.4. For convenience here

218

Foundations of Modern Pobability

and below, any right-continuous, non-decreasing process X on R+ starting at 0 may be identified with a random measure ξ on (0, ∞), via the relationship in Theorem 2.14, so that ξ[0, t] = Xt for all t ≥ 0. We say that ξ is adapted or predictable 2 if the corresponding property holds for X. The compensator of ξ is defined as the predictable random measure ξˆ associated with the compensator ˆ of X. X An rcll process X or filtration F is said to be quasi-left continuous 3 (qlcontinuous for short) if ΔXτ = 0 a.s. on {τ < ∞} or Fτ = Fτ − , respectively, for every predictable time τ . We further say that X has accessible jumps, if ΔXτ = 0 a.s. on {τ < ∞} for every totally inaccessible time τ . Accordingly, a random measure ξ on (0, ∞) is said to be ql-continuous if ξ{τ } = 0 a.s. for every predictable time τ , and accessible if it is purely atomic with ξ{τ } = 0 a.s. for every totally inaccessible time τ , where μ{∞} = 0 for measures μ on R+ . Theorem 10.17 (decomposition of random measure) Let ξ be an adapted ranˆ Then dom measure on (0, ∞) with compensator ξ. (i) ξ has an a.s. unique decomposition ξ = ξ c + ξ q + ξ a , where ξ c is diffuse, ξ q + ξ a is purely atomic, ξ q is ql-continuous, and ξ a is accessible, (ii) ξ a is a.s. supported by disjoint graphs,



n [τn ],

for some predictable times τ1 , τ2 , . . . with

ˆ c , and ξ is ql(iii) when ξ is locally integrable, ξ c + ξ q has compensator (ξ) continuous iff ξˆ is a.s. continuous. Proof: Subtracting the predictable component ξ c , we may take ξ to be purely atomic. Put η = ξˆ and At = η(0, t]. Consider the locally integrable  ˆ and define process Xt = s≤t (ΔAs ∧ 1) with compensator X, Aqt =

0

t+

ˆ s = 0} dAs , 1{ΔX

Aat = At − Aqt ,

t ≥ 0.

For any predictable time τ < ∞, the graph [τ ] is again predictable by Theorem 10.14, and so by Theorem 10.13, 







ˆτ = 0 E(ΔAqτ ∧ 1) = E ΔXτ ; ΔX

ˆ τ ; ΔX ˆ τ = 0 = 0, = E ΔX which shows that Aq is ql-continuous. Now let τn0 = 0 for n ∈ N, and define recursively the random times



ˆ t ∈ 2−n (1, 2] , τnk = inf t > τn,k−1 ; ΔX

n, k ∈ N.

2 Then ξB is clearly measurable for every B ∈ B, so that ξ is indeed a random measure in the sense of Chapters 15, 23, and 29–31 below. 3 This horrible term is well established.

10. Predictability and Compensation

219 

They are predictable by Theorem 10.14, and {t > 0; ΔAat > 0} = n,k [τnk ] a.s. by the definition of Aa . Thus, for any totally inaccessible time τ , we have ΔAaτ = 0 a.s. on {τ < ∞}, which shows that Aa has accessible jumps. If A = B q + B a is another decomposition with the stated properties, then Y = Aq − B q = B a − Aa is ql-continuous with accessible jumps, and so Lemma 10.4 gives ΔYτ = 0 a.s. on {τ < ∞} for any optional time τ , which means that Y is a.s. continuous. Since it is also purely discontinuous, we get Y = 0 a.s., which proves the asserted uniqueness. When A is locally integrable, we may write instead Aq = 1{ΔAˆ = 0} · A, ˆ For any predictable process V ≥ 0, we ˆ c = 1{ΔAˆ = 0} · A. and note that (A) get by Theorem 10.13 E V dAq = E 1{ΔAˆ = 0} V dA

=E



=E

1{ΔAˆ = 0} V dAˆ ˆ c, V d(A) 2

ˆ c. which shows that Aq has compensator (A)

By the compensator of an optional time τ we mean the compensator of the associated jump process Xt = 1{τ ≤ t}. We may characterize the different categories of optional times in terms of their compensators. Corollary 10.18 (compensation of optional time) Let τ be an optional time with compensator A. Then (i) τ is predictable iff A is a.s. constant apart from a possible unit jump, (ii) τ is accessible iff A is a.s. purely discontinuous, (iii) τ is totally inaccessible iff A is a.s. continuous, (iv) τ has the accessible part τD , where D = {ΔAτ > 0, τ < ∞}. Proof: (i) If τ is predictable, so is the process Xt = 1{τ ≤ t} by Theorem 10.14, and hence A = X a.s. Conversely, if At = 1{σ ≤ t} for an optional time σ, the latter is predictable by Theorem 10.14, and Lemma 10.13 yields 







P {σ = τ < ∞} = E ΔXσ ; σ < ∞ = = = =

E ΔAσ ; σ < ∞ P {σ < ∞} EA∞ = EX∞ P {τ < ∞}.

Thus, τ = σ a.s., and so τ is predictable. (ii) Clearly τ is accessible iff X has accessible jumps, which holds by Theorem 10.17 iff A = Ad a.s. (iii) The time τ is totally inaccessible iff X is ql-continuous, which holds by Theorem 10.17 iff A = Ac a.s.

220

Foundations of Modern Pobability 2

(iv) Use (ii) and (iii).

Next we characterize quasi-left continuity for filtrations and martingales. Proposition 10.19 (ql-continuous filtration, Meyer) For a filtration F, these conditions are equivalent: (i) every accessible time is predictable, (ii) Fτ − = Fτ on {τ < ∞} for all predictable times τ , (iii) ΔMτ = 0 a.s. on {τ < ∞} for any martingale M and predictable time τ . If the basic σ-field in Ω is F∞ , then Fτ − = Fτ on {τ = ∞} for any optional time τ , and the relation in (ii) extends to all of Ω. Proof, (i) ⇒ (ii): Let τ be a predictable time, and fix any B ∈ Fτ ∩{τ < ∞}. Then [τB ] ⊂ [τ ], and so τB is accessible, hence by (i) even predictable. The process Xt = 1{τB ≤ t} is then predictable by Theorem 10.14, and since



Xτ 1{τ < ∞} = 1 τB ≤ τ < ∞ = 1B , Lemma 10.3 (i) yields B ∈ Fτ − . (ii) ⇒ (iii): Fix a martingale M , and let τ be a bounded, predictable time announced by (τn ). Using (ii) and Lemma 10.1 (iii), we get as before Mτ − ← Mτn = E(Mτ |Fτn ) → E(Mτ |Fτ − ) = E(Mτ |Fτ ) = Mτ , and so Mτ − = Mτ a.s. (iii) ⇒ (i): If τ is accessible, Proposition 10.4 yields some predictable times  τn with [τ ] ⊂ n [τn ] a.s. By (iii) we have ΔMτn = 0 a.s. on {τn < ∞} for every martingale M and all n, and so ΔMτ = 0 a.s. on {τ < ∞}. Hence, τ is predictable by Theorem 10.14. 2 The following inequality will be needed in Chapter 20. Proposition 10.20 (norm relation, Garsia, Neveu) Let A be a right- or leftcontinuous, predictable, increasing process, and let ζ ≥ 0 be a random variable    such that a.s.  (4) E A∞ − At  Ft ≤ E(ζ | Ft ), t ≥ 0. Then A∞ p ≤ p ζ p , p ≥ 1. In the left-continuous case, predictability is clearly equivalent to adaptedness. The proper interpretation of (4) is to take E(At | Ft ) ≡ At , and choose right-continuous versions of the martingales E(A∞ | Ft ) and E(ζ| Ft ). For a right-continuous A, we may clearly take ζ = Z ∗ , where Z is the super-martingale on the left of (4). We also note that if A is the compensator of an

10. Predictability and Compensation

221

increasing process X, then (4) holds with ζ = X∞ . Proof: We may take A to be right-continuous, the left-continuous case being similar but simpler. We can also take A to be bounded, since we may otherwise replace A by the process A ∧ u for arbitrary u > 0, and let u → ∞ in the resulting formula. For each r > 0, the random time τr = inf{t; At ≥ r} is predictable by Theorem 10.14. By optional sampling and Lemma 10.1, we note that (4) remains valid with t replaced by τr −. Since τr is Fτr − -measurable by the same lemma, we obtain 





E A∞ − r; A∞ > r ≤ E A∞ − r; τr < ∞







≤ E A∞ − Aτr − ; τr < ∞ ≤ E(ζ; τr < ∞) ≤ E(ζ; A∞ ≥ r).

Writing A∞ = α and letting p−1 +q −1 = 1, we get by Fubini’s theorem, H¨older’s inequality, and some calculus α pp = p2 q −1 E = p2 q −1 ≤p q 2

−1





α

0 ∞

0∞ 0

= p2 q −1 E ζ

(α − r) rp−2 dr 



E α − r; α > r rp−2 dr 



E ζ; α ≥ r rp−2 dr



α

rp−2 dr

0

= p E ζαp−1 ≤ p ζ p α p−1 p . If α p > 0, we may finally divide both sides by α p−1 p .

2

We can’t avoid the random measure point of view when turning to processes on a product space R+ × S, where S is a localized Borel space. Here a random measure is simply a locally finite kernel ξ : Ω → (0, ∞) × S, and we say that ξ is adapted, predictable, or locally integrable if the corresponding properties hold for the real-valued processes ξt B = ξ{(0, t] × B} with B ∈ Sˆ fixed. The compensator of ξ is defined as a predictable random measure ξˆ on (0, ∞) × S, ˆ such that ξˆt B is a compensator of ξt B for every B ∈ S. For an alternative approach, suggested by Lemma 10.13, we say that a process V on R+ ×S is predictable if it is P×S -measurable, where P denotes the predictable σ-field on R+ . The compensator ξˆ may then be characterized as the ˆ a.s. unique, predictable random measure on (0, ∞)×S, such that E ξV = E ξV for every predictable process V ≥ 0 on R+ × S. Theorem 10.21 (compensation of random measure, Grigelionis, Jacod) Let ξ be a locally integrable, adapted random measure on (0, ∞) × S, where S

222

Foundations of Modern Pobability

is Borel. Then there exists an a.s. unique predictable random measure ξˆ on (0, ∞) × S, such that for any predictable process V ≥ 0 on R+ × S,

E



V dξ = E

ˆ V dξ.

Again we emphasize that the compensator ξˆ depends in a crucial way on the choice of underlying filtration F. Our proof relies on a technical lemma, which can be established by straightforward monotone-class arguments. Lemma 10.22 (predictable random measures) (i) For any predictable random measure ξ and predictable process V ≥ 0 on (0, ∞) × S, the process V · ξ is again predictable. (ii) For any predictable process V ≥ 0 on (0, ∞)×S and predictable, measurevalued process ρ on S, the process Yt = Vt,s ρt (ds) is again predictable. Proof of Theorem 10.21: Since ξ is locally integrable, we may choose a  predictable process V > 0 on R+ × S such that E V dξ < ∞. If the random ˆ then Lemma 10.22 shows that ξ has measure ζ = V · ξ has compensator ζ, −1 ˆ ˆ compensator ξ = V · ζ. Thus, we may henceforth assume that Eξ{(0, ∞) × S} = 1. Put η = ξ(· × S). Using the kernel composition of Chapter 3, we may introduce the probability measure μ = P ⊗ ξ on Ω × R+ × S with projection ν = P ⊗ η onto Ω × R+ . Applying Theorem 3.4 to the restrictions of μ and ν to the σ-fields P ⊗ S and P, respectively, we obtain a probability kernel ρ : (Ω × R+ , P) → (S, S) satisfying μ = ν ⊗ ρ, or P ⊗ ξ = P ⊗ η ⊗ ρ on (Ω × R+ × S, P × S). Letting ηˆ be the compensator of η, we may introduce the random measure ξˆ = ηˆ ⊗ ρ on R+ × S. To see that ξˆ is the compensator of ξ, we note that ξˆ is predictable by Lemma 10.22 (i). For any predictable process V ≥ 0 on R+ × S, the process  Ys = Vs,t ρt (ds) is again predictable by Lemma 10.22 (ii). By Theorem 8.5 and Lemma 10.13, we get

E

V dξˆ = E =E =E





ηˆ(dt) η(dt)



Vs,t ρt (ds) Vs,t ρt (ds)

V dξ.

Finally note that ξˆ is a.s. unique by Lemma 10.13.

2

The theory simplifies when the filtration F is induced by ξ, in the sense of being the smallest right-continuous, complete filtration making ξ adapted, in which case we call ξˆ the induced compensator. Here we consider only the special case of a single point mass ξ = δ τ,χ , where τ is a random time with associated mark χ in S.

10. Predictability and Compensation

223

Proposition 10.23 (induced compensator) Let (τ, ζ) be a random element in (0, ∞] × S with distribution μ, where S is Borel. Then ξ = δ τ,ζ has the induced compensator μ(dr × B) , t ≥ 0, B ∈ S. (5) ξˆt B = (0, t∧τ ] μ([r, ∞] × S) Proof: The process ηt B on the right of (5) is clearly predictable for every B ∈ S. It remains to show that Mt = ξt B − ηt B is a martingale, hence that E(Mt − Ms ; A) = 0 for any s < t and A ∈ Fs . Since Mt = Ms on {τ ≤ s}, and the set {τ > s} is a.s. an atom of Fs , it suffices to show that E(Mt − Ms ) = 0, or EMt ≡ 0. Then use Fubini’s theorem to get

μ(dr × B) μ([r, ∞] × S) μ(dr × B) = μ(dx) (0,∞] (0,t∧x] μ([r, ∞] × S) μ(dr × B) = μ(dx) (0,t] μ([r, ∞] × S) [r,∞]

E ηt B = E

(0,t∧τ ]

= μ{(0, t] × B} = E ξt B.

2

Now return to the case of a general filtration F: Theorem 10.24 (discounted compensator) For any adapted pair (τ, χ) in (0, ∞] × S with compensator η, there exists a unique, predictable random measure ζ on (0, τ ] × S with ζ ≤ 1, satisfying (5) with μ replaced by ζ. Writing Zt = 1 − ζ¯t , we have a.s. (i)

ζ = Z− · η, and Z is the unique solution of Z = 1 − Z− · η¯,

(ii)

Zt = exp(−¯ ηtc ) Π (1 − Δ¯ ηs ), 

(iii) (iv) (v)

s≤t

t ≥ 0,

< 1 on [0, τ ), ≤ 1 on [τ ], Z is non-increasing and ≥ 0 with Zτ − > 0, Y = Z −1 satisfies d Yt = Yt d η¯t on {Zt > 0}. Δ¯ η

The process in (ii) is a special case of Dol´eans’ exponential in Chapter 20. Though the subsequent discussion involves some stochastic differential equations, the present versions are elementary, and there is no need for the general theory in Chapters 18–20 and 32–33. Proof: (iii) The time σ = inf{t ≥ 0; Δ¯ ηt ≥ 1} is optional, by an elementary approximation. Hence, the random interval (σ, ∞) is predictable, and so the same thing is true for the graph [σ] = {Δ¯ η ≥ 1} \ (σ, ∞). Thus, ¯ = E η¯[σ]. P {τ = σ} = E ξ[σ] Since also

224

Foundations of Modern Pobability 1{τ = σ} ≤ 1{σ < ∞} ≤ η¯[σ],

by the definition of σ, we obtain 1{τ = σ} = η¯[σ] = Δ¯ ησ a.s. on {σ < ∞}, which implies Δ¯ ηt ≤ 1 and τ = σ a.s. on the same set. (i)−(ii): If (5) holds with μ = ζ for some random measure ζ on (0, τ ] × S, then the process Zt = 1 − ζ¯t satisfies −1 ¯ d η¯t = Zt− d ζt −1 = −Zt− d Zt ,

and so d Zt = −Zt d η¯t , which implies Zt − 1 = −(Z · η¯)t , or Z = 1 − Z− · η¯. Conversely, any solution Z to the latter equation yields a measure solving (5) with B = S, and so (5) has the general solution ζ = Z− · η. Furthermore, the equation Z = 1 − Z− · η¯ has the unique solution (ii), by an elementary special case of Theorem 20.8 below. In particular, Z is predictable, and the predictability of ζ follows by Lemma 10.22. (iv) Since 1 − Δ¯ η ≥ 0 a.s. by (iii), (ii) shows that Z is a.s. non-increasing  ηt ≤ η¯τ < ∞ a.s., and supt 0. (v) By an elementary integration by parts, we get on the set {t ≥ 0; Zt > 0} 0 = d(Zt Yt ) = Zt− d Yt + Yt d Zt . Using the chain rule for Stieltjes integrals along with equation Z = 1 − Z− · η¯, we obtain −1 Yt d Zt d Yt = −Zt− −1 = Zt− Yt Zt− d η¯t = Yt d η¯t .

2

The power of the discounted compensator is mainly due to the following result and its sequels, which will play key roles in Chapters 15 and 27. Letting τ , χ, η, ζ, Z be such as in Theorem 10.24, we introduce a predictable process V on R+ × S with ζ|V | < ∞ a.s., and form the processes4 Ut,x = Vt,x + Zt−1



t+

V d ζ,

0

Mt = Uτ,χ 1{τ ≤ t} −



t+

U d η, 0

t ≥ 0, x ∈ S,

(6)

t ≥ 0,

(7)

whenever these expressions exist. Proposition 10.25 (fundamental martingale) For τ, χ, η, ζ, Z as in Theorem 10.24, let V be a predictable process on R+ × S with ζ|V | < ∞ a.s. and ζV = 0 a.s. on {Zτ = 0}. Then for U, M as above, we have 4

Recall our convention 0/0=0.

10. Predictability and Compensation

225

(i) M exists on [0, ∞] and satisfies M∞ = Vτ,χ a.s., (ii) E |Uτ,χ | < ∞ implies 5 E Vτ,χ = 0, and M is a uniformly integrable mar Vτ,χ p for all p > 1. tingale with M ∗ p <

Proof: (i) Write Y = Z −1 . Using the conditions on V , the definition of ζ, and Theorem 10.24 (iv), we obtain η|V | = ζ(Y− |V |) ≤ Yτ − ζ|V | < ∞. Next, we see from (6) and Theorem 10.24 (v) that η|U − V | ≤ ζ|V | η¯Y = (Yτ − 1) ζ|V | < ∞, whenever Zτ > 0. If instead Zτ = 0, we have η|U − V | ≤ ζ|V |



τ−

Y d¯ η

0

= (Yτ − − 1) ζ|V | < ∞. In either case U is η -integrable, and M is well defined. Now let t ≥ 0 with Zt > 0. Using (6), Theorem 10.24 (v), Fubini’s theorem, and the definition of ζ, we get for any x ∈ S

t+ 0

(U − V ) dη =





t+



0



0

Ys d¯ ηs

t+

=

dYs t+

= 0

s+

V dζ 0 s+

V dζ 0

(Yt − Ys− ) Vs,y dζs,y

= Ut,x − Vt,x −

t+

V dη. 0

Simplifying and combining with (6), we get

t+

0

U dη = Ut,x − Vt,x = Yt



t+

V dζ. 0

To extend to general t, suppose that Zτ = 0. Using the previous version of (8), the definition of ζ, and the conditions on V , we get

τ +



U dη = 0

τ −

0

= Yτ −



U dη +

0



= Yτ −

τ −

U dη [τ ]



V dζ +

V dη [τ ]

τ +

V dζ = 0 0

= Uτ,x − Vτ,x , 5

Thus, a mere integrability condition on U implies a moment identity for V .

226

Foundations of Modern Pobability

which shows that (8) is generally true. In particular, Vτ,χ = Uτ,χ −



τ +

U dη 0

= Mτ = M∞ . (ii) If E |Uτ,χ | < ∞, then E η|U | < ∞ by compensation, which shows that M is uniformly integrable. For any optional time σ, the process 1[0,σ] U is again predictable and Eη -integrable, and so by the compensation property and (7), 



EMσ = E Uτ,χ ; τ ≤ σ − E



σ+

U dη 0

= E δτ,χ (1[0,σ] U ) − E η(1[0,σ] U ) = 0, which shows that M is a martingale. Thus, in view of (i), EVτ,χ = EM∞ = EM0 = 0. Furthermore, by Doob’s inequality, M ∗ p < M∞ p = Vτ,χ p ,

p > 1.

2

We also need a multi-variate version of the last result. Here the optional times τ1 , . . . , τn are said to be orthogonal, if they are a.s. distinct, and the atomic parts of their compensators have disjoint supports. Corollary 10.26 (product moments) For every j ≤ m, consider an adapted pair (τj , χj ) in (0, ∞) × Sj and a predictable process Vj on R+ × Sj , where the τj are orthogonal. Define Zj , ζj , Uj as in Theorem 10.24 and (6), and suppose that for all j ≤ m, (i)

ζj |Vj | < ∞,

ζj Vj = 0 a.s. on {Zj (τj ) = 0},

(ii)

E |Uj (τj , χj )| < ∞, E |Vj (τj , χj )|pj < ∞,

for some p1 , . . . , pm > 0 with (iii)

E

Π Vj (τj , χj ) = 0.

j ≤ 1. Σj p−1

Then

j ≤m

Proof: Let the pairs (τj , χj ) have compensators η1 , . . . , ηm , and define the martingales M1 , . . . , Mm as in (7). Fix any i = j in {1, . . . , m}, and choose ηt > some predictable times σ1 , σ2 , . . . as in Theorem 10.17, such that {t > 0; Δ¯  0} = k [σk ] a.s. By compensation and orthogonality, we have for any k ∈ N E δσk [τj ] = P {τj = σk } = E δτj [σk ] = E ηj [σk ] = 0. Summing over k gives ηk [τj ] = 0 a.s., which shows that (ΔMi )(ΔMj ) = 0 a.s.  for all i = j. Integrating repeatedly by parts, we conclude that M = j Mj is

10. Predictability and Compensation

227

a local martingale. Letting p−1 = Σj p−1 ≤ 1, we get by H¨older’s inequality, j Proposition 10.25 (ii), and the various hypotheses M ∗ 1 ≤ M ∗ p ≤ Π j Mj∗ pj <

Π j Vτ ,χ p j

j

j

< ∞.

Thus, M is a uniformly integrable martingale, and so by Lemma 10.25 (i), E Π j Vj (τj , χj ) = E Π j Mj (∞) = EM (∞) = EM (0) = 0.

2

This yields a powerful predictable mapping theorem based on the discounted compensator. Important extensions and applications appear in Chapters 15 and 27. Theorem 10.27 (predictable mapping) Suppose that for all j ≤ m, (i) (τj , χj ) is adapted in (0, ∞) × Kj with discounted compensator ζj , (ii) Yj is a predictable map of R+ × Kj into a probability space (Sj , μj ), (iii) the τj are orthogonal and ζj ◦ Yj−1 ≤ μj a.s. Then the variables γj = Yj (τj , χj ) are independent with distributions 6 μj . Proof: For fixed Bj ∈ Sj , consider the predictable processes Vj (t, x) = 1Bj ◦ Yj (t, x) − μj Bj ,

t ≥ 0, x ∈ Kj , j ≤ m.

By definitions and hypotheses,

0

t+



Vj dζj =

t+

0



1Bj (Yj ) dζj − μj Bj 1 − Zj (t)



≤ Zj (t) μj Bj . Since changing from Bj to Bjc affects only the sign of Vj , the two versions t+ combine into −Zj (t) μj Bjc ≤ Vj dζj ≤ Zj (t) μj Bj . 0

In particular, |ζj Vj | ≤ Zj (τj ) a.s., and so ζj Vj = 0 a.s. on {Zj (τj ) = 0}. Defining Uj as in (6) and using the previous estimates, we get −1 ≤ = ≤ = 6

−1Bjc ◦ Yj Vj − μj Bjc ≤ Uj Vj + μj Bj 1Bj ◦ Yj ≤ 1,

Thus, the mere inequality in (iii) implies the equality L(γj ) = μj .

228

Foundations of Modern Pobability

which implies |Uj | ≤ 1. Letting ∅ = J ⊂ {1, . . . , m} and applying Corollary 10.26 with pj = |J| for all j ∈ J, we obtain



E Π 1Bj (γj ) − μj Bj = E Π Vj (τj , χj ) = 0. j ∈J

(8)

j ∈J

We may now use induction on |J| to prove that P

∩ {γj ∈ Bj } = Π μj Bj , j ∈J

j ∈J

J ⊂ {1, . . . , m}.

(9)

For |J| = 1 this holds by (8). Now assume (9) for all |J| < k, and proceed to a J with |J| = k. Expanding the product in (8), and applying the induction hypothesis to each term involving at least one factor μj Bj , we see that  all terms but one reduce to ± j∈J μj Bj , whereas the remaining term equals  P j∈J {γj ∈ Bj }. Thus, (9) remains true for J, which completes the induction. By a monotone-class argument, the formula for J = {1, . . . , m} extends  2 to L(γ1 , . . . , γm ) = j μj .

Exercises 1. Show by an example that the σ-fields Fτ and Fτ − may differ. (Hint: Take τ to be a constant.) 2. Give examples of optional times that are predictable, accessible but not predictable, and totally inaccessible. (Hint: Use Corollary 10.18.) 3. Give an example that a right-continuous, adapted process that is not predictable. (Hint: Use Theorem 10.14.) 4. Given a Brownian motion B on [0, 1], let F be the filtration induced by Xt = (Bt , B1 ). Find the Doob–Meyer decomposition B = M + A on [0, 1), and show that A has a.s. finite variation on [0, 1]. 5. For any totally inaccessible time τ , show that supt |P (τ ≤ t + ε| Ft ) − 1{τ ≤ t}| → 0 a.s. as ε → 0. Derive a corresponding result for the compensator. (Hint: Use Lemma 10.8.) 6. Let the process X be adapted and rcll. Show that X is predictable iff it has accessible jumps and ΔXτ is Fτ − -measurable for every predictable time τ < ∞. (Hint: Use Proposition 10.17 and Lemmas 10.1 and 10.3.) 7. Show that the compensator A of a ql-continuous, local sub-martingale is a.s. continuous. (Hint: Note that A has accessible jumps. Use optional sampling at an arbitrary predictable time τ < ∞ with announcing sequence (τn ).) 8. Show that any general inequality involving an increasing process A and its compensator Aˆ remains valid in discrete time. (Hint: Embed the discrete-time process and filtration into continuous time.) 9. Let ξ be a binomial process on R+ with E ξ = nμ for a diffuse probability measure μ. Show that if ξ has points τ1 < · · · < τn , then ξ has induced compensator ξˆ = Σ k≤n 1[τk−1 ,τk ) μk for some non-random measures μ1 , . . . , μn , to be identified.

10. Let ξ be a point process on R+ with ξ = n and with induced compensator of the form ξˆ = Σ k≤n 1[τk−1 ,τk ) μk for some non-random measures μ1 , . . . , μn . Show

10. Predictability and Compensation

229

that L(ξ) is uniquely determined by those measures. Under what condition is ξ a binomial process? 11. Determine the natural compensator of a renewal process ξ based on a measure μ. Under what conditions on μ has ξ accessible atoms, and when is ξ ql-continuous? 12. Let τ1 < τ2 < · · · form a renewal process based on a diffuse measure μ. Find the associated discounted compensators ζ1 , ζ2 , . . . . 13. Let ξ be a binomial process with ξ = n, based on a diffuse probability measure μ on R+ . Find the natural compensator ξˆ of ξ. Also find the discounted compensators ζ1 , . . . , ζn of the points τ1 < · · · < τn of ξ.

IV. Markovian and Related Structures Here we introduce the notion of Markov processes, representing another basic dependence structure of probability theory and playing a prominent role throughout the remainder of the book. After a short discussion of the general Markov property, we establish the basic convergence and invariance properties of discrete- and continuous-time Markov chains. In the continuous-time case, we focus on the structure of pure jump-type processes, and include a brief discussion of branching processes. Chapter 12 is devoted to a detailed study of random walks and renewal processes, exploring in particular some basic recurrence properties and a two-sided version of the celebrated renewal theorem. A hurried reader might concentrate on especially the beginnings of Chapters 11 and 13, along with selected material from Chapter 12. −−− 11. Markov properties and discrete-time chains. The beginning of this chapter, exploring the nature of the general Markov property, may be regarded as essential core material. After discussing in particular the role of transition kernels and the significance of space and time homogeneity, we prove an elementary version of the strong Markov property. We conclude with a discussion of the basic recurrence and ergodic properties of discrete-time Markov chains, including criteria for the existence of invariant distributions and convergence of transition probabilities. 12. Random walks and renewal processes. After establishing the recurrence dichotomy of random walks in Rd and giving criteria for recurrence and transience, we proceed to a study of fluctuation properties in one dimension, involving the notions of ladder times and heights and criteria for divergence to infinity. The remainder of the chapter deals with the main results of renewal theory, including the stationary version of a given renewal process, a two-sided version of the basic renewal theorem, and solutions of the renewal equation. The mentioned subjects will prepare for the related but more sophisticated theories of L´evy processes, local time, and excursion theory in later chapters. 13. Jump-type chains and branching processes. Here we begin with a detailed study of pure jump-type Markov processes, leading to a description in terms of a rate kernel, determining both the jump structure and the nature of holding times. This material is of fundamental importance and may serve as an introduction to the theory of Feller processes and their generators. After discussing the recurrence and ergodic properties of continuous-time Markov chains, we conclude with a brief introduction to branching processes, of importance for both theory and applications.

231

Chapter 11

Markov Properties and Discrete-Time Chains Extended Markov property, transition kernels, finite-dimensional distributions, Chapman–Kolmogorov relation, existence, invariance and independence, random dynamical systems, iterated optional times, strong Markov property, space and time homogeneity, stationarity and invariance, strong homogeneity, occupation times, recurrence, periodicity, excursions, irreducible chains, convergence dichotomy, mean recurrence times, absorption

Markov processes are without question the most important processes of modern probability, and various aspects of their theory and applications will appear throughout the remainder of this book. They may be thought of informally as random dynamical systems, which explains their fundamental importance for the subject. They will be considered in discrete and continuous time, and in a wide variety of spaces. To make the mentioned description precise, we may fix an arbitrary Borel space S and filtration F. An adapted process X in S is said to be Markov, if for any times s < t we have Xt = fs,t (Xs , ϑs,t ) a.s. for some measurable functions fs,t and U (0, 1) random variables ϑs,t ⊥⊥Fs . The stated condition is equivalent to the less transparent conditional independence Xt ⊥⊥Xs Fs . The process is said to be time-homogeneous if we can choose fs,t ≡ f0,t−s , and spacehomogeneous (in Abelian groups) if fs,t (x, ·) ≡ fs,t (0, ·) + x. The evolution is formally described in terms of a family of transition kernels μs,t (x, ·) = L{fs,t (x, ϑ)}, satisfying an a.s. version of the Chapman–Kolmogorov relation μs,t μt,u = μs,u . In the usual axiomatic setup, we assume the latter equation to hold identically. This chapter is devoted to the most basic parts of Markov process theory. Space homogeneity is shown to be equivalent to independence of the increments, which motivates our discussion of random walks and L´evy processes in Chapters 12 and 16. In the time-homogeneous case, we prove a primitive form of the strong Markov property, and show how the result simplifies when the process is also space-homogeneous. We further show how invariance of the initial distribution implies stationarity of the process, which motivates our treatment of stationary processes in Chapter 25. The general theory of Markov processes is more advanced and will not be continued until Chapter 17, where we develop the basic theory of Feller processes. In the meantime we will consider several important sub-classes, such © Springer Nature Switzerland AG 2021 O. Kallenberg, Foundations of Modern Probability, Probability Theory and Stochastic Modelling 99, https://doi.org/10.1007/978-3-030-61871-1_12

233

234

Foundations of Modern Probability

as the pure jump-type processes in Chapter 13, Brownian motion and related processes in Chapters 14 and 19, and the mentioned random walks and L´evy processes in Chapters 12 and 16. A detailed study of diffusion processes appears in Chapters 32–33, and further aspects of Brownian motion are considered in Chapters 29, 34–35. To begin our systematic study of Markov processes, consider an arbitrary time scale T ⊂ R equipped with a filtration F = (Ft ), and fix a measurable space (S, S). An S-valued process X on T is called a Markov process, if it is adapted to F and such that Ft ⊥ ⊥ Xu ,

t ≤ u in T.

Xt

(1)

Just as for the martingale property, the Markov property depends on the choice of filtration, with the weakest version obtained for the filtration induced by X. The simple property (1) may be strengthened as follows. Lemma 11.1 (extended Markov property) The Markov property (1) extends to 1 Ft ⊥ (2) ⊥ {Xu ; u ≥ t}, t ∈ T. Xt

Proof: Fix any t = t0 ≤ t1 ≤ · · · in T . Then (1) yields Ftn ⊥⊥Xtn Xtn+1 for every n ≥ 0, and so by Theorem 8.12 Ft

⊥⊥

Xtn+1 ,

Xt 0 , . . . , X t n

n ≥ 0.

By the same result, this is equivalent to Ft ⊥ ⊥ (Xt1 , Xt2 , . . .), Xt

2

and (2) follows by a monotone-class argument.

For any times s ≤ t in T , we assume the existence of some regular conditional distributions μs,t (Xs , B) = P (Xt ∈ B | Xs ) = P (Xt ∈ B | Fs ) a.s.,

B ∈ S,

(3)

where the transition kernels μs,t exist by Theorem 8.5 when S is Borel. We may further introduce the one-dimensional distributions νt = L(Xt ), t ∈ T . When T begins at 0, we show that the distribution of X is uniquely determined by the kernels μs,t together with the initial distribution ν0 . For a precise statement, it is convenient to use the kernel operations introduced in Chapter 3. Recall that if μ, ν are kernels on S, the kernels μ ⊗ ν : S → S 2 and μν : S → S are given, for any s ∈ S and B ∈ S 2 or S, by 1 Using the shift operators θt , this becomes Ft ⊥⊥Xt θt X for all t ∈ T . Informally, the past and future are conditionally independent, given the present.

11. Markov Properties and Discrete-Time Chains

(μ ⊗ ν)(s, B) =

235



ν(t, du) 1B (t, u),

μ(s, dt)

= (μ ⊗ ν)(s, S × B)

(μν)(s, B)



=

μ(s, dt) ν(t, B).

Proposition 11.2 (finite-dimensional distributions) Let X be a Markov process on T with one-dimensional distributions νt and transition kernels μs,t . Then for any t0 ≤ · · · ≤ tn in T , 





 

(i) L Xt0 , . . . , Xtn = νt0 ⊗ μt0 ,t1 ⊗ · · · ⊗ μtn−1 ,tn , 





(ii) L Xt1 , . . . , Xtn  Ft0 = μt0 ,t1 ⊗ · · · ⊗ μtn−1 ,tn (Xt0 , ·). Proof: (i) This is obvious for n = 0. Proceeding by induction, assume the statement with n replaced by n − 1, and fix any bounded, measurable function f on S n+1 . Since Xt0 , . . . , Xtn−1 are Ftn−1 -measurable, we get by Theorem 8.5 and the induction hypothesis



Ef (Xt0 , . . . , Xtn ) = E E f (Xt0 , . . . , Xtn ) | Ftn−1

=E

f (Xt0 , . . . , Xtn−1 , x) μtn−1 ,tn (Xtn−1 , dx)





= νt0 ⊗ μt0 ,t1 ⊗ · · · ⊗ μtn−1 ,tn f, as desired. This completes the induction. (ii) From (i) we get for any B ∈ S and C ∈ S n

P (Xt0 , . . . , Xtn ) ∈ B × C



= B

=E





νt0 (dx) μt0 ,t1 ⊗ · · · ⊗ μtn−1 ,tn (x, C)







μt0 ,t1 ⊗ · · · ⊗ μtn−1 ,tn (Xt0 , C); Xt0 ∈ B ,

and the assertion follows by Theorem 8.1 and Lemma 11.1.

2

An obvious consistency requirement leads to the basic Chapman–Kolmogorov relation between the transition kernels. Say that two kernels μ, μ agree a.s. if μ(x, ·) = μ (x, ·) for almost every x. Corollary 11.3 (Chapman, Smoluchovsky) For a Markov process in a Borel space S, we have μs,u = μs,t μt,u a.s. νs , s ≤ t ≤ u. Proof: By Proposition 11.2, we have a.s. for any B ∈ S

 

μs,u (Xs , B) = P Xu ∈ B  Fs



 

= P (Xt , Xu ) ∈ S × B  Fs



= (μs,t ⊗ μt,u )(Xs , S × B) = (μs,t μt,u )(Xs , B). Since S is Borel, we may choose a common null set for all B.

2

236

Foundations of Modern Probability We will henceforth require the stated relation to hold identically, so that μs,u = μs,t μt,u ,

s ≤ t ≤ u.

(4)

Thus, we define a Markov process by condition (3), in terms of some transition kernels μs,t satisfying (4). In the discrete-time case of T = Z+ , the latter relation imposes no restriction, since we may then start from any versions of the kernels μn = μn−1,n , and define μm,n = μm+1 · · · μn for arbitrary m < n. Given such a family of transition kernels μs,t and an arbitrary initial distribution ν, we show that an associated Markov process exists. Theorem 11.4 (existence, Kolmogorov) Consider a time scale T starting at 0, a Borel space (S, S), a probability measure ν on S, and a family of probability kernels μs,t on S, s ≤ t in T , satisfying (4). Then there exists an S-valued Markov process X on T with initial distribution ν and transition kernels μs,t . Proof: Consider the probability measures νt1 ,...,tn = νμt0 ,t1 ⊗ · · · ⊗ μtn−1 ,tn ,

0 = t0 ≤ t1 ≤ · · · ≤ tn , n ∈ N.

To see that the family (νt0 ,...,tn ) is projective, let B ∈ S n−1 be arbitrary, and define for any k ∈ {1, . . . , n} the set



Bk = (x1 , . . . , xn ) ∈ S n ; (x1 , . . . , xk−1 , xk+1 , . . . , xn ) ∈ B . Then (4) yields 



νt1 ,...,tn Bk = νμt0 ,t1 ⊗ · · · ⊗ μtk−1 ,tk+1 ⊗ · · · ⊗ μtn−1 ,tn B = νt1 ,...,tk−1 ,tk+1 ,...,tn B, as desired. By Theorem 8.23 there exists an S-valued process X on T with L(Xt1 , . . . , Xtn ) = νt1 ,...,tn ,

t1 ≤ · · · ≤ tn , n ∈ N,

(5)

and in particular L(X0 ) = ν0 = ν. To see that X is Markov with transition kernels μs,t , fix any times s1 ≤ · · · ≤ sn = s ≤ t and sets B ∈ S n and C ∈ S, and conclude from (5) that

P (Xs1 , . . . , Xsn , Xt ) ∈ B × C = νs1 ,...,sn ,t (B × C)







= E μs,t (Xs , C); (Xs1 , . . . , Xsn ) ∈ B . Letting F be the filtration induced by X, we get by a monotone-class argument







P Xt ∈ C; A = E μs,t (Xs , C); A , and so P {Xt ∈ C | Fs } = μs,t (Xs , C) a.s.

A ∈ Fs , 2

11. Markov Properties and Discrete-Time Chains

237

Now let (G, G) be a measurable group with identity element ι. Recall that a kernel μ on G is said to be invariant if μr = μι ◦ θr−1 ≡ θr μι , meaning that μ(r, B) = μ(ι, r−1 B),

r ∈ G, B ∈ G.

An G-valued Markov process is said to be space-homogeneous if its transition kernels μs,t are invariant. We further say that a process X in G has independent Xtk are mutually increments, if for any times t0 ≤ · · · ≤ tn the increments Xt−1 k−1 independent and independent of X0 . More generally, given a filtration F on T , we say that X has F-independent increments, if X is adapted to F and such that Xs−1 Xt ⊥⊥ Fs for all s ≤ t in T . Note that the elementary notion of independence corresponds to the case where F is induced by X. For Abelian groups G, we may use an additive notation and write the increments in the usual way as Xt − Xs . Theorem 11.5 (space homogeneity and independence) Let X be an F-adapted process on T with values in a measurable group G. Then these conditions are equivalent: (i) X is a G-homogeneous F-Markov process, (ii) X has F-independent increments. In that case, X has transition kernels



μs,t (r, B) = P Xs−1 Xt ∈ r−1 B ,

r ∈ G, B ∈ G, s ≤ t in T.

(6)

Proof, (i) ⇒ (ii): Let X be F-Markov with transition kernels μs,t (r, B) = μs,t (r−1 B),

r ∈ G, B ∈ G, s ≤ t in T.

Then Theorem 8.5 yields for any s ≤ t in T and B ∈ G

 





 

P Xs−1 Xt ∈ B  Fs = P Xt ∈ Xs B  Fs = μs,t (Xs , Xs B) = μs,t B.

(7)



Thus, Xs−1 Xt is independent of Fs with distribution μs,t , and (6) follows by means of (7). (ii) ⇒ (i): Let Xs−1 Xt be independent of Fs with distribution μs,t . Defining the associated kernel μs,t by (7), we get by Theorem 8.5 for any s, t, and B as before  



  P Xt ∈ B  Fs = P Xs−1 Xt ∈ Xs−1 B  Fs = μs,t (Xs−1 B) = μs,t (Xs , B), and so X is Markov with the invariant transition kernels in (7).

2

238

Foundations of Modern Probability

We now specialize to the time-homogeneous case, where T = R+ or Z+ , and the transition kernels are of the form μs,t = μt−s , so that  





P Xt ∈ B  Fs = μt−s (Xs , B) a.s.,

B ∈ S, s ≤ t in T.

Introducing the initial distribution ν = L(X0 ), we get by Proposition 11.2 

L(Xt0 , . . . , Xtn ) = νμt0 ⊗ μt1 −t0 ⊗ · · · ⊗ μtn −tn−1 ,  







L Xt1 , . . . , Xtn  Ft0 = μt1 −t0 ⊗ · · · ⊗ μtn −tn−1 (Xt0 , ·). The Chapman–Kolmogorov relation now reduces to the semi-group property μs+t = μs μt ,

s, t ∈ T,

which is again assumed to hold identically. The following result suggests that we think of a discrete-time Markov process as a stochastic dynamical system. Let ϑ be a generic U (0, 1) random variable. Proposition 11.6 (random dynamical system) Let X be a process on Z+ with values in a Borel space S. Then these conditions are equivalent: (i) X is Markov with transition kernels μn on S, (ii) there exist some measurable functions f1 , f2 , . . . : S × [0, 1] → S and i.i.d. ⊥ X0 , such that U (0, 1) random variables ϑ1 , ϑ2 , . . . ⊥ Xn = fn (Xn−1 , ϑn ) a.s.,

n ∈ N.

In that case μn (s, ·) = L{fn (s, ϑ)} a.s. L(Xn−1 ), and we may choose fn ≡ f iff X is time-homogeneous. Proof, (ii) ⇒ (i): Assume (ii), and define μn (s, ·) = L{fn (s, ϑ)}. Writing F for the filtration induced by X, we get by Theorem 8.5 for any B ∈ S

 





 





P Xn ∈ B  Fn−1 = P fn (Xn−1 , ϑn ) ∈ B  Fn−1 = λ t; fn (Xn−1 , t) ∈ B



= μn (Xn−1 , B), proving (i). (i) ⇒ (ii): Assume (i). By Lemma 4.22 we may choose the associated funcd ˜0 = X0 , tions fn as above. Let ϑ˜1 , ϑ˜2 , . . . be i.i.d. U (0, 1) and independent of X ˜ ˜ ˜ ˜ and define recursively Xn = fn (Xn−1 , ϑn ) for n ∈ N. Then X is Markov as d ˜ = X by Proposition 11.2, and so before with transition kernels μn . Hence, X d ˜ (ϑ˜n )}. Theorem 8.17 yields some random variables ϑn with {X, (ϑn )} = {X, 2 Now (ii) follows, since the diagonal in S is measurable. The last assertion is clear by construction. 2

11. Markov Properties and Discrete-Time Chains

239

Now fix a transition semi-group (μt ) on a Borel space S. For any probability measure ν on S, Theorem 11.4 yields an associated Markov process Xν , and Proposition 4.2 shows that the associated distribution Pν is uniquely determined by ν. Note that Pν is a probability measure on the path space (S T , S T ). For ν = δx , we write Px instead of Pδx . Integration with respect to Pν or Px is denoted by Eν or Ex , respectively. Lemma 11.7 (mixtures) The measures Px form a probability kernel from S to S T , and for any initial distribution ν,

Pν A =

S

Px (A) ν(dx),

A ∈ ST .

Proof: The measurability of Px A and the stated formula are obvious for cylinder sets of the form A = (πt1 , . . . , πtn )−1 B. The general case follows easily by a monotone-class argument. 2 Rather than considering one Markov process Xν for every initial distribution ν, we may introduce a single canonical process X, defined as the identity mapping on the path space (S T , S T ), equipped with the different probability measures Pν . Then Xt agrees with the evaluation map πt : x → xt on S T , which is measurable by the definition of S T . For our present purposes, it suffices to endow the path space S T with the canonical filtration F induced by X. On S T we further introduce the shift operators θt : S T → S T , t ∈ T , given by (θt x)s = xs+t ,

s, t ∈ T, x ∈ S T ,

which are clearly measurable with respect to S T . In the canonical case we note that θt X = θt = X ◦ θt . Optional times with respect to a Markov process are often constructed recursively in terms of shifts on the underlying path space. Thus, for any pair of optional times σ and τ on the canonical space, we may consider the random time γ = σ + τ ◦ θσ , which is understood to be infinite when σ = ∞. Under weak restrictions on space and filtration, we show that γ is again optional. Here CS and DS denote the spaces of continuous or rcll functions, respectively, from R+ to S. Proposition 11.8 (iterated optional times) For a metric space S, let σ, τ be optional times on the canonical space S ∞ , CS , or DS , endowed with the right-continuous, induced filtration. Then we may form an optional time γ = σ + τ ◦ θσ . Proof: Since σ ∧ n + τ ◦ θσ∧n ↑ γ, Lemma 9.3 justifies taking σ to be bounded. Let X be the canonical process with induced filtration F. Since + -measurable for every s ≥ 0 by X is F + -progressive, Xσ+s = Xs ◦ θσ is Fσ+s

240

Foundations of Modern Probability

Lemma 9.5. Hence, for fixed t ≥ 0, all sets A = {Xs ∈ B} with s ≤ t and + . Since the latter sets form a σ-field, we get B ∈ S satisfy θσ−1 A ∈ Fσ+t + θσ−1 Ft ⊂ Fσ+t ,

t ≥ 0.

(8)

Now fix any t ≥ 0, and note that {γ < t} =



∪ σ < r, r ∈ Q ∩ (0, t)



τ ◦ θσ < t − r .

(9)

+ For every r ∈ (0, t) we have {τ < t−r} ∈ Ft−r , and so θσ−1 {τ < t −r} ∈ Fσ+t−r by (8), and Lemma 9.2 gives













σ < r, τ ◦ θσ < t − r = σ + t − r < t ∩ θσ−1 τ < t − r ∈ Ft .

Thus, (9) yields {γ < t} ∈ Ft , and so γ is F + -optional by Lemma 9.2.

2

We may now extend the Markov property to suitable optional times. The present statement is only preliminary, and stronger versions will be established, under appropriate conditions, in Theorems 13.1, 14.11, and 17.17. Proposition 11.9 (strong Markov property) Let X be a time-homogeneous Markov process on T = R+ or Z+ , and let τ be an optional time taking countably many values. Then in the general and canonical cases, we have respectively

 



(i) P θτ X ∈ A  Fτ = PXτ A a.s. on {τ < ∞}, (ii) Eν (ξ ◦ θτ | Fτ ) = EXτ ξ, Pν -a.s. on {τ < ∞}, for any set A ∈ S T , distribution ν on S, and bounded or non-negative random variable ξ. Since {τ < ∞} ∈ Fτ , formulas (i)−(ii) make sense by Lemma 8.3, even though θτ X and PXτ are only defined for τ < ∞. Proof: By Lemmas 8.3 and 9.1, we may take τ = t to be finite and nonrandom. For sets of the form A = (πt1 , . . . , πtn )−1 B, Proposition 11.2 yields

 





t1 ≤ · · · ≤ tn , B ∈ S n , n ∈ N,  

P θt X ∈ A  Ft = P (Xt+t1 , . . . , Xt+tn ) ∈ B  Ft 

(10)





= μt1 ⊗ μt2 −t1 ⊗ · · · ⊗ μtn −tn−1 (Xt , B) = PXt A, which extends to arbitrary A ∈ S T by a monotone-class argument. For canonical X, (i) is equivalent to (ii) with ξ = 1A , since in that case ξ ◦ θτ = 1{θτ X ∈ A}. The result extends to general ξ by linearity and monotone convergence. 2 When X is both space- and time-homogeneous, we can state the strong Markov property without reference to the family (Px ). For notational clarity, we consider only processes in Abelian groups, the general case being similar.

11. Markov Properties and Discrete-Time Chains

241

Theorem 11.10 (space and time homogeneity) Let X be a space- and timehomogeneous Markov process in a measurable Abelian group S. Then (i) Px A = P0 (A − x), x ∈ S, A ∈ S T , (ii) the strong Markov property holds at an optional time τ < ∞, iff Xτ is a.s. Fτ -measurable and

⊥ Fτ . X − X 0 = θτ X − X τ ⊥ d

Proof: (i) By Proposition 11.2, we get for sets A as in (10) Px A = P ◦ (πt1 , . . . , πtn )−1 B x  = μt1 ⊗ μt2 −t1 ⊗ · · · ⊗ μtn −tn−1 (x, B) 



= μt1 ⊗ μt2 −t1 ⊗ · · · ⊗ μtn −tn−1 (0, B − x) = P0 ◦ (πt1 , . . . , πtn )−1 (B − x) = P0 (A − x), which extends to general A by a monotone-class argument. (ii) Assume the strong Markov property at τ . Letting A = π0−1 B with B ∈ S, we get 1B (Xτ ) = PXτ {π0 ∈ B} 



= P Xτ ∈ B  Fτ



a.s.,

and so Xτ is a.s. Fτ -measurable. By (i) and Theorem 8.5,  





P θτ X − Xτ ∈ A  Fτ = PXτ (A + Xτ ) = P0 A,

A ∈ ST ,

(11)

which shows that θτ X − Xτ ⊥ ⊥ Fτ with distribution P0 . Taking τ = 0, we get in particular L(X − X0 ) = P0 , and the asserted properties follow. Now assume the stated conditions. To deduce the strong Markov property at τ , let A ∈ S T be arbitrary, and conclude from (i) and Theorem 8.5 that 



P θτ X ∈ A  Fτ







= P θτ X − Xτ ∈ A − Xτ  Fτ = P0 (A − Xτ ) = PXτ A.



2

If a time-homogeneous Markov process X has initial distribution ν, the distribution at time t ∈ T equals νt = νμt , or

νt B =

ν(dx) μt (x, B),

B ∈ S, t ∈ T.

A distribution ν is said to be invariant for the semi-group (μt ) if νt is independent of t, so that νμt = ν for all t ∈ T . We also say that a process X on T is d stationary, if θt X = X for all t ∈ T . The two notions2 are related as follows. 2

Stationarity and invariance are often confused. Here the distinction is of course crucial.

242

Foundations of Modern Probability

Lemma 11.11 (stationarity and invariance) Let X be a time-homogeneous Markov process on T with transition kernels μt and initial distribution ν. Then ⇔

X is stationary

ν is invariant for (μt ).

Proof: If ν is invariant, then Proposition 11.2 yields d

t, t1 ≤ · · · ≤ tn in T,

(Xt+t1 , . . . , Xt+tn ) = (Xt1 , . . . , Xtn ),

and so X is stationary by Proposition 4.2. The converse is obvious.

2

The strong Markov property has many applications. We begin with a classical maximum inequality for random walks, which will be useful in Chapter 23. Proposition 11.12 (maximum inequality, Ottaviani) Let ξ1 , ξ2 , . . . be i.i.d. random variables with E ξi = 0 and E ξi2 = 1, and put Xn = Σ i≤n ξi . Then √



P |X | ≥ r n √ n , r > 1, n ∈ N. P Xn∗ ≥ 2 r n ≤ −2 1−r √ Proof: Put c = r n, and define τ = inf{k ∈ N; |Xk | ≥ 2c}. By the strong Markov property at τ and Theorem 8.5,

P {|Xn | ≥ c} ≥ P |Xn | ≥ c, Xn∗ ≥ 2c







≥ P τ ≤ n, |Xn − Xτ | ≤ c







≥ P Xn∗ ≥ 2 c min P |Xk | ≤ c , k≤n

and Chebyshev’s inequality yields





min P |Xk | ≤ c ≥ min 1 − k c−2 k≤n

k≤n

≥ 1 − n c−2 = 1 − r−2 .





2

Similar ideas can be used to prove the following more sophisticated result, which will find important applications in Chapters 16 and 27. Corollary 11.13 (uniform convergence) For r ∈ Q+ , let X r be rcll processes P in Rd , such that the increments in r are independent and satisfy (X r −X s )∗t → 0 as r, s → ∞ for fixed t > 0. Then there exists an rcll process X in Rd , such that (X r − X)∗t → 0 a.s., t > 0. Proof: Fixing a t > 0, we can choose a sequence rn → ∞ in Q+ , such that (X rm − X rn )∗t → 0 a.s.,

m, n → ∞.

By completeness there exists a process X in Rd with (X rn − X)∗t → 0 a.s., and we note that X is again a.s. rcll. Now define Y r = X r − X, and note that

11. Markov Properties and Discrete-Time Chains

243

(Y r )∗t → 0. Fixing any ε, r > 0 and a finite set A ⊂ [r, ∞) ∩ Q, and putting σ = inf{s ∈ A; (Y s )∗t > 2ε}, we get as in Proposition 11.12 P



P {(Y r )∗t > ε} ≥ P (Y r )∗t > ε, max(Y s )∗t > 2ε

s∈A

≥ P σ < ∞, (Y r − Y s )∗t ≤ ε











≥ P max(Y s )∗t > 2ε min P (Y r − Y s )∗t ≤ ε , s∈A

s∈A

which extends immediately to A = [r, ∞) ∩ Q. Solving for the first factor on the right gives

P {(Y r )∗t > ε}

, P sup(Y s )∗t > 2ε ≤ s≥r 1 − sups≥r P (Y r − Y s )∗t > ε and so sups≥r (Y s )∗t → 0 which implies (X r − X)∗t = (Y r )∗t → 0 a.s. The assertion now follows since t > 0 was arbitrary. 2 P

We proceed to clarify the nature of the strong Markov property of a process X at a finite optional time τ . The condition is clearly a combination of the properties L( θτ X | Xτ ) = PXτ a.s. Fτ ⊥ ⊥ θτ X, Xτ

Though the latter relation—here referred to as strong homogeneity—appears to be weaker than the strong Markov property, the two conditions are in fact essentially equivalent. Here we fix any right-continuous filtration F on R+ . Theorem 11.14 (strong homogeneity) Let X be an F-adapted, rcll process in a separable metric space (S, ρ), and consider a probability kernel (Px ) : S → DS . Then these conditions are equivalent: (i) X is strongly homogeneous at every bounded, optional time τ , (ii) X satisfies the strong Markov property at every optional time τ < ∞. Our proof is based on a 0−1 law for absorption, involving the sets



I = x ∈ D; xt ≡ x0 ,





A = s ∈ S; Ps I = 1 .

(12)

Lemma 11.15 (absorption) For any X as in Theorem 11.14 (i) and optional times τ < ∞, we have PXτ I = 1I (θτ X) = 1A (Xτ ) a.s. Proof: We may clearly take τ to be bounded, say by n ∈ N. Fix any h > 0, and partition S into disjoint Borel sets B1 , B2 , . . . of diameter < h. For each k ∈ N, define

τk = n ∧ inf t > τ ; ρ(Xτ , Xt ) > h



on {Xτ ∈ Bk },

(13)

and put τk = τ otherwise. The times τk are again bounded and optional, and clearly

(14) {Xτk ∈ Bk } ⊂ Xτ ∈ Bk , sup ρ(Xτ , Xt ) ≤ h . t∈[τ,n]

244

Foundations of Modern Probability

Using property (i) and (14), we get as n → ∞ and h → 0 

E P Xτ I c ; θ τ X ∈ I



= ≤ = ≤





Σ k E PX I c; θτ X ∈ I, X τ ∈ Bk Σ k E PX I c; Xτ ∈ Bk

Σ k P θτ X ∈/ I, Xτ ∈ Bk

sup ρ(Xτ , Xt ) ≤ h Σ k P θτ X ∈/ I, Xτ ∈ Bk , t∈[τ,n] τ

τk

k

k

k





→ P θτ X ∈ / I, sup ρ(Xτ , Xt ) = 0 = 0, t≥τ

and so PXτ I = 1 a.s. on {θτ X ∈ I}. Since also EPXτ I = P {θτ X ∈ I} by (i), we obtain the first of the displayed equalities. The second one follows by the definition of A. 2 Proof of Theorem 11.14: Assume (i), and define I and A as in (12). To prove (ii) on {Xτ ∈ A}, fix any times t1 < · · · < tn and Borel sets B1 , . . . , Bn ,  write B = k Bk , and conclude from (i) and Lemma 11.15 that P



∩ k {Xτ +t

k

 

∈ Bk }  Fτ



= = = = =

P {Xτ ∈ B | Fτ } 1{Xτ ∈ B} P {Xτ ∈ B | Xτ } PXτ {x0 ∈ B} PXτ ∩ k {xtk ∈ Bk },

which extends to (ii) by a monotone-class argument. To prove (ii) on {Xτ ∈ / A}, we may take τ ≤ n a.s., and divide Ac into / A}. disjoint Borel sets Bk of diameter < h. Fix any F ∈ Fτ with F ⊂ {Xτ ∈ For each k ∈ N, define τk as in (13) on the set F c ∩ {Xτ ∈ Bk }, and let τk = τ otherwise. Note that (14) remains true on F c . Using (i), (14), and Lemma 11.15, we get as n → ∞ and h → 0      L(θτ X ; F ) − E(PXτ ; F )       =  k E 1{θτ X ∈ ·} − PXτ ; Xτ ∈ Bk , F       =  k E 1{θτk X ∈ ·} − PXτk ; Xτk ∈ Bk , F     =  k E 1{θτk X ∈ ·} − PXτk ; Xτk ∈ Bk , F c 

≤ ≤

Σ Σ Σ Σ k P Xτ ∈ B k ; F c

sup ρ(Xτ , Xt ) ≤ h Σ k P Xτ ∈ Bk , t∈[τ,n] k





→ P Xτ ∈ / A, sup ρ(Xτ , Xt ) = 0 = 0, t≥τ

which shows that the left-hand side is zero.

2

We conclude with some asymptotic and invariance properties for discretetime Markov chains. First we consider the successive visits to a fixed state

11. Markov Properties and Discrete-Time Chains

245

y ∈ S. Assuming the process to be canonical, we introduce the hitting time τy = inf{n ∈ N; Xn = y}, and define recursively τyk+1 = τyk + τy ◦ θτyk ,

k ∈ Z+ ,

starting with τy0 = 0. We further introduce the occupation times



κy = sup k; τyk < ∞ =

Σ 1{Xn = y},

n≥1

y ∈ S,

and show how the distribution of κy can be expressed in terms of the hitting probabilities rxy = Px {τy < ∞} = Px {κy > 0}, x, y ∈ S. Lemma 11.16 (occupation times) For any x, y ∈ S and k ∈ N, (i) (ii)

k−1 Px {κy ≥ k} = Px {τyk < ∞} = rxy ryy , rxy . E x κy = 1 − ryy

Proof: (i) By the strong Markov property, we have for any k ∈ N





Px τyk+1 < ∞ = Px τyk < ∞, τy ◦ θτyk < ∞



= Px {τyk < ∞} Py {τy < ∞} = ryy Px {τyk < ∞}, and the second relation follows by induction on k. The first relation holds since κy ≥ k iff τyk < ∞. (ii) By (i) and Lemma 4.4, we have

Σ Px{κy ≥ k} rxy k−1 = . = Σ rxy ryy 1 − ryy k≥1

Ex κy =

k≥1

2

In particular, we get for x = y k , Px {κx ≥ k} = Px {τxk < ∞} = rxx

k ∈ N.

Thus, under Px , the number of visits to x is either a.s. infinite or geometrically distributed with mean Ex κx + 1 = (1 − rxx )−1 < ∞. This leads to a corresponding division of S into recurrent and transient states. Recurrence can often be deduced from the existence of an invariant distribution. Here and below we write pnxy = μn (x, {y}). Lemma 11.17 (invariant distributions and recurrence) Let ν be an invariant distribution on S. Then for any x ∈ S, ν{x} > 0



x is recurrent.

246

Foundations of Modern Probability Proof: By the invariance of ν,



ν(dy) pnyx ,

0 < ν{x} =

n ∈ N.

(15)

Thus, by Lemma 11.16 and Fubini’s theorem, ∞=

Σ



ν(dy) pnyx

n≥1



=

=

ν(dy) Σ pnyx n≥1

ν(dy)

ryx 1 ≤ . 1 − rxx 1 − rxx 2

Hence, rxx = 1, and so x is recurrent.

The period dx of a state x is defined as the greatest common divisor of the set {n ∈ N; pnxx > 0}, and we say that x is aperiodic if dx = 1. Lemma 11.18 (periodicity) If a state x ∈ S has period d < ∞, we have pnd xx > 0 for all but finitely many n. Proof: Put C = {n ∈ N; pnd xx > 0}, and conclude from the Chapman– Kolmogorov relation that C is closed under addition. Since C has greatest common divisor 1, the generated additive group equals Z. In particular, there exist some n1 , . . . , nk ∈ C and z1 , . . . , zk ∈ Z with Σj zj nj = 1. Writing m = n1 Σj |zj | nj , we note that any number n ≥ m can be represented, for suitable h ∈ Z+ and r ∈ {0, . . . , n1 − 1}, as n = m + h n1 + r  = h n1 + Σ n1 |zj | + rzj nj ∈ C.

2

j ≤k

For every x ∈ S, we define the excursions of X from x by Yn = X τx ◦ θτxn ,

n ∈ Z+ ,

as long as τxn < ∞. To allow infinite excursions, we may introduce an ex¯ ≡ (Δ, Δ, . . .) whenever τxn = ∞. / S, and define Yn = Δ traneous element Δ ∈ Conversely, X may be recovered from the Yn through the formulas τn =

Σ inf k 0; Yk (t) = x ,

Xt = Yn (t − τn ),

t ∈ [τn , τn+1 ),

(16) n ∈ Z+ .

(17)

The distribution νx = Px ◦ Y0−1 is called the excursion law at x. When x is recurrent and ryx = 1, Proposition 11.9 shows that Y1 , Y2 , . . . are i.i.d. νx under Py . The result extends to the general case, as follows. Lemma 11.19 (excursions) Let X be a discrete-time Markov process in a Borel space S, and fix any x ∈ S. Then there exist some independent processes Y0 , Y1 , . . . in S, all but Y0 with distribution νx , such that X is a.s. given by (16) and (17).

11. Markov Properties and Discrete-Time Chains

247

d Proof: Put Y˜0 = Y0 , and let Y˜1 , Y˜2 , . . . be independent of Y˜0 and i.i.d. νx . Construct associated random times τ˜0 , τ˜1 , . . . as in (16), and define a process d ˜ ˜ as in (17). By Corollary 8.18, it is enough to show that X = X. Writing X









κ = sup n ≥ 0; τn < ∞ , κ ˜ = sup n ≥ 0; τ˜n < ∞ , it is equivalent to show that 



d ¯, Δ ¯ , . . .) = ¯, Δ ¯, . . . . (Y0 , . . . , Yκ , Δ Y˜0 , . . . , Y˜κ˜ , Δ

(18)

Using the strong Markov property on the left and the independence of the Y˜n on the right, it is easy to check that both sides are Markov processes in ¯ } with the same initial distribution and transition kernel. Hence, (18) S Z+ ∪ {Δ holds by Proposition 11.2. 2 We now consider discrete-time Markov chains X on the time scale Z+ , taking values in a countable state space S. Here the Chapman–Kolmogorov relation becomes n = Σ j pm pm+n ij pjk , ik

i, k ∈ S, m, n ∈ N,

(19)

where pnij = μn (i, {j}), i, j ∈ S. This may be written in matrix form as pm+n = pm pn , which shows that the matrix pn of n -step transition probabilities is simply the n-th power of the matrix p = p1 , justifying our notation. Regarding the initial distribution ν as a row vector (νi ), we may write the distribution at time n as νpn . Now define rij = Pi {τj < ∞} as before, where τj = inf{n > 0; Xn = j}. A Markov chain in S is said to be irreducible if rij > 0 for all i, j ∈ S, so that every state can be reached from any other state. For irreducible chains, all states have the same recurrence and periodicity properties: Proposition 11.20 (irreducible chains) For an irreducible Markov chain, (i) either all states are recurrent or all are transient, (ii) all states have the same period, (iii) if ν is invariant, then νi > 0 for all i. For the proof of (i) we need the following lemma. Lemma 11.21 (recurrence classes) For a recurrent i ∈ S, define Si = {j ∈ S; rij > 0}. Then (i) rjk = 1 for all j, k ∈ Si , (ii) all states in Si are recurrent.

248

Foundations of Modern Probability

Proof: By the recurrence of i and the strong Markov property, we get for any j ∈ Si

0 = Pi τj < ∞, τi ◦ θτj = ∞ = Pi {τj < ∞} Pj {τi = ∞} = rij (1 − rji ). Since rij > 0 by hypothesis, we obtain rji = 1. Fixing any m, n ∈ N with n pm ij , pji > 0, we get by (19) Ej κj ≥

Σ pm+n+s jj

s>0

≥ =

Σ

pnji psii pm ij s>0 n m pji pij Ei κi =

∞,

and so j is recurrent by Lemma 11.16. Reversing the roles of i and j gives rij = 1. Finally, we get for any j, k ∈ Si

rjk ≥ Pj τi < ∞, τk ◦ θτi < ∞ = rji rik = 1.



2

Proof of Proposition 11.20: (i) Use Lemma 11.21. n (ii) Fix any i, j ∈ S, and choose m, n ∈ N with pm ij , pji > 0. By (19),

≥ pnji phii pm pm+h+n jj ij ,

h ≥ 0.

> 0, and so3 dj |(m + n). Hence, in general phii > 0 For h = 0 we get pm+n jj implies dj |h, and we get dj ≤ di . Reversing the roles of i and j yields the opposite inequality. (iii) Fix any i ∈ S. Choosing j ∈ S with νj > 0, and then n ∈ N with 2 pnji > 0, we see from (15) that even νi > 0. We turn to the basic limit theorem for irreducible Markov chains. Related u results appear in Chapters 13, 17, and 33. Write → for convergence in the total ˆ S be the class of probability measures on S. variation norm · , and let M Theorem 11.22 (convergence dichotomy, Markov, Kolmogorov, Orey) For an irreducible, aperiodic Markov chain in S, one of these cases occurs: (i) There exists a unique invariant distribution ν, the latter satisfies νi > 0 for all i ∈ S, and as n → ∞, u ˆ S. (20) Pμ ◦ θ−1 → Pν , μ ∈ M n

(ii) No invariant distribution exists, and as n → ∞, pnij → 0, 3

i, j ∈ S.

Here m|n (pronounced m divides n) means that n is divisible by m.

(21)

11. Markov Properties and Discrete-Time Chains

249

Markov chains satisfying (i) are clearly recurrent, whereas those in (ii) may be either recurrent or transient. This suggests a further division of the irreducible, aperiodic, and recurrent Markov chains into positive-recurrent and null-recurrent ones, depending on whether (i) or (ii) occurs. We shall prove Theorem 11.22 by a simple coupling 4 argument. For our present purposes, an elementary coupling by independence is sufficient. Lemma 11.23 (coupling) Let X, Y be independent Markov chains in S, T with transition matrices (pii ) and (qjj  ), respectively. Then (i) (X, Y ) is a Markov chain in S ×T with transition matrix rij, ij  = pii qjj  , (ii) if X and Y are irreducible, aperiodic, then so is (X, Y ), (iii) for X, Y as in (ii) with invariant distributions μ, ν, the chain (X, Y ) is recurrent. Proof: (i) Using Proposition 11.2, we may compute the finite-dimensional distributions of (X, Y ) for an arbitrary initial distribution μ ⊗ ν on S × T . (ii) For X, Y as stated, fix any i, i ∈ S and j, j  ∈ T . Then Proposition n n n 11.18 yields rij, ij  = pii qjj  > 0 for all but finitely many n ∈ N, and so (X, Y ) has again the stated properties. (iii) Since the product measure μ ⊗ ν is clearly invariant for (X, Y ), the assertion follows by Proposition 11.17. 2 The point of the construction is that, if the coupled processes eventually meet, their distributions are asymptotically the same. Lemma 11.24 (strong ergodicity) For an irreducible, recurrent Markov chain in S 2 with transition matrix pii pjj  , we have as n → ∞     Pμ ◦ θn−1 − Pν ◦ θn−1  → 0,

ˆ S. μ, ν ∈ M

(22)

Proof (Doeblin): Let X, Y be independent with distributions Pμ , Pν . By Lemma 11.23 the pair (X, Y ) is again Markov with respect to the induced filtration F, and by Proposition 11.9 it satisfies the strong Markov property at every finite optional time τ . Choosing τ = inf{n ≥ 0; Xn = Yn }, we get for any measurable set A ⊂ S ∞

 

P θ τ X ∈ A  Fτ



= PXτ A = PYτ A





= P θτ Y ∈ A  Fτ .

4 The idea is to study the limiting behavior of a process X by approximating with a ˜ whose asymptotic behavior is known, thus replacing a technical analysis of distriprocess X butional properties by a simple pathwise comparison. The method is frequently employed in subsequent chapters.

250

Foundations of Modern Probability

d

˜ n = Xn for n ≤ τ and In particular, (τ, X τ , θτ X) = (τ, X τ , θτ Y ). Putting X d ˜ = X, and so for any A as above, ˜ n = Yn otherwise, we obtain X X     P {θn X ∈ A} − P {θn Y ∈ A}    ˜ ∈ A} − P {θn Y ∈ A} = P {θn X 



 ˜ ∈ A, τ > n − P θn Y ∈ A, τ > n  = P θn X

2

≤ P {τ > n} → 0.

The next result ensures the existence of an invariant distribution. Here a coupling argument is again useful. Lemma 11.25 (existence) If (21) fails, then an invariant distribution exists. Proof: Suppose that (21) fails, so that lim supn pni0 ,j0 > 0 for some i0 , j0 ∈ S. By a diagonal argument, we have pni0 ,j → cj along a sub-sequence N  ⊂ N for some constants cj with cj0 > 0, where 0 < Σj cj ≤ 1 by Fatou’s lemma. To extend the convergence to arbitrary i, let X, Y be independent processes with the same transition matrix (pij ), and conclude from Lemma 11.23 that (X, Y ) is an irreducible Markov chain on S 2 with transition probabilities qij,i j  = pii pjj  . If (X, Y ) is transient, then Proposition 11.16 yields n < ∞, Σ n (pnij )2 = Σ n qii,jj

i, j ∈ S,

and (21) follows. The pair (X, Y ) is then recurrent, and Lemma 11.24 gives pnij − pni0 ,j → 0 for all i, j ∈ I. Hence, pnij → cj along N  for all i, j. Next conclude from the Chapman–Kolmogorov relation that pn+1 = ik =

Σ j pnij pjk Σ j pij pnjk ,

i, k ∈ S.



Letting n → ∞ along N , and using Fatou’s lemma on the left and dominated convergence on the right, we get

Σ j cj pjk ≤ Σ j pij ck = ck , k ∈ S. gives Σj cj ≤ 1 on both sides, and

(23)

so (23) holds with Summing over k equality. Thus, (ci ) is invariant, and we may form the invariant distribution 2 νi = c i / Σ j c j . Proof of Theorem 11.22: If no invariant distribution exists, then (21) holds by Lemma 11.25. Now let ν be an invariant distribution, and note that νi > 0 for all i by Proposition 11.20. By Lemma 11.23 the coupled chain in Lemma 11.24 is irreducible and recurrent, and so (22) holds for any initial distribution μ, and (20) follows since Pν ◦ θn−1 = Pν by Lemma 11.11. If even ν  is invariant, 2 then (20) yields Pν  = Pν , and so ν  = ν. The limits in Theorem 11.22 may be expressed as follows in terms of the mean recurrence times Ej τj .

11. Markov Properties and Discrete-Time Chains

251

Theorem 11.26 (mean recurrence times, Kolmogorov) For a Markov chain in S and states i, j ∈ S with j aperiodic, we have as n → ∞ Pi {τj < ∞} . (24) p nij → Ej τj Proof: First let i = j. If j is transient, then pnjj → 0 and Ej τj = ∞, and so (24) is trivially true. If instead j is recurrent, then the restriction of X to the set Sj = {i; rji > 0} is irreducible and recurrent by Lemma 11.21, and aperiodic by Proposition 11.20. Hence, pnjj converges by Theorem 11.22. To identify the limit, define

Ln = sup k ∈ Z+ ; τjk ≤ n n

Σ 1{Xk = j},

=



n ∈ N.

k=1

The τjn form a random walk under Pj , and so the law of large numbers yields L(τjn ) n 1 = n → a.s. Pj . τjn τj Ej τj By the monotonicity of Lk and τjn , we obtain Ln /n → (Ej τj )−1 a.s. Pj . Since Ln ≤ n, we get by dominated convergence 1 n

n

Σ pkjj = k=1

1 Ej Ln → , n Ej τj

and (24) follows. Now let i = j. Using the strong Markov property, the disintegration theorem, and dominated convergence, we get pnij = Pi {Xn = j}

= Pi τj ≤ n, (θτj X)n−τj = j 

n−τj

= Ei pjj →

; τj ≤ n



Pi {τj < ∞} . Ej τj

2

Exercises 1. Let X be a process with Xs ⊥⊥Xt {Xu , u ≥ t} for all s < t. Show that X is Markov with respect to the induced filtration. 2. Show that the Markov property is preserved under time-reversal, but the possible space or time homogeneity is not, in general. (Hint: Use Lemma 11.1, and consider a random walk starting at 0.) 3. Let X be a Markov process in a space S, and fix a measurable function f on S. Show by an example that the process Yt = f (Xt ) need not be Markov. (Hint: Let X be a simple symmetric random walk in Z, and take f (x) = [x/2].) 4. Let X be a Markov process in R with transition functions μt satisfying μt (x, B) = μt (−x, −B). Show that the process Yt = |Xt | is again Markov.

252

Foundations of Modern Probability

5. Fix any process X on R+ , and define Yt = X t = {Xs∧t ; s ≥ 0}. Show that Y is Markov with respect to the induced filtration. 6. Consider a random element ξ in a Borel space and a filtration F with F∞ ⊂ σ{ξ}. Show that the measure-valued process Xt = L(ξ |Ft ) is Markov. (Hint: Note that ξ⊥⊥Xt Ft for all t.) 7. Let X be a time-homogeneous Markov process in a Borel space S. Prove the existence of some measurable functions fh : S ×[0, 1] → S, h ≥ 0, and U (0, 1) random variables ϑt,h ⊥⊥X t , t, h ≥ 0, such that Xt+h = fh (Xt , ϑt,h ) a.s. for all t, h ≥ 0. 8. Let X be a time-homogeneous, rcll Markov process in a Polish space S. Prove the existence of a measurable function f : S × [0, 1] → D R+ ,S and some U (0, 1) random variables ϑt ⊥⊥X t such that θt X = f (Xt , ϑt ) a.s. Extend the result to optional times taking countably many values. 9. Let X be a process on R+ with state space S, and define Yt = (Xt , t), t ≥ 0. Show that X, Y are simultanously Markov, and that Y is then time-homogeneous. Give a relation between the transition kernels for X, Y . Express the strong Markov property of Y at a random time τ in terms of the process X. 10. Let X be a discrete-time Markov process in S with invariant distribution ν. Show that Pν {Xn ∈ B i.o.} ≥ νB for every measurable set B ⊂ S. Use the result to give an alternative proof of Proposition 11.17. (Hint: Use Fatou’s lemma.) 11. Fix an irreducible Markov chain in S with period d. Show that S has a unique partition into subsets S1 , . . . , Sd , such that pij = 0 unless i ∈ Sk and j ∈ Sk+1 for some k ∈ {1, . . . , d}, with addition defined modulo d. 12. Let X be an irreducible Markov chain with period d, and define S1 , . . . , Sd as above. Show that the restrictions of (Xnd ) to S1 , . . . , Sd are irreducible, aperiodic and either all positive recurrent or all null recurrent. In the former case, show that the original chain has a unique invariant distribution ν. Further show that (20) holds iff μSk = 1/d for all k. (Hint: If (Xnd ) has an invariant distribution ν k on Sk , then νjk+1 = Σi νik pij form an invariant distribution in Sk+1 .) 13. Given a Markov chain X in S, define the classes Ci as in Lemma 11.21. Show that if j ∈ Ci but i ∈ Cj for some i, j ∈ S, then i is transient. If instead i ∈ Cj for every j ∈ Ci , show that Ci is irreducible (so that the restriction of X to Ci is an irreducible Markov chain). Further show that the irreducible sets are disjoint, and that every state outside the irreducible sets is transient. 14. For any Markov chain, show that (20) holds iff

Σj |pnij − νj | → 0 for all i.

15. Let X be an irreducible, aperiodic Markov chain in N. Show that X is transient iff Xn → ∞ a.s. under every initial distribution, and null recurrent iff the same divergence holds in probability but not a.s. 16. For every irreducible, positive recurrent subset Sk ⊂ S, there exists a unique invariant distribution νk restricted to Sk , and every invariant distribution is a convex combination Σk ck νk . 17. Show that a Markov chain in a finite state space S has at least one irreducible set and one invariant distribution. (Hint: Starting from any i0 ∈ S, choose i1 ∈ Ci0 ,  i2 ∈ Ci1 , etc. Then n Cin is irreducible.) 18. Let X⊥⊥Y be Markov processes with transition kernels μs,t and νs,t . Show that (X, Y ) is again Markov with transition kernels μs,t (x, ·) ⊗ νs,t (y, ·). (Hint: Compute

11. Markov Properties and Discrete-Time Chains

253

the finite-dimensional distributions from Proposition 11.2, or use Proposition 8.12 with no computations.) 19. Let X⊥⊥Y be irreducible Markov chains with periods d1 , d2 . Show that Z = (X, Y ) is irreducible iff d1 , d2 are relatively prime, and that Z has then period d1 d2 . 20. State and prove a discrete-time version of Theorem 11.14. Further simplify the continuous-time proof when S is countable.

Chapter 12

Random Walks and Renewal Processes Recurrence dichotomy and criteria, transience for d ≥ 3, Wald equations, first maximum and last return, duality, ladder times and heights, fluctuations, Wiener–Hopf factorization, ladder distributions, boundedness and divergence, stationary renewal process, occupation measure, two-sided renewal theorem, asymptotic delay, renewal equation

Before turning to the continuous-time Markov chains, we consider the special case of space- and time-homogeneous processes in discrete time, where the increments are independent, identically distributed (i.i.d.), and thus determined by the single one-step distribution μ. Such processes are known as random walks. The corresponding continuous-time processes are the L´evy processes, to be discussed in Chapter 16. For simplicity, we consider only random walks in Euclidean spaces, though the definition makes sense in an arbitrary measurable group. Thus, a random walk in Rd is defined as a discrete-time random process X = (Xn ) evolving by i.i.d. steps ξn = ΔXn = Xn − Xn−1 . For most purposes we may take X0 = 0, so that Xn = ξ1 +. . .+ξn for all n. Though random walks may be regarded as the simplest of all Markov chains, they exhibit many basic features of the more general discrete-time processes, and hence may serve as a good introduction to the general subject. We shall further see how random walks enter naturally into the description and analysis of certain continuoustime phenomena. Some basic facts about random walks have already been noted in previous chapters. Thus, we proved some simple 0−1 laws in Chapter 4, and in Chapters 5–6 we established the ultimate versions of the law of large numbers and the central limit theorem, both of which deal with the asymptotic behavior of n−c Xn for suitable constants c > 0. More sophisticated limit theorems of this kind will be derived in Chapters 22–23 and 30, often through approximation by a Brownian motion or a more general L´evy process. Random walks in Rd are either recurrent or transient, and our first aim is to derive recurrence criteria in terms of the transition distribution μ. We proceed with some striking connections between maxima and return times, anticipating the arcsine laws in Chapters 14 and 22. This is followed by a detailed study of ladder times and heights for one-dimensional random walks, culminating with the Wiener–Hopf factorization and Baxter’s formula. Finally, we prove a two-sided version of the renewal theorem, describing the asymptotic behavior of the occupation measure and its intensity for a transient random walk. © Springer Nature Switzerland AG 2021 O. Kallenberg, Foundations of Modern Probability, Probability Theory and Stochastic Modelling 99, https://doi.org/10.1007/978-3-030-61871-1_13

255

256

Foundations of Modern Pobability

In addition to the mentioned connections, we note the relevance of renewal theory for the study of continuous-time Markov chains, as considered in Chapter 13. Renewal processes are simple instances of regenerative sets, to be studied in full generality in Chapter 29, in connection with local time and excursion theory. To begin our systematic discussion of random walks, assume as before that Xn = ξ1 + · · · + ξn for all n ∈ Z+ , where the ξn are i.i.d. random vectors in Rd . The distribution of X is then determined by the common distribution μ = L(ξn ) of the increments. By the effective dimension of X we mean the dimension of the linear subspace spanned by the support of μ. For most purposes, we may assume that the effective dimension agrees with the dimension of the underlying space, since we may otherwise restrict our attention to the generated subspace. A random walk X in Rd is said to be recurrent if lim inf n→∞ |Xn | = 0 a.s. and transient if |Xn | → ∞ a.s. In terms of the occupation measure ξ = Σn δXn , transience means that ξ is a.s. locally finite, and recurrence that ξB0ε = ∞ a.s. for every ε > 0, where Bxr = {y; |x − y| < r}. We define the recurrence set A of X as the set of points x ∈ Rd with ξBxε = ∞ a.s. for every ε > 0. We now state the basic dichotomy for random walks in Rd . Here the intensity (measure) Eξ of ξ is defined by (Eξ)f = E(ξf ). Theorem 12.1 (recurrence dichotomy) Let X be a random walk in Rd with occupation measure ξ and recurrence set A. Then exactly one of these cases occurs: (i) X is recurrent with A = supp E ξ, which is then a closed subgroup of Rd , (ii) X is transient and E ξ is locally finite. Proof: The event |Xn | → ∞ has probability 0 or 1 by Kolmogorov’s 0−1 law. When |Xn | → ∞ a.s., the Markov property at time m yields for any m, n ∈ N and r > 0



P |Xm | < r, inf |Xm+k | ≥ r



k≥n

≥ P |Xm | < r, inf |Xm+k − Xm | ≥ 2r k≥n

= P {|Xm | < r} P



inf |Xk | ≥ 2r .

k≥n

Noting that the event on the left can occur for at most n different values of m, we get

P inf |Xk | ≥ 2r Σ P {|Xm | < r} < ∞, n ∈ N. k≥n

m≥1

Since |Xk | → ∞ a.s., the probability on the left is positive for large enough n, and so for r > 0 E ξB0r = E Σ 1{|Xm | < r} m≥1

=

Σ P {|Xm| < r} < ∞,

m≥1

12. Random Walks and Renewal Processes

257

which shows that Eξ is locally finite. Next let |Xn | → ∞ a.s., so that P {|Xn | < r i.o.} > 0 for some r > 0. Covering B0r by finitely many balls G1 , . . . , Gn of radius ε/2, we conclude that P {Xn ∈ Gk i.o.} > 0 for at least one k. By the Hewitt–Savage 0−1 law, the latter probability is in fact 1. Thus, the optional time τ = inf{n; Xn ∈ Gk } is a.s. finite, and so the strong Markov property yields

1 = P Xn ∈ Gk i.o.





≤ P |Xτ +n − Xτ | < ε i.o.

= P |Xn | < ε i.o. = P {ξB0ε = ∞},





which shows that 0 ∈ A, and X is recurrent. The set S = supp Eξ clearly contains A. To prove the converse, fix any x ∈ S and ε > 0. By the strong Markov property at σ = inf{n ≥ 0; |Xn − x| < ε/2} and the recurrence of X, we get

P {ξBxε = ∞} = P |Xn − x| < ε i.o.





≥ P σ < ∞, |Xσ+n − Xσ | < ε/2 i.o.





= P {σ < ∞} P |Xn | < ε/2 i.o. > 0. By the Hewitt–Savage 0−1 law, the probability on the left then equals 1, which means that x ∈ A. Thus, S ⊂ A ⊂ S, and so in this case A = S. The set S is clearly a closed additive semi-group in Rd . To see that in this case it is even a group, it remains to show that x ∈ S implies −x ∈ S. Defining σ as before and using the strong Markov property, along with the fact that x ∈ S ⊂ A, we get

ε = ∞} = P |Xn + x| < ε i.o. P {ξB−x





= P |Xσ+n − Xσ + x| < ε i.o.







≥ P |Xσ − x| < ε, |Xn | < ε/2 i.o. = 1, which shows that −x ∈ A = S.

2

We give some simple sufficient conditions for recurrence. Theorem 12.2 (recurrence for d ≤ 2) A random walk X in Rd is recurrent under each of these conditions: • for d = 1: • for d = 2:

n−1 Xn → 0, Eξ1 = 0 and E|ξ1 |2 < ∞. P

For d = 1 we recognize the weak law of large numbers, which was characterized in Theorem 6.17. In particular, the condition holds when Eξ1 = 0.

258

Foundations of Modern Pobability

By contrast, Eξ1 ∈ (0, ∞] implies Xn → ∞ a.s. by the strong law of large numbers, so in that case X is transient. Our proof of Theorem 12.2 is based on the following scaling relation1 . Lemma 12.3 (scaling) For a random walk X in Rd , rd Σ P {|Xn | ≤ ε}, Σ P {|Xn| ≤ rε} <

n≥0

r ≥ 1, ε > 0.

n≥0

Proof: The ball B0rε may be covered by balls G1 , . . . , Gm of radius ε/2, where rd . Introducing the optional times τk = inf{n; Xn ∈ Gk }, k = 1, . . . , m, m<

we see from the strong Markov property that

Σ n P {|Xn| ≤ rε}

≤ ≤ = <

n ∈ Gk } Σ k,n P {X

Σ k,n P |Xτ +n − Xτ | ≤ ε; τk < ∞ Σ k P {τk < ∞} Σ n P {|Xn| ≤ ε} rd Σ n P {|Xn | ≤ ε}. k

k

2

Proof of Theorem 12.2 (Chung & Ornstein): (d = 1) Fix any ε > 0 and r ≥ 1, and conclude from Lemma 12.3 that

Σ n P {|Xn| ≤ ε}

> r−1 Σ n P {|Xn | ≤ rε}







=



P |X[rt] | ≤ rε dt.

0

Since the integrand on the right tends to 1 as r → ∞, the integral tends to ∞ by Fatou’s lemma, and the recurrence of X follows by Theorem 12.1. (d = 2) We may take X to have effective dimension 2, since the 1-dimensional case is already covered by (i). Then the central limit theorem yields d n−1/2 Xn → ζ for a non-degenerate normal random vector ζ. In particular, c2 for bounded c > 0. Fixing any ε > 0 and r ≥ 1, we conclude P {|ζ| ≤ c} >

from Lemma 12.3 that

Σ n P {|Xn| ≤ ε}

> r−2 Σ n P {|Xn | ≤ rε}





= 0

As r → ∞, we get by Fatou’s lemma

Σ n P {|Xn| ≤ ε}



>

> ε









P |X[r2 t] | ≤ rε dt.



P |ζ| ≤ εt−1/2 dt

0 ∞ 2 −1 1

t dt = ∞,

and the recurrence follows again by Theorem 12.1.

2

An exact recurrence criterion can be given in terms of the characteristic function μ ˆ of μ. Here we write Bε = B0ε for an open ε-ball around the origin. 1

< b means that a ≤ c b for some constant c > 0. Here a

12. Random Walks and Renewal Processes

259

Theorem 12.4 (recurrence criterion, Chung & Fuchs) Let X be a random walk in Rd based on a distribution μ, and fix an ε > 0. Then X is recurrent iff 1 & dt = ∞. (1) sup 1 − rμ ˆt 0 0 = Xn



= P X1 ∨ · · · ∨ Xn−1 < 0 = Xn ,

n ∈ N,

where the second equality holds by (4). Theorem 12.16 (Wiener–Hopf factorization) For a random walk in R based on a distribution μ, we have (i) (ii)

δ0 − δ1 ⊗ μ = (δ0 − χ+ ) ∗ (δ0 − ψ − ) = (δ0 − ψ + ) ∗ (δ0 − χ− ), δ0 − ψ ± = (δ0 − χ± ) ∗ (δ0 − χ0 ).

Note that the convolutions in (i) are defined on the space Z+ × R, whereas those in (ii) can be regarded as defined on Z+ . Alternatively, we may regard χ0 as a measure on N × {0}, and think of all convolutions as defined on Z+ × R. Proof: (i) Define the measures ρ1 , ρ2 , . . . on (0, ∞) by

ρn B = P X1 ∧ · · · ∧ Xn−1 > 0, Xn ∈ B =E

Σk 1





τk = n, Xτk ∈ B ,



n ∈ N, B ∈ B(0,∞) ,

(9)

where the second equality holds by (8). Put ρ0 = δ0 , and regard the sequence ρ = (ρn ) as a measure on Z+ × (0, ∞). Noting that the corresponding measures on R equal ρn + ψn− and using the Markov property at time n − 1, we get ρn + ψn− = ρn−1 ∗ μ



= ρ ∗ (δ1 ⊗ μ)

n,

n ∈ N.

(10)

Applying the strong Markov property at τ1 to the second expression in (9), we see that also n ρn = Σ χ + k ∗ ρn−k k=1

= (χ+ ∗ ρ)n ,

n ∈ N.

(11)

12. Random Walks and Renewal Processes

267

Recalling the values at zero, we get from (10) and (11) ρ + ψ − = δ0 + ρ ∗ (δ1 ⊗ μ), ρ = δ0 + χ+ ∗ ρ. Eliminating ρ between the two equations yields the first relation in (i), and the second relation follows by symmetry. (ii) Since the restriction of ψ + to (0, ∞) equals ψn+ − χ0n , we get for any B ∈ B(0,∞)  

+ 0 χ+ n − ψn + χn B = P max Xk = 0, Xn ∈ B . k 0 . n≥1 n A similar relation holds for (σ1 , Xσ1 ), with Xn > 0 replaced by Xn ≥ 0. Proof: Consider the mixed generating and characteristic functions 



τ1 χˆ+ s,t = E s exp itXτ1 ,   − − = E sσ1 exp itXσ− , ψˆs,t 1

and note that the first relation in Theorem 12.16 (i) is equivalent to ˆ− 1 − sμ ˆt = (1 − χˆ+ s,t )(1 − ψs,t ),

|s| < 1, t ∈ R.

Taking logarithms and expanding in Taylor series, we obtain − n ) . Σ n n−1(s μˆt)n = Σ n n−1(χˆ+s,t)n + Σ n n−1(ψˆs,t

For fixed s ∈ (−1, 1), this equation is of the form νˆ = νˆ+ + νˆ− , where ν and ν ± are bounded signed measures on R, (0, ∞), and (−∞, 0], respectively. By the uniqueness theorem for characteristic functions, we get ν = ν + + ν − . In particular, ν + equals the restriction of ν to (0, ∞). Thus, the corresponding Laplace transforms agree, and (12) follows by summation of a Taylor series for 2 the logarithm. A similar argument yields the formula for (σ1 , Xσ1 ). The last result yields expressions for the probability that a random walk stays negative or non-positive, as well as criteria for its divergence to −∞.

268

Foundations of Modern Pobability

Corollary 12.18 (boundedness and divergence) For a random walk X in R,

(i) P {τ1 = ∞} = 1 − = exp − Σ n−1 P {Xn > 0} , Eσ1 n≥1

1 −1 (ii) P {σ1 = ∞} = − = exp − Σ n P {Xn ≥ 0} , Eτ1 n≥1 and the a.s. divergence Xn → −∞ is equivalent to each of the conditions:

Σ n−1P {Xn > 0} < ∞,

n≥1

Σ n−1P {Xn ≥ 0} < ∞.

n≥1

Proof: The last expression for P {τ1 = ∞} follows from (12) with u = 0, as we let s → 1. Similarly, the formula for P {σ1 = ∞} is obtained from the version of (12) for the pair (σ1 , Xσ1 ). In particular, P {τ1 = ∞} > 0 iff the series in (i) converges, and similarly for the condition P {σ1 = ∞} > 0 in terms of the series in (ii). Since both conditions are equivalent to Xn → −∞ a.s., the last assertion follows. Finally, the first equalities in (i) and (ii) are obtained most easily from Lemma 12.13, if we note that the number of strict or weak 2 ladder times τn < ∞ or σn < ∞ is geometrically distributed. We turn to a detailed study of the occupation measure ξ = Σ n≥0 δXn of a transient random walk in R, based on the transition and initial distributions μ and ν. Recall from Theorem 12.1 that the associated intensity measure E ξ = ν ∗ Σ n μ∗n is locally finite. By the strong Markov property, the sequence (Xτ +n − Xτ ) has the same distribution for every finite optional time τ . Thus, a similar invariance holds for the occupation measure, and the associated intensities agree. A renewal is then said to occur at time τ , and the whole subject is known as renewal theory. When μ and ν are supported by R+ , we refer to ξ as a renewal process based on μ and ν, and to E ξ as the associated renewal measure. The standard choice is to take ν = δ0 , though in general we may also allow a non-degenerate delay distribution ν. The occupation measure ξ is clearly a random measure on R, defined as a kernel from Ω to R. By Lemma 15.1 below, the distribution of a random  measure on R+ is determined by the distributions of the integrals ξf = f dξ, for all f belonging to the space Cˆ+ (R+ ) of continuous functions f : R+ → R+ with bounded support. For any measure μ on R and constant t ∈ R we define θt μ = μ ◦ θt−1 with θt x = x + t, so that (θt μ)B = μ(B − t) and (θt μ)f = μ(f ◦ θt ) for any measurable set B ⊂ R and function f ≥ 0. A random measure ξ is d said to be stationary if θt ξ = ξ on R+ . When ξ is a renewal process based on a transition distribution μ, the delayed process η = δα ∗ξ is called a stationary version of ξ, if it is stationary on R+ with the same spacing distribution μ. We show that such a version exists whenever μ has finite mean, in which case ν is uniquely determined by μ. Recall that λ denotes Lebesgue measure on R+ . Theorem 12.19 (stationary renewal process) Let ξ be a renewal process based on a distribution μ on R+ with mean m > 0. Then

12. Random Walks and Renewal Processes

269

(i) ξ has a stationary version η on R+ iff m < ∞, (ii) the distribution of η is unique with Eη = m−1 λ and delay distribution ν[0, t] = m−1



t 0

μ(s, ∞) ds,

t ≥ 0,

(iii) ν = μ iff ξ is a stationary Poisson process on R+ . Proof, (i)−(ii): For a delayed renewal process ξ based on μ and ν, Fubini’s theorem yields E ξ = E Σ δXn = Σ L(Xn ) n≥0

=

Σ

n≥0

ν ∗ μ∗n

= ν+μ∗

Σ

n≥0

n≥0

ν ∗ μ∗n

= ν + μ ∗ E ξ, and so ν = (δ0 − μ) ∗ Eξ. If ξ is stationary, then Eξ is shift invariant, and Theorem 2.6 yields Eξ = c λ for some constant c > 0. Thus, ν = c (δ0 − μ) ∗ λ, and the asserted formula follows with m−1 replaced by c. As t → ∞, we get 1 = c m by Lemma 4.4, which implies m < ∞ and c = m−1 . Conversely, let m < ∞, and choose ν as stated. Then Eξ = ν ∗

Σ

n≥0 m−1 (δ0

μ∗n

Σ μ∗n   = m−1 λ ∗ Σ μ∗n − Σ μ∗n = m−1 λ, n≥0 n≥1 =

− μ) ∗ λ ∗

n≥0

which shows that Eξ is invariant. By the strong Markov property, the shifted random measure θt ξ is again a renewal process based on μ, say with delay distribution νt . Since E θt ξ = θt Eξ, we get as before νt = (δ0 − μ) ∗ (θt Eξ) = (δ0 − μ) ∗ Eξ = ν, d

which implies θt ξ = ξ, showing that ξ is stationary. (iii) By Theorem 13.6 below, ξ is a stationary Poisson process on R+ with rate c > 0, iff it is a delayed renewal process based on the exponential distribution with mean c−1 , in which case μ = ν. Conversely, let ξ be a stationary renewal process based on a common distribution μ = ν with mean m < ∞. Then the tail probability f (t) = μ(t, ∞) satisfies the differential equation mf  +f = 0 with initial condition f (0) = 1, and so f (t) = e−t/m , which shows that μ = ν is exponential with mean m. As before, ξ is then a stationary Poisson process 2 with rate m−1 . The last result yields a similar statement for the occupation measure of a general random walk.

270

Foundations of Modern Pobability

Corollary 12.20 (stationary occupation measure) Let ξ be the occupation measure of a random walk X in R based on distributions μ, ν, where μ has mean m ∈ (0, ∞), and ν is defined as in Theorem 12.19 in terms of the ladder height distribution μ ˜ and its mean m. ˜ Then ξ is stationary on R+ with intensity m−1 . Proof: Since Xn → ∞ a.s., Propositions 12.14 and 12.15 show that the ladder times τn and heights Hn = Xτn have finite mean, and by Proposition 12.19 the renewal process ζ = Σ n δHn is stationary for the prescribed choice of ν. Fixing t ≥ 0 and putting σt = inf{n ∈ Z+ ; Xn ≥ t}, we note in particular that Xσt − t has distribution ν. By the strong Markov property at σt , the sequence Xσt +n − t, n ∈ Z+ , has then the same distribution as X. Since Xk < t d for k < σt , we get θt ξ = ξ on R+ , which proves the asserted stationarity. To identify the intensity, let ξn be the occupation measure of the sequence d Xk − Hn , τn ≤ k < τn+1 , and note that Hn ⊥⊥ ξn = ξ0 for each n by the strong Markov property. Hence, Fubini’s theorem yields

Σ ξn ∗ δH = Σ E(δH ∗ E ξn ) n≥0 = E ξ 0 ∗ E Σ δH n≥0

Eξ = E

n

n≥0

n

n

= E ξ0 ∗ E ζ. −1

Noting that Eζ = m ˜ λ by Proposition 12.19, that E ξ0 (0, ∞) = 0, and that m ˜ = m Eτ1 by Proposition 12.15, we get on R+ Eτ1 E ξ0 R− 2 λ= λ = m−1 λ. Eξ = m ˜ m ˜ We proceed to study the asymptotic behavior of the occupation measure ξ and its intensity Eξ. Under weak restrictions on μ, we show that θt ξ ap˜ whereas θt Eξ is asymptotically proportional proaches the stationary version ξ, to Lebesgue measure. For simplicity, we assume that the mean of μ exists in ¯ = [−∞, ∞]. R We will use the standard notation for convergence on measure spaces.4 Thus, for locally finite measures ν, ν1 , ν2 , . . . on R+ , the vague convergence v νn → ν means that νn f → νf for all f ∈ Cˆ+ (R+ ). Similarly, for random vd measures ξ, ξ1 , ξ2 , . . . on R+ , the distributional convergence ξn −→ ξ is defined d by the condition ξn f → ξf for every f ∈ Cˆ+ (R+ ). A measure μ on R is said to be non-arithmetic, if the additive subgroup generated by supp μ is dense in R. Theorem 12.21 (two-sided renewal theorem, Blackwell, Feller & Orey) Let ξ be the occupation measure of a random walk X in R based on distributions μ ¯ \ {0}. When m ∈ (0, ∞), and ν, where μ is non-arithmetic with mean m ∈ R ˜ let ξ be the stationary version in Proposition 12.20, and otherwise put ξ˜ = 0. Then as t → ∞, 4

For a detailed discussion of such convergence, see Chapter 23 and Appendix 5.

12. Random Walks and Renewal Processes

271

vd

˜ (i) θt ξ −→ ξ, v (ii) θt E ξ → E ξ˜ = (m−1 ∨ 0) λ.

Our proof will be based on two further lemmas. When m < ∞, the crucial step is to prove convergence of the ladder height distributions of the sequences X − t. This will be accomplished by a coupling argument: Lemma 12.22 (asymptotic delay) For X as in Theorem 12.21 with m ∈ (0, ∞), let νt be the distribution of the first ladder height ≥ 0 of the sequence w X − t. Then νt → ν˜ as t → ∞. Proof: Let α and α be independent random variables with distributions ν and ν˜. Choose some i.i.d. sequences (ξk )⊥⊥(ϑk ) independent of α and α , such that L(ξk ) = μ and P {ϑk = ±1} = 12 . Then Mn = α  − α −

Σ ϑk ξ k ,

k≤n

n ∈ Z+ ,

is a random walk based on a non-arithmetic distribution with mean 0. By Theorems 12.1 and 12.2, the range {Mn } is then a.s. dense in R, and so for any ε > 0 the optional time σ = inf{n ≥ 0; Mn ∈ [0, ε]} is a.s. finite. Next define ϑk = ±ϑk , with plus sign iff k > τ , and note as in Lemma 12.11 d that {α , (ξk , ϑk )} = {α , (ξk , ϑk )}. Let κ1 , κ2 , . . . and κ1 , κ2 , . . . be the values of k where ϑk = 1 or ϑk = 1, respectively. The sequences

Σ ξκ , Xn = α + Σ ξκ , j ≤n Xn = α +

j ≤n

j

 j

n ∈ Z+ ,

are again random walks based on μ with initial distributions ν and ν  , respectively. Writing σ± = Σ k≤τ (±ϑk ∨ 0), we note that Xσ − +n − Xσ+ +n = Mτ ∈ [0, ε],

n ∈ Z+ .

Putting β = Xσ∗+ ∨ Xσ −∗ , and considering the first entries of X and X  into the interval [t, ∞), we obtain for any r ≥ ε ν˜[ε, r] − P {β ≥ t} ≤ νt [0, r] ≤ ν˜[0, r + ε] + P {β ≥ t}. Letting t → ∞ and then ε → 0, and noting that ν˜{0} = 0 by stationarity, we w get νt [0, r] → ν˜[0, r], which implies νt → ν˜. 2 To prove (i) ⇒ (ii) in the main theorem, we need the following technical result, which also plays a crucial role when m = ∞. Lemma 12.23 (uniform integrability) Let ξ be the occupation measure of a transient random walk X in Rd with arbitrary initial distribution, and fix a bounded set B ∈ B d . Then the random variables ξ(B + x), x ∈ Rd , are uniformly integrable.

272

Foundations of Modern Pobability

Proof: Fix any x ∈ Rd , and put τ = inf{t ≥ 0; Xn ∈ B + x}. Writing η for the occupation measure of an independent random walk starting at 0, we get by the strong Markov property d

ξ(B + x) = η(B + x − Xτ ) 1{τ < ∞} ≤ η(B − B). Finally, note that Eη(B − B) < ∞ by Theorem 12.1, since X is transient. 2 Proof of Theorem 12.21 (m < ∞): By Lemma 12.23 it is enough to prove (i). If m < 0, then Xn → −∞ a.s. by the law of large numbers, and so θt ξ = 0 w for sufficiently large t, and (i) follows. If instead m ∈ (0, ∞), then νt → ν˜ by Lemma 12.22, and we may choose some random variables αt and α with distributions νt and ν, respectively, such that αt → α a.s. We also introduce the occupation measure ξ0 of an independent random walk starting at 0. Now let f ∈ Cˆ+ (R+ ), and extend f to R by putting f (x) = 0 for x < 0. Since ν˜  λ, we have ξ0 {−α} = 0 a.s. Hence, by the strong Markov property and dominated convergence,

d

(θt ξ)f = →

f (αt + x) ξ0 (dx)



d ˜ f (α + x) ξ0 (dx) = ξf.

(m = ∞): Here it is clearly enough to prove (ii). The strong Markov property yields E ξ = ν ∗ Eχ ∗ Eζ, where χ is the occupation measure of the ladder height sequence of (Xn − X0 ) and ζ is the occupation measure of the same process, prior to the first ladder time. Here EζR− < ∞ by Proposition v 12.14, and so by dominated convergence it suffices to show that θt Eχ → 0. Since the ladder heights have again infinite mean by Proposition 12.15, we may assume instead that ν = δ0 , and let μ be an arbitrary distribution on R+ with mean m = ∞. Put I = [0, 1], and note that E ξ(I + t) is bounded by Lemma 12.23. Define b = lim supt E ξ(I + t) and choose some tk → ∞ with E ξ(I + tk ) → b. Subtracting the bounded measures μ∗j for j < n, we get (μ∗n ∗ E ξ)(I + tk ) → b for all n ∈ Z+ . Using the reverse Fatou lemma, we obtain for any B ∈ BR+ lim inf E ξ(I − B + tk ) μ∗n B k→∞

≥ lim inf



k→∞

B

E ξ(I − x + tk ) μ∗n (dx)

= b − lim sup ≥ b− ≥ b−



k→∞



Bc

Bc



Bc

E ξ(I − x + tk ) μ∗n (dx)

lim sup E ξ(I − x + tk ) μ∗n (dx) k→∞

b μ∗n (dx) = b μ∗n B.

Since n was arbitrary, the ‘lim inf’ on the left is then ≥ b for every B with E ξB > 0.

12. Random Walks and Renewal Processes

273

Now fix any h > 0 with μ(0, h] > 0, and write J = [0, a] with a = h + 1. Noting that E ξ[r, r + h] > 0 for all r ≥ 0, we get lim inf E ξ(J + tk − r) ≥ b,

r ≥ a.

k→∞

Since δ0 = (δ0 − μ) ∗ E ξ, we further note that tk

1=

μ(tk − x, ∞) E ξ(dx)

0



Σ μ(na, ∞) E ξ(J + tk − na).

n≥1

As k → ∞, we get 1 ≥ b Σ n≥1 μ(na, ∞) by Fatou’s lemma. Here the sum v 2 diverges since m = ∞, which implies θt E ξ → 0. The preceding theory may be used to study the renewal equation F = f + F ∗ μ, which often arises in applications. Here the convolution F ∗ μ is t defined by F (t − s) μ(ds), t ≥ 0, (F ∗ μ)t = 0

whenever the integrals on the right exist. Under suitable regularity conditions, the stated equation has the unique solution F = f ∗ μ ¯, where μ ¯ denotes the renewal measure Σ n≥0 μ∗n . Additional conditions ensure convergence at ∞ of the solution F . The precise statements require some further terminology. By a regular step function we mean a function on R+ of the form ft =

Σ aj 1[j−1,j)(t/h),

j ≥1

t ≥ 0,

(13)

where h > 0 and a1 , a2 , . . . ∈ R. A measurable function f on R+ is said to be directly Riemann integrable, if λ|f | < ∞, and there exist some regular step functions fn± with fn− ≤ f ≤ fn+ and λ(fn+ − fn− ) → 0. Theorem 12.24 (renewal equation) Consider a distribution μ = δ0 on R+ with associated renewal measure μ ¯ and a locally bounded, measurable function f on R+ . Then (i) equation F = f +F ∗μ has the unique, locally bounded solution F = f ∗ μ ¯, (ii) when f is directly Riemann integrable and μ is non-arithmetic with mean m, we have Ft → m−1 λf, t → ∞. Proof: (i) Iterating the renewal equation gives F =

Σ f ∗ μ∗k + F ∗ μ∗n,

n ∈ N.

k 0.) P

2. For any non-degenerate random walk X in Rd , show that |Xn | → ∞. (Hint: Use Lemma 6.1.) 3. Let X be a random walk in R, based on a symmetric, non-degenerate distribution with bounded support. Show that X is recurrent. (Hint: Recall that lim supn (±Xn ) = ∞ a.s.) 4. Show that the accessible set A equals the closed semi-group generated by supp μ. Also show by examples that A may or may not be a group. 5. Let ν be an invariant measure on the accessible set of a recurrent random walk in Rd . Show by examples that E ξ may or may not be of the form ∞ · ν. 6. Show that a non-degenerate random walk in Rd has no invariant distribution. (Hint: If ν is invariant, then μ ∗ ν = ν.) 7. Show by examples that the conditions in Theorem 12.2 are not necessary. (Hint: For d = 2, consider mixtures of N (0, σ 2 ) and use Lemma 6.19.) 8. Let X be a random walk in R based on the symmetric, p -stable distribution p with characteristic function e−|t| . Show that X is recurrent for p ≥ 1 and transient for p < 1. 9. Let X be a random walk in R2 based on the distribution μ2 , where μ is symmetric p -stable. Show that X is recurrent for p = 2 and transient for p < 2. 10. Let μ = c μ1 + (1 − c) μ2 for some symmetric distributions μ1 , μ2 on Rd and a constant c ∈ (0, 1). Show that a random walk based on μ is recurrent iff recurrence holds for random walks based on μ1 and μ2 . 11. Let μ = μ1 ∗ μ2 , where μ1 , μ2 are symmetric distributions on Rd . Show that if a random walk based on μ is recurrent, then so are the random walks based on

12. Random Walks and Renewal Processes

275

μ1 , μ2 . Also show by an example that the converse is false. (Hint: For the latter part, let μ1 , μ2 be supported by orthogonal sub-spaces.) 12. For a symmetric, recurrent random walk in Zd , show that the expected number of visits to an accessible state k = 0 before return to the origin equals 1. (Hint: Compute the distribution, assuming a probability p for return before visit to k.) 13. Use Proposition 12.14 to show that any non-degenerate random walk in Zd has an infinite mean recurrence time. Compare with the preceding problem. 14. Show how part (i) of Proposition 12.15 can be strengthened by means of Theorems 6.17 and 12.2. 15. For a non-degenerate random walk in R, show that lim supn Xn = ∞ a.s. iff σ1 < ∞ a.s., and that Xn → ∞ a.s. iff E σ1 < ∞. In both conditions, note that σ1 can be replaced by τ1 . 16. Let ξ be a renewal process based on a non-arithmetic distribution on R+ . Show that sup{t > 0; E ξ[t, t + ε] = 0} < ∞ for any ε > 0. (Hint: Mimic the proof of Proposition 11.18.) 17. Let μ be a distribution on Z+ , such that the group generated by supp μ equals Z. Show that Proposition 12.19 remains valid with ν{n} = c−1 μ(n, ∞), n ≥ 0, and prove a corresponding version of Proposition 12.20. 18. Let ξ be the occupation measure of a random walk in Z based on a distribution ¯ \ {0}, such that the group generated by supp μ equals Z. Show μ with mean m ∈ R as in Theorem 12.21 that Eξ{n} → m−1 ∨ 0. 19. Prove the renewal theorem for random walks in Z+ from the ergodic theorem for discrete-time Markov chains, and conversely. (Hint: Given a distribution μ on N, construct a Markov chain X in Z+ with Xn+1 = Xn + 1 or 0, such that the recurrence times at 0 are i.i.d. μ. Note that X is aperiodic iff Z is the smallest group containing supp μ.) 20. Fix a distribution μ on R with symmetrization μ ˜ . Note that if μ ˜ is nonarithmetic, then so is μ. Show by an example that the converse is false. 21. Simplify the proof of Lemma 12.22, in the case where even the symmetrization ˜n = μ ˜ is non-arithmetic. (Hint: Let ξ1 , ξ2 , . . . and ξ1 , ξ2 , . . . be i.i.d. μ, and define X α − α + Σ k≤n (ξk − ξk ).)

22. Show that any monotone, Lebesgue integrable function on R+ is directly Riemann integrable. 23. State and prove the counterpart of Corollary 12.24 for arithmetic distributions.

24. Let (ξn ), (ηn ) be independent i.i.d. sequences with distributions μ, ν, put  Xn = Σ k≤n (ξk + ηk ), and define U = n≥0 [Xn , Xn + ξn+1 ). Show that Ft = P {t ∈ U } satisfies the renewal equation F = f + F ∗ μ ∗ ν with ft = μ(t, ∞). For μ, ν with finite means, show also that Ft converges as t → ∞, and identify the limit. 25. Consider a renewal process ξ based on a non-arithmetic distribution μ with mean m < ∞, fix any h > 0, and define Ft = P {ξ[t, t + h] = 0}. Show that F = f + F ∗ μ with ft = μ(t + h, ∞). Further show that Ft converges as t → ∞, and identify the limit. (Hint: Consider the first point of ξ in (0, t), if any.) 26. For ξ as above, let τ = inf{t ≥ 0; ξ[t, t + h] = 0}, and put Ft = P {τ ≤ t}.

276 Show that Ft = μ(h, ∞) + and f ≡ μ(h, ∞).

Foundations of Modern Pobability 

h∧t 0 μ(ds) Ft−s ,

or F = f + F ∗ μh , where μh = 1[0,h] · μ

Chapter 13

Jump-Type Chains and Branching Processes Strong Markov property, renewals in Poisson process, rate kernel, embedded Markov chain, existence and explosion, compound and pseudoPoisson processes, backward equation, invariant distribution, convergence dichotomy, mean recurrence times, Bienaym´e process, extinction and asymptotics, binary splitting, Brownian branching tree, genealogy and Yule tree, survival rate and distribution, diffusion approximation

After our detailed study of discrete-time Markov chains in Chapter 11, we turn our attention to the corresponding processes in continuous time, where the paths are taken to be piecewise constant apart from isolated jumps. The evolution of the process is then governed by a rate kernel α, determining both the rate at which transitions occur and the associated transition probabilities. For bounded α we get a pseudo-Poisson process, which may be described as a discrete-time Markov chain with transition times given by an independent, homogeneous Poisson process. Of special interest is the space-homogeneous case of compound Poisson processes, where the underlying Markov chain is a random walk. In Chapter 17 we show how every Feller process can be approximated in a natural way by a sequence of pseudo-Poisson processes, characterized by the boundedness of their generators. A similar compound Poisson approximation of general L´evy processes plays a basic role in Chapter 16. The chapter ends with a brief introduction to branching processes in discrete and continuous time. In particular, we study the extinction probability and asymptotic population size, and identify the ancestral structure of a critical Brownian branching tree in terms of a suitable Yule process. For a critical Bienaym´e process, we further derive the asymptotic survival rate and distribution, and indicate how a suitably scaled version can be approximated by a Feller diffusion. The study of spatial branching processes will be resumed in Chapter 30. A process X in a measurable space (S, S) is said to be of pure jump-type, if its paths are a.s. right-continuous and constant apart from isolated jumps. We may then denote the jump times of X by τ1 , τ2 , . . . , with the understanding that τn = ∞ if there are fewer than n jumps. By a simple approximation based on Lemma 9.3, the times τn are optional with respect to the right-continuous filtration F = (Ft ) induced by X. For convenience we may choose X to be the identity map on the canonical path space Ω. When X is Markov, we write Px © Springer Nature Switzerland AG 2021 O. Kallenberg, Foundations of Modern Probability, Probability Theory and Stochastic Modelling 99, https://doi.org/10.1007/978-3-030-61871-1_14

277

278

Foundations of Modern Probability

for the distribution with initial state x, and note that the mapping x → Px is a kernel from (S, S) to (Ω, F∞ ). We begin our study of pure jump-type Markov processes by proving an extension of the elementary strong Markov property in Proposition 11.9. A further extension appears as Theorem 17.17. Theorem 13.1 (strong Markov property, Doob) A pure jump-type Markov process satisfies the strong Markov property at every optional time. Proof: For any optional time τ , we may choose some optional times σn ≥ τ + 2−n taking countably many values, such that σn → τ a.s. When A ∈ Fτ ∩ {τ < ∞} and B ∈ F∞ , we get by Proposition 11.9







P θσn X ∈ B; A = E PXσn B; A .

(1)

The right-continuity of X yields P {Xσn = Xτ } → 0. If B depends on finitely many coordinates, it is also clear that 



P {θσn X ∈ B}  {θτ X ∈ B} → 0,

n → ∞.

Hence, (1) remains true for such sets B with σn replaced by τ , and the relation extends to the general case by a monotone-class argument. 2 We now examine the structure of a general pure jump-type Markov process. Here the crucial step is to determine the distributions associated with the first jump, occurring at time τ1 . A random variable γ ≥ 0 is said to be exponentially distributed if P {γ > t} = ect , t ≥ 0, for a constant c > 0, in which case  ∞ −ct E γ = 0 e dt = c−1 . Say that a state x ∈ S is absorbing if Px {X ≡ x} = Px {τ1 = ∞} = 1. Lemma 13.2 (first jump) Let X be a pure jump-type Markov process, and fix a non-absorbing state x. Then under Px , (i) τ1 is exponentially distributed, (ii) θτ1 X⊥⊥ τ1 . Proof: (i) Put τ1 = τ . By the Markov property at fixed times, we get for any s, t ≥ 0







Px τ > s + t = Px τ > s, τ ◦ θs > t = Px {τ > s} Px {τ > t}. This is a Cauchy equation in s and t, whose only non-increasing solutions are of the form Px {τ > t} = e−ct with c ∈ [0, ∞]. Since x is non-absorbing and τ > 0 a.s., we have c ∈ (0, ∞), and so τ is exponentially distributed with parameter c.

13. Jump-Type Chains and Branching Processes

279

(ii) By the Markov property at fixed times, we get for any B ∈ F∞





Px τ > t, θτ X ∈ B = Px τ > t, (θτ X) ◦ θt ∈ B





= Px {τ > t} Px θτ X ∈ B , which shows that τ ⊥ ⊥ θτ X.

2

Writing X∞ = x when X eventually gets absorbed at x, we define the rate function c and jump transition kernel μ by c(x) = (Ex τ1 )−1 , μ(x, B) = Px {Xτ1 ∈ B},

x ∈ S, B ∈ S,

combined into a rate kernel α = c μ, or α(x, B) = c(x) μ(x, B),

x ∈ S, B ∈ S.

The required measurability follows from that for the kernel (Px ). Assuming in addition that μ(x, ·) = δx when α(x, ·) = 0, we may reconstruct μ from α, so that μ becomes a measurable function of α. We proceed to show how X can be represented in terms of a discrete-time Markov chain and an independent sequence of exponential random variables. In particular, we will see how the distributions Px are uniquely determined by the rate kernel α. In the sequel, we always assume the existence of the required randomization variables. Theorem 13.3 (embedded Markov chain) For a pure jump-type Markov process X with rate kernel α = c μ, there exist a Markov chain Y on Z+ with transition kernel μ and some i.i.d., exponential random variables γ1 , γ2 , . . . ⊥⊥ Y with mean 1, such that a.s. for all n ∈ Z+ , (i) Xt = Yn , t ∈ [τn , τn+1 ), n γk (ii) τn = Σ . k = 1 c(Yk−1 ) Proof: Let τ1 , τ2 , . . . be the jump times of X, and put τ0 = 0, so that (i) holds with Yn = Xτn for n ∈ Z+ . Introduce some i.i.d., exponentially ⊥ X with mean 1, and define for n ∈ N distributed random variables γ1 , γ2 , . . . ⊥



γn = (τn − τn−1 ) c(Yn ) 1{τn−1 < ∞} + γn 1 c(Yn ) = 0 . By Lemma 13.2, we get for any t ≥ 0, B ∈ S, and x ∈ S with c(x) > 0





Px γ1 > t, Y1 ∈ B = Px τ1 c(x) > t, Y1 ∈ B = e−t μ(x, B),



280

Foundations of Modern Probability

which clearly remains true when c(x) = 0. By the strong Markov property, we obtain for any n, a.s. on {τn < ∞}, 







Px γn+1 > t, Yn+1 ∈ B  Fτn = PYn γ1 > t, Y1 ∈ B



= e−t μ(Yn , B).

(2)

The strong Markov property also gives τn+1 < ∞ a.s. on the set {τn < ∞, c(Yn ) > 0}. Arguing recursively, we get {c(Yn ) = 0} = {τn+1 = ∞} a.s., and (ii) follows. The same relation shows that (2) remains a.s. true on {τn = ∞}, and in both cases we may replace Fτn by Gn = Fτn ∨ σ{γ1 , . . . , γn }. Thus, the pairs (γn , Yn ) form a discrete-time Markov chain with the desired transition kernel. By Proposition 11.2, the latter property along with the initial distri2 bution determine uniquely the joint distribution of Y and (γn ). In applications we are often faced with the converse problem of constructing a Markov process X with a given rate kernel α. Then write α(x, B) = c(x) μ(x, B) for some rate function c : S → R+ and transition kernel μ on S, such that μ(x, ·) = δx when c(x) = 0 and otherwise μ(x, {x}) = 0. If X does exist, it may clearly be constructed as in Theorem 13.3. The construction fails when ζ ≡ supn τn < ∞, in which case an explosion is said to occur at time ζ. Theorem 13.4 (synthesis) For a finite kernel α = c μ on S with α(x, {x}) ≡ 0, let Y be a Markov chain in S with transition kernel μ, and let γ1 , γ2 , . . . ⊥⊥ Y be i.i.d., exponential random variables with mean 1. Then these conditions are equivalent: (i)

Σ nγn/c(Yn−1) = ∞ a.s. under every initial distribution for Y ,

(ii) there exists a pure jump-type Markov process on R+ with rate kernel α. Proof: Assuming (i), let Px be the joint distribution of the sequences Y = (Yn ) and Γ = (γn ) when Y0 = x. Regarding (Y, Γ) as the identity map on the canonical space Ω = (S × R+ )∞ , we may construct X from (Y, Γ) as in Theorem 13.3, where Xt = x0 is arbitrary for t ≥ supn τn , and introduce the filtrations G = (Gn ) induced by (Y, γ) and F = (Ft ) induced by X. We need to verify the Markov property Lx ( θt X | Ft ) = LXt (X) for the distributions under Px and PXt , since the rate kernel can then be identified via Theorem 13.3. For any t ≥ 0 and n ∈ Z+ , put κ = sup{k; τk ≤ t}, and define

β = (t − τn ) c(Yn ),





T m (Y, Γ) = (Yk , γk+1 ); k ≥ m , (Y  , Γ ) = T n+1 (Y, Γ),

γ  = γn+1 .

Since clearly Ft = Gn ∨ σ{γ  > β} on {κ = n},

13. Jump-Type Chains and Branching Processes

281

it suffices by Lemma 8.3 to prove that  









Lx (Y  , Γ ); γ  − β > r  Gn , γ  > β = LYn T (Y, Γ); γ1 > r . ⊥Gn (γ  , β) since γ  ⊥ ⊥ (Gn , Y  , Γ ), we see that the left-hand Noting that (Y  , Γ )⊥ side equals  





Lx (Y  , Γ ); γ  − β > r  Gn Px {γ  > β | Gn }

 



P {γ  − β > r | G } x n

= Lx (Y  , Γ )  Gn = (PYn ◦ T

−1

−r

Px {γ  > β | Gn }

)e , 2

as required.

To complete the picture, we need a convenient criterion for non-explosion. Proposition 13.5 (explosion) For any rate kernel α and initial state x, let (Yn ) and (τn ) be such as in Theorem 13.3. Then a.s. 1 τn → ∞ ⇔ Σ = ∞. (3) n c(Yn ) In particular, τn → ∞ a.s. when x is recurrent for (Yn ). Proof: Write βn = {c(Yn−1 )}−1 . Noting that Ee−uγn = (1 + u)−1 for all u ≥ 0, we get by Theorem 13.3 (ii) and Fubini’s theorem

Π n (1 + uβn)−1

= exp − Σ n log(1 + uβn )

E( e−uζ | Y ) =

a.s.

(4)

Since 12 (r ∧ 1) ≤ log(1 + r) ≤ r for all r > 0, the series on the right converges for every u > 0 iff Σ n βn < ∞. Letting u → 0 in (4), we get by dominated convergence

P ( ζ < ∞ | Y ) = 1 Σ n βn < ∞ a.s.,

which implies (3). If x is visited infinitely often, the series Σ n βn has infinitely 2 many terms c−1 x > 0, and the last assertion follows. The simplest and most basic pure jump-type Markov processes are the homogeneous Poisson processes, which are space and time homogeneous processes of this kind starting at 0 and proceeding by unit jumps. Here we define a stationary Poisson process on R+ with rate c > 0 as a simple point process ξ with stationary, independent increments, such that ξ[0, t) is Poisson distributed with mean c t. To prove the existence, we may use generating functions to see that the Poisson distributions μt with mean t satisfy the semi-group property μs ∗ μt = μs+t . The desired existence now follows by Theorem 8.23. Theorem 13.6 (stationary Poisson process, Bateman) Let ξ be a simple point process on R+ with points τ1 < τ2 < · · · , and put τ0 = 0. Then these conditions are equivalent:

282

Foundations of Modern Probability

(i) ξ is a stationary Poisson process, (ii) ξ has stationary, independent increments, (iii) the variables τn − τn−1 , n ∈ N, are i.i.d., exponentially distributed. In that case ξ has rate c = (E τ1 )−1 . Proof, (i) ⇒ (ii): Clear by definition. (ii) ⇒ (iii): Under (ii), Theorem 11.5 shows that Xt = ξ[0, t], t ≥ 0, is a space- and time-homogeneous Markov process of pure jump type, and so by d Lemma 13.2, τ1 is exponential and independent of θτ1 ξ. Since θτ1 ξ = ξ by Theorem 13.1, we may proceed recursively, and (iii) follows. (iii) ⇒ (i): Assuming (iii) with E τ1 = c−1 , we may choose a stationary Poisson process η with rate c and points σ1 < σ2 < · · · , and conclude as before d that (σn ) = (τn ). Hence, ξ = Σ n δτn = Σ n δσn = η. d

2

By a pseudo-Poisson process in a measurable space S we mean a process of the form X = Y ◦ N a.s., where Y is a discrete-time Markov process in S and N ⊥⊥Y is an homogeneous Poisson process. Letting μ be the transition kernel of Y and writing c for the constant rate of N , we may construct a kernel 



α(x, B) = c μ x, B \ {x} ,

x ∈ S, B ∈ S,

(5)

which is measurable since μ(x, {x}) is a measurable function of x. We may characterize pseudo-Poisson processes in terms of the rate kernel: Proposition 13.7 (pseudo-Poisson process, Feller) For a process X in a Borel space S, these conditions are equivalent: (i) X = Y ◦ N a.s. for a discrete-time Markov chain Y in S with transition kernel μ and a Poisson process N ⊥⊥ Y with constant rate c > 0, (ii) X is a pure jump-type Markov process with bounded rate kernel α. In that case, we may choose α and (μ, c) to be related by (5). Proof, (i) ⇒ (ii): Assuming (i), write τ1 , τ2 , . . . for the jump times of N , and let F be the filtration induced by the pair (X, N ). As in Theorem 13.4, we see that X is F-Markov. To identify the rate kernel α, fix any initial state x, and note that the first jump of X occurs at the time τn when Yn first leaves x. For each transition of Y , this happens with probability px = μ(x, {x}c ). By Proposition 15.3, the time until first jump is then exponentially distributed with parameter c px . If px > 0, the position of X after the first jump has distribution p−1 x μ(x, · \ {x}). Thus, α is given by (5). (ii) ⇒ (i): Assuming (ii), put rx = α(x, S) and c = supx rx , and note that the kernel

μ(x, ·) = c−1 α(x, ·) + (c − rx ) δx , x ∈ S,

13. Jump-Type Chains and Branching Processes

283

satisfies (5). Thus, if X  = Y  ◦ N  is a pseudo-Poisson process based on μ d and c, then X  is again Markov with rate kernel α, and so X = X  . Hence, d 2 Corollary 8.18 yields X = Y ◦ N a.s. for a pair (Y, N ) = (Y  , N  ). If the underlying Markov chain Y is a random walk in a measurable Abelian group S, then X = Y ◦ N is called a compound Poisson process. Here X − X0 ⊥⊥ X0 , the jump sizes are i.i.d., and the jump times form an independent, homogeneous Poisson process. Thus, the distribution of X − X0 is determined by the characteristic measure 1 ν = c μ, where c is the rate of the jump-time process and μ is the common jump distribution. Compound Poisson processes may be characterized analytically in terms of the rate kernel, and probabilistically in terms of the increments of the process. A kernel α on S is said to be invariant if αx = α0 ◦θx−1 or α(x, B) = α(0, B −x) for all x and B. We also say that a process X in S has independent increments ⊥ {Xr ; r ≤ s} for all s < t. if Xt − Xs ⊥ Corollary 13.8 (compound Poisson process) For a pure jump-type process X in a measurable Abelian group, these conditions are equivalent: (i) X is Markov with invariant rate kernel, (ii) X has stationary, independent increments, (iii) X is compound Poisson. Proof: If a pure jump-type Markov process is space-homogeneous, then its rate kernel is clearly invariant, and the converse follows from the representation in Theorem 13.3. Thus, (i) ⇔ (ii) by Proposition 11.5. Next, Theorem 13.3 yields (i) ⇒ (iii), and the converse follows by Theorem 13.4. 2 We now derive a combined differential and integral equation for the transition kernels μt . An abstract version of this result appears in Theorem 17.6. For any measurable and suitably integrable function f : S → R, we define

Tt f (x) =

f (y) μt (x, dy)

= Ex f (Xt ),

x ∈ S, t ≥ 0.

Theorem 13.9 (backward equation, Kolmogorov) Let α be the rate kernel of a pure jump-type Markov process in S, and fix a bounded, measurable function f : S → R. Then Tt f (x) is continuously differentiable in t for fixed x, and

∂ Tt f (x) = α(x, dy) Tt f (y) − Tt f (x) , ∂t

1

t ≥ 0, x ∈ S.

also called L´evy measure, as in the general case of Chapter 16

(6)

284

Foundations of Modern Probability

Proof: Put τ = τ1 , and let x ∈ S and t ≥ 0. By the strong Markov property at σ = τ ∧ t and Theorem 8.5,

Tt f (x) = Ex f (Xt ) = Ex f (θσ X)t−σ = Ex Tt−σ f (Xσ )







= f (x) Px {τ > t} + Ex Tt−τ f (Xτ ); τ ≤ t = f (x) e−t cx +



t

e−s cx ds



0



and so

e t cx Tt f (x) = f (x) +

t

α(x, dy) Tt−s f (y),



e s cx ds

0

α(x, dy) Ts f (y).

(7)

Here the use of the disintegration theorem is justified by the fact that X(ω, t) is product measurable on Ω × R+ , by the right continuity of the paths. From (7) we see that Tt f (x) is continuous in t for each x, and so by dominated convergence the inner integral on the right is continuous in s. Hence, Tt f (x) is continuously differentiable in t, and (6) follows by an easy computation. 2 Next we show how the invariant distributions of a pure jump-type Markov process are related to those of the embedded Markov chain. Proposition 13.10 (invariant measures) Let the processes X, Y be related  as in Theorem 13.3, and fix a probability measure ν on S with c dν < ∞. Then ν is invariant for X ⇔ c · ν is invariant for Y. Proof: By Theorem 13.9 and Fubini’s theorem, we have for any bounded measurable function f : S → R

Eν f (Xt ) =





t

f (x) ν(dx) +

ds



ν(dx)

0





α(x, dy) Ts f (y) − Ts f (x) .

Thus, ν is invariant for X iff the second term on the right is identically zero. Now (6) shows that Tt f (x) is continuous in t, and by dominated convergence this remains true for the integral

It =



ν(dx)





α(x, dy) Tt f (y) − Tt f (x) ,

t ≥ 0.

Thus, the condition becomes It ≡ 0. Since f is arbitrary, it is enough to take t = 0. Our condition then reduces to (να)f ≡ ν(cf ) or (c · ν)μ = c · ν, which means that c · ν is invariant for Y . 2 We turn to a study of pure jump-type Markov processes in a countable state space S, also called continuous-time Markov chains. Here the kernels μt are determined by the transition functions pijt = μt (i, {j}). The connectivity properties are simpler than in the discrete case, and the notion of periodicity has no continuous-time counterpart.

13. Jump-Type Chains and Branching Processes

285

Lemma 13.11 (positivity) Consider a continuous-time Markov chain in S with transition functions pijt . Then for fixed i, j ∈ S, exactly one of these cases occurs: (i) pijt > 0, t > 0, (ii) pijt = 0, t ≥ 0. In particular, piit > 0 for all t and i. Proof: Let q = (qij ) be the transition matrix of the embedded Markov chain Y in Theorem 13.3. If qijn = Pi {Yn = j} = 0 for all n ≥ 0, then clearly 1{Xt = j} ≡ 1 a.s. Pi , and so pijt = 0 for all t ≥ 0. If instead qijn > 0 for some n ≥ 0, there exist some states i = i0 and i1 , . . . , in = j with qik−1 ,ik > 0 for k = 1, . . . , n. Since the distribution of (γ1 , . . . , γn+1 ) has the positive density Π k≤n+1 e−xk > 0 on Rn+1 + , we obtain for any t > 0 ptij

≥P



n

γk

Σ k = 1 ci

k−1



n+1

γk

k=1

cik−1

≤t 0.

Noting that p0ii = qii0 = 1, we get in particular piit > 0 for all t ≥ 0.

2

A continuous-time Markov chain is said to be irreducible if pijt > 0 for all i, j ∈ S and t > 0. This clearly holds iff the underlying discrete-time process Y in Theorem 13.3 is irreducible, in which case



sup t > 0; Xt = j < ∞







sup n > 0; Yn = j < ∞.

Thus, when Y is recurrent, the sets {t; Xt = j} are a.s. unbounded under Pi for all i ∈ S; otherwise they are a.s. bounded. The two possibilities are again referred to as recurrence and transience, respectively. The basic ergodic Theorem 11.22 for discrete-time Markov chains has the following continuous-time counterpart. Further extensions are considered in u Chapter 17 and 26. Recall that → means convergence in total variation, and ˆ let MS be the class of probability measures on S. Theorem 13.12 (convergence dichotomy, Kolmogorov) For an irreducible, continuous-time Markov chain in S, exactly one of these cases occurs: (i) there exists a unique invariant distribution ν, the latter satisfies νi > 0 for all i ∈ S, and as t → ∞ we have u ˆ S, Pμ ◦ θ−1 → Pν , μ ∈ M t

(ii) no invariant distribution exists, and as t → ∞, pijt → 0,

i, j ∈ S.

Proof: By Lemma 13.11, the discrete-time chain Xnh , n ∈ Z+ , is irreducible and aperiodic. Suppose that (Xnh ) is positive recurrent for some h > 0, say with invariant distribution ν. Then the chain (Xnh ) is positive recurrent for every h of the form 2−m h, and by the uniqueness in Theorem 11.22 it has

286

Foundations of Modern Probability

the same invariant distribution. Since the paths are right-continuous, a simple approximation shows that ν is invariant even for the original process X. For any distribution μ on S, we have       Pμ ◦ θt−1 − Pν  = 

 

Σ i μi Σj (ptij − νj )Pj    ≤ Σ i μi Σ j ptij − νj .

Thus, (i) will follow by dominated convergence, if we can show that the inner sum on the right tends to zero. This is clear if we put n = [t/h] and r = t − nh, and note that by Theorem 11.22,  

 

 

 

r Σ k  ptik − νk  ≤ Σ j,k  pnh ij − νj  pjk    = Σ j  pnh ij − νj  → 0.

It remains to consider the case where (Xnh ) is null-recurrent or transient for every h > 0. Fixing any i, k ∈ S and writing n = [t/h] and r = t − nh as before, we get ptik = Σ j prij pnh jk ≤ pnh ik +

Σ prij

j = i

r = pnh ik + (1 − pii ),

which tends to zero as t → ∞ and then h → 0, by Theorem 11.22 and the 2 continuity of ptii . As in discrete time, condition (ii) of the last theorem holds for any transient Markov chain, whereas a recurrent chain may satisfy either condition. Recurrent chains satisfying (i) and (ii) are again referred to as positive-recurrent and null-recurrent, respectively. Note that X may be positive recurrent even if the embedded, discrete-time chain Y is null-recurrent, and vice versa. On the other hand, X clearly has the same asymptotic properties as the discrete-time processes (Xnh ), h > 0. For any j ∈ S, we now introduce the first exit and recurrence times γj = inf{t > 0; Xt = j}, τj = inf{t > γj ; Xt = j}. As for Theorem 11.26 in the discrete-time case, we may express the asymptotic transition probabilities in terms of the mean recurrence times Ej τj . To avoid trite exceptions, we consider only non-absorbing states. Theorem 13.13 (mean recurrence times, Kolmogorov) For a continuoustime Markov chain in S and states i, j ∈ S with j non-absorbing, we have as t→∞ Pi {τj < ∞} . (8) pijt → cj Ej τj

13. Jump-Type Chains and Branching Processes

287

Proof: We may take i = j, since the general statement will then follow as in the proof of Theorem 11.26. If j is transient, then 1{Xt = j} → 0 a.s. Pj , and so by dominated convergence pjjt = Pj {Xt = j} → 0. This agrees with (8), since in this case Pj {τj = ∞} > 0. Turning to the recurrent case, let Sj be the class of states i accessible from j. Then Sj is clearly irreducible, and so pjjt converges by Theorem 13.12. To identify the limit, define

Ljt = λ s ≤ t; Xs = j



t

= 0

1{Xs = j} ds,

t ≥ 0,

and let τjn be the time of n -th return to j. Letting m, n → ∞ with |m − n| ≤ 1 and using the strong Markov property and the law of large numbers, we get a.s. Pj Lj (τjm ) n m Lj (τjm ) · n· = n τj m τj n E j γj 1 → = . E j τj cj E j τj By the monotonicity of Lj it follows that t−1 Ljt → (cj Ej τj )−1 a.s. Hence, by Fubini’s theorem and dominated convergence, Ej Ljt 1 t s 1 → pjj ds = , t 0 t cj Ej τj 2

and (8) follows.

We conclude with some applications to branching processes and their diffusion approximations. By a Bienaym´e process 2 we mean a discrete-time branching process X = (Xn ), where Xn represents the total number of individuals in generation n. Given that Xn = m, the individuals in the n -th generation give rise to independent families of progeny Y1 , . . . , Ym starting at time n, each distributed like the original process X with X0 = 1. This clearly defines a discrete-time Markov chain taking values in Z+ . We say that X is critical when E1 X1 = 1, and super- or sub-critical when E1 X1 > 1 or < 1, respectively.3 To avoid some trite exceptions, we assume that both possibilities X1 = 0 and X1 > 1 have positive probabilities. Theorem 13.14 (Bienaym´e process) Let X be a Bienaym´e process with E1 X1 = μ and Var1 (X1 ) = σ 2 , and put f (s) = E1 sX1 . Then (i) P1 {Xn → 0} is the smallest s ∈ [0, 1] with f (s) = s, and so Xn → 0 a.s.



μ ≤ 1,

(ii) when μ < ∞, we have μ−n Xn → γ a.s. for a random variable γ ≥ 0, 2

also called a Galton–Watson process The conventions of martingale theory are the opposite. Thus, a branching process is super-critical iff it is a sub-martingale, and sub-critical iff it is a super-martingale. 3

288

Foundations of Modern Probability

(iii) when 4 σ 2 < ∞, we have E1 γ = 1 ⇔ μ > 1, and {γ = 0} = {Xn → 0} a.s. P1 . The proof is based on some simple algebraic facts: Lemma 13.15 (recursion and moments) Let X be a Bienaym´e process with E1 X1 = μ and Var1 (X1 ) = σ 2 . Then (i) the functions f (n) (s) = E1 sXn on [0, 1] satisfy f (m+n) (s) = f (m) ◦ f (n) (s), (ii) E1 Xn = μn , and when σ 2 < ∞ we have  n σ2, Var1 (Xn ) = μn −1 σ 2 μn−1 μ−1 ,

m, n ∈ N, μ = 1, μ = 1.

Proof: (i) By the Markov and independence properties of X, we have f (m+n) (s) = = = =

E1 sXm+n = E1 EXm sXn E1 (E1 sXn )Xm f (m) (E1 sXn ) f (m) ◦ f (n) (s).

(ii) Write μn = E1 Xn and σn2 = Var1 (Xn ), whenever those moments exist. The Markov and independence properties yield μm+n = = = =

E1 Xm+n = E1 EXm Xn E1 (Xm E1 Xn ) (E1 Xm )(E1 Xn ) μm μn ,

and so by induction μn = μn . Furthermore, Theorem 12.9 (iii) gives 2 2 σm+n = σm μn + μ2m σn2 .

The variance formula is clearly true for n = 1. Proceeding by induction, suppose it holds for a given n. Then for n + 1 we get when μ = 1 2 σn+1 = σn2 μ + μ2n σ 2 μn − 1 + μ2n σ 2 = μ σ 2 μn−1 μ − 1

σ 2 μn n = μ − 1 + μn (μ − 1) μ − 1 n+1 μ −1 , = σ 2 μn μ−1 as required. When μ = 1, the proof simplifies to 4

This assertion remains true under the weaker condition E1 (X1 log X1 ) < ∞.

13. Jump-Type Chains and Branching Processes

289

2 σn+1 = σn2 + σ 2 = n σ2 + σ2 = (n + 1) σ 2 .

2

Proof of Theorem 13.14: (i) Noting that P1 {Xn = 0} = E1 0Xn = f (n) (0), we get by Lemma 13.15 (i)





P1 {Xn+1 = 0} = f P1 {Xn = 0} . Since also P1 {X0 = 0} = 0 and P1 {Xn = 0} ↑ P {Xn → 0}, we conclude that the extinction probability q is the smallest root in [0, 1] of the equation f (s) = s. Now f is strictly convex with f  (1) = μ and f (0) ≥ 0, and so the equation f (s) = s has the single root 1 when μ ≤ 1, whereas for μ > 1 it has two roots q < 1 and 1. (ii) Since E1 X1 = μ, the Markov and independence properties yield E(Xn+1 | X0 , . . . , Xn ) = Xn (E1 X1 ) = μ Xn , n ∈ N. Dividing both sides by μn+1 , we conclude that the sequence Mn = μ−n Xn is a martingale. Since it is non-negative and hence L1 -bounded, it converges a.s. by Theorem 9.19. (iii) If μ ≤ 1, then Xn → 0 a.s. by (i), and so E1 γ = 0. If instead μ > 1 and σ 2 < ∞, then Lemma 13.15 (ii) yields σ2 1 − μ−n → < ∞, Var(Mn ) = σ 2 μ(μ − 1) μ(μ − 1) and so (Mn ) is uniformly integrable, which implies E1 γ = E1 M0 = 1. Furthermore, P1 {γ = 0} = E1 PX1 {γ = 0} 

= E1 P1 {γ = 0} 



X1



= f P1 {γ = 0} , which shows that even P1 {γ = 0} is a root of the equation f (s) = s. It can’t be 1 since E1 γ = 1, so in fact P1 {γ = 0} = P1 {Xn → 0}. Since also 2 {Xn → 0} ⊂ {γ = 0}, the two events agree a.s. The simplest case is when X proceeds by binary splitting, so that the offspring distribution is confined to the values 0 and 2. To obtain a continuoustime Markov chain, we may choose the life lengths until splitting or death to be independent and exponentially distributed with constant mean. This makes X = (Xt ) a birth & death process with rates nλ and nμ, and we say that X is critical of rate 1 if λ = μ = 1. We also consider the Yule process, where λ = 1 and μ = 0, so that X is non-decreasing and no deaths occur.

290

Foundations of Modern Probability

Theorem 13.16 (binary splitting) Let X be a critical, unit rate binary-splitting process, and define pt = (1 + t)−1 . Then (i) P1 {Xt > 0} = pt , (ii) P1

t > 0,



 Xt = n  Xt > 0 = pt (1 − pt )n−1 ,



t > 0, n ∈ N,

(iii) for fixed t > 0, the ancestors of Xt at times s ≤ t form a Yule process Zst , s ∈ [0, t], with rate function (1 + t − s)−1 . Proof, (i) The generating functions s ∈ [0, 1], t > 0,

F (s, t) = E1 sXt ,

satisfy Kolmogorov’s backward equation

2 ∂ F (s, t) = 1 − F (s, t) , ∂t

F (s, 0) = s,

which has the unique solution s + t − st p2t s = qt + 1 + t − st 1 − qt s = qt + Σ p2t qtn−1 sn ,

F (s, t) =

n≥1

where qt = 1 − pt . In particular, P1 {Xt = 0} = F (0, t) = qt . (ii) For any n ∈ N,

 





P1 {Xt = n} P1 {Xt > 0} p2 q n−1 = pt qtn−1 . = t t pt

P1 X t = n  X t > 0 =

(iii) Assuming X0 = 1, we get 



P {Zst = 1} = EP Zst = 1  Xs 

Xs −1 = E Xs pt−s qt−s

= = =

Σ





n−1 n p2s qsn−1 pt−s qt−s

n≥1 p2s pt−s

n(qs qt−s )n−1 Σ n≥1

p2s pt−s 1+t−s = . (1 − qs qt−s )2 (1 + t)2

Writing τ1 = inf{s ≤ t; Zst > 1}, we conclude that

 





 

P τ1 > s  Xt > 0 = P Zst = 1  Xt > 0 1+t−s , = 1+t



13. Jump-Type Chains and Branching Processes

291

which shows that τ1 has the defective probability density (1 + t)−1 on [0, t]. By Proposition 10.23, the associated compensator has density (1 + t − s)−1 . The Markov and branching properties carry over to the process Zst by Theorem 8.12. Using those properties and proceeding recursively, we see that Zst is a 2 Yule process in s ∈ [0, t] with rate function (1 + t − s)−1 . Now let the individual particles perform independent Brownian motions5 in Rd , so that the discrete branching process X = (Xt ) turns into a random branching tree ξ = (ξt ). For fixed t > 0, even the ancestral processes Zst combine into a random branching tree ζ t = (ζst ) on [0, t]. Corollary 13.17 (Brownian branching tree) Let the processes ξt form a critical, unit rate Brownian branching tree in Rd , and define pt = (1 + t)−1 . Then for s ≤ s + h = t, (i) the ancestors of ξt at time s form a ph -thinning ζst of ξs , (ii) ξt is a sum of conditionally independent clusters of age h, rooted at the points of ζst , (iii) for fixed t > 0, the processes ζst , s ∈ [0, t], form a Brownian Yule tree with rate function (1 + t − s)−1 . Similar results hold asymptotically for critical Bienaym´e processes. Theorem 13.18 (survival rate and distribution, Kolmogorov, Yaglom) Let X be a critical Bienaym´e process with Var1 (X1 ) = σ 2 ∈ (0, ∞). Then as n → ∞, (i) n P1 {Xn > 0} → 2 σ −2 , 

−2  (ii) P1 n−1 Xn > r  Xn > 0 → e−2 rσ ,

r ≥ 0.

Our proof is based on an elementary approximation of generating functions: Lemma 13.19 (generating functions) For X as above, define fn (s) = E1 sXn . Writing σ 2 = 2 c we have as n → ∞, uniformly in s ∈ [0, 1), 1 n



1 1 − 1 − fn (s) 1 − s



→ c.

Proof (Spitzer): For the restrictions to Z+ of a binary splitting process X, we get by Theorem 13.16 P1 {X1 = k} = (1 − p)2 pk−1 ,

k ∈ N,

for a constant p ∈ (0, 1), and an easy calculation yields 1 np 1 − = = n c. 1 − fn (s) 1 − s 1−p 5

For the notions of Brownian motion and p -thinnings, see Chapters 14–15.

292

Foundations of Modern Probability

Thus, equality holds in the stated relation with p and c related by c ∈ (0, 1). 1+c

p=

For general X, fix any constants c1 , c2 > 0 with c1 < c < c2 , and let X (1) , X (2) be associated binary splitting processes. By elementary estimates there exist some constants n1 , n2 ∈ Z, such that the associated generating functions satisfy (1)

(2)

fn+n1 ≤ fn ≤ fn+n2 ,

n > |n1 | ∨ |n2 |.

Combining this with the identities for fn(1) , fn(2) , we obtain c1



1 (n + n1 ) ≤ n n

1 1 − 1 − fn (s) 1 − s



≤ c2

(n + n2 ) . n

Here let n → ∞, and then c1 ↑ c and c2 ↓ c.

2

Proof of Theorem 13.18: (i) Taking s = 0 in Lemma 13.19 gives fn (0) → 1. (ii) Writing sn = e−r/n and using the uniformity in Lemma 13.19, we get 1 1 1

≈ + c → + c, n(1 − sn ) r n 1 − fn (sn ) where a ≈ b means a − b → 0. Since also n{1 − fn (0)} → c−1 by (i), we obtain 

 



fn (sn ) − fn (0) 1 − fn (0)

n 1 − fn (sn ) = 1− 1 − fn (0) c 1 → 1 − −1 = , r +c 1 + cr

E1 e−rXn /n  Xn > 0 =

recognized as the Laplace transform of the exponential distribution with mean c−1 . Now use Theorem 6.3. 2 We finally state a basic approximation property of Bienaym´e branching processes, anticipating some weak convergence and SDE theory from Chapters 23 and 31–32. Theorem 13.20 (diffusion approximation, Feller) For a critical Bienaym´e (n) process X with Var1 (X1 ) = 1, consider versions X (n) with X0 = n, and put (n)

Yt

= n−1 X[nt] , (n)

t ≥ 0.

sd

Then Y (n) −→ Y in D R+ , where Y satisfies the SDE 1/2

dYt = Yt

dBt ,

Y0 = 1.

(9)

13. Jump-Type Chains and Branching Processes

293

Note that uniqueness in law holds for (9) by Theorem 33.1 below. By Theorem 33.3 we have even pathwise uniqueness, which implies strong existence by Theorem 32.14. The solutions are known as squared Bessel processes of order 0. (n) d

(n)

Proof (outline): By Theorems 7.2 and 13.18 we have Yt ∼ Y˜t for fixed (n) t > 0, where the variables Y˜t are compound Poisson with characteristic (n) w measures νt → νt , and ν2t (dx) = t−2 e−x/t dx,

x, t > 0.

(n) d

Hence, Lemma 7.8 yields Yt → Yt , where the limit is infinitely divisible in R+ with characteristics (0, νt ). By suitable scaling, we obtain a similar convergence (n) of the general transition kernels L(Ys+t | Ys(n) )x , except that the limiting law μt (x, ·) is now infinitely divisible with characteristics (0, x νt ). We also note that the time-homogeneous Markov property carries over to the limit. By Lemma 7.1 (i) we may write d

Yt = ξ1t + · · · + ξκt t ,

t > 0,

where the ξkt are independent, exponential random variables with mean t/2, and κt is independent and Poisson distributed with mean 2/t. In particular, EYt = (Eκt )(Eξ1t ) = 1, and Theorem 12.9 (iii) yields Var(Yt ) = E κt Var(ξ1t ) + Var(κt ) (E ξ1t )2     2 t 2 2 t 2 = + = t. t 2 t 2 By Theorem 23.11 below, the processes Y (n) are tight in D R+ . Since the jumps of Y (n) are a.s. of order n−1 , every limiting process Y has continuous paths. The previous moment calculations suggest that Y is then a diffusion process with drift 0 and diffusion rate σ 2 (x) = x, which leads to the stated SDE (details omitted). The asserted convergence now follows by Theorem 17.28. 2

Exercises 1. Prove the implication (iii) ⇒ (i) in Theorem 13.6 by direct computation. (Hint: Note that {Nt ≥ n} = {τn ≤ t} for all t and n.) 2. Give a non-computational proof of the implication (i) ⇒ (iii) in Theorem 13.6, using the reverse implication. (Hint: Note as in Theorem 2.19 that τ1 , τ2 , . . . are measurable functions of ξ.) 3. Prove the implication (i) ⇒ (iii) in Theorem 13.6 by direct computation. 4. Show that the Poisson process is the only renewal process that is also a Markov process. 5. Let Xt = ξ[0, t] for a binomial process ξ with ξ = n based on a diffuse probability measure μ on R+ . Show that we can choose μ such that X becomes a Markov process, and determine the corresponding rate kernel.

294

Foundations of Modern Probability

6. For a pure jump-type Markov process in S, show that Px {τ2 ≤ t} = o(t) for all x ∈ S. Also note that the bound can be strengthened to O(t2 ) if the rate function is bounded, but not in general. (Hint: Use Lemma 13.2 and dominated convergence.) 7. Show that a transient, discrete-time Markov chain Y can be embedded into an exploding (resp., non-exploding) continuous-time chain X. (Hint: Use Propositions 11.16 and 13.5.) 8. In Corollary 13.8, use the measurability of the mapping X = Y ◦N to derive the implication (iii) ⇒ (i) from its converse. (Hint: Proceed as in the proof of Theorem 13.6.) 9. Consider a pure jump-type Markov process in (S, S) with transition kernels μt and rate kernel α. For any x ∈ S and B ∈ S, show that α(x, B) = μ˙ 0 (x, B \ {x}). (Hint: Apply Theorem 13.9 with f = 1B\{x} , and use dominated convergence.) 10. Use Theorem 13.9 to derive a system of differential equations for the transition functions pij (t) of a continuous-time Markov chain. (Hint: Take f (i) = δij for fixed j.) 11. Give an example of a positive recurrent, continuous-time Markov chain, such that the embedded discrete-time chain is null-recurrent, and vice versa. (Hint: Use Proposition 13.10.) 12. Prove Theorem 13.12 by a direct argument, mimicking the proof of Theorem 11.22. 13. Let X be a Bienaym´e process as in Theorem 13.14. Determine the extinction probability and asymptotic growth rate if instead we start with m individuals. 14. Let X = (Xt ) be a binary-splitting process as in Theorem 13.16, and define Yn = Xnh for fixed h > 0. Show that Y = (Yn ) is a critical Bienaym´e process, and find the associated offspring distribution and asymptotic survival rate. 15. Show that the binary-splitting process in Theorem 13.16 is a pure jump-type Markov process, and calculate the associated rate kernel. 16. For a critical Bienaym´e process with Var1 (X1 ) = σ 2 < ∞, find the asymptotic behavior of Pm {Xn > 0} and Pm {n−1 Xn > r} as n → ∞ for fixed m ∈ N. 17. How will Theorem 13.20 change if we assume instead that Var1 (X1 ) = σ 2 > 0?

V. Some Fundamental Processes The Brownian motion and Poisson processes constitute the basic building blocks of probability theory, and both classes of processes enjoy a wealth of remarkable properties of constant use throughout the subject. L´evy processes are continuous-time counterparts of random walks, representable as mixtures of Brownian and Poisson processes. They may also be regarded as prototypes of Feller processes, which form a broad class of Markov processes, restricted only by some natural regularity assumptions that simplify the analysis. Most of the included material is of fundamental importance, though some later parts of especially Chapters 16–17 are more advanced and might be postponed until a later stage of study. −−− 14. Gaussian processes and Brownian motion. After a preliminary study of general Gaussian processes, we introduce Brownian motion and derive some of its basic properties, including the path regularity, quadratic variation, reflection principle, the arcsine laws, and the laws of the iterated logarithm. The chapter concludes with a study of multiple Wiener–Itˆo integrals and the associated chaos expansion of Gaussian functionals. Needless to say, the basic theory of Brownian motion is of utmost importance. 15. Poisson and related processes. The basic theory of Poisson and related processes is equally important. Here we note in particular the mapping and marking theorems, the relations to binomial processes, and the roles of Cox transforms and thinnings. The independence properties lead to a Poisson representation of random measures with independent increments. We also note how a simple point process can be transformed to Poisson through a random time change derived from the compensator. 16. Independent-increment and L´ evy processes. Here we note how processes with independent increments are composed of Gaussian and Poisson variables, reducing in the time-homogeneous case to mixtures of Brownian motions and Poisson processes. The resulting processes may be regarded as both continuous-time counterparts of random walks and as space-homogeneous Feller processes. After discussing some related approximation theorems, we conclude with an introduction to tangential processes. 17. Feller processes and semi-groups. The Feller processes form a general class of Markov processes, broad enough to cover most applications of interest, and yet restricted by some regularity conditions ensuring technical flexibility and convenience. Here we study the underlying semigroup theory, prove the general path regularity and strong Markov property, and establish some basic approximation and convergence theorems. Though considerably more advanced, this material is clearly of fundamental importance.

295

Chapter 14

Gaussian Processes and Brownian Motion Covariance and independence, rotational symmetry, isonormal Gaussian process, independent increments, Brownian motion and bridge, scaling and inversion, Gaussian Markov processes, quadratic variation, path irregularity, strong Markov and reflection properties, Bessel processes, maximum process, arcsine and uniform laws, laws of the iterated logarithm, Wiener integral, spectral and moving-average representations, Ornstein–Uhlenbeck process, multiple Wiener–Itˆ o integrals, chaos expansion of variables and processes

Here we initiate the study of Brownian motion, arguably the single most important object of modern probability. Indeed, we will see in Chapters 22–23 how the Gaussian limit theorems of Chapter 6 extend to pathwise approximations of broad classes of random walks and discrete-time martingales by a Brownian motion. In Chapter 19 we show how every continuous local martingale can be represented as a time-changed Brownian motion. Similarly, we will see in Chapters 32–33 how large classes of diffusion processes may be constructed from Brownian motion by various pathwise transformations. Finally, the close relationship between Brownian motion and classical potential theory is explored in Chapter 34. The easiest construction of Brownian motion is via an isonormal Gaussian process on L2 (R+ ), whose existence is a consequence of the characteristic spherical symmetry of multi-variate Gaussian distributions. Among the many basic properties of Brownian motion, this chapter covers the H¨older continuity and existence of quadratic variation, the strong Markov and reflection properties, the three arcsine laws, and the law of the iterated logarithm. The values of an isonormal Gaussian process on L2 (R+ ) can be identified with integrals of L2 -functions with respect to an associated Brownian motion. Many processes of interest have representations in terms of such integrals, and in particular we consider spectral and moving-average representations of stationary Gaussian processes. More generally, we introduce the multiple Wiener– Itˆo integrals ζ n f of functions f ∈ L2 (Rn+ ), and establish the fundamental chaos expansion of Brownian L2 -functionals. The present material is related to practically every other chapter in the book. Thus, we refer to Chapter 6 for the definition of Gaussian distributions and the basic Gaussian limit theorem, to Chapter 8 for the transfer theorem, to Chapter 9 for properties of martingales and optional times, to Chapter 11 © Springer Nature Switzerland AG 2021 O. Kallenberg, Foundations of Modern Probability, Probability Theory and Stochastic Modelling 99, https://doi.org/10.1007/978-3-030-61871-1_15

297

298

Foundations of Modern Probability

for basic facts about Markov processes, to Chapter 12 for similarities with random walks, to Chapter 27 for some basic symmetries, and to Chapter 15 for analogies with the Poisson process. Our study of Brownian motion per se is continued in Chapter 19 with the basic recurrence or transience dichotomy, some further invariance properties, and a representation of Brownian martingales. Brownian local time and additive functionals are studied in Chapter 29. In Chapter 34 we consider some basic properties of Brownian hitting distributions, and examine the relationship between excessive functions and additive functionals of Brownian motion. A further discussion of multiple integrals and chaos expansions appears in Chapters 19 and 21. To begin with some basic definitions, we say that a process X on a parameter space T is Gaussian, if the random variable c1 Xt1 + · · · + cn Xtn is Gaussian1 for every choice of n ∈ N, t1 , . . . , tn ∈ T , and c1 , . . . , cn ∈ R. This holds in particular if the Xt are independent Gaussian random variables. A Gaussian process X is said to be centered if EXt = 0 for all t ∈ T . We also say that the processes X k on Tk , k ∈ I, are jointly Gaussian, if the combined process X = {Xtk ; t ∈ Tk , k ∈ I} is Gaussian. The latter condition is certainly fulfilled when the Gaussian processes X k are independent. Some simple facts clarify the fundamental role of the covariance function. As usual, we take all distributions to be defined on the σ-field generated by all evaluation maps. Lemma 14.1 (covariance and independence) Let X and X k , k ∈ I, be jointly Gaussian processes on an index set T . Then (i) the distribution of X is uniquely determined by the functions mt = EXt ,

rs,t = Cov(Xs , Xt ),

s, t ∈ T,

(ii) the processes X k are independent 2 iff Cov(Xsj , Xtk ) = 0,

s ∈ T j , t ∈ Tk ,

j = k in I.

Proof: (i) If the processes X, Y are Gaussian with the same mean and covariance functions, then the mean and variance agree for the variables c1 Xt1 + · · · + cn Xtn and c1 Yt1 + · · · + cn Ytn for any c1 , . . . , cn ∈ R and t1 , . . . , tn ∈ T , n ∈ N. Since the latter are Gaussian, their distributions must then agree. d Hence, the Cram´er–Wold theorem yields (Xt1 , . . . , Xtn ) = (Yt1 , . . . , Ytn ) for d any t1 , . . . , tn ∈ T , n ∈ N, and so X = Y by Proposition 4.2. (ii) Assume the stated condition. To prove the asserted independence, we may take I to be finite. Introduce some independent processes Y k , k ∈ I, with the same distributions as the X k , and note that the combined processes 1

also said to be normally distributed This (mutual) independence must not be confused with independence between the values of X k = (Xtk ) for each k. 2

14. Gaussian Processes and Brownian Motion

299

X = (X k ) and Y = (Y k ) have the same means and covariances. Hence, their joint distributions agree by part (i). In particular, the independence between the processes Y k implies the corresponding property for the processes X k . 2 The multi-variate Gaussian distributions are characterized by a simple symmetry property: Proposition 14.2 (spherical symmetry, Maxwell) For independent random variables ξ1 , . . . , ξd with d ≥ 2, these conditions are equivalent: (i) L(ξ1 , . . . , ξd ) is spherically symmetric, (ii) the ξk are i.i.d. centered Gaussian. Proof, (i) ⇒ (ii): Assuming (i), let ϕ be the common characteristic function d of ξ1 , . . . , ξd . Noting that −ξ1 = ξ1 , we see that ϕ is real and symmetric. Since √ d also s ξ1 + t ξ2 = ξ1 s2 + t2 , we obtain the functional equation √  ϕ(s) ϕ(t) = ϕ s2 + t2 , s, t ∈ R. This is a Cauchy equation for the function ψ(x) = ϕ(|x|1/2 ), and we get 2 ϕ(t) = e−c t for a constant c, which is positive since |ϕ| ≤ 1. (ii) ⇒ (i): Assume (ii). By scaling we may take the ξk to be i.i.d. N (0, 1). Then (ξ1 , . . . , ξd ) has the joint density

Π k (2π)−1/2 e−x /2 = (2π)−d/2 e−|x| /2, 2 k

2

x ∈ Rd ,

and the spherical symmetry follows from that for |x|2 = x21 + · · · + x2d .

2

In infinite dimensions, the Gaussian property is essentially a consequence of the spherical symmetry alone, without any requirement of independence. Theorem 14.3 (unitary invariance, Schoenberg, Freedman) For an infinite sequence of random variables ξ1 , ξ2 , . . . , these conditions are equivalent: (i) L(ξ1 , . . . , ξn ) is spherically symmetric for every n ∈ N, (ii) the ξk are conditionally i.i.d. N (0, σ 2 ), given a random variable σ 2 ≥ 0. Proof: Under (i) the ξk are exchangeable, and so by de Finetti’s Theorem 27.2 below they are conditionally μ -i.i.d. for a random probability measure μ on R. By the law of large numbers, μB = n→∞ lim n−1

Σ 1{ξk ∈ B}

a.s.,

k≤n

B ∈ B,

and in particular μ is a.s. {ξ3 , ξ4 , . . .} -measurable. By spherical symmetry, we have for any orthogonal transformation T on R2 

 





 



P (ξ1 , ξ2 ) ∈ B  ξ3 , . . . , ξn = P T (ξ1 , ξ2 ) ∈ B  ξ3 , . . . , ξn ,

B ∈ B R2 .

300

Foundations of Modern Probability

As n → ∞, we get μ2 = μ2 ◦ T −1 a.s. Applying this to a countable, dense set of mappings T , we see that the exceptional null set can be chosen to be indeso μ is a.s. centered pendent of T . Thus, μ2 is a.s. spherically symmetric, and  2 Gaussian by Proposition 14.2. It remains to take σ 2 = x2 μ(dx). Now fix a real, separable Hilbert space3 H with inner product ·, · . By an isonormal Gaussian process on H we mean a centered Gaussian process ζh, h ∈ H, such that E(ζh ζk) = h, k for all h, k ∈ H. For its existence, fix any ortho-normal basis (ONB) h1 , h2 , . . . ∈ H, and let ξ1 , ξ2 , . . . be i.i.d. N (0, 1). Then for any element h = Σ j bj hj we define ζh = Σ j bj ξj , where the series converges a.s. and in L2 , since Σ j b2j < ∞. Note that ζ is centered Gaussian, and also linear, in the sense that ζ(ah + bk) = a ζh + b ζk a.s., If also j = Σ j cj hj , we obtain

h, k ∈ H, a, b ∈ R.

Σ i,j bi cj E(ξi ξj ) = Σ i bi ci = h, k .

E(ζh ζk) =

By Lemma 14.1, the stated conditions determine uniquely the distribution of ζ. In particular, the symmetry in Proposition 14.2 extends to a distributional invariance of ζ under unitary transformations on H. The Gaussian distributions arise naturally in the context of processes with independent increments. A similar Poisson criterion is given in Theorem 15.10. Theorem 14.4 (independent increments, L´evy) Let X be a continuous process in Rd with X0 = 0. Then these conditions are equivalent: (i) X has independent increments, (ii) X is Gaussian, and there exist some continuous functions b in Rd and a 2 in R d , where a has non-negative definite increments, such that L(Xt − Xs ) = N (bt − bs , at − as ),

s < t.

Proof, (i)⇒(ii): Fix any s < t in R+ and u ∈ Rd . For every n ∈ N, divide the interval [s, t] into n sub-intervals of equal length, and let ξn1 , . . . , ξnn be the corresponding increments of uX. Then maxj |ξnj | → 0 a.s. since X is continuous, and so the sum u(Xt − Xs ) = Σj ξnj is Gaussian by Theorem 6.16. Since the increments of X are independent, the entire process X is then Gaussian. Writing bt = EXt and at = Cov(Xt ), we get by linearity and independence E(Xt − Xs ) = = Cov(Xt − Xs ) = = 3

EXt − EXs bt − bs , Cov(Xt ) − Cov(Xs ) at − as , s < t.

See Appendix 3 for some basic properties of Hilbert spaces.

14. Gaussian Processes and Brownian Motion

301

d

The continuity of X yields Xs → Xt as s → t, and so bs → bt and as → at . Thus, both functions are continuous. (ii)⇒(i): Use Lemma 14.1.

2

It is not so obvious that continuous processes such as in Theorem 14.4 really exist. If X has stationary, independent increments, the mean and covariance functions would clearly be linear. For d = 1 the simplest choice is to take b = 0 and at = t, so that Xt − Xs is N (0, t − s) for all s < t. We prove the existence of such a process and estimate its local modulus of continuity.

Theorem 14.5 (Brownian motion, Thiele, Bachelier, Wiener) There exists a process B on R+ with B0 = 0, such that (i) B has stationary, independent increments, (ii) Bt is N (0, t) for every t ≥ 0, (iii) B is a.s. locally H¨older continuous of order p, for every p ∈ (0, 12 ). In Theorem 14.18 we will see that the stated order of H¨older continuity can’t be improved. Related results are given in Proposition 14.10 and Lemma 22.7 below. Proof: Let ζ be an isonormal Gaussian process on L2 (R+ , λ), and introduce a process Bt = ζ1[0,t] , t ≥ 0. Since indicator functions of disjoint intervals are orthogonal, the increments of B are uncorrelated and hence independent. Furthermore, 1(s,t] 2 = t − s for any s ≤ t, and so Bt − Bs is N (0, t − s). For any s ≤ t we get d Bt − Bs = Bt−s d = (t − s)1/2 B1 , (1) whence E|Bt − Bs |p = (t − s)p/2 E|B1 |p < ∞, p > 0. The asserted H¨older continuity now follows by Theorem 4.23.

2

A process B as in Theorem 14.5 is called a (standard) Brownian motion 4 . A d -dimensional Brownian motion is a process B = (B 1 , . . . , B d ) in Rd , where B 1 , . . . , B d are independent, one-dimensional Brownian motions. By Proposition 14.2, the distribution of B is invariant under orthogonal transformations of Rd . Furthermore, any continuous process X in Rd with stationary independent increments and X0 = 0 can be written as Xt = b t + aBt for some vector b and matrix a. Other important Gaussian processes can be constructed from a Brownian motion. For example, we define a Brownian bridge as a process on [0, 1] distributed as Xt = Bt − tB1 , t ∈ [0, 1]. An easy computation shows that X has covariance function rs,t = s(1 − t), 0 ≤ s ≤ t ≤ 1. 4

More appropriately, it is also called a Wiener process.

302

Foundations of Modern Probability

The Brownian motion and bridge have many nice symmetry properties. For example, if B is a Brownian motion, then so is −B, as well as the process r−1 B(r2 t) for any r > 0. The latter transformations, known as Brownian scaling, are especially useful. We also note that, for any u > 0, the processes Bu±t − Bu are Brownian motions on R+ and [0, u], respectively. If B is instead a Brownian bridge, then so are the processes −Bt and B1−t . We list some less obvious invariance properties. Further, possibly random mappings preserving the distribution of a Brownian motion or bridge are exhibited in Theorem 14.11, Lemma 14.14, and Proposition 27.16. Lemma 14.6 (scaling and inversion) (i) If B is a Brownian motion, then so is the process tB 1/t , whereas these processes are Brownian bridges: (1 − t)B t/(1−t) ,

tB (1−t)/t ,

t ∈ (0, 1),

(ii) If B is a Brownian bridge, then so is the process B 1−t , whereas these processes are Brownian motions: (1 + t)B t/(1+t) ,

(1 + t)B 1/(1+t) ,

t ≥ 0.

Proof: Since the mentioned processes are centered Gaussian, it suffices by Lemma 14.1 to verify that they have the required covariance functions. This is clear from the expressions s ∧ t and (s ∧ t)(1 − s ∨ t) for the covariance functions of the Brownian motion and bridge. 2 Combining Proposition 11.5 with Theorem 14.4, we note that any spaceand time-homogeneous, continuous Markov process in Rd has the form aBt + b t + c for a Brownian motion B in Rd , a d × d matrix a, and some vectors b, c ∈ Rd . We turn to a general characterization of Gaussian Markov processes. As before, we use the convention 0/0 = 0. Proposition 14.7 (Markov property) Let X be a Gaussian process on an index set T ⊂ R, and put rs,t = Cov(Xs , Xt ). Then X is Markov iff rs,t rt,u rs,u = , s ≤ t ≤ u in T. (2) rt,t If X is also stationary on T = R, then rs,t = a e−b |s−t| for some constants a ≥ 0 and b ∈ [0, ∞]. Proof: Subtracting the means, if necessary, we may take EXt ≡ 0. Fixing any t ≤ u in T , we may choose a ∈ R such that Xu ≡ Xu − aXt ⊥ Xt . Then a = rt,u /rt,t when rt,t = 0, and if rt,t = 0 we may take a = 0. By Lemma 14.1 ⊥Xt . we get Xu ⊥ Now let X be Markov, and fix any s ≤ t. Then Xs ⊥⊥Xt Xu , and so ⊥Xt Xu . Since also Xt ⊥ ⊥Xu by the choice of a, Proposition 8.12 yields Xs ⊥  ⊥Xu . Hence, rs,u = a rs,t , and (2) follows as we insert the expression for a. Xs ⊥

14. Gaussian Processes and Brownian Motion

303

Conversely, (2) implies Xs ⊥ Xu for all s ≤ t, and so Ft ⊥⊥Xu by Lemma 14.1, where Ft = σ{Xs ; s ≤ t}. Then Proposition 8.12 yields Ft ⊥⊥Xt Xu , which is the required Markov property of X at t. If X is stationary, we have rs,t = r|s−t|, 0 = r|s−t| , and (2) reduces to the Cauchy equation r0 rs+t = rs rt , s, t ≥ 0, allowing the only bounded solutions 2 rt = a e−bt . A continuous, centered Gaussian process X on R with covariance function rs,t = e−|t−s| is called a stationary Ornstein–Uhlenbeck process 5 . Such a process can be expressed as Xt = e−t B(e2t ), t ∈ R, in terms of a Brownian motion B. The previous result shows that the Ornstein–Uhlenbeck process is essentially the only stationary Gaussian process that is also a Markov process. We consider some basic path properties of Brownian motion. Lemma 14.8 (level sets) For a Brownian motion or bridge B,



λ t; Bt = u = 0 a.s.,

u ∈ R.

Proof: Defining Xtn = B[nt]/n , t ∈ R+ or [0, 1], n ∈ N, we note that Xtn → Bt for every t. Since the processes X n are product-measurable on Ω × R+ or Ω × [0, 1], so is B. Hence, Fubini’s theorem yields



E λ t; Bt = u =



P {Bt = u} dt = 0,

u ∈ R.

2

Next we show that Brownian motion has locally finite quadratic variation. Extensions to general semi-martingales appear in Proposition 18.17 and Corollary 20.15. A sequence of partitions is said to be nested, if each of them is a refinement of the previous one. Theorem 14.9 (quadratic variation, L´evy) Let B be a Brownian motion, and fix any t > 0. Then for any partitions 0 = tn,0 < tn,1 < · · · < tn,kn = t with hn ≡ maxk (tn,k − tn,k−1 ) → 0, we have 

ζn ≡ Σ k Btn,k − Btn,k−1

2

→ t in L2 .

(3)

The convergence holds a.s. when the partitions are nested. d

Proof (Doob): To prove (3), we see from the scaling property Bt − Bs = |t − s|1/2 B1 that  2 Eζn = Σ k E Btn,k − Btn,k−1

Σ k ( tn,k − tn,k−1) EB12 = t, 2 Var(ζn ) = Σ k Var Bt − Bt = Σ k ( tn,k − tn,k−1 )2 Var(B12 ) =

n,k

n,k−1

≤ hn t EB14 → 0.

5

In Chapters 21 and 32 we use a slightly different normalization.

304

Foundations of Modern Probability

For nested partitions, the a.s. convergence follows once we show that the sequence (ζn ) is a reverse martingale, that is 





E ζn−1 − ζn  ζn , ζn+1 , . . . = 0 a.s.,

n ∈ N.

(4)

Inserting intermediate partitions if necessary, we may assume that kn = n for all n. Then there exist some t1 , t2 , . . . ∈ [0, t], such that the n-th partition has division points t1 , . . . , tn . To verify (4) for a fixed n, we introduce an auxiliary random variable ϑ⊥ ⊥B with P {ϑ = ±1} = 12 , and replace B by the Brownian motion Bs = Bs∧tn + ϑ (Bs − Bs∧tn ), s ≥ 0. Since the sums ζn , ζn+1 , . . . for B  agree with those for B, whereas ζn −ζn−1 is replaced by ϑ(ζn −ζn−1 ), it suffices to show that E{ϑ(ζn −ζn−1 ) | ζn , ζn+1 , . . .} = 0 a.s. This is clear from the choice of ϑ, if we first condition on ζn−1 , ζn , . . . . 2 The last result shows that the linear variation of B is locally unbounded, which is why we can’t define the stochastic integral V dB as an ordinary Stieltjes integral, prompting us in Chapter 18 to use a more sophisticated approach. We give some more precise information about the irregularity of paths, supplementing the regularity property in Theorem 14.5. Proposition 14.10 (path irregularity, Paley, Wiener, & Zygmund) Let B be a Brownian motion in R, and fix any p > 12 . Then a.s. (i) B is nowhere 6 H¨older continuous of order p, (ii) B is nowhere differentiable. Proof: (i) Fix any p > 12 and m ∈ N. If Bω is H¨older continuous of order p at a point tω ∈ [0, 1], then for s ∈ [0, 1] with |s − t| ≤ mh we have |Δh Bs | ≤ |Bs − Bt | + |Bs+h − Bt | < (mh)p < hp .



Applying this to m fixed, disjoint intervals of length h between the bounds t ± mh, we see that the probability of the stated H¨older continuity is bounded by expressions of the form 

h−1 P |Bh | ≤ c hp

m



= h−1 P |B1 | ≤ c hp−1/2

m

< h−1 hm(p−1/2) ,

for some constants c > 0. Choosing m > (p − 12 )−1 , we see that the right-hand side tends to 0 as h → 0. 6 Thus, for any ω outside a P -null set, there is no point t ≥ 0 where the path B(ω, ·) has the stated regularity property. This is clearly stronger than the corresponding statement for fixed points t ≥ 0 or intervals I ⊂ R+ . Since the stated events may not be measurable, the results should be understood in the sense of inner measure or completion.

14. Gaussian Processes and Brownian Motion

305

(ii) If Bω is differentiable at a point tω ≥ 0, it is Lipschitz continuous at t, hence H¨older continuous of order 1, which contradicts (i) for p = 1. 2 Proposition 11.5 shows that Brownian motion B is a space-homogeneous Markov process with respect to its induced filtration. If the Markov property holds for a more general filtration F = (Ft ) —i.e., if B is F-adapted and such that the process Bt = Bs+t − Bs is independent of Fs for each s ≥ 0 —we say that B is a Brownian motion with respect to F or an F-Brownian motion. In particular, we may take Ft = Gt ∨ N , t ≥ 0, where G is the filtration induced by B and N = σ{N ⊂ A; A ∈ A, P A = 0}. With this choice, F becomes right-continuous by Corollary 9.26. We now extend the Markov property of B to suitable optional times. A more general version of this result appears in Theorem 17.17. As in Chapter 9 we write Ft+ = Ft+ . Theorem 14.11 (strong Markov property, Hunt) For an F-Brownian motion B in Rd and an F + -optional time τ < ∞, we have these equivalent properties: (i) B has the strong F + -Markov property at τ , d (ii) the process Bt = Bτ +t − Bτ satisfies B = B  ⊥⊥Fτ+ . Proof: (ii) As in Lemma 9.4, there exist some optional times τn → τ  with τn ≥ τ + 2−n taking countably many values. Then Fτ+ ⊂ n Fτn by Lemmas 9.1 and 9.3, and so by Proposition 11.9 and Theorem 11.10, each process Btn = Bτn +t − Bτn , t ≥ 0, is a Brownian motion independent of Fτ+ . The continuity of B yields Btn → Bt a.s. for every t. For any A ∈ Fτ+ and t1 , . . . , tk ∈ R+ , k ∈ N, and for bounded, continuous functions f : Rk → R, we get by dominated convergence



E f (Bt1 , . . . , Btk ); A = Ef (Bt1 , . . . , Btk ) · P A. The general relation L(B  ; A) = L(B) · P A now follows by a straightforward extension argument. 2 d

When B is a Brownian motion in Rd , the process X = |B| is called a Bessel process of order d. More general Bessel processes are obtained as solutions to suitable SDEs. We show that |B| inherits the strong Markov property from B. Corollary 14.12 (Bessel processes) For an F-Brownian motion B in Rd , |B| is a strong F + -Markov process. d

Proof: By Theorem 14.11, it is enough to show that |B + x| = |B + y| whenever |x| = |y|. Then choose an orthogonal transformation T on Rd with T x = y, and note that |B + x| = |T (B + x)| d

= |T B + y| = |B + y|.

2

306

Foundations of Modern Probability

The strong Markov property can be used to derive the distribution of the maximum of Brownian motion up to a fixed time. A stronger result will be obtained in Corollary 29.3. Proposition 14.13 (maximum process, Bachelier) Let B be a Brownian motion in R, and define Mt = sups≤t Bs , t ≥ 0. Then d

d

Mt = Mt − Bt = |Bt |,

t ≥ 0.

Our proof relies on the following continuous-time counterpart of Lemma 12.11, where the given Brownian motion B is compared with a suitably reflected ˜ version B. Lemma 14.14 (reflection principle) For a Brownian motion B and an assod ˜ with ciated optional time τ , we have B = B ˜t = Bt∧τ − (Bt − Bt∧τ ), B

t ≥ 0.

Proof: It suffices to compare the distributions up to a fixed time t, and so we may assume that τ < ∞. Define Btτ = Bτ ∧t and Bt = Bτ +t − Bτ . By Theorem 14.11, the process B  is a Brownian motion independent of (τ, B τ ). d d Since also −B  = B  , we get (τ, B τ , B  ) = (τ, B τ , −B  ), and it remains to note that  Bt = Btτ + B(t−τ )+ , τ  ˜t = Bt − B(t−τ t ≥ 0. B 2 )+ , Proof of Proposition 14.13: By scaling we may take t = 1. Applying Lemma 14.14 with τ = inf{t; Bt = x} gives







˜1 ≥ 2x − y , P M1 ≥ x, B1 ≤ y = P B

x ≥ y ∨ 0.

By differentiation, the joint distribution of (M1 , B1 ) has density −2 ϕ (2x − y), where ϕ denotes the standard normal density. Changing variables, we conclude that (M1 , M1 − B1 ) has density −2 ϕ (x + y), x, y ≥ 0. In particular, M1 and 2 M1 − B1 have the same marginal density 2 ϕ(x) on R+ . To prepare for the next major result, we prove another elementary sample path property. Lemma 14.15 (local extremes) The local maxima and minima of a Brownian motion or bridge are a.s. distinct. Proof: For a Brownian motion B and any intervals I = [a, b] and J = [c, d] with b < c, we may write sup Bt − sup Bt = sup (Bt − Bc ) + (Bc − Bb ) − sup (Bt − Bb ). t∈J

t∈I

t∈J

t∈I

14. Gaussian Processes and Brownian Motion

307

Here the distribution of second term on the right is diffuse, which extends by independence to the whole expression. In particular, the difference on the left is a.s. non-zero. Since I and J are arbitrary, this proves the result for local maxima. The proofs for local minima and the mixed case are similar. The result for the Brownian bridge B o follows from that for Brownian motion, since the distributions of the two processes are mutually absolutely continuous on every interval [0, t] with t < 1. To see this, construct from B and B o the corresponding ‘bridges’ Xs = Bs − s t−1 Bt Ys = Bso − s t−1 Bto ,

s ∈ [0, t],

d

and check that Bt ⊥ ⊥ X = Y ⊥⊥ Bto . The stated equivalence now follows since N (0, t) ∼ N (0, t(1 − t)) for any t ∈ [0, 1). 2 The arcsine law may be defined7 as the distribution of ξ = sin2 α when α is U (0, 2π). Its name derives from the fact that √

P {ξ ≤ t} = P | sin α| ≤ t √ = π2 arcsin t, t ∈ [0, 1]. Note that the arcsine distribution is symmetric about 12 , since d

ξ = sin2 α = cos2 α = 1 − sin2 α = 1 − ξ. We consider three basic functionals of Brownian motion, which are all arcsine distributed. Theorem 14.16 (arcsine laws, L´evy) For a Brownian motion B on [0, 1] with maximum M1 , these random variables are all arcsine distributed: τ1 = λ{t; Bt > 0},



τ2 = inf t; Bt = M1 , τ3 = sup{t; Bt = 0}. d

d

The relations τ1 = τ2 = τ3 may be compared with the discrete-time analogues in Theorem 12.12 and Corollary 27.8. In Theorems 16.18 and 22.11, the arcsine laws are extended by approximation to appropriate random walks and L´evy processes. d

Proof (OK): To see that τ1 = τ2 , let n ∈ N, and conclude from Corollary 27.8 below that n−1 7

Σ 1{Bk/n > 0} =d k≤n





n−1 min k ≥ 0; Bk/n = max Bj/n .

Equivalently, we may take α to be U (0, π) or U (0, π/2).

j≤n

308

Foundations of Modern Probability

By Lemma 14.15 the right-hand side tends a.s. to τ2 as n → ∞. To see that the left-hand side converges to τ1 , we note that by Lemma 14.8







λ t ∈ [0, 1]; Bt > 0 + λ t ∈ [0, 1]; Bt < 0 = 1 a.s. It remains to note that, for any open set G ⊂ [0, 1], lim inf n−1 n→∞

Σ 1G(k/n) ≥ λG.

k≤n

In case of τ2 , fix any t ∈ [0, 1], let ξ and η be independent N (0, 1), and let α be U (0, 2π). Using Proposition 14.13 and the circular symmetry of the distribution of (ξ, η), we get



P {τ2 ≤ t} = P sup(Bs − Bt ) ≥ sup(Bs − Bt )

s≤t

= P |Bt | ≥ |B1 − Bt |

= P t ξ 2 ≥ (1 − t) η 2  η2 =P 2 ≤t ξ + η2

s≥t





= P sin2 α ≤ t . In case of τ3 , we write instead







P {τ3 < t} = P sup Bs < 0 + P inf Bs > 0 s≥t

s≥t



= 2 P sup(Bs − Bt ) < −Bt

s≥t

= 2 P |B1 − Bt | < Bt

= P |B1 − Bt | < |Bt |





= P {τ2 ≤ t}.

2

The first two arcsine laws have simple counterparts for the Brownian bridge: Proposition 14.17 (uniform laws) For a Brownian bridge B with maximum M1 , these random variables are U (0, 1): τ1 = λ{t; Bt > 0},



τ2 = inf t; Bt = M1 . d

Proof: The relation τ1 = τ2 may be proved in the same way as for Brownian motion. To see that τ2 is U (0, 1), write (x) = x − [x], and consider for each d u ∈ [0, 1] a process Btu = B(u+t) − Bu , t ∈ [0, 1]. Here clearly B u = B for each u u, and the maximum of B occurs at (τ2 − u). Hence, Fubini’s theorem yields for any t ∈ [0, 1]

14. Gaussian Processes and Brownian Motion P {τ2 ≤ t} =



1

0

309





P (τ2 − u) ≤ t du



= E λ u; (τ2 − u) ≤ t = t.

2

From Theorem 14.5 we see that t−c Bt → 0 a.s. as t → 0 for fixed c ∈ [0, 12 ). The following classical result gives the exact growth rate of Brownian motion at 0 and ∞. Extensions to random walks and renewal processes appear in Corollaries 22.8 and 22.14, and a functional version is given in Theorem 24.18. Theorem 14.18 (laws of the iterated logarithm, Khinchin) For a Brownian motion B in R, we have a.s. Bt

lim sup 

2t log log(1/t)

t→0

Bt

= lim sup  t→∞

= 1.

2t log log t

˜t = tB1/t of Lemma 14.6 converts Proof: Since the Brownian inversion B the two formulas into one another, it is enough to prove the result for t → ∞. Then note that as u → ∞

∞ u

e−x

2 /2

dx ∼ u−1





x e−x

u −1 −u2 /2

=u e

2 /2

dx

.

Letting Mt = sups≤t Bs and using Proposition 14.13, we hence obtain, uniformly in t > 0,





P Mt > u t1/2 = 2P Bt > u t1/2



2 /2

∼ (2/π)1/2 u−1 e−u

.

Writing ht = (2t log log t)1/2 , we get for any r > 1 and c > 0



n−c P M (rn ) > c h(rn−1 ) <

2 /r

(log n)−1/2 ,

n ∈ N.

Fixing c > 1 and choosing r < c2 , we see from the Borel–Cantelli lemma that







P lim sup (Bt /ht ) > c ≤ P M (rn ) > c h(rn−1 ) i.o. = 0, t→∞

which implies lim supt→∞ (Bt /ht ) ≤ 1 a.s. To prove the reverse inequality, we may write



n−c P B(rn ) − B(rn−1 ) > c h(rn ) >

2 r/(r−1)

(log n)−1/2 ,

Taking c = {(r − 1)/r}1/2 , we get by the Borel–Cantelli lemma lim sup t→∞

Bt − Bt/r B(rn ) − B(rn−1 ) ≥ lim sup ht h(rn ) n→∞  1/2 r−1 ≥ a.s. r

n ∈ N.

310

Foundations of Modern Probability

The previous upper bound yields lim supt→∞ (−Bt/r /ht ) ≤ r−1/2 , and so by combination Bt ≥ (1 − r−1 )1/2 − r−1/2 a.s. lim sup ht t→∞ Letting r → ∞, we obtain lim supt→∞ (Bt /ht ) ≥ 1 a.s.

2

In Theorem 14.5 we constructed a Brownian motion B from an isonormal Gaussian process ζ on L2 (R+ , λ) by setting Bt = ζ 1[0,t] a.s. for all t ≥ 0. Conversely, given a Brownian motion B on R+ , Theorem 8.17 ensures the existence of a corresponding isonormal Gaussian process ζ. Since every function h ∈ L2 (R+ , λ) can be approximated by simple step functions, as in the proof of Lemma 1.37, the random variables ζh are a.s. unique. They can also be constructed directly from B, as suitable Wiener integrals h dB. Since the latter may fail to exist in the pathwise Stieltjes sense, a different approach is required. For a first step, consider the class S of simple step functions of the form ht =

Σ aj 1(t

j ≤n

j−1 ,tj ]

(t),

t ≥ 0,

where n ∈ Z+ , 0 = t0 < · · · < tn , and a1 , . . . , an ∈ R. For any h ∈ S, we define the integral ζh in the obvious way as

ζh =



0

=

ht dBt = Bh

Σ aj (Bt

j ≤n

j

− Btj−1 ),

which is clearly centered Gaussian with variance E(ζh)2 = =

Σ a2j (tj − tj−1)

j ≤n ∞ 0

h2t dt = h 2 ,

where h denotes the norm in L2 (R+ , λ). Thus, the integration h → ζh = h dB defines a linear isometry from S ⊂ L2 (R+ , λ) into L2 (Ω, P ). Since S is dense inL2 (R+ , λ), the integral extends by continuity to a linear isometry h → ζh = h dB from L2 (λ) to L2 (P ). The variable ζh is again centered Gaussian for every h ∈ L2 (λ), and by linearity the whole process h → ζh is Gaussian. By a polarization argument, the integration preserves inner products, in the sense that







E(ζh ζk) = 0

ht kt dt = h, k ,

h, k ∈ L2 (λ).

We consider two general representations of stationary Gaussian processes in terms of Wiener integrals ζh. Here a complex notation is convenient. By a complex, isonormal Gaussian process on a (real) Hilbert space H, we mean a process ζ = ξ + iη on H such that ξ and η are independent, real, isonormal

14. Gaussian Processes and Brownian Motion

311

Gaussian processes on H. For any f = g + ih with g, h ∈ H, we define ζf = ξg − ηh + i(ξh + ηg). Now let X be a stationary, centered Gaussian process on R with covariance function rt = E Xs Xs+t , s, t ∈ R. Then r is non-negative definite, and it is further continuous whenever X is continuous in probability, in which case Bochner’s theorem yields a unique spectral representation

rt =



−∞

t ∈ R,

eitx μ(dx),

in terms of a bounded, symmetric spectral measure μ on R. We may derive a similar spectral representation of the process X itself. By a different argument, the result extends to suitable non-Gaussian processes. As usual, we take the basic probability space to be rich enough to support the required randomization variables. Proposition 14.19 (spectral representation, Stone, Cram´er) Let X be an L2 continuous, stationary, centered Gaussian process on R with spectral measure μ. Then there exists a complex, isonormal Gaussian process ζ on L2 (μ), such ∞ that eitx dζx a.s., t ∈ R. (5) Xt = & −∞

Proof: Writing Y for the right-hand side of (5), we obtain E Ys Yt = E = = =





cos sx dξx − sin sx dηx

  

cos tx dξx − sin tx dηx



cos sx cos tx − sin sx sin tx μ(dx)

cos(s − t)x μ(dx) ei(s−t)x μ(dx) = rs−t . d

Since X, Y are both centered Gaussian, Lemma 14.1 yields Y = X. Since X and ζ are further continuous and defined on the separable spaces L2 (X) and L2 (μ), they may be regarded as random elements in suitable Borel spaces. The a.s. representation (5) then follows by Theorem 8.17. 2 Our second representation requires a suitable regularity condition on the spectral measure μ. Write λ for Lebesgue measure on R. Proposition 14.20 (moving average representation) Let X be an L2 -continuous, stationary, centered Gaussian process on R with spectral measure μ  λ. Then there exist an isonormal Gaussian process ζ on L2 (λ) and a function f ∈ L2 (λ), such that

Xt =



−∞

ft−s dζs a.s.,

t ∈ R.

(6)

312

Foundations of Modern Probability

√ Proof: Choose a symmetric density g ≥ 0 of μ, and define h = g. Since 2 h lies in L (λ), it admits a Fourier transform in the sense of Plancherel

ˆ s = (2π)−1/2 lim fs = h

a

a→∞ −a

eisx hx dx,

s ∈ R,

(7)

which is again real and square integrable. For each t ∈ R, the function kx = e−itx hx has Fourier transform kˆs = fs−t , and so by Parseval’s identity,

rt = =



−∞ ∞ −∞

eitx h2x dx hx k¯x dx =



∞ −∞

fs fs−t ds.

(8)

Now let ζ be an isonormal Gaussian process on L2 (λ). For f as in (7), define a process Y on R by the right-hand side of (6). Then (8) gives E Ys Ys+t = rt d for all s, t ∈ R, and so Y = X by Lemma 14.1. Again, Theorem 8.17 yields the desired a.s. representation of X. 2 For an example, we consider a moving average representation of the stationary Ornstein–Uhlenbeck process. Then let ζ be an isonormal Gaussian process on L2 (R, λ), and introduce the process √ t s−t Xt = 2 e dζs , t ≥ 0, −∞

which is clearly centered Gaussian. Furthermore, rs,t = E Xs Xt

s∧t

=2

eu−s eu−t du

=e

,

−∞ −|t−s|

s, t ∈ R,

as desired. The Markov property of X is clear from the fact that Xt = es−t Xs +

√ t u−t 2 e dζu , s

s ≤ t.

We turn to a discussion of multiple Wiener–Itˆo integrals 8 ζ n with respect to an isonormal Gaussian process ζ on a separable, infinite-dimensional Hilbert space H. Without loss of generality, we may choose H to be of the form L2 (S, μ), so that the tensor product H ⊗n may be identified with L2 (S n , μ⊗n ), where μ⊗n denotes the n-fold product measure μ ⊗ · · · ⊗ μ. The tensor product  k≤n hk = h1 ⊗ · · · ⊗ hn of h1 , . . . , hn ∈ H is then equivalent to the function h1 (t1 ) · · · hn (tn ) on S n . Recall that for any ONB h1 , h2 , . . . in H, the products  ⊗n . j≤n hrj with r1 , . . . , rn ∈ N form an ONB in H We begin with the basic characterization and existence result for the multiple integrals ζ n . 8 They are usually denoted by In . However, we often need a notation exhibiting the dependence on ζ, such as in the higher-dimensional versions of Theorem 28.19

14. Gaussian Processes and Brownian Motion

313

Theorem 14.21 (multiple stochastic integrals, Wiener, Itˆo) Let ζ be an isonormal Gaussian process on a separable Hilbert space H. Then for every n ∈ N there exists a unique, continuous, linear map ζ n : H ⊗n → L2 (P ), such that a.s. ζ n ⊗ hk = k≤n

Π ζhk ,

h1 , . . . , hn ∈ H orthogonal.

k≤n

Here the uniqueness means that ζ n h is a.s. unique for every h, and the linearity that ζ n (af + bg) = a ζ nf + b ζ n g a.s.,

a, b ∈ R, f, g ∈ H ⊗n .

In particular we have ζ 1 h = ζh a.s., and for consistency we define ζ 0 as the identity mapping on R. For the proof, we may take H = L2 (I, λ) with I = [0, 1]. Let En be the class of elementary functions of the form f=

1A , Σ cj k⊗ ≤n

j ≤m

(9)

k j

where the sets A1j , . . . , Anj ∈ BI are disjoint for each j ∈ {1, . . . , m}. The indicator functions 1Akj are then orthogonal for fixed j, and we need to take ζ nf =

ζAkj , Σ cj kΠ ≤n

(10)

j ≤m

where ζA = ζ1A . By the linearity in each factor, ζ nf is a.s. independent of the choice of representation (9) for f . To extend ζ n to the entire space L2 (I n , λ⊗n ), we need some further lemmas. For any function f on I n , we introduce the symmetrization f˜(t1 , . . . , tn ) = (n!)−1

Σ

p ∈ Pn

f (tp1 , . . . , tpn ),

t1 , . . . , tn ∈ R+ ,

where Pn is the set of permutations of {1, . . . , n}. We begin with the basic L2 -structure, which will later carry over to general f . Lemma 14.22 (isometry) The elementary integrals ζ nf in (10) are orthogonal for different n and satisfy E(ζ nf )2 = n! f˜ 2 ≤ n! f 2 ,

f ∈ En .

(11)

Proof: The second relation in (11) follows from Minkowski’s inequality. To prove the remaining assertions, we first reduce to the case where all sets Akj belong to a fixed collection of disjoint sets B1 , B2 , . . . . For any finite index sets J = K in N, we get E

ζBk = Π E(ζBj )2 Π EζBj = 0, Π ζBj kΠ ∈K j ∈ J ∩K j ∈ JK

j ∈J

which proves the asserted orthogonality. Since clearly f, g = 0 when f and g involve different index sets, it also reduces the proof of the isometry in (11) to the case where all terms in f involve the same sets B1 , . . . , Bn , though

314

Foundations of Modern Probability

in possibly different order. Since ζ nf = ζ n f˜, we may further assume that  f = k 1Bk . Then E(ζ nf )2 = Π k E(ζBk )2 = Π k λBk = f 2 = n! f˜ 2 , where the last relation holds since, in the present case, the permutations of f are orthogonal. 2 To extend the integral, we need to show that the set of elementary functions is dense in L2 (λ⊗n ). Lemma 14.23 (approximation) The set En is dense in L2 (λ⊗n ). Proof: By a standard argument based on monotone convergence and a monotone-class argument, any function f ∈ L2 (λ⊗n ) can be approximated by  linear combinations of products k≤n 1Ak , and so it suffices to approximate functions f of the latter type. For each m, we then divide I into 2m intervals Imj of length 2−m , and define fm = f ◦

Σ ⊗ 1I

j1 ,. . . ,jn k ≤ n

m,jk

,

(12)

where the summation extends over all collections of distinct indices j1 , . . . , jn ∈ {1, . . . , 2m }. Here fm ∈ En for each m, and the sum in (12) tends to 1 a.e. 2 λ⊗n . Thus, by dominated convergence, fm → f in L2 (λ⊗n ). So far we have defined the integral ζ n as a uniformly continuous mapping on a dense subset of L2 (λ⊗n ). By continuity it extends immediately to all of L2 (λ⊗n ), with preservation of both linearity and the norm relations in (11). To  complete the proof of Theorem 14.21, it remains to show that ζ n k≤n hk = 2 This follows from the following Πk ζhk for orthogonal h1, . . . , hn ∈ 2L (λ). technical result, where for any f ∈ L (λ⊗n ) and g ∈ L2 (λ) we write (f ⊗1 g)(t1 , . . . , tn−1 ) =



f (t1 , . . . , tn ) g(tn ) dtn .

Lemma 14.24 (recursion) For any f ∈ L2 (λ⊗n ) and g ∈ L2 (λ) with n ∈ N, we have   (13) ζ n+1 (f ⊗ g) = ζ nf · ζg − n ζ n−1 f˜ ⊗1 g . Proof: By Fubini’s theorem and Cauchy’s inequality, we have f ⊗ g = f g , f˜ ⊗1 g ≤ f˜ g ≤ f g , and so both sides of (13) are continuous in probability in f and g. Hence, it suffices to prove the formula for f ∈ En and g ∈ E1 . By the linearity of each  side, we may next reduce to the case of functions f = k≤n 1Ak and g = 1A ,

14. Gaussian Processes and Brownian Motion

315 

where A1 , . . . , An are disjoint and either A ∩ k Ak = ∅ or A = A1 . In the former case, we have f˜ ⊗1 g = 0, and (13) follows from the definitions. In the latter case, (13) becomes 







ζ n+1 A2 × A2 × · · · × An = (ζA)2 − λA ζA2 · · · ζAn .

(14)

Approximating 1A2 as in Lemma 14.23 by functions fm ∈ E2 with support in A2 , we see that the left-hand side equals (ζ 2A2 )(ζA2 ) · · · (ζAn ), which reduces the proof of (14) to the two-dimensional version ζ 2A2 = (ζA)2 − λA. Here we may divide A for each m into 2m subsets Bmj of measure ≤ 2−m , and note as in Theorem 14.9 and Lemma 14.23 that (ζA)2 =

(ζBmi )(ζBmj ) Σ i (ζBmi)2 + iΣ = j

→ λA + ζ 2A2

2

in L2 .

We proceed to derive an explicit representation of the integrals ζ n in terms of Hermite polynomials p0 , p1 , . . . , defined as orthogonal polynomials of degrees 0, 1, . . . with respect to the standard normal distribution N (0, 1). This determines the pn up to normalizations, which we choose for convenience to give leading coefficients 1. The first few polynomials are then p0 (x) = 1,

p1 (x) = x,

p2 (x) = x2 − 1,

p3 (x) = x3 − 3x,

... .

Theorem 14.25 (polynomial representation, Itˆo) Let ζ be an isonormal Gaussian process on a separable Hilbert space H with associated multiple Wiener–Itˆo integrals ζ 1 , ζ 2 , . . . . Then for any ortho-normal elements h1 , . . . , hm ∈ H and integers n1 , . . . , nm ≥ 1 with sum n, we have ζ n ⊗ hj

⊗ nj

j ≤m

=

Π pn (ζhj ).

j ≤m

j

ˆ = h/ h , we see that the stated Using the linearity of ζ n and writing h formula is equivalent to the factorization ζ n ⊗ hj

⊗nj

j ≤m

=

⊗nj

Π ζ n hj j

h1 , . . . , hk ∈ H orthogonal,

,

j ≤m

(15)

together with the representation of the individual factors ˆ ζ n h⊗n = h n pn (ζ h),

h ∈ H \ {0}.

(16)

Proof: We prove (15) by induction on n. Assuming equality for all integrals of order ≤ n, we fix any ortho-normal elements h, h1 , . . . , hm ∈ H and integers  ⊗n k, n1 , . . . , nm ∈ N with sum n + 1, and write f = j≤m hj j . By Lemma 14.24 and the induction hypothesis, we get 









ζ n+1 f ⊗ h⊗k = ζ n f ⊗ h⊗(k−1) · ζh − (k − 1) ζ n−1 f ⊗ h⊗(k−2)



= (ζ n−k+1 f ) ζ k−1 h⊗(k−1) · ζh − (k − 1) ζ k−2 h⊗(k−2) = (ζ n−k+1 f )(ζ k h⊗k ).



316

Foundations of Modern Probability

Using the induction hypothesis again, we obtain the desired extension to ζ n+1 . It remains to prove (16) for an arbitrary element h ∈ H with h = 1. Then conclude from Lemma 14.24 that ζ n+1 h⊗(n+1) = (ζ n h⊗n )(ζh) − n ζ n−1 h⊗(n−1) ,

n ∈ N.

Since ζ 0 1 = 1 and ζ 1 h = ζh, we see by induction that ζ n h⊗n is a polynomial in ζh of degree n with leading coefficient 1. By the definition of Hermite polynomials, it remains to show that the integrals ζ n h⊗n with different n are orthogonal, which holds by Lemma 14.22. 2 Given an isonormal Gaussian process ζ on a separable Hilbert space H, we introduce the space L2 (ζ) = L2 (Ω, σ{ζ}, P ) of ζ-measurable random variables ξ with Eξ 2 < ∞. The n -th polynomial chaos Pn is defined as the closed linear sub-space generated by all polynomials of degree ≤ n in the random variables ζh, h ∈ H. For every n ∈ Z+ , we also introduce the n -th homogeneous chaos Hn , consisting of all integrals ζ nf , f ∈ H ⊗n . The relationship between the mentioned spaces is clarified by the following result, where we write ⊕ for direct sum and ' for orthogonal complement. Theorem 14.26 (chaos decomposition, Wiener) Let ζ be an isonormal Gaussian process on a separable Hilbert space H, with associated polynomial and homogeneous chaoses Pn and Hn , respectively. Then (i) the Hn are orthogonal, closed, linear sub-spaces of L2 (ζ), satisfying n

Pn = ⊕ Hk , k=0

n ∈ Z+ ;



L2 (ζ) = ⊕ Hn , n=0

(ii) for any ξ ∈ L2 (ζ), there exist some unique symmetric elements fn ∈ H ⊗n , n ≥ 0, such that ξ = Σ ζ nfn a.s. n≥0

In particular, we note that H0 = P0 = R and Hn = Pn ' Pn−1 ,

n ∈ N.

Proof: (i) The properties in Lemma 14.22 extend to arbitrary integrands, and so the spaces Hn are mutually orthogonal, closed, linear sub-spaces of L2 (ζ). From Lemma 14.23 or Theorem 14.25 we see that also Hn ⊂ Pn . Conversely, let ξ be a polynomial of degree n in the variables ζh. We may then choose some ortho-normal elements h1 , . . . , hm ∈ H, such that ξ is a polynomial in ζh1 , . . . , ζhm of degree n. Since any power (ζhj )k is a linear combination of the variables p0 (ζhj ), . . . , pk (ζhj ), Theorem 14.25 shows that ξ is a linear combination of multiple integrals ζ kf with k ≤ n, which means that  ξ ∈ k≤n Hk . This proves the first relation.  To prove the second relation, let ξ ∈ L2 (ζ) ' n Hn . In particular, ξ⊥(ζh)n for every h ∈ H and n ∈ Z+ . Since Σ n |ζh|n /n! = e|ζh| ∈ L2 , the series eiζh = Σ n (iζh)n /n! converges in L2 , and we get ξ⊥eiζh for every h ∈ H. By linearity of the integral ζh, we hence obtain for any h1 , . . . , hn ∈ H, n ∈ N,

14. Gaussian Processes and Brownian Motion 

E ξ exp



Σ iuk ζhk k≤n

= 0,

317

u1 , . . . , un ∈ R.

Applying Theorem 6.3 to the distributions of (ζh1 , . . . , ζhn ) under the bounded measures μ± = E(ξ ± ; ·), we conclude that



E ξ; (ζh1 , . . . , ζhn ) ∈ B = 0,

B ∈ B n,

which extends by a monotone-class argument to E( ξ; A) = 0 for arbitrary A ∈ σ{ζ}. Since ξ is ζ -measurable, it follows that ξ = E( ξ| ζ) = 0 a.s. (ii) By (i) any element ξ ∈ L2 (ζ) has an orthogonal expansion ξ = Σ ζ nfn = Σ ζ nf˜n , n≥0

n≥0

for some elements fn ∈ H ⊗n with symmetrizations f˜n , n ∈ Z+ . Now assume that also ξ = Σ n ζ n gn . Projecting onto Hn and using the linearity of ζ n , we gn − f˜n = 0, and get ζ n (gn − fn ) = 0. By the isometry in (11) it follows that ˜ ˜ 2 so g˜n = fn . We may often write the decomposition in (ii) as ξ = Σn Jn ξ, where Jn ξ = ζ nfn . We proceed to derive a Wiener chaos expansion in L2 (ζ, H), stated in terms of the WI-integrals ζ nfn (·, t) for suitable functions fn on T n+1 . Corollary 14.27 (chaos expansion of processes) Let ζ be an isonormal Gaussian process on H = L2 (T, μ), and let X ∈ L2 (ζ, H). Then there exist some functions fn ∈ L2 (μn+1 ), n ≥ 0, each symmetric in the first n arguments, such that (i) Xt = Σ n ζ nfn (·, t) in L2 (ζ, H), (ii) X 2 = Σ n n! fn 2 . Proof: We may regard X as an element in the space L2 (ζ) ⊗ H, admitting an ONB of the form (αi ⊗ hj ), for suitable ONB’s α1 , α2 , . . . ∈ L2 (ζ) and h1 , h2 , . . . ∈ H. This yields an orthogonal expansion X = Σj ξj ⊗ hj in terms of some elements ξj ∈ L2 (ζ). Applying the classical Wiener chaos expansion to each ξj , we get an orthogonal decomposition

Σ n,j Jnξj ⊗ hj = Σ n,j ζ n gn,j ⊗ hj ,

X =

(17)

for some symmetric functions gn,j ∈ L (μ ) satisfying 2

n

Σ n,j ζ ngn,j 2 = Σ n,j n! gn,j 2 .

X 2 =

(18)

2

Regarding the hj as functions in L (μ), we may now introduce the functions fn (s, t) = Σ j gn,j (s) hj (t) in L2 (μn+1 ),

s ∈ T n , t ∈ T, n ≥ 0,

which may clearly be chosen to be product measurable on T n+1 and symmetric 2 in s ∈ T n . The relations (i) and (ii) are obvious from (17) and (18). Next we will see how the chaos expansion is affected by conditioning.

318

Foundations of Modern Probability

Corollary 14.28 (conditioning) Let ξ ∈ L2 (ζ) with chaos expansion ξ = Σ nζ nfn. Then E( ξ | 1A ζ) =

Σ ζ n(1⊗n A fn )

a.s.,

n≥0

A∈T.

Proof: By Lemma 14.23 we may choose ξ = ζ nf with f ∈ En , and by  linearity we may take f = j 1Bj for some disjoint sets B1 , . . . , Bn ∈ T . Writing ξ = Π ζBj j ≤n





Π ζ(Bj ∩ A) + ζ(Bj ∩ Ac) j ≤n = Σ Π ζ(Bj ∩ A) Π ζ(Bk ∩ Ac ), J ∈2 j ∈J k∈J =

n

c

and noting that the conditional expectation of the last product vanishes by independence unless J c = ∅, we get

Π ζ(Bj ∩ A) = ζ ⊗ 1B ∩A j ≤n

E( ξ | 1A ζ) =

j ≤n n

j



n

2

n (1⊗ A f ).

We also need a related property for processes adapted to a Brownian motion. Corollary 14.29 (adapted processes) Let B be a Brownian motion generated by an isonormal Gaussian process ζ on L2 (R+ , λ). Then a process X ∈ L2 (ζ, λ) is B-adapted iff it can be represented as in Corollary 14.27, with fn (·, t) supported by [0, t]n for every t ≥ 0 and n ≥ 0. Proof: The sufficiency is obvious. Now let X be adapted. We may then apply Corollary 14.28 for every fixed t ≥ 0. Alternatively, we may apply Corollary 14.27 with ζ replaced by 1[0,t] ζ, and use the a.e. uniqueness of the representation. In either case, it remains to check that the resulting functions 2 1⊗n [0,t] fn (·, t) remain product measurable.

Exercises 1. Let ξ1 , . . . , ξn be i.i.d. N (m, σ 2 ). Show that the random variables ξ¯ = n−1 Σk ξk d ¯ 2 are independent, and that (n−1)s2 = and s2 = (n−1)−1 (ξk − ξ) (ξk −m)2 .

Σk

Σk 0; Bt > 0} = 0 a.s. (Hint: By Kolmogorov’s 0−1 law, the stated event has probability 0 or 1. Alternatively, use Theorem 14.18.) 9. For a Brownian motion B, define τa = inf{t > 0; Bt = a}. Compute the density of the distribution of τa for a = 0, and show that E τa = ∞. (Hint: Use Proposition 14.13.) 10. For a Brownian motion B, show that Zt = exp(cBt − 12 c2 t) is a martingale for every c. Use optional sampling to compute the Laplace transform of τa above, and compare with the preceding result. 13. Show that the local maxima of a Brownian motion are a.s. dense in R, and the corresponding times are a.s. dense in R+ . (Hint: Use the preceding result.) 14. Show by a direct argument that lim supt t−1/2 Bt = ∞ a.s. as t → 0 and ∞, where B is a Brownian motion. (Hint: Use Kolmogorov’s 0−1 law.) 15. Show that the local law of the iterated logarithm for Brownian motion remains valid for the Brownian bridge. 16. For a Brownian motion B in Rd , show that the process |B| satisfies the law of the iterated logarithm at 0 and ∞. 17. Let ξ1 , ξ2 , . . . be i.i.d. N (0, 1). Show that lim supn (2 log n)−1/2 ξn = 1 a.s. 18. For a Brownian motion B, show that Mt = t−1 Bt is a reverse martingale, and conclude that t−1 Bt → 0 a.s. and in Lp , p > 0, as t → ∞. (Hint: The limit is degenerate by Kolmogorov’s 0−1 law.) Deduce the same result from Theorem 25.9. 19. For a Brownian bridge B, show that Mt = (1 − t)−1 Bt is a martingale on [0, 1). Check that M is not L1 -bounded. 20. Let ζ n be the n-fold Wiener–Itˆo integral w.r.t. Brownian motion B on R+ . Show that the process Mt = ζ n (1[0,t]n ) is a martingale. Express M in terms of B, and compute the expression for n = 1, 2, 3. (Hint: Use Theorem 14.25.) 21. Let ζ1 , . . . , ζn be independent, isonormal Gaussian processes on a separable  Hilbert space H. Show that there exists a unique continuous linear mapping k ζk   ⊗n 2 from H to L (P ) such that k ζk k hk = Πk ζk hk a.s. for all h1 , . . . , hn ∈ H.  Also show that k ζk is an isometry.

Chapter 15

Poisson and Related Processes Random measures, Laplace transforms, binomial processes, Poisson and Cox processes, transforms and thinnings, mapping and marking, mixed Poisson and binomial processes, existence and randomization, Cox and thinning uniqueness, one-dimensional uniqueness criteria, independent increments, symmetry, Poisson reduction, random time change, Poisson convergence, positive, symmetric, and centered Poisson integrals, double Poisson integrals, decoupling

Brownian motion and the Poisson processes are universally recognized as the basic building blocks of modern probability. After studying the former process in the previous chapter, we turn to a detailed study of Poisson and related processes. In particular, we construct Poisson processes on bounded sets as mixed binomial processes, and derive a variety of Poisson characterizations in terms of independence, symmetry, and renewal properties. A randomization of the underlying intensity measure leads to the richer class of Cox processes. We also consider related randomizations of general point processes, obtainable through independent mappings of the individual points. In particular, we will see how such transformations preserve the Poisson property. For most purposes, it is convenient to regard Poisson and other point processes as integer-valued random measures. The relevant parts of this chapter then serve as an introduction to general random measure theory. Here we will see how Cox processes and thinnings can be used to derive some general uniqueness criteria for simple point processes and diffuse random measures. The notions and results of this chapter form a basis for the corresponding convergence theory in Chapters 23 and 30, where typically Poisson and Cox processes arise as distributional limits. We also stress the present significance of the compensators and predictable processes from Chapter 10. In particular, we may use a predictable transformation to reduce a ql-continuous simple point process on R+ to Poisson, in a similar way as a continuous local martingale can be time-changed into a Brownian motion. This gives a basic connection to the stochastic calculus of Chapters 18–19. Using discounted compensators, we can even go beyond the ql-continuous case. We finally recall the fundamental role of compound Poisson processes for the theory of L´evy processes and their distributions in Chapters 7 and 16, and for the excursion theory in Chapter 29. In such connections, we often encounter Poisson integrals of various kind, whose existence is studied toward the end of this chapter. We also characterize the existence of double Poisson integrals of © Springer Nature Switzerland AG 2021 O. Kallenberg, Foundations of Modern Probability, Probability Theory and Stochastic Modelling 99, https://doi.org/10.1007/978-3-030-61871-1_16

321

322

Foundations of Modern Probability

the form ξ 2 f or ξηf , which provides a connection to the multiple Wiener–Itˆo integrals in Chapters 14, 19, and 21. Given a localized Borel space S, we define a random measure on S as a locally finite kernel from the basic probability space (Ω, A, P ) into S. Thus, ξ(ω, B) is a locally finite measure in B ∈ S for fixed ω and a random variable in ω ∈ Ω for fixed B. Equivalently, we may regard ξ as a random element in the space MS of locally finite measures on S, endowed with the σ-field generated by all evaluation maps πB : μ → μB with B ∈ S. The intensity Eξ of ξ is the measure on S defined by (Eξ)B = E(ξB) on S. A point process on S is defined as an integer-valued random measure ξ. Alternatively, it may be regarded as a random element in the space NS ⊂ MS of all locally finite, integer-valued measures on S. By Theorem 2.18, it may be ¯ +, written as ξ = Σk≤κ δσk for some random elements σ1 , σ2 , . . . in S and κ in Z and we say that ξ is simple if the σk are all distinct. For a general ξ, we may eliminate the possible multiplicities to form a simple point process ξ ∗ , equal to the counting measure on the support of ξ. Its measurability as a function of ξ is ensured by Theorem 2.19. We begin with some basic uniqueness criteria for random measures. Stronger results for simple point processes and diffuse random measures are given in Theorem 15.8, and related convergence criteria appear in Theorem 23.16. Recall that Sˆ+ denotes the class of measurable functions f ≥ 0 with bounded support. For countable classes I ⊂ S, we write Iˆ+ for the family of simple, I-measurable functions f ≥ 0 on S. Dissecting systems were defined in Chapter 1. Lemma 15.1 (uniqueness) Let ξ, η be random measures on S, and fix any d ˆ Then ξ = dissecting semi-ring I ⊂ S. η under each of the conditions d (i) ξf = ηf , f ∈ Iˆ+ , (ii)

E e−ξf = E e−ηf ,

f ∈ Sˆ+ .

When S is Polish, we may take f ∈ CˆS or f ≤ 1 in (i)−(ii). Proof: (i) By the Cram´er–Wold Theorem 6.5, the stated condition implies d

(ξI1 , . . . , ξIn ) = (ηI1 , . . . , ηIn ),

I1 , . . . , In ∈ I, n ∈ N. d

By a monotone-class argument in MS , we get ξ = η on the σ-field ΣI in MS generated by all projections πI : μ → μI, I ∈ I. By a monotone-class argument ˆ Hence, ΣI contains the σin S, πB remains ΣI -measurable for every B ∈ S. field generated by the latter projections, and so it agrees with the basic σ-field d on MS , which shows that ξ = η. The statement for Polish spaces follows by a simple approximation. (ii) Use the uniqueness Theorem 6.3 for Laplace transforms.

2

A random measure ξ on a Borel space S is said to have independent increments, if the random variables ξB1 , . . . , ξBn are independent for disjoint sets

15. Poisson and Related Processes

323

ˆ By a Poisson process on S with intensity measure μ ∈ MS B1 , . . . , Bn ∈ S. we mean a point process ξ on S with independent increments, such that ξB is ˆ By Lemma 15.1 the stated Poisson distributed with mean μB for every B ∈ S. conditions define a distribution for ξ, determined by the intensity measure μ. More generally, for any random measure η on S, we say that a point process ξ is a Cox process 1 directed by η, if it is conditionally Poisson given η with E(ξ| η) = η a.s. Choosing η = α μ for some measure μ ∈ MS and random variable α ≥ 0, we get a mixed Poisson process based on μ and α. Given a probability kernel ν : S → T and a point measure μ = Σk δsk , we define a ν -transform of μ as a point process η = Σk δτk on T , where the τk are independent random elements in T with distributions νsk , assumed to be such that η becomes a.s. locally finite. The distribution of η is clearly a kernel Pμ on the appropriate subset of NT . Replacing μ by a general point process ξ on S, we may define a ν -transform η of ξ by L(η | ξ) = Pξ , as long as this makes η a.s. locally finite. By a ρ -randomization of ξ we mean a transform with respect to the kernel νs = δs ⊗ ρs from S to S × T , where ρ is a probability kernel from S to T . In particular, we may form a uniform randomization by choosing ρ to be Lebesgue measure λ on [0, 1]. We may also think of a randomization as a ρ -marking of ξ, where independent random marks of distributions ρs are attached to the points of ξ with locations s. We further consider p -thinnings of ξ, obtained by independently deleting the points of ξ with probability 1 − p. A binomial process 2 on S is defined as a point process of the form ξ = Σ i≤n δσi , where σ1, . . . , σn are independent random elements in S with a common distribution μ. The name is suggested by the fact that ξB is binomially distributed for every B ∈ S. Replacing n by a random variable κ⊥⊥(σi ) in Z+ yields a mixed binomial process. To examine the relationship between the various processes, it is often helpful to use the Laplace transforms of random measures ξ on S, defined for any measurable function f ≥ 0 on the state space S by ψξ (f ) = Ee−ξf . Note that ψξ determines the distribution L(ξ) by Lemma 15.1. Here we list some basic formulas, where we recall that a kernel ν between measurable spaces S and T may be regarded as an operator between the associated function spaces, given  by νf (s) = ν(s, dt)f (t). Lemma 15.2 (Laplace transforms) Let f ∈ Sˆ+ with S Borel. Then a.s. (i) for a mixed binomial process ξ based on μ and κ, E(e−ξf | κ) = (μe−f )κ , (ii) for a Cox process ξ directed by η,





E(e−ξf | η) = exp −η(1 − e−f ) , 1

also called a doubly stochastic Poisson process Binomial processes have also been called sample or Bernoulli processes, or, when S ⊂ R, processes with the order statistics property. 2

324

Foundations of Modern Probability

(iii) for a ν-transform ξ of η,





E(e−ξf | η) = exp η log νe−f , (iv) for a p -thinning ξ of η,





E(e−ξf | η) = exp −η log 1 − p(1 − e−f )



.

Throughout, we use the compact notation explained in previous chapters. For example, in a more explicit notation the formula in part (iii) becomes 3



E exp −







f (s) ξ(ds)  η



= exp



log





e−f (t) νs (dt) η(ds) .

Proof: (i) Let ξ = Σk≤n δσk , where σ1 , . . . , σn are i.i.d. μ and n ∈ Z+ . Then

Ee−ξf = E exp Σ k −f (σk )



= Π k Ee−f (σk ) = (μe−f )n . The general result follows by conditioning on κ.

(ii) For a Poisson random variable κ with mean c, we have ck k s Esκ = e−c Σ k ≥ 0 k! = e−c ecs = e−c(1−s) , s ∈ (0, 1].

(1)

Hence, if ξ is a Poisson process with intensity μ, we get for any function f = Σk≤n ck 1Bk with disjoint B1 , . . . , Bn ∈ Sˆ and constant c1 , . . . , cn ≥ 0 

Ee−ξf = E exp Σ k −ck ξBk = Π k E (e−ck )ξBk







Π k exp −μBk (1 − e−c )

= exp Σ k −μBk (1 − e−c )

=

k

k

= exp −μ(1 − e−f ) ,

which extends by monotone and dominated convergence to arbitrary f ∈ S+ . Here the general result follows by conditioning on η. (iii) First let η = Σk δsk be non-random in NS . Choosing some independent random elements τk in T with distributions νsk , we get Ee−ξf = E exp Σ k {−f (τk )} = Π k Ee−f (τk ) = Π k νsk e−f = exp Σ k log νsk e−f 



= exp η log νe−f .

3 Such transliterations are henceforth left to the reader, who may soon see the advantage of our compact notation.

15. Poisson and Related Processes

325

Again, the general result follows by conditioning on η. (iv) This is easily deduced from (iii). For a direct proof, let η = Σk δsk as before, and put ξ = Σk ϑk δsk for some independent random variables ϑk in {0, 1} with Eϑk = p(sk ). Then

Ee−ξf = E exp Σk −ϑk f (sk )



−ϑ f (s ) Π k Ee  

= Π k 1 − p(sk ) 1 − e−f (s )  

= exp Σk log 1 − p(sk ) 1 − e−f (s )  

=

k

k

k

k

= exp η log 1 − p (1 − e−f )

,

2

which extends by conditioning to general η.

We proceed with some basic closure properties for Cox processes and point process transforms. In particular, we note that the Poisson property is preserved by randomizations. Theorem 15.3 (mapping and marking, Bartlett, Doob, Pr´ekopa, OK) Let μ : S → T and ν : T → U be probability kernels between the Borel spaces S, T, U . (i) For a Cox process ξ directed by a random measure η on S, consider a locally finite μ-transform ζ⊥⊥ξ η of ξ. Then ζ is a Cox process on U directed by ημ. (ii) For a μ-transform ξ of a point process η on S, consider a locally finite ν-transform ζ⊥⊥ξ η of ξ. Then ζ is a μν-transform of η. Recall that the kernel μν : S → U is given by

(μν)s f = =



(μν)s (du) f (u)

μs (dt)

νt (du) f (u).

Proof: (i) By Proposition 8.9 and Lemma 15.2 (ii) and (iii), we have 

 

 E(e−ζf | η) = E E e−ζf  ξ, η  η 



= E E(e−ζf | ξ)  η







= E exp ξ log μe−f  η







= exp −η(1 − μe−f )

= exp −ημ(1 − e−f ) . Now use Lemmas 15.1 (ii) and 15.2 (ii).

326

Foundations of Modern Probability (ii) Using Lemma 15.2 (iii) twice, we get 





E(e−ζf | η) = E E(e−ζf | ξ)  η







= E exp ξ log νe−f  η 



= exp η log μνe−f . 2

Now apply Lemmas 15.1 (ii) and 15.2 (ii).

We turn to a basic relationship between mixed Poisson and binomial processes. The equivalence (i) ⇔ (ii) is elementary for bounded S, and leads in that case to an easy construction of the general Poisson process. The significance of mixed Poisson and binomial processes is further clarified by Theorem 15.14 below. Write 1B μ = 1B · μ for the restriction of μ to B. Theorem 15.4 (mixed Poisson and binomial processes, Wiener, Moyal, OK) For a point process ξ on S, these conditions are equivalent: (i) ξ is a mixed Poisson or binomial process, ˆ (ii) 1B ξ is a mixed binomial process for every B ∈ S. In that case, 1B ξ ⊥ ⊥ 1Bc ξ, (iii) ξB

ˆ B ∈ S.

Proof, (i) ⇒ (ii): First let ξ be a mixed Poisson process based on μ and ρ. Since 1B ξ is again mixed Poisson and based on 1B μ and ρ, we may take B = S and μ = 1. Now consider a mixed binomial process η based on μ and κ, where κ is conditionally Poisson given ρ with E(κ| ρ) = ρ. Using (1) and Lemma 15.2 (i)−(ii), we get for any f ∈ S+ Ee−ηf = E(μe−f )κ





= EE (μe−f )κ  ρ







= E exp −ρ(1 − μe−f )

= E exp −ρμ(1 − e−f ) = Ee−ξf , d

and so ξ = η by Lemma 15.1. Next, let ξ be a mixed binomial process based on μ and κ. Fix any B ∈ S with μB > 0, and consider a mixed binomial process η based on μ ˆB = 1B μ/μB and β, where β is conditionally binomially distributed given κ, with parameters κ and μB. Letting f ∈ S+ be supported by B and using Lemma 15.2 (i) twice, we get   ˆB e−f β Ee−ηf = E μ = EE



μ ˆB e−f 

 

β κ

= E 1 − μB 1 − μ ˆB e−f = E(μe−f )κ = Ee−ξf ,

 κ

15. Poisson and Related Processes

327

d

and so 1B ξ = η by Lemma 15.1. ˆ Fix a localizing se(ii) ⇒ (i): Let 1B ξ be mixed binomial for every B ∈ S. ˆ quence Bn ↑ S in S, so that the restrictions 1Bn ξ are mixed binomial processes based on some probability measures μn . Since the μn are unique, we see as above that μm = 1Bm μn /μn Bm whenever m ≤ n, and so there exists a measure μ ∈ MS with μn = 1Bn μ/μBn for all n ∈ N. When μ < ∞, we may first normalize to μ = 1. Since snk k → sn as ¯ we get for any f ∈ Sˆ+ with μf > 0, by sk → s ∈ (0, 1) and nk → n ∈ N, Lemma 15.2 (i) and monotone and dominated convergence, Ee−ξf ← Ee−ξn f = E(μn e−f )ξBn → E(μe−f )ξS . Choosing f = ε1Bn and letting ε ↓ 0, we get ξS < ∞ a.s., and so the previous calculation applies to arbitrary f ∈ Sˆ+ , which shows that ξ is a mixed binomial process based on μ. Next let μ = ∞. If f ∈ Sˆ+ is supported by Bm for a fixed m ∈ N, then for any n ≥ m, Ee−ξf = E(μn e−f )ξBn  μ(1 − e−f ) μBn αn = E 1− , μBn d

where αn = ξBn /μBn . By Theorem 6.20, we have convergence αn → α in mn xn → e−x as mn → ∞ [0, ∞] along a sub-sequence. Noting that (1 − m−1 n ) and 0 ≤ xn → x ∈ [0, ∞], we get by Theorem 5.31



Ee−ξf = E exp −αμ(1 − e−f ) . As before, we get α < ∞ a.s., and so by monotone and dominated convergence, we may extend the displayed formula to arbitrary f ∈ Sˆ+ . By Lemmas 15.1 and 15.2 (ii), ξ is then distributed as a mixed Poisson process based on μ and α. (iii) We may take μ < ∞, since the case of μ = ∞ will then follow by a martingale or monotone-class argument. Then consider any mixed binomial process ξ = Σk≤κ δσk , where the σk are i.i.d. μ and independent of κ. For any B ∈ S, we note that 1B ξ and 1B c ξ are conditionally independent binomial processes based on 1B μ and 1B c μ, respectively, given the variables κ and ϑk = 1{k ≤ κ, σk ∈ B}, k ∈ N. Thus, 1B ξ is conditionally a binomial process based on μB and ξB, given 1B c ξ, ξB, and ϑ1 , ϑ2 , . . . . Since the conditional distribution depends only on ξB, we obtain 1B ξ⊥⊥ξB (1B c ξ, ϑ1 , ϑ2 , . . .), and the assertion follows. 2 Using the easy necessity part of Theorem 15.4, we may prove the existence of Cox processes and transforms. The result also covers the cases of binomial processes and randomizations.

328

Foundations of Modern Probability

Theorem 15.5 (Cox and transform existence) (i) For any random measure η on S, there exists a Cox process ξ directed by η. (ii) For any point process ξ on S and probability kernel ν : S → T , there exists a ν-transform ζ of ξ. Proof: (i) First let η = μ be non-random with μS ∈ (0, ∞). By Corollary 8.25 we may choose a Poisson distributed random variable κ with mean μS and some independent i.i.d. random elements σ1 , σ2 , . . . in S with distribution μ/μS. By Theorem 15.4, the random measure ξ = Σj≤κ δσj is then Poisson with intensity μ. For general μ ∈ Mμ , we may choose a partition of S into disjoint subsets ˆ such that μBk ∈ (0, ∞) for all k. As before, there exists for B1 , B2 , . . . ∈ S, every k a Poisson process ξk on S with intensity μk = 1Bk μ, and by Corollary 8.25 we may choose the ξk to be independent. Writing ξ = Σk ξk and using Lemma 15.2 (i), we get for any measurable function f ≥ 0 on S

Π k Ee−ξ f

= Π k exp −μk (1 − e−f )

= exp − Σ k μk (1 − e−f )

Ee−ξf =

k

= exp −μ(1 − e−f ) .

Using Lemmas 15.1 (ii) and 15.2 (i), we conclude that ξ is a Poisson process with intensity μ. Now let ξμ be a Poisson process with intensity μ. Then for any m1 , . . . , mn ∈ ˆ Z+ and disjoint B1 , . . . , Bn ∈ S, P

∩ k≤n





ξμ Bk = mk =

Π e−μB (μBk )m /mk !, k

k

k≤n

which is a measurable function of μ. The measurability extends to arbitrary ˆ since the general probability on the left is a finite sum of such sets Bk ∈ S, products. Now the sets on the left form a π-system generating the σ-field in NS , and so Theorem 1.1 shows that Pμ = L(ξμ ) is a probability kernel from MS to NS . Then Lemma 8.16 ensures the existence, for any random measure η on S, of a Cox process ξ directed by η. (ii) First let μ = Σk δsk be non-random in NS . Then Corollary 8.25 yields some independent random elements τk in T with distributions νsk , making ζμ = Σk δτk a ν-transform of μ. Letting B1 , . . . , Bn ∈ T and s1 , . . . , sn ∈ (0, 1), we get by Lemma 15.2 (iii) E

Π k sk

ζμ Bk

= E exp ζμ Σ k 1Bk log sk

= exp μ log ν exp Σ k 1Bk log sk = exp μ log ν Π k (sk )1Bk .

15. Poisson and Related Processes

329 

Using Lemma 3.2 (i) twice, we see that ν k (sk )1Bk is a measurable function on S for fixed s1 , . . . , sn , and so the right-hand side is a measurable function of μ. Differentiating mk times with respect to sk for each k and setting  s1 = · · · = sn = 1, we conclude that the probability P k {ζμ Bk = mk } is a measurable function of μ for any m1 , . . . , mn ∈ Z+ . As before, Pμ = L(ζμ ) is then a probability kernel from NS to NT , and the general result follows by Lemma 8.16. 2 We proceed with some basic properties of Poisson and Cox processes. Lemma 15.6 (Cox simplicity and integrability) Let ξ be a Cox process directed by a random measure η on S. Then (i) ξ is a.s. simple ⇔ η is a.s. diffuse, (ii) {ξf < ∞} = η(f ∧ 1) < ∞ a.s. for all f ∈ S+ . Proof: (i) First reduce by conditioning to the case where ξ is Poisson with intensity μ. Since both properties are local, we may also assume that μ ∈ (0, ∞). By a further conditioning based on Theorem 15.4, we may then take ξ to be a binomial process based on μ, say ξ = Σk≤n δσk , where the σk are i.i.d. μ. Then Fubini’s theorem yields P {σi = σj } = =



μ{s} μ(ds)

Σ s(μ{s})2,

i = j,

which shows that the σk are a.s. distinct iff μ is diffuse. (ii) Conditioning on η and using Theorem 8.5, we may again reduce to the case where ξ is Poisson with intensity μ. For any f ∈ S+ , we get as 0 < r → 0

exp −μ(1 − e−rf )



= Ee−rξf → P {ξf < ∞}.

Since 1 − e−f ) f ∧ 1, we have μ(1 − e−rf ) ≡ ∞ when μ(f ∧ 1) = ∞, whereas 2 μ(1 − e−rf ) → 0 by dominated convergence when μ(f ∧ 1) < ∞. The following uniqueness property will play an important role below. Lemma 15.7 (Cox and thinning uniqueness, Krickeberg) Let ξ, ξ˜ be Cox processes on S directed by some random measures η, η˜, or p -thinnings of some point processes η, η˜, for a fixed p ∈ (0, 1]. Then d d ξ = ξ˜ ⇔ η = η˜. Proof: The implications to the left are obvious. To prove the implication to the right for the Cox transform, we may invert Lemma 15.2 (ii) to get for f ∈ Sˆ+

Ee−ηf = E exp ξ log(1 − f ) , f < 1.

330

Foundations of Modern Probability

In case of p -thinnings, we may use Lemma 15.2 (iv) instead to get



Ee−ηf = E exp ξ log 1 − p−1 (1 − e−f )



,

f < − log(1 − p). 2

In both cases, the assertion follows by Lemma 15.1 (ii).

Using Cox and thinning transforms, we may partially strengthen the general uniqueness criteria of Theorem 15.1. Related convergence criteria are given in Chapter 23 and 30. Theorem 15.8 (one-dimensional uniqueness criteria, M¨onch, Grandell, OK) ˆ Then Let S be a Borel space with a dissecting ring U ⊂ S. (i) for any point processes ξ, η on S, we have ξ ∗ = η ∗ iff d

P {ξU = 0} = P {ηU = 0},

U ∈ U,

(ii) for any simple point processes or diffuse random measures ξ, η on S, we d have ξ = η iff for a fixed c > 0, E e−c ξU = E e−c ηU ,

U ∈ U,

(iii) for a simple point process or diffuse random measure ξ and an arbitrary d random measure η on S, we have ξ = η iff d

ξU = ηU,

U ∈ U.

Proof: (i) Let C be the class of sets {μ; μU = 0} in NS with U ∈ U , and note that C is a π-system, since for any B, C ∈ U ,



{μB = 0} ∩ {μC = 0} = μ(B ∪ C) = 0 ∈ C, by the ring property of U. Furthermore, the sets M ∈ BNS with P {ξ ∈ M } = P {η ∈ M } form a λ -system D, which contains C by hypothesis. Hence, d Theorem 1.1 (i) yields D ⊃ σ(C), which means that ξ = η on σ(C). Since the ring U is dissecting, it contains a dissecting system (Inj ), and for any μ ∈ NS we have

Σj





μ(U ∩ Inj ) ∧ 1 → μ∗ U,

U ∈ U,

since the atoms of μ are ultimately separated by the sets Inj . Since the sum on the left is σ(C)-measurable, so is the limit μ∗ U for every U ∈ U . By the proof of Lemma 15.1, the measurability extends to the map μ → μ∗ , and we d conclude that ξ ∗ = η ∗ . ˜ η˜ be Cox processes directed by (ii) First let ξ, η be diffuse. Letting ξ, c ξ, c η, respectively, we get by conditioning and hypothesis ˜ = 0} = E e−c ξU = E e−c ηU P {ξU = P {˜ η U = 0}, U ∈ U .

15. Poisson and Related Processes

331 d

d

˜ η˜ are a.s. simple by Lemma 15.6 (i), part (i) yields ξ˜ = η˜, and so ξ = η Since ξ, by Lemma 15.7 (i). ˜ η˜ be p Next let ξ, η be simple point processes. For p = 1 − e−c , let ξ, thinnings of ξ, η, respectively. Since Lemma 15.2 (iv) remains true when 0 ≤ f ≤ ∞, we may take f = ∞ · 1U for any U ∈ U to get

˜ = 0} = E exp ξ log(1 − p 1U ) P {ξU





= E exp ξU log(1 − p) = Ee−c ξ , ˜ η˜ satisfy and similarly for η˜ in terms of η. Hence, the simple point processes ξ, d d ˜ the condition in (i), and so ξ = η˜, which implies ξ = η. (iii) First let ξ be a simple point process. The stated condition yields ηU ∈ Z+ a.s. for every U ∈ U , and so η is a point process by Lemma 2.18. d d d By (i) we get ξ = η ∗ , and in particular ηU = ξU = η ∗ U for all U ∈ U , which ∗ implies η = η since the ring U is covering. ˜ η˜ be Cox processes directed by ξ, η, Next let η be a.s. diffuse. Letting ξ, d ˜ respectively, we get ξU = η˜U for any U ∈ U by Lemma 15.2 (ii). Since ξ˜ is d d simple by Lemma 15.6 (i), we conclude as before that ξ˜ = η˜, and so ξ = η by Lemma 15.7 (i). 2 The last result yields a nice characterization of Poisson processes: Corollary 15.9 (Poisson criterion, R´enyi) Let ξ be a random measure on S ˆ Then these conditions with Eξ{s} ≡ 0 a.s., and fix a dissecting ring U ⊂ S. are equivalent: (i) ξ is a Poisson process, (ii) ξU is Poisson distributed for every U ∈ U . Proof: Assume (ii). Since λ = Eξ ∈ MS , Theorem 15.5 yields a Poisson d process η on S with Eη = λ. Here ξU = ηU for all U ∈ U , and since η is d simple by Corollary 15.6 (i), we get ξ = η by Theorem 15.8 (iii), proving (i). 2 Much of the previous theory extends to processes with marks. Given some Borel spaces S and T , we define a T -marked point process on S as a point process ξ on S × T with ξ({s} × T ) ≤ 1 for all s ∈ S. Say that ξ has independent increments, if the point processes ξ(B1 × ·), . . . , ξ(Bn × ·) on T ˆ We also say that ξ are independent for any disjoint sets B1 , . . . , Bn ∈ S. is a Poisson process, if ξ is Poisson in the usual sense on the product space S×T . We may now characterize Poisson processes in terms of the independence property, extending Theorem 13.6. The result plays a crucial role in Chapters 16 and 29. A related characterization in terms of compensators is given in Corollary 15.17. For future purposes, we consider a slightly more general case. For any Smarked point process ξ on T , we introduce the discontinuity set D = {t ∈ T ;

332

Foundations of Modern Probability

E ξ({t} × S) > 0}, and say that ξ is an extended Poisson process, if it is an ordinary Poisson process on Dc , and the contributions ξt to the points t ∈ D are independent point measures on S with mass ≤ 1. Note that L(ξ) is then uniquely determined by E ξ. Theorem 15.10 (independent-increment point processes, Bateman, Copeland & Regan, Itˆo, Kingman) For an S-marked point process ξ on T , these conditions are equivalent: (i) ξ has independent T -increments, (ii) ξ is an extended Poisson process on T × S. Proof (OK): Subtracting the fixed discontinuities, we may reduce to the case where E ξ({t} × S) ≡ 0. Here we first consider a simple point process ξ on T = [0, 1] with E ξ{t} ≡ 0. Define a set function ρB = − log Ee−ξB ,

B ∈ Tˆ ,

and note that ρB ≥ 0 for all B since ξB ≥ 0 a.s. If ξ has independent increments, then for disjoint B, C ∈ Tˆ , ρ(B ∪ C) = − log Ee−ξB−ξC 

= − log Ee−ξB Ee−ξC = ρB + ρC,



which shows that ρ is finitely additive. It is even countably additive, since Bn ↑ B in Tˆ implies ρBn ↑ ρB by monotone and dominated convergence. It is further diffuse, since ρ{t} = − log Ee−ξ{t} = 0 for all t ∈ T . Now Theorem 15.5 yields a Poisson process η on T , such that E η = c−1 ρ ˆ we get by Lemma 15.2 (ii) with c = 1 − e−1 . For any B ∈ S,

Ee−ηB = exp −c−1 ρ(1 − e−1B )



= e−ρB = Ee−ξB . d

Since ξ, η are simple by hypothesis and Lemma 15.6 (i), we get ξ = η by Theorem 15.8 (ii). In the marked case, fix any B ∈ BT ×S , and define a simple point process on T by η = 1B ξ(· × S). Then the previous case shows that ξB = η is Poisson distributed, and so ξ is Poisson by Corollary 15.9. 2 The last result extends to a representation of random measures with independent increments4 . A extension to general processes on R+ will be given in Theorem 16.3. 4

also said to be completely random

15. Poisson and Related Processes

333

Corollary 15.11 (independent-increment random measures, Kingman) A random measure ξ on S has independent increments iff a.s.



ξB = αB + 0

x η(B × dx),

B ∈ S,

(2)

for a fixed measure α ∈ MS and an extended Poisson process η on S × (0, ∞) ∞ with ˆ (x ∧ 1) E η(B × dx) < ∞, B ∈ S. (3) 0

Proof (OK): On S × (0, ∞) we define a point process η = Σs δs, ξ{s} , where the required measurability follows by a simple approximation. Since η has independent S-increments and 







η {s} × (0, ∞) = 1 ξ{s} > 0 ≤ 1,

s ∈ S,

Theorem 15.10 shows that η is a Poisson process. Subtracting the atomic part from ξ, we obtain a diffuse random measure α satisfying (2), which has again independent increments and hence is a.s. non-random by Theorem 6.12. Next, Lemma 15.2 (i) yields for any B ∈ S and r > 0

− log E exp − r

0





x η(B × dx) =

0



(1 − e−rx ) Eη(B × dx).

As r → 0, we see by dominated convergence that (3) holds.

∞ 0

x η(B × dx) < ∞ a.s. iff 2

We summarize the various Poisson criteria, to highlight some remarkable equivalences. The last condition anticipates Corollary 15.17 below. Corollary 15.12 (Poisson criteria) Let ξ be a simple point process on S with E ξ{s} = 0 for all s ∈ S. Then these conditions are equivalent 5 : (i) ξ is a Poisson process, (ii) ξ has independent increments, ˆ (iii) ξB is Poisson distributed for every B ∈ S. When ξ is stationary on S = R+ , it is further equivalent that (iv) ξ is a renewal process with exponential holding times, (v) ξ has induced compensator cλ for a constant c ≥ 0. Proof: Here (i) ⇔ (ii) by Theorem 15.10 and (i) ⇔ (iii) by Corollary 15.9. For stationary ξ, the equivalence of (i)–(ii), (iv) was proved in Theorem 13.6. 2 In the multi-variate case, we have a further striking equivalence: 5 The equivalence (ii) ⇔ (iii) is remarkable, since both conditions are usually required to define a Poisson process.

334

Foundations of Modern Probability

Corollary 15.13 (multi-variate Poisson criteria) Let ξ be a point process on S × T with a simple, locally finite S-projection ξ¯ = ξ(· × T ), such that ¯ E ξ{s} = 0 for all s ∈ S. Then these conditions are equivalent: (i) ξ is a Poisson process on S × T , (ii) ξ has independent S-increments, (iii) ξ is a ν-randomization of a Poisson process ξ¯ on S, for a probability kernel ν : S → T . Proof: Since (iii) ⇒ (ii) is obvious and (ii) ⇒ (i) by Theorem 15.10, it remains to prove that (i) ⇒ (iii). Then let ξ be Poisson with intensity λ = E ξ, ¯ = λ(· × T ). By Theorem 3.4 and note that ξ¯ is again Poisson with intensity λ ¯ ⊗ ν. Letting η be there exists a probability kernel ν : S → T such that λ = λ ¯ a ν -randomization of ξ, we see from Theorem 15.3 (i) that η is again Poisson d ¯ and so L(ξ| ξ) ¯ = L(η| ξ), ¯ which shows ¯ = (η, ξ), with intensity λ. Hence, (ξ, ξ) ¯ that even ξ is a ν -randomization of ξ. 2 We may next characterize the mixed Poisson and binomial processes by a basic symmetry property. Related results for more general processes appear in Theorems 27.9 and 27.10. Given a random measure ξ and a diffuse measure λ d on S, we say that ξ is λ -symmetric if ξ ◦ f −1 = ξ for every λ -preserving map f on S. Theorem 15.14 (symmetry, Davidson, Matthes et al., OK) Let S be a Borel space with a diffuse measure λ ∈ MS . Then (i) a simple point process ξ on S is λ-symmetric iff it is a mixed Poisson or binomial process based on λ, (ii) a diffuse random measure ξ on S is λ-symmetric iff ξ = α λ a.s. for a random variable α ≥ 0. In both cases, this holds 6 iff L(ξB) depends only on λB. Proof: (i) If S = [0, 1], then ξ is invariant under λ -preserving transformations, and so ξ remains conditionally λ -symmetric, given ξ . We may then reduce by conditioning to the case of a constant ξ = n, so that ξ = Σk≤n δσk for some random variables σ1 < · · · < σn in [0, 1]. Now consider any subintervals I1 < · · · < In of [0, 1], where I < J means that s < t for all s ∈ I and t ∈ J. By contractability, the probability P

∩ k {σk ∈ Ik } = P ∩ k {ξIk > 0}

is invariant under individual shifts of I1 , . . . , In , subject to the mentioned order restriction. Thus, L(σ1 , . . . , σn ) is shift invariant on the set Δn = {s1 < · · · < 6 In particular, the notions of exchangeability and contractability are equivalent for simple or diffuse random measures on a finite or infinite interval. Note that the corresponding statements for finite sequences or processes on [0, 1] are false. See Chapters 27–28.

15. Poisson and Related Processes

335

sn } ⊂ [0, 1]n , and hence equals n! λn on Δn , which means that ξ is a binomial process on [0, 1] based on λ and n. If instead S = R+ , the restrictions 1[0,t] ξ are mixed binomial processes based on 1[0,t] λ, and so ξ is mixed Poisson on R+ by Theorem 15.4. (ii) We may take S = [0, 1]. As in (i), we may further reduce by conditioning to the case where ξ = 1 a.s. Then Eξ = λ by contractability, and for any intervals I1 < I2 in I, the moment Eξ 2 (I1 × I2 ) is invariant under individual shifts of I1 and I2 , so that Eξ 2 is proportional to λ2 on Δ2 . Since ξ is diffuse, we have Eξ 2 = 0 on the diagonal {s1 = s2 } in [0, 1]2 . Hence, by symmetry and normalization, Eξ 2 = λ2 on [0, 1]2 , and so Var(ξB) = E(ξB)2 − (EξB)2 = λ2 B 2 − (λB)2 = 0,

B ∈ B[0,1] .

Then ξB = λB a.s. for all B, and so ξ = λ a.s. by a monotone-class argument. Now suppose that L(ξB) depends only on λB, and define Ee−rξB = ϕr (λB),

B ∈ S, r ≥ 0.

For any λ -preserving map f on S, we get Ee−rξ◦f

−1 B

= ϕr ◦ λ(f −1 B) = ϕr (λB) = Ee−rξB ,

and so ξ(f −1 B) = ξB by Theorem 6.3. Then Theorem 15.8 (iii) yields d 2 ξ ◦ f −1 = ξ, which shows that ξ is λ -symmetric. d

A Poisson process ξ on R+ is said to be stationary with rate c ≥ 0 if Eξ = c λ. Then Proposition 11.5 shows that Nt = ξ[0, t], t ≥ 0, is a spaceand time-homogeneous Markov process. We show how a point process on R+ , subject to a suitable regularity condition, can be transformed to Poisson by a predictable map. This leads to various time-change formulas for point processes, similar to those for continuous local martingales in Chapter 19. When ξ is an S-marked point process on (0, ∞), it is automatically locally ˆ Recall integrable, which ensures the existence of the associated compensator ξ. that ξ is said to be ql-continuous if ξ([τ ] × S) = 0 a.s. for every predictable ˆ time τ , so that the process ξˆt B is a.s. continuous for every B ∈ S. Theorem 15.15 (predictable mapping to Poisson) For any Borel spaces S, T and measure μ ∈ MT , let ξ be a ql-continuous, S-marked point process on ˆ and consider a predictable map Y : Ω×R+ ×S → T . (0, ∞) with compensator ξ, Then ξˆ ◦ Y −1 = μ a.s. ⇒ η = ξ ◦ Y −1 is Poisson (μ).

336

Foundations of Modern Probability

In particular, any simple, ql-continuous point process ξ on R+ is unit rate ˆ Note the analogy with Poisson, on the time scale given by the compensator ξ. the time-change reduction of continuous local martingales to Brownian motion in Theorem 19.4. Proof: For disjoint sets B1 , . . . , Bn ∈ BT with finite μ-measure, we need to show that ηB1 , . . . , ηBn are independent Poisson variables with means μB1 , . . . , μBn . Then define for each k ≤ n the processes

Jtk = Jˆtk =

t+

S 0 t 0

S

1Bk (Ys,x ) ξ(ds dx), ˆ dx). 1Bk (Ys,x ) ξ(ds

k Here Jˆ∞ = μBk < ∞ a.s. by hypothesis, and so the J k are simple, integrable point processes on R+ with compensators Jˆk . For fixed u1 , . . . , un ≥ 0, we put

Xt =

Σ k≤n





uk Jtk − (1 − e−uk )Jˆtk ,

t ≥ 0.

The process Mt = e−Xt has bounded variation and finitely many jumps, and so by an elementary change of variables, Mt − 1 = =

Σ Δe−Xs −

s≤t

Σ k≤n



t+



t 0

e−Xs dXsc





e−Xs− (1 − e−uk ) d Jˆsk − Jsk .

0

Since the integrands on the right are bounded and predictable, M is a uniformly integrable martingale, and so EM∞ = 1. Thus, 







E exp − Σ k uk ηBk = exp − Σ k (1 − e−uk ) μBk , 2

and the assertion follows by Theorem 6.3.

We also have a multi-variate time-change result, similar to Proposition 19.8 for continuous local martingales. Say that the simple point processes ξ1 , . . . , ξn are orthogonal if Σ k ξk {s} ≤ 1 for all s. Corollary 15.16 (time-change reduction to Poisson, Papangelou, Meyer) Let ξ1 , . . . , ξn be orthogonal, ql-continuous point processes on (0, ∞) with a.s. unbounded compensators ξˆ1 , . . . , ξˆn , and define



Yk (t) = inf s > 0; ξˆk [0, s] > t , Then the processes ηk = ξk ◦

Yk−1

t ≥ 0, k ≤ n.

on R+ are independent, unit-rate Poisson.

Proof: The processes ηk are Poisson by Theorem 15.15, since each Yk is predictable and

ξˆk [0, Yk (t)] = ξˆk s ≥ 0; ξˆk [0, s] ≤ t = t a.s., t ≥ 0, by the continuity of ξˆk . The multi-variate statement follows from the same theorem, if we apply the predictable map Y = (Y1 , . . . , Yn ) to the marked

15. Poisson and Related Processes

337

point process ξ = (ξ1 , . . . , ξn ) on (0, ∞) × {1, . . . , n} with compensator ξˆ = 2 (ξˆ1 , . . . , ξˆn ). From Theorem 15.15 we see that a ql-continuous, marked point process ξ is Poisson iff its compensator is a.s. non-random, similarly to the characterization of Brownian motion in Theorem 19.3. The following more general statement may be regarded as an extension of Theorem 15.10. Theorem 15.17 (extended Poisson criterion, Watanabe, Jacod) Let ξ be an ˆ where S is S-marked, F-adapted point process on (0, ∞) with compensator ξ, Borel. Then these conditions are equivalent: (i) ξˆ = μ is a.s. non-random, (ii) ξ is extended Poisson with E ξ = μ. Proof (OK): It is enough to prove that (i) ⇒ (ii). Then write μ = ν + Σ k (δtk ⊗ μk ),

ξ = η + Σ k (δtk ⊗ ξk ),

where t1 , t2 , . . . > 0 are the times with μ({tk }×S) > 0. Here ν is the restriction  of μ to the remaining set Dc × S with D = k {tk }, η is the corresponding restriction of ξ with E η = ν, and the ξk are point processes on S with ξk ≤ 1 and E ξk = μk . We need to show that η, ξ1 , ξ2 , . . . are independent, and that η is a Poisson process. Then for any measurable A ⊂ Dc × S with νA < ∞, we define

Xt =

0

t+

1A (r, s) ξ(dr ds),

Nt = e−Xt 1{Xt = 0}. ˆ

As in Theorem 15.15, we may check that N is a uniformly integrable martingale with N0 = 1 and N∞ = e νA 1{ηA = 0}. For any B1 , B2 , . . . ∈ S, we further introduce the bounded martingales Mtk = (ξk − μk )Bk 1{t ≥ tk },

t ≥ 0, k ∈ N.

Since N and M 1 , M 2 , . . . are mutually orthogonal, the process Mtn = Nt

Π Mtk ,

k≤n

t ≥ 0,

0 is again a uniformly integrable martingale, and so EM∞ = 1 and EM0n = 0 for all n > 0. Proceeding by induction, as in the proof of Theorem 10.27, we conclude that

E



Π ξk Bk ; ηA = 0 k≤n



= e−νA

Π μk Bk ,

k≤n

n ∈ N.

The desired joint distribution now follows by Corollary 15.9.

2

We may use discounted compensators to extend the previous time-change results beyond the ql-continuous case. To avoid distracting technicalities, we consider only unbounded point processes.

338

Foundations of Modern Probability

Theorem 15.18 (time-change reduction to Poisson) Let ξ = Σj δτj be a simple, unbounded, F-adapted point process on (0, ∞) with compensator η, let ϑ1 , ϑ2 , . . . ⊥⊥ F be i.i.d. U (0, 1), and define ρt = 1 − Σ j ϑj 1{t = τj },

Yt = ηtc −

Σ log s≤t





1 − ρs Δηs ,

t ≥ 0.

Then ξ˜ = ξ ◦ Y −1 is a unit rate Poisson process on R+ . Proof: Write σ0 = 0, and let σj = Yτj for j > 0. We claim that the differences σj −σj−1 are i.i.d. and exponentially distributed with mean 1. Since the τj are orthogonal, it is enough to consider σ1 . Letting G be the rightcontinuous filtration induced by F and the pairs (σj , ϑj ), we note that (τ1 , ϑ1 ) has G -compensator η˜ = η ⊗ λ on [0, τ1 ] × [0, 1]. Write ζ for the associated ¯ t]. Define discounted version with projection ζ¯ onto R+ , and put Zt = 1 − ζ(0, the G -predictable processes U and V = e−U on R+ × [0, 1] by Ut,r = ηtc − Vt,r =

Σ log(1 − Δηs) − log(1 − r Δηt), Π (1 − Δηs) (1 − r Δηt) s 0. Writing ξn = Σj δτnj and putting σnj = Yn (τnj ) with Yn as before, we get a.s. for every j ∈ N  

 

 

 

| τnj − σnj | ≤  τnj − ηn [0, τnj ]  +  ηn [0, τnj ] − σnj  → 0. 7

vd

For the convergence −→, see Chapter 23.

15. Poisson and Related Processes

339

Letting n → ∞ along N  , we get ξn = Σ j δτnj −→ Σ j δσnj = ξ, vd

d

which yields the required convergence.

2

Poisson integrals of various kind occur frequently in applications. Here we give criteria for the existence of the integrals ξf , (ξ − ξ  )f , and (ξ − μ)f , where ξ and ξ  are independent Poisson processes with a common intensity μ. In each case, the integral may be defined as a limit in probability of elementary integrals ξfn , (ξ − ξ  )fn , or (ξ − μ)fn , respectively, where the fn are bounded with compact support and such that |fn | ≤ |f | and fn → f . We say that the integral of f exists, if the appropriate limit exists and is independent of the choice of approximating functions fn . Proposition 15.20 (Poisson integrals) Let ξ⊥⊥ ξ  be Poisson processes on S with a common diffuse intensity μ. Then for measurable functions f on S, we have (i) ξf exists ⇔ μ(|f | ∧ 1) < ∞, (ii) (ξ − ξ  )f exists ⇔ μ(f 2 ∧ 1) < ∞, (iii) (ξ − μ)f exists ⇔ μ(f 2 ∧ |f |) < ∞. In each case, the existence is equivalent to tightness of the corresponding set of approximating elementary integrals. Proof: (i) This is a special case of Lemma 15.6 (ii). (ii) For counting measures ν = Σk δsk we form a symmetrization ν˜ = P {ϑk = ±1} = 12 . Σk ϑk δsk , where ϑ1, ϑ2, . . . are i.i.d. random variables with 2 By Theorem 5.17, the series ν˜f converges a.s. iff νf < ∞, and otherwise P ˆ The |˜ ν fn | → ∞ for any bounded approximations fn = 1Bn f with Bn ∈ S. result extends by conditioning to any point process ν with symmetric randomization ν˜. Now Proposition 15.3 exhibits ξ − ξ  as such a randomization of the Poisson process ξ +ξ  , and (i) yields (ξ +ξ  )f 2 < ∞ a.s. iff μ(f 2 ∧1) < ∞. (iii) Write f = g + h, where g = f 1{|f | ≤ 1} and h = f 1{|f | > 1}. First let μg 2 + μ|h| = μ(f 2 ∧ |f |) < ∞. Since clearly E(ξf − μf )2 = μf 2 , the integral (ξ − μ)g exists. Since also ξh exists by (i), even (ξ − μ)f = (ξ − μ)g + ξh − μh exists. Conversely, if (ξ − μ)f exists, then so does (ξ − ξ  )f , and (ii) yields μg 2 + μ{h = 0} = μ(f 2 ∧ 1) < ∞. The existence of (ξ − μ)g now follows by the direct assertion, and trivially even ξh exists. The existence of μh = (ξ − μ)g + ξh − (ξ − μ)f then follows, and so μ|h| < ∞. 2 The multi-variate case is more difficult, and so we consider only positive double integrals ξ 2f and ξηf , where ξ 2 = ξ ⊗ ξ and ξη = ξ ⊗ η. Write f1 = μ2 fˆ and f2 = μ1 fˆ, where fˆ = f ∧ 1 and μi denotes μ -integration in the i-th coordinate. Say that a function f on S 2 is non-diagonal if f (x, x) ≡ 0.

340

Foundations of Modern Probability

Theorem 15.21 (double Poisson integrals, OK & Szulga) Let ξ⊥⊥η be Poisson processes on S with a common intensity μ, and consider a measurable, non-diagonal function f ≥ 0 on S 2 . Then ξηf < ∞ a.s. iff ξ 2f < ∞ a.s., which holds iff (i) f1 ∨ f2 < ∞ a.e. μ, (ii) μ{f1 ∨ f2 > 1} < ∞,   (iii) μ2 fˆ; f1 ∨ f2 ≤ 1 < ∞. Note that this result covers even the case where E ξ = E η. For the proof, we may clearly take μ to be diffuse, and whenever convenient we may even choose S = R+ and μ = λ. Define ψ(x) = 1 − e−x . Lemma 15.22 (Poisson moments) Let ξ be Poisson with Eξ = μ. Then for any measurable set B ⊂ S and function f ≥ 0 on S or S 2 , we have

(i) E ψ(ξf ) = ψ μ(ψ ◦ f ) , (ii) P {ξB > 0} = ψ(μB), (iii) E (ξf )2 = μf 2 + (μf )2 , E ξηf = μ2f , (iv) E (ξηf )2 = μ2f 2 + μ(μ1 f )2 + μ(μ2 f )2 + (μ2f )2 . Proof: (i)−(ii): Use Lemma 15.2 (ii) with η = μ. (iii) When μf < ∞, the first relation becomes Var(ξf ) = μf 2 , which holds by linearity, independence, and monotone convergence. For the second relation, use Fubini’s theorem. (iv) Use (iii) and Fubini’s theorem to get E(ξηf )2 = E{ξ(ηf )}2 = Eμ1 (ηf )2 + E(μ1 ηf )2 = μ1 E(ηf )2 + E{η(μ1 f )}2 2 = μ2f 2 + μ1 (μ2 f )2 + μ2 (μ1 f )2 + (μ2f )2 . Lemma 15.23 (tail estimate) For measurable f : S 2 → [0, 1] with μ2f < ∞,  

we have μ2f ψ . P ξηf > 12 μ2f >

1 + f1∗ ∨ f2∗ Proof: We may clearly take μ2f > 0. By Lemma 15.22 (iii)−(iv), we have E ξηf = μ2f and E(ξηf )2 ≤ (μ2f )2 + μ2f + f1∗ μf1 + f2∗ μf2 = (μ2f )2 + (1 + f1∗ + f2∗ ) μ2f, and so by Lemma 5.1,

(1 − 12 )2 (μ2f )2 P ξηf > 12 μ2f ≥ (μ2f )2 + (1 + f1∗ + f2∗ ) μ2f   1 + f1∗ ∨ f2∗ −1 > 1 +

μ2f   μ2f > ψ . 2

1 + f1∗ ∨ f2∗

15. Poisson and Related Processes

341

Lemma 15.24 (decoupling) For non-diagonal, measurable functions f ≥ 0 on S 2 , we have E ψ(ξ 2f ) ) E ψ(ξηf ). Proof: Here it is helpful to take μ = λ on R+ . We may further assume that f is supported by the wedge {s < t} in R2+ . It is then equivalent to show that







E (V · ξ)∞ ∧ 1 ) E (V · η)∞ ∧ 1 , where Vt = ξf (·, t) ∧ 1. Since V is predictable with respect to the filtration induced by ξ, η, the random time

τ = inf t ≥ 0; (V · η)t > 1



is optional, and so the process 1{τ > t} is again predictable. Noting that ξ and η are both compensated by λ, we get



E (V · ξ)∞ ; τ = ∞ ≤ = = ≤ and so





E(V · ξ)τ E(V · λ)τ E(V · η)τ

E (V · η)∞ ∧ 1 + 2 P {τ < ∞},





E (V · ξ)∞ ∧ 1 ≤ E (V · ξ)∞ ; τ = ∞ + P {τ < ∞}



≤ E (V · η)∞ ∧ 1 + 3 P {τ < ∞}



≤ 4 E (V · η)∞ ∧ 1 . The same argument applies with the roles of ξ and η interchanged.

2

Proof of Theorem 15.21: Since ξ ⊗ η is a.s. simple, we have ξηf < ∞ iff ξη fˆ < ∞ a.s., which allows us to take f ≤ 1. First assume (i)−(iii). Since by (i)   E ξη f ; f1 ∨ f2 = ∞ ≤ Σ i μ2 (f ; fi = ∞) = 0, we may assume that f1 , f2 < ∞. Then Lemma 15.20 (i) gives ηf (s, ·) < ∞ and ξf (·, s) < ∞ a.s. for all s ≥ 0. Furthermore, (ii) implies ξ{f1 > 1} < ∞ and η{f2 > 1} < ∞ a.s. By Fubini’s theorem, we get a.s. 











ξη f ; f1 ∨ f2 > 1 ≤ ξ ηf ; f1 > 1 + η ξf ; f2 > 1 < ∞, which allows us to assume that even f1 , f2 ≤ 1. Then (iii) yields E ξηf = μ2f < ∞, which implies ξηf < ∞ a.s. Conversely, suppose that ξηf < ∞ a.s. for a function f : S 2 → [0, 1]. By Lemma 15.22 (i) and Fubini’s theorem,



E ψ μ ψ(t ηf ) = E ψ(t ξηf ) → 0,

t ↓ 0,

342

Foundations of Modern Probability

which implies μ ψ(t ηf ) → 0 a.s. and hence ηf < ∞ a.e. μ ⊗ P . By Lemma 15.20 and Fubini’s theorem we get f1 = μ2 f < ∞ a.e. μ, and the symmetric argument yields f2 = μ1 f < ∞ a.e. This proves (i). Next, Lemma 15.22 (i) yields on the set {f1 > 1}



E ψ(ηf ) = ψ μ2 (ψ ◦ f )

≥ ψ (1 − e−1 )f1



≥ ψ(1 − e−1 ) ≡ c > 0. Hence, for any measurable set B ⊂ {f1 > 1},



E μ 1 − ψ(ηf ); B ≤ (1 − c) μB, and so by Chebyshev’s inequality,





P μ ψ(ηf ) < 12 c μB ≤ P μ{1 − ψ(ηf ); B} > (1 − 12 c) μB



E μ 1 − ψ(ηf ); B (1 − c) μB 1 2







1−c . 1 − 12 c

Since B was arbitrary, we conclude that



P μ ψ(ηf ) ≥ 12 c μ{f1 > 1} ≥ 1 −

1−c c > 0. = 2−c 1 − 12 c

Noting that μ ψ(ηf ) < ∞ a.s. by Lemma 15.20 and Fubini’s theorem, we obtain μ{f1 > 1} < ∞. Combining with a similar result for f2 , we obtain (ii). Applying Lemma 15.23 to the function f 1{f1 ∨ f2 ≤ 1} gives

P ξηf >

1 2

μ2 (f ; f1 ∨ f2 ≤ 1)



> ψ



1 2



μ2 (f ; f1 ∨ f2 ≤ 1) ,

which implies (iii), since the opposite statement would yield the contradiction P {ξηf = ∞} > 0. To extend the result to the integral ξ 2f , we see from Lemma 15.24 that E ψ(t ξ 2f ) ) E ψ(t ξηf ), Letting t → 0 gives

t > 0.

P {ξ 2f = ∞} ) P {ξηf = ∞},

which implies ξ 2f < ∞ a.s. iff ξηf < ∞ a.s.

2

Exercises 1. Let ξ be a point process on a Borel space S. Show that ξ = Σk δτk for some random elements τk in S ∪ {Δ}, where Δ ∈ / S is arbitrary. Extend the result to general random measures. (Hint: We may assume that S = R+ .)

15. Poisson and Related Processes

343

2. Show that two random measures ξ, η are independent iff Ee−ξf −ηg = Ee−ξf × Ee−ηg for all measurable f, g ≥ 0. When ξ, η are simple point processes, prove the equivalence of P {ξB + ηC = 0} = P {ξB = 0}P {ηC = 0} for any B, C ∈ S. (Hint: Regard (ξ, η) as a random measure on 2 S.) 3. Let ξ1 , ξ2 , . . . be independent Poisson processes with intensity measures μ1 , μ2 , . . . , such that the measure μ = Σk μk is σ-finite. Show that ξ = Σk ξk is again Poisson with intensity measure μ. 4. Show that the classes of mixed Poisson and binomial processes are preserved by randomization. 5. Let ξ be a Cox process on S directed by a random measure η, and let f be a measurable mapping into a space T , such that η ◦ f −1 is a.s. σ-finite. Prove directly from definitions that ξ ◦ f −1 is a Cox process on T directed by η ◦ f −1 . Derive a corresponding result for p -thinnings. Also show how the result follows from Proposition 15.3. 6. Use Proposition 15.3 to show that (iii) ⇒ (ii) in Corollary 13.8, and use Theorem 15.10 to prove the converse. 7. Consider a p -thinning η of ξ and a q -thinning ζ of η with ζ⊥ ⊥η ξ. Show that ζ is a p q -thinning of ξ. 8. Let ξ be a Cox process directed by η or a p -thinning of η with p ∈ (0, 1), and fix two disjoint sets B, C ∈ S. Show that ξB⊥⊥ξC iff ηB⊥ ⊥ηC. (Hint: Compute the Laplace transforms. The ‘if’ assertions can also be obtained from Proposition 8.12.) 9. Use Lemma 15.2 to derive expressions for P {ξB = 0} when ξ is a Cox process directed by η, a μ-randomization of η, or a p -thinning of η. (Hint: Note that Ee−t ξB → P {ξB = 0} as t → ∞.) 10. Let ξ be a p -thinning of η, where p ∈ (0, 1). Show that ξ, η are simultaneously Cox. (Hint: Use Lemma 15.7.) 11. (Fichtner ) For a fixed p ∈ (0, 1), let η be a p -thinning of a point process ξ on S. Show that ξ is Poisson iff η ⊥⊥ ξ − η. (Hint: Extend by iteration to arbitrary p. Then a uniform randomization of ξ on S × [0, 1] has independent increments in the second variable, and the result follows by Theorem 19.3.) 12. Use Theorem 15.8 to give a simplified proof of Theorem 15.4, in the case of simple ξ. 13. Derive Theorem 15.4 from Theorem 15.14. (Hint: Note that ξ is symmetric on S iff it is symmetric on Bn for every n. If ξ is simple, the assertion follows immediately from Theorem 15.14. Otherwise, apply the same result to a uniform randomization on S × [0, 1].) 14. For ξ as in Theorem 15.14, show that P {ξB = 0} = ϕ(μB) for a completely monotone function ϕ. Conclude from the Hausdorff–Bernstein characterization and Theorem 15.8 that ξ is a mixed Poisson or binomial process based on μ. 15. Show that the distribution of a simple point process ξ on R may not be determined by the distributions of ξI for all intervals I. (Hint: If ξ is restricted to {1, . . . , n}, then the distributions of all ξI give Σk≤n k(n − k + 1) ≤ n3 linear relations between the 2 n − 1 parameters.) 16. Show that the distribution of a point process may not be determined by the one-dimensional distributions. (Hint: If ξ is restricted to {0, 1} with ξ{0}∨ξ{1} ≤ n,

344

Foundations of Modern Probability

then the one-dimensional distributions give 4n linear relations between the n(n + 2) parameters.) 17. Show that Lemma 15.1 remains valid with B1 , . . . , Bn restricted to an arbitrary pre-separating class C, as defined in Chapter 23 or Appendix 6. Also show that Theorem 15.8 holds with B restricted to a separating class. (Hint: Extend to the case where C = {B ∈ S; (ξ + η)∂B = 0 a.s.}. Then use monotone-class arguments for sets in S and MS .) 18. Show that Theorem 15.10 may fail without the condition Eξ({s} × K) ≡ 0. 19. Give an example of a non-Poisson point process ξ on S such that ξB is Poisson for every B ∈ S. (Hint: Take S = {0, 1}.) 20. Describe the marked point processes ξ with independent increments, in the presence of possible fixed atom sites. (Hint: Note that there are at most countably many fixed discontinuities, and that the associated component of ξ is independent of the remaining part. Now use Proposition 5.14.) 21. Describe the random measures ξ with independent increments, in the presence of possible fixed atom sites. Write the general representation as in Corollary 15.11, for a suitably generalized Poisson process η. 22. Show by an example that Theorem 15.15 may fail without the assumption of ql-continuity. 23. Use Theorem 15.15 to show that any ql-continuous simple point process ξ on (0, ∞) with a.s. unbounded compensator ξˆ can be time-changed into a stationary Poisson process η w.r.t. the correspondingly time-changed filtration. Also extend the result to possibly bounded compensators, and describe the reverse map η → ξ. 24. Extend Corollary 15.16 to possibly bounded compensators. Show that the result fails in general, when the compensators are not continuous. 25. For ξ1 , . . . , ξn as in Corollary 15.16 with compensators ξˆ1 , . . . , ξˆn , let Y1 , . . . , Yn be predictable maps into T such that the measures ξˆk ◦ Y −1 = μk are non-random k

in MT . Show that the processes ξk ◦ Yk−1 are independent Poisson μ1 , . . . , μn . 26. Prove Theorem 15.20 (i), (iii) by means of characteristic functions.

27. Let ξ⊥⊥η be Poisson processes on S with E ξ = E η = μ, and let f1 , f2 , . . . : P S → R be measurable with ∞ > μ(fn2 ∧ 1) → ∞. Show that |(ξ − η)fn | → ∞. (Hint: Consider the symmetrization ν˜ of a fixed measure ν ∈ NS with νfn2 → ∞, and argue along sub-sequences, as in the proof of Theorem 5.17.) 28. Give conditions for ξ 2 f < ∞ a.s. when ξ is Poisson with E ξ = λ and f ≥ 0 on R2+ . (Hint: Combine Theorems 15.20 and 15.21.)

Chapter 16

Independent-Increment and L´ evy Processes Infinitely divisible distributions, independent-increment processes, L´evy processes and subordinators, orthogonality and independence, stable L´evy processes, first-passage times, coupling of L´evy processes, approximation of random walks, arcsine laws, time change of stable integrals, tangential existence and comparison

Just as random walks are the most basic Markov processes in discrete time, so also the L´evy processes constitute the most fundamental among continuoustime Markov processes, as they are precisely the processes of this kind that are both space- and time-homogeneous. They may also be characterized as processes with stationary, independent increments, which suggests that they be thought of simply as random walks in continuous time. Basic special cases are given by Brownian motion and homogeneous Poisson processes. Dropping the stationarity assumption leads to the broader class of processes with independent increments, and since many of the basic properties are the same, it is natural to extend our study to this larger class. Under weak regularity conditions, their paths are right-continuous with left-hand limits (rcll), and the one-dimensional distributions agree with the infinitely divisible laws studied in Chapter 7. As the latter are formed by a combination of Gaussian and compound Poisson distributions, the general independent-increment processes are composed of Gaussian processes and suitable Poisson integrals. The classical limit theorems of Chapters 6–7 suggest similar approximations of random walks by L´evy processes, which may serve as an introduction to the general weak convergence theory in Chapters 22–23. In this context, we will see how results for random walks or Brownian motion may be extended to a more general continuous-time context. In particular, two of the arcsine laws derived for Brownian motion in Chapter 14 remain valid for general symmetric L´evy processes. Independent-increment processes may also be regarded as the simplest special cases of semi-martingales, to be studied in full generality in Chapter 20. We conclude the chapter with a discussion of tangential processes, where the ˜ with condiidea is to approximate a general semi-martingale X by a process X ˜ have similar asymptotic tionally independent increments, such that X and X properties. This is useful, since the asymptotic behavior of the latter processes can often be determined by elementary methods. © Springer Nature Switzerland AG 2021 O. Kallenberg, Foundations of Modern Probability, Probability Theory and Stochastic Modelling 99, https://doi.org/10.1007/978-3-030-61871-1_17

345

346

Foundations of Modern Probability

To resume our discussion of general processes with independent increments, P we say that an Rd -valued process X is continuous in probability 1 if Xs → Xt whenever s → t. This clearly prevents the existence of fixed discontinuities. To examine the structure of processes with independent increments, we may first choose regular versions of the paths. Recall that X is said to be rcll 2 , if it has finite limits Xt± from the right and left at every point t ≥ 0, and the former equals Xt . Then the only possible discontinuities are jumps, and we say that X has a fixed jump at a time t > 0 if P {Xt = Xt− } > 0. Lemma 16.1 (infinitely divisible distributions, de Finetti) For a random vector ξ in Rd , these conditions are equivalent: (i) ξ is infinitely divisible, d

(ii) ξ = X1 for a process X in Rd with stationary, independent increments and X0 = 0. We may then choose X to be continuous in probability, in which case (iii) L(X) is uniquely determined by L(ξ), (iv) ξ ∈ Rd+ a.s. ⇔ Xt ∈ Rd+ a.s., t ≥ 0. Proof: To prove (i) ⇔ (ii), let ξ be infinitely divisible. Using the representation in Theorem 7.5, we can construct the finite-dimensional distributions of an associated process X in Rd with stationary, independent increments, and then invoke Theorem 8.23 for the existence of an associated process. Assertions (iii)−(iv) also follow from the same theorem. 2 We first describe the processes with stationary, independent increments: Theorem 16.2 (L´evy processes and subordinators, L´evy, Itˆo) Let the process X in Rd be continuous in probability with X0 = 0. Then X has stationary, independent increments iff it has an rcll version, given for t ≥ 0 by t

Xt = b t + σBt +

0

|x|≤1

x (η − E η)(ds dx) +

t 0

|x|>1

x η(ds dx),

where b ∈ Rd , σ is a d × d matrix, B is a Brownian motion in Rd , and η is an independent Poisson process on R+ × (Rd \ {0}) with E η = λ ⊗ ν, where  (|x|2 ∧ 1) ν(dx) < ∞. When X is Rd+ -valued, the representation simplifies to t

Xt = a t +

x η(ds dx) a.s., 0

t ≥ 0,

where a ∈ Rd+ , and η is a Poisson process on R+ × (Rd+ \ {0}) with E η = λ ⊗ ν, where (|x| ∧ 1) ν(dx) < ∞. The triple (b, a, ν) with a = σ  σ or pair (a, ν) is then unique, and any choice with the stated properties may occur. 1

also said to be stochastically continuous right-continuous with left-hand limits; the French acronym c` adl` ag is also common, even in English texts 2

16. Independent-Increment and L´evy Processes

347

A process X in Rd with stationary, independent increments is called a L´evy processes, and when X is R+ -valued it is also called a subordinator 3 . Furthermore, the measure ν is called the L´evy measure of X. The result follows easily from the theory of infinitely divisible distributions: Proof: Given Theorem 7.5, it is enough to show that X has an rcll version. This being obvious for the first two terms as well as for the final integral term, it remains to prove that the first double integral Mt has an rcll version. By Theorem 9.17 we have Mtε → Mt uniformly in L2 as ε → 0, where Mtε is the sum of compensated jumps of size |x| ∈ (ε, 1], hence a martingale. Proceeding along an a.s. convergent sub-sequence, we see that the rcll property extends to the limit. 2 The last result can also be proved by probabilistic methods. We will develop the probabilistic approach in a more general setting, where we drop the stationarity assumption and even allow some fixed discontinuities. Given an rcll process X in Rd , we define the associated jump point process η on (0, ∞) × (Rd \ {0} by η=

Σ δ t,ΔX t>0

t

=

Σ1 t>0





(t, ΔXt ) ∈ · ,

(1)

where the summations extend over all times t > 0 with ΔXt = Xt − Xt− = 0. Theorem 16.3 (independent-increment processes, L´evy, Itˆo, Jacod, OK) Let the process X in Rd be rcll in probability with independent increments and X0 = 0. Then (i) X has an rcll version with jump point process η, given for t ≥ 0 by t

Xt = bt + Gt +

0

|x|≤1

x (η − E η)(ds dx) +

t 0

|x|>1

x η(ds dx),

(2)

where b is an rcll function in Rd with b0 = 0, G is a continuous, centered Gaussian process in Rd with independent increments and G0 = 0, and η is an independent, extended Poisson process on R+ × (Rd \ {0}) with 4 t 0

(|x|2 ∧ 1) E η  (ds dx) < ∞,

t > 0,

(3)

(ii) for non-decreasing X, the representation simplifies to t

Xt = at +

x η(ds dx) a.s., 0

t ≥ 0,

where a is a non-decreasing, rcll function in Rd+ with a0 = 0, and η is an extended Poisson process on R+ × (Rd+ \ {0}) with t 0

(|x| ∧ 1) E η(ds dx) < ∞,

t > 0.

3 This curious term comes from the fact that the composition of a Markov process X with a subordinator T yields a new Markov process Y = X ◦ T . 4 Here we need to replace η by a slightly modified version η  , as explained below.

348

Foundations of Modern Probability

Both representations are a.s. unique, and any processes η and G with the stated properties may occur.

First we need to clarify the meaning of the jump component in (i) and the associated integrability condition. Here we consider separately the continuous Poisson component and the process of fixed discontinuities. For the former, let η be a stochastically continuous Poisson process on R+ × (Rd \ {0}, and introduce the elementary processes Ytr =

t

0 r1

x η(ds dx),

t ≥ 0,

with associated symmetrizations Y˜tr . Proposition 16.4 (Poisson component) For any stochastically continuous Poisson process η on R+ × (Rd \ {0}), these conditions are equivalent: (i) the family {Y˜tr ; r > 0} is tight for every t ≥ 0, t

(ii)

0

(|x|2 ∧ 1) E η(ds dx) < ∞,

t > 0,

(iii) there exists an rcll process Y in R with d

(Y r − Y )∗t → 0 a.s.,

t ≥ 0.

Proof: Since clearly (iii) ⇒ (i), it is enough to prove (i) ⇒ (ii) ⇒ (iii). (i) ⇒ (ii): Assume (i), and fix any t > 0. By Lemma 6.2 we can choose ˜r a > 0, such that EeiuYt ≥ 12 for all r < 1 and u ≤ a. Then by Lemma 15.2 and Fubini’s theorem, − 12



a



˜r



a

log E eiuYt du =

du

0

0

≈ a

|x|>r



|x|>r

Now take the lim sup as r → 0.

(1 − cos ux) E ηt (dx)





= a

|x|>r

1−



sin ax E ηt (dx) ax

(|ax|2 ∧ 1) E ηt (dx).

(ii) ⇒ (iii): By Lemma 15.22, we have Var|ηf | = E η|f |2 for suitably bounded functions f . Hence as r, r → 0, t



E |Ytr − Ytr |2 = Var

0 r 1} < ∞ a.s. for all t > 0. Then these conditions are equivalent: (i) the family {Y˜tn ; n > 0} is tight for every t ≥ 0,   (ii) Σ Var{ξs; |ξs| ≤ 1} + P {|ξs| > 1} < ∞, t > 0, s ∈ Dt

(iii) there exists an rcll process Y in Rd with Ytn → Yt a.s.,

t ≥ 0.

Proof: Here again (iii) ⇒ (i), and so it remains to prove (i) ⇒ (ii) ⇒ (iii). (i) ⇒ (ii): Fix any t > 0. By Theorem 4.18 we may assume that |ξt | ≤ 1 for all t. Then Theorem 5.17 yields Σ Var(ξs) = 1 Σ Var(ξ˜s) < ∞. s ∈ Dt

2

s ∈ Dt

(ii) ⇒ (iii): Under (i), we have convergence Ytn → Yt a.s. for every t ≥ 0 by Theorem 5.18, and we also note that the limiting process Y is rcll in probability. The existence of a version with rcll paths then follows by Proposition 16.6 below. 2 We proceed to establish the pathwise regularity. Proposition 16.6 (regularization) For a process X in Rd with independent increments, these conditions are equivalent: (i) X is rcll in probability, (ii) X has a version with rcll paths. This requires a simple analytic result. Lemma 16.7 (complex exponentials) For any constants x1 , x2 , . . . ∈ Rd , these conditions are equivalent: (i) xn converges in Rd , (ii) eiuxn converges in C for almost every u ∈ Rd .

350

Foundations of Modern Probability

Proof: Assume (ii). Fix a standard Gaussian random vector ζ in Rd , and note that exp{it ζ(am − an )} → 1 a.s. as m, n → ∞ for fixed t ∈ R. Then by P dominated convergence E exp{it ζ(am − an )} → 1, and so ζ(am − an ) → 0 by Theorem 6.3, which implies am − an → 0. Thus, the sequence (an ) is Cauchy, and (i) follows. 2 P

Proof of Proposition 16.6: Assume (i). In particular, Xt → 0 as t → 0, and so ϕt (u) → 1 for every u ∈ Rd , where ϕt (u) = EeiuXt ,

t ≥ 0, u ∈ Rd .

For fixed u ∈ Rd we may then choose a tu > 0, such that ϕt (u) = 0 for all t ∈ [0, tu ). Then the process eiuXt , t < t u , u ∈ Rd , Mtu = ϕt (u) exists and is a martingale in t, and so by Theorem 9.28 it has a version with rcll paths. Since ϕt (u) has the same property, we conclude that even eiuXt has an rcll version. By a similar argument, every s > 0 has a neighborhood Is , such that eiuXt has an rcll version on Is , except for a possible jump at s. The rcll property then holds on all of Is , and by compactness we conclude that eiuXt has an rcll version on the entire domain R+ . Now let A be the set of pairs (ω, u) ∈ Ω × Rd , such that eiuXt has right and left limits along Q at every point. Expressing this in terms of upcrossings, we see that A is product measurable, and the previous argument shows that all u -sections Au have probability 1. Hence, by Fubini’s theorem, P -almost every ω -section Aω has full λd -measure. This means that, on a set Ω of probability 1, eiuXt has the stated continuity properties for almost every u. Then by Lemma 16.7, Xt itself has the same continuity properties on Ω . Now define ˜ t ≡ 0 otherwise. Then X ˜ is clearly rcll, and by (i) ˜ t ≡ Xt+ on Ω , and put X X ˜ 2 we have Xt = Xt a.s. for all t. To separate the jump component from the remainder of X, we need the following independence criterion: Lemma 16.8 (orthogonality and independence) Let (X, Y ) be an rcll processes in (Rd )2 with independent increments and X0 = Y0 = 0, where Y is an elementary step process. Then (ΔX)(ΔY ) = 0 a.s.



X⊥⊥ Y.

Proof: By a transformation of the jump sizes, we may assume that Y has locally integrable variation. Now introduce as before the characteristic functions t ≥ 0, u, v ∈ Rd , ϕt (u) = EeiuXt , ψt (v) = EeivYt , and note that both are = 0 on some interval [0, tu,v ), where tu,v > 0 for all u, v ∈ Rd . On the same interval, we can then introduce the martingales

16. Independent-Increment and L´evy Processes

Mtu =

eiuXt , ϕt (u)

Ntv =

eivYt , ψt (v)

351

t < tu,v , u, v ∈ Rd ,

where N v has again integrable variation on every compact subinterval. Fixing any t ∈ (0, tu,v ) and writing h = t/n with n ∈ N, we get by the martingale property and dominated convergence E Mtu Ntv − 1 = E = E →E = E

Σ k≤n



t

0 t 0

u u Mkh − M(k−1)h



v v Nkh − N(k−1)h





u u M[sn+1−]h − M[sn−]h dNsv

(ΔMsu ) dNsv

Σ (ΔMsu)(ΔNsv ) = 0.

s≤t

This shows that E Mtu Ntv = 1, and so E eiuXt +ivYt = E eiuXt E eivYt ,

t < tu,v , u, v ∈ Rd .

By a similar argument, along with the independence (ΔXt , ΔYt ) ⊥ ⊥ (X − ΔXt , Y − ΔYt ),

t > 0,

every point r > 0 has a neighborhood Iu,v , such that t

t

t

t

E eiuXs +ivYs = E eiuXs E eivYs ,

s, t ∈ Iu,v , u, v ∈ Rd ,

where Xst = Xt − Xs and Yst = Yt − Ys . By a compactness argument, the same relation holds for all s, t ≥ 0, and since u, v were arbitrary, Theorem 6.3 yields ⊥ Yst for all s, t ≥ 0. Since (X, Y ) has independent increments, the indeXst ⊥ pendence extends to any finite collection of increments, and we get X⊥⊥Y . 2 The last lemma allows us to eliminate the compensated random jumps, which leaves us with a process with only non-random jumps. It remains to show that the latter process is Gaussian. Proposition 16.9 (Gaussian processes with independent increments) Let X be an rcll process in Rd with independent increments and X0 = 0. Then these conditions are equivalent: (i) X has only non-random jumps, (ii) X is Gaussian of the form Xt = bt + Gt a.s., t ≥ 0, where G is a continuous, centered, Gaussian process with independent increments and G0 = 0, and b is an rcll function in Rd with b0 = 0.

352

Foundations of Modern Probability

˜ be a symProof: We need to prove only (i) ⇒ (ii), so assume (i). Let X ˜ has again independent increments. It is metrization of X, and note that X also is a.s. continuous, since the only discontinuities of X are constant jumps, ˜ is symmetric Gaussian which are canceled by the symmetrization. Hence, X by Theorem 14.4. ˜ over Now fix any t > 0, and write ξnj and ξ˜nj for the increments of X, X the n sub-intervals of length t/n. Since maxj |ξ˜nj | → 0 a.s. by compactness, the ξ˜nj form a null array in Rd , and so by Theorem 6.12,

Σj P {|ξ˜nj | > ε} → 0,

ε > 0.

Letting mnj be component-wise medians of ξnj , we see from Lemma 5.19 that  = ξnj − mnj again form a null array satisfying the differences ξnj

Σj P {|ξnj | > ε} → 0,

ε > 0.

Putting mn = Σj mnj , we may write   Xt = Σj ξnj + mn = Σj ξnj + (n|mn |)

mn , n|mn |

(4)

n ∈ N,

where the n(1 + |mn |) terms on the right again form a null array in Rd , satisfying a condition like (4). Then Xt is Gaussian by Theorem 6.16. A similar argument shows that all increments Xt − Xs are Gaussian, and so the entire process X has the same property, and the difference G = X − EX is centered Gaussian. It is further continuous, since X has only constant jumps, which all go into the mean bt = EXt = Xt − Gt . The latter is clearly non-random and rcll, and the independence of the increments carries over to G. 2 Proof of Theorem 16.3 (OK): By Proposition 16.6 we may assume that X has rcll paths. Then introduce the associated jump point process η on R+ × (Rd \ {0}), which clearly inherits the property of independent increments. By Theorem 15.10 the process η is then extended Poisson, and hence can be decomposed into a stochastically continuous Poisson process η1 and a process ⊥η1 with fixed discontinuities, supported by a countable set D ⊂ R+ . η2 ⊥ Let Y1r , Y2n be the associated approximating processes in Propositions 16.4 and 16.5, and note that by Lemma 16.8 the pairs (Y1r , Y2n ) are independent from the remaining parts of X. Factoring the characteristic functions, we see from Lemma 6.2 that the corresponding families of symmetrized processes Y˜1r , Y˜2n are tight at fixed times t > 0. Hence, Propositions 16.4 and 16.5 yield the stated integrability conditions for E η adding up to (3), and there exist some limiting processes Y1 , Y2 , adding up to the integral terms in (2). The previous independence property clearly extends to Y ⊥ ⊥(X − Y ), where Y = Y1 + Y2 . Since the jumps of X are already contained in Y and the continuous centering in Y1 gives no new jumps, the only remaining jumps of X − Y are the fixed jumps arising from the centering in Y2 . Thus, Proposition 19.2 yields X − Y = G + b, where G is continuous Gaussian with independent increments and b is non-random, rcll. 2

16. Independent-Increment and L´evy Processes

353

Of special interest is the case where X is a semi-martingale: Corollary 16.10 (semi-martingales with independent increments, Jacod, OK) Let X be an rcll process with independent increments. Then (i) X = Y + b, where Y is a semi-martingale and b is non-random rcll, (ii) X is a semi-martingale iff b has locally finite variation, (iii) when X is a semi-martingale, (3) holds with E η  replaced by E η. Proof: (i) We need to show that all terms making up X − b are semi-martingales. Then note that G and the first integral term are martingales, whereas the second integral term represents a process of isolated jumps. (ii) Part (i) shows that X is a semi-martingale iff b has the same property, which holds by Corollary 20.21 below iff b has locally finite variation. (iii) By the definition of η, the only discontinuities of b are the fixed jumps E(ξt ; |ξt | ≤ 1). Since b is a semi-martingale, its quadratic variation is locally finite, and we get



Dt |x|≤1

|x|2 η(ds dx) =

ΣE s ∈D

=

Σ s ∈D

t





|ξs |2 ; |ξs | ≤ 1



Var{ξt ; |ξt | ≤ 1} + |E{ξt ; |ξt | ≤ 1}|2 < ∞.

t

2

By Proposition 11.5, a L´evy process X is Markov for the induced filtration G = (Gt ), with the translation-invariant transition kernels μt (x, B) = μt (0, B − x) = P {Xt ∈ B − x},

t ≥ 0.

More generally, given any filtration F, we say that a process X is L´evy with respect to F or simply F-L´evy, if it is adapted to F and such that (Xt − Xs ) ⊥⊥Fs for all s < t. In particular, we may choose the filtration Ft = Gt ∨ N , t ≥ 0, N = σ{N ⊂ A; A ∈ A, P A = 0}, which is right-continuous by Corollary 9.26. Just as for Brownian motion in Theorem 14.11, we see that when X is F-L´evy for a right-continuous, complete filtration F, it is a strong Markov process, in the sense that the process X  = d θτ X − Xτ satisfies X = X  ⊥⊥Fτ for every optional time τ < ∞. Turning to some symmetry conditions, we say that a process X on R+ is self-similar, if for every r > 0 the scaled process Xrt , t ≥ 0, has the same distribution as sX for a constant s = h(r) > 0. Excluding the trivial case of E|Xt | ≡ 0, we see that h satisfies the Cauchy equation h(rs) = h(r) h(s). If

354

Foundations of Modern Probability

X is right-continuous, then h is continuous, and all solutions are of the form h(r) = rα for some α ∈ R. A L´evy process X in R is said to be strictly stable if it is self-similar, and weakly stable if it is self-similar apart from a centering, so that for every r > 0 the process Xrt has the same distribution as sXt + b t for some s and b. In either case, the symmetrized version X − X  is strictly stable with the same parameter α, and so s is again of the form rα . Since clearly α > 0, we may introduce the stability index p = α−1 , and say that X is strictly or weakly p -stable. The terminology carries over to random variables or vectors with distribution L(X1 ). Corollary 16.11 (stable L´evy processes) Let X be a non-degenerate L´evy process in R with characteristics (a, b, ν). Then X is weakly p-stable for a p > 0 iff one of these conditions holds: (i) p = 2 and ν = 0, (ii) p ∈ (0, 2), a = 0, and ν(dx) = c± |x|−p−1 dx on R± for some c± ≥ 0. For subordinators, it is further equivalent that (iii) p ∈ (0, 1) and ν(dx) = c x−p−1 dx on (0, ∞) for a c > 0. In (i) we recognize a scaled Brownian motion with possible drift. An important case of (ii) is the symmetric, 1-stable L´evy process, known under suitable normalization as a Cauchy process. Proof: Writing Sr : x → rx for r > 0, we note that the processes X(rp t) and rX have characteristics rp (a, b, ν) and (r2 a, rb, ν ◦ Sr−1 ), respectively. Since the latter are determined by the distributions, it follows that X is weakly p -stable iff rp a = r2 a and rp ν = ν ◦ Sr−1 for all r > 0. In particular, a = 0 when p = 2. Writing F (x) = ν[x, ∞) or ν(−∞, −x], we also note that rp F (rx) = F (x) for all r, x > 0, so that F (x) = x−p F (1), which yields a density of the stated form. The condition (x2 ∧1) ν(dx) < ∞implies p ∈ (0, 2) when ν = 0. When X ≥ 0 we have the stronger requirement (x ∧ 1) ν(dx) < ∞, so in this case p < 1. 2 If X is weakly p -stable with p = 1, it becomes strictly p -stable through a suitable centering. In particular, a weakly p -stable subordinator is strictly stable iff its drift component vanishes, in which case X is simply said to be stable. We show how stable subordinators may arise naturally even in the context of continuous processes. Given a Brownian motion B in R, we introduce the maximum process Mt = sups≤t Bs with right-continuous inverse







Tr = inf t ≥ 0; Mt > r

= inf t ≥ 0; Bt > r ,

r ≥ 0.

Theorem 16.12 (first-passage times, L´evy) For a Brownian motion B, the process T is a 12 -stable subordinator with L´evy measure ν(dx) = (2π)−1/2 x−3/2 dx,

x > 0.

16. Independent-Increment and L´evy Processes

355

Proof: By Lemma 9.6, the random times Tr are optional with respect to the right-continuous filtration F induced by B. By the strong Markov property of B, the process θr T − Tr is then independent of FTr and distributed as T . Since T is further adapted to the filtration (FTr ), it follows that T has stationary, independent increments and hence is a subordinator. ˜t = c−1 B(c2 t), and define fix any c > 0, put B To see that T is 12 -stable,

˜t > r . Then T˜r = inf t ≥ 0; B

Tc r = inf t ≥ 0; Bt > c r







˜t > r = c2 T˜r . = c2 inf t ≥ 0; B By Proposition 16.11, the L´evy measure of T has a density of the form a x−3/2 , x > 0, and it remains to identify a. Then note that the process 



Xt = exp uBt − 12 u2 t ,

t ≥ 0,

is a martingale for any u ∈ R. In particular, EXτr ∧t = 1 for any r, t ≥ 0, and since clearly Bτr = r, we get by dominated convergence 



E exp − 12 u2 Tr = e−ur ,

u, r ≥ 0. √ Taking u = 2 and comparing with Corollary 7.6, we obtain √ ∞ 2 = (1 − e−x ) x−3/2 dx a 0 ∞ √ e−x x−1/2 dx = 2 π, =2 0

which shows that a = (2π)−1/2 .

2

If we add a negative drift to a Brownian motion, the associated maximum process M becomes bounded, so that T = M −1 terminates by a jump to infinity. Here it becomes useful to allow subordinators with infinite jumps. By an extended subordinator we mean a process of the form Xt ≡ Yt + ∞ · 1{t ≥ ζ} a.s., where Y is an ordinary subordinator and ζ is an independent, exponentially distributed random variable, so that X is obtained from Y by exponential killing. The representation in Theorem 16.3 remains valid in the extended case, except for a positive mass of ν at ∞. The following characterization is needed in Chapter 29. Lemma 16.13 (extended subordinators) Let X be a non-decreasing, rightcontinuous process in [0, ∞] with X0 = 0 and induced filtration F. Then X is an extended subordinator iff 

 



L Xs+t − Xs  Fs = L(Xt ) a.s. on {Xs < ∞},

s, t > 0.

(5)

Proof: Writing ζ = inf{t; Xt = ∞}, we get from (5) the Cauchy equation P {ζ > s + t} = P {ζ > s}P {ζ > t},

s, t ≥ 0,

(6)

356

Foundations of Modern Probability

which implies that ζ is exponentially distributed with mean m ∈ (0, ∞]. Next define μt = L(Xt |Xt < ∞), t ≥ 0, and conclude from (5) and (6) that the μt form a semigroup under convolution. By Theorem 11.4 there exists a corresponding process Y with stationary, independent increments. From the rightcontinuity of X, it follows that Y is continuous in probability. Hence, Y has d ˜ ⊥Y , and let X ˜ denote the a subordinator version. Now choose ζ˜ = ζ with ζ⊥ d ˜ ˜ process Y killed at ζ. Comparing with (5), we note that X = X. By Theorem ˜ a.s., which means that X is an extended 8.17 we may assume that even X = X subordinator. The converse assertion is obvious. 2 The weak convergence of infinitely divisible laws extends to a pathwise approximation of the corresponding L´evy processes. Theorem 16.14 (coupling of L´evy processes, Skorohod) For any L´evy prod d ˜n = Xn cesses X, X 1 , X 2 , . . . in Rd with X1n → X1 , there exist some processes X   with  ˜n  P t ≥ 0. sup X s − Xs  → 0, s≤t

For the proof, we first consider two special cases. Lemma 16.15 (compound Poisson case) Theorem 16.14 holds for compound Poisson processes X, X 1 , X 2 , . . . with characteristic measures ν, ν1 , ν2 , . . . satw isfying νn → ν. Proof: Adding positive masses at the origin, we may assume that ν and the νn have the same total mass, which may then be reduced to 1 through a suitable scaling. If ξ1 , ξ2 , . . . and ξ1n , ξ2n , . . . are associated i.i.d. sequences, we d get (ξ1n , ξ2n , . . .) → (ξ1 , ξ2 , . . .) by Theorem 5.30, and by Theorem 5.31 we may strengthen this to a.s. convergence. Letting N be an independent, unit-rate Poisson process and defining Xt = Σj≤Nt ξj and Xtn = Σj≤Nt ξjn , we obtain 2 (X n − X)∗t → 0 a.s. for all t ≥ 0. Lemma 16.16 (case of small jumps) Theorem 16.14 holds for L´evy processes X n satisfying P EX n ≡ 0, 1 ≥ (ΔX n )∗1 → 0. Proof: Since (ΔX n )∗1 → 0, we may choose some constants hn → 0 with P ∈ N, such that w(X n , 1, hn ) → 0. By the stationarity of the mn = h−1 n P n increments, we get w(X , t, hn ) → 0 for all t ≥ 0. Further note that X is centered Gaussian by Theorem 7.7. As in Theorem 22.20 below, we may then P d n ), such that (Y n − X)∗t → 0 for all t ≥ 0. form some processes Y n = (X[m n t]hn d ˜n = By Corollary 8.18, we may further choose some processes X X n with n n ˜ Y ≡ X[mn t]hn a.s. Letting n → ∞ for fixed t ≥ 0, we obtain P

16. Independent-Increment and L´evy Processes

˜ n − X)∗ ∧ 1 E (X t

357











≤ E (Y n − X)∗t ∧ 1 + E w(X n , t, hn ) ∧ 1 → 0.

2

Proof of Theorem 16.14: The asserted convergence is clearly equivalent to ˜ n , X) → 0, where ρ denotes the metric ρ(X

ρ(X, Y ) = 0







e−t E (X − Y )∗t ∧ 1 dt.

For any h > 0, we may write X = Lh + M h + J h , X n = Ln,h + M n,h + J n,h , with Lht ≡ bh t and Ln,h ≡ bhn t, where M h and M n,h are martingales containing t the Gaussian components and all centered jumps of size ≤ h, whereas the processes J h and J n,h contain all the remaining jumps. Write G for the Gaussian component of X, and note that ρ(M h , G) → 0 as h → 0 by Proposition 9.17. For any h > 0 with ν{|x| = h} = 0, Theorem 7.7 yields bhn → bh and h w νn → ν h , where ν h and νnh denote the restrictions of ν and νn , respectively, to the set {|x| > h}. The same theorem gives ahn → a as n → ∞ and then h → 0, d so under those conditions M1n,h → G1 . Now fix any ε > 0. By Lemma 16.16, we may choose some constants h, r > d ˜ n,h , G) ≤ ε ˜ n,h = M n,h , such that ρ(M h , G) ≤ ε and ρ(M 0 and processes M for all n > r. Under the additional assumption ν{|x| = h} = 0, Lemma 16.15 d ˜ n,h , such yields an r ≥ r and some processes J˜n,h = J n,h independent of M h n,h    that ρ(J˜ , J˜ ) ≤ ε for all n > r . We may finally choose r ≥ r so large that ρ(Lh , Ln,h ) ≤ ε for all n > r . Then the processes d ˜ n,h + J˜n,h = ˜ n ≡ Ln,h + M X n, X

n ∈ N,

˜ n ) ≤ 4 ε for all n > r . satisfy ρ(X, X

2

Combining Theorem 16.14 with Corollary 7.9, we get a similar approximation of random walks, extending the result for Gaussian limits in Theorem 22.20. A slightly weaker result is obtained by different methods in Theorem 23.14. Corollary 16.17 (L´evy approximation of random walks) Consider a L´evy d process X and some random walks S 1 , S 2 , . . . in Rd , such that Sknn → X1 for some constants kn → ∞, and let N be an independent, unit-rate Poisson process. Then there exist some processes X n with d

X n = (S n ◦ Nkn t ),

P

sup |Xsn − Xs | → 0, s≤t

t ≥ 0.

Using the last result, we may extend the first two arcsine laws of Theorem d 14.16 to symmetric L´evy processes, where symmetry means −X = X.

358

Foundations of Modern Probability

Theorem 16.18 (arcsine laws) Let X be a symmetric L´evy process in R with X1 = 0 a.s. Then these random variables are arcsine distributed:



τ1 = λ t ≤ 1; Xt > 0 ,



τ2 = inf t ≥ 0; Xt ∨ Xt− = sup Xs .

(7)

s≤1

Proof: Introduce the random walk Skn = Xk/n , let N be an independent, unit-rate Poisson process, and define Xtn = S n ◦ Nnt . Then Corollary 16.17 P d ˜n = ˜ n − X)∗ → yields some processes X X n with (X 0. Define τ1n , τ2n as in (7) 1 d in terms of X n , and conclude from Lemmas 22.12 and 7.16 that τin → τi for i = 1, 2. Now define σ1n = Nn−1 Σ 1{Skn > 0}, k ≤ Nn



σ2n = Nn−1 min k; Skn = max Sjn . j≤Nn

Since t−1 Nt → 1 a.s. by the law of large numbers, we have supt≤1 |n−1 Nnt − t| → 0 a.s., and so σ2n − τ2n → 0 a.s. Applying the same law to the sequence of P d holding times in N , we further note that σ1n − τ1n → 0. Hence, σin → τi for d d i = 1, 2. Now σ1n = σ2n by Corollary 27.8, and Theorem 22.11 yields σ2n → sin2 α d d where α is U (0, 2π). Hence, τ1 = τ2 = sin2 α. 2 In Theorem 15.15, we saw how any ql-continuous, simple point process on R+ can be time-changed into a Poisson process. Similarly, we will see in Theorem 19.4 below how a continuous local martingale can be time-changed into a Brownian motion. Here we consider a third case of time-change reduction, involving integrals of p -stable L´evy processes. Since the general theory relies on some advanced stochastic calculus from Chapter 20, we consider only the elementary case of p < 1. Proposition 16.19 (time change of stable integrals, Rosi´ nski & Woyczy´ nski, OK) Let X be a strictly p-stable L´evy process with p ∈ (0, 1), and let the process V ≥ 0 be predictable and such that A = V p · λ is a.s. finite but unbounded.

Then d τs = inf t; At > s , s ≥ 0. (V · X) ◦ τ = X, Proof: Define a point process ξ on R+ × (R \ {0}) by ξB = Σs 1B (s, ΔXs ), and recall from Corollary 16.2 and Proposition 16.11 that ξ is Poisson with intensity λ ⊗ ν, for some measures ν(dx) = c± |x|−p−1 dx on R± . In particular, ξ has compensator ξˆ = λ ⊗ ν. Now introduce on R+ × R the predictable map Ts,x = (As , xVs ). Since A is continuous, we have {As ≤ t} = {s ≤ τt } and Aτt = t. Hence, Fubini’s theorem yields for any t, u > 0 

(λ ⊗ ν) ◦ T −1 [0, t] × (u, ∞)



= (λ ⊗ ν) (s, x); As ≤ t, x Vs > u



16. Independent-Increment and L´evy Processes

τt

= 0



359



ν x; x Vs > u ds

= ν(u, ∞)



τt 0

Vsp ds

= t ν(u, ∞), and similarly for the sets [0, t] × (−∞, −u). Thus, ξˆ ◦ T −1 = ξˆ = λ ⊗ ν a.s., d and so Theorem 15.15 gives ξ ◦ T −1 = ξ. Finally, note that (V · X)τt = =



τt +

0 ∞

0

= 0

t+

x Vs ξ(ds dx) x Vs 1{As ≤ t} ξ(ds dx) y (ξ ◦ T −1 )(dr dy),

where the process on the right has the same distribution as X.

2

Anticipating our more detailed treatment in Chapter 20, we define a semimartingale in Rd as an adapted process X with rcll paths, admitting a centering into a local martingale. Thus, subtracting all suitably centered jumps yields a continuous semi-martingale, in the sense of Chapter 18. Here the centering of small jumps amounts to a subtraction of the compensator ηˆ rather than the intensity Eη of the jump point process η, to ensure that the resulting process will be a local martingale. Note that such a centering depends on the choice of underlying filtration F. The set of local characteristics of X is given by the triple (ˆ η , [M ], A), where M is a continuous local martingale and A is a continuous, predictable process of locally finite variation, such that X is the sum of the centered jumps and the continuous semi-martingale M + A. We say that X is locally symmetric if ηˆ is symmetric and A = 0. Two semi-martingales X, Y are said to be Ftangential 5 if they have the same local characteristics with respect to F. For filtrations F, G on a common probability space, we say that G is a standard extension of F if ⊥ F, t ≥ 0. Ft ⊂ G t ⊥ Ft

These are the minimal conditions ensuring that G will preserve all conditioning and adaptedness properties of F. Theorem 16.20 (tangential existence) Let X be an F-semi-martingale with ˜ ˜ ⊥Y F with local characteristics Y . Then there exists an F-semi-martingale X⊥ Y -conditionally independent increments, for a standard extension F˜ ⊃ F , such ˜ are F-tangential. ˜ that X and X We will prove this only in the ql-continuous case where Y is continuous, in which case we may rely on some elementary constructions: 5 This should not to be confused with the more elementary notion of tangent processes. For two processes to be tangential, they must be semi-martingales with respect to the same filtration, which typically requires us to extend of the original filtration without affecting the local characteristics, a highly non-trivial step.

360

Foundations of Modern Probability

Lemma 16.21 (tangential constructions) Let M be a continuous local Fmartingale, and let ξ be an S-marked, F-adapted point process with continuous ˜ ⊥η F directed by η and a Brownian F-compensator η. Form a Cox process ξ⊥ ˜ F), put M ˜ = B ◦ [M ], and let F˜ be the filtration generated by motion B⊥ ⊥(ξ, ˜M ˜ ). Then 6 (F, ξ, (i) F˜ is a standard extension of F, ˜ is a continuous local F-martingale ˜ ˜ ] = [M ] a.s., (ii) M with [M ˜ (iii) ξ, ξ˜ have the same F-compensator η, ˜ η)-compensator of ξ. ˜ (iv) η is both a (ξ, η)-compensator of ξ and a (ξ, ˜ ⊥η F and B⊥⊥ ˜ F, Theorem 8.12 yields (ξ, ˜ B)⊥⊥η F, Proof: (i) Since ξ⊥ ξ,η and so ˜ t) ⊥ (ξ˜t , M ⊥ F, t ≥ 0. η, [M ]

˜ , we further note that Using Theorem 8.9 and the definitions of ξ˜ and M ˜ t) ⊥ (ξ˜t , M ⊥ t (η, [M ]), t η , [M ]

t ≥ 0.

Combining those relations and using Theorem 8.12, we obtain ˜ t) ⊥ (ξ˜t , M ⊥ t F, t

t ≥ 0,

η , [M ]

which implies F˜t ⊥ ⊥Ft F for all t ≥ 0. ˜ F), we get (ii) Since B⊥⊥(ξ, ˜M ˜ s ), B ⊥ ⊥ s (F, ξ, ˜ [M ], M

and so

˜ ⊥ θs M ⊥ s F˜s , ˜ [M ], M

s ≥ 0, s ≥ 0.

˜ ⊥⊥[M ] M ˜ s and using Theorem 8.12, we obtain Combining with the relation θs M ˜⊥ θs M ⊥ F˜s , [M ]

s ≥ 0.

Localizing if necessary to ensure integrability, we get for any s ≤ t the desired martingale property 





   

˜ s  F˜s = E E M ˜ s  F˜s , [M ]  F˜s ˜t − M ˜t − M E M 



 

˜ s  [M ]  F˜s = 0, ˜t − M =E E M

and the associated rate property 



   

˜ 2  F˜s = E E M ˜ 2  F˜s , [M ]  F˜s ˜2 −M ˜2 −M E M t s t s   

˜ 2  [M ]  F˜s ˜ t2 − M =E E M s 







  = E E [M ]t − [M ]s  [M ]  F˜s . 6

˜ as a Brownification of M . We may think of ξ˜ as a Coxification of ξ and of M

16. Independent-Increment and L´evy Processes

361

(iii) Property (i) shows that η remains an F˜ -compensator of ξ. Next, the ˜ ⊥η F implies θt ξ⊥ ˜ ⊥ ˜t Ft . Combining with the Cox property and relation ξ⊥ η,ξ ˜ ⊥η (ξ˜t , Ft ). Invoking the tower property of using Theorem 8.12, we get θt ξ⊥ ˜ we obtain conditional expectations and the Cox property of ξ, 



 E( θr ξ˜ | F˜t ) = E θt ξ˜  ξ˜t , Ft









  = E E θt ξ˜  ξ˜t , η, Ft  ξ˜t , Ft 

 = E E( θt ξ˜ | η)  ξ˜t , Ft 

= E θt η | ξ˜t , Ft = E( θt η | F˜t ).





˜ ˜ ˜ Since η remains F-predictable, it is then an F-compensator of ξ. (iv) The martingale properties in (ii) extend to the filtrations generated ˜ η), respectively, by the tower property of conditional expectaby (ξ, η) and ξ, tions. Since η is continuous and adapted to both filtrations, it is both (ξ, η)˜ η)-predictable. The assertions follow by combination of those properand (ξ, ties. 2 Proof of Theorem 16.20 for continuous η: Let ξ be the jump point process of X, and let η denote the F-compensator of ξ. Further, let M be the continuous martingale component of X, and let A be the predictable drift component ˜ M ˜ , F˜ as in Lemma 16.21, of X, for a suitable truncation function. Define ξ, ˜ and construct an associated semi-martingale X by compensating the jumps ˜ Then X ˜ has the same local characteristics Y = ([M ], η, A) as X, given by ξ. ˜ F), we ˜ ⊥η F and B⊥⊥(ξ, ˜ and so the two processes are F-tangential. Since ξ⊥ ˜ ˜ have X⊥ ⊥Y F, and the independence properties of B show that X has conditionally independent increments. 2 Tangential processes have similar asymptotic properties at ∞. Here we state only the most basic result of this type. Say that the function ϕ : R+ → R+ has moderate growth, if it is non-decreasing and continuous with ϕ(0) = 0, and there exists a constant c > 0 such that ϕ(2x) ≤ c ϕ(x) for all x > 0. In that case, there exists a function h > 0 on (0, ∞), such that ϕ(cx) ≤ h(c) ϕ(x) for all c, x > 0. Basic examples include the power functions ϕ(x) = |x|p with p > 0, and the functions ϕ(x) = x ∧ 1 and ϕ(x) = 1 − e−x . Theorem 16.22 (tangential comparison, Zinn, Hitchenko, OK) Let the processes X, Y be tangential and either increasing or locally symmetric. Then for any function ϕ : R+ → R+ of moderate growth, we have 7 E ϕ(X ∗ ) ) E ϕ(Y ∗ ). Our proof is based on some tail estimates: 7

The domination constants are understood to depend only on ϕ.

362

Foundations of Modern Probability

Lemma 16.23 (tail comparison) Let the processes X, Y be tangential and either increasing or locally symmetric. Then for any c, x > 0,







(i) P (ΔX)∗ > x ≤ 2 P (ΔY )∗ > x , (ii) P {X ∗ > x} ≤ 3 P {Y ∗ > c x} + 4 c. Proof: (i) Let ξ, η be the jump point processes of X, Y . Fix any x > 0, and introduce the optional time



τ = inf t > 0; |ΔYt | > x . Since the set (0, τ ] is predictable by Lemma 10.2, we get











P (ΔX)∗ > x ≤ P {τ < ∞} + E ξ (0, τ ] × [−x, x]c

= P {τ < ∞} + E η (0, τ ] × [−x, x]c = 2P {τ < ∞}

= 2 P (ΔY )∗ > x .

ˆ Yˆ by omitting all jumps (ii) Fix any c, x > 0. For increasing X, Y , form X, greater than cx, which clearly preserves the tangential relation. If X, Y are inˆ Yˆ by omitting all jumps of modulus > 2cx. stead locally symmetric, we form X, ˆ Yˆ are then local L2 -martingales with By Lemma 20.5 below and its proof, X, jumps a.s. bounded by 4cx. They also remain tangential, and the tangential ˆ [Yˆ ]. relation carries over to the quadratic variation processes [X], Now introduce the optional time



τ = inf t > 0; |Yˆt | > c x . For increasing X, Y , we have ˆ τ > x} ≤ E X ˆ τ = E Yˆτ x P {X = E Yˆτ − + EΔYˆτ ≤ 3 c x. If X, Y are instead locally symmetric, then using the Jensen and Bernstein– L´evy inequalities in 4.5 and 9.16, the integration by parts formula in Theorem 20.6 below, and the tangential properties and bounds, we get 

2

ˆ ∗ > x} x P {X τ

ˆ τ |)2 ≤ (E|X ˆ 2 = E[X] ˆτ ≤ EX τ ˆ = E[Y ]τ = E Yˆτ2 

≤ |Yˆτ − | + |ΔYˆτ | ≤ (4 c x)2 . ˆ ∗ > x} ≤ 4c. Thus, in both cases P {X τ 1 ∗ ∗ Since Y ≥ 2 (ΔY ) , we have

2

16. Independent-Increment and L´evy Processes

363



ˆ (ΔX)∗ ≤ 2 c x ⊂ {X = X},

{τ < ∞} ⊂ {Yˆ ∗ > c x} ⊂ {Y ∗ > c x}. Combining with the previous tail estimate and (i), we obtain



P {X ∗ > x} ≤ P (ΔX)∗ > 2 c x + P {X ∗ > x}



ˆ τ∗ > x} ≤ 2 P (ΔY )∗ > 2 c x + P {τ < ∞} + P {X ≤ 3 P {Y ∗ > c x} + 4 c.

2

Proof of Theorem 16.22: For any c, x > 0, consider the optional times



τ = inf t > 0; |Xt | > x ,



σ = inf t > 0; P {(θt Y )∗ > c x | Ft } > c . Since θτ X, θτ Y remain conditionally tangential given Fτ , Lemma 16.23 (ii) yields a.s. on {τ < σ} 











  P (θτ X)∗ > x  Fτ ≤ 3 P (θτ Y )∗ > c x  Fτ + 4c ≤ 7c,

and since {τ < σ} ∈ Fτ by Lemma 9.1, we get



P X ∗ > 3 x, (ΔX)∗ ≤ x, σ = ∞

≤ P (θτ X)∗ > x, τ < σ





= E P {(θτ X)∗ > x | Fτ }; τ < σ



≤ 7c P {τ < ∞} = 7c P {X ∗ > x}. Furthermore, Lemma 16.23 (i) yields





P (ΔX)∗ > x ≤ 2 P (ΔY )∗ > x





≤ 2 P Y ∗ > 12 x , and Theorem 9.16 gives













P {σ < ∞} = P supt P (θt Y )∗ > c x  Ft > c 



≤ P supt P Y ∗ > 12 c x  Ft > c







≤ c−1 P Y ∗ > 12 c x . Combining the last three estimates, we obtain



P {X ∗ > 3 x} ≤ P X ∗ > 3 x, (ΔX)∗ ≤ x, σ = ∞



+ P (ΔX)∗ > x + P {σ < ∞}

≤ 7c P {X ∗ > x} + 2 P Y ∗ > 12 x



+ c−1 P Y ∗ > 12 c x .



364

Foundations of Modern Probability

Since ϕ is non-decreasing of moderate growth, so that ϕ(rx) ≤ h(r) ϕ(x) for some function h > 0, we obtain ∗ ∗ ∗ (h−1 3 − 7c)E ϕ(X ) ≤ E ϕ(X /3) − 7c E ϕ(X ) ≤ 2 E ϕ(2 Y ∗ ) + c−1 E ϕ(2 Y ∗/c)





≤ 2h2 + c−1 h2/c E ϕ(Y ∗ ). E ϕ(Y ∗ ). Now choose c < (7h3 )−1 to get E ϕ(X ∗ ) <

2

Combining the last two theorems, we may simplify the tangential comparison. Say that X, Y are weakly tangential if their local characteristics have the same distribution. Here there is no mention of associated filtrations, and the two processes may even be defined on different probability spaces. Corollary 16.24 (extended comparison) Theorem 16.22 remains true when X and Y are weakly tangential. Proof: Let X, Y be semi-martingales on some probability spaces Ω, Ω with filtrations F, G, and write A, B for the associated local characteristics. ˜ Y˜ with conditionally By Theorem 16.20 we may choose some processes X, ˜ are independent increments, on suitable extensions of Ω, Ω , such that X, X ˜ ˜ ˜ F-tangential while Y, Y are G -tangential, for some standard extensions F˜ of ˜ Y˜ have the same local characteristics A, B. F and G˜ of G. In particular, X, d d ˜ ˜ = ˜ Y˜ have Y since X, Now suppose that A = B, and conclude that also X conditionally independent increments. Assuming X, Y to be either increasing or locally symmetric and letting ϕ have moderate growth, we get by Theorem 16.22 ˜ ∗) E ϕ(X ∗ ) ) E ϕ(X = E ϕ(Y˜ ∗ ) 2 ) E ϕ(Y ∗ ).

Exercises 1. Give an example of a process, whose increments are independent but not infinitely divisible. (Hint: Allow fixed discontinuities.) 2. Show that a non-decreasing process with independent increments has at most countably many fixed discontinuities, and that the associated component is independent of the remaining part. Then write the representation as in Theorem 16.3 for a suitably generalized Poisson process η. 3. Given a convolution semi-group of distributions μt on Rd , construct a L´evy process X with L(Xt ) = μt for all t ≥ 0, starting from a suitable Poisson process and an independent Brownian motion. (Hint: Use Lemma 4.24 and Theorems 8.17 and 8.23.) 4. Let X be a process with stationary, independent increments and X0 = 0. Show that X has a version with rcll paths. (Hint: Use the previous exercise.)

16. Independent-Increment and L´evy Processes

365

5. For a L´evy process X of effective dimension d ≥ 3, show that |Xt | → ∞ a.s. as t → ∞. (Hint: Define τ = inf{t; |Xt | > 1}, and iterate to form a random walk (Sn ). Show that the latter has the same effective dimension as X, and use Theorem 12.8.) 6. Let X be a real L´evy process, and fix any p ∈ (0, 2). Show that t−1/p Xt converges a.s. iff E|X1 |p < ∞ and either p ≤ 1 or EX1 = 0. (Hint: Define a random walk (Sn ) as before, show that S1 satisfies the same moment condition as X1 , and apply Theorem 5.23.) 7. Show that a real L´evy process X is a subordinator iff X1 ≥ 0 a.s. 8. Let X be a real process as in Theorem 16.3. Show that if X ≥ 0 a.s., then G = 0 a.s. and ν is restricted to (0, ∞). 9. Let X be a weakly p -stable L´evy process. Show that when p = 1, we may choose c ∈ R such that the process Xt − ct becomes strictly p -stable. Explain why such a centering fails for p = 1. 10. Show that a L´evy process is symmetric p -stable for a p ∈ (0, 2] iff E eiuXt = p e−ct|u| for a constant c ≥ 0. Similarly, show that a subordinator Y is strictly p p stable for a p ∈ (0, 1) iff E e−uYt = e−ctu for a constant c ≥ 0. In each case, find the corresponding characteristics. (Hint: Derive a scaling property for the characteristic exponent.) 11. Let X be a symmetric, p -stable L´evy process and let T be a strictly q -stable subordinator, for some p ∈ (0, 2] and q ∈ (0, 1). Show that Y = X ◦ T is a symmetric p q -stable L´evy process. (Hint: Check that Y is again a L´evy process, and calculate E eiuYt .) 12. Give an example of two tangential ql-continuous, simple point processes ξ, η, such that η is Cox while ξ is not. 13. Give an example two tangential continuous martingales M, N , such that N has conditionally independent increments while M has not. 14. Let M, N be tangential continuous martingales. Show that M ∗ p ) N ∗ p for all p > 0. (Hint: Use Theorem 18.7.) 15. Let M, N be tangential martingales. Show that M ∗ p ) N ∗ p for all p ≥ 1. (Hint: Note that [M ], [N ] are again tangential, and use Theorems 16.22 and 20.12.)

Chapter 17

Feller Processes and Semi-groups Transition semi-groups, pseudo-Poisson processes, Feller properties, resolvents and generator, forward and backward equations, Yosida approximation, closure and cores, L´evy processes, Hille–Yosida theorem, positive-maximum principle, compactification, existence and regularization, strong Markov property, discontinuity sets, Dynkin’s formula, characteristic operator, diffusions and elliptic operators, convergence of Feller processes, approximation of Markov chains, quasi-left continuity

As stressed before, Markov processes are among the most basic processes of modern probability. After studying several special cases in previous chapters, we now turn to a detailed study of the broad class of Feller processes. Those are Markov processes general enough to cover most applications of interest, yet restricted by some regularity conditions that allow for a reasonable flexibility, leading to a rich arsenal of basic tools and powerful properties. The crucial new idea is to regard the transition kernels as operators Tt on an appropriate function space. The Chapman–Kolmogorov relation then turns into the semi-group property Ts Tt = Ts+t , which suggests a formal representation Tt = etA in terms of a generator A. Under suitable regularity conditions— the so-called Feller properties—it is indeed possible to define a generator A describing the infinitesimal evolution of the underlying process X. Under further conditions, X will be shown to have continuous paths iff A extends an elliptic differential operator. In general, the powerful Hille–Yosida theorem provides precise conditions for the existence of a Feller process corresponding to a given operator A. Using the basic regularity theorem for sub-martingales from Chapter 9, we show that every Feller process has a right-continuous version with left-hand limits. Given this fundamental result, it is straightforward to extend the strong Markov property to arbitrary Feller processes. We also explore some profound connections with martingale theory. Finally, we establish a general continuity theorem for Feller processes, and deduce a corresponding approximation of discrete-time Markov chains by diffusions and other continuous-time Markov processes, anticipating some weak convergence results from Chapter 23. To clarify the connection between transition kernels and operators, let μ be an arbitrary probability kernel on a measurable space (S, S). The associated transition operator T is given by T f (x) = (T f )(x) = μ(x, dy) f (y),

x ∈ S,

© Springer Nature Switzerland AG 2021 O. Kallenberg, Foundations of Modern Probability, Probability Theory and Stochastic Modelling 99, https://doi.org/10.1007/978-3-030-61871-1_18

(1) 367

368

Foundations of Modern Probability

where f : S → R is assumed to be measurable and either bounded or nonnegative. Approximating f by simple functions and using monotone convergence, we see that T f is again a measurable function on S. We also note that T is a positive contraction operator, in the sense that 0 ≤ f ≤ 1 implies 0 ≤ T f ≤ 1. A special role is played by the identity operator I, corresponding to the kernel μ(x, ·) ≡ δx . The importance of transition operators for the study of Markov processes is due to the following simple fact. Lemma 17.1 (semi-group property) For each t ≥ 0, let μt be a probability kernel on S with associated transition operator Tt . Then for any s, t ≥ 0, ⇔

μs+t = μs μt

Ts+t = Ts Tt .

Proof: For any B ∈ S, we have Ts+t 1B (x) = μs+t (x, B) and (Ts Tt )1B (x) = Ts (Tt 1B )(x)

= =



μs (x, dy) (Tt 1B )(y) μs (x, dy) μt (y, B)

= (μs μt )(x, B). Thus, the Chapman–Kolmogorov relation on the left is equivalent to Ts+t 1B = (Ts Tt )1B for any B ∈ S, which extends to the semi-group property Ts+t = Ts Tt , by linearity and monotone convergence. 2 By analogy with the situation for the Cauchy equation, we might hope to represent the semi-group in the form Tt = etA , t ≥ 0, for a suitable generator A. For this formula to make sense, the operator A must be suitably bounded, so that the exponential function can be defined through a Taylor expansion. We consider a simple case where such a representation exists. Proposition 17.2 (pseudo-Poisson processes) Let (Tt ) be the transition semigroup of a pure jump-type Markov process in S with bounded rate kernel α. Then Tt = etA , t ≥ 0, where for bounded measurable functions f : S → R, Af (x) =





f (y) − f (x) α(x, dy),

x ∈ S.

Proof: Since α is bounded, we may write α(x, B) ≡ c μ(x, B \ {x}) for a probability kernel μ and a constant c ≥ 0. By Proposition 13.7, X is then a pseudo-Poisson process of the form X = Y ◦ N , where Y is a discretetime Markov chain with transition kernel μ, and N is an independent Poisson process with fixed rate c. Letting T be the transition operator associated with μ, we get for any t ≥ 0 and f as stated

17. Feller Processes and Semi-groups

369

Tt f (x) = Ex f (X t )

= Σ Ex f (Yn ); Nt = n n≥0

Σ P {Nt = n} Exf (Yn)

=

n≥0

=

Σ e−ct n≥0

(ct)n n T f (x) n!

= ect (T −I) f (x). Hence, Tt = etA for all t ≥ 0, where Af (x) = c (T − I)f (x) =c =







f (y) − f (x) μ(x, dy)

f (y) − f (x) α(x, dy).

2

For the further analysis, we take S to be a locally compact, separable metric space, and write C0 = C0 (S) for the class of continuous functions f : S → R with f (x) → 0 as x → ∞. We can make C0 into a Banach space by introducing the norm f = supx |f (x)|. A semi-group of positive contraction operators Tt on C0 is called a Feller semi-group, if it has the additional regularity properties (F1 ) Tt C0 ⊂ C0 , t ≥ 0, (F2 ) Tt f (x) → f (x) as t → 0, f ∈ C0 , x ∈ S. In Theorem 17.6 we show that (F1 )−(F2 ), together with the semi-group property, imply the strong continuity (F3 ) Tt f → f as t → 0, f ∈ C0 . To clarify the probabilistic significance of those conditions, we assume for simplicity that S is compact, and also that (Tt ) is conservative, in the sense that Tt 1 = 1 for all t. For every initial state x, we may introduce an associated Markov process Xtx , t ≥ 0, with transition operators Tt . Lemma 17.3 (Feller properties) Let (Tt ) be a conservative transition semigroup on a compact metric space (S, ρ). Then the Feller properties (F1 )−(F3 ) are equivalent to, respectively, d

(i) Xtx → Xty as x → y, (ii)

P Xtx →

x as t → 0,

t ≥ 0, x ∈ S,

(iii) supx Ex ρ(Xs , Xt ) ∧ 1 → 0 as |s − t| → 0. Proof: The first two equivalences are obvious, so we prove only the third one. Then let the sequence f1 , f2 , . . . be dense in C = CS . By the compactness of S, we note that xn → x in S iff fk (xn ) → fk (x) for each k. Thus, ρ is topologically equivalent to the metric ρ (x, y) =

Σ 2−k k≥1





|fk (x) − fk (y)| ∧ 1 ,

x, y ∈ S.

370

Foundations of Modern Probability

Since S is compact, the identity mapping on S is uniformly continuous with respect to ρ and ρ , and so we may assume that ρ = ρ . Next we note that, for any f ∈ C, x ∈ S, and t, h ≥ 0,

Ex f (Xt ) − f (Xt+h )

2





= Ex f 2 − 2f Th f − Th f 2 (Xt ) 



≤ f 2 − 2f Th f + Th f 2   

 

 

 

≤ 2 f f − Th f  + f 2 − Th f 2 . Assuming (F3 ), we get supx Ex |fk (Xs ) − fk (Xt )| → 0 as s − t → 0 for fixed k, and so by dominated convergence supx Ex ρ(Xs , Xt ) → 0. Conversely, the 2 latter condition yields Th fk → fk for each k, which implies (F3 ). Our first aim is to construct the generator of a Feller semi-group (Tt ) on C0 . Since in general there is no bounded linear operator A satisfying Tt = etA , we need to look for a suitable substitute. For motivation, note that if p is a real-valued function on R+ with representation pt = e at , we can recover a from p by either a differentiation or an integration: 0

t−1 (pt − 1) → a as t → 0, ∞

e−λt pt dt = (λ − a)−1 ,

λ > 0.

The latter formula suggests that we introduce, for every λ > 0, the resolvent or potential Rλ , defined as the Laplace transform

Rλ f =

0



e−λt (Tt f ) dt,

f ∈ C0 .

Note that the integral exists, since Tt f (x) is bounded and right-continuous in t ≥ 0 for fixed x ∈ S. Theorem 17.4 (resolvents and generator) Let (Tt ) be a Feller semi-group on C0 with resolvents Rλ , λ > 0. Then (i) the λRλ are injective contraction operators on C0 , and λRλ → I holds strongly as λ → ∞, (ii) the range D = Rλ C0 is independent of λ and dense in C0 , (iii) there exists an operator A on C0 with domain D, such that Rλ−1 = λ − A on D for every λ > 0, (iv) A and Tt commute on D for every t ≥ 0. Proof, (i)−(iii): If f ∈ C0 , then (F1 ) yields Tt f ∈ C0 for every t, and so by dominated convergence we have even Rλ f ∈ C0 . To prove the stated contraction property, we may write for any f ∈ C0     λRλ f  ≤ λ



0

≤ λ f

e−λt Tt f dt

∞ 0

e−λt dt = f .

17. Feller Processes and Semi-groups

371

A simple computation yields the resolvent equation Rλ − Rμ = (μ − λ)Rλ Rμ ,

λ, μ > 0,

(2)

which shows that the operators Rλ commute and have a common range D. If f = R1 g for some g ∈ C0 , we get by (2) as λ → ∞         λRλ f − f  = (λRλ − I)R1 g      = (R1 − I)Rλ g    ≤ λ−1 R1 − I  g → 0.

The convergence extends by a simple approximation to the closure of D. Now introduce the one-point compactification Sˆ = S ∪{Δ} of S, and extend ˆ by putting f (Δ) = 0. If D ¯ = C0 , then the Hahn– any f ∈ C0 to Cˆ = C(S) ˆ such that Banach theorem yields a bounded linear functional ϕ ≡ 0 on C, ϕR1 f = 0 for all f ∈ C0 . By Riesz’ representation Theorem 2.25, ϕ extends ˆ Letting f ∈ C0 and using (F2 ), we get by to a bounded, signed measure on S. dominated convergence as λ → ∞

0 = λϕRλ f =

=



ϕ(dx)





ϕ(dx) 0

0



λe−λt Tt f (x) dt

e−s Ts/λ f (x) ds → ϕf,

and so ϕ ≡ 0. The contradiction shows that D is dense in C0 . To see that the operators Rλ are injective, let f ∈ C0 with Rλ0 f = 0 for some λ0 > 0. Then (2) yields Rλ f = 0 for every λ > 0, and since λRλ f → f as λ → ∞, we get f = 0. Hence, the inverses Rλ−1 exist on D. Multiplying (2) by Rλ−1 from the left and by Rμ−1 from the right, we get on D the relation Rμ−1 − Rλ−1 = μ − λ. Thus, the operator A = λ − Rλ−1 on D is independent of λ. (iv) Note that Tt and Rλ commute for any t, λ > 0, and write Tt (λ − A)Rλ = Tt = (λ − A)Rλ Tt = (λ − A) Tt Rλ .

2

The operator A in Theorem 17.4 is called the generator of the semi-group (Tt ). To emphasize the role of the domain D, we often say that (Tt ) has generator (A, D). The term is justified by the following lemma. Lemma 17.5 (uniqueness) A Feller semi-group is uniquely determined by its generator. Proof: The operator A determines Rλ = (λ − A)−1 for all λ > 0. By the uniqueness theorem for Laplace transforms, it then determines the measure μf (dt) = Tt f (x)dt on R+ for all f ∈ C0 and x ∈ S. Since the density Tt f (x) is

372

Foundations of Modern Probability 2

right-continuous in t for fixed x, the assertion follows.

We now show that any Feller semi-group is strongly continuous, and derive general versions of Kolmogorov’s forward and backward equations. Theorem 17.6 (strong continuity, forward and backward equations) Let (Tt ) be a Feller semi-group with generator (A, D). Then (i) (Tt ) is strongly continuous and satisfies Tt f − f =



0

t

Ts Af ds,

f ∈ D, t ≥ 0,

(ii) Tt f is differentiable at 0 iff f ∈ D, in which case d (Tt f ) = Tt Af = ATt f, t ≥ 0. dt For the proof we use the Yosida approximation Aλ = λARλ = λ(λRλ − I),

λ > 0,

(3)

with associated semi-group Ttλ = etA , t ≥ 0. Note that this is the transition semi-group of a pseudo-Poisson process with rate λ, based on the transition operator λRλ . λ

Lemma 17.7 (Yosida approximation) For any f ∈ D,     (i) Tt f − Ttλ f  ≤ tAf − Aλ f , t, λ > 0, (ii) Aλ f → Af as λ → ∞, (iii) Ttλ f → Tt f as λ → ∞ for each f ∈ C0 , uniformly for bounded t ≥ 0. Proof: By Theorem 17.4, we have Aλ f = λRλ Af → Af for any f ∈ D. For fixed λ > 0, we further note that h−1 (Thλ − I) → Aλ in the norm topology as h → 0. Now for any commuting contraction operators B and C,       n     B f − C n f  ≤ B n−1 + B n−2 C + · · · + C n−1  Bf − Cf      ≤ n Bf − Cf .

Fixing any f ∈ C0 and t, λ, μ > 0, we hence obtain as h = t/n → 0      λ  μ  μ  Tt f − Tt f  ≤ n Thλ f − Th f   λ  T f −f T μf − f − h = t  h h h   → t Aλ f − Aμ f .

   

For f ∈ D it follows that Ttλ f is Cauchy convergent as λ → ∞ for fixed t. Since D is dense in C0 , the same property holds for arbitrary f ∈ C0 . Denoting the limit by T˜t f , we get in particular      λ    Tt f − T˜t f  ≤ t Aλ f − Af ,

f ∈ D, t ≥ 0.

(4)

17. Feller Processes and Semi-groups

373

Thus, for each f ∈ D we have Ttλ f → T˜t f as λ → ∞, uniformly for bounded t, which again extends to all f ∈ C0 . To identify T˜t , we may use the resolvent equation (2) to obtain, for any f ∈ C0 and λ, μ > 0,



e−λt Ttμ μRμ f dt = (λ − Aμ )−1 μRμ f μ = Rν f, (5) λ+μ where ν = λμ(λ + μ)−1 . As μ → ∞ we have ν → λ, and so Rν f → Rλ . Furthermore,        μ     μ  Tt μRμ f − T˜t f  ≤ μRμ f − f  + Tt f − T˜t f  → 0, 0



and so by dominated convergence, (5) yields e−λt T˜t f dt = Rλ f . Hence, the semi-groups (Tt ) and (T˜t ) have the same resolvent operators Rλ , and so they agree by Lemma 17.5. In particular, (i) then follows from (4). 2 Proof of Theorem 17.6: The semi-group (Ttλ ) is clearly norm continuous in t for each λ > 0, and the strong continuity of (Tt ) follows by Lemma 17.7 as λ → ∞. We further have h−1 (Thλ − I) → Aλ as h ↓ 0. Using the semi-group relation and continuity, we obtain more generally d λ T = Aλ Ttλ = Ttλ Aλ , t ≥ 0, dt t which implies t Tsλ Aλ f ds, f ∈ C0 , t ≥ 0. (6) Ttλ f − f = 0

If f ∈ D, Lemma 17.7 yields as λ → ∞        λ λ      Ts A f − Ts Af  ≤ Aλ f − Af  + Tsλ Af − Ts Af  → 0,

uniformly for bounded s, and so (i) follows from (6) as λ → ∞. By the strong continuity of Tt , we may differentiate (i) to get the first relation in (ii). The second relation holds by Theorem 17.4. Conversely, let h−1 (Th f − f ) → g for some functions f, g ∈ C0 . As h → 0, we get Th − I Th f − f ARλ f ← Rλ f = R λ → Rλ g, h h and so f = (λ − A)Rλ f = λRλ f − ARλ f = Rλ (λf − g) ∈ D. 2 For a given generator A, it is often hard to identify the domain D, or the latter may simply be too large for convenient calculations. It is then useful to restrict A to a suitable sub-domain. An operator A with domain D on a Banach space B is said to be closed, if its graph G = {(f, Af ); f ∈ D} is a ¯ is closed subset of B 2 . In general, we say that A is closable, if the closure G

374

Foundations of Modern Probability

¯ called the closure of A. Note that A the graph of a single-valued operator A, is closable iff the conditions D * fn → 0 and Afn → g imply g = 0. When A is closed, a core for A is defined as a linear sub-space D ⊂ D, such that the restriction A|D has closure A. Note that A is then uniquely determined by A|D . We give conditions ensuring D ⊂ D to be a core, when A is the generator of a Feller semi-group (Tt ) on C0 . Lemma 17.8 (closure and cores) Let (A, D) be the generator of a Feller semigroup, and fix any λ > 0 and a sub-space D ⊂ D. Then (i) (A, D) is closed, (ii) D is a core for A ⇔ (λ − A)D is dense in C0 . Proof: (i) Let f1 , f2 , . . . ∈ D with fn → f and Afn → g. Then (I − A)fn → f − g. Since R1 is bounded, we get fn → R1 (f − g). Hence, f = R1 (f − g) ∈ D, and (I − A)f = f − g or g = Af . Thus, A is closed. (ii) If D is a core for A, then for any g ∈ C0 and λ > 0 there exist some f1 , f2 , . . . ∈ D with fn → Rλ g and Afn → ARλ g, and we get (λ − A)fn → (λ − A)Rλ g = g. Thus, (λ − A)D is dense in C0 . Conversely, let (λ − A)D be dense in C0 . To see that D is a core, fix any f ∈ D. By hypothesis, we may choose some f1 , f2 , . . . ∈ D with gn ≡ (λ − A)fn → (λ − A)f ≡ g. Since Rλ is bounded, we obtain fn = Rλ gn → Rλ g = f , and thus Afn = λfn − gn → λf − g = Af.

2

A sub-space D ⊂ C0 is said to be invariant under (Tt ) if Tt D ⊂ D for  all t ≥ 0. Note that for any subset B ⊂ C0 , the linear span of t Tt B is an invariant sub-space of C0 . Proposition 17.9 (invariance and cores, Watanabe) Let (A, D) be the generator of a Feller semi-group. Then every dense, invariant sub-space D ⊂ D is a core for A. Proof: By the strong continuity of (Tt ), the operator R1 can be approximated in the strong topology by some finite linear combinations L1 , L2 , . . . of the operators Tt . Now fix any f ∈ D, and define gn = Ln f . Noting that A and Ln commute on D by Theorem 17.4, we get (I − A)gn = (I − A)Ln f = Ln (I − A)f → R1 (I − A)f = f.

17. Feller Processes and Semi-groups

375

Since gn ∈ D which is dense in C0 , it follows that (I − A)D is dense in C0 . Hence, D is a core by Lemma 17.8. 2 The L´evy processes in Rd are archetypes of Feller processes, and we proceed to identify their generators1 . Let C0∞ denote the class of all infinitely differentiable functions f on Rd , such that f and all its derivatives belong to C0 = C0 (Rd ). Theorem 17.10 (L´evy processes) Let Tt , t ≥ 0, be transition operators of a L´evy process in Rd with characteristics (a, b, ν). Then (i) (Tt ) is a Feller semi-group, (ii) C0∞ is a core for the generator A of (Tt ), (iii) for any f ∈ C0∞ and x ∈ Rd , we have Af (x) =

1 2

aij ∂ij2 f (x) + bi ∂i f (x) +





f (x + y) − f (x) − y i ∂i f (x) 1{|y| ≤ 1} ν(dy).

In particular, a standard Brownian motion in Rd has generator 12 Δ, and the uniform motion with velocity b ∈ Rd has generator b ∇, both on the core C0∞ , where Δ and ∇ denote the Laplace and gradient operators, respectively. The generator of the jump component has the same form as for the pseudo-Poisson processes in Proposition 17.2, apart from a compensation for small jumps by a linear drift term. ∗[t−1 ]

Proof of Theorem 17.10 (i), (iii): As t → 0 we have μt v Corollary 7.9 yields μt /t → ν on Rd \ {0} and at,h ≡ t−1 bt,h ≡ t−1



|x|≤h |x|≤h

w

→ μ1 , and so

xx μt (dx) → ah , x μt (dx) → bh ,

(7)

for any h > 0 with ν{|x| = h} = 0. Now fix any f ∈ C0∞ , and write



t−1 Tt f (x) − f (x) = t−1 = t−1



+t



|y|≤h −1



f (x + y) − f (x) μt (dy)





f (x + y) − f (x) − y i ∂i f (x) − 12 y i y j ∂ij2 f (x) μt (dy)

|y|>h





2 f (x + y) − f (x) μt (dy) + bit,h ∂i f (x) + 12 aij t,h ∂ij f (x).

As t → 0, the last three terms approach the expression in (iii), though with aij replaced by ahij and with integration over {|x| > h}. To establish the required convergence, it remains to show that the first term on the right tends to zero 1

Here and below, summation over repeated indices is understood.

376

Foundations of Modern Probability

as h → 0, uniformly for small t > 0. Now this is clear from (7), since the integrand is of order h|y|2 by Taylor’s formula. By the uniform boundedness of the derivatives of f , the convergence is uniform in x. Thus, C0∞ ⊂ D by Theorem 17.6, and (iii) holds on C0∞ . (ii) Since C0∞ is dense in C0 , it suffices by Proposition 17.9 to show that it is also invariant under (Tt ). Then note that, by dominated convergence, the 2 differentiation operators commute with each Tt , and use condition (F1 ). We proceed to characterize the linear operators A on C0 , whose closures A¯ are generators of Feller semi-groups. Theorem 17.11 (generator criteria, Hille, Yosida) Let A be a linear operator on C0 with domain D. Then A is closable and its closure A¯ is the generator of a Feller semi-group on C0 , iff these conditions hold: (i) D is dense in C0 , (ii) the range of λ0 − A is dense in C0 for some λ0 > 0, (iii) f ∨ 0 ≤ f (x) ⇒ Af (x) ≤ 0, for any f ∈ D and x ∈ S. Condition (iii) is known as the positive-maximum principle. Proof: First assume that A¯ is the generator of a Feller semi-group (Tt ). Then (i) and (ii) hold by Theorem 17.4. To prove (iii), let f ∈ D and x ∈ S with f + = f ∨ 0 ≤ f (x). Then Tt f (x) ≤ Tt f + (x) ≤ Tt f + ≤ f + = f (x),

t ≥ 0,

and so h−1 (Th f − f )(x) ≤ 0. As h → 0, we get Af (x) ≤ 0. Conversely, let A satisfy (i)−(iii). For any f ∈ D, choose an x ∈ S with |f (x)| = f , and put g = f sgn f (x). Then g ∈ D with g + ≤ g(x), and so (iii) yields Ag(x) ≤ 0. Thus, we get for any λ > 0     (λ − A)f  ≥ λg(x) − Ag(x)

≥ λg(x) = λ f .

(8)

To show that A is closable, let f1 , f2 , . . . ∈ D with fn → 0 and Afn → g. By (i) we may choose g1 , g2 , . . . ∈ D with gn → g, and (8) yields         (λ − A)(gm + λfn ) ≥ λgm + λfn ,

m, n ∈ N, λ > 0.

As n → ∞, we get (λ − A)gm − λg ≥ λ gm . Dividing by λ and letting λ → ∞, we obtain gm − g ≥ gm , which gives g = 0 as m → ∞. Thus, A is closable, and by (8) the closure A¯ satisfies    ¯  ≥ λ f , (λ − A)f

¯ λ > 0, f ∈ dom(A).

(9)

17. Feller Processes and Semi-groups

377

¯ n → g for some f1 , f2 , . . . ∈ dom(A). ¯ By Now let λn → λ > 0 and (λn − A)f (9) the sequence (fn ) is then Cauchy, say with limit f ∈ C0 . The definition ¯ = g, and so g belongs to the range of λ − A. ¯ Letting of A¯ yields (λ − A)f Λ be the set of constants λ > 0 such that λ − A¯ has range C0 , it follows in particular that Λ is closed. If it can be shown to be open as well, then (ii) gives Λ = (0, ∞). Then fix any λ ∈ Λ, and conclude from (9) that λ − A¯ has a bounded inverse Rλ with norm Rλ ≤ λ−1 . For any μ > 0 with |λ − μ| Rλ < 1, we may form a bounded linear operator ˜ μ = Σ (λ − μ)n Rn+1 , R λ

n≥0

and note that

¯R ˜ μ − (λ − μ)R ˜ μ = I. ¯R ˜ μ = (λ − A) (μ − A) In particular μ ∈ Λ, which shows that λ ∈ Λo . ¯ λ= To prove the resolvent equation (2), we start from the identity (λ − A)R ¯ (μ − A)Rμ = I. A simple rearrangement yields ¯ λ − Rμ ) = (μ − λ)Rμ , (λ − A)(R and (2) follows as we multiply from the left by Rλ . In particular, (2) shows that Rλ and Rμ commute for any λ, μ > 0. ¯ = I on dom(A) ¯ and Rλ ≤ λ−1 , we have for any f ∈ Since Rλ (λ − A) ¯ dom(A) as λ → ∞        ¯  λRλ f − f  = Rλ Af ¯ → 0. ≤ λ−1 Af From (i) and the contractivity of λRλ , it follows easily that λRλ → I in the λ strong topology. Now define Aλ as in (3), and let Ttλ = etA . Arguing as λ in the proof of Lemma 17.7, we get Tt f → Tt f for each f ∈ C0 , uniformly for bounded t, where the Tt form a strongly continuous family of contraction operators on C0 , such that e−λt Tt dt = Rλ for all λ > 0. To deduce the semi-group property, fix any f ∈ C0 and s, t ≥ 0, and note that as λ → ∞, 











λ Ts+t − Ts Tt f = Ts+t − Ts+t f + Tsλ Ttλ − Tt f





+ Tsλ − Ts Tt f → 0. The positivity of the operators Tt will follow immediately, if we can show that Rλ is positive for each λ > 0. Then fix any function g ≥ 0 in C0 , and ¯ . By the definition of A, ¯ there exist put f = Rλ g, so that g = (λ − A)f ¯ . If inf x f (x) < 0, we have some f1 , f2 , . . . ∈ D with fn → f and Afn → Af inf x fn (x) < 0 for all sufficiently large n, and we may choose some xn ∈ S with fn (xn ) ≤ fn ∧ 0. By (iii) we have Afn (xn ) ≥ 0, and so inf x (λ − A)fn (x) ≤ (λ − A)fn (xn ) ≤ λfn (xn ) = λ inf x fn (x).

378

Foundations of Modern Probability

As n → ∞, we get the contradiction 0 ≤ inf x g(x) ¯ (x) = inf x (λ − A)f ≤ λ inf x f (x) < 0. To see that A¯ is the generator of the semi-group (Tt ), we note that the opera2 tors λ − A¯ are inverses of the resolvent operators Rλ . The previous proof shows that, if an operator A on C0 satisfies the positive maximum principle in (iii), then it is dissipative, in the sense that (λ−A)f ≥ λ f for all f ∈ dom(A) and λ > 0. This yields the following simple observation, needed later. Lemma 17.12 (maximality) Let (A, D) be the generator of a Feller semigroup on C0 , and consider an extension of A to a linear operator (A , D ) satisfying the positive-maximum principle. Then (A, D) = (A , D ). Proof: Fix any f ∈ D , and put g = (I − A )f . Since A is dissipative and (I − A)R1 = I on C0 , we get         f − R1 g  ≤ (I − A )(f − R1 g)     = g − (I − A)R1 g  = 0,

and so f = R1 g ∈ D.

2

We proceed to show how a nice Markov process can be associated with every Feller semi-group (Tt ). For the corresponding transition kernels μt to have total mass 1, we need the operators Tt to be conservative, in the sense that supf ≤1 Tt f (x) = 1 for all x ∈ S. This can be achieved by a suitable extension. Then introduce an auxiliary state2 Δ ∈ S, and form the compactified space Sˆ = S ∪{Δ}, where Δ is chosen as the point at infinity when S is non-compact, and otherwise is taken to be isolated from S. Note that any function f ∈ C0 ˆ achieved by putting f (Δ) = 0. The original has a continuous extension to S, semi-group on C0 extends as follows to a conservative semi-group on Cˆ = CSˆ . Lemma 17.13 (compactification) Every Feller semi-group (Tt ) on C0 extends ˆ given by to a conservative Feller semi-group (Tˆt ) on C,



Tˆt f = f (Δ) + Tt f − f (Δ) ,

ˆ t ≥ 0, f ∈ C.

Proof: It is straightforward to verify that (Tˆt ) is a strongly continuous semiˆ To show that the operators Tˆt are positive, fix any f ∈ Cˆ with group on C. f ≥ 0, and note that g ≡ f (Δ) − f ∈ C0 with g ≤ f (Δ). Hence, 2

often called the coffin state or cemetery; this morbid terminology is well established

17. Feller Processes and Semi-groups

379

Tt g ≤ Tt g + ≤ Tt g + ≤ g + ≤ f (Δ), and so Tˆt f = f (Δ) − Tt g ≥ 0. The contraction and conservation properties 2 now follow from the fact that Tˆt 1 = 1. Next we construct an associated semi-group of Markov transition kernels ˆ satisfying μt on S,

Tt f (x) =

f (y) μt (x, dy),

f ∈ C0 .

(10)

Say that a state x ∈ Sˆ is absorbing for (μt ), if μt (x, {x}) = 1 for all t ≥ 0. Proposition 17.14 (existence) For any Feller semi-group (Tt ) on C0 , there ˆ satisfying exists a unique semi-group of Markov transition kernels μt on S, (10) and such that Δ is absorbing for (μt ). Proof: For fixed x ∈ S and t ≥ 0, the mapping f → Tˆt f (x) is a positive linear functional on Cˆ with norm 1. Hence, Riesz’ representation Theorem 2.25 yields some probability measures μt (x, ·) on Sˆ satisfying Tˆt f (x) =



f (y) μt (x, dy),

ˆ x ∈ S, ˆ t ≥ 0. f ∈ C,

(11)

The measurability on the right is clear by continuity. By a standard approximation followed by a monotone-class argument, we obtain the desired measurabilˆ The Chapman–Kolmogorov ity of μt (x, B), for any t ≥ 0 and Borel set B ⊂ S. relation holds on Sˆ by Lemma 17.1. Formula (10) is a special case of (11), which also yields f (y) μt (Δ, dy) = Tˆt f (Δ) = f (Δ) = 0,

f ∈ C0 ,

showing that Δ is absorbing. The uniqueness of (μt ) is clear from the last two properties. 2 ˆ Theorem 11.4 yields a Markov process For any probability measure ν on S, ˆ X in S with initial distribution ν and transition kernels μt . Let Pν be the distribution of X ν with associated integration operator Eν . When ν = δx , we may write Px and Ex instead, for simplicity. We now extend Theorem 16.3 to a basic regularization theorem for Feller processes. For an rcll process X, we say that Δ is absorbing for X ± if Xt = Δ or Xt− = Δ implies Xu = Δ for all u ≥ t. ν

Theorem 17.15 (regularization, Kinney) Let X be a Feller process in Sˆ with initial distribution ν. Then ˜ ±, ˜ in S, ˆ such that Δ is absorbing for X (i) X has an rcll version X ˜ to be (ii) if (Tt ) is conservative and ν is restricted to S, we can choose X rcll even in S.

380

Foundations of Modern Probability

For the proof, we introduce an associated class of super-martingales, to which we may apply the regularity theorems in Chapter 9. Let C0+ be the class of non-negative functions in C0 . Lemma 17.16 (resolvents and super-martingales) For any f ∈ C0+ , the process Yt = e−t R1 f (Xt ), t ≥ 0, is a super-martingale under Pν for every ν. Proof: Letting (Gt ) be the filtration induced by X, we get for any t, h ≥ 0  



E(Yt+h | Gt ) = E e−t−h R1 f (Xt+h )  Gt



= e−t−h Th R1 f (Xt ) = e−t−h = e−t







0 ∞ h

e−s Ts+h f (Xt ) ds

e−s Ts f (Xt ) ds ≤ Yt .

2

Proof of Theorem 17.15: By Lemma 17.16 and Theorem 9.28, the process f (Xt ) has a.s. right- and left-hand limits along Q+ for any f ∈ D ≡ dom(A). Since D is dense in C0 , the same property holds for every f ∈ C0 . By the separability of C0 , we can choose the exceptional null set N to be independent of f . If x1 , x2 , . . . ∈ Sˆ are such that f (xn ) converges for every f ∈ C0 , then ˆ Thus, by compactness of Sˆ the sequence xn converges in the topology of S. on N c , the process X has right- and left-hand limits Xt± along Q+ , and on N ˜ t = Xt+ is rcll. It remains to show we may redefine X to be 0. Then clearly X ˜ is a version of X, so that Xt+ = Xt a.s. for each t ≥ 0. This holds since that X P Xt+h → Xt as h ↓ 0, by Lemma 17.3 and dominated convergence. For any f ∈ C0 with f > 0 on S, the strong continuity of (Tt ) yields even R1 f > 0 on S. Applying Lemma 9.32 to the super-martingale Yt = ˜ t ), we conclude that X ≡ Δ a.s. on the interval [ζ, ∞), where e−t R1 f ( X

˜t, X ˜ t− } . Discarding the exceptional null set, we can ζ = inf t ≥ 0; Δ ∈ {X make this hold identically. If (Tt ) is conservative and ν is restricted to S, then ˜ t ∈ S a.s. for every t ≥ 0. Thus, ζ > t a.s. for all t, and hence ζ = ∞ a.s., X ˜ t− take values in ˜ t and X which may again be taken to hold identically. Then X S, and the stated regularity properties remain valid in S. 2 By the last theorem, we may choose Ω as the space of rcll functions in Sˆ such that Δ is absorbing, and let X be the canonical process on Ω. Processes with different initial distributions ν are then distinguished by their distributions Pν on Ω. The process X is clearly Markov under each Pν , with initial distribution ν and transition kernels μt , and it has the regularity properties of Theorem 17.15. In particular, X ≡ Δ on the interval [ζ, ∞), where ζ denotes the terminal time

ζ = inf t ≥ 0; Xt = Δ or Xt− = Δ .

17. Feller Processes and Semi-groups

381

Let (Ft ) be the right-continuous filtration generated by X, and put A = F∞ =  t Ft . As before, the shift operators θt on Ω are given by (θt ω)s = ωs+t ,

s, t ≥ 0.

The process X with associated distributions Pν , filtration F = (Ft ), and shift operators θt is called the canonical Feller process with semi-group (Tt ). We now state a general version of the strong Markov property. The result extends the special versions obtained in Proposition 11.9 and Theorems 13.1 and 14.11. A further instant of this property is given in Theorem 32.11. Theorem 17.17 (strong Markov property, Dynkin & Yushkevich, Blumenthal) For any canonical Feller process X, initial distribution ν, optional time τ , and random variable ξ ≥ 0, we have  





Eν ξ ◦ θτ  Fτ = EXτ ξ a.s. Pν on {τ < ∞}. Proof: By Lemmas 8.3 and 9.1, we may take τ < ∞. Let G be the filtration induced by X. Then the times τn = 2−n [2n τ + 1] are G -optional by Lemma 9.4, and Lemma 9.3 gives Fτ ⊂ Gτn for all n. Thus, Proposition 11.9 yields 







Eν ξ ◦ θτn ; A = Eν EXτn ξ; A ,

A ∈ Fτ , n ∈ N.

(12)



To extend this to τ , we first take ξ = k≤m fk (Xtk ) for some f1 , . . . , fm ∈ C0 and t1 < · · · < tm . Then ξ ◦ θτn → ξ ◦ θτ , by the right-continuity of X and continuity of f1 , . . . , fm . Writing hk = tk − tk−1 with t0 = 0, and using Feller property (F1 ) and the right-continuity of X, we obtain







EXτn ξ = Th1 f1 Th2 · · · (fm−1 Thm fm ) · · · (Xτn ) → Th1 f1 Th2 · · · (fm−1 Thm fm ) · · · (Xτ ) = EXτ ξ. Thus, (12) extends to τ by dominated convergence on both sides. Using standard approximation and monotone-class arguments, we may finally extend the result to arbitrary ξ. 2 For a simple application, we get a useful 0−1 law: Corollary 17.18 (Blumenthal’s 0−1 law) For a canonical Feller process, we have Px A = 0 or 1, x ∈ S, A ∈ F0 . Proof: Taking τ = 0 in Theorem 17.17, we get for any x ∈ S and A ∈ F0 1A = Px (A|F0 ) = PX0 A = Px A a.s. Px .

2

To appreciate the last result, recall that F0 = F0+ . In particular, Px {τ = 0} = 0 or 1 for any state x ∈ S and F-optional time τ . The strong Markov property is often used in the following extended form.

382

Foundations of Modern Probability

Corollary 17.19 (optional projection) For any canonical Feller process X, non-decreasing, adapted process Y , and random variable ξ ≥ 0, we have

Ex

∞ 0



(EXt ξ) dYt = Ex

0



(ξ ◦ θt ) dYt ,

x ∈ S.

Proof: We may take Y0 = 0. Consider the right-continuous inverse



τs = inf t ≥ 0; Yt > s ,

s ≥ 0,

and note that the τs are optional by Lemma 9.6. By Theorem 17.17, we have 







Ex EXτs ξ; τs < ∞ = Ex Ex (ξ ◦ θτs |Fτs ); τs < ∞ 



= Ex ξ ◦ θτs ; τs < ∞ . Since τs < ∞ iff s < Y∞ , we get by integration

Ex

Y∞

0



(EXτs ξ) ds = Ex

Y∞ 0

(ξ ◦ θτs ) ds, 2

and the asserted formula follows by Lemma 1.24.

Next we show that any martingale on the canonical space of a Feller process X is a.s. continuous outside the discontinuity set of X. For Brownian motion, this also follows from the integral representation in Theorem 19.11. Theorem 17.20 (discontinuity sets) Let X be a canonical Feller process with initial distribution ν, and let M be a local Pν -martingale. Then





t > 0; ΔMt = 0 ⊂ t > 0; Xt− = Xt



a.s.

(13)

Proof (Chung & Walsh): By localization, we may take M to be uniformly integrable and hence of the form Mt = E(ξ|Ft ) for some ξ ∈ L1 . Let C be the class of random variables ξ ∈ L1 , such that the the generated process M satisfies (13). Then C is a linear sub-space of L1 . It is further closed, since if Mtn = E(ξn |Ft ) with ξn 1 → 0, then



P supt |Mtn | > ε ≤ ε−1 E|ξn | → 0,

ε > 0,

P

and so supt |Mtn | → 0.  Now let ξ = k≤n fk (Xtk ) for some f1 , . . . , fn ∈ C0 and t1 < · · · < tn . Writing hk = tk − tk−1 , we note that Mt = where

Π fk (Xt ) Tt

k≤m

k



m+1 −t

t ∈ [tm , tm+1 ],

gm+1 (Xt ),

gk = fk Thk+1 fk+1 Thk+2 (· · · Thn fn ) · · · ,

(14)

k = 1, . . . , n,

subject to obvious conventions when t < t1 and t > tn . Since Tt g(x) is jointly continuous in (t, x) for each g ∈ C0 , equation (14) defines a right-continuous

17. Feller Processes and Semi-groups

383

version of M satisfying (13), and so ξ ∈ C. By a simple approximation, C  then contains all indicator functions of sets k≤n {Xtk ∈ Gk } with G1 , . . . , Gn open. The result extends by a monotone-class argument to any X-measurable indicator function ξ, and a routine argument yields the final extension to L1 . 2 A basic role in the theory is played by the processes Mtf = f (Xt ) − f (X0 ) −



t 0

t ≥ 0, f ∈ D.

Af (Xs ) ds,

Lemma 17.21 (Dynkin’s formula) For a Feller process X, (i) the processes M f are martingales under any initial distributions ν for X, (ii) for any bounded optional time τ ,

Ex f (Xτ ) = f (x) + Ex

τ 0

Proof: For any t, h ≥ 0, f − Mtf = f (Xt+h ) − f (Xt ) − Mt+h

=

Mhf

x ∈ S, f ∈ D.

Af (Xs ) ds,

t+h

Af (Xs ) ds

t

◦ θt ,

and so by the Markov property at t and Theorem 17.6,  





 



f f f  Ft − Mt = Eν Mh ◦ θt  Ft Eν Mt+h



= EXt Mhf = 0. Thus, M f is a martingale, and (ii) follows by optional sampling.

2

To prepare for the next major result, we introduce the optional times



τh = inf t ≥ 0; ρ(Xt , X0 ) > h ,

h > 0,

where ρ denotes the metric in S. Note that a state x is absorbing iff τh = ∞ a.s. Px for every h > 0. Lemma 17.22 (escape times) When x ∈ S is non-absorbing, we have Ex τh < ∞ for h > 0 small enough. Proof: If x is not absorbing, then μt (x, Bxε ) < p < 1 for some t, ε > 0, where Bxε = {y; ρ(x, y) ≤ ε}. By Lemma 17.3 and Theorem 5.25, we may choose h ∈ (0, ε] so small that μt (y, Bxh ) ≤ μt (y, Bxε ) ≤ p,

y ∈ Bxh .

Then Proposition 11.2 yields Px {τh ≥ nt} ≤ Px and so by Lemma 4.4,

∩ k≤n





Xkt ∈ Bxh ≤ pn ,

n ∈ Z+ ,

384

Foundations of Modern Probability

Ex τh =

0



P {τh ≥ s}ds

≤ t Σ P {τh ≥ nt} n≥0 t < ∞. = t Σ pn = 1 − p n≥0

2

We turn to a probabilistic description of the generator and its domain. Say that A is maximal within a class of linear operators, if it extends every member of the class. Theorem 17.23 (characteristic operator, Dynkin) Let (A, D) be the generator of a Feller process. Then (i) for f ∈ D we have Af (x) = 0 when x is absorbing, and otherwise Af (x) = lim

h→0

Ex f (Xτh ) − f (x) , E x τh

(15)

(ii) A is the maximal operator on C0 with those properties. Proof: (i) Fix any f ∈ D. If x is absorbing, then Tt f (x) = f (x) for all t ≥ 0, and so Af (x) = 0. For non-absorbing x, Lemma 17.21 yields instead Ex f (Xτh ∧t ) − f (x) = Ex



τh ∧t 0

Af (Xs ) ds,

t, h > 0.

(16)

By Lemma 17.22, we have Eτh < ∞ for sufficiently small h > 0, and so (16) extends to t = ∞ by dominated convergence. Relation (15) now follows from the continuity of Af , along with the fact that ρ(Xs , x) ≤ h for all s < τh . (ii) Since the positive-maximum principle holds for any extension of A with the stated properties, the assertion follows by Lemma 17.12. 2 In the special case of S = Rd , let Cˆ ∞ be the class of infinitely differentiable functions on Rd with bounded support. An operator (A, D) with D ⊃ Cˆ ∞ is said to be local on Cˆ ∞ , if Af (x) = 0 when f vanishes in a neighborhood of x. For a local generator A on Cˆ ∞ , the positive-maximum principle yields a local positive-maximum principle, asserting that if f ∈ Cˆ ∞ has a local maximum ≥ 0 at a point x, then Af (x) ≤ 0. We give a basic relationship between diffusion processes and elliptic differential operators. This connection is further explored in Chapters 21, 32, and 34–35. Theorem 17.24 (Feller diffusions and elliptic operators, Dynkin) Let (A, D) be the generator of a Feller process X in Rd with Cˆ ∞ ⊂ D. Then these conditions are equivalent: (i) X is continuous on [0, ζ), a.s. Pν for every ν, (ii) A is local on Cˆ ∞ .

17. Feller Processes and Semi-groups

385

In this case, there exist some functions aij , bi , c ∈ CRd with c ≥ 0 and (aij ) symmetric, non-negative definite, such that for any f ∈ Cˆ ∞ and x ∈ R+ , Af (x) =

1 2

aij (x) ∂ij2 f (x) + bi (x) ∂i f (x) − c(x)f (x).

(17)

Here the matrix (aij ) gives the diffusion rates of X, the vector (bi ) represents the drift rate of X, and c is the rate of killing 3 . For semi-groups of this type, we may choose Ω as the set of paths that are continuous on [0, ζ). The Markov process X is then referred to as a canonical Feller diffusion. Proof: (i) ⇒ (ii): Use Theorem 17.23. (ii) ⇒ (i): Assume (ii). Fix any x ∈ Rd and 0 < h < m, and choose an f ∈ Cˆ ∞ with f ≥ 0 and support {y; h ≤ |y − x| ≤ m}. Then Af (y) = 0 for all y ∈ Bxh , and so Lemma 17.21 shows that f (Xt∧τh ) is a martingale under Px . By dominated convergence, we have Ex f (Xτh ) = 0, and since m was arbitrary,



Px |Xτh − x| ≤ h or Xτh = Δ = 1,

x ∈ Rd , h > 0.

By the Markov property at fixed times, we get for any initial distribution ν Pν

∩ θt−1 t∈Q +

which implies





|Xτh − X0 | ≤ h or Xτh = Δ = 1,



Pν sup |ΔXt | ≤ h = 1,

h > 0,

h > 0,

t 0,

has a local maximum 0 at x, and so



ε Ag± (x) = ± Af (x) − Af˜(x) − ε Σ i aii (x) ≤ 0,

ε > 0.

As ε → 0 we get Af (x) = Af˜(x), showing that (17) is generally true.

2

We turn to a basic limit theorem for Feller processes, extending some results for L´evy processes in Theorems 7.7 and 16.14. Theorem 17.25 (convergence, Trotter, Sova, Kurtz, Mackeviˇcius) Let X, X 1 , X 2 , . . . be Feller processes in S with semi-groups (Tt ), (Tn,t ) and generators (A, D), (An , Dn ), respectively, and fix a core D for A. Then these conditions are equivalent: (i) for any f ∈ D, there exist some fn ∈ Dn with fn → f and An fn → Af , (ii) Tn,t → Tt strongly for each t > 0, (iii) Tn,t f → Tt f for every f ∈ C0 , uniformly for bounded t > 0, d

sd

(iv) X0n → X0 in S ⇒ X n −→ X in D R+ ,Sˆ . Our proof is based on two lemmas, the former extending Lemma 17.7. Lemma 17.26 (norm inequality) Let (Tt ), (Tt ) be Feller semi-groups with generators (A, D), (A , D ), respectively, where A is bounded. Then t        Tt f − Tt f  ≤ (A − A ) Ts f  ds, 0

f ∈ D, t ≥ 0.

(18)

Proof: Fix any f ∈ D and t > 0. Since (Ts ) is norm continuous, Theorem 17.6 yields ∂   (T Ts f ) = Tt−s (A − A )Ts f, 0 ≤ s ≤ t. ∂s t−s Here the right-hand side is continuous in s, by the strong continuity of (Ts ), the boundedness of A , the commutativity of A and Ts , and the norm continuity of (Ts ). Hence, t ∂  (Tt−s Ts f ) ds Tt f − Tt f = 0 ∂s

t

= 0

 Tt−s (A − A )Ts f ds,

 . and (18) follows by the contractivity of Tt−s

2

We may next establish a continuity property for the Yosida approximations Aλ , Aλn of the generators A, An , respectively.

17. Feller Processes and Semi-groups

387

Lemma 17.27 (continuity of Yosida approximation) Let (A, D), (An , Dn ) be generators of some Feller semi-groups, satisfying (i) of Theorem 17.25. Then Aλn → Aλ strongly for every λ > 0. Proof: By Lemma 17.8, it is enough to show that Aλn f → Aλ f for every f ∈ (λ − A)D. Then define g ≡ Rλ f ∈ D. By (i) we may choose some gn ∈ Dn with gn → g and An gn → Ag. Then fn ≡ (λ − An )gn → (λ − A)g = f, and so

     λ   λ  An f − Aλ f  = λ2 Rn f − Rλ f          ≤ λ2 Rnλ (f − fn ) + λ2 Rnλ fn − Rλ f 

≤ λ f − fn + λ2 gn − g → 0.

2

Proof of Theorem 17.25, (i) ⇒ (iii): Assume (i). Since D is dense in C0 , we may take f ∈ D. Then choose some functions fn as in (i), and conclude from Lemmas 17.7 and 17.26 that, for any n ∈ N and t, λ > 0,                λ  λ Tn,t f − Tt f  ≤ Tn,t (f − fn ) + (Tn,t − Tn,t )fn  + Tn,t (fn − f )      λ    + (Tn,t − Ttλ )f  + (Ttλ − Tt )f          ≤ 2 fn − f + t (Aλ − A)f  + t (An − Aλn )fn  t  + (Aλn − Aλ )Tsλ f  ds. (19) 0

By Lemma 17.27 and dominated convergence, the last term tends to zero as n → ∞. For the third term on the right, we get             (An − Aλn )fn  ≤ An fn − Af  + (A − Aλ )f          + (Aλ − Aλn )f  + Aλn (f − fn ),

which tends to (A − Aλ )f by the same lemma. Hence, by (19),  

 

 

 

lim sup sup Tn,t f − Tt f  ≤ 2u (Aλ − A)f , n→∞

u, λ > 0,

t≤u

and the desired convergence follows by Lemma 17.7, as we let λ → ∞. (iii) ⇒ (ii): Obvious. (ii) ⇒ (i): Assume (ii), fix any f ∈ D and λ > 0, and define g = (λ − A)f and fn = Rnλ g. By dominated convergence, fn → Rλ g = f . Since (λ − An )fn = g = (λ − A)f , we also note that An fn → Af , proving (i). (iv) ⇒ (ii): Assume (iv). We may take S to be compact and the semi-groups (Tt ) and (Tn,t ) to be conservative. We need to show that, for any f ∈ C and

388

Foundations of Modern Probability

t > 0, we have Ttn f (xn ) → Tt f (x) whenever xn → x in S. Then let X0 = x and X0n = xn . By Lemma 17.3, the process X is a.s. continuous at t. Thus, d (iv) yields Xtn → Xt , and the desired convergence follows. fd

d

(i) – (iii) ⇒ (iv): Assume (i) – (iii), and let X0n → X0 . To obtain X n −→ X, it is enough to show that as n → ∞, for any f0 , . . . , fm ∈ C and 0 = t0 < t1 · · · tm , E Π fk (Xtnk ) → E Π fk (Xtk ). (20) k≤m

k≤m

This holds by hypothesis when m = 0. Proceeding by induction, we may use the Markov property to rewrite (20) in the form E

Π fk (Xtn ) · Thn

k 0

E f (Xτn ) − f (Xτn +h )

2





= E f 2 − 2f Th f + Th f 2 (Xτn )  

 

≤ f 2 − 2f Th f + Th f 2   

 

 

 

≤ 2 f f − Th f  + f 2 − Th f 2 . Letting n → ∞ and then h ↓ 0, and using dominated convergence on the left and strong continuity on the right, we see that E{f (Xτ − ) − f (Xτ )}2 = 0, which implies f (Xτ − ) = f (Xτ ) a.s. Applying this to a separating sequence of functions f1 , f2 , . . . ∈ C0 , we obtain Xτ − = Xτ a.s. (iii) ⇒ (i): By (iii) and Theorem 17.20, we have ΔMτ = 0 a.s. on {τ < ∞} for every martingale M , and so τ is predictable by Theorem 10.14.

390

Foundations of Modern Probability

(i) ⇒ (ii): Obvious.

2

Exercises 1. Show how the proofs of Theorems 17.4 and 17.6 can be simplified, if we assume (F3 ) instead of the weaker condition (F2 ). 2. Consider a pseudo-Poisson process X on S with rate kernel α. Give conditions ensuring X to be Feller. 3. Verify the resolvent equation (2), and conclude that the range of Rλ is independent of λ. 4. Show that a Feller semi-group (Tt ) is uniquely determined by the resolvent operator Rλ for a fixed λ > 0. Interpret the result probabilistically in terms of an independent, exponentially distributed random variable with mean λ−1 . (Hint: Use Theorem 17.4 and Lemma 17.5.) 5. Consider a discrete-time Markov process in S with transition operator T , and let τ be an independent random variable with a fixed geometric distribution. Show that T is uniquely determined by Ex f (Xτ ) for arbitrary x ∈ S and f ≥ 0. (Hint: Apply the preceding result to the associated pseudo-Poisson process.) 6. Give a probabilistic description of the Yosida approximation Ttλ in terms the original process X and two independent Poisson processes with rate λ. 7. Given a Feller diffusion semi-group, write the differential equation in Theorem 17.6 (ii), for suitable f , as a PDE for the function Tt f (x) on R+ × Rd . Also show that the backward equation of Theorem 13.9 is a special case of the same equation. 8. Consider a Feller process X and an independent subordinator T . Show that Y = X ◦ T is again Markov, and that Y is L´evy whenever this is true for X. If both T and X are stable, then so is Y . Find the relationship between the transition semi-groups, respectively between the indices of stability. 9. Consider a Feller process X and an independent renewal process τ0 , τ1 , . . . . Show that Yn = Xτn is a discrete-time Markov process, and express its transition kernel in terms of the transition semi-group of X. Also show that Yt = X ◦ τ[t] may fail to be Markov, even when (τn ) is Poisson. 10. Let X, Y be independent Feller processes in S, T with generators A, B. Show ˜ where A, ˜ B ˜ that (X, Y ) is a Feller process in S × T with generator extending A˜ + B, denote the natural extensions of A, B to C0 (S × T ). 11. Consider in S a Feller process with generator A and a pseudo-Poisson process with generator B. Construct a Markov process with generator A + B. 12. Use Theorem 17.23 to show that the generator of Brownian motion in R extends A = 12 Δ, on the set D of functions f ∈ C02 with Af ∈ C0 . 13. Let Rλ be the λ-resolvent of Brownian motion in R. For any f ∈ C0 , put h = Rλ f , and show by direct computation that λh− 12 h = f . Conclude by Theorem 17.4 that 12 Δ with domain D, defined as above, extends the generator A. Thus, A = 12 Δ by the preceding exercise or by Lemma 17.12. 14. Show that if A is a bounded generator on C0 , then the associated Markov process is pseudo-Poisson. (Hint: Note as in Theorem 17.11 that A satisfies the

17. Feller Processes and Semi-groups

391

positive-maximum principle. Next use Riesz’ representation theorem to express A in terms of bounded kernels, and show that A has the form of Proposition 17.2.) d

15. Let X, X n be processes as in Theorem 23.14. Show that if Xtn → Xt for all d t > 0, then X n → X in DR+ ,Rd , and compare with the stated theorem. Also prove a corresponding result for a sequence of L´evy processes X n . (Hint: Use Theorems 17.28 and 17.25, respectively.)

VI. Stochastic Calculus and Applications Stochastic calculus is rightfully recognized as one of the central areas of modern probability. In Chapter 18 we develop the classical theory of Itˆo integration with respect to continuous semi-martingales, based on a detailed study of the quadratic variation process. Chapter 19 deals with some basic applications to Brownian motion and related processes, including various martingale characterizations and transformations of martingales based on a random time change or a change of probability measure. In Chapter 20, the integration theory is extended to the more subtle case of semi-martingales with jump discontinuities, and in the final Chapter 21 we develop some basic aspects of the Malliavin calculus for functionals on Wiener space. The material of the first two chapters is absolutely fundamental, whereas the remaining chapters are more advanced and could be deferred to a later stage of studies. −−− 18. Itˆ o integration and quadratic variation. After a detailed study of the quadratic variation and covariation processes for continuous local martingales, we introduce the stochastic integral via an ingenious martingale characterization. This leads easily to some basic continuity and iteration properties, and to the fundamental Itˆo formula, showing how continuous semi-martingales are transformed by smooth mappings. 19. Continuous martingales and Brownian motion. Using stochastic calculus, we derive some basic properties of Brownian motion and continuous local martingales. Thus, we show how the latter can be reduced to Brownian motions by a random time change, and how the drift of a continuous semimartingale can be removed by a suitable change of measure. We further derive some integral representations of Brownian functionals and continuous martingales. 20. Semi-martingales and stochastic integration. Here we extend the stochastic calculus for continuous integrators to the context of semi-martingales with jump discontinuities. Important applications include some general decompositions of semi-martingales, the Burkholder-type inequalities, the Bichteler– Dellacherie characterization of stochastic integrators, and a representation of the Dol´eans exponential. 21. Malliavin calculus. Here we study the Malliavin derivative D and its dual D∗ , along with the Ornstein–Uhlenbeck generator L. Those three operators are closely related and admit simple representations in terms of multiple Wiener–Itˆo integrals. We further explain in what sense D∗ extends the Itˆo integral, derive a differentiation property for the latter, and establish the Clark–Ocone representation of Brownian functionals. We also indicate how the theory can be used to prove the existence and smoothness of densities. 393

Chapter 18

Itˆ o Integration and Quadratic Variation Local and finite-variation martingales, completeness, covariation, continuity, norm comparison, uniform integrability, Cauchy-type inequalities, martingale integral, semi-martingales, continuity, dominated convergence, chain rule, optional stopping, integration by parts, approximation of covariation, Itˆ o’s formula, local integral, conformal mapping, Fisk–Stratonovich integral, continuity and approximation, random time change, dependence on parameter, functional representation

Here we initiate the study of stochastic calculus, arguably the most powerful tool of modern probability, with applications to a broad variety of subjects throughout the subject. For the moment, we may only mention the timechange reduction and integral representation of continuous local martingales in Chapter 19, the Girsanov theory for removal of drift in the same chapter, the predictable transformations in Chapter 27, the construction of local time in Chapter 29, and the stochastic differential equations in Chapters 32–33. In this chapter we consider only stochastic integration with respect to continuous semi-martingales, whereas the more subtle case of integrators with possible jump discontinuities is postponed until Chapter 20, and an extension to non-anticipating integrators appears in Chapter 21. The theory of stochastic integration is inextricably linked to the notions of quadratic variation and covariation, already encountered in Chapter 14 in the special case of Brownian motion, and together the two notions are developed into a theory of amazing power and beauty. We begin with a construction of the covariation process [M, N ] of a pair of continuous local martingales M and N , which requires an elementary approximation and completeness argument. The processes M ∗ and [M ] = [M, M ] will be related by some useful continuity and norm relations, including the powerful BDG inequalities. Given the quadratic variation [M ], we may next construct the stochastic Hilbert integral V dM for suitable progressive processes V , using a simple  space argument. Combining with the ordinary Stieltjes integral V dA for processes A of locally finite variation, we may finally extend the integral to arbitrary continuous semi-martingales X = M + A. The continuity properties of quadratic variation carry over to the stochastic integral, and in conjunction with the obvious linearity they characterize the integration. © Springer Nature Switzerland AG 2021 O. Kallenberg, Foundations of Modern Probability, Probability Theory and Stochastic Modelling 99, https://doi.org/10.1007/978-3-030-61871-1_19

395

396

Foundations of Modern Probability

A key result for applications is Itˆ o’s formula, which shows how semi-martingales are transformed under smooth mappings. Though the present substitution rule differs from the elementary result for Stieltjes integrals, the two formulas can be brought into agreement by a simple modification of the integral. We conclude the chapter with some special topics of importance for applications, such as the transformation of stochastic integrals under a random time-change, and the integration of processes depending on a parameter. The present material may be thought of as a continuation of the martingale theory of Chapters 9–10. Though no results for Brownian motion are used explicitly in this chapter, the existence of the Brownian quadratic variation in Chapter 14 may serve as a motivation. We also need the representation and measurability of limits obtained in Chapter 5. Throughout the chapter we take F = (Ft ) to be a right-continuous and complete filtration on R+ . A process M is said to be a local martingale, if it is adapted to F and such that the stopped and centered processes M τn − M0 are true martingales for some optional times τn ↑ ∞. By a similar localization we may define local L2 -martingales, locally bounded martingales, locally integrable processes, etc. The required optional times τn are said to form a localizing sequence. Any continuous local martingale may clearly be reduced by localization to a sequence of bounded, continuous martingales. Conversely, we see by dominated convergence that every bounded local martingale is a true martingale. The following useful result may be less obvious. Lemma 18.1 (local martingales) For any process M and optional times τn ↑ ∞, these conditions are equivalent: (i) M is a local martingale, (ii) M τn is a local martingale for every n. Proof, (i) ⇒ (ii): If M is a local martingale with localizing sequence (σn ) and τ is an arbitrary optional time, then the processes (M τ )σn = (M σn )τ are true martingales. Thus, M τ is again a local martingale with localizing sequence (σn ). (ii) ⇒ (i): Suppose that each process M τn is a local martingale with localizing sequence (σkn ). Since σkn → ∞ a.s. for each n, we may choose some indices kn with

P σknn < τn ∧ n ≤ 2−n , n ∈ N. Writing τn = τn ∧ σknn , we get τn → ∞ a.s. by the Borel–Cantelli lemma, and  satisfy τn ↑ ∞ a.s. It remains to note so the optional times τn = inf m≥n τm τn τn τn 2 that the processes M = (M ) are true martingales. Next we show that every continuous martingale of finite variation is a.s. constant. An extension appears as Lemma 10.11.

18. Itˆo Integration and Quadratic Variation

397

Proposition 18.2 (finite-variation martingales) Let M be a continuous local martingale. Then M has locally finite variation ⇔ M is a.s. constant. Proof: By localization we may reduce to the case where M0 = 0 and M has bounded variation. In fact, let Vt denote the total variation of M on the interval [0, t], and note that V is continuous and adapted. For each n ∈ N, we may then introduce the optional time τn = inf{t ≥ 0; Vt = n}, and note that M τn − M0 is a continuous martingale with total variation bounded by n. We further note that τn → ∞, and that if M τn = M0 a.s. for each n, then even M = M0 a.s. In the reduced case, fix any t > 0, write tn,k = kt/n, and conclude from the continuity of M that a.s. ζn ≡

Σ k≤n



Mtn,k − Mtn,k−1

2

 

 

≤ Vt max Mtn,k − Mtn,k−1  → 0. k≤n

Since ζn ≤ which is bounded by a constant, it follows by the martingale property and dominated convergence that EMt2 = Eζn → 0, and so Mt = 0 a.s. for each t > 0. 2 Vt2 ,

Our construction of stochastic integrals depends on the quadratic variation and covariation processes, which therefore need to be constructed first. Here we use a direct approach, which has the further advantage of giving some insight into the nature of the basic integration-by-parts formula in Proposition 18.16. An alternative but less elementary approach would be to use the Doob–Meyer decomposition in Chapter 10. The construction utilizes predictable step processes of the form

Σ k ξk 1{t > τk } = Σ k ηk 1(τ ,τ ] (t),

Vt =

k

t ≥ 0,

k+1

(1)

where the τn are optional times with τn ↑ ∞ a.s., and the ξk and ηk are Fτk measurable random variables for all k ∈ N. For any process X we consider the elementary integral process V · X, given as in Chapter 9 by (V · X)t ≡ =



t 0



V dX = Σ k ξk Xt − Xtτk

Σ k ηk





Xτtk+1 − Xτtk ,



(2)

where the series converge since they have only finitely many non-zero terms. Note that (V ·X)0 = 0, and that V ·X inherits the possible continuity properties of X. It is further useful to note that V · X = V · (X − X0 ). The following simple estimate will be needed later. Lemma 18.3 (martingale preservation and L2 -bound) For any continuous L2 -martingale M with M0 = 0 and predictable step process V with |V | ≤ 1, the process V · M is again an L2 -martingale satisfying E(V · M )2t ≤ EMt2 , t ≥ 0.

398

Foundations of Modern Probability

Proof: First suppose that the sum in (1) has only finitely many non-zero terms. Then V · M is a martingale by Corollary 9.15, and the L2 -bound follows by the computation 

Σ k ηk2 Mτt ≤ E Σ k Mτt

E(V · M )2t = E

k+1

k+1

− Mτtk

− Mτtk

2

2

= E Mt2 . The estimate extends to the general case by Fatou’s lemma, and the martingale property then extends by uniform integrability. 2 Now consider the space M2 of all L2 -bounded, continuous martingales M with M0 = 0, equipped with the norm M = M∞ 2 . Recall that M ∗ 2 ≤ 2 M by Proposition 9.17. Lemma 18.4 (completeness) Let M2 be the space of L2 -bounded, continuous martingales with M0 = 0 and norm M = M∞ 2 . Then M2 is complete and hence a Hilbert space. n ) Proof: For any Cauchy sequence M 1 , M 2 , . . . in M2 , the sequence (M∞ 2 2 is Cauchy in L and thus converges toward some ξ ∈ L . Introduce the L2 martingale Mt = E(ξ | Ft ), t ≥ 0, and note that M∞ = ξ a.s. since ξ is F∞ -measurable. Hence,

    (M n − M )∗ 2 ≤ 2 M n − M    n  − M∞ 2 → 0, = 2 M∞

and so M n − M → 0. Moreover, (M n − M )∗ → 0 a.s. along a sub-sequence, 2 which implies that M is a.s. continuous with M0 = 0. We may now establish the existence and basic properties of the quadratic variation and covariation processes [M ] and [M, N ]. Extensions to possibly discontinuous processes are considered in Chapter 20. Theorem 18.5 (covariation) For any continuous local martingales M, N , there exists a continuous process [M, N ] with locally finite variation and [M, N ]0 = 0, such that a.s. (i) M N − [M, N ] is a local martingale, (ii) [M, N ] = [N, M ], (iii) [a M1 + b M2 , N ] = a [M1 , N ] + b [M2 , N ], (iv) [M, N ] = [M − M0 , N ], (v) [M ] = [M, M ] is non-decreasing, (vi) [M τ , N ] = [M τ , N τ ] = [M, N ]τ for any optional time τ . The process [M, N ] is determined a.s. uniquely by (i).

18. Itˆo Integration and Quadratic Variation

399

Proof: The a.s. uniqueness of [M, N ] follows from Proposition 18.2, and (ii)−(iii) are immediate consequences. If [M, N ] exists with the stated properties and τ is an optional time, then Lemma 18.1 shows that M τ N τ − [M, N ]τ is a local martingale, as is the process M τ (N − N τ ) by Corollary 9.15. Hence, even M τ N − [M, N ]τ is a local martingale, and (vi) follows. Furthermore, M N − (M − M0 )N = M0 N is a local martingale, which yields (iv) whenever either side exists. If both [M + N ] and [M − N ] exist, then 

4M N − [M + N ] − [M − N ]









= (M + N )2 − [M + N ] − (M − N )2 − [M − N ]



is a local martingale, and so we may take [M, N ] = ([M + N ] − [M − N ])/4. It is then enough to prove the existence of [M ] when M0 = 0. First let M be bounded. For each n ∈ N, let τ0n = 0, and define recursively



n τk+1 = inf t > τkn ; |Mt − Mτkn | = 2−n ,

Note that

τkn

k ≥ 0.

→ ∞ as k → ∞ for fixed n. Introduce the processes

Σ k Mτ 1 Qnt = Σ k Mt∧τ

Vtn =

n k



n t ∈ (τkn , τk+1 ] ,

2

n k

n − Mt∧τk−1 .

The V n are bounded, predictable step processes, and clearly Mt2 = 2 (V n · M )t + Qnt ,

t ≥ 0.

(3)

By Lemma 18.3, the integrals V n · M are continuous L2 -martingales, and since |V n − M | ≤ 2−n for each n, we have for m ≤ n      m    V · M − V n · M  = (V m − V n ) · M 

≤ 2−m+1 M . Hence, Lemma 18.4 yields a continuous martingale N with (V n · M − N )∗ → 0. The process [M ] = M 2 − 2N is again continuous, and by (3) we have P



∗

Qn − [M ]



=2 N −Vn·M

∗

P

→ 0.

In particular, [M ] is a.s. non-decreasing on the random time set T = {τkn ; n, k ∈ N}, which extends by continuity to the closure T¯. Also note that [M ] is constant on each interval in T¯c , since this is true for M and hence also for every Qn . Thus, (v) follows. In the unbounded case, define τn = inf{t > 0; |Mt | = n},

n ∈ N.

The processes [M τn ] exist as before, and clearly [M τm ]τm = [M τn ]τm a.s. for all m < n. Hence, [M τm ] = [M τn ] a.s. on [0, τm ], and since τn → ∞, there exists a non-decreasing, continuous, and adapted process [M ], such that [M ] = [M τn ]

400

Foundations of Modern Probability

a.s. on [0, τn ] for every n. Here (M τn )2 − [M ]τn is a local martingale for every 2 n, and so M 2 − [M ] is a local martingale by Lemma 18.1. Next we establish a basic continuity property. Proposition 18.6 (continuity) For any continuous local martingales Mn starting at 0, we have P P Mn∗ → 0 ⇔ [Mn ]∞ → 0. Proof: First let Mn∗ → 0. Fix any ε > 0, and define τn = inf{t ≥ 0; |Mn (t)| > ε}, n ∈ N. Write Nn = Mn2 − [Mn ], and note that Nnτn is a true mar¯ + . In particular, E[Mn ]τn ≤ ε2 , and so by Chebyshev’s inequality tingale on R P





P [Mn ]∞ > ε ≤ P {τn < ∞} + ε−1 E[Mn ]τn ≤ P {Mn∗ > ε} + ε. Here the right-hand side tends to zero as n → ∞ and then ε → 0, which shows P that [Mn ]∞ → 0. P Conversely, let [Mn ]∞ → 0. By a localization argument and Fatou’s lemma, P we see that M is L2 -bounded. Now proceed as before to get Mn∗ → 0. 2 Next we prove a pair of basic norm inequalities1 involving the quadratic variation, known as the BDG inequalities. Partial extensions to discontinuous martingales appear in Theorem 20.12. Theorem 18.7 (norm comparison, Burkholder, Millar, Gundy, Novikov) For a continuous local martingale M with M0 = 0, we have 2 EM ∗p ) E[M ]p/2 ∞ ,

p > 0.

Proof: By optional stopping, we may take M and [M ] to be bounded. Write M  = M − M τ with τ = inf{t; Mt2 = r}, and define N = (M  )2 − [M  ]. By Corollary 9.31, we have for any r > 0 and c ∈ (0, 2−p )









P M ∗2 ≥ 4r − P [M ]∞ ≥ c r ≤ P M ∗2 ≥ 4r, [M ]∞ < c r



≤ P N > −c r, supt Nt > r − c r



≤ c P {N ∗ > 0} ≤ c P {M ∗2 ≥ r}. Multiplying by (p/2) rp/2−1 and integrating over R+ , we get by Lemma 4.4 ∗p 2−p EM ∗p − c−p/2 E[M ]p/2 ∞ ≤ c EM ,

with domination constant cp = c−p/2 /(2−p − c). which gives the bound <

1 2

Recall that f ) g means f ≤ cg and g ≤ cf for some constant c > 0. The domination constants are understood to depend only on p.

18. Itˆo Integration and Quadratic Variation

401

Defining N as before with τ = inf{t; [M ]t = r}, we get for any r > 0 and c ∈ (0, 2−p/2−2 )









P [M ]∞ ≥ 2 r − P M ∗2 ≥ c r ≤ P [M ]∞ ≥ 2 r, M ∗2 < c r





≤ P N < 4 c r, inf t Nt < 4 c r − r





≤ 4c P [M ]∞ ≥ r . Integrating as before yields −p/2 2−p/2 E[M ]p/2 EM ∗p ≤ 4c E[M ]p/2 ∞ −c ∞ ,

follows with domination constant cp = c−p/2 /(2−p/2 −4 c). 2 and the bound >

We often need to certify that a given local martingale is a true martingale. The last theorem yields a useful criterion. Corollary 18.8 (uniform integrability) Let M be a continuous local martingale. Then M is a uniformly integrable martingale, whenever 



< ∞. E |M0 | + [M ]1/2 ∞ Proof: By Theorem 18.7 we have EM ∗ < ∞, and the martingale property follows by dominated convergence. 2 The basic properties of [M, N ] suggest that we think of the covariation process as a kind of inner product. A further justification is given by the following Cauchy-type inequalities. Proposition 18.9 (Cauchy-type inequalities, Courr`ege) For any continuous local martingales M, processes U, V , we have a.s. N and product-measurable     1/2     (i)  [M, N ] ≤ d [M, N ] ≤ [M ] [N ] , (ii)

t      1/2    U V d [M, N ] ≤ U 2 · [M ] t V 2 · [N ] t , 0

t ≥ 0.

Proof: (i) By Theorem 18.5 (iii) and (v), we have a.s. for any a, b ∈ R and t>0 0 ≤ [aM + bN ]t = a2 [M ]t + 2ab [M, N ]t + b2 [N ]t . By continuity we may choose a common exceptional null set for all a, b, and so [M, N ]2t ≤ [M ]t [N ]t a.s. Applying this inequality to the processes M − M s and N − N s for any s < t, we obtain a.s.      1/2    [M, N ]t − [M, N ]s  ≤ [M ]t − [M ]s [N ]t − [N ]s ,

(4)

and by continuity we may again choose a common null set. Now let 0 = t0 < t1 < · · · < tn = t be arbitrary, and conclude from (4) and the classical Cauchy inequality that

402

Foundations of Modern Probability | [M, N ]t | ≤

 

Σ k  [M, N ]t  

 

k

− [M, N ]tk−1 

1/2

≤ [M ]t [N ]t . It remains to take the supremum over all partitions of [0, t]. (ii) Writing dμ = d[M ], dν = d[N ], and dρ = |d[M, N ]|, we conclude from (i) that (ρI)2 ≤ (μI)(νI) a.s. for every interval I. By continuity, we may choose the exceptional null set A to be independent of I. Letting G ⊂ R+ be open with connected components Ik and using Cauchy’s inequality, we get on Ac ρG = Σ k ρIk  1/2 ≤ Σ k μIk νIk ≤



1/2

Σ j μIj Σ k νIk   1/2

= μG · νG . By Lemma 1.36, the last relation extends to any B ∈ BR+ . Now fix any simple, measurable functions f = Σk ak 1Bk and g = Σk bk 1Bk . Using Cauchy’s inequality again, we obtain on Ac ρ|f g| ≤ Σ k |ak bk | ρBk 

1/2

Σ k |ak bk | μBk νBk  1/2 ≤ Σ j a2j μBj Σ k b2k νBk   ≤

1/2

≤ μf 2 νg 2 , which extends by monotone convergence to any measurable functions f, g on R+ . By Lemma 1.35, we may choose f (t) = Ut (ω) and g(t) = Vt (ω) for any 2 fixed ω ∈ Ac . Let E be the class of bounded, predictable step processes with jumps at finitely many fixed times. To motivate the construction of general stochastic integrals, and for subsequent needs, we derive a basic identity for elementary integrals. Lemma 18.10 (covariation identity) For any continuous local martingales M, N and processes U, V ∈ E, the integrals U ·M and V ·N are again continuous local martingales, and we have  U · M, V · N = ( U V ) · [M, N ] a.s. (5) Proof: We may clearly take M0 = N0 = 0. The first assertion follows by localization from Lemma 18.3. To prove (5), let Ut = Σ k≤n ξk 1(tk ,tk+1 ] (t), where ξk is bounded and Ftk -measurable for each k. By localization we may take M , N , [M, N ] to be bounded, so that M , N , and M N − [M, N ] are martingales ¯ + . Then on R 





Σ j ξj Mt − Mt Σ k Nt  = E Σ k ξk Mt Nt − Mt Nt   = E Σ k ξk [M, N ]t − [M, N ]t  

E ( U · M ) ∞ N∞ = E

j+1

k+1

j

k+1

k+1

= E U · [M, N ]

∞.

k+1

k

k

k

− Ntk



18. Itˆo Integration and Quadratic Variation

403

Replacing M, N by M τ , N τ for an arbitrary optional time τ , we get τ E ( U · M )τ Nτ = E( U · M τ )∞ N∞



= E U · [M τ , N τ ] 







= E U · [M, N ] τ . By Lemma 9.14, the process ( U · M )N − U · [M, N ] is then a martingale, and so [ U · M, N ] = U · [M, N ] a.s. The general formula follows by iteration. 2 To extend the stochastic integral V ·M to more general processes V , we take (5) to be the characteristic property. For any continuous local martingale M , let L(M ) be the class of all progressive processes V , such that (V 2 · [M ])t < ∞ a.s. for every t > 0. Theorem 18.11 (martingale integral, Itˆo, Kunita & Watanabe) For any continuous local martingale M and process V ∈ L(M ), there exists an a.s. unique continuous local martingale V · M with (V · M )0 = 0, such that for any continuous local martingale N , [V · M, N ] = V · [M, N ] a.s. Proof: To prove the uniqueness, let M  , M  be continuous local martingales with M0 = M0 = 0, such that [M  , N ] = [M  , N ] = V · [M, N ] a.s. for all continuous local martingales N . Then by linearity [M  − M  , N ] = 0 a.s. Taking N = M  − M  gives [M  − M  ] = 0 a.s., and so M  = M  a.s. by Proposition 18.6.   To prove the existence, we first assume V 2M = E V 2 · [M ] ∞ < ∞. Since V is measurable, we get by Proposition 18.9 and Cauchy’s inequality       E V · [M, N ] ∞  ≤ V M N , 

N ∈ M2 .



The mapping N → E V · [M, N ] ∞ is then a continuous linear functional on M2 , and so Lemma 18.4 yields an element V · M ∈ M2 with 

E V · [M, N ]

 ∞

= E (V · M )∞ N∞ ,

N ∈ M2 .

Replacing N by N τ for an arbitrary optional time τ and using Theorem 18.5 and optional sampling, we get 

E V · [M, N ]

 τ







∞

= E V · [M, N ]τ = E V · [M, N τ ] = E(V · M )∞ Nτ = E(V · M )τ Nτ .



404

Foundations of Modern Probability

Since V is progressive, Lemma 9.14 shows that V · [M, N ] − (V · M )N is a martingale, which implies [V · M, N ] = V · [M, N ] a.s. The last relation extends by localization to any continuous local martingale N . For general V , define



τn = inf t > 0; V 2 · [M ]





t

n ∈ N.

=n ,

By the previous argument, there exist some continuous local martingales V · M τn such that, for any continuous local martingale N , 

V · M τn , N = V · [M τn , N ] a.s.,

n ∈ N.

(6)

For m < n it follows that (V · M τn )τm satisfies the corresponding relation with [M τm , N ], and so (V · M τn )τm = V · M τm a.s. Hence, there exists a continuous process V · M with (V · M )τn = V · M τn a.s. for all n, and Lemma 18.1 shows that V ·M is again a local martingale. Finally, (6) yields [V ·M, N ] = V ·[M, N ] 2 a.s. on [0, τn ] for each n, and so the same relation holds on R+ . By Lemma 18.10, the stochastic integral V · M of Theorem 18.11 extends the previously defined elementary integral. It is also clear that V · M is a.s. bilinear in the pair (V, M ), with the following basic continuity property. Lemma 18.12 (continuity) For any continuous local martingales Mn and processes Vn ∈ L(Mn ), we have (Vn · Mn )∗ → 0 P





Vn2 · [Mn ]



P



→ 0.

Proof: Note that [Vn · Mn ] = Vn2 · [Mn ], and use Proposition 18.6.

2

Before continuing our study of basic properties, we extend the stochastic integral to a larger class of integrators. By a continuous semi-martingale we mean a process X = M + A, where M is a continuous local martingale and A is a continuous, adapted process with locally finite variation and A0 = 0. The decomposition X = M + A is then a.s. unique by Proposition 18.2, and it is often referred to as the canonical decomposition of X. A continuous semimartingale in Rd is defined as a process X = (X 1 , . . . , X d ), where X 1 , . . . , X d are continuous semi-martingales in R. Let L(A) be the class of progressive processes V , such that the processes  (V · A)t = 0t V dA exist as elementary Stieltjes integrals. For any continuous semi-martingale X = M + A, we write L(X) = L(M ) ∩ L(A), and define the X-integral of a process V ∈ L(X) as the sum V ·X = V ·M +V ·A, which makes V ·X a continuous semi-martingale with canonical decomposition V ·M +V ·A. For progressive processes V , it is further clear that V ∈ L(X)



V 2 ∈ L([M ]), V ∈ L(A).

Lemma 18.12 yields the following stochastic version of the dominated convergence theorem.

18. Itˆo Integration and Quadratic Variation

405

Corollary 18.13 (dominated convergence) For any continuous semi-martingale X and processes U, V, V1 , V2 , . . . ∈ L(X), we have Vn → V |Vn | ≤ U





⇒ Vn · X − V · X

∗ t

P

→ 0, t ≥ 0.

Proof: Let X = M + A. Since U ∈ L(X), we have U 2 ∈ L([M ]) and U ∈ L(A). By dominated convergence for ordinary Stieltjes integrals, we obtain a.s.

 ∗ (Vn − V )2 · [M ] t + Vn · A − V · A t → 0. Here the former convergence implies (Vn · M − V · M )∗t → 0 by Lemma 18.12, and the assertion follows. 2 P

We further extend the elementary chain rule of Lemma 1.25 to stochastic integrals. Lemma 18.14 (chain rule) For any continuous semi-martingale X and progressive processes U, V with V ∈ L(X), we have (i) U ∈ L(V · X) ⇔ U V ∈ L(X), and then (ii) U · (V · X) = ( U V ) · X a.s. Proof: (i) Letting X = M + A, we have U ∈ L(V · X) ⇔ U 2 ∈ L([ V · M ]), U V ∈ L(X) ⇔ (U V )2 ∈ L([M ]),

U ∈ L(V · A), U V ∈ L(A).

Since [ V · M ] = V 2 · [M ], the two pairs of conditions are equivalent. (ii) The relation U · (V · A) = ( U V ) · A is elementary. To see that even U · (V · M ) = (U V ) · M a.s., consider any continuous local martingale N , and note that 

( U V ) · M, N = ( U V ) · [M, N ] 



= U · V · [M, N ] = U · [ V · M, N ]  = U · ( V · M ), N .

2

Next we examine the behavior under optional stopping. Lemma 18.15 (optional stopping) For any continuous semi-martingale X, process V ∈ L(X), and optional time τ , we have a.s. (V · X)τ = V · X τ = (V 1[0,τ ] ) · X.

406

Foundations of Modern Probability

Proof: The relations being obvious for ordinary Stieltjes integrals, we may take X = M to be a continuous local martingale. Then (V ·M )τ is a continuous local martingale starting at 0, and we have 

(V · M )τ , N = = = =



V · M, N τ V · [M, N τ ] V · [M, N ]τ (V 1[0,τ ] ) · [M, N ].

Thus, (V · M )τ satisfies the conditions characterizing the integrals V · M τ and 2 (V 1[0,τ ] ) · M . We may extend the definitions of quadratic variation and covariation to any continuous semi-martingales X = M +A and Y = N +B by putting [X] = [M ] and [X, Y ] = [M, N ]. As a crucial step toward a general substitution rule, we show how the covariation process can be expressed in terms of stochastic integrals. For martingales X and Y , the result is implicit in the proof of Theorem 18.5. Theorem 18.16 (integration by parts) For any continuous semi-martingales X, Y , we have a.s. (7) XY = X0 Y0 + X · Y + Y · X + [X, Y ]. Proof: We may take X = Y , since the general result will then follow by polarization. First let X = M ∈ M2 , and define V n and Qn as in the proof of Theorem 18.5. Then V n → M and |Vtn | ≤ Mt∗ < ∞, and so Corollary 18.13 P yields (V n · M )t → (M · M )t for each t ≥ 0. Now (7) follows as we let n → ∞ in the relation M 2 = 2 V n · M + Qn , and it extends by localization to general continuous local martingales M with M0 = 0. If instead X = A, then (7) reduces to A2 = 2 A · A, which holds by Fubini’s theorem. For general X we may take X0 = 0, since the formula for general X0 will then follow by an easy computation from the result for X − X0 . In this case, (7) reduces to X 2 = 2 X · X + [M ]. Subtracting the formulas for M 2 and A2 , it remains to show that AM = A · M + M · A a.s. Then fix any t > 0, put tnk = (k/n)t, and introduce for s ∈ (tnk−1 , tnk ], k, n ∈ N, the processes Ans = Atnk−1 ,

Note that

Msn = Mtnk .

At Mt = (An · M )t + (M n · A)t ,

n ∈ N.

P

Here (A · M )t → (A · M )t by Corollary 18.13, and (M n · A)t → (M · A)t by dominated convergence for ordinary Stieltjes integrals. 2 n

Our terminology is justified by the following result, which extends Theorem 14.9 for Brownian motion. It also shows that [X, Y ] is a.s. measurably determined3 by X and Y . 3 This is remarkable, since it is defined by martingale properties that depend on both probability measure and filtration.

18. Itˆo Integration and Quadratic Variation

407

Proposition 18.17 (approximation of covariation, Fisk) For any continuous semi-martingales X, Y on [0, t] and partitions 0 = tn0 < tn1 < · · · < tnkn = t, n ∈ N, with maxk (tnk − tnk−1 ) → 0, we have 

ζn ≡ Σ k Xtnk − Xtnk−1





P

Ytnk − Ytnk−1 → [X, Y ]t .

(8)

Proof: We may clearly take X0 = Y0 = 0. Introduce for s ∈ (tnk−1 , tnk ], k, n ∈ N, the predictable step processes and note that

Xsn = Xtnk−1 ,

Ysn = Ytnk−1 ,

Xt Yt = (X n · Y )t + (Y n · X)t + ζn ,

n ∈ N.

Since X n → X and Y n → Y , and also (X n )∗t ≤ Xt∗ < ∞ and (Y n )∗t ≤ Xt∗ < ∞, we get by Corollary 18.13 and Theorem 18.16 P

ζn → Xt Yt − (X · Y )t − (Y · X)t = [X, Y ]t .

2

We turn to a version of Itˆ o’s formula, arguably the most important formula of modern probability.4 It shows that the class of continuous semi-martingales is closed under smooth maps, and exhibits the canonical decomposition of the image process in terms of the components of the original process. Extended versions appear in Corollaries 18.19 and 18.20, as well as in Theorems 20.7 and 29.5. Write C k = C k (Rd ) for the class of k times continuously differentiable functions on Rd . When f ∈ C 2 , we write ∂i f and ∂ ij2 f for the first and second order partial derivatives of f . Here and below, summation over repeated indices is understood. Theorem 18.18 (substitution rule, Itˆo) For any continuous semi-martingale X in Rd and function f ∈ C 2 (Rd ), we have a.s. f (X) = f (X0 ) + ∂ i f (X) · X i + 12 ∂ ij2 f (X) · [X i , X j ].

(9)

This may also be written in differential form as df (X) = ∂ i f (X) dX i + 12 ∂ ij2 f (X) d [X i , X j ].

(10)

It is suggestive to write d [X i , X j ] = dX i dX j , and think of (10) as a second order Taylor expansion. If X has canonical decomposition M + A, we get the corresponding decomposition of f (X) by substituting M i + Ai for X i on the right side of (9). When M = 0, the last term vanishes, and (9) reduces to the familiar substitution rule for ordinary Stieltjes integrals. In general, the appearance of this Itˆo 4 Possible contenders might include the representation of infinitely divisible distributions, the polynomial representation of multiple WI-integrals, and the formula for the generator of a continuous Feller process.

408

Foundations of Modern Probability

correction term shows that the rules of ordinary calculus fail for the Itˆo integral. Proof of Theorem 18.18: For notational convenience, we may take d = 1, the general case being similar. Then fix a continuous semi-martingale X in R, and let C be the class of functions f ∈ C 2 satisfying (9), now written in the form (11) f (X) = f (X0 ) + f  (X) · X + 12 f  (X) · [X]. The class C is clearly a linear subspace of C 2 containing the functions f (x) ≡ 1 and f (x) ≡ x. We shall prove that C is closed under multiplication, and hence contains all polynomials. Then suppose that (11) holds for both f and g, so that F = f (X) and G = g(X) are continuous semi-martingales. Using the definition of the integral, along with Proposition 18.14 and Theorem 18.16, we get (f g)(X) − (f g)(X0 ) = F G − F 0 G0 = F ·G + G · F + [F, G]

= F · g  (X) · X + 12 g  (X) · [X]





+ G · f  (X) · X + 12 f  (X) · [X] + f  (X) · X, g  (X) · X 

= f g  + f  g (X) · X +

1 2 







f g  + 2f  g  + f  g (X) · [X]

= (f g) (X) · X + 12 (f g) (X) · [X]. Now let f ∈ C 2 be arbitrary. By Weierstrass’ approximation theorem5 , we may choose some polynomials p1 , p2 , . . . , such that sup|x|≤c |pn (x) − f  (x)| → 0 for every c > 0. Integrating the pn twice yields some polynomials fn satisfying 









      sup fn (x) − f (x) ∨ fn (x) − f  (x) ∨ fn (x) − f  (x) → 0,

c > 0.

|x|≤c

In particular, fn (Xt ) → f (Xt ) for each t > 0. Letting X have canonical decomposition M + A and using dominated convergence for ordinary Stieltjes integrals, we get for any t ≥ 0

fn (X) · A + 12 fn (X) · [X]





t



→ f  (X) · A + 12 f  (X) · [X] t .

Similarly, {fn (X) − f  (X)}2 · [M ])t → 0 for all t, and so by Lemma 18.12

fn (X) · M



P

t





→ f  (X) · M t ,

t ≥ 0.

Thus, formula (11) for the polynomials fn extends in the limit to the same relation for f . 2 We also need a local version of the last theorem, involving stochastic integrals up to the first time ζD when X exits a given domain D ⊂ Rd . For 5

Any continuous function on [0, 1] admits a uniform approximation by polynomials.

18. Itˆo Integration and Quadratic Variation

409

continuous and adapted X, the time ζD is clearly predictable, in the sense of being announced by some optional times τn ↑ ζD with τn < ζD a.s. on {ζD > 0} for all n. Indeed, writing ρ for the Euclidean metric in Rd , we may choose



τn = inf t ∈ [0, n]; ρ(Xt , Dc ) ≤ n−1 ,

n ∈ N.

(12)

We say that X is a semi-martingale on [0, ζD ), if the stopped process X τn is a semi-martingale in the usual sense for every n ∈ N. To define the covariation processes [X i , X j ] on the interval [0, ζD ), we require [X i , X j ]τn = [(X i )τn , (X j )τn ] a.s. for every n. Stochastic integrals with respect to X 1 , . . . , X d are defined on [0, ζD ) in a similar way. Corollary 18.19 (local Itˆo formula) For any domain D ⊂ Rd , let X be a continuous semi-martingale on [0, ζD ). Then (9) holds a.s. on [0, ζD ) for every f ∈ C 2 (D). Proof: Choose some fn ∈ C 2 (Rd ) with fn (x) = f (x) when ρ(x, Dc ) ≥ n−1 . Applying Theorem 18.18 to fn (X τn ) with τn as in (12), we get (9) on [0, τn ]. 2 Since n was arbitrary, the result extends to [0, ζD ). By a complex, continuous semi-martingale we mean a process of the form Z = X +iY , where X, Y are real, continuous semi-martingales. The bilinearity of the covariation process suggests that we define the quadratic variation of Z as  [Z] = [Z, Z] = X + iY, X + iY = [X] + 2i [X, Y ] − [Y ]. Let L(Z) be the class of processes W = U + iV with U, V ∈ L(X) ∩ L(Y ). For such a process W , we define the integral by W · Z = (U + iV ) · (X + iY ) 



= U ·X −V ·Y +i U ·Y +V ·X . Corollary 18.20 (conformal mapping) Let f be an analytic function on a domain D ⊂ C. Then (9) holds for every continuous semi-martingale Z in D. Proof: Writing f (x + iy) = g(x, y) + ih(x, y) for any x + iy ∈ D, we get g1 + ih1 = f  ,

g2 + ih2 = if  ,

and so by iteration  + ih11 = f  , g11

 g12 + ih12 = if  ,

 g22 + ih22 = −f  .

Now (9) follows for Z = X + iY , as we apply Corollary 18.19 to the semi-martingale (X, Y ) and the functions g, h. 2

410

Foundations of Modern Probability

Under suitable regularity conditions, we may modify the Itˆo integral so that it will obey the rules of ordinary calculus. Then for any continuous semimartingales X, Y , we define the Fisk–Stratonovich integral by

t

0

X ◦ dY = (X · Y )t + 12 [X, Y ]t ,

t ≥ 0,

(13)

or, in differential form X ◦ dY = XdY + 12 d[X, Y ], where the first term on the right is an ordinary Itˆo integral. The point of this modification is that the substitution rule simplifies to df (X) = ∂i f (X) ◦ dX i , conforming with the chain rule of ordinary calculus. We may also prove a version for FS -integrals of the chain rule in Proposition 18.14. Theorem 18.21 (modified Itˆo integral, Fisk, Stratonovich) The integral in (13) satisfies the computational rules of elementary calculus. Thus, a.s., (i) for any continuous semi-martingale X in Rd and function f ∈ C 3 (Rd ),

f (Xt ) = f (X0 ) +

0

t

∂i f (X) ◦ dX i ,

t ≥ 0,

(ii) for any real, continuous semi-martingales X, Y, Z, X ◦ (Y ◦ Z) = (XY ) ◦ Z. Proof: (i) By Itˆo’s formula, 3 f (X) · [X j , X k ]. ∂i f (X) = ∂i f (X0 ) + ∂ij2 f (X) · X j + 12 ∂ijk

Using Itˆo’s formula again together with (5) and (13), we get 0

∂i f (X) ◦ dX i = ∂i f (X) · X i + 12 [∂i f (X), X i ] = ∂i f (X) · X i + 12 ∂ij2 f (X) · [X j , X i ] = f (X) − f (X0 ).

(ii) By Lemma 18.14 and integration by parts, X ◦ (Y ◦ Z) = = = =

X · (Y ◦ Z) + 12 [X, Y ◦ Z] X · (Y · Z) + 12 X · [Y, Z] + 12 [X, Y · Z] + XY · Z + 12 [XY, Z] XY ◦ Z.

1 4



X, [Y, Z]

2

The more convenient substitution rule of Corollary 18.21 comes at a high price: The FS -integral does not preserve the martingale property, and it requires even the integrand to be a continuous semi-martingale, which forces us to impose the stronger regularity constraint on the function f in the substitution rule. Our next task is to establish a basic uniqueness property, justifying our reference to the process V · M in Theorem 18.11 as an integral.

18. Itˆo Integration and Quadratic Variation

411

Theorem 18.22 (continuity) The integral V ·M in Theorem 18.11 is the a.s. unique linear extension of the elementary stochastic integral, such that for any   t > 0, P P Vn2 · [M ] t → 0 ⇒ (Vn · M )∗t → 0. This follows immediately from Lemmas 18.10 and 18.12, together with the following approximation of progressive processes by predictable step processes. Lemma 18.23 (approximation) For any continuous semi-martingale X = M + A and process V ∈ L(X), there exist some processes V1 , V2 , . . . ∈ E, such that a.s., simultaneously for all t > 0,

(Vn − V )2 · [M ]





+ (Vn − V ) · A t

∗ t

→ 0.

(14)

Proof: We may take t = 1, since we can then combine the processes Vn for disjoint, finite intervals to construct an approximating sequence on R+ . It is further enough to consider approximations in the sense of convergence in probability, since the a.s. versions will then follow for a suitable sub-sequence. This allows us to perform the construction in steps, first approximating V by bounded and progressive processes V  , next approximating each V  by continuous and adapted processes V  , and finally approximating each V  by predictable step processes V  . The first and last steps being elementary, we may focus on the second step. Then let V be bounded. We need to construct some continuous, adapted processes Vn , such that (14) holds a.s. for t = 1. Since the Vn can be chosen to be uniformly bounded, we may replace the first term by (|Vn − V | · [M ])1 . Thus, it is enough to establish the approximation (|Vn − V | · A)1 → 0 when the process A is non-decreasing, continuous, and adapted with A0 = 0. Replacing At by At + t if necessary, we may even take A to be strictly increasing. To form the required approximations, we may introduce the inverse process Ts = sup{t ≥ 0; At ≤ s}, and define for all t, h > 0 Vth = h−1 = h−1



t

T (At −h) At

V dA



(At −h)+

V (Ts ) ds.

Then Theorem 2.15 yields V h ◦ T → V ◦ T as h → 0, a.e. on [0, A1 ], and so by dominated convergence, 0

1

   h  V − V  dA =

A1 0

   h  V (Ts ) − V (Ts ) ds → 0.

The processes V h are clearly continuous. To prove their adaptedness, we note that the process T (At − h) is adapted for every h > 0, by the definition of T . Since V is progressive, we further note that V · A is adapted and hence progressive. The adaptedness of (V · A)T (At −h) now follows by composition. 2

412

Foundations of Modern Probability

Though the class L(X) of stochastic integrands is sufficient for most purposes, it is sometimes useful to allow integration of slightly more general proˆ cesses. Given any continuous semi-martingale X = M + A, let L(X) denote the class of product-measurable processes V , such that (V − V˜ ) · [M ] = 0 and ˆ (V − V˜ ) · A = 0 a.s. for some process V˜ ∈ L(X). For V ∈ L(X) we define V · X = V˜ · X a.s. The extension clearly enjoys all the previously established properties of stochastic integration. We often need to see how semi-martingales, covariation processes, and stochastic integrals are transformed by a random time change. Then consider a non-decreasing, right-continuous family of finite optional times τs , s ≥ 0, here referred to as a finite random time change τ . If even F is right-continuous, so is the induced filtration Gs = Fτs , s ≥ 0, by Lemma 9.3. A process X is said to be τ -continuous, if it is a.s. continuous on R+ and constant on every interval [τs− , τs ], s ≥ 0, where τ0− = X0− = 0 by convention. Theorem 18.24 (random time change, Kazamaki) Let τ be a finite random time change with induced filtration G, and let X = M + A be a τ -continuous F-semi-martingale. Then (i) X ◦ τ is a continuous G-semi-martingale with canonical decomposition M ◦ τ + A ◦ τ , such that [X ◦ τ ] = [X] ◦ τ a.s., ˆ (ii) V ∈ L(X) implies V ◦ τ ∈ L(X ◦ τ ) and (V ◦ τ ) · (X ◦ τ ) = (V · X) ◦ τ a.s.

(15)

Proof: (i) It is easy to check that the time change X → X ◦ τ preserves continuity, adaptedness, monotonicity, and the local martingale property. In particular, X ◦ τ is then a continuous G -semi-martingale with canonical decomposition M ◦ τ + A ◦ τ . Since M 2 − [M ] is a continuous local martingale, so is the time-changed process M 2 ◦ τ − [M ] ◦ τ , and we get [X ◦ τ ] = [M ◦ τ ] = [M ] ◦ τ = [X] ◦ τ a.s. If V ∈ L(X), we also note that V ◦ τ is product-measurable, since this is true for both V and τ . (ii) Fixing any t ≥ 0 and using the τ -continuity of X, we get (1[0,t] ◦ τ ) · (X ◦ τ ) = 1[0,τt−1 ] · (X ◦ τ ) −1

= (X ◦ τ )τt = (1[0,t] · X) ◦ τ, which proves (15) when V = 1[0,t] . If X has locally finite variation, the result extends by a monotone-class argument and monotone convergence to arbitrary V ∈ L(X). In general, Lemma 18.23 yields some continuous, adapted processes V1 , V2 , . . . , such that a.s.

18. Itˆo Integration and Quadratic Variation

(Vn − V )2 d[M ] +

413

    (Vn − V ) dA → 0.

By (15) the corresponding properties hold for the time-changed processes, and since the processes Vn ◦ τ are right-continuous and adapted, hence progressive, ˆ we obtain V ◦ τ ∈ L(X ◦ τ ). Now assume instead that the approximating processes V1 , V2 , . . . are predictable step processes. Then (15) holds as before for each Vn , and the relation extends to V by Lemma 18.12. 2 Next we consider stochastic integrals of processes depending on a parameter. Given a measurable space (S, S), we say that a process V on S × R+ is progressive 6 , if its restriction to S × [0, t] is S ⊗ Bt ⊗ Ft -measurable for every t ≥ 0, where Bt = B[0,t] . A simple version of the following result will be useful in Chapter 19. Theorem 18.25 (dependence on parameter, Dol´eans, Stricker & Yor) For measurable S, consider a continuous semi-martingale X and a progressive process Vs (t) on S × R+ such that Vs ∈ L(X) for all s ∈ S. Then the process Ys (t) = (Vs · X)t has a version that is progressive and a.s. continuous at each s ∈ S. Proof: Let X have canonical decomposition of M + A, and let the processes Vsn on S × R+ be progressive and such that for any t ≥ 0 and s ∈ S,

(Vsn − Vs )2 · [M ]





t

+ (Vsn − Vs ) · A

∗ t

P

→ 0.

Then Lemma 18.12 yields (Vsn · X − Vs · X)∗t → 0 for every s and t. Proceeding as in the proof of Proposition 5.32, we may choose a sub-sequence {nk (s)} ⊂ N depending measurably on s, such that the same convergence holds a.s. along {nk (s)} for any s and t. Define Ys,t = lim supk (Vsnk · X)t whenever this is finite, and put Ys,t = 0 otherwise. If the processes (Vsn · X)t have progressive versions on S × R+ that are a.s. continuous for each s, then Ys,t is clearly a version of the process (Vs · X)t with the same properties. We apply this argument in three steps. First we reduce to the case of bounded, progressive integrands by taking V n = V 1{|V | ≤ n}. Next, we apply the transformation in the proof of Lemma 18.23, to reduce to the case of continuous and progressive integrands. Finally, we approximate any continuous, progressive process V by the predictable step processes Vsn (t) = Vs (2−n [2n t]). The integrals Vsn · X are then elementary, and the desired continuity and measurability are obvious by inspection. 2 P

We turn to the related topic of functional representations. For motivation, we note that the construction of a stochastic integral V · X depends in a subtle way on the underlying probability measure P and filtration F. Thus, we cannot expect a general representation F (V, X) of the integral process V · X. In view 6

short for progressively measurable

414

Foundations of Modern Probability

of Proposition 5.32, we might still hope for a modified representation of the form F (μ, V, X), where μ = L(V, X). Even this may be too optimistic, since the canonical decomposition of X also depends on F. Here we consider only a special case, sufficient for our needs in Chapter 32. Fixing any progressive functions σji and b i of suitable dimension, defined on the path space CR+ ,Rd , we consider an adapted process X satisfying the stochastic differential equation dXti = σji (t, X) dBtj + bi (t, X) dt,

(16)

where B is a Brownian motion in Rr . A detailed discussion of such equations appears in Chapter 32. Here we need only the simple fact from Lemma 32.1 that the coefficients σji (t, X) and bi (t, X) are again progressive. Write aij = σki σkj . Proposition 18.26 (functional representation) For any progressive functions σ, b, f of suitable dimension, there exists a measurable mapping 



ˆ C R , Rd × C R , Rd → CR + , R , F:M + +

(17)

such that whenever X solves (16) with L(X) = μ and f i (X) ∈ L(X i ) for all i, we have f i (X) · X i = F (μ, X) a.s. Proof: From (16) we note that X is a semi-martingale with covariation processes [X i , X j ] = aij (X) · λ and drift components bi (X) · λ. Hence, f i (X) ∈ L(X i ) for all i iff the processes (f i )2 aii (X) and f i bi (X) are a.s. Lebesgue integrable, which holds in particular when f is bounded. Letting f1 , f2 , . . . be progressive with  

 

(fni − f i )2 aii (X) · λ + (fni − f i ) bi (X) · λ → 0, we get by Lemma 18.12

fni (X) · X i − f i (X) · X i

∗ t

P

→ 0,

(18)

t ≥ 0.

Thus, if fni (X) · X i = Fn (μ, X) a.s. for some measurable mappings Fn as in (17), Proposition 5.32 yields a similar representation for the limit f i (X) · X i . As in the previous proof, we may apply this argument in steps, reducing first to the case where f is bounded, next to the case of continuous f , and finally to the case where f is a predictable step function. Here the first and last steps are again elementary. For the second step, we may now use the simpler approximation

fn (t, x) = n

t (t−n−1 )+

f (s, x)ds,

t ≥ 0, n ∈ N, x ∈ CR+ ,Rd .

By Theorem 2.15 we have fn (t, x) → f (t, x) a.e. in t for each x ∈ CR+ ,Rd , and (18) follows by dominated convergence. 2

18. Itˆo Integration and Quadratic Variation

415

Exercises 1. Show that if M is a local martingale and ξ is an F0 -measurable random variable, then the process Nt = ξMt is again a local martingale. 2. Show that every local martingale M ≥ 0 with EM0 < ∞ is a super-martingale. Also show by an example that M may fail to be a martingale. (Hint: Use Fatou’s lemma. Then take Mt = Xt/(1−t)+ , where X is a Brownian motion starting at 1, stopped when it reaches 0.) 3. For a continuous local martingale M , show that M and [M ] have a.s. the same intervals of constancy. (Hint: For any r ∈ Q+ , put τ = inf{t > r; [M ]t > [M ]r }. Then M τ is a continuous local martingale on [r, ∞) with quadratic variation 0, and so M τ is a.s. constant on [s, τ ]. Use a similar argument in the other direction.) 4. For any continuous local martingales Mn starting at 0 and associated optional P P times τn , show that (Mn )∗τn → 0 iff [Mn ]τn → 0. State the corresponding result for stochastic integrals. 5. Give examples of continuous semi-martingales X1 , X2 , . . . , such that Xn∗ → 0, P  0 for all t > 0. (Hint: Let B be a Brownian motion stopped at and yet [Xn ]t → time 1, put Ak2−n = B(k−1)+ 2−n , and interpolate linearly. Define X n = B − An .) P

6. For a Brownian motion B and an optional time τ , show that EBτ = 0 when Eτ 1/2 < ∞ and EBτ2 = Eτ when Eτ < ∞. (Hint: Use optional sampling and Theorem 18.7.) 7. Deduce the first inequality of Proposition 18.9 from Proposition 18.17 and the classical Cauchy inequality. 8. For any continuous semi-martingales X, Y , show that [X + Y ]1/2 ≤ [X]1/2 + [Y ]1/2 a.s. 9. (Kunita & Watanabe) Let M, N be continuous local martingales, and fix any p, q, r > 0 with p−1 + q −1 = r−1 . Show that [M, N ]t 22r ≤ [M ]t p [N ]t q for all t > 0. 10. Let M, N be continuous local martingales with M0 = N0 = 0. Show that M ⊥⊥N implies [M, N ] ≡ 0 a.s. Also show by an example that the converse is false. (Hint: Let M = U · B and N = V · B for a Brownian motion B and suitable U, V ∈ L(B).) 11. Fix a continuous semi-martingale X, and let U, V ∈ L(X) with U = V a.s. on a set A ∈ F0 . Show that U · X = V · X a.s. on A. (Hint: Use Proposition 18.15.) 12. For a continuous local martingale M , let U, U1 , U2 , . . . and V, V1 , V2 , . . . ∈ P L(M ) with |Un | ≤ Vn , Un → U , Vn → V , and {(Vn − V ) · M }∗t → 0 for all t > 0. P Show that (Un ·M )t → (U ·M )t for all t. (Hint: Write (Un −U )2 ≤ 2 (Vn −V )2 +8 V 2 , and use Theorem 1.23 and Lemmas 5.2 and 18.12.) 13. Let B be a Brownian bridge. Show that Xt = Bt∧1 is a semi-martingale on R+ with the induced filtration. (Hint: Note that Mt = (1 − t)−1 Bt is a martingale on [0, 1), integrate by parts, and check that the compensator has finite variation.) 14. Show by an example that the canonical decomposition of a continuous semimartingale may depend on the filtration. (Hint: Let B be a Brownian motion with induced filtration F, put Gt = Ft ∨ σ(B1 ), and use the preceding result.)

416

Foundations of Modern Probability

15. Show by stochastic calculus that t−p Bt → 0 a.s. as t → ∞, where B is a Brownian motion and p > 12 . (Hint: Integrate by parts to find the canonical decomposition. Compare with the L1 -limit.) 16. Extend Theorem 18.16 to a product of n semi-martingales. 17. Consider a Brownian bridge X and a bounded, progressive process V with  Show that E 01 V dX = 0. (Hint: Integrate by parts to get  −1 1 V ds.) 0 V dX = 0 (V − U )dB, where B is a Brownian motion and Ut = (1 − t) t s

1 V dt = 0 a.s. 01 t 1

18. Show that Proposition 18.17 remains valid for any finite optional times t and P tnk satisfying maxk (tnk − tn,k−1 ) → 0. 19. Let M be a continuous local martingale. Find the canonical decomposition of |M |p when p ≥ 2. For such a p, deduce the second relation in Theorem 18.7. (Hint: Use Theorem 18.18. For the last part, use H¨older’s inequality.) 20. Let M be a continuous local martingale with M0 = 0 and [M ]∞ ≤ 1. Show 2 that P {supt Mt ≥ r} ≤ e−r /2 for all r ≥ 0. (Hint: Consider the super-martingale 2 Z = exp(cM − c [M ]/2) for a suitable c > 0.) 21. Let X, Y be continuous semi-martingales. Fix a t > 0 and some partitions (tnk ) of [0, t] with maxk (tnk − tn,k−1 ) → 0. Show that 12 Σk (Ytnk + Ytn,k−1 )(Xtnk − P

Xtn,k−1 ) → (Y ◦ X)t . (Hint: Use Corollary 18.13 and Proposition 18.17.)

23. Say that a process is predictable if it is measurable with respect to the σ-field in R+ × Ω induced by all predictable step processes. Show that every predictable process is progressive. Conversely, given any progressive process X and constant h > 0, show that the process Yt = X(t−h)+ is predictable. 24. For a progressive process V and a non-decreasing, continuous, adapted process A, prove the existence of a predictable process V˜ with |V − V˜ | · A = 0 a.s. (Hint: Use Lemma 18.23.) 25. Use the preceding statement to deduce Lemma 18.23. (Hint: Begin with predictable V , using a monotone-class argument.) 26. Construct the stochastic integral V · M by approximation from elementary integrals, using Lemmas 18.10 and 18.23. Show that the resulting integral satisfies the relation in Theorem 18.11. (Hint: First let M ∈ M2 and E(V 2 · [M ])∞ < ∞, and extend by localization.) d ˜ for some Brownian motions B, B ˜ on possibly different 27. Let (V, B) = (V˜ , B) d ˜ ˜ filtered probability spaces and some V ∈ L(B), V ∈ L(B). Show that (V, B, V ·B) = ˜ V˜ · B). ˜ (Hint: Argue as in the proof of Proposition 18.26.) (V˜ , B, 28. Let X be a continuous F-semi-martingale. Show that X remains conditionally a semi-martingale given F0 , and that the conditional quadratic variation agrees with [X]. Also show that if V ∈ L(X), where V = σ(Y ) for some continuous process Y and measurable function σ, then V remains conditionally X-integrable, and the conditional integral agrees with V · X. (Hint: Conditioning on F0 preserves martingales.)

Chapter 19

Continuous Martingales and Brownian Motion Real and complex exponential martingales, characterization of Brownian motion, time-change reduction, harmonic and analytic maps, point polarity and transience, skew-product representation, orthogonality and independence, Brownian invariance, Brownian functionals and martingales, Gauss and Poisson reduction, iterated and multiple integrals, chaos and integral representation, change of measure, transformation of drift, preservation properties, Cameron–Martin theorem, uniform integrability, Wald’s identity, removal of drift

Here we deal with a wide range of applications of stochastic calculus, the principal tools of which were introduced in the previous chapter. A recurrent theme is the notion of exponential martingales, which appear in both a real and a complex variety. Exploring the latter yields an easy access to L´evy’s celebrated martingale characterization of Brownian motion, and to the basic time-change reduction of isotropic continuous local martingales to a Brownian motion. Applying the latter result to suitable compositions of a Brownian motion with harmonic or analytic maps, we obtain important information about Brownian motion in Rd . Similar methods can be used to analyze a variety of other transformations leading to Gaussian processes. As a further application of the exponential martingales, we derive stochastic integral representations of Brownian functionals and martingales, and examine their relationship to the chaos expansions obtained by different methods in Chapter 14. In this context, we also note how the previously introduced multiple Wiener–Itˆo integrals can be expressed as iterated single Itˆo integrals. A related problem, of crucial importance for Chapter 32, is to represent a continuous local martingale with absolutely continuous covariation process in terms of stochastic integrals with respect to a suitable Brownian motion. As a final major topic, we examine the transformations induced by absolutely continuous changes of the probability measure. Here the density process turns out to be a real exponential martingale, and any continuous local martingale remains a martingale under the new measure, apart from an additional drift term. This observation is useful for applications, where it is often employed to remove the drift of a given semi-martingale. The appropriate change of measure then depends on the given process, and it becomes important to find effective criteria for a proposed exponential process to be a true martingale. © Springer Nature Switzerland AG 2021 O. Kallenberg, Foundations of Modern Probability, Probability Theory and Stochastic Modelling 99, https://doi.org/10.1007/978-3-030-61871-1_20

417

418

Foundations of Modern Probability

The present exposition may be regarded as a continuation of our discussion of martingales and Brownian motion in Chapters 9 and 14, respectively. Changes of time and measure are both important for the theory of stochastic differential equations, developed in Chapters 32–33. The time-change reductions of continuous martingales have counterparts for the point processes explored in Chapter 15, where the present Gaussian processes are replaced by suitable Poisson processes. The results about changes of measure are extended in Chapter 20 to the context of possibly discontinuous semi-martingales. To begin our technical discussion of the mentioned ideas, we consider first the complex exponential martingales, which are closely related to the real versions appearing in Lemma 19.22. Lemma 19.1 (complex exponential martingales) Let M be a real, continuous local martingale with M0 = 0. Then 



Zt = exp iMt + 12 [M ]t ,

t ≥ 0,

is a complex local martingale, satisfying Zt = 1 + i(Z · M )t a.s.,

t ≥ 0.

Proof: Applying Corollary 18.20 to the complex-valued semi-martingale Xt = iMt + 12 [M ]t and the entire function f (z) = ez , we get 

dZt = Zt dXt + 12 d[X]t





= Zt idMt + 12 d[M ]t − 12 d[M ]t = iZt dMt .



2

We now explore the connection between continuous martingales and Gauss¯ denote the closed ian processes. For a subset K of a Hilbert space H, let K linear subspace generated by K. Lemma 19.2 (isonormal martingales) For a Hilbert space H, let K ⊂ H ¯ = H. Let M h , h ∈ K, be continuous local martingales with M0h = 0, with K such that (1) [M h , M k ]∞ = h, k a.s., h, k ∈ K. Then there exists an isonormal Gaussian process ζ ⊥⊥ F0 on H, such that h = ζh a.s., M∞

h ∈ K.

Proof: For any linear combination h = u1 h1 + · · · + un hn , let Nt = u1 Mth1 + · · · + un Mthn ,

t ≥ 0.

Then (1) yields



Σ j,k uj uk M h , M h ∞ = Σ j,k uj uk hj , hk = h 2 .

[N ]∞ =

j

k

19. Continuous Martingales and Brownian Motion

419

The process Z = exp(iN + 12 [N ]) is a.s. bounded, and so by Lemma 19.1 it is a uniformly integrable martingale. Writing ξ = N∞ , we hence obtain for any A ∈ F0 P A = E(Z ; A) ∞ 

= E exp iN∞ + 12 [N ]∞ ; A = E(eiξ ; A) exp



1 2



h 2 .

Since u1 , . . . , un were arbitrary, we conclude from Theorem 6.3 that the ranh1 hn , . . . , M∞ ) is independent of F0 and centered Gaussian with dom vector (M∞ covariances hj , hk . hi . To see that ζh is a.s. For any element h = Σi ai hi , define ζh = Σi ai M∞ independent of the representation of h, suppose that also h = Σi bi hi . Writing ci = ai − bi , we get E



Σ i aiM∞h − Σ i biM∞h i

i

2



2

hi = E Σ i ci M∞ = Σ i,j ci cj hi , hj





2 =  Σ i ci hi 





2 =  Σ i ai hi − Σ i bi hi  = 0,

as desired. Note that ζ is centered Gaussian with E(ζh ζk) = h, k . We may finally extend ζ by continuity to an isonormal Gaussian process on H. 2 For a first application, we may establish a celebrated martingale characterization of Brownian motion. Theorem 19.3 (Brownian criterion, L´evy) For a continuous process B = (B 1 , . . . , B d ) in Rd with B0 = 0, these conditions are equivalent: (i) B is an F-Brownian motion, (ii) B is a local F-martingale with [B i , B j ]t ≡ δij t a.s. Proof (OK): Assuming (ii), we introduce for fixed s < t the continuous local martingales i i − Br∧s , r ≥ s, i ≤ d, Mri = Br∧t and conclude from Lemma 19.2 that the differences Bti −Bsi are i.i.d. N (0, t−s) 2 and independent of Fs , which proves (i). The last result suggests that we might transform a general continuous local martingale M into a Brownian motion through a suitable random time change. In higher dimensions, this requires M to be isotropic, in the sense that a.s. [M i ] = [M j ], i = j, [M i , M j ] = 0, which holds in particular when M is a Brownian motion in Rd . For continuous local martingales M in C, the condition is a.s. equivalent to [M ] = 0, or

420

Foundations of Modern Probability [ &M ] = [,M ], [ &M, ,M ] = 0.

In the isotropic case, we refer to [M 1 ] = · · · = [M d ] or [ &M ] = [,M ] as the rate process of M . Though the proof is straightforward when [M ]∞ = ∞ a.s., the general case requires a rather subtle extension of the filtered probability space. For two filtrations F, G on Ω, we say that G is a standard extension of F if

⊥ F, Ft ⊂ G t ⊥ Ft

t ≥ 0.

This is precisely the condition needed to ensure that all adaptedness and conditioning properties will be preserved. Theorem 19.4 (time-change reduction, Doeblin, Dambis, Dubins & Schwarz) Let M be an isotropic, continuous local F-martingale in Rd with M0 = 0, and

define τs = inf t ≥ 0; [M 1 ]t > s , Gs = Fτs , s ≥ 0. Then there exists a Brownian motion B in Rd w.r.t. a standard extension Gˆ of G, such that a.s. B = M ◦ τ on [ 0, [M 1 ]∞ ), M = B ◦ [M 1 ]. Proof (OK): We may take d = 1, the proof for d > 1 being similar. Introduce a Brownian motion X⊥⊥F with induced filtration X , and put Gˆt = Gt ∨Xt . Since G⊥⊥X , it is clear that Gˆ is a standard extension of both G and X . In ˆ Now define particular, X remains a Brownian motion under G.

Bs = Mτs +

0

s

1{τr = ∞} dXr ,

s ≥ 0.

(2)

Since M is τ -continuous by Proposition 18.6, Theorem 18.24 shows that the first term M ◦ τ is a continuous G -martingale, hence also a Gˆ -martingale, with quadratic variation [M ◦ τ ]s = [M ]τs = s ∧ [M ]∞ ,

s ≥ 0.

The second term in (2) has quadratic variation s−s∧[M ]∞ , and the covariation vanishes since M ◦ τ ⊥ ⊥X. Thus, [B]s = s a.s., and so Theorem 19.3 shows that B is a Gˆ -Brownian motion. Finally, Bs = Mτs for s < [M ]∞ , which implies M = B ◦ [M ] a.s. by the τ -continuity of M . 2 In two dimensions, isotropic martingales arise naturally by composition of a complex Brownian motion B with a possibly multi-valued analytic function f . For any continuous process X, we may clearly choose a continuous evolution of f (X), as long as X avoids all singularities of f . Similar results hold for harmonic functions, which is especially useful in dimensions d ≥ 3, where no analytic functions exist.

19. Continuous Martingales and Brownian Motion

421

Theorem 19.5 (harmonic and analytic maps, L´evy) (i) For an harmonic function f on Rd , consider an isotropic, continuous local martingale M in Rd that a.s. avoids all singularities of f . Then f (M ) is a local martingale satisfying [f (M ) ] = |∇f (M )|2 · [M 1 ] a.s. (ii) For an analytic function f , consider a complex, isotropic, continuous local martingale M that a.s. avoids all singularities of f . Then f (M ) is again a complex, isotropic local martingale satisfying [ &f (M ) ] = |f  (M )|2 · [ &M ] a.s. If B is a Brownian motion and f  ≡ 0, then [ &f (B)] is a.s. unbounded and strictly increasing. Proof: (i) Since M is isotropic, we get by Corollary 18.19 f (M ) = f (M0 ) + fi · M i + 12 Δf (M ) · [M 1 ]. Here the last term vanishes since f is harmonic, and so f (M ) is a local martingale. The isotropy of M also yields 

Σ i fi(M ) · M i 2 = Σ i fi (M ) · [M 1 ]  

[f (M ) ] =

2 = ∇f (M ) · [M 1 ].

(ii) Since f is analytic, Corollary 18.20 yields f (M ) = f (M0 ) + f  (M ) · M + 12 f  (M ) · [M ]. Since M is isotropic, the last term vanishes, and we get [f (M ) ] = f  (M ) · M

2

= f  (M )



· [M ] = 0,

which shows that f (M ) is again isotropic. Writing M = X + iY and f  (M ) = U + iV , we get [ &f (M ) ] = [ U · X − V · Y ] = ( U 2 + V 2 ) · [X] = |f  (M )|2 · [ &M ]. Noting that f  has at most countably many zeros, unless it vanishes identically, we obtain by Fubini’s theorem



Eλ t ≥ 0; f  (Bt ) = 0 =

0







P f  (Bt ) = 0 dt = 0,

422

Foundations of Modern Probability

and so [ &f (B) ] = |f  (B)|2 · λ is a.s. strictly increasing. To see that it is also a.s. unbounded, we note that f (B) converges a.s. on the set {[ &f (B) ] < ∞}. However, f (B) diverges a.s., since f is non-constant and the random walk 2 B0 , B1 , . . . is recurrent by Theorem 12.2. Combining the last two results, we may prove the polarity of single points when d ≥ 2, and the transience of the process when d ≥ 3. Note that the latter property is a continuous-time counterpart of Theorem 12.8 for random walks. Both results are important for the potential theory developed in Chapter 34. Define τa = inf{t > 0; Bt = a}. Theorem 19.6 (point polarity and transience, L´evy, Kakutani) For a Brownian motion B in Rd , we have (i) for d ≥ 2, τa = ∞ a.s. for all a ∈ Rd , (ii) for d ≥ 3, |Bt | → ∞ a.s. as t → ∞. Proof: (i) We may take d = 2, and hence choose B to be a complex Brownian motion. Using Theorem 19.5 (ii) with f (z) = ez , we see that M = eB is a conformal local martingale with unbounded rate [ &M ]. By Theorem 19.4 we have M − 1 = X ◦ [ &M ] a.s. for a Brownian motion X, and since M = 0, it follows that X a.s. avoids −1. Hence, τ−1 = ∞ a.s., and so by scaling and rotational symmetry we have τa = ∞ a.s. for every a = 0. To extend this to a = 0, we use the Markov property at h > 0 to obtain



P0 τ0 ◦ θh < ∞ = E0 PBh {τ0 < ∞} = 0,

h > 0.

As h → 0, we get P0 {τ0 < ∞} = 0, and so τ0 = ∞ a.s. (ii) Here we may take d = 3. For any a = 0 we have τa = ∞ a.s. by (i), and so by Theorem 19.5 (i) the process M = |B − a|−1 is a continuous local martingale. By Fatou’s lemma, M is then an L1 -bounded super-martingale, and so by Theorem 9.19 it converges a.s. toward some random variable ξ. Since d clearly Mt → 0 we have ξ = 0 a.s. 2 Combining (i) above with Theorem 17.11, we see that a complex, isotropic, continuous local martingale M with M0 = 0 avoids every fixed point a = 0. Thus, Theorem 19.5 (ii) applies to any analytic function f with only isolated singularities. Since f is allowed to be multi-valued, the result applies even to functions with essential singularities, such as to f (z) = log(1 + z). For a simple application, we consider the windings of planar Brownian motion about a fixed point. Corollary 19.7 (skew-product representation, Galmarino) Let B be a complex Brownian motion with B0 = 1, and choose a continuous version of V = arg B with V0 = 0. Then there exists a real Brownian motion Y ⊥⊥ |B|, such   that a.s. Vt ≡ Y ◦ |B|−2 · λ t , t ≥ 0.

19. Continuous Martingales and Brownian Motion

423

Proof: Using Theorem 19.5 (ii) with f (z) = log(1 + z), we see that Mt = log |Bt |+iVt is an isotropic martingale with rate [ &M ] = |B|−2 ·λ. Hence, Theorem 19.4 yields a complex Brownian motion Z = X + iY with M = Z ◦ [ &M ] a.s., and the assertion follows. 2 For a non-isotropic, continuous local martingale M in Rd , there is no single random time change that reduces the process to a Brownian motion. However, we may transform the components M i individually as in Theorem 19.4, to obtain a set of one-dimensional Brownian motions B 1 , . . . , B d . If the latter processes are independent, they may be combined into a d -dimensional Brownian motion B = (B 1 , . . . , B d ). We show that the required independence arises automatically, whenever the original components M i are strongly orthogonal, in the sense that [M i , M j ] = 0 a.s. for all i = j. Proposition 19.8 (multi-variate time change, Knight) Let M 1 , M 2 , . . . be strongly orthogonal, continuous local martingales starting at 0. Then there exist some independent Brownian motions B 1 , B 2 , . . . with M k = B k ◦ [M k ] a.s.,

k ∈ N.

Proof: When [M k ]∞ = ∞ a.s. for all k, the result is an easy consequence of Lemma 19.2. In general, choose some independent Brownian motions ⊥ F with induced filtration X , and define X 1, X 2, . . . ⊥



Bsk = M k (τsk ) + X k (s − [M k ]∞ )+ ,

s ≥ 0, k ∈ N.

Write ψt = −log(1 − t)+ , and put Gt = Fψt + X(t−1)+ , t ≥ 0. To check that B 1 , B 2 , . . . have the desired joint distribution, we may take the [M k ] to be bounded. Then the processes k , Ntk = Mψkt + X(t−1) +

t ≥ 0,

are strongly orthogonal, continuous G -martingales with quadratic variations [N k ]t = [M k ]ψt + (t − 1)+ , and we note that

Bsk = Nσksk ,

t ≥ 0,

σsk = inf t ≥ 0; [N k ]t > s .

The assertion now follows from the result for [M k ]∞ = ∞ a.s.

2

We turn to some general invariance properties of a Brownian motion or bridge. Then for any processes U , V on I = R+ or [0, 1], we define1

λV = I

Vt dt,

 U, V =



I

Ut Vt dt,

V 2 = V, V

1/2

.

1 The inner product ·, · must not be confused with the predictable covariation introduced in Chapter 20.

424

Foundations of Modern Probability

A process B on [0, 1] is said to be a Brownian bridge w.r.t. a filtration F, if it is F-adapted and such that, conditionally on Ft for every t ∈ [0, 1], it is a Brownian bridge from (t, Bt ) to (1, 0). We anticipate the fact that an FBrownian bridge in Rd is a continuous semi-martingale on [0, 1]. The associated martingale component is clearly a standard Brownian motion. Theorem 19.9 (Brownian invariance) (i) Consider an F-Brownian motion X and some F-predictable processes V t on R+ , t ≥ 0, such that V s , V t = s ∧ t a.s.,

s, t ≥ 0.

Then Yt = (V t · X)∞ , t ≥ 0, is a version of a Brownian motion. (ii) Consider an F-Brownian bridge X and some F-predictable processes V t on [0, 1], t ∈ [0, 1], such that V s , V t = s ∧ t a.s.,

λV t = t,

s, t ∈ [0, 1].

Then Yt = (V · X)1 , t ∈ [0, 1], is a version of a Brownian bridge. t

The existence of the integrals V t ·X should be regarded as part of the assertions. In case of (ii) we need to convert the relevant Brownian-bridge integrals into continuous martingales. Then for any Lebesgue integrable process V on 1 [0, 1], we define V¯t = (1 − t)−1 Vs ds, t ∈ [0, 1]. t

Lemma 19.10 (Brownian bridge integral) (i) Let X be an F-Brownian bridge with martingale component B, and con sider an F-predictable process V on [0, 1], such that E F0 01 V 2 < ∞ a.s. and λV is F0 -measurable. Then



1 0

Vt dXt =

1 0

(Vt − V¯t ) dBt a.s.

(ii) For any processes U, V ∈ L2 [0, 1], we have U¯ , V¯ ∈ L2 [0, 1] and

1 0

(Ut − U¯t )(Vt − V¯t ) dt =



1 0

Ut Vt dt − λU · λV.

Proof: (ii) We may take U = V , since the general case will then follow by polarization. Writing Rt = t1 Vs ds = (1 − t)V¯t and integrating by parts, we get on [0, 1) V¯t2 dt = (1 − t)−2 Rt2 dt = (1 − t)−1 Rt2 + 2 = (1 − t) V¯t2 + 2 For bounded V , we conclude that





(1 − t)−1 Rt Vt dt

V¯t Vt dt.

19. Continuous Martingales and Brownian Motion 0

1



(Vt − V¯t )2 dt =

1

0



1

= 0

425

Vt2 dt − (λV )2 (Vt − λV )2 dt,

(3)

and so by Minkowski’s inequality V¯ 2 ≤ V 2 + V − V¯ 2 = V 2 + V − λV 2 ≤ 2 V 2 , which extends to the general case by monotone convergence. This gives V¯ ∈ L2 , and the asserted relation follows from (3) by dominated convergence. (i) The process Mt = Xt /(1 − t) is clearly a continuous martingale on [0, 1), and so Xt = (1 − t)Mt is a continuous semi-martingale on the same interval. Integrating by parts, we get for any t ∈ [0, 1) dXt = (1 − t) dMt − Mt dt = dBt − Mt dt, and also





t 0

Vs Ms ds = Mt

0



dMs

0

=

Vs ds −

t

=

0

t

t

1 s





t 0

dMs

Vr dr − Mt

s

Vr dr

0



1

Vs ds

t

V¯s dBs − Xt V¯t .

Hence by combination,



t

0

Vs dXs =

t

0

(Vs − V¯s ) dBs + Xt V¯t ,

t ∈ [0, 1).

(4)

By Cauchy’s inequality and dominated convergence, we get as t → 1 

E F0 |Xt V¯t |

2

≤ E F0 (Mt2 ) E F0



1 t

|Vs | ds

≤ (1 − t) E F0 (Mt2 ) E F0

1

= t E F0

t

1

t

2

Vs2 ds

Vs2 ds → 0,

P and so by dominated convergence Xt V¯t → 0. By (4) and Corollary 18.13, it remains to note that V¯ is B-integrable, which holds since V¯ ∈ L2 by (ii). 2

Proof of Theorem 19.9: (i) Use Lemma 19.2. (ii) By Lemma 19.10 (i), we have a.s.

Yt =

1 0

(Vrt − V¯rt ) dBr ,

t ∈ [0, 1],

426

Foundations of Modern Probability

ˆ Furthermore, Lemma 19.10 (ii) yields for the Brownian motion B = X − X. 1 0

Vrs − V¯rs





Vrt − V¯rt dr =



1

0

Vrs Vrt dr − λV s · λV t

= s ∧ t − st. 2

The assertion now follows by Lemma 19.2.

Next we consider a basic stochastic integral representation of martingales with respect to a Brownian filtration. Theorem 19.11 (Brownian martingales) Let B = (B 1 , . . . , B d ) be a Brownian motion in Rd with complete, induced filtration F. Then every local Fmartingale M is a.s. continuous, and there exist some (P × λ)-a.e. unique processes V 1 , . . . , V d ∈ L(B 1 ), such that M = M0 +

Σ V k · Bk

a.s.

(5)

k≤d

The proof is based on a similar representation of Brownian functionals: Lemma 19.12 (Brownian functionals, Itˆo) Let B = (B 1 , . . . , B d ) be a Brownian motion in Rd , and let ξ be a B-measurable random variable with E ξ = 0 and E ξ 2 < ∞. Then there exist some (P × λ)-a.e. unique processes V 1 , . . . , V d ∈ L(B 1 ), such that 2 a.s.



E 0

|Vt |2 dt < ∞,

ξ=

Σ k≤d





V k dB k .

0

Proof (Dellacherie): Let H be the Hilbert space of B-measurable random variables ξ ∈ L2 with Eξ = 0, and write K for the subspace of elements ξ admitting the desired representation Σk (V k · B k )∞ . For such a ξ we get E ξ 2 = E Σk {(V k )2 · λ}∞ , which implies the asserted uniqueness. By the obvious completeness of L(B 1 ), the same formula shows that K is closed. To obtain K = H, we need to show that any ξ ∈ H ' K vanishes a.s. Then fix any non-random functions u1 , . . . , ud ∈ L2 (R). Put M = Σk uk ·B k , and define the process Z as in Lemma 19.1. Then Proposition 18.14 yields

and so ξ ⊥ (Z∞ − 1), or

Z − 1 = iZ · M = i Σ (Zuk ) · B k , k≤d





E ξ exp i Σ k (uk · B k )∞ = 0. Specializing to step functions uk and using Theorem 6.3, we get



E ξ; (Bt1 , . . . , Btn ) ∈ C = 0,

t1 , . . . , tn ∈ R+ , C ∈ B n , n ∈ N,

2 The moment condition is essential, since the existence would otherwise be trivial and the uniqueness would fail. This is similar to the situation for the Skorohod embedding in Theorem 22.1.

19. Continuous Martingales and Brownian Motion

427

which extends by a monotone-class argument to E( ξ; A) = 0 for any A ∈ F∞ . 2 Thus, ξ = E( ξ | F∞ ) = 0 a.s. Proof of Theorem 19.11: We may clearly take M0 = 0, and by suitable localization we may assume that M is uniformly integrable. Then M∞ exists in L1 (F∞ ) and may be approximated in L1 by some random variables ξ1 , ξ2 , . . . ∈ L2 (F∞ ) with Eξn = 0. The martingales Mtn = E( ξn | Ft ) are a.s. continuous by Lemma 19.12, and Proposition 9.16 yields for any ε > 0





P (ΔM )∗ > 2ε ≤ P (M n − M )∗ > ε



≤ ε−1 E| ξn − M∞ | → 0. Hence, (ΔM )∗ = 0 a.s., and so M is a.s. continuous. The remaining assertions now follow by localization from Lemma 19.12. 2 We turn to the converse problem of finding a Brownian motion B satisfying (5) for given processes V k . This result plays a crucial role in Chapter 32. Theorem 19.13 (integral representation, Doob) Let M be a continuous local F-martingale in Rd with M0 = 0, such that [M i , M j ] =

Σ Vki Vkj · λ

k≤n

a.s.,

i, j ≤ d,

for some F-progressive processes Vki , i ≤ d, k ≤ n. Then there exists a Brownian motion B = (B 1 , . . . , B d ) in Rd w.r.t. a standard extension G of F, such that M i = Σ Vki · B k a.s., i ≤ d. k≤n

Proof: For any t ≥ 0, let Nt and Rt be the null and range spaces of the matrix Vt , and write Nt⊥ and Rt⊥ for their orthogonal complements. Denote the corresponding orthogonal projections by πNt , πRt , πNt⊥ , and πRt⊥ , respectively. Note that Vt is a bijection from Nt⊥ to Rt , and write Vt−1 for the inverse mapping from Rt to Nt⊥ . All these mappings are clearly Borel-measurable functions of Vt , and hence again progressive. Now introduce a Brownian motion X⊥⊥F in Rn with induced filtration X , and note that Gt = Ft ∨ Xt , t ≥ 0, is a standard extension of both F and X . Thus, V remains G -progressive, and the martingale properties of M and X remain valid for G. In Rn we introduce the local G -martingale B = V −1 πR · M + πN · X. The covariation matrix of B has density    (V −1 πR ) V V  (V −1 πR ) + πN πN = πN ⊥ πN ⊥ + πN π N = πN ⊥ + πN = I,

428

Foundations of Modern Probability

and so B is a Brownian motion by Theorem 19.3. Furthermore, the process πR⊥ · M vanishes a.s., since its covariation matrix has density πR⊥ V V  πR ⊥ = 0. Hence, Proposition 18.14 yields V · B = V V −1 πR · M + V πN · Y = πR · M = (πR + πR⊥ ) · M = M.

2

The following joint extension of Theorems 15.15 and 19.2 will be needed in Chapter 27. Proposition 19.14 (Gauss and Poisson reduction) Consider a continuous local martingale M in Rd , a ql-continuous, K-marked point process ξ on (0, ∞) ˆ a predictable process V : R+ × K → S, ˆ and a progressive with compensator ξ, d ˆ / S. process U : T × R+ → R , where K, S are Borel and S = S ∪ {Δ} with Δ ∈ Let the random measure μ on S and process ρ on T 2 be F0 -measurable with ρs,t = Σ i,j

μ = ξˆ ◦ V −1 ,





0

j i Us,r Ut,r d[M i , M j ]r ,

s, t ∈ T,

and define the point process η on S and process X on T by Xt = Σ i

η = ξ ◦ V −1 ,





0

i Ut,r dMri ,

t ∈ T.

Then conditionally on F0 , we have (i) η and X are independent, (ii) η is Poisson on S with intensity measure μ, (iii) X is centered Gaussian with covariance function ρ. Proof: For any constants c1 , . . . , cm ∈ R, times t1 , . . . , tm ∈ T , and disjoint sets B1 , . . . , Bn ∈ S with μBj < ∞, consider the processes Nt = Ytk =

Σ k ck Σ i S

t+

0



t

0

Utik ,r dMri , t ≥ 0, k ≤ n.

1Bk (Vs,x ) ξ(ds dx),

Next define for any u1 , . . . , un ≥ 0 the exponential local martingales 



Zt0 = exp iNt + 12 [N, N ]t ,



Ztk = exp −uk Ytk + (1 − e−uk )Yˆtk ,

t ≥ 0, k ≤ n,

where the local martingale property holds for Z 0 by Lemma 19.1 and for Z 1 , . . . , Z n by Theorem 20.8, applied to the processes Akt = (1 − e−uk )(Yˆtk −Ytk ). The same property holds for the product Zt = Πk Ztk , since the Z k are strongly orthogonal by Theorems 20.4 and 20.6. Furthermore,

Σ k ck Xt , Y∞k∞ = ηBk , Uti ,r Utj ,r d[M i , M j ]r [N, N ]∞ = Σ h,k ch ck Σ i,j 0 = Σ h,k ch ck ρt ,t , N∞ =

k

h

k

h k

Yˆ∞k =

S

K

ˆ dx) = μBk , 1Bk (Vs,x ) ξ(ds

k ≤ n.

19. Continuous Martingales and Brownian Motion

429

The product Z remains a local martingale with respect to the conditional probability measure PA = P (· |A), for any A ∈ F0 with P A > 0. Choosing A such that ρth ,tk and μBk are bounded on A, we see that even Z becomes bounded on A, and hence is a uniformly integrable PA -martingale. In particular, E(Z∞ |A) = 1 or E(Z∞ ; A) = P A, which extends immediately to arbitrary A ∈ F0 . Hence, E(Z∞ | F0 ) = 1, and we get



E exp i

Σ k c k Xt

= exp −

 

k

1 2

− Σ k uk ηBk  F0

Σ h,k chck ρt ,t



− Σ k (1 − e−uk ) μBk .

h k

By Theorem 6.3, the variables Xt1 , . . . , Xtm and ηB1 , . . . , ηBn have then the required joint conditional distribution, and the assertion follows by a monotoneclass argument. 2 For a similar purpose in Chapter 27, we also need a joint extension of Theorems 10.27 and 19.2, where the ordinary compensator is replaced by its discounted version. Proposition 19.15 (discounted reduction maps) For j ∈ N, let (τj , χj ) be orthogonal, adapted pairs as in Theorem 10.27 with discounted compensators ζj , and let Yj : R+ × Kj → Sj be predictable with ζj ◦ Yj−1 ≤ μj a.s. for some F0 -measurable random distributions μj on Sj . Put γj = Yj (τj , χj ), and define X as in Lemma 19.14 for a continuous local martingale M in Rd and some predictable processes Ut , t ∈ S, such that the process ρ on S 2 is F0 -measurable. Then conditionally on F0 , we have (i) X, γ1 , γ2 , . . . are independent, (ii) L(γj ) = μj for all j, (iii) X is centered Gaussian with covariance function ρ. Proof: For V1 , . . . , Vm as before, we define the associated martingales M1 , . . . , Mm as in Lemma 10.25. Further introduce the exponential local martingale Z = exp(iN + 12 [N ]), where N is such as in the proof of Lemma 19.14. Then (Z − 1) Πj Mj is conditionally a uniformly integrable martingale, and we get as in Theorem 10.27



E F0 (Z∞ − 1) Π j 1Bj (γj ) − μj Bj = 0. Combining with the same relation without the factor Z∞ − 1 and proceeding recursively, we get as before 



E F0 exp(iN∞ ) Π j 1Bj (γj ) = exp − 12 [N ]∞

Π j μj Bj .

Applying Theorem 6.3 to the bounded measure P˜ = conclude that



Πj 1B (γj ) · P F , 0

j

P F0 X ∈ A; γj ∈ Bj , j ≤ m = νρ (A) Π j μj Bj ,

we

430

Foundations of Modern Probability

where νρ denotes the centered Gaussian distribution on RT with covariance function ρ. 2 Next we show how the multiple Wiener–Itˆo integrals of Chapter 14 can be expressed as iterated single Itˆo integrals. Then introduce the tetrahedral sets



Δn = (t1 , . . . , tn ) ∈ Rn+ ; t1 < · · · < tn ,

n ∈ N.

Given a function f ∈ L2 (Rn+ , λn ), we write fˆ = n! f˜1Δn , where f˜ is the symmetrization of f defined in Chapter 14. For any isonormal Gaussian process ζ on L2 (λ), we may introduce an associated Brownian motion B on R+ satisfying Bt = ζ1[0,t] for all t ≥ 0. Theorem 19.16 (multiple and iterated integrals) Let B be a Brownian motion on R+ generated by an isonormal Gaussian process ζ on L2 (λ). Then for any f ∈ L2 (λn ), we have (6) ζ nf = dBtn dBtn−1 · · · fˆ(t1 , . . . , tn ) dBt1 a.s. Though a formal verification is easy, the existence of the integrals on the right depends in a subtle way on the choice of suitable versions in each step. The existence of such versions is regarded as part of the assertion. Proof: We prove by induction that the iterated integral

Vtkk+1 ,...,tn =



dBtk−1 · · ·

dBtk



fˆ(t1 , . . . , tn ) dBt1

exists for almost all tk+1 , . . . , tn , and that V k has a version supported by Δn−k , which is progressive in the variable tk+1 for fixed tk+2 , . . . , tn . We further need to establish the isometry 

E Vtkk+1 ,...,tn

2



=

···





2 fˆ(t1 , . . . , tn ) dt1 · · · dtk ,

(7)

which allows us in the next step to define Vtk+1 for almost all tk+2 , . . . , tn . k+2 ,...,tn 0 ˆ The integral V = f has clearly the stated properties. Now suppose that a has been constructed with the desired properties. version of the integral Vtk−1 k ,...,tn For any tk+1 , . . . , tn such that (7) is finite, Theorem 18.25 shows that the process k = Xt,t k+1 ,...,tn

t

0

Vtk−1 dBtk , k ,...,tn

t ≥ 0,

has a progressive version that is a.s. continuous in t for fixed tk+1 , . . . , tn . By Proposition 18.15 we obtain Vtkk+1 ,...,tn = Xtkk+1 ,tk+1 ,...,tn a.s.,

tk+1 , . . . , tn ≥ 0,

and the progressivity carries over to V k , regarded as a process in tk+1 for fixed tk+2 , . . . , tn . Since V k−1 is supported by Δn−k+1 , we may choose X k to be

19. Continuous Martingales and Brownian Motion

431

supported by R+ × Δn−k , which ensures that V k will be supported by Δn−k . Finally, formula (7) for V k−1 yields 

E Vtkk+1 ,...,tn

2

=E

=

 ···

2

Vtk−1 dtk k ,...,tn





2 fˆ(t1 , . . . , tn ) dt1 · · · dtk .

To prove (6), we note that the right-hand side is linear and L2 -continuous in f . Furthermore, the two sides agree for indicator functions of rectangles in Δn . The equality extends by a monotone-class argument to arbitrary indicator functions in Δn , and the further extension to L2 (Δn ) is immediate. It remains 2 to note that ζ nf = ζ nf˜ = ζ nfˆ for any f ∈ L2 (λn ). For a Brownian functional with mean 0 and finite variance, we have established both a chaos expansion in Theorem 14.26 and a stochastic integral representation in Lemma 19.12. We proceed to examine how the two formulas are related. For any f ∈ L2 (λn ), we define ft (t1 , . . . , tn−1 ) = f (t1 , . . . , tn−1 , t), and write ζ n−1 f (t) = ζ n−1 ft when ft < ∞. Proposition 19.17 (chaos and integral representations) Let ζ be an isonormal Gaussian process on L2 (λ), generating a Brownian motion B, and let ξ ∈ L2 (ζ) with chaos expansion Σ n>0 ζ n fn . Then a.s.



ξ= 0

Vt =

Vt dBt ,

Σ ζ n−1fˆn(t),

t ≥ 0.

n>0

Proof: Proceeding as in the previous proof, we get for any m ∈ N

ΣE n≥m



2

fˆn 2 Σ n≥m = Σ E(ζ nfn )2 < ∞. n≥m

ζ n−1fˆn (t) dt =

(8)

Since the integrals ζ nf are orthogonal for different n, the series for Vt converges in L2 for almost every t ≥ 0. On the exceptional null set, we may redefine Vt = 0. Choosing progressive versions of the integrals ζ n−1fˆn (t), as before, we see from the proof of Corollary 5.33 that even the sum V can be chosen to be progressive. Applying (8) with m = 1 gives V ∈ L(B). Using Theorem 19.16, we get by a formal calculation ξ= =

ζ nfn Σ n≥1

Σ n≥1





= =



dBt

ζ n−1fˆn (t) dBt

Σ ζ n−1fˆn(t)

n≥1

Vt dBt .

To justify the interchange of integration and summation, we may use (8) and let m → ∞ to obtain

432

Foundations of Modern Probability E



dBt

Σ ζ n−1fˆn(t) n≥m

2





Σ E ζ n−1fˆn(t) n≥m = Σ E(ζ nfn )2 → 0. n≥m

=

2

dt 2

Now let P, Q be probability measures on a common measurable space (Ω, A), endowed with a right-continuous and P -complete filtration (Ft ). When Q  P on Ft , we write Zt for the corresponding density, so that Q = Zt · P on Ft . Since the martingale property depends on the choice of probability measure, we need to distinguish between P -martingales and Q -martingales.  Let EP denote integration with respect to P , and write F∞ = t Ft . We summarize the basic properties of such measure transformations. Recall from Theorem 9.28 that any F-martingale has an rcll version. Lemma 19.18 (change of measure) Let P, Q be probability measures on a measurable space Ω with a right-continuous, P -complete filtration F, such that Q = Zt · P on Ft ,

t ≥ 0,

for an adapted process Z. Then for any adapted process X, (i) Z is a P -martingale, (ii) Z is uniformly integrable ⇔ Q  P on F∞ , (iii) X is a Q-martingale ⇔ XZ is a P -martingale. When Z is right-continuous and X is rcll, we have also (iv) Q = Zτ · P on Fτ ∩ {τ < ∞} for any optional time τ , (v) X is a local Q-martingale ⇔ XZ is a local P -martingale, (vi) inf Zs > 0 a.s. Q, s≤t

t > 0.

Proof: (iii) For any adapted process X, we note that Xt is Q -integrable iff Xt Zt is P -integrable. When this holds for all t, the Q -martingale property of X becomes Xs dQ = Xt dQ, A ∈ Fs , s < t. A

A

By the definition of Z, this is equivalent to 







EP Xs Zs ; A = EP Xt Zt ; A ,

A ∈ Fs , s < t,

which means that XZ is a P -martingale. (i)−(ii): Choosing Xt ≡ 1 in (iii), we see that Z is a P -martingale. Now let Z be uniformly P -integrable, say with L1 -limit Z∞ . For any t < u and A ∈ Ft , we have QA = EP (Zu ; A). As u → ∞ we get QA = EP (Z∞ ; A), which extends to arbitrary A ∈ F∞ by a monotone-class argument. Thus, Q = Z∞ ·P on F∞ . Conversely, if Q = ξ · P on F∞ , then EP ξ = 1, and the P -martingale Mt = EP ( ξ |Ft ) satisfies Q = Mt · P on Ft for every t. But then Zt = Mt a.s. for all t, and Z is uniformly P -integrable with limit ξ. (iv) By optional sampling,

19. Continuous Martingales and Brownian Motion 



QA = EP Zτ ∧t ; A , and so





433

A ∈ Fτ ∧t , t ≥ 0,





Q A; τ ≤ t = EP Zτ ; A ∩ {τ ≤ t} ,

A ∈ Fτ , t ≥ 0.

The assertion now follows by monotone convergence as t → ∞. (v) For any optional time τ , we need to show that X τ is a Q -martingale iff (XZ)τ is a P -martingale. This may be seen as in (iv), once we note that Q = Ztτ · P on Fτ ∧t for all t. (vi) By Lemma 9.32 it is enough to show that Zt > 0 a.s. Q for every t > 0. 2 This is clear from the fact that Q{Zt = 0} = EP (Zt ; Zt = 0) = 0. The measure Q is usually not given at the outset, but needs to be constructed from the martingale Z. This requires some regularity conditions on the underlying probability space. Lemma 19.19 (existence) Let Z ≥ 0 be a martingale with Z0 = 1, on the canonical probability space (Ω, P ) with induced filtration F, where Ω = D R+ ,S for a Polish space S. Then there exists a probability measure Q on Ω with Q = Zt · P on Ft ,

t ≥ 0.

Proof: For every t ≥ 0 we may introduce the probability measure Qt = Zt ·P on Ft , which may be regarded as defined on D [0,t],S . Since the latter spaces are again Polish for the Skorohod topology, Corollary 8.22 yields a probability measure Q on D R+ ,S with projections Qt . The measure Q has clearly the required properties. 2 We show how the drift of a continuous semi-martingale is transformed under a change of measure with a continuous density Z. An extension appears in Theorem 20.9. Theorem 19.20 (transformation of drift, Girsanov, van Schuppen & Wong) Consider some probability measures P , Q and a continuous process Z, such that Q = Z · P on F , t ≥ 0. t

t

Then for any continuous local P -martingale M , we have the local Q-martingale ˜ = M − Z −1 · [M, Z]. M ˜ is a Proof: First let Z −1 be bounded on the support of [M ]. Then M continuous P -semi-martingale, and so by Proposition 18.14 and an integration by parts, ˜ Z − (M ˜ Z)0 = M = =

˜ ·Z +Z ·M ˜ + [M ˜ , Z] M ˜ · Z + Z · M − [M, Z] + [M ˜ , Z] M ˜ M · Z + Z · M,

434

Foundations of Modern Probability

˜ Z is a local P -martingale. Hence, by Lemma 19.18, M ˜ is which shows that M a local Q -martingale. For general M , define τn = inf{t ≥ 0; Zt < 1/n}, and note as before that ˜ τn is a local Q -martingale for every n ∈ N. Since τn → ∞ a.s. Q by Lemma M ˜ is a local Q -martingale by Lemma 18.1. 19.18, M 2 Next we show how the basic notions of stochastic calculus are preserved by a change of measure. Then let [X]P be the quadratic variation of X under the probability measure P , write LP (X) for the class of processes V that are X-integrable under P , and let (V ·X)P be the corresponding stochastic integral. Proposition 19.21 (preservation properties) Let Q = Zt · P on Ft for all t ≥ 0, where Z is continuous. Then (i) a continuous P -semi-martingale X is also a Q -semi-martingale, and [X]P = [X]Q a.s. Q, (ii) LP (X) ⊂ LQ (X), and for any V ∈ LP (X), (V · X)P = (V · X)Q a.s. Q, (iii) for any continuous local P -martingale M and process V ∈ LP (M ), ∼ ˜ a.s. Q, (V · M ) = V · M whenever either side exists. Proof: (i) Consider a continuous P -semi-martingale X = M + A, where M is a continuous local P -martingale and A is a process of locally finite variation. Under Q we may write ˜ + Z −1 · [M, Z] + A, X=M ˜ is the continuous local Q -martingale of Theorem 19.20, and we note where M −1 that Z · [M, Z] has locally finite variation, since Z > 0 a.s. Q by Lemma 19.18. Thus, X is also a semi-martingale under Q. The statement for [X] is now clear from Proposition 18.17. (ii) If V ∈ LP (X), then V 2 ∈ LP ([X]) and V ∈ LP (A), and so the same ˜ + A). To get V ∈ LQ (X), relations hold under Q, and we get V ∈ LQ (M −1 it remains to show that V ∈ LQ (Z · [M, Z]). Since Z > 0 under Q, it is equivalent to show that V ∈ LQ ([M, Z]), which is clear by Proposition 18.9, ˜ , Z]Q and V ∈ LQ (M ˜ ). since [M, Z]Q = [M ˜ ) as before. If V belongs to either class, (iii) Note that LQ (M ) = LQ (M then under Q, Proposition 18.14 yields the a.s. relations (V · M )∼ = V · M − Z −1 · [V · M, Z] = V · M − V Z −1 · [M, Z] ˜. = V ·M

2

19. Continuous Martingales and Brownian Motion

435

In particular, Theorem 19.3 shows that if B is a P -Brownian motion in ˜ is a Brownian motion under Q, since both processes are continuous Rd , then B martingales with the same covariation process. The preceding theory simplifies when P and Q are equivalent on each Ft , since in that case Z > 0 a.s. P by Lemma 19.18. If Z is also continuous, it can be expressed as an exponential martingale. More general processes of this kind are considered in Theorem 20.8. Lemma 19.22 (real exponential martingales) A continuous process Z > 0 is a local martingale iff a.s. 



Zt = E(M )t ≡ exp Mt − 12 [M ]t ,

t ≥ 0,

(9)

for a continuous local martingale M . Here M is a.s. unique, and for any continuous local martingale N , [M, N ] = Z −1 · [Z, N ] a.s. Proof: If M is a continuous local martingale, so is E(M ) by Itˆo’s formula. Conversely, let Z > 0 be a continuous local martingale. Then Corollary 18.19 yields log Z − log Z0 = Z −1 · Z − 12 Z −2 · [Z] = Z −1 · Z − 12 [Z −1 · Z], and (9) follows with M = log Z0 + Z −1 · Z. The last assertion is clear from this expression, and the uniqueness of M follows from Proposition 18.2. 2 The drift of a continuous semi-martingale can sometimes be eliminated by a suitable change of measure. Beginning with the case of a Brownian motion B with deterministic drift, we need to show that E(B) is a true martingale, which follows easily by a direct computation. By P ∼ Q we mean that P  Q and Q  P . Let L2loc be the class of measurable functions f : R+ → Rd , such that |f |2 is locally Lebesgue integrable. For any f ∈ L2loc we define f · λ = (f 1 · λ, . . . , f d · λ), where the components on the right are ordinary Lebesgue integrals. Theorem 19.23 (shifted Brownian motion, Cameron & Martin) Let B be a canonical Brownian motion in Rd with induced, complete filtration F, fix a continuous function h : R+ → Rd with h0 = 0, and put Ph = L(B + h). Then these conditions are equivalent: (i) Ph ∼ P0 on Ft for all t ≥ 0, (ii) h = f · λ for an f ∈ L2loc . Under those conditions, (iii) Ph = E(f · B)t · P0 .

436

Foundations of Modern Probability

Proof, (i) ⇒ (ii)−(iii): Assuming (i), Lemma 19.18 yields a P0 -martingale Z > 0, such that Ph = Zt · P0 on Ft for all t ≥ 0. Here Z is a.s. continuous by Theorem 19.11, and by Lemma 19.22 it equals E(M ) for a continuous local P0 martingale M . Using Theorem 19.11 again, we note that M = V ·B = Σ i V i ·B i a.s. for some processes V i ∈ L(B 1 ), and in particular V ∈ L2loc a.s. By Theorem 19.20, the process ˜ = B − [B, M ] B = B−V ·λ is a Ph -Brownian motion, and so under Ph the canonical process B has the semi-martingale decompositions ˜ +V ·λ B=B = (B − h) + h. Since the decomposition is a.s. unique by Proposition 18.2, we get V ·λ = h a.s. Then also h = f · λ for some measurable function f on R+ , and since clearly λ{t ≥ 0; Vt = ft } = 0 a.s., we have even f ∈ L2loc , proving (ii). Furthermore, M = V · B = f · B a.s., and (iii) follows. (ii) ⇒ (i): Assume (ii). Since M = f ·B is a time-changed Brownian motion under P0 , the process Z = E(M ) is a P0 -martingale, and Lemma 19.19 yields a probability measure Q on CR+ ,Rd , such that Q = Zt · P0 on Ft for all t ≥ 0. Moreover, Theorem 19.20 shows that the process ˜ = B − [B, M ] B = B−h is a Q -Brownian motion, and so Q = Ph . In particular, this proves (i).

2

For more general semi-martingales, Theorem 19.20 and Lemma 19.22 suggest that we might remove the drift by a change of measure of the form Q = E(M )t · P on Ft for all t ≥ 0, where M is a continuous local martingale with M0 = 0. Then by Lemma 19.18 we need Z = E(M ) to be a true martingale, which can often be verified through the following criterion. Theorem 19.24 (uniform integrability, Novikov) Let M be a real, continuous local martingale with M0 = 0. Then E(M ) is a uniformly integrable martingale, whenever   E exp 12 [M ]∞ < ∞. This will first be proved in a special case. Lemma 19.25 (Wald’s identity) For any real Brownian motion B and optional time τ , we have E eτ /2 < ∞







E exp Bτ − 12 τ = 1.

19. Continuous Martingales and Brownian Motion

437

Proof: First consider the hitting times



τb = inf t ≥ 0; Bt = t − b ,

b > 0.

Since the τb remain optional with respect to the right-continuous, induced filtration, we may take B to be a canonical Brownian motion with associated distribution P = P0 . Defining ht ≡ t and Z = E(B), we see from Theorem 19.23 that Ph = Zt · P on Ft for all t ≥ 0. Since τb < ∞ a.s. under both P and Ph , Lemma 19.18 yields 



E exp Bτb − 12 τb = EZτb   = E Zτb ; τb < ∞ = Ph {τb < ∞} = 1. The stopped process Mt ≡ Zt∧τb is a positive local martingale, and by Fatou’s lemma it is also a super-martingale on [0, ∞]. Since clearly EM∞ = EZτb = 1 = EM0 , we see from the Doob decomposition that M is a true martingale on [0, ∞]. For τ as stated, we get by optional sampling 1 = EMτ = EZτ ∧τb 







= E Zτ ; τ ≤ τb + E Zτb ; τ > τb .

(10)

Using the definition of τb and the condition on τ , we get as b → ∞ 





E Zτb ; τ > τb = e−b E eτb /2 ; τ > τb



≤ e−b Eeτ /2 → 0, and so the last term in (10) tends to zero. Since also τb → ∞, the first term on the right tends to EZτ by monotone convergence, and we get EZτ = 1. 2 Proof of Theorem 19.24: Since E(M ) is a super-martingale on [0, ∞], it suffices to show that, under the stated condition, E E(M )∞ = 1. Using Theorem 19.4 and Proposition 9.9, we may then reduce to the statement of Lemma 19.25. 2 This yields in particular a classical result for Brownian motion: Corollary 19.26 (drift removal, Girsanov) Consider a Brownian motion B and a progressive process V in Rd , such that



1 2

|V |2 · λ



< ∞. ∞ Then (i) Q = E(V · B)∞ · P is a probability measure, ˜ = B − V · λ is a Q-Brownian motion. (ii) B E exp

Proof: Combine Theorems 19.20 and 19.24.

2

438

Foundations of Modern Probability

Exercises 1. Let B be a real Brownian motion and let V ∈ L(B). Show that X = V · B is a time-changed Brownian motion, express the required time change τ in terms of V , and verify that X is τ -continuous. 2. Consider a real Brownian motion B and some progressive processes V t ∈ L(B), t ≥ 0. Give conditions on the V t for the process Xt = (V t · B)∞ to be Gaussian. 3. Let M be a continuous local martingale in Rd , and let V 1 , . . . , V d be predictable ∞ i j processes on S × R+ , such that Σ ij 0 Vs,r Vt,r d[M i , M j ]r = ρs,t is a.s. non-random   i dM i is Gaussian on S with for all s, t ∈ S. Show that the process Xs = i 0∞ Vs,r r covariance function ρ. (Hint: Use Lemma 19.2.) 4. For M in Theorem 19.4, let [M ]∞ = ∞ a.s. Show that M is τ -continuous in the sense of Theorem 18.24, and conclude from Theorem 19.3 that B = M ◦ τ is a Brownian motion. Also show for any V ∈ L(M ) that (V ◦ τ ) · B = (V · M ) ◦ τ a.s. 5. Deduce Theorem 19.3 for d = 1 from Theorem 22.17. (Hint: Proceed as above to construct a discrete-time martingale with jump sizes h. Let h → 0, and use a version of Proposition 18.17.) 6. Let M be a real, continuous local martingale. Show that M converges a.s. on the set {supt Mt < ∞}. (Hint: Use Theorem 19.4.) 7. Let F, G be right-continuous filtrations, where G is a standard extension of F, and let τ be an F-optional time. Show that Fτ ⊂ Gτ ⊥⊥Fτ F. (Hint: Apply optional sampling to the uniformly integrable martingale Mt = E(ξ|Ft ), for any ξ ∈ L1 (F∞ ).) 8. Let F, G be filtrations on a common probability space (Ω, A, P ). Show that G is a standard extension of F iff every F-martingale is also a G-martingale. (Hint: Consider martingales of the form Mt = E(ξ| Ft ), where ξ ∈ L1 (F∞ ). Here Mt is Gt -measurable for all ξ iff Ft ⊂ Gt , and then Mt = E(ξ| Gt ) a.s. for all ξ iff F⊥⊥Ft Gt by Proposition 8.9.) 9. Let M be a non-trivial, isotropic, continuous local martingale in Rd , and fix an affine transformation f on Rd . Show that even f (M ) is isotropic iff f is conformal (i.e., the composition of a rigid motion with a change of scale). 10. Show that for any d ≥ 2, the path of a Brownian motion in Rd has Lebesgue measure 0. (Hint: Use Theorem 19.6 and Fubini’s theorem.) 11. Deduce Theorem 19.6 (ii) from Theorem 12.8. (Hint: Define τ = inf{t; |Bt | = 1}, and iterate the construction to form a random walk in Rd with steps of size 1.) 12. Give an example of a bounded, continuous martingale M ≥ 0 with M0 = 1, such that P {Mt = 0} > 0 for large t > 0. (Hint: We may take M to be a Brownian motion stopped at the levels 0 and 2.) 13. For any Brownian motion B and optional time τ < ∞, show that E exp(Bτ − τ ) ≤ 1, where the inequality may be strict. (Hint: Truncate and use Fatou’s lemma. Note that t − 2Bt → ∞ by the law of large numbers.) 1 2

14. In Theorem 19.23, give necessary and sufficient conditions for Ph ∼ P0 on F∞ , and find the associated density. 15. Show by an example that Lemma 19.25 may fail without a moment condition on τ . (Hint: Choose a τ > 0 with Bτ = 0.)

Chapter 20

Semi-Martingales and Stochastic Integration Predictable covariation, L2 -martingale and semi-martingale integrals, quadratic variation and covariation, substitution rule, Dol´eans exponential, change of measure, BDG inequalities, martingale integral, purely discontinuous semi-martingales, semi-martingale and martingale decompositions, exponential super-martingales and inequalities, quasi-martingales, stochastic integrators

In the last two chapters, we have seen how the Itˆo integral with associated transformation rules leads to a powerful stochastic calculus for continuous semimartingales, defined as sums of continuous local martingales and continuous processes of locally finite variation. This entire theory will now be extended to semi-martingales with possible jump discontinuities. In particular, the semimartingales turn out to be precisely the processes that can serve as stochastic integrators with reasonable adaptedness and continuity properties. The present theory is equally important but much more subtle, as it depends crucially on the Doob–Meyer decomposition from Chapter 10 and the powerful BDG inequalities for general local martingales. Our construction of the general stochastic integral proceeds in three steps. First we imitate the definition of the L2 -integral V · M from Chapter 18, using a predictable version M, N of the covariation process. A suitable truncation then allows us to extend the integral to arbitrary semi-martingales X and bounded, predictable processes V . The ordinary covariation [X, Y ] can now be defined by the integration-by-parts formula, and we may use the general BDG inequalities to extend the martingale integral V · M to more general integrands V . Once the stochastic integral is defined, we may develop a stochastic calculus for general semi-martingales. In particular, we prove an extension of Itˆo’s formula, solve a basic stochastic differential equation, and establish a general Girsanov-type theorem for absolutely continuous changes of the probability measure. The latter material extends the relevant parts of Chapters 18–19. The stochastic integral and covariation process, along with the Doob–Meyer decomposition from Chapter 10, provide the basic tools for a more detailed study of general semi-martingales. In particular, we may now establish two general decompositions, similar to the decompositions of optional times and increasing processes in Chapter 10. We further derive some exponential inequalities for martingales with bounded jumps, characterize local quasi-martingales © Springer Nature Switzerland AG 2021 O. Kallenberg, Foundations of Modern Probability, Probability Theory and Stochastic Modelling 99, https://doi.org/10.1007/978-3-030-61871-1_21

439

440

Foundations of Modern Probability

as special semi-martingales, and show that no continuous extension of the predictable integral exists beyond the context of semi-martingales. Throughout this chapter, let M2 be the class of uniformly square-integrable martingales, and note as in Lemma 18.4 that M2 is a Hilbert space for the 2 1/2 ) . Let M20 be the closed linear subspace of martingales norm M = (EM∞ 2 M ∈ M with M0 = 0. The corresponding classes M2loc and M20,loc are defined as the sets of processes M , such that the stopped versions M τn belong to M2 or M20 , respectively, for suitable optional times τn → ∞. For every M ∈ M2loc , we note that M 2 is a local sub-martingale. The corresponding compensator M is called the predictable quadratic variation of M . More generally, we define1 the predictable covariation M, N of two processes M, N ∈ M2loc as the compensator of M N , also obtainable through the polarization formula 4M, N = M + N − M − N . Note that M, M = M . If M and N are continuous, then clearly M, N = [M, N ] a.s. Here we list some further useful properties. Proposition 20.1 (predictable covariation) For any M, N, M n ∈ M2loc , we have a.s. (i) M, N is symmetric, bilinear with M, N = M − M0 , N , (ii) M = M, M is non-decreasing, (iii) |M, N | ≤ (iv) M, N (v) M

n



τ



|dM, N | ≤ M

= M , N = M , N τ

P

τ

→ 0 ⇒ (M − n

1/2

N

τ

for any optional time τ ,

P M0n )∗ →

1/2

,

0.

Proof, (i)−(iv): Lemma 10.11 shows that M, N is the a.s. unique predictable process of locally integrable variation starting at 0, such that M N − M, N is a local martingale. The symmetry and bilinearity follow immediately, as does last property in (i), since M N0 , M0 N , and M0 N0 are all local martingales. Moreover, we get (iii) as in Proposition 18.9, and (iv) as in Theorem 18.5. P

(v) Here we may take M0n = 0 for all n. Let M n ∞ → 0, fix any ε > 0, and define τn = inf{t; M n t ≥ ε}. Since M n is predictable, even τn is predictable by Theorem 10.14, say with announcing sequence τnk ↑ τn . The latter may be chosen such that M n becomes an L2 -martingale and (M n )2 −M n a uniformly integrable martingale on every interval [0, τnk ]. Then Proposition 9.17 yields E(M n )∗2 E(M n )2τnk τnk <

= EM n τnk ≤ ε, 1 The predictable covariation ·, · must not be confused with the inner product appearing in Chapters 14, 21, and 35.

20. Semi-Martingales and Stochastic Integration

441

ε. Now fix any δ > 0, and write and as k → ∞ we get E(M n )∗2 τn − <



P (M n )∗2 > δ



≤ P {τn < ∞} + δ −1 E(M n )∗2 τn −

< P M n





≥ ε + δ −1 ε.

Here the right-hand side tends to zero as n → ∞ and then ε → 0.

2

The predictable quadratic variation can be used to extend the Itˆo integral from Chapter 18. Then let E be the class of bounded, predictable step processes V with jumps at finitely many fixed times. The corresponding integral V · X is referred to as an elementary predictable integral. Given any M ∈ M2loc , let L2 (M ) be the class of predictable processes V , such that (V 2 · M )t < ∞ a.s. for all t > 0. Assuming M ∈ M2loc and V ∈ L2 (M ), we may form an integral process V · M belonging to M20,loc . In the following statement, we assume that M, N, Mn ∈ M2loc , and let U, V, Vn be predictable processes such that the appropriate integrals exist. Theorem 20.2 (L2 -martingale integral, Courr`ege, Kunita & Watanabe) The elementary predictable integral extends a.s. uniquely to a bilinear map of any M ∈ M2loc and V ∈ L2 (M ) into a process V · M ∈ M20,loc , such that a.s. 

(i) Vn2 · Mn (ii) (iii) (iv) (v)



P

t

→0 ⇒



V n · Mn

∗

t

P

→ 0,

t > 0,

M2loc ,

V · M, N = V · M, N , N ∈ U · (V · M ) = (U V ) · M , Δ(V · M ) = V (ΔM ), (V · M )τ = V · M τ = (V 1[0,τ ] ) · M for any optional time τ .

The integral V · M is a.s. uniquely determined by (ii). The proof depends on an elementary approximation of predictable processes, similar to Lemma 18.23 in the continuous case. Lemma 20.3 (approximation) Let V be a predictable process with |V |p ∈ L(A), where A is increasing and p ≥ 1. Then there exist some V1 , V2 , . . . ∈ E with   |Vn − V |p · A t → 0 a.s., t > 0. P

Proof: It is enough to establish the approximation (|Vn − V |p · A)t → 0. By Minkowski’s inequality we may then approximate in steps, and by dominated convergence we may first take V to be simple. Then each term can be approximated separately, and so we may next choose V = 1B for a predictable set B. Approximating separately on disjoint intervals, we may finally assume that B ⊂ Ω × [0, t] for some t > 0. The desired approximation then follows from Lemma 10.2 by a monotone-class argument. 2

442

Foundations of Modern Probability

Proof of Theorem 20.2: As in Theorem 18.11, we may construct the integral V ·M as the a.s. unique process in M20,loc satisfying (ii). The mapping (V, M ) → V ·M is clearly bilinear and extends the elementary predictable integral, just as in Lemma 18.10. Properties (iii) and (v) may be proved as in Propositions 18.14 and 18.15. The stated continuity follows immediately from (ii) and Proposition 20.1 (v). To prove the asserted uniqueness, it is then enough to apply Lemma 20.3 with A = M and p = 2. To prove (iv), we may apply Lemma 20.3 with At = M

t

+

Σ (ΔMs)2,

t ≥ 0,

s≤t

to ensure the existence of some processes Vn ∈ E, such that a.s. 

Vn (ΔM ) → V (ΔM ), Vn · M − V · M

∗

→ 0.

In particular, Δ(Vn · M ) → Δ(V · M ) a.s., and (iv) follows from the corresponding relation for the elementary integrals Vn · M , once we have shown that a.s. For the latter condition, we may take M ∈ M20 and Σs≤t(ΔMs)2 < ∞ −n define tn,k = kt 2 for k ≤ 2n . Then Fatou’s lemma yields E



(ΔMs )2 ≤ E lim inf Mt Σ n→∞ Σ k s≤t  ≤ lim inf E Σ k Mt n→∞

n,k

n,k

= EMt2 < ∞.

− Mtn,k−1

− Mtn,k−1

2

2

2

A semi-martingale is defined as a right-continuous, adapted process X, admitting a decomposition M + A into a local martingale M and a process A of locally finite variation starting at 0. If the variation of A is even locally ˆ A, ˆ where Aˆ denotes the compensator integrable, we can write X = (M +A−A)+ of A. We may then choose A to be predictable, in which case the decomposition is a.s. unique by Propositions 18.2 and 10.16, and X is called a special semimartingale with canonical decomposition M + A. The L´evy processes constitute basic examples of semi-martingales, and we note that a L´evy process is a special semi-martingale iff its L´evy measure ν satisfies (x2 ∧ |x|) ν(dx) < ∞. From Theorem 10.5 we further see that any local sub-martingale is a special semi-martingale. We may extend the stochastic integration to general semi-martingales. For the moment we consider only locally bounded integrands, which covers most applications of interest. Theorem 20.4 (semi-martingale integral, Dol´eans-Dade & Meyer) The L2 integral of Theorem 20.2 and the ordinary Lebesgue–Stieltjes integral extend a.s. uniquely to a bilinear map of any semi-martingale X and locally bounded, predictable process V into a semi-martingale V · X. The integration map also satisfies

20. Semi-Martingales and Stochastic Integration

443

(i) properties (iii)−(v) of Theorem 20.2, (ii) V ≥ |Vn | → 0 ⇒ (Vn · X)∗t → 0, t > 0, P

(iii) when X is a local martingale, so is V · X. Our proof relies on a basic truncation: Lemma 20.5 (truncation, Dol´eans-Dade, Jacod & M´emin, Yan) Any local martingale M has a decomposition M  + M  into local martingales, such that (i) M  has locally integrable variation, (ii) |ΔM  | ≤ 1 a.s. Proof: Define At =

Σ (ΔMs) 1 s≤t



|ΔMs | >

1 2



,

t ≥ 0.

By optional sampling, A has locally integrable variation. Let Aˆ be the compensator of A, and put M  = A − Aˆ and M  = M − M  . Then M  and M  are again local martingales, and M  has locally integrable variation. Since also 



  ˆ |ΔM  | ≤ ΔM − ΔA + |ΔA| ˆ ≤ 1 + |ΔA|, 2

ˆ ≤ 1 . Since the constructions of A and Aˆ commute it remains to show that |ΔA| 2 with optional stopping, we may take M and M  to be uniformly integrable. ˆ > 1 } are predictable Since Aˆ is predictable, the times τ = n ∧ inf{t; |ΔA| 2 by Theorem 10.14, and it suffices to show that |ΔAˆτ | ≤ 12 a.s. Since clearly E(ΔMτ | Fτ − ) = E(ΔMτ | Fτ − ) = 0 a.s., Lemma 10.3 yields 



  |ΔAˆτ | = E ΔAτ  Fτ −    

= E ΔMτ ; |ΔMτ | >

1 2

= E ΔMτ ; |ΔMτ | ≤

1 2

  

     Fτ −       Fτ −  ≤ 12 .

2

Proof of Theorem 20.4: By Lemma 20.5 we may write X = M + A, where M is a local martingale with bounded jumps, hence a local L2 -martingale, and A has locally finite variation. For any locally bounded, predictable process V , we may then define V · X = V · M + V · A, where the first term is the integral in Theorem 20.2 and the second term is an ordinary Lebesgue–Stieltjes integral. If V ≥ |Vn | → 0, we get by dominated convergence 

Vn2 · M





+ Vn · A t

∗

t

→ 0,

and so by Theorem 20.2 we have (Vn · X)∗t → 0 for all t > 0. For the uniqueness it suffices to prove that, if M = A is a local L2 martingale of locally finite variation, then V · M = V · A a.s. for every locally bounded, predictable process V , where V · M is the integral in Theorem 20.2 P

444

Foundations of Modern Probability

and V · A is an ordinary Stieltjes integral. The two integrals clearly agree when V ∈ E. For general V , we may approximate as in Lemma 20.3 by processes Vn ∈ E, such that a.s.

(Vn − V )2 · M





∗

+ |Vn − V | · A

→ 0.

Then for any t > 0, P

(Vn · M )t → (V · M )t , (Vn · A)t → (V · A)t , and the desired equality follows. (iii) By Lemma 20.5 and a suitable localization, we may assume that V is bounded and X has integrable variation A. By Lemma 20.3 we may next choose some uniformly bounded processes V1 , V2 , . . . ∈ E, such that (|Vn − V | · A)t → 0 a.s. for every t ≥ 0. Then (Vn · X)t → (V · X)t a.s. for all t, which extends to L1 by dominated convergence. Thus, the martingale property of Vn · X carries over to V · X. 2 For any semi-martingales X, Y , the left-continuous versions (X− )t = Xt− and (Y− )t = Yt− are locally bounded and predictable, and hence may serve as integrands in general stochastic integrals. We may then define2 the quadratic variation [X] and covariation [X, Y ] by the integration-by-parts formulas [X] = X 2 − X02 − 2 X− · X, [X, Y ] = X Y − X0 Y0 − X− · Y − Y− · X 



= 1 [X + Y ] − [X − Y ] , 4

(1)

so that in particular [X] = [X, X]. We list some further properties of the covariation. Theorem 20.6 (covariation) For any semi-martingales X, Y , we have a.s. (i) [X, Y ] is symmetric, bilinear with [X, Y ] = [X − X0 , Y ], (ii) [X] = [X, X] is non-decreasing, (iii) | [X, Y ] | ≤



|d [X, Y ]| ≤ [X]1/2 [Y ]1/2 ,

(iv) Δ[X] = (ΔX)2 , Δ[X, Y ] = (ΔX)(ΔY ), (v) [V · X, Y ] = V · [X, Y ] for locally bounded, predictable V , (vi) [X τ , Y ] = [X τ , Y τ ] = [X, Y ]τ for any optional time τ , (vii) M, N ∈ M2loc ⇒ [M, N ] has compensator M, N , (viii) [X, A]t =

Σ (ΔXs)(ΔAs) when A has locally finite variation.

s≤t

2 Both covariation processes [X, Y ] and X, Y remain important, as is evident from, e.g., Theorems 20.9, 20.16, and 20.17.

20. Semi-Martingales and Stochastic Integration

445

Proof: (i) The symmetry and bilinearity of [X, Y ] are obvious from (1), and the last relation holds since [X, Y0 ] = 0. (ii) Proposition 18.17 extends with the same proof to general semi-martingales. In particular, [X]s ≤ [X]t a.s. for all s ≤ t. By right-continuity, we can choose the exceptional null set to be independent of s and t, which means that [X] is a.s. non-decreasing. (iii) Proceed as in case of Proposition 18.9. (iv) By (1) and Theorem 20.2 (iv), Δ[X, Y ]t = Δ(X Y )t − Δ(X− · Y )t − Δ(Y− · X)t = Xt Yt − Xt− Yt− − Xt− (ΔYt ) − Yt− (ΔXt ) = (ΔXt )(ΔYt ). (v) For V ∈ E the relation follows most easily from the extended version in Proposition 18.17. Also note that both sides are a.s. linear in V . Now let V, V1 , V2 , . . . be locally bounded and predictable with V ≥ |Vn | → 0. Then Vn · [X, Y ] → 0 by dominated convergence, and Theorem 20.4 yields P

[Vn · X, Y ] = (Vn · X) Y − (Vn · X)− · Y − (Vn Y− ) · X → 0. A monotone-class argument yields an extension to arbitrary V . (vi) Use (v) with V = 1[0,τ ] . (vii) Since M− · N and N− · M are local martingales, the assertion follows from (1) and the definition of M, N . (viii) For step processes A, the stated relation follows from the extended version in Proposition 18.17. If instead ΔA ≤ ε, we conclude from the same result and property (iii), along with the ordinary Cauchy inequality, that  

 2

[X, A]2t ∨  Σ (ΔXs )(ΔAs )  ≤ [X]t [A]t s≤t

≤ ε[X]t



The assertion now follows by a simple approximation.

0

t

|dAs |. 2

We may now extend Itˆo’s formula in Theorem 18.18 to a substitution rule for general semi-martingales. By a semi-martingale in Rd we mean a process X = (X 1 , . . . , X d ) composed by one-dimensional semi-martingales X i . Let [X i , X j ]c denote the continuous components of the finite-variation processes [X i , X j ], and write ∂ i f and ∂ ij2 f for the first and second order partial derivatives of f , respectively.

446

Foundations of Modern Probability

Theorem 20.7 (substitution rule, Kunita & Watanabe) For any semi-martingale X = (X 1 , . . . , X d ) in Rd and function f ∈ C 2 (Rd ), we have 3

f (Xt ) = f (X0 ) + +

t

0

Σ s≤t



∂ i f (Xs− ) dXsi +



1 2

t

0

∂ ij2 f (Xs− ) d [X i , X j ]cs

Δf (Xs ) − ∂ i f (Xs− )(ΔXsi ) .

(2)

Proof: Let f ∈ C 2 (Rd ) satisfy (2), where the last term has locally finite variation. We show that (2) remains true for the functions gk (x) = xk f (x), k = 1, . . . , n. Then note that by (1), gk (X) = gk (X0 ) + X−k · f (X) + f (X− ) · X k + [X k , f (X)].

(3)

Writing fˆ(x, y) = f (x) − f (y) − ∂i f (y)(xi − yi ), we get by (2) and Theorem 20.2 (iii) X−k · f (X) = X−k ∂i f (X− ) · X i + 12 X−k ∂ij2 f (X− ) · [X i , X j ]c k ˆ + Σs Xs− (4) f (Xs , Xs− ). Next we see from (ii), (iv), (v), and (viii) of Theorem 20.6 that 

X k , f (X) = ∂i f (X− ) · [X k , X i ] + Σs (ΔXsk ) fˆ(Xs , Xs− )

= ∂i f (X− ) · [X k , X i ]c + Σs (ΔXsk ){Δf (Xs )}.

(5)

Inserting (4)−(5) into (3) and using the elementary formulas ∂i gk (x) = δik f (x) + xk ∂i f (x), ∂ij2 gk (x) = δik ∂j f (x) + δjk ∂i f (x) + xk ∂ij2 f (x),

gˆk (x, y) = (xk − yk ) f (x) − f (y) + yk fˆ(x, y), followed by some simplification, we obtain the desired expression for gk (X). Formula (2) is trivially true for constant functions, and it extends by induction and linearity to arbitrary polynomials. By Weierstrass’ theorem, a general function f ∈ C 2 (Rd ) can be approximated by polynomials, such that all derivatives up to second order tend to those of f , uniformly on every compact set. To prove (2) for f , it is then enough to show that the right-hand side tends to zero in probability, as f and its first and second order derivatives tend to zero, uniformly on compact sets. For the integrals in (2), this holds by the dominated convergence property in Theorem 20.4, and it remains to consider the last term. Writing Bt = {x ∈ Rd ; |x| ≤ Xt∗ } and g B = supB |g|, we get by Taylor’s formula in Rd 



Σ fˆ(Xs, Xs−) s≤t

<



3





 2  ∂ ij f B Σ |ΔXs |2 Σ i,j s≤t t

Σ

   2  i,j ∂ ij f B

t

Σ i [X i]t → 0.

As always, summation over repeated indices is understood.

20. Semi-Martingales and Stochastic Integration

447

The same estimate shows that the last term has locally finite variation.

2

To illustrate the use of the general substitution rule, we consider a partial extension of Lemma 19.22 and Proposition 32.2 to general semi-martingales. Theorem 20.8 (Dol´eans exponential) For a semi-martingale X with X0 = 0, the equation Z = 1 + Z− · X has the a.s. unique solution 

Zt = E(X) ≡ exp Xt − 12 [X]ct



Π (1 + ΔXs) e−ΔX ,

t ≥ 0.

s

s≤t

(6)

The infinite product in (6) is a.s. absolutely convergent, since Σs≤t (ΔXs )2 ≤ [X]t < ∞. However, we may have ΔXs = −1 for some s > 0, in which case Z = 0 for t ≥ s. The process E(X) in (6) is called the Dol´eans exponential of X. For continuous X we get E(X) = exp(X − 12 [X]), conforming with the notation of Lemma 19.22. For processes A of locally finite variation, formula (6) simplifies to E(A) = exp(Act ) Π (1 + ΔAs ), t ≥ 0. s≤t

Proof of Theorem 20.8: To check that (6) is a solution, we may write Z = eY V , where Yt = Xt − 12 [X]ct , Vt =

Π (1 + ΔXs) e−ΔX . s

s≤t

Then Theorem 20.7 yields Zt − 1 = (Z− · Y )t + (eY− · V )t + 12 (Z− · [X]c )t +

Σ s≤t





ΔZs − Zs− ΔXs − eYs− ΔVs .

(7)

Now eY− · V = Σ eY− ΔV since V is of pure jump type, and furthermore ΔZ = Z− ΔX. Hence, the right-hand side of (7) simplifies to Z− · X, as desired. To prove the uniqueness, let Z be an arbitrary solution, and put V = Ze−Y , where Y = X − 12 [X]c as before. Then Theorem 20.7 gives Vt − 1 = (e−Y− · Z)t − (V− · Y )t + 12 (V− · [X]c )t − (e−Y− · [X, Z]c )t +

Σ s≤t



ΔVs + Vs− ΔYs − e−Ys− ΔZs



= (V− · X)t − (V− · X)t + 12 (V− · [X]c )t + 12 (V− · [X]c )t − (V− · [X]c )t + =

Σ s≤t



ΔVs + Vs− ΔXs − Vs− ΔXs



Σ ΔVs.

s≤t

Thus, V is purely discontinuous of locally finite variation. We may further compute ΔVt = Zt e−Yt − Zt− e−Yt− = (Zt− + ΔZt ) e−Yt− −ΔYt − Z t− e−Yt− = Vt− (1 + ΔXt ) e−ΔXt − 1 ,

448

Foundations of Modern Probability

which shows that Vt = 1 + (V− · A)t , At =

Σ s≤t





(1 + ΔXt ) e−ΔXt − 1 .

It remains to show that the homogeneous equation V = V− · A has the  unique solution V = 0. Then define R t = (0,t] |dA|, and conclude from Theorem 20.7 and the convexity of the function x → xn that R tn = n (R−n−1 · R)t + ≥ n (R−n−1 · R)t .

Σ s≤t



n−1 ΔRsn − n Rs− ΔRs



(8)

We prove by induction that Vt∗ ≤

Vt∗ R tn , n!

t ≥ 0, n ∈ Z+ .

(9)

This is obvious for n = 0, and if it holds for n − 1, then by (8) Vt∗ = (V− · A)∗t V ∗ (R n−1 · R)t V ∗R n ≤ t t , ≤ t − (n − 1)! n! as required. Since R tn /n! → 0 as n → ∞, relation (9) yields Vt∗ = 0 for all t > 0. 2 The equation Z = 1 + Z− · X arises naturally in connection with changes of probability measure. Here we extend Theorem 19.20 to general local martingales. Theorem 20.9 (change of measure, van Schuppen & Wong) Consider some probability measures P , Q and a local martingale Z ≥ 0, such that Q = Zt · P on Ft for all t ≥ 0, and let M be a local P -martingale such that [M, Z] has locally integrable variation and P -compensator M, Z . Then we have the local Q-martingale ˜ = M − Z −1 · M, Z . M − A lemma is needed for the proof. Lemma 20.10 (integration by parts) For any semi-martingale X with X0 = 0 and predictable process A of locally finite variation, we have A X = A · X + X− · A a.s. Proof: We need to show that ΔA · X = [A, X] a.s., which is equivalent to

(0,t]

(ΔAs ) dXs =

Σ (ΔAs)(ΔXs),

s≤t

t ≥ 0,

by Theorem 20.6 (viii). Since the series on the right is absolutely convergent by Cauchy’s inequality, we may reduce, by dominated convergence on each side,

20. Semi-Martingales and Stochastic Integration

449

to the case where A is constant apart from finitely many jumps. By Lemma 10.3 and Theorem 10.14, we may proceed to the case where A has at most one jump occurring at a predictable time τ . Introducing an announcing sequence (τn ) and writing Y = ΔA · X, we get by Theorem 20.2 (v) Yτn ∧t = 0 = Yt − Yt∧τ a.s.,

t ≥ 0, n ∈ N.

Thus, even Y is constant, apart from a possible jump at τ . Finally, Theorem 2 20.2 (iv) yields ΔYτ = (ΔAτ )(ΔXτ ) a.s. on {τ < ∞}. Proof of Theorem 20.9: Let τn = inf{t; Zt < 1/n} for n ∈ N, and note ˜ is well defined under Q, and that τn → ∞ a.s. Q by Lemma 19.18. Hence, M ˜ it suffices as in Lemma 19.18 to show that (M Z)τn is a local P -martingale m for every n. Writing = for equality up to a local P -martingale, we see from Lemma 20.10 with X = Z and A = Z−−1 · M, Z that, on every interval [0, τn ], m

m

M Z = [M, Z] = M, Z m = Z− · A = AZ. m ˜ Z = (M − A) Z = Thus, M 0, as required.

2

Using the last theorem, we may show that the class of semi-martingales is closed under absolutely continuous changes of probability measure. A special case of this result was previously obtained as part of Proposition 19.21. Corollary 20.11 (preservation law, Jacod) If Q  P on Ft for all t > 0, then every P -semi-martingale is also a Q-semi-martingale. Proof: Let Q = Zt · P on Ft for all t ≥ 0. We need to show that every local P -martingale M is a semi-martingale under Q. By Lemma 20.5, we may then take ΔM to be bounded, so that [M ] is locally bounded. By Theorem 20.9 it suffices to show that [M, Z] has locally integrable variation, and by Theorem 20.6 (iii) it is then enough to prove that [Z]1/2 is locally integrable. By Theorem 20.6 (iv), 1/2

[Z]t

1/2

≤ [Z]t− + |ΔZt | ∗ + |Zt |, ≤ [Z]t− + Zt− 1/2

t ≥ 0,

and so the desired integrability follows by optional sampling.

2

Next we extend the BDG inequalities of Theorem 18.7 to general local martingales. Such extensions are possible only for exponents p ≥ 1. Theorem 20.12 (norm comparison, Burkholder, Davis, Gundy) For any local martingale M with M0 = 0, we have 4 E M ∗p ) E [M ]p/2 ∞ ,

p ≥ 1.

(10)

4 Here M ∗ = supt |Mt |, and f ) g means that f ≤ cg and g ≤ cf for some constant c > 0. The domination constants are understood to depend only on p.

450

Foundations of Modern Probability

In particular, we conclude as in Corollary 18.8 that M is a uniformly integrable martingale whenever E[M ]1/2 ∞ < ∞. Proof for p = 1 (Davis): Exploiting the symmetry of the argument, we write M  and M  for the processes M ∗ and [M ]1/2 , taken in either order. Putting J = ΔM , we define At =

Σ Js 1 s≤t





∗ |Js | > 2Js− ,

t ≥ 0.

Since |ΔA| ≤ 2ΔJ ∗ , we have

∞ 0

Σs|ΔAs|

|dAs | =

 . ≤ 2J ∗ ≤ 4M∞

ˆ we get Writing Aˆ for the compensator of A and putting D = A − A,   ∨ ED∞ ≤ E ED∞

< E





0∞ 0

|dDs |  |dAs | < EM∞ .

(11)

To get a similar estimate for N = M − D, we introduce the optional times



τr = inf t; Nt ∨ Jt∗ > r ,

r > 0,

and note that

  P {N∞ > r} ≤ P {τr < ∞} + P τr = ∞, N∞ >r



 ≤ P {N∞ > r} + P {J ∗ > r} + P {Nτr > r}.

(12)

Arguing as in the proof of Lemma 20.5, we get |ΔN | ≤ 4J−∗ , and so 

 ∧ Nτr − + 4Jτ∗r − Nτr ≤ N∞  ≤ N∞ ∧ 5r.



Since N 2 − [N ] is a local martingale, we get by Chebyshev’s inequality or Proposition 9.16, respectively, r2 P {Nτr > r} < ENτ2r

 < E(N∞ ∧ r)2 .

Hence, by Fubini’s theorem and calculus, 0



P {Nτr > r}dr <





0

  E(N∞ ∧ r)2 r−2 dr < EN∞ .

Combining this with (11)−(12) and using Lemma 4.4, we get  EN∞ =



0



0

∞ ∞

 P {N∞ > r}dr





 P {N∞ > r} + P {J ∗ > r} + P {Nτr > r} dr

  < EN∞ + EJ ∗ < EM∞ .



20. Semi-Martingales and Stochastic Integration

451

   It remains to note that EM∞ ≤ ED∞ + EN∞ .

2

Extension to p > 1 (Garsia): For any t ≥ 0 and B ∈ Ft , we may apply (10) with p = 1 to the local martingale 1B (M − M t ) to get a.s.  





 



t 1/2 t ∗ c−1 1 E [M − M ]∞  Ft ≤ E (M − M )∞  Ft



 





≤ c1 E [M − M t ]1/2 ∞  Ft .

Since

1/2

[M ]1/2 ∞ − [M ]t

≤ [M − M t ]1/2 ∞ ≤ [M ]1/2 , ∞

∗ M∞ − Mt∗ ≤ (M − M t )∗∞ ∗ ≤ 2M∞ , 1/2 E(ζ | Ft ) with At = [M ]t and Proposition 10.20 yields E(A∞ − At | Ft ) <

ζ = M ∗ , and also with At = Mt∗ and ζ = [M ]1/2 ∞ . Since

ΔMt∗ ≤ Δ[M ]t = |ΔMt |

1/2

≤ [M ]t ∧ 2Mt∗ , 1/2

E(ζ | Fτ ) a.s. for any optional time τ , and so we have in both cases ΔAτ <

the cited condition remains fulfilled for the left-continuous version A− . Hence, ζ p for every p ≥ 1, and (10) follows. 2 Proposition 10.20 yields A∞ p <

The last theorem allows us to extend the stochastic integral to a larger class of integrands. Then write M for the space of local martingales, and let M0 be the subclass of processes M with M0 = 0. For any M ∈ M, let L(M ) be the class of predictable processes V such that (V 2 · [M ])1/2 is locally integrable. Theorem 20.13 (martingale integral, Meyer) The elementary predictable integral extends a.s. uniquely to a bilinear map of any M ∈ M and V ∈ L(M ) into a process V · M ∈ M0 , satisfying P P (i) |Vn | ≤ V in L(M ) with (Vn2 · [M ])t → 0 ⇒ (Vn · M )∗t → 0, (ii) properties (iii)−(v) of Theorem 20.2, (iii) [V · M, N ] = V · [M, N ] a.s., N ∈ M. The integral V · M is a.s. uniquely determined by (iii). Proof: To construct the integral, we may reduce by localization to the case where E(M − M0 )∗ < ∞ and E(V 2 · [M ])1/2 ∞ < ∞. For each n ∈ N, define Vn = V 1{|V | ≤ n}. Then Vn · M ∈ M0 by Theorem 20.4, and by Theorem 20.12 we have E(Vn · M )∗ < ∞. Using Theorems 20.6 (v) and 20.12, Minkowski’s inequality, and dominated convergence, we obtain 

E Vm · M − V n · M

∗

< E (Vm − Vn ) · M



1/2 ∞

= E (Vm − Vn )2 · [M ]

1/2 ∞

→ 0.

452

Foundations of Modern Probability

Hence, there exists a process V · M with E(Vn · M − V · M )∗ → 0, and we note that V · M ∈ M0 and E(V · M )∗ < ∞. To prove (iii), we note that the relation holds for each Vn by Theorem 20.6 (v). Since E[Vn · M − V · M ]1/2 ∞ → 0 by Theorem 20.12, we get by Theorem 20.6 (iii) for any N ∈ M and t ≥ 0    P  2  [Vn · M, N ]t − [V · M, N ]t  ≤ Vn · M − V · M t [N ]t → 0.

(13)

Next we see from Theorem 20.6 (iii) and (v) that

t t        Vn d[M, N ] = d[Vn · M, N ] 0

0

1/2

≤ [Vn · M ]t

1/2

[N ]t .

As n → ∞, we get by monotone convergence on the left and Minkowski’s inequality on the right t  1/2 1/2   V d[M, N ] ≤ [V · M ]t [N ]t < ∞. 0

Hence, Vn · [M, N ] → V · [M, N ] by dominated convergence, and (iii) follows by combination with (13). To see that V · M is determined by (iii), it remains to note that, if [M ] = 0 a.s. for some M ∈ M0 , then M ∗ = 0 a.s. by Theorem 20.12. To prove the stated continuity property, we may reduce by localization to the case where E(V 2 · 2 1/2 [M ])1/2 ∞ < ∞. But then E(Vn · [M ])∞ → 0 by dominated convergence, and ∗ Theorem 20.12 yields E(Vn · M ) → 0. To prove the uniqueness of the integral, it is enough to consider bounded integrands V . We may then approximate as in P Lemma 20.3 by uniformly bounded processes Vn ∈ E with {(Vn −V )2 ·[M ]}∞ → P 0, and conclude that (Vn · M − V · M )∗ → 0. Of the remaining properties in Theorem 20.2, assertion (iii) may be proved as before by means of (iii) above, whereas properties (iv)−(v) follow easily by truncation from the corresponding statements in Theorem 20.4. 2 A semi-martingale X = M + A is said to be purely discontinuous, if there exist some local martingales M 1 , M 2 , . . . of locally finite variation, such that E(M − M n )∗2 t → 0 for every t > 0. The property is clearly independent of the choice of decomposition X = M + A. To motivate the terminology, note that ˆ any martingale M of locally finite variation can be written as M = M0 +A− A, ˆ where At = Σs≤t ΔMs and A is the compensator of A. Thus, M − M0 is then a compensated sum of jumps. We proceed to establish a basic decomposition5 of a general semi-martingale X into continuous and purely discontinuous components, corresponding to the elementary decomposition of the quadratic variation [X] into a continuous part and a jump part. 5 Not to be confused with the decomposition of functions of bounded variation in Theorem 2.23 (iii). The distinction is crucial for part (ii) below, which involves both notions.

20. Semi-Martingales and Stochastic Integration

453

Theorem 20.14 (semi-martingale decomposition, Yoeurp, Meyer) For any real semi-martingale X, we have (i) X = X0 + X c + X d a.s. uniquely, for a continuous local martingale X c with X0c = 0 and a purely discontinuous semi-martingale X d , (ii) [X c ] = [X]c and [X d ] = [X]d a.s. Proof: (i) It is enough to consider the martingale component in an arbitrary decomposition X = X0 +M +A, and by Lemma 20.5 we may take M ∈ M20,loc . Then choose some optional times τn ↑ ∞ with τ0 = 0, such that M τn ∈ M20 for all n. It is enough to construct the desired decomposition for each process M τn − M τn−1 , which reduces the discussion to the case of M ∈ M20 . Now let C and D be the classes of continuous and purely discontinuous processes in M20 , and note that both are closed linear subspaces of the Hilbert space M20 . The desired decomposition will follow from Theorem 1.35, if we can show that D⊥ ⊂ C. First let M ∈ D⊥ . To see that M is continuous, fix any ε > 0, and put τ = inf{t; ΔMt > ε}. Define At = 1{τ ≤ t}, let Aˆ be the compensator of A, ˆ Integrating by parts and using Lemma 10.13 gives and put N = A − A. 1 2

E Aˆ2τ ≤ E



=E

Aˆ dAˆ Aˆ dA = E Aˆτ

= EAτ ≤ 1. Thus, N is L2 -bounded and hence lies in D. For any bounded martingale M  , we get  N∞ ) = E M  dN E(M∞

=E =E



(ΔM  ) dN (ΔM  ) dA



= E ΔMτ ; τ < ∞ , where the first equality is obtained as in the proof of Lemma 10.7, the second holds by the predictability of M− , and the third holds since Aˆ is predictable and hence natural. Letting M  → M in M2 , we obtain 0 = E(M N )  ∞ ∞  = E ΔMτ ; τ < ∞ ≥ ε P {τ < ∞}. Thus, ΔM ≤ ε a.s., and so ΔM ≤ 0 a.s. since ε is arbitrary. Similarly, ΔM ≥ 0 a.s., and the desired continuity follows. Next let M ∈ D and N ∈ C, and choose martingales M n → M of locally finite variation. By Theorem 20.6 (vi) and (vii) and optional sampling, we get for any optional time τ

454

Foundations of Modern Probability 0 = = → =

E[M n , N ]τ E(Mτn Nτ ) E(Mτ Nτ ) E[M, N ]τ ,

and so [M, N ] is a martingale by Lemma 9.14. Since it is also continuous by (14), Proposition 18.2 yields [M, N ] = 0 a.s. In particular, E(M∞ N∞ ) = 0, which shows that C ⊥ D. The asserted uniqueness now follows easily. (ii) By Theorem 20.6 (iv), we have for any M ∈ M2 [M ]t = [M ]ct +

Σ (ΔMs)2,

s≤t

t ≥ 0.

(14)

If M ∈ D, we may choose some martingales of locally finite variation M n → M . By Theorem 20.6 (vii) and (viii), we get [M n ]c = 0 and E[M n − M ]∞ → 0. For any t ≥ 0, we get by Minkowski’s inequality and (14)    

Σ (ΔMsn)2

s≤t

1/2





Σ (ΔMs)2

s≤t

1/2   ≤ 

Σ s≤t



ΔMsn − ΔMs

2 1/2

1/2 P

≤ [M n − M ]t

→ 0,

  1/2 1/2  1/2 P   [M n ]t − [M ]t  ≤ [M n − M ]t → 0.

Taking limits in (14) for the martingales M n , we get the same formula for M without the term [M ]ct , which shows that [M ] = [M ]d . Now consider any M ∈ M2 . Using the strong orthogonality [M c , M d ] = 0, we get a.s. [M ]c + [M ]d = [M ] = [M c + M d ] = [M c ] + [M d ], which shows that even [M c ] = [M ]c a.s. By the same argument combined with Theorem 20.6 (viii), we obtain [X d ] = [X]d a.s. for any semi-martingale X. 2 The last result yields an explicit formula for the covariation of two semimartingales. Corollary 20.15 (decomposition of covariation) For any semi-martingales X, Y , we have (i) X c is the a.s. unique continuous local martingale M with M0 = 0, such that [X − M ] is a.s. purely discontinuous, (ii) [X, Y ]t = [X c , Y c ]t +

Σ (ΔXs)(ΔYs)

s≤t

a.s.,

t ≥ 0.

In particular, (V · X)c = V · X c a.s. for any semi-martingale X and locally bounded, predictable process V , and [X c , Y c ] = [X, Y ]c a.s. Proof: If M has the stated properties, then [(X − M )c ] = [X − M ]c = 0 a.s., and so (X − M )c = 0 a.s. Thus, X − M is purely discontinuous. Formula (ii) holds by Theorems 20.6 (iv) and 20.14 when X = Y , and the general result

20. Semi-Martingales and Stochastic Integration

455 2

follows by polarization.

The purely discontinuous component of a local martingale has a further decomposition, similar to the decompositions of optional times and increasing processes in Propositions 10.4 and 10.17. Corollary 20.16 (martingale decomposition, Yoeurp) For any purely discontinuous, local martingale M , we have (i) M = M0 + M q + M a a.s. uniquely, where M q , M a ∈ M0 are purely discontinuous, M q is ql-continuous, and M a has accessible jumps, (ii)





t; ΔMta = 0 ⊂ disjoint graphs,



n [τn ]

a.s. for some predictable times τ1 , τ2 , . . . with

(iii) [M q ] = [M ]q and [M a ] = [M ]a a.s., (iv) M q = M

c

and M a = M

d

a.s. when M ∈ M2loc .

Proof: Form a locally integrable process At =

Σ s≤t





(ΔMs )2 ∧ 1 ,

t ≥ 0,

ˆ and define with compensator A, M q = M − M0 − M a = 1 ΔAˆt = 0 · M. Then Theorem 20.4 yields M q , M a ∈ M0 and ΔM q = 1{ΔAˆ = 0}ΔM a.s., and M q and M a are purely discontinuous by Corollary 20.15. The proof may now be completed as in case of Proposition 10.17. 2 We illustrate the use of the previous decompositions by proving two exponential inequalities for martingales with bounded jumps. Theorem 20.17 (exponential inequalities, OK & Sztencel) Let M be a local martingale with M0 = 0, such that |ΔM | ≤ c for a constant c ≤ 1. Then (i) if [M ]∞ ≤ 1 a.s., we have  r2 exp − , r ≥ 0, P {M ∗ ≥ r} <

2(1 + rc) (ii) if M ∞ ≤ 1 a.s., we have  log (1 + rc) P {M ∗ ≥ r} < exp − r , r ≥ 0.

2c 2

For continuous martingales M , both bounds reduce to e−r /2 , which can also be seen directly by more elementary methods. For the proof of Theorem 20.17 we need two lemmas, beginning with a characterization of certain pure jump-type martingales.

456

Foundations of Modern Probability

Lemma 20.18 (accessible jump-type martingales) Let N be a pure jump-type process with integrable variation and accessible jumps. Then N is a martingale    iff  E ΔNτ  Fτ − = 0 a.s., τ < ∞ predictable. Proof: Proposition 10.17 yields some predictable times τ1 , τ2 , . . . with dis joint graphs such that {t > 0; ΔNt = 0} ⊂ n [τn ]. Under the stated condition, we get by Fubini’s theorem and Lemma 10.1, for any bounded optional time τ , 



τ ; τn ≤ τ Σ n E ΔN   

 = Σ n E E ΔNτ  Fτ − ; τn ≤ τ = 0,

ENτ =

n

n

n

and so N is a martingale by Lemma 9.14. Conversely, given any uniformly integrable martingale N and finite predictable time τ , we have a.s. E(Nτ | Fτ − ) = 2 Nτ − and hence E(ΔNτ | Fτ − ) = 0. Though for general martingales M the process Z = eM −[M ]/2 of Lemma 19.22 may fail to be a martingale, it can often be replaced by a similar supermartingale. Lemma 20.19 (exponential super-martingales) Let M be a local martingale with M0 = 0 and |ΔM | ≤ c < ∞ a.s., and define a = f (c) and b = g(c), where x + log (1 − x)+ , x2 Then we have the super-martingales f (x) = −

g(x) =



ex − 1 − x . x2



Xt = exp Mt − a [M ]t , 

Yt = exp Mt − b M



t

,

t ≥ 0.

Proof: In case of X, we may clearly assume that c < 1. By Theorem 20.7 we get, in an obvious shorthand notation, (X−−1 · X)t = Mt − (a − 12 ) [M ]ct +

Σ s≤t



2



eΔMs −a(ΔMs ) − 1 − ΔMs .

Here the first term on the right is a local martingale, and the second term is non-increasing since a ≥ 12 . To see that even the sum is non-increasing, we need to show that exp(x − ax2 ) ≤ 1 + x or f (−x) ≤ f (c) whenever |x| ≤ c, which is clear by a Taylor expansion of each side. Thus, (X−−1 ) · X is a local super-martingale, which remains true for X− · (X−−1 · X) = X since X > 0. By Fatou’s lemma, X is then a true super-martingale. In case of Y , Theorem 20.14 and Proposition 20.16 yield a decomposition M = M c + M q + M a , and Theorem 20.7 gives (Y−−1 · Y )t = Mt − b M

c t

+ 12 [M ]ct +



= Mt + b [M q ]t − M q

 t

Σ s≤t



eΔMs −b ΔM s − 1 − ΔMs

− (b − 12 ) [M ]ct



20. Semi-Martingales and Stochastic Integration 

457

1 + ΔMs + b (ΔMs )2 Σ 1 + b ΔM s s≤t  1 + ΔMsa + b (ΔMsa )2 a +Σ − 1 − ΔM s . 1 + b ΔM a s s≤t

+

eΔMs −b ΔM s −



Here the first two terms on the right are martingales, and the third term is non-increasing since b ≥ 12 . Even the first sum of jumps is non-increasing, since ex − 1 − x ≤ b x2 for |x| ≤ c and ey ≤ 1 + y for y ≥ 0. The last sum clearly defines a purely discontinuous process N with locally finite variation and accessible jumps. Fixing any finite predictable time τ , and writing ξ = ΔMτ and η = ΔM τ , we note that   1 + ξ + b ξ2

E 

 

 

 

− 1 − ξ  ≤ E  1 + ξ + b ξ 2 − (1 + ξ)(1 + b η)    = b E  ξ 2 − (1 + ξ) η 

1 + bη

≤ b (2 + c) Eξ 2 .

Since E

Σ t (ΔMt)2 ≤ E[M ]∞ = EM



≤ 1,

we conclude that N has integrable total variation. Furthermore, Lemmas 10.3 and 20.18 show that a.s. E(ξ |Fτ − ) = 0 and  



E( ξ 2 | Fτ − ) = E Δ[M ]τ  Fτ −

= E( η |Fτ − ) = η. 



Thus,





 1 + ξ + b ξ2 − 1 − ξ  Fτ − = 0, 1 + bη and Lemma 20.18 shows that N is a martingale. The proof may now be completed as before. 2

E

Proof of Theorem 20.17: (i) Fix any u > 0, and conclude from Lemma 20.19 that the process



Xtu = exp uMt − u2 f (uc) [M ]t ,

t ≥ 0,

is a positive super-martingale. Since [M ] ≤ 1 and X0u = 1, we get for any r > 0

P sup t Mt > r









≤ P sup t Xtu > exp ur − u2 f (uc)



≤ exp −ur + u2 f (uc) .

(15)

The function F (x) = 2xf (x) is clearly continuous and strictly increasing from [0, 1) onto R+ . Since also F (x) ≤ x/(1 − x), we have F −1 (y) ≥ y/(1 + y). Choosing u = F −1 (rc)/c in (15) gives 

rF −1 (rc) P sup t Mt > r ≤ exp − 2c  r2 ≤ exp − . 2 (1 + rc)

458

Foundations of Modern Probability

It remains to combine with the same inequality for −M . (ii) The function G(x) = 2x g(x) is clearly continuous and strictly increasing onto R+ . Since also G(x) ≤ ex − 1, we have G−1 (y) ≥ log(1 + y). Then as before, 

r G−1 (rc) P sup t Mt > r ≤ exp − 2c  r log (1 + rc) ≤ exp − , 2c and the result follows. 2 By a quasi-martingale we mean an integrable, adapted, and right-continuous process X, such that 

sup π

 E  Xt Σ k≤n

 

k

− E(Xtk+1 |Ftk ) < ∞,

(16)

where π is the set of finite partitions of R+ of the form 0 = t0 < t1 < · · · < tn < ∞, and we take tn+1 = ∞ and X∞ = 0 in the last term. In particular, (16) holds when X is the sum of an L1 -bounded martingale and a process of integrable variation starting at 0. We show that this is close to the general situation. Here localization is defined in the usual way in terms of some optional times τn ↑ ∞. Theorem 20.20 (quasi-martingales, Rao) A quasi-martingale is the difference of two non-negative super-martingales. Thus, for a process X with X0 = 0, these conditions are equivalent: (i) X is a local quasi-martingale, (ii) X is a special semi-martingale. Proof: For any t ≥ 0, let Pt be the class of partitions π of the interval [t, ∞) of the form t = t0 < t1 < · · · < tn , and define ηπ± =

ΣE k≤n



Xtk − E(Xtk+1 |Ftk )

    Ft , ±

π ∈ Pt ,

where tn+1 = ∞ and X∞ = 0, as before. We show that ηπ+ and ηπ− are a.s. non-decreasing under refinements of π ∈ Pt . It is then enough to add one more division point u to π, say in the interval (tk , tk+1 ). Put α = Xtk − Xu and β = Xu − Xtk+1 . By sub-additivity and Jensen’s inequality, we get the desired relation 

 

E E α + β  Ftk



 

  ±  Ft ≤ E E(α | Ftk )± + E(β | Ftk )±  Ft 

≤ E E(α | Ftk )± + E(β | Fu )±  Ft .

± Now fix any t ≥ 0, and conclude from (16) that m± t ≡ supπ∈Pt Eηπ < ∞. ± ± −1 For every n ∈ N, we may then choose πn ∈ Pt with Eηπn > mt − n . The sequences (ηπ±n ) being Cauchy in L1 , they converge in L1 toward some limits

20. Semi-Martingales and Stochastic Integration

459

Yt± . Further note that E|ηπ± − Yt± | < n−1 whenever π is a refinement of πn . Thus, ηπ± → Yt± in L1 along the directed set Pt . Next fix any s < t, let π ∈ Pt be arbitrary, and define π  ∈ Ps by adding the point s to π. Then

Ys± ≥ ηπ± = Xs − E(Xt |Fs ) ≥ E(ηπ± | Fs ).



±

+ E(ηπ± | Fs )

Taking limits along Pt on the right, we get Ys± ≥ E(Yt± |Fs ) a.s., which means that the processes Y ± are super-martingales. By Theorem 9.28, the right-hand ± then exist outside a fixed null set, and the limits along the rationals Zt± = Yt+ ± processes Z are right-continuous super-martingales. For π ∈ Pt , we have 2 Xt = ηπ+ − ηπ− → Yt+ − Yt− , and so Zt+ − Zt− = Xt+ = Xt a.s. The following consequence is easy to believe but surprisingly hard to prove. Corollary 20.21 (non-random semi-martingales) Let X be a non-random process in Rd . Then these conditions are equivalent: (i) X is a semi-martingale, (ii) X is rcll of locally finite variation. Proof (OK): Clearly (ii) ⇒ (i). Now assume (i) with d = 1. Then X is trivially a quasi-martingale, and so by Theorem 20.20 it is a special semimartingale, hence X = M + A for a local martingale M and a predictable process A of locally finite variation. Since M = X − A is again predictable, it is continuous by Proposition 10.16. Then X and A have the same jumps, and so by subtraction we may take all processes to be continuous. Then [M ] = [X − A] = [X], which is non-random by Proposition 18.17, and we see as in Theorem 19.3 that M is a Brownian motion on the deterministic time scale [X], hence has independent increments. Then A = X − M has the same property, and since it is also continuous of locally finite variation, it is a.s. non-random by Proposition 14.4. Then even M = X − A is a.s. non-random, hence 0 a.s., and so X = A, which has locally finite variation. 2 Now we show that semi-martingales are the most general integrators for which a stochastic integral with reasonable continuity properties can be defined. As before, let E be the class of bounded, predictable step processes with jumps at finitely many fixed times. Theorem 20.22 (stochastic integrators, Bichteler, Dellacherie) A right-continuous, adapted process X is a semi-martingale iff, for any V1 , V2 , . . . ∈ E, Vn∗ ∞ → 0



P

(Vn · X)t → 0,

t > 0.

Our proof is based on three lemmas, beginning with the crucial functionalanalytic part of the argument.

460

Foundations of Modern Probability

Lemma 20.23 (convexity and tightness) For any convex, tight set K ⊂ L1 (P ), there exists a bounded random variable ρ > 0 with sup E(ρ ξ) < ∞. ξ∈K

Proof (Yan): Let B be the class of bounded, non-negative random variables,

and define C = γ ∈ B; sup E(γ ξ) < ∞ . ξ∈K

We claim that, for any γ1 , γ2 , . . . ∈ C, there exists a γ ∈ C with {γ > 0} =  n {γn > 0}. Indeed, we may assume that γn ≤ 1 and supξ∈K E(γn ξ) ≤ 1, in  which case we may choose γ = n 2−n γn . It is then easy to construct a ρ ∈ C, such that P {ρ > 0} = supγ∈C P {γ > 0}. Clearly, {γ > 0} ⊂ {ρ > 0} a.s.,

γ ∈ C,

(17)

since we could otherwise choose a ρ ∈ C with P {ρ > 0} > P {ρ > 0}. To see that ρ > 0 a.s., suppose that instead P {ρ = 0} > ε > 0. Since K is tight, we may choose r > 0 so large that P {ξ > r} ≤ ε for all ξ ∈ K. Then P {ξ − β > r} ≤ ε for all ξ ∈ K and β ∈ B. By Fatou’s lemma, we get P {ζ > r} ≤ ε for all ζ in the L1 -closure Z = K − B. In particular, the random variable ζ0 = 2r1{ρ = 0} lies outside Z. Now Z is convex and closed, and so, by a version of the Hahn–Banach theorem, there exists a γ ∈ (L1 )∗ = L∞ satisfying sup E(γ ξ) − inf E(γ β) ≤ sup E(γ ζ) β∈B

ξ∈K

ζ∈Z

< E(γ ζ0 ) = 2rE(γ; ρ = 0).

(18)

Here γ ≥ 0, since we would otherwise get a contradiction by choosing β = b 1{γ < 0} for large enough b > 0. Hence, (18) reduces to sup E(γ ξ) < 2rE(γ; ρ = 0), ξ∈K

which implies γ ∈ C and E(γ; ρ = 0) > 0. But this contradicts (17), proving that indeed ρ > 0 a.s. 2 Two further lemmas are needed for the proof of Theorem 20.22. Lemma 20.24 (tightness and boundedness) Let X be a right-continuous, adapted process, and let T be the class of optional times τ < ∞ taking finitely many values. Then {Xτ ; τ ∈ T } is tight ⇒ X ∗ < ∞ a.s. Proof: By Lemma 9.4, any bounded optional time τ can be approximated from the right by optional times τn ∈ T , so that Xτn → Xτ by right continuity. Hence, Fatou’s lemma yields







inf P |Xτn | > r , P |Xτ | > r ≤ lim n→∞

20. Semi-Martingales and Stochastic Integration

461

and so the hypothesis remains true with T replaced by the class Tˆ of bounded optional times. By Lemma 9.6, the times



τ t,n = t ∧ inf s; |Xs | > n ,

t > 0, n ∈ N,

belong to Tˆ , and as n → ∞ we get P {X ∗ > n} = sup P {Xt∗ > n} t>0





≤ sup P |Xτ | > n → 0.

2

τ ∈Tˆ

Lemma 20.25 (scaling) For any finite random variable ξ, there exists a bounded random variable ρ > 0 with E|ρ ξ| < ∞. Proof: We may take ρ = (|ξ| ∨ 1)−1 .

2

Proof of Theorem 20.22: The necessity is clear from Theorem 20.4. Now assume the stated condition. By Lemma 5.9, it is equivalent to assume for each t > 0 that the family Kt = {(V · X)t ; V ∈ E1 } is tight, where E1 = {V ∈ E; |V | ≤ 1}. The latter family is clearly convex, and by the linearity of the integral the convexity carries over to Kt . so Lemma 20.25 yields a By Lemma 20.24 we have X ∗ < ∞ a.s., and  probability measure Q ∼ P , such that EQ Xt∗ = Xt∗ dQ < ∞. In particular, Kt ⊂ L1 (Q), and we note that Kt remains tight with respect to Q. Hence, Lemma 20.23 yields a probability measure R ∼ Q with bounded density ρ > 0, such that Kt is bounded in L1 (R). For any partition 0 = t0 < t1 < · · · < tn = t, we note that 

Σ ER Xt k≤n where Vs =

 

k

− ER (Xtk+1 |Ftk )  = ER (V · X)t + ER |Xt |,

Σ sgn k 1, since it is isometrically isomorphic to the reflexive space Lp (ζ) × Lp (ζ, H). This yields a useful closure property: Lemma 21.6 (weak closure) Let ξ1 , ξ2 , . . . ∈ D1,2 with ξn → ξ in Lp (ζ) for a p > 1. Then these conditions are equivalent: (i) ξ ∈ D1,p , and D ξn → D ξ weakly in Lp (ζ, H), (ii) supn E D ξn p < ∞. Proof, (i) ⇒ (ii): Note that weak convergence in Lp implies Lp -boundedness, by the Banach-Steinhaus theorem. (ii) ⇒ (i): By (ii) the sequence (ξn ) is bounded in the reflexive space D1,p . Thus, (ξn ) is weakly relatively compact in D1,p , and hence converges weakly in D1,p along a subsequence, say to some η ∈ D1,p . Since also ξn → ξ in Lp (ζ), we have η = ξ a.s., and so ξ ∈ D1,p , and ξn → ξ weakly in D1,p along the entire 2 sequence. In particular, D ξn → D ξ weakly in Lp (ζ, H). So far we have regarded the derivative D as an operator from Lp (ζ) to L (ζ, H). It is straightforward to extend D to an operator from Lp (ζ, K) to Lp (ζ, H ⊗ K), for a possibly different separable Hilbert space K. Assuming as before that H = L2 (μ) and K = L2 (ν) for some σ-finite measures μ, ν on suitable spaces T, T  , we may identify H ⊗ K with L2 (μ ⊗ ν). For any random variables ξ1 , . . . , ξm ∈ D1,p and elements k1 , . . . , km ∈ K, we define p

D

D ξ i ⊗ ki . Σ ξ i ki = i Σ ≤m

i≤m

¯ with h ¯ = (h1 , . . . , hn ) In particular, we get for smooth variables ξi = fi (ζ h) D

¯ (hj ⊗ ki ). Σ fi(ζh) ki = i Σ Σ ∂j fi(ζ h) ≤m j ≤n

i≤m

As before, D extends to a closed operator from Lp (P ⊗ ν) to Lp (P ⊗ μ ⊗ ν). Iterating the construction, we may define the higher order derivatives Dm on Lp (ζ) by setting Dm+1 ξ = D(Dm ξ) with D1 = D. The associated domains Dm,p are defined recursively, and in particular Dm,2 is a Hilbert space with inner product m ! " ξ, η m,2 = E(ξ η)+ Σ E D i ξ, D i η H ⊗i . i=1

470

Foundations of Modern Probability

For first order derivatives, the chaos decomposition in Theorem 14.26 yields explicit expressions of the L2 -derivative D ξ and its domain D1,2 , in terms of multiple Wiener-Itˆo integrals. For elements in L2 (ζ, H), we write · and ·, · for the norm and inner product in H, so that the resulting quantities are random variables, not constants. This applies in particular to the norm Dξ . Theorem 21.7 (derivative of WI-integral) For an isonormal Gaussian process ζ on H = L2 (T, μ), let ξ = ζ nf with f ∈ L2 (μ⊗n ) symmetric. Then (i) ξ ∈ D1,2 with Dt ξ = n ζ n−1f (·, t), t ∈ T , (ii) E D ξ 2 = n ξ 2 = n(n!) f 2 . Proof: For any ONB h1 , h2 , . . . ∈ H, Theorem 14.25 yields an orthogonal expansion of ζ n f in terms of products ζ n ⊗ hj

⊗nj

j ≤m

=

Π pn (ζhj ),

j ≤m

(4)

j

where Σj≤m nj = n. Using the simple chain rule (2), for smooth functions of polynomial growth, together with the elementary fact that pn = n pn−1 , we get D ζ n ⊗ hj

⊗nj

j ≤m

=

pn (ζhj ), Σ ni pn −1(ζhi) hijΠ = i i

i≤m

j

which is equivalent to Dt ζ n f = n ζ n−1 f˜(t) for the symmetrization f˜ of the elementary integrand f in (4). This extends by linearity to any linear combination of such functions. Using Lemma 14.22, along with the orthogonality of the various terms and factors, we obtain  

E  D ζ n ⊗ hj

⊗nj

j ≤m

 2  =

=





  n2i pn −1 (ζhi ) 2 Π pn (ζhj )2 Σ i≤m j = i i

j

(nj )! Σ n2i (ni − 1)! jΠ = i

i≤m

=

(nj )! Σ nij Π ≤m

i≤m

=n





Π pn (ζhj )2 j ≤m j

= n ζ nf 2 , which extends by orthogonality to general linear combinations. This proves (i) and (ii) for such variables ξ, and the general result follows by the closure property of D. 2 Corollary 21.8 (chaos representation of D) For an isonormal Gaussian process ζ on a separable Hilbert space H, let ξ ∈ L2 (ζ) with chaos decomposition ξ = Σ n≥0 Jn ξ. Then (i) ξ ∈ D1,2 ⇔ in which case (ii) D ξ =

Σ n Jnξ 2 < ∞,

n>0

Σ D(Jnξ)

n>0

in L2 (ζ, H),

21. Malliavin Calculus 2 (iii) E D ξ =

D.

471 



Σ E D(Jnξ) n>0

2

=

Σ n Jnξ 2.

n>0

Proof: Use Theorems 14.26 and 21.7, along with the closure property of 2

Corollary 21.9 (vanishing derivative) For any ξ ∈ D1,2 , D ξ = 0 a.s.



ξ = E ξ a.s.

Proof: If ξ = E ξ, then D ξ = 0 by (2). Conversely, D ξ = 0 yields Jn ξ = 0 2 for all n > 0 by Theorem 21.8 (iii), and so ξ = J0 ξ = E ξ. Corollary 21.10 (indicator variables) For any A ∈ σ(ζ), 1A ∈ D1,2



P A ∈ {0, 1}.

Proof: If P A ∈ {0, 1}, then ξ = 1A is a.s. a constant, and so ξ ∈ S ⊂ D1,2 . Conversely, let 1A ∈ D1,2 . Applying the chain rule to the function f (x) = x2 on [0, 1] yields D1A = D12A = 21A D1A , and so D1A = 0. Hence, Corollary 21.9 shows that 1A is a.s. a constant, which implies that either 1A = 0 a.s. or 2 1A = 1 a.s., meaning that P A ∈ {0, 1}. Corollary 21.11 (conditioning) Let ξ ∈ D1,2 , and fix any A ∈ T . Then E(ξ | 1A ζ) ∈ D1,2 and   





 



Dt E ξ  1A ζ = E Dt ξ  1A ζ 1A (t) a.e. Proof: Letting ξ = twice, we get   

Σ nζ nfn and using Corollary 14.28 and Theorem 21.7 

Dt E ξ  1A ζ = Dt

Σ n ζ n(f n 1⊗n A )



Σ n n ζ n−1 fn(·, t) 1⊗n A (·, t)

⊗(n−1) n−1 = Σn n ζ fn (·, t) 1A 1A (t)   

=

= E Dt ξ  1A ζ 1A (t).

In particular, the last series converges, so that E(ξ | 1A ζ) ∈ D1,2 .

2

The last result yields immediately the following: Corollary 21.12 (null criterion) Let ξ ∈ D1,2 be 1A ζ -measurable for an A ∈ T . Then Dt ξ = 0 a.e. on Ac × Ω. Corollary 21.13 (adapted processes) Let B be a Brownian motion generated by an isonormal Gaussian process ζ on L2 (R+ , λ), and consider a B-adapted process X ∈ D1,2 (H). Then Xs ∈ D1,2 for almost every s ≥ 0, and the process (s, t) → Dt Xs has a version such that, for fixed t ≥ 0,

472

Foundations of Modern Probability

(i) Dt Xs = 0 for all s ≤ t, (ii) Dt Xs is B-adapted in s ≥ t. Proof: (i) Use Corollary 21.12. (ii) Apply Theorems 21.7 and 21.8 to the chaos representation in Corollary 14.29. 2 Next we introduce the divergence operator 4 , defined as the adjoint D∗ : L (ζ, H) → L2 (ζ) of the Malliavin derivative D on D1,2 , in the sense of the duality relation ED ξ, Y =  ξ, D∗ Y , valid for all ξ ∈ D1,2 and suitable Y ∈ L2 (ζ, H), where the inner products are taken in L2 (ζ, H) on the left and in L2 (ζ) on the right. More precisely, the operator D∗ and its domain D∗ are given by: 2

Lemma 21.14 (existence of D∗ ) For an isonormal Gaussian process ζ on H and element Y ∈ L2 (ζ, H), these conditions are equivalent: (i) there exists an a.s. unique element D∗ Y ∈ L2 (ζ) with ED ξ, Y = E(ξD∗ Y ),

ξ ∈ D1,2 ,

(ii) there exists a constant c > 0 such that

   2 ED ξ, Y  ≤ c E|ξ|2 ,

ξ ∈ D1,2 .

Proof: By (i) and Cauchy’s inequality, we get (ii) with c2 = E|D∗ Y |2 . Now assume (ii). Then the map ϕ(ξ) = ED ξ, Y is a bounded linear functional on D1,2 , and since the latter is a dense subset of L2 (ζ), ϕ extends by continuity to all of L2 (ζ). Hence, the Riesz–Fischer theorem yields ϕ(ξ) = E( ξγ) for an 2 a.s. unique γ ∈ L2 (ζ), and (i) follows with D∗ Y = γ. For suitable elements X ∈ L2 (ζ, H), with chaos expansions as in Corollary 14.27 in terms of functions fn on T n+1 , symmetric in the first n arguments, we may now express the divergence D∗ X in terms of the full symmetrizations f˜n on T n+1 . Since fn (s, t) is already symmetric in s ∈ T n , we note that for all n ≥ 0,   (n + 1) f˜n (s, t) = fn (s, t) + Σ fn s1 , . . . , si−1 , t, si+1 , . . . , sn , si . i≤n

Theorem 21.15 (chaos representation of D∗ ) For an isonormal Gaussian process ζ on H = L2 (μ), let X ∈ L2 (ζ, H) with Xt = Σ n≥0 ζ n fn (·, t), where the fn ∈ L2 (μn+1 ) are symmetric in the first n arguments. Writing f˜n for the full symmetrizations of fn , we have (i) X ∈ D ∗ ⇔ Σ (n + 1)! f˜n 2 < ∞, in which case 4

n≥0

also called the Skorohod integral and often denoted by δ or



21. Malliavin Calculus

473

Σ D∗(JnX) = nΣ≥ 0ζ n+1f˜n in L2(ζ), D∗ X 2 = Σ D∗ (Jn X) 2 = Σ (n + 1)! f˜n 2 . n≥0 n≥0

(ii) D∗ X =

n≥0

(iii)

Proof: For an n ≥ 1, let η = ζ n g with symmetric g ∈ L2 (μn ). Using Theorem 21.7 and the extension of Lemma 14.22, we get

EX, D η = E

Xt (D η)t μ(dt)



Xt n ζ n−1 g(·, t) μ(dt)

=E =

Σ m

=n = n!







nE ζ mfm (·, t) ζ n−1 g(·, t) μ(dt)





E ζ n−1fn−1 (·, t) ζ n−1 g(·, t) μ(dt)

!

"

fn−1 (·, t), g(·, t) μ(dt) = n! fn−1 , g = n! f˜n−1 , g   = E ζ nf˜n−1 ζ n g = E η(ζ nf˜n−1 ).

Now let X ∈ D∗ . Comparing the above computation with Theorem 21.14   (i), we get E η ζ n f˜n−1 = E η(D∗ X), η ∈ Hn , n ≥ 1, and so Jn (D∗ X) = ζ n f˜n−1 , n ≥ 1, which remains true for n = 0, since E D∗ X = ED1, X = 0 by Theorem 21.14 (i) and Corollary 21.9. Hence,

Σ Jn(D∗X) = Σ ζ n f˜n−1 = Σ ζ n+1 f˜n , n>0 n≥0

D∗ X =

n>0





Σ Jn(D∗X) n > 0  2 = Σ ζ n f˜n−1  = Σ n! f˜n−1 2 , n>0 n>0

D∗ X 2 =

2

which proves (ii)−(iii), along with the convergence of the series in (i). Conversely, suppose that the series in (i) converges, so that Σ n ζ n f˜n−1 converges in L2 . By the previous calculation, we get for any η ∈ L2 (ζ) !

"



E X, D(Jn η) = E(Jn η) ζ n f˜n−1 





= E η ζ n f˜n−1 . If η ∈ D1,2 , we have

Σ nD(Jnη) = Dη in L2(ζ, H) by Theorem 21.8, and so EX, D η = E η Σ ζ n f˜n−1 . n>0

Hence, by Cauchy’s inequality,

      EX, D η  ≤ η 



 ζ n f˜n−1  Σ n>0

< η ,

η ∈ D1,2 ,

474

Foundations of Modern Probability

and X ∈ D∗ follows by Theorem 21.14.

2

ˆ 1,2 be the Many results for D∗ require a slightly stronger condition. Let D class of processes X ∈ L2 (P ⊗ μ), such that Xt ∈ D1,2 for almost every t, and the process Ds Xt on T 2 has a product-measurable version in L2 (μ2 ). Note that ˆ 1,2 is a Hilbert space with norm D X 21,2 = X 2P ⊗μ + DX 2P ⊗μ2 ,

ˆ 1,2 , X∈D

and hence is isomorphic to D1,2 (H). Corollary 21.16 (sub-domain of D∗) For an isonormal Gaussian process ζ ˆ 1,2 is a proper subset of D∗ . on L2 (μ), the class D Partial proof: For the moment, we prove only the stated inclusion. Then let X ∈ L2 (T, μ) with a representation Xt = Σn ζ nfn (·, t), as in Lemma 14.27, where the functions fn ∈ L2 (μn+1 ) are symmetric in the first n arguments. ˆ 1,2 , then DX ∈ L2 , and we get by Lemma 14.22, along with the If X ∈ D vector-valued versions of Theorem 21.7 and Corollary 21.8, E DX 2 = = >



 

 2

Σ n E D(JnX) Σ n n(n!) fn 2 Σ n (n + 1)! fn 2 Σ n (n + 1)! f˜n 2,

with summations over n ≥ 1. Hence, the series on the right converges, and so 2 X ∈ D∗ by Theorem 21.15. Now let SH be the class of smooth random elements in L2 (ζ, H) of the form η = Σ i≤n ξi hi , where ξ1 , . . . , ξn ∈ S and h1 , . . . , hn ∈ H. Lemma 21.17 (divergence of smooth elements) The class SH is contained in D∗ , and for any β = Σ i ξi hi in SH , D∗ β = Σ i ξi (ζhi ) − Σ i D ξi , hi .

Proof: By linearity, we may take β = ξ h for some ξ ∈ S and h ∈ H, which reduces the asserted formula to D∗ (ξh) = ξ(ζh) − D ξ, h . By the definition of D∗ , we need to check that



E ξD η, h = E η ξ(ζh) − D ξ, h ,

η ∈ D1,2 .

This holds by Lemma 21.3 (ii) for any η ∈ S, and the general result follows by approximation. 2 The next result allows us to factor out scalar variables.

21. Malliavin Calculus

475

Lemma 21.18 (scalar factors) Let ξ ∈ D1,2 and Y ∈ D ∗ with (i) ξY ∈ L2 (ζ, H), (ii) ξD∗ Y − D ξ, Y ∈ L2 (ζ). Then ξY ∈ D ∗ , and

D∗ (ξY ) = ξD∗ Y − D ξ, Y .

Proof: Using the elementary product rule for differentiation and the D/D∗ duality, we get for any ξ, η ∈ S ED η, ξY = EY, ξD η !

= E Y, D(ξη) − η D ξ

"



= E (D∗ Y ) ξη − η Y, D ξ !

= η, ξD∗ Y − D ξ, Y

"



,

which extends to general ξ, η ∈ D1,2 as stated, by (i)−(ii) and the closure of D. 2 Hence, Theorem 21.14 yields ξY ∈ D∗ , and the asserted formula follows. This yields a simple factorization property: Corollary 21.19 (factorization) Let ξ ∈ L2 (ζ) be 1Ac ζ -measurable for an A ∈ T . Then 1A ξ ∈ D∗ , and D∗ (1A ξ) = ξ(ζA) a.s. Proof: First let ξ ∈ D1,2 . Using Lemmas 21.17 and 21.18, along with Corollary 21.12, we get D∗ (1A ξ) = ξ D∗ (1A ) − D ξ, 1A = ξ(ζA) − 0 = ξ(ζA). The general result follows since D1,2 is dense in L2 (ζ) and D∗ is closed, being the adjoint of a closed operator. 2 End of proof of Corollary 21.16: It remains to show that the stated inclusion is strict. Then let A, B ∈ T be disjoint with μA ∧ μB > 0, and define Y = 1A ξ with ξ = 1{ζB > 0}, so that Y ∈ D∗ by Corollary 21.19. On the other hand, Corollary 21.10 yields ξ ∈ / D1,2 since P {ζB > 0} = 12 . Noting that ˆ 1,2 iff ξ ∈ D1,2 when μA > 0, we conclude that even Y ∈ ˆ 1,2 . 1A ξ ∈ D /D 2 We turn to a key relationship between D and D∗ . ˆ 1,2 be such that, for t ∈ T a.e. μ, Theorem 21.20 (commutation) Let X ∈ D (i) the process s → Dt Xs belongs to D∗ , (ii) the process t →  D∗ (Dt Xs ) has a version in L2 (P ⊗ μ).

476

Foundations of Modern Probability

Then D∗ X ∈ D1,2 , and Dt (D∗ X) = Xt + D∗ (Dt X),

t ∈ T.

Proof: By Corollary 14.27 we may write Xt = Σ n ζ n fn (·, t), where each fn ˆ 1,2 ⊂ D∗ by Corollary 21.16, is symmetric in the first n arguments. Since D Theorem 21.15 yields D∗ X = Σ n ζ n+1 f˜n , In particular,

Σ n (n + 1)! f˜n 2 < ∞.

Jn (D∗ X) = ζ n f˜n−1 ,

   2 Jn (D ∗ X) = n! f˜n−1 2 ,

and so Σ n n Jn (DX ) 2 < ∞, which implies D∗ X ∈ D1,2 by Theorem 21.8. Moreover, Theorem 21.7 yields Dt (D∗ X) = Dt Σ n ζ n+1 f˜n

Σ n (n + 1)! ζ nf˜n(·, t)   = Σ n ζ n fn (·, t) + Σ n,i ζ n fn t1 , . . . , ti−1 , t, ti+1 , . . . , tn , ti = Xt + Σ ζ n gn (·, t), n≥0 =



where



gn t 1 , . . . , tn , t =

Σ fn i≤n





t1 , . . . , ti−1 , t, ti+1 , . . . , tn , ti .

Now fix any s, t ∈ T , and conclude from Theorem 21.7 that Dt Xs = Σ nζ n−1fn(·, t, s). Then Theorem 21.15 yields D∗ (Dt Xs ) = where





gn t1 , . . . , tn , t =

Σ fn i≤n

=

Σ fn i≤n

Σ ζ ngn (·, t),

n≥0

 

t ∈ T,

t1 , . . . , ti−1 , tn , ti+1 , . . . , tn−1 , t, ti t1 , . . . , ti−1 , t, ti+1 , . . . , tn−1 , tn , ti

(5)  

= gn (·, t), since fn is symmetric in the first n variables. Hence, (5) holds with gn replaced 2 by gn , and the asserted relation follows. ˆ 1,2 To state the next result, it is suggestive to write for any X, Y ∈ D tr(DX · DY ) =



(Ds Xt )(Dt Ys ) μ⊗2 (ds dt).

ˆ 1,2 , we have5 Corollary 21.21 (covariance) For any X, Y ∈ D !

D∗ X, D∗ Y

"

= EX, Y + E tr(DX · DY ).

5 Throughout this chapter, the inner product ·, · must not be confused with the predictable covariation in Chapter 20.

21. Malliavin Calculus

477

Proof: By a simple approximation, we may assume that X, Y have finite chaos decompositions. Then D∗ X, D∗ Y ∈ D1,2 , and we see from the D/D∗ duality, used twice, and Theorem 21.20 that !

D∗ X, D∗ Y

"

!

"

= E X, D(D∗ Y!) " = EX, Y + E X, D∗ (DY ) = EX, Y + E tr(DX · DY ),

where the last equality is clear if we write out the duality explicitly in terms of μ -integrals. 2 An operator T on a space of random variables ξ is said to be local if ξ = 0 a.s. on a set A ∈ F implies T ξ = 0 a.e. on A. We show that D∗ is local on ˆ 1,2 and D is local on D1,1 . To make this more precise, we note that for any D statement on Ω × T , the qualification ‘a.e.’ is understood to be with respect to P ⊗ μ. Theorem 21.22 (local properties) For any set A ⊂ Ω in F, we have ˆ 1,2 , 1A X = 0 a.e. ⇒ 1A (D∗ X) = 0 a.s., (i) X ∈ D (ii) ξ ∈ D1,1 , 1A ξ = 0 a.s. ⇒ 1A (D ξ) = 0 a.e. Proof: (i) Choose a smooth function f : R → [0, 1] with f (0) = 1 and support in [−1, 1], and define fn (x) = f (nx). Let η ∈ Sc be arbitrary. We can show that η fn ( X 2 ) ∈ D1,2 , and use the product and chain rules to get



D η fn ( X 2 ) = fn ( X 2 ) D η + 2 η fn ( X 2 ) DX · X. Hence, the D/D∗ -duality yields





E η (D∗ X) fn ( X 2 ) = E fn ( X 2 ) X, D η



!

+ 2 E η fn ( X 2 ) DX · X, X

"

.

(6)

As n → ∞, we note that η (D∗ X) fn ( X 2 ) → η (D∗ X) 1{X = 0}, fn ( X 2 ) X, D η → 0, !

"

η fn ( X 2 ) DX · X, X → 0, and (6) yields formally





E η D∗ X; X = 0 = 0.

(7)

Since Sc is dense in L2 (ζ), this implies D∗ X = 0 a.s. on A ⊂ {X = 0}, as required. To justify (7), we note that     fn ( X 2 ) X, Dη  ≤ f ∞ X H Dη H ,

478

Foundations of Modern Probability

and

   ! "      η fn ( X 2 ) DX · X, X  ≤ |η| fn ( X 2 ) DX H 2 X 2

≤ 2 |η| f  ∞ DX H 2 , where both bounds are integrable. Hence, (7) follows from (6) by dominated convergence. (ii) It is enough to prove the statement for bounded variables ξ ∈ D1,1 , since it will then follow for general ξ by the chain rulefor D, applied to the t fn (r) dr, and note variable arctan ξ. For f and fn as in (i), define gn (t) = −∞ −1 that gn ∞ ≤ n f 1 , and also Dgn (ξ) = fn (ξ)Dξ by the chain rule. Now let Y ∈ Sb (H) be arbitrary, and note that the D/D∗ -duality remains valid in the form E(η D∗ Y ) = EDη, Y , for any bounded variables η ∈ D1,1 . Using the chain rule for D, the extended D/D∗ -duality, and some easy estimates, we obtain   !

 "     E fn (ξ) D ξ, Y  = E D(gn ◦ ξ), Y  

   = E gn (ξ) D∗ Y 

≤ n−1 f 1 E|D∗ Y | → 0, and so by dominated convergence 



E D ξ, Y ; ξ = 0 = 0, which extends by suitable approximation to Dξ = 0 a.e. on A ⊂ {ξ = 0}.

2

Since the operators D and D∗ are local, we may extend their domains as 1,p , follows. Say that a random variable ξ lies locally in D1,p and write ξ ∈ Dloc if there exist some measurable sets An ↑ Ω with associated random variables m,p ˆ 1,2 are similar. and D ξn ∈ D1,p , such that ξ = ξn on An . The definitions of Dloc loc To motivate our introduction of the third basic operator of the subject, let X be a standard Ornstein–Uhlenbeck process on H, defined as a centered Gaussian process on R × H with covariance function E(Xs h)(Xt k) = e−|t−s| h, k ,

h, k ∈ H, s, t ∈ R.

This makes X a stationary Markov process on H, such that Xt is isonormal Gaussian on H for every t ∈ R, whereas Xt h is a real OU-process in t for each h ∈ H with h = 1. For the construction of X, fix any ONB h1 , h2 , . . . ∈ H, let the Xt hj be independent OU-processes in R, and extend to H by linearity and continuity6 . For the present purposes, we may put X0 = ζ, and define the associated transition operators Tt directly on L2 (ζ) by 

  

Tt (f ◦ ζ) = E f ◦ Xt  ζ , 6

Here the possible path continuity of X plays no role.

t ≥ 0,

21. Malliavin Calculus

479

for measurable functions f on RH with f (ζ) ∈ L2 . In fact, any ξ ∈ L2 (ζ) has d such a representation by Lemma 1.14 or Theorem 14.26, and since Xt = ζ, the right-hand side is a.s. independent of the choice of f . To identify the operators Tt , we note that the Hermite polynomials pn , normalized as in Theorem 14.25, satisfy the identity   tn (8) exp xt − 12 t2 = Σ pn (x) , x, t ∈ R. n! n≥0 From the moving average representation in Chapter 14, we further note that for any t ≥ 0, √ Xt = e−t ζ + 1 − e−2t ζt , (9) d where ζ = ζt ⊥⊥ ζ. Theorem 21.23 (Ornstein–Uhlenbeck semi-group) For a standard OU-process X on H with X0 = ζ, the transition operators Tt on L2 (ζ) are given by Tt ξ =

Σ e−ntJnξ,

ξ ∈ L2 (ζ), t ≥ 0.

n≥0

Proof: By Proposition 4.2 and Corollary 6.5, it is enough to consider random variables of the form ξ = f (ζh) with h = 1. By the uniqueness theorem for moment-generating functions, we may take f (x) = exp(rx − 12 r2 ) for an √ arbitrary r ∈ R. Using the representation (9) with a = e−t and b = 1 − e−2t , we obtain   f (Xt ) = exp rXt h − 12 r2 

= exp ra ζh + rb ζt − 12 r2 









= exp ra ζh − 12 r2 a2 exp rb ζt h − 12 r2 b2 . In particular, we get for t = 0 by (8) and Theorem 14.25 

ξ = f (ζ) = exp r ζh − 12 r2 rn = Σ pn (ζh) n! n≥0 n n ⊗h r , = Σζ h n! n≥0



which shows that Jn ξ = ζ n h⊗n rn /n!. For general t, we note that ζt h is N (0, 1) ⊥ ζ, and conclude by the same calculation that with ζt ⊥





E f (Xt )  ζ = exp ra ζh − 12 r2 a2 (ra)n = Σ ζ n h⊗h n! n≥0 = Σ e−nt Jn ξ. n≥0



2

The Tt clearly form a contraction semi-group on L2 (ζ), and we proceed to determine the associated generator, in the sense of convergence in L2 .

480

Foundations of Modern Probability

Theorem 21.24 (Ornstein–Uhlenbeck generator) For a standard OU-process X on H with X0 = ζ, the generator L and its domain L ⊂ L2 (ζ) are given by (i)

Σ n2 Jnξ 2 < ∞, Lξ = Σ L(Jn ξ) = − Σ n Jn ξ in L2 , n>0 n>0 Lξ 2 = Σ L(Jn ξ) 2 = Σ n2 Jn ξ 2 . n>0 n>0 ξ∈L ⇔

n>0

(ii) (iii)

Proof: Let an operator L with domain L be given by (i)−(ii), and let G denote the generator of (Tt ) with domain G. First let ξ ∈ L. Since Jn and Tt commute on L2 (ζ) and n ≥ t−1 (1 − e−nt ) → n as t → 0 for fixed n, we get by dominated convergence as t → 0      −1 t Tt ξ − ξ − Lξ 2 =

Σ =Σ

     2 −1 −nt   n J ξ − t 1 − e J ξ n n n     −1 −nt 2 2  Jn → 0, 1−e n n − t

and so ξ ∈ G with Gξ = Lξ, proving that G = L on L ⊂ G. To show that G ⊂ L, let ξ ∈ G be arbitrary. Since the Jn act continuously on L2 (ζ), we get as before



Jn (Gξ) ← t−1 Tt (Jn ξ) − Jn ξ → − n Jn ξ,  

and so

 2

Σ n n2 Jnξ 2 = Σ n Jn(Gξ) = Gξ 2 < ∞,

which shows that indeed ξ ∈ L.

2

We may now establish a remarkable relationship between the three basic operators of Malliavin calculus: Theorem 21.25 (operator relation) The operators D, D∗ , L are related by (i) ξ ∈ L ⇔ ξ ∈ D1,2 , D ξ ∈ D ∗ , (ii) L = −D∗ D on L. Proof: First let ξ ∈ D1,2 and D ξ ∈ D∗ . Then for any η ∈ Hn , we get by the D/D∗ -duality and Theorem 21.7 E η(D∗ D ξ) = ED η, D ξ = n E η(Jn ξ), and so Jn (D∗ D ξ) = n Jn ξ, which shows that ξ ∈ L with D∗ Dξ = −Lξ. It remains to show that L ⊂ dom(D∗ D), so let ξ ∈ L be arbitrary. Then Theorem 21.24 yields

Σ n n Jnξ 2 ≤ Σ n n2 Jnξ 2 < ∞, and so ξ ∈ D1,2 by Theorem 21.8. Next, Theorems 21.8 and 21.24 yield for any η ∈ D1,2

21. Malliavin Calculus

481

ED η, D ξ =

Σ n n E(Jnη)(Jnξ)

= −E η(Lξ),

and so D ξ ∈ D∗ by Theorem 21.14. Thus, we have indeed ξ ∈ dom(D∗ D). 2 We proceed to show that L behaves like a second order differential operator, when acting on smooth random variables. For comparison, recall from Chapter √ 32 that an OU-process in Rd solves the SDE dXt = 2 dBt − Xt dt, and hence that its generator A agrees formally with the expression for L below. See also Theorem 17.24 for an equivalent formula in the context of general Feller diffusions. Theorem 21.26 (L as elliptic operator) For smooth random variables ξ = ¯ with h ¯ = (h1 , . . . , hn ) ∈ H n , we have f (ζ h) Lξ =

Σ

i, j ≤ n

¯ hi , hj − ∂ ij2 f (ζ h)

¯ ζhi . Σ ∂ if (ζ h)

i≤n

Proof: Since ξ ∈ D1,2 , we have by (2) ¯ hi , D ξ = Σ n ∂ i f (ζ h) so that in particular D ξ ∈ SH ⊂ D∗ . Using (2) again, along with Lemma 21.17, we obtain D∗ (D ξ) =

¯ ζhi − Σ ∂ 2 f (ζ h) ¯ hi , hj Σ ∂ if (ζ h) ij i, j ≤ n

,

i≤n

2

and the assertion follows by Theorem 21.25.

We may extend the last result to a chain rule for the OU-operator L, which may be regarded as a second order counterpart of the first order chain rule (3) for D. Corollary 21.27 (second order chain rule) Let ξ1 , . . . , ξn ∈ D2,4 , and let f ∈ C 2 (Rn ) with bounded first and second order partial derivatives. Then f (ξ) ∈ L with ξ = (ξ1 , . . . , ξn ), and we have L(f ◦ ξ) =

Σ ∂ ij2 f (ξ) i, j ≤ n

!

"

D ξi , D ξj +

Σ ∂ if (ξ) Lξi.

i≤n

Proof: Approximate ξ in the norm · 2,4 by smooth random variables, and ˆ n ). The result then follows by the continuity of f by smooth functions in C(R 2 L in the norm · 2,2 . We omit the details. We now come to the remarkable fact that the divergence operator D∗ , here introduced as the adjoint of the differentiation operator D, is a nonanticipating extension of the Itˆo integral of stochastic calculus. This is why it is often regarded as an integral and may even be denoted by an integral sign.

482

Foundations of Modern Probability

Here we take ζ to be an isonormal Gaussian process on H = L2 (R+ , λ), generating a Brownian motion B with Bt = ζ[0, t] a.s. for all t ≥ 0. Letting F be the filtration induced by B, we write L2F (P ⊗λ) for the class of F-progressive < ∞. For such a process X, we know from processes X on R+ with E Xt2 dt  Chapter 18 that the Itˆo integral Xt dBt exists. Theorem 21.28 (D∗ as extended Itˆo integral) Let ζ be an isonormal process on L2 (R+ , λ), generating a Brownian motion B with induced filtration F. Then (i) L2F (P ⊗ λ) ⊂ D∗ , (ii) D∗ X =



∞ 0

X ∈ L2F (P ⊗ λ).

Xt dBt a.s.,

Proof: First let X be a predictable step process of the form Xt =

Σ

i≤m

ξi 1(ti ,ti+1 ] (t),

t ≥ 0,

where 0 ≤ t1 < · · · < tm+1 and each ξi is an Fti -measurable L2 -random variable. By Corollary 21.19, we have X ∈ D∗ with

0







Σ ξ i B t − Bt i≤m   = Σ D∗ ξi 1(t ,t ] = D∗ X. i≤m

Xt dBt =

i+1

i

i i+1

(10)

For general X ∈ L2F (P ⊗ λ), Lemma 18.23 yields some predictable step processes X n as above, such that X n → X in L2 (P ⊗ λ). Applying (10) to the X n and using Theorem 18.11, we get D∗ X n =



∞ 0

Xtn dBt →



∞ 0

Xt dBt in L2 .

Hence, Theorem 21.14 yields7 X ∈ D∗ and D∗ X =

∞ 0

Xt dBt .

2

We proceed with a striking differentiation property of the Itˆo integral. Theorem 21.29 (derivative of Itˆo integral, Pardoux & Peng) Let ζ be an isonormal Gaussian process on L2 ([0, 1], λ) generating a Brownian motion B, and consider a B-adapted process X ∈ L2 (P ⊗ λ). Then ˆ 1,2 ⇔ (X · B)1 ∈ D1,2 , (i) X ∈ D in which case ˆ 1,2 , and (X · B)t ∈ D1,2 , t ∈ [0, 1], (ii) X · B ∈ D (iii) D s 7





t 0

Xr dBr = Xs 1s≤t +

t s

(Ds Xr ) dBr a.e.,

Indeed, the adjoint of a closed operator is always closed.

s, t ∈ [0, 1].

21. Malliavin Calculus

483

ˆ 1,2 . Since for fixed t the process s → Dt Xs is square Proof: First let X ∈ D integrable by definition and adapted by Corollary 21.13, we have X ∈ D∗ by Theorem 21.28 (i). Furthermore, the isometry of the Itˆo integral yields

1

E 0

 

dt 

1 t

 2

Dt Xs dBs  =





1

1

dt 0

t

E(Dt Xs )2 ds < ∞,

ˆ 1,2 . By Theorems 21.20 and 21.28 (ii) we conclude that and so X · B ∈ D (X · B)1 ∈ D1.2 for all t ∈ [0, 1], and that (iii) holds for s ≤ t. Equation (iii) remains true for s > t, since all three terms then vanish by Corollary 21.13 (i). It remains to prove the implication ‘⇐’ in (i), so assume that (X·B)1 ∈ D1,2 . Let Xtn denote the projection of Xt onto Pn = H0 ⊕ · · · ⊕ Hn . Then (X n · B)t is the projection of (X·)t onto Pn+1 , and so (X n · B)1 → (X · B)1 in the topology of D1,2 . Applying (iii) to each process X n and using the martingale property of the Itˆo integral, we obtain

1 0





2

E Ds (X n · B)1 ds = ≥ =



1

0



E|Xsn |2 ds +

1

ds

s





1

s

ds 0

0

E(Dr Xsn )2 dr

E(Dr Xsn )2 dr

0   2 E DX n λ2 . 0

By a similar estimate for DX m − DX n we conclude that DX n converges in ˆ 1,2 by the closure of D. (Alternatively, we may use a L2 (λ2 ), and so X ∈ D version of Lemma 21.6.) 2 Next recall from Lemma 19.12 that any B-measurable random variable  ξ ∈ L2 with Eξ = 0 can be written as an Itˆo integral 0∞ Xt dBt , for an a.e. unique process X ∈ L2F (P ⊗ λ). We may now give an explicit formula for the integrand X. Theorem 21.30 (Brownian functionals, Clark, Ocone) Let B be a Brownian motion on [0, 1] with induced filtration F. Then for any ξ ∈ D1,2 ,

(i) ξ = E ξ + 0

1

 





E Dt ξ  Ft dBt ,

(ii) E| ξ |p < |E ξ|p +



1 0

E|Dt ξ|p dt,

p ≥ 2.

For the stated formula to make sense, we need to choose a progressively measurable version of the integrand. Such an optional projection of Dt ξ is easily constructed by a suitable approximation. Proof: (i) Assuming ξ = Σ n ζ n fn , we get by Theorem 21.7 and Corollary 14.28   

 E Dt ξ  Ft = Σ n E ζ n−1 fn (·, t)  Ft n>0

=

Σ n ζ n−1 n>0





1[0,t]n−1 fn (·, t) .

484

Foundations of Modern Probability

The left-hand side is further adapted and square integrable, hence belongs to D∗ by Theorem 21.28 (i). Applying D∗ to the last sum and using Theorems 21.15 and 21.28 (ii), we conclude that

1 0







E Dt ξ  Ft dBt =

Σ ζ nfn = ξ − Eξ,

n>0

where the earlier factor n is canceled out by the symmetrization. (ii) Using Jensen’s inequality (three times), along with Fubini’s theorem and the Burkholder inequality in Theorem 18.7, we get from (i)  

E  E| ξ |p − |E ξ|p <

1 0

 



 p



E Dt ξ  Ft dBt 

 1    

 2 p/2   <  E E D ξ F dt t t  

0 1      p ≤ E E Dt ξ  Ft  dt 0 1



0

E|Dt ξ|p dt.

2

The Malliavin calculus can often be used to show the existence and possible regularity of a density. The following simple result illustrates the idea. Theorem 21.31 (functionals with smooth density) For an isonormal Gaussian process ζ on H, let ξ ∈ L2 (ζ) with ξ ∈ D1,2 ,

D ξ ∈ D∗ ,

D ξ > 0.

Then L(ξ)  λ with the bounded, continuous density 8







g(x) = E D∗ D ξ −2 D ξ ; ξ > x ,

x ∈ R.

x ˆ Proof: Let f ∈ C(R) be arbitrary, and put F (x) = −∞ f (y) dy. Then the chain rule in Theorem 21.1 yields F (ξ) ∈ D1,2 and !

"

D(F ◦ ξ), D ξ = f (ξ) D ξ 2 .

Using the D/D∗ -duality and Fubini’s theorem, we obtain !

Ef (ξ) = E D(F ◦ ξ), D ξ −2 D ξ 







"

= E D∗ D ξ −2 D ξ F (ξ) = E D∗ D ξ −2 D ξ

=





ξ

−∞

f (x) dx





E D∗ D ξ −2 D ξ ; ξ > x f (x) dx.

Since f was arbitrary, this gives the asserted density.

2

Weaker conditions will often ensure the mere absolute continuity, though we may then be unable to give an explicit formula for the density. 8 Since D ξ −2 is a random variable, not a constant, it can’t be pulled out in front of E or even of D∗ .

21. Malliavin Calculus

485

Theorem 21.32 (absolute continuity) For an isonormal Gaussian process ζ on H, let ξ be a ζ-measurable random variable. Then 1,1 ξ ∈ Dloc ⇒ L(ξ)  λ. D ξ = 0 a.s. Proof: By localization, we may let ξ ∈ D1,1 , and we may further assume that μ = L(ξ) is supported by [−1, 1]. We need to show that if A ∈ B[−1,1] with λA = 0, then μA = 0. As in Lemma 1.37, we may then choose some smooth functions fn : [−1, 1] → [0,1] with bounded derivatives, such that fn → 1A a.e. x fn (y) dy, we see by dominated convergence that λ + μ. Writing Fn (x) = −1 Fn (x) → 0 for all x. Since the Fn are again smooth with bounded derivatives, the chain rule in Theorem 21.1 yields Fn (ξ) ∈ D1,1 with D(Fn ◦ ξ) = fn (ξ) D ξ → 1{ξ ∈ A} D ξ a.s. Since also Fn (ξ) → 0, the closure property of D gives 1{ξ ∈ A}D ξ = 0 a.s., and since D ξ = 0 a.s. by hypothesis, we obtain μA = P {ξ ∈ A} = 0. 2

Exercises 1. Show that when H = L2 (T, μ) and ξ ∈ D, we can choose the process Dt ξ to be product measurable on T × Ω. 

2. Let H 1 be the space of functions x ∈ C0 ([0, 1]) of the form xt = 0t x˙ s ds with x˙ ∈ H = L2 ([0, 1]). Show that the relation x, y H 1 = x, ˙ y˙ H defines an inner product on H 1 , making it a Hilbert space isomorphic to H, and that the natural embedding H 1 → C0 is injective and continuous. 3. Let B be a Brownian motion on [0, 1] generated by an isonormal Gaussian process ζ on L2 ([0, 1]), so that Bt = ζ1[0,t] . For any f ∈ C ∞ (Rd ) and t ∈ [0, 1], show that the random variable ξ = f (Bt ) is smooth. 4. For ξ = f (Bt ) as above and h ∈ L2 , show that D ξ, h agrees with the directional derivative of ξ, in the direction of h · λ ∈ H 1 . (Hint: Write Dξ, h = f  (Bt )(h · λ)ti = (d/dε)ξ(B + εh · λ)ε=0 .) 5. Let B be a Brownian motion generated by an isonormal Gaussian process ζ on L2 ([0, 1]). Show that for any t ∈ [0, 1], n ∈ N, and f ∈ Cˆ ∞ , the random variable ξ = f (Btn ) belongs to D1,2 . (Hint: Note that Btn has a finite chaos expansion. Then use Lemma 21.5 and Corollary 21.8.) 6. For a Brownian motion B, some fixed t1 , . . . , tn , and a suitably smooth function f on Rd , define ξ = f (B1 , . . . , Bn ). Use Corollary 21.27 to calculate an expression for Lξ. 7. Let B be a Brownian motion generated by an isonormal Gaussian process ζ on L2 ([0, 1]), and fix any n ∈ N. For X = B n , show that X ∈ D∗ , and calculate an expression for D∗ X. Check that X satisfies the conditions in Theorem 21.28, compute the corresponding Itˆo integral 01 Xt dBt , and compare the two expressions. (Hint: Use Theorem 21.15.)

486

Foundations of Modern Probability

8. For a Brownian motion B on [0, 1] generated by the isonormal process ζ, define ξ = Bt for a fixed t > 0. Calculate Dt ξ, and verify the expression in Theorem 21.30. 9. For a Brownian motion B and a fixed t > 0, use Theorem 21.31 to calculate an expression for the density of L(Bt ), and compare with the known formula for the normal density.

VII. Convergence and Approximation Here our primary objective is to derive functional versions of the major limit theorems of probability theory, through the development of general embedding and compactness arguments. The intuitive Skorohod embedding in Chapter 22 leads easily to functional versions of the central limit theorem and the law of the iterated logarithm, and allows extensions to renewal and empirical processes, as well as to martingales with small jumps. Though the compactness approach in Chapter 23 is more subtle, as it depends on the development of compactness criteria in the underlying function or measure spaces, it has a much wider scope, as it applies even to processes with jumps and to random sets and measures. In Chapter 24, we develop some powerful large deviation principles, providing precise exponential bounds for tail probabilities in a great variety of contexts. The first two chapters contain essential core material, whereas the last chapter is more advanced and might be postponed. −−− 22. Skorohod embedding and functional convergence. Here the key step is to express a centered random variable as the value of a Brownian motion B at a suitably bounded optional time. Iterating the construction yields an embedding of an entire random walk X into B. By estimating the associated approximation error, we can derive a variety of weak or strong limit theorems for X from the corresponding results for B, leading in particular to functional versions of the central limit theorem and the law of the iterated logarithm, with extensions to renewal and empirical processes. A similar embedding applies to martingales with uniformly small jumps. 23. Convergence in distribution. Here we develop the compactness approach to convergence in distribution in function and measure spaces. The key result is Prohorov’s theorem, which shows how the required relative compactness is equivalent to tightness, under suitable conditions on the underlying space. The usefulness of the theory relies on our ability to develop efficient tightness criteria. Here we consider in particular the spaces D R+ ,S of functions in S with only jump discontinuities, and the space MS of locally finite measures on S, leading to a wealth of useful convergence criteria. 24. Large deviations. The large deviation principles (LDPs) developed in this chapter can be regarded as refinements of the weak law of large numbers in a variety of contexts. The theory involves some general extension principles, leading to an extremely versatile and powerful body of methods and results, covering a wide range of applications. In particular, we include Schilder’s LDP for Brownian motion, Sanov’s LDP for empirical distributions, the Freidlin– Wentzel LDP for stochastic dynamical systems, and Strassen’s functional law of the iterated logarithm.

487

Chapter 22

Skorohod Embedding and Functional Convergence Embedding of random walk, Brownian martingales, moment identities and estimates, rate of continuity, approximations of random walk, functional central limit theorem, laws of the iterated logarithm, arcsine laws, approximation of renewal processes, empirical distribution functions, embedding and approximation of martingales, martingales with small jumps, time-scale comparison

In Chapter 6 we used analytic methods to derive criteria for sums of independent random variables to be approximately Gaussian. Though this may remain the easiest approach to the classical limit theorems, such results are best understood when viewed in the context of some general approximation results for random processes. The aim of this chapter is to develop the purely probabilistic technique of Skorohod embedding to derive such functional limit theorems. The scope of applications of the present method extends far beyond the classical limit theorems. This is because some of the most important functionals of a random process, such as the supremum up to a fixed time or the time at which the maximum occurs, can not be expressed in terms of the values at finitely many times. The powerful new technique allows us to extend some basic result for Brownian motion from Chapter 14, including the acsine laws and the law of the iterated logarithm, to random walks and related sequences. Indeed, the technique of Skorohod embedding extends even to a wide class of martingales with small jumps. From the statements for random walks, similar results can be deduced for various related processes. In particular, we prove a functional central limit theorem and a law of the iterated logarithm for renewal processes, and show how suitably normalized versions of the empirical distribution functions, based on a sample of i.i.d. random variables, can be approximated by a Brownian bridge. In the simplest setting, we may consider a random walk (Xn ), based on some i.i.d. random variables ξk with mean 0 and variance 1. Here we can prove the existence of a Brownian motion B and some optional times τ1 ≤ τ2 ≤ · · · , such that Xn = Bτn a.s. for every n. For applications, it is essential to choose the τn such that the differences Δτn become i.i.d. with finite mean. The path of the step process X[t] will then be close to that of B, and many results for Brownian motion carry over, at least approximately, to the random walk. © Springer Nature Switzerland AG 2021 O. Kallenberg, Foundations of Modern Probability, Probability Theory and Stochastic Modelling 99, https://doi.org/10.1007/978-3-030-61871-1_23

489

490

Foundations of Modern Probability

The present account depends in many ways on material from previous chapters. Thus, we rely on the basic theory of Brownian motion, as set forth in Chapter 14. We also make frequent use of ideas and results from Chapter 9 on martingales and optional times. Finally, occasional references are made to Chapter 5 for empirical distributions, to Chapter 8 for the transfer theorem, to Chapter 12 for random walks and renewal processes, and to Chapter 15 for Poisson processes. More general approximations and functional limit theorems will be obtained by different methods in Chapters 23–24 and 30. The present approximations of martingales with small jumps are closely related to the time-change results for continuous local martingales in Chapter 19. To clarify the basic ideas, we begin with a detailed discussion of the classical Skorohod embedding of random walks. Say that a random variable ξ or its distribution is centered if E ξ = 0. Theorem 22.1 (embedding of random walk, Skorohod) Let ξ1 , ξ2 , . . . be i.i.d., centered random variables, and put Xn = ξ1 + · · · + ξn . Then there exists a filtered probability space with a Brownian motion B and some optional times 0 = τ0 ≤ τ1 ≤ . . . with i.i.d. differences Δτn = τn − τn−1 , such that d

EΔτn = E ξ12 ,

(Bτn ) = (Xn ),

E(Δτn )2 ≤ 4E ξ14 .

Here the moment conditions are crucial, since without them the statement would be trivially true and totally ⊥(ξn ) useless. In fact, we

could then take B⊥ and choose recursively τn = inf t ≥ τn−1 ; Bt = Xn . Our proof of Theorem 22.1 is based on some lemmas. First we list a few martingales associated with Brownian motion. Lemma 22.2 (Brownian martingales) For a Brownian motion B, these processes are martingales: Bt , Bt2 − t, Bt4 − 6 tBt2 + 3 t2 . Proof: Note that EBt = EBt3 = 0,

EBt2 = t,

EBt4 = 3 t2 .

Write F for the filtration induced by B, let 0 ≤ s ≤ t, and recall that the ˜t = Bs+t − Bs is again a Brownian motion independent of Fs . Hence, process B 









 ˜ 2  Fs ˜t−s + B E Bt2  Fs = E Bs2 + 2Bs B t−s = Bs2 + t − s.

Moreover, 











˜4  ˜t−s + 6 B 2 B ˜2 ˜3 E Bt4  Fs = E Bs4 + 4Bs3 B s t−s + 4Bs Bt−s + Bt−s  Fs



= Bs4 + 6 (t − s)Bs2 + 3 (t − s)2 , and so







E Bt4 − 6 tBt2  Fs = Bs4 − 6 sBs2 + 3 (s2 − t2 ).

By optional sampling, we may deduce some useful moment formulas.

2

22. Skorohod Embedding and Functional Convergence

491

Lemma 22.3 (moment relations) For any Brownian motion B and optional time τ such that B τ is bounded, we have E τ 2 ≤ 4EBτ4 .

E τ = EBτ2 ,

EBτ = 0,

(1)

Proof: By optional stopping and Lemma 22.2, we get for any t ≥ 0 E(τ ∧ t) = EBτ2∧t ,

EBτ ∧t = 0, 3 E(τ ∧ t) + 2

EBτ4∧t

= 6 E(τ ∧

t)Bτ2∧t .

(2) (3)

The first two relations in (1) follow from (2) by dominated and monotone convergence as t → ∞. In particular, we have Eτ < ∞. We may then take limits even in (3), and conclude by dominated and monotone convergence, together with Cauchy’s inequality, that 3 E τ 2 + EBτ4 = 6 E τ Bτ2  1/2 ≤ 6 Eτ 2 EBτ4 . Writing r = (Eτ 2 /EBτ4 )1/2 , we get 3 r2 + 1 ≤ 6 r, which yields 3 (r − 1)2 ≤ 2, 2 and finally r ≤ 1 + (2/3)1/2 < 2. Next we may write the centered distributions1 as mixtures of suitable twopoint distributions. For any a ≤ 0 ≤ b, there exists a unique probability measure νa,b on {a, b} with mean 0, given by νa,b = δ0 when ab = 0, and otherwise b δa − a δb , a < 0 < b. νa,b = b−a This defines a probability kernel ν : R− × R+ → R, where for mappings between measure spaces, measurability is defined in terms of the σ-fields generated by the evaluation maps πB : μ → μB, for arbitrary sets B in the underlying σ-field. Lemma 22.4 (centered distributions as mixtures) For any centered distribution μ on R, there exists a μ-measurable distribution μ ˜ on R− × R+ , such that

μ=

μ ˜(dx dy) νx,y .

Proof (Chung): Let μ± denote the restrictions of μ to R± \ {0}, define l(x) ≡ x, and put c = l dμ+ = − l dμ− . For any measurable function f : R → R+ with f (0) = 0, we get

c



f dμ =



l dμ+



=

f dμ− −





l dμ−

(y − x) μ− (dx) μ+ (dy)



f dμ+ f dνx,y ,

and so we may choose μ ˜(dx dy) = μ{0} δ0,0 (dx dy) + c−1 (y − x) μ− (dx) μ+ (dy). 1

probability measures with mean 0

492

Foundations of Modern Probability

The measurability of the mapping μ → μ ˜ is clear by a monotone-class argument, once we note that μ ˜(A × B) is a measurable function of μ for any A, B ∈ B. 2 The embedding in Theorem 22.1 may be constructed recursively, beginning with a single random variable ξ with mean 0. Lemma 22.5 (embedding of random variable) For any Brownian motion B and centered distribution μ on R, choose a random pair (α, β)⊥⊥B with distribution μ ˜ as in Lemma 22.4, and define a random time τ and filtration F by





τ = inf t ≥ 0; Bt ∈ {α, β} , Then τ is F-optional with L(Bτ ) = μ,



x2 μ(dx),

Eτ =



Ft = σ α, β; Bs , s ≤ t . E τ2 ≤ 4



x4 μ(dx).

Proof: The process B is clearly an F-Brownian motion, and τ is F-optional as in Lemma 9.6 (ii). Using Lemma 22.3 and Fubini’s theorem, we get L(Bτ ) = E L(Bτ | α, β) = E να,β = μ, E τ = E E(τ | α, β) =E =

x2 να,β (dx) x2 μ(dx),

2 E τ 2 = E E(τ | α, β) ≤ 4E x4 να,β (dx)

=4

x4 μ(dx).

2

Proof of Theorem 22.1: Let μ be the common distribution of the ξn . Introduce a Brownian motion B and some independent i.i.d. random pairs (αn , βn ), n ∈ N, with the distribution μ ˜ of Lemma 22.4, and define recursively some random times 0 = τ0 ≤ τ1 ≤ · · · by



τn = inf t ≥ τn−1 ; Bt − Bτn−1 ∈ {αn , βn } ,

n ∈ N.

Each τn is clearly optional for the filtration Ft = σ{αk , βk , k ≥ 1; B t }, t ≥ 0, and B is an F-Brownian motion. By the strong Markov property at τn , (n) the process Bt = Bτn +t − Bτn is again a Brownian motion independent of Gn = σ{τk , Bτk ; k ≤ n}. Since also (αn+1 , βn+1 )⊥⊥(B (n) , Gn ), we obtain (αn+1 , βn+1 , B (n) ) ⊥⊥ Gn , and so the pairs (Δτn , ΔBτn ) are i.i.d. The remaining assertions now follow by Lemma 22.5. 2 Using the last theorem, we may approximate a centered random walk by a Brownian motion. As before, we assume the underlying probability space to be rich enough to support the required randomization variables.

22. Skorohod Embedding and Functional Convergence

493

Theorem 22.6 (approximation of random walk, Skorohod, Strassen) Let ξ1 , ξ2 , . . . be i.i.d. random variables with E ξi = 0 and E ξi2 = 1, and put Xn = ξ1 + · · · + ξn . Then there exists a Brownian motion B, such that (i)





t−1/2 supX[s] − Bs  → 0, P

t → ∞,

s≤t

(ii)

X[t] − Bt lim  = 0 a.s. 2t log log t

t→∞

The proof of (ii) requires the following estimate. Lemma 22.7 (rate of continuity) For a real Brownian motion B, |Bu − Bt | = 0 a.s. lim lim sup sup  r↓1 t→∞ t≤u≤rt 2t log log t Proof: Write h(t) = (2t log log t)1/2 . It is enough to show that lim lim sup r↓1

n→∞

sup rn ≤t≤rn+1

|Bt − Brn | = 0 a.s. h(rn )

(4)

Proceeding as in the proof of Theorem 14.18, we get2 as n → ∞ for fixed r > 1 and c > 0 P



sup

 



  Bt − Brn  > ch(r n ) < P B{rn (r − 1)} > ch(rn )

t∈[rn ,r n+1 ]

2

< n−c /(r−1) ( log n)−1/2 .

Choosing c2 > r − 1 and using the Borel–Cantelli lemma, we see that the lim sup in (4) is a.s. bounded by c, and the assertion follows as we let r → 1. 2 For the main proof, we introduce the modulus of continuity w(f, t, h) =

sup r,s≤t, |r−s|≤h

|fr − fs |,

t, h > 0.

Proof of Theorem 22.6: By Theorems 8.17 and 22.1, we may choose a Brownian motion B and some optional times 0 ≡ τ0 ≤ τ1 ≤ · · · , such that Xn = Bτn a.s. for all n, and the differences τn − τn−1 are i.i.d. with mean 1. Then τn /n → 1 a.s. by the law of large numbers, and so τ[t] /t → 1 a.s. Now (ii) follows by Lemma 22.7. Next define     δt = sup τ[s] − s, t ≥ 0, s≤t

and note that the a.s. convergence τn /n → 1 implies δt /t → 0 a.s. Fixing any t, h, ε > 0, and using the scaling property of B, we get 2

< b means a ≤ c b for some constant c > 0. Recall that a

494

Foundations of Modern Probability  



 

P t−1/2 sup Bτ[s] − Bs  > ε s≤t









≤ P w B, t + th, th > ε t1/2 + P {δt > th}



= P w(B, 1 + h, h) > ε + P {t−1 δt > h}. Here the right-hand side tends to zero as t → ∞ and then h → 0, and (i) follows. 2 For an immediate application of the last theorem, we may extend the law of the iterated logarithm to suitable random walks. Corollary 22.8 (law of the iterated logarithm, Hartman & Wintner) Let ξ1 , ξ2 , . . . be i.i.d. random variables with E ξi = 0 and E ξi2 = 1, and put Xn = ξ1 + · · · + ξn . Then Xn lim sup  = 1 a.s. n→∞ 2n log log n Proof: Combine Theorems 14.18 and 22.6.

2

The first approximation in Theorem 22.6 yields a classical weak convergence result for random walks. Here we introduce the space D [0,1] of right-continuous functions on [0, 1] with left-hand limits (rcll). For the present purposes, we equip D [0,1] with the norm x = supt |xt | and the σ-field D generated by all evaluation maps πt : x → xt . Since the norm is clearly D-measurable3 , the same thing is true for the open balls Bxr = {y; x − y < r}, x ∈ D [0,1] , r > 0. For a process X with paths in D [0,1] and a functional f : D [0,1] → R, we say that f is a.s. continuous at X if X ∈ Df a.s., where Df denotes the set of functions4 x ∈ D [0,1] , such that f is discontinuous at the path x. We may now state a version of the functional central limit theorem. Theorem 22.9 (functional central limit theorem, Donsker) Let ξ1 , ξ2 , . . . be i.i.d. random variables with E ξi = 0 and E ξi2 = 1, and define Xtn = n−1/2

Σ ξk ,

t ∈ [0, 1], n ∈ N.

k ≤ nt

Let the functional f : D [0,1] → R be measurable and a.s. continuous at the paths of a Brownian motion B on [0, 1]. Then d

f (X n ) → f (B). This follows immediately from Theorem 22.6, together with the following lemma. 3

Note, however, that D is strictly smaller than the Borel σ-field induced by the norm. The measurability of Df is irrelevant here, provided we interpret the a.s. condition in the sense of inner measure. 4

22. Skorohod Embedding and Functional Convergence

495

Lemma 22.10 (approximation and convergence) For n ∈ N, let Xn , Yn , Y be rcll processes on [0, 1] such that d

Yn = Y,

n ∈ N;

P

Xn − Yn → 0,

and let the functional f : D [0,1] → R be measurable and a.s. continuous at Y . Then d f (Xn ) → f (Y ). Proof: Put T = Q ∩ [0, 1]. By Theorem 8.17 there exist some processes d on T , such that (Xn , Y ) = (Xn , Yn ) on T for all n. Then each Xn is a.s. bounded with finitely many upcrossings of any non-degenerate interval, and ˜ n (t) = X  (t+) exists a.s. with paths in D [0,1] . From the right so the process X n d ˜n, Y ) = (Xn , Yn ) on [0, 1] for every n. continuity of paths, it is also clear that (X To obtain the desired convergence, we note that

Xn

  P ˜  d Xn − Y  = Xn − Yn → 0,

d

P

˜ n ) → f (Y ), as in Lemma 5.3. and hence f (Xn ) = f (X

2

In particular, we may recover the central limit theorem in Proposition 6.10 by taking f (x) = x1 in Theorem 22.9. We may also obtain results that go beyond the classical theory, such as by choosing f (x) = supt |xt |. For a less obvious application, we may extend the arcsine laws of Theorem 14.16 to suitable random walks. Recall that a random variable ξ is said to be arcsine distributed d if ξ = sin2 α, where α is U (0, 2π). Theorem 22.11 (arcsine laws, Erd¨ os & Kac, Sparre-Andersen) Let ξ1 , ξ2 , . . . be i.i.d., non-degenerate random variables, and put Xn = ξ1 + · · · + ξn . Define for n ∈ N τn1 = n−1 Σ 1{Xk > 0}, k≤n





τn2 = n−1 min k ≥ 0; Xk = max Xi ,



i≤n

τn3 = n−1 max k ≤ n; Xk Xn ≤ 0 , and let τ be arcsine distributed. Then d

(i) τnk → τ for k = 1, 2, 3 when E ξi = 0 and E ξi2 < ∞, d

(ii) τnk → τ for k = 1, 2 when L(ξi ) is symmetric. For the proof, we introduce on D [0,1] the functionals



f1 (x) = λ t ∈ [0, 1]; xt > 0 ,



f2 (x) = inf t ∈ [0, 1]; xt ∨ xt− = sup xs ,



s≤1

f3 (x) = sup t ∈ [0, 1]; xt x1 ≤ 0 . The following result is elementary.

496

Foundations of Modern Probability

Lemma 22.12 (continuity of functionals) The functionals fi are measurable, and • f1 is continuous at x ⇔ λ{t; xt = 0} = 0, • f2 is continuous at x ⇔ xt ∨ xt− has a unique maximum, • f3 is continuous at x, if all local extremes of xt or xt− on (0, 1] are = 0. Proof of Theorem 22.11: Clearly, τni = fi (X n ) for n ∈ N and i = 1, 2, 3, where Xtn = n−1/2 X[nt] , t ∈ [0, 1], n ∈ N. (i) By Theorems 14.16 and 22.9, it suffices to show that each fi is a.s. continuous at B. Thus, we need to verify that B a.s. satisfies the conditions in Lemma 22.12. This is obvious for f1 , since Fubini’s theorem yields



E λ t ≤ 1; Bt = 0 =



1

0

P {Bt = 0}dt = 0.

The conditions for f2 and f3 follow easily from Lemma 14.15. d

(ii) Since τn1 = τn2 by Corollary 27.8, it is enough to consider τn1 . Then introduce an independent Brownian motion B, and define σnε = n−1

Σ1 k≤n





εBk + (1 − ε)Xk > 0 ,

n ∈ N, ε ∈ (0, 1]. d

d

By (i) together with Theorem 12.12 and Corollary 27.8, we have σnε = σn1 → τ . Since P {Xn = 0} → 0, e.g. by Theorem 5.17, we also note that  

 

lim sup σnε − τn1  ≤ n−1 ε→0

Σ 1{Xk = 0} → 0. P

k≤n

P

Hence, we may choose some εn → 0 with σnεn − τn1 → 0, and Theorem 5.29 d yields τn1 → τ . 2 Results like Theorem 22.9 are called invariance principles, since the limiting distribution of f (X n ) is the same5 for all i.i.d. sequences (ξk ) with Eξi = 0 and Eξi2 = 1. This is often useful for applications, since a direct computation may be possible for a special choice of distribution, such as for P {ξk = ±1} = 12 . The approximations in Theorem 22.6 yield similar results for renewal processes, here regarded as non-decreasing step processes. Theorem 22.13 (approximation of renewal process) Let N be a renewal process based on a distribution μ with mean 1 and variance σ 2 ∈ (0, ∞). Then there exists a Brownian motion B, such that    P t−1/2 supNs − s − σBs  → 0, t → ∞, (i) s≤t

(ii)

Nt − t − σBt

lim 

t→∞ 5

= 0 a.s.

2t log log t

sometimes also referred to as universality

22. Skorohod Embedding and Functional Convergence

497

Proof: (ii) Let τ0 , τ1 , . . . be the renewal times of N , and introduce the random walk Xn = n − τn + τ0 , n ∈ Z+ . Choosing a Brownian motion B as in Theorem 22.6, we get as n → ∞ Nτn − τn − σBn 

Xn − σBn = → 0 a.s. 2n log log n 2n log log n

Since τn ∼ n a.s. by the law of large numbers, we may replace n in the denominator by τn , and by Lemma 22.7 we may further replace Bn by Bτn . Hence, Nt − t − σBt

→ 0 a.s. along (τn ). 2t log log t By Lemma 22.7 it is enough to show that τn+1 − τn  → 0 a.s., 2τn log log τn 

which is easily seen from Theorem 22.6. (i) From Theorem 22.6, we further obtain 







n−1/2 supNτk − τk − σBk  = n−1/2 supXk − τ0 − σBk  → 0, k≤n

P

k≤n

and by Brownian scaling, n−1/2 w(B, n, 1) = w(B, 1, n−1 ) → 0. d

It is then enough to show that  

 

 

 

n−1/2 supτk − τk−1 − 1 = n−1/2 supXk − Xk−1  → 0, k≤n

P

k≤n

which is again clear from Theorem 22.6.

2

Proceeding as in Corollary 22.8 and Theorem 22.9, we may deduce an associated law of the iterated logarithm and a weak convergence result. Corollary 22.14 (limits of renewal process) Let N be a renewal process based on a distribution μ with mean 1 and variance σ 2 < ∞, and let the functional f : D [0,1] → R be measurable and a.s. continuous at a Brownian motion B. Then ±(Nt − t) = σ a.s., lim sup  (i) t→∞ 2t log log t d

(ii) f (X r ) → f (B) as r → ∞, where Nrt − rt √ , t ∈ [0, 1], r > 0. Xtr = σ r Part (ii) above yields a similar result for empirical distribution functions, based on a sequence of i.i.d. random variables. The asymptotic behavior can then be expressed in terms of a Brownian bridge.

498

Foundations of Modern Probability

Theorem 22.15 (approximation of empirical distributions) Let ξ1 , ξ2 , . . . be i.i.d. random variables with ordinary and empirical distribution functions F and Fˆ1 , Fˆ2 , . . . . Then there exist some Brownian bridges B 1 , B 2 , . . . , such that 







  P sup n1/2 Fˆn (x) − F (x) − B n ◦ F (x) → 0, x

n → ∞.

(5)

Proof: Arguing as in the proof of Proposition 5.24, we may reduce to the case where the ξn are U (0, 1), and F (t) ≡ t on [0, 1]. Then clearly



n1/2 Fˆn (t) − F (t) = n−1/2

Σ k≤n





1{ξk ≤ t} − t ,

t ∈ [0, 1].

Now introduce for every n an independent Poisson random variable κn with mean n, and conclude from Proposition 15.4 that Ntn =

Σ 1{ξk ≤ t},

k ≤ κn

t ∈ [0, 1],

is a homogeneous Poisson process on [0, 1] with rate n. By Theorem 22.13, there exist some Brownian motions W n on [0, 1] with 



sup  n−1/2 (Ntn − nt) − Wtn  → 0. P

t≤1

For the associated Brownian bridges Btn = Wtn − tW1n , we get  

 

sup  n−1/2 (Ntn − tN1n ) − B tn  → 0. P

t≤1

To deduce (5), it is enough to show that  

n−1/2 sup  t≤1

Σ



k ≤ |κn −n|

 

P 1{ξk ≤ t} − t  → 0.

(6)

P

Here |κn − n| → ∞, e.g. by Proposition 6.10, and so (6) holds by Proposition 5.24 with n1/2 replaced by |κn − n|. It remains to note that n−1/2 |κn − n| is 2 tight, since E(κn − n)2 = n. We turn to a martingale version of the Skorohod embedding in Theorem 22.1 with associated approximation in Theorem 22.6. Theorem 22.16 (embedding of martingale) For a martingale (Mn ) with M0 = 0 and induced filtration (Gn ), there exists a Brownian motion B with associated optional times 0 = τ0 ≤ τ1 ≤ · · · , such that a.s. for all n ∈ N, (i) Mn = B τn , 

 



 





(ii) E Δτn  Fn−1 = E (ΔMn )2  Gn−1 ,

 





 



(iii) E (Δτn )2  Fn−1 ≤ 4 E (ΔMn )4  Gn−1 , where (Fn ) denotes the filtration induced by the pairs (Mn , τn ).

22. Skorohod Embedding and Functional Convergence

499

Proof: Let μ1 , μ2 , . . . be probability kernels satisfying 



L ΔMn | Gn−1 = μn (M1 , . . . , Mn−1 ; · ) a.s.,

n ∈ N.

(7)

Since the Mn form a martingale, we may assume that μn (x; ·) has mean 0 for ˜n (x; ·) on R2 as in Lemma 22.4, all x ∈ Rn−1 . Define the associated measures μ and conclude from the measurability part of the lemma that μ ˜ n is a probability kernel from Rn−1 to R2 . Next choose some measurable functions fn : Rn → R2 as in Lemma 4.22, such that L{fn (x, ϑ)} = μ ˜n (x, ·) when ϑ is U (0, 1). Now fix a Brownian motion B  and some independent i.i.d. U (0, 1) random variables ϑ1 , ϑ2 , . . . . Let τ0 = 0, and define recursively some random variables αn , βn , τn for n ∈ N by 



, ϑn , (αn , βn ) = fn Bτ 1 , . . . , Bτ n−1 



 τn = inf t ≥ τn−1 ; Bt − Bτ n−1 ∈ {αn , βn } . 

(8) (9)

Since B  is a Brownian motion for the filtration Bt = σ{(B  )t , (ϑn )}, t ≥ 0, and (n) each τn is B -optional, the strong Markov property shows that Bt = Bτ n +t −     Bτn is again a Brownian motion independent of Fn = σ{τk , Bτ  ; k ≤ n}. Since k also ϑn+1 ⊥ ⊥(B (n) , Fn ), we have (B (n) , ϑn+1 )⊥ ⊥Fn . Writing Gn = σ{Bτ  ; k ≤ n}, k we conclude that    Δτn+1 , ΔBτ n+1 (10) ⊥⊥ Fn .  Gn

Using (8) and Theorem 8.5, we get  









 =μ ˜n Bτ 1 , . . . , Bτ n−1 ;· . L αn , βn  Gn−1 

(11)

  Since also B (n−1) ⊥ ⊥(αn , βn , Gn−1 ), we have B (n−1) ⊥⊥ Gn−1 (αn , βn ), and B (n−1) is conditionally a Brownian motion. Applying Lemma 22.5 to the conditional  , we get by (9), (10), and (11) distributions given Gn−1











 = μn Bτ 1 , . . . , Bτ n−1 ;· , L ΔBτ n  Gn−1 

         E Δτn  Fn−1 = E Δτn  Gn−1 

 = E (ΔBτ n )2  Gn−1 ,  



 2   2  E (Δτn )  Fn−1 = E (Δτn )  Gn−1 

  ≤ 4 E (ΔBτ n )4  Gn−1 . 

(12)

(13)

(14)

Comparing (7) and (12), we get (Bτ n ) = (Mn ). By Theorem 8.17, we may then choose a Brownian motion B with associated optional times τ1 , τ2 , . . . , such that



d B, (Mn ), (τn ) = B  , (Bτ n ), (τn ) . d

All a.s. relations between the objects on the right, including their conditional expectations with respect to appropriate induced σ-fields, remain valid for the

500

Foundations of Modern Probability

objects on the left. In particular, (i) holds for all n, and (13)−(14) imply the corresponding formulas (ii)−(iii). 2 The last theorem allows us to approximate a martingale with small jumps by a Brownian motion. For discrete-time martingales M , we define the quadratic variation [M ] and predictable variation M by

Σ (ΔMk )2, 

2 n = Σ E (ΔMk )  Fk−1 , k≤n

[M ]n = M

k≤n

in analogy with the continuous-time versions in Chapters 18 and 20. Theorem 22.17 (approximation of martingales with small jumps) For n ∈ N, let M n be an Fn -martingale M n on Z+ satisfying M0n = 0, |ΔMkn | ≤ 1, n put ζn = [M ]∞ , and define

P

supk |ΔMkn | → 0,

(15)



Xtn = Σ k ΔMkn 1 [M n ]k ≤ t ,

t ∈ [0, 1], n ∈ N. Then P (i) (X n − B n )∗ζn ∧1 → 0 for some Brownian motions B n , (ii) claim (i) remains true with [M n ] replaced by M n , (iii) the third condition in (15) can be replaced by

Σk P



 



P

n |ΔMkn | > ε  Fk−1 → 0,

ε > 0.

(16)

For the proof, we need to show that the time scales given by the sequences (τkn ), [M n ], and M n are asymptotically equivalent. Lemma 22.18 (time-scale comparison) In Theorem 22.17, let Mkn = B n (τkn ) a.s. for some Brownian motions B n with associated optional times τkn as in Theorem 22.16, and put

κnt = inf k; [M n ]k > t , t ≥ 0, n ∈ N. Then as n → ∞ for fixed t > 0,  



 





 sup τkn − [M n ]k  ∨ [M n ]k − M n k  → 0. P

(17)

k≤κn t

Proof: By optional stopping, we may assume [M n ] to be uniformly bounded, and take the supremum in (17) over all k. To handle the second difference in (17), we note that Dn = [M n ] − M n is a martingale for each n. Using the martingale property, Proposition 9.17, and dominated convergence, we get supk E(Dkn )2 E(Dn )∗2 <

= = ≤

n 2 ) Σ k E(ΔD  k

 n Σ k E E (ΔDkn)2 F k−1 

2 n Σ k E E Δ[M n]k  Fk−1 E Σ k (ΔMkn )4

= < E supk (ΔMkn )2 → 0,

22. Skorohod Embedding and Functional Convergence

501

and so (Dn )∗ → 0. This clearly remains true if each sequence M n is defined in terms of the filtration G n induced by M n . To complete the proof of (17), it is enough to show, for the latter versions P of M n , that (τ n − M n )∗ → 0. Then write T n for the filtration induced by n n the pairs (Mk , τk ), k ∈ N, and conclude from Theorem 22.16 (ii) that P

M n

m

=







n Σ E Δτkn  Tk−1 k≤m

m, n ∈ N.

,

˜ n = τ n − M n is a T n -martingale. Using Theorem 22.16 (ii)−(iii), Hence, D we get as before ˜ n )2 ˜ n )∗2 < sup k E (D E (D

 k

˜ n )2  T n = Σ E E (ΔD k  k−1

n 2 n k E E (Δτk )  T  k−1

n 4 n k E E (ΔMk )  Gk−1 n 4 k (ΔMk ) sup k (ΔMkn )2 → 0. k

≤ <

Σ Σ

= E < E



Σ

2

The sufficiency of (16) is a consequence of the following simple estimate. Lemma 22.19 (Dvoretzky) For any filtration F on Z+ and sets An ∈ Fn , n ∈ N, we have P





∪ n An ≤ P Σ n P (An| Fn−1) > ε

+ ε,

ε > 0.

Proof: Write ξn = 1An and ξˆn = P (An | Fn−1 ), fix any ε > 0, and define τ = inf{n; ξˆ1 + · · · + ξˆn > ε}. Then {τ ≤ n} ∈ Fn−1 for each n, and so E

Hence, P





Σ ξ n = Σ n E ξ n ; τ > n  n n = E Σ ξˆn ≤ ε. n h}.

P

Since δn → 0 by Lemma 22.18, the right-hand side tends to zero as n → ∞ and then h → 0, and the assertion follows. In case of the time scales M n , define κn = inf{k; [M n ] > 2}. Then P [M n ]κn − M n κn → 0 by Lemma 22.18, and so

P M n



< 1, κn < ∞ → 0.

κn

We may then reduce by optional stopping to the case where [M n ] ≤ 3. The proof may now be completed as before. 2 Though the Skorohod embedding has no natural extension to higher dimensions, we can still derive some multi-variate approximations by applying the previous results separately in each component. To illustrate the method, we show how suitable random walks in Rd can be approximated by continuous processes with stationary, independent increments. Extensions to more general limits are obtained by different methods in Corollary 16.17 and Theorem 23.14. Theorem 22.20 (approximation of random walks in Rd ) Let X 1 , X 2 , . . . be w n ) → N (0, σσ  ), for some d × d matrix σ and random walks in Rd with L(Xm n n n . Then there exist some Brownian integers mn → ∞, and define Yt = X[m n t] 1 2 d motions B , B , . . . in R , such that 

Y n − σBn

Proof: Since by Theorem 6.16,  

∗

P

t

→ 0,

 

P

t ≥ 0.

max ΔXkn  → 0,

t ≥ 0,

k≤mn t

we may assume that |ΔXkn | ≤ 1 for all n and k. Subtracting the means, we may further take EXkn ≡ 0. Applying Theorem 22.17 in each coordinate yields P w(Y n , t, h) → 0, as n → ∞ and then h → 0. Furthermore, w(σB, t, h) → 0 a.s. as h → 0. d Applying Theorem 6.16 in both directions yields Ytnn → σBt as tn → t. d Hence, by independence, (Ytn1 , . . . , Ytnm ) → σ(Bt1 , . . . , Btm ) for all n ∈ N and d

t1 , . . . , tn ≥ 0, and so Y n → σB on Q+ by Theorem 5.30. By Theorem 5.31, or even simpler by Corollary 8.19 and Theorem A5.4, there exist some rcll d processes Y˜ n = Y n with Y˜tn → σBt a.s. for all t ∈ Q+ . For any t, h > 0, E



Y˜ n − σB

∗ t



∧1 ≤ E







 n  max Y˜jh − σBjh  ∧ 1

j≤t/h











+ E w Y˜ n , t, h ∧ 1 + E w(σB, t, h) ∧ 1 .

22. Skorohod Embedding and Functional Convergence

503

Multiplying by e−t , integrating over t > 0, and letting n → ∞ and then h → 0 along Q+ , we get by dominated convergence

0



e−t E



Y˜ n − σB

∗ t



∧ 1 dt → 0.

By monotonicity, the last integrand tends to zero as n → ∞, and so (Y˜ n −σB)∗t P → 0 for each t > 0. Now use Theorem 8.17. 2

Exercises 1. Show that Theorem 22.6 yields a purely probabilistic proof of the central limit theorem, with no need of characteristic functions. 2. Given a Brownian motion B, put τ0 = 0, and define recursively τn = inf{t > τn−1 ; |Bτn − Bτn−1 | = 1}. Show that EΔτn = ∞ for all n. (Hint: We may use Proposition 12.14.) 3. Construct some Brownian martingales as in Lemma 22.2 with leading terms Bt3 and Bt5 . Use multiple Wiener–Itˆo integrals to give an alternative proof of the lemma. For every n ∈ N, find a martingale with leading term Btn . (Hint: Use Theorem 14.25.) 4. For a Brownian motion B with associated optional time τ < ∞, show that Eτ ≥ EBτ2 . (Hint: Truncate τ and use Fatou’s lemma.) 5. For Xn as in Corollary 22.8, show that the sequence of random variables (2n log log n)−1/2 Xn , n ≥ 3, is a.s. relatively compact with set of limit points [−1, 1]. (Hint: Prove the corresponding property for Brownian motion, and use Theorem 22.6.) 6. Let ξ1 , ξ2 , . . . be i.i.d. random vectors in Rd with mean 0 and covariances δij . Show that Corollary 22.8 remains true with Xn replaced by |Xn |. More precisely, show that the sequence (2n log log n)−1/2 Xn , n ≥ 3, is relatively compact in Rd , with all limit points contained in the closed unit ball. (Hint: Apply Corollary 22.8 to the projections u · Xn for arbitrary u ∈ Rd with |u| = 1.) 7. In Theorem 14.18, show that for any c ∈ (0, 1) we may choose tn → ∞, such that a.s. the limsup along (tn ) equals c. Conclude that the set of limit points in the preceding exercise agrees with the closed unit ball in Rd . 

 



n → 0. Show by 8. Condition (16) clearly follows from Σ k E |ΔMkn | ∧ 1  Fn−1 an example that the converse implication is false. (Hint: Consider a sequence of random walks.) P

9. Specialize Lemma 22.18 to random walks, and give a direct proof in that case. 10. For random walks, show that condition (16) is also necessary for the approximation in Theorem 22.17. (Hint: Use Theorem 6.16.) 11. Specialize Theorem 22.17 to random walks in R, and derive a corresponding extension of Theorem 22.9. Then derive a functional version of Theorem 6.13. 12. Specialize further to successive renormalizations of a single random walk Xn . Then derive a limit theorem for the values at t = 1, and compare with Proposition 6.10.

504

Foundations of Modern Probability

13. In the second arcsine law of Theorem 22.11, show that the first maximum on [0, 1] can be replaced by the last one. Conclude that the associated times σn and P τn satisfy τn − σn → 0. (Hint: Use the corresponding result for Brownian motion. Alternatively, use the symmetry of both (Xn ) and the arcsine distribution.) 14. Extend Theorem 22.11 to symmetric random walks satisfying a Lindeberg condition. Further extend the results for τn1 and τn2 to random walks based on diffuse, symmetric distributions. Finally, show that the result for τn3 may fail in the latter case. (Hint: Consider the n−1 -increments of a compound Poisson process based on the uniform distribution on [−1, 1], perturbed by a small diffusion term εn B, where B is an independent Brownian motion.) 15. In the context of Theorem 22.20 and for a Brownian motion B, show that we d can choose Y n = X n with (Y n − σB)∗t → 0 a.s. for all t ≥ 0. Prove a corresponding version of Theorem 22.17. (Hint: Use Theorem 5.31 or Corollary 8.19.)

Chapter 23

Convergence in Distribution Tightness and relative compactness, convergence and tightness in function spaces, convergence of continuous and rcll processes, functional central limit theorem, moments and tightness, optional equi-continuity and tightness, approximation of random walks, tightness and convergence of random measures, existence via tightness, vague and weak convergence, Cox and thinning continuity, strong Cox continuity, measure-valued processes, convergence of random sets and point processes

Distributional convergence is another of those indispensable topics, constantly used throughout modern probability. The basic notions were introduced already in Chapter 5, and in Chapter 6 we proved some fundamental limit theorems for sums of independent random variables. In the previous chapter we introduced the powerful method of Skorohod embedding, which enabled us to establish some functional versions of those results, where the Gaussian limits are replaced by entire paths of Brownian motion and related processes. Despite the amazing power of the latter technique, its scope of applicability is inevitably restricted by the limited class of possible limit laws. Here we turn to a systematic study of general weak convergence theory, based on some fundamental principles of compactness in appropriate function and measure spaces. In particular, some functional limit theorems, derived in the last chapter by cumbersome embedding and approximation techniques, will now be accessible by straightforward compactness arguments. The key result is Prohorov’s theorem, which gives the basic connection between tightness and relative distributional compactness. This result enables us to convert some classical compactness criteria into similar probabilistic versions. In particular, the Arzel`a–Ascoli theorem yields a corresponding criterion for distributional compactness of continuous processes. Similarly, an optional equi-continuity condition guarantees the appropriate compactness, for rightcontinuous processes with left-hand limits1 . We will also derive some general criteria for convergence in distribution of random measures and sets, with special emphasis on the point process case. The general criteria will be applied to some interesting concrete situations. In addition to some already familiar results from Chapters 22, we will derive convergence criteria for Cox processes and thinnings of point processes. Further applications appear in other chapters, such as a general approximation result 1

henceforth referred to as rcll processes

© Springer Nature Switzerland AG 2021 O. Kallenberg, Foundations of Modern Probability, Probability Theory and Stochastic Modelling 99, https://doi.org/10.1007/978-3-030-61871-1_24

505

506

Foundations of Modern Probability

for Markov chains in Chapter 17, various limit theorems for exchangeable processes and particle systems in Chapters 27 and 30, and a way of constructing weak solutions to SDEs in Chapter 32. To avoid overloading this chapter with technical detail, we outline some of the underlying analysis in Appendices 5–6. Beginning with continuous processes, we fix some separable and complete metric spaces (T, d), (S, ρ), where T is also locally compact, and consider the space CT,S of continuous functions from T to S, endowed with the locally ul uniform topology. The associated notion of convergence, written as xn −→ x, is given by (1) sup ρ(xn,t , xt ) → 0, K ∈ KT , t∈K

where KT denotes the class of compact sets K ⊂ T . Since T is locally compact, we may choose K1 , K2 , . . . ∈ KT with Kno ↑ T , and it is enough to verify (1) for all Kn . Using such a sequence, we may easily construct a separable and complete metrization of the topology. In Lemma A5.1 we show that the Borel σ-field in CT,S is generated by the evaluation maps πt : x → xt , t ∈ T . This is significant for our purposes, since the random elements in CT,S are then precisely the continuous processes on T taking values in S. Given any such processes X and X1 , X2 , . . . , we need to find tractable uld conditions for the associated convergence in distribution, written as Xn −→ X. This mode of convergence is clearly strictly stronger than the finite-dimensional fd convergence Xn −→ X, defined by2 d

(Xtn1 , . . . , Xtnm ) → (Xt1 , . . . , Xtm ),

t1 , . . . , tm ∈ T, m ∈ N.

(2)

This is clear already in the non-random case, since the point-wise convergence xnt → xt in CT,S may not be locally uniform. The classical way around this difficulty is to require the functions xn to be locally equi-continuous. Formally, we may impose suitable conditions on the associated local moduli of continuity



wK (x, h) = sup ρ(xs , xt ); s, t ∈ K, d(s, t) ≤ h ,

h > 0, K ∈ KT .

(3)

Using the Arzel`a–Ascoli compactness criterion, in the version of Theorem A5.2, we may easily characterize convergence in CT,S . Though this case is quite elementary, it is discussed in detail below, since it sets the pattern for the more advanced criteria for distributional convergence, considered throughout the remainder of the chapter. Lemma 23.1 (convergence in CT,S ) Let x, x1 , x2 , . . . be continuous, S-valued functions on T , where S, T are separable, complete metric spaces and T is ul locally compact with a dense subset T  . Then xn −→ x in C T,S iff 2

To avoid double subscripts, we may sometimes write xn , Xn as xn , X n .

23. Convergence in Distribution

507

(i) xnt → xt for all t ∈ T  , (ii) lim lim sup wK (xn , h) = 0, h→0

n→∞

K ∈ KT .

ul

Proof: If xn −→ x, then (i) holds by continuity and (ii) holds by Theorem A5.2. Conversely, assume (i)−(ii). Then (xn ) is relatively compact by Theoul rem A5.2, and so for any sub-sequence N  ⊂ N we have convergence xn −→ y along a further sub-sequence N  . Then by continuity, xn (t) → y(t) along N  for all t ∈ T , and so (i) yields x(t) = y(t) for all t ∈ T  . Since x and y are ul continuous and T  is dense in T , we get x = y. Hence, xn −→ x along N  ,  2 which extends to N since N was arbitrary. The approach for random processes is similar, except that here we need to replace the compactness criterion in Theorem A5.2 by a criterion for compactness in distribution. The key result is Prohorov’s Theorem 23.2 below, which relates compactness in a given metric space S to weak compactness 3 in the ˆ S of probability measures on S. For most applications of associated space M interest, we may choose S to be a suitable function or measure space. To make this precise, let Ξ be a family of random elements in a metric space S. Generalizing a notion from Chapter 5, we say that Ξ is tight if / K} = 0. inf sup P {ξ ∈

K∈KS ξ∈Ξ

(4)

We also say that Ξ is relatively compact in distribution, if every sequence of random elements ξ1 , ξ2 , . . . ∈ Ξ has a sub-sequence converging in distribution toward a random element ξ in S. We may now state the fundamental equivalence between tightness and relative compactness, for random elements in sufficiently regular metric spaces. This extends the version for Euclidean spaces in Proposition 6.22. Theorem 23.2 (tightness and relative compactness, Prohorov) For a set Ξ of random elements in a metric space S, we have (i) ⇒ (ii) with (i) Ξ is tight, (ii) Ξ is relatively compact in distribution, and equivalence holds when S is separable and complete. In particular, we see that when S is separable and complete, a single random / K} = 0. For sequences element ξ in S is tight, in the sense that inf K P {ξ ∈ we may then replace the ‘sup’ in (4) by ‘lim sup’. For the proof of Theorem 23.2, we need a simple lemma. Recall from Lemma 1.6 that a random element in a sub-space of a metric space S may also be regarded as a random element in S. 3 ˆ S are both metric spaces, and compactness For all applications in this book, S and M in the topological sense is equivalent to sequential compactness.

508

Foundations of Modern Probability

Lemma 23.3 (tightness preservation) Tightness is preserved by continuous maps. Thus, if S is a metric space and Ξ is a set of random elements in a subset A ⊂ S, we have Ξ is tight in A



Ξ is tight in S.

Proof: Compactness is preserved by continuous maps. This applies in particular to the natural embedding I : A → S. 2 Proof of Theorem 23.2 (Varadarajan): For S = Rd , the result was proved in Proposition 6.22. Turning to the case of S = R∞ , consider a tight sequence of random elements ξ n = (ξ1n , ξ2n , . . .) in R∞ . Writing ηkn = (ξ1n , . . . , ξkn ), we see from Lemma 23.3 that the sequence (ηkn ; n ∈ N) is tight in Rk for each k ∈ N. d For any sub-sequence N  ⊂ N, a diagonal argument yields ηkn → some ηk for all k ∈ N, as n → ∞ along a further sub-sequence N  . The sequence {L(ηk )} is projective, since the coordinate projections are continuous, and so Theorem d 8.21 yields a random sequence ξ = (ξ1 , ξ2 , . . .) with (ξ1 , . . . , ξk ) = ηk for all k. f d d But then ξ n −→ ξ along N  , and so Theorem 5.30 gives ξ n → ξ along the same sequence. Next let S ⊂ R∞ . If (ξn ) is tight in S, then by Lemma 23.3 it remains tight d as a sequence in R∞ . Hence, for any sequence N  ⊂ N, we have ξn → some ∞  ξ in R along a further sub-sequence N . To extend the convergence to S, it suffices by Lemma 5.26 to check that ξ ∈ S a.s. Then choose some compact sets Km ⊂ S with lim inf n P {ξn ∈ Km } ≥ 1 − 2−m for all m ∈ N. Since the Km remain closed in R∞ , Theorem 5.25 yields P {ξ ∈ Km } ≥ lim sup P {ξn ∈ Km } n∈N 

≥ lim inf P {ξn ∈ Km } n→∞



≥ 1 − 2−m ,

and so ξ ∈ m Km ⊂ S a.s. Now let S be σ-compact. It is then separable and therefore homeomorphic to a subset A ⊂ R∞ . By Lemma 23.3, the tightness of (ξn ) carries over to the image sequence (ξ˜n ) in A, and by Lemma 5.26 the possible relative compactness of (ξ˜n ) implies the same property for (ξn ). This reduces the discussion to the previous case. Now turn to the general case. If (ξn ) is tight, there exist some compact sets Km ⊂ S with lim inf n P {ξn ∈ Km } ≥ 1 − 2−m . In particular, P {ξn ∈ A} → 1  with A = m Km , and so P {ξn = ηn } → 1 for some random elements ηn in A. Then (ηn ) is again tight, even as a sequence in A, and since A is σ-compact, we see as before that (ηn ) is relatively compact as a sequence in A. By Lemma 5.26 it remains relatively compact in S, and by Theorem 5.29 the relative compactness carries over to (ξn ). To prove the converse, let S be separable and complete, and suppose that (ξn ) is relatively compact. For any r > 0, we may cover S by some open balls

23. Convergence in Distribution

509

B1 , B2 , . . . of radius r. Writing Gk = B1 ∪ · · · ∪ Bk , we claim that lim inf P {ξn ∈ Gk } = 1.

(5)

n

k→∞

If this fails, we may choose some integers nk ↑ ∞ with supk P {ξnk ∈ Gk } = c < d 1. The relative compactness yields ξnk → some ξ along a sub-sequence N  ⊂ N, and so

P ξ ∈ G P {ξ ∈ Gm } ≤ lim inf n m k  k∈N

≤ c < 1,

m ∈ N.

As m → ∞, we get the contradiction 1 < 1, proving (5). Now take r = m−1 , and write Gm k for the corresponding sets Gk . For any ε > 0, (5) yields some k1 , k1 , . . . ∈ N with



−m P ξn ∈ Gm , inf km ≥ 1 − ε 2 n

m ∈ N.

 ¯ Writing A = m Gm km , we get inf n P {ξn ∈ A} ≥ 1 − ε. Since A is complete and 2 totally bounded, hence compact, it follows that (ξn ) is tight.

For applications of the last theorem, we need to find effective criteria for tightness. Beginning with the space CT,S , we may convert the classical Arzel`a– Ascoli compactness criterion into a condition for tightness, stated in terms of the local moduli of continuity in (3). Note that wK (x, h) is continuous in x for fixed h > 0, and hence depends measurably on x. Theorem 23.4 (tightness in CT,S , Prohorov) Let Ξ be a set of continuous, S-valued processes on T , where S, T are separable, complete metric spaces and T is locally compact with a dense subset T  . Then Ξ is tight in CT,S iff (i) πt Ξ is tight in S, t ∈ T  , (ii) lim sup E wK (X, h) ∧ 1 = 0, h→0 X∈Ξ

K ∈ KT ,

in which case (iii)

∪ πt Ξ

t∈K

is tight in S,

K ∈ KT .

Proof: Let Ξ be tight in CT,S , so that for every ε > 0 there exists a compact / A} ≤ ε for all X ∈ Ξ. Then for any K ∈ KT and set A ⊂ CT,S with P {X ∈

X ∈ Ξ, P ∪ {Xt } ∈ / ∪ πt A ≤ P {X ∈ / A} ≤ ε, X ∈ Ξ, t∈K



t∈K

and (iii) follows since t∈K πt A is relatively compact by Theorem A5.2 (iii). The same theorem yields an h > 0 with wK (x, h) ≤ ε for all x ∈ A, and so for any X ∈ Ξ,







E wK (X, h) ∧ 1 ≤ ε + P wK (X, h) > ε ≤ 2ε, which implies (ii), since ε > 0 was arbitrary. Finally, note that (iii) ⇒ (i).

510

Foundations of Modern Probability

Conversely, assume (i)−(ii). Let t1 , t2 , . . . ∈ T  be dense in T , and choose K1 , K2 , . . . ∈ KT with Kno ↑ T . For any ε > 0, (i) yields some compact sets B1 , B2 , . . . ⊂ S with



/ Bm ≤ 2−m ε, P X(tm ) ∈

m ∈ N, X ∈ Ξ,

(6)

and by (ii) there exist some hnk > 0 with



P wKn (X, hnk ) > 2−k ≤ 2−n−k ε,

n, k ∈ N, X ∈ Ξ.

(7)

By Theorem A5.2, the set

∩ m,n,k

Aε =



x ∈ CT,S ; x(tm ) ∈ Bm , wKn (x, hnk ) ≤ 2−k



is relatively compact in CT,S , and for any X ∈ Ξ we get by (6) and (7) P {X ∈ / Aε } ≤ Σ n,k 2−n−k ε + Σ m 2−m ε = 2 ε. Since ε > 0 was arbitrary, Ξ is then tight in CT,S .

2

We may now characterize convergence in distribution of continuous, Svalued processes on T , generalizing the elementary Lemma 23.1 for non-random uld functions. Write Xn −→ X for convergence in distribution in C T,S with respect to the locally uniform topology. Corollary 23.5 (convergence of continuous processes, Prohorov) Let X, X1 , X2 , . . . be continuous, S-valued processes on T , where S, T are separable, complete metric spaces and T is locally compact with a dense subset T  . Then uld Xn −→ X iff fd

(i) Xn −→ X on T  ,

(ii) lim lim sup E wK (Xn , h) ∧ 1 = 0, h→0

n→∞

K ∈ KT .

(8)

uld

Proof: If Xn −→ X, then (i) holds by Theorem 5.27 and (ii) by Theorems 23.2 and 23.4. Conversely, assume (i)−(ii). Then (Xn ) is tight by Theorem 23.4, and so by Theorem 23.2 it is relatively compact in distribution. For any uld sub-sequence N  ⊂ N, we get Xn −→ some Y along a further sub-sequence N  . fd fd Then Theorem 5.27 yields Xn −→ Y along N  , and so by (i) we have X = Y d on T  , which extends to T by Theorem 5.27. Hence, X = Y by Proposition uld 4.2, and so Xn −→ X along N  , which extends to N since N  was arbitrary. 2 We illustrate the last result by proving a version of Donsker’s Theorem 22.9 in Rd . Since Theorem 23.5 applies only to processes with continuous paths, we need to replace the original step processes by suitably interpolated versions. In Theorem 23.14 below, we will see how such an interpolation can be avoided.

23. Convergence in Distribution

511

Theorem 23.6 (functional central limit theorem, Donsker) Let ξ1 , ξ2 , . . . be i.i.d. random vectors in Rd with mean 0 and covariances δ ij , form the continuous processes Xtn = n−1/2





Σ ξk + (nt − [nt]) ξ[nt]+1 k ≤ nt

t ≥ 0, n ∈ N,

, uld

and let B be a Brownian motion in Rd . Then X n −→ B in CR+ ,Rd . Proof: By Theorem 23.5 it is enough to prove convergence on [0, 1]. Clearly, fd Xn −→ X by Proposition 6.10 and Corollary 6.5. Combining the former result with Lemma 11.12, we further get the rough estimate √

lim r2 lim sup P Xn∗ ≥ r n = 0, r→∞

n→∞

which implies lim h−1 lim sup sup P

h→0

n→∞



t≤1



n sup |Xt+r − Xtn | > ε = 0.

0≤r≤h

Now (8) follows easily, as we split [0, 1] into sub-intervals of length ≤ h.

2

The Kolmogorov–Chentsov criterion in Theorem 4.23 can be converted into a sufficient condition for tightness in CRd ,S . An important application appears in Theorem 32.9. Theorem 23.7 (moment criteria and H¨older continuity) Let X 1 , X 2 , . . . be continuous processes on Rd with values in a separable, complete metric space (S, ρ). Then (X n ) is tight in CRd ,S , whenever (X0n ) is tight in S and 



|s − t| d+b , sup E  ρ(Xsn , Xtn )  <

a

n

s, t ∈ Rd ,

(9)

for some a, b > 0. The limiting processes are then a.s. locally H¨older continuous with exponent c, for any c ∈ (0, b/a). Proof: For every X n , we may define the associated quantities ξnk as in the a < −kb 2 . Hence, Lemma 1.31 proof of Theorem 4.23, and we see that E ξnk

yields for m, n ∈ N    a∧1 < wK (Xn , 2−m )a Σ ξnk a∧1 a

k≥m

<

Σ 2−kb/(a∨1)

k≥m −mb/(a∨1)

< 2

,

which implies (8). Condition (9) extends by Lemma 5.11 to any limiting process X, and the last assertion follows by Theorem 4.23. 2 For processes with possible jump discontinuities, the theory is similar but more complicated. Here we consider the space D R+ ,S of rcll4 functions on R+ , 4

right-continuous with left-hand limits

512

Foundations of Modern Probability

taking values in a separable, complete metric space (S, ρ). For any ε, t > 0, such a function x has clearly at most finitely many jumps of size ρ(xt , xt− ) > ε up to time t. Though it still makes sense to consider the locally uniform topology with ul associated mode of convergence xn −→ x, it is technically convenient and more useful for applications to allow appropriate time changes, given by increasing bijections on R+ . Note that such functions λ are continuous and strictly increasing with λ0 = 0 and λ∞ = ∞. We define the Skorohod convergence s xn → x by the requirements ul

λn −→ ι,

ul

xn ◦ λn −→ x,

for some increasing bijections λn on R+ , where ι denotes the identity map on R+ . This mode of convergence generates a Polish topology on D R+ ,S , known as Skorohod’s J1 -topology, whose basic properties are listed in Lemma A5.3. In particular, the associated Borel σ-field is again generated by the evaluation maps πt , and so the random elements in D R+ ,S are precisely the rcll processes in S. The associated compactness criteria in Theorem A5.4 are analogous to the Arze`a–Ascoli conditions in Theorem A5.2, except for being based on the modified local moduli of continuity w˜t (x, h) = inf max sup ρ(xr , xs ), (Ik )

k

t, h > 0,

(10)

r,s∈Ik

where the infimum extends over all partitions of the interval [0, t) into sub˜t (x, h) → 0 intervals Ik = [u, v), such that v − u ≥ h when v < t. Note that w as h → 0 for fixed x ∈ D R+ ,S and t > 0. In order to derive criteria for convergence in distribution in DR+ ,S , corresponding to the conditions of Theorem 23.5 for processes in CT,S , we first need a convenient criterion for tightness. Theorem 23.8 (tightness in D R+ ,S ) Let Ξ be a set of rcll processes on R+ with values in a separable, complete metric space S, and fix a dense set T ⊂ R+ . Then Ξ is tight in D R+ ,S iff (i) πt Ξ is tight in S, t ∈ T , (ii) lim sup E w˜t (X, h) ∧ 1 = 0, h→0 X∈Ξ

in which case (iii) ∪ πs Ξ is tight in S, s≤t

t > 0,

t ≥ 0.

Proof: Same as for Theorem 23.4, except that now we need to use Theorem A5.4 in place of Theorem A5.2. 2 We may now characterize convergence in distribution of S-valued, rcll processes on R+ . In particular, the conditions simplify and lead to stronger conclusd uld sions, when the limiting process is continuous. Write Xn −→ X and Xn −→ X for convergence in distribution in D R+ ,S with respect to the Skorohod and locally uniform topologies.

23. Convergence in Distribution

513

Theorem 23.9 (convergence of rcll processes, Skorohod) Let X, X1 , X2 , . . . be rcll processes on R+ with values in a separable, complete metric space S. Then fd

sd

(i) Xn −→ X, iff Xn −→ X on a dense set T ⊂ {t ≥ 0; Xt = Xt− a.s.}

and lim lim sup E w˜t (Xn , h) ∧ 1 = 0, t > 0, (11) h→0

n→∞

fd

uld

(ii) Xn −→ X with X a.s. continuous, iff Xn −→ X on a dense set T ⊂ R+

and lim lim sup E wt (Xn , h) ∧ 1 = 0, t > 0, h→0

n→∞

(iii) for a.s. continuous X, sd

Xn −→ X



uld

Xn −→ X.

The present proof is more subtle than for continuous processes. For motivation and technical convenience, we begin with the non-random case. Lemma 23.10 (convergence in D R+ ,S ) Let x, x1 , x2 , . . . be rcll functions on R+ with values in a separable, complete metric space S. Then s

(i) xn → x, iff xnt → xt for all t in a dense set T ⊂ {t ≥ 0; xt = xt− } and lim lim sup w˜t (xn , h) = 0,

h→0

t > 0,

n→∞

ul

(ii) xn −→ x with x continuous, iff xnt → xt for all t in a dense set T ⊂ R+ and lim lim sup wt (xn , h) = 0, t > 0, h→0

(iii) for continuous x,

n→∞ s

xn → x



ul

xn −→ x.

s

Proof: (i) If xn → x, then xnt → xt for all t with xt = xt− , and the displayed condition holds by Theorem A5.4. Conversely, assume the stated conditions, and conclude from the same theorem that (xn ) is relatively compact in D R+ ,S . s Then for any sub-sequence N  ⊂ N, we have xn → some y along a further sub-sequence N  . To see that x = y, fix any t > 0 with yt = yt− , and choose tk ↓ t in T with xtk = xtk − . Then 







ρ(xt , yt ) ≤ ρ(xt , xtk ) + ρ xtk , xntk + ρ xntk , yt , which tends to 0 as n → ∞ along N  and then k → ∞, by the convergence s xn → y, the point-wise convergence xn → x on T , and the right continuity of x. Hence, xt = yt on the continuity set of y. Since the latter set is dense in s R+ and both functions are right-continuous, we obtain x = y, and so xn → x along N  . This extends to N, since N  was arbitrary. s

ul

(iii) The claim ‘⇐’ is obvious. Now let xn → x, so that λn −→ ι and ul ul xn ◦ λn −→ x for some λn . Since x is continuous, we have also x ◦ λn −→ x, ul ul and so by combination ρ(xn ◦ λn , x ◦ λn ) −→ 0, which implies xn −→ x.

514

Foundations of Modern Probability

(ii) Writing Jt (x) = sups≤t ρ(xs , xs− ), we note that for any t, h > 0, w˜t (x, h) ∨ Jt (x) ≤ wt (x, h) ≤ 2w ˜t (x, h) + Jt (x).

(12)

ul

Now let xn −→ x with x continuous. Then xnt → xt and Jt (xn ) → 0 for all t > 0, and so the stated conditions follow by (i) and (12). Conversely, assume s the two conditions. Then by (12) and (i) we have xn → x, and (12) also shows n that Jt (x ) → 0, which yields the continuity of x. By (iii) the convergence ul 2 may then be strengthened to xn −→ x. sd

Proof of Theorem 23.9: (i) Let Xn −→ X. Then Theorem 5.27 yields fd Xn −→ X on {t ≥ 0; Xt = Xt− a.s.}. Furthermore, (Xn ) is tight in D R+ ,S by Theorem 23.2, and (11) follows by Theorem 23.8 (ii). Conversely, assume the stated conditions. Then (Xn,t ) is tight in S for every t ∈ T by Theorem 23.2, and so (Xn ) is tight in D R+ ,S by Lemma 23.8. Assuming T to be countable and writing Xn,T = {Xn,t ; t ∈ T }, we see that even (Xn , Xn,T ) is tight in D R+ ,S × S T . By Theorem 23.2 the sequence (Xn , Xn,T ) is then relatively compact in distribution, and so for any sub-sequence N  ⊂ N sd we have convergence (Xn , Xn,T ) −→ (Y, Z) along a further sub-sequence N  , d for an S-valued process (Y, Z) on R+ ∪ T . Since also Xn,T → XT , we may take s d Z = XT . By Theorem 5.31, we may choose Xn = Xn such that a.s. Xn → Y    s and Xn,t → Xt along N for all t ∈ T . Hence, Lemma 23.10 (i) yields Xn → X a.s., and so Xn −→ X along N  , which extends to N since N  was arbitrary. sd

sd

(iii) Let Xn −→ X for a continuous X. By Theorem 5.31 we may choose s ul d Xn = Xn with Xn → X a.s. Then Lemma 23.10 (iii) yields Xn −→ X a.s., uld which implies Xn −→ X. The converse claim is again obvious. (ii) Use (iii) and Theorem 5.31 to reduce to Lemma 23.10 (ii).

2

Tightness in D R+ ,S is often verified most easily by means of the following sufficient condition. Given a process X, we say that a random time is Xoptional if it is optional with respect to the filtration induced by X. Theorem 23.11 (optional equi-continuity and tightness, Aldous) Let X 1 , X 2 , . . . be rcll processes in a metric space (S, ρ). Then (11) holds, if for any bounded X n -optional times τn and positive constants hn → 0, 



P

ρ Xτnn , Xτnn +hn → 0,

n → ∞.

(13)

Our proof will be based on two lemmas, beginning with a restatement of condition (13).

23. Convergence in Distribution

515

Lemma 23.12 (equi-continuity) The condition in Theorem 23.11 holds iff



lim lim sup sup E ρ(Xσn , Xτn ) ∧ 1 = 0,

h→0

n→∞

t > 0,

(14)

σ,τ

where the supremum extends over all X n -optional times σ, τ ≤ t with σ ≤ τ ≤ σ + h. Proof: Replacing ρ by ρ ∧ 1 if necessary, we may assume that ρ ≤ 1. The condition in Theorem 23.11 is then equivalent to lim lim sup sup sup Eρ(Xτn , Xτn+h ) = 0,

δ→0

τ ≤t h∈[0,δ]

n→∞

t > 0,

where the outer supremum extends over all X n -optional times τ ≤ t. To deduce (14), suppose that 0 ≤ τ − σ ≤ δ. Then [τ, τ + δ] ⊂ [σ, σ + 2δ], and so by the triangle inequality and a simple substitution, δ

δ ρ(Xσ , Xτ ) ≤



≤ Thus,

0



ρ(Xσ , Xτ +h ) + ρ(Xτ , Xτ +h ) dh

2δ 0

ρ(Xσ , Xσ+h ) dh +

δ

ρ(Xτ , Xτ +h ) dh.

0

sup Eρ(Xσ , Xτ ) ≤ 3 sup sup Eρ(Xτ , Xτ +h ), σ,τ

τ

h∈[0,2δ]

where the suprema extend over all optional times τ ≤ t and σ ∈ [τ − δ, τ ]. 2 We also need the following elementary estimate. Lemma 23.13 (exponential bound) For any random variables ξ1 , . . . , ξn ≥ 0 with sum Xn , we have E e−Xn ≤ e−nc + max P {ξk < c}, k≤n

c > 0.

Proof: Let p be the maximum on the right. By the H¨older and Chebyshev inequalities, Ee−Xn = E Π k e−ξk ≤ Π k (Ee−nξk )1/n

n ≤ (e−nc + p)1/n 2 = e−nc + p. Proof of Theorem 23.11: Again we may assume that ρ ≤ 1, and by suitable approximation we may extend condition (14) to weakly optional times σ and τ . For all n ∈ N and ε > 0, we define recursively the weakly X n -optional times



n σk+1 = inf s > σkn ; ρ(Xσnkn , Xsn ) > ε ,

k ∈ Z+ ,

starting with σ0n = 0, and note that for m ∈ N and t, h > 0, w˜t (X n , h) ≤ 2ε +

Σ1 k 0,

(16)

and so by (14) and (15), n < t}. lim lim sup E w˜t (X n , h) ≤ 2ε + lim sup P {σm

h→0

n→∞

(17)

n→∞

Next we see from (16) and Lemma 23.13 that, for any c > 0,





n n < t ≤ et E e−σm ; σm r} = 0, B ∈ S, r→∞ ξ∈Ξ

(ii)





ˆ inf sup E ξ(B \ K) ∧ 1 = 0, B ∈ S.

K∈K ξ∈Ξ

When the ξ ∈ Ξ are a.s. bounded, the set Ξ is weakly tight iff (i)−(ii) hold with B = S. Proof: Let Ξ be vaguely tight. Then for any ε > 0, there exists a vaguely ˆ / A} < ε for all ξ ∈ Ξ. For fixed B ∈ S, compact set A ⊂ MS , such that P {ξ ∈ Theorem A5.6 yields r = supμ∈A μB < ∞, and so supξ∈Ξ P {ξB > r} < ε. The same lemma yields a K ∈ K with supμ∈A μ(B \ K) < ε, which implies supξ∈Ξ E{ξ(B \ K) ∧ 1} ≤ 2ε. Since ε was arbitrary, this proves the necessity of (i) and (ii). Now assume (i)−(ii). Fix any s0 ∈ S, and put Bn = {s ∈ S; ρ(s, s0 ) < n}. For any ε > 0, we may choose some constants r1 , r2 , . . . > 0 and sets K1 , K2 , . . . ∈ K, such that for all ξ ∈ Ξ and n ∈ N,

P {ξBn > rn } < 2−n ε,

E ξ(Bn \ Kn ) ∧ 1 < 2−2n ε.

(18)

Writing A for the set of measures μ ∈ MS with μBn ≤ rn ,

μ(Bn \ Kn ) ≤ 2−n ,

n ∈ N,

we see from Theorem A5.6 that A is vaguely relatively compact. Noting that







P ξ(Bn \ Kn ) > 2−n ≤ 2n E ξ(Bn \ Kn ) ∧ 1 , we get from (18) for any ξ ∈ Ξ



n ∈ N,

P {ξ ∈ / A} ≤ Σ n P {ξBn > rn } + Σ n 2n E ξ(Bn \ Kn ) ∧ 1 < 2ε, 5 Thus, we assume the supports of f to be metrically bounded. Requiring compact supports yields a weaker topology.

518

Foundations of Modern Probability 2

and since ε was arbitrary, we conclude that Ξ is tight.

We may now derive criteria for convergence in distribution with respect to vw vd the vague topology, written as ξn −→ ξ and defined6 by L(ξn ) −→ L(ξ), where vw ˆ M with respect to the vague topology on −→ denotes weak convergence on M S ˆ MS . Let Sξ be the class of sets B ∈ Sˆ with ξ∂B = 0 a.s. A non-empty class I of sets I ⊂ S is called a semi-ring, if it is closed under finite intersections, and such that every proper difference in I is a finite union of disjoint I -sets; it is called a ring if it is also closed under finite unions. Typical semi-rings in Rd are families of rectangular boxes I1 × · · · × Id . By Iˆ+ we denote the class of simple functions over I. Theorem 23.16 (convergence of random measures, Harris, Matthes et al.) Let ξ, ξ1 , ξ2 , . . . be random measures on a separable, complete metric space S, and fix a dissecting semi-ring I ⊂ Sˆξ . Then these conditions are equivalent: vd

(i) ξn −→ ξ, d (ii) ξn f → ξf for all f ∈ CˆS or Iˆ+ ,

(iii) Ee−ξn f → Ee−ξf for all f ∈ CˆS or Iˆ+ with f ≤ 1. Our proof relies on two simple lemmas. For any function f : S → R, let Df be the set of discontinuity points of f . Lemma 23.17 (limits of integrals) Let ξ, ξ1 , ξ2 , . . . be random measures on d vd S with ξn −→ ξ, and let f ∈ Sˆ+ with ξDf = 0 a.s. Then ξn f → ξf . v

Proof: By Theorem 5.27 it is enough to prove that, if μn → μ with μDf = 0, then μn f → μf . By a suitable truncation and normalization, we may take all w μn and μ to be probability measures. Then the same theorem yields μn ◦f −1 → −1 2 μ ◦ f , which implies μn f → μf since f is bounded. Lemma 23.18 (continuity sets) Let ξ, η, ξ1 , ξ2 , . . . be random measures on S vd with ξn −→ η, and fix a dissecting ring U and a constant r > 0. Then (i) Sˆη ⊃ Sˆξ whenever lim inf E e−r ξn U ≥ E e−r ξU , n→∞

U ∈ U,

(ii) for point processes ξ, ξ1 , ξ2 , . . . , we may assume instead that lim inf P {ξn U = 0} ≥ P {ξU = 0}, n→∞

U ∈ U.

6 ˆM , This mode of convergence involves the topologies on all three spaces S, MS , M S which must not be confused.

23. Convergence in Distribution

519

Proof: Fix any B ∈ Sˆξ . Let ∂B ⊂ F ⊂ U with F ∈ Sˆη and U ∈ U , where F is closed. By (i) and Lemma 23.17, Ee−r η∂B ≥ Ee−r ηF = n→∞ lim Ee−r ξn F ≥ lim inf Ee−r ξn U n→∞

≥ Ee−r ξU . Letting U ↓ F and then F ↓ ∂B, we get Ee−r η∂B ≥ Ee−r ξ∂B = 1 by dominated convergence. Hence, η∂B = 0 a.s., which means that B ∈ Sˆη . The proof of (ii) is similar. 2 Proof of Theorem 23.16: Here (i) ⇒ (ii) holds by Lemma 23.17 and (ii) ⇔ d (iii) by Theorem 6.3. To prove (ii) ⇒ (i), assume that ξn f → ξf for all f ∈ CˆS ˆ ˆ or I+ , respectively. For any B ∈ S we may choose an f with 1B ≤ f , and so ξn B ≤ ξn f is tight by the convergence of ξn f . Thus, (ξn ) satisfies (i) of Theorem 23.15. To prove (ii) of the same result, we may assume that ξ∂B = 0 a.s. A simple d approximation yields (1B ξn )f → (1B ξ)f for all f ∈ CˆS of Iˆ+ , and so we may d take S = B, and let the measures ξn and ξ be a.s. bounded with ξn f → ξf for d all f ∈ CS or I+ . Here Corollary 6.5 yields (ξn f, ξn S) → (ξf, ξS) for all such d f , and so (ξn S ∨ 1)−1 ξn f → (ξS ∨ 1)−1 ξf , which allows us to assume ξn ≤ 1 d for all n. Then ξn f → ξf implies Eξn f → Eξf for all f as above, and so v Eξn → Eξ by the definition of vague convergence. Here Lemma A5.6 yields inf K∈K supn Eξn K c = 0, which proves (ii) of Theorem 23.15 for the sequence (ξn ). d The latter theorem shows that (ξn ) is tight. If ξn → η along a sub-sequence d N  ⊂ N for a random measure η on S, then Lemma 23.17 yields ξn f → ηf for d all f ∈ CˆS or I+ . In the former case, ηf = ξf for all f ∈ CS , which implies d d η = ξ. In the latter case, Lemma 23.18 yields Sˆη ⊃ Sˆξ ⊃ I, and so ξn f → ηf d for all f ∈ Iˆ+ , which again implies η = ξ. Since N  was arbitrary, we obtain vd 2 ξn −→ ξ. Corollary 23.19 (existence of limit) Let ξ1 , ξ2 , . . . be random measures on d an lcscH space S, such that ξn f → αf for some random variables αf , f ∈ CˆS . vd Then ξn −→ ξ for a random measure ξ on S satisfying d ξf = αf , f ∈ CˆS . Proof: Condition (ii) of Theorem 23.15 being void, the tightness of (ξn ) follows already from the tightness of (ξn f ) for all f ∈ CˆS . Hence, Theorem vd 23.2 yields ξn −→ ξ along a sub-sequence N  ⊂ N, for a random measure ξ d on S, and so ξn f → ξf along N  for every f ∈ CˆS . The latter convergence vd extends to N by hypothesis, and so ξn −→ ξ by Theorem 23.16. 2

520

Foundations of Modern Probability

Theorem 23.20 (weak and vague convergence) Let ξ, ξ1 , ξ2 , . . . be a.s. bounded vd random measures on S with ξn −→ ξ. Then these conditions are equivalent: wd

(i) ξn −→ ξ, d (ii) ξn S → ξS,   c (iii) inf lim sup E ξn B ∧ 1 = 0. B∈Sˆ

n→∞

Proof, (i) ⇒ (ii): Clear since 1 ∈ CS . (ii) ⇒ (iii): If (iii) fails, then a diagonal argument yields a sub-sequence N  ⊂ N with   c E ξ B ∧ 1 > 0. (19) inf lim inf n  B∈Sˆ

n∈N

Under (ii) the sequences (ξn ) and (ξn S) are vaguely tight, whence so is the vd ˜ α) sequence of pairs (ξn , ξn S), which yields the convergence (ξn , ξn S) −→ (ξ, d   ˜ along a further sub-sequence N ⊂ N . Since clearly ξ = ξ, a transfer argument allows us to take ξ˜ = ξ. Then for any B ∈ Sˆξ , we get along N  0 ≤ ξn B c = ξ n S − ξ n B d → α − ξB, and so ξB ≤ α a.s., which implies ξS ≤ α a.s. since B was arbitrary. Since d also α = ξS, we get  

 

E  e−ξS − e−α  = Ee−ξS − Ee−α = Ee−ξS − Ee−ξS = 0, proving that α = ξS a.s. Hence, for any B ∈ Sˆξ , we have along N  





E ξn B c ∧ 1 = E (ξn S − ξn B) ∧ 1

→ E (ξS − ξB) ∧ 1 







= E ξB c ∧ 1 , which tends to 0 as B ↑ S by dominated convergence. This contradicts (19), and (iii) follows. ˆ and using (iii) ⇒ (i): Writing K c ⊂ B c ∪ (B \ K) when K ∈ K and B ∈ S, the additivity of P and ξn and the sub-additivity of x ∧ 1 for x ≥ 0, we get 

lim sup E ξn K c ∧ 1 n→∞









≤ lim sup E ξn B c ∧ 1 + E ξn (B \ K) ∧ 1 n→∞









≤ lim sup E ξn B c ∧ 1 + lim sup E ξn (B \ K) ∧ 1 . n→∞

n→∞

Taking the infimum over K ∈ K and then letting B ↑ S, we get by Theorem 23.15 and (iii)   inf lim sup E ξn K c ∧ 1 = 0. (20) K∈K

n→∞

23. Convergence in Distribution

521

Similarly, we have for any r > 1

lim sup P ξn S > r n→∞







≤ lim sup P ξn B > r − 1 + P ξn B c > 1

n→∞









≤ lim sup P ξn B > r − 1 + lim sup E ξn B c ∧ 1 . n→∞

n→∞

Letting r → ∞ and then B ↑ S, we get by Theorem 23.15 and (iii)



lim lim inf P ξn S > r = 0.

r→∞

(21)

n→∞

By Theorem 23.15, we see from (20) and (21) that (ξn ) is weakly tight. Aswd vd d suming ξn −→ η along a sub-sequence, we have also ξn −→ η, and so η = ξ. Since the sub-sequence was arbitrary, (i) follows by Theorem 23.2. 2 For a first application, we prove some continuity properties of Cox processes and thinnings, extending the simple uniqueness properties in Lemma 15.7. Further limit theorems involving Cox processes and thinnings appear in Chapter 30. Theorem 23.21 (Cox and thinning continuity) For n ∈ N and a fixed p ∈ (0, 1], let ξn be a Cox process on S directed by a random measure ηn or a p -thinning of a point process ηn on S. Then vd

ξn −→ ξ



vd

ηn −→ η

for some ξ, η, where ξ is a Cox process directed by η or a p -thinning of η, respectively. In that case also vd

(ξn , ηn ) −→ (ξ, η) in M2S . vd

Proof: Let ηn −→ η for a random measure η on S. Then Lemma 15.2 yields for any f, g ∈ CˆS

E e−ξn f −ηn g = E exp −ηn (1 − e−f + g)

→ E exp −η(1 − e−f + g)





= E e−ξf −ηg , d

where ξ is a Cox process directed by η. Hence, Theorem 6.3 gives ξn f + ηn g → vd ξf + ηg for all such f and g, and so (ξn , ηn ) −→ (ξ, η) by Theorem 23.16. vd Conversely, let ξn −→ ξ for a point process ξ on S. Then the sequence (ξn ) is vaguely tight, and so Theorem 23.15 yields for any B ∈ Sˆ and u > 0



lim inf Ee−r ξn B = sup inf E exp −u ξn (B \ K) = 1.

r→0 n≥1

K∈K n≥1

Similar relations hold for (ηn ) by Lemma 15.2, with r and u replaced by r = 1 − e−r and u = 1 − e−u , and so even (ηn ) is vaguely tight by Theovd rem 23.15, which implies that (ηn ) is vaguely relatively compact. If ηn −→ η

522

Foundations of Modern Probability

along a sub-sequence, then ξn −→ ξ  as before, where ξ  is a Cox process did rected by η. Since ξ  = ξ, the distribution of η is unique by Lemma 15.7, and vd so the convergence ηn −→ η extends to the entire sequence. The proof for p -thinnings is similar. 2 vd

We also consider a stronger, non-topological version of distributional convergence for random measures, defined as follows. Beginning with non-random u measures μ and μ1 , μ2 , . . . ∈ MS , we write μn → μ for convergence in toul tal variation and μn −→ μ for the corresponding local version, defined by u ˆ For random measures ξ and ξn on S, we may now 1B μn → 1B μ for all B ∈ S. u ud uld define the convergence ξn −→ ξ by L(ξn ) → L(ξ), and let ξn −→ ξ be the ud ˆ Finally, we corresponding local version, given by 1B ξn −→ 1B ξ for all B ∈ S. ulP P ˆ define the convergence ξn −→ ξ by ξn − ξ B → 0 for all B ∈ S, and similarly uP for the global version ξn −→ ξ. The last result has the following partial analogue for strong convergence. Theorem 23.22 (strong Cox continuity) Let ξ, ξ1 , ξ2 , . . . be Cox processes directed by some random measures η, η1 , η2 , . . . on S. Then ulP

ηn −→ η



uld

ξn −→ ξ,

with equivalence when η, η1 , η2 , . . . are non-random. Proof: We may clearly take S to be bounded. First let η = λ and all ηn = λn be non-random. For any n ∈ N, put ˆ n = λ ∧ λn , λ

ˆn, λn = λ − λ

ˆn, λn = λn − λ

so that ˆ n + λ , ˆ n + λ , λ=λ λn = λ n n  λn − λ = λn + λn . ˆ n , λ , λ , Letting ξˆn , ξn , ξn be independent Poisson processes with intensities λ n n respectively, we note that d ξ = ξˆn + ξn ,

d ξn = ξˆn + ξn .

u

Assuming λn → λ, we get             L(ξ) − L(ξn ) ≤ L(ξ) − L(ξˆn ) + L(ξn ) − L(ξˆn )

< P {ξn = 0} + P {ξ  = 0} 

   n   = 1 − e−λn  + 1 − e−λn  ≤ λn + λn = λ − λn → 0, ud

which shows that ξn −→ ξ.

23. Convergence in Distribution

523

ud

Conversely, let ξn −→ ξ. Then for any B ∈ S and n ∈ N, P {ξn B = 0} ≥ P {ξn = 0} → P {ξ = 0} = e−λ > 0. 1 on every interval [ε, 1] with ε > 0, we get Since (log x) = x−1 <

supB |λB − λn B| λ − λn <





= supB  log P {ξB = 0} − log P {ξn B = 0} 



  < supB P {ξB = 0} − P {ξn B = 0}

≤ L(ξ) − L(ξn ) → 0, u

which shows that λn → λ. uP Now let η and the ηn be random measures satisfying ηn −→ η. To obtain ud ξn −→ ξ, it is enough to show that, for any sub-sequence N  ⊂ N, the desired convergence holds along a further sub-sequence N  . By Lemma 5.2 we may u then assume that ηn → η a.s. Letting the processes ξ and ξn be conditionally independent and Poisson distributed with intensities η and ηn , respectively, we conclude from the previous case that     L(ξn | ηn ) − L(ξ | η) → 0 a.s.,

and so by dominated convergence,   

     L(ξn ) − L(ξ) = E L(ξn | ηn ) − L(ξ | η)       = sup E E{f (ξn )| ηn } − E{f (ξ)| η}  |f |≤1





≤ sup E E{f (ξn )| ηn } − E{f (ξ)| η} |f |≤1





≤ E L(ξn | ηn ) − L(ξ | η) → 0, ud

which shows that ξn −→ ξ.

2

For measure-valued processes X n with rcll paths, we may characterize tightness in terms of the real-valued projections

Xtn f =

f (s) Xtn (ds),

f ∈ Cˆ + .

Theorem 23.23 (measure-valued processes) Let X 1 , X 2 , . . . be vaguely rcll processes in MS , where S is lcscH. Then these conditions are equivalent: (i) (X n ) is vaguely tight in D R+ ,MS , (ii) (X n f ) is tight in D R+ ,R+ for every f ∈ CˆS+ .

524

Foundations of Modern Probability

Proof: Let (X n f ) be tight for every f ∈ Cˆ + , and fix any ε > 0. Let f1 , f2 , . . . be such as in Theorem A5.7, and choose some compact sets B1 , B2 , . . . ⊂ DR+ ,R+ with

P X n fk ∈ Bk ≥ 1 − ε2−k , k, n ∈ N. (22) 

Then A = k {μ; μfk ∈ Bk } is relatively compact in D R+ ,MS , and (22) yields 2 P {X n ∈ A} ≥ 1 − ε. Next we consider convergence of random sets. Given an lcscH space S, let F, G, K be the classes of closed, open, and compact subsets of S. We endow F with the Fell topology, generated by the sets



F ∈ F ; F ∩ G = ∅ ,

G ∈ G,

F ∈ F; F ∩ K = ∅ ,

K ∈ K.



Some basic properties of this topology are summarized in Theorem A6.1. In particular, F is compact and metrizable, and the set {F ; F ∩ B = ∅} is uniˆ versally measurable for every B ∈ S. A random closed set in S is defined as a random element ϕ in F. Write ˆ and note that the probabilities P {ϕB = ∅} are ϕ ∩ B = ϕB for any B ∈ S, well defined. For any random closed set ϕ, we introduce the class



ˆ P {ϕB o = ∅} = P {ϕB ¯ = ∅} , Sˆϕ = B ∈ S; which is separating by Lemma A6.2. We may now state the basic convergence criterion for random sets. Theorem 23.24 (convergence of random sets, Norberg, OK) Let ϕ, ϕ1 , ϕ2 , . . . be random closed sets in an lcscH space S, and fix a separating class U ⊂ Sˆϕ . d Then ϕn → ϕ iff P {ϕn U = ∅} → P {ϕU = ∅},

U ∈ U.

In particular, we may take U = Sˆϕ . Proof: Write h(B) = P {ϕB = ∅}, hn (B) = P {ϕn B = ∅}. d

If ϕn → ϕ, Theorem 5.25 yields inf hn (B) h(B o ) ≤ lim n→∞ ¯ ≤ lim sup hn (B) ≤ h(B), n→∞

and so for any B ∈ Sˆϕ we have hn (B) → h(B).

ˆ B ∈ S,

(23)

23. Convergence in Distribution

525

Now consider a separating class U satisfying (23). Fix any B ∈ Sˆϕ , and conclude from (23) that, for any U, V ∈ U with U ⊂ B ⊂ V , h(U ) ≤ lim inf hn (B) n→∞

≤ lim sup hn (B) ≤ h(V ).

(24)

n→∞

Since U is separating, we may choose some Uk , Vk ∈ U with Uk ↑ B o and ¯ Here clearly V¯k ↓ B. Uk ↑ B o ⇒ {ϕUk = ∅} ↑ {ϕB o = ∅} ⇒ h(Uk ) ↑ h(B o ) = h(B), ¯ ⇒ {ϕV¯k = ∅} ↓ {ϕB ¯ = ∅} V¯k ↓ B ¯ = h(B), ⇒ h(V¯k ) ↓ h(B) where the finite-intersection property was used in the second line. In view of (24), we get hn (B) → h(B), and (23) follows with U = Sˆϕ . Since F is compact, the sequence (ϕn ) is relatively compact by Theorem d 23.2. For any sub-sequence N  ⊂ N, we obtain ϕn → ψ along a further sub sequence N , for a random closed set ψ. Combining the direct statement with (23), we get (25) P {ϕB = ∅} = P {ψB = ∅}, B ∈ Sˆϕ ∩ Sˆψ . Since Sˆϕ ∩ Sˆψ is separating by Lemma A6.2, we may approximate as before to extend (25) to any compact set B. The class of sets {F ; F ∩ K = ∅} with compact K is clearly a π-system, and so a monotone-class argument yields d d 2 ϕ = ψ. Since N  is arbitrary, we obtain ϕn → ϕ along N. Simple point processes allow the dual descriptions as integer-valued random measures or locally finite random sets. The corresponding notions of convergence are different, and we proceed to clarify how they are related. Since the vd mapping μ → supp μ is continuous on NS , the convergence ξn −→ ξ implies d supp ξn → supp ξ. Conversely, if E ξ and E ξn are locally finite, then by Theovd d rem 23.24 and Proposition 23.25, we have ξn −→ ξ whenever supp ξn → supp ξ v and Eξn → Eξ. Here we give a precise criterion. Theorem 23.25 (convergence of point processes) Let ξ, ξ1 , ξ2 , . . . be point processes on S with ξ simple, and fix any dissecting ring U ⊂ Sˆξ and semi-ring vd I ⊂ U . Then ξn −→ ξ iff (i) P {ξn U = 0} → P {ξU = 0}, (ii) lim sup P {ξn I > 1} ≤ P {ξI > 1}, n→∞

U ∈ U, I ∈ I.

Proof: The necessity holds by Lemma 23.17. Now assume (i)−(ii). We may choose U and I to be countable. Define η(U ) = ξU ∧ 1 and ηn (U ) = ξn U ∧ 1

526

Foundations of Modern Probability d

for U ∈ U . Taking repeated differences in (i), we get ηn → η for the product d d topology on R+U . Hence, Theorem 5.31 yields some processes η˜ = η and η˜n = ηn , such that a.s. η˜n (U ) → η˜(U ) for all U ∈ U . By a transfer argument, we may d d choose ξ˜n = ξn and ξ˜ = ξ, such that a.s. η˜n (U ) = ξ˜n U ∧ 1, ˜ ∧ 1, U ∈ U . η˜(U ) = ξU We may then assume that ηn (U ) → η(U ) for all U ∈ U a.s. Now fix an ω ∈ Ω with ηn (U ) → η(U ) for all U ∈ U , and let U ∈ U be arbitrary. Splitting U into disjoint sets I1 , . . . , Im ∈ I with ξIk ≤ 1 for all k ≤ m, we get ηn (U ) → η(U ) ≤ ξU = Σ j ξIj = Σ j η(Ij ) ← Σ j ηn (Ij ) ≤ Σ j ξn Ij = ξn U, and so as n → ∞, we have a.s.



lim sup ξn U ∧ 1 ≤ ξU n→∞ ≤ lim inf ξn U,

U ∈ U.

n→∞

(26)

Next we note that, for any h, k ∈ Z+ ,

k≤h≤1

c



= {h > 1} ∪ h < k ∧ 2







= {k > 1} ∪ h = 0, k = 1 ∪ h > 1 ≥ k , where all unions are disjoint. Substituting h = ξI and k = ξn I, and using (ii) and (26), we get



lim P ξI < ξn I ∧ 2 = 0,

n→∞

I ∈ I.

(27)

Letting U ∈ U be a union of disjoint sets I1 , . . . , Im ∈ I, we get by (27)



lim sup P ξn U > ξU ≤ lim sup P n→∞

n→∞

≤P

∪j



ξn Ij > ξIj



∪ j {ξIj > 1}.

By dominated convergence, we can make the right-hand side arbitrarily small, and so P {ξn U > ξU } → 0. Combining with (26) gives P {ξn U = ξU } → 0, P P which means that ξn U → ξU . Hence, ξn f → ξf for every f ∈ U+ , and Theovd 2 rem 23.16 (i) yields ξn −→ ξ.

23. Convergence in Distribution

527

Exercises 1. Show that the two versions of Donsker’s theorem in Theorems 22.9 and 23.6 are equivalent. 2. Extend either version of Donsker’s theorem to random vectors in Rd . 3. For any metric space (S, ρ), show that if xn → x in D R+ ,S with x continuous, then sups≤t ρ(xns , xs ) → 0 for every t ≥ 0. (Hint: Note that x is uniformly continuous on every interval [0, t].) d

4. For any separable, complete metric space (S, ρ), show that if X n → X in d D R+ ,S with X continuous, we may choose Y n = X n with sups≤t ρ(Ysn , Xs ) → 0 a.s. for every t ≥ 0. (Hint: Combine the preceding result with Theorems 5.31 and A5.4.) 5. Give an example where xn → x and yn → y in D R+ ,R , and yet (xn , yn ) → (x, y) in D R+ ,R2 . 6. Let f be a continuous map between the metric spaces S, T , where T is d d separable, complete. Show that if X n → X in D R+ ,S , then f (X n ) → f (X) in D R+ ,T . (Hint: By Theorem 5.27 it suffices to show that xn → x in D R+ ,S implies f (xn ) → f (x) in D R+ ,T . Since A = {x, x1 , x2 , . . .} is relatively compact in D R+ ,S ,  Theorem A5.4 shows that Ut = s≤t πs A is relatively compact in S for every t > 0. Hence, f is uniformly continuous on each Ut .) 7. Show by an example that the condition in Theorem 23.11 is not necessary for tightness. (Hint: Consider non-random processes Xn .) 8. Show that in Theorem 23.11 it is enough to consider optional times taking finitely many values. (Hint: Approximate from the right and use the right-continuity of paths.) 9. Let the process X on R+ be continuous in probability with values in a sepP arable, complete metric space (S, ρ). Show that if ρ(Xτn , Xτn +hn ) → 0 for any bounded optional times τn and constants hn → 0, then X has an rcll version. (Hint: Approximate by suitable step processes, and use Theorems 23.9 and 23.11.) d

10. Let X, X 1 , X 2 , . . . be L´evy processes in Rd . Show that X n → X in D R+ ,Rd iff d

X1n → X1 in Rd . Compare with Theorem 16.14. 11. Examine how Theorem 23.15 simplifies when S is lcscH. (Hint: We may then omit condition (ii).) 12. Show that (ii)−(iii) of Theorem 23.16 remain sufficient if we replace Sˆξ by an arbitrary separating class. (Hint: Restate the conditions in terms of Laplace transforms, and extend to Sˆξ by a suitable approximation.) 13. Let η, η1 , η2 , . . . be λ -randomizations of some point processes ξ, ξ1 , ξ2 , . . . on d d a separable, complete space S. Show that ξn → ξ iff ηn → η. uld

vd

14. For random measures ξ, ξ1 , ξ2 , . . . on S, show that ξn −→ ξ implies ξn −→ ξ. Further show by an example that the converse implication may fail. fd

15. Let X and X n be random elements in D R+ ,Rd with X n −→ X, and such that d

d

uX n → uX in D R+ ,R for every u ∈ Rd . Show that X n → X. (Hint: Proceed as in Theorems 23.23 and A5.7.)

528

Foundations of Modern Probability

16. On an lcscH space S, let ξ, ξ1 , ξ2 , . . . be simple point processes with associated vd d supports ϕ, ϕ1 , ϕ2 , . . . . Show that ξn −→ ξ implies ϕn → ϕ. Further show by an example that the reverse implication may fail. 17. For an lcscH space S, let U ⊂ Sˆ be separating. Show that if K ⊂ G with K ¯ ⊂ G. (Hint: First compact and G open, there exists a U ∈ U with K ⊂ U o ⊂ U ¯ ⊂ C o ⊂ C¯ ⊂ G.) choose B, C ∈ Sˆ with K ⊂ B o ⊂ B

Chapter 24

Large Deviations Exponential rates, cumulant-generating function, Legendre–Fenchel transform, rate function, large deviations in Rd , Schilder’s theorem, regularization and uniqueness, large deviation principle, goodness, exponential tightness, weak and functional LDPs, continuous mapping, random sequences, exponential equivalence, perturbed dynamical systems, empirical distributions, relative entropy, functional law of the iterated logarithm

In many contexts throughout probability theory and its applications, we need to estimate the asymptotic rate of decline of various tail probabilities. Since such problems might be expected to lead to tedious technical calculations, we may be surprised and delighted to learn about the existence of a powerful general theory, providing such asymptotic rates with great accuracy. Indeed, in many cases of interest, the asymptotic rate turns out to be exponential, and the associated rate function can often be obtained explicitly from the underlying distributions. In its simplest setting, the theory of large deviation provides exponential rates of convergence in the weak law of large numbers. Here we consider some i.i.d. random variables ξ1 , ξ2 , . . . with mean m and cumulant-generating function Λ(u) = log Eeuξi < ∞, and write ξ¯n = n−1 Σ k≤n ξk . For any x > m, we show that the tail probabilities P {ξ¯n > x} tend to 0 at an exponential rate I(x), given by the Legendre–Fenchel transform Λ∗ of Λ. In higher dimensions, it is convenient to state the result in the more general form −n−1 log P {ξ¯n ∈ B} → I(B) = inf I(x), x∈B

where B is restricted to a suitable class of continuity sets. In this standard form of a large-deviation principle with rate function I, the result extends to a wide range of contexts throughout probability theory. For a striking example of fundamental importance in statistical mechanics, we may mention Sanov’s theorem, which provides similar large deviation bounds for the empirical distributions of a sequence of i.i.d. random variables with a common distribution μ. Here the rate function I, defined on the space of probability measures ν on R, agrees with the relative entropy H(ν| μ). Another important case is Schilder’s theorem for a family of rescaled Brownian motions ˙ 22 —the squared norm in in Rd , where the rate function becomes I(x) = 12 x the Cameron–Martin space first encountered in Chapter 19. The latter result can be used to derive the powerful Fredlin–Wentzell estimates for randomly © Springer Nature Switzerland AG 2021 O. Kallenberg, Foundations of Modern Probability, Probability Theory and Stochastic Modelling 99, https://doi.org/10.1007/978-3-030-61871-1_25

529

530

Foundations of Modern Probability

perturbed dynamical systems. It also yields a short proof of Strassen’s law of the iterated logarithm, a stunning extension of the classical Khinchin law from Chapter 14. Modern proofs of those and other large deviation results rely on some general extension principles, which explains the wide applicability of the present ideas. In addition to some rather elementary and straightforward methods of continuity and approximation, we consider the more sophisticated and extremely powerful principles of inverse continuous mapping and projective limits, both of which play crucial roles in ensuing applications. We may also stress the significance of the notion of exponential tightness, and of the essential equivalence between the setwise and functional formulations of the large-deviation principle. Large deviation theory is arguably one of the most technical areas of modern probability. For a beginning student, it is then essential not to get distracted by topological subtleties or elaborate computations. Many results are therefore stated under simplifying assumptions. Likewise, we postpone our discussion of the general principles, until the reader has become well acquainted with some basic ideas in a simple and concrete setting. For those reasons, important applications appear both at the beginning and at the end of the chapter, separated by some more abstract discussions of general notions and principles. Returning to the elementary setting of i.i.d. random variables ξ, ξ1 , ξ2 , . . . , we write Sn = Σ k≤n ξk and ξ¯n = Sn /n. If m = Eξ exists and is finite, then P {ξ¯n > x} → 0 for all x > m by the weak law of large numbers. Under stronger moment conditions, the rate of convergence turns out to be exponential and can be estimated with great accuracy. This elementary but quite technical observation, along with its multi-dimensional counterpart, lie at the core of large-deviation theory, and provide both a general pattern and a point of departure for the more advanced developments. For motivation, we begin with some simple observations. Lemma 24.1 (tail rate) For i.i.d. random variables ξ, ξ1 , ξ2 , . . . , (i) n−1 log P {ξ¯n ≥ x} → sup n−1 log P {ξ¯n ≥ x} ≡ −h(x), x ∈ R, n

(ii) h is [0, ∞]-valued, non-decreasing, and convex, (iii) h(x) < ∞ ⇔ P {ξ ≥ x} > 0. Proof: (i) Writing pn = P {ξ¯n ≥ x}, we get for any m, n ∈ N



pm+n = P Sm+n ≥ (m + n)x



≥ P Sm ≥ mx, Sm+n − Sm ≥ nx = pm pn . Taking logarithms, we conclude that the sequence − log pn is sub-additive, and the assertion follows by Lemma 25.19. (ii) The first two assertions are obvious. To prove the convexity, let x, y ∈ R be arbitrary, and proceed as before to get

24. Large Deviations

531





P S2n ≥ n(x + y) ≥ P {Sn ≥ nx} P {Sn ≥ ny}. Taking logarithms, dividing by 2n, and letting n → ∞, we obtain h



1 2



(x + y) ≤

1 2





h(x) + h(y) ,

x, y > 0.

(iii) If P {ξ ≥ x} = 0, then P {ξ¯n ≥ x} = 0 for all n, and so h(x) = ∞. Conversely, (i) yields log P {ξ ≥ x} ≤ −h(x), and so h(x) = ∞ implies P {ξ ≥ x} = 0. 2 To identify the limit in Lemma 24.1, we need some further notation, here given for convenience directly in d dimensions. For any random vector ξ in Rd , we introduce the cumulant-generating function u ∈ Rd .

Λ(u) = Λξ (u) = log Eeuξ ,

(1)

Note that Λ is convex, since by H¨older’s inequality we have for any u, v ∈ Rd and p, q > 0 with p + q = 1

Λ(pu + qv) = log E exp (pu + qv) ξ

≤ log (Eeuξ )p (Eevξ )q





= p Λ(u) + q Λ(v). The surface z = Λ(u) in Rd+1 is determined by the class of supporting hyper-planes1 of different slopes, and we note that the plane with slope x ∈ Rd (or normal vector (1, −x)) has equation z + Λ∗ (x) = xu,

u ∈ Rd ,

where Λ∗ is the Legendre–Fenchel transform of Λ, given by



Λ∗ (x) = sup ux − Λ(u) , u∈Rd

x ∈ Rd .

(2)

We can often compute Λ∗ explicitly. Here we consider two simple cases, both needed below. The results may be proved by elementary calculus. Lemma 24.2 (Gaussian and Bernoulli distributions) (i) When ξ = (ξ1 , . . . , ξd ) is standard Gaussian in Rd , Λ∗ξ (x) ≡ 12 |x|2 ,

x ∈ Rd ,

(ii) when ξ ∈ {0, 1} with P {ξ = 1} = p ∈ (0, 1), Λ∗ξ (x) = 1



x log xp + (1 − x) log 1−x 1−p , ∞,

d -dimensional affine subspaces

x ∈ [0, 1], x∈ / [0, 1].

532

Foundations of Modern Probability

The function Λ∗ is again convex, since for any x, y ∈ Rd and for p, q as before, 





Λ∗ (px + qy) = supu p ux − Λ(u) + q uy − Λ(u)







≤ p supu ux − Λ(u) + q supu uy − Λ(u) = p Λ∗ (x) + q Λ∗ (y).



If Λ < ∞ near the origin, then m = Eξ exists and agrees with the gradient ∇Λ(0). Thus, the surface z = Λ(u) has tangent hyper-plane z = mu at 0, and we conclude that Λ∗ (m) = 0 and Λ∗ (x) > 0 for x = m. If ξ is truly d -dimensional, then Λ is strictly convex at 0, and Λ∗ is finite and continuous near m. For d = 1, we sometimes need the corresponding one-sided statements, which are easily derived by dominated convergence. The following key result identifies the function h in Lemma 24.1. For simplicity, we assume that m = Eξ exists in [−∞, ∞). Theorem 24.3 (rate function, Cram´er, Chernoff ) Let ξ, ξ1 , ξ2 , . . . be i.i.d. random variables with m = E ξ < ∞. Then



n−1 log P ξ¯n ≥ x → −Λ∗ (x),

x ≥ m.

(3)

Proof: Using Chebyshev’s inequality and (1), we get for any u > 0







P ξ¯n ≥ x = P euSn ≥ enux uSn ≤ e−nux Ee

= exp nΛ(u) − nux , and so





n−1 log P ξ¯n ≥ x ≤ Λ(u) − ux.

This remains true for u ≤ 0, since in that case Λ(u) − ux ≥ 0 for x ≥ m. Hence, by (2), we have the upper bound



n−1 log P ξ¯n ≥ x ≤ −Λ∗ (x),

x ≥ m, n ∈ N.

(4)

To derive a matching lower bound, we first assume that Λ < ∞ on R+ . Then Λ is smooth on (0, ∞) with Λ (0+) = m and Λ (∞) = ess sup ξ ≡ b, and so for any a ∈ (m, b) we can choose a u > 0 with Λ (u) = a. Let η, η1 , η2 , . . . be i.i.d. with distribution 



P {η ∈ B} = e−Λ(u) E euξ ; ξ ∈ B ,

B ∈ B.

(5)

Then Λη (r) = Λξ (r + u) − Λξ (u), and so Eη = Λη (0) = Λξ (u) = a. For any ε > 0, we get by (5)  

 





ηn ); |¯ ηn − a| < ε P  ξ¯n − a  < ε = enΛ(u) E exp(−nu¯







≥ exp nΛ(u) − nu(a + ε) P |¯ ηn − a| < ε .

(6)

24. Large Deviations

533

Here the last probability tends to 1 by the law of large numbers, and so by (2)  

 



lim inf n−1 log P  ξ¯n − a  < ε ≥ Λ(u) − u(a + ε) n→∞ ≥ −Λ∗ (a + ε). Fixing any x ∈ (m, b) and putting a = x + ε, we get for small enough ε > 0



lim inf n−1 log P ξ¯n ≥ x ≥ −Λ∗ (x + 2ε). n→∞ Since Λ∗ is continuous on (m, b) by convexity, we may let ε → 0 and combine with (4) to obtain (3). The result for x > b is trivial, since in that case both sides of (3) equal −∞. If instead x = b < ∞, then both sides equal log P {ξ = b}, the left side by a simple computation and the right side by an elementary estimate. Finally, let x = m > −∞. Since the statement is trivial when ξ = m a.s., we may assume that b > m. For any y ∈ (m, b), we have

0 ≥ n−1 log P ξ¯n ≥ m

≥ n−1 P ξ¯n ≥ y



→ −Λ∗ (y) > −∞. Here Λ∗ (y) → Λ∗ (m) = 0 by continuity, and (3) follows for x = m. This completes the proof when Λ < ∞ on R+ . The case where Λ(u) = ∞ for some u > 0 may be handled by truncation. Thus, for any r > m we consider the random variables ξkr = ξk ∧ r. Writing Λr and Λ∗r for the associated functions Λ and Λ∗ , we get for x ≥ m ≥ Eξ r



n−1 log P ξ¯n ≥ x



≥ n−1 log P ξ¯nr ≥ x



→ −Λ∗r (x).

(7)

Now Λr (u) ↑ Λ(u) by monotone convergence as r → ∞, and by Dini’s theorem2 the convergence is uniform on every compact interval where Λ < ∞. Since also Λ is unbounded on the set where Λ < ∞, it follows easily that Λ∗r (x) → Λ∗ (x) for all x ≥ m. The required lower bound is now immediate from (7). 2 We may supplement Lemma 24.1 with a criterion for exponential decline of the tail probabilities P {ξ¯n ≥ x}. Corollary 24.4 (exponential rate) Let ξ, ξ1 , ξ2 , . . . be i.i.d. random variables with m = E ξ < ∞ and b = ess sup ξ. Then

(i) for any x ∈ (m, b), the probabilities P ξ¯n ≥ x decrease exponentially iff Λ(ε) < ∞ for some ε > 0, (ii) the exponential decline extends to x = b iff P {ξ = b} ∈ (0, 1). 2 When fn ↑ f for some continuous functions fn , f on a metric space, the convergence is uniform on compacts.

534

Foundations of Modern Probability

Proof: If Λ(ε) < ∞ for some ε > 0, then Λ (0+) = m by dominated convergence, and so Λ∗ (x) > 0 for all x > m. If instead Λ = ∞ on (0, ∞), then 2 Λ∗ (x) = 0 for all x ≥ m. The statement for x = b is trivial. The large deviation estimates in Theorem 24.3 are easily extended from intervals [x, ∞) to arbitrary open or closed sets, which leads to a large-deviation principle for i.i.d. sequences in R. To fulfill our needs in subsequent applications and extensions, we consider a version of the same result in Rd . Motivated by the last result, and to avoid some technical complications, we assume that ¯ = B − for the interior and closure of a Λ(u) < ∞ for all u. Write B o and B set B. Theorem 24.5 (large deviations in Rd , Varadhan) Let ξ, ξ1 , ξ2 , . . . be i.i.d. random vectors in Rd with Λ = Λξ < ∞. Then for any B ∈ B d ,

− info Λ∗ (x) ≤ lim inf n−1 log P ξ¯n ∈ B n→∞





x∈B



≤ lim sup n−1 log P ξ¯n ∈ B ≤ − inf Λ∗ (x). ¯ x∈B

n→∞

Proof: To derive the upper bound, fix any ε > 0. By (2) there exists for every x ∈ Rd some ux ∈ Rd , such that



ux x − Λ(ux ) > Λ∗ (x) − ε ∧ ε−1 , and by continuity we may choose an open ball Bx around x, such that



ux y > Λ(ux ) + Λ∗ (x) − ε ∧ ε−1 ,

y ∈ Bx .

By Chebyshev’s inequality and (1), we get for any n ∈ N







P ξ¯n ∈ Bx ≤ E exp ux Sn − n inf ux y; y ∈ Bx



≤ exp −n {Λ∗ (x) − ε} ∧ ε−1





(8)

.

Further note that Λ < ∞ implies Λ∗ (x) → ∞ as |x| → ∞, at least when d = 1. By Lemma 24.1 and Theorem 24.3, we may then choose r > 0 so large that



n−1 log P |ξ¯n | > r ≤ −1/ε,

n ∈ N.

(9)

Now let B ⊂ Rd be closed. Then the set {x ∈ B; |x| ≤ r} is compact and may be covered by finitely many balls Bx1 , . . . , Bxm with centers xi ∈ B. By (8) and (9), we get for any n ∈ N



P ξ¯n ∈ B ≤ ≤

ΣP i≤m







ξ¯n ∈ Bxi + P |ξ¯n | > r

Σ exp i≤m







−n {Λ∗ (xi ) − ε} ∧ ε−1





+ e−n/ε

≤ (m + 1) exp −n {Λ∗ (B) − ε} ∧ ε−1



,

24. Large Deviations

535

where Λ∗ (B) = inf x∈B Λ∗ (x). Hence,







lim sup n−1 log P ξ¯n ∈ B ≤ − Λ∗ (B) − ε ∧ ε−1 , n→∞

and the upper bound follows since ε was arbitrary. Turning to the lower bound, we first assume that Λ(u)/|u| → ∞ as |u| → ∞. Fix any open set B ⊂ Rd and a point x ∈ B. By compactness and the smoothness of Λ, there exists a u ∈ Rd with ∇Λ(u) = x. Let η, η1 , η2 , . . . be i.i.d. random vectors with distribution (5), and note as before that Eη = x. For ε > 0 small enough, we get as in (6) 





  P ξ¯n ∈ B ≥ P  ξ¯n − x  < ε







≥ exp nΛ(u) − nux − nε|u| P |¯ ηn − x| < ε . Hence, by the law of large numbers and (2),



lim inf n−1 log P ξ¯n ∈ B ≥ Λ(u) − ux − ε|u| n→∞ ≥ −Λ∗ (x) − ε|u|. It remains to let ε → 0 and take the supremum over x ∈ B. To eliminate the growth condition on Λ, let ζ, ζ1 , ζ2 , . . . be i.i.d. standard Gaussian random vectors independent of ξ and the ξn . Then for any σ > 0 and u ∈ Rd , we have by Lemma 24.2 (i) Λξ+σζ (u) = Λξ (u) + Λζ (σu) = Λξ (u) + 12 σ 2 |u|2 ≥ Λξ (u), and in particular Λ∗ξ+σζ ≤ Λ∗ξ . Since also Λξ+σζ (u)/|u| ≥ σ 2 |u|/2 → ∞, we note that the previous bound applies to ξ¯n + σ ζ¯n . Now fix any x ∈ B as before, and choose ε > 0 small enough that B contains a 2ε-ball around x. Then 







P ξ¯n + σ ζ¯n − x < ε ≤ P ξ¯n ∈ B + P σ|ζ¯n | ≥ ε 





≤ 2 P ξ¯n ∈ B ∨ P σ|ζ¯n | ≥ ε



.

Applying the lower bound to the variables ξ¯n + σ ζ¯n and the upper bound to ζ¯n , we get by Lemma 24.2 (i) −Λ∗ξ (x) ≤ −Λ∗ξ+σζ (x)





  ≤ lim inf n−1 log P ξ¯n + σ ζ¯n − x < ε n→∞ 









≤ lim inf n−1 log P ξ¯n ∈ B ∨ P σ|ζ¯n | ≥ ε n→∞



≤ lim inf n−1 log P ξ¯n ∈ B ∨ (−ε2 /2σ 2 ). n→∞ The desired lower bound now follows, as we let σ → 0 and then take the supremum over all x ∈ B. 2

536

Foundations of Modern Probability

We can also establish large-deviation results in function spaces. The following theorem is basic and sets the pattern for more complex results. For convenience, we may write C = C[0,1],Rd and C0k = {x ∈ C k ; x0 = 0}. We also introduce the Cameron–Martin space H1 , consisting of all absolutely continuous functions x ∈ C0 with a Radon–Nikodym derivative x˙ ∈ L2 , so that 1 x ˙ 22 = 0 |x˙ t |2 dt < ∞. Theorem 24.6 (large deviations of Brownian motion, Schilder) Let X be a d-dimensional Brownian motion on [0, 1]. Then for any Borel set B ⊂ C[0,1],Rd , we have − info I(x) ≤ lim inf ε2 log P {εX ∈ B} ε→0

x∈B

≤ lim sup ε2 log P {εX ∈ B} ≤ − inf I(x), ε→0

where



I(x) =

¯ x∈B

1 2

x ˙ 22 ,

∞,

x ∈ H1 , x∈ / H1 .

The proof requires a simple topological fact. Lemma 24.7 (level sets) The level sets Lr below are compact in C[0,1],Rd :



Lr = I −1 [0, r] = x ∈ H1 ; x ˙ 22 ≤ 2r ,

r > 0.

Proof: Cauchy’s inequality gives |xt − xs | ≤



t s

|x˙ u | du

≤ (t − s)1/2 x ˙ 2,

0 ≤ s < t ≤ 1, x ∈ H1 .

By the Arzel`a–Ascoli Theorem A5.2, the set Lr is then relatively compact in ˙ 2. C. It is also weakly compact in the Hilbert space H1 with norm x = x Thus, every sequence x1 , x2 , . . . ∈ Lr has a sub-sequence that converges in both C and H1 , say with limits x ∈ C and y ∈ Lr , respectively. For every t ∈ [0, 1], the sequence xn (t) then converges in Rd to both x(t) and y(t), and we get 2 x = y ∈ Lr . Proof of Theorem 24.6: To establish the lower bound, fix any open set B ⊂ C. Since I = ∞ outside H1 , it suffices to prove that −I(x) ≤ lim inf ε2 log P {εX ∈ B}, ε→0

x ∈ B ∩ H1 .

(10)

Then note as in Lemma 1.37 that C02 is dense in H1 , and also that x ∞ ≤ ˙ 2 for any x ∈ H1 . Hence, for every x ∈ B ∩ H1 , there exist some x ˙ 1 ≤ x functions xn ∈ B ∩ C02 with I(xn ) → I(x), and it suffices to prove (10) for x ∈ B ∩ C02 . For small enough h > 0, Theorem 19.23 yields

P {εX ∈ B} ≥ P εX − x ∞ < h 

= E E −(x/ε) ˙ ·X





1;



εX ∞ < h .

(11)

24. Large Deviations

537

Integrating by parts gives

log E −(x/ε) ˙ ·X

1

= −ε−1



1 0

x˙ t dXt − ε−2 I(x)

= −ε−1 x˙ 1 X1 + ε−1 and so by (11),



0

1

x¨t Xt dt − ε−2 I(x),



x 1 + ε2 log P εX ∞ < h . ε2 log P {εX ∈ B} ≥ −I(x) − h|x˙ 1 | − h ¨ Now (10) follows as we let ε → 0 and then h → 0. Turning to the upper bound, fix any closed set B ⊂ C, and let Bh be the closed h -neighborhood of B. Letting Xn be the n -segment, polygonal approximation of X with Xn (k/n) = X(k/n) for k ≤ n, we note that







P {εX ∈ B} ≤ P εXn ∈ Bh + P ε X − Xn > h .

(12)

Writing I(Bh ) = inf{I(x); x ∈ Bh }, we obtain







P εXn ∈ Bh ≤ P I(εXn ) ≥ I(Bh ) . 2 , where the ξik are i.i.d. N (0, 1), and Here 2I(Xn ) is a sum of nd variables ξik so by Lemma 24.2 (i) and an interpolated version of Theorem 24.5,





lim sup ε2 log P εXn ∈ Bh ≤ −I(Bh ).

(13)

ε→0

Next, we get by Proposition 14.13 and some elementary estimates



√ P ε X − Xn > h ≤ n P ε X > h n/2



≤ 2 nd P ε2 ξ 2 > h2 n/4d , where ξ is N (0, 1). Applying Theorem 24.5 and Lemma 24.2 (i) again, we obtain

(14) lim sup ε2 log P ε X − Xn > h ≤ −h2 n/8d. ε→0

Combining (12), (13), and (14) gives lim sup ε2 log P {εX ∈ B} ≤ −I(Bh ) ∧ (h2 n/8d), ε→0

and as n → ∞ we obtain the upper bound −I(Bh ). It remains to show that I(Bh ) ↑ I(B) as h → 0. Then fix any r > suph I(Bh ). For every h > 0, we may choose an xh ∈ Bh with I(xh ) ≤ r, and by Lemma 24.7 we may extract a convergent sequence xhn → x with  hn → 0, such that even I(x) ≤ r. Since also x ∈ h Bh = B, we obtain I(B) ≤ r, as required. 2 The last two theorems suggest the following abstraction. Letting ξε , ε > 0, be random elements in a metric space S with Borel σ-field S, we say that

538

Foundations of Modern Probability

the family (ξε ) satisfies a large-deviation principle (LDP) with rate function I : S → [0, ∞], if for any B ∈ S, − info I(x) ≤ lim inf ε log P {ξε ∈ B} ε→0

x∈B

≤ lim sup ε log P {ξε ∈ B} ≤ − inf I(x). ¯ x∈B

ε→0

(15)

For sequences ξ1 , ξ2 , . . . , we require the same condition with the normalizing factor ε replaced by n−1 . It is often convenient to write I(B) = inf x∈B I(x). ¯ of I-continuity sets, we note Writing SI for the class {B ∈ S; I(B o ) = I(B)} that (15) yields the convergence lim ε log P {ξε ∈ B} = −I(B),

ε→0

B ∈ SI .

(16)

If ξ, ξ1 , ξ2 , . . . are i.i.d. random vectors in Rd with Λ(u) = Eeuξ < ∞ for all u, then by Theorem 24.5 the averages ξ¯n satisfy the LDP in Rd with rate function Λ∗ . If instead X is a d -dimensional Brownian motion on [0, 1], then Theorem 24.6 shows that the processes ε1/2 X satisfy the LDP in C[0,1],Rd , with ˙ 22 for x ∈ H1 and I(x) = ∞ otherwise. rate function I(x) = 12 x We show that the rate function I is essentially unique. Lemma 24.8 (regularization and uniqueness) Let (ξε ) satisfy an LDP in a metric space S with rate function I. Then (i) I can be chosen to be lower semi-continuous, (ii) the choice of I in (i) is unique. Proof: (i) Assume (15) for some I. Then the function J(x) = lim inf I(y), y→x

x ∈ S,

is clearly lower semi-continuous with J ≤ I. It is also easy to verify that J(G) = I(G) for all open sets G ⊂ S. Thus, (15) remains true with I replaced by J. (ii) Suppose that (15) holds for each of the lower semi-continuous functions I and J, and let I(x) < J(x) for some x ∈ S. By the semi-continuity of J, we ¯ > I(x). Applying (15) to both may choose a neighborhood G of x with J(G) I and J, we get the contradiction −I(x) ≤ −I(G) ≤ lim inf ε log P {ξε ∈ G} ε→0 ¯ < −I(x). ≤ −J(G)

2

Justified by the last result, we may henceforth take the lower semi-continuity to be part of our definition of a rate function. (An arbitrary function I satisfying (15) will then be called a raw rate function.) No regularization is needed in Theorems 24.5 and 24.6, since the associated rate functions Λ∗ and I are

24. Large Deviations

539

already lower semi-continuous, the former as the supremum of a family of continuous functions, the latter by Lemma 24.7. It is sometimes useful to impose a slightly stronger regularity condition on the function I. Thus, we say that I is good, if the level sets I −1 [0, r] = {x ∈ S; I(x) ≤ r} are compact (rather than just closed). Note that the infimum I(B) = inf x∈B I(x) is then attained for every closed set B = ∅. The rate functions in Theorems 24.5 and 24.6 are clearly both good. A related condition on the family (ξε ) is the exponential tightness



inf lim sup ε log P ξε ∈ / K = −∞, K

(17)

ε→0

where the infimum extends over all compact sets K ⊂ S. We actually need only the slightly weaker condition of sequential exponential tightness, where (17) is only required along sequences εn → 0. To simplify our exposition, we often omit the sequential qualification from our statements, and carry out the proofs under the stronger non-sequential hypothesis. We finally say that (ξε ) satisfies the weak LDP with rate function I, if the lower bound in (15) holds as stated, while the upper bound is only required for compact sets B. We list some relations between the mentioned properties. Lemma 24.9 (goodness, exponential tightness, and weak LDP) Let ξε , ε > 0, be random elements in a metric space S. Then (i) if (ξε ) satisfies an LDP with rate function I, then (16) holds, and the two conditions are equivalent when I is good, (ii) if the ξε are exponentially tight and satisfy a weak LDP with rate function I, then I is good, and (ξε ) satisfies the full LDP, (iii) if S is Polish and (ξε ) satisfies an LDP with rate function I, then I is good iff (ξε ) is sequentially exponentially tight. Proof: (i) Let I be good and satisfy (16). Write B h for the closed h / SI for neighborhood of B ∈ S. Since I is non-increasing on S, we have B h ∈ at most countably many h > 0. Hence, (16) yields for almost every h > 0

lim sup ε log P {ξε ∈ B} ≤ lim ε log P ξε ∈ B h



ε→0

ε→0

= −I(B h ).

¯ as h → 0, suppose that instead suph I(B h ) < I(B). ¯ To see that I(B h ) ↑ I(B) Since I is good, we may choose for every h > 0 some xh ∈ B h with I(xh ) = ¯ with hn → 0. I(B h ), and then extract a convergent sequence xhn → x ∈ B Then the lower semi-continuity of I yields the contradiction ¯ ≤ I(x) ≤ lim inf I(xhn ) I(B) n→∞

¯ ≤ sup I(B h ) < I(B), h>0

proving the upper bound. Next let x ∈ B o , and conclude from (16) that, for almost all sufficiently small h > 0,

540

Foundations of Modern Probability −I(x) ≤ −I({x}h )



= lim ε log P ξε ∈ {x}h



ε→0

≤ lim inf ε log P {ξε ∈ B}. ε→0

The lower bound now follows as we take the supremum over x ∈ B o . (ii) By (17), we may choose some compact sets Kr satisfying lim sup ε log P {ξε ∈ / Kr } < −r,

r > 0.

(18)

ε→0

For any closed set B ⊂ S, we have 





P {ξε ∈ B} ≤ 2 P ξε ∈ B ∩ Kr ∨ P ξε ∈ / Kr



r > 0,

,

and so, by the weak LDP and (18), lim sup ε log P {ξε ∈ B} ≤ −I(B ∩ Kr ) ∧ r ε→0 ≤ −I(B) ∧ r. The upper bound now follows as we let r → ∞. Applying the lower bound and (18) to the sets Krc gives



/ Kr < −r, −I(Krc ) ≤ lim sup ε log P ξε ∈

r > 0,

ε→0

and so I −1 [0, r] ⊂ Kr for all r > 0, which shows that I is good. (iii) The sufficiency follows from (ii), applied to an arbitrary sequence εn → 0. Now let S be separable and complete, and assume that the rate function I is good. For any k ∈ N, we may cover S by some open balls Bk1 , Bk2 , . . . of  c ) = ∞, since any radius 1/k. Putting Ukm = j≤m Bkj , we have supm I(Ukm level set I −1 [0, r] is covered by finitely many sets Bkj . Now fix any sequence εn → 0 and constant r > 0. By the LDP upper bound and the fact that c } → 0 as m → ∞ for fixed n and k, we may choose mk ∈ N so P {ξεn ∈ Ukm large that

c ≤ exp (−rk/εn ), n, k ∈ N. P ξεn ∈ Uk,m k Summing a geometric series, we obtain



c lim sup εn log P ξεn ∈ ∪ k Uk,m ≤ −r. k n→∞

The asserted exponential tightness now follows, since the set bounded and hence relatively compact.

 k

Uk,mk is totally 2

By analogy with weak convergence theory, we may look for a version of (16) for continuous functions. Theorem 24.10 (functional LDP, Varadhan, Bryc) Let ξε , ε > 0, be random elements in a metric space S.

24. Large Deviations

541

(i) If (ξε ) satisfies an LDP with rate function I, and f : S → R is continuous and bounded above, then







Λf ≡ lim ε log E exp f (ξε )/ε = sup f (x) − I(x) . ε→0

x∈S

(ii) If the ξε are exponentially tight and the limits Λf in (i) exist for all f ∈ Cb , then (ξε ) satisfies an LDP with the good rate function



I(x) = sup f (x) − Λf , f ∈Cb

x ∈ S.

Proof: (i) For every n ∈ N, we can choose finitely many closed sets B1 , . . . ,  Bm ⊂ S, such that f ≤ −n on j Bjc , and the oscillation of f on each Bj is at most n−1 . Then 



lim sup ε log Eef (ξε )/ε ≤ max lim sup ε log E ef (ξε )/ε ; ξε ∈ Bj ∨ (−n) j≤m

ε→0

ε→0





≤ max sup f (x) − inf I(x) ∨ (−n) j≤m

x∈Bj

x∈Bj





≤ max sup f (x) − I(x) + n−1 ∨ (−n) j≤m x∈Bj





= sup f (x) − I(x) + n−1 ∨ (−n). x∈S

The upper bound now follows as we let n → ∞. Next we fix any x ∈ S with a neighborhood G, and write 

lim inf ε log Eef (ξε )/ε ≥ lim inf ε log E ef (ξε )/ε ; ξε ∈ G ε→0



ε→0

≥ inf f (y) − inf I(y) y∈G

y∈G

≥ inf f (y) − I(x). y∈G

Here the lower bound follows as we let G ↓ {x}, and then take the supremum over x ∈ S. (ii) Note that I is lower semi-continuous, as the supremum over a family of continuous functions. Since Λf = 0 for f = 0, it is also clear that I ≥ 0. By Lemma 24.9 (ii), it remains to show that (ξε ) satisfies the weak LDP with rate function I. Then fix any δ > 0. For every x ∈ S, we may choose a function fx ∈ Cb satisfying

fx (x) − Λfx > I(x) − δ ∧ δ −1 , and by continuity there exists a neighborhood Bx of x such that



fx (y) > Λfx + I(x) − δ ∧ δ −1 ,

y ∈ Bx .

By Chebyshev’s inequality, we get for any ε > 0







P {ξε ∈ Bx } ≤ E exp ε−1 fx (ξε ) − inf {fx (y); y ∈ Bx }



≤ E exp ε−1 fx (ξε ) − Λfx − {I(x) − δ} ∧ δ −1



,

542

Foundations of Modern Probability

and so by the definition of Λfx ,

lim sup ε log P ξε ∈ Bx





ε→0







≤ lim ε log E exp fx (ξε )/ε − Λfx − I(x) − δ ∧ δ −1 ε→0



= − I(x) − δ ∧ δ −1 . Now fix any compact set K ⊂ S, and choose x1 , . . . , xm ∈ K such that K ⊂  i Bxi . Then





lim sup ε log P ξε ∈ K ≤ max lim sup ε log P ξε ∈ Bxi i≤m

ε→0

ε→0







≤ − min I(xi ) − δ ∧ δ −1 i≤m





≤ − I(K) − δ ∧ δ −1 . The upper bound now follows as we let δ → 0. Next consider any open set G and element x ∈ G. For any n ∈ N, we may choose a continuous function fn : S → [−n, 0] such that fn (x) = 0 and fn = −n on Gc . Then

−I(x) = inf Λf − f (x) f ∈Cb

≤ Λfn − fn (x) = Λfn

= lim ε log E exp fn (ξε )/ε

ε→0



≤ lim inf ε log P ξε ∈ G ∨ (−n). ε→0

The lower bound now follows as we let n → ∞, and then take the supremum over all x ∈ G. 2 We proceed to show how the LDP is preserved by continuous mappings. The following results are often referred to as direct and inverse contraction principles. Given a rate function I on S and a map f : S → T , we define the image J = I ◦ f −1 on T as the function 



J(y) = I f −1 {y}



= inf I(x); f (x) = y ,

y ∈ T.

(19)

Note that the corresponding set functions are related by J(B) ≡ inf J(y) y∈B



= inf I(x); f (x) ∈ B = I(f −1 B),



B ⊂ T.

Theorem 24.11 (continuous mapping) Let f be a continuous map between two metric spaces S, T , and let ξε be random elements in S. (i) If (ξε ) satisfies an LDP in S with rate function I, then the images f (ξε ) satisfy an LDP in T with the raw rate function J = I ◦ f −1 . Moreover, J is a good rate function on T , whenever the function I is good on S.

24. Large Deviations

543

(ii) (Ioffe) Let (ξε ) be exponentially tight in S, let f be injective, and let the images f (ξε ) satisfy a weak LDP in T with rate function J. Then (ξε ) satisfies an LDP in S with the good rate function I = J ◦ f . Proof: (i) Since f is continuous, the set f −1 B is open or closed whenever the corresponding property holds for B. Using the LDP for (ξε ), we get for any B ⊂ T

−I(f −1 B o ) ≤ lim inf ε log P ξε ∈ f −1 B o ε→0







¯ ≤ −I(f −1 B), ¯ ≤ lim sup ε log P ξε ∈ f −1 B ε→0

which proves the LDP for {f (ξε )} with the raw rate function J = I ◦ f −1 . When I is good, we claim that 



J −1 [0, r] = f I −1 [0, r] ,

r ≥ 0.

(20)

To see this, fix any r ≥ 0, and let x ∈ I −1 [0, r]. Then J ◦ f (x) = I ◦ f −1 ◦ f (x)



= inf I(u); f (u) = f (x) ≤ I(x) ≤ r,

which means that f (x) ∈ J −1 [0, r]. Conversely, let y ∈ J −1 [0, r]. Since I is good and f is continuous, the infimum in (19) is attained at some x ∈ S, and we get y = f (x) with I(x) ≤ r. Thus, y ∈ f (I −1 [0, r]), which completes the proof of (20). Since continuous maps preserve compactness, (20) shows that the goodness of I carries over to J. (ii) Here I is again a rate function, since the lower semi-continuity of J is preserved by composition with the continuous map f . By Lemma 24.9 (ii), it is then enough to show that (ξε ) satisfies a weak LDP in S. To prove the upper bound, fix any compact set K ⊂ S, and note that the image set f (K) is again compact, since f is continuous. Hence, the weak LDP for {f (ξε )} yields





lim sup ε log P ξε ∈ K = lim sup ε log P f (ξε ) ∈ f (K) ε→0



ε→0

≤ −J{f (K)} = −I(K). Next we fix any open set G ⊂ S, and let x ∈ G be arbitrary with I(x) = r < ∞. Since (ξε ) is exponentially tight, we may choose a compact set K ⊂ S

such that lim sup ε log P ξε ∈ / K < −r. (21) ε→0

The continuous image f (K) is compact in T , and so by (21) and the weak LDP for {f (ξε )},

544

Foundations of Modern Probability

−I(K c ) = −J f (K c )





≤ −J {f (K)}c





≤ lim inf ε log P f (ξε ) ∈ / f (K)

ε→0



≤ lim sup ε log P ξε ∈ / K < −r. ε→0

Since I(x) = r, we conclude that x ∈ K. As a continuous bijection from the compact set K onto f (K), the function f is a homeomorphism between the two sets with their subset topologies. Then Lemma 1.6 yields an open set G ⊂ T , such that f (x) ∈ f (G ∩ K) = G ∩ f (K). Noting that





/K , P f (ξε ) ∈ G ≤ P ξε ∈ G + P ξε ∈ and using the weak LDP for {f (ξε )}, we get −r = −I(x) = −J{f (x)}

≤ lim inf ε log P f (ξε ) ∈ G ε→0



≤ lim inf ε log P ξε ∈ G



ε→0

Hence, by (21),



ε→0





−I(x) ≤ lim inf ε log P ξε ∈ G , ε→0



∨ lim sup ε log P ξε ∈ /K . x ∈ G,

and the lower bound follows as we take the supremum over all x ∈ G.

2

We turn to the powerful method of projective limits. The following sequential version is sufficient for our needs, and will enable us to extend the LDP to a variety of infinite-dimensional contexts. Some general background on projective limits is provided by Appendix 6. Theorem 24.12 (random sequences, Dawson & G¨artner) For any metric spaces S1 , S2 , . . . , let ξε = (ξεn ) be random elements in S ∞ = S1 ×S2 ×· · · , such that for each n ∈ N the vectors (ξε1 , . . . , ξεn ) satisfy an LDP in S n = S1 ×· · ·×Sn with the good rate function In . Then (ξε ) satisfies an LDP in S ∞ with the good rate function I(x) = supn In (x1 , . . . , xn ),

x = (x1 , x2 , . . .) ∈ S ∞ .

(22)

Proof: For any m ≤ n, we introduce the natural projections πn : S ∞ → S n and πmn : S n → S m . Since the πmn are continuous and the In are good, −1 for all m ≤ n, and so πmn (In−1 [0, r]) ⊂ Theorem 24.11 yields Im = In ◦ πmn −1 Im [0, r] for all r ≥ 0 and m ≤ n. Hence, for each r ≥ 0 the level sets In−1 [0, r] form a projective sequence. Since they are also compact by hypothesis, and by (22), (23) I −1 [0, r] = ∩ n πn−1 In−1 [0, r], r ≥ 0, the sets I −1 [0, r] are compact by Lemma A6.4. Thus, I is again a good rate function.

24. Large Deviations

545

Now fix any closed set A ⊂ S, and put An = πn A, so that πmn An = Am − for all m ≤ n. Since the πmn are continuous, we have also πmn A− n ⊂ Am for − m ≤ n, which means that the sets An form a projective sequence. We claim that A = ∩ n πn−1 A− (24) n. Here the relation A ⊂ πn−1 A− / A. By the definition of n is obvious. Now let x ∈ the product topology, we may choose a k ∈ N and an open set U ⊂ S k , such that x ∈ πk−1 U ⊂ Ac . It follows easily that πk x ∈ U ⊂ Ack . Since U is open,  c / n πn−1 A− we have even πk x ∈ (A− n , which completes the proof of k ) . Thus, x ∈ −1 (24). The projective property carries over to the intersections A− n ∩ In [0, r], and formulas (23) and (24) combine into the relation



−1 A ∩ I −1 [0, r] = ∩ n πn−1 A− n ∩ In [0, r] ,

r ≥ 0.

(25)

Now let I(A) > r ∈ R. Then A ∩ I −1 [0, r] = ∅, and by (25) and Lemma −1 − A6.4 we get A− n ∩ In [0, r] = ∅ for some n ∈ N, which implies In (An ) ≥ r. −1 n Noting that A ⊂ πn An and using the LDP in S , we conclude that





lim sup ε log P ξε ∈ A ≤ lim sup ε log P πn ξε ∈ An ε→0



ε→0

≤ −In (A− n ) ≤ −r, The upper bound now follows as we let r ↑ I(A). Finally, fix an open set G ⊂ S ∞ , and let x ∈ G be arbitrary. By the definition of the product topology, we may choose n ∈ N and an open set U ⊂ S n such that x ∈ πn−1 U ⊂ G. The LDP in S n yields





lim inf ε log P ξε ∈ G ≥ lim inf ε log P πn ξε ∈ U ε→0



ε→0

≥ −In (U ) ≥ −In ◦ πn (x) ≥ −I(x), and the lower bound follows as we take the supremum over all x ∈ G.

2

We consider yet another basic extension principle for the LDP, involving suitable approximations. The families of random elements ξε and ηε in a common separable metric space (S, d) are said to be exponentially equivalent, if



lim ε log P d(ξε , ηε ) > h = −∞,

ε→0

h > 0.

(26)

The separability of S is only needed to ensure measurability of the pairwise distances d(ξε , ηε ). In general, we may replace (26) by a similar condition involving the outer measure. Lemma 24.13 (approximation) Let ξε , ηε be exponentially equivalent random elements in a separable metric space S. Then (ξε ) satisfies an LDP with the good rate function I, iff the same LDP holds for (ηε ).

546

Foundations of Modern Probability

Proof: Let the LDP hold for (ξε ) with rate function I. Fix any closed set B ⊂ S, and write B h for the closed h -neighborhood of B. Then











P ηε ∈ B ≤ P ξε ∈ B h + P d(ξε , ηε ) > h , and so by (26) and the LDP for (ξε ),

lim sup ε log P ηε ∈ B





ε→0





≤ lim sup ε log P ξε ∈ B h ∨ lim sup ε log P d(ξε , ηε ) > h ε→0



ε→0

≤ −I(B h ) ∨ (−∞) = −I(B h ).

Since I is good, we have I(B h ) ↑ I(B) as h → 0, and the required upper bound follows. Now fix any open set G ⊂ S, and let x ∈ G. If d(x, Gc ) > h > 0, we may choose a neighborhood U of x such that U h ⊂ G. Noting that











P ξε ∈ U ≤ P ηε ∈ G + P d(ξε , ηε ) > h , we get by (26) and the LDP for (ξε ) −I(x) ≤ −I(U )













≤ lim inf ε log P ξε ∈ U ε→0



≤ lim inf ε log P ηε ∈ G ∨ lim sup ε log P d(ξε , ηε ) > h ε→0



ε→0

= lim inf ε log P ηε ∈ G . ε→0

The required lower bound now follows, as we take the supremum over all x ∈ G. 2 We turn to some important applications, illustrating the power of the general theory. First we study the perturbations of an ordinary differential equation x˙ = b(x) by a small noise term. More precisely, we consider the unique solutions X ε with X0ε = 0 of the d -dimensional SDEs3 dXt = ε1/2 dBt + b(Xt ) dt,

t ≥ 0, ε ≥ 0,

(27)

where B is a Brownian motion in Rd and b is a bounded and uniformly Lipschitz continuous mapping on Rd . Let H∞ be the set of all absolutely continuous functions x : R+ → Rd with x0 = 0, such that x˙ ∈ L2 . Theorem 24.14 (perturbed dynamical systems, Freidlin & Wentzell) For any bounded, uniformly Lipschitz continuous function b : Rd → Rd , the solutions X ε to (27) with X0ε = 0 satisfy an LDP in CR+ ,Rd with the good rate function I(x) = 3

1 2

∞

0

  2 x˙ t − b(xt ) dt,

x ∈ H∞ .

See Chapter 32 for a detailed discussion of such equations.

(28)

24. Large Deviations

547

Here it is understood that I(x) = ∞ when x ∈ / H∞ . Note that the result for b = 0 extends Theorem 24.6 to processes on R+ . Proof: If B 1 is a Brownian motion on [0, 1], then for every r > 0, the 1 is a Brownian motion on scaled process B r = Φ(B 1 ) given by Btr = r1/2 Bt/r [0, r]. Since Φ is continuous from C[0,1] to C[0,r] , we see from Theorems 24.6 and 24.11 (i), combined with Lemma 24.7, that the processes ε1/2 B r satisfy an ˙ 22 LDP in C[0,r] with the good rate function Ir = I1 ◦ Φ−1 , where I1 (x) = 12 x for x ∈ H1 and I1 (x) = ∞ otherwise. Now Φ maps H1 onto Hr , and when y = Φ(x) with x ∈ H1 , we have x˙ t = r1/2 y˙ rt . Hence, a simple calculation yields 1 r ˙ 22 , which extends Theorem 24.6 to [0, r]. For the Ir (y) = 2 0 |y˙ s |2 ds = 12 y further extension to R+ , write πn x for the restriction of a function x ∈ CR+ to [0, n], and infer from Theorem 24.12 that the processes ε1/2 B satisfy an LDP ˙ 22 . in CR+ with the good rate function I∞ (x) = supn In (πn x) = 12 x By an elementary version of Theorem 32.3, the integral equation

xt = zt +

t 0

b(xs ) ds,

t ≥ 0,

(29)

has a unique solution x = F (z) in C = CR+ for every z ∈ C. Letting z 1 , z 2 ∈ C be arbitrary, and writing a for the Lipschitz constant of b, we note that the corresponding solutions xi = F (z i ) satisfy t       1     1  xt − x2t  ≤ z 1 − z 2  + a xs − x2s  ds, 0

t ≥ 0.

Hence, Gronwall’s Lemma 32.4 yields x1 − x2 ≤ z 1 − z 2 ear on the interval [0, r], which shows that F is continuous. Using Schilder’s theorem on R+ , along with Theorem 24.11 (i), we conclude that the processes X ε satisfy an LDP in CR+ with the good rate function I = I∞ ◦ F −1 . Now F is clearly bijective, and by (29) the functions z and x = F (z) lie simultaneously in H∞ , in which case z˙ = x˙ − b(x) a.e. Thus, I is indeed given by (28). 2 Now consider a random element ξ with distribution μ, in an arbitrary metric space S. Introduce the cumulant-generating functional Λ(f ) = log Eef (ξ) = log μef , f ∈ Cb (S), with associated Legendre–Fenchel transform



Λ∗ (ν) = sup νf − Λ(f ) , f ∈Cb

ˆ S, ν∈M

(30)

ˆ S is the class of probability measures on S, endowed with the topolwhere M ogy of weak convergence. Note that Λ and Λ∗ are both convex, by the same argument as for Rd . ˆ S , we define the relative entropy of ν with For any measures μ, ν ∈ M respect to μ by

548

Foundations of Modern Probability 

H(ν | μ) =

ν log p = μ(p log p), ∞,

ν  μ with ν = p · μ, ν  μ.

Since x log x is convex, the function H(ν | μ) is convex in ν for fixed μ, and Jensen’s inequality yields H(ν | μ) ≥ μp log μp = νS log νS = 0,

ˆ S, v∈M

with equality iff ν = μ. Now let ξ1 , ξ2 , . . . be i.i.d. random elements in S. The associated empirical distributions ηn = n−1 Σ δξk , n ∈ N, k≤n

ˆ S , and we note that may be regarded as random elements in M ηn f = n−1

Σ f (ξk ),

k≤n

f ∈ Cb (S), n ∈ N.

In particular, Theorem 24.5 applies to the random vectors (ηn f1 , . . . , ηn fm ), for fixed f1 , . . . , fm ∈ Cb (S). The following result may be regarded as an infinitedimensional version of Theorem 24.5. It also provides an important connection to statistical mechanics, via the entropy function. Theorem 24.15 (large deviations of empirical distributions, Sanov) Let ξ1 , ξ2 , . . . be i.i.d. random elements with distribution μ in a Polish space S, and put Λ(f ) = log μ ef . Then the associated empirical distributions η1 , η2 , . . . satisfy ˆ S with the good rate function an LDP in M ˆ S. Λ∗ (ν) = H(ν | μ), ν ∈ M (31) A couple of lemmas will be needed for the proof. ˆ S, Lemma 24.16 (relative entropy, Donsker & Varadhan) For any μ, ν ∈ M (i) the supremum in (30) may be taken over all bounded, measurable functions f : S → R, (ii) then (31) extends to measures on any measurable space S. Proof: (i) Use Lemma 1.37 and dominated convergence. (ii) If ν  μ, then H(ν | μ) = ∞ by definition. Choosing B ∈ S with μB = 0 and νB > 0, and taking fn = n1B , we obtain νfn − log μefn = n νB → ∞. Thus, even Λ∗ (ν) = ∞ in this case, and it remains to prove (31) when ν  μ. Assuming ν = p · μ and writing f = log p, we note that νf − log μef = ν log p − log μp = H(ν | μ). If f = log p is unbounded, it may approximated by some bounded, measurable functions fn satisfying μefn → 1 and νfn → νf , and we get Λ∗ (ν) ≥ H(ν | μ).

24. Large Deviations

549

To prove the reverse inequality, we first let S be finite and generated by a partition B1 , . . . , Bn of S. Putting μk = μBk , νk = νBk , and pk = νk /μk , we may write our claim in the form

Σ k νk xk − log Σ k μk ex ≤ Σ k νk log pk ,

g(x) ≡

k

where x = (x1 , . . . , xn ) ∈ Rd is arbitrary. Here the function g is concave and satisfies ∇g(x) = 0 for x = (log p1 , . . . , log pn ), asymptotically when pk = 0 for some k. Thus,   supx g(x) = g log p1 , . . . , log pk =

Σ k νk log pk .

To prove the general inequality νf − log μef ≤ ν log p, we may choose f to be simple. The generated σ-field F ⊂ S is then finite, and we note that ν = μ(p |F) · μ on F. Using the result in the finite case, along with Jensen’s inequality for conditional expectations, we obtain

νf − log μef ≤ μ μ(p |F ) log μ(p |F)  



≤ μ μ p log p  F = ν log p.





2

Lemma 24.17 (exponential tightness) The empirical distributions ηn in Theˆ S. orem 24.15 are exponentially tight in M Proof: If B ∈ S with P {ξ ∈ B} = p ∈ (0, 1), Theorem 24.3 and Lemmas 24.1 and 24.2 yield for any x ∈ [p, 1]



supn n−1 log P ηn B > x ≤ −x log

1−x x − (1 − x) log . p 1−p

(32)

In particular, the right-hand side tends to −∞ as p → 0 for fixed x ∈ (0, 1). Now fix any r > 0. By (32) and Theorem 23.2, we may choose some compact sets K1 , K2 , . . . ⊂ S with



P ηn Kkc > 2−k ≤ e−knr , Summing over k gives lim sup n−1 log P n→∞

∪k



k, n ∈ N.

ηn Kkc > 2−k ≤ −r,

and it remains to note that the set

ˆ S ; νKkc ≤ 2−k M = ∩k ν ∈ M is compact, by another application of Theorem 23.2.



2

Proof of Theorem 24.15: By Theorem 1.8, we can embed S as a Borel subset of a compact metric space K. The function space Cb (K) is separable,

550

Foundations of Modern Probability

and we can choose a dense sequence f1 , f2 , . . . ∈ Cb (K). For any m ∈ N, the random vector {f1 (ξ), . . . , fm (ξ)} has cumulant-generating function Λm (u) = log E exp = Λ◦

Σ uk fk (ξ)

k≤m

Σ uk f k ,

k≤m

u ∈ Rm ,

and so by Theorem 24.5 the random vectors (ηn f1 , . . . , ηn fm ) satisfy an LDP in Rm with the good rate function Λ∗m . Then by Theorem 24.12, the infinite sequences (ηn f1 , ηn f2 , . . .) satisfy an LDP in R∞ with the good rate function J = supm (Λ∗m ◦ πm ), where πm denotes the natural projection of R∞ onto Rm . ˆ K is compact by Theorem 23.2 and the mapping ν → (νf1 , νf2 , . . .) Since M ˆ K into R∞ , Theorem 24.11 (ii) shows that the is a continuous injection of M ˆ K with the good rate function random measures ηn satisfy an LDP in M 



IK (ν) = J νf1 , νf2 , . . .  = supm Λ∗m νf1 , . . . , νfm = sup sup m

u∈R



m

u k fk Σ uk νfk − Λ ◦ kΣ k≤m ≤m

= sup νf − Λ(f )

f ∈F







= sup νf − Λ(f ) ,

(33)

f ∈Cb

where F is the set of all linear combinations of f1 , f2 , . . . . ˆ K is continuous, since ˆS →M Next, we note that the natural embedding M for any f ∈ Cb (K) the restriction of f to S belongs to Cb (S). Since it is also trivially injective, we see from Theorem 24.11 (ii) and Lemma 24.17 that the ηn ˆ S , with a good rate function IS equal to the restricsatisfy an LDP even in M ˆ tion of IK to MS . It remains to note that IS = Λ∗ by (33) and Lemma 24.16. 2 We conclude with a remarkable application of Schilder’s Theorem 24.6. Writing B for a standard Brownian motion in Rd , we introduce for each t > e the scaled process Bst Xst =  , s ≥ 0. (34) 2t log log t Theorem 24.18 (functional law of the iterated logarithm, Strassen) For a Brownian motion B in Rd , define the processes X t by (34). Then these statements are equivalent and hold outside a fixed P -null set: (i) the set of paths X t with t ≥ 3 is relatively compact in CR+ ,Rd , with set of limit points as t → ∞



K = x ∈ H∞ ; x ˙ 2≤1 , (ii) for any continuous function F : CR+ ,Rd → R, lim sup F (X t ) = sup F (x). t→∞

x∈K

24. Large Deviations

551

In particular, we may recover the classical law of the iterated logarithm in Theorem 14.18 by choosing F (x) = x1 . Using Theorem 22.6, we can easily derive a correspondingly strengthened version for random walks. Proof: The equivalence (i) ⇔ (ii) being elementary, it is enough to prove d (i). Noting that X t = (2 log log t)−1/2 B and using Theorem 24.6, we get for any measurable set A ⊂ CR+ ,Rd and constant r > 1

lim sup n→∞

lim inf 1 2

n

log n



log P X ∈ A rn

≤ lim sup

log P {X t ∈ A} ¯ ≤ −2 I(A), log log t

≥ lim inf

log P {X t ∈ A} ≥ −2 I(Ao ), log log t

t→∞



log n

n→∞

where I(x) =



log P X r ∈ A

t→∞

x ˙ 22 for x ∈ H∞ and I(x) = ∞ otherwise. Hence,





Σ n P Xr ∈ A n

< ∞, = ∞,

¯ > 1, 2 I(A) 2 I(Ao ) < 1.

(35)

Now fix any r > 1, and let G ⊃ K be open. Note that 2 I(Gc ) > 1 by Lemma 24.7. By the first part of (35) and the Borel–Cantelli lemma, we have n n / G i.o.} = 0, or equivalently 1G (X r ) → 1 a.s. Since G was arbitrary, P {X r ∈ n it follows that ρ(X r , K) → 0 a.s. for any metrization ρ of CR+ ,Rd . In particular, this holds with any c > 0 for the metric

ρc (x, y) =



0



(x − y)∗s ∧ 1 e−cs ds,

x, y ∈ CR+ ,Rd .

To extend the convergence to the entire family {X t }, fix a path of B with n n n n ρ1 (X r , K) → 0, and choose some functions y r ∈ K with ρ1 (X r , y r ) → 0. n For any t ∈ [rn , rn+1 ), the paths X r and X t are related by X t (s) = X r (t r−n s) n



rn log log rn t log log t

1/2

,

s > 0.

Defining y t in the same way in terms of y r , we note that also y t ∈ K, since n I(y t ) ≤ I(y r ). (The two H∞ -norms would agree if the logarithmic factors were omitted.) Furthermore, n



ρr (X t , y t ) = ≤

∞ 

0∞  0

X t − yt

∗

n

∧ 1 e−rs ds

n

∗

X r − yr



= r−1 ρ1 X r , y r n

n



s



rs



∧ 1 e−rs ds

→ 0.

Thus, ρr (X t , K) → 0. Since K is compact, we conclude that {X t } is relatively compact, and that all its limit points as t → ∞ belong to K. Now fix any y ∈ K and u ≥ ε > 0. By the established part of the theorem and Cauchy’s inequality, we have a.s.

552

Foundations of Modern Probability 

lim sup X t − y

∗

t→∞

ε

≤ sup (x − y)∗ε x∈K

≤ sup x∗ε + yε∗ x∈K

≤ 2 ε1/2 .

(36)

Write x∗ε,u = sups∈[ε,u] |xs − xε |, and choose r > u/ε to ensure independence n between the variables (X r − y)∗ε,u . Applying the second case of (35) to the open set A = {x; (x − y)∗ε,u < ε}, and using the Borel–Cantelli lemma together with (36), we obtain a.s. 

lim inf X t − y t→∞

∗ u



≤ lim sup X t − y t→∞ 1/2

≤ 2ε

∗ ε



n

+ lim inf X r − y n→∞

∗ ε,u

+ ε.

As ε → 0, we get lim inf t (X t − y)∗u = 0 a.s., and so lim inf t ρ1 (X t , y) ≤ e−u a.s. Letting u → ∞, we obtain lim inf t ρ1 (X t , y) = 0 a.s. Applying this result to a dense sequence y1 , y2 , . . . ∈ K, we see that a.s. every element of K is a limit 2 point as t → ∞ of the family {X t }.

Exercises 1. For any random vector ξ and constant a in Rd , show that Λξ−a (u) = Λξ (u) − ua and Λ∗ξ−a (x) = Λ∗ξ (x + a). 2. For any random vector ξ in Rd and non-singular d × d matrix a, show that Λaξ (u) = Λξ (ua) and Λ∗aξ (x) = Λ∗ξ (a−1 x). 3. For any random vectors ξ⊥⊥η, show that Λξ,η (u, v) = Λξ (u) + Λη (v) and Λ∗ξ,η (x, y) = Λ∗ξ (x) + Λ∗η (y). 4. Prove the claims of Lemma 24.2. 5. For a Gaussian vector ξ in Rd with mean m ∈ Rd and covariance matrix a, show that Λ∗ξ (x) = 12 (x − m) a−1 (x − m). Explain the interpretation when a is singular. 6. Let ξ be a standard Gaussian random vector in Rd . Show that the family ε1/2 ξ satisfies an LDP in Rd with the good rate function I(x) = 12 |x|2 . (Hint: Deduce the result along the sequence εn = n−1 from Theorem 24.5, and extend by monotonicity to general ε > 0.) 7. Use Theorem 24.11 (i) to deduce the preceding result from Schilder’s theorem. (Hint: For x ∈ H1 , note that |x1 | ≤ x ˙ 2 , with equality iff xt ≡ tx1 .) 8. Prove Schilder’s theorem on [0, T ] by the same argument as for [0, 1]. 9. Deduce Schilder’s theorem in C[0,n],Rd from the version in C[0,1],Rnd . 10. Let B be a Brownian bridge in Rd . Show that the processes ε1/2 B satisfy an LDP in C[0,1],Rd with the good rate function I(x) = 12 x ˙ 22 for x ∈ H1 with x1 = 0 and I(x) = ∞ otherwise. (Hint: Write Bt = Xt − tX1 , where X is a Brownian motion in Rd , and use Theorem 24.11. Check that x˙ − a 2 is minimized for a = x1 .) 11. Show that the exponential tightness and its sequential version are preserved by continuous mappings.

24. Large Deviations

553

12. Show that if the processes X ε , Y ε in CR+ ,Rd are exponentially tight, then so is any linear combination aX ε + b Y ε . (Hint: Use the Arzel`a–Ascoli theorem.) 13. Prove directly from (27) that the processes X ε in Theorem 24.14 are exponentially tight. (Hint: Use Lemmas 24.7 and 24.9 (iii), along with the Arzel`a–Ascoli theorem.) Derive the same result from the stated theorem. 14. Let ξε be random elements in a locally compact metric space S, satisfying an LDP with the good rate function I. Show that the ξε are exponentially tight, even in the non-sequential sense. (Hint: For any r > 0, there exists a compact set Kr ⊂ S with I −1 [0, r] ⊂ Kro . Now apply the LDP upper bound to the closed sets (Kro )c ⊃ Krc .) 15. For any metric space S and lcscH space T , let X ε be random elements in ε to an arbitrary compact set K ⊂ T satisfy an LDP in CT,S , whose restrictions XK CK,S with the good rate function IK . Show that the X ε satisfy an LDP in CT,S with the good rate function I = supK (IK ◦ πK ), where πK denotes the restriction map CT,S → CK,S . 16. Let ξkj be i.i.d. random vectors in Rd with Λ(u) = Eeuξkj < ∞ for all u ∈ Rd . Show that the sequences ξ¯n = n−1 Σk≤n (ξk1 , ξk2 , . . .) satisfy an LDP in (Rd )∞ with the good rate function I(x) = Σj Λ∗ (xj ). Also derive an LDP for the associated random walks in Rd . 17. Let ξ be a sequence of i.i.d. N (0, 1) random variables. Use the preceding result to show that the sequences ε1/2 ξ satisfy an LDP in R∞ with the good rate function I(x) = 12 x 2 for x ∈ l2 and I(x) = ∞ otherwise. Also derive the statement from Schilder’s theorem. 18. Let ξ1 , ξ2 , . . . be i.i.d. random probability measures on a Polish space S. Derive ˆ S for the averages ξ¯n = n−1 an LDP in M Σk≤nξk . (Hint: Define Λ(f ) = log Eeξk f , and proceed as in the proof of Sanov’s theorem.) 19. Show how the classical LIL4 in Theorem 14.18 follows from Theorem 24.18. Also use the latter result to derive an LIL for the variables ξt = |B2t − Bt |, where B is a Brownian motion in Rd . 20. Use Theorem 24.18 to derive a corresponding LIL in C[0,1],Rd . 21. Use Theorems 22.6 and 24.18 to derive a functional LIL for random walks, based on i.i.d. random variables with mean 0 and variance 1. (Hint: For the result in CR+ ,R , replace the summation process S[t] by its linearly interpolated version, as in Corollary 23.6.) 22. Use Theorems 22.13 and 24.18 to derive a functional LIL for suitable renewal processes. 23. Let B 1 , B 2 , . . . be independent Brownian motions in Rd . Show that the sequence of paths Xtn = (2 log n)−1/2 Btn , n ≥ 2, is a.s. relatively compact in CR+ ,Rd with set of limit points K = {x ∈ H∞ ; x ˙ 2 ≤ 1}.

4

law of the iterated logarithm

VIII. Stationarity, Symmetry and Invariance Stationary and symmetric processes represent the third basic dependence structure of probability theory. In Theorem 25 we prove the pointwise and mean ergodic theorems in discrete and continuous time, along with various multi-variate and sub-additive extensions. Some main results admit extensions to transition semi-groups of Markov processes, which leads in Chapter 26 to some powerful limit theorems for the latter. Here we also study weak and strong ergodicity as well as Harris recurrence. In Chapter 27 we prove the basic representations and limit theorems for exchangeable processes, as well as the predictable sampling and mapping theorems. The final Chapter 28 provides representations of contractable, exchangeable, and rotatable arrays. For the novice we recommend especially Chapter 25 plus selected material from Chapters 26–27. Chapter 28 is more advanced and might be postponed. −−− 25. Stationary processes and ergodic theorems. After a general discussion of stationary and ergodic sequences and processes, we prove the basic discrete- and continuous-time ergodic theorems, along with various multivariate and sub-additive extensions. Next we consider the basic ergodic decomposition and prove some powerful coupling theorems. Among applications we note some limit theorems for entropy and information and for random matrices. We also include a very general version of the classical ballot theorem. 26. Ergodic properties of Markov processes. Here we begin with the a.e. ergodic theorem for positive L1 −L∞ contractions on an abstract measure space, along with a related ratio ergodic theorem, leading to some powerful limit theorems for Markov processes. Next we characterize weak and strong ergodicity, and study the notion of Harris recurrence and the existence of invariant measures for regular Feller processes, using tools from potential theory. 27. Symmetric distributions and predictable maps. Here the central result is the celebrated de Finetti-type representation of exchangeable and contractable sequences, along with various continuous-time extensions, allowing applications to sampling from a finite population. We further note the powerful predictable sampling and mapping theorems, where the original invariance property is extended to predictable times and mappings. 28. Multi-variate arrays and symmetries. Here we establish the Aldous–Hoover-type coding representations of jointly exchangeable and contractable arrays of arbitrary dimension, and identify pairs of functions that can be used to represent the same array. After highlighting some special cases, we proceed to a representation of rotatable arrays in terms of Gaussian random variables, and conclude with an extension of Kingman’s paintbox representation for exchangeable and related partitions.

555

Chapter 25

Stationary Processes and Ergodic Theory Stationarity and invariance, two-sided extension, invariance and almost invariance, ergodicity, maximal ergodic lemma, discrete and continuoustime ergodic theorems, varying functions, moment and maximum inequalities, multi-variate ergodic theorem, commuting maps, convex averages, mean and sub-additive ergodic theorems, products of random matrices, ergodic decomposition, shift coupling and mixing criteria, group coupling, ballot theorems, entropy and information

We now come to the subject of stationarity, the third basic dependence structure of modern probability, beside those of martingales and Markov processes1 . Apart from their useful role in probabilistic modeling and in providing a natural setting for many general theorems, stationary processes often arise as limiting processes in a wide variety of contexts throughout probability. In particular, they appear as steady-state versions of various Markov and renewal-type processes. Their importance is further due to the existence of a range of fundamental limit theorems and maximum inequalities, belonging to the basic tool kit of every probabilist. A process is said to be stationary if its distribution is invariant2 under shifts. A key result is Birkhoff’s ergodic theorem, which may be regarded as a strong law of large numbers for stationary sequences and processes. After proving the classical ergodic theorems in discrete and continuous time, we turn to the multi-variate versions of Zygmund and Wiener, the former in a setting for non-commutative maps on rectangular regions, the latter in the commutative case and involving averages over increasing families of convex sets. We further consider a version of Kingman’s sub-additive ergodic theorem, and discuss a basic application to random matrices. In all the mentioned results, the limit is a random variable, measurable with respect to an appropriate invariant σ-field I. Of special importance is the ergodic case, where I is trivial and the limit reduces to a constant. For general stationary processes, we consider a decomposition of the distribution into ergodic components. We further consider some basic criteria for a suitable shift coupling of two processes, expressed in terms of the tail and invariant σfields T and I, respectively. Those results will be helpful to prove some ergodic 1

not to mention the mere independence Stationarity and invariance are often confused. The a.s. invariance T ξ = ξ is clearly d much stronger than the stationarity T ξ = ξ or L(T ξ) = L(ξ). 2

© Springer Nature Switzerland AG 2021 O. Kallenberg, Foundations of Modern Probability, Probability Theory and Stochastic Modelling 99, https://doi.org/10.1007/978-3-030-61871-1_26

557

558

Foundations of Modern Probability

theorems in Chapters 27 and 31. We conclude with some general versions of the classical ballot theorem, and prove the basic limit theorem for entropy and information. Our treatment of stationary sequences and processes is continued in Chapters 27 and 31 with some important applications and extensions of the present theory, including various ergodic theorems for Palm distributions. In Chapter 26, we show how the basic ergodic theorems allow extensions to suitable contraction operators, which leads to a profound unification of the present theory with the ergodic theory for Markov transition operators. In this context, we also prove some powerful ratio ergodic theorems. Returning to the basic notions of stationarity and invariance, we fix a general measurable space (S, S), often taken to be Borel. Given a measure μ and a measurable transformation T on S, we say that T preserves μ or is μ -preserving if μ ◦ T −1 = μ. Thus, if ξ is a random element in S with distribution μ, then d T is measure-preserving iff T ξ = ξ. In particular, we may consider a random sequence ξ = (ξ0 , ξ1 , . . .) in a measurable space (S  , S  ), and let θ denote the shift on S = (S  )∞ , given by θ(x0 , x1 , . . .) = (x1 , x2 , . . .). Then ξ is said to be d stationary 3 if θ ξ = ξ. We show that the general situation is equivalent to this special case. Lemma 25.1 (stationarity and invariance) For any random element ξ in S and measurable map T on S, we have d (i) T ξ = ξ iff the sequence (T n ξ) is stationary, (ii) for ξ as in (i), (f ◦ T n ξ) is stationary for every measurable function f , (iii) any stationary random sequence in S can be represented as in (ii). d

Proof: Assuming T ξ = ξ, we get θ(f ◦ T n ξ) = (f ◦ T n+1 ξ) = (f ◦ T n T ξ) d = (f ◦ T n ξ), and so (f ◦ T n ξ) is stationary. Conversely, if η = (η0 , η1 , . . .) is stationary, we d may write ηn = π0 (θn η) with π0 (x0 , x1 , . . .) = x0 , and note that θη = η by the stationarity of η. 2 In particular, when ξ0 , ξ1 , . . . is a stationary sequence of random elements in a measurable space S, and f is a measurable map of S ∞ into a measurable space S  , we see that the random sequence ηn = f (ξn , ξn+1 , . . .),

n ∈ Z+ ,

3 The existence of a P -preserving transformation T on the underlying probability space Ω is often postulated as part of the definition, so that any random element ξ generates a stationary sequence ξn = ξ ◦ T n . Here we prefer a purely probabilistic setting, where stationarity is defined directly in terms of the distributions of (ξn ). Though the resulting statements are more general, most proofs remain essentially the same.

25. Stationary Processes and Ergodic Theory

559

is again stationary. The definition of stationarity extends in an obvious way to random sequences indexed by Z = {. . . , −1, 0, 1, . . .}. A technical advantage of the twosided version is that the associated shift operators are invertible and form a group, rather than just a semi-group. We show that the two cases are essentially equivalent. Here we assume the existence of appropriate randomization variables, as explained in Chapter 8. Lemma 25.2 (two-sided extension) Every stationary random sequence ξ0 , ξ1 , . . . in a Borel space can be extended to a two-sided stationary sequence (. . . , ξ−1 , ξ0 , ξ1 , ξ2 , . . .). Proof: Letting ϑ1 , ϑ2 , . . . be i.i.d. U (0, 1) and independent of ξ = (ξ0 , ξ1 , . . .), we may construct the ξ−n recursively as functions of ξ and ϑ1 , . . . , ϑn , such d that (ξ−n , ξ−n+1 , . . .) = ξ for all n. In fact, once ξ−1 , . . . , ξ−n have been chosen, the existence of ξ−n−1 is clear from Theorem 8.17, if we note that d (ξ−n , ξ−n+1 , . . .) = θξ. Finally, the extended sequence is stationary by Proposition 4.2. 2 Now fix a measurable transformation T on a measure space (S, S, μ), and let S μ denote the μ -completion of S. We say that a set I ⊂ S is invariant if T −1 I = I and almost invariant if T −1 I = I a.e. μ, in the sense that μ(T −1 I  I) = 0. Since inverse maps preserve the basic set operations, the classes I and I  of invariant sets in S and almost invariant sets in S μ form σ-fields in S, called the invariant and almost invariant σ-fields, respectively. A measurable function f on S is said to be invariant if f ◦ T ≡ f , and almost invariant if f ◦T = f a.e. μ. We show how invariant or almost invariant sets and functions are related. Lemma 25.3 (invariant sets and functions) Fix a measure μ and a measurable transformation T on S, and let f be a measurable map of S into a Borel space U . Then (i) f is invariant iff it is I-measurable, (ii) f is almost invariant iff it is I  -measurable. Proof: Since U is Borel, we may take U = R. If f is invariant or almost invariant, then so is the set Ix = f −1 (−∞, x) for any x ∈ R, and so Ix ∈ I or I  , respectively. Conversely, if f is measurable with respect to I or I  , then Ix ∈ I or I  , respectively, for every x ∈ R. Hence, the function fn (s) = 2−n [2n f (s)], s ∈ S, is invariant or almost invariant for every n ∈ N, and the invariance or almost invariance carries over to the limit f . 2 We proceed to clarify the relationship between the invariant and almost invariant σ-fields. Here I μ denotes the μ -completion of I in S μ , the σ-field generated by I and the μ -null sets in S μ .

560

Foundations of Modern Probability

Lemma 25.4 (almost invariance) For any distribution μ and μ-preserving transformation T on S, the invariant and almost invariant σ-fields I and I  are related by I  = I μ . Proof: If J ∈ I μ , there exists an I ∈ I with μ(I J) = 0. Since T is μ -preserving, we get 











μ T −1 J  J ≤ μ T −1 J  T −1 I + μ T −1 I  I + μ(I  J) = μ ◦ T −1 (J  I) = μ(J  I) = 0, which shows that J ∈ I  . Conversely, given any J ∈ I  , we may choose a   J  ∈ S with μ(J J  ) = 0, and put I = n k≥n T −n J  . Then clearly I ∈ I 2 and μ(I J) = 0, and so J ∈ I μ . A measure-preserving map T on a probability space (S, S, μ) is said to be ergodic for μ or simply μ -ergodic, if the invariant σ-field I is μ -trivial, in the sense that μI = 0 or 1 for every I ∈ I. Depending on viewpoint, we may prefer to say that μ is ergodic for T , or T -ergodic. The terminology carries over to any random element ξ with distribution μ, which is said to be ergodic whenever this is true for T or μ. Thus, ξ is ergodic iff P {ξ ∈ I} = 0 or 1 for any I ∈ I, i.e., if the σ-field Iξ = ξ −1 I in Ω is P -trivial. In particular, a stationary sequence ξ = (ξn ) is ergodic, if the shift-invariant σ-field is trivial for the distribution of ξ. We show how the ergodicity of a random element ξ is related to the ergodicity of the generated stationary sequence. Lemma 25.5 (ergodicity) Let ξ be a random element in S with distribution μ, and let T be a μ -preserving map on S. Then ξ is T -ergodic



(T n ξ) is θ-ergodic,

in which case η = (f ◦ T n ξ) is θ-ergodic for every measurable function f on S. Proof: Fix any measurable map f : S → S  , and define F = {f ◦T n ; n ≥ 0}, so that F ◦ T = θ ◦ F . If I ⊂ (S  )∞ is θ -invariant, then T −1 F −1 I = F −1 θ−1 I = F −1 I, and so F −1 I is T -invariant in S. Assuming ξ to be ergodic, we obtain P {η ∈ I} = P {ξ ∈ F −1 I} = 0 or 1, which shows that even η is ergodic. Conversely, let the sequence (T n ξ) be ergodic, and fix any T -invariant set I in S. Put F = {T n ; n ≥ 0}, and define A = {s ∈ S ∞ ; sn ∈ I i.o.}. Then I = F −1 A and A is θ -invariant. Hence, P {ξ ∈ I} = P {(T n ξ) ∈ A} = 0 or 1, which means that even ξ is ergodic. 2 We may now state the fundamental a.s. and mean ergodic theorem for stationary sequences of random variables. Recall that (S, S) denotes an arbitrary measurable space, and write Iξ = ξ −1 I for convenience.

25. Stationary Processes and Ergodic Theory

561

Theorem 25.6 (ergodic theorem, Birkhoff ) Let ξ be a random element in S with distribution μ, and let T be a μ-preserving map 4 on S with invariant σ-field I. Then for any measurable function f ≥ 0 on S, n−1

f (T k ξ) → E Σ k 0 ≥ 0. n≥1

Proof (Garsia): Put Mn = S1 ∨ · · · ∨ Sn . Assuming ξ to be defined on the canonical space R∞ , we get Sk = ξ1 + Sk−1 ◦ θ ≤ ξ1 + (Mn ◦ θ)+ ,

k = 1, . . . , n.

Taking maxima yields Mn ≤ ξ1 +(Mn ◦θ)+ for all n ∈ N, and so by stationarity 





E ξ1 ; Mn > 0 ≥ E Mn − (Mn ◦ θ)+ ; Mn > 0





≥ E (Mn )+ − (Mn ◦ θ)+ = 0. Since Mn ↑ supn Sn , the assertion follows by dominated convergence.

2

Proof of Theorem 25.6 (Yosida & Kakutani): First let f ∈ L1 , and put ηk = f (T k−1 ξ) for convenience. Since E(η1 | Iξ ) is an invariant function of ξ by Lemmas 1.14 and 25.3, the sequence ζk = ηk − E(η1 | Iξ ) is again stationary. Writing Sn = ζ1 + · · · + ζn , we define for any ε > 0



ζnε = (ζn − ε)1Aε ,

Aε = lim sup (Sn /n) > ε , n→∞

and note that the sums Snε = ζ1ε + · · · + ζnε satisfy











supn Snε > 0 = supn (Snε /n) > 0

= supn (Sn /n) > ε ∩ Aε = Aε . Since Aε ∈ Iξ , the sequence (ζnε ) is stationary, and Lemma 25.7 yields 



0 ≤ E  ζ1ε ; supn Snε > 0 = E ζ1 − ε; Aε



= E E(ζ1 |Iξ ); Aε − εP Aε = −εP Aε , 4 When T is a P -preserving transformation directly on Ω, the result reduces to  n−1 k 0  

E 1A  n−1



Σ f (T k ξ)  k 0, we see that (1) holds a.s. on {¯ η < ∞}. Next, we have a.s. for any r > 0 lim inf n−1 n→∞

lim n−1 Σ Σ f (T k ξ) ≥ n→∞ k≤n k≤n







f (T k ξ) ∧ r





= E f (ξ) ∧ r  Iξ .

As r → ∞, the right-hand side tends a.s. to η¯, by the monotone convergence property of conditional expectations. In particular, the left-hand side is a.s. infinite on {¯ η = ∞}, as required. 2 Write I and T for the shift-invariant and tail σ-fields in R∞ , respectively, and note that I ⊂ T . Thus, for any sequence of random variables ξ = (ξ1 , ξ2 , . . .), we have Iξ = ξ −1 I ⊂ ξ −1 T . By Kolmogorov’s 0−1 law, the latter σ-field is trivial when the ξn are independent. If they are even i.i.d. and integrable, then Theorem 25.6 yields n−1 (ξ1 + · · · + ξn ) → Eξ1 a.s. and in L1 , conforming with Theorem 5.23. Hence, the last theorem contains the strong law of large numbers. We may often allow the function f = fn,k in Theorem 25.6 to depend on n or k. Here we consider a slightly more general situation. Corollary 25.8 (varying functions, Maker) Let ξ be a random element in S with distribution μ, let T be a μ-preserving map on S with invariant σ-field I, and let f and fm,k be measurable functions on S. Then (i) if fm,k → f a.s. and supm,k |fm,k | ∈ L1 , we have as m, n → ∞ n−1

Σ fm,k (T k ξ) → E kr |fm,k |, and conclude from the same result that a.s.  

lim sup  n−1 m,n→∞



fm,k (T k ξ)  ≤ lim n−1 Σ gr (T k ξ) Σ n→∞ k 0.

Proof: For any r > 0, put ξkr = ξk 1{ξk > r}, and note that ξk ≤ ξkr + r. Assuming ξ to be defined on the canonical space R∞ , and writing An = Sn /n, we get An − 2r = An ◦ (ξ − 2r) ≤ An ◦ (ξ r − r), which implies M − 2r ≤ M ◦ (ξ r − r). Applying Lemma 25.7 to the sequence ξ r − r, we obtain

r P {M > 2r} ≤ r P M ◦ (ξ r − r) > 0







≤ E ξ1r ; M ◦ (ξ r − r) > 0 



≤ E ξ1r = E ξ1 ; ξ1 > r . Proof of Proposition 25.10: We may clearly assume that ξ1 ≥ 0 a.s. (i) By Lemma 25.11, Fubini’s theorem, and calculus,

M

E Mp = p E 0 ∞



= p ≤ 2p

0

rp−1 dr

P {M > r} rp−1 dr



0

= 2p E ξ1





E ξ1 ; 2 ξ1 > r rp−2 dr

2ξ1 0

rp−2 dr

= 2p (p − 1)−1 E ξ1 (2 ξ1 )p−1 < E ξ1p .

(ii) For m = 0, we may write



2

25. Stationary Processes and Ergodic Theory

565

EM − 1 ≤ E(M − 1)+

=



1

≤ 2

P {M > r} dr

∞ 1





E ξ1 ; 2 ξ1 > r r−1 dr

= 2 E ξ1 

2 ξ1 ∨1

r−1 dr

1

= 2 E ξ1 log+ 2ξ1







≤ e + 2E ξ1 log 2 ξ1 ; 2 ξ1 > e 



< 1 + E ξ1 log+ ξ1 .

For m > 0, we may write instead 

E M logm + M





=



0∞

=

1

≤ 2





P M logm + M > r dr 



P {M > t} m logm−1 t + logm t dt

∞ 1







E ξ1 ; 2 ξ1 > t m logm−1 t + logm t t−1 dt

= 2 E ξ1

log+ 2ξ1 



mxm−1 + xm dx

0





2 ξ1 logm+1 + m+1   ≤ 2e + 4 E ξ1 logm+1 2 ξ1 ; 2 ξ1 > e = 2 E ξ1 logm + 2 ξ1 + 



< 1 + E ξ1 logm+1 ξ1 . +

2

Given a measure space (S, S, μ), we introduce for any m ≥ 0 the class L logm L(μ) of measurable functions f on S satisfying |f | logm + |f | dμ < ∞. Note in particular that L log0 L = L1 . Using the maximum inequalities of Proposition 25.10, we may prove the following multi-variate version of Theorem 25.6, for possibly non-commuting, measure-preserving transformations T1 , . . . , T d . Theorem 25.12 (multi-variate ergodic theorem, Zygmund) Let ξ be a random element in S with distribution μ, let T1 , . . . , Td be μ-preserving maps on S with invariant σ-fields I1 , . . . , Id , and put Jk = ξ −1 Ik . Then for any f ∈ L logd−1 L(μ), we have as n1 , . . . , nd → ∞ f Σ · · ·k Σ k r ≤m , r > 0. r P sup d d k λ Bk Proof: Fix any r, a > 0 and n ∈ N, and define a process ν on Rd and a random set K in Sa = {x ∈ Rd ; |x| ≤ a} by



ν(x) = inf k ∈ N; ξ(Bk + x) > r λd Bk ,



x ∈ Rd ,

K = x ∈ Sa ; ν(x) ≤ n . By Lemma 25.15 we may choose a finite, random subset M ⊂ K, such that the   2d sets Bν(x) + x, x ∈ M , are disjoint and λd K ≤ d Σx∈M λd Bν(x) . Writing b = sup{|x|; x ∈ Bn }, we get

568

Foundations of Modern Probability ξSa+b ≥





ξ Bν(x) + x



x∈M



≥r

x∈M

≥r



2d d

λd Bν(x) −1

λd K.

Taking expectations and using Fubini’s theorem and the stationarity and measurability of ν, we obtain 

m

2d d



λd Sa+b ≥ rEλd K



=r Sa d



P ν(x) ≤ n dx

= rλ Sa P max k≤n

ξBk >r . d λ Bk

Now divide by λd Sa , and then let a → ∞ and n → ∞ in this order.

2

We finally need an elementary Hilbert-space result. By a contraction on a Hilbert space H we mean a linear operator T , such that T x ≤ x for all x ∈ H. For any linear subspace M ⊂ H, write M ⊥ for the orthogonal ¯ for the closure of M . The adjoint T ∗ of an operator T is complement and M determined by the identity x, T y = T ∗ x, y , where ·, · denotes the inner product in H. Lemma 25.17 (invariant subspace) Let T be a set of contraction operators on a Hilbert space H, and write N for the T -invariant subspace of H. Then ¯ N ⊥ ⊂ R,





R = span x − T x; x ∈ H, T ∈ T .

Proof: If x ⊥ R,  x − T ∗ x, y =  x, y − T y = 0,

T ∈ T , y ∈ H,

which implies T ∗ x = x for every T ∈ T . Hence, for any T ∈ T , we have T x, x = x, T ∗ x = x 2 , and so by contraction 0 ≤ T x − x 2 = T x 2 + x 2 − 2T x, x ≤ 2 x 2 − 2 x 2 = 0, ¯ which implies T x = x. This gives R⊥ ⊂ N , and so N ⊥ ⊂ (R⊥ )⊥ = R. Proof of Theorem 25.14: First let f ∈ L1 , and define T˜s f = f ◦ Ts ,

An = (λd Bn )−1

Bn

T˜s ds.

For any ε > 0, Lemma 25.17 yields a measurable decomposition f = fε +

Σ k≤m





gkε − T˜sk gkε + hε ,

2

25. Stationary Processes and Ergodic Theory

569

ε where f ε ∈ L2 is T˜s -invariant for all s ∈ Rd , the functions g1ε , . . . , gm are ε ε ε bounded, and E|h (ξ)| < ε. Here clearly An f ≡ f . Using Lemma A6.5 (ii), we get as n → ∞ for fixed k ≤ m and ε > 0



   λd (Bn + sk )  Bn   ε ε An gk − T˜sk gk  ≤ gkε λd B n  

≤2

1 + |sk |/r(Bn )

d

− 1 gkε → 0.

Finally, Lemma 25.16 gives



r P supn An |hε (ξ)| ≥ r ≤







2d 2d E|hε (ξ)| ≤ d d



ε,

r, ε > 0,

P

and so supn An |hε (ξ)| → 0 as ε → 0. In particular, lim inf n An f (ξ) < ∞ a.s., which ensures that 







lim sup − lim inf An f (ξ) = lim sup − lim inf An hε (ξ) n→∞

n→∞

n→∞

n→∞

P

≤ 2 supn An |hε (ξ)| → 0. Thus, the left-hand side vanishes a.s., and the required a.s. convergence follows. When f ∈ Lp for a p ≥ 1, the asserted Lp -convergence follows as before, by the uniform integrability of the powers |An f (ξ)|p . We may now identify the limit, as in the proof of Corollary 25.9, and the a.s. convergence extends to arbitrary f ≥ 0, as in case of Theorem 25.6. 2 The Lp -version of Theorem 25.14 remains valid under weaker conditions. For a simple extension, say that the distributions μn on Rd are asymptotically invariant if μn − μn ∗ δs → 0 for every s ∈ Rd , where · denotes the total variation norm. Note that the conclusion of Theorem 25.14 can be written as ¯ a.s., where μn X → X 

1 B λd ¯ = E f (ξ)  Iξ . μn = dn , Xs = f (Ts ξ), X λ Bn Corollary 25.18 (mean ergodic theorem) Let X be a stationary, measurable, and Lp -valued process on Rd , where p ≥ 1. Then for any asymptotically invariant distributions μn on Rd ,    ¯ ≡ E Xs  IX in Lp . μn X → X Proof: By Theorem 25.14 we may choose some distributions νm on Rd with ¯ in Lp . Using Minkowski’s inequality and its extension in Corollary νm X → X ¯ and dominated 1.32, along with the stationarity of X, the invariance of X, convergence, we get as n → ∞ and then m → ∞        ¯  ≤ μn X − (μn ∗ νm )X  + (μn ∗ νm )X − X ¯   μn X − X p p p         ¯  μn (ds) ≤  μn − μn ∗ νm  X p + (δs ∗ νm )X − X p        ¯  → 0. ≤ X p  μn − μn ∗ δt  νm (dt) + νm X − X p

2

570

Foundations of Modern Probability

We turn to a sub-additive version of Theorem 25.6. For motivation and subsequent needs, we begin with a simple result for non-random sequences. A sequence c1 , c2 , . . . ∈ R is said to be sub-additive, if cm+n ≤ cm + cn for all m, n ∈ N. Lemma 25.19 (sub-additivity) For any sub-additive sequence c1 , c2 , . . . ∈ R, cn cn lim = inf ∈ [−∞, ∞). n→∞ n n n Proof: Iterating the sub-additivity relation, we get for any k, n ∈ N cn ≤ [n/k]ck + cn−k[n/k] ≤ [n/k]ck + c0 ∨ · · · ∨ ck−1 , where c0 = 0. Noting that [n/k] ∼ n/k as n → ∞, we get lim supn (cn /n) ≤ ck /k for all k, and so cn cn ≤ lim inf inf n n n→∞ n cn cn ≤ lim sup ≤ inf . 2 n n n→∞ n For two-dimensional arrays cjk , 0 ≤ j < k, sub-additivity is defined by c0,n ≤ c0,m + cm,n for all m < n, which reduces to the one-dimensional notion when cjk = ck−j for some sequence ck . Note that sub-additivity holds automatically for arrays of the form cjk = aj+1 + · · · + ak . We now extend the ergodic theorem to sub-additive arrays of random variables ξjk , 0 ≤ j < k. For motivation, recall from Theorem 25.6 that, if ξjk = ηj+1 + · · · + ηk for a stationary sequence of integrable random variables ηk , then ξ0,n /n converges a.s. and in L1 . A similar result holds for general subadditive arrays (ξjk ), stationary under simultaneous shifts in the two indices, d so that (ξj+1, k+1 ) = (ξj,k ). To allow a wider range of applications, we introduce the slightly weaker hypotheses  

ξk,2k , ξ2k,3k , . . . = ξ0,k , ξk,2k , . . . ,



d









k ∈ N,

(5)

d

k ∈ N.

(6)



ξk,k+1 , ξk,k+2 , . . . = ξ0,1 , ξ0,2 , . . . ,

For convenience, we also restate the sub-additivity condition ξ0,n ≤ ξ0,m + ξm,n ,

0 < m < n.

(7)

Theorem 25.20 (sub-additive ergodic theorem, Kingman, Liggett) Let (ξjk ) + < ∞, satisfying (5)−(6). be a sub-additive array of random variables with E ξ0,1 Then (i) n−1 ξ0,n → ξ¯ a.s. in [−∞, ∞), where E ξ¯ = inf n (E ξ0,n /n) ≡ c, (ii) the convergence in (i) extends to L1 when c > −∞, (iii) ξ¯ is a.s. constant when the sequences in (5) are ergodic.

25. Stationary Processes and Ergodic Theory

571

Proof: Put ξ0,n = ξn for convenience. By (6) and (7) we have Eξn+ ≤ nEξ1+ < ∞. First let c > −∞, so that the variables ξm,n are integrable. Iterating (7) gives n ξn [n/k] ≤ Σ ξ(j−1)k, jk /n + Σ ξj−1, j /n, n j =1 j = k[n/k]+1

n, k ∈ N.

(8)

By (5) the sequence ξ(j−1)k, jk , j ∈ N, is stationary for fixed k, and so Theorem 25.6 yields n−1 Σj≤n ξ(j−1)k, jk → ξ¯k a.s. and in L1 , where E ξ¯k = E ξk . Hence, the first term in (8) tends to ξ¯k /k a.s. and in L1 . Similarly, n−1 Σj≤n ξj−1, j → ξ¯1 a.s. and in L1 , and so the second term in (8) tends in the same sense to 0. Thus, the right-hand side converges a.s. and in L1 toward ξ¯k /k, and since k is arbitrary, we get   lim sup (ξn /n) ≤ inf n ξ¯n /n n→∞ ≡ ξ¯ < ∞ a.s. (9) The variables ξn+ /n are uniformly integrable by Proposition 5.12, and moreover E lim sup (ξn /n) ≤ E ξ¯   n→∞ ≤ inf n E ξ¯n /n = inf n (E ξn /n) = c.

(10)

To derive a lower bound, let κn ⊥⊥(ξjk ) be uniformly distributed over {1, . . . , n} for each n, and define ζkn = ξκn , κn +k , ηkn = ξκn +k − ξκn +k−1 ,

k ∈ N.

Then by (6), d

n ∈ N.

(ζ1n , ζ2n , . . .) = (ξ1 , ξ2 , . . .),

(11)

d

Moreover, ηkn ≤ ξκn +k−1, κn +k = ξ1 by (6) and (7), and so the variables (ηkn )+ are uniformly integrable. On the other hand, the sequence E ξ1 , E ξ2 , . . . is sub-additive, and so by Lemma 25.19 we have as n → ∞ 

E ηkn = n−1 E ξn+k − E ξk → inf n (E ξn /n) = c,



k ∈ N.

(12)

In particular, supn E|ηkn | < ∞, which shows that the sequence ηk1 , ηk2 , . . . is tight for each k. Hence, by Theorems 5.30, 6.20, and 8.21, there exist some random variables ζk and ηk such that 



d



ζ1n , ζ2n , . . . ; η1n , η2n , . . . → ζ1 , ζ2 , . . . ; η1 , η2 , . . . d



(13)

along a sub-sequence. Here (ζk ) = (ξk ) by (11), and so by Theorem 8.17 we may assume that ζk = ξk for each k. The sequence η1 , η2 , . . . is clearly stationary, and by Lemma 5.11 it is also integrable. From (7) we get

572

Foundations of Modern Probability η1n + · · · + ηkn = ξκn +k − ξκn ≤ ξκn , κn +k = ζkn ,

and so in the limit η1 + · · · + ηk ≤ ξk a.s. Hence, Theorem 25.6 yields ξn /n ≥ n−1

Σ ηk → η¯

a.s. and in L1 ,

k≤n

for some η¯ ∈ L1 . In particular, the variables ξn− /n are uniformly integrable, and so the same thing is true for ξn /n. Using Lemma 5.11 and the uniform integrability of the variables (ηkn )+ , together with (10) and (12), we get c = lim sup E η1n n→∞

≤ E η1 = E η¯ ≤ E lim inf (ξn /n) n→∞

≤ E lim sup (ξn /n) n→∞

≤ E ξ¯ ≤ c. ¯ Furthermore, by Thus, ξn /n converges a.s., and by (9) the limit equals ξ. 1 ¯ Lemma 5.11, the convergence holds even in L , and E ξ = c. If the sequences in (5) are ergodic, then ξ¯n = E ξn a.s. for each n, and we get ξ¯ = c a.s. Now assume instead that c = −∞. Then for each r ∈ Z, the truncated array ξm,n ∨ r(n − m), 0 ≤ m < n, satisfies the hypotheses of the theorem with c replaced by cr = inf n (E ξnr /n) ≥ r, where ξnr = ξn ∨ rn. Thus, ξnr /n = (ξn /n) ∨ r converges a.s. toward a random variable ξ¯r with mean cr , ¯ Finally, E ξ¯ ≤ inf r cr = c = −∞ by monotone and so ξn /n → inf r ξ¯r ≡ ξ. convergence. 2 For an application of the last result, we derive a celebrated ergodic theorem for products of random matrices. Theorem 25.21 (random matrices, Furstenberg & Kesten) Let X k = (ξijk ) be a stationary sequence of random d × d matrices, such that ξijk > 0 a.s.,





E  log ξijk  < ∞,

i, j ≤ d.

Then there exists a random variable η, independent of i, j, such that as n → ∞, 

n−1 log X 1 · · · X n

 ij

→ η a.s. and in L1 ,

i, j ≤ d.

Proof: First let i = j = 1, and define 

ηm,n = log X m+1 · · · X n



11 ,

0 ≤ m < n.

The array (−ηm,n ) is clearly sub-additive and jointly stationary, and E|η0,1 | < ∞ by hypothesis. Further note that 

Hence,

X1 · · · Xn



11

≤ d n−1 Π max ξijk . k≤n

i,j

25. Stationary Processes and Ergodic Theory

573

ξijk Σ log max i,j   ≤ Σ Σ i,j  log ξijk  , k≤n

η0,n − (n − 1) log d ≤

and so

k≤n

 

 

n−1 Eη0,n ≤ log d + Σ i,j E  log ξij1  < ∞.

Thus, by Theorem 25.20 and its proof, we have η0,n /n → η a.s. and in L1 for an invariant random variable η. To extend the convergence to arbitrary i, j ∈ {1, . . . , d}, we may write for any n ∈ N     2 n+1 X 3 · · · X n 11 ξ1j ≤ X 2 · · · X n+1 ij ξi1 

1 n+2 ≤ ξ1i ξj1

−1 

X 1 · · · X n+2



11 .

Noting that n−1 log ξijn → 0 a.s. and in L1 by Theorem 25.6, and using the stationarity of (X n ) and the invariance of ξ, we obtain n−1 log (X 2 · · · X n+1 )ij → η a.s. and in L1 . The desired convergence now follows by stationarity. 2 We turn to the decomposition of an invariant distribution into ergodic components. For motivation, consider the setting of Theorem 25.6 or 25.14, and let S be Borel, to ensure the existence of regular conditional distributions. Writing η = L(ξ | Iξ ), we get L(ξ) = EL(ξ | Iξ ) = Eη m P {η ∈ dm}.

= Furthermore,

(14)



ηI = P ξ ∈ I  Iξ

= 1{ξ ∈ I} a.s.,

I ∈ I,

and so ηI = 0 or 1 a.s. for any fixed I ∈ I. If we can choose the exceptional null set to be independent of I, then η is a.s. ergodic, and (14) gives the desired ergodic decomposition of μ = L(ξ). Though the suggested statement is indeed true, its proof requires a different approach. Proposition 25.22 (ergodicity by conditioning, Farrell, Varadarajan) Let ξ be a random element in a Borel space S with distribution μ, and consider a measurable group G of μ-preserving maps Ts on S, s ∈ Rd , with invariant σ-field I. Then η = L(ξ | Iξ ) is a.s. invariant and ergodic under G. For the proof, consider an increasing sequence of convex sets Bn ∈ B d with r(Bn ) → ∞, and introduce on S the probability kernels μn (x, A) = (λd Bn )−1



Bn

1A (Ts x) ds,

x ∈ S, A ∈ S,

(15)

along with the empirical distributions ηn = μn (ξ, ·). By Theorem 25.14, we have ηn f → ηf a.s. for every bounded, measurable function f on S, where η = L(ξ | Iξ ). Say that a class C ⊂ S is measure-determining, if every distribution on S is uniquely determined by its values on C.

574

Foundations of Modern Probability

Lemma 25.23 (degenerate limit) Let A1 , A2 , . . . ∈ S be measure-determining with empirical distributions ηn = μn (ξ, ·), where the μn are given by (15). Then ξ is ergodic whenever ηn Ak → P {ξ ∈ Ak } a.s., k ∈ N. Proof: By Theorem 25.14, we have ηn A → ηA ≡ P ( ξ ∈ A | Iξ ) a.s. for every A ∈ S, and so by comparison ηAk = P {ξ ∈ Ak } a.s. for all k. Since the Ak are measure-determining, we get η = L(ξ) a.s. Hence, for any I ∈ I we have a.s. 

 P {ξ ∈ I} = ηI = P ξ ∈ I  Iξ = 1I (ξ) ∈ {0, 1}, which implies P {ξ ∈ I} = 0 or 1.

2

Proof of Proposition 25.22: By the stationarity of ξ, we have for any A ∈ S and s ∈ Rd 

 η ◦ Ts−1 A = P Ts ξ ∈ A  Iξ 





= P ξ ∈ A  Iξ = ηA a.s. Since S is Borel, we obtain η ◦ Ts−1 = η a.s. for every s. Now put C = [0, 1]d , and define η¯ = C (η ◦ Ts−1 ) ds. Since η is a.s. invariant under shifts in Zd , the variable η¯ is a.s. invariant under arbitrary shifts. Furthermore, Fubini’s theorem yields

λd s ∈ [0, 1]d ; η ◦ Ts−1 = η = 1 a.s., and therefore η¯ = η a.s. Thus, η is a.s. G-invariant. Now choose a measure-determining sequence A1 , A2 , . . . ∈ S, which exists since S is Borel. Noting that ηn Ak → ηAk a.s. for every k by Theorem 25.14, we get by Theorem 8.5 η

∩k









x ∈ S; μn (x, Ak ) → ηAk = P Iξ ∩ k ηn Ak → ηAk = 1 a.s.

Since η is a.s. a G-invariant distribution on S, Lemma 25.23 applies for every ω ∈ Ω outside a P -null set, and we conclude that η is a.s. ergodic. 2 In (14) we saw that, for stationary ξ, the distribution μ = L(ξ) is a mixture of invariant, ergodic probability measures. We show that this representation is unique6 and characterizes the ergodic measures as extreme points in the convex set of invariant measures. To explain the terminology, we say that a subset M of a linear space is convex if c m1 + (1 − c) m2 ∈ M for all m1 , m2 ∈ M and c ∈ (0, 1). In that case, an element m ∈ M is extreme if for any m1 , m2 , c as above, the relation m = c m1 + (1 − c) m2 implies m1 = m2 = m. With any set of measures μ on a measurable space (S, S), we associate the σ-field generated by all evaluation maps πB : μ → μB, B ∈ S. 6

Under stronger conditions, a more explicit representation was obtained in Theorem 3.12.

25. Stationary Processes and Ergodic Theory

575

Theorem 25.24 (ergodic decomposition, Krylov & Bogolioubov) Let G = {Ts ; s ∈ Rd } be a measurable group of transformations on a Borel space S. Then (i) the G-invariant distributions on S form a convex set M , whose extreme points are precisely the ergodic measures in M , (ii) every measure μ ∈ M has a unique representation μ = ν restricted to the set of ergodic measures in M .



m ν(dm), with

Proof: The set M is clearly convex, and Proposition 25.22 yields for every  μ ∈ M a representation μ = m ν(dm), where ν is a probability measure on the set of ergodic measures in M . To see that ν is unique, we introduce a regular conditional distribution η = μ( · | I) a.s. μ on S, and note that μn A → ηA a.s. μ for all A ∈ S by Theorem 25.14. Thus, for any A1 , A2 , . . . ∈ S, we have m

∩k





x ∈ S; μn (x, Ak ) → η(x, Ak ) = 1 a.e. ν.

The same relation holds with η(x, Ak ) replaced by mAk , since ν is restricted to the class of ergodic measures in M . Assuming the sets Ak to be measuredetermining, we conclude that m{x; η(x, ·) = m} = 1 a.e. ν. Hence, for any measurable set A ⊂ M , μ{η ∈ A} = =



m{η ∈ A} ν(dm) 1A (m) ν(dm) = νA,

which shows that ν = μ ◦ η −1 . To prove the equivalence of ergodicity and extremality, fix any measure  μ ∈ M with ergodic decomposition m ν(dm). First let μ be extreme. If it is not ergodic, then ν is non-degenerate, and we have ν = c ν1 +(1−c)ν2 for some ν1 ⊥ν2 and c ∈ (0, 1). Since μ is extreme, we obtain m ν1 (dm) = m ν2 (dm), and so ν1 = ν2 by the uniqueness of the decomposition. The contradiction shows that μ is ergodic. Next let μ be ergodic, so that ν = δμ , and let μ = c μ1 + (1 − c) μ2 with μ1 , μ2 ∈ M and c ∈ (0, 1). If μi = m νi (dm) for i = 1, 2, then δμ = c ν1 + (1 − c) ν2 by the uniqueness of the decomposition. Hence, ν1 = ν2 = δμ , 2 and so μ1 = μ2 , which shows that μ is extreme. We turn to some powerful coupling results, needed for the ergodic theory of Markov processes in Chapter 26 and for our discussion of Palm distributions in Chapter 31. First we consider pairs of measurable processes on R+ , with values in an arbitrary measurable space S. In the associated path space, we introduce  the invariant σ-field I and tail σ-field T = t Tt with Tt = σ(θt ), where clearly I ⊂ T . For any signed measure ν, write ν A for the total variation of ν on the σ-field A.

576

Foundations of Modern Probability

Theorem 25.25 (coupling on R+ , Goldstein, Berbee, Aldous & Thorisson) Let X, Y be S-valued, product-measurable processes on R+ . Then these conditions are equivalent: d (i) X = Y on T , d (ii) (σ, θσ X) = (τ, θτ Y ) for some random times σ, τ ≥ 0,  (iii) L(θt X) − L(θt Y ) → 0 as t → ∞. So are the conditions: (i ) (ii ) (iii )

d

X = Y on I, d θ X = θτ Y for some random times σ, τ ≥ 0, σ   1   0 {L(θst X) − L(θst Y )} ds → 0 as t → ∞.

d When the path space is Borel, we may choose Y˜ = Y such that (ii) and (ii ) become, respectively, θσ X = θτ Y˜ a.s. θτ X = θτ Y˜ ,

Proof for (i)−(iii): Let μ1 , μ2 be the distributions of X, Y , and assume μ1 = μ2 on T . Write U = S R+ , and define a mapping p on R+ × U by p(s, x) = (s, θs x). Let C be the class of pairs (ν1 , ν2 ) of measures on R+ × U with (16) ν1 ◦ p−1 = ν2 ◦ p−1 , ν¯1 ≤ μ1 , ν¯2 ≤ μ2 , where ν¯i = νi (R+ × ·), and regard C as partially ordered under component-wise inequality. Since by Corollary 1.17 every linearly ordered subset has an upper bound in C, Zorn’s lemma7 yields a maximal pair (ν1 , ν2 ). To see that ν¯1 = μ1 and ν¯2 = μ2 , define μi = μi − ν¯i , and conclude from the equality in (16) that          μ1 − μ2 T = ν ¯1 − ν¯2 T     ≤ ν¯1 − ν¯2 Tn



≤ 2 ν1 (n, ∞) × U → 0,

(17)

which implies μ1 = μ2 on T . Next, Corollary 2.13 yields some measures μni ≤ μi satisfying μn1 = μn2 = μ1 ∧ μ2 on Tn , n ∈ N. Writing νin = δn ⊗ μni , we get ν¯in ≤ μi and ν1n ◦ p−1 = ν2n ◦ p−1 , and so (ν1 + ν1n , ν2 + ν2n ) ∈ C. Since (ν1 , ν2 ) is maximal, we obtain ν1n = ν2n = 0, and so Corollary 2.9 gives μ1 ⊥ μ2 on Tn for all n. Thus, μ1 An = μ2 Acn = 0 for some sets An ∈ Tn . Then also μ1 A = μ2 Ac = 0, where A = lim supn An ∈ T . Since the μi agree on T , we obtain μ1 = μ2 = 0, which means that ν¯i = μi . Hence, Theorem 8.17 yields some random variables σ, τ ≥ 0, such that the pairs (σ, X) and (τ, Y ) have distributions ν1 , ν2 , and the desired coupling follows from the equality in (16). 7 We don’t hesitate to use the Axiom of Choice, whenever it leads to short and transparent proofs.

25. Stationary Processes and Ergodic Theory

577 d

The remaining claims are easy. Thus, the relation (σ, θσ X) = (τ, θτ Y ) implies μ1 − μ2 Tn → 0 as in (17), and the latter condition yields μ1 = μ2 on T . When the path space is Borel, the asserted a.s. coupling follows from the distributional version by Theorem 8.17. 2 To avoid repetitions, we postpone the proof for (i )−(iii ) until after the proof of the next theorem, involving groups G of transformations on an arbitrary measurable space (S, S). Theorem 25.26 (group coupling, Thorisson) Let G be an lcscH group, acting measurably on a space S with G-invariant σ-field I. Then for any random elements ξ, η in S, these conditions are equivalent: d

(i) ξ = η on I, d (ii) γ ξ = η for a random element γ in G. Proof (OK): Let μ1 , μ2 be the distributions of ξ, η. Define p : G × S → S by p(r, s) = rs, and let C be the class of pairs (ν1 , ν2 ) of measures on G × S satisfying (16) with ν¯i = νi (G × ·). As before, Zorn’s lemma yields a maximal element (ν1 , ν2 ), and we claim that ν¯i = μi for i = 1, 2. To see this, let λ be a right-invariant Haar measure on G, which exists by ˜ ∼ λ, Theorem 3.8. Since λ is σ-finite, we may choose a probability measure λ and define μi = μi − ν¯i , ˜ ⊗ μ , i = 1, 2. χi = λ i By Corollary 2.13, there exist some measures νi ≤ χi satisfying ν1 ◦ p−1 = ν2 ◦ p−1 = χ1 ◦ p−1 ∧ χ2 ◦ p−1 . Then ν¯i ≤ μi for i = 1, 2, and so (ν1 + ν1 , ν2 + ν2 ) ∈ C. The maximality of (ν1 , ν2 ) yields ν1 = ν2 = 0, and so χ1 ◦ p−1 ⊥ χ2 ◦ p−1 by Corollary 2.9. Thus, ˜ there exists a set A1 = Ac2 ∈ S with χi ◦ p−1 Ai = 0 for i = 1, 2. Since λ  λ, Fubini’s theorem gives S

μi (ds)

G

1Ai (gs) λ(dg) = (λ ⊗ μi ) ◦ p−1 Ai = 0.

(18)

By the right invariance of λ, the inner integral on the left is G-invariant and therefore I -measurable in s ∈ S. Since also μ1 = μ2 on I by (16), equation (18) remains true with Ai replaced by Aci . Adding the two versions gives λ ⊗ μi = 0, and so μi = 0. Thus, ν¯i = μi for i = 1, 2. Since G is Borel, Theorem 8.17 yields some random elements σ, τ in G, such d that (σ, ξ) and (τ, η) have distributions ν1 , ν2 . By (16) we get σξ = τ η, and so d the same theorem yields a random element τ˜ in G with (˜ τ , σξ) = (τ, τ η). But d −1 −1 then τ˜ σξ = τ τ η = η, which proves the desired relation with γ = τ˜−1 σ. 2

578

Foundations of Modern Probability

Proof of Theorem 25.25 (i )−(iii ): In the last proof, replace S by the path space U = S R+ , G by the semi-group of shifts θt , t ≥ 0, and λ by Lebesgue d measure on R+ . Assuming X = Y on I, we may proceed as before up to equation (18), which now takes the form

U

μi (dx)



0



1Ai (θt x) dt = (λ ⊗ μi ) ◦ p−1Ai = 0.

(19)

Writing fi (x) for the inner integral on the left, we note that for any h > 0,

fi (θh x) = =



h∞ 0

1Ai (θt x) dt 1θ−1Ai (θt x) dt. h

(20)

Hence, (19) remains true with Ai replaced by θh−1Ai , and then also for the θ1 -invariant sets A˜1 = lim sup θn−1A1 , n→∞

inf θn−1A2 , A˜2 = lim n→∞ where n → ∞ along N. Since A˜1 = A˜c2 , we may henceforth take the Ai in (19) to be θ1 -invariant. Then so are the functions fi in view of (20). By the monotonicity of fi ◦ θh , the fi are then θh -invariant for all h > 0 and therefore d I -measurable. Arguing as before, we see that θσ X = θτ Y for some random variables σ, τ ≥ 0. The remaining assertions are again routine. 2 ¯  to a striking relationship between the sample intensity ξ =  We turn  E ξ[0, 1]  Iξ of a stationary, a.s. singular random measure ξ on R+ and the corresponding maximum over increasing intervals. It may be compared with the more general but less precise maximum inequalities in Proposition 25.10 and Lemmas 25.11 and 25.16. For the needs of certain applications, we also consider random measures ξ on [0, 1). Here ξ¯ = ξ[0, 1) by definition, and stationarity is defined in the cyclic sense of shifts θt on [0, 1), where θt s = s + t (mod 1), and correspondingly for sets and measures. Recall that ξ is said to be singular if its absolutely continuous component vanishes, which holds in particular for purely atomic measures ξ.

Theorem 25.27 (ballot theorem, Whitworth, Tak´ acs, OK) Let ξ be a stationary, a.s. singular random measure on R+ or [0, 1). Then there exists a U (0, 1) random variable σ⊥⊥Iξ , such that (21) σ sup t−1 ξ[0, t] = ξ¯ a.s. t>0

To justify the statement, we note that the singularity of a measure μ is a measurable property. Indeed, by Proposition 2.24, it is equivalent that the function Ft = μ[0, t] be singular. Now it is easy to check that the singularity of F can be described by countably many conditions, each involving the increments of F over finitely many intervals with rational endpoints.

25. Stationary Processes and Ergodic Theory

579

Proof: If ξ is stationary on [0, 1), the periodic continuation η = Σ n≤0 θn ξ ¯ Furthermore, is clearly stationary on R+ , and moreover Iη = Iξ and η¯ = ξ. the elementary inequality xk x1 + · · · + xn ≤ max , k≤n t1 + · · · + tn tk

n ∈ N,

valid for arbitrary x1 , x2 , . . . ≥ 0 and t1 , t2 , . . . > 0, shows that supt t−1 η[0, t] = supt t−1 ξ[0, t]. Thus, it suffices to consider random measures on R+ . Then put Xt = ξ(0, t], and define At = inf (s − Xs ), s≥t



αt = 1 At = t − Xt ,

t ≥ 0.

(22)

Noting that At ≤ t − Xt and using the monotonicity of X, we get for any s < t As = inf (r − Xr ) ∧ At r∈[s,t)

≥ (s − Xt ) ∧ At ≥ (s − t + At ) ∧ At = s − t + At . If A0 is finite, then so is At for every t, and we get by subtraction 0 ≤ At − As ≤ t − s on {A0 > −∞},

s < t.

(23)

Thus, A is non-decreasing and absolutely continuous on {A0 > −∞}. Now fix a singular path of X such that A0 is finite, and let t ≥ 0 be such that At < t − Xt . Then At + Xt± < t by monotonicity, and so by the left and right continuity of A and X, there exists an ε > 0 such that As + Xs < s − 2ε,

|s − t| < ε.

Then (23) yields s − Xs > As + 2ε > At + ε, |s − t| < ε, and by (22) it follows that As = At for |s − t| < ε. In particular, A has derivative At = 0 = αt at t. On the complementary set D = {t ≥ 0; At = t − Xt }, Theorem 2.15 shows that both A and X are differentiable a.e., the latter with derivative 0, and we may form a set D by excluding the corresponding null sets. We may also exclude the countable set of isolated points of D. Then for any t ∈ D we may choose some tn → t in D \ {t}. By the definition of D, X t − Xt Atn − At =1− n , tn − t tn − t

n ∈ N,

and as n → ∞ we get At = 1 = αt . Combining this with the result in the previous case, we see that A = α a.e. Since A is absolutely continuous, Theorem 2.15 yields

580

Foundations of Modern Probability At − A0 =



t

0

αs ds on {A0 > −∞},

t ≥ 0.

(24)

Now Xt /t → ξ¯ a.s. as t → ∞, by Corollary 30.10 below. When ξ¯ < 1, (22) gives −∞ < At /t → 1 − ξ¯ a.s. Furthermore,

At + Xt − t = inf (s − t) − (Xs − Xt ) s≥t







= inf s − θt ξ(0, s] , s≥0

and hence









αt = 1 inf s − θt ξ(0, s] = 0 , s≥0

t ≥ 0.

Dividing (24) by t and using Corollary 25.9, we get a.s. on {ξ¯ < 1}  



P sup (Xt /t) ≤ 1  Iξ



t>0

 



= P sup (Xt − t) = 0  Iξ

t≥0





 

= P A0 = 0  Iξ







 ¯ = E α0  Iξ = 1 − ξ.

Replacing ξ by rξ and taking complements, we obtain more generally 





 P r sup (Xt /t) > 1  Iξ = rξ¯ ∧ 1 a.s.,

r ≥ 0,

t>0

(25)

where the result for rξ¯ ∈ [1, ∞) follows by monotonicity. When ξ¯ ∈ (0, ∞), we may define σ by (21). If instead ξ¯ = 0 or ∞, we take σ = ϑ, where ϑ⊥⊥ξ is U (0, 1). In the latter case, (21) clearly remains true, since ξ = 0 a.s. on {ξ¯ = 0} and Xt /t → ∞ a.s. on {ξ¯ = ∞}. To verify the distributional claim, we conclude from (25) and Theorem 8.5 that, on {ξ¯ ∈ (0, ∞)},









  P σ < r  Iξ = P r sup (Xt /t) > ξ¯  Iξ



t>0

= r ∧ 1 a.s.,

r ≥ 0.

Since the same relation holds trivially when ξ¯ = 0 or ∞, σ is conditionally 2 U (0, 1) given Iξ , hence unconditionally U (0, 1) and independent of Iξ . The last theorem yields a similar result in discrete time. Here (21) holds only with inequality and will be supplemented by an equality similar to (25). For a stationary sequence ξ = (ξ1 , ξ2 , . . .) in R+ with invariant σ-field Iξ , we define ξ¯ = E(ξ1 | Iξ ) a.s. On {1, . . . , n} we define stationarity in the obvious way in terms of addition modulo n, and put ξ¯ = n−1 Σ k ξk . Corollary 25.28 (discrete-time ballot theorem) Let ξ = (ξ1 , ξ2 , . . .) be a finite or infinite, stationary random sequence in R+ , and put Sn = Σ k≤n ξk . Then (i) there exists a U (0, 1) random variable σ⊥ ⊥Iξ , such that σ sup (Sn /n) ≤ ξ¯ a.s., n>0

25. Stationary Processes and Ergodic Theory (ii) when the ξk are Z+ -valued, we have also  



P sup (Sn − n) ≥ 0  Iξ

581



n>0

= ξ¯ ∧ 1 a.s.

Proof: (i) Arguing by periodic continuation, as before, we may reduce to the case of infinite sequences ξ. Now let ϑ⊥⊥ξ be U (0, 1), and define Xt = S[t+ϑ] . ¯ By ¯ = ξ. Then X has stationary increments, and we note that IX = Iξ and X Theorem 25.27 there exists a U (0, 1) random variable σ⊥⊥IX , such that a.s. sup (Sn /n) = sup (S[t] /t) n>0

t>0

¯ ≤ sup (Xt /t) = ξ/σ. t>0

(ii) If the ξk are Z+ -valued, the same result yields a.s. 



P sup (Sn − n) ≥ 0  Iξ



n>0





= P sup (Xt − t) > 0  Iξ

t≥0



t>0



= P sup (Xt /t) > 1  Iξ 







= P ξ¯ > σ  Iξ = ξ¯ ∧ 1.

2

To state the next result, let ξ be a random element in a countable space S, and put pj = P {ξ = j}. For any σ-field F, we define the information I(j) and conditional information I(j |F ) by I(j) = − log pj ,







I(j | F ) = − log P ξ = j  F ,

j ∈ S.

For motivation, we note the additivity 









I ξ1 , . . . , ξn = I(ξ1 ) + I(ξ2 | ξ1 ) + · · · + I ξn  ξ1 , . . . , ξn−1 ,

(26)

valid for any random elements ξ1 , . . . , ξn in S. Next, we form the associated entropy H(ξ) = E I(ξ) and conditional entropy H(ξ|F) = E I(ξ|F), and note that H(ξ) = E I(ξ) = − Σ j pj log pj . By (26), even H is additive, in the sense that 





 



H ξ1 , . . . , ξn = H(ξ1 ) + H(ξ2 | ξ1 ) + · · · + H ξn  ξ1 , . . . , ξn−1 .

(27)

When the sequence (ξn ) is stationary and ergodic with H(ξ0 ) < ∞, we show that the averages in (26) and (27) converge toward a common limit. Theorem 25.29 (entropy and information, Shannon, McMillan, Breiman, Ionescu Tulcea) Let ξ = (ξk ) be a stationary, ergodic sequence in a countable space S, such that H(ξ0 ) < ∞. Then 







n−1 I ξ1 , . . . , ξn → H ξ0  ξ−1 , ξ−2 , . . .



a.s. and in L1 .

582

Foundations of Modern Probability

Note that H(ξ0 ) < ∞ holds automatically when the state space is finite. Our proof will be based on a technical estimate. Lemma 25.30 (maximum inequality, Chung, Neveu) For any countably valued random variable ξ and discrete filtration (Fn ), we have E supn I(ξ | Fn ) ≤ H(ξ) + 1. Proof: Write pj = P {ξ = j} and η = supn I(ξ |Fn ). For fixed r > 0, we introduce the optional times

τj = inf n; I(j |Fn ) > r

 









= inf n; P ξ = j  Fn < e−r ,

j ∈ S.

By Lemma 8.3,





P η > r, ξ = j = P τj < ∞, ξ = j  









= E P ξ = j  Fτj ; τj < ∞ ≤ e−r P {τj < ∞} ≤ e−r .

Since the left-hand side is also bounded by pj , Lemma 4.4 yields 



Σ j E η; ξ = j

∞ = Σj P η > r, ξ = j dr 0

Eη =









Σ j 0 e−r ∧ pj dr   = Σ j pj 1 − log pj



= H(ξ) + 1.

2

Proof of Theorem 25.29 (Breiman): We may choose ξ to be defined on the canonical space S ∞ . Then introduce the functions 

 



 



gk (ξ) = I ξ0  ξ−1 , . . . , ξ−k+1 , 

g(ξ) = I ξ0  ξ−1 , ξ−2 , . . . . By (26), we may write the assertion as n−1

Σ gk (θk ξ) → E g(ξ)

a.s. and in L1 .

(28)

k≤n

Here gk (ξ) → g(ξ) a.s. by martingale convergence, and E supk gk (ξ) < ∞ by Lemma 25.30. Hence, (28) follows by Corollary 25.8. 2

25. Stationary Processes and Ergodic Theory

583

Exercises 1. State and prove some continuous-time, two-sided, and higher-dimensional versions of Lemma 25.1. 2. For a stationary random sequence ξ = (ξ1 , ξ2 , . . .), show that the ξn are i.i.d. iff ξ1 ⊥⊥(ξ2 , ξ3 , . . .). 3. For a Borel space S, let X be a stationary array of random elements in S indexed by Nd . Prove the existence of a stationary array Y indexed by Zd such that X = Y a.s. on Nd . 4. Let X be a stationary process on R+ with values in a Borel space S. Prove the d existence of a stationary process Y on R with X = Y on R+ . Strengthen this to an a.s. equality, when S is a complete metric space and X is right-continuous. 5. Let ξ be a two-sided, stationary random sequence with restriction η to N. Show that ξ, η are simultaneously ergodic. (Hint: For any measurable, invariant set I ∈ S Z , there exists a measurable, invariant set I  ∈ S N with I = S Z− × I  a.s. L(ξ).) 6. Establish two-sided and higher-dimensional versions of Lemmas 25.4 and 25.5, as well as of Theorem 25.9. 7. Say that a measure-preserving transformation T on a probability space (S, S, μ) is mixing if μ(A ∩ T −n B) → μA · μB for all A, B ∈ S. Prove a counterpart of Lemma 25.5 for mixing. Also show that mixing transformations are ergodic. (Hint: For the latter assertion, take A = B to be invariant.) 8. Show that it suffices to verify the mixing property for sets in a generating π -system. Using this, prove that i.i.d. sequences are mixing under shifts. 9. For any a ∈ R, define T s = s + a (mod 1) on [0, 1]. Show that T fails to be mixing but is ergodic iff a ∈ Q. (Hint: For the ergodicity, let I ⊂ [0, 1] be T -invariant. Then so is the measure 1I λ. Since the points ka are dense in [0, 1], it follows that 1I λ is invariant. Now use Theorem 2.6.) 10. (Bohl, Sierpi´ nski, Weyl ) For any a ∈ Q, put μn = n−1 Σ k≤n δka , where w ka ∈ [0, 1] is defined modulo 1. Show that μn → λ. (Hint: Apply Theorem 25.6 to the map in the previous exercise.) 11. Show that the transformation T s = 2s (mod 1) on [0, 1] is mixing. Also show how the map of Lemma 4.20 can be generated as in Lemma 25.1 by means of T . 12. Note that Theorem 25.6 remains true for invertible shifts T , with averages taken over increasing index sets [an , bn ] with bn −an → ∞. Show by an example that the a.s. convergence may fail without the monotonicity. (Hint: Consider an i.i.d. sequence (ξn ) and disjoint intervals [an , bn ], and use the Borel–Cantelli lemma.) 13. Consider a one- or two-sided, stationary random sequence (ξn ) in a measurable space (S, S), and fix any B ∈ S. Show that a.s. either ξn ∈ B c for all n or ξn ∈ B i.o. (Hint: Use Theorem 25.6.) 14. (von Neumann) Give a direct proof of the L2 -version of Theorem 25.6. (Hint: Define a unitary operator U on L2 (S) by U f = f ◦ T . Let M denote the U -invariant ¯ A , the closed range of subspace of L2 and put A = I − U . Check that M ⊥ = R A. By Theorem 1.35 it is enough to take f ∈ M or f ∈ RA .) Deduce the general Lp -version, and extend the argument to higher dimensions.

584

Foundations of Modern Probability

15. In the context of Theorem 25.24, show that the set of ergodic measures in M is measurable. (Hint: Use Lemma 3.2, Proposition 5.32, and Theorem 25.14.) 16. Prove a continuous-time version of Theorem 25.24. 17. Deduce Theorem 5.23 for p ≤ 1 from Theorem 25.20. (Hint: Take Xm,n = |Sn − Sm |p , and note that E|Sn |p = o(n) when p < 1.) 18. Let ξ = (ξ1 , ξ2 , . . .) be a stationary sequence of random variables, fix any B ∈ BRd , and let κn be the number of indices k ∈ {1, . . . , n−d} with (ξk , . . . , ξd ) ∈ B. Use Theorem 25.20 to show that κn /n converges a.s. Deduce the same result from Theorem 25.6, by considering suitable sub-sequences.

19. Strengthen the inequality in Lemma 25.7 to E ξ1 ; supn (Sn /n) ≥ 0 (Hint: Apply the original result to the variables ξk + ε, and let ε → 0.)



≥ 0.

20. Extend Proposition 25.10 to stationary processes on Zd . 21. Extend Theorem 25.14 to averages over arbitrary rectangles An = [0, an1 ] × · · · × [0, and ], such that anj → ∞ and supn (ani /anj ) < ∞ for all i = j. (Hint: Note that Lemma 25.16 extends to this case.) 22. Derive a version of Theorem 25.14 for stationary processes X on Zd . (Hint: ˜ on Rd , By a suitable randomization, construct an associated stationary process X ˜ apply Theorem 25.14 to X, and estimate the error term as in Corollary 30.10.) d

23. Give an example of two processes X, Y on R+ , such that X = Y on I but not on T . 24. Derive a version of Theorem 25.25 for processes on Z+ . Also prove versions for processes on Rd+ and Zd+ . 25. Show that Theorem 25.25 (ii) implies a corresponding result for processes on ˜ t = θt X and Y˜ = θt Y .) Also R. (Hint: Apply Theorem 25.25 to the processes X show how the two-sided statement follows from Theorem 25.26. ˜ t = (Xt , t), and let I˜ be the associated 26. For processes X on R+ , define X d

invariant σ-field. Assuming X, Y to be measurable, show that X = Y on T iff d ˜ ˜ (Hint: Use Theorem 25.25.) ˜= Y on I. X 27. Show by an example that the conclusion of Theorem 25.27 may fail when ξ is not singular. 28. Give an example where the inequality in Corollary 25.28 is a.s. strict. (Hint: Examine the proof.) 29. (Bertrand, Andr´e ) Show that if two candidates A, B in an election get the proportions p and 1−p of the votes, then the probability that A will lead throughout the ballot count equals (2p − 1)+ . (Hint: Apply Corollary 25.28, or use a direct combinatorial argument based on the reflection principle.) 30. Prove the second claim in Corollary 25.28 by a martingale argument, when ξ1 , . . . , ξn are Z+ -valued and exchangeable. (Hint: We may assume that Sn is nonrandom. Then the variables Mk = Sk /k form a reverse martingale, and the result follows by optional sampling.) 31. Prove that the convergence in Theorem 25.29 holds in Lp for arbitrary p > 0 when S is finite. (Hint: Show as in Lemma 25.30 that supn I(ξ |Fn ) p < ∞ when ξ is S -valued, and use Corollary 25.8 (ii).)

25. Stationary Processes and Ergodic Theory

585

32. Show that H(ξ, η) ≤ H(ξ) + H(η) for any ξ, η. (Hint: Note that H(η | ξ) ≤ H(η) by Jensen’s inequality.) 33. Give an example of a stationary Markov chain (ξn ), such that H(ξ1 ) > 0 but H(ξ1 | ξ0 ) = 0. 34. Give an example of a stationary Markov chain (ξn ), such that H(ξ1 ) = ∞ but H(ξ1 | ξ0 ) < ∞. (Hint: Choose the state space Z+ and transition probabilities pij = 0, unless j ∈ {0, i + 1}.)

Chapter 26

Ergodic Properties of Markov Processes Transition and contraction operators, operator ergodic theorem, maximum inequalities, ratio ergodic theorem, filling operators and functionals, ratio limit theorem, space-time invariance, strong and weak ergodicity and mixing, tail and invariant triviality, Harris recurrence and ergodicity, potential and resolvent operators, convergence dichotomy, recurrence dichotomy, invariant measures, distributional and pathwise limits

The primary aim of this chapter is to study the asymptotic behavior of broad classes of Markov processes. First, we will see how the ergodic theorems of the previous chapter admit some powerful extensions to suitable contraction operators, leading to some pathwise and distributional ratio limit theorems for fairly general Markov processes. Next, we establish some basic criteria for strong or weak ergodicity of conservative Markov semi-groups, and show that Harris recurrent Feller processes are strongly ergodic. Finally, we show that any regular Feller process is either Harris recurrent or uniformly transient, examine the existence of invariant measures, and prove some associated distributional and pathwise limit theorems. The latter analysis relies crucially on some profound tools from potential theory. For a more detailed description, we first extend the basic ergodic theorem of Chapter 25 to suitable contraction operators on an arbitrary measure space, and establish a general operator version of the ratio ergodic theorem. The relevance of those results for the study of Markov processes is due to the fact that their transition operators are positive L1 −L∞ -contractions with respect to any invariant measure λ on the state space S. The mentioned results cover both the positive-recurrent case where λS < ∞, and the null-recurrent case where λS = ∞. Even more remarkably, the same ergodic theorems apply to both the transition probabilities and the sample paths, in both cases giving conclusive information about the asymptotic behavior. Next we prove, for an arbitrary Markov process, that a certain condition of strong ergodicity is equivalent to the triviality of the tail σ-field, the constancy of all bounded, space-time invariant functions, and a uniform mixing condition. We also consider a similar result where all four conditions are replaced by suitably averaged versions. For both sets of equivalences, one gets very simple and transparent proofs by applying the general coupling results of Chapter 25. In order to apply the mentioned theorems to specific Markov processes, we need to find regularity conditions ensuring the existence of an invariant © Springer Nature Switzerland AG 2021 O. Kallenberg, Foundations of Modern Probability, Probability Theory and Stochastic Modelling 99, https://doi.org/10.1007/978-3-030-61871-1_27

587

588

Foundations of Modern Probability

measure or the triviality of the tail σ-field. Here we consider a general class of Feller processes satisfying either a strong recurrence or a uniform transience condition. In the former case, we prove the existence of an invariant measure, required for the application of the mentioned ergodic theorems, and show that the space-time invariant functions are constant, which implies the mentioned strong ergodicity. Our proofs of the latter results depend on some potential theoretic tools, related to those developed in Chapter 17. To set the stage for the technical developments, we consider a Markov transition operator T on an arbitrary measurable space (S, S). Note that T is positive, in the sense that f ≥ 0 implies Tf ≥ 0, and also that T 1 = 1. As before, we write Px for the distribution of a Markov process on Z+ with transition operator T , starting at x ∈ S. More generally, we define Pμ =  Px μ(dx) for any measure μ on S. A measure λ on S is said to be invariant if λ Tf = λf for any measurable function f ≥ 0. Writing θ for the shift on the ˜ = f ◦ θ. path space S ∞ , we define the associated operator θ˜ by θf For any p ≥ 1, an operator T on a measure space (S, S, μ) is called an Lp -contraction, if Tf p ≤ f p for every f ∈ Lp . We further say that T is an L1 −L∞ -contraction, if it is an Lp -contraction for every p ∈ [1, ∞]. The following result shows the relevance of the mentioned notions for the theory of Markov processes. Lemma 26.1 (transition and contraction operators) Let T be a Markov transition operator on (S, S) with invariant measure λ. Then (i) T is a positive L1 −L∞ -contraction on (S, λ), (ii) θ˜ is a positive L1 −L∞ -contraction on (S ∞ , Pλ ). Proof: (i) Applying Jensen’s inequality to the transition kernel μ(x, B) = T 1B (x) and using the invariance of λ, we get for any p ∈ [1, ∞) and f ∈ Lp Tf pp = λ|μf |p ≤ λμ|f |p = λ|f |p = f pp . The result for p = ∞ is obvious. (ii) Proceeding as in Lemma 11.11, we see that θ is a measure-preserving transformation on (S ∞ , Pλ ). Hence, for any measurable function f ≥ 0 on S ∞ and constant p ≥ 1, we have     ˜ p = Pλ f ◦ θp Pλ θf 



= Pλ ◦ θ−1 |f |p = Pλ |f |p . The contraction property for p = ∞ is again obvious.

2

Some crucial results from Chapter 25 carry over to the context of positive L1 − L∞ -contractions on an arbitrary measure space. First we consider an

26. Ergodic Properties of Markov Processes

589

operator version of Birkhoff’s ergodic theorem. To simplify the writing, we introduce the operators Sn =

Σ T k,

An = Sn /n,

M f = supn An f.

k 0 ≥ 0, f ∈ L1 . If T is even an L1 −L∞ -contraction, then also (ii) r μ{M f > 2r} ≤ μ(f ; f > r), f ∈ L1 , r > 0, f p , f ∈ Lp , p > 1. (iii) M f p <

Proof: (i) For any f ∈ L1 , write Mn f = S1 f ∨ · · · ∨ Sn f , and conclude that by positivity Sk f = f + T Sk−1 f ≤ f + T (Mn f )+ , k = 1, . . . , n. Hence, Mn f ≤ f + T (Mn f )+ for all n, and so by positivity and contractivity, 





μ f ; Mn f > 0 ≥ μ Mn f − T (Mn f )+ ; Mn f > 0





≥ μ (Mn f )+ − T (Mn f )+  

 

 

 

= (Mn f )+ 1 − T (Mn f )+ 1 ≥ 0. As before, it remains to let n → ∞. (ii) Put fr = f 1{f > r}. By the L∞ -contractivity and positivity of An , An f − 2r ≤ An (f − 2r) ≤ An (fr − r),

n ∈ N,

which implies M f − 2r ≤ M (fr − r). Hence, by part (i),

r μ{M f > 2r} ≤ r μ M (fr − r) > 0





≤ μ fr ; M (fr − r) > 0 



≤ μfr = μ f ; f > r .



590

Foundations of Modern Probability (iii) Here the earlier proof applies with only notational changes.

2

Proof of Theorem 26.2: Fix any f ∈ L1 . By dominated convergence, f may be approximated in L1 by functions f˜ ∈ L1 ∩ L∞ ⊂ L2 . By Lemma 25.17 we may next approximate f˜ in L2 by functions of the form fˆ+ (g − T g), where fˆ, g ∈ L2 and T fˆ = fˆ. Finally, we may approximate g in L2 by functions g˜ ∈ L1 ∩ L∞ . Since T contracts L2 , the functions g˜ − T g˜ will then approximate g − T g in L2 . Combining the three approximations, we have for any ε > 0 f = fε + (gε − T gε ) + hε + kε ,

(1)

where fε ∈ L2 with T fε = fε , gε ∈ L1 ∩ L∞ , and hε 2 ∨ kε 1 < ε. Since fε is invariant, we have An fε ≡ fε . Next, we note that         An (gε − T gε )∞ = n−1 gε − T n gε ∞

Hence,

≤ 2n−1 gε ∞ → 0.

(2)

lim sup An f ≤ fε + M hε + M kε < ∞ a.e., n→∞

and similarly for lim inf n An f . Combining the two estimates gives 



lim sup − lim inf An f ≤ 2M |hε | + 2M |kε |. n→∞

n→∞

Now Lemma 26.3 yields for any ε, r > 0     M |hε |2 < hε 2 < ε,



μ M |kε | > 2r



≤ r−1 kε 1 < ε/r,

and so M |hε | + M |kε | → 0 a.e. as ε → 0 along a suitable sequence. Thus, An f converges a.e. toward a limit Af . To prove that Af is T -invariant, we see from (1) and (2) that the a.e. limits Ahε and Akε exist and satisfy T Af − Af = (T A − A)(hε + kε ). By the contraction property and Fatou’s lemma, the right-hand side tends a.e. to 0 as ε → 0 along a sequence, and we get T Af = Af a.e. 2 The limit Af in the last theorem may be 0, in which case the a.s. convergence An f → Af gives little information about the asymptotic behavior of An f . For example, this happens when μS = ∞ and T is the operator induced by a μ -preserving and ergodic transformation θ on S. Then Af is a constant, and the condition Af ∈ L1 implies Af = 0. To get around this difficulty, we may instead compare the asymptotic behavior of Sn f with that of Sn g for a suitable reference function g ∈ L1 . This idea leads to a far-reaching and powerful extension of Birkhoff’s theorem. Theorem 26.4 (ratio ergodic theorem, Chacon & Ornstein) Let T be a positive L1 -contraction on a measure space (S, S, μ), and let f ∈ L1 and g ∈ L1+ . Then Sn f converges a.e. on {S∞ g > 0}. Sn g

26. Ergodic Properties of Markov Processes

591

Three lemmas are needed for the proof. Lemma 26.5 (individual terms) In the context of Theorem 26.4, T nf → 0 a.e. on {S∞ g > 0}. Sn+1 g Proof: We may take f ≥ 0. Fix any ε > 0, and define hn = T n f − ε Sn+1 g, An = {hn > 0}, n ≥ 0. By positivity, hn = T hn−1 − εg ≤ T h+ n−1 − εg,

n ≥ 1.

Examining separately the cases An and Acn , we conclude that + h+ n ≤ T hn−1 − ε1An g,

and so by contractivity,



n ≥ 1, 

+ ε μ(g; An ) ≤ μ T h+ n−1 − μhn + ≤ μh+ n−1 − μhn .

Summing over n gives ε μ Σ 1An g ≤ μh+ 0 = μ(f − εg) n≥1 ≤ μf < ∞, which implies μ(g; An i.o.) = 0 and hence lim supn (T n f /Sn+1 g) ≤ ε a.e. on {g > 0}. Since ε was arbitrary, we obtain T n f /Sn+1 g → 0 a.e. on {g > 0}. Applying this result to the functions T m f and T m g gives the same convergence 2 on {Sm−1 g = 0 < Sm g}, for arbitrary m > 1. To state the next result, we introduce the non-linear filling operator F on L1 , given by F h = T h+ − h− . We may think of the sequence F n h as arising from successive attempts to fill a hole h− , where in each step we are only mapping matter that has not yet fallen into the hole. We also define Mn h = S1 h ∨ · · · ∨ Sn h. Lemma 26.6 (filling operator) For any h ∈ L1 and n ∈ N, F n−1 h ≥ 0 on {Mn h > 0}. Proof: Writing hk = h+ + (F h)+ + · · · + (F k h)+ , we claim that hk ≥ Sk+1 h,

k ≥ 0.

(3)

This is clear for k = 0 since h+ = h + h− ≥ h. Proceeding by induction, let (3) be true for k = m ≥ 0. Using the induction hypothesis and the definitions of Sk , hk , and F , we get for m + 1

592

Foundations of Modern Probability Sm+2 h = h + T Sm+1 h ≤ h + T hm = h + Σ T (F k h)+ k ≤ m



Σ F k+1h + (F k h)− k ≤ m

= h + Σ (F k+1 h)+ − (F k+1 h)− + (F k h)− k≤m = h+

= h + hm+1 − h+ + h− − (F m+1 h)− ≤ hm+1 .

This completes the proof of (3). If Mn h > 0 at some point in S, then Sk h > 0 for some k ≤ n, and so by (3) we have hk > 0 for some k < n. But then (F k h)+ > 0, and therefore (F k h)− = 0 for the same k. Since (F k h)− is non-increasing, it follows that 2 (F n−1 h)− = 0, and hence F n−1 h ≥ 0. To state our third and crucial lemma, write g ∈ T1 (f ) for a given f ∈ L1+ , if there exists a decomposition f = f1 +f2 with f1 , f2 ∈ L1+ such that g = T f1 +f2 . Note in particular that f, g ∈ L1+ implies U (f −g) = f  −g for some f  ∈ T1 (f ). The classes Tn (f ) are defined recursively by Tn+1 (f ) = T1 {Tn (f )}, and we put  T (f ) = n Tn (f ). Now introduce the functionals



ψB f = sup μ(g; B); g ∈ T (f ) ,

f ∈ L1+ , B ∈ S.

Lemma 26.7 (filling functionals) For any f, g ∈ L1+ and B ∈ S,



B ⊂ lim sup Sn (f − g) > 0 n→∞



ψB f ≥ ψB g.

Proof: Fix any g  ∈ T (g) and c > 1. First we show that







lim sup Sn (f − g) > 0 ⊂ lim sup Sn (cf − g  ) > 0 n→∞

a.e.

(4)

n→∞

Here we may assume that g  ∈ T1 (g), since the general result then follows by iteration in finitely many steps. Letting g  = r + T s for some r, s ∈ L1+ with r + s = g, we obtain Sn (cf − g  ) =

Σ T k (cf − r − T s)

k 0} Sn f Sn f

by Lemma 26.5, we conclude that eventually Sn (cf − g  ) ≥ Sn (f − g) a.e. on the same set, and (4) follows. Combining the given hypothesis with (4), we obtain the a.e. relation B ⊂ {M (cf − g  ) > 0}. Now Lemma 26.6 yields



F n−1 (cf − g  ) ≥ 0 on Bn ≡ B ∩ Mn (cf − g  ) > 0 ,

n ∈ N.

26. Ergodic Properties of Markov Processes

593

Since Bn ↑ B a.e. and F n−1 (cf − g  ) = f  − g  for some f  ∈ T (cf ), we get

0 ≤ μ F n−1 (cf − g  ); Bn



= μ(f  ; Bn ) − μ(g  ; Bn ) ≤ ψB (cf ) − μ(g  ; Bn ) → c ψB f − μ(g  ; B), and so c ψB f ≥ μ(g  ; B). It remains to let c → 1 and take the supremum over 2 g  ∈ T (g). Proof of Theorem 26.4: We may take f ≥ 0. On {S∞ g > 0}, put α = lim inf n→∞

Sn f Sn f ≤ lim sup = β, Sn g n→∞ Sn g

and define α = β = 0 otherwise. Since Sn g is non-decreasing, we have for any c>0

Sn (f − cg) {β > c} ⊂ lim sup > 0, S∞ g > 0 Sn g n→∞

⊂ lim sup Sn (f − cg) > 0 . n→∞

Writing B = {β = ∞, S∞ g > 0}, we see from Lemma 26.7 that c ψB g = ψB (cg) ≤ ψB f ≤ μf < ∞, and as c → ∞ we get ψB g = 0. But then μ(T n g; B) = 0 for all n ≥ 0, and therefore μ(S∞ g; B) = 0. Since S∞ g > 0 on B, we obtain μB = 0, which means that β < ∞ a.e. Now define C = {α < a < b < β} for fixed b > a > 0. As before,



C ⊂ lim sup Sn (f − bg) ∧ lim sup Sn (ag − f ) > 0 , n→∞

n→∞

and so by Lemma 26.7, b ψC g = ψC (bg) ≤ ψC f ≤ ψC (ag) = a ψC g < ∞, which implies ψC g = 0, and therefore μC = 0. Hence, μ{α < β} ≤



Σμ a 0 . Ex Sn g Proof: (i) By Lemma 26.1 (ii), we may apply Theorem 26.4 to the L1−L∞ contraction θ˜ on (S ∞ , Pλ ) induced by the shift θ, and the result follows. (ii) Writing f˜(x) = Ex f (X) and using the Markov property at k, we get Ex f (θk X) = Ex EXk f (X) = Ex f˜(Xk ) = T k f˜(x), and so

Σ Exf (θk X) = Σ T k f˜(x) k 0.

The following properties define the notion of strong ergodicity, as opposed to the weak ergodicity in Theorem 26.11. Theorem 26.10 (strong ergodicity and mixing, Orey) For a conservative Markov semi-group on R+ or Z+ with distributions Pμ , these conditions are equivalent: (i) (ii) (iii) (iv)

the tail σ-field T is Pμ -trivial for every μ, Pμ is mixing for every μ, every bounded, space-time invariant function is a constant,    −1 −1  Pμ ◦ θt − Pν ◦ θt  → 0 as t → ∞ for all μ and ν.

First proof: By Theorem 25.25 (i), conditions (ii) and (iv) are equivalent to respectively (ii ) Pμ = PμB on T for all μ and B, (iv ) Pμ = Pν on T for all μ, ν.

596

Foundations of Modern Probability

It is then enough to prove that (ii ) ⇔ (i) ⇒ (iv ) and (iv) ⇒ (iii) ⇒ (i). (i) ⇔ (ii ): If Pμ A = 0 or 1, then clearly also PμB A = 0 or 1, which shows that (i) ⇒ (ii ). Conversely, let A ∈ T be arbitrary with Pμ A > 0. Taking B = A in (ii ) gives Pμ A = (Pμ A)2 , which implies Pμ A = 1. (i) ⇒ (iv ): Applying (i) to the distribution 12 (μ + ν) gives Pμ A + Pν A = 0 or 2 for every A ∈ T , which implies Pμ A = Pν A = 0 or 1. (iv) ⇒ (iii): Let f be bounded and space-time invariant. Using (iv) with μ = δx and ν = δy gives         f (x, s) − f (y, s) = Ex f (Xt , s + t) − Ey f (Xt , s + t)   ≤ f Px ◦ θt−1 − Py ◦ θt−1  → 0,

which shows that f (x, s) = f (s) is independent of x. Then f (s) = f (s + t) by (5), and so f is a constant. (iii) ⇒ (i): Fix any A ∈ T . Since A ∈ Tt = σ(θt ) for every t ≥ 0, we have A = θt−1At for some sets At ∈ F∞ , which are unique since the θt are surjective. For any s, t ≥ 0, we have −1 θt−1 θs−1As+t = θs+t As+t = A = θt−1At ,

and so θs−1As+t = At . Putting f (x, t) = Px At and using the Markov property at time s, we get Ex f (Xs , s + t) = Ex PXs As+t



= Ex Px θs−1As+t  Fs



= Px At = f (x, t). Thus, f is space-time invariant and hence a constant c ∈ [0, 1]. By the Markov property at t and martingale convergence as t → ∞, we have a.s. c = f (Xt , t) = PXt At

 

= Pμ θt−1At  Ft



= Pμ (A | Ft ) → 1A , which implies Pμ A = c ∈ {0, 1}. This shows that T is Pμ -trivial.

2

Second proof: We can avoid to use the rather deep Theorem 25.25 by giving direct proofs of the implications (i) ⇒ (ii) ⇒ (iv). (i) ⇒ (ii): Assuming (i), we get by reverse martingale convergence   

     Pμ (· ∩ B) − Pμ (·) Pμ (B) T = Eμ Pμ (B | Tt ) − Pμ B; ·  T t t     ≤ Eμ Pμ (B | Tt ) − Pμ B  → 0.

26. Ergodic Properties of Markov Processes

597

(ii) ⇒ (iv): Let μ − ν  be the Hahn decomposition of μ − ν, and choose B ∈ S with μ B c = νB = 0. Writing χ = μ + ν  and A = {X0 ∈ B}, we get by (ii)     c     Pμ ◦ θt−1 − Pν ◦ θt−1  = μ PχA ◦ θt−1 − PχA ◦ θt−1  → 0. 2 The invariant σ-field I on Ω consists of all events A ⊂ Ω with θt−1 A = A for all t ≥ 0. Note that a random variable ξ on Ω is I-measurable iff ξ ◦ θt = ξ for all t ≥ 0. The invariant σ-field I is clearly contained in the tail σ-field T . We say that Pμ is weakly mixing if  1 

lim 

t→∞

0



 

−1 Pμ − PμB ◦ θst ds  = 0,

B ∈ F∞ with Pμ B > 0,

with the understanding that θs = θ[s] when the time scale is discrete. We may now state a weak counterpart of Theorem 26.10. Theorem 26.11 (weak ergodicity and mixing) For a conservative Markov semi-group on R+ or Z+ with distributions Pμ , these conditions are equivalent: (i) the invariant σ-field I is Pμ -trivial for every μ, (ii) Pμ is weakly mixing for every μ, (iii) every bounded, invariant function is a constant,   1  −1 ds  → 0 as t → ∞ for all μ, ν. (iv)  0 (Pμ − Pν ) ◦ θst Proof: By Theorem 25.25 (ii), we note that (ii) and (iv) are equivalent to the conditions (ii ) Pμ = PμB on I for all μ and B, (iv ) Pμ = Pν on I for all μ, ν. Here (ii ) ⇔ (i) ⇒ (iv ) may be established as before, and it only remains to show that (iv) ⇒ (iii) ⇒ (i). (iv) ⇒ (iii): Let f be bounded and invariant. Then f (x) = Ex f (Xt ) = 1 Tt f (x), and therefore f (x) = 0 Tst f (x) ds. Using (iv) gives  1

    Tst f (x) − Tst f (y) ds  0  1  −1 ≤ f  (Px − Py ) ◦ θst ds  → 0,

    f (x) − f (y) =

0

which shows that f is constant. (iii) ⇒ (i): Fix any A ∈ I, and define f (x) = Px A. Using the Markov property at t and the invariance of A, we get Ex f (Xt ) = Ex PXt A



= Ex Px θt−1A  Ft = Px θt−1A = Px A = f (x),



598

Foundations of Modern Probability

which shows that f is invariant. By (iii) it follows that f equals a constant c ∈ [0, 1]. Hence, by the Markov property and martingale convergence, we have a.s. c = f (Xt ) = PXt A 



= Pμ θt−1A  Ft



= Pμ (A | Ft ) → 1A , which implies Pμ A = c ∈ {0, 1}. Thus, I is Pμ -trivial.

2

We now specialize to conservative Feller processes X with distributions Px , defined on an lcscH1 space S with Borel σ-field S. Say that the process is regular, if there exist a locally finite measure ρ on S and a continuous function (x, y, t) → pt (x, y) > 0 on S 2 × (0, ∞), such that Px {Xt ∈ B} =



B

pt (x, y) ρ(dy),

x ∈ S, B ∈ S, t > 0.

The supporting measure ρ is then unique up to an equivalence, and supp ρ = S by the Feller property. A Feller process is said to be Harris recurrent, if it is regular with a supporting measure ρ satisfying

∞ 0

1B (Xt ) dt = ∞ a.s. Px ,

x ∈ S, B ∈ S with ρB > 0.

(6)

Theorem 26.12 (Harris recurrence and ergodicity, Orey) For a Feller process X, we have ⇒

X is Harris recurrent

X is strongly ergodic.

Proof: By Theorem 26.10 it suffices to prove that any bounded, space-time invariant function f : S × R+ → R is a constant. First we show for fixed x ∈ S that f (x, t) is independent of t. Then assume that instead f (x, h) = f (x, 0) for some h > 0, say f (x, h) > f (x, 0). Recall from Lemma 26.9 that Mts = f (Xt , s + t) is a Py -martingale for any y ∈ S and s ≥ 0. In particular, the limit s exists a.s. along hQ, and we get a.s. Px M∞ 





h 0   F0 = M0h − M00 Ex M∞ − M∞ = f (x, h) − f (x, 0) > 0, h 0 which implies Px {M∞ > M∞ } > 0. We may then choose some constants a < b such that

0 h < a < b < M∞ > 0. (7) Px M∞

We also note that





Mts+h ◦ θs = f Xs+t , s + t + h h = Ms+t , s, t, h ≥ 0. 1

locally compact, second countable, and Hausdorff

(8)

26. Ergodic Properties of Markov Processes

599

Restricting s, t to h Q+ , we define



g(y, s) = Py ∩ Mts ≤ a < b ≤ Mts+h ,

y ∈ S, s ≥ 0.

t≥0

Using (8) and the Markov property at s, we get a.s. Px for any r ≤ s

g(Xs , s) = PXs ∩ Mts ≤ a < b ≤ Mts+h t≥0





= PxFs ∩ Mts ◦ θs ≤ a < b ≤ Mts+h ◦ θs t≥0



t≥s



= PxFs ∩ Mt0 ≤ a < b ≤ Mth



≥ PxFs ∩ Mt0 ≤ a < b ≤ Mth . t≥r

By martingale convergence we get, a.s. as s → ∞ along hQ and then r → ∞,

lim inf g(Xs , s) ≥ lim inf 1 Mt0 ≤ a < b ≤ Mth s→∞ t→∞







0 h ≥ 1 M∞ < a < b < M∞ ,

and so by (7),









0 h < a < b < M∞ > 0. Px g(Xs , s) → 1 ≥ Px M∞

(9)

Now fix any non-empty, bounded, open set B ⊂ S. Using (6) and the right-continuity of X, we note that lim sups 1B (Xs ) = 1 a.s. Px , and so by (9)



Px lim sup 1B (Xs ) g(Xs , s) = 1 > 0.

(10)

s→∞

Furthermore, we have by regularity ph (u, v) ∧ p 2h (u, v) ≥ ε > 0,

u, v ∈ B,

(11)

for some ε > 0. By (10) we may choose some y ∈ B and s ≥ 0 with g(y, s) > 1 − 12 ε ρB. Define for i = 1, 2



Bi = B \ u ∈ S; f (u, s + ih) ≤ a < b ≤ f u, s + (i + 1)h



.

Using (11), the definitions of Bi , M s , g, and the properties of y and s, we get ε ρBi ≤ Py {Xih ∈ Bi }    

≤ 1 − Py f Xih , s + ih ≤ a < b ≤ f Xih , s + (i + 1)h

s+h s = 1 − Py Mih ≤ a < b ≤ Mih ≤ 1 − g(y, s) < 12 ε ρB.



Thus, ρB1 + ρB2 < ρB, and there exists a u ∈ B \ (B1 ∪ B2 ). But this yields the contradiction a < b ≤ f (u, s + 2h) ≤ a, showing that f (x, t) = f (x) is indeed independent of t.

600

Foundations of Modern Probability To see that f (x) is also independent of x, assume that instead







ρ x; f (x) ≤ a ∧ ρ x; f (x) ≥ b > 0 for some a < b. Then by (6), the martingale Mt = f (Xt ) satisfies



0



1{Mt ≤ a} dt =



0

1{Mt ≥ b} dt = ∞ a.s.

(12)

˜ for the right-continuous version of M , which exists by Theorem Writing M 9.28 (ii), we get by Fubini’s theorem for any x ∈ S 

sup Ex  u>0

≤ ≤

u 0

1{Mt ≤ a} dt − ∞

0 ∞ 0



u 0



˜ t ≤ a} dt  1{M

   ˜ t ≤ a} dt Ex 1{Mt ≤ a} − 1{M



˜ t dt = 0, Px Mt = M

˜ t ≥ b. Thus, the integrals on the and similarly for the events Mt ≥ b and M ˜ . In left agree a.s. Px for all x, and so (12) remains true with M replaced by M particular, ˜ t ≤ a < b ≤ lim sup M ˜ t a.s. Px . lim inf M t→∞

t→∞

˜ is a bounded, right-continuous martingale and But this is impossible, since M hence converges a.s. The contradiction shows that f (x) = c a.e. ρ for some constant c ∈ R. Then for any t > 0, f (x) = Ex f (Xt )

=

x ∈ S,

f (y) pt (x, y) ρ(dy) = c,

2

and so f (x) is indeed independent of x.

Our further analysis of regular Feller processes requires some potential theory. For any measurable functions f, h ≥ 0 on S, we define the h -potential Uh f of f by ∞ h Uh f (x) = Ex e−At f (Xt ) dt, x ∈ S, 0

where Ah denotes the elementary additive functional

Aht =

t

h(Xs ) ds,

0

t ≥ 0.

When h is a constant a ≥ 0, we note that Uh = Ua agrees with the resolvent operator Ra of the semigroup (Tt ), in which case

Ua f (x) = Ex

= 0



0



e−at f (Xt ) dt

e−at Tt f (x) dt,

x ∈ S.

The classical resolvent equation extends to general h -potentials, as follows.

26. Ergodic Properties of Markov Processes

601

Lemma 26.13 (resolvent equation) Let f ≥ 0 and h ≥ k ≥ 0 be measurable functions on S, where h is bounded. Then Uh h ≤ 1, and Uk f = Uh f + Uh (h − k) Uk f = Uh f + Uk (h − k) Uh f. Proof: Define F = f (X), H = h(X), and K = k(X) for convenience. By Itˆo’s formula for continuous functions of bounded variation,

e−At = 1 − h

t

0

e−As Hs ds, h

t ≥ 0,

(13)

which implies Uh h ≤ 1. We also see from the Markov property of X that a.s.

Uk f (Xt ) = EXt ExFt

=

= ExFt



0∞

0



t

e−As Fs ds k

e−As ◦θt Fs+t ds k

e−Au +At Fu du. k

k

(14)

Using (13) and (14) together with Fubini’s theorem, we get Uh (h − k) Uk f (x) = Ex = Ex

0



0



= Ex = Ex

∞ ∞ ∞

e−At (Ht − Kt ) dt h

e−Au Fu du k





u 0





t

e−Au +At Fu du k

k

e−At +At (Ht − Kt ) dt h

k



e−Au Fu 1 − e−Au +Au du k

0∞  0

h

k



e−Au − e−Au Fu du k

h

= Uk f (x) − Uh f (x). A similar calculation yields the same expression for Uk (h − k) Uh f (x).

2

For a simple application of the resolvent equation, we show that any bounded potential function Uh f is continuous. Lemma 26.14 (boundedness and continuity) For a regular Feller process on S and any bounded, measurable functions f, h ≥ 0 on S, we have ⇒

Uh f is bounded

Uh f is continuous.

Proof: Using Fatou’s lemma and the continuity of pt (·, y), we get for any time t > 0 and sequence xn → x in S

inf lim inf Tt f (xn ) = lim n→∞ n→∞ ≥



pt (xn , y)f (y) ρ(dy)

pt (x, y)f (y) ρ(dy)

= Tt f (x).

602

Foundations of Modern Probability

If f ≤ c, the same relation applies to the function Tt (c − f ) = c − Tt f , and so by combination Tt f is continuous. By dominated convergence, Ua f is then continuous for every a > 0. Now let h ≤ a. Applying the previous result to the bounded, measurable function (a − h) Uh f ≥ 0, we see that even Ua (a − h) Uh f is continuous. The continuity of Uh f now follows from Lemma 26.13, with h and k replaced by a and h. 2 We proceed with some useful estimates. Lemma 26.15 (lower bounds) For a regular Feller process on S, there exist some continuous functions h, k : S → (0, 1], such that for any measurable function f ≥ 0 on S, (i) U2 f (x) ≥ ρ(kf ) h(x), (ii) Uh f (x) ≥ ρ(kf ) Uh h(x). Proof: Fix any compact sets K ⊂ S and T ⊂ (0, ∞) with ρK > 0 and λT > 0. Define uTa (x, y) = e−at pt (x, y) dt, x, y ∈ S, T

and note that for any measurable function f ≥ 0 on S, Ua f (x) ≥



uTa (x, y) f (y) ρ(dy),

x ∈ S.

Applying Lemma 26.13 to the constant functions 4 and 2 gives U2 f (x) = U4 f (x) + 2 U4 U2 f (x) ≥2





K

uT4 (x, y) ρ(dy)

and (i) follows with



h(x) = 2 K

uT2 (y, z) f (z) ρ(dz),

uT4 (x, y) ρ(dy) ∧ 1,

k(x) = inf uT2 (y, x) ∧ 1. y∈K

To deduce (ii), we may combine (i) with Lemma 26.13 for the functions 2 and h, to obtain Uh f (x) ≥ Uh (2 − h) U2 f (x) ≥ Uh U2 f (x) ≥ Uh h(x) ρ(kf ). The continuity of h is clear by dominated convergence. For the same reason, the function uT2 is jointly continuous on S 2 . Since K is compact, the functions uT2 (y, ·), y ∈ K, are then equi-continuous on S, which yields the required continuity of k. Finally, the relation h > 0 is obvious, whereas k > 0 holds by the

26. Ergodic Properties of Markov Processes

603 2

compactness of K. Fixing a function h as in Lemma 26.15, we introduce the kernel Qx B = Uh (h1B )(x),

x ∈ S, B ∈ S,

(15)

and note that Qx S = Uh h(x) ≤ 1 by Lemma 26.13. Lemma 26.16 (convergence dichotomy) For h as in Lemma 26.15, define Q by (15). Then (i) when Uh h ≡ 1, there exists an r ∈ (0, 1) with Qn S ≤ rn−1 ,

n ∈ N,

(ii) when Uh h ≡ 1, there exists a Q-invariant distribution ν ∼ ρ with    n  Qx − ν  → 0,

x ∈ S,

and every σ-finite, Q-invariant measure on S is proportional to ν. Proof: (i) Choose k as in Lemma 26.15, fix any a ∈ S with Uh h(a) < 1,

and define r = 1 − h(a) ρ hk (1 − Uh h) . Note that ρ{hk (1−Uh h)} > 0, since Uh h is continuous by Lemma 26.14. Using Lemma 26.15 (i), we obtain 0 < 1−r ≤ h(a) ρk ≤ U2 1(a) = 12 . Next, Lemma 26.15 (ii) yields





(1 − r) Uh h = h(a) ρ hk (1 − Uh h) Uh h ≤ Uh h(1 − Uh h). Hence,

Q2 S = Uh h Uh h ≤ r Uh h = r QS,

and so by iteration Qn S ≤ rn−1 QS ≤ rn−1 , n ∈ N. (ii) Introduce a measure ρ˜ = hk · ρ on S. Since Uh h = 1, Lemma 26.15 (ii) yields ρ˜B = ρ(hk 1B ) ≤ Uh (h 1B )(x) (16) = Qx B, B ∈ S. Regarding ρ˜ as a kernel, we have for any x, y ∈ S and m, n ∈ Z+ 







n n Qm ˜k = Qm ˜k = 0. x − Qy ρ x S − Qy S ρ

604

Foundations of Modern Probability

Iterating this and using (16), we get as n → ∞

     n   k n  Qx − Qn+k    = (δ − Q )Q x y y     = (δx − Qky )(Q − ρ˜)n      n ≤  δx − Qky  supz Qz − ρ˜

≤ 2 (1 − ρ˜S)n → 0. Hence, supx Qnx − ν → 0 for some set function ν on S, and we note that ν is a Q -invariant probability measure. By Fubini’s theorem, we have Qx  ρ for all x, and so ν = νQ  ρ. Conversely, Lemma 26.15 (ii) yields νB = νQ(B)

= ν Uh (h1B ) ≥ ρ(hk1B ), which shows that even ρ  ν. Now consider any σ-finite, Q -invariant measure μ on S. By Fatou’s lemma, we get for any B ∈ S μB = lim inf μQn B n→∞ ≥ μνB = μS νB. Choosing B with μB < ∞ and νB > 0, we obtain μS < ∞. By dominated convergence, we conclude that μ = μQn → μν = μS ν, which proves the asserted uniqueness of ν. 2 We may now establish the basic recurrence dichotomy for regular Feller processes. Write U = U0 , and say that X is uniformly transient if U 1K = supx Ex





0

1K (Xt ) dt < ∞,

K ⊂ S compact.

Theorem 26.17 (recurrence dichotomy) A regular Feller process is either Harris recurrent or uniformly transient. Proof: Choose h, k as in Lemma 26.15 and Q, r, ν as in Lemma 26.16. First assume Uh h ≡ 1. Letting a ∈ (0, h ], we note that ah ≤ (h ∧ a) h , and hence (17) a Uh∧a h ≤ Uh∧a (h ∧ a) h ≤ h . Furthermore, Lemma 26.13 yields Uh∧a h ≤ Uhh + Uh hUh∧a h  = Q 1 + Uh∧a h . Iterating this relation and using Lemma 26.16 (i) and (17), we get

Σ Qk 1 + QnUh∧ah     ≤ Σ rk−1 + rn−1  Uh∧a h  k≤n

Uh∧a h ≤

k≤n

≤ (1 − r)−1 + rn−1 h /a.

26. Ergodic Properties of Markov Processes

605

Letting n → ∞ and then a → 0, we conclude by dominated and monotone convergence that U h ≤ (1 − r)−1 . Now fix any compact set K ⊂ S. Since b ≡ inf K h > 0, we get for all x ∈ S U 1K (x) ≤ b−1 U h(x) ≤ b−1 (1 − r)−1 < ∞, which shows that X is uniformly transient. Now assume that instead Uh h ≡ 1. Fix any measurable function f on S with 0 ≤ f ≤ h and ρf > 0, and put g = 1 − Uf f . By Lemma 26.13, we get g= = = = ≤

1 − Uf f Uh h − Uh f − Uh (h − f )Uf f Uh (h − f )(1 − Uf f ) Uh (h − f )g Uh hg = Qg.

(18)

Iterating this relation and using Lemma 26.16 (ii), we obtain g ≤ Qn g → νg, where ν ∼ ρ is the unique Q -invariant distribution on S. Inserting this into (18) gives g ≤ Uh (h − f )νg, and so by Lemma 26.15 (ii),



νg ≤ ν Uh (h − f ) νg



≤ 1 − ρ(kf ) νg. Since ρ(kf ) > 0, we obtain νg = 0, and so Uf f = 1 − g = 1 a.e. ν ∼ ρ. Since Uf f is continuous by Lemma 26.14 and supp ρ = S, we obtain Uf f ≡ 1. Taking expected values in (13), we conclude that Af∞ = ∞ a.s. Px for every x ∈ S. Now fix any compact set K ⊂ S with ρK > 0. Since b ≡ inf K h > 0, 2 we may choose f = b 1K , and the desired Harris recurrence follows. A measure λ on S is said to be invariant for the semi-group (Tt ), if λ(Tt f ) = λf for all t > 0 and every measurable function f ≥ 0 on S. In the Harris recurrent case, the existence of an invariant measure λ can be inferred from Lemma 26.16. Theorem 26.18 (invariant measure, Harris, Watanabe) For a Harris recurrent Feller process on S with supporting measure ρ, (i) there exists a locally finite, invariant measure λ ∼ ρ, (ii) every σ-finite, invariant measure agrees with λ up to a normalization. To prepare for the proof, we first express the required invariance in terms of the resolvent operators. Lemma 26.19 (invariance equivalence) Let (Tt ) be a Feller semi-group on S with resolvent (Ua ), and fix a locally finite measure λ on S and a constant c > 0. Then these conditions are equivalent:

606

Foundations of Modern Probability

(i) λ is Tt -invariant for every t > 0, (ii) λ is a Ua -invariant for every a ≥ c. Proof, (i) ⇒ (ii): Assuming (i), we get by Fubini’s theorem for any measurable function f ≥ 0 and constant a > 0

λ(Ua f ) =

0



=

∞ ∞

e−at λ(Tt f ) dt e−at λf dt

0 −1

(19)

= a λf.

(ii) ⇒ (i): Assume (ii). Then for any measurable function f ≥ 0 on S with λf < ∞, the integrals in (19) agree for all a ≥ c. Hence, by Theorem 6.3 the measures λ(Tt f ) e−ct dt and λf e−ct dt agree on R+ , which implies λ(Tt f ) = λf for almost every t ≥ 0. By the semi-group property and Fubini’s theorem, we then obtain for any t ≥ 0 λ(Tt f ) = c λUc (Tt f )

= cλ

0 ∞

=c

0

=c





e−cs Ts Tt f ds

e−cs λ(Ts+t f ) ds e−cs λf ds = λf.

2

0

Proof of Theorem 26.18: (i) Let h, Q, ν be such as in Lemmas 26.15 and 26.16, and put λ = h−1 · ν. Using the definition of λ (twice), the Q -invariance of ν (three times), and Lemma 26.13, we get for any constant a ≥ h and bounded, measurable function f ≥ 0 on S a λUa f = = = = =

a ν(h−1 Ua f ) a νUh Ua f  ν Uh f − Ua f + Uh hUa f νUh f ν(h−1 f ) = λf,

which shows that λ is a Ua -invariant for every such a. By Lemma 26.19 it follows that λ is also (Tt ) -invariant. (ii) Consider any σ-finite, (Tt ) -invariant measure λ on S. By Lemma 26.19, λ is even a Ua -invariant for every a ≥ h . Now define ν  = h · λ . Letting f ≥ 0 be bounded and measurable on S and using Lemma 26.13, we get as before ν  Uh (hf ) = λ (hUh (hf )) = a λ Ua hUh (hf ) 



= a λ Ua (hf ) − Uh (hf ) + a Ua Uh (hf ) = a λ Ua (hf ) = λ (hf ) = ν  f,



26. Ergodic Properties of Markov Processes

607

which shows that ν  is Q -invariant. Hence, the uniqueness in Lemma 26.16 2 (ii) yields ν  = c ν for some constant c ≥ 0, which implies λ = c λ. A Harris recurrent Feller process is said to be positive-recurrent if the invariant measure λ is bounded, and null-recurrent otherwise. In the former case, we may choose λ to be a probability measure on S. For any process X P in S, the divergence Xt → ∞ a.s. or Xt → ∞ means that 1K (Xt ) → 0 in the same sense for every compact set K ⊂ S. Theorem 26.20 (distributional limits) Let X be a regular Feller process in S. Then for any distribution μ on S, we have as t → ∞ (i) when X is positive-recurrent with invariant distribution λ,    A  Pμ ◦ θt−1 − Pλ  → 0,

A ∈ F∞ with Pμ A > 0,

(ii) when X is null-recurrent or transient, Pμ

Xt −→ ∞. Proof: (i) Since Pλ ◦ θt−1 = Pλ by Lemma 11.11, the assertion follows from Theorem 26.12, together with properties (ii) and (iv) of Theorem 26.10. (ii) (null-recurrent case) For any compact set K ⊂ S and constant ε > 0, define

Bt = x ∈ S; Tt 1K (x) ≥ μTt 1K − ε , t > 0, and note that for any invariant measure λ, 



μTt 1K − ε λBt ≤ λ(Tt 1K ) = λK < ∞.

(20)

Since μTt 1K −Tt 1K (x) → 0 for all x ∈ S by Theorem 26.12, we have lim inf t Bt = S, and so λBt → ∞ by Fatou’s lemma. Hence, (20) yields lim supt μTt 1K ≤ ε, and ε being arbitrary, we obtain Pμ {Xt ∈ K} = μTt 1K → 0. (ii) (transient case) Fix any compact set K ⊂ S with ρK > 0, and conclude from the uniform transience of X that U 1K is bounded. Hence, by the Markov property at t and dominated convergence,

Eμ U 1K (Xt ) = Eμ EXt

= Eμ

t



∞ 0

1K (Xs ) ds

1K (Xs ) ds → 0,



which shows that U 1K (Xt ) −→ 0. Since U 1K is strictly positive and also conPμ

tinuous by Lemma 26.14, we conclude that Xt −→ ∞.

2

We conclude with a pathwise limit theorem for regular Feller processes. Here ‘almost surely’ means a.s. Pμ for every initial distribution μ on S.

608

Foundations of Modern Probability

Theorem 26.21 (pathwise limits) For a regular Feller process X in S, we have as t → ∞ (i) when X is positive-recurrent with invariant distribution λ, t−1



0

t

f (θs X) ds → Eλ f (X) a.s.,

f bounded, measurable,

(ii) when X is null-recurrent, t−1



t 0

1K (Xs ) ds → 0 a.s.,

(iii) when X is transient,

K ⊂ S compact,

Xt → ∞ a.s.

Proof: (i) From Lemma 11.11 and Theorems 26.10 (i) and 26.12, we note that Pλ is stationary and ergodic, and so the assertion holds a.s. Pλ by Corollary 25.9. Since the stated convergence is a tail event and Pμ = Pλ on T for any μ, the general result follows. (ii) Since Pλ is shift-invariant with Pλ {Xs ∈ K} = λK < ∞, the left-hand side converges a.e. Pλ by Theorem 26.2. By Theorems 26.10 and 26.12, the limit is a.e. a constant c ≥ 0. Using Fatou’s lemma and Fubini’s theorem, we get t Eλ c ≤ lim inf t−1 Pλ {Xs ∈ K} ds t→∞

0

= λK < ∞,

which implies c = 0 since Pλ = λ = ∞. The general result follows from the fact that Pμ = Pν on T for any distributions μ and ν. (iii) Fix any compact set K ⊂ S with ρK > 0, and conclude from the Markov property at t > 0 that a.s. Pμ ,

U 1K (Xt ) = EXt = EμFt



0∞ t

1K (Xr ) dr 1K (Xr ) dr.

Using the tower property of conditional expectations, we get for any s < t

 



Eμ U 1K (Xt )  Fs = EμFs ≤ EμFs





t ∞ s

1K (Xr ) dr 1K (Xr ) dr

= U 1K (Xs ), which shows that U 1K (Xt ) is a super-martingale. Since it is also non-negative and right-continuous, it converges a.s. Pμ as t → ∞, and the limit equals 0 Pμ

a.s., since U 1K (Xt ) −→ 0 by the previous proof. Since U 1K is strictly positive 2 and continuous, it follows that Xt → ∞ a.s. Pμ .

26. Ergodic Properties of Markov Processes

609

Exercises 1. For a measure space (S, S, μ), let T be a positive, linear operator on L1 ∩ L∞ . Show that if T is both an L1 -contraction and an L∞ -contraction, it is also an Lp contraction for every p ∈ [1, ∞]. (Hint: Prove a H¨older-type inequality for T .) 2. Extend Lemma 25.3 to any transition operators T on a measurable space (S, S). Thus, letting I be the class of sets B ∈ S with T 1B = 1B , show that an S-measurable function f ≥ 0 is T -invariant iff it is I-measurable. 3. Prove a continuous-time version of Theorem 26.2, for measurable semi-groups of positive L1 −L∞ -contractions. (Hint: Interpolate in the discrete-time result.) 4. Let (Tt ) be a measurable, discrete- or continuous-time semi-group of positive L1 −L∞ -contractions on (S, S, ν), let μ1, μ2 , . . . be asymptotically invariant distriν butions on Z+ or R+ , and define An = Tt μn (dt). Show that An f → Af for any ν 1 f ∈ L (λ), where → denotes convergence in measure. (Hint: Proceed as in Theorem 26.2, using the contractivity together with Minkowski’s and Chebyshev’s inequalities to estimate the remainder terms.) 5. Prove a continuous-time version of Theorem 26.4. (Hint: Use Lemma 26.5 to interpolate in the discrete-time result.) 6. Derive Theorem 25.6 from Theorem 26.4. (Hint: Take g ≡ 1, and proceed as in Corollary 25.9 to identify the limit.) 7. Show that when f ≥ 0, the limit in Theorem 26.4 is strictly positive on the set {S∞ f ∧ S∞ g > 0}. 8. Show that the limit in Theorem 26.4 is invariant, at least when T is induced by a measure-preserving map on S. 9. Derive Lemma 26.3 (i) from Lemma 26.6. (Hint: Note that if g ∈ T (f ) with f ∈ L1+ , thenμg ≤ μf . Conclude that for any h ∈ L1 , μ{h; Mn h > 0} ≥  μ U n−1 h; Mn h > 0 ≥ 0.) 10. Show that Brownian motion X in Rd is regular and strongly ergodic for every d ∈ N, with an invariant measure that is unique up to a constant factor. Also show that X is Harris recurrent for d ≤ 2, uniformly transient for d ≥ 3. ˜ Show that 11. Let X be a Markov process with associated space-time process X. ˜ is weakly ergodic in the X is strongly ergodic in the sense of Theorem 26.10 iff X sense of Theorem 26.11. (Hint: Note that a function is space-time invariant for X ˜ iff it is invariant for X.) 12. For a Harris recurrent process on R+ or Z+ , every tail event is clearly a.s. invariant. Show by an example that the statement may fail in the transient case. 13. State and prove discrete-time versions of Theorems 26.12, 26.17, and 26.18. (Hint: The continuous-time arguments apply with obvious changes.) 14. Derive discrete-time versions of Theorems 26.17 and 26.18 from the corresponding continuous-time results. 15. Give an example of a regular Markov process that is weakly but not strongly ergodic. (Hint: For any strongly ergodic process, the associated space-time process has the stated property. For a less trivial example, consider a suitable super-critical branching process.)

610

Foundations of Modern Probability

16. Give examples of non-regular Markov processes with no invariant measure, with exactly one (up to a normalization), and with more than one. 17. Show that a discrete-time Markov process X and the corresponding pseudoPoisson process Y have the same invariant measures. Further show that if X is regular then so is Y , but not conversely.

Chapter 27

Symmetric Distributions and Predictable Maps Asymptotically invariant sampling, exchangeable and contractable, mixed and conditionally i.i.d. sequences, coding representation and equivalence criteria, urn sequences and factorial measures, strong stationarity, prediction sequence, predictable sampling, optional skipping, sojourns and maxima, exchangeable and mixed L´evy processes, compensated jumps, sampling processes, hyper-contraction and tightness, rotatable sequences and processes, predictable mapping

Classical results in probability typically require both an independence and an invariance condition1 . It is then quite remarkable that, under special circumstances, the independence follows from the invariance assumption alone. We have already seen some notable instances of this phenomenon, such as in Theorems 11.5 and 11.14 for Markov processes and in Theorems 10.27 and 19.8 for martingales. Here and in the next chapter, we will study some broad classes of random phenomena, where a symmetry or invariance condition alone yields some very precise structural information, typically involving properties of independence. For a striking example, recall that an infinite sequence of random variables d ξ = (ξ1 , ξ2 , . . .) is said to be stationary if θn ξ = ξ for every n ∈ Z+ , where θn ∞ denotes the n -step shift operator on R . Replacing n by an arbitrary optional d time τ in Z+ yields the notion of strong stationarity, where θτ ξ = ξ. We will show that ξ is strongly stationary iff it is conditionally i.i.d., given a suitable σ-field I. This statement is essentially equivalent to de Finetti’s theorem—the fact that ξ is exchangeable, defined as invariance in distribution under finite permutations, iff it is mixed i.i.d. Thus, apart from a randomization, a simple symmetry condition leads us back to the elementary setting of i.i.d. sequences. This is by modern standards a quite elementary result that can be proved in just a few lines. The same argument yields the remarkable extension by RyllNardzewski—the fact that the mere contractability 2 , where all sub-sequences are assumed to have the same distribution, suffices for ξ to be mixed i.i.d. Further highlights of the theory include the predictable sampling theorem, extending of the exchangeablity property of a finite or infinite sequence to arbitrary 1 2

We may think of i.i.d., such as in the CLT, LLN, and LIL. short for contraction invariance in distribution

© Springer Nature Switzerland AG 2021 O. Kallenberg, Foundations of Modern Probability, Probability Theory and Stochastic Modelling 99, https://doi.org/10.1007/978-3-030-61871-1_28

611

612

Foundations of Modern Probability

predictable permutations. In Chapters 14, 16, and 22, we saw how the latter result yields short proofs of the classical arcsine laws. All the mentioned results have continuous-time counterparts. In particular, a process on R+ with exchangeable or contractable increments is a mixture of L´evy processes. The more subtle representation of exchangeable processes on [0, 1] will be derived from some asymptotic properties for sequences obtained by sampling without replacement from a finite population. Similarly, the predictable sampling theorem turns into the quite subtle predictable mapping theorem—the invariance in distribution of an exchangeable process on R+ or [0, 1] under predictable and pathwise measure-preserving transformations. Here some predictable mapping properties from Chapter 19, involving ordinary or discounted compensators, will be helpful for the proof. The material in this chapter is related in many ways to other parts of the book. In particular, we note some links to various applications and extensions in Chapters 14–16, for results involving exchangeable sequences or processes. The predictable sampling theorem is further related to results for random time change in Chapters 15–16 and 19. To motivate our first main topic, we consider a simple limit theorem for multi-variate sampling from a stationary process. Here we consider a measurable process X on an index set T , taking values in a space S, and let τ = (τ1 , τ2 , . . .) be an independent sequence of random elements in T with joint distribution μ. Define the associated sampling sequence ξ = X ◦ τ in S ∞ by ξ = (ξ1 , ξ2 , . . .) = (Xτ1 , Xτ2 , . . .) henceforth referred to as a μ -sample from X. The sampling distributions μ1 , μ2 , . . . on T ∞ are said to be asymptotically invariant, if their projections onto T k are asymptotically invariant for every k ∈ N. Recall that IX denotes the invariant σ-field of X, and note that the conditional distribution η = L(X0 | IX ) exists by Theorem 8.5 when S is Borel. Lemma 27.1 (asymptotically invariant sampling) Let X be a stationary, measurable process on T = R or Z with values in a Polish space S. For n ∈ N let ξn be a μn -sample from X, where μ1 , μ2 , . . . are asymptotically invariant distributions on T ∞ . Then ξn → ξ in S ∞ , d

L(ξ) = E η ∞ with η = L(X0 | IX ).

Proof: Write ξ = (ξ k ) and ξn = (ξnk ). Fix any asymptotically invariant distributions ν1 , ν2 , . . . on T , and let f1 , . . . , fm be measurable functions on S bounded by ±1. Proceeding as in the proof of Corollary 25.18 (i), we get   E

 

Π k fk(ξnk ) − E Π k fk (ξ k )   ≤ E  μn ⊗ k fk (X) − Π k ηfk   

 

≤  μn − μn ∗ νrm  + ≤

 

 

E  (νrm ∗ δt ) ⊗ k fk (X) − Π k ηfk  μn (dt)

     μn − μn ∗ δt  νrm (dt) +

 

 

Σ k suptE (νr ∗ δt)fk (X) − ηfk .

27. Symmetric Distributions and Predictable Maps

613

Using the asymptotic invariance of μn and νr , together with Corollary 25.18 (i) and dominated convergence, we see that the right-hand side tends to 0 as n → ∞ and then r → ∞. The assertion now follows by Theorem 5.30. 2 The last result leads immediately to a version of de Finetti’s theorem. For a precise statement, consider a finite or infinite random sequence ξ = (ξ1 , ξ2 , . . .) with index set I, and say that ξ is exchangeable if d

(ξk1 , ξk2 , . . .) = (ξ1 , ξ2 , . . .),

(1)

for any finite3 permutation (k1 , k2 , . . .) of I. For infinite sequences ξ, we also consider the seemingly weaker property of contractability 4 , where I = N and (1) is only required for increasing sequences k1 < k2 < · · · . Then ξ is stationary, and every sub-sequence has the same distribution as ξ. By Lemma 27.1 we obtain L(ξ) = E η ∞ with η = L(ξ1 | Iξ ), here called the directing random measure of ξ. We may also give a stronger conditional statement. Recall that, for any random measure η on a measurable space (S, S), the associated σ-field is generated by the random variables ηB for arbitrary B ∈ S. Theorem 27.2 (exchangeable sequences, de Finetti, Ryll-Nardzewski) For an infinite random sequence ξ in a Borel space S, these conditions are equivalent: (i) ξ is exchangeable, (ii) ξ is contractable, (iii) ξ is mixed i.i.d., (iv) L(ξ | η) = η ∞ a.s. for a random distribution η on S. Here η is a.s. unique and equal to L(ξ1 | Iξ ). Since η ∞ is a.s. the distribution of an i.i.d. sequence in S based on the measure η, (iv) means that ξ is conditionally i.i.d. η. Taking expectations of both sides in (iv), we obtain the seemingly weaker condition L(ξ) = E η ∞ , stating that ξ is mixed i.i.d. The latter condition implies that ξ is exchangeable, and so the two versions of (iii) are indeed equivalent. First proof of Theorem 27.2 (OK): Since S is Borel, we may take S = [0, 1]. Write μn for the uniform distribution on the product set Πk {(k − 1)n + 1, . . . , kn}. Assuming (ii), we see from Lemma 27.1 that L(ξ) = E η ∞ . More generally, we may extend (1) to

1A (ξ), ξk1 , ξk2 , . . .



d





= 1A (ξ), ξ1 , ξ2 , . . . ,

k1 < k2 < · · · ,



for any invariant set A ∈ S . Applying Lemma 27.1 to the sequence of pairs {ξk , 1A (ξ)}, we get as before L(ξ; ξ ∈ A) = E(η ∞ ; ξ ∈ A), and since η is Iξ -measurable, it follows that L(ξ | η) = η ∞ a.s. 3

one affecting only finitely many elements short for contraction invariance in distribution, also called sub-sequence invariance, spreading invariance, or spreadability 4

614

Foundations of Modern Probability

To see that η is a.s. unique, we may use the law of large numbers and Theorem 8.5 to obtain n−1 Σ 1B (ξk ) → ηB a.s., k≤n

B ∈ S.

2

We can also prove de Finetti’s theorem by a simple martingale argument: Second proof of Theorem 27.2 (Aldous): If ξ is contractable, then d

(ξm , θm ξ) = (ξk , θm ξ) d = (ξk , θn ξ),

k ≤ m ≤ n.



Now write Tξ = n σ(θn ξ). Using Lemma 8.10 followed by Theorem 9.24 as n → ∞, we get a.s. for a fixed set B ∈ S L(ξm | θm ξ) = L(ξk | θm ξ) = L(ξk | θn ξ) → L(ξk | Tξ ), which shows that the extreme members agree a.s. In particular, L(ξm | θm ξ) = L(ξm | Tξ ) = L(ξ1 | Tξ ) a.s. ⊥Tξ θm ξ for all m ∈ N, and so by iteration Here the first relation gives ξm ⊥ ξ1 , ξ2 , . . . are conditionally independent given Tξ . The second relation shows that the conditional distributions agree a.s. This proves that L(ξ | Tξ ) = ν ∞ with ν = L(ξ1 | Tξ ). 2 The nature of de Finetti’s criterion is further clarified by the following functional representation, here included to prepare for the much more subtle higher-dimensional versions in Chapter 28. Corollary 27.3 (coding representation) Let ξ = (ξj ) be an infinite random sequence in a Borel space S. Then ξ is exchangeable iff j ∈ N,

ξj = f (α, ζj ),

for a measurable function f : [0, 1]2 → S and some i.i.d. U (0, 1) random variables α, ζ1 , ζ2 , . . . . The directing random measure ν of ξ is then given by νB =





1B ◦ f (α, x) dx,

B ∈ S.

Proof: Let ξ be conditionally i.i.d. with directing random measure ν. For any i.i.d. U (0, 1) random variables α ˜ and ζ˜1 , ζ˜2 , . . . , a repeated use of Theorem 8.17 yields some measurable functions g, h, such that d

ν˜ ≡ g(˜ α) = ν,

d (ξ˜1 , ν˜) ≡ h(˜ ν , ζ˜1 ), ν˜ = (ξ1 , ν).

(2)

27. Symmetric Distributions and Predictable Maps

615

Writing f (s, x) = h{g(s), x} gives ˜ ξ˜1 = h(˜ ν , ζ1 )

α), ζ˜1 = f g(˜ = f (˜ α, ζ˜1 ), and we may define more generally ξ˜j = f (˜ α, ζ˜j ) = h(˜ ν , ζ˜j ),

j ∈ N.

The last expression shows that the ξ˜j are conditionally independent given ν˜, and by (2) their common conditional distribution equals ν˜. Hence, the sequence d ˜ = ξ˜ = (ξ˜j ) satisfies (˜ ν , ξ) (ν, ξ), and Corollary 8.18 yields a.s. ν = g(α),

ξj = h(ν, ζj ),

j ∈ N,

(3)

for some i.i.d. U (0, 1) variables α and ζ1 , ζ2 , . . . . To identify ν, let B ∈ S be arbitrary, and use (3), Fubini’s theorem, and the definition of ν to get





νB = P ξj ∈ B  ν





= P h(ν, ζj ) ∈ B  ν = =





1B ◦ h(ν, x) dx

1B ◦ f (α, x) dx.

2

The representing function f in Corollary 27.3 is far from unique. Here we state some basic equivalence criteria, motivating the two-dimensional versions in Theorem 28.3. The present proof is the same, apart from some obvious simplifications. To simplify the writing, put I = [0, 1]. Proposition 27.4 (equivalent coding functions, Hoover, OK) Fix any measurable functions f, g : I 2 → S, and let α, ξ1 , ξ2 , . . . and α , ξ1 , ξ2 , . . . be i.i.d. U (0, 1). Then these conditions are equivalent 5 : d (i) we may choose (˜ α, ξ˜1 , ξ˜2 , . . .) = (α, ξ1 , ξ2 , . . .) with f (α, ξi ) = g(˜ α, ξ˜i ) a.s., i ≥ 1, (ii)





d





f (α, ξi ); i ≥ 1 = g(α, ξi ); i ≥ 1 ,

(iii) there exist some measurable functions T, T  : I → I and U, U  : I 2 → I, preserving λ in the highest order arguments, such that







f T (α), U (α, ξi ) = g T  (α), U  (α, ξi ) a.s., i ≥ 1, 5 This result is often misunderstood: it is not enough in (iii) to consider λ -preserving transformations U, U  of a single variable, nor is (iv) true without the extra randomization variables α , ξi .

616

Foundations of Modern Probability

(iv) there exist some measurable functions T : I 2 → I and U : I 4 → I, mapping λ2 into λ in the highest order arguments, such that



f (α, ξi ) = g T (α, α ), U (α, α , ξi , ξi ) a.s., i ≥ 1. Theorem 27.2 fails for finite sequences. Here we need instead to replace the inherent i.i.d. sequences by suitable urn sequences, generated by successive drawing without replacement from a finite set. To make this precise, fix a measurable space S, and consider a measure of the form μ = Σ k≤n δsk with s1 , . . . , sn ∈ S. As in Lemma 2.20, we introduce on S n the associated factorial measure μ(n) = Σ δs◦p , p ∈ Pn

where Pn is the set of permutations p = (p1 , . . . , pn ) of 1, . . . , n, and s ◦ p = (sp1 , . . . , spn ). Note that μ(n) is independent of the order of s1 , . . . , sn and depends measurably on μ. Lemma 27.5 (finite exchangeable sequences) Let ξ = (ξ1 , . . . , ξn ) be a finite random sequence in S, and put η = Σk δξk . Then ξ is exchangeable iff L(ξ | η) =

η (n) a.s. n!

Proof: Since η is invariant under permutations of ξ1 , . . . , ξn , we have (ξ ◦ d p, η) = (ξ, η) for any p ∈ Pn . Now introduce an exchangeable permutation π⊥⊥ξ of 1, . . . , n. Using Fubini’s theorem twice, we get for any measurable sets A and B in appropriate spaces







P ξ ∈ B, η ∈ A = P ξ ◦ π ∈ B, η ∈ A 

 



= E P ξ ◦ π ∈ B  ξ ; η ∈ A 



= E η (n) B/n! ; η ∈ A .

2

Just as for martingales and Markov processes, we may relate the defining properties to a filtration F = (Fn ). Thus, a finite or infinite sequence of random elements ξ = (ξ1 , ξ2 , . . .) is said to be F-exchangeable, if it is F-adapted and such that, for every n ≥ 0, the shifted sequence θn ξ = (ξn+1 , ξn+2 , . . .) is conditionally exchangeable given Fn . For infinite sequences ξ, the definition of F-contractability is similar.6 When F is the filtration induced by ξ, the stated properties reduce to the unqualified versions considered earlier. An infinite, adapted sequence ξ is said to be strongly stationary or Fd stationary if θτ ξ = ξ for every optional time τ < ∞. By the prediction sequence of ξ we mean the set of conditional distributions πn = L( θn ξ | Fn ),

n ∈ Z+ .

(4)

6 Since both definitions can be stated without reference to regular conditional distributions, no restrictions are needed on S.

27. Symmetric Distributions and Predictable Maps

617

The random probability measures π0 , π1 , . . . on S are said to form a measurevalued martingale, if (πn B) is a real-valued martingale for every measurable set B ⊂ S. We show that strong stationarity is equivalent to exchangeability,7 and exhibit an interesting martingale connection. Corollary 27.6 (strong stationarity) Let ξ be an infinite, F-adapted random sequence with prediction sequence π, taking values in a Borel space S. Then these conditions are equivalent: (i) (ii) (iii) (iv)

ξ is F-exchangeable, ξ is F-contractable, ξ is F-stationary, π is a measure-valued F-martingale.

Proof: Conditions (i) and (ii) are equivalent by Theorem 27.2. Assuming (ii), we get a.s. for any B ∈ S ∞ and n ∈ Z+  





 



E πn+1 B  Fn = P θn+1 ξ ∈ B  Fn  







= P θn ξ ∈ B  Fn = πn B,

(5)

which proves (iv). Conversely, (ii) follows by iteration from the second equality in (5), and so (ii) and (iv) are equivalent. Next we note that (4) extends by Lemma 8.3 to  



πτ B = P θτ ξ ∈ B  Fτ



a.s.,

B ∈ S ∞,

for any finite optional time τ . By Lemma 9.14, condition (iv) is then equivalent to 



P θτ ξ ∈ B = Eπτ B = Eπ0 B = P {ξ ∈ B},

B ∈ S ∞, 2

which is in turn equivalent to (iii).

The exchangeability property extends to a broad class of random permutations. Say that an integer-valued random variable τ is predictable with respect to a given filtration F, if the shifted time (τ − 1)+ is F-optional. Theorem 27.7 (predictable sampling) Let ξ = (ξ1 , ξ2 , . . .) be a finite or infinite, F-exchangeable random sequence indexed by I. Then for any a.s. distinct, F-predictable times τ1 , . . . , τn in I, we have d

(ξτ1 , . . . , ξτn ) = (ξ1 , . . . , ξn ).

(6)

7 This remains true for finite sequences ξ = (ξ1 , . . . , ξn ), if we define θk ξ = (ξk+1 , . . . , ξn , ξk , . . . , ξ1 ).

618

Foundations of Modern Probability

Note that we do not require ξ to be infinite or the τk to be increasing. A simple special case is that of optional skipping, where I = N and τ1 < τ2 < · · · . If τk ≡ τ + k for some optional time τ < ∞, then (6) reduces to the strong stationarity in Proposition 27.6. For both applications and proof, it is useful to consider the inverse allocation sequence 



j ∈ I.

αj = inf k; τk = j ,

When αj is finite, it gives the position of j in the permuted sequence (τk ). The times τk are clearly predictable iff the sequence (αj ) is predictable in the sense of Chapters 9–10. Proof of Theorem 27.7: First let ξ be indexed by In = {1, . . . , n}, so that (τ1 , . . . , τn ) and (α1 , . . . , αn ) are mutually inverse random permutations of In . For each m ∈ {0, . . . , n}, put αjm = αj for all j ≤ m, and define recursively 



m αj+1 = min In \ {α1m , . . . , αjm } ,

m ≤ j ≤ n.

Then (α1m , . . . , αnm ) is a predictable, Fm−1 -measurable permutation of In . Since also αjm = αjm−1 = αj whenever j < m, Theorem 8.5 yields for any measurable functions f1 , . . . , fn ≥ 0 on S E

Π j fα

m j

(ξj ) = E E =E



Π j fα

Π fα j m}, and conclude from the previous version of (6) that

k = 1, . . . , n,

27. Symmetric Distributions and Predictable Maps 



619

d

ξτ1m , . . . , ξτnm = (ξ1 , . . . , ξn ).

(8)

As m → ∞ we get τkm → τk , and (6) follows from (8) by dominated convergence. 2 The last result yields a simple proof of yet another striking property of random walks in R, relating the first maximum to the number of positive values. This leads in turn to simple proofs of the arcsine laws in Theorems 14.16 and 22.11. Corollary 27.8 (sojourns and maxima, Sparre-Andersen) Let ξ1 , . . . , ξn be exchangeable random variables, and put Sk = ξ1 + · · · + ξk . Then

Σ 1{Sk > 0} = min k≤n d





k ≥ 0; Sk = max Sj . j≤n

Proof (OK): Put ξ˜k = ξn−k+1 for k = 1, . . . , n, and note that the ξ˜k remain exchangeable for the filtration Fk = σ{Sn , ξ˜1 , . . . , ξ˜k }, k = 0, . . . , n. Write S˜k = ξ˜1 + · · · + ξ˜k , and introduce the predictable permutation k−1

αk = Σ 1{S˜j < Sn } + (n − k + 1)1{S˜k−1 ≥ Sn },

k = 1, . . . , n.

j ∈0

Define ξk = Σj ξ˜j 1{αj = k} for k = 1, . . . , n, and conclude from Theorem 27.7 that (ξk ) = (ξk ). Writing Sk = ξ1 + · · · + ξk , we further note that d



min k ≥ 0; Sk = maxj Sj n−1

=



n

1{Sk > 0}. Σ 1{S˜j < Sn} =kΣ j =0 =1

2

Turning to the continuous-time case, we say that a process X in a topoP logical space is continuous in probability 8 if Xs → Xt as s → t. An Rd -valued process X on [0, 1] or R+ is said to be exchangeable or contractable, if it is continuous in probability with X0 = 0, and such that the increments Xt − Xs over disjoint intervals (s, t] of equal length form exchangeable or contractable sequences. Finally, we say that X has conditionally stationary, independent increments, given a σ-field I, if the stated property holds conditionally for any finite collection of intervals. The following continuous-time version of Theorem 27.2 characterizes the exchangeable processes on R+ ; the harder finite-interval case is treated in Theorem 27.10. The point process case was already considered separately, by different methods, in Theorem 15.14. uhlmann) Let X be an Proposition 27.9 (exchangeable processes on R+ , B¨ Rd -valued process on R+ , continuous in probability with X0 = 0. Then these conditions are equivalent: 8

also said to be stochastically continuous

620

(i) (ii) (iii) (iv)

Foundations of Modern Probability

X X X X

has exchangeable increments, has contractable increments, is a mixed L´evy process, is conditionally L´evy, given a σ-field I.

Proof: The sufficiency of (iii)−(iv) being obvious, it suffices to show that the conditions are also necessary. Thus, let X be contractable. The increments ξnk over the dyadic intervals Ink = 2−n (k − 1, k] are then contractable for fixed n, and so by Theorem 27.2 they are conditionally i.i.d. ηn for some random probability measure ηn on R. Using Corollary 4.12 and the uniqueness in Theorem 27.2, we obtain ηn∗2

n−m

= ηm a.s.,

(9)

m < n.

Thus, for any m < n, the increments ξmk are conditionally i.i.d. ηm , given ηn . Since the σ-fields σ(ηn ) are a.s. non-decreasing by (9), Theorem 9.24 shows that the ξmk remain conditionally i.i.d. ηm , given I ≡ σ{η0 , η1 , . . .}. Now fix any disjoint intervals I1 , . . . , In of equal length with associated increments ξ1 , . . . , ξn , and approximate by disjoint intervals I1m , . . . , Inm of equal length with dyadic endpoints. For each m, the associated increments ξkm are conditionally i.i.d., given I. Thus, for any bounded, continuous functions f 1 , . . . , fn , (10) E I Π fk (ξkm ) = Π E I fk (ξkm ) = Π E I fk (ξ1m ). k≤n

k≤n

k≤n

P ξkm →

ξk for each k, and so (10) Since X is continuous in probability, we have extends by dominated convergence to the original variables ξk . By suitable approximation and monotone-class arguments, we may finally extend the rela2 tions to any measurable indicator functions fk = 1Bk . We turn to the more difficult case of exchangeable processes on [0, 1]. To avoid some technical complications9 , we will prove the following result only for d = 1. Theorem 27.10 (exchangeable processes on [0, 1]) Let X be an Rd -valued process on [0, 1], continuous in probability with X0 = 0. Then X is exchangeable iff it has a version Xt = α t + σBt +

Σ βj j ≥1





1{τj ≤ t} − t ,

t ∈ [0, 1],

(11)

for a Brownian bridge B, some independent i.i.d. U (0, 1) random variables τ1 , τ2 , . . . , and an independent set 10 of coefficients α, σ, β1 , β2 , . . . with Σj |βj |2 < ∞ a.s. The sum in (11) converges a.s., uniformly on [0, 1], toward an rcll process on [0, 1]. 9

The complications for d > 1 are similar to those in Theorem 7.7. Thus, B, τ1 , τ2 , . . . , {α, σ, (βj )} are independent; no claim is made about the dependence between α, σ, β1 , β2 , . . . . 10

27. Symmetric Distributions and Predictable Maps

621

In particular, a simple point process on [0, 1] is symmetric with respect to Lebesgue measure λ iff it is a mixed binomial process based on λ, as noted already in Theorem 15.14. Combining the present result with Theorem 27.9, we see that a continuous process X on R+ or [0, 1] with X0 = 0 is exchangeable iff Xt ≡ α t + σBt a.s., where B is a Brownian motion or bridge, respectively, and (α, σ) is an independent pair of random variables. The coefficients α, σ, β1 , β2 , . . . in (11) are clearly a.s. unique and X-measurable, apart from the order of the jump sizes βj . Thus, X determines a.s. the pair (α, ν), where ν is the a.s. bounded random measure on R given by ν = σ 2 δ0 + Σ j βj2 δβj .

(12)

Since L(X) is also uniquely determined by L(α, ν), we say that a process X as in (11) is directed by (α, ν). First we need to establish the convergence of the sum in (11). Lemma 27.11 (uniform convergence) Let Ytn be the n-th partial sum of the series in (11). Then these conditions are equivalent: (i)

Σj βj2 < ∞ a.s.,

(ii) Ytn converges a.s. for every t ∈ [0, 1], (iii) there exists an rcll process Y on [0, 1], such that (Y n − Y )∗ → 0 a.s. Proof: We may clearly take the variables βj to be non-random. Further note that trivially (iii) ⇒ (ii). (i) ⇔ (ii): Here we may take the βj to be non-random. Then for fixed t ∈ (0, 1) the terms are independent and bounded with mean 0 and variance βj2 t(1 − t), and so by Theorem 5.18 the series converges iff Σj βj2 < ∞.

(i) ⇒ (iii): The processes Mtn = (1 − t)−1 Ytn are L2 -martingales on [0, 1), with respect to the filtration induced by the processes 1{τj ≤ t}. By Doob’s inequality, we have for any m < n and t ∈ [0, 1) n m ∗2 E(Y n − Y m )∗2 t ≤ E(M − M )t



≤ 4 E Mtn − Mtm

2



= 4 (1 − t)−2 E Ytn − Ytm

2

≤ 4 t(1 − t)−1 Σ βj2 , j >m

which tends to 0 as m → ∞ for fixed t. By symmetry we have the same n , and so by combination convergence for the time-reversed processes Y1−t (Y m − Y n )∗1 → 0, P

m, n → ∞.

Statement (iii) now follows by Lemma 11.13.

2

622

Foundations of Modern Probability

For the main proof, we consider first a continuity theorem for exchangeable processes of the form (11). Once the general representation is established, this yields a continuity theorem for general exchangeable processes on [0, 1], which is clearly of independent interest. The processes X are regarded as random elements in the function space D [0,1] of rcll functions on [0, 1], endowed with an obvious modification of the Skorohod topology, discussed in Chapter 23 and Appendix 5. We further regard the a.s. bounded random measures ν as random elements in the measure space MR endowed with the weak topology, equivalent to the vague topology on MR¯ . Theorem 27.12 (limits of exchangeable processes) Let X1 , X2 , . . . be exchangeable processes as in (11) directed by the pairs (αn , νn ). Then sd

X n −→ X in D [0,1]



wd

(αn , νn ) −→ (α, ν) in R × MR ,

where X is exchangeable and directed by (α, ν). Here and below, we need an elementary but powerful tightness criterion. Lemma 27.13 (hyper-contraction and tightness) Let the random variables ξ1 , ξ2 , . . . ≥ 0 and σ-fields F1 , F2 , . . . be such that 







E ξn2  Fn ≤ c E(ξn | Fn )



2

< ∞ a.s.,

n ∈ N,

for a constant c > 0, and put ηn = E(ξn |Fn ). Then (ξn ) is tight



(ηn ) is tight. P

Proof: By Lemma 5.9, we need to show that rn ηn → 0 whenever 0 ≤ rn → 0. Then conclude from Lemma 5.1 that, for any p ∈ (0, 1) and ε > 0, 0 < (1 − p)2 c−1 

≤ P ξn ≥ p ηn  Fn

 



≤ P rn ξn ≥ p ε  Fn + 1{rn ηn < ε}. P

Here the first term on the right tends in probability to 0, since rn ξn → 0 by P Lemma 5.9. Hence, 1{rn ηn < ε} → 1, which means that P {rn ηn ≥ ε} → 0. P Since ε is arbitrary, we get rn ηn → 0. 2 wd

sd

Proof of Theorem 27.12: First let (αn , νn ) −→ (α, ν). To show that X n −→ X for the corresponding processes in (11), it suffices by Lemma 5.28 to take all αn and νn to be non-random. Thus, we may restrict our attention to processes X n with constant coefficients αn , σn , and βnj , j ∈ N. fd To prove that X n −→ X, we begin with four special cases. First we note that, if αn → α, then trivially αn t → αt uniformly on [0, 1]. Similarly, σn → σ implies σn B → σB in the same sense. Next suppose that αn = σn = 0 and βn,m+1 = βn,m+2 = · · · = 0 for a fixed m ∈ N. We may then assume that even

27. Symmetric Distributions and Predictable Maps

623

α = σ = 0 and βm+1 = βm+2 = · · · = 0, and that βnj → βj for all j. Here the convergence X n → X is obvious. We finally assume that αn = σn = 0 and α = β1 = β2 = · · · = 0. Then maxj |βnj | → 0, and for any s ≤ t, 2 E(Xsn Xtn ) = s (1 − t) Σ k βnk 2 → s (1 − t) σ = E(Xs Xt ).

(13)

fd

In this case, X n −→ X by Theorem 6.12 and Corollary 6.5. By independence, fd we may combine the four special cases into X n −→ X, whenever βj = 0 for all but finitely many j. To extend this to the general case, we may use Theorem 5.29, where the required uniform error estimate is obtained as in (13). sd To strengthen the convergence to X n −→ X in D [0,1] , it is enough to verify the tightness criterion in Theorem 23.11. Thus, for any X n -optional times τn and positive constants hn → 0 with τn + hn ≤ 1, we need to show P that Xτnn +hn − Xτnn → 0. By Theorem 27.7 and a simple approximation, it is P equivalent that Xhnn → 0, which is clear since E(Xhnn )2 = h2n αn2 + hn (1 − hn ) νn → 0. sd

To obtain the reverse implication, let X n −→ X in D [0,1] for some process d X. Since αn = X1n → X1 , the sequence (αn ) is tight. Next define for n ∈ N n − X1n ηn = 2 X1/2

Then



= 2 σn B1/2 + 2 Σ j βnj 1{τj ≤ 12 } − 2 E(ηn2 | νn ) = σn2 + Σ j βnj = νn ,



2 E(ηn4 | νn ) = 3 σn2 + Σ j βnj

≤ 3 νn 2 .

2

1 2



.

4 − 2 Σ j βnj

Since (ηn ) is tight, even (νn ) is tight by Lemmas 23.15 and 27.13, and so the same thing is true for the sequence of pairs (αn , νn ). The tightness implies relative compactness in distribution, and so every subsequence contains a further sub-sequence converging in R × MR¯ toward some random pair (α, ν). Since the measures in (12) form a vaguely closed subset of MR¯ , the limit ν has the same form for suitable σ and β1 , β2 , . . . . Then the sd d direct assertion yields X n −→ Y with Y as in (11), and therefore X = Y . Since the coefficients in (11) are measurable functions of Y , the distribution of (α, ν) is uniquely determined by that of X. Thus, the limiting distribution is wd independent of sub-sequence, and the convergence (αn , νn ) −→ (α, ν) remains valid along N. Finally, we may use Corollary 8.18 to transfer the representation (22) to the original process X. 2

624

Foundations of Modern Probability

Next we prove a functional limit theorem for partial sums of exchangeable random variables, extending the classical limit theorems for random walks in Theorems 22.9 and 23.14. Then for every n ∈ N we consider some exchangeable random variables ξnj , j ≤ mn , and form the summation processes Xtn =

Σ

ξnj ,

j ≤ mn t

t ∈ [0, 1], n ∈ N.

(14)

Assuming mn → ∞, we show that the X n can be approximated by exchangeable processes as in (11). The convergence criteria may be stated in terms of the random variables and measures αn = Σ j ξnj ,

2 νn = Σ j ξnj δ ξnj ,

n ∈ N.

(15)

When the αn and νn are non-random, this reduces to a functional limit theorem for sampling without replacement from a finite population, which is again of independent interest. Theorem 27.14 (limits of sampling processes, Hagberg, OK) For n ∈ N, let ξnj , j ≤ mn → ∞, be exchangeable random variables, and define the processes X n with associated pairs (αn , νn ) by (14) and (15). Then sd

X n −→ X in D [0,1]



wd

(αn , νn ) −→ (α, ν) in R × MR ,

where X is exchangeable on [0, 1] and directed by (α, ν). Proof: Let τ1 , τ2 , . . . be i.i.d. U (0, 1) and independent of all ξnj , and define

Σ j ξnj 1{τj ≤t}  = αn t + Σ j ξnj 1{τj ≤ t} − t ,

Ytn =

t ∈ [0, 1].

Writing ξ˜nk for the k-th jump from the left of Y n (including possible zero jumps d d ˜n = X n, when ξnj = 0), we note that (ξ˜nj ) = (ξnj ) by exchangeability. Thus, X n n n ˜ , Y ) → 0 a.s. by Proposi˜ is defined as in (14). Furthermore, d(X where X t tion 5.24, where d is the metric in Theorem A5.4. Hence, by Theorem 5.29, it is equivalent to replace X n by Y n . The assertion then follows by Theorem 27.12. 2 We are now ready to prove the main representation theorem for exchangeable processes on [0, 1]. As noted before, we consider only the case of d = 1. Proof of Theorem 27.10: The sufficiency being obvious, it is enough to prove the necessity. Then let X have exchangeable increments. Introduce the step processes   Xtn = X 2−n [2n t] , t ∈ [0, 1], n ∈ N, define νn as in (15) in terms of the jump sizes of X n , and put αn ≡ X1 . wd If the sequence (νn ) is tight, then (αm , νn ) −→ (α, ν) along a sub-sequence, sd and Theorem 27.14 yields X n −→ Y along the same sub-sequence, where Y

27. Symmetric Distributions and Predictable Maps

625 fd

has a representation as in (11). In particular, X n −→ Y , and so the finitedimensional distributions of X and Y agree for dyadic times. The agreement extends to arbitrary times, since both processes are continuous in probability. Then Lemma 4.24 shows that X has a version in D [0,1] , whence Corollary 8.18 yields the desired representation. To prove the required tightness of (νn ), let ξnj denote the increments of X n , put ζnj = ξnj − 2−n αn , and note that

Σ j ξnj2 2 = Σ j ζnj + 2−n αn2 .

νn =

(16)

n − X1n = 2X1/2 − X1 , and noting that Writing ηn = 2X1/2 the elementary estimates

E(ηn4 | νn ) <  = < 

Σj ζnj = 0, we get

2 2 ζni ζnj Σ ζnj4 + iΣ = j

j ≥1



Σ j ζnj2 



2

2

E(ηn2 | νn ) .

Since ηn is independent of n, the sequence of sums 27.13, and the tightness of (νn ) follows by (16).

Σj ζnj2

is tight by Lemma 2

Next we extend Theorem 14.3 to continuous time and higher dimensions. d Say that a random sequence ξ = (ξk ) in Rd is rotatable 11 if ξ n U = ξ n for every n finite sub-sequence ξ and orthogonal matrix U . A process X on R+ or [0, 1] is said to be rotatable, if it is continuous in probability with X0 = 0, and such that the increments over disjoint intervals of equal lengths are rotatable. Proposition 27.15 (rotatable sequences and processes, Schoenberg, Freedman) (i) An infinite random sequence X = (ξni ) in Rd is rotatable iff ξni = σki ζnk a.s.,

i ≤ d, n ∈ N,

for a random d × d matrix σ and some i.i.d. N (0, 1) random variables ζnk , k ≤ d, n ∈ N. (ii) An Rd -valued process X = (Xti ) on I = [0, 1] or R+ is rotatable iff i ≤ d, t ∈ I,

Xti = σki Btk a.s.,

for a random d × d matrix σ and an R -valued Brownian motion B on I. d

In both cases, σσ  is X-measurable and L(X) is determined by L(σσ  ). Proof: (i) As in Theorem 14.3, the ξn are conditionally i.i.d. with a common distribution ν on Rd , such that all one-dimensional projections are Gaussian. Letting ρ be the covariance matrix of ν, we may choose σ to be the non-negative square root of ρ, which yields the required representation. The last assertions are now obvious. 11

short for rotation-invariant in distribution, also called spherical

626

Foundations of Modern Probability

(ii) We may take I = [0, 1]. Then let h1 , h2 , . . . ∈ L2 [0, 1] be the Haar system of ortho-normal functions on [0, 1], and introduce the associated random  vectors Yn = hn dX in Rd , n ∈ N. Then the sequence Y = (Yn ) is rotatable and hence representable as in (i), which translates into a representation of X as in (ii), for a Brownian motion B  on the dyadic subset of [0, 1]. By Theorem 8.17, we may extend B  to a Brownian motion B on the entire interval [0, 1], and since X is continuous in probability, the stated representation remains a.s. valid for all t ∈ [0, 1]. 2 We turn to a continuous-time counterpart of the predictable sampling Theorem 27.7. Here we consider predictable mappings V on the index set I that are a.s. measure-preserving, in the sense that λ ◦ V −1 = λ a.s. for Lebesgue measure λ on I = [0, 1] or R+ . The associated transformations of a process X on I are given by (X ◦ V −1 )t = 1{Vs ≤ t} dXs , t ∈ I, I

where the right-hand side is a stochastic integral of the predictable process Us = 1{Vs ≤ t} with respect to the semi-martingale X. Note that if Xt = ξ[0, t] for a random measure ξ on I, the map t → (X◦V −1 )t reduces to the distribution d function of the transformed measure ξ ◦ V −1 , so that X ◦ V −1 = X becomes −1 d equivalent to ξ ◦ V = ξ. Theorem 27.16 (predictable invariance) Let X be an Rd -valued, F-exchangeable process on I = [0, 1] or R+ , and let V be an F-predictable mapping on I. Then d λ ◦ V −1 = λ a.s. ⇒ X ◦ V −1 = X. This holds by definition when V is non-random. The result applies in particular to L´evy processes X. For a Brownian motion or bridge, we have the stronger statements in Theorem 19.9. With some effort, we can prove the result from the discrete-time version in Theorem 27.7 by a suitable approximation. Here we use instead some general time-change results from Chapter 19, which were in turn based on ideas from Chapters 10 and 15. For clarity, we treat the cases of I = R+ or [0, 1] separately. Proof for I = R+ : Extending the original filtration if necessary, we may take the characteristics of X to be F0 -measurable. First let X have isolated t jumps. Then Xt = α t + σBt + x ξ(ds dx), t ≥ 0, 0

for a Cox process ξ on R+ × Rd directed by λ ⊗ ν and an independent Brownian motion B. We may also choose the triple (α, σ, ν) to be F0 -measurable and the pair (B, ξ) to be F-exchangeable. Since ξ has compensator ξˆ = λ ⊗ ν, we obtain ξˆ ◦ V −1 = λ ⊗ ν, which d implies ξ ◦ V −1 = ξ by Lemma 19.14. Next define for every t ≥ 0 a predictable process Ut,r = 1{Vr ≤ t}, r ≥ 0, and note that

27. Symmetric Distributions and Predictable Maps

∞ 0



Us,r Ut,r d[B i , B j ]r = δij



0

627





1 Vr ≤ s ∧ t dr

= (s ∧ t) δij . Using Lemma 19.14 again, we conclude that the Rd -valued process (B ◦ V −1 )t =



∞ 0

Ut,r dBr ,

t ≥ 0,

is Gaussian with the same covariance function as B. The same result shows that ξ ◦ V −1 and B ◦ V −1 are conditionally independent given F0 . Hence,







(ξ, λ, B) ◦ V −1 , α, σ, ν = ξ, λ, B, α, σ, ν , d

which implies X ◦ V −1 = X. In the general case, we may write d

Xt = Mtε + (Xt − Mtε ),

t ≥ 0,

where M ε is the purely discontinuous local martingale formed by all compend sated jumps of modulus ≤ ε. Then (X − M ε ) ◦ V −1 = (X − M ε ) as above, and P P it suffices to show that Mtε → 0 and (M ε ◦ V −1 )t → 0 as ε → 0 for fixed t ≥ 0. In the one-dimensional case, we may use the isometry property of stochastic L2 -integrals in Theorem 20.2, along with the measure-preserving property of V , to see that E F0 (M ε ◦ V −1 )2t = E F0 =E

F0

= E F0

=t



0 ∞



0



0 |x|≤ε



1{Vs ≤ t} dMsε

2

1{Vs ≤ t} dM ε s , 1{Vs ≤ t} ds



|x|≤ε

x2 ν(dx)

x2 ν(dx) → 0.

Hence, by Jensen’s inequality and dominated convergence, E



M ε ◦ V −1

2 t







∧ 1 ≤ E E F0 M ε ◦ V −1

2 t



∧ 1 → 0,

which implies (M ε ◦ V −1 )t → 0. Specializing to Vs ≡ s, we get in particular P Mtε → 0. 2 P

Proof for I = [0, 1]: Extending the original filtration if necessary, we may assume that the coefficients in the representation of X are F0 -measurable. First truncate the sum of centered jumps in the representation of Theorem 27.10. Let Xn denote the remainder after the first n terms, and write Xn = ˆ n . Noting that tr[Mn ]1 = Σ |βj |2 → 0, and using the BDG inequaliMn + X j>n

ties in Theorem 20.12, we obtain (Mn ◦V −1 )t → 0 for every t ∈ [0, 1]. Using the martingale property of (X1 − Xt )/(1 − t) and the symmetry of X, we further see that P

628

Foundations of Modern Probability

E



1 0

 



ˆ n |  F0 |dX

< E(Xn∗ | F0 )

<

=

 

tr[Xn ]

1/2 1

Σ |βj |2

1/2

→ 0.

j >n

Hence, (Xn ◦ V −1 )t → 0 for all t ∈ [0, 1], which reduces the proof to the case of finitely many jumps. It is then enough to consider a jointly exchangeable pair, consisting of a marked point process ξ on [0, 1] and a Brownian bridge B in Rd . By a simple randomization, we may further reduce to the case where ξ has a.s. distinct marks. It is then equivalent to consider finitely many optional times τ1 , . . . , τm , such that B and the point processes ξj = δτj are jointly exchangeable. To apply Proposition 19.15, we may express the continuous martingale M as an integral with respect to the Brownian motion in Lemma 19.10, with the predictable process V replaced by the associated indicator functions Urt = 1{Vr ≤ t}, r ∈ [0, 1]. Then the covariations of the lemma become P

1 0

Urs − U¯rs





Urt − U¯rt dr =



1 0

Urs Urt dr − λU s · λU t

= λU s∧t − λU s · λU t = s ∧ t − st = E(Bs Bt ), as required. As for the random measures ξj = δτj , Proposition 10.23 yields the associated compensators t∧τj ds ηj [0, t] = 1−s 0 = − log (1 − t ∧ τj ), t ∈ [0, 1]. Since the ηj are diffuse, we get by Theorem 10.24 (ii) the corresponding dis  counted compensators ζj [0, t] = 1 − exp −ηj [0, t] = 1 − (1 − t ∧ τj ) = t ∧ τj , t ∈ [0, 1]. Thus, ζj = λ([0, τj ] ∩ ·) ≤ λ a.s., and so ζj ◦ V −1 ≤ λ ◦ V −1 = λ a.s. Hence, Proposition 19.15 yields 



B ◦ V −1 , Vτ1 , . . . , Vτm = (B, τ1 , . . . , τm ), d

which implies X ◦ V −1 = X. d

2

Exercises 1. Let μc be the joint distribution of τ1 < τ2 < · · · , where ξ = Σj δτj is a stationary Poisson process on R+ with rate c > 0. Show that the μc are asymptotically invariant as c → 0.

27. Symmetric Distributions and Predictable Maps

629

2. Let the random sequence ξ be conditionally i.i.d. η. Show that ξ is ergodic iff η is a.s. non-random. 3. Let ξ, η be random probability measures on a Borel space such that E ξ ∞ = d E η ∞ . Show that ξ = η. (Hint: Use the law of large numbers.) 4. Let ξ1 , ξ2 , . . . be contractable random elements in a Borel space S. Prove the existence of a measurable function f : [0, 1]2 → S and some i.i.d. U (0, 1) random variables ϑ0 , ϑ1 , . . . , such that ξn = f (ϑ0 , ϑn ) a.s. for all n. (Hint: Use Lemma 4.22, Proposition 8.20, and Theorems 8.17 and 27.2.) 5. Let ξ = (ξ1 , ξ2 , . . .) be an F -contractable random sequence in a Borel space S. Prove the existence of a random measure η, such that for any n ∈ Z+ the sequence θn ξ is conditionally i.i.d. η, given Fn and η. 6. Describe the representation of Lemma 27.5, in the special case where ξ1 , . . . , ξn are conditionally i.i.d. with directing random measure η. 7. Give an example of a finite, exchangeable sequence that is not mixed i.i.d. Also give an example of an exchangeable process on [0, 1] that is not mixed L´evy. 8. For a simple point process or diffuse random measure ξ on [0, 1], show that ξ is exchangeable iff it is contractable. 9. Say that a finite random sequence in S is contractable if all subsequences of equal length have the same distribution. Give an example of a finite random sequence that is contractable but not exchangeable. (Hint: We may take |S| = 3 and n = 2.) 10. Show that for |S| = 2, a finite or infinite sequence ξ in S is exchangeable iff it is contractable. (Hint: Regard ξ as a simple point process on {1, . . . , n} or N, and use Theorem 15.8.) 11. State and prove a continuous-time version of Lemma 27.6. (If no regularity conditions are imposed on the exchangeable processes in Theorem 27.9, we need to consider optional times taking countably many values.) 12. Let ξ1 , . . . , ξn be exchangeable random variables, fix a Borel set B, and let τ1 < · · · < τν be the indices k ∈ {1, . . . , n} with Σj t}, b −1 pa (x) = x + a (1 − b x) 1{0 < x < b}, d

and note that pba ◦ pb = pb . The contractability of (X, Y ) yields (X, Y ) ◦ pba = (X, Y ), and so d



(X ◦ pb , Y ) = X ◦ pba ◦ pb , Y ◦ pba 





= X ◦ pb , Y ◦ pba . By Corollary 8.10, we get

(X ◦ pb ) ⊥ ⊥b Y, Y ◦ pa

= σ(Y ◦ pa ) and σ(Y ◦ pb ; b > a) = σ(Y ◦ pa ), a monotoneand since σ(Y class argument yields the extended relation ◦ pba )

(X ◦ pa ) ⊥ ⊥ Y. Y ◦ pa

⊥Y (Z ◦ pa ) by hypothesis, Theorem 8.12 gives Since also (X ◦ pa )⊥ (X ◦ pa ) ⊥ ⊥ (Z ◦ pa ). Y ◦ pa

Combining this with the hypothetical relations 

X⊥ ⊥Y Z,

d

(X, Y ) ◦ pa = (X, Y ), d (Y, Z) ◦ pa = (Y, Z),

636

Foundations of Modern Probability

we see as in case (i) that d

(X, Y, Z) ◦ pa = (X, Y, Z),

a ∈ Q+ .

Relabeling Q if necessary, we get the same relation for the more general mappings pa,t . For any finite sets I, J ⊂ Q with |I| = |J|, we may choose compositions p and q of such functions satisfying p(I) = q(J), and the required contractability follows. 2 Lemma 28.5 (conditional independence) Let (X, ξ) be a contractable array ˜ d , where ξ has independent entries. Then ˜ d with restriction (Y, η) to N on Z Y⊥ ξ. ⊥ η Proof: Putting pn (k) = k − n 1{k ≤ 0},

k ∈ Z, n ∈ N,

we get by the contractability of (X, ξ) d

(Y, ξ ◦ pn ) = (Y, ξ),

n ∈ N.

Since also σ{ξ ◦ pn } ⊂ σ{ξ}, Corollary 8.10 yields

⊥ ξ, Y ⊥ 

ξ ◦ pn

n ∈ N.

Further note that n σ{ξ ◦ pn } = σ{η} a.s. by Corollary 9.26. The assertion now follows by martingale convergence as n → ∞. 2 For any S-valued random array X on Nd , we introduce the associated shell

σ-field S(X) = ∩ n σ Xk ; max i ki ≥ n . Proposition 28.6 (shell σ-field, Aldous, Hoover) Let the array X on Zd+ be separately exchangeable under permutations on N, and write X + for the restriction of X to Nd . Then X+ ⊥ ⊥+ (X \ X + ). S(X )

Proof: When d = 1, X is just a random sequence (η, ξ1 , ξ2 , . . .) = (η, ξ), and S(X + ) reduces to the tail σ-field T . Applying de Finetti’s theorem to both ξ and (ξ, η) gives L(ξ | μ) = μ∞ ,

L(ξ | ν, η) = ν ∞ ,

for some random probability measures μ and ν on S. By the law of large numbers, μ and ν are T -measurable with μ = ν a.s., and so ξ⊥⊥T η by Corollary 8.10. Assuming the truth in dimension d − 1, we turn to the d -dimensional case. by Define an S ∞ -valued array Y on Zd−1 +

28. Multi-variate Arrays and Symmetries 

637



Ym = Xm,k ; k ≥ 0 ,

m ∈ Zd−1 + ,

and note that Y is separately exchangeable under permutations on N. Writing Y + for the restriction of Y to Nd−1 and X 0 for the restriction of X to Nd−1 ×{0}, we get by the induction hypothesis (X + , X 0 ) = Y + ⊥ ⊥+ (Y \ Y + ) S(Y

Since also

we obtain

)

= X \ (X + , X 0 ). S(Y + ) ⊂ S(X + ) ∨ σ(X 0 ) ⊂ σ(Y + ), X+

⊥⊥

S(X + ), X 0

(X \ X + ).

(5)

Next we apply the result for d = 1 to the S ∞ -valued sequence 



Zk = Xm,k ; m ∈ Nd−1 , to obtain

k ≥ 0,

X + = (Z \ Z0 ) ⊥ ⊥ Z0 = X 0 . S(Z)

Since clearly we conclude that

S(Z) ⊂ S(X + ) ⊂ σ(X + ), X+ ⊥ ⊥+ X 0 . S(X )

Now combine with (5) and use Theorem 8.12 to get the desired relation.

2

We need only the following corollary. Given an array ξ on Zd+ and a set I ⊂ c Nd = {1, . . . , d}, we write ξ I for the sub-array on the index set NI × {0}I , and put ξˆJ = (ξ I ; I ⊂ J). Further define ξ m = (ξ I ; |I| = m) and ξˆm = (ξ I ; |I| ≤ ˆ ↑d . Recall that N↑d = (J ⊂ N; |J| = d) and m), and similarly for arrays on N d ˜ N = (J ⊂ N; |J| ≤ d). Corollary 28.7 (independent entries) (i) Let ξ be a random array on Zd+ with independent entries, separately exchangeable under permutations on N, and define ηk = f (ξˆk ), k ∈ Zd+ , for a measurable function f . Then even η has independent entries iff (6) ηk ⊥ ⊥(ξˆk \ ξk ), k ∈ Zd+ , in which case ηI ⊥ (7) ⊥(ξ \ ξ I ), I ⊂ Nd . d ˜ with independent entries, and define (ii) Let ξ be a contractable array on N ˜ d, ηJ = f (ξˆJ ), J ∈ N (8) for a measurable function f . Then even η has independent entries iff (9) ηJ ⊥ ⊥(ξˆJ \ ξJ ), J ∈ N˜ d , in which case (10) ηk ⊥ ⊥(ξ \ ξ k ), k ≤ d.

638

Foundations of Modern Probability Proof: (i) Assume (6). Since also ξˆk ⊥⊥(ξ \ ξˆk ), we get ηk ⊥⊥(ξ \ ξk ), and so 



ηk ⊥ ⊥ ξ \ ξk , ηˆ|k| \ ηk ,

k ∈ Zd+ ,

where |k| is the number of k -components in N. As in Lemma 4.8, this yields both (7) and the independence of the η -entries. Conversely, let η have independent entries. Then S(η I ) is trivial for ∅ = I ⊂ Nd by Kolmogorov’s 0−1 law. Applying Lemma 28.6 to the ZI+ -indexed array (η I , ξˆI \ ξ I ), which is clearly separately exchangeable on NI , we obtain η I ⊥⊥(ξˆI \ ξ I ), and (6) follows. (ii) Assume (9). Since also ξˆJ ⊥⊥(ξ \ ξˆJ ), we obtain ηJ ⊥⊥(ξ \ ξJ ), and so 



ηJ ⊥ ⊥ ξ \ ξ d , ηˆd \ ηJ ,

˜ d. J ∈N

Thus, η has independent entries and satisfies (10). Conversely, let η have independent entries. Extend ξ to a contractable array on Q↑d , and extend η accordingly to Q↑d by means of (8). Define (ξk , ηk ) = (ξ, η) r(1,k1 ),...,r(d,kd ) ,

k ∈ Zd+ ,

where r(i, k) = i + k(k + 1)−1 . Then ξ  is separately exchangeable on Nd with independent entries, and (8) yields ηk = f (ξˆk ) for all k ∈ Zd+ . Since η  ⊂ η has again independent entries, (i) yields ηk ⊥ ⊥(ξˆk \ ξk ) for all k ∈ Nd , or equivalently ↑d ˆ ⊥(ξJ \ ξJ ) for suitable J ∈ Q . The general relation (9) now follows by the ηJ ⊥ contractability of ξ. 2 Lemma 28.8 (coding) Let ξ = (ξj ) and η = (ηj ) be random arrays on a countable index set I, satisfying d

(ξi , ηi ) = (ξj , ηj ), ξj ⊥ ⊥ (ξ \ ξj , η), ηj

i, j ∈ I,

(11)

j ∈ I.

(12)

Then there exist some measurable functions f, g, such that for any U-array ϑ⊥⊥(ξ, η) on I, the random variables ζj = g(ξj , ηj , ϑj ),

j ∈ I,

(13)

form a U-array ζ⊥ ⊥ η satisfying ξj = f (ηj , ζj ) a.s.,

j ∈ I.

(14)

Proof: For any i ∈ I, Theorem 8.17 yields a measurable function f , such

that d (ξi , ηi ) = f (ηi , ϑi ), ηi . The same theorem provides some measurable functions g and h, such that the random elements ζi = g(ξi , ηi , ϑi ), η˜i = h(ξi , ηi , ϑi ) satisfy

28. Multi-variate Arrays and Symmetries

639



(ξi , ηi ) = f (˜ ηi , ζi ), η˜i



a.s.,

d

(˜ ηi , ζi ) = (ηi , ϑi ). In particular η˜i = ηi a.s., and so (14) holds for j = i with ζi as in (13). Furthermore, ζi is U (0, 1) and independent of ηi . The two statements extend by (11) to arbitrary j ∈ I. Combining (12) with the relations ϑ⊥⊥(ξ, η) and ϑj ⊥⊥(ϑ \ ϑj ), we get 



(ξj , ηj , ϑj ) ⊥ ⊥ ξ \ ξj , η, ϑ \ ϑj , ηj

which implies

ζj ⊥ ⊥ (ζ \ ζj , η),

j ∈ I,

j ∈ I.

ηj

⊥ηj , Theorem 8.12 yields ζj ⊥⊥(ζ \ ζj , η). Then Lemma 4.8 shows Since also ζj ⊥ that the array η and the elements ζj , j ∈ I, are all independent. Thus, ζ is indeed a U-array independent of η. 2 Proposition 28.9 (invariant coding) Let G be a finite group acting measurably on a Borel space S, and let ξ, η be random elements in S satisfying d

r(ξ, η) = (ξ, η),

r η = η a.s.,

r ∈ G.

Then for any U (0, 1) random variable ϑ⊥ ⊥(ξ, η), (i) there exists a measurable function f : S ×[0, 1] → S and a U (0, 1) random variable ζ⊥ ⊥ η, such that a.s. r ξ = f (r η, ζ),

r ∈ G,

(ii) we may choose ζ = b(ξ, η, ϑ), for a measurable function b : S 2 × [0, 1] → [0, 1] satisfying b(rx, ry, t) = b(x, y, t),

x, y ∈ S, r ∈ G, t ∈ [0, 1].

Our proof relies on a simple algebraic fact: Lemma 28.10 (invariant functions) Let G be a finite group acting measurably on a Borel space S, such that rs = s for all r ∈ G and s ∈ S. Then there exists a measurable function h : S → G, such that (i) h(rs) r = h(s),

r ∈ G, s ∈ S,

(ii) for any map b : S → S, the function f (s) = h−1 s b(hs s), satisfies f (rs) = rf (s),

s ∈ S,

r ∈ G, s ∈ S.

640

Foundations of Modern Probability

Proof: (i) We may take S ⊂ R. For any s ∈ S, the elements rs are all different, and we may choose hs to be the element r ∈ G maximizing rs. Then hrs is the element p ∈ G maximizing p(rs), so that pr maximizes (pr)s, and hence hrs r = p r = hs . (ii) Using (i) and the definition of f , we get for any r ∈ G and s ∈ S f (rs) = h−1 rs b(hrs rs) = h−1 rs b(hs s) = h−1 rs hs f (s) = r f (s).

2

Proof of Lemma 28.9: (i) Assuming rη = η identically, we may define h as in Lemma 28.10. Writing ξ˜ = hη ξ and η˜ = hη η, we have for any r ∈ G ˜ η˜) = hη (ξ, η) (ξ, = hrη r(ξ, η). Letting γ⊥⊥(ξ, η) be uniformly distributed in G, we get by Fubini’s theorem d

γ(ξ, η) = (ξ, η), d (γhη , ξ, η) = (γ, ξ, η). Combining those relations and using Lemma 28.10 (i), we obtain 

˜ η˜) = η, hη ξ, hη η (η, ξ, d





= γη, hγη γ ξ, hγη γ η 

= γη, hη ξ, hη η d







= γhη η, hη ξ, hη η ˜ η˜). = (γ η˜, ξ,



˜ Hence, Propo⊥(ξ, η), we get η⊥⊥η˜ ξ. Since also γ η˜⊥⊥η˜ ξ˜ by the independence γ⊥ sition 8.20 yields a measurable function g : S × [0, 1] → S and a U (0, 1) random variable ζ⊥⊥ η, such that a.s. η , ζ) hη ξ = ξ˜ = g(˜ = g(hη η, ζ). This shows that ξ = f (η, ζ) a.s., where f (s, t) = h−1 s g(hs s, t),

s ∈ S, t ∈ [0, 1],

for a measurable extension of h to S. Using Lemma 28.10 (ii), we obtain a.s. rξ = rf (η, ζ) = f (rη, ζ),

r ∈ G.

(ii) Defining rt = t for r ∈ G and t ∈ [0, 1], we have by (i) and the independence ζ⊥ ⊥η

28. Multi-variate Arrays and Symmetries

641

r(ξ, η, ζ) = (rξ, rη, ζ) d = (ξ, η, ζ), r ∈ G. Proceeding as in (i), we may choose a measurable function b : S 2 ×[0, 1] → [0, 1] and a U (0, 1) random variable ϑ⊥⊥(ξ, η), such that a.s. ζ = b(rξ, rη, ϑ) a.s.,

r ∈ G.

Averaging over G, we may assume relation (ii) to hold identically. Letting d ϑ ⊥ ⊥(ξ, η) be U (0, 1) and putting ζ  = b(ξ, η, ϑ ), we get (ξ, η, ζ  ) = (ξ, η, ζ), 2 and so (i) remains true with ζ replaced by ζ  . We proceed with some elementary properties of symmetric functions. Note that if two arrays ξ, η are related by ηJ = f (ξˆJ ), then ηˆJ = fˆ(ξˆJ ) with 







fˆ xI ; I ⊂ Nd = f (xJ◦I ; I ⊂ N|J| ); J ⊂ Nd . ˜ d , and Lemma 28.11 (symmetric functions) Let ξ be a random array on N let f, g be measurable functions between suitable spaces. Then ˜ d ⇒ ηˆJ = fˆ(ξˆJ ) on N ˜ d, (i) ηJ = f (ξˆJ ) on N ˆ (d) with f symmetric ⇒ ηˆk = fˆ(ξˆk ) on N ˆ (d) , (ii) ηk = f (ξˆk ) on N (iii) when f, g are symmetric, so is f ◦ gˆ, (iv) for exchangeable ξ and symmetric f , the array ηJ = f (ξˆJ ) on N↑d is again symmetric. Proof: (i) This just restates the definition. ˆ (ii) Let k ∈ N(d) and J ⊂ Nd . By the symmetry of f and the definition of ξ, 





f ξk◦(J◦I) ; I ⊂ N|J| = ξ(k◦J)◦I ; I ⊂ N|J| = f (ξˆk◦I ).



ˆ fˆ, η, ηˆ, Hence, by the definitions of ξ, 

fˆ(ξˆk ) = fˆ ξk◦I ; I ⊂ Nd







= f ξk◦(J◦I) ; I ⊂ N|J| ; J ⊂ Nd

= f (ξˆk◦J ); J ⊂ Nd 







= ηk◦J ; J ⊂ Nd = ηˆk . (iii) Put ηk = g(ξˆk ) and h = f ◦ gˆ. Applying (ii) to g and using the symmetry of f , we get for any vector k ∈ N(d) with associated subset k˜ ∈ N↑d h(ξˆk ) = f ◦ gˆ(ξˆk ) ηk˜ ) = f (ˆ ηk ) = f (ˆ ˆ = f ◦ gˆ(ξ˜ ) = h(ξˆ˜ ). k

k

642

Foundations of Modern Probability

(iv) Fix any J ∈ N↑d and a permutation p of Nd . Using the symmetry of f , we get as in (ii) (η ◦ p)J = ηp◦J = f (ξˆp◦J ) 







= f ξ(p◦J)◦I ; I ⊂ Nd = f ξp◦(J◦I) ; I ⊂ Nd

= f (ξ ◦ p)J◦I ; I ⊂ Nd

= f (ξ ◦ p)ˆJ .



Thus, η ◦ p = g(ξ ◦ p) for a suitable function g, and so the exchangeability of ξ carries over to η. 2 ˜ d with inProposition 28.12 (inversion) Let ξ be a contractable array on N dependent entries, and fix a measurable function f , such that the array ˜ d, (15) ηJ = f (ξˆJ ), J ∈ N has independent entries. Then (i) there exist some measurable functions g, h, such that for any U-array ˜ d , the variables ϑ⊥⊥ ξ on N ˜ d, (16) ζJ = g(ξˆJ , ϑJ ), J ∈ N form a U-array ζ⊥⊥ η satisfying ηJ , ζˆJ ) a.s., ξJ = h(ˆ

˜ d, J ∈N

(17)

(ii) when f is symmetric, we can choose g, h with the same property. ˜ d , respectively, Proof: (i) Write ξ d , ξˆd for the restrictions of ξ to N↑d and N and similarly for η, ζ, ϑ. For d = 0, Lemma 28.8 applied to the pair (ξ∅ , η∅ ) ⊥ ξ is U (0, 1), the variable ζ∅ = yields some functions g, h, such that when ϑ∅ ⊥ ⊥ η∅ and ξ∅ = h(η∅ , ζ∅ ) a.s. g(ξ∅ , ϑ∅ ) becomes U (0, 1) with ζ∅ ⊥ Proceeding by recursion on d, let g, h be such that (16) defines a U-array ˜ d−1 satisfying (17) on N ˜ d−1 . To extend the construction to ζˆd−1 ⊥⊥ ηˆd−1 on N dimension d, we note that ξJ ⊥ ⊥ (ξˆd \ ξJ ), J ∈ N↑d , ξˆJ

since ξ has independent entries. Then (15) yields 



ξJ ⊥ ⊥ ξˆd \ ξJ , ηˆd ,  ξˆJ , ηJ

J ∈ N↑d ,

and by contractability we see that (ξˆJ , ηJ ) = (ξJ , ξˆJ , ηJ ) has the same distribution for all J ∈ N↑d . Hence, Lemma 28.8 yields some measurable functions gd , hd , such that the random variables   (18) ζJ = gd ξˆJ , ηJ , ϑJ , J ∈ N↑d , form a U-array ζ d on N↑d satisfying





ζ d⊥ ⊥ ξˆd−1 , η d ,

(19)

28. Multi-variate Arrays and Symmetries 

ξJ = hd ξˆJ , ηJ , ζJ



643 a.s.,

J ∈ N↑d .

(20)

˜ d−1 into (20), we obtain Inserting (15) into (18), and (17) with J ∈ N

ζJ = gd ξˆJ , f (ξˆJ ), ϑJ ,

ˆ  (ˆ ηJ , ζˆJ ), ηJ , ζJ ξJ = h d h



a.s.,

J ∈ N↑d ,

ˆ  . Thus, (16) and (17) remain valid on N ˜ d for for a measurable function h suitable extensions of g, h. To complete the recursion, we need to show that ζˆd−1 , ζ d , ηˆd are indepen⊥ (ξˆd , ϑd ), since ϑ is a U-array independent of ξ. dent. Then note that ϑˆd−1 ⊥ Hence, (15) and (16) yield   ϑˆd−1 ⊥ ⊥ ξˆd−1 , η d , ζ d . Combining this with (19) gives





ζ d⊥ ⊥ ξˆd−1 , η d , ϑˆd−1 , η d , ζˆd−1 ). and so by (15) and (16) we have ζ d ⊥⊥ (ˆ d d−1 remains to be proved. Since ξ and η are related by (15) Now only ηˆ ⊥⊥ ζˆ and have independent entries, Lemma 28.7 (ii) yields η d ⊥⊥ ξˆd−1 , which extends ⊥ (ˆ η d−1 , ζˆd−1 ) by (15) and (16). to η d ⊥⊥ (ξˆd−1 , ϑˆd−1 ) since ϑ⊥⊥ (ξ, η). Hence, η d ⊥ d−1 d−1 by the induction hypothesis, we have indeed ηˆd ⊥⊥ ζˆd−1 . Since also ηˆ ⊥⊥ ζˆ (ii) Here a similar recursive argument applies, except that now we need to invoke Lemmas 28.9 and 28.11 in each step, to ensure the desired symmetry of g, h. We omit the details. 2 For the next result, we say that an array is representable, if it can be represented as in Theorem 28.1. ˜d Lemma 28.13 (augmentation) Suppose that all contractable arrays on N d ˜ are representable, and let (X, ξ) be a contractable array on N , where ξ has independent entries. Then there exist a measurable function f and a U-array ˜ d , such that η⊥⊥ ξ on N   ˜ d. XJ = f ξˆJ , ηˆJ a.s., J ∈ N Proof: Since (X, ξ) is representable, there exist some measurable functions g, h and a U-array ζ, such that a.s. XJ = g(ζˆJ ), ˜ d. (21) ξJ = h(ζˆJ ), J ∈ N Then Lemma 28.12 yields a function k and a U-array η⊥⊥ ξ, such that   ˜ d. ζJ = k ξˆJ , ηˆJ a.s., J ∈ N ˆ 2 Inserting this into (21) gives the desired representation of X with f = g ◦ k. −−− We are now ready to prove the main results, beginning with Theorem 28.1 (i). Here our proof is based on the following lemma.

644

Foundations of Modern Probability

ˆ (d−1) are Lemma 28.14 (recursion) Suppose that all exchangeable arrays on N (d) ˆ representable. Then any exchangeable array X on N can be represented on ˆ (d−1) in terms of a U-array ξ on N ˜ d−1 , such that the pair (X, ξ) is exchangeable N ˜ k = (Xh ; h ∼ k) satisfy and the arrays X 



˜k ⊥ X ⊥ X \ X˜ k , ξ , ξˆk

k ∈ N(d) .

(22)

ˆ (d) . For r = (r1 , . . . , rm ) ¯ be a stationary extension of X to Z Proof: Let X (d) ˆ (d−1) and ˆ ∈ Z , let r+ be the sub-sequence of elements rj > 0. For k ∈ N (d) ˆ with r+ ∼ (1, . . . , |k|), write k ◦ r = (kr , . . . , krm ), where kr = rj r ∈ Z 1 j ˆ (d−1) , we introduce the array when rj ≤ 0. On N



ˆ (d) , r+ ∼ (1, . . . , |k|) , ¯ k◦r ; r ∈ Z Yk = X

ˆ (d−1) , k∈N

¯ h ; h+ ∼ k} with a specified order of enumeration. so that informally Yk = {X ¯ extends to the pair (X, Y ). The exchangeability of X For k = (k1 , . . . , kd ) ∈ N(d) , write Yˆk for the restriction of Y to sequences in k˜ of length < d, and note that 



d ˜k , Y , ˜ k , Yˆk ) = ˜k , X \ X (X X

k ∈ N(d) ,

¯ to sesince the two sides are restrictions of the same exchangeable array X ˜ ˆ ˜ k , Y ), quences in Z− ∪ {k} and Z, respectively. Since also σ(Yk ) ⊂ σ(X \ X Lemma 8.10 yields   ˜k ⊥ (23) X ⊥ X \ X˜ k , Y , k ∈ N(d) . Yˆk

ˆ (d−1) are representable, there exist a U-array Since exchangeable arrays on N ˜ d−1 and a measurable function f , such that ξ on N (Xk , Yk ) = f (ξˆk ),

ˆ (d−1) . k∈N

(24)

By Theorems 8.17 and 8.20 we may choose ξ⊥⊥X  ,Y X, where X  denotes the ˆ (d−1) , in which case (X, ξ) is again exchangeable by Lemma restriction of X to N 28.4. The same conditional independence yields ˜k ⊥ (25) X ⊥ ξ, k ∈ N(d). ˜k Y, X \ X

Using Theorem 8.12, we may combine (23) and (25) into 



˜k ⊥ X ⊥ X \ X˜ k , ξ , Yˆk

k ∈ N(d) ,

and (22) follows since Yˆk is ξˆk -measurable by (24).

2

Proof of Theorem 28.1 (i): The result for d = 0 is elementary and follows from Lemma 28.8 with η = 0. Proceeding by induction, suppose that all ˆ (d−1) are representable, and consider an exchangeable exchangeable arrays on N (d) ˜ d−1 and a ˆ array X on N . By Lemma 28.14, we may choose a U-array ξ on N measurable function f satisfying Xk = f (ξˆk ) a.s.,

ˆ (d−1) , k∈N

(26)

28. Multi-variate Arrays and Symmetries

645

˜ k = (Xh ; h ∼ k) with and such that (X, ξ) is exchangeable and the arrays X k ∈ N(d) satisfy (22). Letting Pd be the group of permutations on Nd , we may write   ˜ k = Xk◦p ; p ∈ Pd , X 



ξˆk = ξk◦I ; I ⊂ Nd ,

k ∈ N(d) ,

˜ k , ξˆk ) are equally distributed by the where k ◦ I = (ki ; i ∈ I). The pairs (X exchangeability of (X, ξ), and in particular 



d ˜ k , ξˆk ), ˜ k◦p , ξˆk◦p = (X X

p ∈ Pd , k ∈ N(d) .

Since ξ is a U-array, the arrays ξˆk◦p are a.s. different for fixed k, and so the associated symmetry groups are a.s. trivial. Hence, Lemma 28.9 yields some ⊥ ξˆk and a measurable function h, such that U (0, 1) random variables ζk ⊥ 

Xk◦p = h ξˆk◦p , ζk



p ∈ Pd , k ∈ N(d) .

a.s.,

(27)

Now introduce a U-array ϑ⊥⊥ ξ on N↑d , and define Yk = h(ξˆk , ϑ˜ ), k





Y˜k = Yk◦p ; p ∈ Pd ,

k ∈ N(d) .

(28)

Comparing with (27) and using the independence properties of (ξ, ϑ), we get d ˜ k , ξˆk ), (Y˜k , ξˆk ) = (X  Y˜k ⊥ ⊥ Y \ Y˜k , ξ , ξˆk

k ∈ N(d) .

d

˜ k , ξ) for all k ∈ N(d) , and moreover By (22) it follows that (Y˜k , ξ) = (X ˜k ⊥ ⊥ξ (X \ X˜ k ), X ⊥ξ (Y \ Y˜k ), Y˜k ⊥

k ∈ N(d) .

˜ k with different k˜ are conditionally independent given ξ, Thus, the arrays X ˆ d−1 , and similarly for the arrays Y˜k . Letting X  be the restriction of X to N d which is ξ-measurable by (26), we obtain (X, ξ) = (X  , Y, ξ). Using (28) along with Corollary 8.18, we conclude that Xk = h(ξˆk , ηk ) a.s.,

k ∈ N(d) ,

for a U-array η⊥⊥ ξ on N↑d . Combining this with (26) yields the desired representation of X. 2 −−− Next we prove the equivalence criteria for jointly exchangeable arrays. Proof of Theorem 28.3, (i) ⇔ (ii): The implication (i) ⇒ (ii) is obvious, and the converse follows from Theorem 8.17. (iv) ⇒ (i): We may write condition (iv) as

646

Foundations of Modern Probability 





f α, ξi , ξj , ζij = g α , ξi , ξj , ζij where

α = T (¯ α),

i = j,

(29) 

ζij = V α ¯ , ξ¯i , ξ¯j , ζ¯ij ,

with

α ¯ = (α, α ), ξ¯i = (ξi , ξi ), The symmetry of V and ζ¯ yields   ¯ , ξ¯i , ξ¯j , ζ¯ij ζ  = V α 

a.s., 

ξi = U (¯ α, ξ¯i ),

ij



ζ¯ij = (ζij , ζij ).



=V α ¯ , ξ¯j , ξ¯i , ζ¯ji = ζji ,

i = j.

Using the mapping property λ → λ of the functions T, U, V , and condition¯ we see that the variables ing first on the pair α ¯ and then on the array (¯ α, ξ), α , ξi , ζij are again i.i.d. U (0, 1). Condition (i) now follows from (29). 2

(i) ⇒ (iii): Assume (i), so that 







Xij = f α, ξi , ξj , ζij

= g α , ξi , ξj , ζij

i = j,

a.s.,

(30)

˜ 2 . Since the latter may be chosen for some U-arrays (α, ξ, ζ) and (α , ξ  , ζ  ) on N ¯ ζ) ¯ is again to be conditionally independent given X, the combined array (¯ α, ξ, jointly exchangeable by Lemma 28.4 (i). Hence, it may be represented as in ˜ 2 , so that a.s. for all Theorem 28.1 (i) in terms of a U-array (α , ξ  , ζ  ) on N i = j   ξ¯i = U¯ (α , ξi ), ζ¯ij = V¯ α , ξi , ξj , ζij , α ¯ = T¯(α ), for some functions T¯ = (T, T  ), U¯ = (U, U  ), and V¯ = (V, V  ) between suitable spaces. Since ζ¯ij = ζ¯ji , we may choose V¯ to be symmetric, and by Lemma 28.7 (ii) we may choose T, U, V and T  , U  , V  to preserve λ in the highest order arguments. Now (iii) follows by substitution into (30). (iii) ⇒ (iv): Condition (iii) may be written as 





f α , ξi , ξj , ζij = g α , ξi , ξj , ζij where



a.s.,

i = j,

(31)

α = T  (α),

ξi = U  (α, ξi ),

ζij = V  (α, ξi , ξj , ζij ),

(32)



ξi

ζij

(33)



α = T (α),



= U (α, ξi ),



= V (α, ξi , ξj , ζij ).

By the measure-preserving property of T  , U  , V  and T  , U  , V  along with ˜ 2. Lemma 28.7, the triples (α , ξ  , ζ  ) and (α , ξ  , ζ  ) are again U-arrays on N By Lemma 28.12 we may solve for (α, ξ, ζ) in (32) to obtain α = T (¯ α ),

ξi = U (¯ α , ξ¯i ),





ζij = V α ¯  , ξ¯i , ξ¯j , ζ¯ij ,

(34)

for some measurable functions T, U, V between suitable spaces, where α ¯  = (α , β),

ξ¯i = (ξi , ηi ),

ζ¯ij = (ζij , ϑij ),

for an independent U-array (β, η, ϑ). Substituting (34) into (33) and using (31), we get

28. Multi-variate Arrays and Symmetries 



647





α ), U¯ (¯ f α , ξi , ξj , ζij = g T¯(¯ α , ξ¯i ), U¯ (¯ α , ξ¯j ), V¯ α ¯  , ξ¯i , ξ¯j , ζ¯ij



,

where T¯, U¯ , V¯ result from composition of (T, U, V ) into (T  , U  , V  ). By Lemma 28.7 (ii) we can modify T¯, U¯ , V¯ to achieve the desired mapping property λ2 → λ, and by Lemma 28.12 (ii) we can choose V to be symmetric, so that V¯ becomes symmetric as well, by Lemma 28.11 (iii). 2 −−− Next we prove Theorem 28.1 (ii) by establishing a representation5 as in (i). Here the following construction plays a key role for the main proof. Define T˜  = T˜ \ {∅} and Z− = −Z+ = Z \ N. ˜ and Lemma 28.15 (key construction) Let X be a contractable array on Z, ˜ and n ∈ N define for J ∈ N 



XJn = X{n}∪(J+n) ,  YJn = YJ+n , Y{n}∪(J+n) .

˜ , YJ = XI∪J ; I ∈ Z −

˜ and the latter are Then the pairs (X, Y ) and (X n , Y n ) are contractable on N, equally distributed with

⊥n (X \ X n ), X n⊥ Y

n ∈ N.

(35)

Proof: It is enough to prove (35), the remaining assertions being obvious. Then write X = (Qn , Rn ) for all n ∈ N, where Qn denotes the restriction of X ˜ + n − 1 and Rn = X \ Qn . Defining to N pn (k) = k − (n − 1) 1{k < n},

k ∈ Z, n ∈ N,

and using the contractability of X, we obtain d

(Qn , Rn ) = X = X ◦ pn = (Qn , Rn ◦ pn ), and so Lemma 8.10 yields Rn ⊥⊥Rn ◦pn Qn , which amounts to 







⊥n X n , X n+1 , . . . ; X∅ . Y, X 1 , . . . , X n−1 ⊥ Y

Replacing n by n + 1 and noting that Y n+1 ⊂ Y n , we see that also 







⊥n X n+1 , X n+2 , . . . ; X∅ . Y, X 1 , . . . , X n ⊥ Y

Combining the last two relations, we get by Lemma 4.8 



⊥n Y, X 1 , . . . , X n−1 , X n+1 , . . . ; X∅ = X \ X n . Xn ⊥ Y

5

No direct proof of the extension property is known.

2

648

Foundations of Modern Probability

˜ d−1 are Lemma 28.16 (recursion) Suppose that all contractable arrays on N d ˜ representable, and let X be a contractable array on N . Then there exists a ˜ d−1 , such that (X, η) is contractable with σ(X∅ ) ⊂ σ(η∅ ), and U-array η on N the arrays ˜ d−1 , J ∈N XJn = X{n}∪(J+n) , ηJn = ηJ+n , 

satisfy



⊥ X n⊥ X \ X n, η , n η

n ∈ N.

(36)

Proof: Defining Y as in Lemma 28.15 in terms of a contractable extension ˜ we note that the pair (X∅ , Y ) is contractable on N ˜ d−1 . Then by of X to Z, hypothesis it has a representation X∅ = f (η∅ );

YJ = g(ˆ ηJ ),

˜ d−1 , J ∈N

(37)

˜ d−1 . By Theorems 8.17 for some measurable functions f, g and a U-array η on N and 8.21, we may extend the pairs (X, Y ) and (Y, η) to contractable arrays on ˜ Choosing Q. η⊥ (38) ⊥ X, X∅ , Y

˜ by Lemma 28.4. we see that (X, Y, η) becomes contractable on Q By Theorem 8.12 and Lemma 28.15, we have Xn ⊥ ⊥ (X \ X n ), X∅ , Y

n ∈ N.

Combining with (38) and using a conditional version of Lemma 4.8, we get 



Xn ⊥ ⊥ X \ X n, η , X∅ , Y

n ∈ N,

and since X∅ and Y are η-measurable by (37), it follows that 



X n⊥ X \ X n, η , ⊥ η

n ∈ N.

(39)

˜ by the index sets T˜ε with By the contractability of (X, η), we may replace Q  Tε = k>0 (k − ε, k], ε > 0. Letting ε → 0, we see from Corollary 9.26 that (39) ˜ d−1 . Since also X n ⊥⊥ηn η by Lemma 28.5, remains true with η restricted to N (36) follows by Theorem 8.12. 2 Proof of Theorem 28.1 (ii): For d = 0, the result is elementary and follows from Lemma 28.8 with η = 0. Now suppose that all contractable arrays on ˜ d . Choose ˜ d−1 are representable, and let X be a contractable array on N N d−1 d ˜ ˜ as in Lemma 28.16, and extend η to N by attaching a U-array η on N some independent U (0, 1) variables ηJ with |J| = d. Then (X, η) remains contractable, the element X∅ is η∅ -measurable, and the arrays6 XJn = X{n}∪(J+n) ,   ˜ d−1 , n ∈ N, η n = ηJ+n , η{n}∪(J+n) , J ∈N J

satisfy



η

6



X n⊥ X \ X n, η , ⊥ n

n ∈ N.

Note that these definitions differ slightly from those in Lemma 28.16.

(40)

28. Multi-variate Arrays and Symmetries

649

The pairs (X n , η n ) are equally distributed and inherit the contractability from (X, η). Furthermore, the elements of η n are independent for fixed n and uniformly distributed on [0, 1]2 . Hence, Corollary 28.13 yields a measurable ˜ d−1 , such that ⊥η n on N function G and some U-arrays ζ n ⊥ ηJn , ζˆJn ) a.s., XJn = G(ˆ

˜ d−1 , n ∈ N. J ∈N

By Theorem 8.17, we may choose 



ζ n = h X n , η n , ϑn ,

(41)

n ∈ N,

(42)

for a U-sequence ϑ = (ϑn )⊥ ⊥(X, η) and a measurable function h, so that by Proposition 8.20, ζn ⊥ ⊥ (η, ζ \ ζ n ), n ∈ N. n n X ,η

Using (40) and (42), along with the independence of the ϑn , we get (η, ζ \ ζ n ), X n⊥ ⊥ n η

n ∈ N.

Combining the last two relations, we get by Theorem 8.12 ζ n⊥ (η, ζ \ ζ n ), ⊥ n η

n ∈ N.

⊥ η , the same result yields Since also ζ ⊥ n

n

ζ n⊥ ⊥(η, ζ \ ζ n ),

n ∈ N.

By Lemma 4.8, the ζ n are then mutually independent and independent of η. ˜ d \ {∅}, given Now combine the arrays ζ n into a single U-array ζ⊥⊥ η on N by ˜ d−1 , n ∈ N, ζ{n}∪(J+n) = ζJn , J ∈ N ˜ d by attaching an independent U (0, 1) variable ζ∅ . Then (41) and extend ζ to N becomes ˜ d \ {∅}, XJ = F (ˆ ηJ , ζˆJ ) a.s., J ∈ N (43) where F is given for k = 1, . . . , d by







F (y, z)I ; I ⊂ Nk = G yI+1 , (y, z){1}∪(I+1) ; I ⊂ Nk−1 . Since X∅ is η∅ -measurable, we may extend (43) to J = ∅, through a suitable extension of F to [0, 1]2 , as in Lemma 1.14. By Lemma 28.8 with η = 0, we may ˜ d and a measurable function b : [0, 1] → [0, 1]2 , finally choose a U-array ξ on N such that ˜ d. (ηJ , ζJ ) = b(ξJ ) a.s., J ∈ N ˜ d , and a substitution into (43) yields Then (ˆ ηJ , ζˆJ ) = ˆb(ξˆJ ) a.s. for all J ∈ N 2 the desired representation XJ = f (ξˆJ ) a.s. with f = F ◦ ˆb. −−− We turn to a more detailed study of separately exchangeable arrays X on N2 . Define the shell σ-field of X as in Lemma 28.6. Say that X is dissociated if     Xij ; i ≤ m, j ≤ n ⊥ ⊥ Xij ; i > m, j > n , m, n ∈ N.

650

Foundations of Modern Probability

Theorem 28.17 (conditioning and independence) Let X be the separately exchangeable array on N2 given by (4), write S for the shell σ-field of X, and let X − be an exchangeable extension of X to Z2− . Then (i) S ⊂ σ(α, ξ, η) a.s., (ii) X⊥ ⊥S (α, ξ, η), (iii) the Xij are conditionally independent given S, (iv) X is i.i.d. ⇔ X⊥ ⊥ (α, ξ, η), (v) X is dissociated ⇔ X⊥ ⊥ α, (vi) X is S-measurable ⇔ X⊥⊥ ζ, (vii) L(X | X − ) = L(X | α) a.s. Proof: (i) Use Corollary 9.26. (ii) Apply Lemma 28.6 to the array X on N2 , extended to Z2+ by X0,0 = α,

Xi,0 = ξi ,

i, j ∈ N.

X0,j = ηj ,

(iii) By Lemma 8.7, the Xij are conditionally independent given (α, ξ, η). Now apply (i)−(ii) and Theorem 8.9. (iv) If X is i.i.d., then S is trivial by Kolmogorov’s 0−1 law, and (ii) yields X⊥⊥(α, ξ, η). Conversely, X⊥⊥(α, ξ, η) implies L(X) = L(X | α, ξ, η) by d Lemma 8.6, and so Lemma 8.7 yields X = Y with Yij = g(ζij ) = λ3 f (·, ·, ·, ζij ),

i, j ∈ N,

which shows that the Yij , and then also the Xij , are i.i.d. (v) Choose some disjoint, countable sets N1 , N2 , . . . ⊂ N, and write X n for the restriction of X to Nn2 . If X is dissociated, the sequence (X n ) is i.i.d. Applying Theorem 27.2 to the exchangeable sequence of pairs (α, X n ), we obtain a.s. L(X) = L(X 1 ) = L(X 1 | α) = L(X| α), and so X⊥ ⊥ α by Lemma 8.6. Conversely, X⊥ ⊥ α implies L(X) = L(X | α) a.s. d by the same lemma, and so Lemma 8.7 yields X = Y with Yij = g(ξi , ηj , ζij ) = λf (·, ξi , ηj , ζij ),

i, j ∈ N,

which shows that Y , and then also X, is dissociated. (vi) We may choose X to take values in [0, 1]. If X is S -measurable, then by (i)−(ii) and Lemma 8.7,

28. Multi-variate Arrays and Symmetries

651

X = E(X| S) = E(X| α, ξ, η), and so X is (α, ξ, η) -measurable, which implies X⊥⊥ ζ. Conversely, X⊥⊥ ζ d implies L(X) = L(X | ζ) a.s. by Lemma 8.6, and so Lemma 8.7 yields X = Y with Yij = g(α, ξi , ηj ) = λf (α, ξi , ηj , ·), i, j ∈ N, which shows that Y is (α, ξ, η) -measurable. Applying (i)−(ii) to Y yields Y = E(Y | α, ξ, η) = E(Y | SY ), d

which shows that Y is SY -measurable. Hence, X = Y is SX -measurable. (vii) Define the arrays X n as in (v), and use X − to extend the sequence (X ) to Z− , so that the entire sequence is exchangeable on Z. Since X − is invariant under permutations on N, the exchangeability on N is preserved by conditioning on X − , and the X n with n > 0 are conditionally i.i.d. by the law of large numbers. This remains true under conditioning on α in the sequence of pairs (X n , α). By the uniqueness in Theorem 27.2, the two conditional distributions agree a.s., and so by exchangeability n

L(X| X − ) = L(X 1 | X − ) = L(X 1 | α) = L(X| α).

2

Corollary 28.18 (representations in special cases, Aldous) Let X be a separately exchangeable array on N2 with shell σ-field S. Then (i) X is dissociated iff it has an a.s. representation Xij = f (ξi , ηj , ζij ),

i, j ∈ N,

(ii) X is S-measurable iff it has an a.s. representation Xij = f (α, ξi , ηj ),

i, j ∈ N,

(iii) X is dissociated and S-measurable iff it has an a.s. representation Xij = f (ξi , ηj ),

i, j ∈ N.

Proof: (i) Use Theorem 28.17 (v). (ii) Proceed as in the proof of Theorem 28.17 (iv). 2

(iii) Combine (i) and (ii). −−−

652

Foundations of Modern Probability

We turn to the subject of rotational symmetries. Given a real-valued random array X = (Xij ) indexed by N2 , we introduce the rotated version Yij = Σ h,k Uih Vjk Xhk ,

i, j ∈ N,

where U, V are unitary arrays on N2 affecting only finitely many coordinates, hence orthogonal matrices of finite order n, extended to i ∨ j > n by Uij = Vij = δij . Writing Y = (U ⊗ V )X for the mapping X → Y , we say that X is d separately rotatable if (U ⊗ V )X = X for all U, V , and jointly rotatable if the same condition holds with U = V . The representations of separately and jointly rotatable arrays may be stated in terms of G-arrays, defined as arrays of independent N (0, 1) random variables. Theorem 28.19 (rotatable arrays, Aldous, OK) Let X = (Xij ) be a realvalued random array on N2 . Then (i) X is separately rotatable iff a.s. Xij = σ ζij + Σ k α k ξki ηkj ,

i, j ∈ N,

for some independent G-arrays (ξki ), (ηkj ), (ζij ) and an independent set 7 of random variables σ, α1 , α2 , . . . with Σ k αk2 < ∞ a.s., (ii) X is jointly rotatable iff a.s. Xij = ρ δij + σ ζij + σ  ζji +

Σ α hk h, k





ξhi ξkj − δij δhk ,

i, j ∈ N,

for some independent G-arrays (ξhi ), (ζij ) and an independent set 8 of 2 < ∞ a.s. random variables ρ, σ, σ  , and αhk with Σ h,k αhk d

In case (i), we note the remarkable symmetry (Xij ) = (Xji ). The centering term δij δhk in (ii) is needed to ensure convergence of the double sum9 . The simpler representation in (i) is essentially a special case. Here we prove only part (i), which still requires several lemmas. Lemma 28.20 (conditional moments) Let X be a separately rotatable array on N2 with rotatable extension X − to Z2− . Then 

 



E |Xij |p  X − < ∞ a.s.,

p > 0, i, j ∈ N.

Proof: For fixed i ∈ Z, we have Xij = σi ζij for all j by Theorem 14.3, where σi ≥ 0 and the ζij are independent of σi and i.i.d. N (0, 1). By the stronger 7 This means that the arrays ξ, η, ζ, (σ, α) are mutually independent; no independence is claimed between the variables σ, α1 , α2 , . . . . 8 Here ξ, ζ, (ρ, σ, σ  , α) are independent; the dependence between ρ, σ, σ  , α is arbitrary. 9 Comparing (ii) with Theorem 14.25, we recognize a double Wiener–Itˆo integral. To see the general pattern, we need to go to higher dimensions.

28. Multi-variate Arrays and Symmetries

653

Theorem 27.15, we have in fact (ζij )⊥ ⊥X − for i ≤ 0 and j > 0, and since σi is − X -measurable by the law of large numbers, we get  





E |Xij |p  X − = σip E|ζij |p < ∞ a.s.,

i ≤ 0, j ∈ N.

Hence, by rotatability, 





E |Xij |p  X − < ∞ a.s.,

i j < 0, p > 0.

(44)

Taking i = 1, we put ζ1j = ζj and X1j = σ1 ζj = ξj . Writing E( · |X − ) = and letting n ∈ N, we get by Cauchy’s inequality E X−







p/2

E X |ξj |p = E X ζj2 σ12  2 p/2 ξ 2 + · · · + ξ−n − = E X ζj2 −1 2 ζ 2 + · · · + ζ−n   −1 p 2  p 1/2 ζ − j X− 2 2 ≤ EX E ξ + · · · + ξ . −1 −n 2 2 ζ−1 + · · · + ζ−n For the second factor on the right, we get a.s. by Minkowski’s inequality and (44)  p − − 2 2 + · · · + ξ−n ≤ np E X |ξ−1 |2p < ∞. E X ξ−1 The first factor is a.s. finite for n > 2p, since its expected value equals 

E

2 ζ−1

p  −p ζj2 2 2 = E|ζj |2p E ζ−1 + · · · + ζ−n 2 + · · · + ζ−n ∞ 2 < r−2p e−r /2 rn−1 dr < ∞.

0

2

Lemma 28.21 (arrays of product type, Aldous) Let X be a separately rotatable array of the form Xij = f (ξi , ηj ), i, j ∈ N, for a measurable function f on [0, 1]2 and some independent U-sequences (ξi ) and (ηj ). Then a.s. Xij = Σ k ck ϕki ψkj , i, j ∈ N, (45)

for some independent G-arrays (ϕki ), (ψkj ) on N2 and constants ck with Σk c2k < ∞. ¯ be a stationary extension of X to Z2 with restriction X − to Proof: Let X and note that X − ⊥ ⊥X by the independence of all ξi and ηj . Hence, Lemma 28.20 yields EXij2 < ∞ for all i, j, so that f ∈ L2 (λ2 ). By a standard Hilbert space result10 , there exist some ortho-normal sequences (gk ), (hk ) in L2 (λ) and constants ck > 0 with Σk c2k < ∞, such that Z2− ,

f (x, y) = Σ k ck gk (x) hk (y) in L2 , and (45) follows with 10

Cf. Lemma A3.1 in K(05), p. 450.

x, y ∈ [0, 1],

654

Foundations of Modern Probability ϕki = gk (ξi ), ψkj = hk (ηj ),

i, j, k ∈ N.

We need to show that the variables ϕk = gk (ξ) and ψk = hk (η) are i.i.d. N (0, 1), where ξ1 = ξ and η1 = η. By Fubini’s theorem, 0

1



dx Σ k c2k gk (x)

2

= Σ k c2k gk 2 < ∞,

and so the sum on the left converges a.e. λ, and we may modify the gk (along with f ) on a null set, to ensure that

Σ k c2k



gk (x)

2

< ∞,

x ∈ [0, 1].

We may then define Y (x) = f (x, η) = Σ k ck gk (x) ψk ,

x ∈ [0, 1],

2

with convergence in L . The array X = (Xij ) remains rotatable in j under conditioning on the invariant variables ξi , and so by Lemma 8.7 (i) and Theorem 27.15 the variables Y (x1 ), . . . , Y (xn ) are jointly centered Gaussian for x1 , . . . , xn ∈ [0, 1] a.e. λn . Defining a measurable map G : [0, 1] → l2 by G(x) = {ck gk (x)} and putting μ = λ ◦ G−1 , we see from Lemma 1.19 that



λ x ∈ [0, 1]; G(x) ∈ supp μ = μ(supp μ) = 1. We can then modify the functions gk (along with Y and f ) on a null set, to make G(x) ∈ supp μ hold identically. Since the set of Gaussian distributions on Rn is weakly closed, we conclude that Y is centered Gaussian. Now let H1 be the linear subspace of l2 spanned by the sequences {ck gk (x)}, so that for any (yk ) ∈ H1 the sum Σk yk ψk converges in L2 to a centered Gaussian limit. If instead (yk )⊥ H1 in l2 , we have Σk ck yk gk (x) ≡ 0, which implies Σk ck yk gk = 0 in L2 . Hence, ck yk ≡ 0 by ortho-normality, and so ¯ 1 = l2 , and so the sums Σ yk ψk are centered Gaussian yk ≡ 0. This gives H k 2 for all (yk ) ∈ l . The ψk are then jointly centered Gaussian by Corollary 6.5. For the remaining properties, interchange the roles of i and j, and use the ortho-normality and independence. 2 Lemma 28.22 (totally rotatable arrays, Aldous) Let X be a dissociated, separately rotatable array on N2 with shell σ-field S, such that E(X| S) = 0. Then the Xij are i.i.d. centered Gaussian. Proof (OK): For any rotation R of rows or columns and indices h = k, we get by Lemma 8.8, Proposition 28.17 (iii), and the centering assumption



 



d



 



d



 



E (RX)k  S = E Xk  S = 0,  

E (RX)h (RX)k  S = E Xh Xk  S



= E(Xh | S) E(Xk | S) = 0.

28. Multi-variate Arrays and Symmetries

655

Letting T be the tail σ-field in either index, and noting that T ⊂ S, we get by the tower property of conditioning 

 = 0, 

 E (RX)h (RX)k  T = 0.

E (RX)k  T

Now L(X | T ) is a.s. Gaussian by Theorem 27.15. Since the Gaussian property is preserved by rotations, Lemma 14.1 yields L(RX | T ) = L(X | T ), and so X is conditionally rotatable in both indices, hence i.i.d. N (0, 1). Thus, Xij = σ ζij a.s. for a G-array (ζij ) and an independent random variable σ ≥ 0. Since X is dissociated, we get for h, k ∈ N2 differing in both indices  

 

E σ 2 E|ζh |·E|ζk | = E Xh Xk  = E|Xh |·E|Xk | = (E σ)2 E|ζh |·E|ζk |, and so Eσ 2 = (Eσ)2 , which yields Var(σ) = 0, showing that σ is a constant. 2 Proof of Theorem 28.19 (i): If X is separately rotatable on N2 , it is also separately exchangeable and hence representable as in (4), in terms of a Uarray (α, ξ, η, ζ). The rotatability on N2 is preserved by conditioning on an exchangeable extension X − to Z2− , and by Proposition 28.17 (vii) it is equivalent to condition on α, which makes X dissociated. By Theorem A1.3, it is enough to prove the representation under this conditioning, and so we may assume that X is representable as in Corollary 28.18 (i). Then E|Xij |p < ∞ for all p > 0 by Lemma 28.20. Letting S be the shell σ-field of X, we may decompose X into the arrays X  = E(X| S),

X  = X − E(X| S),

which are again separately rotatable by Lemma 8.8. Writing g(a, x, y) = λf (a, x, y, ·), we have by Propositions 8.7 (i), 8.9, and 28.17 (i)−(ii) the a.s. representations Xij = g(ξi , ηj ), i, j ∈ N, Xij = f (ξi , ηj , ζij ) − g(ξi , ηj ), and in particular X  and X  are again dissociated. By Lemma 28.21, X  is representable as in (45). Furthermore, the Xij are i.i.d. centered Gaussian by Lemma 28.22, since clearly E(X  | S) = 0. Then X  ⊥⊥ X  by Proposition 28.17 (iv), and we may choose the underlying G-arrays to be independent. 2 −−− Exchangeable arrays can be used to model the interactions between nodes in a random graph or network, in which case Xij represents the directed interaction between nodes i, j. For an adjacency array we have Xij ∈ {0, 1}, where Xij = 1 when there is a directed link from i to j.

656

Foundations of Modern Probability

We turn to the related subject of random partitions of a countable set I, represented by arrays Xij = κi 1{i ∼ j},

i, j ∈ I,

where i ∼ j whenever i, j belong to the same random subset of I, and κi = κj is an associated random mark, taking values in a Borel space S. Given a class d P of injective maps on I, we say that X is P-symmetric if X ◦ p = X for all p ∈ P, where (X ◦ p)ij = Xpi ,pj . In particular, the partition is exchangeable if the array X = (Xij ) is jointly exchangeable. We give an extended version of the celebrated paintbox representation. Theorem 28.23 (symmetric partitions, Kingman, OK) Let P be a class of injective maps on a countable set I, and let the array X = (Xij ) on I 2 represent a random partition of I with marks in a Borel space S. Then X is P-symmetric iff (46) Xij = b(ξi )1{ξi = ξj }, i, j ∈ I, for a measurable function b : R → S and a P-symmetric sequence of random variables ξ1 , ξ2 , . . . . Though contractable arrays on N2 may not be exchangeable in general, the two properties are equivalent when X is the indicator array of a random partition on N. The equivalence fails for partitions of N2 . Corollary 28.24 (contractable partitions) Let X be a marked partition of N. Then X is exchangeable ⇔ X is contractable. Proof: Use Theorem 28.23, together with the equivalence of exchangeabil2 ity and contractability for a single sequence (ξn ). Our proof of Theorem 28.23 is based on an elementary algebraic fact, where for any partition of N with associated indicator array rij = 1{i ∼ j}, we define a mapping m(r) : N → N by



mj (r) = min i ∈ N; i ∼ j ,

j ∈ N.

Lemma 28.25 (lead elements) For any indicator array r on N2 and injective map p on N, there exists an injection q on N, depending measurably on r and p, such that q ◦ mj (r ◦ p) = mpj (r), j ∈ N. Proof: Fix any indicator array r with associated partition classes A1 , A2 , . . . , listed in the order of first entry, and define Bk = p−1 Ak , ak = min Ak ,





K = k ∈ N; Bk = ∅ ,

bk = min Bk ,

q(bk ) = ak ,

k ∈ K.

28. Multi-variate Arrays and Symmetries

657

For any j ∈ N, we note that j ∈ Bk = p−1 Ak and hence pj ∈ Ak for some k ∈ K, and so

q mj (r ◦ p) = q(bk ) = ak = mpj (r). To extend q to an injective map on N, we need to check that |J c | ≤ |I c |, where     J = bk ; k ∈ K . I = ak ; k ∈ K , Noting that |Bk | ≤ |Ak | for all k since p is injective, we get 



∪ Bk \ {bk } k ∈ K  ≤ ∪ Ak \ {ak } ≤ |I c |. k∈K

|J c | =

2

Proof of Theorem 28.23: We may take I = N. The sufficiency is clear from (46), since (X ◦ p)ij = Xpi ,pj = b(ξpi ) 1{ξpi = ξpj }, i, j ∈ N. To prove the necessity, let X be P-symmetric, put kj (r) = rjj , choose ⊥X to be i.i.d. U (0, 1), and define ϑ1 , ϑ 2 , . . . ⊥ ηj = ϑ ◦ mj (X), κj = kj (X) = Xjj ,

j ∈ N.

(47)

For any p ∈ P, Lemma 28.25 yields a random injection q(X) on N, such that q(X) ◦ mj (X ◦ p) = m(X) ◦ pj ,

j ∈ N.

(48)

Using (47) and (48), the P-symmetry of X, the independence ϑ⊥⊥X, and Fubini’s theorem (four times), we get for any measurable function f ≥ 0 on ([0, 1] × S)∞ 







Ef η ◦ p, κ ◦ p = Ef ϑ ◦ m(X) ◦ p, k(X) ◦ p



= Ef ϑ ◦ q(X) ◦ m(X ◦ p), k(X ◦ p)  



 

= E Ef ϑ ◦ q(r) ◦ m(r ◦ p), k(r ◦ p)  r=X 



 = E Ef ϑ ◦ m(r ◦ p), k(r ◦ p)  r=X



= Ef ϑ ◦ m(X ◦ p), k(X ◦ p)  



 



 

= E Ef t ◦ m(X ◦ p), k(X ◦ p)  t=ϑ

 

= E Ef t ◦ m(X), k(X)  t=ϑ



= Ef ϑ ◦ m(X), k(X) = Ef (η, κ), d

which shows that (η, κ) ◦ p = (η, κ).

658

Foundations of Modern Probability

Now choose a Borel isomorphism h : [0, 1] × S → B ∈ B with inverse (b, c) : B → [0, 1] × S, and note that the B -valued random variables ξj = h(κj , ηj ) are again P-symmetric. Now (46) follows from ξi = ξj ⇔ ⇔ ⇔ ⇔

(κi , ηi ) = (κj , ηj ) ηi = η j mi (X) = mj (X) i ∼ j.

2

Exercises 1. Show how the de Finetti–Ryll-Nardzewski theorem can be obtained as a special case of Theorem 28.1. 2. Explain the difference between the representations in (2) and (4). Then write out explicitly the corresponding three-dimensional representations, and again explain in what way they differ. 3. State a coding representation of a single random element in a Borel space, and specify when two coding functions f, g give rise to random elements with the same distribution. 4. Show how the one-dimensional coding representation with associated equivalence criteria follow from the corresponding two-dimensional results. 5. Give an example of a measure-preserving map f on [0, 1] that is not invertible. Then show how we can construct an inversion by introducing an extra randomization variable. (Hint: For a simple example, let f (x) = 2 x (mod 1) for x ∈ [0, 1]. Then introduce a Bernoulli variable to choose between the intervals [0, 12 ] and ( 12 , 1].) 6. Let Xij = f (α, ξi , ηj ) for some i.i.d. random variables α, ξ1 , ξ2 , . . ., and η1 , η2 , . . . . Show that the Xij cannot be i.i.d., non-degenerate. (Hint: This can be shown from either Corollary 28.18 or Theorem 28.3.) 7. Express the notions of separate and joint rotatability as invariance under unitary transformations on a Hilbert space H. Then write the double sum in Theorem 28.19 (ii) in terms of a double Wiener-Itˆo integral of a suitable isonormal Gaussian process. Similarly, write the double sum in part (i) as a tensor product of two independent isonormal Gaussian processes on H. 8. Show that all sets in an exchangeable partition have cardinality 1 or ∞, and give the probabilities for a fixed element to belong to a set of each type. Further give the probability for two fixed elements to belong to the same set. (Hint: Use the paintbox representation.) 9. Say that a graph is exchangeable if the associated adjacency matrix is jointly exchangeable. Show that in an infinite, exchangeable graph, a.s. every node is either disconnected or has infinitely many links to other nodes. (Hint: Use Lemma 30.9.) 10. Say that two nodes in a graph are connected, if one can be reached from the other by a sequence of links. Show that this defines an equivalence relation between sites. For an infinite, exchangeable graph, show that the associated partition is exchangeable. (Hint: Note that the partition is exchangeable and hence has a paintbox representation.)

IX. Random Sets and Measures In Chapter 29 we show how the entire excursion structure of a regenerative process can be described by a Poisson process on the time scale given by an associated local time process. We further study the local time as a function of the space variable, and initiate the study of continuous additive functionals. In Chapter 30 we discuss the weak and strong Poisson or Cox convergence under superposition and scattering of point processes, highlighting the relationship between the asymptotic properties of the processes and smoothing properties of the underlying transforms. Finally, in Chapter 31 we explore the basic notion of Palm measures, first for stationary point processes on the real line, and then for general random measures on an abstract Borel space. In both cases, we prove some local approximations and develop a variety of duality relations. A duality argument also leads to the notion of Gibbs kernel, related to ideas in statistical mechanics. For a first encounter, we recommend especially a careful study of the semi-martingale local time and excursion theory in Chapter 29, along with some basic limit theorems from Chapter 30. −−− 29. Local time, excursions, and additive functionals. This chapter provides three totally different but essentially equivalent approaches to local time. Our first approach is based on Tanaka’s formula in stochastic calculus and leads to an interpretation of local time as an occupation density. The second approach extends to a fundamental representation of the entire excursion structure of a regenerative process in terms of a Poisson process on the local time scale. The third approach is based on the potential theory of continuous additive functionals. 30. Random measures, smoothing and scattering. Here we begin with the weak or strong Poisson convergence, for superpositions of independent, uniformly small point processes. Next, we consider the Cox convergence of dissipative transforms of suitable point processes, depending on appropriate averaging and smoothing properties of the transforms. Here a core result is the celebrated theorem of Dobrushin, based on conditions of asymptotic invariance. We finally explore some Cox criteria of stationary line and flat processes, and study the basic cluster dichotomy for spatial branching processes. 31. Palm and Gibbs kernels, local approximation. Here we begin with a study of Palm measures of stationary point processes on Euclidean spaces, highlighting the dual relationship between stationarity in discrete and continuous time. Next, we consider the Palm measures of random measures on an abstract Borel space, and prove some basic local approximations. Further highlights include a conditioning approach to Palm measures and a dual approach via conditional densities. Finally, we discuss the dual notions of Gibbs and Papangelou kernels, of significance in statistical mechanics. 659

Chapter 29

Local Time, Excursions, and Additive Functionals Semi-martingale local time, Tanaka’s formula, space-time regularity, occupation measure and density, extended Itˆ o formula, regenerative sets and processes, excursion law, excursion local time and Poisson process, approximations of local time, inverse local time as a subordinator, Brownian excursion, Ray–Knight theorem, continuous additive functionals, Revuz measure, potentials, excessive functions, additive-functional local time, additive functionals of Brownian motion

Local time is yet another indispensable notion of modern probability, playing key roles in both stochastic calculus and excursion theory, and also needed to represent continuous additive functionals of suitable Markov processes. Most remarkably, it appears as the occupation density of continuous semi-martingales, and it provides a canonical time scale for the family of excursions of a regenerative process. In each of the mentioned application areas, there is an associated method of construction, which leads to three different but essentially equivalent versions of the local time process. We begin with the semi-martingale local time, constructed from Tanaka’s formula, and leading to a useful extension of Itˆo’s formula, and to an interpretation of local time as an occupation density. Next we develop excursion theory for processes that are regenerative at a fixed state, and prove the powerful Itˆo representation, involving a Poisson process of excursions on the time scale given by an associated local time. Among its many applications, we consider a version of the Ray–Knight theorem, describing the spatial variation of Brownian local time. Finally, we study continuous additive functionals (CAFs) and their potentials, prove the existence of local time at a regular point, and show that any CAF of a one-dimensional Brownian motion is a mixture of local times. The beginning of the chapter may be regarded as a continuation of the stochastic calculus developed in Chapter 18. The present excursion theory continues the elementary discussion of the discrete-time case in Chapter 12. Though the theory of CAFs is formally developed for Feller processes, few results are needed from Chapter 17, beyond the strong Markov property and its integrated version in Corollary 17.19. Both semi-martingale local time and excursion theory will be useful in Chapter 33 to study one-dimensional SDEs and diffusions. Our discussion of CAFs of Brownian motion with associated potentials will continue at the end of Chapter 34. © Springer Nature Switzerland AG 2021 O. Kallenberg, Foundations of Modern Probability, Probability Theory and Stochastic Modelling 99, https://doi.org/10.1007/978-3-030-61871-1_30

661

662

Foundations of Modern Probability

For the stochastic calculus approach to local time, let X be a continuous semi-martingale in R. The semi-martingale local time L0 of X at 0 may be defined by Tanaka’s formula L0t = |Xt | − |X0 | −



t 0

sgn(Xs −) dXs ,

t ≥ 0,

(1)

where sgn(x−) = 1(0,∞) (x) − 1(−∞,0] (x). More generally, we define the local time Lxt at a point x ∈ R as the local time at 0 of the process Xt − x. Note that the stochastic integral on the right exists, since the integrand is bounded and progressive. The process L0 is clearly continuous and adapted with L00 = 0. For motivation, we note that if Itˆo’s rule could be applied to the function f (x) = |x|, we would formally obtain (1) with L0t = s≤t δ(Xs ) d[X]s . We state the basic properties of local time at a fixed point. Say that a nondecreasing function f is supported by a Borel set A, if the associated measure μ satisfies μAc = 0. The support of f is the smallest closed set with this property. Theorem 29.1 (semi-martingale local time) Let X be a continuous semimartingale with local time L0 at 0. Then a.s. (i) L0 is non-decreasing, continuous, and supported by



Ξ = t ≥ 0; Xt = 0 , (ii)



L0t = − |X0 | − inf



s≤t 0

s



sgn(X−) dX ∨ 0,

t ≥ 0.

The proof of (ii) depends on an elementary observation. Lemma 29.2 (supporting function, Skorohod) Let f be a continuous function f on R+ with f0 ≥ 0. Then gt = − inf fs ∧ 0 = sup(−fs ) ∨ 0, s≤t

s≤t

t ≥ 0,

is the unique non-decreasing, continuous function g on R+ with g0 = 0, satis fying h ≡ f + g ≥ 0, 1{h > 0} dg = 0, Proof: The given function has clearly the stated properties. To prove the uniqueness, suppose that both g and g  have the stated properties, and put h = f + g and h = f + g  . If gt < gt for some t > 0, define s = sup{r < t; gr = gr }, and note that h ≥ h − h = g  − g > 0 on (s, t]. Hence, gs = gt , and so 2 0 < gt − gt ≤ gs − gs = 0, a contradiction. Proof of Theorem 29.1: (i) For any constant h > 0, we may choose a convex function fh ∈ C 2 with  −x, x ≤ 0, fh (x) = x − h, x ≥ h.

29. Local Time, Excursions, and Additive Functionals

663

Then Itˆo’s formula gives, a.s. for any t ≥ 0, Yth ≡ fh (Xt ) − fh (X0 ) −

1 2

=

t

0



t

0

fh (Xs ) dXs

fh (Xs ) d[X]s .

As h → 0, we have fh (x) → |x| and fh → sgn(x−). By Corollary 18.13 and ∗ P dominated convergence, we get (Y h −L0 )t → 0 for every t > 0. Now (i) follows, h since the processes Y are non-decreasing with



0





1 Xs ∈ / [0, h] dYsh = 0 a.s.,

h > 0. 2

(ii) This is clear from Lemma 29.2.

This leads in particular to a basic relationship between a Brownian motion, its maximum process, and its local time at 0. The result extends the more elementary Proposition 14.13. Corollary 29.3 (local time and maximum process, L´evy) Let L0 be the local time at 0 of a Brownian motion B, and define Mt = sups≤t Bs . Then 

Proof: Define Bt = −



d

L0 , |B| = (M, M − B).

s≤t

Mt = sup Bs ,

sgn(Bs −) dBs ,

s≤t

and conclude from (1) and Theorem 29.1 (ii) that L = M  and |B| = L0 −B  = d 2 M  − B  . Further note that B  = B by Theorem 19.3. 0

For any continuous local martingale M , the associated local time Lxt has a jointly continuous version. More generally, we have for any continuous semimartingale: Theorem 29.4 (spatial regularity, Trotter, Yor) Let X be a continuous semimartingale with canonical decomposition M + A. Then the local time L = (Lxt ) of X has a version that is rcll in x, uniformly for bounded t, and satisfies =2 Lxt − Lx− t



t 0

x ∈ R, t ∈ R+ .

1{Xs = x} dAs ,

Proof: By the definition of L, we have for any x ∈ R and t ≥ 0 Lxt = |Xt − x| − |X0 − x| −



0

t





sgn Xs − x − dMs −

0

t





sgn Xs − x − dAs .

By dominated convergence, the last term has the required continuity properties, with discontinuities given by the displayed formula. Since the first two

664

Foundations of Modern Probability

terms are trivially continuous in x and t, it remains to show that the first integral term, henceforth denoted by Itx , has a jointly continuous version. By localization we may then assume that the processes X − X0 , [M ]1/2 ,  and |dA| are all bounded by some constant c. Fixing any p > 2 and using Theorem 18.7, we get1 for any x < y 

E Ix − Iy

∗p



≤ 2p E 1(x,y] (X) · M

t



∗p t

p/2

< E 1(x,y] (X) · [M ]

t

.

To estimate the integral on the right, put y − x = h, and choose f ∈ C 2 with f  ≥ 2 · 1(x,y] and |f  | ≤ 2h. Then by Itˆo’s formula, 1(x,y] (X) · [M ] ≤ 12 f  (X) · [X] = f (X) − f (X0 ) − f  (X) · X 



≤ 4 c h + f  (X) · M ,

and another use of Theorem 18.7 yields

E f  (X) · M

∗p/2 t



< E {f  (X)}2 · [M ]

≤ (2 c h)

p/2

p/4 t

.

Combining the last three estimates gives E(I x − I y )∗p (c h)p/2 , and since t <

p/2 > 1, the desired continuity follows by Theorem 4.23. 2 We may henceforth take L to be a regularized version of the local time. Given a continuous semi-martingale X, write ξt for the associated occupation measures, defined as below with respect to the quadratic variation process [X]. We show that ξt  λ a.s. for every t ≥ 0, and that the set of local times Lxt provides the associated densities. This also leads to a simultaneous extension of the Itˆo and Tanaka formulas. Since every convex function f on R has a non-decreasing and left-continuous left derivative f− (x) = f  (x−), Theorem 2.14 yields an associated measure μf on R satisfying μ [x, y) = f  (y) − f  (x) x ≤ y in R. −

f



More generally, when f is the difference of two convex functions, μf becomes a locally finite, signed measure. Theorem 29.5 (occupation density, Meyer, Wang) Let X be a continuous semi-martingale with right-continuous local time L. Then outside a fixed P null set, we have (i) for any measurable function f ≥ 0 on R,

0 1



t

f (Xs ) d[X]s =

Recall the notation (X · Y )t =

t 0

Xs dYs .



−∞

f (x) Lxt dx,

t ≥ 0,

29. Local Time, Excursions, and Additive Functionals

665

(ii) when f is the difference between two convex functions, f (Xt ) − f (X0 ) =



t

0

f− (X) dX +

1 2

∞ −∞

Lxt μf (dx),

t ≥ 0.

In particular, Theorem 18.18 extends to functions f ∈ CR1 , such that f  is absolutely continuous with density f  . Note that (ii) remains valid for the left-continuous version of L, provided we replace f− (X) by the right derivative f+ (X). Proof: (ii) For f (x) ≡ |x − a|, this reduces to the definition of Lat . Since the formula is also trivially true for affine functions f (x) ≡ ax + b, it extends by linearity to the case where μf is supported by a finite set. By linearity and suitable truncation, it remains to prove the formula when μf is positive with bounded support and f (−∞) = f  (−∞) = 0. Then for any n ∈ N, we define the functions   gn (x) = f  2−n [2n x]− ,

fn (x) =

x

−∞

gn (u) du,

x ∈ R,

and note that (ii) holds for all fn . As n → ∞ we get fn (x−) = gn (x−) ↑ f− (x), P and so by Corollary 18.13 we have fn (X−) · X → f− (X) · X. Further note that  f n → f by monotone convergence. It remains to show that Lxt μfn (dx) → Lxt μf (dx). Then let h be any bounded, right-continuous function on R, and note that μfn h = μf hn with hn (x) = h(2−n [2n x + 1]). Since hn → h, we get μf hn → μf h by dominated convergence. (i) Comparing (ii) with Itˆo’s formula, we obtain the stated relation for fixed t ≥ 0 and f ∈ C. Since both sides define random measures on R, a simple approximation yields ξt = Lt · λ, a.s. for fixed t ≥ 0. By the continuity of each side, we may choose the exceptional null set to be independent of t. If f ∈ C 1 with f  as stated, then (ii) applies with μf (dx) = f  (x) dx, and the last assertion follows by (i). 2 In particular, the occupation measure of X at time t, given by

ηt A =

t

0

1A (Xs ) d[X]s ,

A ∈ BR , t ≥ 0,

(2)

is a.s. absolutely continuous with density Lt . This leads to a spatial approximation of L, which may be compared with the temporal approximation in Proposition 29.12 below. Corollary 29.6 (spatial approximation) Let X be a continuous semi-martingale in R with local time L and occupation measures ηt . Then outside a fixed P -null set, we have Lxt = lim h−1 ηt [x, x + h), h→0

t ≥ 0, x ∈ R.

666

Foundations of Modern Probability 2

Proof: Use Theorem 29.5 and the right-continuity of L.

Local time processes also arise naturally in the context of regenerative processes. Here we consider an rcll process X in a Polish space S, adapted to a right-continuous and complete filtration F. Say that X is regenerative at a state a ∈ S, if it satisfies the strong Markov property at a, so that for any optional time τ , 





 L θτ X  Fτ = Pa a.s. on





τ < ∞, Xτ = a ,

for some probability measure Pa on the path space D R+ ,S of X. In particular, a strong Markov process is regenerative at every point. The associated random set Ξ = {t ≥ 0; Xt = a} is regenerative in the same sense. By the right-continuity of X, we note that Ξ * tn ↓ t implies t ∈ Ξ, ¯ \ Ξ are isolated from the right. In particular, which means that all points of Ξ the first-entry times τr = inf{t ≥ r; t ∈ Ξ} lie in Ξ, a.s. on {τr < ∞}. Since ¯ c is open and hence a countable union of disjoint intervals (u, v), it also (Ξ) follows that Ξc is a countable union of intervals (u, v) or [u, v). If we are only interested in regeneration at a fixed point a, we may further assume that X0 = a or 0 ∈ Ξ a.s. The mentioned properties will be assumed below, even when we make no reference to an underlying process X. We first classify the regenerative sets into five different categories. Recall ¯ o = ∅, and perfect if it is that a set B ⊂ R+ is said to be nowhere dense if (B) closed with no isolated points. Theorem 29.7 (regenerative sets) For a regenerative set Ξ, exactly one of these cases occurs a.s.: (i) Ξ is locally finite, (ii) Ξ = R+ , (iii) Ξ is a locally finite union of disjoint intervals with i.i.d., exponentially distributed lengths, ¯ = supp(1Ξ λ), (iv) Ξ is nowhere dense with no isolated points, and Ξ (v) Ξ is nowhere dense with no isolated points, and λΞ = 0. Case (iii) is excluded when Ξ is closed. In case of (i), Ξ is a generalized renewal process based on a distribution μ on (0, ∞]. In cases (ii)−(iv), Ξ is known as a regenerative phenomenon, whose distribution is determined by the p -function p(t) = P {t ∈ Ξ}. Our proof of Theorem 29.7 is based on the following dichotomies: Lemma 29.8 (local dichotomies) For a regenerative set Ξ, ¯ a.s., ¯ o = ∅ a.s., or Ξo = Ξ (i) either (Ξ) (ii) either a.s. all points of Ξ are isolated, or a.s. none of them is, ¯ a.s. (iii) either λΞ = 0 a.s., or supp(Ξ · λ) = Ξ

29. Local Time, Excursions, and Additive Functionals

667

Proof: We may take F to be the right-continuous, complete filtration induced by Ξ, which allows us to use a canonical notation. For any optional time τ , the regenerative property yields

P {τ = 0} = E P (τ = 0 | F0 ); τ = 0



= (P {τ = 0})2 , and so P {τ = 0} = 0 or 1. If σ is another optional time, then τ  = σ + τ ◦ θσ is again optional by Proposition 11.8, and so for any h ≥ 0,





P τ  − h ≤ σ ∈ Ξ = P τ ◦ θσ ≤ h, σ ∈ Ξ



= P {τ ≤ h} P {σ ∈ Ξ}, which implies L(τ  − σ | σ ∈ Ξ) = P ◦ τ −1 . In particular, τ = 0 a.s. gives τ  = σ a.s. on {σ ∈ Ξ}. (i) The previous observations apply to the optional times τ = inf Ξc and σ = τr . If τ > 0 a.s., then τ ◦ θτr > 0 a.s. on {τr < ∞}, and so τr ∈ Ξo a.s. on ¯ we obtain Ξ ¯ = Ξo a.s. the same set. Since the set {τr ; r ∈ Q+ } is dense in Ξ, If instead τ = 0 a.s., then τ ◦ θτr = 0 a.s. on {τr < ∞}, and so τr ∈ Ξc a.s. on ¯ ⊂ Ξc a.s., and so Ξc = R+ a.s. It remains to note that the same set. Hence, Ξ ¯ c , since Ξc is a disjoint union of intervals (u, v) or [u, v). Ξc = (Ξ) (ii) Put τ = inf(Ξ \ {0}). If τ = 0 a.s., then τ ◦ θτr = 0 a.s. on {τr < ∞}. Since every isolated point of Ξ equals τr for some r ∈ Q+ , it follows that Ξ has a.s. no isolated points. If instead τ > 0 a.s., we define recursively some optional times σn by σn+1 = σn + τ ◦ θσn , starting with σ1 = τ . Then the σn form a renewal process based on L(τ ), and so σn → ∞ a.s. by the law of large numbers. Thus, Ξ = {σn < ∞; n ∈ N} a.s. with σ0 = 0, and a.s. all points of Ξ are isolated. (iii) Let τ = inf{t > 0; (1Ξ · λ)t > 0}. If τ = 0 a.s., then τ ◦ θτr = 0 a.s. on ¯ ⊂ supp(1Ξ λ) {τr < ∞}, and so τr ∈ supp(1Ξ λ) a.s. on the same set. Hence, Ξ a.s., and the two sets agree a.s. If instead τ > 0 a.s., then τ = τ + τ ◦ θτ > τ a.s. on {τ < ∞}, which implies τ = ∞ a.s., and so λΞ = 0 a.s. 2 Proof of Theorem 29.7: Using Lemma 29.8 (ii), we may eliminate case (i) where Ξ is a renewal process, which is clearly a.s. locally finite. Next, we may use parts (i) and (iii) of the same lemma to separate out cases (iv) and (v). Excluding (i) and (iv)−(v), we see that Ξ is a.s. a locally finite union of intervals of i.i.d. lengths. Writing γ = inf Ξc and using the regenerative property at τs , we get

P {γ > s + t} = P γ ◦ θs > t, γ > s



= P {γ > s} P {γ > t},

s, t ≥ 0.

Hence, the monotone function pt = P {γ > t} satisfies the Cauchy equation ps+t = ps pt with initial condition p0 = 1, and so pt ≡ e−ct for some constant

668

Foundations of Modern Probability

c ≥ 0. Here c = 0 corresponds to case (ii) and c > 0 to case (iii). If Ξ is closed, then γ ∈ Ξ on {γ < ∞}, and the regenerative property at γ yields γ = γ + γ ◦ θγ > γ a.s. on {γ < ∞}, which implies γ = ∞ a.s., corresponding to case (ii). 2 ¯ c is a countable union of disjoint intervals (σ, τ ), each The complement (Ξ) supporting an excursion of X away from a. The shifted excursions Yσ,τ = θσ X τ are rcll processes in their own right. Let D0 be the space of possible excursion paths, write l(x) for the length of excursion x ∈ D0 , and put Dh = {x ∈ D0 ; l(x) > h}. Let κh be the number of excursions of X longer than h. We need some basic facts. Lemma 29.9 (long excursions) Fix any h > 0, and allow even h ≥ 0 when the recurrence time is positive. Then exactly one of these cases occurs: (i) κh = 0 a.s., (ii) κh = ∞ a.s., (iii) κh is geometrically distributed with mean mh ∈ [1, ∞). The excursions Yjh are i.i.d., and in case (iii) we have l(Yκhh ) = ∞. Proof: Let κth be the number of excursions in Dh completed at time t ∈ [0, ∞], and note that κτht > 0 when τt = ∞. Writing ph = P {κh > 0}, we get for any t > 0

ph = P {κτht > 0} + P κτht = 0, κh ◦ θτt > 0



= P {κτht > 0} + P {κτht = 0} ph . As t → ∞, we obtain ph = ph + (1 − ph ) ph , which implies ph = 0 or 1. When ph = 1, let σ = σ1 be the right endpoint of the first Dh -excursion, and define recursively σn+1 = σn + σ1 ◦ θσn , starting with σ0 = 0. Writing q = P {σ < ∞} and using the regenerative property at each σn , we obtain P {κh > n} = q n . Thus, q = 1 yields κh = ∞ a.s., whereas for q < 1 the variable κh is geometrically distributed with mean mh = (1 − q)−1 , proving the 2 first assertion. Even the last claim holds by regeneration at each σn . ˆ = inf{h > 0; κh = 0 a.s.}. For any h < h, ˆ let νh be the common Write h distribution of all excursions in Dh . The measures νh may be combined into a single σ-finite measure ν : ¯ a.s. Lemma 29.10 (excursion law, Itˆo) Let X be regenerative at a with Ξ perfect. Then ˆ (i) there exists a measure ν on D0 , such that for every h ∈ (0, h), νDh ∈ (0, ∞),

νh = ν( · |Dh ),

(ii) ν is unique up to a normalization, (iii) ν is bounded iff the recurrence time is a.s. positive.

29. Local Time, Excursions, and Additive Functionals

669

ˆ and let Y 1 , Y 2 , . . . be such as in Proof (OK): (i) Fix any h ≤ k in (0, h), h h Lemma 29.9. Then the first excursion in Dk is the first process Yhj belonging to Dk , and since the Yhj are i.i.d. νh , we have νk = νh ( · |Dk ),

ˆ 0 < h ≤ k < h.

(3)

ˆ and define ν˜h = νh /νh Dk for all h ∈ (0, k]. Then (3) Now fix any k ∈ (0, h), yields ν˜h = ν˜h (· ∩ Dh ) for any h ≤ h ≤ k, and so ν˜h increases as h → 0 toward ˆ we a measure ν on D0 with ν(· ∩ Dh ) = ν˜h for all h ≤ k. For any h ∈ (0, h), get ν( · |Dh ) = ν˜h∧k ( · |Dh ) = νh∧k ( · |Dh ) = νh . (ii) If ν  is another measure with the stated properties, then νh ν  (· ∩ Dh ) ν(· ∩ Dh ) = = , νDk ν h Dk ν  Dk

ˆ h ≤ k < h.

Letting h → 0 for fixed k, we get ν = rν  with r = νDk /ν  Dk . (iii) If the recurrence time is positive, then (3) remains true for h = 0, and ˆ and write κh,k for the we may take ν = ν0 . Otherwise, let h ≤ k in (0, h), number of Dh -excursions up to the first completed excursion in Dk . For fixed ¯ is perfect and nowhere dense. k we have κh,k → ∞ a.s. as h → 0, since Ξ Noting that κh,k is geometrically distributed with mean Eκh,k = (νh Dk )−1

= ν(Dk |Dh )

−1

= νDh /νDk , we get νDh → ∞, which shows that ν is unbounded.

2

We turn to the fundamental theorem of excursion theory, describing the excursion structure of X in terms of a Poisson process ξ on R+ × D0 and a diffuse random measure ζ on R+ . The measure ζ, with associated cumulative process Lt = ζ[0, t], is unique up to a normalization and will be referred to as the excursion local time of X at a, whereas ξ is referred to as the excursion point process of X at a. Theorem 29.11 (excursion local time and point process, L´evy, Itˆo) Let the ¯ a.s. perfect. Then there exist a nonprocess X be regenerative at a with Ξ ¯ a.s., a Poisson decreasing, continuous, adapted process L on R+ with support Ξ process ξ on R+ × D0 with E ξ = λ ⊗ ν, and a constant c ≥ 0, such that (i) Ξ · λ = c L a.s., (ii) the excursions of X with associated L-values are given a.s. by the restriction of ξ to [0, L∞ ]. The product ν · L is a.s. unique.

670

Foundations of Modern Probability

Proof (OK, beginning): When Eγ = c > 0, we define ν = ν0 /c, and introduce a Poisson process ξ on R+ × D0 with intensity λ ⊗ ν, say with points (σj , Y˜j ), j ∈ N. Putting σ0 = 0, we see from Theorem 13.6 that the differences γ˜j = σj −σj−1 are i.i.d. exponentially distributed with mean m. By Proposition 15.3 (i), the processes Y˜j are further independent of the σj and i.i.d. ν0 . Writing κ ˜ = inf{j; l(Y˜j ) = ∞}, we see from Theorem 29.7 and Lemma 29.9 that 



d





γj , Yj ; j ≤ κ = γ˜j , Y˜j ; j ≤ κ ˜ ,

(4)

where the quantities on the left are the holding times and subsequent excursions of X. By Theorem 8.17 we may redefine ξ to make (4) hold a.s. Then the stated conditions become fulfilled with L = 1Ξ · λ. When Eγ = 0, we may again choose ξ to be Poisson λ ⊗ ν, but now with ˆ we may enumerate the points of ξ in ν as in Lemma 29.10. For any h ∈ (0, h), j ˜h ˜ h = inf{j; l(Yjh ) = ∞}. The processes R+ × Dh as (σh , Yj ), j ∈ N, and define κ h ˜ Yj are i.i.d. νh , and Lemma 29.9 yields 



d





γjh , Yjh ; j ≤ κh = γ˜jh , Y˜jh ; j ≤ κ ˜h ,

ˆ h ∈ (0, h).

(5)

Since the excursions in Dh form nested sub-arrays, the entire collections have the same finite-dimensional distributions, and so by Theorem 8.17 we may redefine ξ to make (5) hold a.s. Let τhj be the right endpoint of the j-th excursion in Dh , and define



Lt = inf σhj ; h, j > 0, τhj ≥ t , Then for any t ≥ 0 and h, j > 0, Lt < σhj ⇒ t ≤ τhj ⇒ Lt ≤ σhj .

t ≥ 0.

(6)

To see that L is continuous, we may assume that (5) holds identically. Since ν is unbounded, we may further assume that the set {σhj ; h, j > 0} is dense in the interval [0, L∞ ]. If ΔLt > 0, there exist some i, j, h > 0 with Lt− < σhi < σhj < Lt+ . By (6) we get t − ε ≤ τhi < τhj ≤ t + ε for every ε > 0, which is impossible. Thus, ΔLt = 0 for all t. ¯ ⊂ supp L a.s., we may further take Ξ ¯ ω to be perfect and To prove Ξ ¯ nowhere dense for each ω ∈ Ω. If t ∈ Ξ, then for every ε > 0 there exist some i, j, h > 0 with t − ε < τhi < τhj < t + ε. By (6) we get Lt−ε ≤ σhi < σhj ≤ Lt+ε , 2 and so Lt−ε < Lt+ε for all ε > 0, which implies t ∈ supp L. In the perfect case, it remains to establish the a.s. relation Ξ · λ = c L for a constant c ≥ 0, and to show that L is a.s. unique and adapted. The former claim is a consequence of Theorem 29.13 below, whereas the latter statement follows immediately from the next theorem. Most standard constructions of L are special cases of the following result, where ηt A denotes the number of excursions in a set A ∈ D0 completed by time t ≥ 0, so that η is an adapted, measure-valued process on D0 . The result may be compared with the global construction in Corollary 29.6.

29. Local Time, Excursions, and Additive Functionals

671

Proposition 29.12 (approximation) For sets A1 , A2 , . . . ∈ D0 with finite measures νAn → ∞, we have   η A  P (i) sup  t n − Lt  → 0, u ≥ 0, t≤u νAn (ii) the convergence holds a.s. when the An are increasing. In particular, we have ηt Dh /νDh → Lt a.s. as h → 0 for fixed t, which shows that L is a.s. determined by the regenerative set Ξ. Proof: Choose ξ as in Theorem 29.11, and put ξs = ξ([0, s] × ·). If the An are increasing, then for any unit rate Poisson process N on R+ , we have d (ξs An ) = (Ns νAn ) for all s ≥ 0. Since t−1 Nt → 1 a.s. by the law of large numbers, we get ξs An → s a.s., s ≥ 0. νAn By the monotonicity of each side, this may be strengthen to   ξ A   s n  − s  → 0 a.s., sup  νA s≤r n

r ≥ 0.

The desired convergence now follows, by the continuity of L and the fact that ξLt − ≤ ηt ≤ ξLt for all t ≥ 0. Dropping the monotonicity assumption, but keeping the condition νAn ↑ ∞ for convenience, we may choose a non-decreasing sequence An ∈ D0 such that d νAn = νAn for all n. Then clearly (ξs An ) = (ξs An ) for fixed n, and so     ξ A   ξ A   s n  d  s n   → 0 a.s., sup  − s  = sup  − s   s≤r νAn s≤r νAn

r ≥ 0,

P

which implies the convergence → 0 on the left. The asserted convergence now follows as before. 2 The excursion local time L may be described most conveniently in terms of its right-continuous inverse



Ts = L−1 s = inf t ≥ 0; Lt > s ,

s ≥ 0,

which will be shown to be a generalized subordinator, defined as a non-decreasing L´evy process with possibly infinite jumps. To state the next result, form Ξ ⊂ Ξ by omitting all points of Ξ that are isolated from the right. Recall that l(u) denotes the length of an excursion path u ∈ D0 . Theorem 29.13 (inverse local time) Let L, ξ, ν, c be such as in Theorem 29.11. Then a.s. (i) T = L−1 is a generalized subordinator with characteristics (c, ν ◦ l−1 ) and range Ξ in R+ ,

(ii) Ts = c s +

s+

l(u) ξ(dr du), 0

s ≥ 0.

672

Foundations of Modern Probability

Proof: We may discard the P -null set where L fails to be continuous with ¯ by the definition ¯ If Ts < ∞ for some s ≥ 0, then Ts ∈ supp L = Ξ support Ξ. ¯ \ Ξ . Thus, T (R+ ) ⊂ Ξ ∪ {∞} of T , and since L is continuous, we get Ts ∈ Ξ a.s. Conversely, let t ∈ Ξ . Then for any ε > 0 we have Lt+ε > Lt , and so t ≤ T ◦ Lt ≤ t + ε. As ε → 0, we get T ◦ Lt = t. Thus, Ξ ⊂ T (R+ ) a.s. The times Ts are optional by the right-continuity of F. Furthermore, Proposition 29.12 shows that, as long as Ts < ∞, the process θs T − Ts is obtainable from θTs X by a measurable map independent of s. Using the regenerative property at each Ts , we conclude from Lemma 16.13 that T is a generalized subordinator, hence admitting a representation as in Theorem 16.3. Since the ¯ c , we obtain (ii) for a constant jumps of T agree with the interval lengths in (Ξ) c ≥ 0.  The double integral in (ii) equals x (ξs ◦ l−1 )(dx), which shows that T has L´evy measure E(ξ1 ◦ l−1 ) = ν ◦ l−1 . Taking s = Lt in (ii), we get a.s. for any t ∈ Ξ t = T ◦ Lt L + t = c Lt + l(u) ξ(dr du) 0

= c Lt + (Ξc · λ)t , and so by subtraction c Lt = (1Ξ · λ)t a.s., which extends by continuity to arbitrary t ≥ 0. Noting that a.s. T −1 [0, t] = [0, Lt ] or [0, Lt ) for all t ≥ 0, since T is strictly increasing, we have a.s. ξ[0, t] = Lt = λ ◦ T −1 [0, t],

t ≥ 0,

which extends to ξ = λ ◦ T −1 a.s. by Lemma 4.3.

2

To justify our terminology, we show that the semi-martingale and excursion local times agree whenever both exist. Proposition 29.14 (reconciliation) Let X be a continuous semi-martingale with local time L, and let X be regenerative at some a ∈ R with P {La∞ = 0} > 0. Then (i) Ξ = {t; Xt = a} is a.s. perfect and nowhere dense, (ii) La is a version of the excursion local time at a. Proof: The regenerative set Ξ = {t; Xt = a} is closed by the continuity of X. If Ξ were a countable union of closed intervals, then La would vanish a.s., contrary to our hypothesis, and so by Theorem 29.7 it is perfect and nowhere dense. Let L be a version of the excursion local time at a, and put T = L−1 . Define Ys = La ◦ Ts for s < L∞ , and put Ys = ∞ otherwise. The continuity of La yields Ys± = La ◦ Ts± for every s < L∞ . If ΔTs > 0, then La ◦ Ts− = La ◦ Ts , since (Ts− , Ts ) is an excursion interval of X and La is continuous with support in Ξ. Thus, Y is a.s. continuous on [0, L∞ ). By Corollary 29.6 and Proposition 29.12, the processes θs Y − Ys can be obtained from θTs X for all s < L∞ by a common measurable map. By the

29. Local Time, Excursions, and Additive Functionals

673

regenerative property at a, Y is then a generalized subordinator, and so by Theorem 16.3 and the continuity of Y , we have Ys ≡ c s a.s. on [0, L∞ ) for a constant c ≥ 0. For t ∈ Ξ we have a.s. T ◦ Lt = t, and therefore Lat = La ◦ (T ◦ Lt ) = (La ◦ T ) ◦ Lt = c Lt , which extends to R+ since both extremes are continuous with support in Ξ. 2 For Brownian motion it is convenient to normalize local time by Tanaka’s formula, which leads to a corresponding normalization of the excursion law ν. By the spatial homogeneity of Brownian motion, we may restrict our attention to excursions from 0. Then excursions of different lengths have the same distribution, apart from a scaling. For a precise statement, we define the scaling operators Sr on D by t ≥ 0, r > 0, f ∈ D.

(Sr f )t = r1/2 ft/r ,

Theorem 29.15 (Brownian excursion) Let ν be the normalized excursion law of Brownian motion. Then there exists a unique distribution νˆ on the set of excursions of length one, such that ν = (2π)−1/2



∞ 0



νˆ ◦ Sr−1 r−3/2 dr.

(7)

Proof: By Theorem 29.13, the inverse local time L−1 is a subordinator with L´evy measure ν ◦ l−1 , where l(u) denotes the length of u. Furthermore, d L = M by Corollary 29.3, where Mt = sups≤t Bs , and so by Theorem 16.12 the measure ν ◦ l−1 has density (2π)−1/2 r−3/2 , r > 0. As in Theorem 8.5 there exists a probability kernel (νr ) : (0, ∞) → D0 , such that νr ◦ l−1 ≡ δr and ν = (2π)−1/2



∞ 0

νr r−3/2 dr,

(8)

and we note that the measures νr are unique a.e. λ. ˜ = Sr B is again a Brownian motion, and by For any r > 0, the process B ˜ equals L ˜ = Sr L. If B has an excursion u Corollary 29.6 the local time of B ˜ ends at rt, and the ending at time t, then the corresponding excursion Sr u of B 1/2 ˜ ˜ local time for B at the new excursion equals Lrt = r Lt . Thus, the excursion ˜ is obtained from the process ξ for B through the mapping process ξ˜ for B d Tr : (s, u) → (r1/2 s, Sr u). Since ξ˜ = ξ, each Tr leaves the intensity measure λ ⊗ ν invariant, and we get ν ◦ Sr−1 = r1/2 ν,

r > 0.

Combining (8) and (9), we get for any r > 0

∞ 0



νx ◦ Sr−1 x−3/2 dx = r1/2



= 0



∞ 0

νx x−3/2 dx

νrx x−3/2 dx,

(9)

674

Foundations of Modern Probability

and the uniqueness in (8) yields νx ◦ Sr−1 = νrx ,

x > 0 a.e. λ, r > 0.

By Fubini’s theorem, we may then fix an x = c > 0 with νc ◦ Sr−1 = νcr ,

r > 0 a.s. λ.

−1 Defining νˆ = νc ◦ S1/c , we conclude that for almost every r > 0,

νr = νc(r/c) −1 = νc ◦ Sr/c −1 = νc ◦ S1/c ◦ Sr−1

= νˆ ◦ Sr−1 . Substituting this into (8) yields equation (7). If μ is another probability measure with the stated properties, then for almost every r > 0 we have μ ◦ Sr−1 = νˆ ◦ Sr−1 , and hence −1 μ = μ ◦ Sr−1 ◦ S1/r −1 = νˆ ◦ Sr−1 ◦ S1/r = νˆ.

2

Thus, νˆ is unique.

By the continuity of paths, an excursion of Brownian motion is either positive or negative, and by symmetry the two possibilities have the same probaν+ + νˆ− ). A bility 12 under νˆ. This leads to the further decomposition νˆ = 12 (ˆ process with distribution νˆ+ is called a (normalized) Brownian excursion. For subsequent needs, we insert a simple computational result. Lemma 29.16 (height distribution) Let ν be the excursion law of Brownian motion. Then

ν u ∈ D0 ; supt ut > h = (2 h)−1 , h > 0. Proof: By Tanaka’s formula, the process M = 2B ∨ 0 − L0 = B + |B| − L0 is a martingale, and so for τ = inf{t ≥ 0; Bt = h} we get 



E L0τ ∧t = 2 E Bτ ∧t ∨ 0 ,

t ≥ 0.

Hence, by monotone and dominated convergence, EL0τ = 2E(Bτ ∨ 0) = 2 h. On the other hand, Theorem 29.11 shows that L0τ is exponentially distributed 2 with mean (νAh )−1 , where Ah = {u; supt ut ≥ h}. We proceed to show that Brownian local time has some quite amazing spatial properties.

29. Local Time, Excursions, and Additive Functionals

675

Theorem 29.17 (space dependence, Ray, Knight) Let B be a Brownian motion with local time L, and put τ = inf{t > 0; Bt = 1}. Then the process St = Lτ1−t on [0, 1] is a squared Bessel process of order 2. Several proofs are known. Here we derive the result from the previously developed excursion theory. Proof (Walsh): Fix any u ∈ [0, 1], put σ = Luτ , and let ξ ± denote the Poisson processes of positive and negative excursions from u. Write Y for the process B, stopped when it first hits u. Then Y ⊥⊥(ξ + , ξ − ) and ξ + ⊥⊥ξ − , and ⊥(ξ − , Y ). Since σ is ξ + -measurable, we obtain ξ + ⊥⊥σ (ξ − , Y ), and hence so ξ + ⊥ + ⊥σ (ξσ− , Y ), which implies the Markov property of Lxτ at x = u. ξσ ⊥ To find the corresponding transition kernels, fix any x ∈ [0, u), and write h = u − x. Put τ0 = 0, and let τ1 , τ2 , . . . be the right endpoints of excursions from x exceeding u. Further define ζk = Lxτk+1 − Lxτk , k ≥ 0, so that Lxτ = ζ0 + · · · + ζκ with κ = sup{k; τk ≤ τ }. By Lemma 29.16, the variables ζk are i.i.d. and exponentially distributed with mean 2h. Since κ agrees with the number of completed u -excursions before time τ reaching x, and since σ⊥⊥ ξ − , we further see that κ is conditionally Poisson σ/2h, given σ. Next we show that (σ, κ)⊥ ⊥(ζ0 , ζ1 , . . .). Then define σk = Luτk . Since ξ − is ⊥(ζ1 , ζ2 , . . .), and so (σ, σ1 , σ2 , . . .)⊥⊥ (Y, ζ1 , ζ2 , . . .). Poisson, we have (σ1 , σ2 , . . .)⊥ The desired relation now follows, since κ is a measurable function of (σ, σ1 , σ2 , . . .), and ζ0 depends measurably on Y . For any s ≥ 0, we may now compute



κ+1 

 σ 

 = E (1 + 2sh)−κ−1  σ 

E exp(−sLu−h ) σ = E τ



Ee−s ζ0

= (1 + 2sh)−1 exp



−s σ . 1 + 2sh

By the Markov property of Lxτ , the last relation is equivalent, via the substitutions u = 1 − t and 2s = (a − t)−1 , to the martingale property, for every a > 0, of the process  −L1−t τ , t ∈ [0, a). Mt = (a − t)−1 exp 2(a − t) Now let X be a squared Bessel process of order 2, and note that L1τ = X0 = 0 by Theorem 29.4. By Corollary 14.12, the process X is again Markov. To see that X has the same transition kernel as L1−t τ , it is enough to show that, is replaced by for any a > 0, the process M remains a martingale when L1−t τ Xt . This is clear from Itˆo’s formula, if we note that X is a weak solution to 1/2 2 the SDE dXt = 2Xt dBt + 2dt. For an important application of the last result, we show that the local time is strictly positive on the range of the process.

676

Foundations of Modern Probability

Corollary 29.18 (range and support) Let M be a continuous local martingale with local time L. Then outside a fixed P -null set, {Lxt > 0} =





inf Ms < x < sup Ms , s≤t

s≤t

x ∈ R, t ≥ 0.

Proof: By Corollary 29.6 and the continuity of L, we have Lxt = 0 for x outside the stated interval, except on a fixed P -null set. To see that Lxt > 0 otherwise, we may reduce by Theorem 19.3 and Corollary 29.6 to the case where M is a Brownian motion B. Letting τu = inf{t ≥ 0; Bt = u}, we see from Theorems 19.6 (i) and 18.16 that, outside a fixed P -null set, 0 ≤ x < u ∈ Q+ .

Lxτu > 0,

(10)

If 0 ≤ x < sups≤t Bs for some t and x, there exists a u ∈ Q+ with x < u < sups≤t Bs . But then τu < t, and (10) yields Lxt ≥ Lxτu > 0. A similar argument 2 applies to the case where inf s≤t Bs < x ≤ 0. Our third approach to local times is via additive functionals and their potentials. Here we consider a canonical Feller process X with state space S, associated terminal time ζ, probability measures Px , transition operators Tt , shift operators θt , and filtration F. By a continuous, additive functional (CAF) of X we mean a non-decreasing, continuous, adapted process A with A0 = 0 and Aζ∨t ≡ Aζ , satisfying As+t = As + At ◦ θs a.s.,

s, t ≥ 0,

(11)

where the ‘a.s.’ without qualification means Px -a.s. for every x. By the continuity of A, we may choose the exceptional null set to be independent of t. If it can also be chosen to be independent of s, then A is said to be perfect. For a simple example, let f ≥ 0 be a bounded, measurable function on S, and consider the associated elementary CAF

At =

0

t

f (Xs ) ds,

t ≥ 0.

(12)

More generally, given a CAF A and a function f as above, we may define a  new CAF f · A by (f · A)t = s≤t f (Xs ) dAs , t ≥ 0. A less trivial example is given by the local time of X at a fixed point x, whenever it exists in either sense discussed above. For any CAF A and constant α ≥ 0, we may introduce the associated ∞ α-potential e−αt dAt , x ∈ S, UAα (x) = Ex 0

and put UAα f = Ufα·A . In the special case where At ≡ t ∧ ζ, we shall often write U α f = UAα f . Note in particular that UAα = U α f = Rα f for the A in (12). If α = 0, we may omit the superscript and write U = U 0 and UA = UA0 . We proceed to show that a CAF is determined by its α-potential, whenever the latter is finite.

29. Local Time, Excursions, and Additive Functionals

677

Lemma 29.19 (potentials of additive functionals) Let A, B be continuous, additive functionals of a Feller process X, and fix any α ≥ 0. Then UAα = UBα < ∞





A = B a.s.

Proof: Define Aαt = s≤t e−αs dAs , and conclude from (11) and the Markov property at t that, for any x ∈ S, 

 



 



Ex Aα∞  Ft − Aαt = e−αt Ex Aα∞ ◦ θt  Ft



= e−αt UAα (Xt ).

(13)

Comparing with the same relation for B, we see that Aα − B α is a continuous Px -martingale of finite variation, and so Aα = B α a.s. Px by Proposition 18.2. Since x was arbitrary, we get A = B a.s. 2 For any CAF A of Brownian motion in Rd , we may introduce the associated for any measurable function g ≥ 0 on Rd by νA g = Revuz measure νA , given  ¯ ¯ E(g · A)1 , where E = Ex dx. When A is defined by (12), we get in particular νA g = f, g , where ·, · denotes the inner product in L2 (Rd ). In general, we need to show that νA is σ-finite. Lemma 29.20 (Revuz measure) For a continuous, additive functional A of Brownian motion X in Rd , the associated Revuz measure νA is σ-finite. Proof: Fix any integrable function f > 0 on Rd , and define

g(x) = Ex



0

e−t−At f (Xt ) dt,

x ∈ Rd .

Using Corollary 17.19, the additivity of A, and Fubini’s theorem, we get

UA1 g(x) = Ex = Ex = Ex



0



0∞ 0



= Ex = Ex ≤ E0





0∞ 0∞

Hence, by Fubini’s theorem, e−1 νA g ≤ ≤

0

e−t dAt EXt e−t dAt eAt dAt



0∞ t

∞ 0

e−s−As f (Xs ) ds

e−s−As ◦θt f (Xs+t ) ds e−s−As f (Xs ) ds

e−s−As f (Xs ) ds 





0

s

eAt dAt

e−s 1 − e−As f (Xs ) ds e−s f (Xs + x) ds.



UA1 g(x) dx

dx E0

= E0

=







0 ∞ −s

e−s f (Xs + x) ds

e ds

0



f (Xs + x) dx

f (x)dx < ∞.

678

Foundations of Modern Probability 2

The assertion now follows since g > 0. 2

density (2πt)−d/2 e−|x| /2t of Brownian Now let pt (x) denote the transition  ∞ −αt d α measure μ on Rd , we motion in R , and put u (x) = 0 e pt (x) dt. For any  α α may introduce the associated α -potential U μ(x) = u (x − y) μ(dy). We show that the Revuz measure has the same potential as the underlying CAF. Theorem 29.21 (potentials of Revuz measure, Hunt, Revuz) Let A be a continuous, additive functional of Brownian motion in Rd with Revuz measure νA . Then UAα = U α νA , α ≥ 0. Proof: By monotone convergence we may take α > 0. By Lemma 29.20, we may choose some positive functions fn ↑ 1 with νfn ·A 1 = νA fn < ∞ for each n, and by dominated convergence we have Ufαn ·A ↑ UAα and U α νfn ·A ↑ U α νA . Thus, we may further assume that νA is bounded. Then clearly UAα < ∞ a.e. Now fix any bounded, continuous function f ≥ 0 on Rd , and conclude by dominated convergence that U α f is again bounded and continuous. Writing h = n−1 for an arbitrary n ∈ N, we get by dominated convergence and the additivity of A 1 U α f (Xs ) dAs νA U α f = E¯ 0

= lim E¯ n→∞

Σ U αf (Xjh)Ah ◦ θjh.

j 0, and so Ct = c(Xt ) by Lemma 1.14 for a measurable and invariant function c. To deal with A, we need to show that if f (X) has a semi-martingale decomposition M + V , then Vt − Vs depends measurably on X in the interval [s, t]. The required measurability then follows by approximation with sums of conditional expectations, as in the proof in Lemma 10.7, since the conditioning σ-field Fs may be replaced by conditioning on Xs , by independence of the increments. Since such a measurability is already known for the term 12 ∇f [X] in (9), it holds for Af as well. We may now proceed as before to see that A = a(Xt ) for some measurable and invariant function a. Since a is a left-invariant vector field on S, it is automatically smooth28 . A similar argument may yield the desired smoothness of the second order invariant vector field c. 2 Every continuous process in S with stationary, independent left increments can be shown to be a semi-martingale of the stated form. The intrinsic drift rate a clearly depends on the connection ∇, which in turn depends on the metric ρ, via Theorem 35.21. However, ρ is not unique since it can be generated by any inner product on Tι , and so there is no natural definition of Brownian motion in S. In fact, martingales and drift rates are not well-defined either, since even the canonical connection may fail to be unique29 .

Exercises 1. Show that when S = Rn , a semi-martingale in S is just a continuous semimartingale X = (X 1 , . . . , X n ) in the usual sense. Further show how the covariance matrix [X i , X j ], i, j ≤ n, can be recovered from b[X]. 2. For any semi-martingales X, Y and bilinear form b on S, construct a process b [X, Y ] of locally finite variation, depending linearly on b, such that (df ⊗dg)[X, Y ] = [f (X), g(Y )] for smooth functions f, g, so that in particular b [X, X] = b [X]. (Hint: Regard (X, Y ) as a semi-martingale in S 2 , and define b [X, Y ] = (π1 ⊗ π2 )∗ b [(X, Y )], where π1 , π2 are the coordinate projections S 2 → S, and (π1 ⊗ π2 )∗ b denotes the pull-back of b by π1 ⊗ π2 into a bilinear form on S 2 . Now check the desired formula.) 3. Explain where in the proof of Theorem 35.2 it is important that we use the FS rather than the Itˆo integral. Specify which of the four parts of the theorem depend on that convention. 28 29

Cf. Proposition 3.7 in Warner (1983), p. 85. This is far from obvious; thanks to Ming Liao for answering my question.

35. Stochastic Differential Geometry

825

4. Same requests as in the previous exercise for the extrinsic Lemma 35.3 with accompanying discussion. (Hint: Note that the reasons are very different.) 5. Show that the class of connections on S is not closed under addition or multiplication by scalars. (Hint: This is clear from (4), and even more obvious from Theorem 35.6.) 6. Show that for any connection ∇ with Christoffel symbols Γkij , the latter can be changed arbitrarily in a neighborhood of a point x ∈ S, such that the new functions ˜ k are Christoffel symbols of a connection ∇ ˜ on S. (Hint: Extend the Γ ˜ k to smooth Γ ij ij k functions agreeing with Γij outside a larger neighborhood.) 7. Show that the class of martingales in S depends on the choice of connection. Thus, letting X be a (non-trivial) martingale for a connection ∇, show that we can choose another connection ∇ , such that X fails to be a ∇ -martingale. (Hint: Use Theorem 35.7 together with the previous exercise.) 8. Compare the martingale notions in Lemma 35.3 and Theorem 35.8. Thus, check that, for a given embedding, the two notions agree. On the other hand, explain in what sense one notion is extrinsic while the other is intrinsic. 9. Show that the notion of geodesic depends on the choice of connection. Thus, given a geodesic γ for a connection ∇, show that we can choose another connection ∇ such that γ fails to be a ∇ -geodesic. (Hint: This might be seen most easily from Corollary 35.11.) 10. For a given connection ∇ on S, show that the classes of geodesics and martingales are invariant under a scaling of time. Thus, if γ is a geodesic in S, then so is the function γt = γct , t ≥ 0, for any constant c > 0. Similarly, if M is a martingale in S, then so is the process Mt = Mct . (Hint: For geodesics, use Lemma 35.9. For martingales, we can use Theorem 35.7.) 11. When S = Rn with the flat connection, show that the notions of affine and convex functions agree with the elementary notions for Euclidean spaces. Further show that, in this case, a geodesic represents motion with constant speed along a straight line. For Euclidean spaces S, S  with flat connections, verify the properties and equivalences in Lemma 35.12 and Theorems 35.10 and 35.13. 12. Show that the intrinsic notion of drift depends on the choice of connection, whereas the intrinsic diffusion rate doesn’t. Thus, show that if a semi-martingale X has drift A for connection ∇, we can choose another connection ∇ such that the associated drift is different. (Hint: Note that X is a martingale for ∇ iff A = 0. For the second assertion, note that b [X] is independent of ∇.) 13. For S = Rn with the flat connection, explain how our intrinsic local characteristics are related to the classical notions for ordinary semi-martingales. 14. Show that in Theorem 35.15, we can replace Lebesgue measure λ by any diffuse measure on R, and even by a suitable random measure. 15. Clarify the relationship between the intrinsic results of Theorem 35.16 and Corollary 35.17 and the extrinsic ones in Lemma 35.3. (Hint: Since the definitions of local characteristics are totally different in the two cases, the formal agreement merely justifies our intrinsic definitions. Further note that the intrinsic properties hold for any smooth embedding of S, whereas the extrinsic ones refer to a specific choice of embedding.)

826

Foundations of Modern Probability

16. Show that for S = Rd with the Euclidean metric, the intrinsic notion of isotropic martingales agrees with the notion for continuous local martingales in Chapter 19. Prove a corresponding statement for Brownian motion. 17. Show that if X is a martingale in a Riemannian manifold (S, ρ), it may fail to be so for a different choice of metric ρ. Further show that if X is an isotropic martingale in (S, ρ), we can change ρ so that it remains a martingale but is no longer isotropic. Finally, if X is a Brownian motion in (S, ρ), we can change ρ so that it remains an isotropic martingale, though with a rate different from 1.

Appendices 1. Measurable maps, 2. General topology, 3. Linear spaces, 4. Linear operators, 5. Function and measure spaces, 6. Classes and spaces of sets, 7. Differential geometry

Here we review some basic topology, functional analysis, and differential geometry, needed at various places throughout the book. We further consider some special results from advanced measure theory, and discuss the topological properties of various spaces of functions, measures, and sets, of crucial importance for especially the weak convergence theory in Chapter 23. Proofs are included only when they are not easily accessible in the standard literature.

1. Measurable maps The basic measure theory required for the development of modern probability was surveyed in Chapters 1–3. Here we add some less elementary facts, needed for special purposes. If a measurable mapping is invertible, the measurability of its inverse can sometimes be inferred from the measurability of the range. Theorem A1.1 (range and inverse, Kuratowski) For any measurable bijection f between two Borel spaces S and T , the inverse f −1 : T → S is again measurable. 2

Proof: See Parthasarathy (1967), Section I.3.

We turn to the basic projection and section theorems, which play important roles in some more advanced contexts. Given a measurable space (Ω, F), we  define the universal completion of F as the σ-field F¯ = μ F μ , where F μ denotes the completion of F with respect to μ, and the intersection extends over all probability measures μ on F. For any spaces Ω and S, we define the  projection πA of a set A ⊂ Ω × S onto Ω as the union s As , where



As = ω ∈ Ω; (ω, s) ∈ A ,

s ∈ S.

Theorem A1.2 (projection and sections, Lusin, Choquet, Meyer) For any measurable space (Ω, F) and Borel space (S, S), consider a set A ∈ F ⊗ S with projection πA onto Ω. Then (i) πA ∈ F¯ = ∩ μ F μ , © Springer Nature Switzerland AG 2021 O. Kallenberg, Foundations of Modern Probability, Probability Theory and Stochastic Modelling 99, https://doi.org/10.1007/978-3-030-61871-1

827

828

Foundations of Modern Probability

(ii) for any probability measure P on F, there exists a random element ξ in S, such that {ω, ξ(ω)} ∈ A, ω ∈ πA a.s. P. 2

Proof: See Dellacherie & Meyer (1975), Section III.44.

We note an application to functional representations needed in Chapter 28: Corollary A1.3 (conditioning representation) Consider a probability space (Ω, F, P ) with sub-σ-field G ⊂ F and a measurable function f : T × U → S, where S, T, U are Borel. Let ξ, η be random elements in S and U , such that

L(ξ | G) ∈ L{f (t, η)}; t ∈ T



a.s.

Then there exist a G-measurable random element τ in T and a random element η˜ in U , such that d ξ = f (τ, η˜) a.s. η = η˜ ⊥ ⊥ τ, Proof: Fix a version of the conditional distribution L(ξ | G). Applying Theorem A1.2 (ii) to the product-measurable set



A = (ω, t) ∈ Ω × T ; L(ξ | G)(ω) = L{f (t, η)} , we obtain a G-measurable random element τ in T satisfying



L(ξ | G) = L f (t, η)

t=τ

= L(ξ | τ ) a.s., d

where the last equality holds by the chain rule for conditioning. Letting ζ = η in U with ζ⊥ ⊥ τ and using Lemma 4.11, we get a.s.









L f (τ, ζ) τ = L f (t, η)

t=τ

= L(ξ | τ ), d

d

and so {f (τ, ζ), τ } = (ξ, τ ). Then Theorem 8.17 yields a random pair (˜ η , τ˜) = (ζ, τ ) in U × T with

f (˜ τ , η˜), τ˜ = (ξ, τ ) a.s. d

d

In particular τ˜ = τ a.s., and so ξ = f (τ, η˜) a.s. Further note that η˜ = ζ = η, and that η˜ ⊥⊥ τ˜ = τ since ζ⊥⊥ τ by construction. 2

2. General topology Elementary metric topology, including properties of open and closed sets, continuity, convergence, and compactness, are used throughout this book. For special purposes, we also need the slightly more subtle notions of separability and completeness, dense or nowhere dense sets, uniform and equi-continuity, and approximation. We assume the reader to be well familiar with most of this

Appendices

829

material. Here we summarize some basic ideas of general, non-metric topology, of special importance in Sections 5–6 below and Chapter 23. Proofs may be found in most textbooks on real analysis. A topology on a space S is defined as a class T of subsets of S, called the open sets, such that T is closed under finite intersections and arbitrary unions and contains S and ∅. The complements of open sets are said to be closed. For any A ⊂ S, the interior Ao is the union of all open set contained in A. Similarly, the closure A¯ is the intersection of all closed sets containing A, and the boundary ∂A is the difference A¯ \ Ao . A set A is dense in S if A¯ = S, ¯ c is dense in S. The space S is separable if it has a and nowhere dense if (A) countable, dense subset. It is connected if S and ∅ are the only sets that are both open and closed. For any A ⊂ S, the relative topology on A consists of all sets A ∩ G with G ∈ T . The family of topologies on S is partially ordered under set inclusion. The smallest topology on S has only the two sets S and ∅, whereas the largest one, the discrete topology 2S , contains all subsets of S. For any class C of subsets of S, the generated topology is the smallest topology T ⊃ C. A topology T is said to be metrizable, if there exists a metric ρ on S, such that T is generated by all open ρ -balls. In particular, it is Polish if ρ can be chosen to be separable and complete. An open set containing a point x ∈ S is called a neighborhood of x. A class Bx of neighborhoods of x is called a neighborhood base at x, if every neighborhood of x contains a set in Bx . A base B of T is a class containing a neighborhood base at every point. Every base clearly generates the topology. The space S is said to be first countable, if it has a countable neighborhood base at every point, and second countable, if it has a countable base. Every metric space is clearly first countable, and it is second countable iff it is separable. A topological space S is said to be Hausdorff, if any two distinct points have disjoint neighborhoods, and normal if all singleton sets are closed, and any two disjoint closed sets are contained in disjoint open sets. In particular, all metric spaces are normal. A mapping f between two topological spaces S, S  is said to be continuous, if f −1 B is open in S for every open set B ⊂ S  . The spaces S, S  are homeomorphic, if there exists a bijection (1−1 correspondence) f : S → S  such that f and f −1 are both continuous. For any family F of functions fi mapping S into some topological spaces Si , the generated topology on S is the smallest one making all the fi continuous. In particular, if S is a Cartesian product of some topological spaces Si , i ∈ I, so that the elements of S are functions x = (xi ; i ∈ I), then the product topology on S is generated by all coordinate projections πi : x → xi . A set A ⊂ S is said to be compact, if every cover of A by open sets contains a finite subcover, and relatively compact if its closure A¯ is compact. By Tychonov’s theorem, any Cartesian product of compact spaces is again compact in the product topology. A topological space S is said to be locally compact, if it has a base consisting of sets with compact closure. A set I is said to be partially ordered under a relation ≺, if

830

Foundations of Modern Probability

• i ≺ j ≺ i ⇒ i = j, • i ≺ i for all i ∈ I, • i ≺ j ≺ k ⇒ i ≺ k. It is said to be directed if also • i, j ∈ I ⇒ i ≺ k and j ≺ k for some k ∈ I. A net in a topological space S is a mapping of a directed set I into S. In particular, sequences are nets on I = N with the usual order. A net (xi ) is said to converge to a limit x ∈ S, written as xi → x, if for every neighborhood A of x there exists an i ∈ I, such that xj ∈ A for all j - i. Possible limits are unique when S is Hausdorff, but not in general. We further say that (xi ) clusters at x, if for any neighborhood A of x and i ∈ I, there exists a j - i with xj ∈ A. For sequences (xn ), this means that xn → x along a sub-sequence. Net convergence and clustering are constantly used1 throughout analysis and probability, such as in continuous time and higher dimensions, or in the definitions of total variation and Riemann integrals. They also serve to characterize the topology itself: Lemma A2.1 (closed sets) For sets A in a topological space S, these conditions are equivalent: (i) A is closed, (ii) for every convergent net in A, even the limit lies in A, (iii) for every net in A, all cluster points also lie in A. When S is a metric space, it suffices in (ii)−(iii) to consider sequences. We can also use nets to characterize compactness: Lemma A2.2 (compact sets) For sets A in a topological space S, these conditions are equivalent: (i) A is compact, (ii) every net in A has at least one cluster point in A. If the net in (ii) has exactly one cluster point x, it converges to x. When S is a metric space, it is enough in (ii) to consider sequences. For sequences, property (ii) above amounts to sequential compactness—the existence of a convergent sub-sequence with limit in A. For the notion of relative sequential compactness, the limit is not required to lie in A. Given a compact topological space X, let CX be the class of continuous functions f : X → R equipped with the supremum norm f = supx |f (x)|. An algebra A ⊂ CX is defined as a linear subspace closed under multiplication. It is said to separate points, if for any x = y in X there exists an f ∈ A with f (x) = f (y). 1

only the terminology may be less familiar

Appendices

831

Theorem A2.3 (Stone–Weierstrass) For a compact topological space X, let A ⊂ CX be an algebra separating points and containing the constant function 1. Then A¯ = CX . Like many other results in this and subsequent sections, this statement is also true in a complex version. Throughout this book we are making use of spaces that are locally compact, second countable, and Hausdorff, here referred to as lcscH spaces. We record some standard facts about such spaces: Lemma A2.4 (lcscH spaces) Let the space S be lcscH. Then (i) S is Polish and σ-compact, (ii) S has a separable, complete metrization, such that a set is relatively compact iff it is bounded, (iii) if S is not already compact, it can be compactified by the attachment of a single point Δ = S.

3. Linear spaces In this and the next section, we summarize some basic ideas of functional analysis, such as Hilbert and Banach spaces, and operators between linear spaces. Aspects of this material are needed in several chapters throughout the book. Proofs may be found in textbooks on real analysis. A linear space 2 over R is a set X with two operations, addition x + y and multiplication by scalars3 cx, such that for any x, y, z ∈ X and a, b ∈ R, • • • • • • • •

x + (y + z) = (x + y) + z, x + y = y + x, x + 0 = x for some 0 ∈ X, x + (−x) = 0 for some −x ∈ X, a(bx) = (ab)x, (a + b)x = ax + bx, a(x + y) = ax + ay, 1x = x.

A subspace of X is a subset closed under the mentioned operations, hence a linear space in its own right. A function x ≥ 0 on X is called a norm, if for any x, y ∈ X and c ∈ R, 2

also called a vector space Here and below, we can also use the set C of complex numbers as our scalar field. Usually this requires only trivial modifications. 3

832

Foundations of Modern Probability

• x = 0 ⇔ x = 0, • cx = |c| x , • x + y ≤ x + y . The norm generates a metric ρ on X, given by ρ(x, y) = x − y . If ρ is complete, in the sense that every Cauchy sequence converges, then X is called a Banach space. In that case, any closed subspace is again a Banach space. Given a linear space X, we define a linear functional on X as a function f : X → R, such that f (ax + by) = a f (x) + b f (y),

x, y ∈ X, a, b ∈ R.

Similarly, a convex functional is defined as a map p : X → R+ satisfying • p(cx) = c p(x), x ∈ X, c ≥ 0, • p(x + y) ≤ p(x) + p(y), x, y ∈ X. Lemma A3.1 (Hahn–Banach) Let X be a linear space with subspace Y , fix a convex functional p on X, and let f be a linear functional on Y with f ≤ p on Y . Then f extends to a linear functional on X, such that f ≤ p on X. A set M in a linear space X is said to be convex if c x + (1 − c) y ∈ M,

x, y ∈ M, c ∈ [0, 1].

A hyper-plane in X is determined by an equation f (x) = a, for a linear functional f = 0 and a constant a. We say that M is supported by the plane f (x) = a if f (x) ≤ a, or f (x) ≥ a, for all x ∈ M . The sets M, N are separated by the plane f (x) = a if f (x) ≤ a on M and f (x) ≥ a on N , or vice versa. For clarity, we state the following support and separation result for convex sets in a Euclidean space. Corollary A3.2 (convex sets) (i) For any convex set M ⊂ Rd and point x ∈ ∂M , there exists a hyper-plane through x supporting M . (ii) For any disjoint convex sets M, N ⊂ Rd , there exists a hyper-plane separating M and N . When X is a normed linear space, we say that a linear functional f on X is bounded if |f (x)| ≤ c x for some constant c < ∞. Then there is a smallest value of c, denoted by f , so that |f (x)| ≤ f x . The space of bounded linear functionals f on X is again a normed linear space (in fact a Banach space), called the dual of X and denoted by X ∗ . Continuing recursively, we may form the second dual X ∗∗ = (X ∗ )∗ , etc. Theorem A3.3 (Hahn–Banach) Let X be a normed linear space with a subspace Y . Then for any g ∈ Y ∗ , there exists an f ∈ X ∗ with f = g and f = g on Y .

Appendices

833

Corollary A3.4 (natural embedding) For a normed linear space X, we may define a linear isometry T : X → X ∗∗ by (T x)f = f (x),

x ∈ X, f ∈ X ∗ .

The space X is said to be reflexive if T (X) = X ∗∗ , so that X and X ∗∗ are ∼ isomorphic, written as X = X ∗∗ . The reflexive property extends to any closed linear subspace of X. Familiar examples of reflexive spaces include lp and Lp for p > 1 (but not for p = 1), where the duals equal lq and Lq , respectively, with p−1 + q −1 = 1. For p = 2, we get a Hilbert space H with the unique ∼ property H = H ∗ . On a normed linear space X, we may consider not only the norm topology induced by x , but also the weak topology generated by the dual X ∗ . Thus, the weak topology on X ∗ is generated by the second dual X ∗∗ . Even more important is the weak ∗ topology on X ∗ , generated by the coordinate maps πx : X ∗ → R given by πx f = f (x), x ∈ X, f ∈ X ∗ . Lemma A3.5 (topologies on X ∗ ) For a normed linear space X, write Tw∗ , Tw , Tn for the weak ∗ , weak, and norm topologies on X ∗ . Then (i) Tw∗ ⊂ Tw ⊂ Tn , (ii) Tw∗ = Tw = Tn when X is reflexive. An inner product on a linear space X is a function x, y on X 2 , such that for any x, y, z ∈ X and a, b ∈ R, • x, x ≥ 0, with equality iff x = 0, • x, y = y, x , • ax + by, z = ax, z + by, z . The function x = x, x inequality 4

1/2

is then a norm on X, satisfying the Cauchy

• |x, y | ≤ x y , x, y ∈ X. If the associated metric is complete, then X is called a Hilbert space 5 . If x, y = 0, we say that x and y are orthogonal and write x ⊥ y. For any subset A ⊂ H, we write x ⊥ A if x ⊥ y for all y ∈ A. The orthogonal complement A⊥ , consisting of all x ∈ H with x ⊥ A, is a closed linear subspace of H. A linear functional f on a Hilbert space H is clearly continuous iff it is bounded, in the sense that |f (x)| ≤ c x , x ∈ H, for some constant c ≥ 0, in which case we define f as the smallest constant c with this property. The bounded linear functionals form a linear space H ∗ with norm f , which is again a Hilbert space isomorphic to H. 4

also known as the Schwarz or Buniakovsky inequality Hilbert spaces appear at many places throughout the book, especially in Chapters 8, 14, 18–19, and 21. 5

834

Foundations of Modern Probability

Lemma A3.6 (Riesz representation 6) For functions f on a Hilbert space H, these conditions are equivalent: (i) f is a continuous linear functional on H, (ii) there exists a unique element y ∈ H satisfying f (x) = x, y ,

x ∈ H.

In this case, f = y . An ortho-normal system (ONS) in H is a family of orthogonal elements xi ∈ H with norm 1. It is said to be complete or to form an ortho-normal basis (ONB) if y ⊥ (xi ) implies y = 0. Lemma A3.7 (ortho-normal bases) For any Hilbert space H, we have 7 (i) H has an ONB (xi ), (ii) the cardinality of (xi ) is unique, (iii) every x ∈ H has a unique representation x = Σ i bi xi , where bi = x, xi , (iv) if x = Σi bi xi and y = Σi ci xi , then x, y = Σ i bi ci ,

x 2 = Σ i b2i .

The cardinality in (ii) is called the dimension of H. Since any finite-dimensional Hilbert space is isomorphic8 to a Euclidean space Rn , we may henceforth assume the dimension to be infinite. The simplest infinite-dimensional Hilbert space is l 2 , consisting of all infinite sequences x = (xn ) in R with x 2 = Σnx2n < ∞. Here the inner product is given by x, y =

Σ xn yn,

n≥1

x = (xn ), y = (yn ),

and we may choose an ONB consisting of the elements (0, . . . , 0, 1, 0, . . .), with 1 in the n-th position. Another standard example is L2 (μ), for any σ-finite measure μ with infinite support on a measurable space S. Here the elements are (equivalence classes of) measurable functions f on S with f 2 = μf 2 < ∞, and the inner product is given by f, g = μ(f g). Familiar ONB’s include the Haar functions when μ equals Lebesgue measure λ on [0, 1], and suitably normalized Hermite polynomials when μ = N (0, 1) on R. All the basic Hilbert spaces are essentially equivalent: 6

also known as the Riesz–Fischer theorem This requires the axiom of choice, assumed throughout this book. 8 In other words, there exists a 1−1 correspondence between the two spaces preserving the algebraic structure. 7

Appendices

835

Lemma A3.8 (separable Hilbert spaces) For an infinite-dimensional Hilbert space H, these conditions are equivalent: (i) (ii) (iii) (iv)

H H H H

is separable, has countably infinite dimension, ∼ = l 2, ∼ = L2 (μ) for any σ-finite measure μ with infinite support.

Most Hilbert spaces H arising in applications are separable, and for simplicity we may often take H = l 2 , or let H = L2 (μ) for a suitable measure μ. For any Hilbert spaces H1 , H2 , we may form a new Hilbert space H1 ⊗ H2 , called the tensor product of H1 , H2 . When H1 = L2 (μ1 ) and H2 = L2 (μ2 ), it may be thought of as the space L2 (μ1 ⊗ μ2 ), where μ1 ⊗ μ2 is the product measure of μ1 , μ2 . If f1 , f2 , . . . and g1 , g2 , . . . are ONB’s in H1 and H2 , the functions hij = fi ⊗gj form an ONB in H1 ⊗H2 , where (f ⊗g)(x, y) = f (x)g(y). Iterating the construction, we may form the n-fold tensor powers H ⊗n = H ⊗ · · · ⊗ H, for every n ∈ N.

4. Linear operators A linear operator between two normed linear spaces X, Y is a mapping T : X → Y satisfying x, y ∈ X, a, b ∈ R.

T (ax + by) = a T x + b T y,

It is said to be bounded 9 if T x ≤ c x for some constant c < ∞, in which case there is a smallest number T with this property, so that T x ≤ T x ,

x ∈ X.

When T ≤ 1 we call T a contraction operator. The space of bounded linear operators X → Y is denoted by BX,Y , and when X = Y we write BX,X = BX . Those are again linear spaces, under the operations (T1 + T2 ) x = T1 x + T2 x, (c T )x = c (T x), x ∈ X, c ∈ R, and the function T above is a norm on BX,Y . Furthermore, BX,Y is complete, hence a Banach space, whenever this is true for Y . In particular, we note that BX,R equals the dual space X ∗ . Theorem A4.1 (Banach–Steinhaus) Let T ⊂ BX,Y , for a Banach space X and a normed linear space Y . Then sup T x < ∞, x ∈ X

T ∈T 9



sup T < ∞.

T ∈T

Bounded operators are especially important in Chapter 26.

836

Foundations of Modern Probability

If T ∈ BX,Y and S ∈ BY,Z , the product ST ∈ BX,Z is given by (ST )x = S(T x), and we note that ST ≤ S T . Two operators S, T ∈ BX are said to commute if ST = T S. The identity operator I ∈ BX is defined by Ix = x for all x ∈ X. Furthermore, T ∈ BX is said to be idempotent if T 2 = T . For any operators T, T1 , T2 , . . . ∈ BX,Y , we say that Tn converges strongly to T if Tn x − T x → 0 in Y,

x ∈ X.

We also consider convergence in the norm topology, defined by Tn − T → 0 in BX,Y . For any A ∈ BX,Y , the adjoint A∗ of A is a linear map on Y ∗ given by (A∗ f )(x) = f (Ax),

x ∈ X, f ∈ Y ∗ .

If also B ∈ BY,Z , we note that (BA)∗ = A∗ B ∗ . Further note that I ∗ = I, where the latter I is given by If = f . Lemma A4.2 (adjoint operator) For any normed linear spaces X, Y , the map A → A∗ is a linear isometry from BX,Y to B Y ∗, X ∗ . When A ∈ BX,Y is bijective, its inverse A−1 : Y → X is again linear, though it need not be bounded. Theorem A4.3 (inverses and adjoints) For a Banach space X and a normed linear space Y , let A ∈ BX,Y . Then these conditions are equivalent: (i) A has a bounded inverse A−1 ∈ B Y, X , (ii) A∗ has a bounded inverse (A∗ )−1 ∈ BX ∗, Y ∗ . In that case, (A∗ )−1 = (A−1 )∗ . For a Hilbert space H we may identify H ∗ with H, which leads to some simplifications. Then for any A ∈ BH there exists a unique operator A∗ ∈ BH satisfying Ax, y = x, A∗ y , x, y ∈ H, which again is called the adjoint of A. The mapping A → A∗ is a linear isometry on H satisfying (A∗ )∗ = A. We say that A is self-adjoint if A∗ = A, so that Ax, y = x, Ay . In that case, A = sup |Ax, x |. x=1

For any A ∈ BH , the null space NA and range RA are given by NA = {x ∈ H; Ax = 0}, RA = {Ax; x ∈ H}, and we note that (RA )⊥ = NA∗ and (NA )⊥ = RA∗

Appendices

837

Lemma A4.4 (unitary operators 10) For any U ∈ BH , these conditions are equivalent: (i) U U ∗ = U ∗ U = I, (ii) U is an isometry on H, (iii) U maps any ONB in H into an ONB. We state an abstract version of Theorem 1.35: Lemma A4.5 (orthogonal decomposition) Let M be a closed subspace of a Hilbert space H. Then for any x ∈ H, we have (i) x has a unique decomposition x = y + z with y ∈ M and z = M ⊥ , (ii) y is the unique element of M minimizing x − y , (iii) writing y = πM x, we have πM ∈ BH with πM ≤ 1, and πM + πM ⊥ = I. Lemma A4.6 (projections 11) When A ∈ BH , these conditions are equivalent: (i) A = πM for a closed subspace M ⊂ H, (ii) A is self-adjoint and idempotent. In that case, M = RA . For any sub-spaces M, N ⊂ H, we introduce the orthogonal complement M ' N = M ∩ N ⊥ = {x ∈ M ; x ⊥ N }. When M ⊥ N , we further consider the direct sum M ⊕ N = {x + y; x ∈ M, y ∈ N }. The conditional orthogonality M ⊥R N is defined by12 M ⊥ N ⇔ πR⊥ M ⊥ πR⊥ N. R

We list some classical propositions of probabilistic relevance. Lemma A4.7 (orthogonality) For any closed sub-spaces M, N ⊂ H, these conditions are equivalent: (i) M ⊥ N , (ii) πM πN = 0, (iii) πM + πN = πR for some R, In that case, R = M ⊕ N . Lemma A4.8 (commutativity) For any closed sub-spaces M, N ⊂ H, these conditions are equivalent: (i) M ⊥ N , M ∩N

(ii) πM πN = πN πM , (iii) πM πN = πR for some R, In that case, R = M ∩ N . 10

Unitary operators are especially important in Chapters 14 and 27–28. Projection operators play a fundamental role in especially Chapter 8, where most of the quoted results have important probabilistic counterparts. 12 Here our terminology and notation are motivated by Theorem 8.13. 11

838

Foundations of Modern Probability

Lemma A4.9 (tower property) For any closed sub-spaces M, N ⊂ H, these conditions are equivalent: (i) N ⊂ M , (ii) πM πN = πN , (iii) πN πM = πN , (iv) πN x ≤ πM x ,

x ∈ H.

The next result may be compared with the probabilistic versions in Corollary 25.18 and Theorem 30.11. Theorem A4.10 (mean ergodic theorem) For a bi-measurable, bijection 13 T on (S, μ) with μ ◦ f −1 = μ, define U f = f ◦ T for f ∈ L2 (μ), and let M be the null space of U − I. Then (i) U is a unitary operator on L2 (μ), (ii)

n−1

U k f → πM f, Σ f ◦ T k = n−1kΣ 0,

(3)

where the infimum extends over all partitions of the interval [0, t) into sub˜t (x, h) → 0 as intervals Ik = [u, v) with v − u ≥ h when v < t. Note that w h → 0 for fixed x ∈ D R+ ,S and t > 0. Theorem A5.4 (compactness in D R+ ,S , Prohorov) Let A ⊂ D R+ ,S for a separable, complete metric space S, and fix a dense set T ⊂ R+ . Then A is relatively compact in the Skorohod topology iff (i) πt A is relatively compact in S for all t ∈ T , (ii) lim sup w˜t (x, h) = 0, t > 0. h→0 x∈A

(4)

In that case, (iii)

∪ πsA is relatively compact in S for all t ≥ 0.

s≤t

Proof: Detailed proofs are implicit in Ethier & Kurtz (1986), pp. 122– 127. Related versions appear in Billingsley (1968), pp. 116–120, and Jacod & Shiryaev (1987), pp. 292–298. 2 We turn to some measure spaces. Given a separable, complete metric space S, let MS be the class of locally finite measures μ on S, so that μB < ∞ for all bounded Borel sets B. For any μ, μ1 , μ2 , . . . ∈ MS , we define the vague v convergence μn → μ by μn f → μf, f ∈ CˆS , (5) where CˆS is the class of bounded, continuous functions f ≥ 0 on S with bounded support. The maps πf : μ → μf induce the vague topology 18 on MS : 18 When S is locally compact, it is equivalent to require the supports of f to be compact. In general, that would give a smaller and less useful topology.

842

Foundations of Modern Probability

Lemma A5.5 (vague topology on MS ) For a separable, complete metric space S, there exists a topology T on MS , such that v

(i) T induces the convergence μn → μ in (5), (ii) MS is Polish under T , (iii) T generates the Borel σ-field σ{πf ; f ∈ CˆS }. Proof: See K(17), pp. 111–117.

2

On the sub-class MSb of bounded measures on S, we also consider the w weak topology and associated weak convergence μn → μ, given by (5) with CˆS replaced by the class CS of bounded, continuous functions on S. No further analysis is required, since this agrees with the vague topology and convergence, when the metric ρ on S is replaced by the equivalent metric ρ ∧ 1, so that all sets in S become bounded. Theorem A5.6 (vague compactness in MS , Prohorov, Matthes et al.) Let A ⊂ MS for a separable, complete metric space S. Then A is vaguely relatively compact iff ˆ (i) sup μB < ∞, B ∈ S, μ∈A

inf sup μ(B \ K) = 0, (ii) K∈K μ∈A

ˆ B ∈ S.

In particular, a set A ⊂ MSb is weakly relatively compact iff (i)−(ii) hold with B = S. Proof: See K(17), pp. 114, 116.

2

We finally consider spaces of measure-valued rcll functions. Here we may characterize compactness in terms of countably many one-dimensional projections. Here we write D R+ ,S = DS for simplicity. Theorem A5.7 (measure-valued functions) Given an lcscH space S, there exist some f1 , f2 , . . . ∈ CˆS , such that these conditions are equivalent, for any A ⊂ DMS : (i) A is relatively compact in DMS , (ii) Afj = {xfj ; x ∈ A} is relatively compact in D R+ for every j ∈ N. Proof: If A is relatively compact, then so is Af for every f ∈ CˆS , since the map x → xf is continuous from DMS to D R+ . To prove the converse, we may choose a dense collection f1 , f2 , . . . ∈ CˆS , closed under addition, such that Afj is relatively compact for every j. In particular, supx∈A xt fj < ∞ for all t ≥ 0 and j ∈ N, and so by Theorem A5.6 the set {xt ; x ∈ A} is relatively compact ˜ in MS for every t ≥ 0. By Theorem A5.4 it remains to verify (4), where w may be defined in terms of the complete metric

Appendices

843

Σ 2−k k≥1

ρ(μ, ν) =

 

  μfk − νfk  ∧ 1 ,

μ, ν ∈ MS .

(6)

If (4) fails, then either we may choose some xn ∈ A and tn → 0 with lim supn > 0, or else there exist some xn ∈ A and bounded st < tn < un with un − sn → 0 such that

ρ(xntn , xn0 )





lim sup ρ(xnsn , xntn ) ∧ ρ(xntn , xnun ) > 0.

(7)

n→∞

In the former case, (6) yields lim supn |xntn fj − xn0 fj | > 0 for some j ∈ N, which contradicts the relative compactness of Afj . Assuming (7) instead, we have by (6) some i, j ∈ N with 





 lim sup xnsn fi − xntn fi  ∧ xntn fj − xnun fj  > 0. n→∞  

Now for any a, a , b, b ∈ R, 1 2















(8) 

|a| ∧ |b | ≤ |a| ∧ |a | ∨ |b| ∧ |b | ∨ |a + a | ∧ |b + b | .

Since the set {fk } is closed under addition, (8) yields the same relation with a common i = j. But then (4) fails for Afi , which contradicts the relative compactness of Afi by Theorem A5.4. Thus, (4) fails for A, and so A is relatively compact. 2

6. Classes and spaces of sets Given an lcscH space S, we introduce the classes G, F, K of open, closed, and compact subsets, respectively. Here we may regard F as a space in its own right, endowed with the Fell topology and associated convergence19 . To introduce the latter, choose a metric ρ on S making every closed ρ -ball compact, and define ρ(s, F ) = inf ρ(s, x), s ∈ S, F ∈ F. x∈F

f

Then for any F, F1 , F2 , . . . ∈ F , we define Fn → F by the condition ρ(s, Fn ) → ρ(s, F ),

s ∈ S.

(9)

We show that (9) is independent of the choice of ρ, and is induced by the topology generated by the sets





F ∈ F ; F ∩ G = ∅ ,

G ∈ G,

F ∈ F; F ∩ K = ∅ ,

K ∈ K.



(10)

Theorem A6.1 (Fell topology on F) Let F be the class of closed sets in an lcscH space S. Then there exists a topology T on F, such that f (i) T generates the convergence Fn → F in (9), (ii) F is compact and metrizable under T , (iii) T is generated by the sets (10),

(iv) F ∈ F; F ∩ B = ∅ is universally Borel measurable for every B ∈ S. 19

This is important for our discussion of random sets in Chapters 23 and 34.

844

Foundations of Modern Probability

Proof, (i)−(iii): We show that the Fell topology in (iii) is generated by the maps F → ρ(s, F ), s ∈ S. The convergence criterion in (i) then follows, once the metrization property is established. To see that the stated maps are continuous, put Bsr = {t ∈ S; ρ(s, t) < r}, and note that {F ; ρ(s, F ) < r} = {F ; F ∩ Bsr =  ∅}, r ¯ {F ; ρ(s, F ) > r} = {F ; F ∩ Bs = ∅}. Here the sets on the right are open, by the definition of the Fell topology and the choice of ρ. Thus, the Fell topology contains the ρ -topology. To prove the converse, fix any F ∈ F and a net {Fi } ⊂ F with directed index set (I, ≺), such that Fi → F in the ρ -topology. To see that convergence holds even in the Fell topology, let G ∈ G be arbitrary with F ∩ G ∈ / ∅. Fix any s ∈ F ∩ G. Since ρ(s, Fi ) → ρ(s, F ) = 0, we may further choose some si ∈ Fi with ρ(s, si ) → 0. Since G is open, there exists an i ∈ I with sj ∈ G / ∅ for all j - i. for all j - i. Then also Fj ∩ G ∈ Next let K ∈ K with F ∩ K = ∅. Define rs = 12 ρ(s, F ) for each s ∈ K, and put Gs = Bs,rs . Since K is compact, it is covered by finitely many balls Gsk . For each k we have ρ(sk , Fi ) → ρ(sk , F ), and so there exists an ik ∈ I with Fj ∩ Gsk = ∅ for all j - ik . Letting i ∈ I with i - ik for all k, it is clear that Fj ∩ K = ∅ for all j - i. (ii) Fix a countable, dense set D ⊂ S, and assume that ρ(s, Fi ) → ρ(s, F ) for all s ∈ D. For any s, s ∈ S,         ρ(s, Fj ) − ρ(s, F ) ≤ ρ(s , Fj ) − ρ(s , F ) + 2ρ(s, s ).

Given any s and ε > 0, we can make the left-hand side < ε by choosing s ∈ D with ρ(s, s ) < ε/3, and then letting i ∈ I be such that |ρ(s , Fj ) − ρ(s , F )| < ε/3 for all j - i. Hence, the Fell topology is also generated by the maps ¯ ∞, F → ρ(s, F ) for all s ∈ D. But then F is homeomorphic to a subset of R + which is second-countable and metrizable. To see that F is compact, it suffices to show that every sequence (Fn ) ⊂ F contains a convergent sub-sequence. Then choose a sub-sequence with ρ(s, Fn ) ¯ + for all s ∈ D, and hence also for all s ∈ S. Since the converging in R functions ρ(s, Fn ) are equi-continuous, the limit f is again continuous, and so the set F = {s ∈ S; f (s) = 0} is closed. To obtain Fn → F , we need to show that, whenever F ∩G = ∅ or F ∩K = ∅ for some G ∈ G or K ∈ K, the same relation holds eventually even for Fn . In the former case, fix any s ∈ F ∩ G, and note that ρ(s, Fn ) → f (s) = 0. Hence, we may choose some sn ∈ Fn with sn → s, and since sn ∈ G for large n, we get Fn ∩ G = ∅. In the latter case, assume that instead Fn ∩ K = ∅ along a sub-sequence. Then there exist some sn ∈ Fn ∩ K, and so sn → s ∈ K along a further sub-sequence. Here 0 = ρ(sn , Fn ) → ρ(s, F ), which yields the contradiction s ∈ F ∩ K.

Appendices

845

(iv) The mapping (s, F ) → ρ(s, F ) is jointly continuous, hence Borel measurable. Now S and F are both separable, and so the Borel σ-field in S × F agrees with the product σ-field S ⊗ BF . Since s ∈ F iff ρ(s, F ) = 0, it follows that {(s, F ); s ∈ F } belongs to S ⊗ BF . Hence, so does {(s, F ); s ∈ F ∩ B} for arbitrary B ∈ S. The assertion now follows by Theorem A1.2. 2 Say that a class U ⊂ Sˆ is separating, if for any K ⊂ G with K ∈ K and G ∈ G, there exists a U ∈ U with K ⊂ U ⊂ G. A class I ⊂ Sˆ is pre-separating if the finite unions of I -sets form a separating class. When S is Euclidean, we typically choose I to be a class of intervals or rectangles and U as the corresponding class of finite unions. Lemma A6.2 (separation) For any monotone function h : Sˆ → R, we have the separating class

ˆ h(B o ) = h(B) ¯ . Sˆh = B ∈ S; Proof: Fix a metric ρ in S such that every closed ρ -ball is compact, and let K ∈ K and G ∈ G with K ⊂ G. For any ε > 0, define Kε = {s ∈ S; ¯ ε = {s ∈ S; ρ(s, K) ≤ ε}. Since K is compact, d(s, K) < ε} and note that K c we have ρ(K, G ) > 0, and so K ⊂ Kε ⊂ G for sufficiently small ε > 0. Fur2 thermore, the monotonicity of h yields Kε ∈ Sˆh for almost every ε > 0. We often need the separating class to be countable. Lemma A6.3 (countable separation) Every separating class U ⊂ Sˆ contains a countable separating subclass. ˆ closed under finite unions. Proof: Fix a countable topological base B ⊂ S, ¯ with K o ⊃ B. ¯ Then For every B ∈ B, choose some compact sets KB,n ↓ B B,n o ¯ for every pair (B, n) ∈ B × N, choose UB,n ∈ U with B ⊂ UB,n ⊂ KB,n . The 2 family {UB,n } is clearly separating. To state the next result, fix any metric spaces S1 , S2 , . . . , and introduce the product spaces S n = S1 × · · · × Sn and S = S1 × S2 × · · · , endowed with their product topologies. For any m < n < ∞, let πm and πmn be the natural projections of S or S n onto S m . Say that the sets An ⊂ S n , n ∈ N, form a projective sequence if πmn An ⊂ Am for all m ≤ n. Their projective limit in S  is then defined as the set A = n πn−1 An . Lemma A6.4 (projective limits) For any metric spaces S1 , S2 , . . . , consider a projective sequence of non-empty, compact sets Kn ⊂ S1 × · · · × Sn , n ∈ N.  Then the projective limit K = n πn−1 Kn is again non-empty and compact. Proof: Since the Kn are non-empty, we may choose some sequences xn = πn−1 Kn , n ∈ N. By the projective property of the sets Km , we have Km for all m ≤ n. In particular, the sequence x1m , x2m , . . . is relatively

(xnm ) ∈ πm x n ∈

846

Foundations of Modern Probability

compact in Sm for each m ∈ N, and so a diagonal argument yields convergence xn → x = (xm ) ∈ S along a sub-sequence N  ⊂ N, which implies πm xn → πm x along N  for each m ∈ N. Since the Km are closed, we obtain πm x ∈ Km for all m, and so x ∈ K, which shows that K is non-empty. The compactness of K may be proved by a similar argument, where we assume that x1 , x2 , . . . ∈ K. 2 We also include a couple of estimates for convex sets, needed in Chapter 25. Here ∂ε B denotes the ε-neighborhood of the boundary ∂B. Lemma A6.5 (convex sets) If B ⊂ Rd is convex and ε > 0, then (i) λd (B − B) ≤





2d d



λd B,

(ii) λd (∂ε B) ≤ 2 (1 + ε/rB )d − 1 λd B. Proof: For (i), see Rogers & Shephard (1958). Part (ii) is elementary.

2

Throughout this book, we assume the Zermelo-Fraenkel axioms (ZF) of set theory, amended with the Axiom of Choice (AC), stated below. It is known that, if the ZF axioms are consistent 20 (non-contradictory), then so is the ZF system amended with AC. Axiom of Choice, AC: For any class of sets St = ∅, t ∈ T , there exists a function f on T with f (t) ∈ St , t ∈ T. In other words, for any collection of non-empty sets St , the product set Xt St is again non-empty. We often need the following equivalent statement involving partially ordered sets (S, ≺). When A ⊂ S is linearly ordered under ≺ , we say that b ∈ S is an upper bound of A if s ≺ b for all s ∈ A. Furthermore, an element m ∈ S is said to be maximal if m ≺ s ∈ S ⇒ s = m. Lemma A6.6 (Zorn’s lemma, Kuratowski) Let (S, ≺) be a partially ordered set. Then (i) ⇒ (ii), where (i) every linearly ordered subset A ⊂ S has an upper bound b ∈ S, (ii) S has a maximal element m ∈ S. Proof: See Dudley (1989) for a comprehensive discussion of AC and its equivalents. 2

20 The consistency of ZF must be taken on faith, since no proof is possible. It had better be true, or else the entire edifice of modern mathematics would collapse. Most logicians seem to agree that even questions of truth are meaningless.

Appendices

847

7. Differential geometry Here we state some basic facts from intrinsic 21 differential geometry, needed in Chapter 35. A more detailed discussion of the required results is given by Emery (1989). Lemma A7.1 (tangent vectors) Let M be a smooth manifold with class S of smooth functions M → R, and consider a point-wise smooth map u : S → S. Then these conditions are equivalent 22 : (i) u is a differential operator 23 of the form f ∈ S,

uf = ui ∂i f, (ii) u is linear and satisfies

f ∈ S,

uf 2 = 2f (uf ),

(iii) for any smooth functions f = (f , . . . , f ) : M → Rn and ϕ : Rn → R, 1

n

u(ϕ ◦ f ) = (∂i ϕ ◦ f ) uf i . Proof, (i) ⇒ (iii): Consider the restriction to a local chart U , and write f i = f˜i ◦ ψ in terms of some local coordinates ψ 1 , . . . , ψ n . Using (i) (twice) along with the elementary chain rule, we get 







u(ϕ ◦ f ) = (uψ i ) ∂i (ϕ ◦ f˜) ◦ ψ = (uψ i ) ∂j ϕ ◦ f˜ ◦ ψ 









∂i f˜j ◦ ψ

= ∂j ϕ ◦ f (uψ i ) ∂i f˜j ◦ ψ





= ∂j ϕ ◦ f (uf j ). (iii) ⇒ (ii): Taking n = 2 and ϕ(x, y) = ax + by yields the linearity of u. Then take n = 1 and ϕ(x) = x2 to get the relation in (ii). (ii) ⇒ (i): Condition (ii) gives u1 = 0, and by polarization u(f g) = f (ug)+ g(uf ), which extends by iteration to u(f gh) = f g(uh) + gh(uf ) + hf (ug),

f, g, h ∈ S.

(11)

If f ∈ S with f = 0 in a neighborhood of a point x ∈ M , we may choose g ∈ S with g(a) = 0 and f g = f , where the former formula yields uf (x) = 0, showing that the operator u is local. Thus, we may henceforth take f to be supported by a single chart U , and write f = f˜ ◦ ϕ for some smooth functions ϕ : U → Rn and f˜: Rn → R, where the ϕi form local coordinates on U . Fixing any a ∈ U , we may assume that ϕ(a) = 0. By Taylor’s formula, f˜(r) = f˜(0) + ri ∂i f˜(0) + ri rj hij (r), 21

r ∈ Rn ,

meaning that no notion or result depends on the choice of local coordinates Here and below, summation over repeated indices is understood. 23 In local coordinates, we can identify u with the geometric vector u = (u1 , . . . , un ). 22

848

Foundations of Modern Probability

for some smooth functions hij on Rn . Inserting r = ϕ(x), applying u to both sides, and noting that uc = 0 for constants c, we get



uf (x) = uϕi (x) ∂i f˜(0) + u ϕi ϕj (hij ◦ ϕ) (x). Taking x = a and using (11), we obtain



uf (a) = uϕi (a) ∂i f˜(0), 2

and the assertion follows with ui = uϕi .

The space of tangent vectors at x is denoted by Tx S, and a smooth function of tangent vectors is called a (tangent) vector field on S. The space of vector fields u, v, . . . on S will be denoted by TS . A cotangent at x is an element of the dual space Tx∗ , and smooth functions α, β, . . . of cotangent vectors are called (simple) forms. Similarly, smooth functions b, c, . . . on the dual TS∗2 of the product space TS2 are called bilinear forms. The dual coupling of simple or bilinear forms α or b with vector fields u, v is often written as α, u = α(u),

b, (u, v) = b(u, v).

A special role is played by the forms df , given by df, u = df (u) = uf,

f ∈ S.

For any simple forms α, β, we define a bilinear form α ⊗ β by (α ⊗ β)(u, v) = α, u β, v = α(u)β(v), so that in particular (df ⊗ dg)(u, v) = (uf )(vg). In general, we have the following representations24 in terms of simple or bilinear forms df and df ⊗ dg. Lemma A7.2 (simple and bilinear forms) For a smooth manifold S, (i) every form on S is a finite sum α = a i df i with a i , f i ∈ S, (ii) every bilinear form on S is a finite sum b = b ij (df i ⊗ dg j ) with b ij , f i , g j ∈ S. Such simple and bilinear forms can be represented as follows in terms of local coordinates: 24

Note that the coefficients are functions, not constants.

Appendices

849

Lemma A7.3 (coordinate representations) Let S be a smooth manifold with local coordinates x1 , . . . , xn . Then for any f, g, h ∈ S, (i)

f dg = (f ∂i g)(x) dxi ,

(ii)

f (dg ⊗ dh) = f (∂i g)(∂j h) (x) (dxi ⊗ dxj ).





For a smooth mapping ϕ between two manifolds S, S  , the push-forward of a tangent vector u ∈ Tx S into a tangent vector ϕ ◦ u ∈ Tϕx S  is given by25 (ϕ ◦ u)f = u(f ◦ ϕ), df (ϕ ◦ u) = d(f ◦ ϕ)u. The dual operations of pull-back of a simple or bilinear form α or b on S, here denoted by ϕ∗ or ϕ∗2 , respectively, are given, for tangent vectors u, v at suitable points, by (ϕ∗ α)u = α(ϕ ◦ u), (ϕ∗2 b)(u, v) = b(ϕ ◦ u, ϕ ◦ v). (In traditional notation, the latter formulas may be written as $ # $ # ∗ (Tx ϕ) α, A x = α, (Tx ϕ)A ϕ(x) ,   (T ∗ ϕ ⊗ T ∗ ϕ) b(x)(A, B) = b(ϕ(x)) (Tx ϕ)A, (Tx ϕ)B .)

Here we list some useful pull-back formulas, first for simple forms, ϕ∗ (f α) = (f ◦ ϕ)(ϕ∗ α), ϕ∗ (df ) = d(f ◦ ϕ), and then for bilinear forms, ϕ∗2 (f b) = (f ◦ ϕ)(ϕ∗2 b), ϕ (α ⊗ β) = ϕ∗ α ⊗ ϕ∗ β, (ψ ◦ ϕ)∗2 = ϕ∗2 ◦ ψ ∗2 . ∗2

A Riemannian metric on S is a symmetric, positive-definite, bilinear form ρ on S, determining an inner product u | v = ρ(u, v) with associated norm u on S. For any f ∈ S, we define the gradient grad f = fˆ as the unique vector field on S satisfying ρ(fˆ, u) = fˆ | u = df (u) = uf, for any vector field u on S. The inner product on TS may be transferred to the forms df through the relation df | dg = fˆ | gˆ , so that26 25 To facilitate the access for probabilists, we often use a simplified notation. Traditional versions of the stated formulas are sometimes inserted for reference and comparison. 26 The mapping df → fˆ extends immediately to a linear map α → α between TS∗ and TS with inverse u → u , sometimes called the musical isomorphisms. Unfortunately, the nice interpretation of fˆ as a gradient vector seems to be lost in general.

850

Foundations of Modern Probability fˆg = gˆf = fˆ| gˆ = df | dg .

Letting ρ = ρij (dxi ⊗ dxj ) in local coordinates and writing (ρij ) for the inverse of the matrix (ρij ), we note that u | v = ρij ui v j , fˆi = ρij ∂j f,

fˆ | gˆ = ρij fˆi gˆj , ∂i f = ρij fˆj .

For special purposes, we also need some Lie derivatives. Since the formal definitions may be confusing for the novice27 , we list only some basic formulas, which can be used recursively to calculate the Lie derivatives along a given vector field u of a smooth function f , another vector field v, a form α, or a bilinear form b. Theorem A7.4 (Lie derivatives) (i) Lu f = uf , (ii) Lu v = uv − vu = [u, v], (iii) Lu α, v = Lu α, v − α, Lu v , (iv) (Lu b)(v, w) = u{b(v, w)} − b (Lu v, w) − b (v, Lu w), Proof: See Emery (1989), pp. 16–19.

2

Thus, Lu f is simply the vector field u itself, applied to the function f . The u -derivative of a vector field v equals the commutator or Lie bracket [u, v]. Since the u -derivative of a simple or bilinear form α or b can be calculated recursively from (i)−(iv), we never need to go back to the underlying definitions in terms of derivatives. From the stated formulas, it is straightforward to derive the following useful relation: Corollary A7.5 (transformation rule) For any smooth function f , bilinear form b, and vector field u, we have Lf u b = f Lu b + df ⊗ b(u, ·) + b(·, u) ⊗ df. Proof: See Emery (1989), p. 19.

2

Exercise 1. Explain in what sense we can think of ξ in Theorem A1.2 as a measurable selection. Further state the result in terms of random sets. 2. A partially ordered set I is said to be linearly ordered if, for any two elements i, j ∈ I, we have either i ≺ j or j ≺ i. Give an example of a partially ordered set that is directed by not linearly ordered. 27

and are presumably well-known to the experts

Appendices

851

3. For a topological space S, give an example of a net in S that is not indexed by a linearly ordered set I. 4. Give an example of a Polish space that is not locally compact. (Hint: This is true for for many function spaces.) 5. Give an example of a convex functional on Rd that is not linear. (Hint: If M is a convex body in Rd , then the functional p(x) = inf{r > 0; (x/r) ∈ M } is convex. When is it linear?) 6. Give examples of two different norms in Rd . Then give the geometric meaning of a linear functional on Rd . Finally, for each of the norms, explain the meaning of the Hahn–Banach theorem, and show that the claimed extension of f may not be unique. (Hint: The lp norms are different for 1 ≤ p ≤ ∞. We may think of the dual as a space of vectors.) 7. For Rd with the usual ‘dot’ product, give examples of two different ONBs, and explain the geometric meaning of the statements in Lemma A3.7. 8. In the Euclidean space Rd , give the geometric and/or algebraic meaning of adjoints, unitary operators, and projections. Further explain in what sense geometric projections are self-adjoint and idempotent, and give the geometric meaning of the various propositions involving projection operators. 9. Prove Lemmas A4.7, A4.8, and A4.9. (Hint: Here Lemma A4.6 may be helpful.) 10. Show that the operator U in Theorem A4.10 is unitary, and prove the result. Then explain how the statement is related to Theorems 25.18 and 30.11. 11. Prove Lemma A6.5 (ii). (Hint (Day): First show that if Sr ⊂ B, then B + Sε ⊂ (1 + ε/r)B, where Sr denotes an r -ball around 0.)

Notes and References Here my modest aims are to trace the origins of some basic ideas in each chapter 1 , to give references for the main results cited in the text, and to suggest some literature for further reading. No completeness is claimed, and knowledgeable readers will notice many errors and omissions, for which I apologize in advance. I also inserted some footnotes with short comments about the lives and careers of a few notable people 2 . Here my selection is very subjective, and many others could have been included.

1. Sets and functions, measures and integration The existence of non-trivial, countably additive measures was first discovered by Borel (1895/98), who constructed Lebesgue measure on the Borel σfield in R. The corresponding integral was introduced by Lebesgue3 (1902/04), who also established the dominated convergence theorem. The monotone convergence theorem and Fatou’s lemma were later obtained by Levi (1906a) and Fatou (1906), respectively. Lebesgue also introduced the higher-dimensional Lebesgue measure and proved a first version of Fubini’s theorem, subsequently generalized by Fubini (1907) and Tonelli (1909). The integration theory was extended to general measures and abstract spaces by many authors, including ´chet (1928). Radon (1913) and Fre The norm inequalities in Lemma 1.31 were first noted for finite sums by ¨ lder (1889) and Minkowski (1907), respectively, and were later extended Ho to integrals by Riesz (1910). Part (i) for p = 2 goes back to Cauchy (1821) for finite sums and to Buniakowsky (1859) for integrals. The Hilbert space projection theorem can be traced back to Levi (1906b). ´ski (1928). Their Monotone-class theorems were first established by Sierpin `ve (1955), and use in probability theory goes back to Halmos (1950), Loe Dynkin (1961). Mazurkiewicz (1916) proved that a topological space is Polish iff it is homeomorphic to a countable intersection of open sets in [0, 1]∞ . The fact that every uncountable Borel set in a Polish space is Borel isomorphic to 2∞ was proved independently by Alexandrov (1916) and Hausdorff (1916). Careful proofs of both results appear in Dudley (1989). Localized Borel spaces were used systematically in K(17). Most results in this chapter are well known and can be found in any textbook on real analysis. Many graduate-level probability texts, such as Billingsley (1995) and C ¸ inlar (2011), contain introductions to measure 1

A comprehensive history of modern probability theory still remains to be written. I included only people who are no longer alive. 3 Henri Lebesgue (1875–1941), French mathematician, along with Borel a founder of measure theory, which made modern probability possible. 2

© Springer Nature Switzerland AG 2021 O. Kallenberg, Foundations of Modern Probability, Probability Theory and Stochastic Modelling 99, https://doi.org/10.1007/978-3-030-61871-1

853

854

Foundations of Modern Probability

theory. More advanced or comprehensive accounts are given, with numerous remarks and references, by Dudley (1989) and Bogachev (2007). We are tacitly adopting ZFC or ZF+AC—the Zermelo–Fraenkel system amended with the Axiom of Choice—as the axiomatic foundation of modern analysis and probability. See, e.g., Dudley (1989) for a detailed discussion. 2. Measure extension and decomposition As already mentioned, Borel (1895/98) was the first to prove the existence of one-dimensional Lebesgue measure. The modern construction via outer mea´odory (1918). Functions of bounded variation were sures in due to Carathe introduced by Jordan (1881), who proved that any such function is the difference of two non-decreasing functions. The corresponding decomposition of signed measures was obtained by Hahn (1921). Integrals with respect to nondecreasing functions were defined by Stieltjes (1894), but their importance was not recognized until Riesz4 (1909b) proved his representation theorem for linear functionals on C[0, 1]. The a.e. differentiability of a function of bounded variation was first proved by Lebesgue (1904). Vitali (1905) noted the connection between absolute continuity and the existence of a density. The Radon–Nikodym theorem was then proved in increasing generality by Radon (1913), Daniell (1920), and Nikodym (1930). A combined proof that also yields the Lebesgue decomposition was devised by von Neumann. Atomic decompositions and factorial measures appear in K(75/76) and K(83/86), respectively, and the present discussion is adapted from K(17). Constructions of Lebesgue measure and proofs of the Radon–Nikodym theorem and Lebesgue’s differentiation theorem are given in most textbooks on real analysis. Detailed accounts adapted to the needs of probabilists appear `ve (1977), Billingsley (1995), and Dudley (1989). in, e.g., Loe 3. Kernels, disintegration, and invariance Though kernels have long been used routinely in many areas of probability theory, they are hardly mentioned in most real analysis or early probability texts. Their basic properties were listed in K(97/02), where the basic kernel operations are also mentioned. Though the disintegration of measures on a product space is well known and due to Doob (1938), the associated disintegration formula is rarely stated and often used without proof, an early exception being K(75/76). Partial disintegrations go back to K(83/86). The theory of iterated disintegration was developed in K(10/11b), for applications to Palm measures. Invariant measures on special groups were identified by explicit computation by many authors, including Hurwitz (1897). Haar (1933) proved the 4 Riesz, Frigyes (1880–1956) and Marcel (1886–1969), Hungarian mathematicians and brothers, making fundamental contributions to functional analysis, potential theory, and other areas. Among their many students, we note R´enyi (Frigyes) and Cram´er (Marcel).

Historical Notes and References

855

existence of invariant measures on a general lcscH group. Their uniqueness was then established by Weil (1936/40), who also extended the theory to homogeneous spaces. The present representation of invariant measures, under a non-topological, proper group action, was established in K(11a). The usefulness of s-finite measures was recognized by Sharpe (1988) and Getoor (1990). Invariant disintegrations were first used to construct Palm distributions of stationary point processes. Here the regularization approach goes back to Ryll-Nardzewski (1961) and Papangelou (1974a), whereas the simple skew factorization was first noted by Matthes (1963) and further developed by Mecke (1967) and Tortrat (1969). The general factorization and regularization results for invariant measures on a product space were obtained in K(07). The notions of strict and weak asymptotic invariance, going back to Dobrushin (1956), were further studied by Debes et al. (1970/71) and Matthes et al. (1974/78). Stone (1968) noted that absolute continuity implies local invariance, and the latter notion was studied more systematically in K(78b). The existence and uniqueness of Haar measures appears, along with much related material, in many textbooks on real or harmonic analysis, such as in Hewitt & Ross (1979). The existence of regular conditional distributions is proved in most graduate-level probability texts, though often without a statement of the associated disintegration formula. For a more detailed and comprehensive account of the remaining material, we refer to K(17). 4. Processes, distributions, and independence Countably additive probability measures were first used by Borel5 (1909/ 24), who constructed random variables as measurable functions on the Lebesgue unit interval and proved Theorem 4.18 for independent events. Cantelli (1917) noticed that the ‘easy’ part remains true without the independence assumption. Lemma 4.5 was proved by Jensen (1906), after a special case had ¨ lder. been noted by Ho The modern framework—with random variables as measurable functions on an abstract probability space (Ω, A, P ) and with expected values as P integrals over Ω—was used implicitly by Kolmogorov from (1928) on. It was later formalized in the monograph of Kolmogorov6 (1933c), which also contains the author’s 0−1 law, discovered long before the 0−1 law of Hewitt 5´

Emile Borel (1871–1956), French mathematician, was the first to base probability theory on countably additive measures, 24 years before Kolmogorov’s axiomatic foundations, which makes him the true founder of the subject. 6 Andrey Kolmogorov (1903–87), Russian mathematician, making profound contributions to many areas of mathematics, including probability. His 1933 paper was revolutionary, not only for the axiomatization of probability theory, which was essentially known since Borel (1909), but also for his definition of conditional expectations, his 0−1 law, and his existence theorem for processes with given finite-dimensional distributions. As a leader of the Moscow probability school, he had countless of students.

856

Foundations of Modern Probability

& Savage (1955). Early work in probability theory deals mostly with properties depending only on the finite-dimensional distributions. Wiener (1923) was the first to construct the distribution of a process as a measure on a function space. The general continuity criterion in Theorem 4.23, essentially due to Kolmogorov, `ve was first published by Slutsky (1937), with minor extensions added by Loe (1955) and Chentsov (1956). The search for general regularity properties was initiated by Doob (1937/47). It soon became clear, through the work of ´vy (1934/37), Doob (1951/53), and Kinney (1953), that most processes Le of interest have right-continuous versions with left-hand limits. Introductions to measure-theoretic probability are given by countless authors, including Breiman (1968), Chung (1974), Billingsley (1995), and C ¸ inlar (2011). Some more specific regularity properties are discussed in `ve (1977) and Crame ´r & Leadbetter (1967). Earlier texts tend to Loe give more weight to distribution functions and their densities, less to measures and σ-fields. 5. Random sequences, series, and averages The weak law of large numbers was first obtained by Bernoulli (1713), for the sequences named after him. More general versions were then established ´ (1853), Chebyshev (1867), and Markov with increasing rigor by Bienayme (1899). A necessary and sufficient condition for the weak law of large numbers was finally obtained by Kolmogorov (1928/29). Khinchin & Kolmogorov (1925) studied series of independent, discrete random variables, and showed that convergence holds under the condition in Lemma 5.16. Kolmogorov (1928/29) then obtained his maximum inequality and showed that the three conditions in Theorem 5.18 are necessary and sufficient for a.s. convergence. The equivalence with convergence in distribution ´vy (1937). was later noted by Le A strong law of large numbers for Bernoulli sequences was stated by Borel (1909a), but the first rigorous proof is due to Faber (1910). The simple criterion in Corollary 5.22 was obtained in Kolmogorov (1930). In (1933a) Kolmogorov showed that existence of the mean is necessary and sufficient for the strong law of large numbers, for general i.i.d. sequences. The extension to exponents p = 1 is due to Marcinkiewicz & Zygmund (1937). Proposition 5.24 was proved in stages by Glivenko (1933) and Cantelli (1933). Riesz (1909a) introduced the notion of convergence in measure, for probability measures equivalent to convergence in probability, and showed that it implies a.e. convergence along a subsequence. The weak compactness criterion in Lemma 5.13 is due to Dunford (1939). The functional representation of Proposition 5.32 appeared in K(96a), and Corollary 5.33 was given by Stricker & Yor (1978). The theory of weak convergence was founded by Alexandrov (1940/43), who proved in particular the ‘portmanteau’ Theorem 5.25. The continuous

Historical Notes and References

857

mapping Theorem 5.27 was obtained for a single function fn ≡ f by Mann & Wald (1943), and then in general by Prohorov (1956) and Rubin. The coupling Theorem 5.31 is due for complete S to Skorohod (1956) and in general to Dudley (1968). More detailed accounts of the material in this chapter may be found in `ve (1977) and Chow & Teicher (1997). many textbooks, such as in Loe Additional results on random series and a.s. convergence appear in Stout ´ & Woyczyn ´ ski (1992). (1974) and Kwapien 6. Gaussian and Poisson convergence ´ lya (1920)—has a long and The central limit theorem—so first called by Po glorious history, beginning with the work of de Moivre7 (1738), who obtained the familiar approximation of binomial probabilities in terms of the normal density. Laplace (1774, 1812/20) stated a general version of the central limit theorem, but his proof was incomplete, as was the proof of Chebyshev (1867/90). The normal laws were used by Gauss (1809/16) to model the distribution of errors in astronomical observations. The first rigorous proof was given by Liapounov (1901), though under an extra moment condition. Then Lindeberg (1922a) proved his fundamental Theorem 6.13, which led in turn to the basic Proposition 6.10, in a series of ´vy (1922a–c). Bernstein (1927) obpapers by Lindeberg (1922b) and Le tained the first extension to higher dimensions. The general problem of normal convergence, regarded for two centuries as the central (indeed the only nontrivial) problem in probability, was eventually solved in the form of Theorem ´vy (1935a). 6.16, independently by Feller (1935) and Le Many authors have proved local versions of the central limit theorem, the present version being due to Stone (1965). The domain of attraction to the ´vy, Feller, and Khinchin. General laws of normal law was identified by Le attraction were studied by Doeblin (1947, posth.), who noticed the connection with Karamata’s (1930) theory of regular variation. Even the simple Poisson approximation of binomial probabilities was noted by de Moivre (1711). It was rediscovered by Poisson (1837), who also noted that the limits form a probability distribution in its own right. Applications of this ‘law of small numbers’ were compiled by Cournot (1843) and Bortkiewicz8 (1898). Though characteristic functions were used in probability already by Laplace (1812/20), their first rigorous use to prove a limit theorem is credited to Liapounov (1901). The first general continuity theorem was established by ´vy (1922c), who assumed the characteristic functions to converge uniformly Le in a neighborhood of the origin. The simpler criterion in Theorem 6.23 is due to Bochner (1933). Our direct approach to Theorem 6.3 may be new, in 7 Abraham de Moivre (1667–1754), French mathematician. To escape the persecution of huguenots he fled to England, where he befriended Newton. He discovered both the normal and the Poisson approximation to the binomial distribution. 8 Known for the example of the number of deaths by horse kicks in the Prussian army.

858

Foundations of Modern Probability

avoiding the relatively deep selection theorem of Helly (1911/12). The basic ´r & Wold (1936). Corollary 6.5 was noted by Crame Introductions to characteristic functions and classical limit theorems may `ve (1977). Feller (1971) is a rich be found in many textbooks, notably Loe source of further information on Laplace transforms, characteristic functions, and classical limit theorems. For more detailed or advanced results on characteristic functions, see Lukacs (1970). 7. Infinite divisibility and general null arrays ´vy The scope of the classical central limit problem was broadened by Le (1925) to encompass the study of suitably normalized partial sums, leading to the classes of stable and self-decomposable limiting distributions. To include the case of the classical Poisson approximation, Kolmogorov proposed a further extension to general triangular arrays, subject to the sole condition of uniform asymptotic negligibility, allowing convergence to general infinitely divisible laws. de Finetti (1929) saw how the latter distributions are related to processes with stationary, independent increments, and posed the problem of finding their general form. Here the characteristic functions were first characterized by Kolmogorov ´vy (1932), though only under a moment condition that was later removed by Le (1934/35). Khinchin (1937/38) noted the potential for significant simplifications in the one-dimensional case, leading to the celebrated L´evy–Khinchin formula. The probabilistic interpretation in terms of Poisson processes is essenˆ (1942b). The simple form for non-negative random variables tially due to Ito ˇina (1964). was noted by Jir For the mentioned limit problem, Feller (1937) and Khinchin (1937) proved independently that all limiting distributions are infinitely divisible. It remained to characterize the convergence to specific limits, continuing the work ´vy (1935a) for Gaussian limits. The general criteby Feller (1935) and Le ria were found independently by Doeblin (1939a) and Gnedenko (1939). Convergence criteria for infinitely divisible distributions on Rd+ were given ˇina (1966) and Nawrotzki (1968). Corresponding criteria for nullby Jir arrays were obtained in K(75/76), after the special case of Z+ -valued variables had been settled by Kerstan & Matthes (1964). Extreme value theory is surveyed by Leadbetter et al. (1983). Lemma 7.16 appears in Doeblin (1939a). The limit theory for null arrays of random variables was once regarded as the central topic of probability theory, and comprehensive treatments were `ve (1977), and Petrov given by Gnedenko & Kolmogorov (1954/68), Loe (1995). The classical approach soon got a bad reputation, because of the numerous technical estimates involving characteristic and distribution functions, and the subject is often omitted in modern textbooks.9 Eventually, Feller 9 This is a huge mistake, not only since characteristic functions provide the easiest approach to the classical limit theorems, but also for their crucial role in connection with

Historical Notes and References

859

(1966/71) provided a more readable, even enjoyable account. The present concise probabilistic approach, based on compound Poisson approximations and methods of modern measure theory, was pioneered in K(97). It has the further advantage of extending easily to the multi-variate case. 8. Conditioning and disintegration Though conditional densities have been computed by statisticians ever since Laplace (1774), a general approach to conditioning was not devised until Kolmogorov (1933c), who defined conditional probabilities and expectations as random variables on the basic probability space, using the Radon–Nikodym theorem, which had recently become available. His original notion of conditioning with respect to a random vector was extended by Halmos (1950) to arbitrary random elements, and then by Doob (1953) to general sub -σ-fields. Our present Hilbert space approach to conditioning, essentially due to von Neumann (1940), is simpler and more suggestive, in avoiding the use of the less intuitive Radon–Nikodym theorem. It has the further advantage of leading to the attractive interpretation of martingales as projective families of random variables. The existence of regular conditional distributions was studied by several authors, beginning with Doob (1938). It leads immediately to the often used but rarely stated disintegration Theorem 8.5, extending Fubini’s theorem to the case of dependent random variables. The fundamental importance of conditional distributions is not always appreciated.10 The interpretation of the simple Markov property in terms of conditional independence was indicated already by Markov (1906), and the formal statement of Proposition 8.9 appears in Doob (1953). Further properties of con¨ hler (1980) and others. Our ditional independence have been listed by Do statement of conditional independence in terms of Hilbert-space projections is new. The subtle device of iterated conditioning was developed in K(10/11b). The fundamental transfer Theorem 8.17, noted by many authors, may have been stated explicitly for the first time by Aldous (1981). It leads in particular to a short proof of Daniell’s (1918/19/20) existence theorem for sequences of random elements in a Borel space, which in turn implies Kolmogorov’s (1933c) celebrated extension to continuous time and arbitrary index sets. L  omnicki & Ulam (1934) noted that no topological assumptions are needed for independent random elements, a result that was later extended by Ionescu Tulcea (1949/50) to measures specified by a sequence of conditional distributions. The traditional Radon–Nikodym approach to conditional expectations appears in most textbooks, such as in Billingsley (1995). Our projection approach is essentially the one used by Dellacherie & Meyer (1975) and martingales, L´evy processes, random measures, and potentials. 10 Dellacherie & Meyer (1975) write: “[They] have a bad reputation, and probabilists rarely employ them without saying ‘we are obliged’ to use conditional distributions . . . .” We may hope that times have changed, and their importance is now universally recognized.

860

Foundations of Modern Probability

Elliott (1982). 9. Optional times and martingales Martingales were first introduced by Bernstein11 (1927/37), in his efforts to relax the independence assumption in the classical limit theorems. ´vy (1935a-b/37) extended Kolmogorov’s maximum Both Bernstein and Le inequality and the central limit theorem to a general martingale context. The term martingale (originally denoting part of a horse’s harness and later used for a special gambling system) was introduced in the probabilistic context by Ville (1939). The first martingale convergence theorem was obtained by Jessen12 (1934) ´vy (1935b), both of whom proved Theorem 9.24 for filtrations generated and Le by a sequence of independent random variables. A sub-martingale version of the same result appears in Sparre-Andersen & Jessen (1948). The ´vy (1937/54), who also noted the independence assumption was removed by Le simple martingale proof of Kolmogorov’s 0−1 law and proved his conditional version of the Borel–Cantelli lemma. The general convergence theorem for discrete-time martingales was proved by Doob13 (1940), and the basic regularity theorems for continuous-time martingales first appeared in Doob (1951). The theory was extended to submartingales by Snell (1952) and Doob (1953). The latter book is also the original source of such fundamental results as the martingale closure theorem, the optional sampling theorem, and the Lp -inequality. Though hitting times have long been used informally, general optional times (often called stopping times) seem to appear for the first time in Doob (1936). Abstract filtrations were not introduced until Doob (1953). Progressive processes were introduced by Dynkin (1961), and the modern definition of the σ-fields Fτ is due to Yushkevich. The first general exposition of martingale theory, truly revolutionary for its time, was given by Doob (1953). Most graduate-level texts, such as C ¸ inlar (2011), contain introductions to martingales, filtrations, and optional times. More extensive coverage of the discrete-time theory appears in Neveu (1975), Chow & Teicher (1997), and Rogers & Williams (1994). The encyclopedic treatises of Dellacherie & Meyer (1975/80) remain standard references for the continuous-time theory. 10. Predictability and compensation Predictable and totally inaccessible times appear implicitly, along with 11

Sergei Bernstein (1880–1968), Ukrainian and Sovjet mathematician working in analysis and probability, invented martingales and proved martingale versions of some classical results for independent variables. 12 Børge Jessen (1907–93), Danish mathematician working in analysis and geometry, proved the first martingale convergence theorem, long before Doob. 13 Joseph Doob (1910–2004), American mathematician and a founder of modern probability, known for path-breaking work on martingales and probabilistic potential theory.

Historical Notes and References

861

quasi-left continuous processes, in the work of Blumenthal (1957) and Hunt (1957/58). A systematic study of optional times and their associated σ-fields ´ans was initiated by Chung & Doob (1965), Meyer (1966/68), and Dole (1967a). Their ideas were further developed by Dellacherie (1972), Dellacherie & Meyer (1975/80), and others into a ‘general theory of processes’, which has in many ways revolutionized modern probability. The connection between excessive functions and super-martingales, noted by Doob (1954), suggested a continuous-time extension of the elementary Lemma 9.10. Such a result was eventually proved by Meyer14 (1962/63) in the form of Lemma 10.7, after special decompositions in the Markovian context ´ans (1967) had been obtained by Volkonsky (1960) and Shur (1961). Dole proved the equivalence of natural and predictable processes, leading to the ultimate version of the Doob–Meyer decomposition in the form of Theorem 10.5. The original proof, appearing in Dellacherie (1972) and Dellacherie & Meyer (1975/80), is based on some deep results in capacity theory. The present more elementary approach combines Rao’s (1969) simple proof of Lemma 10.7, based on Dunford’s (1939) weak compactness criterion, with ˆ & Doob’s (1984) ingenious approximation of totally inaccessible times. Ito Watanabe (1965) noted the extension to general sub-martingales, by suitable localization. The present proof of Theorem 10.14 is taken from Chung & Walsh (1974). The moment inequality in Proposition 10.20 was proved independently by Garsia (1973) and Neveu (1975), after a special case had been obtained by Burkholder et al. (1972). Compensators of optional times first arose as hazard functions in reliability theory. More general compensators were later studied in the Markovian context by S. Watanabe (1964) under the name of L´evy systems. Grigelionis (1971) and Jacod (1975) constructed the compensators of adapted random measures on a product space R+ × S. The elementary formula for the induced compensator of a random pair (τ, χ) is classical and appears in, e.g., Dellacherie (1972) and Jacod (1975). The equation Z = 1 − Z− · η¯ was studied ´ans (1970). Discounted compensators were inin greater generality by Dole troduced in K(90), which contains the fundamental martingale and mapping results like Theorem 10.27. Induced compensators of point processes have been studied by many authors, including Last & Brandt (1995) and Jacobsen (2006). More advanced and comprehensive accounts of the ‘general theory’ are given by Dellacherie (1972), Elliott (1982), Rogers & Williams (1987/2000), and Dellacherie & Meyer (1975/80).

14 ´ Meyer (1934–2003), French mathematician, making profound contribuPaul-Andre tions to martingales, stochastic integration, Markov processes, potential theory, Malliavin calculus, quantum probability, and stochastic differential geometry. As a leader of the famous Strasbourg school of probability, his work inspired probabilists all over the world.

862

Foundations of Modern Probability

11. Markov properties and discrete-time chains Discrete-time Markov chains in a finite state space were introduced by Markov15 (1906), who proved the first ergodic theorem and introduced the notion of irreducible chains. Kolmogorov (1936a/b) extended the theory to a countable state space and general transition probabilities. In particular, he obtained a decomposition of the state space into irreducible sets, classified the states with respect to recurrence and periodicity, and described the asymptotic behavior of the n-step transition probabilities. Kolmogorov’s original proofs were analytic. The more intuitive coupling approach was introduced by Doeblin (1938a/b), long before the strong Markov property was formalized. Bachelier had noted the connection between random walks and diffusions, which inspired Kolmogorov (1931a) to give a precise definition of Markov processes in continuous time. His treatment is purely analytic, with the distribution specified by a family of transition kernels satisfying the Chapman–Kolmogorov relation, previously noted in special cases by Chapman (1928) and Smoluchovsky. Kolmogorov makes no reference to sample paths. The transition to probabilistic methods began with the work of ´vy (1934/35) and Doeblin (1938a/b). Le Though the strong Markov property had been used informally since Bachelier (1900/01), it was first proved rigorously in a special case by Doob (1945). The elementary inequality of Ottaviani is from (1939). General filtrations were introduced in Markov process theory by Blumenthal (1957). The modern setup, with a canonical process X defined on the path space Ω, equipped with a filtration F, a family of shift operators θt , and a collection of probability measures Px , was developed systematically by Dynkin (1961/65). A weaker form of Theorem 11.14 appears in Blumenthal & Getoor (1968), and the present version is from K(87). Most graduate-level texts contain elementary introductions to Markov chains in discrete and continuous time. More extensive discussions appear in Feller (1968/71) and Freedman (1971a). Most literature on general Markov processes is fairly advanced, leading quickly into discussions of semi-group and potential theory, here treated in Chapters 17 and 34. Standard references include Dynkin16 (1965), Blumenthal & Getoor (1968), Sharpe (1988), and Dellacherie & Meyer (1987/92). The coupling method, soon forgotten after Doeblin’s tragic death17 , was revived by Lindvall (1992) and 15

Andrei Markov (1856–1922), Russian mathematician, inventor of Markov processes. Eugene Dynkin (1924–2014), Russian-American mathematician, worked on Lie groups before turning to probability. Laid the foundations of the modern theory of Markov processes, and made seminal contributions to the theory of super-processes. 17 Wolfgang Doeblin (1915–40), born in Berlin to the famous author Alfred D¨oblin. When Hitler came to power, his Jewish family fled to Paris. After a short career inspired by L´evy and Fr´echet, he joined the French army to fight the Nazis. When his unit got cornered, he shot himself, to avoid getting caught and sent to an extermination camp. Aged 25, he had already made revolutionary contributions to many areas of probability, including construction of diffusions with given drift and diffusion rates, time-change reduction of martingales to Brownian motion, solution of the general limit problem for null arrays, and recurrence 16

Historical Notes and References

863

Thorisson (2000). 12. Random walks and renewal processes Random walks arose early in a wide range of applications, such as in gambling, queuing, storage, and insurance; their history can be traced back to the origins of probability. Already Bachelier (1900/01) used random walks to approximate diffusion processes. Such approximations were used in potential theory in the 1920s, to yield probabilistic interpretations in terms of a simple symmetric random walk. Finally, random walks played an important role in the sequential analysis developed by Wald (1947). ´ lya’s (1921) discovery that a simThe modern developments began with Po ple symmetric random walk in Zd is recurrent for d ≤ 2, otherwise transient. ´vy (1940) and KakuHis result was later extended to Brownian motion by Le tani (1944a). The general recurrence criterion in Theorem 12.4 was obtained by Chung & Fuchs (1951), and the probabilistic approach to Theorem 12.2 was found by Chung & Ornstein (1962). The first condition in Corollary 12.7 is even necessary for recurrence, as noted independently by Ornstein (1969) and Stone (1969). Wald (1944) discovered the equations named after him. ´ (1887) in the context of the The reflection principle was first used by Andre ballot problem. The systematic study of fluctuation and absorption problems for random walks began with the work of Pollaczek (1930). Ladder times and heights, first introduced by Blackwell, were explored in an influential paper by Feller (1949). The factorizations in Theorem 12.16 were originally derived by the Wiener–Hopf technique, developed by Paley & Wiener (1934) as a general tool in Fourier analysis. Theorem 12.17 is due for u = 0 to Sparre-Andersen (1953/54) and in general to Baxter (1961). The intricate combinatorial methods used by the former author were later simplified by Feller and others. Though renewals in Markov chains are implicit already in some early work ´vy, the general renewal process may have been first of Kolmogorov and Le ¨s introduced by Palm (1943). The first renewal theorem was obtained by Erdo et al. (1949) for random walks in Z+ , though the result is then an easy consequence of Kolmogorov’s (1936a/b) ergodic theorem for Markov chains on a countable state space, as noted by Chung. Blackwell (1948/53) extended the result to random walks in R+ . The ultimate version for transient random walks in R is due to Feller & Orey (1961). The first coupling proof of Blackwell’s theorem was given by Lindvall (1977). Our proof is a modification of an argument in Athreya et al. (1978), which originally did not cover all cases. The method seems to require the existence of a possibly infinite mean. An analytic approach to the general case appears in Feller (1971). Elementary introductions to random walks are given by many authors, including Chung (1974) and Feller (1971). A detailed treatment of random and ergodic properties of Markov processes.

864

Foundations of Modern Probability

walks in Zd is given by Spitzer (1976). Further aspects of renewal theory are discussed in Smith (1954). 13. Jump-type chains and branching processes Continuous-time Markov chains have been studied by many authors, beginning with Kolmogorov (1931a). The transition functions of general pure jump-type Markov processes were explored by Pospiˇ sil (1935/36) and Feller (1936/40), and the corresponding path properties were analysed by Doeblin (1939b) and Doob (1942b). The first continuous-time version of the strong Markov property was obtained by Doob (1945). The equivalence of the counting and renewal descriptions of Poisson processes was established by Bateman (1910). The significance of pseudo-Poisson processes was recognized by Feller. The classical, discrete-time branching process was first studied by Bien´18 (1845), who found in particular the formula for the extinction probabilayme ity. His results were partially rediscovered by Watson & Galton19 (1874), after whom the process has traditionally been named. In the critical case, the asymptotic survival probability and associated distribution were found by Kolmogorov (1938) and Yaglom (1947), and the present comparison argument is due to Spitzer. The transition probabilities of a birth & death process with rates nμ and nλ were obtained by Palm (1943) and Kendall (1948). The associated ancestral process was identified in K(17). The diffusion limit of a Bienaym´e process was obtained by Feller (1951). The associated space-time process, known as a super-process, was discovered by Watanabe (1968) and later studied extensively by Dawson (1977/93) and others. The corresponding ancestral structure was uncovered in profound work by Dynkin (1991), Dawson & Perkins (1991), and Le Gall (1991). Comprehensive accounts of pure-jump type Markov processes have been given by many authors, beginning with Doob 1953), Chung (1960), and Kemeny et al. (1966). The underlying regenerative structure was analyzed by Kingman (1972). Countless authors have discussed applications to queuing20 ´cs (1962). Detailed accounts of Bitheory and other areas, including Taka enaym´e and related processes include those by Harris (1963), Athreya & Ney (1972), and Jagers (1975). Etheridge (2000) gives an accessible introduction to super-processes. 14. Gaussian processes and Brownian motion As we have seen, the Gaussian distribution first arose in the work of de Moivre (1738) and Laplace (1774, 1812/20), and was popularized through 18

´ ne ´e-Jules Bienayme ´ (1796–1878), French probabilist and statistician. Ire The latter became infamous as a founder of eugenics—the ‘science’ of breeding a human master race—later put in action by Hitler and others. Since his results were either wrong or already known, I would be happy to drop his name from the record. 20 I prefer the traditional spelling, as in ensuing and pursuing. 19

Historical Notes and References

865

Gauss’ (1809) theory of errors. Maxwell21 (1875/78) derived the Gaussian law as the velocity distribution for the molecules in a gas, assuming the hypotheses of Proposition 14.2. Freedman (1962/63) used de Finetti’s theorem to characterize sequences and processes with rotatable symmetries. His results were later recognized as equivalent to a relationship between positive definite and completely monotone functions proved by Schoenberg (1938). Isonormal Gaussian processes were introduced by Segal (1954). The Brownian motion process, first introduced by Thiele (1880), was rediscovered and used by Bachelier22 (1900) to model fluctuations on the stock market. Bachelier also noted several basic properties of the process, such as the distribution of the maximum and the connection with the heat equation. His work, long neglected and forgotten, was eventually revived by Kolmogorov and others. He is now recognized as the founder of mathematical finance. The mathematical notion must not be confused with the physical phenomenon of Brownian motion—the irregular motion of pollen particles suspended in water. Such a motion, first noted by van Leeuwenhoek23 , was originally thought to be a biological phenomenon, motivating some careful studies during different seasons, meticulously recorded by the botanist Brown (1828). Eventually, some evidence from physics and chemistry suggested the existence of atoms24 , explaining Brownian motion as caused by the constant bombardment by water molecules. The atomic hypothesis was still controversial, when Einstein (1905/06), ignorant of Bachelier’s work, modeled the motion by the mentioned process, and used it to estimate the size of the molecules25 . A more refined model for the physical Brownian motion was proposed by Langevin (1908) and Ornstein & Uhlenbeck (1930). A rigorous study of the Brownian motion process was initiated by Wiener26 (1923), who constructed its distribution as a measure on the space of continuous paths. The significance of Wiener’s revolutionary paper was not appre´vy ciated until after the pioneering work of Kolmogorov (1931a/33b), Le (1934/35), and Feller (1936). The fact that the Brownian paths are nowhere differentiable was noted in a celebrated paper by Paley et al. (1933). Wiener also introduced stochastic integrals of deterministic L2 -functions, later studied in further detail by Paley et al. (1933). The spectral repre21

James Clerk Maxwell (1831–79), Scottish physicist, known for his equations describing the electric and magnetic fields, and a co-founder of statistical mechanics. 22 Louis Bachelier (1870–1946), French mathematician, the founder of mathematical finance and discoverer of the Brownian motion process. 23 Antonie van Leeuwenhoek (1632–1723), Dutch scientist, who was the first to construct a microscope powerful enough to observe the micro-organisms in ‘plain’ water. 24 A similar claim by Democritus can be dismissed, as it was based on a logical fallacy. 25 This is one of the three revolutionary achievements of Albert Einstein (1879–1955) during his ‘miracle year’ of 1905, the others being his discovery of special relativity, including the celebrated formula E = mc2 , and his explanation of the photo-electric effect, marking the beginning of quantum mechanics and earning him a Nobel prize in 1921. 26 Norbert Wiener (1894–1964), famous American mathematician, making profound contributions to numerous areas of analysis and probability.

866

Foundations of Modern Probability

sentation of stationary processes, originally deduced from Bochner’s (1932) ´r (1942), was later recognized as equivalent to a general theorem by Crame Hilbert space result of Stone (1932). The chaos expansion of Brownian functionals was discovered by Wiener (1938), and the theory of multiple integrals ˆ with respect to Brownian motion was developed in a seminal paper of Ito (1951c). The law of the iterated logarithm was discovered by Khinchin27 , first (1923/24) for Bernoulli sequences, and then (1933b) for Brownian motion. A ´vy (1948), who more detailed study of Brownian paths was pursued by Le proved the existence of quadratic variation in (1940) and the arcsine laws in (1948). Though many proofs of the latter have since been given, the present deduction from basic symmetry properties may be new. Though the strong Markov property was used implicitly by L´evy and others, the result was not carefully stated and proved until Hunt (1956). Most modern probability texts contain introductions to Brownian motion. ˆ & McKean (1965), Freedman (1971b), Karatzas & The books by Ito Shreve (1991), and Revuz & Yor (1999) give more detailed information on the subject. Further discussion of multiple Wiener–Itˆo integrals appears in Kallianpur (1980), Dellacherie et al. (1992), and Nualart (2006). The advanced theory of Gaussian processes is surveyed by Adler (1990). 15. Poisson and related processes As we have seen, the Poisson distribution was recognized by de Moivre (1711) and Poisson (1837) as an approximation to the binomial distribution. In a study of Bible chronology, Ellis (1844) obtains Poisson processes as approximations of more general renewal processes. Point processes with stationary, independent increments were later considered by Lundberg (1903) to model a stream of insurance claims, by Rutherford & Geiger (1908) to describe the process of radioactive decay, and by Erlang (1909) to model the incoming calls to a telephone exchange. Here the first rigorous proof of the Poisson distribution for the increments was given by Bateman (1910), who also showed that the holding times are i.i.d. and exponentially distributed. The ´vy (1934/35), Poisson property in the non-homogeneous case is implicit in Le but it was not rigorously proven until Copeland & Regan (1936). The mixed binomial representation of Poisson processes is used implicitly by Wiener (1938), in his construction of higher-dimensional Poisson processes. It was stated more explicitly by Moyal (1962), leading to the elementary construction of general Poisson processes, noted by both Kingman (1967) ˆ (1942b) noted that a marked point process with and Mecke (1967). Ito stationary, independent increments is equivalent to a Poisson process in the plane. The corresponding result in the non-homogeneous case eluded Itˆo, but 27

Alexandr Khinchin (1894–1959), Russian mathematician, along with Kolmogorov a leader of the Moscow school of analysis and probability. He wrote monographs on statistical mechanics, information theory, and queuing theory.

Historical Notes and References

867

follows from Kingman’s (1967) general description of random measures with independent increments. The mapping and marking properties of a Poisson process were established ´kopa (1958), after partial results and special cases had been noted by by Pre Bartlett (1949) and Doob (1953). Cox processes, first introduced by Cox (1955), were studied extensively by Kingman (1964), Krickeberg (1972), ´nyi28 (1956). and Grandell (1976). Thinnings were first considered by Re Laplace transforms of random measures were used systematically by von Waldenfels (1968) and Mecke (1968); their use to construct Cox processes ˇina (1964). One-dimensional uniqueand transforms relies on a lemma of Jir ´nyi (1967), and then ness criteria were obtained in the Poisson case by Re ¨ nch (1971) and in K(73a/83), with further in general independently by Mo extensions by Grandell (1976). The symmetry characterization of mixed Poisson and binomial processes was noted independently by Matthes et al. (1974), Davidson (1974b), and K(73a). S. Watanabe (1964) proved that a ql-continuous simple point process is Poisson iff its compensator is deterministic, a result later extended to general marked point processes by Jacod (1975). A corresponding time-change reduction to Poisson was obtained independently by Meyer (1971) and Papangelou (1972); a related Poisson approximation was obtained by Brown (1978/79). The extended time-change result in Theorem 15.15 was obtained in K(90). ˆ (1956), though some Multiple Poisson integrals were first mentioned by Ito underlying ideas can be traced back to papers by Wiener (1938) and Wiener & Wintner (1943). Many sophisticated methods have been applied through the years to the study of such integrals. The present convergence criterion for double Poisson integrals is a special case of some general results for multiple Poisson and L´evy integrals, established in K & Szulga (1989). Introductions to Poisson and related processes are given by Kingman (1993) and Last & Penrose (2017). More advanced and comprehensive accounts of random measure theory appear in Matthes et al. (1978), Daley & Vere-Jones (2003/08), and K(17). 16. Independent-increment and L´ evy processes Up to the 1920s, Brownian motion and the Poisson process were essentially the only known processes with independent increments. In (1924/25) ´vy discovered the stable distributions and noted that they, too, could be Le associated with suitable ‘decomposable’ processes. de Finetti (1929) saw the connection between processes with independent increments and infinitely divisible distributions, and posed the problem of characterizing the latter. As noted in Chapter 7, the problem was solved by Kolmogorov (1932) and ´vy (1934/35). Le 28 ´ d Re ´nyi (1921–70), Hungarian mathematician. Through his passion for the Alfre subject, he inspired a whole generation of mathematicians.

868

Foundations of Modern Probability

´vy29 (1934/35) was truly revolutionary30 , in giving at The paper of Le the same time a complete description of processes with independent increments. Though his intuition was perfectly clear, his reasoning was sometimes ˆ (1942b). In the a bit sketchy, and a more careful analysis was given by Ito meantime, analytic proofs of the representation formula for the characteris´vy himself (1937), by Feller (1937), and by tic function were given by Le Khinchin (1937). The L´evy–Khinchin formula for the characteristic function was extended by Jacod (1983) to the case of possibly fixed discontinuities. Our probabilistic description in the general case is new and totally different. The basic convergence Theorem 16.14 for L´evy processes and the associated approximation of random walks in Corollary 16.17 are essentially due to Skorohod (1957), though with rather different statements and proofs. The special case of stable processes has been studied by countless authors, ever since ´vy (1924/25). Versions of Proposition 16.19 were discovered independently Le ´ ski & Woyczyn ´ ski (1986) and in K(92). by Rosin The local characteristics of a semi-martingale were introduced independently by Grigelionis (1971) and Jacod (1975). Tangent processes, first ˆ (1942b), were used by Jacod (1984) to extend the notion mentioned by Ito of semi-martingales. The existence of tangential sequences with conditionally ´ & Woyczyn ´ski (1992), and a independent terms was noted by Kwapien ˜ a & Gine ´ (1999). A discrete-time version careful proof appears in de la Pen of the tangential comparison was obtained by Zinn (1986) and Hitchenko ´ & Woyczyn ´ ski (1992). (1988), and a detailed proof appears in Kwapien Both results were extended to continuous time in K(17a). The simplified comparison criterion may be new. Modern introductions to L´evy processes have been given by Bertoin (1996) and Sato (2013). The latter gives a detailed account of the L´evy–Itˆo representation in the stochastically continuous case. The analytic approach in the general case is carefully explained by Jacod & Shiryaev (1987), which also contains an exposition of the associated limit theory. 17. Feller processes and semi-groups Semi-group ideas are implicit in Kolmogorov’s pioneering (1931a) paper, whose central theme is to identify some local characteristics determining the transition probabilities through a system of differential equations, the socalled Kolmogorov forward and backward equations. Markov chains and diffusion processes were originally treated separately, but in (1935) Kolmogorov proposed a unified framework, with transition kernels regarded as operators (initially operating on measures rather than on functions), and with local char29 ´vy (1886–1971), French mathematician and a founder of modern probability, Paul Le making profound contributions to the central limit theorem, stable processes, Brownian motion, L´evy processes, and local time. 30 `ve (1978), who had been a student of L´evy, writes (alluding to Greek mythology): Loe “Decomposability sprang forth fully armed from the forehead of P. L´evy . . . His analysis . . . was so complete that since then only improvements of detail have been added.”

Historical Notes and References

869

acteristics given by an associated generator. Kolmogorov’s ideas were taken up by Feller31 (1936), who derived existence and uniqueness criteria for the forward and backward equations. The general theory of contraction semi-groups on a Banach space was developed independently by Hille (1948) and Yosida (1948), both of whom recognized its significance for the theory of Markov processes. The power of the semigroup approach became clear through the work of Feller (1952/54), who gave a complete description of the generators of one-dimensional diffusions, in particular characterizing the boundary behavior of the process in terms of the domain of the generator. A systematic study of Markov semi-groups was now initiated by Dynkin (1955a). The usual approach is to postulate the strong continuity (F3 ), instead of the weaker and more easily verified condition (F2 ). The positive maximum ˆ (1957), and the core condition in principle first appeared in the work of Ito Proposition 17.9 is due to Watanabe (1968). The first regularity theorem was obtained by Doeblin (1939b), who gave conditions for the paths to be step functions. A sufficient condition for continuity was then obtained by Fortet (1943). Finally, Kinney (1953) showed that any Feller process has a version with rcll paths, after Dynkin (1952) had derived the same property under a H¨older condition. The use of martingale methods in the study of Markov processes dates back to Kinney (1953) and Doob (1954). The strong Markov property for Feller processes was proved independently by Dynkin & Yushkevich (1956) and Blumenthal (1957), after special cases had been obtained by Doob (1945), Hunt (1956), and Ray (1956). Blumenthal’s (1957) paper also contains his 0−1 law. Dynkin (1955a) introduced his characteristic operator, and a version of Theorem 17.24 appears in Dynkin (1956). There is a vast literature on the approximation of Markov chains and Markov processes, covering a wide range of applications. The use of semigroup methods to prove limit theorems can be traced back to Lindeberg’s (1922a) proof of the central limit theorem. The general results of Theorems 17.25 and 17.28 were developed in stages by Trotter (1958a), Sova (1967), ˇius (1974). Our proof of Theorem 17.25 Kurtz (1969/75), and Mackevic uses ideas from Goldstein (1976). A splendid introduction to semi-group theory is given by the relevant chapters in Feller (1971). In particular, Feller shows how the one-dimensional L´evy–Khinchin formula and associated limit theorems can be derived by semigroup methods. More detailed and advanced accounts of the subject appear in Dynkin (1965), Ethier & Kurtz (1986), and Dellacherie & Meyer (1987/92). 31 William Feller (1906–70), Croatian-American mathematician, making path-breaking contributions to many areas of analysis and probability, including classical limit theorems and diffusion processes. His introductory volumes on probability theory, arguably the best probability books ever written, became bestsellers and were translated to many languages.

870

Foundations of Modern Probability

18. Itˆ o integration and quadratic variation ˆ 32 The first stochastic integral with a random integrand was defined by Ito (1942a/44), using Brownian motion as the integrator and product measurable and adapted processes as integrands. Doob (1953) made the connection with martingale theory. A version of the fundamental substitution rule was proved ˆ (1951a), and later extended by many authors. The compensated inby Ito tegral in Corollary 18.21 was introduced independently by Fisk (1966) and Stratonovich (1966). The quadratic variation process was originally obtained from the Doob– Meyer decomposition. Fisk (1966) noted how it can also be obtained directly from the process, as in Proposition 18.17. Our present construction was inspired by Rogers & Williams (2000). The BDG inequalities were originally proved for p > 1 and discrete time by Burkholder (1966). Millar (1968) noted the extension to continuous martingales, in which context the further extension to arbitrary p > 0 was obtained independently by Burkholder & Gundy (1970) and Novikov (1971). Kunita & Watanabe (1967) introduced the covariation of two martingales and used it to characterize the integral. They further established some general inequalities related to Proposition 18.9. Itˆo’s original integral was extended to square-integrable martingales by `ge (1962/63) and Kunita & Watanabe (1967), and then to continCourre ´ans & Meyer (1970). The idea of localization uous semi-martingales by Dole ˆ is due to Ito & Watanabe (1965). Theorem 18.24 was obtained by Kazamaki (1972), as part of a general theory of random time change. Stochastic ´ans (1967b) and integrals depending on a parameter were studied by Dole Stricker & Yor (1978), and the functional representation in Proposition 18.26 first appeared in K(96a). Elementary introductions to Itˆo calculus and its applications appear in countless textbooks, including Chung & Williams (1990) and Klebaner (2012). For more advanced accounts and further information, we recommend especially Ikeda & Watanabe (2014), Rogers & Williams (2000), Karatzas & Shreve (1998), and Revuz & Yor (1999). 19. Continuous martingales and Brownian motion The fundamental characterization of Brownian motion in Theorem 19.3 was ´vy (1937), who also in (1940) noted the conformal invariance proved by Le up to a time change of complex Brownian motion and stated the polarity of singletons. A rigorous proof of Theorem 19.6 was provided by Kakutani (1944a/b). Kunita & Watanabe (1967) gave the first modern proof of L´evy’s characterization, based on Itˆo’s formula for exponential martingales. 32 ˆ (1915–2008), Japanese mathematician, the creator of stochastic calculus, Kiyosi Ito SDEs, multiple WI-integrals, and excursion theory. He is also notable for reworking and making rigorous L´evy’s theory of processes with independent increments.

Historical Notes and References

871

The corresponding time-change reduction of continuous local martingales to a Brownian motion was discovered by Doeblin (1940b), and used by him to solve Kolmogorov’s fundamental problem of constructing diffusion processes with given drift and diffusion rates33 . However, his revolutionary paper was deposited in a sealed envelope to the Academy of Sciences in Paris, and was not known to the world until 60 years later. In the meantime, the result was rediscovered independently by Dambis (1965) and Dubins & Schwarz (1965). A multi-variate version of Doeblin’s theorem was noted by Knight (1971), and a simplified proof was given by Meyer (1971). A systematic study of isotropic martingales was initiated by Getoor & Sharpe (1972). The skew-product representation in Corollary 19.7 is due to Galmarino ˆ (1963). The integral representation in Theorem 19.11 is essentially due to Ito (1951c), who noted its connection with multiple stochastic integrals and chaos expansions. A one-dimensional version of Theorem 19.13 appears in Doob (1953). The change of measure was first considered by Cameron & Martin (1944), which is the source of Theorem 19.23, and in Wald’s (1946/47) work on sequential analysis, containing the identity of Lemma 19.25 in a version for random walks. The Cameron–Martin theorem was gradually extended to more general settings by many authors, including Maruyama (1954/55), Girsanov (1960), and van Schuppen & Wong (1974). The martingale criterion of Theorem 19.24 was obtained by Novikov (1972). The material in this chapter is covered by many texts, including the monographs by Karatzas & Shreve (1998) and Revuz & Yor (1999). A more advanced and amazingly informative text is Jacod (1979). 20. Semi-martingales and stochastic integration Doob (1953) conceived the idea of a stochastic integration theory for general L2 -martingales, based on a suitable decomposition of continuous-time submartingales. Meyer’s (1962) proof of such a result opened the door to the `ge (1962/63) and Kunita & L2 -theory, which was then developed by Courre Watanabe (1967). The latter paper contains in particular a version of the general substitution rule. The integration theory was later extended in a series of ´ans-Dade & Meyer (1970), and reached papers by Meyer (1967) and Dole maturity with the notes of Meyer (1976) and the books by Jacod (1979), ´tivier & Pellaumail (1980), Me ´tivier (1982), and Dellacherie & Me Meyer (1975/80). The basic role of predictable processes as integrands was recognized by Meyer (1967). By contrast, semi-martingales were originally introduced in ´ans-Dade & Meyer (1970), and their basic an ad hoc manner by Dole preservation laws were only gradually recognized. In particular, Jacod (1975) 33 ˆ (1942a), through his A totally different construction was given independently by Ito invention of stochastic calculus and the theory of SDEs. Itˆo’s approach was also developed independently by Gihman (1947/50/51).

872

Foundations of Modern Probability

used the general Girsanov theorem of van Schuppen & Wong (1974) to show that the semi-martingale property is preserved under absolutely continuous changes of the probability measure. The characterization of general stochastic integrators as semi-martingales was obtained independently by Bichteler (1979) and Dellacherie (1980), in both cases with support from analysts. Quasi-martingales were originally introduced by Fisk (1965) and Orey (1966). The decomposition of Rao (1969b) extends a result by Krickeberg (1956) for L1 -bounded martingales. Yoeurp (1976) combined a notion of stable subspaces, due to Kunita & Watanabe (1967), with the Hilbert space structure of M2 to obtain an orthogonal decomposition of L2 -martingales, equivalent to the decompositions in Theorem 20.14 and Proposition 20.16. Elaborating on those ideas, Meyer (1976) showed that the purely discontinuous component admits a representation as a sum of compensated jumps. ˆ SDEs driven by general L´evy processes were considered already by Ito (1951b). The study of SDEs driven by general semi-martingales was initiated ´ans-Dade (1970), who obtained her exponential process as a solution by Dole to the equation in Theorem 20.8. The scope of the theory was later expanded by many authors, and a comprehensive account is given by Protter (1990). The martingale inequalities in Theorems 20.12 and 20.17 have ancient origins. Thus, a version of the latter result for independent random variables was proved by Kolmogorov (1929) and, in a sharper form, by Prohorov (1959). Their result was extended to discrete-time martingales by Johnson et al. (1985) and Hitczenko (1990). The present statements appeared in K & Sztencel (1991). Early versions of the inequalities in Theorem 20.12 were proved by Khinchin (1923/24) for symmetric random walks and by Paley (1932) for Walsh series. A version for independent random variables was obtained by Marcinkiewicz & Zygmund (1937/38). The extension to discrete-time martingales is due to Burkholder (1966) for p > 1 and to Davis (1970) for p = 1. The result was extended to continuous time by Burkholder et al. (1972), who also noted how the general result can be deduced from the statement for p = 1. The present proof is a continuous-time version of Davis’ original argument. Introductions to general semi-martingales and stochastic integration are given by Dellacherie & Meyer (1975/80), Jacod & Shiryaev (1987), and Rogers & Williams (2000). Protter (1990) offers an alternative approach, originally suggested by Dellacherie (1980) and Meyer. The book by Jacod (1979) remains a rich source of further information on the subject. 21. Malliavin calculus ¨ rmanMalliavin34 (1978) sensationally gave a probabilistic proof of Ho der’s (1967) celebrated regularity theorem for solutions to certain elliptic 34 Paul Malliavin (1925–2010), French mathematician, already famous for contributions to harmonic analysis and other areas, before his revolutionary work in probability.

Historical Notes and References

873

partial differential equations. The underlying ideas35 , forming the core of a stochastic calculus of variations known as Malliavin calculus, were further developed by many authors, including Stroock (1981), Bismut (1981), and Watanabe (1987), into a powerful tool with many applications. Ramer (1974) and Skorohod (1975) independently extended Itˆo’s stochastic integral to the case of anticipating integrands. Their integral was recognized by Gaveau & Trauber (1982) as the adjoint of the Malliavin derivative D. The expression for the integrand in Itˆo’s representation of Brownian functionals was obtained by Clark (1970/71), under some regularity conditions that were later removed by Ocone (1984). The differentiability theorem for Itˆo integrals was proved by Pardoux & Peng (1990). Many introductions to Malliavin calculus exist, including an early account by Bell (1987). A standard reference is the book by Nualart (1995/2006), which also contains a wealth of applications to anticipating stochastic calculus, stochastic PDEs, mathematical finance, and other areas. I also found the notes of Kunze (2013) especially helpful. 22. Skorohod embedding and functional convergence The first functional limit theorems were obtained by Kolmogorov (1931b/ ¨ s36 & Kac 33a), who considered special functionals of a random walk. Erdo (1946/47) conceived the idea of an ‘invariance principle’ that would allow the extension of functional limit theorems from special cases to a general setting. They also treated some interesting functionals of a random walk. The first functional limit theorems were obtained by Donsker37 (1951/52) for random walks and empirical distribution functions, following an idea of Doob (1949). A general theory based on sophisticated compactness arguments was later developed by Prohorov (1956) and others. Skorohod’s38 (1965) embedding theorem provided a new, probabilistic approach to Donsker’s theorem. Extensions to the martingale context were obtained by many authors, beginning with Dubins (1968). Lemma 22.19 appears in Dvoretzky (1972). Donsker’s weak invariance principle was supplemented by a strong version due to Strassen (1964), allowing extensions of many a.s. limit theorems for Brownian motion to suitable random walks. In particular, his result yields a simple proof of the Hartman & Wintner (1941) law of the iterated logarithm, originally deduced from some deep results of Kolmogorov (1929). 35 Denis Bell tells me that much of this material was known before the work of Malliavin, including the notion of Malliavin derivative. 36 ¨ s (1913–96), Hungarian, the most prolific and eccentric mathematician of Paul Erdo the 20th century. Constantly on travel and staying with friends, his only possession was a small suitcase where he kept his laundry. His motto was ‘a new roof—a new proof’. He also offered monetary awards, successively upgraded, for solutions to open problems. 37 Monroe Donsker (1924–91), American mathematician, who proved the first functional limit theorem, and along with Varadhan developed the theory of large deviations. 38 Anatoliy Skorohod (1930–2011), Ukrainian mathematician, making profound contributions to many areas of stochastic processes.

874

Foundations of Modern Probability

´ s et al. (1975/76) showed how the approximation rate in the SkoKomlo rohod embedding can be improved by a more delicate ‘strong approximation’. ¨ rgo ¨& For an exposition of their work and its numerous applications, see Cso ´ ve ´sz (1981). Re Many texts on different levels contain introductions to weak convergence and functional limit theorems. A standard reference remains Billingsley (1968), which contains a careful treatment of the general theory and its numerous applications. Books focusing on applications in special areas include Pollard (1984) for empirical distributions and Whitt (2002) for queuing theory. Versions of the Skorohod embedding, even for martingales, appear in Hall & Heyde (1980) and Durrett (2019). 23. Convergence in distribution After Donsker (1951/52) had proved his functional limit theorems for random walks and empirical distribution functions, a general theory of weak convergence in function spaces was developed by the Russian school, in seminal papers by Prohorov (1956), Skorohod (1956/57), and Kolmogorov (1956). Thus, Prohorov39 (1956) proved his fundamental compactness Theorem 23.2, in a setting for separable, complete metric spaces. The abstract theory was later extended in various directions by Le Cam (1957), Varadarajan (1958), and Dudley (1966/67). Skorohod (1956) considered four different topologies on D[0,1] , of which the J1 -topology considered here is the most important for applications. The theory was later extended to D R+ by Stone (1963) and Lindvall (1973). Moment conditions for tightness were developed by Chentsov (1956) and Billingsley (1968), before the powerful criterion of Aldous (1978) became available. Kurtz (1975) and Mitoma (1983) noted that tightness in D R+ ,S can often be verified by conditions on the one-dimensional projections, as in Theorem 23.23. Convergence criteria for random measures were established by Prohorov (1961) for compact spaces, by von Waldenfels (1968) for locally compact spaces, and by Harris (1971) for separable, complete metric spaces. In the latter setting, convergence and tightness criteria for point processes, first announced in Debes et al. (1970), were proved in Matthes et al. (1974). The relationship between weak and vague convergence in distribution was clarified, for locally compact spaces, in K(75/76). Strong convergence of point processes was studied extensively by Matthes et al. (1974), where it forms a foundation of their theory of infinitely divisible point processes. One-dimensional convergence criteria, noted for locally compact spaces in K(75/83) and K(96b), were extended to Polish spaces by Peterson (2001). The power of the Cox transform in this context was explored by Grandell (1976). The one-dimensional criteria provide a link to random closed sets 39 Yuri Prohorov (1929–2913), Russian probabilist, founder of the modern theory of weak convergence.

Historical Notes and References

875

with the associated Fell (1962) topology. Random sets had been studied extensively by many authors, including Choquet (1953/54), Kendall (1974), and Matheron (1975), when associated convergence criteria were identified in Norberg40 (1984). Detailed accounts of the theory and applications of weak convergence appear in many textbooks and monographs, including Parthasarathy (1967), Billingsley (1968), Pollard (1984), Ethier & Kurtz (1986), Jacod & Shiryaev (1987), and Whitt (2002). The convergence theory for random measures is discussed in Matthes et al. (1978), Daley & Vere-Jones (2008), and K(17b). An introduction to random sets appears in Schneider & Weil (2008), and more comprehensive treatments are given by Matheron (1975) and Molchanov (2005). 24. Large deviations The theory of large deviations originated with some refinements of the central limit theorem noted by many authors, beginning with Khinchin (1929). Here the object of study is the ratio of tail probabilities rn (x) = P {ζn > x}  /P {ζ > x}, where ζ is N (0, 1) and ζn = n−1/2 k≤n ξk for some i.i.d. random variables ξk with mean 0 and variance 1, so that rn (x) → 1 for each x. A pre´r41 (1938), when x varies cise asymptotic expansion was obtained by Crame 1/2 with n at a rate x = o(n ); see Petrov (1995) for details. ´r (1938) obtained the first true large deviation In the same paper, Crame result, in the form of our Theorem 24.3, though under some technical assumptions that were later removed by Chernoff (1952) and Bahadur (1971). Varadhan (1966) extended the result to higher dimensions and rephrased it as a general large deviation principle. At about the same time, Schilder (1966) proved his large deviation result for Brownian motion, using the present change-of-measure approach. Similar methods were used by Freidlin & Wentzell (1970/98) to study random perturbations of dynamical systems. This was long after Sanov (1957) had obtained his large deviation result for empirical distributions of i.i.d. random variables. The relative entropy H(ν|μ) appearing in the limit had already been introduced in statistics by Kullback & Leibler (1951). Its crucial link to the Legendre–Fenchel transform Λ∗ , long anticipated by physicists, was formalized by Donsker & Varadhan (1975/83). The latter authors also developed some profound and far-reaching extensions of Sanov’s theorem, in a series of revolutionary papers42 . Ellis 40

After this work, based on an unpublished paper of mine, Norberg went on to write some deep papers on random capacities, inspired by unpublished notes of W. Vervaat. 41 ´r (1893–1985), Swedish mathematician and a founder of risk, large Harald Crame deviation, and extreme-value theories. He also wrote the first rigorous treatise on statistical theory. At the end of a distinguished academic career, he became chancellor of the entire Swedish university system. 42 This work earned Varadhan an Abel prize in 2007, long after the death of his main collaborator. Together with Stroock, Varadhan also made profound contributions to the theory of multi-variate diffusions.

876

Foundations of Modern Probability

(1985) gives a detailed exposition of those results, along with a discussion of their physical significance. Some of the underlying principles and techniques were developed at a later stage. Thus, an abstract version of the projective limit approach was intro¨ rtner (1987). Bryc (1990) supplemented Varadduced by Dawson & Ga han’s (1966) functional version of the LDP with a converse proposition. Similarly, Ioffe (1991) appended a powerful inverse to the classical ‘contraction principle’. Finally, Pukhalsky (1991) established the equivalence, under suitable regularity conditions, of the exponential tightness and the goodness of the rate function. Strassen (1964) established his formidable law of the iterated logarithm by direct estimates. Freedman (1971b) gives a detailed account of the original approach. Varadhan (1984) recognized the result as a corollary to Schilder’s theorem, and a complete proof along the suggested lines, different from ours, appears in Deuschel & Stroock (1989). Gentle introductions to large deviation theory and its applications are given by Varadhan (1984) and Dembo & Zeitouni (1998). The more demanding text of Deuschel & Stroock (1989) provides much additional insight for the persistent reader. 25. Stationary processes and ergodic theorems The history of ergodic theory dates back to the work of Boltzmann43 (1887) in statistical mechanics. His ‘ergodic hypothesis’—the conjectural equality of time and ensemble averages—was long accepted as an heuristic principle. In probabilistic terms it amounts to the convergence t−1 0t f (Xs ) ds → Ef (X0 ), where Xt represents the state of the system—typically the configuration of all molecules in a gas—at time t, and the expected value is taken with respect to an invariant probability measure on a compact sub-manifold of the state space. The ergodic hypothesis was sensationally proved as a mathematical theorem, first in an L2 -version by von Neumann (1932), after Koopman (1931) had noted the connection between measure-preserving transformations and unitary operators on a Hilbert space, and then in the pointwise form of Birkhoff44 (1932). The proof of the latter version was simplified in stages: first by Yosida & Kakutani (1939), who showed that the statement is an easy consequence of the maximal ergodic lemma, and then by Garsia (1965), who gave a short proof of the latter result. Khinchin (1933a/34) pioneered a translation of the basic ergodic theory into the probabilistic setting of stationary sequences and processes. The first multi-variate ergodic theorem was obtained by Wiener (1939), who proved his result in the special case of averages over concentric balls. More general versions were established by many authors, including Day (1942) and 43 Ludwig Boltzmann (1844–1906), Austrian physicist, along with Maxwell and Gibbs a founder of statistical mechanics. 44 George David Birkhoff (1884–1944), prominent American mathematician.

Historical Notes and References

877

Pitt (1942). The classical methods were pushed to the limit in a notable paper by Tempel’man (1972). The first ergodic theorem for non-commuting transformations was obtained by Zygmund (1951), where the underlying maximum inequalities go back to Hardy & Littlewood (1930). Sucheston (1983) noted that the statement follows easily from Maker’s (1940) lemma. The ergodic theorem for random matrices was proved by Furstenberg & Kesten (1960), long before the sub-additive ergodic theorem became available. The latter was originally proved by Kingman (1968), under the stronger hypothesis of joint stationarity of the array (Xm,n ). The present extension and shorter proof are due to Liggett (1985). The ergodic decomposition of invariant measures dates back to Krylov & Bogolioubov (1937), though the basic role of the invariant σ-field was not recognized until the work of Farrell (1962) and Varadarajan (1963). The connection between ergodic decompositions and sufficient statistics is explored in an elegant paper of Dynkin (1978). The traditional approach to the subject is via Choquet theory, as surveyed by Dellacherie & Meyer (1983). The coupling equivalences in Theorem 25.25 (i) were proved by Goldstein (1979), after Griffeath (1975) had obtained a related result for Markov chains. The shift coupling part of the same theorem was established by Berbee (1979) and Aldous & Thorisson (1993), and the version for abstract groups was then obtained by Thorisson (1996). The original ballot theorem, due to Whitworth (1878) and later rediscovered by Bertrand (1887), is one of the earliest rigorous results of probability theory. It states that if two candidates A and B in an election get the proportions p and 1 − p of the votes, then A will lead throughout the ballot count with probability (2p−1)+ . Extensions and new approaches have been noted by ´ (1887) and Barbier (1887). A modern many authors, beginning with Andre combinatorial proof is given by Feller (1968), and a simple martingale argument appears in Chow & Teicher (1997). A version for cyclically stationary ´ cs (1967). All earlier versions sequences and processes was obtained by Taka are subsumed by the present statement from K(99a), whose proof relies heavily on Tak´acs’ ideas. The first version of Theorem 25.29 was obtained by Shannon (1948), who proved convergence in probability for stationary and ergodic Markov chains in a finite state space. The Markovian restriction was lifted by McMillan (1953), who also strengthened the result to convergence in L1 . Carleson (1958) extended McMillan’s result to a countable state space. The a.s. convergence is due to Breiman (1957/60) and Ionescu Tulcea (1960) for finite state spaces and to Chung (1961) in the countable case. Elementary introductions to stationary processes have been given by many ´r & Leadbetter (1967). authors, beginning with Doob (1953) and Crame A modern and comprehensive survey of general ergodic theorems is given by Krengel (1985). A nice introduction to information theory is given by Billingsley (1965). Thorisson (2000) gives a detailed account of modern coupling principles.

878

Foundations of Modern Probability

26. Ergodic properties of Markov processes The first ratio ergodic theorems were obtained by Doeblin (1938b), Doob (1938/48a), Kakutani (1940), and Hurewicz (1944). Hopf (1954) and Dunford & Schwartz (1956) extended the pointwise ergodic theorem to general L1 − L∞ -contractions, and the ratio ergodic theorem was extended to positive L1 -contractions by Chacon & Ornstein (1960). The present approach to their result in due to Akcoglu & Chacon (1970). The notion of Harris recurrence goes back to Doeblin (1940a) and Harris45 (1956). The latter author used the condition to ensure the existence, in discrete time, of a σ-finite invariant measure. A corresponding continuous-time result was obtained by H. Watanabe (1964). The total variation convergence of Markov transition probabilities was obtained for a countable state space by Orey (1959/62), and in general by Jamison & Orey (1967). Blackwell & Freedman (1964) noted the equivalence of mixing and tail triviality. The present coupling approach goes back to Griffeath (1975) and Goldstein (1979) for the case of strong ergodicity, and to Berbee (1979) and Aldous & Thorisson (1993) for the corresponding weak result. There is an extensive literature on ergodic theorems for Markov processes, mostly dealing with the discrete-time case. General expositions have been given by many authors, beginning with Neveu (1971) and Orey (1971). Our treatment of Harris recurrent Feller processes is adapted from Kunita (1990), who in turn follows the discrete-time approach of Revuz (1984). Krengel (1985) gives a comprehensive survey of abstract ergodic theorems. 27. Symmetric distributions and predictable maps de Finetti46 (1930) proved that an infinite sequence of random events is exchangeable iff the joint distribution is mixed i.i.d. He later (1937) claimed the corresponding result for general random variables, and a careful proof for even more general state spaces was given by Hewitt & Savage47 (1955). Ryll-Nardzewski48 (1957) noted that the statement remains true under ¨ hlmann (1960) extended the the weaker hypothesis of contractability, and Bu result to processes on R+ . The sampling approach to de Finetti’s theorem was noted in K(99b). Finite exchangeable sequences arise in the context of sampling from a finite population, in which context a variety of limit theorems were obtained by ´ jek (1960), Rose ´n (1964), Billingsley Chernoff & Teicher (1958), Ha (1968), and Hagberg (1973). The representation of exchangeable processes on [0, 1] was discovered independently in K(73b), along with the associated 45

Ted Harris (1919–2005), American mathematician, making profound contributions to branching processes, Markov ergodic theory, and convergence of random measures. 46 Bruno de Finetti (1906–85), Italian mathematician and philosopher, whose famous theorem became a cornerstone in his theory of subjective probability and Bayesian statistics. 47 This paper is notable for containing the authors’ famous 0−1 law. 48 Czeslaw Ryll-Nardzewski (1926–2015), Polish mathematician prominent in analysis and probability, making profound contributions to both exchangeability and Palm measures.

Historical Notes and References

879

functional limit theorems. The sum of compensated jumps was shown in K(74) to converge uniformly a.s. The optional skipping theorem—the special case of Theorem 27.7 for i.i.d. sequences and increasing sequences of predictable times—was first noted by Doob49 (1936/53). It formalizes the intuitive idea that a gambler can’t ‘beat the odds’ by using a clever gambling strategy, as explained in Billingsley (1986). The general predictable sampling theorem, proved in K(88) along with its continuous-time counterpart, yields a simple proof of Sparre-Andersen’s (1953/54) notable identity50 for random walks, which in turn leads to a short proof of L´evy’s third arcsine law. Comprehensive surveys of exchangeability theory have been given by Aldous (1985) and in K(05). 28. Multi-variate arrays and symmetries The notion of partial exchangeability—the invariance in distribution of a sequence of random variables under a proper sub-group of permutations—goes back to de Finetti (1938). Separately exchangeable arrays on N2 were first studied by Dawid (1972). Symmetric, jointly exchangeable arrays on N2 arise naturally in the context of U-statistics, whose theory goes back to Hoeffding (1948). They were later studied by McGinley & Sibson (1975), Silverman (1976), and Eagleson & Weber (1978). More recently, exchangeable arrays have been used extensively to describe random graphs. General representations of exchangeable arrays were established independently in profound papers by Aldous (1981) and Hoover (1979), using totally different methods. Thus, Aldous uses probabilistic methods to derive the representation of separately exchangeable arrays on N2 , whereas Hoover derives the representation of jointly exchangeable arrays of arbitrary dimension, using ideas from non-standard analysis and mathematical logic, going back to Gaifman (1961) and Krauss (1969). Hoover also identifies pairs of functions f, g that can be used to represent the same array. Probabilistic proofs of Hoover’s results were found51 in K(92), along with the extensions to contractable arrays. Hoover’s uniqueness criteria were amended in K(89/92). The notion of rotatable arrays was suggested by some limit theorems for U-statistics, going back to Hoeffding (1948). Rotatable arrays also arise naturally in quantum mechanics, where the asymptotic eigenvalue distribution of X was studied by Wigner (1955), leading in particular to his celebrated semicircle law for the ‘Gaussian orthogonal ensemble’, where the entries Xij = Xji are i.i.d. N (0, 1) for i < j. Olson & Uppuluri (1972) noted that, under the stated independence assumption, the Gaussian distribution follows from the 49

See also the recollections of Halmos (1985), pp. 74–76, who was a student of Doob. Feller (1966/71) writes that Sparre-Andersen’s result ‘was a sensation greeted with incredulity, and the original proof was of an extraordinary intricacy and complexity’. He goes on to give a simplified proof, which is still quite complicated. 51 Hoover’s unpublished paper requires expert knowledge in mathematical logic. Before developing my own proofs, I had to ask the author if my interpretations were correct. 50

880

Foundations of Modern Probability

joint rotatability suggested by physical considerations52 . Scores of physicists and mathematicians have extended Wigner’s results, but under independence assumptions without physical significance. Separately and jointly rotatable arrays on N2 were first studied by Dawid (1977/78), who also conjectured the representation in the separate case. The statement was subsequently proved by Aldous (1981), under some simplifying assumptions that were later removed in K(88). The latter paper also gives the representation in the jointly rotatable case. The higher-dimensional versions, derived in K(95), are stated most conveniently in terms of multiple Wiener–Itˆo integrals on tensor products of Hilbert spaces. Those results lead in turn to related representations of exchangeable and contractable random sheets. Motivated by problems in population genetics, Kingman (1978) established his ‘paint-box’ representation of exchangeable partitions, later extended to more general symmetries in K(05). Algebraic and combinatorial aspects of exchangeable partitions have been studied extensively by Pitman (1995/2002) and Gnedin (1997). A comprehensive account of contractable, exchangeable, and rotatable arrays is given in K(05), where the proofs of some major results take up to a hundred pages. From the huge literature on the eigenvalue distribution of random matrices, I can especially recommend the books by Mehta (1991) for physicists and by Anderson et al. (2010) for mathematicians. 29. Local time, excursions, and additive functionals Local time of Brownian motion at a fixed point was discovered and ex´vy (1939), who devised several explicit constructions, mostly of plored by Le the type of Proposition 29.12. Much of L´evy’s analysis is based on the observation in Corollary 29.3. The elementary Lemma 29.2 is due to Skorohod (1961/62). Tanaka’s (1963) formula, first noted for Brownian motion, was taken by Meyer (1976) as a basis for a general semi-martingale approach. The resulting local time process was recognized as an occupation density, independently by Meyer (1976) and Wang (1977). Trotter (1958b) proved that Brownian local time has a jointly continuous version, and the regularization theorem for general continuous semi-martingales was obtained by Yor (1978). ˆ (1972), Modern excursion theory originated with the seminal paper of Ito ´vy (1939). In particular, Itˆo proved which was partly inspired by work of Le a version of Theorem 29.11, assuming the existence of local time. Horowitz (1972) independently studied regenerative sets and noted their connection with subordinators, equivalent to the existence of a local time. A systematic theory of regenerative processes was developed by Maisonneuve (1974). The remarkable Theorem 29.17 was discovered independently by Ray (1963) and Knight (1963), and the present proof is essentially due to Walsh (1978). Our construction of the excursion process is close in spirit to L´evy’s original ideas and those of Greenwood & Pitman (1980). 52

An observer can’t distinguish between arrays differing by a joint rotation.

Historical Notes and References

881

Elementary additive functionals of integral type had been discussed extensively in the literature, when Dynkin proposed a study of the general case. The existence Theorem 29.23 was obtained by Volkonsky (1960), and the construction of local time in Theorem 29.24 dates back to Blumenthal & Getoor (1964). The integral representation of CAFs in Theorem 29.25 was proved independently by Volkonsky (1958/60) and McKean & Tanaka (1961). The characterization of an additive functional in terms of a suitable measure on the state space dates back to Meyer (1962), and an explicit representation of the associated measure was found by Revuz (1970), after special cases had been considered by Hunt (1957/58). An introduction to Brownian local time appears in Karatzas & Shreve ˆ & McKean (1965) and Revuz & Yor (1999) con(1998). The books by Ito tain an abundance of further information on the subject. The latter text may also serve as an introduction to additive functionals and excursion theory. For more information on the latter topics, the reader may consult Blumenthal & Getoor (1968), Blumenthal (1992), and Dellacherie et al. (1992). 30. Random measures, smoothing and scattering The Poisson convergence of superpositions of many small, independent point processes has been noted in increasing generality by many authors, including Palm (1943), Khinchin (1955), Ososkov (1956), Grigelionis (1963), Goldman (1967), and Jagers (1972). Limit theorems under simul´nyi taneous thinning and rescaling of a given point process were obtained by Re (1956), Nawrotzki (1962), Belyaev (1963), and Goldman (1967). The Cox convergence of dissipative transforms, quoted from K(17b), extends the thinning theorem in K(75), which in turn implies Mecke’s (1968) characterization of Cox processes. The partly analogous theory of strong convergence, developed by Matthes et al. (1974/78), forms a basis for their theory of infinitely divisible point processes. The limit theorem of Dobrushin53 (1956) and its various extensions played a crucial role in the development of the subject. Related results for evolutions generated by independent random walks go back to Maruyama (1955). The Poisson convergence of traffic flows was first noted by Breiman (1963). Precise version of those results, including the role of asymptotically invariant measures, were given by Stone (1968). For constant motion, the fact that independence alone implies the Poisson property was noted in K(78c). Wiener’s multi-variate ergodic theorem was extended to random measures by Nguyen & Zessin (1979). A general theory of averaging, smoothing, and scattering, with associated criteria for weak or strong Cox convergence, was developed in K(17b). Line and flat processes have been studied extensively, first by Miles (1969/ 53 Roland Dobrushin (1929–95), Russian mathematical physicist, famous for his work in statistical mechanics.

882

Foundations of Modern Probability

71/74) in the Poisson case, and then more generally by Davidson54 (1974a/c), Krickeberg (1974a/b), Papangelou (1974a-b/76), and in K(76/78b-c/81). Spatial branching processes were studied extensively by Matthes et al. (1974/ 78). Here the stability dichotomy with associated truncation criteria go back to Debes et al. (1970/71). The Palm tree criterion for stability55 was noted in K(77). Further developments in this area appear in Liemant et al. (1988). A systematic development of point process theory was initiated by the EastGerman school, as documented by Matthes56 et al. (1974/78/82). Daley & Vere-Jones (2003/08) give a detailed introduction to point processes and their applications. General random measure theory has been studied extensively since Jagers (1974) and K(75/76/83/86), and a modern, comprehensive account was attempted in K(17b). 31. Palm and Gibbs kernels, local approximation In his study of intensity fluctuations in telephone traffic, now recognized as the beginning of modern point process theory, Palm57 (1943) introduced some Palm probabilities associated with a simple, stationary point process on R. An extended and more rigorous account of Palm’s ideas was given by Khinchin (1955), in a monograph on queuing theory, and some basic inversion formulas became known as the Palm–Khinchin equations. Kaplan (1955) established the fundamental time-cycle duality as an extension of results for renewal processes by Doob (1948b), after a partial discretetime version had been noted by Kac (1947). Kaplan’s result was rediscovered in the context of Palm distributions, independently by Ryll-Nardzewski (1961) and Slivnyak (1962). In the special case of intervals on the line, Theorem 31.5 (i) was first noted by Korolyuk, as cited by Khinchin (1955), and part (iii) of the same theorem was obtained by Ryll-Nardzewski (1961). ¨ nig & Matthes (1963) and Matthes The general versions are due to Ko (1963) for d = 1, and to Matthes et al. (1978) for d > 1. Averaging properties of Palm and related measures have been explored by ¨ hle (1980), and Thorisson many authors, including Slivnyak (1962), Za (1995). Palm measures of stationary point processes on the line have been studied in detail by many authors, including Nieuwenhuis (1994) and Thorisson (2000). The duality between Palm measures and moment densities was noted in K(99c). Local approximations of general point processes and the associated Palm measures were studied in K(09). The iteration approach to 54 Rollo Davidson (1944–1970), British math prodigy, making profound contributions to stochastic analysis and geometry, before dying in a climbing accident at age 25. A yearly prize was established to his memory. 55 often referred to as ‘Kallenberg’s backward method’—there I got what I deserve! 56 Klaus Matthes (1931–98), a leader of the East-German probability school, who along with countless students and collaborators developed the modern theory of point processes, with applications to queuing, branching processes, and statistical mechanics. Served for many years as head of the Weierstrass Institute of the Academy of Sciences in DDR. 57 Conny Palm (1907–51), Swedish engineer and applied mathematician, and the founder of modern point process theory. Also known for designing the first Swedish computer.

Historical Notes and References

883

conditional distributions and Palm measures goes back to K(10/11b). Exterior conditioning plays a basic role in statistical mechanics, where the theory of Gibbs measures, inspired by the pioneering work of Gibbs58 (1902), has been pursued by scores of physicists and mathematicians. In particular, Theorem 31.15 (i) is a version of the celebrated ‘DLR equation’, noted by Dobrushin (1970) and Lanford & Ruelle (1969). Connections with modern point process theory were recognized and explored by Nguyen & Zessin (1979), Matthes et al. (1979), and Rauchenschwandtner (1980). Motivated by problems in stochastic geometry, Papangelou (1974b) introduced the kernel named after him, under some simplifying assumptions that were later removed in K(78a). An elegant projection approach to Papangelou kernels was noted by van der Hoeven (1983). The general theory of Palm and Gibbs kernels was developed in K(83/86). As with many subjects of the previous chapter, here too the systematic development was initiated by Matthes et al. (1974/78/82). A comprehensive account of general Palm and Gibbs kernels is given in K(17b), which also contains a detailed exposition of the stationary case. Applications to queuing theory and related areas are discussed by many authors, including Franken ´maud (2000). et al. (1981) and Baccelli & Bre 32. Stochastic equations and martingale problems Long before the existence of any general theory for SDEs, Langevin (1908) proposed his equation to model the velocity of a Brownian particle. The solution process was later studied by Ornstein & Uhlenbeck (1930), and was thus named after them. A more rigorous discussion appears in Doob (1942a). The general idea of a stochastic differential equation goes back to Bernstein (1934/38), who proposed a pathwise construction of diffusion processes by a discrete approximation, leading in the limit to a formal differential equation driven by a Brownian motion. Kolmogorov posed the problem of constructing a diffusion process with given drift and diffusion rates59 , which moˆ (1942a/51b) to develop his theory of SDEs, including a precise tivated Ito definition of the stochastic integral, conditions for existence and uniqueness of solutions, the Markov property of the solution process, and the continuous dependence on initial state. Similar results were later obtained independently by Gihman (1947/50/51). A general theory of stochastic flows was developed by Kunita (1990) and others. The notion of a weak solution was introduced by Girsanov (1960), and criteria for weak existence appear in Skorohod (1965). The ideas behind the transformations in Propositions 32.12 and 32.13 date back to Girsanov (1960) and Volkonsky (1958), respectively. The notion of a martingale prob´vy’s martingale characterization of Brownian motion and lem goes back to Le 58 Josiah Willard Gibbs (1839–1903), American physicist, along with Maxwell and Boltzmann a founder of statistical mechanics. 59 The problem was also solved independently by Doeblin (1940b), who used a totally different method based on a random time change; see the notes to Chapter 19.

884

Foundations of Modern Probability

Dynkin’s theory of the characteristic operator. A comprehensive theory was developed by Stroock & Varadhan (1969), who established the equivalence with weak solutions to the associated SDEs, gave general criteria for uniqueness in law, and deduced conditions for the strong Markov and Feller properties. The measurability part of Theorem 32.10 extends an exercise in Stroock & Varadhan (1979). Yamada & Watanabe (1971) proved that weak existence and pathwise uniqueness imply strong existence and uniqueness in law. Under the same conditions, they also established the existence of a functional solution, possibly depending on the initial distribution. The latter dependence was later removed in K(96a). Ikeda & Watanabe (1989) noted how the notions of pathwise uniqueness and uniqueness in law extend by conditioning from degenerate to arbitrary initial distributions. The theory of SDEs is covered by countless textbooks on different levels, including Ikeda & Watanabe (2014), Rogers & Williams (2000), and Karatzas & Shreve (1998). More information on the martingale problem appears in Jacod (1979), Stroock & Varadhan (1979), and Ethier & Kurtz (1986). There is also an extensive literature on SPDEs—stochastic partial differential equations—where a standard reference is Walsh (1984). 33. One-dimensional SDEs and diffusions The study of continuous Markov processes with associated parabolic differential equations, initiated by Kolmogorov (1931a) and Feller (1936), took a new turn with the seminal papers of Feller (1952/54), who studied the generators of one-dimensional diffusions, within the framework of the newly developed semi-group theory. In particular, Feller gave a complete description in terms of scale function and speed measure, classified the boundary behavior, and showed how the latter is determined by the domain of the generator. Finally, he identified the cases when explosion occurs, corresponding to the absorption cases in Theorem 33.15. A more probabilistic approach to those results was developed by Dynkin (1955b/59), who along with Ray (1956) continued Feller’s study of the relationship between analytic properties of the generator and sample path properties of the process. The idea of constructing diffusions on a natural scale through a time change of Brownian motion is due to Hunt (1958) and Volkonsky (1958), and the full description in Theorem 33.9 was completed by Volkonsky ˆ & McKean (1965). The present stochastic calculus approach (1960) and Ito ´le ´ard (1986). is based on ideas in Me The ratio ergodic Theorem 33.14 was first obtained for Brownian motion by Derman (1954), by a method originally devised for discrete-time chains by Doeblin (1938a). It was later extended to more general diffusions by Motoo & Watanabe (1958). The ergodic behavior of recurrent, one-dimensional diffusions was analyzed by Maruyama & Tanaka (1957). For one-dimensional SDEs, Skorohod (1965) noticed that Itˆo’s original

Historical Notes and References

885

Lipschitz condition for pathwise uniqueness can be replaced by a weaker H¨older condition. He also obtained a corresponding comparison theorem. The improved conditions in Theorems 33.3 and 33.5 are due to Yamada & Watanabe (1971) and Yamada (1973), respectively. Perkins (1982) and Le Gall (1983) noted how the semi-martingale local time can be used to simplify and unify the proofs of those and related results. The fundamental weak existence and uniqueness criteria in Theorem 33.1 were discovered by Engelbert & Schmidt (1984/85), whose (1981) 0−1 law is implicit in Lemma 33.2. Elementary introductions to one-dimensional diffusions appear in Breiman (1968), Freedman (1971b), and Rogers & Williams (2000). More detailed ˆ & McKean and advanced accounts are given by Dynkin (1965) and Ito (1965). Further information on one-dimensional SDEs appears in Karatzas & Shreve (1998) and Revuz & Yor (1999). 34. PDE connections and potential theory The fundamental solution to the heat equation in terms of the Gaussian kernel was obtained by Laplace (1809). A century later Bachelier (1900) noted the relationship between Brownian motion and the heat equation. The PDE connections were further explored by many authors, including Kolmogorov (1931a), Feller (1936), Kac (1951), and Doob (1955). A first version of Theorem 34.1 was obtained by Kac (1949), who was in turn inspired by Feynman’s (1948) work on the Schr¨odinger equation. Theorem 34.2 is due to Stroock & Varadhan (1969). In a discussion of the Dirichlet problem, Green (1828) introduced the functions named after him. The Dirichlet, sweeping, and equilibrium problems were all studied by Gauss (1840), in a pioneering paper on electrostatics. The ´ (1890/99), rigorous developments in potential theory began with Poincare who solved the Dirichlet problem for domains with a smooth boundary. The equilibrium measure was characterized by Gauss as the unique measure minimizing a certain energy functional, but the existence of the minimum was not rigorously established until Frostman (1935). The cone condition for regularity is due to Zaremba (1909). The first probabilistic connections were made by Phillips & Wiener (1923) and Courant et al. (1928), who solved the Dirichlet problem in the plane by a method of discrete approximation, involving a version of Theorem 34.5 for a simple symmetric random walk. Kolmogorov & Leontovich (1933) evaluated a special hitting distribution for two-dimensional Brownian motion and noted that it satisfies the heat equation. Kakutani (1944b/45) showed how the harmonic measure and sweeping kernel can be expressed in terms of a Brownian motion. The probabilistic methods were extended and perfected by Doob (1954/55), who noted the profound connections with martingale theory. A general potential theory was later developed by Hunt (1957/58) for broad classes of Markov processes. The interpretation of Green functions as occupation densities was known

886

Foundations of Modern Probability

to Kac (1951), and a probabilistic approach to Green functions was developed by Hunt60 (1956). The connection between equilibrium measures and quitting ˆ & McKean (1965), was times, implicit already in Spitzer (1964) and Ito exploited by Chung61 (1973) to yield the representation in Theorem 34.14. ¨ dinger Time reversal of diffusion processes was first considered by Schro (1931). Kolmogorov (1936b/37) computed the transition kernels of the reversed process and gave necessary and sufficient conditions for symmetry. The basic role of time reversal and duality in potential theory was recognized by Doob (1954) and Hunt (1958). Proposition 34.15 and the related construction in Theorem 34.21 go back to Hunt, but Theorem 34.19 may be new. The measure ν in Theorem 34.21 is related to the Kuznetsov measures, discussed extensively in Getoor (1990). The connection between random sets and alternating capacities was established by Choquet (1953/54), and a corresponding representation of infinitely divisible random sets was obtained by Matheron (1975). The classical decomposition of subharmonic functions goes back to Riesz (1926/30). The basic connection between super-harmonic functions and supermartingales was established by Doob (1954), leading to a probabilistic counterpart of Riesz’ theorem. This is where he recognized the need for a general decomposition theorem for super-martingales, generalizing the elementary Lemma 9.10. Doob also proved the continuity of the composition of an excessive function with Brownian motion. Introductions to probabilistic potential theory and other PDE connections appear in Bass (1995/98), Chung (1995), and Karatzas & Shreve (1998). A detailed exposition of classical probabilistic potential theory is given by Port & Stone (1978). Doob (1984) provides a wealth of further information on both the analytic and the probabilistic aspects. An introduction to Hunt’s work and subsequent developments is given by Chung (1982). More advanced and comprehensive treatments appear in Blumenthal & Getoor (1968), Dellacherie & Meyer (1987/92), and Sharpe (1988), which also cover the theory of additive functionals and their potentials. The connection between random sets and alternating capacities is discussed in Molchanov (2005). 35. Stochastic differential geometry Brownian motion and stochastic calculus on Riemannian manifolds and Lie ˆ (1950), Dynkin (1961), groups were considered already by Yosida (1949), Ito and others, involving the sophisticated machinery of Levy-Civita connections and constructions by ‘rolling without slipping’. Martingales in Riemannian 60 Gilbert Hunt (1916–2008), American mathematician and a prominent tennis player, whose deep and revolutionary papers established a close connection between Markov processes and general potential theory. 61 Kai Lai Chung (1917–2009), Chinese-American mathematician, making significant contributions to many areas of modern probability. He became a friend of mine and even spent a week as a guest in our house.

Historical Notes and References

887

manifolds have been studied by Duncan (1983) and Kendall (1988). A L´evy–Khinchin type representation of infinitely divisible distributions on Lie groups was obtained by Hunt (1956b), and a related theory of L´evy processes in Lie groups and homogeneous spaces has been developed by Liao (2018). Schwartz62 (1980/82) realized that much of the classical theory could be developed for semi-martingales in a differentiable manifold without any Riemannian structure. His ideas63 were further pursued by Meyer (1981/82), arguably leading to a significant simplification64 of the whole area. The need of a connection to define martingales in a differential manifold was recognized by Meyer, and the present definition of martingales in connected manifolds goes back to Bismut (1981). Characterizations of martingales in terms of convex functions were given by Darling (1982). Our notion of local characteristics of a semi-martingale, believed to be new65 , arose from attempts to make probabilistic sense of the beautiful but somewhat obscure Schwartz– Meyer theory for general semi-martingales. Though the diffusion rates require no additional structure, the drift rate makes sense only in manifolds with a connection66 . A survey of Brownian motion and SDEs in a Riemannian manifold appears in Ikeda & Watanabe (2014). For beginners we further recommend the ‘overture’ to the Riemannian theory in Rogers & Williams (2000), and the introduction to stochastic differential geometry by Kendall (1988). Detailed and comprehensive accounts of the Schwartz–Meyer theory are given by ´ Emery (1989/2000). Appendices A modern and comprehensive account of general measure theory is given by Bogachev (2007). Another useful reference is Dudley (1989), who also discusses the axiomatization of mathematics. The projection and section theorems rely on capacity theory, for which we refer to Dellacherie (1972) and Dellacherie & Meyer (1975/83). Most of the material on topology and functional analysis can be found in standard textbooks. The J1 -topology was introduced by Skorohod (1956), and detailed expositions appear in Billingsley (1968), Ethier & Kurtz (1986), and Jacod & Shiryaev (1987). Our discussion of topologies on MS is adapted from K(17b). The topology on the space of closed sets was introduced in a more general setting by Fell (1962), and a detailed proof of Theorem A6.1, differ62

Laurent Schwartz (1915–2002), French mathematician, whose creation of distribution theory earned him a Fields medal in 1950. Towards the end of a long and distinguished career, he got interested in stochastic differential geometry. 63 Michel Emery tells me that much of this material was known before the work of Schwartz, whose aim was to develop a Bourbaki-style approach to stochastic calculus in manifolds. 64 hence the phrase sans larmes (without tears) in the title of Meyer’s 1981 paper 65 A notion of local characteristics also appears in Meyer (1981), though with a totally different meaning. 66 simply because a martingale is understood to be a semi-martingale without drift

888

Foundations of Modern Probability

ent from ours, appears in Matheron (1975). A modern and comprehensive account of the relevant set theory is given by Molchanov (2004). The material on differential geometry is standard, and our account is adapted ´ from Emery (1989), though with a simplified notation.

Bibliography Here we list only publications that are explicitly mentioned in the text or notes or are directly related to results cited in the book. Knowledgeable readers will notice that many books or papers of historical significance are missing. The items are ordered alphabetically after the main part of the last name, as in de Finetti and von Neumann. Adler, R.J. (1990). An Introduction to Continuity, Extrema, and Related Topics for General Gaussian Proceses. Inst. Math. Statist., Hayward, CA. Akcoglu, M.A., Chacon, R.V. (1970). Ergodic properties of operators in Lebesgue space. Adv. Appl. Probab. 2, 1–47. Aldous, D.J. (1978). Stopping times and tightness. Ann. Probab. 6, 335–340. — (1981). Representations for partially exchangeable arrays of random variables. J. Multivar. Anal. 11, 581–598. — (1985). Exchangeability and related topics. Lect. Notes in Math. 1117, 1–198. Springer, Berlin. Aldous, D., Thorisson, H. (1993). Shift-coupling. Stoch. Proc. Appl. 44, 1-14. Alexandrov, A.D. (1940/43). Additive set-functions in abstract spaces. Mat. Sb. 8, 307–348; 9, 563–628; 13, 169–238. Alexandrov, P. (1916). Sur la puissance des ensembles mesurables B. C.R. Acad. Sci. Paris 162, 323–325. Anderson, G.W., Guionnet, A., Zeitouni, O. (2010). Am Introduction to Random Matrices. Cambridge Univ. Press. ´, D. (1887). Solution directe du probl`eme r´esolu par M. Bertrand. C.R. Acad. Sci. Andre Paris 105, 436–437. Athreya, K., McDonald, D., Ney, P. (1978). Coupling and the renewal theorem. Amer. Math. Monthly 85, 809–814. Athreya, K.B., Ney, P.E. (1972). Branching Processes. Springer, Berlin. ´maud, P. (2000). Elements of Queueing [sic] Theory, 2nd ed. Springer, Baccelli, F., Bre Berlin. ´ Bachelier, L. (1900). Th´eorie de la sp´eculation. Ann. Sci. Ecole Norm. Sup. 17, 21–86. ´ — (1901). Th´eorie math´ematique du jeu. Ann. Sci. Ecole Norm. Sup. 18, 143–210. Bahadur, R.R. (1971). Some Limit Theorems in Statistics. SIAM, Philadelphia. ´ (1887). G´en´eralisation du probl`eme r´esolu par M. J. Bertrand. C.R. Acad. Barbier, E. Sci. Paris 105, 407, 440. Bartlett, M.S. (1949). Some evolutionary stochastic processes. J. Roy. Statist. Soc. B 11, 211–229. Bass, R.F. (1995). Probabilistic Techniques in Analysis. Springer, NY. — (1998). Diffusions and Elliptic Operators. Springer, NY.

© Springer Nature Switzerland AG 2021 O. Kallenberg, Foundations of Modern Probability, Probability Theory and Stochastic Modelling 99, https://doi.org/10.1007/978-3-030-61871-1

889

890

Foundations of Modern Probability

Bateman, H. (1910). Note on the probability distribution of α-particles. Philos. Mag. 20 (6), 704–707. Baxter, G. (1961). An analytic approach to finite fluctuation problems in probability. J. d’Analyse Math. 9, 31–70. Bell, D.R. (1987). The Malliavin Calculus. Longman & Wiley. Belyaev, Y.K. (1963). Limit theorems for dissipative flows. Th. Probab. Appl. 8, 165– 173. Berbee, H.C.P. (1979). Random Walks with Stationary Increments and Renewal Theory. Mathematisch Centrum, Amsterdam. Bernoulli, J. (1713). Ars Conjectandi. Thurnisiorum, Basel. Bernstein, S.N. (1927). Sur l’extension du th´eor`eme limite du calcul des probabilit´es aux sommes de quantit´es d´ependantes. Math. Ann. 97, 1–59. — (1934). Principes de la th´eorie des ´equations diff´erentielles stochastiques. Trudy Fiz.Mat., Steklov Inst., Akad. Nauk. 5, 95–124. — (1937). On some variations of the Chebyshev inequality (in Russian). Dokl. Acad. Nauk SSSR 17, 275–277. ´ — (1938). Equations diff´erentielles stochastiques. Act. Sci. Ind. 738, 5–31. Bertoin, J. (1996). L´evy Processes. Cambridge Univ. Press. Bertrand, J. (1887). Solution d’un probl`eme. C.R. Acad. Sci. Paris 105, 369. Bichteler, K. (1979). Stochastic integrators. Bull. Amer. Math. Soc. 1, 761–765. ´, I.J. (1845). De la loi de multiplication et de la dur´ee des familles. Soc. Bienayme Philomat. Paris Extraits, S´er. 5, 37–39. — (1853). Consid´erations `a l’appui de la d´ecouverte de Laplace sur la loi de probabilit´e dans la m´ethode des moindres carr´es. C.R. Acad. Sci. Paris 37, 309–324. Billingsley, P. (1965). Ergodic Theory and Information. Wiley, NY. — (1968). Convergence of Probability Measures. Wiley, NY. — (1986/95). Probability and Measure, 3rd ed. Wiley, NY. Birkhoff, G.D. (1932). Proof of the ergodic theorem. Proc. Natl. Acad. Sci. USA 17, 656–660. Bismut, J.M. (1981). Martingales, the Malliavin calculus and hypoellipticity under general H¨ormander’s condition. Z. Wahrsch. verw. Geb. 56, 469–505. Blackwell, D. (1948). A renewal theorem. Duke Math. J. 15, 145–150. — (1953). Extension of a renewal theorem. Pacific J. Math. 3, 315–320. Blackwell, D., Freedman, D. (1964). The tail σ-field of a Markov chain and a theorem of Orey. Ann. Math. Statist. 35, 1291–1295. Blumenthal, R.M. (1957). An extended Markov property. Trans. Amer. Math. Soc. 82, 52–72. — (1992). Excursions of Markov Processes. Birkh¨auser, Boston. Blumenthal, R.M., Getoor, R.K. (1964). Local times for Markov processes. Z. Wahrsch. verw. Geb. 3, 50–74. — (1968). Markov Processes and Potential Theory. Academic Press, NY.

Bibliography

891

Bochner, S. (1932). Vorlesungen u ¨ber Fouriersche Integrale, Akad. Verlagsges., Leipzig. Repr. Chelsea, NY 1948. — (1933). Monotone Funktionen, Stieltjessche Integrale und harmonische Analyse. Math. Ann. 108, 378–410. Bogachev, V.I. (2007). Measure Theory 1–2. Springer. ¨ Boltzmann, L. (1887). Uber die mechanischen Analogien des zweiten Hauptsatzes der Thermodynamik. J. Reine Angew. Math. 100, 201–212. ´ Borel, E. (1895). Sur quelques points de la th´eorie des fonctions. Ann. Sci. Ecole Norm. Sup. (3) 12, 9–55. — (1898). Le¸cons sur la Th´eorie des Fonctions. Gauthier-Villars, Paris. — (1909a). Les probabilit´es d´enombrables et leurs applications arithm´etiques. Rend. Circ. Mat. Palermo 27, 247–271. — (1909b/24). Elements of the Theory of Probability, 3rd ed., Engl. transl. Prentice-Hall 1965. Bortkiewicz, L.v. (1898). Das Gesetz der kleinen Zahlen. Teubner, Leipzig. Breiman, L. (1957/60). The individual ergodic theorem of infomation theory. Ann. Math. Statist. 28, 809–811; 31, 809–810. — (1963). The Poisson tendency in traffic distribution. Ann. Math. Statist. 34, 308–311. — (1968). Probability. Addison-Wesley, Reading, MA. Repr. SIAM, Philadelphia 1992. Brown, R. (1828). A brief description of microscopical observations made in the months of June, July and August 1827, on the particles contained in the pollen of plants; and on the general existence of active molecules in organic and inorganic bodies. Ann. Phys. 14, 294–313. Brown, T.C. (1978). A martingale approach to the Poisson convergence of simple point processes. Ann. Probab. 6, 615–628. — (1979). Compensators and Cox convergence. Math. Proc. Cambridge Phil. Soc. 90, 305–319. Bryc, W. (1990). Large deviations by the asymptotic value method. In Diffusion Processes and Related Problems in Analysis (M. Pinsky, ed.), 447–472. Birkh¨auser, Basel. ¨hlmann, H. (1960). Austauschbare stochastische Variabeln und ihre Grenzwerts¨ Bu atze. Univ. Calif. Publ. Statist. 3, 1–35. Buniakowsky, V.Y. (1859). Sur quelques in´egalit´es concernant les int´egrales ordinaires et les int´egrales aux diff´erences finies. M´em. de l’Acad. St.-Petersbourg 1:9. Burkholder, D.L. (1966). Martingale transforms. Ann. Math. Statist. 37, 1494–1504. Burkholder, D.L., Davis, B.J., Gundy, R.F. (1972). Integral inequalities for convex functions of operators on martingales. Proc. 6th Berkeley Symp. Math. Statist. Probab. 2, 223–240. Burkholder, D.L., Gundy, R.F. (1970). Extrapolation and interpolation of quasi-linear operators on martingales. Acta Math. 124, 249–304. Cameron, R.H., Martin, W.T. (1944). Transformation of Wiener integrals under translations. Ann. Math. 45, 386–396. Cantelli, F.P. (1917). Su due applicazione di un teorema di G. Boole alla statistica matematica. Rend. Accad. Naz. Lincei 26, 295–302.

892

Foundations of Modern Probability

— (1933). Sulla determinazione empirica della leggi di probabilit`a. Giorn. Ist. Ital. Attuari 4, 421–424. ´odory, C. (1918/27). Vorlesungen u Carathe ¨ber reelle Funktionen, 2nd ed. Teubner, Leipzig (1st ed. 1918). Repr. Chelsea, NY 1946. Carleson, L. (1958). Two remarks on the basic theorems of information theory. Math. Scand. 6, 175–180. ´ Cauchy, A.L. (1821). Cours d’analyse de l’Ecole Royale Polytechnique, Paris. Chacon, R.V., Ornstein, D.S. (1960). A general ergodic theorem. Illinois J. Math. 4, 153–160. Chapman, S. (1928). On the Brownian displacements and thermal diffusion of grains suspended in a non-uniform fluid. Proc. Roy. Soc. London (A) 119, 34–54. Chebyshev, P.L. (1867). Des valeurs moyennes. J. Math. Pures Appl. 12, 177–184. — (1890). Sur deux th´eor`emes relatifs aux probabilit´es. Acta Math. 14, 305–315. Chentsov, N.N. (1956). Weak convergence of stochastic processes whose trajectories have no discontinuities of the second kind and the “heuristic” approach to the Kolmogorov– Smirnov tests. Th. Probab. Appl. 1, 140–144. Chernoff, H. (1952). A measure of asymptotic efficiency for tests of a hypothesis based on the sum of observations. Ann. Math. Statist. 23, 493–507. Chernoff, H., Teicher, H. (1958). A central limit theorem for sequences of exchangeable random variables. Ann. Math. Statist. 29, 118–130. Choquet, G. (1953/54). Theory of capacities. Ann. Inst. Fourier Grenoble 5, 131–295. Chow, Y.S., Teicher, H. (1978/97). Probability Theory: Independence, Interchangeability, Martingales, 3nd ed. Springer, NY. Chung, K.L. (1960). Markov Chains with Stationary Transition Probabilities. Springer, Berlin. — (1961). A note on the ergodic theorem of information theory. Ann. Math. Statist. 32, 612–614. — (1968/74). A Course in Probability Theory, 2nd ed. Academic Press, NY. — (1973). Probabilistic approach to the equilibrium problem in potential theory. Ann. Inst. Fourier Grenoble 23, 313–322. — (1982). Lectures from Markov Processes to Brownian Motion. Springer, NY. — (1995). Green, Brown, and Probability. World Scientific, Singapore. Chung, K.L., Doob, J.L. (1965). Fields, optionality and measurability. Amer. J. Math. 87, 397–424. Chung, K.L., Fuchs, W.H.J. (1951). On the distribution of values of sums of random variables. Mem. Amer. Math. Soc. 6. Chung, K.L., Ornstein, D.S. (1962). On the recurrence of sums of random variables. Bull. Amer. Math. Soc. 68, 30–32. Chung, K.L., Walsh, J.B. (1974). Meyer’s theorem on previsibility. Z. Wahrsch. verw. Geb. 29, 253–256. Chung, K.L., Williams, R.J. (1983/90). Introduction to Stochastic Integration, 2nd ed. Birkh¨auser, Boston. C ¸ inlar, E. (2011). Probability and Stochastics. Springer.

Bibliography

893

Clark, J.M.C. (1970/71). The representation of functionals of Brownian motion by stochastic integrals. Ann. Math. Statist. 41, 1282–95; 42, 1778. Copeland, A.H., Regan, F. (1936). A postulational treatment of the Poisson law. Ann. Math, 37, 357–362. ¨ Courant, R., Friedrichs, K., Lewy, H. (1928). Uber die partiellen Differentialgleichungen der mathematischen Physik. Math. Ann. 100, 32–74. Cournot, A.A. (1843). Exposition de la th´eorie des chances et des probabilit´es. Hauchette, Paris. `ge, P. (1962/63). Int´egrales stochastiques et martingales de carr´e int´egrable. Courre Sem. Brelot–Choquet–Deny 7. Publ. Inst. H. Poincar´e. Cox, D.R. (1955). Some statistical methods connected with series of events. J. R. Statist. Soc. Ser. B 17, 129–164. ´r, H. (1938). Sur un nouveau th´eor`eme-limite de la th´eorie des probabilit´es. Actual. Crame Sci. Indust. 736, 5–23. — (1942). On harmonic analysis in certain functional spaces. Ark. Mat. Astr. Fys. 28B:12 (17 pp.). ´r, H., Leadbetter, M.R. (1967). Stationary and Related Stochastic Processes. Crame Wiley, NY. ´r, H., Wold, H. (1936). Some theorems on distribution functions. J. London Crame Math. Soc. 11, 290–295. ¨ rgo ¨ , M., Re ´ ve ´sz, P. (1981). Strong Approximations in Probability and Statistics. Cso Academic Press, NY. Daley, D.J., Vere-Jones, D. (2003/08). An Introduction to the Theory of Point Processes 1-2, 2nd ed. Springer, NY. Dambis, K.E. (1965). On the decomposition of continuous submartingales. Th. Probab. Appl. 10, 401–410. Daniell, P.J. (1918/19). Integrals in an infinite number of dimensions. Ann. Math. (2) 20, 281–288. — (1919/20). Functions of limited variation in an infinite number of dimensions. Ann. Math. (2) 21, 30–38. — (1920). Stieltjes derivatives. Bull. Amer. Math. Soc. 26, 444–448. Darling, R.W.R. (1982). Martingales in manifolds—definition, examples and behaviour under maps. Lect. Notes in Math. 921, Springer 1982. Davidson, R. (1974a). Stochastic processes of flats and exchangeability. In: Stochastic Geometry (E.F. Harding & D.G. Kendall, eds.), pp. 13–45. Wiley, London. — (1974b). Exchangeable point-processes. Ibid., 46–51. — (1974c). Construction of line-processes: second-order properties. Ibid., 55–75. Davis, B.J. (1970). On the integrability of the martingale square function. Israel J. Math. 8, 187–190. Dawid, A.P. (1977). Spherical matrix distributions and a multivariate model. J. Roy. Statist. Soc. (B) 39, 254–261. — (1978). Extendibility of spherical matrix distributions. J. Multivar. Anal. 8, 559–566.

894

Foundations of Modern Probability

Dawson, D.A. (1977). The critical measure diffusion. Z. Wahrsch. verw. Geb. 40, 125– 145. ´ — (1993). Measure-valued Markov processes. In: Ecole d’´et´e de probabilit´es de Saint-Flour XXI–1991. Lect. Notes in Math. 1541, 1–126. Springer, Berlin. ¨ rtner, J. (1987). Large deviations from the McKean–Vlasov limit Dawson, D.A., Ga for weakly interacting diffusions. Stochastics 20, 247–308. Dawson, D.A., Perkins, E.A. (1991). Historical processes. Mem. Amer. Math. Soc. 93, #454. Day, M.M. (1942). Ergodic theorems for Abelian semigroups. Trans. Amer. Math. Soc. 51, 399–412. Debes, H., Kerstan, J., Liemant, A., Matthes, K. (1970/71). Verallgemeinerung eines Satzes von Dobrushin I, III. Math. Nachr. 47, 183–244; 50, 99–139. Dellacherie, C. (1972). Capacit´es et Processus Stochastiques. Springer, Berlin. — (1980). Un survol de la th´eorie de l’int´egrale stochastique. Stoch. Proc. Appl. 10, 115-144. Dellacherie, C., Maisonneuve, B., Meyer, P.A. (1992). Probabilit´es et Potentiel, V. Hermann, Paris. Dellacherie, C., Meyer, P.A. (1975/80/83/87). Probabilit´es et Potentiel, I–IV. Hermann, Paris. Engl. trans., North-Holland. Dembo, A., Zeitouni, O. (1998). Large Deviations Techniques and Applications, 2nd ed. Springer, NY. Derman, C. (1954). Ergodic property of the Brownian motion process. Proc. Natl. Acad. Sci. USA 40, 1155–1158. Deuschel, J.D., Stroock, D.W. (1989). Large Deviations. Academic Press, Boston. Dobrushin, R.L. (1956). On Poisson laws for distributions of particles in space (in Russian). Ukrain. Mat. Z. 8, 127–134. — (1970). Gibbsian random fields without hard core conditions. Theor. Math. Phys. 4, 705–719. Doeblin, W. (1938a). Expos´e de la th´eorie des chaˆınes simples constantes de Markov `a un nombre fini d’´etats. Rev. Math. Union Interbalkan. 2, 77–105. — (1938b). Sur deux probl`emes de M. Kolmogoroff concernant les chaˆınes d´enombrables. Bull. Soc. Math. France 66, 210–220. — (1939a). Sur les sommes d’un grand nombre de variables al´eatoires ind´ependantes. Bull. Sci. Math. 63, 23–64. — (1939b). Sur certains mouvements al´eatoires discontinus. Skand. Aktuarietidskr. 22, 211–222. — (1940a). El´ements d’une th´eorie g´en´erale des chaˆınes simples constantes de Markoff. Ann. Sci. Ecole Norm. Sup. 3, 57, 61–111. — (1940b/2000). Sur l’´equation de Kolmogoroff. C.R. Acad. Sci. 331. — (1947). Sur l’ensemble de puissances d’une loi de probabilit´e. Ann. Ec. Norm. Sup. ¨ hler, R. (1980). On the conditional independence of random events. Th. Probab. Do Appl. 25, 628–634.

Bibliography

895

´ans(-Dade), C. (1967a). Processus croissants naturel et processus croissants tr`es Dole bien mesurable. C.R. Acad. Sci. Paris 264, 874–876. — (1967b). Int´egrales stochastiques d´ependant d’un param`etre. Publ. Inst. Stat. Univ. Paris 16, 23–34. — (1970). Quelques applications de la formule de changement de variables pour les semimartingales. Z. Wahrsch. verw. Geb. 16, 181–194. ´ans-Dade, C., Meyer, P.A. (1970). Int´egrales stochastiques par rapport aux Dole martingales locales. Lect. Notes in Math. 124, 77–107. Springer, Berlin. Donsker, M.D. (1951/52). An invariance principle for certain probability limit theorems. Mem. Amer. Math. Soc. 6. — (1952). Justification and extension of Doob’s heuristic approach to the Kolmogorov– Smirnov theorems. Ann. Math. Statist. 23, 277–281. Donsker, M.D., Varadhan, S.R.S. (1975/83). Asymptotic evaluation of certain Markov process expectations for large time, I–IV. Comm. Pure Appl. Math. 28, 1–47, 279–301; 29, 389–461; 36, 183–212. Doob, J.L. (1936). Note on probability. Ann. Math. (2) 37, 363–367. — (1937). Stochastic processes depending on a continuous parameter. Trans. Amer. Math. Soc. 42, 107–140. — (1938). Stochastic processes with an integral-valued parameter. Trans. Amer. Math. Soc. 44, 87–150. — (1940). Regularity properties of certain families of chance variables. Trans. Amer. Math. Soc. 47, 455–486. — (1942a). The Brownian movement and stochastic equations. Ann. Math. 43, 351–369. — (1942b). Topics in the theory of Markoff chains. Trans. Amer. Math. Soc. 52, 37–64. — (1945). Markoff chains—denumerable case. Trans. Amer. Math. Soc. 58, 455–473. — (1947). Probability in function space. Bull. Amer. Math. Soc. 53, 15–30. — (1948a). Asymptotic properties of Markov transition probabilities. Trans. Amer. Math. Soc. 63, 393–421. — (1948b). Renewal theory from the point of view of the theory of probability. Trans. Amer. Math. Soc. 63, 422–438. — (1949). Heuristic approach to the Kolmogorov–Smirnov theorems. Ann. Math. Statist. 20, 393–403. — (1951). Continuous parameter martingales. Proc. 2nd Berkeley Symp. Math. Statist. Probab., 269–277. — (1953). Stochastic Processes. Wiley, NY. — (1954). Semimartingales and subharmonic functions. Trans. Amer. Math. Soc. 77, 86–121. — (1955). A probability approach to the heat equation. Trans. Amer. Math. Soc. 80, 216–280. — (1984). Classical Potential Theory and its Probabilistic Counterpart. Springer, NY. — (1994). Measure Theory. Springer, NY. Dubins, L.E. (1968). On a theorem of Skorohod. Ann. Math. Statist. 39, 2094–2097.

896

Foundations of Modern Probability

Dubins, L.E., Schwarz, G. (1965). On continuous martingales. Proc. Natl. Acad. Sci. USA 53, 913–916. Dudley, R.M. (1966). Weak convergence of probabilities on nonseparable metric spaces and empirical measures on Euclidean spaces. Illinois J. Math. 10, 109–126. — (1967). Measures on non-separable metric spaces. Illinois J. Math. 11, 449–453. — (1968). Distances of probability measures and random variables. Ann. Math. Statist. 39, 1563–1572. — (1989). Real Analysis and Probability. Wadsworth, Brooks & Cole, Pacific Grove, CA. Duncan, T.E. (1983). Some geometric methods for stochastic integration in manifolds. Geometry and Identification, 63–72. Lie groups: Hist., Frontiers and Appl. Ser. B 1, Math. Sci. Press, Brooklyn 1983. Dunford, N. (1939). A mean ergodic theorem. Duke Math. J. 5, 635–646. Dunford, N., Schwartz, J.T. (1956). Convergence almost everywhere of operator averages. J. Rat. Mech. Anal. 5, 129–178. Durrett, R. (1991/2019). Probability Theory and Examples, 2nd ed. Wadsworth, Brooks & Cole, Pacific Grove, CA. Dvoretzky, A. (1972). Asymptotic normality for sums of dependent random variables. Proc. 6th Berkeley Symp. Math. Statist. Probab. 2, 513–535. Dynkin, E.B. (1952). Criteria of continuity and lack of discontinuities of the second kind for trajectories of a Markov stochastic process (Russian). Izv. Akad. Nauk SSSR, Ser. Mat. 16, 563–572. — (1955a). Infinitesimal operators of Markov stochastic processes (Russian). Dokl. Akad. Nauk SSSR 105, 206–209. — (1955b). Continuous one-dimensional Markov processes (Russian). Dokl. Akad. Nauk SSSR 105, 405–408. — (1956). Markov processes and semigroups of operators. Infinitesimal operators of Markov processes. Th. Probab. Appl. 1, 25–60. — (1959). One-dimensional continuous strong Markov processes. Th. Probab. Appl. 4, 3–54. — (1961a). Theory of Markov Processes. Engl. trans., Prentice-Hall and Pergamon Press, Englewood Cliffs, NJ, and Oxford. (Russian orig. 1959.) — (1961b). Non-negative eigenfunctions of the Laplace–Beltrami operator and Brownian motion in certain symmetric spaces (in Russian). Dokl. Akad. Nauk. SSSR, 141, 288–291. — (1965). Markov Processes, Vols. 1–2. Engl. trans., Springer, Berlin. (Russian orig. 1963.) — (1978). Sufficient statistics and extreme points. Ann. Probab. 6, 705–730. — (1991). Branching particle systems and superprocesses. Ann. Probab. 19, 1157–1194. Dynkin, E.B., Yushkevich, A.A. (1956). Strong Markov processes. Th. Probab. Appl. 1, 134–139. Eagleson, G.K., Weber, N.C. (1978). Limit theorems for weakly exchangeable arrays. Math. Proc. Cambridge Phil. Soc. 84, 123–130.

Bibliography

897

Einstein, A. (1905). On the movement of small particles suspended in a stationary liquid demanded by the molecular-kinetic theory of heat. Engl. trans. in Investigations on the Theory of the Brownian Movement. Repr. Dover, NY 1956. — (1906). On the theory of Brownian motion. Ibid. Elliott, R.J. (1982). Stochastic Calculus and Applications. Springer, NY. Ellis, R.L. (1844). On a question in the theory of probabilities. Cambridge Math. J. 4 (21), 127–133. Ellis, R.S. (1985). Entropy, Large Deviations, and Statistical Mechanics. Springer, NY. ´ Emery, M. (1989). Stochastic Calculus in Manifolds. Springer. — (2000/13). Martingales continues dans les vari´et´es diff´erentiables. Lect. Notes in Math. 1738, 3–84. Reprinted in Stochastic Differential Geometry at Saint-Flour, pp. 263– 346. Springer. Engelbert, H.J., Schmidt, W. (1981). On the behaviour of certain functionals of the Wiener process and applications to stochastic differential equations. Lect. Notes in Control and Inform. Sci. 36, 47–55. — (1984). On one-dimensional stochastic differential equations with generalized drift. Lect. Notes in Control and Inform. Sci. 69, 143–155. Springer, Berlin. — (1985). On solutions of stochastic differential equations without drift. Z. Wahrsch. verw. Geb. 68, 287–317. ¨ s, P., Feller, W., Pollard, H. (1949). A theorem on power series. Bull. Amer. Erdo Math. Soc. 55, 201–204. ¨ s, P., Kac, M. (1946). On certain limit theorems in the theory of probability. Bull. Erdo Amer. Math. Soc. 52, 292–302. — (1947). On the number of positive sums of independent random variables. Bull. Amer. Math. Soc. 53, 1011–1020. Erlang, A.K. (1909). The theory of probabilities and telephone conversations. Nyt. Tidskr. Mat. B 20, 33–41. Etheridge, A.M. (2000). An Introduction to Superprocesses. Univ. Lect. Ser. 20, Amer. Math. Soc., Providence, RI. Ethier, S.N., Kurtz, T.G. (1986). Markov Processes: Characterization and Convergence. Wiley, NY. ¨ Faber, G. (1910). Uber stetige Funktionen, II. Math. Ann. 69, 372–443. Farrell, R.H. (1962). Representation of invariant measures. Illinois J. Math. 6, 447– 467. Fatou, P. (1906). S´eries trigonom´etriques et s´eries de Taylor. Acta Math. 30, 335–400. Fell, J.M.G. (1962). A Hausdorff topology for the closed subsets of a locally compact non-Hausdorff space. Proc. Amer. Math. Soc. 13, 472–476. ¨ Feller, W. (1935/37). Uber den zentralen Grenzwertsatz der Wahrscheinlichkeitstheorie, I–II. Math. Z. 40, 521–559; 42, 301–312. — (1936). Zur Theorie der stochastischen Prozesse (Existenz und Eindeutigkeitss¨ atze). Math. Ann. 113, 113–160. — (1937). On the Kolmogoroff–P. L´evy formula for infinitely divisible distribution functions. Proc. Yugoslav Acad. Sci. 82, 95–112.

898

Foundations of Modern Probability

— (1940). On the integro-differential equations of purely discontinuous Markoff processes. Trans. Amer. Math. Soc. 48, 488–515; 58, 474. — (1949). Fluctuation theory of recurrent events. Trans. Amer. Math. Soc. 67, 98–119. — (1951). Diffusion processes in genetics. Proc. 2nd Berkeley Symp. Math. Statist. Probab., 227–246. Univ. Calif. Press, Berkeley. — (1952). The parabolic differential equations and the associated semi-groups of transformations. Ann. Math. 55, 468–519. — (1954). Diffusion processes in one dimension. Trans. Amer. Math. Soc. 77, 1–31. — (1950/57/68; 1966/71). An Introduction to Probability Theory and its Applications, 1 (3rd ed.); 2 (2nd ed.). Wiley, NY (1st eds. 1950, 1966). Feller, W., Orey, S. (1961). A renewal theorem. J. Math. Mech. 10, 619–624. Feynman, R.P. (1948). Space-time approach to nonrelativistic quantum mechanics. Rev. Mod. Phys. 20, 367–387. Finetti, B. de (1929). Sulle funzioni ad incremento aleatorio. Rend. Acc. Naz. Lincei 10, 163–168. — (1930). Fuzione caratteristica di un fenomeno aleatorio. Mem. R. Acc. Lincei (6) 4, 86–133. — (1937). La pr´evision: ses lois logiques, ses sources subjectives. Ann. Inst. H. Poincar´e 7, 1–68. — (1938). Sur la condition d’´equivalence partielle. Act. Sci. Ind. 739, 5–18. Fisk, D.L. (1965). Quasimartingales. Trans. Amer. Math. Soc. 120, 369–389. — (1966). Sample quadratic variation of continuous, second-order martingales. Z. Wahrsch. verw. Geb. 6, 273–278. Fortet, R. (1943). Les fonctions al´eatoires du type de Markoff associ´ees `a certaines ´equations lin´eaires aux d´eriv´ees partielles du type parabolique. J. Math. Pures Appl. 22, 177–243. ¨ nig, D., Arndt, U., Schmidt, V. (1981). Queues and Point Processes. Franken, P., Ko Akademie-Verlag, Berlin. ´chet, M. (1928). Les Espaces Abstraits. Gauthier-Villars, Paris. Fre Freedman, D. (1962/63). Invariants under mixing which generalize de Finetti’s theorem. Ann. Math. Statist. 33, 916–923; 34, 1194–1216. — (1971a). Markov Chains. Holden-Day, San Francisco. Repr. Springer, NY 1983. — (1971b). Brownian Motion and Diffusion. Holden-Day, San Francisco. Repr. Springer, NY 1983. Freidlin, M.I., Wentzell, A.D. (1970). On small random permutations of dynamical systems. Russian Math. Surveys 25, 1–55. — (1998). Random Perturbations of Dynamical Systems. Engl. trans., Springer, NY. (Russian orig. 1979.) Frostman, O. (1935). Potentiel d’´equilibre et capacit´e des ensembles avec quelques applications `a la th´eorie des fonctions. Medd. Lunds Univ. Mat. Sem. 3, 1-118. Fubini, G. (1907). Sugli integrali multipli. Rend. Acc. Naz. Lincei 16, 608–614. Furstenberg, H., Kesten, H. (1960). Products of random matrices. Ann. Math. Statist. 31, 457–469.

Bibliography

899

Gaifman, H. (1961). Concerning measures in first order calculi. Israel J. Math. 2, 1–18. Galmarino, A.R. (1963). Representation of an isotropic diffusion as a skew product. Z. Wahrsch. verw. Geb. 1, 359–378. Garsia, A.M. (1965). A simple proof of E. Hopf’s maximal ergodic theorem. J. Math. Mech. 14, 381–382. — (1973). Martingale Inequalities: Seminar Notes on Recent Progress. Math. Lect. Notes Ser. Benjamin, Reading, MA. Gauss, C.F. (1809/16). Theory of Motion of the Heavenly Bodies. Engl. trans., Dover, NY 1963. — (1840). Allgemeine Lehrs¨atze in Beziehung auf die im vehrkehrten Verh¨ altnisse des Quadrats der Entfernung wirkenden Anziehungs- und Abstossungs-Kr¨afte. Gauss Werke 5, 197–242. G¨ottingen 1867. Gaveau, B., Trauber, P. (1982). L’int´egrale stochastique comme op´erateur de divergence dans l’espace fonctionnel. J. Functional Anal. 46, 230–238. Getoor, R.K. (1990). Excessive Measures. Birkh¨auser, Boston. Getoor, R.K., Sharpe, M.J. (1972). Conformal martingales. Invent. Math. 16, 271– 308. Gibbs, J.W. (1902). Elementary Principles in Statistical Mechanics. Yale Univ. Press. Gihman, I.I. (1947). On a method of constructing random processes (Russian). Dokl. Akad. Nauk SSSR 58, 961–964. — (1950/51). On the theory of differential equations for random processes, I–II (Russian). Ukr. Mat. J. 2:4, 37–63; 3:3, 317–339. Girsanov, I.V. (1960). On transforming a certain class of stochastic processes by absolutely continuous substitution of measures. Th. Probab. Appl. 5, 285–301. Glivenko, V.I. (1933). Sulla determinazione empirica della leggi di probabilit`a. Giorn. Ist. Ital. Attuari 4, 92–99. Gnedenko, B.V. (1939). On the theory of limit theorems for sums of independent random variables (Russian). Izv. Akad. Nauk SSSR Ser. Mat. 181–232, 643–647. Gnedenko, B.V., Kolmogorov, A.N. (1954/68). Limit Distributions for Sums of Independent Random Variables. Engl. trans., 2nd ed., Addison-Wesley, Reading, MA. (Russian orig. 1949.) Gnedin, A.V. (1997). The representation of composition structures. Ann. Probab. 25, 1437–1450. Goldman, J.R. (1967). Stochastic point processes: limit theorems. Ann. Math. Statist. 38, 771–779. Goldstein, J.A. (1976). Semigroup-theoretic proofs of the central limit theorem and other theorems of analysis. Semigroup Forum 12, 189–206. Goldstein, S. (1979). Maximal coupling. Z. Wahrsch. verw. Geb. 46, 193–204. Grandell, J. (1976). Doubly Stochastic Poisson Processes. Lect. Notes in Math. 529. Springer, Berlin. Green, G. (1828). An essay on the application of mathematical analysis to the theories of electricity and magnetism. Repr. in Mathematical Papers, Chelsea, NY 1970. Greenwood, P., Pitman, J. (1980). Construction of local time and Poisson point processes from nested arrays. J. London Math. Soc. (2) 22, 182–192.

900

Foundations of Modern Probability

Griffeath, D. (1975). A maximal coupling for Markov chains. Z. Wahrsch. verw. Geb. 31, 95–106. Grigelionis, B. (1963). On the convergence of sums of random step processes to a Poisson process. Th. Probab. Appl. 8, 172–182. — (1971). On the representation of integer-valued measures by means of stochastic integrals with respect to Poisson measure. Litovsk. Mat. Sb. 11, 93–108. Haar, A. (1933). Der Maßbegriff in der Theorie der kontinuerlichen Gruppen. Ann. Math. 34, 147–169. Hagberg, J. (1973). Approximation of the summation process obtained by sampling from a finite population. Th. Probab. Appl. 18, 790–803. Hahn, H. (1921). Theorie der reellen Funktionen. Julius Springer, Berlin. ´ jek, J. (1960). Limiting distributions in simple random sampling from a finite populaHa tion. Magyar Tud. Akad. Mat. Kutat´ o Int. K¨ ozl. 5, 361–374. Hall, P., Heyde, C.C. (1980). Martingale Limit Theory and its Application. Academic Press, NY. Halmos, P.R. (1950). Measure Theory, Van Nostrand, Princeton. Repr. Springer, NY 1974. — (1985). I want to be a Mathematician, an Automathography. Springer, NY. Hardy, G.H., Littlewood, J.E. (1930). A maximal theorem with function-theoretic applications. Acta Math. 54, 81–116. Harris, T.E. (1956). The existence of stationary measures for certain Markov processes. Proc. 3rd Berkeley Symp. Math. Statist. Probab. 2, 113–124. — (1963/89). The Theory of Branching Processes. Dover Publ., NY. — (1971). Random measures and motions of point processes. Z. Wahrsch. verw. Geb. 18, 85–115. Hartman, P., Wintner, A. (1941). On the law of the iterated logarithm. J. Math. 63, 169–176. Hausdorff, F. (1916). Die M¨achtigkeit der Borelschen Mengen. Math. Ann. 77, 430–437. ¨ Helly, E. (1911/12). Uber lineare Funktionaloperatoren. Sitzungsber. Nat. Kais. Akad. Wiss. 121, 265–297. Hewitt, E., Ross, K.A. (1979). Abstract Harmonic Analysis I, 2nd ed. Springer NY. Hewitt, E., Savage, L.J. (1955). Symmetric measures on Cartesian products. Trans. Amer. Math. Soc. 80, 470–501. Hille, E. (1948). Functional analysis and semi-groups. Amer. Math. Colloq. Publ. 31, NY. Hitczenko, P. (1988). Comparison of moments for tangent sequences of random variables. Probab. Th. Rel. Fields 78, 223–230. — (1990). Best constants in martingale version of Rosenthal’s inequality. Ann. Probab. 18, 1656–1668. Hoeffding, W. (1948). A class of statistics with asymptotically normal distributions. Ann. Math. Statist. 19, 293–325. Hoeven, P.C.T. van der (1983). On Point Processes. Mathematisch Centrum, Amsterdam.

Bibliography

901

¨ ¨ lder, O. (1889). Uber Ho einen Mittelwertsatz. Nachr. Akad. Wiss. G¨ ottingen, math.phys. Kl., 38–47. Hoover, D.N. (1979). Relations on probability spaces and arrays of random variables. Preprint, Institute for Advanced Study, Princeton. Hopf, E. (1954). The general temporally discrete Markov process. J. Rat. Mech. Anal. 3, 13–45. ¨ rmander, L. (1967). Hypoelliptic second order differential equations. Acta Math. Ho 119, 147–171. Horowitz, J. (1972). Semilinear Markov processes, subordinators and renewal theory. Z. Wahrsch. verw. Geb. 24, 167–193. Hunt, G.A. (1956a). Some theorems concerning Brownian motion. Trans. Amer. Math. Soc. 81, 294–319. — (1956b). Semigroups of measures on Lie groups. Trans. Amer. Math. Soc. 81, 264–293. — (1957/58). Markoff processes and potentials, I–III. Illinois J. Math. 1, 44–93, 316–369; 2, 151–213. Hurewicz, W. (1944). Ergodic theorem without invariant measure. Ann. Math. 45, 192–206. ¨ Hurwitz, A. (1897). Uber die Erzeugung der Invarianten durch Integration. Nachr. Ges. G¨ ottingen, math.-phys. Kl., 71–90. Ikeda, N., Watanabe, S. (1989/2014). Stochastic Differential Equations and Diffusion Processes, 2nd ed. Kodansha and Elsevier. Ioffe, D. (1991). On some applicable versions of abstract large deviations theorems. Ann. Probab. 19, 1629–1639. Ionescu Tulcea, A. (1960). Contributions to information theory for abstract alphabets. Ark. Mat. 4, 235–247. Ionescu Tulcea, C.T. (1949/50). Mesures dans les espaces produits. Atti Accad. Naz. Lincei Rend. 7, 208–211. ˆ , K. (1942a). Differential equations determining Markov processes (Japanese). Zenkoku Ito Shij¯ o S¯ ugaku Danwakai 244:1077, 1352–1400. — (1942b). On stochastic processes I: Infinitely divisible laws of probability. Jap. J. Math. 18, 261–301. — (1944). Stochastic integral. Proc. Imp. Acad. Tokyo 20, 519–524. — (1946). On a stochastic integral equation. Proc. Imp. Acad. Tokyo 22, 32–35. — (1950). Stochastic differential equations in a differentiable manifold. Nagoya Math. J. 1, 35–47. — (1951a). On a formula concerning stochastic differentials. Nagoya Math. J. 3, 55–65. — (1951b). On stochastic differential equations. Mem. Amer. Math. Soc. 4, 1–51. — (1951c). Multiple Wiener integral. J. Math. Soc. Japan 3, 157–169. — (1956). Spectral type of the shift transformation of differential processes with stationary increments. Trans. Amer. Math. Soc. 81, 253–263. — (1957). Stochastic Processes (Japanese). Iwanami Shoten, Tokyo. — (1972). Poisson point processes attached to Markov processes. Proc. 6th Berkeley Symp. Math. Statist. Probab. 3, 225–239.

902

Foundations of Modern Probability

ˆ , K., McKean, H.P. (1965). Diffusion Processes and their Sample Paths. Repr. Ito Springer, Berlin 1996. ˆ , K., Watanabe, S. (1965). Transformation of Markov processes by multiplicative Ito functionals. Ann. Inst. Fourier 15, 15–30. Jacobsen, M. (2006). Point Process Theory and Applications. Birkh¨auser, Boston. Jacod, J. (1975). Multivariate point processes: Predictable projection, Radon-Nikodym derivative, representation of martingales. Z. Wahrsch. verw. Geb. 31, 235–253. — (1979). Calcul Stochastique et Probl`emes de Martingales. Lect. Notes in Math. 714. Springer, Berlin. — (1983). Processus `a accroissements ind´ependants: use condition n´ecessaire et suffisante de convergence en loi. Z. Wahrsch. verw. Geb. 63, 109–136. — (1984). Une g´en´eralisation des semimartingales: les processus admettant un processus `a accroissements ind´ependants tangent. Lect. Notes in Math. 1059, 91–118. Springer, Berlin. Jacod, J., Shiryaev, A.N. (1987). Limit Theorems for Stochastic Processes. Springer, Berlin. Jagers, P. (1972). On the weak convergence of superpositions of point processes. Z. Wahrsch. verw. Geb. 22, 1–7. — (1974). Aspects of random measures and point processes. Adv. Probab. Rel. Topics 3, 179–239. Marcel Dekker, NY. — (1975). Branching Processes with Biological Applications. Wiley. Jamison, B., Orey, S. (1967). Markov chains recurrent in the sense of Harris. Z. Wahrsch. verw. Geb. 8, 206–223. Jensen, J.L.W.V. (1906). Sur les fonctions convexes et les in´egalit´es entre les valeurs moyennes. Acta Math. 30, 175–193. Jessen, B. (1934). The theory of integration in a space of an infinite number of dimensions. Acta Math. 63, 249–323. ˇina, M. (1964). Branching processes with measure-valued states. In: Trans. 3rd Prague Jir Conf. Inf. Th. Statist. Dec. Func. Rand. Proc., pp. 333–357. ˇ — (1966). Asymptotic behavior of measure-valued branching processes. Rozpravy Ceskoˇ slovensk´e Akad. Vˇed. Rada Mat. Pˇrirod. Vˇed 76:3, Praha. Johnson, W.B., Schechtman, G., Zinn, J. (1985). Best constants in moment inequalities for linear combinations of independent and exchangeable random variables. Ann. Probab. 13, 234–253. Jordan, C. (1881). Sur la s´erie de Fourier. C.R. Acad. Sci. Paris 92, 228–230. Kac, M. (1947). On the notion of recurrence in discrete stochastic processes. Bull. Amer. Math. Soc. 53, 1002–1010. — (1949). On distributions of certain Wiener functionals. Trans. Amer. Math. Soc. 65, 1–13. — (1951). On some connections between probability theory and differential and integral equations. Proc. 2nd Berkeley Symp. Math. Statist. Probab., 189–215. Univ. of California Press, Berkeley. Kakutani, S. (1940). Ergodic theorems and the Markoff process with a stable distribution. Proc. Imp. Acad. Tokyo 16, 49–54.

Bibliography

903

— (1944a). On Brownian motions in n-space. Proc. Imp. Acad. Tokyo 20, 648–652. — (1944b). Two-dimensional Brownian motion and harmonic functions. Proc. Imp. Acad. Tokyo 20, 706–714. — (1945). Markoff process and the Dirichlet problem. Proc. Japan Acad. 21, 227–233. Kallenberg, O. (1973a). Characterization and convergence of random measures and point processes. Z. Wahrsch. verw. Geb. 27, 9–21. — (1973b). Canonical representations and convergence criteria for processes with interchangeable increments. Z. Wahrsch. verw. Geb. 27, 23–36. — (1974). Series of random processes without discontinuities of the second kind. Ann. Probab. 2, 729–737. — (1975a). Limits of compound and thinned point processes. J. Appl. Probab. 12, 269– 278. — (1975/76, 1983/86). Random Measures, 1st–2nd eds, 3rd–4th eds. Akademie-Verlag & Academic Press, Berlin, London. — (1976/80/81). On the structure of stationary flat processes, I–III. Z. Wahrsch. verw. Geb. 37, 157–174; 52, 127–147; 56, 239–253. — (1977). Stability of critical cluster fields. Math. Nachr. 77, 1–43. — (1978a). On conditional intensities of point processes. Z. Wahrsch. verw. Geb. 41, 205–220. — (1978b). On the asymptotic behavior of line processes and systems of non-interacting particles. Z. Wahrsch. verw. Geb. 43, 65–95. — (1978c). On the independence of velocities in a system of non-interacting particles. Ann. Probab. 6, 885–890. — (1987). Homogeneity and the strong Markov property. Ann. Probab. 15, 213–240. — (1988a). Spreading and predictable sampling in exchangeable sequences and processes. Ann. Probab. 16, 508–534. — (1988b). Some new representation in bivariate exchangeability. Probab. Th. Rel. Fields 77, 415–455. — (1989). On the representation theorem for exchangeable arrays. J. Multivar. Anal. 30, 137–154. — (1990). Random time change and an integral representation for marked stopping times. Probab. Th. Rel. Fields 86, 167–202. — (1992). Some time change representations of stable integrals, via predictable transformations of local martingales. Stoch. Proc. Appl. 40, 199–223. — (1992). Symmetries on random arrays and set-indexed processes. J. Theor. Probab. 5, 727–765. — (1995). Random arrays and functionals with multivariate rotational symmetries. Probab. Th. Rel. Fields 103, 91–141. — (1996a). On the existence of universal functional solutions to classical SDEs. Ann. Probab. 24, 196–205. — (1996b). Improved criteria for distributional convergence of point processes. Stoch. Proc. Appl. 64, 93-102. — (1997/2002). Foundations of Modern Probability, eds. 1–2. Springer, NY.

904

Foundations of Modern Probability

— (1999a). Ballot theorems and sojourn laws for stationary processes. Ann. Probab. 27, 2011–2019. — (1999b). Asymptotically invariant sampling and averaging from stationary-like processes. Stoch. Proc. Appl. 82, 195–204. — (1999c). Palm measure duality and conditioning in regenerative sets. Ann. Probab. 27, 945–969. — (2005). Probabilistic Symmetries and Invariance Principles. Springer. — (2007). Invariant measures and disintegrations with applications to Palm and related kernels. Probab. Th. Rel. Fields 139, 285–310, 311. — (2009). Some local approximation properties of simple point processes. Probab. Th. Rel. Fields 143, 73–96. — (2010). Commutativity properties of conditional distributions and Palm measures. Comm. Stoch. Anal. 4, 21–34. — (2011a). Invariant Palm and related disintegrations via skew factorization. Probab. Th. Rel. Fields 149, 279–301. — (2011b). Iterated Palm conditioning and some Slivnyak-type theorems for Cox and cluster processes. J. Theor. Probab. 24, 875–893. — (2017a). Tangential existence and comparison, with applications to single and multiple integration. Probab. Math. Statist. 37, 21–52. — (2017b). Random Measures, Theory and Applications. Springer. Kallenberg, O., Sztencel, R. (1991). Some dimension-free features of vector-valued martingales. Probab. Th. Rel. Fields 88, 215–247. Kallenberg, O., Szulga, J. (1989). Multiple integration with respect to Poisson and L´evy processes. Probab. Th. Rel. Fields 83, 101–134. Kallianpur, G. (1980). Stochastic Filtering Theory. Springer, NY. Kaplan, E.L. (1955). Transformations of stationary random sequences. Math. Scand. 3, 127–149. Karamata, J. (1930). Sur une mode de croissance r´eguli`ere des fonctions. Mathematica (Cluj) 4, 38–53. Karatzas, I., Shreve, S.E. (1988/91/98). Brownian Motion and Stochastic Calculus, 2nd ed. Springer, NY. Kazamaki, N. (1972). Change of time, stochastic integrals and weak martingales. Z. Wahrsch. verw. Geb. 22, 25–32. Kemeny, J.G., Snell, J.L., Knapp, A.W. (1966). Denumerable Markov Chains. Van Nostrand, Princeton. Kendall, D.G. (1948). On the generalized “birth-and-death” process. Ann. Math. Statist. 19, 1–15. — (1974). Foundations of a theory of random sets. In Stochastic Geometry (eds. E.F. Harding, D.G. Kendall), pp. 322–376. Wiley, NY. Kendall, W.S. (1988). Martingales on manifolds and harmonic maps. Geometry of random motion. Contemp. Math. 73, 121–157. Kerstan, J., Matthes, K. (1964). Station¨are zuf¨allige Punktfolgen II. Jahresber. Deutsch. Mat.-Verein. 66, 106–118.

Bibliography

905

¨ Khinchin, A.Y. (1923). Uber dyadische Br¨ ucke. Math. Z. 18, 109–116. ¨ — (1924). Uber einen Satz der Wahrscheinlichkeitsrechnung. Fund. Math. 6, 9–20. ¨ — (1929). Uber einen neuen Grenzwertsatz der Wahrscheinlichkeitsrechnung. Math. Ann. 101, 745–752. — (1933a). Zur mathematischen Begr¨ unding der statistischen Mechanik. Z. Angew. Math. Mech. 13, 101–103. — (1933b). Asymptotische Gesetze der Wahrscheinlichkeitsrechnung, Springer, Berlin. Repr. Chelsea, NY 1948. — (1934). Korrelationstheorie der station¨aren stochastischen Prozesse. Math. Ann. 109, 604–615. — (1937). Zur Theorie der unbeschr¨ankt teilbaren Verteilungsgesetze. Mat. Sb. 2, 79–119. — (1938). Limit Laws for Sums of Independent Random Variables (Russian). Moscow. — (1955). Mathematical Methods in the Theory of Queuing (Russian). Engl. trans., Griffin, London 1960. ¨ Khinchin, A.Y., Kolmogorov, A.N. (1925). Uber Konvergenz von Reihen deren Glieder durch den Zufall bestimmt werden. Mat. Sb. 32, 668–676. Kingman, J.F.C. (1964). On doubly stochastic Poisson processes. Proc. Cambridge Phil. Soc. 60, 923–930. — (1967). Completely random measures. Pac.. J. Math. 21, 59–78. — (1968). The ergodic theory of subadditive stochastic processes. J. Roy. Statist. Soc. (B) 30, 499–510. — (1972). Regenerative Phenomena. Wiley, NY. — (1978). The representation of partition structures. J. London Math, Soc. 18, 374–380. — (1993). Poisson Processes. Clarendon Press, Oxford. Kinney, J.R. (1953). Continuity properties of Markov processes. Trans. Amer. Math. Soc. 74, 280–302. Klebaner, F.C. (2012). Introduction to Stochastic Calculus with Applications, 3rd ed. Imperial Coll. Press, London. Knight, F.B. (1963). Random walks and a sojourn density process of Brownian motion. Trans. Amer. Math. Soc. 107, 56–86. — (1971). A reduction of continuous, square-integrable martingales to Brownian motion. Lect. Notes in Math. 190, 19–31. Springer, Berlin. ¨ die Summen durch den Zufall bestimmter unKolmogorov, A.N. (1928/29). Uber abh¨angiger Gr¨ossen. Math. Ann. 99, 309–319; 102, 484–488. ¨ — (1929). Uber das Gesatz des iterierten Logarithmus. Math. Ann. 101, 126–135. — (1930). Sur la loi forte des grandes nombres. C.R. Acad. Sci. Paris 191, 910–912. ¨ — (1931a). Uber die analytischen Methoden in der Wahrscheinlichkeitsrechnung. Math. Ann. 104, 415–458. — (1931b). Eine Verallgemeinerung des Laplace–Liapounoffschen Satzes. Izv. Akad. Nauk USSR, Otdel. Matem. Yestestv. Nauk 1931, 959–962. — (1932). Sulla forma generale di un processo stocastico omogeneo (un problema di B. de Finetti). Atti Accad. Naz. Lincei Rend. (6) 15, 805–808, 866–869.

906

Foundations of Modern Probability

¨ — (1933a). Uber die Grenzwerts¨atze der Wahrscheinlichkeitsrechnung. Izv. Akad. Nauk USSR, Otdel. Matem. Yestestv. Nauk 1933, 363–372. — (1933b). Zur Theorie der stetigen zuf¨alligen Prozesse. Math. Ann. 108, 149–160. — (1933c). Foundations of the Theory of Probability (German), Springer, Berlin. Engl. trans., Chelsea, NY 1956. — (1935). Some current developments in probability theory (in Russian). Proc. 2nd AllUnion Math. Congr. 1, 349–358. Akad. Nauk SSSR, Leningrad. — (1936a). Anfangsgr¨ unde der Markoffschen Ketten mit unendlich vielen m¨oglichen Zust¨anden. Mat. Sb. 1, 607–610. — (1936b). Zur Theorie der Markoffschen Ketten. Math. Ann. 112, 155–160. — (1937). Zur Umkehrbarkeit der statistischen Naturgesetze. Math. Ann. 113, 766–772. — (1938). Zur L¨osung einer biologischen Aufgabe. Izv. Nauchno—Issled. Inst. Mat. Mech. Tomsk. Gosud. Univ. 2, 1–6. — (1956). On Skorohod convergence. Th. Probab. Appl. 1, 213–222. Kolmogorov, A.N., Leontovich, M.A. (1933). Zur Berechnung der mittleren Brownschen Fl¨ache. Physik. Z. Sowjetunion 4, 1–13. ´ s, J., Major, P., Tusna ´ dy, G. (1975/76). An approximation of partial sums of Komlo independent r.v.’s and the sample d.f., I–II. Z. Wahrsch. verw. Geb. 32, 111–131; 34, 33–58. ¨ nig, D., Matthes, K. (1963). Verallgemeinerung der Erlangschen Formeln, I. Math. Ko Nachr. 26, 45–56. Koopman, B.O. (1931). Hamiltonian systems and transformations in Hilbert space. Proc. Nat. Acad. Sci. USA 17, 315–318. Krauss, P.H. (1969). Representations of symmetric probability models. J. Symbolic Logic 34, 183–193. Krengel, U. (1985). Ergodic Theorems. de Gruyter, Berlin. Krickeberg, K. (1956). Convergence of martingales with a directed index set. Trans. Amer. Math. Soc. 83, 313–357. — (1972). The Cox process. Symp. Math. 9, 151–167. — (1974a). Invariance properties of the correlation measure of line-processes. In: Stochastic Geometry (E.F. Harding & D.G. Kendall, eds.), pp. 76–88. Wiley, London. — (1974b). Moments of point processes. Ibid., pp. 89–113. Krylov, N., Bogolioubov, N. (1937). La th´eorie g´en´erale de la mesure dans son application `a l’´etude des syst`emes de la m´ecanique non lin´eaires. Ann. Math. 38, 65–113. Kullback, S., Leibler, R.A. (1951). On information and sufficiency. Ann. Math. Statist. 22, 79–86. Kunita, H. (1990). Stochastic Flows and Stochastic Differential Equations. Cambridge Univ. Press, Cambridge. Kunita, H., Watanabe, S. (1967). On square integrable martingales. Nagoya Math. J. 30, 209–245. Kunze, M. (2013). An Introduction to Malliavin Calculus. Lecture notes, Universit¨ at Ulm. Kuratowski, K. (1933). Topologie I. Monografje Matem., Warsaw & Lwow.

Bibliography

907

Kurtz, T.G. (1969). Extensions of Trotter’s operator semigroup approximation theorems. J. Funct. Anal. 3, 354–375. — (1975). Semigroups of conditioned shifts and approximation of Markov processes. Ann. Probab. 3, 618–642. ´ , S., Woyczyn ´ ski, W.A. (1992). Random Series and Stochastic Integrals: SinKwapien gle and Multiple. Birkh¨auser, Boston. Lanford, O.E., Ruelle, D. (1969). Observables at infinity and states with short range correlations in statistical mechanics. Comm. Math. Phys. 13, 194–215. Langevin, P. (1908). Sur la th´eorie du mouvement brownien. C.R. Acad. Sci. Paris 146, 530–533. Laplace, P.S. de (1774). M´emoire sur la probabilit´e des causes par les ´ev´enemens. Engl. trans. in Statistical Science 1, 359–378. — (1809). M´emoire sur divers points d’analyse. Repr. in Oeuvres Compl`etes de Laplace 14, 178–214. Gauthier-Villars, Paris 1886–1912. — (1812/20). Th´eorie Analytique des Probabilit´es, 3rd ed. Repr. in Oeuvres Compl`etes de Laplace 7. Gauthier-Villars, Paris 1886–1912. Last, G., Brandt, A. (1995). Marked Point Processes on the Real Line: The Dynamic Approach. Springer, NY. Last, G., Penrose, M.D. (2017). Lectures on the Poisson Process. Cambridge Univ. Press. ´n, H. (1983). Extremes and Related PropLeadbetter, M.R., Lindgren, G., Rootze erties of Random Sequences and Processes. Springer, NY. Lebesgue, H. (1902). Int´egrale, longeur, aire. Ann. Mat. Pura Appl. 7, 231–359. — (1904). Le¸cons sur l’Int´egration et la Recherche des Fonctions Primitives. Paris. Le Cam, L. (1957). Convergence in distribution of stochastic processes. Univ. California Publ. Statist. 2, 207–236. Le Gall, J.F. (1983). Applications des temps locaux aux ´equations diff´erentielles stochastiques unidimensionelles. Lect. Notes in Math. 986, 15–31. — (1991). Brownian excursions, trees and measure-valued branching processes. Ann. Probab. 19, 1399–1439. Levi, B. (1906a). Sopra l’integrazione delle serie. Rend. Ist. Lombardo Sci. Lett. (2) 39, 775–780. — (1906b). Sul principio de Dirichlet. Rend. Circ. Mat. Palermo 22, 293–360. ´vy, P. (1922a). Sur le rˆole de la loi de Gauss dans la th´eorie des erreurs. C.R. Acad. Le Sci. Paris 174, 855–857. — (1922b). Sur la loi de Gauss. C.R. Acad. Sci. Paris 1682–1684. — (1922c). Sur la d´etermination des lois de probabilit´e par leurs fonctions caract´eristiques. C.R. Acad. Sci. Paris 175, 854–856. — (1924). Th´eorie des erreurs. La loi de Gauss et les lois exceptionelles. Bull. Soc. Math. France 52, 49–85. — (1925). Calcul des Probabilit´es. Gauthier-Villars, Paris. — (1934/35). Sur les int´egrales dont les ´el´ements sont des variables al´eatoires ind´ependantes. Ann. Scuola Norm. Sup. Pisa (2) 3, 337–366; 4, 217–218.

908

Foundations of Modern Probability

— (1935a). Propri´et´es asymptotiques des sommes de variables al´eatoires ind´ependantes ou enchain´ees. J. Math. Pures Appl. (8) 14, 347–402. — (1935b). Propri´et´es asymptotiques des sommes de variables al´eatoires enchain´ees. Bull. Sci. Math. (2) 59, 84–96, 109–128. — (1937/54). Th´eorie de l’Addition des Variables Al´eatoires, 2nd ed. Gauthier-Villars, Paris (1st ed. 1937). — (1939). Sur certain processus stochastiques homog`enes. Comp. Math. 7, 283–339. — (1940). Le mouvement brownien plan. Amer. J. Math. 62, 487–550. — (1948/65). Processus Stochastiques et Mouvement Brownien, 2nd ed. Gauthier-Villars, Paris (1st ed. 1948). Liao, M. (2018). Invariant Markov Processes under Lie Group Action. Springer Intern. Publ. Liapounov, A.M. (1901). Nouvelle forme du th´eor`eme sur la limite des probabilit´es. Mem. Acad. Sci. St. Petersbourg 12, 1–24. Liemant, A., Matthes, K., Wakolbinger, A. (1988). Equilibrium Distributions of Branching Processes. Akademie-Verlag, Berlin. Liggett, T.M. (1985). An improved subadditive ergodic theorem. Ann. Probab. 13, 1279–1285. Lindeberg, J.W. (1922a). Eine neue Herleitung des Exponentialgesetzes in der Wahrscheinlichkeitsrechnung. Math. Zeitschr. 15, 211–225. — (1922b). Sur la loi de Gauss. C.R. Acad. Sci. Paris 174, 1400–1402. Lindvall, T. (1973). Weak convergence of probability measures and random functions in the function space D[0, ∞). J. Appl. Probab. 10, 109–121. — (1977). A probabilistic proof of Blackwell’s renewal theorem. Ann. Probab. 5, 482–485. — (1992). Lectures on the Coupling Method. Wiley, NY. `ve, M. (1977/78). Probability Theory 1–2, 4th ed. Springer, NY (eds. 1–3, 1955/63). Loe L  omnicki, Z., Ulam, S. (1934). Sur la th´eorie de la mesure dans les espaces combinatoires et son application au calcul des probabilit´es: I. Variables ind´ependantes. Fund. Math. 23, 237–278. Lukacs, E. (1960/70). Characteristic Functions, 2nd ed. Griffin, London. Lundberg, F. (1903). Approximate representation of the probability function. Reinsurance of collective risks (in Swedish). Ph.D. thesis, Almqvist & Wiksell, Uppsala. ˇius, V. (1974). On the question of the weak convergence of random processes Mackevic in the space D[0, ∞). Lithuanian Math. Trans. 14, 620–623. Maisonneuve, B. (1974). Syst`emes R´eg´en´eratifs. Ast´erique 15. Soc. Math. de France. Maker, P. (1940). The ergodic theorem for a sequence of functions. Duke Math. J. 6, 27–30. Malliavin, P. (1978). Stochastic calculus of variations and hypoelliptic operators. In: Proc. Intern. Symp. on Stoch. Diff. Equations, Kyoto 1976, pp. 195–263. Wiley 1978. Mann, H.B., Wald, A. (1943). On stochastic limit and order relations. Ann. Math. Statist. 14, 217–226. Marcinkiewicz, J., Zygmund, A. (1937). Sur les fonctions ind´ependantes. Fund. Math. 29, 60–90.

Bibliography

909

— (1938). Quelques th´eor`emes sur les fonctions ind´ependantes. Studia Math. 7, 104–120. Markov, A.A. (1899). The law of large numbers and the method of least squares (Russian). Izv. Fiz.-Mat. Obshch. Kazan Univ. (2) 8, 110–128. — (1906). Extension of the law of large numbers to dependent events (Russian). Bull. Soc. Phys. Math. Kazan (2) 15, 135–156. Maruyama, G. (1954). On the transition probability functions of the Markov process. Natl. Sci. Rep. Ochanomizu Univ. 5, 10–20. — (1955). Continuous Markov processes and stochastic equations. Rend. Circ. Mat. Palermo 4, 48–90. Maruyama, G., Tanaka, H. (1957). Some properties of one-dimensional diffusion processes. Mem. Fac. Sci. Kyushu Univ. 11, 117–141. Matheron, G. (1975). Random Sets and Integral Geometry. Wiley, London. Matthes, K. (1963). Station¨are zuf¨allige Punktfolgen, I. Jahresber. Deutsch. Math.Verein. 66, 66–79. Matthes, K., Kerstan, J., Mecke, J. (1974/78/82). Infinitely Divisible Point Processes. Wiley, Chichester (German/English/Russian eds.). Matthes, K., Warmuth, W., Mecke, J. (1979). Bemerkungen zu einer Arbeit von Nguyen Xuan Xanh und Hans Zessin. Math. Nachr. 88, 117–127. Maxwell, J.C. (1875). Theory of Heat, 4th ed. Longmans, London. — (1878). On Boltzmann’s theorem on the average distribution of energy in a system of material points. Trans. Cambridge Phil. Soc. 12, 547. ¨ Mazurkiewicz, S. (1916). Uber Borelsche Mengen. Bull. Acad. Cracovie 1916, 490–494. McGinley, W.G., Sibson, R. (1975). Dissociated random variables. Math. Proc. Cambridge Phil. Soc. 77, 185–188. McKean, H.P. (1969). Stochastic Integrals. Academic Press, NY. McKean, H.P., Tanaka, H. (1961). Additive functionals of the Brownian path. Mem. Coll. Sci. Univ. Kyoto, A 33, 479–506. McMillan, B. (1953). The basic theorems of information theory. Ann. Math. Statist. 24, 196–219. Mecke, J. (1967). Station¨are zuf¨allige Maße auf lokalkompakten Abelschen Gruppen. Z. Wahrsch. verw. Geb. 9, 36–58. — (1968). Eine characteristische Eigenschaft der doppelt stochastischen Poissonschen Prozesse. Z. Wahrsch. verw. Geb. 11, 74–81. Mehta, M.L. (1991). Random Matrices. Academic Press. ´ le ´ard, S. (1986). Application du calcul stochastique `a l’´etude des processus de Markov Me r´eguliers sur [0, 1]. Stochastics 19, 41–82. ´tivier, M. (1982). Semimartingales: A Course on Stochastic Processes. de Gruyter, Me Berlin. ´tivier, M., Pellaumail, J. (1980). Stochastic Integration. Academic Press, NY. Me Meyer, P.A. (1962). A decomposition theorem for supermartingales. Illinois J. Math. 6, 193–205. — (1963). Decomposition of supermartingales: the uniqueness theorem. Illinois J. Math. 7, 1–17.

910

Foundations of Modern Probability

— (1966). Probability and Potentials. Engl. trans., Blaisdell, Waltham. — (1967). Int´egrales stochastiques, I–IV. Lect. Notes in Math. 39, 72–162. Springer, Berlin. — (1968). Guide d´etaill´e de la th´eorie g´en´erale des processus. Lect. Notes in Math. 51, 140–165. Springer. — (1971). D´emonstration simplifi´ee d’un th´eor`eme de Knight. Lect. Notes in Math. 191, 191–195. Springer, Berlin. — (1976). Un cours sur les int´egrales stochastiques. Lect. Notes in Math. 511, 245–398. Springer, Berlin. — (1981). G´eom´etrie stochastique sans larmes. Lecture Notes in Math. 850, 44–102. — (1982). G´eom´etrie diff´erentielle stochastique (bis). Lecture Notes in Math. 921, 165– 207. Miles, R.E. (1969/71). Poisson flats in Euclidean spaces, I–II. Adv. Appl. Probab. 1, 211–237; 3, 1–43. — (1974). A synopsis of ‘Poisson flats in Euclidean spaces’. In: Stochastic Geometry (E.F. Harding & D.G. Kendall, eds), pp. 202–227. Wiley, London. Millar, P.W. (1968). Martingale integrals. Trans. Amer. Math. Soc. 133, 145–166. Minkowski, H. (1907). Diophantische Approximationen. Teubner, Leipzig. Mitoma, I. (1983). Tightness of probabilities on C([0, 1]; S  ) and D([0, 1]; S  ). Ann. Probab. 11, 989–999. Moivre, A. de (1711). De Mensura Sortis (On the measurement of chance). Engl. trans., Int. Statist. Rev. 52 (1984), 229–262. — (1738). The Doctrine of Chances, 2nd ed. Repr. Case and Chelsea, London and NY 1967. Molchanov, I. (2005). Theory of Random Sets. Springer, London. ¨ nch, G. (1971). Verallgemeinerung eines Satzes von A. R´enyi. Studia Sci. Math. Hung. Mo 6, 81–90. Motoo, M., Watanabe, H. (1958). Ergodic property of recurrent diffusion process in one dimension. J. Math. Soc. Japan 10, 272–286. Moyal, J.E. (1962). The general theory of stochastic population processes. Acta Math. 108, 1–31. Nawrotzki, K. (1962). Ein Grenzwertsatz f¨ ur homogene zuf¨allige Punktfolgen (Verallgemeinerung eines Satzes von A. R´enyi). Math. Nachr. 24, 201–217. — (1968). Mischungseigenschaften station¨arer unbegrenzt teilbarer zuf¨alliger Maße. Math. Nachr. 38, 97–114. Neumann, J. von (1932). Proof of the quasi-ergodic hypothesis. Proc. Natl. Acad. Sci. USA 18, 70–82. — (1940). On rings of operators, III. Ann. Math. 41, 94–161. Neveu, J. (1971). Mathematical Foundations of the Calculus of Probability. Holden-Day, San Francisco. — (1975). Discrete-Parameter Martingales. North-Holland, Amsterdam. Nguyen, X.X., Zessin, H. (1979). Ergodic theorems for spatial processes. Z. Wahrsch. verw. Geb. 48, 133–158.

Bibliography

911

Nieuwenhuis, G. (1994). Bridging the gap between a stationary point process and its Palm distribution. Statist. Nederl. 48, 37–62. Nikodym, O.M. (1930). Sur une g´en´eralisation des int´egrales de M. J. Radon. Fund. Math. 15, 131–179. Norberg, T. (1984). Convergence and existence of random set distributions. Probab. 12, 726–732.

Ann.

Novikov, A.A. (1971). On moment inequalities for stochastic integrals. Th. Probab. Appl. 16, 538–541. — (1972). On an identity for stochastic integrals. Th. Probab. Appl. 17, 717–720. Nualart, D. (1995/2006). The Malliavin Calculus and Related Topics. Springer, NY. Ocone, D. (1984). Malliavin calculus and stochastic integral representation of diffusion processes. Stochastics 12, 161–185. Olson, W.H., Uppuluri, V.R.R. (1972). Asymptotic distribution of eigenvalues of random matrices. Proc. 6th Berkely Symp. Math. Statist. Probab. 3, 615–644. Orey, S. (1959). Recurrent Markov chains. Pacific J. Math. 9, 805–827. — (1962). An ergodic theorem for Markov chains. Z. Wahrsch. verw. Geb. 1, 174–176. — (1966). F -processes. Proc. 5th Berkeley Symp. Math. Statist. Probab. 2:1, 301–313. — (1971). Limit Theorems for Markov Chain Transition Probabilities. Van Nostrand, London. Ornstein, D.S. (1969). Random walks. Trans. Amer. Math. Soc. 138, 1–60. Ornstein, L.S., Uhlenbeck, G.E. (1930). On the theory of Brownian motion. Phys. Review 36, 823–841. Ososkov, G.A. (1956). A limit theorem for flows of homogeneous events. Th. Probab. Appl. 1, 248–255. Ottaviani, G. (1939). Sulla teoria astratta del calcolo delle probabilit`a proposita dal Cantelli. Giorn. Ist. Ital. Attuari 10, 10–40. Paley, R.E.A.C. (1932). A remarkable series of orthogonal functions I. Proc. London Math. Soc. 34, 241–264. Paley, R.E.A.C., Wiener, N. (1934). Fourier transforms in the complex domain. Amer. Math. Soc. Coll. Publ. 19. Paley, R.E.A.C., Wiener, N., Zygmund, A. (1933). Notes on random functions. Math. Z. 37, 647–668. Palm, C. (1943). Intensity Variations in Telephone Traffic (German). Ericsson Technics 44, 1–189. Engl. trans., North-Holland Studies in Telecommunication 10, Elsevier 1988. Papangelou, F. (1972). Integrability of expected increments of point processes and a related random change of scale. Trans. Amer. Math. Soc. 165, 486–506. — (1974a). On the Palm probabilities of processes of points and processes of lines. In: Stochastic geometry (E.F. Harding & D.G. Kendall, eds), pp. 114–147. Wiley, London. — (1974b). The conditional intensity of general point processes and an application to line processes. Z. Wahrsch. verw. Geb. 28, 207–226. — (1976). Point processes on spaces of flats and other homogeneous spaces. Math. Proc. Cambridge Phil. Soc. 80, 297–314.

912

Foundations of Modern Probability

Pardoux, E., Peng, S. (1990). Adapted solution of a backward stochastic differential equation. Systems and Control Letters 14, 55–61. Parthasarathy, K.R. (1967). Probability Measures on Metric Spaces. Academic Press, NY. ˜ a, V.H. de la, Gine ´, E. (1999). Decoupling: from Dependence to Independence. Pen Springer, NY. Perkins, E. (1982). Local time and pathwise uniqueness for stochastic differential equations. Lect. Notes in Math. 920, 201–208. Springer, Berlin. Peterson, L.D. (2001). Convergence of random measures on Polish spaces. Dissertation, Auburn Univ. Petrov, V.V. (1995). Limit Theorems of Probability Theory. Clarendon Press, Oxford. Phillips, H.B., Wiener, N. (1923). Nets and Dirichlet problem. J. Math. Phys. 2, 105–124. Pitman, J.W. (1995). Exchangeable and partially exchangeable random partitions. Probab. Th. Rel. Fields 102, 145–158. Pitt, H.R. (1942). Some generalizations of the ergodic theorem. Proc. Camb. Phil. Soc. 38, 325–343. ´, H. (1890). Sur les ´equations aux d´eriv´ees partielles de la physique math´emaPoincare tique. Amer. J. Math. 12, 211–294. — (1899). Th´eorie du Potentiel Newtonien. Gauthier-Villars, Paris. Poisson, S.D. (1837). Recherches sur la Probabilit´e des Jugements en Mati`ere Criminelle et en Mati`ere Civile, Pr´ec´ed´ees des R`egles G´en´erales du Calcul des Probabilit´es. Bachelier, Paris. ¨ Pollaczek, F. (1930). Uber eine Aufgabe der Wahrscheinlichkeitstheorie, I–II. Math. Z. 32, 64–100, 729–750. Pollard, D. (1984). Convergence of Stochastic Processes. Springer, NY. ¨ ´ lya, G. (1920). Uber Po den zentralen Grenzwertsatz der Wahrscheinlichkeitsrechnung und das Momentenproblem. Math. Z. 8, 171–181. ¨ — (1921). Uber eine Aufgabe der Wahrscheinlichkeitsrechnung betreffend die Irrfahrt im Strassennetz. Math. Ann. 84, 149–160. Port, S.C., Stone, C.J. (1978). Brownian Motion and Classical Potential Theory. Academic Press, NY. ˇ Pospiˇ sil, B. (1935/36). Sur un probl`eme de M.M.S. Bernstein et A. Kolmogoroff. Casopis Pˇest. Mat. Fys. 65, 64–76. ´kopa, A. (1958). On secondary processes generated by a random point distribution Pre of Poisson type. Ann. Univ. Sci. Budapest Sect. Math. 1, 153–170. Prohorov, Y.V. (1956). Convergence of random processes and limit theorems in probability theory. Th. Probab. Appl. 1, 157–214. — (1959). Some remarks on the strong law of large numbers. Th. Probab. Appl. 4, 204–208. — (1961). Random measures on a compactum. Soviet Math. Dokl. 2, 539–541. Protter, P. (1990). Stochastic Integration and Differential Equations. Springer, Berlin. Pukhalsky, A.A. (1991). On functional principle of large deviations. In New Trends in Probability and Statistics (V. Sazonov and T. Shervashidze, eds.), 198–218. VSP Moks’las, Moscow.

Bibliography

913

Radon, J. (1913). Theorie und Anwendungen der absolut additiven Mengenfunktionen. Wien Akad. Sitzungsber. 122, 1295–1438. Ramer, R. (1974). On nonlinear transformations of Gaussian measures. J. Functional Anal. 15, 166–187. Rao, K.M. (1969a). On decomposition theorems of Meyer. Math. Scand. 24, 66–78. — (1969b). Quasimartingales. Math. Scand. 24, 79–92. Rauchenschwandtner, B. (1980). Gibbsprozesse und Papangeloukerne. Verb. wis¨ sensch. Gesellsch. Osterreichs, Vienna. Ray, D.B. (1956). Stationary Markov processes with continuous paths. Trans. Amer. Math. Soc. 82, 452–493. — (1963). Sojourn times of a diffusion process. Illinois J. Math. 7, 615–630. ´nyi, A. (1956). A characterization of Poisson processes. Magyar Tud. Akad. Mat. Re Kutato Int. K¨ ozl. 1, 519–527. — (1967). Remarks on the Poisson process. Studia Sci. Math. Hung. 2, 119–123. Revuz, D. (1970). Mesures associ´ees aux fonctionnelles additives de Markov, I–II. Trans. Amer. Math. Soc. 148, 501–531; Z. Wahrsch. verw. Geb. 16, 336–344. — (1984). Markov Chains, 2nd ed. North-Holland, Amsterdam. Revuz, D., Yor, M. (1991/94/99). Continuous Martingales and Brownian Motion, 3rd ed. Springer, Berlin. Riesz, F. (1909a). Sur les suites de fonctions mesurables. C.R. Acad. Sci. Paris 148, 1303–1305. — (1909b). Sur les op´erations fonctionelles lin´eaires. C.R. Acad. Sci. Paris 149, 974–977. — (1910). Untersuchungen u ¨ber Systeme integrierbarer Funktionen. Math. Ann. 69, 449–497. — (1926/30). Sur les fonctions subharmoniques et leur rapport `a la th´eorie du potentiel, I–II. Acta Math. 48, 329–343; 54, 321–360. Rogers, C.A., Shephard, G.C. (1958). Some extremal problems for convex bodies. Mathematica 5, 93–102. Rogers, L.C.G., Williams, D. (1994/2000). Diffusions, Markov Processes, and Martingales 1–2, 2nd ed. Cambridge Univ. Press. ´n, B. (1964). Limit theorems for sampling from a finite population. Ark. Mat. 5, Rose 383–424. ´ ski, J., Woyczyn ´ ski, W.A. (1986). On Itˆo stochastic integration with respect Rosin to p-stable motion: Inner clock, integrability of sample paths, double and multiple integrals. Ann. Probab. 14, 271–286. Rutherford, E., Geiger, H. (1908). An electrical method of counting the number of particles from radioactive substances. Proc. Roy. Soc. A 81, 141–161. Ryll-Nardzewski, C. (1957). On stationary sequences of random variables and the de Finetti’s [sic] equivalence. Colloq. Math. 4, 149–156. — (1961). Remarks on processes of calls. Proc. 4th Berkeley Symp. Math. Statist. Probab. 2, 455–465. Sanov, I.N. (1957). On the probability of large deviations of random variables (Russian). Engl. trans.: Sel. Trans. Math. Statist. Probab. 1 (1961), 213–244.

914

Foundations of Modern Probability

Sato, K.-i. (1999/2013). L´evy Processes and Infinitely Divisible Distributions, 2nd ed. Cambridge Univ. Press. Schilder, M. (1966). Some asymptotic formulae for Wiener integrals. Trans. Amer. Math. Soc. 125, 63–85. Schneider, R., Weil, W. (2008). Stochastic and Integral Geometry. Springer, Berlin. Schoenberg, I.J. (1938). Metric spaces and completely monotone functions. Ann. Math. 39, 811–841. ¨ ¨ dinger, E. (1931). Uber Schro die Umkehrung der Naturgesetze. Sitzungsber. Preuss. Akad. Wiss. Phys. Math. Kl. 144–153. Schuppen, J.H. van, Wong, E. (1974). Transformation of local martingales under a change of law. Ann. Probab. 2, 879–888. Schwartz, L. (1980). Semi-martingales sur des vari´et´es et martingales conformes sur des vari´et´es analytiques complexes. Lect. Notes in Math. 780, Springer. — (1982). G´eom´etrie diff´erentielle du 2e ordre, semimartingales et ´equations diff´erentielles stochastiques sur une vari´et´e diff´erentielle. Lect. Notes in Math. 921, 1–148. Springer Segal, I.E. (1954). Abstract probability spaces and a theorem of Kolmogorov. Amer. J. Math. 76, 721–732. Shannon, C.E. (1948). A mathematical theory of communication. Bell System Tech. J. 27, 379–423, 623–656. Sharpe, M. (1988). General Theory of Markov Processes. Academic Press, Boston. Shiryaev, A.N. (1995). Probability, 2nd ed. Springer, NY. Shur, M.G. (1961). Continuous additive functionals of a Markov process. Dokl. Akad. Nauk SSSR 137, 800–803. ´ ski, W. (1928). Une th´eor`eme g´en´erale sur les familles d’ensemble. Fund. Math. Sierpin 12, 206–210. Silverman, B.W. (1976). Limit theorems for dissociated random variables. Adv. Appl. Probab. 8, 806–819. Skorohod, A.V. (1956). Limit theorems for stochastic processes. Th. Probab. Appl. 1, 261–290. — (1957). Limit theorems for stochastic processes with independent increments. Th. Probab. Appl. 2, 122–142. — (1961/62). Stochastic equations for diffusion processes in a bounded region, I–II. Th. Probab. Appl. 6, 264–274; 7, 3–23. — (1965). Studies in the Theory of Random Processes. Addison-Wesley, Reading, MA. (Russian orig. 1961.) — (1975). On a generalization of a stochastic integral. Theory Probab. Appl. 20, 219–233. Slivnyak, I.M. (1962). Some properties of stationary flows of homogeneous random events. Th. Probab. Appl. 7, 336–341. Slutsky, E.E. (1937). Qualche proposizione relativa alla teoria delle funzioni aleatorie. Giorn. Ist. Ital. Attuari 8, 183–199. Smith, W.L. (1954). Asymptotic renewal theorems. Proc. Roy. Soc. Edinb. A 64, 9–48. Snell, J.L. (1952). Application of martingale system theorems. Trans. Amer. Math. Soc. 73, 293–312.

Bibliography

915

Sova, M. (1967). Convergence d’op´erations lin´eaires non born´ees. Rev. Roumaine Math. Pures Appl. 12, 373–389. Sparre-Andersen, E. (1953/54). On the fluctuations of sums of random variables, I–II. Math. Scand. 1, 263–285; 2, 195–223. Sparre-Andersen, E., Jessen, B. (1948). Some limit theorems on set-functions. Danske Vid. Selsk. Mat.-Fys. Medd. 25:5 (8 pp.). Spitzer, F. (1964). Electrostatic capacity, heat flow, and Brownian motion. Z. Wahrsch. verw. Geb. 3, 110–121. — (1976). Principles of Random Walk, 2nd ed. Springer, NY. Stieltjes, T.J. (1894/95). Recherches sur les fractions continues. Ann. Fac. Sci. Toulouse 8, 1-122; 9, 1–47. Stone, C.J. (1963). Weak convergence of stochastic processes defined on a semi-infinite time interval. Proc. Amer. Math. Soc. 14, 694–696. — (1965). A local limit theorem for multidimensional distribution functions. Ann. Math. Statist. 36, 546-551. — (1968). On a theorem by Dobrushin. Ann. Math. Statist. 39, 1391–1401. — (1969). On the potential operator for one-dimensional recurrent random walks. Trans. Amer. Math. Soc. 136, 427–445. Stone, M.H. (1932). Linear transformations in Hilbert space and their applications to analysis. Amer. Math. Soc. Coll. Publ. 15. Stout, W.F. (1974). Almost Sure Convergence. Academic Press, NY. Strassen, V. (1964). An invariance principle for the law of the iterated logarithm. Z. Wahrsch. verw. Geb. 3, 211–226. Stratonovich, R.L. (1966). A new representation for stochastic integrals and equations. SIAM J. Control 4, 362–371. Stricker, C., Yor, M. (1978). Calcul stochastique d´ependant d’un param`etre. Z. Wahrsch. verw. Geb. 45, 109–133. Stroock, D.W. (1981). The Malliavin calculus and its applications to second order parabolic differential equations, I–II. Math. Systems Theory 14, 25–65, 141–171. Stroock, D.W., Varadhan, S.R.S. (1969). Diffusion processes with continuous coefficients, I–II. Comm. Pure Appl. Math. 22, 345–400, 479–530. — (1979). Multidimensional Diffusion Processes. Springer, Berlin. Sucheston, L. (1983). On one-parameter proofs of almost sure convergence of multiparameter processes. Z. Wahrsch. verw. Geb. 63, 43–49. ´ cs, L. (1962). Introduction to the Theory of Queues. Oxford Univ. Press, NY. Taka — (1967). Combinatorial Methods in the Theory of Stochastic Processes. Wiley, NY. Tanaka, H. (1963). Note on continuous additive functionals of the 1-dimensional Brownian path. Z. Wahrsch. verw. Geb. 1, 251–257. Tempel’man, A.A. (1972). Ergodic theorems for general dynamical systems. Trans. Moscow Math. Soc. 26, 94–132. Thiele, T.N. (1880). On the compensation of quasi-systematic errors by the method of least squares (in Danish). Vidensk. Selsk. Skr., Naturvid. Mat. 12 (5), 381–408. Thorisson, H. (1995). On time and cycle stationarity. Stoch. Proc. Appl. 55, 183–209.

916

Foundations of Modern Probability

— (1996). Transforming random elements and shifting random fields. Ann. Probab. 24, 2057–2064. — (2000). Coupling, Stationarity, and Regeneration. Springer, NY. Tonelli, L. (1909). Sull’integrazione per parti. Rend. Acc. Naz. Lincei (5) 18, 246–253. Tortrat, A. (1969). Sur les mesures al´eatoires dans les groupes non ab´eliens. Ann. Inst. H. Poincar´e, Sec. B, 5, 31–47. Trotter, H.F. (1958a). Approximation of semi-groups of operators. Pacific J. Math. 8, 887–919. — (1958b). A property of Brownian motion paths. Illinois J. Math. 2, 425–433. Varadarajan, V.S. (1958). Weak convergence of measures on separable metric spaces. On the convergence of probability distributions. Sankhy¯ a 19, 15–26. — (1963). Groups of automorphisms of Borel spaces. Trans. Amer. Math. Soc. 109, 191–220. Varadhan, S.R.S. (1966). Asymptotic probabilities and differential equations. Comm. Pure Appl. Math. 19, 261–286. — (1984). Large Deviations and Applications. SIAM, Philadelphia. ´ Ville, J. (1939). Etude Critique de la Notion du Collectif. Gauthier-Villars, Paris. Vitali, G. (1905). Sulle funzioni integrali. Atti R. Accad. Sci. Torino 40, 753–766. Volkonsky, V.A. (1958). Random time changes in strong Markov processes. Th. Probab. Appl. 3, 310–326. — (1960). Additive functionals of Markov processes. Trudy Mosk. Mat. Obshc. 9, 143–189. Wald, A. (1944). On cumulative sums of random variables. Ann. Math. Statist. 15, 283–296. — (1946). Differentiation under the integral sign in the fundamental identity of sequential analysis. Ann. Math. Statist. 17, 493–497. — (1947). Sequential Analysis. Wiley, NY. Waldenfels, W.v. (1968). Characteristische Funktionale zuf¨alliger Maße. Z. Wahrsch. verw. Geb. 10, 279–283. Walsh, J.B. (1978). Excursions and local time. Ast´erisque 52–53, 159–192. — (1984). An introduction to stochastic partial differential equations. Lect. Notes in Math. 1180, 265–439. Wang, A.T. (1977). Generalized Itˆo’s formula and additive functionals of Brownian motion. Z. Wahrsch. verw. Geb. 41, 153–159. Warner, F.W. (1983). Foundations of Differential Manifolds and Lie groups. Springer, Berlin. Watanabe, H. (1964). Potential operator of a recurrent strong Feller process in the strict sense and boundary value problem. J. Math. Soc. Japan 16, 83–95. Watanabe, S. (1964). On discontinuous additive functionals and L´evy measures of a Markov process. Japan. J. Math. 34, 53–79. — (1968). A limit theorem of branching processes and continuous state branching processes. J. Math. Kyoto Univ. 8, 141–167.

Bibliography

917

— (1987). Analysis of Wiener functionals (Malliavin calculus) and its applications to heat kernels. Ann. Probab. 15, 1–39. Watson, H.W., Galton, F. (1874). On the probability of the extinction of families. J. Anthropol. Inst. Gr. Britain, Ireland 4, 138–144. Weil, A. (1936). La mesure invariante dans les espaces de groupes et les espaces homog`enes. Enseinement Math. 35, 241. — (1940). L’int´egration dans les Groupes Topologiques et ses Applications. Hermann et Cie, Paris. Whitt, W. (2002). Stochastic-Process Limits. Springer, NY. Whitworth, W.A. (1878). Arrangements of m things of one sort and n things of another sort under certain conditions of priority. Messenger of Math. 8, 105–14. Wiener, N. (1923). Differential space. J. Math. Phys. 2, 131–174. — (1938). The homogeneous chaos. Amer. J. Math. 60, 897–936. — (1939). The ergodic theorem. Duke Math. J. 5, 1–18. Wiener, N., Wintner, A. (1943). The discrete chaos. Amer. J. Math. 65, 279–298. Wigner, E.P. (1955). Characteristic vectors of bordered matrices with infinite dimension. Ann. Math. 62, 548–564. Williams, D. (1991). Probability with Martingales. Cambridge Univ. Press. Yaglom, A.M. (1947). Certain limit theorems of the theory of branching processes. Dokl. Acad. Nauk SSSR 56, 795–798. Yamada, T. (1973). On a comparison theorem for solutions of stochastic differential equations and its applications. J. Math. Kyoto Univ. 13, 497–512. Yamada, T., Watanabe, S. (1971). On the uniqueness of solutions of stochastic differential equations. J. Math. Kyoto Univ. 11, 155–167. Yoeurp, C. (1976). D´ecompositions des martingales locales et formules exponentielles. Lect. Notes in Math. 511, 432–480. Springer, Berlin. Yor, M. (1978). Sur la continuit´e des temps locaux associ´ee `a certaines semimartingales. Ast´erisque 52–53, 23–36. Yosida, K. (1948). On the differentiability and the representation of one-parameter semigroups of linear operators. J. Math. Soc. Japan 1, 15–21. — (1949). Brownian motion on the surface of the 3-sphere. Ann. Math. Statist. 20, 292–296. Yosida, K., Kakutani, S. (1939). Birkhoff’s ergodic theorem and the maximal ergodic theorem. Proc. Imp. Acad. 15, 165–168. ¨ hle, M. (1980). Ergodic properties of general Palm measures. Math. Nachr. 95, Za 93–106. Zaremba, S. (1909). Sur le principe du minimum. Bull. Acad. Sci. Cracovie. Zinn, J. (1986). Comparison of martingale difference sequences. In: Probability in Banach Spaces. Lect. Notes in Math. 1153, 453–457. Springer, Berlin. Zygmund, A. (1951). An individual ergodic theorem for noncommutative transformations. Acta Sci. Math. (Szeged) 14, 103–110.

Indices Authors Abel, N.H., 26, 283, 823, 875 Adler, R.J., 866 Akcoglu, M.A., 878 Aldous, D.J., 514, 555, 576, 614, 633, 636, 651–4, 859, 874, 877–80 Alexandrov, A.D., 116, 856 Alexandrov, P., 853 Anderson, G.W., 880 Andr´e, D., 263, 584, 863, 877 Arndt, U., 898 Arzel` a, C., 505–6, 509, 512, 536, 553, 840 Ascoli, G., 505–6, 509, 512, 536, 553, 840 Athreya, K.B., 863–4 Baccelli, F., 883 Bachelier, L., 301, 306, 862–3, 865, 885 Bahadur, R.R., 875 Banach, S., 86, 369, 371, 460, 831–2, 835–6, 851 ´ 877 Barbier, E., Bartlett, M.S., 325, 867 Bass, R.F., 886 Bateman, H., 281, 332, 864, 866 Baxter, G., 255, 267, 863 Bayes, T., 878 Bell, D.R., 873 Beltrami, E., 822 Belyaev, Y.K., 881 Berbee, H.C.P., 576, 877–8 Bernoulli, J., 81–2, 93, 323, 531, 658, 731, 856 Bernstein, S.N., 195, 343, 362, 857, 860, 883 Bertoin, J., 868 Bertrand, J., 578, 584, 877 Bessel, F.W., 293, 305, 763 Bichteler, K., 393, 460, 872 Bienaym´e, I.J., 102, 277, 287–8, 291, 294, 701, 856, 864 Billingsley, P., 841, 853–4, 856, 859, 874–5, 877–9, 887 Birkhoff, G.D., 557, 561, 589–90, 876 Bismut, J.M., 873, 887 Blackwell, D., 270, 863, 878 Blumenthal, R.M., 381, 389, 681, 861–2, 869, 881, 886 Bochner, S., 144, 311, 857, 866 Bogachev, V.I., 854, 887 Bogolioubov, N., 575, 877 Bohl, P., 583

Boltzmann, L., 876, 883 Borel, E., 9-10, 14–5, 35, 44–5, 51, 56–7, 61– 2, 64, 70–2, 81, 83, 92–3, 104, 112, 114, 174, 185, 197, 309, 396, 551–2, 658, 853–4, 855–6, 860 Bortkiewicz, L.v., 857 Bourbaki, N., 887 Brandt, A., 861 Breiman, L., 581–2, 696, 856, 877, 881, 885 Br´emaud, P., 883 Brown, R., 865 Brown, T.C., 338, 867 Bryc, W., 540, 876 B¨ uhlmann, H., 619, 878 Buniakowsky, V.Y., 833, 853 Burkholder, D.L., 393, 400, 449, 459, 861, 870, 872 Cameron, R.H., 417, 435, 529, 535, 871 Campbell, N.R., 709–10, 722, 725, 729–31 Cantelli, F.P., 81, 83, 92, 112, 114–5, 185, 197, 309, 396, 551–2, 583, 855–6, 860 Carath´eodory, C., 33–4, 36, 854 Carleson, L., 877 Cauchy, A.L., 20, 27–8, 86, 104, 108, 110, 160, 278, 353–4, 368, 372, 395, 398, 401, 403, 425, 445, 448, 472–3, 536, 653, 667, 683, 765, 771–3, 832–3, 853 Chacon, R.V., 590, 878 Chapman, S., 233, 235, 238, 246, 250, 367– 8, 679, 862 Chebyshev, P.L., 102, 107, 109, 450, 532, 541, 696, 856–7 Chentsov, N.N., 95, 511, 856, 874 Chernoff, H., 532, 875, 878 Choquet, G., 785, 788–9, 827, 875, 886 Chow, Y.S., 857, 860, 877 Christoffel, E.B., 801, 808, 825 Chung, K.L., 216, 258–9, 491, 582, 783, 856, 861, 863–4, 870, 877, 886 C ¸ inlar, E., 853, 856, 860 Clark, J.M.C., 393, 483, 873 Copeland, A.H., 332, 866 Courant, R., 885 Cournot, A.A., 857 Courr`ege, P., 401, 441, 870–1 Cox, D.R., 55, 321, 323, 325, 327–31, 343, 360–1, 365, 505, 521–2, 626, 659, 685, 688–90, 696, 722, 729, 867, 881 Cram´er, H., 129, 298, 311, 322, 532, 854, 856, 858, 866, 875, 877

© Springer Nature Switzerland AG 2021 O. Kallenberg, Foundations of Modern Probability, Probability Theory and Stochastic Modelling 99, https://doi.org/10.1007/978-3-030-61871-1

919

920

Cs¨org¨ o, M., 874 Daley, D.J., 867, 875, 882 Dambis, K.E., 420, 871 Daniell, P.J., 161, 164, 178, 854, 859 Darling, R.W.R., 814, 887 Davidson, R., 334, 697, 867, 882 Davis, B.J., 449–50, 872, 891 Dawid, A.P., 879–80 Dawson, D.A., 544, 864, 876 Day, M.M., 851, 876 Debes, H., 855, 874, 882 Dellacherie, C., 393, 426, 460, 827, 859–62, 866, 869, 871–2, 877, 881, 886–7 Dembo, A., 876 Democritus, 865 Derman, C., 765, 884 Deuschel, J.D., 876 Dini, U., 534 Dirac, P., 19, 55 Dirichlet, P.G.L., 771, 775–7, 885 Dobrushin, R.L., 659, 694, 855, 881, 883 Doeblin, W., 157, 159, 249, 420, 857–8, 862, 864, 869, 871, 878, 883–4 D¨ohler, R., 859 Dol´eans(-Dade), C., 211, 214, 223, 393, 413, 442–3, 447, 861, 870–2 Donsker, M.D., 3, 494, 510–1, 527, 548, 873– 5 Doob, J.L., 1–2, 18, 161, 166, 170, 185, 190, 192–3, 195–8, 201–2, 204–5, 207, 211, 213, 226, 228, 303, 325, 348, 397, 427, 437, 439, 771, 776, 793– 4, 854, 856, 859, 860–2, 864, 867, 869–71, 873, 877–9, 882–3, 885–6 Dubins, L.E., 420, 871, 873 Dudley, R.M., 119, 840, 846, 853–4, 857, 874, 887 Duncan, T.E., 887 Dunford, N., 108, 207, 589, 856, 861, 878 Durrett, R., 874 Dvoretzky, A., 501, 873 Dynkin, E.B., 367, 381, 383–4, 757, 853, 860, 862, 864, 869, 877, 881, 884–6 Eagleson, G.K., 879 Egorov, D., 30 Einstein, A., 865 Elliott, R.J., 860–1 Ellis, R.L., 866 Ellis, R.S., 875 ´ Emery, M., 814, 847, 850, 887–8 Engelbert, H.J., 752, 885 Erd¨os, P., 495, 863, 873

Foundations of Modern Probability

Erlang, A.K., 866 Etheridge, A.M., 864 Ethier, S.N., 841, 869, 875, 884, 887 Faber, G., 856 Farrell, R.H., 573, 877 Fatou, P., 22–3, 82, 97, 107, 116, 203–4, 250, 272, 422, 461, 590, 601, 607–8, 714, 792, 853 Fell, J.M.G., 524, 772, 843, 875, 887 Feller, W., 3, 79, 135, 137, 139, 156, 185–6, 207, 231, 233, 262, 270, 277, 282, 292, 295, 367, 369–72, 374–6, 378– 9, 381–4, 386–90, 407, 555, 587, 598, 600–2, 604–5, 607–8, 661, 676, 677, 679, 681, 683, 735, 744, 751, 757, 759, 763–6, 768, 772–3, 817, 823, 857–8, 862–5, 868, 869, 877– 9, 884–5, 897 Fenchel, M.W., 529, 531, 547, 875 Feynman, R.P., 771–2, 885 Finetti, B. de, 299, 346, 555, 611, 613–4, 658, 858, 867, 878–9 Fischer, E.S., 472, 834 Fisk, D.L., 395, 407, 410, 749, 804–5, 870, 872 Fortet, R., 869 Fourier, J.B.J., 132, 144, 259, 312 Fraenkel, A., 846, 854 Franken, P., 883 Fr´echet, M., 853, 862 Freedman, D., 299, 625, 862, 865–6, 876, 878, 885 Freidlin, M.I., 487, 529, 546, 875 Friedrichs, K., 893 Frostman, O., 885 Fubini, G., 2, 7, 9, 25, 66, 89, 164, 853, 859 Fuchs, W.H.J., 259, 863 Furstenberg, H., 572, 877 Gaifman, H., 879 Galmarino, A.R., 422, 871 Galton, F., 287, 864 Garsia, A.M., 220, 451, 561, 861, 876 G¨artner, J., 544, 876 Gauss, C.F., 857, 864–5, 885 Gaveau, B., 873 Geiger, H., 866 Getoor, R.K., 681, 855, 862, 871, 881, 886 Gibbs, J.W., 659, 709, 725–7, 729–31, 883 Gihman, I.I., 871, 883 Gin´e, E., 868 Girsanov, I.V., 125, 395, 433, 437, 822, 871, 883

Indices

Glivenko, V.I., 115, 856 Gnedenko, B.V., 157, 858 Gnedin, A.V., 880 Goldman, J.R., 881 Goldstein, J.A., 869 Goldstein, S., 576, 877–8 Grandell, J., 330, 867, 874 Green, G., 751, 759, 771, 777–80, 782–8, 791, 798, 885 Greenwood, P., 880 Griffeath, D., 877–8 Grigelionis, B., 221, 686, 861, 868, 881 Gr¨ onwall, T.H., 547, 738, 756 Guionnet, A., 889 Gundy, R.F., 400, 449, 870, 891 Haar, A., 7, 55, 64, 66–9, 71, 77, 577, 626, 710–1, 731, 854 Hagberg, J., 624, 878 Hahn, H., 33, 38, 47, 54, 86, 460, 597, 832, 851, 854 H´ajek, J., 878 Hall, P., 874 Halmos, P.R., 853, 859, 879 Hardy, G.H., 564, 877 Harris, T.E., 518, 555, 587, 598, 604–5, 609, 864, 874, 878 Hartman, P., 494, 873 Hausdorff, F., 50, 343, 598, 829–30, 853 Heine, E., 35 Helly, E., 125, 142, 858 Hermite, C., 126, 315, 479 Hesse, O., 806 Hewitt, E., 81, 91, 855, 878 Heyde, C.C., 874 Hilbert, D., 161, 163–4, 300, 310, 312–3, 315–6, 398, 418, 440, 465–6, 469– 70, 536, 568, 631, 653, 833–7, 876, 880 Hille, E., 367, 376, 869 Hitczenko, P., 361, 868, 872 Hoeffding, W., 879 Hoeven, P.C.T. v.d., 883 H¨older, O., 7, 9, 26–7, 81–2, 85–6, 95–6, 108, 168, 196, 221, 297, 301, 304, 416, 511, 531, 609, 683, 749, 853, 855, 869 Hoover, D.N., 555, 615, 633–4, 636, 879 Hopf, E., 255, 266, 561, 589, 863, 878 H¨ormander, L., 872 Horowitz, J., 880 Hunt, G.A., 190, 305, 678, 778, 861, 866, 869, 881, 884–5, 886–7 Hurewicz, W., 878

921

Hurwitz, A., 854 Ikeda, N., 870, 884, 887 Ioffe, D., 876 Ionescu Tulcea, A., 581, 877 Ionescu Tulcea, C.T., 164, 180, 859 Itˆo, K., 3, 125, 151, 295, 297, 312–3, 315, 319, 332, 346–7, 393, 395, 403, 407– 10, 426, 439, 465, 470, 481–2, 485, 503, 601, 632, 652, 658, 661, 664, 668–9, 675, 735–6, 738, 752, 759, 773, 804, 807, 809, 858, 861, 866– 9, 870–2, 880–1, 883–6 Jacobsen, M., 861 Jacod, J., 221, 337, 347, 353, 443, 449, 841, 861, 867–8, 871–2, 875, 884, 887 Jagers, P., 864, 882 Jamison, B., 878 Jensen, J.L.W.V., 81, 85–6, 166, 168, 193, 362, 459, 484, 562, 585, 855 Jessen, B., 199, 860 Jiˇrina, M., 858, 867 Johnson, W.B., 872 Jordan, C., 33, 47, 854 Kac, M., 495, 771–2, 873, 882, 885–6 Kakutani, S., 422, 561, 776, 863, 870, 876, 878, 885 Kallenberg, O., 903–4 Kallianpur, G., 866 Kaplan, E.L., 659, 709, 713, 882 Karamata, J., 140, 857 Karatzas, I., 866, 870–1, 881, 884–6 Kazamaki, N., 412, 870 Kemeny, J.G., 864 Kendall, D.G., 864, 875 Kendall, W.S., 887 Kerstan, J., 721, 858, 894, 909 Kesten, H., 572, 877 Khinchin, A.Y., 3, 109, 139, 151, 156, 309, 856–8, 866, 868–9, 872, 875–6, 881– 2 Kingman, J.F.C., 332–3, 555, 557, 570, 632, 656, 864, 866–7, 877, 880 Kinney, J.R., 379, 856, 869 Klebaner, F.C., 870 Knapp, A.W., 904 Knight, F.B., 423, 661, 675, 871, 880 Koebe, P., 774 Kolmogorov, A.N., 33, 81, 90, 95, 109, 111, 113, 151, 161, 164, 179, 185, 195, 199, 233, 235–6, 238, 246, 248, 250– 1, 256, 283, 285–6, 290–1, 319, 367– 8, 372, 511, 562, 679, 772, 841,

922 855–6, 858–60, 862–5, 867–9, 871– 4, 883–6 Koml´os, J., 874 K¨onig, D., 714, 882, 898 Koopman, B.O., 876 Korolyuk, V.S., 714, 882 Krauss, P.H., 879 Krengel, U., 877–8 Krickeberg, K., 329, 697, 867, 872, 882 Kronecker, L., 112 Krylov, N., 575, 877 Kullback, S., 875 Kunita, H., 415, 441, 445, 870–2, 878, 883 Kunze, M., 873 Kuratowski, K., 827, 846 Kurtz, T.G., 386, 841, 869, 874–5, 884, 887 Kuznetsov, S.E., 886 Kwapie´ n, S., 857, 868

Foundations of Modern Probability

Lindgren, G., 907 Lindvall, T., 862–3, 874 Lipschitz, R., 738, 754–6, 885 Littlewood, J.E., 564, 877 Lo`eve, M., 1, 95, 853–4, 856–8, 868 L  omnicki, Z., 181, 859 Lukacs, E., 858 Lundberg, F., 866 Lusin, N.N., 30, 827

Mackeviˇcius, V., 386, 869 Maisonneuve, B., 880, 894 Major, P., 906 Maker, P., 562, 877 Malliavin, P., 125, 393, 465–6, 472, 478, 480, 484, 861, 872–3 Mann, H.B., 117, 857 Marcinkiewicz, J., 113, 856, 872 Markov, A.A., 102, 234, 856, 862 Martin, W.T., 417, 435, 529, 535, 871 Lanford, O.E., 883 Maruyama, G., 694, 766, 871, 881, 884 Langevin, P., 735, 737, 865, 883 Laplace, P.S. de, 2, 125–8, 130, 142, 153, Matheron, G., 875, 886, 888 292, 323, 371, 375, 720, 774, 822, Matthes, K., 334, 517–8, 687, 702–4, 714, 721, 842, 855, 858, 867, 874–5, 881, 857, 859, 864, 885 882–3, 894, 908 Last, G., 861, 867 Maxwell, J.C., 299, 865, 883 Leadbetter, M.R., 856, 858, 877 Lebesgue, H., 7, 22, 25, 33, 35, 37, 40, 42, Mazurkiewicz, S., 853 54, 61, 82, 93, 97, 215, 275, 424, McDonald, D., 889 443, 621, 711, 736, 759, 794, 825, McGinley, W.G., 879 McKean, H.P., 682, 759, 866, 881, 884–6 853–4 McMillan, B., 581, 877 Le Cam, L., 874 Mecke, J., 690, 721, 855, 866–7, 881, 909 Leeuwenhoek, A. van, 865 Mehta, M.L., 880 Le Gall, J.F., 864, 885 M´el´eard, S., 761, 884 Legendre, A.M., 529, 531, 547, 875 M´emin, J., 443 Leibler, R.A., 875 Leontovich, M.A., 885 M´etivier, M., 871 Levi, B., 21, 853 Meyer, P.A., 2, 161, 204, 207, 211, 216, 220, 228, 336, 389, 397, 439, 442, 451, Levi-Civita, T., 819, 886 453, 664, 771, 794, 814–5, 827, 859– L´evy, P., 2–3, 79, 111, 128, 132, 137, 139, 60, 861–2, 867, 869–72, 877, 880– 144, 151, 158, 195, 197, 199, 231, 1, 886–7, 894 233, 255, 277, 283, 295, 300, 303, 307, 321, 345–7, 353–6, 358, 362, Miles, R.E., 881 365, 367, 375, 386, 391, 417, 419, Millar, P.W., 400, 870 421–2, 442, 462, 516, 611–2, 620, Minkowski, H., 7, 9, 26–7, 168, 313, 425, 629, 663, 669, 671, 683, 801, 823, 454, 563, 569, 653, 692, 700, 853 856–8, 860, 862–3, 865–7, 868–70, Mitoma, I., 874 879–80, 883 Moivre, A. de, 857, 864, 866 Lewy, H., 893 Molchanov, I., 875, 886, 888 Liao, M., 824, 887 M¨onch, G., 330, 867 Liapounov, A.M., 857 Morgan, A. de, 10 Lie, M.S., 801, 819, 823, 850, 862, 886 Motoo, M., 765, 884 Liemant, A., 882, 894 Moyal, J.E., 866 Liggett, T.M., 570, 877 Lindeberg, J.W., 125, 132, 135, 857, 869 Nash, J.F., 802

Indices

Nawrotzki, K., 858, 881 Neumann, J. von, 583, 854, 859, 876 Neveu, J., 220, 582, 860–1, 878 Newton, I., 798, 857 Ney, P., 864, 889 Nguyen, X.X., 692, 881 Nieuwenhuis, G., 882 Nikodym, O.M., 7, 33, 40, 54, 61, 163, 535, 854, 859 Norberg, T., 524, 875 Novikov, A.A., 400, 436, 870–1 Nualart, D., 866, 873 Ocone, D., 393, 483, 873 Olson, W.H., 879 Orey, S., 248, 270, 595, 598, 863, 872, 878 Ornstein, D.S., 258, 590, 863, 878 Ornstein, L.S., 297, 303, 312, 393, 465, 478– 80, 865, 883 Ososkov, G.A., 881 Ottaviani, G., 242, 862 Paley, R.E.A.C., 102, 304, 863, 865, 872 Palm, C., 4, 55, 575, 659, 685, 705–6, 709– 11, 715–6, 720–5, 729–32, 863–4, 878, 881, 882–3 Papangelou, F., 336, 659, 698, 709, 727–9, 855, 867, 882–3 Pardoux, E., 482, 873 Parseval, M.A., 312 Parthasarathy, K.R., 875 Pellaumail, J., 871 Pe˜ na, V.H. de la, 868 Peng, S., 873 Penrose, M.D., 867 Perkins, E.A., 864, 885 Peterson, L.D., 874 Petrov, V.V., 858, 875 Phillips, H.B., 885 Picard, E., 733, 735, 738 Pitman, J.W., 880 Pitt, H.R., 877 Poincar´e, H., 885 Poisson, S.D., 129, 323, 857, 866 Pollaczek, F., 863 Pollard, D., 874–5 Pollard, H., 897 P´olya, G., 857, 863 Port, S.C., 886 Pospiˇsil, B., 864 Pr´ekopa, A., 325, 867 Prohorov, Y.V., 3, 117, 487, 505, 507, 509– 10, 841–2, 857, 872–3, 874 Protter, P., 872

923

Pukhalsky, A.A., 876 Radon, J., 7, 33, 40, 51, 54, 61, 163, 535, 853–4, 859 Ramer, R., 873 Rao, K.M., 211, 458, 861, 872 Rauchenschwandtner, B., 883 Ray, D.B., 661, 675, 869, 880, 884 Regan, F., 332, 866 R´enyi, A., 854, 867, 881 R´ev´esz, P., 874 Revuz, D., 661, 677–8, 866, 870–1, 878, 881, 885 Riemann, B., 54, 273, 733, 801–2, 818–9, 821–3, 826, 849 Riesz, F., 7, 33, 51, 54, 371, 379, 391, 472, 733, 771, 796, 834, 853, 854, 856, 886 Riesz, M., 854 Rogers, C.A., 846 Rogers, L.C.G., 860–1, 870, 872, 874–5, 887 Rootz´en, H., 907 Ros´en, B., 878 Rosi´ nski, J., 358, 868 Ross, K.A., 855 Rubin, H., 117, 857 Ruelle, D., 883 Rutherford, E., 866 Ryll-Nardzewski, C., 555, 611, 613, 632, 658, 714, 855, 878, 882 Sanov, I.N., 487, 529, 548, 875 Sato, K.-i., 868 Savage, L.J., 81, 91, 856, 878 Schechtman, G., 902 Schilder, M., 487, 529, 536, 550, 552, 875–6 Schmidt, V., 898 Schmidt, W., 752, 885 Schneider, R., 875 Schoenberg, I.J., 299, 625, 865 Schr¨odinger, E., 885–6 Schuppen, J.H. van, 433, 448, 871–2 Schwartz, J.T., 589, 878 Schwartz, L., 814–5, 887 Schwarz, G., 420, 871 Schwarz, H.A., 833 Segal, I.E., 865 Shannon, C.E., 581, 877 Shappon, C.E., Sharpe, M., 855, 862, 871, 886 Shephard, G.C., 846 Shiryaev, A.N., 841, 868, 872, 875, 887 Shreve, S.E., 866, 870–1, 881, 884–6 Shur, M.G., 861

924

Sibson, R., 879 Sierpi´ nski, W., 10, 583, 853 Silverman, B.W., 879 Skorohod, A.V., 3, 101, 119, 177, 356, 426, 433, 465, 472, 487, 489–90, 493, 502, 505, 512–3, 516, 622, 662, 742, 754–5, 841, 857, 868, 873–4, 880, 883–4, 887 Slivnyak, I.M., 717, 721, 882 Slutsky, E.E., 856 Smith, W.L., 864 Smoluchovsky, M., 235, 862 Snell, J.L., 196, 860, 904 Sobolev, S., 465, Sova, M., 386, 869 Sparre-Andersen, E., 263, 267, 495, 619, 860, 863, 879 Spitzer, F., 291, 864, 886 Steinhaus, H., 835 Stieltjes, T.J., 33, 42, 215, 310, 396, 405–6, 408, 443, 854 Stone, C.J., 133, 696, 855, 857, 863, 874, 881, 886 Stone, M.H., 311, 831, 866 Stout, W.F., 857 Strassen, V., 487, 493, 550, 873, 876 Stratonovich, R.L., 395, 410, 749, 804–5, 870 Stricker, C., 121, 413, 856, 870 Stroock, D.W., 741, 743–4, 773, 873, 875–6, 884–5 Sucheston, L., 877 Sztencel, R., 455, 872 Szulga, J., 340, 867 Tak´ acs, L., 578, 864, 877 Tanaka, H., 659, 661–2, 664, 682, 753, 755, 766, 880–1, 884 Taylor, B., 132–3, 136, 156, 368, 386, 446 Teicher, H., 857, 860, 877–8 Tempel’man, A.A., 877 Thorisson, H., 576–7, 716, 863, 877–8, 882 Tonelli, L., 25, 853 Tortrat, A., 855 Trauber, P., 873 Trotter, H.F., 386, 663, 869, 880 Tusn´ ady, G., 906 Tychonov, A.N., 66, 829 Uhlenbeck, G.E., 297, 303, 312, 393, 465, 478–80, 865, 883 Ulam, S., 181, 859 Uppuluri, V.R.R., 879 Varadarajan, V.S., 508, 573, 874, 877

Foundations of Modern Probability

Varadhan, S.R.S., 534, 540, 548, 741, 773, 875–6, 884–5 Vere-Jones, D., 867, 875, 882 Vervaat, W., 875 Ville, J., 860 Vitali, G., 854 Volkonsky, V.A., 679, 682, 759, 861, 881, 883–4 Voronoy, G.F., 711 Wakolbinger, A., 908 Wald, A., 117, 255, 261, 417, 436, 857, 863, 871 Waldenfels, W. von, 867, 874 Walsh, J.B., 216, 675, 861, 880, 884 Walsh, J.L., 872 Wang, A.T., 664, 880 Warmuth, W., 909 Warner, F.W., 824 Watanabe, H., 765, 878, 884 Watanabe, S., 337, 374, 403, 415, 441, 445, 605, 735, 743–4, 747, 754, 861, 864, 867, 869–73, 884–5, 887 Watson, H.W., 287, 864 Weber, N.C., 879 Weierstrass, K., 128, 408, 446, 831 Weil, A., 64, 855 Weil, W., 875 Wentzell, A.D., 487, 529, 546, 875 Weyl, H., 583 Whitney, H., 802 Whitt, W., 874–5 Wiener, N., 125, 255, 266, 295, 297, 301, 304, 310, 312–3, 316–7, 319, 393, 465, 470, 503, 557, 564, 566, 632, 652, 658, 746, 856, 863, 865–7, 876, 880–1, 885, 911 Wigner, E.P., 879–80 Williams, D., 860–1, 870, 872, 884–5, 887 Williams, R.J., 870 Wintner, A., 493, 867, 873 Wold, H., 129, 298, 322, 858 Wong, E., 433, 448, 871–2 Woyczy´ nski, W.A., 358, 857, 868 Yaglom, A.M., 291, 864 Yamada, T., 735, 747, 754–5, 884–5 Yoerp, C., 453, 455, 872 Yor, M., 121, 413, 663, 856, 866, 870–1, 880– 1, 885 Yosida, K., 367, 372, 376, 387, 390, 561, 869, 876, 886 Yule, G.U., 277, 290–1 Yushkevicz, A.A., 381, 860, 869

Indices

Z¨ahle, M., 717, 882 Zaremba, S., 775, 885 Zeitouni, O., 876, 889 Zermelo, E., 846, 854 Zessin, H., 692, 881 Zinn, J., 361, 868, 902 Zorn, M., 54, 576–7, 846 Zygmund, A., 102, 113, 304, 557, 565, 856, 872, 877, 911

Topics Abelian group, 26, 241, 283 absolute, -ly continuous, 24, 49 distribution, 485 function, 49–50 measure, 24 moment, 85 absor|bing, -ption boundary, 761 martingale, 203 natural, 763 state, 278, 379 accessible boundary, set, 763 random measure, 218 time/jump, 210, 214, 218 action, 68 joint, 68 measurable, 68 proper, 68 transitive, 68 adapted process, 186 random measure, 218, 221 additive functional Brownian, 682 continuous, 676, 794–7 elementary, 676 uniqueness, 679 adjacency array, 655 adjoint operator, 568, 836 a.e. —almost everywhere, 23 affine maps, 810–12 algebra, 830 allocation sequence, 618 almost everywhere, 23 invariant, 559 alternating function, 785, 789 analytic map, 421

925

ancestors, 290–1 announcing sequence, 208 aperiodic, 246 approximation (in/of) covariation, 407 distribution, 118 empirical distribution, 498 exchangeable sum, 624 local time, 665, 671 Lp , 29 LDP, 545 Markov chains, 388 martingale, 500 Palm measure, 714, 718 point process, 714, 718 predictable process, 441 random walk, 357, 493, 502, 516 renewal process, 496 totally inaccessible time, 213 arcsine laws Brownian motion, 307 random walks, 495 L´evy processes, 358 Arzel`a–Ascoli thm, 506, 840 a.s. —almost surely, 83 associative, 26 asymptotic|ally hitting probability, 524–5 invariant, 74, 569, 612, 692 weakly, 74 negligible, 129 atom, -ic decomposition, 44, 49 support, 525 augmented filtration, 190 averag|e, -ing, 112, 165, 691 distributional, 717 functional, 715–6 Palm measure, 716–7 strong, 691 weak, 692 avoidance function, 330, 524–5 axiom of choice, 846 backward equation, 283, 372 recursion, 705–6 tree, 706 balayage —sweeping, 776 ballot theorem, 578, 580 Banach Hahn–B thm, 832 space, 832 -Steinhaus thm, 835

926

BDG —Burkholder et al., 400, 449 Bernoulli distribution, 531 sequence, 93 Bernstein–L´evy inequality, 195 Bessel process, 293, 305, 763 Bienaym´e process, 287 approximation, 292 asymptotics, 291 bilinear, 86 form, 802, 848 binary expansion, 93 splitting, 289–90 tree, 291 binomial process, 323 mixed, 323, 334 Birkhoff’s ergodic thm, 561, 563 birth–death process, 289 Blumenthal’s 0−1 law, 381 Borel -Cantelli lemma, 83, 92 extended, 197 isomorphic, 14 set, σ-field, 10 space, 14 localized, 15 boundary absorbing, 761 accessible, 763 entrance, 763–4 exit, 763 reflecting, 761, 763 regular, 775–6 bounded (in) Lp , 106, 197 linear functional, 832–3 martingale, 197 operator, 835 optional time, 193 set, 15 time, 193 branching Brownian motion, 291 extinction, 287 moments, 288 process, 287 recursion, 288 tree, 291 Brownian (motion), 301, 865 additive functional, 682 arcsine laws, 307 branching, 291 bridge, 301, 424, 498

Foundations of Modern Probability

criterion, 419 drift removal, 437 embedding, 490, 498 excursion, 673 functional, 426, 431, 483 H¨older continuity, 301, 304 invariance, 424 iterated logarithm, 309 large deviations, 536 level sets, 303 manifold, 806, 821-2 martingales, 426, 490 maximum process, 306, 663 multiple integral, 430 path regularity, 301, 304 point polarity, 422 quadratic variation, 303 rate of continuity, 493 reflection, 306 scaling, inversion, 302 shifted, 435 strong Markov property, 305 time-change reduction, 420, 423, 759 transience, 422 Burkholder inequalities, 400, 449 CAF —continuous additive functional, 676 c`adl`ag —rcll Cameron–Martin space, 435, 535–6, 550 Campbell measure, 710 compound, 725 reduced, 725 canonical decomposition, 404 Feller diffusion, 385 process, 381 filtration, 239 process, 239 capacity, 783, 788 alternating, 785 Carath´eodory extension, 36 restriction, 34 Cartesian product, 11 Cauchy distribution, 160 equation, 278, 353 inequality, 28, 401, 833 in probability, 103 problem, 772–4 process, 354 sequence, 27, 104 center, -ing, 111, 173, 192, 298

Indices

central limit thm, 132 local, 133 chain rule conditional independence, 172 connection, 807–8 differentiation, 466, 468 second order, 482 integration, 23 change of measure, 432, 448 scale, 752, 757 time, 759 chaos representation, 316, 431 adapted, 318 conditional, 318 divergence operator, 472 Malliavin derivative, 470 processes, 317 Chapman–Kolmogorov eqn, 235, 238 characteristic|s, 151 exponent, 151 function, 125, 128 local, 359 measure, 148, 283 operator, 384 Chebyshev’s inequality, 102 Chentsov criterion, 95, 511 Christoffel symbols, 808, 819 clos|ed, -ure, -able martingale, 198 operator, 373–4, 466, 468, 838 optional times, 188 set, 830 weakly, 469 cluster, 291 invariant, 701–2 kernel, 701 net, 830 process, 702 coding representation, 614, 633 equivalent, 615, 634 commut|ing, -ative operation, 26 operators, 173, 370, 475–6, 566, 837 compact, -ification locally, 829 one-point, 378 relatively, 829 set, 829–30 vague, 142, 842 weak, 842 weak L1 , 108 CT,S , 840 DR+ ,S , 841

927

comparison moment, 361 solutions, 755–6 tail, 362 compensat|ed, -or discounted, 223 optional time, 219 random measure, 218, 221 sub-martingale, 211 complement, 9 complet|e|ly, -ion, filtration, 190 Lp , 28 metric, 832 monotone —alternating, 785 in probability, 104 σ-field, 24 space, 398 complex-valued process, 310, 409 composition, 13 compound Campbell measure, 725 optional time, 239 Poisson approximation, 148, 155, 686 distribution, 147–8 limits, 150 process, 283 condenser thm, 786 condition, -al|ly, -ing covariance, 165–6 density, 723 distribution, 167–8 entropy, information, 581 expectation, 164 i.i.d., 613 independent, 170 inner, 730 intensity, 691 invariance, 170 iterated, 174 L´evy, 620 orthogonal, 173, 837 outer, 726 probability, 167 variance, 165–6 cone condition, 775 conformal mapping, 409 connected, 829 connection, 806 Riemannian, 819 conservative semi-group, 369, 378 consistency, 846 constant velocities, 696

928

continu|ous, -ity additive functional, 676 mapping, 103, 117, 829 martingale component, 453 in probability, 346, 619 right, 48 set, 115 stochastically, 346 theorem, 117 time-change, 190 contract|ion, -able, 739 array, 632–3 invariance, 611, 613 operator, 165, 168, 568, 588, 835 principle, 542 sequence, 613, 616 convergence (in/of) averages, 112–3, 132–3, 138–9 CT,S , 506, 510 DR+ ,S , 513 density, 133 dichotomy, 248, 285, 603 distribution, 104, 116 Feller processes, 386 Lp , 28, 107, 198 means, 107 nets, 830 point processes, 525 probability, 102 random measures, 518 sets, 524 series, 109–11 vague, 142, 517, 841 weak, 104, 143, 517, 842 convex function, -al, 664–5, 832 map, 86, 192, 812–4 set, 566–7, 574, 691, 832, 846 convolution, 26, 89 powers, 694 coordinate map, 15, 833 representation, 849 core of operator, 374, 839 cotangent, 848 countabl|e, -y, additive, 18 first, second, 829 sub-additive, 18 counting measure, 19 coupling exchangeable arrays, 635 group, 577 L´evy processes, 356

Foundations of Modern Probability

Markov chains, 249 Palm measure, 716 shift, 576–7 Skorohod, 119 covariance, 86, 298 covariation, 398, 444, 802–3 Cox process, 323 continuity, 521–2 convergence, 688–90, 694, 696 criterion, 690, 696, 729 directing random measure, 323 existence, 328 integrability, 329 simplicity, 329 tangential, 360 uniqueness, 329 Cram´er–Wold thm, 129 critical branching, 287, 289, 701 cumulant-generating function, 531, 547 cycle stationary, 713 cylinder set, 11 (D) —sub-martingale class, 211 Daniell–Kolmogorov thm, 178 debut, 189–90 decomposition atomic, 44 covariation, 454 Doob, 192 Doob–Meyer, 211 ergodic, 69, 575 finite-variation function, 48 Hahn, 38, 49 Jordan, 47, 49 Lebesgue, 40 martingale, 455 optional time, 210 orthogonal, 837 quasi-martingale, 458 random measure, 218 semi-martingale, 453 decoupling, 261, 341 degenerate measure, 19 random element, 88 delay, 268–9, 271 dens|e, -ity, 23, 829 criterion, 698 limit thm, 133 derivative adapted process, 471–2 Itˆo integral, 482 WI-integral, 470 diagonal space, 46

Indices

dichotomy recurrence, 256, 604 regeneracy, 666 stability, 702 difference proper, 9 symmetric, 9–10 differential equation backward, 283, 372 Dol´eans, 447 forward, 372 Langevin, 737 differentiation, 42, 304 operator, 465–6 diffuse —non-atomic distribution, 159 measure, 44 random measure, 329 diffusion, 756 approximation, 292 equation, 736, 751 Feller, 384 manifold, 817 martingale, 817 rate, 814 dimension, 834 effective, 256 Dirac measure, 19 direct, -ed set, 830 sum, 837 directing random measure, 613 directly Riemann integrable, 273 direction, -al derivative, 466 flat, 697 Dirichlet problem, 775–6 discontinuity set, 382 discounted compensator, 223 map, 429 discrete approximation, 188 time, 247 topology, 829 disintegration, 60, 168 invariant, 71, 711 iterated, 62 kernel, 61 Palm, 710–1 partial, 61 dissect|ion, -ing class, 15 ring/semi-ring, 518

929

system, 15 dissipative, 378, 688 dissociated, 649–51 distribution, 83 function, 84, 96 distributive laws, 10 divergence, 268 covariance, 476 factorization, 475 smooth elements, 474 DLR-equation, 883 Dol´eans exponential, 223, 447 domain, 774 attraction, 139 operator, 838 dominated convergence, 22, 405 ergodic thm, 564 Donsker’s thm, 494, 511 Doob decomposition, 192 inequality, 195 -Meyer decomposition, 211 doubly stochastic Poisson —Cox drift component, 745, 752, 815 rate, 814 removal, 437, 752 dual, -ity disintegration, 723 ladder heights, 264 Palm/density, 723 Palm/Gibbs, 729 predictable projection, 216, 222 second, 832 space, 832 time/cycle, 713 dynamical system, 238 perturbed, 546 Dynkin’s characteristic operator, 384 formula, 383 effective dimension, 256 Egorov’s thm, 30 elementary additive functional, 676 function, 313 operations, 17 stochastic integral, 441 elliptic operator, 384–5, 481 embedd|ed, -ing Markov chain, 279 manifold, 802, 819–20

930

martingale, 498 random walk, 490 space, 833 empirical distributions, 115 endpoints, 666, 766 energy, 818 entrance boundary, 763–4 entropy, 581 conditional, 581 relative, 548 equi-continuity, 127, 506 equilibrium measure, 783–4, 787–8 equivalent coding, 634 ergodic decomposition, 69, 575 diffusion, 766 map, 560 measure, 69, 560 random element, 560 sequence, 560 strongly, 595, 598 weakly, 597 ergodic theorem continuous time, 563 discrete time, 561 Feller process, 607–8 information, 581 Markov chains, 248, 285 mean, 569, 692, 838 multi-variate, 565–6 operator, 589 random matrices, 572 random measures, 691 ratio, 590, 594, 765 sub-additive, 570 escape time, 383 evaluation map, 44, 83, 516, 839 event, 82 excessive, 791–4, 796 uniformly, 679 exchangeable array, 633 finitely, 616 increments, 619 jointly, 632 partition, 656 point process, 334 process, 619–20, 622 random measure, 334 separately, 632 sequence, 611, 613, 616 excursion, 668 Brownian, 673 endpoints, 666

Foundations of Modern Probability

height, 674 law, 668 length, 668 local time, 669 Markov process, 246 measure, 668 point process, 669 space, 668 existence Brownian motion, 301 Cox process, 328 Feller process, 379 Markov process, 236 Poisson process, 328 process, 179 random sequence, 93, 178, 180–1 SDE solution, 737 point-process transform, 328 exit boundary, 763 expectat|ion, -ed value, 85 explosion, 280–1, 740, 763 exponential distribution, 278–81, 289 Dol´eans, 223, 447 equivalence, 545 inequality, 455 martingale complex, 418 real, 435 rate, 533 super-martingale, 456 tightness, 539, 541, 549 exten|did, -sion Borel–Cantelli lemma, 197 dominated convergence, 22 filtration, 190, 359 hitting, quitting, 787 independence, 87 Itˆo integral, 482 line, halfline, 16 measure, 24, 36 Poisson process, 332, 347 probability space, 175 set function, 36 subordinator, 355 external conditioning, 726 extinction, 287 local, 702 extreme|s, 158, 306 measure, 69, 574 point, 69, 575 factorial measure, 46, 616 factorization, 67

Indices

Fatou’s lemma, 22, 107 Fell topology, 524, 843 Feller diffusion, 384, 772 branching, 292 generator, 370, 376 process, 379 canonical, 381 convergence, 386 ql-continuity, 389 properties, 369, 744 semi-group, 369 conservative, 378 Feynman–Kac formula, 772 field, 10 filling functional, 592 operator, 591 filtration, 186 augmented, 190 complete, 190 extended, 190, 359 induced, 186 right-continuous, 187 de Finetti’s thm, 613 finite, -ly additive, 36 -dimensional distributions, 84, 235 exchangeable, 616 locally, 20 variation function, 47–9 process, 397 permutation, 613 s-, σ-finite, 20 first entry, 190 hit, 203 jump, 278 maximum, 264, 619 passage, 354 fixed jump, 332, 346–7, 349 flat, 697 connection, 807–8, 820 flow, 563 fluctuation, 265 form, 802, 848 forward equation, 372 Fourier transform, 132, 259 FS —Fisk–Stratonovich integral, 410, 804–5 Fubini’s thm, 25, 89, 168, 853, 859 function, -al central limit theorem, 494, 511

931

continuous, linear, 832 convex, 832 large deviations, 540 law of the iterated log, 550 measurable, 13 positive, 51 representation, 18, 414, 742 simple, 16–17 solution, 747 fundamental identity, 782 martingale, 224 Galton–Watson —Bienaym´e, 287 Gaussian distribution, 132, 857, 864–5 convergence, 132–5, 137 isonormal, 300, 310 jointly, 298 Markov process, 302 process, 347, 298, 351 reduction, 428 genealogy, 290–1 generalized subordinator, 355 generat|ed, -ing cumulant, 531, 547 filtration, 186 function, 126, 288, 479 probability, 126 σ-field, 12, 16 topology, 829 generator, 368 criteria, 376 geodesic, 810–1 geometric distribution, 668 Gibbs kernel, 725–6 -Palm duality, 729 Girsanov thm, 433, 437 Glivenko–Cantelli thm, 115, 498 goodness of rate function, 539 gradient, 819, 849 graph of operator, 373, 838 optional time, 210 Green|ian domain, 778 function, 759, 779 potential, 779–80 Gronwall’s lemma, 738 group, -ing Abelian, 26, 241, 283 action, 68 Lie, 823

932

Foundations of Modern Probability

measurable, 26, 64 property, 87 Haar functions, 834 measure, 64 Hahn -Banach thm, 86, 832 decomposition, 38, 49 harmonic function, 421, 594, 774, 822 measure, 776, 782 minorant, 796 sub/super, 791–2 Harris recurrent, 598, 604–5 Hausdorff space, 50, 829 hazard function, 861 heat equation, 773 height excursion, 674 ladder, 264–5 Helly selection thm, 142 Hermite polynomials, 315, 479, 834 Hewitt–Savage 0−1 law, 91 Hilbert space, 833 domain, 469 martingales, 398 separable, 835 tensor product, 835 Hille–Yosida thm, 376 historical process, 290–1 hitting criterion, 703 function, 789 kernel, 774, 781 extended, 787 probability, 774, 788 time, 189, 757 H¨older continuous, 95, 301, 304, 511 inequality, 26, 168 holding time, 666 homeomorphic, 829 homogeneous chaos, 316 space, 237–41 time, 238 hyper-contraction, 622 hyper-plane, 832 idempotent, 836 identity map, 836 i.i.d. —independent, equal distribution array, 148–9, 154

inaccessible boundary, 761 time, 210 increasing process, 211 natural, 211 super-martingales, 204 increments, 96 exchangeable, 619 independent, 345 independent, 87 conditional, 170 increments, 237, 300, 332–3, 347, 619 motion, 696 pairwise, 88 index of H¨older continuity, 95 stability, 354 indicator function, 16 indistinguishable, 95 induced compensator, 222–3 connection, 810 filtration, 186, 222 Riemannian metric, 819 σ-field, 12, 16 topology, 829 infinitely divisible, 150–1, 346 convergence, 152 information, 581 conditional, 581 initial distribution, 234 invariant, 241–2 injective operator, 370 inner conditioning, 730 content, 51 continuous, 52 product, 28, 833 radius, 566, 691 regular, 52 integra|l, -ble, 21–2 approximation, 411 continuity, 404, 411 function, 21–2 increasing process, 404 Itˆo, 403 martingale, 403, 441, 451 multiple, 313, 430 process, 397 random measure, 221 semi-martingale, 404, 442 stable, 358

Indices

integral representation, 431 invariant measure, 69, 575 martingale, 427 integration by parts, 406, 444, 467 intensity, 256, 322, 567 sample, 578 interior, 829 interpolation, 588 intersection, 9 intrinsic, 801, 847 drift, diffusion, 815 invarian|t, -ce disintegration, 71, 711 distribution, 245, 248, 285, 768–9 factorization, 67 function, 559, 589, 594 kernel, 70, 237, 283 measure, 26, 37, 67, 69, 284, 558, 605 permutation, 90 principle, 496 process, 698 rotational, 37 scatter, 694 set, σ-field, 68, 559, 563, 597 smoothing, 693 sub-space, 374, 568 u-invariant, 698 unitary, 299 invers|e, -ion, 302 coding representation, 642 contraction principle, 542 formulas, 711 function, 827 local time, 671 maximum process, 354 operator, 836 i.o. —infinitely often, 82 irreducible, 247, 285 isolated, 666 isometr|ic, -y, 313 isomorphic, 834 isonormal Gaussian, 300 complex, 310 martingales, 418 isotropic, 419–20, 906, 821–2 iterated clustering, 701–2 conditioning, 174 disintegrations, 723 expectation, 89 integrals, 25, 430 optional times, 239 Itˆo

933

correction term, 407–8 equation, 738 excursion law/thm, 668–9 formula, 407, 446, 665 local, 409 integral, 403 extended, 482 L´evy–I representation, 346–7 J1 -topology, 841 Jensen’s inequality, 86, 168 joint, -ly action, 68 exchangeable, 632 Gaussian, 298 invariant, 70 stationary, 710–1 Jordan decomposition, 47, 49 jump accessible, 214 kernel, 279 point process, 347 kernel, 55 cluster, 701 composition, 58 conditioning, 167–8 criteria, 56 density, 200 disintegration, 61 Gibbs, 725–6 hitting, 774 invariant, 70, 237 locally σ-finite, 56 maximal, 61 Palm, 710 probability, 56 product, 58 quitting, 783 rate, 279 representation, 94 transition, 234 killing, 772 Kolmogorov backward equation, 283, 372, 772 Chapman-K eqn, 235, 238 -Chentsov criterion, 95, 511 consistency/extension thm, 179 Daniell-K thm, 178 maximum inequality, 109 three-series criterion, 111 0−1 law, 90, 199 Kronecker’s lemma, 112 ladder

934

distributions, 267 height, 264–5 time, 264–5 λ -system, 10 Langevin equation, 737 Laplace -Beltrami operator, 822 equation, 774 operator, 773, 822 transform, 125, 323–4 continuity, 128 uniqueness, 322 large deviation principle, 537 Brownian motion, 536 empirical distribution, 548 Rd , 534 sequences, 544 last exit, 783 return, 262 lattice distribution, 133 law arcsine, 307 excursion, 668 iterated logarithm, 309, 494, 549 large numbers,13 lcscH space, 50, 831 LDP —large deviation principle Lebesgue decomposition, 40 differentiation, 42 dominated convergence, 22 measure, 35, 37 -Stieltjes measure, 42 unit interval, 93 left-invariant, 64 Legendre–Fenchel transform, 531, 547 length, 818 level set, 303 Levi-Civita connection, 819 L´evy arcsine laws, 307 BM characterization, 419 Borel–Cantelli extension, 197 -Itˆ o representation, 346–7 -Khinchin representation, 151 measure, 151, 283, 347 process, 516 generator, 375 Lie group, 823 representation, 346–7 0−1 law extension, 199 Lie bracket, 850

Foundations of Modern Probability

derivatives, 819, 850 group, 823 life length, 289 limit, 17 Lindeberg condition, 135 line process, 696–7 linear functional, 832 map, 21 operator, 835 SDE, 737 space, 831 Lipschitz condition, 738, 754–6 L log L -condition, 288, 565 local, -ly, localize approximation, 714, 718 averaging, 133 Borel space, 15 central limit thm, 133 characteristics, 359, 814 compact, 50 conditioning, 166 coordinates, 808–9 equi-continuous, 506 extinction, 702 finite measure, 20, 51, 56 variation, 47–9 Itˆo formula, 409 invariant, 75, 696 martingale, 396 problem, 741 modulus of continuity, 506 operator, 477, 839 sequence, 396 sets, 15 sub-martingale, 211 symmetric, 359 uniform, 506, 839 local time additive functional, 680–1 approximation, 665, 671 excursion, 669 inverse, 671 positivity, 676 semi-martingale, 662 Lp convergence, 28, 107, 198 inequality, 195 Lusin’s thm, 30 Malliavin derivative, 466 adapted process, 471–2 chaos representation, 470

Indices

indicator variable, 471 WI-integral, 470 manifold, 802 Riemannian, 818 mark|ed, -ing optional time, 223, 226–7 point process, 323, 325, 331, 335 theorem, 325 Markov chain, 247, 284 convergence, 248, 285 coupling, 249 excursions, 246 invariant distribution, 248 irreducible, 247 mean recurrence time, 251, 286 occupation time, 245 period, 246 recurrent, 245, 247 positive, null, 249, 286 strong ergodicity, 249 transient, 247 Markov inequality, 102 Markov process backward equation, 283 canonical, 239 embedded M-chain, 279 existence, 236 explosion, 281 finite-dim. distr., 235 Gaussian, 302 initial distribution, 234 invariant measure, 284 property, 234 extended, 234 strong, 240, 243, 278 rate kernel, 279 semi-group, 238 space homogeneous, 237 strong M-property, 240, 243, 278 time homogeneous, 238 transition kernel, 234 martingale, 191 absorption, 203 bounded, 197 clos|ure, -ed, 198, 202 continuous, 397–8 convergence, 197–8 criterion, 194, 809 decomposition, 455 embedding, 498 fundamental, 224–5 hitting, 203 inequalities, 195–6, 400, 449 integral, 403, 441, 451

935

isonormal, 418 isotropic, 419–20 local, 396 manifold, 805–7 optional sampling, 193, 202 predictable, 215, 217 problem, 741 regularization, 197, 201 representation, 427 semi-, 404, 442 sub-, super-, 192 transform, 194 truncation, 443 uniformly integrable, 198 maxim|um, -al ergodic lemma, 561 inequality, 109, 195, 242, 564, 567, 582, 589 kernel, 61 measure, 39 operator, 378, 384 principle, 376, 378 process, 306, 663 upper bound, 846 mean, 85 continuity, 38 ergodic thm, 569, 838 recurrence time, 250–1, 286 -value property, 774 measurable action, 68 decomposition, 45 map, 13 function, map, 13 group, 26, 64 limit, 121 set, space, 34 measure, 18 -determining, 20 equilibrium, 783 Haar, 64 invariant, 64 Lebesgue, 35 Lebesgue–Stieltjes, 42 preserving, 558, 626 product, 25 space, 18 supporting, 710 sweeping, 782 -valued, 523, 617, 842 Mecke equation, 721 median, 111 metri|c, -zation, 829 Minkowski’s inequality, 26, 168

936

extended, 27 mixed binomial, 323, 326, 334, 720 i.i.d., 613 L´evy, 619–20 Poisson, 323, 326, 334, 720 solutions, 743 mixing, 595 weakly, 597 moderate growth, 361 modified Campbell measure, 725 modulus of continuity, 841 Palm measure, 716 modular function, 67 modulus of continuity, 95, 506 moment, 85, 102 absolute, 85 comparison, 361 criterion, 95, 511 density, 723–4 identity, 340 monotone class, 10 convergence, 21 limits, 19 de Morgan’s laws, 10 moving average, 311, 479 multi|ple, -variate Campbell measure, 725 ergodic thm, 565–6 integral, 430 Palm measure, 725 stochastic integral, 430 time change, 423 Wiener–Itˆ o integral, 313, 315 multiplicative functional, 772 musical isomorphism, 849 natural absorption, 763 compensator, 223 embedding, 833 increasing process, 211, 216 scale, 757 nearly continuous, 30 super-additive, 65 uniform, 30 neighborhood, 829 net, 830 nonarithmetic, 270 decreasing, 21

Foundations of Modern Probability

diagonal, 46 interacting, 696 lattice, 133, 694 negative definite, 86 increments, 96 simgular, 694 topological, 522 transitive, 69 norm, 27, 831, 833 inequality, 400, 449 relation, 220 topology, 833, 836 normal Gaussian distribution, 132 space, 829 normalizing function, 68 nowhere dense, 666, 829 null array, 129–30, 148, 686 convergence, 157 limits, 156 recurrent, 249, 286, 607 set, 23–4, 43 space, 836 occupation density, 664, 779 measure, 256, 264, 268, 270, 664 time, 245 ONB —ortho-normal basis, 834 onedimensional criteria, 129, 152, 330 sided bound, 197, 268 operator|s adjoint, 836 self-, 836 bounded, 835 clos|ure, -ed, 838 commuting, 837 contraction, 835 core, 839 domain, 838 ergodic thm, 589, 838 graph, 838 idempotent, 836 identity, 836 inverse, 836 local, 839 null space, 836 orthogonal, 837 complement, 83 conditionally, 837 projection, 837

Indices

range, 836 unitary, 837 optional equi-continuity, 514 evaluation, 189 projection, 382 sampling, 193, 202 skipping, 618 stopping, 194, 405 time, 186 iterated, 239 weakly, 187 orbit, 68 measure, 68 order statistics, 323 Ornstein–Uhlenbeck generator, 480 process, 303, 312, 478, 737 semi-group, 479 orthogonal, 833 complement, 833, 837 conditionally, 837 decomposition, 837 functions, spaces, 28, 837 martingales, 423 measures, 24 optional times, 226 projection, 29, 837 ortho-normal, 834 basis, ONB, 834 outer conditioning, 726, 730 measure, 33 regular, 52 p -function, 666 -thinning, 323 paintbox representation, 656 Palm approximation, 714, 718 averaging, 716–7 conditioning, 730 -density duality, 723 disintegration, 710–1 distribution, 710 -Gibbs duality, 729 invariant, 711 inversion, 711 iteration, 723 kernel, 710 measure, 710–1 modified, 716 recursion, 705 reduced, 725

937

tree, 706 uniqueness, 711 Papangelou kernel, 727–8 invariance, 729 parallelogram identity, 28 parameter dependence, 413 partial disintegration, 61 order, 829 particle system, 694 partition symmetric, 656 of unity, 51 path regularity, 304 -wise uniqueness, 737, 754 perfect additive functional, 676 set, 666 period, 246 permutation, 90 invariant, 90 persistence, 702 perturbed dynamical system, 546 π-system, 10 Picard iteration, 738 point mass, atom, 19 measure, 45 polarity, 422 point process, 322, 686 approximation, 714, 718 marked, 323 simple, 322 Poisson approximation, 687 binomial property, 326 compound, 147, 283 approximation, 148, 155 limits, 150 convergence, 130, 338, 686–7 criterion, 281, 331–4, 721 distribution, 129 excursion process, 669 existence, 328 extended, 332, 337 holding times, 281–2 integral, 339 double, 340 mapping, marking, 325 mixed, 326, 334 moments, 340 process, 281, 323 pseudo-, 282

938

reduction, 335–6, 338, 428 stationary, 335 polar set, 422, 782 polarization, 440 Polish space, 14, 829 polynomial chaos, 316 portmanteau thm, 116 positive conditional expectation, 165 contraction operator, 368 functional, 51 maximum principle, 376, 378, 384 operator, 588 recurrent, 249, 286, 607, 769 terms, 108, 110, 134, 155 variation, 47 potential, 370, 600 additive functional, 676, 796 bounded, continuous, 601 Green, 779–80 operator, 370 Revuz measure, 678 term, 772 predict|ion, -able covariation, 440 invariance, 626 mapping, 227, 335 martingale, 215, 217 process, 209, 221 projection, 216 quadratic variation, 440, 500 random measure, 218, 221–2 restriction, 217 sampling, 617 sequence, 192, 616 set, σ-field, 209 step process, 194, 397 time, 208–9 pre-separating, 845 preservation (of) measure, 615, 626 semi-martingales, 434, 449 stochastic integrals, 434 probability generating function, 126 kernel, 56 measure, space, 82 process binomial, 323 contractable, 619–20 Cox, 323 exchangeable, 619–20 point, 322 Poisson, 323

Foundations of Modern Probability

regenerative, 666 renewal, 268 rotatable, 625 product Cartesian, 11 kernel, 58 measure, 25, 88 moment, 226 operator, 836 σ-field, 11 topology, 103, 829 progressive process, 188, 413 set, σ-field, 188 Prohorov’s thm, 507 project|ive, -ion dual, 216, 222 group, 68 limit, 178, 845 manifold, 805–6, 817 measures, 62 optional, 382 orthogonal, 29, 837 predictable, 216, 222 sequence, 845 set, 827 proper action, 68 difference, 9 function, 96 pseudo-Poisson, 282, 368 pull-back, 802, 849 pull-out property, 165 pure|ly atomic, 48–9 discontinuous, 452 jump-type, 277 push-forward, 802, 849 quadratic variation, 303, 398, 444, 500, 818 quasi-left (ql) continuous Feller process, 389 filtration, 220 random measure, 218, 335 quasi-martingale, 458 queuing, 882–3 quitting kernel, 783 extended, 787 Radon measure, 51 -Nikodym thm, 40 random array, 631

Indices

element, 83 matrix, 572 measure, 83, 221, 322, 516, 686 partition, 656 process, 83 sequence, 83, 103, 118 series, 108–11 set, 83, 691, 788–9 time, 186 variable, vector, 83 walk, 91, 255, 490 randomization, 117, 177, 323 uniform, 323 range, 836 rate drift, diffusion, 735, 814 function, 279, 532 good, 539 semi-continuous, 538 kernel, 279 process, 759, 821 ratio ergodic thm, 590, 594, 765 Ray–Knight theorem, 675 rcll —right-continuous, left limits, 201, 840 recurren|t, -ce class, 247 dichotomy, 256, 604 Harris, 598 null, positive, 249, 286, 607, 769 process, 256, 285, 766 random walk, 256–61 state, 245 time, 251, 286 recursion, 314 reduced Campbell measure, 725 Palm measure, 725 reflec|ting, -tion, -sive boundary, 761, 763 fast, slow, 761 Brownian motion, 306 principle, 263 space, 833 regenerative phenomenon, 666 process, 666 set, 666 regular|ity, -ization boundary, 775 conditional distribution, 167–8 diffusion, 757 domain, 775–6 Feller process, 379, 598 local time, 663

939

martingale, sub-, 197, 201 measure, outer measure, 29 point process, 728 rate function, 538 stochastic flow, 738 relative, -ly compact, 507, 829 entropy, 547–8 topology, 829 renewal equation, 273 measure, 268 process, 268 delayed, 268 stationary, 268–9 two-sided, 270 theorem, 270–1 replication, 94 representation coding, 633 infinitely divisible, 151 kernel, 94 limit, 120 resolvent, 370, 380 equation, 601 restriction of measure, 20 optional time, 210 outer measure, 34 Revuz measure, 677 potential, 678, 796 Riemann, -ian connection, 819 integrable, 273 manifold, metric, 818, 849 quadratic variation, 818 Riesz decomposition, 796 -Fischer thm, 834 representation, 51, 834 rightcontinuous filtration, 187 function, 48–50 process, 201 invariant, 64, 823 rigid motion, 37 ring, 518 rotat|ion, -able array, 652 invariant, 37 sequence, process, 625 totally, 654

940

s-finite, 66 sampl|e, -ing intensity, 578 process, 624 sequence, 612 without replacement, 616, 624 scal|e, -ing, 302 Brownian, 302 function, 757 limit, 139 operator, 673 SDE, 746 scatter dissipative, 688 invariant, 694 Schwarz’ inequality, 28 SDE —stochastic differential eqn, 735 comparison, 755–6 existence, 747 explosion, 740 functional solution, 747 linear equation, 737 pathwise uniqueness, 737, 747, 754 strong solution, 737–738 transformation, 745–6, 752 uniqueness in law, 737, 747, 752 weak solution, 737, 741–2, 752 second countable, 50 sections, 25, 827–8 selection, 827–8 self -adjoint, 165, 836 -similar, 354 semicontinuous, 791 group, 368, 563 martingale, 359, 404, 442, 802 criterion, 459 decomposition 453 integral, 442, 804 local time, 662 non-random, 459 special, 442 ring, 518 separable, 829, 835 separat|ely, -ing class, 87, 845 exchangeable, 633 points, 830 sequentially compact, 830 series (of) measures, 19 random variables, 108–11 shell

Foundations of Modern Probability

measurable, 651 σ-field, 636 shift, -ed, 68 coupling, 716 measure, 26 operator, 239, 381, 558, 690 σ-field, 10 induced, 10 product, 11 σ-finite, 20 signed measure, 38, 49 simple function, 16–7 point measure, 45 point process, 322 random walk, 262 singular, -ity function, 49–50 measures, 24, 38, 49, 578 point, 421 set, 753 skewfactorization, 71 product, 422 Skorohod convergence, 512–3, 840 coupling, 119 extended, 177 embedding, 490, 498 integral, 472 topology, 512, 622, 841 slow reflection, 761 variation, 139–40 smooth, -ing density, 484 function, 466 operation, 693 random measure, 693 sojourn, 619 spa|ce, -ial approximation, 665 Borel, 14 filling, 567 homogeneous, 237, 241 measurable, 34 Polish, 14 probability, 82 product, 11 regularity, 663 space-time function, 594 invariant, 594–5 spanning criterion, 697

Indices

special semi-martingale, 442 spectral measure, 311 representation, 311 speed measure, 759 spherical symmetry, 299 spreadable, 613 stab|le, -ility cluster kernel, 702–4, 706 index, 354 integral, 358 process, 354 standard differentiation basis, 718 extension, 359, 420 stationar|y, -ity cycle, 713 increments, 619 joint, 710–1 process, 241–2, 558, 563 random measure, 710 renewal process, 268 strong, 616–7 u -, 698 statistical mechanics, 876, 883 Stieltjes integral, 404 measure, 42 stochastic continuity, 619 differential equation, 735 geometry, 801 dynamical system, 238 equation, 177 flow, 738 geometry, 696–8 integrator, 459 process, 83 Stone–Weierstrass thm, 831 stopping time —optional time, 186 Stratonovich integral, 410, 804–5 strict comparison, 756 past, 208 strong, -ly continuous, 369, 372 convergence, 836 equivalence, 689 ergodic, 249, 595, 598, 768 existence, 737 homogeneous, 243 law of large numbers, 113 Markov, 240, 278, 305, 353, 381, 744

941

orthogonal, 423 solution, 738 stationary, 616 subadditive, 570 ergodic thm, 570 sequence, 570 set function, 18 critical, 287 harmonic, 791 manifold, 816–7 martingale, 192 sequence criterion, 102 invariance, 613 set, space, 13, 83, 116, 831 subordinator, 346–7 extended, 355 substitution, 23, 407, 446 sum of measures, 19 supercritical, 287 harmonic, 791–2 martingale, 192, 380 position, 686–7 support, -ing additive functional, 680 affine function, 86 function, 662 hyper-plane, 832 local time, 662 measure, 598, 710 set, 20 survival, 291 sweeping kernel, 776 problem, 782 symmetr|y, -ic difference, 9 functions, 641 point process, 334 random measure, 334 variable, 91 set, 90 spherical, 299 stable, 358 terms, 110, 134, 155 symmetrization, 111, 260 system dissecting, 15 π -, λ -, 10 tail

942

comparison, 362 probability, 85, 102, 126 rate, 530 σ -field, 90, 199, 595 Tanaka’s formula, 662 tangent, -ial comparison, 361, 364 existence, 359–60 processes, 359 vector, space, 847 Taylor expansion, 132, 136 tensor product, 835, 848 terminal time, 380 tetrahedral, 430, 632 thinning, 291, 323 convergence, 521, 690 Cox criterion, 690 uniqueness, 329 three-seies criterion, 111 tightness (in) exponential, 539 CT,S , 509, 511 DR+ ,S , 512, 514, 523 MS , 517 Rd , 105–6, 127, 143 relative compactness, 143, 507 time accessible, 210 homogeneous, 238 optional, 186, 219 weakly, 187 predictable, 208–10, 216 random, 186 reversal, 786, 793 scales, 500 totally inaccessible, 210, 213 time change, 190 diffusion, 759 filtration, 190 integral, 412 martingale, 193, 420, 423 point process, 336, 338 stable integral, 358 topolog|y, -ical, 829 group, 64 locally uniform, 839 norm, 833 Skorohod, 841 uniform, 839 vague, 842 weak, 842 total, -ly inaccessible, 210, 213 variation, 47

Foundations of Modern Probability

tower property, 165, 838 transfer random element, 176 regularity, 96 solution, 747 transform, -ation Cox, 323 drift, 745 Laplace, 125, 323 point process, 323 transient process, 256, 285, 422 state, 245 uniformly, 604 transition density, 777–8 function, 284 kernel, 234, 279 matrix, 247 operator, 367, 588 semi-group, 238 transitive, 68 translation, 26 tree branching, 291 Palm, 706 trivial σ-field, 88, 90 truncation of martingale, 443 two-sided extension, 559 Tychonov’s thm, 66, 829 ult. —ultimately, 82 uncorrelated, 86 uniform, -ly approximation, 128 convergence, 242, 620–1 distribution, 93 excessive, 679 integrable, 106, 166, 198, 201, 401, 436 laws, 308 locally, 839 randomization, 323 transient, 604 uniqueness (of) additive functional, 679 characteristic function, 128 Cox, thinning, 329 diffuse random measure, 330 distribution function, 84 generator, 371 Laplace transform, 128, 322 in law, 737, 752, 773–4 measure, 20 Palm measure, 711

Indices

pathwise, 737 point process, 330 unitary, 837–8 invariance, 299 universal, -ly adapted, 746 completion, 746, 827 measurable, 189–90, 827, 843 upcrossing, 196 urn sequence, 616 vague, -ly compact, 142, 842 convergent, 517, 520, 841 tight, 517 topology, 142, 517, 841–2 variance, 86 criterion, 109, 113, 135 variation positive, 47 total, 47 vector field, 848 space, 831 velocity, 696 version of process, 94 Voronoi cell, 711 Wald equations, 261 identity, 436 weak, -ly compact, 842 continuous, 742 convergent, 517, 520, 842 ergodic, 595 existence, 737, 742, 752 L1 -compact, 108 large deviations, 539 law of large numbers, 138 mixing, 597 optional, 187 solution, 737, 741–2, 752 tight, 517 topology, 517, 833, 842 Weierstrass approximation, 128, 408, 446 Stone-W thm, 831 weight function, 692 well posed, 741 WI —Wiener–Itˆo, 313 Wiener chaos, 316–8 -Hopf factorization, 266 integral, 310

943

multivariate ergodic thm, 566 process —Brownian motion, 301 Wiener–Itˆo integral, 313, 430 derivative, 470 Yosida approximation, 372, 387 Yule process, 289–90 tree, 291 zero–infinity law, 690–1 zero–one law absorption, 243 Blumenthal, 381 Hewitt–Savage, 91 Kolmogorov, 90 ZF —Zermelo–Fraenkel, 846 Zorn’s lemma, 846

Symbols A, 370, 676, 741 A∗ , 836 An , Aht , 589, 600 Aλ , 372 A × B, 11 Ac , A \ B, AΔB, 9–10 ¯ ∂A, 829 Ao , A, 1A , 1A μ, 16, 20 A, Aμ , 24, 82 A ⊗ B, 11 (a, b), 741 a(Xt ), 817 α, αj , 279, 618 α ⊗ β, 802, 848 α, X, α, u, 748, 804 Bxε , 14 B, BS , 10 BX , BX,Y , 835 b[X], 803 ˆ 104, 378 CˆS , C, C0 , C0∞ , 369, 375 C k , C 1,2 , 407, 773 + CK , 142 CT,S , 839 D CK , 783 Cov(ξ, η), Cov(ξ; A), 86, 157 CovF (·) = Cov(·|F ), 166 c, 279 D, Dh ,

774, 668

944

Foundations of Modern Probability

(D), 211 D[0,1] , DR+ ,S , 622, 840–1 Dp , D1,p , 466–8 D∗ , D∗ , 472 Δ, 67, 192, 211, 375, 378, 490 Δn , 430 ΔU , ΔU1 ,...,Un , 785 ∇, ∇n , 806–8, 375 δx , 19 2 ∂i , ∂ij , 407 d

d

=, →,

84, 104

E, 85, 322 ¯ 677 E, Ex , Eμ , 239 E[ξ; A], 85 E[ξ|F] = E F ξ, 164 ¯ E(··), E(··), 710, 716 E, En , 313, 402 E(X), 435, 447 F , Fˆn , 84, 96–7, 115 F± , Δba F , F ba , 47 Fa , Fs , F  , 42–3 Fc , Fd , 48 F ∗ μ, 273 Fk , Φk , 697 F, 186, 524 ¯ 24, 190, 827 F μ , F, F + , 187 Fτ , Fτ − , 186, 208 F∞ , F−∞ , 199 r FD , FD , 781 F ⊗ G, 11 F ∨ G, n Fn , 87 F⊥⊥G, F⊥⊥G H, 87, 170 f −1 , f± , 12, 22 fˆ = grad f , 849 f · μ, f · A, 23, 676 μf , μ ◦ f −1 , 20–1 f ◦ g, f ∗ g, 14, 26 f ⊗ g, 312 f ⊗d , fˆ⊗d , 259 f, g, f ⊥ g, 28, 833 df , df, dg, 802, 848–9 U ≺ f ≺ V , 51 f fd →, →, 843, 506 ϕs , ϕB, 68, 524 ϕ ◦ u, ϕ∗ α, ϕ∗2 b, 802, 849 G, GD , 725, 779 G, 524, 843 ga,b , g D , 759, 779 Γ, Γkij , 725, 808

D γK , 783 ⊥G , ⊥⊥G , 170, 173

H, 833 H1 , H∞ , 535, 546 H1 ⊗ H2 , H ⊗n , 312, 835 D,D  D , HK , 781, 787 HK ha,b , 757 H(·), H(·|·), 548, 581 I, I(B), |I|, 35, 368, 537–8, 836 In f = ζ n f , 312 I(ξ), I(ξ|F), 581 Iˆ+ , Iξ , 322, 560, 691 ι, 68 Jn ξ, 317, 470 K, KT , 524, 839, 843 r K D , KD , 781 κy , κh , 245, 668 L, Lxt , 480, 662, 669, 681 D,D  LD , 783, 787 K , LK p L , 26 ˆ L(X), L(X), 403–4, 412, 451 L2 (ζ), L2 (M ), 316, 441 L, Lu , 480, 850 L(·), L(·|·), L(··), 83 l(u), l2 , 668, 834 Λ, Λ∗ , 531, 547 λ, 34–5 M , M f , 564, 589 [M ], [M, N ], 398, 500 M q , M a , 455 M τ , M f , 194, 383, 741 M ∗ , Mt∗ , 195 M , M, N , 440, 500 M (λ, ϕ), 720 M, M0 , 451 M2 , M20 , M2loc , 398, 440 MS , 44, 322, 841 ˆ S , 44, 547, 741 McS , M m =, 449, 807 μ∗ , μ, 45, 248 μ∗n , μ(n) , 46, 494 μt , μs,t , 234, 238 μd , μa , 48 μ12 , μ23|1 , μ3|12 , μ3|2|1 , 62 μD K , 783 μf , μ ◦ f −1 , 20–1 f · μ, 1B μ, μf , 20, 23, 664 μ ∗ ν, 26 μν, μ ⊗ ν, 25, 58

Indices

945

μ ⊥ ν, μ ν, μ ∼ ν, μ ∨ ν, μ ∧ ν, 39

24, 40, 435

˜ d, N ˆ (d) , 11, 632 ˆ d, N N, N NA , 836 N (m, σ 2 ), 132 ˆS , 45–6 NS , NS∗ , N Nμ , 24 ν, 151, 346, 668, 701, 759 ν± , ν d , ν a , 38, 48 νt , νa,b , 234, 491 νA , 677 ˆ ω, Ω, Ω,

82, 175

Tt , Tˆt , Ttλ , 368, 372, 378 Tx , T ∗x , 848 TS , TS∗ , TS∗2 , 802, 848 tr(·), 476, 821 τA , τB , 189, 210 τa , τa,b , 757 [τ ], 210 θ, θn , θt , 192, 611, 381 ˜ θr , θ˜r , 64, 239, 588, 710 θ, U , Uh , Ua , 600, 604 U α , UA , UAα , 676, 678 U (0, 1), U (0, 2π), 93, 307 U, V , u|v, 423, 818 u

P , P¯ , 82, 793 P Q, P ∼ Q, 432, 435 Px , Pν , PT , 239, 379, 741, 806 P F , PF , (PF )G , 167, 174 P (·|·), P (··), 167, 710 P (·| · ·), P (· · |·), 722 P ◦ ξ −1 , 83 P (A|F ) = P FA, 167 Pn , 46, 616 pa,b , 757 pnij , ptij , 247, 284 pt , pD t , 777 Lp ,  · p , 26 P

uP

→, −→, 102, 607, 522 π, πs , πM , 68, 697, 837 πB , πf , πt , 44, 142, 839 ψ, 126, 340 Q, Q+ , 142, 191 Qx , Qξ , 603, 729 ¯ R ¯ + , Rd , 10, 16, 151 R, R+ , R, Rλ , RA , 370, 836 rij , r(B), 247, 566 ˆ S n , S (n) , 378, 46 S, Sn , Sr , 564, 589, 593, 673 ˆ (S, S, S), 10, 15 S, SH , 466, 474, 802 Sˆμ , Sˆh , Sˆξ , Sˆϕ , 115, 518, 524, 845 S/T , 13 (Σ), 728 σn , σn− , 264 σ(·), (σ, b), 10, 16, 736 s

sd

→, −→, 512, 840–1 supp μ, 20 T , T ∗ , 367, 568 T¯, T , 186, 835

ul

uld

uP

→, −→, −→, −→, 522 ud u ∼, ≈, 687, 718 V · X, 194, 403, 441–2, 451 Var(·), Var(·; A), 86, 111 VarF (·) = Var(·|F), 165–6 v

vd

vw

→, −→, −→, 142, 518, 841 vd ∼, 686 w, wf ,, 95, 493, 506, 840 w(f, ˜ t, h), 512, 841 w

wd

→, −→, 520, 842 ˆ 211, 805–6, 817 X, X c , X d , 453 X τ , 194 X ∗ , Xt∗ , X ∗∗ , 195, 832 X ◦ dY , 410 X ◦ V −1 , 626 [X], [X, Y ], b[X], 406, 444, 803 α, X, 804 BX , BX,Y , 836 x, x, y, x ⊥ y, 831, 833–4 Ξ, 666 ξ, ξp , 107, 669 ˆ ξt , ξˆt , 669, 221–2 ξ, ξ c , ξ q , ξ a , 218 ξ 2 , ξη, 339 ¯ ξ¯n , 530, 578, 580, 691 ξ, ξ1,p , ξ, ηm,2 , 469 Dξ, 466 χn , 702 ¯ + , 16, 44, 559 Z, Z+ , Z ζ, ζ n , 223, 300, 310–3, 380 ζD , 408, 775 ∅, 10 1A , 1{·}, 16, 82 2S , 2∞ , 10, 15

946 ∼, ≈, 718 < , , ∼ =, 95, 400, 833 ⊥, ⊥⊥, 833, 87 ⊥G , ⊥⊥G , 170, 173 ⊕, , 316 ⊗, 11, 25, 58 ∨, ∧, 39  · ,  · p , 26, 248, 369 [[, [, (, 766

Foundations of Modern Probability