Probability theory and stochastic processes 9783030401825, 9783030401832

358 116 5MB

English Pages 717 Year 2020

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Probability theory and stochastic processes
 9783030401825, 9783030401832

Table of contents :
Contents......Page 7
Introduction......Page 14
PART THREE: Advanced topics......Page 15
Acknowledgements......Page 16
I: PROBABILITY THEORY......Page 17
1.1 Sample Space, Events and Probability......Page 18
The Language of Probabilists......Page 19
The σ-field of Events......Page 20
1.1.2 Probability of Events......Page 21
Basic Formulas......Page 22
Negligible Sets......Page 23
1.2.1 Independent Events......Page 24
1.2.2 Bayes’ Calculus......Page 26
1.2.3 Conditional Independence......Page 28
1.3.1 Probability Distributions and Expectation......Page 29
Expectation for Discrete Random Variables......Page 31
Basic Properties of Expectation......Page 33
Independent Variables......Page 34
The Product Formula for Expectations......Page 36
The Binomial Distribution......Page 37
The Geometric Distribution......Page 38
The Poisson Distribution......Page 41
The Multinomial Distribution......Page 42
Random Graphs......Page 43
1.3.3 Conditional Expectation......Page 44
1.4.1 Generating Functions......Page 47
Moments from the Generating Function......Page 48
Counting with Generating Functions......Page 50
Random Sums......Page 51
1.4.2 Probability of Extinction......Page 53
1.5.1 The Borel–Cantelli Lemma......Page 55
1.5.2 Markov’s Inequality......Page 56
1.5.3 Proof of Borel’s Strong Law......Page 57
1.6 Exercises......Page 58
Chapter 2 Integration......Page 66
2.1.1 Measurable Functions......Page 67
Stability Properties of Measurable Functions......Page 71
Dynkin’s Systems......Page 73
2.1.2 Measure......Page 75
Negligible Sets......Page 77
Equality of Measures......Page 78
Existence of Measures......Page 79
2.2.1 Construction of the Integral......Page 81
2.2.2 Elementary Properties of the Integral......Page 86
More Elementary Properties......Page 87
2.2.3 Beppo Levi, Fatou and Lebesgue......Page 88
Differentiation under the integral sign......Page 89
2.3 The Other Big Theorems......Page 90
2.3.2 The Fubini–Tonelli Theorem......Page 91
Integration by Parts Formula......Page 96
2.3.3 The Riesz–Fischer Theorem......Page 98
Holder’s Inequality......Page 99
The Riesz–Fischer Theorem......Page 100
The Product of a Measure by a Function......Page 103
Lebesgue’s decomposition......Page 104
2.4 Exercises......Page 106
3.1.1 Translation......Page 110
3.1.2 Probability Distributions......Page 112
Famous Continuous Random Variables......Page 114
Change of Variables......Page 118
Correlation Coefficient......Page 120
3.1.3 Independence and the Product Formula......Page 123
Order Statistics......Page 125
Sampling from a Distribution......Page 126
3.1.4 Characteristic Functions......Page 129
Ladder Random Variables......Page 131
Random Sums and Wald’s Identity......Page 132
3.1.5 Laplace Transforms......Page 133
3.2.1 Two Equivalent Definitions......Page 134
3.2.2 Independence and Non-correlation......Page 136
3.2.3 The pdf of a Non-degenerate Gaussian Vector......Page 138
3.3.1 The Intermediate Theory......Page 140
Bayesian Tests of Hypotheses......Page 144
3.3.2 The General Theory......Page 146
Connection with the Intermediate Theory......Page 147
Properties of the Conditional Expectation......Page 148
3.3.3 The Doubly Stochastic Framework......Page 150
3.4 Exercises......Page 151
4.1.1 A Sufficient Condition and a Criterion......Page 159
A Criterion......Page 160
4.1.2 Beppo Levi, Fatou and Lebesgue......Page 162
4.1.3 The Strong Law of Large Numbers......Page 163
Large Deviations......Page 167
4.2.1 Convergence in Probability......Page 170
4.2.2 Convergence in Lp......Page 172
4.2.3 Uniform Integrability......Page 174
4.3.1 Kolmogorov’s Zero-one Law......Page 176
4.3.2 The Hewitt–Savage Zero-one Law......Page 177
4.4.1 The Role of Characteristic Functions......Page 180
Paul Levy’s Characterization......Page 182
Bochner’s Theorem......Page 184
4.4.2 The Central Limit Theorem......Page 186
Statistical Applications......Page 189
The Variation Distance......Page 190
The Coupling Inequality......Page 191
A More General Definition......Page 193
A Bayesian Interpretation......Page 194
Convergence in Variation......Page 195
Radon Linear Forms......Page 196
Vague Convergence......Page 197
Fourier Transforms of Finite Measures......Page 198
The Proof of Paul Levy’s criterion......Page 200
4.5.1 Almost-sure vs in Probability......Page 202
4.5.2 The Rank of Convergence in Distribution......Page 203
4.6 Exercises......Page 204
II: STANDARD STOCHASTIC PROCESSES......Page 210
Random Processes as Collections of Random Variables......Page 211
Finite-dimensional Distributions......Page 212
Transfer to Canonical Spaces......Page 214
5.1.2 Second-order Stochastic Processes......Page 215
Wide-sense Stationarity......Page 216
5.1.3 Gaussian Processes......Page 218
Gaussian Subspaces......Page 219
5.2.1 Versions and Modifications......Page 220
5.2.2 Kolmogorov’s Continuity Condition......Page 221
5.3.1 Measurable Processes and their Integrals......Page 224
5.3.2 Histories and Stopping Times......Page 225
Stopping Times......Page 226
5.4 Exercises......Page 230
6.1.1 The Markov Property on the Integers......Page 233
First-step Analysis......Page 237
Local characteristics......Page 239
Gibbs Distributions......Page 240
The Hammersley–Clifford Theorem......Page 245
Communication Classes......Page 247
Period......Page 248
6.2.2 Stationary Distributions and Reversibility......Page 249
Reversibility......Page 252
6.2.3 The Strong Markov Property......Page 254
Regenerative Cycles......Page 256
6.3.1 Classification of States......Page 257
The Potential Matrix Criterion of Recurrence......Page 258
6.3.2 The Stationary Distribution Criterion......Page 262
Birth-and-death Markov Chains......Page 266
6.3.3 Foster’s Theorem......Page 269
6.4.1 The Markov Chain Ergodic Theorem......Page 272
6.4.2 Convergence in Variation to Steady State......Page 274
6.4.3 Null Recurrent Case: Orey’s Theorem......Page 277
6.4.4 Absorption......Page 278
Before Absorption......Page 279
Time to Absorption......Page 281
Final Destination......Page 282
6.5.1 Basic Principle and Algorithms......Page 284
The Propp–Wilson Algorithm......Page 287
Sandwiching......Page 290
6.6 Exercises......Page 292
7.1.1 The Counting Process and the Interval Sequence......Page 300
The Counting Process......Page 301
Superposition of independent HPPS......Page 302
Strong Markov Property......Page 303
A Smoothing Formula for HPPS......Page 305
Watanabe’s Characterization......Page 307
The Strong Markov Property via Watanabe’s theorem......Page 308
7.2.1 The Infinitesimal Generator......Page 309
The Uniform HMC......Page 311
7.2.2 The Local Characteristics......Page 313
7.2.3 HMCS from HPPS......Page 317
Aggregation of States......Page 319
7.3.1 The Strong Markov Property......Page 321
7.3.2 Imbedded Chain......Page 322
7.3.3 Conditions for Regularity......Page 325
7.4.1 Recurrence......Page 328
Invariant Measures of Recurrent Chains......Page 329
The Stationary Distribution Criterion of Ergodicity......Page 332
Reversibility......Page 334
7.4.2 Convergence to Equilibrium......Page 335
7.5 Exercises......Page 336
8.1.1 Point Processes as Random Measures......Page 339
Points......Page 342
8.1.2 Point Process Integrals and the Intensity Measure......Page 344
Campbell’s Formula......Page 345
Cluster Point Processes......Page 346
Finite-dimensional Distributions......Page 348
The Laplace Functional......Page 349
The Avoidance Function......Page 352
8.2.1 Construction......Page 355
The Covariance Formula......Page 357
The Exponential Formula......Page 359
8.3.1 As Unmarked Poisson Processes......Page 361
Thinning and Coloring......Page 364
Transportation......Page 365
Poisson Shot Noise......Page 366
The Case of Finite Intensity Measures......Page 367
The Mixed Poisson Case......Page 369
8.4 The Boolean Model......Page 370
Isolated Points......Page 374
8.5 Exercises......Page 375
9.1.1 The Basic Example......Page 381
9.1.2 Multiple Access Communication......Page 382
The Instability of ALOHA......Page 383
Backlog Dependent Policies......Page 384
9.1.3 The Stack Algorithm......Page 385
9.2.1 Isolated Markovian Queues......Page 388
Congestion as a Birth-and-Death Process......Page 391
Burke’s Output Theorem......Page 395
Jackson Networks......Page 396
Gordon–Newell Networks......Page 400
9.3 Non-exponential Models......Page 401
9.3.1 M/GI/∞......Page 402
9.3.2 M/GI/1/∞/FIFO......Page 404
9.3.3 GI/M/1/∞/FIFO......Page 406
9.4 Exercises......Page 409
10.1.1 The Renewal Measure......Page 412
10.1.2 The Renewal Equation......Page 416
Solution of the Renewal Equation......Page 420
10.1.3 Stationary Renewal Processes......Page 422
Direct Riemann Integrability......Page 425
The Key Renewal Theorem......Page 429
Renewal Reward Processes......Page 431
10.2.2 The Coupling Proof of Blackwell’s Theorem......Page 433
10.2.3 Defective and Excessive Renewal Equations......Page 437
10.3.1 Examples......Page 439
10.3.2 The Limit Distribution......Page 440
10.4 Semi-Markov Processes......Page 444
Improper Multivariate Renewal Equations......Page 447
10.5 Exercises......Page 448
11.1.1 As a Rescaled Random Walk......Page 451
11.1.2 Simple Operations on Brownian motion......Page 453
11.1.3 Gauss–Markov Processes......Page 455
The Reflection Principle......Page 457
11.2.3 Non-differentiability......Page 459
11.2.4 Quadratic Variation......Page 461
11.3.1 Construction......Page 462
Series Expansion of Wiener integrals......Page 464
A Characterization of the Wiener Integral......Page 465
11.3.2 Langevin’s Equation......Page 466
11.3.3 The Cameron–Martin Formula......Page 467
11.4 Fractal Brownian Motion......Page 469
11.5 Exercises......Page 471
12.1.1 Covariance Functions and Characteristic Functions......Page 475
Two Particular Cases......Page 476
The General Case......Page 477
12.1.2 Filtering of wss Stochastic Processes......Page 479
A First Approach......Page 481
White Noise via the Doob–Wiener Integral......Page 482
The Approximate Derivative Approach......Page 483
12.2.1 The Cramer–Khintchin Decomposition......Page 484
The Shannon–Nyquist Sampling Theorem......Page 487
12.2.2 A Plancherel–Parseval Formula......Page 488
12.2.3 Linear Operations......Page 489
12.3.1 The Power Spectral Matrix......Page 491
12.3.2 Band-pass Stochastic Processes......Page 495
Complementary reading......Page 496
12.4 Exercises......Page 497
III: ADVANCED TOPICS......Page 500
13.1.1 The Martingale Property......Page 501
Convex Functions of Martingales......Page 504
Martingale Transforms and Stopped Martingales......Page 505
13.1.2 Kolmogorov’s Inequality......Page 506
13.1.3 Doob’s Inequality......Page 507
13.1.4 Hoeffding’s Inequality......Page 508
A General Framework of Application......Page 509
13.2.1 Doob’s Optional Sampling Theorem......Page 511
Wald’s Exponential Formula......Page 516
13.2.3 The Maximum Principle......Page 517
13.3.1 The Fundamental Convergence Theorem......Page 520
Kakutani’s Theorem......Page 525
13.3.2 Backwards (or Reverse) Martingales......Page 526
Local Absolute Continuity......Page 529
Harmonic Functions and Markov Chains......Page 532
13.3.3 The Robbins–Sigmund Theorem......Page 533
Doob’s decomposition......Page 535
The Martingale Law of Large Numbers......Page 537
The Robbins–Monro algorithm......Page 538
13.4 Continuous-time Martingales......Page 540
13.4.1 From Discrete Time to Continuous Time......Page 542
13.4.2 The Banach Space MP......Page 546
13.4.3 Time Scaling......Page 547
13.5 Exercises......Page 549
14.1.1 Construction......Page 555
14.1.2 Properties of the Ito Integral Process......Page 558
14.1.3 Ito’s Integrals Defined as Limits in Probability......Page 561
Functions of Brownian Motion......Page 562
14.2.2 Some Extensions......Page 564
A Finite Number of Discontinuities......Page 566
The Vectorial Differentiation Rule......Page 567
14.3.1 Square-integrable Brownian Functionals......Page 568
14.3.2 Girsanov’s Theorem......Page 570
The Strong Markov Property of Brownian Motion......Page 574
14.3.3 Stochastic Differential Equations......Page 575
Strong and Weak Solutions......Page 576
14.3.4 The Dirichlet Problem......Page 577
14.4 Exercises......Page 579
15.1.1 The Martingale Definition......Page 583
15.1.2 Stochastic Intensity Kernels......Page 590
The Case of Marked Point Processes......Page 591
Stochastic Integrals and Martingales......Page 595
15.1.3 Martingales as Stochastic Integrals......Page 597
15.1.4 The Regenerative Form of the Stochastic Intensity......Page 600
15.2.1 Changing the History......Page 602
15.2.2 Absolutely Continuous Change of Probability......Page 606
The Reference Probability Method......Page 611
15.2.3 Changing the Time Scale......Page 613
Cryptology......Page 614
15.3.1 An Extension of Watanabe’s Theorem......Page 615
15.3.2 Grigelionis’ Embedding Theorem......Page 618
Variants of the Embedding Theorems......Page 621
15.4 Exercises......Page 622
16.1.1 Invariant Events and Ergodicity......Page 627
16.1.2 Mixing......Page 630
The Stochastic Process Point of View......Page 632
16.1.3 The Convex Set of Ergodic Probabilities......Page 633
16.2.1 Lindley’s Sequence......Page 634
16.2.2 Loynes’ Equation......Page 635
16.3.1 The Ergodic Case......Page 637
16.3.2 The Non-ergodic Case......Page 638
16.3.3 The Continuous-time Ergodic Theorem......Page 641
16.4 Exercises......Page 643
17.1.1 Palm Distribution......Page 645
Compatibility......Page 647
Stationary Frameworks......Page 648
17.1.3 Palm Probability and the Campbell–Mecke Formula......Page 649
Thinning and Conditioning......Page 652
17.2.1 Event-time Stationarity......Page 654
17.2.2 Inversion Formulas......Page 656
Backward and Forward Recurrence Times......Page 658
17.2.3 The Exchange Formula......Page 659
17.2.4 From Palm to Stationary......Page 660
The G/G/1/∞ Queue in Continuous Time......Page 664
17.3.1 The Local Interpretation......Page 666
17.3.2 The Ergodic Interpretation......Page 668
17.4.1 The PSATA Property......Page 670
17.4.2 Queue Length at Departures or Arrivals......Page 674
17.4.3 Little’s Formula......Page 675
17.5 Exercises......Page 678
A.1 The Greatest Common Divisor......Page 682
A.2 Eigenvalues and Eigenvectors......Page 683
A.3 The Perron–Fr¨obenius Theorem......Page 684
B.1 Infinite Products......Page 686
B.2 Abel’s Theorem......Page 687
Cesaro’s Lemma......Page 688
Toeplitz’s Lemma......Page 689
B.5 Subadditive Functions......Page 690
B.7 The Abstract Definition of Continuity......Page 691
B.8 Change of Time......Page 692
C.2 Schwarz’s Inequality......Page 694
C.3 Isometric Extension......Page 696
C.4 Orthogonal Projection......Page 697
C.5 Riesz’s Representation Theorem......Page 702
C.6 Orthonormal expansions......Page 703
Bibliography......Page 706
Index......Page 711

Citation preview

Universitext

Pierre Brémaud

Probability Theory and Stochastic Processes

Universitext

Universitext

Series Editors Sheldon Axler San Francisco State University Carles Casacuberta Universitat de Barcelona John Greenlees University of Warwick, Coventry Angus MacIntyre Queen Mary University of London

Kenneth Ribet University of California, Berkeley Claude Sabbah École Polytechnique, CNRS, Université Paris-Saclay, Palaiseau Endre Süli University of Oxford Wojbor A. Woyczyński, Case Western Reserve University

Universitext is a series of textbooks that presents material from a wide variety of mathematical disciplines at master’s level and beyond. The books, often well class-tested by their author, may have an informal, personal even experimental approach to their subject matter. Some of the most successful and established books in the series have evolved through several editions, always following the evolution of teaching curricula, to very polished texts. Thus as research topics trickle down into graduate-level teaching, first textbooks written for new, cutting-edge courses may make their way into Universitext.

More information about this series at http://www.springer.com/series/223

Pierre Brémaud

Probability Theory and Stochastic Processes

Pierre Brémaud Département d’Informatique INRIA, École Normale Supérieure Paris CX 5, France

ISSN 0172-5939 ISSN 2191-6675 (electronic) Universitext ISBN 978-3-030-40182-5 ISBN 978-3-030-40183-2 (eBook) https://doi.org/10.1007/978-3-030-40183-2 © Springer Nature Switzerland AG 2020 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

Pour Marion

Contents Introduction

xv

Part One: Probability Theory

1

1 Warming Up 1.1 Sample Space, Events and Probability . . . . . . 1.1.1 Events . . . . . . . . . . . . . . . . . . . . 1.1.2 Probability of Events . . . . . . . . . . . . 1.2 Independence and Conditioning . . . . . . . . . . 1.2.1 Independent Events . . . . . . . . . . . . . 1.2.2 Bayes’ Calculus . . . . . . . . . . . . . . . 1.2.3 Conditional Independence . . . . . . . . . 1.3 Discrete Random Variables . . . . . . . . . . . . . 1.3.1 Probability Distributions and Expectation 1.3.2 Famous Discrete Probability Distributions 1.3.3 Conditional Expectation . . . . . . . . . . 1.4 The Branching Process . . . . . . . . . . . . . . . 1.4.1 Generating Functions . . . . . . . . . . . . 1.4.2 Probability of Extinction . . . . . . . . . . 1.5 Borel’s Strong Law of Large Numbers . . . . . . . 1.5.1 The Borel–Cantelli Lemma . . . . . . . . . 1.5.2 Markov’s Inequality . . . . . . . . . . . . . 1.5.3 Proof of Borel’s Strong Law . . . . . . . . 1.6 Exercises . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

3 3 4 6 9 9 11 13 14 14 22 29 32 32 38 40 40 41 42 43

2 Integration 2.1 Measurability and Measure . . . . . . . . . . 2.1.1 Measurable Functions . . . . . . . . . 2.1.2 Measure . . . . . . . . . . . . . . . . 2.2 The Lebesgue Integral . . . . . . . . . . . . 2.2.1 Construction of the Integral . . . . . 2.2.2 Elementary Properties of the Integral 2.2.3 Beppo Levi, Fatou and Lebesgue . . 2.3 The Other Big Theorems . . . . . . . . . . . 2.3.1 The Image Measure Theorem . . . . 2.3.2 The Fubini–Tonelli Theorem . . . . . 2.3.3 The Riesz–Fischer Theorem . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

51 52 52 60 66 66 71 73 75 76 76 83

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

vii

CONTENTS

viii

2.4

2.3.4 The Radon–Nikod´ ym Theorem . . . . . . . . . . . . . . . . 88 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

3 Probability and Expectation 3.1 From Integral to Expectation . . . . . . . . . . . . . 3.1.1 Translation . . . . . . . . . . . . . . . . . . . 3.1.2 Probability Distributions . . . . . . . . . . . . 3.1.3 Independence and the Product Formula . . . . 3.1.4 Characteristic Functions . . . . . . . . . . . . 3.1.5 Laplace Transforms . . . . . . . . . . . . . . . 3.2 Gaussian vectors . . . . . . . . . . . . . . . . . . . . 3.2.1 Two Equivalent Definitions . . . . . . . . . . 3.2.2 Independence and Non-correlation . . . . . . . 3.2.3 The pdf of a Non-degenerate Gaussian Vector 3.3 Conditional Expectation . . . . . . . . . . . . . . . . 3.3.1 The Intermediate Theory . . . . . . . . . . . . 3.3.2 The General Theory . . . . . . . . . . . . . . 3.3.3 The Doubly Stochastic Framework . . . . . . 3.3.4 The L2 -theory of Conditional Expectation . . 3.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

95 95 95 97 108 114 118 119 119 121 123 125 125 131 135 136 136

4 Convergences 4.1 Almost-sure Convergence . . . . . . . . . . . . . 4.1.1 A Sufficient Condition and a Criterion . 4.1.2 Beppo Levi, Fatou and Lebesgue . . . . 4.1.3 The Strong Law of Large Numbers . . . 4.2 Two Other Types of Convergence . . . . . . . . 4.2.1 Convergence in Probability . . . . . . . . 4.2.2 Convergence in Lp . . . . . . . . . . . . 4.2.3 Uniform Integrability . . . . . . . . . . . 4.3 Zero-one Laws . . . . . . . . . . . . . . . . . . . 4.3.1 Kolmogorov’s Zero-one Law . . . . . . . 4.3.2 The Hewitt–Savage Zero-one Law . . . . 4.4 Convergence in Distribution and in Variation . . 4.4.1 The Role of Characteristic Functions . . 4.4.2 The Central Limit Theorem . . . . . . . 4.4.3 Convergence in Variation . . . . . . . . . 4.4.4 Proof of Paul L´evy’s Criterion . . . . . . 4.5 The Hierarchy of Convergences . . . . . . . . . 4.5.1 Almost-sure vs in Probability . . . . . . 4.5.2 The Rank of Convergence in Distribution 4.6 Exercises . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

145 145 145 148 149 156 156 158 160 162 162 163 166 166 172 176 182 188 188 189 190

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

CONTENTS

ix 197

Part Two: Standard Stochastic Processes 5 Generalities on Random Processes 5.1 The Distribution of a Random Process . . . . . 5.1.1 Kolmogorov’s Theorem on Distributions 5.1.2 Second-order Stochastic Processes . . . . 5.1.3 Gaussian Processes . . . . . . . . . . . . 5.2 Random Processes as Random Functions . . . . 5.2.1 Versions and Modifications . . . . . . . . 5.2.2 Kolmogorov’s Continuity Condition . . . 5.3 Measurability Issues . . . . . . . . . . . . . . . 5.3.1 Measurable Processes and their Integrals 5.3.2 Histories and Stopping Times . . . . . . 5.4 Exercises . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

199 199 199 203 206 208 208 209 212 212 213 218

6 Markov Chains, Discrete Time 6.1 The Markov Property . . . . . . . . . . . . . . . . 6.1.1 The Markov Property on the Integers . . . 6.1.2 The Markov Property on a Graph . . . . . 6.2 The Transition Matrix . . . . . . . . . . . . . . . 6.2.1 Topological Notions . . . . . . . . . . . . . 6.2.2 Stationary Distributions and Reversibility 6.2.3 The Strong Markov Property . . . . . . . 6.3 Recurrence and Transience . . . . . . . . . . . . . 6.3.1 Classification of States . . . . . . . . . . . 6.3.2 The Stationary Distribution Criterion . . . 6.3.3 Foster’s Theorem . . . . . . . . . . . . . . 6.4 Long-run Behavior . . . . . . . . . . . . . . . . . 6.4.1 The Markov Chain Ergodic Theorem . . . 6.4.2 Convergence in Variation to Steady State . 6.4.3 Null Recurrent Case: Orey’s Theorem . . . 6.4.4 Absorption . . . . . . . . . . . . . . . . . 6.5 Monte Carlo Markov Chain Simulation . . . . . . 6.5.1 Basic Principle and Algorithms . . . . . . 6.5.2 Exact Sampling . . . . . . . . . . . . . . . 6.6 Exercises . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

221 221 221 227 235 235 237 242 245 245 250 257 260 260 262 265 266 272 272 275 280

7 Markov Chains, Continuous Time 7.1 Homogeneous Poisson Processes on the Line . . . . . . . 7.1.1 The Counting Process and the Interval Sequence . 7.1.2 Stochastic Calculus of hpps . . . . . . . . . . . . 7.2 The Transition Semigroup . . . . . . . . . . . . . . . . . 7.2.1 The Infinitesimal Generator . . . . . . . . . . . . 7.2.2 The Local Characteristics . . . . . . . . . . . . . 7.2.3 hmcs from hpps . . . . . . . . . . . . . . . . . . 7.3 Regenerative Structure . . . . . . . . . . . . . . . . . . . 7.3.1 The Strong Markov Property . . . . . . . . . . . 7.3.2 Imbedded Chain . . . . . . . . . . . . . . . . . . 7.3.3 Conditions for Regularity . . . . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

289 289 289 294 298 298 302 306 310 310 311 314

CONTENTS

x . . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

317 317 324 325

8 Spatial Poisson Processes 8.1 Generalities on Point Processes . . . . . . . . . . . . . . 8.1.1 Point Processes as Random Measures . . . . . . . 8.1.2 Point Process Integrals and the Intensity Measure 8.1.3 The Distribution of a Point Process . . . . . . . . 8.2 Unmarked Spatial Poisson Processes . . . . . . . . . . . 8.2.1 Construction . . . . . . . . . . . . . . . . . . . . 8.2.2 Poisson Process Integrals . . . . . . . . . . . . . . 8.3 Marked Spatial Poisson Processes . . . . . . . . . . . . . 8.3.1 As Unmarked Poisson Processes . . . . . . . . . . 8.3.2 Operations on Poisson Processes . . . . . . . . . . 8.3.3 Change of Probability Measure . . . . . . . . . . 8.4 The Boolean Model . . . . . . . . . . . . . . . . . . . . . 8.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

329 329 329 334 338 345 345 347 351 351 354 357 360 365

9 Queueing Processes 9.1 Discrete-time Markovian Queues . . . . 9.1.1 The Basic Example . . . . . . . 9.1.2 Multiple Access Communication 9.1.3 The Stack Algorithm . . . . . . 9.2 Continuous-time Markovian Queues . . 9.2.1 Isolated Markovian Queues . . . 9.2.2 Markovian Networks . . . . . . 9.3 Non-exponential Models . . . . . . . . 9.3.1 M/GI/∞ . . . . . . . . . . . . 9.3.2 M/GI/1/∞/fifo . . . . . . . . 9.3.3 GI/M/1/∞/fifo . . . . . . . . 9.4 Exercises . . . . . . . . . . . . . . . . .

7.4

7.5

Long-run Behavior 7.4.1 Recurrence 7.4.2 Convergence Exercises . . . . . .

. . . . . . . . . . . . . . . . . . to Equilibrium . . . . . . . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

371 371 371 372 375 378 378 385 391 392 394 396 399

10 Renewal and Regenerative Processes 10.1 Renewal Point processes . . . . . . . . . . . . . . . 10.1.1 The Renewal Measure . . . . . . . . . . . . 10.1.2 The Renewal Equation . . . . . . . . . . . . 10.1.3 Stationary Renewal Processes . . . . . . . . 10.2 The Renewal Theorem . . . . . . . . . . . . . . . . 10.2.1 The Key Renewal Theorem . . . . . . . . . 10.2.2 The Coupling Proof of Blackwell’s Theorem 10.2.3 Defective and Excessive Renewal Equations 10.3 Regenerative Processes . . . . . . . . . . . . . . . . 10.3.1 Examples . . . . . . . . . . . . . . . . . . . 10.3.2 The Limit Distribution . . . . . . . . . . . . 10.4 Semi-Markov Processes . . . . . . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

403 403 403 407 413 416 416 424 428 430 430 431 435

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

CONTENTS

xi

10.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 439 11 Brownian Motion 11.1 Brownian Motion or Wiener Process . . . . . 11.1.1 As a Rescaled Random Walk . . . . . . 11.1.2 Simple Operations on Brownian motion 11.1.3 Gauss–Markov Processes . . . . . . . . 11.2 Properties of Brownian Motion . . . . . . . . 11.2.1 The Strong Markov Property . . . . . 11.2.2 Continuity . . . . . . . . . . . . . . . . 11.2.3 Non-differentiability . . . . . . . . . . 11.2.4 Quadratic Variation . . . . . . . . . . 11.3 The Wiener–Doob Integral . . . . . . . . . . . 11.3.1 Construction . . . . . . . . . . . . . . 11.3.2 Langevin’s Equation . . . . . . . . . . 11.3.3 The Cameron–Martin Formula . . . . . 11.4 Fractal Brownian Motion . . . . . . . . . . . . 11.5 Exercises . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

443 443 443 445 447 449 449 451 451 453 454 454 458 459 461 463

12 Wide-sense Stationary Stochastic Processes 12.1 The Power Spectral Measure . . . . . . . . . . . . . . . . . 12.1.1 Covariance Functions and Characteristic Functions 12.1.2 Filtering of wss Stochastic Processes . . . . . . . . 12.1.3 White Noise . . . . . . . . . . . . . . . . . . . . . . 12.2 Fourier Analysis of the Trajectories . . . . . . . . . . . . . 12.2.1 The Cram´er–Khintchin Decomposition . . . . . . . 12.2.2 A Plancherel–Parseval Formula . . . . . . . . . . . 12.2.3 Linear Operations . . . . . . . . . . . . . . . . . . . 12.3 Multivariate wss Stochastic Processes . . . . . . . . . . . 12.3.1 The Power Spectral Matrix . . . . . . . . . . . . . 12.3.2 Band-pass Stochastic Processes . . . . . . . . . . . 12.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

467 467 467 471 473 476 476 480 481 483 483 487 489

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

Part Three: Advanced Topics 13 Martingales 13.1 Martingale Inequalities . . . . . . . . . . . . . . 13.1.1 The Martingale Property . . . . . . . . . 13.1.2 Kolmogorov’s Inequality . . . . . . . . . 13.1.3 Doob’s Inequality . . . . . . . . . . . . . 13.1.4 Hoeffding’s Inequality . . . . . . . . . . 13.2 Martingales and Stopping Times . . . . . . . . . 13.2.1 Doob’s Optional Sampling Theorem . . . 13.2.2 Wald’s Formulas . . . . . . . . . . . . . 13.2.3 The Maximum Principle . . . . . . . . . 13.3 Convergence of Martingales . . . . . . . . . . . 13.3.1 The Fundamental Convergence Theorem 13.3.2 Backwards (or Reverse) Martingales . . .

493

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

495 495 495 500 501 502 505 505 510 511 514 514 520

CONTENTS

xii . . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

527 529 534 536 540 541 543

14 A Glimpse at Itˆ o’s Stochastic Calculus 14.1 The Itˆo Integral . . . . . . . . . . . . . . . . . . . . . 14.1.1 Construction . . . . . . . . . . . . . . . . . . 14.1.2 Properties of the Itˆo Integral Process . . . . . 14.1.3 Itˆo’s Integrals Defined as Limits in Probability 14.2 Itˆo’s Differential Formula . . . . . . . . . . . . . . . . 14.2.1 Elementary Form . . . . . . . . . . . . . . . . 14.2.2 Some Extensions . . . . . . . . . . . . . . . . 14.3 Selected Applications . . . . . . . . . . . . . . . . . . 14.3.1 Square-integrable Brownian Functionals . . . 14.3.2 Girsanov’s Theorem . . . . . . . . . . . . . . 14.3.3 Stochastic Differential Equations . . . . . . . 14.3.4 The Dirichlet Problem . . . . . . . . . . . . . 14.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

549 549 549 552 555 556 556 558 562 562 564 569 571 573

15 Point Processes with a Stochastic Intensity 15.1 Stochastic Intensity . . . . . . . . . . . . . . . . . . . . . 15.1.1 The Martingale Definition . . . . . . . . . . . . . 15.1.2 Stochastic Intensity Kernels . . . . . . . . . . . . 15.1.3 Martingales as Stochastic Integrals . . . . . . . . 15.1.4 The Regenerative Form of the Stochastic Intensity 15.2 Transformations of the Stochastic Intensity . . . . . . . . 15.2.1 Changing the History . . . . . . . . . . . . . . . . 15.2.2 Absolutely Continuous Change of Probability . . 15.2.3 Changing the Time Scale . . . . . . . . . . . . . . 15.3 Point Processes under a Poisson process . . . . . . . . . 15.3.1 An Extension of Watanabe’s Theorem . . . . . . 15.3.2 Grigelionis’ Embedding Theorem . . . . . . . . . 15.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

577 577 577 584 591 594 596 596 600 607 609 609 612 616

16 Ergodic Processes 16.1 Ergodicity and Mixing . . . . . . . . . . . . . . 16.1.1 Invariant Events and Ergodicity . . . . . 16.1.2 Mixing . . . . . . . . . . . . . . . . . . . 16.1.3 The Convex Set of Ergodic Probabilities 16.2 A Detour into Queueing Theory . . . . . . . . . 16.2.1 Lindley’s Sequence . . . . . . . . . . . . 16.2.2 Loynes’ Equation . . . . . . . . . . . . . 16.3 Birkhoff’s Theorem . . . . . . . . . . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

621 621 621 624 627 628 628 629 631

13.3.3 The Robbins–Sigmund Theorem . . . . . 13.3.4 Square-integrable Martingales . . . . . . 13.4 Continuous-time Martingales . . . . . . . . . . . 13.4.1 From Discrete Time to Continuous Time 13.4.2 The Banach Space Mp . . . . . . . . . . 13.4.3 Time Scaling . . . . . . . . . . . . . . . 13.5 Exercises . . . . . . . . . . . . . . . . . . . . . .

. . . . . . .

. . . . . . . .

. . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

CONTENTS

xiii

16.3.1 The Ergodic Case . . . . . . . . . . . . 16.3.2 The Non-ergodic Case . . . . . . . . . 16.3.3 The Continuous-time Ergodic Theorem 16.4 Exercises . . . . . . . . . . . . . . . . . . . . .

. . . .

. . . .

17 Palm Probability 17.1 Palm Distribution and Palm Probability . . . . . 17.1.1 Palm Distribution . . . . . . . . . . . . . . 17.1.2 Stationary Frameworks . . . . . . . . . . . 17.1.3 Palm Probability and the Campbell–Mecke 17.2 Basic Properties and Formulas . . . . . . . . . . . 17.2.1 Event-time Stationarity . . . . . . . . . . 17.2.2 Inversion Formulas . . . . . . . . . . . . . 17.2.3 The Exchange Formula . . . . . . . . . . . 17.2.4 From Palm to Stationary . . . . . . . . . . 17.3 Two Interpretations of Palm Probability . . . . . 17.3.1 The Local Interpretation . . . . . . . . . . 17.3.2 The Ergodic Interpretation . . . . . . . . . 17.4 General Principles of Queueing Theory . . . . . . 17.4.1 The pasta Property . . . . . . . . . . . . 17.4.2 Queue Length at Departures or Arrivals . 17.4.3 Little’s Formula . . . . . . . . . . . . . . . 17.5 Exercises . . . . . . . . . . . . . . . . . . . . . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

631 632 635 637

. . . . . . . . . . . . . . . Formula . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

639 639 639 641 643 648 648 650 653 654 660 660 662 664 664 668 669 672

. . . .

. . . .

. . . .

. . . .

A Number Theory and Linear Algebra 677 A.1 The Greatest Common Divisor . . . . . . . . . . . . . . . . . . . . . 677 A.2 Eigenvalues and Eigenvectors . . . . . . . . . . . . . . . . . . . . . 678 A.3 The Perron–Fr¨obenius Theorem . . . . . . . . . . . . . . . . . . . . 679 B Analysis B.1 Infinite Products . . . . . . . . . . . . . . B.2 Abel’s Theorem . . . . . . . . . . . . . . . B.3 Tykhonov’s Theorem . . . . . . . . . . . . B.4 Ces`aro, Toeplitz and Kronecker’s Lemmas B.5 Subadditive Functions . . . . . . . . . . . B.6 Gronwall’s Lemma . . . . . . . . . . . . . B.7 The Abstract Definition of Continuity . . . B.8 Change of Time . . . . . . . . . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

681 681 682 683 683 685 686 686 687

C Hilbert Spaces C.1 Basic Definitions . . . . . . . . C.2 Schwarz’s Inequality . . . . . . C.3 Isometric Extension . . . . . . . C.4 Orthogonal Projection . . . . . C.5 Riesz’s Representation Theorem C.6 Orthonormal expansions . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

689 689 689 691 692 697 698

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

Bibliography

701

Index

707

Introduction This initiation to the theory of probability and random processes has three objectives. The first is to provide the theoretical background necessary for the student to feel comfortable and secure in the utilization of probabilistic models, particularly those involving stochastic processes. The second is to introduce the specific stochastic processes most often encountered in applications (operations research, insurance, finance, biology, physics, computer and communications sciences and signal processing), that is, Markov chains in discrete and continuous time, Poisson processes in time and space, renewal and regenerative processes, queues and their networks, wide-sense stationary processes, and Brownian motion. In addition to the above topics, which form the indispensable cultural background of an applied probabilist, this text gives—and this is the third objective—the advanced tools such as martingales, ergodic theory, Palm theory and stochastic integration, useful for the analysis of the more complex stochastic models. The only prerequisite is an elementary knowledge of calculus and of linear algebra usually acquired at the undergraduate or beginning graduate level, the occasional material beyond this being given in the Appendix or in the main text.

The above cartoon symbolizes the feeling of awe and distress that may invade the potential reader’s mind when considering the size of the book. The fact is that this text has been written for an ideal reader who has had no previous exposure to probability theory and who is willing to learn in a self-teaching mode the basics of the theory of stochastic processes that will constitute a solid platform both for applications and for more specialized studies. This implies that the material should be presented in a detailed, progressive and self-contained manner. However, there is sufficient modularity to devise a table of contents adapted to one’s appetite and interest. The material of this book has been divided into three parts roughly corresponding to the three objectives listed at the beginning of this introduction.

xv

xvi

Introduction

PART ONE: Probability theory This part stands alone as a short course in probability at the intermediate level. It contains all the results referenced in the rest of the book, and the initiate can therefore skip it and use it as an appendix. The first chapter introduces the basic notions in the elementary framework of discrete random variables and gives a few tricks that permit us to obtain at an early stage non-trivial results such as the strong law of large numbers for coin tossing or the extinction probability of a branching process. One of its purposes is to persuade the neophyte of the power of a formal approach to probability while introducing the main concepts of expectation, independence and conditional expectation. Having acquired familiarity with the vocabulary and the spirit of probability theory, the reader will be ready for the development of this discipline in the framework of integration theory, of which the second chapter provides a detailed account. This theory requires more concentration from the beginner but the effort is worthwhile as it will provide her/him with enough confidence in the manipulation of stochastic models. Probability theory is usually developed in the more theoretical texts without a preliminary exposition of integration theory, whose results are then presented as need arises. But, as far as stochastic processes are concerned, the theory of integration with respect to a finite measure (such as a probability) is not sufficient. The third chapter translates the previous one into the probabilistic language and formalizes the concepts of distribution, independence and conditional expectation. The fourth chapter features the various notions of convergence of a sequence of random variables (almost-sure, in distribution, in variation, in probability and in the mean square) and their interconnections, and closes the first part on the probabilistic background directly useful in the rest of the book.

PART TWO: Standard stochastic processes This part forms a basic course on stochastic processes. It begins with the pivotal Chapter 5, devoted to general issues such as trajectory continuity, measurability and stopping times. Chapter 6 and Chapter 7 introduce the stochastic processes that are the most popular and that can be treated at an elementary level, namely discrete-time homogeneous Markov chains, homogeneous Poisson process on the line and continuous-time homogeneous Markov chains. Chapter 8 is devoted to Poisson processes in space, with or without marks, a versatile source of spatial models. Chapter 10 gives the essentials of renewal theory and its application to regenerative processes. Chapter 9 gives a panoramic view of the classical queues and their networks at the elementary level. Chapter 11 features the Brownian motion and the Doob–Wiener stochastic integral. The latter is, together with Bochner’s representation of characteristic functions, one of the foundations of the theory of wide-sense stationary stochastic processes of Chapter 12.

PART THREE: Advanced topics This part complements the previous one in that it is not exclusively devoted to specific random processes, but rather to general classes of such processes. Only a taste of the topics treated here will be given since each one of them requires and deserves considerably more space.

Introduction

xvii

Chapter 13 gives the basic theory of martingales, one of the most important allpurpose tools of probability theory. Chapter 14 is a short introduction to the Brownian motion stochastic calculus based on the Itˆo integral. It is a natural continuation of the chapters on Brownian motion and on martingales. Chapter 15, a novel item in the table of contents of a textbook on stochastic processes, introduces point processes on the line admitting a stochastic intensity and the associated stochastic calculus. Chapter 16 gives the essentials of ergodic theory. The presentation is not the classical one and gives the opportunity to extend the elementary results of Chapter 9 on queueing. Another novel item of the table of contents for a book at this level is Chapter 17 on Palm probability, a natural complement to the theory of renewal point processes.

Practical issues The index gives the page(s) where a particular notation or abbreviation is used. These items will appear at the beginning of the list corresponding to their first letter. The special numbering of equations, such as (), (†) and the like, is used only locally, inside proofs. Just before the Exercises section of each chapter there is a subsubsection entitled “Complementary reading” pointing at books (only books) where additional material connected with the current topic can be found. The selection is mainly based on two criteria: accessibility by a reader of this book and relevance to applications (with, of course, a natural bias towards the author’s own interests). No attempt at exhaustivity or proper crediting has been made, the reader being directed to the Bibliography or to the sporadic footnotes for this. In these subsubsections, only the year of the last edition is given. The full history is given in the Bibliography.

Acknowledgements I wish to acknowledge the precious help of Eva L¨ocherbach (University of Paris I Sorbonne), Anne Bouillard (Nokia France), Paolo Baldi (University Roma II Torre Vergata) and L´eo Miolane (Inria Paris). I warmly thank them and also, last but not least, Marina Reizakis, the patient and diligent editor of this book.

Pierre Br´emaud Paris, July 14, 2019

I: PROBABILITY THEORY

Chapter 1 Warming Up We apparently live in a random world. There are events that we are unable to predict with absolute certainty. Is this world inherently random or does randomness just refer to our incapacity to solve the highly complex deterministic equations that rule the universe? In fact, probabilists do not attempt to take sides in this debate and just observe that there are some phenomena that look random and yet seem to exhibit some kind of regularity. The canonical example is a coin tossed over and over by a non-mischievous person: the result is an erratic sequence of heads and tails, yet there seems to be a balance between heads and tails. This regularity takes the form of the law of large numbers: the long run proportion of heads is 21 , that is, as the number of tosses tends to infinity, the frequency of heads approaches 12 . Is this a physical law, or is it a mathematical theorem? At first sight, it is a physical law, and indeed it has to do with a complex process, but so complex that it is preferable to view the law of large numbers as a theorem resulting from a mathematical model. It took some time for the corresponding mathematical theory to emerge. The modern ´ era, announced by the proof of the strong law of large numbers for coin tossing by Emile Borel in 1909, really started with Andre¨ı Nikola¨ıevitch Kolmogorov who axiomatized probability in 1933 in terms of the theory of measure and integration, which is presented in the next chapter. Before this, however, it is wise to introduce the terminology and the probabilistic concepts (expectation, independence, conditional expectation) in the elementary framework of discrete random variables. This is done in the current chapter, which contains the proofs of two results that demonstrate the power of probabilistic reasoning, already at an elementary level: Borel’s strong law of large numbers and the computation of the extinction probability of a branching process.

1.1

Sample Space, Events and Probability

The study of random phenomena requires a clear and precise language. That of probability theory features familiar mathematical objects such as points, sets and functions, which, however, receive a particular interpretation: points are outcomes (of an experiment), sets are events, functions are random numbers. The meaning of these terms will be given just after we recall the notation concerning the elementary operations on sets: union, intersection and complementation. © Springer Nature Switzerland AG 2020 P. Brémaud, Probability Theory and Stochastic Processes, Universitext, https://doi.org/10.1007/978-3-030-40183-2_1

3

CHAPTER 1. WARMING UP

4

1.1.1

Events

If A and B are subsets of some set Ω, A ∪ B denotes their union and A ∩ B their intersection. In this book, A (rather than A or Ac ) denotes the complement of A in Ω. The notation A + B (the sum of A and B) implies by convention that A and B are disjoint, in which case it represents the union A ∪ B. Similarly, the notation ∞ k=1 Ak is used for ∪∞ k=1 Ak only when the Ak ’s are pairwise disjoint. The notation A − B is used only if B ⊆ A, and it stands for A ∩ B. In particular, if B ⊆ A, then A = B + (A − B). The symmetric difference of A and B, that is, the set (A ∪ B) − (A ∩ B), is denoted by A  B. The indicator function of the subset A ⊆ Ω is the function 1A : Ω → {0, 1} defined by

 1A (ω) =

1 if ω ∈ A , 0 if ω ∈ A.

Random phenomena are observed by means of experiments (performed either by man or nature). Each experiment results in an outcome. The collection of all possible outcomes ω is called the sample space Ω. Any subset A of the sample space Ω will be regarded fro the time being1 as a representation of some event. Example 1.1.1: Tossing a die, take 1. The experiment consists in tossing a die once. The possible outcomes are ω = 1, 2, . . . , 6 and the sample space is the set Ω = {1, 2, 3, 4, 5, 6}. The subset A = {1, 3, 5} is the event “the result is odd.” Example 1.1.2: Throwing a dart. The experiment consists in throwing a dart at a wall. The sample space can be chosen to be the plane R2 . An outcome is the position ω = (x, y) ∈ R2 hit by the dart. The subset A = {(x, y); x2 + y 2 > 1} is an event that could be named “you missed the dartboard” if the dartboard is a closed disk of radius 1 centered at 0.

Example 1.1.3: Heads or tails, take 1. The experiment is an infinite succession of coin tosses. One can take for the sample space the collection of all sequences ω = {xn }n≥1, where xn = 1 or 0, depending on whether the n-th toss results in heads or tails. The subset A = {ω; xk = 1 for k = 1 to 1000} is a lucky event for anyone betting on heads!

The Language of Probabilists Probabilists have their own dialect. They say that outcome ω realizes event A if ω ∈ A. For instance, in the die model of Example 1.1.1, the outcome ω = 1 realizes the event “result is odd”, since 1 ∈ A = {1, 3, 5}. Obviously, if ω does not realize A, it realizes A. Event A ∩ B is realized by outcome ω if and only if ω realizes both A and B. Similarly, A ∪ B is realized by ω if and only if at least one event among A and B is realized (both can be realized). Two events A and B are called incompatible when A ∩ B = ∅. In other words, event A ∩ B is impossible: no outcome ω can realize both A and B. For this reason one refers to the empty set ∅ as the impossible event. Naturally, Ω is called the certain event. 1

See Definition 1.1.4.

1.1. SAMPLE SPACE, EVENTS AND PROBABILITY

5

∞ Recall now that the notation k=1 Ak is used for ∪∞ k=1 Ak only when the subsets Ak are pairwise disjoint. In the terminology of sets, the sets A1 , A2 , . . . form a partition of Ω if ∞  Ak = Ω . k=0

The probabilists then say that the events A1 , A2 , . . . are mutually exclusive and exhaustive. They are exhaustive in the sense that any outcome ω realizes at least one among them. They are mutually exclusive in the sense that any two distinct events among them are incompatible. Therefore, any ω realizes one and only one of the events An (n ≥ 1). If B ⊆ A, event B is said to imply event A, since ω realizes A when it realizes B.

The σ-field of Events Probability theory assigns to each event a number, the probability of the said event. The collection F of events to which a probability is assigned is not always identical to the collection of all subsets of Ω. The requirement on F is that it should be a σ-field: Definition 1.1.4 Let F be a collection of subsets of Ω, such that (i) Ω is in F, (ii) if A belongs to F, then so does its complement A, and (iii) if A1 , A2 , . . . belong to F, then so does their union ∪∞ k=1 Ak . One then calls F a σ-field on Ω (here the σ-field of events). Note that the impossible event ∅, being the complement of the certain event Ω, is in F. Note also that if A1 , A2 , . . . belong to F, then so does their intersection ∩∞ k=1 Ak (Exercise 1.6.5). Example 1.1.5: Trivial σ-field, gross σ-field. These are respectively the collection P(Ω) of all subsets of Ω and the σ-field with only two sets: {Ω, ∅}. If the sample space Ω is finite or countable, one usually (but not always and not necessarily) considers any subset of Ω to be an event, that is, F = P(Ω).

Example 1.1.6: Borel σ-field. The Borel σ-field on the n-dimensional euclidean space Rn , denoted B(Rn ) and called the Borel σ-field on Rn , is, by definition, the smallest  σ-field on Rn that contains all rectangles, that is, all sets of the form nj=1 Ij , where the Ij ’s are intervals of R. This definition is not constructive and therefore one may wonder if there exists sets that are not Borel sets (that is, not sets in B(Rn )). The theory tells us that there are indeed such sets, but they are in a sense “pathological”: the proof of existence of non-Borel sets is not constructive, in the sense that it involves the axiom of choice. In any case, all the sets whose n-volume you have ever been able to compute in your early life are Borel sets. More about this in Chapter 2.

CHAPTER 1. WARMING UP

6

Example 1.1.7: Heads or tails, take 2. Take F to be the smallest σ-field that contains all the sets of the form {ω ; xk = 1} (k ≥ 1). This σ-field also contains all the sets of the form {ω ; xk = 0} (k ≥ 1) (pass to the complements) and therefore (take intersections) all the sets of the form {ω ; x1 = a1 , . . . , xn = an } for all n ≥ 1 and all a1 , . . . , an ∈ {0, 1}.

1.1.2

Probability of Events

The probability P (A) of an event A ∈ F measures the likeliness of its occurrence. As a function defined on F, the probability P is required to satisfy a few properties, the axioms of probability. Definition 1.1.8 A probability on (Ω, F) is a mapping P : F → R such that (i) 0 ≤ P (A) ≤ 1 for all A ∈ F, (ii) P (Ω) = 1, and ∞  (iii) P ( ∞ k=1 Ak ) = k=1 P (Ak ) for all sequences {Ak }k≥1 of pairwise disjoint events in F. Property (iii) is called σ-additivity. The triple (Ω, F, P ) is called a probability space, or probability model. Example 1.1.9: Tossing a die, take 2. An event A is a subset of Ω = {1, 2, 3, 4, 5, 6}. The formula P (A) = |A| 6 , where |A| is the cardinality of A (the number of elements in A), defines a probability P . Example 1.1.10: Heads or tails, take 3. Choose a probability P such that for any event of the form A = {x1 = a1 , . . . , xn = an }, where a1 , . . . , an are in {0, 1}, P (A) =

1 2n

.

Note that this does not define the probability of all events of F. But the theory (wait until Chapter 3) tells us that there exists such a probability satisfying the above requirement and that this probability is unique.

Example 1.1.11: Random point in a square, take 1. The following is a possible model of a random point inside the unit square [0, 1]2 = [0, 1] × [0, 1]: Ω = [0, 1]2 , F is the collection of sets in the Borel σ-field B(R2 ) that are contained in [0, 1]2. The theory tells us that there does indeed exist one and only one probability P satisfying the above requirement, called the Lebesgue measure on [0, 1]2 , which formalizes the intuitive notion of “area”. (More about this in Chapters 2 and 3.) The probability of Example 1.1.9 suggests an unbiased die, where each outcome 1, 2, 3, 4, 5, or 6 has the same probability. As we shall soon see, probability P of Example

1.1. SAMPLE SPACE, EVENTS AND PROBABILITY

7

1.1.10 implies an unbiased coin and independent tosses (the emphasized terms will be defined later). The axioms of probability are motivated by the heuristic interpretation of P (A) as the empirical frequency of occurrence of event A. If n “independent” experiments are performed, among which nA result in the realization of A, then the empirical frequency F (A) =

nA n

should be close to P (A) if n is “sufficiently large”. (This statement has to be made precise. It is in fact a loose expression of the law of large numbers that will be given later on.) Clearly, the empirical frequency function F satisfies the axioms of probability.

Basic Formulas We shall now list the properties of probability that follow directly from the axioms: Theorem 1.1.12 For any event A P (A) = 1 − P (A) ,

(1.1)

P (∅) = 0.

(1.2)

and

Proof. For a proof of (1.1), use additivity: 1 = P (Ω) = P (A + A) = P (A) + P (A) . 

Applying (1.1) with A = Ω gives (1.2). Theorem 1.1.13 Probability is monotone, that is, for any events A and B, A ⊆ B =⇒ P (A) ≤ P (B).

(1.3)

Proof. For a proof, observe that when A ⊆ B, B = A + (B − A), and therefore P (B) = P (A) + P (B − A) ≥ P (A).  Theorem 1.1.14 Probability is sub-σ-additive: for any sequence A1 , A2 , . . . of events, ∞ (1.4) P (∪∞ k=1 P (Ak ). k=1 Ak ) ≤ Proof. Observe that ∪∞ k=1 Ak =

∞ 

Ak ,

k=1

where A1 := A1 and

  k−1 Ak := Ak ∩ ∪i=1 Ai

(k ≥ 2) .

CHAPTER 1. WARMING UP

8 Therefore, P

(∪∞ k=1 Ak )

=P

∞ 

Ak

=

k=1

∞ 

P (Ak ).

k=1

But Ak ⊆ Ak , and therefore P (Ak ) ≤ P (Ak ).



The next (very important) property is the sequential continuity of probability: Theorem 1.1.15 Let {An }n≥1 be a non-decreasing sequence of events, that is, An+1 ⊇ An for all n ≥ 1. Then (1.5) P (∪∞ n=1 An ) = limn↑∞ P (An ) . Proof. Write An = A1 + (A2 − A1 ) + · · · + (An − An−1) and

∪∞ k=1 Ak = A1 + (A2 − A1 ) + (A3 − A2 ) + · · · .

Therefore, P (∪∞ k=1 Ak ) = P (A1 ) +

∞ 

P (Aj − Aj−1)

j=2

⎧ ⎫ n ⎨ ⎬  = lim P (A1 ) + P (Aj − Aj−1) = lim P (An ). ⎭ n↑∞ n↑∞ ⎩ j=2

 Corollary 1.1.16 Let {Bn }n≥1 be a non-increasing sequence of events, that is, Bn+1 ⊆ Bn for all n ≥ 1. Then P (∩∞ (1.6) n=1 Bn ) = limn↑∞ P (Bn ) . Proof. To obtain (1.6), write (using De Morgan’s identity; see Exercise 1.6.1):   ∞ ∞ P (∩∞ n=1 Bn ) = 1 − P ∩n=1 Bn = 1 − P (∪n=1 B n ) , and apply (1.5) with An = B n : 1 − P (∪∞ n=1 B n ) = 1 − lim P (B n ) = lim (1 − P (B n )) = lim P (Bn ) . n↑∞

n↑∞

n↑∞



Negligible Sets A central notion of probability is that of a negligible set. Definition 1.1.17 A set N ⊂ Ω is called P -negligible if it is contained in an event A ∈ F of probability P (A) = 0. Theorem 1.1.18 A countable union of negligible sets is a negligible set.

1.2. INDEPENDENCE AND CONDITIONING

9

Proof. Let Nk (k ≥ 1) be P -negligible sets. By definition there exists a sequence Ak (k ≥ 1) of events of null probability such that Nk ⊆ Ak (k ≥ 1). We have N := ∪k≥1 Nk ⊆ A := ∪k≥1 Ak , and by the sub-σ-additivity property of probability, P (A) = 0.



Example 1.1.19: Random point in a square, 2. (Example 1.1.11 continued) Recall the model of a random point inside the unit square [0, 1]2 = [0, 1] × [0, 1]. Each rational point therein has a null area and therefore a null probability. Therefore, the (countable) set of rational points of the square has null probability. In other words, the probability of drawing a rational point is, in this particular model, null.

1.2

Independence and Conditioning

In the frequency interpretation of probability, a situation where nA∩B /n ≈ (nA /n) × (nB /n), or nA nA∩B ≈ nB n (here ≈ is a non-mathematical symbol meaning “approximately equal”) suggests some kind of “independence” of A and B, in the sense that statistics relative to A do not vary when passing from a neutral sample of population to a selected sample characterized by the property B. For example, the proportion of people with a family name beginning with H is “approximately” the same among a large population with the usual mix of men and women as it would be among a “large” all-male population. Therefore, one’s gender is “independent” of the fact that one’s name begins with an H.2 .

1.2.1

Independent Events

The above discussion prompts us to give the following formal definition of independence, the single most important concept of probability theory. Definition 1.2.1 Two events A and B are called independent if and only if P (A ∩ B) = P (A)P (B) .

(1.7)

Remark 1.2.2 One should be aware that incompatibility is different from independence. As a matter of fact, two incompatible events A and B are independent if and only if at least one of them has null probability. Indeed, if A and B are incompatible, P (A ∩ B) = P (∅) = 0, and therefore (1.7) holds if and only if P (A)P (B) = 0. The notion of independence carries over to families of events in the following manner. Definition 1.2.3 A family {An }n∈N of events is called independent if for any finite set of indices i1 < . . . < ir where ij ∈ N (1 ≤ j ≤ r), P (Ai1 ∩ Ai2 ∩ · · · ∩ Air ) = P (Ai1 ) × P (Ai2 ) × · · · × P (Air ) . One also says that the An ’s (n ∈ N) are jointly independent. 2

As far as we know...

CHAPTER 1. WARMING UP

10

Example 1.2.4: The switches. Two locations A and B in a communications network are connected by three different paths, and each path contains a number of links that can fail. These are represented symbolically in the figure below by switches that are in the lifted position if the link is unable to operate. The number associated with a switch is the probability that the switch is lifted. The switches are lifted independently. What is the probability that A is accessible from B, that is, that there exists at least one available path for communications?

0.25

0.25 0.4

A 0.1

0.1

B 0.1

Let U1 be the event “no switch lifted in the upper path”. Defining U2 and U3 similarly, we see that the probability to be computed is that of U1 ∪ U2 ∪ U3 , or by de Morgan’s law, that of the complement of U 1 ∩ U 2 ∩ U 3 : 1 − P (U 1 ∩ U 2 ∩ U 3 ) = 1 − P (U 1 )P (U 2 )(P U 3 ), where the last equality follows from the independence assumption concerning the switches. Letting now U11 = “switch 1 (first from left) in the upper path is not lifted” and U12 = “switch 2 in the upper path is not lifted”, we have U1 = U11 ∩ U12 , therefore, in view of the independence assumption, P (U 1 ) = 1 − P (U1 ) = 1 − P (U11 )P (U12 ). We must now use the data P (U11 ) = 1 − 0.25, P (U12 ) = 1 − 0.25 to obtain P (U 1 ) = 1 − (0.75)2 . Similarly P (U 2 ) = 1 − 0.6 and P (U 3 ) = 1 − (0.9)3 . The final result (of rather limited interest) is 1 − (0.4375)(0.4)(0.271) = 0.952575.

Example 1.2.5: Is this number the larger one? Let a and b be two numbers in {1, 2, . . . , 10 000}. Nothing is known about these numbers, except that they are not equal, say a > b. Only one of these numbers is shown to you, secretly chosen at random and equiprobably. Call this random number X. Is there a good strategy for guessing if the number shown to you is the larger one? Of course, one would like to have a probability of success strictly larger than 12 . Perhaps surprisingly, there is such a strategy, that we now describe. Select at random, uniformly on {1, 2, . . . , 10 000}, a number Y . If X ≥ Y , say that X is the largest (= a), otherwise say that it is the smallest. Let us compute the probability PE of a wrong guess. An error occurs when either (i) X ≥ Y and X = b, or (ii) X < Y and X = a. These events are exclusive of one another, and therefore

1.2. INDEPENDENCE AND CONDITIONING

11

PE = P (X ≥ Y, X = b) + P (X < Y, X = a) = P (b ≥ Y, X = b) + P (a < Y, X = a) = P (b ≥ Y )P (X = b) + P (a < Y )P (X = a) 1 1 1 = P (b ≥ Y ) + P (a < Y ) = (P (b ≥ Y ) + P (a < Y )) 2 2  2  1 1 a−b 1 = (1 − P (Y ∈ [b + 1, a]) = 1− < . 2 2 10 000 2

1.2.2

Bayes’ Calculus

We continue the heuristic discussion of Subsection 1.2.1 in terms of empirical frequencies. Dependence between A and B occurs when P (A ∩ B) = P (A)P (B). In this case the relative frequency nA∩B /nB ≈ P (A ∩ B)/P (B) is different from the frequency nA /n. This suggests the following definition. Definition 1.2.6 The conditional probability of A given B is the number P (A | B) :=

P (A∩B) P (B)

,

(1.8)

defined when P (B) > 0. Remark 1.2.7 The quantity P (A | B) represents our expectation of A being realized when the only available information is that B is realized. Indeed, this expectation is based on the relative frequency nA∩B /nB alone. Of course, if A and B are independent, then P (A | B) = P (A). Probability theory is primarily concerned with the computation of probabilities of complex events. The following formulas, called Bayes’ rules, are useful for that purpose. Theorem 1.2.8 With P (A) > 0, we have the Bayes rule of retrodiction: P (B | A) =

P (A | B)P (B) P (A)

.

(1.9)

Proof. Rewrite Definition 1.8 symmetrically in A and B: P (A ∩ B) = P (A | B)P (B) = P (B | A)P (A).  Theorem 1.2.9 Let B1 , B2 , . . . be events forming partition of Ω. Then for any event A, we have the Bayes rule of total causes: P (A) =

∞  i=1

P (A | Bi )P (Bi ) .

(1.10)

CHAPTER 1. WARMING UP

12 Proof. Decompose A as follows:  A= A∩Ω =A∩

∞  i=1

Bi

=

∞ 

(A ∩ Bi ).

i=1

Therefore (by σ-additivity and the definition of conditional probability): ∞ ∞ ∞    P (A) = P (A ∩ Bi ) = P (A ∩ Bi ) = P (A | Bi )P (Bi ). i=1

i=1

i=1

 Theorem 1.2.10 For any sequence of events A1 , . . . , An , we have the Bayes sequential formula:     k k−1 P ∩i=1 Ai = P (A1 )P (A2 | A1 )P (A3 | A1 ∩ A2 ) · · · P Ak | ∩i=1 Ai . (1.11) Proof. By induction. First observe that (1.11) is true for k = 2 by definition of conditional probability. Suppose that (1.11) is true for k. Write          P ∩k+1 ∩ki=1 Ai ∩ Ak+1 = P Ak+1 | ∩ki=1 Ai P ∩ki=1 Ai , i=1 Ai = P   and replace P ∩ki=1 Ai by the assumed equality (1.11) to obtain the same equality with k + 1 replacing k.  Example 1.2.11: Should we always believe doctors? Doctors apply a test that gives a positive result in 99% of the cases where the patient is affected by the disease. However it happens in 2% of the cases that a healthy patient has a positive test. Statistical data show that one individual out of 1000 has the disease. What is the probability that a patient with a positive test is affected by the disease? Solution: Let M be the event “patient is ill,” and let + and − be the events “test is positive” and “test is negative” respectively. We have the data P (M ) = 0.001, P (+ | M ) = 0.99, P (+ | M ) = 0.02, and we must compute P (M | +). By the Bayes retrodiction formula, P (M | +) =

P (+ | M )P (M ) . P (+)

By the Bayes formula of total causes, P (+) = P (+ | M )P (M ) + P (+ | M )P (M ). Therefore, P (M | +) = that is, approximately 0.005.

(0.99)(0.001) , (0.99)(0.001) + (0.02)(0.999)

1.2. INDEPENDENCE AND CONDITIONING

13

Remark 1.2.12 The quantitative result of the above example may be disquieting. In fact, this may happen with grouped blood tests (maybe in a prison or in the army) to detect, say aids. A single individual will provoke a positive test alert for all his mates. Of course, the doctor in charge will then proceed to individual tests. See Exercise 1.6.19.

Example 1.2.13: The ballot problem. In an election, candidates I and II have obtained a and b votes, respectively. Candidate I won, that is, a > b. We seek to compute the probability that in the course of the vote counting procedure, candidate I has always had the lead. Let pa,b be the probability that A is always ahead. We have by the formula of total causes, conditioning on the last vote: pa,b = P (A always ahead |A gets the last vote )P (A gets the last vote ) + P (A always ahead |B gets the last vote )P (B gets the last vote ) a b = pa−1,b + pa,b−1 , a+b a+b with the convention that for a = b + 1, pa−1,b = pb,b = 0. The result follows by induction on the total number of votes a + b: pa,b =

1.2.3

a−b . a+b

Conditional Independence

Definition 1.2.14 Let A, B and C be events, where P (C) > 0. One says that A and B are conditionally independent given C if P (A ∩ B | C) = P (A | C)P (B | C) .

(1.12)

In other words, A and B are independent with respect to the probability PC defined by PC (A) = P (A | C) (see Exercise 1.6.11). Example 1.2.15: Cheap watches. Two factories A and B manufacture watches. Factory A produces on average one defective item out of 100, and B produces on average one bad watch out of 200. A retailer receives a container of watches from one of the two above factories, but he does not know which. He checks the first watch. It works! (a) What is the probability that the second watch he will check is good? (b) Are the states of the first two watches independent? You will need to invent reasonable hypotheses when needed. Solution: (a) Let Xn be the state of the n-th watch in the container, with Xn = 1 if it works and Xn = 0 if it does not. Let Y be the factory of origin. We express our a priori ignorance of where the case comes from by 1 P (Y = A) = P (Y = B) = . 2

CHAPTER 1. WARMING UP

14

(Note that this is a hypothesis.) Also, we assume that given Y = A (resp., Y = B), the states of the successive watches are independent. For instance, P (X1 = 1, X2 = 0 | Y = A) = P (X1 = 1 | Y = A)P (X2 = 0 | Y = A). We have the data P (Xn = 0 | Y = A) = 0.01

P (Xn = 0 | Y = B) = 0.005.

We are required to compute P (X2 = 1 | X1 = 1) =

P (X1 = 1, X2 = 1) . P (X1 = 1)

By the Bayes formula of total causes, the numerator of this fraction equals P (X1 = 1, X2 = 1 | Y = A)P (Y = A) + P (X1 = 1, X2 = 1 | Y = B)P (Y = B), that is, (0.5)(0.99)2 + (0.5)(0.995)2 , and the denominator is P (X1 = 1 | Y = A)P (Y = A) + P (X1 = 1 | Y = B)P (Y = B), that is, (0.5)(0.99) + (0.5)(0.995). Therefore, P (X2 = 1 | X1 = 1) =

(0.99)2 + (0.995)2 . 0.99 + 0.995

(b) The states of the two watches are not independent. Indeed, if they were, then P (X2 = 1 | X1 = 1) = P (X2 = 1) = (0.5) (0.99 + 0.995) , a result different from what we obtained.

Remark 1.2.16 The above example shows that two events A and B that are conditionally independent given some event C and at the same time conditionally independent given C, may yet not be independent.

1.3

Discrete Random Variables

The number of heads in a sequence of 1000 coin tosses, the number of days it takes until the next rain and the size of a genealogical tree are random numbers. All are functions of the outcome of a random experiment performed either by man or nature, and these outcomes take discrete values, that is, values in a countable set. These values are integers in the above examples, but they could be more complex mathematical objects. This section gives the basic theory of discrete random variables.

1.3.1

Probability Distributions and Expectation

Definition 1.3.1 Let E be a countable set. A function X : Ω → E such that for all x∈E {ω; X(ω) = x} ∈ F is called a discrete random variable. (Being in F, the event {X = x} can be assigned a probability.)

1.3. DISCRETE RANDOM VARIABLES

15

Remark 1.3.2 Calling an integer-valued random variable X a random number is an innocuous habit as long as one is aware that it is not the function X that is random, but the outcome ω. This in turn makes the number X(ω) random. Example 1.3.3: Tossing a die, take 3. The sample space is the set Ω = {1, 2, 3, 4, 5, 6}. Take for X the identity: X(ω) = ω. Therefore X is a random number obtained by tossing a die.

Example 1.3.4: Heads or tails, take 4. (Example 1.1.10 continued.) The sample space Ω is the collection of all sequences ω = {xn }n≥1 , where xn = 1 or 0. Define a random variable Xn by Xn (ω) = xn . It is the random number obtained at the n-th toss. It is indeed a random variable since for all an ∈ {0, 1}, {ω ; Xn (ω) = an } = {ω ; xn = an } ∈ F, by definition of F. The following are elementary remarks. Let E and F be countable sets. Let X be a random variable with values in E, and let f : E → F be a function. Then Y := f (X) is a random variable. Proof. Let y ∈ F . The set {ω; Y (ω) = y} is in F since it is a countable union of sets in F, namely:  {Y = y} = {X = x} . x∈E; f (x)=y

 Let E1 and E2 be countable sets. Let X1 and X2 be random variable with values in E1 and E2 , respectively. Then Y := (X1 , X2 ) is a random variable with values in E = E1 × E2 . Proof. Let x = (x1 , x2 ) ∈ E. The set {ω; X(ω) = x} is in F since it is the intersection of sets in F, namely: {X = x} = {X1 = x1 } ∩ {X2 = x2 } .  Definition 1.3.5 Let X be a discrete random variable taking its values in E. Its probability distribution function is the function π : E → [0, 1], where π(x) := P (X = x)

(x ∈ E) .

Example 1.3.6: The gambler’s fortune. This is a continuation of the coin tosses example (Example 1.1.10). The number of occurrences of heads in n tosses is Sn = X1 + · · · + Xn . This random variable is the fortune at time n of a gambler systematically betting on heads. It takes integer values from 0 to n. We have   P (Sn = k) = nk 21n .   Proof. The event {Sn = k} is “k among X1 , . . . , Xn are equal to 1”. There are nk distinct ways of assigning k values of 1 and n − k values of 0 to X1 , . . . , Xn , and all have the same probability 2−n . 

CHAPTER 1. WARMING UP

16

Remark 1.3.7 One may have to prove that a random variable X, taking its values in N (and therefore for which the value ∞ is a priori possible) is in fact almost surely finite, that is, to prove that P (X = ∞) = 0 or, equivalently, that P (X < ∞) = 1. Since {X < ∞} = we have P (X < ∞) =

∞

n=0 {X

∞

= n} ,

n=0 P (X

= n) .

 (This remark provides an opportunity to recall that in an expression such as ∞ n=0 , the sum is over N and does  not include ∞ as the notation seems to suggest. A less ambiguous notation would be  n∈N . If we want to sum over all integers plus ∞, we shall always use the notation n∈N .)

Expectation for Discrete Random Variables Definition 1.3.8 Let X be a discrete random variable taking its values in a countable set E and let the function g : E → R be either non-negative or such that it satisfies the absolute summability condition 

|g(x)|P (X = x) < ∞ .

x∈E

Then one defines E[g(X)], the expectation of g(X), by the formula E[g(X)] :=



g(x)P (X = x) .

x∈E

If the absolute summability condition is satisfied, the random variable g(X) is called integrable, and in this case the expectation E[g(X)] is a finite number. If it is only assumed that g is non-negative, the expectation may well be infinite. Example 1.3.9: The gambler’s fortune. This is a continuation of Example 1.3.6. Consider the random variable Sn = X1 + · · · + Xn taking its values in {0, 1, . . . , n}. Its expectation is E[Sn ] = n/2, as the following straightforward computation shows: E[Sn ] =

n 

kP (Sn = k)

k=0

=

n 1  n! k 2n k!(n − k)! k=1

n (n − 1)! n  = n 2 (k − 1)!((n − 1) − (k − 1))!

=

n 2n

k=1 n−1  j=0

n (n − 1)! n = n 2n−1 = . j!(n − 1 − j)! 2 2

1.3. DISCRETE RANDOM VARIABLES

17

Example 1.3.10: Finite random variables with infinite expectations. One should be aware that a discrete random variable taking finite values may have an infinite expectation. The canonical example is the random variable X taking its values in E = N and with probability distribution P (X = n) =

1 , cn2

where the constant c is chosen such that P (X < ∞) =

∞ 

P (X = n) =

n=1

(that is, c =

∞

1 n=1 n2

=

π2 6 ).

E[X] =

∞  1 =1 2 cn n=1

Indeed, the expectation of X is

∞ 

nP (X = n) =

n=1

∞  n=1



n

 1 1 = ∞. = 2 cn cn n=1

Remark 1.3.11 The above example seems artificial. It is however not pathological, and there are a lot more natural occurrences of the phenomenon. Consider for instance Example 1.3.9, and let T be the first integer n (necessarily even) such that 2Sn − n = 0. (The quantity 2Sn − n is the fortune at time n of a gambler systematically betting on heads.) Then as it turns out and as we shall prove later (in Example 6.3.6), T is a finite random variable with infinite expectation.

The telescope formula below gives an alternative way of computing the expectation of an integer-valued random variable. Theorem 1.3.12 For a random variable X taking its values in N, E[X] =

∞ 

P (X ≥ n) .

n=1

Proof. E[X] = P (X = 1)+2P (X = 2) + 3P (X = 3) + . . . = P (X = 1) +P (X = 2) + P (X = 3) + . . . +P (X = 2) + P (X = 3) + . . . + P (X = 3) + . . . 

CHAPTER 1. WARMING UP

18 Basic Properties of Expectation

Let A be some event. The expectation of the indicator random variable X = 1A is E[1A ] = P (A) . (We call this the expectation formula for indicator functions.) Proof. The random variable X = 1A takes the value 1 with probability P (X = 1) = P (A) and the value 0 with probability P (X = 0) = P (A) = 1 − P (A). Therefore, E[X] = 0 × P (X = 0) + 1 × P (X = 1) = P (X = 1) = P (A).  Theorem 1.3.13 Let g1 and g2 be functions from E to R such that g1 (X) and g2 (X) are integrable (resp., non-negative), and let λ1 , λ2 ∈ R (resp., ∈ R+ ). Then E[λ1 g1 (X) + λ2 g2 (X)] = λ1 E[g1 (X)] + λ2 E[g2 (X)] (linearity of expectation). Also, if g1 (x) ≤ g2 (x) for all x ∈ E, E[g1 (X)] ≤ E[g2 (X)] (monotonicity of expectation). Finally, we have the triangle inequality |E[g(X)]| ≤ E[|g(X)|] . Proof. These properties follow directly from the corresponding properties of series.  Example 1.3.14: The matching paradox. There are n boxes B1 , · · · , Bn and n objects O1 , · · · , On . These objects are placed “at random” in the boxes, one and only one per box. What is the average number of matchings, that is, of boxes that receive an object with the same index? The problem will be stated mathematically in a way that gives meaning to the phrase “at random”. Let Πn be the set of permutations of 1 for all σ (0) ∈ Πn . {1, 2, . . . , n}. Let σ a random permutation, that is, P (σ = σ (0) ) = n! The random placement is assimilated to such a random permutation, and a matching at position (box) i is said to occur if σi = i. Let Xi = 1 if a matching occurs  at position i, and Xi = 0 otherwise. The total number of matches is therefore Zn := ni=1 Xi , so that the average number of matches is E [Zn ] =

n  i=1

E [Xi ] =

n 

P (Xi = 1) .

i=1

By symmetry, P (Xi = 1) = P (X1 = 1) (1 ≤ i ≤ n), so that E [Zn ] = nP (X1 = 1) . (0)

But there are (n − 1)! permutations σ (0) such that σ1 = 1, each one occurring with 1 , so that probability n! 1 (n − 1)! = . P (X1 = 1) = n! n Therefore 1 E [Zn ] = n × = 1 . n This number remains constant and does not increase with n as one (maybe) expects!

1.3. DISCRETE RANDOM VARIABLES

19

Mean and Variance Definition 1.3.15 Let X be a random variable such that E[|X|] < ∞ (X is integrable). In this case (and only in this case) the mean μ of X is defined by μ := E[X] =

+∞ 

nP (X = n) .

n=0

From the inequality |a| ≤ 1 + a2 , true for all a ∈ R, we have that |X| ≤ 1 + X 2 , and therefore, by the monotonicity and linearity properties of expectation, E[|X|] ≤ 1+E[X 2 ] (we also used the fact that E[1] = 1). Therefore if E[X 2 ] < ∞ (in which case we say that X is square-integrable) then X is integrable. The following definition then makes sense. Definition 1.3.16 Let X be a square-integrable random variable. Its variance is, by definition, the quantity σ 2 := E[(X − μ)2 ] =

+∞ 

(n − μ)2 P (X = n) .

n=0

The variance is also denoted by Var (X). From the linearity of expectation, it follows that E[(X − m)2 ] = E[X 2 ] − 2mE[X] + m2 , that is, Var (X) = E[X 2 ] − m2 . The mean is the “center of inertia” of a random variable. More precisely, Theorem 1.3.17 Let X be a real random variable with mean μ and finite variance σ 2 . Then, for all a ∈ R, a = μ, E[(X − a)2 ] > E[(X − μ)2 ] = σ 2 . Proof.     E (X − a)2 = E ((X − μ) + (μ − a))2   = E (X − μ)2 + (μ − a)2 + 2(μ − a)E [(X − μ)]     = E (X − μ)2 + (μ − a)2 > E (X − μ)2 . 

Independent Variables Definition 1.3.18 Two discrete random variables X and Y are called independent if P (X = i, Y = j) = P (X = i)P (Y = j)

(i, j ∈ E) .

Remark 1.3.19 The left-hand side of the last display is P ({X = i} ∩ {Y = j}). This is a general feature of the notational system: commas replace intersection signs. For instance, P (A, B) is the probability that both events A and B occur.

CHAPTER 1. WARMING UP

20

Definition 1.3.18 extends to a finite number of random variables: Definition 1.3.20 The discrete random variables X1 , . . . , Xk taking their values in E1 , . . . , Ek respectively are said to be independent if for all i1 ∈ E1 , . . . , ik ∈ Ek , P (X1 = i1 , . . . , Xk = ik ) = P (X1 = i1 ) · · · P (Xk = ik ) . Definition 1.3.21 A sequence {Xn }n≥1 of discrete random variables taking their values in the sets {En }n≥1 respectively is called independent if any finite collection of distinct random variables Xi1 , . . . , Xir extracted from this sequence is independent. It is said to be iid (independent and identically distributed) if En ≡ E for all n ≥ 1, if it is independent and if the probability distribution function of Xn does not depend on n. Example 1.3.22: Heads or tails, take 5. (Example 1.3.4 continued) We are going to show that the sequence {Xn }n≥1 is iid. Therefore, we have a model for independent tosses of an unbiased coin. Proof. Event {Xk = ak } is the direct sum of events {X1 = a1 , . . . , Xk−1 = ak−1 , Xk = ak } for all possible values of (a1 , . . . , ak−1 ). Since there are 2k−1 such values and each one has probability 2−k , we have P (Xk = ak ) = 2k−1 2−k , that is, 1 P (Xk = 1) = P (Xk = 0) = . 2 Therefore, P (X1 = a1 , . . . , Xk = ak ) = P (X1 = a1 ) · · · P (Xk = ak ) for all a1 , . . . , ak ∈ {0, 1}, from which it follows by definition that X1 , . . . , Xk are independent random variables, and more generally that {Xn }n≥1 is a family of independent random variables.  Definition 1.3.23 Let {Xn }n≥1 and {Yn }n≥1 be sequences of discrete random variables taking their values in the sets {En }n≥1 and {Fn }n≥1, respectively. They are said to be independent if for any finite collection of random variables Xi1 , . . . , Xir and Yj1 , . . . , Yjs extracted from their respective sequences, the discrete random variables (Xi1 , . . . , Xir ) and (Yj1 , . . . , Yjs ) are independent. (This means that for all a1 ∈ E1 , . . . , ar ∈ Er , b1 ∈ F1 , . . . , bs ∈ Fs , P ((∩r=1 {Xi = a }) ∩ (∩sm=1 {Yjm = bm })) = P (∩r=1 {Xi = a }) P (∩sm=1 {Yjm = bm }) .) The notion of conditional independence for events (Definition 1.2.14) extends naturally to discrete random variables. Definition 1.3.24 Let X, Y , Z be random variables taking their values in the countable sets E, F , G, respectively. One says that X and Y are conditionally independent given Z if for all x, y, z in E, F , G, respectively, the events {X = x} and {Y = y} are conditionally independent given {Z = z}. Recall that the events {X = x} and {Y = y} are said to be conditionally independent given {Z = z} if P (X = x, Y = y | Z = z) = P (X = x | Z = z)P (Y = y | Z = z) .

1.3. DISCRETE RANDOM VARIABLES

21

The Product Formula for Expectations Theorem 1.3.25 Let Y and Z be two discrete random variables with values in the countable sets F and G, respectively, and let v : F → R, w : G → R be functions that are either non-negative or such that v(Y ) and w(Z) are both integrable. Then E[v(Y )w(Z)] = E[v(Y )]E[w(Z)] . Proof. Consider the discrete random variable X with values in E = F × G defined by X = (Y, Z), and consider the function g : E → R defined by g(x) = v(y)w(z), where x = (y, z). We have, under the above stated conditions  E[v(Y )w(Z)] = E[g(X)] = g(x)P (X = x) =



x∈E

v(y)w(z)P (Y = y, Z = z)

y∈F z∈F

=



v(y)w(z)P (Y = y)P (Z = z)

y∈F z∈F

⎛ =⎝



⎞

v(y)P (Y = y)⎠

y∈F



w(z)P (Z = z)

z∈F

= E[v(Y )]E[w(Z)].  For independent random variables, “variances add up”: Corollary 1.3.26 Let X1 , . . . , Xn be independent integrable random variables with values in N. Then 2 2 + · · · + σ2 . = σX (1.13) σX Xn 1 +··· +Xn 1 Proof. Let μ1 , . . . , μn be the respective means of X1 , . . . , Xn . The mean of the sum X := X1 + · · · + Xn is μ := μ1 + · · · + μn . By the product formula for expectations, if i = k, E [(Xi − μi )(Xk − μk )] = E [(Xi − μi )] E [(Xk − μk )] = 0. Therefore   Var (X) = E (X − μ)2 ⎡ " n n 2 ⎤ # n   ⎣ ⎦ =E (Xi − μi ) (Xi − μi )(Xk − μk ) =E i=1

= =

n  n 

i=1 k=1

E [(Xi − μi )(Xk − μk )]

i=1 k=1 n 

n    E (Xi − μi )2 = Var (Xi ).

i=1

i=1



CHAPTER 1. WARMING UP

22

Remark 1.3.27 Note that means always add up, even when the random variables are not independent. Let X be an integrable random variable. Then, clearly, for any a ∈ R, aX is integrable and its variance is given by the formula Var (aX) = a2 Var (X) .

Example 1.3.28: Variance of the empirical mean. From this remark and Corollary 1.3.26, we immediately obtain that if X1 , . . . , Xn are independent and identically distributed integrable random variables with values in N with common variance σ 2 , then  Var

1.3.2

X1 + · · · + Xn n

 =

σ2 . n

Famous Discrete Probability Distributions

The Binomial Distribution Consider an iid sequence {Xn }n≥1 of random variables taking their values in the set {0, 1} and with a common distribution given by P (Xn = 1) = p

(p ∈ (0, 1)) .

This may be taken as a model for a game of heads and tails with a possibly biased coin (when p = 12 ). Since P (Xj = aj ) = p or 1 − p depending on whether ai = 1 or 0, and  since there are exactly h(a) := kj=1 aj coordinates of a = (a1 , . . . , ak ) that are equal to 1, P (X1 = a1 , . . . , Xk = ak ) = ph(a) q k−h(a) , (1.14) where q := 1 − p. (The integer h(a) is called the Hamming weight of the binary vector a.) The heads and tails framework shelters two important discrete random variables: the binomial random variable and the geometric random variable.

The Binomial Distribution Definition 1.3.29 A random variable X taking its values in the set E = {0, 1, . . . , n} and with the probability distribution   n i P (X = i) = p (1 − p)n−i (0 ≤ i ≤ n) i is called a binomial random variable of size n and parameter p ∈ (0, 1). This is denoted by X ∼ B(n, p).

1.3. DISCRETE RANDOM VARIABLES

23

Example 1.3.30: Number of heads in coin tossing. Define Sn = X1 + · · · + Xn . This random variable takes the values 0, 1, . . . , n. To obtain Sn = i,  where 0 ≤ i ≤ n, one must have X1 = a1 , . . . , Xn = an with nj=1 aj = i. There are ni distinct ways of having this, and each occurs with probability pi (1 − p)n−i . Therefore, for 0 ≤ i ≤ n,   n i P (Sn = i) = p (1 − p)n−i . i

Theorem 1.3.31 The mean and the variance of a binomial random variable X of size n and parameter p are respectively E[X] = np and Var (X) = np(1 − p) . Proof. Consider the random variable Sn of Example 1.3.30, which is a binomial random variable. We have, since expectations add up, E [Sn ] =

n 

E [Xi ] = nE [X1 ] ,

i=1

and since the Xi ’s are iid (and therefore in this case variances add up), Var (Sn ) =

n 

Var (Xi ) = nV (X1 ) .

i=1

Now, E [X1 ] = 0 × P (X1 = 0) + 1 × P (X1 = 1) = P (X1 = 1) = p , and since X12 = X1 , Therefore

  E X12 = E [X1 ] = p .

  Var (X1 ) = E X12 − E [X1 ]2 = p − p2 = p(1 − p). 

The Geometric Distribution Definition 1.3.32 A random variable X taking its values in N+ := {1, 2, . . .} and with the distribution P (X = k) = (1 − p)k−1 p (k ≥ 1) , where 0 < p < 1, is called a geometric random variable with parameter p. This is denoted X ∼ Geo(p).

CHAPTER 1. WARMING UP

24

Example 1.3.33: First “heads” in the sequence. Let {Xn }n≥1 be an iid sequence of random variables taking their values in the set {0, 1} with common distribution given by P (Xn = 1) = p ∈ (0, 1). Define the random variable T to be the first time of occurrence of 1 in this sequence, that is, T = inf{n ≥ 1; Xn = 1} , with the convention that if Xn = 0 for all n ≥ 1, then T = ∞. The event {T = k} is exactly {X1 = 0, . . . , Xk−1 = 0, Xk = 1}, and therefore, P (T = k) = P (X1 = 0) · · · P (Xk−1 = 0)P (Xk = 1) , that is, P (T = k) = (1 − p)k−1 p .

Theorem 1.3.34 The mean of a geometric random variable X with parameter p > 0 is E[X] =

Proof. E [X] =

∞ 

1 . p

k (1 − p)k−1 p .

k=1

But for α ∈ (0, 1), ∞  k=1

k−1



d = dα



∞ 

k

α

k=1

Therefore, with α = 1 − p, E [X] =

d = dα



1 −1 1−α

 =

1 . (1 − α)2

1 1 ×p= . p2 p 

Theorem 1.3.35 A geometric random variable T with parameter p ∈ (0, 1) is memoryless in the sense that for any integer k0 ≥ 1, P (T = k + k0 | T > k0 ) = P (T = k)

(k ≥ 1) .

Proof. We first compute P (T > k0 ) =

∞ 

(1 − p)k−1 p

k=k0 +1

= p (1 − p)k0

∞  n=0

(1 − p)n =

p (1 − p)k0 = (1 − p)k0 . 1 − (1 − p)

1.3. DISCRETE RANDOM VARIABLES

25

Therefore, P (T = k0 + k|T > k0 ) = =

P (T = k0 + k, T > k0 ) P (T = k0 + k) = P (T > k0 ) P (T > k0 ) p (1 − p)k+k0 −1 (1 − p)k0

= p (1 − p)k−1 = P (T = k) . 

Example 1.3.36: The coupon collector. Each chocolate tablet of a certain brand contains a coupon, randomly and independently chosen among n types. A prize may be claimed once the chocolate amateur has gathered a collection containing a subset with all the types of coupons. What is the average value of the number X of chocolate tablets bought when this happens for the first time? Solution: Let Xi (0 ≤ i ≤ n − 1) be the number of tablets bought during the time where there are exactly i different types of coupons in the collector’s box, so that X=

n−1 

Xi .

i=0

Each Xi is a geometric random variable with parameter pi = 1 − n , and therefore E [Xi ] = p1i = n−i E [X] =

n−1 

E [Xi ] = n

i=0

n  1 i=1

i

i n.

In particular,

.

We can have a more preciseidea of how far away from its mean the random variable X can be. Observing that | ni=1 1/i − ln n| ≤ 1, we have that |E [X] − n ln n| ≤ n. We shall now prove that for all c > 0, P (X > n ln n + cn) ≤ e−c .

()

For this, define Aα to be the event that no coupon of type α shows up in the first n ln n + cn tablets. Then (by sub-additivity) P (X > n ln n + cn) = P (∪nα=1 Aα ) ≤ =

n   α=1

1 1− n

n ln n+cn

n 

P (Aα )

α=1

  1 n ln n+cn =n 1− , n

and therefore, since 1 + x ≤ ex for all x ∈ R,  1 n ln n+cn P (X > n ln n + cn) ≤ n e− n = ne− ln n−cn = e−c .

Remark 1.3.37 An inequality such as () is called a concentration inequality. It reads   X − E[X] c P > ≤ e−c , E[X] ln n which explains the terminology, since around its mean.

X−E[X] E[X]

measures the relative dispersion of X

CHAPTER 1. WARMING UP

26 The Poisson Distribution

Definition 1.3.38 A random variable X taking its values in N and such that for all k ≥ 0, k P (X = k) = e−θ θk! , (1.15) is called a Poisson random variable with parameter θ > 0. This is denoted by X ∼ Poi(θ). Example 1.3.39: Poisson’s law of rare events, take 1. A veterinary surgeon of the Prussian army collecting data relative to accidents due to horse kicks found that the yearly number of such casualties was approximately following a Poisson distribution. Here is an explanation of his findings. Suppose that you play “heads or tails” for a large number n of (independent) tosses of a coin such that α P (Xi = 1) = . n In the example, n is the (large) number of soldiers, Xi = 1 if the i-th soldier was hurt and Xi = 0 otherwise. Let Sn be the total number of heads (wounded soldiers) and let pn (k) := P (Sn = k). It turns out that lim pn (k) = e−α

n↑∞

αk k!

()

(with the convention 0! = 1). (The average number of heads is α and the choice P (Xi = 1) = αn guarantees this. Letting n ↑ ∞ accounts for n being large but unknown.) Here is the proof of this result, which is known as Poisson’s law of rare events. As we know, the random variable Sn follows a binomial law:     n α n−k α k P (Sn = k) = 1− n n k n  α of mean n× n = α. Denoting by pn (k) = P (Sn = k), we see that pn (0) = 1 − αn → e−α as n ↑ ∞. Also, n−k α α pn (k + 1) = k+1 αn → pn (k) 1− n k+1 as n ↑ ∞. Therefore, () holds true for all k ≥ 0. The limit distribution is therefore a Poisson distribution of mean α. Theorem 1.3.40 The mean of a Poisson random variable with parameter θ > 0 is given by E[X] = θ , and its variance is Var (X) = θ . Proof. We have E [X] = e−θ

∞  kθk k=1 ∞ 

= e−θ θ

j=0

k! θj j!

= e−θ θ

∞  θk−1 (k − 1)! k=1

= e−θ θeθ = θ .

1.3. DISCRETE RANDOM VARIABLES

27

Also: ∞ ∞    2   θk  θk k −k E X 2 − X = e−θ k (k − 1) = e−θ k! k!

=e

k=0 ∞  −θ 2

θ

k=2

k−2

θ =e (k − 2)!

k=2 ∞  −θ 2

θ

j=0

θj = e−θ θ2 eθ = θ2 . j!

Therefore,   Var (X) = E X 2 − E [X]2   = E X 2 − X + E [X] − E [X]2 = θ2 + θ − θ2 = θ.  Theorem 1.3.41 Let X1 and X2 be two independent Poisson random variables with means θ1 > 0 and θ2 > 0, respectively. Then X = X1 + X2 is a Poisson random variable with mean θ = θ1 + θ2 . Proof. For k ≥ 0, P (X = k) = P (X1 + X2 = k)   = P ∪ki=0 {X1 = i, X2 = k − i} =

=

=

k  i=0 k  i=0 k  i=0

=

P (X1 = i, X2 = k − i) P (X1 = i)P (X2 = k − i) e−θ1

θ1i −θ2 θ2k−i e i! (k − i)!

k e−(θ1+θ2 )  k! θi θk−i k! i!(k − i)! 1 2 i=0

(θ1 + θ2 )k = e−(θ1 +θ2 ) . k!  Remark 1.3.42 See Example 1.4.8 for an alternative shorter proof using generating functions (defined in Subsection 1.4.1).

The Multinomial Distribution Consider the random vector X = (X1 , . . . , Xn ) where all the random variables Xi take their values in the same countable space E (this restriction is not essential, but it simplifies the notation). Let π : En → R+ be a function such that

CHAPTER 1. WARMING UP

28 

π(x) = 1 .

x∈En

The discrete random vector X above is said to admit the probability distribution π if P (X = x) = π(x)

(x ∈ En ) .

In fact, there is nothing new here with respect to previous definitions, since X is a discrete random variable taking its values in the countable set X := En . Example 1.3.43: Multinomial random vector. We place, independently of one another, k balls in n boxes B1 , . . . , Bn , with probability pi for a given ball to be assigned to box Bi . Of course, n 

pi = 1 .

i=1

After placing all the balls in the boxes, there are Xi balls in box Bi , where n 

Xi = k .

i=1

The random vector X = (X1 , . . . , Xn ) is a multinomial vector of size (n, k) and parameters p1 , . . . , pn , that is, its probability distribution is $ k! i pm i , (m )! i i=1 n

P (X1 = m1 , . . . , Xn = mn ) = n

i=1

where m1 + · · · + mn = k.

 Proof. Observe that (α): there are k!/ ni=1 (mi )! distinct ways of placing k balls in n boxes in such a manner that m1 balls are in box B1 , m2 are in B2 , etc., and (β): each of i  these distinct ways occurs with the same probability ni=1 pm i .

Random Graphs A graph is a discrete object and therefore random graphs are, from a purely formal point of view, discrete random variables. The random graphs considered below are in fact described by a finite collection of iid {0, 1}-valued random variables. A (finite) graph G = (V, E) consists of a finite collection V of vertices v and of a collection E of unordered pairs of distinct vertices, u, v, called the edges. If u, v ∈ E, then u and v are called neighbors, and this is also denoted by u ∼ v. The degree of vertex v ∈ V is the number of edges stemming from it. Definition 1.3.44 (Gilbert, 1959) Let n be a fixed positive integer and let V = {1, 2, . . . , n} be a finite set of vertices. To each unordered pair of distinct vertices u, v, associate a random variable X u,v taking its values in {0, 1} and suppose that all such variables are iid with probability p ∈ (0, 1) for the value 1. This defines a random graph denoted by G(n, p), a random element taking its values in the (finite) set of all graphs with vertices {1, 2, . . . , n} and admitting for an edge the unordered pair of vertices u, v if and only if X u,v = 1.

1.3. DISCRETE RANDOM VARIABLES

29

Note that G(n, p) is indeed a discrete random variable (taking its values in the finite set consisting of the collection of graphs with vertex set V = {1, 2, . . . , n}). Similarly, the set En,p of edges of G(n, p) is also a discrete random  variable. If we call any unordered pair of vertices u, v a potential edge (there are n2 such edges forming the set En ), G(n, p) is constructed by accepting a potential edge as one of its edges with probability p, independently of all other potential edges. The probability of occurrence of a graph G with exactly m edges is then n P (G(n, p) = G) = P (|En,p | = m) = pm (1 − p)( 2 )−m .

Note that the degree of a given vertex, that is, the number of edges stemming from it, is a binomial random variable B(n − 1, p). In particular, the average degree is d = (n − 1)p. Another type of random graph is the Erd¨os–R´enyi random graph (Definition 1.3.45 below). It is closely related to the Gilbert graph (Exercise 1.6.27). Definition 1.3.45 (Erd¨ os and R´ enyi, 1959) Consider the collection Gm of graphs  n)  2 such G = (V, E) where V = {1, 2, . . . , n} with exactly m edges (|E| = m). There are (m graphs. The Erd¨ os–R´enyi random graph Gn,m is a random graph uniformly distributed on Gm .

1.3.3

Conditional Expectation

This subsection introduces the concept of conditional expectation given a random element (variable or vector).3 Let Z be a discrete random variable with values in E, and let f : E → R be a non-negative function. Let A be some event of positive probability. The conditional expectation of f (Z) given A, denoted by E [f (Z) | A], is by definition the expectation when the distribution of Z is replaced by its conditional distribution given A: E [f (Z) | A] :=



f (z)P (Z = z | A).

z

Let {Ai }i∈N be a partition of the sample space. The following formula is then a direct consequence of Bayes’ formula of total causes: E [f (Z)] =



E [f (Z) | Ai ] P (Ai ) .

i∈N

Example 1.3.46: The Poisson and multinomial distributions. Suppose we have N bins in which we place balls in the following manner. The number of balls in any given bin is a Poisson variable of mean m N , and is independent of numbers in the other bins. In particular, the total number of balls Y1 + · · · + YN is, as the sum of independent Poisson random variables, a Poisson random variable whose mean is the sum of the means of the coordinates, that is, m. For a given integer k, we will compute the conditional probability that there are k1 balls in bin 1, k2 balls in bin 2, etc, given that the total number of balls is k1 +· · ·+kN = k : 3

The general theory of conditional expectation will be given in Section 3.3.

CHAPTER 1. WARMING UP

30

P (Y1 = k1 , . . . , YN = kN | Y1 + · · · + YN = k) P (Y1 = k1 , . . . , YN = kN , Y1 + · · · + YN = k) = P (Y1 + · · · + YN = k) P (Y1 = k1 , . . . , YN = kN ) . = P (Y1 + · · · + YN = k) m N,

By independence of the Yi ’s, and since they are Poisson variables with mean P (Y1 = k1 , . . . , YN = kN ) =

N $

 e

m −N

i=1

 m ki N

ki !

.

Also, P (Y1 + · · · + YN = k) = e−m

mk . k!

Therefore P (Y1 = k1 , . . . , YN = kN | Y1 + · · · + YN = k) =

k! k1 ! · · · kN !



1 N

N .

But this is equal to P (Z1 = k1 , . . . , ZN = kN ), where Zi is the number of balls in bin i when k = k1 + · · · + kN balls are placed independently and at random in the N bins. Note that the above equality is independent of m. The conditional expectation of some discrete random variable Z given some other discrete random variable Y is the expectation of Z using the probability measure modified by the observation of Y . For instance, if Y = y, instead of the original probability assigning the mass P (A) to the event A, we use the conditional probability given Y = y assigning the mass P (A|Y = y) to this event. Definition 1.3.47 Let X and Y be two discrete random variables taking their values in the countable sets F and G, respectively, and let g : F × G → R+ be either non-negative, or such that E[|g(X, Y )|] < ∞. Define for each y ∈ G such that P (Y = y) > 0,  ψ(y) = g(x, y)P (X = x | Y = y) , (1.16) x∈F

and if (P (Y = y) = 0), let ψ(y) = 0. This quantity is called the conditional expectation of g(X, Y ) given Y = y, and is denoted by EY =y [g(X, Y )], or E[g(X, Y ) | Y = y]. The random variable ψ(Y ) is called the conditional expectation of g(X, Y ) given Y , and is denoted by EY [g(X, Y )] or E[g(X, Y ) | Y ]. The sum in (1.16) is well defined (possibly infinite however) when g is non-negative. Note that in the non-negative case, we have that   ψ(y)P (Y = y) = g(x, y)P (X = x | Y = y)P (Y = y) y∈G

y∈G x∈F

=

 x

y

In particular, if E[g(X, Y )] < ∞, then

g(x, y)P (X = x, Y = y) = E[g(X, Y )] .

1.3. DISCRETE RANDOM VARIABLES 

31

ψ(y)P (Y = y) < ∞,

y∈G

which implies that ψ(y) < ∞ for all y ∈ G such that P (Y = y) > 0. We observe (for reference in a few lines) that in this case,  ψ(Y ) < ∞ almost surely, that is to say P (ψ(Y ) < ∞) = 1 (in fact, P (ψ(Y ) = ∞) = y;ψ(y)=∞ P (Y = y) = 0). Let now g : F × G → R be a function of arbitrary sign such that E[|g(X, Y )|] < ∞, and in particular E[g ± (X, Y )] < ∞. Denote by ψ ± the functions associated to g ± as in (1.16). As we just saw, for all y ∈ G, ψ ± (y) < ∞, and therefore ψ(y) = ψ + (y) − ψ − (y) is well defined (not an indeterminate ∞ − ∞ form). Thus, the conditional expectation is also well defined in the integrable case. From the observation made a few lines above, in this case, |EY [g(X, Y )]| < ∞. Example 1.3.48: Binomial example. Let X1 and X2 be independent binomial random variables of the same size N and same parameter p. We are going to show that EX1 +X2 [X1 ] = h(X1 + X2 ) =

X1 + X2 . 2

We have P (X1 = k)P (X2 = n − k) P (X1 = k|X1 + X2 = n) = P (X1 + X2 = n)   n−k N  k N  N  N −k N (1 − p)N −n+k k p (1 − p) n−k p = = k2Nn−k 2N   , n N −n n p (1 − p) n where we have used the fact that the sum of two independent binomial random variables with size N and parameter p is a binomial random variable with size 2N and parameter p. This is the hypergeometric distribution. The right-hand side of the last display is the probability of obtaining k black balls when a sample of n balls is randomly selected from an urn containing N black balls and N red balls. The mean of such a distribution is (by reason of symmetry) n2 , therefore EX1 +X2 =n [X1 ] =

n = h(n), 2

and this gives the announced result.

Example 1.3.49: Poisson example. Let X1 and X2 be two independent Poisson random variables with respective means θ1 > 0 and θ2 > 0. We seek to compute EX1 +X2 [X1 ], that is EY [X], where X = X1 , Y = X1 + X2 . Following the instructions of Definition 1.3.47, we must first compute (only for y ≥ x, why?) P (X1 = x, X1 + X2 = y) P (X = x, Y = y) = P (Y = y) P (X1 + X2 = y) P (X1 = x)P (X2 = y − x) P (X1 = x, X2 = y − x) = = P (X1 + X2 = y) P (X1 + X2 = y) y−x θ1x −θ2 θ2    x  y−x −θ 1 e y θ1 θ2 x! e (y−x)! = = . y (θ +θ ) x θ1 + θ2 θ1 + θ2 −(θ +θ ) 1 y! 2 e 1 2

P (X = x | Y = y) =

CHAPTER 1. WARMING UP

32 Therefore, letting α =

θ1 θ1 +θ2 ,

ψ(y) = EY =y [X] =

  y  y x x α (1 − α)y−x = αy. x x=0

Finally, EY [X] = ψ(Y ) = αY , that is, EX1 +X2 [X1 ] =

1.4

θ1 (X1 + X2 ). θ1 + θ2

The Branching Process

The branching process is also known as the Galton–Watson process. Sir Francis Galton, a cousin of Darwin, was interested in the survival probability of a given line of English peerage. He posed the problem in the Educational Times in 1873. In the same year and the same journal, Reverend Watson proposed the method of solution that has become a textbook classic, and thereby initiated an important branch of probability. The elementary theory of branching processes of Subsection 1.4.2 provides the opportunity to introduce the tool of generating functions.

1.4.1

Generating Functions

The computation of probabilities in discrete probability models often requires an enumeration of all the possible outcomes realizing this particular event. Generating functions are very useful for this task, and more generally, for obtaining distribution functions of integer-valued random variables. In order to introduce this versatile tool, we shall need to define the expectation of a complex-valued function of an integer-valued variable. Let X be a discrete random variable with values in N, and let ϕ : N → C be a complex function with real and imaginary parts ϕR and ϕI , respectively. The expectation E[ϕ(X)] is naturally defined by E[ϕ(X)] = E[ϕR (X)] + iE[ϕI (X)] , provided that the expectations on the right-hand side are well defined and finite. Definition 1.4.1 Let X be an N-valued random variable. Its generating function (gf) is the function g : D(0; 1) := {z ∈ C; |z| ≤ 1} → C defined by g(z) = E[z X ] =

∞ 

P (X = k)z k .

(1.17)

k=0

The power series associated with the sequence {P (X = n)}n≥0 has a radius of  convergence R ≥ 1, since ∞ P (X = n) = 1 < ∞. The domain of definition of g could n=0 be, in specific cases, larger than the closed unit disk centered at the origin. In the next two examples below, the domain of absolute convergence is the whole complex plane.

1.4. THE BRANCHING PROCESS

33

Example 1.4.2: The gf of the binomial variable. For the binomial random variable of size n and parameter p, n n n k k n−k , k=0 P (X = k)z = k=0 k (zp) (1 − p) and therefore g(z) = (1 − p + pz)n .

(1.18)

Example 1.4.3: The gf of the Poisson variable. For the Poisson random variable of mean θ, ∞ (θz)k ∞ k −θ k=0 P (X = k)z = e k=0 k! , and therefore g(z) = eθ(z−1) .

(1.19)

Here is an example where the radius of convergence is finite. Example 1.4.4: The gf of the geometric variable. For the geometric random variable of Definition 1.3.32, ∞ ∞ k k−1 k z , k=1 P (X = k)z = k=0 p(1 − p) and therefore, with q = 1 − p, g(z) =

pz 1−qz

.

The radius of convergence of this generating function power series is 1q .

Moments from the Generating Function Generating functions are powerful computational tools. First of all, they can be used to obtain moments of a discrete random variable. Theorem 1.4.5 We have g  (1) = E[X]

(1.20)

g  (1) = E[X(X − 1)].

(1.21)

and

Proof. Inside the open disk centered at the origin and of radius R, the power series defining the generating function g is continuous, and differentiable at any order term by term. In particular, differentiating both sides of (1.17) twice inside the open disk D(0; R) gives ∞  g  (z) = nP (X = n)z n−1 , (1.22) n=1

and

CHAPTER 1. WARMING UP

34 g  (z) =

∞ 

n(n − 1)P (X = n)z n−2 .

(1.23)

n=2

When the radius of convergence R is strictly larger than 1, we obtained the announced results by letting z = 1 in the previous identities. If R = 1, the same is basically true but the mathematical argument is more subtle. The difficulty is not with the right-hand side of (1.22), which is always well defined at  z = 1, being equal to ∞ n=1 nP (X = n), a non-negative and possibly infinite quantity. The difficulty is that g may not be differentiable at z = 1, a border point of the disk (Theorem (here of radius 1) on which it is defined. However, B.2.3), the  by Abel’s theorem ∞ n−1 is limit as the real variable x increases to 1 of ∞ n=1 nP (X = n). n=1 nP (X = n)x Therefore g  , as a function on the real interval [0, 1), can be extended to [0, 1] by (1.20), and this extension preserves continuity. With this definition of g  (1), Formula (1.20) holds true. Similarly, when R = 1, the function g  defined on [0, 1) by (1.23) is extended  to a continuous function on [0, 1] by defining g  (1) by (1.21). Theorem 1.4.6 The generating function characterizes the distribution of a random variable. This means the following. Suppose that, without knowing the distribution of X, you have been able to compute its generating function g, and that, moreover, you are able to give its power series expansion in a neighborhood of the origin:4 ∞ 

g(z) =

an z n .

n=0

Since g(z) is the generating function of X, g(z) =

∞ 

P (X = n)z n ,

n=0

and since the power series expansion around the origin is unique, the distribution of X is identified as P (X = n) = an for all n ≥ 0. Similarly, if two N-valued random variables X and Y have the same generating function, they have the same distribution. Indeed, the identity in a neighborhood of the origin of the power series: ∞ 

P (X = n)z n =

n=0

∞ 

P (Y = n)z n

n=0

implies the identity of their coefficients. Theorem 1.4.7 Let X and Y be two independent integer-valued random variables with respective generating functions gX and gY . Then the sum X + Y has the gf gX+Y (z) = gX (z) × gY (z). 4

This is a common situation; see Theorem 1.4.10 for instance.

(1.24)

1.4. THE BRANCHING PROCESS

35

Proof. Use the product formula for expectations:   gX+Y (z) = E z X+Y       = E zX zY = E zX E zY .  Example 1.4.8: Sums of independent Poisson variables. Let X and Y be two independent Poisson random variables of means α and β respectively. We shall prove that the sum X + Y is a Poisson random variable with mean α + β. Indeed, according to (1.24) and (1.19), gX+Y (z) = gX (z) × gY (z) = eα(z−1) eβ(z−1) = e(α+β)(z−1), and the assertion follows directly from Theorem 1.4.6 since gX+Y is the gf of a Poisson random variable with mean α + β.

Counting with Generating Functions The following example is typical of the use of generating functions in combinatorics (the art of counting). Example 1.4.9: The lottery. Let X1 , X2 , X3 , X4 , X5 , and X6 be independent random variables uniformly distributed over {0, 1, . . . , 9}. We shall compute the generating function of Y = 27 + X1 + X2 + X3 − X4 − X5 − X6 and use the result to obtain the probability that in a 6-digit lottery the sum of the first three digits equals the sum of the last three digits. We have 1 1 1 − z 10 (1 + z + · · · + z 9 ) = , 10 10 1 − z   1 1 1 1 − z −10 1 1 1 1 − z 10 E[z −Xi ] = 1+ +···+ 9 = , = −1 10 z z 10 1 − z 10 z 9 1 − z E[z Xi ] =

and   3 6 E[z Y ] = E z 27+ i=1 Xi − i=4 Xi # " 6 6 3 3 $ $ $ $ 27 Xi −Xi = z 27 =E z z z E[z Xi ] E[z −Xi ]. i=1

i=4

Therefore, gY (z) =

i=1

i=4

 6 1 1 − z 10 . 106 (1 − z)6

But P (X1 + X2 + X3 = X4 + X5 + X6 ) = P (Y = 27) is the factor of z 27 in the power series expansion of gY (z). Since     6 10 6 20 (1 − z 10 )6 = 1 − z + z +··· 1 2

CHAPTER 1. WARMING UP

36 and −6

(1 − z)

      6 7 2 8 3 = 1+ z+ z + z + ··· 5 5 5

(negative binomial formula), we find that         1 32 6 22 6 12 P (Y = 27) = 6 − + . 10 5 1 5 2 5

Random Sums How to compute the distribution of random sums? Here again, generating functions help. Theorem 1.4.10 Let {Yn }n≥1 be an iid sequence of integer-valued random variables with the common generating function gY . Let T be another random variable, integervalued, independent of the sequence {Yn }n≥1, and let gT be its generating function. The generating function of  X = Tn=1 Yn ,  where by convention 0n=1 = 0, is gX (z) = gT (gY (z)) .

(1.25)

Proof.  Since {T = k}k≥0 is a sequence forming a partition of Ω, we have (Exercise 1.6.3) 1= ∞ k=0 1{T =k} . Therefore ∞ T T  X Y n 1{T =k} z n=1 Yn z = z n=1 = k=0

∞   ∞        T k = z n=1 Yn 1{T =k} = z n=1 Yn 1{T =k} . k=0

k=0

Taking expectations, E[z X ] =

∞ 

 k   E 1{T =k} z n=1 Yn

k=0

=

∞ 

E[1{T =k} ]E[z

k

n=1

Yn

],

k=0

where we have used independence of T and {Yn }n≥1 . Now, E[1{T =k} ] = P (T = k), and E[z

k

n=1

Yn ]

= gY (x)k , and therefore E[z X ] =

∞ 

P (T = k)gY (z)k = gT (gY (z)) .

k=0

 Another useful result is Wald’s identity below (Formula (13.2.10)), which gives the expectation of a random sum of independent and identically distributed integer-valued variables.

1.4. THE BRANCHING PROCESS

37

By taking derivatives in (1.25) of Theorem 1.4.10,  E [X] = gX (1) = gY (1)gT (gY (1)) = E[Y1 ]E[T ].

A stronger version of this result is often needed: Theorem 1.4.11 Let {Yn }n≥1 be a sequence of integer-valued integrable random variables such that E[Yn ] = E[Y1 ] for all n ≥ 1. Let T be an integer-valued random variable such that for all n ≥ 1, the event {T ≥ n} is independent of Yn . Define  X = Tn=1 Yn . Then E [X] = E[Y1 ]E[T ] . "

Proof. We have E[X] = E

∞ 

# Yn 1{n≤T } =

n=1

∞ 

(1.26)

E[Yn 1{n≤T } ].

n=1

But E[Yn 1{n≤T } ] = E[Yn ]E[1{n≤T } ] = E[Y1 ]P (n ≤ T }). The result then follows from the telescope formula.



The following technical result will be needed in the next subsection on branching processes. It gives details concerning the shape of the generating function restricted to the interval [0, 1]. Theorem 1.4.12 (α) Let g : [0, 1] → R be defined by g(x) = E[xX ], where X is a non-negative integer-valued random variable. Then g is non-decreasing and convex. Moreover, if P (X = 0) < 1, then g is strictly increasing, and if P (X ≤ 1) < 1, it is strictly convex. (β) Suppose P (X ≤ 1) < 1. If E[X] ≤ 1, the equation x = g(x) has a unique solution x ∈ [0, 1], namely x = 1. If E[X] > 1, it has two solutions in [0, 1], x = 1 and x = x0 ∈ (0, 1). Proof. Just observe that for x ∈ [0, 1], g  (x) =

∞ 

nP (X = n)xn−1 ≥ 0,

n=1

and therefore g is non-decreasing, and g  (x) =

∞ 

n(n − 1)P (X − n)xn−2 ≥ 0,

n=2

and therefore g is convex. For g  (x) to be null for some x ∈ (0, 1), it is necessary to have P (X = n) = 0 for all n ≥ 1, and therefore P (X = 0) = 1. For g  (x) to be null for some x ∈ (0, 1), one must have P (X = n) = 0 for all n ≥ 2, and therefore P (X = 0) + P (X = 1) = 1. The graph of g : [0, 1] → R has, in the strictly increasing strictly convex case, P (X = 0) + P (X = 1) < 1, the general shape shown in the figure, where we distinguish two cases: E[X] = g  (1) ≤ 1, and E[X] = g  (1) > 1. The rest of the proof is then easy. 

CHAPTER 1. WARMING UP

38

1

1 P (X = 0) P (X = 0) 0

E[X] ≤ 1

0

1

1 E[X] > 1

Two aspects of the generating function

1.4.2

Probability of Extinction (1)

(2)

We shall first formally define the branching process. Let Zn = (Zn , Zn , . . .), where the (j) random variables {Zn }n≥1,j≥1 are iid and integer-valued. The recurrence equation Xn+1 =

Xn 

(k)

Zn+1

(1.27)

k=1

(Xn+1 = 0 if Xn = 0) may be interpreted as follows: Xn is the number of individuals in the nth generation of a given population (humans, particles, etc.). Individual number k (k) of the nth generation gives birth to Zn+1 descendants, and this accounts for Eqn. (1.27). The number X0 of ancestors is assumed to be independent of {Zn }n≥1 . The sequence of random variables {Xn }n≥0 is called a branching process because of the genealogical tree that it generates (see the figure below).

X6 = 2 X5 = 6 X4 = 7 X3 = 8 X2 = 5 X1 = 2 X0 = 1 Sample tree of a branching process (one ancestor) The event E = “an extinction occurs” is just “at least one generation is empty”, that is,

E = ∪∞ n=1 {Xn = 0} .

We now proceed to the computation of the extinction probability when there is one ancestor. We discard trivialities by supposing that P (Z ≤ 1) < 1. Let g be the common (k) generating function of the variables Zn . The generating function of the number of individuals in the nth generation is denoted ψn (z) = E[z Xn ] .

1.4. THE BRANCHING PROCESS

39

We prove successively that (a) P (Xn+1 = 0) = g(P (Xn = 0)), (b) P (E) = g(P (E)), and (c) if E[Z1 ] < 1, the probability of extinction is 1; and if E[Z1 ] > 1, the probability of extinction is < 1 but nonzero. Proof. (k)

(a) In (1.27), X n is independent of the Zn+1 ’s. Therefore, by Theorem 1.4.10, ψn+1 (z) = ψn (g(z)). Iterating this equality, we obtain ψn+1 (z) = ψ0 (g (n+1)(z)), where g (n) is the nth iterate of g. Since there is only one ancestor, ψ0 (z) = z, and therefore ψn+1 (z) = g (n+1)(z) = g(g (n) (z)), that is, ψn+1 (z) = g(ψn (z)). In particular, since ψn (0) = P (Xn = 0), (a) is proved. (b) Since Xn = 0 implies Xn+1 = 0, the sequence {Xn = 0}, n ≥ 1, is non-decreasing, and therefore, by monotone sequential continuity, P (E) = lim P (Xn = 0). n↑∞

The generating function g is continuous, and therefore from (a) and the last equation, the probability of extinction satisfies (b). (k)

(c) Let Z be any of the random variables Zn . Since the trivial cases where P (Z = 0) = 1 or P (Z ≥ 2) = 0 have been eliminated, by Theorem 1.4.12: (α) If E[Z] ≤ 1, the only solution of x = g(x) in [0, 1] is 1, and therefore P (E) = 1. The branching process eventually becomes extinct. (β) If E[Z] > 1, there are two solutions of x = g(x) in [0, 1], 1 and x0 such that 0 < x0 < 1. From the strict convexity of f : [0, 1] → [0, 1], it follows that the sequence yn = P (Xn = 0) that satisfies y0 = 0 and yn+1 = g(yn ) converges to x0 . Therefore, when the mean number of descendants E[Z] is strictly larger than 1, P (E) ∈ (0, 1). 

CHAPTER 1. WARMING UP

40

1.5

Borel’s Strong Law of Large Numbers

The empirical frequency of heads in a sequence of independent tosses of a fair coin is 21 . This is a special case of Borel’s strong law of large numbers: Theorem 1.5.1 Let {Xn }n≥1 be an iid sequence of {0, 1}-valued random variables taking the value 1 with probability p ∈ [0, 1]. Then  n 1 P lim Xk = p = 1 . n↑∞ n k=1

We then say: the sequence

{ n1

n k=1

Xk }n≥1 converges almost surely to p.

For the proof, some preliminaries are in order.

1.5.1

The Borel–Cantelli Lemma

Consider a sequence of events {An }n≥1. Let {An i.o.} := {ω; ω ∈ An for an infinity of indices n}. Here i.o. means infinitely often. We have the Borel–Cantelli lemma: Theorem 1.5.2

∞ 

P (An ) < ∞ =⇒ P (An i.o.) = 0.

n=1

Proof. First observe that {An i.o.} =

∞ & %

Ak .

n=1 k≥n

(Indeed, if ω belongs to the set on the right-hand side, then for all n ≥ 1, ω belongs to at least one among An , An+1, . . ., which implies that ω is in An for an infinite number of indices n. Conversely, if ω is in An for an infinite number of indices n, it is for all n ≥ 1 in at least one of the sets An , An+1, . . ..) The set ∪k≥n Ak decreases as n increases, so that by the sequential continuity property of probability, ⎛ ⎞ & (1.28) P (An i.o.) = lim P ⎝ Ak ⎠ . n↑∞

But by sub-σ-additivity,

⎛ P⎝

&

k≥n

⎞ Ak ⎠ ≤

k≥n



P (Ak ),

k≥n

and by the summability assumption, the right-hand side of this inequality goes to 0 as n ↑ ∞.  For the converse Borel–Cantelli lemma below, an additional assumption of independence is needed.

1.5. BOREL’S STRONG LAW OF LARGE NUMBERS

41

Theorem 1.5.3 Let {An }n≥1 be a sequence of independent events. Then, ∞ 

P (An ) = ∞ =⇒ P (An i.o.) = 1.

n=1

Proof. We may without loss of generality assume that P (An ) > 0 for all n ≥ 1 (why?). The divergence hypothesis implies that for all n ≥ 1, $ (1 − P (Ak )) = 0. k≥n

This infinite product equals, in view of the independence assumption, ∞ $   % P Ak = P Ak . k≥n

k=n

Passing to the complement and using De Morgan’s identity, ⎛ ⎞ & P⎝ Ak ⎠ = 1 . k≥n

Therefore, by (1.28),

⎛ P (An i.o.) = lim P ⎝ n↑∞

&

⎞ Ak ⎠ = 1 .

k≥n



1.5.2

Markov’s Inequality

Theorem 1.5.4 Let Z be a non-negative real random variable and let a > 0. Then, P (Z ≥ a) ≤

E[Z] . a

(1.29)

Proof. From the inequality Z ≥ a1{Z≥a} , it follows by taking expectations that E[Z] ≥ aE[1{Z≥a} ] = aP (Z ≥ a) .  Example 1.5.5: Chebyshev’s inequality. Let X be a real (discrete) random variable. Specializing the Markov inequality of Theorem 1.5.4 to Z = (X − μ)2 , a = ε2 > 0, we obtain Chebyshev’s inequality: For all ε > 0, P (|X − μ| ≥ ε) ≤

σ2 . ε2

CHAPTER 1. WARMING UP

42

Example 1.5.6: The weak law of large numbers. Let {Xn }n≥1 be an iid sequence of real square-integrable random variables with common mean μ and common variance 2 n σ 2 < ∞. Since the variance of the empirical mean Snn := X1 +···+X is equal to σn , we n have by Chebyshev’s inequality, for all ε > 0, ' '  ' n ' ' ' i=1 (Xi − μ) ' ' Sn σ2 ' ' ' − μ' ≥ ε = P ' ≥ ε'' ≤ 2 . P ' n n n ε In other words, the empirical mean Snn converges to the mean μ in probability, which means exactly (by definition of the convergence in probability) that, for all ε > 0, '  ' ' ' Sn ' − μ'' ≥ ε = 0. lim P ' n↑∞ n This specific result is called the weak law of large numbers.

1.5.3 Let Sn :=

Proof of Borel’s Strong Law n

k=1

Xk and Zn := n1 Sn .

Lemma 1.5.7 If



P (|Zn − p| ≥ εn ) < ∞

(1.30)

n≥1

for some sequence of positive numbers {εn }n≥1 converging to 0, then the sequence {Zn }n≥1 converges P-a.s. to p. Proof. If, for a given ω, |Zn (ω) − p| ≥ εn finitely often (or f.o.; that is, for all but a finite number of indices n), then limn↑∞ |Zn (ω) − p| ≤ limn↑∞ εn = 0. Therefore P ( lim Zn = p) ≥ P (|Zn − p| ≥ εn n↑∞

f.o.).

On the other hand, {|Zn − p| ≥ εn

f.o.} = {|Zn − p| ≥ εn

i.o.}.

Therefore P (|Zn − p| ≥ εn

f.o.) = 1 − P (|Zn − p| ≥ εn

i.o.).

Hypothesis (1.30) implies (Borel–Cantelli lemma) that P (|Zn − p| ≥ εn

i.o.) = 0.

By linking the above facts, we obtain P ( lim Zn = p) ≥ 1, n↑∞

and of course, the only possibility is = 1.



In order to prove almost-sure convergence using Lemma 1.5.7, we must find some adequate upper bound for the general term of the series occurring in the left-hand side of (1.30). The basic tool for this is the Markov inequality:

1.6. EXERCISES

43

 ' '  4 ' Sn ' Sn 4 ' ' P ' − p' ≥ ε = P −p ≥ε n n    4  E Snn − p E ( ni=1 Yi )4 ≤ ≤ , ε4 n4 ε4 where Yi := Xi − p. In view of the independence hypothesis, E[Y1 Y2 Y3 Y4 ] = E[Y1 ]E[Y2 ]E[Y3 ]E[Y4 ] = 0, E[Y1 Y23 ] = E[Y1 ]E[Y23 ] = 0, and the like. Finally, in the expansion ⎡ 4 ⎤ n  Yi ⎦ = E⎣ i=1

n 

E[Yi Yj Yk Y ],

i,j,k,=1

only the terms of the form E[Yi4 ] and E[Yi2 Yj2 ], i = j, remain. There are n terms of the first type and 3n(n − 1) terms of the second type. Therefore, nE[Y14 ] + 3n(n − 1)E[Y12 Y22 ] remains, which is less than Kn2 for some finite K. Therefore '  ' ' ' Sn K − p'' ≥ ε ≤ 2 4 , P '' n n ε 1

and in particular, with ε = n− 8 , ' '  ' Sn ' 1 K P '' − p'' ≥ n− 8 ≤ 3 , n n2 from which it follows that ∞ 

P

n=1

' '  ' Sn ' ' ' ≥ n− 18 < ∞. − p 'n '

' ' Therefore, by Lemma 1.5.7, ' Snn − p' converges almost surely to 0.

Complementary reading [Br´emaud, 2017] is entirely devoted to the main discrete probability models and methods, using only the elementary tools presented in this chapter.

1.6

Exercises

Exercise 1.6.1. De Morgan’s rules Let {An }n≥1 be a sequence of subsets of Ω. Prove De Morgan’s identities: 

∞ %

n=1

An

=

∞ & n=1

Exercise 1.6.2. Finitely often

 An and

∞ & n=1

An

=

∞ % n=1

An .

CHAPTER 1. WARMING UP

44

( )∞ Let {An }n≥1 be a sequence of subsets of Ω. Show that ω ∈ B := ∞ n=1 k=n Ak if and only if there exists at most a finite number (depending on ω) of indices k such that ω ∈ Ak . (The event B is therefore the event that events An occur finitely often.) Exercise 1.6.3. Indicator functions Show that for all subsets A, B ⊂ Ω, where Ω is an arbitrary set, 1A∩B = 1A × 1B and 1A = 1 − 1A . Show that if {An }n≥1 is a partition of Ω, 1=



1 An .

n≥1

Exercise 1.6.4. Small σ-fields Is there a σ-field on Ω with 6 elements (including of course Ω and ∅)? Exercise 1.6.5. Composed events Let F be a σ-field on some set Ω. (1) Show that if A1 , A2, . . . are in F, then so is ∩∞ k=1 Ak . (2) Show that if A1 , A2 are in F, then so is their symmetric difference A1 A2 := A1 ∪ A2 − A1 ∩ A2 . Exercise 1.6.6. Set inverse functions Let f : U → E be a function, where U and E are arbitrary sets. For any subset A ⊆ E, define f −1 (A) = {u ∈ U ; f (u) ∈ A} . (i) Show that for all u ∈ U , 1A (f (u)) = 1f −1 (A) (u). (ii) Prove that if E is a σ-field on E, then the collection of subsets of U * + f −1 (E) := f −1 (A) ; A ∈ E is a σ-field on U . Exercise 1.6.7. Identities Prove the set identities P (A ∪ B) = 1 − P (A ∩ B) and P (A ∪ B) = P (A) + P (B) − P (A ∩ B). Exercise 1.6.8. Urns 1. An urn contains 17 red balls and 19 white balls. Balls are drawn in succession at random and without replacement. What is the probability that the first 2 balls are red?

1.6. EXERCISES

45

2. An urn contains N balls numbered from 1 to N . Someone draws n balls (1 ≤ n ≤ N ) simultaneously from the urn. What is the probability that the lowest number drawn is k?

Exercise 1.6.9. Independence of a family of events 1. Give a simple example of a probability space (Ω, F, P ) with three events A1 , A2 , A3 that are pairwise independent but not globally independent (the family {A1 , A2 , A3} is not independent). ,i }i∈N is also an 2. If {Ai }i∈N is an independent family of events, is it true that {A ,i = Ai or Ai (your choice, independent family of events, where for each i ∈ N, A ,0 = A0 , A ,1 = A1 , A ,3 = A3 , . . .)? for instance, A

Exercise 1.6.10. Extension of the product formula for independent events Let {Cn }n≥1 be a sequence of independent events. Then ∞ P (∩∞ n=1 Cn ) = Πn=1 P (Cn ) .

(This extends formula (1.7) to a countable number of sets.) Exercise 1.6.11. Conditional independence and the Markov property 1. Let (Ω, F, P ) be a probability space. Define for a fixed event C of positive probability, PC (A) := P (A | C). Show that PC is a probability on (Ω, F). (Note that A and B are independent with respect to this probability if and only if they are conditionally independent given C.) 2. Let A1 , A2, A3 be three events of positive probability. Show that events A1 and A3 are conditionally independent given A2 if and only if the “Markov property” holds, that is, P (A3 | A1 ∩ A2 ) = P (A3 | A2 ).

Exercise 1.6.12. Roll it! You roll fairly and simultaneously three unbiased dice. (i) What is the probability that you obtain the (unordered) outcome {1, 2, 4}? (ii) What is the probability that some die shows 2, given that the sum of the 3 values equals 5? Exercise 1.6.13. Heads or tails as usual A person, A, tossing an unbiased coin N times obtains TA tails. Another person, B, tossing her own unbiased coin N + 1 times has TB tails. What is the probability that TA ≥ TB ? Hint: Introduce HA and HB , the number of heads obtained by A and B respectively, and use a symmetry argument. Exercise 1.6.14. Apartheid University In the renowned Social Apartheid University, students have been separated into three social groups for pedagogical purposes. In group A, one finds students who individually

46

CHAPTER 1. WARMING UP

have a probability of passing equal to 0.95. In group B this probability is 0.75, and in group C only 0.65. The three groups are of equal size. What is the probability that a student passing the course comes from group A? B? C? Exercise 1.6.15. A wise bet There are three cards. The first one has both faces red, the second one has both faces white, and the third one is white on one face, red on the other. A card is drawn at random, and the color of a randomly selected face of this card is shown to you (the other remains hidden). What is the winning strategy if you must bet on the color of the hidden face? Exercise 1.6.16. A sequence of liars Consider a sequence of n “liars” L1 , . . . , Ln . The first liar L1 receives information about the occurrence of some event in the form “yes or no”, and transmits it to L2 , who transmits it to L3 , etc. . . Each liar transmits what he hears with probability p ∈ (0, 1), and the contrary with probability q = 1 − p. The decision of lying or not is made independently by each liar. What is the probability xn of obtaining the correct information from Ln ? What is the limit of xn as n increases to infinity? Exercise 1.6.17. The campus library complaint You are looking for a book in the campus libraries. Each library has it with probability 0.60 but the book of each given library may have been stolen with probability 0.25. If there are three libraries, what are your chances of obtaining the book? Exercise 1.6.18. Professor Nebulous Professor Nebulous travels from Los Angeles to Paris with stopovers in New York and London. At each stop his luggage is transferred from one plane to another. In each airport, including Los Angeles, the chances are that with probability p his luggage is not placed in the right plane. Professor Nebulous finds that his suitcase has not reached Paris. What are the chances that the mishap took place in Los Angeles, New York, and London, respectively? Exercise 1.6.19. Blood test Give a mathematical model and invent data to corroborate the informal discussion of Remark 1.2.12. Exercise 1.6.20. Safari butchers Three tourists participate in a safari in Africa. Here comes an elephant, unaware of the rules of the game. The innocent beast is killed, having received two out of the three bullets simultaneously shot by the tourists. The tourist’s hit probabilities are: Tourist A: 14 , Tourist B: 12 , Tourist C: 34 . Give for each tourist the probability that he was the one who missed. Exercise 1.6.21. One is the sum of the other two You perform three independent tosses of an unbiased die. What is the probability that one of these tosses results in a number that is the sum of the other two numbers? ´’s formula Exercise 1.6.22. Poincare Let A1 , . . . , An be events and let X1 , . . . , Xn be their indicator functions. By expanding the expression E [Πni=1(1 − Xi )], deduce Poincar´e’s formula:

1.6. EXERCISES P (∪ni=1 Ai ) =

47 n 

P (Ai ) −

i=1

+

n 

P (Ai ∩ Aj )

i=1,j=1;i =j n 

P (Ai ∩ Aj ∩ Ak ) − · · ·

i=1,j=1,k=1;i =j =k

Exercise 1.6.23. No name Let X be a discrete random variable taking its values in E, with probability distribution p(x) (x ∈ E). Let A := {ω; p(X(ω)) = 0}. What is the probability of this event? Exercise 1.6.24. Null variance Prove that a null variance implies that the random variable is almost surely constant. Exercise 1.6.25. Moment inequalities (a) Prove that for any integer-valued random variable X, P (X = 0) ≤ E[X] . (b) Prove that for any square-integrable real-valued discrete random variable X, P (X = 0) ≤

Var(X) . E[X]2

Exercise 1.6.26. Checking conditional independence Let X, Y and Z be three discrete random variables with values in E, F , and G, respectively. Prove the following: If for some function g : E × F → [0, 1], P (X = x | Y = y, Z = z) = g(x, y) for all x, y, z, then P (X = x | Y = y) = g(x, y) for all x, y, and X and Z are conditionally independent given Y . Exercise 1.6.27. G(n, p) with a given number of edges Prove that the conditional distribution of G(n, p) given that the number of edges is  m ≤ n2 is uniform on the set Gm of graphs G = (V, E), where V = {1, 2, . . . , n} with exactly m edges. Exercise 1.6.28. Sum of geometric variables Let T1 and T2 be two independent geometric random variables with the same parameter p ∈ (0, 1). Give the probability distribution of their sum X = T1 + T2 . Exercise 1.6.29. The coupon collector, take 2 In the coupon collector problem of Example 1.3.36, show that the number X of chocolate tablets bought when all the n coupons have been collected for the first time satisfies the inequality |E [X] − n ln n| ≤ n . Exercise 1.6.30. The coupon collector, take 3

CHAPTER 1. WARMING UP

48

2 of X (the In the coupon collector problem of Example 1.3.36, compute the variance σX number of chocolate tablets needed to complete the collection of the n different coupons)

and show that

2 σX n2

has a limit (to be identified) as n grows indefinitely.

Exercise 1.6.31. The coupon collector, take 4 In the coupon collector problem of Example 1.3.36, prove that for all c > 0, P (X > n ln n + cn) ≤ e−c . Hint: you might find it useful to define Ai to be the event that the Type i coupon has not shown up in the first n ln n + cn tablets. Exercise 1.6.32. Factorial of Poisson 1. Let X be a Poisson random variable with mean θ > 0. Compute the mean of the random variable X! (factorial, not exclamation mark!).   2. Compute E θX .

Exercise 1.6.33. Even and odd Poisson Let X be a Poisson random variable with mean θ > 0. What is the probability that X is odd? even? Exercise 1.6.34. A random sum Let {Xn }n≥1 be independent random variables taking the values 0 and 1 with probability q = 1 − p and p, respectively, where p ∈ (0, 1). Let T be a Poisson random variable with mean θ > 0, independent of {Xn }n≥1 . Define S = X1 + · · · + XT . Show that S is a Poisson random variable with mean pθ. Exercise 1.6.35. Multiplicative Bernoulli Let X1 , . . . , X2n be independent random variables taking the  values 0 or 1, and such that for all i (1 ≤ i ≤ 2n) P (Xi = 1) = p ∈ [0, 1]. Define Z = ni=1 Xi Xn+i. Compute P (Z = k) (1 ≤ k ≤ n). Exercise 1.6.36. The matchbox A smoker has one matchbox with n matches in each pocket. He reaches at random for one box or the other. What is the probability that, having eventually found an empty matchbox, there will be k matches left in the other box? Exercise 1.6.37. Means and variances via generating functions (a) Compute the mean and variance of the binomial random variable B of size n and parameter p from its generating function. Do the same for the Poisson random variable P of mean θ. (b) What is the generating function gT of the geometric random variable T with parameter p ∈ (0, 1) (recall P (T = n) = (1 − p)n−1 p, n ≥ 1). Compute its first two derivatives and deduce from the result the variance of T .

1.6. EXERCISES

49

Exercise 1.6.38. Factorial moment of Poisson What is the n-th factorial moment (E [X(X − 1) · · · (X − n + 1)]) of a Poisson random variable X of mean θ > 0? Exercise 1.6.39. From generating function to probability distribution What is the probability distribution of the integer-valued random variable X with gen1 erating function g(z) = (2−z) 2 ? Compute its variance. Exercise 1.6.40. Negative binomial formula Prove that for all z ∈ C, |z| ≤ 1,       p p+1 2 p+2 3 z+ z + z +··· . (1 − z)−p = 1 + p−1 p−1 p−1 Exercise 1.6.41. Throw a die You perform three independent tosses of an unbiased die. What is the probability that one of these tosses results in a number that is the sum of the other two numbers? (You are required to find a solution using generating functions.) Exercise 1.6.42. Residual time Let X be a random variable with values in N and with finite mean m. We know (tele1 scope formula; Theorem 1.3.12) that pn = m P (X > n), n ∈ N, defines a probability distribution on N. Compute its generating function. Exercise 1.6.43. The blue pinko The blue pinko (an extravagant Australian bird) lays T eggs, each egg blue or pink, with probability p that a given egg is blue. The colors of the successive eggs are independent, and independent of the number of eggs laid. Exercise 1.6.34 shows that if the number of eggs is Poisson with mean θ, then the number of blue eggs is Poisson with mean θp and the number of pink eggs is Poisson with mean θq. Show that the number of blue eggs and the number of pink eggs are independent random variables. Exercise 1.6.44. The entomologist Each individual of a specific breed of insects has, independently of the others, the probability θ of being a male. An entomologist seeks to collect exactly M > 1 males, and therefore stops hunting as soon as she captures M males. She has to capture an insect in order to determine its gender. What is the distribution of X, the number of insects she must catch to collect exactly M males? Exercise 1.6.45. The return of the entomologist The situation is as in Exercise 1.6.44. What is the distribution of X, the smallest number of insects that the entomologist must catch to collect at least M males and N females? Exercise 1.6.46. The entomologist strikes again This continues Exercise 1.6.44. What is the expectation of X? (In Exercise 1.6.44, you computed the distribution of X, from which you can of course compute the mean. However you can give the solution directly, and this is what is required in the present exercise.) Exercise 1.6.47. A recurrence equation Recall the notation a+ = max(a, 0). Consider the recurrence equation

CHAPTER 1. WARMING UP

50 Xn+1 = (Xn − 1)+ + Zn+1

(n ≥ 0),

where X0 is a random variable taking its values in N, and {Zn }n≥1 is a sequence of independent random variables taking their values in N, and independent of X0 . Express the generating function ψn+1 of Xn+1 in terms of the generating function ϕ of Z1 . Exercise 1.6.48. Extinction of a branching process Compute the probability of extinction of a branching process with one ancestor when the probabilities of having 0, 1, or 2 sons are respectively 41 , 14 , and 12 . Exercise 1.6.49. Several ancestors (a) Give the survival probability in the model of Section 1.4.2 with k ancestors, k > 1. (b) Give the mean and variance of Xn in the model of Section 1.4.2 with one ancestor.

Exercise 1.6.50. Size of the branching tree When the probability of extinction is 1 (m < 1), call Y the size of the branching tree (Y = n≥0 Xn ). Prove that gY (z) = z gZ (gY (z)).

Chapter 2 Integration From a technical point of view, a probability is a particular kind of measure and expectation is integration with respect to that measure. However we shall see in the next chapter that this point of view is short-sighted because the probabilistic notions of independence and conditioning, which are absent from the theory of measure and integration, play a fundamental role in probability. Nevertheless, the foundations of probability theory rest on Lebesgue’s integration theory, of which the present chapter gives a detailed outline and the main results. The reader is assumed to have a working knowledge of the Riemann integral. This type of integral is sufficient for many purposes, but it has a few weak points when compared to the Lebesgue integral. For instance: (1) The class of Riemann-integrable functions is too narrow. As a matter of fact, there are functions that have a Lebesgue integral and yet are not Riemann integrable (see Example 2.2.14). (2) The stability properties under the limit operation of the functions that admit a Riemann integral are too weak. Indeed, it often happens that such limits do not have a Riemann integral whereas the limit, for instance, of non-negative functions for which the Lebesgue integral is well defined also admits a well-defined Lebesgue integral. (3) The Riemann integral is defined with respect to the Lebesgue measure (length, area, volume, etc.) whereas Lebesgue’s integral can be defined with respect to a general abstract measure, a probability for instance. This last advantage makes it worthwhile to invest a little time in order to understand the fundamental results of the Lebesgue integration theory, because the return is considerable. In fact, the Lebesgue integral of a function f with respect to an abstract measure μ contains a variety of mathematical objects besides the usual Lebesgue integral on Rd f (x1 , · · · , xd ) dx1 · · · dxd . Rd

In fact, an infinite sum



f (n)

n∈Z

can also be regarded as a Lebesgue integral with respect to the counting measure on Z. The Stieltjes–Lebesgue integral © Springer Nature Switzerland AG 2020 P. Brémaud, Probability Theory and Stochastic Processes, Universitext, https://doi.org/10.1007/978-3-030-40183-2_2

51

CHAPTER 2. INTEGRATION

52 f (x) dF (x) R

with respect to a right-continuous non-decreasing function F is again a special case of the Lebesgue integral. Most importantly, the expectation E[Z] of a random variable Z will be recognized as an abstract integral. It is the latter type of integral that is of interest in this book and the purpose of the present chapter is to define the Lebesgue integral and to give its properties directly useful to probability theory.

2.1

Measurability and Measure

This section describes the functions that Lebesgue’s integration theory admits for integrands and gives the formal definition and properties of the measure with respect to which such functions are integrated.

2.1.1

Measurable Functions

Remember the definition of a σ-field: Definition 2.1.1 Denote by P(X) the collection of all subsets of a given set X. A collection of subsets X ⊆ P(X) is called a σ-field on X if: (α) X ∈ X , (β) A ∈ X =⇒ A ∈ X and (γ) An ∈ X for all n ∈ N =⇒ ∪∞ n=0 An ∈ X . One then says that (X, X ) is a measurable space. A set A ∈ X is called a measurable set. Observe that (See Exercise 1.6.5): (γ  ) An ∈ X for all n ∈ N =⇒ ∩∞ n=0 An ∈ X . In fact, given the properties (α) and (β), properties (γ) and (γ  ) are equivalent. Note also that ∅ ∈ X , being the complement of X. Therefore, a σ-field on X is a collection of subsets of X that contains X and ∅, and is closed under countable unions, countable intersections and complementation. The two simplest examples of σ-fields on X are the gross σ-field X = {∅, X} and the trivial σ-field X = P(X). Definition 2.1.2 The σ-field generated by a non-empty collection of subsets C ⊆ P(X) is, by definition, the smallest σ-field on X containing all the sets in C (see Exercise 2.4.1). It is denoted by σ(C). Of course, if G is a σ-field, σ(G) = G, a fact that will often be used. We now review some basic notions of topology.

2.1. MEASURABILITY AND MEASURE

53

Definition 2.1.3 A topology on a set X is a collection O of subsets of X satisfying the following properties: (i) X and ∅ belong to O; (ii) the union of an arbitrary collection of sets in O is a set in O; and (iii) the intersection of a finite number of sets in O is a set in O. The elements O ∈ O are called open sets and the pair (X, O) is called a topological space. (Usually, when the context clearly defines the open sets, we say: “the topological space X”.) Example 2.1.4: Metric spaces. Let X be a set. A function d : X × X → R+ such that for all x, y, z ∈ X, (i) d(x, y) = 0 ⇒ x = y, (ii) d(x, y) = d(y, x) and (iii) d(x, z) ≤ d(x, y) + d(y, z) is called a metric on X. The pair (X, d) is called a metric space. Any metric induces a topology as follows. A set O ∈ X is called open if for all x ∈ O there is an open ball B(x, a) := {y ∈ X; d(y, x) < a} contained in O. The collection of such sets is indeed a topology as can be straightforwardly checked. Example 2.1.5: The euclidean topology. In Rn , the usual euclidean distance defines a topology called the euclidean topology. Topology concerns continuity: Definition 2.1.6 Let (X, O) and (E, V) be two topological spaces. A function f : X → E is said to be continuous (with respect to these topologies) if V ∈ V ⇒ f −1 (V ) ∈ O . In other terms, f −1 (V) ⊆ O. This is a rather abstract definition (however it will turn out to be quite convenient, as we shall soon see). For metric spaces, continuity is more explicit, and is usually defined in terms of metrics: Theorem 2.1.7 Let (X, d) and (E, ρ) be two metric spaces. A necessary and sufficient condition for a mapping f : X → E to be continuous according to the above abstract definition (and with respect to the topologies induced by the corresponding metrics) is that for any ε > 0 and any x ∈ X, there exists a δ > 0 such that d(y, x) ≤ δ implies ρ(f (y), f (x)) ≤ ε. The proof is given in Section B.7 of the appendix. We shall now be more “concrete” about σ-fields.

54

CHAPTER 2. INTEGRATION

Definition 2.1.8 Let (X, O) be a topological space. The σ-field σ(O) will be denoted by B(X) and called the Borel σ-field on X (associated with topology O). A set B ∈ B(X) is called a Borel set of X (with respect to the topology O). When X = Rn is endowed with the euclidean topology, the Borel σ-field is denoted by B(Rn ). The next result gives a sometimes more convenient characterization of B(Rn ). 2.1.9 B(Rn ) is generated by the collection C of all rectangles of the type Theorem n i=1 (−∞, ai ], where ai ∈ Q for all i ∈ {1, . . . , n}. Proof. It suffices to show that B(Rn ) is generated by the collection C  of all rectangles n i=1 (ai , bi ) with rational endpoints (that is, such that ai , bi ∈ Q for all i ∈ {1, . . . , n}). Note that C  is a countable collection and that all its elements are open sets for the euclidean topology (the latter we denote by O). It follows that C  ⊆ O and therefore σ(C  ) ⊆ σ(O) = B(Rn ). It remains to show that O ⊆ σ(C  ), since this implies that σ(O) ⊆ σ(C  ). For this it suffices to show that any set O ∈ O is a countable union of elements in C  . Take x ∈ O. By definition of the euclidean topology, there exists a non-empty open ball B(x, r) centered at x and contained in O. Now we can always choose a rational rectangle Rx ∈ C  that contains x and that is contained in B(x, r). Clearly ∪x∈O Rx = O. Since the Rx are chosen in a countable family of sets, the union ∪x∈O Rx is in fact countable. As a countable union of sets in C  it is in σ(C  ). Therefore O ∈ σ(C  ).  Definition 2.1.10 B(R) is the σ-field on R generated by the intervals of type [− ∞, a], a ∈ R.  If I = nj=1 Ij , where Ij is an interval of R, the Borel σ-field B(I) on I is the collection of all the Borel sets contained in I. Remark 2.1.11 A question naturally arises at this point. How can we be sure that the Borel σ-field (of R for instance) is not just the trivial σ-field? In other terms, does there exist at least one subset of R that is not a Borel set? The answer is yes: there exist such “pathological” subsets, but we shall not prove this here. The central concept of Lebesgue’s integration theory is that of a measurable function. Definition 2.1.12 Let (X, X ) and (E, E) be two measurable spaces. A function f : X → E is said to be a measurable function with respect to X and E if f −1 (C) ∈ X for all C ∈ E. In other terms, f −1 (E) ⊆ X . This will be denoted by: f : (X, X ) → (E, E) or f ∈ E/X . Let (X, X ) be a measurable space. A function f : (X, X ) → (Rk , B(Rk )) is called a Borel function from X to Rk . A function f : (X, X ) → (R, B(R)) is called an extended Borel function, or simply a Borel function. As for functions f : (X, X ) → (R, B(R)), they

2.1. MEASURABILITY AND MEASURE

55

are called real Borel functions. In general, in a sentence such as “f is a Borel function defined on X”, the σ-field X is assumed to be the obvious one in the given context. The key to the definition of the Lebesgue integral is the theorem of approximation of non-negative measurable functions by simple Borel functions. Definition 2.1.13 A function f : X → R of the form f (x) =

k 

ai 1Ai (x),

i=1

where k ∈ N+ , a1 , . . . , ak ∈ R and A1 , . . . , Ak are sets in X , is called a simple Borel function (defined on X). Theorem 2.1.14 Let f : (X, X ) → (R, B(R)) be a non-negative Borel function. There exists a non-decreasing sequence {fn }n≥1 of non-negative simple Borel functions converging pointwise to the function f . Proof. Let fn (x) :=

−n −1 n2

k2−n 1Ak,n (x) + n1An (x),

k=0

where Ak,n := {x ∈ X : k2−n < f (x) ≤ (k + 1)2−n }, An = {x ∈ X : f (x) > n} . This sequence of functions has the announced properties. In fact, for any x ∈ X such that f (x) < ∞ and for n large enough, |f (x) − fn (x)| ≤ 2−n , and for any x ∈ X such that f (x) = ∞, fn (x) = n indeed converges to f (x) = +∞.  It seems difficult to prove measurability since σ-fields are often not defined explicitly (see the definition of B(Rn ), for instance). However, the following result renders the task feasible. The corollaries will provide examples of application. Theorem 2.1.15 Let (X, X ) and (E, E) be two measurable spaces, where E = σ(C) for some collection C of subsets of E. Let f : X → E be some function. Then f : (X, X ) → (E, E) if and only if f −1 (C) ∈ X for all C ∈ C. Proof. Only sufficiency requires a proof. We start with the following trivial observations. Let X and E be sets, let G be a σ-field on E, and let C1 , C2 be non-empty collections of subsets of E. Then (i) σ(G) = G, and (ii) C1 ⊆ C2 ⇒ σ(C1 ) ⊆ σ(C2 ). Let now f : X → E be a function from X to E, and define G := {C ⊆ E; f −1 (C) ∈ X }. One checks that G is a σ-field. But by hypothesis, C ⊆ G. Therefore, by (ii) and (i), X = σ(C) ⊆ σ(G) = G. 

CHAPTER 2. INTEGRATION

56

Stability Properties of Measurable Functions Measurability is a stable property, in the sense that all the usual operations on measurable functions preserve measurability. (This is not the case for continuity, which is in general not stable with respect to limits.) Also, the class of measurable functions is a rich one. In particular, “continuous functions are measurable”. More precisely: Corollary 2.1.16 Let X and E be two topological spaces with respective Borel σ-fields B(X) and B(E). Any continuous function f : X → E is measurable with respect to B(X) and B(E). Proof. By definition of continuity, the inverse image of an open set of E is an open set of X and is therefore in B(X). By Theorem 2.1.15, since the open sets of E generate B(E), the function f is measurable with respect to B(X) and B(E).  Corollary 2.1.17 Let (X, X ) be a measurable space and let n ≥ 1 be an integer. Then f = (f1 , . . . , fn ) : (X, X ) → (Rn , B(Rn )) if and only if for all 1 ≤ i ≤ n, {fi ≤ ai } ∈ X for all ai ∈ Q (the rational numbers).  Proof. Since by Theorem 2.1.9, B(Rn ) is generated by the sets ni=1 (−∞, ai ], where ai ∈ Q for all i ∈ {1, . . . , n}, it suffices by Theorem 2.1.15 to show that for all a ∈ Qn , {f ≤ a} ∈ X . This is indeed the case since n {f ≤ a} = ∩i=1 {fi ≤ ai } ,

and therefore {f ≤ a} ∈ X , being the intersection of a countable (actually: finite) number of sets in X .  Measurability is closed under composition: Theorem 2.1.18 Let (X, X ), (Y, Y) and (E, E) be measurable spaces and let ϕ : (X, X ) → (Y, Y), g : (Y, Y) → (E, E). Then g ◦ ϕ : (X, X ) → (E, E). Proof. Let f := g ◦ ϕ (meaning: f (x) = g(ϕ(x)) for all x ∈ X). For all C ∈ E, f −1 (C) = ϕ−1 (g −1 (C)) = ϕ−1 (D) ∈ X , because D = g −1 (C) is a set in Y since g ∈ E/Y, and therefore ϕ−1 (D) ∈ X since  ϕ ∈ Y/X . Corollary 2.1.19 Let ϕ = (ϕ1 , . . . , ϕn ) be a measurable function from (X, X ) to (Rn , B(Rn )) and let g : Rn → R be a continuous function. Then g ◦ ϕ : (X, X ) → (R, B(R)). Proof. Follows directly from Theorem 2.1.18 and Corollary 2.1.16.



This corollary in turn allows us to show that addition, multiplication and quotients preserve measurability.

2.1. MEASURABILITY AND MEASURE

57

Corollary 2.1.20 Let ϕ1 , ϕ2 : (X, X ) → (R, B(R)) and let λ ∈ R. Then ϕ1 × ϕ2 , ϕ1 + ϕ2 , λϕ1 , (ϕ1 /ϕ2 )1{ϕ2 =0} are real Borel functions. Moreover, the set {ϕ1 = ϕ2 } is a measurable set.

Proof. For the first three functions, take in Corollary 2.1.19 g(x1 , x2 ) = x1 × x2 , = 1 x1 + x2 , = λx1 successively. For (ϕ1 /ϕ2 )1{ϕ2 =0} , let ψ2 := {ϕϕ22=0} , check that the latter function is measurable, and use the just proved fact that the product ϕ1 ψ2 is then measurable. Finally, {ϕ1 = ϕ2 } = {ϕ1 − ϕ2 = 0} = (ϕ1 − ϕ2 )−1 ({0}) is a measurable set since ϕ1 − ϕ2 is a measurable function and {0} is a measurable set (any singleton is in B(R); exercise).  Finally, taking the limit preserves measurability, as will now be proved. Without otherwise explicitly mentioned, the limits of functions must be understood as pointwise limits. Theorem 2.1.21 Let fn : (X, X ) → (R, B(R)) (n ∈ N). Then lim inf n↑∞ fn and lim supn↑∞ fn are Borel functions, and the set {lim sup fn = lim inf fn } = {∃ lim fn } n↑∞

n↑∞

n↑∞

belongs to X . In particular, if {∃ limn↑∞ fn } = X, the function limn↑∞ fn is a Borel function.

Proof. We first prove the result in the particular case when the sequence of functions is non-decreasing. Denote by f the limit of this sequence. By Theorem 2.1.15 it suffices to show that for all a ∈ R, {f ≤ a} ∈ X . But since the sequence {fn }n≥1 is non-decreasing, we have that {f ≤ a} = ∩∞ n=1 {fn ≤ a}, which is indeed in X , as a countable intersection of sets in X . Now recall that by definition, lim inf fn := lim gn , n↑∞

n↑∞

where gn := inf fk . k≥n

The function gn is measurable since for all a ∈ R, {inf k≥n fk ≤ a} is a measurable set, being the complement of {inf k≥n fk > a} = ∩k≥n {fk > a}, a measurable set (as the countable intersection of measurable sets). Since the sequence {gn }n≥1 is non-decreasing, the measurability of lim inf n↑∞ fn follows from the particular case of non-decreasing functions. Similarly, lim supn↑∞ fn = − lim inf n↑∞ (−fn ) is measurable. The set {lim supn↑∞ fn = lim inf n↑∞ fn } is the set on which two measurable functions are equal, and therefore, by the last assertion of Corollary 2.1.20, it is a measurable set. Finally, if limn↑∞ fn exists, it is equal to lim supn↑∞ fn , which is, as we just proved, a measurable function. 

CHAPTER 2. INTEGRATION

58 Dynkin’s Systems

Proving that a given property is common to all measurable functions, or to all measurable sets, may at times appear difficult because there is usually no constructive definition of the σ-fields involved (these are often defined as “the smallest σ-field containing a certain class of subsets”). There is however a technical tool that allows us to do this, the Dynkin theorem(s). The central notions in this respect are those of a π-system and of a d-system. Definition 2.1.22 Let X be a set. The collection S ⊆ P(X) is called a π-system of X if it is closed under finite intersections.

For instance, the collection of finite unions of intervals of the type (a, b] is a π-system of R. Definition 2.1.23 Let X be a set. A non-empty collection of sets S ∈ P(X) is called a d-system, or Dynkin system, of sets if (a) X, ∅ ∈ S. (b) S is closed under strict difference (that is, if A, B ∈ S and A ⊆ B, then B−A ∈ S). (c) S is closed under sequential non-decreasing limits (that is, the limit of a nondecreasing sequence of sets in S is in S). Theorem 2.1.24 Let X be a set. If the collection S ∈ P(X) is a π-system and a dsystem, it is a σ-field.

Proof. (i) S contains X and ∅, by definition of a d-system. (ii) S is closed under complementation. (Apply (b) in Definition 2.1.23 with B = X and A ∈ S, to obtain that A ∈ S.) (iii) S is closed under countable unions. To prove this, we first show that it is closed under finite unions. Indeed, if A, B ∈ S, then by (ii), A, B ∈ S, and therefore, since S is a π-system, A ∩ B ∈ S. Taking the complement we obtain by (ii) that A ∪ B ∈ S. Now for countable unions, consider a sequence {An }n≥1 of elements of S. The union ∪n≥1 An can be written as ∪n≥1 (∪nk=1 Ak ), which is a countable union of non-decreasing sets in S, and therefore it is in S, by (c) of Definition 2.1.23.  The smallest d-system containing a non-empty collection of sets C ⊆ P(X) is denoted by d(C). Observe that since a σ-field G is already a d-system, d(G) = G. In particular, for any collection of sets C ⊆ P(X), d(C) ⊆ σ(C). Also note that if C1 ⊆ C2 , then d(C1 ) ⊆ d(C2 ). We now state Dynkin’s theorem (sometimes called the monotone class theorem).

2.1. MEASURABILITY AND MEASURE

59

Theorem 2.1.25 Let S be a π-system defined on X. Then d(S) = σ(S). Proof. Any σ-field containing S will contain d(S). Therefore σ(d(S)) = σ(S). If we can show that d(S) is a σ-field, then σ(d(S)) = d(S), so that d(S) = σ(S). To prove that d(S) is a σ-field, it suffices by Theorem 2.1.24 to show that it is a π-system. For this, define D1 := {A ∈ d(S); A ∩ C ∈ d(S) for all C ∈ S} . One checks that D1 is a d-system and that S ⊆ D1 since S is a π-system by assumption. Therefore d(S) ⊆ D1 . By definition of D1 , D1 ⊆ d(S). Therefore D1 = d(S). Define now D2 := {A ∈ d(S); A ∩ C ∈ d(S) for all C ∈ d(S)} . This is a d-system. Also, if C ∈ S, then A ∩ C ∈ d(S) for all A ∈ D1 = d(S), and therefore S ⊆ D2 . Therefore d(S) ⊆ d(D2 ) = D2 . Also by definition of D2 , D2 ⊆ d(S),  so that finally D2 = d(S). In particular, d(S) is a π-system. There is a functional form of Dynkin’s theorem. We first define a d-system of functions. Definition 2.1.26 Let H be a collection of non-negative functions f : X → R+ such that (α) 1 ∈ H, (β) H is closed under monotone non-decreasing sequential limits (that is, if {fn }n≥1 is a non-decreasing sequence of functions in H, then limn↑∞ fn ∈ H), and (γ) if f1 , f2 ∈ H, then λ1 f1 + λ2 f2 ∈ H for all λ1 , λ2 ∈ R such that λ1 f1 + λ2 f2 is a non-negative function. Then H is called a d-system, or Dynkin system, of functions. Theorem 2.1.27 Let H be a family of non-negative functions f : X → R+ that form a d-system. Let S be π-system on X such that 1C ∈ H for all C ∈ S. Then H contains all non-negative functions f : (X, σ(S)) → (R, B(R)).

Proof. The collection D := {A ⊆ X; 1A ∈ H} is a d-system containing S by hypothesis. Therefore d(S) ⊆ d(D) = D. Since d(S) = σ(S) by Dynkin’s theorem, σ(S) ⊆ D. This means that H contains the indicator functions of all the sets in σ(S). Being a d-system of functions, it contains all the non-negative simple σ(S)-measurable functions. The rest of the proof follows from the theorem of approximation of non-negative measurable functions by non-negative simple functions (Theorem 2.1.14) and property (γ) of Definition 2.1.26. 

CHAPTER 2. INTEGRATION

60

2.1.2

Measure

The next most important notion of integration theory after that of measurable sets and measurable functions is that of measure. Definition 2.1.28 Let (X, X ) be a measurable space. A set function μ : X → [0, ∞] is called a measure on (X, X ) if μ(∅) = 0 and if for any countable sequence {An }n≥0 of pairwise disjoint sets in X , the following property (σ-additivity) is satisfied  μ

∞ 

An

=

∞ 

μ(An ) .

n=0

n=0

The triple (X, X , μ) is then called a measure space.

The next two properties have been proved in the first chapter. We repeat the proofs for self-containedness. First, the monotonicity property: A ⊆ B and A, B ∈ X =⇒ μ(A) ≤ μ(B) . Indeed, B = A + (B − A) and therefore, μ(B) = μ(A) + μ(B − A) ≥ μ(A). The sub-σadditivity property: 

∞ &

An ∈ X for all n ∈ N =⇒ μ



An

 μ

∞ & n=0

An

 =μ

∞ 

μ(An ) ,

n=0

n=0

is obtained by writing

∞ 

An

,

n=0

where A0 = A0 and for n ≥ 1,   n−1 An = An ∩ ∪j=1 Aj ⊆ An , so that μ(An ) ≤ μ(An ) by the monotonicity property. Example 2.1.29: The Dirac measure. Let a ∈ X. The measure a defined by a (C) = 1C (a) is called the Dirac measure at a ∈ X. The set function μ : X → [0, ∞] defined by  μ(C) = ∞ i=0 αi 1ai (C),  where ai ∈ X, αi ∈ R+ for all i ∈ N, is a measure denoted μ = ∞ i=0 αi ai . Example 2.1.30: Weighted counting measure. Let {α n }n∈Z be a sequence of R+ . The set function μ : P(Z) → [0, ∞] defined by μ(C) = n∈C αn is a measure on (Z, P(Z)). In the case αn ≡ 1, it is called the counting measure on Z (then μ(C) = card (C)).

2.1. MEASURABILITY AND MEASURE

61

Example 2.1.31: The Lebesgue measure. There exists one and only one measure  on (R, B(R)) such that ((a, b]) = b − a. This measure is called the Lebesgue measure on R. (Beware. The statement of this example is in fact a theorem, which is part of a more general result, Theorem 2.1.53 below.) More generally, the Lebesgue measure on Rn , denoted by n , is the unique measure on (Rn , B(Rn )) such that  n n $ $ n  (ai , bi ] = (bi − ai ) . i=1

i=1

The proof of existence of n for n ≥ 2, assuming the existence of , is a consequence of the forthcoming Theorem 2.3.7. Definition 2.1.32 Let μ be a measure on (X, X ). If μ(X) < ∞ the measure μ is called a finite measure. If there exists a sequence {Kn }n≥1 of X such that μ(Kn ) < ∞ for all n ≥ 1 and ∪∞ n=1 Kn = X, the measure μ is called a σ-finite measure. A measure μ on (Rm , B(Rm )) such that μ(C) < ∞ for all bounded Borel sets C is called a locally finite measure. It is called non-atomic or diffuse if μ({a}) = 0 for all a ∈ Rm . Remark 2.1.33 Non-atomicity does not imply that μ is the null measure. Indeed, the “proof” that it is null, namely  μ({a}) = 0 , μ(C) = a∈C

is not valid because C is not countable. We shall single out the case of a probability measure: Definition 2.1.34 A measure P on a measurable space (Ω, F) such that P (Ω) = 1 is called a probability measure (for short: a probability). Example 2.1.35: A few examples. The Dirac measure a is a probability measure. The counting measure ν on Z is a σ-finite measure. Any locally finite measure on (Rn , B(Rn )) is σ-finite. Lebesgue measure is a locally finite measure. Theorem 2.1.36 Let (X, X , μ) be a measure space. Let {An }n≥1 be a sequence of X , non-decreasing (that is, An ⊆ An+1 for all n ≥ 1). Then ( (2.1) μ( ∞ n=1 An ) = limn↑∞ μ(An ) . Proof. Since An = A0 + ∪nk=1 (Ak − Ak−1 ) and ∪n≥0 An = A0 + ∪k≥1 (Ak − Ak−1 ), by σ additivity n  μ(An ) = μ(A0 ) + μ(Ak − Ak−1 ) k=1

and μ(∪n≥0 An ) = μ(A0 ) +



μ(Ak − Ak−1 ) ,

k≥1

from which the result follows. (A word of caution: see Exercise 2.4.8.)



CHAPTER 2. INTEGRATION

62

Definition 2.1.37 Let μ be a measure on the measurable topological space (X, B(X)). Its support supp(μ) is, by definition, the closure of the subset of X consisting of all the points x such that for all open neighborhoods Nx of x, μ(Nx ) > 0.

Negligible Sets Definition 2.1.38 Let (X, X , μ) be a measure space. A μ-negligible set is a set contained in a measurable set N ∈ X such that μ(N ) = 0. One says that some property P relative to the elements x ∈ X holds μ-almost everywhere (μ-a.e.) if the set {x ∈ X : x does not satisfy P} is a μ-negligible set.

For instance, if f and g are two Borel functions defined on X, the expression f ≤ g μ-a.e. means that μ({x : f (x) > g(x)}) = 0. Theorem 2.1.39 A countable union of μ-negligible sets is a μ-negligible set.

Proof. Let An , n ≥ 1, be a sequence of μ-negligible sets, and let Nn , n ≥ 1, be a sequence of measurable sets such that μ(Nn ) = 0 and An ⊆ Nn . Then N = ∪n≥1 Nn is a measurable set containing ∪n≥1 An , and N is of μ-measure 0, by the sub-σ-additivity property.  Example 2.1.40: The rationals are Lebesgue-negligible. Any singleton {a}, a ∈ R, is a Borel set of Lebesgue measure 0. The set of rationals Q is a Borel set of Lebesgue measure 0. Proof. The Borel σ-field B(R) is generated by the intervals Ia = (−∞, a], a ∈ R (Theorem 2.1.9), and therefore {a} = ∩n≥1 (Ia −Ia−1/n ) is also in B. Denoting by  the Lebesgue measure, (Ia − Ia−1/n ) = 1/n, and therefore ({a}) = limn≥1 (Ia − Ia−1/n ) = 0. Q is a countable union of sets in B (singletons) and is therefore in B. It has Lebesgue measure 0 as a countable union of sets of Lebesgue measure 0. 

Definition 2.1.41 The measure space (X, X , μ) is called complete if X contains all the μ-negligible subsets of X.

Given a measure space (X, X , μ) that is not necessarily complete, denote by N the collection of μ-negligible subsets of X (note that it contains the empty set). Let X be formed by the sets A ∪ N (A ∈ X , N ∈ N ). This is obviously a σ-field. Let μ be the function from X to R+ defined by μ(A ∪ N ) = μ(A)

(A ∈ X , N ∈ N ) .

Then (X, X , μ) is a complete measure space, called the completion of (X, X , μ). It is an extension of μ in the sense that μ(A) = μ(A) for all A ∈ F.

2.1. MEASURABILITY AND MEASURE

63

Equality of Measures The forthcoming results, which are consequences of Dynkin’s theorem, help to prove that two measures are identical. Let P be a collection of subsets of X, and let μ : P → [0, ∞] be σ-additive, that is, for any countable family{An }n≥1 of mutually disjoint sets in P such that ∪n≥1 An ∈ P, we have μ(∪n≥1 An ) = n≥1 μ(An ). Then μ is called a measure on P. Let C ⊆ P be a collection of subsets of X. A mapping μ : P → [0, ∞] is called σ-finite on C if there exists a countable family {Cn }n≥1 of sets in C such that ∪n≥1 Cn = X and μ(Cn ) < ∞ for all n ≥ 1. Theorem 2.1.42 Let μ1 and μ2 be two measures on (X, X ) and let S be a π-system of measurable sets generating X . Suppose that μ1 and μ2 are σ-finite on S. If μ1 and μ2 agree on S (that is, μ1 (C) = μ2 (C) for all C ∈ S), then they are identical. Proof. Let C ∈ S be such that μ1 (C)(= μ2 (C)) < ∞. Consider the collection DC = {A ∈ X ; μ1 (C ∩ A) = μ2 (C ∩ A)}. One verifies that DC is a d-system (the finiteness of μ1 (C) and μ2 (C) is needed here). Moreover, by hypothesis, S ⊆ DC , and therefore d(S) ⊆ d(DC ). But d(DC ) = DC and by Dynkin’s Theorem 2.1.25, d(S) = σ(S) = X . Therefore for all C ∈ S of finite measure, all A ∈ X , we have μ1 (C ∩ A) = μ2 (C ∩ A). By the assumed σ-finiteness of μ1 and μ2 on S there exists a countable family {Cn }n≥1 of sets in S such that ∪n≥1 Cn = X and μ1 (Cn ) = μ2 (Cn ) < ∞ for all n ≥ 1. For α = 1, 2 and all n ≥ 1 we have that   n μα (∪i=1 (Ci ∩ B)) = μα (Ci ∩ B) − μα (Ci ∩ Cj ∩ B) + · · · 1≤i≤n

1≤i 0}, and therefore by sequential continuity, μ({f > 0}) = 0, that is, f ≤ 0, μ-a.e. On the other hand, by hypothesis, f ≥ 0, μ-a.e. Therefore f = 0, μ-a.e. (f) With A = {f > 0}, 1A f is a non-negative measurable function. By (e), 1A f = 0, μ-a.e. This implies that 1A = 0, μ-a.e., that is to say f ≤ 0, μ-a.e. Similarly, f ≥ 0, μ-a.e. Therefore, f = 0, μ-a.e. (g) It is enough to consider the case f ≥ 0. Since f ≥ n1{f =∞} for all n ≥ 1, we have ∞ > μ(f ) ≥ nμ({f = ∞}), and therefore nμ({f = ∞}) < ∞. This cannot be true for all n ≥ 1 unless μ({f = ∞}) = 0.  The extension to complex Borel functions of the properties (a), (b), (d) and (f) is immediate. Example 2.2.14: Lebesgue-integrable but not Riemann-integrable. The function f defined by f := 1Q (Q is the set of rational numbers) is a Borel function and it is Lebesgue integrable with its integral equal to zero because {f = 0} is the set of rational numbers, which has null Lebesgue measure. However, f is not Riemann integrable.

2.2.3

Beppo Levi, Fatou and Lebesgue

The following versions of the theorems of Beppo Levi, Fatou and Lebesgue differ from the previous ones by the introduction of “μ-almost everywhere” in the statements of the conditions. No other proofs are needed since integrals of almost everywhere equal functions are equal and countable unions of negligible sets are negligible. Only a convention must be stated: if the limit of a sequence of real measurable functions exists μ-almost everywhere, that is, outside a μ-negligible set, then the limit is typically assigned some arbitrary value on this μ-negligible set; for example, many people set the limit to be 0. Remember that we are looking for conditions guaranteeing that -

lim fn dμ = lim

X n↑∞

n↑∞ X

fn dμ .

We start by restating the Beppo Levi or monotone convergence theorem.

(2.6)

CHAPTER 2. INTEGRATION

74

Theorem 2.2.15 Let fn : (X, X ) → (R, B(R)) (n ≥ 1) be such that (i) fn ≥ 0 μ-a.e., and (ii) fn+1 ≥ fn μ-a.e. Then, there exists a non-negative function f : (X, X ) → (R, B(R)) such that lim fn = f

n↑∞

μ-a.e. ,

and (2.6) holds true. Next, we restate Fatou’s lemma. Theorem 2.2.16 Let fn : (X, X ) → (R, B(R)) (n ≥ 1) be such that fn ≥ 0 μ-a.e. (n ≥ 1). Then  (lim inf fn ) dμ ≤ lim inf fn dμ . (2.7) X

n↑∞

n↑∞

X

Finally, we restate the Lebesgue or dominated convergence theorem. Theorem 2.2.17 Let fn : (X, X ) → (R, B(R)) (n ≥ 1) be such that, for some function f : (X, X ) → (R, B(R)) and some μ-integrable function g : (X, X ) → (R, B(R)): (i) lim fn = f , μ-a.e., and n↑∞

(ii) |fn | ≤ |g| μ-a.e. for all n ≥ 1. Then, (2.6) holds true.

Differentiation under the integral sign Let (X, X , μ) be a measure space and let (a, b) ⊆ R. Let f : (a, b) × X → R and for all t ∈ (a, b), define ft : X → R by ft (x) := f (t, x). Suppose that for all t ∈ (a, b), ft is measurable with respect to X , and define, when possible, the function I : (a, b) → R by the formula 0 I(t) = X f (t, x) μ(dx) . (2.8) Assume that for μ-almost all x the function t → f (t, x) is continuous at t0 ∈ (a, b) and that there exists a μ-integrable function g : (X, X ) → (R, B(R)) such that |f (t, x)| ≤ |g(x)| μ-a.e. for all t in a neighborhood V of t0 . Then I is well defined and is continuous at t0 . Proof. Let {tn }n≥1 be a sequence in V \ {t0 } such that limn↑∞ tn = t0 , and define fn (x) = f (tn , x), f (x) = f (t0 , x). By dominated convergence, lim I(tn ) = lim μ(fn ) = μ(f ) = I(t0 ).

n↑∞

n↑∞



2.3. THE OTHER BIG THEOREMS

75

If we furthermore assume that (α) t → f (t, x) is continuously differentiable on V for μ-almost all x, and (β) for some μ-integrable function h : (X, X ) → (R, B(R)) and all t ∈ V , |(df /dt) (t, x)| ≤ |h(x)|

μ-a.e. ,

then I is differentiable at t0 and I  (t0 ) =

0

X (df /dt) (t0 , x) μ(dx) .

(2.9)

Proof. Let {tn }n≥1 be a sequence in V \ {t0 } such that limn↑∞ tn = t0 , and define fn (x) = f (tn , x), f (x) = f (t0 , x). By dominated convergence, lim I(tn ) = lim μ(fn ) = μ(f ) = I(t0 ).

n↑∞

Also

n↑∞

I(tn ) − I(t0 ) = tn − t0

X

f (tn , x) − f (t0 , x) μ(dx), tn − t0

and for some θ ∈ (0, 1), possibly depending upon n, ' ' ' f (tn , x) − f (t0 , x) ' ' ' ≤ |(df /dt) (t0 + θ(tn − t0 ), x)| . ' ' tn − t0 The latter quantity is bounded by |h(x)|. Therefore, by dominated convergence,  -  I(tn ) − I(t0 ) f (tn , x) − f (t0 ) lim μ(dx) = lim n↑∞ n↑∞ tn − t0 tn − t0 -X = (df /dt) (t0 , x) μ(dx). X



2.3

The Other Big Theorems

Besides Beppo Levi, Fatou and Lebesgue’s theorems, the four main results of integration theory are (i) the image measure theorem, (ii) the Fubini–Tonelli theorem relative to the product measures (to be defined in a few lines), which gives conditions allowing one to “choose the order of integration in multiple integrals”, (iii) the Riesz–Fischer theorem relative to the Hilbert space structure of the space of square-integrable functions, and (iv) the Radon–Nikod´ ym theorem relative to the product of a measure by a function, more precisely a converse of Theorem 2.3.28.

CHAPTER 2. INTEGRATION

76

2.3.1

The Image Measure Theorem

Definition 2.3.1 Let (X, X ) and (E, E) be two measurable spaces, let h : (X, X ) → (E, E) be a measurable function and let μ be a measure on (X, X ). Define the set function μ ◦ h−1 : E → [0, ∞] by (μ ◦ h−1 )(C) := μ(h−1 (C))

(C ∈ E) .

(2.10)

Then, as one easily checks, μ ◦ h−1 is a measure on (E, E) called the image of μ by h. Integrals can be computed in the original domain or in the image domain. More precisely: Theorem 2.3.2 For a non-negative f : (X, X ) → (R, B(R)) (f ◦ h)(x) μ(dx) = f (y)(μ ◦ h−1 ) (dy) . X

(2.11)

E

For functions f : (X, X ) → (R, B(R)) of arbitrary sign either one of the conditions (a) f ◦ h is μ-integrable, (b) f is μ ◦ h−1 -integrable, implies the other, and equality (2.11) then holds. Proof. The equality (2.11) is readily verified when f is a non-negative simple Borel function. In the general case one approximates f by a non-decreasing sequence of nonnegative simple Borel functions {fn }n≥1 and (2.11) then follows from the same equality written with f = fn , by letting n ↑ ∞ and using the monotone convergence theorem. For the case of functions of arbitrary sign, apply (2.11) with f + and f − . 

2.3.2

The Fubini–Tonelli Theorem

The first task is to define products of measurable spaces and of measures.  Definition 2.3.3 Let (Xi , Xi ) (1 ≤ i ≤ n) be measurable spaces and let X := ni=1 Xi . Define the product σ-field X := ⊗ni=1 Xi to be the  smallest σ-field on X containing all the so-called generalized measurable rectangles ni=1 Ai , where Ai ∈ Xi (1 ≤ i ≤ n).  If (Xi , Xi ) = (Y, Y) (1 ≤ i ≤ n), ni=1 Xi is denoted by Y n and ⊗ni=1 Xi by Y ⊗n . For the product of σ-fields, we have the rule of associativity. For instance (exercise): X1 ⊗ X2 ⊗ X3 = (X1 ⊗ X2 ) ⊗ X3 = X1 ⊗ (X2 ⊗ X3 ). The Borel σ-field B(R)⊗n (denoted by B(R)n for short) is the σ-field on Rn generated is: Is this σ-field identical by the generalized measurable rectangles of Rn . The question  to B(Rn ), the σ-field generated by the rectangles of the type ni=1 (ai , bi ], where −∞ < ai ≤ bi < +∞ (1 ≤ i ≤ n)? The answer is positive and a consequence of the rule of associativity of the product of σ-fields and of the result below.

2.3. THE OTHER BIG THEOREMS

77

Theorem 2.3.4 Let E and F be two separable metric spaces with respective Borel σfields B(E) and B(F ). Then B(E × F ) = B(E) ⊗ B(F ) . Proof. (i) For the proof that B(E × F ) ⊇ B(E) ⊗ B(F ), separability is not needed. We just have to observe that the projections π1 and π2 from E × F to E and F , respectively, are continuous, and therefore measurable functions from (E × F, B(E × F )) to (E, B(E)) and (F, B(F )), respectively. In particular, if C ∈ B(E), then C ×F = π1−1 (C) ∈ B(E ×F ) and similarly, if D ∈ B(F ), then E×D ∈ B(E×F ). Therefore C×D = (C×F )∩(E×D) ∈ B(E × F ). (ii) We now prove that B(E × F ) ⊆ B(E) ⊗ B(F ). By the separability assumption, there exists a dense countable subset of E. Consider the collection U consisting of all open balls with rational radius centered at some point of this dense set. It forms a base for the topology in the sense that any open set of E is the union of sets in U. Let V be a similar base for F . To any (x, y) ∈ O, an open set of E × F , one can associate an open set U × V , where U ∈ U and V ∈ V, that contains (x, y) and is contained in O. Therefore O is the union (at most countable) of sets of the form U × V , where U ∈ U and V ∈ V. In particular, every open set of E × F is measurable with respect to B(E) ⊗ B(F ).  Lemma 2.3.5 Let X = X1 × X2 and X = X1 ⊗ X2 . Let f : (X, X ) → (R, B(R)). For fixed x1 ∈ X1 , let the function fx1 : X → R be defined by fx1 (x2 ) := f (x1 , x2 ). Then fx1 is a measurable function from (X2 , X2 ) to (R, B(R)). Proof. STEP 1. We first prove the result for the special case f = 1F , where F ∈ X . Let for fixed x1 ∈ X1 Fx1 := {x2 ∈ X2 ; (x1 , x2 ) ∈ F }. (Fx1 is called the section of F at x1 .) We have fx1 = 1Fx1 . Therefore we have to prove that Fx1 ∈ X2 . For this define Cx1 = {F ⊆ X; Fx1 ∈ X2 }. We want to show that Cx1 ⊇ X . For this, we first observe that Cx1 is a σ-field since it contains Ω and ∅, and (i) if F ∈ Cx1 , then F ∈ Cx1 . Indeed, since Fx1 ∈ X2 , we have that (Fx1 ) ∈ X2 . But (Fx1 ) = (F )x1 . Therefore (F )x1 ∈ X2 . (ii) if Fn ∈ Cx1 (n ≥ 1), then ∪n≥1 Fn ∈ Cx1 . Indeed, (Fn )x1 ∈ X2 and therefore ∪n≥1 (Fn )x1 ∈ X2 . But ∪n≥1 (Fn )x1 = (∪n≥1 Fn )x1 . Therefore (∪n≥1 Fn )x1 ∈ X2 . The σ-field Cx1 contains all the rectangles A × B, where A ∈ X1 , B ∈ X2 . Indeed, (A × B)x1 = B if x1 ∈ A, = ∅ if x1 ∈ / A. We may now conclude that, since Cx1 is a σ-field containing the generators of X , it contains X . STEP 2. Let now f : (X, X ) → (R+ , B(R+ )). It is the limit of some non-decreasing sequence {fn }n≥1 of non-negative simple functions. In particular, fx1 = lim (fn )x1 . n↑∞

CHAPTER 2. INTEGRATION

78

It therefore suffices to prove that for any non-negative simple function g = the function gx1 is X2 -measurable. This is true since g x1 =

k 

k

i=1 ai 1Ai ,

ai 1(Ai )x1

i=1

and since the (Ai )x1 ∈ X2 by the result in Step 1. STEP 3. We now consider a general f : (X, X ) → (R, B(R)). We have, with the usual notation, f = f + − f − , and therefore fx1 = (f + )x1 − (f − )x1 . By Step 2, (f + )x1 and (f − )x1 are X2 -measurable, and therefore so is fx1 .  Lemma 2.3.6 Let f : (X, X ) →0 (R, B(R)) be a non-negative function. Then, if μ2 is σ-finite, the function x1 → X2 fx1 (x2 ) μ2 (dx2 ) is measurable from (X1 , X1 ) to (R+ , B(R+ )). 0 Proof. For fixed x1 ∈ X1 , the integral X2 fx1 dμ2 is well defined since fx1 is measurable (by the previous lemma) and non-negative. Observe that the conclusion of the lemma is true for f = 1A×B , where A ∈ X1 and B ∈ X2 . Indeed, in this case, fx1 dμ2 = 1A (x1 )1B (x2 ) μ2 (dx2 ) = 1A (x1 )μ2 (B). X2

X2

We now prove that the lemma is true for f = 1F when F ∈ X . Let 1F (x1 , x2 ) μ2 (dx2 ) = μ2 (Fx1 ) . gF (x1 ) := X2

Consider the collection of sets C := {F ∈ X ; gF is X1 -measurable} . First suppose that μ2 is a finite measure. In this case, C is a Dynkin system, since (i) if A and B are in C and A ⊆ B, then μ2 ((B − A)x1 ) = μ2 (Bx1 ) − μ2 (Ax1 ) (this is where we need finiteness of μ2 ) is X1 -measurable, and (ii) if {Cn }n≥1 is a non-decreasing sequence, μ2 ((∪n≥1 Cn )x1 ) = lim μ2 ((Cn )x1 ) n↑∞

is X1 -measurable. Since C contains the measurable rectangles A × B, it contains X , by Dynkin’s theorem (Theorem 2.1.25). If μ2 is not finite, but only σ-finite, there exists a sequence {Kn }n≥1 of elements of X2 increasing to X2 and such that the measure μ2,n defined by μ2,n (A) = μ2 (A ∪ Kn ) is finite. Then μ2 (Fx1 ) = limn≥1 μ2,n (Fx1 ) is X1 -measurable. Finally, we pass from indicator functions of measurable sets to non-negative measurable functions by the usual monotone convergence argument. 

2.3. THE OTHER BIG THEOREMS

79

Theorem 2.3.7 Suppose μ1 and μ2 σ-finite. Then there exists a unique measure μ on (X1 × X2 , X1 × X2 ) such that μ(A1 × A2 ) = μ1 (A1 )μ2 (A2 )

(2.12)

for all A1 ∈ X1 , A2 ∈ X2 . This measure, denoted by μ1 ⊗ μ2 , or μ1 × μ2 , is called the product measure of μ1 by μ2 . Proof. Existence. Consider the set function μ : X → [0, ∞] defined by  - μ(F ) = 1F (x1 , x2 ) μ2 (dx2 ) μ1 (dx1 ). X1

X2

It is a measure on (X, X ) (the monotone convergence theorem proves σ-additivity) that is obviously σ-finite and satisfies (2.12). Uniqueness. Let A be the algebra consisting of the finite sums of disjoint measurable rectangles. Define (uniquely) the measure μ0 on (X, A) by μ0 (A1 × A2 ) = μ1 (A1 )μ2 (A2 ). By Carath´eodory’s theorem (Theorem 2.1.50), there exists a unique extension of μ0 to (X, X ).  The above result extends in an obvious manner to a finite number of σ-finite measures. Example 2.3.8: Lebesgue measure on Rn . The typical example of a product measure is the Lebesgue measure on the space (Rn , B(Rn )): It is the unique measure n on that space that is such that n (Πni=1Ai ) = Πni=1 (Ai )

for all A1 , . . . , An ∈ B(R) .

Going back to the situation with two measure spaces (the case of a finite number of measure spaces is similar) we have the following result: Theorem 2.3.9 Let (X1 , X1 , μ1 ) and (X1 , X2 , μ2 ) be two measure spaces in which μ1 and μ2 are σ-finite. Let (X, X , μ) = (X1 × X2 , X1 ⊗ X2 , μ1 ⊗ μ2 ). (A) Tonelli. If f : (X, X ) → (R, B(R)) is non-negative, then, for μ1 -almost all x1 , the function x2 → f (x1 , x2 ) is measurable with respect to X2 , and x1 →

0 X2

f (x1 , x2 ) μ2 (dx2 )

is a measurable function with respect to X1 . Furthermore, . - 2f dμ = f (x1 , x2 ) μ2 (dx2 ) μ1 (dx1 ) X X X . - 12 - 2 = f (x1 , x2 ) μ2 (dx1 ) μ1 (dx2 ) . X2

X1

(2.14)

CHAPTER 2. INTEGRATION

80

(B) Fubini. If f : (X, X ) → (R, B(R)) is μ-integrable, then, (a): 0for μ1 -almost all x1 , the function x2 → f (x1 , x2 ) is μ2 -integrable, and (b): x1 → X2 f (x1 , x2 ) μ2 (dx2 ) is μ1 -integrable, and (2.14) is true. The global result is referred to as the Fubini–Tonelli theorem. Remark 2.3.10 Part A (Tonelli) says that one can integrate a non-negative X measurable function in any order of its variables. Part B (Fubini) says that the same is true of any X -measurable function provided that function is μ-integrable. In general, in order to apply Part (B) one must use Part (A) with f = |f | to ascertain whether or not 0 X |f | dμ < ∞. Proof. (A) The σ-finite measures

ν(F ) =

1F dμ X

-

-

and

 1F (x1 , x2 ) μ2 (dx2 ) μ1 (dx1 )

μ(F ) = X1

X2

coincide on the algebra A consisting of the finite sums of disjoint generalized measurable rectangles. They are therefore identical, by Theorem 2.1.42. Therefore we have proved the theorem for f of the form 1F , F ∈ X . The general case of a non-negative measurable function is obtained by the usual monotone convergence argument. (B) Since f is μ-integrable, by Tonelli’s theorem, . - 2|f (x1 , x2 )| μ2 (dx2 ) μ1 (dx1 ) < ∞ X1

and in particular

X2

|f (x1 , x2 )| μ2 (dx2 ) < ∞ , μ1 -a.e. X2

Therefore, outside a μ1 -negligible set N1 , f ± (x1 , x2 ) μ2 (dx2 ) < ∞ . X2

We may suppose that the above inequalities are true everywhere because we may replace f , without changing its integral with respect to μ, by a function μ-almost everywhere equal to f . By Tonelli, . - 2f ± dμ = f ± (x1 , x2 ) μ2 (dx2 ) μ1 (dx1 ) X

and therefore -

X1

X2

f dμ = f + dμ − f − dμ X X X . . - 2- 2f + (x1 , x2 ) μ2 (dx2 ) μ1 (dx1 ) − f − (x1 , x2 ) μ2 (dx2 ) μ1 (dx1 ) = X X X1 X2 . - 1 2 -2 + − f (x1 , x2 ) μ2 (dx2 ) − f (x1 , x2 ) μ2 (dx2 ) μ1 (dx1 ) = X X X2 . - 12 - 2 f (x1 , x2 ) μ2 (dx2 ) μ1 (dx1 ) . = X1

X2

2.3. THE OTHER BIG THEOREMS

81

(All this fuss guarantees that at every step we do not encounter ∞ − ∞ forms.)



Remark 2.3.11 Is the hypothesis of σ-finiteness superfluous? In fact it is not, as the following counterexample shows. Take (Xi , Xi ) = (R, B(R)) (i = 1, 2), let μ1 = , the Lebesgue measure, and let μ2 = ν, the measure that associates with a measurable set its cardinality (only finite sets have a finite measure, and therefore, since there is no sequence of finite sets increasing to R, this measure is not σ-finite). Now, let C = {(x, x); x ∈ R} (the diagonal of R2 ). Clearly Cx1 = x1 and Cx2 = x2 , so that μ2 (Cx1 ) = 1 and therefore -

X1

μ2 (Cx1 ) μ1 (dx1 ) =

R

1 (dx) = ∞.

On the other hand, μ1 (Cx2 ) = 0, and therefore, -

X2

μ1 (Cx2 ) μ2 (dx2 ) =

0 ν(dx) = 0. R

Integration by Parts Formula Theorem 2.3.12 Let μ1 and μ2 be two σ-finite measures on (R, B(R)). For any interval (a, b) ⊆ R -

μ1 ((a, t]) μ2 (dt) +

μ1 ((a, b])μ2 ((a, b]) =

μ2 ((a, t)) μ1 (dt).

(2.15)

(a,b]

(a,b]

Observe that the first integral features the interval (a, t] (closed on the right), whereas in the second integral, the interval is of the type (a, t) (open on the right). Proof. The proof consists in computing the μ1 × μ2 -measure of the square D := (a, b] × (a, b] in two ways. The first one is obvious and gives the left-hand side of (2.15). The second one consists in observing that μ(D) = μ(D1 ) + μ(D2 ), where D1 = {(x, y); a < y ≤ b, a < x ≤ y} and D2 = {(a, b] × (a, b]} ∩ D1 . Then μ(D1 ) and μ(D2 ) are computed using Tonelli’s theorem. For instance,  - 1D1 (x, y)μ1 (dx) μ2 (dy) μ(D1 ) = R R  - 1{a 0 (using property (b) of Theorem 2.2.12), 0 0 f Rg =⇒ X |f |p dμ = X |g|p dμ . The operations +, ×, lence class by



and multiplication by a scalar α ∈ C are defined on the equiva-

{f } + {g} = {f + g} , {f } {g} = {f g} , {f }∗ = {f ∗ } , α {f } = {αf } . The first equality means that {f } + {g} is, by definition, the equivalence class consisting of the functions f + g, where f and g are members of {f } and {g}, respectively. Similar interpretations hold for the other equalities. By definition, for a given p ≥ 1, LpC (μ) is the collection of equivalence classes {f } 0 p such that X |f | dμ < ∞. Clearly it is a vector space over C (for the proof recall that 

 |f |+|g| p 2



1 2

|f |p +

1 2

|g|p

since t → tp is a convex function when p ≥ 1). In order to avoid cumbersome notation, in this section and in general whenever we consider Lp -spaces, we shall write f for {f }. This abuse of notation is harmless since two members of the same equivalence class have the same integral if that integral is defined. Therefore, using this loose notation, we may write * 0 + LpC (μ) = f : X |f |p dμ < ∞ . (2.17) When the measure is the counting measure on the set Z of relative integers, the traditional notation is pC (Z). This is the space of random complex sequences {xn }n∈Z such that  |xn |p < ∞. n∈Z

The following is a simple and often used observation. Theorem 2.3.16 Let p and q be positive real numbers such that p > q. If the measure μ on (X, X , μ) is finite, then LpC (μ) ⊆ LqC (μ). In particular, L2C (μ) ⊆ L1C (μ). Proof. From the inequality |a|q ≤ 1 + |a|p , true for all a ∈ C, it follows that μ(|f |q ) ≤ μ(1) + μ(|f |p ). Since μ(1) = μ(R) < ∞, μ(|f |q ) < ∞ whenever μ(|f |p ) < ∞. 

CHAPTER 2. INTEGRATION

84

Remark 2.3.17 This inclusion is not true in general if μ is not a finite measure, for instance consider the Lebesgue measure  on R: there exist functions in L1C () that are 2 not in LC () and vice versa (Exercise 2.4.18). In the case of the (not finite) counting measure on Z, the order of inclusion is the reverse of the one concerning finite measures: Theorem 2.3.18 pC inclusions. If p > q, qC (Z) ⊂ pC (Z). In particular, 1C (Z) ⊂ 2C (Z). 

Proof. Exercise 2.4.19

H¨ older’s Inequality Theorem 2.3.19 Let p and q be positive real numbers in (0, 1) such that 1 1 + =1 p q (p and q are then said to be conjugate) and let f, g : (X, X ) → (R, B(R)) be non-negative real functions. Then, 0 0 1/p 0 q 1/q p . (2.18) X f g dμ ≤ X f dμ X g dμ In particular, if f, g ∈ L2C (R), then f g ∈ L1C (R). Proof. Let

-

1/p 1/q f p dμ ,B= g q dμ .

A= X

X

It may be assumed that 0 < A, B < ∞, because otherwise H¨ older’s inequality is trivially satisfied. Let F := f /A, G := g/B, so that F p dμ = Gq dμ = 1. X

X

Suppose that we have been able to prove that F (x)G(x) ≤

1 1 F (x)p + G(x)q . p q

Integrating this inequality yields (F G) dμ ≤ X

(2.19)

1 1 + = 1, p q

and this is just (2.18). Inequality (2.19) is trivially satisfied if x is such that F ≡ 0 or G ≡ 0. It is also satisfied in the case when F and G are not μ-almost everywhere null. Indeed, letting s(x) := p ln(F (x)),

t(x) := q ln(G(x)) ,

from the convexity of the exponential function and the assumption that 1/p + 1/q = 1, es(x)/p+t(x)/q ≤

1 s(x) 1 + et(x), e p q

and this is precisely inequality (2.19). For the last assertion of the theorem, take p = q = 2.



2.3. THE OTHER BIG THEOREMS

85

Minkowski’s Inequality Theorem 2.3.20 Let p ≥ 1 and let f, g : (X, X ) → (R, B(R)) be non-negative functions in LpC (μ). Then, 0

X (f

+ g)p

1/p



0 X

1/p 0 p 1/p . f p dμ + X g dμ

(2.20)

Proof. For p = 1 the inequality (in fact an equality) is obvious. Therefore, assume p > 1. From H¨ older’s inequality .1/q .1/p 2f p dμ (f + g)(p−1)q

2-

f (f + g)p−1 dμ ≤ X

and

X

X

.1/q .1/p 2(p−1)q g dμ . (f + g)

2-

p−1

g(f + g)

dμ ≤

X

p

X

X

Adding up the above two inequalities and observing that (p − 1)q = p, we obtain 2.1/p 2.1/p 2.1/q p p p + . (f + g) dμ ≤ f dμ g dμ (f + g)p X

X

X

x

One may assume that the right-hand side of (2.20) is finite0 and that the left-hand side is positive (otherwise the inequality is trivial). Therefore X (f + g)p dμ ∈ (0, ∞) and 0 1/q we may therefore divide both sides of the last display by X (f + g)p dμ . Observing  that 1 − 1/q = 1/p yields the announced inequality (2.20). Theorem 2.3.21 Let p ≥ 1. The mapping νp : LpC (μ) → [0, ∞) defined by νp (f ) :=

0 X

1/p |f |p dμ

(2.21)

is a norm on LpC (μ). Proof. Clearly, νp (αf ) = |α|νp (f ) for all α ∈ C, f ∈ LpC (μ). Also, νp (f ) = 0 if and only 0 1/p if X |f |p dμ = 0, which in turn is equivalent to f = 0, μ-a.e. Finally, νp (f + g) ≤  νp (f ) + νp (g) for all f, g ∈ LpC (μ), by Minkowski’s inequality.

The Riesz–Fischer Theorem Denoting νp (f ) by f p , LpC (μ) is a normed vector space over C, with the norm  · p and the induced metric dp (f, g) := f − gp . Theorem 2.3.22 Let p ≥ 1. The metric dp makes of LpC (μ) a complete normed vector space. In other words, LpC (μ) is a Banach space for the norm  · p . p (μ) Proof. To show completeness one must prove that for any sequence {fn }n≥1 of LC that is a Cauchy sequence (that is, such that limm,n↑∞ dp (fn , fm ) = 0), there exists an p (μ) such that limn↑∞ dp (fn , f ) = 0. f ∈ LC

CHAPTER 2. INTEGRATION

86

Since {fn }n≥1 is a Cauchy sequence, one can select a subsequence {fni }i≥1 such that dp (fni+1 − fni ) ≤ 2−i .

(2.22)

Let gk =

k 

|fni+1 − fni |, g =

i=1

∞ 

|fni+1 − fni |.

i=1

By (2.22) and Minkowski’s inequality, gk p ≤ 1. Fatou’s lemma applied to the sequence + * gkp k≥1 gives gp ≤ 1. In particular, any member of the equivalence class of g is μ-almost everywhere finite and therefore fn1 (x) +

∞  

 fni+1 (x) − fni (x)

i=1

converges absolutely for μ-almost all x. Call the corresponding limit f (x) (set f (x) = 0 when this limit does not exist). Since fn 1 +

k−1    fni+1 − fni = fnk i=1

we see that f = lim fnk μ-a.e. k↑∞

One must show that f is the limit in LpC (μ) of {fnk }k≥1 . Let  > 0. There exists an integer N = N () such that fn − fm p ≤  whenever m, n ≥ N . For all m > N , by Fatou’s lemma we have |f − fm |p dμ ≤ lim inf |fni − fm |p dμ ≤ p . i→∞

X

Therefore f − fm ∈ inequality that

LpC (μ)

x

and consequently f ∈ LpC (μ). It also follows from the last lim f − fm p = 0.

m→∞

 The next result is a by-product of the proofs of Theorems 2.3.22. Theorem 2.3.23 Let p ≥ 1 and let {fn }n≥1 be a convergent sequence in LpC (μ). Let f be the corresponding limit in LpC (μ). Then, there exists a subsequence {fni }i≥1 such that limi↑∞ fni = f μ-a.e.

(2.23)

Note that the statement in (2.23) is about functions and not about equivalence classes. The functions thereof are any members of the corresponding equivalence class. In particular, when a given sequence of functions converges μ-a.e. to two functions, these two functions are necessarily equal μ-a.e. Therefore, Theorem 2.3.24 If {fn }n≥1 converges both to f in LpC (μ) and to g μ-a.e., then f = g μ-a.e.

2.3. THE OTHER BIG THEOREMS

87

Of special interest for applications is the space L2C (μ) of complex measurable functions f : X → R such that |f (x)|2 μ(dx) < ∞, X

where two functions f and f  such that f (x) = f  (x), μ-a.e. are not distinguished. We have by the Riesz–Fischer theorem: Theorem 2.3.25 L2C (μ) is a vector space with scalar field C, and when endowed with the inner product (2.24) f (x)g(x)∗ μ(dx) , f, g := X

it is a Hilbert space. The norm of a function f ∈ L2C (μ) is f  =

0 X

1 |f (x)|2 μ(dx) 2

and the distance between two functions f and g in L2C (μ) is d(f, g) =

0 X

1 |f (x) − g(x)|2 μ(dx) 2 .

The completeness property of L2C (μ) reads in this case as follows. If {fn }n≥1 is a sequence of functions in L2C (μ) such that lim |fn (x) − fm (x)|2 μ(dx) = 0, m,n↑∞ X

then, there exists a function f ∈ L2C (μ) such that lim |fn (x) − f (x)|2 μ(dx) = 0. n↑∞ X

In L2C (μ), Schwarz’s inequality reads as follows: ''  1 1 2 2 ' ' 2 2 ' f (x)g(x)∗ μ(dx)' ≤ |f (x)| μ(dx) |g(x)| μ(dx) . ' ' X

X

X

Example 2.3.26: Complex sequences. The set of complex sequences a = {an }n∈Z such that  |an |2 < ∞ n∈Z

is, when endowed with the inner product a, b =



an b∗n ,

n∈Z

2C (Z).

a Hilbert space, denoted by This is indeed a particular case of a Hilbert space L2C (μ), where X = Z and μ is the counting measure. In this example, Schwarz’s inequality takes the form '  ' 1  1 2 2 ' '   ' ∗' 2 2 an bn ' ≤ |an | × |bn | . ' ' ' n∈Z

n∈Z

n∈Z

CHAPTER 2. INTEGRATION

88

2.3.4

The Radon–Nikod´ ym Theorem

The Product of a Measure by a Function Definition 2.3.27 Let (X, X , μ) be a measure space and let h : (X, X ) → (R, B(R)) be a non-negative measurable function. Define the set function ν : X → [0, ∞] by h(x) μ(dx) . ν(C) = C

Then ν is a measure on (X, X ) called the product of μ by the function h. This is denoted by dν = h dμ. That ν is a measure is easily checked. First of all, it is obvious that ν(∅) = 0. As for the σ-additivity property, write for any sequence of mutually disjoint measurable sets {An }n≥1 , ν(∪n≥1 An ) = h dμ = 1∪n≥1 An h dμ ∪n≥1 An



-



= X

= lim

k↑∞

= lim

k↑∞





1An ⎠ h dμ =

n≥1

-  k X k  n=1

X

-  lim

k↑∞

X

1 An

h dμ = lim

k↑∞

n=1

ν(An ) =



k 

1 An

h dμ

n=1

k  n=1 X

1An h dμ

ν(An ) ,

n≥1

where the fifth equality is by monotone convergence. Theorem 2.3.28 Let μ, h and ν be as in Definition 2.3.27. (i) For non-negative f : (X, X ) → (R, B(R)), 0 0 X f (x) ν(dx) = X f (x)h(x) μ(dx) .

(2.25)

(ii) If f : (X, X ) → (R, B(R)) has arbitrary sign, then either one of the following conditions (a) f is ν-integrable, (b) f h is μ-integrable, implies the other, and the equality (2.25) then holds. Proof. Verify (2.25) for elementary non-negative functions and, approximating f by a non-decreasing sequence of such functions, use the monotone convergence theorem as in the proof of (2.11). For the case of functions of arbitrary sign, apply (2.25) with f = f + and f = f − .  Observe that in the situation of Theorem 2.3.28, for all C ∈ X , μ(C) = 0 =⇒ ν(C) = 0 .

(2.26)

2.3. THE OTHER BIG THEOREMS

89

Definition 2.3.29 Let μ and ν be two measures on (X, X ). (A) If (2.26) holds for all C ∈ X , ν is said to be absolutely continuous with respect to μ. This is denoted by ν μ. (B) The measures μ and ν on (X, X ) are said to be mutually singular if there exists a set A ∈ X such ν(A) = μ(A) = 0. This is denoted by μ⊥ν.

Lebesgue’s decomposition Theorem 2.3.30 Let μ and ν be two σ-finite measures on (X, X ). There exists a unique decomposition (called the Lebesgue decomposition) ν = νa + νs such that μ and νa ⊥μ ,

νa

and a non-negative measurable function g : X → R such that dνa = g dμ , this function being μ-essentially unique. 3 Proof. STEP 1. We first assume that μ and ν are finite and that ν ≤ μ, that is, ν(A) ≤ μ(A) for all A ∈ X . Define a mapping ϕ : L2R (μ) → R by ϕ(f ) = f dν . X

The latter integral is well defined since the hypothesis of finiteness of μ implies that 0 0 L2R (μ) ⊆ L1R (μ), and hypothesis ν ≤ μ implies that R |f | dν ≤ R |f | dμ. Also ϕ(f ) does not depend on the function chosen in the equivalence class of L2R (μ). In fact,0letting f  be such function, f = f  μ-a.e. implies that f = f  ν-a.e. and then X f dν = 0 another  X f dν. By Schwarz’s inequality, 1

|ϕ(f )| ≤

f 2 dν -

X



2

1

ν(X) 2

1 2 1 1 f 2 dμ ν(X) 2 = ν(X) 2 ||f ||L2 (μ) . R

X

Therefore, the (linear) functional ϕ from the Hilbert space L2R (μ) to R is continuous. By Riesz’s theorem on the representation of linear functionals on L2R (μ) (Theorem C.5.2), there exists a g ∈ L2R (μ) such that f g dμ , ϕ(f ) = f, gL2 (μ) := R

that is,

X

-

f dν = X

3

f g dμ . X

If g  is such that dνa = g  dμ, then g(x) = g  (x) μ-a.e.

CHAPTER 2. INTEGRATION

90 In particular, ν(A) =

0

A g dμ

for all A ∈ X . With A = {g ≥ 1 + ε} where ε > 0,

μ({g ≥ 1 + ε}) ≥ ν({g ≥ 1 + ε}) = g dμ ≥ (1 + ε)μ({g ≥ 1 + ε}) , {g≥1+ε}

and therefore μ({g ≥ 1 + ε}) = 0. Since ε is an arbitrary positive number, this implies that g ≤ 1 μ-a.e. A similar argument shows that g ≥ 0 μ-a.e. We may in fact suppose that 0 ≤ g(x) ≤ 1 for all x ∈ X by replacing if necessary g by g 1{0≤g≤1} . STEP 2. We still assume that μ and ν are finite, but not that ν ≤ μ. However, since ν ≤ μ + ν, we may apply the above results to ν and μ + ν, to obtain the existence of a measurable function g such that 0 ≤ g ≤ 1 and such that for all f ∈ L2R (μ + ν), f dν = f g d(μ + ν) . X

X

In particular, for any bounded measurable function f : X → R, f g dμ . f (1 − g) dν = X

()

X

By monotone convergence, this inequality extends to all non-negative measurable functions f . 0 STEP 3. With f = 1N in (), where N := {g = 1}, we have that N f dμ = 0 for all non-negative measurable f , and therefore μ(N ) = 0. The measures νs := 1N ν and μ are therefore mutually singular. STEP 4. Replacing in () f by functions f , -

f 1−g 1N

gives that for all non-negative measurable -

f dν = N

f h dμ , X

where h := 1N

g . 1−g

It remains to define νa by dνa := 1N dν = h dμ to conclude the existence part of the theorem, under the assumption that μ and ν are finite. STEP 5. To prove the uniqueness of the pair (νa , νs ), consider another such pair (, νa , ν,s ). For all A ∈ X , νa (A) − ν,a (A) = −νs (A) + ν,s (A) .

(†)

, of νs and ν,s respectively have a null μ-measure, and since Since the supports N and N μ and ν,a μ, νa     , ) − ν,s (A ∩ N ∪ N , ) νs (A) − ν,s (A) = νs (A ∩ N ∪ N     , ) + ν,a (A ∩ N ∪ N , ) = 0. = −νa (A ∩ N ∪ N Therefore νa ≡ ν,a , and consequently, from (†), νs ≡ ν,s .

2.4. EXERCISES

91

STEP 6. To prove the uniqueness of h, just observe that if another measurable nonh satisfies negative function , , νa (A) = h dμ h dμ = A

A

for all A ∈ X , then necessarily , h = h μ-a.e. STEP 7. We get rid of the finiteness hypothesis for μ and ν, only assuming that these measures are σ-finite. Therefore, there exists a measurable partition {Kn }n≥1 of X such that μn := 1Kn μ and νn := 1Kn ν are finite measures. Applying the above results to μn and νn , and calling νa,n , νs,n and hn the corresponding items of the decomposition, define    νa = νa,n , νs = νs,n , h = hn . n≥1

n≥1

n≥1

The verification that νa , νs and h satisfy the requirement of the theorem is straightforward. 

The Radon–Nikod´ ym Derivative Corollary 2.3.31 Let μ and ν be two σ-finite measures on (X, X ) such that ν Then there exists a non-negative function h : (X, X ) → (R, B(R)) such that

μ.

ν(dx) = h(x) μ(dx) . Proof. From the uniqueness of the Lebesgue decomposition of Theorem 2.3.30 and the hypothesis ν μ, it follows that νa = ν and therefore νs ≡ 0.  The function h is called the Radon–Nikod´ym derivative of ν with respect to μ and is denoted dν/dμ. With such a notation, we have that 0 0 dν X f (x) ν(dx) = X f (x) dμ (x) μ(dx) for all non-negative f = (X, X ) → (R, B(R)).

Complementary reading For the omitted proofs of existence and unicity of measures, see for instance [Royden, 1988].

2.4

Exercises

Exercise 2.4.1. The σ-field generated by a collection of sets (1) Let {Fi }i∈I be a non-empty family of σ-fields on some set Ω (the non-empty index set I is arbitrary). Show that the family F = ∩i∈I Fi is a σ-field (A ∈ F if and only if A ∈ Fi for all i ∈ I). (2) Let C be a family of subsets of some set Ω. Show the existence of a smallest σ-field F containing C. (This means, by definition, that F is a σ-field on Ω containing C, such that if F  is a σ-field on Ω containing C, then F ⊆ F  .)

CHAPTER 2. INTEGRATION

92 Exercise 2.4.2. Simple functions

(1) Show that a Borel function f : (X, X ) → (R, B) taking a finite number of values is a simple function. (2) Show that a function measurable with respect to the gross σ-field is a constant. Exercise 2.4.3. B(R) Recall that B(R) is the σ-field on R generated by the intervals of type (− ∞, a] (a ∈ R). Describe B(R) in terms of B(R). Exercise 2.4.4. σ(f −1 (C)) = f −1 (σ(C)) Let X and E be sets, f : X → E a function from X to E and C a collection of subsets of E. Prove that σ(f −1 (C)) = f −1 (σ(C)). Exercise 2.4.5. The smallest σ-field guaranteeing measurability Let f : X → E be a function. Let E be a given σ-field on E. What is the smallest σ-field on X such that f is measurable with respect to X and E? Exercise 2.4.6. The modulus of a function and measurability Let f : X → E be a function. Is it true that if |f | is measurable with respect to X and E, then so is f itself? Exercise 2.4.7. Measurability with respect to the gross σ-field Prove that a function f : E → R measurable with respect to the gross σ-field on E and the Borel σ-field on R is a constant. Exercise 2.4.8. Decreasing sequences of measurable sets Let (X, X ) be a measurable space and {Bn }n≥1 a non-increasing sequence of X such that μ(Bn0 ) < ∞ for some n0 ∈ N+ . Show that ∞ % Bn = lim ↓ μ(Bn ) . μ n=1

n↓∞

Give a counterexample for the necessity of condition μ(Bn0 ) < ∞ for some n0 . Exercise 2.4.9. Almost-everywhere equal continuous functions Prove that if two continuous functions f, g : R → R are -a.e. equal, they are everywhere equal. Exercise 2.4.10. sinx x 0t Let f (x) := sinx x . Prove that the limit limt↑∞ 0 with respect to the Lebesgue measure?

sin x x

dx exists. Is f integrable on R+

Exercise 2.4.11. From integral to series Prove that for all a, b ∈ R, R+

+∞  t e−at 1 dt = . 1 − e−bt (a + nb)2 n=0

2.4. EXERCISES

93

Exercise 2.4.12. Fun Fubini A bounded rectangle of R2 is said to have Property (A) if at least one of its sides “is an integer” (meaning: its length is an integer). Let Δ be a finite rectangle that is the union of a finite number of 0disjoint rectangles with Property (A). Show that Δ itself must have Property (A). Hint: I e2iπxdx . . . Exercise 2.4.13. Fourier transform Let f : (R, B(R)) → (R, B(R)) be integrable with respect to the Lebesgue measure. Show that for any ν ∈ R, f (t) e−2iπνt dt fˆ(ν) = R

is well defined and that the function fˆ is continuous and bounded. (fˆ is called the Fourier transform of f .) Exercise 2.4.14. Convolution Let f, g : (R, B(R)) → (R, B(R)) be integrable with respect to the Lebesgue measure  and let fˆ, gˆ be their respective Fourier transforms (See Exercise 2.4.13). (1) Show that

- R

R

|f (t − s)g(s)| dt ds < ∞.

(2) Deduce from this that for almost all t ∈ R, the function s → f (t − s)g(s) is -integrable, and therefore that the convolution f ∗ g, where f (t − s)g(s) ds, (f ∗ g)(t) = R

is almost everywhere well defined. (3) For all t such that the last integral is not defined, set (f ∗ g)(t) = 0. Show that f ∗ g is -integrable and that its Fourier transform is f ∗ g = fˆgˆ. Exercise 2.4.15. A Fubini counterexample Let (Xi , Xi , μi ) (i = 1, 2) be two versions of the measure space (X, X , μ), where X = {1, 2, . . .}, X = P(X) and μ is the counting measure. Consider the function f : (X1 × X2 ) → Z whose non null values are f (m, m) = +1 and f (m + 1, m) = −1 (m ≥ 1). Show that       f (m, n) = 1 and f (m, n) = 0 . m

n

n

m

Why don’t we obtain the same values for both sums? Exercise 2.4.16. Another Fubini counterexample Define f : [0, 1]2 → R by x2 − y 2 1 . (x2 + y 2 )2 {(x,y) =(0,0)}   0 0 0 0 Compute [0,1] [0,1] f (x, y) dx dy and [0,1] [0,1] f (x, y) dy dx. Is f Lebesgue intef (x, y) =

grable on [0, 1]2 ?

CHAPTER 2. INTEGRATION

94

Exercise 2.4.17. Convolution of measures The convolution product of two finite measures μ1 and μ2 on Rd is the measure ν on Rd that is the image of the product measure μ := μ1 × μ2 on Rd × Rd under the mapping (x1 , x2 ) → x1 + x2 . This measure will be denoted by μ1 ∗ μ2 . (i) Show that for any non-negative measurable function f : Rd → R  - f (x) ν(dx) = f (x1 + x2 )μ1 (dx1 ) μ2 (dx2 ) . Rd

Rd

Rd

(ii) Let μ be a finite measure on Rd and let εa be the Dirac measure (on Rd ) at point a ∈ Rd . What is the convolution product μ ∗ εa ? 2 () Exercise 2.4.18. L1C () and LC Show that there exist functions in L1C () that are not in L2C () and vice versa.

Exercise 2.4.19. pC q Show that if p > q, then C (Z) ⊂ pC (Z). Exercise 2.4.20. The Lebesgue decomposition Let μ and ν be measures on the measurable space (X, X ). Describe the Lebesgue decomposition in the following cases: A. (X, X ) = (Z, P(Z). B. (X, X ) = (R, B(R), μ(dx) = f (x) dx and ν(dx) = g(x) dx.

Chapter 3 Probability and Expectation Although from a formal point of view a probability is just a measure with total mass equal to one, and expectation is nothing more than an integral with respect to this measure,1 probability theory has two ingredients that make the difference: the notion of independence and that of conditional expectation. Probability theory has a specific terminology adapted to its goals, and therefore we begin with the “translation” of the theory of measure and integration into the theory of probability and expectation.

3.1 3.1.1

From Integral to Expectation Translation

Recall that abstract (or axiomatic) probability theory features a “sample” space Ω and a collection F of its subsets that forms a σ-field, the σ-field of events. An element A ∈ F is called an event. A probability P on (Ω, F) is a measure on this measurable space with total mass 1. The results obtained in the previous chapter will now be recast in this specific framework. The probabilistic version of Theorem 2.1.42 is given below for future reference. Theorem 3.1.1 Let P1 and P2 be two probability measures on (Ω, F) and let S be a π-system of measurable sets generating F. If P1 and P2 agree on S, they are identical. Let (E, E) be some measurable space. A measurable mapping (or function) X : (Ω, F) → (E, E) is called a random element with values in E. If E = R and F = B(R), it is called a random variable. If E = Rm and F = B(Rm ), it is called a random vector. In view of Theorem 2.1.18, for a mapping X = (X1 , . . . , Xm) to be a random vector in Rm , it suffices that {Xi ≤ a} ∈ F (1 ≤ i ≤ m, a ∈ R). From Corollary 2.1.17, we have that if X is a random element with values in the measurable space (E, E) and if g is a measurable function from (E, E) to another measurable space (G, G), then g(X) is a random element with values in the measurable space (G, G). 1 But as the wise man said: “He who does not know measure from probability does not know sake from rice.”

© Springer Nature Switzerland AG 2020 P. Brémaud, Probability Theory and Stochastic Processes, Universitext, https://doi.org/10.1007/978-3-030-40183-2_3

95

CHAPTER 3. PROBABILITY AND EXPECTATION

96

Corollary 2.1.20 and Theorem 2.1.21 tell us that all ordinary operations on random variables (addition, multiplication, and quotient—if well defined—) and the limit operations (limsup, liminf, and lim—if well defined—) preserve the status of random variable. Since a random variable X is a measurable function, we can define, under certain circumstances, its integral with respect to the probability measure P , called the expectation of X. Therefore E [X] = X(ω)P (dω). Ω

The main steps in the definition of the integral (here the expectation) are summarized below in the specific notation of probability theory. If A ∈ F, E[1A ] = P (A) , and more generally, if X is a simple random variable, that is, X(ω) = where αi ∈ R, Ai ∈ F and N < ∞, then E[X] =

N 

N

i=1 αi 1Ai (ω),

αi P (Ai ).

i=1

For a non-negative random variable X, the expectation is always defined by E[X] = lim E[Xn ], n↑∞

where {Xn }n≥1 is a non-decreasing sequence of non-negative simple random variables that converges to X. This definition is consistent, that is, it does not depend on the approximating sequence of non-negative simple random variables as long as it is nondecreasing and has X for limit. In particular, with the following special choice of the approximating sequence: Xn =

n −1 n2 

k=0

k 1A + n1{X≥n} , 2n k,n

2−n

where Ak,n := {k × ≤ X < (k + 1) × 2−n }, we have for any non-negative random variable X, the “horizontal slice formula”: E[X] = lim

n↑∞

n −1 n2 

k=0

k P (Ak,n ) + nP (X ≥ n). 2n

If X is of arbitrary sign, the expectation is defined by E[X] = E[X + ] − E[X − ] if E[X + ] and E[X − ] are not both infinite. If E[X + ] and E[X − ] are infinite, the expectation is not defined. If E[|X|] < ∞, X is said to be integrable and E[X] is then a finite number. The basic properties of the expectation are linearity and monotonicity: If X1 and X2 are integrable (resp. non-negative) random variables, then (linearity): for all λ1 , λ2 ∈ R (resp. ∈ R+ ), (3.1) E[λ1 X1 + λ2 X2 ] = λ1 E[X1 ] + λ2 E[X2 ] , and (monotonicity): X1 ≤ X2 =⇒ E[X1 ] ≤ E[X2 ] .

(3.2)

It follows from monotonicity that if E[X] is well defined, |E[X]| ≤ E[|X|] .

(3.3)

3.1. FROM INTEGRAL TO EXPECTATION

97

Mean and Variance 2 The definitions of the mean mX and the variance σX of a real-valued random variable X are given, when the corresponding expectations are meaningful, by

mX := E[X] ,

2 σX = E[(X − mX )2 ] := E[X 2 ] − m2X .

Markov’s Inequality This inequality was given and proved in the specific framework of discrete random variables (Theorem 1.5.4). Theorem 3.1.2 Let Z be a non-negative real random variable, and let a > 0. We then have E[Z] P (Z ≥ a) ≤ . a

Proof. Reproduce verbatim the proof in the special case of discrete variables (Theorem 3.1.2). 

ε2

Specializing the Markov inequality of Theorem 3.1.2 to Z = (X − mX )2 and a = > 0, we obtain as in the first chapter Chebyshev’s inequality: For all ε > 0, P (|X − mX | ≥ ε) ≤

2 σX . ε2

Jensen’s Inequality Jensen’s inequality is also a simple consequence of the monotonicity of expectation and of the expectation formula for indicator functions. Theorem 3.1.3 Let I be a general interval of R (closed, open, semi-closed, infinite, etc.) and let (a, b) be its interior, assumed non-empty. Let ϕ : I → R be a convex function. Let X be an integrable real-valued random variable such that P (X ∈ I) = 1. Assume moreover that either ϕ is non-negative, or that ϕ(X) is integrable. Then E [ϕ(X)] ≥ ϕ(E [X]) .

(3.4)

Proof. Reproduce verbatim the proof in the special case of discrete random variables (Theorem 3.1.3). 

3.1.2

Probability Distributions

Definition 3.1.4 The distribution of a random element X with values in (E, E) is, by definition, the probability measure QX on (E, E), the image of the probability measure P by the mapping X from (Ω, F) to (E, E) (that is, for all C ∈ E, QX (C) = P (X ∈ C)). The next result is a rephrasing of Theorem 2.3.2 in the context of probability.

98

CHAPTER 3. PROBABILITY AND EXPECTATION

Theorem 3.1.5 If g is a measurable function from (E, E) to (R, B(R)) that is nonnegative, then g(x) QX (dx). (3.5) E [g(X)] = E

If g is of arbitrary sign, and if one of the following two conditions is satisfied: (a) g(X) is P -integrable, or (b) g is QX -integrable, then the other one is also satisfied and equality (3.5) holds true. Definition 3.1.6 If X is a random vector ((E, E) = (Rm , B(Rm ))) whose probability distribution QX is the product of a measurable function fX by the Lebesgue measure n , one calls fX the probability density function (pdf) of X.  Remark 3.1.7 The pdf is unique, in the sense that any other pdf fX is such that  fX (x) = fX (x) Lebesgue-almost everywhere. See Exercise 3.4.1.

Remark 3.1.8 The following is an “obvious” result (a proof is however required in Exercise 3.4.6): P (fX (X) = 0) = 0 .

Example 3.1.9: The case of a real random variable. In the particular case where (E, E) = (R, B(R)), taking C = (−∞, x], we have QX ((−∞, x]) = P (X ≤ x) = FX (x) , where FX is the cumulative distribution function (cdf) of X, and therefore E[g(X)] = g(x) dF (x) , R

by definition of the Stieltjes–Lebesgue integral (Definition 2.2.9).

Example 3.1.10: The case of a discrete random variable. In the particular case where (E, E) = (N, P(N)), QX ({n}) = P (X = n) and  g(n)P (X = n) . E[g(X)] = N

Example 3.1.11: The case of a random vector with a probability density. If X is a random vector admitting a probability density fX , then, by Theorem 2.3.28, E[g(X)] = g(x)fX (x) dx . Rn

3.1. FROM INTEGRAL TO EXPECTATION

99

The cumulative distribution of a real random variable X has the following properties: (i) F : R → [0, 1]. (ii) F is non-decreasing. (iii) F is right-continuous. (iv) For each x ∈ R there exists F (x−) := limh↓0 F (x − h). (v) F (+∞) := lima↑∞ F (a) = P (X < ∞). (vi) F (−∞) := lima↓−∞ F (a) = P (X = −∞). (vii) P (X = a) = F (a) − F (a−) for all a ∈ R. Proof. (i) is obvious; (ii) If * a ≤ b, then+ {X ≤ a} ⊆ {X * ≤ b}, and+therefore P (X ≤ a) ≤ P (X ≤ b); (iii) Let Bn = X ≤ a + n1 . Since ∩n≥1 X ≤ a + n1 = {X ≤ a}, we have, by sequential continuity,   1 lim P X ≤ a + = P (X ≤ a). n↑∞ n (iv) We know from Analysis that a non-decreasing function from R to R has at any point a limit to the left; (v) Let An = {X ≤ n} and observe that ∪∞ n=1 {X ≤ n} = {X < ∞}. The result again follows by sequential continuity; (vi) Apply (1.6) with Bn = {X ≤ −n} and observe that ∩∞ n=1 {X ≤ −n} = *{X = −∞}. The + result follows by sequential ∩∞ continuity. (vii) The sequence Bn = a − n1 < X ≤ a is decreasing, n=1 Bn =  and 1 {X = a}. Therefore, by sequential continuity, P (X = a) = lim P a − < X ≤a = n↑∞ n     limn↑∞ F (a) − F a − n1 , that is to say, P (X = a) = F (a) − F (a−). From (vii), we see that the cdf is continuous at a ∈ R if and only if P (X = a) = 0. Being a non-decreasing right-continuous function, F has at most a countable set of discontinuity points {dn , n ≥ 1}. Define the discontinuous part of F by  (F (dn ) − F (dn −)) 1{dn ≤x} Fd (x) = n≥1

=



P (X = dn )1{dn ≤x} .

n≥1

In particular, when a random variable takes its values in a countable set, its cdf reduces to the discontinuous part Fd . For such (discrete) random variables, the probability distribution {p(dn )}n≥1 , where p(dn ) = P (X = dn ), suffices to describe the probabilistic behavior of X.

Famous Continuous Random Variables An (absolutely) continuous random variable is by definition a real (no infinite values) random variable with a probability density, that is, 0x P (X ≤ x) = −∞ f (x)dx , where f (x) ≥ 0, and since X is real, P (X < ∞) = 1, that is, 0 +∞ −∞ f (x)dx = 1 .

100

CHAPTER 3. PROBABILITY AND EXPECTATION

Definition 3.1.12 Let a and b be real numbers. A real random variable X with probability density function 1 f (x) = b−a 1[a,b] (3.6) is called a uniform random variable on [a, b] . This is denoted by X ∼ U([a, b]). Theorem 3.1.13 The mean and the variance of a uniform random variable on [a, b] are given by E[X] =

a+b 2 ,

Var (X) =

(b−a)2 12

.

(3.7) 

Proof. Direct computation. Theorem 3.1.14 Let for u ∈ (0, 1) F ← (u) := inf{x ; F (x) > u} .

If U is a uniform random variable on (0, 1), then F ← (U ) has the same probability distribution as X. Proof. First note that for all u ∈ (0, 1), F ← (u) ≤ t implies F (t) ≥ u. Indeed, in this case, for all s > t there exists an x < s such that F (x) > u and therefore F (s) > u; and consequently, by right-continuity of F , F (t) ≥ u. Conversely, F (t) ≥ u implies that t ∈ {x ; F (x) ≥ u} and therefore F ← (u) ≤ t. Taking all this into account, F (t) = P (U < F (t)) ≤ P (F ← (U ) ≤ t) ≤ P (F (t) ≥ U ) = F (t) . This forces P (F ← (U ) ≤ t) to equal F (t).



Remark 3.1.15 Of course, if F is continuous, F ← is the inverse F −1 in the usual sense. Definition 3.1.16 A real random variable X with pdf f (x) =

2 1 (x−m) σ2

√1 e− 2 σ 2π

,

(3.8)

where m ∈ R and σ ∈ R+ , is called a Gaussian random variable. This is denoted by X ∼ N (m, σ 2 ). One can check that E[X] = m and Var (X) = σ 2 (Exercise 3.4.18). Definition 3.1.17 The tail distribution of a random variable X is, by definition, the quantity P (X > x).

3.1. FROM INTEGRAL TO EXPECTATION

101

The following bounds for the tail distribution of a standard Gaussian variable are useful: - ∞ - ∞ 1 2 x 1 − 1 x2 1 1 1 − 21 y 2 2 √ √ √ e ≤ e dy ≤ e− 2 y dy . 1 + x2 2π x 2π x 2π x Proof. We have 1 x2

-



1 2

e− 2 y dy >

x

-



1 − 1 y2 e 2 dy y2 - ∞ 1 2 1 1 2 e− 2 y dy, = e− 2 x − x x x

where the equality is obtained by integration by parts. This gives the inequality on the right. Integration by parts again: - ∞ - ∞ 1 2 1 − 1 y2 1 − 1 x2 2 2 e e dy = − e− 2 y dy , 2 y x x x and therefore

1 − 1 x2 e 2 = x

∞

-

1+ x

1 y2



1 2

e− 2 y dy ≥

-



1 2

e− 2 y dy ,

x



and this is the inequality on the left. Definition 3.1.18 A random variable X with pdf f (x) = λe−λx 1{x≥0}

(3.9)

is called an exponential random variable with parameter λ. This is denoted by X ∼ E(λ). The cdf of the exponential random variable is F (x) = (1 − e−λx )1{x≥0} . Theorem 3.1.19 The mean of an exponential random variable with parameter λ is E[X] = λ−1 . Proof. Direct computation, or see the Gamma distribution below.

(3.10) 

The exponential distribution lacks memory, in the following sense: Theorem 3.1.20 Let X ∼ E(λ). For all t, t0 ∈ R+ , we have P (X ≥ t0 + t | X ≥ t0 ) = P (X ≥ t). Proof. P (X ≥ t0 + t | X ≥ t0 ) = =

P (X ≥ t0 + t , X ≥ t0 ) P (X ≥ t0 )

e−λ(t0 +t) P (X ≥ t0 + t) = −λ(t ) = e−λt = P (X ≥ t). 0 P (X ≥ t0 ) e 

CHAPTER 3. PROBABILITY AND EXPECTATION

102

Recall the definition of the gamma function Γ: 0∞ Γ(α) := 0 xα−1 e−x dx . Integration by parts gives, for α > 0, '∞ - ∞ ' αuα−1 e−udu − 0 = uα e−u'' = 0

0



e−u uα du

0

= αΓ(α) − Γ(α + 1). Therefore Γ(α + 1) = α Γ(α), 0∞ from which it follows in particular, since Γ(1) = 0 e−x dx = 1, that for all integers n ≥ 1, Γ(n) = (n − 1)! Definition 3.1.21 Let α and β be two strictly positive real numbers. A non-negative random variable X with the pdf f (x) =

β α α−1 −βx e 1{x>0} x Γ(α)

(3.11)

is called a Gamma random variable with parameters α and β. This is denoted by X ∼ γ(α, β). We must check that (3.11) defines a probability density (that is, the integral of f is 1). In fact: - +∞ - ∞ βα f (x)dx = xα−1 e−βxdx Γ(α) 0 −∞ - ∞ Γ(α) 1 = 1, y α−1 e−y dy = = Γ(α) 0 Γ(α) where the second equality has been obtained via the change of variable y = βx. Theorem 3.1.22 If X ∼ γ(α, β), then E [X] = Proof.

-

α β

and Var (X) =

α β2

.



βα x xα−1 e−βx dx Γ(α) 0 - ∞ βα Γ(α + 1) 1 α = xα e−βx dx = = . Γ(α) 0 Γ(α) β β

E [X] =

Similarly,

Therefore

  Γ(α + 2) 1 α(α + 1) E X2 = = . Γ(α) β 2 β2   α(α + 1) Var (X) = E X 2 − E [X]2 = − β2

 2 α α = 2. β β

(3.12)

3.1. FROM INTEGRAL TO EXPECTATION

103 

The exponential distribution is a particular case of the Gamma distribution. In fact, γ(1, λ) ≡ E(λ). The so-called chi-square distribution with n degrees of freedom, denoted by χn2 , is just the γ( n2 , 12 ) distribution. It therefore has the pdf f (x) =

n 1 1 x 2 −1 e− 2 x 1{x>0} . n 2 2 Γ( n2 )

(3.13)

This is denoted by X ∼ χ2n . Definition 3.1.23 A random variable X with pdf f (x) =

1 π(1+x2 )

(3.14)

is called a Cauchy random variable.

It is important to observe that the mean of X is not defined since R

|x| dx = +∞ . π(1 + x2 )

Of course, a fortiori, its variance is not defined.

Change of Variables Let X = (X1 , . . . , Xn ) be a random vector with the probability density function fX , and define the random vector Y = g(X), where g : Rn → Rn . More explicitly, ⎧ ⎪ ⎪ ⎨Y1 = g1 (X1 , . . . , Xn ), .. . ⎪ ⎪ ⎩Y = g (X , . . . , X ). n n 1 n Under smoothness assumptions on g, the random vector Y is absolutely continuous, and its probability density function can be explicitly computed from g and the probability density function fX . The conditions allowing this are the following: A1 : The function g from U to Rn , where U is an open subset of Rn , is one-to-one (injective). A2 : The coordinate functions gi (1 ≤ i ≤ n) are continuously differentiable. A2 : Moreover, the Jacobian matrix of the function g,   ∂gi Jg (x) := Jg (x1 , . . . , xn ) := ∂x (x1 , . . . , xn ) j satisfies the positivity condition | det Jg (x)| > 0

(x ∈ U ) .

, 1≤i,j≤n

CHAPTER 3. PROBABILITY AND EXPECTATION

104

A standard result of Analysis says that V = g(U ) is an open subset of Rn , and that the invertible function g : U → V has an inverse g −1 : V → U with the same properties as the direct function g. In particular, on V , | det Jg−1 (y)| > 0. Moreover,

Jg−1 (y) = Jg (g −1 (y))−1 .

Also, under the conditions A1 − A3 , for any function u : Rn → Rn , u(x)dx = u(g −1 (y))| det Jg−1 (y)|dy. U

g(U )

Theorem 3.1.24 Under the conditions just stated for X, g, and U , and if moreover P (X ∈ U ) = 1, then Y admits the probability density fY (y) = fX (g −1 (y))| det Jg (g −1 (y))|−1 1V (y) .

(3.15)

Proof. The proof consists in checking that for any bounded function h : R → R, E[h(Y )] = h(y)ψ(y)dy, (3.16) Rn

where ψ is the function on the right-hand side of (3.15). Indeed, taking h(y) = 1y≤a = 1y1 ≤a1 · · · 1yn≤an , (3.16) reads - an - a1 P (Y1 ≤ a1 , . . . , Yn ≤ an ) = ··· ψ(y1 , . . . , yn )dy1 · · · dyn . −∞

−∞

To prove that (3.16) holds with the appropriate ψ, one just uses the basic rule of change of variables: E[h(Y )] = E[h(g(X))] = h(y)fX (g −1 (y))| det Jg−1 (y)|dy . h(g(x))fX (x)dx = U

V

 Theorem 3.1.25 Let X be an n-dimensional random vector with probability density fX . Let A be an invertible n × n real matrix and let b be an n-dimensional real vector. Then, the random vector Y = AX + b admits the density fY (y) = fX (A−1 (y − b)) | det1 A| . Proof. Here U = Rn , g(x) = Ax + b and | det Jg−1 (y)| =

1 | det A| .

(3.17) 

Example 3.1.26: Polar coordinates. Let (X1 , X2) be a two-dimensional random vector with probability density fX1 ,X2 (x1 , x2 ) and let (R, Θ) be its polar coordinates. The probability density of (R, Θ) is given by the formula fR,Θ (r, θ) = fX1 ,X2 (r cos θ, r sin θ) r.

3.1. FROM INTEGRAL TO EXPECTATION

105

Proof. Here g is the bijective function from the open set U consisting of R2 without the half-line {(x1 , 0) ; x1 ≥ 0} to the open set V = (0, ∞) × (0, 2π). The inverse function is x = r cos θ, The Jacobian of g −1 is

 Jg−1 (r, θ) =

y = r sin θ . cos θ −r sin θ sin θ r cos θ



of determinant det Jg−1 (r, θ) = r. Apply formula (3.15) to obtain the announced result. 

Covariance Matrices Recall that L2C (P ) (resp., L2R (P )) is the set of square-integrable complex (resp., real) random variables, where two variables X and X  such that P (X = X  ) = 1 are not distinguished. Define for X, Y in L2C (P ) or L2R (P ) X, Y  = E [XY ∗ ] .

(3.18)

L2C (P ) and L2R (P ) are Hilbert subspaces with scalar field C and R respectively (the Riesz–Fischer Theorem 2.3.25). Definition 3.1.27 Two complex square-integrable random variables are said to be orthogonal if E[XY ∗ ] = 0. They are said to be uncorrelated if E[(X − mX )(Y − mY )∗ ] = 0. Recall Schwarz’s inequality for square-integrable random variables: 1

1

|E[XY ]| ≤ E[|XY |] ≤ E[|Y |2 ] 2 × E[|X|2 ] 2 . In particular, with Y = 1,

1

E[|X|] ≤ E[|X|2 ] 2 < ∞.

(3.19)

(3.20)

Correlation Coefficient Definition 3.1.28 The cross-variance of the two complex square integrable variables X and Y is, by definition, the complex number E [(X − mX )(Y − mY )∗ ], denoted by σXY . Definition 3.1.29 Let X and Y be square-integrable real random variables with respec2 > 0 and σ 2 . Their correlation tive means mX and mY , and respective variances σX Y coefficient is the quantity σXY , ρXY := σX σY where σXY is the cross-variance. By Schwarz’s inequality, |σXY | ≤ σX σy , and therefore |ρXY | ≤ 1, with equality if and only if X and Y are colinear. When ρXY = 0, X and Y are said to be uncorrelated. If ρXY > 0, they are said to be positively correlated, whereas if ρXY < 0, they are said to be negatively correlated.

CHAPTER 3. PROBABILITY AND EXPECTATION

106

Theorem 3.1.30 Let X be a square-integrable real random variable. Among all variables Z = aX + b, where a and b are real numbers, the one that minimizes the error E[(Z − Y )2 ] is σXY Yˆ = mY + 2 (X − mX ) , σX and the error is then E[(Yˆ − Y )2 ] = σY2 (1 − ρ2XY ) . (The proof is left as Exercise 3.4.43.) Remark 3.1.31 We see that if the variables are not correlated, then the best prediction is the trivial one Yˆ = mY and the (maximal) error is then σY2 . In imprecise but suggestive terms, high correlation implies high predictability. Notation: For vectors and matrices, an asterisk superscript (∗ ) denotes complex conjugates, a T superscript (T ) is for vector transposition, and the dagger superscript († ) is for conjugation-transposition. When x is a vector of Rn , we always assume in the notation that it is a column vector, and therefore xT will be the corresponding row vector. Definition 3.1.32 A random vector X = (X1 , . . . , Xn )T such that X1 , . . ., Xn are square-integrable complex random variables is called a square-integrable complex vector. In particular, for all 1 ≤ i, j ≤ n, by (3.20), E[|Xi |] < ∞, and by Schwarz’s inequality (3.19), E[|Xi Xj |] < ∞. This allows us to define the mean of X mX := E[X] = (E[X1 ], . . . , E[Xn ])T and the covariance matrix of X ΓX :=E[(X − mX )(X − mX )† ] + *  = E (Xi − mXi )(Xj − mXj )∗ 1≤i,j≤n + * = σXi ,Xj 1≤i,j≤n . Theorem 3.1.33 The matrix ΓX is symmetric Hermitian, that is, Γ†X = ΓX ,

(3.21)

and it is non-negative definite (denoted ΓX ≥ 0), that is, α† ΓX α ≥ 0

(α ∈ Cn ) .

(3.22)

Proof. α† Γα = αT Γα∗ =

n  n 

αi α∗j E[(Xi − E[Xi ])(Xj − E[Xj ])∗ ]

i=1 j=1

⎡ ⎤ n  n  αi α∗j (Xi − E[Xi ])(Xj − E[Xj ])∗ ⎦ = E⎣ i=1 j=1

⎡ ⎞∗ ⎤ ⎛ n n   = E⎣ αi (Xi − E[Xi ]) ⎝ αj (Xj − E[Xj ])⎠ ⎦ i=1

j=1

= E[|αT (X − E[X])|2 ] ≥ 0. 

3.1. FROM INTEGRAL TO EXPECTATION

107

Theorem 3.1.34 Let X be a square-integrable real random vector with a covariance matrix ΓX which is degenerate, that is, αT ΓX α = 0 for some α = 0. Then X lies almost surely in a given hyperplane of Rn of dimension strictly less than n. Proof. For such α, E[|αT (X − E[X])|2 ] = αT ΓX α = 0, and therefore αT (X − E[X]) = 0 

almost surely.

Remark 3.1.35 Since X lies almost surely in a strict hyperplane of Rn , it cannot have a probability density. A vector X with degenerate covariance matrix is also called degenerate. If ΓX is non-degenerate, we write ΓX > 0. We now examine the effects of an affine transformation of a random vector on its covariance matrix. Let X be a square-integrable n-dimensional complex random vector, with mean mX and covariance matrix ΓX . Let A be an (n × k)-dimensional complex matrix, and b a k-dimensional complex vector. Theorem 3.1.36 Then the k-dimensional complex vector Z = AX + b has mean mZ = A mX + b, and covariance matrix ΓZ = A ΓX A† . Proof. The formula giving the mean is immediate. As for the other one, it suffices to observe that (Z − mZ ) = A(X − mX ) and to write   ΓZ = E (Z − mZ )(Z − mZ )†   = E A(X − mX )(A(X − mX ))†   = E A(X − mX )(X − mX )† A†   = AE (X − mX )(X − mX )T A† = AΓX A† .  Let X and Y be square-integrable complex random vectors of respective dimensions n and q. We define the inter-covariance matrix of X and Y —in this order—by ΓXY = E[(X − mX )(Y − mY )† ].

CHAPTER 3. PROBABILITY AND EXPECTATION

108 Note that

† ΓY X = ΓXY .

Also if we define the (n + q)-dimensional vector Z by Z = (X1 , . . . , Xn , Y1 , . . . , Yq )T then its covariance takes the block form  ΓX ΓZ = ΓY X

3.1.3

ΓXY ΓY

 .

Independence and the Product Formula

Recall the definition of independence for events. Two events A and B are said to be independent if P (A ∩ B) = P (A)P (B).

(3.23)

More generally, a family {Ai}i∈I of events, where I is an arbitrary index, is called independent if for every finite subset J ∈ I, ⎛ ⎞ % $ P (Aj ). P⎝ Aj ⎠ = j∈J

j∈J

Definition 3.1.37 Two random elements X : (Ω, F) → (E, E) and Y : (Ω, F) → (G, G) are called independent if for all C ∈ E, D ∈ G, P ({X ∈ C} ∩ {Y ∈ D}) = P (X ∈ C)P (Y ∈ D).

(3.24)

More generally, let I be an arbitrary index. The family of random elements {Xi}i∈I , where Xi : (Ω, F) → (Ei , Ei) (i ∈ I), is called independent if for every finite subset J ∈ I, ⎛ ⎞ % $ P⎝ {Xj ∈ Cj }⎠ = P (Xj ∈ Cj ) j∈J

j∈J

for all Cj ∈ Ej (j ∈ J). Theorem 3.1.38 If the random elements X and Y taking their values in (E, E) and (G, G) respectively are independent , then so are the random elements ϕ(X) and ψ(Y ), where ϕ : (E, E) → (E  , E  ), ψ : (G, G) → (G , G  ). Proof. For all C  ∈ E  , D ∈ G  , the sets C = ϕ−1 (C  ) and D = ψ −1 (D ) are in E and G respectively, since ϕ and ψ are measurable. We have   P ϕ(X) ∈ C  , ψ(Y ) ∈ D = P (X ∈ C, Y ∈ D) = P (X ∈ C) P (Y ∈ D)     = P ϕ(X) ∈ C  P ψ(Y ) ∈ D .  The above result is stated for two random variables for simplicity, and it extends in the obvious way to a finite number of independent random variables. The next result simplifies the task of proving that two σ-fields are independent.

3.1. FROM INTEGRAL TO EXPECTATION

109

Theorem 3.1.39 Let (Ω, F, P ) be a probability space and let S1 and S2 be two π-systems of sets in F. If S1 and S2 are independent, then so are σ(S1 ) and σ(S2 ). Proof. Fix A ∈ S1 . Let V2 = {B ⊆ X; P (A ∩ B) = P (A)P (B)}. This is a d-system (easy to check) and S2 ⊆ V2 . Therefore d(S2 ) ⊆ d(V2 ) = V2 . On the other hand, by Dynkin’s theorem, d(S2 ) = σ(S2 ). We have therefore proved that S1 and σ(S2 ) are independent. Now fix B ∈ σ(S2 ). Let V1 = {A ⊆ X; P (A ∩ B) = P (A)P (∩B)}. This is a d-system and S1 ⊆ V1 . Therefore d(S1 ) ⊆ d(V1 ) = V1 . On the other hand, by Dynkin’s theorem, d(S1 ) = σ(S1 ). We therefore have proved that σ(S1 ) and σ(S2 ) are  independent. Corollary 3.1.40 Let (Ω, F, P ) be a probability space on which are given two real random variables X and Y . For these two random variables to be independent, it is necessary and sufficient that for all a, b ∈ R, P (X ≤ a, Y ≤ b) = P (X ≤ a)P (Y ≤ b). Proof. This follows from Theorem 3.1.39, remembering that the collection  {(−∞, a]; a ∈ R} is a π-system generating B(R). The independence of two random elements X and Y is equivalent to the factorization of their joint distribution: Q(X,Y ) = QX × QY , where Q(X,Y ) , QX , and QY are the distributions of, respectively, (X, Y ), X and Y . Indeed, for all sets of the form C × D, where C ∈ E and D ∈ G, Q(X,Y ) (C × D) = P ((X, Y ) ∈ C × D) = P (X ∈ C, Y ∈ D) = P (X ∈ C)P (Y ∈ D) = QX (C)QY (D), and therefore (Theorem 2.3.7) Q(X,Y ) is the product measure of QX and QY . In particular, the Fubini–Tonelli theorem immediately gives a result that we have already seen in the particular case of discrete random variables: the product formula for expectations (Formula (3.25) below). Theorem 3.1.41 Let the random variables X and Y taking their values in (E, E) and (G, G) respectively be independent, and let g : (E, E) → (R, B), h : (G, G) → (R, B) such that either one of the following two conditions is satisfied: (i) E [|g(X)|] < ∞ and E [|h(Y )|] < ∞, and (ii) g ≥ 0 and h ≥ 0. Then E [g(X)h(Y )] = E [g(X)] E [h(Y )] .

(3.25)

CHAPTER 3. PROBABILITY AND EXPECTATION

110

Proof. It suffices to give the proof in the non-negative case. We have: - E [g(X)h(Y )] = g(x)h(y)Q(X,Y ) (dx × dy) -E -G = g(x)h(y)QX (dx)QY (dy) -E G g(x)h(y)QX (dx) h(y)QY (dy) = E

G

= E [g(X)] E [h(Y )] .  Theorem 3.1.42 Let X be a random vector of Rn admitting the pdf fX . The (measurable) set of samples ω such that there exists i, j (i = j) such that Xi (ω) = Xj (ω) has a null probability. Proof. Let A be this set and let C := {x1 , . . . , xn ; xi = xj for some i = j} . The set C has null Lebesgue measure, and therefore, since 1A (ω) ≡ 1C (X(ω)), P (A) = E [1C (X(ω))] = 1C (x)fX (x) dx = 0 . Rn



Order Statistics Let X1 , . . . , Xn be independent random variables with the same pdf f . By Theorem 3.1.42, the probability that two or more among X1 , . . . , Xn take the same value is null. Therefore one can define unambiguously the random variables Z1 , . . . , Zn obtained by arranging X1 , . . . , Xn in increasing order: / Zi ∈ {X1 , . . . , Xn }, Z1 < Z2 < · · · < Zn . In particular, Z1 = min(X1 , . . . , Xn ) and Zn = max(X1 , . . . , Xn ). Theorem 3.1.43 The probability density of the reordered vector Z = (Z1 , . . . , Zn ) (defined above) is   n fZ (z1 , . . . , zn ) = n! (3.26) j=1 f (zj ) 1C (z1 , . . . , zn ) , where C = {(z1 , . . . , zn ) ∈ Rn ; z1 < z2 < · · · < zn } . Proof. Let σ be the permutation of {1, . . . , n} that orders X1 , . . . , Xn in ascending order, that is, Xσ(i) = Zi (note that σ is a random permutation). For any set A ⊆ Rn ,

3.1. FROM INTEGRAL TO EXPECTATION P (Z ∈ A) = P (Z ∈ A ∩ C) = P (Xσ ∈ A ∩ C) =



111

P (Xσo ∈ A ∩ C, σ = σo ),

σo

where the sum is over all permutations of {1, . . . , n}. Observing that Xσo ∈ A∩C implies σ = σo , P (Xσo ∈ A ∩ C, σ = σo ) = P (Xσo ∈ A ∩ C) and therefore since the probability distribution of Xσo does not depend upon a fixed permutation σo (here we need the independence and equidistribution assumptions for the Xi ’s), P (Xσo ∈ A ∩ C) = P (X ∈ A ∩ C). Therefore, P (Z ∈ A) =



P (X ∈ A ∩ C) = n!P (X ∈ A ∩ C)

σo

-

= n!

n!fX (x)1C (x)dx.

fX (x)dx = A∩C

A

 Example 3.1.44: Volume of the right-angled pyramid. We shall apply the above result to prove the formula - b - b (b − a)n . (3.27) ··· 1C (z1 , . . . , zn )dz1 · · · dzn = n! a a Indeed, when the Xi ’s are uniformly distributed over [a, b], fZ (z1 , . . . , zn ) = The result follows since

0 Rn

n! 1 n (z1 , . . . , zn )1C (z1 , . . . , zn ). (b − a)n [a,b]

(3.28)

fZ (z)dz = 1.

Sampling from a Distribution The problem that we address now (which arises in the simulation of stochastic systems) is to generate a random variable with prescribed cdf or, in other terms, to sample the said cdf. For this, one is allowed to use a random generator that produces a sequence U1 , U2 , . . . of independent real random variables, uniformly distributed on [0, 1]. In practice, the numbers that such random generators produce are not quite random, but they look as if they are (the generators are called pseudo-random generators). The topic of how to devise a good pseudo-random generator is out of our scope, and we shall admit that we can trust our favourite computer to provide us with an iid sequence of random variables uniformly distributed on [0, 1] (from now on we call them random numbers). Given such a sequence, we are going to describe methods for constructing a random variable Z with cdf F (z) = P (Z ≤ z) . In the case where Z is a discrete random variable with distribution P (Z = ai ) = pi i ≤ K), the basic principle of the sampling algorithm is the following:

(0 ≤

CHAPTER 3. PROBABILITY AND EXPECTATION

112 Draw U ∼ U([0, 1]).

Set Z = a if p0 + p1 + . . . + p−1 < U ≤ p0 + p1 + . . . + p . This method is called the method of the inverse. A crude generation algorithm would successively perform the tests U ≤ p0 ?, U ≤ p0 +p1 ?, . . ., until the answer is positive. The average number of iterations required would  therefore be i≥0 (i + 1)pi = 1 + E [Z]. This number may be too large, but there are ways of improving it, as the example below will show for the Poisson random variable. For absolutely continuous variables, the inverse method takes the following form. Draw a random number U ∼ U([0, 1]) and set Z = F −1 (U ) , where F −1 is the inverse of F . Indeed, P (Z ≤ z) = P (F −1 (U ) ≤ z) = P (U ≤ F (z)) = F (z) . Example 3.1.45: Exponential distribution. We want to sample from E(λ). The corresponding cdf is F (z) = 1 − e−λz (z ≥ 0) . The solution of y = 1 − e−λz is z = − λ1 ln(1 − y) = F −1 (y), and therefore, Z = − λ1 ln(1 − U ) will do, or since U and 1 − U have the same distribution, Z = − λ1 ln U .

Remark 3.1.46 Both the discrete case and the absolutely continuous cases are particular cases of the more general result of Theorem 3.1.14.

Example 3.1.47: Symmetric exponential distribution. This example features a simple trick. We want to sample from the symmetric exponential distribution with pdf 1 f (x) = e−|x| . 2 One way is to generate two independent random variables Y and Z, where Z ∼ E(1) and P (Y = +1) = P (Y = −1) = 12 . Taking X = Y Z we have that P (X ≤ x) = P (U = +1, Z ≤ x) + P (U = −1, Z ≥ −x) =

1 (FZ (x) + 1 − FZ (−x)) , 2

and therefore, by differentiation, fX (x) =

1 1 (fZ (x) + fZ (−x)) = fZ (|x|) . 2 2

3.1. FROM INTEGRAL TO EXPECTATION

113

It is not always easy to compute the inverse of the cumulative distribution function of the random variable to be generated. An alternative method is the method of acceptancerejection below. Let {Yn }n≥1 be a sequence of iid random variables with the probability density g(x) satisfying for all x ∈ R f (x) (3.29) g(x) ≤ c for some finite constant c (necessarily larger or equal to 1). Let {Un }n≥1 be a sequence of iid random variables uniformly distributed on [0, 1]. Theorem 3.1.48 Let τ be the first index n ≥ 1 for which Un ≤

f (Yn ) cg(Yn )

and let Z = Yτ . Then (a) Z admits the probability density function f , and (b) E[τ ] = c. Proof. We have P (Z ≤ x) = P (Yτ ≤ x) =  Denote by Ak the event Uk >

f (Yk ) cg(Yk )

 . Then



P (τ = n, Yn ≤ x).

n≥1

P (τ = n, Yn ≤ x) = P (A1 , . . . , An−1, An , Yn ≤ x) = P (A1 ) · · · P (An−1 )P (An , Yn ≤ x).   P Ak =



P -R

= R

Uk ≤

f (y) cg(y)

f (y) g(y) dy = cg(y)

 g(y) dy R

1 f (y) dy = . c c

    f (y) P A k , Yk ≤ x = P Uk ≤ 1y≤x g(y) dy cg(y) - x -Rx f (y) f (y) 1 x f (y) dy. g(y) dy = dy = = c −∞ −∞ cg(y) −∞ c -

Therefore P (Z ≤ x) =

 n≥1

1−

1 c

n−1

1 c

-

-

x −∞

Also, using the above calculations,   P (τ = n) = P A1 , . . . , An−1, An   = P (A1 ) · · · P (An−1 )P An = from which it follows that E[τ ] = c.

x

f (y) dy =

f (y) dy. −∞

  1 n−1 1 , 1− c c 

The method depends on one’s ability to easily generate random vectors with the probability density g. Such a pdf must satisfy (3.29) and c should be as small as possible under this constraint.

CHAPTER 3. PROBABILITY AND EXPECTATION

114

3.1.4

Characteristic Functions

Recall that for a complex-valued random variable X = XR + iXI , where XR and XI are real-valued integrable random variables, E[X] = E[XR ] + iE[XI ] defines the expectation of X. The characteristic function ϕX of a real-valued random variable X is defined by   ϕX (u) = E eiuX . Similarly, the characteristic function ϕX : Rd → C of a real random vector X ∈ Rd is defined by  T  ϕX (u) = E eiu X . Theorem 3.1.49 Let X ∈ Rd be a random vector with characteristic function ϕ. Then for all 1 ≤ j ≤ d, all aj , bj ∈ Rd such that aj < bj , ⎛ ⎞ - +c - +c $ d −iuj aj − e−iuj bj 1 e ⎝ ⎠ ϕ(u1 , . . . , ud ) du1 · · · dud lim ··· c↑+∞ (2π)d −c iuj −c j=1 ⎡ ⎤  d  $ 1 = E⎣ + 1{aj E [X1 ], h+ (a) is positive. Similarly to (4.15), we obtain that  n  − P Xi ≤ na ≤ e−nh (a) , i=1

where

  h− (a) = sup{at − ln E etX1 } . t≤0



Moreover, if a < E[X1 ], h (a) is positive. The Chernoff bound can be interpreted in terms of large deviations from the law of large numbers. Denote by μ the common mean of the Xn ’s, and define for ε > 0 the (positive) quantities    H + (ε) = sup εt − ln E et(X1 −μ) , t≥0

   H (ε) = sup εt − ln E et(X1 −μ) . −

t≤0

Then

' ' n ' ' + − '1  ' P ' Xi ' ≥ +ε ≤ e−nH (ε) + e−nH (ε) . 'n ' i=1

Remark 4.1.19 The computation of the supremum in (4.15) may be fastidious, There are shortcuts leading to practical bounds that are not as good but nevertheless satisfactory for certain applications. Example 4.1.20: Suppose for instance  that {Xn }n≥1 is iid, the Xn ’s taking the values −1 and +1 equiprobably so that E etX = 12 e+t + 12 e−t . We do not keep this expression t2

as such but instead replace it by an upper bound, namely e 2 , and therefore, for a > 0,  n  tX1 Xi ≥ na ≤ e−n(at−ln E[e ]) P i=1 1 2 ≤ e−n(at− 2 t ) ,

so that, with t = a, P

 n 

Xi ≥ na

1 2

≤ e−n 2 a .

i=1

 By symmetry of the distribution of ni=1 Xi , one would obtain for a > 0  n  n   1 2 Xi ≤ −na = P Xi ≥ na ≤ e−n 2 a , P i=1

i=1

and therefore combining the two bounds, ' ' n ' ' 1 2 ' ' P ' Xi ' ≥ na ≤ 2e−n 2 a . ' ' i=1

CHAPTER 4. CONVERGENCES

156

Two Other Types of Convergence

4.2

These are (i) convergence in probability, the “parent pauvre” of almost-sure convergence, and (ii) convergence in the quadratic mean, that is, convergence in L2C (P ). (Convergences in distribution and in variation will be treated in the next section.)

4.2.1

Convergence in Probability

Recall the definition already given for discrete random variables in the first chapter: Definition 4.2.1 A sequence {Zn }n≥1 of variables is said to converge in probability to the random variable Z if, for all ε > 0, lim P (|Zn − Z| ≥ ε) = 0 .

n↑∞

(4.16)

Example 4.2.2: Bernstein’s polynomial approximation. This example is a particular instance of the fruitful interaction between probability and analysis. Here, we shall give a probabilistic proof of the fact that a continuous function f from [0, 1] into R can be uniformly approximated by a polynomial. More precisely, for all x ∈ [0, 1], f (x) = limn↑∞ Pn (x) , where Pn (x) =

()

  n  n! k xk (1 − x)n−k , f n k!(n − k)] k=0

and the convergence of the series in the right-hand side is uniform in [0, 1]. A proof of this classical theorem of analysis using probabilistic arguments and in particular the notion of convergence in probability is as follows. Since Sn ∼ B(n, p), 2  .      n n  Sn k k n! E f f f = P (Sn = k) = xk (1 − x)n−k . n n n k!(n − k)! k=0

k=0

The function f is continuous on the bounded [0, 1] and therefore uniformly continuous on this interval. Therefore to any ε > 0, one can associate a number δ(ε) such that if |y − x| ≤ δ(ε), then |f (x) − f (y)| ≤ ε. Being continuous on [0, 1], f is bounded on [0, 1] by some finite number, say M . Now '. ' 2   .' 2'   ' ' ' ' Sn Sn ' ' ' |Pn (x) − f (x)| = 'E f − f (x) ' ≤ E 'f − f (x)'' n n '2   ' . . ' 2'   ' ' ' ' Sn Sn = E '' f − f (x) 1A '' + E ''f − f (x)'' 1A , n n where A := {|Sn (ω)/n) − x| ≤ δ(ε)}. Since |f (Sn /n) − f (x)|1A ≤ 2M 1A , we have ' . ' 2'   '  ' Sn ' ' ' Sn ' ' ' ' E 'f − f (x)' 1A ≤ 2M P (A) = 2M P ' − x' ≥ δ(ε) . n n Also, by definition A and δ(ε),

4.2. TWO OTHER TYPES OF CONVERGENCE

157

' . 2'   ' ' Sn E ''f − f (x)'' 1A ≤ ε . n ' '  ' Sn ' ' |Pn (x) − f (x)| ≤ ε + 2M P ' − x'' ≥ δ(ε) . n

Therefore

But x is the mean of Sn /n, and the variance of Sn /n is nx(1 − x) ≤ n/4. Therefore, by Tchebyshev’s inequality, '  ' ' ' Sn 4 − x'' ≥ δ(ε) ≤ . P '' n n[δ(ε)]2 Finally 4 . n[δ(ε)]2 Since ε > 0 is otherwise arbitrary, this suffices to prove the convergence in (). The convergence is uniform since the right-hand side of the latter inequality does not depend on x ∈ [0, 1]. |f (x) − Pn (x)| ≤ ε +

There is a Cauchy-type criterion for convergence in probability. Theorem 4.2.3 For a sequence {Zn }n≥1 of random variables to converge in probability to some random variable, it is necessary and sufficient that for all ε > 0, lim P (|Zm − Zn | ≥ ε) = 0.

m,n↑∞

Proof. Necessity. We have the inclusion 1 1 {|Zm − Zn | ≥ ε} ⊆ {|Zm − Z| ≥ ε} ∪ {|Zm − Z| ≥ ε} 2 2 and therefore 1 1 P (|Zm − Zn | ≥ ε) ≤ P (|Zm − Z| ≥ ε) + P (|Zm − Z| ≥ ε) . 2 2 Sufficiency. Let n1 := +1 and let for j ≥ 2, nj = inf{N > nj−1 ; P (|Zr − Zs | > Then

 j

P (|Znj − Znj−1 | >

1 1 ) < j if r, s > N } . 2j 3

1 < ∞, 2j−1

a.s. and therefore, there exists a random variable Z such that Znj → Z as j ↑ ∞. Now: 1 1 P (|Z − Zn | ≥ ε) ≤ P (|Zn − Znj | ≥ ε) + P (|Znj − Z| ≥ ε) 2 2 can be made arbitrarily close to 0 as n ↑ ∞, by definition of the nj ’s and the fact that almost sure convergence implies convergence in probability, as we shall see next, in Theorem 4.5.1.  In fact, there exists a distance between random variables that metrizes convergence in probability, namely d(X, Y ) := E [|X − Y | ∧ 1] . The verification that d is indeed a metric is left as an exercise.

CHAPTER 4. CONVERGENCES

158

Theorem 4.2.4 The sequence {Xn }n≥1 converges in probability to the variable X if and only if lim d(Xn , X) = 0 . n↑∞

Proof. If: By Markov’s inequality, for ε ∈ (0, 1], P (|Xn − X| ≥ ε) = P (|Xn − X| ∧ 1 ≥ ε) ≤ Only if: For all ε > 0, d(Xn , X) =

{|Xn −X|≥ε}

d(Xn , X) . ε

(|Xn − X| ∧ 1) dP +

{|Xn −X| 0 is arbitrary, we have shown that limn↑∞ d(Xn , X) = 0. 

4.2.2

Convergence in Lp

Definition 4.2.5 Let p be a positive integer. A sequence {Zn }n≥1 of complex random variables of LpC (P ) is said to converge in Lp to the complex random variable Z ∈ LpC (P ) if lim E[|Zn − Z|p ] = 0. (4.17) n↑∞

In the case p = 2, the sequence {Zn }n≥1 of square-integrable complex random variables is said to converge in the quadratic mean to Z. By the Riesz–Fischer theorem (Theorem 2.3.22): Theorem 4.2.6 For the sequence {Zn }n≥1 of square-integrable complex random variables to converge in Lp to some random variable Z ∈ LpC (P ), it is necessary and sufficient that (4.18) lim E[|Zn − Zm |p ] = 0 . n,m↑∞

Recall that L2C (P ) is a Hilbert space with inner product X, Y  = E [XY ∗ ] with the following property of continuity. Theorem 4.2.7 Let {Xn }n≥1 and {Yn }n≥1 be two sequences of square-integrable complex random variables that converge in quadratic mean to the square-integrable complex random variables X and Y , respectively. Then, lim E[Xn Ym∗ ] = E[XY ∗ ].

n,m↑∞

(4.19)

4.2. TWO OTHER TYPES OF CONVERGENCE

159

Proof. We have |E[Xn Ym∗ ] − E[XY ∗ ]| = |E[(Xn − X)(Ym − Y )∗ ] + E[(Xn − X)Y ∗ ] + E[X(Ym − Y )∗ ]| ≤ |E[(Xn − X)(Ym − Y )∗ ]| + |E[(Xn − X)Y ∗ ]| + |E[X(Ym − Y )∗ ]| and the right-hand side of this inequality is, by Schwarz’s inequality, less than  1  1 E[|Xn − X|2 ] 2 E[|Ym − Y |2 ] 2  1  1 + E[|Xn − X|2 ] 2 E[|Y |2 ] 2  1  1 + E[|X|2 ] 2 E[|Ym − Y |2 ] 2 , which tends to 0 as n, m ↑ ∞.



Example 4.2.8: L2 -convergence of Series. Let {An }n∈Z and {Bn }n∈Z be two sequences of centered square-integrable complex random variables such that   E[|Aj |2 ] < ∞, E[|Bj |2 ] < ∞. j∈Z

j∈Z

Suppose, moreover, that for all i = j,       E Ai A∗j = E Bi Bj∗ = E Ai Bj∗ = 0 for all i = j. Define

n 

Un =

n 

A j , Vn =

j=−n

Bj .

j=−n

Then {Un }n≥1 (resp., {Vn }n≥1 ) converges in quadratic mean to some square-integrable random variable U (resp., V ) and  E [U ] = E [V ] = 0 and E[U V ∗ ] = E[Aj Bj∗ ]. j∈Z

Proof. We have

'2 ⎤ ⎡' '  ' m m m m    ' ' E[|Un − Um |2 ] = E ⎣'' Aj '' ⎦ = E[Aj A∗i ] = E[|Aj |2 ] 'j=n+1 ' j=n+1 i=n+1 j=n+1

since when i = j, E[Aj A∗i ] = 0. The conclusion then follows from the Cauchy criterion for convergence in quadratic mean, since m 

lim E[|Un − Um |2 ] = lim

m,n↑∞

in view of hypothesis



m,n↑∞

j∈Z E[|Aj |

2

] < ∞. By continuity of the inner product in L2C (P ),

E[U V ∗ ] = lim E[Un Vn∗ ] = lim n↑∞

= lim

E[|Aj |2 ] = 0 ,

j=n+1

n↑∞

n n  

n↑∞

n  j=1

E[Aj Bj∗ ] =

E[Aj B∗ ]

j=1 =1



E[Aj Bj∗ ] .

j∈Z



CHAPTER 4. CONVERGENCES

160

4.2.3

Uniform Integrability

The monotone and dominated convergence theorems are not all the tools that we have at our disposition giving conditions under which it is possible to exchange limits and expectations. Uniform integrability, which will be introduced now, is another such sufficient condition. Definition 4.2.9 A collection {Xi}i∈I (where I is an arbitrary index) of integrable random variables is called uniformly integrable if lim |Xi | dP = 0 uniformly in i ∈ I . c↑∞

{|Xi |>c}

Example 4.2.10: Collection Dominated by an Integrable Variable. If, for some integrable random variable, P (|Xi | ≤ X) = 1 for all i ∈ I, then {Xi }i∈I is uniformly integrable. Indeed, in this case, |Xi | dP ≤ X dP {|Xi |>c}

{X>c}

and by monotone convergence the right-hand side of the above inequality tends to 0 as c ↑ ∞. Remark 4.2.11 Clearly, if one adds a finite number of integrable variables to a uniformly integrable collection, the augmented collection will also be uniformly integrable. Theorem 4.2.12 The collection {Xi }i∈I of integrable random variables is uniformly integrable if and only if (a) supi E [|Xi |] < ∞, and (b) for every ε > 0, there exists a δ(ε) > 0 such that sup |Xi | dP ≤ ε whenever P (A) ≤ δ(ε) . n

(In other words,

0

A |Xi | dP

A

→ 0 uniformly in i as P (A) → 0.)

Proof. Assume uniform integrability. For any ε > 0, there exists a c such that 0 |X i | dP ≤ ε for all i ∈ I. For all A ∈ F, all i ∈ I, {Xi >c} 1 |Xi | dP ≤ cP (A) + |Xi | dP ≤ cP (A) + ε . 2 A {|Xi |>c} Therefore we have (b) by taking δ(ε) =

ε 2c

and (a) with A = Ω.

M . Conversely, let M := supi E [|Xi |] < ∞. Let ε and δ(ε) be as in (b). Let c0 := δ(ε) For all c ≥ c0 and all i ∈ I, P (|Xi0| > c) ≤ δε (Markov’s inequality). Apply (b) with  A = {|Xc | > c} to obtain that supn {|Xc |>c} |Xi | dP ≤ ε.

Since the “collection” consisting of a single integrable variable X is uniformly integrable, condition (b) of the theorem above reads sup

E [|X| 1A ] → 0 as δ → 0 .

A ; P (A) 0,   E |Xi | 1{|Xi |≥a} ≤ E [Zi ] 1{Zi ≥a} , where Zi := E [|Y | | Fi ]. By definition of conditional expectation, since {Zi ≥ a} ∈ Fi ,   E (|Y | − Zi) 1{Zi ≥a} = 0 and therefore

    E |Xi | 1{|Xi |≥a} ≤ E |Y | 1{Zi ≥a} .

()

By Markov’s inequality,

E [Zi ] E [|Y |] = , a a and  thereforeP (Zi ≥ a) → 0 as a → ∞ uniformly in i. Use (15.42) to obtain that  E |Y | 1{Zi ≥a} → 0 as a → ∞ uniformly in i. Conclude with (). P (Zi ≥ a) ≤

Theorem 4.2.14 A sufficient condition for the collection {Xi}i∈I of integrable random variables to be uniformly integrable is the existence of a non-negative non-decreasing function G : R → R such that G(t) = +∞ lim t↑∞ t and sup E [G(|Xi |)] < ∞ . i

Proof. Fix ε > 0 and let a = M ε where M := supn (E [G(|Xi |)]). Take c large enough so G(|X |) that G(t)/t ≥ a for t ≥ c. In particular, |Xi | ≤ a i on {|Xi| > c} and therefore  M 1  |Xi| dP ≤ E G(|Xi |)1{|Xi |>c} ≤ =ε a a {|Xi |>c} 

uniformly in i.

Example 4.2.15: Two Sufficient Conditions for Uniform Integrability. Two frequently used sufficient conditions guaranteeing uniform integrability are   sup E |Xi |1+α < ∞ (α > 1) i

and

  sup E |Xi | log+ |Xi | < ∞ . i

Almost-sure convergence of a sequence of integrable random variables to an integrable random variable does not necessarily imply convergence in L1 . However:

CHAPTER 4. CONVERGENCES

162

Theorem 4.2.16 Let {Xn }n≥1 be a sequence of integrable random variables and let X be some random variable. The following are equivalent: P r.

(a) {Xn }n≥1 is uniformly integrable and Xn → X as n → ∞. L1

(b) X is integrable and Xn → X as n → ∞.

Pr

Proof. (a) implies (b): Since Xn → X, there exists a subsequence {Xnk }k≥1 such that a.s. Xnk → X. By Fatou’s lemma, E [|X|] ≤ lim inf E [|Xnk |] ≤ sup E [|Xnk |] ≤ sup E [|Xn |] < ∞ . k

n

nk

Therefore X ∈ L1 . Also for fixed ε > 0, |Xn − X| dP + |X| dP |Xn | dP + {|Xn −X| 0 be given and let n0 be such that E [|Xn − X|] ≤ ε for all n ≥ n0 . The random 0 variables X, X1, . . .0, Xn0 being integrable, there exists a δ > 0 such that if P (A) ≤ δ, A |X| dP ≤ 2ε and A |Xn | dP ≤ 2ε for n ≤ n0 . If n ≥ n0 , by the triangle inequality, -

-

-

|Xn | dP ≤ A

|X| dP + A

|Xn − X| dP ≤ 2ε , A

and therefore (b) of Theorem 4.2.12 is satisfied. Whereas (a) of Theorem 4.2.12 is satisfied  since E [|Xn |] ≤ E [|Xn − X|] + E [|X|].

4.3 4.3.1

Zero-one Laws Kolmogorov’s Zero-one Law

Definition 4.3.1 Let {Xn }n≥1 be a sequence of random variables and let FnX := σ(X1 , . . . , Xn ). The σ-field T X := ∩n≥1 σ(Xn , Xn+1, . . .) is called the tail σ-field of this sequence.

4.3. ZERO-ONE LAWS

163

n Example 4.3.2: For any a ∈ R, the event {limn↑∞ X1 +···+X ≤ a} belongs to the tail n X1 +···+Xn σ-field, since the existence and the value of the limit of does not depend on n any fixed finite number of terms of the sequence. More generally, any event concerning n limn↑∞ X1 +···+X such as, for instance, the event that such limit exists, is in the tail n σ-field.

X := ∨ X Recall the notation F∞ n≥1 Fn .

Theorem 4.3.3 The tail σ-field of a sequence {Xn }n≥1 of independent random variables is trivial, that is, if A ∈ T X , then P (A) = 0 or 1. Proof. The σ-fields FnX and σ(Xn+k , Xn+k+1 , . . .) are independent for all k ≥ 1 and therefore, since T X = ∩k≥1 σ(Xn+k , Xn+k+1 ), the σ-fields FnX and T X are independent. Therefore the algebra ∪n≥1 FnX and T X are independent, and consequently (Theorem X and T X are independent. But F X ⊇ T X , so that T X is independent of 3.1.39) F∞ ∞ itself. In particular, for all A ∈ T X , P (A ∩ A) = P A)P (A), that is P (A) = P (A)2 , which implies that P (A) = 0 or 1. 

4.3.2

The Hewitt–Savage Zero-one Law

Let (S, S) be some measurable space and let μ be a probability measure on it. We shall work on the canonical measurable space (Ω, F) := (S N , S ⊗N) of S-valued random sequences, endowed with the probability measure P := μ⊗∞ . In particular, an element of Ω has the form ω := x := (x1 , x2 , . . .) ∈ S N and moreover, the sequence {Xn }n≥1 defined by Xn (ω) := xn (ω ∈ Ω, n ≥ 1) is iid with common probability distribution μ. Definition 4.3.4 (a) A finite permutation of N is a permutation π such that π(i) = i for all but a finite number of indices i ≥ 1. (b) An event A ∈ F such π −1 A = A for all finite permutations π is called exchangeable. (c) The sub-σ-field E consisting of the collection of exchangeable events is called the exchangeable σ-field. (Note that Xn (πω) = Xπ(n)(ω) for any permutation π on Ω.) Example 4.3.5: Tail events are exchangeable. All the events of the tail σfield T are exchangeable. Indeed, for all n ≥ 1, an event B ∈ σ(Xn+1, Xn+2, . . .) is unaltered by a permutation bearing on only the first n coordinates. Therefore any event B ∈ ∩n≥1 σ(Xn+1 , Xn+2, . . .) is unaltered by any finite permutation.

CHAPTER 4. CONVERGENCES

164

Example 4.3.6: There exist exchangeable events that are not tail events. In Example 4.3.5, we have seen that T ⊂ E. The current example shows that we do not have the reverse inclusion. Indeed, the event A := X1 +· · ·+Xn ∈ C i.o.} is exchangeable (if the finite permutation π bears on only the first K integers, then for all n ≥ K + 1, X1 + · · · + Xn = Xπ(1) + · · · + Xπ(n)). However, it is not a tail event.

Theorem 4.3.7 The events of the exchangeable σ-field are trivial, that is for any A ∈ E, P (A) = 0 or 1. The proof depends on the following lemma of approximation of an element of F by an element of an algebra A generating F. More precisely (recalling the notation AΔB := (A − A ∩ B) ∪ (B − A ∩ B)): Lemma 4.3.8 Let A be an algebra generating the σ-field F and let P be a probability on F. With any event B ∈ F and any ε > 0, one can associate an event A ∈ A such that P (A  B) ≤ ε. Proof. The collection of sets G := {B ∈ F; ∀ε > 0, ∃A ∈ A with P (A  B) ≤ ε} contains A. It is moreover a σ-field, as we now show. First, Ω ∈ A ⊆ G and the stability of G under complementation is clear. For the stability of G under countable unions, let Bn (n ≥ 1) be in G and let ε > 0 be given. Also, by definition of G, there exist An ’s in A such that P (An  Bn ) ≤ 2−n−1ε. Therefore, for all K ≥ 1, K P ((∪K n=1 An )  (∪n=1 Bn )) ≤

K 

2−n−1ε ≤

n=1



2−n−1ε = 2−1 ε .

n≥1

By the sequential continuity property of probability, there exists an integer K = K(ε) −1 such that P (∪n≥1 Bn − ∪K n=1 Bn ) ≤ 2 ε. Therefore, for such an integer, P ((∪K n=1 An )  (∪n≥1 Bn )) ≤ ε . The proof of stability of G under countable unions is completed since A is an algebra and therefore ∪K n=1 An ∈ A. Therefore G is a σ-field containing A and consequently contains the σ-field F generated by A.  We now proceed to the proof of Theorem 4.3.7. Proof. Let A ∈ E. Lemma 4.3.8 guarantees that for any n ≥ 1, there exists an An ∈ σ(X1 , . . . , Xn ) such that P (An ΔA) → 0 . Note that for all n ≥ 1, An = {ω ; (x1 , . . . , xn ) ∈ Bn } for some Bn ∈ S ⊗N. Define the finite permutation πn = π by

4.3. ZERO-ONE LAWS

165 π(j) = j + n if 1 ≤ j ≤ n = j − n if n + 1 ≤ j ≤ 2n = j if j ≥ 2n + 1 .

Note that π 2 ≡ π and in particular π = π −1 , and that by the iid assumption, the sequence obtained by finite permutation of an iid sequence is iid. Therefore P (ω ; ω ∈ An ΔA) = P (ω ; πω ∈ An ΔA) .

()

Now {ω ; πω ∈ A} = {ω ; ω ∈ A} by exchangeability of A, and {ω ; πω ∈ An } = {ω ; (xn+1 , . . . , x2n ) ∈ Bn } . Therefore denoting by An the event in the right-hand side of the above equality,

Combining () and ():

{ω ; πω ∈ An ΔA} = {ω ; ω ∈ An ΔA} .

()

P (An ΔA) = P (An ΔA) .

(†)

From the set inclusion AΔC ⊆ (AΔB) ∪ (BΔC), (†) and P (An ΔA) → 0, P (An ΔAn ) + P (AΔAn ) → 0 .

(††)

Therefore 0 ≤ P (An ) − P (An ∩ An ) ≤ P (An ∪ An ) − P (An ∩ An ) = P (An ΔAn ) → 0 . Therefore P (An ∩ An ) → P (A). Since An and An are independent (and recalling that P (An ) = P (An )) P (An ∩ An ) = P (An )P (An ) = P (An )2 → P (A)2 . Comparing with (††), we see that P (A)2 = P (A), which implies that P (A) = 0 or P (A) = 1.  The Hewitt–Savage zero-one law will now applied to the asymptotic behavior of random walks. By definition, a random walk on R is a sequence {Sn }n≥1 of real-valued random variables of the form Sn = X1 + · · · + Xn , where {Xn }n≥1 is an iid sequence of real-valued random variables. Theorem 4.3.9 Discarding the trivial case where P (X1 = 0) = 1, with probability one, one and only one of the following occurs: (a) limn Sn = +∞, (b) limn Sn = −∞, (c) −∞ = lim inf n Sn < lim supn Sn = +∞. If, moreover, the distribution of X1 is symmetric around 0, (c) occurs with probability 1.

CHAPTER 4. CONVERGENCES

166

Proof. The random variable lim supn Sn is exchangeable (its value is independent of any finite permutation of the Xi ’s) and therefore is a constant c, possibly +∞ or −∞. Since Sn = X1 + Sn , where Sn = X2 + · · · + Xn , we have that lim sup Sn = X1 + lim sup Sn , n

n

lim supn Sn

where and therefore

has the same distribution as lim supn Sn and is independent of X1 , c = X1 + c .

Since P (X1 = 0) > 0, this implies that lim supn Sn cannot be finite, and is therefore either +∞ or −∞. Similarly for lim inf n Sn , which is therefore either +∞ or −∞. Since we cannot have simultaneously lim inf n Sn = +∞ and lim supn Sn = −∞, the first part of the theorem in proved. In the symmetric case, only (c) is possible because one of the events (a) or (b) entails  the other.

4.4 4.4.1

Convergence in Distribution and in Variation The Role of Characteristic Functions

Let {Xn }n≥1 and X be real-valued random variables with respective cumulative distribution functions {Fn }n≥1 and F . A natural definition of convergence in distribution of {Xn }n≥1 to X is the following: lim Fn (x) = F (x) .

n↑∞

()

We have not specified for what x ∈ R () is required. If we want this to hold for all x, then we could not say that the “random” (actually deterministic) sequence of random variables Xn ≡ a + n1 where a ∈ R converges to X ≡ a. In fact, () holds in this case for all points of continuity of the cumulative distribution of X, here F (x) = 1x≥a . It turns out that a “good” definition would be precisely that () should hold for all continuity points of the target cdf F . Example 4.4.1: Magnified Minimum. Let {Yn }n≥1 be a sequence of iid random variables uniformly distributed on [0, 1]. Then D

Xn = n min(Y1 , . . . , Yn ) → E(1) , (the exponential distribution with mean 1). In fact, for all x ∈ [0, n], n   $ x x  x n P (Xn > x) = P min(Y1 , . . . , Yn ) > P Yi > , = = 1− n n n i=1

and therefore limn↑∞ P (Xn > x) = e−x 1R+ (x).

4.4. CONVERGENCE IN DISTRIBUTION AND IN VARIATION

167

For technical reasons, our starting point will differ from the definition (), properly modified. Denote by M + (Rd ) the collection of finite measures on (Rd , B(Rd )) and by Cb (Rd ) the collection of continuous bounded functions f : Rd → R. Definition 4.4.2 (a) The sequence {μn0}n≥1 in M + (Rd ) is said to converge weakly to μ if 0 limn↑∞ Rd f dμn = Rd f dμ for all f ∈ Cb (Rd ). This is denoted by w

μn → μ . (b) The sequence of random vectors {Xn }n≥1 of Rd with respective probability distributions {QXn }n≥1 is said to converge in distribution to the random vector X ∈ Rd w with distribution QX if QXn → QX . (In other words, for all continuous and d bounded functions f : R → R, limn↑∞ E[f (Xn )] = E[f (X)].) This is denoted by D Xn → X . Remark 4.4.3 Observe that X and the Xn ’s need not be defined on the same probability space. Convergence in distribution concerns only probability distributions. As a matter of fact, very often the Xn ’s are defined on the same probability space but there is no “visible” (that is, defined on the same probability space) limit random vector X. D Therefore one sometimes denotes convergence in distribution as follows: Xn → Q, where d Q is a probability distribution on R . If Q is a “famous” probability distribution, for instance a standard Gaussian variable, one then says that {Xn }n≥1 “converges in distriD

bution to a standard Gaussian distribution”. This is also denoted by Xn → N (0, 1). Let B o and B c be respectively the interior and the closure of the set B ∈ Rd and let ∂B be its boundary (:= B c \B o ). The following theorem is a major tool of the theory of convergence in distribution: Theorem 4.4.4 Let {μn }n≥1 and μ be probability distributions on Rd . The following conditions are equivalent: w

(i) μn → μ. (ii) For any open set G ⊆ Rd , lim inf n μn (G) ≥ μ(G). (iii) For any closed set F ⊆ Rd , lim supn μn (F ) ≤ μ(F ). (iv) For any measurable set B ⊆ Rd such that μ(∂B) = 0, limn μn (B) = μ(B). Proof. (i) ⇒ (ii). For any open set G ∈ Rd there exists a non-decreasing sequence {ϕk }k≥1 of non-negative functions of Cb (Rd ) such that 0 ≤ ϕk ≤ 1 and ϕk ↑ 1G (for 0 0 instance, ϕk (x) = 1 − e−kd(x,G) ). Since 1G dμn ≥ ϕk dμn (k ≥ 1), lim inf μn (G) = lim inf 1G dμn ≥ lim inf ϕk dμn . n

n

n

CHAPTER 4. CONVERGENCES

168 This being true for all k ≥ 1,

  lim inf μn (G) ≥ sup lim inf ϕk dμn n n k   = sup lim ϕk dμn = sup ϕk dμ = μ(G) . n

k

k

(ii) ⇔ (iii). Take complements. (ii) + (iii) ⇒ (iv). Indeed, by (ii) and (iii), lim sup μn (B) ≤ lim sup μn (B c ) ≤ μ(B c ) n

n

and lim inf μn (B) ≥ lim inf μn (B o ) ≥ μ(B o ) . n

n

o

c

But since μ(∂B) = 0, μ(B ) = μ(B ) = μ(B), and therefore (iv) is verified. 0 0 (iv) ⇒ (i). Let f ∈ Cb (Rd ). We must show that limn↑∞ Rd f dμn = Rd f dμ. It is enough to show this for f ≥ 0. Let K < ∞ be a bound of f . By Fubini, -

-

-

K

f (x) dμ(x) = Rd

-

Rd K

=

0

 1{t≤f (x)} dt dμ(x) -

K

μ({x ; t ≤ f (x)}) dt =

μ(Dtf ) dt ,

0

0

where Dtf := {x ; t ≤ f (x)}. Observe that ∂Dtf ⊆ {x ; t = f (x)} and that the collection of positive t such that μ({x ; t = f (x)}) > 0 is at most countable (for each positive integer k there are at most k values of t such that μ({x ; t = f (x)}) ≥ k1 ). Therefore, by (iv), for almost all t (with respect to the Lebesgue measure), lim μn (Dtf ) = μ(Dtf ) (-almost everywhere) n

and by dominated convergence, lim n

-

K

f dμn = lim n

0

μn (Dtf ) dt

K

=

μ(Dtf ) dt

=

f dμ .

0



Paul L´ evy’s Characterization Theorem 4.4.5 A necessary and sufficient condition for the sequence {Xn }n≥1 of random vectors of Rd to converge in distribution is that the sequence of their characteristic functions {ϕn }n≥1 converges to some function ϕ that is continuous at 0. In such a case, ϕ is the characteristic function of the limit probability distribution.

The (technical) proof is postponed to Section 4.4.4.

4.4. CONVERGENCE IN DISTRIBUTION AND IN VARIATION

169

Corollary 4.4.6 Let {Xn }n≥1 and X be random vectors of Rd with respective characteristic functions {ϕn }n≥1 and ϕ. The following two statements are equivalent. D

(A) Xn → X. (B) limn↑∞ ϕn = ϕ. Corollary 4.4.7 In the univariate case, denote by Fn and F the cumulative distribution functions of Xn and X, respectively. Call a point x ∈ R a continuity point of F if D F (x) = F (x− ). Then Xn → X if and only lim Fn (x) = F (x) for all continuity points x of F . n

Proof. Necessity. Let QX be the probability distribution of X. If x is a continuity point of F , the boundary of C := (−∞, x] is {x} of null QX -measure. Therefore by (iv) of Theorem 4.4.4, limn QXn ((−∞, x]) = QX ((∞, x]), that is, limn Fn (x) = F (x). Sufficiency. Let f ∈ Cb (R) and let M < ∞ be an upper bound of f . For ε > 0, there exists a subdivision −∞ < a = x0 < x1 < · · · < xk = b < +∞ formed by continuity points of F , such that F (a) < ε, F (b) > 1 − ε and |f (x) − f (xi )| < ε on [xi−1 , xi ]. By hypothesis, Sn :=

k 

f (xi )(Fn (xi ) − Fn (xi−1 )) → S :=

i=1

k 

f (xi )(F (xi ) − F (xi−1 )) .

i=1

Also |E [f (X)] − S| ≤ ε + M F (a) + M (1 − F (b)) ≤ (2M + 1)ε and |E [f (Xn )] − Sn | ≤ ε + M Fn (a) + M (1 − Fn (b)) → ε + M F (a) + M (1 − F (b)) ≤ (2M + 1)ε . Therefore, lim sup |E [f (Xn )] − E [f (X)]| n

≤ lim sup |E [f (Xn )] − Sn | + lim sup |Sn − S| + |E [f (X)] − S| n

n

≤ (4M + 2)ε. Since ε is arbitrary, limn |E [f (Xn )] − E [f (X)]| = 0.



Theorem 4.4.8 Let {Xn }n≥1 and {Yn }n≥1 be sequences of random vectors of Rd such D

P r.

D

that Xn → X and d(Xn , Yn ) → 0, where d denotes the euclidean distance. Then Yn → X. Proof. By (iii) of Theorem 4.4.4, it suffices to show that for all closed sets F , lim supn P (Yn ∈ F ) ≤ P (X ∈ F ). For all ε > 0, define the closed set Fε = {x ∈ Rd ; d(x, F ) ≤ ε}. Then P (Yn ∈ F ) ≤ P (d(Xn , F ) ≥ ε) + P (Xn ∈ Fε ) ,

CHAPTER 4. CONVERGENCES

170 and therefore

lim sup P (Yn ∈ F ) ≤ lim sup P (d(Xn , F ) ≥ ε) + lim sup P (Xn ∈ Fε ) n

n

n

= lim sup P (Xn ∈ Fε ) ≤ P (X ∈ Fε ) . n

Since ε > 0 is arbitrary and limε↓0 P (X ∈ Fε ) = P (X ∈ F ), lim supn P (Yn ∈ F ) ≤ P (X ∈ F ).  D

Corollary 4.4.9 Let {Xn }n≥1 be a sequence of random vectors of Rd such that Xn → D

X. If the sequence of real numbers {an }n≥1 converges to the real number a, then an Xn → aX.

Bochner’s Theorem Bochner’s theorem will play a central role in the theory of wide-sense stationary processes of Chapter 12. The characteristic function ϕ of a real random variable X has the following properties: A. it is hermitian symmetric (that is, ϕ(−u) = ϕ(u)∗ ) and uniformly bounded (in fact, |ϕ(u)| ≤ ϕ(0)); B. it is uniformly continuous on R; and C. it is definite non-negative, in the sense that for all integers n, all u1 , . . . , un ∈ R, and all z1 , . . . , zn ∈ C, n  n 

ϕ(uj − uk )zj zk∗ ≥ 0

j=1 k=1

2' '2 . ' ' (just observe that the left-hand side equals E ' nj=1 zj eiuj X ' ). It turns out that Properties A, B and C characterize characteristic functions (up to a multiplicative constant). This is Bochner’s theorem: Theorem 4.4.10 Let ϕ : R → C be a function satisfying properties A, B and C. Then there exists a constant 0 ≤ β < ∞ and a real random variable X such that for all u ∈ R,   ϕ(u) = βE eiuX . Proof. We henceforth eliminate the trivial case where ϕ(0) = 0 (implying, in view of condition A, that ϕ is the null function). For any continuous function z : R → C and any A ≥ 0, - A- A ϕ(u − v)z(u)z ∗ (v) du dv ≥ 0 . () 0

0

Indeed, since the integrand is continuous, the integral is the limit as n ↑ ∞ of       2 2 A(j − k) Aj Ak ∗ A2   ϕ , z z 4n 2n 2n 2n n

n

j=1 k=1

4.4. CONVERGENCE IN DISTRIBUTION AND IN VARIATION

171

a non-negative quantity by condition C. From () with z(u) := e−ixu , we have that g(x, A) :=

-

1 2πA

A- A 0

ϕ(u − v)e−ix(u−v) du dv ≥ 0 .

0

Changing variables, we obtain for g(x, A) the alternative expression   |u| ϕ(u)e−iux du 1− A −A - +∞   u 1 h ϕ(u)e−iux du , = 2π −∞ A

g(x, A) : =

1 2π

-

A

where h(u) = (1 − |u|) 1{|u|≤1} . Let M > 0. We have -

+∞

h −∞

 x  g(x, A) dx 2M  - +∞  - +∞   u x  −iux 1 h h dx du ϕ(u) e = 2π −∞ A 2M −∞ 2  - +∞   u 1 sin M u = M h du . ϕ(u) π A Mu −∞

Therefore -

+∞

h −∞

  - +∞    x  u 1 sin M u 2 h du g(x, A) dx ≤ M |ϕ(u)| 2M π A Mu −∞  - +∞  1 sin u 2 ≤ ϕ(0) du = ϕ(0) . π u −∞

By monotone convergence, lim

+∞

M ↑∞ −∞

and therefore

h

- +∞  x  g(x, A) dx , g(x, A) dx = 2M −∞ -

+∞ −∞

g(x, A) dx ≤ ϕ(0) .

The function x → g(x, A) is therefore integrable and it is the Fourier transform of the u ϕ(u). Therefore, by the Fourier inversion integrable and continuous function u → h A formula: - +∞ u h g(x, A)eiux dx . ϕ(u) = A −∞ 0 +∞ In particular, with u = 0, −∞ g(x, A) dx = ϕ(0). Therefore, f (x, A) := g(x,A) ϕ(0) is the  u  ϕ(u) probability density of some real random variable with characteristic function h A ϕ(0) . But  u  ϕ(u) ϕ(u) = . lim h A↑∞ A ϕ(0) ϕ(0) This limit of a sequence of characteristic functions is continuous at 0 and is therefore a characteristic function (Paul L´evy’s criterion, Theorem 4.4.5). 

CHAPTER 4. CONVERGENCES

172

4.4.2

The Central Limit Theorem

The emblematic theorem of Statistics is the so-called central limit theorem (CLT). Theorem 4.4.11 Let {Xn }n≥1 be an iid sequence of real random variables such that E[X12 ] < ∞ .

(4.21)

(In particular, E[|X1 |] < ∞.) Then, for all x ∈ R,    Sn − nE[X1 ] √ lim P ≤ x = P (N (0; 1) ≤ x), n↑∞ σX1 n

(4.22)

where N (0; 1) is a Gaussian variable with mean 0 and variance 1. The proof depends in part on the following theorem, which says in particular that under certain conditions, the moments of a random variable can be extracted from its characteristic function: Theorem 4.4.12 Let X be a real random variable with characteristic function ψ, and suppose that E [|X|n ] < ∞ for some integer n ≥ 1. Then for all integers r ≤ n, the r-th derivative ψ (r) of ψ exists and is given by   (4.23) ψ (r) (u) = ir E X r eiuX , and in particular E [X r ] =

ψ (r) (0) ir .

ψ(u) =

Moreover,

n r=0

(iu)r r r! E [X ]

+

(iu)n n! εn (u) ,

(4.24)

where limn↑∞ εn (u) = 0 and |εn (u)| ≤ 3E [|X|n ]. Proof. First we observe that for any non-negative real number a, and all integers r ≤ n, ar ≤ 1 + an (Indeed, if a ≤ 1, then ar ≤ 1, and if a ≥ 1, then ar ≤ an ). In particular, E [|X|r ] ≤ E [1 + |X|n ] = 1 + E [|X|n ] < ∞ . Suppose that for some r < n,   ψ (r) (u) = ir E X r eiuX . In # " i(u+h)X ψ (r) (u + h) − ψ (r) (u) − eiuX r re =i E X h h . 2 eihX − 1 , = ir E X r eiuX h the quantity under the expectation sign tends to X r+1 eiuX as h → 0, and moreover, it is bounded in absolute value by an integrable function since ' ' ' ' ' r iuX eihX − 1 ' ' r eihX − 1 ' 'X e ' ≤ 'X ' ≤ |X|r+1 . ' ' ' ' h h

4.4. CONVERGENCE IN DISTRIBUTION AND IN VARIATION

173

' '2 (For the last inequality, use the fact that 'eia − 1' = 2(1 − cos a) ≤ a2 .) Therefore, by dominated convergence, ψ(u + h) − ψ(u) h→0 h . 2 ihX   −1 r r iuX e = ir E X r+1eiuX . = i E lim X e h→0 h

ψ (r+1)(u) = lim

Equality (4.23) follows since the induction hypothesis is trivially true for r = 0. We now prove (4.24). By Taylor’s formula, for y ∈ R, eiy = cos y + i sin y =

n−1  k=0

(iy)n (iy)k + (cos(θ1 y) + i sin(θ2 y)) k! n!

for some θ1 , θ2 ∈ [−1, +1]. Therefore eiuX =

n−1  k=0

(iuX)n (iuX)k + (cos(θ1 uX) + i sin(θ2 uX)) , k! n!

where θ1 = θ1 (ω), θ2 = θ2 (ω) ∈ [−1, +1], and  (iu)k  n−1  (iu)n E[X k ] + (E [X n ] + εn (u)) , E eiuX = k! n! k=0

where εn (u) = E [X n (cos θ1 uX + i sin θ2 uX − 1)] . Clearly |εn (u)| ≤ 3E [|X|n ]. Also, since the random variable X n (cos θ1 uX + i sin θ2 uX − 1) is bounded in absolute value by the integrable random variable 3 |X|n and tends to 0 as  u → 0, we have by dominated convergence limu→0 εn (u) = 0. We now proceed to the proof of Theorem 4.22. Proof. Assume without loss of generality that E[X1 ] = 0. Then call σ 2 the variance of X1 . By the characteristic function criterion for convergence in distribution, it suffices to show that 2 2 lim ϕn (u) = e−σ u /2 , n↑∞

where

"

/

ϕn (u) = E exp iu =

n $ j=1

n

j=1 Xj

4#



n 1.    u n u =ψ √ E exp i √ Xj , n n 2

where ψ is the characteristic function of X1 . From the Taylor expansion of ψ about zero, ψ(u) = 1 +

ψ  (0) 2 u + o(u2 ) , 2!

CHAPTER 4. CONVERGENCES

174 we have, for fixed u ∈ R,

 ψ

u √ n

 =1−

1 σ 2 u2 +o n 2

  1 , n

and therefore

 1   1 1 σ 2 u2 +o = − σ 2 u2 . lim ln {ϕn (u)} = lim n ln 1 − n↑∞ n↑∞ 2n n 2 

The result follows by Theorem 4.4.6. Remark 4.4.13 The random variable Sn − nE[X1 ] √ σ n

is obtained by centering the sum Sn (subtracting its mean nE[X1 ]) and then normalizing it (dividing by the square root of its variance to make the resulting variance equal to 1). Theorem 4.4.14 Let {Xn }n≥1 be a sequence of independent random vectors of dimension d, and let {an }n≥1 be a sequence of real numbers such that limn↑∞ an = ∞. Suppose that p.s. Xn → m and



D

an (Xn − m) → N (0, Γ) .

Let g : Rd → Rq be twice continuously differentiable in a neighborhood U of m. Then p.s.

g(Xn ) → g(m) and



  D an (g(Xn ) − g(m)) → N 0, Jg (m)T Γ Jg (m)

where Jg (m) is the Jacobian matrix of g evaluated at m. Proof. U can be chosen convex and compact. Let gj denote the j-th coordinate of g, and let D2 gj denote the second differential matrix of gj . By Taylor’s formula, 1 gj (x) − gj (m) = (x − m)T (grad gj (m)) + (x − m)T D2 gj (m∗ ) (x − m) 2 for some m∗ in the closed segment linking m to x, denoted [m, x]. Therefore, if Xn ∈ U √ √ an (gj (Xn ) − gj (m)) = an (Xn − m)T (grad gj (m)) 1 1 + an (Xn − m)T √ D2 gj (m∗n ) (Xn − m) , 2 an where m∗n ∈ [m, Xn ]. Suppose Xn ∈ U . Since U is convex and m ∈ U , also m∗n ∈ U . Now since U is compact, the continuous function D2 gj is bounded in U . Therefore, since an ↑ ∞, a.s. √1 D2 gj (m∗ )1U (Xn ) → 0. Since Xn → m, we deduce from the above remarks that n an √

an (gj (Xn ) − gj (m)) −



a.s.

an (Xn − m)T (grad gj (m)) → 0,

4.4. CONVERGENCE IN DISTRIBUTION AND IN VARIATION and therefore

But





175

√ p.s. an (g(Xn ) − g(m)) − Jg (m) an (Xn − m) → 0 .

D

an (Xn − m) → N (0, Γ) and therefore √

  D an (g(Xn ) − g(m)) → Jg (m)N (0, Γ) = N 0, Jg (m)T Γ Jg (m) . 

Statistical Applications A basic methodology of Statistics is based on the notion of confidence interval and on the central limit theorem. Theorem (4.22) implies that for x ≥ 0   Sn σ σ lim P E[X1 ] − √ x ≤ ≤ E[X1 ] + √ x = P (|N (0; 1)| ≤ x). n↑∞ n n n   Under the condition E |X1 |3 < ∞, this limit is uniform in x ∈ R (1 ) and therefore, with √σn x = a, 2    √ . Sn a n lim P E[X1 ] − a ≤ ≤ E[X1 ] + a − P |N (0; 1)| ≤ = 0. n↑∞ n σ That is, for large n,    √  Sn a n P E[X1 ] − a ≤ ≤ E[X1 ] + a ( P |N (0; 1)| ≤ . n σ In other words, for large n, the of E[X1 ], that is  slln estimate √  a of E[X1 ] with probability P |N (0; 1)| ≤ a σ n .

Sn n ,

lies within distance

In statistical practice, this result is used in two manners. (1) One wishes to know the number n experiments that guarantee that with probability, say 0.99, the estimation error is less than a. Choose n such that  √  a n = 0.99. P |N (0; 1)| ≤ σ Since P (|N (0; 1)| ≤ 2.58) = 0.99, we have 2.58 = and therefore

 n=

1

√ a n , σ

2.58a σ

(4.25)

2 .

This is the content of the Berry–Essen theorem. The proof is omitted.

CHAPTER 4. CONVERGENCES

176

large) number (2) The (usually n of experiments is fixed. We want to determine the   interval Snn − a, Snn + a within which the mean E[X1 ] lies with probability at least 0.99. From (4.25): 2.58σ a= √ . n If the standard deviation σ is unknown, it may be replaced by an slln estimate of it (but then of course. . . ), or the conservative method can be used, which consists of replacing σ by an upper bound. Example 4.4.15: Testing a coin. Consider the problem of estimating the bias p of a coin. Here, Xn takes two values, 1 and 0 with probability p and 1 − p respectively, and in particular E[X1 ] = p, Var (X1 ) = σ 2 = p(1 − p). Clearly, since we are trying to estimate p, bound of σ is the maximum of 5 the standard deviation σ is unknown. Here the upper p(1 − p) for p ∈ [0, 1], which is attained for p = 12 . Thus σ ≤ 12 . Suppose the coin was tossed 10, 000 times and that the experiment produced the estimate Snn = 0.4925. Can we “believe 99 percent” that the coin is unbiased. For this we would check that the corresponding confidence interval contains the value 12 . Using the conservative method (not a big problem since obviously the actual bias is not far from 12 ), we have σ2.58 a = √ = 0.0129. n and indeed 12 ∈ [0.4925 − 0.0129, 0.4925 − 0.0129], so that we are at least 99 percent confident that the coin is unbiased.

4.4.3

Convergence in Variation

The Variation Distance Convergence in variation is convergence with respect to the variation distance, a notion that is now first introduced in the discrete case.2 Definition 4.4.16 Let E be a countable space. The distance in variation between two probability distributions α and β on E is the quantity dV (α, β) :=

1 |α(i) − β(i)|. 2

(4.26)

i∈E

That dV is indeed a metric is clear. Lemma 4.4.17 Let α and β be two probability distributions on the same countable space E. Then dV (α, β) = sup {|α(A) − β(A)|} A⊆E

= sup {α(A) − β(A)} . A⊆E 2

Only the discrete case will be used in this book, in Chapter 6 on discrete-time Markov chains.

4.4. CONVERGENCE IN DISTRIBUTION AND IN VARIATION

177

Proof. For the second equality observe that for each subset A there is a subset B such ¯ For the first equality, write that |α(A) − β(A)| = α(B) − β(B) (take B = A or A). α(A) − β(A) =



1A (i){α(i) − β(i)}

i∈E

and observe that the right-hand side is maximal for A = {i ∈ E; α(i) > β(i)} . Therefore, with g(i) = α(i) − β(i), sup {α(A) − β(A)} = A⊆E

since

 i∈E



g + (i) =

i∈E

1 |g(i)| 2 i∈E



g(i) = 0.

The distance in variation between two random variables X and Y with values in E is the distance in variation between their probability distributions, and it is denoted (with a slight abuse of notation) by dV (X, Y ). Therefore dV (X, Y ) :=

1 |P (X = i) − P (Y = i)| . 2 i∈E

The distance in variation between a random variable X with values in E and a probability distribution α on E, denoted (again with a slight abuse of notation) by dV (X, α), is defined by 1 |P (X = i) − α(i)| . dV (X, α) := 2 i∈E

The Coupling Inequality Coupling of two discrete probability distributions π  on E  and π  on E  consists, by definition, of the construction of a probability distribution π on E := E  × E  such that the marginal distributions of π on E  and E  , respectively, are π  and π  , that is,   π(i, j) = π  (i) and π(i, j) = π  (j) . j∈E 

i∈E 

For two probability distributions α and β on the countable set E, let D(α, β) be the collection of random vectors (X, Y ) taking their values in E ×E and with given marginal distributions α and β, that is, P (X = i) = α(i), P (Y = i) = β(i) .

(4.27)

Theorem 4.4.18 For any pair (X, Y ) ∈ D(α, β), we have the fundamental coupling inequality dV (α, β) ≤ P (X = Y ), and equality is attained by some pair (X, Y ) ∈ D(α, β), which is then said to realize maximal coincidence.

CHAPTER 4. CONVERGENCES

178 Proof. For A ⊂ E,

¯ P (X = Y ) ≥ P (X ∈ A, Y ∈ A) = P (X ∈ A) − P (X ∈ A, Y ∈ A) ≥ P (X ∈ A) − P (Y ∈ A), and therefore P (X = Y ) ≥ sup {P (X ∈ A) − P (Y ∈ A)} = dV (α, β). A⊂E

We now construct (X, Y ) ∈ D(α, β) realizing equality. Let U, Z, V , and W be independent random variables; U takes its values in {0, 1}, and Z, V, W take their values in E. The distributions of these random variables are given by P (U = 1) = 1 − dV (α, β), P (Z = i) = α(i) ∧ β(i)/ (1 − dV (α, β)) , P (V = i) = (α(i) − β(i))+ /dV (α, β), P (W = i) = (β(i) − α(i))+ /dV (α, β). Observe that P (V = W ) = 0. Defining  (Z, Z) (X, Y ) = (V, W )

if if

U =1 U =0

we have P (X = i) = P (U = 1, Z = i) + P (U = 0, V = i) = P (U = 1)P (Z = i) + P (U = 0)P (V = i) = α(i) ∧ β(i) + (α(i) − β(i))+ = α(i), and similarly, P (Y = i) = β(i). Therefore, (X, Y ) ∈ D(α, β). Also, P (X = Y ) = P (U = 1) = 1 − dV (α, β).  Example 4.4.19: Poisson’s law of rare events, take 2. Let Y1 , . . . , Yn be independent random variables taking their values in {0, 1}, with P (Yi = 1) = πi , 1 ≤ i ≤ n.   Let X := ni=1 Yi and λ := ni=1 πi . Let pλ be the Poisson distribution with mean λ. We wish to bound the variation distance between the distribution q of X and pλ . For this we construct a coupling of the two distributions as follows. First we generate independent pairs (Y1 , Y1 ), . . . , (Yn , Yn ) such that ⎧ if j = 0, k = 0, ⎪ ⎨1 − πik πi  −π i P (Yi = j, Yi = k) = e if j = 1, k ≥ 1, ⎪ ⎩ −πi k! e − (1 − πi ) if j = 1, k = 0 . One verifies that for all 1 ≤ i ≤ n, P (Yi = 1) = πi and Yi ∼ Poi(πi ). In particular,  X  := ni=1 Yi is a Poisson variable with mean λ. Now  n n     P (X = X ) = P Yi = Yi i=1

i=1

n      ≤ P Yi = Yi for some i ≤ P Yi = Yi . i=1

4.4. CONVERGENCE IN DISTRIBUTION AND IN VARIATION

179

But   P Yi = Yi = e−πi − (1 − πi ) + P (Y1 > 1)   = πi 1 − e−πi ≤ πi2 . Therefore P (X = X  ) ≤

n

2 i=1 πi ,

and by the coupling inequality dV (q, pλ ) ≤

n 

πi2 .

i=1

For instance, with πi = p := nλ , we have dV (q, pλ ) ≤

λ2 . n

In other terms, the binomial distribution of size n and mean λ differs in variation of less 2 than λn from a Poisson variable with the same mean. This is obviously a refinement of the Poisson approximation theorem since it gives exploitable estimates for finite n.

A More General Definition The extension of the notions of the previous subsection to probability distributions on more general spaces is conceptually straightforward and necessitates only obvious adaptations. Definition 4.4.20 Let P1 and P2 be two probability measures on the same measurable space (X, X ). The quantity dV (P1 , P2 ) := sup |P1 (A) − P2 (A)| A∈X

is called the distance in variation between P1 and P2 . Let Q be a probability measure such that Pi Q (i = 1, 2), for instance Q = Therefore there exist (Radon–Nikod´ ym theorem) two non-negative measurable real functions fi (i = 1, 2) such that Pi (A) = fi dQ (A ∈ X ). P +Q 2 .

A

Theorem 4.4.21 We have that dV (P1 , P2 ) =

1 2

|f1 − f2 | dQ .

(4.28)

X

Proof. We first observe that sup |P1 (A) − P2 (A)| = sup (P1 (A) − P2 (A)) A∈X

A∈X

since for any A ∈ X there exists a B ∈ X such that P1 (A) − P2 (A) = −(P1 (B) − P2 (B)) (take B = A). Therefore

CHAPTER 4. CONVERGENCES

180

(f1 − f2 ) dQ .

dV (P1 , P2 ) = sup (P1 (A) − P2 (A)) = sup A∈X

A∈X

A

The supremum is attained for A = {f1 − f2 ≥ 0}: dV (P1 , P2 ) = Since

0

X (f1

− f2 ) dQ = 0, we have that

{f1 −f2 ≥0}

0

(f1 − f2 ) dQ .

{f1 −f2 ≥0} (f1

− f2 ) dQ =

1 2

0 X

|f1 − f2 | dQ.



It follows from the expression (4.28) that dV is indeed a metric.

A Bayesian Interpretation Let X ∈ Rd be a random vector called the observation and H ∈ {1, 2} be a random variable called the hypothesis. The joint law of (X, H) is described as follows: P (X ∈ C | Hi ) =

fi (x) dx

(i ∈ {1, 2}, C ∈ B(Rd ))

C

and P (H = i) =

1 2

(i ∈ {1, 2}).

We seek to devise a test based on the observation of X alone that will help us to decide which is the value of H. In other words, we must select a measurable partition {A1 , A2 } of Rd and decide H = i if X ∈ Ai (i = 1, 2). This partition is called a test. A probability of error (wrong guess) is associated with this test: PE = P (H = 1)P (X ∈ A2 | H = 1) + P (H = 2)P (X ∈ A2 | H = 2) , that is, 1 1 f1 (x) dx + f2 (x) dx 2 A2 2 A1   1 1 1− f1 (x) dx + f2 (x) dx = 2 A2 2 A2 1 1 = + (f1 (x) − f2 (x)) dx . 2 2 A2

PE =

We seek to minimize this quantity (with respect to A2 ) or, equivalently, to maximize the quantity (f2 (x) − f1 (x)) dx = P2 (A2 ) − P1 (A2 ) , 0

A2

where Pi (C) := C fi (x) dx (i = 1, 2). This is done by the choice A2 := {x ; f2 (x) ≥ f1 (x)} and the resulting (minimal) probability of error is then 1 PE∗ = (1 − dV (P1 , P2 )) , 2 where Pi (·) := P (· | Hi ) (i = 1, 2).

4.4. CONVERGENCE IN DISTRIBUTION AND IN VARIATION

181

Convergence in Variation Definition 4.4.22 The sequence {Pn }n≥1 of probability measures on (X, X ) is said to converge in variation to the probability P on (X, X ) if lim dV (Pn , P ) = 0 .

n↑∞ Var.

This is denoted Pn → P . Let Q be a probability measure such that Pn

Q (n ≥ 1), for instance

 1 Q= Pn 2n n≥1

defines a probability measure Q such that for all n ≥ 1, Pn Q. Denote by fn (resp., f ) the Radon–Nikod´ ym derivative of Pn (resp., P ) with respect to Q. By Theorem 4.4.21, L1

Var.

Pn → P if and only if fn → f , where L1 = L1C (Q). Note also that if ϕ : X → C is a bounded function, then ϕ dPn → ϕ dP , X

X

as follows from the fact that ϕ dPn − ϕ dP = ϕ × (fn − f ) dQ X

X

X

and dominated convergence. Theorem 4.4.23 Let Pn , Q and fn be defined as above. Suppose that there exists a non-negative measurable function f from (X, X ) to (R, B(R)) such that Q-a.e, fn → f . 0 Var. Then Pn → P where P is the probability defined by P (A) = A f dQ, A ∈ X . The proof is a direct consequence of Scheff´e’s lemma: Lemma 4.4.24 Let f and fn (n ≥ 1) be Q-integrable non-negative real 0functions from 0 (X, X ) 0to (R, B(R)), with limn↑∞ fn = f Q-a.e. and limn↑∞ X fn dQ = X f dQ. Then limn↑∞ X |fn − f | dQ = 0. Proof. The function inf(fn , f ) is bounded by the (integrable) function f (this is where the non-negativeness assumption 0 is used). Moreover, 0 it converges to f . Therefore, by dominated convergence, limn↑∞ X inf(fn , f ) dQ = X f dQ. The rest of the proof follows from |fn − f | dQ = fn dQ + f dQ − inf(fn , f ) dQ . X

X

X

X

 Definition 4.4.25 A. Let X1 , X2 be random elements with values in the measurable space (E, E), with respective distributions α1 , α2 . The distance in variation between X1 , and X2 is, by definition the quantity dV (X1 , X2 ) := dV (α1 , α2 ).

CHAPTER 4. CONVERGENCES

182

B. Let X and {Xn }n≥1 be random elements with values in the measurable space (E, E), with respective distributions α, {αn }n≥1 . The sequence {Xn }n≥1 is said to converge Var.

in variation to X if limn↑∞ dV (Xn , X) = 0. This is denoted by Xn → X. Let {Xn }n≥1 be random elements with values in some measurable space (E, E) and Var.

let α be some probability distribution on (E, E). The notation Xn → α means, by Var.

convention, that αn → α, where αn is the distribution of Xn . (This convention is similar to the one introduced above in the context of convergence in distribution.)

4.4.4

Proof of Paul L´ evy’s Criterion

Radon Linear Forms Denote by C0 (Rd ) the set of continuous functions from ϕ : Rd → R that vanish at infinity (lim|x|↑∞ ϕ(x) = 0), endowed with the norm of uniform convergence ||ϕ|| := supx∈Rd ||ϕ(x)||. Let Cc (Rd ) be the set of continuous functions from Rd to R with compact support. In particular, Cc (Rd ) ⊂ C0 (Rd ). Definition 4.4.26 A linear form L : Cc (Rd ) → R such that L(f ) ≥ 0 whenever f ≥ 0 is called a positive Radon linear form. We quote without proof the following fundamental result of Riesz:3 Theorem 4.4.27 Let L : Cc (Rd ) → R be a positive Radon linear form. There exists a unique locally finite measure μ on (Rd , B(Rd )) such that for all f ∈ Cc (Rd ), L(f ) = f dμ . Rd

We shall need a slight extension of Riesz’s theorem (Theorem 4.4.27), Part (ii) of the following: Theorem 4.4.28 0 (i) Let μ ∈ M + (Rd ). The linear form L : C0 (Rd ) → R defined by L(f ) := Rd f dμ is positive (L(f ) ≥ 0 whenever f ≥ 0) and continuous, and its norm is μ(Rd ). (ii) Let L : C0 (Rd ) → R be a positive continuous linear form. There exists a unique measure μ ∈ M + (Rd ) such that for all f ∈ C0 (Rd ), L(f ) = f dμ . Rd

Proof. Part (i) is left as an exercise. We turn to the proof of (ii). The restriction of L to Cc is a positive Radon linear form, and therefore, according to Riesz’s Theorem 0 4.4.27, there exists a locally finite μ on (Rd , B(Rd )) such that for all f ∈ Cc (Rd ), L(f ) = Rd f dμ. The measure μ is a finite (not just locally finite) measure. If not, there would exist a sequence {Km }m≥1 of compact subsets of Rd such that μ(Km ) ≥ 3m for all m ≥ 1. Let 3

See for instance [Rudin, 1986], Theorem 2.14.

4.4. CONVERGENCE IN DISTRIBUTION AND IN VARIATION

183

then {ϕm }m≥1 be a sequence of non-negative functions in Cc (Rd ) with values in [0, 1] and such  that for all m ≥ 1, ϕm (x) = 1 for all x ∈ Km . In particular, the function ϕ := m≥1 2−m ϕm is in C0 (Rd ) and 



k 

L(ϕ) ≥ L

2

m=1 k 

=

2−m

−m

=

ϕm

2−mL(ϕm )

m=1 k 

Rd

m=1

k 

2−mμ(Km ) ≥

ϕm dμ ≥

m=1

 k 3 . 2

Letting k ↑ ∞ leads to L(ϕ) = ∞, a contradiction. The mapping L is continuous. Suppose it is not. One could then find a sequence d m {ϕm } m≥1 of functions in Cc (R ) such that |ϕm | ≤ 1 and L(ϕm ) ≥ 3 . The function ϕ := m≥1 2−m ϕm is in C0 (Rd ) and L(ϕ) ≥ L(

k 

2−m ϕm ) =

m=1

=

k  m=1

2−m

k  m=1

Rd

ϕm dμ ≥

2−mL(ϕm )  k 3 . 2

Letting k ↑ ∞ again leads to L(ϕ) = ∞, a contradiction. 0 It remains to show that L(f ) = Rd f dμ for all f ∈ C0 (Rd ). For this, consider a sequence {fm}m≥1 of functions in Cc (Rd ) converging uniformly to f ∈ C0 (Rd ).0 We have, limm↑∞ fm dμ = 0 0since L is continuous, limm↑∞ L(fm ) = L(f ) and, since μ is finite, f dμ by dominated convergence. Therefore, since L(f ) = f dμ for all m ≥ 1, m m 0 L(f ) = f dμ. 

Vague Convergence Definition 4.4.29 The sequence {μn }n≥1 in M + (Rd ) is said to0 converge vaguely (resp., 0 weakly) to μ if, for all f ∈ C0 (Rd ) (resp. f ∈ Cb (Rd )), limn↑∞ Rd f dμn = Rd f dμ. Remark 4.4.30 When applied to probability measure this notion is weaker than weak convergence, because a continuous function vanishing at infinity is a particular case of a bounded continuous function. Theorem 4.4.31 The sequence {μn }n≥1 in M + (Rd ) converges vaguely if and only if (a) supn μn (Rd ) < ∞, and (b) there exists a dense subset E in C0 (Rd ) such that for all f ∈ E, there exists 0 limn↑∞ Rd f dμn . Proof. Necessity. If the sequence converges vaguely, it obviously satisfies (b). As for (a), it is a consequence of the Banach–Steinhaus theorem.4 Indeed, μn (Rd ) is the norm of Ln , 4 Let E be a Banach space and F be a normed vector space. Let {Li }i∈I be a family of continuous linear mappings from E to F such that supi∈I Li (x) < ∞ for all x ∈ E. Then supi∈I Li  < ∞. See for instance [Rudin, 1986], Thm. 5.8.

CHAPTER 4. CONVERGENCES

184

0 the 'Banach space C0 (Rd ) where Ln is the continuous linear form f → Rd f dμn 'from 0 d ' (with the sup norm) to R, and for all f ∈ C0 (R ), supn Rd f dμn ' < ∞. Sufficiency. Suppose the sequence satisfies (a) and (b). Let f ∈ C0 (Rd ). For all ϕ ∈ E, ' '' ' ' f dμm − f dμn ' ' ' ' '' ' ''' ' ' ' ' ' ≤ '' ϕ dμm − ϕ dμn '' + '' f dμm − ϕ dμm '' + '' f dμn − ϕ dμn '' '' ' ' ' ≤ ' ϕ dμm − ϕ dμn '' + sup |f (x) − ϕ(x)| × sup μn (Rd ). x∈Rd

n

Since supx∈Rd |f (x) − ϕ(x)|0 can be made arbitrarily small by a proper choice of ϕ, this shows that the sequence { f dμn }n≥1 is a Cauchy sequence. It therefore converges to some L(f ), and L so defined is a 0positive linear form on C0 (Rd ). Therefore, there exists a μ ∈ M + (Rd ) such that L(f ) = Rd f dμ and {μn }n≥1 converges vaguely to μ. 

Helly’s Theorem Theorem 4.4.32 From any bounded sequence of M + (Rd ), one can extract a vaguely convergent subsequence. Proof. Let {μn }n≥1 be a bounded sequence of M + (Rd ). Let {fn }n≥1 be a dense sequence of C0 (Rd ). 0 Since the sequence { f1 dμn }n≥1 is bounded, one 0can extract from it a conver0 gent subsequence { f1 dμ1,n }n≥1. Since the sequence 0 { f2 dμ1,n }n≥1 is bounded, one can extract from it a convergent subsequence { f2 dμ2,n 0 }n≥1 . This diagonal selec{ fk+1 dμk,n }n≥1 is bounded, tion process is continued. At step k, since the sequence 0 one can extract from it a convergent subsequence { fk+1 dμk+1,n }n≥1. The sequence {νk }k≥1 where νk = μk,k (the “diagonal” sequence) is extracted from the original se0 quence and for all fn , the sequence { fn dνk }k≥1 converges. The conclusion follows from  Theorem 4.4.31.

Fourier Transforms of Finite Measures Definition 4.4.33 The Fourier transform of a measure μ ∈ M + (Rd ) is the function μ 6 : Rd → C defined by e−2iπ ν,x μ(dx) , μ 6(ν) = where ν, x :=

d

Rd

j=1 νj xj .

Theorem 4.4.34 The Fourier transform of a measure μ ∈ M + (Rd ) is bounded and uniformly continuous. Proof. From the definition, we have that |6 μ(ν)| ≤ |e−2iπ ν,x | μ(dx) Rd μ(dx) = μ(Rd ) , = Rd

4.4. CONVERGENCE IN DISTRIBUTION AND IN VARIATION

185

where the last term does not depend on ν and is finite. Also, for all h ∈ Rd , - ' ' ' ' |6 μ(ν + h) − μ 6(ν)| ≤ 'e−2iπ ν,x+h − e−2iπ ν,x ' μ(dx) Rd - ' ' ' ' = 'e−2iπ h,x − 1' μ(dx) . Rd

The last term is independent of ν and tends to 0 as h → 0 by dominated convergence (recall that μ is finite).  Theorem 4.4.35 Let μ ∈ M + (Rd ) and let f6 be the Fourier transform of f ∈ L1C (Rd ). Then fμ 6 dx. f6dμ = Rd

Rd

Proof. This follows from Fubini’s theorem. In fact,  - f (x)e−2iπ ν,x dx μ(dν) f6dμ = d Rd Rd  -R = e−2iπ ν,x μ(dν) dx . f (x) Rd

Rd

of '(Interversion ' the order of integration is justified by the fact that the function (x, ν) → 'f (x)e−2iπ ν,x ' = |f (x)| is integrable with respect to the product measure dx × μ(dν).  Recall that μ is finite.) For the next definition, recall that Cb (Rd ) denotes the collection of uniformly bounded and continuous functions from Rd to R. Theorem 4.4.36 The sequence {μn }n≥1 in M + (Rd ) converges weakly to μ if and only if (i) It converges vaguely to μ, and (ii) limn↑∞ μn (Rd ) = μ(Rd ). Proof. The necessity of (i) immediately follows from the observation that C0 (Rd ) ⊂ Cb (Rd ). The necessity of (ii) 0follows from the fact that the 0 function that is the constant 1 is in Cb (Rd ) and therefore 1 dμn = μn (Rd ) tends to 1 dμ = μ(Rd ) as n ↑ ∞. Sufficiency. Suppose that 0(i) and (ii) are 0 satisfied. To prove weak convergence, it suffices to prove that limn↑∞ Rd f dμn = Rd f dμ for any non-negative function f ∈ Cb (Rd ). Since the measure μ is of finite total mass, for any ε > 0 one can find a compact set Kε = K such that μ(K) ≤ ε. Choose a continuous function with compact support ϕ with values in [0, 1] and such that ϕ ≥ 1K . Since |f − f ϕ)| ≤ f (1 − ϕ) (where f  = supx∈Rd |f (x)|), '' ' ' lim sup '' f dμn − f ϕ dμn '' ≤ lim sup f  (1 − ϕ) dμn n↑∞

n↑∞



= f  = f 

-

lim

n↑∞



-

dμn − lim

n↑∞

(1 − ϕ) dμ ≤ εf  .

ϕ dμn

CHAPTER 4. CONVERGENCES

186

'0 ' 0 Similarly, ' f dμ − f ϕ dμ' ≤ εf . Therefore, for all ε > 0, '' ' ' ' ' lim sup ' f dμn − f dμ' ≤ 2εf  , n↑∞



and this completes the proof.

evy’s criterion The Proof of Paul L´ We shall in fact prove a slightly more general result: Theorem 4.4.37 Let {μn }n≥1 be a sequence of M + (Rd ) such that for all ν ∈ Rd , there 6n (ν) = ϕ(ν) for some function ϕ that is continuous at 0. Then {μn }n≥1 exists limn↑∞ μ converges weakly to a finite measure μ whose Fourier transform is ϕ. We will be ready for the generalization of Paul L´evy’s criterion of convergence in distribution after a few preliminaries. Definition 4.4.38 A family {αt }t>0 of functions αt : Rd → C in L1C (Rd ) is called an approximation of the Dirac distribution in Rd if it satisfies the following three conditions: 0 (i) Rd αt (x) dx = 1; 0 (ii) supt>0 Rd |αt (x)| dx := M < ∞; and 0 (iii) for any compact neighborhood V of 0 ∈ Rd , limt↓0 V |αt (x)| dx = 0. Lemma 4.4.39 Let {αt }t>0 be an approximation of the Dirac distribution in Rd . Let f : Rd → C be a bounded function continuous at all points of a compact K ⊂ Rd . Then limt↓0 f ∗ αt uniformly in K. Proof. We will show later that lim sup |f (x − y) − f (x)| → 0 .

()

y→0 x∈K

V being a compact neighborhood of 0, we have that sup |f (x) − (f ∗ αt )(x)| ≤ M sup sup |f (x − y) − f (x)| + 2| sup |f (x)| x∈K

y∈V x∈K

x∈Rd

|αt (y)| dy . V

This quantity can be made smaller than any ε > 0 by choosing V such that the first term is < 12 ε (uniform continuity of f on compact sets) and the second term can then be made < 12 ε by letting t ↓ (condition (iii) of Definition 4.4.38). Proof of (). Let ε > 0 be given. For all x ∈ K, there exists an open and symmetric neighborhood Vx of 0 such that for all y ∈ Vx , f (x − y)f (x)| ≤ 12 ε. Also, one can find an open and symmetric neighborhood Wx of 0 such that Wx + Wx ⊂ Vx . The union of open sets ∪x∈K {x + Wx } obviously covers K, and since the latter is a compact set, m one can extract a finite covering of K: ∪m j=1 (xj + Wxj ). Define W = ∩j=1 Wxj , an open neighborhood of 0. Let y ∈ W . Any x ∈ K belongs to some xj + Wxj , and for such j,

4.4. CONVERGENCE IN DISTRIBUTION AND IN VARIATION

187

|f (x − y) − f (x)| ≤ |f (xj ) − f (xj − (xj − x))| + |f (xj ) − f (xj − (xj − x + y))| . But xj − x ∈ Wxj and xj − x + y ∈ Wxj + W ⊂ Vxj . Therefore 1 1 |f (x − y) − f (x)| ≤ ε + ε = ε . 2 2  We can now prove Theorem 4.4.37. Proof. The sequence {μn }n≥1 is bounded (that is, supn μn (Rd ) < ∞) since μn (Rd ) = μ 6n (1) has a limit as n ↑ ∞. In particular, μ 6n (ν) ≤ μn (Rd ) ≤ sup μn (Rd ) < ∞ .

(†)

n

0 0 6n dx. By dominated convergence If f ∈ L1R (Rd ), then0by Theorem0 4.4.35, f6dμn = f μ 6n dx = f ϕ dx. Therefore (using (†)), limn↑∞ f μ lim f6dμn = f ϕ dx . n↑∞

One can replace in the above equality f6 by any function in D(Rd ), since such a function is always the Fourier transform of some integrable function. Therefore, by Theorem 4.4.31, {μn }n≥1 converges vaguely to some finite measure μ. We now show that it converges weakly to μ. Let f be an integrable function with integral 1 such that f (x) = f (−x) and f6 ∈ D(Rd ). For t > 0, define ft (x) := t−d f (t−1 x)). Using Theorem 4.4.35, we have f6(tx) μn (dx) = ft (x) μ 6n (x) dx = (ft ∗ μ 6n )(0) . By dominated convergence,

-

lim

n↑∞

and by vague convergence, lim

n↑∞

Therefore, for all t > 0,

0

-

f6(tx) μn (dx) = (ft ∗ ϕ)(0) ,

f6(tx) μn (dx) =

-

f6(tx) μ(dx) .

f6(tx) μ(dx) = (ft ∗ ϕ)(0).

Since the function ϕ is bounded and continuous at the origin, by Lemma 4.4.39, 0 limt↓0 (ft ∗ ϕ)(0) = ϕ(0). Also, by dominated convergence limt↓0 f6(tx) μ(dx) = μ(Rd ). Therefore, μ(Rd ) = ϕ(0) = lim μn (Rd ) . n↑∞

Therefore, by Theorem 4.4.36, {μn }n≥1 converges weakly to μ. Since the function x → e−2iπ ν,x is continuous and bounded, μ 6(ν) = e−2iπ ν,x μ(dx) = lim e−2iπ ν,x μn (dx) = ϕ(ν). n↑∞



CHAPTER 4. CONVERGENCES

188

4.5 4.5.1

The Hierarchy of Convergences Almost-sure vs in Probability

Theorem 4.5.1 A. If the sequence {Zn }n≥1 of complex random variables converges almost surely to some complex random variable Z, it also converges in probability to the same random variable Z. B. If the sequence of complex random variables {Xn }n≥1 converges in probability to the complex random variable X, one can find a sequence of integers {nk }k≥1 , strictly increasing, such that {Xnk }k≥1 converges almost surely to X. (B says, in other words: From a sequence converging in probability, one can extract a subsequence converging almost surely.) Proof. A. Suppose almost-sure convergence. By Theorem 4.1.3 , for all ε > 0, P (|Zn − Z| ≥ ε i.o.) = 0, that is

∞ P (∩n≥1 ∪k=n (|Zk − Z| ≥ ε)) = 0,

or (sequential continuity of probability) ∞ lim P (∪k=n (|Zk − Z| ≥ ε)) = 0,

n↑∞

which in turn implies that lim P (|Zn − Z| ≥ ε) = 0 .

n↑∞

B. By definition of convergence in probability, for all ε > 0, lim P (|Xn − X| ≥ ε) = 0.

n↑∞

    Therefore one can find n1 such that P |Xn1 − X| ≥ 11 ≤ 12 . Then, one can find n2 >   2  n1 such that P |Xn2 − X| ≥ 12 ≤ 12 , and so on, until we have a strictly increasing sequence of integers nk (k ≥ 1) such that  P

|Xnk − X| ≥

1 k

 ≤

 k 1 . 2

It then follows from Theorem 4.1.2 that lim Xnk = X

k↑∞

a.s. 

Remark 4.5.2 Exercise 4.6.7 gives an example of a sequence converging in probability, but not almost surely. Thus, convergence in probability is a notion strictly weaker than almost-sure convergence.

4.5. THE HIERARCHY OF CONVERGENCES

189

Theorem 4.5.3 If the sequence {Zn }n≥1 of square-integrable complex random variables converges in quadratic mean to the complex random variable Z, it also converges in probability to the same random variable. Proof. It suffices to observe that, by Markov’s inequality, for all ε > 0, P (|Zn − Z| ≥ ε) ≤

1 E[|Zn − Z|2 ]. ε2 

4.5.2

The Rank of Convergence in Distribution

We now compare convergence in distribution to the other types of convergence. Convergence in distribution is weaker than almost-sure convergence: Theorem 4.5.4 If the sequence {Xn }n≥1 of random vectors of Rd converges almost surely to some random vector X, it also converges in distribution to the same vector X. Proof. By dominated convergence, for all u ∈ R,     lim E ei u,Xn = E ei u,X n↑∞

which implies, by Theorem 4.4.6 that {Xn }n≥1 converges in distribution to X.



In fact, convergence in distribution is even weaker than convergence in probability. Theorem 4.5.5 If the sequence {Xn }n≥1 of random vectors of Rd converges in probability to some random vector X, it also converges in distribution to X. Proof. If this were not the case, one could find a function f ∈ Cb (Rd ) such that E[f (Xn )] does not converge to E[f (X)]. In particular, there would exist a subsequence nk and some ε > 0 such that |E[f (Xnk )]−E[f (X)]| ≥ ε for all k. As {Xnk }k≥1 converges in probability to X, one can extract from it a subsequence {Xnk }≥1 converging almost surely to X. In particular, since f is bounded and continuous, lim E[f (Xnk ] = E[f (X)] by dominated convergence, a contradiction.  Combining Theorems 4.5.3 and 4.5.5, we have that convergence in distribution is weaker than convergence in the quadratic mean: Theorem 4.5.6 If the sequence of real random variables {Zn }n≥1 converges in quadratic mean to some random variable Z, it also converges in distribution to the same random variable Z.

Theorem 4.5.6 can be refined in the Gaussian case, where the distribution of the limit can be proved to be Gaussian.

CHAPTER 4. CONVERGENCES

190

A Stability Property of Gaussian Vectors (m)

(1)

Theorem 4.5.7 If {Zn }n≥1 , where Zn = (Zn , . . . , Zn ), is a sequence of Gaussian random vectors of fixed dimension m that converges componentwise in quadratic mean to some vector Z = (Z (1) , . . . , Z (m) ), the latter vector is Gaussian. Proof. In fact, by continuity of the inner product in L2R (P ), for all 1 ≤ i, j ≤ m, (i) (i) (j) limn↑∞ E[Zn Zn ] = E[Z (i) Z (j) ] and limn↑∞ E[Zn ] = E[Z (i) ], that is, lim mZn = mZ ,

lim ΓZn = ΓZ ,

n↑∞

n↑∞

and in particular, for all u ∈ Rm ,  T  i T T lim E eiu Zn = lim eiu μZn − 2 u ΓZn u n↑∞

n↑∞

= eiu

T μ − i uT Γ u Z 2 Z

.

mean to uT Z, also converges in disThe sequence {uT Zn }n≥1 , converging in  quadratic  T T tribution to uT Z. Therefore, limn↑∞ E eiu Zn = E[eiu Z ], and finally E[eiu

TZ

] = eiu

T μ − i uT Γ u Z 2 Z

for all u ∈ Rm . This shows that Z is a Gaussian vector. Therefore, limits in the quadratic mean preserve the Gaussian nature of random vectors. This is the stability property referred to in the title of this example. Note that the Gaussian nature of random vectors is also preserved by linear transformations, as we already know.  Convergence in distribution is weaker that convergence in variation: Theorem 4.5.8 If the sequence of real random variables {Xn }n≥1 converges in variation to X, it converges in distribution to the same random variable. Proof. Indeed, for all x (not just the continuity points of the distribution of X), |P (Xn ≤ x) − P (X ≤ x)| ≤ dV (Xn , X) → 0 . 

Complementary reading [Billingsley, 1979, 1992], [Kallenberg, 2002].

4.6

Exercises

Exercise 4.6.1. The telescope formula, take 2 Prove the following inequalities concerning a non-negative random variable X: ∞  n=1

P (X ≥ n) ≤ E [X] ≤ 1 +

∞  n=1

P (X ≥ n) .

4.6. EXERCISES

191

Exercise 4.6.2. A recurrence equation Consider the recurrence equation Xn+1 = (Xn − 1)+ + Zn+1, n ≥ 0 (a+ := sup(a, 0)), where X0 = 0 and where {Zn }n≥1 is an iid sequence of random variables with values in N. Denote by T0 the first index n ≥ 1 such that Xn = 0 (T0 = ∞ if such index does not exist) a) Show that if E[Z1 ] < 1, then P (T0 < ∞) = 1. b) Show that if E[Z1 ] > 1, there exists a (random) index n0 such that Xn > 0 for all n ≥ n0 . Exercise 4.6.3. Poisson asymptotics Let {Sn }n≥1 be an iid sequence of real random variables such that P (S1 ∈ (0, ∞)) = 1 and E[S1 ] < ∞, and let for each t ≥ 0, N (t) = n≥1 1(0,t] (Tn ), where Tn = S1 + · · ·+ Sn . Prove that limt→∞

N (t) t

=

1 E[S1 ] .

Exercise 4.6.4. slln and infinite expectation Let {Zn }, n ≥ 1, be an iid sequence of non-negative random variables such that E [Z1 ] = ∞. Show that Z1 + . . . + Zn = ∞ (= E [Z1 ]). lim n↑∞ n Exercise 4.6.5. Exchanging the order of expectation and summation (a) Let {Sn }n≥1 be a sequence of non-negative random variables. Show that "∞ # ∞   E Sn = E[Sn ] . n=1

()

n=1

(b) Let {Sn }n≥1 be a sequence of real random variables such that Show that () holds as well.



n≥1 E[|Sn |]

< ∞.

Exercise 4.6.6. A sufficient condition for almost-sure convergence Show that  P (|Zn − Z| ≥ ) < ∞ n≥1

for all ε > 0 is a sufficient condition for the sequence of random variables {Zn }n≥1 to converge to Z. Exercise 4.6.7. Convergence almost sure vs convergence in probability Let {Xn }n≥1 be a sequence of independent random variables taking only 2 values, 0 and 1. (A) Show that a necessary and sufficient condition of almost-sure convergence to 0 is that  P (Xn = 1) < ∞. n≥1

CHAPTER 4. CONVERGENCES

192

(B) Show that a necessary and sufficient condition of convergence in probability to 0 is that lim P (Xn = 1) = 0. n↑∞

(C) Deduce from the above that convergence in probability does not imply almost-sure convergence.

Exercise 4.6.8. Convergence in probability and in the quadratic mean Let α > 0, and let {Zn }n≥1 be a sequence of random variables such that P (Zn = 1) = 1 −

α α , P (Zn = n) = , n n

where α < 1. Show that {Zn }n≥1 converges in probability to some variable Z. For what values of α does {Zn }n≥1 converge to Z in quadratic mean? Exercise 4.6.9. Convergence in probability Suppose the sequence of random variables {Zn }n≥1 converges to a in probability. Let g : R → R be a continuous function. Show that {g(Zn )}n≥1 converges to g(a) in probability. Exercise 4.6.10. Inner product in L2 (P ) Prove the following: If the sequence {Zn }n≥1 of square-integrable complex random variables converges in quadratic mean to the complex random variable Z, then     lim E [Zn ] = E [Z] and lim E |Zn |2 = E |Z|2 .

n↑∞

n↑∞

Exercise 4.6.11. Convergence in probability but not almost-sure Let {Zn }n≥1 be an independent sequence of random variables such that P (Zn = ±1) = Let Sn = limit.

n

j=1 Zj .

1 1 , P (Zn = 0) = 1 − . 2n log n n log n

Show the limit in probability of

Sn n

exists, but not the almost-sure

Exercise 4.6.12. Convergence in distribution but not in probability Let Z be a random variable with a symmetric distribution (that is, Z and −Z have the same distribution). Define the sequence {Zn }n≥1 as follows: Zn = Z if n is odd, Zn = −Z if n is even. In particular, {Zn }n≥1 converges in distribution to Z. Show that if Z is not degenerate, then {Zn }n≥1 does not converge to Z in probability. Exercise 4.6.13. The unlimited gambler This exercise anticipates the gambling situation described in Example 13.1.4 to which the reader is referred for the notation. Suppose that the stakes are bounded, say by M , and that the initial fortune of the gambler is a. The gambler can borrow whatever amount is needed, so that his “fortune” Yn at any time n can take arbitrary values.

4.6. EXERCISES Prove that

193   λ2 P (|Yn − a| ≥ λ) ≤ 2 exp − . 2nM 2

Exercise 4.6.14. Fair coin tosses Consider a Bernoulli sequence of parameter 12 representing a fair game of heads or tails. Let X be the number of heads after n tosses. Use Hoeffding’s inequality to prove that  2 λ P (|X − E[X]| ≥ λ) ≤ 2 exp − . n Exercise 4.6.15. Empty bins Consider the usual “balls and bins” setting with n bins and m balls (the multinomial distribution). Let X be the number of empty bins. Prove that  2 λ P (|X − E[X]| ≥ λ) ≤ 2 exp − . m Exercise 4.6.16. Bernoulli is Borel Let the sequence {Xn }n≥1 be as in Theorem 1.1.6 (it is sometimes called a Bernoulli sequence). Prove that P ({Xn }n≥1 is a Borel sequence) = 1 . A sequence {xn }n≥1 taking its values in the set {0, 1} is called a Borel sequence if for all k ≥ 1, all a1 , . . . , ak in {0, 1}, 1 1k 1{xj+1 =a1 ,...,xj+k =ak } = . n↑∞ n 2 n

lim

j=1

Exercise 4.6.17. Metrization of convergence in probability Define for any two random variables X and Y , . 2 |X − Y | . d(X, Y ) := E 1 + |X − Y | Prove that d so defined is a metric. Prove the following variant of Theorem 4.2.4: The sequence {Xn }n≥1 converges in probability to the variable X if and only if lim d(Xn , X) = 0 .

n↑∞

Exercise 4.6.18. Poisson’s law of rare events in the plane Let Z1 , . . . , ZM be M bidimensional iid random vectors uniformly distributed on the square [0, A] × [0, A] = ΓA . For any measurable set C ⊆ ΓA , define N (C) to be the number of random vectors Zi that fall in C. Let C1 , . . . , CK be measurable disjoint subsets of ΓA . i) Give the characteristic function of the vectors (N (C1 ) , . . . , N (CK )).

CHAPTER 4. CONVERGENCES

194

ii) We now let M be a function of A such that M (A) = λ > 0. A2 Show that, as A ↑ ∞, (N (C1 ) , . . . , N (CK )) converges in distribution. Identify the limit distribution. Exercise 4.6.19. A characterization of the Gaussian distribution Let G be a cumulative distribution function on R with xdG (x) = 0, x2 dG (x) = σ 2 < ∞. R

R

In addition, suppose that G has the following property: If X1 and X2 are independent 2 also admits G as cdf. random variables with the cdf G, then X1√+X 2 Prove that G is the cdf of a Gaussian variable with mean 0 and variance σ 2 . Exercise 4.6.20. The central limit theorem Prove, using the central limit theorem, that lim

n→∞

n  k=1

e−n

1 nk = . k! 2

Exercise 4.6.21. A confidence interval In Example 4.4.15 how would you find the best statement of the kind: “This coin is x percent guaranteed unbiased”? In other words, how would you obtain the largest x in this claim? (You are not required to give the actual value, just the method for obtaining it.) Exercise 4.6.22. Maximal coincidence of biased coins Find a pair of {0, 1}-valued random variables with prescribed marginals P (X = 1) = a , P (Y = 1) = b, where a, b ∈ (0, 1), and such that P (X = Y ) is maximal. Exercise 4.6.23. Functions of random variables and distance in variation Let (E, E) and (G, G) be measurable spaces and let f : (E, E) → (G, G) be some measurable function. For α and β, probability distributions on (E, E), define the probability distribution αf −1 on (G, G) by αf −1 (B) = α(f −1 (B)), and define likewise βf −1 . Prove that   dV (α, β) ≥ dV αf −1 , βf −1 .

Exercise 4.6.24. The variation distance of two Poisson variables Let pλ denote the Poisson distribution with mean λ. Let μ > 0. Prove that dV (pλ , pμ ) ≤ 1 − e−|μ−λ| .

4.6. EXERCISES

195

Exercise 4.6.25. Convexity of the distance in variation Let αi and βi , 1 ≤ i ≤K, be probability distributions on the countable space E. Show that if λi ∈ [0, 1] and K i=1 λi = 1, then dV

K 

λi αi ,

i=1

K 

λi β i



i=1

K 

λi dV (αi , βi) .

i=1

State and prove the analogous result in the general case of an arbitrary measurable space (E, E). Exercise 4.6.26. An alternative expression of the distance in variation Let α and β be two probability distributions on some countable space E. Show that    1 dV (α, β) = sup f (i)α(i) − f (i)β(i) , 2 |f |≤1 i

i

where |f | := supi∈E |f (i)|. State and prove the analogous result in the general case of an arbitrary measurable space (E, E). Exercise 4.6.27. Another expression of the distance in variation Let α and β be two probability distributions on the same countable space E. Prove the following alternative expressions of the distance in variation:  dV (α, β) = 1 − α(i) ∧ β(i) =



i∈E

(α(i) − β(i))+ =

i∈E



(β(i) − α(i))+ .

i∈E

State and prove the analogous result in the general case of an arbitrary measurable space (E, E). Exercise 4.6.28. Convergence in probability and convergence in variation Let {Zn }n≥0 be a sequence of {0, 1}-valued random variables. Show that it converges in variation to 0 if and only if it converges in probability to 0. Deduce from this that there exist sequences of random variables that converge in distribution but not in variation. Exercise 4.6.29. Tricky Cauchy Let {Xn }n≥1 be a sequence of iid Cauchy random variables. (a) What is the limit in distribution of

X1 +···+Xn ? n

(b) Does

X1 +···+Xn n2

converge in distribution?

(c) Does

X1 +···+Xn n

converge almost surely to a (non-random) constant?

II: STANDARD STOCHASTIC PROCESSES

Chapter 5 Generalities on Random Processes A random process (or stochastic process) is a collection of random variables indexed by time, which may record the evolution of some phenomenon. This definition is generalized to accommodate models where space, and not just time, plays a role. This chapter addresses the basic issues concerning the distribution of such processes, their sample path properties and the various notions of measurability.

5.1 5.1.1

The Distribution of a Random Process Kolmogorov’s Theorem on Distributions

Let T be an arbitrary index. In general in this book, it will be one of the following: N (the natural numbers), Z (the integers), R (the real numbers) or R+ (the non-negative real numbers) (see, however, Example 5.1.24).

Random Processes as Collections of Random Variables Definition 5.1.1 A stochastic process (or random process) is a family {X(t)}t∈T of random elements defined on the same probability space (Ω, F, P ) and taking their values in some given measurable space (E, E). It is called a real (resp., complex) stochastic process if it takes real (resp., complex) values. It is called a continuous-time stochastic process when the index set is R or R+ , and a discrete-time stochastic process when it is N or Z. When the index set is N or Z, we also use the notation n instead of t for the time index, and write Xn instead of X(t). Example 5.1.2: Random Sinusoid. Let A be some real non-negative random variable, let ν0 ∈ R be a positive constant and let Φ be a random variable with values in [0, 2π]. The formula X(t) = A sin(2πν0 t + Φ) defines a stochastic process. For each sample ω ∈ Ω, the function t → X(t, ω) is a sinusoid with frequency ν0 , random amplitude A(ω) and random phase Φ(ω). © Springer Nature Switzerland AG 2020 P. Brémaud, Probability Theory and Stochastic Processes, Universitext, https://doi.org/10.1007/978-3-030-40183-2_5

199

200

CHAPTER 5. GENERALITIES ON RANDOM PROCESSES

Example 5.1.3: Counting Processes. A counting process {N (t)}t≥0 is, by definition, an integer-valued stochastic process such that the functions t → N (t, ω) are almostsurely integer-valued, non-decreasing, right-continuous with left-hand limits, such that N (t) − N (t−) ≤ 1 (t ≥ 0) and N (0) = 0. For instance, N (t) could be the number of arrivals of cars at a highway toll in the interval (0, t].

Finite-dimensional Distributions One way of describing the probabilistic behavior of a stochastic process is by means of its finite-dimensional distribution. Definition 5.1.4 By definition, the finite-dimensional (fidi) distribution of a stochastic process {X(t)}t∈T is the collection of probability distributions of the random vectors (X(t1 ), . . . , X(tk )) for all k ≥ 1 and all t1 , . . . , tk ∈ T. Example 5.1.5: A Particular Counting Process. Let {N (t)}t≥0 be a counting process. Suppose that for all k ≥ 1, all 0 = t0 ≤ t1 < · · · < tk , and all integers m1 . . . , mk , P (∩kj=1 {N (tj ) − N (tj−1 ) = mj }) =

k $

eλ(tj −tj−1 )

j=1

(λ(tj − tj−1 ))mj . mj !

The fact that this completely describes the fidi distribution of {N (t)}t≥0 is obvious. The existence of such a process is at this point not proved but will be guaranteed by Theorem 5.1.7 below. We shall see later on that this is the counting process of a homogeneous Poisson process on the positive half-line of intensity λ. There are two pending issues. Firstly, is there a stochastic process having a prescribed finite distribution? Secondly, is it unique? The answer is rather simple and it is best answered in the setting of canonical measurable spaces of functions. Let E T be the set of functions x : T → E. An element x ∈ E T is therefore a function from T to E: x := (x(t), t ∈ T) , where x(t) ∈ E. Let E T (1 ) be the smallest σ-field containing all the sets of the form {x ∈ E T ; x(t) ∈ C}, where t ranges over T and C ranges over E. The measurable space (E T , E T ) so defined is called the canonical (measurable) space of stochastic processes indexed by T with values in (E, E) (we say: “with values in E” if the choice of the σ-field E is clear in the given context). 1 The notation E ⊗T is also used in order to distinguish this mathematical object from a collection of E-valued functions.

5.1. THE DISTRIBUTION OF A RANDOM PROCESS

201

Denote by πt the coordinate map at t ∈ T, that is, the mapping from E T to E defined by πt (x) := x(t) (x ∈ E) . This is a random variable since when C ∈ B(R), the set {x ; πt (x) ∈ C} = {x ; x(t) ∈ C}, and therefore belongs to E T , by definition of the latter. The family {πt }t∈T is called the coordinate process on the (canonical) measurable space (E T , E T ). The probability distribution of the vector (X(t1 ), . . . , X(tk )) is a probability measure Q(t1 ,...,tk ) on (E k , E k ). It satisfies the following obvious properties, called the compatibility conditions: C1 . For all (t1 , . . . , tk ) ∈ Tk , and any permutation σ on Tk , Qσ(t1 ,...,tk ) = Qt1 ,...,tk ◦ σ −1 .

(5.1)

C2 . For all (t1 , . . . , tk , tk+1 ) ∈ Tk+1 and all A ∈ E k Q(t1 ,...,tk ) (A) = Q(t1 ,...,tk+1 ) (A × E).

(5.2)

Remark 5.1.6 Conditions C1 and C2 just acknowledge the obvious facts of the type P (X(t1 ) ∈ A1 , X(t2 ) ∈ A2 ) = P (X(t2 ) ∈ A2 , X(t1 ) ∈ A1 ) and P (X(t1 ) ∈ A1 , X(t2 ) ∈ E) = P (X(t1 ) ∈ A1 ). Recall the definition of a Polish space.2 It is a topological space whose topology is metrizable (generated by some metric), complete (with respect to this metric) and separable (there exists a countable dense subset). Theorem 5.1.7 Let E be a Polish space and let E := B(E) be its Borel σ-field. Let Q = {Q(t1 ,...,tk ) ; k ≥ 1, (t1 , . . . , tk ) ∈ Tk } be a family of probability distributions on (E k , E k ) satisfying the compatibility conditions C1 and C2. Then there exists a unique probability P on the canonical measurable space (E T , E T) such that the coordinate process {πt }t∈T admits the finite distribution Q. This is the Kolmogorov existence and uniqueness theorem.3 Example 5.1.8: iid sequences. Take E = R, T = Z, and let the fidi distributions Qt1 ,...,tk be of the form Q(t1 ,...,tk ) = Qt1 × · · · × Qtk , where for each t ∈ T, Qt is a probability distribution on (E, E). This collection of finite-dimensional distributions obviously satisfies the compatibility conditions, and the resulting coordinate process is an independent random sequence indexed by the relative integers. It is an iid (that is: independent and identically distributed) sequence if Qt = Q for all t ∈ T. Note that the restriction to a Polish space E is superfluous (Exercise 5.4.1). 2 For most applications in this book, the Polish space in question will be some Rm with the euclidean topology. 3 For a proof of Kolmogorov’s distribution theorem, see for instance [Shiryaev, 1996].

CHAPTER 5. GENERALITIES ON RANDOM PROCESSES

202 Independence

Let {X(t)}t∈T be a stochastic process. The σ-field F X := σ(X(t); t ∈ T) is called the global history of this process. Definition 5.1.9 Two stochastic processes {X(t)}t∈T and {Y (t)}t∈T defined on the same probability space, with values in (E, E) and (E  , E ) respectively, are called independent if the σ-fields F X and F Y are independent.

The verification of independence is simplified by the following result. Theorem 5.1.10 For the stochastic processes {X(t)}t∈T and {Y (t)}t∈T , with values in (E, E) and (E  , E ) respectively, to be independent, it suffices that for all t1 , . . . , tk ∈ T and all s1 , . . . , s ∈ T , the vectors (X(t1 ), . . . , X(tk )) and (Y (s1 ), . . . , Y (s )) be independent.

Proof. The collection of events of the type {X(t1 ) ∈ C1 , . . . , X(tk ) ∈ Ck }, where the Ci ’s belong to E, is a π-system generating F X , with a similar observation for F Y . The result then follows from these observations and Theorem 3.1.39. 

Transfer to Canonical Spaces Let a stochastic process {X(t)}t∈T be given with values in a Polish space E and defined on the probability space (Ω, F, P ). Define the mapping h : (Ω, F) → (E T , E T ) by h(ω) = (X(t, ω), t ∈ T). This mapping is measurable. To show this, it is enough to verify that h−1 (C) ∈ F for all C ∈ C, where C is a collection of subsets of E T that generates E T . Here, we choose C = ({x; x(t) ∈ A}, t ∈ T, A ∈ E). But h−1 ({x; x(t) ∈ A}) = ({ω; X(t, ω) ∈ A}) ∈ F since X(t) is a random variable. Now, denote by PX the image of P by h. The fidi distribution of the coordinate process of the canonical measurable space is the same as that of the original stochastic process. Definition 5.1.11 The probability PX on (E T , E T) is called the distribution of {X(t)}t∈T .

An immediate consequence of Kolmogorov’s (existence and) uniqueness theorem is:

Corollary 5.1.12 Two stochastic processes with the same fidi distribution have the same distribution.

5.1. THE DISTRIBUTION OF A RANDOM PROCESS

203

Stationarity In this subsubsection, the index set is one of the following: N, Z, R+ , R. Definition 5.1.13 A stochastic process {X(t)}t∈T is called (strictly) stationary if for all k ≥ 1, all (t1 , . . . , tk ) ∈ Tk , the probability distribution of the random vector (X(t1 + h), . . . , X(tk + h)) is independent of h ∈ T such that t1 + h, . . . , tk + h ∈ T. A stochastic process with index set R+ (resp., N) that is stationary can always be uniquely extended to R (resp., Z) in such a way that stationarity is preserved. More precisely: Theorem 5.1.14 Consider the canonical space (E T , E T ) of stochastic processes with values in the Polish space E and with index set T = R+ (resp., T = N). Let P+ be a probability measure on this canonical space that makes the canonical process stationary. Then there exists a (unique) probability measure P on the canonical space of stochastic processes with values in E with index set R (resp., Z) such that the restriction of P to (E R+ , E R+ ) (resp., (E N , E N)) is P+ . ; k ≥ 1, (t1 , . . . , tk ) ∈ Tk } be the finite-dimensional distributions Proof. Let {Q+ (t1 ,...,tk ) relative to P+ . Define the collection of finite-dimensional distributions {Q(t1 ,...,tk ) ; k ≥ 1, (t1 , . . . , tk ) ∈ Tk }, where the index set is now R (resp., Z), by Q(t1 ,...,tk ) := Q+ (t1 +h,...,t

k +h)

,

where h is an element of R+ (resp., N) such that t1 + h, . . . , tk + h ∈ R+ (resp., ∈ N). Observe that this family satisfies the compatibility conditions (5.1) and (5.1). The result then follows from Kolmogorov’s existence and uniqueness theorem. 

5.1.2

Second-order Stochastic Processes

In this subsection, T represents any of the following index sets: R, R+ , Z and N. Definition 5.1.15 A measurable complex stochastic process {X(t)}t∈T satisfying the condition E[|X(t)|2 ] < ∞ (t ∈ T) is called a second-order stochastic process. In other words, for all t ∈ T, the complex random variable X(t) ∈ L2C (P ). This implies that the mean function m : T → C and the covariance function Γ : T × T → C are well defined by m(t) := E[X(t)] and

Γ(t, s) := cov (X(t), X(s)) = E[X(t)X(s)∗ ] − m(t)m(s)∗ .

When the mean function is the null function, the stochastic process is said to be centered.

CHAPTER 5. GENERALITIES ON RANDOM PROCESSES

204

Theorem 5.1.16 Let {X(t)}t∈T be a second-order stochastic process with mean function m and covariance function Γ. Then, for all s, t ∈ T, 1

E [|X(t) − m(t)|] ≤ Γ(t, t) 2 and

1

1

|Γ(t, s)| ≤ Γ(t, t) 2 Γ(s, s) 2 . Proof. Apply Schwarz’s inequality  1  1 2 2 E [|X| |Y |] ≤ E |X|2 E |Y |2 with X := X(t) − m(t) and Y := 1 for the first inequality, and with X := X(t) − m(t) and Y := X(s) − m(s) for the second one.  For a stationary second-order stochastic process, for all s, t ∈ T,

where m ∈ C and

m(t) ≡ m,

(5.3)

Γ(t, s) = C(t − s)

(5.4)

for some function C : T → C, also called the covariance function of the process. The complex number m is called the mean of the process.

Wide-sense Stationarity A notion weaker than strict stationarity concerns second-order processes with values in E = C: Definition 5.1.17 If conditions (5.3) and (5.4) are satisfied for all s, t ∈ T, the complex second-order stochastic process {X(t)}t∈T is called wide-sense stationary. Remark 5.1.18 There exist stochastic processes that are wide-sense stationary but not strictly stationary (Exercise 5.4.3). In continuous time (T = R or R+ ) this appellation will be reserved in this book to wide-sense stationary processes that have in addition a continuous covariance function. For this condition to be satisfied, it suffices that the covariance function be continuous at the origin. This is in turn equivalent to continuity in the quadratic mean of the stochastic process, that is: For all t ∈ T,   lim E |X(t + h) − X(t)|2 = 0 . h→0

In fact, the covariance function is then uniformly continuous on R. Proof.

      E |X(t + h) − X(t)|2 = E |X(t + h)|2 + E |X(t)|2 − E [X(t)X(t + h)∗ ] − E [X(t)∗ X(t + h)] = 2C(0) − C(h) − C(h)∗ ,

5.1. THE DISTRIBUTION OF A RANDOM PROCESS

205

and therefore, uniform continuity in quadratic mean follows from the continuity at the origin of the autocovariance function. On the other hand, |C(τ + h) − C(τ )| = |E [X(τ + h)X(0)∗ ] − E [X(τ )X(0)∗ ]| = |E [(X(τ + h) − X(τ )) X(0)∗ ]| 1  1  2 2 ≤ E |X(0)|2 × E |X(τ + h) − X(τ )|2   1 1 2 2 = E |X(0)|2 × E |X(h) − X(0)|2 , and therefore, uniform continuity of the autocovariance function follows from the continuity in quadratic mean at the origin.  2 , the variance of any of the random variables X(t). Note that C(0) = σX

As an immediate corollary of Theorem 5.1.16, we have: Corollary 5.1.19 Let {X(t)}t∈T be a wide-sense stationary stochastic process with mean m and covariance function C. Then 1

E [|X(t) − m|] ≤ C(0) 2 and |C(τ )| ≤ C(0) . Example 5.1.20: Harmonic processes. Let {Uk }k≥1 be centered random variables of L2C (P ) that are mutually uncorrelated. Let {Φk }k≥1 be completely random phases, that is, real random variables uniformly distributed on [0, 2π]. Suppose moreover that the U  2 variables are independent of the Φ variables. Finally, suppose that ∞ k=1 E[|Uk | ] < ∞. Then (Exercise 5.4.5): For all t ∈ R, the series in the right-hand side of X(t) =

∞ 

Uk cos(2πνk t + Φk ) ,

k=1

where the νk ’s are real numbers (frequencies), is convergent in L2C (P ) and defines a centered wss stochastic process (called a harmonic process) with covariance function C(τ ) =

∞  1 k=1

2

E[|Uk |2 ] cos(2πνk τ ) .

Recall the definition of the correlation coefficient ρ between two non-trivial real square-integrable random variables X and Y with respective means mX and mY and 2 and σ 2 : respective variances σX Y ρ=

cov (X, Y ) . σX σY

  The variable aX + b that minimizes the function F (a, b) := E (Y − aX − b)2 is

CHAPTER 5. GENERALITIES ON RANDOM PROCESSES

206

cov (X, Y ) Y6 = mY + (X − mX ) 2 σX and moreover E

2 2 .   Y6 − Y = 1 − ρ2 σY2

(Exercise 3.4.43). This variable is called the best linear-quadratic estimate of Y given X, or the linear regression of Y on X. For a wss stochastic process with covariance function C, the function ρ(τ ) =

C(τ ) C(0)

is called the autocorrelation function. It is in fact, for any t, the correlation coefficient between X(t) and X(t + τ ). In particular, the best linear-quadratic estimate of X(t + τ ) given X(t) is 6 + τ |t) := m + ρ(τ )(X(t) − m) . X(t The estimation error is then, according to the above, 2 2 .   2 6 + τ |t) − X(t + τ ) E X(t 1 − ρ(τ )2 . = σX Remark 5.1.21 In the continuous time case, this shows that if the support of the covariance function is concentrated around τ = 0, the process tends to be “unpredictable”. We shall come back to this when we introduce the notion of white noise.

5.1.3

Gaussian Processes

This particular type of stochastic process is an important one for many reasons, for instance: (1) because of its mathematical tractability due in particular to the stability of the Gaussianity of stochastic processes by linear transformations and limits, (2) because of its ubiquity due to the many forms of the central limit theorem for stochastic processes, (3) because the most famous example of a Gaussian process, Brownian motion (Chapter 11), is the basis of a very productive stochastic calculus (Chapter 14). Gaussian processes will at this point serve to substantiate the definitions and theoretical results of this chapter. Definition 5.1.22 Let T be an arbitrary index. The real-valued stochastic process {X(t)}t∈T is called a Gaussian process if for all n ≥ 1 and for all t1 , . . . , tn ∈ T, the random vector (X(t1 ), . . . , X(tn )) is Gaussian. In particular, its characteristic function is given by the formula ⎧ ⎧ ⎫⎤ ⎫ ⎡ n n n  n ⎨  ⎨  ⎬ ⎬  1 E ⎣exp i uj X(tj ) ⎦ = exp i uj m(tj ) − uj uk Γ(tj , tk ) , ⎩ ⎩ ⎭ ⎭ 2 j=1

j=1

(5.5)

j=1 k=1

where u1 , . . . , un ∈ R, m(t) := E[X(t)] and Γ(t, s) := E[(X(t) − m(t))(X(s) − m(s))]. The next result is an existence theorem.

5.1. THE DISTRIBUTION OF A RANDOM PROCESS

207

Theorem 5.1.23 Let Γ : T2 → R be a non-negative definite function, that is, such that for all t1 , . . . , tk ∈ T and all u1 , . . . , uk ∈ R, k  k 

ui uj Γ(ti , tj ) ≥ 0 .

(5.6)

i=1 j=1

Then, there exists a centered Gaussian process with covariance Γ. Proof. By Theorem 3.2.5, for any k ∈ N+ , any t1 , . . . , tk ∈ T, there exists a centered Gaussian vector with covariance matrix {Γ(ti , tj )}1≤i,j≤k . Let Qt1 ,...,tk be the probability distribution of this vector. The family {Qt1 ,...,tk ; t1 , . . . , tk ∈ T} is obviously compatible and therefore, by Kolmogorov’s theorem (Theorem 5.1.7), a centered Gaussian process with covariance Γ exists and is unique in distribution.  Example 5.1.24: A Gaussian field on Rd . Let μ be locally finite measure on Rd and let Bb (Rd ) be the collection of bounded Borelian sets of Rd . There exists a unique (distributionwise) centered Gaussian process {X(A)}A∈Bb (Rd ) with autocorrelation function Γ(A, B) = μ(A ∩ B) (A, B ∈ Bb (Rd )) . To prove this it suffices to verify condition (5.6). This is done by observing that k  k 

ui uj Γ(Ai , Aj ) =

i=1 j=1

k  k 

ui uj μ(Ai ∩ Aj )

i=1 j=1

= Rd

⎞2 ⎛ k  ⎝ uj 1Aj (x)⎠ μ(dx) ≥ 0 . j=1

Theorem 5.1.25 For a Gaussian process with index set T = Rd or Zd (d ∈ N+ ) to be stationary, it is necessary and sufficient that for some real number m and some function C : T → R, m(t) = m and Γ(t, s) = C(t − s) for all s, t ∈ T. Proof. The necessity is obvious, whereas the sufficiency is proved by replacing the t ’s in (5.5) by t + h to obtain the characteristic function of (X(t1 + h), . . . , X(tn + h)), namely, ⎫ ⎧ n n n ⎬ ⎨  1  uj m − uj uk C(tj − tk ) , exp i ⎭ ⎩ 2 j=1

j=1 k=1

and then observing that this quantity is independent of h.



Gaussian Subspaces With any second-order stochastic process is associated a Hilbert subspace of the Hilbert space of square-integrable random variables. More precisely: let {Xi }i∈I , where I is an arbitrary index set, be a collection of complex (resp., real) random variables in L2C (P ) (resp., L2R (P )). The Hilbert subspace of L2C (P ) (resp., L2R (P )) consisting of the closure in

208

CHAPTER 5. GENERALITIES ON RANDOM PROCESSES

L2C (P ) (resp., L2R (P )) of the vector space of finite linear complex (resp., real) combinations of elements of {Xi }i∈I is called the complex (resp., real) Hilbert subspace generated by {Xi}i∈I and is denoted by HC (Xi , i ∈ I) (resp., HR (Xi , i ∈ I)). A collection {Xi}i∈I of real random variables defined on the same probability space, where I is an arbitrary index set, is called a Gaussian family if for all finite set of indices i1 , . . . , ik ∈ I, the random vector (Xi1 , . . . , Xik ) is Gaussian. Definition 5.1.26 A Hilbert subspace G of the real Hilbert space L2R (P ) is called a Gaussian (Hilbert) subspace if it is a Gaussian family. Theorem 5.1.27 Let {Xi }i∈I , where I is an arbitrary index set, be a Gaussian family of random variables of L2R (P ). Then the Hilbert subspace HR (Xi , i ∈ I) generated by {Xi }i∈I is a Gaussian subspace of L2R (P ). Proof. By definition, the Hilbert subspace HR (Xi , i ∈ I) consists of all the random variables in L2R (P ) that are limits in quadratic mean of finite linear combinations of elements of the family {Xi }i∈I . To prove the announced result, it suffices to show that (m) (1) if {Zn }n≥1 , where Zn = (Zn , . . . , Zn ), is a sequence of Gaussian random vectors of fixed dimension m that converges componentwise in quadratic mean to some vector Z = (Z (1) , . . . , Z (m) ) of HR (Xi , i ∈ I), then the latter vector is Gaussian. By continuity (i) (j) of the inner product in L2R (P ), for all 1 ≤ i, j ≤ m, limn↑∞ E[Zn Zn ] = E[Z (i) Z (j)] and (i) (i) limn↑∞ E[Zn ] = E[Z ], that is lim mZn = mZ ,

lim ΓZn = ΓZ

n↑∞

n↑∞

and in particular, for all u ∈ Rm ,  T  i T T lim E eiu Zn = lim eiu μZn − 2 u ΓZn u n↑∞

n↑∞

= eiu

T μ − i uT Γ u Z 2 Z

.

therefore in distriThe sequence {uT Zn }n≥1 converges to uT Z in quadratic  T  mean, and TZ iu Z iu n = E[e ], and finally bution (Theorem 4.5.6). In particular, limn↑∞ E e E[eiu

TZ

] = eiu

T μ − i uT Γ u Z 2 Z

(u ∈ Rm ) ,

which shows that Z is Gaussian.

5.2 5.2.1



Random Processes as Random Functions Versions and Modifications

For each ω ∈ Ω, the function t ∈ T → X(t, ω) ∈ E is called the ω-trajectory, or ω-sample path, of the stochastic process {X(t)}t∈T . A stochastic process can be viewed as a random function, associating to each ω ∈ Ω the trajectory t → X(t, ω). When the state space E is some Rm and the index set is R, for fixed ω ∈ Ω, we can then discuss the continuity properties of the associated sample path. For example, if for all ω ∈ Ω the ω-sample path is right-continuous, we call this stochastic process right-continuous. It is called P -a.s. right-continuous if the ω-sample paths are right-continuous for all ω ∈ Ω, except perhaps for ω ∈ N , where N is a P -negligible set. One defines similarly (P -a.s.) left-continuity, (P -a.s.) continuity, etc.

5.2. RANDOM PROCESSES AS RANDOM FUNCTIONS

209

Definition 5.2.1 Two stochastic processes {X(t)}t∈T and {Y (t)}t∈T defined on the same probability space (Ω, F, P ) are said to be versions of one another if P ({ω ; X(t, ω) = Y (t, ω)}) = 0 for all t ∈ T. They are said to be undistinguishable if P ({ω ; X(t, ω) = Y (t, ω) for all t ∈ T}) = 1, that is, if they have identical trajectories except on a P -null set. Clearly two undistinguishable processes are versions of one another. Example 5.2.2: Two Distinguishable Versions. The two processes X(t) = 1 (t ∈ [0, 1]) and X(t) = 1U =t

(t ∈ [0, 1]) ,

where U is a random variable uniformly distributed on [0, 1], have the same distributions, and therefore are versions of one another, but they are not undistinguishable. It is useful to find conditions bearing only on the finite-dimensional distributions and guaranteeing that the sample paths have certain desired properties. This is not always feasible but, in certain cases, there is a version possessing the desired properties. One such result is Kolmogorov’s continuity theorem below.

5.2.2

Kolmogorov’s Continuity Condition

Theorem 5.2.3 Let (E, d) be a complete metric space. Let {X(t)}t∈[0,1] be a stochastic process with values in E. Suppose that for some positive real numbers α, β, K, E [d(X(t), X(s))α ] ≤ K|t − s|1+β , for all s, t ∈ [0, 1). Then there exists a version of this stochastic process whose sample paths are almost surely continuous. Proof. We will encounter in the proof the following subsets of [0, 1]: the set D of dyadic rationals of [0, 1] and for each n ≥ 1 the set Dn := {k2−n ; k = 0, · · · , n}. The countable set D is dense in [0, 1]. STEP 1. The original process is continuous in probability, that is, for all ε > 0, lim P (d(X(tn ) − X(t)) ≥ ε) = 0 .

tn →t

This follows from Markov’s inequality: P (d(X(tn ), X(t)) ≥ ε) = P (d(X(tn ), X(t))α > εα ) ≤

E [d(X(tn ), X(t))α ] K|tn − t|1+β ≤ . α ε εα

()

CHAPTER 5. GENERALITIES ON RANDOM PROCESSES

210

STEP 2. The original process is uniformly continuous on D. To see this, let γ ∈ (0, β/α) and obtain from () that P (d(X(k2−n ), X((k + 1)2−n )) ≥ 2−γn ) ≤ K2−n(1+β−αγ) . In particular, with  An :=

maxn ||X(k2−n ) − X((k − 1)2−n )|| ≥ 2−γn

1

1≤k≤2

and by sub-σ-additivity, 2  n

P (An ) ≤

P (||X(k2−n ) − X((k − 1)2−n )|| ≥ 2−γn )

k=1

≤ 2n K2−n(1+β−αγ) = K2−n(β−αγ) , and therefore, since β − αγ > 0, 

P (An ) < ∞ .

n

By the Borel–Cantelli lemma, there exists a P -negligible subset N and an almost surely finite random integer N such that outside N , for n ≥ N and k = 1, . . . , 2n , d(X(k2−n ), X((k − 1)2−n )) < 2−γn . In particular,

 sup

Kγ := sup n≥1

1≤k≤2n

d(X((k − 1)2−n ), X(k2−n ) 2−γn

(†)

is an almost surely finite random variable. It will follow from this (Lemma 5.2.4 below) that for all s, t ∈ D, almost surely d(X(t), X(s)) ≤ Cγ |t − s|γ ,

(5.7)

where

2 Kγ . 1 − 2−γ Therefore, almost surely, the mapping t → X(t) is continuous, and therefore uniformly continuous on D. Cγ :=

Step 3. Let Y (t) = 0 on N and if t ∈ N , Y (t) = X(t) for t ∈ D, and Y (t) =

lim

tn →t, tn ∈D

X(tn )

(t ∈ D) .

Outside N , the function t ∈ [0, 1) → Y (t) is a continuous extension of the uniformly continuous function t ∈ D → X(t) to [0, 1). Indeed, for any s, t ∈ [0, 1), there exist a sequence in D, tk → t, and a sequence in D, sk → s, such that Y (t) = limtk →t X(tk ) and Y (s) = limsk →t X(sk ), so that ||Y (t) − Y (s)|| ≤ ||Y (t) − X(tk )|| + ||X(tk ) − X(sk )|| + ||X(sk ) − Y (s)|| . One can choose the tk ’s and the sk ’s inside the interval [s, t]. With any ε > 0 one can then associate δ such that if |t − s| ≤ δ, the middle term of the right-hand side is less

5.2. RANDOM PROCESSES AS RANDOM FUNCTIONS

211

than ε/3 whatever k (by the uniform continuity of t ∈ D → X(t)) and such that the extreme terms are less than ε/3 (by an appropriate choice of sk and tk , by construction of t ∈ [0, 1) → Y (t)) and therefore finally ||Y (t) − Y (s)|| ≤ ε. Therefore, outside N , t ∈ [0, 1) → Y (t) is a continuous function. Step 4. We now show that {Y (t)}t∈[0,1) is a version of {X(t)}t∈[0,1) , that is, P (X(t) = Y (t)) = 1 for all t ∈ [0, 1). This follows from the fact that {X(t)}t∈[0,1) is continuous in probability and that limits in probability and almost-sure limits coincide when both exist. Step 5. It remains to prove the inequality (5.7). This follows from Lemma 5.2.4 below.  Lemma 5.2.4 Let f be a mapping from D to the metric space (E, d). Suppose that there exists a finite constant K such that for n ≥ N and k = 1, . . . , 2n , d(f (k2−n ), f ((k − 1)2−n )) < K2−γn . Then for all s, t ∈ D, d(f (t), f (s)) ≤

2 K|t − s|γ . 1 − 2−γ

Proof. Let s, t ∈ D, s < t. Let p be the smallest integer such that 2−p ≤ t − s. Let k be the smallest integer such that k2−p ≥ s. Then it is possible to write s = k2−p − ε1 2−p−1 − · · · − ε 2−p−, t = k2−p + ε1 2−p−1 + · · · + εm 2−p−m for some non-negative integers , m and ε’s and ε ’s taking the values 0 or 1. Define si = k2−p − ε1 2−p−1 − · · · − ε 2−p−i tj = k2

−p

+ ε1 2−p−1

+···+

εm 2−p−j

(0 ≤ i ≤ ), (0 ≤ j ≤ m).

Then, observing that s = s and t = tm , d(f (s), f (t)) = d(f (s ), f (tm )) = d(f (s0 ), f (t0 )) +

 

d(f (si−1 ), f (si )) +

i=1

≤ K2−pγ +

  i=1

K2−(p+i)γ +

m 

d(f (tj−1 ), f (tj ))

j=1 m 

K2−(p+j)γ

j=1

≤ 2K(1 − 2−γ )−1 2−pγ ≤ 2K(1 − 2−γ )−1 (t − s)γ .  Remark 5.2.5 Going back to Example 5.1.5, Kolmogorov’s theorem guarantees the existence of an integer-valued process {N (t)}t≥0 with the fidi distribution described in this example. It does not say, however, that it is a counting process in the sense of Example 5.1.3. It turns out that such a version exists, as we shall see in Chapter 8, where a completely different definition is given.

CHAPTER 5. GENERALITIES ON RANDOM PROCESSES

212

5.3 5.3.1

Measurability Issues Measurable Processes and their Integrals

Viewing a stochastic process as a mapping X : T × Ω → E, defined by (t, ω) → X(t, ω), opens the way to various measurability concepts. Definition 5.3.1 The stochastic process {X(t)}t∈R is said to be measurable iff the mapping from R × Ω into E defined by (t, ω) → X(t, ω) is measurable with respect to B ⊗ F and E. In particular, for any ω ∈ Ω the mapping t → X(t, ω) is measurable with respect to the σ-fields B(R) and E (Lemma 2.3.6). Also, if E = R and if X(t) is non-negative, one can define the Lebesgue integral X(t, ω)dt R

for each ω ∈ Ω, and also apply Tonelli’s theorem (Theorem 2.3.9) to obtain 2. E E [X(t)] dt. X(t)dt = R

R

By Fubini’s theorem, the last 0 equality also holds true for measurable stochastic processes of arbitrary sign such that R E [|X(t)|] dt < ∞. Theorem 5.3.2 Let {X(t)}t∈R be a second-order complex-valued measurable stochastic process with mean function m and covariance function Γ. Let f : R → C be an integrable function such that -

Then the integral

0

R

|f (t)|E [|X(t)|] dt < ∞.

(5.8)

is almost surely well defined and . f (t)X(t) dt = f (t)m(t) dt.

R f (t)X(t) dt

2-

E R

R

Suppose in addition that f satisfies the condition 1 |f (t)||Γ(t, t)| 2 dt < ∞

(5.9)

R

0 and let g : R → C be a function with the same properties as f . Then R f (t)X(t) dt is square-integrable and  - f (t)X(t) dt, g(t)X(t) dt = f (t)g ∗ (s)Γ(t, s) dt ds . cov R

R

R

R

Proof. By Tonelli’s theorem . 2|f (t)||X(t)| dt = |f (t)|E [|X(t)|] dt < ∞ E R

0

R

0and therefore, almost surely R |f (t)||X(t)| dt < ∞, so that almost surely the integral R f (t)X(t) dt is well defined and finite. Also (Fubini)

5.3. MEASURABILITY ISSUES

213

. f (t)E [X(t)] dt. f (t)X(t) dt = E [f (t)X(t)] dt =

2E

R

R

R

Suppose now (without loss of generality) that the process is centered. By Tonelli’s theorem  . 2|f (t)||X(t)| dt |g(t)||X(t)| dt E R R - = |f (t)||g(s)|E [|X(t)||X(s)|] dt ds. R

R

1

1

But (Schwarz’s inequality) E [|X(t)||X(s)|] ≤ Γ(t, t)| 2 Γ(s, s)| 2 , and therefore the righthand side of the last equality is bounded by   1 1 |f (t)|Γ(t, t)| 2 dt |g(s)|Γ(s, s)| 2 ds < ∞. R

R

One may therefore apply Fubini’s theorem to obtain  . - 2f (t)X(t) dt g(t)X(t) dt = f (t)g ∗ (s)E [X(t)X(s)] dt ds . E R

R

R

R

 Remark 5.3.3 Since E[|X(t)|] ≤ E[1 +0|X(t)|2 ] = 1 + Γ(t, t), condition (5.8) is satisfied if f is an integrable function such that R |f (t)|Γ(t, t) dt < ∞.

5.3.2

Histories and Stopping Times

In the following, the index set T is any of the following: R, R+ , N, Z. Definition 5.3.4 Let (Ω, F) be a measurable space. The family {Ft }t∈T of sub-σ-fields of F is called a history (or filtration) on (Ω, F) if for all s, t ∈ T such that s ≤ t, Fs ⊆ Ft . In other words, a history is a non-decreasing family of sub-σ-fields of F indexed by T. In applications Ft often represents the information available at time t to an observer. The σ-field F∞ := ∨t∈T Ft is, by definition, the smallest σ-field that contains Ft for all t ∈ T. Definition 5.3.5 Let {X(t)}t∈T be stochastic process defined on (Ω, F). The history {FtX }t∈T defined by FtX = σ(X(s) ; s ≤ t) is called the internal history of {X(t)}t∈T . Any history {Ft }t∈T such that Ft ⊇ FtX

(t ∈ T)

is called a history of {X(t)}t∈T . The stochastic process {X(t)}t∈T is then said to be adapted to the history {Ft }t∈T , or Ft -adapted. Definition 5.3.6 Let T = R or R+ . Define for all t ∈ T Ft+ := ∩s>t Fs . The history {Ft }t∈T is called right-continuous if for all t ∈ T, Ft = Ft+ .

214

CHAPTER 5. GENERALITIES ON RANDOM PROCESSES

Progressive Measurability Definition 5.3.7 A stochastic process {X(t)}t∈R+ taking its values in the measurable space (E, E) is said to be Ft -progressively measurable if for all t ∈ R+ the mapping (s, ω) → X(s, ω) from [0, t]×Ω into E is measurable with respect to the σ-fields B([0, t])⊗ Ft and E. A Ft -progressively measurable {X(t)}t∈R+ is then Ft -adapted and measurable. Theorem 5.3.8 Let {X(t)}t∈R+ be a stochastic process, taking its values in a topological space E endowed with its Borel σ-field E = B(E), adapted to {Ft }t∈R+ and rightcontinuous (resp., left-continuous). Then {X(t)}t∈R+ is Ft -progressively measurable. Proof. Let t be a non-negative real number. For all n ≥ 0 and all s ∈ [0, t), let Xn (s) :=

n −1 2

X((k + 1)t/2−n ) 1{[k2−n t,(k+1)2−n t)} (s) ,

k=0

and let Xn (t) := X(t). This defines a function [0, t] × Ω → E which is measurable with respect to B([0, t]) ⊗ Ft and E. If t → X(t, ω) is right-continuous, X(s, ω) is the limit of Xn (s, ω) for all (s, ω) ∈ [0, t] × Ω, and therefore (s, ω) → X(s, ω) is measurable with respect to B([0, t]) ⊗ Ft and E as a function of [0, t] × Ω into E. The case of a left-continuous process is treated in a similar way.  Theorem 5.3.9 If the non-negative stochastic process0 {X(t)}t∈R+ is Ft -progressively measurable, then, for each t ∈ R+ , the random variable (0,t] X(s, ω) ds is Ft -measurable. The proof is left as Exercise 5.4.2.

Stopping Times A principal notion in the theory of stochastic processes is that of stopping time. In this subsection, the index set is T = N or R+ , and T := T ∪ {+∞}. Definition 5.3.10 Let {Ft }t∈T be a history. A T-valued random variable τ is called an Ft -stopping time iff for all t ∈ T, {τ ≤ t} ⊂ Ft . Remark 5.3.11 In the case T = R+ , the condition {τ < t} ⊂ Ft

(t ≥ 0)

does not guarantee that T is an Ft -stopping time. But since {τ ≤ t} = ∩n {τ < t +

1 } ∈ ∩s>t Fs = Ft+ , n

we have that T is an Ft+ -stopping time, and therefore an Ft -stopping time if the history is right-continuous.

5.3. MEASURABILITY ISSUES

215

Example 5.3.12: Counterexample. Define a (not right-continuous) history by Ft = / {∅, Ω}. The random variable {∅, Ω} if t ≤ 1, and Ft = σ(A) if t > 1, where A ∈ τ := 1 + 1A is such that {τ < t} ⊂ Ft for all t ≥ 0, but it is not an Ft -stopping time because {τ ≤ 1} is not in F1 . The following approximation of a stopping time by simple random variables will be of frequent use in the sequel. Theorem 5.3.13 Let {Ft }t∈R+ be a history, and let for all n ≥ 1, ⎧ ⎪ ⎨0 τ (n, ω) :=

k+1 ⎪ 2n



+∞

if τ (ω) = 0 if 2kn < τ (ω) ≤ if τ (ω) = ∞.

k+1 2n

Then τ (n) is an Ft -stopping-time decreasing to τ as n ↑ ∞. Proof. In fact, for all t ≥ 0, {τ (n) ≤ t} = ∪k ; (k+1)2−n ≤t {τn = (k + 1)2−n } &  = {τ = 0} ∪k ; (k+1)2−n ≤t {k2−n < τ ≤ (k + 1)2−n } ∈ Ft . The decreasing convergence to τ is obvious.



Let {X(t)}t∈R+ be a stochastic process with values in E. For any set C ⊂ E, let τ (C) := inf{t ≥ 0 ; X(t) ∈ C} . Theorem 5.3.14 Let {X(t)}t∈R+ be a right-continuous stochastic process with values in a metric space E and adapted to the history {Ft }t≥0 . A. Let G be an open set of E. The random time τ (G) is an Ft+ -stopping time. B. Suppose moreover that {X(t)}t∈R+ has left limits for all t > 0. Let Γ be a closed set of E. Then, the random time τ (Γ) is an Ft -stopping time. Proof. A. This comes from the identity {τ (G) < t} = ∩r∈Q,r t} = {X(s) < c for all s ∈ [0, t]} is identical to % {X(kt/2n ) < c (k = 0, 1, . . . , 2n )}, n≥1

which is in Ft . The left-continuous case is similar. In the left-continuous case, suppose that for a given ω, X(τ (ω), ω) = c + ε > c. Then there exists a δ > 0 such that for all t ∈ [τ (ω) − δ, τ (ω)), X(t, ω) ≥ c + 21 ε > c, and this is in contradiction with the definition of τ . 

Remark 5.3.17 The situation depicted in the left-continuous time guarantees that the entrance time τn in [n, ∞) is such that the stopped process {X(t ∧ τn )}t∈R+ is bounded (by n). This remark will be of frequent use in the sequel. Theorem 5.3.18 Let T = N or R+ . Let {Ft }t∈T be a history. Let τ be an Ft -stopping time. The collection of events Fτ = {A ∈ F∞ | A ∩ {τ ≤ t} ∈ Ft , for all t ∈ T} is a σ-field, and τ is Fτ -measurable. Let {X(t)}t∈T be an E-valued Ft -adapted stochastic process, and let τ be a finite Ft -stopping time. Define the random variable X(τ ) by X(τ )(ω) := X(τ (ω), ω). Then, if T = N (resp., T = R+ and {X(t)}t∈R+ is Ft progressively measurable) X(τ ) is Fτ -measurable.

Proof. The verification that Fτ is a σ-field is straightforward. In order to show that τ is Fτ -measurable, it is enough to show that for all c ≥ 0, {τ ≤ c} ∈ Fτ , that is, for all t ∈ T, {τ ≤ c} ∩ {τ ≤ t} ∈ Ft . But this last event is just {τ ≤ c ∧ t} ∈ Fc∧t , by definition of an Ft -stopping time, and Fc∧t ⊆ Ft . We treat the case T = R+ . Let A ∈ E and a ≥ 0. The set {X(τ ) ∈ A} ∩ {τ ≤ a} is identical to ({X(S) ∈ A} ∩ {S < a}) ∪ ({X(a) ∈ A} ∩ {τ = a}), where S = τ ∧ a. Therefore, it suffices to prove that {X(S) ∈ A} is in Fa , as we now show. Indeed, the random variable S is an Ft -stopping time and it is also an Fa -measurable random variable. The Fa -measurability of X(S) follows from the fact that it is obtained by composition of ω → (S(ω), ω) from (Ω, Fa ) into ([0, a] × Ω, B([0, a]) ⊗ Fa ), and (s, ω) → X(s, ω) from ([0, a] × Ω, B([0, a]) ⊗ Fa ) into (E, E), which are measurable: the first by definition of Ft -stopping times and the second by definition of Ft -progressiveness. 

5.3. MEASURABILITY ISSUES

217

Theorem 5.3.19 (i) If S and T are Ft -stopping times, then so are S ∧ T and S ∨ T . (ii) An Ft -stopping time T is FT -measurable. (iii) If T is an Ft -stopping time and S is FT -measurable and such that S ≥ T , then S is an Ft -stopping time. (iv) If S and T are Ft -stopping times and A ∈ FS , then A ∩ {S ≤ T } ∈ FT . (v) If S and T are Ft -stopping times such that S ≤ T , then FS ⊆ FT . (vi) If {Tn }n≥1 is a sequence of Ft -stopping times, then supn Tn is an Ft -stopping time. Proof. (i) and (ii) are left as exercises (Exercise 5.4.10). (iii) By hypothesis {S ≤ t} ∈ FT and therefore, by definition of FT , {S ≤ t} ∩ {T ≤ t} ∈ Ft . But the last intersection is just {S ≤ t}. (iv) By definition of FT , we must check that [A ∩ {S ≤ T }] ∩ {T ≤ t} ∈ Ft

(t ≥ 0) .

But this intersection is equal to [A ∩ {S ≤ t}] ∩ {T ≤ t} ∩ {S ∧ t ≤ T ∧ t} , and all the three sets therein are in Ft , the first one because A ∈ FS , the second one because T is an Ft -stopping time and the last one because S ∧ t and S ∨ t are Ft measurable. (v) Let A ∈ FS . According to (iv), A = A ∩ {S ≤ T } ∈ FT . (vi) Just observe that {sup Tn ≤ t} = ∩n {Tn ≤ t} ∈ Ft . n

 Remark 5.3.20 It is not true in general that inf n Tn is an Ft -stopping time. In fact, it is not always true that {inf n Tn ≤ t} = ∪n {Tn ≤ t}. However, {inf Tn < t} = ∪n {Tn < t} ∈ Ft n

(t ≥ 0) ,

and therefore, if {Ft }t≥0 is a right-continuous history, inf n Tn is an Ft -stopping time (Remark 5.3.11).

CHAPTER 5. GENERALITIES ON RANDOM PROCESSES

218

Theorem 5.3.21 Let {Ft }t≥0 be a right-continuous history and {Tn }n≥1 be a sequence of Ft -stopping times. Then (i) lim inf n Tn and lim supn Tn are Ft -stopping times. (ii) If {Tn }n≥1 is decreasing, with limit T , then FT = ∩n FTn . Proof. The proof of (i) is left as an exercise. It ensures that T of (ii) is indeed an Ft -stopping time. We have from (v) of Theorem 5.3.19 that FT ⊆ ∩n FTn . Now, let A ∈ ∩n FTn . Then A ∩ {Tn < t} ∈ Ft (t ≥ 0) and therefore ∪n (A ∩ {Tn < t}) = A ∩ {T < t} ∈ Ft

(t ≥ 0),

which guarantees, because of the right-continuity hypothesis for the history, that T is a stopping time for this history (Remark 5.3.11). 

Complementary reading [Meyer, 1975] (the first chapters) and [Durrett, 1996] are advanced references.

5.4

Exercises

Exercise 5.4.1. The case of iid sequences Prove directly (without referring to Theorem 5.1.7) the last statement in Example 5.1.8. Exercise 5.4.2. Why Progressive Measurability? Prove that if the non-negative stochastic process {X(t)} t∈R+ is Ft -progressively measur0 able, then, for each t ∈ R+ , the random variable (0,t] X(s) ds is Ft -measurable. Exercise 5.4.3. Wide-sense Stationary but not Stationary Give a simple example of a discrete-time stochastic process that is wide-sense stationary, but not strictly stationary. Give a similar example in continuous time. Exercise 5.4.4. Stationarization Let {Y (t)}t∈R be the stochastic process taking its values in {−1, +1} defined by Y (t) = Z × (−1)n on (nT, (n + 1)T ] , where T is a positive real number and Z is a random variable equidistributed on {−1, +1}. (1) Show that {Y (t)}t∈R is not a stationary (strictly or in the wide sense) stochastic process. (2) Let now U be a random variable uniformly distributed on [0, T ] and independent of Z. Define for all t ∈ R, X(t) = Y (t − U ). Show that {X(t)}t∈R is a wide-sense stationary stochastic process and compute its covariance function.

5.4. EXERCISES

219

Exercise 5.4.5. A harmonic process Let {Uk }k≥1 be centered random variables of L2C (P ) that are mutually uncorrelated. Let {Φk }k≥1 be completely random phases, that is, real random variables uniformly distributed on [0, 2π]. Suppose, moreover, that the U variables are independent of the Φ  2 variables. Finally, suppose that ∞ k=1 E[|Uk | ] < ∞. Prove that for all t ∈ R, the series in the right-hand side of X(t) =

∞ 

Uk cos(2πνk t + Φk ) ,

k=1

where the νk ’s are real numbers (frequencies), is convergent in L2C (P ) and defines a wss stochastic process. Give its covariance function. Exercise 5.4.6. Just a joke Let {X(t)}t∈R be a centered Gaussian process, and let t1 , t2 ∈ R be fixed times. Compute the probability that X(t1 ) > X(t2 ). Exercise 5.4.7. A clipped Gaussian process Let {X(t)}t∈R be a centered stationary Gaussian process with covariance function CX . Define the clipped (or hard-limited ) process Y (t) = sign X(t) , with the convention sign X(t) = 0 if X(t) = 0 (note however that this occurs with 2 > 0, which we assume to hold). Clearly this stochastic null probability if CX (0) = σX process is centered. Moreover, it is unchanged when {X(t)}t∈R is multiplied by a positive constant. In particular, we may assume that the variance CX (0) equals 1, so that the covariance matrix of the vector (X(0), X(τ ))T is   1 ρX (τ ) Γ(τ ) = , 1 ρX (τ ) where ρX (τ ) is the correlation coefficient of X(0) and X(τ ). We assume that Γ(τ ) is invertible, that is, |ρX (τ )| < 1. Prove the following formula: CY (τ ) =

  CX (τ ) 2 sin−1 . π CX (0)

Exercise 5.4.8. The Black and Scholes Formula This formula concerns a certain type of financial product called the European call option. The value of a stock at time t ≥ 0 is V (t) = V (0) exp {θt + σW (t)} . In particular,

 E[V (t)] = E[V (0)] exp

 1 1 θ + σ2 t , 2

 + * that is, one euro invested in this stock at time 0 will yield exp θ + 12 σ 2 t at time t. On the other hand, an investment of 1 euro in a risk-free instrument (bonds, saving

220

CHAPTER 5. GENERALITIES ON RANDOM PROCESSES

account) returns ert euros at time t, where r is the fixed return rate of the risk-free investment. In a competitive market, the return rates are the same:4 1 θ + σ2 = r . 2 Therefore V (t) = V (0) exp

 1  1 r − σ 2 t + σW (t) . 2

The investor has the right to exercise the following option (the European call option). At some time T in the future, called the expiration date, he can buy one share of the stock at a fixed price K (the strike price) and immediately sell it at price V (T ) and therefore make a profit V (T ) − K. If he does not exercise the option he will do nothing and therefore the profit will be max(V (T ) − K, 0) . Of course, the investor must pay an entrance fee C in order to enter the deal. The value of C should be such that the expected return of an investment C in the option should equal the expected return when exercising the option: C = erT = E [max(V (T ) − K, 0)] . This is called the “no arbitrage” condition. Give an explicit formula for C in terms of r, σ, V (0) and T . Exercise 5.4.9. An elementary ergodic theorem Let {X(t)}t∈R be a wss stochastic process with mean m and covariance function C (τ ). Prove that in order that 1 T X (s) ds = mX lim T ↑∞ T 0 holds in the quadratic mean, it is necessary and sufficient that 1 T ↑∞ T

-

T

lim

0

 u C (u) du = 0. 1− T

(5.10)

Show that this condition is satisfied in particular when the covariance function is integrable. Exercise 5.4.10. About stopping times Let S and T be Ft -stopping times. Prove that (1) the events {S < T }, {S ≤ T } and {S = T } are in FS ∩ FT , (2) S ∧ T and S ∨ T are Ft -stopping times, and (3) T is FT -measurable.

4 We do not attempt here to define a “competitive market” or to prove the corresponding statement.

Chapter 6 Markov Chains, Discrete Time A sequence {Xn }n≥0 of random variables with values in a set E is called a discrete-time stochastic process with state space E. According to such a definition, sequences of independent and identically distributed random variables are stochastic processes. However, in order to introduce more variability, one may wish to allow for some dependence on the past in the manner of deterministic recurrence equations. Discrete-time homogeneous Markov chains possess the required feature since they can always be represented (in a sense to be made precise) by a stochastic recurrence equation Xn+1 = f (Xn , Zn+1), where {Zn }n≥1 is an iid sequence independent of the initial state X0 . The probabilistic dependence on the past is only through the previous state, but this limited amount of memory suffices to produce enough varied and complex behavior to make Markov chains a most important source of models.

6.1 6.1.1

The Markov Property The Markov Property on the Integers

Let {Xn }n≥0 be a discrete-time stochastic process with countable state space E. The elements of the state space will be denoted by i, j, k,. . . If Xn = i, the process is said to be in state i at time n, or to visit state i at time n. Definition 6.1.1 If for all integers n ≥ 0 and all states i0 , i1 , . . . , in−1, i, j, P (Xn+1 = j | Xn = i, Xn−1 = in−1 , . . . , X0 = i0 ) = P (Xn+1 = j | Xn = i) ,

(6.1)

the above stochastic process is called a Markov chain, and a homogeneous Markov chain (hmc) if, in addition, the right-hand side of (6.1) is independent of n. In the homogeneous case, the matrix P = {pij }i,j∈E , where pij := P (Xn+1 = j | Xn = i), is called the transition matrix of the hmc. Since the entries are probabilities and since a transition from any state i must be to some state, it follows that  pik = 1 (i, j ∈ E) . pij ≥ 0 and k∈E

© Springer Nature Switzerland AG 2020 P. Brémaud, Probability Theory and Stochastic Processes, Universitext, https://doi.org/10.1007/978-3-030-40183-2_6

221

CHAPTER 6. MARKOV CHAINS, DISCRETE TIME

222

A matrix P indexed by E and satisfying the above properties is called a stochastic matrix. The state space may be infinite, and therefore such a matrix is in general not of the kind studied in linear algebra. However, the basic operations of addition and multiplication will be defined by the same formal rules. The notation x = {x(i)}i∈E formally represents a column vector, and xT is the corresponding row vector. For instance,  (xT P)(j) = x(k)pkj . k∈E

A transition matrix P is sometimes described by its transition graph G, that is, a graph having for nodes (or vertices) the states of E, and with an oriented edge from i to j if and only if pij > 0. The Markov property (6.1) extends to P (A | Xn = i, B) = P (A | Xn = i) , where A := {Xn+1 = j1 , . . . , Xn+k = jk } , B = {X0 = i0 , . . . , Xn−1 = in−1 } (Exercise 6.6.1). This is in turn equivalent to P (A ∩ B | Xn = i) = P (A | Xn = i)P (B | Xn = i) . In other words, A and B are conditionally independent given Xn = i. Therefore, the future at time n and the past at time n are conditionally independent given the present state Xn = i. More generally: Theorem 6.1.2 For all n ≥ 2 and all i ∈ E, the σ-fields σ(X0 , . . . , Xn−1) and σ(Xn+1 , Xn+2, . . .) are independent given Xn = i. Proof. This is a direct consequence of the above observations and of Theorem 3.1.39.  Theorem 6.1.2 shows in particular that the Markov property is independent of the direction of time. Notation. We shall from now on abbreviate P (C | X0 = i) as Pi (C) (C ∈ F). Also, if μ is a probability distribution on E, then Pμ (C) is the probability of C given that the initial state X0 is distributed according to μ:   Pμ (C) = μ(i)P (C | X0 = i) = μ(i)Pi (C) . i∈E

i∈E

The distribution at time n of the chain is the vector νn indexed by E and defined by νn (i) := P (Xn = i)

(i ∈ E) .

From Bayes’ rule of total causes, P (Xn+1 = j) =

 k∈E

P (Xn = k)P (Xn+1 = j | Xn = k),

6.1. THE MARKOV PROPERTY that is, νn+1 (j) = yields

 k∈E

T νn (k)pkj . In matrix form: νn+1 = νnT P. Iteration of this equality

νnT = ν0T Pn . The matrix

Pm

223

(6.2)

is called the m-step transition matrix because its general term is pij (m) := P (Xn+m = j | Xn = i) .

Indeed, the Bayes sequential rule and the Markov property give for the right-hand side of the latter equality  pii1 pi1 i2 · · · pim−1 j , i1 ,...,im−1 ∈E

which is the general term of the m-th power of P. The probability distribution ν0 of the initial state X0 is called the initial distribution. From Bayes’ sequential rule, the homogeneous Markov property and the definition of the transition matrix, P (X0 = i0 , X1 = i1 , . . . , Xk = ik ) = ν0 (i0 )pi0 i1 · · · pik−1 ik .

(6.3)

Therefore, by Theorem 5.1.7, we have the following Theorem 6.1.3 The distribution of a discrete-time hmc is uniquely determined by its initial distribution and its transition matrix. Many hmcs receive a natural description in terms of a recurrence equation. Theorem 6.1.4 Let {Zn }n≥1 be an iid sequence of random variables with values in some measurable space (G, G). Let E be a countable space and let f : (E ×G, P(E)⊗G) → (E, P(E)) be some measurable function. Let X0 be a random variable with values in E, independent of {Zn }n≥1 . The recurrence equation Xn+1 = f (Xn , Zn+1)

(6.4)

then defines an hmc. Proof. Iteration of recurrence (6.4) shows that for all n ≥ 1, there is a function gn such that Xn = gn (X0 , Z1 , . . . , Zn ), and therefore P (Xn+1 = j | Xn = i, Xn−1 = in−1 , . . . , X0 = i0 ) = P (f (i, Zn+1 ) = j | Xn = i, Xn−1 = in−1 , . . . , X0 = i0 ) = P (f (i, Zn+1 ) = j), since the event {X0 = i0 , . . . , Xn−1 = in−1 , Xn = i} is expressible in terms of X0 , Z1 , . . . , Zn and is therefore independent of Zn+1 . Similarly, P (Xn+1 = j | Xn = i) = P (f (i, Zn+1 ) = j). We therefore have a Markov chain, and it is homogeneous since the right-hand side of the last equality does not depend on n. Explicitly: (6.5) pij = P (f (i, Z1 ) = j).  Not all homogeneous Markov chains receive a “natural” description of the type featured in Theorem 6.1.4. However, it is always possible to find a “theoretical” description of this kind. More exactly, we have

CHAPTER 6. MARKOV CHAINS, DISCRETE TIME

224

Theorem 6.1.5 For any transition matrix P on E, there exists a homogeneous Markov chain with this transition matrix and with a representation such as in Theorem 6.1.4. Proof. Since E is countable, we may identify it with N, which we do in this proof. Let Xn+1 = j if

j−1 

pXn k ≤ Zn+1
0. One of the challenges associated with Gibbs models is obtaining explicit formulas for averages, considering that it is generally hard to compute the partition function. This is feasible in exceptional cases (see Exercise 6.6.3). Such distributions are of interest to physicists when the energy is expressed in terms of a potential function describing the local interactions. The notion of clique then plays a central role.

6.1. THE MARKOV PROPERTY

229

Definition 6.1.15 Any singleton {v} ⊂ V is a clique. A subset C ⊆ V with more than one element is called a clique (with respect to ∼) if and only if any two distinct sites of C are mutual neighbors. A clique C is called maximal if for any site v ∈ / C, C ∪ {v} is not a clique. The collection of cliques will be denoted by C. Definition 6.1.16 A Gibbs potential on ΛV relative to ∼ is a collection {VC }C⊆V of functions VC : ΛV → R ∪ {+∞} such that (i) VC ≡ 0 if C is not a clique, and (ii) for all x, x ∈ ΛV and all C ⊆ V , x(C) = x (C) ⇒ VC (x) = VC (x ) . The energy function U is said to derive from the potential {VC }C⊆V if  U (x) = VC (x) . C

The function VC depends only on the phases at the sites inside subset C. One could write more explicitly VC (x(C)) instead of VC (x), but this notation will not be used. In this context, the distribution in (6.8) is called a Gibbs distribution (w.r.t. ∼). Example 6.1.17: The Ising Model, take 1. In statistical physics, the following model is regarded as a qualitatively correct idealization of a piece of ferromagnetic material. Here V = Z2m = {(i, j) ∈ Z2 , i, j ∈ [1, m]} and Λ = {+1, −1}, where ±1 is the orientation of the magnetic spin at a given site. The figure below depicts two particular neighborhood systems, their respective cliques, and the boundary of a 2 × 2 square for both cases. The neighborhood system in the original Ising model is as in column (α) of the figure below, and the Gibbs potential is H x(v), k J V v,w (x) = − x(v)x(w), k V{v} (x) = −

where v, w is the 2-element clique (v ∼ w). For physicists, k is the Boltzmann constant, H is the external magnetic field, and J is the internal energy of an elementary magnetic dipole. The energy function corresponding to this potential is therefore U (x) = −

J  H  x(v)x(w) − x(v) . k k

v,w

v∈V

Example 6.1.18: The autobinomial model2 For the purpose of image synthesis, one seeks Gibbs distributions describing pictures featuring various textures, lines separating patches with different textures (boundaries), lines per se (roads, rail tracks), randomly located objects (moon craters), etc. The following is an all-purpose texture model that 2

[Besag, 1974].

CHAPTER 6. MARKOV CHAINS, DISCRETE TIME

230

(α)

(β) neighborhoods

(1) (2)

(3)

cliques (up to a rotation)

(4)

in black: boundary (of the white square)

Two examples of neighborhoods, cliques, and boundaries

6.1. THE MARKOV PROPERTY

231

may be used to describe the texture of various materials. The set of sites is V = Z2m , and the phase space is Λ = {0, 1, . . . , L}. In the context of image processing, a site v is a pixel (PICTure ELement), and a phase λ ∈ Λ is a shade of grey, or a color. The neighborhood system is Nv = {w ∈ V ; w = v ; w − v2 ≤ d},

(6.9)

where d is a fixed positive integer and where w − v is the euclidean distance between v and w. In this model the only cliques participating in the energy function are singletons and pairs of mutual neighbors. The set of cliques appearing in the energy function is a disjoint sum of collections of cliques 

m(d)

C=

Cj ,

j=1

where C1 is the collection of singletons, and all pairs {v, w} in Cj , 2 ≤ j ≤ m(d), have the same distance w − v and the same direction, as shown in the figure below. The potential is given by /  L  + α1 x(v) if C = {v} ∈ C1 , − log x(v) VC (x) = αj x(v)x(w) if C = {v, w} ∈ Cj , where αj ∈ R. For any clique C not of type Cj , VC ≡ 0. The terminology (“autobinomial”) is motivated by the fact that the local system has the form   L π v (x) = (6.10) τ x(v)(1 − τ )L−x(v), x(v) where τ is a parameter depending on x(Nv ) as follows: τ = τ (Nv ) =

e− α,b . 1 + e− α,b

Here α, b is the scalar product of α = (α1 , . . . , αm(d)) and b = (b1 , . . . , bm(d) ), where b1 = 1, and for all j, 2 ≤ j ≤ m(d), bj = bj (x(Nv )) = x(u) + x(w), where {v, u} and {v, w} are the two pairs in Cj containing v. Proof. From the explicit formula (6.12) giving the local characteristic at site v,        m(d) L − α1 x(v) − exp log x(v) αj v;{v,w}∈Cj x(w) x(v) j=2       . π v (x) =   m(d) L λ∈Λ exp log λ − α1 λ − t;{v,w}∈Cj x(w) λ j=2 αj The numerator equals

and the denominator is



 L e− α,b x(v), x(v)

(6.11)

CHAPTER 6. MARKOV CHAINS, DISCRETE TIME

232

(α)

(β)

(γ)

d

1

2

4

m(d)

3

5

7

C1 C2 C3 C4 C5 C6 C7

Neighborhoods and cliques of three autobinomial models

6.1. THE MARKOV PROPERTY  L λ∈Λ

λ

e(−α,b)λ =

233

L     L  L . e− α,b = 1 + e− α,b  =0



Equality (6.10) then follows.

Expression (6.10) shows that τ is the average level of grey at site v, given x(Nv ), and expression (6.11) shows that τ is a function of α, b. The parameter αj controls the bond in the direction and at the distance that characterize Cj .

The Hammersley–Clifford Theorem Gibbs distributions with an energy deriving from a Gibbs potential relative to a neighborhood system are distributions of Markov fields relative to the same neighborhood system. Theorem 6.1.19 If X is a random field with a distribution π of the form π(x) = 1 −U (x) , where the energy function U derives from a Gibbs potential {VC }C⊆V relative Ze to ∼, then X is a Markov random field with respect to ∼. Moreover, its local specification is given by the formula 

e− Cv VC (x)  , π (x) =  − Cv VC (λ,x(V \v)) λ∈Λ e v

where the notation site v.

 Cv

(6.12)

means that the sum extends over the sets C that contain the

Proof. First observe that the right-hand side of (6.12) depends on x only through x(v) and x(Nv ). Indeed, VC (x) depends only on (x(w), w ∈ C), and for a clique C, if w ∈ C and v ∈ C, then either w = v or w ∼ v. Therefore, if it can be shown that P (X(v) = x(v)|X(V \v) = x(V \v)) equals the right-hand side of (6.12), then (see Exercise 6.6.6) the Markov property will be proved. By definition of conditional probability, P (X(v) = x(v) | X(V \v) = x(V \v)) =  But π(x) =

π(x) . π(λ, x(V \v)) λ∈Λ

(†)

1 − Cv VC (x)−Cv VC (x) , e Z

and similarly, π(λ, x(V \v)) =

1 − Cv VC (λ,x(V \v))−Cv VC (λ,x(V \v)) e . Z

= VC (x) and is therefore indepenIf C is a clique and v is not in C, then VC (λ, x(V \v))   dent of λ ∈ Λ. Therefore, after factoring out exp − C v VC (x) , the right-hand side of (†) is found to be equal to the right-hand side of (6.12).  The local energy at site v of configuration x is  Uv (x) = VC (x). Cv

234

CHAPTER 6. MARKOV CHAINS, DISCRETE TIME

With this notation, (6.12) becomes e−Uv (x) . −Uv (λ,x(V \v)) λ∈Λ e

π v (x) = 

Example 6.1.20: The Ising Model, take 2. The local characteristics in the Ising model are  1 e kT {J w;w∼v x(w)+H }x(v) . πTv (x) =   1 1 e+ kT {J w;w∼v x(w)+H } + e− kT {J w;w∼v x(w)+H }

Theorem 6.1.19 above is the direct part of the Gibbs–Markov equivalence theorem: A Gibbs distribution relative to a neighborhood system is the distribution of a Markov field with respect to the same neighborhood system. The converse part (Hammersley–Clifford theorem) is important from a theoretical point of view, since together with the direct part it concludes that Gibbs distributions and mrfs are essentially the same objects. Theorem 6.1.21 Let π > 0 be the distribution of a Markov random field with respect to ∼. Then 1 π(x) = e−U (x) Z for some energy function U deriving from a Gibbs potential {VC }C⊆V with respect to ∼. The proof is omitted with little inconvenience since, in practice, the potential as well as the topology of V can be obtained directly from the expression of the energy, as the following example shows. Example 6.1.22: Markov chains as Markov fields. Let V = {0, 1, . . . N } and Λ = E, a finite space. A random field X on V with phase space Λ is therefore a vector X with values in E N +1 . Suppose that X0 , . . . , XN is a homogeneous Markov chain with transition matrix P = {pij }i,j∈E and initial distribution ν = {νi }i∈E . In particular, with x = (x0 , . . . , xN ), π(x) = νx0 px0 x1 · · · pxN−1 xN , that is,

π(x) = e−U (x),

where U (x) = − log νx0 −

N −1 

(log pxn xn+1 ).

n=0

Clearly, this energy derives from a Gibbs potential associated with the nearest-neighbor topology for which the cliques are, besides the singletons, the pairs of adjacent sites. The potential functions are: V{0} (x) = − log νx0 ,

V{n,n+1} (x) = − log pxn xn+1 .

The local characteristic at site n, 2 ≤ n ≤ N − 1, can be computed from formula (6.12), which gives exp(log pxn−1 xn + log pxn xn+1 ) , π n (x) =  y∈E exp(log pxn−1 y + log pyxn+1 )

6.2. THE TRANSITION MATRIX

235

that is, π n (x) =

pxn−1 xn pxn xn+1 (2)

,

pxn−1 xn+1 (2)

where pij is the general term of the two-step transition matrix P2 . Similar computations give π 0 (x) and π N (x). We note that, in view of the neighborhood structure, for 2 ≤ n ≤ N − 1, Xn is independent of X0 , . . . , Xn−2, Xn+2 , . . . , XN given Xn−1 and Xn+1.

6.2 6.2.1

The Transition Matrix Topological Notions

The notions introduced in this subsection (communication and periodicity) are of a topological nature, in the sense that they concern only the naked transition graph (without the labels).

Communication Classes Definition 6.2.1 State j is said to be accessible from state i if there exists an integer M ≥ 0 such that pij (M ) > 0. In particular, a state i is always accessible from itself, since pii (0) = 1. States i and j are said to communicate if i is accessible from j and j is accessible from i, and this is denoted by i ↔ j.  For M ≥ 1, pij (M ) = i1 ,...,iM−1 pii1 · · · piM−1 j , and therefore pij (M ) > 0 if and only if there exists at least one path i, i1 , . . . , iM −1 , j from i to j such that pii1 pi1 i2 · · · piM−1 j > 0, or, equivalently, if there is an oriented path from i to j in the transition graph G. Clearly, i↔i

(reflexivity),

i↔j⇒j↔i

(symmetry),

i ↔ j, j ↔ k ⇒ i ↔ k

(transivity).

Therefore, the communication relation (↔) is an equivalence relation, and it generates a partition of the state space E into disjoint equivalence classes called communication classes. Definition 6.2.2 When there exists only one communication class, the chain, its transition matrix and its transition graph are said to be irreducible.

Example 6.2.3: Repair shop, take 2. Example 6.1.7 continued. A necessary and sufficient condition of irreducibility of the repair shop chain of Example 6.1.7 is that P (Z1 = 0) > 0 and P (Z1 ≥ 2) > 0 (Exercise 6.6.7).

CHAPTER 6. MARKOV CHAINS, DISCRETE TIME

236 Period

Consider the random walk on Z (Example 6.1.6). Since p ∈ (0, 1), it is irreducible. Observe that E = C0 + C1 , where C0 and C1 , the set of even and odd relative integers respectively, have the following property. If you start from i ∈ C0 (resp., C1 ), then in one step you can go only to a state j ∈ C1 (resp., C0 ). The chain {Xn } passes alternately from one cyclic class to the other. In this sense, the chain has a periodic behavior, corresponding to the period 2. More generally, for any irreducible Markov chain, one can find a partition of E into d classes C0 , C1 , . . ., Cd−1 such that for all k, i ∈ Ck ,  pij = 1, j∈Ck+1

where by convention Cd = C0 . The proof follows directly from Theorem 6.2.7 below. The number d ≥ 1 is called the period of the chain (resp., of the transition matrix, of the transition graph). The classes C0 , C1 , . . . , Cd−1 are called the cyclic classes.

C0

Cd−2

C1

Cd−1

Cycles The chain therefore moves from one class to the other at each transition, and this cyclically, as shown in the figure. We shall proceed to substantiate the above description of periodicity starting with the formal definition of period based on the notion of greatest common divisor (gcd) of a set of integers. Definition 6.2.4 The period di of state i ∈ E is, by definition, di = gcd{n ≥ 1 ; pii (n) > 0}, with the convention di = +∞ if there is no n ≥ 1 with pii (n) > 0. If di = 1, the state i is called aperiodic. Remark 6.2.5 Very often aperiodicity follows from the following simple observation: An irreducible transition matrix P with at least one state i ∈ E such that pii > 0 is aperiodic (in fact, in this case 1 ∈ {n ≥ 1 ; pii (n) > 0} and therefore di = 1). Period is a (communication) class property in the following sense:

6.2. THE TRANSITION MATRIX

237

Theorem 6.2.6 Two states i and j which communicate have the same period. Proof. As i and j communicate, there exist integers N and M such that pij (M ) > 0 and pji (N ) > 0. For any k ≥ 1, pii (M + nk + N ) ≥ pij (M )(pjj (k))n pji (N ) (indeed, the trajectories X0 , . . . , XM +nk+N such that X0 = i, XM = j, XM +k = j, . . . , XM +nk = j, XM +nk+N = i are a subset of the trajectories starting from i and returning to i in M + nk + N steps). Therefore, for any k ≥ 1 such that pjj (k) > 0, we have that pii (M + nk + N ) > 0 for all n ≥ 1. Therefore, di divides M + nk + N for all n ≥ 1, and in particular, di divides k. We have therefore shown that di divides all k such that pjj (k) > 0, and in particular, di divides dj . By symmetry, dj divides di , and therefore,  finally, di = dj . We may therefore henceforth speak of the period of a communication class or of an irreducible chain. The important result concerning periodicity is the following. Theorem 6.2.7 Let P be an irreducible stochastic matrix with period d. Then for all states i, j there exist m ≥ 0 and n0 ≥ 0 (m and n0 possibly depending on i, j) such that pij (m + nd) > 0, for all n ≥ n0 . Proof. It suffices to prove the theorem for i = j. Indeed, there exists an m such that pij (m) > 0, because j is accessible from i, the chain being irreducible, and therefore, if for some n0 ≥ 0 we have pjj (nd) > 0 for all n ≥ n0 , then pij (m+nd) ≥ pij (m)pjj (nd) > 0 for all n ≥ n0 . The rest of the proof is an immediate consequence of a classical result of number theory. Indeed, the gcd of the set A = {k ≥ 1; pjj (k) > 0} is d, and A is closed under addition. The set A therefore contains all but a finite number of the positive multiples of d. In other words, there exists an n0 such that n > n0 implies pjj (nd) > 0. 

6.2.2

Stationary Distributions and Reversibility

We now introduce the central notion of the stability theory of discrete-time hmcs. Definition 6.2.8 A probability distribution π satisfying πT = πT P

(6.13)

is called a stationary distribution (of the transition matrix P or of the corresponding hmc). The so-called global balance equation (6.13) says that  π(j)pji (i ∈ E) . π(i) = j∈E

Iteration of (6.13) gives π T = π T Pn for all n ≥ 0, and therefore, in view of (6.2), if the initial distribution ν = π, then νn = π for all n ≥ 0. In particular, a chain starting with

238

CHAPTER 6. MARKOV CHAINS, DISCRETE TIME

a stationary distribution keeps the same distribution forever. But there is more, because then,

P (Xn = i0 , Xn+1 = i1 , . . . , Xn+k = ik ) = P (Xn = i0 )pi0 i1 . . . pik−1 ik = π(i0 )pi0 i1 . . . pik−1 ik does not depend on n. In this sense the chain is stationary. One also says that the chain is in a stationary regime, or in equilibrium, or in steady state. In summary: Theorem 6.2.9 An hmc whose initial distribution is a stationary distribution is stationary. Remark 6.2.10 The balance equation π T P = π T , together with the requirement that π is a probability vector, that is, π T 1 = 1 (where 1 is a column vector with all its entries equal to 1), constitute, when E is finite, |E| + 1 equations for |E| unknown variables. One of the |E| equations in π T P = π T is superfluous given the constraint π T 1 = 1. In fact, summation of all equalities of π T P = π T yields the equality π T P1 = π T 1, that is, π T 1 = 1. Example 6.2.11: A two-state Markov Chain. The state space E = {1, 2} and the transition matrix is   1−α α P= , β 1−β where α, β ∈ (0, 1). The global balance equations are π(1) = π(1)(1 − α) + π(2)β ,

π(2) = π(1)α + π(2)(1 − β).

This is a dependent system which reduces to the single equation π(1)α = π(2)β, to which must be added the equality π(1) + π(2) = 1 expressing that π is a probability vector. We obtain α β , π(2) = . π(1) = α+β α+β

Example 6.2.12: The Ehrenfest Diffusion Model, take 2. The corresponding hmc was described in Example 6.1.9. The global balance equations are, for i ∈ [1, N − 1],   i+1 i−1 + π(i + 1) π(i) = π(i − 1) 1 − N N and, for the boundary states, π(0) = π(1)

1 , N

π(N ) = π(N − 1)

1 . N

Leaving π(0) undetermined, one can solve the balance equations for i = 0, 1, . . . , N successively, to obtain   N π(i) = π(0) . i The value of π(0) is then determined by writing that π is a probability vector:

6.2. THE TRANSITION MATRIX 1=

N  i=0

π(i) = π(0)

239 N    N i=0

i

= π(0)2N .

This gives for π the binomial distribution of size N and parameter 21 :   1 N . π(i) = N 2 i This is the distribution one would obtain by placing independently each particle in the compartments, with probability 12 for each compartment. There may be many stationary distributions. Take the identity as transition matrix. Then any probability distribution on the state space is a stationary distribution. Also there may well not exist any stationary distribution. (See Exercise 6.6.16.) Remark 6.2.13 An immediate consequence of Theorem 5.1.14 is that if an hmc {Xn }n≥0 is stationary, it may be extended to a stationary hmc {Xn }n∈Z with the same distribution. Recurrence equations can be used to obtain the stationary distribution when the latter exists and is unique. Generating functions sometimes usefully exploit the dynamics. Example 6.2.14: Repair shop, take 3. Examples 6.1.7 and 6.2.3 continued. For any complex number z with modulus not larger than 1, it follows from the recurrence equation (6.6) that     + z Xn+1 +1 = z (Xn −1) +1 z Zn+1 = z Xn 1{Xn >0} + z1{Xn =0} z Zn+1   = z Xn − 1{Xn =0} + z1{Xn =0} z Zn+1 , and therefore zz Xn+1 − z Xn z Zn+1 = (z − 1)1{Xn =0} z Zn+1 . From the independence of Xn and Zn+1, E[z Xn z Zn+1 ] = E[z Xn ]gZ (z), where gZ is the generating function of Zn+1, and E[1{Xn =0} z Zn+1 ] = π(0)gZ (z), where π(0) = P (Xn = 0). Therefore, zE[z Xn+1 ] − gZ (z)E[z Xn ] = (z − 1)π(0)gZ (z). Suppose that the chain is in steady state, in which case E[z Xn+1 ] = E[z Xn ] = gX (z), and therefore () gX (z) (z − gZ (z)) = π(0)(z − 1)gZ (z) . ∞ This gives the generating function gX (z) = i=0 π(i)z i , as long as π(0) is available. To obtain π(0), differentiate ():      gX (z) (z − gZ (z)) + gX (z) 1 − gZ (z) = π(0) gZ (z) + (z − 1)gZ (z) , and let z = 1, to obtain, taking into account the equalities gX (1) = gZ (1) = 1 and gZ (1) = E[Z], π(0) = 1 − E[Z] . () Since π(0) must be non-negative, this immediately gives the necessary condition E[Z] ≤ 1. Actually, one must have, if the trivial case Z1 ≡ 1 is excluded,

CHAPTER 6. MARKOV CHAINS, DISCRETE TIME

240

E[Z] < 1 .

(6.14)

Indeed, if E[Z] = 1, implying π(0) = 0, it follows from () that gX (x)(x − gZ (x)) = 0 for all x ∈ [0, 1]. But, if the case Z1 ≡ 1 (that is, gZ (x) ≡ x) is excluded, equation x − gZ (x) = 0 has only x = 1 for a solution when gZ (1) = E[Z] ≤ 1. Therefore, gX (x) ≡ 0 for x ∈ [0, 1), and consequently gX (z) ≡ 0 on {|z| < 1} (gZ is analytic inside the open unit disk centered at the origin). This leads to a contradiction, since the generating function of an integer-valued random variable cannot be identically null. It turns out that E[Z] < 1 is also a sufficient condition for the existence of a steady state (Example 6.3.22). For the time being, we have from () and () that, if the stationary distribution exists, then its generating function is given by the formula ∞ 

π(i)z i = (1 − E[Z])

i=0

(z − 1)gZ (z) . z − gZ (z)

Reversibility Let {Xn }n∈Z be an hmc with transition matrix P and admitting a stationary distribution π > 0 (see Remark 6.2.13). Define the matrix Q, indexed by E, by π(i)qij = π(j)pji .

()

This matrix is stochastic, since 

qij =

j∈E

 π(j) j∈E

π(i)

pji =

1  π(i) = 1, π(j)pji = π(i) π(i) j∈E

where the third equality uses the global balance equations. From Bayes’ retrodiction formula, P (Xn+1 = i | Xn = j)P (Xn = j) P (Xn = j | Xn+1 = i) = , P (Xn+1 = i) that is, in view of (), P (Xn = j | Xn+1 = i) = qji .

(6.15)

Therefore Q is the transition matrix of the initial chain when time is reversed. Theorem 6.2.15 Let P be a stochastic matrix indexed by a countable set E, and let π be a probability distribution on E. Define the matrix Q indexed by E by π(i)qij = π(j)pji . If Q is a stochastic matrix, then π is a stationary distribution of P. Proof. Just verify that the global balance equation is satisfied.



Definition 6.2.16 One calls reversible a stationary Markov chain with initial distribution π (a stationary distribution) if for all i, j ∈ E, we have the so-called detailed balance equations π(i)pij = π(j)pji .

6.2. THE TRANSITION MATRIX

241

We then say that the pair (P, π) is reversible. In this case, qij = pij , and therefore the chain and the time-reversed chain are statistically the same, since the distribution of a homogeneous Markov chain is entirely determined by its initial distribution and its transition matrix. The following is an immediate corollary of Theorem 6.2.15. Theorem 6.2.17 Let P be a transition matrix on the countable state space E, and let π be some probability distribution on E. If for all i, j ∈ E, the detailed balance equations are satisfied, then π is a stationary distribution of P. Example 6.2.18: The Ehrenfest Diffusion Model, take 3. This example continues Examples 6.1.9 and 6.2.12. Recall that we obtained the expression   1 N π(i) = 2N i for the stationary distribution. We can also find this by checking the detailed balance equations π(i)pi,i+1 = π(i + 1)pi+1,i .

Example 6.2.19: Random Walk on a Graph. Consider a finite non-oriented graph and denote by E the set of vertices, or nodes, of this graph. Let di be the index of vertex i (the number of edges adjacent to i). Transform this graph into an oriented graph by splitting each edge into two oriented edges of opposite directions, and make it a transition graph by associating to the oriented edge from i to j the transition probability d1i (see the figure below). It will be assumed, as is the case in the figure, that di > 0 for all states i (that is, the graph is connected).

1

1 1

2

4

2

1 3 1 2 1 2

1 2 1 3

4

1 2 1 3

3

3 A random walk on a graph

This chain is irreducible due to the connectedness of the graph and the fact that pij > 0 whenever pji > 0. It admits the distribution π(i) =

di 2|edges|

(i ∈ E) ,

where |edges| is the number of edges of the original graph. This follows from Theorem 6.2.17. In fact, if i and j are connected in the graph, pij = d1i and pji = d1j , and therefore the detailed balance equation between these two states is

242

CHAPTER 6. MARKOV CHAINS, DISCRETE TIME π(i)

1 1 = π(j) . di dj

This gives (i ∈ E) ,  −1  where K is obtained by normalization: K = d . But j∈E dj = 2|edges|. j j∈E π(i) = Kdi

6.2.3

The Strong Markov Property

The Markov property relative to the past and future at a given instant n can be extended to the situation where this deterministic time is replaced by an FnX -stopping time (where FnX := σ(X0 , · · · , Xn )), whose definition we recall below. Definition 6.2.20 Let {Fn }n∈N be a non-decreasing sequence of sub-σ-fields of F. A random variable τ taking its values in N and such that, for all m ∈ N, the event {τ = m} is in Fm is called an Fn -stopping time. In other words, τ is an FnX -stopping time if, for all m ∈ N, the event {τ = m} can be expressed as 1{τ =m} = ψm (X0 , . . . , Xm), for some measurable function ψm with values in {0, 1} (Theorem 3.3.18). Example 6.2.21: Fixed Times and Delayed Stopping Times. A constant time is a stopping time. If τ is a stopping time and n0 a non-negative deterministic time, then τ + n0 is a stopping time. Indeed, {τ + n0 = m} ≡ {τ = m − n0 } is expressible in terms of X0 , X1 , . . . , Xm−n0 . Example 6.2.22: Return Times and Hitting Times. In the theory of Markov chains, a typical and most important stopping time is the return time to state i ∈ E, Ti = inf{n ≥ 1; Xn = i}, where Ti = ∞ if Xn = i for all n ≥ 1. It is a stopping time, as we shall soon prove. Observe that Ti ≥ 1, and in particular, X0 = i does not imply Ti = 0. This is why Ti is called the return time to i, and not the hitting time of i. The latter is Si = Ti if X0 = i, and Si = 0 if X0 = i. It is also a stopping time. More generally, let τ1 = Ti , τ2 , . . . be the successive return times to state i. If there are only r returns to state i, let τr+1 = τr+2 = · · · = ∞. These random times are stopping times with respect to {Xn }n≥0 , since for any m ≥ 1, / {τk = m} ≡

m 

4 1{Xn =i} = k, Xm = i

n=1

is indeed expressible in terms of X0 , . . . , Xm.

6.2. THE TRANSITION MATRIX

243

Remark 6.2.23 For a given stopping time τ , one can decide whether τ = m just by observing X0 , X1 , . . ., Xm . This is why stopping times are said to be nonanticipative. The random time τ = inf{n ≥ 0; Xn+1 = i}, where τ = ∞ if Xn+1 = i for all n ≥ 0, is anticipative because {τ = m} = {X1 = i, . . . , Xm = i, Xm+1 = i} for all m ≥ 0. Knowledge of this random time provides information about the value of the process just after it. It is not a stopping time. Let τ be a random time taking its values in N∪{+∞}, and let {Xn }n≥0 be a stochastic process with values in the countable set E. In order to define Xτ when τ = ∞, one must decide how to define X∞ . This is done by taking some element Δ not in E, and setting X∞ = Δ. By definition, the “process {Xn } after τ ” is the stochastic process {Xn+τ }n≥0 . The “process {Xn } before τ ” is the process {Xn∧(τ −1)}n≥0 , where by convention Xn∧(0−1) = X0 . The main result of the present subsection is the strong Markov property. It says that the Markov property, that is, the independence of past and future given the present state, extends to the situation where the present time is a stopping time. More precisely: Theorem 6.2.24 Let {Xn }n≥0 be an hmc with countable state space E and transition matrix P. If τ is an FnX -stopping time, then given that Xτ = i ∈ E (in particular, τ < ∞, since i = Δ), (α) the process after τ and the process before τ are independent, and (β) the process after τ is an hmc with transition matrix P. Proof. (α) By Theorem 3.1.39 it suffices to show that for all times k ≥ 1, n ≥ 0, and all states i0 , . . . , in , i, j1 , . . . , jk , P (Xτ +1 = j1 , . . . , Xτ +k = jk | Xτ = i, X(τ −1)∧0 = i0 , . . . , X(τ −1)∧n = in ) = P (Xτ +1 = j1 , . . . , Xτ +k = jk | Xτ = i). We shall prove a simplified version of the above equality, namely P (Xτ +k = j | Xτ = i, X(τ −1)∧n = in ) = P (Xτ +k = j | Xτ = i)

()

(the general case is obtained by the same arguments). The left-hand side of the above equality is equal to P (Xτ +k = j, Xτ = i, X(τ −1)∧n = in ) . P (Xτ = i, X(τ −1)∧n = in ) The numerator can be expanded as

CHAPTER 6. MARKOV CHAINS, DISCRETE TIME

244 

P (τ = r, Xr+k = j, Xr = i, X(r−1)∧n = in ).

(6.16)

r≥0

But P (τ = r, Xr+k = j, Xr = i, X(r−1)∧n = in ) =P (Xr+k = j | Xr = i, X(r−1)∧n = in , τ = r) P (τ = r, X(r−1)∧n = in , Xr = i), and since (r − 1) ∧ n ≤ r and {τ = r} ∈ FrX , the event B = {X(r−1)∧n = in , τ = r} is in FrX . Therefore, by the Markov property, P (Xr+k = j | Xr = i, X(r−1)∧n = in , τ = r} = P (Xr+k = j | Xr = i) = pij (k). Finally, expression (6.16) reduces to 

pij (k)P (τ = r, X(r−1)∧n = in , Xr = i) = pij (k)P (Xτ = i, Xτ ∧n = in ).

r≥0

Therefore, the left-hand side of () is just pij (k). Similar computations show that the right-hand side of () is also pij (k), so that (α) is proved. (β) We must show that for all states i, j, k, in−1, . . . , i1 , P (Xτ +n+1 = k | Xτ +n = j, Xτ +n−1 = in−1 , . . . , Xτ = i) = P (Xτ +n+1 = k | Xτ +n = j) = pjk . But the first equality follows from the fact proved in (α) that for the stopping time τ  = τ + n, the processes before and after τ  are independent given Xτ  = j. The second equality is obtained by the same calculations as in the proof of (α). 

Regenerative Cycles Consider a Markov chain with a state conventionally denoted by 0 such that P0 (T0 < ∞) = 1. As a consequence of the strong Markov property, the chain starting from state 0 will return infinitely often to this state. Let τ1 = T0 , τ2 , . . . be the successive return times to 0, and set τ0 ≡ 0. By the strong Markov property, for any k ≥ 1, the process after τk is independent of the process before τk (observe that condition Xτk = 0 is always satisfied), and the process after τk is a Markov chain with the same transition matrix as the original chain, and with initial state 0, by construction. Therefore, the successive times of visit to 0, the pieces of the trajectory {Xτk , Xτk +1 , . . . , Xτk+1−1 }

(k ≥ 0) ,

are independent and identically distributed. Such pieces are called the regenerative cycles of the chain between visits to state 0. Each random time τk is a regeneration time, in the sense that {Xτk +n }n≥0 is independent of the past X0 , . . . , Xτk −1 and has the same distribution as {Xn }n≥0 . In particular, the sequence {τk − τk−1 }k≥1 is iid.

6.3. RECURRENCE AND TRANSIENCE

6.3

245

Recurrence and Transience

Consider a Markov chain taking its values in E = N. There is a possibility that for any initial state i ∈ N the chain will never visit i after some finite random time. This is often an undesirable feature. For example, if the chain counts the number of customers waiting in line at a service counter, such a behavior implies that the waiting line will eventually grow beyond the limits of the waiting room, whatever its size. In a sense, the corresponding system is unstable. The good notion of stability for an irreducible hmc is that of positive recurrence, when any given state is visited infinitely often and when, moreover, the average time between two successive visits to this state is finite.

6.3.1

Classification of States

Denote by Ni :=



1{Xn =i}

n≥1

the number of visits to state i strictly after time 0. Theorem 6.3.1 The distribution of Ni given X0 = j is / fji fiir−1 (1 − fii ) for r ≥ 1, Pj (Ni = r) = 1 − fji for r = 0 , where fji = Pj (Ti < ∞) and Ti is the return time to i. Proof. An informal proof goes like this: We first go from j to i (probability fji ) and then, r − 1 times in succession, from i to i (each time with probability fii ), and the last time, that is the r + 1-st time, we leave i never to return to it (probability 1 − fii). By the independent cycle property, all these “jumps” are independent, so that the successive probabilities multiply. Here is a formal proof if someone needs it. For r = 0, this is just the definition of fji . Now let r ≥ 1, and suppose that Pj (Ni = k) = fji fiik−1 (1 − fii ) is true for k (1 ≤ k ≤ r). In particular, Pj (Ni > r) = fji fiir . Denoting by τr the rth return time to state i, Pj (Ni = r + 1) = Pj (Ni = r + 1, Xτr+1 = i) = Pj (τr+2 − τr+1 = ∞, Xτr+1 = i) = Pj (τr+2 − τr+1 = ∞ | Xτr+1 = i)Pj (Xτr+1 = i) . But Pj (τr+2 − τr+1 = ∞ | Xτr+1 = i) = 1 − fii by the strong Markov property (τr+2 − τr+1 is the return time to i of the process after τr+1 ). Also, Pj (Xτr+1 = i) = Pj (Ni > r) .

CHAPTER 6. MARKOV CHAINS, DISCRETE TIME

246 Therefore,

Pj (Ni = r + 1) = Pi (Ti = ∞)Pj (Ni > r) = (1 − fii )fji fiir . 

The result then follows by induction.

The distribution of Ni given X0 = i is geometric with parameter 1 − fii. A geometric random variable with parameter p = 1 is in fact equal to infinity, and in particular has an infinite mean. If p < 1, however, it is almost surely finite and it has a finite mean. From these remarks, we deduce that Pi (Ti < ∞) = 1 ⇔ Pi (Ni = ∞) = 1 , (in words: if starting from i you almost surely return to i, then you will visit i infinitely often) and Pi (Ti < ∞) < 1 ⇔ Ei [Ni ] < ∞ . We collect the results just obtained for future reference. Theorem 6.3.2 For any state i ∈ E, Pi (Ti < ∞) = 1 ⇐⇒ Pi (Ni = ∞) = 1 , and Pi (Ti < ∞) < 1 ⇐⇒ Pi (Ni = ∞) = 0 ⇐⇒ Ei [Ni ] < ∞ . In particular, the event {Ni = ∞} has Pi -probability 0 or 1. We are now ready for the basic definitions concerning recurrence. First recall that Ti denotes the return time to state i. Definition 6.3.3 State i ∈ E is called recurrent if Pi (Ti < ∞) = 1 , and otherwise it is called transient. A recurrent state i ∈ E is called positive recurrent if Ei [Ti ] < ∞ , and otherwise it is called null recurrent.

The Potential Matrix Criterion of Recurrence In general, it is not easy to check whether a given state is transient or recurrent. One of the goals of the theory of Markov chains is to provide criteria of recurrence. Sometimes, one is happy with just a sufficient condition. The problem of finding useful (easy to check) conditions of recurrence is an active area of research. However, the theory has a few conditions that qualify as useful and are applicable to many practical situations. Although the next criterion is of theoretical rather than practical interest, it can be helpful in a few situations, for instance in the study of recurrence of random walks (Example 6.3.5.) The potential matrix G associated with the transition matrix P is defined by  G := Pn . n≥0

6.3. RECURRENCE AND TRANSIENCE

247

Its general term gij =

∞ 

pij (n) =

∞ 

Pi (Xn = j) =

" Ei [1{Xn =j} ] = Ei

n=0

n=0

n=0

∞ 

∞ 

# 1{Xn =j}

n=0

is the average number of visits to state j, given that the chain starts from state i. Theorem 6.3.4 State i ∈ E is recurrent if and only if ∞ 

pii (n) = ∞.

n=0

Proof. This merely rephrases Theorem 6.3.2 since  pii (n) = Ei [Ni ] . n≥1

In fact, ⎡ Ei [Ni ] = Ei ⎣



n≥1

⎤ 1{Xn =i} ⎦ =



    Ei 1{Xn =i} = pii (n) . Pi (Xn = i) =

n≥1

n≥1

n≥1

 Example 6.3.5: Random Walks on Z. The corresponding Markov chain was described in Example 6.1.6. The nonzero terms of its transition matrix are pi,i+1 = p , pi,i−1 = 1 − p , where p ∈ (0, 1). We shall study the nature (recurrent or transient) of any one of its states, say, 0. We have p00 (2n + 1) = 0 and (2n)! n p (1 − p)n . n!n! √ By Stirling’s equivalence formula n! ∼ (n/e)n 2πn, the above quantity is equivalent to p00 (2n) =

[4p(1 − p)]n √ , πn

(6.17)

 and the nature of the series ∞ n=0 p00 (n) (convergent or divergent) is that of the series with general term (6.17). If p = 12 , in which case 4p(1−p) < 1, the latter series converges. And if p = 12 , in which case 4p(1 − p) = 1, it diverges. In summary, the states of the random walk on Z are transient if p = 12 , recurrent if p = 12 . Example 6.3.6: Returns to zero of the symmetric random walk. Consider the symmetric (p = 12 ) 1-D random walk. Let τ1 = T0 , τ2 , . . . be the successive return times to state 0. We just learnt in the previous example that P0 (T0 < ∞) = 1. We will compute the generating function of T0 given X0 = 0, and show that the expected return time to 0 is infinite (and therefore the symmetric random walk on Z is null recurrent).

CHAPTER 6. MARKOV CHAINS, DISCRETE TIME

248

Observe that for n ≥ 1, P0 (X2n = 0) =



P0 (τk = 2n) ,

k≥1

and therefore, for all z ∈ C such that |z| < 1,    P0 (X2n = 0)z 2n = E0 [z τk ] . P0 (τk = 2n)z 2n = n≥1

k≥1 n≥1

k≥1

But τk = τ1 + (τ2 − τ1 ) + · · · + (τk − τk−1 ) and therefore, in view of the iid property of the regenerative cycles, and since τ1 = T0 , E0 [z τk ] = (E0 [z T0 ])k . In particular,



P0 (X2n = 0)z 2n =

n≥0

1 1 − E0 [z T0 ]

(note that the latter sum includes the term for n = 0, that is, 1). Direct evaluation of the left-hand side yields  1 (2n)! 1 z 2n = √ . 22n n!n! 1 − z2 n≥0

Therefore, the generating function of the return time to 0 given X0 = 0 is 5 E0 [z T0 ] = 1 − 1 − z 2 . z tends to ∞ as z → 1 from below via real values. Therefore, by Its first derivative √1−z 2 Abel’s theorem, E0 [T0 ] = ∞. We see that although the return time to state 0 is almost surely finite, it has an infinite expectation.

Example 6.3.7: 3-D symmetric random walk. The state space of this Markov chain is E = Z3 . Denoting by e1 , e2 , and e3 the canonical basis vectors of R3 (respectively (1, 0, 0), (0, 1, 0), and (0, 0, 1)), the non-null terms of the transition matrix of the 3-D symmetric random walk are given by 1 px,x±ei = . 6 We elucidate the nature of state, say, 0 = (0, 0, 0). Clearly, p00 (2n + 1) = 0 for all n ≥ 0, and (exercise)  2n  (2n)! 1 . p00 (2n) = (i!j!(n − i − j)!)2 6 0≤i+j≤n

This can be rewritten as 

p00 (2n) =

0≤i+j≤n

2  2n   1 2n n! 1 . 22n n i!j!(n − i − j)! 3

Using the trinomial formula  0≤i+j≤n

n! i!j!(n − i − j)!

 n 1 = 1, 3

6.3. RECURRENCE AND TRANSIENCE we obtain the bound p00 (2n) ≤ Kn

   n 1 2n 1 , 22n n 3

Kn = max

n! . i!j!(n − i − j)!

249

where 0≤i+j≤n

For large values of n, Kn is bounded as follows. Let i0 and j0 be the values of i, j that maximize n!/(i!j!(n + j)!) in the domain of interest 0 ≤ i + j ≤ n. From the definition of i0 and j0 , the quantities n! (i0 − 1)!j0 !(n − i0 − j0 + 1)! n! (i0 + 1)!j0 !(n − i0 − j0 − 1)! n! i0 !(j0 − 1)!(n − i0 − j0 + 1)! n! i0 !(j0 + 1)!(n − i0 − j0 − 1)! are bounded by

n! i0 !j0 !(n−i0 −j0 )! .

The corresponding inequalities reduce to

n − i0 − 1 ≤ 2j0 ≤ n − i0 + 1 and n − j0 − 1 ≤ 2i0 ≤ n − j0 + 1, and this shows that for large n, i0 ∼ n/3 and j0 ∼ n/3. Therefore, for large n, p00 (2n) ∼

  2n n! . (n/3)!(n/3)!22n en n

By Stirling’s equivalence formula, the right-hand side of the latter equivalence is in √ 3 3 turn equivalent to 2(πn)3/2 , the general term of a divergent series. State 0 is therefore transient. A theoretical application of the potential matrix criterion is to the proof that recurrence is a (communication) class property. Theorem 6.3.8 If i and j communicate, they are either both recurrent or both transient.

Proof. States i and j communicate if and only if there exist integers M and N such that pij (M ) > 0, pji(N ) > 0. Going from i to j in M steps, then from j to j in n steps, then from j to i in N steps, is just one way of going from i back to i in M + n + N steps. Therefore, pii (M + n + N ) ≥ pij (M )pjj (n)pji (N ). Similarly, pjj (N + n + M ) ≥ pji (N )pii (n)pij (M ). Therefore, writing α = pij (M )pji (N ) (a strictly positive quantity), (M + N + n) ≥ αpjj (n) and pjj (M + N + n) ≥ αpii (n). This implies that we have pii ∞ p (n) and the series ∞ ii n=0 n=0 pjj (n) either both converge or both diverge. Theorem 6.3.4 concludes the proof. 

CHAPTER 6. MARKOV CHAINS, DISCRETE TIME

250

6.3.2

The Stationary Distribution Criterion

The notion of invariant measure extends the notion of stationary distribution and plays a technical role in the recurrence theory of Markov chains. Definition 6.3.9 A nontrivial (that is, non-null) vector x = {xi }i∈E of non-negative real numbers is called an invariant measure of the stochastic matrix P = {pij }i,j∈E if for all i ∈ E,  xj pji . (6.18) xi = j∈E

(In abbreviated notation, 0 ≤ x < ∞ and xT P = xT .) Theorem 6.3.10 Let P be the transition matrix of an irreducible recurrent hmc {Xn }n≥0 . Let 0 be a state and let T0 be the return time to 0. Define for all i ∈ E ⎡ ⎤  xi = E0 ⎣ (6.19) 1{Xn =i} 1{n≤T0 } ⎦ n≥1

(for i = 0, xi is therefore the expected number of visits to state i before returning to 0). Then, for all i ∈ E, xi ∈ (0, ∞) , (6.20) and x is an invariant measure of P. Proof. We make two preliminary observations. First, when 1 ≤ n ≤ T0 , Xn = 0 if and only if n = T0 . Therefore, x0 = 1. Also, 

1{Xn =i} 1{n≤T0 } =

i∈E n≥1

/   n≥1

=



4 1{Xn =i}

1{n≤T0 }

i∈E

1{n≤T0 } = T0 ,

n≥1

and therefore



xi = E0 [T0 ] .

(6.21)

i∈E

We now introduce the so-called taboo transition probability 0 p0i (n)

:= E0 [1{Xn =i} 1{n≤T0 } ] = P0 (X1 = 0, · · · , Xn−1 = 0, Xn = i) ,

the probability, starting from state 0, of visiting i at time n before returning to 0 (the “taboo” state). From the definition of x,  (6.22) xi = 0 p0i (n) . n≥1

We first prove (6.18). Observe that 0 p0i (1)

= p0i

6.3. RECURRENCE AND TRANSIENCE

251

and (first-step analysis) for all n ≥ 2, 0 p0i (n)

=



0 p0j (n

− 1)pji .

(6.23)

j =0

Summing up all the above equalities, and taking (6.22) into account, we obtain  xj pji , xi = p0i + j =0

that is, (6.18), since x0 = 1. Next we show that xi > 0 for all i ∈ E. Indeed, iterating (6.18), we find xT = xT Pn , that is, since x0 = 1,   xi = xj pji (n) = p0i (n) + xj pji (n) . j∈E

j =0

If xi were null for some i ∈ E, i = 0, the latter equality would imply that p0i (n) = 0 for all n ≥ 0, which means that 0 and i do not communicate, in contradiction to the irreducibility assumption. It remains to show that xi < ∞ for all i ∈ E. As before, we find that  1 = x0 = xj pj0 (n) j∈E

for all n ≥ 1, and therefore if xi = ∞ for some i, necessarily pi0 (n) = 0 for all n ≥ 1, and this also contradicts irreducibility.  Theorem 6.3.11 The invariant measure of an irreducible recurrent stochastic matrix is unique up to a multiplicative factor. Proof. In the proof of Theorem 6.3.10, we showed that for an invariant measure y of an irreducible chain, yi > 0 for all i ∈ E, and therefore, one can define, for all i, j ∈ E, the matrix Q by yi (6.24) qji = pij . yj   y It is a transition matrix, since i∈E qji = y1j i∈E yi pij = yjj = 1. The general term of n Q is yi (6.25) qji (n) = pij (n). yj Indeed, supposing (6.25) true for n,   yk yi qji (n + 1) = qjk qki (n) = pkj pik (n) yj yk k∈E k∈E yi  yi = pik (n)pkj = pij (n + 1), yj yj k∈E

and (6.25) follows by induction. Clearly, Q is irreducible, since P is irreducible (just observe that in  view of (6.25) qji (n) > 0 if and only if pij (n) > 0). Also, pii (n) = qii (n), and therefore n≥0 qii (n) =

CHAPTER 6. MARKOV CHAINS, DISCRETE TIME

252 

n≥0 pii (n), which ensures that Q is recurrent (potential matrix criterion). Call gji (n) the probability, relative to the chain governed by the transition matrix Q, of returning to state i for the first time at step n when starting from j. First-step analysis gives  gi0 (n + 1) = (6.26) qij gj0 (n) , j =0

that is, using (6.24), yi gi0 (n + 1) = Recall that 0 p0i (n + 1) =





(yj gj0 (n))pji .

j =0 j =0 0 p0j (n)pji ,

y0 0 p0i (n + 1) =

or, equivalently, 

(y0 0 p0j (n))pji .

j =0

We therefore see that the sequences {y0 0 p0i (n)} and {yi gi0 (n)} satisfy the same recurrence equation. Their first terms (n = 1), respectively y0 0 p0i (1) = y0 p0i and yi gi0 (1) = yi qi0 , are equal in view of (6.24). Therefore, for all n ≥ 1, yi gi0 (n). y0  Summing with respect to n ≥ 1 and using n≥1 gi0 (n) = 1 (Q is recurrent), we obtain  the announced result xi = yy0i . 0 p0i (n)

=

Equality (6.21) and the definition of positive recurrence give the following result: Theorem 6.3.12 An irreducible recurrent hmc is positive recurrent if and only if its invariant measures x satisfy  xi < ∞. (6.27) i∈E

Remark 6.3.13 An hmc may well be irreducible and possess an invariant measure, and yet not be recurrent. The simplest example is the one-dimensional non-symmetric random walk, which was shown to be transient and yet admits xi ≡ 1 as an invariant measure. It turns out, however, that the existence of a stationary probability distribution is necessary and sufficient for an irreducible chain (not a priori assumed recurrent) to be recurrent positive. Theorem 6.3.14 An irreducible homogeneous Markov chain is positive recurrent if and only if there exists a stationary distribution. Moreover, the stationary distribution π is, when it exists, unique, and π > 0. Proof. The direct part follows from Theorems 6.3.10 and 6.3.12. For the converse part, assume the existence of a stationary distribution π. Iterating π T = π T P, we obtain π T = π T Pn , that is, for all i ∈ E,  π(j)pji (n) . π(i) = j∈E

If the chain were transient, then, for all states i, j,

6.3. RECURRENCE AND TRANSIENCE

253

lim pji (n) = 0 .

n↑∞

Indeed pji (n) = Ej [1{Xn =i} ], limn↑∞ 1{Xn =i} = 0 (j is transient), and 1{Xn =i} ≤ 1, so that, by dominated convergence limn↑∞ Ej [1{Xn =i} ] = 0. Since pji (n) is bounded by 1 uniformly in j and n, we have by dominated convergence     π(j)pji (n) = π(j) lim pji (n) = 0. π(i) = lim n↑∞

n↑∞

j∈E

j∈E

 This contradicts the assumption that π is a stationary distribution ( i∈E π(i) = 1). The chain must therefore be recurrent, and by Theorem 6.3.12, it is positive recurrent. The stationary distribution π of an irreducible positive recurrent chain is unique (use Theorem 6.3.11 and the fact that there is no choice for a multiplicative factor but 1). Also recall that π(i) > 0 for all i ∈ E (see Theorem 6.3.10).  Theorem 6.3.15 Let π be the unique stationary distribution of an irreducible positive recurrent chain, and let Ti be the return time to state i. Then π(i)Ei [Ti ] = 1.

(6.28)

Proof. This equality is a direct consequence of expression (6.19) for the invariant measure. Indeed, π is obtained by normalization of x: for all i ∈ E, π(i) = 

xi

j∈E

xj

,

and in particular, for i = 0, recalling that x0 = 1 and using (6.21), π(0) = 

x0

j∈E

xj

=

1 . E0 [T0 ]

Since state 0 does not play a special role in the analysis, (6.28) is true for all i ∈ E.  The situation is extremely simple when the state space is finite. Theorem 6.3.16 An irreducible hmc with finite state space is positive recurrent. Proof. We first show recurrence. If the chain were transient, then, for all i, j ∈ E, lim pij (n) = 0

n↑∞

(see the argument in the proof of Theorem 6.3.14), and therefore, since the state space is finite  pij (n) = 0. lim n↑∞

But for all n ≥ 0,

j∈E

 j∈E

pij (n) = 1,

CHAPTER 6. MARKOV CHAINS, DISCRETE TIME

254

a contradiction. Therefore, the  chain is recurrent. By Theorem 6.3.10 it has an invariant measure x. Since E is finite, i∈E xi < ∞, and therefore the chain is positive recurrent, by Theorem 6.3.12.  Example 6.3.17: A Random Walk on Z Reflected at 0. This chain has the state space E = N and the transition graph of the figure below. It is assumed that pi (and therefore qi = 1 − pi ) are in the open interval (0, 1) for all i ∈ E, so that the chain is irreducible.

p0 = 1 0

p1

pi

1 q1

···

2

i

i+1 qi+1

q2 Reflected random walk

The invariant measure equation xT = xT P takes in this case the form x0 = x1 q1 , xi = xi−1 pi−1 + xi+1 qi+1 , i ≥ 1, i−1 . The positive recurrence with p0 = 1. The general solution is, for i ≥ 1, xi = x0 p0q···p 1 ···qi  condition i∈E xi < ∞ is  p0 . . . pi−1 1+ < ∞, q1 . . . qi

i≥1

and if it is satisfied, the stationary distribution π is obtained by normalization of the general solution. This gives ⎛ π(0) = ⎝1 +

 p0 · · · pi−1 i≥1

and for i ≥ 1, π(i) = π(0)

q1 · · · qi

⎞−1 ⎠

,

p0 · · · pi−1 . q1 · · · qi

In the special case where pi = p, qi = q = 1 − p, the positive recurrence condition  j  becomes 1 + 1q j≥0 pq < ∞, that is to say p < q, or equivalently, p < 12 .

Birth-and-death Markov Chains Birth-and-death process models are omnipresent in operations research and, of course, in biology. We first define the birth-and-death process with a bounded population. The state space of such a chain is E = {0, 1, . . . , N } and its transition matrix is

6.3. RECURRENCE AND TRANSIENCE

255 ⎞



r0 p0 ⎜ ⎜ q1 r1 ⎜ q2 ⎜ ⎜ ⎜ P=⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝

p1 r2 p2 .. . ri .. .

qi

pi .. .

..

. qN −1 rN −1 pN −1 pN rN

⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟, ⎟ ⎟ ⎟ ⎟ ⎠

where pi > 0 for all i ∈ E\{N }, qi > 0 for all i ∈ E\{0}, ri ≥ 0 for all i ∈ E, and pi + qi + ri = 1 for all i ∈ E. The positivity conditions placed on the pi ’s and qi ’s guarantee that the chain is irreducible. Since the state space is finite, it is positive recurrent (Theorem 6.3.16), and it has a unique stationary distribution. Motivated by the Ehrenfest hmc, which is reversible in the stationary state, we make the educated guess that the birth-and-death process considered has the same property. This will be the case if and only if there exists a probability distribution π on E satisfying the detailed balance equations, that is, such that π(i − 1)pi−1 = π(i)qi (1 ≤ i ≤ N ). Letting w0 = 1 and wi =

i $ pk−1 qk

(1 ≤ i ≤ N ) ,

k=1

we find that wi π(i) = N

j=0 wj

(0 ≤ i ≤ N )

(6.29)

indeed satisfies the detailed balance equations and is therefore the (unique) stationary distribution of the chain. We now treat the unbounded birth-and-death process with state space E = N and transition matrix as in the previous example (except that the state is now “unbounded on the right”). We assume that the pi ’s and qi ’s are positive in order to guarantee irreducibility. The same reversibility argument as above applies with a little difference. In fact we can show that the wi ’s defined above satisfy the detailed balance equations and therefore the global balance equations. Therefore the vector {wi }i∈E is the unique, up to a multiplicative factor, invariant measure of the chain. It can be normalized to a probability distribution if and only if ∞ 

wj < ∞ .

j=0

Therefore in this case, and only in this case, there exists a (unique) stationary distribution, also given by (6.29). Note that the stationary distribution, when it exists, does not depend on the ri ’s. The recurrence properties of the above unbounded birth-and-death process are therefore the same as those of the chain below, which is however not aperiodic. For aperiodicity of the original chain, it suffices to assume at least one of the ri ’s to be positive (Remark 6.2.5).

CHAPTER 6. MARKOV CHAINS, DISCRETE TIME

256

p0 = 1 0

p1 1

q1

pi−1

p2 i−1

2

q3

q2

pi i+1

i

qi

qi+1

We now compute for the (bounded or unbounded) irreducible birth-and-death process the average time it takes to reach a state b from a state a < b. In fact, we shall prove that b k−1  1  Ea [Tb ] = wj . (6.30) q k wk k=a+1

Since obviously Ea [Tb ] =

b k=a+1

j=0

Ek−1 [Tk ], it suffices to prove that

Ek−1 [Tk ] =

k−1 1  wj . q k wk

()

j=0

For this, consider for any given k ∈ {0, 1, . . . , N } the truncated chain which moves on the state space {0, 1, . . . , k} as the original chain, except in state k where it moves one , to symbolize step down with probability qk and stays still with probability pk +rk . Use E expectations with respect to the modified chain. The unique stationary distribution of this chain is w (0 ≤  ≤ k) . π , = k j=0 w   , k [Tk ] = (rk + pk ) × 1 + qk 1 + E , k−1 [Tk ] , that is, First-step analysis yields E , k [Tk ] = 1 + qk E , k−1 [Tk ] . E Also k  , k [Tk ] = 1 = 1 E wj , π ,k wk j=0

, k−1 [Tk ] = Ek−1 [Tk ], we have (). and therefore, since E Example 6.3.18: Special cases. In the special case where (pj , qj , rj ) = (p, q, r) for  i all j = 0, N , (p0 , q0 , r0 ) = (p, q + r, 0) and (pN , qN , rN ) = (0, p + r, q), we have wi = pq , and for 1 ≤ k ≤ N , Ek−1 [Tk ] = q

k−1   1  p j 1 =  k q p − q p q

j=0

 k q . 1− p



6.3. RECURRENCE AND TRANSIENCE

6.3.3

257

Foster’s Theorem

The stationary distribution criterion of positive recurrence of an irreducible chain requires solving the balance equation, and this is not always feasible in practice. The following result (Foster’s theorem) gives a more tractable, and in fact quite powerful sufficient condition. Theorem 6.3.19 (3 ) Let the transition matrix P on the countable state space E be irreducible and suppose that there exists a function h : E → R such that inf i h(i) > −∞ and  pik h(k) < ∞ for all i ∈ F, (6.31) k∈E



pik h(k) ≤ h(i) −  for all i ∈ F,

(6.32)

k∈E

for some finite set F and some  > 0. Then the corresponding hmc is positive recurrent. Proof. Since inf i h(i) > −∞, one may assume without loss of generality that h ≥ 0, by adding a constant if necessary. Call τ the return time to F , and define Yn = h(Xn )1{n 0. The counting process {N (t)}t≥0 is a continuous-time hmc with transition semigroup defined by pij (t) = 1{j≥i} e−λt

(λt)j−i . (j − i)!

Proof. With C := {N (s1 ) = i1 , . . . , N (sk ) = ik )}, we have, for i ≥ j, P (N (t + s) = j | N (s) = i, C) P (N (t + s) = j, N (s) = i, C) = P (N (s) = i, C) P (N (s, s + t] = j − i, N (s) = i, C) . = P (N (s) = i, C)

CHAPTER 7. MARKOV CHAINS, CONTINUOUS TIME

300

But N (s, s + t] is independent of N (s) and of C, and therefore, P (N (s, s + t] = j − i, N (s) = i, C) = P (N (s, s + t] = j − i)P (N (s) = i, C) , so that P (N (t + s) = j | N (s) = i, C) = P (N (s, s + t] = j − i). Similarly, P (N (t + s) = j | N (s) = i) = P (N (s, s + t] = j − i) = e−λt

(λt)j−i . (j − i)! 

Example 7.2.5: Flip-flop. Let N be an hpp on R+ with intensity λ > 0. Define the flip-flop process with state space {+1, −1} by X(t) := X(0) × (−1)N (t) , where X(0) is a {+1, −1}-valued random variable independent of the counting process N . In words: the flip-flop process switches between −1 and +1 at each event of N . It is a continuous-time hmc with transition semigroup   1 1 + e−2λt 1 − e−2λt . P(t) = 2 1 − e−2λt 1 + e−2λt Proof. The value X(t+s) depends on N (s, s+t] and X(s). Also, N (s, s+t] is independent of X(0), N (s1 ), . . . , N (sk ) when s ≤ s (, 1 ≤  ≤ k), and the latter random variables determine X(s1 ), . . . , X(sk ). Therefore, X(t+s) is independent of X(s1 ), . . . , X(sk ) given X(s), that is, {X(t)}t≥0 is a Markov chain. Moreover, P (X(t + s) = 1 | X(s) = −1) = P (N (s, s + t] = odd ) ∞  1 (λt)2k+1 = = (1 − e−2λt ), e−λt (2k + 1)! 2 k=0

that is, p−1,+1 (t) = p+1,−1 (t).

1 2 (1

−e

−2λt

). Similar computations give the announced result for 

The Uniform hmc 6n }n≥0 be a discrete-time hmc with countable state space E Definition 7.2.6 Let {X and transition matrix K = {kij }i,j∈E and let N be a hpp on R+ of intensity λ > 0 and ˆ n }n≥0 and N are independent. The associated time sequence {Tn }n≥1 . Suppose that {X stochastic process 6N (t) X(t) = X

(t ≥ 0)

is called a uniform Markov chain. The Poisson process N is the clock, and the chain 6n }n≥0 is the subordinated chain. {X

7.2. THE TRANSITION SEMIGROUP ˆ2 X

X(t)

ˆ0 X

301 ˆ7 X

ˆ1 X

ˆ6 X ˆ3 X

T0 = 0

T1

T2

T3

ˆ4 X ˆ5 X

T4 T5

T6

T7

t

Uniform Markov chain 6n for all n ≥ 0. Observe also that the disconRemark 7.2.7 Observe that X(Tn ) = X tinuity times of the uniform chain are all events of N but that not all events of N are 6n (a “transition” of type 6n−1 = X discontinuity times, since it may well occur that X i → i of the subordinated chain). The process {X(t)}t≥0 is a continuous-time hmc (Exercise 7.5.3). Its transition semigroup is ∞  (λt)n n (7.15) K , e−λt P(t) = n! n=0

that is, pij (t) =

∞  n=0

e−λt

(λt)n kij (n). n!

Indeed, ˆ N (t) = j) = Pi (X(t) = j) = Pi (X =

∞  n=0 ∞ 

6n = j) Pi (N (t) = n, X ˆ n = j). Pi (N (t) = n)Pi (X

n=0

Definition 7.2.8 The probability distribution π on E is called a stationary distribution of the continuous-time hmc, or of its transition semi-group, if π T P(t) = π T

(t ≥ 0) .

From (7.12), we see that if the initial distribution of the chain is a stationary distribution π, then the distribution at any time t ≥ 0 is π, and moreover, the chain is stationary, since for all k ≥ 1, all 0 ≤ t1 < . . . < tk and all states i1 , . . . , ik , the quantity P (X(t1 + t) = i1 , . . . , X(tk + t) = ik ) = π(i1 )pi1 ,i2 (t2 − t1 ) · · · pik−1 ,ik (tk − tk−1 ) does not depend on t ≥ 0. Therefore Theorem 7.2.9 A continuous-time hmc having for initial distribution a stationary distribution of the transition semi-group is stationary.

CHAPTER 7. MARKOV CHAINS, CONTINUOUS TIME

302

Example 7.2.10: Uniform hmc, take 2. In the case of the uniform hmc of Definition 7.2.6, if π is a stationary distribution of the subordinated chain, π T Kn = π T , and therefore in view of (7.15), π T P(t) = π T . Conversely, if π is a stationary distribution of the continuous-time hmc, then, by (7.15), πT =

∞ 

e−λt

n=0

(λt)n T n π K , n!

and letting t ↓ 0, we obtain π T = π T K.

7.2.2

The Local Characteristics

Let {P(t)}t≥0 be a transition semi-group on E, that is, for each t, s ≥ 0, (a) P(t) is a stochastic matrix, (b) P(0) = I, (c) P(t + s) = P(t)P(s). Suppose moreover that the semi-group is continuous at the origin, that is, (d) limh↓0 P(h) = P(0) = I, where the convergence therein is pointwise and for each entry. In Exercise 7.5.4, the reader is invited to prove that continuity at the origin implies continuity at any time, that is, limh→0 P(t + h) = P(t) for all t > 0. The result to follow is purely analytical: it does not require {P(t)}t≥0 to be the transition semigroup of some continuous-time hmc. Theorem 7.2.11 Let {P(t)}t≥0 be a continuous transition semi-group on the countable state space E. For any state i, there exists qi := lim h↓0

1 − pii (h) ∈ [0, ∞] , h

(7.16)

pij (h) ∈ [0, ∞) . h

(7.17)

and for any pair i, j of different states, there exists qij := lim h↓0

t n Proof.  t  nFor all t ≥ 0 and all n ≥ 1, we have P(t) = [P n ] and therefore pii (t) ≥ [pii n ] (i ∈ E). Since limh↓0 pii (h) = 1, there exists an  > 0 such that pii (h) > 0 for all h ∈ [0, ]. For n sufficiently large, nt ∈ [0, ]. Therefore, for all t ≥ 0, pii (t) > 0, and the non-negative quantity fi (t) := − log pii (t) is finite. Also, limh↓0 fi (h) = 0. Moreover, from P(t)P(s) = P(t + s), we have that pii (t + s) ≥ pii (t)pii (s), and therefore, the function fi is subadditive, that is, fi (t + s) ≤ fi (t) + fi (s)

(s, t ∈ R+ ) .

7.2. THE TRANSITION SEMIGROUP

303

Define the (possibly infinite) non-negative real number qi := sup t>0

fi (t) . t

Then (Theorem B.5.1) lim h↓0

Therefore, lim h↓0

fi (h) = qi . h

1 − pii (h) 1 − e−fi (h) fi (h) = lim = qi , h↓0 h fi (h) h

and this proves the first equality in (7.16). It now remains to prove (7.17). For this, take two different states i and j. Since pii (t) and pjj (t) tend to 1 as t > 0 tends to 0, there exists for any c ∈ ( 12 , 1) a number δ > 0 such that for t ∈ [0, δ], pii (t) > c and pjj (t) > c. Denote by {Xn }n≥0 the discrete-time hmc defined by Xn = X(nh), with transition matrix P(h). Let n > 0 be an integer and h > 0 be such that 0 ≤ nh ≤ δ. One way to pass from state i at time 0 to state j at time n is to pass through state i at time r for some r, 0 ≤ r ≤ n − 1, without visiting state j meanwhile, then to pass from i at time r to state j at time r + 1, and finally to pass from j at time r + 1 to state j at time n. The paths corresponding to different values of r are different, but they do not exhaust the possibilities of going from X0 = i to Xn = j. Therefore, pij (nh) ≥

n−1 

P (X1 = j, . . . , Xr−1 = j, Xr = i | X0 = i)pij (h)P (Xn = j | Xr+1 = j) .

r=0

The parameters δ, n and h are such that P (Xn = j | Xr+1 = j) ≥ c. Also P (X1 = j, . . . , Xr−1 = j, Xr = i | X0 = i) = P (Xr = i | X0 = i)  − P (X1 = j, . . . , Xk−1 = j, Xk = j | X0 = i)P (Xr = i | Xk = j) k a | Xn = i) = e−qi a pij , (7.19) where pij =

qij qi

if qi > 0, pij = 0 if qi = 0. (In particular, {Xn }n≥0 is an hmc.)

B. If P (τ∞ = ∞) = 1, the process {X(t)}t≥0 constructed as above is a regular jump hmc with infinitesimal generator A. Proof. A. The τn ’s form a sequence of Gt -stopping times where   N Gt := σ(X0 ) ∨ ∨i,j∈E Ft ij . The announced result then follows from the strong Markov property for hpps (Theorem 7.1.5) and the competition theorem (Theorem 7.1.4). B. By construction, for a given time t, the process after time t depends only upon X(t) and the hpps St Nij (i, j ∈ E , i = j). The homogeneous Markov property follows immediately from this observation. It remains to show that A is indeed the infinitesimal generator of the hmc. We first check that, for i = j, 1 lim Pi (X(t) = j) = qij . t↓0 t For this, observe that when X(t) = X(0), necessarily τ1 < t and write Pi (X(t) = j) = Pi (τ2 ≤ t, X(t) = j) + Pi (τ2 > t, X(t) = j) = Pi (τ2 ≤ t, X(t) = j) + Pi (τ2 > t, X1 = j, τ1 < t) = Pi (τ2 ≤ t, X(t) = j) + Pi (X1 = j, τ1 < t) − Pi (τ2 ≤ t, X1 = j, τ1 < t) . By Theorem 7.1.4, Pi (X1 = j, τ1 < t) = (1 − e−qi t ) qiji and then q

1 lim Pi (X1 = j, τ1 < t) = qij . t↓0 t It therefore remains to show that Pi (τ2 ≤ t, X(t) = j) and Pi (τ2 ≤ t, X1 = j, τ1 ≤ t) are o(t) (obvious if qi = 0, and therefore we suppose qi > 0). Both terms are bounded by Pi (τ2 ≤ t), and  Pi (τ2 ≤ t) ≤ Pi (τ1 ≤ t, τ2 − τ1 ≤ t) = Pi (τ1 ≤ t, X1 = k, τ2 − τ1 ≤ t) =

k∈E k=j

 qik qik (1 − e−qi t ) (1 − e−qk t ) = (1 − e−qi t ) (1 − e−qk t ). q qi i k∈E k∈E

 k=j

k=i

CHAPTER 7. MARKOV CHAINS, CONTINUOUS TIME

308

But (1 − e−qi t ) is O(t) (identically null if qi = 0) and limt↓0



dominated convergence. Therefore Pi (τ2 ≤ t) is o(t). We now check that limt↓0

1−pii (t) t

k∈E k=i

qik qi (1

− e−qk t ) = 0 by

= qi . From

1 − pii (t) = 1 − Pi (X(t) = i) = 1 − Pi (X(t) = i, τ1 > t) − Pi (X(t) = i, τ1 ≤ t) = 1 − Pi (τ1 > t) − Pi (X(t) = i, τ1 ≤ t, τ2 ≤ t) = 1 − e−qi t − Pi (X(t) = i, τ1 ≤ t, τ2 ≤ t), and the announced result follows from the fact (proved above) that Pi (τ2 ≤ t) is o(t).  Let Zi (t) := 1{X(t)=i} . The construction of the state process on [0, τ∞ ) is summarized by the following equations: for all i ∈ E, Zi (t) = Zi (0) +

 -

Zj (s−) Nji (ds) −

j∈E;j =i (0,t]

 -

Zi (s−) Nij (ds) .

(7.20)

j∈E;j =i (0,t]

Equations (7.20) constitute a system of stochastic differential equations driven by the Poisson processes Nij (i, j ∈ E , i = j) for the processes {Zi (t)}t≥0 (i ∈ E). It is sometimes convenient to state conclusion B of Theorem 7.2.18 as Theorem 7.2.19 Let {X(t)}t≥0 be a regular jump process with countable state space E, satisfying (7.20) where Zi (t) = 1{X(t)=i} and where Ni,j (i, j ∈ E , i = j) is a family of independent hpps with respective intensities qij (i, j ∈ E , i = j), and independent of the initial state X(0). Then {X(t)}t≥0 is a regular jump hmc with infinitesimal generator A.

Note that (7.20) is equivalent to the requirement that f (X(t)) − f (X(0)) =

 i,j∈E i=j

{f (j) − f (i)} (0,t]

1{X(s−)=i} dNij (s)

(7.21)

for all non-negative functions f : E → R. We shall now exploit this canonical representation.

Aggregation of States Consider a regular jump hmc {X(t)}t≥0 with state space E and infinitesimal generator ˜ A. Let E˜ = {α, β, . . .} be a partition of E, and define the process {X(t)} t≥0 taking its ˜ by values in E ˜ X(t) = α ⇐⇒ X(t) ∈ α .

(7.22)

˜ The hmc {X(t)} t≥0 is the aggregated chain of {X(t)}t≥0 (with respect to the partition ˜ E).

7.2. THE TRANSITION SEMIGROUP

309

˜ (α = β), Theorem 7.2.20 Suppose that for all α, β ∈ E 

(i ∈ α) .

qij = q˜αβ

(7.23)

j∈β

(This equality not only defines the quantity in the right-hand side but also states the ˜ hypothesis that the left-hand side is independent of i ∈ α.) Then {X(t)} t≥0 is a regular ˜ with off-diagonal terms ˜ and infinitesimal generator A, jump hmc with state space E given by (7.23). Proof. This statement concerns the distribution of {X(t)}t≥0 and therefore we may ˜ → R and s ≤ t, suppose that this process is of the form (7.21). Then, for f : E  ˜ ˜ ˜ ˜ f (X(t)) = f (X(0)) + {f (X(u)) − f (X(u−))}1 {X(u−)=i} dNij (u) (0,t]

i,j∈E i=j



˜ = f (X(0)) +

{f (β) − f (α)}

˜ α,β∈E α=β



⎛ ⎝ (0,t]

i∈α

⎛ ⎞⎞  dNij (u)⎠⎠ . 1{X(u−)=i} ⎝ j∈β

˜ αβ by ˜ α = β, the point process N Define for all α, β ∈ E, -



˜αβ (0, t] = N

(0,t] i∈α



⎛ ⎞⎞  ⎝1{X(s−)=i} ⎝ ⎠ ⎠ + dNij (s) j∈β

(0,t]

ˆ 1{X(s−) ˜ =α} dNα,β (s),

ˆαβ } α,β∈E˜ form an independent family of hpps where the “dummy” point processes {N α=β

with intensities {˜ qαβ } α,β∈E˜ , respectively, and are independent of X(0) and {Nij } i,j∈E . i=j

α=β

Then

˜ ˜ f (X(t)) = f (X(0)) +



(f (β) − f (α)) (0,t]

˜ α,β∈E α=β

˜αβ (u). 1{X(u−)=α} dN ˜

(7.24)

˜α,β (α, β ∈ E˜ , α = β) In view of the remark relative to (7.21), it suffices to prove that N ˜ , α = β). For is a family of independent hpps with respective intensities q˜αβ (α, β ∈ E this, we apply Watanabe’s theorem (Theorem 7.1.8). Let {Z(t)}t≥0 be a left-continuous Ft -adapted stochastic process, where ˜αβ N

Ft = σ(X(0)) ∨ Ft

  ˜uv N ∨ ∨u,v∈E;(u,v) . ˜ =(α,β) F∞

We obtain "-

# " ˜ Z(t)dNαβ (t) = E

E (0,T ]

i∈α j∈β

"-

# (0,T ]

+E (0,T ]

Z(t)1{X(t−)=i} dNij (t)

# ˆ Z(t)1{X(t−) ˜ =α} dNαβ ,

and this quantity is equal, by the smoothing formula (Theorem 7.1.7), to

CHAPTER 7. MARKOV CHAINS, CONTINUOUS TIME

310  i∈α j∈β

2-

. 2Z(t)1{X(t−)=i} qij dt + E

T

E 0

⎡ = E⎣

T

⎡⎛ ⎞  Z(t) ⎣⎝ qij ⎠ 1

0

2-

T

=E

T 0

˜ {X(t−)=α}

. q ˜ dt Z(t)1{X(t) ˜ =α} αβ ⎤



⎦ ⎦ + q˜αβ 1{X(t−) ˜ =α} dt

j∈β

. Z(t)˜ qαβ dt .

0

˜αβ is an hpp with intensity q˜αβ independent of σ(X(0))∨ by Theorem 7.1.8, N Therefore,  ˜uv N .  ∨u,v∈E;(u,v) ˜ =(α,β) F∞

7.3 7.3.1

Regenerative Structure The Strong Markov Property

A regular jump hmc has the strong Markov property. More precisely: Define, similarly to the discrete-time case, the process after the random time τ : {Sτ X(t)}t≥0 = {X(t + τ )}t≥0 (with the convention X(∞) = Δ, where Δ is an element not in E) and the process before τ: {X τ (t)}t≥0 = {X(t ∧ τ )}t≥0 . Theorem 7.3.1 Let {X(t)}t≥0 be a right-continuous continuous-time hmc with countable state space E and transition semigroup {P(t)}t≥0 , and let τ be a stopping time with respect to {X(t)}t≥0 . Let k ∈ E be a state. Then, (α) given X(τ ) = k, the chain after τ and the chain before τ are independent, and (β) given X(τ ) = k, the chain after τ is a regular jump hmc with transition semigroup {P(t)}t≥0 . Proof. Suppose we have proved that for all states k, all positive times t1 , . . . , tn , s1 , . . . , sp and all real numbers u1 , . . . , un , v1 , . . . , vp ,  n  p E ei =1 u X(τ +t )+i m=1 vm X(τ∧ sm ) 1{X(τ )=k}  n   p  = E ei =1 u X(t ) |X(0) = k E ei m=1 vm X(τ∧ sm ) 1{X(τ )=k} . (7.25) Then, fixing v1 = · · · = vp = 0, we obtain  n  E ei =1 u X(τ +t ) 1{X(τ )=k} P (X(τ ) = k)

 n  = E ei =1 u X(t ) |X(0) = k ,

and this shows that given X(τ ) = k, {X(τ +t)}t≥0 had the same distribution as {X(t)}t≥0 given X(0) = k. We therefore will have proved (β). For (α), it suffices to rewrite (7.25) as follows, using the previous equality:

7.3. REGENERATIVE STRUCTURE  n  n E ei =i u X(τ +t )+i m=1 vm X(τ∧ sm ) | X(τ ) = k  n   n  = E ei =1 u X(τ +t ) | X(τ ) = k E ei n=1 vm X(τ∧ sm ) | X(τ ) = k .

311

(7.26)

It remains to prove (7.25). For the sake of simplicity, we consider the case where n = m = 1, and let u1 = u, t1 = t, v1 = v, s1 = s. Suppose first that τ takes a countable number of finite values, denoted by aj , and also, maybe, the value +∞. Note that X(τ ) = k ∈ E implies τ < ∞. Then E[eiuX(τ +t)+ivX(τ ∧s) 1{X(τ )=k} ]  = E[eiuX(aj +t)+ivX(aj ∧s) 1{X(aj )=k} 1{τ =aj } ]. j≥1

For all j ≥ 1, E[eiuX(aj +t)+ivX(aj ∧s) 1{X(aj )=k} 1{τ =aj } ] = E[eiuX(aj +t) 1{X(aj )=k} eivX(aj ∧s) 1{τ =aj } ] = E[eiuX(aj +t) | X(aj ) = k]E[eivX(aj ∧s) 1{τ =aj } 1{X(aj )=k} ], where for the last equality, we have used the fact that 1{τ =aj } is FaXj -measurable and the Markov property at time aj . Therefore, for all j ≥ 1, E[eiuX(aj +t)+ivX(aj ∧s) 1{X(aj )=k} 1{τ =aj } ] = E[eiuX(t) |X(0) = k]E[eivX(aj ∧s) 1{X(aj )=k} 1{τ =aj } ]. Summing with respect to j, we obtain the equality corresponding to (7.25). To pass from the case where the stopping time τ takes a countable number of values to the general case, letting τ (n) be the approximation of τ of Theorem 5.3.13. The random time τ (n) is an FtX -stopping-time with a countable number of values such that limn↑∞ ↓ τ (n, ω) = τ (ω). In particular, limn↑∞ X(τ (n) ∧ a) = X(τ ∧ a), limn↑∞ X(τ (n) + b) = X(τ + b) and limn↑∞ 1{X(τ (n))=k} = 1{X(τ )=k} (use the fact that a regular jump process is right-continuous). Therefore, letting n go to ∞ in (7.25) with τ replaced by τ (n), we obtain the result for τ itself, by dominated convergence.  Equality (7.26) says that, given X(τ ) = i, the random vectors (X(τ ∧ s1 ), . . . , X(τ ∧ sm )) and (X(τ + t1 ), . . . , X(τ + tn )) and (X(τ + t1 ), . . . , X(τ + tn )) are independent. In particular, the events {X(τ ∧ s1 ) = i1 , . . . , X(τ ∧ sm ) = im } and {X(τ + t1 ) = j1 , . . . , X(τ + tn ) = jn } are conditionally independent given X(τ ) = i. Since this is true for all i1 , . . . , im , j1 , . . . , jn , and all s1 , . . . , sm , t1 , . . . , tn , this property extends (see τ Theorem 5.1.10) to events in A and B respectively in F Sτ X and F X . And then, finally (see the discussion leading to (7.14), EX(τ ) [Y × Z] = EX(τ ) [Y ]EX(τ ) [Z], for all non-negative Y and Z that

7.3.2

(7.27)

τ are respectively F Sτ X -measurable and F X -measurable.

Imbedded Chain

Let {τn }n≥0 be the non-decreasing sequence of transition times of a regular jump process {X(t)}t≥0 , where τ0 = 0, and τn = ∞ if there are strictly fewer than n transitions in (0, ∞). Note that, for each n ≥ 0, τn is an FtX -stopping time.

312

CHAPTER 7. MARKOV CHAINS, CONTINUOUS TIME

The discrete-time stochastic process {Xn }n≥0 with values in E defined by Xn := X(τn ) if τn < ∞ and Xn = Xn−1 if τn = ∞ is called the imbedded process of the jump process. Theorem 7.3.2 Let {X(t)}t≥0 be a continuous-time regular jump (therefore rightcontinuous) hmc, with infinitesimal generator A, transition times sequence {τn }n≥0, and imbedded process {Xn }n≥0 . Then (α) {Xn }n≥0 is a discrete-time hmc with transition matrix given by, for j = i, pij =

qij qi

if qi > 0, and pij = 0 if qi = 0. (β) For all n ≥ 0 and all a ∈ R+ , P (τn+1 − τn ≤ a | X0 , . . . , Xn , τ1 , . . . , τn ) = 1 − e−qXn a .

(7.28)

Proof. We begin with the following partial result. (α ) {Xn }n≥0 is a discrete-time hmc on E. (β  ) There exists for each i ∈ E a finite real number λ(i) ≥ 0 such that for all n ≥ 0 and all a ∈ R+ , P (τn+1 − τn ≤ a | X0 , . . . , Xn , τ1 , . . . , τn ) = 1 − e−λ(Xn )a . It follows from the strong Markov property that given X(τn ) = i ∈ E, {X(τn +t)}t≥0 is independent of {X(τn ∧t)}t≥0 and therefore, given Xn = i, the variables (Xn+1 , Xn+2, . . .) are independent of (X0 , . . . , Xn ), that is, {Xn }n≥0 is a Markov chain. It is clearly homogeneous because the distribution of {X(τn + t)}t≥0 given X(τn ) = i is independent of n, being identical with the distribution of {X(t)}t≥0 given X(0) = i, again by the strong Markov property. We have therefore proved (α ). Call pij = Pi (X(τ1 ) = j) the transition probability of {Xn }n≥1 from i to j. To prove (β  ), it suffices to show that Pi (X1 = i, . . . , Xn = in , τ1 − τ0 > a1 , . . . , τn− τn−1 > an ) = e−λ(i)a1 pii1 e−λ(i1 )a2 pi1 i2 · · · e−λ(in−1)an pin−1 in for all i, i1 , . . . , in ∈ E, all a1 , . . . , an ∈ R+ and some function λ : E → R+ . In view of the strong Markov property, it suffices to show that for all i, j ∈ E, a ∈ R+ , there exists a λ(i) ∈ [0, ∞) such that Pi (X1 = j, τ1 − τ0 > a) = Pi (X1 = j)e−λ(i)a . Define g(t) = Pi (τ1 > t). For t, s ≥ 0, using the obvious set identities, g(t + s) = Pi (τ1 > t + s) = Pi (τ1 > t + s, τ1 > t, X(t) = i) = Pi (X(t + u) = i for all u ∈ [0, s], τ1 > t, X(t) = i).

(7.29)

7.3. REGENERATIVE STRUCTURE

313

The last expression is, in view of the Markov property at time t and using the fact that {τ1 > t} ∈ FtX , Pi (X(t + u) = i for all u ∈ [0, s] | X(t) = i)Pi (τ1 > t, X(t) = i) = Pi (X(u) = i for all u ∈ [0, s] | X(0) = i)Pi (τ1 > t) = Pi (τ1 > s)Pi (τ1 > t), where the last two equalities again follow from the obvious set identities. Therefore, for all s, t ≥ 0, g(t + s) = g(t)g(s). Also, g(t) is non-increasing, and limt↓0 g(t) = 1 (use the fact that the chain is assumed to be a jump process). It follows that there exists a λ(i) ∈ [0, ∞) such that g(t) = e−λ(i)t, that is, Pi (τ1 > t) = e−λ(i)t, for all t ≥ 0. Now, using the Markov property and appropriate set identities, Pi (X1 = j, τ1 > t) = Pi (X(τ1 ) = j, τ1 > t, X(t) = i) = Pi (first jump of {X(t + s)}s≥0 is j, τ1 > t, X(t) = i) = Pi (first jump of {X(t + s)}s≥0 is j | X(t) = i)Pi (τ1 > t, X(t) = i) = Pi (first jump of {X(s)}s≥0 is j | X(0) = i)Pi (τ1 > t) = Pi (X(τ1 ) = j)Pi (τ1 > t), and this is (7.29). We have now proved (α) and (β), where qi is replaced by λ(i) ∈ R+ (only known to q exist but not yet identified with qi ) and where qiji is replaced by Pi (X(τ1 ) = j) (not yet qij identified with qi ). We shall now proceed to the required identifications. For this, define the generator A on E by  qi = λ(i), qij = λ(i)Pi (X(τ1 ) = j).

This generator is stable and conservative, and we can therefore construct {X  (t)}t≥0 , a  . regular jump hmc associated with A , via the construction of Section 7.2.3, up to τ∞   Then {X (t)}t≥0 and {X(t)}t≥0 have the same regenerative structure, given by (α ) and (β  ), and therefore, they have the same distribution (in particular, {X (t)}t≥0 is regular, since {X(t)}t≥0 is regular, by assumption). Their respective infinitesimal generators A  and A are therefore identical. Theorem 7.3.3 A regular jump hmc is stable and conservative. Proof. Indeed, a regular jump hmc is strongly Markovian, and therefore has the regenerative structure of Theorem 7.3.2. In the course of the proof of this theorem, we have identified qi with a certain finite quantity λ(i), and therefore qi < ∞. Therefore a regular jump hmc is stable. Also qij was identified with λ(i)Pi (X(τ1 ) = j), that is,  qi Pi (X(τ1 ) = j) and therefore the conservation property is clear. Definition 7.3.4 A state i ∈ E such that qi = 0 is called permanent; otherwise, it is called essential.

CHAPTER 7. MARKOV CHAINS, CONTINUOUS TIME

314

In view of (7.28), if X(τn ) = i, a permanent state, then τn+1 − τn = ∞; that is, there is no more transition at finite distance, hence the terminology. Example 7.3.5: Uniform hmc, take 3. For the uniform hmc (see Definition 7.2.6), the imbedded process {Xn }n≥0 is an hmc with state space E, and if i ∈ E is not permanent (that is, in this case, if kii < 1), then for j = i, pij =

kij . 1 − kii

ˆ n }n≥0 by considering only the “real” transitions Indeed, {Xn }n≥0 is obtained from {X (exercise). An immediate consequence of Theorem 7.3.2 is: Corollary 7.3.6 Two regular jump hmcs with the same infinitesimal generator and the same initial distribution have the same distribution. Another way to state this is as follows: Two regular jump hmcs with the same infinitesimal generator have the same transition semi-group. Example 7.3.7: Uniformization. A regular jump hmc with infinitesimal generator A such that supi∈E qi < ∞ has the same transition semigroup as a uniform chain. Indeed: select any real number λ > supi∈E qi , and define the transition matrix K by (7.18). One checks that it is indeed a stochastic matrix. The uniform chain corresponding to (λ, K) has the infinitesimal generator A. Any pair (λ, K) as above gives rise to a uniform version of the chain. The minimal uniform version is, by definition, that with λ = supi∈E qi .

Definition 7.3.8 A continuous time hmc with an infinitesimal generator such that sup qi < ∞,

(7.30)

i∈E

is called uniformizable.

7.3.3

Conditions for Regularity

Theorem 7.3.2 gives a way of constructing a regular jump hmc with values in a countable state space and admitting a given generator that is stable and conservative. (We shall also suppose for simplicity that this generator is essential (qi > 0 for all i ∈ E).) It suffices to construct a sequence τ0 = 0, X0, τ1 − τ0 , X1 , τ2 − τ1 , X2 , . . . according to P (Xn+1 = j, τn+1 − τn ≤ x | X0 , . . . , Xn , τ0 , . . . , τn ) = qXn j /qXn (1 − e−qXn x ) ,

(7.31)

the initial state X0 being chosen at random, with arbitrary distribution. The value of X(t) for τn ≤ t < τn+1 is then Xn . If τ∞ := limn↑∞ ↑ τn = ∞, we have obtained a regular jump hmc with A as infinitesimal generator. Definition 7.3.9 The generator A is called non-explosive, or regular, if Pi (τ∞ = ∞) = 1

(i ∈ E) .

(7.32)

7.3. REGENERATIVE STRUCTURE

315

Theorem 7.3.10 Let A be a stable and conservative generator on E. It is regular if and only if for any real λ > 0, the system of equations  (λ + qi )xi = qij xj (i ∈ E) (7.33) j∈E j=i

admits no non-negative bounded solution other than the trivial one. Proof. Let Sk := τk − τk−1 . In particular, τ∞ := " gi (λ) = Ei exp{−λ

∞

k=1

∞ 

Sk . The number #

Sk }

k=1

is uniformly bounded in λ > 0 and i ∈ E, and if Pi (τ∞ = ∞) < 1, it is strictly positive. Also, xi := gi (λ) (i ∈ E) is a solution of (7.33), as follows from the calculations below: " # ∞  Sk } gi (λ) = Ei exp{−λS1 } exp{−λ k=2

-



=

#  " ∞  −λt −qi t e qi e Sk } dt Ei exp{−λ

0

=

" # ∞  qi Ei exp{−λ Sk } λ + qi

k=2

k=2

and, by first-step analysis, # # " " ∞ ∞  qij  qij   Ei exp{−λ Sk } = Ej exp{−λ Sk } = gj (λ) . qi qi j∈E j∈E k=2

k=2

j=i

j=i

Therefore, if A is explosive there exists a non-trivial bounded solution of (7.33). We now prove the converse. Call {gi (λ)}i∈E a bounded solution of (7.33) for a fixed real λ > 0. We have (7.34) gi (λ) = E[exp{−λS1 }gX1 (λ) | X0 = i], since first-step analysis shows that the right-hand side is equal to that of (7.33). We prove by induction that " # n  gi (λ) = E exp{−λ Sk }gXn (λ) | X0 = i . (7.35) k=1

For this, rewrite (7.34) as gi (λ) = E[exp{−λSn+1}gXn+1 (λ) | Xn = i], that is, gXn (λ) = E[exp{−λSn+1}gXn+1 (λ) | Xn ] . Using this expression of gXn (λ) in (7.35), we obtain " # n+1  gi (λ) = E exp{−λ Sk }gXn+1 (λ) | X0 = i . k=1

(7.36)

CHAPTER 7. MARKOV CHAINS, CONTINUOUS TIME

316

Therefore, (7.35) implies (7.36) (the forward step in the induction argument). Since (7.35) is true for n = 1 (Eqn. (7.34)), it is true for all n ≥ 1 and therefore, since K := |gi (λ)| < ∞, " # ∞  |gi (λ)| ≤ KEi exp{−λ Sk } . k=1

if {gi (λ)}i∈E is not trivial, it must hold for some i Therefore,  Pi ( ∞ k=1 Sk < ∞) > 0 or, equivalently, Pi (τ∞ = ∞) < 1.



E that 

Applied to a birth-and-death process, Theorem 7.3.10 gives Reuter’s criterion. Theorem 7.3.11 Let A be generator on E = N defined by qn,n+1 = λn and qn,n−1 = μn 1n≥1, where the birth parameters λn are strictly positive. A necessary and sufficient condition of non-explosion of this generator is . ∞ 2  1 μn μn · · · μ1 = ∞. (7.37) + + ···+ λn λn λn−1 λn · · · λ1 λ0 n=1

Proof. The system of equations (7.33) reads in the particular case of a birth-and-death generator / λx0 = −λ0 x0 + λ0 x1 , (7.38) λxk = μk xk−1 − (λk + μk )xk + λk xk+1 (k ≥ 1). For any fixed x0 , this system admits a unique solution that is identically null if and only if x0 = 0. If x0 = 0, the solution is such that xk /x0 does not depend on x0 , and therefore, only the case where x0 = 1 needs to be treated. Writing yk = xk+1 − xk , we obtain from (7.38) yk =

λ μk λ μk · · · μ2 λ μk · · · μ1 xk + xk−1 + · · · + x1 + y0 λk λk λk−1 λk · · · λ2 λ1 λk · · · λ1

(7.39)

and y0 = λλ0 . From this we deduce that if λ > 0, then yk > 0 and therefore {xk }k≥0 is a strictly increasing sequence. in (7.39), we have (since xk ≥ x0 = 1) . 2 1 μk μk · · · μ1 . yk ≥ λ + +···+ λk λk λk−1 λk · · · λ1 λ0

Therefore, using y0 =

λ λ0

Therefore, a necessary condition for {xk }k≥0 to be bounded is that the left-hand side of (7.37) be finite. This proves the sufficiency of (7.37) for non-explosion. We now turn to the proof of necessity. For i ≤ k, bounding in (7.39) xi by xk yields the majoration . 2 λ μk · · · μ1 λ yk ≤ xk , +···+ λk λk · · · λ1 λ0 and therefore, since yk = xk+1 − xk , . 2 λ μk · · · μ1 λ xk +···+ xk+1 ≤ 1 + λk λk · · · λ1 λ0  2 .1 1 μk · · · μ1 ≤ xk exp λ +···+ . λk λk · · · λ0

7.4. LONG-RUN BEHAVIOR Since x0 = 1, this leads to

317

/

. n 2  1 μk · · · μ1 xn ≤ exp λ +···+ λk λk · · · λ0

4 .

k=1

Therefore, a sufficient condition for the solution {xn }n≥0 to be bounded is that the left-hand side of (7.37) be finite.  Example 7.3.12: Pure birth. A pure birth generator A is a birth-and-death generator with all μn = 0. The necessary and sufficient condition of regularity (7.37) reads in this case ∞  1 = ∞. (7.40) λ n=0 n

Remark 7.3.13 There is a large class of hmcs for which the regularity is ensured without recourse to the regularity criterion above (Theorem 7.3.10): see Exercise 7.5.8.

7.4 7.4.1

Long-run Behavior Recurrence

We shall define irreducibility, recurrence, transience, and positive recurrence for a regular jump hmc. Definition 7.4.1 A regular jump hmc is called irreducible if and only if the imbedded discrete-time hmc is irreducible. Definition 7.4.2 A state i is called recurrent if and only if it is recurrent for the imbedded chain. Otherwise, it is called transient. In order to define positive recurrence, we need the following definitions. The escape time from state i is defined by Li := inf{t ≥ 0; X(t) = i} (=∞ if X(t) = i for all t ≥ 0). The return time to i is Ri := inf{t > 0; t > Li and X(t) = i} (= ∞ if Ei = ∞ or X(t) = i if for all t ≥ Ei ). Clearly, Ei and Ri are FtX -stopping times (exercise). Definition 7.4.3 A recurrent state i ∈ E is called t-positive recurrent if and only if Ei [Ri ] < ∞, where Ri is the return time to state i. Otherwise, it is called t-null recurrent. Remark 7.4.4 We shall soon see that t-positive recurrence and n-positive recurrence (positive recurrence of the imbedded chain) are not equivalent concepts. Also, observe that recurrence of a given state implies that this state is essential. Finally, in the same vein, note that irreducibility implies that all states are essential. Remark 7.4.5 Note that there is no notion of periodicity for a continuous-time hmc, for obvious reasons.

CHAPTER 7. MARKOV CHAINS, CONTINUOUS TIME

318

Invariant Measures of Recurrent Chains Definition 7.4.6 A t-invariant measure is a finite non-trivial vector ν = {ν(i)}i∈E such that for all t ≥ 0, ν T P(t) = ν T .

(7.41)

Of course, an n-invariant measure is, by definition, an invariant measure for the imbedded chain.

Theorem 7.4.7 Let the regular jump hmc {X(t)}t≥0 with infinitesimal generator A be irreducible and recurrent. Then there exists a unique (up to a multiplicative factor) t-invariant measure such that ν(i) > 0 for all i ∈ E. Moreover, ν is obtained in one of the following ways: (1): 2-

R0

ν(i) = E0 0

. 1{X(s)=i} ds ,

(7.42)

where 0 is a state and R0 is the return time to state 0, or (2): E0 μ(i) ν(i) = = qi

 T0



n=1 1{Xn =i}

qi

,

(7.43)

where μ is the canonical invariant measure relative to state 0 of the imbedded chain and T0 is the return time to 0 of the imbedded chain, or (3): as a solution of: ν T A = 0.

(7.44)

Proof. (α) We first show that (7.42) defines an invariant measure, that is, for all j ∈ E and all t ≥ 0,  ν(k)pkj (t). ν(j) = k∈E

The right-hand side of the above equality is equal to

7.4. LONG-RUN BEHAVIOR A=



20

k∈E ∞



0

k∈E

=

-



= -

0 ∞ 0 ∞

= -



. 1{X(s)=k} 1{s≤R0 } ds pkj (t)

P0 (X(t + s) = j | X(s) = k)P0 (X(s) = k, s ≤ R0 ) ds P0 (X(t + s) = j | X(s) = k)P0 (s ≤ R0 | X(s) = k)P0 (X(s) = k) ds

k∈E

= -



E0

319

0



P0 (X(t + s) = j, s ≤ R0 | X(s) = k)P0 (X(s) = k) ds

k∈E



P0 (X(t + s) = j, s ≤ R0 , X(s) = k) ds

k∈E ∞

= -0 ∞

P0 (X(t + s) = j, s ≤ R0 ) ds

  E0 1{X(t+s)=j} 1{s≤R0 } ds 0 . 2- ∞ = E0 1{X(t+s)=j} 1{s≤R0 } ds , =

0

where we have used the Markov property for the fourth equality ({s ≤ R0 } ∈ FsX ). Therefore, 2- t+R0 2- R0 . . 1{X(t+s)=j} ds = E0 1{X(u)=j} du A = E0 t 0 2 2 . . - t - R0 1{X(u)=j} du 1{X(u)=j} du − E0 1{t>R0 } = E0 1{t≤R0 } t

2-

R0 R0 +t

+ E0 R0

. 1{X(u)=j} du .

From the strong Markov property applied at R0 , . . 2- t 2- R0 +t E0 1{X(u)=j} du = E0 1{X(u)=j} du . R0

0

Therefore, 2 A = E0 1{t≤R0 } 2-

R0

= E0 0

R0 t

2 . · · · − E0 1{t>R0 }

. 1{X(u)=j} du = ν(j).

t R0

2- t . . ··· · · · + E0 0

(β) We now show uniqueness. For this consider the skeleton chain {X(n)}n≥0. For any state i, consider the sequence Z1 , Z2 , . . . of successive sojourn times in state i of the state process. This sequence is infinite because the imbedded chain is recurrent, and it is iid with exponential distribution of mean q1i . In particular, the event {Zn > 1} occurs infinitely often, and this implies that {X(n) = i} also occurs infinitely often. This is true for all states. Therefore, the skeleton is irreducible and recurrent. Consequently it has one and only one (up to a multiplicative factor) invariant measure. Since an invariant measure of the continuous-time chain is an invariant measure of the skeleton, the announced uniqueness of the invariant measure of the continuous-time hmc follows.

CHAPTER 7. MARKOV CHAINS, CONTINUOUS TIME

320

(γ) Call T0 the return time to 0 of the imbedded chain. Then # "T −1 2- R0 . 0  ν(i) = E0 Sn+1 1{Xn =i} 1{X(s)=i} ds = E0 " = E0

0

∞ 

#

n=0

Sn+1 1{Xn =i} 1{n 0, . . . , N (Ak ) > 0, N (B) = 0) = P (N (A1 ) > 0, . . . , N (Ak−1 ) > 0, N (B) = 0) − P (N (A1 ) > 0, . . . , N (Ak−1 ) > 0, N (Ak ∪ B) = 0) , and for k = 1, P (N (A1 ) > 0, N (B) = 0) = P (N (B) = 0) − P (N (B ∪ A1 ) = 0) = v(B) − v(B ∪ A1 ) . 2 Named after R´enyi, who introduced the notion and applied it to Poisson processes [R´enyi, 1967].

8.1. GENERALITIES ON POINT PROCESSES

343

This shows that P (N (A1 ) > 0, . . . , N (Ak ) > 0, N (B) = 0) can be recursively computed from the void probability function v. n Step 2. Suppose that we have a sequence Kn = {Kn,i }ki=1 of nested partitions of E such that for any distinct x, y ∈ E, there exists an n such that x and y belong to two distinct sets of the partition Kn (in other words, the sequence of partitions {Kn }n≥0 eventually separates the points of E). Define for n ≥ 1 and A ∈ B(E),

Hn (A) =

kn 

H(A ∩ Kn,i ),

i=0 n . Since the sequence where H(C) := 1{N (C)>0} (C ⊆ E). Let Kn ∩ A := {A ∩ Kn,i }ki=0 of partitions {Kn ∩ A}n≥1 of A eventually separates the points of A, and since Hn (A) counts the number of sets of Kn ∩ A that contain at least one point of the point process, we have in view of the assumed simplicity of the point process

lim Hn (A) = N (A) ,

n↑∞

a.s.

Step 3. The probability P (Hn (A) = l) can be expressed in terms of the void probability function v alone since (with An,i = A ∩ Kn,i )  P (H(An,0 ) = i0 , . . . , H(An,kn ) = ikn ) P (Hn (A) = l) = i0 ,...,ik ∈{0,1} n k Σ n ij =l j=1

and for i0 , . . . , ikn ∈ {0, 1} P (H(An,0 ) = i0 , . . . , H(An,kn ) = ikn ) = P (∩l;il =1 {N (An,l ) > 0} ∩ {N (∪m;im =0 An,m ) = 0}) , a quantity which can be expressed in terms of the void function v alone, as we saw in Step 1. More generally, for all l1 , . . . , lk ∈ N, P (Hn (A1 ) = l1 , . . . , Hn (Ak ) = lk ) is expressible in terms of the void probability function, and the same is true of P (Hn (A1 ) ≤ n1 , . . . , Hn (Ak ) ≤ nk )

(n1 , . . . , nk ∈ N) .

Step 4. Finally, observe that {Hn (A1 ) ≤ n1 , . . . , Hn (Ak ) ≤ nk } ↓ {N (A1 ) ≤ n1 , . . . , N (Ak ) ≤ nk } and therefore lim P (Hn (A1 ) ≤ n1 , . . . , Hn (Ak ) ≤ nk ) = P (N (A1 ) ≤ n1 , . . . , N (Ak ) ≤ nk ) .

n↑∞

The proof is now almost done. It just remains to construct the sequence of partitions {Kn }n≥1 . Denote by B(a, r) the closed ball of center a and radius r. Since (E, d) is separable, there exists a countable set {a1 , a2 , . . . } that is dense in E. The first partition K1 consists of two sets K11 := B(a1 , 1) ,

K10 := E\K11 .

Suppose that we have constructed Kn−1 . The next partition Kn is constructed as follows.

CHAPTER 8. SPATIAL POISSON PROCESSES

344 Letting

  Bn,i := B ai , 2−(n−i)

(i = 1, . . . , n) and Bn,0 := E\

n &

Bn,i ,

i=1

define a partition Cn = {Cn,i }ni=0 by Cn,0 := Bn,0 ;

Cn,i := Bn,i \

Cn,1 := Bn,1 ;

i−1 &

Bn,j

(i = 2, . . . , n) .

j=1

In order to obtain the partition Kn nested in Kn−1 , we intersect Cn and Kn−1 , that is, Kn := {Cn,i ∩ Kn−1,j

(j = 0, . . . , kn−1, i = 0, . . . , n)} . 

Remark 8.1.42 In the case E = Rm , a simple sequence of nested partitions could be the following, say for m = 1 for notational ease: * + Kn = (i2−n , (i + 1)2−n ] ; i ∈ Z . Remark 8.1.43 Note that the assumption of simplicity is necessary in the above result. For instance, doubling the multiplicity of the points of a given point process leaves the avoidance function unchanged.

Example 8.1.44: The Avoidance Function of a Poisson Cluster Process. Consider the space-homogeneous cluster point process of Example 8.1.25, with the additional specification that the germ N0 is a Poisson process. In order to compute its void probability function vN (C) = P (N (C) = 0), we observe that   P (N (C) = 0) = lim E e−tN (C) . t↑∞

We have # "      $ −tN (C) −t n Zn (C−Xn ) −tZn (C−Xn ) e E e =E e =E n

##

" " =E E " =E

$

e−tZn(C−Xn) | F N0

n

$

" =E

$

#   E e−tZn(C−Xn ) | F N0

n

# "  #      −tZ1 (C−Xn ) −tZ1 (C−Xn ) = E exp . E e log E e

n

n

Since N0 is a Poisson process with intensity measure ν0 , the last term of the above sequence of equalities is  -     () exp E e−tZ1 (C−x) − 1 ν0 (dx) . Rm

But

  lim E e−tZ1(C−x) − 1 = vZ (C − x) − 1 .

t↑∞

Therefore taking the limit as t ↑ ∞ in () yields by dominated convergence:  vN (C) = exp (vZ (C − x) − 1) ν0 (dx) . Rm

8.2. UNMARKED SPATIAL POISSON PROCESSES

8.2 8.2.1

345

Unmarked Spatial Poisson Processes Construction

Recall the definition given in Example 8.1.12. Definition 8.2.1 Let ν be a σ-finite measure on E. The point process N on E is called a Poisson process on E with intensity measure ν if (i) for all finite families of mutually disjoint sets C1 , . . . , CK ∈ B(E), the random variables N (C1 ), . . . , N (CK ) are independent, and (ii) for any set C ∈ B(E) such that ν(C) < ∞, P (N (C) = k) = e−ν(C)

ν(C)k k!

(k ≥ 0) .

0 In the case E = Rm , if ν is of the form ν(C) = C λ(x)dx for some non-negative measurable function λ : Rm → R, the Poisson process N is said to admit the intensity function λ(x). If in addition λ(x) ≡ λ, N is called a homogeneous Poisson process (hpp) on Rm with intensity or rate λ. We now construct the Poisson process. In other terms, we simulate the distribution of a Poisson process on Rm of given intensity measure ν. The basic result is the following: Theorem 8.2.2 Let T be a Poisson random variable of mean θ. Let {Zn }n≥1 be an iid sequence of random elements with values in E and common distribution Q. Assume that T is independent of {Zn }n≥1 . The point process N on E defined by N (C) =

T 

1C (Zn )

(C ∈ B(E))

n=1

is a Poisson process with intensity measure ν(·) = θ × Q(·). Proof. It suffices to show that for any finite family C1 , . . . , CK of pairwise disjoint measurable sets of E with finite ν-measure and all non-negative reals t1 , . . . , tK , E[e−

K

j=1 tj N (Cj )

* + −tj ] = ΠK − 1) . j=1 exp ν(Cj )(e

We have K 

tj N (Cj ) =

j=1

where Yn =

K 

 tj

j=1

K

j=1 tj 1Cj (Zn ).

T 

1Cj (Zn )

n=1

=

T  n=1

⎛ ⎞ K T   ⎝ tj 1Cj (Zn )⎠ = Yn , j=1

n=1

By Theorem 3.1.55, E[e−

T

n=1

Yn

] = gT (E[e−Y1 ]) ,

where gT is the generating function of T . Here, since T is Poisson mean θ, gT (z) = exp {θ(z − 1)} .

CHAPTER 8. SPATIAL POISSON PROCESSES

346

The random variable Y1 takesthe values t1 , . . . , tK and 0 with the respective probabilities Q(C1 ), . . . , Q(CK ) and 1 − K j=1 Q(Cj ). Therefore E[e−Y1 ] =

K 

e−tj Q(Cj ) + 1 −

K 

Q(Cj ) = 1 +

j=1

j=1

K  

 e−tj − 1 Q(Cj ) ,

j=1



from which we get the announced result.

The above is a special case of what is to be done, that is, to construct a Poisson process on E with an intensity measure ν that is σ-finite (not just finite). Such a measure can be decomposed as ∞  ν(·) = θj × Qj (·) , j=1

where the θj ’s are positive real numbers and the Qj ’s are probability distributions on E. One can construct independent Poisson processes Nj on E with respective intensity measures θj Qj (·). The announced result then follows from the following theorem:  Theorem 8.2.3 Let ν be a σ-finite measure on E of the form ν = ∞ i=1 νi , where the νi ’s (i ≥ 1) are σ-finite measures on E. Let Ni (i ≥ 1) be a family of independent Poisson processes on E with respective intensity measures νi (i ≥ 1). Then the point process ∞  Nj N= j=1

is a Poisson process with intensity measure ν. Proof. For mutually disjoint measurable sets C1 , . . . , CK of finite ν-measures, and non-negative reals t1 , . . . , tK ,  K   K  ∞ E e− =1 t N (C ) = E e− =1 t ( j=1 Nj (C ))   K n = E e− limn↑∞ =1 t ( j=1 Nj (C ))  K  n = lim E e− =1 t ( j=1 Nj (C )) , n↑∞

by dominated convergence. But  K    n K n E e− =1 t ( j=1 Nj (C )) = E e− j=1 ( =1 t Nj (C )) =

n $

n $ K  K  $ E e− =1 t Nj (C ) = e−t Nj (C )

j=1

=

=

j=1 =1

n $ K $

* + exp (e−t − 1)νj (C )

j=1 =1

/

n $

exp

j=1

= exp

K  

e

=1

⎧ K ⎨  ⎩

=1

e

−t

−t



4

− 1 νj (C )

⎞⎫ ⎛ n ⎬  −1 ⎝ νj (C ))⎠ . ⎭ 

j=1

8.2. UNMARKED SPATIAL POISSON PROCESSES

347

Letting n ↑ ∞ we obtain, by dominated convergence, /K 4  K    − =1 t N (C ) −t E e e − 1 ν(C ) . = exp =1

Therefore N (C1 ), . . . , N (CK ) are independent Poisson random variables with respective means ν(C1 ), . . . , ν(CK ).  Theorem 8.2.4 Let N be a Poisson process on Rm with intensity measure ν. (a) If ν is locally finite, then N is locally finite. (b) If ν is locally finite and non-atomic, then N is simple.

Proof. (a) If C is a bounded measurable set, it is of finite ν-measure, and therefore E[N (C)] = ν(C) < ∞, which implies that N (C) < ∞, P -almost surely. (b) It suffices to show this for a finite intensity measure ν(·) = θ(·) Q, where θ is a positive real number and Q is a non-atomic probability measure on Rm , and then use the construction of Theorem 8.2.2. In turn, it suffices to show that for each n ≥ 1, P (Zi = Zj for some pair (i, j) (1 ≤ i < j ≤ n) | N (Rm ) = n) = 0. This is the case because for iid vectors Z1 , . . . , Zn with a non-atomic probability distribution, P (Zi = Zj for some pair (i, j) (1 ≤ i < j ≤ n)) = 0. 

Doubly Stochastic Poisson Processes Doubly stochastic Poisson processes are also called Cox processes.3 Definition 8.2.5 Let G be a σ-field containing F ν , where ν is a locally finite random measure on Rm . A point process N on Rm such that given G, N is a Poisson process on Rm with the intensity measure ν, is called a doubly stochastic Poisson process with respect to G with the (conditional) intensity measure ν. If the random measure ν is of the form ν(dx) = Λm (dx) , where Λ is a non-negative random variable, the corresponding Cox process is also called a mixed Poisson process.

8.2.2

Poisson Process Integrals

The Covariance Formula Let N be a Poisson process on E, with intensity measure ν. Recall Campbell’s theorem ¯ be a ν-integrable measurable function. Then N (ϕ) is (Theorem 8.1.20). Let ϕ : E → R a well-defined integrable random variable, and 2. E ϕ(x) N (dx) = ϕ(x) ν(dx) . (8.11) E 3

[Cox, 1955].

E

CHAPTER 8. SPATIAL POISSON PROCESSES

348

Theorem 8.2.6 Let N be as above. Let ϕ, ψ : E → C be two ν-integrable measurable functions such that moreover |ϕ|2 and |ψ|2 are ν-integrable. Then N (ϕ) and N (ψ) are well–defined square-integrable random variables and  ψ(x) N (dx) = (8.12) ϕ(x)ψ(x)∗ ν(dx) . cov ϕ(x) N (dx), E

E

E

Proof. It is enough to consider the case of real functions. First suppose that ϕ and ψ are simple non-negative Borel functions. We can always assume that ϕ :=

K 

ah 1 C h ,

ψ :=

h=1

K 

bh 1Ch ,

h=1

where C1 , . . . , CK are disjoint measurable subsets of E. In particular, ϕ(x)ψ(x) =  K h=1 ah bh 1Ch (x). Using the facts that if i = j, N (Ci ) and N (Cj ) are independent, and that a Poisson random variable with mean θ has variance θ, E[N (ϕ)N (ψ)] =

K 

ah bl E[N (Ch )N (Cl )]

h,l=1

=

K 

ah bl E[N (Ch )N (Cl )] +

h,l=1 h=l

=

K 

K 

al bl E[N (Cl )2 ]

l=1

ah bl E[N (Ch )]E[N (Cl )] +

K 

al bl E[N (Cl )2 ] ,

l=1

h,l=1 h=l

and therefore E[N (ϕ)N (ψ)] =

K 

ah bl ν(Ch )ν(Cl ) +

k 

al bl [ν(Cl ) + ν(Cl )2 ]

l=1

h,l=1 h=l

=

k 

ah bl ν(Ch )ν(Cl ) +

h,l=1

k 

al bl ν(Cl )

l=1

= ν(ϕ)ν(ψ) + ν(ϕψ) . Let now ϕ, ψ be non-negative and let {ϕn }n≥1 , {ψn }n≥1 be non-decreasing sequences of simple non-negative functions, with respective limits ϕ and ψ. Letting n go to ∞ in the equality E[N (ϕn )N (ψn )] = ν(ϕn ψn ) + ν(ϕn )ν(ψn ) yields the announced results, by monotone convergence. We have that for any ν-integrable function ϕ : E → C     E [N (ϕ)] = E N (ϕ+ ) − E N (ϕ− ) = ν(ϕ+ ) − ν(ϕ− ) = ν(ϕ).   Also by the result in the non-negative case, E N (|ϕ|)2 = ν(|ϕ|2 ) + ν(|ϕ|)2 < ∞. Therefore, since |N (ϕ)| ≤ N (|ϕ|), N (ϕ) is a square-integrable variable, as well as N (ψ) for the same reasons. Therefore, by Schwarz’s inequality, N (ϕ)N (ψ) is integrable. We have

8.2. UNMARKED SPATIAL POISSON PROCESSES

349

   E [N (ϕ)N (ψ)] = E N (ϕ+ ) − N (ϕ− ) N (ψ + ) − N (ψ − )     = E N (ϕ+ )N (ψ + ) + E N (ϕ− )N (ψ − )     − E N (ϕ+ )N (ψ − ) − E N (ϕ− )N (ψ + )     = ν(ϕ+ ψ + ) + ν(ϕ+ )ν(ψ + ) + ν(ϕ− ψ − ) + ν(ϕ− )ν(ψ − )     − ν(ϕ+ ψ − ) + ν(ϕ+ )ν(ψ − ) − ν(ϕ− ψ + ) + ν(ϕ− )ν(ψ + ) = ν(ϕψ) + ν(ϕ)ν(ψ) , 

from which (8.12) follows.

The Exponential Formula We now turn to the exponential formula for Poisson processes. (It is sometimes called the second Campbell’s formula. However, in this book, the appellation “Campbell’s formula” will be reserved for the first one.) Theorem 8.2.7 Let N be a Poisson process on E with intensity measure ν. Let ϕ : E → R be a non-negative measurable function. Then, 1  (e−ϕ(x) − 1) ν(dx) E[e− E ϕ(x) N (dx)) ] = exp E

and



E[e

E

ϕ(x) N (dx))

1 (eϕ(x) − 1) ν(dx) .

] = exp E

Proof. We prove the first formula, the proof of the second being similar. Suppose that ϕ is simple and non-negative: ϕ = K h=1 ah 1Ch where C1 , . . . , CK are mutually disjoint measurable subsets of E. Then # "K  K  $ −N (ϕ) − h=1 ah N (Ch )) −ah N (Ch ) E[e ] = E e e =E h=1

=

K $ h=1

= exp





E e−ah N (Ch ) = /

K $ h=1

K 

(e

−ah

* + exp (e−ah − 1)ν(Ch ) 4

− 1)ν(Ch )

* + = exp ν(e−ϕ − 1) .

h=1

The formula is therefore true for non-negative simple functions. Take now a nondecreasing sequence {ϕn }n≥1 of such functions converging to ϕ. For all n ≥ 1, * + E[e−N (ϕn ) ] = exp ν(e−ϕn − 1) . By monotone convergence, the limit as n tends to ∞ of N (ϕn ) is N (ϕ). Consequently, by dominated convergence, the limit of the left-hand side is E[e−N (ϕ) ]. The function gn = −(e−ϕn − 1) is a non-negative function increasing to g = −(e−ϕ − 1), and therefore, by monotone convergence, ν(e−ϕn − 1) = −ν(gn ) converges to ν(e−ϕ − 1) = −ν(g), which in turn implies that the right-hand side of the last displayed equality tends to exp {ν(e−ϕ − 1)} as n tends to ∞. 

CHAPTER 8. SPATIAL POISSON PROCESSES

350

can of course be obtained from the exponential Remark 8.2.8 The covariance formula  formula by differentiation of t → E e−tN (ϕ) . Example 8.2.9: The Maximum Formula. Let N be a simple Poisson process on E with intensity measure ν and let ϕ : E → R. Then  1 P (sup ϕ(Xn ) ≤ a) = exp − 1{ϕ(x)>a} ν(dx) . n∈N

E

A direct proof based on the construction of Poisson processes in Subsection 8.2.1 is possible (Exercise 8.5.21). We take another path and first prove that    () lim E e−θ n∈N 1{ϕ(Xn )>a} = P (sup ϕ(Xn ) ≤ a) . θ↑∞

n∈N



Indeed, the sum n∈N 1{ϕ(Xn )>a} is strictly positive, except when supn∈N ϕ(Xn ) ≤ a, in which case it is null. Therefore lim e−θ

 n∈N

1{ϕ(Xn )>a}

θ↑∞

= 1{supn∈N ϕ(Xn )≤a} .

Taking expectations yields (), by dominated convergence. Now, by Theorem 8.2.7, 1 -      −θ n∈N 1{ϕ(Xn )>a} −θ1{ϕ(x)>a} E e − 1 ν(dx) = exp e -E  1  = exp e−θ − 1 1{ϕ(x)>a} ν(dx) E

+ * 0 and the limit of the latter quantity as θ ↑ ∞ is exp − E 1{ϕ(x)>a} ν(dx) . Example 8.2.10: The Laplace Functional of a Poisson Process. According to Theorem 8.2.7, the Laplace functional of a Poisson process N on E with intensity measure ν is *  + LN (ϕ) = exp ν e−ϕ − 1 .

Theorem 8.2.11 Let Ni (i ∈ J) be a finite collection of simple point processes on E. If for any collection ϕi : E → R+ (i ∈ J) of non-negative measurable functions, -  1     $ exp (8.13) E e− i∈J Ni (ϕi ) = e−ϕi (x) − 1 νi (dx) , E

i∈J

where νi , i ∈ J, is a collection of σ-finite measures on E, then Ni , i ∈ J, is a family of independent Poisson processes with respective intensity measures νi , i ∈ J. Proof. Taking all the ϕi ’s identically null except the first one, we have -  1    E e−N1 (ϕ1 ) = exp e−ϕ1 (x) − 1 ν1 (dx) , E

and therefore N1 is a Poisson process with intensity measure ν1 . Similarly, for any i ∈ J, Ni is a Poisson process with intensity measure νi . Independence follows from Theorem 8.1.37. 

8.3. MARKED SPATIAL POISSON PROCESSES

8.3

351

Marked Spatial Poisson Processes

8.3.1

As Unmarked Poisson Processes

Let (α) N be a simple and locally finite process on E, with point sequence {Xn }n∈N, and (β) {Zn }n∈N be a sequence of random elements taking their values in the measurable space (K, K). The sequence {Xn , Zn }n∈N is a marked point process, with the interpretation that Zn is the mark associated with the point Xn . N is the base point process of the marked point process, and {Zn }n∈N is the associated sequence of marks. One also calls N a simple and locally finite point process on E with marks {Zn }n∈N in K. If moreover (1) N is a Poisson process with intensity measure ν, (2) {Zn }n∈N is an iid sequence, and (3) {Zn }n∈N and N are independent, the corresponding marked point process is called a Poisson process on E with independent iid marks. This model can be slightly generalized by allowing the mark distribution to depend on the location of the marked point. More precisely, we replace (2) and (3) by (2’) {Zn }n∈N is, conditionally on N , an independent sequence, (3’) given Xn , the random vector Zn is independent of Xk (k ∈ N, k = n), and (4’) for all n ∈ N and all L ∈ K, P (Zn ∈ L | Xn = x) = Q(x, L) , where Q(·, ·) is a stochastic kernel from (E, B(E)) to (K, K), that is, Q is a function from E × K to [0, 1] such that for all L ∈ K the map x → Q(x, L) is measurable, and for all x ∈ E, Q(x, ·) is a probability measure on (K, K). Theorem 8.3.1 Let {Xn , Zn }n∈N be as in (α) and (β) above, and define the point , on E × K by process N , (A) = N



1A (Xn , Zn )

(A ∈ B(E) ⊗ K) .

(8.14)

n∈N

, is a simple Poisson If conditions (1), (2’), (3’), and (4’) above are satisfied, then N process with intensity measure ν, given by ν,(C × L) = Q(x, L) ν(dx) (C ∈ B(E) , L ∈ K) . C

, has Proof. In view of Theorem 8.1.34, it suffices to show that the Laplace transform of N the appropriate form, that is, for any non-negative measurable function ϕ , : E × K → R,    + *0 0     E e−N(ϕ) − 1 ν,(dt × dz) . = exp E K e−ϕ(t,z)

352

CHAPTER 8. SPATIAL POISSON PROCESSES

By dominated convergence,            n ,Zn )  n ,Zn ) E e−N(ϕ) = lim E e− n≤L ϕ(X = E e− n∈N ϕ(X . L↑∞

For the time being, fix a positive integer L. Then, taking into account assumptions (2’) and (3’), ⎤ ⎡    $  n ,Zn )  n ,Zn ) ⎦ E e− n≤L ϕ(X e−ϕ(X = E⎣ n≤L

⎡ ⎡ = E ⎣E ⎣

$

⎤⎤  n ,Zn ) | Xj , j ≤ L⎦⎦ e−ϕ(X

n≤L

   = E e− n≤L ψ(Xn ) , 0  where ψ(x) := − log K e−ϕ(x,z) Q(x, dz), a non-negative function. Letting L ↑ ∞, we have, by dominated convergence,          E e−N(ϕ) = E e− n∈N ψ(Xn ) = E e−N (ψ) -  1  = exp e−ψ(x) − 1 ν(dx) -E 2. 1  e−ϕ(x,z) = exp Q(x, dz) − 1 ν(dx) -E 2-K  . 1   = exp − 1 Q(x, dz) ν(dx) e−ϕ(x,z) 1 -E - K   − 1 ν,(dx × dz) . = exp e−ϕ(x,z) E

K

 Example 8.3.2: The M/GI/∞ Model, take 1. The model of this example is of interest in queueing theory and in the traffic analysis of communications networks. We adopt the queueing interpretation. Let N be an hpp on R with intensity λ, and {σn }n∈Z be a sequence of random vectors taking their values in R+ with probability distribution Q. Assume moreover that {σn }n∈Z and N are independent. The n-th event time of N , Tn , is the arrival time of the n-th customer, and σn is her service time request. Define , on R × R+ by the point process N  , (C) = N 1C (Tn , σn ) n∈Z

, is a simple Poisson process for all C ∈ B(R) ⊗ B(R+ ). According to Theorem 8.3.1, N with intensity measure ν,(dt × dz) = λdt × Q(dz) . In the M/GI/∞ model,4 a customer arriving at time Tn is immediately served, and therefore departs from the “system” at time Tn + σn . The number X(t) of customers present in the system at time t is therefore given by the formula 4 “∞” represents the number of servers. This model is sometimes called a “queueing” system, although in reality there is no queueing, since customers are served immediately upon arrival and without interruption. It is in fact a “pure delay” system.

8.3. MARKED SPATIAL POISSON PROCESSES X(t) =



353

1(−∞,t](Tn )1(t,∞) (Tn + σn ) .

n∈Z

(The n-th customer is in the system at time t if and only if she arrived at time Tn ≤ t and departed at time Tn + σn > t.) Assume that the service times have finite expectation: E [σ1 ] < ∞. Then, for all t ∈ R, X(t) is a Poisson random variable with mean λE [σ1 ]. Proof. Observe that

, (C(t)) , X(t) = N

where C(t) := {(s, σ); s ≤ t, s + σ > t} ⊂ R × R+ . In particular, X(t) is a Poisson random variable with mean - 1{s+σ>t} 1{s≤t} ν,(ds × dσ) ν,(C(t)) = R R - - + = 1{s+σ>t} 1{s≤t} λ ds × Q(dσ) R R+  - 1{s+σ>t} Q(dσ) 1{s≤t} λ ds = R+

R

-

t

=λ -−∞ ∞

Q((t − s, +∞)) ds Q((s, +∞)) ds = λ

=λ 0



P (σ1 > s)ds = λE[σ1 ] . 0

 It can be shown that the departure process D of departure times, defined by  D(C) := 1C (Tn + σn ) , n∈Z

is an hpp of intensity λ (Exercise 8.5.13). Formulas such as Campbell’s first formula and the Poisson exponential formula are straightforwardly extended to marked point processes. In the situation prevailing in Theorem 8.3.1, consider sums of the type  , (ϕ) N , := ϕ(X , n , Zn ) , (8.15) n∈N

for functions ϕ , : E × K → R. Note that, denoting by Z1 (x) any random element of K with the distribution Q(x, dz), - ϕ(x, , z)Q(x, dz) ν(dx) = E [ϕ(x, , Z1 (x))] ν(dx) , ν,(ϕ) , = E

K

E

whenever the quantities involved have a meaning. Using this observation, the formulas obtained in the previous subsection can be applied in terms marked point processes. The corollaries below do not require proofs, since they are reformulations of previous results, namely Theorem 8.2.6, Theorem 8.2.7 and Exercise 7.5.1. Let 0 < p < ∞. Recall that a measurable function ϕ , : E × K → R (resp. → C) is said to be in LpR (, ν ) (resp. LpC (, ν )) if

CHAPTER 8. SPATIAL POISSON PROCESSES

354 - -

|ϕ(x, , z)|p ν(dx) Q(x, dz) < ∞ . E

K

, ∈ L1C (, Corollary 8.3.3 Suppose that ϕ ν ). Then the sum (8.15) is well defined, and moreover # "  ϕ(X , n , Zn ) = E [ϕ(x, , Z1 (x))] ν(dx) . E E

n∈N

Let ϕ, , ψ, : R × E → C be two measurable functions in L1C (, ν ) ∩ L2C (, ν ). Then  cov



ϕ(X , n , Zn ),



, ψ(Xn , Zn )

n∈N

n∈N

-

  , Z1 (x))∗ ν(dx) . , Z1 (x))ψ(x, E ϕ(x,

= E

Corollary 8.3.4 Let ϕ , be a non-negative function from E × K to R. Then, 1       n ,Zn ) −ϕ(x,Z  − n∈N ϕ(X 1 (x)) − 1 ν(dx) E e = exp E e E

8.3.2

Operations on Poisson Processes

Thinning and Coloring Thinning is the operation of randomly erasing points of a Poisson process. It is a particular case of the independent coloring operation whereby the points of a Poisson process are independently colored with the result of obtaining independent Poisson processes, each one corresponding to a different color.

Theorem 8.3.5 Consider the situation depicted in Theorem 8.3.1. Let I be an arbitrary index set and let {Li }i∈I be a family of disjoint measurable sets of K. Define for each i ∈ I the simple point process Ni on Rm by Ni (C) =



1C (Xn )1Li (Zn ) .

n∈N

Then the family Ni (i ∈ I) is an independent family of Poisson processes with respective intensity measures νi , i ∈ I, where νi (dx) = Q(x, Li ) ν(dx) .

Proof. According to the definition of independence, it suffices to consider a finite , on Rm × K as in (8.14). Then N , index set I. Define the simple point process N 0 is a Poisson process with intensity measure ν,(C × L) = C Q(x, L)ν(dx). Defining   , (ϕ). , Therefore ϕ(x, , z) = i∈I ϕi (x)1Li (z), we have i∈I Ni (ϕi ) = N

8.3. MARKED SPATIAL POISSON PROCESSES

355

       E e− i∈I Ni (ϕi ) = E e−N(ϕ) - -  1   = exp − 1 ν,(dx × dz) e−ϕ(x,z) m 1 -R -K    − 1 Q(x, dz)ν(dx) = exp e−ϕ(x,z) m 1 -R -K    = exp e− i∈I ϕi (x)1Li (z) − 1 Q(x, dz)ν(dx) m /-R -K 4   −ϕi (x) = exp − 1 1Li (z)Q(x, dz)ν(dx) e /-

Rm

= exp =

$ i∈I

Therefore,

K i∈I

4   −ϕi (x) − 1 Q(x, Li )ν(dx) e

Rm i∈I

-

exp Rm

1   e−ϕi (x) − 1 Q(x, Li )ν(dx) .

   $ E e− i∈I Ni (ϕi ) = exp Rm

i∈I

1   e−(ϕi ) − 1 νi (dx) 

and the result follows from Theorem 8.2.11.

Transportation This is the operation of moving the points of a Poisson process. More precisely, consider the situation depicted in Theorem 8.3.1. Form a point process N ∗ on K by associating to a point Xn ∈ Rm a point Zn ∈ K:  N ∗ (L) := 1L (Zn ) , n∈N

where L ∈ B(Rm ). We then say that N ∗ is obtained by transporting N via the stochastic kernel Q(x, ·). Theorem 8.3.6 N ∗ is a Poisson process on K with intensity measure ν ∗ given by ν(dx)Q(x, L) . ν ∗ (L) = Rm

Proof. Let ϕ∗ : K → R be a non-negative measurable function. We have      ∗ ∗ ∗ E e−N (ϕ ) = E e− n∈N ϕ (Zn ) - -  1  ∗ = exp e−ϕ (z) − 1 ν(dx)Q(x, dz) m -R  K 1 ∗ = exp e−ϕ (z) − 1 ν(dx)Q(x, dz) . K

Rm



CHAPTER 8. SPATIAL POISSON PROCESSES

356

Example 8.3.7: Translation. Let N be a Poisson process on Rm with intensity measure ν and let {Vn }n∈N be an iid sequence random vectors of Rm with common distribution Q. Form the point process N ∗ on Rm by translating each point Xn of N by Vn . Formally,  1C (Xn + Vn ). N ∗ (C) = n∈N

We are in the situation of Theorem 8.3.6 with Zn = Xn + Vn . In particular, Q(x, A) = Q(A − x). It follows that N ∗ is a Poisson process on Rm with intensity measure ν ∗ (L) =

Rm

Q(L − x) ν(dx) ,

the convolution of ν and Q.

Poisson Shot Noise Let N be a simple and locally finite point process on Rm with point sequence {Xn }n∈N and with marks {Zn }n∈N in the measurable space (K, K). Let h : Rm × K → C be a measurable function. The complex-valued spatial stochastic process {X(y)}y∈Rm given by  h(y − Xn , Zn ) , (8.16) X(y) := n∈N

where the right-hand side is assumed well defined (for instance, when h takes real nonnegative values), is called a spatial shot noise with random impulse response. If N is a simple and locally finite Poisson process on Rm with independent iid marks {Zn }n∈N, {X(y)}y∈Rm is called a Poisson spatial shot noise with random impulse response and independent iid marks. The following result is a direct application of Theorems 8.2.6 and 8.3.1. Theorem 8.3.8 Consider the above Poisson spatial shot noise with random impulse response and independent iid marks. Suppose that for all y ∈ Rm , E [|h(y − x, Z1 )|] ν(dx) < ∞ Rm

and

Rm

  E |h(y − x, Z1 )|2 ν(dx) < ∞ .

Then the complex-valued spatial stochastic process {X(y)}y∈Rm given by (8.16) is well defined, and for any y, ξ ∈ Rm , we have E [X(y)] = E [h(y − x, Z1 )] ν(dx) Rm

and

cov(X(y + ξ), X(y)) = Rm

E [h(y − x, Z1 )h∗ (y + ξ − x, Z1 )] ν(dt) .

8.3. MARKED SPATIAL POISSON PROCESSES

357

In the case where the base point process N is an hpp with intensity λ, we find that E [X(y)] = λ E [h(x, Z1 )] dx Rm

-

and cov (X(y + ξ), X(y)) = λ

Rm

E [h(x, Z1 )h∗ (ξ + x, Z1 )] dx .

Observe that these quantities do not depend on y ∈ Rm . The process {X(y)}y∈Rm is for that reason called a wide-sense stationary process (see Chapter 9).

8.3.3

Change of Probability Measure

Let (Ω, F, P ) be a probability space on which is given a Poisson process N on Rm with non-atomic and locally finite intensity measure ν. We shall replace the probability P by another probability Pˆ in such a way that with respect to this new probability, the same point process N is a Poisson process, but with the intensity measure νˆ given by νˆ(C) = μ(x) ν(dx), (8.17) Rm

for some non-negative measurable function μ : Rm → R.

The Case of Finite Intensity Measures The above program is first carried out under the following hypotheses: H1 : ν is a finite measure, and H2 : μ is ν-integrable (or equivalently νˆ is finite). The change of probability P → Pˆ will be an absolutely continuous one, that is, for all A ∈ F, Pˆ (A) = E[L 1A ], (8.18) where L is a non-negative random variable such that E[L] = 1 , ym derivative of Pˆ with respect to P , and also denoted called the Radon–Nikod´ Lemma 8.3.9 Under hypotheses H1 and H2 , the random variable   1 $ μ(Xn ) exp − (μ(x) − 1)ν(dx) L :=

(8.19) dPˆ dP .

(8.20)

Rm

n∈N

satisfies (8.19). Proof. Let g(x) = log(μ(x)) and decompose this function into its positive and negative part, g = g+ − g− . By Theorem 8.2.7 we have that 1 -     E e−N (g−) = exp e−g−(x) − 1 ν(dx) Rm

and

CHAPTER 8. SPATIAL POISSON PROCESSES

358

  E eN (g+ ) = exp Rm

1   eg+ (x) − 1 ν(dx) .

Let B1 = {x ∈ Rm ; g(x) > 0}. By Theorem 8.3.5, the restrictions of N to B1 and B2 = B¯1 are independent, and therefore the variables e−N (g−) and eN (g+ ) are independent. In particular, from the two last displays, " # $ E μ(Xn ) n∈N

      = E eN (log(μ)) = E eN (g) = E eN (g+ )−N (g−)       = E e−N (g− ) eN (g+ ) = E e−N (g−) E eN (g+ ) -  1 -  1   = exp e−g−(x) − 1 ν(dx) eg+ (x) − 1 ν(dx) exp m m 1R -  -R  1   g(x) − 1 1{g(x)>0} ν(dx) exp = exp e eg(x) − 1 1{g(x)≤0} ν(dx) m Rm -R  1 -    g(x) g(x) = exp e e − 1 1{g(x)>0} ν(dx) + − 1 1{g(x)≤0} ν(dx) m m 1 - R 1 -R   (μ(x) − 1) ν(dx) . = exp eg(x) − 1 ν(dx) = exp Rm

Rm

By assumptions H1 and H2 the last quantity is finite and therefore one can divide the first and last terms of the above chain of equalities by it, to obtain (8.19).  Theorem 8.3.10 Under the assumptions H1 and H2 , if we define probability Pˆ by (8.18) and (8.20), N is under probability Pˆ a Poisson process with intensity measure νˆ given by (8.17). ˆ It suffices to show that the Laplace Proof. Denote expectation with respect to Pˆ by E. transform of N under probability Pˆ is that of a Poisson process with intensity measure νˆ, that is, for any bounded non-negative measurable function ϕ : Rm → R, . 2-     ˆ e−N (ϕ) = exp E e−ϕ(x) − 1 νˆ(dx) . Rm

But

    ˆ e−N (ϕ) = E L e−N (ϕ) E    = E eN (log(μ))− Rm (μ(x)−1)ν(dx) e−N (ϕ)    = E eN (−ϕ+log(μ)) e− Rm (μ(x)−1)ν(dx) -      = exp (μ(x) − 1)ν(dx) e−ϕ(x)+log(μ(x)) − 1 ν(dx) exp − m Rm    -R   (μ(x) − 1)ν(dx) = exp e−ϕ(x)μ(x) − 1 ν(dx) exp − m m -R    -R    = exp e−ϕ(x) − 1 νˆ(dx) . e−ϕ(x) − 1 μ(x)ν(dx) = exp Rm

Rm



8.3. MARKED SPATIAL POISSON PROCESSES

359

The Mixed Poisson Case Let N be a Poisson process on Rm of finite intensity measure ν and let Λ be a nonnegative random variable independent of N . Let L := ΛN (R

m)

exp{−(Λ − 1)N (Rm )} .

The arguments of the proof of Lemma 8.3.9 and Theorem 8.3.10 are immediately adaptable to show that EP [L] = 1 and that under the probability measure Pˆ defined by dPˆ dP = L, N is a Cox process (here a mixed Poisson process) with σ(Λ)-conditional intensity measure Λν(dx). Theorem 8.3.11 Under the above conditions, for any non-negative function gR+ → R, 

EPˆ g(Λ) | F

N



0 =

g(λ)λN (R ) e−λN (R ) F (dλ) 0 . m m λN (R ) e−λN (R ) F (dλ) m

m

(8.21)

Proof. The proof is based on the following fundamental lemma: Lemma 8.3.12 Let P and Q be two probability measures on the measurable space (Ω, F) such that P Q and let L := dP dQ . Let Z be a non-negative random variable. For any sub-σ-field G of F, EQ [L | G] EP [Z | G] = EQ [ZL | G]

Q-a.s.

(8.22)

or, equivalently, EQ [ZL | G] EQ [L | G]

EP [Z | G] =

P -a.s.

(8.23)

Proof. By definition of conditional expectation, for all A ∈ G Z dP = EP [Z | G] dP . A

A

By definition of L and of conditional probability again, Z dP = ZL dQ = EQ [ZL | G] dQ . A

A

A

Also -

EP [Z | G] dP = A

-A

EP [Z | G] L dQ EP [Z | G] EQ [L | G] dQ .

= A

Therefore

-

EQ [ZL | G] dQ = A

EP [Z | G] EQ [L | G] dQ , A

which is, since A is arbitrary in G, equivalent to (8.22). Since P Q this equality also holds P -a.s. To obtain (8.23), it remains to show that P (EQ [L | G] = 0) = 0. But

CHAPTER 8. SPATIAL POISSON PROCESSES

360

P (EQ [L | G] = 0) =

-

= -

1{EQ [L|G]=0} dP 1{EQ [L|G]=0} LdQ 1{EQ [L|G]=0} EQ [L | G] dQ = 0 .

=

 We may now proceed to the proof of Theorem 8.3.11. By Lemma 8.3.12,     EP g(Λ)L | F N EPˆ g(Λ) | F N = EP [L | F N ] and therefore, since under P , N and Λ are independent, 

EPˆ g(Λ) | F

N



0 =

R+

0

g(λ)λN (R

R+

λN (R

m)

m)

exp{−(λ − 1)N (Rm )} F (dλ)

exp{−(λ − 1)N (Rm )} F (dλ)



from which the result follows.

8.4

,

The Boolean Model

Stochastic geometry concerns the study of random shapes. The model considered below pertains to a particular sort of stochastic geometry, where the randomness of the shapes is dependent on the positions of the points of an underlying point process. Although there exists a sound mathematical theory of random sets, this theory will not be necessary as long as the random sets considered in the applications are “good” sets fully described by a random vector of finite dimension (circle, disk, polygon, line, segment, etc.). We then resort to what can be called the “poor man’s random set theory”, in which a random set is set of the form S(Z) ⊆ Rm where Z ∈ Rd is a random vector and for each z, S(z) is a measurable set and S(Z) is also a measurable set. These are minimal requirements. We shall consider real-valued functions of S, for instance g(S) = d (S) , g(S) = 1a∈S ,

()

for which the expectation is well defined, as E [g(S)] := E [g(S(Z))] = Rd

g(S(z))P (Z ∈ dz) .

(8.24)

We only need to ensure that the function z ∈ Rd → g(S(z)) ∈ R is measurable and its integral with respect to the distribution of Z well defined (in the examples, this will generally the case, because g will be a non-negative function, as in ()). We shall use for (8.24) the abbreviated notation g(s) Q(ds) , S

thereby pretending that there exists a set S of shapes with a suitable σ-field G on it, and an adequate probability distribution Q on (S, G).

8.4. THE BOOLEAN MODEL

361

Example 8.4.1: Random Disk. In this example, S is the closed disk in R2 centered on the origin and with radius Z, a non-negative random variable. We have, with g(S) = 2 (S)     E [g(S)] = E 2 (S(Z)) = E πZ 2 , and with g(S) = 1a∈S ,   E [g(S)] = E 1a∈S(Z) = P (a ∈ S(Z)) = P (Z ≥ a) .

Definition 8.4.2 The capacity functional of the random set S is the function K → TS (K) (K compact) defined by TS (K) := P (S ∩ K = ∅) . Example 8.4.3: The Capacity Functional of a Point Process. A simple point process N can also be viewed as a random set S ≡ N . In this case the capacity functional is the void probability function: TN (K) := P (N ∩ K = ∅) = P (N (K) = 0) = vN (K) .

We now introduce the Boolean model.5 Let N be a Poisson process on Rm with a non-atomic σ-finite intensity measure ν. Denote by {Xn }n∈N its sequence of points. Let now {Sn }n∈N be a sequence of random marks, iid and independent of N . Each Sn is a compact random set. The Xn ’s are called the germs whereas the Sn ’s are called the grains. Recall the following notations: if A and B are subsets of Rm and x ∈ Rm , x + A := {x + y ; y ∈ A}, and A ⊕ B := {x + y ; x ∈ A, x ∈ B}.

One of the quantities of interest in applications is the probability of intersection of the random set Σ = ∪n∈N (Xn + Sn ) with a given compact set K ⊂ Rm , that is, TΣ (K) := P (Σ ∩ K = ∅) . In order to compute this quantity, we first observe that 5

[Matheron, 1967, 1975], [Serra, 1982].

CHAPTER 8. SPATIAL POISSON PROCESSES

362

TΣ (K) = P (N (K) = 0) , where N (K) =



1(Xn +Sn )∩K =∅ .

n∈N

We show that N (K) is a Poisson variable with mean θ(K) := P ((x + S1 ) ∩ K = ∅) ν(dx)

()

Rm

and therefore  TΣ (K) = 1 − exp −

Rm

1 P ((x + S1 ) ∩ K = ∅) ν(dx) .

(8.25)

Proof. The Laplace transform of the distribution of N (K) is given by         E e−tN (K) = E e−t n∈N 1(Xn +Sn )∩K=∅ = E e−t n∈N f (Xn ,Sn ) , where t ≥ 0 and f (x, s) := 1(x+s)∩K =∅. To see this, use the formula in Corollary 8.3.4, which gives -   1     E e−t n∈N f (Xn ,Sn ) = exp E e−tf (x,S1 ) − 1 ν(dx) m -R 1   −t1  E e (x+S1 )∩K=∅ − 1 ν(dx) = exp m 1  R −t P ((x + S1 ) ∩ K = ∅) ν(dx) . = exp (e − 1) Rm

 If the germ process is a homogeneous Poisson process of intensity λ, the mean value θ(K) of N (K) takes the form θ(K) = λE [m ((−S1 ) ⊕ K)] . Indeed, from (), θ(K) = λ Rm

2-

= λE

P ((x + S1 ) ∩ K = ∅) dx . 21(x+S1)∩K =∅ dx = λE

Rm

Rm

. 1(−S1 )⊕K (x) dx .

Therefore, in the homogeneous case, TΣ (K) = 1 − exp {λE [m ((−S1 ) ⊕ K)]} .

(8.26)

Another quantity of interest when the germ point process is an hpp with intensity λ is the volume fraction p of the random set Σ, defined by p :=

E [m (Σ ∩ B)] . m (B)

8.4. THE BOOLEAN MODEL

363

It is independent of B, and by translation invariance of the model, it is equal to 0  E B 1x∈Σ dx p= = P (0 ∈ Σ) = P (Σ ∩ {0} = ∅) . m (B) Therefore, p = p({0}) = 1 − e− = 1 − e−



Rm

 Rm

P ((x+S1 )∩{0} =∅) λ dx

P ((x+S1 )∩{0} =∅) λ dx

= 1 − e−

 Rm

P (−x∈S1 ) λ dx

= 1 − e−λE[

m (S

1 )]

.

The covariance function C : Rm → R of the random set Σ is defined by C(x) := P (0 ∈ Σ , x ∈ Σ) . (This is the covariance function, in the usual sense, of the wide-sense stationary stochastic process {1Σ (t)}t∈Rm .) In the homogeneous case, C(x) = 2p − 1 − (1 − p)2 exp {λE [m (S1 ∪ (S1 − x))]} . Proof. C(x) = P (0 ∈ Σ ∩ (Σ − x)) = P (0 ∈ Σ) + P (x ∈ Σ) − P (0 ∈ Σ ∪ (Σ − x)) = 2p − 1 + P (0 ∈ / Σ ∪ (Σ − x)) = 2p − 1 + P (Σ ∩ {o, x} = ∅) = 2p − 1 + TΣ ({0, x}) . From (8.26), TΣ ({0, x}) = 1 − exp {λE [m ((−S1 ) ⊕ {0, x})]} . Now E [m ((−S1 ) ⊕ {0, x})] = E [m ((−S1 ) ∪ (−S1 + x))] = E [m (−S1 )] + E [m (−S1 + x)] − E [m ((−S1 ) ∩ (−S1 + x))] = 2E [m (−S1 )] − E [m ((−S1 ) ∩ (−S1 + x))] . Combining the above equalities with the observation that 1 − p = exp {−λm (S1 )} gives the announced result.  Example 8.4.4: The Boundary of a Poisson Cluster of Disks. Let N be a homogeneous Poisson process on R2 , of intensity λ. Let {Xn }n∈N be its sequence of points. Draw around each point Xn a closed disk of radius a. The area inside the square [0, T ] × [0, T ] that is not covered by a disk is delimited by a curve. We seek to compute its average length, excluding the parts on the boundaries of [0, T ] × [0, T ].

CHAPTER 8. SPATIAL POISSON PROCESSES

364

The number of disks covering a given point y ∈ R2 is Z(y) =



1{||Xn −y||0} dD(s) .

Show that {X(t)}t≥0 is the congestion process of an M/M/1/∞ queue with arrival rate λ and service times with exponential distribution of mean μ. Exercise 9.4.4. M/M/1∞ in heavy traffic Consider the M/M/1∞ with traffic intensity ρ in equilibrium, and let Xρ = Xρ (0) be the congestion process at time 0 (recall that the congestion process at any time t ≥ 0 has a distribution independent of t since the queue is assumed to be in equilibrium). Show that the random variable (1 − ρ)Xρ converges in distribution to an exponential random variable of mean 1. Exercise 9.4.5. A generalization of Jackson’s network The following modification of the basic Jackson network is considered. For all i, 1 ≤ i ≤ K, the server at station i has a speed of service ϕi (ni ) when there are ni customers present in station i, where ϕi (k) > 0 for all k ≥ 1 and ϕi (0) = 0. The new infinitesimal generator is obtained from the standard one by replacing μi by μi ϕ(ni ). Check this and show that if for all i, 1 ≤ i ≤ K, Ai := 1 +

∞   Ai =1

ρni ni i k=1 ϕ(k)

 < ∞,

where ρi = λi /μi and λi is the solution of the traffic equation, then the network is ergodic with stationary distribution K $ πi (ni ), π(n) = i=1

where

CHAPTER 9. QUEUEING PROCESSES

400 πi (ni ) =

1 ρni ni i . Ai k=1 ϕi (k)

Exercise 9.4.6. Closed Jackson with a single customer Consider the closed Jackson network of the theory, with N = 1 customer. Let {Y (t)}t≥0 be the process giving the position of this customer, that is, Y (t) = i if she is in station i at time t. Show that {Y (t)}t≥0 is a regular jump hmc, irreducible, and give its stationary distribution. Exercise 9.4.7. ACK Consider the closed Jackson network below where all service times at different queues have the same (exponential) distribution of mean 15 . Compute for N = 5 the average time spent by a customer to go from the leftmost point A to the rightmost point B, and the average number of customers passing by A per unit time.

Exercise 9.4.8. Suppressed transitions Let {X(t)}t≥0 be an irreducible positive recurrent hmc with infinitesimal generator A and stationary distribution π. Suppose that (A, π) is reversible. Let S be a subset of the ˜ by state space E. Define the infinitesimal generator A / q˜ij =

αqij if i ∈ S, j ∈ E − S qij otherwise

when i = j. The corresponding hmc is irreducible if α > 0. If α = 0, the state space will be reduced to S, to maintain irreducibility. Show that the continuous time hmc ˜ admits the stationary distribution π associated to A ˜ given by / π ˜i =

αC × π(i) if i ∈ S, C × απ(i) if i ∈ E − S,

with the obvious modification when α = 0. Exercise 9.4.9. Loss networks (Kelly’s networks) (2 ) Consider a telecommunications network with K relays. An incoming call chooses a “route” r among a set R. The network then reserves the set of relays r1 , . . . , rk(r) corresponding to this route, and processes the call to destination in an exponential time of mean μ−1 r . The incoming calls with route r form a homogeneous Poisson process of intensity λr . It is assumed that all the relays are useful, in that they are part of at least one route in R. 2

[Kelly, 1979].

9.4. EXERCISES

401

(1) The capacity of the system is for the time being assumed to be infinite, that is, the number Xr (t) of calls on route r at time t can take any integer value. All the usual independence hypotheses are made: the processing times and the Poisson processes are independent. Give the stationary distribution of the continuous time hmc {X(t)}t≥0 , where X(t) = (Xr (t), r ∈ R). (2) The capacity of the system is now restricted as follows. Consider a given pair (a, b) of relays. It represents a “link” in the network. This link has finite capacity Cab . This means that the total number of calls using this link cannot exceed this capacity, with the consequence that an incoming call requiring a route passing through this link will be lost if the link is saturated when it arrives. The process {X(t)}t≥0 therefore has for state space  E˜ = {n = (nr ; r ∈ R); nr ≤ Cab for all links (a, b)}. r∈R,(a,b)∈r

What is the stationary distribution of the chain {X(t)}t≥0 ? Exercise 9.4.10. M/GI/1/∞/fifo Show that the imbedded hmc of an M/GI/1/∞/fifo is irreducible (as long as the service times are not identically null). Exercise 9.4.11. N (0, τ ] Prove the statement in Subsection 9.3.3 concerning the sequence {Zn }n≥1, namely, that it is iid, independent of X0 and distributed as Z = N (0, τ ], where N is an hpp with intensity μ, τ and N are independent and the cdf of τ is F . Exercise 9.4.12. GI/M/1/∞ Show that the GI/M/1/∞ queue of Subsection 9.3.3 is transient if ρ > 1. Exercise 9.4.13. The imbedded hmc of an M/GI/1/∞/fifo queue Show that the imbedded hmc of an M/GI/1/∞/fifo is irreducible (as long as the service times are not identically null). Exercise 9.4.14. Constant service times minimize congestion Show that for a fixed traffic intensity ρ, constant service times minimize average congestion in the M/GI/1/∞ fifo queue. Exercise 9.4.15. Workload of M/M/1/∞ Show that the stationary distribution of the workload process of an M/M/1/∞ queue with arrival rate λ and mean service time μ−1 such that ρ = μλ < 1 is FW (x) = 1 − ρ (1 − exp{−(μ − λ)x}) .

Chapter 10 Renewal and Regenerative Processes From the purely analytical point of view, renewal theory is concerned with the renewal equation f (t) = g(t) + f (t − s) dF (s) , [0,t]

where F is the cumulative distribution function of a finite measure on the positive real line. Its main concern is the asymptotic behavior of the solution f (the existence and uniqueness of which is not a real issue under mild conditions, as we shall see). Once embedded in the framework of point processes, renewal theory becomes a fundamental tool of probability theory that is useful in particular in the study of convergence of regenerative processes, a large and important class of stochastic processes that includes the recurrent continuous-time hmcs and the semi-Markov processes.

10.1

Renewal Point processes

10.1.1

The Renewal Measure

Consider an iid sequence {Sn }n≥1 of non-negative random variables with common cumulative distribution function F (x) := P (Sn ≤ x) . This cdf is called defective when F (∞) := P (S1 < ∞) < 1, and proper when F (∞) = 1. The uninteresting case where P (S1 = 0) = 1 is henceforth eliminated. The above sequence is called the inter-renewal sequence. The associated renewal sequence {Tn }n≥0 is defined by Tn := Tn−1 + Sn

(n ≥ 1) ,

where the initial delay T0 is a finite non-negative random variable independent of the inter-renewal sequence. When T0 = 0, the renewal sequence is called undelayed. Time Tn is called a renewal time, or an event. The stochastic process  1{Tn ≤t} (t ≥ 0) N ([0, t]) := n≥0

© Springer Nature Switzerland AG 2020 P. Brémaud, Probability Theory and Stochastic Processes, Universitext, https://doi.org/10.1007/978-3-030-40183-2_10

403

CHAPTER 10. RENEWAL AND REGENERATIVE PROCESSES

404

is the counting process of the renewal sequence; N ([0, t]) counts the number of events in the closed interval [0, t]. Note that T0 ≥ 0 (this convention differs from the usual one, only for this chapter) and that the point at 0 is counted when there is one. Clearly, the random function t → N ([0, t]) is almost surely right-continuous and has a limit on the left for each t > 0, namely N [0, t).

0

T1

T0

T2

t

T3

Theorem 10.1.1 For all t ≥ 0, E[N ([0, t])] < ∞. In particular, almost surely, N ([0, t]) < ∞ for all t ≥ 0. Proof. It suffices to consider the undelayed case (Exercise 10.5.1). By Markov’s inequality,   P (Tn ≤ t) = P (e−Tn ≥ e−t ) ≤ et E e−Tn . But since the renewal sequence is iid, n  n  $     E e−Tn = E e− k=1 Sk = E e−Sk = αn , k=1





with α = E e−S1 < 1 since P (S1 = 0) < 1. Therefore ⎡ E [N ([0, t])] − 1 = E ⎣



⎤ 1{Tn ≤t} ⎦ =

n≥1



    E 1{Tn ≤t} = P (Tn ≤ t) ≤ et αn < ∞ .

n≥1

n≥1

n≥1

 The forward recurrence {A(t)}t≥0 and the backward recurrence {B(t)}t≥0 are defined as follows. Both processes are right-continuous with left-hand limits. For n ≥ 0, they have linear trajectories in (Tn , Tn+1) with respective slopes −1 and +1, and at a renewal point Tn A(Tn ) = Tn+1 − Tn , A(Tn+1 −) = 0 , B(Tn ) = 0 , B(Tn+1 −) = Tn+1 − Tn . For 0 ≤ t < T0 , A(t) = T0 − t and B(t) = t.

0

T0

T1

T2

T3

t

The forward recurrence time process

10.1. RENEWAL POINT PROCESSES

0

T0

405

T1

T2

T3

t

The backward recurrence time process

Definition 10.1.2 The function R : R+ → R+ defined by R(t) := E[N ([0, t])], where N is the counting process of the undelayed renewal sequence, is called the renewal function.

The renewal function is right-continuous (Exercise 10.5.3) and non-decreasing. Therefore, one can associate with it a unique measure μR on R+ such that μR ([0, a]) = R(a). This measure, called the renewal measure, will sometimes be denoted by R, the context avoiding confusion between the measure and its cumulative distribution function. Note that μR ({0}) = R(0) = 1. Example 10.1.3: The Poisson Process. Consider the case of exponential inter-event times (that is, F (t) = 1 − e−λt ). The undelayed renewal process is then a homogeneous Poisson process of intensity λ to which a point at time 0 is added, R(t) = 1 + λt.

It will be convenient to express the renewal function in terms of the common cumulative distribution function of the random variables Sn . For this, observe that in the undelayed case Tn := S1 + · · · + Sn is the sum of n independent random variables with common cumulative distribution function F and therefore P (Tn ≤ t) = F ∗n (t) ,

(10.1)

where F ∗n is the n-fold convolution of F , defined recursively by F ∗0 (t) = 1[0,∞) (t),

F ∗n (t) =

-

F ∗(n−1) (t − s) dF (s)

(n ≥ 1) .

(10.2)

[0,t]

(The role of 0 in the integration over [0, t] is made precise by the following equality: -

ϕ(s) dF (s) = ϕ(0)F (0) + [0,t]

ϕ(s) dF (s).) (0,t]

  Writing the renewal function as E[N ([0, t])] = E[1+ n≥1 1{Tn ≤t} ] = 1+ n≥1 P (Tn ≤ t), we obtain the expression: ∞  R(t) = F ∗n (t) . (10.3) n=0

406

CHAPTER 10. RENEWAL AND REGENERATIVE PROCESSES

Theorem 10.1.4 P (S1 < ∞) < 1

⇐⇒

P (N ([0, ∞)) < ∞) = 1

⇐⇒

E[N ([0, ∞))] < ∞ .

Proof. It suffices to prove the theorem in the undelayed case. For all k ≥ 1, P (N ([0, ∞)) = k) = P (S1 < ∞, . . . , Sk−1 < ∞, Sk = ∞) = P (S1 < ∞) . . . P (Sk−1 < ∞)P (Sk = ∞) = F (∞)k−1 (1 − F (∞)), and P (N ([0, ∞)) < ∞) =

∞ 

F (∞)k−1 (1 − F (∞)) .

k=1

In particular, P (N ([0, ∞)) < ∞) = 1 if F (∞) < 1 and P (N ([0, ∞) < ∞) = 0 if F (∞) = 1. Also, if F (∞) < 1, E[N ([0, ∞))] =

∞ 

kF (∞)k−1 (1 − F (∞)) =

k=1

1 < ∞, 1 − F (∞)

whereas if F (∞) = 1, E[N ([0, ∞))] = ∞.



A renewal process (delayed or not) is called recurrent when P (S1 < ∞) = 1 (F is proper), and transient when P (S1 < ∞) < 1 (F is defective). The following result is called the elementary renewal theorem. Theorem 10.1.5 We have lim

t→∞

N ([0, t]) 1 = t E[S1 ]

and lim

t→∞

P-a.s.,

(10.4)

1 E[N ([0, t])] = . t E[S1 ]

(10.5)

Proof. For the proof of (10.4) see Exercise 10.5.2. Proof of (10.5): The transient case follows from the obvious bound E[N ([0,t])] ≤ E[Nt(∞)] , since in this case E[S1 ] = ∞ and t E [N (∞)] < ∞. For the recurrent case, a proof is required (the conditions of the dominated convergence theorem that would guarantee that (10.4) implies (10.5) are not satisfied). However, by Fatou’s lemma 2 . 2 . N ([0, t]) N ([0, t]) 1 lim inf E ≥ E lim inf = t→∞ t→∞ t t E[S1 ] ]≤ and therefore it suffices to show that lim supt→∞ E[ N ([0,t]) t T0 := T0 ,

T1 := T0 + S1 ∧ c,

1 E[S1 ] .

Define for finite c > 0

T2 := T1 + S2 ∧ c, . . .  where Sn := Sn ∧ c (n ≥ 1), and let N  ([0, t]) := n≥0 1Tn ≤t . Since N  ([0, t]) ≥ N ([0, t]) for all t ≥ 0, E[N ([0, t])] E[N  ([0, t])] ≤ lim sup . lim sup t t t→∞ t→∞

10.1. RENEWAL POINT PROCESSES

407

 Observe that S1 +· · ·+SN  ([0,t]) ≤ t+c and therefore, by Wald’s lemma (Exercise 10.5.7),  E[S1 ]E[N  ([0, t])] = E[S1 + · · · + SN  ([0,t]) ] ≤ t + c ,

so that lim sup t→∞

  E[N  ([0, t])] c 1 1  ≤ lim sup 1 + = . t E[S1 ] t E[S1 ] t→∞

Therefore, for all c > 0, lim sup t→∞

1 E[N ([0, t])] ≤ . t E[S1 ∧ c]

Since limc↑∞ E[S1 ∧ c] = E[S1 ], we finally obtain the desired inequality lim sup t→∞

1 E[N ([0, t])] ≤ . t E[S1 ] 

Let F : R+ → R+ be a generalized cumulative distribution function on R+ , that is, F (x) = c G(x) where c > 0 and G is the cumulative distribution function of a nonnegative real random variable that is proper (G(∞) = 1).

10.1.2

The Renewal Equation

The basic object of renewal theory is the renewal equation f = g + f ∗ F, that is, by definition of the convolution symbol ∗, f (t) = g(t) + f (t − s) dF (s)

(t ≥ 0) ,

(10.6)

[0,t]

where g : R+ → R is a measurable function called the data. If F (∞) = 1 one refers to the renewal equation as a proper renewal equation, or just a renewal equation. The renewal equation is called defective if F (∞) < 1 and excessive if F (∞) > 1. The first example features the basic method for obtaining renewal equations. Example 10.1.6: Lifetime of a Transient Renewal Process, Take 1. The lifetime of a renewal sequence is the random variable L := sup{Tn ; Tn < ∞}. Clearly if the renewal process is recurrent, L is almost surely infinite. We therefore consider the transient case, for which L is almost surely finite. In the undelayed case, the function f (t) = P (L > t) satisfies the renewal equation f (t) = F (∞) − F (t) + f (t − s) dF (s). [0,t]

Proof. Define S6n = Sn+1 (n ≥ 1) and let {T6n }n≥0 be the associated undelayed renewal 6 process whose lifetime is denoted by L.

408

CHAPTER 10. RENEWAL AND REGENERATIVE PROCESSES no more event S1 T0 = 0

S2 = Sˆ1 T1

S3 = Sˆ2

T2

t

T3 = L ˆ L

6 have the same distribution. Also Clearly L and L 1{L>t} = 1{tt} + 1{t≥T1 } 1{L>t} . 6 > t − T1 } and therefore Now, on {t ≥ T1 }, {L > t} ≡ {L . 1{L>t} = 1{tt} + 1{t≥T1 } 1{L>t−T  1} Taking expectations, 6 > t − T1 , T1 ≤ t). P (L > t) = P (L > t, t < T1 ) + P (L 6 and T1 are independent (L 6 depends only on S2 , S3 , . . .), Since L 6 > t − T1 , T1 ≤ t) = 6 > t − s) dF (s) = P (L P (L P (L > t − s) dF (s) , [0,t]

[0,t]

6 have the same distribution. Also, where we have used the fact that L and L  P (L > t, t < T1 ) = P (t < T1 , T1 < ∞) = P (t < T1 < ∞) = F (∞) − F (t).

Example 10.1.7: The Risk Model, Take 2. This example is a continuation of Example 7.1.2. We find a renewal equation for the probability of ruin corresponding to an initial capital u: Ψ(u) := P (u + X(t) < 0 for some t > 0) .

(10.7)

This function is non-increasing. It is convenient to work with the non-ruin probability Φ(u) := 1 − Ψ(u). Of course, Φ(u) = 0 if u ≤ 0. If the point process N is stationary with average rate λ, the average profit of the insurance company at time t is E[X(t)] = (c − λμ)t . As expected, insurance companies prefer that c − λμ > 0 or, equivalently, that ρ :=

c − λμ c = − 1 > 0, λμ λμ

(10.8)

where ρ is the safety loading. In fact, by the strong law of large numbers, if the safety loading is negative, the probability of ruin is 1 whatever the initial capital. If N is a homogeneous Poisson process, λ u Φ(u) = Φ(0) + Φ(u − z)(1 − G(z))dz . c 0 Proof. Suppose u ≥ 0. Since ruin cannot occur at a time < S1 ,

(10.9)

10.1. RENEWAL POINT PROCESSES -

409

-

Φ(u) = Φ(u + cT1 − Z1 ) = -

(0,∞) ∞

(0,∞)

-

Φ(u + cs − z)λe−λs ds dG(z) Φ(u + cs − z) dG(z) λe−λs ds

= 0

(0,u+cs]

λ = eλu/c c

-

u

-

Φ(x − z) dG(z) e−λx/c dx .

(0,x]

0

The right-hand side is differentiable; differentiation leads to the integro-differential equation λ λ Φ (u) = Φ(u) − Φ(u − z) dG(z). (10.10) c c (0,u] Therefore, Φ(t) − Φ(0) =

λ c

-

t

Φ(u) du + 0

λ c

- t0

Φ(u − z) d(1 − G(z)) du (0,u]

. - u - 2 λ t λ t Φ(u) du + Φ (u − z)(1 − G(z)) dz du Φ(0)(1 − G(u)) − Φ(u) + c 0 c 0 0  - t - - t λ t λ  Φ (u − z) du (1 − G(z)) dz = Φ(0) (1 − G(u)) du + c c 0 0 z - t - t λ λ = Φ(0) (1 − G(u)) du + (1 − G(z)(Φ(t − z) − Φ(0)) dz , c c 0 0

=



that is, (10.9).

Simple arguments (Exercise 10.5.9) show that in the case of positive safety loading, Φ(∞) = 1. Letting u ↑ ∞ in (10.9), we have by monotone convergence that Φ(∞) = Φ(0) + λc Φ(∞) and therefore the probability of ruin with zero initial capital is Ψ(0) =

λμ . c

From (10.9), when the safety loading is positive, λμ λ u 1 − Ψ(u) = 1 − + (1 − Ψ(u − z))(1 − G(z)) dz c c 0   - u - ∞ λ = 1− (1 − G(z))dz + Ψ(u − z)(1 − G(z)) dz , μ− c u 0 that is Ψ(u) =

λ c

-

∞ u

(1 − G(z))dz +

λ c

-

u

Ψ(u − z)(1 − G(z)) dz .

(10.11)

0

Example 10.1.8: The Lotka–Volterra Population Model. This model features a population of women. A woman of age a gives birth to girls at the rate λ(a) (that is, a woman of age a will have on average λ(a) da daughters in the infinitesimal time interval (a, a + da) of her lifetime). A woman of age a is alive at time a + t with probability p(a, t). At the origin of time there are f0 (a) da women of age between a and a + da. The birth rate f (t) at time t ≥ 0 is the sum of the birth rate r(t) at time t due to women born after time 0 and of the birth rate g(t) due to women born before time 0.

CHAPTER 10. RENEWAL AND REGENERATIVE PROCESSES

410

Since women of age a at time 0 are a + t years old at time t, - ∞ g(t) = f0 (a)p(a, t)λ(a + t) da . 0

Women born at time t − s ≥ 0 contribute by f (t − s)p(0, s)λ(s) to the birth rate at time t and therefore t

r(t) =

f (t − s)p(0, s)λ(s) ds .

0

Therefore

-

t

f (t) = g(t) +

f (t − s)p(0, s)λ(s) ds .

0

This is a renewal equation (called Lotka’s equation) with data g and cumulative distribution function - t F (t) := p(0, s)λ(s) ds. 0

Note that

-



p(0, s)λ(s) ds

F (∞) = 0

is the average number of daughters from a given mother’s lifetime, that is, the reproduction rate.

Theorem 10.1.9 The renewal function R satisfies the so-called fundamental renewal equation R = 1 + R ∗ F. (10.12) Proof. By (10.3), ⎛ R∗F =⎝



⎞ F

∗n ⎠

n≥0

∗F =



(F ∗n ∗ F ) =

n≥0



F ∗n = R − F ∗0 = R − 1 .

n≥1

 The following simple technical result will be needed later on. Lemma 10.1.10 For all t ≥ b, R ([t − b, t]) ≤ (1 − F (b))−1 . Proof. By (10.12) and for t ≥ b, 1 = R(t) − F (t − s) dR(s) = (1 − F (t − s)) dR(s) [0,t] [0,t] ≥ (1 − F (t − s)) dR(s) ≥ (1 − F (b))R ([t − b, t]) . [t−b,t]



10.1. RENEWAL POINT PROCESSES

411

Example 10.1.11: The Elementary Renewal Theorem: Correction Term, 1 Take 1. We have seen that limt↑∞ R(t) t = E[S1 ] . In view of obtaining finer asymptotic results (see Example 10.2.14 below), we study the function f (t) := R(t) −

t · E[S1 ]

In the case where E[S1 ] < ∞, f satisfies the renewal equation with data - ∞ 1 g(t) := (1 − F (x))dx . E[S1 ] t Proof. Let E[S1 ] := m. By (10.12), (f ∗ F )(t) = (R ∗ F )(t) −

1 m -

(t − s) dF (s) [0,t]

1 (t − s) dF (s) m [0,t] / 4  t t 1 = R(t) − − (t − s) dF (s) 1− + m m m [0,t] - ∞ 1 (1 − F (x))dx , = f (t) − m t = R(t) − 1 −

where the last equality is obtained by the following computations. Integration by parts (Theorem 2.3.12) gives t(1 − F (t)) = (1 − F (s)) ds − s dF (s) , 0∞

[0,t]

[0,t]

and therefore, since m = 0 (1 − F (s)) ds, 1 ∞ 1 t 1 ∞ (1 − F (s)) ds = (1 − F (s)) ds − (1 − F (s)) ds m t m 0 m 0  1 t 1 t(1 − F (t)) + =1− (1 − F (s)) ds = 1 − s dF (s) m 0 m [0,t]  1 t 1 + (s − t) dF (s) = 1 − (t − s) dF (s) . =1− t+ m m m [0,t] [0,t] 

Solution of the Renewal Equation An expression of the solution of the renewal equation in terms of the renewal function is easy to obtain. Recall the following definition: A function g : R+ → R is called locally bounded if for all a ≥ 0, supt∈[0,a] |g(t)| < ∞. Theorem 10.1.12 If F (∞) ≤ 1 and if the measurable data function g : R+ → R is locally bounded, the renewal equation (10.6) admits a unique locally bounded solution f : R+ → R given by f = g ∗ R, that is, g(t − s)dR(s) . (10.13) f (t) = [0,t]

CHAPTER 10. RENEWAL AND REGENERATIVE PROCESSES

412

Proof. The function f = g ∗ R is indeed locally bounded since g is locally bounded and R(t) is finite for all t. Also f ∗ F = (g ∗ R) ∗ F = g ∗ (R ∗ F ) = g ∗ (R − 1) = g ∗ R − g = f − g . Therefore f is indeed a solution of the renewal equation. Let f1 be another locally bounded solution and let h := f − f1 . This is a locally bounded solution which satisfies h = h ∗ F . By iteration, h = h ∗ F ∗n . Therefore, for all t ≥ 0,



sup |h(s)| F ∗n (t).

|h(t)| ≤

s∈[0,t]

 Since R(t) = n≥0 F ∗n (t) < ∞, we have limn→∞ F ∗n (t) = 0, which implies in view of the last displayed inequality that |h(t)| ≡ 0.  The first asymptotic result on the solution of the renewal equation concerns the defective case: Theorem 10.1.13 If F is defective and if the measurable data function g : R+ → R is bounded and has a limit g(∞) := limt→∞ g(t), the unique locally bounded solution of the renewal equation satisfies lim f (t) =

t→∞

g(∞) . 1 − F (∞)

Proof. From previous computations, we have E[N ([0, ∞))] = R(∞) = and therefore g(∞) = 1 − F (∞) Also

1 , 1 − F (∞)

g(∞) dR(s). [0,∞)

g(t − s) dR(s) ,

f (t) = [0,t]

and therefore f (t) −

g(∞) = 1 − F (∞)

[0,∞)

(g(t − s)1{s≤t} − g(∞)) dR(s).

The latter integrand is bounded in absolute value by 2 × sup |g(t)|, a finite constant. Considered as a function, a constant is integrable with respect to the renewal measure because, in the defective case, the total mass of the renewal measure is R(∞) = E [N ([0, ∞))] < ∞. Now for fixed s ≥ 0, limt→∞ (g(t − s)1{s≤t} − g(∞)) = 0. Therefore, by dominated convergence, the integral converges to 0 as t → ∞. 

10.1. RENEWAL POINT PROCESSES

10.1.3

413

Stationary Renewal Processes

By a proper choice of the initial delay, a renewal process can be made stationary, in a sense to be made precise. Consider a renewal process T0 = S0 , T1 = S0 + S1 , . . . , Tn = S0 + · · · + Sn where 0 ≤ S0 < ∞. Let G be the cumulative distribution function of the initial delay S0 := T0 and suppose that P (S1 < ∞) = 1 (the renewal process is proper). As usual, exclude trivialities by imposing the condition P (S1 = 0) < 1. For t ≥ 0, define S0 (t) := TN ([0,t]) − t and Sn (t) := TN ([0,t])+n − TN ([0,t])+n−1 (n ≥ 1) .

(10.14)

In particular, Sn (0) = Sn for all n ≥ 0. Also observe that S0 (t) = A(t), the forward recurrence time at t.

t S0

S1

S2

S3 S0 (t)

S4

S5

S1 (t)

S2 (t)

Definition 10.1.14 The delayed renewal process is called stationary if the distribution of the sequence S0 (t) , S1 (t) , S2 (t) , . . . is independent of time t ≥ 0. It turns out that S0 (t) is independent of {Sn (t)}n≥1 and that the latter sequence has the same distribution as {Sn }n≥1 (Exercise 10.5.11). Therefore: Lemma 10.1.15 For a delayed renewal process to be stationary it is necessary and sufficient that for all t ≥ 0, the distribution of S0 (t) = A(t) be the same as that of S0 . Lemma 10.1.16 If the delayed renewal process is stationary, then necessarily E[S1 ] < ∞ and E[N ([0, t])] = E[St 1 ] . Proof. The measure M on R+ defined by M (C) := E[N (C)] is translation-invariant and therefore a multiple of the Lebesgue measure (Theorem 2.1.45), that is, M (C) = K(C) for some constant K, which is finite (M is locally finite) and positive (the renewal process is not empty). By the elementary renewal theorem, K = lim

t↑∞

and therefore

1 E[S1 ]

E[N ((0, t])] 1 = , t E[S1 ]

> 0.



Lemma 10.1.17 If the delayed renewal process is stationary, then necessarily E[S1 ] < ∞ and the distribution of the initial delay T0 is - x 1 (1 − F (y)) dy , (10.15) F0 (x) := E[S1 ] 0 called the stationary forward recurrence time distribution.

414

CHAPTER 10. RENEWAL AND REGENERATIVE PROCESSES

Proof. The finiteness of E[S1 ] was proved in the previous lemma. For all u ∈ R, all t ≥ 0, the following equation - t   d iuA(s) eiuA(Tn) − eiuA(Tn −) 1{Tn ≤t} + eiuA(t) = eiuA(0) + e ds 0 ds n≥0

is obtained by looking at what happens at the event times and between the event times.1 d iuA(s) Observing that ds e = −iueiuA(s), A(0) = S0 , A(Tn −) = 0, A(Tn ) = Sn+1 , we therefore have - t   eiuSn+1 − 1 1{Tn ≤t} − iu eiuA(t) = eiuS0 + eiuA(s) ds . 0

n≥0

Therefore, taking into account the independence of Sn+1 and Tn , - t         E eiuA(s) ds . E eiuA(t) = E eiuS0 + E eiuS1 − 1 E[N ((0, t])] − iu t E[S1 ]



0

eiuA(t)

By the assumed stationarity, E[N ((0, t])] = and E fore  iuS    E e 1 −1 − iuE eiuS0 = 0 , E[S1 ] that is,



  = E eiuS0 , and there-

    E eiuS1 − 1 E eiuS0 = . iu E[S1 ]

But the right-hand side is the characteristic function of F0 , as the following computation shows: - ∞ - ∞ 1 1 eiux (1 − F (x)) dx = eiux P (S1 > x) dx E[S1 ] 0 E[S1 ] 0 - ∞   1 eiux E 1{S1 >x} dx = E[S1 ] 0 2- ∞ . 1 E = eiux 1{S1 >x} dx E[S1 ] 0 2- S1 2 iuS1 . . e −1 1 1 E E . eiux dx = = E[S1 ] E[S1 ] iu 0  1 Or use the following. Let f : R+ → R be a right-continuous function with left-hand limits, and with a set of discontinuity times that form a sequence {tn }n≥1 that is strictly increasing on R. This sequence may be finite, even empty. If n0 ∈ N is the cardinality of this sequence, one conventionally lets tn = ∞ for all n > n0 . Suppose in addition that on the intervals [tn , tn+1 ) that lie in R+ , - t f  (s) ds , f (t) = f (tn ) + tn

for some locally integrable function f  (the derivative). Let now G : R → R be a differentiable function with derivative G . Then for all [a, b) ⊂ R+ , - b  (G(f (tn )) − G(f (tn −)) + f  (s)G (f (s)) ds . G(f (b)) = G(f (a)) + tn ∈(a,b)

a

10.1. RENEWAL POINT PROCESSES

415

Theorem 10.1.18 For a delayed renewal process to be stationary, it is necessary and sufficient that E[S1 ] < ∞ and that P (T0 ≤ x) = F0 (x) , where F0 is the stationary forward recurrence time distribution (10.15). Proof. The proof of necessity is contained in Lemmas 10.1.16 and 10.1.17. For sufficiency, we first show that t RF0 (t) := E[N ([0, t]))] = E[S1 ] (the notation emphasizes the role of the initial delay with cumulative distribution function F0 ). We have, ⎤ ⎡   1{Tn ≤t} ⎦ = P (Tn ≤ t) = F0 (t) + (F0 ∗ F )(t) + (F0 ∗ F ∗2 )(t) + · · · , RF0 (t) = E ⎣ n≥0

n≥0

that is, RF0 = F0 + RF0 ∗ F . Therefore, by Theorem 10.1.12, RF0 is the unique locally bounded solution of the renewal equation f = F0 + f ∗ F . It then suffices to show that f (t) = E[St 1 ] is indeed a solution. To verify this, observe that for such f , 1 (f ∗ F )(t) = (t − s) dF (s) E[S1 ] [0,t] and therefore

1 1 t − t dF (s) + s dF (s) m E[S1 ] [0,t] E[S1 ] [0,t] 1 t (1 − F (t)) + s dF (s) . = E[S1 ] E[S1 ] [0,t]

f (t) − (f ∗ F )(t) =

It remains to show that the right-hand side of the above equality is F0 (t). Integration by parts does it:  - t 1 1 t(1 − F (t)) + (1 − F (s)) ds = s dF (s) . F0 (t) = E[S1 ] 0 E[S1 ] [0,t] Having proved that E[N ([0, t]))] = E[St 1 ] , we are almost done. From computations in the proof of Lemma 10.1.17, we extract the identity - t      t    E eiuA(s) ds . − iu E eiuA(t) = E eiuS0 + E eiuS1 − 1 E[S1 ] 0  iuA(t) is therefore a solution of the ordinary differential equation The function z(t) := E e   dz 1 = −iuz + E eiuS1 − 1 dt E[S1 ]     with initial condition z(0) = E eiuS0 = E[S1 1 ] E eiuS1 − 1 , whose unique solution is    E[S1 ]  iuS   E eiuA(t) = E eiuS0 = E e 1 −1 . m Therefore, for all t ≥ 0, S0 (t) (= A(t)) has the same distribution as S0 . The conclusion then follows from Lemma 10.1.15. 

416

10.2

CHAPTER 10. RENEWAL AND REGENERATIVE PROCESSES

The Renewal Theorem

Renewal theory deals mainly with the limiting behavior of the solution of the renewal equation. The theory is rather simple when the renewal process is transient (Theorem 10.1.13) and becomes more involved in the recurrent case. A very simple example will give the flavor of such a result. Example 10.2.1: The Renewal Theorem for a Poisson Process. If {Tn }n≥1 is a Poisson process of intensity λ > 0, we know (Theorem 10.1.13) that R(t) = 1 + λt. Therefore, the solution of the corresponding renewal equation is, when the data g is non-negative, - t - t - t f (t) = g(t − s) R(ds) = λ g(t − s)ds = λ g(s)ds. 0

0

Here λ = 1/E[S1 ]. If we suppose that g is integrable, then 0∞ g(s) ds . lim f (t) = 0 t→∞ E[S1 ]

0

()

Definition 10.2.2 Let F be the cdf of a non-negative real random variable  S1 . Both F and S1 are called non-lattice if there is no strictly positive real a such that k≥0 P (S1 = ka) = 1.

10.2.1

The Key Renewal Theorem

It turns out that for non-lattice distributions () is quite general, modulo a mild technical assumption on the data g: this function has to be directly Riemann integrable. This is the key renewal theorem (Theorem 10.2.11 below).

Direct Riemann Integrability Let g : R+ → R be a nonnegative locally bounded function. Define for each b > 0 and each t ≥ 0,  gb (t) = sup{g(s); nb ≤ s < (n + 1)b} on [nb, (n + 1)b) gb (t) = inf{g(s); nb ≤ s < (n + 1)b} on [nb, (n + 1)b) . The functions gb and gb are finite constants on the intervals [nb, (n + 1)b), for all n ∈ N, and thus Lebesgue integrable2 on bounded intervals. Definition 10.2.3 The function g ≥ 0 is said to be Riemann integrable 0(Ri) on the a bounded interval [0, a] if for some (and then for all) b > 0, the integral 0 gb (t)dt is finite, and - a  - a lim g b (t)dt − g b (t)dt = 0. (10.16) b↓0

0

0

2 The theory of the Riemann integral predates that of the Lebesgue integral. We do not follow the historical development in the present treatment of Riemann integrals, and use Lebesgue integration theory – in particular the powerful Lebesgue’s dominated convergence theorem– which allows considerably simpler arguments.

10.2. THE RENEWAL THEOREM

417

(The fact that, when gb is integrable for some b > 0, then g b is also integrable for all b > 0, follows from the inequality g b (t) ≤

+n 

gb (t + kb),

k=−n

where n = b /b.) Theorem 10.2.4 Let g be a Riemann integrable function on the bounded interval [0, a]. Then: (i) g is bounded and almost everywhere continuous on [0, a] and (ii) the limit

-

a

g b (t)dt

lim b↓0

0

exists and is finite. This limit is by definition the R-integral (Riemann integral) of g on [0, a]. It is denoted by - a - a Rg(t) dt = lim gb (t)dt. b↓0

0

0

It coincides with the Lebesgue integral of g on [0, a]. Proof. (i) Boundedness of g is clear since supx∈[0,a] g(x) ≤ b−1 by assumption.

0a 0

gb (t)dt, which is finite

We now show that the set of discontinuity points of g has a null Lebesgue measure. Let g(x) = lim supy→x g(y), and g(x) = lim inf y→x g(y). Both functions are measurable. In fact, more is true: g is upper semi-continuous, that is, for all A ∈ R+ , the set {x : g(x) ≥ A} is closed, while g is lower semi-continuous, that is, for all A ∈ R+ , the set {x : g(x) ≤ A} is closed. We omit the (easy) proof of these facts. The set of discontinuity points of g on [0, a], that is, {x : g(x) > g(x)}, is therefore measurable. Suppose it is of positive Lebesgue measure. Then, there exists an  > 0 such that the set {x : g(x) − g(x) > } is also of positive Lebesgue measure, say δ. Since for all b > 0, and since for almost every t ∈ [0, a] g b (t) ≥ g(t) ≥ g(t) ≥ g b (t), it follows that for all b > 0, -

a

-

a

g b (t)dt −

0

0

g b (t)dt ≥ δ > 0,

which contradicts the assumption of Riemann integrability (10.16). (ii) At a continuity point x of g, it holds that lim gb (x) = g(x) = g(x). b↓0

By (i), this convergence holds almost everywhere. By dominated convergence (with dominating function g 1 (x) + g1 (x − 1) + g 1 (x + 1)), - a - a lim gb (t)dt = g(t)dt . b↓0

0

0



418

CHAPTER 10. RENEWAL AND REGENERATIVE PROCESSES

Definition 10.2.5 The function g ≥ 0 is said to be Riemann integrable on [0, ∞) if - a lim Rg(t) dt a↑∞

0

exists and the limit is then, by definition, its Riemann integral on [0, ∞). Remark 10.2.6 A major result of Riemann’s integration theory is that the Riemann integral on the finite interval [0, a] of a function exists if and only if this function is almost everywhere continuous and bounded on this interval, that is, Property (i) in the above proposition is not only necessary, but also sufficient for Riemann-integrability. This result is mainly of theoretical interest and is not needed in this book. We now turn to the definition of the direct Riemann integral. “Direct” means that this integral on [0, ∞) is not defined as a limit of integrals over finite intervals, but “directly” on [0, ∞). 10.2.7 The function g ≥ 0 is said to be directly Riemann integrable (dRi) Definition 0∞ if 0 g b (t)dt < ∞ for some (and then for all) b > 0, and if  - ∞ - ∞ lim g b (t)dt − g b (t)dt = 0. b↓0

0

0

From the definitions, it is clear that for functions vanishing outside a bounded interval, the notions of Riemann integrability and of direct Riemann integrability are the same. Also, the following analog of Theorem 10.2.4 holds for direct Riemann integrability: Theorem 10.2.8 Let g ≥ 0 be dRi. Then: (i) g is bounded, and almost everywhere continuous on R+ . (ii) The limit

-



g b (t)dt

lim b↓0

0

exists and is finite. This limit is, by definition, the dR-integral (direct Riemann integral) of g on R+ , and is denoted by - ∞ - ∞ g(t) dt = lim g b (t)dt. dR− 0

b↓0

0

It coincides with the Lebesgue integral of g on R+ . The proof is identical to that of Theorem 10.2.4. The following example features a function that is Riemann integrable, but not directly Riemann integrable. Example 10.2.9: A Counterexample. Let {an }n≥1 sequences of  and {bn }n≥1 be  positive real numbers such that 1/2 > a1 > a2 > · · · , n≥1 bn = ∞ and n≥1 an bn < ∞. Let g be null outside the union of the intervals [n − an , n + an ], n ≥ 1, and such that for all n ≥ 1, g(n − an ) = g(n + an ) = 0 and g(n) = bn , and g is linear in the intervals [n − an , n] and [n, n + an ]. Then, g is Riemann integrable:

10.2. THE RENEWAL THEOREM -

419



g(t) dt =

R− 0



an bn < ∞ .

n≥1

It is however not directly Riemann integrable since

0∞ 0

g¯b (t)dt = ∞ for all b > 0.

There exist however a few reassuring results: Theorem 10.2.10 (a) If g is directly Riemann integrable, it is Riemann integrable on [0, ∞) and - ∞ - ∞ R− g(t)dt = dR− g(t)dt . 0

0

(b) Non-negative non-increasing functions are directly Riemann integrable if and only if they are Riemann integrable on [0, ∞). (c) A non-negative function that is Riemann integrable on all finite intervals, and such 0∞ that 0 g¯1 (t)dt < ∞, is directly Riemann integrable. 0 ∞In particular, a non-negative almost everywhere continuous function such that 0 g¯1 (t)dt < ∞ is directly Riemann integrable. (d) A non-negative function that is Riemann integrable and bounded above by a directly Riemann-integrable function is directly Riemann integrable. Proof. (a) Since g is directly Riemann integrable, - ∞ - a 0 = lim (¯ gb (t) − gb (t))dt ≥ lim (¯ gb (t) − g b (t))dt, b↓0

b↓0

0

0

implying Riemann-integrability on [0, a]. For all a > 0, recalling that g is (Lebesgue) integrable on [0, +∞), ' '- ∞ ' - ∞ ' - a - a - ∞ ' ' ' ' ' ' 'dRg(t)dt − Rg(t)dt' = ' g(t)dt − g(t)dt'' = g(t)dt. ' 0

0

0

0

a

The right-hand side tends to zero as a → ∞ by dominated convergence. (b) The necessity follows from (a). In view of proving sufficiency, suppose that the non-negative non-increasing function g is Riemann integrable on [0, +∞). 0 aIt is in particular (Lebesgue) integrable on [0, a] for all finite a > 0, and the integral 0 g(t)dt admits a finite limit as a → ∞. Therefore, g is (Lebesgue) integrable on R+ , by monotone convergence. Since it is non-increasing, for all b > 0 - ∞ - ∞  g¯b (t)dt = bg(nb) ≤ bg(0) + g(t)dt < ∞ . 0

0

n≥0

Furthermore, -

∞ 0

g¯b (t)dt − 0



gb (t)dt =

 n≥0

bg(nb) −

 n>0

bg(nb) = bg(0).

CHAPTER 10. RENEWAL AND REGENERATIVE PROCESSES

420

The latter term vanishes as b → 0, establishing that g is directly Riemann integrable. 0∞ (c) Fix ε > 0 and select a > 0 such that a g¯1 (t)dt ≤ ε. For all b ∈ (0, 1], g b (t) ≤ g¯b (t) ≤ g¯1 (t − 1) + g¯1 (t) + g¯1 (t + 1). It follows that - ∞ g¯b (t)dt − 0

∞ 0

g b (t)dt ≤

-

 g¯b (t)dt − gb (t) dt + 3

a+1  0



g¯1 (t)dt

a

a+1 

0



 g¯b (t)dt − gb (t) dt + 3ε .

As g is assumed Riemann integrable on [0, a + 1], the rightmost integral tends to zero as b → 0. Since ε > 0 is arbitrary, we conclude that the left-hand side goes to zero as b → 0. Hence, g is directly Riemann integrable. (d) Follows from (c) since g is Riemann integrable on finite intervals, and calling z the bounding function, we have, since g¯1 ≤ z¯1 - ∞ - ∞ g¯1 (t) dt ≤ z¯1 (t) dt, 0

0

a finite quantity because z is directly Riemann integrable by assumption.



The Key Renewal Theorem The renewal processes considered from now on now are those with a renewal distribution that is non-lattice (Definition 10.2.2). Theorem 10.2.11 Let F be a non-lattice distribution function such that F (∞) = 1 (with possibly infinite mean) and let R be the associated renewal function. Then: (α) Blackwell’s theorem:3 for all τ ≥ 0, lim{R(t + τ ) − R(t)} =

t↑∞

τ . E[S1 ]

(10.17)

(β) Key renewal theorem: if g : R+ → R is a non-negative directly Riemann-integrable function, - ∞ 1 lim(R ∗ g)(t) = g(y) dy . (10.18) t↑∞ E[S1 ] 0 In fact, (α) and (β) are equivalent. Remark 10.2.12 Example 10.3.5 below features a spectacular example showing that the direct integrability condition cannot be dispensed with in general. Theorem 10.2.10 above and Theorem 10.3.6 give practical ways to prove direct Riemann integrability. 3

[Blackwell, 1948].

10.2. THE RENEWAL THEOREM

421

Proof. (α) We shall admit it for the time being. All existing proofs are somewhat technical. A proof, based on the so-called “coupling method”, is given later (starting with Theorem 10.2.16) when E[S1 ] < ∞. (β) Recall that when g is locally bounded, f = R ∗ g is the unique locally bounded solution of the renewal equation f (t) = g(t) + f (t − s) dF (s). [0,t]

STEP 1. Case g(t) = 1[(n−1)b,nb)(t). Then f (t) = R(t − (n − 1)b) − R(t − nb), and the result is just Blackwell’s theorem.   STEP 2. Case g(t) = n≥1 cn 1[(n−1)b,nb) (t), where cn ≥ 0, n≥1 cn < ∞, and b is such that F (b) < 1. Then  cn (R(t − (n − 1)b) − R(t − nb)). f (t) = n≥1

By Lemma 10.1.10, sup(R(t − (n − 1)b) − R(t − nb)) ≤ (1 − F (b))−1 < ∞. t≥0

In particular, by dominated convergence, lim

t↑∞



cn (R(t − (n − 1)b) − R(t − nb)) =

n≥1

 n≥1

cn

b 1 = E[S1 ] E[S1 ]

-



g(y) dy . 0

STEP 3. If g is directly Riemann integrable, the functions g¯b and g b previously 0 0 defined are of the type considered in Step 2 since g b ≤ g¯b < ∞. But g b ≤ g ≤ g¯b and therefore 1 gb (s) ds = lim g b (t − s) R(ds) ≤ lim inf g(t − s)R(ds) t↑∞ t↑∞ E[S1 ] ≤ lim sup g(t − s) R(ds) t↑∞ 1 ≤ lim g¯b (t − s) R(ds) = g¯b (s) ds. t↑∞ E[S1 ] The result follows by letting b tend to 0. We showed that (α) implies (β). The converse implication follows by choosing g(t) := 1[0,τ ] (t).  Here is a frequently encountered example of a directly Riemann-integrable function: Example 10.2.13: Tail distribution of an integrable variable. Let F be the cdf of an integrable non-negative random variable S1 . Then 1 − F is directly Riemann integrable. This follows from (b) of Theorem 10.2.10 and 0∞ (1 − F (t)) dt = E[S ] < ∞. 1 0

422

CHAPTER 10. RENEWAL AND REGENERATIVE PROCESSES

Example 10.2.14: Elementary Renewal Theorem: Correction Term, Take 2. In the recurrent case, we know (Theorem 10.1.5) that lim

t→∞

R(t) 1 = . t E[S1 ]

In order to obtain more information on the asymptotic behavior of the renewal function, we shall study the behavior of f (t) := R(t) −

t E[S1 ]

as t goes to ∞ in the non-lattice case when S1 has finite first and second moments. Calling σ 2 the common variance of the inter-renewal times and letting m := E[S1 ], we have 1  1 E[S1 ]2 + σ 2 t lim R(t) − . (10.19) = t↑∞ E[S1 ] 2 E[S1 ]2 Proof. Recall from Example 10.1.11 that f satisfies the renewal equation f = g + F ∗ f with data - ∞ 1 g(t) = (1 − F (x)) dx . E[S1 ] t The function g is of the form 1 − F0 where F0 is the cdf of a non-negative variable that is integrable. It is therefore directly Riemann integrable (Example 10.2.13). The key renewal theorem then gives   1 ∞ t lim R(t) − g(s) ds. = t→∞ m m 0 But 1 m

-



 - ∞ - ∞ 1 (1 − F (x)) dx ds m2 0 s  - ∞ - x 1 (1 − F (x))ds dx = 2 m 0 0 - ∞ 1 1 1 ∞ 2 = 2 x(1 − F (x)) dx = 2 x dF (x), m 0 m 2 0

g(s)ds = 0

hence the result. (Proof of the last equality: . 2- ∞ . - ∞ 2- X   1  2 x dx = E 1{x 0, limt↑∞ (RG (t + a) − RG (t)) = μ−1 a (Blackwell’s theorem). (ii) For all x ≥ 0, limt↑∞ P (B(t) ≤ x) = F0 (x). (iii) For all x ≥ 0, limt↑∞ P (A(t) ≤ x) = F0 (x). Proof. (ii) ⇔ (iii). Just observe that for t ≥ 0, x ≥ 0, P (A(t) ≤ x) = P (N [0, t + x] − N ([0, t]) ≥ 1) = P (B(t + x) ≤ x). (i) ⇒ (ii). When T0 ≡ 0, the function t → P (B(t) ≤ x) satisfies a renewal equation with data g(t) = (1 − F (t))1[0,x] (t). This function is directly Riemann integrable, and therefore by the key renewal theorem (a consequence of Blackwell’s theorem) - ∞ g(t)dt = F0 (x). lim P (B(t) ≤ x) = μ−1 t↑∞

0

10.2. THE RENEWAL THEOREM

425

The case of a non-null and proper initial delay follows by the usual argument (see the proof of Theorem 10.3.4). (iii) ⇒ (i). Define Gt (x) = P (A(t) ≤ x) and observe that RG (t + a) − RG (t) = (Gt ∗ R)(a) - a = R(a − s)Gt (ds) -0 a Gt (a − s)R(ds) = 0

and that since by hypothesis limt↑∞ Gt (x) = F0 (x), we have by dominated convergence (R gives finite mass to bounded intervals) - a - a F0 (a − s)R(ds) = μ−1 a. Gt (a − s)R(ds) = lim t↑∞

0

0

 In order to prove Blackwell’s theorem, it is enough to prove (iii) of Lemma 10.2.16. We do this in the case where μ < ∞.4 Here is the coupling argument.5 Consider two independent renewal sequences with the same interarrival distribution F . The first one is undelayed: S0 = 0, S1 , S2 , . . . and the second one is stationary: S˜0 , S˜1 , S˜2 . In particular, the distribution of S˜0 is F0 given by (10.22). Construct a renewal sequence {Sn∗ }n≥1 as follows. Take Sn∗ = Sn until the first time where two points of the tilded and untilded processes are ε-close, where ε is fixed. (In this case we say that εcoupling was successful, which is not granted in general. The technical part of the proof of Blackwell’s theorem is to show that ε-coupling is actually realizable with probability 1 when the interval distribution is non-lattice.) Then follow the tilded process. For instance, suppose that T5 and T˜3 are at a distance less than ε. Then Sn∗ = Sn for n = 1, 2, 3, 4, 5, ∗ = S˜3+k for k ≥ 1. Denote by T = T ε the first point of the tilded process which and S5+k is ε-close to a point of the untilded process (in the example T = T˜3 ). T5 S1 = S1∗

S2 = S2∗

S3 = S3∗ S4 = S4∗

S5 = S5∗ ≤

S˜0

S˜1

S˜2

S˜3 S˜6∗

S˜7∗

T˜3

Lemma 10.2.17 With the assumptions of Theorem 10.2.11, if ε-coupling happens almost surely, that is, if P (T < ∞) = 1, then Blackwell’s theorem is proved. ˜ Proof. For simpler notation, let T := Tε . Let {A(t)}t≥0 and {A(t)} t≥0 be the recurrence times corresponding to the undelayed starred renewal process and the (stationary) tilded 4 5

For the extension to μ = ∞, see for instance [Lindvall, 1992], p. 76–77. [Lindvall, 1977].

426

CHAPTER 10. RENEWAL AND REGENERATIVE PROCESSES

˜ + D) where |D| ≤ ε. Let f be a renewal process. For all t ≥ T , we have A(t) = A(t continuous function bounded by 1, and define ' ' ' ' ˜ ˜ + s)) − f (A(t)) Mε (t) = sup 'f (A(t '. |s|≤ε

Note that limε↓0 E [Mε (0)] = 0. Now, ' '  ' ' ' ' ' ' ˜ ˜ ' ' ≤ E 'f (A(t)) − f (A(t)) 'E f (A(t)) − f (A(t)) '  ' ' ' ˜ = E '(f (A(t)) − f (A(t)) ' 1{t < T } ' '  ' ˜ '' 1{t ≥ T } + E '(f (A(t)) − f (A(t)) ≤ 2P (T > t) + E [Mε (t)] = 2P (T > t) + E [Mε (0)] , ˜ where the last equality  follows from stationarity of A. Deduce from this that, since   ˜ ˜ E f (A(t)) = E f (S0 ) ,   lim E [f (A(t))] = E f (S˜0 ) . t↑∞

In other words, since f is an arbitrary continuous function bounded by 1, A(t) tends in distribution to S˜0 as t ↑ ∞. In particular, since the distribution F0 of S˜0 is continuous, for all x ∈ R+ , lim P (A(t) ≤ x) = F0 (x). t↑∞

The conclusion follows from (iii) of Lemma 10.2.16.



In order to prove -coupling, we first examine the role of the non-lattice assumption. Recall that a point x is said to be in the support of the distribution function F if F (x + ) − F (x − ) > 0 for all  > 0. The set of all such points is called the support of F and is denoted by supp(F ). The key implication of the non-lattice assumption is the following: Lemma 10.2.18 Let F be a non-lattice cumulative distribution function. Let G denote the set of finite linear combinations of elements of supp(F ) with coefficients in N, that is / n 4 &  gi ; g1 , . . . , gn ∈ supp(F ) . (10.23) G= n∈N

i=1

Then G is asymptotically dense in R+ , that is lim d(x, G) = 0,

x→∞

where d(x, G) = inf g∈G |x − g|. Observe that the set G as defined by (10.23) is the union of the supports of the cumulative distribution functions F ∗n (n ∈ N) or, equivalently, the support of the renewal  ∗n function R = n∈N F associated to F . Proof. Letting μ :=

inf

{g − h} ,

g,h∈G,g>h

(10.24)

10.2. THE RENEWAL THEOREM

427

we first prove that μ = 0. Suppose in view of contradiction that μ > 0. The infimum in (10.24) is then necessarily attained, for otherwise there would exist sequences gn , hn in G such that gn − hn > gn+1 − hn+1 , and gn − hn → μ as n → ∞. Then, for n large enough, gn − hn < μ + μ/2. Consequently, letting g := gn + hn+1 and h = hn + gn+1 , it holds that g − h = (gn − hn ) − (gn+1 − hn+1 ) ∈ (0, μ/2). This is a contradiction, in view of the definition of μ and the fact that g, h ∈ G. There must therefore exist g, h ∈ G such that g − h = μ. Since F is non-lattice, there exists z ∈ supp(F ) such that, for some k ∈ N, kμ < z < (k + 1)μ. Define then g  := z + kh and h := kg. Both g  and h belong to G. Furthermore, g  − h = z − kμ ∈ (0, m), again a contradiction. Necessarily then, μ = 0. Therefore, for any  > 0, there exist g, h ∈ G such that g − h ∈ (0, ). Consider the subset G  of G consisting of the elements kg + h (k,  ∈ N). We argue that limx→∞ d(x, G  ) ≤ . Indeed, let m = h/. Let x > mh. Write x = nh + r, with n ∈ N, n ≥ m, and r ∈ [0, h). Let k ∈ N be such that (n − k)h + kg ≤ x < (n − k)h + kg + (g − h). Necessarily k ≤ m since r < h. The term (n − k)h + kg thus belongs to G  , as n − k ≥ 0. Furthermore, it is at most  apart from x, by the pair of inequalities displayed above. It follows that lim sup d(x, G) ≤ lim sup d(x, G  ) ≤ . x→∞

x→∞

As  is arbitrary, this concludes the proof of the theorem.



One says that -coupling holds for renewal processes with inter-renewal cdf F if for  > 0 and fixed initial delays t1 , t1 , one can construct jointly two renewal processes with the corresponding delays such that, with probability 1, there are indices m, n such that the corresponding renewal times Tm , Tn are less than  apart. Lemma 10.2.19 With the assumptions of Theorem 10.2.11, ε-coupling happens almost surely. Proof. Let

Zi := min{T,j − Ti ; T,j − Ti ≥ 0} (i ≥ 0) .

For fixed ε > 0, let Ai := {Zj < ε for some j ≥ i} . Then A0 ⊇ · · · ⊇ A ⊇ · · · ⊇ ∩∞ i=0 Ai = A∞ := {Zi < ε i.o.} . Since the sequence {Ti+n −Ti }n≥1 has a distribution independent of i and is independent , is stationary, the sequence {Zi }i≥0 is also stationary. , ≡ {T,n }n≥0 , and since N of N Therefore the events Ai (i ≥ 0) have the same probability, and in particular P (A0 ) = P (A∞ ) .

428

CHAPTER 10. RENEWAL AND REGENERATIVE PROCESSES

Conditionally on S,0 , the event A∞ is an exchangeable event of the symmetric sequence {(Sn , S,n )}n≥1 . Therefore, by the Hewitt–Savage 0-1 law (Theorem 4.3.7), for all t > 0 P (A∞ | S,0 = t) = 0 or 1 .

()

Lemma 10.2.18 guarantees that for sufficiently large u and fixed ε, P (u − t < T,j − S,0 < u − t + ε for some j) > 0 . Therefore P (A0 | S,0 = t) > 0 for all t ≥ 0 and in particular - ∞ P (A0 | S,0 = t)(1 − F (t)) dt) > 0 , P (A0 ) = λ 0

which implies since P (A0 ) = P (A∞ ), - ∞ P (A∞ ) = λ P (A∞ | S,0 = t)(1 − F (t)) dt) > 0 . 0

In view of (), this implies that P (A∞ | S,0 = t) = 1 for all t such that F (t) < 1 . Therefore P (A∞ ) = 1 = P (A0 ), so that P (Zi < ε for some i) = 1.

10.2.3



Defective and Excessive Renewal Equations

The key renewal theorem concerns proper renewal equations. In a number of situations though, one encounters renewal equations for which F (∞) < 1 or F (∞) > 1. However, when there exists an α ∈ R such that eαtdF (t) = 1 , (10.25) [0,∞)

the asymptotics of the solution of the renewal equation can be obtained from the proper case. In fact, letting g˜(t) := eαt g(t), f˜(t) := eαt f (t) and F˜ (t) := eαs dF (s) , [0,t]

the distribution F˜ is proper, and non-lattice if F itself is non-lattice. One immediately checks that f˜ satisfies the renewal equation f˜ = g˜ + f˜ ∗ F˜ . The conclusion of the key renewal theorem is that when g˜ is directly Riemann integrable, 0 ∞ αt e g(t)dt lim eαtf (t) = 0 0∞ αt . (10.26) t→∞ te dF (t) 0 Remark 10.2.20 A number α satisfying (10.25) always exists in the excessive case. 0 Indeed, the function α → [0,∞) eαtdF (t) is continuous on (−∞, 0] and strictly increases 0 from 0 to [0,∞) dF (t) > 1. Therefore there is a unique α < 0 satisfying (10.25). In the defective case, such α, if it exists, is necessarily positive. But it may not exist. In fact, its existence implies exponential decay of the tail distribution 1 − F since by Markov’s inequality P (S1 > t) = P (eαS1 > eαt) ≤ e−αt E[eαS1 ].

10.2. THE RENEWAL THEOREM

429

Remark 10.2.21 Clearly, from (10.26), in the non-lattice defective case and assuming the existence of such α, the solution of the renewal equation decays exponentially fast as t → ∞, whereas in the non-lattice excessive case (for which α always exists) the solution of the renewal equation explodes exponentially fast as t → ∞. Example 10.2.22: Asymptotics in the Transient Case. Suppose that F is defective (F (∞) < 1) and that there exists α (necessarily > 0) such that - ∞ eαt dF (t) = 1 . (10.27) 0

By Theorem 10.1.13, when the data g is bounded and such that there exists g(∞) = limt→∞ g(t), the unique solution f of the renewal equation f = g + f ∗ F satisfies lim f (t) =

t↑∞

g(∞) . 1 − F (∞)

(10.28)

With the help of the defective renewal theorem additional information concerning the asymptotic behavior of f can be obtained. In fact, if the function g1 defined by g1 (t) = g(t) − g(∞) + g(∞)

F (t) − F (∞) 1 − F (∞)

is such that the function t → g˜1 (t) = eαt g1 (t) is directly Riemann integrable, then   g(∞) =C, (10.29) lim eαt f (t) − t→∞ 1 − F (∞) where

0∞ C=

0

eαt [g(t) − g(∞)]dt − 0∞ αt 0 te dF (t)

Proof. Define f1 (t) := f (t) −

g(∞) α

.

g(∞) . 1 − F (∞)

Straightforward computations using the identity R ∗ F = R − 1 show that f1 = R ∗ g1 . Therefore f1 is a solution of the (defective) renewal equation f1 = g1 + f1 ∗ F . Since g˜1 (t) = eαt g1 (t) is assumed to be directly Riemann integrable, 0 ∞ αt e g1 (t)dt lim eαtf1 (t) = 00∞ αt . t→∞ 0 te dF (t) But -



e (F (∞) − F (t))dt =

0

-



αt

e -

0

dF (s) dt

∞ - t

= 0



αt

0

from which the above expression for C follows.

(t,∞)

 1 eαs ds dF (t) = (1 − F (∞)), α 

CHAPTER 10. RENEWAL AND REGENERATIVE PROCESSES

430

Example 10.2.23: The Risk Model, Take 3. This example is a continuation of Example 10.1.7. Recall Eqn. (10.11) in the case where the safety loading is positive, and write this equation in the form - u λ ∞ Ψ(u) = (1 − G(z)) dz + Ψ(u − z) dF (z) , c u 0 where F (x) :=

λ c

-

x

(1 − G(z)) dz .

0

0∞ This is a renewal equation, and it is defective since λc 0 (1 − G(z)) dz = λμ c < 1 under the positive safety loading condition. Assume the existence of α > 0 such that λ x αz e (1 − G(z)) dz = 1. c 0 By the defective renewal theorem, if λ ∞ αu ∞ e (1 − G(z)) dz du < ∞ , c 0 u we have that

0 ∞ αu 0 ∞ e (1 − G(z)) dz du lim e Ψ(u) = C := 0 0 ∞ uαz u↑∞ 0 ze (1 − G(z)) dz αu

or, equivalently,

Ψ(u) = Ce−αu + o(u) .

10.3

Regenerative Processes

10.3.1

Examples

Let (E, E) be a measurable space. Definition 10.3.1 Let {X(t)}t≥0 be a measurable E-valued stochastic process and let {Tn }n≥0 be a proper recurrent renewal process, possibly delayed (recall, however, that the initial delay T0 is always assumed finite). The process {X(t)}t≥0 is said to be regenerative with respect to {Tn }n≥0 if for all n ≥ 0, (a) the distribution of the post-Tn process Sn+1 , Sn+2, . . . , {X(t + Tn )}t≥0 is independent of n ≥ 0, and (b) the post-Tn process is independent of T0 , . . . , Tn . The times Tn are called regeneration times of the regenerative process. Example 10.3.2: Continuous-time Markov Chains. Let {X(t)}t≥0 be a recurrent continuous-time homogeneous Markov chain taking its values in the state space E = N. Suppose that it starts from state 0 at time t = 0. By the strong Markov property, {X(t)}t≥0 is regenerative with respect to the sequence {Tn }n≥0 where Tn is the n-th time of visit to state 0 of the chain. Regenerative processes are the main sources of renewal equations.

10.3. REGENERATIVE PROCESSES

431

Theorem 10.3.3 Let {X(t)}t≥0 and {Tn }n≥0 be as in Definition 10.3.1 except for the additional assumption T0 ≡ 0 (undelayed renewal process) and let h : E → R be a non-negative measurable function. The function f : R+ → R defined by f (t) := E [h(X(t))] satisfies the renewal equation with data   g(t) = E h(X(t))1{t s) ds = a E[U1 ] = a. In a reliability context, a+b represents the availability of a given machine with mean lifetime a and mean repair time b. Example 10.3.9: Forward and Backward Recurrence Times. Let {Tn }n≥0 be an undelayed (T0 = 0) renewal process. Clearly the forward and backward recurrence times are regenerative with respect to the renewal process {Tn }n≥0 . From Smith’s regenerative formula (10.34), in the non-lattice case 1 ∞ lim P (A(t) > x) = P (A(s) > x, s < S1 ) ds . t→∞ m 0 0∞ > s + x) = 1 − F (s + x), we have 0 P (A(s) > x, s < Since P (A(s) 0 ∞ > x, s < S1 ) = P (S01 ∞ S1 ) ds = 0 (1 − F (s + x)) ds = x (1 − F (s)) ds, and therefore lim P (A(t) > x) =

t→∞

1 m

-

∞ x

(1 − F (s)) ds .

(10.36)

434

CHAPTER 10. RENEWAL AND REGENERATIVE PROCESSES

Similar arguments yield for the backward recurrence time 1 ∞ lim P (B(t) > y) = (1 − F (s))ds. t→∞ m y

(10.37)

(This time, the data function is (1 − F (t))1{t>y} .) One can prove directly the direct Riemann integrability of the data functions of this example, or use Theorem 10.3.6.

Example 10.3.10: The Bus Paradox. The sum A(t)+B(t) is the inter-event interval around time t. Interpreting t as the time at which you arrive at a bus stop, and the sequence {Tn }n≥1 as the sequence of times at which buses arrive at (and immediately depart from) the bus stop, A(t) is your waiting time. If t is large enough, one can, in view of (10.36), assume that A(t) is distributed as a random variable A with the distribution 1 ∞ P (A > x) = (1 − F (s)) ds . (10.38) m x Similarly, the time B(t) by which you missed the previous bus is approximately, when t is large, distributed as a random variable B with the same distribution as A. The bus paradox can be stated in several ways. One of them is: the mean time interval between the bus you missed and the bus you will catch is asymptotically as t → ∞ equal to E[A + B] = 2E[A], and is in general different from the mean of the interval between two successive buses n and n + 1, E[S1 ]. Let {X(t)}t∈R be a stochastic process taking its values in a metric space E, having right-continuous paths and being regenerative relative to the (possibly delayed) renewal sequence {Tn }n≥0 with non-lattice and finite mean inter-event distribution μ. Let P0 and E0 symbolize respectively the probability and the expectation corresponding to the undelayed version of the renewal sequence. One checks easily that . 2- S1 1 1A (X(s)) ds P ∗ (A) := E0 μ 0 defines a probability measure on (E, B(E)). Theorem 10.3.11 Under the above conditions, X(t) converges in distribution to P ∗ as t ↑ ∞. Proof. We must show that for all bounded (say, by 1) continuous functions h : E → R, lim E [h(X(t))] =

t↑∞

1 E0 μ

2-

S1

. h(X(s)) ds .

(10.39)

0

By the usual renewal argument (conditioning on T0 , whose cumulative distribution function is denoted by FT0 ),   f (t − s) dFT0 (s) , () E [h(X(t))] = E h(X(t))1{t 0, we have in the non-lattice case (Blackwell’s theorem) νj Rij (t + a) − Rij (t) lim = . (10.41) t↑∞ a μ The following result is similar to the one in the univariate case: Theorem 10.4.1 If the data functions are non-negative and locally bounded, there exists a unique vector f := {fi }i∈E of locally bounded measurable functions from R+ → R satisfying the multivariate renewal equation (10.40), namely f = R ∗ g. Proof. The fact that R ∗ g is well defined, locally bounded, and satisfies the renewal equation is proved in the same way as in the univariate case. Let now f and f, be two vectors of locally bounded functions satisfying the renewal equation, and let h := f − f,. Then h = R ∗ h, and iteratively, h = F ∗n ∗ h for all n ≥ 1, so that |h| ≤ F ∗n ∗ |h|. Let supi |hi (t)| ≤ M (a) < ∞ on [0, a], say M (a) = 1, without loss of generality, so that |h| ≤ F ∗n ∗ 1 on [0, a] and therefore, on [0, a],

CHAPTER 10. RENEWAL AND REGENERATIVE PROCESSES

438

|hi (t)| ≤



Fij∗n (t) = Pi (S1 + · · · + Sn ≤ t) ,

j∈E

a quantity that tends to 0 as n ↑ ∞ since T∞ = ∞ under the prevailing conditions (the embedded chain is irreducible recurrent). Therefore h ≡ 0 on all [0, a] (a ∈ R+ ) and  therefore on R+ . It follows from (10.41) that if gi is directly Riemann integrable νj ∞ Rij ∗ gj (t) → gj (s) ds , μ 0 and from this, in the case where E is finite, fi (t) =



1 νj μ

Rij ∗ gj (t) →

j∈E

j∈E

-



gj (s) ds . 0

The case when E is infinite requires further conditions and will not be treated here.8

Improper Multivariate Renewal Equations The above results concern the case where Q := {||Fij ||}i,j∈E is a stochastic matrix, that is, the transition matrix of a homogeneous Markov chain (namely, the transition matrix P of the embedded Markov chain). However the renewal equations (10.40) make sense even if this is not the case. We now give results9 of the same kind as the ones in the defective or excessive univariate renewal functions when the state space is finite. The matrix Q is no longer a stochastic matrix, but still assumed irreducible. Define for some real β the matrix A := {aij }i,j∈E - ∞ aij := eβt dFij (t) . 0

Assume that β can be chosen such that A has spectral radius 1. In particular, there exists two positive vectors ν and h such that ν T A = ν and Ah = h . The existence of ν and h is ensured by the Perron–Fr¨ obenius theorem. The following facts are easy. First the matrix 1  , := hj aij Q hi i,j∈E is an (irreducible) stochastic matrix admitting the invariant measure ν, given by ν,i = νi hi . Let hj F,ij (t) := hi

-

t

eβs dFij (s) . 0

, = {||F,ij ||}i,j∈E is irreducible and recurThis defines a semi-Markov kernel for which Q rent. Defining f,i (t) := eβt fi (t)/hi and g,i := eβt gi (t)/hi , 8 9

See, for instance, [C ¸ inlar, 1975]. [Asmussen and Hering, 1977].

10.5. EXERCISES

439

we see, analogously to the univariate case, that f, = g, + F, ∗ f,. Therefore, if F, is non-lattice and the g,i ’s are locally bounded and integrable, - ∞ 1 lim f,i (t) = ν,j g,j (s) ds , t↑∞ μ , 0 j∈E

that is hi lim eβt fi (t) = 

t↑∞

 j∈E

k,j∈E

νj

νk hj

0∞ 00∞ 0

eβsgj (s) ds seβs dFkj (s)

.

The existence of β is guaranteed, for instance,10 when the spectral radius of {||Fij ||}i,j∈E is strictly less that 1.

Complementary reading [Asmussen, 2003] for more theory and for applications to random walks and queues.

10.5

Exercises

Exercise 10.5.1. Theorem 10.1.1 true in the undelayed case Prove that as Theorem 10.1.1 is true in the undelayed case, it is then true in the delayed case (recall: with finite delay). Exercise 10.5.2. The asymptotic counting rate Prove (10.4), that is, 1 N ([0, t]) lim = , P-a.s. t→∞ t E[S1 ]

Exercise 10.5.3. Right-continuity of the renewal function Show that the renewal function R is right-continuous. Exercise 10.5.4. About the distribution of N ([0, t]) In the undelayed case, compute P (N ([0, t]) = n) for n ≥ 1 in terms of the convolution βN ([0,t]) iterates of the inter-renewal < ∞ for all β ∈  −S  distribution and show that E e 1 [0, α), where α := E e . Exercise 10.5.5. First event after a random time In the undelayed case, let X be a strictly positive random variable, independent of {Sn }n≥1 , with cumulative distribution G. Let T, = inf {Tn ; Tn > X} . Give an expression of the cumulative distribution of T, in terms of G, the renewal function R and the avoidance function v (defined in Section 8.1.3). 10

See [Asmussen, 1987], Problem 2.3, chap. X.

440

CHAPTER 10. RENEWAL AND REGENERATIVE PROCESSES

Exercise 10.5.6. Forward recurrence time What is the limit distribution of the forward recurrence time of a renewal process (possibly delayed) when S1 is deterministic, equal to a? Exercise 10.5.7. Wald’s lemma for renewal processes Let {Sn }n≥1 be a renewal sequence and let {N ([0, t])}t≥0 be the counting process of the corresponding (possibly delayed) renewal process. Assume that E [S1 ] < ∞. Show that for all t ≥ 0, E[S1 + · · · + SN ([0,t]) ] = E[S1 ]E[N ([0, t])].

Exercise 10.5.8. Expected lifetime In Example 10.1.6, compute the expectation of the lifetime L in the transient case. Exercise 10.5.9. Safety load This exercise refers to Example 7.1.2. Prove that in case of positive safety loading, Φ(∞) = 1. Exercise 10.5.10. The bus paradox See Example 10.3.10 for the context. Consider an undelayed renewal process with finite mean inter-renewal time E[S1 ]. For t ≥ 0, consider the interval between the last renewal time before t and the first renewal time after t. Show that in the Poisson case (the interarrival distribution is exponential) the mean length of this interval is asymptotically 2E[S1 ]. Show that it is equal to E[S1 ] if and only S1 is a constant. Exercise 10.5.11. First event after a fixed time Refer to Definition 10.1.14 for the notation. Prove that S0 (t) is independent of {Sn (t)}n≥1 and that the latter sequence has the same distribution as {Sn }n≥1 . Exercise 10.5.12. A limit theorem for continuous-time hmcs Let {X(t)}t≥0 , be a positive recurrent continuous-time homogeneous Markov chain taking its values in the state space E = N. Let P0 and E0 denote respectively probability and expectation given X(0) = 0. Let T0 be the return time to 0 (T0 := inf{t > 0 ; X(t) = 0, X(t−) = 0}). Recall that in the positive recurrent case, E0 [T0 ] < ∞. Show that  0 T E0 0 0 1{X(s)=i} ds . lim P (X(t) = i) = t↑∞ E0 [T0 ]

Exercise 10.5.13. Lotka–Volterra asymptotics In the Lotka–Volterra model, give the details concerning the asymptotics of the birth rate f in the cases F (∞) = 1 and F (∞) > 1. What can you say about the defective case F (∞) < 1? Exercise 10.5.14. Backward and forward recurrence processes Refer to Definition 10.3.9. Compute limt→∞ P (A(t) > x, B(t) > y) for x, y ≥ 0.

10.5. EXERCISES

441

Exercise 10.5.15. Asymptotic variance of N ((0, t]) For a proper renewal process with an interarrival distribution of finite variance, show that VarN ((0, t]) VarS1 lim = . t↑∞ t E[S1 ] Exercise 10.5.16. The age replacement policy We interpret the random variables S1 , S2 , . . . as the lifetimes of machines successively put into service, a new machine immediately replacing a failed one. It will be assumed that E[S1 ] < ∞, and therefore, by (10.4), E[S1 1 ] is the asymptotic failure rate per unit time. In some situations, the inconvenience caused by a failure is too important, and the failure rate must be controlled. The age replacement policy suggests that an engine should be replaced at failure time or at a fixed time T > 0, whichever occurs first. What is the asymptotic failure rate? (A replacement is not considered as a failure.) Exercise 10.5.17. Another maintenance policy A given machine can be in either one of three states: G (good), M (in maintenance), or R (in repair). Its successive periods where it is in state G (resp., M, R) form an independent and identically distributed sequence {Sn }n≥0 (resp., {Un }n≥0 , {Vn }n≥0) with finite mean. All these sequences are assumed mutually independent. The maintenance policy uses a number T > 0. If the machine has age T and has not failed, it goes to state M. If it fails before it has reached age T , it enters state R. From states M and R, the next state is G. Find the steady state probability that the machine is operational. (Note that “good” does not mean “operational”. The machine can be “good” but, due to the operations policy, in maintenance, and therefore not operational. However, after a period of maintenance or of repair, we consider that the machine starts anew, and enters a G period.) Exercise 10.5.18. A two state semi-Markov process Let   α 1−α P= , 1−β β where α, β ∈ (0, 1), and let G1 and G2 be two proper cumulative distribution functions. Let {X(t)}t≥0 be the stochastic process evolving as follows. When in state i (i = 1, 2) it stays there for a random time with distribution Gi (i = 1, 2) after which it moves to state j (possibly the same state) with the probability pij (the (i, j)-entry of P). The successive sojourn times (in either state) are independent given the knowledge of the state the process is in (the reader will clarify this imprecise sentence). What is the asymptotic distribution of the process, that is, what is limt↑∞ P (X(t) = 1)?

Chapter 11 Brownian Motion Brownian motion was originally introduced as an idealized representation of the chaotic motion of an isolated particle in water due to the steady bombardment by neighboring molecules. It plays a fundamental role in the theory of stochastic processes and various domains of application such as mathematical finance and communications theory. This chapter is an introduction to some of its more notable properties and to the Wiener–Doob stochastic integral, which is a fundamental tool in the theory of wide-sense stationary processes (Chapter 12).

11.1

Brownian Motion or Wiener Process

11.1.1

As a Rescaled Random Walk

Recall that two complex random variables X and Y in L2C (P ) are called orthogonal if E[XY ∗ ] = 0. Definition 11.1.1 A stochastic process {X(t)}t∈R is said to have independent (resp., orthogonal) increments if for all n ≥ 2 and for all mutually disjoint intervals (a1 , b1 ],. . . , (an , bn ] of R, the random variables X(b1 ) − X(a1 ), . . . , X(bn ) − X(an ) are independent (resp., mutually orthogonal). Clearly, a centered second-order stochastic process with independent increments has a fortiori orthogonal increments. Brownian motion is the fundamental example of a Gaussian process. Definition 11.1.2 By definition, a standard Brownian motion, or standard Wiener process, is a continuous centered Gaussian process {W (t)}t∈R+ with independent increments and such that W (0) = 0 and Var(W (b) − W (a)) = b − a ([a, b] ⊂ R+ ). (The existence of a process with the required distribution is guaranteed by Theorem 5.1.23. The existence of a continuous version is proved in the forthcoming Theorem 11.2.7.) In particular, the pdfs of the vectors (W (t1 ), . . . , W (tk )) (0 < t1 < . . . < tk ) are © Springer Nature Switzerland AG 2020 P. Brémaud, Probability Theory and Stochastic Processes, Universitext, https://doi.org/10.1007/978-3-030-40183-2_11

443

444

CHAPTER 11. BROWNIAN MOTION − 12 1 5 √ e ( 2π)k t1 (t2 − t1 ) · · · (tk − tk−1 )



2 2 x2 1 + (x1 +x2 ) +···+ (x1 +···+xk ) t1 t2 −t1 tk −tk−1



.

Note for future reference that (Exercise 11.5.2) for s, t ∈ R+ , E[W (t)W (s)] = t ∧ s.

(11.1)

Definition 11.1.3 The stochastic process with values in Rk W (t) := (W1 (t), . . . , Wk (t))

(t ∈ R+ ),

where {Wj (t)}t∈R+ (1 ≤ j ≤ k) are independent (standard) Wiener processes, is called a k-dimensional (standard) Wiener process. The following extension of the definition of Brownian motion slightly enlarges the scope of the theory by introducing a history that is possibly larger than the internal history. Definition 11.1.4 Let {Ft }t≥0 be a filtration. A real stochastic process {W (t)}t≥0 is called a standard Ft -Brownian motion if (i) for P -almost all ω, the trajectories t → W (t, ω) are continuous, (ii) for all t ≥ 0, W (t) is Ft -measurable, and (iii) W (0) = 0 and for all 0 ≤ s ≤ t, W (t) − W (s) is a centered Gaussian random variable independent of Fs and with variance t − s. Consider a symmetric random walk on Z with initial state 0. It admits the representation n  Xn = Zk , k=1

where {Zn }n≥1 is an iid sequence of {−1, +1}-valued random variables with P (Z1 = ±1) = 12 . A time-continuous stochastic process is constructed from this sequence as follows. A time-step equal to 1 in the discrete-time model represents in the continuoustime model Δ units of time. The amplitude scale is also modified, a distance of 1 in the original discrete-time model representing δ state space units in the modified model. We are therefore considering the continuous-time process {X(t)}t≥0 given by t/Δ

X(t) := δXt/Δ = δ



Zk .

(11.2)

k=1

(The dependence on δ and Δ will not be made explicit in the notation for X(t).) Since the Zk ’s are centered and of variance 1, E [X(t)] = 0 and Var (X(t)) = δ 2 × t/Δ . Let now Δ and δ tend to 0 in such a way that the limit in distribution exists and is not trivial. With respect to this goal, the choice δ = Δ is not satisfactory since E [X(t)] = 0 and limΔ↓0 Var (X(t)) = 0, leading to a null process. With the choice

11.1. BROWNIAN MOTION OR WIENER PROCESS

445

δ 2 = Δ, E [X(t)] = 0 and limΔ↓0 Var (X(t)) = t. In this case, since by the central limit theorem n Zk D k=1 √ → N (0, 1) , n we have that (using the fact that if a sequence of random variables Xn converges in distribution to some random variable X and if the sequence of real numbers an converges to the real number a, then an Xn converges in distribution to aX; Theorem 4.4.8) √ t t/Δ X(t)   D k=1 Zk √ Δ √ → N (0, 1) . = √ t t Δ t Δ Thus at the limit (in distribution) as n ↑ ∞, X(t) is a centered Gaussian variable with variance t. In fact, for all t1 , . . . , tk in R+ forming an increasing sequence, the limit distribution of the vector (X(t1 ), . . . , X(tk )) corresponds to a Brownian motion (Exercise 11.5.1).

Behavior at Infinity In view of the previous approximation of a Wiener process as a symmetric random walk on Z, the following result is expected. Theorem 11.1.5 Let {W (t)}t≥0 be a standard Brownian motion. Then, P -a.s., lim sup W (t) = +∞ , lim inf W (t) = −∞ . t↑+∞

t↑∞

Proof. This follows from lim sup W (n) = +∞ , lim inf W (n) = −∞ n↑+∞

n↑∞

P -a.s.

which in turn is a direct consequence of Theorem 4.3.9 applied to the random walk  Sn = W (n) of step Xn = W (n) − W (n − 1). Corollary 11.1.6 For all a ∈ R, the FtW -stopping time Ta = inf{t ≥ 0 ; W (t) ≥ a} is almost surely finite. (An explicit expression for the distribution of Ta will be given in Subsection 11.2.1.)

11.1.2

Simple Operations on Brownian motion

These are the operations of (i) symmetrization: X(t) = −W (t)

(t ≥ 0) ,

(ii) delay: for a > 0, X(t) = W (t + a) − W (t) (iii) scaling: for c > 0, X(t) =



  t cW c

(t ≥ 0) ,

(t ≥ 0) ,

CHAPTER 11. BROWNIAN MOTION

446 (iv) time inversion X(t) = tW

  1 t

(t > 0) and X(0) = 0 ,

In each case, the process {X(t)}t≥0 is a standard Brownian motion if {W (t)}t≥0 is a standard Brownian motion. The fact that these processes have the same distribution as the standard Brownian motion is easily checked (Exercise 11.5.3). They are also continuous. This is obvious in all cases except for the continuity at 0 of the process obtained by time inversion. However, almost surely:   1 = 0. lim tW t↓0 t Proof. We prove the equivalent statement lim

s↑+∞

1 W (s) = 0 . s

First observe that since W (n) is the sum of n iid centered Gaussian variables, limn↑∞ Wn(n) = 0 by the strong law of large numbers and a fortiori limn↑∞ Wn(n) = 0. 2 In the following let n := n(s) be the largest integer less than or equal to s. Therefore, taking into account the first observation, we just have to show that W (s) W (n) − → 0 as s → ∞ . s n But

' ' ' ' ' ' ' W (s) W (n) ' ' W (s) W (n) ' ' W (n) W (n) ' ' '≤' '+' ' − − − ' s n ' ' s s ' ' s n ' ' ' '1 1 ' 1 ' sup |W (s) − W (n)| + |W (n)| ' − '' ≤ n s∈[n,n+1] s n ≤

Zn W (n) + , n n2

where Zn := sup |W (s + n) − W (n)| s∈[0,1]

has the same distribution as sups∈[0,1] |W (s)| and the sequence {Zn }n≥1 is iid. Since as → 0, it remains to show that Znn → 0 or, equivalently (Theorem observed earlier Wn(n) 2 4.1.3), that for any ε > 0, ' '  ' Zn ' ' ' P ' ' > ε i.o. = 0 . n By the Borel–Cantelli lemma, it suffices to show that  P (|Zn | > nε) < ∞ n

or, equivalently, since the Zn ’s are identically distributed,  P (|Z1 | > nε) < ∞ . n

But this is true of any integrable random variable Z1 (see Exercise 4.6.1).



11.1. BROWNIAN MOTION OR WIENER PROCESS

447

The Brownian Bridge This is the process {X(t)}t∈[0,1] obtained from the standard Brownian motion {W (t)}t∈[0,1] by X(t) := W (t) − tW (1) (t ∈ [0, 1]) . It is a Gaussian process since for all t1 , . . . , tk ∈ [0, 1], the random vector (X(t1 ), . . . , X(tk )) is Gaussian, being a linear function of the Gaussian vector (W (t1 ), . . . , W (tk ), W (1)). In particular, since it is a centered Gaussian process, its distribution is entirely characterized by its covariance function and a simple calculation (Exercise 11.5.2) gives cov (X(t), X(s)) = s(1 − t)

(0 ≤ s ≤ t ≤ 1) .

In particular, X(0) = X(1) = 0. The Brownian bridge {X(t)}t∈[0,1] is distributionwise a Wiener process {W (t)}t∈[0,1] conditioned by W (1) = 0. This statement is problematic in that the conditioning event has a null probability. However, it is true “at the limit”: Theorem 11.1.7 Let f : Rk → R be a bounded and continuous function. Then, for any 0 ≤ t1 < t2 < · · · < tk ≤ 1, lim E [f (W (t1 ), . . . , W (tk )) | |W (1)| ≤ ε] = E [f (X(t1 ), . . . , X(tk ))] . ε↓0

Proof. E [f (W (t1 ), . . . , W (tk )) | |W (1)| ≤ ε] = E [f (X(t1 ) + t1 W (1), . . . , X(tk ) + tk W (1)) | |W (1)| ≤ ε]   E f (X(t1 ) + t1 W (1), . . . , X(tk ) + tk W (1))1|W (1)|≤ε = . P (|W (1)| ≤ ε) In view of the independence of {X(t)}t∈[0,1] and W (1) (Exercise 11.5.13), this last quantity equals 0 +ε − 1 x2 2 E [f (X(t1 ) + t1 x, . . . , X(tk ) + tk x)] dx −ε e , 0 +ε − 1 x2 2 dx −ε e which tends to E [f (X(t1 ), . . . , X(tk ))] as ε ↓ 0.

11.1.3



Gauss–Markov Processes

Definition 11.1.8 Let T be R+ or N. A real-valued stochastic process {X(t)}t≥0 is called a Markov process if for all t ≥ 0 and all non-negative (or integrable) random variables Z that are σ(X(s) ; s ≥ t)-measurable   (11.3) E Z | FtX = E [Z | σ(X(t))] . Gaussian processes that are moreover Markovian are called Gauss–Markov processes. This class of models receives a simple description in terms of the Brownian motion. Example 11.1.9: The Brownian motion is Gauss–Markov. The Brownian motion is a Gauss–Markov process (Exercise 11.5.6). Gauss–Markov processes are characterized among Gaussian processes by a simple property of their covariance function.

CHAPTER 11. BROWNIAN MOTION

448

Theorem 11.1.10 Let {X(t)}t≥0 be a centered Gaussian process with continuous covariance function Γ such that Γ(t, t) > 0 for all t ∈ R+ . It is a Markov process if and only if there exist functions f and g such that for all s, t ∈ R+ Γ(t, s) = f (t ∨ s)g(t ∧ s) .

(11.4)

The proof relies on the following lemma. Lemma 11.1.11 Let {X(t)}t≥0 be a centered Gaussian process with covariance function Γ such that Γ(t, t) > 0 for all t ∈ R+ . If in addition it is a Markov process, then for all t > s > t0 ≥ 0, Γ(t, s)Γ(s, t0 ) Γ(t, t0 ) = . (11.5) Γ(s, s) Proof. By the Gaussian property, the conditional expectation of X(t) given X(t0 ) is equal to the linear regression of X(t) on X(t0 ): E [X(t)|X(t0 )] =

Γ(t, t0 ) X(t0 ). Γ(t0 , t0 )

()

Using this remark and the Markov property, E [X(t)|X(t0 )] = E [E [X(t)|X(t0 ), X(s)] |X(t0 )] = E [E [X(t)|X(s)] |X(t0 )] . 2 Γ(t, s) X(s)|X(t0 ) =E Γ(s, s) Γ(t, s) Γ(s, t0 ) Γ(t, s) E [X(s)|X(t0 )] = X(t0 ) . = Γ(s, s) Γ(s, s) Γ(t0 , t0 ) Comparing with the right-hand side of (), and since P (X(t0 ) = 0) > 0 (in fact = 1), we obtain (11.5).  We now turn to the proof of Theorem 11.1.10. Proof. Necessity. Suppose the process is Gauss–Markov. Let ρ(t, s) =

Γ(t, s) 1

1

(Γ(t, t)) 2 (Γ(s, s)) 2 be its autocorrelation function. By (11.5), for all t > s > t0 ≥ 0, ρ(t, t0 ) = ρ(t, s)ρ(s, t0 ).

()

We show that ρ(t, s) > 0 for all t, s ∈ R. Indeed, assuming s > t and using () repeatedly, for all n ≥ 1, ρ(t, s) =

n−1 $ k=0

  (k + 1)(s − t) k(s − t) ,t+ ρ t+ , n n

and therefore, using the facts that ρ(u, u) = 1 for all u and that ρ is uniformly continuous on bounded intervals, n can be chosen large enough to guarantee that all the elements

11.2. PROPERTIES OF BROWNIAN MOTION

449

in the above product are positive. Therefore, one may divide by ρ(t, t0 ) and write () as ρ(t, t0 ) ρ(t, s) = ρ(s, t0 ) or 1 1 Γ(s, s) 2 , Γ(t, s) = ρ(t, t0 )Γ(t, t) 2 × ρ(s, t0 ) from which we obtain the desired conclusion (here s = t ∧ s and t = t ∨ s) . Sufficiency. Suppose that the process is Gaussian and that (11.5) holds true. Assume 1 1 t > s. Therefore Γ(t, s) = f (t)g(s). By Schwarz’s inequality, Γ(t, s) ≤ Γ(t, t) 2 Γ(s, s) 2 1 or, equivalently, f (t)g(s) ≤ (f (t)g(t)f (s)g(s)) 2 , from which it follows that f (t)g(s) ≤ g(t)f (s). Therefore, the function g(t) τ (t) := f (t) is monotone non-decreasing. In particular, the centered Gaussian process Y (t) := f (t)W (τ (t)) is a Markov process since the Brownian motion itself is a Markov process. Its covariance function is E [Y (t)Y (s)] = f (t)f (s)E [W (τ (t))W (τ (s))] = f (t)f (s)(τ (t) ∧ τ (s)) = f (t)f (s)τ (s) = f (t)g(s). Since it has the same covariance as {X(t)}t≥0 and since both processes are centered and Gaussian, they have the same distribution. In particular, {X(t)}t≥0 is a Markov process. 

11.2

Properties of Brownian Motion

11.2.1

The Strong Markov Property

Theorem 11.2.1 Let {W (t)}t≥0 be a standard Ft -Brownian motion and let τ be a finite Ft -stopping time. Then {W (τ + t)}t≥0 is a standard Ft -Brownian motion independent of Fτ . A proof is given in Subsection 14.3.2 via Itˆ o calculus.

The Reflection Principle Theorem 11.2.2 Let {W (t)}t≥0 be a standard Ft -Brownian motion and let Ta be the first time it reaches the value a > 0. The stochastic process Y (t) := W (t)1t 0 and y ≥ 0, P (W (t) ≤ a − y, M (t) ≥ a) = P (W (t) ≥ a + y) .

(11.7)

Proof. Observing that {M (t) ≥ a} ≡ {Ta ≤ t}, that Ta = inf{t ≥ 0 ; Y (t) = a} and that {W (t) ≥ a + y} ⊆ {Ta ≤ t}, and using Theorem 11.2.2, P (W (t) ≤ a − y, M (t) ≥ a) = P (W (t) ≤ a − y, Ta ≤ t) = P (Y (t) ≤ a − y, Ta ≤ t) = P (2a − W (t) ≤ a − y, Ta ≤ t) = P (W (t) ≥ a + y, Ta ≤ t) = P (W (t) ≥ a + y) .  Corollary 11.2.4 For a > 0, P (Ta ≤ t) = P (M (t) ≥ a) = 2P (W (t) > a) = P (|W (t)| > a) . Proof. The last equality is by symmetry of Brownian motion. The first equality is a consequence of the identity {M (t) ≥ a} ≡ {Ta ≤ t}. It remains to prove the second equality. We have P (M (t) ≥ a) = P (M (t) ≥ a, W (t) ≤ a) + P (M (t) ≥ a, W (t) > a) = P (M (t) ≥ a, W (t) ≤ a) + P (W (t) > a) = P (W (t) > a) + P (W (t) > a) , where it was observed that {W (t) > a} ⊆ {M (t) ≥ a} for the second equality, and where (11.7) was applied with y = 0 for the third one.  The above results immediately yield the distribution of Ta : For a ≥ 0 and t ≥ 0, - ∞ y2 2 P (Ta ≤ t) = √ e− 2 dy . 2π √at Since the law of T−a is the same as that of Ta , the formula for any a is - ∞ y2 2 e− 2 dy . P (Ta ≤ t) = √ |a| 2π √ t Definition 11.2.5 A real stochastic process {X(t)}t≥0 with stationary increments is called recurrent if for all x, y ∈ R, this process starting at time 0 from x will almost surely reach y in finite random time. It is called null recurrent if it is recurrent but this random time is not integrable.

11.2. PROPERTIES OF BROWNIAN MOTION

451

Corollary 11.2.6 Brownian motion is null recurrent. Proof. It suffices to verify the conditions of null recurrence of the above definition for x = 0. Letting t ↑ ∞ in the expression of the cumulative distribution function of Ty (y ∈ R), we obtain P (Ty < ∞) = 1. On the other hand, with α := √22π and β := y, -



E [Ty ] =

P (Ty ≤ t) dt = α

0

-







⎛ ⎝

β2 u2

0

⎞ 1

β √ t

e

− 21 u2

du

dt

0

2 dt⎠ e− 2 u du = αβ 2

0

0

-

-

∞ 0

1 − 1 u2 e 2 du = +∞ . u2 

11.2.2

Continuity

Theorem 11.2.7 Consider a stochastic process such as the Brownian motion of Definition 11.1.2, except that continuity of the trajectories is not assumed. There exists a version of it having almost surely continuous paths. 1

Proof. For any s, t ∈ R+ , W (t) − W (s) has the same distribution as |t − s| 2 Y where Y is a centered Gaussian variable with unit variance. In particular, for any α > 0, α

E [|W (t) − W (s)|α ] = |t − s| 2 E|Y |α , from which the result immediately follows by application of Theorem 5.2.3: take α > 2, β = 12 α − 1 and K = E|Y |α . 

11.2.3

Non-differentiability

Definition 11.1.2 of the Wiener process does not tell much about the qualitative behavior of its trajectories. Although the trajectories of the (standard) Brownian motion are almost surely continuous functions, their behavior is otherwise rather chaotic. First of all observe that, for fixed t0 > 0, the random variable   W (t0 + h) − W (t0 ) D ∼ N 0, h−1 h does not converge in distribution as h ↓ 0, and a fortiori does not converge almost surely. Therefore, for any t0 > 0, P (t → W (t) is not differentiable at t0 ) = 1. But the situation is even more dramatic: Theorem 11.2.8 Almost all the paths of the Wiener process are nowhere differentiable. Proof. We shall prove that   W (t + h) − W (t) P lim sup = +∞ for all t ∈ [0, 1] = 1 . h h→0

CHAPTER 11. BROWNIAN MOTION

452

Fix β > 0. If a function f : (0, 1) → R has at some point s ∈ [0, 1] a derivative f  (s) of absolute value smaller than β, then there exists an integer n0 such that for n ≥ n0 , |f (t) − f (s)| < 2β |t − s| if |t − s| ≤

2 . n

()

Let Cn := {f : [0, 1] → R ; there exists an s ∈ [0, 1] satisfying ()} Let An := {ω ; the function t ∈ [0, 1] → W (t, ω) is in Cn } . This event increases with n and its limit A includes all the samples ω corresponding to a trajectory t ∈ (0, 1) → W (t, ω) having at least at one point of [0, 1] a derivative of absolute value smaller than β. Therefore it suffices to show that P (A) = 0 for all β > 0. If ω ∈ An , letting k be the largest integer such that nk ≤ s (where s is the point in the definition of Cn ), and letting   '1 '  ' ' 'W k + j + 1 , ω − W k + j , ω ' , Yk (ω) := max ' j=−1,0,+1 ' n n then Yk (ω) ≤

6β n .

Therefore, 1  6β . An ⊆ Bn := ω ; at least one Yk (ω) ≤ n

In order to prove that P (A) = 0, it is then enough to show that limn P (Bn ) = 0. But Bn =

n−2 &

ω ; Yk (ω) ≤

k=1

6β n

1 ,

and by sub-σ-additivity, P (Bn ) ≤

n−2  k=1

 P

  '1  '  ' ' 'W k + j + 1 − W k + j ' ≤ 6β . ' j=−1,0,+1 ' n n n max

By the independence property   of the increments of a Wiener process and since all the variables involved are N 0, n1 , ' 3 '  ' ' 'N 0, 1 ' ≤ 6β ' n ' n = - 6β 3 +n 2 n − nx =n e 2 dx 2π −6β n  3 - +6β x2 1 =n √ e− 2n dx → 0 . 2πn −6β

P (Bn ) ≤ nP

 If a function f : [0, 1] → R is of bounded variation on [0, 1], it has a derivative almost everywhere (with respect to the Lebesgue measure) in [0, 1]. Therefore: Corollary 11.2.9 Almost every trajectory of a Brownian motion is of unbounded variation on any interval.

11.2. PROPERTIES OF BROWNIAN MOTION

11.2.4

453

Quadratic Variation

Let D := {0 = t0 ≤ t1 ≤ . . . ≤ tn = t} be a division of the interval [0, t] with maximum gap Δ := max (tk − tk−1 ) . 1≤k≤n

Let VW (t, D) :=

n 

|W (tk ) − W (tk−1 )| .

k=1

By Corollary 11.2.9, the variation supD VW (t, D) of the Wiener process on the interval [0, t] is almost surely infinite. However, the quadratic variation of this process on the interval [0, t], defined by QW (t, D) :=

n 

(W (tk ) − W (tk−1 ))2 ,

k=1

is such that E[QW (t, D)] =

n 

n    (tk − tk−1 ) = t. E (W (tk ) − W (tk−1 ))2 =

k=1

k=1

But there is more: Theorem 11.2.10 A. As Δ → 0,

QW (t, D) → t in L2R (P ) .

B. If {Di }i≥0 is a sequence of subdivisions of [0, t] such that Δi = o(i−2 ), then lim QW (t, Di ) → t,

i↑∞

P -a.s.

Proof. A. Write QW (t, D) − t =

n 

Zk ,

k=1

where Zk = (W (tk ) − W (tk−1 ))2 − (tk − tk−1 ), a centered random variable with variance E[Zk2 ] = 2(tk − tk−1 )2 . Therefore, by independence of the Zk ’s, E[(QW (t, D) − t)2 ] =

n 

  E Zk2

k=1 n 

=2

k=1

(tk − tk−1 )2 ≤ 2Δ

n  k=1

(tk − tk−1 ) = 2Δt .

CHAPTER 11. BROWNIAN MOTION

454

B. Take Δi = εi /i2 with limi↑∞ εi = 0. By Markov’s inequality,     5 P |QW (t, Di ) − t| > i 2Δi = P |QW (t, Di ) − t|2 > 2εi   E |QW (t, Di ) − t|2 2Δi t t ≤ ≤ = 2. 2εi 2εi i The announced result then follows from Theorem 4.1.2.

11.3

The Wiener–Doob Integral

11.3.1

Construction

The Wiener stochastic integral



f (t) dW (t)

(11.8)

R

will be defined for a certain class of measurable functions f . Note, however, that this integral will not be of the usual type. For instance, it cannot be defined pathwise as a Stieltjes–Lebesgue integral since the trajectories of the0Brownian motion are of un˙ (t) dt either (the dot bounded variation. This integral cannot be interpreted as R f (t)W denotes differentiation) since the Brownian motion does not have a derivative. Therefore, the integral in (11.8) will be defined in a radically different way. In fact, the Doob–Wiener stochastic integral is defined with respect to a stochastic process with centered and uncorrelated increments. This generalizes the original Wiener integral (which is defined with respect to the Brownian motion). Let {Z(t)}t∈R be a complex-valued stochastic process such that (i) for all intervals [t1 , t2 ] ⊂ R, the increments Z(t2 ) − Z(t1 ) are centered and in L2C (P ), and (ii) there exists a locally finite measure μ on (R, B) such that E[(Z(t2 ) − Z(t1 ))(Z(t4 ) − Z(t3 ))∗ ] = μ((t1 , t2 ] ∩ (t3 , t4 ])

(11.9)

for all [t1 , t2 ] ⊂ R and all [t3 , t4 ] ⊂ R. Note in particular that if (t1 , t2 ]∩(t3 , t4 ] = ∅, Z(t2 ) − Z(t1 ) and Z(t4 ) − Z(t3 ) are orthogonal random variables of L2C (P ). Definition 11.3.1 The above stochastic process {Z(t)}t∈R is called a stochastic process with centered and uncorrelated increments with structural measure μ.

Example 11.3.2: The structural measure of the Wiener process. The Wiener process {W (t)}t∈R is such a process, with structural measure equal to the Lebesgue measure.

11.3. THE WIENER–DOOB INTEGRAL

455

Example 11.3.3: The structural measure of a compensated hpp. Let N be an hpp on R with intensity λ. Define {Z(t)}t∈R by Z(0) = 0 and, for all [a, b] ∈ R, Z(b) − Z(a) = N ((a, b]) − λ × (b − a). Then {Z(t)}t∈R is a stochastic process with centered and uncorrelated increments whose structural measure is λ times the Lebesgue measure. The Wiener–Doob integral

f (t) dZ(t) R

is constructed for all f ∈ L2C (μ) in the following manner. First of all, we define this integral for all f ∈ L, the vector subspace of L2C (μ) formed by the finite complex linear combinations of interval indicator functions f (t) =

N 

αi 1(ai ,bi ] (t).

i=1

For such functions, by definition, f (t) dZ(t) := R

N 

αi (Z(bi ) − Z(ai )) .

()

i=1

One easily verifies that the linear mapping ϕ : f ∈L→ f (t) dZ(t) ∈ L2C (P ) R

is an isometry, that is, for all f ∈ L, 2 . |f (t)|2 μ(dt) = E | f (t) dZ(t)|2 . R

R

Since L is dense in L2C (μ), ϕ can be uniquely extended to an isometric linear mapping of L2C (μ) into L2C (P ). We continue to call this extension ϕ and then define, for all f ∈ L2C (μ), f (t) dZ(t) := ϕ(f ). R

The fact that ϕ is an isometry is expressed by Doob’s isometry formula: ∗ . 2= f (t) dZ(t) g(t) dZ(t) f (t)g ∗ (t) μ(dt), E R

R

(11.10)

R

where f and g are in L2C (μ). Note also that for all f ∈ L2C (μ), . 2f (t) dZ(t) = 0 , E

(11.11)

R

 since the Doob integral is the limit in L2C (μ) of random variables of the type N i=1 αi (Z(bi )− Z(ai )) that have mean 0 (use the continuity of the inner product in L2C (P )). Remark 11.3.4 In the case where Z(t) := W (t) (t ∈ R+ ), a Wiener process, the righthand side of () is in the Gaussian Hilbert subspace H(W ), and so are the Wiener integrals, being limits in quadratic mean of elements of H(W ).

CHAPTER 11. BROWNIAN MOTION

456

Series Expansion of Wiener integrals Let f be a function of L2R ([a, b]) and let {W (t)}t∈[a,b] be a standard Wiener process. Let {ϕn }n≥1 be an orthonormal basis of the Hilbert space L2R ([a, b]). In particular, f=

∞ 

f, ϕn ϕn ,

n=1

where the convergence of the series of the right-hand side is in L2R ([a, b]). Consider now the sequence of random variables - b Zn := ϕn (t) dW (s) (n ≥ 1) . a

This is a Gaussian sequence (Remark 11.3.4). Moreover, the Zn ’s are uncorrelated since by the isometry formula for the Doob–Wiener integrals, if n = k, - b E [Zn Zk ] = ϕn (t)ϕk (t) dt = 0 . a

Therefore {Zn }n≥1 is an iid sequence. Theorem 11.3.5 For f ∈ L2R ([a, b]), we have the expansion - b ∞  f, ϕn Zn , f (t) dW (s) = a

n=1

where the convergence of the series in the right-hand side is in L2R (P ) and almost surely. Proof. It is enough to prove convergence in L2R (P ) since the statement about almostsure convergence then follows from Theorem 4.1.15 and the fact that when a sequence of 2 (P ), the respective limits are almost random variables converges almost surely and in LR surely equal. For convergence in L2R (P ): ⎡ 2 ⎤ - b N  E⎣ f, ϕn Zn ⎦ f (t) dW (t) − a

n=1

"-

b

=E

2 # 2- b . N  −2 f (t) dW (t) f, ϕn E f (t) dW (t)Zn

a

+

  f, ϕn 2 E Zn2

n=1

-

b

=

f (t)2 dt − 2

a

-

b

= a

a

n=1 N 

f (t)2 dt − 2

N 

-

b

f, ϕn 

n=1 N 

f (t)ϕn (t) dt + a

f, ϕn 2 +

n=1

N 

  f, ϕn 2 E Zn2

n=1

N 

-

b

f, ϕn 2 =

n=1

a

f (t)2 dt −

N 

f, ϕn 2 → 0 .

n=1

 0b Remark 11.3.6 For the purpose of sampling the integral a f (t) dW (s) (that is, of 0b generating a random variable with the same distribution as a f (t) dW (s)) it is enough to use any sequence {Zn }n≥1 of independent standard Gaussian random variables.

11.3. THE WIENER–DOOB INTEGRAL

457

A Characterization of the Wiener Integral The following characterisation of the Wiener integral will be useful: Lemma 11.3.7 Let f ∈ L2R () and let {W (t)}t∈[0,1] be a standard Wiener process. Denote by H(W )0 the Gaussian Hilbert space generated by this Wiener process. The Wiener integral Z := R+ f (t) dW (t) is characterized by the following two properties: (a) Z ∈ H(W ), and 0s (b) E [ZW (s)] = 0 f (t) dt for all s ≥ 0. 0 Proof. Necessity: It was already noted that, by construction, R+ f (t) dW (t) ∈ H(W ). 0  0 0 s Also, (b) is just the isometry formula E R+ f (t) dW (t) R+ 1{s≤t} dW (t) = 0 f (t) dt.

0t Sufficiency: Since Z − 0 f (s) dW (s) is in H(W ), it suffices to show that this random variable0 is orthogonal to all the generators W (s) (s ∈ R) of H(W ) to obt dW (s) = 0 P -a.s. But, by (b) and by the isometry formula, tain  that0 Z − 0 f (s)  0s 0s t  E Z − 0 f (s) dW (s) W (s) = 0 f (t) dt − 0 f (t) dt = 0. The next lemma features a kind of formula of integration by parts. Lemma 11.3.8 Let {W (t)}t∈[0,1] be a standard Wiener process. Let T be a positive real number and let f : [0, T ] → R be a continuously differentiable function (in particular the 0T 2 (μ) and therefore the integral 0 f (t) dW (t) is well defined). function f (t)1{t≤T } is in LC Then, - T - T (11.12) f (t) dW (t) + f  (t)W (t) dt = f (T )W (T ) . 0

0

Proof. By Lemma 11.3.7, it suffices to prove that for all s ∈ R+ , 2  . - s - T E f (T )W (T ) − f (t)1{t≤T } dt . f  (t)W (t) dt W (s) = 0

0

Using the equality E [W (a)W (b)] = a ∧ b, the latter reduces to 2- T  . - s f (T )(T ∧ s) − E f  (t)W (t) dt W (s) = f (t)1{t≤T } dt. 0

0

By Fubini: 2-

T

E

 . f  (t)W (t) dt W (s) =

0

-

T

f  (t)E [W (t)W (s)] dt

0 T

=

f  (t)(t ∧ s) dt .

0

It therefore remains to check that f (T )(T ∧ s) −

T

f  (t)(t ∧ s) dt =

0

When T ≤ s, this reduces to the identity

-

s 0

f (t)1{t≤T } dt .

CHAPTER 11. BROWNIAN MOTION

458 -

T

f (T )T −

f  (t)t dt =

0

-

T

f (t) dt , 0

which is verified by integration by parts, and when T ≥ s, it reduces to -

T

f (T )s −

f  (t)(s ∧ t) dt =

-

s

f (t) dt . 0

0

This last identity is verified by noting that both sides are null for s = 0 and that their derivatives are equal for all s ≤ T : -

T

f (T ) −

f  (t) dt = f (s).

s



11.3.2

Langevin’s Equation

This is the equation dV (t) + αV (t) dt = σdW (t) ,

(11.13)

where {W (t)}t∈[0,1] is a standard Wiener process and α and σ are positive real numbers, with the following interpretation - t V (t) − V (0) + α (11.14) V (s) ds = σW (t) (t ≥ 0) . 0

Remark 11.3.9 The motion of a particle of mass m on the line subjected at each instant t to an external force F (t) and to friction is governed by the differential equation mx (t) = −αx (t) + F (t) , where α > 0 is the friction coefficient. (It is assumed that there is no potential energy field.) If the external force is due, as in the Brown experiment, to numerous tiny shocks, 0b one may assume that a F (s) ds = σ(W (b) − W (a)) so that, letting V (t) = x (t) and taking m = 1, we obtain equation 11.13. Theorem 11.3.10 The unique solution of the Langevin equation (11.15) with initial value V (0) is - t V (t) = e−αtV (0) + e−α(t−s)σ dW (s) . (11.15) 0

Proof. Using the integration by parts formula (11.12), (11.15) is found equivalent to V (t) = e−αtV (0) + σW (t) −

-

t

αe−α(t−s)σW (s) ds .

()

0

Integrating from 0 to u gives  - u - u - t - u 1 −αu −α(t−s) )V (0) + σW (s) ds − αe σW (s) ds dt . V (t) dt = (1 − e α 0 0 0 0 The last integral is equal to

11.3. THE WIENER–DOOB INTEGRAL

459

 1s≤t≤u 1s≤u αe−α(t−s)σW (s) ds dt 0 0  - ∞ - ∞ = 1s≤t≤u αe−α(t−s) dt 1s≤u σW (s) ds  -0 u - 0u αe−α(t−s) dt σW (s) ds = s -0 u 1 = (1 − eα(u−s))σW (s) ds . 0 α

-

∞ - ∞

Replacing this in () gives - u α V (t) dt = (1 − e−αu)V (0) + 0

u

αe−α(u−s)σW (s) ds ,

0

and therefore

-

u

V (u) − V (0) + α

V (t) dt 0

= V (u) − e−αuV (0) +

-

u

αe−α(u−s)σW (s) ds = σW (u) .

0

To prove unicity, let V  be another solution of the Langevin equation with the same initial value. Letting U := V − V  , we have - t U (s) ds , U (t) = α 0

whose unique solution is the null function (Gronwall’s lemma, Theorem B.6.1).

11.3.3



The Cameron–Martin Formula

This result is of interest in communications theory. One will recognize the likelihood ratio associated with the hypothesis “signal plus white Gaussian noise” against the hypothesis “white Gaussian noise only”. Theorem 11.3.11 Let {X(t)}t≥0 be, with respect to probability P , a Wiener process with variance σ 2 and let γ : R → R be in L2R (). For any T ∈ R+ , the formula T  1 1 T 2 dQ = e σ2 { 0 γ(t)dX(t)− 2 0 γ (t)dt} dP

(11.16)

defines a probability measure Q on (Ω, F) with respect to which - t X(t) − γ(s) ds 0

is, on the interval [0, T ], a Wiener process with variance σ 2 . The proof of Theorem 11.3.11 is based on the following preliminary result. Lemma 11.3.12 Let {X(t)}t≥0 be a Wiener process with variance σ 2 and let ϕ : R → R be in L2R (). Then, for any T ∈ R+ ,  T   1 2 T 2 E e 0 ϕ(t)dX(t) = e 2 σ 0 ϕ (t)dt . (11.17)

CHAPTER 11. BROWNIAN MOTION

460 Proof. First consider the case ϕ(t) =

N 

αk 1(ak ,bk ] (t) ,

(11.18)

k=1

where αk ∈ R and the intervals (ak , bk ] are disjoint. For this special case, formula (11.17) reduces to  N  1 2 N 2 E e k=1 αk (X(bk )−X(ak )) = e 2 σ k=1 αk (bk −ak ) , and therefore follows directly from the independence of the increments of a Wiener process and from the Gaussian property of these increments, in particular, the formula giving the Laplace transform of the centered Gaussian variable X(b)−X(a) with variance σ 2 (b − a):   1 2 2 E eα(X(b)−X(a)) = e 2 σ α (b−a) . Let now {ϕn }n≥1 be a sequence of functions of type (11.18) converging in L2R () to ϕ (in 0T 0T particular, limn↑∞ 0 ϕ2n (t)dt = 0 ϕ2 (t)dt). Therefore, lim

n↑∞ 0

-

T

T

ϕn (t)dX(t) =

ϕ(t)dX(t), 0

where the latter convergence is in L2R (P ). This convergence can be assumed to take place almost surely by taking if necessary a subsequence. From the equality  T   2 T 2 E e 0 ϕn (t)dX(t) = eσ 0 ϕn (t)dt we can then deduce (11.17), at least if the sequence of random variables in the left-hand side is uniformly integrable. This is the case because the quantity 2'  ' .  T   2 T 2 ' 0T ϕn (t)dX(t)'2 E 'e ' = E e2 0 ϕn (t)dX(t) = e2σ 0 ϕn (t)dt is uniformly bounded, and therefore the uniform integrability claim follows from Theorem  4.2.14, with G(t) = t2 . We may now turn to the proof of Theorem 11.3.11. Proof. The fact that (11.16) properly defines a probability Q, that is, that the expectation of the right-hand side of (11.16) equals 1, follows from Lemma 11.3.12 with ϕ(t) = σ12 γ(t). Letting

-

t

Y (t) := X(t) −

γ(s)ds , 0

we have to prove that this centered stochastic process is Gaussian. To do this, we must show that  N  1 2 N 2 EQ e k=1 αk (Y (bk )−Y (ak )) = e 2 σ k=1 αk (bk −ak ) , where αk ∈ R and the intervals (ak , bk ] ⊆ [0, T ] are disjoint, that is, letting ψ(t) =  N k=1 αk 1(ak ,bk ] (t),  T   1 2 T 2 EQ e 0 ψ(t)dY (t) = e 2 σ 0 ψ (t)dt , or equivalently,

11.4. FRACTAL BROWNIAN MOTION 2 EP

461

.  1 2 T 2 dQ  T ψ(t)(dX(t)−γ(t)dt) e0 = e 2 σ 0 ψ (t)dt , dP

that is,  1 T   T  1 T 2 1 2 T 2 EP e σ2 { 0 γ(t)dX(t)− 2 0 γ (t)dt} e 0 ψ(t)(dX(t)−γ(t)dt) = e 2 σ 0 ψ (t)dt . Simplifying: 2 T T T ψ(t)+ 12 γ(t) dX(t)− 0 (γ(t)ψ(t))dt− 21 0 σ EP e 0 and using (11.17) with ϕ(t) = ψ(t) +

1 γ(t), σ2

γ 2 (t) dt σ2

.

1

= e2σ

2

T 0

ψ 2 (t)dt

,

the left-hand side is equal to

2 2   T T 1 2 T ψ(t)+ 12 γ(t) dt− 0 (γ(t)ψ(t))dt− 12 0 σ σ EP e 2 0

γ 2 (t) dt σ2

. .

The proof is completed since 2 - T -  1 T γ 2 (t) 1 2 T 1 σ (γ(t)ψ(t)) dt − ψ(t) + 2 γ(t) dt − 2 σ 2 0 σ2 0 0 - T 1 = σ2 ψ 2 (t)dt . 2 0  Remark 11.3.13 A sweeping generalization of the Cameron–Martin theorem, the Girsanov theorem, will be given in Chapter 14 (Theorem 14.3.3). It will require the more advanced tools of the stochastic calculus associated with the Itˆo integral.

11.4

Fractal Brownian Motion

The Wiener process {W (t)}t≥0 has the following property. If c is a positive constant, the 1 process {Wc (t)}t≥0 := {c− 2 W (ct)}t≥0 is also a Wiener process. It is indeed a centered Gaussian process with independent increments, null at the time origin, and for 0 < a < b,     E |Wc (b) − Wc (a)|2 = c−1 E |W (cb) − W (ca)|2 = c−1 (cb − ca) = b − a . This is a particular instance of a self-similar stochastic process. Definition 11.4.1 A real-valued stochastic process {Y (t)}t≥0 is called self-similar with (Hurst) self-similarity parameter H if for any c > 0, D

{Y (t)}t≥0 ∼ {c−H Y (ct)}t≥0 . The Wiener process is therefore self-similar with similarity parameter H = 12 . D

It follows from the definition that Y (t) ∼ tH Y (1), and therefore, if P (Y (1) = 0) > 0: If H < 0, Y (t) → 0 in distribution as t → ∞ and Y (t) → ∞ in distribution as t → 0. If H > 0, Y (t) → ∞ in distribution as t → 0 and Y (t) → 0 in distribution as t → ∞.

CHAPTER 11. BROWNIAN MOTION

462

If H = 0, Y (t) has a distribution independent of t. In particular, when H = 0, a self-similar process cannot be stationary (strictly or in the wide sense). We shall be interested in self-similar processes that have stationary increments. We must restrict attention to non-negative self-similarity parameters, because of the following negative result:1 for any strictly negative value of the self-similarity parameter, a self-similar stochastic process with independent increments is not measurable (except of course for the trivial case where the process is identically null). Theorem 11.4.2 Let {Y (t)}t≥0 be a self-similar stochastic process with stationary increments and self-similarity parameter H > 0 (in particular, Y (0) = 0). Its covariance function is given by  1  Γ(s, t) := cov (Y (s), Y (t)) = σ 2 t2H − |t − s|2H + s2H , 2     where σ 2 = E (Y (t + 1) − Y (t))2 = E Y (1)2 . Proof. Assume without loss of generality that the process is centered. Let 0 ≤ s ≤ t. Then     E (Y (t) − Y (s))2 = E (Y (t − s) − Y (0))2   = E (Y (t − s))2 = σ 2 (t − s)2H and       2E [Y (t)Y (s)] = E Y (t)2 + E Y (s)2 − E (Y (t) − Y (s))2 , 

hence the result.

Fractal Brownian motion2 is a Gaussian process that in a sense generalizes the Wiener process. Definition 11.4.3 A fractal Brownian motion on R+ with Hurst parameter H ∈ (0, 1) is a centered Gaussian process {BH (t)}t≥0 with continuous paths such that BH (0) = 0, and with covariance function E[BH (t)BH (s)] =

 1  2H |t| + |s|2H − |t − s|2H . 2

(11.19)

The existence of such process follows from Theorem 5.1.23 as soon as the right-hand side of (11.19) can be shown to be a non-negative definite function. This can be done directly, although we choose another path. We shall prove the existence of the fractal Brownian motion by constructing it as a Doob integral with respect to a Wiener process. More precisely, define for 0 < H < 1, wH (t, s) := 0 for t ≤ s, 1

wH (t, s) := (t − s)H− 2 for 0 ≤ s ≤ t and 1 2

[Vervaat, 1987]. [Mandelbrot and Van Ness, 1968].

11.5. EXERCISES

463 1

1

wH (t, s) := (t − s)H− 2 − (−s)H− 2 for s < 0. Observe that for any c > 0 1

wH (ct, s) = cH− 2 wH (t, sc−1 ). Define

BH (t) :=

R

wH (t, s) dW (s) .

The Doob integral of the right-hand side is, more explicitly, -

t

A − B :=

-

1

(t − s)H− 2 dW (s) −

0 −∞

0

  1 1 (t − s)H− 2 − (−s)H− 2 dW (s).

(11.20)

It is well defined and with the change of variable u = c−1 s it becomes 1

-

cH− 2 R

wH (t, u) dW (cu) .

Using the self-similarity of the Wiener process, the process defined by the last display has the same distribution as the process defined by 1

1

-

cH− 2 c 2 R

wH (t, u) dW (u).

Therefore {BH (t)}t≥0 is self-similar with similarity parameter H. The fact that there is a version of this process with continuous paths can be proven using Theorem 5.2.3 along the lines of the proof of Theorem 11.2.7. It is tempting to rewrite (11.20) as Z(t) − Z(0), where -

t

Z(t) = −∞

1

(t − s)H− 2 dW (s).

However this last integral is not well defined as a Doob integral since for all H > 0, the 1 function s → (t − s)H− 2 1{s≤t} is not in L2R (R).

Complementary reading Chapter 6 of [Resnick, 1992]. [Revuz and Yor, 1999] (more advanced).

11.5

Exercises

Exercise 11.5.1. Wiener as a limit Prove that for all t1 , . . . , tn in R+ forming an increasing sequence, the limit distribution of the vector (X(t1 ), . . . , X(tn )), where X(t) is defined by (11.2), is that corresponding to a Wiener process, that is, a centered Gaussian vector such that X(t1 ), X(t2 )−X(t1 ), . . . , X(tn ) − X(tn−1) are centered Gaussian variables with variances t1 , t2 − t1 ,. . . , tn − tn−1.

CHAPTER 11. BROWNIAN MOTION

464

Exercise 11.5.2. A basic formula Let {W (t)}t≥0 be a standard Wiener process. Prove that for s, t ∈ R+ , E[W (t)W (s)] = t ∧ s . Let {Y (t)}t≥0 be a Brownian bridge. Prove that (0 ≤ s ≤ t ≤ 1) .

cov (X(t), X(s)) = s(1 − t)

Exercise 11.5.3. Transforming a Wiener process Let {W (t)}t≥0 be a standard Wiener process. Prove that the process {X(t)}t∈[0,1] is a standard Brownian motion in the following cases: (i) X(t) = −W (t), (ii) X(t) = W (t + a) − W (t) (a > 0),   √ (iii) X(t) = cW ct (t ≥ 0) (c > 0), 1 (iv) X(t) = tW t (t > 0) and X(0) = 0. (Note that the continuity at 0 is already proved in Section 11.1.2.) Exercise 11.5.4. Brownian bridges Let {W (t)}t∈[0,1] be a Wiener process. Show that the Brownian bridge {X(t) := W (t) − tW (1)}t∈[0,1] is a Gaussian process independent of W (1) and compute its autocovariance function. Show that the process {X(1 − t)}t∈[0,1] is a Brownian bridge. Exercise 11.5.5. Let {W (t)}t∈[0,1] be a Wiener process. Let  Z(t) := (1 − t)W

t 1−t

 (0 ≤ t < 1)

and Z(1) := 0. Show that {Z(t)}t∈[0,1] is continuous at t = 1 and that it has the same distribution as the Brownian bridge. Exercise 11.5.6. Wiener is Gauss–Markov Prove that a Wiener process is a Gauss–Markov process. Exercise 11.5.7. An Ornstein–Uhlenbeck process   Let {W (t)}t≥0 be a standard Brownian motion. Show that {e−αtW e2αt }t≥0 is (has the same distribution as) an Ornstein–Uhlenbeck process. Exercise 11.5.8. Exit time from a strip Let {W (t)}t≥0 be a standard Brownian motion, and define for a > 0 and b < 0 the stopping time Ta,b := inf{t ≥ 0 ; W (t) ∈ {a, b}} . (i) Compute P (W (Ta,b ) = a.

11.5. EXERCISES

465

(ii) Show that {W (t)2 − t}t≥0 is an FtW -martingale and deduce from this E [Ta,b ]. Exercise 11.5.9. The transience of Brownian motion with a positive drift Let μ and σ > 0 be two real numbers. Let (t ≥ 0) .

X(t) := σW (t) + μt (i) Show that for all u ∈ R,

  uσ 2 Z(t) := exp{uX(t) − ut μ + } (t ≥ 0) 2 is an FtW -martingale. and of Doob’s optional sampling theorem (ii) Take advantage of the choice u = − −2μ σ2 (the applicability of which you shall verify) to obtain that the probability ra that {X(t)}t≥0 will reach a > 0 before it touches −b < 0 is given by 2μb

ra =

1 − e σ2 1−e

2μ(a+b) σ2

.

(iii) Show that if μ > 0 (or μ < 0), {X(t)}t≥0 is transient, that is, for all a ∈ R, there exists an almost surely finite (random) time after which it does not visit a. Exercise 11.5.10. Independent Brownian motions with a drift Let for i = 1, 2, Xi (t) := xi + μi t + σi Wi (t) , where xi , μi ∈ R, σi > 0, and {W1 }t≥0 and {W2 }t≥0 are independent standard Brownian motions. Suppose moreover that x1 < x2 . Compute the probability that {X1 (t)}t≥0 and {X2 (t)}t≥0 never meet. Exercise 11.5.11. The Lebesgue integral of a Gaussian process Let {X(t)}t∈[0,1) be a continuous Gaussian stochastic process. Prove that the random 01 variable 0 X(t) dt is Gaussian. Compute its mean and variance when {X(t)}t∈[0,1) is a Brownian bridge. Exercise 11.5.12. Let {W (t)}t≥0 be a standard Brownian motion. Show that the stochastic process -

t 1−t

X(t) := 0

1 dW (s) 1−s

(t ∈ [0, 1))

is a Brownian motion. Exercise 11.5.13. A representation of the Brownian bridge Let {W (t)}t≥0 be a standard Brownian motion. Let for t ∈ [0, 1), -

t

Y (t) := (1 − t) 0

dW (s) ds . 1−s

CHAPTER 11. BROWNIAN MOTION

466

(i) Prove that the integral in the right-hand side is well defined on [0, 1) as a Wiener integral. (ii) Prove that as t ↓ 0, Y (t) → 0 in quadratic mean. (iii) Define Y (0) := 0. Show that {Y (t)}t∈[0,1] is a Gaussian process. (iv) Show that {Y (t)}t∈[0,1] is (has the same distribution as) a Brownian bridge. Exercise 11.5.14. Average occupation time Let {W (t)}t≥0 be a standard d-dimensional Brownian motion and let A ∈ B(Rd ) be of positive Lebesgue measure. Let, for ω ∈ Ω, SA (ω) := {t ≥ 0 ; W (t, ω) ∈ A}   (the occupation time of A). Prove that E d (SA ) = ∞ if d ≤ 2, and that if d ≥ 3,    1 d E d (SA ) = Γ ||x||2−d dx , − 1 2 2π d/2 A 0∞ where Γ(α) := 0 xα−1 e−x dx (α > 0) is the Gamma function. Exercise 11.5.15. Micropulses and fractal Brownian motion (3 ) Let N ε be a Poisson process on R × R+ with intensity measure ν(dt × dz) = + 1 −1−θ z dt × dz (0 < θ < 1, ε > 0). For all t ≥ 0, let S0,t = {(s, z) : 0 < s < t, t − s < z} 2ε2 − and S0,t = {(s, z) : −∞ < s < 0, −s < z < t − s}, and let   + − ) − N ε (S0,t ) . Xε (t) := ε N ε (S0,t

+ S0,t

− S0,t

0

t

(1) Show that Xε (t) is well defined for all t ≥ 0. (2) Compute for all 0 ≤ t1 ≤ t2 . . . ≤ tn the characteristic function of (Xε (t1 ), . . . , Xε(tn )) . (3) Show that for all 0 ≤ t1 ≤ t2 . . . ≤ tn , (Xε (t1 ), . . . , Xε(tn )) converges as ε ↓ 0 in distribution to (BH (t1 ), . . . , BH (tn )), where {BH (t)}t≥0 is a fractal Brownian   2 = θ−1 (1 − θ)−1 , motion with Hurst parameter H = 1−θ 2 and variance E BH (1) that is, {BH (t)}t≥0 is a centered Gaussian process such that BH (0) = 0 and with covariance function    1  2H E [BH (t)BH (s)] = |s| + |t|2H − |s − t|2H E BH (1)2 . 2

3

[Cioczek-Georges and Mandelbrot, 1995].

Chapter 12 Wide-sense Stationary Stochastic Processes Wide-sense stationary stochastic processes are of interest in signal analysis and processing, as well as in physics. Their study rests on Bochner’s representation of characteristic functions, which immediately leads to the fundamental notion of power spectral measure, and on the Doob–Wiener integral that permits a mathematical definition of white noise as well as the obtention of a spectral decomposition of the trajectories of such stochastic processes, the Cram´er–Khinchin decomposition, a fundamental result with importance consequences in signal processing.

12.1

The Power Spectral Measure

12.1.1

Covariance Functions and Characteristic Functions

Recall the simple facts about Fourier theory. Let f : (R, B(R)) → (R, B(R)) be integrable with respect to the Lebesgue measure. Then, for any ν ∈ R, f6(ν) := f (t) e−2iπνt dt R

is well defined and the function fˆ, called the Fourier transform of f , is continuous and bounded (Exercise 2.4.13). From classical Fourier analysis, we know that if moreover the function f6 is integrable with respect to the Lebesgue measure, then (Fourier inversion formula) f (t) = f6(ν) e2iπνt dν , R

where this equality is true almost everywhere, and everywhere if f is continuous (Exercise 2.4.9). Remark 12.1.1 The notion of Fourier transform does not in general apply as such to the trajectories of a wide-sense stationary stochastic process. Consider, for instance, a square-integrable ergodic process {X(t)}t∈R not identically null. In particular, for p = 1 or p = 2, P-a.s., 1 t lim |X(t)|p dt = E [|X(t)|p ] > 0 , t↑∞ t 0 from which it follows that almost all trajectories are not in L1C (R) nor in L2C (R) and therefore do not have a Fourier transform in the usual L1 or L2 senses. © Springer Nature Switzerland AG 2020 P. Brémaud, Probability Theory and Stochastic Processes, Universitext, https://doi.org/10.1007/978-3-030-40183-2_12

467

468CHAPTER 12. WIDE-SENSE STATIONARY STOCHASTIC PROCESSES Nevertheless, there exists a spectral decomposition for the trajectories of a wss stochastic process, called the Cram´er–Khintchin decomposition, as we shall see in Section 12.2. To obtain such a decomposition, the starting point is the Fourier analysis of the covariance function. We begin with a few examples.

Two Particular Cases Example 12.1.2: Absolutely continuous spectrum. Consider a wss random process with integrable and continuous covariance function C, in which case the Fourier transform f of the latter is well defined by f (ν) = e−2iπντ C(τ ) dτ . R

It is called the power spectral density (psd). It turns out, as we shall soon see when we consider the general case, that it is non-negative and integrable. Since it is integrable, the Fourier inversion formula C(τ ) = e2iπντ f (ν) dν (12.1) R

holds true for all t ∈ R since C is continuous. Also f is the unique integrable function such that (12.1) holds. Letting τ = 0 in this formula, we obtain, since C(0) = Var(X(t) := σ 2 , f (ν)dν . σ2 = R

Example 12.1.3: The Ornstein–Uhlenbeck process. The Ornstein–Uhlenbeck process is a centered Gaussian process with covariance function Γ(t, s) = C(t − s) = e−α|t−s| . The function C is integrable and therefore the power spectral density is the Fourier transform of the covariance function: 2α f (ν) = e−2iπντ e−α|τ | dτ = 2 . α + 4π 2 ν 2 R

Not all wss stochastic processes admit a power spectral density. For instance: Example 12.1.4: Line spectrum. Consider a wide-sense stationary process with a covariance function of the form  Pk e2iπνk τ , C(τ ) = k∈Z

where Pk ≥ 0 and

 k∈Z

Pk < ∞

12.1. THE POWER SPECTRAL MEASURE

469

(for instance, the harmonic process of Example 5.1.20). This covariance function is not integrable, and in fact there does not exist a power spectral density. In particular, a representation of the covariance function such as (12.1) is not available, at least if the function f is interpreted in the ordinary sense. However, there is a formula such as (12.1) if we consent, as is usually done in the engineering literature, to define the power spectral density in this case to be the pseudo-function  f (ν) = Pk δ(ν − νk ), k∈Z

where δ(ν − a) is the delayed Dirac pseudo-function informally defined by ϕ(ν) δ(ν − a) dν = ϕ(a). R

Indeed, with such a convention,   f (ν)e2iπντ f (ν) dν = Pk e2iπνk τ . Pk e2iπντ δ(ν − νk ) dν = R

k∈Z

R

k∈Z

We can (and perhaps should) however avoid recourse to Dirac pseudo-functions, and the general result to follow (Theorem 12.1.5) will tell us what to do. In general, it may happen that the covariance function is not integrable and/or that there does not exist a line spectrum. We now turn to the general theory.

The General Case Remember that the characteristic function ϕ of a real random variable X has the following properties: A. it is hermitian symmetric, that is, ϕ(−u) = ϕ(u)∗ , and it is uniformly bounded: |ϕ(u)| ≤ ϕ(0), B. it is uniformly continuous on R, and C. it is definite non-negative, in the sense that for all integers n, all u1 , . . . , un ∈ R, and all z1 , . . . , zn ∈ C, n n  

ϕ(uj − uk )zj zk∗ ≥ 0

j=1 k=1

2' '2 . ' ' (just observe that the left-hand side equals E ' nj=1 zj eiuj X ' ). It turns out that Properties A , B and C characterize characteristic functions (up to a multiplicative constant). This is Bochner’s theorem (Theorem 4.4.10), which is now recalled for easier reference: Let ϕ : R → C be a function satisfying properties A, B and C. Then there exists a constant 0 ≤ β < ∞ and a real random variable X such that for all u ∈ R,   ϕ(u) = βE eiuX . Bochner’s theorem is all that is needed to define the power spectral measure of a widesense stationary process continuous in the quadratic mean.

470CHAPTER 12. WIDE-SENSE STATIONARY STOCHASTIC PROCESSES Theorem 12.1.5 Let {X(t)}t∈R be a wss random process continuous in the quadratic mean, with covariance function C. Then, there exists a unique measure μ on R such that e2iπντ μ(dν). (12.2) C(τ ) = R

In particular, μ is a finite measure: μ(R) = C(0) = Var(X(0)) < ∞.

(12.3)

Proof. It suffices to observe that the covariance function of a wss stochastic process that is continuous in the quadratic mean shares the properties A, B and C of the characteristic function of a real random variable. Indeed, (a) it is hermitian symmetric, and |C(τ )| ≤ C(0) (Schwarz’s inequality), (b) it is uniformly continuous, and (c) it is definite non-negative, in the sense that for all integers n, all τ1 , . . . , τn ∈ R, and all z1 , . . . , zn ∈ C, n  n 

C(τj − τk )zj zk∗ ≥ 0

j=1 k=1

2' '2 . ' ' (just observe that the left-hand side is equal to E ' nj=1 zj X(tj )' ). Therefore, by Theorem 4.4.10, the covariance function C is (up to a multiplicative constant) a characteristic function. This is exactly what (12.2) says, since μ thereof is a finite measure, that is, up to a multiplicative constant, a probability distribution. Uniqueness of the power spectral measure follows from the fact that a finite measure (up to a multiplicative constant: a probability) on Rd is characterized by its Fourier transform (Theorem 3.1.51).  The case of an absolutely continuous spectrum corresponds to the situation where μ admits a density with respect to Lebesgue measure: μ(dν) = f (ν) dν. We then say that the wss stochastic process in question admits the power spectral density (psd) f . If such a power spectral density exists, it has the properties mentioned without proof in Example 12.1.2: it is non-negative and it is integrable. The case of a line spectrum corresponds to a spectral measure that is a weighted sum of Dirac measures:  Pk ενk (dν) , μ(dν) = k∈Z

where the Pk ’s are non-negative and have a finite sum, as in Example 12.1.4.

12.1. THE POWER SPECTRAL MEASURE

12.1.2

471

Filtering of wss Stochastic Processes

We recall a few standard results concerning the (convolutional) filtering of deterministic functions. Let f, g : (R, B(R)) → (R, B(R)) be integrable functions with respective Fourier transforms f6 and g6. Then (Exercise 2.4.14), - |f (t − s)g(s)| dt ds < ∞ , R

R

and therefore, for almost all t ∈ R, the function s → f (t − s)g(s) is Lebesgue integrable. In particular, the convolution (f ∗ g)(t) := f (t − s)g(s) ds R

is almost everywhere well defined. For all t such that the last integral is not defined, set g, (f ∗ g)(t) = 0. Then f ∗ g is Lebesgue integrable and its Fourier transform is f ∗ g = f66 where fˆ, gˆ are the Fourier transforms of f and g, respectively (Exercise 2.4.14). Let h : (R, B(R)) → (R, B(R)) be an integrable function. The operation that associates to the integrable function x : (R, B(R)) → (R, B(R)) the integrable function y(t) := h(t − s)x(s) ds R

is called a stable convolutional filter. The function h is called the impulse response of the filter, x and y are respectively the input and the output of this filter. The Fourier transform 6 h of the impulse response is the transmittance of the filter. Let now {X(t)}t∈R be a wss random process with continuous covariance function CX . We examine the effect of filtering on this process. The output process is the process defined by Y (t) := R

h(t − s)X(s)ds .

(12.4)

Note that the integral (12.4) is well defined under the integrability condition for the impulse response h. This follows from Theorem 5.3.2 according to which the integral f (s)X(s, ω) ds R

is well defined for P -almost all ω when f is integrable (in the special case of wss stochastic processes, m(t) = m and Γ(t, t) = C(0) + |m|2 , and therefore the conditions on f and g thereof reduce to integrability of these functions). Referring to the same theorem, we have E[ f (t)X(t) dt] = f (t)E[X(t)] dt = m f (t) dt . (12.5) R

R

R

Let now f, g : R → C and be integrable functions. As a special case of Theorem 5.3.2, we have  - g(s)X(s) ds = f (t)g ∗ (s)C(t − s) dt ds. f (t)X(t) dt , (12.6) cov R

R

We shall see that, in addition,

R

R

472CHAPTER 12. WIDE-SENSE STATIONARY STOCHASTIC PROCESSES -

 - g(s)X(s) ds = f6(−ν)6 g ∗ (−ν)μ(dν).

f (t)X(t) dt ,

cov R

R

R

(12.7)

R

Proof. Assume without loss of generality that m = 0. From Bochner’s representation of the covariance function, we obtain for the last double integral in (12.6)  - f (t)g ∗ (s) e+2jπν(t−s) μ(dν) dt ds = R R R  ∗ - f (t)e+2jπνt dt g(s)e+2jπνs ds μ(dν). R

R

R

Here again we have to justify the change of order of integration ' ' using Fubini’s theorem. For this, it suffices to show that the function (t, s, ν) → 'f (t)g ∗ (s)e+2jπν(t−s)' = |f (t)| |g(s)| 1R (ν) is integrable with respect 0 to the product 0 measure  ×  × μ. This is indeed true, the integral being equal to ( R |f (t)| dt) × ( R |g(t)| dt) × μ(R).  In view of the above results, the right-hand side of formula (12.4) is well defined. Moreover Theorem 12.1.6 When the input process {X(t)}t∈R is a wss random process with power spectral measure μX , the output {Y (t)}t∈R of a stable convolutional filter of transmittance 6 h is a wss random process with the power spectral measure h(ν)|2 μX (dν) . μY (dν) = |6

(12.8)

This formula will be referred to as the fundamental filtering formula in continuous time. Proof. Just apply formulas (12.5) and (12.7) with the functions f (u) = h(t − u),

g(v) = h(s − v), -

to obtain E[Y (t)] = m

h(t)dt, R

and E[(Y (t) − m)(Y (s) − m)∗ ] =

R

|6 h(ν)|2 e+2jπν(t−s)μ(dν) . 

Example 12.1.7: Two special cases. In particular, if the input process admits a psd fX , the output process also admits a psd given by fY (ν) = |6 h(ν)|2 fX (ν) dν . When the input process has a line spectrum, the power spectral measure of the output process takes the form ∞  μY (dν) = Pk |6 h(νk )|2 ενk (dν) . k=1

12.1. THE POWER SPECTRAL MEASURE

12.1.3

473

White Noise

By analogy with Optics, one calls white noise any centered wss random process {B(t)}t∈R with constant power spectral density fB (ν) = N0 /2.1 Such a definition presents a theoretical difficulty, because - +∞ fB (ν) dν = + ∞, −∞

which contradicts the finite power property of wide-sense stationary processes.

A First Approach From a pragmatic point of view, one could define a white noise to be a centered wss stochastic process whose psd is constant over a “large”, yet bounded, range of frequencies [−A, +A]. The calculations below show what happens as A tends to infinity. Let therefore {X(t)}t∈R be a centered wss stochastic process with psd f (ν) =

N0 (ν) . 1 2 [−A,+A]

Let ϕ1 , ϕ2 : R → C be two functions in L1C (R) ∩ L2C (R) with Fourier transforms ϕ 61 and ϕ 62 , respectively. Then 2∗ . N0 lim E ϕ1 (t)X(t) dt ϕ2 (t)X(t) dt ϕ1 (t)ϕ∗2 (t) dt = A↑∞ 2 R R R N0 = ϕ 61 (ν)ϕ 6∗2 (ν) dν . 2 R Proof. We have 2 ∗ . - E ϕ1 (t)X(t) dt ϕ2 (t)X(t) dt ϕ1 (u)ϕ2 (v)∗ CX (u − v) du dv . = R

R

R

R

The latter quantity is equal to - +A  N0 +∞ ∗ 2iπν(u−v) ϕ1 (u)ϕ2 (v) e dν du dv 2 −∞ −A  - +∞  - +A - +∞ N0 = ϕ1 (u)e2iπνu du ϕ2 (v)∗ e−2iπνv dv dν 2 −A −∞ −∞ N0 +A ϕ 61 (−ν)ϕ 62 (−ν)∗ dν, = 2 −A and the limit of this quantity as A ↑ ∞ is: N0 +∞ N0 +∞ ϕ 61 (ν)ϕ 6 ∗2 (ν) dν = ϕ1 (t)ϕ2 (t)∗ dt , 2 −∞ 2 −∞ where the last equality is the Plancherel–Parseval identity.



Let now h : R → C be in L1C (R) ∩ L2C (R), and define 1 The notation N0 /2 comes from Physics and is a standard one in communications theory when dealing with the so-called additive white noise channels.

474CHAPTER 12. WIDE-SENSE STATIONARY STOCHASTIC PROCESSES Y (t) = R

h(t − s)X(s) ds .

Applying the above result with ϕ1 (u) = h(t − u) and ϕ2 (v) = h(t + τ − v), we find that the covariance function CY of this wss stochastic process is such that N0 e2iπντ |6 h(ν)|2 lim CY (τ ) = dν . A↑∞ 2 R The limit is finite since 6 h ∈ L2C (R) and is a covariance function corresponding to a bona fide (that is integrable) pdf fY (ν) = |6 h(ν)|2 N20 . With f (ν) = N20 , we formally retrieve the usual filtering formula, fY (ν) = |6 h(ν)|2 f (ν) .

White Noise via the Doob–Wiener Integral Another approach to white noise, more formal, consists in working right away “at the limit”. We do not attempt to define the white noise {B(t)}t∈R directly (for good reasons since it does not exist as a bona fide wss stochastic process, as we noted earlier). Instead, 0 we define directly the symbolic integral R f (t)B(t) dt for integrands f to be described below, by = N0 f (t)B(t) dt := f (t) dZ(t), (12.9) 2 R R increments with unit where {Z(t)}t∈R is a centered stochastic process with uncorrelated > 1 N0 variance. We say that {B(t)}t∈R is a white noise and that is an integrated 2 Z(t) white noise. For all f, g ∈ L2C (R), we have that 2. E f (t) B(t) dt = 0 ,

t∈R

R

and by the isometry formulas for the Doob–Wiener integral, 2 ∗ . N0 E f (t) B(t) dt g(t) B(t) dt f (t)g(t)∗ dt , = 2 R R R which can be formally rewritten, using the Dirac symbolism: N0 f (t)g(s)∗ E [B(t)B ∗ (s)] dt ds = f (t)g(s)∗ δ(t − s) dt ds . 2 R R Hence “the covariance function of the white noise {B(t)}t∈R is a Dirac pseudo-function: CB (τ ) = N20 δ(τ )”. When {Z(t)}t∈R ≡ {W (t)}t∈R , a standard Brownian motion, {B(t)}t∈R is called a Gaussian white noise. In this case, the Wiener–Doob integral is certainly not a Stieltjes– Lebesgue integral since the trajectories of the Wiener process are of unbounded variation on any finite interval (Corollary 11.2.9). Also, B(t) cannot be interpreted as the “derivative” dWdt(t) (Theorem 11.2.8). Let {B(t)}t∈R be a white noise with psd N0 /2. Let h : R → C be in L1C ∩ L2C and define the output of a filter with impulse response h when the white noise {B(t)}t∈R is the input, by

12.1. THE POWER SPECTRAL MEASURE

475

Y (t) = R

h(t − s)B(t) ds.

By the isometry formula for the Wiener–Doob integral, E[Y (t)Y (s)∗ ] =

N0 2

R

h(t − s − u)h∗ (u) du,

and therefore (Plancherel–Parseval equality) CY (τ ) =

N0 dν. e2iπντ |6 h(ν)|2 2 R

The stochastic process {Y (t)}t∈R is therefore centered and wss, with psd fY (ν) = |6 h(ν)|2 fB (ν) , where fB (ν) :=

N0 . 2

We therefore once more recover formally the fundamental equation of linear filtering of wss continuous-time stochastic processes.

The Approximate Derivative Approach There is a third approach to white noise. The Brownian motion is approximated by the “finitesimal” derivative W (t + h) − W (t) Bh (t) = h (here we take N0 /2 = 1). For fixed h > 0 this defines a proper wss stochastic process centered, with covariance function Ch (τ ) = and power spectral density

 fh (ν) =

(h − |τ |)+ h2

sin πνh πνh

2 .

Note that, as h ↓ 0, the power spectral density tends to the constant function 1, the power spectral density of the “white noise”. At the same time, the covariance function “tends to the Dirac function” and the energy Ch (0) = h1 tends to infinity. This is another feature of white noise: unpredictability. Indeed, for τ ≥ h, the value Bh (t + τ ) cannot be predicted from the value Bh (t), since both are independent random variables. The connection with the second approach is the following. For all f ∈ L2C (R+ ) ∩

L1C (R+ ),

lim h↓0

R+

f (t)Bh (t) dt =

f (t) dW (t) R+

in the quadratic mean. The proof is required in Exercise 12.4.4.

476CHAPTER 12. WIDE-SENSE STATIONARY STOCHASTIC PROCESSES

12.2

Fourier Analysis of the Trajectories

12.2.1

The Cram´ er–Khintchin Decomposition

If one seeks a Fourier transform of the trajectories of a non-trivial wide-sense stationary process, there is a priori little chance that it will be a classical one, say, in L1 or L2 (Remark 12.1.1). However, under quite general conditions there exists a kind of Fourier decomposition of the trajectories of a wide-sense stationary process. Theorem 12.2.1 Let {X(t)}t∈R be a centered wss stochastic process, continuous in the quadratic mean and with power spectral measure μ. There exists a unique centered stochastic process {x(ν)}ν∈R with uncorrelated increments and with structural measure μ such that P -a.s. X(t) = (12.10) e2iπνt dx(ν) (t ∈ R) , R

where the integral of the right-hand side is a Doob integral. Uniqueness is in the following sense: If there exists another centered stochastic pro,, such cess {, x(ν)}ν∈R with uncorrelated increments, 0 and with finite structural measure μ that for all t ∈ R, we have P –a.s., X(t) = R e2iπνt d, x(ν) , then for all a, b ∈ R, a ≤ b, x ,(b) − x ,(a) = x(b) − x(a), P -a.s. We will occasionally say: “dx(ν) is the Cram´er–Khinchin decomposition” of the wss stochastic process. Proof. 1. Denote by H(X) the vector subspace of L2C (P ) formed by the finite complex linear combinations of the type Z=

K 

λk X(tk ) ,

k=1

and by ϕ the mapping of H(X) into L2C (μ) defined by ϕ : Z →

K 

λk e2iπνtk .

k=1

We verify that it is a linear isometry of H(X) into L2C (μ). In fact, ⎡' '2 ⎤ K K K  ' '  ' ' E ⎣' λk X(tk )' ⎦ = λk λ∗ E [X(tk )X(t )∗ ] ' ' k=1

k=1 =1

=

K  K 

λk λ∗ C(tk − t ) ,

k=1 =1

and using Bochner’s theorem, this quantity is equal to -  K K K  K   λk λ∗ e2iπν(tk −t ) μ(dν) = λk λ∗ e2iπν(tk −t ) μ(dν) k=1 =1

R

R

k=1 =1

'2 - '' K ' ' 2iπνtk ' = λk e ' μ(dν). ' ' R' k=1

12.2. FOURIER ANALYSIS OF THE TRAJECTORIES

477

2. This isometric linear mapping can be uniquely extended to an isometric linear mapping (Theorem C.3.2), that we shall  continue to call ϕ), from H(X), the closure of K H(X), into L2C (μ). As the combinations k=1 λk e2iπνtk are dense in L2C (μ) when μ is a finite measure, ϕ is onto. Therefore, it is a linear isometric bijection between H(X) and L2C (μ). 3. We shall define x(ν0 ) to be the random variable in H(X) that corresponds in this isometry to the function 1(−∞,ν0] (ν) of L2C (μ). First, we observe that E[x(ν2 ) − x(ν1 )] = 0 since H(X) is the closure in L2C (P ) of a family of centered random variables. Also, by isometry, 1(ν1 ,ν2 ] (ν)1(ν3 ,ν4 ] (ν) μ(dν) E[(x(ν2 ) − x(ν1 ))(x(ν4 ) − x(ν3 ))∗ ] = R

0

We can therefore define the Doob integral 4. Let now Zn (t) :=



e

= μ((ν1 , ν2 ] ∩ (ν3 , ν4 ]). R f (ν) dx(ν)

for all f ∈ L2C (μ).

     k+1 k x −x n . 2n 2

2iπt(k/2n )

k∈Z

-

We have lim Zn (t) =

n→∞

(limit in L2C (P )) because

e2iπνt dx(ν) R

Zn (t) =

where fn (t, ν) =



R

fn (t, ν) dx(ν), n

e2iπt(k/2 ) 1(k/2n ,(k+1)/2n ] (ν),

k∈Z

and therefore, by isometry, ' '2 ' ' E ''Zn (t) − e2iπνt dx(ν)'' = |e2iπνt − fn (t, ν)|2 μ(dν), R

R

a quantity which tends to zero when n tends to infinity (by dominated convergence, using the fact that μ is a bounded measure). On the other hand, by definition of ϕ, ϕ

Zn (t) → fn (t, ν). 0 2iπνt dx(ν) in L2C (P ) and limn→∞ fn (t, ν) = e2iπνt Since, for fixed t, limn→∞ Zn (t) = R e in L2C (μ), ϕ

R

But, by definition of ϕ, Therefore X(t) =

0

e2iπνt dx(ν) → e2iπνt . ϕ

X(t) → e2iπνt. Re

2iπνt dx(ν).

5. We now prove uniqueness. Suppose that there exists another spectral decomposition d, x(ν). Denote by G the set of finite linear combinations of complex exponentials. Since by hypothesis

478CHAPTER 12. WIDE-SENSE STATIONARY STOCHASTIC PROCESSES -

-

e2iπνt d, x(ν)

e2iπνt dx(ν) =

( = X(t))

R

R

-

-

we have

f (ν) dx(ν) =

f (ν) d, x(ν) R

R

for all f ∈ G, and therefore, for all f ∈ L2C (μ) ∩ L2C (, μ) ⊆ L2C ( 21 (μ + μ ,)) because G is 1 2 ,)). In particular, with f = 1(a,b] , dense in LC ( 2 (μ + μ x(b) − x(a) = x ,(b) − x ,(a).  More details can be obtained as to the continuity properties (in quadratic mean) of the increments of the spectral decomposition. For instance, it is right-continuous in quadratic mean, and it admits a left-hand limit in quadratic mean at any point ν ∈ R. If such limit is denoted by x(ν−), then, for all a ∈ R, E[|x(a) − x(a−)|2 ] = μ({a}) . Proof. The right-continuity follows from the continuity of the (finite) measure μ: lim E[|x(a + h) − x(a)|2 ] = lim μ((a, a + h]) = μ(∅) = 0. h↓0

h↓0

As for the existence of left-hand limits, it is guaranteed by the Cauchy criterion, since for all a ∈ R, lim

h,h ↓0,h 1/(2B), X(t) = lim

N ↑∞

where the limit is in L2C (P ).

+N  n=−N

X(nT )

sin

 (t − nT ) , (t − nT )



π T

T

(12.11)

480CHAPTER 12. WIDE-SENSE STATIONARY STOCHASTIC PROCESSES -

Proof. We have

e2iπνt dx(ν).

X(t) = [−B,+B]

Now, e

2iπνt

= lim

N ↑∞

+N 

e

2iπνnT

sin



π T

n=−N

 (t − nT ) , (t − nT )

T

where the limit is uniform in [−B, +B] and bounded. Therefore the above limit is also in L2C (μ) because μ is a finite measure. Consequently / +N π 4  2iπνnT sin T (t − nT ) dx(ν), X(t) = lim e π N ↑∞ [−B,+B] T (t − nT ) n=−N

where the limit is in L2C (P ). The result then follows by expanding the integral with respect to the sum. 

12.2.2

A Plancherel–Parseval Formula

The following result is the analog of the Plancherel–Parseval formula of classical Fourier analysis. Theorem 12.2.6 Let f : R → C be in L1C (R) with Fourier transform f6. Let {X(t)}t∈R be a centered wss stochastic process with power spectral measure μ and Cram´er– Khintchin spectral decomposition dx(ν). Then: f (t)∗ X(t) dt. (12.12) f6(ν)∗ dx(ν) = R

R

Proof. The function f6 is bounded and continuous (as the Fourier transform of an integrable function) and μ is a finite measure, so that f6 ∈ L2C (μ) and  k 2 6 f6 n 1( kn , k+1 ] → f in LC (μ) . 2 2n 2 n Therefore (all limits in the following sequence of equalities being in L2C (P )): -

    ∗   k+1 k k x − x f6 n n n n→∞ 2 2 2 −n2n n −1      n2  k k+1 n − x = lim f ∗ (t)e+2iπ(k/2 )t dt x n n n→∞ 2 2 R −n2n n −1 2     . n2  k+1 k n e+2iπ(k/2 )t x − x dt f ∗ (t) = lim n n→∞ R 2 2n −n2n f ∗ (t)Xn (t) dt, = lim

f6(ν)∗ dx(ν) = lim R

n −1 n2 

n→∞ R

where Xn (t) =

n −1 n2 

−n2n

n )t

e+2iπ(k/2

     k+1 k x − x → X(t) in L2C (P ). n 2 2n

12.2. FOURIER ANALYSIS OF THE TRAJECTORIES

481

The announced result will then follow once we prove that lim f ∗ (t)Xn (t) dt = f ∗ (t)X(t) dt, n→∞ R

R

where the limit is in L2C (P ). In fact, with Yn (t) = X(t) − Xn (t), "''2 # - ' ' ' f (t)f (s)∗ E [Yn (t)Yn (s)∗ ] dt ds . E ' f (t)Yn (t) dt'' = R

R

R

But for all t ∈ R, limn↑∞ Yn (t) = 0 (in L2C (P )) and therefore limn↑∞ E [Yn (t)Yn (s)∗ ] = 0. Moreover, E [Yn (t)Yn (s)∗ ] is uniformly bounded in n. Therefore, by dominated convergence, - lim

n↑∞ R

R

f (t)f (s)∗ E [Yn (t)Yn (s)∗ ] dt ds = 0 .

 Example 12.2.7: Convolutional filtering. Let h ∈ L1C (R) and let 6 h be its Fourier transform. Then 6 h(t − s)X(s) ds = (12.13) h(ν)e2iπνt dx(ν). R

R

Proof. It suffices to apply (12.12) to the function s → h∗ (t−s), whose Fourier transform is 6 h(ν)∗ e−2iπνt. 

12.2.3

Linear Operations

A function g : R → C in L2C (μ) defines a linear operation on the centered wss stochastic process {X(t)}t∈R (called the input) by associating with it the centered stochastic process (called the output) e2iπνtg(ν) dx(ν). (12.14) Y (t) = R

On the other hand, the calculation of the covariance function CY (τ ) = E[Y (t)Y (t + τ )∗ ] of the output gives (isometry formula for Doob’s integral), CY (τ ) = e2iπντ |g(ν)|2 μX (dν), R

where μX is the power spectral measure of the input. The power spectral measure of the output process is therefore μY (dν) = |g(ν)|2 μX (dν) .

(12.15)

This is similar to the formula (15.25) obtained for the output of a stable convolutional filter with impulse response. One then says that g is the transmittance of the “filter”

482CHAPTER 12. WIDE-SENSE STATIONARY STOCHASTIC PROCESSES (12.14). Note however that this filter is not necessarily of the convolutional type, since g may well not be the Fourier transform of an integrable function (for instance it may be unbounded, as the next example shows). Example 12.2.8: Differentiation. Let {X(t)}t∈R be a wss stochastic processes with spectral measure μX such that |ν|2 μX (dν) < ∞. (12.16) R

Then

X(t + h) − X(t) = h→0 h

(2iπν)e2iπνt dx(ν) ,

lim

R

where the limit is in the quadratic mean. The linear operation corresponding to the transmittance g(ν) = 2iπν is therefore the differentiation in quadratic mean. Proof. Let h ∈ R. From the equality X(t + h) − X(t) − (2iπν)e2iπνtdx(ν) h R   2iπνh e −1 − 2iπν dx(ν) e2iπνt = h R we have, by isometry, "' '2 # ' X(t + h) − X(t) ' 2iπνt ' lim E ' dx(ν)'' − (2iπν)e h→0 h R '2 - ' 2iπνh ' 'e −1 ' − 2iπν '' μX (dν) . = lim h→0 R ' h ' '2 ' 2iπνh ' In view of hypothesis (12.16) and since ' e h −1 − 2iπν ' ≤ 4π 2 ν 2 , the latter limit is 0, by dominated convergence. 

“A line spectrum corresponds to a combination of sinusoids.” More precisely: Theorem 12.2.9 Let {X(t)}t∈R be a centered wss stochastic processes with spectral measure  Pk ενk (dν) , μX (dν) = k∈Z

where ενk is the Dirac measure at νk ∈ R, Pk ∈ R+ and X(t) =



 k∈Z

Pk < ∞. Then

Uk e2iπνk t

k∈Z

where {Uk }k∈Z is a sequence of centered uncorrelated square-integrable complex variables, and E[|Uk |2 ] = Pk . Proof. The function g(ν) =

 k∈Z

1{νk } (ν)

12.3. MULTIVARIATE WSS STOCHASTIC PROCESSES

483

0 is in L2C (μ0X ), as well as the function 1 − g. Also R |1 − g(ν)|2 μX (dν) = 0, and in particular R (1 − g(ν))e2iπνt dx(ν) = 0. Therefore X(t) = g(ν)e2iπνt dx(ν) R  e2iπνk t (x(νk ) − x(νk −)). = k∈Z

The conclusion follows by defining Uk := x(νk ) − x(νk −).



Linear Transformations of Gaussian Processes Definition 12.2.10 A linear transformation of a wss stochastic process {X(t)}t∈R is a transformation of it into the second-order process (not wss in general) g(ν, t) dx(ν) , (12.17) Y (t) = R

where

R

|g(t, ν)|2 μX (dν) < ∞

for all t ∈ R.

Theorem 12.2.11 A linear transformation of a Gaussian wss stochastic process yields a Gaussian stochastic process. Proof. Let {X(t)}t∈R be a centered Gaussian wss with Cram´er–Khinchin decomposition dx(ν). For each ν ∈ R, the random variable x(ν) is in HR (X), by construction. Now, if {X(t)}t∈R is a Gaussian process, HR (X) is a Gaussian subspace. But (Theorem 12.2.3) HR (X) = HR (x). Therefore the process (12.17) is in HC (X), hence Gaussian.  Example 12.2.12: Convolutional filtering of a wss Gaussian process. In particular, if {X(t)}t∈R is a Gaussian wss process with Cram´er–Khinchin decomposition dx(ν) and if g ∈ L2C (μX ), the process e2iπνtg(ν) dx(ν) (t ∈ R) Y (t) = R

is Gaussian. A particular case is when g = 6 h, the Fourier transform of a filter with integrable impulse response h. The stochastic process {Y (t)}t∈R is the one obtained by convolutional filtering of {X(t)}t∈R with this filter.

12.3

Multivariate wss Stochastic Processes

12.3.1

The Power Spectral Matrix

Let X(t) = (X1 (t), . . . , XL (t)) CL ,

(t ∈ R)

where L is an integer greater than or be a stochastic process with values in E := equal to 2. This process is assumed centered and of the second order, that is,

484CHAPTER 12. WIDE-SENSE STATIONARY STOCHASTIC PROCESSES E[||X(t)||2 ] < ∞

(t ∈ R) .

Furthermore, it will be assumed that it is wide-sense stationary, in the sense that the mean vector of X(t) and the cross-covariance matrix of the vectors X(t + τ ) and X(t) do not depend upon t. The matrix-valued function C defined by C(τ ) := cov (X(t + τ ), X(t))

(12.18)

is called the (matrix) covariance function of the stochastic process. Its general entry is Cij (τ ) = cov(Xi (t), Xj (t + τ )). The processes {Xi (t)}t∈R (1 ≤ i ≤ L) are wss stochastic processes and in addition, they are stationarily correlated or “jointly wss”. Such a vector-valued stochastic process {X(t)}t∈R is called a multivariate wss stochastic process. Example 12.3.1: Signal plus noise. The following model frequently appears in signal processing: Y (t) = S(t) + B(t) , wss stochastic process where {S(t)}t∈R and {B(t)}t∈R are two uncorrelated centered + * with respective covariance functions CS and CB . Then, (Y (t), S(t))T t∈R is a bivariate wss stochastic process. Owing to the assumption of non-correlation,   CS (τ ) + CB (τ ) CS (τ ) C(τ ) = . CS (τ ) CS (τ )

Theorem 12.3.2 Let {X(t)}t∈R be an L-dimensional multivariate wss stochastic process. For all r, s (1 ≤ r, s ≤ L) there exists a finite complex measure μrs such that Crs (τ ) = e2iπντ μrs (dν). (12.19) R

Proof. Say r = 1, s = 2. Let us consider the stochastic processes Y (t) = X1 (t) + X2 (t),

Z(t) = iX1 (t) + X2 (t).

These are wss stochastic processes with respective covariance functions CY (τ ) = C1 (τ ) + C2 (τ ) + C12 (τ ) + C21 (τ ), CZ (τ ) = −C1 (τ ) + C2 (τ ) + iC12 (τ ) − iC21 (τ ). From these two equalities we deduce C12 (τ ) =

1 {[CY (τ ) − C1 (τ ) − C2 (τ )] − i[CZ (τ ) − C1 (τ ) + C2 (τ )]} , 2

from which the result follows with μ12 =

1 {[μY − μ1 − μ2 ] − i[μZ − μ1 + μ2 ]} . 2 

12.3. MULTIVARIATE WSS STOCHASTIC PROCESSES

485

The matrix M := {μij }1≤i,j≤k (whose entries are finite complex measures) is the interspectral power measure matrix of the multivariate wss stochastic process {X(t)}t∈R . It is clear that for all z = (z1 , . . . , zk ) ∈ Ck , U (t) = z T X(t) defines a wss stochastic process with spectral measure μU = z M z † († means transpose conjugate). The link between the interspectral measure μ12 and the Cram´er–Khintchine decompositions dx1 (ν) and dx2 (ν) is the following: E[x1 (ν2 ) − x1 (ν1 ))(x2 (ν4 ) − x2 (ν3 ))∗ ] = μ12 ((ν1 , ν2 ] ∪ (ν3 , ν4 ]) . This is a particular case of the following result. For all functions gi : R → C, gi ∈ L2C (μi ) (i = 1, 2), 2∗ . E = g1 (ν) dx1 (ν) g1 (ν)g2 (ν)∗ μ12 (dν) . g2 (ν) dx2 (ν) (12.20) R

R

R

Indeed, equality (12.20) is true for g1 (ν) = e2iπt1 ν and g2 (ν) = e2iπt2 ν since it then reduces to e2iπ(t1 −t2 )ν μ12 (dν) . E[X1 (t)X2 (t)∗ ] = π

This is therefore verified for g1 ∈ E, g2 ∈ E, where E is the set of finite linear combinations of functions of the type ν → e2iπtν , t ∈ R. But E is dense in L2C (μi ) (i = 1, 2), and therefore the equality (12.20) is true for all gi ∈ L2C (μi ) (i = 1, 2). Theorem 12.3.3 The interspectral measure μ12 is absolutely continuous with respect to each spectral measure μ1 and μ2 . Proof. This means that μ12 (A) = 0 whenever μ1 (A) = 0 or μ2 (A) = 0. Indeed, ∗ . 2dZ1 dZ2 μ12 (A) = E A

A

and μ1 (A) = 0 implies

0

A dZ1

= 0 since "''2 # ' ' ' E ' dZ1 '' = μ1 (A). A

 Therefore, every spectral measure μij is absolutely continuous with respect to the trace of the power spectral measure matrix Tr M :=

k 

μj .

j=1

By the Radon–Nikod´ ym theorem there exists a function gij : R → C such that gij (ν) TrM (dν) . μij (A) = A

486CHAPTER 12. WIDE-SENSE STATIONARY STOCHASTIC PROCESSES The matrix g(ν) = {gij (ν)}1≤i,j≤k is called the canonical spectral density matrix of {X(t)}t∈R . One should insist that it is not required that the stochastic processes {Xi (t)}t∈R , 1 ≤ i ≤ k, admit power spectral densities. The correlation matrix C(τ ) has, with the above notation, the representation e2iπντ g(ν) Tr M (dν) .

C(τ ) = R

If each one among the wss stochastic processes {Xi (t)}t∈R admits a spectral density, {X(t)}t∈R admits an interspectral density matrix f (ν) = {fij (ν)}1≤i,j≤k , that is,

Cij (τ ) = cov (Xi (t + τ ), Xj (t)) =

R

e2iπντ fij (ν) dν .

Example 12.3.4: Interferences. Let {X(t)}t∈R be a centered wss stochastic process with power spectral measure μX . Let h1 , h2 : R → C be integrable functions with h2 . Define for i = 1, 2, respective Fourier transforms 6 h1 and 6 Yi (t) :=

R

hi (t − s)X(s) ds .

The wss stochastic processes {Y1 (t)}t∈R and {Y2 (t)}t∈R are stationarily correlated. In fact (assuming that they are centered, without loss of generality), 2 ∗ . h1 (t + τ − s)X(s) ds h2 (t − s)X(s) ds E[Y1 (t + τ )Y2 (t)∗ ] = E R - - R ∗ h1 (t + τ − u)h2 (t − v)CX (u − v) du dv = -R -R = h1 (τ − u)h∗2 (−v)CX (u − v) du dv , R

R

and this quantity depends only upon τ . Replacing CX (u − v) by its expression in terms of the spectral measure μX , one obtains CY1 Y2 (τ ) =

R

e2iπντ T1 (ν)T2∗ (ν) μX (dν) .

The power spectral matrix of the bivariate process {Y1 (t), Y2 (t)}t∈R is therefore  μY (dν) =

|T1 (ν)|2 T1 (ν)T2∗ (ν) ∗ |T2 (ν)|2 T1 (ν)T2 (ν)

 μX (dν) .

12.3. MULTIVARIATE WSS STOCHASTIC PROCESSES

12.3.2

487

Band-pass Stochastic Processes

Let {X(t)}t∈R be a centered wss stochastic process with power spectral measure μX and Cram´er–Khinchin decomposition dx(ν). This process is assumed real, and therefore μX (−dν) = μX (dν),

dx(−ν) = dx(ν)∗ .

Definition 12.3.5 The above wss stochastic process is called band-pass (ν0 , B), where ν0 > B > 0, if the support of μX is contained in the frequency band [−ν0 − B, −ν0 + B] ∪ [ν0 − B, ν0 + B]. A band-pass stochastic process admits the following quadrature decomposition X(t) = M (t) cos 2πν0 t − N (t) sin 2πν0 t ,

(12.21)

where {M (t)}t∈R and {N (t)}t∈R , called the quadrature components, are real base-band (B) wss stochastic process. To prove this, let G(ν) := − i sign(ν) (= 0 if ν = 0). The function G is the so-called Hilbert filter transmittance. The quadrature process associated with {X(t)}t∈R is defined by Y (t) = G(ν)e2iπνt dx(ν) . R

0 The right-hand side of the preceding equality is well defined since R |G(ν)|2 μX (dν) = μX (R) < ∞. Moreover, this stochastic process is real, since its spectral decomposition is hermitian symmetric. The analytic process associated with {X(t)}t∈R is, by definition, the stochastic process Z(t) = X(t) + iY (t) = (1 + iG(ν))e2iπνt dx(ν) = 2 e2iπνt dx(ν). R

(0,∞)

Taking into account that |G(ν)|2 = 1, the preceding expressions and the Wiener isometry formulas lead to the following properties: μY (dν) = μX (dν),

CY (τ ) = CX (τ ),

μZ (dν) = 4 1R+ (ν) μX (dν),

CXY (τ ) = − CY X (τ ) ,

CZ (τ ) = 2 {CX (τ ) + iCY X (τ )} ,

and E[Z(t + τ )Z(t)] = 0 .

()

Defining the complex envelope of {X(t)}t∈R by U (t) = Z(t)e−2iπν0 t ,

()

it follows from this definition that CU (τ ) = e−2iπν0 τ CZ (τ ),

μU (dν) = μZ (dν + ν0 ) ,

(†)

whereas () and () give E[U (t + τ )U (t)] = 0.

(††)

The quadrature components {M (t)}t∈R and {N (t)}t∈R of {X(t)}t∈R are the real wss stochastic processes defined by

488CHAPTER 12. WIDE-SENSE STATIONARY STOCHASTIC PROCESSES U (t) = M (t) + iN (t). Since X(t) = Re{Z(t)} = Re{U (t)e2iπν0 t } , we have the decomposition (12.21). Taking (††) into account we obtain: CM (τ ) = CN (τ ) =

1 {CU (τ ) + CU (τ )∗ } , 4

and

1 {CU (τ ) − CU (τ )∗ } , 4i and the corresponding relations for the spectra CM N (τ ) = CN M (τ ) =

(♦)

μM (dν) = μN (dν) = {μX (dν − ν0 ) + μX (dν + ν0 )} 1[−B,+B](ν) . From (♦) and the observation that CU (0) = CU (0)∗ (since CU (0) = E[|U (0)|2 ] is real), we deduce CM N (0) = 0, that is to say, E[M (t)N (t)] = 0.

(12.22)

If, furthermore, the original process has a power spectral measure that is symmetric about ν0 in the band [ν0 − B, ν0 + B], the same holds for the spectrum of the analytic process and, by (†), the complex envelope has a spectral measure symmetric about 0, which implies CU (τ ) = CU (τ )∗ and then, by (♦), E[M (t)N (t + τ )] = 0.

(12.23)

In summary: Theorem 12.3.6 Let {X(t)}t∈R be a centered real band-pass (ν0 , B) wss stochastic process. The values of its quadrature components at a given time are uncorrelated. Moreover, if the original stochastic process has a power spectral measure symmetric about ν0 , the quadrature component processes are uncorrelated. More can be said when the original process is Gaussian. In this case, the quadrature component processes are jointly Gaussian (being obtained from the original Gaussian process by linear operations). In particular, for all t ∈ R, M (t) and N (t) are jointly Gaussian and uncorrelated, and therefore independent. If moreover the original process has a spectrum symmetric about ν0 , then, by (12.23), M (t1 ) and N (t2 ) (t1 , t2 ∈ R) are uncorrelated jointly Gaussian variables, and therefore independent. In other words, the quadrature component processes are two independent centered Gaussian wss stochastic processes.

Complementary reading [Cram´er and Leadbetter, 1967, 1995] is the classical reference. It emphasizes the study of level crossings by wide-sense stationary stochastic processes. [Br´emaud, 2014] has a chapter on the spectral measure of point processes.

12.4. EXERCISES

12.4

489

Exercises

Exercise 12.4.1. Symmetric power spectral measure Show that the power spectral measure of a real wss stochastic process is symmetric. Exercise 12.4.2. Products of independent wss stochastic processes Let {X(t)}t∈R and {Y (t)}t∈R be two centered wss stochastic processes of respective covariance functions CX (τ ) and CY (τ ). 1. Assume the two signals to be independent. Show that Z (t) := X(t)Y (t) (t ∈ R) is a wss stochastic process. Give its mean and covariance function. 2. Assume the same hypothesis as in the previous question, but now {X(t)}t∈R is the harmonic process of Example 5.1.20. Suppose that {Y (t)}t∈R admits a power spectral density fY (ν). Give the power spectral density fZ (ν) of {Z (t)}t∈R . Exercise 12.4.3. The approximate derivative of a Wiener process Let {W (t)}t≥0 be a Wiener process. Show that for a > 0, the stochastic process Xa (t) :=

W (t + a) − W (t) a

(t ∈ R)

is a wss stochastic process. Compute its mean, its covariance function and its power spectral density. Exercise 12.4.4. Doob’s integral and the finitesimal derivative of Brownian motion Let {W (t)}t≥0 be a standard Brownian motion. Prove the following. For all f ∈ L2C (R+ )∩ L1C (R+ ), lim h↓0

R+

f (t)Bh (t) dt =

f (t) dW (t) R+

in the quadratic mean. Exercise 12.4.5. The square of a band-limited white noise Let {X(t)}t∈R be a wide-sense stationary centered Gaussian process with covariance function CX (τ ) and with the power spectral density fX (ν) =

N0 1 (ν) , 2 [−B,+B]

where N0 > 0 and B > 0. 1. Let Y (t) = X(t)2 . Show that {Y (t)}t∈R is a wide-sense stationary process. 2. Give its power spectral density fY (ν). Exercise 12.4.6. Projection of white noise onto an orthonormal base Let the set of square-integrable functions ϕ : [0, T ] → R (1 ≤ i ≤ N ) be such that -

T

ϕi (t)ϕj (t) dt = δij 0

(1 ≤ i, j ≤ N ),

490CHAPTER 12. WIDE-SENSE STATIONARY STOCHASTIC PROCESSES and let {B(t)}t∈R be a Gaussian white noise with psd N0 /2. Show that the vector B = (B1 , . . . , BN )T defined by -

T

B(t)ϕi (t) dt

Bi =

(1 ≤ i ≤ N )

0

is a centered Gaussian vector with covariance matrix ΓB =

N0 I. 2

(12.24)

(In particular, the components B1 , . . . , BN are identically distributed, independent, and centered Gaussian random variables with common variance N0 /2.) Exercise 12.4.7. An iid sequence carried by an hpp Let N be a homogeneous Poisson process on R+ of intensity λ > 0, and let {Zn }n≥0 be an iid sequence of integrable real random variables, centered, with finite variance σ 2 , and independent of N . 1) Show that {ZN ((0,t]) }t≥0 is a wide-sense stationary stochastic process and give its covariance function. 2) Give its power spectral density. 3) Compute P (X (t1 ) = X (t2 )) and P (X (t1 ) > X (t2 )). Exercise 12.4.8. Poisson shot noises Let N1 , N2 and N3 be three independent homogeneous Poisson processes on R with respective intensities θ1 > 0, θ2 > 0 and θ3 > 0. Let {X1 (t)}t∈R be the shot noise constructed on N1 + N3 with an impulse function h : R → R that is bounded and with compact support (null outside a finite interval). Let {X2 (t)}t∈R be the shot noise constructed on N2 + N3 with the same impulse function h. Compute the power spectral density of the wide-sense stationary process {X(t)}t∈R , where X(t) = X1 (t) + X2 (t). Exercise 12.4.9. Frequency modulation Consider the so-called frequency modulated (or phase modulated) signal, a stochastic process {X(t)}t≥0 defined by X(t) = cos(2π(ν0 t + Φ(t) + α)), where

-

t

ν(s) ds ,

Φ(t) = 0

{ν(t)}t≥0 is a real-valued stochastic process, and α is a real-valued random variable. The following assumptions are made: (a) {ν(t)} is a strictly stationary process. (b) α and {ν(t)} are independent. (c) E[e2iπα ] = E[e4iπα ] = 0.

12.4. EXERCISES

491

Show that the covariance function of the frequency modulated signal is given by   τ 1  CX (τ ) = Re e2iπν0 τ E e2iπ 0 ν(s) ds . 2 Exercise 12.4.10. Gaussian frequency modulation This exercise is a continuation of Exercise 12.4.9 to which the reader is referred for the notation and definitions. We now consider a particular case for which the computations are tractable: Gaussian frequency modulation. Here {ν(t)}t≥0 is a stationary Gaussian signal with mean ν¯ and covariance function Cν . Show that CX (τ ) =

 2 τ 1 cos(2π(ν0 + ν¯)t)e−4π 0 Cν (s)(τ −s) ds . 2

Exercise 12.4.11. Flip-flop Let N be an hpp on R+ with intensity λ. Define the (telegraph or flip-flop) process {X (t)}t≥0 with state space E = {+1, −1} by X (t) = Z (−1)N (t) , where X (0) = Z is an E-valued random variable independent of the counting process N . (Thus the telegraph process switches between −1 and +1 at each event of N .) The probability distribution of Z is arbitrary. 1. Compute P (X (t + s) = j|X (s) = i) for all t, s ≥ 0 and all i, j ∈ E. 2. Give, for all i ∈ E, the limit of P (X (t) = i) as t tends to ∞. 3. Show that when P (Z = 1) = 12 , the process is a stationary process and give its power spectral measure. Exercise 12.4.12. Flip-flop with limited memory Let N be a HPP on R with intensity λ > 0. Define for all t ∈ R X(t) = (−1)N ((t,t+a]) . 1. Show that {X(t)}t∈R is a wss stochastic process. 2. Compute its power spectral density. 3. Give the best affine estimate of X (t + τ ) in terms of X(t), that is, find α, β minimizing   E |X (t + τ ) − (α + βX(t))|2 , when τ > 0. Exercise 12.4.13. Jumping phase Define for each t ∈ R, t ≥ 0,

X(t) = eiΦN(t) ,

where {N (t)}t≥0 is the counting process of a homogeneous Poisson process on R+ with intensity λ > 0, and {Φn }n≥0 is an iid sequence of random variables uniformly distributed on [0, 2π], and independent of the Poisson process. Show that {X(t)}t≥0 is a wide-sense stationary process, give its covariance function CX (τ ) and its power spectral measure.

III: ADVANCED TOPICS

Chapter 13 Martingales A martingale is for the general public a clever way of gambling. In mathematics, it formalizes the notion of fair game and we shall see that martingale theory indeed has something to say about such games. However the interest and scope of martingale theory extends far beyond gambling and has become a fundamental tool of the theory of stochastic processes. The present chapter is an introduction to this topic, featuring the two main pillars on which it rests: the optional sampling theorem and the convergence theory of martingales, in discrete as well as in continuous time.

13.1

Martingale Inequalities

13.1.1

The Martingale Property

Let (Ω, F, P ) be a probability space and let {Fn }n≥1 be a history (or filtration) defined on it, that is, a sequence of sub-σ-fields of F that is non-decreasing: Fn ⊆ Fn+1 (n ≥ 0). The internal history of a random sequence {Xn }n≥0 is the filtration {FnX }n≥0 defined by FnX := σ(X0 , . . . , Xn ). Definition 13.1.1 A complex random sequence {Yn }n≥0 such that for all n ≥ 0 (i) Yn is Fn -measurable and (ii) E[|Yn |] < ∞ is called a (P, Fn )-martingale (resp., submartingale, supermartingale) if, in addition, for all n ≥ 0, E[Yn+1 | Fn ] = Yn (13.1) (resp., ≥ Yn , ≤ Yn ) . When the context is clear as to the choice of the underlying probability measure P , we shall abbreviate, saying for instance, “Fn -submartingale” instead of “(P, Fn )submartingale”. If the history is not mentioned, it is assumed to be the internal history. For instance, the phrase {Yn }n≥0 is a martingale means that it is an FnY -martingale. Of course an Fn -martingale is an Fn -submartingale and an Fn –supermartingale. Condition (13.45) implies that for all k ≥ 1, all n ≥ 0, E[Yn+k | Fn ] = Yn

(resp., ≥ Yn , ≤ Yn ).

© Springer Nature Switzerland AG 2020 P. Brémaud, Probability Theory and Stochastic Processes, Universitext, https://doi.org/10.1007/978-3-030-40183-2_13

495

CHAPTER 13. MARTINGALES

496

Proof. In the martingale case, for instance, by the rule of successive conditioning E[Yn+k | Fn ] = E[E[Yn+k | Fn+k−1 ] | Fn ] = E[Yn+k−1 | Fn ] = E[Yn+k−2 | Fn ] = · · · = E[Yn | Fn ] = Yn .  In particular, taking expectations and letting n = 0, (resp., ≥ E[Y0 ], ≤ E[Y0 ]).

E[Yk ] = E[Y0 ]

Example 13.1.2: Sums of iid random variables. Let {Xn }n≥0 be an iid sequence of centered and integrable random variables. The random sequence Yn := X0 + X1 + · · · + Xn

(n ≥ 0)

is an FnX -martingale. Indeed, for all n ≥ 0, Yn is FnX -measurable and E[Yn+1 | FnX ] = E[Yn | Fn ] + E[Xn+1 | FnX ] = Yn + E[Xn+1] = Yn , where the second equality is due to the fact that FnX and Xn+1 are independent (Theorem 3.3.20). Example 13.1.3: Products of iids. Let X = {Xn }n≥0 be an iid sequence of integrable random variables with mean 1. The random sequence Yn =

n $

Xk

(n ≥ 0)

k=0

is an FnX -martingale. Indeed, for all n ≥ 0, Yn is FnX -measurable and " # n n $ $ E[Yn+1 | FnX ] = E Xn+1 Xk | FnX = E[Xn+1 | FnX ] Xk k=0

= E[Xn+1]

n $

k=0

X k = 1 × Yn = Yn ,

k=1

where the second equality is due to the fact that FnX and Xn+1 are independent (Theorem 3.3.20). Example 13.1.4: Gambling. Consider the random sequence {Yn }n≥0 with values in R+ defined by Y0 = a ∈ R+ and Yn+1 = Yn + Xn+1 bn+1 (X0n )

(n ≥ 0) ,

where X0n := (X0 , . . . , Xn ), X0 = Y0 , {Xn }n≥1 is an iid sequence of random variables taking the values +1 or −1 with equal probability, and the family of functions bn : {0, 1}n → N (n ≥ 1) is the betting strategy, that is, bn+1 (X0n ) is the stake at time n + 1 of a gambler given the observed history X0n of the chance outcomes up to time n. Admissible bets must guarantee that the fortune Yn remains non-negative at all times

13.1. MARTINGALE INEQUALITIES

497

n, that is, bn+1 (X0n ) ≤ Yn . The process so defined is an FnX -martingale. Indeed, for all n ≥ 0, Yn is FnX -measurable and       E Yn+1 | FnX = E Yn | FnX + E Xn+1 bn+1 (X0n ) | FnX   = Yn + E Xn+1 | FnX bn+1 (X0n ) = Yn , where the second equality uses Theorem 3.3.24. The integrability condition should be checked on each application. It is satisfied if the stakes bn (X0n ) are uniformly bounded. Example 13.1.5: Harmonic functions of an hmc. Let {Xn }n≥0 be an hmc with countable space E and transition matrix P. A function h : E → R is called harmonic (resp., subharmonic, superharmonic) if Ph is well defined and Ph = h that is,



pij h(j) = h(i)

(resp., ≥ h, ≤ h) , (resp., ≥ h(i), ≤ h(i))

(13.2)

(i ∈ E) .

j∈E

Superharmonic functions are also called excessive functions. Equation (13.2) is equivalent, in the harmonic case for instance, to E[h(Xn+1 ) | Xn = i] = h(i)

(i ∈ E) .

()

In view of the Markov property, the left-hand side of the above equality is also equal to E[h(Xn+1 ) | Xn = i, Xn−1 = in−1 , . . . , X0 = i0 ], and therefore () is equivalent to E[h(Xn+1 | FnX ] = h(Xn ) . Therefore, if E[|h(Xn )|] < ∞ for all n ≥ 0, the process {h(Xn )}n≥0 is an FnX martingale. Similarly, for a subharmonic (resp. superharmonic) function h such that E[|h(Xn )|] < ∞ for all n ≥ 0, the process {h(Xn )}n≥0 is an FnX -submartingale (resp. FnX -supermartingale). ´ m sequences. Let be given on the measurable Example 13.1.6: Radon–Nikody space (Ω, F) two probability measures Q and P and a filtration {Fn }n≥1 . Let Qn and Pn denote the restrictions of Q and P respectively to (Ω, Fn ). Suppose that for all n ≥ 1, Pn , in which case we say that Q is locally absolutely continuous along {Fn }n≥1 Qn with respect to P and denote this by Q (P, Fn )-martingale.

loc.

P . Let Ln :=

dQn dPn .

Then {Ln }n≥1 is a

Proof. The integrability condition is satisfied since dQn EP [Ln ] = Ln dPn = dPn = dQn = Qn (Ω) = 1 . Ω Ω dPn Ω Now, for all A ∈ Fn (and a fortiori A ∈ Fn+1 ),

CHAPTER 13. MARTINGALES

498 -

Ln+1 dP =

-A

A

=

Ln+1 dPn+1 = Qn+1 (A) = Qn (A) Ln dP . Ln dPn =

A

A



Definition 13.1.7 Let {Fn }n≥0 be some filtration. A complex random sequence {Xn }n≥0 such that for all n ≥ 0 (a) Xn is Fn -measurable, (b) E[|Xn |] < ∞ and E[Xn ] = 0, and (c) E[Xn+1 | Fn ] = 0 (resp. ≥ 0, ≤ 0) is called a (P, Fn )-martingale difference (resp., submartingale difference, supermartingale difference). The notion of martingale difference generalizes that of centered iid sequences. Indeed for such iid sequences, Xn is independent of FnX , and therefore (Theorem 3.3.20) E[Xn+1 | FnX ] = 0.

Convex Functions of Martingales Theorem 13.1.8 Let I ⊆ R be an interval (closed, open, semi-closed, infinite, etc.) and let ϕ : I → R be a convex function. A. Let {Yn }n≥0 be an Fn -martingale such that P (Yn ∈ I) = 1 for all n ≥ 0. Assume that E [|ϕ(Yn )|] < ∞ for all n ≥ 0. Then, the process {ϕ(Yn )}n≥0 is an Fn submartingale. B. Assume moreover that ϕ is non-decreasing and suppose this time that {Yn }n≥0 is an Fn -submartingale. Then, the process {ϕ(Yn )}n≥0 is an Fn -submartingale. Proof. By Jensen’s inequality for conditional expectations (Exercise 3.4.50), E [ϕ(Yn+1 )|Fn ] ≥ ϕ(E [Yn+1 |Fn ]) . Therefore (case A) E [ϕ(Yn+1 )|Fn ] ≥ ϕ(E [Yn+1 |Fn ]) = ϕ(Yn ), and (case B) E [ϕ(Yn+1 )|Fn ] ≥ ϕ(E [Yn+1|Fn ]) ≥ ϕ(Yn ) . (For the last inequality, use the submartingale property E [Yn+1 |Fn ] ≥ Yn and the hypothesis that ϕ is non-decreasing.)  Example 13.1.9: Let {Yn }n≥0 be an Fn -martingale and let p ≥ 1. As a special case of Theorem 13.1.8 with the convex function x → |x|p , we have that if E [|Yn |p ] < ∞, {|Yn |p }n≥0 is an Fn -submartingale. Applying Theorem 13.1.8 with the convex function x → x+ , we have that {Yn+ }n≥0 is an Fn -submartingale.

13.1. MARTINGALE INEQUALITIES

499

Martingale Transforms and Stopped Martingales Let {Fn }n≥0 be some filtration. The complex stochastic process {Hn }n≥1 is called Fn predictable if Hn is Fn−1 -measurable for all n ≥ 1 . Let {Yn }n≥0 be another complex stochastic process. The stochastic process (H ◦ Y )n :=

n 

Hk (Yk − Yk−1 )

(n ≥ 1) .

k=1

is called the transform of Y by H. Theorem 13.1.10 (a) Let {Yn }n≥0 be an Fn -submartingale and let {Hn }n≥0 be a bounded non-negative Fn –predictable process. Then {(H ◦ Y )n }n≥0 is an Fn -submartingale. (b) If {Yn }n≥0 is an Fn -martingale and if {Hn }n≥0 is bounded and Fn -predictable, then {(H ◦ Y )n }n≥0 is an Fn -martingale. Proof. Conditions (i) and (ii) of (13.1.1) are obviously satisfied. Moreover, (a)

E[(H ◦ Y )n+1 − (H ◦ Y )n | Fn ] = E[Hn+1 (Yn+1 − Yn ) | Fn ] = Hn+1 E[Yn+1 − Yn | Fn ] ≥ 0 ,

using Theorem 3.3.24 for the second equality. (b)

E[(H ◦ Y )n+1 − (H ◦ Y )n | Fn ] = Hn+1E[Yn+1 − Yn | Fn ] = 0 , 

by the same token.

Recall the definition of an Fn -stopping time (Definition 6.2.20): a random variable τ taking its values in N and such that for all m ∈ N, the event {τ = m} is in Fm . Theorem 13.1.10 immediately leads to the stopped martingale theorem: Theorem 13.1.11 Let {Yn }n≥0 be an Fn -submartingale (resp., martingale) and let τ be an Fn -stopping time. Then {Yn∧τ }n≥0 is an Fn -submartingale (resp., martingale). In particular, (13.3) E[Yn∧τ ] ≥ E[Y0 ] (resp., = E[Y0 ]) (n ≥ 0) . Proof. Let Hn := 1{n≤τ } . The stochastic process H is Fn –predictable since {Hn = 0} = {τ ≤ n − 1} ∈ Fn−1 . We have Yn∧τ = Y0 + = Y0 +

n∧τ  k=1 n 

(Yk − Yk−1 ) 1{k≤τ } (Yk − Yk−1 ) .

k=1

The result then follows by Theorem 13.1.10.



CHAPTER 13. MARTINGALES

500

13.1.2

Kolmogorov’s Inequality

It often occurs that a result proved for iid sequences also holds for martingale difference sequences. This is the case for the inequality originally proved in the iid case (Lemma 4.1.12). Theorem 13.1.12 Let {Sn }n≥0 be an Fn -submartingale. Then, for all λ ∈ R+ ,     λP max Si > λ ≤ E Sn 1{max0≤i≤n Si >λ} . (13.4) 0≤i≤n

Proof. Define the random time τ = inf{n ≥ 0 ; Sn > λ} . It is an Fn -stopping time since

1



Ai := {τ = i} =

Si > λ, max Sj ≤ λ 0≤j≤i−1

∈ Fi .

The Ai ’s so defined are mutually disjoint and 1 &  n Ai . A := max Si > λ = 0≤i≤n

Since λ1Ai ≤ Si 1Ai , λP (A) = λ

n 

i=1

P (Ai ) ≤

i=0

n 

E[Si 1Ai ] .

i=0

we For all 0 ≤ i ≤ n, Ai being Fi -measurable, 0 0 have by the submartingale property that E [Sn | Fi ] ≥ Si and therefore Ai Si dP ≤ Ai E[Sn | Fi ] dP . Taking these observations into account, λP (A) ≤ ≤

n  i=0 n 

E[Si 1Ai ]   E EFi [Sn ]1Ai

i=0

=

n 

  E EFi [Sn 1Ai ]

i=0

=

n 

E[Sn 1Ai ]

i=0

"

= E Sn

n 

# 1 Ai

i=0

= E[Sn 1A ].  Corollary 13.1.13 Let {Mn }n≥0 be an Fn -martingale. Then, for all p ≥ 1, all λ ∈ R,   p λ P max |Mi | > λ ≤ E[|Mn |p ]. (13.5) 0≤i≤n

13.1. MARTINGALE INEQUALITIES

501

Proof. Let Sn = |Mn |p . This defines an Fn -submartingale (Example 13.1.9) to which one may apply Kolmogorov’s inequality with λ replaced by λp :     p p p λ P max |Mi | > λ ≤ E |Mn |p 1{max0≤i≤n |Mi |p >λp } ≤ E[|Mn |p ]. 0≤i≤n

 Remark 13.1.14 Note that Kolmogorov’s inequality is, as far as martingales are concerned, a considerable improvement with respect to what Markov’s inequality would have given: λp P (|Mi |p > λp ) ≤ E[|Mi |p ] ≤ E[|Mn |p ] (0 ≤ i ≤ n).

13.1.3

Doob’s Inequality

Recall the notation  X p = (E [|X|p ])1/p . Theorem 13.1.15 Let {Mn }n≥0 be an Fn -martingale. For all p > 1,  Mn p ≤  max |Mi | p ≤ q  Mn p , 0≤i≤n

1 p

where q (the “conjugate” of p) is defined by

+

1 q

(13.6)

= 1.

Proof. The first inequality is trivial. For the second inequality, observe that for all non-negative random variables X, by Fubini’s theorem, 2- X . pxp−1 dx E[X p ] = E 2-0 ∞ . p−1 =E px 1{x x) dx . =p 0

Therefore, applying this and Kolmogorov’s inequality (13.4) to the submartingale Sn = |Mn |, 2 p . . 2 E max |Mi |p ≤ E max |Mi | 0≤i≤n 0≤i≤n   - ∞ p−1 =p x P max |Mi | > x dx 0≤i≤n -0 ∞   ≤p xp−2 E |Mn | 1{max0≤i≤n |Mi |>x} dx 0 . 2- ∞ xp−2 |Mn | 1{max0≤i≤n |Mi |>x} dx = pE " 0 # max0≤i≤n |Mi |

= pE |Mn | 0

xp−2 dx

"  p−1 # p E |Mn | max |Mi | 0≤i≤n p−1 "  p−1 # . = qE |Mn | max |Mi |

=

0≤i≤n

CHAPTER 13. MARTINGALES

502

By H¨older’s inequality, and observing that (p − 1)q = p, " "  p−1 # (p−1)q #1/q E |Mn | max |Mi | ≤ E[|Mn |p ]1/p E max |Mi | 0≤i≤n

0≤i≤n

p .1/q

2 = Mn p E Therefore

max |Mi |

.

0≤i≤n

2 p .1/q 2 . E max |Mi |p ≤ q  Mn p E , max |Mi | 0≤i≤n

0≤i≤n

or (eliminating the trivial case where E [max0≤i≤n |Mi |p ] = ∞) 2 .1− 1 q E max |Mi |p ≤ q  Mn p , 0≤i≤n

that is, since 1 −

1 q

=

1 p,

 max |Mi | p ≤ q  Mn p . 0≤i≤n



13.1.4

Hoeffding’s Inequality

Theorem 13.1.16 Let {Mn }n≥0 be a real Fn -martingale such that, for some sequence c1 , c2, . . . of real numbers, P (|Mn − Mn−1 | ≤ cn ) = 1 Then, for all x ≥ 0 and all n ≥ 1,



(n ≥ 1) .

1 P (|Mn − M0 | ≥ x) ≤ 2 exp − x2 2

? n

(13.7)

c2i

.

i=1

Proof. By convexity of z → eaz , for |z| ≤ 1 and all a ∈ R, 1 1 aaz ≤ (1 − z)e−a + (1 + z)e+a . 2 2 In particular, if Z is a centered random variable such that P (|Z| ≤ 1) = 1, 1 1 E[eaZ ] ≤ (1 − E[Z])e−a + (1 + E[Z])e+a 2 2 1 −a 1 +a a2 /2 = e + e ≤e . 2 2 By similar arguments, for all a ∈ R, . 2  M −M  ' n n−1 ' 'Fn−1 cn E ea ' '  2 . 1 Mn − Mn−1 '' −a F ≤ 1−E ' n−1 e + · · · 2 cn '  2 . 1 Mn − Mn−1 '' +a a2 /2 ···+ 1+E , 'Fn−1 e ≤ e 2 cn

(13.8)

13.1. MARTINGALE INEQUALITIES

503

and, with a replaced by cn a,   ' 2 2 E ea(Mn −Mn−1 ) 'Fn−1 ≤ ea cn /2 . Therefore,     E ea(Mn −M0 ) = E ea(Mn−1 −M0 ) ea(Mn −Mn−1 )  '   = E ea(Mn−1 −M0 ) E ea(Mn −Mn−1 ) 'Fn−1   2 2 ≤ E ea(Mn−1 −M0 ) × ea cn /2 , and then by recurrence

  1 2 n 2 E ea(Mn −M0 ) ≤ e 2 a i=1 ci .

In particular, with a > 0, by Markov’s inequality,   1 2 n 2 P (Mn − M0 ≥ x) ≤ e−ax E ea(Mn −M0 ) ≤ e−ax+ 2 a i=1 ci . Minimization of the right-hand side with respect to a gives @ 1 2 n 2 i=1 ci . P (Mn − M0 ≥ x) ≤ e− 2 x The same argument with M0 − Mn instead of Mn − M0 yields the bound @ 1 2 n 2 i=1 ci . P (−(Mn − M0 ) ≥ x) ≤ e− 2 x The announced bound then follows from these two bounds since for any random variable X, and all x ∈ R+ , P (|X| ≥ x) = P (X ≥ x) + P (X ≤ −x).  Example 13.1.17: The knapsack. There are n objects, the i-th has a volume Vi and is worth Wi . All these non-negative random variables form an independent family, the Vi ’s have finite means and the means of the Wi ’s are bounded by M < ∞. You have to  volume ni=1 zi Vi does not exceed choose integers z1 , . . . , zn in such a way that the total  n a given storage capacity c and that the total worth i=1 zi Vi is maximized. Call this maximal worth Z. We shall see that 1  −x2 (x ≥ 0) . P (|Z − E [Z] | ≥ x) ≤ 2 exp 2nM 2 For this consider the variables Zj which are the equivalent of Z when the j-th object has been removed. Let now Mj := E [Z | Fj ], where Fj := σ ((Vk , Wk ); 1 ≤ k ≤ j). Note that in view of the independence assumptions E [Zj | Fj ] = E [Zj−1 | Fj ]. Clearly Zj ≤ Z ≤ Zj + M . Taking conditional expectations given Fj and then Fj−1 in this last chain of inequalities reveals that |Mj − Mj−1 | ≤ M . The rest is then just Hoeffding’s inequality.

A General Framework of Application Let X be a finite set, and let f : X N → R be a given function. We introduce the notation N x = (x1 , . . . , xN ) and xk1 = (x1 , . . . , xk ). In particular, x = xN 1 . For x ∈ X , z ∈ X and 1 ≤ k ≤ N , let fk (x, z) := f (x1 , . . . , xk−1 , z, xk+1 , . . . , xN ) .

CHAPTER 13. MARTINGALES

504

The function f is said to satisfy the Lipschitz condition with bound c if for all x ∈ X N , all z ∈ X and all 1 ≤ k ≤ N , |fk (x, z) − f (x)| ≤ c . Let X1 , X2, . . . , XN be independent random variables with values in X . Define the martingale Mn = E [f (X) | X1n] . By the independence assumption, with obvious notations, 

E [f (X) | X1n ] =

N N f (X1n−1 , Xn , xN n+1 )P (Xn+1 = xn+1 )

N xn+1

and    N N E f (X) | X1n−1 = f (X1n−1, xn , xN n+1 )P (Xn = xn )P (Xn+1 = xn+1 ) . xn xN n+1

Therefore |Mn − Mn−1 |  n−1 N N ≤ , Xn , xN |f (X1n−1 , xn , xN n+1 )|P (Xn = xn )P (Xn+1 = xn+1 ) ≤ c . n+1 ) − f (X1 xn xN n+1

Example 13.1.18: Pattern matching. Take f (x) to be the number of occurrences of the fixed pattern b = (b1 , . . . , bk ) (k ≤ N ) in the sequence x = (x1 , . . . , xN ), that is f (x) =

N −k+1

1{xi =b1 ,...,xi+k−1 =bk } .

i=1

The mean number of matches in an iid sequence X = (X1 , . . . , XN ) with uniform distribution on X is therefore E [f (X)] =

N −k+1

−k+1  N  E 1{Xi =b1 ,...,Xi+k−1 =bk } =

i=1

that is,

i=1

 E [f (X)] = (N − k + 1)

1 |X |



1 |X |

k ,

k .

The martingale Mn := E [f (X) | X1n ] is such that M0 = E [f (X)]. Changing the value of one coordinate of x ∈ X N changes f (x) by at most k, we can apply the bound of Theorem 13.8 with ci ≡ k to obtain the inequality 1 λ2

P (|f (X) − E [f (X)] | ≥ λ) ≤ 2e− 2 Nk2 .

13.2. MARTINGALES AND STOPPING TIMES

505

Exposure Martingales in Erd¨ os–R´ enyi Graphs A random graph G(n, p) (see Definition 1.3.44) with  set of vertices Vn of cardinality n may be generated as follows. Enumerate the N = n2 edges of the complete graph on Vn from i = 1 to i = N . Generate a random vector X = (X1 , . . . , XN ) with independent and identically distributed variables with values in {0, 1} and common distribution, P (Xi = 1) = p. Then include edge i in G(n, p) if and only if Xi = 1. Any functional of G(n, p) can always be written as f (X). The edge exposure martingale corresponding to this functional is the FnX -martingale defined by M0 = E [f (X)] and for i ≥ 1,   Mi := E f (X) | X1i . Since the Xi ’s are independent, the general method of the previous subsection can be applied. Another type of martingale related to a G(n, p) graph is useful. Here Vn is identified with {1, 2, . . . , n}. We denote similarly {1, 2, . . . , i} by Vi . For 1 ≤ i ≤ n, define the graph Gi to be the restriction of G(n, p) to Vi . Any functional of G(n, p) can always be written as f (G), where G := (G1 , . . . , Gn ). The vertex exposure martingale corresponding to this functional is the Gi1 -martingale defined by M0 = E [f (G)] and for i ≥ 1,   Mi := E f (G) | Gi1 .

¨ s–R´ Example 13.1.19: The chromatic number of an Erdo enyi graph. The chromatic number of a graph G is the minimal number of colors needed to color the vertices in such a way that no adjacent vertices receive the same color. Call f (G) the chromatic i−1 n n number of G. Since the difference between f (Gi−1 0 , Gi , gi+1 ) and f (G0 , gi , gi+1 ) for all n gi , gi+1 is at most one, one can apply Hoeffding’s inequality to obtain  √  1 2 P |f (G) − E [f (G)] | ≥ λ n ≤ e−2λ . 2

(13.9)

But . . . the Gi ’s are not independent! Nevertheless, the general method of the previous subsection can be applied modulo a slight change of point of view. Let X1 be a constant, and for 2 ≤ i ≤ n, let Xi = {X i,j , 1 ≤ j ≤ i − 1} (recall the definition of X u,v in Definition 1.3.44). (Here the passage from subgraph Gi−1 to subgraph Gi is represented by the “difference” Xi between these two subgraphs.) Then f (G) can be rewritten as h(X) = h(X1 , . . . , Xn ) and the general method applies since the Xi ’s are independent.

13.2

Martingales and Stopping Times

13.2.1

Doob’s Optional Sampling Theorem

The first pillar of martingale theory is the optional sampling theorem. It has many versions and that given next is the most elementary one, sufficient for the elementary examples to be considered now. More general results are given later in this subsection.

CHAPTER 13. MARTINGALES

506

Theorem 13.2.1 Let {Mn }n≥0 be an Fn -martingale, and let τ be an Fn -stopping time (see Definition 6.2.20). Suppose that at least one of the following conditions holds: (α) P (τ ≤ n0 ) = 1 for some n0 ≥ 0, or (β) P (τ < ∞) = 1 and |Mn | ≤ K < ∞ when n ≤ τ . Then E[Mτ ] = E[M0 ].

(13.10)

Proof. (α) Just apply Theorem 13.1.11 (Formula (13.3) with n = n0 ). (β) Apply the result of (α) to the Fn -stopping time τ ∧ n0 to obtain E[Mτ ∧n0 ] = E[M0 ] . But, by dominated convergence, lim E[Mτ ∧n0 ] = E[ lim Mτ ∧n0 ] = E[Mτ ] .

n0 ↑∞

n0 ↑∞

 Example 13.2.2: The ruin problem via martingales. The symmetric random walk {Xn }n≥0 on Z with initial state 0 is an FnX -martingale (Example 13.1.2). Let τ be the first time n for which Xn = −a or + b, where a, b > 0. This is an FnX -stopping time and moreover τ < ∞. Part (β) of the above result can be applied with K = sup(a, b) to obtain 0 = E[X0 ] = E[Xτ ]. Writing v = P (−a is hit before b), we have E[Xτ ] = −av + b(1 − v), and therefore v=

b . a+b

Example 13.2.3: A counterexample. Consider the symmetric random walk of the previous example, but now define τ to be the hitting time of b > 0, an almost surely finite time since the symmetric walk on Z is recurrent. If the optional sampling theorem applied, one would have 0 = E[X0 ] = E[Xτ ] = b, an obvious contradiction. Of course, neither condition (α) nor (β) is satisfied. The following generalization of the elementary result given at the beginning of the present subsection will now be proved after the following theorem. Theorem 13.2.4 Let {Fn }n≥0 be a history and let F∞ := σ(∪n≥0 Fn ). Let τ be an Fn -stopping time. The collection of events Fτ := {A ∈ F∞ | A ∩ {τ = n} ∈ Fn , for all n ≥ 1} is a σ-field, and τ is Fτ -measurable. Let {Xn }n≥0 be an E-valued Fn -adapted random sequence, and let τ be a finite Fn -stopping time. Then X(τ ) is Fτ -measurable.

13.2. MARTINGALES AND STOPPING TIMES

507

The proof is left as an exercise. A more general result is given in Theorem 13.2.4. If {Fn }n≥0 is the internal history of some random sequence {Xn }n≥0, that is, if Fn = FnX (n ≥ 0), one may interpret FτX as the collection of events that are determined by the observation of the random sequence up to time τ (included). We are now ready for the statement and proof of Doob’s optional sampling theorem. Theorem 13.2.5 Let {Yn }n≥0 be an Fn -submartingale (resp., martingale), and let τ1 , τ2 be finite Fn -stopping times such that P (τ1 ≤ τ2 ) = 1. If for i = 1, 2,

and

E [|Yτi |] < ∞,

(13.11)

  lim inf E |Yn |1{τi >n} = 0 ,

(13.12)

E[Yτ2 | Fτ1 ] ≥ Yτ1 (resp., = Yτ1 ) .

(13.13)

n↑∞

then, P -a.s.

Remark 13.2.6 In particular, E[Yτ2 ] ≥ E[Yτ1 ]

(resp., = E[Yτ1 ]) .

(13.14)

More generally, if {τn }n≥1 is a non-decreasing sequence of finite Fn -stopping times satisfying conditions (13.11) and (13.12), the sequence {Yτn }n≥1 is an Fτn -submartingale (resp., martingale). Proof. It suffices to give the proof for the submartingale case. The meaning of (13.13) is that, for all A ∈ Fτ1 ,     E 1 A Y τ2 ≥ E 1 A Y τ1 . It is sufficient to show that for all n ≥ 0,     E 1A∩{τ1 =n} Yτ2 ≥ E 1A∩{τ1 =n} Yτ1 , or, equivalently since τ1 = n implies τ2 ≥ n,       E 1A∩{τ1 =n}∩{τ2 ≥n} Yτ2 ≥ E 1A∩{τ1 =n}∩{τ2 ≥n} Yτ1 = E 1A∩{τ1 =n}∩{τ2 ≥n} Yn . Write this as

    E 1B∩{τ2 ≥n} Yτ2 ≥ E 1B∩{τ2 ≥n} Yn ,

()

where B := A ∩ {τ1 = n}. By definition of Fτ1 , B ∈ Fn . It is therefore sufficient to show that for all n ≥ 0, all B ∈ Fn , () holds. We have       E 1B∩{τ2 ≥n} Yn = E 1B∩{τ2 =n} Yn + E 1B∩{τ2 ≥n+1}Yn     ≤ E 1B∩{τ2 =n} Yn + E 1B∩{τ2 ≥n+1}E[Yn+1 |Fn ]     = E 1B∩{τ2 =n} Yτ2 + E 1B∩{τ2 ≥n+1} Yn+1     ≤ E 1B∩{n≤τ2 ≤n+1} Yτ2 + E 1B∩{τ2 ≥n+2}Yn+2 ···     ≤ E 1B∩{n≤τ2 ≤m} Yτ2 + E 1B∩{τ2 >m} Ym , that is,

      E 1B∩{n≤τ2 ≤m} Yτ2 ≥ E 1B∩{τ2 ≥n} Yn − E 1B∩{τ2 >m} Ym

CHAPTER 13. MARTINGALES

508

for all m ≥ n. Therefore, by dominated convergence and hypothesis (13.12)     E 1B∩{τ2 ≥n} Yτ2 = E lim 1B∩{n≤τ2≤m} Yτ2 m↑∞     ≥ E 1B∩{τ2 ≥n} Yn − lim inf E 1B∩{τ2 >m} Ym m↑∞   = E 1B∩{τ2 ≥n}Yn .  Corollary 13.2.7 Let {Yn }n≥0 be an Fn -submartingale (resp., martingale). Let τ1 , τ2 be Fn -stopping times such that τ1 ≤ τ2 ≤ N a.s., for some constant N < ∞. Then (13.14) holds. Proof. This is an immediate consequence of Theorem 13.2.5.



Corollary 13.2.8 Let {Yn }n≥0 be a uniformly integrable Fn -submartingale (resp., martingale). Let τ1 , τ2 be finite Fn -stopping times. Then (13.13) holds. Proof. In order to apply Theorem 13.2.5, we have to show that conditions (13.11) and (13.12) are satisfied when {Yn }n≥1 is uniformly integrable. Condition (13.12) follows from part (b) of Theorem 4.2.12 since the τi ’s are finite and therefore P (τi > n) → 0 as n ↑ ∞. It remains to show that condition (13.11) is satisfied. Let N < ∞ be an integer. By Corollary 13.2.7, if τ is a stopping time (here τ1 or τ2 ), E[Y0 ] ≤ E[Yτ ∧N ] and therefore E[|Yτ ∧N |] = 2E[Yτ+∧N ] − E[Yτ ∧N ] ≤ 2E[Yτ+∧N ] − E[Y0 ]. The submartingale {Yn+ }n≥0 satisfies E[Yτ+∧N ] =

N 

E[1{τ ∧N =j} Yj+ ] + E[1{τ >N } YN+ ]

j=0



N 

E[1{τ ∧N =j} YN+ ] + E[1{τ >N } YN+ ]

j=0

= E[YN+ ] ≤ E[|YN |]. Therefore E[|Yτ ∧N |] ≤ 2E[|YN |] + E[|Y0 |] ≤ 3 sup E|YN |. N

Since by Fatou’s lemma E[|Yτ |] ≤ lim inf N ↑∞ E[|Yτ ∧N |], we have E[|Yτ |] ≤ 3 sup E[|YN |], N

a finite quantity since {Yn }n≥1 is uniformly integrable.



13.2. MARTINGALES AND STOPPING TIMES

509

Corollary 13.2.9 Let {Yn }n≥0 be an Fn -submartingale (resp., martingale) and let τ be an Fn -stopping time such that E[τ ] < ∞. Suppose moreover that there exists a constant c < ∞ such that, for all n ≥ 0, E[|Yn+1 − Yn | | Fn ] ≤ c, Then E[|Yτ |] < ∞ and

P -a.s. on {τ ≥ n}.

E[Yτ ] ≥ ( resp., =) E[Y0 ].

Proof. In order to apply Theorem 13.2.5 with τ1 = 0, τ2 = τ , one just has to check conditions (13.11) and (13.12) for τ . Let Z0 := |Y0 |. With Zn := |Yn − Yn−1| (n ≥ 1), ⎡ ⎤ ⎤ ⎡ τ ∞ n    E⎣ E ⎣1{τ =n} Zj ⎦ Zj ⎦ = n=0

j=0

=

j=0

∞  n 

  E 1{τ =n} Zj

n=0 j=0

=

∞  ∞ 

  E 1{τ =n} Zj

j=0 n=j

=

∞ 

  E 1{τ ≥j} Zj .

j=0

For j ≥ 1, {τ ≥ j} = {τ < j − 1} ∈ Fj−1 and therefore,     E 1{τ ≥j} Zj = E 1{τ ≥j} E [Zj | Fj−1 ] ≤ cP (τ ≥ j) , ⎡

and

E⎣

τ 

⎤ Zj ⎦ ≤ E[|Y0 |] + c

j=0

∞ 

P (τ ≥ j) = E[|Y0 |] + cE[τ ] < ∞ .

j=1

Therefore condition (13.11) is satisfied since E [|Yτ |] ≤ E n 

Zj ≤

j=0

τ 



τ j=0 Zj

 . Moreover, if τ > n,

Zj

j=0

⎤ ⎡ τ    E 1{τ >n} |Yn | ≤ E ⎣1{τ >n} Zj ⎦ .

and therefore

But, by (), E convergence

()



τ j=0 Zj



j=0

< ∞. Also, {τ > n} ↓ ∅ as n ↑ ∞. Therefore, by dominated ⎡

lim inf E[1{τ >n} |Yn |] ≤ lim inf E ⎣1{τ >n} n↑∞

This is condition (13.12).

n↑∞

τ 

⎤ Zj ⎦ = 0 .

j=0



CHAPTER 13. MARTINGALES

510

13.2.2

Wald’s Formulas

Wald’s Mean Formula Theorem 13.2.10 Let {Zn }n≥1 be an iid sequence of real random variables such that E [|Z1 |] < ∞, and let τ be an FnZ -stopping time with E[τ ] < ∞. Then # " τ  (13.15) Zn = E[Z1 ]E[τ ]. E n=1

If, moreover,

E[Z12 ]

< ∞,

 Var

τ 

Zn

= Var (Z1 )E[τ ] .

(13.16)

n=1

Proof. Let X0 := 0, Xn := (Z1 + · · · + Zn ) − nE[Z1 ] (n ≥ 1). Then {Xn }n≥1 is an FnZ -martingale such that E[|Xn+1 − Xn | | FnZ ] = E[|Zn+1 − E[Z1 ]| | FnZ ] = E|Zn − E[Z1 ]| ≤ 2E [|Z1 |] < ∞ . n Therefore Corollary 13.2.9 can be applied with Yn = k=1 (Zk − E [Z1 ]) to obtain (13.15). For the proof of (13.16), the same kind of argument works, this time with  the martingale Yn = Xn2 − n Var (Z1 ).

Wald’s Exponential Formula Theorem 13.2.11 Let {Zn }n≥1 be iid real random variables and let Sn = Z1 +· · ·+Zn . Let ϕZ (t) := E[etZ1 ] and suppose that ϕZ (t0 ) exists and is greater than or equal to 1 for some t0 = 0. Let τ be an FnZ -stopping time such that E[τ ] < ∞ and |Sn | ≤ c on {τ ≥ n} for some constant c < ∞. Then 2 t0 Sτ . e E = 1. (13.17) ϕZ (t0 )τ Proof. Let Y0 := 1 and for n ≥ 1, Yn :=

et0 Sn . ϕZ (t0 )n t Z

By application of the result of Example 13.1.3 with Xi := ϕeZ0(t0i ) , we have that the sequence {Yn }n≥0 is an FnZ -martingale. Moreover, on {τ ≥ n}, ' . 2' t0 Zn+1 ' 'e Z Z ' ' − 1' | Fn E[|Yn+1 − Yn | | Fn ] = Yn E ' ϕZ (t0 )   Yn E |et0 Z1 − ϕZ (t0 )| ≤ K < ∞ = ϕZ (t0 ) since ϕZ (t0 ) ≥ 1 and Yn =

et0 Sn e|t0 |c ≤ ≤ e|t0 |c . ϕZ (t0 )n ϕZ (t0 )n

Therefore, Corollary 13.2.9 applies to give (13.17).



13.2. MARTINGALES AND STOPPING TIMES

13.2.3

511

The Maximum Principle

The general approach to the absorption problem for hmcs of this subsection is in terms of harmonic functions. However, its implementation requires explicit forms of harmonic functions satisfying some boundary conditions, and this is not always easy. In contrast, the purely algebraic method given in the chapter on Markov chains can always be implemented in the finite state space case (of course at the cost of matrix computations). Let {Xn }n≥0 be an hmc with countable state space E and transition matrix P. Let D be a subset of E, called the domain, and let D := E\D. Let c : D → R and ϕ : D → R be non-negative functions called the unit time gain function and the final gain function, respectively. Let τ be the hitting time of D. For each state i ∈ E, define



v(i) = Ei ⎣





c(Xk ) + ϕ(Xτ )1{τ τ2k ; Sn = 0} τ2k = inf{n > τ2k+1 ; Sn ≥ b} ··· For i ≥ 1, let ϕi = 1 if τm < i ≤ τm+1 for some odd m = 0 if τm < i ≤ τm+1 for some even m. 1 By definition, an upcrossing occurs at time  if Sk ≤ a and if there exists  > k such that Sj < b for j = 1, . . . ,  − 1 and S ≥ b.

13.3. CONVERGENCE OF MARTINGALES

515

 &  {τm < i} ∩ {τm+1 < i} ∈ Fi−1

Observe that {ϕi = 1} =

odd m

and that bνn ≤

n 

ϕi (Si − Si−1 ).

i=1

Therefore bE[νn ] ≤ E[

n 

ϕi (Si − Si−1 )] =

i=1

= ≤

n  i=1 n 

n 

E[ϕi (Si − Si−1 )]

i=1

E[ϕi E[(Si − Si−1 )|Fi−1 ]] =

n 

E[ϕi (E[Si |Fi−1 ] − Si−1 )]

i=1

E[(E[Si |Fi−1 ] − Si−1 )] ≤

i=1

n 

(E[Si ] − E[Si−1 ]) = E[Sn − S0 ] .

i=1

 Theorem 13.3.2 Let {Sn }n≥0 be an Fn -submartingale. Suppose moreover that it is L1 -bounded, that is, sup E[|Sn |] < ∞. (13.27) n≥0

Then {Sn }n≥0 converges P -a.s. to an integrable random variable S∞ .

Remark 13.3.3 Condition (13.49) can be replaced by the equivalent condition sup E[Sn+ ] < ∞ . n≥0

Indeed, if {Sn }n≥0 is an Fn -submartingale,       E Sn+ ≤ E [|Sn |] ≤ 2E Sn+ − E [Sn ] ≤ 2E Sn+ − E [S0 ] .

Remark 13.3.4 By changing signs, the same hypothesis leads to the same conclusion for a supermartingale {Sn }n≥0. Similarly to the previous remark, condition (13.49) can be replaced by the equivalent condition sup E[Sn− ] < ∞ . n≥0

Proof. The proof is based on the following observation concerning any deterministic sequence {xn }n≥1. If this sequence does not converge, then it is possible to find two rational numbers a and b such that lim inf xn < a < b < lim sup xn , n

n

which implies that the number of upcrossings of [a, b] by this sequence is infinite. Therefore to prove convergence, it suffices to prove that any interval [a, b] with rational extremities is crossed at most a finite number of times.

CHAPTER 13. MARTINGALES

516

Let νn ([a, b]) be the number of upcrossings of an interval [a, b] prior (≤) to time n and let ν∞ ([a, b]) := limn↑∞ νn ([a, b]). By (13.25), (b − a)E[νn ([a, b])] ≤ E[(Sn − a)+ ] ≤ E[Sn+ ] + |a| ≤ sup E[Sk+ ] + |a| k≥0

≤ sup E[[Sk |] + |a| < ∞. k≥0

Therefore, letting n ↑ ∞,

(b − a)E[ν∞ ([a, b])] < ∞.

In particular, ν∞ ([a, b]) < ∞, P -a.s. Therefore, P -a.s. there is only a finite number of upcrossings of any rational interval [a, b]. Equivalently, in view of the observation made in the first lines of the proof, {Sn }n≥0 converges P -a.s. to some random variable S∞ . Therefore (by Fatou’s lemma for the previous inequality): E[|S∞ |] = E[ lim |Sn |] ≤ lim inf E|Sn | ≤ sup E|Sn | < ∞. n↑∞

n↑∞

n≥0

 Corollary 13.3.5 (a) Any non-positive submartingale {Sn }n≥0 almost surely converges to an integrable random variable. (b) Any non-negative supermartingale almost surely converges to an integrable random variable. Proof. (b) follows from (a) by changing signs. For (a), we have E[|Sn |] = −E[Sn ] ≤ −E[S0 ] = E[|S0 |] < ∞ . Therefore (13.49) is satisfied and the conclusion then follows from Theorem 13.3.2.



An immediate application of the martingale convergence theorem is to gambling. The next example teaches us that a gambler in a “fair game” is eventually ruined. Example 13.3.6: Fair game not so fair. Consider the situation in Example 13.1.4, assuming that the initial fortune a is a positive integer and that the bets are also positive integers (that is, the functions bn+1 (X0n ) ∈ N+ except if Yn = 0, in which case the gambler is not allowed to bet anymore, or equivalently bn (X0n−1 ∗0) := bn (X0 , X1 , . . . , Xn , 0) = 0). In particular, Yn ≥ 0 for all n ≥ 0. Therefore the process {Yn }n≥0 is a non-negative FnX martingale and by the martingale convergence theorem it almost surely has a finite limit. Since the bets are assumed positive integers when the fortune of the player is positive, this limit cannot be other than 0. Since Yn is a non-negative integer for all n ≥ 0, this can happen only if the fortune of the gambler becomes null in finite time.

Example 13.3.7: Branching processes via martingales. The power of the concept of martingale will now be illustrated by revisiting the branching process. It is

13.3. CONVERGENCE OF MARTINGALES

517

assumed that P (Z = 0) < 1 and P (Z ≥ 2) > 0 (to get rid of trivialities). The stochastic process Xn Yn = n , m where m is the average number of sons of a given individual, is an FnX -martingale. Indeed, since each one among the Xn members of the nth generation gives birth on average to m sons and does this independently of the rest of the population, E[Xn+1 |Xn ] = mXn and 2 . 2 . Xn+1 X Xn+1 Xn E |F |Xn = n . =E mn+1 n mn+1 m By the martingale convergence theorem, almost surely lim

n↑∞

Xn = Y < ∞. mn

In particular, if m < 1, then limn↑∞ Xn = 0 almost surely. Since Xn takes integer values, this implies that the branching process eventually becomes extinct. If m = 1, then limn↑∞ Xn = X∞ < ∞ and it is easily argued that this limit must be 0. Therefore, in this case as well the process eventually becomes extinct. For the case m > 1, we consider the unique solution in (0, 1) of x = g(x) (g is the generating function of the typical progeny of a member of the population considered). Suppose we can show that Zn = xXn is a martingale. Then, by the martingale convergence theorem, Zn converges to a finite limit and therefore Xn has a limit X∞ , which however can be infinite. One can easily argue that this limit cannot be other than 0 (extinction) or ∞ (non-extinction). Since {Zn }n≥0 is a martingale, x = E[Z0 ] = E[Zn ] and therefore, by dominated convergence, x = E[Z∞ ] = E[xX∞ ] = P (X∞ = 0). Therefore x is the probability of extinction. It remains to show that {Zn }n≥0 is an FnX -martingale. For all i ∈ N and all x ∈ [0, 1], E[xXn+1 |Xn = i] = xi . This is obvious if i = 0. If i > 0, Xn+1 is the sum of i independent random variables with the same generating function g, and therefore, E[xXn+1 |Xn = i] = g(x)i = xi . From this last result and the Markov property, E[xXn+1 |FnX ] = E[xXn+1 |Xn ] = xXn .

Theorem 13.3.8 Let {Mn }n≥0 be an Fn -martingale such that for some p ∈ (1, ∞), sup E|Mn |p < ∞.

(13.28)

n≥0

Then {Mn }n≥0 converges a.s. and in Lp to some finite variable M∞ . Proof. By hypothesis, the martingale {Mn }n≥0 is Lp -bounded and a fortiori L1 bounded since p > 1. Therefore it converges almost surely. By Doob’s inequality, E[max0≤i≤n |Mi |p ] ≤ q p E|Mn |p and in particular, E[ max |Mi |p ] ≤ q p sup E|Mk |p < ∞ . 0≤i≤n

k

Letting n ↑ ∞, we have in view of condition (13.28) that

CHAPTER 13. MARTINGALES

518

E[sup |Mn |p ] < ∞.

(13.29)

n≥0

Therefore {|Mn |p }n≥0 is uniformly integrable (Theorem 4.2.14). In particular, since it converges almost surely, it also converges in L1 (Theorem 4.2.16). In other words,  {Mn }n≥0 converges in Lp . Remark 13.3.9 The above result was proved for p > 1 (the proof depended on Doob’s inequality, which is true for p > 1). For p = 1, a similar result holds with an additional assumption of uniform integrability. Note however that the next result also applies to submartingales. Theorem 13.3.10 A uniformly integrable Fn -submartingale {Sn }n≥0 converges a.s. and in L1 to an integrable random variable S∞ and E[S∞ | Fn ] ≥ Sn . Proof. By the uniform integrability hypothesis, supn E[|Sn |] < ∞ and therefore, by Theorem 13.3.2, Sn converges almost surely to some integrable random variable S∞ . It also converges to this variable in L1 since a uniformly integrable sequence that converges almost surely also converges in L1 (Theorem 4.2.16). By the submartingale property, for all A ∈ Fn , all m ≥ n, E[1A Sn ] ≤ E[1A Sm ] . 1

Since convergence is in L , lim E[1A Sm ] = E[1A S∞ ],

m↑∞

so that finally E[1A Sn ] ≤ E[1A S∞ ]. This being true for all A ∈ Fn , we have that  E[S∞ | Fn ] ≥ Sn . The following result is L´evy’s continuity theorem for conditional expectations. Corollary 13.3.11  Let {Fn }n≥1 be a filtration and let ξ be an integrable random variable. Let F∞ := σ ∪n≥ Fn . Then lim E[ξ | Fn ] = E[ξ | F∞ ].

n↑∞

(13.30)

Proof. It suffices to treat the case where ξ is non-negative. The sequence {Mn = E[ξ | Fn ]}n≥1 is a uniformly integrable Fn -martingale (Theorem 4.2.13) and by Theorem 13.3.10, it converges almost surely and in L1 to some integrable random variable M∞ . We have to show that M∞ = E[ξ | F∞ ]. For m ≥ n and A ∈ Fn , E[1A Mm ] = E[1A Mn ] = E[1A E[ξ | Fn ]] = E[1A ξ]. Since convergence is also in L1 , limm↑∞ E[1A Mm ] = E[1A M∞ ]. Therefore E[1A M∞ ] = E[1A ξ]

(13.31)

for all A ∈ Fn and therefore for all A ∈ ∪n Fn . The σ-finite measures A → E[1A M∞ ] and A → E[1A ξ] agreeing on the algebra ∪n Fn also agree on the smallest σ-algebra containing it, that is F∞ . Therefore (13.31) holds for all A ∈ F∞ (Theorem 2.1.50) and this implies E[1A M∞ ] = E[1A E[ξ | F∞ ]] , and finally, since M∞ is F∞ -measurable, M∞ = E[ξ | F∞ ].



13.3. CONVERGENCE OF MARTINGALES

519

Kakutani’s Theorem Let {Xn }n≥1 be an independent sequence of non-negative random variables with mean 1. Let M0 := 1 and let n $ Mn := Xi (n ≥ 1) . i=1

Then (Example 13.1.3) {Mn }n≥0 is a non-negative martingale and (Theorem 13.3.5) it 1

converges almost surely to a finite random variable M∞ . Let an := E[Xn2 ]. (Note that  ∞ n=1 an ≤ 1.) Theorem 13.3.12 The following conditions are equivalent: (i)

∞

n=1 an

> 0.

(ii) E[M∞ ] = 1. (iii) Mn → M∞ in L1 . (iv) {Mn }n≥0 is uniformly integrable. Proof. Note first that (iv) implies (iii) (since Mn → M∞ a.s. and by Theorem 13.3.10) which in turn implies (ii) since 1 = E[Mn ] → E[M∞ ]. The announced equivalences will be proved if one can show that (i) implies (iv) and that (ii) implies (i). A. (i) implies (iv): let m0 := 1 and mn :=

1  ( ni=1 Xn ) 2 n i=1 an

(n ≥ 1) .

This is a martingale and an L2 -bounded one since E[m2n ] =

1 1 ≤ ∞ < ∞. (a1 · · · an )2 ( n=1 an )2

By Doob’s inequality (Theorem 13.1.15) for p = 2, E[sup |mn |2 ] ≤ 4 sup E[m2n ] < ∞ . n

Also, since

∞

n=1 an

n

≤ 1, Mn ≤ m2n and in particular E[sup |Mn |] ≤ E[sup |mn |2 ] < ∞ . n

n

Therefore Mn is uniformly dominated by the integrable random variable supn |Mn |, which implies that it is uniformly integrable (Example 4.2.10). B. (ii) implies (i) or, equivalently, if (i)  is not true, then (ii) is not true. Therefore suppose in view of contradiction that an = 0. Being a non-negative martingale, {mn }n≥0 converges to a finite limit. Since ∞ n=1 an = 0, this can happen only if M∞ = 0, a contradiction with (ii). 

CHAPTER 13. MARTINGALES

520

13.3.2

Backwards (or Reverse) Martingales

In the following, pay attention to the indexation: the index set is the set of non-positive relative integers. Let {Fn }n≤0 be a non-decreasing family of σ-fields, that is, Fn ⊆ Fn+1 for all n ≤ −1. There is nothing new in the definition of “backwards” or “reverse” martingales or submartingales, except that the index set is now {. . . , −2, −1, 0}. For instance, {Yn }n≤0 is an Fn -submartingale if E [Yn | Fn−1 ] ≥ Yn−1 for all n ≤ 0. The term “backwards” in fact refers to one of the uses that is made of this notion, that of discussing the limit of Yn as n ↓ −∞. Reverse martingales or submartingales often appear in the following setting. Let {Zk }k≥0 be a sequence of integrable random variables. Suppose that E [Zk−1 | Zk , Zk+1 , Zk+2 , . . .] = Zk

(k ≥ 0) .

Clearly, the change of indexation k → −n gives a “backwards” martingale. The next example concerns that situation. Example 13.3.13: Empirical mean of an iid sequence. Let {Xn }n≥1 be an iid sequence of integrable random variables and let Zk :=

1 Sk , k

where Sk := X1 + · · · + Xk . We shall prove that E [Zk−1 | Gk ] = Zk , where Gk = σ(Zk , Zk+1 , Zk+2 , . . .). It suffices to prove that for all k ≥ 1, E [Z1 | Gk ] = Zk ,

()

since it then follows that for m ≤ k, E [Zm | Gk ] = E [E [Z1 | Gm ] | Gk ] = E [Z1 | Gk ] = Zk . By linearity, Sk = E [Sk | Gk ] =

k 

E [Xj | Gk ] .

j=1

From the fact that Gk = σ(Zk , Zk+1 , Zk+2 , . . .) = σ(Sk , Xk+1 , Xk+2 , . . .) and by the iid assumption for {Xn }n≥1, Sk =

k  j=1

E [Xj | Sk , Xk+1 , Xk+2 , . . .] =

k 

E [Xj | Sk ] .

j=1

But the pairs (Xj , Sk ) (1 ≤ j ≤ k) have the same distribution, and therefore k 

E [Xj | Sk ] = kE [X1 | Sk ] = kE [X1 | Gk ] = kE [Z1 | Gk ] ,

j=1

from which () follows.

13.3. CONVERGENCE OF MARTINGALES

521

Theorem 13.3.14 Let {Fn }n≤0 be a non-decreasing family of σ-fields. Let {Sn }n≤0 be an Fn -submartingale. Then: A. Sn converges P -a.s. and in L1 as n ↓ −∞ to an integrable random variable S−∞ , and B. with F−∞ := ∩n≤0 Fn ,

S−∞ ≤ E [S0 | F−∞ ] ,

with equality if {Sn }n≤0 is an Fn -martingale. Proof. First note that by the submartingale property, Sn ≤ E [S0 | Fn ] (n ≤ 0). In particular, {Sn }n≤0 is not only L1 -bounded, but also uniformly integrable (Theorem 4.2.13). A. Denoting by νm = νm ([a, b]) the number of upcrossings of [a, b] by {Sn }n≤0 in the integer interval [−m, 0] and by ν = ν([a, b]) the total number of upcrossings of [a, b], the upcrossing inequality yields   (b − a)E [νm ] ≤ E (S0 − a)+ < ∞ , and letting m ↑ ∞, E [ν] < ∞. Almost-sure convergence to an integrable random variable S−∞ is then proved as in Theorem 13.3.2. Since {Sn }n≤0 is uniformly integrable, convergence to S−∞ is also in L1 . B. Clearly, S−∞ is F−∞ -measurable. Also, by the submartingale property, Sn ≤ E [S0 | Fn ] (n ≤ −1), that is, for all n ≤ −1 and all A ∈ Fn , S0 dP . Sn dP ≤ A

A

This is true for any A ∈0 F−∞ because 0 F−∞ ⊆ Fn for all n ≤ −1. Since Sn converges to S−∞ in L1 as n ↓ −∞, A Sn dP → A S−∞ dP and therefore S−∞ dP ≤ S0 dP (A ∈ F−∞ ) , A

A

which implies that S−∞ ≤ E [S0 | F−∞ ]. The martingale case is obtained using the same proof with each ≤ symbol replaced by =.  Remark 13.3.15 Statement B says that {Sn }n∈−N∪{−∞} is a submartingale relatively to the history {Fn }n∈−N∪{−∞}. Example 13.3.16: The Strong Law of Large Numbers. The situation is that of Example 13.3.13. By Theorem 13.3.14, Sk /k → converges almost surely. By Kolmogorov’s zero-one law (Theorem 4.3.3), Sk /k → a, a deterministic number. It remains to identify a with E [X1 ]. We know from the first lines of the proof of Theorem 13.3.14 that {Sk /k}k≥1 is uniformly integrable. Therefore, by Theorem 4.2.16, 2 . Sk lim E = a. k↑∞ k

CHAPTER 13. MARTINGALES

522 But for all k ≥ 1, E [Sk /k] = E [X1 ].

The uniform integrability of the backwards submartingale in Theorem 13.3.14 followed directly from the submartingale property. This is not the case for a supermartingale unless one adds a condition. Theorem 13.3.17 Let {Fn }n≤0 be a filtration and let {Sn }n≤0 be an Fn -supermartingale such that sup E [Sn ] < ∞ . (13.32) n≤0

Then A. Sn converges P -a.s. and in L1 as n ↓ −∞ to an integrable random variable S−∞ , and B. with F−∞ := ∩n≤0 Fn , S−∞ ≥ E [S0 | F−∞ ]

P -a.s.

Proof. (2 ) It suffices to prove uniform integrability, since the rest of the proof then follows the same lines as in Theorem 13.3.10. Fix ε > 0 and select k ≤ 0 such that lim E [Si ] − E [Sk ] ≤ ε .

()

i↓−∞

Then 0 ≤ E [Sn ] − E [Sk ] ≤ ε for all n ≤ k. We first show that for sufficiently large λ > 0, |Sn | dP ≤ ε . {|Sn |>λ}

It is enough to prove this for sufficiently large −n, here for −n ≥ −k. The previous integral is equal to − Sn dP + E [Sn ] − Sn dP . {Sn λ) ≤

E [|Sn |] →0 λ

uniformly in n ≤ 0, and therefore {|Sn |>λ}

|Sk | dP → 0 

uniformly in n.

The following result is the backwards L´evy’s continuity theorem for conditional expectations. Corollary 13.3.18 Let {Fn }n≤0 be a history and let ξ be an integrable random variable. Then, with F−∞ := ∩n≤0 Fn , lim E[ξ | Fn ] = E [ξ | F−∞ ] .

n↓−∞

(13.33)

Proof. Mn := E[ξ | Fn ] (n ≤ 0) is an Fn -martingale and therefore by the backwards martingale convergence theorem, it converges as n ↓ −∞ almost surely and in L1 to some integrable variable M−∞ and M−∞ = E [M0 | F−∞ ] = E [E [ξ | F0 ] | F−∞ ] = E [ξ | F−∞ ] since F−∞ ⊆ F0 .



Local Absolute Continuity Recall the setting of Example 13.1.6. On the measurable space (Ω, F) are given two probability measures Q and P and a filtration {Fn }n≥1 such that F = ∨n≥1 Fn := F∞ . Let Qn and Pn denote the restrictions to (Ω, Fn ) of Q and P respectively. Suppose that Qn Pn (n ≥ 1), in which case we say that Q is locally absolutely continuous with respect to P along {Fn }n≥1 and denote this by Q Ln :=

loc.

P . Let then

dQn dPn

denote the corresponding Radon–Nikod´ ym derivative. The question is: under what circumstances can we assert that Q P ? And what can be said if this is not the case? That this is not always the case is clear from the following elementary example. Example 13.3.19: Independent sequences of 0’s and 1’s. In this example Fn = σ(X1 , . . . , Xn ), where {Xn }n≥1 is an iid sequence of {0, 1}-valued random variables and

where



Q(Xn = 1) = qn > 0 and P (Xn = 1) = pn > 0 ,  n≥1 qn = ∞ and n≥1 pn < ∞. Then, by the positivity condition on the pn ’s loc.

P . However Q and P are mutually singular since Q(Xn → 0) = 0 and the qn ’s, Q and P (Xn → 0) = 1 (see Exercise 4.6.7).

CHAPTER 13. MARTINGALES

524

Theorem 13.3.20 The Radon–Nikod´ym sequence {Ln }n≥1 converges Q-almost surely and P -almost surely to some random variable L∞ and dQ = 1{L∞ =∞} dQ + L∞ dP

(13.34)

where P (L∞ = ∞) = 0. Remark 13.3.21 In particular, the measures dλ := 1{L∞ =∞} dQ and dμ := L∞ dP are mutually singular and λ is absolutely continuous with respect to Q, so that (13.34) is the Lebesgue decomposition of Q with respect to P (Theorem 2.3.30). Proof. Denote by ν (resp. νn ) the probability 21 (P + Q) on (Ω, F) (resp. 12 (Pn + Qn ) on (Ω, Fn )). Since Qn and Pn are dominated by νn , there exists for each n ≥ 1 an n ym derivative Un := dQ (Fn -measurable) Radon–Nikod´ dνn . The sequence {Un }n≥1 is a (ν, Fn )-martingale, since for all n ≥ 1 and all A ∈ Fn (and therefore also in Fn+1 ), Un+1 dν = Qn+1 (A) = Qn (A) = Un dν . A

A n 2 Pn +Q 2

Also, Un ≤ 2 because Qn ≤ = 2νn . Being a bounded (ν, Fn )-martingale, {Un }n≥1 converges ν-a.s. and in L1 (ν) to some random variable U∞ . Therefore, for all k ≥ 1, all n ≥ 0 and all A ∈ Fk , U∞ dν = lim Un+k dν n↑∞ A

A

0 0 and therefore since for all n ≥ 0 and all A ∈ Fk , A Un+k dν = A Uk dν = Q(A), U∞ dν = Q(A) . A

This being true for all A ∈ Fk , the probability measures U∞ dν and dQ agree on Fk . This being true for all k ≥ 1, they agree on the algebra ∪k≥1 Fk , and therefore on F∞ = ∨k≥1 Fk (Caratheodory’s theorem). We have just proved that dQ = U∞ d P +Q 2 , that is, (2 − U∞ )dQ = U∞ dP , from which it follows that P (U∞ = 2) = 0 and that if Q(U∞ = 2) = 0, then Q dQ U∞ dP = 2−U∞ .

P and

Un U∞ Since Ln = 2−U , Ln → L∞ = 2−U (P + Q)-a.s. and P (L∞ = ∞) = 0. Now, n ∞ 2 2L∞ dQn = Ln dPn and 1+L∞ dQ = 1+L∞ dP . Hence the decomposition

dQ = 1{L∞ =∞} dQ + L∞ dP where P (L∞ = ∞) = 0 and Q(L∞ = 0) = 0. Theorem 13.3.22 Q

P ⇔ EP [L∞ ] = 1 ⇔ Q(L∞ < ∞) = 1,

Q ⊥ P ⇔ EP [L∞ ] = 0 ⇔ Q(L∞ = ∞) = 1 .



13.3. CONVERGENCE OF MARTINGALES

525

Proof. Write (13.34) as -

Q(A) = A

1{L∞ =∞} dQ +

A

(A ∈ F∞ ) .

L∞ dP

With A = Ω, 1 = Q(L∞ = ∞) + EP [L∞ ] , and therefore EP [L∞ ] = 1 ⇔ Q(L∞ < ∞) = 1, EP [L∞ ] = 0 ⇔ Q(L∞ = ∞) = 1 . If Q(L∞ = ∞) = 0 it follows by (13.34) that Q Q(L∞ = ∞) = 0 since P (L∞ = ∞) = 0.

P . Conversely, if Q

P , then

If Q⊥P , there exists a B ∈ F such that Q(B) = 1 and P (B) = 0. In particular, from (13.34), Q(B ∩ {L∞ = ∞}) = 1 and therefore Q(L∞ = ∞) = 1. Finally, if  Q(L∞ = ∞) = 1, Q⊥P since P (L∞ = ∞) = 0.

Remark 13.3.23 By Theorem 4.2.16, the condition EP [L∞ ] = 1 is equivalent to the uniform P -integrability of {Ln }n≥1. Therefore, in order to prove that Q P , any condition guaranteeing uniform integrability is a sufficient condition for the absolute continuity of Q with respect to P . For instance (Theorem 4.2.14 and Example 4.2.15)   sup E Ln1+α < ∞

(α > 1)

n

and

  sup E Ln log+ Ln < ∞ . n

Example 13.3.24: Kakutani’s Dichotomy Theorem. Let {Xn }n≥1 be a sequence of random elements with values in the measurable space (E, E). We may suppose that it is the coordinate sequence of the canonical space (Ω, F) := (E N , E ⊗N). Let Q and P be two probability measures on (Ω, F) such that the sequence is iid relatively to both. Let QXn and PXn be the restrictions of Q and P respectively to σ(Xn ) and let Qn and Pn be the restrictions of Q and P respectively to Fn := σ(X1 , . . . , Xn ). We assume that for dQ PXn and denote the corresponding Radon–Nikod´ ym derivative dPXXn all n ≥ 1, QXn by fn (Xn ). Then for all n ≥ 1, Qn

Pn and Ln =

{L∞ < ∞} = {log L∞ < ∞} =

dQn dPn

/ n 

= Πni=1 fi (Xi ). Since

n

4 log fi (Xi ) < ∞

i=1

is a tail-event of the sequence, its probability is 0 or 1. Therefore, there are only two possibilities, either Q P or Q ⊥ P .

CHAPTER 13. MARTINGALES

526

Example 13.3.25: Kakutani’s Condition. Kakutani’s theorem (Theorem 13.3.12) can be applied to the situation (analogous to that of Example 13.3.24 above) where Ln =

n $

Zi

i=1

where {Zn }n≥1 is a sequence of iid non-negative random variables of mean 1. By this theorem, the criterion of absolute continuity of Q with respect to P , EP [L∞ ] = 1, of 1  2 Theorem 13.3.22 is ∞ n=1 E[Zn ] > 0. By the same argument as in Example 13.3.24, the only alternative to Q P is Q ⊥ P , and therefore a necessary and sufficient condition 1 ∞ for the latter is n=1 E[Zn2 ] = 0.

Harmonic Functions and Markov Chains An application to Markov chain theory of the martingale convergence theorem concerns harmonic functions of hmcs and the study of recurrence of hmcs. The basic result is: Theorem 13.3.26 An irreducible recurrent hmc {Xn }n≥0 has no non-negative superharmonic or bounded subharmonic functions besides the constants. Proof. If h is non-negative superharmonic (resp., bounded subharmonic), the sequence {h(Xn )}n≥0 is a non-negative supermartingale (resp., bounded submartingale) and therefore it converges to a finite limit Y . Since the chain visits any state i ∈ E infinitely often, one must have Y = h(i) almost surely for all i ∈ E. This can happen only if h is a constant.  Corollary 13.3.27 A necessary and sufficient condition for an irreducible hmc to be transient is the existence of some state (henceforth denoted by 0) and of a bounded function h : E → R, not identically null and satisfying  h(j) = pjk h(k) (j = 0) . (13.35) k =0

Proof. Let T0 be the return time to state 0. First-step analysis shows that the (bounded) function h defined by h(j) := Pj (T0 = ∞) satisfies (13.35). If the chain is transient, h is nontrivial (not identically null). This proves necessity. Conversely, suppose that (13.35) holds for a not identically null bounded function. ˜ by h(0) ˜ Define h := 0 and ˜ h(j) := h(j) (j = 0) ,  ˜ ˜ and let α := k∈E p0k h(k). Changing the sign of h if necessary, α can be assumed ˜ non-negative. Then h is subharmonic and bounded. If the chain were recurrent, then by ˜ would be a constant. This constant would be equal to h(0) ˜ Theorem 13.3.26, h = 0, and this contradicts the assumed nontriviality of h.  Here is an application of the martingale convergence theorem in the vein of the previous results and of Foster’s theorem (Theorem 6.3.19).

13.3. CONVERGENCE OF MARTINGALES

527

Theorem 13.3.28 Let the hmc {Xn }n≥0 with transition matrix P be irreducible and let h : E → R be a bounded function such that  (13.36) pik h(k) ≤ h(i), for all i ∈ F, k∈E

for some set F (not assumed finite). Suppose, moreover, that there exists an i ∈ F such that h(i) < h(j), for all j ∈ F. (13.37) Then the chain is transient. Proof. Let τ be the return time in F and let i ∈ F satisfy (13.37). Defining Yn = h(Xn∧τ ), we have that, under Pi , Y is a (bounded) FnX -supermartingale (same proof as in Theorem 6.3.19). By the martingale convergence theorem, the limit Y∞ of Yn = h(Xn∧τ ) exists and is finite, Pi -almost surely. By dominated convergence, Ei [Y∞ ] = limn↑∞ Ei [Yn ], and since Ei [Yn ] ≤ Ei [Y0 ] = h(i) (supermartingale property), we have Ei [Y∞ ] ≤ h(i). If τ were Pi -a.s. finite, then Yn would eventually be frozen at a value h(j) for j ∈ F , and therefore by (13.37), Ei [Y∞ ] ≥ h(i), a contradiction with the last inequality. Therefore, Pi (τ < ∞) < 1, which means that with a strictly positive probability, the chain starting from i ∈ F will not return to F . This is incompatible with irreducibility and recurrence. 

13.3.3

The Robbins–Sigmund Theorem

In applications, one often encounters random sequences that are not quite martingales, submartingales or supermartingales, but “nearly” so, up to “perturbations”. The statement of the result below will make this precise. Theorem 13.3.29 Let {Vn }n≥1, {βn }n≥1, {γn }n≥1 and {δn }n≥1 be real non-negative sequences of random variables adapted to some filtration {Fn }n≥1 and such that E[Vn+1 | Fn ] ≤ Vn (1 + βn ) + γn − δn Then, on the set Γ=

⎧ ⎨ ⎩

n≥1

(n ≥ 1) .

⎫ ⎧ ⎫ ⎬ ⎨ ⎬ βn < ∞ ∩ γn < ∞ ⎭ ⎩ ⎭

(13.38)

(13.39)

n≥1

the sequence {Vn }n≥1 converges almost surely to a finite random variable and moreover  n≥1 δn < ∞ P-almost surely. Proof. 1. Let α0 := 0 and  αn :=

n $

−1 (1 + βk )

(n ≥ 1) ,

k=1

and let

Vn := αn−1 Vn ,

γn := αn γn ,

δn := αn δn

(n ≥ 1) .

CHAPTER 13. MARTINGALES

528 Then

 E[Vn+1 | Fn ] = αn E[Vn+1 | Fn ] ≤ αn Vn (1 + βn ) + αn γn − αn δn ,

that is, since αn Vn (1 + βn ) = αn−1 Vn ,  E[Vn+1 | Fn ] ≤ Vn + γn − δn .

Therefore, the random sequence {Yn }n≥1 defined by Yn := Vn −

n−1  (γk − δk ) k=1

is an Fn -supermartingale. 2. For a > 0, let

/ Ta := inf

n ≥ 1;

n−1 

(γk

4 − δk )

≥a

.

k=1

The sequence {Yn∧Ta }n≥1 is an Fn -supermartingale bounded from below by −a. It therefore converges to a finite limit. Therefore, on {Ta = ∞}, {Yn }n≥1 converges to a finite limit.  almost surely to a positive limit and therefore 3. On Γ, ∞ k=1 (1 + βk ) converges   limn↑∞ αn > 0. Therefore, condition n≥1 γn < ∞ implies n≥1 γn < ∞. 4. By definition of Yn , Yn +

n−1 

γk = Vn +

k=1

n−1  k=1

δk ≥

n−1 

δk ,

k=1

But on Γ ∩ {Ta = ∞}, {Yn }n≥1 converges to a finite random variable, and therefore   n≥1 δn < ∞.   5. Since on Γ ∩ {Ta = ∞}, n≥1 γn < ∞, n≥1 δn < ∞ and {Yn }n≥1 converges to a finite random variable, it follows that {Vn }n≥1 converges to a finite limit. Since  limn↑∞ αn > 0, it follows in turn that {Vn }n≥1 converges to a finite limit and n≥1 δn <  ∞ on Γ ∩ {Ta = ∞}, and therefore on Γ ∩ (∪a {Ta = ∞}) = Γ. Corollary 13.3.30 Let {Vn }n≥1 , {γn }n≥1 and {δn }n≥1 be real non-negative sequences of random variables adapted to some filtration {Fn }n≥1. Suppose that for all n ≥ 1 E[Vn+1 | Fn ] ≤ Vn + γn − δn .

(13.40)

Let {an }n≥1 be a random sequence that is strictly positive and strictly increasing and let ⎧ ⎫ ⎨ γ ⎬ n , := Γ 0, we have that Zn ≥ 0 (n ≥ 1). Also E[Zn+1 | Fn ] ≤ Zn +

γn δn − . an an

, {Zn }n≥1 converges and  Therefore, by Theorem 13.3.29, on Γ, n≥1 in particular Vn+1 − Vn ,. = 0 on Γ lim n↑∞ an 2. If moreover limn↑∞ an = a∞ < ∞, the convergence of  of a1∞ n≥1 (Vn+1 − Vn ), and therefore {Vn }n≥1 converges. 3. If on the contrary limn↑∞ an = ∞, the convergence of Vn+1 an

 n≥1



Vn , by an

(and therefore that of of and an ↑ ∞, the convergence of

13.3.4

Vn . an

n≥1

δn an

< ∞. Note that (13.42)

Vn+1 −Vn an

implies that

Vn+1 −Vn an

implies that

(13.42)) to 0 (recall Kronecker’s lemma: if an > 0 xn 1 n  n≥1 an implies that limn↑∞ an k=1 xk = 0).

Square-integrable Martingales

Doob’s decomposition Let {Fn }n≥0 be a filtration. Recall that a process {Hn }n≥0 is called Fn -predictable if for all n ≥ 1, Hn is Fn−1 -measurable. Theorem 13.3.31 Let {Sn }n≥0 be an Fn -submartingale. Then there exists a P-a.s. unique non-decreasing Fn -predictable process {An }n≥0 with A0 ≡ 0 and a unique Fn martingale {Mn }n≥0 such that for all n ≥ 0, Sn = Mn + An . Proof. Existence is proved by explicit construction. Let M0 := S0 , A0 = 0 and, for n ≥ 1, Mn := S0 +

n−1 

*

+ Sj+1 − E[Sj+1 |Fj ] ,

j=0

An :=

n−1 

(E[Sj+1 |Fj ] − Sj ) .

j=0

Clearly, {Mn }n≥0 and {An }n≥0 have the announced properties. In order to prove uniqueness, let {Mn }n≥0 and {An }n≥0 be another such decomposition. In particular, for n ≥ 1,  − Mn ). An+1 − An = (An+1 − An ) + (Mn+1 − Mn ) − (Mn+1

CHAPTER 13. MARTINGALES

530 Therefore

E[An+1 − An | Fn ] = E[An+1 − An | Fn ] , and, since An+1 − An and An+1 − An are Fn -measurable, An+1 − An = An+1 − An ,

P-a.s.

(n ≥ 1) ,

from which it follows that An = An a.s. for all n ≥ 0 (recall that A0 = A0 ) and then Mn = Mn a.s. for all n ≥ 0.  Definition 13.3.32 The sequence {An }n≥0 in Theorem 13.3.31 is called the compensator of {Sn }n≥0. Definition 13.3.33 Let {Mn }n≥0 be a square-integrable Fn -martingale (that is, E[Mn2 ] < ∞ for all n ≥ 0). The compensator of the Fn -submartingale {Mn2 }n≥0 is denoted by {M n }n≥0 and is called the bracket process of {Mn }n≥0 . By the explicit construction in the proof of Theorem 13.3.31, M 0 := 0 and for n ≥ 1, M n :=

n−1 

n−1 * * +  + 2 2 E[(Mj+1 E[Mj+1 − Mj2 ) | Fj ] . | Fj ] − Mj2 =

j=0

j=0

(13.43)

Also, for all 0 ≤ k ≤ n, E[(Mn − Mk )2 | Fk ] = E[Mn2 − Mk2 | Fk ] = E[M n − M k | Fk ]. Therefore, {Mn2 − M n }n≥0 is an Fn -martingale. In particular, if M0 = 0, E[Mn2 ] = E[M n ]. Example 13.3.34: Let {Zn }n≥0 be asequence of iid centered random variables of finite variance. Let M0 := 0 and Mn := nj=1 Zj for n ≥ 1. Then, for n ≥ 1, M n =

n 

Var (Zj ) .

j=1

Theorem 13.3.35 If E [M ∞ ] < ∞, the square-integrable martingale {Mn }n≥0 converges almost surely to a finite limit, and convergence takes place also in L2 . Proof. This is Theorem 13.3.8 for the particular case p = 2. In fact, condition (13.28) thereof is satisfied since   sup E Mn2 = sup E[M n ] = E [M ∞ ] < ∞ . n≥1

n≥1



13.3. CONVERGENCE OF MARTINGALES

531

The Martingale Law of Large Numbers Theorem 13.3.36 Let {Mn }n≥0 be a square-integrable Fn -martingale. Then: A. On {M ∞ < ∞}, Mn converges to a finite limit. B. On {M ∞ = ∞}, Mn /M n → 0. Proof. A. Let K > 0 be fixed, the random time τK := inf{n ≥ 0 : M n+1 > K} is an Fn -stopping time since the bracket process is Fn -predictable. Also M n∧τK ≤ K and therefore by Theorem 13.3.35, {Mn∧τK }n≥0 converges to a finite limit. Therefore {Mn }n≥0 converges to a finite limit on the set {M ∞ < K} contained in {τK = ∞}. Hence the result since & {M ∞ < ∞} = {τK = ∞} . K≥1

B. Note that 2 E[Mn+1 | Fn ] = Mn2 + M n+1 − M n .

Define Vn = Mn2 ,

γn = M n+1 − M n ,

an = M 2n+1 .

The result then follows from Part 3of Corollary 13.3.30 a k0 ∞ (observe that there exists 2 such that ak ≥ 1 for k ≥ k0 and ∞ γ )/M  /a − M  =  (M k k+1 k k k=k0 k+1 ≤ 5 k=k0 0 ∞ −2 x dx < ∞) which says, in particular, that V /a = M /M  converges n+1 n n+1 n+1 1 to 0.  Remark 13.3.37 We do not have in general {M ∞ < ∞} = {{Mn }n≥0 converges}. The following is a conditioned version of the Borel–Cantelli lemma. Note that, in this form, we have a necessary and sufficient condition. Corollary 13.3.38 Let {Fn }n≥1 be a filtration and let {An }n≥1 be a sequence of events such that An ∈ Fn (n ≥ 1). Then ⎫ ⎧ ⎫ ⎧ ⎨ ⎬ ⎨ ⎬ P (An | Fn−1) = ∞ ≡ 1 An = ∞ . ⎩ ⎭ ⎩ ⎭ n≥1

n≥1

Proof. Define {Mn }n≥0 by M0 := 0 and for n ≥ 1, Mn :=

n 

(1Ak − P (Ak | Fk−1 )).

k=1

This is a square-integrable Fn -martingale, with bracket process M n =

n  k=1

P (Ak | Fk−1 )(1 − P (Ak | Fk−1 )) .

CHAPTER 13. MARTINGALES

532 In particular, M n ≤

n 

P (Ak | Fk−1 ).

k=1

 A. Suppose that ∞ k=1 P (Ak | Fk−1 ) < ∞. Then, by the above inequality, M ∞ < and by Part A of Theorem 13.3.36, ∞, therefore, Mn converges. Since by hypothesis, ∞ ∞ ) < ∞, this implies that | F 1 P (A < ∞. k k−1 A k k=1 k=1 ∞ B. Suppose that k=1 P (Ak | Fk−1 ) = ∞ and M ∞ < ∞. Then Mn converges to a finite random variable and therefore n Mn k=1 1Ak n = n − 1 → 0. P (A | F ) P k k−1 k=1 k=1 (Ak | Fk−1 ) C. Suppose that fortiori,

∞

k=1

P (Ak | Fk−1 ) = ∞ and M ∞ = ∞. Then

Mn

M n

→ 0 and a

Mn → 0, P k=1 (Ak | Fk−1 )

n that is,

n

k=1 1Ak → 1. P k=1 (Ak | Fk−1 )

n



The Robbins–Monro algorithm Consider an input-output relationship u ∈ R → y ∈ R of the form x = g(u, ε) where ε is a random variable, and let Φ(u) := E[g(u, ε)]. We wish to determine u∗ such that Φ(u∗ ) = α, where α is given. Remark 13.3.39 This is a dosage problem: u is the dose and Φ(u) is the (average) effect produced by this dose; u∗ is the dose realizing the desired effect α. Φ is assumed non-decreasing, but is otherwise unknown. In order to determine u∗ , one makes a series of experiments. Experiment n ≥ 0 associates with the input Un (an experimental dose) the output Xn+1 = g(Un , εn+1 ), where {εn }n≥1 is iid. The input Un is a function of the previous experimental results X1 , . . . , Xn , and therefore E[Xn+1 | Fn ] = Φ(Un ) , where Fn = σ(X1 , . . . , Xn ). We want to choose Un as a function of X1 , . . . , Xn that converges almost surely to u∗ . The following strategy is reasonable: reduce the dose if Xn+1 > α, augment it otherwise. This remark has led to the Robbins–Monro algorithm:

13.3. CONVERGENCE OF MARTINGALES

533

Un+1 = Un − γn (Xn+1 − α),

n ≥ 0,

(13.44)

where γn ≥ 0 for all n ≥ 0. The question is: Under what conditions does limn↑∞ Un = u∗ ? We shall need the following deterministic lemma:3 Lemma 13.3.40 Let f : R → R be a continuous function such that for some x∗ and some α ∈ R, f (x∗ ) = α and, for all x = x∗ , (f (x) − α)(x − x∗ ) < 0 and |f (x)| ≤ K(1 + |x|) for some constant K > 0. Let {γn }n≥0 be a non-increasing non-negative deterministic sequence such that  γn = ∞ , n≥0

and let {εn }n≥1 be a deterministic sequence such that  γn εn+1 converges . n≥0

Then, the sequence {xn }n≥0 defined by xn+1 = xn + γn (f (xn ) − α + εn+1 ) ,

n ≥ 0,

converges to x∗ for any initial condition x0 . Let {Xn }n≥0 and {Yn }n≥0 be sequences of square-integrable random vectors of dimension d adapted to some filtration {Fn }n≥0 . Let {γn }n≥0 be a non-increasing sequence of non-negative random variables such that lim γn = 0 ,

n↑∞

and γ0 ≤ C < ∞ for some deterministic constant C. Suppose that Xn+1 = Xn + γn Yn+1,

n ≥ 0,

and that, moreover, for all n ≥ 0, E[Yn+1 | Fn ] = f (Xn ),

E[|Yn+1 − f (Xn )|2 | Fn ] = σ 2 (Xn ) ,

where the function f : R → R is continuous, such that for some x∗ ∈ Rd , f (x∗ ) = 0 and for all x ∈ R such that x = x∗ , f (x) × (x − x∗ ) < 0 . 3

[Duflo, 1997], Proposition 1.2.3.

CHAPTER 13. MARTINGALES

534 Theorem 13.3.41 Suppose in addition that

|f (x)| ≤ K(1 + |x|) for some constant K > 0, and 

γn = ∞ and

n≥0



γn σ 2 (Xn ) < ∞ .

n≥0

Then, limn↑∞ Xn = x∗ . Proof. As Xn+1 = Xn + γn f (Xn ) + γn (Yn+1 − f (Xn ))

(n ≥ 0) ,

the process {Mn }n≥0 defined by M0 := 0 and Mn :=

n 

γk−1 (Yk − f (Xk−1 ))

(n ≥ 1)

k=1

is a square-integrable Fn -martingale and M n =

n 

2 γk−1 σ 2 (Xk−1 ).

k=1

In particular, since M ∞ < ∞ by hypothesis, {Mn }n≥0 converges to a finite limit. The  result then follows from Lemma 13.3.40 with εn+1 = Yn+1 − f (Xn ). Example 13.3.42: Back to the dosage problem. Consider the algorithm (13.44). We apply Theorem 13.3.41 with f = Φ − α. The conditions guaranteeing that limn Un = u∗ are therefore, besides Φ(u∗ ) = α and Φ continuous, (Φ(u) − α)(u − u∗ ) for all u = u∗ ,  n≥0

γn = ∞ ,



γn2 < ∞

n≥0

and, for some K < ∞, E[g(u, ε)]2 ≤ K(1 + |u|2 ) .

13.4

Continuous-time Martingales

The definition of a martingale in continuous time is similar to the one in discrete time and we shall see that most of the results in discrete-time find counterparts in continuoustime. Let {Ft }t≥0 be a history (or filtration) on R+ .

13.4. CONTINUOUS-TIME MARTINGALES

535

Definition 13.4.1 A complex stochastic process {Y (t)}t≥0 such that for all t ∈ R+ (i) Y (t) is Ft –measurable, and (ii) E[|Y (t)|] < ∞, is called a (P, Ft )-martingale (resp., submartingale, supermartingale) if for all s, t ∈ R+ such that s ≤ t, E[Y (t) | Fs ] = Y (s)

(resp., ≥ Y (s), ≤ Y (s)) .

(13.45)

When Y (t) ≥ 0 for all t ∈ R+ , the integrability condition is not required. Example 13.4.2: Compensated counting process. The counting process {N (t)}t≥0 of Example 5.1.5 is such that Y (t) := N (t) − λt

(t ≥ 0)

is an FtY -martingale (Exercise 13.5.24). This result admits a converse: Theorem 13.4.3 (4 ) Let N be a simple locally finite point process on R+ such that for some filtration {Ft }t≥0 the stochastic process M (t) := N (t) − λt

(t ∈ R+ )

is an Ft -martingale. Then N is a homogeneous Poisson process with intensity λ, and for any interval (a, b] ∈ R+ , N ((a, b]) is independent of Fa . Proof. In view of Theorem 7.1.8, it suffices to show that for all T > 0, and for all non-negative bounded real-valued stochastic processes {Z(t)}t≥0 with left-continuous trajectories and adapted to {Ft }t≥0 , "# 2- T . E Z(t) N (dt) = E Z(t)λ dt . () (0,T ]

0

The proof then is along the same lines as that of Theorem 7.1.8. Equality () is true for Z(t, ω) := 1A (ω) 1(a,b] (t) for any interval (a, b] ⊂ R+ and any A ∈ Fa , since in this case, () reads E [1A N ((a, b])] = E [1A (b − a)] , that is, since A is arbitrary in Fa , E [N ((a, b]) | Fa ] = (b − a), which is the martingale hypothesis. The extension to non-negative bounded real-valued stochastic processes {Z(t)}t≥0 with left-continuous trajectories is then done as above via the approximation (7.5).  4

[Watanabe, 1964].

CHAPTER 13. MARTINGALES

536

Theorem 13.4.4 A supermartingale with constant mean is a martingale. Proof. This follows from the fact that two integrable random variables X and Y such that X ≤ Y and E[X] = E[Y ] are almost surely equal.  Definition 13.4.5 A complex stochastic process {Y (t)}t≥0 is called a (P, Ft )-local martingale (resp., local submartingale, local supermartingale) if there exists a non-decreasing sequence of Ft -stopping times {τn }n≥1 (the localizing sequence) such that (a) limn↑∞ τn = ∞, and (b) for all n ≥ 1, {Y (t ∧ τn )}t≥0 is a (P, Ft )-martingale (resp., supermartingale, submartingale). Theorem 13.4.6 A non-negative local martingale is a supermartingale. Proof. By Fatou’s lemma,   E [M (t) | Fs ] = E lim M (t ∧ Tn ) | Fs n

≤ lim inf E [M (t ∧ Tn ) | Fs ] = lim inf M (s ∧ Tn ) = M (s) . n

n

 The following characterization of the martingale property can be viewed as a kind of converse of Doob’s optional sampling theorem. It will be referred to as Komatsu’s lemma. Lemma 13.4.7 Let {Ft }t≥0 be a history. A real-valued Ft -progressive stochastic process {X(t)}t≥0 with the property that, for all bounded Ft -stopping times T such that X(T ) is integrable, E[X(T )] = E[X(0)], is an Ft -martingale. 

Proof. Exercise 13.5.26.

13.4.1

From Discrete Time to Continuous Time

Many among the results given for discrete-time martingales extend easily to the continuoustime right-continuous martingales. For the extension of Kolmogorov and Doob’s inequalities to right-continuous martingales, it suffices to observe that for a right-continuous process {X(t)}t≥0 , supt∈R |X(t)| = supt∈Q |X(t)|, where Q is the set of rational numbers. Theorem 13.4.8 (Kolmogorov’s inequality) Let {Y (t)}t≥0 be a right-continuous Ft -submartingale. Then, for all λ ∈ R and all a ∈ R+ ,     λP sup Y (t) > λ ≤ E Y (a)1{sup0≤t≤a Y (t)>λ} . (13.46) 0≤t≤a

In particular, if M {(t)}t≥0 is a right-continuous Ft -martingale, then, for all p ≥ 1, all λ ∈ R and all a ∈ R+ ,   (13.47) λp P sup |M (t)| > λ ≤ E[|M (a)|p ]. 0≤t≤a

13.4. CONTINUOUS-TIME MARTINGALES

537

Theorem 13.4.9 (Doob’s inequality) Let {M (t)}n≥0 be a right-continuous Ft -martingale. For all p > 1 and all a ∈ R+ ,  M (a) p ≤  sup |M (t)| p ≤ q  M (a) p ,

(13.48)

0≤t≤a

where q is defined by

1 p

+

1 q

= 1.

For the martingale convergence theorem, it suffices to show that the upcrossing inequality holds true. This is done in the proof of the following extension to continuous time of the discrete-time result. As above, one takes advantage of the right-continuity assumption. Theorem 13.4.10 Let {Y (t)}t≥0 be a right-continuous Ft -submartingale, L1 -bounded, that is, such that sup E[|Y (t)|] < ∞ . (13.49) t≥0

Then {Y (t)}t≥0 converges P -a.s. as t ↑ ∞ to an integrable random variable Y (∞). * + Proof. Let Dn := 2kn k∈N and D := ∪n∈N Dn . For given 0 ≤ a < b, let νn ([a, b], K) and ν([a, b], K) be the number of upcrossings of [a, b] respectively by {Y (t)}t∈Δn ∩[0,K] and {Y (t)}t∈Δ∩[0,K] . The upcrossing inequality for discrete-time submartingales give (b − a)E [νn ([a, b], K)] ≤ E [(Y (K) − a)+ ], and therefore, passing to the limit as n ↑ ∞, (b − a)E [ν([a, b], K)] ≤ E [(Y (K) − a)+ ]. By the right-continuity assumption, ν([a, b]) = supK∈N ν([a, b], K) and therefore, (b − a)E [ν([a, b])] ≤ supK∈N E [(Y (K) − a)+ ] < ∞. The rest of the proof is then as in Theorem 13.3.2.  The continuous-time extensions of Theorems 13.3.8 and 13.3.14, and of Corollary 13.3.18, follow from Theorem 13.4.10 in the same way as their original discrete-time counterparts follow from Theorem 13.3.2. We leave to the reader the task of formulating these extensions. The next result is the extension of Theorem 13.3.10 to continuous time, and its proof is left for the reader. Theorem 13.4.11 Let {Y (t)}t≥0 be a right-continuous Ft -submartingale, uniformly integrable. Then {Y (t)}t≥0 converges a.s. and in L1 to an integrable random variable denoted by Y (∞) and E[Y (∞) | Ft ] ≥ Y (t). The above theorems required only a slight adaptation of their discrete-time versions. For the results that are stated in terms of convergence in a metric space, the adaptation to continuous time is even more immediate, and is based on the following lemma of analysis. Lemma 13.4.12 Let (E, d) be a complete metric space. A family {xt }t≥0 of elements of E converges to some element x ∈ E as t ↑ ∞ if and only if for any non-decreasing sequence of times {tn }n≥1 increasing to ∞ as n ↑ ∞, the sequence {xtn }n≥1 converges to x as n ↑ ∞.

CHAPTER 13. MARTINGALES

538

Therefore any statement of convergence as t ↑ ∞ that can be expressed in terms of convergence in a complete metric space can be obtained from the discrete-time version. p This is the case for convergence in LC (P ) (p ≥ 1) and also for convergence in probability (which can indeed be expressed in terms of convergence in a complete metric space, see Theorem 4.2.4). We now proceed to the statement and proof of Doob’s optional sampling theorem in continuous time, which requires a little more work and the use of the reverse martingale convergence theorem. Theorem 13.4.13 Let {Y (t)}t≥0 be a uniformly integrable right-continuous Ft -submartingale, and let S and T be Ft -stopping times such that S ≤ T . Then E [Y (T ) | FS ] ≥ (resp., =)Y (S). Proof. For any Ft -stopping time τ , let τ (n) be the approximation of the stopping time τ given in Theorem 5.3.13. Recall that this is an Ft -stopping-time decreasing to τ as (n) (n) n ↑ ∞. Now, fix n and let Gk := F kn (k ∈ N). Observe that τ (n) is a Gk -stopping time. 2

(n)

From Doob’s optional sampling theorem applied to the Gk -submartingale {Y ( 2kn )}k≥0 ,   E Y (T (n)) | FS(n) ≥ Y (S(n) . In particular, for all A ∈ FS ⊆ FS(n) ⊆ FT (n) (by (v) of Theorem 5.3.19), E [1A Y (T (n))] ≥ E [1A Y (S(n))] .



Let Z−n := Y (T (n)) and A−n := FT (n) (n ≥ 0). By the reverse martingale convergence theorem (Theorem 13.3.14) applied to the submartingale {Zn }n≤0 adapted to the filtration {An }n≤0 , the latter converges to X(T ) almost surely (right-continuity hypothesis) and also in L1 (Theorem 13.3.10). A similar statement holds for S and therefore we can pass to the limit in () to obtain E [1A Y (T )] ≥ (resp. =) E [1A Y (S)] .  Example 13.4.14: The Risk Model, Take 4. The probability of ruin corresponding to an initial capital u is Ψ(u) := P (u + X(t) < 0 for some t > 0) . It is a simple exercise (Exercise 13.5.25) to show that   E e−rX(t) = etg(r), -

where g(r) = λh(r) − rc = λ



 e dG(v) − 1 − rc . rv

0

For any u ∈ R+ , Mu (t) :=

e−r(u+X(t)) etg(r)

(t ∈ R+ )

(13.50)

13.4. CONTINUOUS-TIME MARTINGALES

539

is an FtX -martingale. Indeed, for 0 ≤ s ≤ t < ∞, " #   e−r(u+X(t)) X X E Mu (t) | Fs = E | Fs etg(r) # " e−r(u+X(s)) e−r(X(t)−X(s)) X =E | F s esg(r) e(t−s)g(r) # " e−r(X(t)−X(s)) = Mu (s)E | FsX = Mu (s) . e(t−s)g(r) For all u ≥ 0,

Tu := inf{t ≥ 0 ; u + X(t) < 0}

(with the usual convention that the infimum of an empty set is infinite) is an FtX -stopping time. For any t0 < ∞, since Tu ∧ t0 is a bounded stopping time and since {Mu (t)}t≥0 is a (positive) martingale, we may apply Doob’s optional stopping theorem: e−ru = Mu (0) = E [Mu (Tu ∧ t0 )] = E [Mu (Tu ∧ t0 ) | Tu ∧ t0 < t0 ] P (Tu ∧ t0 < t0 ) + E [Mu (Tu ∧ t0 ) | Tu ∧ t0 ≥ t0 ] P (Tu ∧ t0 ≥ t0 ) ≥ E [Mu (Tu ∧ t0 ) | Tu ∧ t0 < t0 ] P (Tu ∧ t0 < t0 ) = E [Mu (Tu ) | Tu < t0 ] P (Tu < t0 ) . But u + X(Tu ) ≤ 0 on {Tu < ∞}, and therefore e−ru E [Mu (Tu ) | Tu < t0 ] e−ru  ≤ e−ru sup etg(r) . ≤  −T g(r) u E e | Tu < t0 0≤t≤t0

P (Tu < t0 ) ≤

Letting t0 → ∞,

Ψ(u) ≤ e−ru sup etg(r) .

()

t≥0

We choose, under the assumption that supt≥0 etg(r) < ∞, the r maximizing the righthand side of (), that is R = sup{r ; g(r) ≤ 0} , i.e. the positive solution of

cr . λ This is the celebrated Lundberg’s inequality: Ψ(u) ≤ e−Ru. h(r) =

Theorem 13.4.15 Let {Y (t)}t≥0 be an Ft -martingale and let T be Ft -stopping time. Then {Y (t ∧ T )}t≥0 is an Ft -martingale. Proof. First suppose {Y (t)}t≥0 is uniformly integrable. By Theorem 13.4.11, E[Y (∞) | Ft∨T ] = Y (t ∨ T ), and therefore E[Y (∞) − Y (T ) | Ft∨T ] = Y (t ∨ T ) − Y (T ) = 1{T ≤t} (Y (t) − Y (t ∧ T ) = Y (t) − Y (t ∧ T ).

CHAPTER 13. MARTINGALES

540 This variable is Ft -measurable and therefore

E[Y (∞) − Y (T ) | Ft ] = Y (t) − Y (t ∧ T ) that is, since E[Y (∞) | Ft ] = Y (t), E[Y (T ) | Ft ] = Y (t ∧ T ) . We now get rid of the uniform integrability assumption. For any a ≥ 0, the Ft∧a martingale {Y (t ∧ a)}t≥0 is uniformly integrable (Theorem 4.2.13) and therefore, by Theorem 13.4.13, for t ≤ a, E[Y (T ∧ a) | Ft ] = Y (t ∧ T ∧ a) = Y (t ∧ T ) . 

Predictable Quadratic Variation Processes Definition 13.4.16 Let {Ft }t≥0 be a history. Let {M (t)}t≥0 be a local Ft -martingale. A non-decreasing Ft -predictable stochastic process {M (t)}t≥0 such that M (t)2 −M (t) is a local Ft -martingale is called the predictable quadratic variation process of the local martingale {M (t)}t≥0 . The following result of martingale theory is quoted without proof. Theorem 13.4.17 Let {Ft }t≥0 be a history. Let {M (t)}t≥0 be a local square-integrable Ft -martingale with quadratic variation process {M (t)}t≥0 . a. If M (∞) < ∞, then M (t) converges to a finite limit as t ↑ ∞. b. If M (∞) = ∞, then lim

t↑∞

13.4.2

M (t) = 0. M (t)

The Banach Space Mp

Definition 13.4.18 Let p ≥ 1. An Ft -martingale {M (t)}t≥0 is called p-integrable if sup E[|M (t)|p ] < ∞ . t≥0

It is called p-integrable on the finite interval [0, a] if E[|M (a)|2 ] < ∞. Remark 13.4.19 Condition (13.4.18) implies that this martingale is uniformly integrable (Theorem 4.2.14). Note that when p ≥ 1, E[|M (a)|p ] < ∞ implies supt∈[0,a] E[|M (t)|p ] < ∞ since {|M (t)|p }t≥0 is then an Ft -submartingale. For p ≥ 1, let Mp ([0, 1]) be the collection of p-integrable Ft -martingales over [0, 1]. We shall not distinguish between versions, that is to say, an element Mp is an equivalence class for the equivalence M ∼ M  defined by M (t) = M  (t) P -a.s. for all t ∈ [0, 1].

13.4. CONTINUOUS-TIME MARTINGALES

541

Theorem 13.4.20 For p ≥ 1, Mp ([0, 1]) is a Banach space for the norm ||M ||p = E[|M (1)|p ].

(13.51)

Proof. First, we verify that (13.51) defines a norm. Only the fact that ||M ||p = 0 implies M = 0 (that is, P (M (t) = 0) = 1 for all t ∈ [0, 1]) is perhaps not obvious. By Jensen’s inequality for conditional expectations, for all t ∈ [0, 1], E[|M (t)|p ] = E [|E[M (1) | Ft ]|p ] ≤ E [E[|M (1)|p | Ft ]] = E[|M (1)|p ] = 0 which implies in particular that P (M (t) = 0) = 1 for all t ∈ [0, 1]. Let now {Mn }n≥1 be a Cauchy sequence of Mp ([0, 1]), that is, lim E[|Mn (1) − Mk (1)|p ] = 0 .

k,n↑∞

By the same Jensen-type argument as above, for all t ∈ [0, 1], lim E[|Mn (t) − Mk (t)|p ] = 0 .

k,n↑∞

Therefore, for all t ∈ [0, 1], there exists a limit in LpR (P, Ft ) of the sequence of random variables {Mn (t)}n≥1 that we call M (t). It remains to show that the process {M (t)}n≥0 so defined is an Ft -martingale, that is, for all [a, b] ⊂ [0, 1] and all A ∈ Fa E[1A M (b)] = E[1A M (a)]. Using the assumption that for all n ≥ 1, {Mn (t)}t≥0 is an Ft -martingale (and therefore E[1A Mn (b)] = E[1A Mn (a)]) and the fact that for all t ∈ [0, 1], Mn (t) tends to M (t) in LpR (P ), we have that  limn↑∞ E[1A Mn (a)] = E[1A M (a)] and limn↑∞ E[1A Mn (b)] = E[1A M (b)].

13.4.3

Time Scaling

In this subsection, we define changes of the time scale “adapted” to a given history. Definition 13.4.21 Let {Ft }t≥0 be a history. A process A = {A(t)}t≥0 is called a standard non-decreasing stochastic process if it has non-decreasing right-continuous trajectories t → A(t, ω) and if moreover A(0, ω) ≡ 0. A standard non-decreasing process {T (t)}t≥0 is called an Ft -change of time if, for all t ≥ 0, T (t) is an Ft -stopping time. Theorem 13.4.22 Let {Ft }t≥0 be a right-continuous history and let {T (t)}t≥0 be an Ft -change of time. Then, the family {FT (t) }t≥0 is a right-continuous history. Moreover, if the stochastic process {X(t)}t≥0 is Ft -progressive, then the stochastic process {Y (t)}t≥0 defined by Y (t) = X(T (t))1{T (t) t}. If A is adapted to the right-continuous history {Ft }t≥0 , C is an Ft -change of time. Proof. Indeed, for all s ≥ 0, all a ≥ 0, {C(t) < a} = ∪n≥1 {A(a −

1 ) > t} ∈ ∨n≥1 Fa− 1 = Fa− ⊆ Fa . n n 

Theorem 13.4.24 Let X = {X(t)}t≥0 and Y = {Y (t)}t≥0 be two stochastic processes adapted to the right-continuous history {Ft }t≥0 , and such that for any Ft -stopping time S,     E X(S)1{S 0. Prove the following inequality:     E Xn2 . P max Xk > λ ≤ 0≤k≤n E [Xn2 ] + λ2

CHAPTER 13. MARTINGALES

546

Hint: With c > 0, work with the sequence {(Xn + c)2 }n≥0 and then select an appropriate c. Exercise 13.5.19. An extension of Hoeffding’s inequality Let M be a real FnX -martingale such that, for some sequence d1 , d2 , . . . of real numbers, P (Bn ≤ Mn − Mn−1 ≤ Bn + dn ) = 1,

n ≥ 1,

where for each n ≥ 1, Bn is a function of X0n−1. Prove that, for all x ≥ 0,  ? n 2 2 P (|Mn − M0 | ≥ x) ≤ 2 exp −2x di . i=1

Exercise 13.5.20. The derivative of a Lipschitz continuous function Let f : [0, 1) → R satisfy a Lipschitz condition, that is, |f (x) − f (y)| ≤ M |x − y|

(x, y ∈ [0, 1)) ,

where M < ∞. Let Ω = [0, 1), F = B([0, 1)) and let P be the Lebesgue measure on [0, 1). Let for all n ≥ 1 2n  ξn (ω) := 1{[(k−1)2−n ,k2−n )} (ω) k=1

and Fn = σ(ξk ; 1 ≤ k ≤ n) . (i) Show that Fn = σ(ξn ) and ∨n Fn = B([0, 1)). (ii) Let

f (ξn + 2−n ) − f (ξn ) . 2−n is a uniformly integrable Fn -martingale. Xn :=

Show that {Xn }n≥1

(iii) Show that there exists a measurable function g : [0, 1) → R such that Xn → g P -almost surely and that Xn = E [g | Fn ]. (iv) Show that for all n ≥ 1 and all k (1 ≤ k ≤ 2n ) f (k2−n ) − f (0) =

-

k2−n

g(x) dx 0

and deduce from this that

-

x

f (x) − f (0) =

g(y) dy

(x ∈ [0, 1)) .

0

Exercise 13.5.21. A non-uniformly integrable martingale Let {Xn }n≥0 be a sequence of iid random variables such that P (Xn = 0) = P (Xn = 2) = 12 (n ≥ 0). Define Zn :=

n $ j=1

Xj

(n ≥ 0) .

13.5. EXERCISES

547

Show that {Zn }n≥0 is an FnX -martingale and prove that it is not uniformly integrable. Exercise 13.5.22. The ballot problem via martingales This exercise proposes an alternative proof for the ballot problem of Example 1.2.13. Let k := a + b and let Dn be the difference between the number of votes for A and the number of votes for B at time n ≥ 1. Prove that Xn =

Dk−n k−n

(1 ≤ n ≤ k)

is a martingale. Deduce from this that the probability that A leads throughout the voting process is (a − b)/(a + b). Hint: τ := inf{n ; Xn = 0} ∧ (k − 1). Exercise 13.5.23. A voting model Let G = (V, E) be a finite graph. Each vertex v shelters a random variable Xn (v) representing the opinion (0 or 1) at time n of the voter located at this vertex. At each time n, an edge v, w is chosen at random, and one of the two vertices, again chosen at random (say v), reconsiders his opinion passing from Xn (v) to Xn+1(v) = Xn (w). The initial opinions at time 0 are given. Let Zn be the total number of votes for 1 at time n. Show that {Zn }n≥1 is a martingale that converges in finite random time to a random variable Z∞ taking the values 0 or |V |, the probability that all opinions are eventually 1 being equal to the initial proportion of 1’s. Exercise 13.5.24. The fundamental martingale of an hpp Prove that for the counting process {N (t)}t≥0 of Example 5.1.5, {N (t) − λt}t≥0 is an FtY -martingale. Exercise 13.5.25. Compound Poisson Processes Let N be an hpp on R+ with intensity λ and point sequence {Tn }n≥1 . Let {Zn }n≥1 be an iid real-valued sequence independent of N , with common cdf F . Define for all t ≥ 0, Y (t) =



Zn 1(0,t] (Tn ) .

n≥1

(The process {Y (t)}t≥0 is called a compound Poisson process.) Show that   E e−rY (t) = eλt(1−h(r)),  0∞  where h(r) := E e−rZ1 = 0 e−rx dF (x). Exercise 13.5.26. Komatsu’s lemma Prove the following: A right-continuous real stochastic process {X(t)}t≥0 adapted to the filtration {Ft }t≥0 is an Ft -martingale if and only if for all bounded Ft -stopping times τ , E [X(τ )] = E [X(0)] . Hint: Show that for all 0 ≤ a ≤ b and all A ∈ Fa , τ := a1A + b1A defines an Ft -stopping time.

CHAPTER 13. MARTINGALES

548

Exercise 13.5.27. 0 is an absorbing state for non-negative martingales Prove that for a non-negative martingale {M (t)}t≥ , {M (s) = 0} ⊆ {M (t) = 0} whenever 0 ≤ s < t. Exercise 13.5.28. Avoiding 0 Let {M (t)}t≥0 be a right-continuous martingale. A. Let τ := inf{t ≥ 0 ; M (t) = 0}. Prove that M (τ ) = 0 on {τ < ∞}. B. Prove that if P (M (T ) > 0) = 1 for some T ≥ 0, then P (M (t) > 0 for all t ≤ T ) = 1. (Hint: use the stopping times T and T ∧ τ .) Exercise 13.5.29. p-integrable martingales If {M (t)}t≥0 is right-continuous and p-integrable, then sup E[|M (T )|p ] < ∞,

T ∈T

where T is the collection of all finite Ft -stopping times. Exercise 13.5.30. Local martingales Prove that if {M (t)}t≥0 is a right-continuous local Ft -martingale such that 2 . E sup |M (s)| < ∞ (t ≥ 0) , 0≤s≤t

it is in fact a martingale.

Chapter 14 o’s Stochastic A Glimpse at Itˆ Calculus The Itˆo integral is an extension of the Wiener integral to a class of non-deterministic integrands. It is the basic tool of the Itˆo stochastic calculus, of which this chapter is a brief introduction.

14.1

The Itˆ o Integral

14.1.1

Construction

o integral will be constructed via an isometric extension analogous to that used The Itˆ for the construction of the Wiener–Doob integral. Let A(R+ ) be the collection of Ft -progressive complex-valued stochastic processes ϕ = {ϕ(t)}t≥0 such that 2. 2 E |ϕ(t)| dt < ∞ . R+

We view A(R+ ) as a complex Hilbert space with inner product 2. ϕ1 , ϕ2 A(R+ ) := E ϕ1 (t)ϕ2 (t) dt . R+

(To be more precise, an element of A(R+ ) is an equivalence class of such processes with respect to the equivalence relation ϕ ∼ ϕ if and only if ϕ(t, ω) = ϕ (t, ω),

P (dω) × dt a.e. )

For T ≥ 0, let A([0, T ]) be the collection of Ft -progressive complex-valued stochastic processes ϕ such that 2- T . 2 E |ϕ(t)| dt < ∞ , 0

and let Aloc be the collection of Ft -progressive complex-valued stochastic processes ϕ such that ϕ ∈ A([0, T ]) for all T ≥ 0. Finally, let Bloc be the collection of Ft -progressive complex-valued stochastic processes ϕ such that P -a.s. © Springer Nature Switzerland AG 2020 P. Brémaud, Probability Theory and Stochastic Processes, Universitext, https://doi.org/10.1007/978-3-030-40183-2_14

549

ˆ STOCHASTIC CALCULUS CHAPTER 14. A GLIMPSE AT ITO’S

550

-

t

P -a.s. (t ≥ 0) .

|ϕ(s)|2 ds < ∞

(14.1)

0

Definition 14.1.1 A real stochastic processes ϕ := {ϕ(t)}t≥0 of the form ϕ(t, ω) =

K−1 

Zi (ω)1(ti ,ti+1 ] (t) ,

(14.2)

i=1

where K ∈ N+ , 0 ≤ t1 < t2 < · · · < tK < ∞ and where Zi (1 ≤ i ≤ K) is a complex square-integrable Fti -measurable random variable, is called an elementary Ft -predictable process. Such stochastic processes are in A(R+ ). Lemma 14.1.2 The vector subspace G of A(R+ ) consisting of the elementary Ft predictable processes is dense in A(R+ ). Proof. First, consider the operators Pn (n ≥ 1) acting on the functions f ∈ L2C (R+ ) as follows:  n2 i/n  [Pn f ](t) := n f (s) ds 1(i/n,(i+1)/n](t) . (i−1)/n

i=1

By Schwarz’s inequality, for all t ∈ (i/n, (i + 1)/n], 2 i/n

|[Pn f ](t)|2 =

nf (s) ds (i−1)/n

-

i/n



-

n2 ds ×

(i−1)/n - i/n

=n

i/n

f (s)2 ds (i−1)/n

|f (s)|2 ds ,

(i−1)/n

-

-

and therefore

R+

|[Pn f ](t)|2 dt ≤

R+

|f (t)|2 dt.

()

Also Pn f → f in L2R (R+ ) for all functions f ∈ Cc0 (continuous L2R (R+ ) by density of Cc0 in L2R (R+ ).

()

with compact support), and therefore for all f ∈

Let now ϕ be in A(R+ ). For fixed ω, [Pn ϕ](·, ω) is the function obtained by applying Pn to the function t → ϕ(t, ω). By (), . . 22|ϕ(t)|2 dt < ∞ a.s. , |[Pn ϕ](t)|2 dt ≤ E E R+

R+

and therefore the function t → [Pn ϕ](t, ω) is in L2C (R+ ) for P -almost all ω. The stochas0 i/n tic process {Pn ϕ(t)}t≥0 is in G (note that by Theorem 5.3.9, (i−1)/n ϕ(s) ds is Fi/n measurable since {ϕ(t)}t≥0 is Ft -progressive).

ˆ INTEGRAL 14.1. THE ITO

551

As n ↑ ∞, {[Pn ϕ](t)}t≥0 converges in A(R+ ) to {ϕ(t)}t≥0 . In fact, by (), |[Pn ϕ(·, ω)](t) − ϕ(t, ω)|2 dt → 0 R+

and therefore 2||Pn ϕ − ϕ||2A(R+ ) = E

R+

. |[Pn ϕ(·, ω)](t) − ϕ(t, ω)|2 dt → 0

(by dominated convergence since, by () and the triangle inequality, 2  |[Pn ϕ(·, ω)](t) − ϕ(t, ω)|2 dt ≤ ||[Pn ϕ(·, ω)]||L2 (R+ ) + ||ϕ(·, ω)||L2 (R+ ) C

R+

C

2 ≤ 4||ϕ(·, ω)||L 2 (R ) + C

  and E ϕ(·, ω)||2L2 (R+ ) = ||ϕ||2A(R+ ) < ∞).



C

Let L20,C (P ) be the Hilbert subspace of L2C (P ) consisting of the complex centered square-integrable variables. Define the mapping I : G → L20,C (P ) by I(ϕ) :=

K−1 

Zi (W (ti+1 ) − W (ti )) .

(14.3)

i=1

One verifies that for all ϕ ∈ G, E [I(ϕ)] = 0. Also, for all ϕ1 , ϕ2 ∈ G, E [I(ϕ1 )I(ϕ2 ) ] = ϕ1 , ϕ2 A(R+ ) . Proof. By polarization, it suffices to treat the case ϕ1 = ϕ2 = ϕ and to write     K−1  E I(ϕ)2 = E |Zi |2 (W (ti+1 ) − W (ti ))2 i=1 K−1 

+2

E [Zi Z (W (ti+1 ) − W (ti ))(W (t+1 ) − W (t ))]

i 0,  2- T . 1 P sup |Mn (t) − Mm (t)| > a ≤ 2 E ([Pn ϕ](s) − [Pm ϕ](s))2 dt , a 0 t∈[0,T ] a quantity that tends to 0 as n, m → ∞. We can therefore find a sequence {nk }k≥1 strictly increasing to ∞ such that

ˆ INTEGRAL 14.1. THE ITO

555

 P

sup |Mnk (t) − Mnk−1 (t)| > 2

≤ 2−k ,

−k

t∈[0,T ]

so that, by the Borel–Cantelli lemma,  P



sup |Mnk (t) − Mnk−1 (t)| > 2

−k

i.o.

= 0.

t∈[0,T ]

Therefore for P -almost all ω, there exists a finite integer K(ω) such that for k > K(ω), sup |Mnk (t) − Mnk−1 (t)| ≤ 2−k .

t∈[0,T ]

This implies that for P -almost all ω the function t → Mnk (t, ω) converges uniformly on [0, T ] to a function t → M∞0(t, ω) which is continuous (as a uniform limit of continuous t functions). Since Mn (t) → 0 ϕ(s) dW (s) in quadratic mean and Mn (t) → M∞ (t) P a.s., both limits are P -a.s. equal, and therefore the stochastic process {M∞ (t)}t≥0 is a 0t  continuous version of { 0 ϕ(s) dW (s)}t≥0 .

14.1.3

Itˆ o’s Integrals Defined as Limits in Probability

0t It is possible to define the integral 0 ϕ(s) dW (s) when ϕ is only in Bloc , not necessarily in Aloc . For this, define for each n ≥ 1 the Ft -stopping time 1  - t Tn := inf t ; | ϕ(s) |2 ds ≥ n 0

with the usual convention inf ∅ = ∞. Clearly, P -a.s., Tn ↑ ∞ and particular, the stochastic process 0

ϕn (t) := ϕ(t) 1{t≤Tn }

(t ≥ 0)

0 Tn 0

|ϕ(s)|2 ds ≤ n. In (14.6)

is in A(R+ ) and In (t) := R+ ϕn (s) dW (s) is therefore well defined. For any n, m and for any ε > 0, '  '- t - t ' ' ' ϕm (s) dW (s)'' ≥ ε P ' ϕn (s) dW (s) − 0 0 - t  ≤P | ϕ(s) |2 ds ≥ min(n, m) → 0 as n, m ↑ ∞. 0

By Cauchy’s criterion of convergence in probability (Theorem 4.2.3), for each t ≥ 0, In (t) converges in probability to some random variable denoted I(ϕ, t). 0t If ϕ ∈ A(R+ ), limn↑∞ In (t) = 0 ϕ(s) dW (s) in L2C (P ) and therefore also in proba0t bility. Therefore, in this case I(ϕ) = 0 ϕ(s) dW (s). Therefore, for ϕ ∈ Bloc , - t - t ϕ(s) dW (s) := lim ϕn (s) dW (s) , 0

n↑∞ 0

where the limit is in probability, is an extension of the definition of the Itˆ o integral from integrands in A(R+ ) to integrands in Bloc . The following result is a direct consequence of Theorems 14.1.7 and 14.1.6 (Exercise 14.4.6).

ˆ STOCHASTIC CALCULUS CHAPTER 14. A GLIMPSE AT ITO’S

556

Theorem 14.1.8 Let0 {W (t)}t≥0 be an Ft -Wiener process and let ϕ ∈ Bloc . The prot cess {X(t))}t≥0 := { 0 ϕ(s) dW (s)}t≥0 is then an Ft -local martingale which admits a continuous version. Moreover, for any Ft -stopping time τ , - t (14.7) ϕ(s)1{s≤τ } dW (s) (t ≥ 0) . X(t ∧ τ ) = 0

14.2

Itˆ o’s Differential Formula

14.2.1

Elementary Form

The ordinary rules of calculus do not apply to functions of Brownian motion, as the following example shows: Example 14.2.1: Squared Brownian motion. Let {W (t)}t≥0 be an Ft -Wiener process. With ti := it n, W (t)2 =

n 

(W (ti )2 − W (ti−1 )2 )

i=1 n 

=2

W (ti−1 )(W (ti ) − W (ti−1 )) +

i=1

n 

((W (ti ) − W (ti−1 ))2 := An + Bn .

i=1

As n ↑ ∞, using Remark 14.1.3, An converges in L2 (P ) to 2 converges to t (Theorem 11.2.10). Therefore - t W (s) dW (s) + t . W (t)2 = 2

0t 0

W (s) dW (s), whereas Bn

0

If the trajectories of the Wiener process were 0 t of bounded variation, integration by parts would give the (wrong) formula W (t)2 = 2 0 W (s) dW (s).

Functions of Brownian Motion Let Cb2 denote the collection of functions F : R → C that are twice continuously differentiable and such that F , F  and F  are bounded. Theorem 14.2.2 Let {W (t)}t≥0 be a standard Ft -Wiener process and let F ∈ Cb2 . Then - t 1 t  F (W (t)) = F (W (0)) + F  (W (s)) dW (s) + F (W (s)) ds . (14.8) 2 0 0 Proof. It suffices to treat the case of real-valued functions F . Let ti := formula at the second order gives F (W (t)) − F (W (0)) =

n 

it n.

Taylor’s

(F (W (ti )) − F (W (ti−1 )))

i=1

=

n  i=1

=

n  i=1

1   F (ξi )(W (ti ) − W (ti−1 ))2 2 n

F  (W (ti ))(W (ti ) − W (ti−1 )) +

i=1 n

1   F  (W (ti ))(W (ti ) − W (ti−1 )) + F (W (θi ))(W (ti ) − W (ti−1 ))2 , 2 i=1

ˆ DIFFERENTIAL FORMULA 14.2. ITO’S

557

where ξi ∈ [W (ti−1 ), W (ti )] (and therefore, since the trajectories of the Brownian motion are continuous, ξi = W (θi ) for0 some θi = θi (n, ω) ∈ (ti−1 , ti ). The first term on the rightt hand side tends in L2 (P ) to 0 F  (W (s)) dW (s) (Remark 14.1.3). Let now An := Bn :=

n  i=1 n 

F  (W (θi ))(W (ti ) − W (ti−1 ))2 , F  (W (ti−1 ))(W (ti ) − W (ti−1 ))2 ,

i=1

Cn :=

n 

F  (W (ti−1 ))(ti − ti−1 ).

i=1

By Schwarz’s inequality, dominated convergence and Theorem 11.2.10, "

'  ' E [|An − Bn |] ≤ E sup 'F  (W (θi )) − F  (W (ti−1 ))' × (W (ti ) − W (ti−1 ))2 i



2

.

i

⎡

 ' '2 (W (ti ) − W (ti−1 ))2 ≤ ⎝E sup 'F  (W (θi )) − F  (W (ti ))' × E ⎣ i

#

2 ⎤⎞ 21 ⎦⎠

i 1

→ (0 × t2 ) 2 = 0 . Now,  ' '2    E |Bn − Cn |2 ≤ (sup F  )2 × E '(W (ti ) − W (ti−1 ))2 − (ti − ti−1 )' i

= (sup F  )2 × 2

 (ti − ti−1 )2 → 0 . i

0t Finally, Cn → 0 F  (W (s)) dW (s) in L2 (P ), and consequently in L1 (P ). Therefore, for all t, the announced equality holds in L1 (P ) and consequently P -almost surely. Since the stochastic processes in both sides of the equality are continuous, we have that P -almost surely, this equality holds for all t ∈ R+ . 

Remark 14.2.3 Theorem 14.2.2 remains true if we only suppose that {W (t)}t≥0 is a real continuous Ft -martingale such that {W (t)2 − t}t≥0 is also an Ft -martingale. We shall not prove this, although the proof is very close to the one given in the special case.

ˆ ’s rule for exponentials. Let F (x, t) = ex and X(t) = Example 14.2.4: 0 Ito 0t 1 t 2 ds where ϕ ∈ B . Application of rule (14.11) yields ϕ(s) dW (s) − ϕ(s) loc 2 0 0 -

t

L(t) := exp 0

ϕ(s) dW (s) −

1 2

-

t 0

 - t ϕ(s)2 ds = 1 + L(s)ϕ(s) dW (s). 0

(14.9)

ˆ STOCHASTIC CALCULUS CHAPTER 14. A GLIMPSE AT ITO’S

558

L´ evy’s Characterization of Brownian Motion Theorem 14.2.5 A real continuous Ft -adapted stochastic process {W (t)}t≥0 that is an Ft -martingale and such that {W (t)2 − t}t≥0 is also an Ft -martingale is a standard Ft Wiener process. Proof. Applying Itˆo’s differentiation rule1 with F (x) = eiux (u ∈ R), - t - t 1 eiuW (t) − eiuW (s) = iu eiuW (z) dW (z) − u2 eiuW (z) dz (0 ≤ s ≤ t) . 2 s s 0t 0t Now, since E[ 0 | eiuW (s) |2 ds] = t < ∞, it follows that 0 eiuW (s) dW (s) is an Ft martingale and therefore, for all A ∈ Fs , - - t 1 (eiuW (t) − eiuW (s) ) dP = − u2 eiuW (z) dz dP . (14.10) 2 A A s Dividing both sides of the above equation by eiuW (s) and applying Fubini’s theorem to the right-hand side, - t1 eiu(W (t)−W (s)) dP = P (A) − u2 eiu(W (z)−W (s)) dP dz , 2 A A s and therefore

-

1 u2

eiu(W (t)−W (s))dP = P (A)e− 2 t−s .

()

A

This equality is valid for all 0 ≤ s ≤ t, all u ∈ R and all A ∈ Fs . With A = Ω, E[eiu(W (t)−W (s)) ] = e−(1/2)/u

2 /(t−s)

,

that is, W (t) − W (s) is a centered Gaussian variable with unit variance. Equation () then reads E[1A eiu(W (t)−W (s))] = P (A)E[eiu(W (t)−W (s)) ] (A ∈ Fs ) , 

from which it follows that W (t) − W (s) is independent of Fs .

14.2.2

Some Extensions

Theorem 14.2.2 can be extended in several directions that do not involve new ideas. The proofs, using arguments very similar to those in the proof of Theorem 14.2.2 are therefore omitted. Theorem 14.2.2 dealt with functions of the Brownian motion. We now consider functions of an Itˆo process, whose definition follows. Definition 14.2.6 A stochastic process of the form - t - t X(t) := X(0) + f (s) ds + ϕ(s) dW (s) 0

(t ≥ 0) ,

()

0

where X(0) is an F0 -measurable random variable and {ϕ(t)}t≥0 and {f (t)}t≥0 are Ft o process. progressively measurable stochastic processes in Aloc or Bloc is called an Itˆ 1

See Remark 14.2.3.

ˆ DIFFERENTIAL FORMULA 14.2. ITO’S

559

Theorem 14.2.7 Suppose that ϕ, f ∈ Aloc and X(0) is square integrable. Then, for all functions F ∈ C 2 , - t F  (X(s))ϕ(s) dW (s) F (X(t) = F (X(0)) + 0 - t 1 t  + F  (X(s))f (s) ds + F (X(s))ϕ(s)2 ds . 2 0 0 With the notation, dX(s) := ϕ(s) dW (s) + ψ(s) ds, this formula can written as - t 1 t   F (X(s)) dX(s) + F (X(s))ϕ(s)2 ds . F (X(t)) = F (X(0)) + 2 0 0 Therefore, with

-

t

X(t) :=

ϕ(s)2 ds 0

(the bracket process of the martingale part of the Itˆ o process), - t 1 t  F (X(t) = F (X(0)) + F  (X(s)) dX(s) + F (X(s)) dX(s) 2 0 0 or, in differential form, dF (X(t)) = F  (X(t)) dX(t) + F  (X(t)) dX(t) . Example 14.2.8: The geometric Brownian motion. Let Z(t) := exp {σW (t) + μt} (t ≥ 0). By Itˆo’s differential rule,  - t - t σ2 Z(s) dW (s) + μ + Z(s) ds . Z(t) = σ 2 0 0 2

In particular, with μ = − σ2 , the process 1  σ2 Z(t) := exp σW (t) − t 2

(t ≥ 0)

is a martingale, called the geometric Brownian motion Let {W (t)}t≥0 be an Ft -Wiener process. Let {ϕ(t)}t≥0 and {f (t)}t≥0 be in Bloc . Let {X(t)}t≥0 be an Itˆo process. Theorem 14.2.9 Let F : (x, t) ∈ R2 → F (x, t) ∈ C be twice continuously differentiable in the first variable x and once continuously differentiable in the second variable t. Then - t - t ∂F ∂F (X(s), s) ds + (X(s), s) dX(s) F (X(t), t) = F (X(0), 0) + ∂t 0 0 ∂x 1 t ∂2F + (X(s), s) ϕ(s)2 ds, (14.11) 2 0 ∂x2 where - t 0

∂F (X(s), s) dX(s) := ∂x

-

t 0

∂F (X(s), s) ϕ(s) dW (s) + ∂x

-

t 0

∂F (X(s), s) f (s) ds . ∂x (14.12)

ˆ STOCHASTIC CALCULUS CHAPTER 14. A GLIMPSE AT ITO’S

560

A Finite Number of Discontinuities The Itˆ o differentiation rule remains valid in situations where the functions F , F  and F  are C 2 in the x argument only on R\E, where E is a finite set E = {x1 , . . . , xk }. The method of proof is the same as the following one in the simple case of a function of the Brownian motion. Theorem 14.2.10 Let F : R → R be C 1 on R and C 2 on R\E, where E = {x1 , . . . , xk }. o formula applies: If F  remains bounded on a neighborhood of these points, the Itˆ -

t

F (W (t)) = F (W (0)) +

F  (W (s)) dW (s) +

0

1 2

-

t

F  (W (s)) ds .

(14.13)

0

Concerning the last (Lebesgue) integral, note that the Lebesgue measure of {t ; W (t, ω) ∈ E} is almost surely null. Proof. We first show the existence of a sequence of functions Fn that are C 2 in R with the following properties: (i) Fn → F and Fn → F  uniformly in R, and (ii) for all x ∈ / E, Fn (x) → F  (x) and Fn is uniformly bounded on a neighborhood of E. Let α be some function in C ∞ with a compact support, equal to 1 in a neighborhood of E. In particular, (1 − α)F is in C 2 and the Itˆo formula applies for this function. It remains to show that it also applies to αF , or more generally to F as before, but with compact support. For such a function, consider the approximation Fn := F ∗ ϕn where ϕn (x) := nϕ(nx) and ϕ is C ∞ with compact support and such that 0 ≤ ϕ ≤ 1. Such a function satisfies requirements (i) and (ii) above, and therefore the Itˆ o formula applies: -

t

Fn (W (t)) = Fn (W (0)) +

Fn (W (s)) dW (s) +

0

1 2

-

t

Fn (W (s)) ds .

0

The terms of this equality converge as n ↑ ∞ to the corresponding terms of (14.13), the first two terms by uniform convergence of the Fn ’s, the third by uniform convergence of the Fn ’s and the isometry formula. For the third term, observe that, by Schwarz’s inequality, "2 # t

E 0

t

Fn (W (s)) ds −

Fn (W (s)) ds

0

-

t

≤t

  E |Fn (W (s)) − Fn (W (s))|2 ds ,

0

a quantity that tends to 0 by dominated convergence.



ˆ DIFFERENTIAL FORMULA 14.2. ITO’S

561

The Vectorial Differentiation Rule Let C 1,2 (R+ × Rd ) denote the collection of functions F : R+ × Rd → R that are once continuously differentiable in the first coordinate and twice continuously differentiable in the second coordinate. Let {W (t)}t≥0 be a k-dimensional standard Wiener process. Let (t ≥ 0)

ϕ(t) := {ϕi,j (t)}1≤i≤d,1≤j≤k

be a real d×k-matrix valued stochastic process such that for all 1 ≤0 i ≤ d and all 1 ≤ j ≤ t k the process {ϕi,j (t)}t≥0 is FtW -adapted and in Bloc . Denote by 0 ϕ(s) dW (s) (t ≥ 0) k 0 t the d-dimensional stochastic process whose i-th component is j=1 0 ϕi,j (s) dWj (s) (t ≥ 0). Let (t ≥ 0)

ψ(t) := {ψi (t)}1≤i≤d

be a real d-dimensional stochastic process such that for all 1 ≤ i ≤ d the process {ψi (t)}t≥0 is Ft -adapted and in Bloc . Define the d-dimensional Itˆ o process {X(t)}t≥0 by -

-

t

t

ϕ(s) dW (s) +

X(t) := X(0) +

ψ(s) ds ,

0

0

where X(0) is a vector of integrable random variables. Then: Theorem 14.2.11 Under the above conditions, for F : R+ × Rd → R once continuously differentiable in the first coordinate and twice continuously differentiable in the second coordinate, we have the formula -

t

F (t, X(t)) = F (0, X(0)) + 0

+

d - t  i=1

0

1 + 2

 ∂ F (s, X(s)) ds + ∂s d

i=1

∂ F (s, X(s)) ∂xi

- t 0

-

k 

t 0

∂ F (s, X(s))ψi (s) ds ∂xi

ϕi,j (s) dWj (s)

j=1

⎛ ⎞ k  ∂2 F (s, X(s)) ⎝ ϕi,j (s)ϕ,j (s)⎠ ds . ∂xi ∂x j=1

i,

Example 14.2.12: An integration by parts formula. Let for i = 1, 2 -

-

t

Xi (t) = Xi (0) +

t

ϕi (s) dW (s) +

ψi (s) ds

0

(t ≥ 0)

0

where the ϕi ’s and ψi ’s satisfy the conditions of Definition 14.2.6. Then -

-

t

X1 (t)X2 (t) = X1 (0)X2 (0) + 0

-

t

X1 (s) dX2 (s) +

t

X2 (s) dX1 (s) + 0

ϕ1 (s)ϕ2 (s) ds . 0

ˆ STOCHASTIC CALCULUS CHAPTER 14. A GLIMPSE AT ITO’S

562

14.3

Selected Applications

14.3.1

Square-integrable Brownian Functionals

Theorem 14.3.1 Let {W (t)}t∈[0,1] be an FtW -Wiener process, and let {m(t)}t∈[0,1] be a real-valued FtW -square-integrable martingale on [0, 1]. Then there exists a real stochastic process {ϕ(t)}t∈[0,1] ∈ A([0, 1]) such that -

t

m(t) = m(0) +

(t ∈ [0, 1]) .

ϕ(s) dW (s)

(14.14)

0

Lemma 14.3.2 Let {W (t)}t∈[0,1] be an FtW -Wiener process. A. The collection of random variables H := {eX ; X ∈ U} where U is the collection of random variables k 

aj W (tj )

(k ∈ N, 0 ≤ t1 < · · · < tk ≤ 1, a1 , . . . , ak ∈ R)

j=1

is total in the Hilbert space L2R (F1W , P ). B. The collection of random variables that are linear combinations of elements of   1  1 t 2 f K := M (1) = e 0 f (s) dW (s)− 2 0 f (s) ds ; f : R → R measurable and bounded is dense in the Hilbert space L2R (F1W , P ). Proof. Part B is an immediate consequence of Part A. For the proof of A, observe that H⊂ L2R(F1W , P ) and that 1 ∈ H. We have to prove that if Y ∈ L2R (F1W , P ) is such that E Y eX = 0 for all X ∈ U, then P (Y = 0) = 1. As 1 ∈ H, E [Y ] = 0. Multiplying Y by a constant if necessary, we may suppose that E [|Y |] = 2 and in particular, since − + − E[Y ] = 0, E [Y + ] = 1 and E [Y  1. Therefore Q+ := Y P and Q− := Y+ P are  ]X= = 0 for all X ∈ U, we have that E Y eX = probability measures. Since E Y e  − X  X  X E Y e or, equivalently, EQ+ e = EQ− e . Therefore the Laplace transforms of the vectors of the type (W (t1 ), . . . , W (tk )) are the same under  Q+ and Q− . This implies  in particular that Q+ and Q− agree on F1W . Therefore E 1{Y + >Y − } (Y + − Y − ) = 0    and E 1{Y + ∞ .

0

Therefore, by Lemma 14.3.4, [Lψ (T )] = 1 or, equivalently, ⎧ ⎧ ⎫⎤ ⎫ ⎡ p p ⎨  ⎨1  ⎬ ⎬ E ⎣L(T ) exp i uj X(tj ) ⎦ × exp uj uk (tj ∧ tk ) = 1 , ⎩ ⎩2 ⎭ ⎭ j=1

()

j,k=1

that is, (14.19). It remains to get rid of the additional assumption. For this, we introduce the processes ϕn (t) := ϕ(t)1[0,τn ] (t) where

-

t

τn := inf{t ; 0

ϕ(s)2 ds ≥ n} .

14.3. SELECTED APPLICATIONS

567

By Lemma 14.3.4, E [Lϕn (T )] = 1 = E [Lϕ (T )] (the last equality is a hypothesis of the theorem). Also Lϕn (T ) → Lϕ (T ). Since moreover Lϕn (T ) ≥ 0, the conditions of application of Scheff´e’s lemma (Lemma 4.4.24) are satisfied, and therefore Lϕn (T ) → Lϕ (T ) in L1 .

(†)

But () is true for ϕn , that is, ⎧ ⎧ ⎫⎤ ⎫ ⎡ p p ⎨1  ⎨  ⎬ ⎬ E ⎣Lϕn (T ) exp i uj X(tj ) ⎦ × exp uj uk (tj ∧ tk ) = 1 . ⎩2 ⎩ ⎭ ⎭ j=1

j,k=1

0t

Letting Xn (t) := W (t) − 0 ϕn (s) ds, ⎧ ⎫ ⎧ ⎫ p p ⎨  ⎬ ⎨  ⎬ uj Xn (tj ) − Lϕ (T ) exp i uj X(tj ) Lϕn (T ) exp i ⎩ ⎭ ⎩ ⎭ j=1 j=1 ⎧ ⎫ p ⎨  ⎬ uj Xn (tj ) = (Lϕn (T ) − Lϕ (T )) exp i ⎩ ⎭ j=1 ⎧ ⎫ ⎧ ⎫⎞ ⎛ p p ⎨  ⎬ ⎨  ⎬ uj Xn (tj ) − exp i uj X(tj ) ⎠ . + Lϕ (T ) ⎝exp i ⎩ ⎭ ⎩ ⎭ j=1

j=1

   Using (†) and the fact that exp i pj=1 uj Xn (tj ) is uniformly bounded and converges, we obtain by dominated convergence that ⎧ ⎫ ⎧ ⎫ p p ⎨  ⎬ ⎨  ⎬ Lϕn (T ) exp i uj Xn (tj ) → Lϕ (T ) exp i uj Xn (tj ) in L1 , ⎩ ⎭ ⎩ ⎭ j=1

j=1



which gives (). The following result is a sufficient condition for the hypothesis (14.16) to hold.

Lemma 14.3.5 If in Theorem 14.3.3 {ϕ(t)}t∈[0,1] is bounded, then (14.16) holds. Moreover, {L(t)}t∈[0,1] thereof is a (P, Ft )-square-integrable martingale. Proof. For each n ≥ 1, let Sn be the Ft -stopping time defined by  /  0t if {. . . } = ∅, inf t | 0 ϕ(s)2 ds + L(t) ≥ n Sn = +∞ otherwise .

(14.20)

Then {L(t ∧ Sn )}t∈[0,1] is a square-integrable Ft -martingale and, by isometry, 2- t . E[| L(t ∧ Sn ) − 1 |2 ] = E | L(s)ϕ(s) |2 1{s≤Sn} ds . 0

In particular,

-

t

E[| L(t ∧ Sn ) |2 ] ≤ 1 + 0 t

≤ 1 + sup(ϕ)

E[| L(s)ϕ(s) |2 1{s≤Sn } ]ds

E[| L(s ∧ Sn ) |2 ]ds .

0

The rest follows by Gronwall’s lemma (Theorem B.6.1).



568

ˆ STOCHASTIC CALCULUS CHAPTER 14. A GLIMPSE AT ITO’S

Theorem 14.3.6 If E [L(T )] = 1, then {L(t)}t∈[0,T ] is an Ft -martingale. Itˆo’s differentiation rule applied to F (x) = ex yields (14.9). Since L(t) and 0Proof. t 2 0 ϕ(s) ds are finite continuous processes, Sn ↑ ∞, P-a.s. Also -

t 0

| L(s)ϕ(s)1{s≤Sn } |2 ds ≤ n2 ,

and therefore {L(t ∧ Sn )}t≥0 is a (P, Ft )-square-integrable martingale, and therefore {L(t)}t≥0 is a (P, Ft )-local martingale. Being non-negative, it is also a supermartingale,  and a (P, Ft )-martingale on [0, T ] when (14.16) is satisfied.

The Strong Markov Property of Brownian Motion This result (Theorem 11.2.1) was admitted in Chapter 11. The following proof is based on stochastic calculus, more precisely, on the following characterization: Lemma 14.3.7 For a continuous real stochastic process {Y (t)}t≥0 adapted to the history {Ft }t≥0 to be an Ft -Brownian motion, it is necessary and sufficient that for all λ ∈ R, the process 1 2 eλY (t)− 2 λ t (t ≥ 0) be an Ft -martingale. In this case, it is independent of F0 . Proof. The necessity results from an elementary computation on Gaussian variables. For the sufficiency, observe that the martingale condition implies that for all intervals [a, b] ⊂ R,   1 2 E eλ(Y (b)−Y (a)) | Fa = e 2 λ (b−a) . By taking expectations it follows that Y (b) − Y (a) is a centered Gaussian variable of variance (b − a) and then that it is independent of Fa , which implies the independence property of the increments as well as their independence from F0 .  We now prove Theorem 11.2.1. Proof. The stochastic process ϕ(t) := 1A 1(τ +a,τ +b] (t)

(t ≥ 0) ,

where A ∈ Fτ +a , is Ft -progressively measurable, and therefore by Lemma 14.3.4: 1. 2  1 = P (A) E exp λ1A (W (τ + b) − W (τ + a)) − λ2 1A (b − a) 2 or, equivalently, 2

 1 . 1 2 E exp λ(W (τ + b) − W (τ + a)) − λ (b − a) 1A = 1 . 2 Therefore, since A is arbitrary in Fτ +a ,  E [exp {λ(W (τ + b) − W (τ + a))} | Fτ +a ] = exp

1 1 2 λ (b − a) 2

14.3. SELECTED APPLICATIONS

569

from which we deduce as in Lemma 14.3.7 that W (τ +b)−W (τ +a) is a centered Gaussian variable with variance b − a independent of Fτ +a . In particular, it has independent increments and therefore {W (τ + t)}t≥0 is a Wiener process. Moreover, still by Lemma 14.3.7, this process is independent of Fτ .  Remark 14.3.8 The above proof is similar to that of the strong Markov property of Poisson processes (Theorem 7.1.9).

14.3.3

Stochastic Differential Equations

This is a very brief introduction to a vast subject. Let {W (t)}t≥0 be a standard Ft Brownian motion. We are going to discuss the existence and unicity of a measurable Ft -adapted stochastic process {X(t)}t≥0 such that almost surely - t - t b(X(s)) ds + (14.21) σ(X(s)) dW (s) (t ≥ 0), X(t) = X(0) + 0

0

where b and σ are measurable functions such that almost surely - t   |b(X(s))|2 + |σ(X(s))|2 ds < ∞ (t ≥ 0) .

(14.22)

0

One then calls {X(t)}t≥0 the solution of the stochastic differential equation (14.21). Condition (14.22) guarantees in particular that the integrand of the Itˆ o integral of (14.21) is in Aloc . Theorem 14.3.9 If X(0) ∈ L2R (P ) and if for some K < ∞ |b(x) − b(y)| + |σ(x) − σ(y)| ≤ K |x − y| ,

(14.23)

there exists a unique solution of (14.21) satisfying condition (14.22). Proof. A. Uniqueness. If {X(t)}t≥0 and {Y (t)}t≥0 are two solutions, - t - t (b(X(s)) − b(Y (s))) ds + (σ(X(s)) − σ(Y (s))) dW (s) X(t) − Y (t) = 0

0

and therefore, taking into account the inequality (a + b)2 ≤ 2(a2 + b2 ), Schwarz’s inequality, the property of isometry of Itˆ o’s integrals and the Lipschitz condition (14.23), "2 #   t 2 E (X(t) − Y (t)) ≤ 2E (b(X(s)) − b(Y (s))) ds 0

"-

t

+ 2E 0

2 # (σ(X(s)) − σ(Y (s))) dW (s)

. (b(X(s)) − b(Y (s)))2 ds 0 . 2- t (σ(X(s)) − σ(Y (s)))2 ds + 2E 0 2- t . 2 |X(s) − Y (s)|2 ds ≤ 2(t + 1)K E 0 2- T . 2 2 |X(s) − Y (s)| ds (t ≤ T ) . ≤ 2(T + 1)K E 2-

≤ 2tE

t

0

ˆ STOCHASTIC CALCULUS CHAPTER 14. A GLIMPSE AT ITO’S

570

  By Gronwall’s lemma, for all t ∈ [0, T ], E (X(t) − Y (t))2 = 0, and therefore P (X(t) = Y (t)) = 1. B. Existence. Define recursively the stochastic processes {Xn (t)}t≥0 (n ≥ 0) by X0 (t) := X(0) (t ≥ 0) and for n ≥ 0 by - t - t Xn+1(t) := X(0) + σ(Xn (s)) dW (s) . () b(Xn (s)) ds + 0

0

With arguments similar to those of Part A, 2  E (Xn+1(t) − Xn (t))2 ≤ 2(T + 1)K 2 E

T

. |Xn+1 (s) − Xn (s)|2 ds

(t ≤ T ) .

0

  Letting CT := 2(T + 1)K 2 and a := maxt≤T E (X1 (t) − X(0))2 (a finite quantity, less  than a constant times T 3 E X(0)2 ), one checks by recurrence that   C n tn−1 E (Xn+1(t) − Xn (t))2 ≤ a T . (n − 1)! Therefore Xn+1 − Xn A[0,T ] ≤ a and then

(CT T )n n!

  (CT T )k  2 1

Xn+ − Xn A[0,T ] ≤ a

1 2

k≥n

k!

,

a quantity that tends to 0 as n → ∞. This shows that {Xn (t)}t≥0 (n ≥ 0) is a Cauchy sequence of the Hilbert space A[0, T ] and therefore converges in this space to some {X(t)}t≥0 . The Lipschitz condition (14.23) allows us to pass to the limit in () to obtain (14.21). The property (14.22) is easily verified.  Example 14.3.10: An explicit solution. It can be checked (Exercise 14.4.15) that the differential equation . 2 5 5 1 dX(t) = 1 + X(t)2 dW (t) + 1 + X(t)2 + X(t) dt 2 admits the stochastic process X(t) := sinh W (t) + sinh−1 W (t) + t

(t ≥ 0)

as a solution. Theorem 14.3.9 then guarantees its unicity.

Strong and Weak Solutions So far we have considered strong solutions. This means that the problem was posed in terms of a preexisting Wiener process and given initial state, and that the solution took the general form X(t) = F (t, X(0), {W (s)}0≤s≤t ) . A weak solution associated with the parameters (functions) b and σ and a probability distribution π on R consists of a probability space on which are given

14.3. SELECTED APPLICATIONS

571

1. a filtration {Ft }t≥0 , 2. a standard Ft -Wiener process {Wt }t≥0 , 3. a random variable X(0) with a given distribution π and independent of the above Wiener process, and finally 4. an Ft -progressive stochastic process {Xt }t≥0 such that - t - t σ(X(s)) dW (s) . b(X(s)) ds + X(t) = X(0) + 0

0

In this definition, the Wiener process is part of the solution. Example 14.3.11: Tanaka’s stochastic differential equation. In the equation - t X(t) = sgn(X(s)) dW (s) (t ≥ 0) , 0

where sgn(x) = +1 if x ≥ 0 and sgn(x) = −1 if x < 0, the Lipschitz conditions of Theorem 14.3.9 are not satisfied. We shall give rough arguments showing that there exist a solution that cannot be a strong solution. First note that if there exists a solution, it is a Brownian motion. To see this it suffices to show that for all λ > 0, the process 1 exp{λX(t) − λ2 t} (t ≥ 0) 2 is a martingale (Lemma 14.3.7) (Exercise 14.4.17). Therefore we have unicity in law of the solution. Note that it is the best we can do concerning unicity since {−X(t)}t≥0 is another solution. By the same arguments as above, for any solution {X(t)}t≥0 (which is a Brownian motion), the process {W (t)}t≥0 defined by - t W (t) := sgn(X(s)) dX(s) (t ≥ 0) () 0

is a Brownian motion. By differentiation, dW (t) = sgn(X(t)) dX(t), and therefore, since sgn(x)−1 = sgn(x), dX(t) = sgn(X(t)) dW (t). We have therefore obtained a solution. This solution cannot be a strong solution. Indeed, from (), we deduce that W (t) is |X| Ft -measurable. If {X(t)}t≥0 were a strong solution, it would be FtW -adapted, that is, |X| |X| Ft -adapted. But FtX contains more information than Ft !

14.3.4

The Dirichlet Problem

This subsection gives a simple example of the interaction between the theory of stochastic differential equations and that of partial differential equations. Definition 14.3.12 Let u : Rd → R be a function of class C 2 (twice differentiable with continuous derivatives) on an open set O ⊆ Rd . Its Laplacian, defined in O, is the function d  ∂2 u(x) := 2 u(x). ∂x i i=1

572

ˆ STOCHASTIC CALCULUS CHAPTER 14. A GLIMPSE AT ITO’S

Definition 14.3.13 The function u : Rd → R of class C 2 on a domain (open and connected set) D ⊆ Rd is said to be harmonic on D if u(x) = 0

(x ∈ D) .

Example 14.3.14: Examples of harmonic functions. In dimension 2: (x1 , x2 ) → ln(x21 + x22 ), and (x1 , x2 ) → ex1 are harmonic on D = R2 . In dimension 3: x → |x|2−d is harmonic on D = R3 \{0}. Let F : Rd → R be of class C 2 . In the following {W (t)}t≥0 will represent the Brownian motion starting from a ∈ Rd (that is, W (0) = a). Itˆo’s formula gives, denoting by ∇f the gradient of a function f , - t - t F (W (t)) = F (a) + ∇F (W (s)) dW (s) + F (W (s)) ds , (14.24) 0

0

where

-

t

M (t) := F (a) + is an FtW -local martingale since

0t

∇F (W (s)) dW (s)

(t ≥ 0)

0

2 0 (∇F (W (s)))

ds < ∞ (t ≥ 0).

Theorem 14.3.15 Let u : Rd → R be harmonic on the domain D ⊆ Rd , and let G ⊂ D be an open set whose closure clos G ⊂ D. Let τG := inf{t ≥ 0 ; W (t) ∈ / G} be the entrance time of the Brownian motion in G. Then u(W (t ∧ τG )) − u(a)

(t ≥ 0)

is a centered FtW -martingale. Proof. Let F be a function in C 2 (Rd ) whose restriction to G is u, for instance F (x) := ([(1G2δ ∗ α] × u)(x)1D (x) , where G2δ is the 2δ-neighborhood of G, 4δ = d(G, D) and α is a C ∞ function of integral 1 on Rd and null outside B(0, δ). Then, by (14.24), - t∧τG - tτG ∇F (W (s)) dW (s) + F (W (s)) ds , F (W (t ∧ τG )) = F (0) + 0

0

and since F (x) = u(x) is harmonic on G, - tτG ∇F (W (s)) dW (s) u(W (t ∧ τG )) = F (0) +

(t ≥ 0) ,

0

a square-integrable FtW -martingale (not just a local martingale, since ∇F is bounded on clos G).  Definition 14.3.16 Let D ⊂ Rd be a bounded domain, and let f : ∂D → R be a continuous function. The Dirichlet problem (D, f ) consists in finding a function u that is harmonic on D and equal to f on δD.

14.4. EXERCISES

573

Let Dε be the ε-interior of D. From Theorem 14.3.15 with G = Dε and ε small enough, u(x) = Ex [u(W (t ∧ τDε )] (x ∈ D) , where the notation Ex denotes expectation given that the initial position of the Brownian motion is x. Since D is bounded, P (τDε < ∞) = 1, and therefore, letting t → ∞, u(x) = Ex [u(W (τDε )] by dominated convergence. Let now ε → 0. Since τD < ∞ and W (τDε ) → W (τD ), u(x) = Ex [u(W (τD )] by dominated convergence. By the boundary condition, u(x) = Ex [f (W (τD )] .

(14.25)

Therefore the solution to the Dirichlet problem is unique, if it exists. We shall not prove existence, which can be obtained by analytical as well as probabilistic arguments. We shall be content with the fact that the solution has a probabilistic interpretation, given by (14.25).

Complementary reading [Kuo, 2006] and [Baldi, 2018]. The latter has a chapter on mathematical finance and a large collection of corrected exercises. [Oksendal, 1995] has many examples in diverse areas.

14.4

Exercises

Exercise 14.4.1. An Ft -martingale 0t Let {W (t)}t≥0 be an Ft -Wiener process and let ϕ, ψ ∈ A(R+ ). Let M (t) := 0 ϕ(s) dW (s) 0t 0t and N (t) := 0 ψ(s) dW (s). Show that M (t)N (t) − 0 ϕ(s)ψ(s) ds (t ≥ 0) is an Ft martingale. Exercise 14.4.2. Proof of the reflection principle Prove Theorem 11.2.2 using Itˆ o calculus (Hint: see the proof of the strong Markov property of the Brownian motion given in Subsection 14.3.2.) ˆ integrals as Lebesgue integrals, I Exercise 14.4.3. Ito Let {W (t)}t≥0 be a standard Brownian motion, and 0 ≤ a < b. Prove that -

b

W (s)3 dW (s) = a

 3 1 W (b)4 − W (a)4 − 4 2

-

b

W (s)2 ds. a

ˆ integrals as Lebesgue integrals, II Exercise 14.4.4. Ito Let {W (t)}t≥0 be a standard Brownian motion, and 0 ≤ a < b. Prove that -

b a

eW (s) dW (s) = eW (b) − eW (a) −

1 2

-

b

eW (s) ds. a

ˆ STOCHASTIC CALCULUS CHAPTER 14. A GLIMPSE AT ITO’S

574

Exercise 14.4.5. Brownian motion on the circle Let {W (t)}t≥0 be a standard Brownian motion. Let 

cos W (t) sin W (t)

V (t) :=

 .

Show that  dV (t) =

0 −1 1 0



1 V (t) dW (t) − V (t) dt 2

 with V (0) =

1 0

 .

Exercise 14.4.6. Proof of Theorem 14.1.8 Give the details of the proof of Theorem 14.1.8. Exercise 14.4.7. The area under a Brownian motion Let {W (t)}t≥0 be a standard Brownian motion. Show that -

t 0

is an

FtW -martingale.

1 W (s) dW (s) − W (t)3 3

Use this to compute the expected value of the area of the set {(t, x) ; t ∈ [0, Ta,b ], x ∈ [0, |W (t)|},

where Ta,b := inf{t ≥ 0 ; W (t) ∈ {−b, a}} and a > 0, b > 0. ˆ integral Exercise 14.4.8. Continuous integrands for the Ito Prove the statement of Remark 14.1.3. Exercise 14.4.9. A martingale Let {W (t)}t≥0 be a standard Brownian motion. (i) Show that the stochastic process -

t

Y (t) := tW (t) −

W (s) dW (s)

(t ≥ 0)

0

is a martingale. (ii) Show that for all u ∈ [a, b] ⊂ R, Y (b) − Y (a) is orthogonal (in L2R (P )) to H(W (s); s ∈ [0, a]). Exercise 14.4.10. The n-th power of a Brownian motion Let {W (t)}t≥0 be a standard Brownian motion. Show that W (t)n − is a martingale.

n(n − 1) 2

-

t 0

W (s)n−2 ds (t ≥ 0)

14.4. EXERCISES

575

Exercise 14.4.11. The product of independent Brownian motions Let {W1 (t)}t≥0 and {W2 (t)}t≥0 be independent standard Brownian motions. Is the claim - t - t W1 (t)W2 (t) = W1 (s) dW2 (s) + W2 (s) dW1 (s) 0

0

resulting from a naive application of the formula of integration by parts true? Exercise 14.4.12. A differential equation for the Brownian bridge Using the results of Exercise 11.5.13, show that the Brownian bridge thereof satisfies the following equation: - t 1 Z(t) = − Z(s) ds + W (t) . − s 1 0 Exercise 14.4.13. Brownian motion on the circle Let {W (t)}t≥0 be a standard Brownian motion. Show that the vector process V (t) = (cos W (t), sin W (t))T (t ≥ 0) satisfies a stochastic differential equation of the form 1 dV (t) = AV (t) dW (t) − V (t) dt 2 with initial condition V (0) = (1, 0)T , where A is a matrix to be identified. Exercise 14.4.14. A motion on the cone Let {W1 (t)}t≥0 and {W2 (t)}t≥0 be two independent standard Brownian motions. Show that the vector process V (t) = (eW1 (t) cos(W2 (t)), eW1 (t) sin(W2 (t)), eW1 (t) )T (t ≥ 0) satisfies a stochastic differential equation of the form dV (t) = AV (t) dW1 (t) + BV (t) dW2 (t) + CV (t) dt with initial condition V (0) = (1, 0, 1)T , where A, B and C are matrices to be identified. Exercise 14.4.15. A stochastic differential equation Prove that the stochastic process X(t) := sinh W (t) + sinh−1 W (t) + t

(t ≥ 0)

is a solution of the differential equation dX(t) =

. 2 5 5 1 1 + X(t)2 dW (t) + 1 + X(t)2 + X(t) dt . 2

Exercise 14.4.16. The Vasicek model Consider the stochastic differential equation dX(t) = (−bX(t) + c) dt + σW (t) . Prove that the unique solution with initial state X(0) is given by - t c  c  −bt X(t) = + X(0) − e−b(t−s) dW (s) . e +σ b b 0

ˆ STOCHASTIC CALCULUS CHAPTER 14. A GLIMPSE AT ITO’S

576

Exercise 14.4.17. Tanaka’s differential equation Prove that any solution of the stochastic differential equation - t X(t) = sgn(X(s)) dW (s) (t ≥ 0) 0

is such that for all λ > 0, the process  1 1 2 exp λX(t) − λ t 2

(t ≥ 0)

is a martingale. ˆ integrals as Riemann integrals Exercise 14.4.18. Ito Let f : R → R be a continuous function with continuous derivative f  . Let F (x) := 0x 0 f (t) dt. Show that -

t

f (W (s)) dW (s) = F (W (t)) − F (0) −

0

1 2

-

t

f  (W (s)) ds .

0

Use this result to express the following Itˆo integrals in terms of Riemann integrals: - t W (s)eW (s) dW (s), -

0 t 0 t

1 dW (s), 1 + W (s)2 1

eW (s)− 2 s dW (s), -

0 t 0

W (s) dW (s). 1 + W (s)2

Chapter 15 Point Processes with a Stochastic Intensity Let {Ft }t∈R be some history of a simple locally finite point process N on R (that is, a non-decreasing family of σ-fields such that for all t ∈ R and all a ≤ b ≤ t, N ((a, b]) is Ft -measurable). If it holds that for all t ∈ R, lim h↓0

1 E[N ((t, t + h])|Ft ] = λ(t) h

P -a.s. ,

()

for some non-negative locally integrable Ft -adapted stochastic process {λ(t)}t∈R , the latter is called a stochastic Ft -intensity of N . This local definition of intensity is advantageously replaced by a global definition not involving a limiting derivative-type procedure and is more amenable to rigorous analysis. It opens a connection with the rich theory of martingales and offers among other things a unified view of stochastic systems driven by point processes. This point of view will reveal a striking analogy with the contents of Chapter 14, the first instance of which is found in Paul L´evy’s martingale characterization of Brownian motion (Theorem 14.2.5) and in Watanabe’s martingale characterization of the standard Poisson process. The proof of Theorem 13.4.3 is a first example of the point process “stochastic calculus”.

15.1

Stochastic Intensity

15.1.1

The Martingale Definition

For a Poisson process N on the real line with locally integrable intensity function λ(t), it holds that for all intervals [c, d] ⊂ R, - d   E N ((c, d]) | FcN = λ(s) ds c

or, equivalently since the right-hand side is a deterministic quantity, 2- d .   E N ((c, d]) | FcN = E λ(s) ds | FcN . c

This motivates the following definition of stochastic intensity. © Springer Nature Switzerland AG 2020 P. Brémaud, Probability Theory and Stochastic Processes, Universitext, https://doi.org/10.1007/978-3-030-40183-2_15

577

578CHAPTER 15. POINT PROCESSES WITH A STOCHASTIC INTENSITY Definition 15.1.1 Let N be a simple locally finite point process on R, let {Ft }t∈R be a history of N and let {λ(t)}t∈R be a non-negative a.s. locally integrable real-valued Ft progressively measurable stochastic process. If for all a ∈ R and all intervals (c, d] ⊂ (a, ∞), # "(a) d∧Tn

E[N ((c ∧ Tn(a) , d ∧ Tn(a) ]) | Fc ] = E

(a)

λ(s)ds | Fc ,

(15.1)

c∧Tn (a)

(a)

where {Tn }n≥1 is a non-decreasing of Ft -stopping times such that Tn > a, sequence   (a) (a) limn↑∞ Tn = ∞ and E N ((a, Tn ]) < ∞, N is then said to admit the stochastic (P, Ft )-intensity {λ(t)}t∈R . The connection between stochastic intensity (Definition 15.1.1) and martingales is the following: - t M (t) := N ((0, t]) − λ(s) ds 0

is a local (P, Ft )-martingale (Exercise 15.4.1), called the fundamental (local) Ft -martingale of the point process N . When the choice of probability P is clear from the context, one says “the Ft -intensity” instead of “the (P, Ft )-intensity”. of the stochastic intensity guarRemark 15.1.2 The requirement of Ft -progressiveness 0 antees that the integrated intensity process { 0 λ(s) ds}t∈R is measurable and Ft -adapted (Exercise 5.4.2). Remark 15.1.3 When considering point processes on the positive half-line, the intervention of the a’s is superfluous, and it suffices to require that (15.1) holds for a = 0. However, the slightly more complicated definition given above is needed to handle point processes on the whole real line, especially stationary point processes. Remark 15.1.4 The reason why requirement (15.1) cannot be replaced by the simpler one, . 2- d E [N ((c, d]) | Fc ] = E (†) λ(s) ds | Fc , c (a) Tn ,

is that it may occur that both sides of (†) are infinot involving the stopping times nite, in which case the information contained in (†) is nil. This happens for instance when N is a homogeneous Cox process whose random intensity Λ has an infinite expectation (see Example 15.1.5 below).

Example 15.1.5: Poisson and Cox Processes. Let N be a Cox process on R+ ν with conditional intensity 0 measure ν with respect to G ⊇ F (see Definition 8.2.5) and suppose that ν(C) := C λ(s) ds (C ∈ B(R+ )), where {λ(t)}t≥0 is a locally integrable non-negative process. Then N admits this process as an Ft -intensity, where Ft := FtN ∨G (t ≥ 0) (Exercise 15.4.4). Theorem 15.1.11 below gives a formula that can be considered both a refinement of Campbell’s formula and an extension of the smoothing formula for hpps (Theorem 7.1.7). The notion of predictable process will be needed.

15.1. STOCHASTIC INTENSITY

579

Definition 15.1.6 Let T = R or R+. Let {Ft }t∈T be a history. The predictable σ-field P(F· ) on T × Ω is the σ-field generated by the collection of sets (a, b] × A

([a, b] ⊂ T, A ∈ Fa ) ,

(15.2)

to which one must add, in the case T = R+ , the sets {0} × A (A ∈ F0 ). A stochastic process {X(t)}t∈T taking its values in a measurable space (E, E) is called an Ft -predictable process if the mapping (t, ω) → X(t, ω) is P(F· )-measurable. For short, one then says: {X(t)}t∈T is in P(F· ). Definition 15.1.7 Let T = R or R+ and let {Ft }t∈T be a history. Let (K, K) be some measurable space. Let H : (T × Ω × K, P(F· ) ⊗ K) → (R, B(R)). One then says that {H(t, z)}t∈T,z∈K is an Ft -predictable stochastic process indexed by K.

Remark 15.1.8 An Ft -predictable process is Ft -progressive (Exercise 15.4.6).

Example 15.1.9: Left-continuity and Predictability. A complex-valued stochastic process {X(t)}t∈R adapted to {Ft }t∈R and with left-continuous trajectories is Ft predictable. In fact, by left-continuity, X(t, ω) = limn↑∞ Xn (t, ω), where Xn (t, ω) :=

n +n2 

X(k2−n , ω)1(k2−n ,(k+1)2−n ] (t) ,

k=−n2n

and since X(k2−n ) is Fk2−n -measurable, (t, ω) → Xn (t, ω) is P(F· )-measurable.

Example 15.1.10: Another Typical Ft -predictable Process. Let S and τ be two Ft -stopping times such that S ≤ τ , and let ϕ : R+ × R → R be a measurable function. Then X(t, ω) = ϕ(S(ω), t)1{S(ω)0} ds .

0

n≥1

Let R1 be the first strictly positive time at which the system is empty (∞ if the system never empties). W (t)

σ1

σ2

σ4 σ3

σ0 0

T1

T2

T3

T4

t

R1

Clearly R1 is an FtN -stopping time, where N is the point process on R+ × R+ with point sequence {(Tn , σn )}n∈N . For all M > 0, - ∞ R1 ∧ M ≤ σ0 + σk 1(0,R1 ∧M ] (Tk ) = σ0 + σ1(0,R1 ∧M ] (t)N (dt × dσ) . (15.7) 0

k≥1

E

The mapping (t, ω, σ) → H(t, ω, σ) = σ1(0,R1 (ω)∧M ] (t), is in P(F·N ) ⊗ B(R+ ) (noting that it is left-continuous in the t-argument). Since the FtN -intensity kernel of N is λG(dz), we obtain from (15.7) and Theorem 15.1.31 2- ∞ . E[R1 ∧ M ] ≤ E[σ0 ] + E σ1(0,R1 ∧M ] (t)λG(dσ) dt R+

0

= E[σ0 ] + λE[σ0 ]E[R1 ∧ M ].

(15.8)

In particular, E[R1 ∧ M ](1 − λE[σ0 ]) ≤ E[σ0 ], and therefore, if λE[σ0 ] < 1, E[R1 ∧ M ] ≤

E[σ0 ] 1 − λE[σ0 ]

E[σ0 ] for all M > 0, and therefore E [R1 ] ≤ 1−λE[σ < ∞. Reproducing the calculation with 0] R1 replacing R1 ∧ M , we have, since R1 is almost surely finite, the equality

E[R1 ] =

E[σ0 ] . 1 − λE[σ0 ]

In summary: λE[σ0 ] < 1 is a sufficient condition for E[R1 ] to be finite in an M/GI/1/∞ queue. It is also a necessary condition if E[σ0 ] > 0, because when R1 is finite, R1 = σ0 +



σk 1(0,R1 ] (Tk )

k≥1

and therefore E[R1 ] = E[σ0 ] + λE[σ0 ]E[R1 ], that is E[R1 ](1 − λE[σ0 ]) = E[σ0 ], which implies that 1 − λE[σ0 ] > 0.

15.1. STOCHASTIC INTENSITY

589

Let (N, Z) be a simple locally finite marked point process on R+ with marks in the measurable space (K, K) and let {Ft }t≥0 be a history of (N, Z). Let (N, Z) admit the Ft -intensity kernel λ(t, dz). In the sequel, the following notation will be used H(s, z) MZ (ds × dz) (0,t]×K := H(s, z) λ(s, dz) ds , H(s, z) NZ (ds × dz) − (0,t]×K

(0,t]×K

provided the right-hand side is well defined. Therefore, formally, MZ (ds × dz) := NZ (ds × dz) − λ(s, dz) ds .

Stochastic Integrals and Martingales Theorem 15.1.33 Let {H(t, z)}t≥0 be an Ft -predictable real-valued stochastic process indexed by K such that for all t ≥ 0, "# E

|H(s, z)|λ(s, dz) ds < ∞ .

(15.9)

H(s, z) MZ (ds × dz)

(15.10)

(0,t]×K

Then the stochastic process M (t) := (0,t]×K

is a well-defined centered Ft -martingale.

Proof. Condition (15.9) is equivalent to E

0

 |H(s, z)| NZ (ds × dz) < ∞ for all

0 t ≥ 0 (Theorem 15.1.31). Therefore, for all t ≥ 0, (0,t]×K |H(s, z)|λ(s, dz) ds < ∞ and 0 (0,t]×K |H(s, z)| NZ (ds×dz) < ∞, P-a.s. In particular, {M (t)}t≥0 is P-a.s. a well-defined and finite stochastic process. For all a, b ∈ R+ (0 ≤ a ≤ b) and for all A ∈ Fa , 2. E [1A (M (b) − M (a)] = E H  (t, z) MZ (ds × dz) , (0,t]×K

R+ ×K

H  (t, z)

where := H(t, z)1A 1(a,b] (t) defines an Ft -predictable real-valued stochastic process indexed by K. By Theorem 15.1.31, 2. 2. E H  (t, z) NZ (ds × dz) = E H  (t, z) λ(s, dz) ds R+ ×K

R+ ×K

and therefore, for all A ∈ Fa , E [1A (M (b) − M (a)] = 0.



Corollary 15.1.34 Replacing assumption (15.9) of Theorem 15.1.33 by the condition that P -almost surely |H(s, z)|λ(s, dz) ds < ∞ for all t ≥ 0 , (15.11) (0,t]×K

the stochastic process {M (t)}t≥0 defined by (15.10) is then a local Ft -martingale.

590CHAPTER 15. POINT PROCESSES WITH A STOCHASTIC INTENSITY Proof. The random time

|H(s, z)|λ(s, dz) ds ≥ n} ,

Sn := inf{t > 0 ; (0,t]×K

with the usual convention inf ∅ = +∞, is for each n ≥ 1 an Ft -stopping time, and limn↑∞ Sn = ∞. Moreover, H(t, z)1{t≤Sn } satisfies condition (15.9). Therefore by Theo rem 15.1.33, {M (t ∧ Sn )}t≥0 is a well-defined Ft -martingale. Theorem 15.1.35 Let H be an Ft -predictable real-valued stochastic process indexed by K such that for all t ≥ 0, P-a.s. "# |H(s, z)|2 λ(s, dz) ds < ∞ .

E

(15.12)

(0,t]×K

Then the stochastic process H(s, z) MZ (ds × dz)

M (t) := (0,t]×K

is well defined and a square-integrable Ft -martingale. Moreover, "#   |H(s, z)|2 λ(s, dz) ds . E M (t)2 = E

(15.13)

(0,t]×K

Proof. Let Tn be the n-th event time of the base point process. The proof  is 0 that M (t) well defined follows from Theorem 15.1.33. In fact, observing that E (0,Tn ] λ(s) ds = E [N (0, Tn ]] = n, ""# # |H(s, z)|λ(s, dz) ds ≤ E

E (0,t∧Tn ]×K

(1 + |H(s, z)|2 )λ(s, dz) ds (0,t∧Tn ]×K

"-

#

|H(s, z)| λ(s, dz) ds < ∞ 2

= E [N (0, Tn ]] + E (0,t∧Tn ]×K

"-

#

|H(s, z)|2 λ(s, dz) ds < ∞ .

=n+E (0,t∧Tn ]×K

Therefore {M (t ∧ Tn )}t≥0 is well defined, and so is {M (t)}t≥0 since limn↑∞ Tn = ∞. We now turn to the proof of (15.13). By the product rule of Stieltjes–Lebesgue calculus, M (t)2 = M (t−) dM (t) + H(s, z)2 NZ (ds × dz) . (0,t]×K

(0,t]

-

Since m(t) :=

M (s−)H(s, z) MZ (ds × dz)

M (t−) dM (t) = (0,t]

(0,t]×K

is a local Ft -martingale with respect to the localizing stopping times / 4 Vn := inf t ≥ 0 ; |M (t−)| + |H(s, z)|λ(s, dz) ds ≥ n ∧ Tn , (0,t]×K

15.1. STOCHASTIC INTENSITY

591

we have that 

E M (t ∧ Vn )

2



"-

# H(s, z) NZ (ds × dz) 2

=E "-

(0,t∧Vn ]×K

# 2

H(s, z) λ(s, dz) ds ,

=E (0,t∧Vn ]×K

    from which (15.13) follows, if we can show that limn↑∞ E M (t ∧ Vn )2 = E M (t)2 . This will be the case because (as will soon be proved) if M (t ∧ Vn ) converges in L2C (P ) to some limit. This limit is necessarily M (t), the almost sure limit of M (t ∧ Vn ). The 2 (P )-convergence of M (t ∧ V ) to be proved follows from the Cauchy criterion since, LC n by a computation similar to the one above, with m ≥ n,   E (M (t ∧ Vm ) − M (t ∧ Vn ))2

#

"-

= E [m(t ∧ Vm ) − m(t ∧ Vn )] + E

2

H(s, z) λ(s, dz) ds (t∧Vn ,t∧Vm ]×K

#

"2

H(s, z) λ(s, dz) ds ,

=E (t∧Vn ,t∧Vm ]×K

a quantity that vanishes as m, n ↑ ∞.



Corollary 15.1.36 If assumption (15.15) of Theorem 15.4.11 is replaced by |H(s, z)|2 λ(s, dz) ds < ∞

P -a.s. , (t ≥ 0) ,

(15.14)

(0,t]×K

the stochastic process H(s, z) MZ (ds × dz)

M (t) :=

(t ≥ 0)

(0,t]×K

is a square-integrable local Ft -martingale. Proof. This follows from Theorem 15.4.11 in the same manner as Corollary 15.1.34  followed from Theorem 15.1.33.

15.1.3

Martingales as Stochastic Integrals

Let (N, Z) be a marked point process on [0, 1] with marks in the measurable space (K, K) such that the associated lifted point process NZ is Poisson with intensity measure ν(dt × dz) := λ(t, z) dt Q(dz) , where Q is a probability measure on the measurable space (K, K) of the marks. We suppose that ν([0, 1] × K) < ∞.

592CHAPTER 15. POINT PROCESSES WITH A STOCHASTIC INTENSITY Theorem 15.1.37 Let {m(t)}t∈[0,1] be a real-valued centered FtW -square-integrable martingale on [0, 1]. Then M (t) = C(s, z) MZ (ds × dz) , [0,t]×K

where C is an Ft -predictable real-valued stochastic process indexed by K such that for all t ≥ 0, P-a.s. # "2 |C(s, z)| λ(s, dz) ds < ∞ . (15.15) E (0,t]×K

Lemma 15.1.38 A. The collection of random variables H := {eX ; X ∈ U} where U is the collection of random variables k 

aj NZ (Cj × Lj )

(k ∈ N, Cj ∈ B([0, 1]), Lj ∈ K, a1 , . . . , ak ∈ R)

j=1 (N,Z)

2 is total in the Hilbert space LR (F1

, P ).

B. The collection of random variables that are linear combinations of elements of 0     0 , K0 := Mf (1) := exp [0,1]×K f (s, z) NZ (ds × dz) + [0,t]×K ef (s,z) − 1 λ(s, z) Q(dz)

where f : [0, 1] × K → R is a measurable function such that 2  ef (s,z) − 1 λ(t, z) dt Q(dz) < ∞ , [0,1]×K (N,Z)

is dense in the Hilbert space L2R (F1

, P ).

The proof is an immediate adaptation of Lemma 14.3.2. The proof of Theorem 15.1.37 is in turn an easy adaptation of the proof of Theorem 14.3.1 and is based on the following lemma: Lemma 15.1.39 Let f : [0, 1] × K → R be a measurable function such that  2 ef (s,z) − 1 λ(t, z) dt Q(dz) < ∞ . [0,1]×K

(This is the case if f is non-positive or bounded.) Let for t ∈ [0, 1] 4 /  f (s,z) Mf (t) := exp e − 1 λ(s, z) Q(dz) . f (s, z) NZ (ds × dz) + [0,t]×K

[0,t]×K

15.1. STOCHASTIC INTENSITY

593

Under the above conditions, (a):

-

  Mf (s−) ef (s,z) − 1 MZ (ds × dz) ,

Mf (t) = 1 +

()

[0,t]×K

(b): {Mf (t)}t∈[0,1] is a square integrable martingale, and 0    2  (c): E Mf (t)2 − 1 = E [0,t]×K Mf (s)2 ef (s,z) − 1 λ(s, z) Q(dz) (t ∈ [0, 1]). Proof. (a): Observe that at an event-time t of the base point process with corresponding mark z ∈ K,   Mf (t) − Mf (t−) = Mf (t−) ef (t,z) − 1 , at a time t between two event times dMf (t) = Mf (t) dt

-   ef (t,z) − 1 λ(t, z) Q(dz) . K

(b): It suffices to show, in view of Theorem 15.4.11, that 2- 1 . 2  2 f (s,z) E Mf (s) e − 1 λ(s, z) Q(dz) ds < ∞ . 0

(15.16)

K

This is true when Mf (t) is replaced by Mf (t ∧ Sn ), where Sn := inf{t ≥ 0 ; Mf (t−) ≥ n} . In particular, . 2- t  2   E (Mf (t ∧ Sn ) − 1)2 = E Mf (s)2 1{s≤Sn } ef (s,z) − 1 λ(s, z) Q(dz) ds . 0

K

(15.17) Now,

and moreover

    E (Mf (t ∧ Sn ) − 1)2 = E Mf (t ∧ Sn )2 − 1     E Mf (t ∧ Sn )2 ≥ E Mf (t)2 1{t≤Sn } .

Therefore

2- t .  2   E Mf (t)2 1{t≤Sn } ≤ 1 + E Mf (s)2 1{s≤Sn} ef (s,z) − 1 λ(s, z) Q(dz) K 0 -  - t 2   ef (s,z) − 1 λ(s, z) Q(dz) ds . E Mf (s)2 1{s≤Sn } =1+ K

0

By Gronwall’s lemma, for all t ≥ 0, - t -   2   E Mf (t)2 1{t≤Sn} ≤ exp ef (s,z) − 1 λ(s, z) Q(dz) ds < ∞ . 0

K

Since limn↑∞ Sn = ∞, by monotone convergence  - 1 -  2   E Mf (t)2 ≤ exp ef (s,z) − 1 λ(s, z) Q(dz) := C < ∞ , 0

K

594CHAPTER 15. POINT PROCESSES WITH A STOCHASTIC INTENSITY and therefore 2-

1

E 0

. 2  Mf (s)2 ef (s,z) − 1 λ(s, z) Q(dz) K 2  ≤C ef (s,z) − 1 λ(t, z) dt Q(dz) ,

-

[0,1]×K

a finite quantity by hypothesis. Therefore (15.16) is proved. (c): Start from (15.17) and let n ↑ ∞ to obtain 2- t .    2 E (Mf (t))2 = 1 + E Mf (s)2 ef (s,z) − 1 λ(s, z) Q(dz) ds . 0

K

  This is a differential equation in E Mf (t)2 whose solution gives (c).



Remark 15.1.40 The running analogy with Brownian motion stochastic calculus is perhaps more evident in the case of a standard Poisson process N on R+ . In this case, the results of Theorem 15.1.37 specialize as follows. Any centered square-integrable FtN martingale on [0, 1] is of the form m(t) = C(s) dM (s) , (0,t]

where {C(t)}t∈[0,1] is an FtN -predictable process such that

15.1.4

01 0

H(s)2 ds < ∞.

The Regenerative Form of the Stochastic Intensity

Example 15.1.15 has shown that the notion of stochastic intensity is a generalization of that of hazard rate. Such a result can be extended to marked point processes, still in the special case where the history for which the stochastic intensity (kernel) is defined is the internal history “plus a prehistory”. A few results on the structure of point process histories will be needed. These are intuitive results whose technical proofs have been omitted. Let {Tn }n≥1 be a simple point process on R+ , that is, a non-decreasing sequence of positive random variables possibly taking the value +∞, and strictly increasing on R+ (Tn < ∞ ⇒ Tn < Tn+1 ). In particular, it may be a finite point process (if Tn = ∞ for some n ∈ N) and it need not be locally finite (T∞ := limn↑∞ may be finite). Set T0 ≡ 0. Let {Zn }n≥1 be a sequence of random variables taking their values in the measurable space (K, K). The sequence {(Tn , Zn )}n≥1 is called a marked point process on R+ with marks in K. For each L ∈ K, define the (simple) point process N L on R+ by  (C ∈ B(R+ )) . 1C (Tn ) 1L (Zn ) N L (C) := n≥1

Note that N L ({0}) = 0. Define the (simple) point process NZ on R+ × K by NZ (C × L) := N L (C)

(C ∈ B(R+ ), L ∈ K) .

15.1. STOCHASTIC INTENSITY

595

Define the internal history {FtN,Z }t≥0 of NZ by FtN,Z := σ (NZ (C × L) ; C ∈ (0, t], L ∈ K) , and the history {Ft }t≥0 by Ft = σ(Z0 ) ∨ FtN,Z ,

(15.18)

where Z0 is a random element taking values in a measurable space (L0 , K0 ) (for instance, a space of functions). It represents a “prehistory” of the marked point process, in the sense that F0 = σ(Z0 ) contains the information already gathered at time 0 that may influence its future behavior. It is intuitively clear that FTn = σ(Z0 ) ∨ σ(T1 , Z1 , . . . , Tn , Zn ) and FTn − = σ(Z0 ) ∨ σ(T1 , Z1, . . . , Tn ). Suppose that for all n ≥ 0, all L ∈ K, and all C ∈ B(R+ ), P (Sn+1 ∈ C , Zn+1 ∈ L | FTn ) (ω) = g (n+1)(ω, x, L) dx := G(n+1) (ω, C, L) , C

where for each L ∈ K, the mapping (ω, x) → g (n+1)(ω, x, L) is FTn ⊗ B(R+ )-measurable, and for each (ω, x), L → g (n+1)(ω, x, L) is a σ-finite measure on (K, K). In particular, P (Sn+1 ∈ C | FTn ) (ω) = g (n+1)(ω, x) dx := G(n+1)(ω, C) , C

where g (n+1)(ω, x) = g (n+1)(ω, x, K)) and G(n+1) (ω, C) = G(n+1) (ω, C, K). Theorem 15.1.41 For L ∈ K and t ≥ 0, let λ(t, L) :=



g (n+1)(t − Tn , L) 1{Tn ≤t 0, and Φ(t, L) = 0 if λ(t) = 0. Since λ(Tn ) > 0 P -a.s. on {Tn < ∞}, Φ(Tn , L)1{Tn 0

(n ≥ 1),

P -a.s.

(15.24)

Proof. (a) follows from the uniqueness property (15.27). For (b), note that H(t) = 1{λ(t)=0} is an Ft -predictable process. Inserting this into the smoothing formula, we obtain ⎡ ⎤ . 2 E⎣ 1{λ(t)=0} λ(t) dt = 0, 1{λ(Tn )=0} ⎦ = E n≥0

R+



which implies (15.23).

In particular, if the locally finite simple marked point process (N, Z) has the Ft predictable stochastic intensity kernel λ(t)Φ(t, dz), then, for all L ∈ K, λ(Tn )Φ(Tn , L) > 0 on {Tn < ∞} . The above results extend straightforwardly to marked point processes and their stochastic intensity kernels as follows. Let the simple locally finite marked point process (N, Z) have the Ft -intensity kernel λ(t)Φ(t, dz), where Φ(t, dz) = λ(t)μ(t, z)Q(t, dz) (15.25) for some FtN,Z -predictable kernel Q(t, dz). Let {F,t }t≥0 be a history such that Ft ⊇ F,t ⊇ FtN,Z

(t ≥ 0).

(15.26)

It is possible that λ(t)Φ(t, dz) is not F,t -adapted and therefore cannot be the F,t -intensity kernel. Nevertheless, there still exists a stochastic F,t -intensity kernel, and it is obtained by “projection” of the initial stochastic intensity kernel on the smaller history, in a sense to be made precise now. Recall the terminology: if the mapping Y : (t, ω, z) → Y (t, ω, z) ∈ R is B(R+ )⊗F ⊗K, one says that Y (t, z) is a measurable process indexed by K. It is said to be Ft -adapted (resp. Ft -predictable) if moreover for all z ∈ K, the stochastic process {Y (t, z)}t≥0 is Ft -adapted (resp. Ft -predictable). Let the histories {Ft }t≥0 and {F,t }t≥0 satisfy condition (15.26). Let {Y (t, z)}t≥0 be a non-negative measurable process indexed by K. Let the sigma-finite measures μ1 and μ2 on (R × Ω × K, P(F· ) ⊗ K) be defined respectively by:

15.2. TRANSFORMATIONS OF THE STOCHASTIC INTENSITY . H(t, z)Y (t, z) Q(t, dz) dt

2- μ1 (H) := E

R

K

. H(t, z) Q(t, dz) dt

2- -

and μ2 (H) := E

599

R

K

for all non-negative mappings H : R×Ω×K that are P(F,)⊗K-measurable. Note that μ2 is the product measure P (dω)×Q(t, ω, dz) dt on (Ω×R×K, P(F· )⊗K). Clearly μ1 μ2 , dμ1 and therefore there exists a Radon–Nikod´ ym (rnd) derivative Y, (t, ω, z) = dμ (t, ω, z) 2 , , that is P(F· ) ⊗ K-measurable and therefore defines an Ft –predictable process indexed by K, Y, (t, z), such that 2- . H(t, z)Y, (t, z) Q(t, dz) dt . μ1 (H) = μ2 (Y, H) = E R

K

Moreover, this rnd is μ2 -unique, that is to say, if there exists another such rnd, say Y , then Y, (t, ω, z) = Y (t, ω, z) , P (dω)Q(t, ω, dz)dt a.e. (15.27) Definition 15.2.5 The above stochastic process Y, (t, z) indexed by K is called the predictable projection of Y (t, z) on {F,t }t≥0 , or the F,t -predictable projection of Y (t, z). Theorem 15.2.6 Let the simple locally finite marked point process (N, Z) on R+ have the stochastic Ft -intensity kernel (15.25) for some FtN,Z -predictable kernel Q(t, dz). Let {F,t }t≥0 be another history satisfying condition (15.26). Then (N, Z) has the stochastic , , , , At -intensity kernel λ(t) h(t, z)Q(t, dz) where {λ(t)} F t≥0 is the Ft -predictable projection of {λ(t)}t≥0 and , h(t, z) is the F,t -predictable projection of h(t, z). Proof. Let H(t, z) be a non-negative F,t -predictable indexed stochastic process. It is a fortiori an Ft -predictable indexed stochastic process, and therefore 2- . 2- . E H(t, z)) N (dt × dz) = E H(t, z)) λ(t)h(t, z) Q(t, dz) dt R K . 2-R -K H(t, z)) v,(t, z) Q(t, dz) dt , =E R

K

where v,(t, z) is the F,t -predictable projection of λ(t)h(t, z). Let now v,(t, ω, z) , h(t, ω, z) := , , ω) λ(t, a quantity that is P (dω) × N (ω, dt × dz)- and P (dω) × λ(t, ω)Q(t, ω, dz)dt-well defined in view of Theorem 15.2.4. We have that for all non-negative F,t -predictable indexed process H, "- # . 2- v,(t, z) , H(t, z) N (dt × dz) = E H(t, z)λ(t) Q(t, dz) dt E , λ(t) R K R K . 2- H(t, z)λ(t) h(t, z) Q(t, dz) dt . =E R

K

600CHAPTER 15. POINT PROCESSES WITH A STOCHASTIC INTENSITY Equivalently, "-  -

# v,(t, z) , H(t, z) Q(t, dz) λ(t) dt , λ(t) K  . 2- H(t, z) h(t, z) Q(t, dz) λ(t) dt . =E

E R

R

Now

2- E R

K

 . H(t, z) h(t, z) Q(t, dz) λ(t) dt K 2-  . =E H(t, z) h(t, z) Q(t, dz) N (dt)  2-R -K . , H(t, z) h(t, z) Q(t, dz) λ(t) dt , =E R

and therefore

K

"-  -

# v,(t, z) , H(t, z) Q(t, dz) λ(t) dt , λ(t) K  . 2- , dt . H(t, z) h(t, z) Q(t, dz) λ(t) =E

E R

R

Replacing H(t, z) by

K

H(t, z) , , λ(t)

"- E R

# v,(t, z) H(t, z) Q(t, dz) dt , λ(t) K  . 2- H(t, z) h(t, z) Q(t, dz) dt , =E R

which shows that

15.2.2

K

v,(t, z) , = h(t, z). , λ(t)



Absolutely Continuous Change of Probability

We now consider changes of intensity entailed by an absolutely continuous change of probability measure. This subsection is of special interest in statistics where the concept of likelihood ratio is of central importance, in particular in hypothesis testing. It is a sweeping generalization of the results of Section 8.3.3. Let (N, Z) be a simple and locally finite point process on R+ with marks in K and associated lifted process NZ on R+ × K. Let {Ft }t≥0 be a history of NZ and suppose that NZ admits the (P, Ft )-local characteristics (λ(t), Φ(t, dz)). Let {μ(t)}t≥0 be a non-negative Ft -predictable process and let {h(t, z)}t≥0,z∈K be a non-negative Ft -predictable K-indexed stochastic process, such that for all t ≥ 0 - t  P λ(s)μ(s) ds < ∞ = 1 (15.28) 0

and

15.2. TRANSFORMATIONS OF THE STOCHASTIC INTENSITY

601



h(t, z)Φ(t, dz) dz = 1

P

= 1.

(15.29)

K

Define for each t ≥ 0 ⎛ ⎞ $ L(t) := L(0) ⎝ μ(Tn )h(Tn , Zn )⎠ × · · · Tn ∈(0,t]

 exp −



-

(μ(s)h(s, z) − 1)λ(s)Φ(s, dz) ds

(0,t]

,

(15.30)

K

where L(0) is a non-negative F0 -measurable random variable such that E[L(0)] = 1. Theorem 15.2.7 Under the above conditions, (1) {L(t)}t≥0 is a non-negative (P, Ft )-local martingale. If, moreover, E[L(t)] = 1 for all t ≥ 0, it is a non-negative (P, Ft )-martingale. (2) If E[L(T )] = 1 for some T > 0, and if we define the probability Q by the Radon– Nikod´ ym derivative process dQ = L(T ) (15.31) dP the marked point NZ admits the (Q, Ft )-local characteristics (μ(t)λ(t), h(t, z)Φ(t, dz)) on [0, T ].

Proof. (1) By the exponential rule of Stieltjes–Lebesgue calculus, -

(μ(s)h(s, z) − 1)L(s−)MZ (ds × dz) ,

L(t) = L(0) + (0,t]

K

where MZ (ds × dz) := NZ (ds × dz) − λ(s)Φ(s, dz) ds. Let for n ≥ 1,   - t μ(s)λ(s) ds ≥ n . Sn = inf t ; L(t−) +

(15.32)

0

Then, by Theorem 15.1.31, {L(t ∧ Sn )}t≥0 is a (P, Ft ) martingale, and since under conditions (15.37) and (15.38), P (limn↑∞ Sn = ∞) = 1, {L(t)}t≥0 is a (P, Ft )-local martingale. Being non-negative, it is also a (P, Ft )-supermartingale. But a supermartingale with constant mean is a martingale. (2) We have to prove that for any non-negative Ft -predictable K-indexed stochastic process {H(t, z)}t≥0,z∈K and all t ∈ [0, T ], "-

#

-

"-

#

-

H(s, z)NZ (ds × dz) = EQ

EQ (0,t]

K

H(s, z)μ(s)λ(s)h(s, z)Φ(s, dz) ds . (0,t]

K

This is done through the following sequence of equalities (with appropriate justifications at the end)

602CHAPTER 15. POINT PROCESSES WITH A STOCHASTIC INTENSITY "-

#

-

H(s, z)NZ (ds × dz)

EQ (0,t]

K

"

#

-

-

H(s, z)NZ (ds × dz)

= E L(t) K

(0,t]

"-

#

-

L(s)H(s, z)NZ (ds × dz)

=E "-

K

(0,t]

#

-

L(s−)H(s, z)μ(s)h(s, z)NZ (ds × dz)

=E "-

(0,t]

"-

(0,t]

"

(0,t]

K

#

-

L(s−)H(s, z)μ(s)h(s, z)λ(s)Φ(s, dz) ds

=E K

#

-

L(s)H(s, z)μ(s)h(s, z)λ(s)Φ(s, dz) ds

=E K

-

#

-

H(s, z)μ(s)h(s, z)λ(s)Φ(s, dz) ds

= E L(t) "-

(0,t]

K

#

-

H(s, z)μ(s)λ(s)h(s, z)Φ(s, dz) ds .

= EQ (0,t]

K

The first equality follows from (15.31) and the fact that for a nonnegative Ft -measurable random variable V (t), EQ [V (t)] = EP [L(t)V (t)]. The second equality follows from Theorem 13.4.25, the third one from the observation that at a point Tn < ∞ of N , L(Tn ) = L(Tn −)μ(Tn )h(Tn , Zn ). The third equality uses the smoothing theorem, the fifth is by Theorem 13.4.25 and the last one uses (15.31).  Remark 15.2.8 The main condition to verify when using Theorem 15.2.7 is EP [L(T )] = 1. A general method to do this consists in finding some γ > 1 such that for the sequence of stopping times {Sn }n≥1 defined by (15.32), supn≥1 EP [L(T ∧ Sn )γ ] < ∞. This implies that the sequence {L(T ∧ Sn )}n≥1 is uniformly integrable, and therefore . 2 lim E [L(T ∧ Sn )] = E lim L(T ∧ Sn ) . n↑∞

n↑∞

But, by Part (1) of Theorem 15.2.7, E [L(T ∧ Sn )] = 1 and Sn ↑ ∞. Therefore E [L(T )] = 1.

Example 15.2.9: The Likelihood Ratio for a Simple Point Process on a Finite Interval. Let N be under probability P an Ft -Poisson process of intensity 1. Let {λ(t)}t≥0 be a non-negative bounded Ft -predictable process. Define for all t ≥ 0 ⎛ ⎞ - t  $ L(t) = L(0) ⎝ λ(Tn )⎠ exp (λ(s) − 1) ds , n≥1

0

  where L(0) is a non-negative square integrable random variable such that E L(0)2 < ∞. By the exponential formula of Stieltjes–Lebesgue calculus,

15.2. TRANSFORMATIONS OF THE STOCHASTIC INTENSITY

603

L(s−)(λ(s) − 1) (N (ds) − ds) .

L(t) = L(0) + (0,t]

By the product rule of Stieltjes–Lebesgue calculus  L(s−)ΔL(s) L(s−) dL(s) + L(t)2 = L(0)2 + 2 (0,t]

s≤t

-

= L(0)2 + 2

L(s−)2 (λ(s) − 1) N (ds)

L(s−) dL(s) + (0,t]

-

= L(0)2 + 2

(0,t]

L(s−) dL(s) + (0,t]

(0,t]

L(s−)2 (λ(s) − 1) (N (ds) − ds) + L(s)2 (λ(s) − 1) ds)

(15.33) (15.34)

(0,t]

(noting that for Lebesgue-almost all t, L(t) = L(t−)). Define for each n ≥ 1  1 - t Sn := inf t ; L(t−) + λ(s) ds ≥ n ∧ Tn , 0

an Ft -stopping time such that limn↑∞ Sn = ∞. In particular, L(s−) dL(s) = L(s−)2 (λ(s) − 1) (N (ds) − ds) (0,t]

(0,t]

is an Ft -local martingale (with localizing sequence {Sn }n≥1) of mean 0. Replacing in (15.34) t by t ∧ Tn and taking expectations,     E L(t ∧ Sn )2 = E L(0)2 + L(s ∧ Sn )2 (λ(s) − 1) ds) . (0,t∧Sn ]

In particular, in view of the boundedness assumption on the λ(t), "#     2 2 2 E L(t ∧ Sn ) ≤ E L(0) + E L(s ∧ Sn ) (λ(s) + 1) ds (0,t]

-

 E L(s ∧ Sn )2 (C2 + 1) ds 

= C1 + (0,t]

for some finite positive C1 and C2 . This implies, by Gronwall’s lemma, that - t    E L(t ∧ Sn )2 ≤ C1 exp (C2 + 1) ds . 0

 In particular, for any T < ∞, supn≥1 E L(T ∧ Sn )2 < ∞. 

Example 15.2.10: Likelihood Ratios for Continuous-time hmcs. Let {X(t)}t≥0 be, under P , a regular continuous-time homogeneous Markov chain with state space E and stable and conservative infinitesimal generator {qij }i,j∈E . Let αij (i = j ∈ E) be  non-negative numbers such that for all i ∈ E, q,i := j =i∈E, αij qij < ∞. For all t ≥ 0, let Ft := FtX and let ⎫ ⎧ ⎞ ⎛ ⎬ ⎨- t  $ Ni,j (t) ⎠ exp αij (αij − 1)qij 1{X(s)=i} ds . L(t) := L(0) ⎝ ⎭ ⎩ 0 i,j;i =j

i,j;i =j

604CHAPTER 15. POINT PROCESSES WITH A STOCHASTIC INTENSITY    Suppose that EP [L(0)] = 1, that EP L(0)2 < ∞ and that j =i αij qij < ∞ dQ (i ∈ E). Then EP [L(T )] = 1 and under probability Q defined by dP = L(T ), the process {X(t)}t≥0 is on the interval (0, T ] a regular stable and conservative continuous-time homogeneous Markov chain with state space E and infinitesimal parameters q˜ij := αij qij (j = i). Proof. At a discontinuity time t of the chain,  L(t) − L(t−) = L(t−)(αij − 1)ΔNi,j (t) , i,j ; j =i

whereas for t strictly between two jumps of the chain,  dL(t) = L(t) (αij − 1)qij 1{X(t)=i} . dt i,j ; j =i

Therefore

L(s−)

L(t) = L(0) + (0,t]



(αij − 1)(Nij (ds) − qij 1{X(s)=i} ds) .

i =j

By the product rule of Stieltjes–Lebesgue calculus,  L(s−) dL(s) + L(s−)ΔL(s) L(t)2 = L(0)2 + 2 (0,t]

s≤t

= L(0)2 + 2

L(s−) dL(s) (0,t]

-

L(s−)2

+ (0,t]

-

 i =j

+

(αij − 1)(Nij (ds) − qij 1{X(s)=i} ds) 

(0,t] i =j

(αij − 1)qij 1{X(s)=i} ds .

Using the stopping times of type (15.32), we have that  L(t)2 = L(0)2 + local martingale + L(s)2 (αij − 1)qij 1{X(s)=i} ds . (0,t] i =j

Then, ⎡     E L(t ∧ Sn )2 = E L(0)2 + E ⎣

 (0,t∧Sn ] i =j

⎤ L(s ∧ Sn )2 (αij − 1)qij 1{X(s)=i} ds⎦

and in particular     E L(t ∧ Sn )2 ≤ E L(0)2 +

-

  E L(s ∧ Sn )2 |αij − 1|qij ds . (0,t]

i =j

Therefore, by Gronwall’s lemma, / 4       2 2 E L(t ∧ Sn ) ≤ E L(0) exp q˜i + qi t i

i

15.2. TRANSFORMATIONS OF THE STOCHASTIC INTENSITY and finally

605

  sup E L(T ∧ Sn )2 < ∞ . n≥1



The above two examples can be generalized as follows. Example 15.2.11: Likelihood Ratios for Marked Point Processes. Consider the general situation of Theorem 15.2.7. Let (N, Z) be a simple and locally finite point process on R+ with marks in K and associated lifted process NZ on R+ ×K. Let {Ft }t≥0 be a history of (N, Z) and suppose that (N, Z) admits the (P, Ft )-local characteristics (λ(t), Φ(t, dz)). Let {μ(t)}t≥0 be a non-negative Ft -predictable process and let {h(t, z)}t≥0,z∈K be a non-negative Ft -predictable K-indexed stochastic process, such that for all t ≥ 0 - t  P λ(s)μ(s) ds < ∞ = 1 (15.35) 0



-

and

h(t, z)Φ(t, dz) dz = 1

P

= 1.

(15.36)

K

For each t ≥ 0, define L(t) as in (15.30), with L(0) a non-negative F0 -measurable random variable such that E[L(0)] = 1. We suppose in addition that L(0) is square-integrable and that (μ(t) + 1)h(t, z)λ(t) ≤ K(t) , 0T where K : R+ → R+ is a deterministic function such that 0 K(s) ds < ∞. Then E [L(T )] = 1. The proof follows the same lines as the proof in Example 15.2.9 and is left as an exercise.

Remark 15.2.12 One need not insist on the analogy with Girsanov’s theorem as it is obvious. The proof of Girsanov’s result was based on the Itˆo calculus for Brownian motion. In the case of point processes, the underlying calculus is just the ordinary Stieltjes–Lebesgue calculus.

The Reference Probability Method Radon–Nikod´ ym derivatives are of course of interest in Statistics (where they are called likelihood ratios), and also in filtering. In the so-called reference probability method, the probability P actually governing the joint statistics of the observation and of the state process is obtained by an absolutely continuous change of probability measure Q → P . This method therefore relies on the Radon–Nikod´ ym results of Subsection 15.2.2. The reference probability Q is chosen such that the observation and the state process are Q-independent and the observation has a simple structure under Q. The state process {X(t)}t≥0 takes its values in some measurable space (E, E) and the observation is a marked point process (N, Z) with time-events sequence {Tn }n≥1 and mark sequence {Zn }n≥1 , the marks taking their values in a measurable space (K, K). Let

606CHAPTER 15. POINT PROCESSES WITH A STOCHASTIC INTENSITY X , Ft := FtN,Z ∨ F∞

where {FtX }t≥0 is the internal history of the state process. The reference probability Q is such that (N, Z) is a (Q, Ft )-Poisson process of intensity 1 with independent iid marks X are independent. of common probability distribution Q. Moreover, under Q, N and F∞ Therefore, the stochastic Ft -kernel of the observation under the reference probability Q is λ(t, dz) = 1 × Q(dz). Let {μ(t)}t≥0 be a non-negative Ft -predictable process and let {h(t, z)}t≥0,z∈K be a non-negative Ft -predictable K-indexed stochastic process, such that for all t ≥ 0 - T  Q μ(s) ds < ∞ = 1 (15.37) 0

-

and



Q

h(t, z)Q(dz) dz = 1

= 1.

(15.38)

K

Define



L(t) := ⎝



$

 -

μ(Tn )h(Tn , Zn )⎠ exp −

(μ(s)h(s, z) − 1)Q(dz) ds (0,T ]

Tn ∈(0,t]



-

,

K

and suppose that EQ [L(T )] = 1. Then (Theorem 15.2.7) the marked point (N, Z) admits on [0, T ] the (P, Ft )-local characteristics (μ(t), h(t, z)Q(dz)). Moreover, the restrictions dP0 X are the same since L(0) = 1. In fact, by hypothesis, L(0) := dQ of P and Q to F∞ = 1, 0 X that is, P and Q agree on F0 := F∞ . Let {Z(t)}t≥0 be an Ft -adapted real-valued stochastic process. Lemma 15.2.13 For all t ≥ 0,       EQ L(t) | FtN EP Z(t) | FtN = EQ Z(t)L(t) | FtN , or, equivalently, EP



Z(t) | FtN



  EQ Z(t)L(t) | FtN   = , EQ L(t) | FtN

Proof. This is just a rephrasing of Lemma 8.3.12.

Q-a.s.

P -a.s.



Example 15.2.14: Estimating the Random Intensity of a Homogeneous Cox Process. The above lemma allows us to replace a filtering problem with respect to P by one with respect to Q, which may be a simplification when Q has a simple structure. For instance, if Q is a probability that makes the point process N Poisson with intensity 1, and if Λ is an integrable variable independent, under Q of N , the measure P defined by dPt = ΛN (t) exp{(1 − Λ)t} dQt makes N a doubly stochastic process with intensity Λ. Then 0 ∞ N (t)+1 −λt   λ e dF (λ) E Λ | FtN = 00 ∞ N (t) −λt . λ e dF (λ) 0 (Exercise 15.4.16.)

15.2. TRANSFORMATIONS OF THE STOCHASTIC INTENSITY

15.2.3

607

Changing the Time Scale

Recall the following elementary result concerning Poisson processes on the line with an 0 τ (t) intensity. If N is a Poisson process with intensity λ(t), defining τ (t) by 0 λ(s) ds = t, , defined by N , ((0, t]) = N ((0, τ (t)]) is a standard hpp. This result the point process N will be extended to the transformation of a point process with given stochastic intensity into a standard hpp. Let N be a simple locally finite point process on R+ , with the Ft -predictable intensity {λ(t)}t≥0 , and suppose that N (0, ∞) = ∞, P-a.s. or, equivalently (Theorem 15.1.16), 0∞ 0 λ(s) ds = ∞, P-a.s. Define, for each t ≥ 0, the non-negative random variable τ (t) by -

τ (t)

λ(s) ds = t.

(15.39)

0

0∞ For each t ≥ 0, τ (t) is well defined since 0 λ(s) ds = ∞ (Theorem 15.1.16). For each t ∈ R+ , τ (t) is an Ft -stopping time. Indeed, for any a ∈ R, 1 - a λ(s) ds ≤ t ∈ Fa . {τ (t) ≤ a} = 0

, on R+ by Define the simple locally bounded point process N , (0, t] := N (0, τ (t)]. N

(15.40)

 , (0, a] := N (0, τ (a)] is F N -measurable, Note that FtN ⊆ Fτ (t) , since for all a ∈ R+ , N τ (a) and therefore Fτ (a) -measurable.

, has Fτ (t) -intensity 1 (and therefore F N -intensity 1). Theorem 15.2.15 N t Proof. Let [a, b] ∈ R. We must show that   , (a, b] = E [1A (b − a)] E 1A N But the left-hand side is just " # 2 τ (b) E 1A N (dt) = E 1A τ (a)



(A ∈ Fτ (a) ) .

. 1(τ (a),τ (b)] (t)N (dt) .

0

Since the process 1A 1(τ (a),τ (b)] is Ft -predictable (being Ft -adapted and left-continuous), the right-hand side of the above equality is, by the smoothing formula, " # . 2 ∞

τ (b)

1(τ (a),τ (b)] (t)λ(t) dt = E 1A

E 1A 0

λ(t) dt = E [1A (b − a)] .

τ (a)

 , is a homogeneous PoisRemark 15.2.16 By Watanabe’s theorem (Theorem 7.1.8), N , (a, b] is independent of Fτ (a) . son process of intensity 1. In addition, for all [a, b] ∈ R, N

608CHAPTER 15. POINT PROCESSES WITH A STOCHASTIC INTENSITY Remark 15.2.17 The result of Theorem 15.2.15 may be used to test that a given point process admits a given hypothetical intensity: perform the corresponding change of time and see if the result is a standard Poisson process, by using any available statistical test to assess that a given finite sequence of random variables is iid and exponentially distributed with mean 1. Remark 15.2.18 The analogous result in the Brownian motion calculus is the one that says roughly that a continuous martingale is a Brownian motion with a different time scale (see, for instance, Section 5.3.2 in [Legall, 2016]).

Cryptology A question arises naturally: how much information is lost in a change of time scale? Consider for instance the situation of Theorem 15.2.15. Since the change of time transforms the original point process into a homogeneous Poisson process, have we erased all the information previously contained in the stochastic intensity? The answer is: it depends. For instance, suppose that N is a Cox point process with intensity Λ, a non-negative realvalued random variable. In other words, N admits the stochastic Ft -intensity λ(t) ≡ Λ , where Ft := FtN ∨ σ(Λ). If we perform the time change τ (t) = Λt , the resulting process N , , defined by (15.40) is a standard Poisson process. Moreover, for all 0 ≤ c ≤ d, N (d)− N (c) is independent of Fc = FcN ∨ σ(Λ) and in particular of Λ. In this sense, the time change has erased all information concerning Λ, whereas Λ could be recovered from N since, by the strong law of large numbers, Λ = lim

t↑∞

N (t) . t

()

In the case of an intrinsic change  of time,  things are dramatically different. The stochastic 6 = E Λ | F N , is given by (Example 15.2.3): FtN -intensity of N , λ(t) t 0 ∞ N (t)+1 −λt e dF (λ) 6 = 00 ∞λ λ(t) . N (t) e−λt dF (λ) λ 0 To be even more specific, suppose that P (Λ = a) = P (Λ = b) = which case, N (t)+1 e(a−b)t 6 = 1 + (b/a) . λ(t) 1 + (b/a)N (t) e(a−b)t Performing the time change - τ(t) 6 dt = t , λ(t)

1 2

for sone 0 < a < b, in

0

, defined by N , (t) := N (6 we obtain a point process N τ (t)), which is a standard Poisson process. However, this time, Λ can be entirely recovered from it. In fact, as we now show, , and then Λ can be obtained by (). In fact, if T6n is the N can be reconstructed from N , , then n-th point of N - Tn+1 6 dt = Tn+1 − Tn λ(t) Tn

or, more explicitly, where

T6n+1 − T6n = f (n, Tn+1 ) − f (n, Tn ) ,   f (n, t) = at − ln 1 + (b/a)n+1 e(a−b)t .

Clearly then, the sequence {Tn }n≥1 can be recovered from {T6n }n≥1 .

15.3. POINT PROCESSES UNDER A POISSON PROCESS

609

Remark 15.2.19 An interpretation of the above results in terms of cryptography is the following. If the information is contained in Λ, the intrinsic time change yields a standard Poisson process from which Λ can be extracted only if one knows the “key”, that is, the , one can only obtain distribution of Λ. (Note however that from a finite trajectory of N an approximation of Λ. In this sense, secure transmission would be at the price of some unreliability. This unreliability can be controlled at the expense of transmission rate, which is acceptable if one is interested only in storage security.) Remark 15.2.20 There is an analogous result in the Itˆo calculus, although this analogy is not a direct one as is the case, for instance, in Girsanov’s theorem. We discuss it in very rough terms that will not be further detailed. Consider the “signal + noise” model - t X(t) = ϕ(s) ds + W (t) (t ≥ 0) , 0

where {W (t)}t≥0 is a standard Wiener process and {ϕ(t)}t≥0 is a locally integrable process (the integrated signal) independent of the Wiener process (the integrated  noise).XIf one denotes by {ϕ(t)} 6 t≥0 a suitable version of the estimated signal process E ϕ(t) | Ft (t ≥ 0), then - t B (t) := X(t) − W ϕ(s) 6 ds (t ≥ 0) 0

is a standard Wiener process. A cryptographic interpretation of this result avails in full analogy with the simple Poissonian example of the previous remark.

15.3

Point Processes under a Poisson process

A non-homogeneous Poisson process with (deterministic) intensity function λ(t) can be obtained by projecting onto the time axis the points of a homogeneous Poisson process on R2 of intensity 1 which lie between the curve y = λ(t) and the time axis (Exercise 8.5.14). In fact, as we shall see in this section, any point process with stochastic Ft intensity {λ(t)}t≥0 not only can be obtained in this way (the direct embedding theorems) but can always be thought of as having been obtained in this way (the inverse imbedding theorems), in general at the cost of an extension of the probability space. The exact formulation and the mathematical details will be given in Subsection 15.3.2. For this the following preliminaries of intrinsic interest are needed.

15.3.1

An Extension of Watanabe’s Theorem

The original version of Watanabe’s theorem concerns homogeneous Poisson processes on the line. It will be extended, with a proof analogous to the proof of Theorem 7.1.8, to Poisson processes on product spaces of the type R × K. This new version will play a central role in the proof of the embedding theorems in Subsection 15.3.2. Some notation will be needed for the precise statement of this extension. Let N be a point process on R × K. The notation St N , where t ≥ 0, denotes the point process obtained by shifting N by t to the left (algebraically). Let St N + denote the point process obtained by restricting the shifted process St N to R+ × K. Loosely speaking, St N + is the future of N after time t, and more formally, St N + ([a, b] × L) := N (R+ ∩ [a + t, b + t] × L)

([a, b] ⊂ R, L ∈ K) .

610CHAPTER 15. POINT PROCESSES WITH A STOCHASTIC INTENSITY One can similarly define St N − , the restriction of St N to R− × K. The following version of Watanabe’s theorem concerns in particular Cox processes on R2 . Theorem 15.3.1 Let (K, K) be a measurable space. Let G be some σ-field of F. Let λ(t, dz) be a locally integrable kernel from R × Ω to (K, K) such that for all t ≥ 0, L ∈ K, λ(t, L) is G-measurable. Let N be a point process on R × K that admits the Ft -intensity kernel λ(t, ·), where Ft = Ht ∨ G and {Ht }t≥0 is a history of N. Then N is, conditionally on G, a Poisson process with the intensity measure ν(dt × dz) = λ(t, dz) × dt. Furthermore, conditionally on G, for all t ≥ 0, the future St N + of N after time t is independent of Ht . Proof. It suffices to show that for all a ∈ R, for any finite family of disjoint measurable sets C1 , . . . , Cm ⊂ (a, +∞) × K, and all t1 , . . . , tm ∈ R+ , ⎛ ⎡ ⎞ ⎤ m m  $ * + E ⎣exp ⎝− tj N (Cj )⎠ |Fa ⎦ = exp ν(Cj )(e−tj − 1) . (15.41) j=1

j=1

One may assume that C1 , . . . , Cm ⊂ (a, b] × Lk for some Lk as in Definition 15.1.20. Otherwise, replace the Cj ’s by Cj ∩ ((a, b] × Lk ) and let b and k go to infinity. Denote the above Lk by L. For all j (1 ≤ j ≤ m) and all t ≥ 0, let Cj (t) := Cj ∩ {(−∞, t] × K} and Cjt := {z ∈ K; (t, z) ∈ Cj }. Define for t ≥ a

⎧ ⎫ m ⎨  ⎬ Z(t) := exp − tj N (Cj (t)) . ⎩ ⎭

(15.42)

j=1

In particular,

⎧ ⎫ m ⎨  ⎬ Z(b) = exp − tj N (Cj ) . ⎩ ⎭ j=1

Also, since Z(a) = 1, -

Z(s−)

Z(t) = 1 + (a,t]

K

⎧ m ⎨ ⎩

(e−tj

j=1

⎫ ⎬ − 1)1Cjs (z) N (ds × dz) . ⎭

(15.43)

For the proof of this equality, observe that any trajectory t → Z(t) is piecewise constant with discontinuity times that are points of the simple point process NL (·) := N (· × L). Therefore  Z(t) = Z(a) + (Z(s) − Z(s−)) , s∈(a,t]

where Z(s) = Z(s−) only if there is a point (s, z) of N that belongs to (at most) one of the Cj ’s. If (s, z) ∈ Cj and is in N , Z(s) = Z(s−)e−tj .

15.3. POINT PROCESSES UNDER A POISSON PROCESS

611

Now saying that (s, z) ∈ Cj is equivalent to saying that z ∈ Cjs , and therefore, if (s, z) is a point of N Z(s) − Z(s−) =

m 

Z(s−)(e−tj − 1)1Cjs (z),

j=1

which then gives (15.43) since Z(a) = 1. Let now A ∈ Fa . We have - 1A Z(t) = 1A + (a,t]×K

⎛ ⎞ m  1A Z(s−) ⎝ (e−tj − 1)1Cjs (z)⎠ N(ds × dz) . j=1

The stochastic process indexed by K, ⎛ ⎞ m  −t H(t, ω, z) := 1A (ω)1(a,t] (t)Z(t− , ω) × ⎝ (e j − 1)1Cjt (z)⎠ , j=1

is P(F· ) ⊗ K measurable and of constant sign (negative). Therefore, ⎫ ⎤ ⎬ 1A Z(s−) (e − 1)1Cjs (z) N (ds × dz)⎦ E[1A Z(t)] = P (A) + E ⎣ ⎩ ⎭ (a,t] K j=1 ⎧ ⎫ ⎡ ⎤ - tm ⎨ ⎬ −t j = P (A) + E ⎣ 1A Z(s) (e − 1)1Cjs (z) λ(s, dz) ds⎦ ⎩ ⎭ a K j=1 ⎧ ⎫ ⎡ ⎤ - tm ⎨ ⎬ −t j = P (A) + E ⎣1A Z(s) (e − 1)1Cjs (z) λ(s, dz) ds⎦ . ⎩ ⎭ a K ⎡

-

-

⎧ m ⎨

−tj

j=1

Since A is arbitrary in Fa , ⎧ ⎫ ⎤ ⎡ - tm ⎨ ⎬ E[Z(t)|Fa ] = 1 + E ⎣ Z(s) (e−tj − 1)1Cjs (z) λ(s, dz) ds|Fa ⎦ ⎩ ⎭ a K j=1 ⎧ ⎫ - tm ⎨ ⎬ E[Z(s)|Fa ] (e−tj − 1)1Cjs (z) λ(s, dz) s . = 1+ ⎩ ⎭ a K j=1

Therefore

E[Z(t)|Fa ] = exp

⎧ m ⎨ ⎩

(e−tj − 1)

j=1

Letting t = b gives the announced result.

- t a

⎫ 1⎬ 1Cjs (z)λ(s, dz)ds . ⎭ K 

612CHAPTER 15. POINT PROCESSES WITH A STOCHASTIC INTENSITY

15.3.2

Grigelionis’ Embedding Theorem

We shall use the following slight extension of the definition of a Poisson process: Definition 15.3.2 Let (K, K) be some measurable space. Given a history {Ft }t∈R , the point process N on R × K is called an Ft -Poisson process if the following conditions are satisfied: (i) {Ft }t∈R is a history of N in the sense that Ft ⊆ σ(N (D); D ⊆ (−∞, t] × K); (ii) N is a Poisson process; and (iii) for any t ∈ R, St N + and Ft are independent, where St N + is defined by St N + (C × L) := N ((C + t) ∩ (−∞, t]) × L) . Theorem 15.3.3 Let (K, K) be some measurable space and let Q be some probability measure on it. Let N be an Ft -Poisson process on R × K × R+ with intensity measure dt × Q(dz) × dσ. Let f : Ω × R × K → R be a non-negative function that is P(F· ) ⊗ Kmeasurable and such that the kernel λ(t, dz) := f (t, z)Q(dz)

(15.44)

is locally integrable. The marked point process (N, Z) with marks in K defined by - - N (C × L) := 1C (t)1(L) (z)1{σ≤f (t,z)} N (dt × dz × dσ) (C ∈ B(R) , ∈ K) R

K

R+

(15.45) admits the Ft -stochastic intensity kernel λ(t, dz). Proof. We prove the theorem for the unmarked case. The general proof follows exactly the same lines. Let N be a homogeneous Ft -Poisson process on R × R+ with average intensity 1. Let {λ(t)}t≥0 be a non-negative locally integrable Ft -predictable stochastic process. Define the point process N on R by the formula - N (C) = 1(0,λ(t)] (z)N (dt × dz) (15.46) C

R+

for all C ∈ B. Then, N has the Ft -intensity {λ(t)}t≥0 . 1. We first show that (15.46) defines a locally finite point process, that is, N ((0, b]) < ∞ a.s. for all b ∈ R. Define for all n ≥ 1  1 - t τn = inf t ≥ 0 ; λ(s)ds ≥ n 0

(=∞ if {. . . } = ∅). By the local integrability assumption, limn↑∞ τn = ∞, a.s. Also τn is an Ft -stopping time. By the smoothing formula of Theorem 15.1.22,

15.3. POINT PROCESSES UNDER A POISSON PROCESS "-

1(0,τn ] (s)1(0,λ(s)] (σ) N (ds × dσ)

(0,b]×R

#

-

=E 2-

#

-

E [N ((0, τn ∧ b])] = E "-

613

(0,b]×R

=E

τn ∧b

1(0,τn ] (s)1(0,λ(s)] (σ) ds dσ

. λ(s) ds < ∞.

0

Therefore, a.s., for all n ≥ 1, N (0, τn ∧ b] < ∞. 2. The simplicity is left as an exercise for the reader. 3. In order to prove that N has the Ft -intensity {λ(t)}t≥0 it suffices to show that for all H ∈ P(F· ), H ≥ 0, . 2. 2H(t)N (dt) = E H(t)λ(t)dt . E R

R

But the left-hand side of this equality reads 2- . E H(t)1(0,λ(t)] (z)N (dt × dz) . R

R+

Since N is assumed Ft -Poisson, it admits the Ft -intensity kernel λ(t, dz) = dz. Thus, by Theorem 15.1.22, this is also equal to 2- . 2. E H(t)1(0,λ(t)] (z)dtdz = E H(t)λ(t)dt . R

R+

R

 Theorem 15.3.4 Let (N, Z) be a locally finite marked point process with marks in the measurable space (K, K) and Ft -stochastic intensity kernel of the form (15.44), where f : Ω × R × K → R is a non-negative function that is P(F· ) ⊗ K-measurable and Q is a probability measure on (K, K). Then, the probability space may be enlarged to accommodate an Ft -Poisson process N on R × K × R+ with intensity measure dt × Q(dz) × ds such that (15.45) holds. Proof. The result will be proved for the unmarked case, the general case following exactly the same lines. The theorem in this simplified form is as follows. Let N be a simple point process on R with Ft -predictable intensity {λ(t)}t≥0 . Then, there exists a homogeneous Poisson process N on R × R+ with average intensity 1, such that (15.46) holds. Moreover, this process N is an Ft ∨ FtN -Poisson process. As such, for all a ∈ R, Sa N + is independent of Fa . Let {Un }n∈Z be an iid sequence of random variables uniformly distributed on [0, 1], and let N 1 be a homogeneous Poisson process on R × R+ , of intensity 1, such that {Un }n∈Z , N 1 and F∞ are independent. Define N by -  N (A) = 1(λ(t),∞)(σ)N 1 (dt × dσ) + 1A ((Tn , Un λ(Tn ))) A

n∈Z

for all A ∈ B ⊗ B(R+ ). If H is a non-negative function from R × Ω × R+ to R,

614CHAPTER 15. POINT PROCESSES WITH A STOCHASTIC INTENSITY - -

R

R

R

- R+

H(t, σ)N (dt × dσ) =

H(t, σ)1(λ(t),∞) (σ)N 1 (dt × dσ) +



H(Tn , Un λ(Tn )).

n∈Z

Denote by N U the marked point process obtained by attaching the mark Un to point Tn of N . Let U Gt = Ft ∨ FtN 1 ∨ FtN and suppose that H is P(Gt ) ⊗ B(R+ )-measurable and non-negative. ×

×

× U0 λ(T0 )

U2 λ(T2 ) U−1 λ(T−1 ) U1 λ(T1 ) T−1

0

T0

T2

T1

t

Since N 1 has the FtN 1 -intensity kernel λ1 (t, dσ) = dσ, the latter is also the Gt -intensity U and F are independent of F N 1 ). By the smoothing theorem kernel of N 1 (recall that F∞ ∞ ∞ . 2- . 2- 1(λ(t),∞)(σ)H(t, σ)N 1 (dt × dσ) = E 1(λ(t),∞)(σ)H(t, σ)dt × dσ E R

R+

R

R+

for any non-negative H ∈ P(F· )⊗B(R+ ). Now, for any non-negative H ∈ P(F· )⊗B(R+ ), # "- # "  U H(Tn , Un λ(Tn )) = E H(t, uλ(t))N (dt × du) E R

n∈Z

2- -

[0,1] 1

= E R

. H(t, uλ(t)) dt du .

0

U

(In fact, N U admits the Ft ∨ FtN -intensity kernel λ(t)1[0,1] (u) du. This is also a Gt stochastic kernel for N U since N 1 is independent of N U and F∞ . The map (t, ω, u) → H(t, uλ(t)) is P(Gt ) ⊗ B(R)-measurable in view of the Ft -predictability of {λ(t)}t≥0 , and of the measurability assumptions on H. This justifies the above use of the smoothing theorem.) This term is also equal to 2- . E H(t, σ)1(0,λ(t)] (σ)dtdσ , R

R+

by the change of variables σ = uλ(t). Therefore 2- . E H(t, σ)N (dt × dσ) R R+ 2- . 2. =E H(t, σ)1(λ(t),∞)(σ)dtdσ + E H(t, σ)1(0,λ(t)] (σ)dtdσ R R R×R+ 2- - + . =E H(t, σ)dtdσ R

R+

15.3. POINT PROCESSES UNDER A POISSON PROCESS

615

for all non-negative H ∈ P(G) ⊗ B+ . Therefore, by Theorem 15.3.1, N is a homogeneous Gt -Poisson process on (R×R+ , B(R)⊗B(R+ )) with intensity 1. It is a fortiori an Ft ∨FtN Poisson process. 

Variants of the Embedding Theorems The following results are of the same kind as the previous ones, but they assume boundedness of the intensity kernel. We state them in the unmarked case. Their proofs are left as an exercise (Exercise 15.4.18). , be an hpp on R with intensity λ , and Theorem 15.3.5 (Direct embedding) Let N U , let {Un }n∈Z be an iid sequence of marks, uniformly distributed on [0, 1]. Let N be the ,U , associated lifted point process on R × [0, 1]. Let {Gt }t≥0 be a history independent of N and let for all t ≥ 0,  Ft := Gt ∨ FtNU . , < ∞, and define a Let {λ(t)}t≥0 be a non-negative Ft -predictable process bounded by λ point process N on (R, B(R)) by  N (C) = 1C (T,n )1

(C ∈ B(R)) . (15.47) n ) λ(T n∈Z

0≤Un ≤

 λ

Then N admits the Ft -intensity {λ(t)}t≥0 . Theorem 15.3.6 (Inverse embedding) Let N be a simple point process on (R, B(R)) with Ft -intensity {λ(t)}t≥0 (assumed Ft -predictable, without loss of generality). Suppose , < ∞ such that P -a.s. that there exists a constant λ , λ(t, ω) ≤ λ

(t ≥ 0).

, , U ) on (R, B(R)) with marks in ([0, 1], Then, there exists a compound Poisson process (N , Q) where Q is the uniform distribution on [0, 1], and such B([0, 1])), characteristics (λ, that (15.47) holds. Remark 15.3.7 Grigelionis’ construction is of particular importance for coupling point processes. Usually one starts with a point process N1 from which one constructs N via the inverse embedding theorem, and then using this same N , one constructs another point process N  . Example 15.3.8: Lewis–Shedler–Ogata Simulation Algorithms. An immediate application of practical importance of the direct embedding theorems is to the simulation of point processes with a stochastic intensity. We start with a simple example of the methodology. Suppose that one wishes to simulate a point process on the positive halfline with a stochastic intensity of the form λ(t, ω) = v(t, N[0,t) (ω)) , where v : R+ × Mp (R+ ) → R+ is measurable with respect to B(R+ ) ⊗ Mp (R+ ) and B(R+ ) and bounded, say, by K < ∞. For this we can use Theorem 15.3.6, which says , on R+ with intensity K, exthat it suffices to thin a homogeneous Poisson process N amining its points sequentially, keeping a point T,n of it if and only if Un ≤

v(Tn ,N[0,T ) ) n , K

616CHAPTER 15. POINT PROCESSES WITH A STOCHASTIC INTENSITY where {Un }n≥1 is an iid sequence of random variables uniformly distributed on [0, 1] ,. and independent of N This method is easily adapted to the case of point processes with an FtN ∨G0 -intensity of the form () λ(t, ω) = v(t, X(t−, ω), N[0,t)(ω)) , where (i) {X(t)}t≥0 is some corlol stochastic process with values in some measurable space X := ∨t≥0 FtX , (K, K) and G0 = F∞ (ii) v : R+ × Mp (R+ ) → R+ is measurable with respect to B(R+ ) ⊗ K ⊗ Mp (R+ ) and B(R+ ), and is bounded by K, (In other words, the point process we seek to simulate is a semi-Cox point process.) , , and (iii) {X(t)}t≥0 is independent of N (iv) we suppose that we have at our disposition at any time t the value X(t). , is kept if and only if Un ≤ A point T,n of N

v(Tn ,X(Tn −),N(0,T ) ) n . K

We can make another step towards generalization assuming an FtN ∨ FtX -intensity of the form (). But this time we must add the condition that, denoting by Tn the nth point of N , one can construct X(t) from Tn based on the knowledge of N[0,t) . One example is when the process {X(t)}t≥0 is of the form -

-

t

t

ϕ(N (t−)) dt +

X(t) = X(0) + 0

σ(N (t−)) dW (t) , 0

, where {W (t)}t≥0 is a Wiener process independent of N. Other examples, concerning for instance mutually exciting point processes, are left to the imagination of the reader. The relaxation of the hypothesis of boundedness of the stochastic intensity is not a problem since one can always vary the intensity of the , when needed. Poisson process N

Complementary reading [Last and Brandt, 1995] is an advanced text on the martingale theory of point processes on the line, with or without a stochastic intensity. See also [Br´emaud, 2020].

15.4

Exercises

Exercise 15.4.1. The fundamental (local) martingale 0t Show that Definition 15.1.1 implies that for all a ∈ R, Ma (t) := N (a, t] − a λ(s) ds (t ≥ a) is an Ft -local martingale. Exercise 15.4.2. Connecting to the intuitive definition Let the simple and locally finite point process N on R have the Ft -intensity {λ(t)}t≥0 , and suppose that t → λ(t, ω) is for all ω ∈ Ω a right-continuous function, and that the function (t, ω) → λ(t, ω) is uniformly bounded.

15.4. EXERCISES

617

Show that lim h↓0

1 E[N (t, t + h]|Ft ] = λ(t) , h

P -a.s.

Exercise 15.4.3. The traffic equations Show that in equilibrium, the traffic equations of the Jackson network (see Chapter 9) receive the following interpretation: λi = E[Ai (0, 1]],

λi rij = E[Ai,j (0, 1]],

where Ai is the point process counting the arrivals (external and internal) into station i, and Ai,j is the point process counting the transfers from station i to station j. Exercise 15.4.4. Cox processes Let N be a doubly stochastic Poisson process (Cox process) with respect to the σfield G, with locally integrable stochastic intensity {λ(t)}t≥0 . Remember that this means λ := σ (λ(s), s ∈ R) and that whenever C , . . . , C that G ⊇ F∞ 1 K are bounded disjoint measurable subsets of R, ⎧ ⎧ ⎫ ⎤ ⎫ ⎡ K K ⎨ ⎨  ⎬ ⎬ E ⎣exp i uj N (Cj ) | G ⎦ = exp (eiuj − 1) λ(s)ds ⎩ ⎩ ⎭ ⎭ Cj j=1

j=1

for all u1 , . . . , uK ∈ R. Show that N admits the Ft -intensity {λ(t)}t≥0 , where Ft = G ∨ FtN . Exercise 15.4.5. Other histories Let N be a simple and locally finite point process on R with Ft -intensity {λ(t)}t≥0 . Prove the following: (i) If {Gt }t≥0 is a history such that G∞ is independent of F∞ , then {λ(t)}t≥0 is an Ft ∨ Gt -intensity of N . (ii) If {F,t }t≥0 is a history of N such that Ftλ ∨ FtN ⊆ F,t ⊆ Ft , then {λ(t)}t≥0 is an F,t -intensity of N .

Exercise 15.4.6. About predictability (a) Show that a deterministic measurable process is Ft -predictable, and that an Ft predictable process is Ft -progressively measurable, and in particular measurable and Ft -adapted. (b) Let S and τ be two Ft -stopping times such that S ≤ τ , and let ϕ : R+ × R → R be a measurable function. Then X(t, ω) = ϕ(S(ω), t)1{S(ω) 0. Then for P -almost all ω ∈ A, θn ω ∈ A for an infinity of indices n ≥ 1. Proof. Consider the measurable set F := {ω ∈ A ; θn ω ∈ A for all n ≥ 1} = A\ ∪n≥1 {ω ; θn ∈ A} . If m > n, θn F ∩ θm F = ∅. (Indeed, if θ−m ∈ F ⊆ A, we have by definition of F that θ−n ω = θm−n (θ−m ω) ∈ A.) By the θ-invariance of P , P (F ) = P (θn F ) for all n ≥ 1, and therefore, for all N ≥ 1, P (F ) =

M 1  1 1 n P (θn F ) = P (∪M n=1 θ F ) ≤ M M M n=1

and therefore P (F ) = 0 since M is arbitrary. In other words, outside of a negligible set N1 , for every point ω ∈ A, there exists an n1 such that θn1 ω ∈ A. By the same argument applied to θk (k ≥ 1), outside of a negligible set Nk , for every point ω ∈ A, there exists an nk such that θnk ω ∈ A. In particular, for all ω outside the negligible set N := ∪k≥1 Nk , there is for all k ≥ 1 an nk such that θnk ω ∈ A ∩ Nnk . Therefore for all  ω outside N , θn ω ∈ A for an infinity of n. The following lemma will be useful on several occasions. Lemma 16.1.5 Let (P, θ) be a stationary framework. Let Z be non-negative P -a.s. finite random variable such that Z − Z ◦ θ ∈ L1R (P ). Then E0 [Z − Z ◦ θ] = 0. Proof. (The delicate point here is that Z is not assumed integrable.) For any C > 0, |Z ∧ C − (Z ∧ C) ◦ θ| ≤ |Z − Z ◦ θ|. By the θ-invariance of P , E[Z ∧ C − (Z ∧ C) ◦ θ] = 0 and the conclusion follows by dominated convergence, letting C ↑ ∞ in the last equality.  Let (P, θ) be a stationary framework on (Ω, F). Definition 16.1.6 An event A ∈ F is called strictly θ-invariant if θ−1 (A) = A. It is called θ-invariant if P (A  θ−1 (A)) = 0, where  denotes the symmetric difference. Definition 16.1.7 The discrete flow {θn }n∈Z is called ergodic (with respect to P ) if all θ-invariant events are P -trivial (that is: of probability either 0 or 1). We shall also say: (P, θ) is ergodic.

16.1. ERGODICITY AND MIXING

623

Observe that for any θ-invariant event A, the event B = ∩n∈Z ∪k≥n θ−k A is strictly θ-invariant and such that P (A) = P (B). Therefore, for any θ-invariant event, there exists a strictly θ-invariant event with the same probability. In particular, the flow is ergodic if and only if all strictly θ-invariant events are trivial. Example 16.1.8: Irrational translations on the torus, take 1. Let (Ω, F) := ((0, 1], B((0, 1])), and let P be the Lebesgue measure on (0, 1]. Let d be some real number. Define θ : (Ω, F) → (Ω, F) by θ(ω) = ω + d mod 1. We show that (P, θ) is ergodic if and only if d is irrational. Proof. Clearly, P is θ-invariant. Let A ∈ B((0, 1]) and consider the Fourier series development of 1A ,  an e2iπnω , -almost everywhere, 1A (ω) = where an =

0 A

n

e−2iπnω dω.

With c := e2iπd , we have

an =

e−2iπnθ(ω)dω = c−n

-

θ −1 (A)

e−2iπnω dω .

θ −1 (A)

Therefore the n-th Fourier coefficient of 1θ−1 (A) is cn an . In particular, 1θ−1 (A)(ω) =



cn an e2iπnω ,

-almost everywhere.

n

Therefore, A is strictly θ-invariant if and only if cn an = an for all n. Now, when d is irrational, c is not a root of unity and then necessarily an = 0 for all n = 0, that is, 1A is a.s. a constant (= a0 ), necessarily 0 or 1: A is a trivial set. The fact that (P, θ) is not ergodic if d is rational is left as an exercise (Exercise 16.4.1). 

Theorem 16.1.9 If (P, θ) is ergodic and if A ∈ F satisfies either A ⊆ θ−1 A or θ−1 A ⊆ A, then A is trivial. In other words, if (P, θ) is ergodic, an event that is either contracted or expanded by θ is necessarily trivial. Proof. Since P is θ-invariant, for all A ∈ F, P (A − A ∩ θ−1 A) = P (A) − P (A ∩ θ−1 A) = P (θ−1 A) − P (A ∩ θ−1 A) = P (θ−1 A − A ∩ θ−1 A) and therefore P (A  θ−1 A) = 2P (A ∩ θ−1 A) = 2P (θ−1 A ∩ A). Therefore A is θ-invariant if and only if at least one (and then both) of P (A ∩ θ−1 A) and P (θ−1 A ∩ A) is null. In particular, if A ⊆ θ−1 A, then P (A ∩ θ−1 A) = 0 and therefore A is θ-invariant, and therefore trivial. 

CHAPTER 16. ERGODIC PROCESSES

624

The main result of the current chapter is Birkhoff ’s pointwise ergodic theorem: Theorem 16.1.10 If (P, θ) is ergodic, then for all f ∈ L1R (P ), N 1  (f ◦ θn ) = E0 [f ], N ↑∞ N

lim

P -a.s.

(16.1)

1

The proof will be given in Section 16.2. Remark 16.1.11 Theorem 16.1.10 entails a considerable improvement on the ergodic theorem for irreducible positive recurrent aperiodic hmcs which are indeed ergodic (actually, mixing) in the sense given to this word in the present chapter.1 At the price of transferring this chain to the canonical space of random sequences equipped with the canonical shift we have that for any non-negative measurable function g : E N → R,2 N 1  (g(Xk , Xk+1 , . . .)) = Eπ [g(X0 , X1 , . . .)] , N ↑∞ N

lim

Pμ -a.s.

k=1

for any initial distribution of the chain, and where Eπ denotes expectation with respect to the initial distribution π, the stationary distribution. Note that there is a slight difference with the ergodic theorem in that the initial distribution may be different from π. We may take this liberty because an irreducible positive recurrent aperiodic chain starting with an arbitrary distribution eventually couples with a stationary chain.

16.1.2

Mixing

We now introduce a particular form of ergodicity. Definition 16.1.12 The discrete flow {θn }n∈Z is called P -mixing if for all events A, B ∈ F, lim P (A ∩ θ−n B) = P (A)P (B). (16.2) n↑∞

We shall also say: (P, θ) is mixing.

Mixing is a property of “forgetfulness of the initial conditions” since condition (16.2) is equivalent to lim P (θ−n B|A) = P (B). n

Theorem 16.1.13 If (16.2) holds for all A, B ∈ A, where A is an algebra generating F, then (P, θ) is mixing. 1 There is an unfortunate tradition that reserves the term “ergodic” for an hmc that is irreducible positive recurrent and aperiodic. Such a chain is in fact more than ergodic, since it is mixing. 2 For instance g(X0 , X1 , . . .) could be the number of consecutive visits of the chain to a given state i without visiting another given state j in between.

16.1. ERGODICITY AND MIXING

625

Proof. To any fixed A, B ∈ F and any ε > 0, one can associate A , B  ∈ A such that A  A and B  B  have probabilities less that ε (Lemma 4.3.8). The same is true of (θ−n A)  (θ−n A ) = θ−n (A  A ) and of (θ−n B)  (θ−n B  ). In particular, (A ∩ θ−n B)  (A ∩ θ−n B  ) has probability less than 2ε, and P (A ∩ θ−n B  ) − 2ε ≤ P (A ∩ θ−n B) ≤ P (A ∩ θ−n B  ) + 2ε . Taking the lim sup and the lim inf, and then letting ε ↓ 0 yields the result.



Example 16.1.14: Mixing homogeneous Markov chains. This example continues Example 16.1.2, to which the reader is referred for the notation. The set E is now assumed countable. Let P be a transition matrix indexed by E that is irreducible positive recurrent, with (unique) stationary distribution π. Let P be the unique probability measure on (E Z , E ⊕Z ) that makes {Xn }n∈Z a stationary hmc with transition matrix P. Suppose moreover that P is aperiodic. Then (P, θ) is mixing. Indeed, it suffices, in view of Theorem 16.1.13, to verify (16.2) for sets A and B of the form A = {Xk1 = i1 , . . . , Xkp = ip },

B = {X1 = j1 , . . . , Xq = jq } ,

where k1 < · · · < kp and 1 < · · · < q . This is true since, for n > kp ,   P (A ∩ θ−n B) = π(i1 )pi1 ,i2 (k2 − k1 ) · · · pip−1 ,ip (kp − kp−1 ) ×   pip ,j1 (n + 1 − kp )pj1 ,j2 (2 − 1 ) · · · pjq−1 ,jq (q − q−1 ) , and limn↑∞ pip ,j1 (n + 1 − kp ) = π(j1 ), so that lim pip ,j1 (n + 1 − kp ) · · · pjq−1 ,jq (q − q−1 )

n↑∞

= π(j1 )pj1 ,j2 (2 − 1 ) · · · pjq−1 ,jq (q − q−1 ) = P (B).

If A is strictly invariant for the mixing flow θ, then P (A) = P (A)2 , so that P (A) is 0 or 1. Therefore: Theorem 16.1.15 A mixing flow is ergodic. However, an ergodic flow is not necessarily mixing: Example 16.1.16: Irrational translations on the torus, take 2. The flow of Example 16.1.8 is not mixing even if d is irrational. To see this, take A = B = (0, 12 ]. Since the set {nd mod 1 ; n ≥ 1} is dense in (0, 1] when d is irrational (this is the celebrated Weyl’s equidistribution theorem), θ−n A and B arbitrarily nearly coincide for an infinite number of indices n. Therefore (16.2) cannot hold.

Theorem 16.1.17 For any algebra A generating F, (P, θ) is ergodic if and only if for all A, B ∈ A, n 1 P (A ∩ θ−k (B)) = P (A)P (B) . (16.3) lim n↑∞ n k=1

CHAPTER 16. ERGODIC PROCESSES

626

Proof. If (P, θ) is ergodic then, by the ergodic theorem, for all A, B ∈ F 1 1A (ω)1B (θk (ω)) = 1A (ω)P (B), n↑∞ n n

lim

k=1

and therefore, taking expectations, (16.3) follows. Conversely, if (16.3) is true for all A, B ∈ F, then taking an invariant set A = B, we obtain that P (A) = P (A)2 , and therefore A has probability 0 or 1. The fact that we can restrict A and B to be in A is proved in the same way as in Theorem 16.1.13.  Example 16.1.18: Ergodic but not mixing hmc. The setting is as in Example 16.1.14, except that we do not assume aperiodicity. If the period is ≥ 2, the shift is not mixing any more. To see this let C0 and C1 be two consecutive cyclic classes of the chain. Then it is not true that limn↑∞ P (X0 = i, Xn = j) = π(i)π(j) when i ∈ C0 and j ∈ C1 . However, with a proof similar to that of Example 16.1.14, one can prove ergodicity using Theorem 16.1.17 and the fact that for a positive recurrent hmc 1 pij (k) = π(j). n↑∞ n n

lim

k=1

(Exercise 16.4.3.)

The Stochastic Process Point of View In applications, one speaks in terms of a stochastic process rather than flows. The connection between the two points of view is made via canonical spaces, as follows. To any stochastic process {Xn }n∈Z taking values in the measurable space (E, E) and defined on the probability space (Ω, F, P ), one can associate a canonical version, by transporting the process on the canonical measurable space (E Z , E ⊗Z ) as explained after the statement of Theorem 5.1.7. Let PX denote the probability distribution of {Xn }n∈Z (therefore a probability on the canonical space), and let S denote the shift on the canonical space: S : (xn , n ∈ Z) → (yn , n ∈ Z) where yn := xn+1 . Definition 16.1.19 The stochastic process {Xn }n∈Z taking values in the measurable space (E, E) is said to be ergodic (resp. mixing) iff (PX , S) is ergodic (resp. mixing).

Therefore, in discussing ergodicity of a stochastic process, it is best to assume that it is the coordinate process defined on the corresponding canonical space. This is the convention adopted in the sequel. In other words, when speaking of an ergodic stochastic process {Xn }n∈Z taking values in the measurable space (E, E), we implicitly assume that (Ω, F) = (E Z , E ⊗Z ) and that (P, θ) = (PX , S).

16.1. ERGODICITY AND MIXING

627

Remark 16.1.20 From the above discussion, we see that a way to decide if a given process is ergodic is to see if it can be obtained in the form {f (. . . , xn−1, xn , xn+1, . . .)}n∈Z . The formalization of this is left for the reader. Let {Xn }n∈Z be an ergodic3 hmc {Xn }n∈Z . Then, the process {Yn }n∈Z , where Yn is the number of times k (τn ≤ k ≤ n) for which Xk = 0 and where τn is the last time before n where the hmc took the value 1, is ergodic.

16.1.3

The Convex Set of Ergodic Probabilities

The set of ergodic probabilities coincide with the extremal points of the convex set of stationary probabilities. More precisely: Theorem 16.1.21 (P, θ) is ergodic if and only if there exists no decomposition P = α1 P1 + α2 P2 with α1 + α2 = 1 and α1 > 0, α2 > 0,

(16.4)

where P1 and P2 are distinct θ-invariant probabilities.

Proof. We need two lemmas. Lemma 16.1.22 If (P1 , θ) and (P2 , θ) are both ergodic, then either P1 = P2 or P1 ⊥ P2 . Proof. If P1 and P2 do not coincide, there exists an A ∈ F such that P1 (A) = P2 (A). In particular, B1 ∩ B2 = ∅, where for i = 1, 2, / Bi =

4 n 1 k ω; lim 1A ◦ θ = Pi (A) . n↑∞ n k=1

Also, by ergodicity, P1 (B1 ) = 1 and P2 (B2 ) = 1. Therefore P1 ⊥ P2 .



Lemma 16.1.23 (P, θ) is ergodic if and only if there exists no θ-invariant probability P. P1 distinct from P such that P1

P , then (P1 , θ) is ergodic. (In fact, if A is θ-invariant, Proof. If (P, θ) is ergodic and P1 then, either P (A) = 0 or P (A) = 0, and therefore, by the absolute continuity hypothesis, P1 (A) = 0 or P1 (A) = 0.) By Lemma 16.1.22, only the possibility P1 = P remains. Suppose now (P, θ) not ergodic. This means there exists a non-trivial θ-invariant set P . Also P1 is A ∈ F: 0 < P (A) < 1. Define P1 (B) = P (B | A). In particular, P1 θ-invariant. 3

In the sense of Markov chain theory, that is, irreducible, periodic and positive recurrent.

CHAPTER 16. ERGODIC PROCESSES

628 Indeed:

P (θ−1 (B) ∩ A) P (θ−1 (B) ∩ θ−1 (A)) = P (A) P (A) P (B ∩ A) = P (B | A) = P1 (B). = P (A)

P1 (θ−1 (B)) =

 We are now ready to prove Theorem 16.1.21. Suppose (P, θ) is ergodic and that P = α1 P1 + α2 P2 , where α1 + α2 = 1, α1 > 0, α2 > 0, and where P1 and P2 are distinct θ-invariant P . Since P1 and P are distinct, P1 cannot be ergodic probabilities. In particular, P1 (Lemma 16.1.23). Suppose (P, θ) is not ergodic. There exists an invariant set A ∈ F that is non-trivial: 0 < P (A) < 1. The decomposition P (B) = P (A)P (B | A) + P (A)P (B | A) = α1 P1 (B) + α2 P2 (B) is such that α1 + α2 = 1, α1 > 0, α2 > 0, and P1 and P2 are distinct and θ-invariant. 

16.2

A Detour into Queueing Theory

We will provide a proof of Theorem 16.1.10 after taking a detour into queueing theory.

16.2.1

Lindley’s Sequence

Let (Ω, F, P ) be a probability space and let θ : (Ω, F) → (Ω, F) be a bijective measurable map with measurable inverse. Suppose that (P, θ) is ergodic. Let σ and τ be integrable non-negative random variables defined on (Ω, F, P ). A Lindley process associated with these random variables is a stochastic process {Wn }n∈T, where T = N or Z, satisfying the recursion equation Wn+1 = (Wn + σn − τn )+

(n ∈ T) ,

(16.5)

where σn = σ ◦ θn ,

τn = τ ◦ θ n .

This equation will be interpreted in terms of queueing since this will greatly help our intuition in the forthcoming developments. Define the event times sequence {Tn }n∈Z , where T0 = 0 and for all n ∈ Z, Tn+1 − Tn = τn . We interpret Tn as the arrival time in a queueing system of customer n, and σn as the amount of service (in time units) required by this customer. Define ρ=

E[σ] . E[τ ]

16.2. A DETOUR INTO QUEUEING THEORY

629

If we interpret E[τ ]−1 as the rate of arrivals of customers, ρ is the traffic intensity, that is, the average amount of work brought into the system per unit of time. (However we shall not need this interpretation.) Service is provided at unit rate whenever there remains at least one customer. Otherwise there is no further prescription as to service discipline, priorities, and so on. If Wn is the total service remaining to be done just before customer n arrives (that is, at time Tn −), then, obviously, the Lindley recurrence (16.5) is satisfied. In this interpretation, the Lindley process is usually called the workload process.

16.2.2

Loynes’ Equation

When T = N, the Lindley process is recursively calculable from the initial workload W0 , but in the case T = Z, we have nowhere to start the recursion. This corresponds to the situation of a queueing system that has been operating from the infinite past. We may expect that under certain circumstances (of course, a good guess is that ρ < 1 will do) the workload process has a stationary version. One is therefore led to pose the problem in the following terms: exhibit a finite non-negative random variable {W (t)}t∈[0,1] such that the Lindley recursion (16.5) is satisfied for {Wn := W ◦ θ−n }n∈Z . Equivalently: we try to find a finite non-negative random variable {W (t)}t∈[0,1] such that W ◦ θ = (W + σ − τ )+ .

(16.6)

The above equation is called Loynes’ equation. Theorem 16.2.1 (4 ) If ρ < 1, there exists a unique finite non-negative solution {W (t)}t∈[0,1] of Loynes’ equation (16.6). Proof. For n ≥ 0, define Mn to be the workload found by customer 0 assuming that customer −n found an empty queue upon arrival. In particular, M0 = 0. One checks by induction that  + m  (σ−i − τ−i ) max . (16.7) Mn = 1≤m≤n

i=1

In particular,  Mn is integrable for all n ∈ N, being smaller than the integrable random variable ni=1 |σ−i − τ−i |. Furthermore, the sequence {Mn }n≥0 satisfies the recurrence relation Mn+1 ◦ θ = (Mn + σ − τ )+ (16.8) and (16.7) shows that it is non-decreasing. Denoting by M∞ the limit +  n  (σ−i − τ−i ) M∞ = lim ↑ Mn = sup n→∞

(16.9)

n≥1 i=1

and letting n go to ∞ in (16.8), we see that M∞ is a non-negative random variable satisfying M∞ ◦ θ = (M∞ + σ − τ )+ . (16.10) The random variable M∞ is often referred to as Loynes’ variable, while the sequence {Mn }n≥0 is called Loynes’ sequence. The random variable M∞ can take infinite values. When using the identity (a − b)+ = a − a ∧ b, Equality (16.8) becomes 4

[Loynes, 1962].

CHAPTER 16. ERGODIC PROCESSES

630

Mn+1 ◦ θ = Mn − Mn ∧ (τ − σ)

(16.11)

and therefore, since P is θ-invariant, and {Mn }n≥1 is increasing and integrable, E[Mn ∧ (τ − σ)] = E[Mn − Mn+1 ◦ θ] = E[Mn − Mn+1 ] ≤ 0. It follows by monotone convergence that E[M∞ ∧ (τ − σ)] ≤ 0.

(16.12)

Equality (16.10) shows that the event {M∞ = ∞} is θ-invariant (recall that σ and τ are finite). Therefore, by ergodicity, P (M∞ = ∞) is either 0 or 1. In view of (16.12), P (M∞ = ∞) = 1 implies E[τ − σ] ≤ 0. Therefore, the condition E[σ] < E[τ ] implies that M∞ < ∞, P -a.s. The solution of Loynes’ equation that we just gave (that is, M∞ ) is the minimal non-negative solution. In order to prove this, it suffices to show that W ≤ Mn for all n ≥ 0 (where {W (t)}t∈[0,1] is a non-negative solution of Loynes’ equation) and then let n ↑ ∞ to obtain W ≥ M∞ . This is proved by induction. The first term of the induction is satisfied since W ≥ 0 = M0 . Now W ≥ Mn implies W ≥ Mn+1 (because Mn+1 ◦ θ = (Mn + σ − τ )+ ≤ (W + σ − τ )+ = W ◦ θ). It remains to prove uniqueness of a finite solution of (16.6) if ρ < 1. Let {W (t)}t∈[0,1] be a finite solution, perhaps different from M∞ . We have σ − τ ≤ W ◦ θ − W ≤ σ, and in particular W ◦ θ − W E0 [W ◦ θ − W ] = 0.

is integrable. Therefore, by Lemma 16.1.5,

Since M∞ is the minimal solution, for any non-negative solution {W (t)}t∈[0,1] , {W = 0} ⊆ {W = M∞ }. The latter event is θ-contracting since both {W (t)}t∈[0,1] and M∞ satisfy (16.6). Since (P, θ) is ergodic, we must then have P (W = M∞ ) = 0 or 1. It is therefore enough to show that P (W = 0) > 0 (which implies P (W = M∞ ) = 1, that is, uniqueness). The proof of this follows from the next lemma. Lemma 16.2.2 If P (W = 0) = 0, for some finite solution {W (t)}t∈[0,1] of (16.6), then ρ = 1. Proof. Indeed, if W > 0 P -a.s. or (equivalently) W ◦θ > 0 P -a.s., then W ◦θ = W +σ−τ P -a.s., and E[W ◦ θ − W ] = 0 in view of Lemma 16.1.5, and this implies E[σ] = E[τ ].  This completes the proof of Theorem 16.2.1.



A partial converse of Theorem 16.2.1 is the following: Theorem 16.2.3 If E[σ] > E[τ ], (16.6) admits no finite solution. Proof. To prove this, it is enough to show that M∞ = ∞, P -a.s., since M∞ is the minimal non-negative solution of (16.6). This follows from 1 (σ−i − τ−i ) = E[σ − τ ] > 0, n→∞ n n

lim

i=1

16.3. BIRKHOFF’S THEOREM

631

since this in turn implies  M∞ =

sup

n 

n

+ (σ−i − τ−i )

= ∞.

i=1

 At this stage, we have proved the following: for ρ > 1 there is no finite non-negative solution of (16.6), and for ρ < 1, M∞ is the unique non-negative finite solution of (16.6). In the critical case (ρ = 1) the existence of a finite non-negative solution of (16.6) depends on the distribution of the service and inter-arrival sequences. See Exercises 16.4.10 and 16.4.11.

16.3

Birkhoff’s Theorem

16.3.1

The Ergodic Case

We can now proceed to the proof of Theorem 16.1.10. It will be given in the equivalent form: Theorem 16.3.1 Whenever (P, θ) is ergodic and both σ and τ are non-negative, not identically null, and integrable n σ ◦ θ−i E[σ] lim i=0 , P -a.s. = n −i n→∞ τ ◦ θ E[τ ] i=0 Proof. According to (16.7), n 

σ ◦ θ−i ≤

i=1

n 

τ ◦ θ−i + Mn .

i=1

We know that if E[σ] < E[τ ], Mn ↑ M∞ < ∞ P -a.s. Taking σ = 12 E[τ ] > 0, it follows that if E[τ ] > 0, n  lim τ ◦ θ−i = ∞, P -a.s. n→∞

i=1

Therefore, whenever E[τ ] > 0 and E[σ] < E0 [τ ], n σ ◦ θ−i Mn ≤ lim n + 1 = 1, lim sup i=0 n −i −i n→∞ τ ◦ θ n→∞ i=0 i=0 τ ◦ θ

P -a.s.

If E[τ ] > 0, for some integrable σ, take any a such that aE[σ] < E[τ ] to obtain from the previous inequality n σ ◦ θ−i 1 lim sup i=0 ≤ n −i τ ◦ θ a n→∞ i=0 and therefore n 1  σ ◦ θ−i E[σ] 1 ; aE[σ] < E[τ ] = . ≤ inf lim sup i=0 n −i τ ◦ θ a E[τ ] n→∞ i=0

CHAPTER 16. ERGODIC PROCESSES

632

Interchanging the roles of σ and τ , we obtain similarly n σ ◦ θ−i E[σ] lim sup i=0 . ≥ n −i τ ◦ θ E[τ ] n→∞ i=0 

Hence the result.

Remark 16.3.2 If a stochastic process is not stationary, the ergodic theorem does not apply directly to obtain almost sure convergence of the empirical means. However if there is convergence of some sort of such a process to stationarity, the convergence of the empirical mean is possible, for instance in the case of an irreducible positive recurrent aperiodic hmc, where convergence in variation is obtained via coupling. See Theorem 6.4.2.

16.3.2

The Non-ergodic Case

“Non-ergodic” refers to the situation where there are nontrivial invariant sets.5 Define I = {A; θ−1 (A) = A}. I is a σ-field called the invariant σ-field. Definition 16.3.3 A random variable X is called invariant if X = X ◦ θ. Theorem 16.3.4 X is invariant if and only if it is I-measurable. Proof. Suppose X invariant. Then, for all a ∈ R, {X ≤ a} = {X ◦θ ≤ a} = θ−1 {X ≤ a}, and therefore {X ≤ a} ∈ I. Conversely, suppose X = 1A , where A ∈ I. Then X ◦ θ = 1A ◦ θ = 1θ−1 (A) = 1A = X, and therefore indicators of sets in I are invariant, and so are the weighted sums of such indicator functions, as well as limits of the latter. Since a non-negative I-measurable random variable is a limit of a sequence of weighted sums of indicators of sets in I, it is invariant. For an arbitrary I-measurable random variable, the proof is completed by considering its positive and negative parts as usual.  Theorem 16.3.5 Let (P, θ) be a stationary framework. It is ergodic if and only if every invariant real-valued random variable is almost surely a constant. Proof. Sufficiency: Let A be invariant. Then X = 1A is invariant and therefore almost surely a constant, which implies P (A) = 0 or 1. Necessity: Suppose ergodicity and let X be invariant. Then for all a ∈ R, P (X ≤ a) = 0 or 1. When a is sufficiently large, this must be 1 because lima↑∞ P (X ≤ a) = 1. Let a0 = inf{a ∈ R; P (X ≤ a) = 1}. Therefore, for all ε > 0, P (a0 −ε < X < a0 +ε) = 1. Let ε ↓ 0 to obtain P (X = a0 ) = 1.  5 Strictly speaking, the results in the ergodic case follow from those in the current subsection. The choice made in the order of treatment is motivated by the facts that “in practice” the ergodic case is the most interesting one for applications and that the proof of the ergodic theorem seized the opportunity of introducing the G/G/1:∞ queue.

16.3. BIRKHOFF’S THEOREM

633

Theorem 16.3.6 (6 ) Let (P, θ) be a stationary framework and let X be an integrable random variable. Then 1 X ◦ θk = E[X | I] n↑∞ n n−1

lim

P -a.s.

k=0

The proof rests on Hopf ’s lemma: Theorem 16.3.7 Let (P, θ) be ergodic and let X be an integrable random variable. Define Sk = X + X ◦ θ + · · · X ◦ θk−1 and Mn = max(0, S1 , . . . , Sn ). Then X dP ≥ 0. {Mn >0}

Proof. For n ≥ k, Mn ◦ θ ≥ Sk ◦ θ, and therefore, for k > 1, X + Mn ◦ θ ≥ X + Sk ◦ θ = Sk+1 . This is also true for k = 1 (X ≥ S1 − Mn ◦ θ because S1 = X and Mn ◦ θ ≥ 0). Therefore X ≥ max(S1 , . . . , Sn ) − Mn ◦ θ . In particular,     E X 1{Mn >0} ≥ E (max(S1 , . . . , Sn ) − Mn ◦ θ) 1{Mn >0} . But max(S1 , . . . , Sn ) = Mn on {Mn > 0}, and therefore     E X 1{Mn >0} ≥ E (Mn − Mn ◦ θ) 1{Mn >0} ≥ E [Mn − Mn ◦ θ] = 0.  Proof. We can now give the proof of Theorem 16.3.6. We may suppose that E[X | I] = 0, otherwise replace X by X − E[X | I]. Define X = lim supn↑∞ Snn . This is an invariant random variable, and therefore the set C := {X > ε} is an invariant set for any fixed ε > 0. We show that P (C) = 0. Define X ∗ = (X − ε)1C and let Sk∗ = X ∗ + · · · X ∗ ◦ θk−1 and Mn∗ = max(0, S1∗ , . . . , Sn∗ ). Then (Hopf’s lemma) X ∗ dP ≥ 0 . {Mn∗ >0}

The sets Hn = {Mn∗ > 0} = {max(S1∗ , . . . , Sn∗ ) > 0}, n ≥ 1, form a non-decreasing sequence whose sequential limit is 4 / 4 4 / / S∗ Sk >ε ∩C. H := sup Sk∗ > 0 = sup k > 0 = sup k≥1 k≥1 k k≥1 k Since supk≥1 6

Sk k

≥ X and X > ε, we have that H = C. Therefore

[Birkhoff, 1931].

CHAPTER 16. ERGODIC PROCESSES

634 -

X ∗ dP =

lim

n↑∞ Hn

-

X ∗ dP =

-

X ∗ dP C

H

by dominated convergence since X ∗ is integrable (E[|X ∗ |] ≤ E[|X|] + ε). Using the fact that C is an invariant event, X dP − εP (C) X ∗ dP = (X − ε)1C ) dP = 0≤ C C -C E[X | I] dP − εP (C) = E[X] − εP (C) = −εP (C). = C

This implies P (C) = 0, that is, P (X ≤ ε) = 1. Since ε > 0 is arbitrary, P (X ≤ 0) = 1, that is, almost surely lim sup n↑∞

Sn ≤ 0. n

The same arguments applied with −X instead of X give − lim sup n↑∞

Therefore lim supn↑∞

Sn n

−Sn Sn = lim inf ≥ 0. n↑∞ n n

= lim inf n↑∞

Sn n



= 0.

Corollary 16.3.8 Let (P, θ) be a stationary framework, and let X be an integrable random variable. Then 1 X ◦ θk = E[X | I] n↑∞ n n−1

lim

in L1C (P ) .

k=0

Proof. Let for any K > 0,  := X1{|X|≤K} , XK

 XK := X1{|X|>K} .

By the pointwise ergodic theorem, 1   XK ◦ θk = E[XK | I], n↑∞ n n−1

lim

k=0

 is bounded) that from which it follows by dominated convergence (XM '# "' n−1 '1  ' ' '  k  lim E ' XK ◦ θ − E[XK | I]' = 0 . 'n ' n↑∞ k=0

Observing that '# "' n−1 n−1 '1  ' '    1  ''  ' '  k' E ' XK ◦ θ ' ≤ E 'XK ◦ θk ' = E |XK | 'n ' n k=0

and we have that

k=0

'   '        E ' E XK | I ' ≤ E E |XK | | I = E |XK | ,

16.3. BIRKHOFF’S THEOREM

635

'# "' n−1 ' '    ' 1   ' k  E ' XK ◦ θ − E[XK | I]' ≤ 2E |XK | . 'n ' k=0

Therefore

'# "' n−1 '1  '    ' ' k lim sup E ' X ◦ θ − E[X | I]' ≤ 2E |XK | . 'n ' n↑∞ k=0

It then suffices to let K tend to ∞.

16.3.3



The Continuous-time Ergodic Theorem

The extension of the discrete-time result begins with the introduction of the notion of measurable flow in continuous time. Example 16.3.9: Shifts acting on functions, take 1. Let Ω be the space of continuous functions ω : R → R, and let F be the σ-field on Ω generated by the coordinate functions X(s) : Ω → R, s ∈ R, where X(s, ω) = ω(s). Define for each t ∈ R the mapping θt : Ω → Ω by θt (ω)(s) = ω(s + t) (θt translates a function ω ∈ Ω by −t.) For fixed t ∈ R, the mapping θt of the above example is measurable. However we shall need more measurability. Definition 16.3.10 The family {θt }t∈R of measurable maps from the measurable space (Ω, F) into itself is called a shift on (Ω, F) if: (a) θt is bijective for all t ∈ R, and (b) θt ◦ θs = θt+s for all t, s ∈ R. This shift is called a (measurable) flow if, in addition, (c) (t, ω) → θt (ω) is measurable from (R × Ω, B ⊗ F) to (Ω, F), In particular, θ0 is the identity and θt −1 = θ−t . To simplify the notation, we shall write θt ω instead of θt (ω). Definition 16.3.11 Given a shift as in Definition 16.3.10, a stochastic process {Z(t)}t∈R is called compatible with the shift (for short: θt -compatible) if for all t ∈ R, Z(t) = Z(0) ◦ θt ,

(16.13)

that is, for all ω ∈ Ω, Z(t, ω) = Z(0, θt ω). Example 16.3.12: Shifts acting on functions, take 2. In Example 16.3.9, the coordinate process is θt -compatible. The stationarity of a compatible process is embodied in the invariance of the underlying probability with respect to the shifts. More precisely:

CHAPTER 16. ERGODIC PROCESSES

636

Definition 16.3.13 Let (Ω, F, P ) be a probability space and let {θt }t∈R be a shift on (Ω, F). The probability P is called invariant with respect to this shift if P ◦ θt−1 = P

(t ∈ R) .

(16.14)

We then say: (P, θt ) is a stationary framework (on R). The continuous parameter set is now R. We repeat in this setting the definitions given for discrete-time flows. Let (θt , P ) be a stationary framework on (Ω, F). Definition 16.3.14 An event A ∈ F is called strictly θt -invariant if A = θt−1 A for all t ∈ R. It is called θt -invariant if P (A  θt−1 A) = 0 for all t ∈ R. By an easy adaptation of the remark following Definition 16.1.7 to continuous-time flows, we see that in the following definition, “θt -invariant” can be replaced by “strictly θt -invariant”. Definition 16.3.15 The flow {θt }t∈R is called P -ergodic if all θt -invariant events are trivial. One then says: (P, θt ) is ergodic. Theorem 16.3.16 If (P, θt ) is ergodic, then for all f ∈ L1 (P ), 1 T lim (f ◦ θt ) dt = E[f ], P -a.s. T ↑∞ T 0

(16.15)

Proof. Defining θ := θ1 , the pair (P, θ) is ergodic. It is enough to prove the theorem for non-negative f ∈ L1 (P ). In this case, defining n(T ) by n(T ) ≤ T < n(T ) + 1 , we have the bounds - n(T ) - n(T )+1 n(T ) 1 T n(T ) + 1 1 1 f ◦ θt dt ≤ f ◦ θt dt ≤ f ◦ θt dt . n(T ) + 1 n(T ) 0 T 0 n(T ) n(T ) + 1 0 () 01 Defining g := 0 f ◦ θt dt, we have that -

n

f ◦ θt dt =

0

and therefore 1 n↑∞ n

-

n

lim

n 

g ◦ θk

k=1

2-

1

f ◦ θt dt = E[g] = E

0

. f ◦ θt dt = E[f ] .

0



The conclusion then follows from (). Theorem 16.3.17 (P, θt ) is ergodic if and only if there exists no decomposition P = β1 P1 + β2 P2 ,

β1 + β2 = 1, β1 > 0,

β2 > 0,

(16.16)

where P1 and P2 are distinct θt -invariant probabilities for all t. Proof. The proof is analogous to that of Theorem 16.1.21.



16.4. EXERCISES

637

Complementary reading [Billingsley, 1965] is the classic introduction to ergodic theory and to the theoretical aspects of information theory.

16.4

Exercises

Exercise 16.4.1. ω + d mod 1 Let Ω = (0, 1], and let P be the Lebesgue measure on Ω. Let d be some real number. Define θ : (Ω, F) → (Ω, F) by θ(ω) = ω + d mod 1. Show that (P, θ) is not ergodic if d is rational. Exercise 16.4.2. θ ergodic, θ2 not ergodic Give an example where (P, θ) is ergodic and (P, θ2 ) is not ergodic. Exercise 16.4.3. Ergodic yet not mixing hmc Give the details in the proof of Example 16.1.18. Exercise 16.4.4. 2ω mod 1 Let (Ω, F) := ([0, 1), B([0, 1))). Let P be the Lebesgue measure on [0, 1). Consider the transformation / 2ω if ω ∈ [0, 21 ), θ(ω) := 2ω − 1 if ω ∈ [ 12 , 1). Show that P is θ-invariant and that (P, θ) is mixing. Hint: The intervals of the form [ 2kn , k+1 2n ) generate B([0, 1). Exercise 16.4.5. Periodic hmc Show that an irreducible positive recurrent discrete-time hmc with period ≥ 2 cannot be mixing. Exercise 16.4.6. Product of mixing shifts For i = 1, 2, let (Ωi , Fi , Pi ) be a probability space endowed with the measurable shift θi such that (Pi , θi ) is mixing. Define (Ω, F, P ) to be the product of the above probability spaces. The product shift θ := θ1 ⊕ θ2 is defined in the obvious manner: θ((ω1 , ω2 )) := (θ1 (ω1 ), θ2 (ω2 )). (1) Show that on the product of two probability spaces, each endowed with a mixing shift, the product shift is mixing. (2) Give a counterexample when “mixing” is replaced by “ergodic” in the previous question. Exercise 16.4.7. Irrational translations on the torus Prove that the flow of Example 16.1.8 is not mixing (d rational or irrational). Exercise 16.4.8. Ergodicity of Gaussian processes Show that a stationary centered Gaussian sequence limn↑∞ E[X0 Xn ] = 0 is ergodic.

{Xn }n≥0

such

that

638

CHAPTER 16. ERGODIC PROCESSES

Exercise 16.4.9. Invariant events of an hmc In Example 16.1.18, identify the invariant events. Exercise 16.4.10. Loynes: the critical case, I In Loynes’ equation, assume that the random variables σn − τn are centered, iid, and with a positive finite variance. Prove that in this case there exists no finite solution Z of Loynes’ equation. Hint: apply the central limit theorem to {σ−n − τ−n }n≥1 . Exercise 16.4.11. Loynes: the critical case, II Show that if ρ = 1, and if there is a finite solution, then for any c ≥ 0, W = M∞ + c is also a finite solution of (16.6). Exercise 16.4.12. Lindley: recurrence to zero in the stable case The stability condition ρ < 1 is assumed to hold. Let W = M∞ be the unique nonnegative solution of (16.6). Show that there exists an infinity of negative (resp. positive) indices n such that W ◦ θn = 0.

Chapter 17 Palm Probability Palm theory (in this chapter: on the line) links two types of stationarity for marked point processes: time-stationarity and event-stationarity. Two examples will illustrate this. The first example is the renewal process, for which one distinguishes the timestationary (necessarily delayed) version from the undelayed version whose distribution is invariant with respect to the shift that translates the first event time to the origin. It was shown in Chapter 10 that there exists a simple relation between the two versions, which are identical except for the distribution of the first event time. In the terminology of Palm theory, the undelayed version is the Palm version of the time-stationary version. The second example is that of an irreducible positive recurrent continuous-time hmc whose imbedded chain (the chain observed at the transition times) is also positive recurrent. When such a chain is (time-)stationary, it is not true in general that the embedded discrete-time Markov chain is stationary, even when the latter is assumed positive recurrent. However, there is a simple relation between the stationary distribution of the continuous-time chain and the stationary distribution of the imbedded chain. The continuous-time chain starting with the stationary distribution of the imbedded chain is the Palm version of the stationary continuous-time chain. We observe once more that the distribution of the Palm version is invariant with respect to the shift that translates the first event time (transition time) to the origin. In general, Palm theory on the line is concerned with jointly stationary stochastic processes and point processes, and with the probabilistic situation at event times. It is especially relevant in queueing theory applied to service systems, where there are two distinct points of view, that of the “operator”, who is interested in the behavior of a queue at arbitrary times, and that of the “customer”, who is generally interested in the situation found upon arrival. The corresponding issues will be treated in Chapter 9.

17.1

Palm Distribution and Palm Probability

17.1.1

Palm Distribution

The story begins with a new look at Campbell’s formula for stationary marked point processes. Let N be a simple point process on Rm with point sequence {Xn }n∈N . Let {Zn }n∈N be a sequence of random variables with values in the measurable space (K, K). Each Zn is considered as a mark of the corresponding point Zn . Recall that the point process and its sequence of marks are referred to as “the marked point process (N, Z)”, © Springer Nature Switzerland AG 2020 P. Brémaud, Probability Theory and Stochastic Processes, Universitext, https://doi.org/10.1007/978-3-030-40183-2_17

639

CHAPTER 17. PALM PROBABILITY

640

which can also be represented as a point process NZ on Rm × K:  NZ (D) := 1D (Xn , Zn ) (D ∈ B(Rm ) ⊗ K) . n∈N

The marked point process (N, Z) is called stationary if for all x ∈ Rm , the random measure Sx (NZ ) defined by  1D (Xn + x, Zn ) (D ∈ B(Rm ) ⊗ K) Sx (NZ )(D) := n∈N

has the same distribution as NZ . The intensity of the (stationary) point process N will henceforth be assumed positive and finite: 0 < λ := E[N ((0, 1]m )] < ∞ .

(17.1)

Define the σ-finite measure νZ on (Rm × K, B(Rm ) ⊗ K) by (D ∈ B(Rm ) ⊗ K) .

νZ (D) := E [NZ (D)] Recall the notation C + x := {y + x ; y ∈ C} all C ∈ B(Rm ) and all L ∈ K,

(C ⊆ Rm , x ∈ Rm ). By stationarity, for

νZ ((C + x) × L) = νZ (C × L)

(x ∈ Rm ) ,

that is, the measure C → νZ (C × L) is for fixed L translation-invariant, and therefore (Theorem 2.1.45) a multiple of the Lebesgue measure m on (Rm , B(Rm )): νZ (C × L) = γ(L)m (C) , for some γ(L). The mapping L → γ(L) is a measure on (K, K) that is finite since γ(K) = λ. In particular, Q0N := λ−1 γ is a probability measure on (K, K) and 0 νZ (C × L) = λ QN (L)m (C) .

Therefore Q0N (L)

=

E





n∈N 1C (Xn )1L (Zn ) λm (C)

(17.2)

is a probability on (K, K). It is called the Palm distribution of the marks. Theorem 17.1.1 Let f : Rm × K → R be a non-negative measurable function. Then 2. 1 - E f (x, z)NZ (dx × dz) = λ f (x, z)Q0N (dz) dx . (17.3) Rm ×K

Rm

K

Proof. Formula (17.3) is true for f (x, z) := 1C (x) 1L (z)

(C ∈ B(Rm ), L ∈ K)

since it then reduces to (17.2). The general case again follows by the usual monotone class argument based on Dynkin’s Theorem 2.1.27.  Formula (17.3) is the Palm–Campbell formula for stationary marked point processes.

17.1. PALM DISTRIBUTION AND PALM PROBABILITY

17.1.2

641

Stationary Frameworks

The passage from the Palm distribution of marks to Palm probability will be done in terms of measurable flows on abstract probability spaces.

Measurable Flows For convenience, we repeat Definition 16.3.10 of Chapter 16. Definition 17.1.2 A family {θx }x∈Rm of measurable functions from the measurable space (Ω, F) into itself is called a shift on (Ω, F) if: (a)

θx is bijective for all x ∈ Rm , and

(b)

θx ◦ θy = θx+y for all x, y ∈ Rm .

The shift {θx }x∈Rm on (Ω, F) is called a measurable flow if in addition (c)

(x, ω) → θx (ω) is measurable from B(Rm ) ⊗ F to F.

Example 17.1.3: Measurable flow on a space of measures. The shift {Sx }x∈Rm acting on M (Rm ), the canonical space of locally finite measures on Rm , and defined by Sx (μ)(C) := μ(C − x)

(x ∈ Rm , C ∈ B(Rm ),

is a measurable flow. To prove this, it suffices to show that the mapping (x, μ) → (Sx μ)(f ) := f (y − x)μ(dx) Rm

is measurable whenever f : Rm → R is a non-negative continuous function with compact support. This is the case since (x, μ) → g(x, μ) := (Sx μ)(f ) is continuous in the first argument and measurable in the second. (Indeed, g is the limit as n ↑ ∞ of the measur able functions k∈N 1Ck,n (x)g(xk,n , μ), where {xk,n }k∈N is an enumeration of the grid 1 1 m n−1 Zm and Ck,n = xk,n + (− 2n , 2n ]) .) Example 17.1.4: The shift on marked point processes. Let (H, H) be some measurable space. One may take (Ω, F) = (M (Rm × H), M(Rm × H)) with θx = Sx , where Sx μ(C × L) := μ((C + x) × L) (C ∈ B(Rm ), L ∈ H) .

Compatibility A central notion is that of compatibility with a flow. Definition 17.1.5 Let {θx }x∈Rm be a measurable flow on (Ω, F). A stochastic process {Z(x)}x∈Rm defined on (Ω, F) with values in the measurable space (K, K) is called compatible with the flow {θx }x∈Rm (for short: θx -compatible) if

CHAPTER 17. PALM PROBABILITY

642

Z(x, ω) = Z(0, θx (ω))

(ω ∈ Ω, x ∈ Rm ) ,

that is, in shorter notation, Z(x) = Z(0) ◦ θx . A random measure N on Rm is called compatible with the flow {θx }x∈Rm (for short: θx -compatible) if N (θx (ω), C) = N (ω, C + x)

(ω ∈ Ω, C ∈ B(Rm ), x ∈ Rm ) ,

that is, in shorter notation, N ◦ θx = Sx N , where Sx is the translation operator acting on measures (Example 16.3.9, (ii)). Note we have three notations for the same object N ◦ θx , Sx (N ) , N − x . The latter is not to be confused with N − εx , which represents N \{x} if x ∈ N , and N if x ∈ / N.

Stationary Frameworks Let (Ω, F, P ) be a probability space and let {θx }x∈Rm be a measurable flow on (Ω, F). Definition 17.1.6 The probability P is called invariant with respect to the flow {θx }x∈Rm (for short, θx -invariant) if for all x ∈ Rm P ◦ θx−1 = P . (P, θx ) is then called a stationary framework on Rm . Example 17.1.7: Stationary point process. Let (P, θx ) be a stationary framework on Rm . If the point process N on Rm is compatible with the shift, it is stationary. Indeed, letting A = {ω; N (ω, C1 ) = k1 , . . . , N (ω, Cm) = km }, with C1 , . . . , Cm ∈ B(Rm ), k1 , . . . , km ∈ N, we have, by definition, θx−1 A = {ω; θx (ω) ∈ A} = {ω; N (θx (ω), C1 ) = k1 , . . . , N (θx (ω), Cm ) = km } = {ω; N (ω, C1 + x) = k1 , . . . , N (ω, Cm + x) = km }. Therefore, since P ◦ θx −1 = P , P (N (C1 ) = k1 , . . . , N (Cm ) = km ) = P (N (C1 + x) = k1 , . . . , N (Cm + x) = km ) .

In the situation of Example 17.1.7, one sometimes says for short: (N, θx , P ) is a stationary point process. Example 17.1.8: Stationary stochastic process. Let (P, θx ) be a stationary framework on Rm . By the same argument as in the example above, a stochastic process {Z(x)}x∈Rm with values in (K, K) that is θx -compatible is strictly stationary. For short: (Z, θx , P ) is a stationary stochastic process. (Here Z stands for {Z(x)}x∈Rm .)

17.1. PALM DISTRIBUTION AND PALM PROBABILITY

17.1.3

643

Palm Probability and the Campbell–Mecke Formula

Let ((N, Z), θx , P ) be a stationary marked point process on Rm such that N is simple and with finite positive intensity λ. In fact, the work needed for the definition of Palm probability has already been done in Subsection 17.1.1. It suffices to choose for measurable mark space (K, K) the measurable space (Ω, F) itself. For each n ∈ Z, θXn is a random element taking its values in the measurable space (Ω, F). To see this, write θx (ω) as f (x, ω) and remember that the function (x, ω) → f (x, ω) is measurable, and therefore, since the function ω → Xn (ω) is measurable, so is the function ω → f (Xn (ω), ω). This defines θXn (ω) (ω) := f (Xn (ω), ω). The sequence {θXn }n∈N is the universal mark sequence. Remark 17.1.9 If (Ω, F) is the canonical space of point processes on Rm and the measurable flow is just the shift on this space, the universal mark associated with the point Xn is N − Xn , the canonical process N shifted by Xn . In fact, this mark contains as much and no more information than the whole trajectory N ! Take in (17.2) (K, K) = (Ω, F)) and Zn = θXn (ω). Denote in this case Q0N by PN0 . In particular, PN0 is a probability on (Ω, F). Formula (17.2) then reads for all C ∈ B(Rm ) of positive Lebesgue measure PN0 (A) :=

E



n∈N 1C (Xn )1A λm (C)

◦ θXn

 (A ∈ F) .

(17.4)

The probability PN0 defined by (17.4) is called the Palm probability associated with P (or, more precisely, with (N, θx , P )). Remark 17.1.10 The definition (17.4) does not depend on the choice of C ∈ B(Rm ) of positive Lebesgue measure.

Theorem 17.1.11 (1 ) Let v : Rm × Ω → R be a non-negative measurable function. Then 2. . 2(v(x) ◦ θx ) N (dx) = λ E0N v(x) dx . (17.5) E Rm

Rm

(The left-hand side of the above equality is just E





n∈N v(Xn , θXn )

.)

Proof. Formula (17.4) is therefore a special case of the announced equality for the choice v(x, ω) = 1C (x)1A (ω) , from which the general case follows by the usual monotone class argument based on Dynkin’s Theorem 2.1.27.  1

[Mecke, 1967].

CHAPTER 17. PALM PROBABILITY

644

Formula (17.5) is the Campbell–Mecke formula. It is, as we have seen, a sophisticated avatar of Campbell’s formula. It is sometimes used in the alternative equivalent form 22. . E (17.6) v(x) N (dx) = λE0N (v(x) ◦ θ−x ) dx . Rm

Rm

Example 17.1.12: An expression of the renewal function. Let (N, θt , P ) be a stationary simple locally finite point process on R with finite intensity λ and let PN0 be the associated Palm probability. We show that for all a ≥ 0, - a   (2E0N [N ((−t, 0])] − 1)λ dt , E N ((0, a])2 = 0

and that, in the case of a renewal process,   E N ((0, a])2 =

-

a

(2R(t) − 1)λ dt ,

0

where R is the renewal function. Proof. From the integration by parts formula (Theorem 2.3.12) (watch the parentheses), N ((0, a])2 = 2 N ((0, t]) N (dt) + 2 N ((0, t)) N (dt) (0,a] (0,a] =2 N ((0, t]) N (dt) − N ((0, a]) . (0,a]

Therefore

  E N ((0, a])2 = 2E

2R+

. N ((0, t])1(0,a] (t) N (dt) + λa .

In view of (17.6) with v(t) := N ((0, t])1(0,a] (t) (and therefore v(t)◦θ−t = N ((−t, 0])1(0,a] (t)), 22. . E N ((0, t])1(0,a] N (dt) = E0N N ((−t, 0])1(0,a] (t)λ dt R+ R+ 0 EN [N ((−t, 0])] 1(0,a] (t)λ dt. = R+

For a renewal process, observe that

E0N

[N ((−t, 0])] = E0N [N ((0, t])].



Let h : Rm × M (Rm ) → R be a non-negative measurable function. Taking v(x, ω) := h(x, N (ω)) in (17.5), we have # " . 2 h(Xn , N − Xn ) = λE0N h(x, N ) dx . E n∈N

Rm

Specializing this to h(x, N ) := g(x)1Γ (N ) (Γ ∈ M(Rm )) gives # "  g(Xn )1Γ (N − Xn ) = λPN0 (N ∈ Γ) g(x) dx . E n∈N

Rm

With g(x) := 1C (x), where C ∈ B(Rm ) is of finite positive Lebesgue measure, " #  E 1C (Xn )1Γ (N − Xn ) = λm (C)PN0 (N ∈ Γ) . n∈N

(17.7)

17.1. PALM DISTRIBUTION AND PALM PROBABILITY

645

Remark 17.1.13 A set Γ ∈ M(Rm ) represents a property that a measure μ ∈ M(Rm ) may or may not possess, and N − Xn ∈ Γ means that the point process seen by an observer placed at the point Xn (that is, precisely, N − Xn ) possesses this property. For instance, with Γ = {μ ; μ(B(0, a)\{0}) = 0}, where B(x, a) is the open ball of radius a ≥ 0 centered at x, {N − Xn∈ Γ} = {(N − Xn )(B(0, a)\{0}) = 0}, that is, {N (B(Xn , a)\{Xn }) = 0}. The sum n∈N 1C (Xn )1Γ (N − Xn ) counts the points of N lying in C and whose nearest neighbor is at a distance ≥ a. For a general Γ ∈ M(Rm ), the intensity of the point process " #  NΓ (C) := E 1C (Xn )1Γ (N − Xn ) (C ∈ B(Rm )

(17.8)

n∈N

has, in view of (17.7), intensity λΓ = λPN0 (N ∈ Γ). Theorem 17.1.14 Under the Palm probability, there is a point at 0 (the origin of Rm ), that is, PN0 (N ({0}) = 1) = 1 . Proof. With g(x) = 1C (x) and Γ = {μ ; μ({0}) = 1}, Equality (17.7) becomes, since N − Xn always has exactly one point at the origin, " #  E 1C (Xn ) = λm (C)PN0 (N ({0}) = 1) . n∈N

Noting that the left-hand side is λm (C), the result is proved.



Example 17.1.15: Superposition of independent stationary point processes. Recall that Sx is, for any x ∈ Rm , the translation by x applied to measures μ ∈ M (Rm ): Sx (μ)(C) = μ(C + x) . Let P be a probability measure on (M (Rm ), M(Rm )) such that P ◦ Sx = P for all x ∈ Rm . Taking N equal to Φ, the identity map of M (Rm ), we obtain a stationary random (Φ, Sx , P), which is said to be in canonical form. Let (Mi , Mi , Sx (i) , Φi ) (1 ≤ i ≤ k) be replicas of (M (Rm ), M(Rm ), Sx , Φ) and let Pi be a probability on (Mi , Mi) which is Sx (i) -invariant for all x ∈ Rm . Suppose that for all i (1 ≤ i ≤ k), Φi is Pi -almost surely a simple point process with finite and positive intensity λi . Define the product space  k $ k k Mi , ⊗i=1 Mi , ⊗i=1 Pi (Ω, F, P ) = i=1

and, for each x ∈ Rm , define θx := ⊗ki=1 Sx (i) , with the meaning that θx (ω) = (Sx (i) μi ; 1 ≤ i ≤ k), where ω = (μi ; 1 ≤ i ≤ k). Define Ni (ω) := μi and N (ω) :=

k  i=1

μi .

CHAPTER 17. PALM PROBABILITY

646

Then (N, θx , P ) is a stationary point process, the superposition of the stationary point processes (Ni , θx , P ) (1 ≤ i ≤ k). Denote by Pi0 the Palm probability associated to (Φi , Pi ). It will be proved below that PN0 =

k  λi  i=1

where λ =

λ

   0 k P ⊗ ⊗ P ⊗ P , ⊗i−1 j j i j=i+1 j=1

(17.9)

k

i=1 λi .

Remark 17.1.16 The interpretation of (17.9) is the following. With probability λλi the point at the origin in the Palm version comes from the i-th point process and the probability distribution of the i-th process is then its Palm probability, whereas the other processes keep their stationary distributions. All the k point processes remain independent.  Proof of (17.9): By definition, for A = ki=1 Ai , where Ai ∈ Mi , # "1 0 PN (A) = E (1A ◦ θx )N (dx) λ (0,1]m  k - k  $ 1 (i) ... 1Ai ◦ Sx Φj (dx)P1 (dμ1 ) . . . Pk (dμk ) = λ M1 Mk (0,1]m j=1 i=1  k 4 - /k  $ 1 (i) = ... 1Ai ◦ Sx Φj (dx) P1 (dμ1 ) . . . Pk (dμk ). λ M1 Mk (0,1]m j=1

i=1

But (Fubini and the definition of Palm probability Pj0 ) 1 λj

4 k $ (i) (1Ai ◦ Sx ) Φj (dx) P1 (dμ1 ) . . . Pk (dμk )

/-

-

... M1

Mk

(0,1]m i=1

= Pj0 (Aj )

k $

Pi (Ai ),

i=1, i =j (i)

where we have taken into account the Sx -invariance of Pi . Therefore ⎧ ⎫ ⎪ ⎪  k ⎪ ⎪ k ⎨ ⎬ $ $  λi 0 0 Pi (Ai ) Ai = Pj (Aj ) , PN ⎪λ ⎪ ⎪ i=1 i=1 ⎪ 1≤j≤k ⎩ ⎭ j =i

which implies (17.9), by Theorem 2.1.42.

Thinning and Conditioning Let (N, θx , P ) be a simple stationary point process on Rm with finite positive intensity. For U ∈ F, define 1U (θx (ω))N (ω, dx) (C ∈ B(Rm )) . (17.10) NU (ω, C) = C

17.1. PALM DISTRIBUTION AND PALM PROBABILITY

647

Such a point process is therefore obtained by thinning of N , a point x ∈ N (ω) being retained if and only if θx (ω) ∈ U . Example 17.1.17: Mark selection. Let (N, Z, θx , P ) be a stationary marked point process. Take U = {Z0 ∈ L} for some L ∈ K. Then, since Z0 (θXn (ω)) = Zn (ω),  1L (Zn )1C (Xn ). NU (C) = n∈N

The point process NU is obtained by thinning N , only retaining the points of Xn with a mark Zn falling in L. (NU , θx , P ) is obviously a stationary point process and it has a finite intensity (since NU ≤ N ). If the intensity of λU of NU is positive, its Palm probability is given by  1  PN0 U (A) = E (1A ◦ θx )NU (dx) , λU (0,1]m "-

where

#

λU = E (0,1]m

(1U ◦ θx ) N (dx) = λPN0 (U ) .

In addition, "# "E (1A ◦ θx ) NU (dx) = E (0,1]m

Therefore PN0 U (A) =

# (0,1]m

(1A ◦ θx )(1U ◦ θx ) N (dx) = λPN0 (A ∩ U ) .

PN0 (A ∩ U ) = PN0 (A | U ) . PN0 (U )

Note that the sequence of marks could take its values in M (Rm ), for instance, Zn = N −Xn . Recall that N −Xn = SXn (N ). Taking U := {ω ; , N (ω) ∈ Γ} where Γ ∈ M(Rm ), we see that in this case NU ≡ NΓ , where the latter is defined by (17.8). Example 17.1.18: Superposition of point processes. This generalizes Example 17.1.15. Let Ni (1 ≤ i ≤ k) be point processes on Rm , all compatible with the flow {θx }x∈Rm , with positive finite intensities λi (1 ≤ i ≤ k) respectively, but not necessarily independent. Call N their superposition. N is assumed simple. From Bayes’ rule: PN0 (A) =

k 

PN0 (Ni ({0}) = 1) PN0 (A | Ni ({0}) = 1) .

i=1

But PN0

" #  1 λi 1 . (Ni ({0}) = 1) = E 1(0,1]m (Xn )1{Ni ({Xn })=1} = E [Ni (0, 1]] = λ λ λ n∈N

Let U = {Ni ({0}) = 1}. Since we have NU = Ni (with the notation of (17.10)), we obtain PN0 (A | Ni ({0}) = 1) = PN0 i (A) . Therefore PN0 (A) =

k  λi i=1

λ

PN0 i (A) .

648

CHAPTER 17. PALM PROBABILITY

17.2

Basic Properties and Formulas

Attention will now be restricted to simple stationary point processes on the real line. The notation t instead of x will emphasize the fact that one is working on the real line.

17.2.1

Event-time Stationarity

The Palm probability PN0 associated with the simple stationary point process (N, θt , P ) has, as we saw for the general case in Theorem 17.1.14, its mass concentrated on Ω0 := {T0 = 0}. Recall that the sequence of points {Tn }n∈Z is defined in such a way that it is strictly increasing and such that T0 ≤ 0 < T1 . T0

T1

T2

t

T−1 ◦θt

T0 ◦θt

0

T3

N

0

T−2 ◦θt

T1 ◦θt

N ◦θt = St N

The mapping θ := θT1 , defined from Ω0 into Ω0 , is a bijection, with inverse θ−1 = θT−1 . Also, on Ω0 , θTn := θn for all n ∈ Z 2 . Note that the above is not true on Ω (for instance, the inverse of θT1 is not θT−1 ; Exercise 17.5.9). For mappings of the form θU with U random, the composition rule (c) of Definition 17.1.2 is no longer valid in that we do not have in general θU ◦ θV = θU +V when U and V are random variables. The effect of θU ◦ θV on a point process N with sequence of points {Tn }n∈Z is best understood as follows. One first applies the shift θV , obtaining a point process N ′ = θV N whose points are of the form Tn − V . However the sequence of these points has to be reindexed to obtain the ordered sequence {Tn′ }n∈Z such that T0′ ≤ 0 < T1′ . (For instance, with U = T2 , T0′ = 0 and T1′ = T3 .) Once this is done, one can reiterate the operation with θU to obtain a point process N ′′ whose sequence of points is {Tn′′ }n∈Z . But beware because this has now shifted N ′ by −U (θV (ω) ). For instance, if U = T2 , V = T3 , you have to apply to N ′ the shift of −T3′ . Indeed you must remember that θT3 means “the shift that moves to 0 the third point strictly to the right of 0”. The following result is referred to as the “event-time stationarity”. Theorem 17.2.1 PN0 is θ-invariant. Proof. First observe that for all A ∈ F, 1θ−1 (A) ◦ θTn = 1(θTn ∈ θ−1 (A)) = 1(θTn+1 ∈ A) . Formula (17.4) with A ∈ F and C = (0, t] yields 2 It is perhaps worthwhile to emphasize the fact that θTn is “the shift that moves the n-th point of a point process to the origin”.

17.2. BASIC PROPERTIES AND FORMULAS

649

' ' ' ' ' ' 0 'PN (A) − PN0 (θ−1 (A))' ≤ 1 E '' (1A ◦ θTn − 1θ−1 (A) ◦ θTn )1(0,t] (Tn )'' ' λt ' n∈Z ' ' ' 1 '' 2 ' = E ' (1A ◦ θTn − 1A ◦ θTn+1 )1(0,t] (Tn )' ≤ . ' λt λt ' n∈Z

Letting t → ∞, we obtain

PN0 (A)

=

PN0 (θ−1 (A)).



In particular, if {Z(t)}t∈R is compatible with the flow {θt }t∈R and therefore stationary under P , the sequence {Z(Tn )}n∈Z is, under PN0 , a stationary sequence. Example 17.2.2: Palm–Khinchin equations. (3 ) Let for k ∈ N and t ≥ 0, ϕk (t) := PN0 (N ((0, t]) = k) . We have the Palm–Khinchin equations:

-

t

ϕk (s) ds.

P (N ((0, t]) > k) = λ 0

To prove this, observe that

-

1N ((0,t])>k = (0,t]

1{N (s,t]=k} N (ds),

and deduce from this and N (s, t] = N (0, t − s] ◦ θs that   1(0,t] (s) 1{N (0,t−s]=k} ◦ θs N (ds). 1{N ((0,t])>k} = R

By the Campbell–Mecke formula, the expectation of the right-hand side with respect to P is equal to - t PN0 (N (0, t − s] = k) ds λ 1(0,t] (s)PN0 (N (0, t − s] = k) ds = λ R 0 - t PN0 (N (0, s] = k) ds. = λ 0

Theorem 17.2.3 Let (N, θt , P ) be a stationary point process on R with intensity 0 < λ < ∞ and such that P (N (R) = 0) = 0. Then λE0N [T1 ] = 1 . Proof. From the Palm–Khinchin equation with k = 0, - t ϕ0 (s) ds . P (N ((0, t]) = 0) = 1 − λ 0

But ϕ0 (s) =

PN0 (N (0, s]

= 0) = P (T1 > s), and therefore

0 = P (N (R) = 0) = lim P (N ((0, t]) = 0) t↑+∞ - ∞ =1−λ PN0 (T1 > s) ds = 1 − λE0N [T1 ] . 0

 3

[Palm, 1943] for k = 0, [Khinchin, 1960] for k ≥ 1.

CHAPTER 17. PALM PROBABILITY

650

17.2.2

Inversion Formulas

How do we pass from the Palm probability to the stationary distribution? The formulas that do this are called inversion formulas. Theorem 17.2.4 Let (N, P, θt ) be a stationary simple point process on R with intensity 0 < λ < ∞ and such that P (N (R) = 0) = 0. For any non-negative random variable f , 2- T1 . 0 E [f ] = EN (f ◦ θs )ds . 0

One proof, among many others, makes use of the following conservation principle. The intuition behind the following conservation principle is that in a stationary state, “the smooth variation of a stochastic process is balanced by the variation due to jumps”. More precisely: Theorem 17.2.5 (4 ) Let (N, P, θt ) be a stationary simple point process on R with intensity 0 < λ < ∞. Let {Y (t)}t∈R be a real-valued stochastic process, right-continuous with left-hand limits, and let {Y  (t)}t∈R be a real-valued stochastic process such that - 1 Y  (s)ds + (Y (s) − Y (s−))N (ds) . (17.11) Y (1) − Y (0) = (0,1]

0

Suppose that the processes Y and Y  are compatible with the flow. Suppose moreover that

Then

E[|Y  (0)|] < ∞ and E0N [|Y (0) − Y (0−)|] < ∞.

(17.12)

E[Y  (0)] + λE0N [Y (0) − Y (0−)] = 0.

(17.13)

01 0 Proof. Observe that E[ 0 |Y  (s)| ds] = E[|Y  (0)|] and E[ (0,1] |Y (s) − Y (s−)| N (ds)] = λE0N [|Y (0) − Y (0−)|]. Therefore, condition (17.12) guarantees that Y (1) − Y (0) is Pintegrable and by Lemma 16.1.5, E [Y (1) − Y (0] = 0. Equating this to the 0 1expectation  result since E[ of the right-hand side of (17.11), we obtain the announced 0 Y (s) ds] = 0 0   E[Y (0)] and E[ (0,1] (Y (s) − Y (s−)) N (ds)] = λEN [Y (0) − Y