358 116 5MB
English Pages 717 Year 2020
Table of contents :
Contents......Page 7
Introduction......Page 14
PART THREE: Advanced topics......Page 15
Acknowledgements......Page 16
I: PROBABILITY THEORY......Page 17
1.1 Sample Space, Events and Probability......Page 18
The Language of Probabilists......Page 19
The σfield of Events......Page 20
1.1.2 Probability of Events......Page 21
Basic Formulas......Page 22
Negligible Sets......Page 23
1.2.1 Independent Events......Page 24
1.2.2 Bayes’ Calculus......Page 26
1.2.3 Conditional Independence......Page 28
1.3.1 Probability Distributions and Expectation......Page 29
Expectation for Discrete Random Variables......Page 31
Basic Properties of Expectation......Page 33
Independent Variables......Page 34
The Product Formula for Expectations......Page 36
The Binomial Distribution......Page 37
The Geometric Distribution......Page 38
The Poisson Distribution......Page 41
The Multinomial Distribution......Page 42
Random Graphs......Page 43
1.3.3 Conditional Expectation......Page 44
1.4.1 Generating Functions......Page 47
Moments from the Generating Function......Page 48
Counting with Generating Functions......Page 50
Random Sums......Page 51
1.4.2 Probability of Extinction......Page 53
1.5.1 The Borel–Cantelli Lemma......Page 55
1.5.2 Markov’s Inequality......Page 56
1.5.3 Proof of Borel’s Strong Law......Page 57
1.6 Exercises......Page 58
Chapter 2 Integration......Page 66
2.1.1 Measurable Functions......Page 67
Stability Properties of Measurable Functions......Page 71
Dynkin’s Systems......Page 73
2.1.2 Measure......Page 75
Negligible Sets......Page 77
Equality of Measures......Page 78
Existence of Measures......Page 79
2.2.1 Construction of the Integral......Page 81
2.2.2 Elementary Properties of the Integral......Page 86
More Elementary Properties......Page 87
2.2.3 Beppo Levi, Fatou and Lebesgue......Page 88
Differentiation under the integral sign......Page 89
2.3 The Other Big Theorems......Page 90
2.3.2 The Fubini–Tonelli Theorem......Page 91
Integration by Parts Formula......Page 96
2.3.3 The Riesz–Fischer Theorem......Page 98
Holder’s Inequality......Page 99
The Riesz–Fischer Theorem......Page 100
The Product of a Measure by a Function......Page 103
Lebesgue’s decomposition......Page 104
2.4 Exercises......Page 106
3.1.1 Translation......Page 110
3.1.2 Probability Distributions......Page 112
Famous Continuous Random Variables......Page 114
Change of Variables......Page 118
Correlation Coefficient......Page 120
3.1.3 Independence and the Product Formula......Page 123
Order Statistics......Page 125
Sampling from a Distribution......Page 126
3.1.4 Characteristic Functions......Page 129
Ladder Random Variables......Page 131
Random Sums and Wald’s Identity......Page 132
3.1.5 Laplace Transforms......Page 133
3.2.1 Two Equivalent Definitions......Page 134
3.2.2 Independence and Noncorrelation......Page 136
3.2.3 The pdf of a Nondegenerate Gaussian Vector......Page 138
3.3.1 The Intermediate Theory......Page 140
Bayesian Tests of Hypotheses......Page 144
3.3.2 The General Theory......Page 146
Connection with the Intermediate Theory......Page 147
Properties of the Conditional Expectation......Page 148
3.3.3 The Doubly Stochastic Framework......Page 150
3.4 Exercises......Page 151
4.1.1 A Sufficient Condition and a Criterion......Page 159
A Criterion......Page 160
4.1.2 Beppo Levi, Fatou and Lebesgue......Page 162
4.1.3 The Strong Law of Large Numbers......Page 163
Large Deviations......Page 167
4.2.1 Convergence in Probability......Page 170
4.2.2 Convergence in Lp......Page 172
4.2.3 Uniform Integrability......Page 174
4.3.1 Kolmogorov’s Zeroone Law......Page 176
4.3.2 The Hewitt–Savage Zeroone Law......Page 177
4.4.1 The Role of Characteristic Functions......Page 180
Paul Levy’s Characterization......Page 182
Bochner’s Theorem......Page 184
4.4.2 The Central Limit Theorem......Page 186
Statistical Applications......Page 189
The Variation Distance......Page 190
The Coupling Inequality......Page 191
A More General Definition......Page 193
A Bayesian Interpretation......Page 194
Convergence in Variation......Page 195
Radon Linear Forms......Page 196
Vague Convergence......Page 197
Fourier Transforms of Finite Measures......Page 198
The Proof of Paul Levy’s criterion......Page 200
4.5.1 Almostsure vs in Probability......Page 202
4.5.2 The Rank of Convergence in Distribution......Page 203
4.6 Exercises......Page 204
II: STANDARD STOCHASTIC PROCESSES......Page 210
Random Processes as Collections of Random Variables......Page 211
Finitedimensional Distributions......Page 212
Transfer to Canonical Spaces......Page 214
5.1.2 Secondorder Stochastic Processes......Page 215
Widesense Stationarity......Page 216
5.1.3 Gaussian Processes......Page 218
Gaussian Subspaces......Page 219
5.2.1 Versions and Modifications......Page 220
5.2.2 Kolmogorov’s Continuity Condition......Page 221
5.3.1 Measurable Processes and their Integrals......Page 224
5.3.2 Histories and Stopping Times......Page 225
Stopping Times......Page 226
5.4 Exercises......Page 230
6.1.1 The Markov Property on the Integers......Page 233
Firststep Analysis......Page 237
Local characteristics......Page 239
Gibbs Distributions......Page 240
The Hammersley–Clifford Theorem......Page 245
Communication Classes......Page 247
Period......Page 248
6.2.2 Stationary Distributions and Reversibility......Page 249
Reversibility......Page 252
6.2.3 The Strong Markov Property......Page 254
Regenerative Cycles......Page 256
6.3.1 Classification of States......Page 257
The Potential Matrix Criterion of Recurrence......Page 258
6.3.2 The Stationary Distribution Criterion......Page 262
Birthanddeath Markov Chains......Page 266
6.3.3 Foster’s Theorem......Page 269
6.4.1 The Markov Chain Ergodic Theorem......Page 272
6.4.2 Convergence in Variation to Steady State......Page 274
6.4.3 Null Recurrent Case: Orey’s Theorem......Page 277
6.4.4 Absorption......Page 278
Before Absorption......Page 279
Time to Absorption......Page 281
Final Destination......Page 282
6.5.1 Basic Principle and Algorithms......Page 284
The Propp–Wilson Algorithm......Page 287
Sandwiching......Page 290
6.6 Exercises......Page 292
7.1.1 The Counting Process and the Interval Sequence......Page 300
The Counting Process......Page 301
Superposition of independent HPPS......Page 302
Strong Markov Property......Page 303
A Smoothing Formula for HPPS......Page 305
Watanabe’s Characterization......Page 307
The Strong Markov Property via Watanabe’s theorem......Page 308
7.2.1 The Infinitesimal Generator......Page 309
The Uniform HMC......Page 311
7.2.2 The Local Characteristics......Page 313
7.2.3 HMCS from HPPS......Page 317
Aggregation of States......Page 319
7.3.1 The Strong Markov Property......Page 321
7.3.2 Imbedded Chain......Page 322
7.3.3 Conditions for Regularity......Page 325
7.4.1 Recurrence......Page 328
Invariant Measures of Recurrent Chains......Page 329
The Stationary Distribution Criterion of Ergodicity......Page 332
Reversibility......Page 334
7.4.2 Convergence to Equilibrium......Page 335
7.5 Exercises......Page 336
8.1.1 Point Processes as Random Measures......Page 339
Points......Page 342
8.1.2 Point Process Integrals and the Intensity Measure......Page 344
Campbell’s Formula......Page 345
Cluster Point Processes......Page 346
Finitedimensional Distributions......Page 348
The Laplace Functional......Page 349
The Avoidance Function......Page 352
8.2.1 Construction......Page 355
The Covariance Formula......Page 357
The Exponential Formula......Page 359
8.3.1 As Unmarked Poisson Processes......Page 361
Thinning and Coloring......Page 364
Transportation......Page 365
Poisson Shot Noise......Page 366
The Case of Finite Intensity Measures......Page 367
The Mixed Poisson Case......Page 369
8.4 The Boolean Model......Page 370
Isolated Points......Page 374
8.5 Exercises......Page 375
9.1.1 The Basic Example......Page 381
9.1.2 Multiple Access Communication......Page 382
The Instability of ALOHA......Page 383
Backlog Dependent Policies......Page 384
9.1.3 The Stack Algorithm......Page 385
9.2.1 Isolated Markovian Queues......Page 388
Congestion as a BirthandDeath Process......Page 391
Burke’s Output Theorem......Page 395
Jackson Networks......Page 396
Gordon–Newell Networks......Page 400
9.3 Nonexponential Models......Page 401
9.3.1 M/GI/∞......Page 402
9.3.2 M/GI/1/∞/FIFO......Page 404
9.3.3 GI/M/1/∞/FIFO......Page 406
9.4 Exercises......Page 409
10.1.1 The Renewal Measure......Page 412
10.1.2 The Renewal Equation......Page 416
Solution of the Renewal Equation......Page 420
10.1.3 Stationary Renewal Processes......Page 422
Direct Riemann Integrability......Page 425
The Key Renewal Theorem......Page 429
Renewal Reward Processes......Page 431
10.2.2 The Coupling Proof of Blackwell’s Theorem......Page 433
10.2.3 Defective and Excessive Renewal Equations......Page 437
10.3.1 Examples......Page 439
10.3.2 The Limit Distribution......Page 440
10.4 SemiMarkov Processes......Page 444
Improper Multivariate Renewal Equations......Page 447
10.5 Exercises......Page 448
11.1.1 As a Rescaled Random Walk......Page 451
11.1.2 Simple Operations on Brownian motion......Page 453
11.1.3 Gauss–Markov Processes......Page 455
The Reflection Principle......Page 457
11.2.3 Nondifferentiability......Page 459
11.2.4 Quadratic Variation......Page 461
11.3.1 Construction......Page 462
Series Expansion of Wiener integrals......Page 464
A Characterization of the Wiener Integral......Page 465
11.3.2 Langevin’s Equation......Page 466
11.3.3 The Cameron–Martin Formula......Page 467
11.4 Fractal Brownian Motion......Page 469
11.5 Exercises......Page 471
12.1.1 Covariance Functions and Characteristic Functions......Page 475
Two Particular Cases......Page 476
The General Case......Page 477
12.1.2 Filtering of wss Stochastic Processes......Page 479
A First Approach......Page 481
White Noise via the Doob–Wiener Integral......Page 482
The Approximate Derivative Approach......Page 483
12.2.1 The Cramer–Khintchin Decomposition......Page 484
The Shannon–Nyquist Sampling Theorem......Page 487
12.2.2 A Plancherel–Parseval Formula......Page 488
12.2.3 Linear Operations......Page 489
12.3.1 The Power Spectral Matrix......Page 491
12.3.2 Bandpass Stochastic Processes......Page 495
Complementary reading......Page 496
12.4 Exercises......Page 497
III: ADVANCED TOPICS......Page 500
13.1.1 The Martingale Property......Page 501
Convex Functions of Martingales......Page 504
Martingale Transforms and Stopped Martingales......Page 505
13.1.2 Kolmogorov’s Inequality......Page 506
13.1.3 Doob’s Inequality......Page 507
13.1.4 Hoeffding’s Inequality......Page 508
A General Framework of Application......Page 509
13.2.1 Doob’s Optional Sampling Theorem......Page 511
Wald’s Exponential Formula......Page 516
13.2.3 The Maximum Principle......Page 517
13.3.1 The Fundamental Convergence Theorem......Page 520
Kakutani’s Theorem......Page 525
13.3.2 Backwards (or Reverse) Martingales......Page 526
Local Absolute Continuity......Page 529
Harmonic Functions and Markov Chains......Page 532
13.3.3 The Robbins–Sigmund Theorem......Page 533
Doob’s decomposition......Page 535
The Martingale Law of Large Numbers......Page 537
The Robbins–Monro algorithm......Page 538
13.4 Continuoustime Martingales......Page 540
13.4.1 From Discrete Time to Continuous Time......Page 542
13.4.2 The Banach Space MP......Page 546
13.4.3 Time Scaling......Page 547
13.5 Exercises......Page 549
14.1.1 Construction......Page 555
14.1.2 Properties of the Ito Integral Process......Page 558
14.1.3 Ito’s Integrals Defined as Limits in Probability......Page 561
Functions of Brownian Motion......Page 562
14.2.2 Some Extensions......Page 564
A Finite Number of Discontinuities......Page 566
The Vectorial Differentiation Rule......Page 567
14.3.1 Squareintegrable Brownian Functionals......Page 568
14.3.2 Girsanov’s Theorem......Page 570
The Strong Markov Property of Brownian Motion......Page 574
14.3.3 Stochastic Differential Equations......Page 575
Strong and Weak Solutions......Page 576
14.3.4 The Dirichlet Problem......Page 577
14.4 Exercises......Page 579
15.1.1 The Martingale Definition......Page 583
15.1.2 Stochastic Intensity Kernels......Page 590
The Case of Marked Point Processes......Page 591
Stochastic Integrals and Martingales......Page 595
15.1.3 Martingales as Stochastic Integrals......Page 597
15.1.4 The Regenerative Form of the Stochastic Intensity......Page 600
15.2.1 Changing the History......Page 602
15.2.2 Absolutely Continuous Change of Probability......Page 606
The Reference Probability Method......Page 611
15.2.3 Changing the Time Scale......Page 613
Cryptology......Page 614
15.3.1 An Extension of Watanabe’s Theorem......Page 615
15.3.2 Grigelionis’ Embedding Theorem......Page 618
Variants of the Embedding Theorems......Page 621
15.4 Exercises......Page 622
16.1.1 Invariant Events and Ergodicity......Page 627
16.1.2 Mixing......Page 630
The Stochastic Process Point of View......Page 632
16.1.3 The Convex Set of Ergodic Probabilities......Page 633
16.2.1 Lindley’s Sequence......Page 634
16.2.2 Loynes’ Equation......Page 635
16.3.1 The Ergodic Case......Page 637
16.3.2 The Nonergodic Case......Page 638
16.3.3 The Continuoustime Ergodic Theorem......Page 641
16.4 Exercises......Page 643
17.1.1 Palm Distribution......Page 645
Compatibility......Page 647
Stationary Frameworks......Page 648
17.1.3 Palm Probability and the Campbell–Mecke Formula......Page 649
Thinning and Conditioning......Page 652
17.2.1 Eventtime Stationarity......Page 654
17.2.2 Inversion Formulas......Page 656
Backward and Forward Recurrence Times......Page 658
17.2.3 The Exchange Formula......Page 659
17.2.4 From Palm to Stationary......Page 660
The G/G/1/∞ Queue in Continuous Time......Page 664
17.3.1 The Local Interpretation......Page 666
17.3.2 The Ergodic Interpretation......Page 668
17.4.1 The PSATA Property......Page 670
17.4.2 Queue Length at Departures or Arrivals......Page 674
17.4.3 Little’s Formula......Page 675
17.5 Exercises......Page 678
A.1 The Greatest Common Divisor......Page 682
A.2 Eigenvalues and Eigenvectors......Page 683
A.3 The Perron–Fr¨obenius Theorem......Page 684
B.1 Infinite Products......Page 686
B.2 Abel’s Theorem......Page 687
Cesaro’s Lemma......Page 688
Toeplitz’s Lemma......Page 689
B.5 Subadditive Functions......Page 690
B.7 The Abstract Definition of Continuity......Page 691
B.8 Change of Time......Page 692
C.2 Schwarz’s Inequality......Page 694
C.3 Isometric Extension......Page 696
C.4 Orthogonal Projection......Page 697
C.5 Riesz’s Representation Theorem......Page 702
C.6 Orthonormal expansions......Page 703
Bibliography......Page 706
Index......Page 711
Universitext
Pierre Brémaud
Probability Theory and Stochastic Processes
Universitext
Universitext
Series Editors Sheldon Axler San Francisco State University Carles Casacuberta Universitat de Barcelona John Greenlees University of Warwick, Coventry Angus MacIntyre Queen Mary University of London
Kenneth Ribet University of California, Berkeley Claude Sabbah École Polytechnique, CNRS, Université ParisSaclay, Palaiseau Endre Süli University of Oxford Wojbor A. Woyczyński, Case Western Reserve University
Universitext is a series of textbooks that presents material from a wide variety of mathematical disciplines at master’s level and beyond. The books, often well classtested by their author, may have an informal, personal even experimental approach to their subject matter. Some of the most successful and established books in the series have evolved through several editions, always following the evolution of teaching curricula, to very polished texts. Thus as research topics trickle down into graduatelevel teaching, ﬁrst textbooks written for new, cuttingedge courses may make their way into Universitext.
More information about this series at http://www.springer.com/series/223
Pierre Brémaud
Probability Theory and Stochastic Processes
Pierre Brémaud Département d’Informatique INRIA, École Normale Supérieure Paris CX 5, France
ISSN 01725939 ISSN 21916675 (electronic) Universitext ISBN 9783030401825 ISBN 9783030401832 (eBook) https://doi.org/10.1007/9783030401832 © Springer Nature Switzerland AG 2020 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Pour Marion
Contents Introduction
xv
Part One: Probability Theory
1
1 Warming Up 1.1 Sample Space, Events and Probability . . . . . . 1.1.1 Events . . . . . . . . . . . . . . . . . . . . 1.1.2 Probability of Events . . . . . . . . . . . . 1.2 Independence and Conditioning . . . . . . . . . . 1.2.1 Independent Events . . . . . . . . . . . . . 1.2.2 Bayes’ Calculus . . . . . . . . . . . . . . . 1.2.3 Conditional Independence . . . . . . . . . 1.3 Discrete Random Variables . . . . . . . . . . . . . 1.3.1 Probability Distributions and Expectation 1.3.2 Famous Discrete Probability Distributions 1.3.3 Conditional Expectation . . . . . . . . . . 1.4 The Branching Process . . . . . . . . . . . . . . . 1.4.1 Generating Functions . . . . . . . . . . . . 1.4.2 Probability of Extinction . . . . . . . . . . 1.5 Borel’s Strong Law of Large Numbers . . . . . . . 1.5.1 The Borel–Cantelli Lemma . . . . . . . . . 1.5.2 Markov’s Inequality . . . . . . . . . . . . . 1.5.3 Proof of Borel’s Strong Law . . . . . . . . 1.6 Exercises . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
3 3 4 6 9 9 11 13 14 14 22 29 32 32 38 40 40 41 42 43
2 Integration 2.1 Measurability and Measure . . . . . . . . . . 2.1.1 Measurable Functions . . . . . . . . . 2.1.2 Measure . . . . . . . . . . . . . . . . 2.2 The Lebesgue Integral . . . . . . . . . . . . 2.2.1 Construction of the Integral . . . . . 2.2.2 Elementary Properties of the Integral 2.2.3 Beppo Levi, Fatou and Lebesgue . . 2.3 The Other Big Theorems . . . . . . . . . . . 2.3.1 The Image Measure Theorem . . . . 2.3.2 The Fubini–Tonelli Theorem . . . . . 2.3.3 The Riesz–Fischer Theorem . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
51 52 52 60 66 66 71 73 75 76 76 83
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
vii
CONTENTS
viii
2.4
2.3.4 The Radon–Nikod´ ym Theorem . . . . . . . . . . . . . . . . 88 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
3 Probability and Expectation 3.1 From Integral to Expectation . . . . . . . . . . . . . 3.1.1 Translation . . . . . . . . . . . . . . . . . . . 3.1.2 Probability Distributions . . . . . . . . . . . . 3.1.3 Independence and the Product Formula . . . . 3.1.4 Characteristic Functions . . . . . . . . . . . . 3.1.5 Laplace Transforms . . . . . . . . . . . . . . . 3.2 Gaussian vectors . . . . . . . . . . . . . . . . . . . . 3.2.1 Two Equivalent Deﬁnitions . . . . . . . . . . 3.2.2 Independence and Noncorrelation . . . . . . . 3.2.3 The pdf of a Nondegenerate Gaussian Vector 3.3 Conditional Expectation . . . . . . . . . . . . . . . . 3.3.1 The Intermediate Theory . . . . . . . . . . . . 3.3.2 The General Theory . . . . . . . . . . . . . . 3.3.3 The Doubly Stochastic Framework . . . . . . 3.3.4 The L2 theory of Conditional Expectation . . 3.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
95 95 95 97 108 114 118 119 119 121 123 125 125 131 135 136 136
4 Convergences 4.1 Almostsure Convergence . . . . . . . . . . . . . 4.1.1 A Suﬃcient Condition and a Criterion . 4.1.2 Beppo Levi, Fatou and Lebesgue . . . . 4.1.3 The Strong Law of Large Numbers . . . 4.2 Two Other Types of Convergence . . . . . . . . 4.2.1 Convergence in Probability . . . . . . . . 4.2.2 Convergence in Lp . . . . . . . . . . . . 4.2.3 Uniform Integrability . . . . . . . . . . . 4.3 Zeroone Laws . . . . . . . . . . . . . . . . . . . 4.3.1 Kolmogorov’s Zeroone Law . . . . . . . 4.3.2 The Hewitt–Savage Zeroone Law . . . . 4.4 Convergence in Distribution and in Variation . . 4.4.1 The Role of Characteristic Functions . . 4.4.2 The Central Limit Theorem . . . . . . . 4.4.3 Convergence in Variation . . . . . . . . . 4.4.4 Proof of Paul L´evy’s Criterion . . . . . . 4.5 The Hierarchy of Convergences . . . . . . . . . 4.5.1 Almostsure vs in Probability . . . . . . 4.5.2 The Rank of Convergence in Distribution 4.6 Exercises . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
145 145 145 148 149 156 156 158 160 162 162 163 166 166 172 176 182 188 188 189 190
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
CONTENTS
ix 197
Part Two: Standard Stochastic Processes 5 Generalities on Random Processes 5.1 The Distribution of a Random Process . . . . . 5.1.1 Kolmogorov’s Theorem on Distributions 5.1.2 Secondorder Stochastic Processes . . . . 5.1.3 Gaussian Processes . . . . . . . . . . . . 5.2 Random Processes as Random Functions . . . . 5.2.1 Versions and Modiﬁcations . . . . . . . . 5.2.2 Kolmogorov’s Continuity Condition . . . 5.3 Measurability Issues . . . . . . . . . . . . . . . 5.3.1 Measurable Processes and their Integrals 5.3.2 Histories and Stopping Times . . . . . . 5.4 Exercises . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
199 199 199 203 206 208 208 209 212 212 213 218
6 Markov Chains, Discrete Time 6.1 The Markov Property . . . . . . . . . . . . . . . . 6.1.1 The Markov Property on the Integers . . . 6.1.2 The Markov Property on a Graph . . . . . 6.2 The Transition Matrix . . . . . . . . . . . . . . . 6.2.1 Topological Notions . . . . . . . . . . . . . 6.2.2 Stationary Distributions and Reversibility 6.2.3 The Strong Markov Property . . . . . . . 6.3 Recurrence and Transience . . . . . . . . . . . . . 6.3.1 Classiﬁcation of States . . . . . . . . . . . 6.3.2 The Stationary Distribution Criterion . . . 6.3.3 Foster’s Theorem . . . . . . . . . . . . . . 6.4 Longrun Behavior . . . . . . . . . . . . . . . . . 6.4.1 The Markov Chain Ergodic Theorem . . . 6.4.2 Convergence in Variation to Steady State . 6.4.3 Null Recurrent Case: Orey’s Theorem . . . 6.4.4 Absorption . . . . . . . . . . . . . . . . . 6.5 Monte Carlo Markov Chain Simulation . . . . . . 6.5.1 Basic Principle and Algorithms . . . . . . 6.5.2 Exact Sampling . . . . . . . . . . . . . . . 6.6 Exercises . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
221 221 221 227 235 235 237 242 245 245 250 257 260 260 262 265 266 272 272 275 280
7 Markov Chains, Continuous Time 7.1 Homogeneous Poisson Processes on the Line . . . . . . . 7.1.1 The Counting Process and the Interval Sequence . 7.1.2 Stochastic Calculus of hpps . . . . . . . . . . . . 7.2 The Transition Semigroup . . . . . . . . . . . . . . . . . 7.2.1 The Inﬁnitesimal Generator . . . . . . . . . . . . 7.2.2 The Local Characteristics . . . . . . . . . . . . . 7.2.3 hmcs from hpps . . . . . . . . . . . . . . . . . . 7.3 Regenerative Structure . . . . . . . . . . . . . . . . . . . 7.3.1 The Strong Markov Property . . . . . . . . . . . 7.3.2 Imbedded Chain . . . . . . . . . . . . . . . . . . 7.3.3 Conditions for Regularity . . . . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
289 289 289 294 298 298 302 306 310 310 311 314
CONTENTS
x . . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
317 317 324 325
8 Spatial Poisson Processes 8.1 Generalities on Point Processes . . . . . . . . . . . . . . 8.1.1 Point Processes as Random Measures . . . . . . . 8.1.2 Point Process Integrals and the Intensity Measure 8.1.3 The Distribution of a Point Process . . . . . . . . 8.2 Unmarked Spatial Poisson Processes . . . . . . . . . . . 8.2.1 Construction . . . . . . . . . . . . . . . . . . . . 8.2.2 Poisson Process Integrals . . . . . . . . . . . . . . 8.3 Marked Spatial Poisson Processes . . . . . . . . . . . . . 8.3.1 As Unmarked Poisson Processes . . . . . . . . . . 8.3.2 Operations on Poisson Processes . . . . . . . . . . 8.3.3 Change of Probability Measure . . . . . . . . . . 8.4 The Boolean Model . . . . . . . . . . . . . . . . . . . . . 8.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
329 329 329 334 338 345 345 347 351 351 354 357 360 365
9 Queueing Processes 9.1 Discretetime Markovian Queues . . . . 9.1.1 The Basic Example . . . . . . . 9.1.2 Multiple Access Communication 9.1.3 The Stack Algorithm . . . . . . 9.2 Continuoustime Markovian Queues . . 9.2.1 Isolated Markovian Queues . . . 9.2.2 Markovian Networks . . . . . . 9.3 Nonexponential Models . . . . . . . . 9.3.1 M/GI/∞ . . . . . . . . . . . . 9.3.2 M/GI/1/∞/fifo . . . . . . . . 9.3.3 GI/M/1/∞/fifo . . . . . . . . 9.4 Exercises . . . . . . . . . . . . . . . . .
7.4
7.5
Longrun Behavior 7.4.1 Recurrence 7.4.2 Convergence Exercises . . . . . .
. . . . . . . . . . . . . . . . . . to Equilibrium . . . . . . . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
371 371 371 372 375 378 378 385 391 392 394 396 399
10 Renewal and Regenerative Processes 10.1 Renewal Point processes . . . . . . . . . . . . . . . 10.1.1 The Renewal Measure . . . . . . . . . . . . 10.1.2 The Renewal Equation . . . . . . . . . . . . 10.1.3 Stationary Renewal Processes . . . . . . . . 10.2 The Renewal Theorem . . . . . . . . . . . . . . . . 10.2.1 The Key Renewal Theorem . . . . . . . . . 10.2.2 The Coupling Proof of Blackwell’s Theorem 10.2.3 Defective and Excessive Renewal Equations 10.3 Regenerative Processes . . . . . . . . . . . . . . . . 10.3.1 Examples . . . . . . . . . . . . . . . . . . . 10.3.2 The Limit Distribution . . . . . . . . . . . . 10.4 SemiMarkov Processes . . . . . . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
403 403 403 407 413 416 416 424 428 430 430 431 435
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
CONTENTS
xi
10.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 439 11 Brownian Motion 11.1 Brownian Motion or Wiener Process . . . . . 11.1.1 As a Rescaled Random Walk . . . . . . 11.1.2 Simple Operations on Brownian motion 11.1.3 Gauss–Markov Processes . . . . . . . . 11.2 Properties of Brownian Motion . . . . . . . . 11.2.1 The Strong Markov Property . . . . . 11.2.2 Continuity . . . . . . . . . . . . . . . . 11.2.3 Nondiﬀerentiability . . . . . . . . . . 11.2.4 Quadratic Variation . . . . . . . . . . 11.3 The Wiener–Doob Integral . . . . . . . . . . . 11.3.1 Construction . . . . . . . . . . . . . . 11.3.2 Langevin’s Equation . . . . . . . . . . 11.3.3 The Cameron–Martin Formula . . . . . 11.4 Fractal Brownian Motion . . . . . . . . . . . . 11.5 Exercises . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
443 443 443 445 447 449 449 451 451 453 454 454 458 459 461 463
12 Widesense Stationary Stochastic Processes 12.1 The Power Spectral Measure . . . . . . . . . . . . . . . . . 12.1.1 Covariance Functions and Characteristic Functions 12.1.2 Filtering of wss Stochastic Processes . . . . . . . . 12.1.3 White Noise . . . . . . . . . . . . . . . . . . . . . . 12.2 Fourier Analysis of the Trajectories . . . . . . . . . . . . . 12.2.1 The Cram´er–Khintchin Decomposition . . . . . . . 12.2.2 A Plancherel–Parseval Formula . . . . . . . . . . . 12.2.3 Linear Operations . . . . . . . . . . . . . . . . . . . 12.3 Multivariate wss Stochastic Processes . . . . . . . . . . . 12.3.1 The Power Spectral Matrix . . . . . . . . . . . . . 12.3.2 Bandpass Stochastic Processes . . . . . . . . . . . 12.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
467 467 467 471 473 476 476 480 481 483 483 487 489
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
Part Three: Advanced Topics 13 Martingales 13.1 Martingale Inequalities . . . . . . . . . . . . . . 13.1.1 The Martingale Property . . . . . . . . . 13.1.2 Kolmogorov’s Inequality . . . . . . . . . 13.1.3 Doob’s Inequality . . . . . . . . . . . . . 13.1.4 Hoeﬀding’s Inequality . . . . . . . . . . 13.2 Martingales and Stopping Times . . . . . . . . . 13.2.1 Doob’s Optional Sampling Theorem . . . 13.2.2 Wald’s Formulas . . . . . . . . . . . . . 13.2.3 The Maximum Principle . . . . . . . . . 13.3 Convergence of Martingales . . . . . . . . . . . 13.3.1 The Fundamental Convergence Theorem 13.3.2 Backwards (or Reverse) Martingales . . .
493
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
495 495 495 500 501 502 505 505 510 511 514 514 520
CONTENTS
xii . . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
527 529 534 536 540 541 543
14 A Glimpse at Itˆ o’s Stochastic Calculus 14.1 The Itˆo Integral . . . . . . . . . . . . . . . . . . . . . 14.1.1 Construction . . . . . . . . . . . . . . . . . . 14.1.2 Properties of the Itˆo Integral Process . . . . . 14.1.3 Itˆo’s Integrals Deﬁned as Limits in Probability 14.2 Itˆo’s Diﬀerential Formula . . . . . . . . . . . . . . . . 14.2.1 Elementary Form . . . . . . . . . . . . . . . . 14.2.2 Some Extensions . . . . . . . . . . . . . . . . 14.3 Selected Applications . . . . . . . . . . . . . . . . . . 14.3.1 Squareintegrable Brownian Functionals . . . 14.3.2 Girsanov’s Theorem . . . . . . . . . . . . . . 14.3.3 Stochastic Diﬀerential Equations . . . . . . . 14.3.4 The Dirichlet Problem . . . . . . . . . . . . . 14.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
549 549 549 552 555 556 556 558 562 562 564 569 571 573
15 Point Processes with a Stochastic Intensity 15.1 Stochastic Intensity . . . . . . . . . . . . . . . . . . . . . 15.1.1 The Martingale Deﬁnition . . . . . . . . . . . . . 15.1.2 Stochastic Intensity Kernels . . . . . . . . . . . . 15.1.3 Martingales as Stochastic Integrals . . . . . . . . 15.1.4 The Regenerative Form of the Stochastic Intensity 15.2 Transformations of the Stochastic Intensity . . . . . . . . 15.2.1 Changing the History . . . . . . . . . . . . . . . . 15.2.2 Absolutely Continuous Change of Probability . . 15.2.3 Changing the Time Scale . . . . . . . . . . . . . . 15.3 Point Processes under a Poisson process . . . . . . . . . 15.3.1 An Extension of Watanabe’s Theorem . . . . . . 15.3.2 Grigelionis’ Embedding Theorem . . . . . . . . . 15.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
577 577 577 584 591 594 596 596 600 607 609 609 612 616
16 Ergodic Processes 16.1 Ergodicity and Mixing . . . . . . . . . . . . . . 16.1.1 Invariant Events and Ergodicity . . . . . 16.1.2 Mixing . . . . . . . . . . . . . . . . . . . 16.1.3 The Convex Set of Ergodic Probabilities 16.2 A Detour into Queueing Theory . . . . . . . . . 16.2.1 Lindley’s Sequence . . . . . . . . . . . . 16.2.2 Loynes’ Equation . . . . . . . . . . . . . 16.3 Birkhoﬀ’s Theorem . . . . . . . . . . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
621 621 621 624 627 628 628 629 631
13.3.3 The Robbins–Sigmund Theorem . . . . . 13.3.4 Squareintegrable Martingales . . . . . . 13.4 Continuoustime Martingales . . . . . . . . . . . 13.4.1 From Discrete Time to Continuous Time 13.4.2 The Banach Space Mp . . . . . . . . . . 13.4.3 Time Scaling . . . . . . . . . . . . . . . 13.5 Exercises . . . . . . . . . . . . . . . . . . . . . .
. . . . . . .
. . . . . . . .
. . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
CONTENTS
xiii
16.3.1 The Ergodic Case . . . . . . . . . . . . 16.3.2 The Nonergodic Case . . . . . . . . . 16.3.3 The Continuoustime Ergodic Theorem 16.4 Exercises . . . . . . . . . . . . . . . . . . . . .
. . . .
. . . .
17 Palm Probability 17.1 Palm Distribution and Palm Probability . . . . . 17.1.1 Palm Distribution . . . . . . . . . . . . . . 17.1.2 Stationary Frameworks . . . . . . . . . . . 17.1.3 Palm Probability and the Campbell–Mecke 17.2 Basic Properties and Formulas . . . . . . . . . . . 17.2.1 Eventtime Stationarity . . . . . . . . . . 17.2.2 Inversion Formulas . . . . . . . . . . . . . 17.2.3 The Exchange Formula . . . . . . . . . . . 17.2.4 From Palm to Stationary . . . . . . . . . . 17.3 Two Interpretations of Palm Probability . . . . . 17.3.1 The Local Interpretation . . . . . . . . . . 17.3.2 The Ergodic Interpretation . . . . . . . . . 17.4 General Principles of Queueing Theory . . . . . . 17.4.1 The pasta Property . . . . . . . . . . . . 17.4.2 Queue Length at Departures or Arrivals . 17.4.3 Little’s Formula . . . . . . . . . . . . . . . 17.5 Exercises . . . . . . . . . . . . . . . . . . . . . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
631 632 635 637
. . . . . . . . . . . . . . . Formula . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
639 639 639 641 643 648 648 650 653 654 660 660 662 664 664 668 669 672
. . . .
. . . .
. . . .
. . . .
A Number Theory and Linear Algebra 677 A.1 The Greatest Common Divisor . . . . . . . . . . . . . . . . . . . . . 677 A.2 Eigenvalues and Eigenvectors . . . . . . . . . . . . . . . . . . . . . 678 A.3 The Perron–Fr¨obenius Theorem . . . . . . . . . . . . . . . . . . . . 679 B Analysis B.1 Inﬁnite Products . . . . . . . . . . . . . . B.2 Abel’s Theorem . . . . . . . . . . . . . . . B.3 Tykhonov’s Theorem . . . . . . . . . . . . B.4 Ces`aro, Toeplitz and Kronecker’s Lemmas B.5 Subadditive Functions . . . . . . . . . . . B.6 Gronwall’s Lemma . . . . . . . . . . . . . B.7 The Abstract Deﬁnition of Continuity . . . B.8 Change of Time . . . . . . . . . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
681 681 682 683 683 685 686 686 687
C Hilbert Spaces C.1 Basic Deﬁnitions . . . . . . . . C.2 Schwarz’s Inequality . . . . . . C.3 Isometric Extension . . . . . . . C.4 Orthogonal Projection . . . . . C.5 Riesz’s Representation Theorem C.6 Orthonormal expansions . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
689 689 689 691 692 697 698
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
Bibliography
701
Index
707
Introduction This initiation to the theory of probability and random processes has three objectives. The ﬁrst is to provide the theoretical background necessary for the student to feel comfortable and secure in the utilization of probabilistic models, particularly those involving stochastic processes. The second is to introduce the speciﬁc stochastic processes most often encountered in applications (operations research, insurance, ﬁnance, biology, physics, computer and communications sciences and signal processing), that is, Markov chains in discrete and continuous time, Poisson processes in time and space, renewal and regenerative processes, queues and their networks, widesense stationary processes, and Brownian motion. In addition to the above topics, which form the indispensable cultural background of an applied probabilist, this text gives—and this is the third objective—the advanced tools such as martingales, ergodic theory, Palm theory and stochastic integration, useful for the analysis of the more complex stochastic models. The only prerequisite is an elementary knowledge of calculus and of linear algebra usually acquired at the undergraduate or beginning graduate level, the occasional material beyond this being given in the Appendix or in the main text.
The above cartoon symbolizes the feeling of awe and distress that may invade the potential reader’s mind when considering the size of the book. The fact is that this text has been written for an ideal reader who has had no previous exposure to probability theory and who is willing to learn in a selfteaching mode the basics of the theory of stochastic processes that will constitute a solid platform both for applications and for more specialized studies. This implies that the material should be presented in a detailed, progressive and selfcontained manner. However, there is suﬃcient modularity to devise a table of contents adapted to one’s appetite and interest. The material of this book has been divided into three parts roughly corresponding to the three objectives listed at the beginning of this introduction.
xv
xvi
Introduction
PART ONE: Probability theory This part stands alone as a short course in probability at the intermediate level. It contains all the results referenced in the rest of the book, and the initiate can therefore skip it and use it as an appendix. The ﬁrst chapter introduces the basic notions in the elementary framework of discrete random variables and gives a few tricks that permit us to obtain at an early stage nontrivial results such as the strong law of large numbers for coin tossing or the extinction probability of a branching process. One of its purposes is to persuade the neophyte of the power of a formal approach to probability while introducing the main concepts of expectation, independence and conditional expectation. Having acquired familiarity with the vocabulary and the spirit of probability theory, the reader will be ready for the development of this discipline in the framework of integration theory, of which the second chapter provides a detailed account. This theory requires more concentration from the beginner but the eﬀort is worthwhile as it will provide her/him with enough conﬁdence in the manipulation of stochastic models. Probability theory is usually developed in the more theoretical texts without a preliminary exposition of integration theory, whose results are then presented as need arises. But, as far as stochastic processes are concerned, the theory of integration with respect to a ﬁnite measure (such as a probability) is not suﬃcient. The third chapter translates the previous one into the probabilistic language and formalizes the concepts of distribution, independence and conditional expectation. The fourth chapter features the various notions of convergence of a sequence of random variables (almostsure, in distribution, in variation, in probability and in the mean square) and their interconnections, and closes the ﬁrst part on the probabilistic background directly useful in the rest of the book.
PART TWO: Standard stochastic processes This part forms a basic course on stochastic processes. It begins with the pivotal Chapter 5, devoted to general issues such as trajectory continuity, measurability and stopping times. Chapter 6 and Chapter 7 introduce the stochastic processes that are the most popular and that can be treated at an elementary level, namely discretetime homogeneous Markov chains, homogeneous Poisson process on the line and continuoustime homogeneous Markov chains. Chapter 8 is devoted to Poisson processes in space, with or without marks, a versatile source of spatial models. Chapter 10 gives the essentials of renewal theory and its application to regenerative processes. Chapter 9 gives a panoramic view of the classical queues and their networks at the elementary level. Chapter 11 features the Brownian motion and the Doob–Wiener stochastic integral. The latter is, together with Bochner’s representation of characteristic functions, one of the foundations of the theory of widesense stationary stochastic processes of Chapter 12.
PART THREE: Advanced topics This part complements the previous one in that it is not exclusively devoted to speciﬁc random processes, but rather to general classes of such processes. Only a taste of the topics treated here will be given since each one of them requires and deserves considerably more space.
Introduction
xvii
Chapter 13 gives the basic theory of martingales, one of the most important allpurpose tools of probability theory. Chapter 14 is a short introduction to the Brownian motion stochastic calculus based on the Itˆo integral. It is a natural continuation of the chapters on Brownian motion and on martingales. Chapter 15, a novel item in the table of contents of a textbook on stochastic processes, introduces point processes on the line admitting a stochastic intensity and the associated stochastic calculus. Chapter 16 gives the essentials of ergodic theory. The presentation is not the classical one and gives the opportunity to extend the elementary results of Chapter 9 on queueing. Another novel item of the table of contents for a book at this level is Chapter 17 on Palm probability, a natural complement to the theory of renewal point processes.
Practical issues The index gives the page(s) where a particular notation or abbreviation is used. These items will appear at the beginning of the list corresponding to their ﬁrst letter. The special numbering of equations, such as (), (†) and the like, is used only locally, inside proofs. Just before the Exercises section of each chapter there is a subsubsection entitled “Complementary reading” pointing at books (only books) where additional material connected with the current topic can be found. The selection is mainly based on two criteria: accessibility by a reader of this book and relevance to applications (with, of course, a natural bias towards the author’s own interests). No attempt at exhaustivity or proper crediting has been made, the reader being directed to the Bibliography or to the sporadic footnotes for this. In these subsubsections, only the year of the last edition is given. The full history is given in the Bibliography.
Acknowledgements I wish to acknowledge the precious help of Eva L¨ocherbach (University of Paris I Sorbonne), Anne Bouillard (Nokia France), Paolo Baldi (University Roma II Torre Vergata) and L´eo Miolane (Inria Paris). I warmly thank them and also, last but not least, Marina Reizakis, the patient and diligent editor of this book.
Pierre Br´emaud Paris, July 14, 2019
I: PROBABILITY THEORY
Chapter 1 Warming Up We apparently live in a random world. There are events that we are unable to predict with absolute certainty. Is this world inherently random or does randomness just refer to our incapacity to solve the highly complex deterministic equations that rule the universe? In fact, probabilists do not attempt to take sides in this debate and just observe that there are some phenomena that look random and yet seem to exhibit some kind of regularity. The canonical example is a coin tossed over and over by a nonmischievous person: the result is an erratic sequence of heads and tails, yet there seems to be a balance between heads and tails. This regularity takes the form of the law of large numbers: the long run proportion of heads is 21 , that is, as the number of tosses tends to inﬁnity, the frequency of heads approaches 12 . Is this a physical law, or is it a mathematical theorem? At ﬁrst sight, it is a physical law, and indeed it has to do with a complex process, but so complex that it is preferable to view the law of large numbers as a theorem resulting from a mathematical model. It took some time for the corresponding mathematical theory to emerge. The modern ´ era, announced by the proof of the strong law of large numbers for coin tossing by Emile Borel in 1909, really started with Andre¨ı Nikola¨ıevitch Kolmogorov who axiomatized probability in 1933 in terms of the theory of measure and integration, which is presented in the next chapter. Before this, however, it is wise to introduce the terminology and the probabilistic concepts (expectation, independence, conditional expectation) in the elementary framework of discrete random variables. This is done in the current chapter, which contains the proofs of two results that demonstrate the power of probabilistic reasoning, already at an elementary level: Borel’s strong law of large numbers and the computation of the extinction probability of a branching process.
1.1
Sample Space, Events and Probability
The study of random phenomena requires a clear and precise language. That of probability theory features familiar mathematical objects such as points, sets and functions, which, however, receive a particular interpretation: points are outcomes (of an experiment), sets are events, functions are random numbers. The meaning of these terms will be given just after we recall the notation concerning the elementary operations on sets: union, intersection and complementation. © Springer Nature Switzerland AG 2020 P. Brémaud, Probability Theory and Stochastic Processes, Universitext, https://doi.org/10.1007/9783030401832_1
3
CHAPTER 1. WARMING UP
4
1.1.1
Events
If A and B are subsets of some set Ω, A ∪ B denotes their union and A ∩ B their intersection. In this book, A (rather than A or Ac ) denotes the complement of A in Ω. The notation A + B (the sum of A and B) implies by convention that A and B are disjoint, in which case it represents the union A ∪ B. Similarly, the notation ∞ k=1 Ak is used for ∪∞ k=1 Ak only when the Ak ’s are pairwise disjoint. The notation A − B is used only if B ⊆ A, and it stands for A ∩ B. In particular, if B ⊆ A, then A = B + (A − B). The symmetric diﬀerence of A and B, that is, the set (A ∪ B) − (A ∩ B), is denoted by A B. The indicator function of the subset A ⊆ Ω is the function 1A : Ω → {0, 1} deﬁned by
1A (ω) =
1 if ω ∈ A , 0 if ω ∈ A.
Random phenomena are observed by means of experiments (performed either by man or nature). Each experiment results in an outcome. The collection of all possible outcomes ω is called the sample space Ω. Any subset A of the sample space Ω will be regarded fro the time being1 as a representation of some event. Example 1.1.1: Tossing a die, take 1. The experiment consists in tossing a die once. The possible outcomes are ω = 1, 2, . . . , 6 and the sample space is the set Ω = {1, 2, 3, 4, 5, 6}. The subset A = {1, 3, 5} is the event “the result is odd.” Example 1.1.2: Throwing a dart. The experiment consists in throwing a dart at a wall. The sample space can be chosen to be the plane R2 . An outcome is the position ω = (x, y) ∈ R2 hit by the dart. The subset A = {(x, y); x2 + y 2 > 1} is an event that could be named “you missed the dartboard” if the dartboard is a closed disk of radius 1 centered at 0.
Example 1.1.3: Heads or tails, take 1. The experiment is an inﬁnite succession of coin tosses. One can take for the sample space the collection of all sequences ω = {xn }n≥1, where xn = 1 or 0, depending on whether the nth toss results in heads or tails. The subset A = {ω; xk = 1 for k = 1 to 1000} is a lucky event for anyone betting on heads!
The Language of Probabilists Probabilists have their own dialect. They say that outcome ω realizes event A if ω ∈ A. For instance, in the die model of Example 1.1.1, the outcome ω = 1 realizes the event “result is odd”, since 1 ∈ A = {1, 3, 5}. Obviously, if ω does not realize A, it realizes A. Event A ∩ B is realized by outcome ω if and only if ω realizes both A and B. Similarly, A ∪ B is realized by ω if and only if at least one event among A and B is realized (both can be realized). Two events A and B are called incompatible when A ∩ B = ∅. In other words, event A ∩ B is impossible: no outcome ω can realize both A and B. For this reason one refers to the empty set ∅ as the impossible event. Naturally, Ω is called the certain event. 1
See Deﬁnition 1.1.4.
1.1. SAMPLE SPACE, EVENTS AND PROBABILITY
5
∞ Recall now that the notation k=1 Ak is used for ∪∞ k=1 Ak only when the subsets Ak are pairwise disjoint. In the terminology of sets, the sets A1 , A2 , . . . form a partition of Ω if ∞ Ak = Ω . k=0
The probabilists then say that the events A1 , A2 , . . . are mutually exclusive and exhaustive. They are exhaustive in the sense that any outcome ω realizes at least one among them. They are mutually exclusive in the sense that any two distinct events among them are incompatible. Therefore, any ω realizes one and only one of the events An (n ≥ 1). If B ⊆ A, event B is said to imply event A, since ω realizes A when it realizes B.
The σﬁeld of Events Probability theory assigns to each event a number, the probability of the said event. The collection F of events to which a probability is assigned is not always identical to the collection of all subsets of Ω. The requirement on F is that it should be a σﬁeld: Deﬁnition 1.1.4 Let F be a collection of subsets of Ω, such that (i) Ω is in F, (ii) if A belongs to F, then so does its complement A, and (iii) if A1 , A2 , . . . belong to F, then so does their union ∪∞ k=1 Ak . One then calls F a σﬁeld on Ω (here the σﬁeld of events). Note that the impossible event ∅, being the complement of the certain event Ω, is in F. Note also that if A1 , A2 , . . . belong to F, then so does their intersection ∩∞ k=1 Ak (Exercise 1.6.5). Example 1.1.5: Trivial σfield, gross σfield. These are respectively the collection P(Ω) of all subsets of Ω and the σﬁeld with only two sets: {Ω, ∅}. If the sample space Ω is ﬁnite or countable, one usually (but not always and not necessarily) considers any subset of Ω to be an event, that is, F = P(Ω).
Example 1.1.6: Borel σfield. The Borel σﬁeld on the ndimensional euclidean space Rn , denoted B(Rn ) and called the Borel σﬁeld on Rn , is, by deﬁnition, the smallest σﬁeld on Rn that contains all rectangles, that is, all sets of the form nj=1 Ij , where the Ij ’s are intervals of R. This deﬁnition is not constructive and therefore one may wonder if there exists sets that are not Borel sets (that is, not sets in B(Rn )). The theory tells us that there are indeed such sets, but they are in a sense “pathological”: the proof of existence of nonBorel sets is not constructive, in the sense that it involves the axiom of choice. In any case, all the sets whose nvolume you have ever been able to compute in your early life are Borel sets. More about this in Chapter 2.
CHAPTER 1. WARMING UP
6
Example 1.1.7: Heads or tails, take 2. Take F to be the smallest σﬁeld that contains all the sets of the form {ω ; xk = 1} (k ≥ 1). This σﬁeld also contains all the sets of the form {ω ; xk = 0} (k ≥ 1) (pass to the complements) and therefore (take intersections) all the sets of the form {ω ; x1 = a1 , . . . , xn = an } for all n ≥ 1 and all a1 , . . . , an ∈ {0, 1}.
1.1.2
Probability of Events
The probability P (A) of an event A ∈ F measures the likeliness of its occurrence. As a function deﬁned on F, the probability P is required to satisfy a few properties, the axioms of probability. Deﬁnition 1.1.8 A probability on (Ω, F) is a mapping P : F → R such that (i) 0 ≤ P (A) ≤ 1 for all A ∈ F, (ii) P (Ω) = 1, and ∞ (iii) P ( ∞ k=1 Ak ) = k=1 P (Ak ) for all sequences {Ak }k≥1 of pairwise disjoint events in F. Property (iii) is called σadditivity. The triple (Ω, F, P ) is called a probability space, or probability model. Example 1.1.9: Tossing a die, take 2. An event A is a subset of Ω = {1, 2, 3, 4, 5, 6}. The formula P (A) = A 6 , where A is the cardinality of A (the number of elements in A), deﬁnes a probability P . Example 1.1.10: Heads or tails, take 3. Choose a probability P such that for any event of the form A = {x1 = a1 , . . . , xn = an }, where a1 , . . . , an are in {0, 1}, P (A) =
1 2n
.
Note that this does not deﬁne the probability of all events of F. But the theory (wait until Chapter 3) tells us that there exists such a probability satisfying the above requirement and that this probability is unique.
Example 1.1.11: Random point in a square, take 1. The following is a possible model of a random point inside the unit square [0, 1]2 = [0, 1] × [0, 1]: Ω = [0, 1]2 , F is the collection of sets in the Borel σﬁeld B(R2 ) that are contained in [0, 1]2. The theory tells us that there does indeed exist one and only one probability P satisfying the above requirement, called the Lebesgue measure on [0, 1]2 , which formalizes the intuitive notion of “area”. (More about this in Chapters 2 and 3.) The probability of Example 1.1.9 suggests an unbiased die, where each outcome 1, 2, 3, 4, 5, or 6 has the same probability. As we shall soon see, probability P of Example
1.1. SAMPLE SPACE, EVENTS AND PROBABILITY
7
1.1.10 implies an unbiased coin and independent tosses (the emphasized terms will be deﬁned later). The axioms of probability are motivated by the heuristic interpretation of P (A) as the empirical frequency of occurrence of event A. If n “independent” experiments are performed, among which nA result in the realization of A, then the empirical frequency F (A) =
nA n
should be close to P (A) if n is “suﬃciently large”. (This statement has to be made precise. It is in fact a loose expression of the law of large numbers that will be given later on.) Clearly, the empirical frequency function F satisﬁes the axioms of probability.
Basic Formulas We shall now list the properties of probability that follow directly from the axioms: Theorem 1.1.12 For any event A P (A) = 1 − P (A) ,
(1.1)
P (∅) = 0.
(1.2)
and
Proof. For a proof of (1.1), use additivity: 1 = P (Ω) = P (A + A) = P (A) + P (A) .
Applying (1.1) with A = Ω gives (1.2). Theorem 1.1.13 Probability is monotone, that is, for any events A and B, A ⊆ B =⇒ P (A) ≤ P (B).
(1.3)
Proof. For a proof, observe that when A ⊆ B, B = A + (B − A), and therefore P (B) = P (A) + P (B − A) ≥ P (A). Theorem 1.1.14 Probability is subσadditive: for any sequence A1 , A2 , . . . of events, ∞ (1.4) P (∪∞ k=1 P (Ak ). k=1 Ak ) ≤ Proof. Observe that ∪∞ k=1 Ak =
∞
Ak ,
k=1
where A1 := A1 and
k−1 Ak := Ak ∩ ∪i=1 Ai
(k ≥ 2) .
CHAPTER 1. WARMING UP
8 Therefore, P
(∪∞ k=1 Ak )
=P
∞
Ak
=
k=1
∞
P (Ak ).
k=1
But Ak ⊆ Ak , and therefore P (Ak ) ≤ P (Ak ).
The next (very important) property is the sequential continuity of probability: Theorem 1.1.15 Let {An }n≥1 be a nondecreasing sequence of events, that is, An+1 ⊇ An for all n ≥ 1. Then (1.5) P (∪∞ n=1 An ) = limn↑∞ P (An ) . Proof. Write An = A1 + (A2 − A1 ) + · · · + (An − An−1) and
∪∞ k=1 Ak = A1 + (A2 − A1 ) + (A3 − A2 ) + · · · .
Therefore, P (∪∞ k=1 Ak ) = P (A1 ) +
∞
P (Aj − Aj−1)
j=2
⎧ ⎫ n ⎨ ⎬ = lim P (A1 ) + P (Aj − Aj−1) = lim P (An ). ⎭ n↑∞ n↑∞ ⎩ j=2
Corollary 1.1.16 Let {Bn }n≥1 be a nonincreasing sequence of events, that is, Bn+1 ⊆ Bn for all n ≥ 1. Then P (∩∞ (1.6) n=1 Bn ) = limn↑∞ P (Bn ) . Proof. To obtain (1.6), write (using De Morgan’s identity; see Exercise 1.6.1): ∞ ∞ P (∩∞ n=1 Bn ) = 1 − P ∩n=1 Bn = 1 − P (∪n=1 B n ) , and apply (1.5) with An = B n : 1 − P (∪∞ n=1 B n ) = 1 − lim P (B n ) = lim (1 − P (B n )) = lim P (Bn ) . n↑∞
n↑∞
n↑∞
Negligible Sets A central notion of probability is that of a negligible set. Deﬁnition 1.1.17 A set N ⊂ Ω is called P negligible if it is contained in an event A ∈ F of probability P (A) = 0. Theorem 1.1.18 A countable union of negligible sets is a negligible set.
1.2. INDEPENDENCE AND CONDITIONING
9
Proof. Let Nk (k ≥ 1) be P negligible sets. By deﬁnition there exists a sequence Ak (k ≥ 1) of events of null probability such that Nk ⊆ Ak (k ≥ 1). We have N := ∪k≥1 Nk ⊆ A := ∪k≥1 Ak , and by the subσadditivity property of probability, P (A) = 0.
Example 1.1.19: Random point in a square, 2. (Example 1.1.11 continued) Recall the model of a random point inside the unit square [0, 1]2 = [0, 1] × [0, 1]. Each rational point therein has a null area and therefore a null probability. Therefore, the (countable) set of rational points of the square has null probability. In other words, the probability of drawing a rational point is, in this particular model, null.
1.2
Independence and Conditioning
In the frequency interpretation of probability, a situation where nA∩B /n ≈ (nA /n) × (nB /n), or nA nA∩B ≈ nB n (here ≈ is a nonmathematical symbol meaning “approximately equal”) suggests some kind of “independence” of A and B, in the sense that statistics relative to A do not vary when passing from a neutral sample of population to a selected sample characterized by the property B. For example, the proportion of people with a family name beginning with H is “approximately” the same among a large population with the usual mix of men and women as it would be among a “large” allmale population. Therefore, one’s gender is “independent” of the fact that one’s name begins with an H.2 .
1.2.1
Independent Events
The above discussion prompts us to give the following formal deﬁnition of independence, the single most important concept of probability theory. Deﬁnition 1.2.1 Two events A and B are called independent if and only if P (A ∩ B) = P (A)P (B) .
(1.7)
Remark 1.2.2 One should be aware that incompatibility is diﬀerent from independence. As a matter of fact, two incompatible events A and B are independent if and only if at least one of them has null probability. Indeed, if A and B are incompatible, P (A ∩ B) = P (∅) = 0, and therefore (1.7) holds if and only if P (A)P (B) = 0. The notion of independence carries over to families of events in the following manner. Deﬁnition 1.2.3 A family {An }n∈N of events is called independent if for any ﬁnite set of indices i1 < . . . < ir where ij ∈ N (1 ≤ j ≤ r), P (Ai1 ∩ Ai2 ∩ · · · ∩ Air ) = P (Ai1 ) × P (Ai2 ) × · · · × P (Air ) . One also says that the An ’s (n ∈ N) are jointly independent. 2
As far as we know...
CHAPTER 1. WARMING UP
10
Example 1.2.4: The switches. Two locations A and B in a communications network are connected by three diﬀerent paths, and each path contains a number of links that can fail. These are represented symbolically in the ﬁgure below by switches that are in the lifted position if the link is unable to operate. The number associated with a switch is the probability that the switch is lifted. The switches are lifted independently. What is the probability that A is accessible from B, that is, that there exists at least one available path for communications?
0.25
0.25 0.4
A 0.1
0.1
B 0.1
Let U1 be the event “no switch lifted in the upper path”. Deﬁning U2 and U3 similarly, we see that the probability to be computed is that of U1 ∪ U2 ∪ U3 , or by de Morgan’s law, that of the complement of U 1 ∩ U 2 ∩ U 3 : 1 − P (U 1 ∩ U 2 ∩ U 3 ) = 1 − P (U 1 )P (U 2 )(P U 3 ), where the last equality follows from the independence assumption concerning the switches. Letting now U11 = “switch 1 (ﬁrst from left) in the upper path is not lifted” and U12 = “switch 2 in the upper path is not lifted”, we have U1 = U11 ∩ U12 , therefore, in view of the independence assumption, P (U 1 ) = 1 − P (U1 ) = 1 − P (U11 )P (U12 ). We must now use the data P (U11 ) = 1 − 0.25, P (U12 ) = 1 − 0.25 to obtain P (U 1 ) = 1 − (0.75)2 . Similarly P (U 2 ) = 1 − 0.6 and P (U 3 ) = 1 − (0.9)3 . The ﬁnal result (of rather limited interest) is 1 − (0.4375)(0.4)(0.271) = 0.952575.
Example 1.2.5: Is this number the larger one? Let a and b be two numbers in {1, 2, . . . , 10 000}. Nothing is known about these numbers, except that they are not equal, say a > b. Only one of these numbers is shown to you, secretly chosen at random and equiprobably. Call this random number X. Is there a good strategy for guessing if the number shown to you is the larger one? Of course, one would like to have a probability of success strictly larger than 12 . Perhaps surprisingly, there is such a strategy, that we now describe. Select at random, uniformly on {1, 2, . . . , 10 000}, a number Y . If X ≥ Y , say that X is the largest (= a), otherwise say that it is the smallest. Let us compute the probability PE of a wrong guess. An error occurs when either (i) X ≥ Y and X = b, or (ii) X < Y and X = a. These events are exclusive of one another, and therefore
1.2. INDEPENDENCE AND CONDITIONING
11
PE = P (X ≥ Y, X = b) + P (X < Y, X = a) = P (b ≥ Y, X = b) + P (a < Y, X = a) = P (b ≥ Y )P (X = b) + P (a < Y )P (X = a) 1 1 1 = P (b ≥ Y ) + P (a < Y ) = (P (b ≥ Y ) + P (a < Y )) 2 2 2 1 1 a−b 1 = (1 − P (Y ∈ [b + 1, a]) = 1− < . 2 2 10 000 2
1.2.2
Bayes’ Calculus
We continue the heuristic discussion of Subsection 1.2.1 in terms of empirical frequencies. Dependence between A and B occurs when P (A ∩ B) = P (A)P (B). In this case the relative frequency nA∩B /nB ≈ P (A ∩ B)/P (B) is diﬀerent from the frequency nA /n. This suggests the following deﬁnition. Deﬁnition 1.2.6 The conditional probability of A given B is the number P (A  B) :=
P (A∩B) P (B)
,
(1.8)
deﬁned when P (B) > 0. Remark 1.2.7 The quantity P (A  B) represents our expectation of A being realized when the only available information is that B is realized. Indeed, this expectation is based on the relative frequency nA∩B /nB alone. Of course, if A and B are independent, then P (A  B) = P (A). Probability theory is primarily concerned with the computation of probabilities of complex events. The following formulas, called Bayes’ rules, are useful for that purpose. Theorem 1.2.8 With P (A) > 0, we have the Bayes rule of retrodiction: P (B  A) =
P (A  B)P (B) P (A)
.
(1.9)
Proof. Rewrite Deﬁnition 1.8 symmetrically in A and B: P (A ∩ B) = P (A  B)P (B) = P (B  A)P (A). Theorem 1.2.9 Let B1 , B2 , . . . be events forming partition of Ω. Then for any event A, we have the Bayes rule of total causes: P (A) =
∞ i=1
P (A  Bi )P (Bi ) .
(1.10)
CHAPTER 1. WARMING UP
12 Proof. Decompose A as follows: A= A∩Ω =A∩
∞ i=1
Bi
=
∞
(A ∩ Bi ).
i=1
Therefore (by σadditivity and the deﬁnition of conditional probability): ∞ ∞ ∞ P (A) = P (A ∩ Bi ) = P (A ∩ Bi ) = P (A  Bi )P (Bi ). i=1
i=1
i=1
Theorem 1.2.10 For any sequence of events A1 , . . . , An , we have the Bayes sequential formula: k k−1 P ∩i=1 Ai = P (A1 )P (A2  A1 )P (A3  A1 ∩ A2 ) · · · P Ak  ∩i=1 Ai . (1.11) Proof. By induction. First observe that (1.11) is true for k = 2 by deﬁnition of conditional probability. Suppose that (1.11) is true for k. Write P ∩k+1 ∩ki=1 Ai ∩ Ak+1 = P Ak+1  ∩ki=1 Ai P ∩ki=1 Ai , i=1 Ai = P and replace P ∩ki=1 Ai by the assumed equality (1.11) to obtain the same equality with k + 1 replacing k. Example 1.2.11: Should we always believe doctors? Doctors apply a test that gives a positive result in 99% of the cases where the patient is aﬀected by the disease. However it happens in 2% of the cases that a healthy patient has a positive test. Statistical data show that one individual out of 1000 has the disease. What is the probability that a patient with a positive test is aﬀected by the disease? Solution: Let M be the event “patient is ill,” and let + and − be the events “test is positive” and “test is negative” respectively. We have the data P (M ) = 0.001, P (+  M ) = 0.99, P (+  M ) = 0.02, and we must compute P (M  +). By the Bayes retrodiction formula, P (M  +) =
P (+  M )P (M ) . P (+)
By the Bayes formula of total causes, P (+) = P (+  M )P (M ) + P (+  M )P (M ). Therefore, P (M  +) = that is, approximately 0.005.
(0.99)(0.001) , (0.99)(0.001) + (0.02)(0.999)
1.2. INDEPENDENCE AND CONDITIONING
13
Remark 1.2.12 The quantitative result of the above example may be disquieting. In fact, this may happen with grouped blood tests (maybe in a prison or in the army) to detect, say aids. A single individual will provoke a positive test alert for all his mates. Of course, the doctor in charge will then proceed to individual tests. See Exercise 1.6.19.
Example 1.2.13: The ballot problem. In an election, candidates I and II have obtained a and b votes, respectively. Candidate I won, that is, a > b. We seek to compute the probability that in the course of the vote counting procedure, candidate I has always had the lead. Let pa,b be the probability that A is always ahead. We have by the formula of total causes, conditioning on the last vote: pa,b = P (A always ahead A gets the last vote )P (A gets the last vote ) + P (A always ahead B gets the last vote )P (B gets the last vote ) a b = pa−1,b + pa,b−1 , a+b a+b with the convention that for a = b + 1, pa−1,b = pb,b = 0. The result follows by induction on the total number of votes a + b: pa,b =
1.2.3
a−b . a+b
Conditional Independence
Deﬁnition 1.2.14 Let A, B and C be events, where P (C) > 0. One says that A and B are conditionally independent given C if P (A ∩ B  C) = P (A  C)P (B  C) .
(1.12)
In other words, A and B are independent with respect to the probability PC deﬁned by PC (A) = P (A  C) (see Exercise 1.6.11). Example 1.2.15: Cheap watches. Two factories A and B manufacture watches. Factory A produces on average one defective item out of 100, and B produces on average one bad watch out of 200. A retailer receives a container of watches from one of the two above factories, but he does not know which. He checks the ﬁrst watch. It works! (a) What is the probability that the second watch he will check is good? (b) Are the states of the ﬁrst two watches independent? You will need to invent reasonable hypotheses when needed. Solution: (a) Let Xn be the state of the nth watch in the container, with Xn = 1 if it works and Xn = 0 if it does not. Let Y be the factory of origin. We express our a priori ignorance of where the case comes from by 1 P (Y = A) = P (Y = B) = . 2
CHAPTER 1. WARMING UP
14
(Note that this is a hypothesis.) Also, we assume that given Y = A (resp., Y = B), the states of the successive watches are independent. For instance, P (X1 = 1, X2 = 0  Y = A) = P (X1 = 1  Y = A)P (X2 = 0  Y = A). We have the data P (Xn = 0  Y = A) = 0.01
P (Xn = 0  Y = B) = 0.005.
We are required to compute P (X2 = 1  X1 = 1) =
P (X1 = 1, X2 = 1) . P (X1 = 1)
By the Bayes formula of total causes, the numerator of this fraction equals P (X1 = 1, X2 = 1  Y = A)P (Y = A) + P (X1 = 1, X2 = 1  Y = B)P (Y = B), that is, (0.5)(0.99)2 + (0.5)(0.995)2 , and the denominator is P (X1 = 1  Y = A)P (Y = A) + P (X1 = 1  Y = B)P (Y = B), that is, (0.5)(0.99) + (0.5)(0.995). Therefore, P (X2 = 1  X1 = 1) =
(0.99)2 + (0.995)2 . 0.99 + 0.995
(b) The states of the two watches are not independent. Indeed, if they were, then P (X2 = 1  X1 = 1) = P (X2 = 1) = (0.5) (0.99 + 0.995) , a result diﬀerent from what we obtained.
Remark 1.2.16 The above example shows that two events A and B that are conditionally independent given some event C and at the same time conditionally independent given C, may yet not be independent.
1.3
Discrete Random Variables
The number of heads in a sequence of 1000 coin tosses, the number of days it takes until the next rain and the size of a genealogical tree are random numbers. All are functions of the outcome of a random experiment performed either by man or nature, and these outcomes take discrete values, that is, values in a countable set. These values are integers in the above examples, but they could be more complex mathematical objects. This section gives the basic theory of discrete random variables.
1.3.1
Probability Distributions and Expectation
Deﬁnition 1.3.1 Let E be a countable set. A function X : Ω → E such that for all x∈E {ω; X(ω) = x} ∈ F is called a discrete random variable. (Being in F, the event {X = x} can be assigned a probability.)
1.3. DISCRETE RANDOM VARIABLES
15
Remark 1.3.2 Calling an integervalued random variable X a random number is an innocuous habit as long as one is aware that it is not the function X that is random, but the outcome ω. This in turn makes the number X(ω) random. Example 1.3.3: Tossing a die, take 3. The sample space is the set Ω = {1, 2, 3, 4, 5, 6}. Take for X the identity: X(ω) = ω. Therefore X is a random number obtained by tossing a die.
Example 1.3.4: Heads or tails, take 4. (Example 1.1.10 continued.) The sample space Ω is the collection of all sequences ω = {xn }n≥1 , where xn = 1 or 0. Deﬁne a random variable Xn by Xn (ω) = xn . It is the random number obtained at the nth toss. It is indeed a random variable since for all an ∈ {0, 1}, {ω ; Xn (ω) = an } = {ω ; xn = an } ∈ F, by deﬁnition of F. The following are elementary remarks. Let E and F be countable sets. Let X be a random variable with values in E, and let f : E → F be a function. Then Y := f (X) is a random variable. Proof. Let y ∈ F . The set {ω; Y (ω) = y} is in F since it is a countable union of sets in F, namely: {Y = y} = {X = x} . x∈E; f (x)=y
Let E1 and E2 be countable sets. Let X1 and X2 be random variable with values in E1 and E2 , respectively. Then Y := (X1 , X2 ) is a random variable with values in E = E1 × E2 . Proof. Let x = (x1 , x2 ) ∈ E. The set {ω; X(ω) = x} is in F since it is the intersection of sets in F, namely: {X = x} = {X1 = x1 } ∩ {X2 = x2 } . Deﬁnition 1.3.5 Let X be a discrete random variable taking its values in E. Its probability distribution function is the function π : E → [0, 1], where π(x) := P (X = x)
(x ∈ E) .
Example 1.3.6: The gambler’s fortune. This is a continuation of the coin tosses example (Example 1.1.10). The number of occurrences of heads in n tosses is Sn = X1 + · · · + Xn . This random variable is the fortune at time n of a gambler systematically betting on heads. It takes integer values from 0 to n. We have P (Sn = k) = nk 21n . Proof. The event {Sn = k} is “k among X1 , . . . , Xn are equal to 1”. There are nk distinct ways of assigning k values of 1 and n − k values of 0 to X1 , . . . , Xn , and all have the same probability 2−n .
CHAPTER 1. WARMING UP
16
Remark 1.3.7 One may have to prove that a random variable X, taking its values in N (and therefore for which the value ∞ is a priori possible) is in fact almost surely ﬁnite, that is, to prove that P (X = ∞) = 0 or, equivalently, that P (X < ∞) = 1. Since {X < ∞} = we have P (X < ∞) =
∞
n=0 {X
∞
= n} ,
n=0 P (X
= n) .
(This remark provides an opportunity to recall that in an expression such as ∞ n=0 , the sum is over N and does not include ∞ as the notation seems to suggest. A less ambiguous notation would be n∈N . If we want to sum over all integers plus ∞, we shall always use the notation n∈N .)
Expectation for Discrete Random Variables Deﬁnition 1.3.8 Let X be a discrete random variable taking its values in a countable set E and let the function g : E → R be either nonnegative or such that it satisﬁes the absolute summability condition
g(x)P (X = x) < ∞ .
x∈E
Then one deﬁnes E[g(X)], the expectation of g(X), by the formula E[g(X)] :=
g(x)P (X = x) .
x∈E
If the absolute summability condition is satisﬁed, the random variable g(X) is called integrable, and in this case the expectation E[g(X)] is a ﬁnite number. If it is only assumed that g is nonnegative, the expectation may well be inﬁnite. Example 1.3.9: The gambler’s fortune. This is a continuation of Example 1.3.6. Consider the random variable Sn = X1 + · · · + Xn taking its values in {0, 1, . . . , n}. Its expectation is E[Sn ] = n/2, as the following straightforward computation shows: E[Sn ] =
n
kP (Sn = k)
k=0
=
n 1 n! k 2n k!(n − k)! k=1
n (n − 1)! n = n 2 (k − 1)!((n − 1) − (k − 1))!
=
n 2n
k=1 n−1 j=0
n (n − 1)! n = n 2n−1 = . j!(n − 1 − j)! 2 2
1.3. DISCRETE RANDOM VARIABLES
17
Example 1.3.10: Finite random variables with infinite expectations. One should be aware that a discrete random variable taking ﬁnite values may have an inﬁnite expectation. The canonical example is the random variable X taking its values in E = N and with probability distribution P (X = n) =
1 , cn2
where the constant c is chosen such that P (X < ∞) =
∞
P (X = n) =
n=1
(that is, c =
∞
1 n=1 n2
=
π2 6 ).
E[X] =
∞ 1 =1 2 cn n=1
Indeed, the expectation of X is
∞
nP (X = n) =
n=1
∞ n=1
∞
n
1 1 = ∞. = 2 cn cn n=1
Remark 1.3.11 The above example seems artiﬁcial. It is however not pathological, and there are a lot more natural occurrences of the phenomenon. Consider for instance Example 1.3.9, and let T be the ﬁrst integer n (necessarily even) such that 2Sn − n = 0. (The quantity 2Sn − n is the fortune at time n of a gambler systematically betting on heads.) Then as it turns out and as we shall prove later (in Example 6.3.6), T is a ﬁnite random variable with inﬁnite expectation.
The telescope formula below gives an alternative way of computing the expectation of an integervalued random variable. Theorem 1.3.12 For a random variable X taking its values in N, E[X] =
∞
P (X ≥ n) .
n=1
Proof. E[X] = P (X = 1)+2P (X = 2) + 3P (X = 3) + . . . = P (X = 1) +P (X = 2) + P (X = 3) + . . . +P (X = 2) + P (X = 3) + . . . + P (X = 3) + . . .
CHAPTER 1. WARMING UP
18 Basic Properties of Expectation
Let A be some event. The expectation of the indicator random variable X = 1A is E[1A ] = P (A) . (We call this the expectation formula for indicator functions.) Proof. The random variable X = 1A takes the value 1 with probability P (X = 1) = P (A) and the value 0 with probability P (X = 0) = P (A) = 1 − P (A). Therefore, E[X] = 0 × P (X = 0) + 1 × P (X = 1) = P (X = 1) = P (A). Theorem 1.3.13 Let g1 and g2 be functions from E to R such that g1 (X) and g2 (X) are integrable (resp., nonnegative), and let λ1 , λ2 ∈ R (resp., ∈ R+ ). Then E[λ1 g1 (X) + λ2 g2 (X)] = λ1 E[g1 (X)] + λ2 E[g2 (X)] (linearity of expectation). Also, if g1 (x) ≤ g2 (x) for all x ∈ E, E[g1 (X)] ≤ E[g2 (X)] (monotonicity of expectation). Finally, we have the triangle inequality E[g(X)] ≤ E[g(X)] . Proof. These properties follow directly from the corresponding properties of series. Example 1.3.14: The matching paradox. There are n boxes B1 , · · · , Bn and n objects O1 , · · · , On . These objects are placed “at random” in the boxes, one and only one per box. What is the average number of matchings, that is, of boxes that receive an object with the same index? The problem will be stated mathematically in a way that gives meaning to the phrase “at random”. Let Πn be the set of permutations of 1 for all σ (0) ∈ Πn . {1, 2, . . . , n}. Let σ a random permutation, that is, P (σ = σ (0) ) = n! The random placement is assimilated to such a random permutation, and a matching at position (box) i is said to occur if σi = i. Let Xi = 1 if a matching occurs at position i, and Xi = 0 otherwise. The total number of matches is therefore Zn := ni=1 Xi , so that the average number of matches is E [Zn ] =
n i=1
E [Xi ] =
n
P (Xi = 1) .
i=1
By symmetry, P (Xi = 1) = P (X1 = 1) (1 ≤ i ≤ n), so that E [Zn ] = nP (X1 = 1) . (0)
But there are (n − 1)! permutations σ (0) such that σ1 = 1, each one occurring with 1 , so that probability n! 1 (n − 1)! = . P (X1 = 1) = n! n Therefore 1 E [Zn ] = n × = 1 . n This number remains constant and does not increase with n as one (maybe) expects!
1.3. DISCRETE RANDOM VARIABLES
19
Mean and Variance Deﬁnition 1.3.15 Let X be a random variable such that E[X] < ∞ (X is integrable). In this case (and only in this case) the mean μ of X is deﬁned by μ := E[X] =
+∞
nP (X = n) .
n=0
From the inequality a ≤ 1 + a2 , true for all a ∈ R, we have that X ≤ 1 + X 2 , and therefore, by the monotonicity and linearity properties of expectation, E[X] ≤ 1+E[X 2 ] (we also used the fact that E[1] = 1). Therefore if E[X 2 ] < ∞ (in which case we say that X is squareintegrable) then X is integrable. The following deﬁnition then makes sense. Deﬁnition 1.3.16 Let X be a squareintegrable random variable. Its variance is, by deﬁnition, the quantity σ 2 := E[(X − μ)2 ] =
+∞
(n − μ)2 P (X = n) .
n=0
The variance is also denoted by Var (X). From the linearity of expectation, it follows that E[(X − m)2 ] = E[X 2 ] − 2mE[X] + m2 , that is, Var (X) = E[X 2 ] − m2 . The mean is the “center of inertia” of a random variable. More precisely, Theorem 1.3.17 Let X be a real random variable with mean μ and ﬁnite variance σ 2 . Then, for all a ∈ R, a = μ, E[(X − a)2 ] > E[(X − μ)2 ] = σ 2 . Proof. E (X − a)2 = E ((X − μ) + (μ − a))2 = E (X − μ)2 + (μ − a)2 + 2(μ − a)E [(X − μ)] = E (X − μ)2 + (μ − a)2 > E (X − μ)2 .
Independent Variables Deﬁnition 1.3.18 Two discrete random variables X and Y are called independent if P (X = i, Y = j) = P (X = i)P (Y = j)
(i, j ∈ E) .
Remark 1.3.19 The lefthand side of the last display is P ({X = i} ∩ {Y = j}). This is a general feature of the notational system: commas replace intersection signs. For instance, P (A, B) is the probability that both events A and B occur.
CHAPTER 1. WARMING UP
20
Deﬁnition 1.3.18 extends to a ﬁnite number of random variables: Deﬁnition 1.3.20 The discrete random variables X1 , . . . , Xk taking their values in E1 , . . . , Ek respectively are said to be independent if for all i1 ∈ E1 , . . . , ik ∈ Ek , P (X1 = i1 , . . . , Xk = ik ) = P (X1 = i1 ) · · · P (Xk = ik ) . Deﬁnition 1.3.21 A sequence {Xn }n≥1 of discrete random variables taking their values in the sets {En }n≥1 respectively is called independent if any ﬁnite collection of distinct random variables Xi1 , . . . , Xir extracted from this sequence is independent. It is said to be iid (independent and identically distributed) if En ≡ E for all n ≥ 1, if it is independent and if the probability distribution function of Xn does not depend on n. Example 1.3.22: Heads or tails, take 5. (Example 1.3.4 continued) We are going to show that the sequence {Xn }n≥1 is iid. Therefore, we have a model for independent tosses of an unbiased coin. Proof. Event {Xk = ak } is the direct sum of events {X1 = a1 , . . . , Xk−1 = ak−1 , Xk = ak } for all possible values of (a1 , . . . , ak−1 ). Since there are 2k−1 such values and each one has probability 2−k , we have P (Xk = ak ) = 2k−1 2−k , that is, 1 P (Xk = 1) = P (Xk = 0) = . 2 Therefore, P (X1 = a1 , . . . , Xk = ak ) = P (X1 = a1 ) · · · P (Xk = ak ) for all a1 , . . . , ak ∈ {0, 1}, from which it follows by deﬁnition that X1 , . . . , Xk are independent random variables, and more generally that {Xn }n≥1 is a family of independent random variables. Deﬁnition 1.3.23 Let {Xn }n≥1 and {Yn }n≥1 be sequences of discrete random variables taking their values in the sets {En }n≥1 and {Fn }n≥1, respectively. They are said to be independent if for any ﬁnite collection of random variables Xi1 , . . . , Xir and Yj1 , . . . , Yjs extracted from their respective sequences, the discrete random variables (Xi1 , . . . , Xir ) and (Yj1 , . . . , Yjs ) are independent. (This means that for all a1 ∈ E1 , . . . , ar ∈ Er , b1 ∈ F1 , . . . , bs ∈ Fs , P ((∩r=1 {Xi = a }) ∩ (∩sm=1 {Yjm = bm })) = P (∩r=1 {Xi = a }) P (∩sm=1 {Yjm = bm }) .) The notion of conditional independence for events (Deﬁnition 1.2.14) extends naturally to discrete random variables. Deﬁnition 1.3.24 Let X, Y , Z be random variables taking their values in the countable sets E, F , G, respectively. One says that X and Y are conditionally independent given Z if for all x, y, z in E, F , G, respectively, the events {X = x} and {Y = y} are conditionally independent given {Z = z}. Recall that the events {X = x} and {Y = y} are said to be conditionally independent given {Z = z} if P (X = x, Y = y  Z = z) = P (X = x  Z = z)P (Y = y  Z = z) .
1.3. DISCRETE RANDOM VARIABLES
21
The Product Formula for Expectations Theorem 1.3.25 Let Y and Z be two discrete random variables with values in the countable sets F and G, respectively, and let v : F → R, w : G → R be functions that are either nonnegative or such that v(Y ) and w(Z) are both integrable. Then E[v(Y )w(Z)] = E[v(Y )]E[w(Z)] . Proof. Consider the discrete random variable X with values in E = F × G deﬁned by X = (Y, Z), and consider the function g : E → R deﬁned by g(x) = v(y)w(z), where x = (y, z). We have, under the above stated conditions E[v(Y )w(Z)] = E[g(X)] = g(x)P (X = x) =
x∈E
v(y)w(z)P (Y = y, Z = z)
y∈F z∈F
=
v(y)w(z)P (Y = y)P (Z = z)
y∈F z∈F
⎛ =⎝
⎞
v(y)P (Y = y)⎠
y∈F
w(z)P (Z = z)
z∈F
= E[v(Y )]E[w(Z)]. For independent random variables, “variances add up”: Corollary 1.3.26 Let X1 , . . . , Xn be independent integrable random variables with values in N. Then 2 2 + · · · + σ2 . = σX (1.13) σX Xn 1 +··· +Xn 1 Proof. Let μ1 , . . . , μn be the respective means of X1 , . . . , Xn . The mean of the sum X := X1 + · · · + Xn is μ := μ1 + · · · + μn . By the product formula for expectations, if i = k, E [(Xi − μi )(Xk − μk )] = E [(Xi − μi )] E [(Xk − μk )] = 0. Therefore Var (X) = E (X − μ)2 ⎡ " n n 2 ⎤ # n ⎣ ⎦ =E (Xi − μi ) (Xi − μi )(Xk − μk ) =E i=1
= =
n n
i=1 k=1
E [(Xi − μi )(Xk − μk )]
i=1 k=1 n
n E (Xi − μi )2 = Var (Xi ).
i=1
i=1
CHAPTER 1. WARMING UP
22
Remark 1.3.27 Note that means always add up, even when the random variables are not independent. Let X be an integrable random variable. Then, clearly, for any a ∈ R, aX is integrable and its variance is given by the formula Var (aX) = a2 Var (X) .
Example 1.3.28: Variance of the empirical mean. From this remark and Corollary 1.3.26, we immediately obtain that if X1 , . . . , Xn are independent and identically distributed integrable random variables with values in N with common variance σ 2 , then Var
1.3.2
X1 + · · · + Xn n
=
σ2 . n
Famous Discrete Probability Distributions
The Binomial Distribution Consider an iid sequence {Xn }n≥1 of random variables taking their values in the set {0, 1} and with a common distribution given by P (Xn = 1) = p
(p ∈ (0, 1)) .
This may be taken as a model for a game of heads and tails with a possibly biased coin (when p = 12 ). Since P (Xj = aj ) = p or 1 − p depending on whether ai = 1 or 0, and since there are exactly h(a) := kj=1 aj coordinates of a = (a1 , . . . , ak ) that are equal to 1, P (X1 = a1 , . . . , Xk = ak ) = ph(a) q k−h(a) , (1.14) where q := 1 − p. (The integer h(a) is called the Hamming weight of the binary vector a.) The heads and tails framework shelters two important discrete random variables: the binomial random variable and the geometric random variable.
The Binomial Distribution Deﬁnition 1.3.29 A random variable X taking its values in the set E = {0, 1, . . . , n} and with the probability distribution n i P (X = i) = p (1 − p)n−i (0 ≤ i ≤ n) i is called a binomial random variable of size n and parameter p ∈ (0, 1). This is denoted by X ∼ B(n, p).
1.3. DISCRETE RANDOM VARIABLES
23
Example 1.3.30: Number of heads in coin tossing. Deﬁne Sn = X1 + · · · + Xn . This random variable takes the values 0, 1, . . . , n. To obtain Sn = i, where 0 ≤ i ≤ n, one must have X1 = a1 , . . . , Xn = an with nj=1 aj = i. There are ni distinct ways of having this, and each occurs with probability pi (1 − p)n−i . Therefore, for 0 ≤ i ≤ n, n i P (Sn = i) = p (1 − p)n−i . i
Theorem 1.3.31 The mean and the variance of a binomial random variable X of size n and parameter p are respectively E[X] = np and Var (X) = np(1 − p) . Proof. Consider the random variable Sn of Example 1.3.30, which is a binomial random variable. We have, since expectations add up, E [Sn ] =
n
E [Xi ] = nE [X1 ] ,
i=1
and since the Xi ’s are iid (and therefore in this case variances add up), Var (Sn ) =
n
Var (Xi ) = nV (X1 ) .
i=1
Now, E [X1 ] = 0 × P (X1 = 0) + 1 × P (X1 = 1) = P (X1 = 1) = p , and since X12 = X1 , Therefore
E X12 = E [X1 ] = p .
Var (X1 ) = E X12 − E [X1 ]2 = p − p2 = p(1 − p).
The Geometric Distribution Deﬁnition 1.3.32 A random variable X taking its values in N+ := {1, 2, . . .} and with the distribution P (X = k) = (1 − p)k−1 p (k ≥ 1) , where 0 < p < 1, is called a geometric random variable with parameter p. This is denoted X ∼ Geo(p).
CHAPTER 1. WARMING UP
24
Example 1.3.33: First “heads” in the sequence. Let {Xn }n≥1 be an iid sequence of random variables taking their values in the set {0, 1} with common distribution given by P (Xn = 1) = p ∈ (0, 1). Deﬁne the random variable T to be the ﬁrst time of occurrence of 1 in this sequence, that is, T = inf{n ≥ 1; Xn = 1} , with the convention that if Xn = 0 for all n ≥ 1, then T = ∞. The event {T = k} is exactly {X1 = 0, . . . , Xk−1 = 0, Xk = 1}, and therefore, P (T = k) = P (X1 = 0) · · · P (Xk−1 = 0)P (Xk = 1) , that is, P (T = k) = (1 − p)k−1 p .
Theorem 1.3.34 The mean of a geometric random variable X with parameter p > 0 is E[X] =
Proof. E [X] =
∞
1 . p
k (1 − p)k−1 p .
k=1
But for α ∈ (0, 1), ∞ k=1
k−1
kα
d = dα
∞
k
α
k=1
Therefore, with α = 1 − p, E [X] =
d = dα
1 −1 1−α
=
1 . (1 − α)2
1 1 ×p= . p2 p
Theorem 1.3.35 A geometric random variable T with parameter p ∈ (0, 1) is memoryless in the sense that for any integer k0 ≥ 1, P (T = k + k0  T > k0 ) = P (T = k)
(k ≥ 1) .
Proof. We ﬁrst compute P (T > k0 ) =
∞
(1 − p)k−1 p
k=k0 +1
= p (1 − p)k0
∞ n=0
(1 − p)n =
p (1 − p)k0 = (1 − p)k0 . 1 − (1 − p)
1.3. DISCRETE RANDOM VARIABLES
25
Therefore, P (T = k0 + kT > k0 ) = =
P (T = k0 + k, T > k0 ) P (T = k0 + k) = P (T > k0 ) P (T > k0 ) p (1 − p)k+k0 −1 (1 − p)k0
= p (1 − p)k−1 = P (T = k) .
Example 1.3.36: The coupon collector. Each chocolate tablet of a certain brand contains a coupon, randomly and independently chosen among n types. A prize may be claimed once the chocolate amateur has gathered a collection containing a subset with all the types of coupons. What is the average value of the number X of chocolate tablets bought when this happens for the ﬁrst time? Solution: Let Xi (0 ≤ i ≤ n − 1) be the number of tablets bought during the time where there are exactly i diﬀerent types of coupons in the collector’s box, so that X=
n−1
Xi .
i=0
Each Xi is a geometric random variable with parameter pi = 1 − n , and therefore E [Xi ] = p1i = n−i E [X] =
n−1
E [Xi ] = n
i=0
n 1 i=1
i
i n.
In particular,
.
We can have a more preciseidea of how far away from its mean the random variable X can be. Observing that  ni=1 1/i − ln n ≤ 1, we have that E [X] − n ln n ≤ n. We shall now prove that for all c > 0, P (X > n ln n + cn) ≤ e−c .
()
For this, deﬁne Aα to be the event that no coupon of type α shows up in the ﬁrst n ln n + cn tablets. Then (by subadditivity) P (X > n ln n + cn) = P (∪nα=1 Aα ) ≤ =
n α=1
1 1− n
n ln n+cn
n
P (Aα )
α=1
1 n ln n+cn =n 1− , n
and therefore, since 1 + x ≤ ex for all x ∈ R, 1 n ln n+cn P (X > n ln n + cn) ≤ n e− n = ne− ln n−cn = e−c .
Remark 1.3.37 An inequality such as () is called a concentration inequality. It reads X − E[X] c P > ≤ e−c , E[X] ln n which explains the terminology, since around its mean.
X−E[X] E[X]
measures the relative dispersion of X
CHAPTER 1. WARMING UP
26 The Poisson Distribution
Deﬁnition 1.3.38 A random variable X taking its values in N and such that for all k ≥ 0, k P (X = k) = e−θ θk! , (1.15) is called a Poisson random variable with parameter θ > 0. This is denoted by X ∼ Poi(θ). Example 1.3.39: Poisson’s law of rare events, take 1. A veterinary surgeon of the Prussian army collecting data relative to accidents due to horse kicks found that the yearly number of such casualties was approximately following a Poisson distribution. Here is an explanation of his ﬁndings. Suppose that you play “heads or tails” for a large number n of (independent) tosses of a coin such that α P (Xi = 1) = . n In the example, n is the (large) number of soldiers, Xi = 1 if the ith soldier was hurt and Xi = 0 otherwise. Let Sn be the total number of heads (wounded soldiers) and let pn (k) := P (Sn = k). It turns out that lim pn (k) = e−α
n↑∞
αk k!
()
(with the convention 0! = 1). (The average number of heads is α and the choice P (Xi = 1) = αn guarantees this. Letting n ↑ ∞ accounts for n being large but unknown.) Here is the proof of this result, which is known as Poisson’s law of rare events. As we know, the random variable Sn follows a binomial law: n α n−k α k P (Sn = k) = 1− n n k n α of mean n× n = α. Denoting by pn (k) = P (Sn = k), we see that pn (0) = 1 − αn → e−α as n ↑ ∞. Also, n−k α α pn (k + 1) = k+1 αn → pn (k) 1− n k+1 as n ↑ ∞. Therefore, () holds true for all k ≥ 0. The limit distribution is therefore a Poisson distribution of mean α. Theorem 1.3.40 The mean of a Poisson random variable with parameter θ > 0 is given by E[X] = θ , and its variance is Var (X) = θ . Proof. We have E [X] = e−θ
∞ kθk k=1 ∞
= e−θ θ
j=0
k! θj j!
= e−θ θ
∞ θk−1 (k − 1)! k=1
= e−θ θeθ = θ .
1.3. DISCRETE RANDOM VARIABLES
27
Also: ∞ ∞ 2 θk θk k −k E X 2 − X = e−θ k (k − 1) = e−θ k! k!
=e
k=0 ∞ −θ 2
θ
k=2
k−2
θ =e (k − 2)!
k=2 ∞ −θ 2
θ
j=0
θj = e−θ θ2 eθ = θ2 . j!
Therefore, Var (X) = E X 2 − E [X]2 = E X 2 − X + E [X] − E [X]2 = θ2 + θ − θ2 = θ. Theorem 1.3.41 Let X1 and X2 be two independent Poisson random variables with means θ1 > 0 and θ2 > 0, respectively. Then X = X1 + X2 is a Poisson random variable with mean θ = θ1 + θ2 . Proof. For k ≥ 0, P (X = k) = P (X1 + X2 = k) = P ∪ki=0 {X1 = i, X2 = k − i} =
=
=
k i=0 k i=0 k i=0
=
P (X1 = i, X2 = k − i) P (X1 = i)P (X2 = k − i) e−θ1
θ1i −θ2 θ2k−i e i! (k − i)!
k e−(θ1+θ2 ) k! θi θk−i k! i!(k − i)! 1 2 i=0
(θ1 + θ2 )k = e−(θ1 +θ2 ) . k! Remark 1.3.42 See Example 1.4.8 for an alternative shorter proof using generating functions (deﬁned in Subsection 1.4.1).
The Multinomial Distribution Consider the random vector X = (X1 , . . . , Xn ) where all the random variables Xi take their values in the same countable space E (this restriction is not essential, but it simpliﬁes the notation). Let π : En → R+ be a function such that
CHAPTER 1. WARMING UP
28
π(x) = 1 .
x∈En
The discrete random vector X above is said to admit the probability distribution π if P (X = x) = π(x)
(x ∈ En ) .
In fact, there is nothing new here with respect to previous deﬁnitions, since X is a discrete random variable taking its values in the countable set X := En . Example 1.3.43: Multinomial random vector. We place, independently of one another, k balls in n boxes B1 , . . . , Bn , with probability pi for a given ball to be assigned to box Bi . Of course, n
pi = 1 .
i=1
After placing all the balls in the boxes, there are Xi balls in box Bi , where n
Xi = k .
i=1
The random vector X = (X1 , . . . , Xn ) is a multinomial vector of size (n, k) and parameters p1 , . . . , pn , that is, its probability distribution is $ k! i pm i , (m )! i i=1 n
P (X1 = m1 , . . . , Xn = mn ) = n
i=1
where m1 + · · · + mn = k.
Proof. Observe that (α): there are k!/ ni=1 (mi )! distinct ways of placing k balls in n boxes in such a manner that m1 balls are in box B1 , m2 are in B2 , etc., and (β): each of i these distinct ways occurs with the same probability ni=1 pm i .
Random Graphs A graph is a discrete object and therefore random graphs are, from a purely formal point of view, discrete random variables. The random graphs considered below are in fact described by a ﬁnite collection of iid {0, 1}valued random variables. A (ﬁnite) graph G = (V, E) consists of a ﬁnite collection V of vertices v and of a collection E of unordered pairs of distinct vertices, u, v, called the edges. If u, v ∈ E, then u and v are called neighbors, and this is also denoted by u ∼ v. The degree of vertex v ∈ V is the number of edges stemming from it. Deﬁnition 1.3.44 (Gilbert, 1959) Let n be a ﬁxed positive integer and let V = {1, 2, . . . , n} be a ﬁnite set of vertices. To each unordered pair of distinct vertices u, v, associate a random variable X u,v taking its values in {0, 1} and suppose that all such variables are iid with probability p ∈ (0, 1) for the value 1. This deﬁnes a random graph denoted by G(n, p), a random element taking its values in the (ﬁnite) set of all graphs with vertices {1, 2, . . . , n} and admitting for an edge the unordered pair of vertices u, v if and only if X u,v = 1.
1.3. DISCRETE RANDOM VARIABLES
29
Note that G(n, p) is indeed a discrete random variable (taking its values in the ﬁnite set consisting of the collection of graphs with vertex set V = {1, 2, . . . , n}). Similarly, the set En,p of edges of G(n, p) is also a discrete random variable. If we call any unordered pair of vertices u, v a potential edge (there are n2 such edges forming the set En ), G(n, p) is constructed by accepting a potential edge as one of its edges with probability p, independently of all other potential edges. The probability of occurrence of a graph G with exactly m edges is then n P (G(n, p) = G) = P (En,p  = m) = pm (1 − p)( 2 )−m .
Note that the degree of a given vertex, that is, the number of edges stemming from it, is a binomial random variable B(n − 1, p). In particular, the average degree is d = (n − 1)p. Another type of random graph is the Erd¨os–R´enyi random graph (Deﬁnition 1.3.45 below). It is closely related to the Gilbert graph (Exercise 1.6.27). Deﬁnition 1.3.45 (Erd¨ os and R´ enyi, 1959) Consider the collection Gm of graphs n) 2 such G = (V, E) where V = {1, 2, . . . , n} with exactly m edges (E = m). There are (m graphs. The Erd¨ os–R´enyi random graph Gn,m is a random graph uniformly distributed on Gm .
1.3.3
Conditional Expectation
This subsection introduces the concept of conditional expectation given a random element (variable or vector).3 Let Z be a discrete random variable with values in E, and let f : E → R be a nonnegative function. Let A be some event of positive probability. The conditional expectation of f (Z) given A, denoted by E [f (Z)  A], is by deﬁnition the expectation when the distribution of Z is replaced by its conditional distribution given A: E [f (Z)  A] :=
f (z)P (Z = z  A).
z
Let {Ai }i∈N be a partition of the sample space. The following formula is then a direct consequence of Bayes’ formula of total causes: E [f (Z)] =
E [f (Z)  Ai ] P (Ai ) .
i∈N
Example 1.3.46: The Poisson and multinomial distributions. Suppose we have N bins in which we place balls in the following manner. The number of balls in any given bin is a Poisson variable of mean m N , and is independent of numbers in the other bins. In particular, the total number of balls Y1 + · · · + YN is, as the sum of independent Poisson random variables, a Poisson random variable whose mean is the sum of the means of the coordinates, that is, m. For a given integer k, we will compute the conditional probability that there are k1 balls in bin 1, k2 balls in bin 2, etc, given that the total number of balls is k1 +· · ·+kN = k : 3
The general theory of conditional expectation will be given in Section 3.3.
CHAPTER 1. WARMING UP
30
P (Y1 = k1 , . . . , YN = kN  Y1 + · · · + YN = k) P (Y1 = k1 , . . . , YN = kN , Y1 + · · · + YN = k) = P (Y1 + · · · + YN = k) P (Y1 = k1 , . . . , YN = kN ) . = P (Y1 + · · · + YN = k) m N,
By independence of the Yi ’s, and since they are Poisson variables with mean P (Y1 = k1 , . . . , YN = kN ) =
N $
e
m −N
i=1
m ki N
ki !
.
Also, P (Y1 + · · · + YN = k) = e−m
mk . k!
Therefore P (Y1 = k1 , . . . , YN = kN  Y1 + · · · + YN = k) =
k! k1 ! · · · kN !
1 N
N .
But this is equal to P (Z1 = k1 , . . . , ZN = kN ), where Zi is the number of balls in bin i when k = k1 + · · · + kN balls are placed independently and at random in the N bins. Note that the above equality is independent of m. The conditional expectation of some discrete random variable Z given some other discrete random variable Y is the expectation of Z using the probability measure modiﬁed by the observation of Y . For instance, if Y = y, instead of the original probability assigning the mass P (A) to the event A, we use the conditional probability given Y = y assigning the mass P (AY = y) to this event. Deﬁnition 1.3.47 Let X and Y be two discrete random variables taking their values in the countable sets F and G, respectively, and let g : F × G → R+ be either nonnegative, or such that E[g(X, Y )] < ∞. Deﬁne for each y ∈ G such that P (Y = y) > 0, ψ(y) = g(x, y)P (X = x  Y = y) , (1.16) x∈F
and if (P (Y = y) = 0), let ψ(y) = 0. This quantity is called the conditional expectation of g(X, Y ) given Y = y, and is denoted by EY =y [g(X, Y )], or E[g(X, Y )  Y = y]. The random variable ψ(Y ) is called the conditional expectation of g(X, Y ) given Y , and is denoted by EY [g(X, Y )] or E[g(X, Y )  Y ]. The sum in (1.16) is well deﬁned (possibly inﬁnite however) when g is nonnegative. Note that in the nonnegative case, we have that ψ(y)P (Y = y) = g(x, y)P (X = x  Y = y)P (Y = y) y∈G
y∈G x∈F
=
x
y
In particular, if E[g(X, Y )] < ∞, then
g(x, y)P (X = x, Y = y) = E[g(X, Y )] .
1.3. DISCRETE RANDOM VARIABLES
31
ψ(y)P (Y = y) < ∞,
y∈G
which implies that ψ(y) < ∞ for all y ∈ G such that P (Y = y) > 0. We observe (for reference in a few lines) that in this case, ψ(Y ) < ∞ almost surely, that is to say P (ψ(Y ) < ∞) = 1 (in fact, P (ψ(Y ) = ∞) = y;ψ(y)=∞ P (Y = y) = 0). Let now g : F × G → R be a function of arbitrary sign such that E[g(X, Y )] < ∞, and in particular E[g ± (X, Y )] < ∞. Denote by ψ ± the functions associated to g ± as in (1.16). As we just saw, for all y ∈ G, ψ ± (y) < ∞, and therefore ψ(y) = ψ + (y) − ψ − (y) is well deﬁned (not an indeterminate ∞ − ∞ form). Thus, the conditional expectation is also well deﬁned in the integrable case. From the observation made a few lines above, in this case, EY [g(X, Y )] < ∞. Example 1.3.48: Binomial example. Let X1 and X2 be independent binomial random variables of the same size N and same parameter p. We are going to show that EX1 +X2 [X1 ] = h(X1 + X2 ) =
X1 + X2 . 2
We have P (X1 = k)P (X2 = n − k) P (X1 = kX1 + X2 = n) = P (X1 + X2 = n) n−k N k N N N −k N (1 − p)N −n+k k p (1 − p) n−k p = = k2Nn−k 2N , n N −n n p (1 − p) n where we have used the fact that the sum of two independent binomial random variables with size N and parameter p is a binomial random variable with size 2N and parameter p. This is the hypergeometric distribution. The righthand side of the last display is the probability of obtaining k black balls when a sample of n balls is randomly selected from an urn containing N black balls and N red balls. The mean of such a distribution is (by reason of symmetry) n2 , therefore EX1 +X2 =n [X1 ] =
n = h(n), 2
and this gives the announced result.
Example 1.3.49: Poisson example. Let X1 and X2 be two independent Poisson random variables with respective means θ1 > 0 and θ2 > 0. We seek to compute EX1 +X2 [X1 ], that is EY [X], where X = X1 , Y = X1 + X2 . Following the instructions of Deﬁnition 1.3.47, we must ﬁrst compute (only for y ≥ x, why?) P (X1 = x, X1 + X2 = y) P (X = x, Y = y) = P (Y = y) P (X1 + X2 = y) P (X1 = x)P (X2 = y − x) P (X1 = x, X2 = y − x) = = P (X1 + X2 = y) P (X1 + X2 = y) y−x θ1x −θ2 θ2 x y−x −θ 1 e y θ1 θ2 x! e (y−x)! = = . y (θ +θ ) x θ1 + θ2 θ1 + θ2 −(θ +θ ) 1 y! 2 e 1 2
P (X = x  Y = y) =
CHAPTER 1. WARMING UP
32 Therefore, letting α =
θ1 θ1 +θ2 ,
ψ(y) = EY =y [X] =
y y x x α (1 − α)y−x = αy. x x=0
Finally, EY [X] = ψ(Y ) = αY , that is, EX1 +X2 [X1 ] =
1.4
θ1 (X1 + X2 ). θ1 + θ2
The Branching Process
The branching process is also known as the Galton–Watson process. Sir Francis Galton, a cousin of Darwin, was interested in the survival probability of a given line of English peerage. He posed the problem in the Educational Times in 1873. In the same year and the same journal, Reverend Watson proposed the method of solution that has become a textbook classic, and thereby initiated an important branch of probability. The elementary theory of branching processes of Subsection 1.4.2 provides the opportunity to introduce the tool of generating functions.
1.4.1
Generating Functions
The computation of probabilities in discrete probability models often requires an enumeration of all the possible outcomes realizing this particular event. Generating functions are very useful for this task, and more generally, for obtaining distribution functions of integervalued random variables. In order to introduce this versatile tool, we shall need to deﬁne the expectation of a complexvalued function of an integervalued variable. Let X be a discrete random variable with values in N, and let ϕ : N → C be a complex function with real and imaginary parts ϕR and ϕI , respectively. The expectation E[ϕ(X)] is naturally deﬁned by E[ϕ(X)] = E[ϕR (X)] + iE[ϕI (X)] , provided that the expectations on the righthand side are well deﬁned and ﬁnite. Deﬁnition 1.4.1 Let X be an Nvalued random variable. Its generating function (gf) is the function g : D(0; 1) := {z ∈ C; z ≤ 1} → C deﬁned by g(z) = E[z X ] =
∞
P (X = k)z k .
(1.17)
k=0
The power series associated with the sequence {P (X = n)}n≥0 has a radius of convergence R ≥ 1, since ∞ P (X = n) = 1 < ∞. The domain of deﬁnition of g could n=0 be, in speciﬁc cases, larger than the closed unit disk centered at the origin. In the next two examples below, the domain of absolute convergence is the whole complex plane.
1.4. THE BRANCHING PROCESS
33
Example 1.4.2: The gf of the binomial variable. For the binomial random variable of size n and parameter p, n n n k k n−k , k=0 P (X = k)z = k=0 k (zp) (1 − p) and therefore g(z) = (1 − p + pz)n .
(1.18)
Example 1.4.3: The gf of the Poisson variable. For the Poisson random variable of mean θ, ∞ (θz)k ∞ k −θ k=0 P (X = k)z = e k=0 k! , and therefore g(z) = eθ(z−1) .
(1.19)
Here is an example where the radius of convergence is ﬁnite. Example 1.4.4: The gf of the geometric variable. For the geometric random variable of Deﬁnition 1.3.32, ∞ ∞ k k−1 k z , k=1 P (X = k)z = k=0 p(1 − p) and therefore, with q = 1 − p, g(z) =
pz 1−qz
.
The radius of convergence of this generating function power series is 1q .
Moments from the Generating Function Generating functions are powerful computational tools. First of all, they can be used to obtain moments of a discrete random variable. Theorem 1.4.5 We have g (1) = E[X]
(1.20)
g (1) = E[X(X − 1)].
(1.21)
and
Proof. Inside the open disk centered at the origin and of radius R, the power series deﬁning the generating function g is continuous, and diﬀerentiable at any order term by term. In particular, diﬀerentiating both sides of (1.17) twice inside the open disk D(0; R) gives ∞ g (z) = nP (X = n)z n−1 , (1.22) n=1
and
CHAPTER 1. WARMING UP
34 g (z) =
∞
n(n − 1)P (X = n)z n−2 .
(1.23)
n=2
When the radius of convergence R is strictly larger than 1, we obtained the announced results by letting z = 1 in the previous identities. If R = 1, the same is basically true but the mathematical argument is more subtle. The diﬃculty is not with the righthand side of (1.22), which is always well deﬁned at z = 1, being equal to ∞ n=1 nP (X = n), a nonnegative and possibly inﬁnite quantity. The diﬃculty is that g may not be diﬀerentiable at z = 1, a border point of the disk (Theorem (here of radius 1) on which it is deﬁned. However, B.2.3), the by Abel’s theorem ∞ n−1 is limit as the real variable x increases to 1 of ∞ n=1 nP (X = n). n=1 nP (X = n)x Therefore g , as a function on the real interval [0, 1), can be extended to [0, 1] by (1.20), and this extension preserves continuity. With this deﬁnition of g (1), Formula (1.20) holds true. Similarly, when R = 1, the function g deﬁned on [0, 1) by (1.23) is extended to a continuous function on [0, 1] by deﬁning g (1) by (1.21). Theorem 1.4.6 The generating function characterizes the distribution of a random variable. This means the following. Suppose that, without knowing the distribution of X, you have been able to compute its generating function g, and that, moreover, you are able to give its power series expansion in a neighborhood of the origin:4 ∞
g(z) =
an z n .
n=0
Since g(z) is the generating function of X, g(z) =
∞
P (X = n)z n ,
n=0
and since the power series expansion around the origin is unique, the distribution of X is identiﬁed as P (X = n) = an for all n ≥ 0. Similarly, if two Nvalued random variables X and Y have the same generating function, they have the same distribution. Indeed, the identity in a neighborhood of the origin of the power series: ∞
P (X = n)z n =
n=0
∞
P (Y = n)z n
n=0
implies the identity of their coeﬃcients. Theorem 1.4.7 Let X and Y be two independent integervalued random variables with respective generating functions gX and gY . Then the sum X + Y has the gf gX+Y (z) = gX (z) × gY (z). 4
This is a common situation; see Theorem 1.4.10 for instance.
(1.24)
1.4. THE BRANCHING PROCESS
35
Proof. Use the product formula for expectations: gX+Y (z) = E z X+Y = E zX zY = E zX E zY . Example 1.4.8: Sums of independent Poisson variables. Let X and Y be two independent Poisson random variables of means α and β respectively. We shall prove that the sum X + Y is a Poisson random variable with mean α + β. Indeed, according to (1.24) and (1.19), gX+Y (z) = gX (z) × gY (z) = eα(z−1) eβ(z−1) = e(α+β)(z−1), and the assertion follows directly from Theorem 1.4.6 since gX+Y is the gf of a Poisson random variable with mean α + β.
Counting with Generating Functions The following example is typical of the use of generating functions in combinatorics (the art of counting). Example 1.4.9: The lottery. Let X1 , X2 , X3 , X4 , X5 , and X6 be independent random variables uniformly distributed over {0, 1, . . . , 9}. We shall compute the generating function of Y = 27 + X1 + X2 + X3 − X4 − X5 − X6 and use the result to obtain the probability that in a 6digit lottery the sum of the ﬁrst three digits equals the sum of the last three digits. We have 1 1 1 − z 10 (1 + z + · · · + z 9 ) = , 10 10 1 − z 1 1 1 1 − z −10 1 1 1 1 − z 10 E[z −Xi ] = 1+ +···+ 9 = , = −1 10 z z 10 1 − z 10 z 9 1 − z E[z Xi ] =
and 3 6 E[z Y ] = E z 27+ i=1 Xi − i=4 Xi # " 6 6 3 3 $ $ $ $ 27 Xi −Xi = z 27 =E z z z E[z Xi ] E[z −Xi ]. i=1
i=4
Therefore, gY (z) =
i=1
i=4
6 1 1 − z 10 . 106 (1 − z)6
But P (X1 + X2 + X3 = X4 + X5 + X6 ) = P (Y = 27) is the factor of z 27 in the power series expansion of gY (z). Since 6 10 6 20 (1 − z 10 )6 = 1 − z + z +··· 1 2
CHAPTER 1. WARMING UP
36 and −6
(1 − z)
6 7 2 8 3 = 1+ z+ z + z + ··· 5 5 5
(negative binomial formula), we ﬁnd that 1 32 6 22 6 12 P (Y = 27) = 6 − + . 10 5 1 5 2 5
Random Sums How to compute the distribution of random sums? Here again, generating functions help. Theorem 1.4.10 Let {Yn }n≥1 be an iid sequence of integervalued random variables with the common generating function gY . Let T be another random variable, integervalued, independent of the sequence {Yn }n≥1, and let gT be its generating function. The generating function of X = Tn=1 Yn , where by convention 0n=1 = 0, is gX (z) = gT (gY (z)) .
(1.25)
Proof. Since {T = k}k≥0 is a sequence forming a partition of Ω, we have (Exercise 1.6.3) 1= ∞ k=0 1{T =k} . Therefore ∞ T T X Y n 1{T =k} z n=1 Yn z = z n=1 = k=0
∞ ∞ T k = z n=1 Yn 1{T =k} = z n=1 Yn 1{T =k} . k=0
k=0
Taking expectations, E[z X ] =
∞
k E 1{T =k} z n=1 Yn
k=0
=
∞
E[1{T =k} ]E[z
k
n=1
Yn
],
k=0
where we have used independence of T and {Yn }n≥1 . Now, E[1{T =k} ] = P (T = k), and E[z
k
n=1
Yn ]
= gY (x)k , and therefore E[z X ] =
∞
P (T = k)gY (z)k = gT (gY (z)) .
k=0
Another useful result is Wald’s identity below (Formula (13.2.10)), which gives the expectation of a random sum of independent and identically distributed integervalued variables.
1.4. THE BRANCHING PROCESS
37
By taking derivatives in (1.25) of Theorem 1.4.10, E [X] = gX (1) = gY (1)gT (gY (1)) = E[Y1 ]E[T ].
A stronger version of this result is often needed: Theorem 1.4.11 Let {Yn }n≥1 be a sequence of integervalued integrable random variables such that E[Yn ] = E[Y1 ] for all n ≥ 1. Let T be an integervalued random variable such that for all n ≥ 1, the event {T ≥ n} is independent of Yn . Deﬁne X = Tn=1 Yn . Then E [X] = E[Y1 ]E[T ] . "
Proof. We have E[X] = E
∞
# Yn 1{n≤T } =
n=1
∞
(1.26)
E[Yn 1{n≤T } ].
n=1
But E[Yn 1{n≤T } ] = E[Yn ]E[1{n≤T } ] = E[Y1 ]P (n ≤ T }). The result then follows from the telescope formula.
The following technical result will be needed in the next subsection on branching processes. It gives details concerning the shape of the generating function restricted to the interval [0, 1]. Theorem 1.4.12 (α) Let g : [0, 1] → R be deﬁned by g(x) = E[xX ], where X is a nonnegative integervalued random variable. Then g is nondecreasing and convex. Moreover, if P (X = 0) < 1, then g is strictly increasing, and if P (X ≤ 1) < 1, it is strictly convex. (β) Suppose P (X ≤ 1) < 1. If E[X] ≤ 1, the equation x = g(x) has a unique solution x ∈ [0, 1], namely x = 1. If E[X] > 1, it has two solutions in [0, 1], x = 1 and x = x0 ∈ (0, 1). Proof. Just observe that for x ∈ [0, 1], g (x) =
∞
nP (X = n)xn−1 ≥ 0,
n=1
and therefore g is nondecreasing, and g (x) =
∞
n(n − 1)P (X − n)xn−2 ≥ 0,
n=2
and therefore g is convex. For g (x) to be null for some x ∈ (0, 1), it is necessary to have P (X = n) = 0 for all n ≥ 1, and therefore P (X = 0) = 1. For g (x) to be null for some x ∈ (0, 1), one must have P (X = n) = 0 for all n ≥ 2, and therefore P (X = 0) + P (X = 1) = 1. The graph of g : [0, 1] → R has, in the strictly increasing strictly convex case, P (X = 0) + P (X = 1) < 1, the general shape shown in the ﬁgure, where we distinguish two cases: E[X] = g (1) ≤ 1, and E[X] = g (1) > 1. The rest of the proof is then easy.
CHAPTER 1. WARMING UP
38
1
1 P (X = 0) P (X = 0) 0
E[X] ≤ 1
0
1
1 E[X] > 1
Two aspects of the generating function
1.4.2
Probability of Extinction (1)
(2)
We shall ﬁrst formally deﬁne the branching process. Let Zn = (Zn , Zn , . . .), where the (j) random variables {Zn }n≥1,j≥1 are iid and integervalued. The recurrence equation Xn+1 =
Xn
(k)
Zn+1
(1.27)
k=1
(Xn+1 = 0 if Xn = 0) may be interpreted as follows: Xn is the number of individuals in the nth generation of a given population (humans, particles, etc.). Individual number k (k) of the nth generation gives birth to Zn+1 descendants, and this accounts for Eqn. (1.27). The number X0 of ancestors is assumed to be independent of {Zn }n≥1 . The sequence of random variables {Xn }n≥0 is called a branching process because of the genealogical tree that it generates (see the ﬁgure below).
X6 = 2 X5 = 6 X4 = 7 X3 = 8 X2 = 5 X1 = 2 X0 = 1 Sample tree of a branching process (one ancestor) The event E = “an extinction occurs” is just “at least one generation is empty”, that is,
E = ∪∞ n=1 {Xn = 0} .
We now proceed to the computation of the extinction probability when there is one ancestor. We discard trivialities by supposing that P (Z ≤ 1) < 1. Let g be the common (k) generating function of the variables Zn . The generating function of the number of individuals in the nth generation is denoted ψn (z) = E[z Xn ] .
1.4. THE BRANCHING PROCESS
39
We prove successively that (a) P (Xn+1 = 0) = g(P (Xn = 0)), (b) P (E) = g(P (E)), and (c) if E[Z1 ] < 1, the probability of extinction is 1; and if E[Z1 ] > 1, the probability of extinction is < 1 but nonzero. Proof. (k)
(a) In (1.27), X n is independent of the Zn+1 ’s. Therefore, by Theorem 1.4.10, ψn+1 (z) = ψn (g(z)). Iterating this equality, we obtain ψn+1 (z) = ψ0 (g (n+1)(z)), where g (n) is the nth iterate of g. Since there is only one ancestor, ψ0 (z) = z, and therefore ψn+1 (z) = g (n+1)(z) = g(g (n) (z)), that is, ψn+1 (z) = g(ψn (z)). In particular, since ψn (0) = P (Xn = 0), (a) is proved. (b) Since Xn = 0 implies Xn+1 = 0, the sequence {Xn = 0}, n ≥ 1, is nondecreasing, and therefore, by monotone sequential continuity, P (E) = lim P (Xn = 0). n↑∞
The generating function g is continuous, and therefore from (a) and the last equation, the probability of extinction satisﬁes (b). (k)
(c) Let Z be any of the random variables Zn . Since the trivial cases where P (Z = 0) = 1 or P (Z ≥ 2) = 0 have been eliminated, by Theorem 1.4.12: (α) If E[Z] ≤ 1, the only solution of x = g(x) in [0, 1] is 1, and therefore P (E) = 1. The branching process eventually becomes extinct. (β) If E[Z] > 1, there are two solutions of x = g(x) in [0, 1], 1 and x0 such that 0 < x0 < 1. From the strict convexity of f : [0, 1] → [0, 1], it follows that the sequence yn = P (Xn = 0) that satisﬁes y0 = 0 and yn+1 = g(yn ) converges to x0 . Therefore, when the mean number of descendants E[Z] is strictly larger than 1, P (E) ∈ (0, 1).
CHAPTER 1. WARMING UP
40
1.5
Borel’s Strong Law of Large Numbers
The empirical frequency of heads in a sequence of independent tosses of a fair coin is 21 . This is a special case of Borel’s strong law of large numbers: Theorem 1.5.1 Let {Xn }n≥1 be an iid sequence of {0, 1}valued random variables taking the value 1 with probability p ∈ [0, 1]. Then n 1 P lim Xk = p = 1 . n↑∞ n k=1
We then say: the sequence
{ n1
n k=1
Xk }n≥1 converges almost surely to p.
For the proof, some preliminaries are in order.
1.5.1
The Borel–Cantelli Lemma
Consider a sequence of events {An }n≥1. Let {An i.o.} := {ω; ω ∈ An for an inﬁnity of indices n}. Here i.o. means inﬁnitely often. We have the Borel–Cantelli lemma: Theorem 1.5.2
∞
P (An ) < ∞ =⇒ P (An i.o.) = 0.
n=1
Proof. First observe that {An i.o.} =
∞ & %
Ak .
n=1 k≥n
(Indeed, if ω belongs to the set on the righthand side, then for all n ≥ 1, ω belongs to at least one among An , An+1, . . ., which implies that ω is in An for an inﬁnite number of indices n. Conversely, if ω is in An for an inﬁnite number of indices n, it is for all n ≥ 1 in at least one of the sets An , An+1, . . ..) The set ∪k≥n Ak decreases as n increases, so that by the sequential continuity property of probability, ⎛ ⎞ & (1.28) P (An i.o.) = lim P ⎝ Ak ⎠ . n↑∞
But by subσadditivity,
⎛ P⎝
&
k≥n
⎞ Ak ⎠ ≤
k≥n
P (Ak ),
k≥n
and by the summability assumption, the righthand side of this inequality goes to 0 as n ↑ ∞. For the converse Borel–Cantelli lemma below, an additional assumption of independence is needed.
1.5. BOREL’S STRONG LAW OF LARGE NUMBERS
41
Theorem 1.5.3 Let {An }n≥1 be a sequence of independent events. Then, ∞
P (An ) = ∞ =⇒ P (An i.o.) = 1.
n=1
Proof. We may without loss of generality assume that P (An ) > 0 for all n ≥ 1 (why?). The divergence hypothesis implies that for all n ≥ 1, $ (1 − P (Ak )) = 0. k≥n
This inﬁnite product equals, in view of the independence assumption, ∞ $ % P Ak = P Ak . k≥n
k=n
Passing to the complement and using De Morgan’s identity, ⎛ ⎞ & P⎝ Ak ⎠ = 1 . k≥n
Therefore, by (1.28),
⎛ P (An i.o.) = lim P ⎝ n↑∞
&
⎞ Ak ⎠ = 1 .
k≥n
1.5.2
Markov’s Inequality
Theorem 1.5.4 Let Z be a nonnegative real random variable and let a > 0. Then, P (Z ≥ a) ≤
E[Z] . a
(1.29)
Proof. From the inequality Z ≥ a1{Z≥a} , it follows by taking expectations that E[Z] ≥ aE[1{Z≥a} ] = aP (Z ≥ a) . Example 1.5.5: Chebyshev’s inequality. Let X be a real (discrete) random variable. Specializing the Markov inequality of Theorem 1.5.4 to Z = (X − μ)2 , a = ε2 > 0, we obtain Chebyshev’s inequality: For all ε > 0, P (X − μ ≥ ε) ≤
σ2 . ε2
CHAPTER 1. WARMING UP
42
Example 1.5.6: The weak law of large numbers. Let {Xn }n≥1 be an iid sequence of real squareintegrable random variables with common mean μ and common variance 2 n σ 2 < ∞. Since the variance of the empirical mean Snn := X1 +···+X is equal to σn , we n have by Chebyshev’s inequality, for all ε > 0, ' ' ' n ' ' ' i=1 (Xi − μ) ' ' Sn σ2 ' ' ' − μ' ≥ ε = P ' ≥ ε'' ≤ 2 . P ' n n n ε In other words, the empirical mean Snn converges to the mean μ in probability, which means exactly (by deﬁnition of the convergence in probability) that, for all ε > 0, ' ' ' ' Sn ' − μ'' ≥ ε = 0. lim P ' n↑∞ n This speciﬁc result is called the weak law of large numbers.
1.5.3 Let Sn :=
Proof of Borel’s Strong Law n
k=1
Xk and Zn := n1 Sn .
Lemma 1.5.7 If
P (Zn − p ≥ εn ) < ∞
(1.30)
n≥1
for some sequence of positive numbers {εn }n≥1 converging to 0, then the sequence {Zn }n≥1 converges Pa.s. to p. Proof. If, for a given ω, Zn (ω) − p ≥ εn ﬁnitely often (or f.o.; that is, for all but a ﬁnite number of indices n), then limn↑∞ Zn (ω) − p ≤ limn↑∞ εn = 0. Therefore P ( lim Zn = p) ≥ P (Zn − p ≥ εn n↑∞
f.o.).
On the other hand, {Zn − p ≥ εn
f.o.} = {Zn − p ≥ εn
i.o.}.
Therefore P (Zn − p ≥ εn
f.o.) = 1 − P (Zn − p ≥ εn
i.o.).
Hypothesis (1.30) implies (Borel–Cantelli lemma) that P (Zn − p ≥ εn
i.o.) = 0.
By linking the above facts, we obtain P ( lim Zn = p) ≥ 1, n↑∞
and of course, the only possibility is = 1.
In order to prove almostsure convergence using Lemma 1.5.7, we must ﬁnd some adequate upper bound for the general term of the series occurring in the lefthand side of (1.30). The basic tool for this is the Markov inequality:
1.6. EXERCISES
43
' ' 4 ' Sn ' Sn 4 ' ' P ' − p' ≥ ε = P −p ≥ε n n 4 E Snn − p E ( ni=1 Yi )4 ≤ ≤ , ε4 n4 ε4 where Yi := Xi − p. In view of the independence hypothesis, E[Y1 Y2 Y3 Y4 ] = E[Y1 ]E[Y2 ]E[Y3 ]E[Y4 ] = 0, E[Y1 Y23 ] = E[Y1 ]E[Y23 ] = 0, and the like. Finally, in the expansion ⎡ 4 ⎤ n Yi ⎦ = E⎣ i=1
n
E[Yi Yj Yk Y ],
i,j,k,=1
only the terms of the form E[Yi4 ] and E[Yi2 Yj2 ], i = j, remain. There are n terms of the ﬁrst type and 3n(n − 1) terms of the second type. Therefore, nE[Y14 ] + 3n(n − 1)E[Y12 Y22 ] remains, which is less than Kn2 for some ﬁnite K. Therefore ' ' ' ' Sn K − p'' ≥ ε ≤ 2 4 , P '' n n ε 1
and in particular, with ε = n− 8 , ' ' ' Sn ' 1 K P '' − p'' ≥ n− 8 ≤ 3 , n n2 from which it follows that ∞
P
n=1
' ' ' Sn ' ' ' ≥ n− 18 < ∞. − p 'n '
' ' Therefore, by Lemma 1.5.7, ' Snn − p' converges almost surely to 0.
Complementary reading [Br´emaud, 2017] is entirely devoted to the main discrete probability models and methods, using only the elementary tools presented in this chapter.
1.6
Exercises
Exercise 1.6.1. De Morgan’s rules Let {An }n≥1 be a sequence of subsets of Ω. Prove De Morgan’s identities:
∞ %
n=1
An
=
∞ & n=1
Exercise 1.6.2. Finitely often
An and
∞ & n=1
An
=
∞ % n=1
An .
CHAPTER 1. WARMING UP
44
( )∞ Let {An }n≥1 be a sequence of subsets of Ω. Show that ω ∈ B := ∞ n=1 k=n Ak if and only if there exists at most a ﬁnite number (depending on ω) of indices k such that ω ∈ Ak . (The event B is therefore the event that events An occur ﬁnitely often.) Exercise 1.6.3. Indicator functions Show that for all subsets A, B ⊂ Ω, where Ω is an arbitrary set, 1A∩B = 1A × 1B and 1A = 1 − 1A . Show that if {An }n≥1 is a partition of Ω, 1=
1 An .
n≥1
Exercise 1.6.4. Small σfields Is there a σﬁeld on Ω with 6 elements (including of course Ω and ∅)? Exercise 1.6.5. Composed events Let F be a σﬁeld on some set Ω. (1) Show that if A1 , A2, . . . are in F, then so is ∩∞ k=1 Ak . (2) Show that if A1 , A2 are in F, then so is their symmetric diﬀerence A1 A2 := A1 ∪ A2 − A1 ∩ A2 . Exercise 1.6.6. Set inverse functions Let f : U → E be a function, where U and E are arbitrary sets. For any subset A ⊆ E, deﬁne f −1 (A) = {u ∈ U ; f (u) ∈ A} . (i) Show that for all u ∈ U , 1A (f (u)) = 1f −1 (A) (u). (ii) Prove that if E is a σﬁeld on E, then the collection of subsets of U * + f −1 (E) := f −1 (A) ; A ∈ E is a σﬁeld on U . Exercise 1.6.7. Identities Prove the set identities P (A ∪ B) = 1 − P (A ∩ B) and P (A ∪ B) = P (A) + P (B) − P (A ∩ B). Exercise 1.6.8. Urns 1. An urn contains 17 red balls and 19 white balls. Balls are drawn in succession at random and without replacement. What is the probability that the ﬁrst 2 balls are red?
1.6. EXERCISES
45
2. An urn contains N balls numbered from 1 to N . Someone draws n balls (1 ≤ n ≤ N ) simultaneously from the urn. What is the probability that the lowest number drawn is k?
Exercise 1.6.9. Independence of a family of events 1. Give a simple example of a probability space (Ω, F, P ) with three events A1 , A2 , A3 that are pairwise independent but not globally independent (the family {A1 , A2 , A3} is not independent). ,i }i∈N is also an 2. If {Ai }i∈N is an independent family of events, is it true that {A ,i = Ai or Ai (your choice, independent family of events, where for each i ∈ N, A ,0 = A0 , A ,1 = A1 , A ,3 = A3 , . . .)? for instance, A
Exercise 1.6.10. Extension of the product formula for independent events Let {Cn }n≥1 be a sequence of independent events. Then ∞ P (∩∞ n=1 Cn ) = Πn=1 P (Cn ) .
(This extends formula (1.7) to a countable number of sets.) Exercise 1.6.11. Conditional independence and the Markov property 1. Let (Ω, F, P ) be a probability space. Deﬁne for a ﬁxed event C of positive probability, PC (A) := P (A  C). Show that PC is a probability on (Ω, F). (Note that A and B are independent with respect to this probability if and only if they are conditionally independent given C.) 2. Let A1 , A2, A3 be three events of positive probability. Show that events A1 and A3 are conditionally independent given A2 if and only if the “Markov property” holds, that is, P (A3  A1 ∩ A2 ) = P (A3  A2 ).
Exercise 1.6.12. Roll it! You roll fairly and simultaneously three unbiased dice. (i) What is the probability that you obtain the (unordered) outcome {1, 2, 4}? (ii) What is the probability that some die shows 2, given that the sum of the 3 values equals 5? Exercise 1.6.13. Heads or tails as usual A person, A, tossing an unbiased coin N times obtains TA tails. Another person, B, tossing her own unbiased coin N + 1 times has TB tails. What is the probability that TA ≥ TB ? Hint: Introduce HA and HB , the number of heads obtained by A and B respectively, and use a symmetry argument. Exercise 1.6.14. Apartheid University In the renowned Social Apartheid University, students have been separated into three social groups for pedagogical purposes. In group A, one ﬁnds students who individually
46
CHAPTER 1. WARMING UP
have a probability of passing equal to 0.95. In group B this probability is 0.75, and in group C only 0.65. The three groups are of equal size. What is the probability that a student passing the course comes from group A? B? C? Exercise 1.6.15. A wise bet There are three cards. The ﬁrst one has both faces red, the second one has both faces white, and the third one is white on one face, red on the other. A card is drawn at random, and the color of a randomly selected face of this card is shown to you (the other remains hidden). What is the winning strategy if you must bet on the color of the hidden face? Exercise 1.6.16. A sequence of liars Consider a sequence of n “liars” L1 , . . . , Ln . The ﬁrst liar L1 receives information about the occurrence of some event in the form “yes or no”, and transmits it to L2 , who transmits it to L3 , etc. . . Each liar transmits what he hears with probability p ∈ (0, 1), and the contrary with probability q = 1 − p. The decision of lying or not is made independently by each liar. What is the probability xn of obtaining the correct information from Ln ? What is the limit of xn as n increases to inﬁnity? Exercise 1.6.17. The campus library complaint You are looking for a book in the campus libraries. Each library has it with probability 0.60 but the book of each given library may have been stolen with probability 0.25. If there are three libraries, what are your chances of obtaining the book? Exercise 1.6.18. Professor Nebulous Professor Nebulous travels from Los Angeles to Paris with stopovers in New York and London. At each stop his luggage is transferred from one plane to another. In each airport, including Los Angeles, the chances are that with probability p his luggage is not placed in the right plane. Professor Nebulous ﬁnds that his suitcase has not reached Paris. What are the chances that the mishap took place in Los Angeles, New York, and London, respectively? Exercise 1.6.19. Blood test Give a mathematical model and invent data to corroborate the informal discussion of Remark 1.2.12. Exercise 1.6.20. Safari butchers Three tourists participate in a safari in Africa. Here comes an elephant, unaware of the rules of the game. The innocent beast is killed, having received two out of the three bullets simultaneously shot by the tourists. The tourist’s hit probabilities are: Tourist A: 14 , Tourist B: 12 , Tourist C: 34 . Give for each tourist the probability that he was the one who missed. Exercise 1.6.21. One is the sum of the other two You perform three independent tosses of an unbiased die. What is the probability that one of these tosses results in a number that is the sum of the other two numbers? ´’s formula Exercise 1.6.22. Poincare Let A1 , . . . , An be events and let X1 , . . . , Xn be their indicator functions. By expanding the expression E [Πni=1(1 − Xi )], deduce Poincar´e’s formula:
1.6. EXERCISES P (∪ni=1 Ai ) =
47 n
P (Ai ) −
i=1
+
n
P (Ai ∩ Aj )
i=1,j=1;i =j n
P (Ai ∩ Aj ∩ Ak ) − · · ·
i=1,j=1,k=1;i =j =k
Exercise 1.6.23. No name Let X be a discrete random variable taking its values in E, with probability distribution p(x) (x ∈ E). Let A := {ω; p(X(ω)) = 0}. What is the probability of this event? Exercise 1.6.24. Null variance Prove that a null variance implies that the random variable is almost surely constant. Exercise 1.6.25. Moment inequalities (a) Prove that for any integervalued random variable X, P (X = 0) ≤ E[X] . (b) Prove that for any squareintegrable realvalued discrete random variable X, P (X = 0) ≤
Var(X) . E[X]2
Exercise 1.6.26. Checking conditional independence Let X, Y and Z be three discrete random variables with values in E, F , and G, respectively. Prove the following: If for some function g : E × F → [0, 1], P (X = x  Y = y, Z = z) = g(x, y) for all x, y, z, then P (X = x  Y = y) = g(x, y) for all x, y, and X and Z are conditionally independent given Y . Exercise 1.6.27. G(n, p) with a given number of edges Prove that the conditional distribution of G(n, p) given that the number of edges is m ≤ n2 is uniform on the set Gm of graphs G = (V, E), where V = {1, 2, . . . , n} with exactly m edges. Exercise 1.6.28. Sum of geometric variables Let T1 and T2 be two independent geometric random variables with the same parameter p ∈ (0, 1). Give the probability distribution of their sum X = T1 + T2 . Exercise 1.6.29. The coupon collector, take 2 In the coupon collector problem of Example 1.3.36, show that the number X of chocolate tablets bought when all the n coupons have been collected for the ﬁrst time satisﬁes the inequality E [X] − n ln n ≤ n . Exercise 1.6.30. The coupon collector, take 3
CHAPTER 1. WARMING UP
48
2 of X (the In the coupon collector problem of Example 1.3.36, compute the variance σX number of chocolate tablets needed to complete the collection of the n diﬀerent coupons)
and show that
2 σX n2
has a limit (to be identiﬁed) as n grows indeﬁnitely.
Exercise 1.6.31. The coupon collector, take 4 In the coupon collector problem of Example 1.3.36, prove that for all c > 0, P (X > n ln n + cn) ≤ e−c . Hint: you might ﬁnd it useful to deﬁne Ai to be the event that the Type i coupon has not shown up in the ﬁrst n ln n + cn tablets. Exercise 1.6.32. Factorial of Poisson 1. Let X be a Poisson random variable with mean θ > 0. Compute the mean of the random variable X! (factorial, not exclamation mark!). 2. Compute E θX .
Exercise 1.6.33. Even and odd Poisson Let X be a Poisson random variable with mean θ > 0. What is the probability that X is odd? even? Exercise 1.6.34. A random sum Let {Xn }n≥1 be independent random variables taking the values 0 and 1 with probability q = 1 − p and p, respectively, where p ∈ (0, 1). Let T be a Poisson random variable with mean θ > 0, independent of {Xn }n≥1 . Deﬁne S = X1 + · · · + XT . Show that S is a Poisson random variable with mean pθ. Exercise 1.6.35. Multiplicative Bernoulli Let X1 , . . . , X2n be independent random variables taking the values 0 or 1, and such that for all i (1 ≤ i ≤ 2n) P (Xi = 1) = p ∈ [0, 1]. Deﬁne Z = ni=1 Xi Xn+i. Compute P (Z = k) (1 ≤ k ≤ n). Exercise 1.6.36. The matchbox A smoker has one matchbox with n matches in each pocket. He reaches at random for one box or the other. What is the probability that, having eventually found an empty matchbox, there will be k matches left in the other box? Exercise 1.6.37. Means and variances via generating functions (a) Compute the mean and variance of the binomial random variable B of size n and parameter p from its generating function. Do the same for the Poisson random variable P of mean θ. (b) What is the generating function gT of the geometric random variable T with parameter p ∈ (0, 1) (recall P (T = n) = (1 − p)n−1 p, n ≥ 1). Compute its ﬁrst two derivatives and deduce from the result the variance of T .
1.6. EXERCISES
49
Exercise 1.6.38. Factorial moment of Poisson What is the nth factorial moment (E [X(X − 1) · · · (X − n + 1)]) of a Poisson random variable X of mean θ > 0? Exercise 1.6.39. From generating function to probability distribution What is the probability distribution of the integervalued random variable X with gen1 erating function g(z) = (2−z) 2 ? Compute its variance. Exercise 1.6.40. Negative binomial formula Prove that for all z ∈ C, z ≤ 1, p p+1 2 p+2 3 z+ z + z +··· . (1 − z)−p = 1 + p−1 p−1 p−1 Exercise 1.6.41. Throw a die You perform three independent tosses of an unbiased die. What is the probability that one of these tosses results in a number that is the sum of the other two numbers? (You are required to ﬁnd a solution using generating functions.) Exercise 1.6.42. Residual time Let X be a random variable with values in N and with ﬁnite mean m. We know (tele1 scope formula; Theorem 1.3.12) that pn = m P (X > n), n ∈ N, deﬁnes a probability distribution on N. Compute its generating function. Exercise 1.6.43. The blue pinko The blue pinko (an extravagant Australian bird) lays T eggs, each egg blue or pink, with probability p that a given egg is blue. The colors of the successive eggs are independent, and independent of the number of eggs laid. Exercise 1.6.34 shows that if the number of eggs is Poisson with mean θ, then the number of blue eggs is Poisson with mean θp and the number of pink eggs is Poisson with mean θq. Show that the number of blue eggs and the number of pink eggs are independent random variables. Exercise 1.6.44. The entomologist Each individual of a speciﬁc breed of insects has, independently of the others, the probability θ of being a male. An entomologist seeks to collect exactly M > 1 males, and therefore stops hunting as soon as she captures M males. She has to capture an insect in order to determine its gender. What is the distribution of X, the number of insects she must catch to collect exactly M males? Exercise 1.6.45. The return of the entomologist The situation is as in Exercise 1.6.44. What is the distribution of X, the smallest number of insects that the entomologist must catch to collect at least M males and N females? Exercise 1.6.46. The entomologist strikes again This continues Exercise 1.6.44. What is the expectation of X? (In Exercise 1.6.44, you computed the distribution of X, from which you can of course compute the mean. However you can give the solution directly, and this is what is required in the present exercise.) Exercise 1.6.47. A recurrence equation Recall the notation a+ = max(a, 0). Consider the recurrence equation
CHAPTER 1. WARMING UP
50 Xn+1 = (Xn − 1)+ + Zn+1
(n ≥ 0),
where X0 is a random variable taking its values in N, and {Zn }n≥1 is a sequence of independent random variables taking their values in N, and independent of X0 . Express the generating function ψn+1 of Xn+1 in terms of the generating function ϕ of Z1 . Exercise 1.6.48. Extinction of a branching process Compute the probability of extinction of a branching process with one ancestor when the probabilities of having 0, 1, or 2 sons are respectively 41 , 14 , and 12 . Exercise 1.6.49. Several ancestors (a) Give the survival probability in the model of Section 1.4.2 with k ancestors, k > 1. (b) Give the mean and variance of Xn in the model of Section 1.4.2 with one ancestor.
Exercise 1.6.50. Size of the branching tree When the probability of extinction is 1 (m < 1), call Y the size of the branching tree (Y = n≥0 Xn ). Prove that gY (z) = z gZ (gY (z)).
Chapter 2 Integration From a technical point of view, a probability is a particular kind of measure and expectation is integration with respect to that measure. However we shall see in the next chapter that this point of view is shortsighted because the probabilistic notions of independence and conditioning, which are absent from the theory of measure and integration, play a fundamental role in probability. Nevertheless, the foundations of probability theory rest on Lebesgue’s integration theory, of which the present chapter gives a detailed outline and the main results. The reader is assumed to have a working knowledge of the Riemann integral. This type of integral is suﬃcient for many purposes, but it has a few weak points when compared to the Lebesgue integral. For instance: (1) The class of Riemannintegrable functions is too narrow. As a matter of fact, there are functions that have a Lebesgue integral and yet are not Riemann integrable (see Example 2.2.14). (2) The stability properties under the limit operation of the functions that admit a Riemann integral are too weak. Indeed, it often happens that such limits do not have a Riemann integral whereas the limit, for instance, of nonnegative functions for which the Lebesgue integral is well deﬁned also admits a welldeﬁned Lebesgue integral. (3) The Riemann integral is deﬁned with respect to the Lebesgue measure (length, area, volume, etc.) whereas Lebesgue’s integral can be deﬁned with respect to a general abstract measure, a probability for instance. This last advantage makes it worthwhile to invest a little time in order to understand the fundamental results of the Lebesgue integration theory, because the return is considerable. In fact, the Lebesgue integral of a function f with respect to an abstract measure μ contains a variety of mathematical objects besides the usual Lebesgue integral on Rd f (x1 , · · · , xd ) dx1 · · · dxd . Rd
In fact, an inﬁnite sum
f (n)
n∈Z
can also be regarded as a Lebesgue integral with respect to the counting measure on Z. The Stieltjes–Lebesgue integral © Springer Nature Switzerland AG 2020 P. Brémaud, Probability Theory and Stochastic Processes, Universitext, https://doi.org/10.1007/9783030401832_2
51
CHAPTER 2. INTEGRATION
52 f (x) dF (x) R
with respect to a rightcontinuous nondecreasing function F is again a special case of the Lebesgue integral. Most importantly, the expectation E[Z] of a random variable Z will be recognized as an abstract integral. It is the latter type of integral that is of interest in this book and the purpose of the present chapter is to deﬁne the Lebesgue integral and to give its properties directly useful to probability theory.
2.1
Measurability and Measure
This section describes the functions that Lebesgue’s integration theory admits for integrands and gives the formal deﬁnition and properties of the measure with respect to which such functions are integrated.
2.1.1
Measurable Functions
Remember the deﬁnition of a σﬁeld: Deﬁnition 2.1.1 Denote by P(X) the collection of all subsets of a given set X. A collection of subsets X ⊆ P(X) is called a σﬁeld on X if: (α) X ∈ X , (β) A ∈ X =⇒ A ∈ X and (γ) An ∈ X for all n ∈ N =⇒ ∪∞ n=0 An ∈ X . One then says that (X, X ) is a measurable space. A set A ∈ X is called a measurable set. Observe that (See Exercise 1.6.5): (γ ) An ∈ X for all n ∈ N =⇒ ∩∞ n=0 An ∈ X . In fact, given the properties (α) and (β), properties (γ) and (γ ) are equivalent. Note also that ∅ ∈ X , being the complement of X. Therefore, a σﬁeld on X is a collection of subsets of X that contains X and ∅, and is closed under countable unions, countable intersections and complementation. The two simplest examples of σﬁelds on X are the gross σﬁeld X = {∅, X} and the trivial σﬁeld X = P(X). Deﬁnition 2.1.2 The σﬁeld generated by a nonempty collection of subsets C ⊆ P(X) is, by deﬁnition, the smallest σﬁeld on X containing all the sets in C (see Exercise 2.4.1). It is denoted by σ(C). Of course, if G is a σﬁeld, σ(G) = G, a fact that will often be used. We now review some basic notions of topology.
2.1. MEASURABILITY AND MEASURE
53
Deﬁnition 2.1.3 A topology on a set X is a collection O of subsets of X satisfying the following properties: (i) X and ∅ belong to O; (ii) the union of an arbitrary collection of sets in O is a set in O; and (iii) the intersection of a ﬁnite number of sets in O is a set in O. The elements O ∈ O are called open sets and the pair (X, O) is called a topological space. (Usually, when the context clearly deﬁnes the open sets, we say: “the topological space X”.) Example 2.1.4: Metric spaces. Let X be a set. A function d : X × X → R+ such that for all x, y, z ∈ X, (i) d(x, y) = 0 ⇒ x = y, (ii) d(x, y) = d(y, x) and (iii) d(x, z) ≤ d(x, y) + d(y, z) is called a metric on X. The pair (X, d) is called a metric space. Any metric induces a topology as follows. A set O ∈ X is called open if for all x ∈ O there is an open ball B(x, a) := {y ∈ X; d(y, x) < a} contained in O. The collection of such sets is indeed a topology as can be straightforwardly checked. Example 2.1.5: The euclidean topology. In Rn , the usual euclidean distance deﬁnes a topology called the euclidean topology. Topology concerns continuity: Deﬁnition 2.1.6 Let (X, O) and (E, V) be two topological spaces. A function f : X → E is said to be continuous (with respect to these topologies) if V ∈ V ⇒ f −1 (V ) ∈ O . In other terms, f −1 (V) ⊆ O. This is a rather abstract deﬁnition (however it will turn out to be quite convenient, as we shall soon see). For metric spaces, continuity is more explicit, and is usually deﬁned in terms of metrics: Theorem 2.1.7 Let (X, d) and (E, ρ) be two metric spaces. A necessary and suﬃcient condition for a mapping f : X → E to be continuous according to the above abstract deﬁnition (and with respect to the topologies induced by the corresponding metrics) is that for any ε > 0 and any x ∈ X, there exists a δ > 0 such that d(y, x) ≤ δ implies ρ(f (y), f (x)) ≤ ε. The proof is given in Section B.7 of the appendix. We shall now be more “concrete” about σﬁelds.
54
CHAPTER 2. INTEGRATION
Deﬁnition 2.1.8 Let (X, O) be a topological space. The σﬁeld σ(O) will be denoted by B(X) and called the Borel σﬁeld on X (associated with topology O). A set B ∈ B(X) is called a Borel set of X (with respect to the topology O). When X = Rn is endowed with the euclidean topology, the Borel σﬁeld is denoted by B(Rn ). The next result gives a sometimes more convenient characterization of B(Rn ). 2.1.9 B(Rn ) is generated by the collection C of all rectangles of the type Theorem n i=1 (−∞, ai ], where ai ∈ Q for all i ∈ {1, . . . , n}. Proof. It suﬃces to show that B(Rn ) is generated by the collection C of all rectangles n i=1 (ai , bi ) with rational endpoints (that is, such that ai , bi ∈ Q for all i ∈ {1, . . . , n}). Note that C is a countable collection and that all its elements are open sets for the euclidean topology (the latter we denote by O). It follows that C ⊆ O and therefore σ(C ) ⊆ σ(O) = B(Rn ). It remains to show that O ⊆ σ(C ), since this implies that σ(O) ⊆ σ(C ). For this it suﬃces to show that any set O ∈ O is a countable union of elements in C . Take x ∈ O. By deﬁnition of the euclidean topology, there exists a nonempty open ball B(x, r) centered at x and contained in O. Now we can always choose a rational rectangle Rx ∈ C that contains x and that is contained in B(x, r). Clearly ∪x∈O Rx = O. Since the Rx are chosen in a countable family of sets, the union ∪x∈O Rx is in fact countable. As a countable union of sets in C it is in σ(C ). Therefore O ∈ σ(C ). Deﬁnition 2.1.10 B(R) is the σﬁeld on R generated by the intervals of type [− ∞, a], a ∈ R. If I = nj=1 Ij , where Ij is an interval of R, the Borel σﬁeld B(I) on I is the collection of all the Borel sets contained in I. Remark 2.1.11 A question naturally arises at this point. How can we be sure that the Borel σﬁeld (of R for instance) is not just the trivial σﬁeld? In other terms, does there exist at least one subset of R that is not a Borel set? The answer is yes: there exist such “pathological” subsets, but we shall not prove this here. The central concept of Lebesgue’s integration theory is that of a measurable function. Deﬁnition 2.1.12 Let (X, X ) and (E, E) be two measurable spaces. A function f : X → E is said to be a measurable function with respect to X and E if f −1 (C) ∈ X for all C ∈ E. In other terms, f −1 (E) ⊆ X . This will be denoted by: f : (X, X ) → (E, E) or f ∈ E/X . Let (X, X ) be a measurable space. A function f : (X, X ) → (Rk , B(Rk )) is called a Borel function from X to Rk . A function f : (X, X ) → (R, B(R)) is called an extended Borel function, or simply a Borel function. As for functions f : (X, X ) → (R, B(R)), they
2.1. MEASURABILITY AND MEASURE
55
are called real Borel functions. In general, in a sentence such as “f is a Borel function deﬁned on X”, the σﬁeld X is assumed to be the obvious one in the given context. The key to the deﬁnition of the Lebesgue integral is the theorem of approximation of nonnegative measurable functions by simple Borel functions. Deﬁnition 2.1.13 A function f : X → R of the form f (x) =
k
ai 1Ai (x),
i=1
where k ∈ N+ , a1 , . . . , ak ∈ R and A1 , . . . , Ak are sets in X , is called a simple Borel function (deﬁned on X). Theorem 2.1.14 Let f : (X, X ) → (R, B(R)) be a nonnegative Borel function. There exists a nondecreasing sequence {fn }n≥1 of nonnegative simple Borel functions converging pointwise to the function f . Proof. Let fn (x) :=
−n −1 n2
k2−n 1Ak,n (x) + n1An (x),
k=0
where Ak,n := {x ∈ X : k2−n < f (x) ≤ (k + 1)2−n }, An = {x ∈ X : f (x) > n} . This sequence of functions has the announced properties. In fact, for any x ∈ X such that f (x) < ∞ and for n large enough, f (x) − fn (x) ≤ 2−n , and for any x ∈ X such that f (x) = ∞, fn (x) = n indeed converges to f (x) = +∞. It seems diﬃcult to prove measurability since σﬁelds are often not deﬁned explicitly (see the deﬁnition of B(Rn ), for instance). However, the following result renders the task feasible. The corollaries will provide examples of application. Theorem 2.1.15 Let (X, X ) and (E, E) be two measurable spaces, where E = σ(C) for some collection C of subsets of E. Let f : X → E be some function. Then f : (X, X ) → (E, E) if and only if f −1 (C) ∈ X for all C ∈ C. Proof. Only suﬃciency requires a proof. We start with the following trivial observations. Let X and E be sets, let G be a σﬁeld on E, and let C1 , C2 be nonempty collections of subsets of E. Then (i) σ(G) = G, and (ii) C1 ⊆ C2 ⇒ σ(C1 ) ⊆ σ(C2 ). Let now f : X → E be a function from X to E, and deﬁne G := {C ⊆ E; f −1 (C) ∈ X }. One checks that G is a σﬁeld. But by hypothesis, C ⊆ G. Therefore, by (ii) and (i), X = σ(C) ⊆ σ(G) = G.
CHAPTER 2. INTEGRATION
56
Stability Properties of Measurable Functions Measurability is a stable property, in the sense that all the usual operations on measurable functions preserve measurability. (This is not the case for continuity, which is in general not stable with respect to limits.) Also, the class of measurable functions is a rich one. In particular, “continuous functions are measurable”. More precisely: Corollary 2.1.16 Let X and E be two topological spaces with respective Borel σﬁelds B(X) and B(E). Any continuous function f : X → E is measurable with respect to B(X) and B(E). Proof. By deﬁnition of continuity, the inverse image of an open set of E is an open set of X and is therefore in B(X). By Theorem 2.1.15, since the open sets of E generate B(E), the function f is measurable with respect to B(X) and B(E). Corollary 2.1.17 Let (X, X ) be a measurable space and let n ≥ 1 be an integer. Then f = (f1 , . . . , fn ) : (X, X ) → (Rn , B(Rn )) if and only if for all 1 ≤ i ≤ n, {fi ≤ ai } ∈ X for all ai ∈ Q (the rational numbers). Proof. Since by Theorem 2.1.9, B(Rn ) is generated by the sets ni=1 (−∞, ai ], where ai ∈ Q for all i ∈ {1, . . . , n}, it suﬃces by Theorem 2.1.15 to show that for all a ∈ Qn , {f ≤ a} ∈ X . This is indeed the case since n {f ≤ a} = ∩i=1 {fi ≤ ai } ,
and therefore {f ≤ a} ∈ X , being the intersection of a countable (actually: ﬁnite) number of sets in X . Measurability is closed under composition: Theorem 2.1.18 Let (X, X ), (Y, Y) and (E, E) be measurable spaces and let ϕ : (X, X ) → (Y, Y), g : (Y, Y) → (E, E). Then g ◦ ϕ : (X, X ) → (E, E). Proof. Let f := g ◦ ϕ (meaning: f (x) = g(ϕ(x)) for all x ∈ X). For all C ∈ E, f −1 (C) = ϕ−1 (g −1 (C)) = ϕ−1 (D) ∈ X , because D = g −1 (C) is a set in Y since g ∈ E/Y, and therefore ϕ−1 (D) ∈ X since ϕ ∈ Y/X . Corollary 2.1.19 Let ϕ = (ϕ1 , . . . , ϕn ) be a measurable function from (X, X ) to (Rn , B(Rn )) and let g : Rn → R be a continuous function. Then g ◦ ϕ : (X, X ) → (R, B(R)). Proof. Follows directly from Theorem 2.1.18 and Corollary 2.1.16.
This corollary in turn allows us to show that addition, multiplication and quotients preserve measurability.
2.1. MEASURABILITY AND MEASURE
57
Corollary 2.1.20 Let ϕ1 , ϕ2 : (X, X ) → (R, B(R)) and let λ ∈ R. Then ϕ1 × ϕ2 , ϕ1 + ϕ2 , λϕ1 , (ϕ1 /ϕ2 )1{ϕ2 =0} are real Borel functions. Moreover, the set {ϕ1 = ϕ2 } is a measurable set.
Proof. For the ﬁrst three functions, take in Corollary 2.1.19 g(x1 , x2 ) = x1 × x2 , = 1 x1 + x2 , = λx1 successively. For (ϕ1 /ϕ2 )1{ϕ2 =0} , let ψ2 := {ϕϕ22=0} , check that the latter function is measurable, and use the just proved fact that the product ϕ1 ψ2 is then measurable. Finally, {ϕ1 = ϕ2 } = {ϕ1 − ϕ2 = 0} = (ϕ1 − ϕ2 )−1 ({0}) is a measurable set since ϕ1 − ϕ2 is a measurable function and {0} is a measurable set (any singleton is in B(R); exercise). Finally, taking the limit preserves measurability, as will now be proved. Without otherwise explicitly mentioned, the limits of functions must be understood as pointwise limits. Theorem 2.1.21 Let fn : (X, X ) → (R, B(R)) (n ∈ N). Then lim inf n↑∞ fn and lim supn↑∞ fn are Borel functions, and the set {lim sup fn = lim inf fn } = {∃ lim fn } n↑∞
n↑∞
n↑∞
belongs to X . In particular, if {∃ limn↑∞ fn } = X, the function limn↑∞ fn is a Borel function.
Proof. We ﬁrst prove the result in the particular case when the sequence of functions is nondecreasing. Denote by f the limit of this sequence. By Theorem 2.1.15 it suﬃces to show that for all a ∈ R, {f ≤ a} ∈ X . But since the sequence {fn }n≥1 is nondecreasing, we have that {f ≤ a} = ∩∞ n=1 {fn ≤ a}, which is indeed in X , as a countable intersection of sets in X . Now recall that by deﬁnition, lim inf fn := lim gn , n↑∞
n↑∞
where gn := inf fk . k≥n
The function gn is measurable since for all a ∈ R, {inf k≥n fk ≤ a} is a measurable set, being the complement of {inf k≥n fk > a} = ∩k≥n {fk > a}, a measurable set (as the countable intersection of measurable sets). Since the sequence {gn }n≥1 is nondecreasing, the measurability of lim inf n↑∞ fn follows from the particular case of nondecreasing functions. Similarly, lim supn↑∞ fn = − lim inf n↑∞ (−fn ) is measurable. The set {lim supn↑∞ fn = lim inf n↑∞ fn } is the set on which two measurable functions are equal, and therefore, by the last assertion of Corollary 2.1.20, it is a measurable set. Finally, if limn↑∞ fn exists, it is equal to lim supn↑∞ fn , which is, as we just proved, a measurable function.
CHAPTER 2. INTEGRATION
58 Dynkin’s Systems
Proving that a given property is common to all measurable functions, or to all measurable sets, may at times appear diﬃcult because there is usually no constructive deﬁnition of the σﬁelds involved (these are often deﬁned as “the smallest σﬁeld containing a certain class of subsets”). There is however a technical tool that allows us to do this, the Dynkin theorem(s). The central notions in this respect are those of a πsystem and of a dsystem. Deﬁnition 2.1.22 Let X be a set. The collection S ⊆ P(X) is called a πsystem of X if it is closed under ﬁnite intersections.
For instance, the collection of ﬁnite unions of intervals of the type (a, b] is a πsystem of R. Deﬁnition 2.1.23 Let X be a set. A nonempty collection of sets S ∈ P(X) is called a dsystem, or Dynkin system, of sets if (a) X, ∅ ∈ S. (b) S is closed under strict diﬀerence (that is, if A, B ∈ S and A ⊆ B, then B−A ∈ S). (c) S is closed under sequential nondecreasing limits (that is, the limit of a nondecreasing sequence of sets in S is in S). Theorem 2.1.24 Let X be a set. If the collection S ∈ P(X) is a πsystem and a dsystem, it is a σﬁeld.
Proof. (i) S contains X and ∅, by deﬁnition of a dsystem. (ii) S is closed under complementation. (Apply (b) in Deﬁnition 2.1.23 with B = X and A ∈ S, to obtain that A ∈ S.) (iii) S is closed under countable unions. To prove this, we ﬁrst show that it is closed under ﬁnite unions. Indeed, if A, B ∈ S, then by (ii), A, B ∈ S, and therefore, since S is a πsystem, A ∩ B ∈ S. Taking the complement we obtain by (ii) that A ∪ B ∈ S. Now for countable unions, consider a sequence {An }n≥1 of elements of S. The union ∪n≥1 An can be written as ∪n≥1 (∪nk=1 Ak ), which is a countable union of nondecreasing sets in S, and therefore it is in S, by (c) of Deﬁnition 2.1.23. The smallest dsystem containing a nonempty collection of sets C ⊆ P(X) is denoted by d(C). Observe that since a σﬁeld G is already a dsystem, d(G) = G. In particular, for any collection of sets C ⊆ P(X), d(C) ⊆ σ(C). Also note that if C1 ⊆ C2 , then d(C1 ) ⊆ d(C2 ). We now state Dynkin’s theorem (sometimes called the monotone class theorem).
2.1. MEASURABILITY AND MEASURE
59
Theorem 2.1.25 Let S be a πsystem deﬁned on X. Then d(S) = σ(S). Proof. Any σﬁeld containing S will contain d(S). Therefore σ(d(S)) = σ(S). If we can show that d(S) is a σﬁeld, then σ(d(S)) = d(S), so that d(S) = σ(S). To prove that d(S) is a σﬁeld, it suﬃces by Theorem 2.1.24 to show that it is a πsystem. For this, deﬁne D1 := {A ∈ d(S); A ∩ C ∈ d(S) for all C ∈ S} . One checks that D1 is a dsystem and that S ⊆ D1 since S is a πsystem by assumption. Therefore d(S) ⊆ D1 . By deﬁnition of D1 , D1 ⊆ d(S). Therefore D1 = d(S). Deﬁne now D2 := {A ∈ d(S); A ∩ C ∈ d(S) for all C ∈ d(S)} . This is a dsystem. Also, if C ∈ S, then A ∩ C ∈ d(S) for all A ∈ D1 = d(S), and therefore S ⊆ D2 . Therefore d(S) ⊆ d(D2 ) = D2 . Also by deﬁnition of D2 , D2 ⊆ d(S), so that ﬁnally D2 = d(S). In particular, d(S) is a πsystem. There is a functional form of Dynkin’s theorem. We ﬁrst deﬁne a dsystem of functions. Deﬁnition 2.1.26 Let H be a collection of nonnegative functions f : X → R+ such that (α) 1 ∈ H, (β) H is closed under monotone nondecreasing sequential limits (that is, if {fn }n≥1 is a nondecreasing sequence of functions in H, then limn↑∞ fn ∈ H), and (γ) if f1 , f2 ∈ H, then λ1 f1 + λ2 f2 ∈ H for all λ1 , λ2 ∈ R such that λ1 f1 + λ2 f2 is a nonnegative function. Then H is called a dsystem, or Dynkin system, of functions. Theorem 2.1.27 Let H be a family of nonnegative functions f : X → R+ that form a dsystem. Let S be πsystem on X such that 1C ∈ H for all C ∈ S. Then H contains all nonnegative functions f : (X, σ(S)) → (R, B(R)).
Proof. The collection D := {A ⊆ X; 1A ∈ H} is a dsystem containing S by hypothesis. Therefore d(S) ⊆ d(D) = D. Since d(S) = σ(S) by Dynkin’s theorem, σ(S) ⊆ D. This means that H contains the indicator functions of all the sets in σ(S). Being a dsystem of functions, it contains all the nonnegative simple σ(S)measurable functions. The rest of the proof follows from the theorem of approximation of nonnegative measurable functions by nonnegative simple functions (Theorem 2.1.14) and property (γ) of Deﬁnition 2.1.26.
CHAPTER 2. INTEGRATION
60
2.1.2
Measure
The next most important notion of integration theory after that of measurable sets and measurable functions is that of measure. Deﬁnition 2.1.28 Let (X, X ) be a measurable space. A set function μ : X → [0, ∞] is called a measure on (X, X ) if μ(∅) = 0 and if for any countable sequence {An }n≥0 of pairwise disjoint sets in X , the following property (σadditivity) is satisﬁed μ
∞
An
=
∞
μ(An ) .
n=0
n=0
The triple (X, X , μ) is then called a measure space.
The next two properties have been proved in the ﬁrst chapter. We repeat the proofs for selfcontainedness. First, the monotonicity property: A ⊆ B and A, B ∈ X =⇒ μ(A) ≤ μ(B) . Indeed, B = A + (B − A) and therefore, μ(B) = μ(A) + μ(B − A) ≥ μ(A). The subσadditivity property:
∞ &
An ∈ X for all n ∈ N =⇒ μ
≤
An
μ
∞ & n=0
An
=μ
∞
μ(An ) ,
n=0
n=0
is obtained by writing
∞
An
,
n=0
where A0 = A0 and for n ≥ 1, n−1 An = An ∩ ∪j=1 Aj ⊆ An , so that μ(An ) ≤ μ(An ) by the monotonicity property. Example 2.1.29: The Dirac measure. Let a ∈ X. The measure a deﬁned by a (C) = 1C (a) is called the Dirac measure at a ∈ X. The set function μ : X → [0, ∞] deﬁned by μ(C) = ∞ i=0 αi 1ai (C), where ai ∈ X, αi ∈ R+ for all i ∈ N, is a measure denoted μ = ∞ i=0 αi ai . Example 2.1.30: Weighted counting measure. Let {α n }n∈Z be a sequence of R+ . The set function μ : P(Z) → [0, ∞] deﬁned by μ(C) = n∈C αn is a measure on (Z, P(Z)). In the case αn ≡ 1, it is called the counting measure on Z (then μ(C) = card (C)).
2.1. MEASURABILITY AND MEASURE
61
Example 2.1.31: The Lebesgue measure. There exists one and only one measure on (R, B(R)) such that ((a, b]) = b − a. This measure is called the Lebesgue measure on R. (Beware. The statement of this example is in fact a theorem, which is part of a more general result, Theorem 2.1.53 below.) More generally, the Lebesgue measure on Rn , denoted by n , is the unique measure on (Rn , B(Rn )) such that n n $ $ n (ai , bi ] = (bi − ai ) . i=1
i=1
The proof of existence of n for n ≥ 2, assuming the existence of , is a consequence of the forthcoming Theorem 2.3.7. Deﬁnition 2.1.32 Let μ be a measure on (X, X ). If μ(X) < ∞ the measure μ is called a ﬁnite measure. If there exists a sequence {Kn }n≥1 of X such that μ(Kn ) < ∞ for all n ≥ 1 and ∪∞ n=1 Kn = X, the measure μ is called a σﬁnite measure. A measure μ on (Rm , B(Rm )) such that μ(C) < ∞ for all bounded Borel sets C is called a locally ﬁnite measure. It is called nonatomic or diﬀuse if μ({a}) = 0 for all a ∈ Rm . Remark 2.1.33 Nonatomicity does not imply that μ is the null measure. Indeed, the “proof” that it is null, namely μ({a}) = 0 , μ(C) = a∈C
is not valid because C is not countable. We shall single out the case of a probability measure: Deﬁnition 2.1.34 A measure P on a measurable space (Ω, F) such that P (Ω) = 1 is called a probability measure (for short: a probability). Example 2.1.35: A few examples. The Dirac measure a is a probability measure. The counting measure ν on Z is a σﬁnite measure. Any locally ﬁnite measure on (Rn , B(Rn )) is σﬁnite. Lebesgue measure is a locally ﬁnite measure. Theorem 2.1.36 Let (X, X , μ) be a measure space. Let {An }n≥1 be a sequence of X , nondecreasing (that is, An ⊆ An+1 for all n ≥ 1). Then ( (2.1) μ( ∞ n=1 An ) = limn↑∞ μ(An ) . Proof. Since An = A0 + ∪nk=1 (Ak − Ak−1 ) and ∪n≥0 An = A0 + ∪k≥1 (Ak − Ak−1 ), by σ additivity n μ(An ) = μ(A0 ) + μ(Ak − Ak−1 ) k=1
and μ(∪n≥0 An ) = μ(A0 ) +
μ(Ak − Ak−1 ) ,
k≥1
from which the result follows. (A word of caution: see Exercise 2.4.8.)
CHAPTER 2. INTEGRATION
62
Deﬁnition 2.1.37 Let μ be a measure on the measurable topological space (X, B(X)). Its support supp(μ) is, by deﬁnition, the closure of the subset of X consisting of all the points x such that for all open neighborhoods Nx of x, μ(Nx ) > 0.
Negligible Sets Deﬁnition 2.1.38 Let (X, X , μ) be a measure space. A μnegligible set is a set contained in a measurable set N ∈ X such that μ(N ) = 0. One says that some property P relative to the elements x ∈ X holds μalmost everywhere (μa.e.) if the set {x ∈ X : x does not satisfy P} is a μnegligible set.
For instance, if f and g are two Borel functions deﬁned on X, the expression f ≤ g μa.e. means that μ({x : f (x) > g(x)}) = 0. Theorem 2.1.39 A countable union of μnegligible sets is a μnegligible set.
Proof. Let An , n ≥ 1, be a sequence of μnegligible sets, and let Nn , n ≥ 1, be a sequence of measurable sets such that μ(Nn ) = 0 and An ⊆ Nn . Then N = ∪n≥1 Nn is a measurable set containing ∪n≥1 An , and N is of μmeasure 0, by the subσadditivity property. Example 2.1.40: The rationals are Lebesguenegligible. Any singleton {a}, a ∈ R, is a Borel set of Lebesgue measure 0. The set of rationals Q is a Borel set of Lebesgue measure 0. Proof. The Borel σﬁeld B(R) is generated by the intervals Ia = (−∞, a], a ∈ R (Theorem 2.1.9), and therefore {a} = ∩n≥1 (Ia −Ia−1/n ) is also in B. Denoting by the Lebesgue measure, (Ia − Ia−1/n ) = 1/n, and therefore ({a}) = limn≥1 (Ia − Ia−1/n ) = 0. Q is a countable union of sets in B (singletons) and is therefore in B. It has Lebesgue measure 0 as a countable union of sets of Lebesgue measure 0.
Deﬁnition 2.1.41 The measure space (X, X , μ) is called complete if X contains all the μnegligible subsets of X.
Given a measure space (X, X , μ) that is not necessarily complete, denote by N the collection of μnegligible subsets of X (note that it contains the empty set). Let X be formed by the sets A ∪ N (A ∈ X , N ∈ N ). This is obviously a σﬁeld. Let μ be the function from X to R+ deﬁned by μ(A ∪ N ) = μ(A)
(A ∈ X , N ∈ N ) .
Then (X, X , μ) is a complete measure space, called the completion of (X, X , μ). It is an extension of μ in the sense that μ(A) = μ(A) for all A ∈ F.
2.1. MEASURABILITY AND MEASURE
63
Equality of Measures The forthcoming results, which are consequences of Dynkin’s theorem, help to prove that two measures are identical. Let P be a collection of subsets of X, and let μ : P → [0, ∞] be σadditive, that is, for any countable family{An }n≥1 of mutually disjoint sets in P such that ∪n≥1 An ∈ P, we have μ(∪n≥1 An ) = n≥1 μ(An ). Then μ is called a measure on P. Let C ⊆ P be a collection of subsets of X. A mapping μ : P → [0, ∞] is called σﬁnite on C if there exists a countable family {Cn }n≥1 of sets in C such that ∪n≥1 Cn = X and μ(Cn ) < ∞ for all n ≥ 1. Theorem 2.1.42 Let μ1 and μ2 be two measures on (X, X ) and let S be a πsystem of measurable sets generating X . Suppose that μ1 and μ2 are σﬁnite on S. If μ1 and μ2 agree on S (that is, μ1 (C) = μ2 (C) for all C ∈ S), then they are identical. Proof. Let C ∈ S be such that μ1 (C)(= μ2 (C)) < ∞. Consider the collection DC = {A ∈ X ; μ1 (C ∩ A) = μ2 (C ∩ A)}. One veriﬁes that DC is a dsystem (the ﬁniteness of μ1 (C) and μ2 (C) is needed here). Moreover, by hypothesis, S ⊆ DC , and therefore d(S) ⊆ d(DC ). But d(DC ) = DC and by Dynkin’s Theorem 2.1.25, d(S) = σ(S) = X . Therefore for all C ∈ S of ﬁnite measure, all A ∈ X , we have μ1 (C ∩ A) = μ2 (C ∩ A). By the assumed σﬁniteness of μ1 and μ2 on S there exists a countable family {Cn }n≥1 of sets in S such that ∪n≥1 Cn = X and μ1 (Cn ) = μ2 (Cn ) < ∞ for all n ≥ 1. For α = 1, 2 and all n ≥ 1 we have that n μα (∪i=1 (Ci ∩ B)) = μα (Ci ∩ B) − μα (Ci ∩ Cj ∩ B) + · · · 1≤i≤n
1≤i 0}, and therefore by sequential continuity, μ({f > 0}) = 0, that is, f ≤ 0, μa.e. On the other hand, by hypothesis, f ≥ 0, μa.e. Therefore f = 0, μa.e. (f) With A = {f > 0}, 1A f is a nonnegative measurable function. By (e), 1A f = 0, μa.e. This implies that 1A = 0, μa.e., that is to say f ≤ 0, μa.e. Similarly, f ≥ 0, μa.e. Therefore, f = 0, μa.e. (g) It is enough to consider the case f ≥ 0. Since f ≥ n1{f =∞} for all n ≥ 1, we have ∞ > μ(f ) ≥ nμ({f = ∞}), and therefore nμ({f = ∞}) < ∞. This cannot be true for all n ≥ 1 unless μ({f = ∞}) = 0. The extension to complex Borel functions of the properties (a), (b), (d) and (f) is immediate. Example 2.2.14: Lebesgueintegrable but not Riemannintegrable. The function f deﬁned by f := 1Q (Q is the set of rational numbers) is a Borel function and it is Lebesgue integrable with its integral equal to zero because {f = 0} is the set of rational numbers, which has null Lebesgue measure. However, f is not Riemann integrable.
2.2.3
Beppo Levi, Fatou and Lebesgue
The following versions of the theorems of Beppo Levi, Fatou and Lebesgue diﬀer from the previous ones by the introduction of “μalmost everywhere” in the statements of the conditions. No other proofs are needed since integrals of almost everywhere equal functions are equal and countable unions of negligible sets are negligible. Only a convention must be stated: if the limit of a sequence of real measurable functions exists μalmost everywhere, that is, outside a μnegligible set, then the limit is typically assigned some arbitrary value on this μnegligible set; for example, many people set the limit to be 0. Remember that we are looking for conditions guaranteeing that 
lim fn dμ = lim
X n↑∞
n↑∞ X
fn dμ .
We start by restating the Beppo Levi or monotone convergence theorem.
(2.6)
CHAPTER 2. INTEGRATION
74
Theorem 2.2.15 Let fn : (X, X ) → (R, B(R)) (n ≥ 1) be such that (i) fn ≥ 0 μa.e., and (ii) fn+1 ≥ fn μa.e. Then, there exists a nonnegative function f : (X, X ) → (R, B(R)) such that lim fn = f
n↑∞
μa.e. ,
and (2.6) holds true. Next, we restate Fatou’s lemma. Theorem 2.2.16 Let fn : (X, X ) → (R, B(R)) (n ≥ 1) be such that fn ≥ 0 μa.e. (n ≥ 1). Then (lim inf fn ) dμ ≤ lim inf fn dμ . (2.7) X
n↑∞
n↑∞
X
Finally, we restate the Lebesgue or dominated convergence theorem. Theorem 2.2.17 Let fn : (X, X ) → (R, B(R)) (n ≥ 1) be such that, for some function f : (X, X ) → (R, B(R)) and some μintegrable function g : (X, X ) → (R, B(R)): (i) lim fn = f , μa.e., and n↑∞
(ii) fn  ≤ g μa.e. for all n ≥ 1. Then, (2.6) holds true.
Diﬀerentiation under the integral sign Let (X, X , μ) be a measure space and let (a, b) ⊆ R. Let f : (a, b) × X → R and for all t ∈ (a, b), deﬁne ft : X → R by ft (x) := f (t, x). Suppose that for all t ∈ (a, b), ft is measurable with respect to X , and deﬁne, when possible, the function I : (a, b) → R by the formula 0 I(t) = X f (t, x) μ(dx) . (2.8) Assume that for μalmost all x the function t → f (t, x) is continuous at t0 ∈ (a, b) and that there exists a μintegrable function g : (X, X ) → (R, B(R)) such that f (t, x) ≤ g(x) μa.e. for all t in a neighborhood V of t0 . Then I is well deﬁned and is continuous at t0 . Proof. Let {tn }n≥1 be a sequence in V \ {t0 } such that limn↑∞ tn = t0 , and deﬁne fn (x) = f (tn , x), f (x) = f (t0 , x). By dominated convergence, lim I(tn ) = lim μ(fn ) = μ(f ) = I(t0 ).
n↑∞
n↑∞
2.3. THE OTHER BIG THEOREMS
75
If we furthermore assume that (α) t → f (t, x) is continuously diﬀerentiable on V for μalmost all x, and (β) for some μintegrable function h : (X, X ) → (R, B(R)) and all t ∈ V , (df /dt) (t, x) ≤ h(x)
μa.e. ,
then I is diﬀerentiable at t0 and I (t0 ) =
0
X (df /dt) (t0 , x) μ(dx) .
(2.9)
Proof. Let {tn }n≥1 be a sequence in V \ {t0 } such that limn↑∞ tn = t0 , and deﬁne fn (x) = f (tn , x), f (x) = f (t0 , x). By dominated convergence, lim I(tn ) = lim μ(fn ) = μ(f ) = I(t0 ).
n↑∞
Also
n↑∞
I(tn ) − I(t0 ) = tn − t0
X
f (tn , x) − f (t0 , x) μ(dx), tn − t0
and for some θ ∈ (0, 1), possibly depending upon n, ' ' ' f (tn , x) − f (t0 , x) ' ' ' ≤ (df /dt) (t0 + θ(tn − t0 ), x) . ' ' tn − t0 The latter quantity is bounded by h(x). Therefore, by dominated convergence,  I(tn ) − I(t0 ) f (tn , x) − f (t0 ) lim μ(dx) = lim n↑∞ n↑∞ tn − t0 tn − t0 X = (df /dt) (t0 , x) μ(dx). X
2.3
The Other Big Theorems
Besides Beppo Levi, Fatou and Lebesgue’s theorems, the four main results of integration theory are (i) the image measure theorem, (ii) the Fubini–Tonelli theorem relative to the product measures (to be deﬁned in a few lines), which gives conditions allowing one to “choose the order of integration in multiple integrals”, (iii) the Riesz–Fischer theorem relative to the Hilbert space structure of the space of squareintegrable functions, and (iv) the Radon–Nikod´ ym theorem relative to the product of a measure by a function, more precisely a converse of Theorem 2.3.28.
CHAPTER 2. INTEGRATION
76
2.3.1
The Image Measure Theorem
Deﬁnition 2.3.1 Let (X, X ) and (E, E) be two measurable spaces, let h : (X, X ) → (E, E) be a measurable function and let μ be a measure on (X, X ). Deﬁne the set function μ ◦ h−1 : E → [0, ∞] by (μ ◦ h−1 )(C) := μ(h−1 (C))
(C ∈ E) .
(2.10)
Then, as one easily checks, μ ◦ h−1 is a measure on (E, E) called the image of μ by h. Integrals can be computed in the original domain or in the image domain. More precisely: Theorem 2.3.2 For a nonnegative f : (X, X ) → (R, B(R)) (f ◦ h)(x) μ(dx) = f (y)(μ ◦ h−1 ) (dy) . X
(2.11)
E
For functions f : (X, X ) → (R, B(R)) of arbitrary sign either one of the conditions (a) f ◦ h is μintegrable, (b) f is μ ◦ h−1 integrable, implies the other, and equality (2.11) then holds. Proof. The equality (2.11) is readily veriﬁed when f is a nonnegative simple Borel function. In the general case one approximates f by a nondecreasing sequence of nonnegative simple Borel functions {fn }n≥1 and (2.11) then follows from the same equality written with f = fn , by letting n ↑ ∞ and using the monotone convergence theorem. For the case of functions of arbitrary sign, apply (2.11) with f + and f − .
2.3.2
The Fubini–Tonelli Theorem
The ﬁrst task is to deﬁne products of measurable spaces and of measures. Deﬁnition 2.3.3 Let (Xi , Xi ) (1 ≤ i ≤ n) be measurable spaces and let X := ni=1 Xi . Deﬁne the product σﬁeld X := ⊗ni=1 Xi to be the smallest σﬁeld on X containing all the socalled generalized measurable rectangles ni=1 Ai , where Ai ∈ Xi (1 ≤ i ≤ n). If (Xi , Xi ) = (Y, Y) (1 ≤ i ≤ n), ni=1 Xi is denoted by Y n and ⊗ni=1 Xi by Y ⊗n . For the product of σﬁelds, we have the rule of associativity. For instance (exercise): X1 ⊗ X2 ⊗ X3 = (X1 ⊗ X2 ) ⊗ X3 = X1 ⊗ (X2 ⊗ X3 ). The Borel σﬁeld B(R)⊗n (denoted by B(R)n for short) is the σﬁeld on Rn generated is: Is this σﬁeld identical by the generalized measurable rectangles of Rn . The question to B(Rn ), the σﬁeld generated by the rectangles of the type ni=1 (ai , bi ], where −∞ < ai ≤ bi < +∞ (1 ≤ i ≤ n)? The answer is positive and a consequence of the rule of associativity of the product of σﬁelds and of the result below.
2.3. THE OTHER BIG THEOREMS
77
Theorem 2.3.4 Let E and F be two separable metric spaces with respective Borel σﬁelds B(E) and B(F ). Then B(E × F ) = B(E) ⊗ B(F ) . Proof. (i) For the proof that B(E × F ) ⊇ B(E) ⊗ B(F ), separability is not needed. We just have to observe that the projections π1 and π2 from E × F to E and F , respectively, are continuous, and therefore measurable functions from (E × F, B(E × F )) to (E, B(E)) and (F, B(F )), respectively. In particular, if C ∈ B(E), then C ×F = π1−1 (C) ∈ B(E ×F ) and similarly, if D ∈ B(F ), then E×D ∈ B(E×F ). Therefore C×D = (C×F )∩(E×D) ∈ B(E × F ). (ii) We now prove that B(E × F ) ⊆ B(E) ⊗ B(F ). By the separability assumption, there exists a dense countable subset of E. Consider the collection U consisting of all open balls with rational radius centered at some point of this dense set. It forms a base for the topology in the sense that any open set of E is the union of sets in U. Let V be a similar base for F . To any (x, y) ∈ O, an open set of E × F , one can associate an open set U × V , where U ∈ U and V ∈ V, that contains (x, y) and is contained in O. Therefore O is the union (at most countable) of sets of the form U × V , where U ∈ U and V ∈ V. In particular, every open set of E × F is measurable with respect to B(E) ⊗ B(F ). Lemma 2.3.5 Let X = X1 × X2 and X = X1 ⊗ X2 . Let f : (X, X ) → (R, B(R)). For ﬁxed x1 ∈ X1 , let the function fx1 : X → R be deﬁned by fx1 (x2 ) := f (x1 , x2 ). Then fx1 is a measurable function from (X2 , X2 ) to (R, B(R)). Proof. STEP 1. We ﬁrst prove the result for the special case f = 1F , where F ∈ X . Let for ﬁxed x1 ∈ X1 Fx1 := {x2 ∈ X2 ; (x1 , x2 ) ∈ F }. (Fx1 is called the section of F at x1 .) We have fx1 = 1Fx1 . Therefore we have to prove that Fx1 ∈ X2 . For this deﬁne Cx1 = {F ⊆ X; Fx1 ∈ X2 }. We want to show that Cx1 ⊇ X . For this, we ﬁrst observe that Cx1 is a σﬁeld since it contains Ω and ∅, and (i) if F ∈ Cx1 , then F ∈ Cx1 . Indeed, since Fx1 ∈ X2 , we have that (Fx1 ) ∈ X2 . But (Fx1 ) = (F )x1 . Therefore (F )x1 ∈ X2 . (ii) if Fn ∈ Cx1 (n ≥ 1), then ∪n≥1 Fn ∈ Cx1 . Indeed, (Fn )x1 ∈ X2 and therefore ∪n≥1 (Fn )x1 ∈ X2 . But ∪n≥1 (Fn )x1 = (∪n≥1 Fn )x1 . Therefore (∪n≥1 Fn )x1 ∈ X2 . The σﬁeld Cx1 contains all the rectangles A × B, where A ∈ X1 , B ∈ X2 . Indeed, (A × B)x1 = B if x1 ∈ A, = ∅ if x1 ∈ / A. We may now conclude that, since Cx1 is a σﬁeld containing the generators of X , it contains X . STEP 2. Let now f : (X, X ) → (R+ , B(R+ )). It is the limit of some nondecreasing sequence {fn }n≥1 of nonnegative simple functions. In particular, fx1 = lim (fn )x1 . n↑∞
CHAPTER 2. INTEGRATION
78
It therefore suﬃces to prove that for any nonnegative simple function g = the function gx1 is X2 measurable. This is true since g x1 =
k
k
i=1 ai 1Ai ,
ai 1(Ai )x1
i=1
and since the (Ai )x1 ∈ X2 by the result in Step 1. STEP 3. We now consider a general f : (X, X ) → (R, B(R)). We have, with the usual notation, f = f + − f − , and therefore fx1 = (f + )x1 − (f − )x1 . By Step 2, (f + )x1 and (f − )x1 are X2 measurable, and therefore so is fx1 . Lemma 2.3.6 Let f : (X, X ) →0 (R, B(R)) be a nonnegative function. Then, if μ2 is σﬁnite, the function x1 → X2 fx1 (x2 ) μ2 (dx2 ) is measurable from (X1 , X1 ) to (R+ , B(R+ )). 0 Proof. For ﬁxed x1 ∈ X1 , the integral X2 fx1 dμ2 is well deﬁned since fx1 is measurable (by the previous lemma) and nonnegative. Observe that the conclusion of the lemma is true for f = 1A×B , where A ∈ X1 and B ∈ X2 . Indeed, in this case, fx1 dμ2 = 1A (x1 )1B (x2 ) μ2 (dx2 ) = 1A (x1 )μ2 (B). X2
X2
We now prove that the lemma is true for f = 1F when F ∈ X . Let 1F (x1 , x2 ) μ2 (dx2 ) = μ2 (Fx1 ) . gF (x1 ) := X2
Consider the collection of sets C := {F ∈ X ; gF is X1 measurable} . First suppose that μ2 is a ﬁnite measure. In this case, C is a Dynkin system, since (i) if A and B are in C and A ⊆ B, then μ2 ((B − A)x1 ) = μ2 (Bx1 ) − μ2 (Ax1 ) (this is where we need ﬁniteness of μ2 ) is X1 measurable, and (ii) if {Cn }n≥1 is a nondecreasing sequence, μ2 ((∪n≥1 Cn )x1 ) = lim μ2 ((Cn )x1 ) n↑∞
is X1 measurable. Since C contains the measurable rectangles A × B, it contains X , by Dynkin’s theorem (Theorem 2.1.25). If μ2 is not ﬁnite, but only σﬁnite, there exists a sequence {Kn }n≥1 of elements of X2 increasing to X2 and such that the measure μ2,n deﬁned by μ2,n (A) = μ2 (A ∪ Kn ) is ﬁnite. Then μ2 (Fx1 ) = limn≥1 μ2,n (Fx1 ) is X1 measurable. Finally, we pass from indicator functions of measurable sets to nonnegative measurable functions by the usual monotone convergence argument.
2.3. THE OTHER BIG THEOREMS
79
Theorem 2.3.7 Suppose μ1 and μ2 σﬁnite. Then there exists a unique measure μ on (X1 × X2 , X1 × X2 ) such that μ(A1 × A2 ) = μ1 (A1 )μ2 (A2 )
(2.12)
for all A1 ∈ X1 , A2 ∈ X2 . This measure, denoted by μ1 ⊗ μ2 , or μ1 × μ2 , is called the product measure of μ1 by μ2 . Proof. Existence. Consider the set function μ : X → [0, ∞] deﬁned by  μ(F ) = 1F (x1 , x2 ) μ2 (dx2 ) μ1 (dx1 ). X1
X2
It is a measure on (X, X ) (the monotone convergence theorem proves σadditivity) that is obviously σﬁnite and satisﬁes (2.12). Uniqueness. Let A be the algebra consisting of the ﬁnite sums of disjoint measurable rectangles. Deﬁne (uniquely) the measure μ0 on (X, A) by μ0 (A1 × A2 ) = μ1 (A1 )μ2 (A2 ). By Carath´eodory’s theorem (Theorem 2.1.50), there exists a unique extension of μ0 to (X, X ). The above result extends in an obvious manner to a ﬁnite number of σﬁnite measures. Example 2.3.8: Lebesgue measure on Rn . The typical example of a product measure is the Lebesgue measure on the space (Rn , B(Rn )): It is the unique measure n on that space that is such that n (Πni=1Ai ) = Πni=1 (Ai )
for all A1 , . . . , An ∈ B(R) .
Going back to the situation with two measure spaces (the case of a ﬁnite number of measure spaces is similar) we have the following result: Theorem 2.3.9 Let (X1 , X1 , μ1 ) and (X1 , X2 , μ2 ) be two measure spaces in which μ1 and μ2 are σﬁnite. Let (X, X , μ) = (X1 × X2 , X1 ⊗ X2 , μ1 ⊗ μ2 ). (A) Tonelli. If f : (X, X ) → (R, B(R)) is nonnegative, then, for μ1 almost all x1 , the function x2 → f (x1 , x2 ) is measurable with respect to X2 , and x1 →
0 X2
f (x1 , x2 ) μ2 (dx2 )
is a measurable function with respect to X1 . Furthermore, .  2f dμ = f (x1 , x2 ) μ2 (dx2 ) μ1 (dx1 ) X X X .  12  2 = f (x1 , x2 ) μ2 (dx1 ) μ1 (dx2 ) . X2
X1
(2.14)
CHAPTER 2. INTEGRATION
80
(B) Fubini. If f : (X, X ) → (R, B(R)) is μintegrable, then, (a): 0for μ1 almost all x1 , the function x2 → f (x1 , x2 ) is μ2 integrable, and (b): x1 → X2 f (x1 , x2 ) μ2 (dx2 ) is μ1 integrable, and (2.14) is true. The global result is referred to as the Fubini–Tonelli theorem. Remark 2.3.10 Part A (Tonelli) says that one can integrate a nonnegative X measurable function in any order of its variables. Part B (Fubini) says that the same is true of any X measurable function provided that function is μintegrable. In general, in order to apply Part (B) one must use Part (A) with f = f  to ascertain whether or not 0 X f  dμ < ∞. Proof. (A) The σﬁnite measures
ν(F ) =
1F dμ X


and
1F (x1 , x2 ) μ2 (dx2 ) μ1 (dx1 )
μ(F ) = X1
X2
coincide on the algebra A consisting of the ﬁnite sums of disjoint generalized measurable rectangles. They are therefore identical, by Theorem 2.1.42. Therefore we have proved the theorem for f of the form 1F , F ∈ X . The general case of a nonnegative measurable function is obtained by the usual monotone convergence argument. (B) Since f is μintegrable, by Tonelli’s theorem, .  2f (x1 , x2 ) μ2 (dx2 ) μ1 (dx1 ) < ∞ X1
and in particular
X2
f (x1 , x2 ) μ2 (dx2 ) < ∞ , μ1 a.e. X2
Therefore, outside a μ1 negligible set N1 , f ± (x1 , x2 ) μ2 (dx2 ) < ∞ . X2
We may suppose that the above inequalities are true everywhere because we may replace f , without changing its integral with respect to μ, by a function μalmost everywhere equal to f . By Tonelli, .  2f ± dμ = f ± (x1 , x2 ) μ2 (dx2 ) μ1 (dx1 ) X
and therefore 
X1
X2
f dμ = f + dμ − f − dμ X X X . .  2 2f + (x1 , x2 ) μ2 (dx2 ) μ1 (dx1 ) − f − (x1 , x2 ) μ2 (dx2 ) μ1 (dx1 ) = X X X1 X2 .  1 2 2 + − f (x1 , x2 ) μ2 (dx2 ) − f (x1 , x2 ) μ2 (dx2 ) μ1 (dx1 ) = X X X2 .  12  2 f (x1 , x2 ) μ2 (dx2 ) μ1 (dx1 ) . = X1
X2
2.3. THE OTHER BIG THEOREMS
81
(All this fuss guarantees that at every step we do not encounter ∞ − ∞ forms.)
Remark 2.3.11 Is the hypothesis of σﬁniteness superﬂuous? In fact it is not, as the following counterexample shows. Take (Xi , Xi ) = (R, B(R)) (i = 1, 2), let μ1 = , the Lebesgue measure, and let μ2 = ν, the measure that associates with a measurable set its cardinality (only ﬁnite sets have a ﬁnite measure, and therefore, since there is no sequence of ﬁnite sets increasing to R, this measure is not σﬁnite). Now, let C = {(x, x); x ∈ R} (the diagonal of R2 ). Clearly Cx1 = x1 and Cx2 = x2 , so that μ2 (Cx1 ) = 1 and therefore 
X1
μ2 (Cx1 ) μ1 (dx1 ) =
R
1 (dx) = ∞.
On the other hand, μ1 (Cx2 ) = 0, and therefore, 
X2
μ1 (Cx2 ) μ2 (dx2 ) =
0 ν(dx) = 0. R
Integration by Parts Formula Theorem 2.3.12 Let μ1 and μ2 be two σﬁnite measures on (R, B(R)). For any interval (a, b) ⊆ R 
μ1 ((a, t]) μ2 (dt) +
μ1 ((a, b])μ2 ((a, b]) =
μ2 ((a, t)) μ1 (dt).
(2.15)
(a,b]
(a,b]
Observe that the ﬁrst integral features the interval (a, t] (closed on the right), whereas in the second integral, the interval is of the type (a, t) (open on the right). Proof. The proof consists in computing the μ1 × μ2 measure of the square D := (a, b] × (a, b] in two ways. The ﬁrst one is obvious and gives the lefthand side of (2.15). The second one consists in observing that μ(D) = μ(D1 ) + μ(D2 ), where D1 = {(x, y); a < y ≤ b, a < x ≤ y} and D2 = {(a, b] × (a, b]} ∩ D1 . Then μ(D1 ) and μ(D2 ) are computed using Tonelli’s theorem. For instance,  1D1 (x, y)μ1 (dx) μ2 (dy) μ(D1 ) = R R  1{a 0 (using property (b) of Theorem 2.2.12), 0 0 f Rg =⇒ X f p dμ = X gp dμ . The operations +, ×, lence class by
∗
and multiplication by a scalar α ∈ C are deﬁned on the equiva
{f } + {g} = {f + g} , {f } {g} = {f g} , {f }∗ = {f ∗ } , α {f } = {αf } . The ﬁrst equality means that {f } + {g} is, by deﬁnition, the equivalence class consisting of the functions f + g, where f and g are members of {f } and {g}, respectively. Similar interpretations hold for the other equalities. By deﬁnition, for a given p ≥ 1, LpC (μ) is the collection of equivalence classes {f } 0 p such that X f  dμ < ∞. Clearly it is a vector space over C (for the proof recall that
f +g p 2
≤
1 2
f p +
1 2
gp
since t → tp is a convex function when p ≥ 1). In order to avoid cumbersome notation, in this section and in general whenever we consider Lp spaces, we shall write f for {f }. This abuse of notation is harmless since two members of the same equivalence class have the same integral if that integral is deﬁned. Therefore, using this loose notation, we may write * 0 + LpC (μ) = f : X f p dμ < ∞ . (2.17) When the measure is the counting measure on the set Z of relative integers, the traditional notation is pC (Z). This is the space of random complex sequences {xn }n∈Z such that xn p < ∞. n∈Z
The following is a simple and often used observation. Theorem 2.3.16 Let p and q be positive real numbers such that p > q. If the measure μ on (X, X , μ) is ﬁnite, then LpC (μ) ⊆ LqC (μ). In particular, L2C (μ) ⊆ L1C (μ). Proof. From the inequality aq ≤ 1 + ap , true for all a ∈ C, it follows that μ(f q ) ≤ μ(1) + μ(f p ). Since μ(1) = μ(R) < ∞, μ(f q ) < ∞ whenever μ(f p ) < ∞.
CHAPTER 2. INTEGRATION
84
Remark 2.3.17 This inclusion is not true in general if μ is not a ﬁnite measure, for instance consider the Lebesgue measure on R: there exist functions in L1C () that are 2 not in LC () and vice versa (Exercise 2.4.18). In the case of the (not ﬁnite) counting measure on Z, the order of inclusion is the reverse of the one concerning ﬁnite measures: Theorem 2.3.18 pC inclusions. If p > q, qC (Z) ⊂ pC (Z). In particular, 1C (Z) ⊂ 2C (Z).
Proof. Exercise 2.4.19
H¨ older’s Inequality Theorem 2.3.19 Let p and q be positive real numbers in (0, 1) such that 1 1 + =1 p q (p and q are then said to be conjugate) and let f, g : (X, X ) → (R, B(R)) be nonnegative real functions. Then, 0 0 1/p 0 q 1/q p . (2.18) X f g dμ ≤ X f dμ X g dμ In particular, if f, g ∈ L2C (R), then f g ∈ L1C (R). Proof. Let

1/p 1/q f p dμ ,B= g q dμ .
A= X
X
It may be assumed that 0 < A, B < ∞, because otherwise H¨ older’s inequality is trivially satisﬁed. Let F := f /A, G := g/B, so that F p dμ = Gq dμ = 1. X
X
Suppose that we have been able to prove that F (x)G(x) ≤
1 1 F (x)p + G(x)q . p q
Integrating this inequality yields (F G) dμ ≤ X
(2.19)
1 1 + = 1, p q
and this is just (2.18). Inequality (2.19) is trivially satisﬁed if x is such that F ≡ 0 or G ≡ 0. It is also satisﬁed in the case when F and G are not μalmost everywhere null. Indeed, letting s(x) := p ln(F (x)),
t(x) := q ln(G(x)) ,
from the convexity of the exponential function and the assumption that 1/p + 1/q = 1, es(x)/p+t(x)/q ≤
1 s(x) 1 + et(x), e p q
and this is precisely inequality (2.19). For the last assertion of the theorem, take p = q = 2.
2.3. THE OTHER BIG THEOREMS
85
Minkowski’s Inequality Theorem 2.3.20 Let p ≥ 1 and let f, g : (X, X ) → (R, B(R)) be nonnegative functions in LpC (μ). Then, 0
X (f
+ g)p
1/p
≤
0 X
1/p 0 p 1/p . f p dμ + X g dμ
(2.20)
Proof. For p = 1 the inequality (in fact an equality) is obvious. Therefore, assume p > 1. From H¨ older’s inequality .1/q .1/p 2f p dμ (f + g)(p−1)q
2
f (f + g)p−1 dμ ≤ X
and
X
X
.1/q .1/p 2(p−1)q g dμ . (f + g)
2
p−1
g(f + g)
dμ ≤
X
p
X
X
Adding up the above two inequalities and observing that (p − 1)q = p, we obtain 2.1/p 2.1/p 2.1/q p p p + . (f + g) dμ ≤ f dμ g dμ (f + g)p X
X
X
x
One may assume that the righthand side of (2.20) is ﬁnite0 and that the lefthand side is positive (otherwise the inequality is trivial). Therefore X (f + g)p dμ ∈ (0, ∞) and 0 1/q we may therefore divide both sides of the last display by X (f + g)p dμ . Observing that 1 − 1/q = 1/p yields the announced inequality (2.20). Theorem 2.3.21 Let p ≥ 1. The mapping νp : LpC (μ) → [0, ∞) deﬁned by νp (f ) :=
0 X
1/p f p dμ
(2.21)
is a norm on LpC (μ). Proof. Clearly, νp (αf ) = ανp (f ) for all α ∈ C, f ∈ LpC (μ). Also, νp (f ) = 0 if and only 0 1/p if X f p dμ = 0, which in turn is equivalent to f = 0, μa.e. Finally, νp (f + g) ≤ νp (f ) + νp (g) for all f, g ∈ LpC (μ), by Minkowski’s inequality.
The Riesz–Fischer Theorem Denoting νp (f ) by f p , LpC (μ) is a normed vector space over C, with the norm · p and the induced metric dp (f, g) := f − gp . Theorem 2.3.22 Let p ≥ 1. The metric dp makes of LpC (μ) a complete normed vector space. In other words, LpC (μ) is a Banach space for the norm · p . p (μ) Proof. To show completeness one must prove that for any sequence {fn }n≥1 of LC that is a Cauchy sequence (that is, such that limm,n↑∞ dp (fn , fm ) = 0), there exists an p (μ) such that limn↑∞ dp (fn , f ) = 0. f ∈ LC
CHAPTER 2. INTEGRATION
86
Since {fn }n≥1 is a Cauchy sequence, one can select a subsequence {fni }i≥1 such that dp (fni+1 − fni ) ≤ 2−i .
(2.22)
Let gk =
k
fni+1 − fni , g =
i=1
∞
fni+1 − fni .
i=1
By (2.22) and Minkowski’s inequality, gk p ≤ 1. Fatou’s lemma applied to the sequence + * gkp k≥1 gives gp ≤ 1. In particular, any member of the equivalence class of g is μalmost everywhere ﬁnite and therefore fn1 (x) +
∞
fni+1 (x) − fni (x)
i=1
converges absolutely for μalmost all x. Call the corresponding limit f (x) (set f (x) = 0 when this limit does not exist). Since fn 1 +
k−1 fni+1 − fni = fnk i=1
we see that f = lim fnk μa.e. k↑∞
One must show that f is the limit in LpC (μ) of {fnk }k≥1 . Let > 0. There exists an integer N = N () such that fn − fm p ≤ whenever m, n ≥ N . For all m > N , by Fatou’s lemma we have f − fm p dμ ≤ lim inf fni − fm p dμ ≤ p . i→∞
X
Therefore f − fm ∈ inequality that
LpC (μ)
x
and consequently f ∈ LpC (μ). It also follows from the last lim f − fm p = 0.
m→∞
The next result is a byproduct of the proofs of Theorems 2.3.22. Theorem 2.3.23 Let p ≥ 1 and let {fn }n≥1 be a convergent sequence in LpC (μ). Let f be the corresponding limit in LpC (μ). Then, there exists a subsequence {fni }i≥1 such that limi↑∞ fni = f μa.e.
(2.23)
Note that the statement in (2.23) is about functions and not about equivalence classes. The functions thereof are any members of the corresponding equivalence class. In particular, when a given sequence of functions converges μa.e. to two functions, these two functions are necessarily equal μa.e. Therefore, Theorem 2.3.24 If {fn }n≥1 converges both to f in LpC (μ) and to g μa.e., then f = g μa.e.
2.3. THE OTHER BIG THEOREMS
87
Of special interest for applications is the space L2C (μ) of complex measurable functions f : X → R such that f (x)2 μ(dx) < ∞, X
where two functions f and f such that f (x) = f (x), μa.e. are not distinguished. We have by the Riesz–Fischer theorem: Theorem 2.3.25 L2C (μ) is a vector space with scalar ﬁeld C, and when endowed with the inner product (2.24) f (x)g(x)∗ μ(dx) , f, g := X
it is a Hilbert space. The norm of a function f ∈ L2C (μ) is f =
0 X
1 f (x)2 μ(dx) 2
and the distance between two functions f and g in L2C (μ) is d(f, g) =
0 X
1 f (x) − g(x)2 μ(dx) 2 .
The completeness property of L2C (μ) reads in this case as follows. If {fn }n≥1 is a sequence of functions in L2C (μ) such that lim fn (x) − fm (x)2 μ(dx) = 0, m,n↑∞ X
then, there exists a function f ∈ L2C (μ) such that lim fn (x) − f (x)2 μ(dx) = 0. n↑∞ X
In L2C (μ), Schwarz’s inequality reads as follows: '' 1 1 2 2 ' ' 2 2 ' f (x)g(x)∗ μ(dx)' ≤ f (x) μ(dx) g(x) μ(dx) . ' ' X
X
X
Example 2.3.26: Complex sequences. The set of complex sequences a = {an }n∈Z such that an 2 < ∞ n∈Z
is, when endowed with the inner product a, b =
an b∗n ,
n∈Z
2C (Z).
a Hilbert space, denoted by This is indeed a particular case of a Hilbert space L2C (μ), where X = Z and μ is the counting measure. In this example, Schwarz’s inequality takes the form ' ' 1 1 2 2 ' ' ' ∗' 2 2 an bn ' ≤ an  × bn  . ' ' ' n∈Z
n∈Z
n∈Z
CHAPTER 2. INTEGRATION
88
2.3.4
The Radon–Nikod´ ym Theorem
The Product of a Measure by a Function Deﬁnition 2.3.27 Let (X, X , μ) be a measure space and let h : (X, X ) → (R, B(R)) be a nonnegative measurable function. Deﬁne the set function ν : X → [0, ∞] by h(x) μ(dx) . ν(C) = C
Then ν is a measure on (X, X ) called the product of μ by the function h. This is denoted by dν = h dμ. That ν is a measure is easily checked. First of all, it is obvious that ν(∅) = 0. As for the σadditivity property, write for any sequence of mutually disjoint measurable sets {An }n≥1 , ν(∪n≥1 An ) = h dμ = 1∪n≥1 An h dμ ∪n≥1 An
⎛

⎝
= X
= lim
k↑∞
= lim
k↑∞
⎞
1An ⎠ h dμ =
n≥1
 k X k n=1
X
 lim
k↑∞
X
1 An
h dμ = lim
k↑∞
n=1
ν(An ) =
k
1 An
h dμ
n=1
k n=1 X
1An h dμ
ν(An ) ,
n≥1
where the ﬁfth equality is by monotone convergence. Theorem 2.3.28 Let μ, h and ν be as in Deﬁnition 2.3.27. (i) For nonnegative f : (X, X ) → (R, B(R)), 0 0 X f (x) ν(dx) = X f (x)h(x) μ(dx) .
(2.25)
(ii) If f : (X, X ) → (R, B(R)) has arbitrary sign, then either one of the following conditions (a) f is νintegrable, (b) f h is μintegrable, implies the other, and the equality (2.25) then holds. Proof. Verify (2.25) for elementary nonnegative functions and, approximating f by a nondecreasing sequence of such functions, use the monotone convergence theorem as in the proof of (2.11). For the case of functions of arbitrary sign, apply (2.25) with f = f + and f = f − . Observe that in the situation of Theorem 2.3.28, for all C ∈ X , μ(C) = 0 =⇒ ν(C) = 0 .
(2.26)
2.3. THE OTHER BIG THEOREMS
89
Deﬁnition 2.3.29 Let μ and ν be two measures on (X, X ). (A) If (2.26) holds for all C ∈ X , ν is said to be absolutely continuous with respect to μ. This is denoted by ν μ. (B) The measures μ and ν on (X, X ) are said to be mutually singular if there exists a set A ∈ X such ν(A) = μ(A) = 0. This is denoted by μ⊥ν.
Lebesgue’s decomposition Theorem 2.3.30 Let μ and ν be two σﬁnite measures on (X, X ). There exists a unique decomposition (called the Lebesgue decomposition) ν = νa + νs such that μ and νa ⊥μ ,
νa
and a nonnegative measurable function g : X → R such that dνa = g dμ , this function being μessentially unique. 3 Proof. STEP 1. We ﬁrst assume that μ and ν are ﬁnite and that ν ≤ μ, that is, ν(A) ≤ μ(A) for all A ∈ X . Deﬁne a mapping ϕ : L2R (μ) → R by ϕ(f ) = f dν . X
The latter integral is well deﬁned since the hypothesis of ﬁniteness of μ implies that 0 0 L2R (μ) ⊆ L1R (μ), and hypothesis ν ≤ μ implies that R f  dν ≤ R f  dμ. Also ϕ(f ) does not depend on the function chosen in the equivalence class of L2R (μ). In fact,0letting f be such function, f = f μa.e. implies that f = f νa.e. and then X f dν = 0 another X f dν. By Schwarz’s inequality, 1
ϕ(f ) ≤
f 2 dν 
X
≤
2
1
ν(X) 2
1 2 1 1 f 2 dμ ν(X) 2 = ν(X) 2 f L2 (μ) . R
X
Therefore, the (linear) functional ϕ from the Hilbert space L2R (μ) to R is continuous. By Riesz’s theorem on the representation of linear functionals on L2R (μ) (Theorem C.5.2), there exists a g ∈ L2R (μ) such that f g dμ , ϕ(f ) = f, gL2 (μ) := R
that is,
X

f dν = X
3
f g dμ . X
If g is such that dνa = g dμ, then g(x) = g (x) μa.e.
CHAPTER 2. INTEGRATION
90 In particular, ν(A) =
0
A g dμ
for all A ∈ X . With A = {g ≥ 1 + ε} where ε > 0,
μ({g ≥ 1 + ε}) ≥ ν({g ≥ 1 + ε}) = g dμ ≥ (1 + ε)μ({g ≥ 1 + ε}) , {g≥1+ε}
and therefore μ({g ≥ 1 + ε}) = 0. Since ε is an arbitrary positive number, this implies that g ≤ 1 μa.e. A similar argument shows that g ≥ 0 μa.e. We may in fact suppose that 0 ≤ g(x) ≤ 1 for all x ∈ X by replacing if necessary g by g 1{0≤g≤1} . STEP 2. We still assume that μ and ν are ﬁnite, but not that ν ≤ μ. However, since ν ≤ μ + ν, we may apply the above results to ν and μ + ν, to obtain the existence of a measurable function g such that 0 ≤ g ≤ 1 and such that for all f ∈ L2R (μ + ν), f dν = f g d(μ + ν) . X
X
In particular, for any bounded measurable function f : X → R, f g dμ . f (1 − g) dν = X
()
X
By monotone convergence, this inequality extends to all nonnegative measurable functions f . 0 STEP 3. With f = 1N in (), where N := {g = 1}, we have that N f dμ = 0 for all nonnegative measurable f , and therefore μ(N ) = 0. The measures νs := 1N ν and μ are therefore mutually singular. STEP 4. Replacing in () f by functions f , 
f 1−g 1N
gives that for all nonnegative measurable 
f dν = N
f h dμ , X
where h := 1N
g . 1−g
It remains to deﬁne νa by dνa := 1N dν = h dμ to conclude the existence part of the theorem, under the assumption that μ and ν are ﬁnite. STEP 5. To prove the uniqueness of the pair (νa , νs ), consider another such pair (, νa , ν,s ). For all A ∈ X , νa (A) − ν,a (A) = −νs (A) + ν,s (A) .
(†)
, of νs and ν,s respectively have a null μmeasure, and since Since the supports N and N μ and ν,a μ, νa , ) − ν,s (A ∩ N ∪ N , ) νs (A) − ν,s (A) = νs (A ∩ N ∪ N , ) + ν,a (A ∩ N ∪ N , ) = 0. = −νa (A ∩ N ∪ N Therefore νa ≡ ν,a , and consequently, from (†), νs ≡ ν,s .
2.4. EXERCISES
91
STEP 6. To prove the uniqueness of h, just observe that if another measurable nonh satisﬁes negative function , , νa (A) = h dμ h dμ = A
A
for all A ∈ X , then necessarily , h = h μa.e. STEP 7. We get rid of the ﬁniteness hypothesis for μ and ν, only assuming that these measures are σﬁnite. Therefore, there exists a measurable partition {Kn }n≥1 of X such that μn := 1Kn μ and νn := 1Kn ν are ﬁnite measures. Applying the above results to μn and νn , and calling νa,n , νs,n and hn the corresponding items of the decomposition, deﬁne νa = νa,n , νs = νs,n , h = hn . n≥1
n≥1
n≥1
The veriﬁcation that νa , νs and h satisfy the requirement of the theorem is straightforward.
The Radon–Nikod´ ym Derivative Corollary 2.3.31 Let μ and ν be two σﬁnite measures on (X, X ) such that ν Then there exists a nonnegative function h : (X, X ) → (R, B(R)) such that
μ.
ν(dx) = h(x) μ(dx) . Proof. From the uniqueness of the Lebesgue decomposition of Theorem 2.3.30 and the hypothesis ν μ, it follows that νa = ν and therefore νs ≡ 0. The function h is called the Radon–Nikod´ym derivative of ν with respect to μ and is denoted dν/dμ. With such a notation, we have that 0 0 dν X f (x) ν(dx) = X f (x) dμ (x) μ(dx) for all nonnegative f = (X, X ) → (R, B(R)).
Complementary reading For the omitted proofs of existence and unicity of measures, see for instance [Royden, 1988].
2.4
Exercises
Exercise 2.4.1. The σfield generated by a collection of sets (1) Let {Fi }i∈I be a nonempty family of σﬁelds on some set Ω (the nonempty index set I is arbitrary). Show that the family F = ∩i∈I Fi is a σﬁeld (A ∈ F if and only if A ∈ Fi for all i ∈ I). (2) Let C be a family of subsets of some set Ω. Show the existence of a smallest σﬁeld F containing C. (This means, by deﬁnition, that F is a σﬁeld on Ω containing C, such that if F is a σﬁeld on Ω containing C, then F ⊆ F .)
CHAPTER 2. INTEGRATION
92 Exercise 2.4.2. Simple functions
(1) Show that a Borel function f : (X, X ) → (R, B) taking a ﬁnite number of values is a simple function. (2) Show that a function measurable with respect to the gross σﬁeld is a constant. Exercise 2.4.3. B(R) Recall that B(R) is the σﬁeld on R generated by the intervals of type (− ∞, a] (a ∈ R). Describe B(R) in terms of B(R). Exercise 2.4.4. σ(f −1 (C)) = f −1 (σ(C)) Let X and E be sets, f : X → E a function from X to E and C a collection of subsets of E. Prove that σ(f −1 (C)) = f −1 (σ(C)). Exercise 2.4.5. The smallest σfield guaranteeing measurability Let f : X → E be a function. Let E be a given σﬁeld on E. What is the smallest σﬁeld on X such that f is measurable with respect to X and E? Exercise 2.4.6. The modulus of a function and measurability Let f : X → E be a function. Is it true that if f  is measurable with respect to X and E, then so is f itself? Exercise 2.4.7. Measurability with respect to the gross σfield Prove that a function f : E → R measurable with respect to the gross σﬁeld on E and the Borel σﬁeld on R is a constant. Exercise 2.4.8. Decreasing sequences of measurable sets Let (X, X ) be a measurable space and {Bn }n≥1 a nonincreasing sequence of X such that μ(Bn0 ) < ∞ for some n0 ∈ N+ . Show that ∞ % Bn = lim ↓ μ(Bn ) . μ n=1
n↓∞
Give a counterexample for the necessity of condition μ(Bn0 ) < ∞ for some n0 . Exercise 2.4.9. Almosteverywhere equal continuous functions Prove that if two continuous functions f, g : R → R are a.e. equal, they are everywhere equal. Exercise 2.4.10. sinx x 0t Let f (x) := sinx x . Prove that the limit limt↑∞ 0 with respect to the Lebesgue measure?
sin x x
dx exists. Is f integrable on R+
Exercise 2.4.11. From integral to series Prove that for all a, b ∈ R, R+
+∞ t e−at 1 dt = . 1 − e−bt (a + nb)2 n=0
2.4. EXERCISES
93
Exercise 2.4.12. Fun Fubini A bounded rectangle of R2 is said to have Property (A) if at least one of its sides “is an integer” (meaning: its length is an integer). Let Δ be a ﬁnite rectangle that is the union of a ﬁnite number of 0disjoint rectangles with Property (A). Show that Δ itself must have Property (A). Hint: I e2iπxdx . . . Exercise 2.4.13. Fourier transform Let f : (R, B(R)) → (R, B(R)) be integrable with respect to the Lebesgue measure. Show that for any ν ∈ R, f (t) e−2iπνt dt fˆ(ν) = R
is well deﬁned and that the function fˆ is continuous and bounded. (fˆ is called the Fourier transform of f .) Exercise 2.4.14. Convolution Let f, g : (R, B(R)) → (R, B(R)) be integrable with respect to the Lebesgue measure and let fˆ, gˆ be their respective Fourier transforms (See Exercise 2.4.13). (1) Show that
 R
R
f (t − s)g(s) dt ds < ∞.
(2) Deduce from this that for almost all t ∈ R, the function s → f (t − s)g(s) is integrable, and therefore that the convolution f ∗ g, where f (t − s)g(s) ds, (f ∗ g)(t) = R
is almost everywhere well deﬁned. (3) For all t such that the last integral is not deﬁned, set (f ∗ g)(t) = 0. Show that f ∗ g is integrable and that its Fourier transform is f ∗ g = fˆgˆ. Exercise 2.4.15. A Fubini counterexample Let (Xi , Xi , μi ) (i = 1, 2) be two versions of the measure space (X, X , μ), where X = {1, 2, . . .}, X = P(X) and μ is the counting measure. Consider the function f : (X1 × X2 ) → Z whose non null values are f (m, m) = +1 and f (m + 1, m) = −1 (m ≥ 1). Show that f (m, n) = 1 and f (m, n) = 0 . m
n
n
m
Why don’t we obtain the same values for both sums? Exercise 2.4.16. Another Fubini counterexample Deﬁne f : [0, 1]2 → R by x2 − y 2 1 . (x2 + y 2 )2 {(x,y) =(0,0)} 0 0 0 0 Compute [0,1] [0,1] f (x, y) dx dy and [0,1] [0,1] f (x, y) dy dx. Is f Lebesgue intef (x, y) =
grable on [0, 1]2 ?
CHAPTER 2. INTEGRATION
94
Exercise 2.4.17. Convolution of measures The convolution product of two ﬁnite measures μ1 and μ2 on Rd is the measure ν on Rd that is the image of the product measure μ := μ1 × μ2 on Rd × Rd under the mapping (x1 , x2 ) → x1 + x2 . This measure will be denoted by μ1 ∗ μ2 . (i) Show that for any nonnegative measurable function f : Rd → R  f (x) ν(dx) = f (x1 + x2 )μ1 (dx1 ) μ2 (dx2 ) . Rd
Rd
Rd
(ii) Let μ be a ﬁnite measure on Rd and let εa be the Dirac measure (on Rd ) at point a ∈ Rd . What is the convolution product μ ∗ εa ? 2 () Exercise 2.4.18. L1C () and LC Show that there exist functions in L1C () that are not in L2C () and vice versa.
Exercise 2.4.19. pC q Show that if p > q, then C (Z) ⊂ pC (Z). Exercise 2.4.20. The Lebesgue decomposition Let μ and ν be measures on the measurable space (X, X ). Describe the Lebesgue decomposition in the following cases: A. (X, X ) = (Z, P(Z). B. (X, X ) = (R, B(R), μ(dx) = f (x) dx and ν(dx) = g(x) dx.
Chapter 3 Probability and Expectation Although from a formal point of view a probability is just a measure with total mass equal to one, and expectation is nothing more than an integral with respect to this measure,1 probability theory has two ingredients that make the diﬀerence: the notion of independence and that of conditional expectation. Probability theory has a speciﬁc terminology adapted to its goals, and therefore we begin with the “translation” of the theory of measure and integration into the theory of probability and expectation.
3.1 3.1.1
From Integral to Expectation Translation
Recall that abstract (or axiomatic) probability theory features a “sample” space Ω and a collection F of its subsets that forms a σﬁeld, the σﬁeld of events. An element A ∈ F is called an event. A probability P on (Ω, F) is a measure on this measurable space with total mass 1. The results obtained in the previous chapter will now be recast in this speciﬁc framework. The probabilistic version of Theorem 2.1.42 is given below for future reference. Theorem 3.1.1 Let P1 and P2 be two probability measures on (Ω, F) and let S be a πsystem of measurable sets generating F. If P1 and P2 agree on S, they are identical. Let (E, E) be some measurable space. A measurable mapping (or function) X : (Ω, F) → (E, E) is called a random element with values in E. If E = R and F = B(R), it is called a random variable. If E = Rm and F = B(Rm ), it is called a random vector. In view of Theorem 2.1.18, for a mapping X = (X1 , . . . , Xm) to be a random vector in Rm , it suﬃces that {Xi ≤ a} ∈ F (1 ≤ i ≤ m, a ∈ R). From Corollary 2.1.17, we have that if X is a random element with values in the measurable space (E, E) and if g is a measurable function from (E, E) to another measurable space (G, G), then g(X) is a random element with values in the measurable space (G, G). 1 But as the wise man said: “He who does not know measure from probability does not know sake from rice.”
© Springer Nature Switzerland AG 2020 P. Brémaud, Probability Theory and Stochastic Processes, Universitext, https://doi.org/10.1007/9783030401832_3
95
CHAPTER 3. PROBABILITY AND EXPECTATION
96
Corollary 2.1.20 and Theorem 2.1.21 tell us that all ordinary operations on random variables (addition, multiplication, and quotient—if well deﬁned—) and the limit operations (limsup, liminf, and lim—if well deﬁned—) preserve the status of random variable. Since a random variable X is a measurable function, we can deﬁne, under certain circumstances, its integral with respect to the probability measure P , called the expectation of X. Therefore E [X] = X(ω)P (dω). Ω
The main steps in the deﬁnition of the integral (here the expectation) are summarized below in the speciﬁc notation of probability theory. If A ∈ F, E[1A ] = P (A) , and more generally, if X is a simple random variable, that is, X(ω) = where αi ∈ R, Ai ∈ F and N < ∞, then E[X] =
N
N
i=1 αi 1Ai (ω),
αi P (Ai ).
i=1
For a nonnegative random variable X, the expectation is always deﬁned by E[X] = lim E[Xn ], n↑∞
where {Xn }n≥1 is a nondecreasing sequence of nonnegative simple random variables that converges to X. This deﬁnition is consistent, that is, it does not depend on the approximating sequence of nonnegative simple random variables as long as it is nondecreasing and has X for limit. In particular, with the following special choice of the approximating sequence: Xn =
n −1 n2
k=0
k 1A + n1{X≥n} , 2n k,n
2−n
where Ak,n := {k × ≤ X < (k + 1) × 2−n }, we have for any nonnegative random variable X, the “horizontal slice formula”: E[X] = lim
n↑∞
n −1 n2
k=0
k P (Ak,n ) + nP (X ≥ n). 2n
If X is of arbitrary sign, the expectation is deﬁned by E[X] = E[X + ] − E[X − ] if E[X + ] and E[X − ] are not both inﬁnite. If E[X + ] and E[X − ] are inﬁnite, the expectation is not deﬁned. If E[X] < ∞, X is said to be integrable and E[X] is then a ﬁnite number. The basic properties of the expectation are linearity and monotonicity: If X1 and X2 are integrable (resp. nonnegative) random variables, then (linearity): for all λ1 , λ2 ∈ R (resp. ∈ R+ ), (3.1) E[λ1 X1 + λ2 X2 ] = λ1 E[X1 ] + λ2 E[X2 ] , and (monotonicity): X1 ≤ X2 =⇒ E[X1 ] ≤ E[X2 ] .
(3.2)
It follows from monotonicity that if E[X] is well deﬁned, E[X] ≤ E[X] .
(3.3)
3.1. FROM INTEGRAL TO EXPECTATION
97
Mean and Variance 2 The deﬁnitions of the mean mX and the variance σX of a realvalued random variable X are given, when the corresponding expectations are meaningful, by
mX := E[X] ,
2 σX = E[(X − mX )2 ] := E[X 2 ] − m2X .
Markov’s Inequality This inequality was given and proved in the speciﬁc framework of discrete random variables (Theorem 1.5.4). Theorem 3.1.2 Let Z be a nonnegative real random variable, and let a > 0. We then have E[Z] P (Z ≥ a) ≤ . a
Proof. Reproduce verbatim the proof in the special case of discrete variables (Theorem 3.1.2).
ε2
Specializing the Markov inequality of Theorem 3.1.2 to Z = (X − mX )2 and a = > 0, we obtain as in the ﬁrst chapter Chebyshev’s inequality: For all ε > 0, P (X − mX  ≥ ε) ≤
2 σX . ε2
Jensen’s Inequality Jensen’s inequality is also a simple consequence of the monotonicity of expectation and of the expectation formula for indicator functions. Theorem 3.1.3 Let I be a general interval of R (closed, open, semiclosed, inﬁnite, etc.) and let (a, b) be its interior, assumed nonempty. Let ϕ : I → R be a convex function. Let X be an integrable realvalued random variable such that P (X ∈ I) = 1. Assume moreover that either ϕ is nonnegative, or that ϕ(X) is integrable. Then E [ϕ(X)] ≥ ϕ(E [X]) .
(3.4)
Proof. Reproduce verbatim the proof in the special case of discrete random variables (Theorem 3.1.3).
3.1.2
Probability Distributions
Deﬁnition 3.1.4 The distribution of a random element X with values in (E, E) is, by deﬁnition, the probability measure QX on (E, E), the image of the probability measure P by the mapping X from (Ω, F) to (E, E) (that is, for all C ∈ E, QX (C) = P (X ∈ C)). The next result is a rephrasing of Theorem 2.3.2 in the context of probability.
98
CHAPTER 3. PROBABILITY AND EXPECTATION
Theorem 3.1.5 If g is a measurable function from (E, E) to (R, B(R)) that is nonnegative, then g(x) QX (dx). (3.5) E [g(X)] = E
If g is of arbitrary sign, and if one of the following two conditions is satisﬁed: (a) g(X) is P integrable, or (b) g is QX integrable, then the other one is also satisﬁed and equality (3.5) holds true. Deﬁnition 3.1.6 If X is a random vector ((E, E) = (Rm , B(Rm ))) whose probability distribution QX is the product of a measurable function fX by the Lebesgue measure n , one calls fX the probability density function (pdf) of X. Remark 3.1.7 The pdf is unique, in the sense that any other pdf fX is such that fX (x) = fX (x) Lebesguealmost everywhere. See Exercise 3.4.1.
Remark 3.1.8 The following is an “obvious” result (a proof is however required in Exercise 3.4.6): P (fX (X) = 0) = 0 .
Example 3.1.9: The case of a real random variable. In the particular case where (E, E) = (R, B(R)), taking C = (−∞, x], we have QX ((−∞, x]) = P (X ≤ x) = FX (x) , where FX is the cumulative distribution function (cdf) of X, and therefore E[g(X)] = g(x) dF (x) , R
by deﬁnition of the Stieltjes–Lebesgue integral (Deﬁnition 2.2.9).
Example 3.1.10: The case of a discrete random variable. In the particular case where (E, E) = (N, P(N)), QX ({n}) = P (X = n) and g(n)P (X = n) . E[g(X)] = N
Example 3.1.11: The case of a random vector with a probability density. If X is a random vector admitting a probability density fX , then, by Theorem 2.3.28, E[g(X)] = g(x)fX (x) dx . Rn
3.1. FROM INTEGRAL TO EXPECTATION
99
The cumulative distribution of a real random variable X has the following properties: (i) F : R → [0, 1]. (ii) F is nondecreasing. (iii) F is rightcontinuous. (iv) For each x ∈ R there exists F (x−) := limh↓0 F (x − h). (v) F (+∞) := lima↑∞ F (a) = P (X < ∞). (vi) F (−∞) := lima↓−∞ F (a) = P (X = −∞). (vii) P (X = a) = F (a) − F (a−) for all a ∈ R. Proof. (i) is obvious; (ii) If * a ≤ b, then+ {X ≤ a} ⊆ {X * ≤ b}, and+therefore P (X ≤ a) ≤ P (X ≤ b); (iii) Let Bn = X ≤ a + n1 . Since ∩n≥1 X ≤ a + n1 = {X ≤ a}, we have, by sequential continuity, 1 lim P X ≤ a + = P (X ≤ a). n↑∞ n (iv) We know from Analysis that a nondecreasing function from R to R has at any point a limit to the left; (v) Let An = {X ≤ n} and observe that ∪∞ n=1 {X ≤ n} = {X < ∞}. The result again follows by sequential continuity; (vi) Apply (1.6) with Bn = {X ≤ −n} and observe that ∩∞ n=1 {X ≤ −n} = *{X = −∞}. The + result follows by sequential ∩∞ continuity. (vii) The sequence Bn = a − n1 < X ≤ a is decreasing, n=1 Bn = and 1 {X = a}. Therefore, by sequential continuity, P (X = a) = lim P a − < X ≤a = n↑∞ n limn↑∞ F (a) − F a − n1 , that is to say, P (X = a) = F (a) − F (a−). From (vii), we see that the cdf is continuous at a ∈ R if and only if P (X = a) = 0. Being a nondecreasing rightcontinuous function, F has at most a countable set of discontinuity points {dn , n ≥ 1}. Deﬁne the discontinuous part of F by (F (dn ) − F (dn −)) 1{dn ≤x} Fd (x) = n≥1
=
P (X = dn )1{dn ≤x} .
n≥1
In particular, when a random variable takes its values in a countable set, its cdf reduces to the discontinuous part Fd . For such (discrete) random variables, the probability distribution {p(dn )}n≥1 , where p(dn ) = P (X = dn ), suﬃces to describe the probabilistic behavior of X.
Famous Continuous Random Variables An (absolutely) continuous random variable is by deﬁnition a real (no inﬁnite values) random variable with a probability density, that is, 0x P (X ≤ x) = −∞ f (x)dx , where f (x) ≥ 0, and since X is real, P (X < ∞) = 1, that is, 0 +∞ −∞ f (x)dx = 1 .
100
CHAPTER 3. PROBABILITY AND EXPECTATION
Deﬁnition 3.1.12 Let a and b be real numbers. A real random variable X with probability density function 1 f (x) = b−a 1[a,b] (3.6) is called a uniform random variable on [a, b] . This is denoted by X ∼ U([a, b]). Theorem 3.1.13 The mean and the variance of a uniform random variable on [a, b] are given by E[X] =
a+b 2 ,
Var (X) =
(b−a)2 12
.
(3.7)
Proof. Direct computation. Theorem 3.1.14 Let for u ∈ (0, 1) F ← (u) := inf{x ; F (x) > u} .
If U is a uniform random variable on (0, 1), then F ← (U ) has the same probability distribution as X. Proof. First note that for all u ∈ (0, 1), F ← (u) ≤ t implies F (t) ≥ u. Indeed, in this case, for all s > t there exists an x < s such that F (x) > u and therefore F (s) > u; and consequently, by rightcontinuity of F , F (t) ≥ u. Conversely, F (t) ≥ u implies that t ∈ {x ; F (x) ≥ u} and therefore F ← (u) ≤ t. Taking all this into account, F (t) = P (U < F (t)) ≤ P (F ← (U ) ≤ t) ≤ P (F (t) ≥ U ) = F (t) . This forces P (F ← (U ) ≤ t) to equal F (t).
Remark 3.1.15 Of course, if F is continuous, F ← is the inverse F −1 in the usual sense. Deﬁnition 3.1.16 A real random variable X with pdf f (x) =
2 1 (x−m) σ2
√1 e− 2 σ 2π
,
(3.8)
where m ∈ R and σ ∈ R+ , is called a Gaussian random variable. This is denoted by X ∼ N (m, σ 2 ). One can check that E[X] = m and Var (X) = σ 2 (Exercise 3.4.18). Deﬁnition 3.1.17 The tail distribution of a random variable X is, by deﬁnition, the quantity P (X > x).
3.1. FROM INTEGRAL TO EXPECTATION
101
The following bounds for the tail distribution of a standard Gaussian variable are useful:  ∞  ∞ 1 2 x 1 − 1 x2 1 1 1 − 21 y 2 2 √ √ √ e ≤ e dy ≤ e− 2 y dy . 1 + x2 2π x 2π x 2π x Proof. We have 1 x2

∞
1 2
e− 2 y dy >
x

∞
1 − 1 y2 e 2 dy y2  ∞ 1 2 1 1 2 e− 2 y dy, = e− 2 x − x x x
where the equality is obtained by integration by parts. This gives the inequality on the right. Integration by parts again:  ∞  ∞ 1 2 1 − 1 y2 1 − 1 x2 2 2 e e dy = − e− 2 y dy , 2 y x x x and therefore
1 − 1 x2 e 2 = x
∞

1+ x
1 y2
1 2
e− 2 y dy ≥

∞
1 2
e− 2 y dy ,
x
and this is the inequality on the left. Deﬁnition 3.1.18 A random variable X with pdf f (x) = λe−λx 1{x≥0}
(3.9)
is called an exponential random variable with parameter λ. This is denoted by X ∼ E(λ). The cdf of the exponential random variable is F (x) = (1 − e−λx )1{x≥0} . Theorem 3.1.19 The mean of an exponential random variable with parameter λ is E[X] = λ−1 . Proof. Direct computation, or see the Gamma distribution below.
(3.10)
The exponential distribution lacks memory, in the following sense: Theorem 3.1.20 Let X ∼ E(λ). For all t, t0 ∈ R+ , we have P (X ≥ t0 + t  X ≥ t0 ) = P (X ≥ t). Proof. P (X ≥ t0 + t  X ≥ t0 ) = =
P (X ≥ t0 + t , X ≥ t0 ) P (X ≥ t0 )
e−λ(t0 +t) P (X ≥ t0 + t) = −λ(t ) = e−λt = P (X ≥ t). 0 P (X ≥ t0 ) e
CHAPTER 3. PROBABILITY AND EXPECTATION
102
Recall the deﬁnition of the gamma function Γ: 0∞ Γ(α) := 0 xα−1 e−x dx . Integration by parts gives, for α > 0, '∞  ∞ ' αuα−1 e−udu − 0 = uα e−u'' = 0
0
∞
e−u uα du
0
= αΓ(α) − Γ(α + 1). Therefore Γ(α + 1) = α Γ(α), 0∞ from which it follows in particular, since Γ(1) = 0 e−x dx = 1, that for all integers n ≥ 1, Γ(n) = (n − 1)! Deﬁnition 3.1.21 Let α and β be two strictly positive real numbers. A nonnegative random variable X with the pdf f (x) =
β α α−1 −βx e 1{x>0} x Γ(α)
(3.11)
is called a Gamma random variable with parameters α and β. This is denoted by X ∼ γ(α, β). We must check that (3.11) deﬁnes a probability density (that is, the integral of f is 1). In fact:  +∞  ∞ βα f (x)dx = xα−1 e−βxdx Γ(α) 0 −∞  ∞ Γ(α) 1 = 1, y α−1 e−y dy = = Γ(α) 0 Γ(α) where the second equality has been obtained via the change of variable y = βx. Theorem 3.1.22 If X ∼ γ(α, β), then E [X] = Proof.

α β
and Var (X) =
α β2
.
∞
βα x xα−1 e−βx dx Γ(α) 0  ∞ βα Γ(α + 1) 1 α = xα e−βx dx = = . Γ(α) 0 Γ(α) β β
E [X] =
Similarly,
Therefore
Γ(α + 2) 1 α(α + 1) E X2 = = . Γ(α) β 2 β2 α(α + 1) Var (X) = E X 2 − E [X]2 = − β2
2 α α = 2. β β
(3.12)
3.1. FROM INTEGRAL TO EXPECTATION
103
The exponential distribution is a particular case of the Gamma distribution. In fact, γ(1, λ) ≡ E(λ). The socalled chisquare distribution with n degrees of freedom, denoted by χn2 , is just the γ( n2 , 12 ) distribution. It therefore has the pdf f (x) =
n 1 1 x 2 −1 e− 2 x 1{x>0} . n 2 2 Γ( n2 )
(3.13)
This is denoted by X ∼ χ2n . Deﬁnition 3.1.23 A random variable X with pdf f (x) =
1 π(1+x2 )
(3.14)
is called a Cauchy random variable.
It is important to observe that the mean of X is not deﬁned since R
x dx = +∞ . π(1 + x2 )
Of course, a fortiori, its variance is not deﬁned.
Change of Variables Let X = (X1 , . . . , Xn ) be a random vector with the probability density function fX , and deﬁne the random vector Y = g(X), where g : Rn → Rn . More explicitly, ⎧ ⎪ ⎪ ⎨Y1 = g1 (X1 , . . . , Xn ), .. . ⎪ ⎪ ⎩Y = g (X , . . . , X ). n n 1 n Under smoothness assumptions on g, the random vector Y is absolutely continuous, and its probability density function can be explicitly computed from g and the probability density function fX . The conditions allowing this are the following: A1 : The function g from U to Rn , where U is an open subset of Rn , is onetoone (injective). A2 : The coordinate functions gi (1 ≤ i ≤ n) are continuously diﬀerentiable. A2 : Moreover, the Jacobian matrix of the function g, ∂gi Jg (x) := Jg (x1 , . . . , xn ) := ∂x (x1 , . . . , xn ) j satisﬁes the positivity condition  det Jg (x) > 0
(x ∈ U ) .
, 1≤i,j≤n
CHAPTER 3. PROBABILITY AND EXPECTATION
104
A standard result of Analysis says that V = g(U ) is an open subset of Rn , and that the invertible function g : U → V has an inverse g −1 : V → U with the same properties as the direct function g. In particular, on V ,  det Jg−1 (y) > 0. Moreover,
Jg−1 (y) = Jg (g −1 (y))−1 .
Also, under the conditions A1 − A3 , for any function u : Rn → Rn , u(x)dx = u(g −1 (y)) det Jg−1 (y)dy. U
g(U )
Theorem 3.1.24 Under the conditions just stated for X, g, and U , and if moreover P (X ∈ U ) = 1, then Y admits the probability density fY (y) = fX (g −1 (y)) det Jg (g −1 (y))−1 1V (y) .
(3.15)
Proof. The proof consists in checking that for any bounded function h : R → R, E[h(Y )] = h(y)ψ(y)dy, (3.16) Rn
where ψ is the function on the righthand side of (3.15). Indeed, taking h(y) = 1y≤a = 1y1 ≤a1 · · · 1yn≤an , (3.16) reads  an  a1 P (Y1 ≤ a1 , . . . , Yn ≤ an ) = ··· ψ(y1 , . . . , yn )dy1 · · · dyn . −∞
−∞
To prove that (3.16) holds with the appropriate ψ, one just uses the basic rule of change of variables: E[h(Y )] = E[h(g(X))] = h(y)fX (g −1 (y)) det Jg−1 (y)dy . h(g(x))fX (x)dx = U
V
Theorem 3.1.25 Let X be an ndimensional random vector with probability density fX . Let A be an invertible n × n real matrix and let b be an ndimensional real vector. Then, the random vector Y = AX + b admits the density fY (y) = fX (A−1 (y − b))  det1 A . Proof. Here U = Rn , g(x) = Ax + b and  det Jg−1 (y) =
1  det A .
(3.17)
Example 3.1.26: Polar coordinates. Let (X1 , X2) be a twodimensional random vector with probability density fX1 ,X2 (x1 , x2 ) and let (R, Θ) be its polar coordinates. The probability density of (R, Θ) is given by the formula fR,Θ (r, θ) = fX1 ,X2 (r cos θ, r sin θ) r.
3.1. FROM INTEGRAL TO EXPECTATION
105
Proof. Here g is the bijective function from the open set U consisting of R2 without the halfline {(x1 , 0) ; x1 ≥ 0} to the open set V = (0, ∞) × (0, 2π). The inverse function is x = r cos θ, The Jacobian of g −1 is
Jg−1 (r, θ) =
y = r sin θ . cos θ −r sin θ sin θ r cos θ
of determinant det Jg−1 (r, θ) = r. Apply formula (3.15) to obtain the announced result.
Covariance Matrices Recall that L2C (P ) (resp., L2R (P )) is the set of squareintegrable complex (resp., real) random variables, where two variables X and X such that P (X = X ) = 1 are not distinguished. Deﬁne for X, Y in L2C (P ) or L2R (P ) X, Y = E [XY ∗ ] .
(3.18)
L2C (P ) and L2R (P ) are Hilbert subspaces with scalar ﬁeld C and R respectively (the Riesz–Fischer Theorem 2.3.25). Deﬁnition 3.1.27 Two complex squareintegrable random variables are said to be orthogonal if E[XY ∗ ] = 0. They are said to be uncorrelated if E[(X − mX )(Y − mY )∗ ] = 0. Recall Schwarz’s inequality for squareintegrable random variables: 1
1
E[XY ] ≤ E[XY ] ≤ E[Y 2 ] 2 × E[X2 ] 2 . In particular, with Y = 1,
1
E[X] ≤ E[X2 ] 2 < ∞.
(3.19)
(3.20)
Correlation Coeﬃcient Deﬁnition 3.1.28 The crossvariance of the two complex square integrable variables X and Y is, by deﬁnition, the complex number E [(X − mX )(Y − mY )∗ ], denoted by σXY . Deﬁnition 3.1.29 Let X and Y be squareintegrable real random variables with respec2 > 0 and σ 2 . Their correlation tive means mX and mY , and respective variances σX Y coeﬃcient is the quantity σXY , ρXY := σX σY where σXY is the crossvariance. By Schwarz’s inequality, σXY  ≤ σX σy , and therefore ρXY  ≤ 1, with equality if and only if X and Y are colinear. When ρXY = 0, X and Y are said to be uncorrelated. If ρXY > 0, they are said to be positively correlated, whereas if ρXY < 0, they are said to be negatively correlated.
CHAPTER 3. PROBABILITY AND EXPECTATION
106
Theorem 3.1.30 Let X be a squareintegrable real random variable. Among all variables Z = aX + b, where a and b are real numbers, the one that minimizes the error E[(Z − Y )2 ] is σXY Yˆ = mY + 2 (X − mX ) , σX and the error is then E[(Yˆ − Y )2 ] = σY2 (1 − ρ2XY ) . (The proof is left as Exercise 3.4.43.) Remark 3.1.31 We see that if the variables are not correlated, then the best prediction is the trivial one Yˆ = mY and the (maximal) error is then σY2 . In imprecise but suggestive terms, high correlation implies high predictability. Notation: For vectors and matrices, an asterisk superscript (∗ ) denotes complex conjugates, a T superscript (T ) is for vector transposition, and the dagger superscript († ) is for conjugationtransposition. When x is a vector of Rn , we always assume in the notation that it is a column vector, and therefore xT will be the corresponding row vector. Deﬁnition 3.1.32 A random vector X = (X1 , . . . , Xn )T such that X1 , . . ., Xn are squareintegrable complex random variables is called a squareintegrable complex vector. In particular, for all 1 ≤ i, j ≤ n, by (3.20), E[Xi ] < ∞, and by Schwarz’s inequality (3.19), E[Xi Xj ] < ∞. This allows us to deﬁne the mean of X mX := E[X] = (E[X1 ], . . . , E[Xn ])T and the covariance matrix of X ΓX :=E[(X − mX )(X − mX )† ] + * = E (Xi − mXi )(Xj − mXj )∗ 1≤i,j≤n + * = σXi ,Xj 1≤i,j≤n . Theorem 3.1.33 The matrix ΓX is symmetric Hermitian, that is, Γ†X = ΓX ,
(3.21)
and it is nonnegative deﬁnite (denoted ΓX ≥ 0), that is, α† ΓX α ≥ 0
(α ∈ Cn ) .
(3.22)
Proof. α† Γα = αT Γα∗ =
n n
αi α∗j E[(Xi − E[Xi ])(Xj − E[Xj ])∗ ]
i=1 j=1
⎡ ⎤ n n αi α∗j (Xi − E[Xi ])(Xj − E[Xj ])∗ ⎦ = E⎣ i=1 j=1
⎡ ⎞∗ ⎤ ⎛ n n = E⎣ αi (Xi − E[Xi ]) ⎝ αj (Xj − E[Xj ])⎠ ⎦ i=1
j=1
= E[αT (X − E[X])2 ] ≥ 0.
3.1. FROM INTEGRAL TO EXPECTATION
107
Theorem 3.1.34 Let X be a squareintegrable real random vector with a covariance matrix ΓX which is degenerate, that is, αT ΓX α = 0 for some α = 0. Then X lies almost surely in a given hyperplane of Rn of dimension strictly less than n. Proof. For such α, E[αT (X − E[X])2 ] = αT ΓX α = 0, and therefore αT (X − E[X]) = 0
almost surely.
Remark 3.1.35 Since X lies almost surely in a strict hyperplane of Rn , it cannot have a probability density. A vector X with degenerate covariance matrix is also called degenerate. If ΓX is nondegenerate, we write ΓX > 0. We now examine the eﬀects of an aﬃne transformation of a random vector on its covariance matrix. Let X be a squareintegrable ndimensional complex random vector, with mean mX and covariance matrix ΓX . Let A be an (n × k)dimensional complex matrix, and b a kdimensional complex vector. Theorem 3.1.36 Then the kdimensional complex vector Z = AX + b has mean mZ = A mX + b, and covariance matrix ΓZ = A ΓX A† . Proof. The formula giving the mean is immediate. As for the other one, it suﬃces to observe that (Z − mZ ) = A(X − mX ) and to write ΓZ = E (Z − mZ )(Z − mZ )† = E A(X − mX )(A(X − mX ))† = E A(X − mX )(X − mX )† A† = AE (X − mX )(X − mX )T A† = AΓX A† . Let X and Y be squareintegrable complex random vectors of respective dimensions n and q. We deﬁne the intercovariance matrix of X and Y —in this order—by ΓXY = E[(X − mX )(Y − mY )† ].
CHAPTER 3. PROBABILITY AND EXPECTATION
108 Note that
† ΓY X = ΓXY .
Also if we deﬁne the (n + q)dimensional vector Z by Z = (X1 , . . . , Xn , Y1 , . . . , Yq )T then its covariance takes the block form ΓX ΓZ = ΓY X
3.1.3
ΓXY ΓY
.
Independence and the Product Formula
Recall the deﬁnition of independence for events. Two events A and B are said to be independent if P (A ∩ B) = P (A)P (B).
(3.23)
More generally, a family {Ai}i∈I of events, where I is an arbitrary index, is called independent if for every ﬁnite subset J ∈ I, ⎛ ⎞ % $ P (Aj ). P⎝ Aj ⎠ = j∈J
j∈J
Deﬁnition 3.1.37 Two random elements X : (Ω, F) → (E, E) and Y : (Ω, F) → (G, G) are called independent if for all C ∈ E, D ∈ G, P ({X ∈ C} ∩ {Y ∈ D}) = P (X ∈ C)P (Y ∈ D).
(3.24)
More generally, let I be an arbitrary index. The family of random elements {Xi}i∈I , where Xi : (Ω, F) → (Ei , Ei) (i ∈ I), is called independent if for every ﬁnite subset J ∈ I, ⎛ ⎞ % $ P⎝ {Xj ∈ Cj }⎠ = P (Xj ∈ Cj ) j∈J
j∈J
for all Cj ∈ Ej (j ∈ J). Theorem 3.1.38 If the random elements X and Y taking their values in (E, E) and (G, G) respectively are independent , then so are the random elements ϕ(X) and ψ(Y ), where ϕ : (E, E) → (E , E ), ψ : (G, G) → (G , G ). Proof. For all C ∈ E , D ∈ G , the sets C = ϕ−1 (C ) and D = ψ −1 (D ) are in E and G respectively, since ϕ and ψ are measurable. We have P ϕ(X) ∈ C , ψ(Y ) ∈ D = P (X ∈ C, Y ∈ D) = P (X ∈ C) P (Y ∈ D) = P ϕ(X) ∈ C P ψ(Y ) ∈ D . The above result is stated for two random variables for simplicity, and it extends in the obvious way to a ﬁnite number of independent random variables. The next result simpliﬁes the task of proving that two σﬁelds are independent.
3.1. FROM INTEGRAL TO EXPECTATION
109
Theorem 3.1.39 Let (Ω, F, P ) be a probability space and let S1 and S2 be two πsystems of sets in F. If S1 and S2 are independent, then so are σ(S1 ) and σ(S2 ). Proof. Fix A ∈ S1 . Let V2 = {B ⊆ X; P (A ∩ B) = P (A)P (B)}. This is a dsystem (easy to check) and S2 ⊆ V2 . Therefore d(S2 ) ⊆ d(V2 ) = V2 . On the other hand, by Dynkin’s theorem, d(S2 ) = σ(S2 ). We have therefore proved that S1 and σ(S2 ) are independent. Now ﬁx B ∈ σ(S2 ). Let V1 = {A ⊆ X; P (A ∩ B) = P (A)P (∩B)}. This is a dsystem and S1 ⊆ V1 . Therefore d(S1 ) ⊆ d(V1 ) = V1 . On the other hand, by Dynkin’s theorem, d(S1 ) = σ(S1 ). We therefore have proved that σ(S1 ) and σ(S2 ) are independent. Corollary 3.1.40 Let (Ω, F, P ) be a probability space on which are given two real random variables X and Y . For these two random variables to be independent, it is necessary and suﬃcient that for all a, b ∈ R, P (X ≤ a, Y ≤ b) = P (X ≤ a)P (Y ≤ b). Proof. This follows from Theorem 3.1.39, remembering that the collection {(−∞, a]; a ∈ R} is a πsystem generating B(R). The independence of two random elements X and Y is equivalent to the factorization of their joint distribution: Q(X,Y ) = QX × QY , where Q(X,Y ) , QX , and QY are the distributions of, respectively, (X, Y ), X and Y . Indeed, for all sets of the form C × D, where C ∈ E and D ∈ G, Q(X,Y ) (C × D) = P ((X, Y ) ∈ C × D) = P (X ∈ C, Y ∈ D) = P (X ∈ C)P (Y ∈ D) = QX (C)QY (D), and therefore (Theorem 2.3.7) Q(X,Y ) is the product measure of QX and QY . In particular, the Fubini–Tonelli theorem immediately gives a result that we have already seen in the particular case of discrete random variables: the product formula for expectations (Formula (3.25) below). Theorem 3.1.41 Let the random variables X and Y taking their values in (E, E) and (G, G) respectively be independent, and let g : (E, E) → (R, B), h : (G, G) → (R, B) such that either one of the following two conditions is satisﬁed: (i) E [g(X)] < ∞ and E [h(Y )] < ∞, and (ii) g ≥ 0 and h ≥ 0. Then E [g(X)h(Y )] = E [g(X)] E [h(Y )] .
(3.25)
CHAPTER 3. PROBABILITY AND EXPECTATION
110
Proof. It suﬃces to give the proof in the nonnegative case. We have:  E [g(X)h(Y )] = g(x)h(y)Q(X,Y ) (dx × dy) E G = g(x)h(y)QX (dx)QY (dy) E G g(x)h(y)QX (dx) h(y)QY (dy) = E
G
= E [g(X)] E [h(Y )] . Theorem 3.1.42 Let X be a random vector of Rn admitting the pdf fX . The (measurable) set of samples ω such that there exists i, j (i = j) such that Xi (ω) = Xj (ω) has a null probability. Proof. Let A be this set and let C := {x1 , . . . , xn ; xi = xj for some i = j} . The set C has null Lebesgue measure, and therefore, since 1A (ω) ≡ 1C (X(ω)), P (A) = E [1C (X(ω))] = 1C (x)fX (x) dx = 0 . Rn
Order Statistics Let X1 , . . . , Xn be independent random variables with the same pdf f . By Theorem 3.1.42, the probability that two or more among X1 , . . . , Xn take the same value is null. Therefore one can deﬁne unambiguously the random variables Z1 , . . . , Zn obtained by arranging X1 , . . . , Xn in increasing order: / Zi ∈ {X1 , . . . , Xn }, Z1 < Z2 < · · · < Zn . In particular, Z1 = min(X1 , . . . , Xn ) and Zn = max(X1 , . . . , Xn ). Theorem 3.1.43 The probability density of the reordered vector Z = (Z1 , . . . , Zn ) (deﬁned above) is n fZ (z1 , . . . , zn ) = n! (3.26) j=1 f (zj ) 1C (z1 , . . . , zn ) , where C = {(z1 , . . . , zn ) ∈ Rn ; z1 < z2 < · · · < zn } . Proof. Let σ be the permutation of {1, . . . , n} that orders X1 , . . . , Xn in ascending order, that is, Xσ(i) = Zi (note that σ is a random permutation). For any set A ⊆ Rn ,
3.1. FROM INTEGRAL TO EXPECTATION P (Z ∈ A) = P (Z ∈ A ∩ C) = P (Xσ ∈ A ∩ C) =
111
P (Xσo ∈ A ∩ C, σ = σo ),
σo
where the sum is over all permutations of {1, . . . , n}. Observing that Xσo ∈ A∩C implies σ = σo , P (Xσo ∈ A ∩ C, σ = σo ) = P (Xσo ∈ A ∩ C) and therefore since the probability distribution of Xσo does not depend upon a ﬁxed permutation σo (here we need the independence and equidistribution assumptions for the Xi ’s), P (Xσo ∈ A ∩ C) = P (X ∈ A ∩ C). Therefore, P (Z ∈ A) =
P (X ∈ A ∩ C) = n!P (X ∈ A ∩ C)
σo

= n!
n!fX (x)1C (x)dx.
fX (x)dx = A∩C
A
Example 3.1.44: Volume of the rightangled pyramid. We shall apply the above result to prove the formula  b  b (b − a)n . (3.27) ··· 1C (z1 , . . . , zn )dz1 · · · dzn = n! a a Indeed, when the Xi ’s are uniformly distributed over [a, b], fZ (z1 , . . . , zn ) = The result follows since
0 Rn
n! 1 n (z1 , . . . , zn )1C (z1 , . . . , zn ). (b − a)n [a,b]
(3.28)
fZ (z)dz = 1.
Sampling from a Distribution The problem that we address now (which arises in the simulation of stochastic systems) is to generate a random variable with prescribed cdf or, in other terms, to sample the said cdf. For this, one is allowed to use a random generator that produces a sequence U1 , U2 , . . . of independent real random variables, uniformly distributed on [0, 1]. In practice, the numbers that such random generators produce are not quite random, but they look as if they are (the generators are called pseudorandom generators). The topic of how to devise a good pseudorandom generator is out of our scope, and we shall admit that we can trust our favourite computer to provide us with an iid sequence of random variables uniformly distributed on [0, 1] (from now on we call them random numbers). Given such a sequence, we are going to describe methods for constructing a random variable Z with cdf F (z) = P (Z ≤ z) . In the case where Z is a discrete random variable with distribution P (Z = ai ) = pi i ≤ K), the basic principle of the sampling algorithm is the following:
(0 ≤
CHAPTER 3. PROBABILITY AND EXPECTATION
112 Draw U ∼ U([0, 1]).
Set Z = a if p0 + p1 + . . . + p−1 < U ≤ p0 + p1 + . . . + p . This method is called the method of the inverse. A crude generation algorithm would successively perform the tests U ≤ p0 ?, U ≤ p0 +p1 ?, . . ., until the answer is positive. The average number of iterations required would therefore be i≥0 (i + 1)pi = 1 + E [Z]. This number may be too large, but there are ways of improving it, as the example below will show for the Poisson random variable. For absolutely continuous variables, the inverse method takes the following form. Draw a random number U ∼ U([0, 1]) and set Z = F −1 (U ) , where F −1 is the inverse of F . Indeed, P (Z ≤ z) = P (F −1 (U ) ≤ z) = P (U ≤ F (z)) = F (z) . Example 3.1.45: Exponential distribution. We want to sample from E(λ). The corresponding cdf is F (z) = 1 − e−λz (z ≥ 0) . The solution of y = 1 − e−λz is z = − λ1 ln(1 − y) = F −1 (y), and therefore, Z = − λ1 ln(1 − U ) will do, or since U and 1 − U have the same distribution, Z = − λ1 ln U .
Remark 3.1.46 Both the discrete case and the absolutely continuous cases are particular cases of the more general result of Theorem 3.1.14.
Example 3.1.47: Symmetric exponential distribution. This example features a simple trick. We want to sample from the symmetric exponential distribution with pdf 1 f (x) = e−x . 2 One way is to generate two independent random variables Y and Z, where Z ∼ E(1) and P (Y = +1) = P (Y = −1) = 12 . Taking X = Y Z we have that P (X ≤ x) = P (U = +1, Z ≤ x) + P (U = −1, Z ≥ −x) =
1 (FZ (x) + 1 − FZ (−x)) , 2
and therefore, by diﬀerentiation, fX (x) =
1 1 (fZ (x) + fZ (−x)) = fZ (x) . 2 2
3.1. FROM INTEGRAL TO EXPECTATION
113
It is not always easy to compute the inverse of the cumulative distribution function of the random variable to be generated. An alternative method is the method of acceptancerejection below. Let {Yn }n≥1 be a sequence of iid random variables with the probability density g(x) satisfying for all x ∈ R f (x) (3.29) g(x) ≤ c for some ﬁnite constant c (necessarily larger or equal to 1). Let {Un }n≥1 be a sequence of iid random variables uniformly distributed on [0, 1]. Theorem 3.1.48 Let τ be the ﬁrst index n ≥ 1 for which Un ≤
f (Yn ) cg(Yn )
and let Z = Yτ . Then (a) Z admits the probability density function f , and (b) E[τ ] = c. Proof. We have P (Z ≤ x) = P (Yτ ≤ x) = Denote by Ak the event Uk >
f (Yk ) cg(Yk )
. Then
P (τ = n, Yn ≤ x).
n≥1
P (τ = n, Yn ≤ x) = P (A1 , . . . , An−1, An , Yn ≤ x) = P (A1 ) · · · P (An−1 )P (An , Yn ≤ x). P Ak =
P R
= R
Uk ≤
f (y) cg(y)
f (y) g(y) dy = cg(y)
g(y) dy R
1 f (y) dy = . c c
f (y) P A k , Yk ≤ x = P Uk ≤ 1y≤x g(y) dy cg(y)  x Rx f (y) f (y) 1 x f (y) dy. g(y) dy = dy = = c −∞ −∞ cg(y) −∞ c 
Therefore P (Z ≤ x) =
n≥1
1−
1 c
n−1
1 c


x −∞
Also, using the above calculations, P (τ = n) = P A1 , . . . , An−1, An = P (A1 ) · · · P (An−1 )P An = from which it follows that E[τ ] = c.
x
f (y) dy =
f (y) dy. −∞
1 n−1 1 , 1− c c
The method depends on one’s ability to easily generate random vectors with the probability density g. Such a pdf must satisfy (3.29) and c should be as small as possible under this constraint.
CHAPTER 3. PROBABILITY AND EXPECTATION
114
3.1.4
Characteristic Functions
Recall that for a complexvalued random variable X = XR + iXI , where XR and XI are realvalued integrable random variables, E[X] = E[XR ] + iE[XI ] deﬁnes the expectation of X. The characteristic function ϕX of a realvalued random variable X is deﬁned by ϕX (u) = E eiuX . Similarly, the characteristic function ϕX : Rd → C of a real random vector X ∈ Rd is deﬁned by T ϕX (u) = E eiu X . Theorem 3.1.49 Let X ∈ Rd be a random vector with characteristic function ϕ. Then for all 1 ≤ j ≤ d, all aj , bj ∈ Rd such that aj < bj , ⎛ ⎞  +c  +c $ d −iuj aj − e−iuj bj 1 e ⎝ ⎠ ϕ(u1 , . . . , ud ) du1 · · · dud lim ··· c↑+∞ (2π)d −c iuj −c j=1 ⎡ ⎤ d $ 1 = E⎣ + 1{aj E [X1 ], h+ (a) is positive. Similarly to (4.15), we obtain that n − P Xi ≤ na ≤ e−nh (a) , i=1
where
h− (a) = sup{at − ln E etX1 } . t≤0
−
Moreover, if a < E[X1 ], h (a) is positive. The Chernoﬀ bound can be interpreted in terms of large deviations from the law of large numbers. Denote by μ the common mean of the Xn ’s, and deﬁne for ε > 0 the (positive) quantities H + (ε) = sup εt − ln E et(X1 −μ) , t≥0
H (ε) = sup εt − ln E et(X1 −μ) . −
t≤0
Then
' ' n ' ' + − '1 ' P ' Xi ' ≥ +ε ≤ e−nH (ε) + e−nH (ε) . 'n ' i=1
Remark 4.1.19 The computation of the supremum in (4.15) may be fastidious, There are shortcuts leading to practical bounds that are not as good but nevertheless satisfactory for certain applications. Example 4.1.20: Suppose for instance that {Xn }n≥1 is iid, the Xn ’s taking the values −1 and +1 equiprobably so that E etX = 12 e+t + 12 e−t . We do not keep this expression t2
as such but instead replace it by an upper bound, namely e 2 , and therefore, for a > 0, n tX1 Xi ≥ na ≤ e−n(at−ln E[e ]) P i=1 1 2 ≤ e−n(at− 2 t ) ,
so that, with t = a, P
n
Xi ≥ na
1 2
≤ e−n 2 a .
i=1
By symmetry of the distribution of ni=1 Xi , one would obtain for a > 0 n n 1 2 Xi ≤ −na = P Xi ≥ na ≤ e−n 2 a , P i=1
i=1
and therefore combining the two bounds, ' ' n ' ' 1 2 ' ' P ' Xi ' ≥ na ≤ 2e−n 2 a . ' ' i=1
CHAPTER 4. CONVERGENCES
156
Two Other Types of Convergence
4.2
These are (i) convergence in probability, the “parent pauvre” of almostsure convergence, and (ii) convergence in the quadratic mean, that is, convergence in L2C (P ). (Convergences in distribution and in variation will be treated in the next section.)
4.2.1
Convergence in Probability
Recall the deﬁnition already given for discrete random variables in the ﬁrst chapter: Deﬁnition 4.2.1 A sequence {Zn }n≥1 of variables is said to converge in probability to the random variable Z if, for all ε > 0, lim P (Zn − Z ≥ ε) = 0 .
n↑∞
(4.16)
Example 4.2.2: Bernstein’s polynomial approximation. This example is a particular instance of the fruitful interaction between probability and analysis. Here, we shall give a probabilistic proof of the fact that a continuous function f from [0, 1] into R can be uniformly approximated by a polynomial. More precisely, for all x ∈ [0, 1], f (x) = limn↑∞ Pn (x) , where Pn (x) =
()
n n! k xk (1 − x)n−k , f n k!(n − k)] k=0
and the convergence of the series in the righthand side is uniform in [0, 1]. A proof of this classical theorem of analysis using probabilistic arguments and in particular the notion of convergence in probability is as follows. Since Sn ∼ B(n, p), 2 . n n Sn k k n! E f f f = P (Sn = k) = xk (1 − x)n−k . n n n k!(n − k)! k=0
k=0
The function f is continuous on the bounded [0, 1] and therefore uniformly continuous on this interval. Therefore to any ε > 0, one can associate a number δ(ε) such that if y − x ≤ δ(ε), then f (x) − f (y) ≤ ε. Being continuous on [0, 1], f is bounded on [0, 1] by some ﬁnite number, say M . Now '. ' 2 .' 2' ' ' ' ' Sn Sn ' ' ' Pn (x) − f (x) = 'E f − f (x) ' ≤ E 'f − f (x)'' n n '2 ' . . ' 2' ' ' ' ' Sn Sn = E '' f − f (x) 1A '' + E ''f − f (x)'' 1A , n n where A := {Sn (ω)/n) − x ≤ δ(ε)}. Since f (Sn /n) − f (x)1A ≤ 2M 1A , we have ' . ' 2' ' ' Sn ' ' ' Sn ' ' ' ' E 'f − f (x)' 1A ≤ 2M P (A) = 2M P ' − x' ≥ δ(ε) . n n Also, by deﬁnition A and δ(ε),
4.2. TWO OTHER TYPES OF CONVERGENCE
157
' . 2' ' ' Sn E ''f − f (x)'' 1A ≤ ε . n ' ' ' Sn ' ' Pn (x) − f (x) ≤ ε + 2M P ' − x'' ≥ δ(ε) . n
Therefore
But x is the mean of Sn /n, and the variance of Sn /n is nx(1 − x) ≤ n/4. Therefore, by Tchebyshev’s inequality, ' ' ' ' Sn 4 − x'' ≥ δ(ε) ≤ . P '' n n[δ(ε)]2 Finally 4 . n[δ(ε)]2 Since ε > 0 is otherwise arbitrary, this suﬃces to prove the convergence in (). The convergence is uniform since the righthand side of the latter inequality does not depend on x ∈ [0, 1]. f (x) − Pn (x) ≤ ε +
There is a Cauchytype criterion for convergence in probability. Theorem 4.2.3 For a sequence {Zn }n≥1 of random variables to converge in probability to some random variable, it is necessary and suﬃcient that for all ε > 0, lim P (Zm − Zn  ≥ ε) = 0.
m,n↑∞
Proof. Necessity. We have the inclusion 1 1 {Zm − Zn  ≥ ε} ⊆ {Zm − Z ≥ ε} ∪ {Zm − Z ≥ ε} 2 2 and therefore 1 1 P (Zm − Zn  ≥ ε) ≤ P (Zm − Z ≥ ε) + P (Zm − Z ≥ ε) . 2 2 Suﬃciency. Let n1 := +1 and let for j ≥ 2, nj = inf{N > nj−1 ; P (Zr − Zs  > Then
j
P (Znj − Znj−1  >
1 1 ) < j if r, s > N } . 2j 3
1 < ∞, 2j−1
a.s. and therefore, there exists a random variable Z such that Znj → Z as j ↑ ∞. Now: 1 1 P (Z − Zn  ≥ ε) ≤ P (Zn − Znj  ≥ ε) + P (Znj − Z ≥ ε) 2 2 can be made arbitrarily close to 0 as n ↑ ∞, by deﬁnition of the nj ’s and the fact that almost sure convergence implies convergence in probability, as we shall see next, in Theorem 4.5.1. In fact, there exists a distance between random variables that metrizes convergence in probability, namely d(X, Y ) := E [X − Y  ∧ 1] . The veriﬁcation that d is indeed a metric is left as an exercise.
CHAPTER 4. CONVERGENCES
158
Theorem 4.2.4 The sequence {Xn }n≥1 converges in probability to the variable X if and only if lim d(Xn , X) = 0 . n↑∞
Proof. If: By Markov’s inequality, for ε ∈ (0, 1], P (Xn − X ≥ ε) = P (Xn − X ∧ 1 ≥ ε) ≤ Only if: For all ε > 0, d(Xn , X) =
{Xn −X≥ε}
d(Xn , X) . ε
(Xn − X ∧ 1) dP +
{Xn −X 0 is arbitrary, we have shown that limn↑∞ d(Xn , X) = 0.
4.2.2
Convergence in Lp
Deﬁnition 4.2.5 Let p be a positive integer. A sequence {Zn }n≥1 of complex random variables of LpC (P ) is said to converge in Lp to the complex random variable Z ∈ LpC (P ) if lim E[Zn − Zp ] = 0. (4.17) n↑∞
In the case p = 2, the sequence {Zn }n≥1 of squareintegrable complex random variables is said to converge in the quadratic mean to Z. By the Riesz–Fischer theorem (Theorem 2.3.22): Theorem 4.2.6 For the sequence {Zn }n≥1 of squareintegrable complex random variables to converge in Lp to some random variable Z ∈ LpC (P ), it is necessary and suﬃcient that (4.18) lim E[Zn − Zm p ] = 0 . n,m↑∞
Recall that L2C (P ) is a Hilbert space with inner product X, Y = E [XY ∗ ] with the following property of continuity. Theorem 4.2.7 Let {Xn }n≥1 and {Yn }n≥1 be two sequences of squareintegrable complex random variables that converge in quadratic mean to the squareintegrable complex random variables X and Y , respectively. Then, lim E[Xn Ym∗ ] = E[XY ∗ ].
n,m↑∞
(4.19)
4.2. TWO OTHER TYPES OF CONVERGENCE
159
Proof. We have E[Xn Ym∗ ] − E[XY ∗ ] = E[(Xn − X)(Ym − Y )∗ ] + E[(Xn − X)Y ∗ ] + E[X(Ym − Y )∗ ] ≤ E[(Xn − X)(Ym − Y )∗ ] + E[(Xn − X)Y ∗ ] + E[X(Ym − Y )∗ ] and the righthand side of this inequality is, by Schwarz’s inequality, less than 1 1 E[Xn − X2 ] 2 E[Ym − Y 2 ] 2 1 1 + E[Xn − X2 ] 2 E[Y 2 ] 2 1 1 + E[X2 ] 2 E[Ym − Y 2 ] 2 , which tends to 0 as n, m ↑ ∞.
Example 4.2.8: L2 convergence of Series. Let {An }n∈Z and {Bn }n∈Z be two sequences of centered squareintegrable complex random variables such that E[Aj 2 ] < ∞, E[Bj 2 ] < ∞. j∈Z
j∈Z
Suppose, moreover, that for all i = j, E Ai A∗j = E Bi Bj∗ = E Ai Bj∗ = 0 for all i = j. Deﬁne
n
Un =
n
A j , Vn =
j=−n
Bj .
j=−n
Then {Un }n≥1 (resp., {Vn }n≥1 ) converges in quadratic mean to some squareintegrable random variable U (resp., V ) and E [U ] = E [V ] = 0 and E[U V ∗ ] = E[Aj Bj∗ ]. j∈Z
Proof. We have
'2 ⎤ ⎡' ' ' m m m m ' ' E[Un − Um 2 ] = E ⎣'' Aj '' ⎦ = E[Aj A∗i ] = E[Aj 2 ] 'j=n+1 ' j=n+1 i=n+1 j=n+1
since when i = j, E[Aj A∗i ] = 0. The conclusion then follows from the Cauchy criterion for convergence in quadratic mean, since m
lim E[Un − Um 2 ] = lim
m,n↑∞
in view of hypothesis
m,n↑∞
j∈Z E[Aj 
2
] < ∞. By continuity of the inner product in L2C (P ),
E[U V ∗ ] = lim E[Un Vn∗ ] = lim n↑∞
= lim
E[Aj 2 ] = 0 ,
j=n+1
n↑∞
n n
n↑∞
n j=1
E[Aj Bj∗ ] =
E[Aj B∗ ]
j=1 =1
E[Aj Bj∗ ] .
j∈Z
CHAPTER 4. CONVERGENCES
160
4.2.3
Uniform Integrability
The monotone and dominated convergence theorems are not all the tools that we have at our disposition giving conditions under which it is possible to exchange limits and expectations. Uniform integrability, which will be introduced now, is another such suﬃcient condition. Deﬁnition 4.2.9 A collection {Xi}i∈I (where I is an arbitrary index) of integrable random variables is called uniformly integrable if lim Xi  dP = 0 uniformly in i ∈ I . c↑∞
{Xi >c}
Example 4.2.10: Collection Dominated by an Integrable Variable. If, for some integrable random variable, P (Xi  ≤ X) = 1 for all i ∈ I, then {Xi }i∈I is uniformly integrable. Indeed, in this case, Xi  dP ≤ X dP {Xi >c}
{X>c}
and by monotone convergence the righthand side of the above inequality tends to 0 as c ↑ ∞. Remark 4.2.11 Clearly, if one adds a ﬁnite number of integrable variables to a uniformly integrable collection, the augmented collection will also be uniformly integrable. Theorem 4.2.12 The collection {Xi }i∈I of integrable random variables is uniformly integrable if and only if (a) supi E [Xi ] < ∞, and (b) for every ε > 0, there exists a δ(ε) > 0 such that sup Xi  dP ≤ ε whenever P (A) ≤ δ(ε) . n
(In other words,
0
A Xi  dP
A
→ 0 uniformly in i as P (A) → 0.)
Proof. Assume uniform integrability. For any ε > 0, there exists a c such that 0 X i  dP ≤ ε for all i ∈ I. For all A ∈ F, all i ∈ I, {Xi >c} 1 Xi  dP ≤ cP (A) + Xi  dP ≤ cP (A) + ε . 2 A {Xi >c} Therefore we have (b) by taking δ(ε) =
ε 2c
and (a) with A = Ω.
M . Conversely, let M := supi E [Xi ] < ∞. Let ε and δ(ε) be as in (b). Let c0 := δ(ε) For all c ≥ c0 and all i ∈ I, P (Xi0 > c) ≤ δε (Markov’s inequality). Apply (b) with A = {Xc  > c} to obtain that supn {Xc >c} Xi  dP ≤ ε.
Since the “collection” consisting of a single integrable variable X is uniformly integrable, condition (b) of the theorem above reads sup
E [X 1A ] → 0 as δ → 0 .
A ; P (A) 0, E Xi  1{Xi ≥a} ≤ E [Zi ] 1{Zi ≥a} , where Zi := E [Y   Fi ]. By deﬁnition of conditional expectation, since {Zi ≥ a} ∈ Fi , E (Y  − Zi) 1{Zi ≥a} = 0 and therefore
E Xi  1{Xi ≥a} ≤ E Y  1{Zi ≥a} .
()
By Markov’s inequality,
E [Zi ] E [Y ] = , a a and thereforeP (Zi ≥ a) → 0 as a → ∞ uniformly in i. Use (15.42) to obtain that E Y  1{Zi ≥a} → 0 as a → ∞ uniformly in i. Conclude with (). P (Zi ≥ a) ≤
Theorem 4.2.14 A suﬃcient condition for the collection {Xi}i∈I of integrable random variables to be uniformly integrable is the existence of a nonnegative nondecreasing function G : R → R such that G(t) = +∞ lim t↑∞ t and sup E [G(Xi )] < ∞ . i
Proof. Fix ε > 0 and let a = M ε where M := supn (E [G(Xi )]). Take c large enough so G(X ) that G(t)/t ≥ a for t ≥ c. In particular, Xi  ≤ a i on {Xi > c} and therefore M 1 Xi dP ≤ E G(Xi )1{Xi >c} ≤ =ε a a {Xi >c}
uniformly in i.
Example 4.2.15: Two Sufficient Conditions for Uniform Integrability. Two frequently used suﬃcient conditions guaranteeing uniform integrability are sup E Xi 1+α < ∞ (α > 1) i
and
sup E Xi  log+ Xi  < ∞ . i
Almostsure convergence of a sequence of integrable random variables to an integrable random variable does not necessarily imply convergence in L1 . However:
CHAPTER 4. CONVERGENCES
162
Theorem 4.2.16 Let {Xn }n≥1 be a sequence of integrable random variables and let X be some random variable. The following are equivalent: P r.
(a) {Xn }n≥1 is uniformly integrable and Xn → X as n → ∞. L1
(b) X is integrable and Xn → X as n → ∞.
Pr
Proof. (a) implies (b): Since Xn → X, there exists a subsequence {Xnk }k≥1 such that a.s. Xnk → X. By Fatou’s lemma, E [X] ≤ lim inf E [Xnk ] ≤ sup E [Xnk ] ≤ sup E [Xn ] < ∞ . k
n
nk
Therefore X ∈ L1 . Also for ﬁxed ε > 0, Xn − X dP + X dP Xn  dP + {Xn −X 0 be given and let n0 be such that E [Xn − X] ≤ ε for all n ≥ n0 . The random 0 variables X, X1, . . .0, Xn0 being integrable, there exists a δ > 0 such that if P (A) ≤ δ, A X dP ≤ 2ε and A Xn  dP ≤ 2ε for n ≤ n0 . If n ≥ n0 , by the triangle inequality, 


Xn  dP ≤ A
X dP + A
Xn − X dP ≤ 2ε , A
and therefore (b) of Theorem 4.2.12 is satisﬁed. Whereas (a) of Theorem 4.2.12 is satisﬁed since E [Xn ] ≤ E [Xn − X] + E [X].
4.3 4.3.1
Zeroone Laws Kolmogorov’s Zeroone Law
Deﬁnition 4.3.1 Let {Xn }n≥1 be a sequence of random variables and let FnX := σ(X1 , . . . , Xn ). The σﬁeld T X := ∩n≥1 σ(Xn , Xn+1, . . .) is called the tail σﬁeld of this sequence.
4.3. ZEROONE LAWS
163
n Example 4.3.2: For any a ∈ R, the event {limn↑∞ X1 +···+X ≤ a} belongs to the tail n X1 +···+Xn σﬁeld, since the existence and the value of the limit of does not depend on n any ﬁxed ﬁnite number of terms of the sequence. More generally, any event concerning n limn↑∞ X1 +···+X such as, for instance, the event that such limit exists, is in the tail n σﬁeld.
X := ∨ X Recall the notation F∞ n≥1 Fn .
Theorem 4.3.3 The tail σﬁeld of a sequence {Xn }n≥1 of independent random variables is trivial, that is, if A ∈ T X , then P (A) = 0 or 1. Proof. The σﬁelds FnX and σ(Xn+k , Xn+k+1 , . . .) are independent for all k ≥ 1 and therefore, since T X = ∩k≥1 σ(Xn+k , Xn+k+1 ), the σﬁelds FnX and T X are independent. Therefore the algebra ∪n≥1 FnX and T X are independent, and consequently (Theorem X and T X are independent. But F X ⊇ T X , so that T X is independent of 3.1.39) F∞ ∞ itself. In particular, for all A ∈ T X , P (A ∩ A) = P A)P (A), that is P (A) = P (A)2 , which implies that P (A) = 0 or 1.
4.3.2
The Hewitt–Savage Zeroone Law
Let (S, S) be some measurable space and let μ be a probability measure on it. We shall work on the canonical measurable space (Ω, F) := (S N , S ⊗N) of Svalued random sequences, endowed with the probability measure P := μ⊗∞ . In particular, an element of Ω has the form ω := x := (x1 , x2 , . . .) ∈ S N and moreover, the sequence {Xn }n≥1 deﬁned by Xn (ω) := xn (ω ∈ Ω, n ≥ 1) is iid with common probability distribution μ. Deﬁnition 4.3.4 (a) A ﬁnite permutation of N is a permutation π such that π(i) = i for all but a ﬁnite number of indices i ≥ 1. (b) An event A ∈ F such π −1 A = A for all ﬁnite permutations π is called exchangeable. (c) The subσﬁeld E consisting of the collection of exchangeable events is called the exchangeable σﬁeld. (Note that Xn (πω) = Xπ(n)(ω) for any permutation π on Ω.) Example 4.3.5: Tail events are exchangeable. All the events of the tail σﬁeld T are exchangeable. Indeed, for all n ≥ 1, an event B ∈ σ(Xn+1, Xn+2, . . .) is unaltered by a permutation bearing on only the ﬁrst n coordinates. Therefore any event B ∈ ∩n≥1 σ(Xn+1 , Xn+2, . . .) is unaltered by any ﬁnite permutation.
CHAPTER 4. CONVERGENCES
164
Example 4.3.6: There exist exchangeable events that are not tail events. In Example 4.3.5, we have seen that T ⊂ E. The current example shows that we do not have the reverse inclusion. Indeed, the event A := X1 +· · ·+Xn ∈ C i.o.} is exchangeable (if the ﬁnite permutation π bears on only the ﬁrst K integers, then for all n ≥ K + 1, X1 + · · · + Xn = Xπ(1) + · · · + Xπ(n)). However, it is not a tail event.
Theorem 4.3.7 The events of the exchangeable σﬁeld are trivial, that is for any A ∈ E, P (A) = 0 or 1. The proof depends on the following lemma of approximation of an element of F by an element of an algebra A generating F. More precisely (recalling the notation AΔB := (A − A ∩ B) ∪ (B − A ∩ B)): Lemma 4.3.8 Let A be an algebra generating the σﬁeld F and let P be a probability on F. With any event B ∈ F and any ε > 0, one can associate an event A ∈ A such that P (A B) ≤ ε. Proof. The collection of sets G := {B ∈ F; ∀ε > 0, ∃A ∈ A with P (A B) ≤ ε} contains A. It is moreover a σﬁeld, as we now show. First, Ω ∈ A ⊆ G and the stability of G under complementation is clear. For the stability of G under countable unions, let Bn (n ≥ 1) be in G and let ε > 0 be given. Also, by deﬁnition of G, there exist An ’s in A such that P (An Bn ) ≤ 2−n−1ε. Therefore, for all K ≥ 1, K P ((∪K n=1 An ) (∪n=1 Bn )) ≤
K
2−n−1ε ≤
n=1
2−n−1ε = 2−1 ε .
n≥1
By the sequential continuity property of probability, there exists an integer K = K(ε) −1 such that P (∪n≥1 Bn − ∪K n=1 Bn ) ≤ 2 ε. Therefore, for such an integer, P ((∪K n=1 An ) (∪n≥1 Bn )) ≤ ε . The proof of stability of G under countable unions is completed since A is an algebra and therefore ∪K n=1 An ∈ A. Therefore G is a σﬁeld containing A and consequently contains the σﬁeld F generated by A. We now proceed to the proof of Theorem 4.3.7. Proof. Let A ∈ E. Lemma 4.3.8 guarantees that for any n ≥ 1, there exists an An ∈ σ(X1 , . . . , Xn ) such that P (An ΔA) → 0 . Note that for all n ≥ 1, An = {ω ; (x1 , . . . , xn ) ∈ Bn } for some Bn ∈ S ⊗N. Deﬁne the ﬁnite permutation πn = π by
4.3. ZEROONE LAWS
165 π(j) = j + n if 1 ≤ j ≤ n = j − n if n + 1 ≤ j ≤ 2n = j if j ≥ 2n + 1 .
Note that π 2 ≡ π and in particular π = π −1 , and that by the iid assumption, the sequence obtained by ﬁnite permutation of an iid sequence is iid. Therefore P (ω ; ω ∈ An ΔA) = P (ω ; πω ∈ An ΔA) .
()
Now {ω ; πω ∈ A} = {ω ; ω ∈ A} by exchangeability of A, and {ω ; πω ∈ An } = {ω ; (xn+1 , . . . , x2n ) ∈ Bn } . Therefore denoting by An the event in the righthand side of the above equality,
Combining () and ():
{ω ; πω ∈ An ΔA} = {ω ; ω ∈ An ΔA} .
()
P (An ΔA) = P (An ΔA) .
(†)
From the set inclusion AΔC ⊆ (AΔB) ∪ (BΔC), (†) and P (An ΔA) → 0, P (An ΔAn ) + P (AΔAn ) → 0 .
(††)
Therefore 0 ≤ P (An ) − P (An ∩ An ) ≤ P (An ∪ An ) − P (An ∩ An ) = P (An ΔAn ) → 0 . Therefore P (An ∩ An ) → P (A). Since An and An are independent (and recalling that P (An ) = P (An )) P (An ∩ An ) = P (An )P (An ) = P (An )2 → P (A)2 . Comparing with (††), we see that P (A)2 = P (A), which implies that P (A) = 0 or P (A) = 1. The Hewitt–Savage zeroone law will now applied to the asymptotic behavior of random walks. By deﬁnition, a random walk on R is a sequence {Sn }n≥1 of realvalued random variables of the form Sn = X1 + · · · + Xn , where {Xn }n≥1 is an iid sequence of realvalued random variables. Theorem 4.3.9 Discarding the trivial case where P (X1 = 0) = 1, with probability one, one and only one of the following occurs: (a) limn Sn = +∞, (b) limn Sn = −∞, (c) −∞ = lim inf n Sn < lim supn Sn = +∞. If, moreover, the distribution of X1 is symmetric around 0, (c) occurs with probability 1.
CHAPTER 4. CONVERGENCES
166
Proof. The random variable lim supn Sn is exchangeable (its value is independent of any ﬁnite permutation of the Xi ’s) and therefore is a constant c, possibly +∞ or −∞. Since Sn = X1 + Sn , where Sn = X2 + · · · + Xn , we have that lim sup Sn = X1 + lim sup Sn , n
n
lim supn Sn
where and therefore
has the same distribution as lim supn Sn and is independent of X1 , c = X1 + c .
Since P (X1 = 0) > 0, this implies that lim supn Sn cannot be ﬁnite, and is therefore either +∞ or −∞. Similarly for lim inf n Sn , which is therefore either +∞ or −∞. Since we cannot have simultaneously lim inf n Sn = +∞ and lim supn Sn = −∞, the ﬁrst part of the theorem in proved. In the symmetric case, only (c) is possible because one of the events (a) or (b) entails the other.
4.4 4.4.1
Convergence in Distribution and in Variation The Role of Characteristic Functions
Let {Xn }n≥1 and X be realvalued random variables with respective cumulative distribution functions {Fn }n≥1 and F . A natural deﬁnition of convergence in distribution of {Xn }n≥1 to X is the following: lim Fn (x) = F (x) .
n↑∞
()
We have not speciﬁed for what x ∈ R () is required. If we want this to hold for all x, then we could not say that the “random” (actually deterministic) sequence of random variables Xn ≡ a + n1 where a ∈ R converges to X ≡ a. In fact, () holds in this case for all points of continuity of the cumulative distribution of X, here F (x) = 1x≥a . It turns out that a “good” deﬁnition would be precisely that () should hold for all continuity points of the target cdf F . Example 4.4.1: Magnified Minimum. Let {Yn }n≥1 be a sequence of iid random variables uniformly distributed on [0, 1]. Then D
Xn = n min(Y1 , . . . , Yn ) → E(1) , (the exponential distribution with mean 1). In fact, for all x ∈ [0, n], n $ x x x n P (Xn > x) = P min(Y1 , . . . , Yn ) > P Yi > , = = 1− n n n i=1
and therefore limn↑∞ P (Xn > x) = e−x 1R+ (x).
4.4. CONVERGENCE IN DISTRIBUTION AND IN VARIATION
167
For technical reasons, our starting point will diﬀer from the deﬁnition (), properly modiﬁed. Denote by M + (Rd ) the collection of ﬁnite measures on (Rd , B(Rd )) and by Cb (Rd ) the collection of continuous bounded functions f : Rd → R. Deﬁnition 4.4.2 (a) The sequence {μn0}n≥1 in M + (Rd ) is said to converge weakly to μ if 0 limn↑∞ Rd f dμn = Rd f dμ for all f ∈ Cb (Rd ). This is denoted by w
μn → μ . (b) The sequence of random vectors {Xn }n≥1 of Rd with respective probability distributions {QXn }n≥1 is said to converge in distribution to the random vector X ∈ Rd w with distribution QX if QXn → QX . (In other words, for all continuous and d bounded functions f : R → R, limn↑∞ E[f (Xn )] = E[f (X)].) This is denoted by D Xn → X . Remark 4.4.3 Observe that X and the Xn ’s need not be deﬁned on the same probability space. Convergence in distribution concerns only probability distributions. As a matter of fact, very often the Xn ’s are deﬁned on the same probability space but there is no “visible” (that is, deﬁned on the same probability space) limit random vector X. D Therefore one sometimes denotes convergence in distribution as follows: Xn → Q, where d Q is a probability distribution on R . If Q is a “famous” probability distribution, for instance a standard Gaussian variable, one then says that {Xn }n≥1 “converges in distriD
bution to a standard Gaussian distribution”. This is also denoted by Xn → N (0, 1). Let B o and B c be respectively the interior and the closure of the set B ∈ Rd and let ∂B be its boundary (:= B c \B o ). The following theorem is a major tool of the theory of convergence in distribution: Theorem 4.4.4 Let {μn }n≥1 and μ be probability distributions on Rd . The following conditions are equivalent: w
(i) μn → μ. (ii) For any open set G ⊆ Rd , lim inf n μn (G) ≥ μ(G). (iii) For any closed set F ⊆ Rd , lim supn μn (F ) ≤ μ(F ). (iv) For any measurable set B ⊆ Rd such that μ(∂B) = 0, limn μn (B) = μ(B). Proof. (i) ⇒ (ii). For any open set G ∈ Rd there exists a nondecreasing sequence {ϕk }k≥1 of nonnegative functions of Cb (Rd ) such that 0 ≤ ϕk ≤ 1 and ϕk ↑ 1G (for 0 0 instance, ϕk (x) = 1 − e−kd(x,G) ). Since 1G dμn ≥ ϕk dμn (k ≥ 1), lim inf μn (G) = lim inf 1G dμn ≥ lim inf ϕk dμn . n
n
n
CHAPTER 4. CONVERGENCES
168 This being true for all k ≥ 1,
lim inf μn (G) ≥ sup lim inf ϕk dμn n n k = sup lim ϕk dμn = sup ϕk dμ = μ(G) . n
k
k
(ii) ⇔ (iii). Take complements. (ii) + (iii) ⇒ (iv). Indeed, by (ii) and (iii), lim sup μn (B) ≤ lim sup μn (B c ) ≤ μ(B c ) n
n
and lim inf μn (B) ≥ lim inf μn (B o ) ≥ μ(B o ) . n
n
o
c
But since μ(∂B) = 0, μ(B ) = μ(B ) = μ(B), and therefore (iv) is veriﬁed. 0 0 (iv) ⇒ (i). Let f ∈ Cb (Rd ). We must show that limn↑∞ Rd f dμn = Rd f dμ. It is enough to show this for f ≥ 0. Let K < ∞ be a bound of f . By Fubini, 


K
f (x) dμ(x) = Rd

Rd K
=
0
1{t≤f (x)} dt dμ(x) 
K
μ({x ; t ≤ f (x)}) dt =
μ(Dtf ) dt ,
0
0
where Dtf := {x ; t ≤ f (x)}. Observe that ∂Dtf ⊆ {x ; t = f (x)} and that the collection of positive t such that μ({x ; t = f (x)}) > 0 is at most countable (for each positive integer k there are at most k values of t such that μ({x ; t = f (x)}) ≥ k1 ). Therefore, by (iv), for almost all t (with respect to the Lebesgue measure), lim μn (Dtf ) = μ(Dtf ) (almost everywhere) n
and by dominated convergence, lim n

K
f dμn = lim n
0
μn (Dtf ) dt
K
=
μ(Dtf ) dt
=
f dμ .
0
Paul L´ evy’s Characterization Theorem 4.4.5 A necessary and suﬃcient condition for the sequence {Xn }n≥1 of random vectors of Rd to converge in distribution is that the sequence of their characteristic functions {ϕn }n≥1 converges to some function ϕ that is continuous at 0. In such a case, ϕ is the characteristic function of the limit probability distribution.
The (technical) proof is postponed to Section 4.4.4.
4.4. CONVERGENCE IN DISTRIBUTION AND IN VARIATION
169
Corollary 4.4.6 Let {Xn }n≥1 and X be random vectors of Rd with respective characteristic functions {ϕn }n≥1 and ϕ. The following two statements are equivalent. D
(A) Xn → X. (B) limn↑∞ ϕn = ϕ. Corollary 4.4.7 In the univariate case, denote by Fn and F the cumulative distribution functions of Xn and X, respectively. Call a point x ∈ R a continuity point of F if D F (x) = F (x− ). Then Xn → X if and only lim Fn (x) = F (x) for all continuity points x of F . n
Proof. Necessity. Let QX be the probability distribution of X. If x is a continuity point of F , the boundary of C := (−∞, x] is {x} of null QX measure. Therefore by (iv) of Theorem 4.4.4, limn QXn ((−∞, x]) = QX ((∞, x]), that is, limn Fn (x) = F (x). Suﬃciency. Let f ∈ Cb (R) and let M < ∞ be an upper bound of f . For ε > 0, there exists a subdivision −∞ < a = x0 < x1 < · · · < xk = b < +∞ formed by continuity points of F , such that F (a) < ε, F (b) > 1 − ε and f (x) − f (xi ) < ε on [xi−1 , xi ]. By hypothesis, Sn :=
k
f (xi )(Fn (xi ) − Fn (xi−1 )) → S :=
i=1
k
f (xi )(F (xi ) − F (xi−1 )) .
i=1
Also E [f (X)] − S ≤ ε + M F (a) + M (1 − F (b)) ≤ (2M + 1)ε and E [f (Xn )] − Sn  ≤ ε + M Fn (a) + M (1 − Fn (b)) → ε + M F (a) + M (1 − F (b)) ≤ (2M + 1)ε . Therefore, lim sup E [f (Xn )] − E [f (X)] n
≤ lim sup E [f (Xn )] − Sn  + lim sup Sn − S + E [f (X)] − S n
n
≤ (4M + 2)ε. Since ε is arbitrary, limn E [f (Xn )] − E [f (X)] = 0.
Theorem 4.4.8 Let {Xn }n≥1 and {Yn }n≥1 be sequences of random vectors of Rd such D
P r.
D
that Xn → X and d(Xn , Yn ) → 0, where d denotes the euclidean distance. Then Yn → X. Proof. By (iii) of Theorem 4.4.4, it suﬃces to show that for all closed sets F , lim supn P (Yn ∈ F ) ≤ P (X ∈ F ). For all ε > 0, deﬁne the closed set Fε = {x ∈ Rd ; d(x, F ) ≤ ε}. Then P (Yn ∈ F ) ≤ P (d(Xn , F ) ≥ ε) + P (Xn ∈ Fε ) ,
CHAPTER 4. CONVERGENCES
170 and therefore
lim sup P (Yn ∈ F ) ≤ lim sup P (d(Xn , F ) ≥ ε) + lim sup P (Xn ∈ Fε ) n
n
n
= lim sup P (Xn ∈ Fε ) ≤ P (X ∈ Fε ) . n
Since ε > 0 is arbitrary and limε↓0 P (X ∈ Fε ) = P (X ∈ F ), lim supn P (Yn ∈ F ) ≤ P (X ∈ F ). D
Corollary 4.4.9 Let {Xn }n≥1 be a sequence of random vectors of Rd such that Xn → D
X. If the sequence of real numbers {an }n≥1 converges to the real number a, then an Xn → aX.
Bochner’s Theorem Bochner’s theorem will play a central role in the theory of widesense stationary processes of Chapter 12. The characteristic function ϕ of a real random variable X has the following properties: A. it is hermitian symmetric (that is, ϕ(−u) = ϕ(u)∗ ) and uniformly bounded (in fact, ϕ(u) ≤ ϕ(0)); B. it is uniformly continuous on R; and C. it is deﬁnite nonnegative, in the sense that for all integers n, all u1 , . . . , un ∈ R, and all z1 , . . . , zn ∈ C, n n
ϕ(uj − uk )zj zk∗ ≥ 0
j=1 k=1
2' '2 . ' ' (just observe that the lefthand side equals E ' nj=1 zj eiuj X ' ). It turns out that Properties A, B and C characterize characteristic functions (up to a multiplicative constant). This is Bochner’s theorem: Theorem 4.4.10 Let ϕ : R → C be a function satisfying properties A, B and C. Then there exists a constant 0 ≤ β < ∞ and a real random variable X such that for all u ∈ R, ϕ(u) = βE eiuX . Proof. We henceforth eliminate the trivial case where ϕ(0) = 0 (implying, in view of condition A, that ϕ is the null function). For any continuous function z : R → C and any A ≥ 0,  A A ϕ(u − v)z(u)z ∗ (v) du dv ≥ 0 . () 0
0
Indeed, since the integrand is continuous, the integral is the limit as n ↑ ∞ of 2 2 A(j − k) Aj Ak ∗ A2 ϕ , z z 4n 2n 2n 2n n
n
j=1 k=1
4.4. CONVERGENCE IN DISTRIBUTION AND IN VARIATION
171
a nonnegative quantity by condition C. From () with z(u) := e−ixu , we have that g(x, A) :=

1 2πA
A A 0
ϕ(u − v)e−ix(u−v) du dv ≥ 0 .
0
Changing variables, we obtain for g(x, A) the alternative expression u ϕ(u)e−iux du 1− A −A  +∞ u 1 h ϕ(u)e−iux du , = 2π −∞ A
g(x, A) : =
1 2π

A
where h(u) = (1 − u) 1{u≤1} . Let M > 0. We have 
+∞
h −∞
x g(x, A) dx 2M  +∞  +∞ u x −iux 1 h h dx du ϕ(u) e = 2π −∞ A 2M −∞ 2  +∞ u 1 sin M u = M h du . ϕ(u) π A Mu −∞
Therefore 
+∞
h −∞
 +∞ x u 1 sin M u 2 h du g(x, A) dx ≤ M ϕ(u) 2M π A Mu −∞  +∞ 1 sin u 2 ≤ ϕ(0) du = ϕ(0) . π u −∞
By monotone convergence, lim
+∞
M ↑∞ −∞
and therefore
h
 +∞ x g(x, A) dx , g(x, A) dx = 2M −∞ 
+∞ −∞
g(x, A) dx ≤ ϕ(0) .
The function x → g(x, A) is therefore integrable and it is the Fourier transform of the u ϕ(u). Therefore, by the Fourier inversion integrable and continuous function u → h A formula:  +∞ u h g(x, A)eiux dx . ϕ(u) = A −∞ 0 +∞ In particular, with u = 0, −∞ g(x, A) dx = ϕ(0). Therefore, f (x, A) := g(x,A) ϕ(0) is the u ϕ(u) probability density of some real random variable with characteristic function h A ϕ(0) . But u ϕ(u) ϕ(u) = . lim h A↑∞ A ϕ(0) ϕ(0) This limit of a sequence of characteristic functions is continuous at 0 and is therefore a characteristic function (Paul L´evy’s criterion, Theorem 4.4.5).
CHAPTER 4. CONVERGENCES
172
4.4.2
The Central Limit Theorem
The emblematic theorem of Statistics is the socalled central limit theorem (CLT). Theorem 4.4.11 Let {Xn }n≥1 be an iid sequence of real random variables such that E[X12 ] < ∞ .
(4.21)
(In particular, E[X1 ] < ∞.) Then, for all x ∈ R, Sn − nE[X1 ] √ lim P ≤ x = P (N (0; 1) ≤ x), n↑∞ σX1 n
(4.22)
where N (0; 1) is a Gaussian variable with mean 0 and variance 1. The proof depends in part on the following theorem, which says in particular that under certain conditions, the moments of a random variable can be extracted from its characteristic function: Theorem 4.4.12 Let X be a real random variable with characteristic function ψ, and suppose that E [Xn ] < ∞ for some integer n ≥ 1. Then for all integers r ≤ n, the rth derivative ψ (r) of ψ exists and is given by (4.23) ψ (r) (u) = ir E X r eiuX , and in particular E [X r ] =
ψ (r) (0) ir .
ψ(u) =
Moreover,
n r=0
(iu)r r r! E [X ]
+
(iu)n n! εn (u) ,
(4.24)
where limn↑∞ εn (u) = 0 and εn (u) ≤ 3E [Xn ]. Proof. First we observe that for any nonnegative real number a, and all integers r ≤ n, ar ≤ 1 + an (Indeed, if a ≤ 1, then ar ≤ 1, and if a ≥ 1, then ar ≤ an ). In particular, E [Xr ] ≤ E [1 + Xn ] = 1 + E [Xn ] < ∞ . Suppose that for some r < n, ψ (r) (u) = ir E X r eiuX . In # " i(u+h)X ψ (r) (u + h) − ψ (r) (u) − eiuX r re =i E X h h . 2 eihX − 1 , = ir E X r eiuX h the quantity under the expectation sign tends to X r+1 eiuX as h → 0, and moreover, it is bounded in absolute value by an integrable function since ' ' ' ' ' r iuX eihX − 1 ' ' r eihX − 1 ' 'X e ' ≤ 'X ' ≤ Xr+1 . ' ' ' ' h h
4.4. CONVERGENCE IN DISTRIBUTION AND IN VARIATION
173
' '2 (For the last inequality, use the fact that 'eia − 1' = 2(1 − cos a) ≤ a2 .) Therefore, by dominated convergence, ψ(u + h) − ψ(u) h→0 h . 2 ihX −1 r r iuX e = ir E X r+1eiuX . = i E lim X e h→0 h
ψ (r+1)(u) = lim
Equality (4.23) follows since the induction hypothesis is trivially true for r = 0. We now prove (4.24). By Taylor’s formula, for y ∈ R, eiy = cos y + i sin y =
n−1 k=0
(iy)n (iy)k + (cos(θ1 y) + i sin(θ2 y)) k! n!
for some θ1 , θ2 ∈ [−1, +1]. Therefore eiuX =
n−1 k=0
(iuX)n (iuX)k + (cos(θ1 uX) + i sin(θ2 uX)) , k! n!
where θ1 = θ1 (ω), θ2 = θ2 (ω) ∈ [−1, +1], and (iu)k n−1 (iu)n E[X k ] + (E [X n ] + εn (u)) , E eiuX = k! n! k=0
where εn (u) = E [X n (cos θ1 uX + i sin θ2 uX − 1)] . Clearly εn (u) ≤ 3E [Xn ]. Also, since the random variable X n (cos θ1 uX + i sin θ2 uX − 1) is bounded in absolute value by the integrable random variable 3 Xn and tends to 0 as u → 0, we have by dominated convergence limu→0 εn (u) = 0. We now proceed to the proof of Theorem 4.22. Proof. Assume without loss of generality that E[X1 ] = 0. Then call σ 2 the variance of X1 . By the characteristic function criterion for convergence in distribution, it suﬃces to show that 2 2 lim ϕn (u) = e−σ u /2 , n↑∞
where
"
/
ϕn (u) = E exp iu =
n $ j=1
n
j=1 Xj
4#
√
n 1. u n u =ψ √ E exp i √ Xj , n n 2
where ψ is the characteristic function of X1 . From the Taylor expansion of ψ about zero, ψ(u) = 1 +
ψ (0) 2 u + o(u2 ) , 2!
CHAPTER 4. CONVERGENCES
174 we have, for ﬁxed u ∈ R,
ψ
u √ n
=1−
1 σ 2 u2 +o n 2
1 , n
and therefore
1 1 1 σ 2 u2 +o = − σ 2 u2 . lim ln {ϕn (u)} = lim n ln 1 − n↑∞ n↑∞ 2n n 2
The result follows by Theorem 4.4.6. Remark 4.4.13 The random variable Sn − nE[X1 ] √ σ n
is obtained by centering the sum Sn (subtracting its mean nE[X1 ]) and then normalizing it (dividing by the square root of its variance to make the resulting variance equal to 1). Theorem 4.4.14 Let {Xn }n≥1 be a sequence of independent random vectors of dimension d, and let {an }n≥1 be a sequence of real numbers such that limn↑∞ an = ∞. Suppose that p.s. Xn → m and
√
D
an (Xn − m) → N (0, Γ) .
Let g : Rd → Rq be twice continuously diﬀerentiable in a neighborhood U of m. Then p.s.
g(Xn ) → g(m) and
√
D an (g(Xn ) − g(m)) → N 0, Jg (m)T Γ Jg (m)
where Jg (m) is the Jacobian matrix of g evaluated at m. Proof. U can be chosen convex and compact. Let gj denote the jth coordinate of g, and let D2 gj denote the second diﬀerential matrix of gj . By Taylor’s formula, 1 gj (x) − gj (m) = (x − m)T (grad gj (m)) + (x − m)T D2 gj (m∗ ) (x − m) 2 for some m∗ in the closed segment linking m to x, denoted [m, x]. Therefore, if Xn ∈ U √ √ an (gj (Xn ) − gj (m)) = an (Xn − m)T (grad gj (m)) 1 1 + an (Xn − m)T √ D2 gj (m∗n ) (Xn − m) , 2 an where m∗n ∈ [m, Xn ]. Suppose Xn ∈ U . Since U is convex and m ∈ U , also m∗n ∈ U . Now since U is compact, the continuous function D2 gj is bounded in U . Therefore, since an ↑ ∞, a.s. √1 D2 gj (m∗ )1U (Xn ) → 0. Since Xn → m, we deduce from the above remarks that n an √
an (gj (Xn ) − gj (m)) −
√
a.s.
an (Xn − m)T (grad gj (m)) → 0,
4.4. CONVERGENCE IN DISTRIBUTION AND IN VARIATION and therefore
But
√
√
175
√ p.s. an (g(Xn ) − g(m)) − Jg (m) an (Xn − m) → 0 .
D
an (Xn − m) → N (0, Γ) and therefore √
D an (g(Xn ) − g(m)) → Jg (m)N (0, Γ) = N 0, Jg (m)T Γ Jg (m) .
Statistical Applications A basic methodology of Statistics is based on the notion of conﬁdence interval and on the central limit theorem. Theorem (4.22) implies that for x ≥ 0 Sn σ σ lim P E[X1 ] − √ x ≤ ≤ E[X1 ] + √ x = P (N (0; 1) ≤ x). n↑∞ n n n Under the condition E X1 3 < ∞, this limit is uniform in x ∈ R (1 ) and therefore, with √σn x = a, 2 √ . Sn a n lim P E[X1 ] − a ≤ ≤ E[X1 ] + a − P N (0; 1) ≤ = 0. n↑∞ n σ That is, for large n, √ Sn a n P E[X1 ] − a ≤ ≤ E[X1 ] + a ( P N (0; 1) ≤ . n σ In other words, for large n, the of E[X1 ], that is slln estimate √ a of E[X1 ] with probability P N (0; 1) ≤ a σ n .
Sn n ,
lies within distance
In statistical practice, this result is used in two manners. (1) One wishes to know the number n experiments that guarantee that with probability, say 0.99, the estimation error is less than a. Choose n such that √ a n = 0.99. P N (0; 1) ≤ σ Since P (N (0; 1) ≤ 2.58) = 0.99, we have 2.58 = and therefore
n=
1
√ a n , σ
2.58a σ
(4.25)
2 .
This is the content of the Berry–Essen theorem. The proof is omitted.
CHAPTER 4. CONVERGENCES
176
large) number (2) The (usually n of experiments is ﬁxed. We want to determine the interval Snn − a, Snn + a within which the mean E[X1 ] lies with probability at least 0.99. From (4.25): 2.58σ a= √ . n If the standard deviation σ is unknown, it may be replaced by an slln estimate of it (but then of course. . . ), or the conservative method can be used, which consists of replacing σ by an upper bound. Example 4.4.15: Testing a coin. Consider the problem of estimating the bias p of a coin. Here, Xn takes two values, 1 and 0 with probability p and 1 − p respectively, and in particular E[X1 ] = p, Var (X1 ) = σ 2 = p(1 − p). Clearly, since we are trying to estimate p, bound of σ is the maximum of 5 the standard deviation σ is unknown. Here the upper p(1 − p) for p ∈ [0, 1], which is attained for p = 12 . Thus σ ≤ 12 . Suppose the coin was tossed 10, 000 times and that the experiment produced the estimate Snn = 0.4925. Can we “believe 99 percent” that the coin is unbiased. For this we would check that the corresponding conﬁdence interval contains the value 12 . Using the conservative method (not a big problem since obviously the actual bias is not far from 12 ), we have σ2.58 a = √ = 0.0129. n and indeed 12 ∈ [0.4925 − 0.0129, 0.4925 − 0.0129], so that we are at least 99 percent conﬁdent that the coin is unbiased.
4.4.3
Convergence in Variation
The Variation Distance Convergence in variation is convergence with respect to the variation distance, a notion that is now ﬁrst introduced in the discrete case.2 Deﬁnition 4.4.16 Let E be a countable space. The distance in variation between two probability distributions α and β on E is the quantity dV (α, β) :=
1 α(i) − β(i). 2
(4.26)
i∈E
That dV is indeed a metric is clear. Lemma 4.4.17 Let α and β be two probability distributions on the same countable space E. Then dV (α, β) = sup {α(A) − β(A)} A⊆E
= sup {α(A) − β(A)} . A⊆E 2
Only the discrete case will be used in this book, in Chapter 6 on discretetime Markov chains.
4.4. CONVERGENCE IN DISTRIBUTION AND IN VARIATION
177
Proof. For the second equality observe that for each subset A there is a subset B such ¯ For the ﬁrst equality, write that α(A) − β(A) = α(B) − β(B) (take B = A or A). α(A) − β(A) =
1A (i){α(i) − β(i)}
i∈E
and observe that the righthand side is maximal for A = {i ∈ E; α(i) > β(i)} . Therefore, with g(i) = α(i) − β(i), sup {α(A) − β(A)} = A⊆E
since
i∈E
g + (i) =
i∈E
1 g(i) 2 i∈E
g(i) = 0.
The distance in variation between two random variables X and Y with values in E is the distance in variation between their probability distributions, and it is denoted (with a slight abuse of notation) by dV (X, Y ). Therefore dV (X, Y ) :=
1 P (X = i) − P (Y = i) . 2 i∈E
The distance in variation between a random variable X with values in E and a probability distribution α on E, denoted (again with a slight abuse of notation) by dV (X, α), is deﬁned by 1 P (X = i) − α(i) . dV (X, α) := 2 i∈E
The Coupling Inequality Coupling of two discrete probability distributions π on E and π on E consists, by deﬁnition, of the construction of a probability distribution π on E := E × E such that the marginal distributions of π on E and E , respectively, are π and π , that is, π(i, j) = π (i) and π(i, j) = π (j) . j∈E
i∈E
For two probability distributions α and β on the countable set E, let D(α, β) be the collection of random vectors (X, Y ) taking their values in E ×E and with given marginal distributions α and β, that is, P (X = i) = α(i), P (Y = i) = β(i) .
(4.27)
Theorem 4.4.18 For any pair (X, Y ) ∈ D(α, β), we have the fundamental coupling inequality dV (α, β) ≤ P (X = Y ), and equality is attained by some pair (X, Y ) ∈ D(α, β), which is then said to realize maximal coincidence.
CHAPTER 4. CONVERGENCES
178 Proof. For A ⊂ E,
¯ P (X = Y ) ≥ P (X ∈ A, Y ∈ A) = P (X ∈ A) − P (X ∈ A, Y ∈ A) ≥ P (X ∈ A) − P (Y ∈ A), and therefore P (X = Y ) ≥ sup {P (X ∈ A) − P (Y ∈ A)} = dV (α, β). A⊂E
We now construct (X, Y ) ∈ D(α, β) realizing equality. Let U, Z, V , and W be independent random variables; U takes its values in {0, 1}, and Z, V, W take their values in E. The distributions of these random variables are given by P (U = 1) = 1 − dV (α, β), P (Z = i) = α(i) ∧ β(i)/ (1 − dV (α, β)) , P (V = i) = (α(i) − β(i))+ /dV (α, β), P (W = i) = (β(i) − α(i))+ /dV (α, β). Observe that P (V = W ) = 0. Deﬁning (Z, Z) (X, Y ) = (V, W )
if if
U =1 U =0
we have P (X = i) = P (U = 1, Z = i) + P (U = 0, V = i) = P (U = 1)P (Z = i) + P (U = 0)P (V = i) = α(i) ∧ β(i) + (α(i) − β(i))+ = α(i), and similarly, P (Y = i) = β(i). Therefore, (X, Y ) ∈ D(α, β). Also, P (X = Y ) = P (U = 1) = 1 − dV (α, β). Example 4.4.19: Poisson’s law of rare events, take 2. Let Y1 , . . . , Yn be independent random variables taking their values in {0, 1}, with P (Yi = 1) = πi , 1 ≤ i ≤ n. Let X := ni=1 Yi and λ := ni=1 πi . Let pλ be the Poisson distribution with mean λ. We wish to bound the variation distance between the distribution q of X and pλ . For this we construct a coupling of the two distributions as follows. First we generate independent pairs (Y1 , Y1 ), . . . , (Yn , Yn ) such that ⎧ if j = 0, k = 0, ⎪ ⎨1 − πik πi −π i P (Yi = j, Yi = k) = e if j = 1, k ≥ 1, ⎪ ⎩ −πi k! e − (1 − πi ) if j = 1, k = 0 . One veriﬁes that for all 1 ≤ i ≤ n, P (Yi = 1) = πi and Yi ∼ Poi(πi ). In particular, X := ni=1 Yi is a Poisson variable with mean λ. Now n n P (X = X ) = P Yi = Yi i=1
i=1
n ≤ P Yi = Yi for some i ≤ P Yi = Yi . i=1
4.4. CONVERGENCE IN DISTRIBUTION AND IN VARIATION
179
But P Yi = Yi = e−πi − (1 − πi ) + P (Y1 > 1) = πi 1 − e−πi ≤ πi2 . Therefore P (X = X ) ≤
n
2 i=1 πi ,
and by the coupling inequality dV (q, pλ ) ≤
n
πi2 .
i=1
For instance, with πi = p := nλ , we have dV (q, pλ ) ≤
λ2 . n
In other terms, the binomial distribution of size n and mean λ diﬀers in variation of less 2 than λn from a Poisson variable with the same mean. This is obviously a reﬁnement of the Poisson approximation theorem since it gives exploitable estimates for ﬁnite n.
A More General Deﬁnition The extension of the notions of the previous subsection to probability distributions on more general spaces is conceptually straightforward and necessitates only obvious adaptations. Deﬁnition 4.4.20 Let P1 and P2 be two probability measures on the same measurable space (X, X ). The quantity dV (P1 , P2 ) := sup P1 (A) − P2 (A) A∈X
is called the distance in variation between P1 and P2 . Let Q be a probability measure such that Pi Q (i = 1, 2), for instance Q = Therefore there exist (Radon–Nikod´ ym theorem) two nonnegative measurable real functions fi (i = 1, 2) such that Pi (A) = fi dQ (A ∈ X ). P +Q 2 .
A
Theorem 4.4.21 We have that dV (P1 , P2 ) =
1 2
f1 − f2  dQ .
(4.28)
X
Proof. We ﬁrst observe that sup P1 (A) − P2 (A) = sup (P1 (A) − P2 (A)) A∈X
A∈X
since for any A ∈ X there exists a B ∈ X such that P1 (A) − P2 (A) = −(P1 (B) − P2 (B)) (take B = A). Therefore
CHAPTER 4. CONVERGENCES
180
(f1 − f2 ) dQ .
dV (P1 , P2 ) = sup (P1 (A) − P2 (A)) = sup A∈X
A∈X
A
The supremum is attained for A = {f1 − f2 ≥ 0}: dV (P1 , P2 ) = Since
0
X (f1
− f2 ) dQ = 0, we have that
{f1 −f2 ≥0}
0
(f1 − f2 ) dQ .
{f1 −f2 ≥0} (f1
− f2 ) dQ =
1 2
0 X
f1 − f2  dQ.
It follows from the expression (4.28) that dV is indeed a metric.
A Bayesian Interpretation Let X ∈ Rd be a random vector called the observation and H ∈ {1, 2} be a random variable called the hypothesis. The joint law of (X, H) is described as follows: P (X ∈ C  Hi ) =
fi (x) dx
(i ∈ {1, 2}, C ∈ B(Rd ))
C
and P (H = i) =
1 2
(i ∈ {1, 2}).
We seek to devise a test based on the observation of X alone that will help us to decide which is the value of H. In other words, we must select a measurable partition {A1 , A2 } of Rd and decide H = i if X ∈ Ai (i = 1, 2). This partition is called a test. A probability of error (wrong guess) is associated with this test: PE = P (H = 1)P (X ∈ A2  H = 1) + P (H = 2)P (X ∈ A2  H = 2) , that is, 1 1 f1 (x) dx + f2 (x) dx 2 A2 2 A1 1 1 1− f1 (x) dx + f2 (x) dx = 2 A2 2 A2 1 1 = + (f1 (x) − f2 (x)) dx . 2 2 A2
PE =
We seek to minimize this quantity (with respect to A2 ) or, equivalently, to maximize the quantity (f2 (x) − f1 (x)) dx = P2 (A2 ) − P1 (A2 ) , 0
A2
where Pi (C) := C fi (x) dx (i = 1, 2). This is done by the choice A2 := {x ; f2 (x) ≥ f1 (x)} and the resulting (minimal) probability of error is then 1 PE∗ = (1 − dV (P1 , P2 )) , 2 where Pi (·) := P (·  Hi ) (i = 1, 2).
4.4. CONVERGENCE IN DISTRIBUTION AND IN VARIATION
181
Convergence in Variation Deﬁnition 4.4.22 The sequence {Pn }n≥1 of probability measures on (X, X ) is said to converge in variation to the probability P on (X, X ) if lim dV (Pn , P ) = 0 .
n↑∞ Var.
This is denoted Pn → P . Let Q be a probability measure such that Pn
Q (n ≥ 1), for instance
1 Q= Pn 2n n≥1
deﬁnes a probability measure Q such that for all n ≥ 1, Pn Q. Denote by fn (resp., f ) the Radon–Nikod´ ym derivative of Pn (resp., P ) with respect to Q. By Theorem 4.4.21, L1
Var.
Pn → P if and only if fn → f , where L1 = L1C (Q). Note also that if ϕ : X → C is a bounded function, then ϕ dPn → ϕ dP , X
X
as follows from the fact that ϕ dPn − ϕ dP = ϕ × (fn − f ) dQ X
X
X
and dominated convergence. Theorem 4.4.23 Let Pn , Q and fn be deﬁned as above. Suppose that there exists a nonnegative measurable function f from (X, X ) to (R, B(R)) such that Qa.e, fn → f . 0 Var. Then Pn → P where P is the probability deﬁned by P (A) = A f dQ, A ∈ X . The proof is a direct consequence of Scheﬀ´e’s lemma: Lemma 4.4.24 Let f and fn (n ≥ 1) be Qintegrable nonnegative real 0functions from 0 (X, X ) 0to (R, B(R)), with limn↑∞ fn = f Qa.e. and limn↑∞ X fn dQ = X f dQ. Then limn↑∞ X fn − f  dQ = 0. Proof. The function inf(fn , f ) is bounded by the (integrable) function f (this is where the nonnegativeness assumption 0 is used). Moreover, 0 it converges to f . Therefore, by dominated convergence, limn↑∞ X inf(fn , f ) dQ = X f dQ. The rest of the proof follows from fn − f  dQ = fn dQ + f dQ − inf(fn , f ) dQ . X
X
X
X
Deﬁnition 4.4.25 A. Let X1 , X2 be random elements with values in the measurable space (E, E), with respective distributions α1 , α2 . The distance in variation between X1 , and X2 is, by deﬁnition the quantity dV (X1 , X2 ) := dV (α1 , α2 ).
CHAPTER 4. CONVERGENCES
182
B. Let X and {Xn }n≥1 be random elements with values in the measurable space (E, E), with respective distributions α, {αn }n≥1 . The sequence {Xn }n≥1 is said to converge Var.
in variation to X if limn↑∞ dV (Xn , X) = 0. This is denoted by Xn → X. Let {Xn }n≥1 be random elements with values in some measurable space (E, E) and Var.
let α be some probability distribution on (E, E). The notation Xn → α means, by Var.
convention, that αn → α, where αn is the distribution of Xn . (This convention is similar to the one introduced above in the context of convergence in distribution.)
4.4.4
Proof of Paul L´ evy’s Criterion
Radon Linear Forms Denote by C0 (Rd ) the set of continuous functions from ϕ : Rd → R that vanish at inﬁnity (limx↑∞ ϕ(x) = 0), endowed with the norm of uniform convergence ϕ := supx∈Rd ϕ(x). Let Cc (Rd ) be the set of continuous functions from Rd to R with compact support. In particular, Cc (Rd ) ⊂ C0 (Rd ). Deﬁnition 4.4.26 A linear form L : Cc (Rd ) → R such that L(f ) ≥ 0 whenever f ≥ 0 is called a positive Radon linear form. We quote without proof the following fundamental result of Riesz:3 Theorem 4.4.27 Let L : Cc (Rd ) → R be a positive Radon linear form. There exists a unique locally ﬁnite measure μ on (Rd , B(Rd )) such that for all f ∈ Cc (Rd ), L(f ) = f dμ . Rd
We shall need a slight extension of Riesz’s theorem (Theorem 4.4.27), Part (ii) of the following: Theorem 4.4.28 0 (i) Let μ ∈ M + (Rd ). The linear form L : C0 (Rd ) → R deﬁned by L(f ) := Rd f dμ is positive (L(f ) ≥ 0 whenever f ≥ 0) and continuous, and its norm is μ(Rd ). (ii) Let L : C0 (Rd ) → R be a positive continuous linear form. There exists a unique measure μ ∈ M + (Rd ) such that for all f ∈ C0 (Rd ), L(f ) = f dμ . Rd
Proof. Part (i) is left as an exercise. We turn to the proof of (ii). The restriction of L to Cc is a positive Radon linear form, and therefore, according to Riesz’s Theorem 0 4.4.27, there exists a locally ﬁnite μ on (Rd , B(Rd )) such that for all f ∈ Cc (Rd ), L(f ) = Rd f dμ. The measure μ is a ﬁnite (not just locally ﬁnite) measure. If not, there would exist a sequence {Km }m≥1 of compact subsets of Rd such that μ(Km ) ≥ 3m for all m ≥ 1. Let 3
See for instance [Rudin, 1986], Theorem 2.14.
4.4. CONVERGENCE IN DISTRIBUTION AND IN VARIATION
183
then {ϕm }m≥1 be a sequence of nonnegative functions in Cc (Rd ) with values in [0, 1] and such that for all m ≥ 1, ϕm (x) = 1 for all x ∈ Km . In particular, the function ϕ := m≥1 2−m ϕm is in C0 (Rd ) and
k
L(ϕ) ≥ L
2
m=1 k
=
2−m
−m
=
ϕm
2−mL(ϕm )
m=1 k
Rd
m=1
k
2−mμ(Km ) ≥
ϕm dμ ≥
m=1
k 3 . 2
Letting k ↑ ∞ leads to L(ϕ) = ∞, a contradiction. The mapping L is continuous. Suppose it is not. One could then ﬁnd a sequence d m {ϕm } m≥1 of functions in Cc (R ) such that ϕm  ≤ 1 and L(ϕm ) ≥ 3 . The function ϕ := m≥1 2−m ϕm is in C0 (Rd ) and L(ϕ) ≥ L(
k
2−m ϕm ) =
m=1
=
k m=1
2−m
k m=1
Rd
ϕm dμ ≥
2−mL(ϕm ) k 3 . 2
Letting k ↑ ∞ again leads to L(ϕ) = ∞, a contradiction. 0 It remains to show that L(f ) = Rd f dμ for all f ∈ C0 (Rd ). For this, consider a sequence {fm}m≥1 of functions in Cc (Rd ) converging uniformly to f ∈ C0 (Rd ).0 We have, limm↑∞ fm dμ = 0 0since L is continuous, limm↑∞ L(fm ) = L(f ) and, since μ is ﬁnite, f dμ by dominated convergence. Therefore, since L(f ) = f dμ for all m ≥ 1, m m 0 L(f ) = f dμ.
Vague Convergence Deﬁnition 4.4.29 The sequence {μn }n≥1 in M + (Rd ) is said to0 converge vaguely (resp., 0 weakly) to μ if, for all f ∈ C0 (Rd ) (resp. f ∈ Cb (Rd )), limn↑∞ Rd f dμn = Rd f dμ. Remark 4.4.30 When applied to probability measure this notion is weaker than weak convergence, because a continuous function vanishing at inﬁnity is a particular case of a bounded continuous function. Theorem 4.4.31 The sequence {μn }n≥1 in M + (Rd ) converges vaguely if and only if (a) supn μn (Rd ) < ∞, and (b) there exists a dense subset E in C0 (Rd ) such that for all f ∈ E, there exists 0 limn↑∞ Rd f dμn . Proof. Necessity. If the sequence converges vaguely, it obviously satisﬁes (b). As for (a), it is a consequence of the Banach–Steinhaus theorem.4 Indeed, μn (Rd ) is the norm of Ln , 4 Let E be a Banach space and F be a normed vector space. Let {Li }i∈I be a family of continuous linear mappings from E to F such that supi∈I Li (x) < ∞ for all x ∈ E. Then supi∈I Li < ∞. See for instance [Rudin, 1986], Thm. 5.8.
CHAPTER 4. CONVERGENCES
184
0 the 'Banach space C0 (Rd ) where Ln is the continuous linear form f → Rd f dμn 'from 0 d ' (with the sup norm) to R, and for all f ∈ C0 (R ), supn Rd f dμn ' < ∞. Suﬃciency. Suppose the sequence satisﬁes (a) and (b). Let f ∈ C0 (Rd ). For all ϕ ∈ E, ' '' ' ' f dμm − f dμn ' ' ' ' '' ' ''' ' ' ' ' ' ≤ '' ϕ dμm − ϕ dμn '' + '' f dμm − ϕ dμm '' + '' f dμn − ϕ dμn '' '' ' ' ' ≤ ' ϕ dμm − ϕ dμn '' + sup f (x) − ϕ(x) × sup μn (Rd ). x∈Rd
n
Since supx∈Rd f (x) − ϕ(x)0 can be made arbitrarily small by a proper choice of ϕ, this shows that the sequence { f dμn }n≥1 is a Cauchy sequence. It therefore converges to some L(f ), and L so deﬁned is a 0positive linear form on C0 (Rd ). Therefore, there exists a μ ∈ M + (Rd ) such that L(f ) = Rd f dμ and {μn }n≥1 converges vaguely to μ.
Helly’s Theorem Theorem 4.4.32 From any bounded sequence of M + (Rd ), one can extract a vaguely convergent subsequence. Proof. Let {μn }n≥1 be a bounded sequence of M + (Rd ). Let {fn }n≥1 be a dense sequence of C0 (Rd ). 0 Since the sequence { f1 dμn }n≥1 is bounded, one 0can extract from it a conver0 gent subsequence { f1 dμ1,n }n≥1. Since the sequence 0 { f2 dμ1,n }n≥1 is bounded, one can extract from it a convergent subsequence { f2 dμ2,n 0 }n≥1 . This diagonal selec{ fk+1 dμk,n }n≥1 is bounded, tion process is continued. At step k, since the sequence 0 one can extract from it a convergent subsequence { fk+1 dμk+1,n }n≥1. The sequence {νk }k≥1 where νk = μk,k (the “diagonal” sequence) is extracted from the original se0 quence and for all fn , the sequence { fn dνk }k≥1 converges. The conclusion follows from Theorem 4.4.31.
Fourier Transforms of Finite Measures Deﬁnition 4.4.33 The Fourier transform of a measure μ ∈ M + (Rd ) is the function μ 6 : Rd → C deﬁned by e−2iπ ν,x μ(dx) , μ 6(ν) = where ν, x :=
d
Rd
j=1 νj xj .
Theorem 4.4.34 The Fourier transform of a measure μ ∈ M + (Rd ) is bounded and uniformly continuous. Proof. From the deﬁnition, we have that 6 μ(ν) ≤ e−2iπ ν,x μ(dx) Rd μ(dx) = μ(Rd ) , = Rd
4.4. CONVERGENCE IN DISTRIBUTION AND IN VARIATION
185
where the last term does not depend on ν and is ﬁnite. Also, for all h ∈ Rd ,  ' ' ' ' 6 μ(ν + h) − μ 6(ν) ≤ 'e−2iπ ν,x+h − e−2iπ ν,x' μ(dx) Rd  ' ' ' ' = 'e−2iπ h,x − 1' μ(dx) . Rd
The last term is independent of ν and tends to 0 as h → 0 by dominated convergence (recall that μ is ﬁnite). Theorem 4.4.35 Let μ ∈ M + (Rd ) and let f6 be the Fourier transform of f ∈ L1C (Rd ). Then fμ 6 dx. f6dμ = Rd
Rd
Proof. This follows from Fubini’s theorem. In fact,  f (x)e−2iπ ν,x dx μ(dν) f6dμ = d Rd Rd R = e−2iπ ν,x μ(dν) dx . f (x) Rd
Rd
of '(Interversion ' the order of integration is justiﬁed by the fact that the function (x, ν) → 'f (x)e−2iπ ν,x ' = f (x) is integrable with respect to the product measure dx × μ(dν). Recall that μ is ﬁnite.) For the next deﬁnition, recall that Cb (Rd ) denotes the collection of uniformly bounded and continuous functions from Rd to R. Theorem 4.4.36 The sequence {μn }n≥1 in M + (Rd ) converges weakly to μ if and only if (i) It converges vaguely to μ, and (ii) limn↑∞ μn (Rd ) = μ(Rd ). Proof. The necessity of (i) immediately follows from the observation that C0 (Rd ) ⊂ Cb (Rd ). The necessity of (ii) 0follows from the fact that the 0 function that is the constant 1 is in Cb (Rd ) and therefore 1 dμn = μn (Rd ) tends to 1 dμ = μ(Rd ) as n ↑ ∞. Suﬃciency. Suppose that 0(i) and (ii) are 0 satisﬁed. To prove weak convergence, it suﬃces to prove that limn↑∞ Rd f dμn = Rd f dμ for any nonnegative function f ∈ Cb (Rd ). Since the measure μ is of ﬁnite total mass, for any ε > 0 one can ﬁnd a compact set Kε = K such that μ(K) ≤ ε. Choose a continuous function with compact support ϕ with values in [0, 1] and such that ϕ ≥ 1K . Since f − f ϕ) ≤ f (1 − ϕ) (where f = supx∈Rd f (x)), '' ' ' lim sup '' f dμn − f ϕ dμn '' ≤ lim sup f (1 − ϕ) dμn n↑∞
n↑∞
= f = f

lim
n↑∞

dμn − lim
n↑∞
(1 − ϕ) dμ ≤ εf .
ϕ dμn
CHAPTER 4. CONVERGENCES
186
'0 ' 0 Similarly, ' f dμ − f ϕ dμ' ≤ εf . Therefore, for all ε > 0, '' ' ' ' ' lim sup ' f dμn − f dμ' ≤ 2εf , n↑∞
and this completes the proof.
evy’s criterion The Proof of Paul L´ We shall in fact prove a slightly more general result: Theorem 4.4.37 Let {μn }n≥1 be a sequence of M + (Rd ) such that for all ν ∈ Rd , there 6n (ν) = ϕ(ν) for some function ϕ that is continuous at 0. Then {μn }n≥1 exists limn↑∞ μ converges weakly to a ﬁnite measure μ whose Fourier transform is ϕ. We will be ready for the generalization of Paul L´evy’s criterion of convergence in distribution after a few preliminaries. Deﬁnition 4.4.38 A family {αt }t>0 of functions αt : Rd → C in L1C (Rd ) is called an approximation of the Dirac distribution in Rd if it satisﬁes the following three conditions: 0 (i) Rd αt (x) dx = 1; 0 (ii) supt>0 Rd αt (x) dx := M < ∞; and 0 (iii) for any compact neighborhood V of 0 ∈ Rd , limt↓0 V αt (x) dx = 0. Lemma 4.4.39 Let {αt }t>0 be an approximation of the Dirac distribution in Rd . Let f : Rd → C be a bounded function continuous at all points of a compact K ⊂ Rd . Then limt↓0 f ∗ αt uniformly in K. Proof. We will show later that lim sup f (x − y) − f (x) → 0 .
()
y→0 x∈K
V being a compact neighborhood of 0, we have that sup f (x) − (f ∗ αt )(x) ≤ M sup sup f (x − y) − f (x) + 2 sup f (x) x∈K
y∈V x∈K
x∈Rd
αt (y) dy . V
This quantity can be made smaller than any ε > 0 by choosing V such that the ﬁrst term is < 12 ε (uniform continuity of f on compact sets) and the second term can then be made < 12 ε by letting t ↓ (condition (iii) of Deﬁnition 4.4.38). Proof of (). Let ε > 0 be given. For all x ∈ K, there exists an open and symmetric neighborhood Vx of 0 such that for all y ∈ Vx , f (x − y)f (x) ≤ 12 ε. Also, one can ﬁnd an open and symmetric neighborhood Wx of 0 such that Wx + Wx ⊂ Vx . The union of open sets ∪x∈K {x + Wx } obviously covers K, and since the latter is a compact set, m one can extract a ﬁnite covering of K: ∪m j=1 (xj + Wxj ). Deﬁne W = ∩j=1 Wxj , an open neighborhood of 0. Let y ∈ W . Any x ∈ K belongs to some xj + Wxj , and for such j,
4.4. CONVERGENCE IN DISTRIBUTION AND IN VARIATION
187
f (x − y) − f (x) ≤ f (xj ) − f (xj − (xj − x)) + f (xj ) − f (xj − (xj − x + y)) . But xj − x ∈ Wxj and xj − x + y ∈ Wxj + W ⊂ Vxj . Therefore 1 1 f (x − y) − f (x) ≤ ε + ε = ε . 2 2 We can now prove Theorem 4.4.37. Proof. The sequence {μn }n≥1 is bounded (that is, supn μn (Rd ) < ∞) since μn (Rd ) = μ 6n (1) has a limit as n ↑ ∞. In particular, μ 6n (ν) ≤ μn (Rd ) ≤ sup μn (Rd ) < ∞ .
(†)
n
0 0 6n dx. By dominated convergence If f ∈ L1R (Rd ), then0by Theorem0 4.4.35, f6dμn = f μ 6n dx = f ϕ dx. Therefore (using (†)), limn↑∞ f μ lim f6dμn = f ϕ dx . n↑∞
One can replace in the above equality f6 by any function in D(Rd ), since such a function is always the Fourier transform of some integrable function. Therefore, by Theorem 4.4.31, {μn }n≥1 converges vaguely to some ﬁnite measure μ. We now show that it converges weakly to μ. Let f be an integrable function with integral 1 such that f (x) = f (−x) and f6 ∈ D(Rd ). For t > 0, deﬁne ft (x) := t−d f (t−1 x)). Using Theorem 4.4.35, we have f6(tx) μn (dx) = ft (x) μ 6n (x) dx = (ft ∗ μ 6n )(0) . By dominated convergence,

lim
n↑∞
and by vague convergence, lim
n↑∞
Therefore, for all t > 0,
0

f6(tx) μn (dx) = (ft ∗ ϕ)(0) ,
f6(tx) μn (dx) =

f6(tx) μ(dx) .
f6(tx) μ(dx) = (ft ∗ ϕ)(0).
Since the function ϕ is bounded and continuous at the origin, by Lemma 4.4.39, 0 limt↓0 (ft ∗ ϕ)(0) = ϕ(0). Also, by dominated convergence limt↓0 f6(tx) μ(dx) = μ(Rd ). Therefore, μ(Rd ) = ϕ(0) = lim μn (Rd ) . n↑∞
Therefore, by Theorem 4.4.36, {μn }n≥1 converges weakly to μ. Since the function x → e−2iπ ν,x is continuous and bounded, μ 6(ν) = e−2iπ ν,x μ(dx) = lim e−2iπ ν,x μn (dx) = ϕ(ν). n↑∞
CHAPTER 4. CONVERGENCES
188
4.5 4.5.1
The Hierarchy of Convergences Almostsure vs in Probability
Theorem 4.5.1 A. If the sequence {Zn }n≥1 of complex random variables converges almost surely to some complex random variable Z, it also converges in probability to the same random variable Z. B. If the sequence of complex random variables {Xn }n≥1 converges in probability to the complex random variable X, one can ﬁnd a sequence of integers {nk }k≥1 , strictly increasing, such that {Xnk }k≥1 converges almost surely to X. (B says, in other words: From a sequence converging in probability, one can extract a subsequence converging almost surely.) Proof. A. Suppose almostsure convergence. By Theorem 4.1.3 , for all ε > 0, P (Zn − Z ≥ ε i.o.) = 0, that is
∞ P (∩n≥1 ∪k=n (Zk − Z ≥ ε)) = 0,
or (sequential continuity of probability) ∞ lim P (∪k=n (Zk − Z ≥ ε)) = 0,
n↑∞
which in turn implies that lim P (Zn − Z ≥ ε) = 0 .
n↑∞
B. By deﬁnition of convergence in probability, for all ε > 0, lim P (Xn − X ≥ ε) = 0.
n↑∞
Therefore one can ﬁnd n1 such that P Xn1 − X ≥ 11 ≤ 12 . Then, one can ﬁnd n2 > 2 n1 such that P Xn2 − X ≥ 12 ≤ 12 , and so on, until we have a strictly increasing sequence of integers nk (k ≥ 1) such that P
Xnk − X ≥
1 k
≤
k 1 . 2
It then follows from Theorem 4.1.2 that lim Xnk = X
k↑∞
a.s.
Remark 4.5.2 Exercise 4.6.7 gives an example of a sequence converging in probability, but not almost surely. Thus, convergence in probability is a notion strictly weaker than almostsure convergence.
4.5. THE HIERARCHY OF CONVERGENCES
189
Theorem 4.5.3 If the sequence {Zn }n≥1 of squareintegrable complex random variables converges in quadratic mean to the complex random variable Z, it also converges in probability to the same random variable. Proof. It suﬃces to observe that, by Markov’s inequality, for all ε > 0, P (Zn − Z ≥ ε) ≤
1 E[Zn − Z2 ]. ε2
4.5.2
The Rank of Convergence in Distribution
We now compare convergence in distribution to the other types of convergence. Convergence in distribution is weaker than almostsure convergence: Theorem 4.5.4 If the sequence {Xn }n≥1 of random vectors of Rd converges almost surely to some random vector X, it also converges in distribution to the same vector X. Proof. By dominated convergence, for all u ∈ R, lim E ei u,Xn = E ei u,X n↑∞
which implies, by Theorem 4.4.6 that {Xn }n≥1 converges in distribution to X.
In fact, convergence in distribution is even weaker than convergence in probability. Theorem 4.5.5 If the sequence {Xn }n≥1 of random vectors of Rd converges in probability to some random vector X, it also converges in distribution to X. Proof. If this were not the case, one could ﬁnd a function f ∈ Cb (Rd ) such that E[f (Xn )] does not converge to E[f (X)]. In particular, there would exist a subsequence nk and some ε > 0 such that E[f (Xnk )]−E[f (X)] ≥ ε for all k. As {Xnk }k≥1 converges in probability to X, one can extract from it a subsequence {Xnk }≥1 converging almost surely to X. In particular, since f is bounded and continuous, lim E[f (Xnk ] = E[f (X)] by dominated convergence, a contradiction. Combining Theorems 4.5.3 and 4.5.5, we have that convergence in distribution is weaker than convergence in the quadratic mean: Theorem 4.5.6 If the sequence of real random variables {Zn }n≥1 converges in quadratic mean to some random variable Z, it also converges in distribution to the same random variable Z.
Theorem 4.5.6 can be reﬁned in the Gaussian case, where the distribution of the limit can be proved to be Gaussian.
CHAPTER 4. CONVERGENCES
190
A Stability Property of Gaussian Vectors (m)
(1)
Theorem 4.5.7 If {Zn }n≥1 , where Zn = (Zn , . . . , Zn ), is a sequence of Gaussian random vectors of ﬁxed dimension m that converges componentwise in quadratic mean to some vector Z = (Z (1) , . . . , Z (m) ), the latter vector is Gaussian. Proof. In fact, by continuity of the inner product in L2R (P ), for all 1 ≤ i, j ≤ m, (i) (i) (j) limn↑∞ E[Zn Zn ] = E[Z (i) Z (j) ] and limn↑∞ E[Zn ] = E[Z (i) ], that is, lim mZn = mZ ,
lim ΓZn = ΓZ ,
n↑∞
n↑∞
and in particular, for all u ∈ Rm , T i T T lim E eiu Zn = lim eiu μZn − 2 u ΓZn u n↑∞
n↑∞
= eiu
T μ − i uT Γ u Z 2 Z
.
mean to uT Z, also converges in disThe sequence {uT Zn }n≥1 , converging in quadratic T T tribution to uT Z. Therefore, limn↑∞ E eiu Zn = E[eiu Z ], and ﬁnally E[eiu
TZ
] = eiu
T μ − i uT Γ u Z 2 Z
for all u ∈ Rm . This shows that Z is a Gaussian vector. Therefore, limits in the quadratic mean preserve the Gaussian nature of random vectors. This is the stability property referred to in the title of this example. Note that the Gaussian nature of random vectors is also preserved by linear transformations, as we already know. Convergence in distribution is weaker that convergence in variation: Theorem 4.5.8 If the sequence of real random variables {Xn }n≥1 converges in variation to X, it converges in distribution to the same random variable. Proof. Indeed, for all x (not just the continuity points of the distribution of X), P (Xn ≤ x) − P (X ≤ x) ≤ dV (Xn , X) → 0 .
Complementary reading [Billingsley, 1979, 1992], [Kallenberg, 2002].
4.6
Exercises
Exercise 4.6.1. The telescope formula, take 2 Prove the following inequalities concerning a nonnegative random variable X: ∞ n=1
P (X ≥ n) ≤ E [X] ≤ 1 +
∞ n=1
P (X ≥ n) .
4.6. EXERCISES
191
Exercise 4.6.2. A recurrence equation Consider the recurrence equation Xn+1 = (Xn − 1)+ + Zn+1, n ≥ 0 (a+ := sup(a, 0)), where X0 = 0 and where {Zn }n≥1 is an iid sequence of random variables with values in N. Denote by T0 the ﬁrst index n ≥ 1 such that Xn = 0 (T0 = ∞ if such index does not exist) a) Show that if E[Z1 ] < 1, then P (T0 < ∞) = 1. b) Show that if E[Z1 ] > 1, there exists a (random) index n0 such that Xn > 0 for all n ≥ n0 . Exercise 4.6.3. Poisson asymptotics Let {Sn }n≥1 be an iid sequence of real random variables such that P (S1 ∈ (0, ∞)) = 1 and E[S1 ] < ∞, and let for each t ≥ 0, N (t) = n≥1 1(0,t] (Tn ), where Tn = S1 + · · ·+ Sn . Prove that limt→∞
N (t) t
=
1 E[S1 ] .
Exercise 4.6.4. slln and infinite expectation Let {Zn }, n ≥ 1, be an iid sequence of nonnegative random variables such that E [Z1 ] = ∞. Show that Z1 + . . . + Zn = ∞ (= E [Z1 ]). lim n↑∞ n Exercise 4.6.5. Exchanging the order of expectation and summation (a) Let {Sn }n≥1 be a sequence of nonnegative random variables. Show that "∞ # ∞ E Sn = E[Sn ] . n=1
()
n=1
(b) Let {Sn }n≥1 be a sequence of real random variables such that Show that () holds as well.
n≥1 E[Sn ]
< ∞.
Exercise 4.6.6. A sufficient condition for almostsure convergence Show that P (Zn − Z ≥ ) < ∞ n≥1
for all ε > 0 is a suﬃcient condition for the sequence of random variables {Zn }n≥1 to converge to Z. Exercise 4.6.7. Convergence almost sure vs convergence in probability Let {Xn }n≥1 be a sequence of independent random variables taking only 2 values, 0 and 1. (A) Show that a necessary and suﬃcient condition of almostsure convergence to 0 is that P (Xn = 1) < ∞. n≥1
CHAPTER 4. CONVERGENCES
192
(B) Show that a necessary and suﬃcient condition of convergence in probability to 0 is that lim P (Xn = 1) = 0. n↑∞
(C) Deduce from the above that convergence in probability does not imply almostsure convergence.
Exercise 4.6.8. Convergence in probability and in the quadratic mean Let α > 0, and let {Zn }n≥1 be a sequence of random variables such that P (Zn = 1) = 1 −
α α , P (Zn = n) = , n n
where α < 1. Show that {Zn }n≥1 converges in probability to some variable Z. For what values of α does {Zn }n≥1 converge to Z in quadratic mean? Exercise 4.6.9. Convergence in probability Suppose the sequence of random variables {Zn }n≥1 converges to a in probability. Let g : R → R be a continuous function. Show that {g(Zn )}n≥1 converges to g(a) in probability. Exercise 4.6.10. Inner product in L2 (P ) Prove the following: If the sequence {Zn }n≥1 of squareintegrable complex random variables converges in quadratic mean to the complex random variable Z, then lim E [Zn ] = E [Z] and lim E Zn 2 = E Z2 .
n↑∞
n↑∞
Exercise 4.6.11. Convergence in probability but not almostsure Let {Zn }n≥1 be an independent sequence of random variables such that P (Zn = ±1) = Let Sn = limit.
n
j=1 Zj .
1 1 , P (Zn = 0) = 1 − . 2n log n n log n
Show the limit in probability of
Sn n
exists, but not the almostsure
Exercise 4.6.12. Convergence in distribution but not in probability Let Z be a random variable with a symmetric distribution (that is, Z and −Z have the same distribution). Deﬁne the sequence {Zn }n≥1 as follows: Zn = Z if n is odd, Zn = −Z if n is even. In particular, {Zn }n≥1 converges in distribution to Z. Show that if Z is not degenerate, then {Zn }n≥1 does not converge to Z in probability. Exercise 4.6.13. The unlimited gambler This exercise anticipates the gambling situation described in Example 13.1.4 to which the reader is referred for the notation. Suppose that the stakes are bounded, say by M , and that the initial fortune of the gambler is a. The gambler can borrow whatever amount is needed, so that his “fortune” Yn at any time n can take arbitrary values.
4.6. EXERCISES Prove that
193 λ2 P (Yn − a ≥ λ) ≤ 2 exp − . 2nM 2
Exercise 4.6.14. Fair coin tosses Consider a Bernoulli sequence of parameter 12 representing a fair game of heads or tails. Let X be the number of heads after n tosses. Use Hoeﬀding’s inequality to prove that 2 λ P (X − E[X] ≥ λ) ≤ 2 exp − . n Exercise 4.6.15. Empty bins Consider the usual “balls and bins” setting with n bins and m balls (the multinomial distribution). Let X be the number of empty bins. Prove that 2 λ P (X − E[X] ≥ λ) ≤ 2 exp − . m Exercise 4.6.16. Bernoulli is Borel Let the sequence {Xn }n≥1 be as in Theorem 1.1.6 (it is sometimes called a Bernoulli sequence). Prove that P ({Xn }n≥1 is a Borel sequence) = 1 . A sequence {xn }n≥1 taking its values in the set {0, 1} is called a Borel sequence if for all k ≥ 1, all a1 , . . . , ak in {0, 1}, 1 1k 1{xj+1 =a1 ,...,xj+k =ak } = . n↑∞ n 2 n
lim
j=1
Exercise 4.6.17. Metrization of convergence in probability Deﬁne for any two random variables X and Y , . 2 X − Y  . d(X, Y ) := E 1 + X − Y  Prove that d so deﬁned is a metric. Prove the following variant of Theorem 4.2.4: The sequence {Xn }n≥1 converges in probability to the variable X if and only if lim d(Xn , X) = 0 .
n↑∞
Exercise 4.6.18. Poisson’s law of rare events in the plane Let Z1 , . . . , ZM be M bidimensional iid random vectors uniformly distributed on the square [0, A] × [0, A] = ΓA . For any measurable set C ⊆ ΓA , deﬁne N (C) to be the number of random vectors Zi that fall in C. Let C1 , . . . , CK be measurable disjoint subsets of ΓA . i) Give the characteristic function of the vectors (N (C1 ) , . . . , N (CK )).
CHAPTER 4. CONVERGENCES
194
ii) We now let M be a function of A such that M (A) = λ > 0. A2 Show that, as A ↑ ∞, (N (C1 ) , . . . , N (CK )) converges in distribution. Identify the limit distribution. Exercise 4.6.19. A characterization of the Gaussian distribution Let G be a cumulative distribution function on R with xdG (x) = 0, x2 dG (x) = σ 2 < ∞. R
R
In addition, suppose that G has the following property: If X1 and X2 are independent 2 also admits G as cdf. random variables with the cdf G, then X1√+X 2 Prove that G is the cdf of a Gaussian variable with mean 0 and variance σ 2 . Exercise 4.6.20. The central limit theorem Prove, using the central limit theorem, that lim
n→∞
n k=1
e−n
1 nk = . k! 2
Exercise 4.6.21. A confidence interval In Example 4.4.15 how would you ﬁnd the best statement of the kind: “This coin is x percent guaranteed unbiased”? In other words, how would you obtain the largest x in this claim? (You are not required to give the actual value, just the method for obtaining it.) Exercise 4.6.22. Maximal coincidence of biased coins Find a pair of {0, 1}valued random variables with prescribed marginals P (X = 1) = a , P (Y = 1) = b, where a, b ∈ (0, 1), and such that P (X = Y ) is maximal. Exercise 4.6.23. Functions of random variables and distance in variation Let (E, E) and (G, G) be measurable spaces and let f : (E, E) → (G, G) be some measurable function. For α and β, probability distributions on (E, E), deﬁne the probability distribution αf −1 on (G, G) by αf −1 (B) = α(f −1 (B)), and deﬁne likewise βf −1 . Prove that dV (α, β) ≥ dV αf −1 , βf −1 .
Exercise 4.6.24. The variation distance of two Poisson variables Let pλ denote the Poisson distribution with mean λ. Let μ > 0. Prove that dV (pλ , pμ ) ≤ 1 − e−μ−λ .
4.6. EXERCISES
195
Exercise 4.6.25. Convexity of the distance in variation Let αi and βi , 1 ≤ i ≤K, be probability distributions on the countable space E. Show that if λi ∈ [0, 1] and K i=1 λi = 1, then dV
K
λi αi ,
i=1
K
λi β i
≤
i=1
K
λi dV (αi , βi) .
i=1
State and prove the analogous result in the general case of an arbitrary measurable space (E, E). Exercise 4.6.26. An alternative expression of the distance in variation Let α and β be two probability distributions on some countable space E. Show that 1 dV (α, β) = sup f (i)α(i) − f (i)β(i) , 2 f ≤1 i
i
where f  := supi∈E f (i). State and prove the analogous result in the general case of an arbitrary measurable space (E, E). Exercise 4.6.27. Another expression of the distance in variation Let α and β be two probability distributions on the same countable space E. Prove the following alternative expressions of the distance in variation: dV (α, β) = 1 − α(i) ∧ β(i) =
i∈E
(α(i) − β(i))+ =
i∈E
(β(i) − α(i))+ .
i∈E
State and prove the analogous result in the general case of an arbitrary measurable space (E, E). Exercise 4.6.28. Convergence in probability and convergence in variation Let {Zn }n≥0 be a sequence of {0, 1}valued random variables. Show that it converges in variation to 0 if and only if it converges in probability to 0. Deduce from this that there exist sequences of random variables that converge in distribution but not in variation. Exercise 4.6.29. Tricky Cauchy Let {Xn }n≥1 be a sequence of iid Cauchy random variables. (a) What is the limit in distribution of
X1 +···+Xn ? n
(b) Does
X1 +···+Xn n2
converge in distribution?
(c) Does
X1 +···+Xn n
converge almost surely to a (nonrandom) constant?
II: STANDARD STOCHASTIC PROCESSES
Chapter 5 Generalities on Random Processes A random process (or stochastic process) is a collection of random variables indexed by time, which may record the evolution of some phenomenon. This deﬁnition is generalized to accommodate models where space, and not just time, plays a role. This chapter addresses the basic issues concerning the distribution of such processes, their sample path properties and the various notions of measurability.
5.1 5.1.1
The Distribution of a Random Process Kolmogorov’s Theorem on Distributions
Let T be an arbitrary index. In general in this book, it will be one of the following: N (the natural numbers), Z (the integers), R (the real numbers) or R+ (the nonnegative real numbers) (see, however, Example 5.1.24).
Random Processes as Collections of Random Variables Deﬁnition 5.1.1 A stochastic process (or random process) is a family {X(t)}t∈T of random elements deﬁned on the same probability space (Ω, F, P ) and taking their values in some given measurable space (E, E). It is called a real (resp., complex) stochastic process if it takes real (resp., complex) values. It is called a continuoustime stochastic process when the index set is R or R+ , and a discretetime stochastic process when it is N or Z. When the index set is N or Z, we also use the notation n instead of t for the time index, and write Xn instead of X(t). Example 5.1.2: Random Sinusoid. Let A be some real nonnegative random variable, let ν0 ∈ R be a positive constant and let Φ be a random variable with values in [0, 2π]. The formula X(t) = A sin(2πν0 t + Φ) deﬁnes a stochastic process. For each sample ω ∈ Ω, the function t → X(t, ω) is a sinusoid with frequency ν0 , random amplitude A(ω) and random phase Φ(ω). © Springer Nature Switzerland AG 2020 P. Brémaud, Probability Theory and Stochastic Processes, Universitext, https://doi.org/10.1007/9783030401832_5
199
200
CHAPTER 5. GENERALITIES ON RANDOM PROCESSES
Example 5.1.3: Counting Processes. A counting process {N (t)}t≥0 is, by deﬁnition, an integervalued stochastic process such that the functions t → N (t, ω) are almostsurely integervalued, nondecreasing, rightcontinuous with lefthand limits, such that N (t) − N (t−) ≤ 1 (t ≥ 0) and N (0) = 0. For instance, N (t) could be the number of arrivals of cars at a highway toll in the interval (0, t].
Finitedimensional Distributions One way of describing the probabilistic behavior of a stochastic process is by means of its ﬁnitedimensional distribution. Deﬁnition 5.1.4 By deﬁnition, the ﬁnitedimensional (ﬁdi) distribution of a stochastic process {X(t)}t∈T is the collection of probability distributions of the random vectors (X(t1 ), . . . , X(tk )) for all k ≥ 1 and all t1 , . . . , tk ∈ T. Example 5.1.5: A Particular Counting Process. Let {N (t)}t≥0 be a counting process. Suppose that for all k ≥ 1, all 0 = t0 ≤ t1 < · · · < tk , and all integers m1 . . . , mk , P (∩kj=1 {N (tj ) − N (tj−1 ) = mj }) =
k $
eλ(tj −tj−1 )
j=1
(λ(tj − tj−1 ))mj . mj !
The fact that this completely describes the ﬁdi distribution of {N (t)}t≥0 is obvious. The existence of such a process is at this point not proved but will be guaranteed by Theorem 5.1.7 below. We shall see later on that this is the counting process of a homogeneous Poisson process on the positive halfline of intensity λ. There are two pending issues. Firstly, is there a stochastic process having a prescribed ﬁnite distribution? Secondly, is it unique? The answer is rather simple and it is best answered in the setting of canonical measurable spaces of functions. Let E T be the set of functions x : T → E. An element x ∈ E T is therefore a function from T to E: x := (x(t), t ∈ T) , where x(t) ∈ E. Let E T (1 ) be the smallest σﬁeld containing all the sets of the form {x ∈ E T ; x(t) ∈ C}, where t ranges over T and C ranges over E. The measurable space (E T , E T ) so deﬁned is called the canonical (measurable) space of stochastic processes indexed by T with values in (E, E) (we say: “with values in E” if the choice of the σﬁeld E is clear in the given context). 1 The notation E ⊗T is also used in order to distinguish this mathematical object from a collection of Evalued functions.
5.1. THE DISTRIBUTION OF A RANDOM PROCESS
201
Denote by πt the coordinate map at t ∈ T, that is, the mapping from E T to E deﬁned by πt (x) := x(t) (x ∈ E) . This is a random variable since when C ∈ B(R), the set {x ; πt (x) ∈ C} = {x ; x(t) ∈ C}, and therefore belongs to E T , by deﬁnition of the latter. The family {πt }t∈T is called the coordinate process on the (canonical) measurable space (E T , E T ). The probability distribution of the vector (X(t1 ), . . . , X(tk )) is a probability measure Q(t1 ,...,tk ) on (E k , E k ). It satisﬁes the following obvious properties, called the compatibility conditions: C1 . For all (t1 , . . . , tk ) ∈ Tk , and any permutation σ on Tk , Qσ(t1 ,...,tk ) = Qt1 ,...,tk ◦ σ −1 .
(5.1)
C2 . For all (t1 , . . . , tk , tk+1 ) ∈ Tk+1 and all A ∈ E k Q(t1 ,...,tk ) (A) = Q(t1 ,...,tk+1 ) (A × E).
(5.2)
Remark 5.1.6 Conditions C1 and C2 just acknowledge the obvious facts of the type P (X(t1 ) ∈ A1 , X(t2 ) ∈ A2 ) = P (X(t2 ) ∈ A2 , X(t1 ) ∈ A1 ) and P (X(t1 ) ∈ A1 , X(t2 ) ∈ E) = P (X(t1 ) ∈ A1 ). Recall the deﬁnition of a Polish space.2 It is a topological space whose topology is metrizable (generated by some metric), complete (with respect to this metric) and separable (there exists a countable dense subset). Theorem 5.1.7 Let E be a Polish space and let E := B(E) be its Borel σﬁeld. Let Q = {Q(t1 ,...,tk ) ; k ≥ 1, (t1 , . . . , tk ) ∈ Tk } be a family of probability distributions on (E k , E k ) satisfying the compatibility conditions C1 and C2. Then there exists a unique probability P on the canonical measurable space (E T , E T) such that the coordinate process {πt }t∈T admits the ﬁnite distribution Q. This is the Kolmogorov existence and uniqueness theorem.3 Example 5.1.8: iid sequences. Take E = R, T = Z, and let the ﬁdi distributions Qt1 ,...,tk be of the form Q(t1 ,...,tk ) = Qt1 × · · · × Qtk , where for each t ∈ T, Qt is a probability distribution on (E, E). This collection of ﬁnitedimensional distributions obviously satisﬁes the compatibility conditions, and the resulting coordinate process is an independent random sequence indexed by the relative integers. It is an iid (that is: independent and identically distributed) sequence if Qt = Q for all t ∈ T. Note that the restriction to a Polish space E is superﬂuous (Exercise 5.4.1). 2 For most applications in this book, the Polish space in question will be some Rm with the euclidean topology. 3 For a proof of Kolmogorov’s distribution theorem, see for instance [Shiryaev, 1996].
CHAPTER 5. GENERALITIES ON RANDOM PROCESSES
202 Independence
Let {X(t)}t∈T be a stochastic process. The σﬁeld F X := σ(X(t); t ∈ T) is called the global history of this process. Deﬁnition 5.1.9 Two stochastic processes {X(t)}t∈T and {Y (t)}t∈T deﬁned on the same probability space, with values in (E, E) and (E , E ) respectively, are called independent if the σﬁelds F X and F Y are independent.
The veriﬁcation of independence is simpliﬁed by the following result. Theorem 5.1.10 For the stochastic processes {X(t)}t∈T and {Y (t)}t∈T , with values in (E, E) and (E , E ) respectively, to be independent, it suﬃces that for all t1 , . . . , tk ∈ T and all s1 , . . . , s ∈ T , the vectors (X(t1 ), . . . , X(tk )) and (Y (s1 ), . . . , Y (s )) be independent.
Proof. The collection of events of the type {X(t1 ) ∈ C1 , . . . , X(tk ) ∈ Ck }, where the Ci ’s belong to E, is a πsystem generating F X , with a similar observation for F Y . The result then follows from these observations and Theorem 3.1.39.
Transfer to Canonical Spaces Let a stochastic process {X(t)}t∈T be given with values in a Polish space E and deﬁned on the probability space (Ω, F, P ). Deﬁne the mapping h : (Ω, F) → (E T , E T ) by h(ω) = (X(t, ω), t ∈ T). This mapping is measurable. To show this, it is enough to verify that h−1 (C) ∈ F for all C ∈ C, where C is a collection of subsets of E T that generates E T . Here, we choose C = ({x; x(t) ∈ A}, t ∈ T, A ∈ E). But h−1 ({x; x(t) ∈ A}) = ({ω; X(t, ω) ∈ A}) ∈ F since X(t) is a random variable. Now, denote by PX the image of P by h. The ﬁdi distribution of the coordinate process of the canonical measurable space is the same as that of the original stochastic process. Deﬁnition 5.1.11 The probability PX on (E T , E T) is called the distribution of {X(t)}t∈T .
An immediate consequence of Kolmogorov’s (existence and) uniqueness theorem is:
Corollary 5.1.12 Two stochastic processes with the same ﬁdi distribution have the same distribution.
5.1. THE DISTRIBUTION OF A RANDOM PROCESS
203
Stationarity In this subsubsection, the index set is one of the following: N, Z, R+ , R. Deﬁnition 5.1.13 A stochastic process {X(t)}t∈T is called (strictly) stationary if for all k ≥ 1, all (t1 , . . . , tk ) ∈ Tk , the probability distribution of the random vector (X(t1 + h), . . . , X(tk + h)) is independent of h ∈ T such that t1 + h, . . . , tk + h ∈ T. A stochastic process with index set R+ (resp., N) that is stationary can always be uniquely extended to R (resp., Z) in such a way that stationarity is preserved. More precisely: Theorem 5.1.14 Consider the canonical space (E T , E T ) of stochastic processes with values in the Polish space E and with index set T = R+ (resp., T = N). Let P+ be a probability measure on this canonical space that makes the canonical process stationary. Then there exists a (unique) probability measure P on the canonical space of stochastic processes with values in E with index set R (resp., Z) such that the restriction of P to (E R+ , E R+ ) (resp., (E N , E N)) is P+ . ; k ≥ 1, (t1 , . . . , tk ) ∈ Tk } be the ﬁnitedimensional distributions Proof. Let {Q+ (t1 ,...,tk ) relative to P+ . Deﬁne the collection of ﬁnitedimensional distributions {Q(t1 ,...,tk ) ; k ≥ 1, (t1 , . . . , tk ) ∈ Tk }, where the index set is now R (resp., Z), by Q(t1 ,...,tk ) := Q+ (t1 +h,...,t
k +h)
,
where h is an element of R+ (resp., N) such that t1 + h, . . . , tk + h ∈ R+ (resp., ∈ N). Observe that this family satisﬁes the compatibility conditions (5.1) and (5.1). The result then follows from Kolmogorov’s existence and uniqueness theorem.
5.1.2
Secondorder Stochastic Processes
In this subsection, T represents any of the following index sets: R, R+ , Z and N. Deﬁnition 5.1.15 A measurable complex stochastic process {X(t)}t∈T satisfying the condition E[X(t)2 ] < ∞ (t ∈ T) is called a secondorder stochastic process. In other words, for all t ∈ T, the complex random variable X(t) ∈ L2C (P ). This implies that the mean function m : T → C and the covariance function Γ : T × T → C are well deﬁned by m(t) := E[X(t)] and
Γ(t, s) := cov (X(t), X(s)) = E[X(t)X(s)∗ ] − m(t)m(s)∗ .
When the mean function is the null function, the stochastic process is said to be centered.
CHAPTER 5. GENERALITIES ON RANDOM PROCESSES
204
Theorem 5.1.16 Let {X(t)}t∈T be a secondorder stochastic process with mean function m and covariance function Γ. Then, for all s, t ∈ T, 1
E [X(t) − m(t)] ≤ Γ(t, t) 2 and
1
1
Γ(t, s) ≤ Γ(t, t) 2 Γ(s, s) 2 . Proof. Apply Schwarz’s inequality 1 1 2 2 E [X Y ] ≤ E X2 E Y 2 with X := X(t) − m(t) and Y := 1 for the ﬁrst inequality, and with X := X(t) − m(t) and Y := X(s) − m(s) for the second one. For a stationary secondorder stochastic process, for all s, t ∈ T,
where m ∈ C and
m(t) ≡ m,
(5.3)
Γ(t, s) = C(t − s)
(5.4)
for some function C : T → C, also called the covariance function of the process. The complex number m is called the mean of the process.
Widesense Stationarity A notion weaker than strict stationarity concerns secondorder processes with values in E = C: Deﬁnition 5.1.17 If conditions (5.3) and (5.4) are satisﬁed for all s, t ∈ T, the complex secondorder stochastic process {X(t)}t∈T is called widesense stationary. Remark 5.1.18 There exist stochastic processes that are widesense stationary but not strictly stationary (Exercise 5.4.3). In continuous time (T = R or R+ ) this appellation will be reserved in this book to widesense stationary processes that have in addition a continuous covariance function. For this condition to be satisﬁed, it suﬃces that the covariance function be continuous at the origin. This is in turn equivalent to continuity in the quadratic mean of the stochastic process, that is: For all t ∈ T, lim E X(t + h) − X(t)2 = 0 . h→0
In fact, the covariance function is then uniformly continuous on R. Proof.
E X(t + h) − X(t)2 = E X(t + h)2 + E X(t)2 − E [X(t)X(t + h)∗ ] − E [X(t)∗ X(t + h)] = 2C(0) − C(h) − C(h)∗ ,
5.1. THE DISTRIBUTION OF A RANDOM PROCESS
205
and therefore, uniform continuity in quadratic mean follows from the continuity at the origin of the autocovariance function. On the other hand, C(τ + h) − C(τ ) = E [X(τ + h)X(0)∗ ] − E [X(τ )X(0)∗ ] = E [(X(τ + h) − X(τ )) X(0)∗ ] 1 1 2 2 ≤ E X(0)2 × E X(τ + h) − X(τ )2 1 1 2 2 = E X(0)2 × E X(h) − X(0)2 , and therefore, uniform continuity of the autocovariance function follows from the continuity in quadratic mean at the origin. 2 , the variance of any of the random variables X(t). Note that C(0) = σX
As an immediate corollary of Theorem 5.1.16, we have: Corollary 5.1.19 Let {X(t)}t∈T be a widesense stationary stochastic process with mean m and covariance function C. Then 1
E [X(t) − m] ≤ C(0) 2 and C(τ ) ≤ C(0) . Example 5.1.20: Harmonic processes. Let {Uk }k≥1 be centered random variables of L2C (P ) that are mutually uncorrelated. Let {Φk }k≥1 be completely random phases, that is, real random variables uniformly distributed on [0, 2π]. Suppose moreover that the U 2 variables are independent of the Φ variables. Finally, suppose that ∞ k=1 E[Uk  ] < ∞. Then (Exercise 5.4.5): For all t ∈ R, the series in the righthand side of X(t) =
∞
Uk cos(2πνk t + Φk ) ,
k=1
where the νk ’s are real numbers (frequencies), is convergent in L2C (P ) and deﬁnes a centered wss stochastic process (called a harmonic process) with covariance function C(τ ) =
∞ 1 k=1
2
E[Uk 2 ] cos(2πνk τ ) .
Recall the deﬁnition of the correlation coeﬃcient ρ between two nontrivial real squareintegrable random variables X and Y with respective means mX and mY and 2 and σ 2 : respective variances σX Y ρ=
cov (X, Y ) . σX σY
The variable aX + b that minimizes the function F (a, b) := E (Y − aX − b)2 is
CHAPTER 5. GENERALITIES ON RANDOM PROCESSES
206
cov (X, Y ) Y6 = mY + (X − mX ) 2 σX and moreover E
2 2 . Y6 − Y = 1 − ρ2 σY2
(Exercise 3.4.43). This variable is called the best linearquadratic estimate of Y given X, or the linear regression of Y on X. For a wss stochastic process with covariance function C, the function ρ(τ ) =
C(τ ) C(0)
is called the autocorrelation function. It is in fact, for any t, the correlation coeﬃcient between X(t) and X(t + τ ). In particular, the best linearquadratic estimate of X(t + τ ) given X(t) is 6 + τ t) := m + ρ(τ )(X(t) − m) . X(t The estimation error is then, according to the above, 2 2 . 2 6 + τ t) − X(t + τ ) E X(t 1 − ρ(τ )2 . = σX Remark 5.1.21 In the continuous time case, this shows that if the support of the covariance function is concentrated around τ = 0, the process tends to be “unpredictable”. We shall come back to this when we introduce the notion of white noise.
5.1.3
Gaussian Processes
This particular type of stochastic process is an important one for many reasons, for instance: (1) because of its mathematical tractability due in particular to the stability of the Gaussianity of stochastic processes by linear transformations and limits, (2) because of its ubiquity due to the many forms of the central limit theorem for stochastic processes, (3) because the most famous example of a Gaussian process, Brownian motion (Chapter 11), is the basis of a very productive stochastic calculus (Chapter 14). Gaussian processes will at this point serve to substantiate the deﬁnitions and theoretical results of this chapter. Deﬁnition 5.1.22 Let T be an arbitrary index. The realvalued stochastic process {X(t)}t∈T is called a Gaussian process if for all n ≥ 1 and for all t1 , . . . , tn ∈ T, the random vector (X(t1 ), . . . , X(tn )) is Gaussian. In particular, its characteristic function is given by the formula ⎧ ⎧ ⎫⎤ ⎫ ⎡ n n n n ⎨ ⎨ ⎬ ⎬ 1 E ⎣exp i uj X(tj ) ⎦ = exp i uj m(tj ) − uj uk Γ(tj , tk ) , ⎩ ⎩ ⎭ ⎭ 2 j=1
j=1
(5.5)
j=1 k=1
where u1 , . . . , un ∈ R, m(t) := E[X(t)] and Γ(t, s) := E[(X(t) − m(t))(X(s) − m(s))]. The next result is an existence theorem.
5.1. THE DISTRIBUTION OF A RANDOM PROCESS
207
Theorem 5.1.23 Let Γ : T2 → R be a nonnegative deﬁnite function, that is, such that for all t1 , . . . , tk ∈ T and all u1 , . . . , uk ∈ R, k k
ui uj Γ(ti , tj ) ≥ 0 .
(5.6)
i=1 j=1
Then, there exists a centered Gaussian process with covariance Γ. Proof. By Theorem 3.2.5, for any k ∈ N+ , any t1 , . . . , tk ∈ T, there exists a centered Gaussian vector with covariance matrix {Γ(ti , tj )}1≤i,j≤k . Let Qt1 ,...,tk be the probability distribution of this vector. The family {Qt1 ,...,tk ; t1 , . . . , tk ∈ T} is obviously compatible and therefore, by Kolmogorov’s theorem (Theorem 5.1.7), a centered Gaussian process with covariance Γ exists and is unique in distribution. Example 5.1.24: A Gaussian field on Rd . Let μ be locally ﬁnite measure on Rd and let Bb (Rd ) be the collection of bounded Borelian sets of Rd . There exists a unique (distributionwise) centered Gaussian process {X(A)}A∈Bb (Rd ) with autocorrelation function Γ(A, B) = μ(A ∩ B) (A, B ∈ Bb (Rd )) . To prove this it suﬃces to verify condition (5.6). This is done by observing that k k
ui uj Γ(Ai , Aj ) =
i=1 j=1
k k
ui uj μ(Ai ∩ Aj )
i=1 j=1
= Rd
⎞2 ⎛ k ⎝ uj 1Aj (x)⎠ μ(dx) ≥ 0 . j=1
Theorem 5.1.25 For a Gaussian process with index set T = Rd or Zd (d ∈ N+ ) to be stationary, it is necessary and suﬃcient that for some real number m and some function C : T → R, m(t) = m and Γ(t, s) = C(t − s) for all s, t ∈ T. Proof. The necessity is obvious, whereas the suﬃciency is proved by replacing the t ’s in (5.5) by t + h to obtain the characteristic function of (X(t1 + h), . . . , X(tn + h)), namely, ⎫ ⎧ n n n ⎬ ⎨ 1 uj m − uj uk C(tj − tk ) , exp i ⎭ ⎩ 2 j=1
j=1 k=1
and then observing that this quantity is independent of h.
Gaussian Subspaces With any secondorder stochastic process is associated a Hilbert subspace of the Hilbert space of squareintegrable random variables. More precisely: let {Xi }i∈I , where I is an arbitrary index set, be a collection of complex (resp., real) random variables in L2C (P ) (resp., L2R (P )). The Hilbert subspace of L2C (P ) (resp., L2R (P )) consisting of the closure in
208
CHAPTER 5. GENERALITIES ON RANDOM PROCESSES
L2C (P ) (resp., L2R (P )) of the vector space of ﬁnite linear complex (resp., real) combinations of elements of {Xi }i∈I is called the complex (resp., real) Hilbert subspace generated by {Xi}i∈I and is denoted by HC (Xi , i ∈ I) (resp., HR (Xi , i ∈ I)). A collection {Xi}i∈I of real random variables deﬁned on the same probability space, where I is an arbitrary index set, is called a Gaussian family if for all ﬁnite set of indices i1 , . . . , ik ∈ I, the random vector (Xi1 , . . . , Xik ) is Gaussian. Deﬁnition 5.1.26 A Hilbert subspace G of the real Hilbert space L2R (P ) is called a Gaussian (Hilbert) subspace if it is a Gaussian family. Theorem 5.1.27 Let {Xi }i∈I , where I is an arbitrary index set, be a Gaussian family of random variables of L2R (P ). Then the Hilbert subspace HR (Xi , i ∈ I) generated by {Xi }i∈I is a Gaussian subspace of L2R (P ). Proof. By deﬁnition, the Hilbert subspace HR (Xi , i ∈ I) consists of all the random variables in L2R (P ) that are limits in quadratic mean of ﬁnite linear combinations of elements of the family {Xi }i∈I . To prove the announced result, it suﬃces to show that (m) (1) if {Zn }n≥1 , where Zn = (Zn , . . . , Zn ), is a sequence of Gaussian random vectors of ﬁxed dimension m that converges componentwise in quadratic mean to some vector Z = (Z (1) , . . . , Z (m) ) of HR (Xi , i ∈ I), then the latter vector is Gaussian. By continuity (i) (j) of the inner product in L2R (P ), for all 1 ≤ i, j ≤ m, limn↑∞ E[Zn Zn ] = E[Z (i) Z (j)] and (i) (i) limn↑∞ E[Zn ] = E[Z ], that is lim mZn = mZ ,
lim ΓZn = ΓZ
n↑∞
n↑∞
and in particular, for all u ∈ Rm , T i T T lim E eiu Zn = lim eiu μZn − 2 u ΓZn u n↑∞
n↑∞
= eiu
T μ − i uT Γ u Z 2 Z
.
therefore in distriThe sequence {uT Zn }n≥1 converges to uT Z in quadratic T mean, and TZ iu Z iu n = E[e ], and ﬁnally bution (Theorem 4.5.6). In particular, limn↑∞ E e E[eiu
TZ
] = eiu
T μ − i uT Γ u Z 2 Z
(u ∈ Rm ) ,
which shows that Z is Gaussian.
5.2 5.2.1
Random Processes as Random Functions Versions and Modiﬁcations
For each ω ∈ Ω, the function t ∈ T → X(t, ω) ∈ E is called the ωtrajectory, or ωsample path, of the stochastic process {X(t)}t∈T . A stochastic process can be viewed as a random function, associating to each ω ∈ Ω the trajectory t → X(t, ω). When the state space E is some Rm and the index set is R, for ﬁxed ω ∈ Ω, we can then discuss the continuity properties of the associated sample path. For example, if for all ω ∈ Ω the ωsample path is rightcontinuous, we call this stochastic process rightcontinuous. It is called P a.s. rightcontinuous if the ωsample paths are rightcontinuous for all ω ∈ Ω, except perhaps for ω ∈ N , where N is a P negligible set. One deﬁnes similarly (P a.s.) leftcontinuity, (P a.s.) continuity, etc.
5.2. RANDOM PROCESSES AS RANDOM FUNCTIONS
209
Deﬁnition 5.2.1 Two stochastic processes {X(t)}t∈T and {Y (t)}t∈T deﬁned on the same probability space (Ω, F, P ) are said to be versions of one another if P ({ω ; X(t, ω) = Y (t, ω)}) = 0 for all t ∈ T. They are said to be undistinguishable if P ({ω ; X(t, ω) = Y (t, ω) for all t ∈ T}) = 1, that is, if they have identical trajectories except on a P null set. Clearly two undistinguishable processes are versions of one another. Example 5.2.2: Two Distinguishable Versions. The two processes X(t) = 1 (t ∈ [0, 1]) and X(t) = 1U =t
(t ∈ [0, 1]) ,
where U is a random variable uniformly distributed on [0, 1], have the same distributions, and therefore are versions of one another, but they are not undistinguishable. It is useful to ﬁnd conditions bearing only on the ﬁnitedimensional distributions and guaranteeing that the sample paths have certain desired properties. This is not always feasible but, in certain cases, there is a version possessing the desired properties. One such result is Kolmogorov’s continuity theorem below.
5.2.2
Kolmogorov’s Continuity Condition
Theorem 5.2.3 Let (E, d) be a complete metric space. Let {X(t)}t∈[0,1] be a stochastic process with values in E. Suppose that for some positive real numbers α, β, K, E [d(X(t), X(s))α ] ≤ Kt − s1+β , for all s, t ∈ [0, 1). Then there exists a version of this stochastic process whose sample paths are almost surely continuous. Proof. We will encounter in the proof the following subsets of [0, 1]: the set D of dyadic rationals of [0, 1] and for each n ≥ 1 the set Dn := {k2−n ; k = 0, · · · , n}. The countable set D is dense in [0, 1]. STEP 1. The original process is continuous in probability, that is, for all ε > 0, lim P (d(X(tn ) − X(t)) ≥ ε) = 0 .
tn →t
This follows from Markov’s inequality: P (d(X(tn ), X(t)) ≥ ε) = P (d(X(tn ), X(t))α > εα ) ≤
E [d(X(tn ), X(t))α ] Ktn − t1+β ≤ . α ε εα
()
CHAPTER 5. GENERALITIES ON RANDOM PROCESSES
210
STEP 2. The original process is uniformly continuous on D. To see this, let γ ∈ (0, β/α) and obtain from () that P (d(X(k2−n ), X((k + 1)2−n )) ≥ 2−γn ) ≤ K2−n(1+β−αγ) . In particular, with An :=
maxn X(k2−n ) − X((k − 1)2−n ) ≥ 2−γn
1
1≤k≤2
and by subσadditivity, 2 n
P (An ) ≤
P (X(k2−n ) − X((k − 1)2−n ) ≥ 2−γn )
k=1
≤ 2n K2−n(1+β−αγ) = K2−n(β−αγ) , and therefore, since β − αγ > 0,
P (An ) < ∞ .
n
By the Borel–Cantelli lemma, there exists a P negligible subset N and an almost surely ﬁnite random integer N such that outside N , for n ≥ N and k = 1, . . . , 2n , d(X(k2−n ), X((k − 1)2−n )) < 2−γn . In particular,
sup
Kγ := sup n≥1
1≤k≤2n
d(X((k − 1)2−n ), X(k2−n ) 2−γn
(†)
is an almost surely ﬁnite random variable. It will follow from this (Lemma 5.2.4 below) that for all s, t ∈ D, almost surely d(X(t), X(s)) ≤ Cγ t − sγ ,
(5.7)
where
2 Kγ . 1 − 2−γ Therefore, almost surely, the mapping t → X(t) is continuous, and therefore uniformly continuous on D. Cγ :=
Step 3. Let Y (t) = 0 on N and if t ∈ N , Y (t) = X(t) for t ∈ D, and Y (t) =
lim
tn →t, tn ∈D
X(tn )
(t ∈ D) .
Outside N , the function t ∈ [0, 1) → Y (t) is a continuous extension of the uniformly continuous function t ∈ D → X(t) to [0, 1). Indeed, for any s, t ∈ [0, 1), there exist a sequence in D, tk → t, and a sequence in D, sk → s, such that Y (t) = limtk →t X(tk ) and Y (s) = limsk →t X(sk ), so that Y (t) − Y (s) ≤ Y (t) − X(tk ) + X(tk ) − X(sk ) + X(sk ) − Y (s) . One can choose the tk ’s and the sk ’s inside the interval [s, t]. With any ε > 0 one can then associate δ such that if t − s ≤ δ, the middle term of the righthand side is less
5.2. RANDOM PROCESSES AS RANDOM FUNCTIONS
211
than ε/3 whatever k (by the uniform continuity of t ∈ D → X(t)) and such that the extreme terms are less than ε/3 (by an appropriate choice of sk and tk , by construction of t ∈ [0, 1) → Y (t)) and therefore ﬁnally Y (t) − Y (s) ≤ ε. Therefore, outside N , t ∈ [0, 1) → Y (t) is a continuous function. Step 4. We now show that {Y (t)}t∈[0,1) is a version of {X(t)}t∈[0,1) , that is, P (X(t) = Y (t)) = 1 for all t ∈ [0, 1). This follows from the fact that {X(t)}t∈[0,1) is continuous in probability and that limits in probability and almostsure limits coincide when both exist. Step 5. It remains to prove the inequality (5.7). This follows from Lemma 5.2.4 below. Lemma 5.2.4 Let f be a mapping from D to the metric space (E, d). Suppose that there exists a ﬁnite constant K such that for n ≥ N and k = 1, . . . , 2n , d(f (k2−n ), f ((k − 1)2−n )) < K2−γn . Then for all s, t ∈ D, d(f (t), f (s)) ≤
2 Kt − sγ . 1 − 2−γ
Proof. Let s, t ∈ D, s < t. Let p be the smallest integer such that 2−p ≤ t − s. Let k be the smallest integer such that k2−p ≥ s. Then it is possible to write s = k2−p − ε1 2−p−1 − · · · − ε 2−p−, t = k2−p + ε1 2−p−1 + · · · + εm 2−p−m for some nonnegative integers , m and ε’s and ε ’s taking the values 0 or 1. Deﬁne si = k2−p − ε1 2−p−1 − · · · − ε 2−p−i tj = k2
−p
+ ε1 2−p−1
+···+
εm 2−p−j
(0 ≤ i ≤ ), (0 ≤ j ≤ m).
Then, observing that s = s and t = tm , d(f (s), f (t)) = d(f (s ), f (tm )) = d(f (s0 ), f (t0 )) +
d(f (si−1 ), f (si )) +
i=1
≤ K2−pγ +
i=1
K2−(p+i)γ +
m
d(f (tj−1 ), f (tj ))
j=1 m
K2−(p+j)γ
j=1
≤ 2K(1 − 2−γ )−1 2−pγ ≤ 2K(1 − 2−γ )−1 (t − s)γ . Remark 5.2.5 Going back to Example 5.1.5, Kolmogorov’s theorem guarantees the existence of an integervalued process {N (t)}t≥0 with the ﬁdi distribution described in this example. It does not say, however, that it is a counting process in the sense of Example 5.1.3. It turns out that such a version exists, as we shall see in Chapter 8, where a completely diﬀerent deﬁnition is given.
CHAPTER 5. GENERALITIES ON RANDOM PROCESSES
212
5.3 5.3.1
Measurability Issues Measurable Processes and their Integrals
Viewing a stochastic process as a mapping X : T × Ω → E, deﬁned by (t, ω) → X(t, ω), opens the way to various measurability concepts. Deﬁnition 5.3.1 The stochastic process {X(t)}t∈R is said to be measurable iﬀ the mapping from R × Ω into E deﬁned by (t, ω) → X(t, ω) is measurable with respect to B ⊗ F and E. In particular, for any ω ∈ Ω the mapping t → X(t, ω) is measurable with respect to the σﬁelds B(R) and E (Lemma 2.3.6). Also, if E = R and if X(t) is nonnegative, one can deﬁne the Lebesgue integral X(t, ω)dt R
for each ω ∈ Ω, and also apply Tonelli’s theorem (Theorem 2.3.9) to obtain 2. E E [X(t)] dt. X(t)dt = R
R
By Fubini’s theorem, the last 0 equality also holds true for measurable stochastic processes of arbitrary sign such that R E [X(t)] dt < ∞. Theorem 5.3.2 Let {X(t)}t∈R be a secondorder complexvalued measurable stochastic process with mean function m and covariance function Γ. Let f : R → C be an integrable function such that 
Then the integral
0
R
f (t)E [X(t)] dt < ∞.
(5.8)
is almost surely well deﬁned and . f (t)X(t) dt = f (t)m(t) dt.
R f (t)X(t) dt
2
E R
R
Suppose in addition that f satisﬁes the condition 1 f (t)Γ(t, t) 2 dt < ∞
(5.9)
R
0 and let g : R → C be a function with the same properties as f . Then R f (t)X(t) dt is squareintegrable and  f (t)X(t) dt, g(t)X(t) dt = f (t)g ∗ (s)Γ(t, s) dt ds . cov R
R
R
R
Proof. By Tonelli’s theorem . 2f (t)X(t) dt = f (t)E [X(t)] dt < ∞ E R
0
R
0and therefore, almost surely R f (t)X(t) dt < ∞, so that almost surely the integral R f (t)X(t) dt is well deﬁned and ﬁnite. Also (Fubini)
5.3. MEASURABILITY ISSUES
213
. f (t)E [X(t)] dt. f (t)X(t) dt = E [f (t)X(t)] dt =
2E
R
R
R
Suppose now (without loss of generality) that the process is centered. By Tonelli’s theorem . 2f (t)X(t) dt g(t)X(t) dt E R R  = f (t)g(s)E [X(t)X(s)] dt ds. R
R
1
1
But (Schwarz’s inequality) E [X(t)X(s)] ≤ Γ(t, t) 2 Γ(s, s) 2 , and therefore the righthand side of the last equality is bounded by 1 1 f (t)Γ(t, t) 2 dt g(s)Γ(s, s) 2 ds < ∞. R
R
One may therefore apply Fubini’s theorem to obtain .  2f (t)X(t) dt g(t)X(t) dt = f (t)g ∗ (s)E [X(t)X(s)] dt ds . E R
R
R
R
Remark 5.3.3 Since E[X(t)] ≤ E[1 +0X(t)2 ] = 1 + Γ(t, t), condition (5.8) is satisﬁed if f is an integrable function such that R f (t)Γ(t, t) dt < ∞.
5.3.2
Histories and Stopping Times
In the following, the index set T is any of the following: R, R+ , N, Z. Deﬁnition 5.3.4 Let (Ω, F) be a measurable space. The family {Ft }t∈T of subσﬁelds of F is called a history (or ﬁltration) on (Ω, F) if for all s, t ∈ T such that s ≤ t, Fs ⊆ Ft . In other words, a history is a nondecreasing family of subσﬁelds of F indexed by T. In applications Ft often represents the information available at time t to an observer. The σﬁeld F∞ := ∨t∈T Ft is, by deﬁnition, the smallest σﬁeld that contains Ft for all t ∈ T. Deﬁnition 5.3.5 Let {X(t)}t∈T be stochastic process deﬁned on (Ω, F). The history {FtX }t∈T deﬁned by FtX = σ(X(s) ; s ≤ t) is called the internal history of {X(t)}t∈T . Any history {Ft }t∈T such that Ft ⊇ FtX
(t ∈ T)
is called a history of {X(t)}t∈T . The stochastic process {X(t)}t∈T is then said to be adapted to the history {Ft }t∈T , or Ft adapted. Deﬁnition 5.3.6 Let T = R or R+ . Deﬁne for all t ∈ T Ft+ := ∩s>t Fs . The history {Ft }t∈T is called rightcontinuous if for all t ∈ T, Ft = Ft+ .
214
CHAPTER 5. GENERALITIES ON RANDOM PROCESSES
Progressive Measurability Deﬁnition 5.3.7 A stochastic process {X(t)}t∈R+ taking its values in the measurable space (E, E) is said to be Ft progressively measurable if for all t ∈ R+ the mapping (s, ω) → X(s, ω) from [0, t]×Ω into E is measurable with respect to the σﬁelds B([0, t])⊗ Ft and E. A Ft progressively measurable {X(t)}t∈R+ is then Ft adapted and measurable. Theorem 5.3.8 Let {X(t)}t∈R+ be a stochastic process, taking its values in a topological space E endowed with its Borel σﬁeld E = B(E), adapted to {Ft }t∈R+ and rightcontinuous (resp., leftcontinuous). Then {X(t)}t∈R+ is Ft progressively measurable. Proof. Let t be a nonnegative real number. For all n ≥ 0 and all s ∈ [0, t), let Xn (s) :=
n −1 2
X((k + 1)t/2−n ) 1{[k2−n t,(k+1)2−n t)} (s) ,
k=0
and let Xn (t) := X(t). This deﬁnes a function [0, t] × Ω → E which is measurable with respect to B([0, t]) ⊗ Ft and E. If t → X(t, ω) is rightcontinuous, X(s, ω) is the limit of Xn (s, ω) for all (s, ω) ∈ [0, t] × Ω, and therefore (s, ω) → X(s, ω) is measurable with respect to B([0, t]) ⊗ Ft and E as a function of [0, t] × Ω into E. The case of a leftcontinuous process is treated in a similar way. Theorem 5.3.9 If the nonnegative stochastic process0 {X(t)}t∈R+ is Ft progressively measurable, then, for each t ∈ R+ , the random variable (0,t] X(s, ω) ds is Ft measurable. The proof is left as Exercise 5.4.2.
Stopping Times A principal notion in the theory of stochastic processes is that of stopping time. In this subsection, the index set is T = N or R+ , and T := T ∪ {+∞}. Deﬁnition 5.3.10 Let {Ft }t∈T be a history. A Tvalued random variable τ is called an Ft stopping time iﬀ for all t ∈ T, {τ ≤ t} ⊂ Ft . Remark 5.3.11 In the case T = R+ , the condition {τ < t} ⊂ Ft
(t ≥ 0)
does not guarantee that T is an Ft stopping time. But since {τ ≤ t} = ∩n {τ < t +
1 } ∈ ∩s>t Fs = Ft+ , n
we have that T is an Ft+ stopping time, and therefore an Ft stopping time if the history is rightcontinuous.
5.3. MEASURABILITY ISSUES
215
Example 5.3.12: Counterexample. Deﬁne a (not rightcontinuous) history by Ft = / {∅, Ω}. The random variable {∅, Ω} if t ≤ 1, and Ft = σ(A) if t > 1, where A ∈ τ := 1 + 1A is such that {τ < t} ⊂ Ft for all t ≥ 0, but it is not an Ft stopping time because {τ ≤ 1} is not in F1 . The following approximation of a stopping time by simple random variables will be of frequent use in the sequel. Theorem 5.3.13 Let {Ft }t∈R+ be a history, and let for all n ≥ 1, ⎧ ⎪ ⎨0 τ (n, ω) :=
k+1 ⎪ 2n
⎩
+∞
if τ (ω) = 0 if 2kn < τ (ω) ≤ if τ (ω) = ∞.
k+1 2n
Then τ (n) is an Ft stoppingtime decreasing to τ as n ↑ ∞. Proof. In fact, for all t ≥ 0, {τ (n) ≤ t} = ∪k ; (k+1)2−n ≤t {τn = (k + 1)2−n } & = {τ = 0} ∪k ; (k+1)2−n ≤t {k2−n < τ ≤ (k + 1)2−n } ∈ Ft . The decreasing convergence to τ is obvious.
Let {X(t)}t∈R+ be a stochastic process with values in E. For any set C ⊂ E, let τ (C) := inf{t ≥ 0 ; X(t) ∈ C} . Theorem 5.3.14 Let {X(t)}t∈R+ be a rightcontinuous stochastic process with values in a metric space E and adapted to the history {Ft }t≥0 . A. Let G be an open set of E. The random time τ (G) is an Ft+ stopping time. B. Suppose moreover that {X(t)}t∈R+ has left limits for all t > 0. Let Γ be a closed set of E. Then, the random time τ (Γ) is an Ft stopping time. Proof. A. This comes from the identity {τ (G) < t} = ∩r∈Q,r t} = {X(s) < c for all s ∈ [0, t]} is identical to % {X(kt/2n ) < c (k = 0, 1, . . . , 2n )}, n≥1
which is in Ft . The leftcontinuous case is similar. In the leftcontinuous case, suppose that for a given ω, X(τ (ω), ω) = c + ε > c. Then there exists a δ > 0 such that for all t ∈ [τ (ω) − δ, τ (ω)), X(t, ω) ≥ c + 21 ε > c, and this is in contradiction with the deﬁnition of τ .
Remark 5.3.17 The situation depicted in the leftcontinuous time guarantees that the entrance time τn in [n, ∞) is such that the stopped process {X(t ∧ τn )}t∈R+ is bounded (by n). This remark will be of frequent use in the sequel. Theorem 5.3.18 Let T = N or R+ . Let {Ft }t∈T be a history. Let τ be an Ft stopping time. The collection of events Fτ = {A ∈ F∞  A ∩ {τ ≤ t} ∈ Ft , for all t ∈ T} is a σﬁeld, and τ is Fτ measurable. Let {X(t)}t∈T be an Evalued Ft adapted stochastic process, and let τ be a ﬁnite Ft stopping time. Deﬁne the random variable X(τ ) by X(τ )(ω) := X(τ (ω), ω). Then, if T = N (resp., T = R+ and {X(t)}t∈R+ is Ft progressively measurable) X(τ ) is Fτ measurable.
Proof. The veriﬁcation that Fτ is a σﬁeld is straightforward. In order to show that τ is Fτ measurable, it is enough to show that for all c ≥ 0, {τ ≤ c} ∈ Fτ , that is, for all t ∈ T, {τ ≤ c} ∩ {τ ≤ t} ∈ Ft . But this last event is just {τ ≤ c ∧ t} ∈ Fc∧t , by deﬁnition of an Ft stopping time, and Fc∧t ⊆ Ft . We treat the case T = R+ . Let A ∈ E and a ≥ 0. The set {X(τ ) ∈ A} ∩ {τ ≤ a} is identical to ({X(S) ∈ A} ∩ {S < a}) ∪ ({X(a) ∈ A} ∩ {τ = a}), where S = τ ∧ a. Therefore, it suﬃces to prove that {X(S) ∈ A} is in Fa , as we now show. Indeed, the random variable S is an Ft stopping time and it is also an Fa measurable random variable. The Fa measurability of X(S) follows from the fact that it is obtained by composition of ω → (S(ω), ω) from (Ω, Fa ) into ([0, a] × Ω, B([0, a]) ⊗ Fa ), and (s, ω) → X(s, ω) from ([0, a] × Ω, B([0, a]) ⊗ Fa ) into (E, E), which are measurable: the ﬁrst by deﬁnition of Ft stopping times and the second by deﬁnition of Ft progressiveness.
5.3. MEASURABILITY ISSUES
217
Theorem 5.3.19 (i) If S and T are Ft stopping times, then so are S ∧ T and S ∨ T . (ii) An Ft stopping time T is FT measurable. (iii) If T is an Ft stopping time and S is FT measurable and such that S ≥ T , then S is an Ft stopping time. (iv) If S and T are Ft stopping times and A ∈ FS , then A ∩ {S ≤ T } ∈ FT . (v) If S and T are Ft stopping times such that S ≤ T , then FS ⊆ FT . (vi) If {Tn }n≥1 is a sequence of Ft stopping times, then supn Tn is an Ft stopping time. Proof. (i) and (ii) are left as exercises (Exercise 5.4.10). (iii) By hypothesis {S ≤ t} ∈ FT and therefore, by deﬁnition of FT , {S ≤ t} ∩ {T ≤ t} ∈ Ft . But the last intersection is just {S ≤ t}. (iv) By deﬁnition of FT , we must check that [A ∩ {S ≤ T }] ∩ {T ≤ t} ∈ Ft
(t ≥ 0) .
But this intersection is equal to [A ∩ {S ≤ t}] ∩ {T ≤ t} ∩ {S ∧ t ≤ T ∧ t} , and all the three sets therein are in Ft , the ﬁrst one because A ∈ FS , the second one because T is an Ft stopping time and the last one because S ∧ t and S ∨ t are Ft measurable. (v) Let A ∈ FS . According to (iv), A = A ∩ {S ≤ T } ∈ FT . (vi) Just observe that {sup Tn ≤ t} = ∩n {Tn ≤ t} ∈ Ft . n
Remark 5.3.20 It is not true in general that inf n Tn is an Ft stopping time. In fact, it is not always true that {inf n Tn ≤ t} = ∪n {Tn ≤ t}. However, {inf Tn < t} = ∪n {Tn < t} ∈ Ft n
(t ≥ 0) ,
and therefore, if {Ft }t≥0 is a rightcontinuous history, inf n Tn is an Ft stopping time (Remark 5.3.11).
CHAPTER 5. GENERALITIES ON RANDOM PROCESSES
218
Theorem 5.3.21 Let {Ft }t≥0 be a rightcontinuous history and {Tn }n≥1 be a sequence of Ft stopping times. Then (i) lim inf n Tn and lim supn Tn are Ft stopping times. (ii) If {Tn }n≥1 is decreasing, with limit T , then FT = ∩n FTn . Proof. The proof of (i) is left as an exercise. It ensures that T of (ii) is indeed an Ft stopping time. We have from (v) of Theorem 5.3.19 that FT ⊆ ∩n FTn . Now, let A ∈ ∩n FTn . Then A ∩ {Tn < t} ∈ Ft (t ≥ 0) and therefore ∪n (A ∩ {Tn < t}) = A ∩ {T < t} ∈ Ft
(t ≥ 0),
which guarantees, because of the rightcontinuity hypothesis for the history, that T is a stopping time for this history (Remark 5.3.11).
Complementary reading [Meyer, 1975] (the ﬁrst chapters) and [Durrett, 1996] are advanced references.
5.4
Exercises
Exercise 5.4.1. The case of iid sequences Prove directly (without referring to Theorem 5.1.7) the last statement in Example 5.1.8. Exercise 5.4.2. Why Progressive Measurability? Prove that if the nonnegative stochastic process {X(t)} t∈R+ is Ft progressively measur0 able, then, for each t ∈ R+ , the random variable (0,t] X(s) ds is Ft measurable. Exercise 5.4.3. Widesense Stationary but not Stationary Give a simple example of a discretetime stochastic process that is widesense stationary, but not strictly stationary. Give a similar example in continuous time. Exercise 5.4.4. Stationarization Let {Y (t)}t∈R be the stochastic process taking its values in {−1, +1} deﬁned by Y (t) = Z × (−1)n on (nT, (n + 1)T ] , where T is a positive real number and Z is a random variable equidistributed on {−1, +1}. (1) Show that {Y (t)}t∈R is not a stationary (strictly or in the wide sense) stochastic process. (2) Let now U be a random variable uniformly distributed on [0, T ] and independent of Z. Deﬁne for all t ∈ R, X(t) = Y (t − U ). Show that {X(t)}t∈R is a widesense stationary stochastic process and compute its covariance function.
5.4. EXERCISES
219
Exercise 5.4.5. A harmonic process Let {Uk }k≥1 be centered random variables of L2C (P ) that are mutually uncorrelated. Let {Φk }k≥1 be completely random phases, that is, real random variables uniformly distributed on [0, 2π]. Suppose, moreover, that the U variables are independent of the Φ 2 variables. Finally, suppose that ∞ k=1 E[Uk  ] < ∞. Prove that for all t ∈ R, the series in the righthand side of X(t) =
∞
Uk cos(2πνk t + Φk ) ,
k=1
where the νk ’s are real numbers (frequencies), is convergent in L2C (P ) and deﬁnes a wss stochastic process. Give its covariance function. Exercise 5.4.6. Just a joke Let {X(t)}t∈R be a centered Gaussian process, and let t1 , t2 ∈ R be ﬁxed times. Compute the probability that X(t1 ) > X(t2 ). Exercise 5.4.7. A clipped Gaussian process Let {X(t)}t∈R be a centered stationary Gaussian process with covariance function CX . Deﬁne the clipped (or hardlimited ) process Y (t) = sign X(t) , with the convention sign X(t) = 0 if X(t) = 0 (note however that this occurs with 2 > 0, which we assume to hold). Clearly this stochastic null probability if CX (0) = σX process is centered. Moreover, it is unchanged when {X(t)}t∈R is multiplied by a positive constant. In particular, we may assume that the variance CX (0) equals 1, so that the covariance matrix of the vector (X(0), X(τ ))T is 1 ρX (τ ) Γ(τ ) = , 1 ρX (τ ) where ρX (τ ) is the correlation coeﬃcient of X(0) and X(τ ). We assume that Γ(τ ) is invertible, that is, ρX (τ ) < 1. Prove the following formula: CY (τ ) =
CX (τ ) 2 sin−1 . π CX (0)
Exercise 5.4.8. The Black and Scholes Formula This formula concerns a certain type of ﬁnancial product called the European call option. The value of a stock at time t ≥ 0 is V (t) = V (0) exp {θt + σW (t)} . In particular,
E[V (t)] = E[V (0)] exp
1 1 θ + σ2 t , 2
+ * that is, one euro invested in this stock at time 0 will yield exp θ + 12 σ 2 t at time t. On the other hand, an investment of 1 euro in a riskfree instrument (bonds, saving
220
CHAPTER 5. GENERALITIES ON RANDOM PROCESSES
account) returns ert euros at time t, where r is the ﬁxed return rate of the riskfree investment. In a competitive market, the return rates are the same:4 1 θ + σ2 = r . 2 Therefore V (t) = V (0) exp
1 1 r − σ 2 t + σW (t) . 2
The investor has the right to exercise the following option (the European call option). At some time T in the future, called the expiration date, he can buy one share of the stock at a ﬁxed price K (the strike price) and immediately sell it at price V (T ) and therefore make a proﬁt V (T ) − K. If he does not exercise the option he will do nothing and therefore the proﬁt will be max(V (T ) − K, 0) . Of course, the investor must pay an entrance fee C in order to enter the deal. The value of C should be such that the expected return of an investment C in the option should equal the expected return when exercising the option: C = erT = E [max(V (T ) − K, 0)] . This is called the “no arbitrage” condition. Give an explicit formula for C in terms of r, σ, V (0) and T . Exercise 5.4.9. An elementary ergodic theorem Let {X(t)}t∈R be a wss stochastic process with mean m and covariance function C (τ ). Prove that in order that 1 T X (s) ds = mX lim T ↑∞ T 0 holds in the quadratic mean, it is necessary and suﬃcient that 1 T ↑∞ T

T
lim
0
u C (u) du = 0. 1− T
(5.10)
Show that this condition is satisﬁed in particular when the covariance function is integrable. Exercise 5.4.10. About stopping times Let S and T be Ft stopping times. Prove that (1) the events {S < T }, {S ≤ T } and {S = T } are in FS ∩ FT , (2) S ∧ T and S ∨ T are Ft stopping times, and (3) T is FT measurable.
4 We do not attempt here to deﬁne a “competitive market” or to prove the corresponding statement.
Chapter 6 Markov Chains, Discrete Time A sequence {Xn }n≥0 of random variables with values in a set E is called a discretetime stochastic process with state space E. According to such a deﬁnition, sequences of independent and identically distributed random variables are stochastic processes. However, in order to introduce more variability, one may wish to allow for some dependence on the past in the manner of deterministic recurrence equations. Discretetime homogeneous Markov chains possess the required feature since they can always be represented (in a sense to be made precise) by a stochastic recurrence equation Xn+1 = f (Xn , Zn+1), where {Zn }n≥1 is an iid sequence independent of the initial state X0 . The probabilistic dependence on the past is only through the previous state, but this limited amount of memory suﬃces to produce enough varied and complex behavior to make Markov chains a most important source of models.
6.1 6.1.1
The Markov Property The Markov Property on the Integers
Let {Xn }n≥0 be a discretetime stochastic process with countable state space E. The elements of the state space will be denoted by i, j, k,. . . If Xn = i, the process is said to be in state i at time n, or to visit state i at time n. Deﬁnition 6.1.1 If for all integers n ≥ 0 and all states i0 , i1 , . . . , in−1, i, j, P (Xn+1 = j  Xn = i, Xn−1 = in−1 , . . . , X0 = i0 ) = P (Xn+1 = j  Xn = i) ,
(6.1)
the above stochastic process is called a Markov chain, and a homogeneous Markov chain (hmc) if, in addition, the righthand side of (6.1) is independent of n. In the homogeneous case, the matrix P = {pij }i,j∈E , where pij := P (Xn+1 = j  Xn = i), is called the transition matrix of the hmc. Since the entries are probabilities and since a transition from any state i must be to some state, it follows that pik = 1 (i, j ∈ E) . pij ≥ 0 and k∈E
© Springer Nature Switzerland AG 2020 P. Brémaud, Probability Theory and Stochastic Processes, Universitext, https://doi.org/10.1007/9783030401832_6
221
CHAPTER 6. MARKOV CHAINS, DISCRETE TIME
222
A matrix P indexed by E and satisfying the above properties is called a stochastic matrix. The state space may be inﬁnite, and therefore such a matrix is in general not of the kind studied in linear algebra. However, the basic operations of addition and multiplication will be deﬁned by the same formal rules. The notation x = {x(i)}i∈E formally represents a column vector, and xT is the corresponding row vector. For instance, (xT P)(j) = x(k)pkj . k∈E
A transition matrix P is sometimes described by its transition graph G, that is, a graph having for nodes (or vertices) the states of E, and with an oriented edge from i to j if and only if pij > 0. The Markov property (6.1) extends to P (A  Xn = i, B) = P (A  Xn = i) , where A := {Xn+1 = j1 , . . . , Xn+k = jk } , B = {X0 = i0 , . . . , Xn−1 = in−1 } (Exercise 6.6.1). This is in turn equivalent to P (A ∩ B  Xn = i) = P (A  Xn = i)P (B  Xn = i) . In other words, A and B are conditionally independent given Xn = i. Therefore, the future at time n and the past at time n are conditionally independent given the present state Xn = i. More generally: Theorem 6.1.2 For all n ≥ 2 and all i ∈ E, the σﬁelds σ(X0 , . . . , Xn−1) and σ(Xn+1 , Xn+2, . . .) are independent given Xn = i. Proof. This is a direct consequence of the above observations and of Theorem 3.1.39. Theorem 6.1.2 shows in particular that the Markov property is independent of the direction of time. Notation. We shall from now on abbreviate P (C  X0 = i) as Pi (C) (C ∈ F). Also, if μ is a probability distribution on E, then Pμ (C) is the probability of C given that the initial state X0 is distributed according to μ: Pμ (C) = μ(i)P (C  X0 = i) = μ(i)Pi (C) . i∈E
i∈E
The distribution at time n of the chain is the vector νn indexed by E and deﬁned by νn (i) := P (Xn = i)
(i ∈ E) .
From Bayes’ rule of total causes, P (Xn+1 = j) =
k∈E
P (Xn = k)P (Xn+1 = j  Xn = k),
6.1. THE MARKOV PROPERTY that is, νn+1 (j) = yields
k∈E
T νn (k)pkj . In matrix form: νn+1 = νnT P. Iteration of this equality
νnT = ν0T Pn . The matrix
Pm
223
(6.2)
is called the mstep transition matrix because its general term is pij (m) := P (Xn+m = j  Xn = i) .
Indeed, the Bayes sequential rule and the Markov property give for the righthand side of the latter equality pii1 pi1 i2 · · · pim−1 j , i1 ,...,im−1 ∈E
which is the general term of the mth power of P. The probability distribution ν0 of the initial state X0 is called the initial distribution. From Bayes’ sequential rule, the homogeneous Markov property and the deﬁnition of the transition matrix, P (X0 = i0 , X1 = i1 , . . . , Xk = ik ) = ν0 (i0 )pi0 i1 · · · pik−1 ik .
(6.3)
Therefore, by Theorem 5.1.7, we have the following Theorem 6.1.3 The distribution of a discretetime hmc is uniquely determined by its initial distribution and its transition matrix. Many hmcs receive a natural description in terms of a recurrence equation. Theorem 6.1.4 Let {Zn }n≥1 be an iid sequence of random variables with values in some measurable space (G, G). Let E be a countable space and let f : (E ×G, P(E)⊗G) → (E, P(E)) be some measurable function. Let X0 be a random variable with values in E, independent of {Zn }n≥1 . The recurrence equation Xn+1 = f (Xn , Zn+1)
(6.4)
then deﬁnes an hmc. Proof. Iteration of recurrence (6.4) shows that for all n ≥ 1, there is a function gn such that Xn = gn (X0 , Z1 , . . . , Zn ), and therefore P (Xn+1 = j  Xn = i, Xn−1 = in−1 , . . . , X0 = i0 ) = P (f (i, Zn+1 ) = j  Xn = i, Xn−1 = in−1 , . . . , X0 = i0 ) = P (f (i, Zn+1 ) = j), since the event {X0 = i0 , . . . , Xn−1 = in−1 , Xn = i} is expressible in terms of X0 , Z1 , . . . , Zn and is therefore independent of Zn+1 . Similarly, P (Xn+1 = j  Xn = i) = P (f (i, Zn+1 ) = j). We therefore have a Markov chain, and it is homogeneous since the righthand side of the last equality does not depend on n. Explicitly: (6.5) pij = P (f (i, Z1 ) = j). Not all homogeneous Markov chains receive a “natural” description of the type featured in Theorem 6.1.4. However, it is always possible to ﬁnd a “theoretical” description of this kind. More exactly, we have
CHAPTER 6. MARKOV CHAINS, DISCRETE TIME
224
Theorem 6.1.5 For any transition matrix P on E, there exists a homogeneous Markov chain with this transition matrix and with a representation such as in Theorem 6.1.4. Proof. Since E is countable, we may identify it with N, which we do in this proof. Let Xn+1 = j if
j−1
pXn k ≤ Zn+1
0. One of the challenges associated with Gibbs models is obtaining explicit formulas for averages, considering that it is generally hard to compute the partition function. This is feasible in exceptional cases (see Exercise 6.6.3). Such distributions are of interest to physicists when the energy is expressed in terms of a potential function describing the local interactions. The notion of clique then plays a central role.
6.1. THE MARKOV PROPERTY
229
Deﬁnition 6.1.15 Any singleton {v} ⊂ V is a clique. A subset C ⊆ V with more than one element is called a clique (with respect to ∼) if and only if any two distinct sites of C are mutual neighbors. A clique C is called maximal if for any site v ∈ / C, C ∪ {v} is not a clique. The collection of cliques will be denoted by C. Deﬁnition 6.1.16 A Gibbs potential on ΛV relative to ∼ is a collection {VC }C⊆V of functions VC : ΛV → R ∪ {+∞} such that (i) VC ≡ 0 if C is not a clique, and (ii) for all x, x ∈ ΛV and all C ⊆ V , x(C) = x (C) ⇒ VC (x) = VC (x ) . The energy function U is said to derive from the potential {VC }C⊆V if U (x) = VC (x) . C
The function VC depends only on the phases at the sites inside subset C. One could write more explicitly VC (x(C)) instead of VC (x), but this notation will not be used. In this context, the distribution in (6.8) is called a Gibbs distribution (w.r.t. ∼). Example 6.1.17: The Ising Model, take 1. In statistical physics, the following model is regarded as a qualitatively correct idealization of a piece of ferromagnetic material. Here V = Z2m = {(i, j) ∈ Z2 , i, j ∈ [1, m]} and Λ = {+1, −1}, where ±1 is the orientation of the magnetic spin at a given site. The ﬁgure below depicts two particular neighborhood systems, their respective cliques, and the boundary of a 2 × 2 square for both cases. The neighborhood system in the original Ising model is as in column (α) of the ﬁgure below, and the Gibbs potential is H x(v), k J V v,w (x) = − x(v)x(w), k V{v} (x) = −
where v, w is the 2element clique (v ∼ w). For physicists, k is the Boltzmann constant, H is the external magnetic ﬁeld, and J is the internal energy of an elementary magnetic dipole. The energy function corresponding to this potential is therefore U (x) = −
J H x(v)x(w) − x(v) . k k
v,w
v∈V
Example 6.1.18: The autobinomial model2 For the purpose of image synthesis, one seeks Gibbs distributions describing pictures featuring various textures, lines separating patches with diﬀerent textures (boundaries), lines per se (roads, rail tracks), randomly located objects (moon craters), etc. The following is an allpurpose texture model that 2
[Besag, 1974].
CHAPTER 6. MARKOV CHAINS, DISCRETE TIME
230
(α)
(β) neighborhoods
(1) (2)
(3)
cliques (up to a rotation)
(4)
in black: boundary (of the white square)
Two examples of neighborhoods, cliques, and boundaries
6.1. THE MARKOV PROPERTY
231
may be used to describe the texture of various materials. The set of sites is V = Z2m , and the phase space is Λ = {0, 1, . . . , L}. In the context of image processing, a site v is a pixel (PICTure ELement), and a phase λ ∈ Λ is a shade of grey, or a color. The neighborhood system is Nv = {w ∈ V ; w = v ; w − v2 ≤ d},
(6.9)
where d is a ﬁxed positive integer and where w − v is the euclidean distance between v and w. In this model the only cliques participating in the energy function are singletons and pairs of mutual neighbors. The set of cliques appearing in the energy function is a disjoint sum of collections of cliques
m(d)
C=
Cj ,
j=1
where C1 is the collection of singletons, and all pairs {v, w} in Cj , 2 ≤ j ≤ m(d), have the same distance w − v and the same direction, as shown in the ﬁgure below. The potential is given by / L + α1 x(v) if C = {v} ∈ C1 , − log x(v) VC (x) = αj x(v)x(w) if C = {v, w} ∈ Cj , where αj ∈ R. For any clique C not of type Cj , VC ≡ 0. The terminology (“autobinomial”) is motivated by the fact that the local system has the form L π v (x) = (6.10) τ x(v)(1 − τ )L−x(v), x(v) where τ is a parameter depending on x(Nv ) as follows: τ = τ (Nv ) =
e− α,b . 1 + e− α,b
Here α, b is the scalar product of α = (α1 , . . . , αm(d)) and b = (b1 , . . . , bm(d) ), where b1 = 1, and for all j, 2 ≤ j ≤ m(d), bj = bj (x(Nv )) = x(u) + x(w), where {v, u} and {v, w} are the two pairs in Cj containing v. Proof. From the explicit formula (6.12) giving the local characteristic at site v, m(d) L − α1 x(v) − exp log x(v) αj v;{v,w}∈Cj x(w) x(v) j=2 . π v (x) = m(d) L λ∈Λ exp log λ − α1 λ − t;{v,w}∈Cj x(w) λ j=2 αj The numerator equals
and the denominator is
L e− α,bx(v), x(v)
(6.11)
CHAPTER 6. MARKOV CHAINS, DISCRETE TIME
232
(α)
(β)
(γ)
d
1
2
4
m(d)
3
5
7
C1 C2 C3 C4 C5 C6 C7
Neighborhoods and cliques of three autobinomial models
6.1. THE MARKOV PROPERTY L λ∈Λ
λ
e(−α,b)λ =
233
L L L . e− α,b = 1 + e− α,b =0
Equality (6.10) then follows.
Expression (6.10) shows that τ is the average level of grey at site v, given x(Nv ), and expression (6.11) shows that τ is a function of α, b. The parameter αj controls the bond in the direction and at the distance that characterize Cj .
The Hammersley–Cliﬀord Theorem Gibbs distributions with an energy deriving from a Gibbs potential relative to a neighborhood system are distributions of Markov ﬁelds relative to the same neighborhood system. Theorem 6.1.19 If X is a random ﬁeld with a distribution π of the form π(x) = 1 −U (x) , where the energy function U derives from a Gibbs potential {VC }C⊆V relative Ze to ∼, then X is a Markov random ﬁeld with respect to ∼. Moreover, its local speciﬁcation is given by the formula
e− Cv VC (x) , π (x) = − Cv VC (λ,x(V \v)) λ∈Λ e v
where the notation site v.
Cv
(6.12)
means that the sum extends over the sets C that contain the
Proof. First observe that the righthand side of (6.12) depends on x only through x(v) and x(Nv ). Indeed, VC (x) depends only on (x(w), w ∈ C), and for a clique C, if w ∈ C and v ∈ C, then either w = v or w ∼ v. Therefore, if it can be shown that P (X(v) = x(v)X(V \v) = x(V \v)) equals the righthand side of (6.12), then (see Exercise 6.6.6) the Markov property will be proved. By deﬁnition of conditional probability, P (X(v) = x(v)  X(V \v) = x(V \v)) = But π(x) =
π(x) . π(λ, x(V \v)) λ∈Λ
(†)
1 − Cv VC (x)−Cv VC (x) , e Z
and similarly, π(λ, x(V \v)) =
1 − Cv VC (λ,x(V \v))−Cv VC (λ,x(V \v)) e . Z
= VC (x) and is therefore indepenIf C is a clique and v is not in C, then VC (λ, x(V \v)) dent of λ ∈ Λ. Therefore, after factoring out exp − C v VC (x) , the righthand side of (†) is found to be equal to the righthand side of (6.12). The local energy at site v of conﬁguration x is Uv (x) = VC (x). Cv
234
CHAPTER 6. MARKOV CHAINS, DISCRETE TIME
With this notation, (6.12) becomes e−Uv (x) . −Uv (λ,x(V \v)) λ∈Λ e
π v (x) =
Example 6.1.20: The Ising Model, take 2. The local characteristics in the Ising model are 1 e kT {J w;w∼v x(w)+H }x(v) . πTv (x) = 1 1 e+ kT {J w;w∼v x(w)+H } + e− kT {J w;w∼v x(w)+H }
Theorem 6.1.19 above is the direct part of the Gibbs–Markov equivalence theorem: A Gibbs distribution relative to a neighborhood system is the distribution of a Markov ﬁeld with respect to the same neighborhood system. The converse part (Hammersley–Cliﬀord theorem) is important from a theoretical point of view, since together with the direct part it concludes that Gibbs distributions and mrfs are essentially the same objects. Theorem 6.1.21 Let π > 0 be the distribution of a Markov random ﬁeld with respect to ∼. Then 1 π(x) = e−U (x) Z for some energy function U deriving from a Gibbs potential {VC }C⊆V with respect to ∼. The proof is omitted with little inconvenience since, in practice, the potential as well as the topology of V can be obtained directly from the expression of the energy, as the following example shows. Example 6.1.22: Markov chains as Markov fields. Let V = {0, 1, . . . N } and Λ = E, a ﬁnite space. A random ﬁeld X on V with phase space Λ is therefore a vector X with values in E N +1 . Suppose that X0 , . . . , XN is a homogeneous Markov chain with transition matrix P = {pij }i,j∈E and initial distribution ν = {νi }i∈E . In particular, with x = (x0 , . . . , xN ), π(x) = νx0 px0 x1 · · · pxN−1 xN , that is,
π(x) = e−U (x),
where U (x) = − log νx0 −
N −1
(log pxn xn+1 ).
n=0
Clearly, this energy derives from a Gibbs potential associated with the nearestneighbor topology for which the cliques are, besides the singletons, the pairs of adjacent sites. The potential functions are: V{0} (x) = − log νx0 ,
V{n,n+1} (x) = − log pxn xn+1 .
The local characteristic at site n, 2 ≤ n ≤ N − 1, can be computed from formula (6.12), which gives exp(log pxn−1 xn + log pxn xn+1 ) , π n (x) = y∈E exp(log pxn−1 y + log pyxn+1 )
6.2. THE TRANSITION MATRIX
235
that is, π n (x) =
pxn−1 xn pxn xn+1 (2)
,
pxn−1 xn+1 (2)
where pij is the general term of the twostep transition matrix P2 . Similar computations give π 0 (x) and π N (x). We note that, in view of the neighborhood structure, for 2 ≤ n ≤ N − 1, Xn is independent of X0 , . . . , Xn−2, Xn+2 , . . . , XN given Xn−1 and Xn+1.
6.2 6.2.1
The Transition Matrix Topological Notions
The notions introduced in this subsection (communication and periodicity) are of a topological nature, in the sense that they concern only the naked transition graph (without the labels).
Communication Classes Deﬁnition 6.2.1 State j is said to be accessible from state i if there exists an integer M ≥ 0 such that pij (M ) > 0. In particular, a state i is always accessible from itself, since pii (0) = 1. States i and j are said to communicate if i is accessible from j and j is accessible from i, and this is denoted by i ↔ j. For M ≥ 1, pij (M ) = i1 ,...,iM−1 pii1 · · · piM−1 j , and therefore pij (M ) > 0 if and only if there exists at least one path i, i1 , . . . , iM −1 , j from i to j such that pii1 pi1 i2 · · · piM−1 j > 0, or, equivalently, if there is an oriented path from i to j in the transition graph G. Clearly, i↔i
(reﬂexivity),
i↔j⇒j↔i
(symmetry),
i ↔ j, j ↔ k ⇒ i ↔ k
(transivity).
Therefore, the communication relation (↔) is an equivalence relation, and it generates a partition of the state space E into disjoint equivalence classes called communication classes. Deﬁnition 6.2.2 When there exists only one communication class, the chain, its transition matrix and its transition graph are said to be irreducible.
Example 6.2.3: Repair shop, take 2. Example 6.1.7 continued. A necessary and suﬃcient condition of irreducibility of the repair shop chain of Example 6.1.7 is that P (Z1 = 0) > 0 and P (Z1 ≥ 2) > 0 (Exercise 6.6.7).
CHAPTER 6. MARKOV CHAINS, DISCRETE TIME
236 Period
Consider the random walk on Z (Example 6.1.6). Since p ∈ (0, 1), it is irreducible. Observe that E = C0 + C1 , where C0 and C1 , the set of even and odd relative integers respectively, have the following property. If you start from i ∈ C0 (resp., C1 ), then in one step you can go only to a state j ∈ C1 (resp., C0 ). The chain {Xn } passes alternately from one cyclic class to the other. In this sense, the chain has a periodic behavior, corresponding to the period 2. More generally, for any irreducible Markov chain, one can ﬁnd a partition of E into d classes C0 , C1 , . . ., Cd−1 such that for all k, i ∈ Ck , pij = 1, j∈Ck+1
where by convention Cd = C0 . The proof follows directly from Theorem 6.2.7 below. The number d ≥ 1 is called the period of the chain (resp., of the transition matrix, of the transition graph). The classes C0 , C1 , . . . , Cd−1 are called the cyclic classes.
C0
Cd−2
C1
Cd−1
Cycles The chain therefore moves from one class to the other at each transition, and this cyclically, as shown in the ﬁgure. We shall proceed to substantiate the above description of periodicity starting with the formal deﬁnition of period based on the notion of greatest common divisor (gcd) of a set of integers. Deﬁnition 6.2.4 The period di of state i ∈ E is, by deﬁnition, di = gcd{n ≥ 1 ; pii (n) > 0}, with the convention di = +∞ if there is no n ≥ 1 with pii (n) > 0. If di = 1, the state i is called aperiodic. Remark 6.2.5 Very often aperiodicity follows from the following simple observation: An irreducible transition matrix P with at least one state i ∈ E such that pii > 0 is aperiodic (in fact, in this case 1 ∈ {n ≥ 1 ; pii (n) > 0} and therefore di = 1). Period is a (communication) class property in the following sense:
6.2. THE TRANSITION MATRIX
237
Theorem 6.2.6 Two states i and j which communicate have the same period. Proof. As i and j communicate, there exist integers N and M such that pij (M ) > 0 and pji (N ) > 0. For any k ≥ 1, pii (M + nk + N ) ≥ pij (M )(pjj (k))n pji (N ) (indeed, the trajectories X0 , . . . , XM +nk+N such that X0 = i, XM = j, XM +k = j, . . . , XM +nk = j, XM +nk+N = i are a subset of the trajectories starting from i and returning to i in M + nk + N steps). Therefore, for any k ≥ 1 such that pjj (k) > 0, we have that pii (M + nk + N ) > 0 for all n ≥ 1. Therefore, di divides M + nk + N for all n ≥ 1, and in particular, di divides k. We have therefore shown that di divides all k such that pjj (k) > 0, and in particular, di divides dj . By symmetry, dj divides di , and therefore, ﬁnally, di = dj . We may therefore henceforth speak of the period of a communication class or of an irreducible chain. The important result concerning periodicity is the following. Theorem 6.2.7 Let P be an irreducible stochastic matrix with period d. Then for all states i, j there exist m ≥ 0 and n0 ≥ 0 (m and n0 possibly depending on i, j) such that pij (m + nd) > 0, for all n ≥ n0 . Proof. It suﬃces to prove the theorem for i = j. Indeed, there exists an m such that pij (m) > 0, because j is accessible from i, the chain being irreducible, and therefore, if for some n0 ≥ 0 we have pjj (nd) > 0 for all n ≥ n0 , then pij (m+nd) ≥ pij (m)pjj (nd) > 0 for all n ≥ n0 . The rest of the proof is an immediate consequence of a classical result of number theory. Indeed, the gcd of the set A = {k ≥ 1; pjj (k) > 0} is d, and A is closed under addition. The set A therefore contains all but a ﬁnite number of the positive multiples of d. In other words, there exists an n0 such that n > n0 implies pjj (nd) > 0.
6.2.2
Stationary Distributions and Reversibility
We now introduce the central notion of the stability theory of discretetime hmcs. Deﬁnition 6.2.8 A probability distribution π satisfying πT = πT P
(6.13)
is called a stationary distribution (of the transition matrix P or of the corresponding hmc). The socalled global balance equation (6.13) says that π(j)pji (i ∈ E) . π(i) = j∈E
Iteration of (6.13) gives π T = π T Pn for all n ≥ 0, and therefore, in view of (6.2), if the initial distribution ν = π, then νn = π for all n ≥ 0. In particular, a chain starting with
238
CHAPTER 6. MARKOV CHAINS, DISCRETE TIME
a stationary distribution keeps the same distribution forever. But there is more, because then,
P (Xn = i0 , Xn+1 = i1 , . . . , Xn+k = ik ) = P (Xn = i0 )pi0 i1 . . . pik−1 ik = π(i0 )pi0 i1 . . . pik−1 ik does not depend on n. In this sense the chain is stationary. One also says that the chain is in a stationary regime, or in equilibrium, or in steady state. In summary: Theorem 6.2.9 An hmc whose initial distribution is a stationary distribution is stationary. Remark 6.2.10 The balance equation π T P = π T , together with the requirement that π is a probability vector, that is, π T 1 = 1 (where 1 is a column vector with all its entries equal to 1), constitute, when E is ﬁnite, E + 1 equations for E unknown variables. One of the E equations in π T P = π T is superﬂuous given the constraint π T 1 = 1. In fact, summation of all equalities of π T P = π T yields the equality π T P1 = π T 1, that is, π T 1 = 1. Example 6.2.11: A twostate Markov Chain. The state space E = {1, 2} and the transition matrix is 1−α α P= , β 1−β where α, β ∈ (0, 1). The global balance equations are π(1) = π(1)(1 − α) + π(2)β ,
π(2) = π(1)α + π(2)(1 − β).
This is a dependent system which reduces to the single equation π(1)α = π(2)β, to which must be added the equality π(1) + π(2) = 1 expressing that π is a probability vector. We obtain α β , π(2) = . π(1) = α+β α+β
Example 6.2.12: The Ehrenfest Diffusion Model, take 2. The corresponding hmc was described in Example 6.1.9. The global balance equations are, for i ∈ [1, N − 1], i+1 i−1 + π(i + 1) π(i) = π(i − 1) 1 − N N and, for the boundary states, π(0) = π(1)
1 , N
π(N ) = π(N − 1)
1 . N
Leaving π(0) undetermined, one can solve the balance equations for i = 0, 1, . . . , N successively, to obtain N π(i) = π(0) . i The value of π(0) is then determined by writing that π is a probability vector:
6.2. THE TRANSITION MATRIX 1=
N i=0
π(i) = π(0)
239 N N i=0
i
= π(0)2N .
This gives for π the binomial distribution of size N and parameter 21 : 1 N . π(i) = N 2 i This is the distribution one would obtain by placing independently each particle in the compartments, with probability 12 for each compartment. There may be many stationary distributions. Take the identity as transition matrix. Then any probability distribution on the state space is a stationary distribution. Also there may well not exist any stationary distribution. (See Exercise 6.6.16.) Remark 6.2.13 An immediate consequence of Theorem 5.1.14 is that if an hmc {Xn }n≥0 is stationary, it may be extended to a stationary hmc {Xn }n∈Z with the same distribution. Recurrence equations can be used to obtain the stationary distribution when the latter exists and is unique. Generating functions sometimes usefully exploit the dynamics. Example 6.2.14: Repair shop, take 3. Examples 6.1.7 and 6.2.3 continued. For any complex number z with modulus not larger than 1, it follows from the recurrence equation (6.6) that + z Xn+1 +1 = z (Xn −1) +1 z Zn+1 = z Xn 1{Xn >0} + z1{Xn =0} z Zn+1 = z Xn − 1{Xn =0} + z1{Xn =0} z Zn+1 , and therefore zz Xn+1 − z Xn z Zn+1 = (z − 1)1{Xn =0} z Zn+1 . From the independence of Xn and Zn+1, E[z Xn z Zn+1 ] = E[z Xn ]gZ (z), where gZ is the generating function of Zn+1, and E[1{Xn =0} z Zn+1 ] = π(0)gZ (z), where π(0) = P (Xn = 0). Therefore, zE[z Xn+1 ] − gZ (z)E[z Xn ] = (z − 1)π(0)gZ (z). Suppose that the chain is in steady state, in which case E[z Xn+1 ] = E[z Xn ] = gX (z), and therefore () gX (z) (z − gZ (z)) = π(0)(z − 1)gZ (z) . ∞ This gives the generating function gX (z) = i=0 π(i)z i , as long as π(0) is available. To obtain π(0), diﬀerentiate (): gX (z) (z − gZ (z)) + gX (z) 1 − gZ (z) = π(0) gZ (z) + (z − 1)gZ (z) , and let z = 1, to obtain, taking into account the equalities gX (1) = gZ (1) = 1 and gZ (1) = E[Z], π(0) = 1 − E[Z] . () Since π(0) must be nonnegative, this immediately gives the necessary condition E[Z] ≤ 1. Actually, one must have, if the trivial case Z1 ≡ 1 is excluded,
CHAPTER 6. MARKOV CHAINS, DISCRETE TIME
240
E[Z] < 1 .
(6.14)
Indeed, if E[Z] = 1, implying π(0) = 0, it follows from () that gX (x)(x − gZ (x)) = 0 for all x ∈ [0, 1]. But, if the case Z1 ≡ 1 (that is, gZ (x) ≡ x) is excluded, equation x − gZ (x) = 0 has only x = 1 for a solution when gZ (1) = E[Z] ≤ 1. Therefore, gX (x) ≡ 0 for x ∈ [0, 1), and consequently gX (z) ≡ 0 on {z < 1} (gZ is analytic inside the open unit disk centered at the origin). This leads to a contradiction, since the generating function of an integervalued random variable cannot be identically null. It turns out that E[Z] < 1 is also a suﬃcient condition for the existence of a steady state (Example 6.3.22). For the time being, we have from () and () that, if the stationary distribution exists, then its generating function is given by the formula ∞
π(i)z i = (1 − E[Z])
i=0
(z − 1)gZ (z) . z − gZ (z)
Reversibility Let {Xn }n∈Z be an hmc with transition matrix P and admitting a stationary distribution π > 0 (see Remark 6.2.13). Deﬁne the matrix Q, indexed by E, by π(i)qij = π(j)pji .
()
This matrix is stochastic, since
qij =
j∈E
π(j) j∈E
π(i)
pji =
1 π(i) = 1, π(j)pji = π(i) π(i) j∈E
where the third equality uses the global balance equations. From Bayes’ retrodiction formula, P (Xn+1 = i  Xn = j)P (Xn = j) P (Xn = j  Xn+1 = i) = , P (Xn+1 = i) that is, in view of (), P (Xn = j  Xn+1 = i) = qji .
(6.15)
Therefore Q is the transition matrix of the initial chain when time is reversed. Theorem 6.2.15 Let P be a stochastic matrix indexed by a countable set E, and let π be a probability distribution on E. Deﬁne the matrix Q indexed by E by π(i)qij = π(j)pji . If Q is a stochastic matrix, then π is a stationary distribution of P. Proof. Just verify that the global balance equation is satisﬁed.
Deﬁnition 6.2.16 One calls reversible a stationary Markov chain with initial distribution π (a stationary distribution) if for all i, j ∈ E, we have the socalled detailed balance equations π(i)pij = π(j)pji .
6.2. THE TRANSITION MATRIX
241
We then say that the pair (P, π) is reversible. In this case, qij = pij , and therefore the chain and the timereversed chain are statistically the same, since the distribution of a homogeneous Markov chain is entirely determined by its initial distribution and its transition matrix. The following is an immediate corollary of Theorem 6.2.15. Theorem 6.2.17 Let P be a transition matrix on the countable state space E, and let π be some probability distribution on E. If for all i, j ∈ E, the detailed balance equations are satisﬁed, then π is a stationary distribution of P. Example 6.2.18: The Ehrenfest Diffusion Model, take 3. This example continues Examples 6.1.9 and 6.2.12. Recall that we obtained the expression 1 N π(i) = 2N i for the stationary distribution. We can also ﬁnd this by checking the detailed balance equations π(i)pi,i+1 = π(i + 1)pi+1,i .
Example 6.2.19: Random Walk on a Graph. Consider a ﬁnite nonoriented graph and denote by E the set of vertices, or nodes, of this graph. Let di be the index of vertex i (the number of edges adjacent to i). Transform this graph into an oriented graph by splitting each edge into two oriented edges of opposite directions, and make it a transition graph by associating to the oriented edge from i to j the transition probability d1i (see the ﬁgure below). It will be assumed, as is the case in the ﬁgure, that di > 0 for all states i (that is, the graph is connected).
1
1 1
2
4
2
1 3 1 2 1 2
1 2 1 3
4
1 2 1 3
3
3 A random walk on a graph
This chain is irreducible due to the connectedness of the graph and the fact that pij > 0 whenever pji > 0. It admits the distribution π(i) =
di 2edges
(i ∈ E) ,
where edges is the number of edges of the original graph. This follows from Theorem 6.2.17. In fact, if i and j are connected in the graph, pij = d1i and pji = d1j , and therefore the detailed balance equation between these two states is
242
CHAPTER 6. MARKOV CHAINS, DISCRETE TIME π(i)
1 1 = π(j) . di dj
This gives (i ∈ E) , −1 where K is obtained by normalization: K = d . But j∈E dj = 2edges. j j∈E π(i) = Kdi
6.2.3
The Strong Markov Property
The Markov property relative to the past and future at a given instant n can be extended to the situation where this deterministic time is replaced by an FnX stopping time (where FnX := σ(X0 , · · · , Xn )), whose deﬁnition we recall below. Deﬁnition 6.2.20 Let {Fn }n∈N be a nondecreasing sequence of subσﬁelds of F. A random variable τ taking its values in N and such that, for all m ∈ N, the event {τ = m} is in Fm is called an Fn stopping time. In other words, τ is an FnX stopping time if, for all m ∈ N, the event {τ = m} can be expressed as 1{τ =m} = ψm (X0 , . . . , Xm), for some measurable function ψm with values in {0, 1} (Theorem 3.3.18). Example 6.2.21: Fixed Times and Delayed Stopping Times. A constant time is a stopping time. If τ is a stopping time and n0 a nonnegative deterministic time, then τ + n0 is a stopping time. Indeed, {τ + n0 = m} ≡ {τ = m − n0 } is expressible in terms of X0 , X1 , . . . , Xm−n0 . Example 6.2.22: Return Times and Hitting Times. In the theory of Markov chains, a typical and most important stopping time is the return time to state i ∈ E, Ti = inf{n ≥ 1; Xn = i}, where Ti = ∞ if Xn = i for all n ≥ 1. It is a stopping time, as we shall soon prove. Observe that Ti ≥ 1, and in particular, X0 = i does not imply Ti = 0. This is why Ti is called the return time to i, and not the hitting time of i. The latter is Si = Ti if X0 = i, and Si = 0 if X0 = i. It is also a stopping time. More generally, let τ1 = Ti , τ2 , . . . be the successive return times to state i. If there are only r returns to state i, let τr+1 = τr+2 = · · · = ∞. These random times are stopping times with respect to {Xn }n≥0 , since for any m ≥ 1, / {τk = m} ≡
m
4 1{Xn =i} = k, Xm = i
n=1
is indeed expressible in terms of X0 , . . . , Xm.
6.2. THE TRANSITION MATRIX
243
Remark 6.2.23 For a given stopping time τ , one can decide whether τ = m just by observing X0 , X1 , . . ., Xm . This is why stopping times are said to be nonanticipative. The random time τ = inf{n ≥ 0; Xn+1 = i}, where τ = ∞ if Xn+1 = i for all n ≥ 0, is anticipative because {τ = m} = {X1 = i, . . . , Xm = i, Xm+1 = i} for all m ≥ 0. Knowledge of this random time provides information about the value of the process just after it. It is not a stopping time. Let τ be a random time taking its values in N∪{+∞}, and let {Xn }n≥0 be a stochastic process with values in the countable set E. In order to deﬁne Xτ when τ = ∞, one must decide how to deﬁne X∞ . This is done by taking some element Δ not in E, and setting X∞ = Δ. By deﬁnition, the “process {Xn } after τ ” is the stochastic process {Xn+τ }n≥0 . The “process {Xn } before τ ” is the process {Xn∧(τ −1)}n≥0 , where by convention Xn∧(0−1) = X0 . The main result of the present subsection is the strong Markov property. It says that the Markov property, that is, the independence of past and future given the present state, extends to the situation where the present time is a stopping time. More precisely: Theorem 6.2.24 Let {Xn }n≥0 be an hmc with countable state space E and transition matrix P. If τ is an FnX stopping time, then given that Xτ = i ∈ E (in particular, τ < ∞, since i = Δ), (α) the process after τ and the process before τ are independent, and (β) the process after τ is an hmc with transition matrix P. Proof. (α) By Theorem 3.1.39 it suﬃces to show that for all times k ≥ 1, n ≥ 0, and all states i0 , . . . , in , i, j1 , . . . , jk , P (Xτ +1 = j1 , . . . , Xτ +k = jk  Xτ = i, X(τ −1)∧0 = i0 , . . . , X(τ −1)∧n = in ) = P (Xτ +1 = j1 , . . . , Xτ +k = jk  Xτ = i). We shall prove a simpliﬁed version of the above equality, namely P (Xτ +k = j  Xτ = i, X(τ −1)∧n = in ) = P (Xτ +k = j  Xτ = i)
()
(the general case is obtained by the same arguments). The lefthand side of the above equality is equal to P (Xτ +k = j, Xτ = i, X(τ −1)∧n = in ) . P (Xτ = i, X(τ −1)∧n = in ) The numerator can be expanded as
CHAPTER 6. MARKOV CHAINS, DISCRETE TIME
244
P (τ = r, Xr+k = j, Xr = i, X(r−1)∧n = in ).
(6.16)
r≥0
But P (τ = r, Xr+k = j, Xr = i, X(r−1)∧n = in ) =P (Xr+k = j  Xr = i, X(r−1)∧n = in , τ = r) P (τ = r, X(r−1)∧n = in , Xr = i), and since (r − 1) ∧ n ≤ r and {τ = r} ∈ FrX , the event B = {X(r−1)∧n = in , τ = r} is in FrX . Therefore, by the Markov property, P (Xr+k = j  Xr = i, X(r−1)∧n = in , τ = r} = P (Xr+k = j  Xr = i) = pij (k). Finally, expression (6.16) reduces to
pij (k)P (τ = r, X(r−1)∧n = in , Xr = i) = pij (k)P (Xτ = i, Xτ ∧n = in ).
r≥0
Therefore, the lefthand side of () is just pij (k). Similar computations show that the righthand side of () is also pij (k), so that (α) is proved. (β) We must show that for all states i, j, k, in−1, . . . , i1 , P (Xτ +n+1 = k  Xτ +n = j, Xτ +n−1 = in−1 , . . . , Xτ = i) = P (Xτ +n+1 = k  Xτ +n = j) = pjk . But the ﬁrst equality follows from the fact proved in (α) that for the stopping time τ = τ + n, the processes before and after τ are independent given Xτ = j. The second equality is obtained by the same calculations as in the proof of (α).
Regenerative Cycles Consider a Markov chain with a state conventionally denoted by 0 such that P0 (T0 < ∞) = 1. As a consequence of the strong Markov property, the chain starting from state 0 will return inﬁnitely often to this state. Let τ1 = T0 , τ2 , . . . be the successive return times to 0, and set τ0 ≡ 0. By the strong Markov property, for any k ≥ 1, the process after τk is independent of the process before τk (observe that condition Xτk = 0 is always satisﬁed), and the process after τk is a Markov chain with the same transition matrix as the original chain, and with initial state 0, by construction. Therefore, the successive times of visit to 0, the pieces of the trajectory {Xτk , Xτk +1 , . . . , Xτk+1−1 }
(k ≥ 0) ,
are independent and identically distributed. Such pieces are called the regenerative cycles of the chain between visits to state 0. Each random time τk is a regeneration time, in the sense that {Xτk +n }n≥0 is independent of the past X0 , . . . , Xτk −1 and has the same distribution as {Xn }n≥0 . In particular, the sequence {τk − τk−1 }k≥1 is iid.
6.3. RECURRENCE AND TRANSIENCE
6.3
245
Recurrence and Transience
Consider a Markov chain taking its values in E = N. There is a possibility that for any initial state i ∈ N the chain will never visit i after some ﬁnite random time. This is often an undesirable feature. For example, if the chain counts the number of customers waiting in line at a service counter, such a behavior implies that the waiting line will eventually grow beyond the limits of the waiting room, whatever its size. In a sense, the corresponding system is unstable. The good notion of stability for an irreducible hmc is that of positive recurrence, when any given state is visited inﬁnitely often and when, moreover, the average time between two successive visits to this state is ﬁnite.
6.3.1
Classiﬁcation of States
Denote by Ni :=
1{Xn =i}
n≥1
the number of visits to state i strictly after time 0. Theorem 6.3.1 The distribution of Ni given X0 = j is / fji fiir−1 (1 − fii ) for r ≥ 1, Pj (Ni = r) = 1 − fji for r = 0 , where fji = Pj (Ti < ∞) and Ti is the return time to i. Proof. An informal proof goes like this: We ﬁrst go from j to i (probability fji ) and then, r − 1 times in succession, from i to i (each time with probability fii ), and the last time, that is the r + 1st time, we leave i never to return to it (probability 1 − fii). By the independent cycle property, all these “jumps” are independent, so that the successive probabilities multiply. Here is a formal proof if someone needs it. For r = 0, this is just the deﬁnition of fji . Now let r ≥ 1, and suppose that Pj (Ni = k) = fji fiik−1 (1 − fii ) is true for k (1 ≤ k ≤ r). In particular, Pj (Ni > r) = fji fiir . Denoting by τr the rth return time to state i, Pj (Ni = r + 1) = Pj (Ni = r + 1, Xτr+1 = i) = Pj (τr+2 − τr+1 = ∞, Xτr+1 = i) = Pj (τr+2 − τr+1 = ∞  Xτr+1 = i)Pj (Xτr+1 = i) . But Pj (τr+2 − τr+1 = ∞  Xτr+1 = i) = 1 − fii by the strong Markov property (τr+2 − τr+1 is the return time to i of the process after τr+1 ). Also, Pj (Xτr+1 = i) = Pj (Ni > r) .
CHAPTER 6. MARKOV CHAINS, DISCRETE TIME
246 Therefore,
Pj (Ni = r + 1) = Pi (Ti = ∞)Pj (Ni > r) = (1 − fii )fji fiir .
The result then follows by induction.
The distribution of Ni given X0 = i is geometric with parameter 1 − fii. A geometric random variable with parameter p = 1 is in fact equal to inﬁnity, and in particular has an inﬁnite mean. If p < 1, however, it is almost surely ﬁnite and it has a ﬁnite mean. From these remarks, we deduce that Pi (Ti < ∞) = 1 ⇔ Pi (Ni = ∞) = 1 , (in words: if starting from i you almost surely return to i, then you will visit i inﬁnitely often) and Pi (Ti < ∞) < 1 ⇔ Ei [Ni ] < ∞ . We collect the results just obtained for future reference. Theorem 6.3.2 For any state i ∈ E, Pi (Ti < ∞) = 1 ⇐⇒ Pi (Ni = ∞) = 1 , and Pi (Ti < ∞) < 1 ⇐⇒ Pi (Ni = ∞) = 0 ⇐⇒ Ei [Ni ] < ∞ . In particular, the event {Ni = ∞} has Pi probability 0 or 1. We are now ready for the basic deﬁnitions concerning recurrence. First recall that Ti denotes the return time to state i. Deﬁnition 6.3.3 State i ∈ E is called recurrent if Pi (Ti < ∞) = 1 , and otherwise it is called transient. A recurrent state i ∈ E is called positive recurrent if Ei [Ti ] < ∞ , and otherwise it is called null recurrent.
The Potential Matrix Criterion of Recurrence In general, it is not easy to check whether a given state is transient or recurrent. One of the goals of the theory of Markov chains is to provide criteria of recurrence. Sometimes, one is happy with just a suﬃcient condition. The problem of ﬁnding useful (easy to check) conditions of recurrence is an active area of research. However, the theory has a few conditions that qualify as useful and are applicable to many practical situations. Although the next criterion is of theoretical rather than practical interest, it can be helpful in a few situations, for instance in the study of recurrence of random walks (Example 6.3.5.) The potential matrix G associated with the transition matrix P is deﬁned by G := Pn . n≥0
6.3. RECURRENCE AND TRANSIENCE
247
Its general term gij =
∞
pij (n) =
∞
Pi (Xn = j) =
" Ei [1{Xn =j} ] = Ei
n=0
n=0
n=0
∞
∞
# 1{Xn =j}
n=0
is the average number of visits to state j, given that the chain starts from state i. Theorem 6.3.4 State i ∈ E is recurrent if and only if ∞
pii (n) = ∞.
n=0
Proof. This merely rephrases Theorem 6.3.2 since pii (n) = Ei [Ni ] . n≥1
In fact, ⎡ Ei [Ni ] = Ei ⎣
n≥1
⎤ 1{Xn =i} ⎦ =
Ei 1{Xn =i} = pii (n) . Pi (Xn = i) =
n≥1
n≥1
n≥1
Example 6.3.5: Random Walks on Z. The corresponding Markov chain was described in Example 6.1.6. The nonzero terms of its transition matrix are pi,i+1 = p , pi,i−1 = 1 − p , where p ∈ (0, 1). We shall study the nature (recurrent or transient) of any one of its states, say, 0. We have p00 (2n + 1) = 0 and (2n)! n p (1 − p)n . n!n! √ By Stirling’s equivalence formula n! ∼ (n/e)n 2πn, the above quantity is equivalent to p00 (2n) =
[4p(1 − p)]n √ , πn
(6.17)
and the nature of the series ∞ n=0 p00 (n) (convergent or divergent) is that of the series with general term (6.17). If p = 12 , in which case 4p(1−p) < 1, the latter series converges. And if p = 12 , in which case 4p(1 − p) = 1, it diverges. In summary, the states of the random walk on Z are transient if p = 12 , recurrent if p = 12 . Example 6.3.6: Returns to zero of the symmetric random walk. Consider the symmetric (p = 12 ) 1D random walk. Let τ1 = T0 , τ2 , . . . be the successive return times to state 0. We just learnt in the previous example that P0 (T0 < ∞) = 1. We will compute the generating function of T0 given X0 = 0, and show that the expected return time to 0 is inﬁnite (and therefore the symmetric random walk on Z is null recurrent).
CHAPTER 6. MARKOV CHAINS, DISCRETE TIME
248
Observe that for n ≥ 1, P0 (X2n = 0) =
P0 (τk = 2n) ,
k≥1
and therefore, for all z ∈ C such that z < 1, P0 (X2n = 0)z 2n = E0 [z τk ] . P0 (τk = 2n)z 2n = n≥1
k≥1 n≥1
k≥1
But τk = τ1 + (τ2 − τ1 ) + · · · + (τk − τk−1 ) and therefore, in view of the iid property of the regenerative cycles, and since τ1 = T0 , E0 [z τk ] = (E0 [z T0 ])k . In particular,
P0 (X2n = 0)z 2n =
n≥0
1 1 − E0 [z T0 ]
(note that the latter sum includes the term for n = 0, that is, 1). Direct evaluation of the lefthand side yields 1 (2n)! 1 z 2n = √ . 22n n!n! 1 − z2 n≥0
Therefore, the generating function of the return time to 0 given X0 = 0 is 5 E0 [z T0 ] = 1 − 1 − z 2 . z tends to ∞ as z → 1 from below via real values. Therefore, by Its ﬁrst derivative √1−z 2 Abel’s theorem, E0 [T0 ] = ∞. We see that although the return time to state 0 is almost surely ﬁnite, it has an inﬁnite expectation.
Example 6.3.7: 3D symmetric random walk. The state space of this Markov chain is E = Z3 . Denoting by e1 , e2 , and e3 the canonical basis vectors of R3 (respectively (1, 0, 0), (0, 1, 0), and (0, 0, 1)), the nonnull terms of the transition matrix of the 3D symmetric random walk are given by 1 px,x±ei = . 6 We elucidate the nature of state, say, 0 = (0, 0, 0). Clearly, p00 (2n + 1) = 0 for all n ≥ 0, and (exercise) 2n (2n)! 1 . p00 (2n) = (i!j!(n − i − j)!)2 6 0≤i+j≤n
This can be rewritten as
p00 (2n) =
0≤i+j≤n
2 2n 1 2n n! 1 . 22n n i!j!(n − i − j)! 3
Using the trinomial formula 0≤i+j≤n
n! i!j!(n − i − j)!
n 1 = 1, 3
6.3. RECURRENCE AND TRANSIENCE we obtain the bound p00 (2n) ≤ Kn
n 1 2n 1 , 22n n 3
Kn = max
n! . i!j!(n − i − j)!
249
where 0≤i+j≤n
For large values of n, Kn is bounded as follows. Let i0 and j0 be the values of i, j that maximize n!/(i!j!(n + j)!) in the domain of interest 0 ≤ i + j ≤ n. From the deﬁnition of i0 and j0 , the quantities n! (i0 − 1)!j0 !(n − i0 − j0 + 1)! n! (i0 + 1)!j0 !(n − i0 − j0 − 1)! n! i0 !(j0 − 1)!(n − i0 − j0 + 1)! n! i0 !(j0 + 1)!(n − i0 − j0 − 1)! are bounded by
n! i0 !j0 !(n−i0 −j0 )! .
The corresponding inequalities reduce to
n − i0 − 1 ≤ 2j0 ≤ n − i0 + 1 and n − j0 − 1 ≤ 2i0 ≤ n − j0 + 1, and this shows that for large n, i0 ∼ n/3 and j0 ∼ n/3. Therefore, for large n, p00 (2n) ∼
2n n! . (n/3)!(n/3)!22n en n
By Stirling’s equivalence formula, the righthand side of the latter equivalence is in √ 3 3 turn equivalent to 2(πn)3/2 , the general term of a divergent series. State 0 is therefore transient. A theoretical application of the potential matrix criterion is to the proof that recurrence is a (communication) class property. Theorem 6.3.8 If i and j communicate, they are either both recurrent or both transient.
Proof. States i and j communicate if and only if there exist integers M and N such that pij (M ) > 0, pji(N ) > 0. Going from i to j in M steps, then from j to j in n steps, then from j to i in N steps, is just one way of going from i back to i in M + n + N steps. Therefore, pii (M + n + N ) ≥ pij (M )pjj (n)pji (N ). Similarly, pjj (N + n + M ) ≥ pji (N )pii (n)pij (M ). Therefore, writing α = pij (M )pji (N ) (a strictly positive quantity), (M + N + n) ≥ αpjj (n) and pjj (M + N + n) ≥ αpii (n). This implies that we have pii ∞ p (n) and the series ∞ ii n=0 n=0 pjj (n) either both converge or both diverge. Theorem 6.3.4 concludes the proof.
CHAPTER 6. MARKOV CHAINS, DISCRETE TIME
250
6.3.2
The Stationary Distribution Criterion
The notion of invariant measure extends the notion of stationary distribution and plays a technical role in the recurrence theory of Markov chains. Deﬁnition 6.3.9 A nontrivial (that is, nonnull) vector x = {xi }i∈E of nonnegative real numbers is called an invariant measure of the stochastic matrix P = {pij }i,j∈E if for all i ∈ E, xj pji . (6.18) xi = j∈E
(In abbreviated notation, 0 ≤ x < ∞ and xT P = xT .) Theorem 6.3.10 Let P be the transition matrix of an irreducible recurrent hmc {Xn }n≥0 . Let 0 be a state and let T0 be the return time to 0. Deﬁne for all i ∈ E ⎡ ⎤ xi = E0 ⎣ (6.19) 1{Xn =i} 1{n≤T0 } ⎦ n≥1
(for i = 0, xi is therefore the expected number of visits to state i before returning to 0). Then, for all i ∈ E, xi ∈ (0, ∞) , (6.20) and x is an invariant measure of P. Proof. We make two preliminary observations. First, when 1 ≤ n ≤ T0 , Xn = 0 if and only if n = T0 . Therefore, x0 = 1. Also,
1{Xn =i} 1{n≤T0 } =
i∈E n≥1
/ n≥1
=
4 1{Xn =i}
1{n≤T0 }
i∈E
1{n≤T0 } = T0 ,
n≥1
and therefore
xi = E0 [T0 ] .
(6.21)
i∈E
We now introduce the socalled taboo transition probability 0 p0i (n)
:= E0 [1{Xn =i} 1{n≤T0 } ] = P0 (X1 = 0, · · · , Xn−1 = 0, Xn = i) ,
the probability, starting from state 0, of visiting i at time n before returning to 0 (the “taboo” state). From the deﬁnition of x, (6.22) xi = 0 p0i (n) . n≥1
We ﬁrst prove (6.18). Observe that 0 p0i (1)
= p0i
6.3. RECURRENCE AND TRANSIENCE
251
and (ﬁrststep analysis) for all n ≥ 2, 0 p0i (n)
=
0 p0j (n
− 1)pji .
(6.23)
j =0
Summing up all the above equalities, and taking (6.22) into account, we obtain xj pji , xi = p0i + j =0
that is, (6.18), since x0 = 1. Next we show that xi > 0 for all i ∈ E. Indeed, iterating (6.18), we ﬁnd xT = xT Pn , that is, since x0 = 1, xi = xj pji (n) = p0i (n) + xj pji (n) . j∈E
j =0
If xi were null for some i ∈ E, i = 0, the latter equality would imply that p0i (n) = 0 for all n ≥ 0, which means that 0 and i do not communicate, in contradiction to the irreducibility assumption. It remains to show that xi < ∞ for all i ∈ E. As before, we ﬁnd that 1 = x0 = xj pj0 (n) j∈E
for all n ≥ 1, and therefore if xi = ∞ for some i, necessarily pi0 (n) = 0 for all n ≥ 1, and this also contradicts irreducibility. Theorem 6.3.11 The invariant measure of an irreducible recurrent stochastic matrix is unique up to a multiplicative factor. Proof. In the proof of Theorem 6.3.10, we showed that for an invariant measure y of an irreducible chain, yi > 0 for all i ∈ E, and therefore, one can deﬁne, for all i, j ∈ E, the matrix Q by yi (6.24) qji = pij . yj y It is a transition matrix, since i∈E qji = y1j i∈E yi pij = yjj = 1. The general term of n Q is yi (6.25) qji (n) = pij (n). yj Indeed, supposing (6.25) true for n, yk yi qji (n + 1) = qjk qki (n) = pkj pik (n) yj yk k∈E k∈E yi yi = pik (n)pkj = pij (n + 1), yj yj k∈E
and (6.25) follows by induction. Clearly, Q is irreducible, since P is irreducible (just observe that in view of (6.25) qji (n) > 0 if and only if pij (n) > 0). Also, pii (n) = qii (n), and therefore n≥0 qii (n) =
CHAPTER 6. MARKOV CHAINS, DISCRETE TIME
252
n≥0 pii (n), which ensures that Q is recurrent (potential matrix criterion). Call gji (n) the probability, relative to the chain governed by the transition matrix Q, of returning to state i for the ﬁrst time at step n when starting from j. Firststep analysis gives gi0 (n + 1) = (6.26) qij gj0 (n) , j =0
that is, using (6.24), yi gi0 (n + 1) = Recall that 0 p0i (n + 1) =
(yj gj0 (n))pji .
j =0 j =0 0 p0j (n)pji ,
y0 0 p0i (n + 1) =
or, equivalently,
(y0 0 p0j (n))pji .
j =0
We therefore see that the sequences {y0 0 p0i (n)} and {yi gi0 (n)} satisfy the same recurrence equation. Their ﬁrst terms (n = 1), respectively y0 0 p0i (1) = y0 p0i and yi gi0 (1) = yi qi0 , are equal in view of (6.24). Therefore, for all n ≥ 1, yi gi0 (n). y0 Summing with respect to n ≥ 1 and using n≥1 gi0 (n) = 1 (Q is recurrent), we obtain the announced result xi = yy0i . 0 p0i (n)
=
Equality (6.21) and the deﬁnition of positive recurrence give the following result: Theorem 6.3.12 An irreducible recurrent hmc is positive recurrent if and only if its invariant measures x satisfy xi < ∞. (6.27) i∈E
Remark 6.3.13 An hmc may well be irreducible and possess an invariant measure, and yet not be recurrent. The simplest example is the onedimensional nonsymmetric random walk, which was shown to be transient and yet admits xi ≡ 1 as an invariant measure. It turns out, however, that the existence of a stationary probability distribution is necessary and suﬃcient for an irreducible chain (not a priori assumed recurrent) to be recurrent positive. Theorem 6.3.14 An irreducible homogeneous Markov chain is positive recurrent if and only if there exists a stationary distribution. Moreover, the stationary distribution π is, when it exists, unique, and π > 0. Proof. The direct part follows from Theorems 6.3.10 and 6.3.12. For the converse part, assume the existence of a stationary distribution π. Iterating π T = π T P, we obtain π T = π T Pn , that is, for all i ∈ E, π(j)pji (n) . π(i) = j∈E
If the chain were transient, then, for all states i, j,
6.3. RECURRENCE AND TRANSIENCE
253
lim pji (n) = 0 .
n↑∞
Indeed pji (n) = Ej [1{Xn =i} ], limn↑∞ 1{Xn =i} = 0 (j is transient), and 1{Xn =i} ≤ 1, so that, by dominated convergence limn↑∞ Ej [1{Xn =i} ] = 0. Since pji (n) is bounded by 1 uniformly in j and n, we have by dominated convergence π(j)pji (n) = π(j) lim pji (n) = 0. π(i) = lim n↑∞
n↑∞
j∈E
j∈E
This contradicts the assumption that π is a stationary distribution ( i∈E π(i) = 1). The chain must therefore be recurrent, and by Theorem 6.3.12, it is positive recurrent. The stationary distribution π of an irreducible positive recurrent chain is unique (use Theorem 6.3.11 and the fact that there is no choice for a multiplicative factor but 1). Also recall that π(i) > 0 for all i ∈ E (see Theorem 6.3.10). Theorem 6.3.15 Let π be the unique stationary distribution of an irreducible positive recurrent chain, and let Ti be the return time to state i. Then π(i)Ei [Ti ] = 1.
(6.28)
Proof. This equality is a direct consequence of expression (6.19) for the invariant measure. Indeed, π is obtained by normalization of x: for all i ∈ E, π(i) =
xi
j∈E
xj
,
and in particular, for i = 0, recalling that x0 = 1 and using (6.21), π(0) =
x0
j∈E
xj
=
1 . E0 [T0 ]
Since state 0 does not play a special role in the analysis, (6.28) is true for all i ∈ E. The situation is extremely simple when the state space is ﬁnite. Theorem 6.3.16 An irreducible hmc with ﬁnite state space is positive recurrent. Proof. We ﬁrst show recurrence. If the chain were transient, then, for all i, j ∈ E, lim pij (n) = 0
n↑∞
(see the argument in the proof of Theorem 6.3.14), and therefore, since the state space is ﬁnite pij (n) = 0. lim n↑∞
But for all n ≥ 0,
j∈E
j∈E
pij (n) = 1,
CHAPTER 6. MARKOV CHAINS, DISCRETE TIME
254
a contradiction. Therefore, the chain is recurrent. By Theorem 6.3.10 it has an invariant measure x. Since E is ﬁnite, i∈E xi < ∞, and therefore the chain is positive recurrent, by Theorem 6.3.12. Example 6.3.17: A Random Walk on Z Reflected at 0. This chain has the state space E = N and the transition graph of the ﬁgure below. It is assumed that pi (and therefore qi = 1 − pi ) are in the open interval (0, 1) for all i ∈ E, so that the chain is irreducible.
p0 = 1 0
p1
pi
1 q1
···
2
i
i+1 qi+1
q2 Reﬂected random walk
The invariant measure equation xT = xT P takes in this case the form x0 = x1 q1 , xi = xi−1 pi−1 + xi+1 qi+1 , i ≥ 1, i−1 . The positive recurrence with p0 = 1. The general solution is, for i ≥ 1, xi = x0 p0q···p 1 ···qi condition i∈E xi < ∞ is p0 . . . pi−1 1+ < ∞, q1 . . . qi
i≥1
and if it is satisﬁed, the stationary distribution π is obtained by normalization of the general solution. This gives ⎛ π(0) = ⎝1 +
p0 · · · pi−1 i≥1
and for i ≥ 1, π(i) = π(0)
q1 · · · qi
⎞−1 ⎠
,
p0 · · · pi−1 . q1 · · · qi
In the special case where pi = p, qi = q = 1 − p, the positive recurrence condition j becomes 1 + 1q j≥0 pq < ∞, that is to say p < q, or equivalently, p < 12 .
Birthanddeath Markov Chains Birthanddeath process models are omnipresent in operations research and, of course, in biology. We ﬁrst deﬁne the birthanddeath process with a bounded population. The state space of such a chain is E = {0, 1, . . . , N } and its transition matrix is
6.3. RECURRENCE AND TRANSIENCE
255 ⎞
⎛
r0 p0 ⎜ ⎜ q1 r1 ⎜ q2 ⎜ ⎜ ⎜ P=⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝
p1 r2 p2 .. . ri .. .
qi
pi .. .
..
. qN −1 rN −1 pN −1 pN rN
⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟, ⎟ ⎟ ⎟ ⎟ ⎠
where pi > 0 for all i ∈ E\{N }, qi > 0 for all i ∈ E\{0}, ri ≥ 0 for all i ∈ E, and pi + qi + ri = 1 for all i ∈ E. The positivity conditions placed on the pi ’s and qi ’s guarantee that the chain is irreducible. Since the state space is ﬁnite, it is positive recurrent (Theorem 6.3.16), and it has a unique stationary distribution. Motivated by the Ehrenfest hmc, which is reversible in the stationary state, we make the educated guess that the birthanddeath process considered has the same property. This will be the case if and only if there exists a probability distribution π on E satisfying the detailed balance equations, that is, such that π(i − 1)pi−1 = π(i)qi (1 ≤ i ≤ N ). Letting w0 = 1 and wi =
i $ pk−1 qk
(1 ≤ i ≤ N ) ,
k=1
we ﬁnd that wi π(i) = N
j=0 wj
(0 ≤ i ≤ N )
(6.29)
indeed satisﬁes the detailed balance equations and is therefore the (unique) stationary distribution of the chain. We now treat the unbounded birthanddeath process with state space E = N and transition matrix as in the previous example (except that the state is now “unbounded on the right”). We assume that the pi ’s and qi ’s are positive in order to guarantee irreducibility. The same reversibility argument as above applies with a little diﬀerence. In fact we can show that the wi ’s deﬁned above satisfy the detailed balance equations and therefore the global balance equations. Therefore the vector {wi }i∈E is the unique, up to a multiplicative factor, invariant measure of the chain. It can be normalized to a probability distribution if and only if ∞
wj < ∞ .
j=0
Therefore in this case, and only in this case, there exists a (unique) stationary distribution, also given by (6.29). Note that the stationary distribution, when it exists, does not depend on the ri ’s. The recurrence properties of the above unbounded birthanddeath process are therefore the same as those of the chain below, which is however not aperiodic. For aperiodicity of the original chain, it suﬃces to assume at least one of the ri ’s to be positive (Remark 6.2.5).
CHAPTER 6. MARKOV CHAINS, DISCRETE TIME
256
p0 = 1 0
p1 1
q1
pi−1
p2 i−1
2
q3
q2
pi i+1
i
qi
qi+1
We now compute for the (bounded or unbounded) irreducible birthanddeath process the average time it takes to reach a state b from a state a < b. In fact, we shall prove that b k−1 1 Ea [Tb ] = wj . (6.30) q k wk k=a+1
Since obviously Ea [Tb ] =
b k=a+1
j=0
Ek−1 [Tk ], it suﬃces to prove that
Ek−1 [Tk ] =
k−1 1 wj . q k wk
()
j=0
For this, consider for any given k ∈ {0, 1, . . . , N } the truncated chain which moves on the state space {0, 1, . . . , k} as the original chain, except in state k where it moves one , to symbolize step down with probability qk and stays still with probability pk +rk . Use E expectations with respect to the modiﬁed chain. The unique stationary distribution of this chain is w (0 ≤ ≤ k) . π , = k j=0 w , k [Tk ] = (rk + pk ) × 1 + qk 1 + E , k−1 [Tk ] , that is, Firststep analysis yields E , k [Tk ] = 1 + qk E , k−1 [Tk ] . E Also k , k [Tk ] = 1 = 1 E wj , π ,k wk j=0
, k−1 [Tk ] = Ek−1 [Tk ], we have (). and therefore, since E Example 6.3.18: Special cases. In the special case where (pj , qj , rj ) = (p, q, r) for i all j = 0, N , (p0 , q0 , r0 ) = (p, q + r, 0) and (pN , qN , rN ) = (0, p + r, q), we have wi = pq , and for 1 ≤ k ≤ N , Ek−1 [Tk ] = q
k−1 1 p j 1 = k q p − q p q
j=0
k q . 1− p
6.3. RECURRENCE AND TRANSIENCE
6.3.3
257
Foster’s Theorem
The stationary distribution criterion of positive recurrence of an irreducible chain requires solving the balance equation, and this is not always feasible in practice. The following result (Foster’s theorem) gives a more tractable, and in fact quite powerful suﬃcient condition. Theorem 6.3.19 (3 ) Let the transition matrix P on the countable state space E be irreducible and suppose that there exists a function h : E → R such that inf i h(i) > −∞ and pik h(k) < ∞ for all i ∈ F, (6.31) k∈E
pik h(k) ≤ h(i) − for all i ∈ F,
(6.32)
k∈E
for some ﬁnite set F and some > 0. Then the corresponding hmc is positive recurrent. Proof. Since inf i h(i) > −∞, one may assume without loss of generality that h ≥ 0, by adding a constant if necessary. Call τ the return time to F , and deﬁne Yn = h(Xn )1{n 0. The counting process {N (t)}t≥0 is a continuoustime hmc with transition semigroup deﬁned by pij (t) = 1{j≥i} e−λt
(λt)j−i . (j − i)!
Proof. With C := {N (s1 ) = i1 , . . . , N (sk ) = ik )}, we have, for i ≥ j, P (N (t + s) = j  N (s) = i, C) P (N (t + s) = j, N (s) = i, C) = P (N (s) = i, C) P (N (s, s + t] = j − i, N (s) = i, C) . = P (N (s) = i, C)
CHAPTER 7. MARKOV CHAINS, CONTINUOUS TIME
300
But N (s, s + t] is independent of N (s) and of C, and therefore, P (N (s, s + t] = j − i, N (s) = i, C) = P (N (s, s + t] = j − i)P (N (s) = i, C) , so that P (N (t + s) = j  N (s) = i, C) = P (N (s, s + t] = j − i). Similarly, P (N (t + s) = j  N (s) = i) = P (N (s, s + t] = j − i) = e−λt
(λt)j−i . (j − i)!
Example 7.2.5: Flipflop. Let N be an hpp on R+ with intensity λ > 0. Deﬁne the ﬂipﬂop process with state space {+1, −1} by X(t) := X(0) × (−1)N (t) , where X(0) is a {+1, −1}valued random variable independent of the counting process N . In words: the ﬂipﬂop process switches between −1 and +1 at each event of N . It is a continuoustime hmc with transition semigroup 1 1 + e−2λt 1 − e−2λt . P(t) = 2 1 − e−2λt 1 + e−2λt Proof. The value X(t+s) depends on N (s, s+t] and X(s). Also, N (s, s+t] is independent of X(0), N (s1 ), . . . , N (sk ) when s ≤ s (, 1 ≤ ≤ k), and the latter random variables determine X(s1 ), . . . , X(sk ). Therefore, X(t+s) is independent of X(s1 ), . . . , X(sk ) given X(s), that is, {X(t)}t≥0 is a Markov chain. Moreover, P (X(t + s) = 1  X(s) = −1) = P (N (s, s + t] = odd ) ∞ 1 (λt)2k+1 = = (1 − e−2λt ), e−λt (2k + 1)! 2 k=0
that is, p−1,+1 (t) = p+1,−1 (t).
1 2 (1
−e
−2λt
). Similar computations give the announced result for
The Uniform hmc 6n }n≥0 be a discretetime hmc with countable state space E Deﬁnition 7.2.6 Let {X and transition matrix K = {kij }i,j∈E and let N be a hpp on R+ of intensity λ > 0 and ˆ n }n≥0 and N are independent. The associated time sequence {Tn }n≥1 . Suppose that {X stochastic process 6N (t) X(t) = X
(t ≥ 0)
is called a uniform Markov chain. The Poisson process N is the clock, and the chain 6n }n≥0 is the subordinated chain. {X
7.2. THE TRANSITION SEMIGROUP ˆ2 X
X(t)
ˆ0 X
301 ˆ7 X
ˆ1 X
ˆ6 X ˆ3 X
T0 = 0
T1
T2
T3
ˆ4 X ˆ5 X
T4 T5
T6
T7
t
Uniform Markov chain 6n for all n ≥ 0. Observe also that the disconRemark 7.2.7 Observe that X(Tn ) = X tinuity times of the uniform chain are all events of N but that not all events of N are 6n (a “transition” of type 6n−1 = X discontinuity times, since it may well occur that X i → i of the subordinated chain). The process {X(t)}t≥0 is a continuoustime hmc (Exercise 7.5.3). Its transition semigroup is ∞ (λt)n n (7.15) K , e−λt P(t) = n! n=0
that is, pij (t) =
∞ n=0
e−λt
(λt)n kij (n). n!
Indeed, ˆ N (t) = j) = Pi (X(t) = j) = Pi (X =
∞ n=0 ∞
6n = j) Pi (N (t) = n, X ˆ n = j). Pi (N (t) = n)Pi (X
n=0
Deﬁnition 7.2.8 The probability distribution π on E is called a stationary distribution of the continuoustime hmc, or of its transition semigroup, if π T P(t) = π T
(t ≥ 0) .
From (7.12), we see that if the initial distribution of the chain is a stationary distribution π, then the distribution at any time t ≥ 0 is π, and moreover, the chain is stationary, since for all k ≥ 1, all 0 ≤ t1 < . . . < tk and all states i1 , . . . , ik , the quantity P (X(t1 + t) = i1 , . . . , X(tk + t) = ik ) = π(i1 )pi1 ,i2 (t2 − t1 ) · · · pik−1 ,ik (tk − tk−1 ) does not depend on t ≥ 0. Therefore Theorem 7.2.9 A continuoustime hmc having for initial distribution a stationary distribution of the transition semigroup is stationary.
CHAPTER 7. MARKOV CHAINS, CONTINUOUS TIME
302
Example 7.2.10: Uniform hmc, take 2. In the case of the uniform hmc of Deﬁnition 7.2.6, if π is a stationary distribution of the subordinated chain, π T Kn = π T , and therefore in view of (7.15), π T P(t) = π T . Conversely, if π is a stationary distribution of the continuoustime hmc, then, by (7.15), πT =
∞
e−λt
n=0
(λt)n T n π K , n!
and letting t ↓ 0, we obtain π T = π T K.
7.2.2
The Local Characteristics
Let {P(t)}t≥0 be a transition semigroup on E, that is, for each t, s ≥ 0, (a) P(t) is a stochastic matrix, (b) P(0) = I, (c) P(t + s) = P(t)P(s). Suppose moreover that the semigroup is continuous at the origin, that is, (d) limh↓0 P(h) = P(0) = I, where the convergence therein is pointwise and for each entry. In Exercise 7.5.4, the reader is invited to prove that continuity at the origin implies continuity at any time, that is, limh→0 P(t + h) = P(t) for all t > 0. The result to follow is purely analytical: it does not require {P(t)}t≥0 to be the transition semigroup of some continuoustime hmc. Theorem 7.2.11 Let {P(t)}t≥0 be a continuous transition semigroup on the countable state space E. For any state i, there exists qi := lim h↓0
1 − pii (h) ∈ [0, ∞] , h
(7.16)
pij (h) ∈ [0, ∞) . h
(7.17)
and for any pair i, j of diﬀerent states, there exists qij := lim h↓0
t n Proof. t nFor all t ≥ 0 and all n ≥ 1, we have P(t) = [P n ] and therefore pii (t) ≥ [pii n ] (i ∈ E). Since limh↓0 pii (h) = 1, there exists an > 0 such that pii (h) > 0 for all h ∈ [0, ]. For n suﬃciently large, nt ∈ [0, ]. Therefore, for all t ≥ 0, pii (t) > 0, and the nonnegative quantity fi (t) := − log pii (t) is ﬁnite. Also, limh↓0 fi (h) = 0. Moreover, from P(t)P(s) = P(t + s), we have that pii (t + s) ≥ pii (t)pii (s), and therefore, the function fi is subadditive, that is, fi (t + s) ≤ fi (t) + fi (s)
(s, t ∈ R+ ) .
7.2. THE TRANSITION SEMIGROUP
303
Deﬁne the (possibly inﬁnite) nonnegative real number qi := sup t>0
fi (t) . t
Then (Theorem B.5.1) lim h↓0
Therefore, lim h↓0
fi (h) = qi . h
1 − pii (h) 1 − e−fi (h) fi (h) = lim = qi , h↓0 h fi (h) h
and this proves the ﬁrst equality in (7.16). It now remains to prove (7.17). For this, take two diﬀerent states i and j. Since pii (t) and pjj (t) tend to 1 as t > 0 tends to 0, there exists for any c ∈ ( 12 , 1) a number δ > 0 such that for t ∈ [0, δ], pii (t) > c and pjj (t) > c. Denote by {Xn }n≥0 the discretetime hmc deﬁned by Xn = X(nh), with transition matrix P(h). Let n > 0 be an integer and h > 0 be such that 0 ≤ nh ≤ δ. One way to pass from state i at time 0 to state j at time n is to pass through state i at time r for some r, 0 ≤ r ≤ n − 1, without visiting state j meanwhile, then to pass from i at time r to state j at time r + 1, and ﬁnally to pass from j at time r + 1 to state j at time n. The paths corresponding to diﬀerent values of r are diﬀerent, but they do not exhaust the possibilities of going from X0 = i to Xn = j. Therefore, pij (nh) ≥
n−1
P (X1 = j, . . . , Xr−1 = j, Xr = i  X0 = i)pij (h)P (Xn = j  Xr+1 = j) .
r=0
The parameters δ, n and h are such that P (Xn = j  Xr+1 = j) ≥ c. Also P (X1 = j, . . . , Xr−1 = j, Xr = i  X0 = i) = P (Xr = i  X0 = i) − P (X1 = j, . . . , Xk−1 = j, Xk = j  X0 = i)P (Xr = i  Xk = j) k a  Xn = i) = e−qi a pij , (7.19) where pij =
qij qi
if qi > 0, pij = 0 if qi = 0. (In particular, {Xn }n≥0 is an hmc.)
B. If P (τ∞ = ∞) = 1, the process {X(t)}t≥0 constructed as above is a regular jump hmc with inﬁnitesimal generator A. Proof. A. The τn ’s form a sequence of Gt stopping times where N Gt := σ(X0 ) ∨ ∨i,j∈E Ft ij . The announced result then follows from the strong Markov property for hpps (Theorem 7.1.5) and the competition theorem (Theorem 7.1.4). B. By construction, for a given time t, the process after time t depends only upon X(t) and the hpps St Nij (i, j ∈ E , i = j). The homogeneous Markov property follows immediately from this observation. It remains to show that A is indeed the inﬁnitesimal generator of the hmc. We ﬁrst check that, for i = j, 1 lim Pi (X(t) = j) = qij . t↓0 t For this, observe that when X(t) = X(0), necessarily τ1 < t and write Pi (X(t) = j) = Pi (τ2 ≤ t, X(t) = j) + Pi (τ2 > t, X(t) = j) = Pi (τ2 ≤ t, X(t) = j) + Pi (τ2 > t, X1 = j, τ1 < t) = Pi (τ2 ≤ t, X(t) = j) + Pi (X1 = j, τ1 < t) − Pi (τ2 ≤ t, X1 = j, τ1 < t) . By Theorem 7.1.4, Pi (X1 = j, τ1 < t) = (1 − e−qi t ) qiji and then q
1 lim Pi (X1 = j, τ1 < t) = qij . t↓0 t It therefore remains to show that Pi (τ2 ≤ t, X(t) = j) and Pi (τ2 ≤ t, X1 = j, τ1 ≤ t) are o(t) (obvious if qi = 0, and therefore we suppose qi > 0). Both terms are bounded by Pi (τ2 ≤ t), and Pi (τ2 ≤ t) ≤ Pi (τ1 ≤ t, τ2 − τ1 ≤ t) = Pi (τ1 ≤ t, X1 = k, τ2 − τ1 ≤ t) =
k∈E k=j
qik qik (1 − e−qi t ) (1 − e−qk t ) = (1 − e−qi t ) (1 − e−qk t ). q qi i k∈E k∈E
k=j
k=i
CHAPTER 7. MARKOV CHAINS, CONTINUOUS TIME
308
But (1 − e−qi t ) is O(t) (identically null if qi = 0) and limt↓0
dominated convergence. Therefore Pi (τ2 ≤ t) is o(t). We now check that limt↓0
1−pii (t) t
k∈E k=i
qik qi (1
− e−qk t ) = 0 by
= qi . From
1 − pii (t) = 1 − Pi (X(t) = i) = 1 − Pi (X(t) = i, τ1 > t) − Pi (X(t) = i, τ1 ≤ t) = 1 − Pi (τ1 > t) − Pi (X(t) = i, τ1 ≤ t, τ2 ≤ t) = 1 − e−qi t − Pi (X(t) = i, τ1 ≤ t, τ2 ≤ t), and the announced result follows from the fact (proved above) that Pi (τ2 ≤ t) is o(t). Let Zi (t) := 1{X(t)=i} . The construction of the state process on [0, τ∞ ) is summarized by the following equations: for all i ∈ E, Zi (t) = Zi (0) +

Zj (s−) Nji (ds) −
j∈E;j =i (0,t]

Zi (s−) Nij (ds) .
(7.20)
j∈E;j =i (0,t]
Equations (7.20) constitute a system of stochastic diﬀerential equations driven by the Poisson processes Nij (i, j ∈ E , i = j) for the processes {Zi (t)}t≥0 (i ∈ E). It is sometimes convenient to state conclusion B of Theorem 7.2.18 as Theorem 7.2.19 Let {X(t)}t≥0 be a regular jump process with countable state space E, satisfying (7.20) where Zi (t) = 1{X(t)=i} and where Ni,j (i, j ∈ E , i = j) is a family of independent hpps with respective intensities qij (i, j ∈ E , i = j), and independent of the initial state X(0). Then {X(t)}t≥0 is a regular jump hmc with inﬁnitesimal generator A.
Note that (7.20) is equivalent to the requirement that f (X(t)) − f (X(0)) =
i,j∈E i=j
{f (j) − f (i)} (0,t]
1{X(s−)=i} dNij (s)
(7.21)
for all nonnegative functions f : E → R. We shall now exploit this canonical representation.
Aggregation of States Consider a regular jump hmc {X(t)}t≥0 with state space E and inﬁnitesimal generator ˜ A. Let E˜ = {α, β, . . .} be a partition of E, and deﬁne the process {X(t)} t≥0 taking its ˜ by values in E ˜ X(t) = α ⇐⇒ X(t) ∈ α .
(7.22)
˜ The hmc {X(t)} t≥0 is the aggregated chain of {X(t)}t≥0 (with respect to the partition ˜ E).
7.2. THE TRANSITION SEMIGROUP
309
˜ (α = β), Theorem 7.2.20 Suppose that for all α, β ∈ E
(i ∈ α) .
qij = q˜αβ
(7.23)
j∈β
(This equality not only deﬁnes the quantity in the righthand side but also states the ˜ hypothesis that the lefthand side is independent of i ∈ α.) Then {X(t)} t≥0 is a regular ˜ with oﬀdiagonal terms ˜ and inﬁnitesimal generator A, jump hmc with state space E given by (7.23). Proof. This statement concerns the distribution of {X(t)}t≥0 and therefore we may ˜ → R and s ≤ t, suppose that this process is of the form (7.21). Then, for f : E ˜ ˜ ˜ ˜ f (X(t)) = f (X(0)) + {f (X(u)) − f (X(u−))}1 {X(u−)=i} dNij (u) (0,t]
i,j∈E i=j
˜ = f (X(0)) +
{f (β) − f (α)}
˜ α,β∈E α=β
⎛ ⎝ (0,t]
i∈α
⎛ ⎞⎞ dNij (u)⎠⎠ . 1{X(u−)=i} ⎝ j∈β
˜ αβ by ˜ α = β, the point process N Deﬁne for all α, β ∈ E, 
˜αβ (0, t] = N
(0,t] i∈α
⎛
⎛ ⎞⎞ ⎝1{X(s−)=i} ⎝ ⎠ ⎠ + dNij (s) j∈β
(0,t]
ˆ 1{X(s−) ˜ =α} dNα,β (s),
ˆαβ } α,β∈E˜ form an independent family of hpps where the “dummy” point processes {N α=β
with intensities {˜ qαβ } α,β∈E˜ , respectively, and are independent of X(0) and {Nij } i,j∈E . i=j
α=β
Then
˜ ˜ f (X(t)) = f (X(0)) +
(f (β) − f (α)) (0,t]
˜ α,β∈E α=β
˜αβ (u). 1{X(u−)=α} dN ˜
(7.24)
˜α,β (α, β ∈ E˜ , α = β) In view of the remark relative to (7.21), it suﬃces to prove that N ˜ , α = β). For is a family of independent hpps with respective intensities q˜αβ (α, β ∈ E this, we apply Watanabe’s theorem (Theorem 7.1.8). Let {Z(t)}t≥0 be a leftcontinuous Ft adapted stochastic process, where ˜αβ N
Ft = σ(X(0)) ∨ Ft
˜uv N ∨ ∨u,v∈E;(u,v) . ˜ =(α,β) F∞
We obtain "
# " ˜ Z(t)dNαβ (t) = E
E (0,T ]
i∈α j∈β
"
# (0,T ]
+E (0,T ]
Z(t)1{X(t−)=i} dNij (t)
# ˆ Z(t)1{X(t−) ˜ =α} dNαβ ,
and this quantity is equal, by the smoothing formula (Theorem 7.1.7), to
CHAPTER 7. MARKOV CHAINS, CONTINUOUS TIME
310 i∈α j∈β
2
. 2Z(t)1{X(t−)=i} qij dt + E
T
E 0
⎡ = E⎣
T
⎡⎛ ⎞ Z(t) ⎣⎝ qij ⎠ 1
0
2
T
=E
T 0
˜ {X(t−)=α}
. q ˜ dt Z(t)1{X(t) ˜ =α} αβ ⎤
⎤
⎦ ⎦ + q˜αβ 1{X(t−) ˜ =α} dt
j∈β
. Z(t)˜ qαβ dt .
0
˜αβ is an hpp with intensity q˜αβ independent of σ(X(0))∨ by Theorem 7.1.8, N Therefore, ˜uv N . ∨u,v∈E;(u,v) ˜ =(α,β) F∞
7.3 7.3.1
Regenerative Structure The Strong Markov Property
A regular jump hmc has the strong Markov property. More precisely: Deﬁne, similarly to the discretetime case, the process after the random time τ : {Sτ X(t)}t≥0 = {X(t + τ )}t≥0 (with the convention X(∞) = Δ, where Δ is an element not in E) and the process before τ: {X τ (t)}t≥0 = {X(t ∧ τ )}t≥0 . Theorem 7.3.1 Let {X(t)}t≥0 be a rightcontinuous continuoustime hmc with countable state space E and transition semigroup {P(t)}t≥0 , and let τ be a stopping time with respect to {X(t)}t≥0 . Let k ∈ E be a state. Then, (α) given X(τ ) = k, the chain after τ and the chain before τ are independent, and (β) given X(τ ) = k, the chain after τ is a regular jump hmc with transition semigroup {P(t)}t≥0 . Proof. Suppose we have proved that for all states k, all positive times t1 , . . . , tn , s1 , . . . , sp and all real numbers u1 , . . . , un , v1 , . . . , vp , n p E ei =1 u X(τ +t )+i m=1 vm X(τ∧ sm ) 1{X(τ )=k} n p = E ei =1 u X(t ) X(0) = k E ei m=1 vm X(τ∧ sm ) 1{X(τ )=k} . (7.25) Then, ﬁxing v1 = · · · = vp = 0, we obtain n E ei =1 u X(τ +t ) 1{X(τ )=k} P (X(τ ) = k)
n = E ei =1 u X(t ) X(0) = k ,
and this shows that given X(τ ) = k, {X(τ +t)}t≥0 had the same distribution as {X(t)}t≥0 given X(0) = k. We therefore will have proved (β). For (α), it suﬃces to rewrite (7.25) as follows, using the previous equality:
7.3. REGENERATIVE STRUCTURE n n E ei =i u X(τ +t )+i m=1 vm X(τ∧ sm )  X(τ ) = k n n = E ei =1 u X(τ +t )  X(τ ) = k E ei n=1 vm X(τ∧ sm )  X(τ ) = k .
311
(7.26)
It remains to prove (7.25). For the sake of simplicity, we consider the case where n = m = 1, and let u1 = u, t1 = t, v1 = v, s1 = s. Suppose ﬁrst that τ takes a countable number of ﬁnite values, denoted by aj , and also, maybe, the value +∞. Note that X(τ ) = k ∈ E implies τ < ∞. Then E[eiuX(τ +t)+ivX(τ ∧s) 1{X(τ )=k} ] = E[eiuX(aj +t)+ivX(aj ∧s) 1{X(aj )=k} 1{τ =aj } ]. j≥1
For all j ≥ 1, E[eiuX(aj +t)+ivX(aj ∧s) 1{X(aj )=k} 1{τ =aj } ] = E[eiuX(aj +t) 1{X(aj )=k} eivX(aj ∧s) 1{τ =aj } ] = E[eiuX(aj +t)  X(aj ) = k]E[eivX(aj ∧s) 1{τ =aj } 1{X(aj )=k} ], where for the last equality, we have used the fact that 1{τ =aj } is FaXj measurable and the Markov property at time aj . Therefore, for all j ≥ 1, E[eiuX(aj +t)+ivX(aj ∧s) 1{X(aj )=k} 1{τ =aj } ] = E[eiuX(t) X(0) = k]E[eivX(aj ∧s) 1{X(aj )=k} 1{τ =aj } ]. Summing with respect to j, we obtain the equality corresponding to (7.25). To pass from the case where the stopping time τ takes a countable number of values to the general case, letting τ (n) be the approximation of τ of Theorem 5.3.13. The random time τ (n) is an FtX stoppingtime with a countable number of values such that limn↑∞ ↓ τ (n, ω) = τ (ω). In particular, limn↑∞ X(τ (n) ∧ a) = X(τ ∧ a), limn↑∞ X(τ (n) + b) = X(τ + b) and limn↑∞ 1{X(τ (n))=k} = 1{X(τ )=k} (use the fact that a regular jump process is rightcontinuous). Therefore, letting n go to ∞ in (7.25) with τ replaced by τ (n), we obtain the result for τ itself, by dominated convergence. Equality (7.26) says that, given X(τ ) = i, the random vectors (X(τ ∧ s1 ), . . . , X(τ ∧ sm )) and (X(τ + t1 ), . . . , X(τ + tn )) and (X(τ + t1 ), . . . , X(τ + tn )) are independent. In particular, the events {X(τ ∧ s1 ) = i1 , . . . , X(τ ∧ sm ) = im } and {X(τ + t1 ) = j1 , . . . , X(τ + tn ) = jn } are conditionally independent given X(τ ) = i. Since this is true for all i1 , . . . , im , j1 , . . . , jn , and all s1 , . . . , sm , t1 , . . . , tn , this property extends (see τ Theorem 5.1.10) to events in A and B respectively in F Sτ X and F X . And then, ﬁnally (see the discussion leading to (7.14), EX(τ ) [Y × Z] = EX(τ ) [Y ]EX(τ ) [Z], for all nonnegative Y and Z that
7.3.2
(7.27)
τ are respectively F Sτ X measurable and F X measurable.
Imbedded Chain
Let {τn }n≥0 be the nondecreasing sequence of transition times of a regular jump process {X(t)}t≥0 , where τ0 = 0, and τn = ∞ if there are strictly fewer than n transitions in (0, ∞). Note that, for each n ≥ 0, τn is an FtX stopping time.
312
CHAPTER 7. MARKOV CHAINS, CONTINUOUS TIME
The discretetime stochastic process {Xn }n≥0 with values in E deﬁned by Xn := X(τn ) if τn < ∞ and Xn = Xn−1 if τn = ∞ is called the imbedded process of the jump process. Theorem 7.3.2 Let {X(t)}t≥0 be a continuoustime regular jump (therefore rightcontinuous) hmc, with inﬁnitesimal generator A, transition times sequence {τn }n≥0, and imbedded process {Xn }n≥0 . Then (α) {Xn }n≥0 is a discretetime hmc with transition matrix given by, for j = i, pij =
qij qi
if qi > 0, and pij = 0 if qi = 0. (β) For all n ≥ 0 and all a ∈ R+ , P (τn+1 − τn ≤ a  X0 , . . . , Xn , τ1 , . . . , τn ) = 1 − e−qXn a .
(7.28)
Proof. We begin with the following partial result. (α ) {Xn }n≥0 is a discretetime hmc on E. (β ) There exists for each i ∈ E a ﬁnite real number λ(i) ≥ 0 such that for all n ≥ 0 and all a ∈ R+ , P (τn+1 − τn ≤ a  X0 , . . . , Xn , τ1 , . . . , τn ) = 1 − e−λ(Xn )a . It follows from the strong Markov property that given X(τn ) = i ∈ E, {X(τn +t)}t≥0 is independent of {X(τn ∧t)}t≥0 and therefore, given Xn = i, the variables (Xn+1 , Xn+2, . . .) are independent of (X0 , . . . , Xn ), that is, {Xn }n≥0 is a Markov chain. It is clearly homogeneous because the distribution of {X(τn + t)}t≥0 given X(τn ) = i is independent of n, being identical with the distribution of {X(t)}t≥0 given X(0) = i, again by the strong Markov property. We have therefore proved (α ). Call pij = Pi (X(τ1 ) = j) the transition probability of {Xn }n≥1 from i to j. To prove (β ), it suﬃces to show that Pi (X1 = i, . . . , Xn = in , τ1 − τ0 > a1 , . . . , τn− τn−1 > an ) = e−λ(i)a1 pii1 e−λ(i1 )a2 pi1 i2 · · · e−λ(in−1)an pin−1 in for all i, i1 , . . . , in ∈ E, all a1 , . . . , an ∈ R+ and some function λ : E → R+ . In view of the strong Markov property, it suﬃces to show that for all i, j ∈ E, a ∈ R+ , there exists a λ(i) ∈ [0, ∞) such that Pi (X1 = j, τ1 − τ0 > a) = Pi (X1 = j)e−λ(i)a . Deﬁne g(t) = Pi (τ1 > t). For t, s ≥ 0, using the obvious set identities, g(t + s) = Pi (τ1 > t + s) = Pi (τ1 > t + s, τ1 > t, X(t) = i) = Pi (X(t + u) = i for all u ∈ [0, s], τ1 > t, X(t) = i).
(7.29)
7.3. REGENERATIVE STRUCTURE
313
The last expression is, in view of the Markov property at time t and using the fact that {τ1 > t} ∈ FtX , Pi (X(t + u) = i for all u ∈ [0, s]  X(t) = i)Pi (τ1 > t, X(t) = i) = Pi (X(u) = i for all u ∈ [0, s]  X(0) = i)Pi (τ1 > t) = Pi (τ1 > s)Pi (τ1 > t), where the last two equalities again follow from the obvious set identities. Therefore, for all s, t ≥ 0, g(t + s) = g(t)g(s). Also, g(t) is nonincreasing, and limt↓0 g(t) = 1 (use the fact that the chain is assumed to be a jump process). It follows that there exists a λ(i) ∈ [0, ∞) such that g(t) = e−λ(i)t, that is, Pi (τ1 > t) = e−λ(i)t, for all t ≥ 0. Now, using the Markov property and appropriate set identities, Pi (X1 = j, τ1 > t) = Pi (X(τ1 ) = j, τ1 > t, X(t) = i) = Pi (ﬁrst jump of {X(t + s)}s≥0 is j, τ1 > t, X(t) = i) = Pi (ﬁrst jump of {X(t + s)}s≥0 is j  X(t) = i)Pi (τ1 > t, X(t) = i) = Pi (ﬁrst jump of {X(s)}s≥0 is j  X(0) = i)Pi (τ1 > t) = Pi (X(τ1 ) = j)Pi (τ1 > t), and this is (7.29). We have now proved (α) and (β), where qi is replaced by λ(i) ∈ R+ (only known to q exist but not yet identiﬁed with qi ) and where qiji is replaced by Pi (X(τ1 ) = j) (not yet qij identiﬁed with qi ). We shall now proceed to the required identiﬁcations. For this, deﬁne the generator A on E by qi = λ(i), qij = λ(i)Pi (X(τ1 ) = j).
This generator is stable and conservative, and we can therefore construct {X (t)}t≥0 , a . regular jump hmc associated with A , via the construction of Section 7.2.3, up to τ∞ Then {X (t)}t≥0 and {X(t)}t≥0 have the same regenerative structure, given by (α ) and (β ), and therefore, they have the same distribution (in particular, {X (t)}t≥0 is regular, since {X(t)}t≥0 is regular, by assumption). Their respective inﬁnitesimal generators A and A are therefore identical. Theorem 7.3.3 A regular jump hmc is stable and conservative. Proof. Indeed, a regular jump hmc is strongly Markovian, and therefore has the regenerative structure of Theorem 7.3.2. In the course of the proof of this theorem, we have identiﬁed qi with a certain ﬁnite quantity λ(i), and therefore qi < ∞. Therefore a regular jump hmc is stable. Also qij was identiﬁed with λ(i)Pi (X(τ1 ) = j), that is, qi Pi (X(τ1 ) = j) and therefore the conservation property is clear. Deﬁnition 7.3.4 A state i ∈ E such that qi = 0 is called permanent; otherwise, it is called essential.
CHAPTER 7. MARKOV CHAINS, CONTINUOUS TIME
314
In view of (7.28), if X(τn ) = i, a permanent state, then τn+1 − τn = ∞; that is, there is no more transition at ﬁnite distance, hence the terminology. Example 7.3.5: Uniform hmc, take 3. For the uniform hmc (see Deﬁnition 7.2.6), the imbedded process {Xn }n≥0 is an hmc with state space E, and if i ∈ E is not permanent (that is, in this case, if kii < 1), then for j = i, pij =
kij . 1 − kii
ˆ n }n≥0 by considering only the “real” transitions Indeed, {Xn }n≥0 is obtained from {X (exercise). An immediate consequence of Theorem 7.3.2 is: Corollary 7.3.6 Two regular jump hmcs with the same inﬁnitesimal generator and the same initial distribution have the same distribution. Another way to state this is as follows: Two regular jump hmcs with the same inﬁnitesimal generator have the same transition semigroup. Example 7.3.7: Uniformization. A regular jump hmc with inﬁnitesimal generator A such that supi∈E qi < ∞ has the same transition semigroup as a uniform chain. Indeed: select any real number λ > supi∈E qi , and deﬁne the transition matrix K by (7.18). One checks that it is indeed a stochastic matrix. The uniform chain corresponding to (λ, K) has the inﬁnitesimal generator A. Any pair (λ, K) as above gives rise to a uniform version of the chain. The minimal uniform version is, by deﬁnition, that with λ = supi∈E qi .
Deﬁnition 7.3.8 A continuous time hmc with an inﬁnitesimal generator such that sup qi < ∞,
(7.30)
i∈E
is called uniformizable.
7.3.3
Conditions for Regularity
Theorem 7.3.2 gives a way of constructing a regular jump hmc with values in a countable state space and admitting a given generator that is stable and conservative. (We shall also suppose for simplicity that this generator is essential (qi > 0 for all i ∈ E).) It suﬃces to construct a sequence τ0 = 0, X0, τ1 − τ0 , X1 , τ2 − τ1 , X2 , . . . according to P (Xn+1 = j, τn+1 − τn ≤ x  X0 , . . . , Xn , τ0 , . . . , τn ) = qXn j /qXn (1 − e−qXn x ) ,
(7.31)
the initial state X0 being chosen at random, with arbitrary distribution. The value of X(t) for τn ≤ t < τn+1 is then Xn . If τ∞ := limn↑∞ ↑ τn = ∞, we have obtained a regular jump hmc with A as inﬁnitesimal generator. Deﬁnition 7.3.9 The generator A is called nonexplosive, or regular, if Pi (τ∞ = ∞) = 1
(i ∈ E) .
(7.32)
7.3. REGENERATIVE STRUCTURE
315
Theorem 7.3.10 Let A be a stable and conservative generator on E. It is regular if and only if for any real λ > 0, the system of equations (λ + qi )xi = qij xj (i ∈ E) (7.33) j∈E j=i
admits no nonnegative bounded solution other than the trivial one. Proof. Let Sk := τk − τk−1 . In particular, τ∞ := " gi (λ) = Ei exp{−λ
∞
k=1
∞
Sk . The number #
Sk }
k=1
is uniformly bounded in λ > 0 and i ∈ E, and if Pi (τ∞ = ∞) < 1, it is strictly positive. Also, xi := gi (λ) (i ∈ E) is a solution of (7.33), as follows from the calculations below: " # ∞ Sk } gi (λ) = Ei exp{−λS1 } exp{−λ k=2

∞
=
# " ∞ −λt −qi t e qi e Sk } dt Ei exp{−λ
0
=
" # ∞ qi Ei exp{−λ Sk } λ + qi
k=2
k=2
and, by ﬁrststep analysis, # # " " ∞ ∞ qij qij Ei exp{−λ Sk } = Ej exp{−λ Sk } = gj (λ) . qi qi j∈E j∈E k=2
k=2
j=i
j=i
Therefore, if A is explosive there exists a nontrivial bounded solution of (7.33). We now prove the converse. Call {gi (λ)}i∈E a bounded solution of (7.33) for a ﬁxed real λ > 0. We have (7.34) gi (λ) = E[exp{−λS1 }gX1 (λ)  X0 = i], since ﬁrststep analysis shows that the righthand side is equal to that of (7.33). We prove by induction that " # n gi (λ) = E exp{−λ Sk }gXn (λ)  X0 = i . (7.35) k=1
For this, rewrite (7.34) as gi (λ) = E[exp{−λSn+1}gXn+1 (λ)  Xn = i], that is, gXn (λ) = E[exp{−λSn+1}gXn+1 (λ)  Xn ] . Using this expression of gXn (λ) in (7.35), we obtain " # n+1 gi (λ) = E exp{−λ Sk }gXn+1 (λ)  X0 = i . k=1
(7.36)
CHAPTER 7. MARKOV CHAINS, CONTINUOUS TIME
316
Therefore, (7.35) implies (7.36) (the forward step in the induction argument). Since (7.35) is true for n = 1 (Eqn. (7.34)), it is true for all n ≥ 1 and therefore, since K := gi (λ) < ∞, " # ∞ gi (λ) ≤ KEi exp{−λ Sk } . k=1
if {gi (λ)}i∈E is not trivial, it must hold for some i Therefore, Pi ( ∞ k=1 Sk < ∞) > 0 or, equivalently, Pi (τ∞ = ∞) < 1.
∈
E that
Applied to a birthanddeath process, Theorem 7.3.10 gives Reuter’s criterion. Theorem 7.3.11 Let A be generator on E = N deﬁned by qn,n+1 = λn and qn,n−1 = μn 1n≥1, where the birth parameters λn are strictly positive. A necessary and suﬃcient condition of nonexplosion of this generator is . ∞ 2 1 μn μn · · · μ1 = ∞. (7.37) + + ···+ λn λn λn−1 λn · · · λ1 λ0 n=1
Proof. The system of equations (7.33) reads in the particular case of a birthanddeath generator / λx0 = −λ0 x0 + λ0 x1 , (7.38) λxk = μk xk−1 − (λk + μk )xk + λk xk+1 (k ≥ 1). For any ﬁxed x0 , this system admits a unique solution that is identically null if and only if x0 = 0. If x0 = 0, the solution is such that xk /x0 does not depend on x0 , and therefore, only the case where x0 = 1 needs to be treated. Writing yk = xk+1 − xk , we obtain from (7.38) yk =
λ μk λ μk · · · μ2 λ μk · · · μ1 xk + xk−1 + · · · + x1 + y0 λk λk λk−1 λk · · · λ2 λ1 λk · · · λ1
(7.39)
and y0 = λλ0 . From this we deduce that if λ > 0, then yk > 0 and therefore {xk }k≥0 is a strictly increasing sequence. in (7.39), we have (since xk ≥ x0 = 1) . 2 1 μk μk · · · μ1 . yk ≥ λ + +···+ λk λk λk−1 λk · · · λ1 λ0
Therefore, using y0 =
λ λ0
Therefore, a necessary condition for {xk }k≥0 to be bounded is that the lefthand side of (7.37) be ﬁnite. This proves the suﬃciency of (7.37) for nonexplosion. We now turn to the proof of necessity. For i ≤ k, bounding in (7.39) xi by xk yields the majoration . 2 λ μk · · · μ1 λ yk ≤ xk , +···+ λk λk · · · λ1 λ0 and therefore, since yk = xk+1 − xk , . 2 λ μk · · · μ1 λ xk +···+ xk+1 ≤ 1 + λk λk · · · λ1 λ0 2 .1 1 μk · · · μ1 ≤ xk exp λ +···+ . λk λk · · · λ0
7.4. LONGRUN BEHAVIOR Since x0 = 1, this leads to
317
/
. n 2 1 μk · · · μ1 xn ≤ exp λ +···+ λk λk · · · λ0
4 .
k=1
Therefore, a suﬃcient condition for the solution {xn }n≥0 to be bounded is that the lefthand side of (7.37) be ﬁnite. Example 7.3.12: Pure birth. A pure birth generator A is a birthanddeath generator with all μn = 0. The necessary and suﬃcient condition of regularity (7.37) reads in this case ∞ 1 = ∞. (7.40) λ n=0 n
Remark 7.3.13 There is a large class of hmcs for which the regularity is ensured without recourse to the regularity criterion above (Theorem 7.3.10): see Exercise 7.5.8.
7.4 7.4.1
Longrun Behavior Recurrence
We shall deﬁne irreducibility, recurrence, transience, and positive recurrence for a regular jump hmc. Deﬁnition 7.4.1 A regular jump hmc is called irreducible if and only if the imbedded discretetime hmc is irreducible. Deﬁnition 7.4.2 A state i is called recurrent if and only if it is recurrent for the imbedded chain. Otherwise, it is called transient. In order to deﬁne positive recurrence, we need the following deﬁnitions. The escape time from state i is deﬁned by Li := inf{t ≥ 0; X(t) = i} (=∞ if X(t) = i for all t ≥ 0). The return time to i is Ri := inf{t > 0; t > Li and X(t) = i} (= ∞ if Ei = ∞ or X(t) = i if for all t ≥ Ei ). Clearly, Ei and Ri are FtX stopping times (exercise). Deﬁnition 7.4.3 A recurrent state i ∈ E is called tpositive recurrent if and only if Ei [Ri ] < ∞, where Ri is the return time to state i. Otherwise, it is called tnull recurrent. Remark 7.4.4 We shall soon see that tpositive recurrence and npositive recurrence (positive recurrence of the imbedded chain) are not equivalent concepts. Also, observe that recurrence of a given state implies that this state is essential. Finally, in the same vein, note that irreducibility implies that all states are essential. Remark 7.4.5 Note that there is no notion of periodicity for a continuoustime hmc, for obvious reasons.
CHAPTER 7. MARKOV CHAINS, CONTINUOUS TIME
318
Invariant Measures of Recurrent Chains Deﬁnition 7.4.6 A tinvariant measure is a ﬁnite nontrivial vector ν = {ν(i)}i∈E such that for all t ≥ 0, ν T P(t) = ν T .
(7.41)
Of course, an ninvariant measure is, by deﬁnition, an invariant measure for the imbedded chain.
Theorem 7.4.7 Let the regular jump hmc {X(t)}t≥0 with inﬁnitesimal generator A be irreducible and recurrent. Then there exists a unique (up to a multiplicative factor) tinvariant measure such that ν(i) > 0 for all i ∈ E. Moreover, ν is obtained in one of the following ways: (1): 2
R0
ν(i) = E0 0
. 1{X(s)=i} ds ,
(7.42)
where 0 is a state and R0 is the return time to state 0, or (2): E0 μ(i) ν(i) = = qi
T0
n=1 1{Xn =i}
qi
,
(7.43)
where μ is the canonical invariant measure relative to state 0 of the imbedded chain and T0 is the return time to 0 of the imbedded chain, or (3): as a solution of: ν T A = 0.
(7.44)
Proof. (α) We ﬁrst show that (7.42) deﬁnes an invariant measure, that is, for all j ∈ E and all t ≥ 0, ν(k)pkj (t). ν(j) = k∈E
The righthand side of the above equality is equal to
7.4. LONGRUN BEHAVIOR A=
20
k∈E ∞
0
k∈E
=

∞
= 
0 ∞ 0 ∞
= 
. 1{X(s)=k} 1{s≤R0 } ds pkj (t)
P0 (X(t + s) = j  X(s) = k)P0 (X(s) = k, s ≤ R0 ) ds P0 (X(t + s) = j  X(s) = k)P0 (s ≤ R0  X(s) = k)P0 (X(s) = k) ds
k∈E
= 
∞
E0
319
0
P0 (X(t + s) = j, s ≤ R0  X(s) = k)P0 (X(s) = k) ds
k∈E
P0 (X(t + s) = j, s ≤ R0 , X(s) = k) ds
k∈E ∞
= 0 ∞
P0 (X(t + s) = j, s ≤ R0 ) ds
E0 1{X(t+s)=j} 1{s≤R0 } ds 0 . 2 ∞ = E0 1{X(t+s)=j} 1{s≤R0 } ds , =
0
where we have used the Markov property for the fourth equality ({s ≤ R0 } ∈ FsX ). Therefore, 2 t+R0 2 R0 . . 1{X(t+s)=j} ds = E0 1{X(u)=j} du A = E0 t 0 2 2 . .  t  R0 1{X(u)=j} du 1{X(u)=j} du − E0 1{t>R0 } = E0 1{t≤R0 } t
2
R0 R0 +t
+ E0 R0
. 1{X(u)=j} du .
From the strong Markov property applied at R0 , . . 2 t 2 R0 +t E0 1{X(u)=j} du = E0 1{X(u)=j} du . R0
0
Therefore, 2 A = E0 1{t≤R0 } 2
R0
= E0 0
R0 t
2 . · · · − E0 1{t>R0 }
. 1{X(u)=j} du = ν(j).
t R0
2 t . . ··· · · · + E0 0
(β) We now show uniqueness. For this consider the skeleton chain {X(n)}n≥0. For any state i, consider the sequence Z1 , Z2 , . . . of successive sojourn times in state i of the state process. This sequence is inﬁnite because the imbedded chain is recurrent, and it is iid with exponential distribution of mean q1i . In particular, the event {Zn > 1} occurs inﬁnitely often, and this implies that {X(n) = i} also occurs inﬁnitely often. This is true for all states. Therefore, the skeleton is irreducible and recurrent. Consequently it has one and only one (up to a multiplicative factor) invariant measure. Since an invariant measure of the continuoustime chain is an invariant measure of the skeleton, the announced uniqueness of the invariant measure of the continuoustime hmc follows.
CHAPTER 7. MARKOV CHAINS, CONTINUOUS TIME
320
(γ) Call T0 the return time to 0 of the imbedded chain. Then # "T −1 2 R0 . 0 ν(i) = E0 Sn+1 1{Xn =i} 1{X(s)=i} ds = E0 " = E0
0
∞
#
n=0
Sn+1 1{Xn =i} 1{n 0, . . . , N (Ak ) > 0, N (B) = 0) = P (N (A1 ) > 0, . . . , N (Ak−1 ) > 0, N (B) = 0) − P (N (A1 ) > 0, . . . , N (Ak−1 ) > 0, N (Ak ∪ B) = 0) , and for k = 1, P (N (A1 ) > 0, N (B) = 0) = P (N (B) = 0) − P (N (B ∪ A1 ) = 0) = v(B) − v(B ∪ A1 ) . 2 Named after R´enyi, who introduced the notion and applied it to Poisson processes [R´enyi, 1967].
8.1. GENERALITIES ON POINT PROCESSES
343
This shows that P (N (A1 ) > 0, . . . , N (Ak ) > 0, N (B) = 0) can be recursively computed from the void probability function v. n Step 2. Suppose that we have a sequence Kn = {Kn,i }ki=1 of nested partitions of E such that for any distinct x, y ∈ E, there exists an n such that x and y belong to two distinct sets of the partition Kn (in other words, the sequence of partitions {Kn }n≥0 eventually separates the points of E). Deﬁne for n ≥ 1 and A ∈ B(E),
Hn (A) =
kn
H(A ∩ Kn,i ),
i=0 n . Since the sequence where H(C) := 1{N (C)>0} (C ⊆ E). Let Kn ∩ A := {A ∩ Kn,i }ki=0 of partitions {Kn ∩ A}n≥1 of A eventually separates the points of A, and since Hn (A) counts the number of sets of Kn ∩ A that contain at least one point of the point process, we have in view of the assumed simplicity of the point process
lim Hn (A) = N (A) ,
n↑∞
a.s.
Step 3. The probability P (Hn (A) = l) can be expressed in terms of the void probability function v alone since (with An,i = A ∩ Kn,i ) P (H(An,0 ) = i0 , . . . , H(An,kn ) = ikn ) P (Hn (A) = l) = i0 ,...,ik ∈{0,1} n k Σ n ij =l j=1
and for i0 , . . . , ikn ∈ {0, 1} P (H(An,0 ) = i0 , . . . , H(An,kn ) = ikn ) = P (∩l;il =1 {N (An,l ) > 0} ∩ {N (∪m;im =0 An,m ) = 0}) , a quantity which can be expressed in terms of the void function v alone, as we saw in Step 1. More generally, for all l1 , . . . , lk ∈ N, P (Hn (A1 ) = l1 , . . . , Hn (Ak ) = lk ) is expressible in terms of the void probability function, and the same is true of P (Hn (A1 ) ≤ n1 , . . . , Hn (Ak ) ≤ nk )
(n1 , . . . , nk ∈ N) .
Step 4. Finally, observe that {Hn (A1 ) ≤ n1 , . . . , Hn (Ak ) ≤ nk } ↓ {N (A1 ) ≤ n1 , . . . , N (Ak ) ≤ nk } and therefore lim P (Hn (A1 ) ≤ n1 , . . . , Hn (Ak ) ≤ nk ) = P (N (A1 ) ≤ n1 , . . . , N (Ak ) ≤ nk ) .
n↑∞
The proof is now almost done. It just remains to construct the sequence of partitions {Kn }n≥1 . Denote by B(a, r) the closed ball of center a and radius r. Since (E, d) is separable, there exists a countable set {a1 , a2 , . . . } that is dense in E. The ﬁrst partition K1 consists of two sets K11 := B(a1 , 1) ,
K10 := E\K11 .
Suppose that we have constructed Kn−1 . The next partition Kn is constructed as follows.
CHAPTER 8. SPATIAL POISSON PROCESSES
344 Letting
Bn,i := B ai , 2−(n−i)
(i = 1, . . . , n) and Bn,0 := E\
n &
Bn,i ,
i=1
deﬁne a partition Cn = {Cn,i }ni=0 by Cn,0 := Bn,0 ;
Cn,i := Bn,i \
Cn,1 := Bn,1 ;
i−1 &
Bn,j
(i = 2, . . . , n) .
j=1
In order to obtain the partition Kn nested in Kn−1 , we intersect Cn and Kn−1 , that is, Kn := {Cn,i ∩ Kn−1,j
(j = 0, . . . , kn−1, i = 0, . . . , n)} .
Remark 8.1.42 In the case E = Rm , a simple sequence of nested partitions could be the following, say for m = 1 for notational ease: * + Kn = (i2−n , (i + 1)2−n ] ; i ∈ Z . Remark 8.1.43 Note that the assumption of simplicity is necessary in the above result. For instance, doubling the multiplicity of the points of a given point process leaves the avoidance function unchanged.
Example 8.1.44: The Avoidance Function of a Poisson Cluster Process. Consider the spacehomogeneous cluster point process of Example 8.1.25, with the additional speciﬁcation that the germ N0 is a Poisson process. In order to compute its void probability function vN (C) = P (N (C) = 0), we observe that P (N (C) = 0) = lim E e−tN (C) . t↑∞
We have # " $ −tN (C) −t n Zn (C−Xn ) −tZn (C−Xn ) e E e =E e =E n
##
" " =E E " =E
$
e−tZn(C−Xn)  F N0
n
$
" =E
$
# E e−tZn(C−Xn )  F N0
n
# " # −tZ1 (C−Xn ) −tZ1 (C−Xn ) = E exp . E e log E e
n
n
Since N0 is a Poisson process with intensity measure ν0 , the last term of the above sequence of equalities is  () exp E e−tZ1 (C−x) − 1 ν0 (dx) . Rm
But
lim E e−tZ1(C−x) − 1 = vZ (C − x) − 1 .
t↑∞
Therefore taking the limit as t ↑ ∞ in () yields by dominated convergence: vN (C) = exp (vZ (C − x) − 1) ν0 (dx) . Rm
8.2. UNMARKED SPATIAL POISSON PROCESSES
8.2 8.2.1
345
Unmarked Spatial Poisson Processes Construction
Recall the deﬁnition given in Example 8.1.12. Deﬁnition 8.2.1 Let ν be a σﬁnite measure on E. The point process N on E is called a Poisson process on E with intensity measure ν if (i) for all ﬁnite families of mutually disjoint sets C1 , . . . , CK ∈ B(E), the random variables N (C1 ), . . . , N (CK ) are independent, and (ii) for any set C ∈ B(E) such that ν(C) < ∞, P (N (C) = k) = e−ν(C)
ν(C)k k!
(k ≥ 0) .
0 In the case E = Rm , if ν is of the form ν(C) = C λ(x)dx for some nonnegative measurable function λ : Rm → R, the Poisson process N is said to admit the intensity function λ(x). If in addition λ(x) ≡ λ, N is called a homogeneous Poisson process (hpp) on Rm with intensity or rate λ. We now construct the Poisson process. In other terms, we simulate the distribution of a Poisson process on Rm of given intensity measure ν. The basic result is the following: Theorem 8.2.2 Let T be a Poisson random variable of mean θ. Let {Zn }n≥1 be an iid sequence of random elements with values in E and common distribution Q. Assume that T is independent of {Zn }n≥1 . The point process N on E deﬁned by N (C) =
T
1C (Zn )
(C ∈ B(E))
n=1
is a Poisson process with intensity measure ν(·) = θ × Q(·). Proof. It suﬃces to show that for any ﬁnite family C1 , . . . , CK of pairwise disjoint measurable sets of E with ﬁnite νmeasure and all nonnegative reals t1 , . . . , tK , E[e−
K
j=1 tj N (Cj )
* + −tj ] = ΠK − 1) . j=1 exp ν(Cj )(e
We have K
tj N (Cj ) =
j=1
where Yn =
K
tj
j=1
K
j=1 tj 1Cj (Zn ).
T
1Cj (Zn )
n=1
=
T n=1
⎛ ⎞ K T ⎝ tj 1Cj (Zn )⎠ = Yn , j=1
n=1
By Theorem 3.1.55, E[e−
T
n=1
Yn
] = gT (E[e−Y1 ]) ,
where gT is the generating function of T . Here, since T is Poisson mean θ, gT (z) = exp {θ(z − 1)} .
CHAPTER 8. SPATIAL POISSON PROCESSES
346
The random variable Y1 takesthe values t1 , . . . , tK and 0 with the respective probabilities Q(C1 ), . . . , Q(CK ) and 1 − K j=1 Q(Cj ). Therefore E[e−Y1 ] =
K
e−tj Q(Cj ) + 1 −
K
Q(Cj ) = 1 +
j=1
j=1
K
e−tj − 1 Q(Cj ) ,
j=1
from which we get the announced result.
The above is a special case of what is to be done, that is, to construct a Poisson process on E with an intensity measure ν that is σﬁnite (not just ﬁnite). Such a measure can be decomposed as ∞ ν(·) = θj × Qj (·) , j=1
where the θj ’s are positive real numbers and the Qj ’s are probability distributions on E. One can construct independent Poisson processes Nj on E with respective intensity measures θj Qj (·). The announced result then follows from the following theorem: Theorem 8.2.3 Let ν be a σﬁnite measure on E of the form ν = ∞ i=1 νi , where the νi ’s (i ≥ 1) are σﬁnite measures on E. Let Ni (i ≥ 1) be a family of independent Poisson processes on E with respective intensity measures νi (i ≥ 1). Then the point process ∞ Nj N= j=1
is a Poisson process with intensity measure ν. Proof. For mutually disjoint measurable sets C1 , . . . , CK of ﬁnite νmeasures, and nonnegative reals t1 , . . . , tK , K K ∞ E e− =1 t N (C ) = E e− =1 t ( j=1 Nj (C )) K n = E e− limn↑∞ =1 t ( j=1 Nj (C )) K n = lim E e− =1 t ( j=1 Nj (C )) , n↑∞
by dominated convergence. But K n K n E e− =1 t ( j=1 Nj (C )) = E e− j=1 ( =1 t Nj (C )) =
n $
n $ K K $ E e− =1 t Nj (C ) = e−t Nj (C )
j=1
=
=
j=1 =1
n $ K $
* + exp (e−t − 1)νj (C )
j=1 =1
/
n $
exp
j=1
= exp
K
e
=1
⎧ K ⎨ ⎩
=1
e
−t
−t
4
− 1 νj (C )
⎞⎫ ⎛ n ⎬ −1 ⎝ νj (C ))⎠ . ⎭
j=1
8.2. UNMARKED SPATIAL POISSON PROCESSES
347
Letting n ↑ ∞ we obtain, by dominated convergence, /K 4 K − =1 t N (C ) −t E e e − 1 ν(C ) . = exp =1
Therefore N (C1 ), . . . , N (CK ) are independent Poisson random variables with respective means ν(C1 ), . . . , ν(CK ). Theorem 8.2.4 Let N be a Poisson process on Rm with intensity measure ν. (a) If ν is locally ﬁnite, then N is locally ﬁnite. (b) If ν is locally ﬁnite and nonatomic, then N is simple.
Proof. (a) If C is a bounded measurable set, it is of ﬁnite νmeasure, and therefore E[N (C)] = ν(C) < ∞, which implies that N (C) < ∞, P almost surely. (b) It suﬃces to show this for a ﬁnite intensity measure ν(·) = θ(·) Q, where θ is a positive real number and Q is a nonatomic probability measure on Rm , and then use the construction of Theorem 8.2.2. In turn, it suﬃces to show that for each n ≥ 1, P (Zi = Zj for some pair (i, j) (1 ≤ i < j ≤ n)  N (Rm ) = n) = 0. This is the case because for iid vectors Z1 , . . . , Zn with a nonatomic probability distribution, P (Zi = Zj for some pair (i, j) (1 ≤ i < j ≤ n)) = 0.
Doubly Stochastic Poisson Processes Doubly stochastic Poisson processes are also called Cox processes.3 Deﬁnition 8.2.5 Let G be a σﬁeld containing F ν , where ν is a locally ﬁnite random measure on Rm . A point process N on Rm such that given G, N is a Poisson process on Rm with the intensity measure ν, is called a doubly stochastic Poisson process with respect to G with the (conditional) intensity measure ν. If the random measure ν is of the form ν(dx) = Λm (dx) , where Λ is a nonnegative random variable, the corresponding Cox process is also called a mixed Poisson process.
8.2.2
Poisson Process Integrals
The Covariance Formula Let N be a Poisson process on E, with intensity measure ν. Recall Campbell’s theorem ¯ be a νintegrable measurable function. Then N (ϕ) is (Theorem 8.1.20). Let ϕ : E → R a welldeﬁned integrable random variable, and 2. E ϕ(x) N (dx) = ϕ(x) ν(dx) . (8.11) E 3
[Cox, 1955].
E
CHAPTER 8. SPATIAL POISSON PROCESSES
348
Theorem 8.2.6 Let N be as above. Let ϕ, ψ : E → C be two νintegrable measurable functions such that moreover ϕ2 and ψ2 are νintegrable. Then N (ϕ) and N (ψ) are well–deﬁned squareintegrable random variables and ψ(x) N (dx) = (8.12) ϕ(x)ψ(x)∗ ν(dx) . cov ϕ(x) N (dx), E
E
E
Proof. It is enough to consider the case of real functions. First suppose that ϕ and ψ are simple nonnegative Borel functions. We can always assume that ϕ :=
K
ah 1 C h ,
ψ :=
h=1
K
bh 1Ch ,
h=1
where C1 , . . . , CK are disjoint measurable subsets of E. In particular, ϕ(x)ψ(x) = K h=1 ah bh 1Ch (x). Using the facts that if i = j, N (Ci ) and N (Cj ) are independent, and that a Poisson random variable with mean θ has variance θ, E[N (ϕ)N (ψ)] =
K
ah bl E[N (Ch )N (Cl )]
h,l=1
=
K
ah bl E[N (Ch )N (Cl )] +
h,l=1 h=l
=
K
K
al bl E[N (Cl )2 ]
l=1
ah bl E[N (Ch )]E[N (Cl )] +
K
al bl E[N (Cl )2 ] ,
l=1
h,l=1 h=l
and therefore E[N (ϕ)N (ψ)] =
K
ah bl ν(Ch )ν(Cl ) +
k
al bl [ν(Cl ) + ν(Cl )2 ]
l=1
h,l=1 h=l
=
k
ah bl ν(Ch )ν(Cl ) +
h,l=1
k
al bl ν(Cl )
l=1
= ν(ϕ)ν(ψ) + ν(ϕψ) . Let now ϕ, ψ be nonnegative and let {ϕn }n≥1 , {ψn }n≥1 be nondecreasing sequences of simple nonnegative functions, with respective limits ϕ and ψ. Letting n go to ∞ in the equality E[N (ϕn )N (ψn )] = ν(ϕn ψn ) + ν(ϕn )ν(ψn ) yields the announced results, by monotone convergence. We have that for any νintegrable function ϕ : E → C E [N (ϕ)] = E N (ϕ+ ) − E N (ϕ− ) = ν(ϕ+ ) − ν(ϕ− ) = ν(ϕ). Also by the result in the nonnegative case, E N (ϕ)2 = ν(ϕ2 ) + ν(ϕ)2 < ∞. Therefore, since N (ϕ) ≤ N (ϕ), N (ϕ) is a squareintegrable variable, as well as N (ψ) for the same reasons. Therefore, by Schwarz’s inequality, N (ϕ)N (ψ) is integrable. We have
8.2. UNMARKED SPATIAL POISSON PROCESSES
349
E [N (ϕ)N (ψ)] = E N (ϕ+ ) − N (ϕ− ) N (ψ + ) − N (ψ − ) = E N (ϕ+ )N (ψ + ) + E N (ϕ− )N (ψ − ) − E N (ϕ+ )N (ψ − ) − E N (ϕ− )N (ψ + ) = ν(ϕ+ ψ + ) + ν(ϕ+ )ν(ψ + ) + ν(ϕ− ψ − ) + ν(ϕ− )ν(ψ − ) − ν(ϕ+ ψ − ) + ν(ϕ+ )ν(ψ − ) − ν(ϕ− ψ + ) + ν(ϕ− )ν(ψ + ) = ν(ϕψ) + ν(ϕ)ν(ψ) ,
from which (8.12) follows.
The Exponential Formula We now turn to the exponential formula for Poisson processes. (It is sometimes called the second Campbell’s formula. However, in this book, the appellation “Campbell’s formula” will be reserved for the ﬁrst one.) Theorem 8.2.7 Let N be a Poisson process on E with intensity measure ν. Let ϕ : E → R be a nonnegative measurable function. Then, 1 (e−ϕ(x) − 1) ν(dx) E[e− E ϕ(x) N (dx)) ] = exp E
and
E[e
E
ϕ(x) N (dx))
1 (eϕ(x) − 1) ν(dx) .
] = exp E
Proof. We prove the ﬁrst formula, the proof of the second being similar. Suppose that ϕ is simple and nonnegative: ϕ = K h=1 ah 1Ch where C1 , . . . , CK are mutually disjoint measurable subsets of E. Then # "K K $ −N (ϕ) − h=1 ah N (Ch )) −ah N (Ch ) E[e ] = E e e =E h=1
=
K $ h=1
= exp
E e−ah N (Ch ) = /
K $ h=1
K
(e
−ah
* + exp (e−ah − 1)ν(Ch ) 4
− 1)ν(Ch )
* + = exp ν(e−ϕ − 1) .
h=1
The formula is therefore true for nonnegative simple functions. Take now a nondecreasing sequence {ϕn }n≥1 of such functions converging to ϕ. For all n ≥ 1, * + E[e−N (ϕn ) ] = exp ν(e−ϕn − 1) . By monotone convergence, the limit as n tends to ∞ of N (ϕn ) is N (ϕ). Consequently, by dominated convergence, the limit of the lefthand side is E[e−N (ϕ) ]. The function gn = −(e−ϕn − 1) is a nonnegative function increasing to g = −(e−ϕ − 1), and therefore, by monotone convergence, ν(e−ϕn − 1) = −ν(gn ) converges to ν(e−ϕ − 1) = −ν(g), which in turn implies that the righthand side of the last displayed equality tends to exp {ν(e−ϕ − 1)} as n tends to ∞.
CHAPTER 8. SPATIAL POISSON PROCESSES
350
can of course be obtained from the exponential Remark 8.2.8 The covariance formula formula by diﬀerentiation of t → E e−tN (ϕ) . Example 8.2.9: The Maximum Formula. Let N be a simple Poisson process on E with intensity measure ν and let ϕ : E → R. Then 1 P (sup ϕ(Xn ) ≤ a) = exp − 1{ϕ(x)>a} ν(dx) . n∈N
E
A direct proof based on the construction of Poisson processes in Subsection 8.2.1 is possible (Exercise 8.5.21). We take another path and ﬁrst prove that () lim E e−θ n∈N 1{ϕ(Xn )>a} = P (sup ϕ(Xn ) ≤ a) . θ↑∞
n∈N
Indeed, the sum n∈N 1{ϕ(Xn )>a} is strictly positive, except when supn∈N ϕ(Xn ) ≤ a, in which case it is null. Therefore lim e−θ
n∈N
1{ϕ(Xn )>a}
θ↑∞
= 1{supn∈N ϕ(Xn )≤a} .
Taking expectations yields (), by dominated convergence. Now, by Theorem 8.2.7, 1  −θ n∈N 1{ϕ(Xn )>a} −θ1{ϕ(x)>a} E e − 1 ν(dx) = exp e E 1 = exp e−θ − 1 1{ϕ(x)>a} ν(dx) E
+ * 0 and the limit of the latter quantity as θ ↑ ∞ is exp − E 1{ϕ(x)>a} ν(dx) . Example 8.2.10: The Laplace Functional of a Poisson Process. According to Theorem 8.2.7, the Laplace functional of a Poisson process N on E with intensity measure ν is * + LN (ϕ) = exp ν e−ϕ − 1 .
Theorem 8.2.11 Let Ni (i ∈ J) be a ﬁnite collection of simple point processes on E. If for any collection ϕi : E → R+ (i ∈ J) of nonnegative measurable functions,  1 $ exp (8.13) E e− i∈J Ni (ϕi ) = e−ϕi (x) − 1 νi (dx) , E
i∈J
where νi , i ∈ J, is a collection of σﬁnite measures on E, then Ni , i ∈ J, is a family of independent Poisson processes with respective intensity measures νi , i ∈ J. Proof. Taking all the ϕi ’s identically null except the ﬁrst one, we have  1 E e−N1 (ϕ1 ) = exp e−ϕ1 (x) − 1 ν1 (dx) , E
and therefore N1 is a Poisson process with intensity measure ν1 . Similarly, for any i ∈ J, Ni is a Poisson process with intensity measure νi . Independence follows from Theorem 8.1.37.
8.3. MARKED SPATIAL POISSON PROCESSES
8.3
351
Marked Spatial Poisson Processes
8.3.1
As Unmarked Poisson Processes
Let (α) N be a simple and locally ﬁnite process on E, with point sequence {Xn }n∈N, and (β) {Zn }n∈N be a sequence of random elements taking their values in the measurable space (K, K). The sequence {Xn , Zn }n∈N is a marked point process, with the interpretation that Zn is the mark associated with the point Xn . N is the base point process of the marked point process, and {Zn }n∈N is the associated sequence of marks. One also calls N a simple and locally ﬁnite point process on E with marks {Zn }n∈N in K. If moreover (1) N is a Poisson process with intensity measure ν, (2) {Zn }n∈N is an iid sequence, and (3) {Zn }n∈N and N are independent, the corresponding marked point process is called a Poisson process on E with independent iid marks. This model can be slightly generalized by allowing the mark distribution to depend on the location of the marked point. More precisely, we replace (2) and (3) by (2’) {Zn }n∈N is, conditionally on N , an independent sequence, (3’) given Xn , the random vector Zn is independent of Xk (k ∈ N, k = n), and (4’) for all n ∈ N and all L ∈ K, P (Zn ∈ L  Xn = x) = Q(x, L) , where Q(·, ·) is a stochastic kernel from (E, B(E)) to (K, K), that is, Q is a function from E × K to [0, 1] such that for all L ∈ K the map x → Q(x, L) is measurable, and for all x ∈ E, Q(x, ·) is a probability measure on (K, K). Theorem 8.3.1 Let {Xn , Zn }n∈N be as in (α) and (β) above, and deﬁne the point , on E × K by process N , (A) = N
1A (Xn , Zn )
(A ∈ B(E) ⊗ K) .
(8.14)
n∈N
, is a simple Poisson If conditions (1), (2’), (3’), and (4’) above are satisﬁed, then N process with intensity measure ν, given by ν,(C × L) = Q(x, L) ν(dx) (C ∈ B(E) , L ∈ K) . C
, has Proof. In view of Theorem 8.1.34, it suﬃces to show that the Laplace transform of N the appropriate form, that is, for any nonnegative measurable function ϕ , : E × K → R, + *0 0 E e−N(ϕ) − 1 ν,(dt × dz) . = exp E K e−ϕ(t,z)
352
CHAPTER 8. SPATIAL POISSON PROCESSES
By dominated convergence, n ,Zn ) n ,Zn ) E e−N(ϕ) = lim E e− n≤L ϕ(X = E e− n∈N ϕ(X . L↑∞
For the time being, ﬁx a positive integer L. Then, taking into account assumptions (2’) and (3’), ⎤ ⎡ $ n ,Zn ) n ,Zn ) ⎦ E e− n≤L ϕ(X e−ϕ(X = E⎣ n≤L
⎡ ⎡ = E ⎣E ⎣
$
⎤⎤ n ,Zn )  Xj , j ≤ L⎦⎦ e−ϕ(X
n≤L
= E e− n≤L ψ(Xn ) , 0 where ψ(x) := − log K e−ϕ(x,z) Q(x, dz), a nonnegative function. Letting L ↑ ∞, we have, by dominated convergence, E e−N(ϕ) = E e− n∈N ψ(Xn ) = E e−N (ψ)  1 = exp e−ψ(x) − 1 ν(dx) E 2. 1 e−ϕ(x,z) = exp Q(x, dz) − 1 ν(dx) E 2K . 1 = exp − 1 Q(x, dz) ν(dx) e−ϕ(x,z) 1 E  K − 1 ν,(dx × dz) . = exp e−ϕ(x,z) E
K
Example 8.3.2: The M/GI/∞ Model, take 1. The model of this example is of interest in queueing theory and in the traﬃc analysis of communications networks. We adopt the queueing interpretation. Let N be an hpp on R with intensity λ, and {σn }n∈Z be a sequence of random vectors taking their values in R+ with probability distribution Q. Assume moreover that {σn }n∈Z and N are independent. The nth event time of N , Tn , is the arrival time of the nth customer, and σn is her service time request. Deﬁne , on R × R+ by the point process N , (C) = N 1C (Tn , σn ) n∈Z
, is a simple Poisson process for all C ∈ B(R) ⊗ B(R+ ). According to Theorem 8.3.1, N with intensity measure ν,(dt × dz) = λdt × Q(dz) . In the M/GI/∞ model,4 a customer arriving at time Tn is immediately served, and therefore departs from the “system” at time Tn + σn . The number X(t) of customers present in the system at time t is therefore given by the formula 4 “∞” represents the number of servers. This model is sometimes called a “queueing” system, although in reality there is no queueing, since customers are served immediately upon arrival and without interruption. It is in fact a “pure delay” system.
8.3. MARKED SPATIAL POISSON PROCESSES X(t) =
353
1(−∞,t](Tn )1(t,∞) (Tn + σn ) .
n∈Z
(The nth customer is in the system at time t if and only if she arrived at time Tn ≤ t and departed at time Tn + σn > t.) Assume that the service times have ﬁnite expectation: E [σ1 ] < ∞. Then, for all t ∈ R, X(t) is a Poisson random variable with mean λE [σ1 ]. Proof. Observe that
, (C(t)) , X(t) = N
where C(t) := {(s, σ); s ≤ t, s + σ > t} ⊂ R × R+ . In particular, X(t) is a Poisson random variable with mean  1{s+σ>t} 1{s≤t} ν,(ds × dσ) ν,(C(t)) = R R   + = 1{s+σ>t} 1{s≤t} λ ds × Q(dσ) R R+  1{s+σ>t} Q(dσ) 1{s≤t} λ ds = R+
R

t
=λ −∞ ∞
Q((t − s, +∞)) ds Q((s, +∞)) ds = λ
=λ 0
∞
P (σ1 > s)ds = λE[σ1 ] . 0
It can be shown that the departure process D of departure times, deﬁned by D(C) := 1C (Tn + σn ) , n∈Z
is an hpp of intensity λ (Exercise 8.5.13). Formulas such as Campbell’s ﬁrst formula and the Poisson exponential formula are straightforwardly extended to marked point processes. In the situation prevailing in Theorem 8.3.1, consider sums of the type , (ϕ) N , := ϕ(X , n , Zn ) , (8.15) n∈N
for functions ϕ , : E × K → R. Note that, denoting by Z1 (x) any random element of K with the distribution Q(x, dz),  ϕ(x, , z)Q(x, dz) ν(dx) = E [ϕ(x, , Z1 (x))] ν(dx) , ν,(ϕ) , = E
K
E
whenever the quantities involved have a meaning. Using this observation, the formulas obtained in the previous subsection can be applied in terms marked point processes. The corollaries below do not require proofs, since they are reformulations of previous results, namely Theorem 8.2.6, Theorem 8.2.7 and Exercise 7.5.1. Let 0 < p < ∞. Recall that a measurable function ϕ , : E × K → R (resp. → C) is said to be in LpR (, ν ) (resp. LpC (, ν )) if
CHAPTER 8. SPATIAL POISSON PROCESSES
354  
ϕ(x, , z)p ν(dx) Q(x, dz) < ∞ . E
K
, ∈ L1C (, Corollary 8.3.3 Suppose that ϕ ν ). Then the sum (8.15) is well deﬁned, and moreover # " ϕ(X , n , Zn ) = E [ϕ(x, , Z1 (x))] ν(dx) . E E
n∈N
Let ϕ, , ψ, : R × E → C be two measurable functions in L1C (, ν ) ∩ L2C (, ν ). Then cov
ϕ(X , n , Zn ),
, ψ(Xn , Zn )
n∈N
n∈N

, Z1 (x))∗ ν(dx) . , Z1 (x))ψ(x, E ϕ(x,
= E
Corollary 8.3.4 Let ϕ , be a nonnegative function from E × K to R. Then, 1 n ,Zn ) −ϕ(x,Z − n∈N ϕ(X 1 (x)) − 1 ν(dx) E e = exp E e E
8.3.2
Operations on Poisson Processes
Thinning and Coloring Thinning is the operation of randomly erasing points of a Poisson process. It is a particular case of the independent coloring operation whereby the points of a Poisson process are independently colored with the result of obtaining independent Poisson processes, each one corresponding to a diﬀerent color.
Theorem 8.3.5 Consider the situation depicted in Theorem 8.3.1. Let I be an arbitrary index set and let {Li }i∈I be a family of disjoint measurable sets of K. Deﬁne for each i ∈ I the simple point process Ni on Rm by Ni (C) =
1C (Xn )1Li (Zn ) .
n∈N
Then the family Ni (i ∈ I) is an independent family of Poisson processes with respective intensity measures νi , i ∈ I, where νi (dx) = Q(x, Li ) ν(dx) .
Proof. According to the deﬁnition of independence, it suﬃces to consider a ﬁnite , on Rm × K as in (8.14). Then N , index set I. Deﬁne the simple point process N 0 is a Poisson process with intensity measure ν,(C × L) = C Q(x, L)ν(dx). Deﬁning , (ϕ). , Therefore ϕ(x, , z) = i∈I ϕi (x)1Li (z), we have i∈I Ni (ϕi ) = N
8.3. MARKED SPATIAL POISSON PROCESSES
355
E e− i∈I Ni (ϕi ) = E e−N(ϕ)   1 = exp − 1 ν,(dx × dz) e−ϕ(x,z) m 1 R K − 1 Q(x, dz)ν(dx) = exp e−ϕ(x,z) m 1 R K = exp e− i∈I ϕi (x)1Li (z) − 1 Q(x, dz)ν(dx) m /R K 4 −ϕi (x) = exp − 1 1Li (z)Q(x, dz)ν(dx) e /
Rm
= exp =
$ i∈I
Therefore,
K i∈I
4 −ϕi (x) − 1 Q(x, Li )ν(dx) e
Rm i∈I

exp Rm
1 e−ϕi (x) − 1 Q(x, Li )ν(dx) .
$ E e− i∈I Ni (ϕi ) = exp Rm
i∈I
1 e−(ϕi ) − 1 νi (dx)
and the result follows from Theorem 8.2.11.
Transportation This is the operation of moving the points of a Poisson process. More precisely, consider the situation depicted in Theorem 8.3.1. Form a point process N ∗ on K by associating to a point Xn ∈ Rm a point Zn ∈ K: N ∗ (L) := 1L (Zn ) , n∈N
where L ∈ B(Rm ). We then say that N ∗ is obtained by transporting N via the stochastic kernel Q(x, ·). Theorem 8.3.6 N ∗ is a Poisson process on K with intensity measure ν ∗ given by ν(dx)Q(x, L) . ν ∗ (L) = Rm
Proof. Let ϕ∗ : K → R be a nonnegative measurable function. We have ∗ ∗ ∗ E e−N (ϕ ) = E e− n∈N ϕ (Zn )   1 ∗ = exp e−ϕ (z) − 1 ν(dx)Q(x, dz) m R K 1 ∗ = exp e−ϕ (z) − 1 ν(dx)Q(x, dz) . K
Rm
CHAPTER 8. SPATIAL POISSON PROCESSES
356
Example 8.3.7: Translation. Let N be a Poisson process on Rm with intensity measure ν and let {Vn }n∈N be an iid sequence random vectors of Rm with common distribution Q. Form the point process N ∗ on Rm by translating each point Xn of N by Vn . Formally, 1C (Xn + Vn ). N ∗ (C) = n∈N
We are in the situation of Theorem 8.3.6 with Zn = Xn + Vn . In particular, Q(x, A) = Q(A − x). It follows that N ∗ is a Poisson process on Rm with intensity measure ν ∗ (L) =
Rm
Q(L − x) ν(dx) ,
the convolution of ν and Q.
Poisson Shot Noise Let N be a simple and locally ﬁnite point process on Rm with point sequence {Xn }n∈N and with marks {Zn }n∈N in the measurable space (K, K). Let h : Rm × K → C be a measurable function. The complexvalued spatial stochastic process {X(y)}y∈Rm given by h(y − Xn , Zn ) , (8.16) X(y) := n∈N
where the righthand side is assumed well deﬁned (for instance, when h takes real nonnegative values), is called a spatial shot noise with random impulse response. If N is a simple and locally ﬁnite Poisson process on Rm with independent iid marks {Zn }n∈N, {X(y)}y∈Rm is called a Poisson spatial shot noise with random impulse response and independent iid marks. The following result is a direct application of Theorems 8.2.6 and 8.3.1. Theorem 8.3.8 Consider the above Poisson spatial shot noise with random impulse response and independent iid marks. Suppose that for all y ∈ Rm , E [h(y − x, Z1 )] ν(dx) < ∞ Rm
and
Rm
E h(y − x, Z1 )2 ν(dx) < ∞ .
Then the complexvalued spatial stochastic process {X(y)}y∈Rm given by (8.16) is well deﬁned, and for any y, ξ ∈ Rm , we have E [X(y)] = E [h(y − x, Z1 )] ν(dx) Rm
and
cov(X(y + ξ), X(y)) = Rm
E [h(y − x, Z1 )h∗ (y + ξ − x, Z1 )] ν(dt) .
8.3. MARKED SPATIAL POISSON PROCESSES
357
In the case where the base point process N is an hpp with intensity λ, we ﬁnd that E [X(y)] = λ E [h(x, Z1 )] dx Rm

and cov (X(y + ξ), X(y)) = λ
Rm
E [h(x, Z1 )h∗ (ξ + x, Z1 )] dx .
Observe that these quantities do not depend on y ∈ Rm . The process {X(y)}y∈Rm is for that reason called a widesense stationary process (see Chapter 9).
8.3.3
Change of Probability Measure
Let (Ω, F, P ) be a probability space on which is given a Poisson process N on Rm with nonatomic and locally ﬁnite intensity measure ν. We shall replace the probability P by another probability Pˆ in such a way that with respect to this new probability, the same point process N is a Poisson process, but with the intensity measure νˆ given by νˆ(C) = μ(x) ν(dx), (8.17) Rm
for some nonnegative measurable function μ : Rm → R.
The Case of Finite Intensity Measures The above program is ﬁrst carried out under the following hypotheses: H1 : ν is a ﬁnite measure, and H2 : μ is νintegrable (or equivalently νˆ is ﬁnite). The change of probability P → Pˆ will be an absolutely continuous one, that is, for all A ∈ F, Pˆ (A) = E[L 1A ], (8.18) where L is a nonnegative random variable such that E[L] = 1 , ym derivative of Pˆ with respect to P , and also denoted called the Radon–Nikod´ Lemma 8.3.9 Under hypotheses H1 and H2 , the random variable 1 $ μ(Xn ) exp − (μ(x) − 1)ν(dx) L :=
(8.19) dPˆ dP .
(8.20)
Rm
n∈N
satisﬁes (8.19). Proof. Let g(x) = log(μ(x)) and decompose this function into its positive and negative part, g = g+ − g− . By Theorem 8.2.7 we have that 1  E e−N (g−) = exp e−g−(x) − 1 ν(dx) Rm
and
CHAPTER 8. SPATIAL POISSON PROCESSES
358
E eN (g+ ) = exp Rm
1 eg+ (x) − 1 ν(dx) .
Let B1 = {x ∈ Rm ; g(x) > 0}. By Theorem 8.3.5, the restrictions of N to B1 and B2 = B¯1 are independent, and therefore the variables e−N (g−) and eN (g+ ) are independent. In particular, from the two last displays, " # $ E μ(Xn ) n∈N
= E eN (log(μ)) = E eN (g) = E eN (g+ )−N (g−) = E e−N (g− ) eN (g+ ) = E e−N (g−) E eN (g+ )  1  1 = exp e−g−(x) − 1 ν(dx) eg+ (x) − 1 ν(dx) exp m m 1R  R 1 g(x) − 1 1{g(x)>0} ν(dx) exp = exp e eg(x) − 1 1{g(x)≤0} ν(dx) m Rm R 1  g(x) g(x) = exp e e − 1 1{g(x)>0} ν(dx) + − 1 1{g(x)≤0} ν(dx) m m 1  R 1 R (μ(x) − 1) ν(dx) . = exp eg(x) − 1 ν(dx) = exp Rm
Rm
By assumptions H1 and H2 the last quantity is ﬁnite and therefore one can divide the ﬁrst and last terms of the above chain of equalities by it, to obtain (8.19). Theorem 8.3.10 Under the assumptions H1 and H2 , if we deﬁne probability Pˆ by (8.18) and (8.20), N is under probability Pˆ a Poisson process with intensity measure νˆ given by (8.17). ˆ It suﬃces to show that the Laplace Proof. Denote expectation with respect to Pˆ by E. transform of N under probability Pˆ is that of a Poisson process with intensity measure νˆ, that is, for any bounded nonnegative measurable function ϕ : Rm → R, . 2 ˆ e−N (ϕ) = exp E e−ϕ(x) − 1 νˆ(dx) . Rm
But
ˆ e−N (ϕ) = E L e−N (ϕ) E = E eN (log(μ))− Rm (μ(x)−1)ν(dx) e−N (ϕ) = E eN (−ϕ+log(μ)) e− Rm (μ(x)−1)ν(dx)  = exp (μ(x) − 1)ν(dx) e−ϕ(x)+log(μ(x)) − 1 ν(dx) exp − m Rm R (μ(x) − 1)ν(dx) = exp e−ϕ(x)μ(x) − 1 ν(dx) exp − m m R R = exp e−ϕ(x) − 1 νˆ(dx) . e−ϕ(x) − 1 μ(x)ν(dx) = exp Rm
Rm
8.3. MARKED SPATIAL POISSON PROCESSES
359
The Mixed Poisson Case Let N be a Poisson process on Rm of ﬁnite intensity measure ν and let Λ be a nonnegative random variable independent of N . Let L := ΛN (R
m)
exp{−(Λ − 1)N (Rm )} .
The arguments of the proof of Lemma 8.3.9 and Theorem 8.3.10 are immediately adaptable to show that EP [L] = 1 and that under the probability measure Pˆ deﬁned by dPˆ dP = L, N is a Cox process (here a mixed Poisson process) with σ(Λ)conditional intensity measure Λν(dx). Theorem 8.3.11 Under the above conditions, for any nonnegative function gR+ → R,
EPˆ g(Λ)  F
N
0 =
g(λ)λN (R ) e−λN (R ) F (dλ) 0 . m m λN (R ) e−λN (R ) F (dλ) m
m
(8.21)
Proof. The proof is based on the following fundamental lemma: Lemma 8.3.12 Let P and Q be two probability measures on the measurable space (Ω, F) such that P Q and let L := dP dQ . Let Z be a nonnegative random variable. For any subσﬁeld G of F, EQ [L  G] EP [Z  G] = EQ [ZL  G]
Qa.s.
(8.22)
or, equivalently, EQ [ZL  G] EQ [L  G]
EP [Z  G] =
P a.s.
(8.23)
Proof. By deﬁnition of conditional expectation, for all A ∈ G Z dP = EP [Z  G] dP . A
A
By deﬁnition of L and of conditional probability again, Z dP = ZL dQ = EQ [ZL  G] dQ . A
A
A
Also 
EP [Z  G] dP = A
A
EP [Z  G] L dQ EP [Z  G] EQ [L  G] dQ .
= A
Therefore

EQ [ZL  G] dQ = A
EP [Z  G] EQ [L  G] dQ , A
which is, since A is arbitrary in G, equivalent to (8.22). Since P Q this equality also holds P a.s. To obtain (8.23), it remains to show that P (EQ [L  G] = 0) = 0. But
CHAPTER 8. SPATIAL POISSON PROCESSES
360
P (EQ [L  G] = 0) =

= 
1{EQ [LG]=0} dP 1{EQ [LG]=0} LdQ 1{EQ [LG]=0} EQ [L  G] dQ = 0 .
=
We may now proceed to the proof of Theorem 8.3.11. By Lemma 8.3.12, EP g(Λ)L  F N EPˆ g(Λ)  F N = EP [L  F N ] and therefore, since under P , N and Λ are independent,
EPˆ g(Λ)  F
N
0 =
R+
0
g(λ)λN (R
R+
λN (R
m)
m)
exp{−(λ − 1)N (Rm )} F (dλ)
exp{−(λ − 1)N (Rm )} F (dλ)
from which the result follows.
8.4
,
The Boolean Model
Stochastic geometry concerns the study of random shapes. The model considered below pertains to a particular sort of stochastic geometry, where the randomness of the shapes is dependent on the positions of the points of an underlying point process. Although there exists a sound mathematical theory of random sets, this theory will not be necessary as long as the random sets considered in the applications are “good” sets fully described by a random vector of ﬁnite dimension (circle, disk, polygon, line, segment, etc.). We then resort to what can be called the “poor man’s random set theory”, in which a random set is set of the form S(Z) ⊆ Rm where Z ∈ Rd is a random vector and for each z, S(z) is a measurable set and S(Z) is also a measurable set. These are minimal requirements. We shall consider realvalued functions of S, for instance g(S) = d (S) , g(S) = 1a∈S ,
()
for which the expectation is well deﬁned, as E [g(S)] := E [g(S(Z))] = Rd
g(S(z))P (Z ∈ dz) .
(8.24)
We only need to ensure that the function z ∈ Rd → g(S(z)) ∈ R is measurable and its integral with respect to the distribution of Z well deﬁned (in the examples, this will generally the case, because g will be a nonnegative function, as in ()). We shall use for (8.24) the abbreviated notation g(s) Q(ds) , S
thereby pretending that there exists a set S of shapes with a suitable σﬁeld G on it, and an adequate probability distribution Q on (S, G).
8.4. THE BOOLEAN MODEL
361
Example 8.4.1: Random Disk. In this example, S is the closed disk in R2 centered on the origin and with radius Z, a nonnegative random variable. We have, with g(S) = 2 (S) E [g(S)] = E 2 (S(Z)) = E πZ 2 , and with g(S) = 1a∈S , E [g(S)] = E 1a∈S(Z) = P (a ∈ S(Z)) = P (Z ≥ a) .
Deﬁnition 8.4.2 The capacity functional of the random set S is the function K → TS (K) (K compact) deﬁned by TS (K) := P (S ∩ K = ∅) . Example 8.4.3: The Capacity Functional of a Point Process. A simple point process N can also be viewed as a random set S ≡ N . In this case the capacity functional is the void probability function: TN (K) := P (N ∩ K = ∅) = P (N (K) = 0) = vN (K) .
We now introduce the Boolean model.5 Let N be a Poisson process on Rm with a nonatomic σﬁnite intensity measure ν. Denote by {Xn }n∈N its sequence of points. Let now {Sn }n∈N be a sequence of random marks, iid and independent of N . Each Sn is a compact random set. The Xn ’s are called the germs whereas the Sn ’s are called the grains. Recall the following notations: if A and B are subsets of Rm and x ∈ Rm , x + A := {x + y ; y ∈ A}, and A ⊕ B := {x + y ; x ∈ A, x ∈ B}.
One of the quantities of interest in applications is the probability of intersection of the random set Σ = ∪n∈N (Xn + Sn ) with a given compact set K ⊂ Rm , that is, TΣ (K) := P (Σ ∩ K = ∅) . In order to compute this quantity, we ﬁrst observe that 5
[Matheron, 1967, 1975], [Serra, 1982].
CHAPTER 8. SPATIAL POISSON PROCESSES
362
TΣ (K) = P (N (K) = 0) , where N (K) =
1(Xn +Sn )∩K =∅ .
n∈N
We show that N (K) is a Poisson variable with mean θ(K) := P ((x + S1 ) ∩ K = ∅) ν(dx)
()
Rm
and therefore TΣ (K) = 1 − exp −
Rm
1 P ((x + S1 ) ∩ K = ∅) ν(dx) .
(8.25)
Proof. The Laplace transform of the distribution of N (K) is given by E e−tN (K) = E e−t n∈N 1(Xn +Sn )∩K=∅ = E e−t n∈N f (Xn ,Sn ) , where t ≥ 0 and f (x, s) := 1(x+s)∩K =∅. To see this, use the formula in Corollary 8.3.4, which gives  1 E e−t n∈N f (Xn ,Sn ) = exp E e−tf (x,S1 ) − 1 ν(dx) m R 1 −t1 E e (x+S1 )∩K=∅ − 1 ν(dx) = exp m 1 R −t P ((x + S1 ) ∩ K = ∅) ν(dx) . = exp (e − 1) Rm
If the germ process is a homogeneous Poisson process of intensity λ, the mean value θ(K) of N (K) takes the form θ(K) = λE [m ((−S1 ) ⊕ K)] . Indeed, from (), θ(K) = λ Rm
2
= λE
P ((x + S1 ) ∩ K = ∅) dx . 21(x+S1)∩K =∅ dx = λE
Rm
Rm
. 1(−S1 )⊕K (x) dx .
Therefore, in the homogeneous case, TΣ (K) = 1 − exp {λE [m ((−S1 ) ⊕ K)]} .
(8.26)
Another quantity of interest when the germ point process is an hpp with intensity λ is the volume fraction p of the random set Σ, deﬁned by p :=
E [m (Σ ∩ B)] . m (B)
8.4. THE BOOLEAN MODEL
363
It is independent of B, and by translation invariance of the model, it is equal to 0 E B 1x∈Σ dx p= = P (0 ∈ Σ) = P (Σ ∩ {0} = ∅) . m (B) Therefore, p = p({0}) = 1 − e− = 1 − e−
Rm
Rm
P ((x+S1 )∩{0} =∅) λ dx
P ((x+S1 )∩{0} =∅) λ dx
= 1 − e−
Rm
P (−x∈S1 ) λ dx
= 1 − e−λE[
m (S
1 )]
.
The covariance function C : Rm → R of the random set Σ is deﬁned by C(x) := P (0 ∈ Σ , x ∈ Σ) . (This is the covariance function, in the usual sense, of the widesense stationary stochastic process {1Σ (t)}t∈Rm .) In the homogeneous case, C(x) = 2p − 1 − (1 − p)2 exp {λE [m (S1 ∪ (S1 − x))]} . Proof. C(x) = P (0 ∈ Σ ∩ (Σ − x)) = P (0 ∈ Σ) + P (x ∈ Σ) − P (0 ∈ Σ ∪ (Σ − x)) = 2p − 1 + P (0 ∈ / Σ ∪ (Σ − x)) = 2p − 1 + P (Σ ∩ {o, x} = ∅) = 2p − 1 + TΣ ({0, x}) . From (8.26), TΣ ({0, x}) = 1 − exp {λE [m ((−S1 ) ⊕ {0, x})]} . Now E [m ((−S1 ) ⊕ {0, x})] = E [m ((−S1 ) ∪ (−S1 + x))] = E [m (−S1 )] + E [m (−S1 + x)] − E [m ((−S1 ) ∩ (−S1 + x))] = 2E [m (−S1 )] − E [m ((−S1 ) ∩ (−S1 + x))] . Combining the above equalities with the observation that 1 − p = exp {−λm (S1 )} gives the announced result. Example 8.4.4: The Boundary of a Poisson Cluster of Disks. Let N be a homogeneous Poisson process on R2 , of intensity λ. Let {Xn }n∈N be its sequence of points. Draw around each point Xn a closed disk of radius a. The area inside the square [0, T ] × [0, T ] that is not covered by a disk is delimited by a curve. We seek to compute its average length, excluding the parts on the boundaries of [0, T ] × [0, T ].
CHAPTER 8. SPATIAL POISSON PROCESSES
364
The number of disks covering a given point y ∈ R2 is Z(y) =
1{Xn −y0} dD(s) .
Show that {X(t)}t≥0 is the congestion process of an M/M/1/∞ queue with arrival rate λ and service times with exponential distribution of mean μ. Exercise 9.4.4. M/M/1∞ in heavy traffic Consider the M/M/1∞ with traﬃc intensity ρ in equilibrium, and let Xρ = Xρ (0) be the congestion process at time 0 (recall that the congestion process at any time t ≥ 0 has a distribution independent of t since the queue is assumed to be in equilibrium). Show that the random variable (1 − ρ)Xρ converges in distribution to an exponential random variable of mean 1. Exercise 9.4.5. A generalization of Jackson’s network The following modiﬁcation of the basic Jackson network is considered. For all i, 1 ≤ i ≤ K, the server at station i has a speed of service ϕi (ni ) when there are ni customers present in station i, where ϕi (k) > 0 for all k ≥ 1 and ϕi (0) = 0. The new inﬁnitesimal generator is obtained from the standard one by replacing μi by μi ϕ(ni ). Check this and show that if for all i, 1 ≤ i ≤ K, Ai := 1 +
∞ Ai =1
ρni ni i k=1 ϕ(k)
< ∞,
where ρi = λi /μi and λi is the solution of the traﬃc equation, then the network is ergodic with stationary distribution K $ πi (ni ), π(n) = i=1
where
CHAPTER 9. QUEUEING PROCESSES
400 πi (ni ) =
1 ρni ni i . Ai k=1 ϕi (k)
Exercise 9.4.6. Closed Jackson with a single customer Consider the closed Jackson network of the theory, with N = 1 customer. Let {Y (t)}t≥0 be the process giving the position of this customer, that is, Y (t) = i if she is in station i at time t. Show that {Y (t)}t≥0 is a regular jump hmc, irreducible, and give its stationary distribution. Exercise 9.4.7. ACK Consider the closed Jackson network below where all service times at diﬀerent queues have the same (exponential) distribution of mean 15 . Compute for N = 5 the average time spent by a customer to go from the leftmost point A to the rightmost point B, and the average number of customers passing by A per unit time.
Exercise 9.4.8. Suppressed transitions Let {X(t)}t≥0 be an irreducible positive recurrent hmc with inﬁnitesimal generator A and stationary distribution π. Suppose that (A, π) is reversible. Let S be a subset of the ˜ by state space E. Deﬁne the inﬁnitesimal generator A / q˜ij =
αqij if i ∈ S, j ∈ E − S qij otherwise
when i = j. The corresponding hmc is irreducible if α > 0. If α = 0, the state space will be reduced to S, to maintain irreducibility. Show that the continuous time hmc ˜ admits the stationary distribution π associated to A ˜ given by / π ˜i =
αC × π(i) if i ∈ S, C × απ(i) if i ∈ E − S,
with the obvious modiﬁcation when α = 0. Exercise 9.4.9. Loss networks (Kelly’s networks) (2 ) Consider a telecommunications network with K relays. An incoming call chooses a “route” r among a set R. The network then reserves the set of relays r1 , . . . , rk(r) corresponding to this route, and processes the call to destination in an exponential time of mean μ−1 r . The incoming calls with route r form a homogeneous Poisson process of intensity λr . It is assumed that all the relays are useful, in that they are part of at least one route in R. 2
[Kelly, 1979].
9.4. EXERCISES
401
(1) The capacity of the system is for the time being assumed to be inﬁnite, that is, the number Xr (t) of calls on route r at time t can take any integer value. All the usual independence hypotheses are made: the processing times and the Poisson processes are independent. Give the stationary distribution of the continuous time hmc {X(t)}t≥0 , where X(t) = (Xr (t), r ∈ R). (2) The capacity of the system is now restricted as follows. Consider a given pair (a, b) of relays. It represents a “link” in the network. This link has ﬁnite capacity Cab . This means that the total number of calls using this link cannot exceed this capacity, with the consequence that an incoming call requiring a route passing through this link will be lost if the link is saturated when it arrives. The process {X(t)}t≥0 therefore has for state space E˜ = {n = (nr ; r ∈ R); nr ≤ Cab for all links (a, b)}. r∈R,(a,b)∈r
What is the stationary distribution of the chain {X(t)}t≥0 ? Exercise 9.4.10. M/GI/1/∞/fifo Show that the imbedded hmc of an M/GI/1/∞/fifo is irreducible (as long as the service times are not identically null). Exercise 9.4.11. N (0, τ ] Prove the statement in Subsection 9.3.3 concerning the sequence {Zn }n≥1, namely, that it is iid, independent of X0 and distributed as Z = N (0, τ ], where N is an hpp with intensity μ, τ and N are independent and the cdf of τ is F . Exercise 9.4.12. GI/M/1/∞ Show that the GI/M/1/∞ queue of Subsection 9.3.3 is transient if ρ > 1. Exercise 9.4.13. The imbedded hmc of an M/GI/1/∞/fifo queue Show that the imbedded hmc of an M/GI/1/∞/fifo is irreducible (as long as the service times are not identically null). Exercise 9.4.14. Constant service times minimize congestion Show that for a ﬁxed traﬃc intensity ρ, constant service times minimize average congestion in the M/GI/1/∞ fifo queue. Exercise 9.4.15. Workload of M/M/1/∞ Show that the stationary distribution of the workload process of an M/M/1/∞ queue with arrival rate λ and mean service time μ−1 such that ρ = μλ < 1 is FW (x) = 1 − ρ (1 − exp{−(μ − λ)x}) .
Chapter 10 Renewal and Regenerative Processes From the purely analytical point of view, renewal theory is concerned with the renewal equation f (t) = g(t) + f (t − s) dF (s) , [0,t]
where F is the cumulative distribution function of a ﬁnite measure on the positive real line. Its main concern is the asymptotic behavior of the solution f (the existence and uniqueness of which is not a real issue under mild conditions, as we shall see). Once embedded in the framework of point processes, renewal theory becomes a fundamental tool of probability theory that is useful in particular in the study of convergence of regenerative processes, a large and important class of stochastic processes that includes the recurrent continuoustime hmcs and the semiMarkov processes.
10.1
Renewal Point processes
10.1.1
The Renewal Measure
Consider an iid sequence {Sn }n≥1 of nonnegative random variables with common cumulative distribution function F (x) := P (Sn ≤ x) . This cdf is called defective when F (∞) := P (S1 < ∞) < 1, and proper when F (∞) = 1. The uninteresting case where P (S1 = 0) = 1 is henceforth eliminated. The above sequence is called the interrenewal sequence. The associated renewal sequence {Tn }n≥0 is deﬁned by Tn := Tn−1 + Sn
(n ≥ 1) ,
where the initial delay T0 is a finite nonnegative random variable independent of the interrenewal sequence. When T0 = 0, the renewal sequence is called undelayed. Time Tn is called a renewal time, or an event. The stochastic process 1{Tn ≤t} (t ≥ 0) N ([0, t]) := n≥0
© Springer Nature Switzerland AG 2020 P. Brémaud, Probability Theory and Stochastic Processes, Universitext, https://doi.org/10.1007/9783030401832_10
403
CHAPTER 10. RENEWAL AND REGENERATIVE PROCESSES
404
is the counting process of the renewal sequence; N ([0, t]) counts the number of events in the closed interval [0, t]. Note that T0 ≥ 0 (this convention diﬀers from the usual one, only for this chapter) and that the point at 0 is counted when there is one. Clearly, the random function t → N ([0, t]) is almost surely rightcontinuous and has a limit on the left for each t > 0, namely N [0, t).
0
T1
T0
T2
t
T3
Theorem 10.1.1 For all t ≥ 0, E[N ([0, t])] < ∞. In particular, almost surely, N ([0, t]) < ∞ for all t ≥ 0. Proof. It suﬃces to consider the undelayed case (Exercise 10.5.1). By Markov’s inequality, P (Tn ≤ t) = P (e−Tn ≥ e−t ) ≤ et E e−Tn . But since the renewal sequence is iid, n n $ E e−Tn = E e− k=1 Sk = E e−Sk = αn , k=1
with α = E e−S1 < 1 since P (S1 = 0) < 1. Therefore ⎡ E [N ([0, t])] − 1 = E ⎣
⎤ 1{Tn ≤t} ⎦ =
n≥1
E 1{Tn ≤t} = P (Tn ≤ t) ≤ et αn < ∞ .
n≥1
n≥1
n≥1
The forward recurrence {A(t)}t≥0 and the backward recurrence {B(t)}t≥0 are deﬁned as follows. Both processes are rightcontinuous with lefthand limits. For n ≥ 0, they have linear trajectories in (Tn , Tn+1) with respective slopes −1 and +1, and at a renewal point Tn A(Tn ) = Tn+1 − Tn , A(Tn+1 −) = 0 , B(Tn ) = 0 , B(Tn+1 −) = Tn+1 − Tn . For 0 ≤ t < T0 , A(t) = T0 − t and B(t) = t.
0
T0
T1
T2
T3
t
The forward recurrence time process
10.1. RENEWAL POINT PROCESSES
0
T0
405
T1
T2
T3
t
The backward recurrence time process
Deﬁnition 10.1.2 The function R : R+ → R+ deﬁned by R(t) := E[N ([0, t])], where N is the counting process of the undelayed renewal sequence, is called the renewal function.
The renewal function is rightcontinuous (Exercise 10.5.3) and nondecreasing. Therefore, one can associate with it a unique measure μR on R+ such that μR ([0, a]) = R(a). This measure, called the renewal measure, will sometimes be denoted by R, the context avoiding confusion between the measure and its cumulative distribution function. Note that μR ({0}) = R(0) = 1. Example 10.1.3: The Poisson Process. Consider the case of exponential interevent times (that is, F (t) = 1 − e−λt ). The undelayed renewal process is then a homogeneous Poisson process of intensity λ to which a point at time 0 is added, R(t) = 1 + λt.
It will be convenient to express the renewal function in terms of the common cumulative distribution function of the random variables Sn . For this, observe that in the undelayed case Tn := S1 + · · · + Sn is the sum of n independent random variables with common cumulative distribution function F and therefore P (Tn ≤ t) = F ∗n (t) ,
(10.1)
where F ∗n is the nfold convolution of F , deﬁned recursively by F ∗0 (t) = 1[0,∞) (t),
F ∗n (t) =

F ∗(n−1) (t − s) dF (s)
(n ≥ 1) .
(10.2)
[0,t]
(The role of 0 in the integration over [0, t] is made precise by the following equality: 
ϕ(s) dF (s) = ϕ(0)F (0) + [0,t]
ϕ(s) dF (s).) (0,t]
Writing the renewal function as E[N ([0, t])] = E[1+ n≥1 1{Tn ≤t} ] = 1+ n≥1 P (Tn ≤ t), we obtain the expression: ∞ R(t) = F ∗n (t) . (10.3) n=0
406
CHAPTER 10. RENEWAL AND REGENERATIVE PROCESSES
Theorem 10.1.4 P (S1 < ∞) < 1
⇐⇒
P (N ([0, ∞)) < ∞) = 1
⇐⇒
E[N ([0, ∞))] < ∞ .
Proof. It suﬃces to prove the theorem in the undelayed case. For all k ≥ 1, P (N ([0, ∞)) = k) = P (S1 < ∞, . . . , Sk−1 < ∞, Sk = ∞) = P (S1 < ∞) . . . P (Sk−1 < ∞)P (Sk = ∞) = F (∞)k−1 (1 − F (∞)), and P (N ([0, ∞)) < ∞) =
∞
F (∞)k−1 (1 − F (∞)) .
k=1
In particular, P (N ([0, ∞)) < ∞) = 1 if F (∞) < 1 and P (N ([0, ∞) < ∞) = 0 if F (∞) = 1. Also, if F (∞) < 1, E[N ([0, ∞))] =
∞
kF (∞)k−1 (1 − F (∞)) =
k=1
1 < ∞, 1 − F (∞)
whereas if F (∞) = 1, E[N ([0, ∞))] = ∞.
A renewal process (delayed or not) is called recurrent when P (S1 < ∞) = 1 (F is proper), and transient when P (S1 < ∞) < 1 (F is defective). The following result is called the elementary renewal theorem. Theorem 10.1.5 We have lim
t→∞
N ([0, t]) 1 = t E[S1 ]
and lim
t→∞
Pa.s.,
(10.4)
1 E[N ([0, t])] = . t E[S1 ]
(10.5)
Proof. For the proof of (10.4) see Exercise 10.5.2. Proof of (10.5): The transient case follows from the obvious bound E[N ([0,t])] ≤ E[Nt(∞)] , since in this case E[S1 ] = ∞ and t E [N (∞)] < ∞. For the recurrent case, a proof is required (the conditions of the dominated convergence theorem that would guarantee that (10.4) implies (10.5) are not satisﬁed). However, by Fatou’s lemma 2 . 2 . N ([0, t]) N ([0, t]) 1 lim inf E ≥ E lim inf = t→∞ t→∞ t t E[S1 ] ]≤ and therefore it suﬃces to show that lim supt→∞ E[ N ([0,t]) t T0 := T0 ,
T1 := T0 + S1 ∧ c,
1 E[S1 ] .
Deﬁne for ﬁnite c > 0
T2 := T1 + S2 ∧ c, . . . where Sn := Sn ∧ c (n ≥ 1), and let N ([0, t]) := n≥0 1Tn ≤t . Since N ([0, t]) ≥ N ([0, t]) for all t ≥ 0, E[N ([0, t])] E[N ([0, t])] ≤ lim sup . lim sup t t t→∞ t→∞
10.1. RENEWAL POINT PROCESSES
407
Observe that S1 +· · ·+SN ([0,t]) ≤ t+c and therefore, by Wald’s lemma (Exercise 10.5.7), E[S1 ]E[N ([0, t])] = E[S1 + · · · + SN ([0,t]) ] ≤ t + c ,
so that lim sup t→∞
E[N ([0, t])] c 1 1 ≤ lim sup 1 + = . t E[S1 ] t E[S1 ] t→∞
Therefore, for all c > 0, lim sup t→∞
1 E[N ([0, t])] ≤ . t E[S1 ∧ c]
Since limc↑∞ E[S1 ∧ c] = E[S1 ], we ﬁnally obtain the desired inequality lim sup t→∞
1 E[N ([0, t])] ≤ . t E[S1 ]
Let F : R+ → R+ be a generalized cumulative distribution function on R+ , that is, F (x) = c G(x) where c > 0 and G is the cumulative distribution function of a nonnegative real random variable that is proper (G(∞) = 1).
10.1.2
The Renewal Equation
The basic object of renewal theory is the renewal equation f = g + f ∗ F, that is, by deﬁnition of the convolution symbol ∗, f (t) = g(t) + f (t − s) dF (s)
(t ≥ 0) ,
(10.6)
[0,t]
where g : R+ → R is a measurable function called the data. If F (∞) = 1 one refers to the renewal equation as a proper renewal equation, or just a renewal equation. The renewal equation is called defective if F (∞) < 1 and excessive if F (∞) > 1. The ﬁrst example features the basic method for obtaining renewal equations. Example 10.1.6: Lifetime of a Transient Renewal Process, Take 1. The lifetime of a renewal sequence is the random variable L := sup{Tn ; Tn < ∞}. Clearly if the renewal process is recurrent, L is almost surely inﬁnite. We therefore consider the transient case, for which L is almost surely ﬁnite. In the undelayed case, the function f (t) = P (L > t) satisﬁes the renewal equation f (t) = F (∞) − F (t) + f (t − s) dF (s). [0,t]
Proof. Deﬁne S6n = Sn+1 (n ≥ 1) and let {T6n }n≥0 be the associated undelayed renewal 6 process whose lifetime is denoted by L.
408
CHAPTER 10. RENEWAL AND REGENERATIVE PROCESSES no more event S1 T0 = 0
S2 = Sˆ1 T1
S3 = Sˆ2
T2
t
T3 = L ˆ L
6 have the same distribution. Also Clearly L and L 1{L>t} = 1{tt} + 1{t≥T1 } 1{L>t} . 6 > t − T1 } and therefore Now, on {t ≥ T1 }, {L > t} ≡ {L . 1{L>t} = 1{tt} + 1{t≥T1 } 1{L>t−T 1} Taking expectations, 6 > t − T1 , T1 ≤ t). P (L > t) = P (L > t, t < T1 ) + P (L 6 and T1 are independent (L 6 depends only on S2 , S3 , . . .), Since L 6 > t − T1 , T1 ≤ t) = 6 > t − s) dF (s) = P (L P (L P (L > t − s) dF (s) , [0,t]
[0,t]
6 have the same distribution. Also, where we have used the fact that L and L P (L > t, t < T1 ) = P (t < T1 , T1 < ∞) = P (t < T1 < ∞) = F (∞) − F (t).
Example 10.1.7: The Risk Model, Take 2. This example is a continuation of Example 7.1.2. We ﬁnd a renewal equation for the probability of ruin corresponding to an initial capital u: Ψ(u) := P (u + X(t) < 0 for some t > 0) .
(10.7)
This function is nonincreasing. It is convenient to work with the nonruin probability Φ(u) := 1 − Ψ(u). Of course, Φ(u) = 0 if u ≤ 0. If the point process N is stationary with average rate λ, the average proﬁt of the insurance company at time t is E[X(t)] = (c − λμ)t . As expected, insurance companies prefer that c − λμ > 0 or, equivalently, that ρ :=
c − λμ c = − 1 > 0, λμ λμ
(10.8)
where ρ is the safety loading. In fact, by the strong law of large numbers, if the safety loading is negative, the probability of ruin is 1 whatever the initial capital. If N is a homogeneous Poisson process, λ u Φ(u) = Φ(0) + Φ(u − z)(1 − G(z))dz . c 0 Proof. Suppose u ≥ 0. Since ruin cannot occur at a time < S1 ,
(10.9)
10.1. RENEWAL POINT PROCESSES 
409

Φ(u) = Φ(u + cT1 − Z1 ) = 
(0,∞) ∞
(0,∞)

Φ(u + cs − z)λe−λs ds dG(z) Φ(u + cs − z) dG(z) λe−λs ds
= 0
(0,u+cs]
λ = eλu/c c

u

Φ(x − z) dG(z) e−λx/c dx .
(0,x]
0
The righthand side is diﬀerentiable; diﬀerentiation leads to the integrodiﬀerential equation λ λ Φ (u) = Φ(u) − Φ(u − z) dG(z). (10.10) c c (0,u] Therefore, Φ(t) − Φ(0) =
λ c

t
Φ(u) du + 0
λ c
 t0
Φ(u − z) d(1 − G(z)) du (0,u]
.  u  2 λ t λ t Φ(u) du + Φ (u − z)(1 − G(z)) dz du Φ(0)(1 − G(u)) − Φ(u) + c 0 c 0 0  t   t λ t λ Φ (u − z) du (1 − G(z)) dz = Φ(0) (1 − G(u)) du + c c 0 0 z  t  t λ λ = Φ(0) (1 − G(u)) du + (1 − G(z)(Φ(t − z) − Φ(0)) dz , c c 0 0
=
that is, (10.9).
Simple arguments (Exercise 10.5.9) show that in the case of positive safety loading, Φ(∞) = 1. Letting u ↑ ∞ in (10.9), we have by monotone convergence that Φ(∞) = Φ(0) + λc Φ(∞) and therefore the probability of ruin with zero initial capital is Ψ(0) =
λμ . c
From (10.9), when the safety loading is positive, λμ λ u 1 − Ψ(u) = 1 − + (1 − Ψ(u − z))(1 − G(z)) dz c c 0  u  ∞ λ = 1− (1 − G(z))dz + Ψ(u − z)(1 − G(z)) dz , μ− c u 0 that is Ψ(u) =
λ c

∞ u
(1 − G(z))dz +
λ c

u
Ψ(u − z)(1 − G(z)) dz .
(10.11)
0
Example 10.1.8: The Lotka–Volterra Population Model. This model features a population of women. A woman of age a gives birth to girls at the rate λ(a) (that is, a woman of age a will have on average λ(a) da daughters in the inﬁnitesimal time interval (a, a + da) of her lifetime). A woman of age a is alive at time a + t with probability p(a, t). At the origin of time there are f0 (a) da women of age between a and a + da. The birth rate f (t) at time t ≥ 0 is the sum of the birth rate r(t) at time t due to women born after time 0 and of the birth rate g(t) due to women born before time 0.
CHAPTER 10. RENEWAL AND REGENERATIVE PROCESSES
410
Since women of age a at time 0 are a + t years old at time t,  ∞ g(t) = f0 (a)p(a, t)λ(a + t) da . 0
Women born at time t − s ≥ 0 contribute by f (t − s)p(0, s)λ(s) to the birth rate at time t and therefore t
r(t) =
f (t − s)p(0, s)λ(s) ds .
0
Therefore

t
f (t) = g(t) +
f (t − s)p(0, s)λ(s) ds .
0
This is a renewal equation (called Lotka’s equation) with data g and cumulative distribution function  t F (t) := p(0, s)λ(s) ds. 0
Note that

∞
p(0, s)λ(s) ds
F (∞) = 0
is the average number of daughters from a given mother’s lifetime, that is, the reproduction rate.
Theorem 10.1.9 The renewal function R satisﬁes the socalled fundamental renewal equation R = 1 + R ∗ F. (10.12) Proof. By (10.3), ⎛ R∗F =⎝
⎞ F
∗n ⎠
n≥0
∗F =
(F ∗n ∗ F ) =
n≥0
F ∗n = R − F ∗0 = R − 1 .
n≥1
The following simple technical result will be needed later on. Lemma 10.1.10 For all t ≥ b, R ([t − b, t]) ≤ (1 − F (b))−1 . Proof. By (10.12) and for t ≥ b, 1 = R(t) − F (t − s) dR(s) = (1 − F (t − s)) dR(s) [0,t] [0,t] ≥ (1 − F (t − s)) dR(s) ≥ (1 − F (b))R ([t − b, t]) . [t−b,t]
10.1. RENEWAL POINT PROCESSES
411
Example 10.1.11: The Elementary Renewal Theorem: Correction Term, 1 Take 1. We have seen that limt↑∞ R(t) t = E[S1 ] . In view of obtaining ﬁner asymptotic results (see Example 10.2.14 below), we study the function f (t) := R(t) −
t · E[S1 ]
In the case where E[S1 ] < ∞, f satisﬁes the renewal equation with data  ∞ 1 g(t) := (1 − F (x))dx . E[S1 ] t Proof. Let E[S1 ] := m. By (10.12), (f ∗ F )(t) = (R ∗ F )(t) −
1 m 
(t − s) dF (s) [0,t]
1 (t − s) dF (s) m [0,t] / 4 t t 1 = R(t) − − (t − s) dF (s) 1− + m m m [0,t]  ∞ 1 (1 − F (x))dx , = f (t) − m t = R(t) − 1 −
where the last equality is obtained by the following computations. Integration by parts (Theorem 2.3.12) gives t(1 − F (t)) = (1 − F (s)) ds − s dF (s) , 0∞
[0,t]
[0,t]
and therefore, since m = 0 (1 − F (s)) ds, 1 ∞ 1 t 1 ∞ (1 − F (s)) ds = (1 − F (s)) ds − (1 − F (s)) ds m t m 0 m 0 1 t 1 t(1 − F (t)) + =1− (1 − F (s)) ds = 1 − s dF (s) m 0 m [0,t] 1 t 1 + (s − t) dF (s) = 1 − (t − s) dF (s) . =1− t+ m m m [0,t] [0,t]
Solution of the Renewal Equation An expression of the solution of the renewal equation in terms of the renewal function is easy to obtain. Recall the following deﬁnition: A function g : R+ → R is called locally bounded if for all a ≥ 0, supt∈[0,a] g(t) < ∞. Theorem 10.1.12 If F (∞) ≤ 1 and if the measurable data function g : R+ → R is locally bounded, the renewal equation (10.6) admits a unique locally bounded solution f : R+ → R given by f = g ∗ R, that is, g(t − s)dR(s) . (10.13) f (t) = [0,t]
CHAPTER 10. RENEWAL AND REGENERATIVE PROCESSES
412
Proof. The function f = g ∗ R is indeed locally bounded since g is locally bounded and R(t) is ﬁnite for all t. Also f ∗ F = (g ∗ R) ∗ F = g ∗ (R ∗ F ) = g ∗ (R − 1) = g ∗ R − g = f − g . Therefore f is indeed a solution of the renewal equation. Let f1 be another locally bounded solution and let h := f − f1 . This is a locally bounded solution which satisﬁes h = h ∗ F . By iteration, h = h ∗ F ∗n . Therefore, for all t ≥ 0,
sup h(s) F ∗n (t).
h(t) ≤
s∈[0,t]
Since R(t) = n≥0 F ∗n (t) < ∞, we have limn→∞ F ∗n (t) = 0, which implies in view of the last displayed inequality that h(t) ≡ 0. The ﬁrst asymptotic result on the solution of the renewal equation concerns the defective case: Theorem 10.1.13 If F is defective and if the measurable data function g : R+ → R is bounded and has a limit g(∞) := limt→∞ g(t), the unique locally bounded solution of the renewal equation satisﬁes lim f (t) =
t→∞
g(∞) . 1 − F (∞)
Proof. From previous computations, we have E[N ([0, ∞))] = R(∞) = and therefore g(∞) = 1 − F (∞) Also
1 , 1 − F (∞)
g(∞) dR(s). [0,∞)
g(t − s) dR(s) ,
f (t) = [0,t]
and therefore f (t) −
g(∞) = 1 − F (∞)
[0,∞)
(g(t − s)1{s≤t} − g(∞)) dR(s).
The latter integrand is bounded in absolute value by 2 × sup g(t), a ﬁnite constant. Considered as a function, a constant is integrable with respect to the renewal measure because, in the defective case, the total mass of the renewal measure is R(∞) = E [N ([0, ∞))] < ∞. Now for ﬁxed s ≥ 0, limt→∞ (g(t − s)1{s≤t} − g(∞)) = 0. Therefore, by dominated convergence, the integral converges to 0 as t → ∞.
10.1. RENEWAL POINT PROCESSES
10.1.3
413
Stationary Renewal Processes
By a proper choice of the initial delay, a renewal process can be made stationary, in a sense to be made precise. Consider a renewal process T0 = S0 , T1 = S0 + S1 , . . . , Tn = S0 + · · · + Sn where 0 ≤ S0 < ∞. Let G be the cumulative distribution function of the initial delay S0 := T0 and suppose that P (S1 < ∞) = 1 (the renewal process is proper). As usual, exclude trivialities by imposing the condition P (S1 = 0) < 1. For t ≥ 0, deﬁne S0 (t) := TN ([0,t]) − t and Sn (t) := TN ([0,t])+n − TN ([0,t])+n−1 (n ≥ 1) .
(10.14)
In particular, Sn (0) = Sn for all n ≥ 0. Also observe that S0 (t) = A(t), the forward recurrence time at t.
t S0
S1
S2
S3 S0 (t)
S4
S5
S1 (t)
S2 (t)
Deﬁnition 10.1.14 The delayed renewal process is called stationary if the distribution of the sequence S0 (t) , S1 (t) , S2 (t) , . . . is independent of time t ≥ 0. It turns out that S0 (t) is independent of {Sn (t)}n≥1 and that the latter sequence has the same distribution as {Sn }n≥1 (Exercise 10.5.11). Therefore: Lemma 10.1.15 For a delayed renewal process to be stationary it is necessary and suﬃcient that for all t ≥ 0, the distribution of S0 (t) = A(t) be the same as that of S0 . Lemma 10.1.16 If the delayed renewal process is stationary, then necessarily E[S1 ] < ∞ and E[N ([0, t])] = E[St 1 ] . Proof. The measure M on R+ deﬁned by M (C) := E[N (C)] is translationinvariant and therefore a multiple of the Lebesgue measure (Theorem 2.1.45), that is, M (C) = K(C) for some constant K, which is ﬁnite (M is locally ﬁnite) and positive (the renewal process is not empty). By the elementary renewal theorem, K = lim
t↑∞
and therefore
1 E[S1 ]
E[N ((0, t])] 1 = , t E[S1 ]
> 0.
Lemma 10.1.17 If the delayed renewal process is stationary, then necessarily E[S1 ] < ∞ and the distribution of the initial delay T0 is  x 1 (1 − F (y)) dy , (10.15) F0 (x) := E[S1 ] 0 called the stationary forward recurrence time distribution.
414
CHAPTER 10. RENEWAL AND REGENERATIVE PROCESSES
Proof. The ﬁniteness of E[S1 ] was proved in the previous lemma. For all u ∈ R, all t ≥ 0, the following equation  t d iuA(s) eiuA(Tn) − eiuA(Tn −) 1{Tn ≤t} + eiuA(t) = eiuA(0) + e ds 0 ds n≥0
is obtained by looking at what happens at the event times and between the event times.1 d iuA(s) Observing that ds e = −iueiuA(s), A(0) = S0 , A(Tn −) = 0, A(Tn ) = Sn+1 , we therefore have  t eiuSn+1 − 1 1{Tn ≤t} − iu eiuA(t) = eiuS0 + eiuA(s) ds . 0
n≥0
Therefore, taking into account the independence of Sn+1 and Tn ,  t E eiuA(s) ds . E eiuA(t) = E eiuS0 + E eiuS1 − 1 E[N ((0, t])] − iu t E[S1 ]
0
eiuA(t)
By the assumed stationarity, E[N ((0, t])] = and E fore iuS E e 1 −1 − iuE eiuS0 = 0 , E[S1 ] that is,
= E eiuS0 , and there
E eiuS1 − 1 E eiuS0 = . iu E[S1 ]
But the righthand side is the characteristic function of F0 , as the following computation shows:  ∞  ∞ 1 1 eiux (1 − F (x)) dx = eiux P (S1 > x) dx E[S1 ] 0 E[S1 ] 0  ∞ 1 eiux E 1{S1 >x} dx = E[S1 ] 0 2 ∞ . 1 E = eiux 1{S1 >x} dx E[S1 ] 0 2 S1 2 iuS1 . . e −1 1 1 E E . eiux dx = = E[S1 ] E[S1 ] iu 0 1 Or use the following. Let f : R+ → R be a rightcontinuous function with lefthand limits, and with a set of discontinuity times that form a sequence {tn }n≥1 that is strictly increasing on R. This sequence may be ﬁnite, even empty. If n0 ∈ N is the cardinality of this sequence, one conventionally lets tn = ∞ for all n > n0 . Suppose in addition that on the intervals [tn , tn+1 ) that lie in R+ ,  t f (s) ds , f (t) = f (tn ) + tn
for some locally integrable function f (the derivative). Let now G : R → R be a diﬀerentiable function with derivative G . Then for all [a, b) ⊂ R+ ,  b (G(f (tn )) − G(f (tn −)) + f (s)G (f (s)) ds . G(f (b)) = G(f (a)) + tn ∈(a,b)
a
10.1. RENEWAL POINT PROCESSES
415
Theorem 10.1.18 For a delayed renewal process to be stationary, it is necessary and suﬃcient that E[S1 ] < ∞ and that P (T0 ≤ x) = F0 (x) , where F0 is the stationary forward recurrence time distribution (10.15). Proof. The proof of necessity is contained in Lemmas 10.1.16 and 10.1.17. For suﬃciency, we ﬁrst show that t RF0 (t) := E[N ([0, t]))] = E[S1 ] (the notation emphasizes the role of the initial delay with cumulative distribution function F0 ). We have, ⎤ ⎡ 1{Tn ≤t} ⎦ = P (Tn ≤ t) = F0 (t) + (F0 ∗ F )(t) + (F0 ∗ F ∗2 )(t) + · · · , RF0 (t) = E ⎣ n≥0
n≥0
that is, RF0 = F0 + RF0 ∗ F . Therefore, by Theorem 10.1.12, RF0 is the unique locally bounded solution of the renewal equation f = F0 + f ∗ F . It then suﬃces to show that f (t) = E[St 1 ] is indeed a solution. To verify this, observe that for such f , 1 (f ∗ F )(t) = (t − s) dF (s) E[S1 ] [0,t] and therefore
1 1 t − t dF (s) + s dF (s) m E[S1 ] [0,t] E[S1 ] [0,t] 1 t (1 − F (t)) + s dF (s) . = E[S1 ] E[S1 ] [0,t]
f (t) − (f ∗ F )(t) =
It remains to show that the righthand side of the above equality is F0 (t). Integration by parts does it:  t 1 1 t(1 − F (t)) + (1 − F (s)) ds = s dF (s) . F0 (t) = E[S1 ] 0 E[S1 ] [0,t] Having proved that E[N ([0, t]))] = E[St 1 ] , we are almost done. From computations in the proof of Lemma 10.1.17, we extract the identity  t t E eiuA(s) ds . − iu E eiuA(t) = E eiuS0 + E eiuS1 − 1 E[S1 ] 0 iuA(t) is therefore a solution of the ordinary diﬀerential equation The function z(t) := E e dz 1 = −iuz + E eiuS1 − 1 dt E[S1 ] with initial condition z(0) = E eiuS0 = E[S1 1 ] E eiuS1 − 1 , whose unique solution is E[S1 ] iuS E eiuA(t) = E eiuS0 = E e 1 −1 . m Therefore, for all t ≥ 0, S0 (t) (= A(t)) has the same distribution as S0 . The conclusion then follows from Lemma 10.1.15.
416
10.2
CHAPTER 10. RENEWAL AND REGENERATIVE PROCESSES
The Renewal Theorem
Renewal theory deals mainly with the limiting behavior of the solution of the renewal equation. The theory is rather simple when the renewal process is transient (Theorem 10.1.13) and becomes more involved in the recurrent case. A very simple example will give the ﬂavor of such a result. Example 10.2.1: The Renewal Theorem for a Poisson Process. If {Tn }n≥1 is a Poisson process of intensity λ > 0, we know (Theorem 10.1.13) that R(t) = 1 + λt. Therefore, the solution of the corresponding renewal equation is, when the data g is nonnegative,  t  t  t f (t) = g(t − s) R(ds) = λ g(t − s)ds = λ g(s)ds. 0
0
Here λ = 1/E[S1 ]. If we suppose that g is integrable, then 0∞ g(s) ds . lim f (t) = 0 t→∞ E[S1 ]
0
()
Deﬁnition 10.2.2 Let F be the cdf of a nonnegative real random variable S1 . Both F and S1 are called nonlattice if there is no strictly positive real a such that k≥0 P (S1 = ka) = 1.
10.2.1
The Key Renewal Theorem
It turns out that for nonlattice distributions () is quite general, modulo a mild technical assumption on the data g: this function has to be directly Riemann integrable. This is the key renewal theorem (Theorem 10.2.11 below).
Direct Riemann Integrability Let g : R+ → R be a nonnegative locally bounded function. Deﬁne for each b > 0 and each t ≥ 0, gb (t) = sup{g(s); nb ≤ s < (n + 1)b} on [nb, (n + 1)b) gb (t) = inf{g(s); nb ≤ s < (n + 1)b} on [nb, (n + 1)b) . The functions gb and gb are ﬁnite constants on the intervals [nb, (n + 1)b), for all n ∈ N, and thus Lebesgue integrable2 on bounded intervals. Deﬁnition 10.2.3 The function g ≥ 0 is said to be Riemann integrable 0(Ri) on the a bounded interval [0, a] if for some (and then for all) b > 0, the integral 0 gb (t)dt is ﬁnite, and  a  a lim g b (t)dt − g b (t)dt = 0. (10.16) b↓0
0
0
2 The theory of the Riemann integral predates that of the Lebesgue integral. We do not follow the historical development in the present treatment of Riemann integrals, and use Lebesgue integration theory – in particular the powerful Lebesgue’s dominated convergence theorem– which allows considerably simpler arguments.
10.2. THE RENEWAL THEOREM
417
(The fact that, when gb is integrable for some b > 0, then g b is also integrable for all b > 0, follows from the inequality g b (t) ≤
+n
gb (t + kb),
k=−n
where n = b /b.) Theorem 10.2.4 Let g be a Riemann integrable function on the bounded interval [0, a]. Then: (i) g is bounded and almost everywhere continuous on [0, a] and (ii) the limit

a
g b (t)dt
lim b↓0
0
exists and is ﬁnite. This limit is by deﬁnition the Rintegral (Riemann integral) of g on [0, a]. It is denoted by  a  a Rg(t) dt = lim gb (t)dt. b↓0
0
0
It coincides with the Lebesgue integral of g on [0, a]. Proof. (i) Boundedness of g is clear since supx∈[0,a] g(x) ≤ b−1 by assumption.
0a 0
gb (t)dt, which is ﬁnite
We now show that the set of discontinuity points of g has a null Lebesgue measure. Let g(x) = lim supy→x g(y), and g(x) = lim inf y→x g(y). Both functions are measurable. In fact, more is true: g is upper semicontinuous, that is, for all A ∈ R+ , the set {x : g(x) ≥ A} is closed, while g is lower semicontinuous, that is, for all A ∈ R+ , the set {x : g(x) ≤ A} is closed. We omit the (easy) proof of these facts. The set of discontinuity points of g on [0, a], that is, {x : g(x) > g(x)}, is therefore measurable. Suppose it is of positive Lebesgue measure. Then, there exists an > 0 such that the set {x : g(x) − g(x) > } is also of positive Lebesgue measure, say δ. Since for all b > 0, and since for almost every t ∈ [0, a] g b (t) ≥ g(t) ≥ g(t) ≥ g b (t), it follows that for all b > 0, 
a

a
g b (t)dt −
0
0
g b (t)dt ≥ δ > 0,
which contradicts the assumption of Riemann integrability (10.16). (ii) At a continuity point x of g, it holds that lim gb (x) = g(x) = g(x). b↓0
By (i), this convergence holds almost everywhere. By dominated convergence (with dominating function g 1 (x) + g1 (x − 1) + g 1 (x + 1)),  a  a lim gb (t)dt = g(t)dt . b↓0
0
0
418
CHAPTER 10. RENEWAL AND REGENERATIVE PROCESSES
Deﬁnition 10.2.5 The function g ≥ 0 is said to be Riemann integrable on [0, ∞) if  a lim Rg(t) dt a↑∞
0
exists and the limit is then, by deﬁnition, its Riemann integral on [0, ∞). Remark 10.2.6 A major result of Riemann’s integration theory is that the Riemann integral on the ﬁnite interval [0, a] of a function exists if and only if this function is almost everywhere continuous and bounded on this interval, that is, Property (i) in the above proposition is not only necessary, but also suﬃcient for Riemannintegrability. This result is mainly of theoretical interest and is not needed in this book. We now turn to the deﬁnition of the direct Riemann integral. “Direct” means that this integral on [0, ∞) is not deﬁned as a limit of integrals over ﬁnite intervals, but “directly” on [0, ∞). 10.2.7 The function g ≥ 0 is said to be directly Riemann integrable (dRi) Deﬁnition 0∞ if 0 g b (t)dt < ∞ for some (and then for all) b > 0, and if  ∞  ∞ lim g b (t)dt − g b (t)dt = 0. b↓0
0
0
From the deﬁnitions, it is clear that for functions vanishing outside a bounded interval, the notions of Riemann integrability and of direct Riemann integrability are the same. Also, the following analog of Theorem 10.2.4 holds for direct Riemann integrability: Theorem 10.2.8 Let g ≥ 0 be dRi. Then: (i) g is bounded, and almost everywhere continuous on R+ . (ii) The limit

∞
g b (t)dt
lim b↓0
0
exists and is ﬁnite. This limit is, by deﬁnition, the dRintegral (direct Riemann integral) of g on R+ , and is denoted by  ∞  ∞ g(t) dt = lim g b (t)dt. dR− 0
b↓0
0
It coincides with the Lebesgue integral of g on R+ . The proof is identical to that of Theorem 10.2.4. The following example features a function that is Riemann integrable, but not directly Riemann integrable. Example 10.2.9: A Counterexample. Let {an }n≥1 sequences of and {bn }n≥1 be positive real numbers such that 1/2 > a1 > a2 > · · · , n≥1 bn = ∞ and n≥1 an bn < ∞. Let g be null outside the union of the intervals [n − an , n + an ], n ≥ 1, and such that for all n ≥ 1, g(n − an ) = g(n + an ) = 0 and g(n) = bn , and g is linear in the intervals [n − an , n] and [n, n + an ]. Then, g is Riemann integrable:
10.2. THE RENEWAL THEOREM 
419
∞
g(t) dt =
R− 0
an bn < ∞ .
n≥1
It is however not directly Riemann integrable since
0∞ 0
g¯b (t)dt = ∞ for all b > 0.
There exist however a few reassuring results: Theorem 10.2.10 (a) If g is directly Riemann integrable, it is Riemann integrable on [0, ∞) and  ∞  ∞ R− g(t)dt = dR− g(t)dt . 0
0
(b) Nonnegative nonincreasing functions are directly Riemann integrable if and only if they are Riemann integrable on [0, ∞). (c) A nonnegative function that is Riemann integrable on all ﬁnite intervals, and such 0∞ that 0 g¯1 (t)dt < ∞, is directly Riemann integrable. 0 ∞In particular, a nonnegative almost everywhere continuous function such that 0 g¯1 (t)dt < ∞ is directly Riemann integrable. (d) A nonnegative function that is Riemann integrable and bounded above by a directly Riemannintegrable function is directly Riemann integrable. Proof. (a) Since g is directly Riemann integrable,  ∞  a 0 = lim (¯ gb (t) − gb (t))dt ≥ lim (¯ gb (t) − g b (t))dt, b↓0
b↓0
0
0
implying Riemannintegrability on [0, a]. For all a > 0, recalling that g is (Lebesgue) integrable on [0, +∞), ' ' ∞ '  ∞ '  a  a  ∞ ' ' ' ' ' ' 'dRg(t)dt − Rg(t)dt' = ' g(t)dt − g(t)dt'' = g(t)dt. ' 0
0
0
0
a
The righthand side tends to zero as a → ∞ by dominated convergence. (b) The necessity follows from (a). In view of proving suﬃciency, suppose that the nonnegative nonincreasing function g is Riemann integrable on [0, +∞). 0 aIt is in particular (Lebesgue) integrable on [0, a] for all ﬁnite a > 0, and the integral 0 g(t)dt admits a ﬁnite limit as a → ∞. Therefore, g is (Lebesgue) integrable on R+ , by monotone convergence. Since it is nonincreasing, for all b > 0  ∞  ∞ g¯b (t)dt = bg(nb) ≤ bg(0) + g(t)dt < ∞ . 0
0
n≥0
Furthermore, 
∞ 0
g¯b (t)dt − 0
∞
gb (t)dt =
n≥0
bg(nb) −
n>0
bg(nb) = bg(0).
CHAPTER 10. RENEWAL AND REGENERATIVE PROCESSES
420
The latter term vanishes as b → 0, establishing that g is directly Riemann integrable. 0∞ (c) Fix ε > 0 and select a > 0 such that a g¯1 (t)dt ≤ ε. For all b ∈ (0, 1], g b (t) ≤ g¯b (t) ≤ g¯1 (t − 1) + g¯1 (t) + g¯1 (t + 1). It follows that  ∞ g¯b (t)dt − 0
∞ 0
g b (t)dt ≤

g¯b (t)dt − gb (t) dt + 3
a+1 0
≤
g¯1 (t)dt
a
a+1
0
∞
g¯b (t)dt − gb (t) dt + 3ε .
As g is assumed Riemann integrable on [0, a + 1], the rightmost integral tends to zero as b → 0. Since ε > 0 is arbitrary, we conclude that the lefthand side goes to zero as b → 0. Hence, g is directly Riemann integrable. (d) Follows from (c) since g is Riemann integrable on ﬁnite intervals, and calling z the bounding function, we have, since g¯1 ≤ z¯1  ∞  ∞ g¯1 (t) dt ≤ z¯1 (t) dt, 0
0
a ﬁnite quantity because z is directly Riemann integrable by assumption.
The Key Renewal Theorem The renewal processes considered from now on now are those with a renewal distribution that is nonlattice (Deﬁnition 10.2.2). Theorem 10.2.11 Let F be a nonlattice distribution function such that F (∞) = 1 (with possibly inﬁnite mean) and let R be the associated renewal function. Then: (α) Blackwell’s theorem:3 for all τ ≥ 0, lim{R(t + τ ) − R(t)} =
t↑∞
τ . E[S1 ]
(10.17)
(β) Key renewal theorem: if g : R+ → R is a nonnegative directly Riemannintegrable function,  ∞ 1 lim(R ∗ g)(t) = g(y) dy . (10.18) t↑∞ E[S1 ] 0 In fact, (α) and (β) are equivalent. Remark 10.2.12 Example 10.3.5 below features a spectacular example showing that the direct integrability condition cannot be dispensed with in general. Theorem 10.2.10 above and Theorem 10.3.6 give practical ways to prove direct Riemann integrability. 3
[Blackwell, 1948].
10.2. THE RENEWAL THEOREM
421
Proof. (α) We shall admit it for the time being. All existing proofs are somewhat technical. A proof, based on the socalled “coupling method”, is given later (starting with Theorem 10.2.16) when E[S1 ] < ∞. (β) Recall that when g is locally bounded, f = R ∗ g is the unique locally bounded solution of the renewal equation f (t) = g(t) + f (t − s) dF (s). [0,t]
STEP 1. Case g(t) = 1[(n−1)b,nb)(t). Then f (t) = R(t − (n − 1)b) − R(t − nb), and the result is just Blackwell’s theorem. STEP 2. Case g(t) = n≥1 cn 1[(n−1)b,nb) (t), where cn ≥ 0, n≥1 cn < ∞, and b is such that F (b) < 1. Then cn (R(t − (n − 1)b) − R(t − nb)). f (t) = n≥1
By Lemma 10.1.10, sup(R(t − (n − 1)b) − R(t − nb)) ≤ (1 − F (b))−1 < ∞. t≥0
In particular, by dominated convergence, lim
t↑∞
cn (R(t − (n − 1)b) − R(t − nb)) =
n≥1
n≥1
cn
b 1 = E[S1 ] E[S1 ]

∞
g(y) dy . 0
STEP 3. If g is directly Riemann integrable, the functions g¯b and g b previously 0 0 deﬁned are of the type considered in Step 2 since g b ≤ g¯b < ∞. But g b ≤ g ≤ g¯b and therefore 1 gb (s) ds = lim g b (t − s) R(ds) ≤ lim inf g(t − s)R(ds) t↑∞ t↑∞ E[S1 ] ≤ lim sup g(t − s) R(ds) t↑∞ 1 ≤ lim g¯b (t − s) R(ds) = g¯b (s) ds. t↑∞ E[S1 ] The result follows by letting b tend to 0. We showed that (α) implies (β). The converse implication follows by choosing g(t) := 1[0,τ ] (t). Here is a frequently encountered example of a directly Riemannintegrable function: Example 10.2.13: Tail distribution of an integrable variable. Let F be the cdf of an integrable nonnegative random variable S1 . Then 1 − F is directly Riemann integrable. This follows from (b) of Theorem 10.2.10 and 0∞ (1 − F (t)) dt = E[S ] < ∞. 1 0
422
CHAPTER 10. RENEWAL AND REGENERATIVE PROCESSES
Example 10.2.14: Elementary Renewal Theorem: Correction Term, Take 2. In the recurrent case, we know (Theorem 10.1.5) that lim
t→∞
R(t) 1 = . t E[S1 ]
In order to obtain more information on the asymptotic behavior of the renewal function, we shall study the behavior of f (t) := R(t) −
t E[S1 ]
as t goes to ∞ in the nonlattice case when S1 has ﬁnite ﬁrst and second moments. Calling σ 2 the common variance of the interrenewal times and letting m := E[S1 ], we have 1 1 E[S1 ]2 + σ 2 t lim R(t) − . (10.19) = t↑∞ E[S1 ] 2 E[S1 ]2 Proof. Recall from Example 10.1.11 that f satisﬁes the renewal equation f = g + F ∗ f with data  ∞ 1 g(t) = (1 − F (x)) dx . E[S1 ] t The function g is of the form 1 − F0 where F0 is the cdf of a nonnegative variable that is integrable. It is therefore directly Riemann integrable (Example 10.2.13). The key renewal theorem then gives 1 ∞ t lim R(t) − g(s) ds. = t→∞ m m 0 But 1 m

∞
 ∞  ∞ 1 (1 − F (x)) dx ds m2 0 s  ∞  x 1 (1 − F (x))ds dx = 2 m 0 0  ∞ 1 1 1 ∞ 2 = 2 x(1 − F (x)) dx = 2 x dF (x), m 0 m 2 0
g(s)ds = 0
hence the result. (Proof of the last equality: . 2 ∞ .  ∞ 2 X 1 2 x dx = E 1{x 0, limt↑∞ (RG (t + a) − RG (t)) = μ−1 a (Blackwell’s theorem). (ii) For all x ≥ 0, limt↑∞ P (B(t) ≤ x) = F0 (x). (iii) For all x ≥ 0, limt↑∞ P (A(t) ≤ x) = F0 (x). Proof. (ii) ⇔ (iii). Just observe that for t ≥ 0, x ≥ 0, P (A(t) ≤ x) = P (N [0, t + x] − N ([0, t]) ≥ 1) = P (B(t + x) ≤ x). (i) ⇒ (ii). When T0 ≡ 0, the function t → P (B(t) ≤ x) satisﬁes a renewal equation with data g(t) = (1 − F (t))1[0,x] (t). This function is directly Riemann integrable, and therefore by the key renewal theorem (a consequence of Blackwell’s theorem)  ∞ g(t)dt = F0 (x). lim P (B(t) ≤ x) = μ−1 t↑∞
0
10.2. THE RENEWAL THEOREM
425
The case of a nonnull and proper initial delay follows by the usual argument (see the proof of Theorem 10.3.4). (iii) ⇒ (i). Deﬁne Gt (x) = P (A(t) ≤ x) and observe that RG (t + a) − RG (t) = (Gt ∗ R)(a)  a = R(a − s)Gt (ds) 0 a Gt (a − s)R(ds) = 0
and that since by hypothesis limt↑∞ Gt (x) = F0 (x), we have by dominated convergence (R gives ﬁnite mass to bounded intervals)  a  a F0 (a − s)R(ds) = μ−1 a. Gt (a − s)R(ds) = lim t↑∞
0
0
In order to prove Blackwell’s theorem, it is enough to prove (iii) of Lemma 10.2.16. We do this in the case where μ < ∞.4 Here is the coupling argument.5 Consider two independent renewal sequences with the same interarrival distribution F . The ﬁrst one is undelayed: S0 = 0, S1 , S2 , . . . and the second one is stationary: S˜0 , S˜1 , S˜2 . In particular, the distribution of S˜0 is F0 given by (10.22). Construct a renewal sequence {Sn∗ }n≥1 as follows. Take Sn∗ = Sn until the ﬁrst time where two points of the tilded and untilded processes are εclose, where ε is ﬁxed. (In this case we say that εcoupling was successful, which is not granted in general. The technical part of the proof of Blackwell’s theorem is to show that εcoupling is actually realizable with probability 1 when the interval distribution is nonlattice.) Then follow the tilded process. For instance, suppose that T5 and T˜3 are at a distance less than ε. Then Sn∗ = Sn for n = 1, 2, 3, 4, 5, ∗ = S˜3+k for k ≥ 1. Denote by T = T ε the ﬁrst point of the tilded process which and S5+k is εclose to a point of the untilded process (in the example T = T˜3 ). T5 S1 = S1∗
S2 = S2∗
S3 = S3∗ S4 = S4∗
S5 = S5∗ ≤
S˜0
S˜1
S˜2
S˜3 S˜6∗
S˜7∗
T˜3
Lemma 10.2.17 With the assumptions of Theorem 10.2.11, if εcoupling happens almost surely, that is, if P (T < ∞) = 1, then Blackwell’s theorem is proved. ˜ Proof. For simpler notation, let T := Tε . Let {A(t)}t≥0 and {A(t)} t≥0 be the recurrence times corresponding to the undelayed starred renewal process and the (stationary) tilded 4 5
For the extension to μ = ∞, see for instance [Lindvall, 1992], p. 76–77. [Lindvall, 1977].
426
CHAPTER 10. RENEWAL AND REGENERATIVE PROCESSES
˜ + D) where D ≤ ε. Let f be a renewal process. For all t ≥ T , we have A(t) = A(t continuous function bounded by 1, and deﬁne ' ' ' ' ˜ ˜ + s)) − f (A(t)) Mε (t) = sup 'f (A(t '. s≤ε
Note that limε↓0 E [Mε (0)] = 0. Now, ' ' ' ' ' ' ' ' ˜ ˜ ' ' ≤ E 'f (A(t)) − f (A(t)) 'E f (A(t)) − f (A(t)) ' ' ' ' ˜ = E '(f (A(t)) − f (A(t)) ' 1{t < T } ' ' ' ˜ '' 1{t ≥ T } + E '(f (A(t)) − f (A(t)) ≤ 2P (T > t) + E [Mε (t)] = 2P (T > t) + E [Mε (0)] , ˜ where the last equality follows from stationarity of A. Deduce from this that, since ˜ ˜ E f (A(t)) = E f (S0 ) , lim E [f (A(t))] = E f (S˜0 ) . t↑∞
In other words, since f is an arbitrary continuous function bounded by 1, A(t) tends in distribution to S˜0 as t ↑ ∞. In particular, since the distribution F0 of S˜0 is continuous, for all x ∈ R+ , lim P (A(t) ≤ x) = F0 (x). t↑∞
The conclusion follows from (iii) of Lemma 10.2.16.
In order to prove coupling, we ﬁrst examine the role of the nonlattice assumption. Recall that a point x is said to be in the support of the distribution function F if F (x + ) − F (x − ) > 0 for all > 0. The set of all such points is called the support of F and is denoted by supp(F ). The key implication of the nonlattice assumption is the following: Lemma 10.2.18 Let F be a nonlattice cumulative distribution function. Let G denote the set of ﬁnite linear combinations of elements of supp(F ) with coeﬃcients in N, that is / n 4 & gi ; g1 , . . . , gn ∈ supp(F ) . (10.23) G= n∈N
i=1
Then G is asymptotically dense in R+ , that is lim d(x, G) = 0,
x→∞
where d(x, G) = inf g∈G x − g. Observe that the set G as deﬁned by (10.23) is the union of the supports of the cumulative distribution functions F ∗n (n ∈ N) or, equivalently, the support of the renewal ∗n function R = n∈N F associated to F . Proof. Letting μ :=
inf
{g − h} ,
g,h∈G,g>h
(10.24)
10.2. THE RENEWAL THEOREM
427
we ﬁrst prove that μ = 0. Suppose in view of contradiction that μ > 0. The inﬁmum in (10.24) is then necessarily attained, for otherwise there would exist sequences gn , hn in G such that gn − hn > gn+1 − hn+1 , and gn − hn → μ as n → ∞. Then, for n large enough, gn − hn < μ + μ/2. Consequently, letting g := gn + hn+1 and h = hn + gn+1 , it holds that g − h = (gn − hn ) − (gn+1 − hn+1 ) ∈ (0, μ/2). This is a contradiction, in view of the deﬁnition of μ and the fact that g, h ∈ G. There must therefore exist g, h ∈ G such that g − h = μ. Since F is nonlattice, there exists z ∈ supp(F ) such that, for some k ∈ N, kμ < z < (k + 1)μ. Deﬁne then g := z + kh and h := kg. Both g and h belong to G. Furthermore, g − h = z − kμ ∈ (0, m), again a contradiction. Necessarily then, μ = 0. Therefore, for any > 0, there exist g, h ∈ G such that g − h ∈ (0, ). Consider the subset G of G consisting of the elements kg + h (k, ∈ N). We argue that limx→∞ d(x, G ) ≤ . Indeed, let m = h/. Let x > mh. Write x = nh + r, with n ∈ N, n ≥ m, and r ∈ [0, h). Let k ∈ N be such that (n − k)h + kg ≤ x < (n − k)h + kg + (g − h). Necessarily k ≤ m since r < h. The term (n − k)h + kg thus belongs to G , as n − k ≥ 0. Furthermore, it is at most apart from x, by the pair of inequalities displayed above. It follows that lim sup d(x, G) ≤ lim sup d(x, G ) ≤ . x→∞
x→∞
As is arbitrary, this concludes the proof of the theorem.
One says that coupling holds for renewal processes with interrenewal cdf F if for > 0 and ﬁxed initial delays t1 , t1 , one can construct jointly two renewal processes with the corresponding delays such that, with probability 1, there are indices m, n such that the corresponding renewal times Tm , Tn are less than apart. Lemma 10.2.19 With the assumptions of Theorem 10.2.11, εcoupling happens almost surely. Proof. Let
Zi := min{T,j − Ti ; T,j − Ti ≥ 0} (i ≥ 0) .
For ﬁxed ε > 0, let Ai := {Zj < ε for some j ≥ i} . Then A0 ⊇ · · · ⊇ A ⊇ · · · ⊇ ∩∞ i=0 Ai = A∞ := {Zi < ε i.o.} . Since the sequence {Ti+n −Ti }n≥1 has a distribution independent of i and is independent , is stationary, the sequence {Zi }i≥0 is also stationary. , ≡ {T,n }n≥0 , and since N of N Therefore the events Ai (i ≥ 0) have the same probability, and in particular P (A0 ) = P (A∞ ) .
428
CHAPTER 10. RENEWAL AND REGENERATIVE PROCESSES
Conditionally on S,0 , the event A∞ is an exchangeable event of the symmetric sequence {(Sn , S,n )}n≥1 . Therefore, by the Hewitt–Savage 01 law (Theorem 4.3.7), for all t > 0 P (A∞  S,0 = t) = 0 or 1 .
()
Lemma 10.2.18 guarantees that for suﬃciently large u and ﬁxed ε, P (u − t < T,j − S,0 < u − t + ε for some j) > 0 . Therefore P (A0  S,0 = t) > 0 for all t ≥ 0 and in particular  ∞ P (A0  S,0 = t)(1 − F (t)) dt) > 0 , P (A0 ) = λ 0
which implies since P (A0 ) = P (A∞ ),  ∞ P (A∞ ) = λ P (A∞  S,0 = t)(1 − F (t)) dt) > 0 . 0
In view of (), this implies that P (A∞  S,0 = t) = 1 for all t such that F (t) < 1 . Therefore P (A∞ ) = 1 = P (A0 ), so that P (Zi < ε for some i) = 1.
10.2.3
Defective and Excessive Renewal Equations
The key renewal theorem concerns proper renewal equations. In a number of situations though, one encounters renewal equations for which F (∞) < 1 or F (∞) > 1. However, when there exists an α ∈ R such that eαtdF (t) = 1 , (10.25) [0,∞)
the asymptotics of the solution of the renewal equation can be obtained from the proper case. In fact, letting g˜(t) := eαt g(t), f˜(t) := eαt f (t) and F˜ (t) := eαs dF (s) , [0,t]
the distribution F˜ is proper, and nonlattice if F itself is nonlattice. One immediately checks that f˜ satisﬁes the renewal equation f˜ = g˜ + f˜ ∗ F˜ . The conclusion of the key renewal theorem is that when g˜ is directly Riemann integrable, 0 ∞ αt e g(t)dt lim eαtf (t) = 0 0∞ αt . (10.26) t→∞ te dF (t) 0 Remark 10.2.20 A number α satisfying (10.25) always exists in the excessive case. 0 Indeed, the function α → [0,∞) eαtdF (t) is continuous on (−∞, 0] and strictly increases 0 from 0 to [0,∞) dF (t) > 1. Therefore there is a unique α < 0 satisfying (10.25). In the defective case, such α, if it exists, is necessarily positive. But it may not exist. In fact, its existence implies exponential decay of the tail distribution 1 − F since by Markov’s inequality P (S1 > t) = P (eαS1 > eαt) ≤ e−αt E[eαS1 ].
10.2. THE RENEWAL THEOREM
429
Remark 10.2.21 Clearly, from (10.26), in the nonlattice defective case and assuming the existence of such α, the solution of the renewal equation decays exponentially fast as t → ∞, whereas in the nonlattice excessive case (for which α always exists) the solution of the renewal equation explodes exponentially fast as t → ∞. Example 10.2.22: Asymptotics in the Transient Case. Suppose that F is defective (F (∞) < 1) and that there exists α (necessarily > 0) such that  ∞ eαt dF (t) = 1 . (10.27) 0
By Theorem 10.1.13, when the data g is bounded and such that there exists g(∞) = limt→∞ g(t), the unique solution f of the renewal equation f = g + f ∗ F satisﬁes lim f (t) =
t↑∞
g(∞) . 1 − F (∞)
(10.28)
With the help of the defective renewal theorem additional information concerning the asymptotic behavior of f can be obtained. In fact, if the function g1 deﬁned by g1 (t) = g(t) − g(∞) + g(∞)
F (t) − F (∞) 1 − F (∞)
is such that the function t → g˜1 (t) = eαt g1 (t) is directly Riemann integrable, then g(∞) =C, (10.29) lim eαt f (t) − t→∞ 1 − F (∞) where
0∞ C=
0
eαt [g(t) − g(∞)]dt − 0∞ αt 0 te dF (t)
Proof. Deﬁne f1 (t) := f (t) −
g(∞) α
.
g(∞) . 1 − F (∞)
Straightforward computations using the identity R ∗ F = R − 1 show that f1 = R ∗ g1 . Therefore f1 is a solution of the (defective) renewal equation f1 = g1 + f1 ∗ F . Since g˜1 (t) = eαt g1 (t) is assumed to be directly Riemann integrable, 0 ∞ αt e g1 (t)dt lim eαtf1 (t) = 00∞ αt . t→∞ 0 te dF (t) But 
∞
e (F (∞) − F (t))dt =
0

∞
αt
e 
0
dF (s) dt
∞  t
= 0
αt
0
from which the above expression for C follows.
(t,∞)
1 eαs ds dF (t) = (1 − F (∞)), α
CHAPTER 10. RENEWAL AND REGENERATIVE PROCESSES
430
Example 10.2.23: The Risk Model, Take 3. This example is a continuation of Example 10.1.7. Recall Eqn. (10.11) in the case where the safety loading is positive, and write this equation in the form  u λ ∞ Ψ(u) = (1 − G(z)) dz + Ψ(u − z) dF (z) , c u 0 where F (x) :=
λ c

x
(1 − G(z)) dz .
0
0∞ This is a renewal equation, and it is defective since λc 0 (1 − G(z)) dz = λμ c < 1 under the positive safety loading condition. Assume the existence of α > 0 such that λ x αz e (1 − G(z)) dz = 1. c 0 By the defective renewal theorem, if λ ∞ αu ∞ e (1 − G(z)) dz du < ∞ , c 0 u we have that
0 ∞ αu 0 ∞ e (1 − G(z)) dz du lim e Ψ(u) = C := 0 0 ∞ uαz u↑∞ 0 ze (1 − G(z)) dz αu
or, equivalently,
Ψ(u) = Ce−αu + o(u) .
10.3
Regenerative Processes
10.3.1
Examples
Let (E, E) be a measurable space. Deﬁnition 10.3.1 Let {X(t)}t≥0 be a measurable Evalued stochastic process and let {Tn }n≥0 be a proper recurrent renewal process, possibly delayed (recall, however, that the initial delay T0 is always assumed ﬁnite). The process {X(t)}t≥0 is said to be regenerative with respect to {Tn }n≥0 if for all n ≥ 0, (a) the distribution of the postTn process Sn+1 , Sn+2, . . . , {X(t + Tn )}t≥0 is independent of n ≥ 0, and (b) the postTn process is independent of T0 , . . . , Tn . The times Tn are called regeneration times of the regenerative process. Example 10.3.2: Continuoustime Markov Chains. Let {X(t)}t≥0 be a recurrent continuoustime homogeneous Markov chain taking its values in the state space E = N. Suppose that it starts from state 0 at time t = 0. By the strong Markov property, {X(t)}t≥0 is regenerative with respect to the sequence {Tn }n≥0 where Tn is the nth time of visit to state 0 of the chain. Regenerative processes are the main sources of renewal equations.
10.3. REGENERATIVE PROCESSES
431
Theorem 10.3.3 Let {X(t)}t≥0 and {Tn }n≥0 be as in Deﬁnition 10.3.1 except for the additional assumption T0 ≡ 0 (undelayed renewal process) and let h : E → R be a nonnegative measurable function. The function f : R+ → R deﬁned by f (t) := E [h(X(t))] satisﬁes the renewal equation with data g(t) = E h(X(t))1{t s) ds = a E[U1 ] = a. In a reliability context, a+b represents the availability of a given machine with mean lifetime a and mean repair time b. Example 10.3.9: Forward and Backward Recurrence Times. Let {Tn }n≥0 be an undelayed (T0 = 0) renewal process. Clearly the forward and backward recurrence times are regenerative with respect to the renewal process {Tn }n≥0 . From Smith’s regenerative formula (10.34), in the nonlattice case 1 ∞ lim P (A(t) > x) = P (A(s) > x, s < S1 ) ds . t→∞ m 0 0∞ > s + x) = 1 − F (s + x), we have 0 P (A(s) > x, s < Since P (A(s) 0 ∞ > x, s < S1 ) = P (S01 ∞ S1 ) ds = 0 (1 − F (s + x)) ds = x (1 − F (s)) ds, and therefore lim P (A(t) > x) =
t→∞
1 m

∞ x
(1 − F (s)) ds .
(10.36)
434
CHAPTER 10. RENEWAL AND REGENERATIVE PROCESSES
Similar arguments yield for the backward recurrence time 1 ∞ lim P (B(t) > y) = (1 − F (s))ds. t→∞ m y
(10.37)
(This time, the data function is (1 − F (t))1{t>y} .) One can prove directly the direct Riemann integrability of the data functions of this example, or use Theorem 10.3.6.
Example 10.3.10: The Bus Paradox. The sum A(t)+B(t) is the interevent interval around time t. Interpreting t as the time at which you arrive at a bus stop, and the sequence {Tn }n≥1 as the sequence of times at which buses arrive at (and immediately depart from) the bus stop, A(t) is your waiting time. If t is large enough, one can, in view of (10.36), assume that A(t) is distributed as a random variable A with the distribution 1 ∞ P (A > x) = (1 − F (s)) ds . (10.38) m x Similarly, the time B(t) by which you missed the previous bus is approximately, when t is large, distributed as a random variable B with the same distribution as A. The bus paradox can be stated in several ways. One of them is: the mean time interval between the bus you missed and the bus you will catch is asymptotically as t → ∞ equal to E[A + B] = 2E[A], and is in general diﬀerent from the mean of the interval between two successive buses n and n + 1, E[S1 ]. Let {X(t)}t∈R be a stochastic process taking its values in a metric space E, having rightcontinuous paths and being regenerative relative to the (possibly delayed) renewal sequence {Tn }n≥0 with nonlattice and ﬁnite mean interevent distribution μ. Let P0 and E0 symbolize respectively the probability and the expectation corresponding to the undelayed version of the renewal sequence. One checks easily that . 2 S1 1 1A (X(s)) ds P ∗ (A) := E0 μ 0 deﬁnes a probability measure on (E, B(E)). Theorem 10.3.11 Under the above conditions, X(t) converges in distribution to P ∗ as t ↑ ∞. Proof. We must show that for all bounded (say, by 1) continuous functions h : E → R, lim E [h(X(t))] =
t↑∞
1 E0 μ
2
S1
. h(X(s)) ds .
(10.39)
0
By the usual renewal argument (conditioning on T0 , whose cumulative distribution function is denoted by FT0 ), f (t − s) dFT0 (s) , () E [h(X(t))] = E h(X(t))1{t 0, we have in the nonlattice case (Blackwell’s theorem) νj Rij (t + a) − Rij (t) lim = . (10.41) t↑∞ a μ The following result is similar to the one in the univariate case: Theorem 10.4.1 If the data functions are nonnegative and locally bounded, there exists a unique vector f := {fi }i∈E of locally bounded measurable functions from R+ → R satisfying the multivariate renewal equation (10.40), namely f = R ∗ g. Proof. The fact that R ∗ g is well deﬁned, locally bounded, and satisﬁes the renewal equation is proved in the same way as in the univariate case. Let now f and f, be two vectors of locally bounded functions satisfying the renewal equation, and let h := f − f,. Then h = R ∗ h, and iteratively, h = F ∗n ∗ h for all n ≥ 1, so that h ≤ F ∗n ∗ h. Let supi hi (t) ≤ M (a) < ∞ on [0, a], say M (a) = 1, without loss of generality, so that h ≤ F ∗n ∗ 1 on [0, a] and therefore, on [0, a],
CHAPTER 10. RENEWAL AND REGENERATIVE PROCESSES
438
hi (t) ≤
Fij∗n (t) = Pi (S1 + · · · + Sn ≤ t) ,
j∈E
a quantity that tends to 0 as n ↑ ∞ since T∞ = ∞ under the prevailing conditions (the embedded chain is irreducible recurrent). Therefore h ≡ 0 on all [0, a] (a ∈ R+ ) and therefore on R+ . It follows from (10.41) that if gi is directly Riemann integrable νj ∞ Rij ∗ gj (t) → gj (s) ds , μ 0 and from this, in the case where E is ﬁnite, fi (t) =
1 νj μ
Rij ∗ gj (t) →
j∈E
j∈E

∞
gj (s) ds . 0
The case when E is inﬁnite requires further conditions and will not be treated here.8
Improper Multivariate Renewal Equations The above results concern the case where Q := {Fij }i,j∈E is a stochastic matrix, that is, the transition matrix of a homogeneous Markov chain (namely, the transition matrix P of the embedded Markov chain). However the renewal equations (10.40) make sense even if this is not the case. We now give results9 of the same kind as the ones in the defective or excessive univariate renewal functions when the state space is ﬁnite. The matrix Q is no longer a stochastic matrix, but still assumed irreducible. Deﬁne for some real β the matrix A := {aij }i,j∈E  ∞ aij := eβt dFij (t) . 0
Assume that β can be chosen such that A has spectral radius 1. In particular, there exists two positive vectors ν and h such that ν T A = ν and Ah = h . The existence of ν and h is ensured by the Perron–Fr¨ obenius theorem. The following facts are easy. First the matrix 1 , := hj aij Q hi i,j∈E is an (irreducible) stochastic matrix admitting the invariant measure ν, given by ν,i = νi hi . Let hj F,ij (t) := hi

t
eβs dFij (s) . 0
, = {F,ij }i,j∈E is irreducible and recurThis deﬁnes a semiMarkov kernel for which Q rent. Deﬁning f,i (t) := eβt fi (t)/hi and g,i := eβt gi (t)/hi , 8 9
See, for instance, [C ¸ inlar, 1975]. [Asmussen and Hering, 1977].
10.5. EXERCISES
439
we see, analogously to the univariate case, that f, = g, + F, ∗ f,. Therefore, if F, is nonlattice and the g,i ’s are locally bounded and integrable,  ∞ 1 lim f,i (t) = ν,j g,j (s) ds , t↑∞ μ , 0 j∈E
that is hi lim eβt fi (t) =
t↑∞
j∈E
k,j∈E
νj
νk hj
0∞ 00∞ 0
eβsgj (s) ds seβs dFkj (s)
.
The existence of β is guaranteed, for instance,10 when the spectral radius of {Fij }i,j∈E is strictly less that 1.
Complementary reading [Asmussen, 2003] for more theory and for applications to random walks and queues.
10.5
Exercises
Exercise 10.5.1. Theorem 10.1.1 true in the undelayed case Prove that as Theorem 10.1.1 is true in the undelayed case, it is then true in the delayed case (recall: with ﬁnite delay). Exercise 10.5.2. The asymptotic counting rate Prove (10.4), that is, 1 N ([0, t]) lim = , Pa.s. t→∞ t E[S1 ]
Exercise 10.5.3. Rightcontinuity of the renewal function Show that the renewal function R is rightcontinuous. Exercise 10.5.4. About the distribution of N ([0, t]) In the undelayed case, compute P (N ([0, t]) = n) for n ≥ 1 in terms of the convolution βN ([0,t]) iterates of the interrenewal < ∞ for all β ∈ −S distribution and show that E e 1 [0, α), where α := E e . Exercise 10.5.5. First event after a random time In the undelayed case, let X be a strictly positive random variable, independent of {Sn }n≥1 , with cumulative distribution G. Let T, = inf {Tn ; Tn > X} . Give an expression of the cumulative distribution of T, in terms of G, the renewal function R and the avoidance function v (deﬁned in Section 8.1.3). 10
See [Asmussen, 1987], Problem 2.3, chap. X.
440
CHAPTER 10. RENEWAL AND REGENERATIVE PROCESSES
Exercise 10.5.6. Forward recurrence time What is the limit distribution of the forward recurrence time of a renewal process (possibly delayed) when S1 is deterministic, equal to a? Exercise 10.5.7. Wald’s lemma for renewal processes Let {Sn }n≥1 be a renewal sequence and let {N ([0, t])}t≥0 be the counting process of the corresponding (possibly delayed) renewal process. Assume that E [S1 ] < ∞. Show that for all t ≥ 0, E[S1 + · · · + SN ([0,t]) ] = E[S1 ]E[N ([0, t])].
Exercise 10.5.8. Expected lifetime In Example 10.1.6, compute the expectation of the lifetime L in the transient case. Exercise 10.5.9. Safety load This exercise refers to Example 7.1.2. Prove that in case of positive safety loading, Φ(∞) = 1. Exercise 10.5.10. The bus paradox See Example 10.3.10 for the context. Consider an undelayed renewal process with ﬁnite mean interrenewal time E[S1 ]. For t ≥ 0, consider the interval between the last renewal time before t and the ﬁrst renewal time after t. Show that in the Poisson case (the interarrival distribution is exponential) the mean length of this interval is asymptotically 2E[S1 ]. Show that it is equal to E[S1 ] if and only S1 is a constant. Exercise 10.5.11. First event after a fixed time Refer to Deﬁnition 10.1.14 for the notation. Prove that S0 (t) is independent of {Sn (t)}n≥1 and that the latter sequence has the same distribution as {Sn }n≥1 . Exercise 10.5.12. A limit theorem for continuoustime hmcs Let {X(t)}t≥0 , be a positive recurrent continuoustime homogeneous Markov chain taking its values in the state space E = N. Let P0 and E0 denote respectively probability and expectation given X(0) = 0. Let T0 be the return time to 0 (T0 := inf{t > 0 ; X(t) = 0, X(t−) = 0}). Recall that in the positive recurrent case, E0 [T0 ] < ∞. Show that 0 T E0 0 0 1{X(s)=i} ds . lim P (X(t) = i) = t↑∞ E0 [T0 ]
Exercise 10.5.13. Lotka–Volterra asymptotics In the Lotka–Volterra model, give the details concerning the asymptotics of the birth rate f in the cases F (∞) = 1 and F (∞) > 1. What can you say about the defective case F (∞) < 1? Exercise 10.5.14. Backward and forward recurrence processes Refer to Deﬁnition 10.3.9. Compute limt→∞ P (A(t) > x, B(t) > y) for x, y ≥ 0.
10.5. EXERCISES
441
Exercise 10.5.15. Asymptotic variance of N ((0, t]) For a proper renewal process with an interarrival distribution of ﬁnite variance, show that VarN ((0, t]) VarS1 lim = . t↑∞ t E[S1 ] Exercise 10.5.16. The age replacement policy We interpret the random variables S1 , S2 , . . . as the lifetimes of machines successively put into service, a new machine immediately replacing a failed one. It will be assumed that E[S1 ] < ∞, and therefore, by (10.4), E[S1 1 ] is the asymptotic failure rate per unit time. In some situations, the inconvenience caused by a failure is too important, and the failure rate must be controlled. The age replacement policy suggests that an engine should be replaced at failure time or at a ﬁxed time T > 0, whichever occurs ﬁrst. What is the asymptotic failure rate? (A replacement is not considered as a failure.) Exercise 10.5.17. Another maintenance policy A given machine can be in either one of three states: G (good), M (in maintenance), or R (in repair). Its successive periods where it is in state G (resp., M, R) form an independent and identically distributed sequence {Sn }n≥0 (resp., {Un }n≥0 , {Vn }n≥0) with ﬁnite mean. All these sequences are assumed mutually independent. The maintenance policy uses a number T > 0. If the machine has age T and has not failed, it goes to state M. If it fails before it has reached age T , it enters state R. From states M and R, the next state is G. Find the steady state probability that the machine is operational. (Note that “good” does not mean “operational”. The machine can be “good” but, due to the operations policy, in maintenance, and therefore not operational. However, after a period of maintenance or of repair, we consider that the machine starts anew, and enters a G period.) Exercise 10.5.18. A two state semiMarkov process Let α 1−α P= , 1−β β where α, β ∈ (0, 1), and let G1 and G2 be two proper cumulative distribution functions. Let {X(t)}t≥0 be the stochastic process evolving as follows. When in state i (i = 1, 2) it stays there for a random time with distribution Gi (i = 1, 2) after which it moves to state j (possibly the same state) with the probability pij (the (i, j)entry of P). The successive sojourn times (in either state) are independent given the knowledge of the state the process is in (the reader will clarify this imprecise sentence). What is the asymptotic distribution of the process, that is, what is limt↑∞ P (X(t) = 1)?
Chapter 11 Brownian Motion Brownian motion was originally introduced as an idealized representation of the chaotic motion of an isolated particle in water due to the steady bombardment by neighboring molecules. It plays a fundamental role in the theory of stochastic processes and various domains of application such as mathematical ﬁnance and communications theory. This chapter is an introduction to some of its more notable properties and to the Wiener–Doob stochastic integral, which is a fundamental tool in the theory of widesense stationary processes (Chapter 12).
11.1
Brownian Motion or Wiener Process
11.1.1
As a Rescaled Random Walk
Recall that two complex random variables X and Y in L2C (P ) are called orthogonal if E[XY ∗ ] = 0. Deﬁnition 11.1.1 A stochastic process {X(t)}t∈R is said to have independent (resp., orthogonal) increments if for all n ≥ 2 and for all mutually disjoint intervals (a1 , b1 ],. . . , (an , bn ] of R, the random variables X(b1 ) − X(a1 ), . . . , X(bn ) − X(an ) are independent (resp., mutually orthogonal). Clearly, a centered secondorder stochastic process with independent increments has a fortiori orthogonal increments. Brownian motion is the fundamental example of a Gaussian process. Deﬁnition 11.1.2 By deﬁnition, a standard Brownian motion, or standard Wiener process, is a continuous centered Gaussian process {W (t)}t∈R+ with independent increments and such that W (0) = 0 and Var(W (b) − W (a)) = b − a ([a, b] ⊂ R+ ). (The existence of a process with the required distribution is guaranteed by Theorem 5.1.23. The existence of a continuous version is proved in the forthcoming Theorem 11.2.7.) In particular, the pdfs of the vectors (W (t1 ), . . . , W (tk )) (0 < t1 < . . . < tk ) are © Springer Nature Switzerland AG 2020 P. Brémaud, Probability Theory and Stochastic Processes, Universitext, https://doi.org/10.1007/9783030401832_11
443
444
CHAPTER 11. BROWNIAN MOTION − 12 1 5 √ e ( 2π)k t1 (t2 − t1 ) · · · (tk − tk−1 )
2 2 x2 1 + (x1 +x2 ) +···+ (x1 +···+xk ) t1 t2 −t1 tk −tk−1
.
Note for future reference that (Exercise 11.5.2) for s, t ∈ R+ , E[W (t)W (s)] = t ∧ s.
(11.1)
Deﬁnition 11.1.3 The stochastic process with values in Rk W (t) := (W1 (t), . . . , Wk (t))
(t ∈ R+ ),
where {Wj (t)}t∈R+ (1 ≤ j ≤ k) are independent (standard) Wiener processes, is called a kdimensional (standard) Wiener process. The following extension of the deﬁnition of Brownian motion slightly enlarges the scope of the theory by introducing a history that is possibly larger than the internal history. Deﬁnition 11.1.4 Let {Ft }t≥0 be a ﬁltration. A real stochastic process {W (t)}t≥0 is called a standard Ft Brownian motion if (i) for P almost all ω, the trajectories t → W (t, ω) are continuous, (ii) for all t ≥ 0, W (t) is Ft measurable, and (iii) W (0) = 0 and for all 0 ≤ s ≤ t, W (t) − W (s) is a centered Gaussian random variable independent of Fs and with variance t − s. Consider a symmetric random walk on Z with initial state 0. It admits the representation n Xn = Zk , k=1
where {Zn }n≥1 is an iid sequence of {−1, +1}valued random variables with P (Z1 = ±1) = 12 . A timecontinuous stochastic process is constructed from this sequence as follows. A timestep equal to 1 in the discretetime model represents in the continuoustime model Δ units of time. The amplitude scale is also modiﬁed, a distance of 1 in the original discretetime model representing δ state space units in the modiﬁed model. We are therefore considering the continuoustime process {X(t)}t≥0 given by t/Δ
X(t) := δXt/Δ = δ
Zk .
(11.2)
k=1
(The dependence on δ and Δ will not be made explicit in the notation for X(t).) Since the Zk ’s are centered and of variance 1, E [X(t)] = 0 and Var (X(t)) = δ 2 × t/Δ . Let now Δ and δ tend to 0 in such a way that the limit in distribution exists and is not trivial. With respect to this goal, the choice δ = Δ is not satisfactory since E [X(t)] = 0 and limΔ↓0 Var (X(t)) = 0, leading to a null process. With the choice
11.1. BROWNIAN MOTION OR WIENER PROCESS
445
δ 2 = Δ, E [X(t)] = 0 and limΔ↓0 Var (X(t)) = t. In this case, since by the central limit theorem n Zk D k=1 √ → N (0, 1) , n we have that (using the fact that if a sequence of random variables Xn converges in distribution to some random variable X and if the sequence of real numbers an converges to the real number a, then an Xn converges in distribution to aX; Theorem 4.4.8) √ t t/Δ X(t) D k=1 Zk √ Δ √ → N (0, 1) . = √ t t Δ t Δ Thus at the limit (in distribution) as n ↑ ∞, X(t) is a centered Gaussian variable with variance t. In fact, for all t1 , . . . , tk in R+ forming an increasing sequence, the limit distribution of the vector (X(t1 ), . . . , X(tk )) corresponds to a Brownian motion (Exercise 11.5.1).
Behavior at Inﬁnity In view of the previous approximation of a Wiener process as a symmetric random walk on Z, the following result is expected. Theorem 11.1.5 Let {W (t)}t≥0 be a standard Brownian motion. Then, P a.s., lim sup W (t) = +∞ , lim inf W (t) = −∞ . t↑+∞
t↑∞
Proof. This follows from lim sup W (n) = +∞ , lim inf W (n) = −∞ n↑+∞
n↑∞
P a.s.
which in turn is a direct consequence of Theorem 4.3.9 applied to the random walk Sn = W (n) of step Xn = W (n) − W (n − 1). Corollary 11.1.6 For all a ∈ R, the FtW stopping time Ta = inf{t ≥ 0 ; W (t) ≥ a} is almost surely ﬁnite. (An explicit expression for the distribution of Ta will be given in Subsection 11.2.1.)
11.1.2
Simple Operations on Brownian motion
These are the operations of (i) symmetrization: X(t) = −W (t)
(t ≥ 0) ,
(ii) delay: for a > 0, X(t) = W (t + a) − W (t) (iii) scaling: for c > 0, X(t) =
√
t cW c
(t ≥ 0) ,
(t ≥ 0) ,
CHAPTER 11. BROWNIAN MOTION
446 (iv) time inversion X(t) = tW
1 t
(t > 0) and X(0) = 0 ,
In each case, the process {X(t)}t≥0 is a standard Brownian motion if {W (t)}t≥0 is a standard Brownian motion. The fact that these processes have the same distribution as the standard Brownian motion is easily checked (Exercise 11.5.3). They are also continuous. This is obvious in all cases except for the continuity at 0 of the process obtained by time inversion. However, almost surely: 1 = 0. lim tW t↓0 t Proof. We prove the equivalent statement lim
s↑+∞
1 W (s) = 0 . s
First observe that since W (n) is the sum of n iid centered Gaussian variables, limn↑∞ Wn(n) = 0 by the strong law of large numbers and a fortiori limn↑∞ Wn(n) = 0. 2 In the following let n := n(s) be the largest integer less than or equal to s. Therefore, taking into account the ﬁrst observation, we just have to show that W (s) W (n) − → 0 as s → ∞ . s n But
' ' ' ' ' ' ' W (s) W (n) ' ' W (s) W (n) ' ' W (n) W (n) ' ' '≤' '+' ' − − − ' s n ' ' s s ' ' s n ' ' ' '1 1 ' 1 ' sup W (s) − W (n) + W (n) ' − '' ≤ n s∈[n,n+1] s n ≤
Zn W (n) + , n n2
where Zn := sup W (s + n) − W (n) s∈[0,1]
has the same distribution as sups∈[0,1] W (s) and the sequence {Zn }n≥1 is iid. Since as → 0, it remains to show that Znn → 0 or, equivalently (Theorem observed earlier Wn(n) 2 4.1.3), that for any ε > 0, ' ' ' Zn ' ' ' P ' ' > ε i.o. = 0 . n By the Borel–Cantelli lemma, it suﬃces to show that P (Zn  > nε) < ∞ n
or, equivalently, since the Zn ’s are identically distributed, P (Z1  > nε) < ∞ . n
But this is true of any integrable random variable Z1 (see Exercise 4.6.1).
11.1. BROWNIAN MOTION OR WIENER PROCESS
447
The Brownian Bridge This is the process {X(t)}t∈[0,1] obtained from the standard Brownian motion {W (t)}t∈[0,1] by X(t) := W (t) − tW (1) (t ∈ [0, 1]) . It is a Gaussian process since for all t1 , . . . , tk ∈ [0, 1], the random vector (X(t1 ), . . . , X(tk )) is Gaussian, being a linear function of the Gaussian vector (W (t1 ), . . . , W (tk ), W (1)). In particular, since it is a centered Gaussian process, its distribution is entirely characterized by its covariance function and a simple calculation (Exercise 11.5.2) gives cov (X(t), X(s)) = s(1 − t)
(0 ≤ s ≤ t ≤ 1) .
In particular, X(0) = X(1) = 0. The Brownian bridge {X(t)}t∈[0,1] is distributionwise a Wiener process {W (t)}t∈[0,1] conditioned by W (1) = 0. This statement is problematic in that the conditioning event has a null probability. However, it is true “at the limit”: Theorem 11.1.7 Let f : Rk → R be a bounded and continuous function. Then, for any 0 ≤ t1 < t2 < · · · < tk ≤ 1, lim E [f (W (t1 ), . . . , W (tk ))  W (1) ≤ ε] = E [f (X(t1 ), . . . , X(tk ))] . ε↓0
Proof. E [f (W (t1 ), . . . , W (tk ))  W (1) ≤ ε] = E [f (X(t1 ) + t1 W (1), . . . , X(tk ) + tk W (1))  W (1) ≤ ε] E f (X(t1 ) + t1 W (1), . . . , X(tk ) + tk W (1))1W (1)≤ε = . P (W (1) ≤ ε) In view of the independence of {X(t)}t∈[0,1] and W (1) (Exercise 11.5.13), this last quantity equals 0 +ε − 1 x2 2 E [f (X(t1 ) + t1 x, . . . , X(tk ) + tk x)] dx −ε e , 0 +ε − 1 x2 2 dx −ε e which tends to E [f (X(t1 ), . . . , X(tk ))] as ε ↓ 0.
11.1.3
Gauss–Markov Processes
Deﬁnition 11.1.8 Let T be R+ or N. A realvalued stochastic process {X(t)}t≥0 is called a Markov process if for all t ≥ 0 and all nonnegative (or integrable) random variables Z that are σ(X(s) ; s ≥ t)measurable (11.3) E Z  FtX = E [Z  σ(X(t))] . Gaussian processes that are moreover Markovian are called Gauss–Markov processes. This class of models receives a simple description in terms of the Brownian motion. Example 11.1.9: The Brownian motion is Gauss–Markov. The Brownian motion is a Gauss–Markov process (Exercise 11.5.6). Gauss–Markov processes are characterized among Gaussian processes by a simple property of their covariance function.
CHAPTER 11. BROWNIAN MOTION
448
Theorem 11.1.10 Let {X(t)}t≥0 be a centered Gaussian process with continuous covariance function Γ such that Γ(t, t) > 0 for all t ∈ R+ . It is a Markov process if and only if there exist functions f and g such that for all s, t ∈ R+ Γ(t, s) = f (t ∨ s)g(t ∧ s) .
(11.4)
The proof relies on the following lemma. Lemma 11.1.11 Let {X(t)}t≥0 be a centered Gaussian process with covariance function Γ such that Γ(t, t) > 0 for all t ∈ R+ . If in addition it is a Markov process, then for all t > s > t0 ≥ 0, Γ(t, s)Γ(s, t0 ) Γ(t, t0 ) = . (11.5) Γ(s, s) Proof. By the Gaussian property, the conditional expectation of X(t) given X(t0 ) is equal to the linear regression of X(t) on X(t0 ): E [X(t)X(t0 )] =
Γ(t, t0 ) X(t0 ). Γ(t0 , t0 )
()
Using this remark and the Markov property, E [X(t)X(t0 )] = E [E [X(t)X(t0 ), X(s)] X(t0 )] = E [E [X(t)X(s)] X(t0 )] . 2 Γ(t, s) X(s)X(t0 ) =E Γ(s, s) Γ(t, s) Γ(s, t0 ) Γ(t, s) E [X(s)X(t0 )] = X(t0 ) . = Γ(s, s) Γ(s, s) Γ(t0 , t0 ) Comparing with the righthand side of (), and since P (X(t0 ) = 0) > 0 (in fact = 1), we obtain (11.5). We now turn to the proof of Theorem 11.1.10. Proof. Necessity. Suppose the process is Gauss–Markov. Let ρ(t, s) =
Γ(t, s) 1
1
(Γ(t, t)) 2 (Γ(s, s)) 2 be its autocorrelation function. By (11.5), for all t > s > t0 ≥ 0, ρ(t, t0 ) = ρ(t, s)ρ(s, t0 ).
()
We show that ρ(t, s) > 0 for all t, s ∈ R. Indeed, assuming s > t and using () repeatedly, for all n ≥ 1, ρ(t, s) =
n−1 $ k=0
(k + 1)(s − t) k(s − t) ,t+ ρ t+ , n n
and therefore, using the facts that ρ(u, u) = 1 for all u and that ρ is uniformly continuous on bounded intervals, n can be chosen large enough to guarantee that all the elements
11.2. PROPERTIES OF BROWNIAN MOTION
449
in the above product are positive. Therefore, one may divide by ρ(t, t0 ) and write () as ρ(t, t0 ) ρ(t, s) = ρ(s, t0 ) or 1 1 Γ(s, s) 2 , Γ(t, s) = ρ(t, t0 )Γ(t, t) 2 × ρ(s, t0 ) from which we obtain the desired conclusion (here s = t ∧ s and t = t ∨ s) . Suﬃciency. Suppose that the process is Gaussian and that (11.5) holds true. Assume 1 1 t > s. Therefore Γ(t, s) = f (t)g(s). By Schwarz’s inequality, Γ(t, s) ≤ Γ(t, t) 2 Γ(s, s) 2 1 or, equivalently, f (t)g(s) ≤ (f (t)g(t)f (s)g(s)) 2 , from which it follows that f (t)g(s) ≤ g(t)f (s). Therefore, the function g(t) τ (t) := f (t) is monotone nondecreasing. In particular, the centered Gaussian process Y (t) := f (t)W (τ (t)) is a Markov process since the Brownian motion itself is a Markov process. Its covariance function is E [Y (t)Y (s)] = f (t)f (s)E [W (τ (t))W (τ (s))] = f (t)f (s)(τ (t) ∧ τ (s)) = f (t)f (s)τ (s) = f (t)g(s). Since it has the same covariance as {X(t)}t≥0 and since both processes are centered and Gaussian, they have the same distribution. In particular, {X(t)}t≥0 is a Markov process.
11.2
Properties of Brownian Motion
11.2.1
The Strong Markov Property
Theorem 11.2.1 Let {W (t)}t≥0 be a standard Ft Brownian motion and let τ be a ﬁnite Ft stopping time. Then {W (τ + t)}t≥0 is a standard Ft Brownian motion independent of Fτ . A proof is given in Subsection 14.3.2 via Itˆ o calculus.
The Reﬂection Principle Theorem 11.2.2 Let {W (t)}t≥0 be a standard Ft Brownian motion and let Ta be the ﬁrst time it reaches the value a > 0. The stochastic process Y (t) := W (t)1t 0 and y ≥ 0, P (W (t) ≤ a − y, M (t) ≥ a) = P (W (t) ≥ a + y) .
(11.7)
Proof. Observing that {M (t) ≥ a} ≡ {Ta ≤ t}, that Ta = inf{t ≥ 0 ; Y (t) = a} and that {W (t) ≥ a + y} ⊆ {Ta ≤ t}, and using Theorem 11.2.2, P (W (t) ≤ a − y, M (t) ≥ a) = P (W (t) ≤ a − y, Ta ≤ t) = P (Y (t) ≤ a − y, Ta ≤ t) = P (2a − W (t) ≤ a − y, Ta ≤ t) = P (W (t) ≥ a + y, Ta ≤ t) = P (W (t) ≥ a + y) . Corollary 11.2.4 For a > 0, P (Ta ≤ t) = P (M (t) ≥ a) = 2P (W (t) > a) = P (W (t) > a) . Proof. The last equality is by symmetry of Brownian motion. The ﬁrst equality is a consequence of the identity {M (t) ≥ a} ≡ {Ta ≤ t}. It remains to prove the second equality. We have P (M (t) ≥ a) = P (M (t) ≥ a, W (t) ≤ a) + P (M (t) ≥ a, W (t) > a) = P (M (t) ≥ a, W (t) ≤ a) + P (W (t) > a) = P (W (t) > a) + P (W (t) > a) , where it was observed that {W (t) > a} ⊆ {M (t) ≥ a} for the second equality, and where (11.7) was applied with y = 0 for the third one. The above results immediately yield the distribution of Ta : For a ≥ 0 and t ≥ 0,  ∞ y2 2 P (Ta ≤ t) = √ e− 2 dy . 2π √at Since the law of T−a is the same as that of Ta , the formula for any a is  ∞ y2 2 e− 2 dy . P (Ta ≤ t) = √ a 2π √ t Deﬁnition 11.2.5 A real stochastic process {X(t)}t≥0 with stationary increments is called recurrent if for all x, y ∈ R, this process starting at time 0 from x will almost surely reach y in ﬁnite random time. It is called null recurrent if it is recurrent but this random time is not integrable.
11.2. PROPERTIES OF BROWNIAN MOTION
451
Corollary 11.2.6 Brownian motion is null recurrent. Proof. It suﬃces to verify the conditions of null recurrence of the above deﬁnition for x = 0. Letting t ↑ ∞ in the expression of the cumulative distribution function of Ty (y ∈ R), we obtain P (Ty < ∞) = 1. On the other hand, with α := √22π and β := y, 
∞
E [Ty ] =
P (Ty ≤ t) dt = α
0

∞
=α
∞
⎛ ⎝
β2 u2
0
⎞ 1
β √ t
e
− 21 u2
du
dt
0
2 dt⎠ e− 2 u du = αβ 2
0
0


∞ 0
1 − 1 u2 e 2 du = +∞ . u2
11.2.2
Continuity
Theorem 11.2.7 Consider a stochastic process such as the Brownian motion of Definition 11.1.2, except that continuity of the trajectories is not assumed. There exists a version of it having almost surely continuous paths. 1
Proof. For any s, t ∈ R+ , W (t) − W (s) has the same distribution as t − s 2 Y where Y is a centered Gaussian variable with unit variance. In particular, for any α > 0, α
E [W (t) − W (s)α ] = t − s 2 EY α , from which the result immediately follows by application of Theorem 5.2.3: take α > 2, β = 12 α − 1 and K = EY α .
11.2.3
Nondiﬀerentiability
Deﬁnition 11.1.2 of the Wiener process does not tell much about the qualitative behavior of its trajectories. Although the trajectories of the (standard) Brownian motion are almost surely continuous functions, their behavior is otherwise rather chaotic. First of all observe that, for ﬁxed t0 > 0, the random variable W (t0 + h) − W (t0 ) D ∼ N 0, h−1 h does not converge in distribution as h ↓ 0, and a fortiori does not converge almost surely. Therefore, for any t0 > 0, P (t → W (t) is not diﬀerentiable at t0 ) = 1. But the situation is even more dramatic: Theorem 11.2.8 Almost all the paths of the Wiener process are nowhere diﬀerentiable. Proof. We shall prove that W (t + h) − W (t) P lim sup = +∞ for all t ∈ [0, 1] = 1 . h h→0
CHAPTER 11. BROWNIAN MOTION
452
Fix β > 0. If a function f : (0, 1) → R has at some point s ∈ [0, 1] a derivative f (s) of absolute value smaller than β, then there exists an integer n0 such that for n ≥ n0 , f (t) − f (s) < 2β t − s if t − s ≤
2 . n
()
Let Cn := {f : [0, 1] → R ; there exists an s ∈ [0, 1] satisfying ()} Let An := {ω ; the function t ∈ [0, 1] → W (t, ω) is in Cn } . This event increases with n and its limit A includes all the samples ω corresponding to a trajectory t ∈ (0, 1) → W (t, ω) having at least at one point of [0, 1] a derivative of absolute value smaller than β. Therefore it suﬃces to show that P (A) = 0 for all β > 0. If ω ∈ An , letting k be the largest integer such that nk ≤ s (where s is the point in the deﬁnition of Cn ), and letting '1 ' ' ' 'W k + j + 1 , ω − W k + j , ω ' , Yk (ω) := max ' j=−1,0,+1 ' n n then Yk (ω) ≤
6β n .
Therefore, 1 6β . An ⊆ Bn := ω ; at least one Yk (ω) ≤ n
In order to prove that P (A) = 0, it is then enough to show that limn P (Bn ) = 0. But Bn =
n−2 &
ω ; Yk (ω) ≤
k=1
6β n
1 ,
and by subσadditivity, P (Bn ) ≤
n−2 k=1
P
'1 ' ' ' 'W k + j + 1 − W k + j ' ≤ 6β . ' j=−1,0,+1 ' n n n max
By the independence property of the increments of a Wiener process and since all the variables involved are N 0, n1 , ' 3 ' ' ' 'N 0, 1 ' ≤ 6β ' n ' n =  6β 3 +n 2 n − nx =n e 2 dx 2π −6β n 3  +6β x2 1 =n √ e− 2n dx → 0 . 2πn −6β
P (Bn ) ≤ nP
If a function f : [0, 1] → R is of bounded variation on [0, 1], it has a derivative almost everywhere (with respect to the Lebesgue measure) in [0, 1]. Therefore: Corollary 11.2.9 Almost every trajectory of a Brownian motion is of unbounded variation on any interval.
11.2. PROPERTIES OF BROWNIAN MOTION
11.2.4
453
Quadratic Variation
Let D := {0 = t0 ≤ t1 ≤ . . . ≤ tn = t} be a division of the interval [0, t] with maximum gap Δ := max (tk − tk−1 ) . 1≤k≤n
Let VW (t, D) :=
n
W (tk ) − W (tk−1 ) .
k=1
By Corollary 11.2.9, the variation supD VW (t, D) of the Wiener process on the interval [0, t] is almost surely inﬁnite. However, the quadratic variation of this process on the interval [0, t], deﬁned by QW (t, D) :=
n
(W (tk ) − W (tk−1 ))2 ,
k=1
is such that E[QW (t, D)] =
n
n (tk − tk−1 ) = t. E (W (tk ) − W (tk−1 ))2 =
k=1
k=1
But there is more: Theorem 11.2.10 A. As Δ → 0,
QW (t, D) → t in L2R (P ) .
B. If {Di }i≥0 is a sequence of subdivisions of [0, t] such that Δi = o(i−2 ), then lim QW (t, Di ) → t,
i↑∞
P a.s.
Proof. A. Write QW (t, D) − t =
n
Zk ,
k=1
where Zk = (W (tk ) − W (tk−1 ))2 − (tk − tk−1 ), a centered random variable with variance E[Zk2 ] = 2(tk − tk−1 )2 . Therefore, by independence of the Zk ’s, E[(QW (t, D) − t)2 ] =
n
E Zk2
k=1 n
=2
k=1
(tk − tk−1 )2 ≤ 2Δ
n k=1
(tk − tk−1 ) = 2Δt .
CHAPTER 11. BROWNIAN MOTION
454
B. Take Δi = εi /i2 with limi↑∞ εi = 0. By Markov’s inequality, 5 P QW (t, Di ) − t > i 2Δi = P QW (t, Di ) − t2 > 2εi E QW (t, Di ) − t2 2Δi t t ≤ ≤ = 2. 2εi 2εi i The announced result then follows from Theorem 4.1.2.
11.3
The Wiener–Doob Integral
11.3.1
Construction
The Wiener stochastic integral
f (t) dW (t)
(11.8)
R
will be deﬁned for a certain class of measurable functions f . Note, however, that this integral will not be of the usual type. For instance, it cannot be deﬁned pathwise as a Stieltjes–Lebesgue integral since the trajectories of the0Brownian motion are of un˙ (t) dt either (the dot bounded variation. This integral cannot be interpreted as R f (t)W denotes diﬀerentiation) since the Brownian motion does not have a derivative. Therefore, the integral in (11.8) will be deﬁned in a radically diﬀerent way. In fact, the Doob–Wiener stochastic integral is deﬁned with respect to a stochastic process with centered and uncorrelated increments. This generalizes the original Wiener integral (which is deﬁned with respect to the Brownian motion). Let {Z(t)}t∈R be a complexvalued stochastic process such that (i) for all intervals [t1 , t2 ] ⊂ R, the increments Z(t2 ) − Z(t1 ) are centered and in L2C (P ), and (ii) there exists a locally ﬁnite measure μ on (R, B) such that E[(Z(t2 ) − Z(t1 ))(Z(t4 ) − Z(t3 ))∗ ] = μ((t1 , t2 ] ∩ (t3 , t4 ])
(11.9)
for all [t1 , t2 ] ⊂ R and all [t3 , t4 ] ⊂ R. Note in particular that if (t1 , t2 ]∩(t3 , t4 ] = ∅, Z(t2 ) − Z(t1 ) and Z(t4 ) − Z(t3 ) are orthogonal random variables of L2C (P ). Deﬁnition 11.3.1 The above stochastic process {Z(t)}t∈R is called a stochastic process with centered and uncorrelated increments with structural measure μ.
Example 11.3.2: The structural measure of the Wiener process. The Wiener process {W (t)}t∈R is such a process, with structural measure equal to the Lebesgue measure.
11.3. THE WIENER–DOOB INTEGRAL
455
Example 11.3.3: The structural measure of a compensated hpp. Let N be an hpp on R with intensity λ. Deﬁne {Z(t)}t∈R by Z(0) = 0 and, for all [a, b] ∈ R, Z(b) − Z(a) = N ((a, b]) − λ × (b − a). Then {Z(t)}t∈R is a stochastic process with centered and uncorrelated increments whose structural measure is λ times the Lebesgue measure. The Wiener–Doob integral
f (t) dZ(t) R
is constructed for all f ∈ L2C (μ) in the following manner. First of all, we deﬁne this integral for all f ∈ L, the vector subspace of L2C (μ) formed by the ﬁnite complex linear combinations of interval indicator functions f (t) =
N
αi 1(ai ,bi ] (t).
i=1
For such functions, by deﬁnition, f (t) dZ(t) := R
N
αi (Z(bi ) − Z(ai )) .
()
i=1
One easily veriﬁes that the linear mapping ϕ : f ∈L→ f (t) dZ(t) ∈ L2C (P ) R
is an isometry, that is, for all f ∈ L, 2 . f (t)2 μ(dt) = E  f (t) dZ(t)2 . R
R
Since L is dense in L2C (μ), ϕ can be uniquely extended to an isometric linear mapping of L2C (μ) into L2C (P ). We continue to call this extension ϕ and then deﬁne, for all f ∈ L2C (μ), f (t) dZ(t) := ϕ(f ). R
The fact that ϕ is an isometry is expressed by Doob’s isometry formula: ∗ . 2= f (t) dZ(t) g(t) dZ(t) f (t)g ∗ (t) μ(dt), E R
R
(11.10)
R
where f and g are in L2C (μ). Note also that for all f ∈ L2C (μ), . 2f (t) dZ(t) = 0 , E
(11.11)
R
since the Doob integral is the limit in L2C (μ) of random variables of the type N i=1 αi (Z(bi )− Z(ai )) that have mean 0 (use the continuity of the inner product in L2C (P )). Remark 11.3.4 In the case where Z(t) := W (t) (t ∈ R+ ), a Wiener process, the righthand side of () is in the Gaussian Hilbert subspace H(W ), and so are the Wiener integrals, being limits in quadratic mean of elements of H(W ).
CHAPTER 11. BROWNIAN MOTION
456
Series Expansion of Wiener integrals Let f be a function of L2R ([a, b]) and let {W (t)}t∈[a,b] be a standard Wiener process. Let {ϕn }n≥1 be an orthonormal basis of the Hilbert space L2R ([a, b]). In particular, f=
∞
f, ϕn ϕn ,
n=1
where the convergence of the series of the righthand side is in L2R ([a, b]). Consider now the sequence of random variables  b Zn := ϕn (t) dW (s) (n ≥ 1) . a
This is a Gaussian sequence (Remark 11.3.4). Moreover, the Zn ’s are uncorrelated since by the isometry formula for the Doob–Wiener integrals, if n = k,  b E [Zn Zk ] = ϕn (t)ϕk (t) dt = 0 . a
Therefore {Zn }n≥1 is an iid sequence. Theorem 11.3.5 For f ∈ L2R ([a, b]), we have the expansion  b ∞ f, ϕn Zn , f (t) dW (s) = a
n=1
where the convergence of the series in the righthand side is in L2R (P ) and almost surely. Proof. It is enough to prove convergence in L2R (P ) since the statement about almostsure convergence then follows from Theorem 4.1.15 and the fact that when a sequence of 2 (P ), the respective limits are almost random variables converges almost surely and in LR surely equal. For convergence in L2R (P ): ⎡ 2 ⎤  b N E⎣ f, ϕn Zn ⎦ f (t) dW (t) − a
n=1
"
b
=E
2 # 2 b . N −2 f (t) dW (t) f, ϕn E f (t) dW (t)Zn
a
+
f, ϕn 2 E Zn2
n=1

b
=
f (t)2 dt − 2
a

b
= a
a
n=1 N
f (t)2 dt − 2
N

b
f, ϕn
n=1 N
f (t)ϕn (t) dt + a
f, ϕn 2 +
n=1
N
f, ϕn 2 E Zn2
n=1
N

b
f, ϕn 2 =
n=1
a
f (t)2 dt −
N
f, ϕn 2 → 0 .
n=1
0b Remark 11.3.6 For the purpose of sampling the integral a f (t) dW (s) (that is, of 0b generating a random variable with the same distribution as a f (t) dW (s)) it is enough to use any sequence {Zn }n≥1 of independent standard Gaussian random variables.
11.3. THE WIENER–DOOB INTEGRAL
457
A Characterization of the Wiener Integral The following characterisation of the Wiener integral will be useful: Lemma 11.3.7 Let f ∈ L2R () and let {W (t)}t∈[0,1] be a standard Wiener process. Denote by H(W )0 the Gaussian Hilbert space generated by this Wiener process. The Wiener integral Z := R+ f (t) dW (t) is characterized by the following two properties: (a) Z ∈ H(W ), and 0s (b) E [ZW (s)] = 0 f (t) dt for all s ≥ 0. 0 Proof. Necessity: It was already noted that, by construction, R+ f (t) dW (t) ∈ H(W ). 0 0 0 s Also, (b) is just the isometry formula E R+ f (t) dW (t) R+ 1{s≤t} dW (t) = 0 f (t) dt.
0t Suﬃciency: Since Z − 0 f (s) dW (s) is in H(W ), it suﬃces to show that this random variable0 is orthogonal to all the generators W (s) (s ∈ R) of H(W ) to obt dW (s) = 0 P a.s. But, by (b) and by the isometry formula, tain that0 Z − 0 f (s) 0s 0s t E Z − 0 f (s) dW (s) W (s) = 0 f (t) dt − 0 f (t) dt = 0. The next lemma features a kind of formula of integration by parts. Lemma 11.3.8 Let {W (t)}t∈[0,1] be a standard Wiener process. Let T be a positive real number and let f : [0, T ] → R be a continuously diﬀerentiable function (in particular the 0T 2 (μ) and therefore the integral 0 f (t) dW (t) is well deﬁned). function f (t)1{t≤T } is in LC Then,  T  T (11.12) f (t) dW (t) + f (t)W (t) dt = f (T )W (T ) . 0
0
Proof. By Lemma 11.3.7, it suﬃces to prove that for all s ∈ R+ , 2 .  s  T E f (T )W (T ) − f (t)1{t≤T } dt . f (t)W (t) dt W (s) = 0
0
Using the equality E [W (a)W (b)] = a ∧ b, the latter reduces to 2 T .  s f (T )(T ∧ s) − E f (t)W (t) dt W (s) = f (t)1{t≤T } dt. 0
0
By Fubini: 2
T
E
. f (t)W (t) dt W (s) =
0

T
f (t)E [W (t)W (s)] dt
0 T
=
f (t)(t ∧ s) dt .
0
It therefore remains to check that f (T )(T ∧ s) −
T
f (t)(t ∧ s) dt =
0
When T ≤ s, this reduces to the identity

s 0
f (t)1{t≤T } dt .
CHAPTER 11. BROWNIAN MOTION
458 
T
f (T )T −
f (t)t dt =
0

T
f (t) dt , 0
which is veriﬁed by integration by parts, and when T ≥ s, it reduces to 
T
f (T )s −
f (t)(s ∧ t) dt =

s
f (t) dt . 0
0
This last identity is veriﬁed by noting that both sides are null for s = 0 and that their derivatives are equal for all s ≤ T : 
T
f (T ) −
f (t) dt = f (s).
s
11.3.2
Langevin’s Equation
This is the equation dV (t) + αV (t) dt = σdW (t) ,
(11.13)
where {W (t)}t∈[0,1] is a standard Wiener process and α and σ are positive real numbers, with the following interpretation  t V (t) − V (0) + α (11.14) V (s) ds = σW (t) (t ≥ 0) . 0
Remark 11.3.9 The motion of a particle of mass m on the line subjected at each instant t to an external force F (t) and to friction is governed by the diﬀerential equation mx (t) = −αx (t) + F (t) , where α > 0 is the friction coeﬃcient. (It is assumed that there is no potential energy ﬁeld.) If the external force is due, as in the Brown experiment, to numerous tiny shocks, 0b one may assume that a F (s) ds = σ(W (b) − W (a)) so that, letting V (t) = x (t) and taking m = 1, we obtain equation 11.13. Theorem 11.3.10 The unique solution of the Langevin equation (11.15) with initial value V (0) is  t V (t) = e−αtV (0) + e−α(t−s)σ dW (s) . (11.15) 0
Proof. Using the integration by parts formula (11.12), (11.15) is found equivalent to V (t) = e−αtV (0) + σW (t) −

t
αe−α(t−s)σW (s) ds .
()
0
Integrating from 0 to u gives  u  u  t  u 1 −αu −α(t−s) )V (0) + σW (s) ds − αe σW (s) ds dt . V (t) dt = (1 − e α 0 0 0 0 The last integral is equal to
11.3. THE WIENER–DOOB INTEGRAL
459
1s≤t≤u 1s≤u αe−α(t−s)σW (s) ds dt 0 0  ∞  ∞ = 1s≤t≤u αe−α(t−s) dt 1s≤u σW (s) ds 0 u  0u αe−α(t−s) dt σW (s) ds = s 0 u 1 = (1 − eα(u−s))σW (s) ds . 0 α

∞  ∞
Replacing this in () gives  u α V (t) dt = (1 − e−αu)V (0) + 0
u
αe−α(u−s)σW (s) ds ,
0
and therefore

u
V (u) − V (0) + α
V (t) dt 0
= V (u) − e−αuV (0) +

u
αe−α(u−s)σW (s) ds = σW (u) .
0
To prove unicity, let V be another solution of the Langevin equation with the same initial value. Letting U := V − V , we have  t U (s) ds , U (t) = α 0
whose unique solution is the null function (Gronwall’s lemma, Theorem B.6.1).
11.3.3
The Cameron–Martin Formula
This result is of interest in communications theory. One will recognize the likelihood ratio associated with the hypothesis “signal plus white Gaussian noise” against the hypothesis “white Gaussian noise only”. Theorem 11.3.11 Let {X(t)}t≥0 be, with respect to probability P , a Wiener process with variance σ 2 and let γ : R → R be in L2R (). For any T ∈ R+ , the formula T 1 1 T 2 dQ = e σ2 { 0 γ(t)dX(t)− 2 0 γ (t)dt} dP
(11.16)
deﬁnes a probability measure Q on (Ω, F) with respect to which  t X(t) − γ(s) ds 0
is, on the interval [0, T ], a Wiener process with variance σ 2 . The proof of Theorem 11.3.11 is based on the following preliminary result. Lemma 11.3.12 Let {X(t)}t≥0 be a Wiener process with variance σ 2 and let ϕ : R → R be in L2R (). Then, for any T ∈ R+ , T 1 2 T 2 E e 0 ϕ(t)dX(t) = e 2 σ 0 ϕ (t)dt . (11.17)
CHAPTER 11. BROWNIAN MOTION
460 Proof. First consider the case ϕ(t) =
N
αk 1(ak ,bk ] (t) ,
(11.18)
k=1
where αk ∈ R and the intervals (ak , bk ] are disjoint. For this special case, formula (11.17) reduces to N 1 2 N 2 E e k=1 αk (X(bk )−X(ak )) = e 2 σ k=1 αk (bk −ak ) , and therefore follows directly from the independence of the increments of a Wiener process and from the Gaussian property of these increments, in particular, the formula giving the Laplace transform of the centered Gaussian variable X(b)−X(a) with variance σ 2 (b − a): 1 2 2 E eα(X(b)−X(a)) = e 2 σ α (b−a) . Let now {ϕn }n≥1 be a sequence of functions of type (11.18) converging in L2R () to ϕ (in 0T 0T particular, limn↑∞ 0 ϕ2n (t)dt = 0 ϕ2 (t)dt). Therefore, lim
n↑∞ 0

T
T
ϕn (t)dX(t) =
ϕ(t)dX(t), 0
where the latter convergence is in L2R (P ). This convergence can be assumed to take place almost surely by taking if necessary a subsequence. From the equality T 2 T 2 E e 0 ϕn (t)dX(t) = eσ 0 ϕn (t)dt we can then deduce (11.17), at least if the sequence of random variables in the lefthand side is uniformly integrable. This is the case because the quantity 2' ' . T 2 T 2 ' 0T ϕn (t)dX(t)'2 E 'e ' = E e2 0 ϕn (t)dX(t) = e2σ 0 ϕn (t)dt is uniformly bounded, and therefore the uniform integrability claim follows from Theorem 4.2.14, with G(t) = t2 . We may now turn to the proof of Theorem 11.3.11. Proof. The fact that (11.16) properly deﬁnes a probability Q, that is, that the expectation of the righthand side of (11.16) equals 1, follows from Lemma 11.3.12 with ϕ(t) = σ12 γ(t). Letting

t
Y (t) := X(t) −
γ(s)ds , 0
we have to prove that this centered stochastic process is Gaussian. To do this, we must show that N 1 2 N 2 EQ e k=1 αk (Y (bk )−Y (ak )) = e 2 σ k=1 αk (bk −ak ) , where αk ∈ R and the intervals (ak , bk ] ⊆ [0, T ] are disjoint, that is, letting ψ(t) = N k=1 αk 1(ak ,bk ] (t), T 1 2 T 2 EQ e 0 ψ(t)dY (t) = e 2 σ 0 ψ (t)dt , or equivalently,
11.4. FRACTAL BROWNIAN MOTION 2 EP
461
. 1 2 T 2 dQ T ψ(t)(dX(t)−γ(t)dt) e0 = e 2 σ 0 ψ (t)dt , dP
that is, 1 T T 1 T 2 1 2 T 2 EP e σ2 { 0 γ(t)dX(t)− 2 0 γ (t)dt} e 0 ψ(t)(dX(t)−γ(t)dt) = e 2 σ 0 ψ (t)dt . Simplifying: 2 T T T ψ(t)+ 12 γ(t) dX(t)− 0 (γ(t)ψ(t))dt− 21 0 σ EP e 0 and using (11.17) with ϕ(t) = ψ(t) +
1 γ(t), σ2
γ 2 (t) dt σ2
.
1
= e2σ
2
T 0
ψ 2 (t)dt
,
the lefthand side is equal to
2 2 T T 1 2 T ψ(t)+ 12 γ(t) dt− 0 (γ(t)ψ(t))dt− 12 0 σ σ EP e 2 0
γ 2 (t) dt σ2
. .
The proof is completed since 2  T  1 T γ 2 (t) 1 2 T 1 σ (γ(t)ψ(t)) dt − ψ(t) + 2 γ(t) dt − 2 σ 2 0 σ2 0 0  T 1 = σ2 ψ 2 (t)dt . 2 0 Remark 11.3.13 A sweeping generalization of the Cameron–Martin theorem, the Girsanov theorem, will be given in Chapter 14 (Theorem 14.3.3). It will require the more advanced tools of the stochastic calculus associated with the Itˆo integral.
11.4
Fractal Brownian Motion
The Wiener process {W (t)}t≥0 has the following property. If c is a positive constant, the 1 process {Wc (t)}t≥0 := {c− 2 W (ct)}t≥0 is also a Wiener process. It is indeed a centered Gaussian process with independent increments, null at the time origin, and for 0 < a < b, E Wc (b) − Wc (a)2 = c−1 E W (cb) − W (ca)2 = c−1 (cb − ca) = b − a . This is a particular instance of a selfsimilar stochastic process. Deﬁnition 11.4.1 A realvalued stochastic process {Y (t)}t≥0 is called selfsimilar with (Hurst) selfsimilarity parameter H if for any c > 0, D
{Y (t)}t≥0 ∼ {c−H Y (ct)}t≥0 . The Wiener process is therefore selfsimilar with similarity parameter H = 12 . D
It follows from the deﬁnition that Y (t) ∼ tH Y (1), and therefore, if P (Y (1) = 0) > 0: If H < 0, Y (t) → 0 in distribution as t → ∞ and Y (t) → ∞ in distribution as t → 0. If H > 0, Y (t) → ∞ in distribution as t → 0 and Y (t) → 0 in distribution as t → ∞.
CHAPTER 11. BROWNIAN MOTION
462
If H = 0, Y (t) has a distribution independent of t. In particular, when H = 0, a selfsimilar process cannot be stationary (strictly or in the wide sense). We shall be interested in selfsimilar processes that have stationary increments. We must restrict attention to nonnegative selfsimilarity parameters, because of the following negative result:1 for any strictly negative value of the selfsimilarity parameter, a selfsimilar stochastic process with independent increments is not measurable (except of course for the trivial case where the process is identically null). Theorem 11.4.2 Let {Y (t)}t≥0 be a selfsimilar stochastic process with stationary increments and selfsimilarity parameter H > 0 (in particular, Y (0) = 0). Its covariance function is given by 1 Γ(s, t) := cov (Y (s), Y (t)) = σ 2 t2H − t − s2H + s2H , 2 where σ 2 = E (Y (t + 1) − Y (t))2 = E Y (1)2 . Proof. Assume without loss of generality that the process is centered. Let 0 ≤ s ≤ t. Then E (Y (t) − Y (s))2 = E (Y (t − s) − Y (0))2 = E (Y (t − s))2 = σ 2 (t − s)2H and 2E [Y (t)Y (s)] = E Y (t)2 + E Y (s)2 − E (Y (t) − Y (s))2 ,
hence the result.
Fractal Brownian motion2 is a Gaussian process that in a sense generalizes the Wiener process. Deﬁnition 11.4.3 A fractal Brownian motion on R+ with Hurst parameter H ∈ (0, 1) is a centered Gaussian process {BH (t)}t≥0 with continuous paths such that BH (0) = 0, and with covariance function E[BH (t)BH (s)] =
1 2H t + s2H − t − s2H . 2
(11.19)
The existence of such process follows from Theorem 5.1.23 as soon as the righthand side of (11.19) can be shown to be a nonnegative deﬁnite function. This can be done directly, although we choose another path. We shall prove the existence of the fractal Brownian motion by constructing it as a Doob integral with respect to a Wiener process. More precisely, deﬁne for 0 < H < 1, wH (t, s) := 0 for t ≤ s, 1
wH (t, s) := (t − s)H− 2 for 0 ≤ s ≤ t and 1 2
[Vervaat, 1987]. [Mandelbrot and Van Ness, 1968].
11.5. EXERCISES
463 1
1
wH (t, s) := (t − s)H− 2 − (−s)H− 2 for s < 0. Observe that for any c > 0 1
wH (ct, s) = cH− 2 wH (t, sc−1 ). Deﬁne
BH (t) :=
R
wH (t, s) dW (s) .
The Doob integral of the righthand side is, more explicitly, 
t
A − B :=

1
(t − s)H− 2 dW (s) −
0 −∞
0
1 1 (t − s)H− 2 − (−s)H− 2 dW (s).
(11.20)
It is well deﬁned and with the change of variable u = c−1 s it becomes 1

cH− 2 R
wH (t, u) dW (cu) .
Using the selfsimilarity of the Wiener process, the process deﬁned by the last display has the same distribution as the process deﬁned by 1
1

cH− 2 c 2 R
wH (t, u) dW (u).
Therefore {BH (t)}t≥0 is selfsimilar with similarity parameter H. The fact that there is a version of this process with continuous paths can be proven using Theorem 5.2.3 along the lines of the proof of Theorem 11.2.7. It is tempting to rewrite (11.20) as Z(t) − Z(0), where 
t
Z(t) = −∞
1
(t − s)H− 2 dW (s).
However this last integral is not well deﬁned as a Doob integral since for all H > 0, the 1 function s → (t − s)H− 2 1{s≤t} is not in L2R (R).
Complementary reading Chapter 6 of [Resnick, 1992]. [Revuz and Yor, 1999] (more advanced).
11.5
Exercises
Exercise 11.5.1. Wiener as a limit Prove that for all t1 , . . . , tn in R+ forming an increasing sequence, the limit distribution of the vector (X(t1 ), . . . , X(tn )), where X(t) is deﬁned by (11.2), is that corresponding to a Wiener process, that is, a centered Gaussian vector such that X(t1 ), X(t2 )−X(t1 ), . . . , X(tn ) − X(tn−1) are centered Gaussian variables with variances t1 , t2 − t1 ,. . . , tn − tn−1.
CHAPTER 11. BROWNIAN MOTION
464
Exercise 11.5.2. A basic formula Let {W (t)}t≥0 be a standard Wiener process. Prove that for s, t ∈ R+ , E[W (t)W (s)] = t ∧ s . Let {Y (t)}t≥0 be a Brownian bridge. Prove that (0 ≤ s ≤ t ≤ 1) .
cov (X(t), X(s)) = s(1 − t)
Exercise 11.5.3. Transforming a Wiener process Let {W (t)}t≥0 be a standard Wiener process. Prove that the process {X(t)}t∈[0,1] is a standard Brownian motion in the following cases: (i) X(t) = −W (t), (ii) X(t) = W (t + a) − W (t) (a > 0), √ (iii) X(t) = cW ct (t ≥ 0) (c > 0), 1 (iv) X(t) = tW t (t > 0) and X(0) = 0. (Note that the continuity at 0 is already proved in Section 11.1.2.) Exercise 11.5.4. Brownian bridges Let {W (t)}t∈[0,1] be a Wiener process. Show that the Brownian bridge {X(t) := W (t) − tW (1)}t∈[0,1] is a Gaussian process independent of W (1) and compute its autocovariance function. Show that the process {X(1 − t)}t∈[0,1] is a Brownian bridge. Exercise 11.5.5. Let {W (t)}t∈[0,1] be a Wiener process. Let Z(t) := (1 − t)W
t 1−t
(0 ≤ t < 1)
and Z(1) := 0. Show that {Z(t)}t∈[0,1] is continuous at t = 1 and that it has the same distribution as the Brownian bridge. Exercise 11.5.6. Wiener is Gauss–Markov Prove that a Wiener process is a Gauss–Markov process. Exercise 11.5.7. An Ornstein–Uhlenbeck process Let {W (t)}t≥0 be a standard Brownian motion. Show that {e−αtW e2αt }t≥0 is (has the same distribution as) an Ornstein–Uhlenbeck process. Exercise 11.5.8. Exit time from a strip Let {W (t)}t≥0 be a standard Brownian motion, and deﬁne for a > 0 and b < 0 the stopping time Ta,b := inf{t ≥ 0 ; W (t) ∈ {a, b}} . (i) Compute P (W (Ta,b ) = a.
11.5. EXERCISES
465
(ii) Show that {W (t)2 − t}t≥0 is an FtW martingale and deduce from this E [Ta,b ]. Exercise 11.5.9. The transience of Brownian motion with a positive drift Let μ and σ > 0 be two real numbers. Let (t ≥ 0) .
X(t) := σW (t) + μt (i) Show that for all u ∈ R,
uσ 2 Z(t) := exp{uX(t) − ut μ + } (t ≥ 0) 2 is an FtW martingale. and of Doob’s optional sampling theorem (ii) Take advantage of the choice u = − −2μ σ2 (the applicability of which you shall verify) to obtain that the probability ra that {X(t)}t≥0 will reach a > 0 before it touches −b < 0 is given by 2μb
ra =
1 − e σ2 1−e
2μ(a+b) σ2
.
(iii) Show that if μ > 0 (or μ < 0), {X(t)}t≥0 is transient, that is, for all a ∈ R, there exists an almost surely ﬁnite (random) time after which it does not visit a. Exercise 11.5.10. Independent Brownian motions with a drift Let for i = 1, 2, Xi (t) := xi + μi t + σi Wi (t) , where xi , μi ∈ R, σi > 0, and {W1 }t≥0 and {W2 }t≥0 are independent standard Brownian motions. Suppose moreover that x1 < x2 . Compute the probability that {X1 (t)}t≥0 and {X2 (t)}t≥0 never meet. Exercise 11.5.11. The Lebesgue integral of a Gaussian process Let {X(t)}t∈[0,1) be a continuous Gaussian stochastic process. Prove that the random 01 variable 0 X(t) dt is Gaussian. Compute its mean and variance when {X(t)}t∈[0,1) is a Brownian bridge. Exercise 11.5.12. Let {W (t)}t≥0 be a standard Brownian motion. Show that the stochastic process 
t 1−t
X(t) := 0
1 dW (s) 1−s
(t ∈ [0, 1))
is a Brownian motion. Exercise 11.5.13. A representation of the Brownian bridge Let {W (t)}t≥0 be a standard Brownian motion. Let for t ∈ [0, 1), 
t
Y (t) := (1 − t) 0
dW (s) ds . 1−s
CHAPTER 11. BROWNIAN MOTION
466
(i) Prove that the integral in the righthand side is well deﬁned on [0, 1) as a Wiener integral. (ii) Prove that as t ↓ 0, Y (t) → 0 in quadratic mean. (iii) Deﬁne Y (0) := 0. Show that {Y (t)}t∈[0,1] is a Gaussian process. (iv) Show that {Y (t)}t∈[0,1] is (has the same distribution as) a Brownian bridge. Exercise 11.5.14. Average occupation time Let {W (t)}t≥0 be a standard ddimensional Brownian motion and let A ∈ B(Rd ) be of positive Lebesgue measure. Let, for ω ∈ Ω, SA (ω) := {t ≥ 0 ; W (t, ω) ∈ A} (the occupation time of A). Prove that E d (SA ) = ∞ if d ≤ 2, and that if d ≥ 3, 1 d E d (SA ) = Γ x2−d dx , − 1 2 2π d/2 A 0∞ where Γ(α) := 0 xα−1 e−x dx (α > 0) is the Gamma function. Exercise 11.5.15. Micropulses and fractal Brownian motion (3 ) Let N ε be a Poisson process on R × R+ with intensity measure ν(dt × dz) = + 1 −1−θ z dt × dz (0 < θ < 1, ε > 0). For all t ≥ 0, let S0,t = {(s, z) : 0 < s < t, t − s < z} 2ε2 − and S0,t = {(s, z) : −∞ < s < 0, −s < z < t − s}, and let + − ) − N ε (S0,t ) . Xε (t) := ε N ε (S0,t
+ S0,t
− S0,t
0
t
(1) Show that Xε (t) is well deﬁned for all t ≥ 0. (2) Compute for all 0 ≤ t1 ≤ t2 . . . ≤ tn the characteristic function of (Xε (t1 ), . . . , Xε(tn )) . (3) Show that for all 0 ≤ t1 ≤ t2 . . . ≤ tn , (Xε (t1 ), . . . , Xε(tn )) converges as ε ↓ 0 in distribution to (BH (t1 ), . . . , BH (tn )), where {BH (t)}t≥0 is a fractal Brownian 2 = θ−1 (1 − θ)−1 , motion with Hurst parameter H = 1−θ 2 and variance E BH (1) that is, {BH (t)}t≥0 is a centered Gaussian process such that BH (0) = 0 and with covariance function 1 2H E [BH (t)BH (s)] = s + t2H − s − t2H E BH (1)2 . 2
3
[CioczekGeorges and Mandelbrot, 1995].
Chapter 12 Widesense Stationary Stochastic Processes Widesense stationary stochastic processes are of interest in signal analysis and processing, as well as in physics. Their study rests on Bochner’s representation of characteristic functions, which immediately leads to the fundamental notion of power spectral measure, and on the Doob–Wiener integral that permits a mathematical deﬁnition of white noise as well as the obtention of a spectral decomposition of the trajectories of such stochastic processes, the Cram´er–Khinchin decomposition, a fundamental result with importance consequences in signal processing.
12.1
The Power Spectral Measure
12.1.1
Covariance Functions and Characteristic Functions
Recall the simple facts about Fourier theory. Let f : (R, B(R)) → (R, B(R)) be integrable with respect to the Lebesgue measure. Then, for any ν ∈ R, f6(ν) := f (t) e−2iπνt dt R
is well deﬁned and the function fˆ, called the Fourier transform of f , is continuous and bounded (Exercise 2.4.13). From classical Fourier analysis, we know that if moreover the function f6 is integrable with respect to the Lebesgue measure, then (Fourier inversion formula) f (t) = f6(ν) e2iπνt dν , R
where this equality is true almost everywhere, and everywhere if f is continuous (Exercise 2.4.9). Remark 12.1.1 The notion of Fourier transform does not in general apply as such to the trajectories of a widesense stationary stochastic process. Consider, for instance, a squareintegrable ergodic process {X(t)}t∈R not identically null. In particular, for p = 1 or p = 2, Pa.s., 1 t lim X(t)p dt = E [X(t)p ] > 0 , t↑∞ t 0 from which it follows that almost all trajectories are not in L1C (R) nor in L2C (R) and therefore do not have a Fourier transform in the usual L1 or L2 senses. © Springer Nature Switzerland AG 2020 P. Brémaud, Probability Theory and Stochastic Processes, Universitext, https://doi.org/10.1007/9783030401832_12
467
468CHAPTER 12. WIDESENSE STATIONARY STOCHASTIC PROCESSES Nevertheless, there exists a spectral decomposition for the trajectories of a wss stochastic process, called the Cram´er–Khintchin decomposition, as we shall see in Section 12.2. To obtain such a decomposition, the starting point is the Fourier analysis of the covariance function. We begin with a few examples.
Two Particular Cases Example 12.1.2: Absolutely continuous spectrum. Consider a wss random process with integrable and continuous covariance function C, in which case the Fourier transform f of the latter is well deﬁned by f (ν) = e−2iπντ C(τ ) dτ . R
It is called the power spectral density (psd). It turns out, as we shall soon see when we consider the general case, that it is nonnegative and integrable. Since it is integrable, the Fourier inversion formula C(τ ) = e2iπντ f (ν) dν (12.1) R
holds true for all t ∈ R since C is continuous. Also f is the unique integrable function such that (12.1) holds. Letting τ = 0 in this formula, we obtain, since C(0) = Var(X(t) := σ 2 , f (ν)dν . σ2 = R
Example 12.1.3: The Ornstein–Uhlenbeck process. The Ornstein–Uhlenbeck process is a centered Gaussian process with covariance function Γ(t, s) = C(t − s) = e−αt−s . The function C is integrable and therefore the power spectral density is the Fourier transform of the covariance function: 2α f (ν) = e−2iπντ e−ατ  dτ = 2 . α + 4π 2 ν 2 R
Not all wss stochastic processes admit a power spectral density. For instance: Example 12.1.4: Line spectrum. Consider a widesense stationary process with a covariance function of the form Pk e2iπνk τ , C(τ ) = k∈Z
where Pk ≥ 0 and
k∈Z
Pk < ∞
12.1. THE POWER SPECTRAL MEASURE
469
(for instance, the harmonic process of Example 5.1.20). This covariance function is not integrable, and in fact there does not exist a power spectral density. In particular, a representation of the covariance function such as (12.1) is not available, at least if the function f is interpreted in the ordinary sense. However, there is a formula such as (12.1) if we consent, as is usually done in the engineering literature, to deﬁne the power spectral density in this case to be the pseudofunction f (ν) = Pk δ(ν − νk ), k∈Z
where δ(ν − a) is the delayed Dirac pseudofunction informally deﬁned by ϕ(ν) δ(ν − a) dν = ϕ(a). R
Indeed, with such a convention, f (ν)e2iπντ f (ν) dν = Pk e2iπνk τ . Pk e2iπντ δ(ν − νk ) dν = R
k∈Z
R
k∈Z
We can (and perhaps should) however avoid recourse to Dirac pseudofunctions, and the general result to follow (Theorem 12.1.5) will tell us what to do. In general, it may happen that the covariance function is not integrable and/or that there does not exist a line spectrum. We now turn to the general theory.
The General Case Remember that the characteristic function ϕ of a real random variable X has the following properties: A. it is hermitian symmetric, that is, ϕ(−u) = ϕ(u)∗ , and it is uniformly bounded: ϕ(u) ≤ ϕ(0), B. it is uniformly continuous on R, and C. it is deﬁnite nonnegative, in the sense that for all integers n, all u1 , . . . , un ∈ R, and all z1 , . . . , zn ∈ C, n n
ϕ(uj − uk )zj zk∗ ≥ 0
j=1 k=1
2' '2 . ' ' (just observe that the lefthand side equals E ' nj=1 zj eiuj X ' ). It turns out that Properties A , B and C characterize characteristic functions (up to a multiplicative constant). This is Bochner’s theorem (Theorem 4.4.10), which is now recalled for easier reference: Let ϕ : R → C be a function satisfying properties A, B and C. Then there exists a constant 0 ≤ β < ∞ and a real random variable X such that for all u ∈ R, ϕ(u) = βE eiuX . Bochner’s theorem is all that is needed to deﬁne the power spectral measure of a widesense stationary process continuous in the quadratic mean.
470CHAPTER 12. WIDESENSE STATIONARY STOCHASTIC PROCESSES Theorem 12.1.5 Let {X(t)}t∈R be a wss random process continuous in the quadratic mean, with covariance function C. Then, there exists a unique measure μ on R such that e2iπντ μ(dν). (12.2) C(τ ) = R
In particular, μ is a ﬁnite measure: μ(R) = C(0) = Var(X(0)) < ∞.
(12.3)
Proof. It suﬃces to observe that the covariance function of a wss stochastic process that is continuous in the quadratic mean shares the properties A, B and C of the characteristic function of a real random variable. Indeed, (a) it is hermitian symmetric, and C(τ ) ≤ C(0) (Schwarz’s inequality), (b) it is uniformly continuous, and (c) it is deﬁnite nonnegative, in the sense that for all integers n, all τ1 , . . . , τn ∈ R, and all z1 , . . . , zn ∈ C, n n
C(τj − τk )zj zk∗ ≥ 0
j=1 k=1
2' '2 . ' ' (just observe that the lefthand side is equal to E ' nj=1 zj X(tj )' ). Therefore, by Theorem 4.4.10, the covariance function C is (up to a multiplicative constant) a characteristic function. This is exactly what (12.2) says, since μ thereof is a ﬁnite measure, that is, up to a multiplicative constant, a probability distribution. Uniqueness of the power spectral measure follows from the fact that a ﬁnite measure (up to a multiplicative constant: a probability) on Rd is characterized by its Fourier transform (Theorem 3.1.51). The case of an absolutely continuous spectrum corresponds to the situation where μ admits a density with respect to Lebesgue measure: μ(dν) = f (ν) dν. We then say that the wss stochastic process in question admits the power spectral density (psd) f . If such a power spectral density exists, it has the properties mentioned without proof in Example 12.1.2: it is nonnegative and it is integrable. The case of a line spectrum corresponds to a spectral measure that is a weighted sum of Dirac measures: Pk ενk (dν) , μ(dν) = k∈Z
where the Pk ’s are nonnegative and have a ﬁnite sum, as in Example 12.1.4.
12.1. THE POWER SPECTRAL MEASURE
12.1.2
471
Filtering of wss Stochastic Processes
We recall a few standard results concerning the (convolutional) ﬁltering of deterministic functions. Let f, g : (R, B(R)) → (R, B(R)) be integrable functions with respective Fourier transforms f6 and g6. Then (Exercise 2.4.14),  f (t − s)g(s) dt ds < ∞ , R
R
and therefore, for almost all t ∈ R, the function s → f (t − s)g(s) is Lebesgue integrable. In particular, the convolution (f ∗ g)(t) := f (t − s)g(s) ds R
is almost everywhere well deﬁned. For all t such that the last integral is not deﬁned, set g, (f ∗ g)(t) = 0. Then f ∗ g is Lebesgue integrable and its Fourier transform is f ∗ g = f66 where fˆ, gˆ are the Fourier transforms of f and g, respectively (Exercise 2.4.14). Let h : (R, B(R)) → (R, B(R)) be an integrable function. The operation that associates to the integrable function x : (R, B(R)) → (R, B(R)) the integrable function y(t) := h(t − s)x(s) ds R
is called a stable convolutional ﬁlter. The function h is called the impulse response of the ﬁlter, x and y are respectively the input and the output of this ﬁlter. The Fourier transform 6 h of the impulse response is the transmittance of the ﬁlter. Let now {X(t)}t∈R be a wss random process with continuous covariance function CX . We examine the eﬀect of ﬁltering on this process. The output process is the process deﬁned by Y (t) := R
h(t − s)X(s)ds .
(12.4)
Note that the integral (12.4) is well deﬁned under the integrability condition for the impulse response h. This follows from Theorem 5.3.2 according to which the integral f (s)X(s, ω) ds R
is well deﬁned for P almost all ω when f is integrable (in the special case of wss stochastic processes, m(t) = m and Γ(t, t) = C(0) + m2 , and therefore the conditions on f and g thereof reduce to integrability of these functions). Referring to the same theorem, we have E[ f (t)X(t) dt] = f (t)E[X(t)] dt = m f (t) dt . (12.5) R
R
R
Let now f, g : R → C and be integrable functions. As a special case of Theorem 5.3.2, we have  g(s)X(s) ds = f (t)g ∗ (s)C(t − s) dt ds. f (t)X(t) dt , (12.6) cov R
R
We shall see that, in addition,
R
R
472CHAPTER 12. WIDESENSE STATIONARY STOCHASTIC PROCESSES 
 g(s)X(s) ds = f6(−ν)6 g ∗ (−ν)μ(dν).
f (t)X(t) dt ,
cov R
R
R
(12.7)
R
Proof. Assume without loss of generality that m = 0. From Bochner’s representation of the covariance function, we obtain for the last double integral in (12.6)  f (t)g ∗ (s) e+2jπν(t−s) μ(dν) dt ds = R R R ∗  f (t)e+2jπνt dt g(s)e+2jπνs ds μ(dν). R
R
R
Here again we have to justify the change of order of integration ' ' using Fubini’s theorem. For this, it suﬃces to show that the function (t, s, ν) → 'f (t)g ∗ (s)e+2jπν(t−s)' = f (t) g(s) 1R (ν) is integrable with respect 0 to the product 0 measure × × μ. This is indeed true, the integral being equal to ( R f (t) dt) × ( R g(t) dt) × μ(R). In view of the above results, the righthand side of formula (12.4) is well deﬁned. Moreover Theorem 12.1.6 When the input process {X(t)}t∈R is a wss random process with power spectral measure μX , the output {Y (t)}t∈R of a stable convolutional ﬁlter of transmittance 6 h is a wss random process with the power spectral measure h(ν)2 μX (dν) . μY (dν) = 6
(12.8)
This formula will be referred to as the fundamental ﬁltering formula in continuous time. Proof. Just apply formulas (12.5) and (12.7) with the functions f (u) = h(t − u),
g(v) = h(s − v), 
to obtain E[Y (t)] = m
h(t)dt, R
and E[(Y (t) − m)(Y (s) − m)∗ ] =
R
6 h(ν)2 e+2jπν(t−s)μ(dν) .
Example 12.1.7: Two special cases. In particular, if the input process admits a psd fX , the output process also admits a psd given by fY (ν) = 6 h(ν)2 fX (ν) dν . When the input process has a line spectrum, the power spectral measure of the output process takes the form ∞ μY (dν) = Pk 6 h(νk )2 ενk (dν) . k=1
12.1. THE POWER SPECTRAL MEASURE
12.1.3
473
White Noise
By analogy with Optics, one calls white noise any centered wss random process {B(t)}t∈R with constant power spectral density fB (ν) = N0 /2.1 Such a deﬁnition presents a theoretical diﬃculty, because  +∞ fB (ν) dν = + ∞, −∞
which contradicts the ﬁnite power property of widesense stationary processes.
A First Approach From a pragmatic point of view, one could deﬁne a white noise to be a centered wss stochastic process whose psd is constant over a “large”, yet bounded, range of frequencies [−A, +A]. The calculations below show what happens as A tends to inﬁnity. Let therefore {X(t)}t∈R be a centered wss stochastic process with psd f (ν) =
N0 (ν) . 1 2 [−A,+A]
Let ϕ1 , ϕ2 : R → C be two functions in L1C (R) ∩ L2C (R) with Fourier transforms ϕ 61 and ϕ 62 , respectively. Then 2∗ . N0 lim E ϕ1 (t)X(t) dt ϕ2 (t)X(t) dt ϕ1 (t)ϕ∗2 (t) dt = A↑∞ 2 R R R N0 = ϕ 61 (ν)ϕ 6∗2 (ν) dν . 2 R Proof. We have 2 ∗ .  E ϕ1 (t)X(t) dt ϕ2 (t)X(t) dt ϕ1 (u)ϕ2 (v)∗ CX (u − v) du dv . = R
R
R
R
The latter quantity is equal to  +A N0 +∞ ∗ 2iπν(u−v) ϕ1 (u)ϕ2 (v) e dν du dv 2 −∞ −A  +∞  +A  +∞ N0 = ϕ1 (u)e2iπνu du ϕ2 (v)∗ e−2iπνv dv dν 2 −A −∞ −∞ N0 +A ϕ 61 (−ν)ϕ 62 (−ν)∗ dν, = 2 −A and the limit of this quantity as A ↑ ∞ is: N0 +∞ N0 +∞ ϕ 61 (ν)ϕ 6 ∗2 (ν) dν = ϕ1 (t)ϕ2 (t)∗ dt , 2 −∞ 2 −∞ where the last equality is the Plancherel–Parseval identity.
Let now h : R → C be in L1C (R) ∩ L2C (R), and deﬁne 1 The notation N0 /2 comes from Physics and is a standard one in communications theory when dealing with the socalled additive white noise channels.
474CHAPTER 12. WIDESENSE STATIONARY STOCHASTIC PROCESSES Y (t) = R
h(t − s)X(s) ds .
Applying the above result with ϕ1 (u) = h(t − u) and ϕ2 (v) = h(t + τ − v), we ﬁnd that the covariance function CY of this wss stochastic process is such that N0 e2iπντ 6 h(ν)2 lim CY (τ ) = dν . A↑∞ 2 R The limit is ﬁnite since 6 h ∈ L2C (R) and is a covariance function corresponding to a bona ﬁde (that is integrable) pdf fY (ν) = 6 h(ν)2 N20 . With f (ν) = N20 , we formally retrieve the usual ﬁltering formula, fY (ν) = 6 h(ν)2 f (ν) .
White Noise via the Doob–Wiener Integral Another approach to white noise, more formal, consists in working right away “at the limit”. We do not attempt to deﬁne the white noise {B(t)}t∈R directly (for good reasons since it does not exist as a bona ﬁde wss stochastic process, as we noted earlier). Instead, 0 we deﬁne directly the symbolic integral R f (t)B(t) dt for integrands f to be described below, by = N0 f (t)B(t) dt := f (t) dZ(t), (12.9) 2 R R increments with unit where {Z(t)}t∈R is a centered stochastic process with uncorrelated > 1 N0 variance. We say that {B(t)}t∈R is a white noise and that is an integrated 2 Z(t) white noise. For all f, g ∈ L2C (R), we have that 2. E f (t) B(t) dt = 0 ,
t∈R
R
and by the isometry formulas for the Doob–Wiener integral, 2 ∗ . N0 E f (t) B(t) dt g(t) B(t) dt f (t)g(t)∗ dt , = 2 R R R which can be formally rewritten, using the Dirac symbolism: N0 f (t)g(s)∗ E [B(t)B ∗ (s)] dt ds = f (t)g(s)∗ δ(t − s) dt ds . 2 R R Hence “the covariance function of the white noise {B(t)}t∈R is a Dirac pseudofunction: CB (τ ) = N20 δ(τ )”. When {Z(t)}t∈R ≡ {W (t)}t∈R , a standard Brownian motion, {B(t)}t∈R is called a Gaussian white noise. In this case, the Wiener–Doob integral is certainly not a Stieltjes– Lebesgue integral since the trajectories of the Wiener process are of unbounded variation on any ﬁnite interval (Corollary 11.2.9). Also, B(t) cannot be interpreted as the “derivative” dWdt(t) (Theorem 11.2.8). Let {B(t)}t∈R be a white noise with psd N0 /2. Let h : R → C be in L1C ∩ L2C and deﬁne the output of a ﬁlter with impulse response h when the white noise {B(t)}t∈R is the input, by
12.1. THE POWER SPECTRAL MEASURE
475
Y (t) = R
h(t − s)B(t) ds.
By the isometry formula for the Wiener–Doob integral, E[Y (t)Y (s)∗ ] =
N0 2
R
h(t − s − u)h∗ (u) du,
and therefore (Plancherel–Parseval equality) CY (τ ) =
N0 dν. e2iπντ 6 h(ν)2 2 R
The stochastic process {Y (t)}t∈R is therefore centered and wss, with psd fY (ν) = 6 h(ν)2 fB (ν) , where fB (ν) :=
N0 . 2
We therefore once more recover formally the fundamental equation of linear ﬁltering of wss continuoustime stochastic processes.
The Approximate Derivative Approach There is a third approach to white noise. The Brownian motion is approximated by the “ﬁnitesimal” derivative W (t + h) − W (t) Bh (t) = h (here we take N0 /2 = 1). For ﬁxed h > 0 this deﬁnes a proper wss stochastic process centered, with covariance function Ch (τ ) = and power spectral density
fh (ν) =
(h − τ )+ h2
sin πνh πνh
2 .
Note that, as h ↓ 0, the power spectral density tends to the constant function 1, the power spectral density of the “white noise”. At the same time, the covariance function “tends to the Dirac function” and the energy Ch (0) = h1 tends to inﬁnity. This is another feature of white noise: unpredictability. Indeed, for τ ≥ h, the value Bh (t + τ ) cannot be predicted from the value Bh (t), since both are independent random variables. The connection with the second approach is the following. For all f ∈ L2C (R+ ) ∩
L1C (R+ ),
lim h↓0
R+
f (t)Bh (t) dt =
f (t) dW (t) R+
in the quadratic mean. The proof is required in Exercise 12.4.4.
476CHAPTER 12. WIDESENSE STATIONARY STOCHASTIC PROCESSES
12.2
Fourier Analysis of the Trajectories
12.2.1
The Cram´ er–Khintchin Decomposition
If one seeks a Fourier transform of the trajectories of a nontrivial widesense stationary process, there is a priori little chance that it will be a classical one, say, in L1 or L2 (Remark 12.1.1). However, under quite general conditions there exists a kind of Fourier decomposition of the trajectories of a widesense stationary process. Theorem 12.2.1 Let {X(t)}t∈R be a centered wss stochastic process, continuous in the quadratic mean and with power spectral measure μ. There exists a unique centered stochastic process {x(ν)}ν∈R with uncorrelated increments and with structural measure μ such that P a.s. X(t) = (12.10) e2iπνt dx(ν) (t ∈ R) , R
where the integral of the righthand side is a Doob integral. Uniqueness is in the following sense: If there exists another centered stochastic pro,, such cess {, x(ν)}ν∈R with uncorrelated increments, 0 and with ﬁnite structural measure μ that for all t ∈ R, we have P –a.s., X(t) = R e2iπνt d, x(ν) , then for all a, b ∈ R, a ≤ b, x ,(b) − x ,(a) = x(b) − x(a), P a.s. We will occasionally say: “dx(ν) is the Cram´er–Khinchin decomposition” of the wss stochastic process. Proof. 1. Denote by H(X) the vector subspace of L2C (P ) formed by the ﬁnite complex linear combinations of the type Z=
K
λk X(tk ) ,
k=1
and by ϕ the mapping of H(X) into L2C (μ) deﬁned by ϕ : Z →
K
λk e2iπνtk .
k=1
We verify that it is a linear isometry of H(X) into L2C (μ). In fact, ⎡' '2 ⎤ K K K ' ' ' ' E ⎣' λk X(tk )' ⎦ = λk λ∗ E [X(tk )X(t )∗ ] ' ' k=1
k=1 =1
=
K K
λk λ∗ C(tk − t ) ,
k=1 =1
and using Bochner’s theorem, this quantity is equal to  K K K K λk λ∗ e2iπν(tk −t ) μ(dν) = λk λ∗ e2iπν(tk −t ) μ(dν) k=1 =1
R
R
k=1 =1
'2  '' K ' ' 2iπνtk ' = λk e ' μ(dν). ' ' R' k=1
12.2. FOURIER ANALYSIS OF THE TRAJECTORIES
477
2. This isometric linear mapping can be uniquely extended to an isometric linear mapping (Theorem C.3.2), that we shall continue to call ϕ), from H(X), the closure of K H(X), into L2C (μ). As the combinations k=1 λk e2iπνtk are dense in L2C (μ) when μ is a ﬁnite measure, ϕ is onto. Therefore, it is a linear isometric bijection between H(X) and L2C (μ). 3. We shall deﬁne x(ν0 ) to be the random variable in H(X) that corresponds in this isometry to the function 1(−∞,ν0] (ν) of L2C (μ). First, we observe that E[x(ν2 ) − x(ν1 )] = 0 since H(X) is the closure in L2C (P ) of a family of centered random variables. Also, by isometry, 1(ν1 ,ν2 ] (ν)1(ν3 ,ν4 ] (ν) μ(dν) E[(x(ν2 ) − x(ν1 ))(x(ν4 ) − x(ν3 ))∗ ] = R
0
We can therefore deﬁne the Doob integral 4. Let now Zn (t) :=
e
= μ((ν1 , ν2 ] ∩ (ν3 , ν4 ]). R f (ν) dx(ν)
for all f ∈ L2C (μ).
k+1 k x −x n . 2n 2
2iπt(k/2n )
k∈Z

We have lim Zn (t) =
n→∞
(limit in L2C (P )) because
e2iπνt dx(ν) R
Zn (t) =
where fn (t, ν) =
R
fn (t, ν) dx(ν), n
e2iπt(k/2 ) 1(k/2n ,(k+1)/2n ] (ν),
k∈Z
and therefore, by isometry, ' '2 ' ' E ''Zn (t) − e2iπνt dx(ν)'' = e2iπνt − fn (t, ν)2 μ(dν), R
R
a quantity which tends to zero when n tends to inﬁnity (by dominated convergence, using the fact that μ is a bounded measure). On the other hand, by deﬁnition of ϕ, ϕ
Zn (t) → fn (t, ν). 0 2iπνt dx(ν) in L2C (P ) and limn→∞ fn (t, ν) = e2iπνt Since, for ﬁxed t, limn→∞ Zn (t) = R e in L2C (μ), ϕ
R
But, by deﬁnition of ϕ, Therefore X(t) =
0
e2iπνt dx(ν) → e2iπνt . ϕ
X(t) → e2iπνt. Re
2iπνt dx(ν).
5. We now prove uniqueness. Suppose that there exists another spectral decomposition d, x(ν). Denote by G the set of ﬁnite linear combinations of complex exponentials. Since by hypothesis
478CHAPTER 12. WIDESENSE STATIONARY STOCHASTIC PROCESSES 

e2iπνt d, x(ν)
e2iπνt dx(ν) =
( = X(t))
R
R


we have
f (ν) dx(ν) =
f (ν) d, x(ν) R
R
for all f ∈ G, and therefore, for all f ∈ L2C (μ) ∩ L2C (, μ) ⊆ L2C ( 21 (μ + μ ,)) because G is 1 2 ,)). In particular, with f = 1(a,b] , dense in LC ( 2 (μ + μ x(b) − x(a) = x ,(b) − x ,(a). More details can be obtained as to the continuity properties (in quadratic mean) of the increments of the spectral decomposition. For instance, it is rightcontinuous in quadratic mean, and it admits a lefthand limit in quadratic mean at any point ν ∈ R. If such limit is denoted by x(ν−), then, for all a ∈ R, E[x(a) − x(a−)2 ] = μ({a}) . Proof. The rightcontinuity follows from the continuity of the (ﬁnite) measure μ: lim E[x(a + h) − x(a)2 ] = lim μ((a, a + h]) = μ(∅) = 0. h↓0
h↓0
As for the existence of lefthand limits, it is guaranteed by the Cauchy criterion, since for all a ∈ R, lim
h,h ↓0,h 1/(2B), X(t) = lim
N ↑∞
where the limit is in L2C (P ).
+N n=−N
X(nT )
sin
(t − nT ) , (t − nT )
π
π T
T
(12.11)
480CHAPTER 12. WIDESENSE STATIONARY STOCHASTIC PROCESSES 
Proof. We have
e2iπνt dx(ν).
X(t) = [−B,+B]
Now, e
2iπνt
= lim
N ↑∞
+N
e
2iπνnT
sin
π
π T
n=−N
(t − nT ) , (t − nT )
T
where the limit is uniform in [−B, +B] and bounded. Therefore the above limit is also in L2C (μ) because μ is a ﬁnite measure. Consequently / +N π 4 2iπνnT sin T (t − nT ) dx(ν), X(t) = lim e π N ↑∞ [−B,+B] T (t − nT ) n=−N
where the limit is in L2C (P ). The result then follows by expanding the integral with respect to the sum.
12.2.2
A Plancherel–Parseval Formula
The following result is the analog of the Plancherel–Parseval formula of classical Fourier analysis. Theorem 12.2.6 Let f : R → C be in L1C (R) with Fourier transform f6. Let {X(t)}t∈R be a centered wss stochastic process with power spectral measure μ and Cram´er– Khintchin spectral decomposition dx(ν). Then: f (t)∗ X(t) dt. (12.12) f6(ν)∗ dx(ν) = R
R
Proof. The function f6 is bounded and continuous (as the Fourier transform of an integrable function) and μ is a ﬁnite measure, so that f6 ∈ L2C (μ) and k 2 6 f6 n 1( kn , k+1 ] → f in LC (μ) . 2 2n 2 n Therefore (all limits in the following sequence of equalities being in L2C (P )): 
∗ k+1 k k x − x f6 n n n n→∞ 2 2 2 −n2n n −1 n2 k k+1 n − x = lim f ∗ (t)e+2iπ(k/2 )t dt x n n n→∞ 2 2 R −n2n n −1 2 . n2 k+1 k n e+2iπ(k/2 )t x − x dt f ∗ (t) = lim n n→∞ R 2 2n −n2n f ∗ (t)Xn (t) dt, = lim
f6(ν)∗ dx(ν) = lim R
n −1 n2
n→∞ R
where Xn (t) =
n −1 n2
−n2n
n )t
e+2iπ(k/2
k+1 k x − x → X(t) in L2C (P ). n 2 2n
12.2. FOURIER ANALYSIS OF THE TRAJECTORIES
481
The announced result will then follow once we prove that lim f ∗ (t)Xn (t) dt = f ∗ (t)X(t) dt, n→∞ R
R
where the limit is in L2C (P ). In fact, with Yn (t) = X(t) − Xn (t), "''2 #  ' ' ' f (t)f (s)∗ E [Yn (t)Yn (s)∗ ] dt ds . E ' f (t)Yn (t) dt'' = R
R
R
But for all t ∈ R, limn↑∞ Yn (t) = 0 (in L2C (P )) and therefore limn↑∞ E [Yn (t)Yn (s)∗ ] = 0. Moreover, E [Yn (t)Yn (s)∗ ] is uniformly bounded in n. Therefore, by dominated convergence,  lim
n↑∞ R
R
f (t)f (s)∗ E [Yn (t)Yn (s)∗ ] dt ds = 0 .
Example 12.2.7: Convolutional filtering. Let h ∈ L1C (R) and let 6 h be its Fourier transform. Then 6 h(t − s)X(s) ds = (12.13) h(ν)e2iπνt dx(ν). R
R
Proof. It suﬃces to apply (12.12) to the function s → h∗ (t−s), whose Fourier transform is 6 h(ν)∗ e−2iπνt.
12.2.3
Linear Operations
A function g : R → C in L2C (μ) deﬁnes a linear operation on the centered wss stochastic process {X(t)}t∈R (called the input) by associating with it the centered stochastic process (called the output) e2iπνtg(ν) dx(ν). (12.14) Y (t) = R
On the other hand, the calculation of the covariance function CY (τ ) = E[Y (t)Y (t + τ )∗ ] of the output gives (isometry formula for Doob’s integral), CY (τ ) = e2iπντ g(ν)2 μX (dν), R
where μX is the power spectral measure of the input. The power spectral measure of the output process is therefore μY (dν) = g(ν)2 μX (dν) .
(12.15)
This is similar to the formula (15.25) obtained for the output of a stable convolutional ﬁlter with impulse response. One then says that g is the transmittance of the “ﬁlter”
482CHAPTER 12. WIDESENSE STATIONARY STOCHASTIC PROCESSES (12.14). Note however that this ﬁlter is not necessarily of the convolutional type, since g may well not be the Fourier transform of an integrable function (for instance it may be unbounded, as the next example shows). Example 12.2.8: Differentiation. Let {X(t)}t∈R be a wss stochastic processes with spectral measure μX such that ν2 μX (dν) < ∞. (12.16) R
Then
X(t + h) − X(t) = h→0 h
(2iπν)e2iπνt dx(ν) ,
lim
R
where the limit is in the quadratic mean. The linear operation corresponding to the transmittance g(ν) = 2iπν is therefore the diﬀerentiation in quadratic mean. Proof. Let h ∈ R. From the equality X(t + h) − X(t) − (2iπν)e2iπνtdx(ν) h R 2iπνh e −1 − 2iπν dx(ν) e2iπνt = h R we have, by isometry, "' '2 # ' X(t + h) − X(t) ' 2iπνt ' lim E ' dx(ν)'' − (2iπν)e h→0 h R '2  ' 2iπνh ' 'e −1 ' − 2iπν '' μX (dν) . = lim h→0 R ' h ' '2 ' 2iπνh ' In view of hypothesis (12.16) and since ' e h −1 − 2iπν ' ≤ 4π 2 ν 2 , the latter limit is 0, by dominated convergence.
“A line spectrum corresponds to a combination of sinusoids.” More precisely: Theorem 12.2.9 Let {X(t)}t∈R be a centered wss stochastic processes with spectral measure Pk ενk (dν) , μX (dν) = k∈Z
where ενk is the Dirac measure at νk ∈ R, Pk ∈ R+ and X(t) =
k∈Z
Pk < ∞. Then
Uk e2iπνk t
k∈Z
where {Uk }k∈Z is a sequence of centered uncorrelated squareintegrable complex variables, and E[Uk 2 ] = Pk . Proof. The function g(ν) =
k∈Z
1{νk } (ν)
12.3. MULTIVARIATE WSS STOCHASTIC PROCESSES
483
0 is in L2C (μ0X ), as well as the function 1 − g. Also R 1 − g(ν)2 μX (dν) = 0, and in particular R (1 − g(ν))e2iπνt dx(ν) = 0. Therefore X(t) = g(ν)e2iπνt dx(ν) R e2iπνk t (x(νk ) − x(νk −)). = k∈Z
The conclusion follows by deﬁning Uk := x(νk ) − x(νk −).
Linear Transformations of Gaussian Processes Deﬁnition 12.2.10 A linear transformation of a wss stochastic process {X(t)}t∈R is a transformation of it into the secondorder process (not wss in general) g(ν, t) dx(ν) , (12.17) Y (t) = R
where
R
g(t, ν)2 μX (dν) < ∞
for all t ∈ R.
Theorem 12.2.11 A linear transformation of a Gaussian wss stochastic process yields a Gaussian stochastic process. Proof. Let {X(t)}t∈R be a centered Gaussian wss with Cram´er–Khinchin decomposition dx(ν). For each ν ∈ R, the random variable x(ν) is in HR (X), by construction. Now, if {X(t)}t∈R is a Gaussian process, HR (X) is a Gaussian subspace. But (Theorem 12.2.3) HR (X) = HR (x). Therefore the process (12.17) is in HC (X), hence Gaussian. Example 12.2.12: Convolutional filtering of a wss Gaussian process. In particular, if {X(t)}t∈R is a Gaussian wss process with Cram´er–Khinchin decomposition dx(ν) and if g ∈ L2C (μX ), the process e2iπνtg(ν) dx(ν) (t ∈ R) Y (t) = R
is Gaussian. A particular case is when g = 6 h, the Fourier transform of a ﬁlter with integrable impulse response h. The stochastic process {Y (t)}t∈R is the one obtained by convolutional ﬁltering of {X(t)}t∈R with this ﬁlter.
12.3
Multivariate wss Stochastic Processes
12.3.1
The Power Spectral Matrix
Let X(t) = (X1 (t), . . . , XL (t)) CL ,
(t ∈ R)
where L is an integer greater than or be a stochastic process with values in E := equal to 2. This process is assumed centered and of the second order, that is,
484CHAPTER 12. WIDESENSE STATIONARY STOCHASTIC PROCESSES E[X(t)2 ] < ∞
(t ∈ R) .
Furthermore, it will be assumed that it is widesense stationary, in the sense that the mean vector of X(t) and the crosscovariance matrix of the vectors X(t + τ ) and X(t) do not depend upon t. The matrixvalued function C deﬁned by C(τ ) := cov (X(t + τ ), X(t))
(12.18)
is called the (matrix) covariance function of the stochastic process. Its general entry is Cij (τ ) = cov(Xi (t), Xj (t + τ )). The processes {Xi (t)}t∈R (1 ≤ i ≤ L) are wss stochastic processes and in addition, they are stationarily correlated or “jointly wss”. Such a vectorvalued stochastic process {X(t)}t∈R is called a multivariate wss stochastic process. Example 12.3.1: Signal plus noise. The following model frequently appears in signal processing: Y (t) = S(t) + B(t) , wss stochastic process where {S(t)}t∈R and {B(t)}t∈R are two uncorrelated centered + * with respective covariance functions CS and CB . Then, (Y (t), S(t))T t∈R is a bivariate wss stochastic process. Owing to the assumption of noncorrelation, CS (τ ) + CB (τ ) CS (τ ) C(τ ) = . CS (τ ) CS (τ )
Theorem 12.3.2 Let {X(t)}t∈R be an Ldimensional multivariate wss stochastic process. For all r, s (1 ≤ r, s ≤ L) there exists a ﬁnite complex measure μrs such that Crs (τ ) = e2iπντ μrs (dν). (12.19) R
Proof. Say r = 1, s = 2. Let us consider the stochastic processes Y (t) = X1 (t) + X2 (t),
Z(t) = iX1 (t) + X2 (t).
These are wss stochastic processes with respective covariance functions CY (τ ) = C1 (τ ) + C2 (τ ) + C12 (τ ) + C21 (τ ), CZ (τ ) = −C1 (τ ) + C2 (τ ) + iC12 (τ ) − iC21 (τ ). From these two equalities we deduce C12 (τ ) =
1 {[CY (τ ) − C1 (τ ) − C2 (τ )] − i[CZ (τ ) − C1 (τ ) + C2 (τ )]} , 2
from which the result follows with μ12 =
1 {[μY − μ1 − μ2 ] − i[μZ − μ1 + μ2 ]} . 2
12.3. MULTIVARIATE WSS STOCHASTIC PROCESSES
485
The matrix M := {μij }1≤i,j≤k (whose entries are ﬁnite complex measures) is the interspectral power measure matrix of the multivariate wss stochastic process {X(t)}t∈R . It is clear that for all z = (z1 , . . . , zk ) ∈ Ck , U (t) = z T X(t) deﬁnes a wss stochastic process with spectral measure μU = z M z † († means transpose conjugate). The link between the interspectral measure μ12 and the Cram´er–Khintchine decompositions dx1 (ν) and dx2 (ν) is the following: E[x1 (ν2 ) − x1 (ν1 ))(x2 (ν4 ) − x2 (ν3 ))∗ ] = μ12 ((ν1 , ν2 ] ∪ (ν3 , ν4 ]) . This is a particular case of the following result. For all functions gi : R → C, gi ∈ L2C (μi ) (i = 1, 2), 2∗ . E = g1 (ν) dx1 (ν) g1 (ν)g2 (ν)∗ μ12 (dν) . g2 (ν) dx2 (ν) (12.20) R
R
R
Indeed, equality (12.20) is true for g1 (ν) = e2iπt1 ν and g2 (ν) = e2iπt2 ν since it then reduces to e2iπ(t1 −t2 )ν μ12 (dν) . E[X1 (t)X2 (t)∗ ] = π
This is therefore veriﬁed for g1 ∈ E, g2 ∈ E, where E is the set of ﬁnite linear combinations of functions of the type ν → e2iπtν , t ∈ R. But E is dense in L2C (μi ) (i = 1, 2), and therefore the equality (12.20) is true for all gi ∈ L2C (μi ) (i = 1, 2). Theorem 12.3.3 The interspectral measure μ12 is absolutely continuous with respect to each spectral measure μ1 and μ2 . Proof. This means that μ12 (A) = 0 whenever μ1 (A) = 0 or μ2 (A) = 0. Indeed, ∗ . 2dZ1 dZ2 μ12 (A) = E A
A
and μ1 (A) = 0 implies
0
A dZ1
= 0 since "''2 # ' ' ' E ' dZ1 '' = μ1 (A). A
Therefore, every spectral measure μij is absolutely continuous with respect to the trace of the power spectral measure matrix Tr M :=
k
μj .
j=1
By the Radon–Nikod´ ym theorem there exists a function gij : R → C such that gij (ν) TrM (dν) . μij (A) = A
486CHAPTER 12. WIDESENSE STATIONARY STOCHASTIC PROCESSES The matrix g(ν) = {gij (ν)}1≤i,j≤k is called the canonical spectral density matrix of {X(t)}t∈R . One should insist that it is not required that the stochastic processes {Xi (t)}t∈R , 1 ≤ i ≤ k, admit power spectral densities. The correlation matrix C(τ ) has, with the above notation, the representation e2iπντ g(ν) Tr M (dν) .
C(τ ) = R
If each one among the wss stochastic processes {Xi (t)}t∈R admits a spectral density, {X(t)}t∈R admits an interspectral density matrix f (ν) = {fij (ν)}1≤i,j≤k , that is,
Cij (τ ) = cov (Xi (t + τ ), Xj (t)) =
R
e2iπντ fij (ν) dν .
Example 12.3.4: Interferences. Let {X(t)}t∈R be a centered wss stochastic process with power spectral measure μX . Let h1 , h2 : R → C be integrable functions with h2 . Deﬁne for i = 1, 2, respective Fourier transforms 6 h1 and 6 Yi (t) :=
R
hi (t − s)X(s) ds .
The wss stochastic processes {Y1 (t)}t∈R and {Y2 (t)}t∈R are stationarily correlated. In fact (assuming that they are centered, without loss of generality), 2 ∗ . h1 (t + τ − s)X(s) ds h2 (t − s)X(s) ds E[Y1 (t + τ )Y2 (t)∗ ] = E R   R ∗ h1 (t + τ − u)h2 (t − v)CX (u − v) du dv = R R = h1 (τ − u)h∗2 (−v)CX (u − v) du dv , R
R
and this quantity depends only upon τ . Replacing CX (u − v) by its expression in terms of the spectral measure μX , one obtains CY1 Y2 (τ ) =
R
e2iπντ T1 (ν)T2∗ (ν) μX (dν) .
The power spectral matrix of the bivariate process {Y1 (t), Y2 (t)}t∈R is therefore μY (dν) =
T1 (ν)2 T1 (ν)T2∗ (ν) ∗ T2 (ν)2 T1 (ν)T2 (ν)
μX (dν) .
12.3. MULTIVARIATE WSS STOCHASTIC PROCESSES
12.3.2
487
Bandpass Stochastic Processes
Let {X(t)}t∈R be a centered wss stochastic process with power spectral measure μX and Cram´er–Khinchin decomposition dx(ν). This process is assumed real, and therefore μX (−dν) = μX (dν),
dx(−ν) = dx(ν)∗ .
Deﬁnition 12.3.5 The above wss stochastic process is called bandpass (ν0 , B), where ν0 > B > 0, if the support of μX is contained in the frequency band [−ν0 − B, −ν0 + B] ∪ [ν0 − B, ν0 + B]. A bandpass stochastic process admits the following quadrature decomposition X(t) = M (t) cos 2πν0 t − N (t) sin 2πν0 t ,
(12.21)
where {M (t)}t∈R and {N (t)}t∈R , called the quadrature components, are real baseband (B) wss stochastic process. To prove this, let G(ν) := − i sign(ν) (= 0 if ν = 0). The function G is the socalled Hilbert ﬁlter transmittance. The quadrature process associated with {X(t)}t∈R is deﬁned by Y (t) = G(ν)e2iπνt dx(ν) . R
0 The righthand side of the preceding equality is well deﬁned since R G(ν)2 μX (dν) = μX (R) < ∞. Moreover, this stochastic process is real, since its spectral decomposition is hermitian symmetric. The analytic process associated with {X(t)}t∈R is, by deﬁnition, the stochastic process Z(t) = X(t) + iY (t) = (1 + iG(ν))e2iπνt dx(ν) = 2 e2iπνt dx(ν). R
(0,∞)
Taking into account that G(ν)2 = 1, the preceding expressions and the Wiener isometry formulas lead to the following properties: μY (dν) = μX (dν),
CY (τ ) = CX (τ ),
μZ (dν) = 4 1R+ (ν) μX (dν),
CXY (τ ) = − CY X (τ ) ,
CZ (τ ) = 2 {CX (τ ) + iCY X (τ )} ,
and E[Z(t + τ )Z(t)] = 0 .
()
Deﬁning the complex envelope of {X(t)}t∈R by U (t) = Z(t)e−2iπν0 t ,
()
it follows from this deﬁnition that CU (τ ) = e−2iπν0 τ CZ (τ ),
μU (dν) = μZ (dν + ν0 ) ,
(†)
whereas () and () give E[U (t + τ )U (t)] = 0.
(††)
The quadrature components {M (t)}t∈R and {N (t)}t∈R of {X(t)}t∈R are the real wss stochastic processes deﬁned by
488CHAPTER 12. WIDESENSE STATIONARY STOCHASTIC PROCESSES U (t) = M (t) + iN (t). Since X(t) = Re{Z(t)} = Re{U (t)e2iπν0 t } , we have the decomposition (12.21). Taking (††) into account we obtain: CM (τ ) = CN (τ ) =
1 {CU (τ ) + CU (τ )∗ } , 4
and
1 {CU (τ ) − CU (τ )∗ } , 4i and the corresponding relations for the spectra CM N (τ ) = CN M (τ ) =
(♦)
μM (dν) = μN (dν) = {μX (dν − ν0 ) + μX (dν + ν0 )} 1[−B,+B](ν) . From (♦) and the observation that CU (0) = CU (0)∗ (since CU (0) = E[U (0)2 ] is real), we deduce CM N (0) = 0, that is to say, E[M (t)N (t)] = 0.
(12.22)
If, furthermore, the original process has a power spectral measure that is symmetric about ν0 in the band [ν0 − B, ν0 + B], the same holds for the spectrum of the analytic process and, by (†), the complex envelope has a spectral measure symmetric about 0, which implies CU (τ ) = CU (τ )∗ and then, by (♦), E[M (t)N (t + τ )] = 0.
(12.23)
In summary: Theorem 12.3.6 Let {X(t)}t∈R be a centered real bandpass (ν0 , B) wss stochastic process. The values of its quadrature components at a given time are uncorrelated. Moreover, if the original stochastic process has a power spectral measure symmetric about ν0 , the quadrature component processes are uncorrelated. More can be said when the original process is Gaussian. In this case, the quadrature component processes are jointly Gaussian (being obtained from the original Gaussian process by linear operations). In particular, for all t ∈ R, M (t) and N (t) are jointly Gaussian and uncorrelated, and therefore independent. If moreover the original process has a spectrum symmetric about ν0 , then, by (12.23), M (t1 ) and N (t2 ) (t1 , t2 ∈ R) are uncorrelated jointly Gaussian variables, and therefore independent. In other words, the quadrature component processes are two independent centered Gaussian wss stochastic processes.
Complementary reading [Cram´er and Leadbetter, 1967, 1995] is the classical reference. It emphasizes the study of level crossings by widesense stationary stochastic processes. [Br´emaud, 2014] has a chapter on the spectral measure of point processes.
12.4. EXERCISES
12.4
489
Exercises
Exercise 12.4.1. Symmetric power spectral measure Show that the power spectral measure of a real wss stochastic process is symmetric. Exercise 12.4.2. Products of independent wss stochastic processes Let {X(t)}t∈R and {Y (t)}t∈R be two centered wss stochastic processes of respective covariance functions CX (τ ) and CY (τ ). 1. Assume the two signals to be independent. Show that Z (t) := X(t)Y (t) (t ∈ R) is a wss stochastic process. Give its mean and covariance function. 2. Assume the same hypothesis as in the previous question, but now {X(t)}t∈R is the harmonic process of Example 5.1.20. Suppose that {Y (t)}t∈R admits a power spectral density fY (ν). Give the power spectral density fZ (ν) of {Z (t)}t∈R . Exercise 12.4.3. The approximate derivative of a Wiener process Let {W (t)}t≥0 be a Wiener process. Show that for a > 0, the stochastic process Xa (t) :=
W (t + a) − W (t) a
(t ∈ R)
is a wss stochastic process. Compute its mean, its covariance function and its power spectral density. Exercise 12.4.4. Doob’s integral and the finitesimal derivative of Brownian motion Let {W (t)}t≥0 be a standard Brownian motion. Prove the following. For all f ∈ L2C (R+ )∩ L1C (R+ ), lim h↓0
R+
f (t)Bh (t) dt =
f (t) dW (t) R+
in the quadratic mean. Exercise 12.4.5. The square of a bandlimited white noise Let {X(t)}t∈R be a widesense stationary centered Gaussian process with covariance function CX (τ ) and with the power spectral density fX (ν) =
N0 1 (ν) , 2 [−B,+B]
where N0 > 0 and B > 0. 1. Let Y (t) = X(t)2 . Show that {Y (t)}t∈R is a widesense stationary process. 2. Give its power spectral density fY (ν). Exercise 12.4.6. Projection of white noise onto an orthonormal base Let the set of squareintegrable functions ϕ : [0, T ] → R (1 ≤ i ≤ N ) be such that 
T
ϕi (t)ϕj (t) dt = δij 0
(1 ≤ i, j ≤ N ),
490CHAPTER 12. WIDESENSE STATIONARY STOCHASTIC PROCESSES and let {B(t)}t∈R be a Gaussian white noise with psd N0 /2. Show that the vector B = (B1 , . . . , BN )T deﬁned by 
T
B(t)ϕi (t) dt
Bi =
(1 ≤ i ≤ N )
0
is a centered Gaussian vector with covariance matrix ΓB =
N0 I. 2
(12.24)
(In particular, the components B1 , . . . , BN are identically distributed, independent, and centered Gaussian random variables with common variance N0 /2.) Exercise 12.4.7. An iid sequence carried by an hpp Let N be a homogeneous Poisson process on R+ of intensity λ > 0, and let {Zn }n≥0 be an iid sequence of integrable real random variables, centered, with ﬁnite variance σ 2 , and independent of N . 1) Show that {ZN ((0,t]) }t≥0 is a widesense stationary stochastic process and give its covariance function. 2) Give its power spectral density. 3) Compute P (X (t1 ) = X (t2 )) and P (X (t1 ) > X (t2 )). Exercise 12.4.8. Poisson shot noises Let N1 , N2 and N3 be three independent homogeneous Poisson processes on R with respective intensities θ1 > 0, θ2 > 0 and θ3 > 0. Let {X1 (t)}t∈R be the shot noise constructed on N1 + N3 with an impulse function h : R → R that is bounded and with compact support (null outside a ﬁnite interval). Let {X2 (t)}t∈R be the shot noise constructed on N2 + N3 with the same impulse function h. Compute the power spectral density of the widesense stationary process {X(t)}t∈R , where X(t) = X1 (t) + X2 (t). Exercise 12.4.9. Frequency modulation Consider the socalled frequency modulated (or phase modulated) signal, a stochastic process {X(t)}t≥0 deﬁned by X(t) = cos(2π(ν0 t + Φ(t) + α)), where

t
ν(s) ds ,
Φ(t) = 0
{ν(t)}t≥0 is a realvalued stochastic process, and α is a realvalued random variable. The following assumptions are made: (a) {ν(t)} is a strictly stationary process. (b) α and {ν(t)} are independent. (c) E[e2iπα ] = E[e4iπα ] = 0.
12.4. EXERCISES
491
Show that the covariance function of the frequency modulated signal is given by τ 1 CX (τ ) = Re e2iπν0 τ E e2iπ 0 ν(s) ds . 2 Exercise 12.4.10. Gaussian frequency modulation This exercise is a continuation of Exercise 12.4.9 to which the reader is referred for the notation and deﬁnitions. We now consider a particular case for which the computations are tractable: Gaussian frequency modulation. Here {ν(t)}t≥0 is a stationary Gaussian signal with mean ν¯ and covariance function Cν . Show that CX (τ ) =
2 τ 1 cos(2π(ν0 + ν¯)t)e−4π 0 Cν (s)(τ −s) ds . 2
Exercise 12.4.11. Flipflop Let N be an hpp on R+ with intensity λ. Deﬁne the (telegraph or ﬂipﬂop) process {X (t)}t≥0 with state space E = {+1, −1} by X (t) = Z (−1)N (t) , where X (0) = Z is an Evalued random variable independent of the counting process N . (Thus the telegraph process switches between −1 and +1 at each event of N .) The probability distribution of Z is arbitrary. 1. Compute P (X (t + s) = jX (s) = i) for all t, s ≥ 0 and all i, j ∈ E. 2. Give, for all i ∈ E, the limit of P (X (t) = i) as t tends to ∞. 3. Show that when P (Z = 1) = 12 , the process is a stationary process and give its power spectral measure. Exercise 12.4.12. Flipflop with limited memory Let N be a HPP on R with intensity λ > 0. Deﬁne for all t ∈ R X(t) = (−1)N ((t,t+a]) . 1. Show that {X(t)}t∈R is a wss stochastic process. 2. Compute its power spectral density. 3. Give the best aﬃne estimate of X (t + τ ) in terms of X(t), that is, ﬁnd α, β minimizing E X (t + τ ) − (α + βX(t))2 , when τ > 0. Exercise 12.4.13. Jumping phase Deﬁne for each t ∈ R, t ≥ 0,
X(t) = eiΦN(t) ,
where {N (t)}t≥0 is the counting process of a homogeneous Poisson process on R+ with intensity λ > 0, and {Φn }n≥0 is an iid sequence of random variables uniformly distributed on [0, 2π], and independent of the Poisson process. Show that {X(t)}t≥0 is a widesense stationary process, give its covariance function CX (τ ) and its power spectral measure.
III: ADVANCED TOPICS
Chapter 13 Martingales A martingale is for the general public a clever way of gambling. In mathematics, it formalizes the notion of fair game and we shall see that martingale theory indeed has something to say about such games. However the interest and scope of martingale theory extends far beyond gambling and has become a fundamental tool of the theory of stochastic processes. The present chapter is an introduction to this topic, featuring the two main pillars on which it rests: the optional sampling theorem and the convergence theory of martingales, in discrete as well as in continuous time.
13.1
Martingale Inequalities
13.1.1
The Martingale Property
Let (Ω, F, P ) be a probability space and let {Fn }n≥1 be a history (or ﬁltration) deﬁned on it, that is, a sequence of subσﬁelds of F that is nondecreasing: Fn ⊆ Fn+1 (n ≥ 0). The internal history of a random sequence {Xn }n≥0 is the ﬁltration {FnX }n≥0 deﬁned by FnX := σ(X0 , . . . , Xn ). Deﬁnition 13.1.1 A complex random sequence {Yn }n≥0 such that for all n ≥ 0 (i) Yn is Fn measurable and (ii) E[Yn ] < ∞ is called a (P, Fn )martingale (resp., submartingale, supermartingale) if, in addition, for all n ≥ 0, E[Yn+1  Fn ] = Yn (13.1) (resp., ≥ Yn , ≤ Yn ) . When the context is clear as to the choice of the underlying probability measure P , we shall abbreviate, saying for instance, “Fn submartingale” instead of “(P, Fn )submartingale”. If the history is not mentioned, it is assumed to be the internal history. For instance, the phrase {Yn }n≥0 is a martingale means that it is an FnY martingale. Of course an Fn martingale is an Fn submartingale and an Fn –supermartingale. Condition (13.45) implies that for all k ≥ 1, all n ≥ 0, E[Yn+k  Fn ] = Yn
(resp., ≥ Yn , ≤ Yn ).
© Springer Nature Switzerland AG 2020 P. Brémaud, Probability Theory and Stochastic Processes, Universitext, https://doi.org/10.1007/9783030401832_13
495
CHAPTER 13. MARTINGALES
496
Proof. In the martingale case, for instance, by the rule of successive conditioning E[Yn+k  Fn ] = E[E[Yn+k  Fn+k−1 ]  Fn ] = E[Yn+k−1  Fn ] = E[Yn+k−2  Fn ] = · · · = E[Yn  Fn ] = Yn . In particular, taking expectations and letting n = 0, (resp., ≥ E[Y0 ], ≤ E[Y0 ]).
E[Yk ] = E[Y0 ]
Example 13.1.2: Sums of iid random variables. Let {Xn }n≥0 be an iid sequence of centered and integrable random variables. The random sequence Yn := X0 + X1 + · · · + Xn
(n ≥ 0)
is an FnX martingale. Indeed, for all n ≥ 0, Yn is FnX measurable and E[Yn+1  FnX ] = E[Yn  Fn ] + E[Xn+1  FnX ] = Yn + E[Xn+1] = Yn , where the second equality is due to the fact that FnX and Xn+1 are independent (Theorem 3.3.20). Example 13.1.3: Products of iids. Let X = {Xn }n≥0 be an iid sequence of integrable random variables with mean 1. The random sequence Yn =
n $
Xk
(n ≥ 0)
k=0
is an FnX martingale. Indeed, for all n ≥ 0, Yn is FnX measurable and " # n n $ $ E[Yn+1  FnX ] = E Xn+1 Xk  FnX = E[Xn+1  FnX ] Xk k=0
= E[Xn+1]
n $
k=0
X k = 1 × Yn = Yn ,
k=1
where the second equality is due to the fact that FnX and Xn+1 are independent (Theorem 3.3.20). Example 13.1.4: Gambling. Consider the random sequence {Yn }n≥0 with values in R+ deﬁned by Y0 = a ∈ R+ and Yn+1 = Yn + Xn+1 bn+1 (X0n )
(n ≥ 0) ,
where X0n := (X0 , . . . , Xn ), X0 = Y0 , {Xn }n≥1 is an iid sequence of random variables taking the values +1 or −1 with equal probability, and the family of functions bn : {0, 1}n → N (n ≥ 1) is the betting strategy, that is, bn+1 (X0n ) is the stake at time n + 1 of a gambler given the observed history X0n of the chance outcomes up to time n. Admissible bets must guarantee that the fortune Yn remains nonnegative at all times
13.1. MARTINGALE INEQUALITIES
497
n, that is, bn+1 (X0n ) ≤ Yn . The process so deﬁned is an FnX martingale. Indeed, for all n ≥ 0, Yn is FnX measurable and E Yn+1  FnX = E Yn  FnX + E Xn+1 bn+1 (X0n )  FnX = Yn + E Xn+1  FnX bn+1 (X0n ) = Yn , where the second equality uses Theorem 3.3.24. The integrability condition should be checked on each application. It is satisﬁed if the stakes bn (X0n ) are uniformly bounded. Example 13.1.5: Harmonic functions of an hmc. Let {Xn }n≥0 be an hmc with countable space E and transition matrix P. A function h : E → R is called harmonic (resp., subharmonic, superharmonic) if Ph is well deﬁned and Ph = h that is,
pij h(j) = h(i)
(resp., ≥ h, ≤ h) , (resp., ≥ h(i), ≤ h(i))
(13.2)
(i ∈ E) .
j∈E
Superharmonic functions are also called excessive functions. Equation (13.2) is equivalent, in the harmonic case for instance, to E[h(Xn+1 )  Xn = i] = h(i)
(i ∈ E) .
()
In view of the Markov property, the lefthand side of the above equality is also equal to E[h(Xn+1 )  Xn = i, Xn−1 = in−1 , . . . , X0 = i0 ], and therefore () is equivalent to E[h(Xn+1  FnX ] = h(Xn ) . Therefore, if E[h(Xn )] < ∞ for all n ≥ 0, the process {h(Xn )}n≥0 is an FnX martingale. Similarly, for a subharmonic (resp. superharmonic) function h such that E[h(Xn )] < ∞ for all n ≥ 0, the process {h(Xn )}n≥0 is an FnX submartingale (resp. FnX supermartingale). ´ m sequences. Let be given on the measurable Example 13.1.6: Radon–Nikody space (Ω, F) two probability measures Q and P and a ﬁltration {Fn }n≥1 . Let Qn and Pn denote the restrictions of Q and P respectively to (Ω, Fn ). Suppose that for all n ≥ 1, Pn , in which case we say that Q is locally absolutely continuous along {Fn }n≥1 Qn with respect to P and denote this by Q (P, Fn )martingale.
loc.
P . Let Ln :=
dQn dPn .
Then {Ln }n≥1 is a
Proof. The integrability condition is satisﬁed since dQn EP [Ln ] = Ln dPn = dPn = dQn = Qn (Ω) = 1 . Ω Ω dPn Ω Now, for all A ∈ Fn (and a fortiori A ∈ Fn+1 ),
CHAPTER 13. MARTINGALES
498 
Ln+1 dP =
A
A
=
Ln+1 dPn+1 = Qn+1 (A) = Qn (A) Ln dP . Ln dPn =
A
A
Deﬁnition 13.1.7 Let {Fn }n≥0 be some ﬁltration. A complex random sequence {Xn }n≥0 such that for all n ≥ 0 (a) Xn is Fn measurable, (b) E[Xn ] < ∞ and E[Xn ] = 0, and (c) E[Xn+1  Fn ] = 0 (resp. ≥ 0, ≤ 0) is called a (P, Fn )martingale diﬀerence (resp., submartingale diﬀerence, supermartingale diﬀerence). The notion of martingale diﬀerence generalizes that of centered iid sequences. Indeed for such iid sequences, Xn is independent of FnX , and therefore (Theorem 3.3.20) E[Xn+1  FnX ] = 0.
Convex Functions of Martingales Theorem 13.1.8 Let I ⊆ R be an interval (closed, open, semiclosed, inﬁnite, etc.) and let ϕ : I → R be a convex function. A. Let {Yn }n≥0 be an Fn martingale such that P (Yn ∈ I) = 1 for all n ≥ 0. Assume that E [ϕ(Yn )] < ∞ for all n ≥ 0. Then, the process {ϕ(Yn )}n≥0 is an Fn submartingale. B. Assume moreover that ϕ is nondecreasing and suppose this time that {Yn }n≥0 is an Fn submartingale. Then, the process {ϕ(Yn )}n≥0 is an Fn submartingale. Proof. By Jensen’s inequality for conditional expectations (Exercise 3.4.50), E [ϕ(Yn+1 )Fn ] ≥ ϕ(E [Yn+1 Fn ]) . Therefore (case A) E [ϕ(Yn+1 )Fn ] ≥ ϕ(E [Yn+1 Fn ]) = ϕ(Yn ), and (case B) E [ϕ(Yn+1 )Fn ] ≥ ϕ(E [Yn+1Fn ]) ≥ ϕ(Yn ) . (For the last inequality, use the submartingale property E [Yn+1 Fn ] ≥ Yn and the hypothesis that ϕ is nondecreasing.) Example 13.1.9: Let {Yn }n≥0 be an Fn martingale and let p ≥ 1. As a special case of Theorem 13.1.8 with the convex function x → xp , we have that if E [Yn p ] < ∞, {Yn p }n≥0 is an Fn submartingale. Applying Theorem 13.1.8 with the convex function x → x+ , we have that {Yn+ }n≥0 is an Fn submartingale.
13.1. MARTINGALE INEQUALITIES
499
Martingale Transforms and Stopped Martingales Let {Fn }n≥0 be some ﬁltration. The complex stochastic process {Hn }n≥1 is called Fn predictable if Hn is Fn−1 measurable for all n ≥ 1 . Let {Yn }n≥0 be another complex stochastic process. The stochastic process (H ◦ Y )n :=
n
Hk (Yk − Yk−1 )
(n ≥ 1) .
k=1
is called the transform of Y by H. Theorem 13.1.10 (a) Let {Yn }n≥0 be an Fn submartingale and let {Hn }n≥0 be a bounded nonnegative Fn –predictable process. Then {(H ◦ Y )n }n≥0 is an Fn submartingale. (b) If {Yn }n≥0 is an Fn martingale and if {Hn }n≥0 is bounded and Fn predictable, then {(H ◦ Y )n }n≥0 is an Fn martingale. Proof. Conditions (i) and (ii) of (13.1.1) are obviously satisﬁed. Moreover, (a)
E[(H ◦ Y )n+1 − (H ◦ Y )n  Fn ] = E[Hn+1 (Yn+1 − Yn )  Fn ] = Hn+1 E[Yn+1 − Yn  Fn ] ≥ 0 ,
using Theorem 3.3.24 for the second equality. (b)
E[(H ◦ Y )n+1 − (H ◦ Y )n  Fn ] = Hn+1E[Yn+1 − Yn  Fn ] = 0 ,
by the same token.
Recall the deﬁnition of an Fn stopping time (Deﬁnition 6.2.20): a random variable τ taking its values in N and such that for all m ∈ N, the event {τ = m} is in Fm . Theorem 13.1.10 immediately leads to the stopped martingale theorem: Theorem 13.1.11 Let {Yn }n≥0 be an Fn submartingale (resp., martingale) and let τ be an Fn stopping time. Then {Yn∧τ }n≥0 is an Fn submartingale (resp., martingale). In particular, (13.3) E[Yn∧τ ] ≥ E[Y0 ] (resp., = E[Y0 ]) (n ≥ 0) . Proof. Let Hn := 1{n≤τ } . The stochastic process H is Fn –predictable since {Hn = 0} = {τ ≤ n − 1} ∈ Fn−1 . We have Yn∧τ = Y0 + = Y0 +
n∧τ k=1 n
(Yk − Yk−1 ) 1{k≤τ } (Yk − Yk−1 ) .
k=1
The result then follows by Theorem 13.1.10.
CHAPTER 13. MARTINGALES
500
13.1.2
Kolmogorov’s Inequality
It often occurs that a result proved for iid sequences also holds for martingale diﬀerence sequences. This is the case for the inequality originally proved in the iid case (Lemma 4.1.12). Theorem 13.1.12 Let {Sn }n≥0 be an Fn submartingale. Then, for all λ ∈ R+ , λP max Si > λ ≤ E Sn 1{max0≤i≤n Si >λ} . (13.4) 0≤i≤n
Proof. Deﬁne the random time τ = inf{n ≥ 0 ; Sn > λ} . It is an Fn stopping time since
1
Ai := {τ = i} =
Si > λ, max Sj ≤ λ 0≤j≤i−1
∈ Fi .
The Ai ’s so deﬁned are mutually disjoint and 1 & n Ai . A := max Si > λ = 0≤i≤n
Since λ1Ai ≤ Si 1Ai , λP (A) = λ
n
i=1
P (Ai ) ≤
i=0
n
E[Si 1Ai ] .
i=0
we For all 0 ≤ i ≤ n, Ai being Fi measurable, 0 0 have by the submartingale property that E [Sn  Fi ] ≥ Si and therefore Ai Si dP ≤ Ai E[Sn  Fi ] dP . Taking these observations into account, λP (A) ≤ ≤
n i=0 n
E[Si 1Ai ] E EFi [Sn ]1Ai
i=0
=
n
E EFi [Sn 1Ai ]
i=0
=
n
E[Sn 1Ai ]
i=0
"
= E Sn
n
# 1 Ai
i=0
= E[Sn 1A ]. Corollary 13.1.13 Let {Mn }n≥0 be an Fn martingale. Then, for all p ≥ 1, all λ ∈ R, p λ P max Mi  > λ ≤ E[Mn p ]. (13.5) 0≤i≤n
13.1. MARTINGALE INEQUALITIES
501
Proof. Let Sn = Mn p . This deﬁnes an Fn submartingale (Example 13.1.9) to which one may apply Kolmogorov’s inequality with λ replaced by λp : p p p λ P max Mi  > λ ≤ E Mn p 1{max0≤i≤n Mi p >λp } ≤ E[Mn p ]. 0≤i≤n
Remark 13.1.14 Note that Kolmogorov’s inequality is, as far as martingales are concerned, a considerable improvement with respect to what Markov’s inequality would have given: λp P (Mi p > λp ) ≤ E[Mi p ] ≤ E[Mn p ] (0 ≤ i ≤ n).
13.1.3
Doob’s Inequality
Recall the notation X p = (E [Xp ])1/p . Theorem 13.1.15 Let {Mn }n≥0 be an Fn martingale. For all p > 1, Mn p ≤ max Mi  p ≤ q Mn p , 0≤i≤n
1 p
where q (the “conjugate” of p) is deﬁned by
+
1 q
(13.6)
= 1.
Proof. The ﬁrst inequality is trivial. For the second inequality, observe that for all nonnegative random variables X, by Fubini’s theorem, 2 X . pxp−1 dx E[X p ] = E 20 ∞ . p−1 =E px 1{x x) dx . =p 0
Therefore, applying this and Kolmogorov’s inequality (13.4) to the submartingale Sn = Mn , 2 p . . 2 E max Mi p ≤ E max Mi  0≤i≤n 0≤i≤n  ∞ p−1 =p x P max Mi  > x dx 0≤i≤n 0 ∞ ≤p xp−2 E Mn  1{max0≤i≤n Mi >x} dx 0 . 2 ∞ xp−2 Mn  1{max0≤i≤n Mi >x} dx = pE " 0 # max0≤i≤n Mi 
= pE Mn  0
xp−2 dx
" p−1 # p E Mn  max Mi  0≤i≤n p−1 " p−1 # . = qE Mn  max Mi 
=
0≤i≤n
CHAPTER 13. MARTINGALES
502
By H¨older’s inequality, and observing that (p − 1)q = p, " " p−1 # (p−1)q #1/q E Mn  max Mi  ≤ E[Mn p ]1/p E max Mi  0≤i≤n
0≤i≤n
p .1/q
2 = Mn p E Therefore
max Mi 
.
0≤i≤n
2 p .1/q 2 . E max Mi p ≤ q Mn p E , max Mi  0≤i≤n
0≤i≤n
or (eliminating the trivial case where E [max0≤i≤n Mi p ] = ∞) 2 .1− 1 q E max Mi p ≤ q Mn p , 0≤i≤n
that is, since 1 −
1 q
=
1 p,
max Mi  p ≤ q Mn p . 0≤i≤n
13.1.4
Hoeﬀding’s Inequality
Theorem 13.1.16 Let {Mn }n≥0 be a real Fn martingale such that, for some sequence c1 , c2, . . . of real numbers, P (Mn − Mn−1  ≤ cn ) = 1 Then, for all x ≥ 0 and all n ≥ 1,
(n ≥ 1) .
1 P (Mn − M0  ≥ x) ≤ 2 exp − x2 2
? n
(13.7)
c2i
.
i=1
Proof. By convexity of z → eaz , for z ≤ 1 and all a ∈ R, 1 1 aaz ≤ (1 − z)e−a + (1 + z)e+a . 2 2 In particular, if Z is a centered random variable such that P (Z ≤ 1) = 1, 1 1 E[eaZ ] ≤ (1 − E[Z])e−a + (1 + E[Z])e+a 2 2 1 −a 1 +a a2 /2 = e + e ≤e . 2 2 By similar arguments, for all a ∈ R, . 2 M −M ' n n−1 ' 'Fn−1 cn E ea ' ' 2 . 1 Mn − Mn−1 '' −a F ≤ 1−E ' n−1 e + · · · 2 cn ' 2 . 1 Mn − Mn−1 '' +a a2 /2 ···+ 1+E , 'Fn−1 e ≤ e 2 cn
(13.8)
13.1. MARTINGALE INEQUALITIES
503
and, with a replaced by cn a, ' 2 2 E ea(Mn −Mn−1 ) 'Fn−1 ≤ ea cn /2 . Therefore, E ea(Mn −M0 ) = E ea(Mn−1 −M0 ) ea(Mn −Mn−1 ) ' = E ea(Mn−1 −M0 ) E ea(Mn −Mn−1 ) 'Fn−1 2 2 ≤ E ea(Mn−1 −M0 ) × ea cn /2 , and then by recurrence
1 2 n 2 E ea(Mn −M0 ) ≤ e 2 a i=1 ci .
In particular, with a > 0, by Markov’s inequality, 1 2 n 2 P (Mn − M0 ≥ x) ≤ e−ax E ea(Mn −M0 ) ≤ e−ax+ 2 a i=1 ci . Minimization of the righthand side with respect to a gives @ 1 2 n 2 i=1 ci . P (Mn − M0 ≥ x) ≤ e− 2 x The same argument with M0 − Mn instead of Mn − M0 yields the bound @ 1 2 n 2 i=1 ci . P (−(Mn − M0 ) ≥ x) ≤ e− 2 x The announced bound then follows from these two bounds since for any random variable X, and all x ∈ R+ , P (X ≥ x) = P (X ≥ x) + P (X ≤ −x). Example 13.1.17: The knapsack. There are n objects, the ith has a volume Vi and is worth Wi . All these nonnegative random variables form an independent family, the Vi ’s have ﬁnite means and the means of the Wi ’s are bounded by M < ∞. You have to volume ni=1 zi Vi does not exceed choose integers z1 , . . . , zn in such a way that the total n a given storage capacity c and that the total worth i=1 zi Vi is maximized. Call this maximal worth Z. We shall see that 1 −x2 (x ≥ 0) . P (Z − E [Z]  ≥ x) ≤ 2 exp 2nM 2 For this consider the variables Zj which are the equivalent of Z when the jth object has been removed. Let now Mj := E [Z  Fj ], where Fj := σ ((Vk , Wk ); 1 ≤ k ≤ j). Note that in view of the independence assumptions E [Zj  Fj ] = E [Zj−1  Fj ]. Clearly Zj ≤ Z ≤ Zj + M . Taking conditional expectations given Fj and then Fj−1 in this last chain of inequalities reveals that Mj − Mj−1  ≤ M . The rest is then just Hoeﬀding’s inequality.
A General Framework of Application Let X be a ﬁnite set, and let f : X N → R be a given function. We introduce the notation N x = (x1 , . . . , xN ) and xk1 = (x1 , . . . , xk ). In particular, x = xN 1 . For x ∈ X , z ∈ X and 1 ≤ k ≤ N , let fk (x, z) := f (x1 , . . . , xk−1 , z, xk+1 , . . . , xN ) .
CHAPTER 13. MARTINGALES
504
The function f is said to satisfy the Lipschitz condition with bound c if for all x ∈ X N , all z ∈ X and all 1 ≤ k ≤ N , fk (x, z) − f (x) ≤ c . Let X1 , X2, . . . , XN be independent random variables with values in X . Deﬁne the martingale Mn = E [f (X)  X1n] . By the independence assumption, with obvious notations,
E [f (X)  X1n ] =
N N f (X1n−1 , Xn , xN n+1 )P (Xn+1 = xn+1 )
N xn+1
and N N E f (X)  X1n−1 = f (X1n−1, xn , xN n+1 )P (Xn = xn )P (Xn+1 = xn+1 ) . xn xN n+1
Therefore Mn − Mn−1  n−1 N N ≤ , Xn , xN f (X1n−1 , xn , xN n+1 )P (Xn = xn )P (Xn+1 = xn+1 ) ≤ c . n+1 ) − f (X1 xn xN n+1
Example 13.1.18: Pattern matching. Take f (x) to be the number of occurrences of the ﬁxed pattern b = (b1 , . . . , bk ) (k ≤ N ) in the sequence x = (x1 , . . . , xN ), that is f (x) =
N −k+1
1{xi =b1 ,...,xi+k−1 =bk } .
i=1
The mean number of matches in an iid sequence X = (X1 , . . . , XN ) with uniform distribution on X is therefore E [f (X)] =
N −k+1
−k+1 N E 1{Xi =b1 ,...,Xi+k−1 =bk } =
i=1
that is,
i=1
E [f (X)] = (N − k + 1)
1 X 
1 X 
k ,
k .
The martingale Mn := E [f (X)  X1n ] is such that M0 = E [f (X)]. Changing the value of one coordinate of x ∈ X N changes f (x) by at most k, we can apply the bound of Theorem 13.8 with ci ≡ k to obtain the inequality 1 λ2
P (f (X) − E [f (X)]  ≥ λ) ≤ 2e− 2 Nk2 .
13.2. MARTINGALES AND STOPPING TIMES
505
Exposure Martingales in Erd¨ os–R´ enyi Graphs A random graph G(n, p) (see Deﬁnition 1.3.44) with set of vertices Vn of cardinality n may be generated as follows. Enumerate the N = n2 edges of the complete graph on Vn from i = 1 to i = N . Generate a random vector X = (X1 , . . . , XN ) with independent and identically distributed variables with values in {0, 1} and common distribution, P (Xi = 1) = p. Then include edge i in G(n, p) if and only if Xi = 1. Any functional of G(n, p) can always be written as f (X). The edge exposure martingale corresponding to this functional is the FnX martingale deﬁned by M0 = E [f (X)] and for i ≥ 1, Mi := E f (X)  X1i . Since the Xi ’s are independent, the general method of the previous subsection can be applied. Another type of martingale related to a G(n, p) graph is useful. Here Vn is identiﬁed with {1, 2, . . . , n}. We denote similarly {1, 2, . . . , i} by Vi . For 1 ≤ i ≤ n, deﬁne the graph Gi to be the restriction of G(n, p) to Vi . Any functional of G(n, p) can always be written as f (G), where G := (G1 , . . . , Gn ). The vertex exposure martingale corresponding to this functional is the Gi1 martingale deﬁned by M0 = E [f (G)] and for i ≥ 1, Mi := E f (G)  Gi1 .
¨ s–R´ Example 13.1.19: The chromatic number of an Erdo enyi graph. The chromatic number of a graph G is the minimal number of colors needed to color the vertices in such a way that no adjacent vertices receive the same color. Call f (G) the chromatic i−1 n n number of G. Since the diﬀerence between f (Gi−1 0 , Gi , gi+1 ) and f (G0 , gi , gi+1 ) for all n gi , gi+1 is at most one, one can apply Hoeﬀding’s inequality to obtain √ 1 2 P f (G) − E [f (G)]  ≥ λ n ≤ e−2λ . 2
(13.9)
But . . . the Gi ’s are not independent! Nevertheless, the general method of the previous subsection can be applied modulo a slight change of point of view. Let X1 be a constant, and for 2 ≤ i ≤ n, let Xi = {X i,j , 1 ≤ j ≤ i − 1} (recall the deﬁnition of X u,v in Deﬁnition 1.3.44). (Here the passage from subgraph Gi−1 to subgraph Gi is represented by the “diﬀerence” Xi between these two subgraphs.) Then f (G) can be rewritten as h(X) = h(X1 , . . . , Xn ) and the general method applies since the Xi ’s are independent.
13.2
Martingales and Stopping Times
13.2.1
Doob’s Optional Sampling Theorem
The ﬁrst pillar of martingale theory is the optional sampling theorem. It has many versions and that given next is the most elementary one, suﬃcient for the elementary examples to be considered now. More general results are given later in this subsection.
CHAPTER 13. MARTINGALES
506
Theorem 13.2.1 Let {Mn }n≥0 be an Fn martingale, and let τ be an Fn stopping time (see Deﬁnition 6.2.20). Suppose that at least one of the following conditions holds: (α) P (τ ≤ n0 ) = 1 for some n0 ≥ 0, or (β) P (τ < ∞) = 1 and Mn  ≤ K < ∞ when n ≤ τ . Then E[Mτ ] = E[M0 ].
(13.10)
Proof. (α) Just apply Theorem 13.1.11 (Formula (13.3) with n = n0 ). (β) Apply the result of (α) to the Fn stopping time τ ∧ n0 to obtain E[Mτ ∧n0 ] = E[M0 ] . But, by dominated convergence, lim E[Mτ ∧n0 ] = E[ lim Mτ ∧n0 ] = E[Mτ ] .
n0 ↑∞
n0 ↑∞
Example 13.2.2: The ruin problem via martingales. The symmetric random walk {Xn }n≥0 on Z with initial state 0 is an FnX martingale (Example 13.1.2). Let τ be the ﬁrst time n for which Xn = −a or + b, where a, b > 0. This is an FnX stopping time and moreover τ < ∞. Part (β) of the above result can be applied with K = sup(a, b) to obtain 0 = E[X0 ] = E[Xτ ]. Writing v = P (−a is hit before b), we have E[Xτ ] = −av + b(1 − v), and therefore v=
b . a+b
Example 13.2.3: A counterexample. Consider the symmetric random walk of the previous example, but now deﬁne τ to be the hitting time of b > 0, an almost surely ﬁnite time since the symmetric walk on Z is recurrent. If the optional sampling theorem applied, one would have 0 = E[X0 ] = E[Xτ ] = b, an obvious contradiction. Of course, neither condition (α) nor (β) is satisﬁed. The following generalization of the elementary result given at the beginning of the present subsection will now be proved after the following theorem. Theorem 13.2.4 Let {Fn }n≥0 be a history and let F∞ := σ(∪n≥0 Fn ). Let τ be an Fn stopping time. The collection of events Fτ := {A ∈ F∞  A ∩ {τ = n} ∈ Fn , for all n ≥ 1} is a σﬁeld, and τ is Fτ measurable. Let {Xn }n≥0 be an Evalued Fn adapted random sequence, and let τ be a ﬁnite Fn stopping time. Then X(τ ) is Fτ measurable.
13.2. MARTINGALES AND STOPPING TIMES
507
The proof is left as an exercise. A more general result is given in Theorem 13.2.4. If {Fn }n≥0 is the internal history of some random sequence {Xn }n≥0, that is, if Fn = FnX (n ≥ 0), one may interpret FτX as the collection of events that are determined by the observation of the random sequence up to time τ (included). We are now ready for the statement and proof of Doob’s optional sampling theorem. Theorem 13.2.5 Let {Yn }n≥0 be an Fn submartingale (resp., martingale), and let τ1 , τ2 be ﬁnite Fn stopping times such that P (τ1 ≤ τ2 ) = 1. If for i = 1, 2,
and
E [Yτi ] < ∞,
(13.11)
lim inf E Yn 1{τi >n} = 0 ,
(13.12)
E[Yτ2  Fτ1 ] ≥ Yτ1 (resp., = Yτ1 ) .
(13.13)
n↑∞
then, P a.s.
Remark 13.2.6 In particular, E[Yτ2 ] ≥ E[Yτ1 ]
(resp., = E[Yτ1 ]) .
(13.14)
More generally, if {τn }n≥1 is a nondecreasing sequence of ﬁnite Fn stopping times satisfying conditions (13.11) and (13.12), the sequence {Yτn }n≥1 is an Fτn submartingale (resp., martingale). Proof. It suﬃces to give the proof for the submartingale case. The meaning of (13.13) is that, for all A ∈ Fτ1 , E 1 A Y τ2 ≥ E 1 A Y τ1 . It is suﬃcient to show that for all n ≥ 0, E 1A∩{τ1 =n} Yτ2 ≥ E 1A∩{τ1 =n} Yτ1 , or, equivalently since τ1 = n implies τ2 ≥ n, E 1A∩{τ1 =n}∩{τ2 ≥n} Yτ2 ≥ E 1A∩{τ1 =n}∩{τ2 ≥n} Yτ1 = E 1A∩{τ1 =n}∩{τ2 ≥n} Yn . Write this as
E 1B∩{τ2 ≥n} Yτ2 ≥ E 1B∩{τ2 ≥n} Yn ,
()
where B := A ∩ {τ1 = n}. By deﬁnition of Fτ1 , B ∈ Fn . It is therefore suﬃcient to show that for all n ≥ 0, all B ∈ Fn , () holds. We have E 1B∩{τ2 ≥n} Yn = E 1B∩{τ2 =n} Yn + E 1B∩{τ2 ≥n+1}Yn ≤ E 1B∩{τ2 =n} Yn + E 1B∩{τ2 ≥n+1}E[Yn+1 Fn ] = E 1B∩{τ2 =n} Yτ2 + E 1B∩{τ2 ≥n+1} Yn+1 ≤ E 1B∩{n≤τ2 ≤n+1} Yτ2 + E 1B∩{τ2 ≥n+2}Yn+2 ··· ≤ E 1B∩{n≤τ2 ≤m} Yτ2 + E 1B∩{τ2 >m} Ym , that is,
E 1B∩{n≤τ2 ≤m} Yτ2 ≥ E 1B∩{τ2 ≥n} Yn − E 1B∩{τ2 >m} Ym
CHAPTER 13. MARTINGALES
508
for all m ≥ n. Therefore, by dominated convergence and hypothesis (13.12) E 1B∩{τ2 ≥n} Yτ2 = E lim 1B∩{n≤τ2≤m} Yτ2 m↑∞ ≥ E 1B∩{τ2 ≥n} Yn − lim inf E 1B∩{τ2 >m} Ym m↑∞ = E 1B∩{τ2 ≥n}Yn . Corollary 13.2.7 Let {Yn }n≥0 be an Fn submartingale (resp., martingale). Let τ1 , τ2 be Fn stopping times such that τ1 ≤ τ2 ≤ N a.s., for some constant N < ∞. Then (13.14) holds. Proof. This is an immediate consequence of Theorem 13.2.5.
Corollary 13.2.8 Let {Yn }n≥0 be a uniformly integrable Fn submartingale (resp., martingale). Let τ1 , τ2 be ﬁnite Fn stopping times. Then (13.13) holds. Proof. In order to apply Theorem 13.2.5, we have to show that conditions (13.11) and (13.12) are satisﬁed when {Yn }n≥1 is uniformly integrable. Condition (13.12) follows from part (b) of Theorem 4.2.12 since the τi ’s are ﬁnite and therefore P (τi > n) → 0 as n ↑ ∞. It remains to show that condition (13.11) is satisﬁed. Let N < ∞ be an integer. By Corollary 13.2.7, if τ is a stopping time (here τ1 or τ2 ), E[Y0 ] ≤ E[Yτ ∧N ] and therefore E[Yτ ∧N ] = 2E[Yτ+∧N ] − E[Yτ ∧N ] ≤ 2E[Yτ+∧N ] − E[Y0 ]. The submartingale {Yn+ }n≥0 satisﬁes E[Yτ+∧N ] =
N
E[1{τ ∧N =j} Yj+ ] + E[1{τ >N } YN+ ]
j=0
≤
N
E[1{τ ∧N =j} YN+ ] + E[1{τ >N } YN+ ]
j=0
= E[YN+ ] ≤ E[YN ]. Therefore E[Yτ ∧N ] ≤ 2E[YN ] + E[Y0 ] ≤ 3 sup EYN . N
Since by Fatou’s lemma E[Yτ ] ≤ lim inf N ↑∞ E[Yτ ∧N ], we have E[Yτ ] ≤ 3 sup E[YN ], N
a ﬁnite quantity since {Yn }n≥1 is uniformly integrable.
13.2. MARTINGALES AND STOPPING TIMES
509
Corollary 13.2.9 Let {Yn }n≥0 be an Fn submartingale (resp., martingale) and let τ be an Fn stopping time such that E[τ ] < ∞. Suppose moreover that there exists a constant c < ∞ such that, for all n ≥ 0, E[Yn+1 − Yn   Fn ] ≤ c, Then E[Yτ ] < ∞ and
P a.s. on {τ ≥ n}.
E[Yτ ] ≥ ( resp., =) E[Y0 ].
Proof. In order to apply Theorem 13.2.5 with τ1 = 0, τ2 = τ , one just has to check conditions (13.11) and (13.12) for τ . Let Z0 := Y0 . With Zn := Yn − Yn−1 (n ≥ 1), ⎡ ⎤ ⎤ ⎡ τ ∞ n E⎣ E ⎣1{τ =n} Zj ⎦ Zj ⎦ = n=0
j=0
=
j=0
∞ n
E 1{τ =n} Zj
n=0 j=0
=
∞ ∞
E 1{τ =n} Zj
j=0 n=j
=
∞
E 1{τ ≥j} Zj .
j=0
For j ≥ 1, {τ ≥ j} = {τ < j − 1} ∈ Fj−1 and therefore, E 1{τ ≥j} Zj = E 1{τ ≥j} E [Zj  Fj−1 ] ≤ cP (τ ≥ j) , ⎡
and
E⎣
τ
⎤ Zj ⎦ ≤ E[Y0 ] + c
j=0
∞
P (τ ≥ j) = E[Y0 ] + cE[τ ] < ∞ .
j=1
Therefore condition (13.11) is satisﬁed since E [Yτ ] ≤ E n
Zj ≤
j=0
τ
τ j=0 Zj
. Moreover, if τ > n,
Zj
j=0
⎤ ⎡ τ E 1{τ >n} Yn  ≤ E ⎣1{τ >n} Zj ⎦ .
and therefore
But, by (), E convergence
()
τ j=0 Zj
j=0
< ∞. Also, {τ > n} ↓ ∅ as n ↑ ∞. Therefore, by dominated ⎡
lim inf E[1{τ >n} Yn ] ≤ lim inf E ⎣1{τ >n} n↑∞
This is condition (13.12).
n↑∞
τ
⎤ Zj ⎦ = 0 .
j=0
CHAPTER 13. MARTINGALES
510
13.2.2
Wald’s Formulas
Wald’s Mean Formula Theorem 13.2.10 Let {Zn }n≥1 be an iid sequence of real random variables such that E [Z1 ] < ∞, and let τ be an FnZ stopping time with E[τ ] < ∞. Then # " τ (13.15) Zn = E[Z1 ]E[τ ]. E n=1
If, moreover,
E[Z12 ]
< ∞,
Var
τ
Zn
= Var (Z1 )E[τ ] .
(13.16)
n=1
Proof. Let X0 := 0, Xn := (Z1 + · · · + Zn ) − nE[Z1 ] (n ≥ 1). Then {Xn }n≥1 is an FnZ martingale such that E[Xn+1 − Xn   FnZ ] = E[Zn+1 − E[Z1 ]  FnZ ] = EZn − E[Z1 ] ≤ 2E [Z1 ] < ∞ . n Therefore Corollary 13.2.9 can be applied with Yn = k=1 (Zk − E [Z1 ]) to obtain (13.15). For the proof of (13.16), the same kind of argument works, this time with the martingale Yn = Xn2 − n Var (Z1 ).
Wald’s Exponential Formula Theorem 13.2.11 Let {Zn }n≥1 be iid real random variables and let Sn = Z1 +· · ·+Zn . Let ϕZ (t) := E[etZ1 ] and suppose that ϕZ (t0 ) exists and is greater than or equal to 1 for some t0 = 0. Let τ be an FnZ stopping time such that E[τ ] < ∞ and Sn  ≤ c on {τ ≥ n} for some constant c < ∞. Then 2 t0 Sτ . e E = 1. (13.17) ϕZ (t0 )τ Proof. Let Y0 := 1 and for n ≥ 1, Yn :=
et0 Sn . ϕZ (t0 )n t Z
By application of the result of Example 13.1.3 with Xi := ϕeZ0(t0i ) , we have that the sequence {Yn }n≥0 is an FnZ martingale. Moreover, on {τ ≥ n}, ' . 2' t0 Zn+1 ' 'e Z Z ' ' − 1'  Fn E[Yn+1 − Yn   Fn ] = Yn E ' ϕZ (t0 ) Yn E et0 Z1 − ϕZ (t0 ) ≤ K < ∞ = ϕZ (t0 ) since ϕZ (t0 ) ≥ 1 and Yn =
et0 Sn et0 c ≤ ≤ et0 c . ϕZ (t0 )n ϕZ (t0 )n
Therefore, Corollary 13.2.9 applies to give (13.17).
13.2. MARTINGALES AND STOPPING TIMES
13.2.3
511
The Maximum Principle
The general approach to the absorption problem for hmcs of this subsection is in terms of harmonic functions. However, its implementation requires explicit forms of harmonic functions satisfying some boundary conditions, and this is not always easy. In contrast, the purely algebraic method given in the chapter on Markov chains can always be implemented in the ﬁnite state space case (of course at the cost of matrix computations). Let {Xn }n≥0 be an hmc with countable state space E and transition matrix P. Let D be a subset of E, called the domain, and let D := E\D. Let c : D → R and ϕ : D → R be nonnegative functions called the unit time gain function and the ﬁnal gain function, respectively. Let τ be the hitting time of D. For each state i ∈ E, deﬁne
⎡
v(i) = Ei ⎣
⎤
c(Xk ) + ϕ(Xτ )1{τ τ2k ; Sn = 0} τ2k = inf{n > τ2k+1 ; Sn ≥ b} ··· For i ≥ 1, let ϕi = 1 if τm < i ≤ τm+1 for some odd m = 0 if τm < i ≤ τm+1 for some even m. 1 By deﬁnition, an upcrossing occurs at time if Sk ≤ a and if there exists > k such that Sj < b for j = 1, . . . , − 1 and S ≥ b.
13.3. CONVERGENCE OF MARTINGALES
515
& {τm < i} ∩ {τm+1 < i} ∈ Fi−1
Observe that {ϕi = 1} =
odd m
and that bνn ≤
n
ϕi (Si − Si−1 ).
i=1
Therefore bE[νn ] ≤ E[
n
ϕi (Si − Si−1 )] =
i=1
= ≤
n i=1 n
n
E[ϕi (Si − Si−1 )]
i=1
E[ϕi E[(Si − Si−1 )Fi−1 ]] =
n
E[ϕi (E[Si Fi−1 ] − Si−1 )]
i=1
E[(E[Si Fi−1 ] − Si−1 )] ≤
i=1
n
(E[Si ] − E[Si−1 ]) = E[Sn − S0 ] .
i=1
Theorem 13.3.2 Let {Sn }n≥0 be an Fn submartingale. Suppose moreover that it is L1 bounded, that is, sup E[Sn ] < ∞. (13.27) n≥0
Then {Sn }n≥0 converges P a.s. to an integrable random variable S∞ .
Remark 13.3.3 Condition (13.49) can be replaced by the equivalent condition sup E[Sn+ ] < ∞ . n≥0
Indeed, if {Sn }n≥0 is an Fn submartingale, E Sn+ ≤ E [Sn ] ≤ 2E Sn+ − E [Sn ] ≤ 2E Sn+ − E [S0 ] .
Remark 13.3.4 By changing signs, the same hypothesis leads to the same conclusion for a supermartingale {Sn }n≥0. Similarly to the previous remark, condition (13.49) can be replaced by the equivalent condition sup E[Sn− ] < ∞ . n≥0
Proof. The proof is based on the following observation concerning any deterministic sequence {xn }n≥1. If this sequence does not converge, then it is possible to ﬁnd two rational numbers a and b such that lim inf xn < a < b < lim sup xn , n
n
which implies that the number of upcrossings of [a, b] by this sequence is inﬁnite. Therefore to prove convergence, it suﬃces to prove that any interval [a, b] with rational extremities is crossed at most a ﬁnite number of times.
CHAPTER 13. MARTINGALES
516
Let νn ([a, b]) be the number of upcrossings of an interval [a, b] prior (≤) to time n and let ν∞ ([a, b]) := limn↑∞ νn ([a, b]). By (13.25), (b − a)E[νn ([a, b])] ≤ E[(Sn − a)+ ] ≤ E[Sn+ ] + a ≤ sup E[Sk+ ] + a k≥0
≤ sup E[[Sk ] + a < ∞. k≥0
Therefore, letting n ↑ ∞,
(b − a)E[ν∞ ([a, b])] < ∞.
In particular, ν∞ ([a, b]) < ∞, P a.s. Therefore, P a.s. there is only a ﬁnite number of upcrossings of any rational interval [a, b]. Equivalently, in view of the observation made in the ﬁrst lines of the proof, {Sn }n≥0 converges P a.s. to some random variable S∞ . Therefore (by Fatou’s lemma for the previous inequality): E[S∞ ] = E[ lim Sn ] ≤ lim inf ESn  ≤ sup ESn  < ∞. n↑∞
n↑∞
n≥0
Corollary 13.3.5 (a) Any nonpositive submartingale {Sn }n≥0 almost surely converges to an integrable random variable. (b) Any nonnegative supermartingale almost surely converges to an integrable random variable. Proof. (b) follows from (a) by changing signs. For (a), we have E[Sn ] = −E[Sn ] ≤ −E[S0 ] = E[S0 ] < ∞ . Therefore (13.49) is satisﬁed and the conclusion then follows from Theorem 13.3.2.
An immediate application of the martingale convergence theorem is to gambling. The next example teaches us that a gambler in a “fair game” is eventually ruined. Example 13.3.6: Fair game not so fair. Consider the situation in Example 13.1.4, assuming that the initial fortune a is a positive integer and that the bets are also positive integers (that is, the functions bn+1 (X0n ) ∈ N+ except if Yn = 0, in which case the gambler is not allowed to bet anymore, or equivalently bn (X0n−1 ∗0) := bn (X0 , X1 , . . . , Xn , 0) = 0). In particular, Yn ≥ 0 for all n ≥ 0. Therefore the process {Yn }n≥0 is a nonnegative FnX martingale and by the martingale convergence theorem it almost surely has a ﬁnite limit. Since the bets are assumed positive integers when the fortune of the player is positive, this limit cannot be other than 0. Since Yn is a nonnegative integer for all n ≥ 0, this can happen only if the fortune of the gambler becomes null in ﬁnite time.
Example 13.3.7: Branching processes via martingales. The power of the concept of martingale will now be illustrated by revisiting the branching process. It is
13.3. CONVERGENCE OF MARTINGALES
517
assumed that P (Z = 0) < 1 and P (Z ≥ 2) > 0 (to get rid of trivialities). The stochastic process Xn Yn = n , m where m is the average number of sons of a given individual, is an FnX martingale. Indeed, since each one among the Xn members of the nth generation gives birth on average to m sons and does this independently of the rest of the population, E[Xn+1 Xn ] = mXn and 2 . 2 . Xn+1 X Xn+1 Xn E F Xn = n . =E mn+1 n mn+1 m By the martingale convergence theorem, almost surely lim
n↑∞
Xn = Y < ∞. mn
In particular, if m < 1, then limn↑∞ Xn = 0 almost surely. Since Xn takes integer values, this implies that the branching process eventually becomes extinct. If m = 1, then limn↑∞ Xn = X∞ < ∞ and it is easily argued that this limit must be 0. Therefore, in this case as well the process eventually becomes extinct. For the case m > 1, we consider the unique solution in (0, 1) of x = g(x) (g is the generating function of the typical progeny of a member of the population considered). Suppose we can show that Zn = xXn is a martingale. Then, by the martingale convergence theorem, Zn converges to a ﬁnite limit and therefore Xn has a limit X∞ , which however can be inﬁnite. One can easily argue that this limit cannot be other than 0 (extinction) or ∞ (nonextinction). Since {Zn }n≥0 is a martingale, x = E[Z0 ] = E[Zn ] and therefore, by dominated convergence, x = E[Z∞ ] = E[xX∞ ] = P (X∞ = 0). Therefore x is the probability of extinction. It remains to show that {Zn }n≥0 is an FnX martingale. For all i ∈ N and all x ∈ [0, 1], E[xXn+1 Xn = i] = xi . This is obvious if i = 0. If i > 0, Xn+1 is the sum of i independent random variables with the same generating function g, and therefore, E[xXn+1 Xn = i] = g(x)i = xi . From this last result and the Markov property, E[xXn+1 FnX ] = E[xXn+1 Xn ] = xXn .
Theorem 13.3.8 Let {Mn }n≥0 be an Fn martingale such that for some p ∈ (1, ∞), sup EMn p < ∞.
(13.28)
n≥0
Then {Mn }n≥0 converges a.s. and in Lp to some ﬁnite variable M∞ . Proof. By hypothesis, the martingale {Mn }n≥0 is Lp bounded and a fortiori L1 bounded since p > 1. Therefore it converges almost surely. By Doob’s inequality, E[max0≤i≤n Mi p ] ≤ q p EMn p and in particular, E[ max Mi p ] ≤ q p sup EMk p < ∞ . 0≤i≤n
k
Letting n ↑ ∞, we have in view of condition (13.28) that
CHAPTER 13. MARTINGALES
518
E[sup Mn p ] < ∞.
(13.29)
n≥0
Therefore {Mn p }n≥0 is uniformly integrable (Theorem 4.2.14). In particular, since it converges almost surely, it also converges in L1 (Theorem 4.2.16). In other words, {Mn }n≥0 converges in Lp . Remark 13.3.9 The above result was proved for p > 1 (the proof depended on Doob’s inequality, which is true for p > 1). For p = 1, a similar result holds with an additional assumption of uniform integrability. Note however that the next result also applies to submartingales. Theorem 13.3.10 A uniformly integrable Fn submartingale {Sn }n≥0 converges a.s. and in L1 to an integrable random variable S∞ and E[S∞  Fn ] ≥ Sn . Proof. By the uniform integrability hypothesis, supn E[Sn ] < ∞ and therefore, by Theorem 13.3.2, Sn converges almost surely to some integrable random variable S∞ . It also converges to this variable in L1 since a uniformly integrable sequence that converges almost surely also converges in L1 (Theorem 4.2.16). By the submartingale property, for all A ∈ Fn , all m ≥ n, E[1A Sn ] ≤ E[1A Sm ] . 1
Since convergence is in L , lim E[1A Sm ] = E[1A S∞ ],
m↑∞
so that ﬁnally E[1A Sn ] ≤ E[1A S∞ ]. This being true for all A ∈ Fn , we have that E[S∞  Fn ] ≥ Sn . The following result is L´evy’s continuity theorem for conditional expectations. Corollary 13.3.11 Let {Fn }n≥1 be a ﬁltration and let ξ be an integrable random variable. Let F∞ := σ ∪n≥ Fn . Then lim E[ξ  Fn ] = E[ξ  F∞ ].
n↑∞
(13.30)
Proof. It suﬃces to treat the case where ξ is nonnegative. The sequence {Mn = E[ξ  Fn ]}n≥1 is a uniformly integrable Fn martingale (Theorem 4.2.13) and by Theorem 13.3.10, it converges almost surely and in L1 to some integrable random variable M∞ . We have to show that M∞ = E[ξ  F∞ ]. For m ≥ n and A ∈ Fn , E[1A Mm ] = E[1A Mn ] = E[1A E[ξ  Fn ]] = E[1A ξ]. Since convergence is also in L1 , limm↑∞ E[1A Mm ] = E[1A M∞ ]. Therefore E[1A M∞ ] = E[1A ξ]
(13.31)
for all A ∈ Fn and therefore for all A ∈ ∪n Fn . The σﬁnite measures A → E[1A M∞ ] and A → E[1A ξ] agreeing on the algebra ∪n Fn also agree on the smallest σalgebra containing it, that is F∞ . Therefore (13.31) holds for all A ∈ F∞ (Theorem 2.1.50) and this implies E[1A M∞ ] = E[1A E[ξ  F∞ ]] , and ﬁnally, since M∞ is F∞ measurable, M∞ = E[ξ  F∞ ].
13.3. CONVERGENCE OF MARTINGALES
519
Kakutani’s Theorem Let {Xn }n≥1 be an independent sequence of nonnegative random variables with mean 1. Let M0 := 1 and let n $ Mn := Xi (n ≥ 1) . i=1
Then (Example 13.1.3) {Mn }n≥0 is a nonnegative martingale and (Theorem 13.3.5) it 1
converges almost surely to a ﬁnite random variable M∞ . Let an := E[Xn2 ]. (Note that ∞ n=1 an ≤ 1.) Theorem 13.3.12 The following conditions are equivalent: (i)
∞
n=1 an
> 0.
(ii) E[M∞ ] = 1. (iii) Mn → M∞ in L1 . (iv) {Mn }n≥0 is uniformly integrable. Proof. Note ﬁrst that (iv) implies (iii) (since Mn → M∞ a.s. and by Theorem 13.3.10) which in turn implies (ii) since 1 = E[Mn ] → E[M∞ ]. The announced equivalences will be proved if one can show that (i) implies (iv) and that (ii) implies (i). A. (i) implies (iv): let m0 := 1 and mn :=
1 ( ni=1 Xn ) 2 n i=1 an
(n ≥ 1) .
This is a martingale and an L2 bounded one since E[m2n ] =
1 1 ≤ ∞ < ∞. (a1 · · · an )2 ( n=1 an )2
By Doob’s inequality (Theorem 13.1.15) for p = 2, E[sup mn 2 ] ≤ 4 sup E[m2n ] < ∞ . n
Also, since
∞
n=1 an
n
≤ 1, Mn ≤ m2n and in particular E[sup Mn ] ≤ E[sup mn 2 ] < ∞ . n
n
Therefore Mn is uniformly dominated by the integrable random variable supn Mn , which implies that it is uniformly integrable (Example 4.2.10). B. (ii) implies (i) or, equivalently, if (i) is not true, then (ii) is not true. Therefore suppose in view of contradiction that an = 0. Being a nonnegative martingale, {mn }n≥0 converges to a ﬁnite limit. Since ∞ n=1 an = 0, this can happen only if M∞ = 0, a contradiction with (ii).
CHAPTER 13. MARTINGALES
520
13.3.2
Backwards (or Reverse) Martingales
In the following, pay attention to the indexation: the index set is the set of nonpositive relative integers. Let {Fn }n≤0 be a nondecreasing family of σﬁelds, that is, Fn ⊆ Fn+1 for all n ≤ −1. There is nothing new in the deﬁnition of “backwards” or “reverse” martingales or submartingales, except that the index set is now {. . . , −2, −1, 0}. For instance, {Yn }n≤0 is an Fn submartingale if E [Yn  Fn−1 ] ≥ Yn−1 for all n ≤ 0. The term “backwards” in fact refers to one of the uses that is made of this notion, that of discussing the limit of Yn as n ↓ −∞. Reverse martingales or submartingales often appear in the following setting. Let {Zk }k≥0 be a sequence of integrable random variables. Suppose that E [Zk−1  Zk , Zk+1 , Zk+2 , . . .] = Zk
(k ≥ 0) .
Clearly, the change of indexation k → −n gives a “backwards” martingale. The next example concerns that situation. Example 13.3.13: Empirical mean of an iid sequence. Let {Xn }n≥1 be an iid sequence of integrable random variables and let Zk :=
1 Sk , k
where Sk := X1 + · · · + Xk . We shall prove that E [Zk−1  Gk ] = Zk , where Gk = σ(Zk , Zk+1 , Zk+2 , . . .). It suﬃces to prove that for all k ≥ 1, E [Z1  Gk ] = Zk ,
()
since it then follows that for m ≤ k, E [Zm  Gk ] = E [E [Z1  Gm ]  Gk ] = E [Z1  Gk ] = Zk . By linearity, Sk = E [Sk  Gk ] =
k
E [Xj  Gk ] .
j=1
From the fact that Gk = σ(Zk , Zk+1 , Zk+2 , . . .) = σ(Sk , Xk+1 , Xk+2 , . . .) and by the iid assumption for {Xn }n≥1, Sk =
k j=1
E [Xj  Sk , Xk+1 , Xk+2 , . . .] =
k
E [Xj  Sk ] .
j=1
But the pairs (Xj , Sk ) (1 ≤ j ≤ k) have the same distribution, and therefore k
E [Xj  Sk ] = kE [X1  Sk ] = kE [X1  Gk ] = kE [Z1  Gk ] ,
j=1
from which () follows.
13.3. CONVERGENCE OF MARTINGALES
521
Theorem 13.3.14 Let {Fn }n≤0 be a nondecreasing family of σﬁelds. Let {Sn }n≤0 be an Fn submartingale. Then: A. Sn converges P a.s. and in L1 as n ↓ −∞ to an integrable random variable S−∞ , and B. with F−∞ := ∩n≤0 Fn ,
S−∞ ≤ E [S0  F−∞ ] ,
with equality if {Sn }n≤0 is an Fn martingale. Proof. First note that by the submartingale property, Sn ≤ E [S0  Fn ] (n ≤ 0). In particular, {Sn }n≤0 is not only L1 bounded, but also uniformly integrable (Theorem 4.2.13). A. Denoting by νm = νm ([a, b]) the number of upcrossings of [a, b] by {Sn }n≤0 in the integer interval [−m, 0] and by ν = ν([a, b]) the total number of upcrossings of [a, b], the upcrossing inequality yields (b − a)E [νm ] ≤ E (S0 − a)+ < ∞ , and letting m ↑ ∞, E [ν] < ∞. Almostsure convergence to an integrable random variable S−∞ is then proved as in Theorem 13.3.2. Since {Sn }n≤0 is uniformly integrable, convergence to S−∞ is also in L1 . B. Clearly, S−∞ is F−∞ measurable. Also, by the submartingale property, Sn ≤ E [S0  Fn ] (n ≤ −1), that is, for all n ≤ −1 and all A ∈ Fn , S0 dP . Sn dP ≤ A
A
This is true for any A ∈0 F−∞ because 0 F−∞ ⊆ Fn for all n ≤ −1. Since Sn converges to S−∞ in L1 as n ↓ −∞, A Sn dP → A S−∞ dP and therefore S−∞ dP ≤ S0 dP (A ∈ F−∞ ) , A
A
which implies that S−∞ ≤ E [S0  F−∞ ]. The martingale case is obtained using the same proof with each ≤ symbol replaced by =. Remark 13.3.15 Statement B says that {Sn }n∈−N∪{−∞} is a submartingale relatively to the history {Fn }n∈−N∪{−∞}. Example 13.3.16: The Strong Law of Large Numbers. The situation is that of Example 13.3.13. By Theorem 13.3.14, Sk /k → converges almost surely. By Kolmogorov’s zeroone law (Theorem 4.3.3), Sk /k → a, a deterministic number. It remains to identify a with E [X1 ]. We know from the ﬁrst lines of the proof of Theorem 13.3.14 that {Sk /k}k≥1 is uniformly integrable. Therefore, by Theorem 4.2.16, 2 . Sk lim E = a. k↑∞ k
CHAPTER 13. MARTINGALES
522 But for all k ≥ 1, E [Sk /k] = E [X1 ].
The uniform integrability of the backwards submartingale in Theorem 13.3.14 followed directly from the submartingale property. This is not the case for a supermartingale unless one adds a condition. Theorem 13.3.17 Let {Fn }n≤0 be a ﬁltration and let {Sn }n≤0 be an Fn supermartingale such that sup E [Sn ] < ∞ . (13.32) n≤0
Then A. Sn converges P a.s. and in L1 as n ↓ −∞ to an integrable random variable S−∞ , and B. with F−∞ := ∩n≤0 Fn , S−∞ ≥ E [S0  F−∞ ]
P a.s.
Proof. (2 ) It suﬃces to prove uniform integrability, since the rest of the proof then follows the same lines as in Theorem 13.3.10. Fix ε > 0 and select k ≤ 0 such that lim E [Si ] − E [Sk ] ≤ ε .
()
i↓−∞
Then 0 ≤ E [Sn ] − E [Sk ] ≤ ε for all n ≤ k. We ﬁrst show that for suﬃciently large λ > 0, Sn  dP ≤ ε . {Sn >λ}
It is enough to prove this for suﬃciently large −n, here for −n ≥ −k. The previous integral is equal to − Sn dP + E [Sn ] − Sn dP . {Sn λ) ≤
E [Sn ] →0 λ
uniformly in n ≤ 0, and therefore {Sn >λ}
Sk  dP → 0
uniformly in n.
The following result is the backwards L´evy’s continuity theorem for conditional expectations. Corollary 13.3.18 Let {Fn }n≤0 be a history and let ξ be an integrable random variable. Then, with F−∞ := ∩n≤0 Fn , lim E[ξ  Fn ] = E [ξ  F−∞ ] .
n↓−∞
(13.33)
Proof. Mn := E[ξ  Fn ] (n ≤ 0) is an Fn martingale and therefore by the backwards martingale convergence theorem, it converges as n ↓ −∞ almost surely and in L1 to some integrable variable M−∞ and M−∞ = E [M0  F−∞ ] = E [E [ξ  F0 ]  F−∞ ] = E [ξ  F−∞ ] since F−∞ ⊆ F0 .
Local Absolute Continuity Recall the setting of Example 13.1.6. On the measurable space (Ω, F) are given two probability measures Q and P and a ﬁltration {Fn }n≥1 such that F = ∨n≥1 Fn := F∞ . Let Qn and Pn denote the restrictions to (Ω, Fn ) of Q and P respectively. Suppose that Qn Pn (n ≥ 1), in which case we say that Q is locally absolutely continuous with respect to P along {Fn }n≥1 and denote this by Q Ln :=
loc.
P . Let then
dQn dPn
denote the corresponding Radon–Nikod´ ym derivative. The question is: under what circumstances can we assert that Q P ? And what can be said if this is not the case? That this is not always the case is clear from the following elementary example. Example 13.3.19: Independent sequences of 0’s and 1’s. In this example Fn = σ(X1 , . . . , Xn ), where {Xn }n≥1 is an iid sequence of {0, 1}valued random variables and
where
Q(Xn = 1) = qn > 0 and P (Xn = 1) = pn > 0 , n≥1 qn = ∞ and n≥1 pn < ∞. Then, by the positivity condition on the pn ’s loc.
P . However Q and P are mutually singular since Q(Xn → 0) = 0 and the qn ’s, Q and P (Xn → 0) = 1 (see Exercise 4.6.7).
CHAPTER 13. MARTINGALES
524
Theorem 13.3.20 The Radon–Nikod´ym sequence {Ln }n≥1 converges Qalmost surely and P almost surely to some random variable L∞ and dQ = 1{L∞ =∞} dQ + L∞ dP
(13.34)
where P (L∞ = ∞) = 0. Remark 13.3.21 In particular, the measures dλ := 1{L∞ =∞} dQ and dμ := L∞ dP are mutually singular and λ is absolutely continuous with respect to Q, so that (13.34) is the Lebesgue decomposition of Q with respect to P (Theorem 2.3.30). Proof. Denote by ν (resp. νn ) the probability 21 (P + Q) on (Ω, F) (resp. 12 (Pn + Qn ) on (Ω, Fn )). Since Qn and Pn are dominated by νn , there exists for each n ≥ 1 an n ym derivative Un := dQ (Fn measurable) Radon–Nikod´ dνn . The sequence {Un }n≥1 is a (ν, Fn )martingale, since for all n ≥ 1 and all A ∈ Fn (and therefore also in Fn+1 ), Un+1 dν = Qn+1 (A) = Qn (A) = Un dν . A
A n 2 Pn +Q 2
Also, Un ≤ 2 because Qn ≤ = 2νn . Being a bounded (ν, Fn )martingale, {Un }n≥1 converges νa.s. and in L1 (ν) to some random variable U∞ . Therefore, for all k ≥ 1, all n ≥ 0 and all A ∈ Fk , U∞ dν = lim Un+k dν n↑∞ A
A
0 0 and therefore since for all n ≥ 0 and all A ∈ Fk , A Un+k dν = A Uk dν = Q(A), U∞ dν = Q(A) . A
This being true for all A ∈ Fk , the probability measures U∞ dν and dQ agree on Fk . This being true for all k ≥ 1, they agree on the algebra ∪k≥1 Fk , and therefore on F∞ = ∨k≥1 Fk (Caratheodory’s theorem). We have just proved that dQ = U∞ d P +Q 2 , that is, (2 − U∞ )dQ = U∞ dP , from which it follows that P (U∞ = 2) = 0 and that if Q(U∞ = 2) = 0, then Q dQ U∞ dP = 2−U∞ .
P and
Un U∞ Since Ln = 2−U , Ln → L∞ = 2−U (P + Q)a.s. and P (L∞ = ∞) = 0. Now, n ∞ 2 2L∞ dQn = Ln dPn and 1+L∞ dQ = 1+L∞ dP . Hence the decomposition
dQ = 1{L∞ =∞} dQ + L∞ dP where P (L∞ = ∞) = 0 and Q(L∞ = 0) = 0. Theorem 13.3.22 Q
P ⇔ EP [L∞ ] = 1 ⇔ Q(L∞ < ∞) = 1,
Q ⊥ P ⇔ EP [L∞ ] = 0 ⇔ Q(L∞ = ∞) = 1 .
13.3. CONVERGENCE OF MARTINGALES
525
Proof. Write (13.34) as 
Q(A) = A
1{L∞ =∞} dQ +
A
(A ∈ F∞ ) .
L∞ dP
With A = Ω, 1 = Q(L∞ = ∞) + EP [L∞ ] , and therefore EP [L∞ ] = 1 ⇔ Q(L∞ < ∞) = 1, EP [L∞ ] = 0 ⇔ Q(L∞ = ∞) = 1 . If Q(L∞ = ∞) = 0 it follows by (13.34) that Q Q(L∞ = ∞) = 0 since P (L∞ = ∞) = 0.
P . Conversely, if Q
P , then
If Q⊥P , there exists a B ∈ F such that Q(B) = 1 and P (B) = 0. In particular, from (13.34), Q(B ∩ {L∞ = ∞}) = 1 and therefore Q(L∞ = ∞) = 1. Finally, if Q(L∞ = ∞) = 1, Q⊥P since P (L∞ = ∞) = 0.
Remark 13.3.23 By Theorem 4.2.16, the condition EP [L∞ ] = 1 is equivalent to the uniform P integrability of {Ln }n≥1. Therefore, in order to prove that Q P , any condition guaranteeing uniform integrability is a suﬃcient condition for the absolute continuity of Q with respect to P . For instance (Theorem 4.2.14 and Example 4.2.15) sup E Ln1+α < ∞
(α > 1)
n
and
sup E Ln log+ Ln < ∞ . n
Example 13.3.24: Kakutani’s Dichotomy Theorem. Let {Xn }n≥1 be a sequence of random elements with values in the measurable space (E, E). We may suppose that it is the coordinate sequence of the canonical space (Ω, F) := (E N , E ⊗N). Let Q and P be two probability measures on (Ω, F) such that the sequence is iid relatively to both. Let QXn and PXn be the restrictions of Q and P respectively to σ(Xn ) and let Qn and Pn be the restrictions of Q and P respectively to Fn := σ(X1 , . . . , Xn ). We assume that for dQ PXn and denote the corresponding Radon–Nikod´ ym derivative dPXXn all n ≥ 1, QXn by fn (Xn ). Then for all n ≥ 1, Qn
Pn and Ln =
{L∞ < ∞} = {log L∞ < ∞} =
dQn dPn
/ n
= Πni=1 fi (Xi ). Since
n
4 log fi (Xi ) < ∞
i=1
is a tailevent of the sequence, its probability is 0 or 1. Therefore, there are only two possibilities, either Q P or Q ⊥ P .
CHAPTER 13. MARTINGALES
526
Example 13.3.25: Kakutani’s Condition. Kakutani’s theorem (Theorem 13.3.12) can be applied to the situation (analogous to that of Example 13.3.24 above) where Ln =
n $
Zi
i=1
where {Zn }n≥1 is a sequence of iid nonnegative random variables of mean 1. By this theorem, the criterion of absolute continuity of Q with respect to P , EP [L∞ ] = 1, of 1 2 Theorem 13.3.22 is ∞ n=1 E[Zn ] > 0. By the same argument as in Example 13.3.24, the only alternative to Q P is Q ⊥ P , and therefore a necessary and suﬃcient condition 1 ∞ for the latter is n=1 E[Zn2 ] = 0.
Harmonic Functions and Markov Chains An application to Markov chain theory of the martingale convergence theorem concerns harmonic functions of hmcs and the study of recurrence of hmcs. The basic result is: Theorem 13.3.26 An irreducible recurrent hmc {Xn }n≥0 has no nonnegative superharmonic or bounded subharmonic functions besides the constants. Proof. If h is nonnegative superharmonic (resp., bounded subharmonic), the sequence {h(Xn )}n≥0 is a nonnegative supermartingale (resp., bounded submartingale) and therefore it converges to a ﬁnite limit Y . Since the chain visits any state i ∈ E inﬁnitely often, one must have Y = h(i) almost surely for all i ∈ E. This can happen only if h is a constant. Corollary 13.3.27 A necessary and suﬃcient condition for an irreducible hmc to be transient is the existence of some state (henceforth denoted by 0) and of a bounded function h : E → R, not identically null and satisfying h(j) = pjk h(k) (j = 0) . (13.35) k =0
Proof. Let T0 be the return time to state 0. Firststep analysis shows that the (bounded) function h deﬁned by h(j) := Pj (T0 = ∞) satisﬁes (13.35). If the chain is transient, h is nontrivial (not identically null). This proves necessity. Conversely, suppose that (13.35) holds for a not identically null bounded function. ˜ by h(0) ˜ Deﬁne h := 0 and ˜ h(j) := h(j) (j = 0) , ˜ ˜ and let α := k∈E p0k h(k). Changing the sign of h if necessary, α can be assumed ˜ nonnegative. Then h is subharmonic and bounded. If the chain were recurrent, then by ˜ would be a constant. This constant would be equal to h(0) ˜ Theorem 13.3.26, h = 0, and this contradicts the assumed nontriviality of h. Here is an application of the martingale convergence theorem in the vein of the previous results and of Foster’s theorem (Theorem 6.3.19).
13.3. CONVERGENCE OF MARTINGALES
527
Theorem 13.3.28 Let the hmc {Xn }n≥0 with transition matrix P be irreducible and let h : E → R be a bounded function such that (13.36) pik h(k) ≤ h(i), for all i ∈ F, k∈E
for some set F (not assumed ﬁnite). Suppose, moreover, that there exists an i ∈ F such that h(i) < h(j), for all j ∈ F. (13.37) Then the chain is transient. Proof. Let τ be the return time in F and let i ∈ F satisfy (13.37). Deﬁning Yn = h(Xn∧τ ), we have that, under Pi , Y is a (bounded) FnX supermartingale (same proof as in Theorem 6.3.19). By the martingale convergence theorem, the limit Y∞ of Yn = h(Xn∧τ ) exists and is ﬁnite, Pi almost surely. By dominated convergence, Ei [Y∞ ] = limn↑∞ Ei [Yn ], and since Ei [Yn ] ≤ Ei [Y0 ] = h(i) (supermartingale property), we have Ei [Y∞ ] ≤ h(i). If τ were Pi a.s. ﬁnite, then Yn would eventually be frozen at a value h(j) for j ∈ F , and therefore by (13.37), Ei [Y∞ ] ≥ h(i), a contradiction with the last inequality. Therefore, Pi (τ < ∞) < 1, which means that with a strictly positive probability, the chain starting from i ∈ F will not return to F . This is incompatible with irreducibility and recurrence.
13.3.3
The Robbins–Sigmund Theorem
In applications, one often encounters random sequences that are not quite martingales, submartingales or supermartingales, but “nearly” so, up to “perturbations”. The statement of the result below will make this precise. Theorem 13.3.29 Let {Vn }n≥1, {βn }n≥1, {γn }n≥1 and {δn }n≥1 be real nonnegative sequences of random variables adapted to some ﬁltration {Fn }n≥1 and such that E[Vn+1  Fn ] ≤ Vn (1 + βn ) + γn − δn Then, on the set Γ=
⎧ ⎨ ⎩
n≥1
(n ≥ 1) .
⎫ ⎧ ⎫ ⎬ ⎨ ⎬ βn < ∞ ∩ γn < ∞ ⎭ ⎩ ⎭
(13.38)
(13.39)
n≥1
the sequence {Vn }n≥1 converges almost surely to a ﬁnite random variable and moreover n≥1 δn < ∞ Palmost surely. Proof. 1. Let α0 := 0 and αn :=
n $
−1 (1 + βk )
(n ≥ 1) ,
k=1
and let
Vn := αn−1 Vn ,
γn := αn γn ,
δn := αn δn
(n ≥ 1) .
CHAPTER 13. MARTINGALES
528 Then
E[Vn+1  Fn ] = αn E[Vn+1  Fn ] ≤ αn Vn (1 + βn ) + αn γn − αn δn ,
that is, since αn Vn (1 + βn ) = αn−1 Vn , E[Vn+1  Fn ] ≤ Vn + γn − δn .
Therefore, the random sequence {Yn }n≥1 deﬁned by Yn := Vn −
n−1 (γk − δk ) k=1
is an Fn supermartingale. 2. For a > 0, let
/ Ta := inf
n ≥ 1;
n−1
(γk
4 − δk )
≥a
.
k=1
The sequence {Yn∧Ta }n≥1 is an Fn supermartingale bounded from below by −a. It therefore converges to a ﬁnite limit. Therefore, on {Ta = ∞}, {Yn }n≥1 converges to a ﬁnite limit. almost surely to a positive limit and therefore 3. On Γ, ∞ k=1 (1 + βk ) converges limn↑∞ αn > 0. Therefore, condition n≥1 γn < ∞ implies n≥1 γn < ∞. 4. By deﬁnition of Yn , Yn +
n−1
γk = Vn +
k=1
n−1 k=1
δk ≥
n−1
δk ,
k=1
But on Γ ∩ {Ta = ∞}, {Yn }n≥1 converges to a ﬁnite random variable, and therefore n≥1 δn < ∞. 5. Since on Γ ∩ {Ta = ∞}, n≥1 γn < ∞, n≥1 δn < ∞ and {Yn }n≥1 converges to a ﬁnite random variable, it follows that {Vn }n≥1 converges to a ﬁnite limit. Since limn↑∞ αn > 0, it follows in turn that {Vn }n≥1 converges to a ﬁnite limit and n≥1 δn < ∞ on Γ ∩ {Ta = ∞}, and therefore on Γ ∩ (∪a {Ta = ∞}) = Γ. Corollary 13.3.30 Let {Vn }n≥1 , {γn }n≥1 and {δn }n≥1 be real nonnegative sequences of random variables adapted to some ﬁltration {Fn }n≥1. Suppose that for all n ≥ 1 E[Vn+1  Fn ] ≤ Vn + γn − δn .
(13.40)
Let {an }n≥1 be a random sequence that is strictly positive and strictly increasing and let ⎧ ⎫ ⎨ γ ⎬ n , := Γ 0, we have that Zn ≥ 0 (n ≥ 1). Also E[Zn+1  Fn ] ≤ Zn +
γn δn − . an an
, {Zn }n≥1 converges and Therefore, by Theorem 13.3.29, on Γ, n≥1 in particular Vn+1 − Vn ,. = 0 on Γ lim n↑∞ an 2. If moreover limn↑∞ an = a∞ < ∞, the convergence of of a1∞ n≥1 (Vn+1 − Vn ), and therefore {Vn }n≥1 converges. 3. If on the contrary limn↑∞ an = ∞, the convergence of Vn+1 an
n≥1
Vn , by an
(and therefore that of of and an ↑ ∞, the convergence of
13.3.4
Vn . an
n≥1
δn an
< ∞. Note that (13.42)
Vn+1 −Vn an
implies that
Vn+1 −Vn an
implies that
(13.42)) to 0 (recall Kronecker’s lemma: if an > 0 xn 1 n n≥1 an implies that limn↑∞ an k=1 xk = 0).
Squareintegrable Martingales
Doob’s decomposition Let {Fn }n≥0 be a ﬁltration. Recall that a process {Hn }n≥0 is called Fn predictable if for all n ≥ 1, Hn is Fn−1 measurable. Theorem 13.3.31 Let {Sn }n≥0 be an Fn submartingale. Then there exists a Pa.s. unique nondecreasing Fn predictable process {An }n≥0 with A0 ≡ 0 and a unique Fn martingale {Mn }n≥0 such that for all n ≥ 0, Sn = Mn + An . Proof. Existence is proved by explicit construction. Let M0 := S0 , A0 = 0 and, for n ≥ 1, Mn := S0 +
n−1
*
+ Sj+1 − E[Sj+1 Fj ] ,
j=0
An :=
n−1
(E[Sj+1 Fj ] − Sj ) .
j=0
Clearly, {Mn }n≥0 and {An }n≥0 have the announced properties. In order to prove uniqueness, let {Mn }n≥0 and {An }n≥0 be another such decomposition. In particular, for n ≥ 1, − Mn ). An+1 − An = (An+1 − An ) + (Mn+1 − Mn ) − (Mn+1
CHAPTER 13. MARTINGALES
530 Therefore
E[An+1 − An  Fn ] = E[An+1 − An  Fn ] , and, since An+1 − An and An+1 − An are Fn measurable, An+1 − An = An+1 − An ,
Pa.s.
(n ≥ 1) ,
from which it follows that An = An a.s. for all n ≥ 0 (recall that A0 = A0 ) and then Mn = Mn a.s. for all n ≥ 0. Deﬁnition 13.3.32 The sequence {An }n≥0 in Theorem 13.3.31 is called the compensator of {Sn }n≥0. Deﬁnition 13.3.33 Let {Mn }n≥0 be a squareintegrable Fn martingale (that is, E[Mn2 ] < ∞ for all n ≥ 0). The compensator of the Fn submartingale {Mn2 }n≥0 is denoted by {M n }n≥0 and is called the bracket process of {Mn }n≥0 . By the explicit construction in the proof of Theorem 13.3.31, M 0 := 0 and for n ≥ 1, M n :=
n−1
n−1 * * + + 2 2 E[(Mj+1 E[Mj+1 − Mj2 )  Fj ] .  Fj ] − Mj2 =
j=0
j=0
(13.43)
Also, for all 0 ≤ k ≤ n, E[(Mn − Mk )2  Fk ] = E[Mn2 − Mk2  Fk ] = E[M n − M k  Fk ]. Therefore, {Mn2 − M n }n≥0 is an Fn martingale. In particular, if M0 = 0, E[Mn2 ] = E[M n ]. Example 13.3.34: Let {Zn }n≥0 be asequence of iid centered random variables of ﬁnite variance. Let M0 := 0 and Mn := nj=1 Zj for n ≥ 1. Then, for n ≥ 1, M n =
n
Var (Zj ) .
j=1
Theorem 13.3.35 If E [M ∞ ] < ∞, the squareintegrable martingale {Mn }n≥0 converges almost surely to a ﬁnite limit, and convergence takes place also in L2 . Proof. This is Theorem 13.3.8 for the particular case p = 2. In fact, condition (13.28) thereof is satisﬁed since sup E Mn2 = sup E[M n ] = E [M ∞ ] < ∞ . n≥1
n≥1
13.3. CONVERGENCE OF MARTINGALES
531
The Martingale Law of Large Numbers Theorem 13.3.36 Let {Mn }n≥0 be a squareintegrable Fn martingale. Then: A. On {M ∞ < ∞}, Mn converges to a ﬁnite limit. B. On {M ∞ = ∞}, Mn /M n → 0. Proof. A. Let K > 0 be ﬁxed, the random time τK := inf{n ≥ 0 : M n+1 > K} is an Fn stopping time since the bracket process is Fn predictable. Also M n∧τK ≤ K and therefore by Theorem 13.3.35, {Mn∧τK }n≥0 converges to a ﬁnite limit. Therefore {Mn }n≥0 converges to a ﬁnite limit on the set {M ∞ < K} contained in {τK = ∞}. Hence the result since & {M ∞ < ∞} = {τK = ∞} . K≥1
B. Note that 2 E[Mn+1  Fn ] = Mn2 + M n+1 − M n .
Deﬁne Vn = Mn2 ,
γn = M n+1 − M n ,
an = M 2n+1 .
The result then follows from Part 3of Corollary 13.3.30 a k0 ∞ (observe that there exists 2 such that ak ≥ 1 for k ≥ k0 and ∞ γ )/M /a − M = (M k k+1 k k k=k0 k+1 ≤ 5 k=k0 0 ∞ −2 x dx < ∞) which says, in particular, that V /a = M /M converges n+1 n n+1 n+1 1 to 0. Remark 13.3.37 We do not have in general {M ∞ < ∞} = {{Mn }n≥0 converges}. The following is a conditioned version of the Borel–Cantelli lemma. Note that, in this form, we have a necessary and suﬃcient condition. Corollary 13.3.38 Let {Fn }n≥1 be a ﬁltration and let {An }n≥1 be a sequence of events such that An ∈ Fn (n ≥ 1). Then ⎫ ⎧ ⎫ ⎧ ⎨ ⎬ ⎨ ⎬ P (An  Fn−1) = ∞ ≡ 1 An = ∞ . ⎩ ⎭ ⎩ ⎭ n≥1
n≥1
Proof. Deﬁne {Mn }n≥0 by M0 := 0 and for n ≥ 1, Mn :=
n
(1Ak − P (Ak  Fk−1 )).
k=1
This is a squareintegrable Fn martingale, with bracket process M n =
n k=1
P (Ak  Fk−1 )(1 − P (Ak  Fk−1 )) .
CHAPTER 13. MARTINGALES
532 In particular, M n ≤
n
P (Ak  Fk−1 ).
k=1
A. Suppose that ∞ k=1 P (Ak  Fk−1 ) < ∞. Then, by the above inequality, M ∞ < and by Part A of Theorem 13.3.36, ∞, therefore, Mn converges. Since by hypothesis, ∞ ∞ ) < ∞, this implies that  F 1 P (A < ∞. k k−1 A k k=1 k=1 ∞ B. Suppose that k=1 P (Ak  Fk−1 ) = ∞ and M ∞ < ∞. Then Mn converges to a ﬁnite random variable and therefore n Mn k=1 1Ak n = n − 1 → 0. P (A  F ) P k k−1 k=1 k=1 (Ak  Fk−1 ) C. Suppose that fortiori,
∞
k=1
P (Ak  Fk−1 ) = ∞ and M ∞ = ∞. Then
Mn
M n
→ 0 and a
Mn → 0, P k=1 (Ak  Fk−1 )
n that is,
n
k=1 1Ak → 1. P k=1 (Ak  Fk−1 )
n
The Robbins–Monro algorithm Consider an inputoutput relationship u ∈ R → y ∈ R of the form x = g(u, ε) where ε is a random variable, and let Φ(u) := E[g(u, ε)]. We wish to determine u∗ such that Φ(u∗ ) = α, where α is given. Remark 13.3.39 This is a dosage problem: u is the dose and Φ(u) is the (average) eﬀect produced by this dose; u∗ is the dose realizing the desired eﬀect α. Φ is assumed nondecreasing, but is otherwise unknown. In order to determine u∗ , one makes a series of experiments. Experiment n ≥ 0 associates with the input Un (an experimental dose) the output Xn+1 = g(Un , εn+1 ), where {εn }n≥1 is iid. The input Un is a function of the previous experimental results X1 , . . . , Xn , and therefore E[Xn+1  Fn ] = Φ(Un ) , where Fn = σ(X1 , . . . , Xn ). We want to choose Un as a function of X1 , . . . , Xn that converges almost surely to u∗ . The following strategy is reasonable: reduce the dose if Xn+1 > α, augment it otherwise. This remark has led to the Robbins–Monro algorithm:
13.3. CONVERGENCE OF MARTINGALES
533
Un+1 = Un − γn (Xn+1 − α),
n ≥ 0,
(13.44)
where γn ≥ 0 for all n ≥ 0. The question is: Under what conditions does limn↑∞ Un = u∗ ? We shall need the following deterministic lemma:3 Lemma 13.3.40 Let f : R → R be a continuous function such that for some x∗ and some α ∈ R, f (x∗ ) = α and, for all x = x∗ , (f (x) − α)(x − x∗ ) < 0 and f (x) ≤ K(1 + x) for some constant K > 0. Let {γn }n≥0 be a nonincreasing nonnegative deterministic sequence such that γn = ∞ , n≥0
and let {εn }n≥1 be a deterministic sequence such that γn εn+1 converges . n≥0
Then, the sequence {xn }n≥0 deﬁned by xn+1 = xn + γn (f (xn ) − α + εn+1 ) ,
n ≥ 0,
converges to x∗ for any initial condition x0 . Let {Xn }n≥0 and {Yn }n≥0 be sequences of squareintegrable random vectors of dimension d adapted to some ﬁltration {Fn }n≥0 . Let {γn }n≥0 be a nonincreasing sequence of nonnegative random variables such that lim γn = 0 ,
n↑∞
and γ0 ≤ C < ∞ for some deterministic constant C. Suppose that Xn+1 = Xn + γn Yn+1,
n ≥ 0,
and that, moreover, for all n ≥ 0, E[Yn+1  Fn ] = f (Xn ),
E[Yn+1 − f (Xn )2  Fn ] = σ 2 (Xn ) ,
where the function f : R → R is continuous, such that for some x∗ ∈ Rd , f (x∗ ) = 0 and for all x ∈ R such that x = x∗ , f (x) × (x − x∗ ) < 0 . 3
[Duﬂo, 1997], Proposition 1.2.3.
CHAPTER 13. MARTINGALES
534 Theorem 13.3.41 Suppose in addition that
f (x) ≤ K(1 + x) for some constant K > 0, and
γn = ∞ and
n≥0
γn σ 2 (Xn ) < ∞ .
n≥0
Then, limn↑∞ Xn = x∗ . Proof. As Xn+1 = Xn + γn f (Xn ) + γn (Yn+1 − f (Xn ))
(n ≥ 0) ,
the process {Mn }n≥0 deﬁned by M0 := 0 and Mn :=
n
γk−1 (Yk − f (Xk−1 ))
(n ≥ 1)
k=1
is a squareintegrable Fn martingale and M n =
n
2 γk−1 σ 2 (Xk−1 ).
k=1
In particular, since M ∞ < ∞ by hypothesis, {Mn }n≥0 converges to a ﬁnite limit. The result then follows from Lemma 13.3.40 with εn+1 = Yn+1 − f (Xn ). Example 13.3.42: Back to the dosage problem. Consider the algorithm (13.44). We apply Theorem 13.3.41 with f = Φ − α. The conditions guaranteeing that limn Un = u∗ are therefore, besides Φ(u∗ ) = α and Φ continuous, (Φ(u) − α)(u − u∗ ) for all u = u∗ , n≥0
γn = ∞ ,
γn2 < ∞
n≥0
and, for some K < ∞, E[g(u, ε)]2 ≤ K(1 + u2 ) .
13.4
Continuoustime Martingales
The deﬁnition of a martingale in continuous time is similar to the one in discrete time and we shall see that most of the results in discretetime ﬁnd counterparts in continuoustime. Let {Ft }t≥0 be a history (or ﬁltration) on R+ .
13.4. CONTINUOUSTIME MARTINGALES
535
Deﬁnition 13.4.1 A complex stochastic process {Y (t)}t≥0 such that for all t ∈ R+ (i) Y (t) is Ft –measurable, and (ii) E[Y (t)] < ∞, is called a (P, Ft )martingale (resp., submartingale, supermartingale) if for all s, t ∈ R+ such that s ≤ t, E[Y (t)  Fs ] = Y (s)
(resp., ≥ Y (s), ≤ Y (s)) .
(13.45)
When Y (t) ≥ 0 for all t ∈ R+ , the integrability condition is not required. Example 13.4.2: Compensated counting process. The counting process {N (t)}t≥0 of Example 5.1.5 is such that Y (t) := N (t) − λt
(t ≥ 0)
is an FtY martingale (Exercise 13.5.24). This result admits a converse: Theorem 13.4.3 (4 ) Let N be a simple locally ﬁnite point process on R+ such that for some ﬁltration {Ft }t≥0 the stochastic process M (t) := N (t) − λt
(t ∈ R+ )
is an Ft martingale. Then N is a homogeneous Poisson process with intensity λ, and for any interval (a, b] ∈ R+ , N ((a, b]) is independent of Fa . Proof. In view of Theorem 7.1.8, it suﬃces to show that for all T > 0, and for all nonnegative bounded realvalued stochastic processes {Z(t)}t≥0 with leftcontinuous trajectories and adapted to {Ft }t≥0 , "# 2 T . E Z(t) N (dt) = E Z(t)λ dt . () (0,T ]
0
The proof then is along the same lines as that of Theorem 7.1.8. Equality () is true for Z(t, ω) := 1A (ω) 1(a,b] (t) for any interval (a, b] ⊂ R+ and any A ∈ Fa , since in this case, () reads E [1A N ((a, b])] = E [1A (b − a)] , that is, since A is arbitrary in Fa , E [N ((a, b])  Fa ] = (b − a), which is the martingale hypothesis. The extension to nonnegative bounded realvalued stochastic processes {Z(t)}t≥0 with leftcontinuous trajectories is then done as above via the approximation (7.5). 4
[Watanabe, 1964].
CHAPTER 13. MARTINGALES
536
Theorem 13.4.4 A supermartingale with constant mean is a martingale. Proof. This follows from the fact that two integrable random variables X and Y such that X ≤ Y and E[X] = E[Y ] are almost surely equal. Deﬁnition 13.4.5 A complex stochastic process {Y (t)}t≥0 is called a (P, Ft )local martingale (resp., local submartingale, local supermartingale) if there exists a nondecreasing sequence of Ft stopping times {τn }n≥1 (the localizing sequence) such that (a) limn↑∞ τn = ∞, and (b) for all n ≥ 1, {Y (t ∧ τn )}t≥0 is a (P, Ft )martingale (resp., supermartingale, submartingale). Theorem 13.4.6 A nonnegative local martingale is a supermartingale. Proof. By Fatou’s lemma, E [M (t)  Fs ] = E lim M (t ∧ Tn )  Fs n
≤ lim inf E [M (t ∧ Tn )  Fs ] = lim inf M (s ∧ Tn ) = M (s) . n
n
The following characterization of the martingale property can be viewed as a kind of converse of Doob’s optional sampling theorem. It will be referred to as Komatsu’s lemma. Lemma 13.4.7 Let {Ft }t≥0 be a history. A realvalued Ft progressive stochastic process {X(t)}t≥0 with the property that, for all bounded Ft stopping times T such that X(T ) is integrable, E[X(T )] = E[X(0)], is an Ft martingale.
Proof. Exercise 13.5.26.
13.4.1
From Discrete Time to Continuous Time
Many among the results given for discretetime martingales extend easily to the continuoustime rightcontinuous martingales. For the extension of Kolmogorov and Doob’s inequalities to rightcontinuous martingales, it suﬃces to observe that for a rightcontinuous process {X(t)}t≥0 , supt∈R X(t) = supt∈Q X(t), where Q is the set of rational numbers. Theorem 13.4.8 (Kolmogorov’s inequality) Let {Y (t)}t≥0 be a rightcontinuous Ft submartingale. Then, for all λ ∈ R and all a ∈ R+ , λP sup Y (t) > λ ≤ E Y (a)1{sup0≤t≤a Y (t)>λ} . (13.46) 0≤t≤a
In particular, if M {(t)}t≥0 is a rightcontinuous Ft martingale, then, for all p ≥ 1, all λ ∈ R and all a ∈ R+ , (13.47) λp P sup M (t) > λ ≤ E[M (a)p ]. 0≤t≤a
13.4. CONTINUOUSTIME MARTINGALES
537
Theorem 13.4.9 (Doob’s inequality) Let {M (t)}n≥0 be a rightcontinuous Ft martingale. For all p > 1 and all a ∈ R+ , M (a) p ≤ sup M (t) p ≤ q M (a) p ,
(13.48)
0≤t≤a
where q is deﬁned by
1 p
+
1 q
= 1.
For the martingale convergence theorem, it suﬃces to show that the upcrossing inequality holds true. This is done in the proof of the following extension to continuous time of the discretetime result. As above, one takes advantage of the rightcontinuity assumption. Theorem 13.4.10 Let {Y (t)}t≥0 be a rightcontinuous Ft submartingale, L1 bounded, that is, such that sup E[Y (t)] < ∞ . (13.49) t≥0
Then {Y (t)}t≥0 converges P a.s. as t ↑ ∞ to an integrable random variable Y (∞). * + Proof. Let Dn := 2kn k∈N and D := ∪n∈N Dn . For given 0 ≤ a < b, let νn ([a, b], K) and ν([a, b], K) be the number of upcrossings of [a, b] respectively by {Y (t)}t∈Δn ∩[0,K] and {Y (t)}t∈Δ∩[0,K] . The upcrossing inequality for discretetime submartingales give (b − a)E [νn ([a, b], K)] ≤ E [(Y (K) − a)+ ], and therefore, passing to the limit as n ↑ ∞, (b − a)E [ν([a, b], K)] ≤ E [(Y (K) − a)+ ]. By the rightcontinuity assumption, ν([a, b]) = supK∈N ν([a, b], K) and therefore, (b − a)E [ν([a, b])] ≤ supK∈N E [(Y (K) − a)+ ] < ∞. The rest of the proof is then as in Theorem 13.3.2. The continuoustime extensions of Theorems 13.3.8 and 13.3.14, and of Corollary 13.3.18, follow from Theorem 13.4.10 in the same way as their original discretetime counterparts follow from Theorem 13.3.2. We leave to the reader the task of formulating these extensions. The next result is the extension of Theorem 13.3.10 to continuous time, and its proof is left for the reader. Theorem 13.4.11 Let {Y (t)}t≥0 be a rightcontinuous Ft submartingale, uniformly integrable. Then {Y (t)}t≥0 converges a.s. and in L1 to an integrable random variable denoted by Y (∞) and E[Y (∞)  Ft ] ≥ Y (t). The above theorems required only a slight adaptation of their discretetime versions. For the results that are stated in terms of convergence in a metric space, the adaptation to continuous time is even more immediate, and is based on the following lemma of analysis. Lemma 13.4.12 Let (E, d) be a complete metric space. A family {xt }t≥0 of elements of E converges to some element x ∈ E as t ↑ ∞ if and only if for any nondecreasing sequence of times {tn }n≥1 increasing to ∞ as n ↑ ∞, the sequence {xtn }n≥1 converges to x as n ↑ ∞.
CHAPTER 13. MARTINGALES
538
Therefore any statement of convergence as t ↑ ∞ that can be expressed in terms of convergence in a complete metric space can be obtained from the discretetime version. p This is the case for convergence in LC (P ) (p ≥ 1) and also for convergence in probability (which can indeed be expressed in terms of convergence in a complete metric space, see Theorem 4.2.4). We now proceed to the statement and proof of Doob’s optional sampling theorem in continuous time, which requires a little more work and the use of the reverse martingale convergence theorem. Theorem 13.4.13 Let {Y (t)}t≥0 be a uniformly integrable rightcontinuous Ft submartingale, and let S and T be Ft stopping times such that S ≤ T . Then E [Y (T )  FS ] ≥ (resp., =)Y (S). Proof. For any Ft stopping time τ , let τ (n) be the approximation of the stopping time τ given in Theorem 5.3.13. Recall that this is an Ft stoppingtime decreasing to τ as (n) (n) n ↑ ∞. Now, ﬁx n and let Gk := F kn (k ∈ N). Observe that τ (n) is a Gk stopping time. 2
(n)
From Doob’s optional sampling theorem applied to the Gk submartingale {Y ( 2kn )}k≥0 , E Y (T (n))  FS(n) ≥ Y (S(n) . In particular, for all A ∈ FS ⊆ FS(n) ⊆ FT (n) (by (v) of Theorem 5.3.19), E [1A Y (T (n))] ≥ E [1A Y (S(n))] .
Let Z−n := Y (T (n)) and A−n := FT (n) (n ≥ 0). By the reverse martingale convergence theorem (Theorem 13.3.14) applied to the submartingale {Zn }n≤0 adapted to the ﬁltration {An }n≤0 , the latter converges to X(T ) almost surely (rightcontinuity hypothesis) and also in L1 (Theorem 13.3.10). A similar statement holds for S and therefore we can pass to the limit in () to obtain E [1A Y (T )] ≥ (resp. =) E [1A Y (S)] . Example 13.4.14: The Risk Model, Take 4. The probability of ruin corresponding to an initial capital u is Ψ(u) := P (u + X(t) < 0 for some t > 0) . It is a simple exercise (Exercise 13.5.25) to show that E e−rX(t) = etg(r), 
where g(r) = λh(r) − rc = λ
∞
e dG(v) − 1 − rc . rv
0
For any u ∈ R+ , Mu (t) :=
e−r(u+X(t)) etg(r)
(t ∈ R+ )
(13.50)
13.4. CONTINUOUSTIME MARTINGALES
539
is an FtX martingale. Indeed, for 0 ≤ s ≤ t < ∞, " # e−r(u+X(t)) X X E Mu (t)  Fs = E  Fs etg(r) # " e−r(u+X(s)) e−r(X(t)−X(s)) X =E  F s esg(r) e(t−s)g(r) # " e−r(X(t)−X(s)) = Mu (s)E  FsX = Mu (s) . e(t−s)g(r) For all u ≥ 0,
Tu := inf{t ≥ 0 ; u + X(t) < 0}
(with the usual convention that the inﬁmum of an empty set is inﬁnite) is an FtX stopping time. For any t0 < ∞, since Tu ∧ t0 is a bounded stopping time and since {Mu (t)}t≥0 is a (positive) martingale, we may apply Doob’s optional stopping theorem: e−ru = Mu (0) = E [Mu (Tu ∧ t0 )] = E [Mu (Tu ∧ t0 )  Tu ∧ t0 < t0 ] P (Tu ∧ t0 < t0 ) + E [Mu (Tu ∧ t0 )  Tu ∧ t0 ≥ t0 ] P (Tu ∧ t0 ≥ t0 ) ≥ E [Mu (Tu ∧ t0 )  Tu ∧ t0 < t0 ] P (Tu ∧ t0 < t0 ) = E [Mu (Tu )  Tu < t0 ] P (Tu < t0 ) . But u + X(Tu ) ≤ 0 on {Tu < ∞}, and therefore e−ru E [Mu (Tu )  Tu < t0 ] e−ru ≤ e−ru sup etg(r) . ≤ −T g(r) u E e  Tu < t0 0≤t≤t0
P (Tu < t0 ) ≤
Letting t0 → ∞,
Ψ(u) ≤ e−ru sup etg(r) .
()
t≥0
We choose, under the assumption that supt≥0 etg(r) < ∞, the r maximizing the righthand side of (), that is R = sup{r ; g(r) ≤ 0} , i.e. the positive solution of
cr . λ This is the celebrated Lundberg’s inequality: Ψ(u) ≤ e−Ru. h(r) =
Theorem 13.4.15 Let {Y (t)}t≥0 be an Ft martingale and let T be Ft stopping time. Then {Y (t ∧ T )}t≥0 is an Ft martingale. Proof. First suppose {Y (t)}t≥0 is uniformly integrable. By Theorem 13.4.11, E[Y (∞)  Ft∨T ] = Y (t ∨ T ), and therefore E[Y (∞) − Y (T )  Ft∨T ] = Y (t ∨ T ) − Y (T ) = 1{T ≤t} (Y (t) − Y (t ∧ T ) = Y (t) − Y (t ∧ T ).
CHAPTER 13. MARTINGALES
540 This variable is Ft measurable and therefore
E[Y (∞) − Y (T )  Ft ] = Y (t) − Y (t ∧ T ) that is, since E[Y (∞)  Ft ] = Y (t), E[Y (T )  Ft ] = Y (t ∧ T ) . We now get rid of the uniform integrability assumption. For any a ≥ 0, the Ft∧a martingale {Y (t ∧ a)}t≥0 is uniformly integrable (Theorem 4.2.13) and therefore, by Theorem 13.4.13, for t ≤ a, E[Y (T ∧ a)  Ft ] = Y (t ∧ T ∧ a) = Y (t ∧ T ) .
Predictable Quadratic Variation Processes Deﬁnition 13.4.16 Let {Ft }t≥0 be a history. Let {M (t)}t≥0 be a local Ft martingale. A nondecreasing Ft predictable stochastic process {M (t)}t≥0 such that M (t)2 −M (t) is a local Ft martingale is called the predictable quadratic variation process of the local martingale {M (t)}t≥0 . The following result of martingale theory is quoted without proof. Theorem 13.4.17 Let {Ft }t≥0 be a history. Let {M (t)}t≥0 be a local squareintegrable Ft martingale with quadratic variation process {M (t)}t≥0 . a. If M (∞) < ∞, then M (t) converges to a ﬁnite limit as t ↑ ∞. b. If M (∞) = ∞, then lim
t↑∞
13.4.2
M (t) = 0. M (t)
The Banach Space Mp
Deﬁnition 13.4.18 Let p ≥ 1. An Ft martingale {M (t)}t≥0 is called pintegrable if sup E[M (t)p ] < ∞ . t≥0
It is called pintegrable on the ﬁnite interval [0, a] if E[M (a)2 ] < ∞. Remark 13.4.19 Condition (13.4.18) implies that this martingale is uniformly integrable (Theorem 4.2.14). Note that when p ≥ 1, E[M (a)p ] < ∞ implies supt∈[0,a] E[M (t)p ] < ∞ since {M (t)p }t≥0 is then an Ft submartingale. For p ≥ 1, let Mp ([0, 1]) be the collection of pintegrable Ft martingales over [0, 1]. We shall not distinguish between versions, that is to say, an element Mp is an equivalence class for the equivalence M ∼ M deﬁned by M (t) = M (t) P a.s. for all t ∈ [0, 1].
13.4. CONTINUOUSTIME MARTINGALES
541
Theorem 13.4.20 For p ≥ 1, Mp ([0, 1]) is a Banach space for the norm M p = E[M (1)p ].
(13.51)
Proof. First, we verify that (13.51) deﬁnes a norm. Only the fact that M p = 0 implies M = 0 (that is, P (M (t) = 0) = 1 for all t ∈ [0, 1]) is perhaps not obvious. By Jensen’s inequality for conditional expectations, for all t ∈ [0, 1], E[M (t)p ] = E [E[M (1)  Ft ]p ] ≤ E [E[M (1)p  Ft ]] = E[M (1)p ] = 0 which implies in particular that P (M (t) = 0) = 1 for all t ∈ [0, 1]. Let now {Mn }n≥1 be a Cauchy sequence of Mp ([0, 1]), that is, lim E[Mn (1) − Mk (1)p ] = 0 .
k,n↑∞
By the same Jensentype argument as above, for all t ∈ [0, 1], lim E[Mn (t) − Mk (t)p ] = 0 .
k,n↑∞
Therefore, for all t ∈ [0, 1], there exists a limit in LpR (P, Ft ) of the sequence of random variables {Mn (t)}n≥1 that we call M (t). It remains to show that the process {M (t)}n≥0 so deﬁned is an Ft martingale, that is, for all [a, b] ⊂ [0, 1] and all A ∈ Fa E[1A M (b)] = E[1A M (a)]. Using the assumption that for all n ≥ 1, {Mn (t)}t≥0 is an Ft martingale (and therefore E[1A Mn (b)] = E[1A Mn (a)]) and the fact that for all t ∈ [0, 1], Mn (t) tends to M (t) in LpR (P ), we have that limn↑∞ E[1A Mn (a)] = E[1A M (a)] and limn↑∞ E[1A Mn (b)] = E[1A M (b)].
13.4.3
Time Scaling
In this subsection, we deﬁne changes of the time scale “adapted” to a given history. Deﬁnition 13.4.21 Let {Ft }t≥0 be a history. A process A = {A(t)}t≥0 is called a standard nondecreasing stochastic process if it has nondecreasing rightcontinuous trajectories t → A(t, ω) and if moreover A(0, ω) ≡ 0. A standard nondecreasing process {T (t)}t≥0 is called an Ft change of time if, for all t ≥ 0, T (t) is an Ft stopping time. Theorem 13.4.22 Let {Ft }t≥0 be a rightcontinuous history and let {T (t)}t≥0 be an Ft change of time. Then, the family {FT (t) }t≥0 is a rightcontinuous history. Moreover, if the stochastic process {X(t)}t≥0 is Ft progressive, then the stochastic process {Y (t)}t≥0 deﬁned by Y (t) = X(T (t))1{T (t) t}. If A is adapted to the rightcontinuous history {Ft }t≥0 , C is an Ft change of time. Proof. Indeed, for all s ≥ 0, all a ≥ 0, {C(t) < a} = ∪n≥1 {A(a −
1 ) > t} ∈ ∨n≥1 Fa− 1 = Fa− ⊆ Fa . n n
Theorem 13.4.24 Let X = {X(t)}t≥0 and Y = {Y (t)}t≥0 be two stochastic processes adapted to the rightcontinuous history {Ft }t≥0 , and such that for any Ft stopping time S, E X(S)1{S 0. Prove the following inequality: E Xn2 . P max Xk > λ ≤ 0≤k≤n E [Xn2 ] + λ2
CHAPTER 13. MARTINGALES
546
Hint: With c > 0, work with the sequence {(Xn + c)2 }n≥0 and then select an appropriate c. Exercise 13.5.19. An extension of Hoeffding’s inequality Let M be a real FnX martingale such that, for some sequence d1 , d2 , . . . of real numbers, P (Bn ≤ Mn − Mn−1 ≤ Bn + dn ) = 1,
n ≥ 1,
where for each n ≥ 1, Bn is a function of X0n−1. Prove that, for all x ≥ 0, ? n 2 2 P (Mn − M0  ≥ x) ≤ 2 exp −2x di . i=1
Exercise 13.5.20. The derivative of a Lipschitz continuous function Let f : [0, 1) → R satisfy a Lipschitz condition, that is, f (x) − f (y) ≤ M x − y
(x, y ∈ [0, 1)) ,
where M < ∞. Let Ω = [0, 1), F = B([0, 1)) and let P be the Lebesgue measure on [0, 1). Let for all n ≥ 1 2n ξn (ω) := 1{[(k−1)2−n ,k2−n )} (ω) k=1
and Fn = σ(ξk ; 1 ≤ k ≤ n) . (i) Show that Fn = σ(ξn ) and ∨n Fn = B([0, 1)). (ii) Let
f (ξn + 2−n ) − f (ξn ) . 2−n is a uniformly integrable Fn martingale. Xn :=
Show that {Xn }n≥1
(iii) Show that there exists a measurable function g : [0, 1) → R such that Xn → g P almost surely and that Xn = E [g  Fn ]. (iv) Show that for all n ≥ 1 and all k (1 ≤ k ≤ 2n ) f (k2−n ) − f (0) =

k2−n
g(x) dx 0
and deduce from this that

x
f (x) − f (0) =
g(y) dy
(x ∈ [0, 1)) .
0
Exercise 13.5.21. A nonuniformly integrable martingale Let {Xn }n≥0 be a sequence of iid random variables such that P (Xn = 0) = P (Xn = 2) = 12 (n ≥ 0). Deﬁne Zn :=
n $ j=1
Xj
(n ≥ 0) .
13.5. EXERCISES
547
Show that {Zn }n≥0 is an FnX martingale and prove that it is not uniformly integrable. Exercise 13.5.22. The ballot problem via martingales This exercise proposes an alternative proof for the ballot problem of Example 1.2.13. Let k := a + b and let Dn be the diﬀerence between the number of votes for A and the number of votes for B at time n ≥ 1. Prove that Xn =
Dk−n k−n
(1 ≤ n ≤ k)
is a martingale. Deduce from this that the probability that A leads throughout the voting process is (a − b)/(a + b). Hint: τ := inf{n ; Xn = 0} ∧ (k − 1). Exercise 13.5.23. A voting model Let G = (V, E) be a ﬁnite graph. Each vertex v shelters a random variable Xn (v) representing the opinion (0 or 1) at time n of the voter located at this vertex. At each time n, an edge v, w is chosen at random, and one of the two vertices, again chosen at random (say v), reconsiders his opinion passing from Xn (v) to Xn+1(v) = Xn (w). The initial opinions at time 0 are given. Let Zn be the total number of votes for 1 at time n. Show that {Zn }n≥1 is a martingale that converges in ﬁnite random time to a random variable Z∞ taking the values 0 or V , the probability that all opinions are eventually 1 being equal to the initial proportion of 1’s. Exercise 13.5.24. The fundamental martingale of an hpp Prove that for the counting process {N (t)}t≥0 of Example 5.1.5, {N (t) − λt}t≥0 is an FtY martingale. Exercise 13.5.25. Compound Poisson Processes Let N be an hpp on R+ with intensity λ and point sequence {Tn }n≥1 . Let {Zn }n≥1 be an iid realvalued sequence independent of N , with common cdf F . Deﬁne for all t ≥ 0, Y (t) =
Zn 1(0,t] (Tn ) .
n≥1
(The process {Y (t)}t≥0 is called a compound Poisson process.) Show that E e−rY (t) = eλt(1−h(r)), 0∞ where h(r) := E e−rZ1 = 0 e−rx dF (x). Exercise 13.5.26. Komatsu’s lemma Prove the following: A rightcontinuous real stochastic process {X(t)}t≥0 adapted to the ﬁltration {Ft }t≥0 is an Ft martingale if and only if for all bounded Ft stopping times τ , E [X(τ )] = E [X(0)] . Hint: Show that for all 0 ≤ a ≤ b and all A ∈ Fa , τ := a1A + b1A deﬁnes an Ft stopping time.
CHAPTER 13. MARTINGALES
548
Exercise 13.5.27. 0 is an absorbing state for nonnegative martingales Prove that for a nonnegative martingale {M (t)}t≥ , {M (s) = 0} ⊆ {M (t) = 0} whenever 0 ≤ s < t. Exercise 13.5.28. Avoiding 0 Let {M (t)}t≥0 be a rightcontinuous martingale. A. Let τ := inf{t ≥ 0 ; M (t) = 0}. Prove that M (τ ) = 0 on {τ < ∞}. B. Prove that if P (M (T ) > 0) = 1 for some T ≥ 0, then P (M (t) > 0 for all t ≤ T ) = 1. (Hint: use the stopping times T and T ∧ τ .) Exercise 13.5.29. pintegrable martingales If {M (t)}t≥0 is rightcontinuous and pintegrable, then sup E[M (T )p ] < ∞,
T ∈T
where T is the collection of all ﬁnite Ft stopping times. Exercise 13.5.30. Local martingales Prove that if {M (t)}t≥0 is a rightcontinuous local Ft martingale such that 2 . E sup M (s) < ∞ (t ≥ 0) , 0≤s≤t
it is in fact a martingale.
Chapter 14 o’s Stochastic A Glimpse at Itˆ Calculus The Itˆo integral is an extension of the Wiener integral to a class of nondeterministic integrands. It is the basic tool of the Itˆo stochastic calculus, of which this chapter is a brief introduction.
14.1
The Itˆ o Integral
14.1.1
Construction
o integral will be constructed via an isometric extension analogous to that used The Itˆ for the construction of the Wiener–Doob integral. Let A(R+ ) be the collection of Ft progressive complexvalued stochastic processes ϕ = {ϕ(t)}t≥0 such that 2. 2 E ϕ(t) dt < ∞ . R+
We view A(R+ ) as a complex Hilbert space with inner product 2. ϕ1 , ϕ2 A(R+ ) := E ϕ1 (t)ϕ2 (t) dt . R+
(To be more precise, an element of A(R+ ) is an equivalence class of such processes with respect to the equivalence relation ϕ ∼ ϕ if and only if ϕ(t, ω) = ϕ (t, ω),
P (dω) × dt a.e. )
For T ≥ 0, let A([0, T ]) be the collection of Ft progressive complexvalued stochastic processes ϕ such that 2 T . 2 E ϕ(t) dt < ∞ , 0
and let Aloc be the collection of Ft progressive complexvalued stochastic processes ϕ such that ϕ ∈ A([0, T ]) for all T ≥ 0. Finally, let Bloc be the collection of Ft progressive complexvalued stochastic processes ϕ such that P a.s. © Springer Nature Switzerland AG 2020 P. Brémaud, Probability Theory and Stochastic Processes, Universitext, https://doi.org/10.1007/9783030401832_14
549
ˆ STOCHASTIC CALCULUS CHAPTER 14. A GLIMPSE AT ITO’S
550

t
P a.s. (t ≥ 0) .
ϕ(s)2 ds < ∞
(14.1)
0
Deﬁnition 14.1.1 A real stochastic processes ϕ := {ϕ(t)}t≥0 of the form ϕ(t, ω) =
K−1
Zi (ω)1(ti ,ti+1 ] (t) ,
(14.2)
i=1
where K ∈ N+ , 0 ≤ t1 < t2 < · · · < tK < ∞ and where Zi (1 ≤ i ≤ K) is a complex squareintegrable Fti measurable random variable, is called an elementary Ft predictable process. Such stochastic processes are in A(R+ ). Lemma 14.1.2 The vector subspace G of A(R+ ) consisting of the elementary Ft predictable processes is dense in A(R+ ). Proof. First, consider the operators Pn (n ≥ 1) acting on the functions f ∈ L2C (R+ ) as follows: n2 i/n [Pn f ](t) := n f (s) ds 1(i/n,(i+1)/n](t) . (i−1)/n
i=1
By Schwarz’s inequality, for all t ∈ (i/n, (i + 1)/n], 2 i/n
[Pn f ](t)2 =
nf (s) ds (i−1)/n

i/n
≤

n2 ds ×
(i−1)/n  i/n
=n
i/n
f (s)2 ds (i−1)/n
f (s)2 ds ,
(i−1)/n


and therefore
R+
[Pn f ](t)2 dt ≤
R+
f (t)2 dt.
()
Also Pn f → f in L2R (R+ ) for all functions f ∈ Cc0 (continuous L2R (R+ ) by density of Cc0 in L2R (R+ ).
()
with compact support), and therefore for all f ∈
Let now ϕ be in A(R+ ). For ﬁxed ω, [Pn ϕ](·, ω) is the function obtained by applying Pn to the function t → ϕ(t, ω). By (), . . 22ϕ(t)2 dt < ∞ a.s. , [Pn ϕ](t)2 dt ≤ E E R+
R+
and therefore the function t → [Pn ϕ](t, ω) is in L2C (R+ ) for P almost all ω. The stochas0 i/n tic process {Pn ϕ(t)}t≥0 is in G (note that by Theorem 5.3.9, (i−1)/n ϕ(s) ds is Fi/n measurable since {ϕ(t)}t≥0 is Ft progressive).
ˆ INTEGRAL 14.1. THE ITO
551
As n ↑ ∞, {[Pn ϕ](t)}t≥0 converges in A(R+ ) to {ϕ(t)}t≥0 . In fact, by (), [Pn ϕ(·, ω)](t) − ϕ(t, ω)2 dt → 0 R+
and therefore 2Pn ϕ − ϕ2A(R+ ) = E
R+
. [Pn ϕ(·, ω)](t) − ϕ(t, ω)2 dt → 0
(by dominated convergence since, by () and the triangle inequality, 2 [Pn ϕ(·, ω)](t) − ϕ(t, ω)2 dt ≤ [Pn ϕ(·, ω)]L2 (R+ ) + ϕ(·, ω)L2 (R+ ) C
R+
C
2 ≤ 4ϕ(·, ω)L 2 (R ) + C
and E ϕ(·, ω)2L2 (R+ ) = ϕ2A(R+ ) < ∞).
C
Let L20,C (P ) be the Hilbert subspace of L2C (P ) consisting of the complex centered squareintegrable variables. Deﬁne the mapping I : G → L20,C (P ) by I(ϕ) :=
K−1
Zi (W (ti+1 ) − W (ti )) .
(14.3)
i=1
One veriﬁes that for all ϕ ∈ G, E [I(ϕ)] = 0. Also, for all ϕ1 , ϕ2 ∈ G, E [I(ϕ1 )I(ϕ2 ) ] = ϕ1 , ϕ2 A(R+ ) . Proof. By polarization, it suﬃces to treat the case ϕ1 = ϕ2 = ϕ and to write K−1 E I(ϕ)2 = E Zi 2 (W (ti+1 ) − W (ti ))2 i=1 K−1
+2
E [Zi Z (W (ti+1 ) − W (ti ))(W (t+1 ) − W (t ))]
i 0, 2 T . 1 P sup Mn (t) − Mm (t) > a ≤ 2 E ([Pn ϕ](s) − [Pm ϕ](s))2 dt , a 0 t∈[0,T ] a quantity that tends to 0 as n, m → ∞. We can therefore ﬁnd a sequence {nk }k≥1 strictly increasing to ∞ such that
ˆ INTEGRAL 14.1. THE ITO
555
P
sup Mnk (t) − Mnk−1 (t) > 2
≤ 2−k ,
−k
t∈[0,T ]
so that, by the Borel–Cantelli lemma, P
sup Mnk (t) − Mnk−1 (t) > 2
−k
i.o.
= 0.
t∈[0,T ]
Therefore for P almost all ω, there exists a ﬁnite integer K(ω) such that for k > K(ω), sup Mnk (t) − Mnk−1 (t) ≤ 2−k .
t∈[0,T ]
This implies that for P almost all ω the function t → Mnk (t, ω) converges uniformly on [0, T ] to a function t → M∞0(t, ω) which is continuous (as a uniform limit of continuous t functions). Since Mn (t) → 0 ϕ(s) dW (s) in quadratic mean and Mn (t) → M∞ (t) P a.s., both limits are P a.s. equal, and therefore the stochastic process {M∞ (t)}t≥0 is a 0t continuous version of { 0 ϕ(s) dW (s)}t≥0 .
14.1.3
Itˆ o’s Integrals Deﬁned as Limits in Probability
0t It is possible to deﬁne the integral 0 ϕ(s) dW (s) when ϕ is only in Bloc , not necessarily in Aloc . For this, deﬁne for each n ≥ 1 the Ft stopping time 1  t Tn := inf t ;  ϕ(s) 2 ds ≥ n 0
with the usual convention inf ∅ = ∞. Clearly, P a.s., Tn ↑ ∞ and particular, the stochastic process 0
ϕn (t) := ϕ(t) 1{t≤Tn }
(t ≥ 0)
0 Tn 0
ϕ(s)2 ds ≤ n. In (14.6)
is in A(R+ ) and In (t) := R+ ϕn (s) dW (s) is therefore well deﬁned. For any n, m and for any ε > 0, ' ' t  t ' ' ' ϕm (s) dW (s)'' ≥ ε P ' ϕn (s) dW (s) − 0 0  t ≤P  ϕ(s) 2 ds ≥ min(n, m) → 0 as n, m ↑ ∞. 0
By Cauchy’s criterion of convergence in probability (Theorem 4.2.3), for each t ≥ 0, In (t) converges in probability to some random variable denoted I(ϕ, t). 0t If ϕ ∈ A(R+ ), limn↑∞ In (t) = 0 ϕ(s) dW (s) in L2C (P ) and therefore also in proba0t bility. Therefore, in this case I(ϕ) = 0 ϕ(s) dW (s). Therefore, for ϕ ∈ Bloc ,  t  t ϕ(s) dW (s) := lim ϕn (s) dW (s) , 0
n↑∞ 0
where the limit is in probability, is an extension of the deﬁnition of the Itˆ o integral from integrands in A(R+ ) to integrands in Bloc . The following result is a direct consequence of Theorems 14.1.7 and 14.1.6 (Exercise 14.4.6).
ˆ STOCHASTIC CALCULUS CHAPTER 14. A GLIMPSE AT ITO’S
556
Theorem 14.1.8 Let0 {W (t)}t≥0 be an Ft Wiener process and let ϕ ∈ Bloc . The prot cess {X(t))}t≥0 := { 0 ϕ(s) dW (s)}t≥0 is then an Ft local martingale which admits a continuous version. Moreover, for any Ft stopping time τ ,  t (14.7) ϕ(s)1{s≤τ } dW (s) (t ≥ 0) . X(t ∧ τ ) = 0
14.2
Itˆ o’s Diﬀerential Formula
14.2.1
Elementary Form
The ordinary rules of calculus do not apply to functions of Brownian motion, as the following example shows: Example 14.2.1: Squared Brownian motion. Let {W (t)}t≥0 be an Ft Wiener process. With ti := it n, W (t)2 =
n
(W (ti )2 − W (ti−1 )2 )
i=1 n
=2
W (ti−1 )(W (ti ) − W (ti−1 )) +
i=1
n
((W (ti ) − W (ti−1 ))2 := An + Bn .
i=1
As n ↑ ∞, using Remark 14.1.3, An converges in L2 (P ) to 2 converges to t (Theorem 11.2.10). Therefore  t W (s) dW (s) + t . W (t)2 = 2
0t 0
W (s) dW (s), whereas Bn
0
If the trajectories of the Wiener process were 0 t of bounded variation, integration by parts would give the (wrong) formula W (t)2 = 2 0 W (s) dW (s).
Functions of Brownian Motion Let Cb2 denote the collection of functions F : R → C that are twice continuously diﬀerentiable and such that F , F and F are bounded. Theorem 14.2.2 Let {W (t)}t≥0 be a standard Ft Wiener process and let F ∈ Cb2 . Then  t 1 t F (W (t)) = F (W (0)) + F (W (s)) dW (s) + F (W (s)) ds . (14.8) 2 0 0 Proof. It suﬃces to treat the case of realvalued functions F . Let ti := formula at the second order gives F (W (t)) − F (W (0)) =
n
it n.
Taylor’s
(F (W (ti )) − F (W (ti−1 )))
i=1
=
n i=1
=
n i=1
1 F (ξi )(W (ti ) − W (ti−1 ))2 2 n
F (W (ti ))(W (ti ) − W (ti−1 )) +
i=1 n
1 F (W (ti ))(W (ti ) − W (ti−1 )) + F (W (θi ))(W (ti ) − W (ti−1 ))2 , 2 i=1
ˆ DIFFERENTIAL FORMULA 14.2. ITO’S
557
where ξi ∈ [W (ti−1 ), W (ti )] (and therefore, since the trajectories of the Brownian motion are continuous, ξi = W (θi ) for0 some θi = θi (n, ω) ∈ (ti−1 , ti ). The ﬁrst term on the rightt hand side tends in L2 (P ) to 0 F (W (s)) dW (s) (Remark 14.1.3). Let now An := Bn :=
n i=1 n
F (W (θi ))(W (ti ) − W (ti−1 ))2 , F (W (ti−1 ))(W (ti ) − W (ti−1 ))2 ,
i=1
Cn :=
n
F (W (ti−1 ))(ti − ti−1 ).
i=1
By Schwarz’s inequality, dominated convergence and Theorem 11.2.10, "
' ' E [An − Bn ] ≤ E sup 'F (W (θi )) − F (W (ti−1 ))' × (W (ti ) − W (ti−1 ))2 i
⎛
2
.
i
⎡
' '2 (W (ti ) − W (ti−1 ))2 ≤ ⎝E sup 'F (W (θi )) − F (W (ti ))' × E ⎣ i
#
2 ⎤⎞ 21 ⎦⎠
i 1
→ (0 × t2 ) 2 = 0 . Now, ' '2 E Bn − Cn 2 ≤ (sup F )2 × E '(W (ti ) − W (ti−1 ))2 − (ti − ti−1 )' i
= (sup F )2 × 2
(ti − ti−1 )2 → 0 . i
0t Finally, Cn → 0 F (W (s)) dW (s) in L2 (P ), and consequently in L1 (P ). Therefore, for all t, the announced equality holds in L1 (P ) and consequently P almost surely. Since the stochastic processes in both sides of the equality are continuous, we have that P almost surely, this equality holds for all t ∈ R+ .
Remark 14.2.3 Theorem 14.2.2 remains true if we only suppose that {W (t)}t≥0 is a real continuous Ft martingale such that {W (t)2 − t}t≥0 is also an Ft martingale. We shall not prove this, although the proof is very close to the one given in the special case.
ˆ ’s rule for exponentials. Let F (x, t) = ex and X(t) = Example 14.2.4: 0 Ito 0t 1 t 2 ds where ϕ ∈ B . Application of rule (14.11) yields ϕ(s) dW (s) − ϕ(s) loc 2 0 0 
t
L(t) := exp 0
ϕ(s) dW (s) −
1 2

t 0
 t ϕ(s)2 ds = 1 + L(s)ϕ(s) dW (s). 0
(14.9)
ˆ STOCHASTIC CALCULUS CHAPTER 14. A GLIMPSE AT ITO’S
558
L´ evy’s Characterization of Brownian Motion Theorem 14.2.5 A real continuous Ft adapted stochastic process {W (t)}t≥0 that is an Ft martingale and such that {W (t)2 − t}t≥0 is also an Ft martingale is a standard Ft Wiener process. Proof. Applying Itˆo’s diﬀerentiation rule1 with F (x) = eiux (u ∈ R),  t  t 1 eiuW (t) − eiuW (s) = iu eiuW (z) dW (z) − u2 eiuW (z) dz (0 ≤ s ≤ t) . 2 s s 0t 0t Now, since E[ 0  eiuW (s) 2 ds] = t < ∞, it follows that 0 eiuW (s) dW (s) is an Ft martingale and therefore, for all A ∈ Fs ,   t 1 (eiuW (t) − eiuW (s) ) dP = − u2 eiuW (z) dz dP . (14.10) 2 A A s Dividing both sides of the above equation by eiuW (s) and applying Fubini’s theorem to the righthand side,  t1 eiu(W (t)−W (s)) dP = P (A) − u2 eiu(W (z)−W (s)) dP dz , 2 A A s and therefore

1 u2
eiu(W (t)−W (s))dP = P (A)e− 2 t−s .
()
A
This equality is valid for all 0 ≤ s ≤ t, all u ∈ R and all A ∈ Fs . With A = Ω, E[eiu(W (t)−W (s)) ] = e−(1/2)/u
2 /(t−s)
,
that is, W (t) − W (s) is a centered Gaussian variable with unit variance. Equation () then reads E[1A eiu(W (t)−W (s))] = P (A)E[eiu(W (t)−W (s)) ] (A ∈ Fs ) ,
from which it follows that W (t) − W (s) is independent of Fs .
14.2.2
Some Extensions
Theorem 14.2.2 can be extended in several directions that do not involve new ideas. The proofs, using arguments very similar to those in the proof of Theorem 14.2.2 are therefore omitted. Theorem 14.2.2 dealt with functions of the Brownian motion. We now consider functions of an Itˆo process, whose deﬁnition follows. Deﬁnition 14.2.6 A stochastic process of the form  t  t X(t) := X(0) + f (s) ds + ϕ(s) dW (s) 0
(t ≥ 0) ,
()
0
where X(0) is an F0 measurable random variable and {ϕ(t)}t≥0 and {f (t)}t≥0 are Ft o process. progressively measurable stochastic processes in Aloc or Bloc is called an Itˆ 1
See Remark 14.2.3.
ˆ DIFFERENTIAL FORMULA 14.2. ITO’S
559
Theorem 14.2.7 Suppose that ϕ, f ∈ Aloc and X(0) is square integrable. Then, for all functions F ∈ C 2 ,  t F (X(s))ϕ(s) dW (s) F (X(t) = F (X(0)) + 0  t 1 t + F (X(s))f (s) ds + F (X(s))ϕ(s)2 ds . 2 0 0 With the notation, dX(s) := ϕ(s) dW (s) + ψ(s) ds, this formula can written as  t 1 t F (X(s)) dX(s) + F (X(s))ϕ(s)2 ds . F (X(t)) = F (X(0)) + 2 0 0 Therefore, with

t
X(t) :=
ϕ(s)2 ds 0
(the bracket process of the martingale part of the Itˆ o process),  t 1 t F (X(t) = F (X(0)) + F (X(s)) dX(s) + F (X(s)) dX(s) 2 0 0 or, in diﬀerential form, dF (X(t)) = F (X(t)) dX(t) + F (X(t)) dX(t) . Example 14.2.8: The geometric Brownian motion. Let Z(t) := exp {σW (t) + μt} (t ≥ 0). By Itˆo’s diﬀerential rule,  t  t σ2 Z(s) dW (s) + μ + Z(s) ds . Z(t) = σ 2 0 0 2
In particular, with μ = − σ2 , the process 1 σ2 Z(t) := exp σW (t) − t 2
(t ≥ 0)
is a martingale, called the geometric Brownian motion Let {W (t)}t≥0 be an Ft Wiener process. Let {ϕ(t)}t≥0 and {f (t)}t≥0 be in Bloc . Let {X(t)}t≥0 be an Itˆo process. Theorem 14.2.9 Let F : (x, t) ∈ R2 → F (x, t) ∈ C be twice continuously diﬀerentiable in the ﬁrst variable x and once continuously diﬀerentiable in the second variable t. Then  t  t ∂F ∂F (X(s), s) ds + (X(s), s) dX(s) F (X(t), t) = F (X(0), 0) + ∂t 0 0 ∂x 1 t ∂2F + (X(s), s) ϕ(s)2 ds, (14.11) 2 0 ∂x2 where  t 0
∂F (X(s), s) dX(s) := ∂x

t 0
∂F (X(s), s) ϕ(s) dW (s) + ∂x

t 0
∂F (X(s), s) f (s) ds . ∂x (14.12)
ˆ STOCHASTIC CALCULUS CHAPTER 14. A GLIMPSE AT ITO’S
560
A Finite Number of Discontinuities The Itˆ o diﬀerentiation rule remains valid in situations where the functions F , F and F are C 2 in the x argument only on R\E, where E is a ﬁnite set E = {x1 , . . . , xk }. The method of proof is the same as the following one in the simple case of a function of the Brownian motion. Theorem 14.2.10 Let F : R → R be C 1 on R and C 2 on R\E, where E = {x1 , . . . , xk }. o formula applies: If F remains bounded on a neighborhood of these points, the Itˆ 
t
F (W (t)) = F (W (0)) +
F (W (s)) dW (s) +
0
1 2

t
F (W (s)) ds .
(14.13)
0
Concerning the last (Lebesgue) integral, note that the Lebesgue measure of {t ; W (t, ω) ∈ E} is almost surely null. Proof. We ﬁrst show the existence of a sequence of functions Fn that are C 2 in R with the following properties: (i) Fn → F and Fn → F uniformly in R, and (ii) for all x ∈ / E, Fn (x) → F (x) and Fn is uniformly bounded on a neighborhood of E. Let α be some function in C ∞ with a compact support, equal to 1 in a neighborhood of E. In particular, (1 − α)F is in C 2 and the Itˆo formula applies for this function. It remains to show that it also applies to αF , or more generally to F as before, but with compact support. For such a function, consider the approximation Fn := F ∗ ϕn where ϕn (x) := nϕ(nx) and ϕ is C ∞ with compact support and such that 0 ≤ ϕ ≤ 1. Such a function satisﬁes requirements (i) and (ii) above, and therefore the Itˆ o formula applies: 
t
Fn (W (t)) = Fn (W (0)) +
Fn (W (s)) dW (s) +
0
1 2

t
Fn (W (s)) ds .
0
The terms of this equality converge as n ↑ ∞ to the corresponding terms of (14.13), the ﬁrst two terms by uniform convergence of the Fn ’s, the third by uniform convergence of the Fn ’s and the isometry formula. For the third term, observe that, by Schwarz’s inequality, "2 # t
E 0
t
Fn (W (s)) ds −
Fn (W (s)) ds
0

t
≤t
E Fn (W (s)) − Fn (W (s))2 ds ,
0
a quantity that tends to 0 by dominated convergence.
ˆ DIFFERENTIAL FORMULA 14.2. ITO’S
561
The Vectorial Diﬀerentiation Rule Let C 1,2 (R+ × Rd ) denote the collection of functions F : R+ × Rd → R that are once continuously diﬀerentiable in the ﬁrst coordinate and twice continuously diﬀerentiable in the second coordinate. Let {W (t)}t≥0 be a kdimensional standard Wiener process. Let (t ≥ 0)
ϕ(t) := {ϕi,j (t)}1≤i≤d,1≤j≤k
be a real d×kmatrix valued stochastic process such that for all 1 ≤0 i ≤ d and all 1 ≤ j ≤ t k the process {ϕi,j (t)}t≥0 is FtW adapted and in Bloc . Denote by 0 ϕ(s) dW (s) (t ≥ 0) k 0 t the ddimensional stochastic process whose ith component is j=1 0 ϕi,j (s) dWj (s) (t ≥ 0). Let (t ≥ 0)
ψ(t) := {ψi (t)}1≤i≤d
be a real ddimensional stochastic process such that for all 1 ≤ i ≤ d the process {ψi (t)}t≥0 is Ft adapted and in Bloc . Deﬁne the ddimensional Itˆ o process {X(t)}t≥0 by 

t
t
ϕ(s) dW (s) +
X(t) := X(0) +
ψ(s) ds ,
0
0
where X(0) is a vector of integrable random variables. Then: Theorem 14.2.11 Under the above conditions, for F : R+ × Rd → R once continuously diﬀerentiable in the ﬁrst coordinate and twice continuously diﬀerentiable in the second coordinate, we have the formula 
t
F (t, X(t)) = F (0, X(0)) + 0
+
d  t i=1
0
1 + 2
∂ F (s, X(s)) ds + ∂s d
i=1
∂ F (s, X(s)) ∂xi
 t 0

k
t 0
∂ F (s, X(s))ψi (s) ds ∂xi
ϕi,j (s) dWj (s)
j=1
⎛ ⎞ k ∂2 F (s, X(s)) ⎝ ϕi,j (s)ϕ,j (s)⎠ ds . ∂xi ∂x j=1
i,
Example 14.2.12: An integration by parts formula. Let for i = 1, 2 

t
Xi (t) = Xi (0) +
t
ϕi (s) dW (s) +
ψi (s) ds
0
(t ≥ 0)
0
where the ϕi ’s and ψi ’s satisfy the conditions of Deﬁnition 14.2.6. Then 

t
X1 (t)X2 (t) = X1 (0)X2 (0) + 0

t
X1 (s) dX2 (s) +
t
X2 (s) dX1 (s) + 0
ϕ1 (s)ϕ2 (s) ds . 0
ˆ STOCHASTIC CALCULUS CHAPTER 14. A GLIMPSE AT ITO’S
562
14.3
Selected Applications
14.3.1
Squareintegrable Brownian Functionals
Theorem 14.3.1 Let {W (t)}t∈[0,1] be an FtW Wiener process, and let {m(t)}t∈[0,1] be a realvalued FtW squareintegrable martingale on [0, 1]. Then there exists a real stochastic process {ϕ(t)}t∈[0,1] ∈ A([0, 1]) such that 
t
m(t) = m(0) +
(t ∈ [0, 1]) .
ϕ(s) dW (s)
(14.14)
0
Lemma 14.3.2 Let {W (t)}t∈[0,1] be an FtW Wiener process. A. The collection of random variables H := {eX ; X ∈ U} where U is the collection of random variables k
aj W (tj )
(k ∈ N, 0 ≤ t1 < · · · < tk ≤ 1, a1 , . . . , ak ∈ R)
j=1
is total in the Hilbert space L2R (F1W , P ). B. The collection of random variables that are linear combinations of elements of 1 1 t 2 f K := M (1) = e 0 f (s) dW (s)− 2 0 f (s) ds ; f : R → R measurable and bounded is dense in the Hilbert space L2R (F1W , P ). Proof. Part B is an immediate consequence of Part A. For the proof of A, observe that H⊂ L2R(F1W , P ) and that 1 ∈ H. We have to prove that if Y ∈ L2R (F1W , P ) is such that E Y eX = 0 for all X ∈ U, then P (Y = 0) = 1. As 1 ∈ H, E [Y ] = 0. Multiplying Y by a constant if necessary, we may suppose that E [Y ] = 2 and in particular, since − + − E[Y ] = 0, E [Y + ] = 1 and E [Y 1. Therefore Q+ := Y P and Q− := Y+ P are ]X= = 0 for all X ∈ U, we have that E Y eX = probability measures. Since E Y e − X X X E Y e or, equivalently, EQ+ e = EQ− e . Therefore the Laplace transforms of the vectors of the type (W (t1 ), . . . , W (tk )) are the same under Q+ and Q− . This implies in particular that Q+ and Q− agree on F1W . Therefore E 1{Y + >Y − } (Y + − Y − ) = 0 and E 1{Y + ∞ .
0
Therefore, by Lemma 14.3.4, [Lψ (T )] = 1 or, equivalently, ⎧ ⎧ ⎫⎤ ⎫ ⎡ p p ⎨ ⎨1 ⎬ ⎬ E ⎣L(T ) exp i uj X(tj ) ⎦ × exp uj uk (tj ∧ tk ) = 1 , ⎩ ⎩2 ⎭ ⎭ j=1
()
j,k=1
that is, (14.19). It remains to get rid of the additional assumption. For this, we introduce the processes ϕn (t) := ϕ(t)1[0,τn ] (t) where

t
τn := inf{t ; 0
ϕ(s)2 ds ≥ n} .
14.3. SELECTED APPLICATIONS
567
By Lemma 14.3.4, E [Lϕn (T )] = 1 = E [Lϕ (T )] (the last equality is a hypothesis of the theorem). Also Lϕn (T ) → Lϕ (T ). Since moreover Lϕn (T ) ≥ 0, the conditions of application of Scheﬀ´e’s lemma (Lemma 4.4.24) are satisﬁed, and therefore Lϕn (T ) → Lϕ (T ) in L1 .
(†)
But () is true for ϕn , that is, ⎧ ⎧ ⎫⎤ ⎫ ⎡ p p ⎨1 ⎨ ⎬ ⎬ E ⎣Lϕn (T ) exp i uj X(tj ) ⎦ × exp uj uk (tj ∧ tk ) = 1 . ⎩2 ⎩ ⎭ ⎭ j=1
j,k=1
0t
Letting Xn (t) := W (t) − 0 ϕn (s) ds, ⎧ ⎫ ⎧ ⎫ p p ⎨ ⎬ ⎨ ⎬ uj Xn (tj ) − Lϕ (T ) exp i uj X(tj ) Lϕn (T ) exp i ⎩ ⎭ ⎩ ⎭ j=1 j=1 ⎧ ⎫ p ⎨ ⎬ uj Xn (tj ) = (Lϕn (T ) − Lϕ (T )) exp i ⎩ ⎭ j=1 ⎧ ⎫ ⎧ ⎫⎞ ⎛ p p ⎨ ⎬ ⎨ ⎬ uj Xn (tj ) − exp i uj X(tj ) ⎠ . + Lϕ (T ) ⎝exp i ⎩ ⎭ ⎩ ⎭ j=1
j=1
Using (†) and the fact that exp i pj=1 uj Xn (tj ) is uniformly bounded and converges, we obtain by dominated convergence that ⎧ ⎫ ⎧ ⎫ p p ⎨ ⎬ ⎨ ⎬ Lϕn (T ) exp i uj Xn (tj ) → Lϕ (T ) exp i uj Xn (tj ) in L1 , ⎩ ⎭ ⎩ ⎭ j=1
j=1
which gives (). The following result is a suﬃcient condition for the hypothesis (14.16) to hold.
Lemma 14.3.5 If in Theorem 14.3.3 {ϕ(t)}t∈[0,1] is bounded, then (14.16) holds. Moreover, {L(t)}t∈[0,1] thereof is a (P, Ft )squareintegrable martingale. Proof. For each n ≥ 1, let Sn be the Ft stopping time deﬁned by / 0t if {. . . } = ∅, inf t  0 ϕ(s)2 ds + L(t) ≥ n Sn = +∞ otherwise .
(14.20)
Then {L(t ∧ Sn )}t∈[0,1] is a squareintegrable Ft martingale and, by isometry, 2 t . E[ L(t ∧ Sn ) − 1 2 ] = E  L(s)ϕ(s) 2 1{s≤Sn} ds . 0
In particular,

t
E[ L(t ∧ Sn ) 2 ] ≤ 1 + 0 t
≤ 1 + sup(ϕ)
E[ L(s)ϕ(s) 2 1{s≤Sn } ]ds
E[ L(s ∧ Sn ) 2 ]ds .
0
The rest follows by Gronwall’s lemma (Theorem B.6.1).
568
ˆ STOCHASTIC CALCULUS CHAPTER 14. A GLIMPSE AT ITO’S
Theorem 14.3.6 If E [L(T )] = 1, then {L(t)}t∈[0,T ] is an Ft martingale. Itˆo’s diﬀerentiation rule applied to F (x) = ex yields (14.9). Since L(t) and 0Proof. t 2 0 ϕ(s) ds are ﬁnite continuous processes, Sn ↑ ∞, Pa.s. Also 
t 0
 L(s)ϕ(s)1{s≤Sn } 2 ds ≤ n2 ,
and therefore {L(t ∧ Sn )}t≥0 is a (P, Ft )squareintegrable martingale, and therefore {L(t)}t≥0 is a (P, Ft )local martingale. Being nonnegative, it is also a supermartingale, and a (P, Ft )martingale on [0, T ] when (14.16) is satisﬁed.
The Strong Markov Property of Brownian Motion This result (Theorem 11.2.1) was admitted in Chapter 11. The following proof is based on stochastic calculus, more precisely, on the following characterization: Lemma 14.3.7 For a continuous real stochastic process {Y (t)}t≥0 adapted to the history {Ft }t≥0 to be an Ft Brownian motion, it is necessary and suﬃcient that for all λ ∈ R, the process 1 2 eλY (t)− 2 λ t (t ≥ 0) be an Ft martingale. In this case, it is independent of F0 . Proof. The necessity results from an elementary computation on Gaussian variables. For the suﬃciency, observe that the martingale condition implies that for all intervals [a, b] ⊂ R, 1 2 E eλ(Y (b)−Y (a))  Fa = e 2 λ (b−a) . By taking expectations it follows that Y (b) − Y (a) is a centered Gaussian variable of variance (b − a) and then that it is independent of Fa , which implies the independence property of the increments as well as their independence from F0 . We now prove Theorem 11.2.1. Proof. The stochastic process ϕ(t) := 1A 1(τ +a,τ +b] (t)
(t ≥ 0) ,
where A ∈ Fτ +a , is Ft progressively measurable, and therefore by Lemma 14.3.4: 1. 2 1 = P (A) E exp λ1A (W (τ + b) − W (τ + a)) − λ2 1A (b − a) 2 or, equivalently, 2
1 . 1 2 E exp λ(W (τ + b) − W (τ + a)) − λ (b − a) 1A = 1 . 2 Therefore, since A is arbitrary in Fτ +a , E [exp {λ(W (τ + b) − W (τ + a))}  Fτ +a ] = exp
1 1 2 λ (b − a) 2
14.3. SELECTED APPLICATIONS
569
from which we deduce as in Lemma 14.3.7 that W (τ +b)−W (τ +a) is a centered Gaussian variable with variance b − a independent of Fτ +a . In particular, it has independent increments and therefore {W (τ + t)}t≥0 is a Wiener process. Moreover, still by Lemma 14.3.7, this process is independent of Fτ . Remark 14.3.8 The above proof is similar to that of the strong Markov property of Poisson processes (Theorem 7.1.9).
14.3.3
Stochastic Diﬀerential Equations
This is a very brief introduction to a vast subject. Let {W (t)}t≥0 be a standard Ft Brownian motion. We are going to discuss the existence and unicity of a measurable Ft adapted stochastic process {X(t)}t≥0 such that almost surely  t  t b(X(s)) ds + (14.21) σ(X(s)) dW (s) (t ≥ 0), X(t) = X(0) + 0
0
where b and σ are measurable functions such that almost surely  t b(X(s))2 + σ(X(s))2 ds < ∞ (t ≥ 0) .
(14.22)
0
One then calls {X(t)}t≥0 the solution of the stochastic diﬀerential equation (14.21). Condition (14.22) guarantees in particular that the integrand of the Itˆ o integral of (14.21) is in Aloc . Theorem 14.3.9 If X(0) ∈ L2R (P ) and if for some K < ∞ b(x) − b(y) + σ(x) − σ(y) ≤ K x − y ,
(14.23)
there exists a unique solution of (14.21) satisfying condition (14.22). Proof. A. Uniqueness. If {X(t)}t≥0 and {Y (t)}t≥0 are two solutions,  t  t (b(X(s)) − b(Y (s))) ds + (σ(X(s)) − σ(Y (s))) dW (s) X(t) − Y (t) = 0
0
and therefore, taking into account the inequality (a + b)2 ≤ 2(a2 + b2 ), Schwarz’s inequality, the property of isometry of Itˆ o’s integrals and the Lipschitz condition (14.23), "2 # t 2 E (X(t) − Y (t)) ≤ 2E (b(X(s)) − b(Y (s))) ds 0
"
t
+ 2E 0
2 # (σ(X(s)) − σ(Y (s))) dW (s)
. (b(X(s)) − b(Y (s)))2 ds 0 . 2 t (σ(X(s)) − σ(Y (s)))2 ds + 2E 0 2 t . 2 X(s) − Y (s)2 ds ≤ 2(t + 1)K E 0 2 T . 2 2 X(s) − Y (s) ds (t ≤ T ) . ≤ 2(T + 1)K E 2
≤ 2tE
t
0
ˆ STOCHASTIC CALCULUS CHAPTER 14. A GLIMPSE AT ITO’S
570
By Gronwall’s lemma, for all t ∈ [0, T ], E (X(t) − Y (t))2 = 0, and therefore P (X(t) = Y (t)) = 1. B. Existence. Deﬁne recursively the stochastic processes {Xn (t)}t≥0 (n ≥ 0) by X0 (t) := X(0) (t ≥ 0) and for n ≥ 0 by  t  t Xn+1(t) := X(0) + σ(Xn (s)) dW (s) . () b(Xn (s)) ds + 0
0
With arguments similar to those of Part A, 2 E (Xn+1(t) − Xn (t))2 ≤ 2(T + 1)K 2 E
T
. Xn+1 (s) − Xn (s)2 ds
(t ≤ T ) .
0
Letting CT := 2(T + 1)K 2 and a := maxt≤T E (X1 (t) − X(0))2 (a ﬁnite quantity, less than a constant times T 3 E X(0)2 ), one checks by recurrence that C n tn−1 E (Xn+1(t) − Xn (t))2 ≤ a T . (n − 1)! Therefore Xn+1 − Xn A[0,T ] ≤ a and then
(CT T )n n!
(CT T )k 2 1
Xn+ − Xn A[0,T ] ≤ a
1 2
k≥n
k!
,
a quantity that tends to 0 as n → ∞. This shows that {Xn (t)}t≥0 (n ≥ 0) is a Cauchy sequence of the Hilbert space A[0, T ] and therefore converges in this space to some {X(t)}t≥0 . The Lipschitz condition (14.23) allows us to pass to the limit in () to obtain (14.21). The property (14.22) is easily veriﬁed. Example 14.3.10: An explicit solution. It can be checked (Exercise 14.4.15) that the diﬀerential equation . 2 5 5 1 dX(t) = 1 + X(t)2 dW (t) + 1 + X(t)2 + X(t) dt 2 admits the stochastic process X(t) := sinh W (t) + sinh−1 W (t) + t
(t ≥ 0)
as a solution. Theorem 14.3.9 then guarantees its unicity.
Strong and Weak Solutions So far we have considered strong solutions. This means that the problem was posed in terms of a preexisting Wiener process and given initial state, and that the solution took the general form X(t) = F (t, X(0), {W (s)}0≤s≤t ) . A weak solution associated with the parameters (functions) b and σ and a probability distribution π on R consists of a probability space on which are given
14.3. SELECTED APPLICATIONS
571
1. a ﬁltration {Ft }t≥0 , 2. a standard Ft Wiener process {Wt }t≥0 , 3. a random variable X(0) with a given distribution π and independent of the above Wiener process, and ﬁnally 4. an Ft progressive stochastic process {Xt }t≥0 such that  t  t σ(X(s)) dW (s) . b(X(s)) ds + X(t) = X(0) + 0
0
In this deﬁnition, the Wiener process is part of the solution. Example 14.3.11: Tanaka’s stochastic differential equation. In the equation  t X(t) = sgn(X(s)) dW (s) (t ≥ 0) , 0
where sgn(x) = +1 if x ≥ 0 and sgn(x) = −1 if x < 0, the Lipschitz conditions of Theorem 14.3.9 are not satisﬁed. We shall give rough arguments showing that there exist a solution that cannot be a strong solution. First note that if there exists a solution, it is a Brownian motion. To see this it suﬃces to show that for all λ > 0, the process 1 exp{λX(t) − λ2 t} (t ≥ 0) 2 is a martingale (Lemma 14.3.7) (Exercise 14.4.17). Therefore we have unicity in law of the solution. Note that it is the best we can do concerning unicity since {−X(t)}t≥0 is another solution. By the same arguments as above, for any solution {X(t)}t≥0 (which is a Brownian motion), the process {W (t)}t≥0 deﬁned by  t W (t) := sgn(X(s)) dX(s) (t ≥ 0) () 0
is a Brownian motion. By diﬀerentiation, dW (t) = sgn(X(t)) dX(t), and therefore, since sgn(x)−1 = sgn(x), dX(t) = sgn(X(t)) dW (t). We have therefore obtained a solution. This solution cannot be a strong solution. Indeed, from (), we deduce that W (t) is X Ft measurable. If {X(t)}t≥0 were a strong solution, it would be FtW adapted, that is, X X Ft adapted. But FtX contains more information than Ft !
14.3.4
The Dirichlet Problem
This subsection gives a simple example of the interaction between the theory of stochastic diﬀerential equations and that of partial diﬀerential equations. Deﬁnition 14.3.12 Let u : Rd → R be a function of class C 2 (twice diﬀerentiable with continuous derivatives) on an open set O ⊆ Rd . Its Laplacian, deﬁned in O, is the function d ∂2 u(x) := 2 u(x). ∂x i i=1
572
ˆ STOCHASTIC CALCULUS CHAPTER 14. A GLIMPSE AT ITO’S
Deﬁnition 14.3.13 The function u : Rd → R of class C 2 on a domain (open and connected set) D ⊆ Rd is said to be harmonic on D if u(x) = 0
(x ∈ D) .
Example 14.3.14: Examples of harmonic functions. In dimension 2: (x1 , x2 ) → ln(x21 + x22 ), and (x1 , x2 ) → ex1 are harmonic on D = R2 . In dimension 3: x → x2−d is harmonic on D = R3 \{0}. Let F : Rd → R be of class C 2 . In the following {W (t)}t≥0 will represent the Brownian motion starting from a ∈ Rd (that is, W (0) = a). Itˆo’s formula gives, denoting by ∇f the gradient of a function f ,  t  t F (W (t)) = F (a) + ∇F (W (s)) dW (s) + F (W (s)) ds , (14.24) 0
0
where

t
M (t) := F (a) + is an FtW local martingale since
0t
∇F (W (s)) dW (s)
(t ≥ 0)
0
2 0 (∇F (W (s)))
ds < ∞ (t ≥ 0).
Theorem 14.3.15 Let u : Rd → R be harmonic on the domain D ⊆ Rd , and let G ⊂ D be an open set whose closure clos G ⊂ D. Let τG := inf{t ≥ 0 ; W (t) ∈ / G} be the entrance time of the Brownian motion in G. Then u(W (t ∧ τG )) − u(a)
(t ≥ 0)
is a centered FtW martingale. Proof. Let F be a function in C 2 (Rd ) whose restriction to G is u, for instance F (x) := ([(1G2δ ∗ α] × u)(x)1D (x) , where G2δ is the 2δneighborhood of G, 4δ = d(G, D) and α is a C ∞ function of integral 1 on Rd and null outside B(0, δ). Then, by (14.24),  t∧τG  tτG ∇F (W (s)) dW (s) + F (W (s)) ds , F (W (t ∧ τG )) = F (0) + 0
0
and since F (x) = u(x) is harmonic on G,  tτG ∇F (W (s)) dW (s) u(W (t ∧ τG )) = F (0) +
(t ≥ 0) ,
0
a squareintegrable FtW martingale (not just a local martingale, since ∇F is bounded on clos G). Deﬁnition 14.3.16 Let D ⊂ Rd be a bounded domain, and let f : ∂D → R be a continuous function. The Dirichlet problem (D, f ) consists in ﬁnding a function u that is harmonic on D and equal to f on δD.
14.4. EXERCISES
573
Let Dε be the εinterior of D. From Theorem 14.3.15 with G = Dε and ε small enough, u(x) = Ex [u(W (t ∧ τDε )] (x ∈ D) , where the notation Ex denotes expectation given that the initial position of the Brownian motion is x. Since D is bounded, P (τDε < ∞) = 1, and therefore, letting t → ∞, u(x) = Ex [u(W (τDε )] by dominated convergence. Let now ε → 0. Since τD < ∞ and W (τDε ) → W (τD ), u(x) = Ex [u(W (τD )] by dominated convergence. By the boundary condition, u(x) = Ex [f (W (τD )] .
(14.25)
Therefore the solution to the Dirichlet problem is unique, if it exists. We shall not prove existence, which can be obtained by analytical as well as probabilistic arguments. We shall be content with the fact that the solution has a probabilistic interpretation, given by (14.25).
Complementary reading [Kuo, 2006] and [Baldi, 2018]. The latter has a chapter on mathematical ﬁnance and a large collection of corrected exercises. [Oksendal, 1995] has many examples in diverse areas.
14.4
Exercises
Exercise 14.4.1. An Ft martingale 0t Let {W (t)}t≥0 be an Ft Wiener process and let ϕ, ψ ∈ A(R+ ). Let M (t) := 0 ϕ(s) dW (s) 0t 0t and N (t) := 0 ψ(s) dW (s). Show that M (t)N (t) − 0 ϕ(s)ψ(s) ds (t ≥ 0) is an Ft martingale. Exercise 14.4.2. Proof of the reflection principle Prove Theorem 11.2.2 using Itˆ o calculus (Hint: see the proof of the strong Markov property of the Brownian motion given in Subsection 14.3.2.) ˆ integrals as Lebesgue integrals, I Exercise 14.4.3. Ito Let {W (t)}t≥0 be a standard Brownian motion, and 0 ≤ a < b. Prove that 
b
W (s)3 dW (s) = a
3 1 W (b)4 − W (a)4 − 4 2

b
W (s)2 ds. a
ˆ integrals as Lebesgue integrals, II Exercise 14.4.4. Ito Let {W (t)}t≥0 be a standard Brownian motion, and 0 ≤ a < b. Prove that 
b a
eW (s) dW (s) = eW (b) − eW (a) −
1 2

b
eW (s) ds. a
ˆ STOCHASTIC CALCULUS CHAPTER 14. A GLIMPSE AT ITO’S
574
Exercise 14.4.5. Brownian motion on the circle Let {W (t)}t≥0 be a standard Brownian motion. Let
cos W (t) sin W (t)
V (t) :=
.
Show that dV (t) =
0 −1 1 0
1 V (t) dW (t) − V (t) dt 2
with V (0) =
1 0
.
Exercise 14.4.6. Proof of Theorem 14.1.8 Give the details of the proof of Theorem 14.1.8. Exercise 14.4.7. The area under a Brownian motion Let {W (t)}t≥0 be a standard Brownian motion. Show that 
t 0
is an
FtW martingale.
1 W (s) dW (s) − W (t)3 3
Use this to compute the expected value of the area of the set {(t, x) ; t ∈ [0, Ta,b ], x ∈ [0, W (t)},
where Ta,b := inf{t ≥ 0 ; W (t) ∈ {−b, a}} and a > 0, b > 0. ˆ integral Exercise 14.4.8. Continuous integrands for the Ito Prove the statement of Remark 14.1.3. Exercise 14.4.9. A martingale Let {W (t)}t≥0 be a standard Brownian motion. (i) Show that the stochastic process 
t
Y (t) := tW (t) −
W (s) dW (s)
(t ≥ 0)
0
is a martingale. (ii) Show that for all u ∈ [a, b] ⊂ R, Y (b) − Y (a) is orthogonal (in L2R (P )) to H(W (s); s ∈ [0, a]). Exercise 14.4.10. The nth power of a Brownian motion Let {W (t)}t≥0 be a standard Brownian motion. Show that W (t)n − is a martingale.
n(n − 1) 2

t 0
W (s)n−2 ds (t ≥ 0)
14.4. EXERCISES
575
Exercise 14.4.11. The product of independent Brownian motions Let {W1 (t)}t≥0 and {W2 (t)}t≥0 be independent standard Brownian motions. Is the claim  t  t W1 (t)W2 (t) = W1 (s) dW2 (s) + W2 (s) dW1 (s) 0
0
resulting from a naive application of the formula of integration by parts true? Exercise 14.4.12. A differential equation for the Brownian bridge Using the results of Exercise 11.5.13, show that the Brownian bridge thereof satisﬁes the following equation:  t 1 Z(t) = − Z(s) ds + W (t) . − s 1 0 Exercise 14.4.13. Brownian motion on the circle Let {W (t)}t≥0 be a standard Brownian motion. Show that the vector process V (t) = (cos W (t), sin W (t))T (t ≥ 0) satisﬁes a stochastic diﬀerential equation of the form 1 dV (t) = AV (t) dW (t) − V (t) dt 2 with initial condition V (0) = (1, 0)T , where A is a matrix to be identiﬁed. Exercise 14.4.14. A motion on the cone Let {W1 (t)}t≥0 and {W2 (t)}t≥0 be two independent standard Brownian motions. Show that the vector process V (t) = (eW1 (t) cos(W2 (t)), eW1 (t) sin(W2 (t)), eW1 (t) )T (t ≥ 0) satisﬁes a stochastic diﬀerential equation of the form dV (t) = AV (t) dW1 (t) + BV (t) dW2 (t) + CV (t) dt with initial condition V (0) = (1, 0, 1)T , where A, B and C are matrices to be identiﬁed. Exercise 14.4.15. A stochastic differential equation Prove that the stochastic process X(t) := sinh W (t) + sinh−1 W (t) + t
(t ≥ 0)
is a solution of the diﬀerential equation dX(t) =
. 2 5 5 1 1 + X(t)2 dW (t) + 1 + X(t)2 + X(t) dt . 2
Exercise 14.4.16. The Vasicek model Consider the stochastic diﬀerential equation dX(t) = (−bX(t) + c) dt + σW (t) . Prove that the unique solution with initial state X(0) is given by  t c c −bt X(t) = + X(0) − e−b(t−s) dW (s) . e +σ b b 0
ˆ STOCHASTIC CALCULUS CHAPTER 14. A GLIMPSE AT ITO’S
576
Exercise 14.4.17. Tanaka’s differential equation Prove that any solution of the stochastic diﬀerential equation  t X(t) = sgn(X(s)) dW (s) (t ≥ 0) 0
is such that for all λ > 0, the process 1 1 2 exp λX(t) − λ t 2
(t ≥ 0)
is a martingale. ˆ integrals as Riemann integrals Exercise 14.4.18. Ito Let f : R → R be a continuous function with continuous derivative f . Let F (x) := 0x 0 f (t) dt. Show that 
t
f (W (s)) dW (s) = F (W (t)) − F (0) −
0
1 2

t
f (W (s)) ds .
0
Use this result to express the following Itˆo integrals in terms of Riemann integrals:  t W (s)eW (s) dW (s), 
0 t 0 t
1 dW (s), 1 + W (s)2 1
eW (s)− 2 s dW (s), 
0 t 0
W (s) dW (s). 1 + W (s)2
Chapter 15 Point Processes with a Stochastic Intensity Let {Ft }t∈R be some history of a simple locally ﬁnite point process N on R (that is, a nondecreasing family of σﬁelds such that for all t ∈ R and all a ≤ b ≤ t, N ((a, b]) is Ft measurable). If it holds that for all t ∈ R, lim h↓0
1 E[N ((t, t + h])Ft ] = λ(t) h
P a.s. ,
()
for some nonnegative locally integrable Ft adapted stochastic process {λ(t)}t∈R , the latter is called a stochastic Ft intensity of N . This local deﬁnition of intensity is advantageously replaced by a global deﬁnition not involving a limiting derivativetype procedure and is more amenable to rigorous analysis. It opens a connection with the rich theory of martingales and oﬀers among other things a uniﬁed view of stochastic systems driven by point processes. This point of view will reveal a striking analogy with the contents of Chapter 14, the ﬁrst instance of which is found in Paul L´evy’s martingale characterization of Brownian motion (Theorem 14.2.5) and in Watanabe’s martingale characterization of the standard Poisson process. The proof of Theorem 13.4.3 is a ﬁrst example of the point process “stochastic calculus”.
15.1
Stochastic Intensity
15.1.1
The Martingale Deﬁnition
For a Poisson process N on the real line with locally integrable intensity function λ(t), it holds that for all intervals [c, d] ⊂ R,  d E N ((c, d])  FcN = λ(s) ds c
or, equivalently since the righthand side is a deterministic quantity, 2 d . E N ((c, d])  FcN = E λ(s) ds  FcN . c
This motivates the following deﬁnition of stochastic intensity. © Springer Nature Switzerland AG 2020 P. Brémaud, Probability Theory and Stochastic Processes, Universitext, https://doi.org/10.1007/9783030401832_15
577
578CHAPTER 15. POINT PROCESSES WITH A STOCHASTIC INTENSITY Deﬁnition 15.1.1 Let N be a simple locally ﬁnite point process on R, let {Ft }t∈R be a history of N and let {λ(t)}t∈R be a nonnegative a.s. locally integrable realvalued Ft progressively measurable stochastic process. If for all a ∈ R and all intervals (c, d] ⊂ (a, ∞), # "(a) d∧Tn
E[N ((c ∧ Tn(a) , d ∧ Tn(a) ])  Fc ] = E
(a)
λ(s)ds  Fc ,
(15.1)
c∧Tn (a)
(a)
where {Tn }n≥1 is a nondecreasing of Ft stopping times such that Tn > a, sequence (a) (a) limn↑∞ Tn = ∞ and E N ((a, Tn ]) < ∞, N is then said to admit the stochastic (P, Ft )intensity {λ(t)}t∈R . The connection between stochastic intensity (Deﬁnition 15.1.1) and martingales is the following:  t M (t) := N ((0, t]) − λ(s) ds 0
is a local (P, Ft )martingale (Exercise 15.4.1), called the fundamental (local) Ft martingale of the point process N . When the choice of probability P is clear from the context, one says “the Ft intensity” instead of “the (P, Ft )intensity”. of the stochastic intensity guarRemark 15.1.2 The requirement of Ft progressiveness 0 antees that the integrated intensity process { 0 λ(s) ds}t∈R is measurable and Ft adapted (Exercise 5.4.2). Remark 15.1.3 When considering point processes on the positive halfline, the intervention of the a’s is superﬂuous, and it suﬃces to require that (15.1) holds for a = 0. However, the slightly more complicated deﬁnition given above is needed to handle point processes on the whole real line, especially stationary point processes. Remark 15.1.4 The reason why requirement (15.1) cannot be replaced by the simpler one, . 2 d E [N ((c, d])  Fc ] = E (†) λ(s) ds  Fc , c (a) Tn ,
is that it may occur that both sides of (†) are inﬁnot involving the stopping times nite, in which case the information contained in (†) is nil. This happens for instance when N is a homogeneous Cox process whose random intensity Λ has an inﬁnite expectation (see Example 15.1.5 below).
Example 15.1.5: Poisson and Cox Processes. Let N be a Cox process on R+ ν with conditional intensity 0 measure ν with respect to G ⊇ F (see Deﬁnition 8.2.5) and suppose that ν(C) := C λ(s) ds (C ∈ B(R+ )), where {λ(t)}t≥0 is a locally integrable nonnegative process. Then N admits this process as an Ft intensity, where Ft := FtN ∨G (t ≥ 0) (Exercise 15.4.4). Theorem 15.1.11 below gives a formula that can be considered both a reﬁnement of Campbell’s formula and an extension of the smoothing formula for hpps (Theorem 7.1.7). The notion of predictable process will be needed.
15.1. STOCHASTIC INTENSITY
579
Deﬁnition 15.1.6 Let T = R or R+. Let {Ft }t∈T be a history. The predictable σﬁeld P(F· ) on T × Ω is the σﬁeld generated by the collection of sets (a, b] × A
([a, b] ⊂ T, A ∈ Fa ) ,
(15.2)
to which one must add, in the case T = R+ , the sets {0} × A (A ∈ F0 ). A stochastic process {X(t)}t∈T taking its values in a measurable space (E, E) is called an Ft predictable process if the mapping (t, ω) → X(t, ω) is P(F· )measurable. For short, one then says: {X(t)}t∈T is in P(F· ). Deﬁnition 15.1.7 Let T = R or R+ and let {Ft }t∈T be a history. Let (K, K) be some measurable space. Let H : (T × Ω × K, P(F· ) ⊗ K) → (R, B(R)). One then says that {H(t, z)}t∈T,z∈K is an Ft predictable stochastic process indexed by K.
Remark 15.1.8 An Ft predictable process is Ft progressive (Exercise 15.4.6).
Example 15.1.9: Leftcontinuity and Predictability. A complexvalued stochastic process {X(t)}t∈R adapted to {Ft }t∈R and with leftcontinuous trajectories is Ft predictable. In fact, by leftcontinuity, X(t, ω) = limn↑∞ Xn (t, ω), where Xn (t, ω) :=
n +n2
X(k2−n , ω)1(k2−n ,(k+1)2−n ] (t) ,
k=−n2n
and since X(k2−n ) is Fk2−n measurable, (t, ω) → Xn (t, ω) is P(F· )measurable.
Example 15.1.10: Another Typical Ft predictable Process. Let S and τ be two Ft stopping times such that S ≤ τ , and let ϕ : R+ × R → R be a measurable function. Then X(t, ω) = ϕ(S(ω), t)1{S(ω)0} ds .
0
n≥1
Let R1 be the ﬁrst strictly positive time at which the system is empty (∞ if the system never empties). W (t)
σ1
σ2
σ4 σ3
σ0 0
T1
T2
T3
T4
t
R1
Clearly R1 is an FtN stopping time, where N is the point process on R+ × R+ with point sequence {(Tn , σn )}n∈N . For all M > 0,  ∞ R1 ∧ M ≤ σ0 + σk 1(0,R1 ∧M ] (Tk ) = σ0 + σ1(0,R1 ∧M ] (t)N (dt × dσ) . (15.7) 0
k≥1
E
The mapping (t, ω, σ) → H(t, ω, σ) = σ1(0,R1 (ω)∧M ] (t), is in P(F·N ) ⊗ B(R+ ) (noting that it is leftcontinuous in the targument). Since the FtN intensity kernel of N is λG(dz), we obtain from (15.7) and Theorem 15.1.31 2 ∞ . E[R1 ∧ M ] ≤ E[σ0 ] + E σ1(0,R1 ∧M ] (t)λG(dσ) dt R+
0
= E[σ0 ] + λE[σ0 ]E[R1 ∧ M ].
(15.8)
In particular, E[R1 ∧ M ](1 − λE[σ0 ]) ≤ E[σ0 ], and therefore, if λE[σ0 ] < 1, E[R1 ∧ M ] ≤
E[σ0 ] 1 − λE[σ0 ]
E[σ0 ] for all M > 0, and therefore E [R1 ] ≤ 1−λE[σ < ∞. Reproducing the calculation with 0] R1 replacing R1 ∧ M , we have, since R1 is almost surely ﬁnite, the equality
E[R1 ] =
E[σ0 ] . 1 − λE[σ0 ]
In summary: λE[σ0 ] < 1 is a suﬃcient condition for E[R1 ] to be ﬁnite in an M/GI/1/∞ queue. It is also a necessary condition if E[σ0 ] > 0, because when R1 is ﬁnite, R1 = σ0 +
σk 1(0,R1 ] (Tk )
k≥1
and therefore E[R1 ] = E[σ0 ] + λE[σ0 ]E[R1 ], that is E[R1 ](1 − λE[σ0 ]) = E[σ0 ], which implies that 1 − λE[σ0 ] > 0.
15.1. STOCHASTIC INTENSITY
589
Let (N, Z) be a simple locally ﬁnite marked point process on R+ with marks in the measurable space (K, K) and let {Ft }t≥0 be a history of (N, Z). Let (N, Z) admit the Ft intensity kernel λ(t, dz). In the sequel, the following notation will be used H(s, z) MZ (ds × dz) (0,t]×K := H(s, z) λ(s, dz) ds , H(s, z) NZ (ds × dz) − (0,t]×K
(0,t]×K
provided the righthand side is well deﬁned. Therefore, formally, MZ (ds × dz) := NZ (ds × dz) − λ(s, dz) ds .
Stochastic Integrals and Martingales Theorem 15.1.33 Let {H(t, z)}t≥0 be an Ft predictable realvalued stochastic process indexed by K such that for all t ≥ 0, "# E
H(s, z)λ(s, dz) ds < ∞ .
(15.9)
H(s, z) MZ (ds × dz)
(15.10)
(0,t]×K
Then the stochastic process M (t) := (0,t]×K
is a welldeﬁned centered Ft martingale.
Proof. Condition (15.9) is equivalent to E
0
H(s, z) NZ (ds × dz) < ∞ for all
0 t ≥ 0 (Theorem 15.1.31). Therefore, for all t ≥ 0, (0,t]×K H(s, z)λ(s, dz) ds < ∞ and 0 (0,t]×K H(s, z) NZ (ds×dz) < ∞, Pa.s. In particular, {M (t)}t≥0 is Pa.s. a welldeﬁned and ﬁnite stochastic process. For all a, b ∈ R+ (0 ≤ a ≤ b) and for all A ∈ Fa , 2. E [1A (M (b) − M (a)] = E H (t, z) MZ (ds × dz) , (0,t]×K
R+ ×K
H (t, z)
where := H(t, z)1A 1(a,b] (t) deﬁnes an Ft predictable realvalued stochastic process indexed by K. By Theorem 15.1.31, 2. 2. E H (t, z) NZ (ds × dz) = E H (t, z) λ(s, dz) ds R+ ×K
R+ ×K
and therefore, for all A ∈ Fa , E [1A (M (b) − M (a)] = 0.
Corollary 15.1.34 Replacing assumption (15.9) of Theorem 15.1.33 by the condition that P almost surely H(s, z)λ(s, dz) ds < ∞ for all t ≥ 0 , (15.11) (0,t]×K
the stochastic process {M (t)}t≥0 deﬁned by (15.10) is then a local Ft martingale.
590CHAPTER 15. POINT PROCESSES WITH A STOCHASTIC INTENSITY Proof. The random time
H(s, z)λ(s, dz) ds ≥ n} ,
Sn := inf{t > 0 ; (0,t]×K
with the usual convention inf ∅ = +∞, is for each n ≥ 1 an Ft stopping time, and limn↑∞ Sn = ∞. Moreover, H(t, z)1{t≤Sn } satisﬁes condition (15.9). Therefore by Theo rem 15.1.33, {M (t ∧ Sn )}t≥0 is a welldeﬁned Ft martingale. Theorem 15.1.35 Let H be an Ft predictable realvalued stochastic process indexed by K such that for all t ≥ 0, Pa.s. "# H(s, z)2 λ(s, dz) ds < ∞ .
E
(15.12)
(0,t]×K
Then the stochastic process H(s, z) MZ (ds × dz)
M (t) := (0,t]×K
is well deﬁned and a squareintegrable Ft martingale. Moreover, "# H(s, z)2 λ(s, dz) ds . E M (t)2 = E
(15.13)
(0,t]×K
Proof. Let Tn be the nth event time of the base point process. The proof is 0 that M (t) well deﬁned follows from Theorem 15.1.33. In fact, observing that E (0,Tn ] λ(s) ds = E [N (0, Tn ]] = n, ""# # H(s, z)λ(s, dz) ds ≤ E
E (0,t∧Tn ]×K
(1 + H(s, z)2 )λ(s, dz) ds (0,t∧Tn ]×K
"
#
H(s, z) λ(s, dz) ds < ∞ 2
= E [N (0, Tn ]] + E (0,t∧Tn ]×K
"
#
H(s, z)2 λ(s, dz) ds < ∞ .
=n+E (0,t∧Tn ]×K
Therefore {M (t ∧ Tn )}t≥0 is well deﬁned, and so is {M (t)}t≥0 since limn↑∞ Tn = ∞. We now turn to the proof of (15.13). By the product rule of Stieltjes–Lebesgue calculus, M (t)2 = M (t−) dM (t) + H(s, z)2 NZ (ds × dz) . (0,t]×K
(0,t]

Since m(t) :=
M (s−)H(s, z) MZ (ds × dz)
M (t−) dM (t) = (0,t]
(0,t]×K
is a local Ft martingale with respect to the localizing stopping times / 4 Vn := inf t ≥ 0 ; M (t−) + H(s, z)λ(s, dz) ds ≥ n ∧ Tn , (0,t]×K
15.1. STOCHASTIC INTENSITY
591
we have that
E M (t ∧ Vn )
2
"
# H(s, z) NZ (ds × dz) 2
=E "
(0,t∧Vn ]×K
# 2
H(s, z) λ(s, dz) ds ,
=E (0,t∧Vn ]×K
from which (15.13) follows, if we can show that limn↑∞ E M (t ∧ Vn )2 = E M (t)2 . This will be the case because (as will soon be proved) if M (t ∧ Vn ) converges in L2C (P ) to some limit. This limit is necessarily M (t), the almost sure limit of M (t ∧ Vn ). The 2 (P )convergence of M (t ∧ V ) to be proved follows from the Cauchy criterion since, LC n by a computation similar to the one above, with m ≥ n, E (M (t ∧ Vm ) − M (t ∧ Vn ))2
#
"
= E [m(t ∧ Vm ) − m(t ∧ Vn )] + E
2
H(s, z) λ(s, dz) ds (t∧Vn ,t∧Vm ]×K
#
"2
H(s, z) λ(s, dz) ds ,
=E (t∧Vn ,t∧Vm ]×K
a quantity that vanishes as m, n ↑ ∞.
Corollary 15.1.36 If assumption (15.15) of Theorem 15.4.11 is replaced by H(s, z)2 λ(s, dz) ds < ∞
P a.s. , (t ≥ 0) ,
(15.14)
(0,t]×K
the stochastic process H(s, z) MZ (ds × dz)
M (t) :=
(t ≥ 0)
(0,t]×K
is a squareintegrable local Ft martingale. Proof. This follows from Theorem 15.4.11 in the same manner as Corollary 15.1.34 followed from Theorem 15.1.33.
15.1.3
Martingales as Stochastic Integrals
Let (N, Z) be a marked point process on [0, 1] with marks in the measurable space (K, K) such that the associated lifted point process NZ is Poisson with intensity measure ν(dt × dz) := λ(t, z) dt Q(dz) , where Q is a probability measure on the measurable space (K, K) of the marks. We suppose that ν([0, 1] × K) < ∞.
592CHAPTER 15. POINT PROCESSES WITH A STOCHASTIC INTENSITY Theorem 15.1.37 Let {m(t)}t∈[0,1] be a realvalued centered FtW squareintegrable martingale on [0, 1]. Then M (t) = C(s, z) MZ (ds × dz) , [0,t]×K
where C is an Ft predictable realvalued stochastic process indexed by K such that for all t ≥ 0, Pa.s. # "2 C(s, z) λ(s, dz) ds < ∞ . (15.15) E (0,t]×K
Lemma 15.1.38 A. The collection of random variables H := {eX ; X ∈ U} where U is the collection of random variables k
aj NZ (Cj × Lj )
(k ∈ N, Cj ∈ B([0, 1]), Lj ∈ K, a1 , . . . , ak ∈ R)
j=1 (N,Z)
2 is total in the Hilbert space LR (F1
, P ).
B. The collection of random variables that are linear combinations of elements of 0 0 , K0 := Mf (1) := exp [0,1]×K f (s, z) NZ (ds × dz) + [0,t]×K ef (s,z) − 1 λ(s, z) Q(dz)
where f : [0, 1] × K → R is a measurable function such that 2 ef (s,z) − 1 λ(t, z) dt Q(dz) < ∞ , [0,1]×K (N,Z)
is dense in the Hilbert space L2R (F1
, P ).
The proof is an immediate adaptation of Lemma 14.3.2. The proof of Theorem 15.1.37 is in turn an easy adaptation of the proof of Theorem 14.3.1 and is based on the following lemma: Lemma 15.1.39 Let f : [0, 1] × K → R be a measurable function such that 2 ef (s,z) − 1 λ(t, z) dt Q(dz) < ∞ . [0,1]×K
(This is the case if f is nonpositive or bounded.) Let for t ∈ [0, 1] 4 / f (s,z) Mf (t) := exp e − 1 λ(s, z) Q(dz) . f (s, z) NZ (ds × dz) + [0,t]×K
[0,t]×K
15.1. STOCHASTIC INTENSITY
593
Under the above conditions, (a):

Mf (s−) ef (s,z) − 1 MZ (ds × dz) ,
Mf (t) = 1 +
()
[0,t]×K
(b): {Mf (t)}t∈[0,1] is a square integrable martingale, and 0 2 (c): E Mf (t)2 − 1 = E [0,t]×K Mf (s)2 ef (s,z) − 1 λ(s, z) Q(dz) (t ∈ [0, 1]). Proof. (a): Observe that at an eventtime t of the base point process with corresponding mark z ∈ K, Mf (t) − Mf (t−) = Mf (t−) ef (t,z) − 1 , at a time t between two event times dMf (t) = Mf (t) dt
 ef (t,z) − 1 λ(t, z) Q(dz) . K
(b): It suﬃces to show, in view of Theorem 15.4.11, that 2 1 . 2 2 f (s,z) E Mf (s) e − 1 λ(s, z) Q(dz) ds < ∞ . 0
(15.16)
K
This is true when Mf (t) is replaced by Mf (t ∧ Sn ), where Sn := inf{t ≥ 0 ; Mf (t−) ≥ n} . In particular, . 2 t 2 E (Mf (t ∧ Sn ) − 1)2 = E Mf (s)2 1{s≤Sn } ef (s,z) − 1 λ(s, z) Q(dz) ds . 0
K
(15.17) Now,
and moreover
E (Mf (t ∧ Sn ) − 1)2 = E Mf (t ∧ Sn )2 − 1 E Mf (t ∧ Sn )2 ≥ E Mf (t)2 1{t≤Sn } .
Therefore
2 t . 2 E Mf (t)2 1{t≤Sn } ≤ 1 + E Mf (s)2 1{s≤Sn} ef (s,z) − 1 λ(s, z) Q(dz) K 0   t 2 ef (s,z) − 1 λ(s, z) Q(dz) ds . E Mf (s)2 1{s≤Sn } =1+ K
0
By Gronwall’s lemma, for all t ≥ 0,  t  2 E Mf (t)2 1{t≤Sn} ≤ exp ef (s,z) − 1 λ(s, z) Q(dz) ds < ∞ . 0
K
Since limn↑∞ Sn = ∞, by monotone convergence  1  2 E Mf (t)2 ≤ exp ef (s,z) − 1 λ(s, z) Q(dz) := C < ∞ , 0
K
594CHAPTER 15. POINT PROCESSES WITH A STOCHASTIC INTENSITY and therefore 2
1
E 0
. 2 Mf (s)2 ef (s,z) − 1 λ(s, z) Q(dz) K 2 ≤C ef (s,z) − 1 λ(t, z) dt Q(dz) ,

[0,1]×K
a ﬁnite quantity by hypothesis. Therefore (15.16) is proved. (c): Start from (15.17) and let n ↑ ∞ to obtain 2 t . 2 E (Mf (t))2 = 1 + E Mf (s)2 ef (s,z) − 1 λ(s, z) Q(dz) ds . 0
K
This is a diﬀerential equation in E Mf (t)2 whose solution gives (c).
Remark 15.1.40 The running analogy with Brownian motion stochastic calculus is perhaps more evident in the case of a standard Poisson process N on R+ . In this case, the results of Theorem 15.1.37 specialize as follows. Any centered squareintegrable FtN martingale on [0, 1] is of the form m(t) = C(s) dM (s) , (0,t]
where {C(t)}t∈[0,1] is an FtN predictable process such that
15.1.4
01 0
H(s)2 ds < ∞.
The Regenerative Form of the Stochastic Intensity
Example 15.1.15 has shown that the notion of stochastic intensity is a generalization of that of hazard rate. Such a result can be extended to marked point processes, still in the special case where the history for which the stochastic intensity (kernel) is deﬁned is the internal history “plus a prehistory”. A few results on the structure of point process histories will be needed. These are intuitive results whose technical proofs have been omitted. Let {Tn }n≥1 be a simple point process on R+ , that is, a nondecreasing sequence of positive random variables possibly taking the value +∞, and strictly increasing on R+ (Tn < ∞ ⇒ Tn < Tn+1 ). In particular, it may be a ﬁnite point process (if Tn = ∞ for some n ∈ N) and it need not be locally ﬁnite (T∞ := limn↑∞ may be ﬁnite). Set T0 ≡ 0. Let {Zn }n≥1 be a sequence of random variables taking their values in the measurable space (K, K). The sequence {(Tn , Zn )}n≥1 is called a marked point process on R+ with marks in K. For each L ∈ K, deﬁne the (simple) point process N L on R+ by (C ∈ B(R+ )) . 1C (Tn ) 1L (Zn ) N L (C) := n≥1
Note that N L ({0}) = 0. Deﬁne the (simple) point process NZ on R+ × K by NZ (C × L) := N L (C)
(C ∈ B(R+ ), L ∈ K) .
15.1. STOCHASTIC INTENSITY
595
Deﬁne the internal history {FtN,Z }t≥0 of NZ by FtN,Z := σ (NZ (C × L) ; C ∈ (0, t], L ∈ K) , and the history {Ft }t≥0 by Ft = σ(Z0 ) ∨ FtN,Z ,
(15.18)
where Z0 is a random element taking values in a measurable space (L0 , K0 ) (for instance, a space of functions). It represents a “prehistory” of the marked point process, in the sense that F0 = σ(Z0 ) contains the information already gathered at time 0 that may inﬂuence its future behavior. It is intuitively clear that FTn = σ(Z0 ) ∨ σ(T1 , Z1 , . . . , Tn , Zn ) and FTn − = σ(Z0 ) ∨ σ(T1 , Z1, . . . , Tn ). Suppose that for all n ≥ 0, all L ∈ K, and all C ∈ B(R+ ), P (Sn+1 ∈ C , Zn+1 ∈ L  FTn ) (ω) = g (n+1)(ω, x, L) dx := G(n+1) (ω, C, L) , C
where for each L ∈ K, the mapping (ω, x) → g (n+1)(ω, x, L) is FTn ⊗ B(R+ )measurable, and for each (ω, x), L → g (n+1)(ω, x, L) is a σﬁnite measure on (K, K). In particular, P (Sn+1 ∈ C  FTn ) (ω) = g (n+1)(ω, x) dx := G(n+1)(ω, C) , C
where g (n+1)(ω, x) = g (n+1)(ω, x, K)) and G(n+1) (ω, C) = G(n+1) (ω, C, K). Theorem 15.1.41 For L ∈ K and t ≥ 0, let λ(t, L) :=
g (n+1)(t − Tn , L) 1{Tn ≤t 0, and Φ(t, L) = 0 if λ(t) = 0. Since λ(Tn ) > 0 P a.s. on {Tn < ∞}, Φ(Tn , L)1{Tn 0
(n ≥ 1),
P a.s.
(15.24)
Proof. (a) follows from the uniqueness property (15.27). For (b), note that H(t) = 1{λ(t)=0} is an Ft predictable process. Inserting this into the smoothing formula, we obtain ⎡ ⎤ . 2 E⎣ 1{λ(t)=0} λ(t) dt = 0, 1{λ(Tn )=0} ⎦ = E n≥0
R+
which implies (15.23).
In particular, if the locally ﬁnite simple marked point process (N, Z) has the Ft predictable stochastic intensity kernel λ(t)Φ(t, dz), then, for all L ∈ K, λ(Tn )Φ(Tn , L) > 0 on {Tn < ∞} . The above results extend straightforwardly to marked point processes and their stochastic intensity kernels as follows. Let the simple locally ﬁnite marked point process (N, Z) have the Ft intensity kernel λ(t)Φ(t, dz), where Φ(t, dz) = λ(t)μ(t, z)Q(t, dz) (15.25) for some FtN,Z predictable kernel Q(t, dz). Let {F,t }t≥0 be a history such that Ft ⊇ F,t ⊇ FtN,Z
(t ≥ 0).
(15.26)
It is possible that λ(t)Φ(t, dz) is not F,t adapted and therefore cannot be the F,t intensity kernel. Nevertheless, there still exists a stochastic F,t intensity kernel, and it is obtained by “projection” of the initial stochastic intensity kernel on the smaller history, in a sense to be made precise now. Recall the terminology: if the mapping Y : (t, ω, z) → Y (t, ω, z) ∈ R is B(R+ )⊗F ⊗K, one says that Y (t, z) is a measurable process indexed by K. It is said to be Ft adapted (resp. Ft predictable) if moreover for all z ∈ K, the stochastic process {Y (t, z)}t≥0 is Ft adapted (resp. Ft predictable). Let the histories {Ft }t≥0 and {F,t }t≥0 satisfy condition (15.26). Let {Y (t, z)}t≥0 be a nonnegative measurable process indexed by K. Let the sigmaﬁnite measures μ1 and μ2 on (R × Ω × K, P(F· ) ⊗ K) be deﬁned respectively by:
15.2. TRANSFORMATIONS OF THE STOCHASTIC INTENSITY . H(t, z)Y (t, z) Q(t, dz) dt
2 μ1 (H) := E
R
K
. H(t, z) Q(t, dz) dt
2 
and μ2 (H) := E
599
R
K
for all nonnegative mappings H : R×Ω×K that are P(F,)⊗Kmeasurable. Note that μ2 is the product measure P (dω)×Q(t, ω, dz) dt on (Ω×R×K, P(F· )⊗K). Clearly μ1 μ2 , dμ1 and therefore there exists a Radon–Nikod´ ym (rnd) derivative Y, (t, ω, z) = dμ (t, ω, z) 2 , , that is P(F· ) ⊗ Kmeasurable and therefore deﬁnes an Ft –predictable process indexed by K, Y, (t, z), such that 2 . H(t, z)Y, (t, z) Q(t, dz) dt . μ1 (H) = μ2 (Y, H) = E R
K
Moreover, this rnd is μ2 unique, that is to say, if there exists another such rnd, say Y , then Y, (t, ω, z) = Y (t, ω, z) , P (dω)Q(t, ω, dz)dt a.e. (15.27) Deﬁnition 15.2.5 The above stochastic process Y, (t, z) indexed by K is called the predictable projection of Y (t, z) on {F,t }t≥0 , or the F,t predictable projection of Y (t, z). Theorem 15.2.6 Let the simple locally ﬁnite marked point process (N, Z) on R+ have the stochastic Ft intensity kernel (15.25) for some FtN,Z predictable kernel Q(t, dz). Let {F,t }t≥0 be another history satisfying condition (15.26). Then (N, Z) has the stochastic , , , , At intensity kernel λ(t) h(t, z)Q(t, dz) where {λ(t)} F t≥0 is the Ft predictable projection of {λ(t)}t≥0 and , h(t, z) is the F,t predictable projection of h(t, z). Proof. Let H(t, z) be a nonnegative F,t predictable indexed stochastic process. It is a fortiori an Ft predictable indexed stochastic process, and therefore 2 . 2 . E H(t, z)) N (dt × dz) = E H(t, z)) λ(t)h(t, z) Q(t, dz) dt R K . 2R K H(t, z)) v,(t, z) Q(t, dz) dt , =E R
K
where v,(t, z) is the F,t predictable projection of λ(t)h(t, z). Let now v,(t, ω, z) , h(t, ω, z) := , , ω) λ(t, a quantity that is P (dω) × N (ω, dt × dz) and P (dω) × λ(t, ω)Q(t, ω, dz)dtwell deﬁned in view of Theorem 15.2.4. We have that for all nonnegative F,t predictable indexed process H, " # . 2 v,(t, z) , H(t, z) N (dt × dz) = E H(t, z)λ(t) Q(t, dz) dt E , λ(t) R K R K . 2 H(t, z)λ(t) h(t, z) Q(t, dz) dt . =E R
K
600CHAPTER 15. POINT PROCESSES WITH A STOCHASTIC INTENSITY Equivalently, " 
# v,(t, z) , H(t, z) Q(t, dz) λ(t) dt , λ(t) K . 2 H(t, z) h(t, z) Q(t, dz) λ(t) dt . =E
E R
R
Now
2 E R
K
. H(t, z) h(t, z) Q(t, dz) λ(t) dt K 2 . =E H(t, z) h(t, z) Q(t, dz) N (dt) 2R K . , H(t, z) h(t, z) Q(t, dz) λ(t) dt , =E R
and therefore
K
" 
# v,(t, z) , H(t, z) Q(t, dz) λ(t) dt , λ(t) K . 2 , dt . H(t, z) h(t, z) Q(t, dz) λ(t) =E
E R
R
Replacing H(t, z) by
K
H(t, z) , , λ(t)
" E R
# v,(t, z) H(t, z) Q(t, dz) dt , λ(t) K . 2 H(t, z) h(t, z) Q(t, dz) dt , =E R
which shows that
15.2.2
K
v,(t, z) , = h(t, z). , λ(t)
Absolutely Continuous Change of Probability
We now consider changes of intensity entailed by an absolutely continuous change of probability measure. This subsection is of special interest in statistics where the concept of likelihood ratio is of central importance, in particular in hypothesis testing. It is a sweeping generalization of the results of Section 8.3.3. Let (N, Z) be a simple and locally ﬁnite point process on R+ with marks in K and associated lifted process NZ on R+ × K. Let {Ft }t≥0 be a history of NZ and suppose that NZ admits the (P, Ft )local characteristics (λ(t), Φ(t, dz)). Let {μ(t)}t≥0 be a nonnegative Ft predictable process and let {h(t, z)}t≥0,z∈K be a nonnegative Ft predictable Kindexed stochastic process, such that for all t ≥ 0  t P λ(s)μ(s) ds < ∞ = 1 (15.28) 0
and
15.2. TRANSFORMATIONS OF THE STOCHASTIC INTENSITY
601
h(t, z)Φ(t, dz) dz = 1
P
= 1.
(15.29)
K
Deﬁne for each t ≥ 0 ⎛ ⎞ $ L(t) := L(0) ⎝ μ(Tn )h(Tn , Zn )⎠ × · · · Tn ∈(0,t]
exp −

(μ(s)h(s, z) − 1)λ(s)Φ(s, dz) ds
(0,t]
,
(15.30)
K
where L(0) is a nonnegative F0 measurable random variable such that E[L(0)] = 1. Theorem 15.2.7 Under the above conditions, (1) {L(t)}t≥0 is a nonnegative (P, Ft )local martingale. If, moreover, E[L(t)] = 1 for all t ≥ 0, it is a nonnegative (P, Ft )martingale. (2) If E[L(T )] = 1 for some T > 0, and if we deﬁne the probability Q by the Radon– Nikod´ ym derivative process dQ = L(T ) (15.31) dP the marked point NZ admits the (Q, Ft )local characteristics (μ(t)λ(t), h(t, z)Φ(t, dz)) on [0, T ].
Proof. (1) By the exponential rule of Stieltjes–Lebesgue calculus, 
(μ(s)h(s, z) − 1)L(s−)MZ (ds × dz) ,
L(t) = L(0) + (0,t]
K
where MZ (ds × dz) := NZ (ds × dz) − λ(s)Φ(s, dz) ds. Let for n ≥ 1,  t μ(s)λ(s) ds ≥ n . Sn = inf t ; L(t−) +
(15.32)
0
Then, by Theorem 15.1.31, {L(t ∧ Sn )}t≥0 is a (P, Ft ) martingale, and since under conditions (15.37) and (15.38), P (limn↑∞ Sn = ∞) = 1, {L(t)}t≥0 is a (P, Ft )local martingale. Being nonnegative, it is also a (P, Ft )supermartingale. But a supermartingale with constant mean is a martingale. (2) We have to prove that for any nonnegative Ft predictable Kindexed stochastic process {H(t, z)}t≥0,z∈K and all t ∈ [0, T ], "
#

"
#

H(s, z)NZ (ds × dz) = EQ
EQ (0,t]
K
H(s, z)μ(s)λ(s)h(s, z)Φ(s, dz) ds . (0,t]
K
This is done through the following sequence of equalities (with appropriate justiﬁcations at the end)
602CHAPTER 15. POINT PROCESSES WITH A STOCHASTIC INTENSITY "
#

H(s, z)NZ (ds × dz)
EQ (0,t]
K
"
#


H(s, z)NZ (ds × dz)
= E L(t) K
(0,t]
"
#

L(s)H(s, z)NZ (ds × dz)
=E "
K
(0,t]
#

L(s−)H(s, z)μ(s)h(s, z)NZ (ds × dz)
=E "
(0,t]
"
(0,t]
"
(0,t]
K
#

L(s−)H(s, z)μ(s)h(s, z)λ(s)Φ(s, dz) ds
=E K
#

L(s)H(s, z)μ(s)h(s, z)λ(s)Φ(s, dz) ds
=E K

#

H(s, z)μ(s)h(s, z)λ(s)Φ(s, dz) ds
= E L(t) "
(0,t]
K
#

H(s, z)μ(s)λ(s)h(s, z)Φ(s, dz) ds .
= EQ (0,t]
K
The ﬁrst equality follows from (15.31) and the fact that for a nonnegative Ft measurable random variable V (t), EQ [V (t)] = EP [L(t)V (t)]. The second equality follows from Theorem 13.4.25, the third one from the observation that at a point Tn < ∞ of N , L(Tn ) = L(Tn −)μ(Tn )h(Tn , Zn ). The third equality uses the smoothing theorem, the ﬁfth is by Theorem 13.4.25 and the last one uses (15.31). Remark 15.2.8 The main condition to verify when using Theorem 15.2.7 is EP [L(T )] = 1. A general method to do this consists in ﬁnding some γ > 1 such that for the sequence of stopping times {Sn }n≥1 deﬁned by (15.32), supn≥1 EP [L(T ∧ Sn )γ ] < ∞. This implies that the sequence {L(T ∧ Sn )}n≥1 is uniformly integrable, and therefore . 2 lim E [L(T ∧ Sn )] = E lim L(T ∧ Sn ) . n↑∞
n↑∞
But, by Part (1) of Theorem 15.2.7, E [L(T ∧ Sn )] = 1 and Sn ↑ ∞. Therefore E [L(T )] = 1.
Example 15.2.9: The Likelihood Ratio for a Simple Point Process on a Finite Interval. Let N be under probability P an Ft Poisson process of intensity 1. Let {λ(t)}t≥0 be a nonnegative bounded Ft predictable process. Deﬁne for all t ≥ 0 ⎛ ⎞  t $ L(t) = L(0) ⎝ λ(Tn )⎠ exp (λ(s) − 1) ds , n≥1
0
where L(0) is a nonnegative square integrable random variable such that E L(0)2 < ∞. By the exponential formula of Stieltjes–Lebesgue calculus,
15.2. TRANSFORMATIONS OF THE STOCHASTIC INTENSITY
603
L(s−)(λ(s) − 1) (N (ds) − ds) .
L(t) = L(0) + (0,t]
By the product rule of Stieltjes–Lebesgue calculus L(s−)ΔL(s) L(s−) dL(s) + L(t)2 = L(0)2 + 2 (0,t]
s≤t

= L(0)2 + 2
L(s−)2 (λ(s) − 1) N (ds)
L(s−) dL(s) + (0,t]

= L(0)2 + 2
(0,t]
L(s−) dL(s) + (0,t]
(0,t]
L(s−)2 (λ(s) − 1) (N (ds) − ds) + L(s)2 (λ(s) − 1) ds)
(15.33) (15.34)
(0,t]
(noting that for Lebesguealmost all t, L(t) = L(t−)). Deﬁne for each n ≥ 1 1  t Sn := inf t ; L(t−) + λ(s) ds ≥ n ∧ Tn , 0
an Ft stopping time such that limn↑∞ Sn = ∞. In particular, L(s−) dL(s) = L(s−)2 (λ(s) − 1) (N (ds) − ds) (0,t]
(0,t]
is an Ft local martingale (with localizing sequence {Sn }n≥1) of mean 0. Replacing in (15.34) t by t ∧ Tn and taking expectations, E L(t ∧ Sn )2 = E L(0)2 + L(s ∧ Sn )2 (λ(s) − 1) ds) . (0,t∧Sn ]
In particular, in view of the boundedness assumption on the λ(t), "# 2 2 2 E L(t ∧ Sn ) ≤ E L(0) + E L(s ∧ Sn ) (λ(s) + 1) ds (0,t]

E L(s ∧ Sn )2 (C2 + 1) ds
= C1 + (0,t]
for some ﬁnite positive C1 and C2 . This implies, by Gronwall’s lemma, that  t E L(t ∧ Sn )2 ≤ C1 exp (C2 + 1) ds . 0
In particular, for any T < ∞, supn≥1 E L(T ∧ Sn )2 < ∞.
Example 15.2.10: Likelihood Ratios for Continuoustime hmcs. Let {X(t)}t≥0 be, under P , a regular continuoustime homogeneous Markov chain with state space E and stable and conservative inﬁnitesimal generator {qij }i,j∈E . Let αij (i = j ∈ E) be nonnegative numbers such that for all i ∈ E, q,i := j =i∈E, αij qij < ∞. For all t ≥ 0, let Ft := FtX and let ⎫ ⎧ ⎞ ⎛ ⎬ ⎨ t $ Ni,j (t) ⎠ exp αij (αij − 1)qij 1{X(s)=i} ds . L(t) := L(0) ⎝ ⎭ ⎩ 0 i,j;i =j
i,j;i =j
604CHAPTER 15. POINT PROCESSES WITH A STOCHASTIC INTENSITY Suppose that EP [L(0)] = 1, that EP L(0)2 < ∞ and that j =i αij qij < ∞ dQ (i ∈ E). Then EP [L(T )] = 1 and under probability Q deﬁned by dP = L(T ), the process {X(t)}t≥0 is on the interval (0, T ] a regular stable and conservative continuoustime homogeneous Markov chain with state space E and inﬁnitesimal parameters q˜ij := αij qij (j = i). Proof. At a discontinuity time t of the chain, L(t) − L(t−) = L(t−)(αij − 1)ΔNi,j (t) , i,j ; j =i
whereas for t strictly between two jumps of the chain, dL(t) = L(t) (αij − 1)qij 1{X(t)=i} . dt i,j ; j =i
Therefore
L(s−)
L(t) = L(0) + (0,t]
(αij − 1)(Nij (ds) − qij 1{X(s)=i} ds) .
i =j
By the product rule of Stieltjes–Lebesgue calculus, L(s−) dL(s) + L(s−)ΔL(s) L(t)2 = L(0)2 + 2 (0,t]
s≤t
= L(0)2 + 2
L(s−) dL(s) (0,t]

L(s−)2
+ (0,t]

i =j
+
(αij − 1)(Nij (ds) − qij 1{X(s)=i} ds)
(0,t] i =j
(αij − 1)qij 1{X(s)=i} ds .
Using the stopping times of type (15.32), we have that L(t)2 = L(0)2 + local martingale + L(s)2 (αij − 1)qij 1{X(s)=i} ds . (0,t] i =j
Then, ⎡ E L(t ∧ Sn )2 = E L(0)2 + E ⎣
(0,t∧Sn ] i =j
⎤ L(s ∧ Sn )2 (αij − 1)qij 1{X(s)=i} ds⎦
and in particular E L(t ∧ Sn )2 ≤ E L(0)2 +

E L(s ∧ Sn )2 αij − 1qij ds . (0,t]
i =j
Therefore, by Gronwall’s lemma, / 4 2 2 E L(t ∧ Sn ) ≤ E L(0) exp q˜i + qi t i
i
15.2. TRANSFORMATIONS OF THE STOCHASTIC INTENSITY and ﬁnally
605
sup E L(T ∧ Sn )2 < ∞ . n≥1
The above two examples can be generalized as follows. Example 15.2.11: Likelihood Ratios for Marked Point Processes. Consider the general situation of Theorem 15.2.7. Let (N, Z) be a simple and locally ﬁnite point process on R+ with marks in K and associated lifted process NZ on R+ ×K. Let {Ft }t≥0 be a history of (N, Z) and suppose that (N, Z) admits the (P, Ft )local characteristics (λ(t), Φ(t, dz)). Let {μ(t)}t≥0 be a nonnegative Ft predictable process and let {h(t, z)}t≥0,z∈K be a nonnegative Ft predictable Kindexed stochastic process, such that for all t ≥ 0  t P λ(s)μ(s) ds < ∞ = 1 (15.35) 0

and
h(t, z)Φ(t, dz) dz = 1
P
= 1.
(15.36)
K
For each t ≥ 0, deﬁne L(t) as in (15.30), with L(0) a nonnegative F0 measurable random variable such that E[L(0)] = 1. We suppose in addition that L(0) is squareintegrable and that (μ(t) + 1)h(t, z)λ(t) ≤ K(t) , 0T where K : R+ → R+ is a deterministic function such that 0 K(s) ds < ∞. Then E [L(T )] = 1. The proof follows the same lines as the proof in Example 15.2.9 and is left as an exercise.
Remark 15.2.12 One need not insist on the analogy with Girsanov’s theorem as it is obvious. The proof of Girsanov’s result was based on the Itˆo calculus for Brownian motion. In the case of point processes, the underlying calculus is just the ordinary Stieltjes–Lebesgue calculus.
The Reference Probability Method Radon–Nikod´ ym derivatives are of course of interest in Statistics (where they are called likelihood ratios), and also in ﬁltering. In the socalled reference probability method, the probability P actually governing the joint statistics of the observation and of the state process is obtained by an absolutely continuous change of probability measure Q → P . This method therefore relies on the Radon–Nikod´ ym results of Subsection 15.2.2. The reference probability Q is chosen such that the observation and the state process are Qindependent and the observation has a simple structure under Q. The state process {X(t)}t≥0 takes its values in some measurable space (E, E) and the observation is a marked point process (N, Z) with timeevents sequence {Tn }n≥1 and mark sequence {Zn }n≥1 , the marks taking their values in a measurable space (K, K). Let
606CHAPTER 15. POINT PROCESSES WITH A STOCHASTIC INTENSITY X , Ft := FtN,Z ∨ F∞
where {FtX }t≥0 is the internal history of the state process. The reference probability Q is such that (N, Z) is a (Q, Ft )Poisson process of intensity 1 with independent iid marks X are independent. of common probability distribution Q. Moreover, under Q, N and F∞ Therefore, the stochastic Ft kernel of the observation under the reference probability Q is λ(t, dz) = 1 × Q(dz). Let {μ(t)}t≥0 be a nonnegative Ft predictable process and let {h(t, z)}t≥0,z∈K be a nonnegative Ft predictable Kindexed stochastic process, such that for all t ≥ 0  T Q μ(s) ds < ∞ = 1 (15.37) 0

and
Q
h(t, z)Q(dz) dz = 1
= 1.
(15.38)
K
Deﬁne
⎛
L(t) := ⎝
⎞
$

μ(Tn )h(Tn , Zn )⎠ exp −
(μ(s)h(s, z) − 1)Q(dz) ds (0,T ]
Tn ∈(0,t]

,
K
and suppose that EQ [L(T )] = 1. Then (Theorem 15.2.7) the marked point (N, Z) admits on [0, T ] the (P, Ft )local characteristics (μ(t), h(t, z)Q(dz)). Moreover, the restrictions dP0 X are the same since L(0) = 1. In fact, by hypothesis, L(0) := dQ of P and Q to F∞ = 1, 0 X that is, P and Q agree on F0 := F∞ . Let {Z(t)}t≥0 be an Ft adapted realvalued stochastic process. Lemma 15.2.13 For all t ≥ 0, EQ L(t)  FtN EP Z(t)  FtN = EQ Z(t)L(t)  FtN , or, equivalently, EP
Z(t)  FtN
EQ Z(t)L(t)  FtN = , EQ L(t)  FtN
Proof. This is just a rephrasing of Lemma 8.3.12.
Qa.s.
P a.s.
Example 15.2.14: Estimating the Random Intensity of a Homogeneous Cox Process. The above lemma allows us to replace a ﬁltering problem with respect to P by one with respect to Q, which may be a simpliﬁcation when Q has a simple structure. For instance, if Q is a probability that makes the point process N Poisson with intensity 1, and if Λ is an integrable variable independent, under Q of N , the measure P deﬁned by dPt = ΛN (t) exp{(1 − Λ)t} dQt makes N a doubly stochastic process with intensity Λ. Then 0 ∞ N (t)+1 −λt λ e dF (λ) E Λ  FtN = 00 ∞ N (t) −λt . λ e dF (λ) 0 (Exercise 15.4.16.)
15.2. TRANSFORMATIONS OF THE STOCHASTIC INTENSITY
15.2.3
607
Changing the Time Scale
Recall the following elementary result concerning Poisson processes on the line with an 0 τ (t) intensity. If N is a Poisson process with intensity λ(t), deﬁning τ (t) by 0 λ(s) ds = t, , deﬁned by N , ((0, t]) = N ((0, τ (t)]) is a standard hpp. This result the point process N will be extended to the transformation of a point process with given stochastic intensity into a standard hpp. Let N be a simple locally ﬁnite point process on R+ , with the Ft predictable intensity {λ(t)}t≥0 , and suppose that N (0, ∞) = ∞, Pa.s. or, equivalently (Theorem 15.1.16), 0∞ 0 λ(s) ds = ∞, Pa.s. Deﬁne, for each t ≥ 0, the nonnegative random variable τ (t) by 
τ (t)
λ(s) ds = t.
(15.39)
0
0∞ For each t ≥ 0, τ (t) is well deﬁned since 0 λ(s) ds = ∞ (Theorem 15.1.16). For each t ∈ R+ , τ (t) is an Ft stopping time. Indeed, for any a ∈ R, 1  a λ(s) ds ≤ t ∈ Fa . {τ (t) ≤ a} = 0
, on R+ by Deﬁne the simple locally bounded point process N , (0, t] := N (0, τ (t)]. N
(15.40)
, (0, a] := N (0, τ (a)] is F N measurable, Note that FtN ⊆ Fτ (t) , since for all a ∈ R+ , N τ (a) and therefore Fτ (a) measurable.
, has Fτ (t) intensity 1 (and therefore F N intensity 1). Theorem 15.2.15 N t Proof. Let [a, b] ∈ R. We must show that , (a, b] = E [1A (b − a)] E 1A N But the lefthand side is just " # 2 τ (b) E 1A N (dt) = E 1A τ (a)
∞
(A ∈ Fτ (a) ) .
. 1(τ (a),τ (b)] (t)N (dt) .
0
Since the process 1A 1(τ (a),τ (b)] is Ft predictable (being Ft adapted and leftcontinuous), the righthand side of the above equality is, by the smoothing formula, " # . 2 ∞
τ (b)
1(τ (a),τ (b)] (t)λ(t) dt = E 1A
E 1A 0
λ(t) dt = E [1A (b − a)] .
τ (a)
, is a homogeneous PoisRemark 15.2.16 By Watanabe’s theorem (Theorem 7.1.8), N , (a, b] is independent of Fτ (a) . son process of intensity 1. In addition, for all [a, b] ∈ R, N
608CHAPTER 15. POINT PROCESSES WITH A STOCHASTIC INTENSITY Remark 15.2.17 The result of Theorem 15.2.15 may be used to test that a given point process admits a given hypothetical intensity: perform the corresponding change of time and see if the result is a standard Poisson process, by using any available statistical test to assess that a given ﬁnite sequence of random variables is iid and exponentially distributed with mean 1. Remark 15.2.18 The analogous result in the Brownian motion calculus is the one that says roughly that a continuous martingale is a Brownian motion with a diﬀerent time scale (see, for instance, Section 5.3.2 in [Legall, 2016]).
Cryptology A question arises naturally: how much information is lost in a change of time scale? Consider for instance the situation of Theorem 15.2.15. Since the change of time transforms the original point process into a homogeneous Poisson process, have we erased all the information previously contained in the stochastic intensity? The answer is: it depends. For instance, suppose that N is a Cox point process with intensity Λ, a nonnegative realvalued random variable. In other words, N admits the stochastic Ft intensity λ(t) ≡ Λ , where Ft := FtN ∨ σ(Λ). If we perform the time change τ (t) = Λt , the resulting process N , , deﬁned by (15.40) is a standard Poisson process. Moreover, for all 0 ≤ c ≤ d, N (d)− N (c) is independent of Fc = FcN ∨ σ(Λ) and in particular of Λ. In this sense, the time change has erased all information concerning Λ, whereas Λ could be recovered from N since, by the strong law of large numbers, Λ = lim
t↑∞
N (t) . t
()
In the case of an intrinsic change of time, things are dramatically diﬀerent. The stochastic 6 = E Λ  F N , is given by (Example 15.2.3): FtN intensity of N , λ(t) t 0 ∞ N (t)+1 −λt e dF (λ) 6 = 00 ∞λ λ(t) . N (t) e−λt dF (λ) λ 0 To be even more speciﬁc, suppose that P (Λ = a) = P (Λ = b) = which case, N (t)+1 e(a−b)t 6 = 1 + (b/a) . λ(t) 1 + (b/a)N (t) e(a−b)t Performing the time change  τ(t) 6 dt = t , λ(t)
1 2
for sone 0 < a < b, in
0
, deﬁned by N , (t) := N (6 we obtain a point process N τ (t)), which is a standard Poisson process. However, this time, Λ can be entirely recovered from it. In fact, as we now show, , and then Λ can be obtained by (). In fact, if T6n is the N can be reconstructed from N , , then nth point of N  Tn+1 6 dt = Tn+1 − Tn λ(t) Tn
or, more explicitly, where
T6n+1 − T6n = f (n, Tn+1 ) − f (n, Tn ) , f (n, t) = at − ln 1 + (b/a)n+1 e(a−b)t .
Clearly then, the sequence {Tn }n≥1 can be recovered from {T6n }n≥1 .
15.3. POINT PROCESSES UNDER A POISSON PROCESS
609
Remark 15.2.19 An interpretation of the above results in terms of cryptography is the following. If the information is contained in Λ, the intrinsic time change yields a standard Poisson process from which Λ can be extracted only if one knows the “key”, that is, the , one can only obtain distribution of Λ. (Note however that from a ﬁnite trajectory of N an approximation of Λ. In this sense, secure transmission would be at the price of some unreliability. This unreliability can be controlled at the expense of transmission rate, which is acceptable if one is interested only in storage security.) Remark 15.2.20 There is an analogous result in the Itˆo calculus, although this analogy is not a direct one as is the case, for instance, in Girsanov’s theorem. We discuss it in very rough terms that will not be further detailed. Consider the “signal + noise” model  t X(t) = ϕ(s) ds + W (t) (t ≥ 0) , 0
where {W (t)}t≥0 is a standard Wiener process and {ϕ(t)}t≥0 is a locally integrable process (the integrated signal) independent of the Wiener process (the integrated noise).XIf one denotes by {ϕ(t)} 6 t≥0 a suitable version of the estimated signal process E ϕ(t)  Ft (t ≥ 0), then  t B (t) := X(t) − W ϕ(s) 6 ds (t ≥ 0) 0
is a standard Wiener process. A cryptographic interpretation of this result avails in full analogy with the simple Poissonian example of the previous remark.
15.3
Point Processes under a Poisson process
A nonhomogeneous Poisson process with (deterministic) intensity function λ(t) can be obtained by projecting onto the time axis the points of a homogeneous Poisson process on R2 of intensity 1 which lie between the curve y = λ(t) and the time axis (Exercise 8.5.14). In fact, as we shall see in this section, any point process with stochastic Ft intensity {λ(t)}t≥0 not only can be obtained in this way (the direct embedding theorems) but can always be thought of as having been obtained in this way (the inverse imbedding theorems), in general at the cost of an extension of the probability space. The exact formulation and the mathematical details will be given in Subsection 15.3.2. For this the following preliminaries of intrinsic interest are needed.
15.3.1
An Extension of Watanabe’s Theorem
The original version of Watanabe’s theorem concerns homogeneous Poisson processes on the line. It will be extended, with a proof analogous to the proof of Theorem 7.1.8, to Poisson processes on product spaces of the type R × K. This new version will play a central role in the proof of the embedding theorems in Subsection 15.3.2. Some notation will be needed for the precise statement of this extension. Let N be a point process on R × K. The notation St N , where t ≥ 0, denotes the point process obtained by shifting N by t to the left (algebraically). Let St N + denote the point process obtained by restricting the shifted process St N to R+ × K. Loosely speaking, St N + is the future of N after time t, and more formally, St N + ([a, b] × L) := N (R+ ∩ [a + t, b + t] × L)
([a, b] ⊂ R, L ∈ K) .
610CHAPTER 15. POINT PROCESSES WITH A STOCHASTIC INTENSITY One can similarly deﬁne St N − , the restriction of St N to R− × K. The following version of Watanabe’s theorem concerns in particular Cox processes on R2 . Theorem 15.3.1 Let (K, K) be a measurable space. Let G be some σﬁeld of F. Let λ(t, dz) be a locally integrable kernel from R × Ω to (K, K) such that for all t ≥ 0, L ∈ K, λ(t, L) is Gmeasurable. Let N be a point process on R × K that admits the Ft intensity kernel λ(t, ·), where Ft = Ht ∨ G and {Ht }t≥0 is a history of N. Then N is, conditionally on G, a Poisson process with the intensity measure ν(dt × dz) = λ(t, dz) × dt. Furthermore, conditionally on G, for all t ≥ 0, the future St N + of N after time t is independent of Ht . Proof. It suﬃces to show that for all a ∈ R, for any ﬁnite family of disjoint measurable sets C1 , . . . , Cm ⊂ (a, +∞) × K, and all t1 , . . . , tm ∈ R+ , ⎛ ⎡ ⎞ ⎤ m m $ * + E ⎣exp ⎝− tj N (Cj )⎠ Fa ⎦ = exp ν(Cj )(e−tj − 1) . (15.41) j=1
j=1
One may assume that C1 , . . . , Cm ⊂ (a, b] × Lk for some Lk as in Deﬁnition 15.1.20. Otherwise, replace the Cj ’s by Cj ∩ ((a, b] × Lk ) and let b and k go to inﬁnity. Denote the above Lk by L. For all j (1 ≤ j ≤ m) and all t ≥ 0, let Cj (t) := Cj ∩ {(−∞, t] × K} and Cjt := {z ∈ K; (t, z) ∈ Cj }. Deﬁne for t ≥ a
⎧ ⎫ m ⎨ ⎬ Z(t) := exp − tj N (Cj (t)) . ⎩ ⎭
(15.42)
j=1
In particular,
⎧ ⎫ m ⎨ ⎬ Z(b) = exp − tj N (Cj ) . ⎩ ⎭ j=1
Also, since Z(a) = 1, 
Z(s−)
Z(t) = 1 + (a,t]
K
⎧ m ⎨ ⎩
(e−tj
j=1
⎫ ⎬ − 1)1Cjs (z) N (ds × dz) . ⎭
(15.43)
For the proof of this equality, observe that any trajectory t → Z(t) is piecewise constant with discontinuity times that are points of the simple point process NL (·) := N (· × L). Therefore Z(t) = Z(a) + (Z(s) − Z(s−)) , s∈(a,t]
where Z(s) = Z(s−) only if there is a point (s, z) of N that belongs to (at most) one of the Cj ’s. If (s, z) ∈ Cj and is in N , Z(s) = Z(s−)e−tj .
15.3. POINT PROCESSES UNDER A POISSON PROCESS
611
Now saying that (s, z) ∈ Cj is equivalent to saying that z ∈ Cjs , and therefore, if (s, z) is a point of N Z(s) − Z(s−) =
m
Z(s−)(e−tj − 1)1Cjs (z),
j=1
which then gives (15.43) since Z(a) = 1. Let now A ∈ Fa . We have  1A Z(t) = 1A + (a,t]×K
⎛ ⎞ m 1A Z(s−) ⎝ (e−tj − 1)1Cjs (z)⎠ N(ds × dz) . j=1
The stochastic process indexed by K, ⎛ ⎞ m −t H(t, ω, z) := 1A (ω)1(a,t] (t)Z(t− , ω) × ⎝ (e j − 1)1Cjt (z)⎠ , j=1
is P(F· ) ⊗ K measurable and of constant sign (negative). Therefore, ⎫ ⎤ ⎬ 1A Z(s−) (e − 1)1Cjs (z) N (ds × dz)⎦ E[1A Z(t)] = P (A) + E ⎣ ⎩ ⎭ (a,t] K j=1 ⎧ ⎫ ⎡ ⎤  tm ⎨ ⎬ −t j = P (A) + E ⎣ 1A Z(s) (e − 1)1Cjs (z) λ(s, dz) ds⎦ ⎩ ⎭ a K j=1 ⎧ ⎫ ⎡ ⎤  tm ⎨ ⎬ −t j = P (A) + E ⎣1A Z(s) (e − 1)1Cjs (z) λ(s, dz) ds⎦ . ⎩ ⎭ a K ⎡


⎧ m ⎨
−tj
j=1
Since A is arbitrary in Fa , ⎧ ⎫ ⎤ ⎡  tm ⎨ ⎬ E[Z(t)Fa ] = 1 + E ⎣ Z(s) (e−tj − 1)1Cjs (z) λ(s, dz) dsFa ⎦ ⎩ ⎭ a K j=1 ⎧ ⎫  tm ⎨ ⎬ E[Z(s)Fa ] (e−tj − 1)1Cjs (z) λ(s, dz) s . = 1+ ⎩ ⎭ a K j=1
Therefore
E[Z(t)Fa ] = exp
⎧ m ⎨ ⎩
(e−tj − 1)
j=1
Letting t = b gives the announced result.
 t a
⎫ 1⎬ 1Cjs (z)λ(s, dz)ds . ⎭ K
612CHAPTER 15. POINT PROCESSES WITH A STOCHASTIC INTENSITY
15.3.2
Grigelionis’ Embedding Theorem
We shall use the following slight extension of the deﬁnition of a Poisson process: Deﬁnition 15.3.2 Let (K, K) be some measurable space. Given a history {Ft }t∈R , the point process N on R × K is called an Ft Poisson process if the following conditions are satisﬁed: (i) {Ft }t∈R is a history of N in the sense that Ft ⊆ σ(N (D); D ⊆ (−∞, t] × K); (ii) N is a Poisson process; and (iii) for any t ∈ R, St N + and Ft are independent, where St N + is deﬁned by St N + (C × L) := N ((C + t) ∩ (−∞, t]) × L) . Theorem 15.3.3 Let (K, K) be some measurable space and let Q be some probability measure on it. Let N be an Ft Poisson process on R × K × R+ with intensity measure dt × Q(dz) × dσ. Let f : Ω × R × K → R be a nonnegative function that is P(F· ) ⊗ Kmeasurable and such that the kernel λ(t, dz) := f (t, z)Q(dz)
(15.44)
is locally integrable. The marked point process (N, Z) with marks in K deﬁned by   N (C × L) := 1C (t)1(L) (z)1{σ≤f (t,z)} N (dt × dz × dσ) (C ∈ B(R) , ∈ K) R
K
R+
(15.45) admits the Ft stochastic intensity kernel λ(t, dz). Proof. We prove the theorem for the unmarked case. The general proof follows exactly the same lines. Let N be a homogeneous Ft Poisson process on R × R+ with average intensity 1. Let {λ(t)}t≥0 be a nonnegative locally integrable Ft predictable stochastic process. Deﬁne the point process N on R by the formula  N (C) = 1(0,λ(t)] (z)N (dt × dz) (15.46) C
R+
for all C ∈ B. Then, N has the Ft intensity {λ(t)}t≥0 . 1. We ﬁrst show that (15.46) deﬁnes a locally ﬁnite point process, that is, N ((0, b]) < ∞ a.s. for all b ∈ R. Deﬁne for all n ≥ 1 1  t τn = inf t ≥ 0 ; λ(s)ds ≥ n 0
(=∞ if {. . . } = ∅). By the local integrability assumption, limn↑∞ τn = ∞, a.s. Also τn is an Ft stopping time. By the smoothing formula of Theorem 15.1.22,
15.3. POINT PROCESSES UNDER A POISSON PROCESS "
1(0,τn ] (s)1(0,λ(s)] (σ) N (ds × dσ)
(0,b]×R
#

=E 2
#

E [N ((0, τn ∧ b])] = E "
613
(0,b]×R
=E
τn ∧b
1(0,τn ] (s)1(0,λ(s)] (σ) ds dσ
. λ(s) ds < ∞.
0
Therefore, a.s., for all n ≥ 1, N (0, τn ∧ b] < ∞. 2. The simplicity is left as an exercise for the reader. 3. In order to prove that N has the Ft intensity {λ(t)}t≥0 it suﬃces to show that for all H ∈ P(F· ), H ≥ 0, . 2. 2H(t)N (dt) = E H(t)λ(t)dt . E R
R
But the lefthand side of this equality reads 2 . E H(t)1(0,λ(t)] (z)N (dt × dz) . R
R+
Since N is assumed Ft Poisson, it admits the Ft intensity kernel λ(t, dz) = dz. Thus, by Theorem 15.1.22, this is also equal to 2 . 2. E H(t)1(0,λ(t)] (z)dtdz = E H(t)λ(t)dt . R
R+
R
Theorem 15.3.4 Let (N, Z) be a locally ﬁnite marked point process with marks in the measurable space (K, K) and Ft stochastic intensity kernel of the form (15.44), where f : Ω × R × K → R is a nonnegative function that is P(F· ) ⊗ Kmeasurable and Q is a probability measure on (K, K). Then, the probability space may be enlarged to accommodate an Ft Poisson process N on R × K × R+ with intensity measure dt × Q(dz) × ds such that (15.45) holds. Proof. The result will be proved for the unmarked case, the general case following exactly the same lines. The theorem in this simpliﬁed form is as follows. Let N be a simple point process on R with Ft predictable intensity {λ(t)}t≥0 . Then, there exists a homogeneous Poisson process N on R × R+ with average intensity 1, such that (15.46) holds. Moreover, this process N is an Ft ∨ FtN Poisson process. As such, for all a ∈ R, Sa N + is independent of Fa . Let {Un }n∈Z be an iid sequence of random variables uniformly distributed on [0, 1], and let N 1 be a homogeneous Poisson process on R × R+ , of intensity 1, such that {Un }n∈Z , N 1 and F∞ are independent. Deﬁne N by  N (A) = 1(λ(t),∞)(σ)N 1 (dt × dσ) + 1A ((Tn , Un λ(Tn ))) A
n∈Z
for all A ∈ B ⊗ B(R+ ). If H is a nonnegative function from R × Ω × R+ to R,
614CHAPTER 15. POINT PROCESSES WITH A STOCHASTIC INTENSITY  
R
R
R
 R+
H(t, σ)N (dt × dσ) =
H(t, σ)1(λ(t),∞) (σ)N 1 (dt × dσ) +
H(Tn , Un λ(Tn )).
n∈Z
Denote by N U the marked point process obtained by attaching the mark Un to point Tn of N . Let U Gt = Ft ∨ FtN 1 ∨ FtN and suppose that H is P(Gt ) ⊗ B(R+ )measurable and nonnegative. ×
×
× U0 λ(T0 )
U2 λ(T2 ) U−1 λ(T−1 ) U1 λ(T1 ) T−1
0
T0
T2
T1
t
Since N 1 has the FtN 1 intensity kernel λ1 (t, dσ) = dσ, the latter is also the Gt intensity U and F are independent of F N 1 ). By the smoothing theorem kernel of N 1 (recall that F∞ ∞ ∞ . 2 . 2 1(λ(t),∞)(σ)H(t, σ)N 1 (dt × dσ) = E 1(λ(t),∞)(σ)H(t, σ)dt × dσ E R
R+
R
R+
for any nonnegative H ∈ P(F· )⊗B(R+ ). Now, for any nonnegative H ∈ P(F· )⊗B(R+ ), # " # " U H(Tn , Un λ(Tn )) = E H(t, uλ(t))N (dt × du) E R
n∈Z
2 
[0,1] 1
= E R
. H(t, uλ(t)) dt du .
0
U
(In fact, N U admits the Ft ∨ FtN intensity kernel λ(t)1[0,1] (u) du. This is also a Gt stochastic kernel for N U since N 1 is independent of N U and F∞ . The map (t, ω, u) → H(t, uλ(t)) is P(Gt ) ⊗ B(R)measurable in view of the Ft predictability of {λ(t)}t≥0 , and of the measurability assumptions on H. This justiﬁes the above use of the smoothing theorem.) This term is also equal to 2 . E H(t, σ)1(0,λ(t)] (σ)dtdσ , R
R+
by the change of variables σ = uλ(t). Therefore 2 . E H(t, σ)N (dt × dσ) R R+ 2 . 2. =E H(t, σ)1(λ(t),∞)(σ)dtdσ + E H(t, σ)1(0,λ(t)] (σ)dtdσ R R R×R+ 2  + . =E H(t, σ)dtdσ R
R+
15.3. POINT PROCESSES UNDER A POISSON PROCESS
615
for all nonnegative H ∈ P(G) ⊗ B+ . Therefore, by Theorem 15.3.1, N is a homogeneous Gt Poisson process on (R×R+ , B(R)⊗B(R+ )) with intensity 1. It is a fortiori an Ft ∨FtN Poisson process.
Variants of the Embedding Theorems The following results are of the same kind as the previous ones, but they assume boundedness of the intensity kernel. We state them in the unmarked case. Their proofs are left as an exercise (Exercise 15.4.18). , be an hpp on R with intensity λ , and Theorem 15.3.5 (Direct embedding) Let N U , let {Un }n∈Z be an iid sequence of marks, uniformly distributed on [0, 1]. Let N be the ,U , associated lifted point process on R × [0, 1]. Let {Gt }t≥0 be a history independent of N and let for all t ≥ 0, Ft := Gt ∨ FtNU . , < ∞, and deﬁne a Let {λ(t)}t≥0 be a nonnegative Ft predictable process bounded by λ point process N on (R, B(R)) by N (C) = 1C (T,n )1
(C ∈ B(R)) . (15.47) n ) λ(T n∈Z
0≤Un ≤
λ
Then N admits the Ft intensity {λ(t)}t≥0 . Theorem 15.3.6 (Inverse embedding) Let N be a simple point process on (R, B(R)) with Ft intensity {λ(t)}t≥0 (assumed Ft predictable, without loss of generality). Suppose , < ∞ such that P a.s. that there exists a constant λ , λ(t, ω) ≤ λ
(t ≥ 0).
, , U ) on (R, B(R)) with marks in ([0, 1], Then, there exists a compound Poisson process (N , Q) where Q is the uniform distribution on [0, 1], and such B([0, 1])), characteristics (λ, that (15.47) holds. Remark 15.3.7 Grigelionis’ construction is of particular importance for coupling point processes. Usually one starts with a point process N1 from which one constructs N via the inverse embedding theorem, and then using this same N , one constructs another point process N . Example 15.3.8: Lewis–Shedler–Ogata Simulation Algorithms. An immediate application of practical importance of the direct embedding theorems is to the simulation of point processes with a stochastic intensity. We start with a simple example of the methodology. Suppose that one wishes to simulate a point process on the positive halfline with a stochastic intensity of the form λ(t, ω) = v(t, N[0,t) (ω)) , where v : R+ × Mp (R+ ) → R+ is measurable with respect to B(R+ ) ⊗ Mp (R+ ) and B(R+ ) and bounded, say, by K < ∞. For this we can use Theorem 15.3.6, which says , on R+ with intensity K, exthat it suﬃces to thin a homogeneous Poisson process N amining its points sequentially, keeping a point T,n of it if and only if Un ≤
v(Tn ,N[0,T ) ) n , K
616CHAPTER 15. POINT PROCESSES WITH A STOCHASTIC INTENSITY where {Un }n≥1 is an iid sequence of random variables uniformly distributed on [0, 1] ,. and independent of N This method is easily adapted to the case of point processes with an FtN ∨G0 intensity of the form () λ(t, ω) = v(t, X(t−, ω), N[0,t)(ω)) , where (i) {X(t)}t≥0 is some corlol stochastic process with values in some measurable space X := ∨t≥0 FtX , (K, K) and G0 = F∞ (ii) v : R+ × Mp (R+ ) → R+ is measurable with respect to B(R+ ) ⊗ K ⊗ Mp (R+ ) and B(R+ ), and is bounded by K, (In other words, the point process we seek to simulate is a semiCox point process.) , , and (iii) {X(t)}t≥0 is independent of N (iv) we suppose that we have at our disposition at any time t the value X(t). , is kept if and only if Un ≤ A point T,n of N
v(Tn ,X(Tn −),N(0,T ) ) n . K
We can make another step towards generalization assuming an FtN ∨ FtX intensity of the form (). But this time we must add the condition that, denoting by Tn the nth point of N , one can construct X(t) from Tn based on the knowledge of N[0,t) . One example is when the process {X(t)}t≥0 is of the form 

t
t
ϕ(N (t−)) dt +
X(t) = X(0) + 0
σ(N (t−)) dW (t) , 0
, where {W (t)}t≥0 is a Wiener process independent of N. Other examples, concerning for instance mutually exciting point processes, are left to the imagination of the reader. The relaxation of the hypothesis of boundedness of the stochastic intensity is not a problem since one can always vary the intensity of the , when needed. Poisson process N
Complementary reading [Last and Brandt, 1995] is an advanced text on the martingale theory of point processes on the line, with or without a stochastic intensity. See also [Br´emaud, 2020].
15.4
Exercises
Exercise 15.4.1. The fundamental (local) martingale 0t Show that Deﬁnition 15.1.1 implies that for all a ∈ R, Ma (t) := N (a, t] − a λ(s) ds (t ≥ a) is an Ft local martingale. Exercise 15.4.2. Connecting to the intuitive definition Let the simple and locally ﬁnite point process N on R have the Ft intensity {λ(t)}t≥0 , and suppose that t → λ(t, ω) is for all ω ∈ Ω a rightcontinuous function, and that the function (t, ω) → λ(t, ω) is uniformly bounded.
15.4. EXERCISES
617
Show that lim h↓0
1 E[N (t, t + h]Ft ] = λ(t) , h
P a.s.
Exercise 15.4.3. The traffic equations Show that in equilibrium, the traﬃc equations of the Jackson network (see Chapter 9) receive the following interpretation: λi = E[Ai (0, 1]],
λi rij = E[Ai,j (0, 1]],
where Ai is the point process counting the arrivals (external and internal) into station i, and Ai,j is the point process counting the transfers from station i to station j. Exercise 15.4.4. Cox processes Let N be a doubly stochastic Poisson process (Cox process) with respect to the σﬁeld G, with locally integrable stochastic intensity {λ(t)}t≥0 . Remember that this means λ := σ (λ(s), s ∈ R) and that whenever C , . . . , C that G ⊇ F∞ 1 K are bounded disjoint measurable subsets of R, ⎧ ⎧ ⎫ ⎤ ⎫ ⎡ K K ⎨ ⎨ ⎬ ⎬ E ⎣exp i uj N (Cj )  G ⎦ = exp (eiuj − 1) λ(s)ds ⎩ ⎩ ⎭ ⎭ Cj j=1
j=1
for all u1 , . . . , uK ∈ R. Show that N admits the Ft intensity {λ(t)}t≥0 , where Ft = G ∨ FtN . Exercise 15.4.5. Other histories Let N be a simple and locally ﬁnite point process on R with Ft intensity {λ(t)}t≥0 . Prove the following: (i) If {Gt }t≥0 is a history such that G∞ is independent of F∞ , then {λ(t)}t≥0 is an Ft ∨ Gt intensity of N . (ii) If {F,t }t≥0 is a history of N such that Ftλ ∨ FtN ⊆ F,t ⊆ Ft , then {λ(t)}t≥0 is an F,t intensity of N .
Exercise 15.4.6. About predictability (a) Show that a deterministic measurable process is Ft predictable, and that an Ft predictable process is Ft progressively measurable, and in particular measurable and Ft adapted. (b) Let S and τ be two Ft stopping times such that S ≤ τ , and let ϕ : R+ × R → R be a measurable function. Then X(t, ω) = ϕ(S(ω), t)1{S(ω) 0. Then for P almost all ω ∈ A, θn ω ∈ A for an inﬁnity of indices n ≥ 1. Proof. Consider the measurable set F := {ω ∈ A ; θn ω ∈ A for all n ≥ 1} = A\ ∪n≥1 {ω ; θn ∈ A} . If m > n, θn F ∩ θm F = ∅. (Indeed, if θ−m ∈ F ⊆ A, we have by deﬁnition of F that θ−n ω = θm−n (θ−m ω) ∈ A.) By the θinvariance of P , P (F ) = P (θn F ) for all n ≥ 1, and therefore, for all N ≥ 1, P (F ) =
M 1 1 1 n P (θn F ) = P (∪M n=1 θ F ) ≤ M M M n=1
and therefore P (F ) = 0 since M is arbitrary. In other words, outside of a negligible set N1 , for every point ω ∈ A, there exists an n1 such that θn1 ω ∈ A. By the same argument applied to θk (k ≥ 1), outside of a negligible set Nk , for every point ω ∈ A, there exists an nk such that θnk ω ∈ A. In particular, for all ω outside the negligible set N := ∪k≥1 Nk , there is for all k ≥ 1 an nk such that θnk ω ∈ A ∩ Nnk . Therefore for all ω outside N , θn ω ∈ A for an inﬁnity of n. The following lemma will be useful on several occasions. Lemma 16.1.5 Let (P, θ) be a stationary framework. Let Z be nonnegative P a.s. ﬁnite random variable such that Z − Z ◦ θ ∈ L1R (P ). Then E0 [Z − Z ◦ θ] = 0. Proof. (The delicate point here is that Z is not assumed integrable.) For any C > 0, Z ∧ C − (Z ∧ C) ◦ θ ≤ Z − Z ◦ θ. By the θinvariance of P , E[Z ∧ C − (Z ∧ C) ◦ θ] = 0 and the conclusion follows by dominated convergence, letting C ↑ ∞ in the last equality. Let (P, θ) be a stationary framework on (Ω, F). Deﬁnition 16.1.6 An event A ∈ F is called strictly θinvariant if θ−1 (A) = A. It is called θinvariant if P (A θ−1 (A)) = 0, where denotes the symmetric diﬀerence. Deﬁnition 16.1.7 The discrete ﬂow {θn }n∈Z is called ergodic (with respect to P ) if all θinvariant events are P trivial (that is: of probability either 0 or 1). We shall also say: (P, θ) is ergodic.
16.1. ERGODICITY AND MIXING
623
Observe that for any θinvariant event A, the event B = ∩n∈Z ∪k≥n θ−k A is strictly θinvariant and such that P (A) = P (B). Therefore, for any θinvariant event, there exists a strictly θinvariant event with the same probability. In particular, the ﬂow is ergodic if and only if all strictly θinvariant events are trivial. Example 16.1.8: Irrational translations on the torus, take 1. Let (Ω, F) := ((0, 1], B((0, 1])), and let P be the Lebesgue measure on (0, 1]. Let d be some real number. Deﬁne θ : (Ω, F) → (Ω, F) by θ(ω) = ω + d mod 1. We show that (P, θ) is ergodic if and only if d is irrational. Proof. Clearly, P is θinvariant. Let A ∈ B((0, 1]) and consider the Fourier series development of 1A , an e2iπnω , almost everywhere, 1A (ω) = where an =
0 A
n
e−2iπnω dω.
With c := e2iπd , we have
an =
e−2iπnθ(ω)dω = c−n

θ −1 (A)
e−2iπnω dω .
θ −1 (A)
Therefore the nth Fourier coeﬃcient of 1θ−1 (A) is cn an . In particular, 1θ−1 (A)(ω) =
cn an e2iπnω ,
almost everywhere.
n
Therefore, A is strictly θinvariant if and only if cn an = an for all n. Now, when d is irrational, c is not a root of unity and then necessarily an = 0 for all n = 0, that is, 1A is a.s. a constant (= a0 ), necessarily 0 or 1: A is a trivial set. The fact that (P, θ) is not ergodic if d is rational is left as an exercise (Exercise 16.4.1).
Theorem 16.1.9 If (P, θ) is ergodic and if A ∈ F satisﬁes either A ⊆ θ−1 A or θ−1 A ⊆ A, then A is trivial. In other words, if (P, θ) is ergodic, an event that is either contracted or expanded by θ is necessarily trivial. Proof. Since P is θinvariant, for all A ∈ F, P (A − A ∩ θ−1 A) = P (A) − P (A ∩ θ−1 A) = P (θ−1 A) − P (A ∩ θ−1 A) = P (θ−1 A − A ∩ θ−1 A) and therefore P (A θ−1 A) = 2P (A ∩ θ−1 A) = 2P (θ−1 A ∩ A). Therefore A is θinvariant if and only if at least one (and then both) of P (A ∩ θ−1 A) and P (θ−1 A ∩ A) is null. In particular, if A ⊆ θ−1 A, then P (A ∩ θ−1 A) = 0 and therefore A is θinvariant, and therefore trivial.
CHAPTER 16. ERGODIC PROCESSES
624
The main result of the current chapter is Birkhoﬀ ’s pointwise ergodic theorem: Theorem 16.1.10 If (P, θ) is ergodic, then for all f ∈ L1R (P ), N 1 (f ◦ θn ) = E0 [f ], N ↑∞ N
lim
P a.s.
(16.1)
1
The proof will be given in Section 16.2. Remark 16.1.11 Theorem 16.1.10 entails a considerable improvement on the ergodic theorem for irreducible positive recurrent aperiodic hmcs which are indeed ergodic (actually, mixing) in the sense given to this word in the present chapter.1 At the price of transferring this chain to the canonical space of random sequences equipped with the canonical shift we have that for any nonnegative measurable function g : E N → R,2 N 1 (g(Xk , Xk+1 , . . .)) = Eπ [g(X0 , X1 , . . .)] , N ↑∞ N
lim
Pμ a.s.
k=1
for any initial distribution of the chain, and where Eπ denotes expectation with respect to the initial distribution π, the stationary distribution. Note that there is a slight diﬀerence with the ergodic theorem in that the initial distribution may be diﬀerent from π. We may take this liberty because an irreducible positive recurrent aperiodic chain starting with an arbitrary distribution eventually couples with a stationary chain.
16.1.2
Mixing
We now introduce a particular form of ergodicity. Deﬁnition 16.1.12 The discrete ﬂow {θn }n∈Z is called P mixing if for all events A, B ∈ F, lim P (A ∩ θ−n B) = P (A)P (B). (16.2) n↑∞
We shall also say: (P, θ) is mixing.
Mixing is a property of “forgetfulness of the initial conditions” since condition (16.2) is equivalent to lim P (θ−n BA) = P (B). n
Theorem 16.1.13 If (16.2) holds for all A, B ∈ A, where A is an algebra generating F, then (P, θ) is mixing. 1 There is an unfortunate tradition that reserves the term “ergodic” for an hmc that is irreducible positive recurrent and aperiodic. Such a chain is in fact more than ergodic, since it is mixing. 2 For instance g(X0 , X1 , . . .) could be the number of consecutive visits of the chain to a given state i without visiting another given state j in between.
16.1. ERGODICITY AND MIXING
625
Proof. To any ﬁxed A, B ∈ F and any ε > 0, one can associate A , B ∈ A such that A A and B B have probabilities less that ε (Lemma 4.3.8). The same is true of (θ−n A) (θ−n A ) = θ−n (A A ) and of (θ−n B) (θ−n B ). In particular, (A ∩ θ−n B) (A ∩ θ−n B ) has probability less than 2ε, and P (A ∩ θ−n B ) − 2ε ≤ P (A ∩ θ−n B) ≤ P (A ∩ θ−n B ) + 2ε . Taking the lim sup and the lim inf, and then letting ε ↓ 0 yields the result.
Example 16.1.14: Mixing homogeneous Markov chains. This example continues Example 16.1.2, to which the reader is referred for the notation. The set E is now assumed countable. Let P be a transition matrix indexed by E that is irreducible positive recurrent, with (unique) stationary distribution π. Let P be the unique probability measure on (E Z , E ⊕Z ) that makes {Xn }n∈Z a stationary hmc with transition matrix P. Suppose moreover that P is aperiodic. Then (P, θ) is mixing. Indeed, it suﬃces, in view of Theorem 16.1.13, to verify (16.2) for sets A and B of the form A = {Xk1 = i1 , . . . , Xkp = ip },
B = {X1 = j1 , . . . , Xq = jq } ,
where k1 < · · · < kp and 1 < · · · < q . This is true since, for n > kp , P (A ∩ θ−n B) = π(i1 )pi1 ,i2 (k2 − k1 ) · · · pip−1 ,ip (kp − kp−1 ) × pip ,j1 (n + 1 − kp )pj1 ,j2 (2 − 1 ) · · · pjq−1 ,jq (q − q−1 ) , and limn↑∞ pip ,j1 (n + 1 − kp ) = π(j1 ), so that lim pip ,j1 (n + 1 − kp ) · · · pjq−1 ,jq (q − q−1 )
n↑∞
= π(j1 )pj1 ,j2 (2 − 1 ) · · · pjq−1 ,jq (q − q−1 ) = P (B).
If A is strictly invariant for the mixing ﬂow θ, then P (A) = P (A)2 , so that P (A) is 0 or 1. Therefore: Theorem 16.1.15 A mixing ﬂow is ergodic. However, an ergodic ﬂow is not necessarily mixing: Example 16.1.16: Irrational translations on the torus, take 2. The ﬂow of Example 16.1.8 is not mixing even if d is irrational. To see this, take A = B = (0, 12 ]. Since the set {nd mod 1 ; n ≥ 1} is dense in (0, 1] when d is irrational (this is the celebrated Weyl’s equidistribution theorem), θ−n A and B arbitrarily nearly coincide for an inﬁnite number of indices n. Therefore (16.2) cannot hold.
Theorem 16.1.17 For any algebra A generating F, (P, θ) is ergodic if and only if for all A, B ∈ A, n 1 P (A ∩ θ−k (B)) = P (A)P (B) . (16.3) lim n↑∞ n k=1
CHAPTER 16. ERGODIC PROCESSES
626
Proof. If (P, θ) is ergodic then, by the ergodic theorem, for all A, B ∈ F 1 1A (ω)1B (θk (ω)) = 1A (ω)P (B), n↑∞ n n
lim
k=1
and therefore, taking expectations, (16.3) follows. Conversely, if (16.3) is true for all A, B ∈ F, then taking an invariant set A = B, we obtain that P (A) = P (A)2 , and therefore A has probability 0 or 1. The fact that we can restrict A and B to be in A is proved in the same way as in Theorem 16.1.13. Example 16.1.18: Ergodic but not mixing hmc. The setting is as in Example 16.1.14, except that we do not assume aperiodicity. If the period is ≥ 2, the shift is not mixing any more. To see this let C0 and C1 be two consecutive cyclic classes of the chain. Then it is not true that limn↑∞ P (X0 = i, Xn = j) = π(i)π(j) when i ∈ C0 and j ∈ C1 . However, with a proof similar to that of Example 16.1.14, one can prove ergodicity using Theorem 16.1.17 and the fact that for a positive recurrent hmc 1 pij (k) = π(j). n↑∞ n n
lim
k=1
(Exercise 16.4.3.)
The Stochastic Process Point of View In applications, one speaks in terms of a stochastic process rather than ﬂows. The connection between the two points of view is made via canonical spaces, as follows. To any stochastic process {Xn }n∈Z taking values in the measurable space (E, E) and deﬁned on the probability space (Ω, F, P ), one can associate a canonical version, by transporting the process on the canonical measurable space (E Z , E ⊗Z ) as explained after the statement of Theorem 5.1.7. Let PX denote the probability distribution of {Xn }n∈Z (therefore a probability on the canonical space), and let S denote the shift on the canonical space: S : (xn , n ∈ Z) → (yn , n ∈ Z) where yn := xn+1 . Deﬁnition 16.1.19 The stochastic process {Xn }n∈Z taking values in the measurable space (E, E) is said to be ergodic (resp. mixing) iﬀ (PX , S) is ergodic (resp. mixing).
Therefore, in discussing ergodicity of a stochastic process, it is best to assume that it is the coordinate process deﬁned on the corresponding canonical space. This is the convention adopted in the sequel. In other words, when speaking of an ergodic stochastic process {Xn }n∈Z taking values in the measurable space (E, E), we implicitly assume that (Ω, F) = (E Z , E ⊗Z ) and that (P, θ) = (PX , S).
16.1. ERGODICITY AND MIXING
627
Remark 16.1.20 From the above discussion, we see that a way to decide if a given process is ergodic is to see if it can be obtained in the form {f (. . . , xn−1, xn , xn+1, . . .)}n∈Z . The formalization of this is left for the reader. Let {Xn }n∈Z be an ergodic3 hmc {Xn }n∈Z . Then, the process {Yn }n∈Z , where Yn is the number of times k (τn ≤ k ≤ n) for which Xk = 0 and where τn is the last time before n where the hmc took the value 1, is ergodic.
16.1.3
The Convex Set of Ergodic Probabilities
The set of ergodic probabilities coincide with the extremal points of the convex set of stationary probabilities. More precisely: Theorem 16.1.21 (P, θ) is ergodic if and only if there exists no decomposition P = α1 P1 + α2 P2 with α1 + α2 = 1 and α1 > 0, α2 > 0,
(16.4)
where P1 and P2 are distinct θinvariant probabilities.
Proof. We need two lemmas. Lemma 16.1.22 If (P1 , θ) and (P2 , θ) are both ergodic, then either P1 = P2 or P1 ⊥ P2 . Proof. If P1 and P2 do not coincide, there exists an A ∈ F such that P1 (A) = P2 (A). In particular, B1 ∩ B2 = ∅, where for i = 1, 2, / Bi =
4 n 1 k ω; lim 1A ◦ θ = Pi (A) . n↑∞ n k=1
Also, by ergodicity, P1 (B1 ) = 1 and P2 (B2 ) = 1. Therefore P1 ⊥ P2 .
Lemma 16.1.23 (P, θ) is ergodic if and only if there exists no θinvariant probability P. P1 distinct from P such that P1
P , then (P1 , θ) is ergodic. (In fact, if A is θinvariant, Proof. If (P, θ) is ergodic and P1 then, either P (A) = 0 or P (A) = 0, and therefore, by the absolute continuity hypothesis, P1 (A) = 0 or P1 (A) = 0.) By Lemma 16.1.22, only the possibility P1 = P remains. Suppose now (P, θ) not ergodic. This means there exists a nontrivial θinvariant set P . Also P1 is A ∈ F: 0 < P (A) < 1. Deﬁne P1 (B) = P (B  A). In particular, P1 θinvariant. 3
In the sense of Markov chain theory, that is, irreducible, periodic and positive recurrent.
CHAPTER 16. ERGODIC PROCESSES
628 Indeed:
P (θ−1 (B) ∩ A) P (θ−1 (B) ∩ θ−1 (A)) = P (A) P (A) P (B ∩ A) = P (B  A) = P1 (B). = P (A)
P1 (θ−1 (B)) =
We are now ready to prove Theorem 16.1.21. Suppose (P, θ) is ergodic and that P = α1 P1 + α2 P2 , where α1 + α2 = 1, α1 > 0, α2 > 0, and where P1 and P2 are distinct θinvariant P . Since P1 and P are distinct, P1 cannot be ergodic probabilities. In particular, P1 (Lemma 16.1.23). Suppose (P, θ) is not ergodic. There exists an invariant set A ∈ F that is nontrivial: 0 < P (A) < 1. The decomposition P (B) = P (A)P (B  A) + P (A)P (B  A) = α1 P1 (B) + α2 P2 (B) is such that α1 + α2 = 1, α1 > 0, α2 > 0, and P1 and P2 are distinct and θinvariant.
16.2
A Detour into Queueing Theory
We will provide a proof of Theorem 16.1.10 after taking a detour into queueing theory.
16.2.1
Lindley’s Sequence
Let (Ω, F, P ) be a probability space and let θ : (Ω, F) → (Ω, F) be a bijective measurable map with measurable inverse. Suppose that (P, θ) is ergodic. Let σ and τ be integrable nonnegative random variables deﬁned on (Ω, F, P ). A Lindley process associated with these random variables is a stochastic process {Wn }n∈T, where T = N or Z, satisfying the recursion equation Wn+1 = (Wn + σn − τn )+
(n ∈ T) ,
(16.5)
where σn = σ ◦ θn ,
τn = τ ◦ θ n .
This equation will be interpreted in terms of queueing since this will greatly help our intuition in the forthcoming developments. Deﬁne the event times sequence {Tn }n∈Z , where T0 = 0 and for all n ∈ Z, Tn+1 − Tn = τn . We interpret Tn as the arrival time in a queueing system of customer n, and σn as the amount of service (in time units) required by this customer. Deﬁne ρ=
E[σ] . E[τ ]
16.2. A DETOUR INTO QUEUEING THEORY
629
If we interpret E[τ ]−1 as the rate of arrivals of customers, ρ is the traﬃc intensity, that is, the average amount of work brought into the system per unit of time. (However we shall not need this interpretation.) Service is provided at unit rate whenever there remains at least one customer. Otherwise there is no further prescription as to service discipline, priorities, and so on. If Wn is the total service remaining to be done just before customer n arrives (that is, at time Tn −), then, obviously, the Lindley recurrence (16.5) is satisﬁed. In this interpretation, the Lindley process is usually called the workload process.
16.2.2
Loynes’ Equation
When T = N, the Lindley process is recursively calculable from the initial workload W0 , but in the case T = Z, we have nowhere to start the recursion. This corresponds to the situation of a queueing system that has been operating from the inﬁnite past. We may expect that under certain circumstances (of course, a good guess is that ρ < 1 will do) the workload process has a stationary version. One is therefore led to pose the problem in the following terms: exhibit a ﬁnite nonnegative random variable {W (t)}t∈[0,1] such that the Lindley recursion (16.5) is satisﬁed for {Wn := W ◦ θ−n }n∈Z . Equivalently: we try to ﬁnd a ﬁnite nonnegative random variable {W (t)}t∈[0,1] such that W ◦ θ = (W + σ − τ )+ .
(16.6)
The above equation is called Loynes’ equation. Theorem 16.2.1 (4 ) If ρ < 1, there exists a unique ﬁnite nonnegative solution {W (t)}t∈[0,1] of Loynes’ equation (16.6). Proof. For n ≥ 0, deﬁne Mn to be the workload found by customer 0 assuming that customer −n found an empty queue upon arrival. In particular, M0 = 0. One checks by induction that + m (σ−i − τ−i ) max . (16.7) Mn = 1≤m≤n
i=1
In particular, Mn is integrable for all n ∈ N, being smaller than the integrable random variable ni=1 σ−i − τ−i . Furthermore, the sequence {Mn }n≥0 satisﬁes the recurrence relation Mn+1 ◦ θ = (Mn + σ − τ )+ (16.8) and (16.7) shows that it is nondecreasing. Denoting by M∞ the limit + n (σ−i − τ−i ) M∞ = lim ↑ Mn = sup n→∞
(16.9)
n≥1 i=1
and letting n go to ∞ in (16.8), we see that M∞ is a nonnegative random variable satisfying M∞ ◦ θ = (M∞ + σ − τ )+ . (16.10) The random variable M∞ is often referred to as Loynes’ variable, while the sequence {Mn }n≥0 is called Loynes’ sequence. The random variable M∞ can take inﬁnite values. When using the identity (a − b)+ = a − a ∧ b, Equality (16.8) becomes 4
[Loynes, 1962].
CHAPTER 16. ERGODIC PROCESSES
630
Mn+1 ◦ θ = Mn − Mn ∧ (τ − σ)
(16.11)
and therefore, since P is θinvariant, and {Mn }n≥1 is increasing and integrable, E[Mn ∧ (τ − σ)] = E[Mn − Mn+1 ◦ θ] = E[Mn − Mn+1 ] ≤ 0. It follows by monotone convergence that E[M∞ ∧ (τ − σ)] ≤ 0.
(16.12)
Equality (16.10) shows that the event {M∞ = ∞} is θinvariant (recall that σ and τ are ﬁnite). Therefore, by ergodicity, P (M∞ = ∞) is either 0 or 1. In view of (16.12), P (M∞ = ∞) = 1 implies E[τ − σ] ≤ 0. Therefore, the condition E[σ] < E[τ ] implies that M∞ < ∞, P a.s. The solution of Loynes’ equation that we just gave (that is, M∞ ) is the minimal nonnegative solution. In order to prove this, it suﬃces to show that W ≤ Mn for all n ≥ 0 (where {W (t)}t∈[0,1] is a nonnegative solution of Loynes’ equation) and then let n ↑ ∞ to obtain W ≥ M∞ . This is proved by induction. The ﬁrst term of the induction is satisﬁed since W ≥ 0 = M0 . Now W ≥ Mn implies W ≥ Mn+1 (because Mn+1 ◦ θ = (Mn + σ − τ )+ ≤ (W + σ − τ )+ = W ◦ θ). It remains to prove uniqueness of a ﬁnite solution of (16.6) if ρ < 1. Let {W (t)}t∈[0,1] be a ﬁnite solution, perhaps diﬀerent from M∞ . We have σ − τ ≤ W ◦ θ − W ≤ σ, and in particular W ◦ θ − W E0 [W ◦ θ − W ] = 0.
is integrable. Therefore, by Lemma 16.1.5,
Since M∞ is the minimal solution, for any nonnegative solution {W (t)}t∈[0,1] , {W = 0} ⊆ {W = M∞ }. The latter event is θcontracting since both {W (t)}t∈[0,1] and M∞ satisfy (16.6). Since (P, θ) is ergodic, we must then have P (W = M∞ ) = 0 or 1. It is therefore enough to show that P (W = 0) > 0 (which implies P (W = M∞ ) = 1, that is, uniqueness). The proof of this follows from the next lemma. Lemma 16.2.2 If P (W = 0) = 0, for some ﬁnite solution {W (t)}t∈[0,1] of (16.6), then ρ = 1. Proof. Indeed, if W > 0 P a.s. or (equivalently) W ◦θ > 0 P a.s., then W ◦θ = W +σ−τ P a.s., and E[W ◦ θ − W ] = 0 in view of Lemma 16.1.5, and this implies E[σ] = E[τ ]. This completes the proof of Theorem 16.2.1.
A partial converse of Theorem 16.2.1 is the following: Theorem 16.2.3 If E[σ] > E[τ ], (16.6) admits no ﬁnite solution. Proof. To prove this, it is enough to show that M∞ = ∞, P a.s., since M∞ is the minimal nonnegative solution of (16.6). This follows from 1 (σ−i − τ−i ) = E[σ − τ ] > 0, n→∞ n n
lim
i=1
16.3. BIRKHOFF’S THEOREM
631
since this in turn implies M∞ =
sup
n
n
+ (σ−i − τ−i )
= ∞.
i=1
At this stage, we have proved the following: for ρ > 1 there is no ﬁnite nonnegative solution of (16.6), and for ρ < 1, M∞ is the unique nonnegative ﬁnite solution of (16.6). In the critical case (ρ = 1) the existence of a ﬁnite nonnegative solution of (16.6) depends on the distribution of the service and interarrival sequences. See Exercises 16.4.10 and 16.4.11.
16.3
Birkhoﬀ’s Theorem
16.3.1
The Ergodic Case
We can now proceed to the proof of Theorem 16.1.10. It will be given in the equivalent form: Theorem 16.3.1 Whenever (P, θ) is ergodic and both σ and τ are nonnegative, not identically null, and integrable n σ ◦ θ−i E[σ] lim i=0 , P a.s. = n −i n→∞ τ ◦ θ E[τ ] i=0 Proof. According to (16.7), n
σ ◦ θ−i ≤
i=1
n
τ ◦ θ−i + Mn .
i=1
We know that if E[σ] < E[τ ], Mn ↑ M∞ < ∞ P a.s. Taking σ = 12 E[τ ] > 0, it follows that if E[τ ] > 0, n lim τ ◦ θ−i = ∞, P a.s. n→∞
i=1
Therefore, whenever E[τ ] > 0 and E[σ] < E0 [τ ], n σ ◦ θ−i Mn ≤ lim n + 1 = 1, lim sup i=0 n −i −i n→∞ τ ◦ θ n→∞ i=0 i=0 τ ◦ θ
P a.s.
If E[τ ] > 0, for some integrable σ, take any a such that aE[σ] < E[τ ] to obtain from the previous inequality n σ ◦ θ−i 1 lim sup i=0 ≤ n −i τ ◦ θ a n→∞ i=0 and therefore n 1 σ ◦ θ−i E[σ] 1 ; aE[σ] < E[τ ] = . ≤ inf lim sup i=0 n −i τ ◦ θ a E[τ ] n→∞ i=0
CHAPTER 16. ERGODIC PROCESSES
632
Interchanging the roles of σ and τ , we obtain similarly n σ ◦ θ−i E[σ] lim sup i=0 . ≥ n −i τ ◦ θ E[τ ] n→∞ i=0
Hence the result.
Remark 16.3.2 If a stochastic process is not stationary, the ergodic theorem does not apply directly to obtain almost sure convergence of the empirical means. However if there is convergence of some sort of such a process to stationarity, the convergence of the empirical mean is possible, for instance in the case of an irreducible positive recurrent aperiodic hmc, where convergence in variation is obtained via coupling. See Theorem 6.4.2.
16.3.2
The Nonergodic Case
“Nonergodic” refers to the situation where there are nontrivial invariant sets.5 Deﬁne I = {A; θ−1 (A) = A}. I is a σﬁeld called the invariant σﬁeld. Deﬁnition 16.3.3 A random variable X is called invariant if X = X ◦ θ. Theorem 16.3.4 X is invariant if and only if it is Imeasurable. Proof. Suppose X invariant. Then, for all a ∈ R, {X ≤ a} = {X ◦θ ≤ a} = θ−1 {X ≤ a}, and therefore {X ≤ a} ∈ I. Conversely, suppose X = 1A , where A ∈ I. Then X ◦ θ = 1A ◦ θ = 1θ−1 (A) = 1A = X, and therefore indicators of sets in I are invariant, and so are the weighted sums of such indicator functions, as well as limits of the latter. Since a nonnegative Imeasurable random variable is a limit of a sequence of weighted sums of indicators of sets in I, it is invariant. For an arbitrary Imeasurable random variable, the proof is completed by considering its positive and negative parts as usual. Theorem 16.3.5 Let (P, θ) be a stationary framework. It is ergodic if and only if every invariant realvalued random variable is almost surely a constant. Proof. Suﬃciency: Let A be invariant. Then X = 1A is invariant and therefore almost surely a constant, which implies P (A) = 0 or 1. Necessity: Suppose ergodicity and let X be invariant. Then for all a ∈ R, P (X ≤ a) = 0 or 1. When a is suﬃciently large, this must be 1 because lima↑∞ P (X ≤ a) = 1. Let a0 = inf{a ∈ R; P (X ≤ a) = 1}. Therefore, for all ε > 0, P (a0 −ε < X < a0 +ε) = 1. Let ε ↓ 0 to obtain P (X = a0 ) = 1. 5 Strictly speaking, the results in the ergodic case follow from those in the current subsection. The choice made in the order of treatment is motivated by the facts that “in practice” the ergodic case is the most interesting one for applications and that the proof of the ergodic theorem seized the opportunity of introducing the G/G/1:∞ queue.
16.3. BIRKHOFF’S THEOREM
633
Theorem 16.3.6 (6 ) Let (P, θ) be a stationary framework and let X be an integrable random variable. Then 1 X ◦ θk = E[X  I] n↑∞ n n−1
lim
P a.s.
k=0
The proof rests on Hopf ’s lemma: Theorem 16.3.7 Let (P, θ) be ergodic and let X be an integrable random variable. Deﬁne Sk = X + X ◦ θ + · · · X ◦ θk−1 and Mn = max(0, S1 , . . . , Sn ). Then X dP ≥ 0. {Mn >0}
Proof. For n ≥ k, Mn ◦ θ ≥ Sk ◦ θ, and therefore, for k > 1, X + Mn ◦ θ ≥ X + Sk ◦ θ = Sk+1 . This is also true for k = 1 (X ≥ S1 − Mn ◦ θ because S1 = X and Mn ◦ θ ≥ 0). Therefore X ≥ max(S1 , . . . , Sn ) − Mn ◦ θ . In particular, E X 1{Mn >0} ≥ E (max(S1 , . . . , Sn ) − Mn ◦ θ) 1{Mn >0} . But max(S1 , . . . , Sn ) = Mn on {Mn > 0}, and therefore E X 1{Mn >0} ≥ E (Mn − Mn ◦ θ) 1{Mn >0} ≥ E [Mn − Mn ◦ θ] = 0. Proof. We can now give the proof of Theorem 16.3.6. We may suppose that E[X  I] = 0, otherwise replace X by X − E[X  I]. Deﬁne X = lim supn↑∞ Snn . This is an invariant random variable, and therefore the set C := {X > ε} is an invariant set for any ﬁxed ε > 0. We show that P (C) = 0. Deﬁne X ∗ = (X − ε)1C and let Sk∗ = X ∗ + · · · X ∗ ◦ θk−1 and Mn∗ = max(0, S1∗ , . . . , Sn∗ ). Then (Hopf’s lemma) X ∗ dP ≥ 0 . {Mn∗ >0}
The sets Hn = {Mn∗ > 0} = {max(S1∗ , . . . , Sn∗ ) > 0}, n ≥ 1, form a nondecreasing sequence whose sequential limit is 4 / 4 4 / / S∗ Sk >ε ∩C. H := sup Sk∗ > 0 = sup k > 0 = sup k≥1 k≥1 k k≥1 k Since supk≥1 6
Sk k
≥ X and X > ε, we have that H = C. Therefore
[Birkhoﬀ, 1931].
CHAPTER 16. ERGODIC PROCESSES
634 
X ∗ dP =
lim
n↑∞ Hn

X ∗ dP =

X ∗ dP C
H
by dominated convergence since X ∗ is integrable (E[X ∗ ] ≤ E[X] + ε). Using the fact that C is an invariant event, X dP − εP (C) X ∗ dP = (X − ε)1C ) dP = 0≤ C C C E[X  I] dP − εP (C) = E[X] − εP (C) = −εP (C). = C
This implies P (C) = 0, that is, P (X ≤ ε) = 1. Since ε > 0 is arbitrary, P (X ≤ 0) = 1, that is, almost surely lim sup n↑∞
Sn ≤ 0. n
The same arguments applied with −X instead of X give − lim sup n↑∞
Therefore lim supn↑∞
Sn n
−Sn Sn = lim inf ≥ 0. n↑∞ n n
= lim inf n↑∞
Sn n
= 0.
Corollary 16.3.8 Let (P, θ) be a stationary framework, and let X be an integrable random variable. Then 1 X ◦ θk = E[X  I] n↑∞ n n−1
lim
in L1C (P ) .
k=0
Proof. Let for any K > 0, := X1{X≤K} , XK
XK := X1{X>K} .
By the pointwise ergodic theorem, 1 XK ◦ θk = E[XK  I], n↑∞ n n−1
lim
k=0
is bounded) that from which it follows by dominated convergence (XM '# "' n−1 '1 ' ' ' k lim E ' XK ◦ θ − E[XK  I]' = 0 . 'n ' n↑∞ k=0
Observing that '# "' n−1 n−1 '1 ' ' 1 '' ' ' k' E ' XK ◦ θ ' ≤ E 'XK ◦ θk ' = E XK  'n ' n k=0
and we have that
k=0
' ' E ' E XK  I ' ≤ E E XK   I = E XK  ,
16.3. BIRKHOFF’S THEOREM
635
'# "' n−1 ' ' ' 1 ' k E ' XK ◦ θ − E[XK  I]' ≤ 2E XK  . 'n ' k=0
Therefore
'# "' n−1 '1 ' ' ' k lim sup E ' X ◦ θ − E[X  I]' ≤ 2E XK  . 'n ' n↑∞ k=0
It then suﬃces to let K tend to ∞.
16.3.3
The Continuoustime Ergodic Theorem
The extension of the discretetime result begins with the introduction of the notion of measurable ﬂow in continuous time. Example 16.3.9: Shifts acting on functions, take 1. Let Ω be the space of continuous functions ω : R → R, and let F be the σﬁeld on Ω generated by the coordinate functions X(s) : Ω → R, s ∈ R, where X(s, ω) = ω(s). Deﬁne for each t ∈ R the mapping θt : Ω → Ω by θt (ω)(s) = ω(s + t) (θt translates a function ω ∈ Ω by −t.) For ﬁxed t ∈ R, the mapping θt of the above example is measurable. However we shall need more measurability. Deﬁnition 16.3.10 The family {θt }t∈R of measurable maps from the measurable space (Ω, F) into itself is called a shift on (Ω, F) if: (a) θt is bijective for all t ∈ R, and (b) θt ◦ θs = θt+s for all t, s ∈ R. This shift is called a (measurable) ﬂow if, in addition, (c) (t, ω) → θt (ω) is measurable from (R × Ω, B ⊗ F) to (Ω, F), In particular, θ0 is the identity and θt −1 = θ−t . To simplify the notation, we shall write θt ω instead of θt (ω). Deﬁnition 16.3.11 Given a shift as in Deﬁnition 16.3.10, a stochastic process {Z(t)}t∈R is called compatible with the shift (for short: θt compatible) if for all t ∈ R, Z(t) = Z(0) ◦ θt ,
(16.13)
that is, for all ω ∈ Ω, Z(t, ω) = Z(0, θt ω). Example 16.3.12: Shifts acting on functions, take 2. In Example 16.3.9, the coordinate process is θt compatible. The stationarity of a compatible process is embodied in the invariance of the underlying probability with respect to the shifts. More precisely:
CHAPTER 16. ERGODIC PROCESSES
636
Deﬁnition 16.3.13 Let (Ω, F, P ) be a probability space and let {θt }t∈R be a shift on (Ω, F). The probability P is called invariant with respect to this shift if P ◦ θt−1 = P
(t ∈ R) .
(16.14)
We then say: (P, θt ) is a stationary framework (on R). The continuous parameter set is now R. We repeat in this setting the deﬁnitions given for discretetime ﬂows. Let (θt , P ) be a stationary framework on (Ω, F). Deﬁnition 16.3.14 An event A ∈ F is called strictly θt invariant if A = θt−1 A for all t ∈ R. It is called θt invariant if P (A θt−1 A) = 0 for all t ∈ R. By an easy adaptation of the remark following Deﬁnition 16.1.7 to continuoustime ﬂows, we see that in the following deﬁnition, “θt invariant” can be replaced by “strictly θt invariant”. Deﬁnition 16.3.15 The ﬂow {θt }t∈R is called P ergodic if all θt invariant events are trivial. One then says: (P, θt ) is ergodic. Theorem 16.3.16 If (P, θt ) is ergodic, then for all f ∈ L1 (P ), 1 T lim (f ◦ θt ) dt = E[f ], P a.s. T ↑∞ T 0
(16.15)
Proof. Deﬁning θ := θ1 , the pair (P, θ) is ergodic. It is enough to prove the theorem for nonnegative f ∈ L1 (P ). In this case, deﬁning n(T ) by n(T ) ≤ T < n(T ) + 1 , we have the bounds  n(T )  n(T )+1 n(T ) 1 T n(T ) + 1 1 1 f ◦ θt dt ≤ f ◦ θt dt ≤ f ◦ θt dt . n(T ) + 1 n(T ) 0 T 0 n(T ) n(T ) + 1 0 () 01 Deﬁning g := 0 f ◦ θt dt, we have that 
n
f ◦ θt dt =
0
and therefore 1 n↑∞ n

n
lim
n
g ◦ θk
k=1
2
1
f ◦ θt dt = E[g] = E
0
. f ◦ θt dt = E[f ] .
0
The conclusion then follows from (). Theorem 16.3.17 (P, θt ) is ergodic if and only if there exists no decomposition P = β1 P1 + β2 P2 ,
β1 + β2 = 1, β1 > 0,
β2 > 0,
(16.16)
where P1 and P2 are distinct θt invariant probabilities for all t. Proof. The proof is analogous to that of Theorem 16.1.21.
16.4. EXERCISES
637
Complementary reading [Billingsley, 1965] is the classic introduction to ergodic theory and to the theoretical aspects of information theory.
16.4
Exercises
Exercise 16.4.1. ω + d mod 1 Let Ω = (0, 1], and let P be the Lebesgue measure on Ω. Let d be some real number. Deﬁne θ : (Ω, F) → (Ω, F) by θ(ω) = ω + d mod 1. Show that (P, θ) is not ergodic if d is rational. Exercise 16.4.2. θ ergodic, θ2 not ergodic Give an example where (P, θ) is ergodic and (P, θ2 ) is not ergodic. Exercise 16.4.3. Ergodic yet not mixing hmc Give the details in the proof of Example 16.1.18. Exercise 16.4.4. 2ω mod 1 Let (Ω, F) := ([0, 1), B([0, 1))). Let P be the Lebesgue measure on [0, 1). Consider the transformation / 2ω if ω ∈ [0, 21 ), θ(ω) := 2ω − 1 if ω ∈ [ 12 , 1). Show that P is θinvariant and that (P, θ) is mixing. Hint: The intervals of the form [ 2kn , k+1 2n ) generate B([0, 1). Exercise 16.4.5. Periodic hmc Show that an irreducible positive recurrent discretetime hmc with period ≥ 2 cannot be mixing. Exercise 16.4.6. Product of mixing shifts For i = 1, 2, let (Ωi , Fi , Pi ) be a probability space endowed with the measurable shift θi such that (Pi , θi ) is mixing. Deﬁne (Ω, F, P ) to be the product of the above probability spaces. The product shift θ := θ1 ⊕ θ2 is deﬁned in the obvious manner: θ((ω1 , ω2 )) := (θ1 (ω1 ), θ2 (ω2 )). (1) Show that on the product of two probability spaces, each endowed with a mixing shift, the product shift is mixing. (2) Give a counterexample when “mixing” is replaced by “ergodic” in the previous question. Exercise 16.4.7. Irrational translations on the torus Prove that the ﬂow of Example 16.1.8 is not mixing (d rational or irrational). Exercise 16.4.8. Ergodicity of Gaussian processes Show that a stationary centered Gaussian sequence limn↑∞ E[X0 Xn ] = 0 is ergodic.
{Xn }n≥0
such
that
638
CHAPTER 16. ERGODIC PROCESSES
Exercise 16.4.9. Invariant events of an hmc In Example 16.1.18, identify the invariant events. Exercise 16.4.10. Loynes: the critical case, I In Loynes’ equation, assume that the random variables σn − τn are centered, iid, and with a positive ﬁnite variance. Prove that in this case there exists no ﬁnite solution Z of Loynes’ equation. Hint: apply the central limit theorem to {σ−n − τ−n }n≥1 . Exercise 16.4.11. Loynes: the critical case, II Show that if ρ = 1, and if there is a ﬁnite solution, then for any c ≥ 0, W = M∞ + c is also a ﬁnite solution of (16.6). Exercise 16.4.12. Lindley: recurrence to zero in the stable case The stability condition ρ < 1 is assumed to hold. Let W = M∞ be the unique nonnegative solution of (16.6). Show that there exists an inﬁnity of negative (resp. positive) indices n such that W ◦ θn = 0.
Chapter 17 Palm Probability Palm theory (in this chapter: on the line) links two types of stationarity for marked point processes: timestationarity and eventstationarity. Two examples will illustrate this. The ﬁrst example is the renewal process, for which one distinguishes the timestationary (necessarily delayed) version from the undelayed version whose distribution is invariant with respect to the shift that translates the ﬁrst event time to the origin. It was shown in Chapter 10 that there exists a simple relation between the two versions, which are identical except for the distribution of the ﬁrst event time. In the terminology of Palm theory, the undelayed version is the Palm version of the timestationary version. The second example is that of an irreducible positive recurrent continuoustime hmc whose imbedded chain (the chain observed at the transition times) is also positive recurrent. When such a chain is (time)stationary, it is not true in general that the embedded discretetime Markov chain is stationary, even when the latter is assumed positive recurrent. However, there is a simple relation between the stationary distribution of the continuoustime chain and the stationary distribution of the imbedded chain. The continuoustime chain starting with the stationary distribution of the imbedded chain is the Palm version of the stationary continuoustime chain. We observe once more that the distribution of the Palm version is invariant with respect to the shift that translates the ﬁrst event time (transition time) to the origin. In general, Palm theory on the line is concerned with jointly stationary stochastic processes and point processes, and with the probabilistic situation at event times. It is especially relevant in queueing theory applied to service systems, where there are two distinct points of view, that of the “operator”, who is interested in the behavior of a queue at arbitrary times, and that of the “customer”, who is generally interested in the situation found upon arrival. The corresponding issues will be treated in Chapter 9.
17.1
Palm Distribution and Palm Probability
17.1.1
Palm Distribution
The story begins with a new look at Campbell’s formula for stationary marked point processes. Let N be a simple point process on Rm with point sequence {Xn }n∈N . Let {Zn }n∈N be a sequence of random variables with values in the measurable space (K, K). Each Zn is considered as a mark of the corresponding point Zn . Recall that the point process and its sequence of marks are referred to as “the marked point process (N, Z)”, © Springer Nature Switzerland AG 2020 P. Brémaud, Probability Theory and Stochastic Processes, Universitext, https://doi.org/10.1007/9783030401832_17
639
CHAPTER 17. PALM PROBABILITY
640
which can also be represented as a point process NZ on Rm × K: NZ (D) := 1D (Xn , Zn ) (D ∈ B(Rm ) ⊗ K) . n∈N
The marked point process (N, Z) is called stationary if for all x ∈ Rm , the random measure Sx (NZ ) deﬁned by 1D (Xn + x, Zn ) (D ∈ B(Rm ) ⊗ K) Sx (NZ )(D) := n∈N
has the same distribution as NZ . The intensity of the (stationary) point process N will henceforth be assumed positive and ﬁnite: 0 < λ := E[N ((0, 1]m )] < ∞ .
(17.1)
Deﬁne the σﬁnite measure νZ on (Rm × K, B(Rm ) ⊗ K) by (D ∈ B(Rm ) ⊗ K) .
νZ (D) := E [NZ (D)] Recall the notation C + x := {y + x ; y ∈ C} all C ∈ B(Rm ) and all L ∈ K,
(C ⊆ Rm , x ∈ Rm ). By stationarity, for
νZ ((C + x) × L) = νZ (C × L)
(x ∈ Rm ) ,
that is, the measure C → νZ (C × L) is for ﬁxed L translationinvariant, and therefore (Theorem 2.1.45) a multiple of the Lebesgue measure m on (Rm , B(Rm )): νZ (C × L) = γ(L)m (C) , for some γ(L). The mapping L → γ(L) is a measure on (K, K) that is ﬁnite since γ(K) = λ. In particular, Q0N := λ−1 γ is a probability measure on (K, K) and 0 νZ (C × L) = λ QN (L)m (C) .
Therefore Q0N (L)
=
E
n∈N 1C (Xn )1L (Zn ) λm (C)
(17.2)
is a probability on (K, K). It is called the Palm distribution of the marks. Theorem 17.1.1 Let f : Rm × K → R be a nonnegative measurable function. Then 2. 1  E f (x, z)NZ (dx × dz) = λ f (x, z)Q0N (dz) dx . (17.3) Rm ×K
Rm
K
Proof. Formula (17.3) is true for f (x, z) := 1C (x) 1L (z)
(C ∈ B(Rm ), L ∈ K)
since it then reduces to (17.2). The general case again follows by the usual monotone class argument based on Dynkin’s Theorem 2.1.27. Formula (17.3) is the Palm–Campbell formula for stationary marked point processes.
17.1. PALM DISTRIBUTION AND PALM PROBABILITY
17.1.2
641
Stationary Frameworks
The passage from the Palm distribution of marks to Palm probability will be done in terms of measurable ﬂows on abstract probability spaces.
Measurable Flows For convenience, we repeat Deﬁnition 16.3.10 of Chapter 16. Deﬁnition 17.1.2 A family {θx }x∈Rm of measurable functions from the measurable space (Ω, F) into itself is called a shift on (Ω, F) if: (a)
θx is bijective for all x ∈ Rm , and
(b)
θx ◦ θy = θx+y for all x, y ∈ Rm .
The shift {θx }x∈Rm on (Ω, F) is called a measurable ﬂow if in addition (c)
(x, ω) → θx (ω) is measurable from B(Rm ) ⊗ F to F.
Example 17.1.3: Measurable flow on a space of measures. The shift {Sx }x∈Rm acting on M (Rm ), the canonical space of locally ﬁnite measures on Rm , and deﬁned by Sx (μ)(C) := μ(C − x)
(x ∈ Rm , C ∈ B(Rm ),
is a measurable ﬂow. To prove this, it suﬃces to show that the mapping (x, μ) → (Sx μ)(f ) := f (y − x)μ(dx) Rm
is measurable whenever f : Rm → R is a nonnegative continuous function with compact support. This is the case since (x, μ) → g(x, μ) := (Sx μ)(f ) is continuous in the ﬁrst argument and measurable in the second. (Indeed, g is the limit as n ↑ ∞ of the measur able functions k∈N 1Ck,n (x)g(xk,n , μ), where {xk,n }k∈N is an enumeration of the grid 1 1 m n−1 Zm and Ck,n = xk,n + (− 2n , 2n ]) .) Example 17.1.4: The shift on marked point processes. Let (H, H) be some measurable space. One may take (Ω, F) = (M (Rm × H), M(Rm × H)) with θx = Sx , where Sx μ(C × L) := μ((C + x) × L) (C ∈ B(Rm ), L ∈ H) .
Compatibility A central notion is that of compatibility with a ﬂow. Deﬁnition 17.1.5 Let {θx }x∈Rm be a measurable ﬂow on (Ω, F). A stochastic process {Z(x)}x∈Rm deﬁned on (Ω, F) with values in the measurable space (K, K) is called compatible with the ﬂow {θx }x∈Rm (for short: θx compatible) if
CHAPTER 17. PALM PROBABILITY
642
Z(x, ω) = Z(0, θx (ω))
(ω ∈ Ω, x ∈ Rm ) ,
that is, in shorter notation, Z(x) = Z(0) ◦ θx . A random measure N on Rm is called compatible with the ﬂow {θx }x∈Rm (for short: θx compatible) if N (θx (ω), C) = N (ω, C + x)
(ω ∈ Ω, C ∈ B(Rm ), x ∈ Rm ) ,
that is, in shorter notation, N ◦ θx = Sx N , where Sx is the translation operator acting on measures (Example 16.3.9, (ii)). Note we have three notations for the same object N ◦ θx , Sx (N ) , N − x . The latter is not to be confused with N − εx , which represents N \{x} if x ∈ N , and N if x ∈ / N.
Stationary Frameworks Let (Ω, F, P ) be a probability space and let {θx }x∈Rm be a measurable ﬂow on (Ω, F). Deﬁnition 17.1.6 The probability P is called invariant with respect to the ﬂow {θx }x∈Rm (for short, θx invariant) if for all x ∈ Rm P ◦ θx−1 = P . (P, θx ) is then called a stationary framework on Rm . Example 17.1.7: Stationary point process. Let (P, θx ) be a stationary framework on Rm . If the point process N on Rm is compatible with the shift, it is stationary. Indeed, letting A = {ω; N (ω, C1 ) = k1 , . . . , N (ω, Cm) = km }, with C1 , . . . , Cm ∈ B(Rm ), k1 , . . . , km ∈ N, we have, by deﬁnition, θx−1 A = {ω; θx (ω) ∈ A} = {ω; N (θx (ω), C1 ) = k1 , . . . , N (θx (ω), Cm ) = km } = {ω; N (ω, C1 + x) = k1 , . . . , N (ω, Cm + x) = km }. Therefore, since P ◦ θx −1 = P , P (N (C1 ) = k1 , . . . , N (Cm ) = km ) = P (N (C1 + x) = k1 , . . . , N (Cm + x) = km ) .
In the situation of Example 17.1.7, one sometimes says for short: (N, θx , P ) is a stationary point process. Example 17.1.8: Stationary stochastic process. Let (P, θx ) be a stationary framework on Rm . By the same argument as in the example above, a stochastic process {Z(x)}x∈Rm with values in (K, K) that is θx compatible is strictly stationary. For short: (Z, θx , P ) is a stationary stochastic process. (Here Z stands for {Z(x)}x∈Rm .)
17.1. PALM DISTRIBUTION AND PALM PROBABILITY
17.1.3
643
Palm Probability and the Campbell–Mecke Formula
Let ((N, Z), θx , P ) be a stationary marked point process on Rm such that N is simple and with ﬁnite positive intensity λ. In fact, the work needed for the deﬁnition of Palm probability has already been done in Subsection 17.1.1. It suﬃces to choose for measurable mark space (K, K) the measurable space (Ω, F) itself. For each n ∈ Z, θXn is a random element taking its values in the measurable space (Ω, F). To see this, write θx (ω) as f (x, ω) and remember that the function (x, ω) → f (x, ω) is measurable, and therefore, since the function ω → Xn (ω) is measurable, so is the function ω → f (Xn (ω), ω). This deﬁnes θXn (ω) (ω) := f (Xn (ω), ω). The sequence {θXn }n∈N is the universal mark sequence. Remark 17.1.9 If (Ω, F) is the canonical space of point processes on Rm and the measurable ﬂow is just the shift on this space, the universal mark associated with the point Xn is N − Xn , the canonical process N shifted by Xn . In fact, this mark contains as much and no more information than the whole trajectory N ! Take in (17.2) (K, K) = (Ω, F)) and Zn = θXn (ω). Denote in this case Q0N by PN0 . In particular, PN0 is a probability on (Ω, F). Formula (17.2) then reads for all C ∈ B(Rm ) of positive Lebesgue measure PN0 (A) :=
E
n∈N 1C (Xn )1A λm (C)
◦ θXn
(A ∈ F) .
(17.4)
The probability PN0 deﬁned by (17.4) is called the Palm probability associated with P (or, more precisely, with (N, θx , P )). Remark 17.1.10 The deﬁnition (17.4) does not depend on the choice of C ∈ B(Rm ) of positive Lebesgue measure.
Theorem 17.1.11 (1 ) Let v : Rm × Ω → R be a nonnegative measurable function. Then 2. . 2(v(x) ◦ θx ) N (dx) = λ E0N v(x) dx . (17.5) E Rm
Rm
(The lefthand side of the above equality is just E
n∈N v(Xn , θXn )
.)
Proof. Formula (17.4) is therefore a special case of the announced equality for the choice v(x, ω) = 1C (x)1A (ω) , from which the general case follows by the usual monotone class argument based on Dynkin’s Theorem 2.1.27. 1
[Mecke, 1967].
CHAPTER 17. PALM PROBABILITY
644
Formula (17.5) is the Campbell–Mecke formula. It is, as we have seen, a sophisticated avatar of Campbell’s formula. It is sometimes used in the alternative equivalent form 22. . E (17.6) v(x) N (dx) = λE0N (v(x) ◦ θ−x ) dx . Rm
Rm
Example 17.1.12: An expression of the renewal function. Let (N, θt , P ) be a stationary simple locally ﬁnite point process on R with ﬁnite intensity λ and let PN0 be the associated Palm probability. We show that for all a ≥ 0,  a (2E0N [N ((−t, 0])] − 1)λ dt , E N ((0, a])2 = 0
and that, in the case of a renewal process, E N ((0, a])2 =

a
(2R(t) − 1)λ dt ,
0
where R is the renewal function. Proof. From the integration by parts formula (Theorem 2.3.12) (watch the parentheses), N ((0, a])2 = 2 N ((0, t]) N (dt) + 2 N ((0, t)) N (dt) (0,a] (0,a] =2 N ((0, t]) N (dt) − N ((0, a]) . (0,a]
Therefore
E N ((0, a])2 = 2E
2R+
. N ((0, t])1(0,a] (t) N (dt) + λa .
In view of (17.6) with v(t) := N ((0, t])1(0,a] (t) (and therefore v(t)◦θ−t = N ((−t, 0])1(0,a] (t)), 22. . E N ((0, t])1(0,a] N (dt) = E0N N ((−t, 0])1(0,a] (t)λ dt R+ R+ 0 EN [N ((−t, 0])] 1(0,a] (t)λ dt. = R+
For a renewal process, observe that
E0N
[N ((−t, 0])] = E0N [N ((0, t])].
Let h : Rm × M (Rm ) → R be a nonnegative measurable function. Taking v(x, ω) := h(x, N (ω)) in (17.5), we have # " . 2 h(Xn , N − Xn ) = λE0N h(x, N ) dx . E n∈N
Rm
Specializing this to h(x, N ) := g(x)1Γ (N ) (Γ ∈ M(Rm )) gives # " g(Xn )1Γ (N − Xn ) = λPN0 (N ∈ Γ) g(x) dx . E n∈N
Rm
With g(x) := 1C (x), where C ∈ B(Rm ) is of ﬁnite positive Lebesgue measure, " # E 1C (Xn )1Γ (N − Xn ) = λm (C)PN0 (N ∈ Γ) . n∈N
(17.7)
17.1. PALM DISTRIBUTION AND PALM PROBABILITY
645
Remark 17.1.13 A set Γ ∈ M(Rm ) represents a property that a measure μ ∈ M(Rm ) may or may not possess, and N − Xn ∈ Γ means that the point process seen by an observer placed at the point Xn (that is, precisely, N − Xn ) possesses this property. For instance, with Γ = {μ ; μ(B(0, a)\{0}) = 0}, where B(x, a) is the open ball of radius a ≥ 0 centered at x, {N − Xn∈ Γ} = {(N − Xn )(B(0, a)\{0}) = 0}, that is, {N (B(Xn , a)\{Xn }) = 0}. The sum n∈N 1C (Xn )1Γ (N − Xn ) counts the points of N lying in C and whose nearest neighbor is at a distance ≥ a. For a general Γ ∈ M(Rm ), the intensity of the point process " # NΓ (C) := E 1C (Xn )1Γ (N − Xn ) (C ∈ B(Rm )
(17.8)
n∈N
has, in view of (17.7), intensity λΓ = λPN0 (N ∈ Γ). Theorem 17.1.14 Under the Palm probability, there is a point at 0 (the origin of Rm ), that is, PN0 (N ({0}) = 1) = 1 . Proof. With g(x) = 1C (x) and Γ = {μ ; μ({0}) = 1}, Equality (17.7) becomes, since N − Xn always has exactly one point at the origin, " # E 1C (Xn ) = λm (C)PN0 (N ({0}) = 1) . n∈N
Noting that the lefthand side is λm (C), the result is proved.
Example 17.1.15: Superposition of independent stationary point processes. Recall that Sx is, for any x ∈ Rm , the translation by x applied to measures μ ∈ M (Rm ): Sx (μ)(C) = μ(C + x) . Let P be a probability measure on (M (Rm ), M(Rm )) such that P ◦ Sx = P for all x ∈ Rm . Taking N equal to Φ, the identity map of M (Rm ), we obtain a stationary random (Φ, Sx , P), which is said to be in canonical form. Let (Mi , Mi , Sx (i) , Φi ) (1 ≤ i ≤ k) be replicas of (M (Rm ), M(Rm ), Sx , Φ) and let Pi be a probability on (Mi , Mi) which is Sx (i) invariant for all x ∈ Rm . Suppose that for all i (1 ≤ i ≤ k), Φi is Pi almost surely a simple point process with ﬁnite and positive intensity λi . Deﬁne the product space k $ k k Mi , ⊗i=1 Mi , ⊗i=1 Pi (Ω, F, P ) = i=1
and, for each x ∈ Rm , deﬁne θx := ⊗ki=1 Sx (i) , with the meaning that θx (ω) = (Sx (i) μi ; 1 ≤ i ≤ k), where ω = (μi ; 1 ≤ i ≤ k). Deﬁne Ni (ω) := μi and N (ω) :=
k i=1
μi .
CHAPTER 17. PALM PROBABILITY
646
Then (N, θx , P ) is a stationary point process, the superposition of the stationary point processes (Ni , θx , P ) (1 ≤ i ≤ k). Denote by Pi0 the Palm probability associated to (Φi , Pi ). It will be proved below that PN0 =
k λi i=1
where λ =
λ
0 k P ⊗ ⊗ P ⊗ P , ⊗i−1 j j i j=i+1 j=1
(17.9)
k
i=1 λi .
Remark 17.1.16 The interpretation of (17.9) is the following. With probability λλi the point at the origin in the Palm version comes from the ith point process and the probability distribution of the ith process is then its Palm probability, whereas the other processes keep their stationary distributions. All the k point processes remain independent. Proof of (17.9): By deﬁnition, for A = ki=1 Ai , where Ai ∈ Mi , # "1 0 PN (A) = E (1A ◦ θx )N (dx) λ (0,1]m k  k $ 1 (i) ... 1Ai ◦ Sx Φj (dx)P1 (dμ1 ) . . . Pk (dμk ) = λ M1 Mk (0,1]m j=1 i=1 k 4  /k $ 1 (i) = ... 1Ai ◦ Sx Φj (dx) P1 (dμ1 ) . . . Pk (dμk ). λ M1 Mk (0,1]m j=1
i=1
But (Fubini and the deﬁnition of Palm probability Pj0 ) 1 λj
4 k $ (i) (1Ai ◦ Sx ) Φj (dx) P1 (dμ1 ) . . . Pk (dμk )
/

... M1
Mk
(0,1]m i=1
= Pj0 (Aj )
k $
Pi (Ai ),
i=1, i =j (i)
where we have taken into account the Sx invariance of Pi . Therefore ⎧ ⎫ ⎪ ⎪ k ⎪ ⎪ k ⎨ ⎬ $ $ λi 0 0 Pi (Ai ) Ai = Pj (Aj ) , PN ⎪λ ⎪ ⎪ i=1 i=1 ⎪ 1≤j≤k ⎩ ⎭ j =i
which implies (17.9), by Theorem 2.1.42.
Thinning and Conditioning Let (N, θx , P ) be a simple stationary point process on Rm with ﬁnite positive intensity. For U ∈ F, deﬁne 1U (θx (ω))N (ω, dx) (C ∈ B(Rm )) . (17.10) NU (ω, C) = C
17.1. PALM DISTRIBUTION AND PALM PROBABILITY
647
Such a point process is therefore obtained by thinning of N , a point x ∈ N (ω) being retained if and only if θx (ω) ∈ U . Example 17.1.17: Mark selection. Let (N, Z, θx , P ) be a stationary marked point process. Take U = {Z0 ∈ L} for some L ∈ K. Then, since Z0 (θXn (ω)) = Zn (ω), 1L (Zn )1C (Xn ). NU (C) = n∈N
The point process NU is obtained by thinning N , only retaining the points of Xn with a mark Zn falling in L. (NU , θx , P ) is obviously a stationary point process and it has a ﬁnite intensity (since NU ≤ N ). If the intensity of λU of NU is positive, its Palm probability is given by 1 PN0 U (A) = E (1A ◦ θx )NU (dx) , λU (0,1]m "
where
#
λU = E (0,1]m
(1U ◦ θx ) N (dx) = λPN0 (U ) .
In addition, "# "E (1A ◦ θx ) NU (dx) = E (0,1]m
Therefore PN0 U (A) =
# (0,1]m
(1A ◦ θx )(1U ◦ θx ) N (dx) = λPN0 (A ∩ U ) .
PN0 (A ∩ U ) = PN0 (A  U ) . PN0 (U )
Note that the sequence of marks could take its values in M (Rm ), for instance, Zn = N −Xn . Recall that N −Xn = SXn (N ). Taking U := {ω ; , N (ω) ∈ Γ} where Γ ∈ M(Rm ), we see that in this case NU ≡ NΓ , where the latter is deﬁned by (17.8). Example 17.1.18: Superposition of point processes. This generalizes Example 17.1.15. Let Ni (1 ≤ i ≤ k) be point processes on Rm , all compatible with the ﬂow {θx }x∈Rm , with positive ﬁnite intensities λi (1 ≤ i ≤ k) respectively, but not necessarily independent. Call N their superposition. N is assumed simple. From Bayes’ rule: PN0 (A) =
k
PN0 (Ni ({0}) = 1) PN0 (A  Ni ({0}) = 1) .
i=1
But PN0
" # 1 λi 1 . (Ni ({0}) = 1) = E 1(0,1]m (Xn )1{Ni ({Xn })=1} = E [Ni (0, 1]] = λ λ λ n∈N
Let U = {Ni ({0}) = 1}. Since we have NU = Ni (with the notation of (17.10)), we obtain PN0 (A  Ni ({0}) = 1) = PN0 i (A) . Therefore PN0 (A) =
k λi i=1
λ
PN0 i (A) .
648
CHAPTER 17. PALM PROBABILITY
17.2
Basic Properties and Formulas
Attention will now be restricted to simple stationary point processes on the real line. The notation t instead of x will emphasize the fact that one is working on the real line.
17.2.1
Eventtime Stationarity
The Palm probability PN0 associated with the simple stationary point process (N, θt , P ) has, as we saw for the general case in Theorem 17.1.14, its mass concentrated on Ω0 := {T0 = 0}. Recall that the sequence of points {Tn }n∈Z is deﬁned in such a way that it is strictly increasing and such that T0 ≤ 0 < T1 . T0
T1
T2
t
T−1 ◦θt
T0 ◦θt
0
T3
N
0
T−2 ◦θt
T1 ◦θt
N ◦θt = St N
The mapping θ := θT1 , deﬁned from Ω0 into Ω0 , is a bijection, with inverse θ−1 = θT−1 . Also, on Ω0 , θTn := θn for all n ∈ Z 2 . Note that the above is not true on Ω (for instance, the inverse of θT1 is not θT−1 ; Exercise 17.5.9). For mappings of the form θU with U random, the composition rule (c) of Deﬁnition 17.1.2 is no longer valid in that we do not have in general θU ◦ θV = θU +V when U and V are random variables. The eﬀect of θU ◦ θV on a point process N with sequence of points {Tn }n∈Z is best understood as follows. One ﬁrst applies the shift θV , obtaining a point process N ′ = θV N whose points are of the form Tn − V . However the sequence of these points has to be reindexed to obtain the ordered sequence {Tn′ }n∈Z such that T0′ ≤ 0 < T1′ . (For instance, with U = T2 , T0′ = 0 and T1′ = T3 .) Once this is done, one can reiterate the operation with θU to obtain a point process N ′′ whose sequence of points is {Tn′′ }n∈Z . But beware because this has now shifted N ′ by −U (θV (ω) ). For instance, if U = T2 , V = T3 , you have to apply to N ′ the shift of −T3′ . Indeed you must remember that θT3 means “the shift that moves to 0 the third point strictly to the right of 0”. The following result is referred to as the “eventtime stationarity”. Theorem 17.2.1 PN0 is θinvariant. Proof. First observe that for all A ∈ F, 1θ−1 (A) ◦ θTn = 1(θTn ∈ θ−1 (A)) = 1(θTn+1 ∈ A) . Formula (17.4) with A ∈ F and C = (0, t] yields 2 It is perhaps worthwhile to emphasize the fact that θTn is “the shift that moves the nth point of a point process to the origin”.
17.2. BASIC PROPERTIES AND FORMULAS
649
' ' ' ' ' ' 0 'PN (A) − PN0 (θ−1 (A))' ≤ 1 E '' (1A ◦ θTn − 1θ−1 (A) ◦ θTn )1(0,t] (Tn )'' ' λt ' n∈Z ' ' ' 1 '' 2 ' = E ' (1A ◦ θTn − 1A ◦ θTn+1 )1(0,t] (Tn )' ≤ . ' λt λt ' n∈Z
Letting t → ∞, we obtain
PN0 (A)
=
PN0 (θ−1 (A)).
In particular, if {Z(t)}t∈R is compatible with the ﬂow {θt }t∈R and therefore stationary under P , the sequence {Z(Tn )}n∈Z is, under PN0 , a stationary sequence. Example 17.2.2: Palm–Khinchin equations. (3 ) Let for k ∈ N and t ≥ 0, ϕk (t) := PN0 (N ((0, t]) = k) . We have the Palm–Khinchin equations:

t
ϕk (s) ds.
P (N ((0, t]) > k) = λ 0
To prove this, observe that

1N ((0,t])>k = (0,t]
1{N (s,t]=k} N (ds),
and deduce from this and N (s, t] = N (0, t − s] ◦ θs that 1(0,t] (s) 1{N (0,t−s]=k} ◦ θs N (ds). 1{N ((0,t])>k} = R
By the Campbell–Mecke formula, the expectation of the righthand side with respect to P is equal to  t PN0 (N (0, t − s] = k) ds λ 1(0,t] (s)PN0 (N (0, t − s] = k) ds = λ R 0  t PN0 (N (0, s] = k) ds. = λ 0
Theorem 17.2.3 Let (N, θt , P ) be a stationary point process on R with intensity 0 < λ < ∞ and such that P (N (R) = 0) = 0. Then λE0N [T1 ] = 1 . Proof. From the Palm–Khinchin equation with k = 0,  t ϕ0 (s) ds . P (N ((0, t]) = 0) = 1 − λ 0
But ϕ0 (s) =
PN0 (N (0, s]
= 0) = P (T1 > s), and therefore
0 = P (N (R) = 0) = lim P (N ((0, t]) = 0) t↑+∞  ∞ =1−λ PN0 (T1 > s) ds = 1 − λE0N [T1 ] . 0
3
[Palm, 1943] for k = 0, [Khinchin, 1960] for k ≥ 1.
CHAPTER 17. PALM PROBABILITY
650
17.2.2
Inversion Formulas
How do we pass from the Palm probability to the stationary distribution? The formulas that do this are called inversion formulas. Theorem 17.2.4 Let (N, P, θt ) be a stationary simple point process on R with intensity 0 < λ < ∞ and such that P (N (R) = 0) = 0. For any nonnegative random variable f , 2 T1 . 0 E [f ] = EN (f ◦ θs )ds . 0
One proof, among many others, makes use of the following conservation principle. The intuition behind the following conservation principle is that in a stationary state, “the smooth variation of a stochastic process is balanced by the variation due to jumps”. More precisely: Theorem 17.2.5 (4 ) Let (N, P, θt ) be a stationary simple point process on R with intensity 0 < λ < ∞. Let {Y (t)}t∈R be a realvalued stochastic process, rightcontinuous with lefthand limits, and let {Y (t)}t∈R be a realvalued stochastic process such that  1 Y (s)ds + (Y (s) − Y (s−))N (ds) . (17.11) Y (1) − Y (0) = (0,1]
0
Suppose that the processes Y and Y are compatible with the ﬂow. Suppose moreover that
Then
E[Y (0)] < ∞ and E0N [Y (0) − Y (0−)] < ∞.
(17.12)
E[Y (0)] + λE0N [Y (0) − Y (0−)] = 0.
(17.13)
01 0 Proof. Observe that E[ 0 Y (s) ds] = E[Y (0)] and E[ (0,1] Y (s) − Y (s−) N (ds)] = λE0N [Y (0) − Y (0−)]. Therefore, condition (17.12) guarantees that Y (1) − Y (0) is Pintegrable and by Lemma 16.1.5, E [Y (1) − Y (0] = 0. Equating this to the 0 1expectation result since E[ of the righthand side of (17.11), we obtain the announced 0 Y (s) ds] = 0 0 E[Y (0)] and E[ (0,1] (Y (s) − Y (s−)) N (ds)] = λEN [Y (0) − Y