The Theory of Probability: Explorations and Applications (Instructor Solution Manual, Selected Solutions)

Several generations of students and teaching assistants worked out problems and contributed to the selected solutions as

443 127 2MB

English Pages 434 Year 2015

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

The Theory of Probability: Explorations and Applications (Instructor Solution Manual, Selected Solutions)

Table of contents :
Contents
Preface vii
A ELEMENTS
I Probability Spaces 3
II Conditional Probability 25
III A First Look at Independence 51
IV Probability Sieves 65
V Numbers Play a Game of Chance 87
VI The Normal Law 105
VII Probabilities on the Real Line 121
VIII The Bernoulli Schema 143
IX The Essence of Randomness 169
X The Coda of the Normal 211
B FOUNDATIONS
XI Distribution Functions and Measure 247
XII Random Variables 257
XIII Great Expectations 279
XIV Variations on a Theme of Integration 293
XV Laplace Transforms 331
XVI The Law of Large Numbers 357
XVII From Inequalities to Concentration 377
XVIII Poisson Approximation 385
XIX Convergence in Law, Selection Theorems 397
XX Normal Approximation 407
Index 419

Citation preview

Problems for Solution from

The Theory of Probability and

Selected Solutions Santosh S. Venkatesh University of Pennsylvania

Contents Preface

A

vii

E LEMENTS

I

Probability Spaces

3

II

Conditional Probability

25

III

A First Look at Independence

51

IV

Probability Sieves

65

V

Numbers Play a Game of Chance

87

VI

The Normal Law

105

VII

Probabilities on the Real Line

121

VIII The Bernoulli Schema

143

IX

The Essence of Randomness

169

X

The Coda of the Normal

211

B

F OUNDATIONS

XI

Distribution Functions and Measure

247

XII

Random Variables

257

XIII

Great Expectations

279

XIV

Variations on a Theme of Integration

293

Contents XV

Laplace Transforms

331

XVI

The Law of Large Numbers

357

XVII From Inequalities to Concentration

377

XVIII Poisson Approximation

385

XIX

Convergence in Law, Selection Theorems

397

XX

Normal Approximation

407

Index

419

vi

Preface Several generations of students and teaching assistants worked out problems and contributed to the selected solutions assembled here. I should especially acknowledge Wei Bi, Shao Chieh Fang, Gaurav Kasbekar, Jonathan Nukpezah, Alireza Tahbaz Salehi, Shahin Shahrampour, Evangelos Vergetis, and Zhengwei Wu. I must confess immediately that not all the solutions have seen the same level of scrutiny; and, while I have expended some effort to bring some uniformity of style and presentation to the material, it may occasionally lack some polish. It struck me, however, that a reader desirous of an early view of the results may be willing to forgive the deficiencies in the current crude assemblage and, accordingly, that it may be preferable to provide a rough and ready collection of solutions now, however unpolished, unvetted, and incomplete, and not wait till I had had the time to proof-read them carefully and present them in a uniform format and style while filling in the gaps. References to ToP are, of course, to The Theory of Probability; references to Equations, Examples, Theorems, Sections, Chapters, and so on point to the main text; the referencing conventions are as in ToP. As a navigational aid, page headings are labelled by chapter and problem; the detailed index that I have provided may also be of use. There are inevitably errors in the text lying in wait to be discovered. I can only hope that these are of the obvious kind that do not cause great consternation and apologise for these in advance. I would certainly very much appreciate receiving word of errors and ambiguities both here and in the main text of ToP. January 1, 2015: The task of compiling solutions proceeds slowly. The current version of the palimpsest provides solutions for more than 75% of the problems: of the 560 problems assigned for solution in ToP, 428 now have solutions spelled out in detail. In many instances I have provided two or three different approaches to solution to illustrate different perspectives; and the solutions frequently flesh out themes and generalisations suggested by the patterns of attack. Table 1 on the following page identifies the current list of problems whose solutions appear in this manuscript.

Preface

Chapter titles A

I II

1–30

31–34

1–19, 22–31

20, 21

1–20

21–23

1–26, 29

27, 28

Numbers Play a Game of Chance

1–18

19–21

The Normal Law

1–11

12, 13

Probabilities on the Real Line

1–22



1–26, 28–38

27

1–36



1–12, 14–22, 24–33

13, 23

1–5, 7–11

6, 12, 13

Probability Spaces Conditional Probabilities A First Look at Independence

IV

Probability Sieves

VI VII VIII IX X XI

The Bernoulli Schema The Essence of Randomness The Coda of the Normal Distribution Functions and Measure

XII

Random Variables

1–24

25, 26

XIII

Great Expectations

1–5, 7–23

6, 24–26

XIV

Variations on a Theme of Integration

1–11, 13–22, 24–27, 30–36, 39–51, 54–56

12, 23, 28, 29, 37, 38, 52, 53

Laplace Transforms

1–11, 17, 20–31

12–16, 18, 19

The Law of Large Numbers

1–4, 11, 14–20

5–10, 12, 13, 21–31

1

2–35

4, 5, 20–23

1–3, 6–19

1–4, 6, 11, 17, 19, 20

5, 7–10, 12–16, 18

1–3, 6–8, 10–14

4, 5, 9, 15–19





XV XVI XVII XVIII XIX XX C

Not ready for prime time

III

V

B

Solutions provided

XXI

From Inequalities to Concentration Poisson Approximation Convergence in Law, Selection Theorems Normal Approximation Sequences, Functions, Spaces

Table 1: List of compiled solutions.

viii

Part A

E LEMENTS

I Probability Spaces Notation for generalised binomial coefficients will turn out to be useful going forward and I will introduce them here for ease of reference. As a matter of convention, for real t and integer k, we define  t(t−1)(t−2)···(t−k+1) t if k ≥ 0, k! = k 0 if k < 0. Problems 1–5 deal with these generalised binomial coefficients.    t 1. Pascal’s triangle. Prove that k−1 + kt = t+1 k . S OLUTION : The identity is a generalisation of Pascal’s triangle which, for integer t, may be verified by an elementary combinatorial argument. Simple algebraic factorisation is all that is needed to verify the general case of the identity for real-valued t. The identity is obvious for k < 0 and for k ≥ 0, we have „ « „ « t(t − 1) · · · (t − k + 2) t(t − 1) · · · (t − k + 1) t t + + = k−1 k (k − 1)! k! „ « ` ´ t(t − 1) · · · (t − k + 2) t+1 = k + (t − k + 1) = . k k!

 k 2. If t > 0, show that −t k = (−1)  −2 k and k = (−1) (k + 1) if k ≥ 0.

t+k−1 k



and hence that

−1 k



= (−1)k

S OLUTION : By factoring out the terms −1 from the numerator, we have „ « (−t)(−t − 1)(−t − 2) · · · (−t − k + 1) −t = k k! „ « t(t + 1)(t + 2) · · · (t + k − 1) t+k−1 = (−1)k . k k! ` ´ If k ≥ 0, then, for t = 1, this reduces to the identity −1 = (−1)k while, for t = 2, we k `−2´ obtain the useful result that k = (−1)k (k + 1). = (−1)k

Probability Spaces

I.4 3. Show

1/2 k



= (−1)k−1 k1

2k−2 k−1



2−2k+1 and

−1/2 k



= (−1)k

2k k



2−2k .

S OLUTION : By factoring out 1/2 from each of the terms in the product in the numerator, we have « „ 1 1 ( − 1)( 12 − 2) · · · ( 21 − k + 1) 1/2 = 2 2 k k! „ « „ « 1 · 3 · 5 · 7 · · · (2k − 3) 1 k 1(1 − 2) · · · (1 − 2(k − 1)) 1 k = = (−1)k−1 2 k(k − 1)! 2 k(k − 1)! ` ´ „ «k 1 · 3 · 5 · 7 · · · (2k − 3) 2 · 4 · · · (2k − 2) 1 ` ´ = (−1)k−1 2 k(k − 1)! 2 · 4 · · · (2k − 2) „ «k „ « (2k − 2)! 1 2k − 2 −2k+1 k−1 1 k−1 = (−1) = (−1) . 2 2 k2k−1 (k − 1)!(k − 1)! k k−1 Similarly, „ « „ «k − 12 (− 12 − 1)(− 12 − 2) · · · (− 12 − k + 1) 1 · 3 · 5 · · · (2k − 1) −1 −1/2 = = k k! 2 k! ` ´ „ «k «k „ « „ 1 · 3 · 5 · · · (2k − 1) 2 · 4 · · · (2k) (2k)! −1 −1 2k −2k ` ´ = = (−1)k 2 . = k 2 2 2k k!k! k! 2 · 4 · · · (2k)

4. Prove Newton’s binomial theorem (1 + x)t = 1 +

t 1

x+

t 2

x2 +

∞   t X t k x3 + · · · = x , 3 k k=0

the series converging for all real t whenever |x| < 1. Thence, is any Pnif t =nn−k positive integer obtain the usual binomial formula (a+b)n = k=0 n a bk . k S OLUTION :` When ´ 0 t`n=´ n1 is a positive ` ´ ninteger we have the classical binomial expansion (1 + x)n = n x + 1 x +···+ n x known from antiquity. The generalisation of this 0 n venerable result to real powers t is due to Isaac Newton. This formula was the harbinger of the calculus. Taylor’s theorem with remainder says that if f is n + 1 times continuously differentiable then n X xk f(x) = f(k) (0) + Rn (x) k! k=0 where the remainder is given by the expression Z 1 x n (n+1) Rn (x) = t f (x − t) dt. n! 0 The derivatives are easy to compute for the given function f(x) = (1 + x)t and are compactly expressed in terms of the falling factorial notation: f(k) (x) = t(t − 1)(t − 2) · · · (t − k + 1)(1 + x)t−k = tk (1 + x)t−k

4

(k ≥ 0).

I.5

Probability Spaces

It follows that f(x) =

n X tk k x + Rn (x) k! k=0

and we now only need to show that Rn (x) → 0 for each x. But this is easy to see as the derivatives of f grow only exponentially fast with order. Indeed, if |x| < 1, then we may select a positive function M(x) so that |f(n+1) (t)| ≤ max{1, tn+1 }(1 + |x|)t−n−1 ≤ M(x)n+1 uniformly for all t in the closed and bounded interval |t| ≤ |x|. It follows that ` ´n+1 Z M(x)x M(x)n+1 x n |Rn (x)| ≤ t dt = , n! (n + 1)! 0 and the bound on the right converges to zero as n tends to infinity because of the superexponential growth of the factorial function. Allowing n → ∞ in Taylor’s formula, we hence obtain Newton’s binomial Theorem ∞ „ « X t (1 + x)t = xk (|x| < 1). k k=0

Specialising to the case when t = n is a positive integer, we see that k > n and the usual binomial formula results.

`n´ k

= nk /k! = 0 if

5. Continuation. With t = −1, the previous problem reduces to the geometric series 1 = 1 − x + x2 − x3 + x4 − · · · 1+x convergent for all |x| < 1. By integrating this expression termwise derive the Taylor expansion of the natural logarithm log(1 + x) = x − 12 x2 + 13 x3 − 14 x4 + · · · convergent for |x| < 1. (Unless explicitly noted otherwise, our logarithms are always to the natural or "Napier" base e.) Hence derive the alternative forms log(1 − x) = −x − 12 x2 − 31 x3 − 14 x4 − · · · ,  1 1+x 1 3 1 5 2 log 1−x = x + 3 x + 5 x + · · · . S OLUTION : Setting t = −1 in Newton’s binomial theorem we obtain the geometric series 1 = 1 − x + x2 − x3 + x4 − · · · 1+x which converges for |x| < 1. By a formal term by term integration of both sides from 0 to x, we obtain 1 1 1 log(1 + x) = x − x2 + x3 − x4 + · · · , (1) 2 3 4

5

Probability Spaces

I.7

the term-wise integration easily seen to be permissible as ˛Z x ˛ Zx ˛ ` n n ´ ˛ n+1 n+1 ˛ ˛≤ −1 t + (−1) t + · · · dt tn (1 + t + t2 + · · · ) dt ˛ ˛ 0 0 Zx Zx n 1 xn+1 t dt ≤ tn dt = →0 = 1−x 0 1−x 0 1−t

(n → ∞).

Setting x ← −x in (1), we obtain the companion series, log(1 − x) = −x −

1 2 1 3 1 4 x − x − x + ··· , 2 3 4

(2)

and combining (1,2) we see that ´ 1 1+x 1` 1 1 log = log(1 + x) − log(1 − x) = x + x3 + x5 + · · · 2 1−x 2 3 5

(|x| < 1),

as asserted.

Problems 6–18 deal with sample spaces in intuitive settings. 6. A fair coin is tossed repeatedly. What is the probability that on the nth toss: (a) a head appears for the first time? (b) heads and tails are balanced? (c) exactly two heads have occurred? S OLUTION : (a) A head occurs for the first time on the the nth toss if, and only if, the nth toss results in a head and the first n − 1 tosses result in tails. As all 2n sequences of n heads and tails are equally likely for the given problem, the probability in question is 2−n . (b) Heads and tails can` only ´ be balanced in n tosses of a coin if n is even. n In this case, there are precisely n/2 ways of specifying the locations of the heads in the sequence and so, for even n, the probability that heads and tails are balanced is ` n ´ −n 2 . For odd n, the probability of balance is zero. n/2 `n´ (c) There are 2 sequences of n heads and tails containing precisely two heads. ` ´ −n The probability of seeing precisely two heads is hence n 2 . 2

7. Six cups and six saucers come in pairs, two pairs are red, two are white, and two are blue. If cups are randomly assigned to saucers find the probability that no cup is upon a saucer of the same colour. S OLUTION : It is simplest to systematically account for all possibilities of arrangements. For purposes of enumeration we may consider the saucers to be labelled from 1 through 6, the cups likewise. Formally, each permutation (Π1 , . . . , Π6 ) of (1, . . . , 6) represents a sample point of the experiment and corresponds to the cup-saucer matchings i 7→ Πi . As (Π1 , . . . , Π6 ) ranges over all permutations, we range over all 6! assignments of cups to saucers. Of these, those arrangements where all cups are on saucers of a different colour are of two types.

6

I.8

Probability Spaces

1. Both cups of a given colour are placed on matching saucers of a different colour. A little introspection shows that this means that each pair of cups of a given colour must be matched with a pair of saucers of a different colour. If, say, both red cups are placed on blue saucers then both blue cups must be placed on white saucers which means in turn that both white cups must be placed on blue saucers. The red cups may be placed on the blue saucers in 2! ways, the blue cups on the white saucers in 2! ways, and the white cups on the blue saucers in 2! ways, for a total of 2!×2!×2! = 8 possibilites with the assignments red → blue → white. Interchanging the rôles of blue and white, there are 8 more assignments of the form red → white → blue. There are hence a total of 16 arrangements with both cups of each colour placed on matching saucers of a different colour. 2. The cups of a given colour are placed on saucers of differing colours, neither of the original colour. We may systematically enumerate the possibilities as follows. The first red cup may be placed on one of the four saucers that are not red; once a saucer is selected, the second red cup has two possibilites in the saucers of the remaining colour; there are thus 8 ways in which we may deploy the red cups on one blue and one white saucer. Now both red cups are deployed, one on a blue saucer and one on a white saucer. The two blue cups must now be placed one on the remaining white saucer and one on a red saucer. The red saucer may be selected in two ways and the blue cup to be placed on it may be selected in two ways; there are thus 4 ways in the blue cups may be placed on a white and a red saucer. Finally, there remain two white cups, one blue and one red saucer, and both deployments of cups on saucers give rise to valid arrangements. In total we have 8 × 4 × 2 = 64 arrangements of the cups on the saucers so that the cups of a given colour are placed on saucers of differing colours, neither of the original colour. Combining the two cases, there are 16 + 64 = 80 arrangements of cups on saucers so that no cup is on a saucer of a matching colour. The associated probability is hence (16 + 64)/6! = 1/9, or about ten percent.

8. Birthday paradox. In a group of n unrelated individuals, none born on a leap year, what is the probability that at least two share a birthday? Show that this probability exceeds one-half if n ≥ 23. [Make natural assumptions for the probability space.] S OLUTION : We suppose that the year consists of 365 days (we disregard leap years), that individuals are equally likely to be born on any day (a careful analysis of data shows that this is not strictly true but it’s certainly a reasonable starting point; our results are at least conservative), and that births are independent of one another (twins and triplets need not apply). The number of arrangements of n birthdays so that no two are the same is, in the falling factorial notation, 365n := 365 · (365 − 1) · (365 − 2) · · · (365 − n + 1). As there are 365n arrangements in total, the probability that no two birthdays coincide is given by 365n /365n and, accordingly, the probability that two or more birthdays coincide is given by 1 − 365n /365n . For n = 23 this evaluates to approximately 0.5073 which many people find surprising. For n = 30, the probability is in excess of 70%. An asymptotic estimate which is quite remarkably good may be arrived at for

7

Probability Spaces

I.9

the probability. We begin by estimating (N − n + 1) N (N − 1) (N − 2) Nn = · · ··· Nn N N N N “ “ 1 ”“ 2” n − 1” =1· 1− 1− ··· 1 − . N N N Taking logarithms of both sides, we see then that log

n−1 “ X Nn k” . = log 1 − n N N k=1

The Taylor series (2)´ for the logarithm shows that, for each fixed n and 1 ≤ k ≤ n − 1, ` k k we have log 1 − N − ξN , where the order term ξN ≤ k2 /N2 < n2 /N2 for all = −N sufficiently large N. It follows that log

n−1 X k n(n − 1) Nn − ξN = − − ξN = − n N N 2N k=1

where ξN < n3 /N2 . We now have our desired estimate. Suppose KN is any positive sequence satisfying KN /N2/3 → 0 as N → ∞. If n ≤ KN then Nn /Nn ∼ e−(n−1)n/2N where the asymptotic equivalence is to be taken to mean that the ratio of the two sides tends to one as N → ∞. With N = 365 and n = 23, the asymptotic estimate for the probability that two or more birthdays coincide yields the approximation 1 − e−22×23/2×365 = 0.500 · · · which differs from the exact answer only in about 7 parts in 1000.

9. Lottery. A lottery specifies a random subset R of r out of the first n natural numbers by picking one at a time. Determine the probabilities of the following events: (a) there are no consecutive numbers, (b) there is exactly one pair of consecutive numbers, (c) the numbers are drawn in increasing order. Suppose that you have picked your own random set of r numbers. What is the probability that (d) your selection matches R?, (e) exactly k of your numbers match up with numbers in R? ` ´‹`n´ S OLUTION : (a) n−r+1 : one can show by a direct combinatorial argument that the r r ` ´ number of r-sets with no consecutive numbers is given by n−r+1 but a demonstration r by induction on n is even faster. The base of the induction is obvious as, if n < 2r − 1, then there are no r-sets without consecutive numbers, and, when n = 2r − 1, there is precisely one r-set consisting of the odd numbers {1, 3, . . . , 2r − 1} which has no consec` ´ utive numbers: 1 = (2r−1)−r+1 . Now fix r ≥ 0 and consider the situation for a generic r value of n ≥ 2r − 1. If we increase n to n + 1 then the r-sets with no consecutive numbers are of two types. Type (i) r-sets do not include n + 1: but then n + 1 is irrelevant the´situation reverts to that for n; by induction hypothesis, there are hence precisely `and n−r+1 type (i) r-sets. Type (ii) r-sets include n + 1: such r-sets cannot include n so that r the remaining r − 1 numbers are selected from 1 through n − 1 with no consecutive num` ´ bers; by induction hypothesis again, there are hence precisely (n−1)−(r−1)+1 type (ii) r−1

8

I.10

Probability Spaces

r-sets. Thus, the number of r-sets selected from {1, . . . , n, n + 1} so as to not include any consecutive numbers is given by « « „ „ (n − 1) − (r − 1) + 1 n−r+1 + r−1 r « « „ « „ „ (n + 1) − r + 1 n−r+1 n−r+1 , = + = r r−1 r the final step by Pascal’s triangle. This completes the induction. ` ´‹`n´ (b) (r − 1) n−r+1 : proof by induction on n and r. The base case for r = r−1 r n = 2 is obvious. Now consider the effect of moving from n to n + 1. The r-sets of {1, . . . , n − 1, n, n + 1} which contain precisely one pair of consecutive numbers are of three types. Type (i) r-sets do not include n+1: but this situation reverts to that for n and ` ´ so, by induction hypothesis, the number of type (i) r-sets is equal to (r − 1) n−(r−1) . r−1 Type (ii) r-sets include n + 1 but not n: but then, we must select r − 1 integers from {1, . . . , n − 1} in such a way that there there is precisely one pair of consecutive numbers; ` ´ by induction hypothesis, the number of type (ii) r-sets is equal to (r − 2) (n−1)−(r−2) . r−2 Type (iii) r-sets include both n and n + 1: but then the remaining r − 2 numbers cannot include n − 1 and must be chosen from {1, . . . , n − 2} in such a way as to obtain no pair ´ ` type (iii) of consecutive numbers; by part (a) there are hence exactly (n−2)−(r−2)+1 r−2 r-sets. Accordingly, the total number of r-sets selected from {1, . . . , n, n + 1} so as to contain precisely one pair of consecutive numbers is given by „ (r − 1)

« „ « „ « n − (r − 1) (n − 1) − (r − 2) (n − 2) − (r − 2) + 1 + (r − 2) + r−1 r−2 r−2 „ « „ « n − (r − 1) (n − 1) − (r − 2) = (r − 1) + (r − 1) r−1 r−2 »„ « „ «– „ « n−r+1 n−r+1 (n + 1) − r + 1 = (r − 1) + = (r − 1) , r−1 r−2 r−1

the final step again by Pascal’s triangle. This concludes the induction. (c) 1/r!: for any selection of r numbers, precisely one out of the r! equally likely arrangements is in increasing order. ‹` ´ (d) 1 nr : self-explanatory. ` ´` ´‹`n´ (e) kr n−r : specify the k numbers that match and select the remaining r−k r numbers from those excluded from R.

10. Poker. Hands at poker (see Example 3.4) are classified in the following categories: a one pair is a hand containing two cards of the same rank and three cards of disparate ranks in three other ranks; a two pair contains a pair of one rank, another pair of another rank, and a card of rank other than that of the two pairs; a three-of-a-kind contains three cards of one rank and two cards with ranks differing from each other and the rank of the cards in the triple; a straight contains five cards of consecutive ranks, not all of the same suit; a flush has all

9

Probability Spaces

I.11

five cards of the same suit, not in sequence; a full house contains three cards of one rank and two cards of another rank; a four-of-a-kind contains four cards of one rank and one other card; and a straight flush contains five cards in sequence in the same suit. Determine their probabilities. ` ´ S OLUTION : The sample space consists of the 52 = 2, 598, 960 selections of five cards 5 from a standard pack, the probability measure naturally uniform. The probabilities of the various hands are obtained by systematic enumeration, the form of the solutions given below suggestive of the line of thought. ` ´ 13 · 12 · 6 · 43 1760 `352´ = One pair: = 0.422569; 4165 5 `13´ · 11 · 6 · 6 · 4 198 2 `52´ Two pair: = = 0.047539; 4165 5 ` ´ 13 · 12 · 4 · 42 88 `252´ = = 0.021129; Three-of-a-kind: 4165 5 Straight:

192 9 · 45 `52´ = = 0.003546; 54145 5

Full house:

13 · 12 · 4 · 6 `52´ =

Four-of-a-kind:

13 · 12 · 4 `52´ =

Straight flush:

9·4 `52´ =

5

5

5

Royal flush:

4

`52´ = 5

6 = 0.001441; 4165

1 = 0.000240; 4165

3 = 0.000014; 216580 1 = 1.53908 × 10−6 . 649740

11. Occupancy configurations, the Maxwell–Boltzmann distribution. An occupancy configuration of n balls placed in r urns is an arrangement (k1 , . . . , kr ) where urn i has ki balls in it. If the balls and urns are individually distinguishable, determine the number of distinguishable arrangements leading to a given occupancy configuration (k1 , . . . , kr ) where ki ≥ 0 for each i and k1 + · · · + kr = n. If balls are distributed at random into the urns determine thence the probability that a particular occupancy configuration (k1 , . . . , kr ) is discovered. These numbers are called the Maxwell–Boltzmann statistics in statistical physics. While it was natural to posit that physical particles were distributed in this fashion it came as a nasty jar to physicists to discover that no known particles actually behaved in accordance with common sense and followed this law.1 S OLUTION : `The ´ number of ways in which k1 balls can be selected to be placed in the first urn is kn1 . Of the remaining balls, the number of ways k2 balls can be selected 1 S.

Weinberg, The Quantum Theory of Fields. Cambridge: Cambridge University Press, 2000.

10

I.12

Probability Spaces

`n−k ´ 1 . Of the remaining, the number of ways we k2 ´ ` can specify k3 balls to be placed in the third urn is n−kk13−k2 . And, proceeding in this fashion, the number of ways the occupancy configuration (k1 , . . . , kr ) can be achieved is „ «„ «„ « „ « n n − k1 n − k1 − k2 n − k1 − · · · − kr−1 ··· k1 k2 k3 kr to be placed in the second urn is

=

(n − k1 )! (n − k1 − k2 )! (n − k1 − · · · − kr−1 )! n! · · ··· . k1 !(n − k1 )! k2 !(n − k1 − k2 )! k3 !(n − k1 − k2 − k3 )! kr !(n − k1 − · · · − kr )!

Factors cancel successively in the denominators and numerators of the terms of the expression on the right which is now seen to simplify to the form n! . k1 !k2 ! · · · kr ! On the other hand, there are clearly rn distributions of n balls in r urns, the assumptions of the Maxwell-Boltzmann distributions giving each equal probability. Writing PMB (k1 , . . . , kr ) for the probability that the occupancy configuration (k1 , . . . , kr ) arises in the Maxwell-Boltzmann distribution, we see then that PMB (k1 , . . . , kr ) =

n! r−n k1 !k2 ! · · · kr !

(k1 , . . . , kr ≥ 0; k1 + · · · + kr = n).

12. Continuation, the Bose–Einstein distribution. Suppose now that the balls are indistinguishable, the urns distinguishable. Let An,r be the number of distinguishable arrangements of the n balls into the r urns. By lexicographic arrangement of the distinct occupancy configurations, show the validity of the recurrence An,r = An,r−1 + An−1,r−1 + · · · + A1,r−1 + A0,r−1 with boundary conditions An,1 = 1 for n ≥ 1 and A1,r = r for r ≥ 1. Solve the recurrence for An,r . [The standard mode of analysis is by a combinatorial trick by arrangement of the urns sequentially with sticks representing urn walls and identical stones representing balls. The alternative method suggested here provides a principled recurrence as a starting point instead.] The expression 1/An,r represents the probability that a given occupancy configuration is discovered assuming all occupancy configurations are equally likely. These are the Bose–Einstein statistics of statistical physics and have been found to apply to bosons such as photons, certain atomic nuclei like those of the carbon-12 and helium-4 atoms, gluons which underlie the strong nuclear force, and W and Z bosons which mediate the weak nuclear force. S OLUTION : Guessing the pattern (not easy). With n = 5, a systematic enumeration shows that A5,1 = 1, A5,2 = 6 and A5,3 = 21. A diligent examination of such cases may lead one to a shrewd suspicion of the answer; and once the answer is guessed it is easy to verify by induction. But, it must be admitted, seeing the pattern is not easy. Generating functions provide a principled path to solution. A recursive approach—with a generatingfunctionological solution: If there is only one urn then all the balls have to go into it, whence An,1 = 1 for n ≥ 1. If, on the

11

Probability Spaces

I.12

other hand, there is only one ball then it may go into any of the r urns, whence A1,r = r for r ≥ 1. This establishes the end-points of a recurrence. Now suppose urn 1 has k1 = k balls. Then the remaining n − k balls must be distributed in the remaining r − 1 urns. Summing over the possible values for k gives the recurrence An,r = An,r−1 + An−1,r−1 + · · · + A1,r−1 + A0,r−1 for n, r ≥ 1. It will simplify summation limits to expand this recurrence to the entire integer lattice on the plane. To begin, it will be convenient to set boundary conditions A0,0 = 1, An,r = 0

(n < 0 or r < 0),

which enables us to expand the basic recurrence to everywhere on the integer lattice excepting only the origin: X ˆ ˜ An,r = Ak,r−1 (n, r) 6= (0, 0) . (†) k≤n

(We should verify immediately that the recurrence yields A1,r = r for r ≥ 1 and An,1 = 1 for n ≥ 1, as it must.) Now introduce the sequence of generating functions X Gr (s) = An,r sn with G0 (s) = 1. († 0 ) n

Suppose r 6= 0. Then the recurrence (†) is valid for all n. Multiplying both sides of (†) by sn and summing over all n we then obtain X XX Gr (s) = An,r sn = Ak,r−1 sn (r 6= 0). n

n k≤n

The interchange in the order of summation that is indicated is natural, right, and effective, and leads to a summable geometric series. For |s| < 1, we now obtain Gr (s) =

X

Ak,r−1

k

X n≥k

sn =

X

Ak,r−1

k

Gr−1 (s) sk = 1−s 1−s

(r 6= 0).

The recurrence is trivial for r < 0. For r > 0 we may repeatedly run the process and, by induction, obtain Gr (s) =

Gr−1 (s) Gr−2 (s) G0 (s) = = ··· = . 1−s (1 − s)2 (1 − s)r

As G0 (s) = 1, we see hence that « « ∞ „ ∞ „ X X −r n+r−1 n (−s)n = s Gr (s) = (1 − s)−r = n n n=0

(r > 0),

(† 00 )

n=0

the first step is by the binomial theorem (Problem 4), the second by the negative binomial identity (Problem 2). Comparing the sums in († 0 , † 00 ) termwise we may write down by inspection the solution „ « n+r−1 An,r = , n

12

I.14

Probability Spaces

the terms being non-zero only for n ≥ 0 and r ≥ 1 by virtue of the conventions for the binomial coefficients. Another approach—combinatorial this time: Imagine the urns arranged sequentially from left to right, a vertical bar or “stick” representing the divide between two successive urns. The arrangement of n indistinguishable balls (or “stones”) into r distinguishable urns may then be seen to be equivalent to an arrangement of sticks and stones as shown in Figure 1. Each distinguishable deployment is in 1-1 correspondence

1

2

3

4

5

6 7

8

Figure 1: Arrangement of n = 7 indistinguishable balls into r = 8 distinguishable urns. The thick vertical bars at either end represent immovable barriers; the thin vertical bars represent movable sticks corresponding to urn partitions. with an arrangement of r − 1 sticks and n stones and there are precisely arrangements.

`r−1+n´ n

such

Writing PBE (k1 , . . . , kr ) for the probability of the occupancy configuration under the Bose–Einstein statistics, we see that PBE (k1 , . . . , kr ) =

1 1 = `n+r−1´ An,r n

(k1 , . . . , kr ≥ 0; k1 + · · · + kr = n).

13. Continuation, the Fermi–Dirac distribution. With balls indistinguishable and urns distinguishable, suppose that the only legal occupancy configurations (k1 , . . . , kr ) are those where each ki is either 0 or 1 only and k1 +· · ·+kr = n, and suppose further that all legal occupancy configurations are equally likely. The conditions impose the constraint n ≤ r and prohibit occupancy configurations where an urn is occupied by more than one ball. Now determine the probability of observing a legal occupancy configuration (k1 , . . . , kr ). These are the Fermi–Dirac statistics and have been found to apply to fermions such as electrons, neutrons, and protons. S OLUTION : Since an urn can contain either no balls or precisely one ball, each valid Fermi-Dirac occupancy configuration (k1 , . . . , kr ) is`completely specified by identifica´ tion of the n occupied urns. There are thus precisely nr legal occupancy configurations. Writing PFD (k1 , . . . , kr ) for the probability of the occupancy configuration (k1 , . . . , kr ) under the Fermi–Dirac statistics, we see that 1 PFD (k1 , . . . , kr ) = ` r ´

` ´ k1 , . . . , kr ∈ {0, 1}; k1 + · · · + kr = n; 0 ≤ n ≤ r .

n

14. Chromosome breakage and repair. Each of n sticks is broken into a long part and a short part, the parts jumbled up and recombined pairwise to form n

13

Probability Spaces

I.15

new sticks. Find the probability (a) that the parts will be joined in the original order, and (b) that all long parts are paired with short parts.2 S OLUTION : (a) The total number of possibilities for recombination is `2n´`2n−2´ 2

2

n!

···

`2´ 2

= (2n − 1)(2n − 3) · · · (1).

The original order is just one outcome of these possibilities and so the probability that the sticks are recombined in their original order is simply (2n)(2n − 2) · · · 4 · 2 1 2n n! = = . (2n − 1)(2n − 3) · · · (1) (2n)! (2n)! (b) Lay out, say, the short parts in any order. The long parts may then be paired with the short parts in n! ways and so the probability of recombination with a proper pairing of shorts and longs is given by 2n (n!)2 2n n! = = `2n´ . (2n − 1)(2n − 3) · · · (1) (2n)! n

15. Spread of rumours. In a small town of n people a person passes a titbit of information to another person. A rumour is now launched with each recipient of the information passing it on to a randomly chosen individual. What is the probability that the rumour is told r times without (a) returning to the originator, (b) being repeated to anyone. Generalisation: redo the calculations if each person tells the rumour to m randomly selected people. S OLUTION : (a) The first person can pass on the rumour to any of n − 1 people. Starting with the first recipient, each person receiving the rumour can pass it on also in n − 1 ways of which n − 2 possibilities do not return the gossip to the originator. Thus, the probability that the rumour is told r times without returning to the originator is “ (n − 1)(n − 2)r−1 1 ”r−1 . = 1 − (n − 1)(n − 1)r−1 n−1 If each person tells the rumour to m randomly selected people, by a similar argument the probability becomes `n−1´`n−2´r−1 m m `n−1´`n−1´r−1 m m

» =

(n − 2)m (n − 1)m

–r−1 =

»m−1 Y“ j=0

1−

”–r−1 1 . n−1+j

2 If sticks represent chromosomes broken by, say, X-ray irradiation, then a recombination of two long parts or two short parts causes cell death. See D. G. Catcheside, “The effect of X-ray dosage upon the frequency of induced structural changes in the chromosomes of Drosophila Melanogaster”, Journal of Genetics, vol. 36, pp. 307–320, 1938.

14

I.16

Probability Spaces (b) The chance that the rumour does not return to anyone is r−1 Y“ (n − 1)r j ” = . 1− r (n − 1) n−1 j=0

The generalisation when the rumour is told each time to m randomly selected people is `n−1´`n−1−(m+1)´`n−1−(2m+1)´ m

m

m `n−1 ´r m

···

`n−1−((r−1)m+1)´ m

=

r−1 Y m−1 Y“ j=0 k=0

1−

jm + 1 ” . n−1−k

16. Keeping up with the Joneses. The social disease of keeping up with the Joneses may be parodied in this invented game of catch-up. Two protagonists are each provided with an n-sided die the faces of which show the numbers 1, 2, . . . , n. Both social climbers start at the foot of the social ladder. Begin the game by having the first player roll her die and move as many steps up the ladder as show on the die face. The second player, envious of the progress of her rival, takes her turn next, rolls her die, and moves up the ladder as many steps as show on her die face. The game now progresses apace. At each turn, whichever player is lower on the ladder rolls her die and moves up as many steps as indicated on the die face. The game terminates at the first instant when both players end up on the same ladder step. (At which point, the two rivals realise the futility of it all and as the social honours are now even they resolve to remain friends and eschew social competition.) Call each throw of a die a turn and let N denote the number of turns before the game terminates. Determine the distribution of N. [Continued in Problem VII.7.] S OLUTION : It is easy to get confused in this problem but if one keeps in mind the salient feature that, at each trial, the person who is lower on the social rung can match her competitor in exactly one out of the n possible outcomes for her throw then the situation clarifies. By the rules of the game, the rungs of the ladder at which the competitors dwell after each throw are distinct till the final rung at which parity is obtained. Suppose τ throws were made in total. There will then be τ − 1 distinct ladder rungs selected, in increasing order, say, 1 ≤ L1 < L2 < · · · < Lτ−1 , the final throw landing the competitor moving last onto the rung Lτ−1 occupied by her rival. Suppose the first competitor (1) (1) (1) throws her die τ1 times before parity is obtained, the outcomes being k1 , k2 , . . . , kτ1 ; likewise, suppose the second competitor throws her die a total of τ2 times with outcomes (2) (2) (2) (1) (1) (1) k1 , k2 , . . . , kτ2 . Clearly, τ1 + τ2 = τ. With S1 = k1 , the partial sums Si+1 = (1) (1) (1) Si + ki+1 of the first competitor then select an increasing subsequence (S1 =)Li1 < (1) Li2 < · · · < Liτ1 −1 < Liτ1 = Lτ−1 (= Sτ1 ) of the ladder rung sequence terminating in the rung Lτ−1 . Likewise, the corresponding sequence of partial sums for the second (2) (2) (2) (2) (2) competitor, S1 = k1 , Si+1 = Si + ki+1 selects another increasing subsequence (2) (2) (S1 =)Li01 < Li02 < · · · < Li0τ −1 < Li0τ = Lτ−1 (= Sτ2 ) of the ladder rung sequence 2 2 which interleaves the ladder rung sequence of the first competitor and intersects it only in the final step.

15

Probability Spaces

I.17

We now work systematically, moving a competitor up the ladder rung sequence until she leapfrogs her opponent then switching to her opponent and moving her up until she, in turn, leapfrogs her opponent, then switching back to the first competitor, and so on, until finally parity is achieved. Up through the penultimate step, the situation is as follows: the competitor currently under consideration is at ladder rung L, her opponent at ladder rung L 0 > L where 1 ≤ L 0 − L ≤ n − 1. The competitor under consideration now moves up to rung L + k where k is the outcome of her die throw. If L + k < L 0 we repeat the process; if L + k > L 0 , we switch focus to her competitor. In either case, L + k 6= L 0 and there are hence n − 1 legal values that k can assume. This situation persists from the second through the penultimate throws. The first throw which has n possibilities sets the initial bar; and in the final throw which determines parity, the competitor playing catch-up throws exactly the value L 0 − L to land on the same rung Lτ−1 as her opponent. There are thus exactly n · (n − 1)τ−2 · 1 sequences of throws of the dice which result in parity achieved for the first time on the τth throw. As there are nτ possibilities for τ throws, it follows that the distribution of the stopping time N of the game is given by n · (n − 1)τ−2 P{N = τ} = = nτ



1 1− n

«τ−2 ·

1 n

(τ ≥ 2).

This is the geometric distribution with parameter 1/n.

17. The hot hand. A particular basketball player historically makes one basket for every two shots she takes. During a game in which she takes very many shots there is a period during which she seems to hit every shot; at some point, say, she makes five shots in a row. This is clearly evidence that she is on a purple patch (has a “hot hand”) where, temporarily at least, her chances of making a shot are much higher than usual, and so the team tries to funnel the ball to her to milk her run of successes as much as possible. Is this good thinking? Propose a model probability space for this problem and specify the event of interest. S OLUTION : The natural sample space for this model problem is an unending sequence of fair coin tosses, each such sequence representing one sample point or “outcome” of this gedanken experiment. In a succession of coin tosses, say that a first success run of length r occurs at trial n if a succession of r successes occurs for the first time at trials n − r + 1, n − r + 2, . . . , n. The occurrence of a first success run at epoch n triggers a renewal and we consider a statistically identical process to restart with trial n + 1 and, thereafter, each time a success run of length r is first observed. For instance, the renewal epochs for success runs of length 3 are identified by a comma followed by a little added space for visual delineation in the following sequence of coin flips with 1 representing success and 0 representing failure: 0110111, 1000010111, 0111, 111, 101 · · · . With this convention, success runs of length r determine a renewal process. A similar convention and terminology applies to failure runs, as well as to runs of either type. The sample space is continuous and may be identified with the unit interval as we have seen in Examples I.7.6 and I.7.7.

16

I.18

Probability Spaces

The event of interest is now identified with those sequences for which there exists (at least one) success run of length five in the first n trials. The probability measure is uniform in this interval (Examples I.7.7 and I.7.8).

18. Continuation, success run probability. What is the probability that, in the normal course of things, the player makes five (or more) shots in a row somewhere among a string of 50 consecutive shot attempts? If the chance is small then the occurrence of a run of successes somewhere in an observed sequence of attempts would suggest either that an unlikely event has transpired or that the player was temporarily in an altered state (or, “in the zone”) while the run was in progress. Naturally enough, we would then attribute the observed run to a temporary change in odds (a hot hand) and not to the occurrence of an unlikely event. If, on the other hand, it turns out to be not at all unlikely that there will be at least one moderately long success run somewhere in a string of attempts, then one cannot give the hot hand theory any credence. [This problem has attracted critical interest3 and illustrates a surprising and counter-intuitive aspect of the theory of fluctuations. The analysis is elementary but not at all easy—Problems XV.20–27 provide a principled scaffolding on which problems of this type can be considered.] S OLUTION : For each n, write un for the probability that there is a success run of length 5 at trial n. Likewise, let fn represent the probability that there is a first success run of length 5 at trial n. Finally, let qn denoteP the probability of no success run of length 5 through trial n. By additivity, qn = 1 − k≤n fk and so it suffices to determine fk for each k. It turns out to be more convenient to first determine { un , n ≥ 1 }. Now, there will be five consecutive successes at trials n − 4, n − 3, n − 2, n − 1, and n if, and only if, for some k with 0 ≤ k ≤ 4, a success run of length five terminated at trial n − k and there were k consecutive successes that followed at trials n − k + 1, . . . , n. By additivity, it follows that 2−5 = un 2−0 + un−1 2−1 + · · · + un−4 2−4 , or, equivalently, un = 2−5 − un−1 2−1 − un−2 2−2 − un−3 2−3 − un−4 2−4 . The boundary conditions are clear: u1 = u2 = u3 = u4 = 0. We may now churn the recurrence through to evaluate un systematically. Computing the first few values in the sequence, we have u5 = 2−5 = 0.03125, u6 = 2−5 − u5 2−1 = 2−6 = 0.015625, 3 T. Gilovich, R. Vallone, and A. Tversky, “The hot hand in basketball: On the misperception of random sequences”, Cognitive Psychology, vol. 17, pp. 295–314, 1985. Their conclusion? That the hot hand theory is a widespread cognitive illusion affecting all beholders, players, coaches, and fans. Public reaction to the story was one of disbelief. When the celebrated cigar-puffing coach of the Boston Celtics, Red Auerbach, was told of Gilovich and his study, he grunted, “Who is this guy? So he makes a study. I couldn’t care less”. Auerbach’s quote is reported in D. Kahneman, Thinking, Fast and Slow. New York: Farrar, Straus, and Giroux, 2011, p. 117.

17

Probability Spaces

I.19

u7 = 2−5 − u6 2−1 − u5 2−2 = 2−5 − 2−7 − 2−7 = 2−6 = 0.015625, u8 = 2−5 − u7 2−1 − u6 2−2 − u5 2−3 = 2−5 − 2−7 − 2−8 − 2−8 = 2−6 = 0.015625, u9 = 2−5 − u8 2−1 − u7 2−2 − u6 2−3 − u5 2−4

u10

= 2−5 − 2−7 − 2−8 − 2−9 − 2−9 = 2−6 = 0.015625, ` ´ = 2−5 − u9 2−1 − u8 2−2 − u7 2−3 − u6 2−4 = 2−6 2 − (1 − 2−4 ) = 0.016016.

A few more iterations shows that the sequence quickly converges to the stationary solution satisfying u = 2−5 − u2−1 − u2−2 − u2−3 − u2−4 , or, u=

1+

2−1

2−5 1 2−5 = = = 0.016129. −3 −4 + +2 +2 2 − 2−4 62 2−2

Numerical evaluation shows that the sequence has essentially converged to its limiting value by n = 20 (with an error of less than one part in one million). Now to determine the values { fn , n ≥ 1 }. The boundary conditions are clear: f1 = f2 = f3 = f4 = 0. Now, if there is a success run of length five at trial n then there must exist some k ≤ n for which (i) there was a first success run of length five at k, and (ii) the next n − k trials result in a success run of length five culminating at trial n. Summing over the possibilities, we see that, for n ≥ 1, un = f1 un−1 +f2 un−2 +· · ·+fn−1 u1 +fn or fn = un −f1 un−1 −f2 un−2 −· · ·−fn−1 u1 . Churning through the first few values in the recurrence yields f5 = u5 = 2−5 = 0.03125, f6 = u6 − f5 u1 = u6 = 2−6 = 0.015625, f7 = u7 − f6 u1 − f5 u2 = u7 = 2−6 = 0.015625, f8 = u8 − f7 u1 − f6 u2 − f5 u3 = u8 = 2−6 = 0.015625, f9 = u9 − f8 u1 − f7 u2 − f6 u3 − f5 u4 = u9 = 2−6 = 0.015625, f10 = u10 − f9 u1 − f8 u2 − f7 u3 − f6 u4 − f5 u5 ` ´ = 2−6 2 − (1 − 2−4 ) − 2−10 = 2−6 = 0.015625, f11 = u11 − f10 u1 − f9 u2 − f8 u3 − f7 u4 − f6 u5 = 0.0148926, and numerical evaluation shows that q50 = 1 −

50 X k=1

fk =

126135883035101 = 0.448125 281474976710656

rounded up to six decimal places. The chance that there is a success run of length five somewhere in a sequence of fifty tosses of a fair coin is in excess of 55%. It is not at all unlikely hence that the observed run is merely an aspect of chance fluctuations and no mysterious “hot hand” need be proposed to explain the phenomenon. Problems XV.20– 27 sketch a more satisfying general, analytical solution.

The concluding Problems 19–34 are of a theoretical character.

18

I.22

S

Probability Spaces

{ { T S T = = λ A{λ and 19. De Morgan’s laws. Show that λ Aλ λ Aλ { A where λ take values in an arbitrary index set Λ, possibly infinite. λ λ

S OLUTION : The set implications for the first of de Morgan’s laws are given by ω∈

“[



”{

⇔ω∈ /

[

\

´{

A{λ

λ

λ

λ

`S

Aλ ⇔ ∀λ : ω ∈ / Aλ ⇔ ∀λ : ω ∈ A{λ ⇔ ω ∈

and so = λ A{λ . Likewise, the corresponding set implications for the second λ Aλ of de Morgan’s laws are given by “\

ω∈



T

”{

⇔ω∈ /

=

S

and so

λ



Aλ ⇔ ∃λ0 s.t. ω ∈ / Aλ0 ⇔ ω ∈ A{λ0 ⇔ ω ∈

´{

λ

[

A{λ ,

λ

λ

λ

`T

\

A{λ .

 S   T  S T j Aj \ j Bj and j Aj \ j Bj are both subsets S 20. Show that of j (Aj \ Bj ). When is there equality? S OLUTION : By de Morgan’s laws and the distributivity of unions and intersections, “[

Aj

”/“[

j

” “[ ” “[ ”{ “[ ” \ Bj = Aj ∩ Bk = Aj ∩ B{k

j

j

k

j

k

” [ [“ \ { ” (a) [“ = Aj ∩ Bk ⊆ Aj ∩ B{j = (Aj \ Bj ). j

k

j

Equality holds in step (a) if, and only if, Aj \ Bj is disjoint from

j

S

k6=j

Bk for every j.

By a similar argument, “\

Aj

”/“\

j

j

” \ “\ ”{ \ “[ ” Bj = Ak ∩ Bj = Ak ∩ B{j k

j

k

j

” (b) [“ ” [ [“ { \ = Bj ∩ Ak ⊆ B{j ∩ Aj = (Aj \ Bj ). j

k

j

Equality holds in step (b) if, and only if, Aj \ Bj is disjoint from

j

S

k6=j

{

Ak for every j.

21. σ-algebras containing two sets. Suppose A and B are non-empty subsets of Ω and A ∩ B 6= ∅. If A ⊆ B determine the smallest σ-algebra containing both A and B. Repeat the exercise if A 6⊆ B. S OLUTION : If A ⊆ B: If A 6⊆ B:

 F = ∅, Ω, A, B, A{ , B{ , A ∪ B{ , A{ ∩ B .  F = ∅, Ω, A, B, A{ , B{ , A ∪ B, A ∩ B, A{ ∪ B{ ,

A{ ∩ B{ , A ∪ B{ , A{ ∪ B, A{ ∩ B, A ∩ B{ , A M B, (A M B){ .

19

Probability Spaces

I.23

22. Indicator functions. Suppose A is any subset of a universal set Ω. The indicator for A, denoted 1A , is the function 1A (ω) which takes value 1 when ω is in A and value 0 when ω is not in A. Indicators provide a very simple characterisation of the symmetric difference between sets. Recall that if A and B are subsets of some universal set Ω, we define A M B := (A ∩ Bc ) ∪ (B ∩ Ac ) = (A\B)∪(B\A). This is equivalent to the statement that 1A M B = 1A +1B (mod 2) where, on the right-hand side, the addition is modulo 2. Likewise, intersection has the simple representation, 1A∩B = 1A · 1B (mod 2). Verify the following properties of symmetric differences (and cast a thought to how tedious the verification would be otherwise): (a) A M B = B M A (the commutative property), (b) (A M B) M C = A M(B M C) (the associative property), (c) (A M B) M(B M C) = A M C, (d) (A M B) M(C M D) = (A M C) M(B M D), (e) A M B = C if, and only if, A = B M C, (f) A M B = C M D if, and only if, A M C = B M D. In view of their indicator characterisations, it now becomes natural to identify symmetric difference with “addition”, A⊕B := A M B, and intersection with “multiplication”, A ⊗ B := A ∩ B. S OLUTION : (a, b) The commutative and associative properties fall into our lap from the indicator representation as addition is commutative and associative: 1A + 1B = 1B + 1A (mod 2) and 1A + (1B + 1C ) = (1A + 1B ) + 1C (mod 2). (c) As 1B + 1B = 0 (mod 2), it follows that 1A M B) M(B M C) = 1A M B + 1B M C = 1A + 1B + 1B + 1C = 1A + 1C

(mod 2).

(d) Using the commutative and associative properties of addition, 1(A M B) M(C M D) = 1(A M B) + 1(C M D) = 1A + 1B + 1C + 1D = (1A + 1C ) + (1B + 1D ) = 1(A M C) M(B M D)

(mod 2).

(e) If A M B = C then 1A + 1B = 1C (mod 2). By adding 1B to both sides, we obtain 1A = 1A + 1B + 1B = 1B + 1C (mod 2). (f) If 1A M B = 1C M D , proceed similarly by adding 1B and 1C to both sides.

23. Why is the family of events called an algebra? Suppose F is a non-empty family of subsets of a universal set Ω that is closed under complementation and finite unions. First argue that F is closed under intersections, set differences, and symmetric differences. Now, for A and B in F, define A ⊗ B := A ∩ B and A ⊕ B := A M B. Show that equipped with the operations of “addition” (⊕) and “multiplication” (⊗), the system F may be identified as an algebraic system in the usual sense of the word. That is, addition is commutative, associative, and has ∅ as the zero element, i.e., A ⊕ B = B ⊕ A, A ⊕ (B ⊕ C) = (A ⊕ B) ⊕ C, and A⊕∅ = A, and multiplication is commutative, associative, distributes over addition, and has Ω as the unit element, i.e., A ⊗ B = B ⊗ A, A ⊗ (B ⊗ C) = (A ⊗ B) ⊗ C, A ⊗ (B ⊕ C) = (A ⊗ B) ⊕ (A ⊗ C), and A ⊗ Ω = A.

20

I.25

Probability Spaces

` ´{ S OLUTION : Pick A, B ∈ F. We know that A ∩ B = A{ ∪ B{ . Since F is closed under complementation and finite unions, one can conclude that: A ∩ B ∈ F. Again consider A, B ∈ F. By assumption, B{ ∈ F. By the previous result, A \ B = A ∩ B{ ∈ F. Now consider the symmetric set difference A M B = (A ∪ B) \ (A ∩ B). By assumption, (A ∪ B) ∈ F, and the first part of the problem implies that (A ∩ B) ∈ F as well. Therefore, the second part implies that their difference is also in F. We now define A B=A∩B

and

A ⊕ B = AMB

and show that these operations define an algebra: A ⊕ ∅ = A M ∅ = A, A Ω = A ∩ Ω = A, A ⊕ B = A M B = (A ∪ B) \ (A ∩ B) = (B ∪ A) \ (B ∩ A) = B M A = B ⊕ A, A ⊕ (B ⊕ C) = A M(B M C) = (A B) ⊕ (A C) = (A ∩ B) ∪ (A ∩ C) \ (A ∩ B ∩ C) = [A ∩ (B ∪ C)] \ [A ∩ (B ∩ C)] = A (B ⊕ C), ˆ ` ´˜ A ⊕ (B ⊕ C) = A M (B ∪ C) ∩ B{ ∪ C{ ˆ  ` ´ ˜ ˆ { ` { ´ ˜ = A ∪ (B ∪ C) ∩ B{ ∪ C{ ∩ A ∪ B ∩ C{ ∪ (B ∩ C) ` ´ ` ´ ` ´ = (A ∪ B ∪ C) ∩ A ∪ B{ ∪ C{ ∩ A{ ∪ B ∪ C{ ∩ A{ ∪ B{ ∪ C ˆ  ` ´ ˜ ˆ { ` ´ ` ´ ˜ = C ∪ (A ∪ B) ∩ A{ ∪ B{ ∩ C ∪ A ∪ B{ ∩ A{ ∪ B = [C ∪ (A M B)] ∩ [C{ ∪ (A M B){ ] = (A ⊕ B) ⊕ C. The remaining expressions are trivial to verify.

24. Show that a σ-algebra F is closed under countable intersections. S OLUTION : Suppose Aj ∈ F for j ≥ 1. Then A{j ∈ F for each j (closure of F under S complementation), hence j A{j ∈ F (closure of F under countable unions), and finally `S { ´{ T ∈ F (de Morgan’s laws and another appeal to closure of F under j Aj j Aj = complementation).

25. If A = {A1 , . . . , An } is a finite family of sets what is the maximum number of elements that σ(A) can have? (1)

S OLUTION : For each j, write Aj (i ) A1 1

(i ) A2 2

(2)

= Aj and Aj (i ) An n

= A{j . Then the sets of the form

B(i1 , i2 , . . . , in ) = ∩ ∩ ··· ∩ as i1 , i2 , . . . , in vary over {1, 2} form a (disjoint) partition B of Ω. The sets of the form B(i1 , . . . , in ) are the “atomic” sets engendered by the “parents” A1 , A2 , . . . , An . There are at most 2n such atomic sets (clearly); there will be exactly 2n atomic sets in the partition if each of the intersections

21

Probability Spaces

I.28

is non-empty. Every element in the σ-algebra σ(A) is a finite union of atomic sets of the form B(i1 , . . . , in ) ∈ B, each such union creating a distinct element of σ(A). It follows that card σ(A) ≤ 22 elements.

card A

with equality whenever the atomic partition B of Ω has 2card A

26. For any a < b show that the open interval (a, b) may be obtained via a countable number of operations on half-closed intervals by showing that (a, b) may be represented as the countable union of the half-closed intervals  a, b−(b−a)/n as n varies over all integers ≥ 1. Likewise show how the closed interval [a, b] and the reversed half-closed interval [a, b) may be obtained from half-closed intervals by a countable number of operations. Show how to generate the singleton point {a} by a countable number of operations on half-closed intervals. S OLUTION : Generation of intervals of various types from half-closed intervals. [ Open intervals: (a, b) = (a, b − 1/n]; n

Closed intervals:

\ [a, b] = (a − 1/n, b];

Left-closed, right-open intervals:

[a, b) =

n

\ \[ (a − 1/n, b) = (a − 1/n, b − 1/m]; n

Singleton:

{a} =

n m

\ (a − 1/n, a]. n

27. Continuation. Show that intervals of any type are obtainable by countable operations on intervals of a given type. S OLUTION : The following steps illustrate the sequence (a, b] → (a, b) → [a, b) → [a, b] → (a, b]: [ (a, b) = (a, b − 1/n]; n

[a, b) =

\

(a − 1/n, b);

n

[a, b] =

\

[a, b + 1/n);

n

(a, b] =

[

[a + 1/n, b].

n

28. Increasing sequences of sets. Suppose { An , n ≥ S 1 } is an increasing sequence of events, An ⊆ An+1 for each n. Let A = n≥1 An . Show that P(An ) → P(A) as n → ∞. S OLUTION : With the convention A0 = ∅, define the sequence of sets Cn = An \ An−1

22

(n ≥ 1).

I.30

Probability Spaces

(Observe that C1 = A1 .) The sequence S 1 } is pairwise S disjoint and, by S of sets { Cn , n ≥ construction it is clear that An = n k=1 Ck and A = n≥1 An = k≥1 Ck . We may hence express A as a countable union of disjoint sets. Countable additivity now yields P(A) = P

“[

n ∞ ” X X P(Ck ) Ck = P(Ck ) = lim

k≥1

n→∞

k=1

k=1

= lim P n→∞

n “[

” Ck = lim P(An ).

k=1

n→∞

In words, probability measure is continuous from below.

29. Decreasing sequences of sets. Suppose { Bn , n T ≥ 1 } is a decreasing sequence of events, Bn ⊇ Bn+1 for each n. Let B = n≥1 Bn . Show that P(Bn ) → P(B) as n → ∞. S OLUTION : If { Bn , n ≥ 1 } is a decreasing sequence of sets then B{n ⊆ B{n+1 and we may apply the result of Problem 27 to prove continuity from above.

30. Subadditivity. Suppose that P is a set function on the σ-algebra F satisfying Axioms 1, 2, and 3 of probability measure. S If it is true that for all sequences B , B , . . . , B , . . . of sets in F satisfying A ⊆ 1 2 n n Bn we have P(A) ≤ P P(B ) then show that P satisfies Axiom 4. It is frequently easier to verify n n the continuity axiom by this property. S OLUTION : Suppose An ↓ ∅ is a sequence of events decreasing monotonically to the empty set. As the monotone property of measure depends only on the additivity and positivity axioms, we see that P(An ) ≥ P(An+1 ) for each n, whence { P(An ), n ≥ 1 } is a bounded, decreasing sequence of positive numbers, hence has a (positive) limit P(An ) ↓ p ≥ 0. We now proceed to show that in fact p = 0. Fix any  > 0. By definition of limit, we may now select n = n() sufficiently large so that p −  < P(An+m ) ≤ p for all m ≥ 1. As An ⊇ An+m for all m ≥ 1, we see that An = (An \ An+m ) ∪ An+m is the union of disjoint sets and so, by the positivity and additivity axioms, P(An \ An+m ) = P(An ) − P(An+m ) < p − (p − ) = 

(m ≥ 1).

On the other hand, An \ An+m = (An \ An+1 ) ∪ (An+1 \ An+2 ) ∪ · · · ∪ (An+m−1 \ An+m ) is a union of disjoint sets and so, by additivity again, P(An \ An+m ) =

n+m−1 X

P(Aj \ Aj+1 ).

j=n

It follows hence that

n+m−1 X

P(Aj \ Aj+1 ) < 

j=n

23

Probability Spaces

I.34

for every m ≥ 1 and, a fortiori, by passing to the limit, lim

n+m−1 X

m→∞

On the other hand, An = function P, we have

P(Aj \ Aj+1 ) =

j=n

S∞

∞ X

P(Aj \ Aj+1 ) ≤ .

j=n

j=n (Aj

\ Aj+1 ), and by the assumed subadditivity of the set

P(An ) ≤

∞ X

P(Aj \ Aj+1 ).

j=n

We conclude that 0 ≤ P(An ) ≤ , eventually, for a sufficiently large choice of n = n(). As  > 0 may be chosen arbitrarily small, we are led inexorably to the conclusion that P(An ) → 0 as n → ∞.

31. A semiring of sets. Suppose Ω is the set of all rational points in the unit interval [0, 1] and let A be the set of all intersections of the set Ω with arbitrary open, closed, and half-closed subintervals of [0, 1]. Show that A has the following properties: (a) ∅ ∈ A; (b) A is closed under intersections; and (c) if A1 and A are elements of A with A1 ⊆ A then A may be represented as a finite union A = A1 ∪ A2 ∪ · · · ∪ An of pairwise disjoint sets in A, the given set A1 being the first term in the expansion. A family of sets with the properties (a), (b), and (c) is called a semiring. 32. Continuation. For each selection of 0 ≤ a ≤ b ≤ 1, let Aa,b be the set of points obtained by intersecting Ω with any of the intervals (a, b), [a, b], [a, b), or (a, b]. Define a set function Q on the sets of A by the formula Q(Aa,b ) = b−a. Show that Q is additive but not countably additive, hence not continuous. [Hint: Although Q(Ω) = 1, Ω is a countable union of single-element sets, each of which has Q-measure zero.] 33. Continuation. Suppose A ∈ A and A1 , A2 , . . . , An , . . . P is a sequence of pairwise disjoint subsets of A, all belonging to A. Show that n Q(An ) ≤ Q(A). 34. Continuation. Show B1 , B2 , . . . , Bn , . . . S that there exists a sequence P of sets in A satisfying A ⊆ n Bn but Q(A) > Q(B n ). This should be n contrasted with Problem 30.

24

II Conditional Probability 1. Dice. If three dice are thrown what is the probability that one shows a 6 given that no two show the same face? Repeat for n dice where 2 ≤ n ≤ 6. S OLUTION : A CONDITIONAL APPROACH TO SOLUTION. If n dice are thrown, let Aj = (n) (n) (n) (n) Aj denote the event that die j shows a 6 and let A = A(n) = A1 ∪ A2 ∪ · · · ∪ An (n) denote the event that some die shows a 6. Let B = B denote the event that the dice all show different face values. We are interested in the conditional probability ` (n) ´ (n) ` (n) ` (n) P (A1 ∩ B(n) ) ∪ · · · ∪ (An ∩ B(n) ) (n) (n) ´ (n) ´ ` ´ = P A |B = P A1 ∪· · ·∪An ) | B . P B(n)  (n) The events Aj ∩ B(n) , 1 ≤ j ≤ n are mutually exclusive as at most one die can show a 6 if all face values are different. Accordingly, by additivity, ´ ` (n) ` (n) ´ ` ´ ` ´ P A(n) P An ∩ B(n) ) n P A1 ∩ B(n) ) ∩ B(n) ) 1 ` ´ ` ´ ` ´ + · · · + = P A(n) | B(n) = P B(n) P B(n) P B(n) as, by the symmetry inherent in the situation, each of the terms on the right contributes the same amount. A consideration of small values of n lays bare the general pattern. ´ ´ ` ` T HE CASE n = 1 is trivial as it is obvious that P A(1) = 1/6, P B(1) = 1, and ´ ` ´ ` P A(1) ∩ B(1) = P A(1) = 1/6, whence ` ´ ` ´ P A(1) ∩ B(1) 1 ` ´ = . P A(1) | B(1) = 6 P B(1) ` ´ ` (2) ´ W HEN n = 2 we see that P B(2) = (6 × 5)/(6 × 6) and P A1 ∩ B(2) = (1 × 5)/(6 × 6), whence ` (2) ´ ffi ` (2) P A1 ∩ B(2) 1×5 6×5 1 (2) ´ ´ ` = = . P A1 | B = 6×6 6×6 6 P B(2) ` ´ It follows that P A(2) | B(2) = 2/6.

Conditional Probability

II.4

` ´ ` (3) ´ W HEN n = 3 we see that P B(3) = (6×5×4)/(6×6×6) and P A1 ∩B(3) = (1 × 5 × 4)/(6 × 6 × 6), whence ` (3) ´ ffi ` (3) P A1 ∩ B(3) 1 1×5×4 6×5×4 (3) ´ ` ´ P A1 | B = . = = 6×6×6 6×6×6 6 P B(3) ´ ` (3) Thus, P A | B(3) = 3/6. ` ´ T HE CASE OF GENERAL 1 ≤ n ≤ 6: the pattern is now clear. We have P B(n) = ` ´ (n) 6n /6n and P A1 ∩ B(n) = 5n−1 /6n where the falling factorial notation xk = x(x − 1) · · · (x − k + 1) helps consolidate expressions. It follows that ` (n) ´ ffi ` (n) P A1 ∩ B(n) 5n−1 6n 1 (n) ´ ` ´ P A1 | B = = = , n n (n) 6 6 6 P B and so

` ´ n P A(n) | B(n) = 6

(1 ≤ n ≤ 6).

A DIRECT COMBINATORIAL APPROACH. The simple form of the solution suggests that there may be a direct combinatorial pathway to the answer. And indeed there is. In a random permutation of (1, . . . , 6), the face value 6 is equally likely to occur at any of the six locations. Identifying the first n locations as the face values of the n dice (with 1 ≤ n ≤ 6), the probability that one die shows 6 given that all die faces are different is n/6.

2. Keys. An individual somewhat the worse for drink has a collection of n keys of which one fits his door. He tries the keys at random discarding keys as they fail. What is the probability that he succeeds on the rth trial? S OLUTION : All possible permutations for n keys are n! and with the right key fixed at the rth place, the possible permutations reduce to (n − 1)!. The probability of the event that he succeeds on the rth trial is (n − 1)!/n! = 1/n.

3. Let π1 , . . . , πn be a random permutation of the numbers 1, . . . , n. If you are told that πk > π1 , . . . , πk > πk−1 , what is the probability that πk = n? S OLUTION : Write A for the event {πk = n} and B for the event {πk > π1 , . . . , πk > πk−1 }. As all permutations are equally likely, the digit n is equally likely to occur at any of the n locations and hence P(A) = 1/n. To compute the probability of B, we observe that the first k digits in the permutation take values in a sub-collection J = {j1 , . . . , jk } (in some order). Given that the first k digits take values in this set, the largest of these digits is equally likely to occur at any of the first k locations. This conditional probability is invariant with respect to the specification of the sub-collection J and so P(B) = 1/k (by a trivial exercise of total probability by summing over all possibilities for the subcollection J). As it is clear that the occurrence of A implies the occurrence of B, we see that P(A ∩ B) = P(A) = 1/n. It follows that P(A | B) =

P(A ∩ B) P(A) 1/n k = = = . P(B) P(B) 1/k n

26

II.5

Conditional Probability

4. Balls and urns. Four balls are placed successively in four urns, all arrangements being equally probable. Given that the first two balls are in different urns, what is the probability that one urn contains exactly three balls? S OLUTION : Write A for the event that the first two balls are in different urns. Then card A = 4 × 3 × 42 as there are four urns that the first ball can be placed in; once it is placed, the second ball can be placed in any of three urns; the last two balls may be placed in any urn. Consequently, P(A) = (4 × 3 × 42 )/44 . Let B be the event that one urn contains exactly three balls. The joint occurrence of A and B means that either the first ball is grouped with balls three and four in one urn and the second ball is in a separate urn or that the second ball is grouped with balls three and four in one urn and the first ball is in a separate urn. The probability of the first of these eventualities is (4 × 3)/44 , the second eventuality having the same probability by symmetry, and so P(A∩B) = (2×4×3)/44 . It follows that the conditional probability that one urn contains exactly three balls given that the first two balls are in different urns is given by P(B | A) =

(2 × 4 × 3)/44 P(A ∩ B) 1 = = . P(A) (4 × 3 × 42 )/44 8

5. Bridge. North and South have ten trumps between them, trumps being cards of a specified suit. (a) Determine the probability that the three remaining trumps are in the same hand, that is to say, either East or West has no trumps. (b) If it is known that the king of trumps is included among the missing three, what is the probability that he is “unguarded”, that is to say, one player has the king and the other the remaining two trumps? S OLUTION : (a) Condition on the hands distributed to North and South. There are then 26 cards remaining, of which three are trumps (in the` specified suit). The number of ´ ways these cards are split between East and West is 26 and of these the number of 13 ` ´ ways in which East, for definiteness, gets all three trumps is 26 . Similarly, there are 10 `23´ ways for West to get all three trumps. Thus, the probability, given that North and 10 South have ten trumps between them, that either East or West have the three remaining trumps is ` ´ 2 23 11 10 `26 ´ = . 50 13 There is about a one-in-five chance that one opposition hand contains all three trumps given that North-South have ten trumps. (b) Again condition on the North-South hand combination containing ten trumps which now do not include the king. For each such North-South combination, the` num´ ber of hands in which East gets the trump king and West the other two trumps is 23 an 12 identical count holding for the number of hands in which West gets the trump king, East the other two trumps. Thus, the probability that the trump king is unguarded given that North and South have ten trumps between them not including the trump king is given by ` ´ 2 23 13 12 `26 ´ = . 50 13

27

Conditional Probability

II.8

There is about a one-in-four chance that the trump king is isolated given that NorthSouth have ten trumps not including the king.

6. Defect origin. Three semiconductor chip foundries, say, A, B, and C, produce 25%, 35%, and 40%, respectively, of the chips in the market. Defective chips are produced by these foundries at a rate of 5%, 4%, and 2%, respectively. A semiconductor chip is randomly selected from the market and found to be defective. What are the probabilities that it originated at A, B, and C, respectively? S OLUTION : Let D denote the event that a randomly selected semi-conductor chip is defective. By total probability, P(D) = P(D | A) P(A) + P(D | B) P(B) + P(D | C) P(C) = 0.05 × 0.25 + 0.04 × 0.35 + 0.02 × 0.4 = 0.0345. In a slight abuse of notation, write A, B, and C for the events that the faulty chip originated at foundries A, B, and C, respectively. Then P(A ∩ D) 0.25 × 0.05 = = 0.362, P(D) 0.0345 P(B ∩ D) 0.35 × 0.04 P(B | D) = = = 0.406, P(D) 0.0345 P(A ∩ D) 0.4 × 0.02 P(C | D) = = = 0.232. P(D) 0.0345

P(A | D) =

7. The ballot problem. Suppose candidate A wins n votes and candidate B wins m votes in a two-party election. Let Pn,m denote the probability that A leads B throughout the count. Determine the probabilities Pn,1 and Pn,2 by direct combinatorial methods. S OLUTION : If m = 1 then A will leads every step of the way if, and only if, the solitary ballot cast for B is not in the first two counted out of the n + 1 cast. Thus, Pn,1 = 1 −

2 n−1 = . n+1 n+1

If m = 3 and A leads every throughout then either both ballots cast for B appear in positions 4 through n + 2 or one of the two ballots for B appears in position 3 and the other appears in positions 5 through n + 2. The number of configurations supporting the first of these cases is (n − 1)(n − 2), while the number of configurations supporting the second of these cases is 2(n − 2). It follows that Pn,2 =

ˆ ˜ n−2 1 (n − 1)(n − 2) + 2(n − 2) = . (n + 2)(n + 1) n+2

28

II.9

Conditional Probability

8. Parity. A coin with success probability p is tossed repeatedly. Let Pn be the probability that there are an even number of successes in n tosses. Write down a recurrence for Pn by conditioning determine Pn analytically  P and hence n−1 for each n. Thence verify the identity k even n = 2 . k S OLUTION : Set P0 = 1 to establish the base of a recurrence. By conditioning on whether there are an even or an odd number of successes after n − 1 tosses, we obtain Pn = p(1 − Pn−1 ) + (1 − p)Pn−1

(n ≥ 1)

as, for there to be an even number of successes after n tosses, if there are an odd number of successes after n − 1 tosses then the nth toss must result in a success, while if there are an even number of successes after n − 1 tosses then the nth toss must result in a failure. Probabilities multiply as the trials are independent. By repeatedly churning through the recurrence we obtain Pn = p + (1 − 2p)Pn−1 = p + p(1 − 2p) + (1 − 2p)2 Pn−2 = · · · = p + p(1 − 2p) + p(1 − 2p)2 + · · · + p(1 − 2p)n−1 + (1 − 2p)n P0 ´ ` ´ p 1 − (1 − 2p)n 1` + (1 − 2p)n = 1 + (1 − 2p)n . = 1 − (1 − 2p) 2 By summing over all even integers we obtain the alternative expression n „ « X n Pn = pk (1 − p)n−k k k=0 k even

and thence the combinatorial identity n „ « X ´ 1` n pk (1 − p)n−k = 1 + (1 − 2p)n . k 2 k=0 k even

When p = 1/2 this reduces to the claimed expression n „ « X n = 2n−1 . k k=0 k even

9. Uncertain searches. The Snitch in the game of Quidditch is famously elusive. It is hidden in one of n boxes, location uncertain, and is in box k with probability pk . If it is in box k, a search of the box will only reveal it with probability αk . Determine the conditional probability that it is in box k given that a search of box j has not revealed it. [Hint: Consider the cases j = k and j 6= k separately.] S OLUTION : Write Aj for the event that the Snitch is in box j and Rj for the event that a search of box j unearths it. We are given that P(Aj ) = pj and P(Rj | Aj ) = αj so that, by total probability, P(Rj ) = P(Rj | Aj ) P(Aj ) + P(Rj | A{j ) P(A{j ) = αj pj + 0 · (1 − pj ) = αj pj

29

Conditional Probability

II.11

or, equivalently, P(R{j ) = 1 − αj pj for each j. C ASE 1: j = k. We have P(Ak | R{k ) =

(1 − αk )pk P(R{k | Ak ) P(Ak ) = . 1 − αk pk P(R{k )

C ASE 2: j 6= k. In this case, P(Ak | R{j ) =

P(R{j | Ak ) P(Ak ) pk = 1 − αj pj P(R{j )

as it is certain that a search of box j will not discover the Snitch if it is in box k.

10. Total probability. Prove from first principles that P(A | H) = P(A | B ∩ H) P(B | H) + P(A | B{ ∩ H) P(B{ | H) whenever B ∩ H and B{ ∩ H have non-zero probability. S OLUTION : Starting with the right-hand side, we have P(A | B, H) P(B | H) + P(A | B{ ∩ H) P(B{ | H) =

P(A ∩ B ∩ H) P(B ∩ H) P(A ∩ B{ ∩ H) P(B{ ∩ H) · + · P(B ∩ H) P(H) P(B{ ∩ H) P(H) ˜ P(A ∩ H) 1 ˆ { P(A ∩ B ∩ H) + P(A ∩ B ∩ H) = = P(A | H). = P(H) P(H)

11. Seven balls are distributed randomly in seven urns. If exactly two urns are empty, show that the conditional probability of a triple occupancy of some urn equals 1/4. S OLUTION : This problem is trickier than it appears. A systematic approach to it which avoids the Scylla and Charybdis of under- and over-counting is to model it as a double application of an occupancy problem originating from the Maxwell-Boltzmann distribution [see Problem I.11 in ToP].1 Balls may represent particles, communication packets, arrivals, and so on, the corresponding urns representing quantum levels, stacks, queues, and so on. The key is to divide the problem into two phases: first identify the occupancy configuration of the urns and then determine the ball distribution. A distribution of balls into the urns with exactly two urns empty results in an occupancy configuration (k1 , . . . , k7 ), where kj represents the number of balls in urn j, which is of exactly one of two types: the occupancy numbers kj written in decreasing order form either the sequence 3, 1, 1, 1, 1, 0, 0 or the sequence 2, 2, 1, 1, 1, 0, 0 only where these sequences represent the urn occupancies in some order. Consider first the occupancy sequence 1 This approach is explained beautifully in W. Feller’s immortal text: W. Feller, The Theory of Probability and Its Applications, Volume I, Section II.5, pp.39–40. New York: John Wiley & Sons, 1968.

30

II.11

Conditional Probability

3, 1, 1, 1, 1, 0, 0. The number of ways balls can be selected to achieve such a sequence is « „ 7! 7! 7 = = . 3, 1, 1, 1, 1, 0, 0 3!1!1!1!1!0!0! 3! It only remains to identify the urns: there is one urn with triple occupancy, no urns with double occupancy, four urns with single occupancy, and two empty urns. Working systematically from high occupancy urns to low occupancy urns, these may be specified in „ « 7! 7! 7 = = 1, 0, 4, 2 1!0!4!2! 4!2! ways. Thus, the number of ways in which the occupancy numbers 3, 1, 1, 1, 1, 0, 0 may be realised is given by „ «„ « 7!2 7! 7! 7 7 · = . (1) = 3, 1, 1, 1, 1, 0, 0 1, 0, 4, 2 3! 4!2! 2!3!4! Working similarly, the number of ways balls can be selected to achieve occupancy numbers 2, 2, 1, 1, 1, 0, 0 is „ « 7! 7! 7 = = , 2, 2, 1, 1, 1, 0, 0 2!2!1!1!1!0!0! 2!2! and their distribution into the urns results in no urns with triple occupancy, two urns with double occupancy, three urns with single occupancy, and two empty urns which may be specified in „ « 7! 7! 7 = = 0, 2, 3, 2 0!2!3!2! 2!3!2! ways. Thus, the number of ways in which the occupancy numbers 2, 2, 1, 1, 1, 0, 0 may be realised is given by „ «„ « 7! 7!2 7! 7 7 · = 4 . (2) = 2, 2, 1, 1, 1, 0, 0 0, 2, 3, 2 2!2! 2!3!2! 2! 3! As these two sequences of occupancy numbers are the only ones resulting in exactly two empty cells, the number of ways in which the balls may be distributed into the urns to result in exactly two empty urns is obtained by summing (1) and (2). As the seven balls may be distributed into the seven urns in 77 ways in total, the probability that exactly two empty urns result is given by « „ 7!2 7!2 + 4 7−7 . 2!3!4! 2! 3! Of the configurations resulting in exactly two empty urns, those yielding an urn with triple occupancy are captured by (1) and so, the probability of obtaining a triple occupancy urn and exactly two empty urns is given by 7!2 −7 7 . 2!3!4!

31

Conditional Probability

II.12

The conditional probability of obtaining a triple occupancy urn given that exactly two urns are empty is hence given by 7!2 7−7 2!3!4!



7!2 2!3!4!

+

7!2 2!4 3!



1 2!4 3! = . + 2!3!4! 4

=

2!4 3!

7−7

Extension: The general approach may be articulated as follows. Each occupancy configuration (k1 , . . . , kn ) corresponding to the distribution of n balls into n urns results in occupancy numbers νn , νn−1 , . . . , ν1 , ν0 in some order. The notation indicates that there are νn urns with occupancy n, νn−1 urns with occupancy n − 1, and so on, with ν1 single occupancy urns, and ν0 empty urns. The number of ways the balls can be grouped to achieve such a sequence is given by “

” n n! = ν . n (n − 1)!νn−1 · · · 1!ν1 0!ν0 , . . . , 1, . . . , 1 , n − 1, . . . , n − 1 , 0, . . . , 0 n! n, . . . , n | {z } | | {z } | {z } {z } νn

ν1

νn−1

ν0

(Of course, ν0 + ν1 + · · · + νn = n so that νn , for instance, may be either 0 or 1 only. But the notation is forgiving and implicitly takes into account these constraints.) The corresponding partition of urns with these occupancy numbers may be realised in „

n νn , νn−1 , . . . , ν1 , ν0

« =

n! . νn !νn−1 ! · · · ν1 !ν0 !

It follows that the number of occupancy configurations resulting in occupancy numbers νn , νn−1 , . . . , ν1 , ν0 is given by “

« ”„ n n νn , νn−1 , . . . , ν1 , ν0 . . , 0} n, . . , n}, n − 1, . . . , n − 1, . . . , 1, . . , 1}, |0, .{z | .{z | .{z | {z } νn

ν1

νn−1

=

ν0

n! n!νn (n − 1)!νn−1 · · · 1!ν1 0!ν0

·

n! . νn !νn−1 ! · · · ν1 !ν0 !

The probability of seeing these occupancy numbers is obtained by dividing the expression on the right by nn . These calculations are illustrated in Table 1 for the case we started with: n = 7.

12. Die A has four red and two white faces; die B has two red and four white faces. A coin is tossed once privately, out of sight of the player; if it turns up heads then die A is selected and tossed repeatedly; if it turns up tails then die B is selected and tossed repeatedly. (a) What is the probability that the nth throw of the selected die results in a red face? (b) Given that the first two throws are red, what is the probability that the third throw results in a red face? (c) Suppose that the first n throws resulted in successive red faces. What is the probability that the selected die is A?

32

II.12

Conditional Probability

Occupancy numbers 1, 1, 1, 1, 1, 1, 1

Number of arrangements 7!2 7!

Probability 0.006120

2, 1, 1, 1, 1, 1, 0

7!2 2!5!

0.128518

2, 2, 1, 1, 1, 0, 0

7!2 2!4 3!

0.321295

2, 2, 2, 1, 0, 0, 0

7!2 2!3 3!2

0.107098

3, 1, 1, 1, 1, 0, 0

7!2 2!3!4!

0.107098

2

3, 2, 1, 1, 0, 0, 0

7! 2!2 3!2

0.214197

3, 2, 2, 0, 0, 0, 0

7!2 2!3 3!4!

0.026775

3, 3, 1, 0, 0, 0, 0

7!2 2!3!2 4!

0.017850

4, 1, 1, 1, 0, 0, 0

7!2 3!2 4!

0.035699

2

4, 2, 1, 0, 0, 0, 0

7! 2!4!2

0.026775

4, 3, 0, 0, 0, 0, 0

7!2 3!4!5!

0.001785

5, 1, 1, 0, 0, 0, 0

7!2 2!4!5!

0.005355

5, 2, 0, 0, 0, 0, 0

7!2 2!5!2

0.001071

2

7! 5!6!

6, 1, 0, 0, 0, 0, 0

2

7! 6!7!

7, 0, 0, 0, 0, 0, 0

0.000357 0.000008

Table 1: Random distributions of 7 balls into 7 urns.

33

Conditional Probability

II.13

S OLUTION : (a) Let Rn denote the event that the nth throw of the selected die shows red. Then P(Rn ) = P(die results in red | coin showed heads) P(coin showed heads) + P(die results in red | coin showed tails) P(coin showed tails) =

2 1 1 1 1 · + · = . 3 2 3 2 2

(b) With the notation of part (a), P(R3 | R1 ∩R2 ) = P(the third throw results in a red face | the first two throws are red) =

P( the first three throws are red) = P( the first two throws are red)

1 2 3 ( ) 2 3 1 2 2 ( ) 2 3

+ 12 ( 31 )3 + 12 ( 31 )2

=

3 . 5

(c) In a slight misuse of notation, write A for the event that the chosen die is A. Then ` 2 ´n 1 P Rn | A P(A) P(A | Rn ) = = ` 2 ´n 13 `21 ´n P(Rn ) + 3 3 2

1 2

=

2n 1 = . +1 1 + 2−n

2n

The result is quite intuitive—the right-hand side tends to one as n → ∞.

13. A man possesses five coins, two of which are double-headed, one is double-tailed, and two are normal. He shuts his eyes, picks a coin at random, and tosses it. (a) What is the probability that the lower face of the coin is a head? (b) He opens his eyes and sees that the coin is showing heads; what is the probability that the lower face is a head? He shuts his eyes again, and tosses the coin again. (c) What is the probability that the lower face is a head? (d) He opens his eyes and sees that the coin is showing heads; what is the probability that the lower face is a head? He discards the coin, picks another at random, and tosses it. (e) What is the probability that it showed heads? S OLUTION : It will be convenient to introduce notation for various events. Write H, T , and N to mean that the first selected coin was a double-headed coin, a double-tailed coin, and a normal coin, respectively. For j = 1, 2, 3, let Lj and Uj denote the events that on the jth toss the lower face showed heads and the upper face showed heads, respectively. (a) In notation, we are interested in the probability of the event L1 . By total probability, P(L1 ) = P(L1 | H) P(H) + P(L1 | T ) P(T ) + P(L1 | N) P(N) =1·

1 1 2 3 2 +0· + · = . 5 5 2 5 5

By the same argument (or by the symmetry of the situation), we also see that P(U1 ) = P(L1 ) = 3/5.

34

II.14

Conditional Probability

(b) In our notation, we are now interested in the probability of the event L1 given that U1 has occurred. But the event L1 ∩ U1 occurs if, and only if, the selected coin is double-headed, that is to say, the event H has occurred. Hence P(L1 | U1 ) =

P(L1 ∩ U1 ) P(H) 2/5 2 = = = . P(U1 ) P(U1 ) 3/5 3

(c) After the second toss of the chosen coin, results yet unseen, in notation, we are now interested in the occurrence of the event L2 given that U1 has occurred: P(L2 | U1 ) = P(L2 ∩ U1 )/ P(U1 ). We can simplify the total probability argument by the realisation that the event L2 ∩U1 cannot occur if the double-tailed coin has been selected, that is to say, given the event T . As the outcomes of the tosses are independent given the coin, P(L2 ∩ U1 ) = P(L2 ∩ U1 | H) P(H) + P(L2 ∩ U1 | T ) P(T ) + P(L2 ∩ U1 | N) P(N) = 1 ·

2 1 1 2 1 +0· + · = . 5 5 4 5 2

(As an aside of this calculation we observe, by an entirely similar line of reasoning, that P(U2 ∩ U1 ) = 1/2.) It follows that P(L2 | U1 ) =

1/2 5 = . 3/5 6

(d) Once more, by total probability, P(U1 ∩ U2 ∩ L2 ) = P(U1 ∩ U2 ∩ L2 | H) P(H) + P(U1 ∩ U2 ∩ L2 | T ) P(T ) + P(U1 ∩ U2 ∩ L2 | N) P(N) = 1 ·

2 1 2 2 +0· +0· = . 5 5 5 5

By conditioning on seeing an upper face of heads on each of the first two tosses, we hence obtain P(U1 ∩ U2 ∩ L2 ) 2/5 4 P(L2 | U1 ∩ U2 ) = = = . P(U1 ∩ U2 ) 1/2 5 (e) Conditioned on U1 ∩ U2 , the event L2 occurs if, and only if, the coin is double-headed, that is to say, the event H has occurred. Thus, the probability that a double-headed coin is discarded is 4/5 and, as there is no possibility of his discarding a double-tailed coin given that he has observed two heads, the probability of discarding a normal coin is 1/5. Conditioned on the discard, we see then that P(U3 | U1 ∩ U2 ) = P(U3 | H ∩ U1 ∩ U2 ) P(H | U1 ∩ U2 ) + P(U3 | N ∩ U1 ∩ U2 ) P(N | U1 ∩ U2 ) = P(U3 | H) ·

4 1 + P(U3 | N) · . 5 5

Given that a double-headed coin is discarded, the conditional probability of seeing a head on the toss of the next selected coin is 1 · 14 + 0 · 14 + 12 · 24 = 21 by conditioning further on which of the four remaining coins is selected next. Likewise, given that a

35

Conditional Probability

II.14

normal coin is discarded, the conditional probability of seeing a head on the toss of the next selected coin is 1 · 24 + 0 · 14 + 12 · 41 = 58 . Collecting terms, P(U3 | U1 ∩ U2 ) =

1 4 5 1 21 · + · = . 2 5 8 5 40

14. Random selection. In a school community families have at least one child and no more than k children. Let nj denote the number of families having j children and let n1 +· · ·+nk = m be the number of families in the community. A child representative is picked by first selecting a family at random from the m families and then picking a child at random from the children of that family. What is the probability that the chosen child is a first born? If, alternatively, a child is chosen directly by random selection from the pool of children, what is the probability that the child is a first born? Show that the first procedure leads to a larger probability of selecting a first born. [Hint: It will be necessary to Pk Pk Pk Pk n prove the identity i=1 ini j=1 jj ≥ i=1 ni j=1 nj to establish the final part of the problem.] S OLUTION : Write P1 and P2 for the probabilities that the child is a first born under the first and second procedures, respectively. By random selection of families, a family with nj children is picked with probability nj /m and the selection of a child at random from the chosen family results in the first born with probability 1/j. By summing over all possibilities we see by total probability that P nj X 1 nj j j . · = P P1 = j m i ni j The random selection of a child from ‹ Pthe pool of children results in a child in a family with j children with probability jnj k ini and, of the j children in the family, the chances that she is the first born is 1/j. Again, by summing over all possibilities, we see by total probability that P X1 jnj j nj P2 = ·P = P . j in i i i ini j To show that P1 ≥ P2 we need to demonstrate that k k k k X X X nj X ini ≥ nj ni . j i=1 j=1 i=1 j=1

(3)

We proceed by induction on k. The induction base k = 1 results in the obvious inequality n21 ≥ n21 . As induction hypothesis, suppose now that (3) holds for some k and consider the situation for k + 1. By isolating the terms corresponding to the index k + 1 in the sums, we see that k+1 X j=1

k+1 k+1 X k+1 X nj X ini − nj ni j i=1 j=1 i=1

36

II.15

Conditional Probability

=

„X k j=1

« k k k X X nj X k+1 i 2 ini + nj nk+1 + nk+1 ni + nk+1 j i=1 j k+1 j=1 i=1 −

„X k

nj

j=1

=

„X k j=1

=

„X k j=1

nj j

k X

ini −

i=1

k X j=1

k X

ni +

i=1

nj

k X i=1

k X j=1

« ni

k X

nj nk+1 +

+

k X

nk+1 ni + n2k+1

«

i=1

„ nj nk+1

j=1

« j k+1 + −2 j k+1

« X k k k k X X nj X nj nk+1 ini − nj ni + (k + 1 − j)2 ≥ 0 j i=1 j(k + 1) j=1 i=1 j=1

as, by induction hypothesis, the first term on the right is positive, as is the second term because it is a sum of squares. This completes the induction.

15. Random number of dice. A random number of dice, say, N in number is selected from a large collection of dice. The number N has distribution P{N = k} = 2−k for k ≥ 1. The selected dice are thrown and their scores added to form the sum S. Determine the following probabilities: (a) N = 2 given that S = 4; (b) S = 4 given that N is even; (c) the largest number shown by any die is r. S OLUTION : (a) The event {S = 4} can occur only if N = 1, 2, 3, or 4. When N = 1 this can only happen if a four is thrown; when N = 2 this can only happen if one of the three outcomes (1, 3), (2, 2), and (3, 1) results; when N = 3 this can only happen if one of the three outcomes (2, 1, 1), (1, 2, 1), and (1, 1, 2) results; and when N = 4 this can only happen if all four dice show 1. Summing over the cases, we see by total probability that P{S = 4} = P{S = 4 | N = 1} P{N = 1} + P{S = 4 | N = 2} P{N = 2} + P{S = 4 | N = 3} P{N = 3} + P{S = 4 | N = 4} P{N = 4} =

1 −1 3 3 1 2197 · 2 + 2 · 2−2 + 3 · 2−3 + 4 · 2−4 = , 6 6 6 6 20736

or about 10%. It follows that P{N = 2 | S = 4} =

P{S = 4 | N = 2} P{N = 2} 3 = 2 · 2−2 P{S = 4} 6

ffi

2197 432 = , 20736 2197

or about 20%. For later reference, we see that, by a similar process of calculation, P{N = 4 | S = 4} =

P{S = 4 | N = 4} P{N = 4} 1 = 4 · 2−4 P{S = 4} 6

ffi

2197 1 = . 20736 2197

(b) We begin by observing that the probability that N takes an even value is given by the geometric series P{N is even} = 2−2 + 2−4 + 2−6 + · · · =

37

2−2 1 = . 1 − 2−2 3

Conditional Probability

II.16

Now if S = 4 and N is even then N can only take values 2 or 4. Thus, by total probability, P{N is even | S = 4} P{S = 4} P{S = 4 | N is even} = P{N is even} ` 432 ` ´ + P{N = 2 | S = 4} + P{N = 4 | S = 4} P{S = 4} = 2197 = P{N is even}

1 2197 1 3

´

2197 20736

=

433 , 6912

or about 6%. (c) Let X1 , . . . , XN denote the face values of the dice thrown. Given that N = k, suppose that the largest number shown by any die is r. Numbering the dice from 1 through k, there must then be a smallest die index j for which the face value is r. This means that the dice numbered 1 through j − 1 show face values strictly less than r, die j shows face value r, and dice numbered j + 1 through k show face values no larger than r. The probability of this (conditioned on the event N = k) is „ «j−1 „ «j−1 r−1 1 “ r ”k−j rk−1 1 · · = k 1− . 6 6 6 6 r Summing over the values 1 ≤ j ≤ k, by additivity, we see that „ «j−1 k  X rk−1 1 P max{X1 , . . . , XN } = r | N = k = 1 − 6k r j=1 ´ ` 1 k rk−1 1 − 1 − r rk ` ´ = k = k · 1 6 6 1− 1− r

„ «k ! 1 1− 1− . r

Removing the conditioning by summing over N, we finally obtain by total probability that ∞  X  Pr := P max{X1 , . . . , XN } = r = P max{X1 , . . . , XN } = r | N = k P{N = k} k=1

` ´ „ ««k ∞ “ ∞ „ r r X 1− 1 r 1 r−1 r ”k X r = − 1− = 12 r − 12 r ` r 1 ´ = − . 12 12 r 1 − 12 − r 12 − (r − 1) 1 − 1 − 12 12 r k=1 k=1 The values are shown tabulated in Table 2.

r Pr

1

2

3

4

5

6

1 11

6 55

2 15

1 6

3 14

2 7

Table 2: The distribution of the maximum of a random number of dice.

16. Suppose A, B, and C are events of strictly positive probability in some probability space. If P(A | C) > P(B | C) and P(A | C{ ) > P(B | C{ ), is it true that P(A) > P(B)? If P(A | C) > P(A | C{ ) and P(B | C) > P(B | C{ ), is it

38

II.18

Conditional Probability

true that P(A ∩ B | C) > P(A ∩ B | C{ )? [Hint: Consider an experiment involving rolling a pair of dice.] S OLUTION : By the definition of conditional probability, we have P(A | C) > P(B | C) =⇒ P(A ∩ C) > P(B ∩ C), P(A | C{ ) > P(B | C{ ) =⇒ P(A ∩ C{ ) > P(B ∩ C{ ), and it follows that P(A ∩ C) + P(A ∩ C{ ) > P(B ∩ C) + P(B ∩ C{ ) =⇒ P(A) > P(B). Now, consider the throw of a die and let A be the event that the die shows one, B the event that the die shows two, and C the event that the die shows a number less than four. Then P(A | C) = 1/3 > 0 = P(A | C{ ) and P(B | C) = 1/3 > 0 = P(B | C{ ), but P(A ∩ B) | C) = P(A ∩ B | C{ ) = 0.

17. Pólya’s urn scheme. In the urn model of Section 5, what is the probability that the first ball selected was black given that the second ball selected was black? S OLUTION : As in Section II.5 in ToP, let Bj denote the event that the jth ball drawn in Pólya’s urn scheme was black, and, in a slight superfluity of notation, Rj = B{j the event that the jth ball drawn was red. We are given P(B1 ) =

s s+a s , P(B2 | B1 ) = , and P(B2 | R1 ) = . r+s r+s+a r+s+a

By Bayes’s rule for events, we hence have P(B1 | B2 ) =

P(B2 | B1 ) P(B1 ) P(B2 | B1 ) P(B1 ) + P(B2 | R1 ) P(R1 ) =

s+a r+s+a

s s+a · r+s r+s+a s s · r+s r+s+a

·

r r+s

=

s+a . r+s+a

It may come as a mild surprise that P(B1 | B2 ) = P(B2 | B1 ) and so P(B2 ) = P(B1 ) = s/(r + s). This suggests the intriguing possibility that the next problem sets up.

18. Continuation. Show by induction that the probability of selecting a black ball in the kth drawing is invariant with respect to k. S OLUTION : An inductive approach. We shall establish by induction that P(Bk ) = s/(r+s). The first step in Pólya’s urn scheme establishes the induction base: P(B1 ) = s/(r + s); and the solution to Problem 17 establishes that P(B2 ) = s/(r + s) as well. As induction hypothesis we now suppose that the claimed result is valid for some k for all choices of r, s, and a. Consider the effect of the choice in the first step: by total probability, P(Bk+1 ) = P(Bk+1 | B1 ) P(B1 ) + P(Bk+1 | R1 ) P(R1 ) s+a s s r s = · + · = . r+s+a r+s r+a+s r+s r+s

39

Conditional Probability

II.18

The second step follows by induction hypothesis as, if a black ball is selected in the first step then there will be r red balls and s + a black balls in the urn prior to the second step and there are k further steps; if, on the other hand, a red ball is selected in the first step then there will be r + a red balls and s black balls in the urn prior to the second step and, again, there are k further steps. This completes the induction. A direct calculation via Vandermonde’s convolution. We begin with the observation that Vandermonde’s convolution [see Problem VIII.1 in ToP] holds for non-integer arguments as well. VANDERMONDE ’ S CONVOLUTION , EXTENDED Suppose α and β are real (or even complex) numbers. Then „ « X„ «„ « α+β α β = . n j n−j j

[The reader should bear in mind that the sum on the right is only formally infinite: our binomial conventions ensure that the summands are identically zero if j < 0 or j > n. In this form the identity is also known as the Chu-Vandermonde identity which at least gives a nod to historical precedence: Shih-Chieh Chu wrote about it in 1303 more than four-and-a-half centuries before Vandermonde’s writings on the subject!] P ROOF : The combinatorial route that was available when α and β are positive integers is shut now but an algebraic approach is almost as easy. We only need to leverage Newton’s binomial theorem [see Problem 4]: if |x| < 1, then « ∞ „ X α+β n=0

n

xn = (1 + x)α+β = (1 + x)α (1 + x)β

=

∞ „ « X α j=0

j

(n←j+k)

=

xj

∞ „ « X β k=0

k

xk =

∞ „ «X ∞ „ « X α β j=0

j

k=0

k

xj+k

« « ∞ „ «X ∞ „ ∞ X n „ «„ X X α β α β xn = xn . j n−j j n−j j=0

n=j

n=0

j=0

The identity follows by comparison of the summands on the left and the right.

I

In the notation of Section II.5 in ToP, let Pn (i) denote the probability that the urn contains i red balls after n epochs. Write ρ = r/a and σ = s/a. If j black balls (and k−1−j red balls) are drawn through the first k−1 steps then, prior to the kth draw there will be s + ja black balls out of a total of r + s + (k − 1)a balls in the urn. By conditioning on the number of black balls drawn through the first k − 1 steps, by total probability, we may, by leveraging II.5.1 in ToP (see the discussion on page 51), write

P(Bk ) =

k−1 X j=0

` ´ Pk−1 r + (k − 1 − j)a ·

s + ja r + s + (k − 1)a

40

II.21

Conditional Probability

=

k−1 X„ j=0

k−1 k−1−j

«

` ´ ρ(ρ + 1) · · · ρ + (k − 1 − j) − 1 · σ(σ + 1) · · · (σ + j − 1) ` ´ (ρ + σ)(ρ + σ + 1) · · · ρ + σ + (k − 1) − 1 σ+j ρ + σ + (k − 1) ´`−(σ+1)´ ×

k−1 σ X = ρ + σ j=0

`ρ+(k−1−j)−1´`(σ+1)+j−1´ k−1−j j `ρ+(σ+1)+(k−1)−1´ k−1

k−1 σ X = ρ + σ j=0

But, by Vandermonde’s convolution, we see that ! ! k−1 X −ρ −(σ + 1) = k−1−j j j=0 and we conclude that P(Bk ) =

−ρ k−1−j j `−(ρ+σ+1)´ k−1

`

.

! −(ρ + σ + 1) , k−1

σ s = ρ+σ r+s

is invariant with respect to k.

19. Continuation. Suppose m < n. Show that the probability that black balls are drawn on the mth and nth trials is s(s + a)/(r + s)(r + s + a); the probability that a black ball is drawn on the mth trial and a red ball is drawn on the nth trial is rs/(r + s)(r + s + a). S OLUTION : Write Qm,n (r, s, a) for the probability that, in Pólya’s urn scheme with parameters r, s, and a, the mth and nth trials yield two black balls. Likewise write Pm,n (r, s, a) for the probability that the mth and nth trials yield a black ball and a red ball, respectively. We proceed by induction on m and n. As induction hypothesis, suppose Qm−1,n−1 (α, β, γ) = β(β+γ)/(α+β)(α+β+γ) for any choice of positive integers α, β, and γ. By conditioning on the result of the first trial, by total probability, we see that s r + Qm−1,n−1 (r + a, s, a) · r+s r+s (s + a)(s + 2a) s(s + a) s(s + a) s r = · + · = , (r + s + a)(r + s + 2a) r + s (r + s + a)(r + s + 2a) r + s (r + s)(r + s + a) Qm,n (r, s, a) = Qm−1,n−1 (r, s + a, a) ·

completing the induction. We may proceed by induction, likewise to establish the second part of the claim. Alternatively, by additivity, P(Bm ) = Pm,n (r, s, a) + Qm,n (r, s, a), and so Pm,n (r, s, a) =

s(s + a) s rs − = r + s (r + s)(r + s + a) (r + s)(r + s + a)

by Problem 18.

20. An urn contains n balls, each of a different colour. Balls are drawn randomly from the urn, with replacement. Assuming that each colour is equally likely to be selected, determine an explicit expression for the probability P = P(M, n) that after M ≥ n successive draws one or more of the colours has yet to be seen.

41

Conditional Probability

II.22

21. Continuation. Set M = bn log nc, the logarithm to base e. Find an asymptotic estimate for the probability P(M, n) as n → ∞. What can you conclude for large n? 22. The Bernoulli model of diffusion. Another model of diffusion was suggested by D. Bernoulli. As in the Ehrenfest model, consider two chambers separated by a permeable membrane. A total of N red balls and N black balls are distributed between the two chambers in such a way that both chambers contain exactly N balls. At each succeeding epoch a ball is randomly selected from each chamber and exchanged. The state of the system at any epoch may be represented by the number of red balls in the left chamber. Determine the transition probabilities pjk for this model of diffusion and thence the stationary probabilities uk . [Hint: Use the combinatorial identity of Problem VIII.2.] S OLUTION : Starting from a given state k, at each epoch the state may increase by one, remain the same, or decrease by one, only. Accordingly, for 1 ≤ k ≤ N − 1, the transition probabilities are given by pk,k−1 =

2k(N − k) (N − k)2 k2 , pk,k = , and pk,k+1 = . 2 2 N N N2

The boundary states have transition probabilities p0,1 = 1 and pN,N−1 = 1. We may think of the state evolution as a random walk on states 0, 1, . . . , N with reflecting (n) barriers at either end. Starting from some initial distribution of states, let uk represent the probability that the system is in state k at epoch n. This system represents a birthdeath chain with state evolution governed by (n)

uk

(n−1)

(n−1)

= uk−1 pk−1,k + uk

(n−1)

(0 ≤ k ≤ N).

pkk + uk+1 pk+1,k

(See 6.4)

Let { uk , 0 ≤ k ≤ N } denote the stationary distribution of this chain. We then have uk = uk−1 pk−1,k + uk pkk + uk+1 pk+1,k for each k. Solving the recurrence for k ≥ 1 yields 2

uk = u0

2

2

2

(N−1) N · · · (N−k+2) · (N−k+1) p01 p12 · · · pk−2,k−1 pk−1,k 2 · N2 N2 N2 = u0 N (k−1)2 k2 22 12 pk,k−1 pk−1,k−2 · · · p21 p10 · · · · · N2 N2 N2 N2 „ «2 „ «2 N(N − 1) · · · (N − k + 1) N = u0 = u0 (1 ≤ k ≤ N). (See 6.5) k k!

We may determine u0 by the normalisation condition

PN k=0

uk = 1, whence

„ «2 „ «2 „ «2 –−1 „ «−1 N N N 2N u0 = 1 + + + ··· + = 1 2 N N »

42

II.24

Conditional Probability

(see Problem VIII.2). The stationary distribution of the Bernoulli scheme is hence given by „ «2 N k (0 ≤ k ≤ N). uk = „ « 2N N

23. Three players a, b, and c take turns at a game in a sequence of rounds. In the first round a plays b while c waits for her turn. The winner of the first round plays c in the second round while the loser skips the round and waits on the sideline. This process is continued with the winner of each round going on to play the next round against the person on the sideline with the loser of the round skipping the next turn. The game terminates when a player wins two rounds in succession. Suppose that in each round each of the two participants, whoever they may be, has probability 1/2 of winning unaffected by the results of previous rounds. Let xn , yn , and zn be the conditional probabilities that the winner, loser, and bystander, respectively, in the nth round wins the game given that the game does not terminate at the nth round. (a) Show that xn = 1 1 1 1 2 + 2 yn+1 , yn = 2 zn+1 , and zn = 2 xn+1 . (b) By a direct argument conclude that, in reality, xn = x, yn = y, and zn = z are independent of n and determine them. (c) Conclude that a wins the game with probability 5/14. S OLUTION : (a) Condition on what happens in round n + 1: if the winner of round n wins that round as well then she wins the game directly; if she loses round n + 1 then she becomes a bystander and her chances of subsequently winning are, by definition, yn+1 . Accordingly, by total probability, xn = 12 + 12 yn+1 . Likewise, the loser of round n becomes a bystander in round n + 1, and she can only win eventually if the winner of round n loses in round n + 1: accordingly, yn = 12 zn+1 . Finally, the bystander in round n must win in round n + 1 to have a chance of eventually winning and so zn = 21 xn+1 . (b) The trials are independent, each trial resetting the situation, the winner, loser, and bystander of the trial having the same future chance of winning as the winner, loser, and bystander of the previous trial. It follows that xn = x, yn = y, and zn = z do not depend on n. We thus have a system of simultaneous equations x=

1 2

+ 21 y, y =

1 z, 2

and z =

1 x, 2

which leads to the solutions x = 4/7, y = 2x − 1 = 1/7, and z = x/2 = 2/7. (c) Conditioning on the result of the first trial, we see that for a to win she either wins the first trial and then proceeds to win the game eventually, or loses the first trial and eventually pulls out a win. By total probability again, the probability that a wins is ` ´ 5 given by 21 x + 12 y = 21 47 + 71 = 14 .

24. Gambler’s ruin. A gambler starts with k units of money and gambles in a sequence of trials. At each trial he either wins one unit or loses one unit of money, each of the possibilities having probability 1/2 independent of past history. The gambler plays until his fortune reaches N at which point he leaves

43

Conditional Probability

II.26

with his gains or until his fortune becomes 0 at which point he is ruined and leaves in tears swearing never to betray his mother’s trust again. Determine the probability qk that he is bankrupted. S OLUTION : Let pk denote the probability that the gambler’s fortune reaches N starting with an initial capital of k. Clearly, for 1 ≥ k ≤ N − 1, we have 1 p 2 k+1

pk =

+ 12 pk−1 ,

with boundary conditions p0 = 0 and pN = 1. Proceeding systematically, we see that p2 = 2p1 , p3 = 2p2 − p1 = 3p1 , and, proceeding inductively in this fashion, we obtain pk = kp1 . The boundary condition 1 = pN = Np1 allows us to determine that p1 = 1/N and hence qk = 1 − pk = (N − k)/N.

25. Families. Let the probability pn that a family has exactly n children be αpn when n ≥ 1 with p0 = 1 − αp(1 + p + p2 + · · · ). Suppose that all sex distributions of the n children have the same probability. Show that for k ≥ 1 the probability that a family has exactly k boys is 2αpk /(2 − p)k+1 . S OLUTION : Write bk for the probability that a family has exactly k boys. By total probability, bk =

∞ X

∞ X

„ « n · αpn k n=k « ∞ „ « “ ”n ∞ „ “ p ”k X X p n n + k “ p ”n =α =α . k k 2 2 2

P{k boys | n children} P{n children} =

n=k

2−n

n=0

n=k

We may massage the summands into the form of a negative binomial via the manipulations „ « „ « „ « „ « n+k k+n k+1+n−1 −(k + 1) = = = (−1)n . k n n n The binomial theorem allows us to finish the job: bk = α

« ∞ „ “ p ”k X −(k + 1) “ −p ”n 2

n=0

n

2



“ p ”k “ 2

1−

p ”−(k+1) 2αpk = . 2 (2 − p)k+1

26. Seating misadventures. A plane with seating for n passengers is fully booked with each of n passengers having a reserved seat. The passengers file in and the first one in sits at an incorrect seat selected at random from the n − 1 seats assigned to the other passengers. As each of the subsequent passengers file in they sit in their allotted seat if it is available and sit in a randomly selected empty seat if their seat is taken. What is the probability fn that the last passenger finds his allotted seat free? S OLUTION : By relabelling, if necessary, we may as well suppose that the seats are numbered 1 through n in the order of the arriving passengers.

44

II.26

Conditional Probability

A direct conditional argument: Let A be the event that passenger n takes his designated seat and B the event that passenger 1 takes the seat of passenger n. We wish to determine fn = P(A). By total probability, P(A) = P(A | B) P(B) + P(A | B{ ) P(B{ ) =

n−2 P(A | B{ ) n−1

because we know that P(A | B) = 0 and P(B{ ) = (n − 2)/(n − 1). So all that we need to do is to compute the value of P(A | B{ ), i.e., the probability that passenger n can have his seat given that passenger 1 has not taken it. We claim indeed that P(A | B{ ) = 1/2. The reason is that after the first n − 1 passengers have entered the plane, the only seats that can possibly be vacant are 1 and n (any given seat 2 ≤ k ≤ n − 1 cannot be vacant as the kth arrival would have found it free when he arrived and would have taken it). Moreover, once passenger 1 takes a seat other than 1 or n, when passengers 2 to n − 1 enter the plane, they see no difference between seats 1 and n, in other words there exists a perfect symmetry between the two seats. Therefore, the probability of one of them being vacant once passenger n enters the plane is exactly 1/2. It follows that fn = (n − 2)/2(n − 1). A recursive solution: Conditional arguments like the one given above are admittedly slippery. Here is a recursive approach for the reader who prefers more standard fare. Write gn−k+1 for the probability that passenger n occupies seat n given that passenger 1 occupies seat k. The subscript denotes the number of vacant seats seen by arriving passenger k (if the first passenger occupies seat k then passengers 2 through k − 1 occupy will find their seats vacant and will occupy them so that the kth arrival will see unoccupied seats 1, k + 1, k + 2, . . . , n). By total probability, fn =

n n−1 1 X 1 X gn−k+1 = gn−k+1 n − 1 k=2 n − 1 k=2

as gn−n+1 = 0; it is manifestly impossible for the nth passenger to get his seat if the first passenger has already claimed it. If the kth arrival takes seat 1 then all the remaining passengers (including the nth) get their own seats; if he takes seat j for some k + 1 ≤ j ≤ n − 1 then the situation is replicated (with a smaller number of available seats); and, of course, if he takes seat n then it is impossible for passenger n to get his own seat. By total probability again, gn−k+1 =

n X 1 1 ·1+ gn−j+1 n−k+1 n − k+1 j=k+1

=

„ « n−1 X 1 1+ gn−j+1 n−k+1 j=k+1

(1 ≤ k ≤ n − 1)

with boundary condition g1 = 0. By turning the recursion crank in reverse, we see that with k = n − 1, n − 2, n − 3, . . . , + g1 ) = 12 , ` ´ g3 = 31 (1 + g1 + g2 ) = 13 1 + 0 + 12 = 12 , ` ´ g4 = 41 (1 + g1 + g2 + g3 ) = 14 1 + 0 + 21 + 12 = g2 =

1 (1 2

45

1 , 2

Conditional Probability

II.28

and we are led to the delicious conjecture that gn−k+1 = 1/2 for 2 ≤ k ≤ n − 1. Verification is a simple matter of induction: ˆ ` ´ ˜ 1 1 · 2+(n−1)−k = 12 . 1 + (n − 1) − k · 12 = n−k+1 n−k+1 2 Pn−1 1 n−2 We conclude that fn = n−1 k=2 gn−k+1 = 2(n−1) .

27. Laplace’s law of succession. If r red balls and m − r black balls appear in m successive draws in the Laplace urn model, show that the probability that the (m + 1)th draw is red tends asymptotically with the number of urns to R1 (r + 1)/(m + 2). [Hint: Prove and use the identity I(n, m) := 0 xn (1 − x)m dx = n!m!/(n + m + 1)! by integrating by parts to obtain a recurrence and starting with the known base I(n, 0) = 1/(n + 1) to set up an induction over m.] S OLUTION : For each pair of positive integers µ and ν, define Z1 I(µ, ν) = xµ (1 − x)ν dx. 0

Then a trivial integration shows that Z1 xµ dx =

I(µ, 0) = 0

1 . µ+1

A general solution is now within our grasp as, by one integration by parts, Z1 xµ (1 − x)ν dx

I(µ, ν) := 0

˛1 Z1 xµ+1 (1 − x)ν ˛˛ ν ν = + xµ+1 (1 − x)ν−1 dx = I(µ + 1, ν − 1) ˛ µ+1 µ + 1 µ + 1 0 0 if ν ≥ 1. By induction, it follows quickly that I(µ, ν) =

ν−1 1 ν!µ! ν · ··· · I(µ + ν, 0) = . µ+1 µ+2 µ+ν (µ + ν + 1)!

Write A for the event that r red balls and m − r black balls are appear in m successive draws and let B be the event that the (m + 1)th draw is red. By conditioning on the urn selected, by total probability, we obtain „ « „ «r „ «m−r „ « N X 1 k k m m P(A) = 1− → I(r, m − r), r r N + 1 N N k=0 „ « „ «r „ «m−r „ « N X k k k 1 m m 1− P(A ∩ B) = · → I(r + 1, m − r), r r N + 1 N N N k=0 asymptotically as N → ∞. It follows that P(A ∩ B | A) =

P(A ∩ B) I(r, m − r) (r + 1)!(m − r)! → = P(A) I(r + 1, m − r) (m + 2)!

asymptotically as N → ∞.

46

ffi

r!(m − r)! r+1 = , (m + 1)! m+2

II.28

Conditional Probability

28. The marriage problem. A princess is advised that she has n suitors whose desirability as consort may be rank-ordered from low to high. She does not know their rank-ordering ahead of time but can evaluate the relative desirability of each suitor by an interview. She interviews the suitors sequentially in a random order and at the conclusion of each interview either accepts the suitor as consort or casts him into the wilderness. Her decisions are final; once a suitor is rejected she cannot reclaim him, and once a suitor is accepted then the remaining suitors are dismissed. If she has run through the list and rejected the first n − 1 suitors she sees then she accepts the last suitor whatever his relative desirability. How should see proceed? This is the prototypical problem of optimal stopping in sequential decision theory. A predetermined stopping point m has a chance only of 1/n of catching the best consort. But consider the following strategy: the princess fixes a number 1 ≤ r ≤ n − 1 in advance, discards the first r suitors while keeping track of the desirability of the best candidate in the discarded group, and then accepts the first suitor she encounters in the subsequent interviews whose desirability is higher than the best of the discarded set of r suitors (if there are any such else she accepts the last suitor on the list). Determine the probability Pn (r) that she selects the most desirable suitor. This problem is due to Merrill M. Flood.2 [Hint: Condition on the location of the best suitor in the sequence.] S OLUTION : We may suppose the suitors are ranked in desirability from 1 through n. As they appear in random order, the princess sequentially sees a random permutation Π1 , Π2 , . . . of desirabilities. If the princess has determined ahead of time to accept the kth suitor for some fixed 1 ≤ k ≤ n, then she secures the most eligible suitor if, and only if, Πk = n. The probability of occurrence of this event among all random permutations (Π1 , . . . , Πn ) of (1, . . . , n) is (n − 1)!/n! = 1/n. Consider now the suggested strategy: with 1 ≤ r ≤ n − 1 fixed in advance, the princess accepts the first suitor whose desirability is larger than max{Π1 , . . . , Πr }. Let Br be the event that this strategy captures the most desirable suitor. We work by conditioning on the location of the best suitor. For 1 ≤ k ≤ n, let Ak = {Πk = n} denote the event that the best suitor is the kth member of the sequence. By total probability, P(Br ) =

n X

P(Br | Ak ) P(Ak ).

k=r+1

We may start the sum at r + 1 instead of 1 because, if Πk = n for 1 ≤ k ≤ r, then the princess will have sadly discarded the best suitor with the rest of the first r suitors. Conditioned on the occurrence of Ak = {Πk = n}, the event Br occurs if, and only if, the best of the first k − 1 suitors is in the group of the first r suitors; in notation, max{Π1 , . . . , Πr } = max{Π1 , . . . , Πk−1 }. Now, for any random arrangement of k − 1 2 Variations on this theme have found applications in problems ranging from college admissions to spectrum sharing in cognitive radio. For a classical perspective see D. Gale and L. S. Shapley, “College admissions and the stability of marriage”, The American Mathematical Monthly, vol. 69, no. 1, pp. 9–15, 1962.

47

Conditional Probability

II.30

distinct numbers, the probability that the largest of these is in the first r positions is r/(k − 1). By averaging over all permissible subsequences (Π1 , . . . , Πk−1 we see that P(Br | Ak ) = r/(k − 1). It follows that n X

Pn (r) := P(Br ) =

k=r+1

n−1 r 1 r X1 · = . k−1 n n j=r j

(4)

29. Continuation. By approximating a sum by integrals show that, as n → ∞, she should select r as the closest integer to (n − 1)/e and that with this choice her probability of accepting the best consort is approximately 1/e, almost independent of n. S OLUTION : The function f(x) = 1/x decreases monotonically for x > 0 and so the sum appearing on the right in (4) may be usefully approximated by integrals. We have log

n = r

Zn r

n−1 “ X 1 Z n dx dx n n 1” < < = log = log − log 1 − . x j r−1 r r r−1 x j=r

The approximation gets better and better as r increases: the difference between the upper and lower bounds for the sum is bounded above by 2/r for r ≥ 2. It follows that P(Br ) =

r n log + ξr,n n r

where the error term ξr,n is bounded absolutely by 2/n. The results are non-trivial when r = rn = bαnc has a linear rate of growth with n. Here 0 < α < 1 is any fixed positive constant. In this case, we see that P(Bbαnc ) → −α log α as n → ∞. If we set g(α) = −α log α, we see that g 0 (α) = − log α − 1 and g 00 (α) = −1/α < 0 so that g achieves its maximum when − log α = 1 or α = e−1 . It follows that it is asymptotically optimal for the princess to select r = be−1 nc and with this choice her probability of landing the best suitor converges to e−1 as n → ∞. Surprisingly, she can achieve a constant (approximately one-thirds) chance of landing the best suitor even when n is very large. Compare with the naïve strategy with a pre-determined choice for which the probability of landing Mr. Right goes to zero with n.

30. Optimal decisions. A communications channel has input symbol set {0, 1} and output p(r | s) a b c symbol set {a, b, c}. In a single use of the channel 0 0.7 0.2 0.1 a random input symbol S governed by the a priori 1 0.3 0.2 0.5 probabilities P{S = 0} = 0.7 and P{S = 1} = 0.3 is selected and transmitted. It engenders an output symbol R governed by the transition probabilities p(r | s) := P{R = r | S = s} given in the adjacent table. Determine the optimal decision rule ρ∗ : {a, b, c} → {0, 1} that minimises the probability of decision error and determine this minimum error probability.

48

II.31

Conditional Probability

S OLUTION : Write Err := {ρ(R) 6= S} for the event that there is a decision error. First of all note that P{Err | S = 0} = 0.7ρ(a) + 0.2ρ(b) + 0.1ρ(c), P{Err | S = 1} = 0.3(1 − ρ(a)) + 0.2(1 − ρ(b)) + 0.5(1 − ρ(c)). As a result, we have ˆ ˜ Pρ := P{Err} = 0.7 0.7ρ(a) + 0.2ρ(b) + 0.1ρ(c) ˆ ` ´ ` ´ ` ´˜ + 0.3 0.3 1 − ρ(a) + 0.2 1 − ρ(b) + 0.5 1 − ρ(c) = 0.4ρ(a) + 0.08ρ(b) − 0.08ρ(c) + 0.3. Therefore, the optimal decision rule will be ρ∗ (a) = ρ∗ (b) = 0 and ρ∗ (c) = 1. The minimum error probability over all decision rules is hence given by P∗ = minρ Pρ = 0.22.

31. Continuation, minimax decisions. There are eight possible decision rules for the communications channel of the previous problem. For the given matrix of channel conditional probabilities, allow the input symbol probability q := P{S = 0} to vary over all choices in the unit interval and plot the probability of error for each decision rule versus q. Each of the eight decision rules ρ has a maximum probability of error which occurs for some least favourable a priori probability qρ for the input symbol 0. The decision rule which has the smallest maximum probability of error is called the minimax decision rule. Which of the eight rules is minimax? S OLUTION : Writing Pρ = Pρ (q) to explicitly acknowledge the dependence of the probability of decision error on both the decision rule ρ and the a priori input symbol probability q, as in the previous problem, we may write ˆ ˜ ˆ ˜ Pρ (q) = q ρ(a) + 0.4ρ(b) + 0.6ρ(c) − 1 + 1 − 0.3ρ(a) − 0.2ρ(b) − 0.5ρ(c) and plot Pρ (q) as a function of q for the eight different decision rules (Figure 1). We may note that Pρ (q) is a linear function of q and is hence maximised either at q = 0 or q = 1 depending on the sign of the term α = ρ(a) + 0.4ρ(b) + 0.6ρ(c) − 1. Therefore, we have 1 − 0.3ρ(a) − 0.2ρ(b) − 0.5ρ(c) if α < 0, max Pρ (q) = 0.7ρ(a) + 0.2ρ(b) + 0.1ρ(c) if α ≥ 0. q It is easy to see that this maximum error is minimised for the decision rule ρ(a) = 0, ρ(b) = ρ(c) = 1 and this yields a minimax error probability of 0.3

49

Conditional Probability

II.31

Pρ(q) 1.0

(0,0,0)

0.8

(0,1,0) (1,0,0)

0.6 (1,1,0) (0,0,1)

0.4 (0,1,1)

0.2

(1,0,1) (1,1,1)

0.2

0.4

0.6

0.8

1.0

q

Figure 1: Decision error probabilities for each of the eight possible decision rules plotted as a function of ` the a priori symbol ´ probability q. Each decision rule is indexed by the binary triple of values ρ(a), ρ(b), ρ(c) .

50

III A First Look at Independence 1. Mutually exclusive events. Suppose A and B are events of positive probability. Show that if A and B are mutually exclusive then they are not independent. S OLUTION : If two events are mutually exclusive, then P(A ∩ B) = 0. If they are also to be independent, then P(A) P(B) = 0 which implies that P(A) = 0 or P(B) = 0. Thus, mutually exclusive events of strictly positive probability cannot be independent.

2. Conditional independence. Show that the independence of A and B given C neither implies, nor is implied by, the independence of A and B. S OLUTION : Example III.2.5 in ToP shows that conditional independence does not imply (unconditional) independence. Example III.3.7 in ToP shows that (unconditional) independence also does not imply conditional independence.

3. If A, B, and C are independent events, show that A ∪ B and C are independent. S OLUTION : As intersection distributes over union, ` ´ ` ´ P (A ∪ B) ∩ C = P (A ∩ C) ∪ (B ∩ C) . Now, for any two events F and G, P(F ∪ G) = P(F) + P(G \ F) = P(F) + P(G) − P(F ∩ G) by two applications of additivity. Identifying F with A ∩ C and G with B ∩ C, we obtain ` ´ P (A ∪ B) ∩ C = P(A ∩ C) + P(B ∩ C) − P(A ∩ B ∩ C) = P(A) P(C) + P(B) P(C) − P(A) P(B) P(C) ` ´ = P(A) + P(B) − P(A) P(B) P(C) = P(A ∪ B) P(C) which means that A ∪ B and C are independent.

4. Patterns. A coin turns up heads with probability p and tails with probability q = 1 − p. If the coin is tossed repeatedly, what is the probability that the pattern T, H, H, H occurs before the pattern H, H, H, H?

A First Look at Independence

III.6

S OLUTION : Suppose the pattern H, H, H, H occurs for the first time at trial n. If n = 4 then the first occurrence of H, H, H, H precedes that of T, H, H, H. If n > 4 then it must be the case that the outcome of the (n − 4)th trial must be T (else the pattern H, H, H, H would already have been seen on trial n − 1). But then the pattern T, H, H, H will already have been seen at trial n−1. It follows that the first occurrence of T, H, H, H precedes that of H, H, H, H whenever the first four trials do not all result in success. The probability the pattern T, H, H, H occurs before the pattern H, H, H, H is hence 1 − p4 .

5. Runs. Let Qn denote the probability that in n tosses of a fair coin no run of three consecutive heads appears. Show that Qn = 12 Qn−1 + 14 Qn−2 + 18 Qn−3 with the recurrence valid for all n ≥ 3. The boundary conditions are Q0 = Q1 = Q2 = 1. Determine Q8 via the recurrence. [Hint: Consider the first tail.] S OLUTION : Let Bn denote the event that in n tosses of a fair coin no run of three consecutive heads appears; Qn = P(Bn ). The boundary conditions for a recurrence are given by the observation that Q0 = Q1 = Q2 = 1. Consider now a generic integer n ≥ 3. By conditioning on the outcomes of the first three tosses, we obtain Qn = P{Bn | first toss is T} P(First toss is T) + P{Bn | First = H, Second = T} P(First = H, Second = T) + P{Bn | First = H, Second = H, Third = T} P(First, Second = H, Third = T) + P{Bn |First, Second, Third = H} P(First, Second, Third = H). The last term on the right vanishes as the two events that are being conditioned on are complementary. Consequently, for n ≥ 3, we obtain Qn = P(Bn−1 ) P(First toss is T) + P(Bn−2 ) P(First = H, Second = T) + P(Bn−3 ) P(First, Second = H, Third = T) =

1 1 1 Qn−1 + Qn−2 + Qn−3 . 2 4 8

A numerical solution shows that Q10 = 63/128 ≈ 0.4922.

6. Equivalent conditions. In some abstract probability space, consider three events A1 , B1 , and C1 , with complements A2 , B2 , and C2 , respectively. Prove that A1 , B1 , and C1 are independent if, and only if, the eight equations P{Ai ∩ Bj ∩ Ck } = P(Ai ) P(Bj ) P(Ck ),

i, j, k ∈ {1, 2},

(1)

all hold. Does any subset of these equations imply the others? If so, determine a minimum subset with this property. S OLUTION : Write A1 = A, B1 = B, and C1 = C. We recall that A, B, and C are independent if, and only if, P(A ∩ B) = P(A) P(B),

P(A ∩ C) = P(A) P(C),

P(B ∩ C) = P(B) P(C),

P(A ∩ B ∩ C) = P(A) P(B) P(C).

52

(2)

III.7

A First Look at Independence

As a preliminary step observe that P(A{ ∩ B) = P(B) − P(A ∩ B) by additivity { of probability measure. Itˆfollows that ˜ if A and B are independent then P(A ∩ B) = P(B) − P(A) P(B) = P(B) 1 − P(A) = P(A{ ) P(B) so that A{ and B are independent. Likewise, A and B{ are independent and A{ and B{ are independent. We will use this repeatedly. a. Suppose (1) holds to begin with. Then P(A ∩ B) = P(A ∩ B ∩ C) + P(A ∩ B ∩ C{ ) = P(A) P(B) P(C) + P(A) P(B) P(C{ ) ˆ ˜ = P(A) P(B) P(C) + P(C{ ) = P(A) P(B). The remaining two conditions to be established in (2) follow analogously. Now suppose (2) holds. Since A is independent of B ∩ C, so is A{ and hence P(A{ ∩ B ∩ C) = P(A{ ) P(B ∩ C) = P(A{ ) P(B) P(C). The product rule follows likewise for the events A ∩ B{ ∩ C and A ∩ B ∩ C{ . Entirely similar arguments take care of the remaining equations in (1). For instance, P(A{ ∩ B{ ∩ C) = P(A{ ∩ C) − P(A{ ∩ B ∩ C) (a)

= P(A{ ) P(C) − P(A{ ) P(B) P(C) = P(A{ ) P(B{ ) P(C),

where (a) follows because the independence of A{ and C follows from that of A and C, while the product rule for A{ ∩ B ∩ C has just been established. The product rule for the events A{ ∩ B ∩ C{ and A ∩ B{ ∩ C{ follows similiarly. To finish off, P(A{ ∩ B{ ∩ C{ ) = P(A{ ∩ B{ ) − P(A{ ∩ B{ ∩ C) = P(A{ ) P(B{ ) − P(A{ ) P(B{ ) P(C) = P(A{ ) P(B{ ) P(C{ ), and this establishes the validity of the system of equations (1). b. A minimal subset of (1) that suffices to guarantee independence: P(A ∩ B ∩ C) = P(A) P(B) P(C), {

{

P(A ∩ B ∩ C) = P(A) P(B ) P(C),

P(A{ ∩ B ∩ C) = P(A{ ) P(B) P(C), P(A ∩ B ∩ C{ ) = P(A) P(B) P(C{ ).

(3)

To show sufficiency, it is enough to show that (3) implies (2). And, indeed, P(A ∩ B) = P(A ∩ B ∩ C) + P(A ∩ B ∩ C{ ) = P(A) P(B), and likewise for the other probabilities. To argue that this is a minimal set, suppose there exists a set of three sufficient equations. Then all four of the equations in (2) would be obtainable by linear combinations of these three equations which, in turn, would imply that the four equations in (2) are linearly dependent. But, as seen by Bernstein’s example [see Examples III.2.8,9 in ToP], these equations are in general independent of each other. It follows that we need at least four equations from (1) to establish (statistical) independence of A, B, and C.

53

A First Look at Independence

III.10

7. Continuation. Let A1 , . . . , An be events in some probability space. For each i, define the events Ai0 = Ai and Ai1 = A{i . Show that the events A1 , . . . , An are independent if, and only if, for every sequence of values j1 , . . . , jn in {0, 1}, we have \  Y n n P Aiji = P(Aiji ). (4) i=1

i=1

S OLUTION : To generalise, for each k, it is requisite that P(An1 ∩ An2 ∩ · · · ∩ Ank ) = P(An1 ) P(An2 ) · · · P(Ank )

(5)

k

where each ni is either 1 or 2. With two cases for each ni we have 2 conditions, but for independence of k events we only need independence of every subset of them with cardinality strictly greater than one, so we need 2k − k − 1 conditions, „ « „ « „ « „ « „ « „ « k k k k k k k + + + ··· + =2 − − = 2k − k − 1. 2 3 4 k 1 0 To show that (4) is equivalent to (5) we may now proceed by induction using the cases n = 2 and n = 3 as base of the induction.

8. Random sets. Suppose A and B are independently selected random subsets of Ω = {1, . . . , n} (not excluding the empty set ∅ or Ω itself). Show that n P(A ⊆ B) = 34 . S OLUTION : A set of cardinality n has 2n subsets, so the probability that B is a given subset of Ω is 2−n . If B has cardinality k, then it has 2k subsets and so, given that card(B) = k, the probability that A is a subset of B is 2k · 2−n . By conditioning on the size of B, by total probability, P{A ⊆ B} =

n X

P{A ⊆ B | card(B) = k} P{card(B) = k}

k=0

„ « „ «n n n „ « X 2k 1 X n k 3 1 n 1 n · = n . = 2 = n (1 + 2) = n k 2n k 2 4 4 4 k=0 k=0 The penultimate step follows, of course, by the binomial theorem.

9. Continuation. With the random sets A and B as in the previous problem, determine the probability that A and B are disjoint. ` ´ S OLUTION : We again condition on the size of B. If card(B) = k, then card B{ = n − k, and so if A and B are to be disjoint then A must be one of the 2n−k subsets of B{ . By conditioning on the size of B, we see then that P{A ∩ B = ∅} = =

n X

P{A ∩ B = ∅ | card(B) = k} P{card(B) = k}

k=0 n n−k X

2

k=0

2n

„ « „ «n n „ « 1 ` 1 X n −k 3 n 1 −1 ´n · = n 2 = n 1+2 = . k 2n 2 k=0 k 2 4

54

III.11

A First Look at Independence

10. Non-identical trials. In a repeated sequence of independent trials, the kth trial produces a success with probability 1/(2k + 1). Determine the probability Pn that after n trials there have been an odd number of successes. S OLUTION : We build intuition by computing the values of Pn for small values of n. By direct enumeration, 1 , 3 1 P2 = · 3 1 P3 = · 3 P1 =

4 2 + · 5 3 1 1 · + 5 7

1 2 = , 5 5 1 4 6 2 1 6 2 4 1 3 · · + · · + · · = . 3 5 7 3 5 7 3 5 7 7

We now have a shrewd suspicion that Pn = n/(2n + 1). A proof by induction is indicated. Suppose as induction hypothesis that, for some n, Pn = n/(2n + 1). We have Pn+1 = Pn ·

2n + 2 1 + (1 − Pn ) · 2n + 3 2n + 3 (n + 1)(2n + 1) n 2n + 2 n+1 1 n+1 = · + · = = . 2n + 1 2n + 3 2n + 1 2n + 3 (2n + 1)(2n + 3) 2n + 3

This concludes the induction.

11. Permutations. Consider the sample space comprised of all k! permutations of the symbols a1 , . . . , ak , together with the sequences of k symbols obtained by repeating a given symbol k times. Assign to each permutation the   probability 1 k2 (k − 2)! and to each repetition probability 1/k2 . Let Ai be the event that the symbol a1 occurs in the ith location. Determine if the events A1 , . . . , Ak are pairwise independent and whether they are (jointly) independent. S OLUTION : For the event Ai to occur either the symbol ai is repeated k times or the outcome of the experiment is a permutation of all k symbols with ai occurring in the ith position. It follows that P(Ai ) =

´ (k − 1)! 1 1 ` 1 + 2 = 2 1 + (k − 1) = . 2 k k (k − 2)! k k

The events Ai and Aj can occur jointly if, and only if, the outcome of the experiment is a permutation of the symbols with ai occurring in location i and aj occurring in location j. Accordingly, P(Ai ∩ Aj ) =

(k − 2)! 1 1 = · = P(Ai ) P(Aj ). k2 (k − 2)! k k

The events A1 , . . . , An are hence pairwise independent. However, the joint occurrence of the events A1 , . . . , Ak means that the outcome of the experiment results in the unique permutation of the symbols with all symbols in their original position. It follows hence that 1 P(A1 ∩ · · · ∩ Ak ) = 2 . k (k − 2)!

55

A First Look at Independence

III.13

` ´k The right-hand side equal k1 only for k = 2 and so the events A1 , . . . , Ak are pairwise independent but not jointly independent for k ≥ 3.

12. Independence over sets of prime cardinality. Suppose p is a prime number. The sample space is Ω = {1, . . . , p} equipped with probability measure P(A) = card(A)/p for every subset A of Ω. (As usual, card(A) stands for the cardinality or size of A.) Determine a necessary and sufficient condition for two sets A and B to be independent. S OLUTION : Suppose the events A and B are independent. Therefore, P(A ∩ B) = P(A) P(B). As a result, we get card(A ∩ B)p = card(A) card(B). Notice that card(X) for any set X is less than or equal to p. But we know that p is a prime number. Therefore, the only way that the above equality can hold is that either card(A) = p or card(B) = p, or equivalently, either A = Ω or B = Ω. This is a necessary condition. Showing sufficiency of this condition is even easier: Any set is independent from the sample space Ω. Clearly, if either one of the two sets is the empty set, the independence equality still holds. In that case, both sides of the equality are zero.

13. Tournaments. Players of equal skill are pitted against each other in a succession of knock-out rounds in a tournament. Starting from a pool of 2n individuals, players are paired randomly in the first round of the tourney with losers being eliminated. In successive rounds, the victors from the previous rounds are randomly paired and the losers are again eliminated. The nth round constitutes the final with the victor being crowned champion. Suppose A is a given player in the pool. Determine the probability of the event Ak that A plays exactly k games in the tournament. Let B be another player. What is the probability of the event E that A plays B in the tournament? S OLUTION : Starting with the simple part, player A has to win the first k − 1 games and lose the kth, so P(Ak ) = 2−k . Now for the second part draw 2n individuals randomly in positions 1, 2, 3, . . . , 2n . You know player in position 1 plays with player in position 2 and 3 plays with 4 and so forth. Without loss of generality fix the first position for player A and play with the position of player B. If B is in the second position, then A and B will surely play in the first round. If B is in the third or fourth position, then A and B should both win their first match to go to the next round and play with each other. If B is in the 5th, 6th, 7th, or 8th position, each of A and B should win two games in a row to play with each other in the third round and so forth. Define Ci to be the event that B is in a position between 2i−1 + 1 and 2i . For example, C4 means the event that B can be in position 9, 10, 11, 12, 13, 14, 15, 16. Given the event Ci , both A and B should win (i − 1) times to play with each other in the ith round. Let Di be the event that both A and B

56

III.15

A First Look at Independence

win their (i − 1) games before playing with each other in the ith round.

P(E) =

n X i=1

P(E | Ci ) P(Ci ) =

n X

P(Di ) P(Ci )

i=1

=

n X i=1

1 22i−2

n 2i−1 1 X 1 1 = = n−1 . 2n − 1 2n − 1 i=1 2i−1 2

The reader is encouraged to find the symmetry between the question and the way we looked at the problem. The essence of randomness is actually hidden in playing with the position of B.

14. Guessing games. A fair coin is tossed. If the outcome is heads then a bent coin whose probability of heads is 5/6 (and probability of tails is 1/6) is tossed. Call this a type I bent coin. If the outcome of the fair coin toss is tails then another bent coin whose probability of tails is 5/6 (and probability of heads is 1/6) is tossed. Call this a type II bent coin. A gambler is shown the result of the toss of the bent coin and guesses that the original toss of the fair coin had the same outcome as that of the bent coin. What is the probability the guess is wrong? S OLUTION : Write I for the event that the toss of the fair coin resulted in a head, II = I{ for the event that the toss resulted in a tail. Write H1 for the event that the toss of the first bent coin resulted in heads, T1 = H{1 for the event that the toss of the first bent coin resulted in tails. Let G1 be the event that the gambler’s guess after the first toss of the bent coin is in error. Then G1 = (I ∩ T1 ) ∪ (II ∩ H1 ) and, by total probability, P(G1 ) = P(T1 | I) P(I) + P(H1 | II) P(II) =

1 1 1 1 1 · + · = . 6 2 6 2 6

15. Continuation, majority rule. In the setting of the previous problem, suppose that, following the toss of the fair coin, n type I coins are tossed or n type II coins are tossed depending on whether the outcome of the fair coin toss was heads or tails, respectively. Here n is an odd positive integer. The gambler is shown the results of the n bent coin tosses and uses the majority of the n tosses to guess the outcome of the original fair coin toss. What is the probability that the guess is wrong? S OLUTION : Let Gn denote the event that the gambler’s majority guess after n bent coin tosses is wrong, Gn,k the event that k of the tosses of the bent coin have an outcome opposed to that of the fair coin. The bent coin flips have outcomes that are conditionally independent given the result of the fair coin flip and so, P(Gn,k | I) = P(Gn,k

„ « „ «k „ «n−k 1 5 n | II) = , k 6 6

57

A First Look at Independence

III.17

as, by the symmetry in the situation, given the outcome ` of ´ the fair coin toss, the k bent coins with the opposed outcome may be specified in n ways, and for each such sek lection there are k opposed outcomes and n − k outcomes in agreement, these being conditionally independent. By total probability, P(Gn,k ) = P(Gn,k | I) P(I) + P(Gn,k | II) P(II) = Now Gn =

S

k>n/2

„ « „ «k „ «n−k 1 5 n . k 6 6

Gn,k and so, by additivity,

P(Gn ) =

X

P(Gn,k =

k>n/2

X „n« „ 1 «k „ 5 «n−k . k 6 6

k>n/2

The law of large numbers [see Section V.6 and Problem V.17 in ToP] shows that P(Gn ) → 0 as n → ∞.

16. Continuation, majority rule with tie breaks. In the setting of Problem 14, suppose that four type I or type II coins are tossed depending on the outcome of the fair coin toss. The gambler again uses a majority rule to guess the outcome of the fair coin toss but asks for four more tosses of the bent coin in the event of a tie. This continues until she can make a clear majority decision. What is the probability of error? S OLUTION : Let G denote the event that when the tie is finally broken the decision is in error. Given that a type I coin is being tossed, the probability that a clear majority of tails ` ´` ´3 ` ´` ´4 7 is obtained in four tosses is 43 16 56 + 44 16 = 432 , while the probability of a tie `4´` 1 ´2 ` 5 ´2 25 is 2 6 = . The probability of k successive ties is hence, by independence, 6 216 ` 25 ´k . The number of ties prior to a clear majority decision is an integer k ≥ 0 and so, 216 by summing over all possibilities, we obtain P(G | I) =

«k ∞ „ 7 X 25 7 7 = 43225 = . 216 432 382 1 − 216 k=0

A completely analogous calculation gives the same conditional probability by conditioning on the event II that the fair coin showed tails. It follows hence that P(G) = 7/382.

17. Noisy binary relays. A binary symmetric channel is a conduit for information which, given a bit x ∈ {0, 1} for transmission, produces a copy y ∈ {0, 1} which is equal to x with probability 1 − p and differs from x with probability p. A starting bit x = x0 is relayed through a succession of independent binary symmetric channels to create a sequence of bits, x0 7→ x1 7→ . . . 7→ xn 7→ · · · , with xn the bit produced by the nth channel in response to xn−1 . Let Pn denote the probability that the bit xn produced after n stages of transmission is equal to the initial bit x. By conditioning on the nth stage of transmission set up a recurrence for Pn in terms of Pn−1 and solve it. Verify your answer by a direct combinatorial argument.

58

III.18

A First Look at Independence

S OLUTION : 1◦ An inductive approach. Let Cn denote the event that xn is equal to x. By total probability, P(Cn ) = P(Cn | Cn−1 ) P(Cn−1 ) + P(Cn | C{n−1 ) P(C{n−1 ). If the bit is correctly replicated after n − 1 stages then it is correctly replicated after the nth stage if, and only if, an error is not made in the nth stage; if the bit is in error after n − 1 stages then it will be correctly replicated after the nth stage if, and only if, an error is made in the nth transmission. Identifying Pn = P(Cn ), we obtain the recurrence Pn = (1 − p)Pn−1 + p(1 − Pn−1 ) = p + (1 − 2p)Pn−1

(n ≥ 1)

with the natural boundary condition P0 = 1. Churning through the recurrence, we find Pn = p+(1−2p)Pn−1 = p+(1−2p)[p+(1−2p)Pn−2 ] = p+p(1−2p)+(1−2p)2 Pn−2 = p+p(1−2p)+(1−2p)2 [p+(1−2p)Pn−3 ] = p+p(1−2p)+p(1−2p)2 +(1−2p)3 Pn−3 = · · · . We discern an emergent geometric series and we conclude that for n ≥ 0 we have Pn = p

n−1 X

(1−2p)k +(1−2p)n = p

k=0

»

– ˜ 1 − (1 − 2p)n 1ˆ +(1−2p)n = 1+(1−2p)n (6) 1 − (1 − 2p) 2

as may be readily verified by induction. 2◦ A combinatorial argument. Write q = 1 − p. A transmitted bit is replicated correctly after n stages if, and only if, the number of bit errors in the sequence x1 , . . . , n´is even. As the probability that there are exactly k bit errors is given by bn (k; p) = `xn pk qn−k , by summing over the possibilities, we see that k Pn =

n „ « X n pk qn−k . k

k=0 k even

To evaluate the sum, we observe per the binomial theorem that 1 = (q + p)n =

n „ « X n k=0

(1 − 2p)n = (q − p)n =

k

pk qn−k =

k=0 k even

n „ « X n k=0

n „ « n „ « X X n n pk qn−k + pk qn−k , k k

k

(−1)k pk qn−k =

k=0 k odd

n „ « n „ « X X n n pk qn−k − pk qn−k . k k

k=0 k even

By adding the two expressions, we discover the combinatorial identity 1 + (1 − 2p)n = 2

n „ « X n pk qn−k k

k=0 k even

which recovers (6).

59

k=0 k odd

A First Look at Independence

III.19

18. Gilbert channels.1 A wireless communication link may be crudely modelled at a given point in time by saying that it is either in a good state 0 or in a bad state 1. A chance-driven process causes the channel to fluctuate between good and bad states with possible transitions occurring at epochs 1, 2, . . . . The channel has no memory and, at any transition epoch, its next state depends only on its current state: given that it is currently in state 0 it remains in state 0 with probability p and transits to state 1 with probability q = 1 − p; given that it is currently in state 1 it remains in state 1 with probability p 0 and transits to state 0 with probability q 0 = 1 − p 0 . Initially, the channel is in state 0 with probability α and in state 1 with probability β = 1 − α. Let αn be the probability that the channel is in state 0 at epoch n. Set up a recurrence for αn and thence determine an explicit solution for it. Evaluate the long-time limit α := limn→∞ αn and interpret your result. Propose an urn model that captures the probabilistic features of this problem. S OLUTION : By conditioning on the state of the system at epoch n−1, by total probability, ` ´ αn = pαn−1 + q 0 (1 − αn−1 ) = q 0 + 1 − (q + q 0 ) αn−1 (n ≥ 1). By churning through the recurrence with the boundary condition α0 = α, we obtain ` ´ ` ´2 ` ´n−1 ` ´n αn = q 0 +q 0 1−(q+q 0 ) +q 0 1−(q+q 0 ) +· · ·+q 0 1−(q+q 0 ) + 1−(q+q 0 ) α ˆ ` ´n ˜ ` ´n q 0 1 − 1 − (q + q 0 ) qα − q 0 (1 − α) ` q0 0 ´n ` ´ + 1−(q+q ) α = + 1−(q+q 0 ) , = 0 q+q q + q0 1 − 1 − (q + q 0 ) ˛` ´˛ and as ˛ 1 − (q + q 0 ) ˛ < 1 (eschewing the trivial case when q and q 0 are both identically unit), we see that q0 αn → as n → ∞. q + q0 We may construct an urn model for this problem as follows. Suppose p and p 0 are rational forms: p = m/N and p 0 = m 0 /N. Consider two urns each containing N balls: the first urn contains m black balls and N − m white balls; the second urn contains m 0 white balls and N − m 0 black balls. An urn is initially selected by flipping a bent coin, the first urn selected with probability α, the second with probability 1 − α. A ball is selected randomly with replacement. If its colour is black then the next ball is drawn with replacement from urn 1; if its colour is white then the next ball is drawn with replacement from urn 2. The process is repeated. Then αn is the probability that, after n selections, the next selection is from urn 1.

19. Recessive and lethal genes. In the language of Section 3, suppose that individuals of type aa cannot breed. This is a simplification of the setting where the gene a is recessive and lethal; for instance, if aa-individuals are born but 1 Named after the communication theorist E. N. Gilbert who first examined these models for wireless communication in the presence of noise.

60

III.19

A First Look at Independence

cannot survive; or, perhaps, via the enforcement of sterilisation laws in a draconian, eugenically obsessed society borrowing from Spartan principles. Suppose that in generation n the genotypes AA, Aa, and aa occur in the frequencies un : 2vn : wn in both male and female populations. Conditioned upon only genotypes AA and Aa being permitted to mate, the frequencies of the genes A and a in the mating population are given by pn = (un + vn )/(1 − wn ) and qn = vn /(1 − wn ), respectively. Assuming random mating determine the probabilities of the genotypes AA, Aa, and aa in generation n + 1. Assume that the actual frequencies un+1 : 2vn+1 : wn+1 of the three genotypes in generation n + 1 follow these calculated probabilities and thence show the validity of the −1 recurrence q−1 n+1 = 1 + qn for the evolution of the lethal gene a in the mating population of each generation. Show hence that the lethal gene dies out but very slowly. S OLUTION : Conditioned on individuals of type aa being forbidden to mate, the genotypes AA and Aa appear in the mating pool with proportions un /(1 − wn ) and 2vn /(1 − wn ), respectively, for both males and females. The only (male, female) genotype pairs permitted to mate are (AA, AA), (AA, Aa), (Aa, AA), and (Aa, Aa), and given random mating of permissible genotypes, these have probabilities (conditioned on being permitted to mate) given in Table 1. The associated conditional genotype probabilities in the filial Permitted parental pairing in generation n (AA,AA)

A priori conditional probability un un · 1−w 1−wn n un 1−wn 2vn 1−wn 2vn 1−wn

(AA,Aa) (Aa,AA) (Aa,Aa)

· · ·

2vn 1−wn un 1−wn 2vn 1−wn

Table 1: Permissible genotype combinations in the mating population in generation n when the gene aa is recessive and lethal. population are given in Table 2. By total probability, the probabilities that an offspring Transition probabilities

AA

Aa

aa

(AA,AA)

1

0

0

(AA,Aa)

1 2 1 2 1 4

1 2 1 2 1 2

0

(Aa,AA) (Aa,Aa)

0 1 4

Table 2: Conditional genotype probabilities in the filial generation.

61

A First Look at Independence

III.20

in generation n + 1 is of genotype AA, Aa, and aa, respectively, are then given by 2un vn 2un vn 4v2n u2n 1 1 1 1· + · + · + · = !2 2 2 (1 − wn ) 2 (1 − wn ) 2 (1 − wn ) 4 (1 − wn )2



un + vn 1 − wn

«2

= p2n , (7)

4v2n

1 2un vn 1 2un vn 1 un + vn vn · + · + · =2· · = 2pn qn , 2 (1 − wn )2 2 (1 − wn )2 2 (1 − wn )2 1 − wn 1 − wn (8) „ «2 2 vn 4vn 1 · = = q2n . (9) 4 (1 − wn )2 1 − wn Identifying (7), (8), and (9) with un+1 , 2vn+1 , and wn+1 , respectively, we obtain the succinct (approximate) characterisations un+1 = p2n ,

2vn+1 = 2pn qn ,

wn+1 = q2n .

(These relations are not strictly speaking accurate as the actual proportions of the genotypes AA, Aa, and aa in generation n + 1 will deviate from their expected values given above. But we maintain this convenient fiction which at least has the virtue of being approximately true courtesy the law of large numbers to see where this line of thought leads us.) We hence obtain the recurrence qn+1 =

pn qn vn+1 pn qn qn = = = , 1 − wn+1 (1 − qn )(1 + qn ) 1 + qn 1 − q2n

or, what is the same thing, −1 q−1 n+1 = 1 + qn

(n ≥ 0),

for the evolution of the proportion of the lethal gene a in the mating population. Writing q0 = q for the proportion of the lethal gene in the progenitor’s population, we see that −1 q−1 or qn = n = n+q

and hence also wn+1 = q2n =

1 n + q−1

1 (n + q−1 )2

(n ≥ 0),

(n ≥ 0).

The lethal gene a and the doomed genotype aa hence do die out but painfully slowly.

20. Suppose that the genotypes AA, Aa, aa in a population have frequencies u = p2 , 2v = 2pq, and w = q2 , respectively. Given that a man is of genotype Aa, what is the conditional probability that his brother is of the same genotype? S OLUTION : The genotype distribution is stationary in the sense of Section III.3 of ToP and, accordingly, by Hardy’s law the probability of the event C that a child has genotype Aa is equal to P(C) = 2(uv + uw + v2 + vw) = 2pq, (This is the second line in the display equation under Table III.3.2 on page 80 of ToP.)

62

III.23

A First Look at Independence

Table III.3.1 in ToP shows the a priori probabilities of the parental genotype pairings (G, G 0 ) and the second column of Table III.3.2 in ToP shows the conditional probability of a filial Aa genotype occurring for each parental genotype pairing (G, G 0 ). Write C1 and C2 for the events that the first child and the second child, respectively, are of genotype Aa. Then C1 and C2 are conditionally independent given the parental genotype pairing (G, G 0 ), P(C1 ∩ C2 | G, G 0 ) = P(C1 | G, G 0 ) P(C2 | G, G 0 ), the transition probabilities given by the square of the second column in Table III.3.2 in ToP. Table 3 below summarises the salient elements. Summing over all possibilities by total probability, Parental pairing (G, G 0 ) (AA,AA) (AA,Aa) (AA,aa) (Aa,AA) (Aa,Aa) (Aa,aa) (aa,AA) (aa,Aa) (aa,aa)

A priori probability u2 2uv uw 2uv 4v2 2vw uw 2vw w2

P(C1 | G, G 0 ) P(C2 | G, G 0 ) 0 1/4 1 1/4 1/4 1/4 1 1/4 0

Table 3: Genotype combinations. we obtain P(C1 ∩ C2 ) = 0 · u2 + 14 2uv + 1 · uw + 14 2uv + 14 4v2 + 14 2vw + 1 · uw + 14 2vw + 0 · w2 = uv + 2uw + v2 + vw = (uv + uw + v2 + vw) + uw = pq + p2 q2 = pq(1 + pq). It follows that P(C2 | C1 ) =

P(C1 ∩ C2 ) pq(1 + pq) 1 + pq = = . P(C1 ) 2pq 2

21. Consider the repeated throws of a die. Find a number n such that there is an even chance that a run of three consecutive sixes appears before a run of n consecutive non-six faces. 22. Consider repeated independent trials with three possible outcomes A, B, and C with probabilities p, q, and r, respectively. Determine the probability that a run of m consecutive A’s will occur before a run of n consecutive B’s. 23. Halmos’s monotone class theorem. A monotone class M is a family of sets which is closed under monotone unions S and intersections: if { An , n ≥ 1 } is an increasing sequence of sets in M then n An ∈ M; if { An , n ≥ 1 } is a 63

A First Look at Independence

III.23

T decreasing sequence of sets in M then n An ∈ M. Suppose A is an algebra of events, M a monotone class. If A ⊆ M show that σ(A) ⊆ M. [Hint: Mimic the proof of Dynkin’s π–λ theorem.]

64

IV Probability Sieves 1. A closet contains ten distinct pairs of shoes. Four shoes are selected at random. What is the probability that there is at least one pair among the four selected? S OLUTION : Number the pairs from 1 through n and let Aj denote the event that pair j is selected. If pair ` ´ j is selected, the locations of the shoes among the four selected may be specified in 42 ways, the two shoes of the pair may be arranged in two ways in these locations, and the remaining two shoes may be selected in 18 × 17 ways. The four shoes may be selected in 204 = 20 × 19 × 18 × 17 ways. Accordingly, `4´ × 2 × 18 × 17 3 = P(Aj ) = 2 , 204 95 and so S1 =

X

= 10 ·

1≤j≤10

6 3 = . 95 19

Likewise, for j 6= k, the probability that pairs j and k are both selected is given by `4´ ×2×2 P(Aj ∩ Ak ) = 2 204 as, if both pairs are to be selected, after placement of pair j, the remaining two shoes must be those of pair k in one of the two possible orders they may be selected. Thus, „ « `4´ X ×2×2 3 10 2 S2 = P(Aj ∩ Ak ) = = . 2 204 19 × 17 1≤j δ }. n

2

92

V.10

Numbers Play a Game of Chance

The volume of this region may be bounded by the Chebyshev artifice to obtain Z

Z

Z

[0,1]n

[0,1]n

A

n

1A (x) dx ≤

dx =

Vol(Aδ ) =

`1

Sn (x) − δ2

´ 1 2 2

dx

as the integrand on the right exceeds 1 over A and is non-negative elsewhere. Simplifying with the indicated change of variables ξk = 2xk − 1, we obtain Vol(Aδ ) ≤

1 4δ2 n2

1 = 2 2 4δ n =

Z Z

` ´2 2Sn (x) − n dx [0,1]n

` ´2 (2x1 − 1) + (2x2 − 1) + (2xn − 1) dx1 dx2 · · · dxn [0,1]n

1 4δ2 n2 2n

Z

(ξ1 + ξ2 + · · · + ξn )2 dξ1 dξ2 · · · dξn . [−1,1]n

the square in the integrand we obtain n squared terms of the form ξ2k and `Expanding ´ n cross-product terms of the form 2ξj ξk . It is easy to see that the contribution of each 2 of the squared terms to the integral is Z

Z1 ξ2k

n−1

ξ2k

dξ1 · · · dξn = 2

[−1,1]n

dξk = 2

−1

n−1

˛1 ξ3k ˛˛ 2 2n . = 2n−1 = ˛ 3 −1 3 3

The cross-product terms on the other hand contribute nothing to the integral as when j 6= k, Z1 Z1 Z ξj ξk dξ1 · · · dξn = 2n−2 ξj dξj ξk dξk = 0. [−1,1]n

−1

−1

In consequence, we obtain Vol(Aδ ) ≤

1 4δ2 n2 2n

·

n2n 1 = 3 12δ2 n

(2)

as advertised. Armed with this preliminary analysis, let us now turn to a consideration of the difference Z

» „ « „ « Z « „ «– „ 1 1 1 1 Sn (x) dx − f = f Sn (x) dx − f dx. f n 2 n 2 [0,1]n [0,1]n

To show that the difference is small, the idea suggested by (2) is to partition the region 1 of integration into one ` 1two regions: ´ ` ´ where n Sn (x) is close to 1/2 and, whereby, the function values f n Sn (x) and f 12 are close; and a second where the function values are boundedly far apart but over a very small volume. In keeping with this programme, select any tiny  > 0. As f(s) is uniformly continuous in the unit cube [0, 1]n , we may select a positive δ = δ(), independent of s so that |f(s + h) − f(s)| <  whenever |h| < δ.

93

Numbers Play a Game of Chance

V.11

With this choice of δ we may write ˛Z ˛ ˛ ˛

„ « „ «˛ ˛Z « „ «– ˛ » „ ˛ 1 1 ˛˛ ˛˛ 1 1 f Sn (x) dx − f Sn (x) − f f dx˛˛ ˛=˛ n 2 n 2 n n [0,1] [0,1] ˛Z » „ « „ «– « „ «– ˛ Z » „ ˛ ˛ 1 1 1 1 Sn (x) − f Sn (x) − f = ˛˛ f dx + dx˛˛ f n 2 n 2 Acδ Aδ « „ «˛ « „ «˛ Z ˛ „ Z ˛ „ ˛ ˛ 1 ˛ ˛ 1 ˛ ˛f 1 Sn (x) − f 1 ˛ dx ˛f S (x) − f dx + ≤ n ˛ ˛ ˛ n 2 n 2 ˛ Aδ Ac δ

1 by the triangle inequality. Within Acδ the values n Sn (x) differ from 1/2 by no more than δ so that the corresponding function values differ by no more than . Accordingly, we may bound the first term on the right by « „ «˛ Z ˛ „ Z Z ˛ 1 1 ˛˛ ˛f dx = . dx <  dx ≤  S (x) − f n ˛ n 2 ˛ Ac Ac [0,1]n δ

δ

For the second term, we can afford a crude estimate for the difference between the function values. As f is continuous on the unit interval it is bounded in absolute value there by, say, M. It follows that « „ «˛ Z ˛ „ Z ˛ 1 M 1 ˛˛ ˛f dx ≤ 2 , S (x) − f dx ≤ 2M n ˛ n 2 ˛ 6δ n Aδ Aδ with the upper bound on the right less than  eventually, for all sufficiently large n. Putting the pieces together, we obtain ˛Z „ « „ «˛ ˛ 1 1 ˛˛ ˛ f Sn (x) dx − f < 2 ˛ n 2 ˛ n [0,1] for all sufficiently large n. As  may be chosen arbitrarily small, the claimed result follows. Remember this partitioning approach to the problem. It turns out to be widely useful. A little introspection shows that the procedure actually points the way to a much more general result. T HEOREM Suppose f is bounded and continuous on the unit interval. ˆLet the ` 1Sn,t denote ´˜ number of successes in n tosses of a coin with success probability t. Then E f n Sn,t → f(t) uniformly in t. This is, in essence, Weierstrass’s famous theorem on the approximation of continuous functions by polynomials (see Section XVI.5). The probabilistic proof is due to S. N. Bernstein.

11. Using the formula |u| =

1 π

Z∞ −∞

1 − cos(ux) dx, x2 94

(3)

V.11

Numbers Play a Game of Chance

prove first that Z∞ Z 1/√n Z 1 X n n 1 1 − cos(x) 1 1 − cos(x)n rk (t) dt = dx > dx, √ 2 π −∞ x π −1/ n x2 0 k=1

and finally that

Z 1 X n √ dt > A n r (t) k 0 k=1

where the constant A may be chosen as A=

1 π

Z1 −1

1 − e−y y2

2

/2

dy.

P∞ P∞ [Hint: Recall that ex = k=0 xk /k! and cos(x) = k=0 (−1)k x2k /(2k)!; these are, of course, just the familiar Taylor series formulæ for the exponential and the cosine. And for the reader wondering where the formula (3) came from, rewrite the integral in the form u2 2π

Z∞  −∞

sin(ux/2) ux/2

2 dx,

and recognise the fact that the area under the sinc-squared function sin(ξ)2 /ξ2 is π. (Or, for the reader unfamiliar with Fourier analysis, peek ahead and apply Parseval’s equation from Section VI.3 to the rectangular function of Example VI.2.1.)] S OLUTION : The elementary trigonometric identity 1 − cos(t) = 2 sin(t)2 shows that 1 − cos(zx) 2 sin(zx/2)2 z2 sin(zx/2)2 = = · , 2 2 x x 2 (zx/2)2 the function of x on the right being non-negative and integrable for each z. (Indeed, Fourier theory tells us that this is the Fejér kernel, Z∞ „ −∞

sin(πt) πt

«2 dt = 1,

and the identity (3) follows by a change of variable.) With of z we may hence write

Pn k=1

rk (t) playing the rôle

P ˛ Z 1˛X Z1 Z∞ ˛ ˛ n 1 − cos n 1 k=1 xrk (t) ˛ ˛ dt = r (t) dx k ˛ ˛ 2 π x 0 k=1 0 −∞ „ « Z Z n X 1 ∞ 1 1 1 = − 2 cos xrk (t) dt dx, π −∞ x2 x 0 k=1

95

Numbers Play a Game of Chance

V.11

the change in the order of integration permissible because, as noted, the integrand is non-negative and integrable. The temptation to use Euler’s trigonometric formula for the cosine is irresistible, especially with the sum squirelled away inside the argument, and we obtain « „X „ X « Z1 Z Z n n n X 1 1 1 1 cos xrk (t) dt = xrk (t) dt + exp exp − xrk (t) dt. 2 0 2 0 0 k=1 k=1 k=1 All is now set for Viète. We have Z1 0

„ X « Z1 Y n n ` ´ exp ± xrk (t) dt = exp ±xrk (t) dt 0 k=1

k=1

=

n Z1 Y

` ´ exp ±xrk (t) dt = cos(x)n

k=1 0

via the independence of the binary digits! It follows that Z1 cos 0

n X

xrk (t) dt =

1 2

cos(x)n +

1 2

cos(x)n = cos(x)n

k=1

and tracking all the way back we’ve verified the first step in the indicated programme: ˛ Z 1˛X Z∞ ˛ n ˛ 1 − cos(x)n ˛ ˛ dt = 1 dx. r (t) k ˛ ˛ π −∞ x2 0 k=1 The Taylor series for the cosine suggests a profitable truncation in the range of integra2 tion. For sufficiently small x, cos x ≈ 1 − x2 /2 so that cos(x)n ≈ (1 − x‹2√ /2)n ≈ e−nx /2 . The exponential contributes ‹a√negligible amount if |x| much exceeds 1 n so we are led to consider the range |x| ≤ 1 n. The rest is technique. It is clear that we may truncate the range of integration on the right of the previous equation to −n−1/2 ≤ x ≤ n−1/2 and only occasion a decrease in the value of the integral as the integrand is palpably non-negative. Thus, ˛ Z 1˛X Z 1/√n ˛ ˛ n 1 − cos(x)n ˛ ˛ dt ≥ 1 r (t) dx. k ˛ ˛ √ π −1/ n x2 0 k=1 Beginning with the Taylor series x4 x6 x2 + 2 − 3 ,+··· 2 2 2! 2 3! x2 x4 x6 cos(x) = 1 − + − + ··· , 2 4! 6! 2

e−x

/2

=1−

and comparing term by term it is clear that the coefficients of the exponential series are in absolute value larger than the corresponding terms of the cosine series. While it is 2 true that the graph of cos(x) lies below that of e−x /2 for |x| ≤ π/2, a cruder bound

96

V.12

Numbers Play a Game of Chance

will suffice for our purposes. Subtracting the two Taylor series and dropping successive positive contributions after the first term yields the lower bound „ « „ « „ « 2 1 1 1 1 1 1 e−x /2 − cos(x) = x4 2 − − x6 3 − + x8 4 − ··· 2 2! 4! 2 3! 6! 2 4! 8! „ « 1 1 x6 x10 x14 ≥ x4 − − 3 − 5 − 7 − ··· 8 24 2 3! 2 5! 2 7! « „ x4 x6 23 3! 23 3! = − 1 + x4 5 + x8 7 + · · · 12 48 2 5! 2 7! x4 x6 − (1 + x4 + x8 + · · · ) 12 48 x4 x6 = − 12 48(1 − x4 ) ≥

for |x| < 1. It is now easy to verify that the right-hand side is ≥ 0 if s √ 65 1 = 0.939565 · · · . |x| ≤ − + 8 8 ‹√ ‹√ In particular, the intervals −1 n ≤ x ≤ 1 n with n ≥ 2 lie comfortably inside the indicated range and we obtain the further bound ˛ √ Z1 Z √ Z 1˛X 2 2 ˛ n ˛ √ n 1 − e−y /2 1 1/ n 1 − e−nx /2 ˛ ˛ dx = dy = A n rk (t)˛ dt ≥ ˛ 2 2 √ π x π y −1 0 k=1 −1/ n via the indicated change of variable nx = y. A numerical evaluation shows that A = 0.294 · · · , and thus, ˛ Z 1˛X ˛ n ˛ √ ˛ rk (t)˛˛ dt ≥ 0.294 n ˛ 0 k=1

for n ≥ 2. In probabilistic terms, we have shown the validity of the following assertion: if ` ´ √ Sn is the number of successes in n tosses of a fair coin, then E |2Sn − n| ≥ 0.294 n and the expected absolute deviation from the mean value increases as the square-root of n.

Problems 12–21 deal with biased coins and the properties of bent binary digits. 12. Partitions of the unit interval. Suppose 0 < p < 1; set q = 1 − p. The following recursive procedure partitions the unit interval I = [0, 1) into a sequence of finer and finer subintervals, 2 at the first step, 4 at the second, and so (1) (1) on, with 2k at step k. Begin at step 1 by setting I0 = [0, q) and I1 = [q, 1). (k) Now suppose that at step k the 2k intervals Ij1 ,...,jk have been specified where, in lexicographic notation, (j1 , . . . , jk ) sweeps through all 2k binary sequences in (k) {0, 1}k . Each subinterval Ij1 ,...,jk (the parent) at step k engenders two subinter(k+1)

vals (the children) at level k + 1, a left subinterval Ij1 ,...,jk ,0 whose length is a 97

Numbers Play a Game of Chance

V.13 (k+1)

fraction q of the parent and a right subinterval Ij1 ,...,jk ,1 whose length is a frac(k)

tion p of the parent. In more detail, suppose the subinterval Ij1 ,...,jk = [α, β) has left endpoint α and length β − α. Then the left subinterval engendered by   (k+1) it is specified by Ij1 ,...,jk ,0 = α, α + q(β − α) with the corresponding right   (k+1) subinterval given by Ij1 ,...,jk ,1 = α + q(β − α), β . In this fashion, the 2k+1 (k+1)

(k)

subintervals Ij1 ,...,jk ,jk+1 at step k + 1 are engendered. Show that Ij1 ,...,jk has length pj1 +···+jk qk−j1 −···−jk . S OLUTION : Much more real estate is expended in describing the setting of this problem than is needed to do it. The result is almost obvious by construction and a proof by (1) induction on k is both natural and easy. The induction base is trivial as, for k = 1, I0 (1) has length q and I1 has length p. For any binary string j1 , . . . , jk ∈ {0, 1} as induction (k) hypothesis suppose Ij1 ,...,jk has length pr qk−r where r = j1 + · · · + jk is the number (k+1)

of ones in the string and k − r is the number of zeros. By the construction, Ij1 ,...,jk ,1 (k+1)

then has length p · pr qk−r = pj1 +···+jk +1 q(k+1)−j1 −···−jk −1 and Ij1 ,...,jk ,1 has length q · pr qk−r = pj1 +···+jk +0 q(k+1)−j1 −···−jk −0 . This concludes the induction.

13. Continuation, bent binary digits. For each k ≥ 1, define the bent binary (k) digit zk (t; p) by setting zk (t; p) = 0 if t ∈ Ij1 ,...,jk−1 ,0 for some specification of (k)

bits j1 , . . . , jk−1 , and zk (t; p) = 1 if, instead, t ∈ Ij1 ,...,jk−1 ,1 . These functions coincide with the ordinary binary digits zk (t) if p = 1/2. Show that the bent binary digits zk (t; p) (k ≥ 1) are independent for each value of p. (k)

S OLUTION : We may relate the bent binary digits to the intervals Ij1 ,...,jk of the previous problem via the observation ´ ` (k) λ{ t : z1 (t; p) = j1 , . . . , zk (t; p) = jk } = λ Ij1 ,...,jk ´ ´ ` ´ ` ` = pj1 +···+jk qk−j1 −···−jk = pj1 q1−j1 × pj2 q1−j2 × · · · × pjk q1−jk .

(4)

Write j 0 = (j1 , . . . , ji−1 , 1, ji+1 , . . . , jk ) for a generic binary string with ith component fixed at 1. Then [ (k) { t : zi (t; p) = 1 } = Ij1 ,...,ji−1 ,1,ji+1 ,...,jk . j0 (k)

By the recursive construction of the intervals Ij1 ,...,jk it is apparent that the family of  (k) intervals Ij1 ,...,jk : j1 , . . . , jk ∈ {0, 1} is disjoint and partitions the unit interval [0, 1). Accordingly, X

λ{ t : zi (t; p) = 1 } =

` (k) ´ λ Ij1 ,...,ji−1 ,1,ji+1 ,...,jk

j1 ,...,ji−1 ,ji+1 ,...,jk

=

X

pj1 +···+ji−1 +1+ji+1 +···+jk qk−j1 −···−ji−1 −1−ji+1 −···−jk

j1 ,...,ji−1 ,ji+1 ,...,jk

98

V.14

Numbers Play a Game of Chance

=p

„X

pj1 q1−j1

« × ··· ×

„X

pji−1 q1−ji−1

«

ji−1

j1

×

„X

pji+1 q1−ji+1

«

j

„X

« pjk q1−jk .

jk

ji+1

P

× ··· ×

1−j

As j∈{0,1} p q = p + q = 1, each of the terms in round brackets on the right is identically equal to one and so λ{ t : zi (t; p) = 1 } = p. We may repeat the above argument for ji = 0 or, more simply, observe that [0, 1) = { t : zi (t; p) = 1 } ∪ { t : zi (t; p) = 0 } so that, by additivity, λ{ t : zi (t; p) = 0 } = λ[0, 1) − λ{ t : zi (t; p) = 1 } = 1 − p = q. Returning to (4), we hence see that λ{ t : z1 (t; p) = j1 , z2 (t; p) = j2 , . . . , zk (t; p) = jk } = λ{ t : z1 (t; p) = j1 } × λ{ t : z2 (t; p) = j2 } × · · · × λ{ t : zk (t; p) = jk }, or, in words, the bent binary digits are independent.

14. Non-symmetric binomial distribution. With zk (t; p) the bent binary digits of the previous problem, prove that the measure of the set of points t on which z1 (t; p) + · · · + zn (t; p) = k is given by Z 1 π −ikx e (peix + q)n dx 2π −π and explicitly evaluate this expression for each integer k. S OLUTION : Let A denote the set of points t for which z1 (t; p) + · · · + zn (t; p) = k. Then Z1 λ{ t : z1 (t; p) + · · · + zn (t; p) = k } = 1A (t) dt. 0

In view of the elementary integral identity  Z 1 1 2π imx e dx = 2π 0 0

if m = 0, if m ∈ Z \ {0},

we may write Z1 0

1 2π

Z 2π

Z Z 1 2π −ixk 1 ix(z1 (t;p)+···+zn (t;p)) e e dt dx 2π 0 0 Z Z n 1 2π −ixk 1 Y ixzk (t;p) = e e dt dx 2π 0 0 k=1 Z n Z 1 2π −ixk Y 1 ixzk (t;p) = e e dt dx, 2π 0 k=1 0

eix(z1 (t;p)+···+zn (t;p)−k) dx dt = 0

99

Numbers Play a Game of Chance

V.16

the independence of the binary digits permitting the interchange of integral and product in the final step. Now each bent binary digit takes value 1 on a set of Lebesgue measure p and value 0 on a set of Lebesgue measure q. It follows that Z1 eixzk (t;p) dt = peix + q 0

for each k and consequently 1 2π

Z 2π −ixk `

e 0

ix

pe

Z n „ « 1 2π −ixk X n + q dx = e pm eixm qn−m dx m 2π 0 m=1 „ « Z 2π n „ « X 1 n n = eix(k−m) dx = pk qn−k . pm qn−m k m 2π 0 ´n

m=1

15. Continuation, unfair coins. Explain how the bent binary digits zk (t; p) (k ≥ 1) can be used to construct a model for independent tosses of an unfair coin whose success probability is p and failure probability is q = 1 − p. S OLUTION : The correspondence is the natural “bent” analogue of Table 1 in Section V.4, page 149 in ToP. In the general case the labour involved in symmetrisation is not worth Terminology kth trial Sample point Basic event Event Measure

Unfair coin tosses ωk ∈ {T, H} ω = (ωk , k ≥ 1) Cylinder set: I Union of cylinder sets: A Event probability: P(A)

Bent binary digits zk (t; p) ∈ {0, 1} P t = k zk (t; p)(1 − p)k Interval: I Union of intervals: A Lebesgue measure: λ(A)

Table 1: A dictionary of translation modelling the repeated toss of an unfair or bent coin. the candle and we work directly with the bent binary digits instead of a symmetrised “Rademacher” variant.

16. Bernstein polynomials. Suppose f(t) is a continuous function on the unit interval. Show that Z1  n  k  n X z1 (t; x) + · · · + zn (t; x)  dt = f xk (1 − x)n−k . f k n n 0 k=0

For each n, the expression Bn (x) given by the sum on the right defines a function on the unit interval 0 ≤ x ≤ 1 [with the natural convention 00 = 1 giving the values Bn (0) = f(0) and Bn (1) = f(1).] The functions Bn (x) are the famous Bernstein polynomials.

100

V.17

Numbers Play a Game of Chance

S OLUTION : As t varies across the unit interval the sum z1 (t; x) + · · · + zn (t; x) takes values k` in´the` set ´ {0, 1,`. n. .´, n}. It follows that the integrand only takes a finite set of 0 1 values f n ,f n , . . . , f n . By summing over the possibilities, we see that Z1 “ n “k” X z1 (t; x) + · · · + zn (t; x) ” dt = f λ{ t : z1 (t; x) + · · · + zn (t; x) = k } f n n 0 k=0 n “ k ” „n« X xk (1 − x)n−k , = f k n k=0

the final step courtesy Problem 14.

17. Weak law of large numbers. Let Sn be the number of successes in n tosses of a bent coin whose success probability is p. By using Chebyshev’s idea,  1 1 Sn − p ≥ show that S converges in probability to p. In other words, P n n n  → 0 as n → ∞ for every choice of  > 0. ˛  ˛ S OLUTION : Let A = t : ˛z1 (t; p) + · · · + zn (t; p) − np˛ ≥ n . By using Chebyshev’s idea, we know for any  > 0, ˛ ˛ 1 Sn − p˛ ≥  = P{|Sn − np| > n} = P ˛n ≤

Z1 “ 0

Z1 1A (t) dt 0

z1 (t; p) + z2 (t; p) + · · · + zn (t; p) − np ”2 dt. n

(5)

Writing the integrand on the right in the form ` ´2 z1 (t; p) + z2 (t; p) + · · · + zn (t; p) − np ˆ` ´ ` ´ ` ´˜2 = z1 (t; p) − p + z2 (t; p) − p + · · · + zn (t; p) − p) ` ´2 `and expanding ´` the square´show that there are two kinds of elements, zi (t; p) − p and zi (t; p) − p zj (t; p) − p where i 6= j. For terms of the first type, we have Z1 0

` ´2 zi (t; p) − p dt =

Z1 ` 2 ´ zi (t; p) − 2pzi (t; p) + p2 dt 0

Z1

Z1 z2i (t; p)dt − 2

= 0

pzi (t; p)dt + p2 = p − p2 , 0

as, in view of Problems 12,13, Z1

Z1 z2i (t) dt =

0

zi (t) dt = p. 0

101

Numbers Play a Game of Chance

V.18

By a similar factoring inside the integral, if i 6= j, for terms of the second type, we have Z1

Z1 ` ´ zi (t; p)zj (t; p) − pzi (t; p) + p2 dt

` ´` ´ zi (t; p) − p zj (t; p) − p dt = 0

0

0 Z1

0

0

pzj (t; p) dt + p2

pzi (t; p) dt −

zi (t; p)zj (t; p) dt −

=

Z1

Z1

Z1

zi (t; p)zj (t; p) dt − p2 = 0,

= 0

as, by the independence of the bent binary digits (Problem 13), Z1 0

` ´ zi (t; p)zj (t; p) dt = λ { t : zi (t; p) = zj (t; p) = 1 } ` ´ ` ´ = λ { t : zi (t) = 1 } λ { t : zj (t) = 1 } = p2 .

[In the language of L2 -theory, zi (t; p) − p and zj (t; p) − p are orthogonal to each other.] There ` ´are precisely n terms of the first type in (5) and, as there is no contribution from the n terms of the second type, we see that 2 ˛ ˛ 1 P ˛n S n − p˛ ≥  ≤ +

2 2 n2

n Z 1 X 1 (zi (t; p) − p)2 dt 2 n2 j=1 0

X Z 1` ´` ´ p(1 − p) n(p − p2 ) = . zi (t; p) − p zj (t; p) − p dt = 2 n2 n2 0

(6)

1≤i ) → 0, for any  > 0, as n → ∞, or, what is the 1 same thing, n Sn → p in probability.

1 18. Strong law of large numbers. Prove that n Sn converges almost everywhere to p. This substantially strengthens the result of Problem 17.

S OLUTION : The bound on the right in (6) in the solution to Problem 17 decays like 1/n and, as the harmonic series is divergent, this is too slow a rate of decrease for us to be able to use Levi’s theorem (see page 157 of the text). To improve the rate, consider the function „ «4 z1 (t; p) + z2 (t; p) + · · · + zn (t; p) − np fn (t) = . n Expanding the fourth power by the multinomial theorem, we see that the terms of the expansion are of the following generic forms for distinct indices i, j, k, and l: ` ´4 zi (t; p) − p , ´3 ` ´` zi (t; p) − p zj (t; p) − p , ` ´2 ` ´2 zi (t; p) − p zj (t; p) − p , ` ´` ´` ´2 zi (t; p) − p zj (t; p) − p zk (t; p) − p , ` ´` ´` ´` ´ zi (t; p) − p zj (t; p) − p zk (t; p) − p zl (t; p) − p .

102

V.21

Numbers Play a Game of Chance

Proceeding as in the solution to Problem 17, by repeated appeals to orthogonality we obtain Z1 ` ´4 zi (t; p) − p dt = 4p4 − 8p3 + 4p2 , 0

Z1

Z1

` ´` ´3 zi (t; p) − p zj (t; p) − p dt = 0,

0

` ´2 ` ´2 zi (t; p) − p zj (t; p) − p dt = p4 − 2p3 + p2 ,

0

Z1

` ´` ´` ´2 zi (t; p) − p zj (t; p) − p zk (t; p) − p dt = 0,

0

Z1

` ´` ´` ´` ´ zi (t; p) − p zi (t; p) − p zk (t; p) − p zl (t; p) − p dt = 0, 0

and thus, ∞ Z1 X n=1 0

fn (t) dt =

∞ X n(4p4 − 8p3 + 4p2 ) + n4 n=1

`n´ 4 (p − 2p3 + p2 ) 2

.

The summands are of order 1/n2 and so the series is convergent, whence, by Levi’s theP orem, n fn (t) converges a.e. It follows a fortiori, that wherever the series is convergent ˆ ˜4 its terms must go to zero. And so fn (t) = z1 (t; p) + · · · + zn (t; p) − np /n4 → 0 almost ˆ ˜ 1 z1 (t; p) + · · · + zn (t; p) → p a.e., or, what everywhere. This is the same ˛as saying that n ˛ Sn ˛ ˛ is the same thing, P n − p ≥  i.o. = 0 for every  > 0.

19. Weierstrass’s approximation theorem. Use the results of Problem 17 to estimate the measure of the set on which z1 (t; x) + · · · + zn (t; x) − x >  n and prove thence that limn→∞ Bn (x) = f(x) uniformly in 0 ≤ x ≤ 1. This is Bernstein’s original proof of the celebrated theorem of Weierstrass on the approximation of continuous functions by polynomials. 20. Continuation, Lipschitz functions. We say that a real-valued function f is Lipschitz if there exists a constant M such that |f(x) − f(y)| ≤ M|x − y| for all choices of x andy. √If f: [0, 1] → R is Lipschitz with constant M, show that |f(x) − Bn (x)| ≤ M 2 n . 21. Continuation, tightness of the estimate. Let f(x) = |x − 1/2| on the unit interval 0 ≤ x ≤ 1. Use the result of Problem √11 to estimate |f(1/2) − Bn (1/2)| from below and thus show that the order 1 n error estimate of the previous problem is tight.

103

VI The Normal Law R∞ 1. The gamma function. The function Γ (t) = 0 xt−1 e−x dx, defined for all t > 0, plays a particularly important rôle in analysis and is called the gamma function. Evaluate Γ (1) and Γ (1/2). (The latter integral is made transparent by making the change of variable x = y2 /2 inside the integral.) S OLUTION : The first of the two integrals is trivial, Z∞ ˛∞ ˛ Γ (1) = e−x dx = −e−x ˛ = 1. 0

0

Evaluating the second of the integrals can be tricky but the suggested change of variable (admittedly not obvious) magically simplifies matters. As the variable of integration x varies over the positive half-line, the change of variable x = y2 /2 is legitimate, the chain rule showing that dx = y dy. Accordingly, Z∞ Z √ Z ∞ −1 −y2 /2 √ ∞ √ Γ (1/2) = x−1 e−x dx = 2 y e y dy = 2 π φ(y) dy = π 0

0

0

as, by the symmetry of the normal density, the integral in the penultimate step evaluates to 1/2.

2. Another proof of de Moivre’s theorem. Stirling’s formula for the factorial √ n  says that n! 2πn n → 1 as n → ∞. (The reader who does not know e this result will find two elementary proofs from rather different viewpoints in Section XIV.7 and again in Section XVI.6.) By repeated applications show that  −n .q 2 n limn→∞ dn/2e 2 πn = 1. Deduce the local limit theorem of Section 6 and thence de Moivre’s theorem. S OLUTION : In the notation of Section VI.6 of ToP, write  −n „ «  nn!2 n −n n!2 !( 2 )! n ( ) −n 2 β0 = bn (m + k) = ¨ n ˝ 2 = ¨n˝ ˚nˇ = −n ! 2 !  n−1n!2 n+1 2 2 ( 2 )!( 2 )!

if n is even, if n is odd.

The Normal Law

VI.4

for the central term of the binomial. If n approaches infinity through the even positive integers, then by Stirling’s formula, we obtain the asymptotic estimate r √ 2πn nn e−n 2−n 2 β0 ∼ `p . ` n ´n/2 ´2 = n πn −n/2 e 2π 2 2 If n approaches infinity through the odd positive integers then, likewise, √

β0 ∼ q

2π n−1 2

2πn nn e−n 2−n q ∼ ` n−1 ´(n−1)/2 ` n+1 ´(n+1)/2 e−(n−1)/2 · 2π n+1 e−(n+1)/2 2 2 2

r

2 . πn

p It follows that β0 ∼ 2/(πn) as n → ∞, recovering (VI.6.7) in ToP. The rest of the proof of de Moivre’s theorem follows the concluding portion of Section VI.6 in ToP verbatim.

3. Another derivation of the law of large numbers. Derive the weak law of large numbers for coin tosses (V.6.3) from de Moivre’s theorem. S OLUTION : Suppose Sn ∼ Binomial(n, 1/2) represents the accumulated number of suc‹√ cesses in n tosses of a fair coin. As usual, let S∗n = (Sn −n/2) 2n denote the centred and normalised number of successes. Fix any  > 0 and δ > 0. These are our R∞tolerance numbers; they may be chosen arbitrarily small. For the selected δ > 0, as −∞ φ(x) dx = 1, by the definition of the improper Riemann integral, we may select a = a(δ) > 0 determined only by δ so that Z a(δ) 1− φ(x) dx < δ. −a(δ)

But de Moivre’s theorem now tells us that we may select N1 = N1 (δ) sufficiently large (and determined only by the given δ) so that ˛ ˛ Z a(δ) ˛ ˛ ˛P{−a < S∗n < a} − ˛ which choice we see that X  bn1 (m1 + j)bn2 (m2 + k − j) < √ , ν √ |j|≥C

` 2√2 σ21 ´1/2 √ π σ2 

, with

ν

`√ ´ √ eventually, for all sufficiently large ν. When |j| < C ν then j = O ν and k − j = `√ ´ O ν and so the local limit theorem applies to both bn1 (m1 + j) and bn2 (m2 + k − j). It follows that X bn1 (m1 + j)bn2 (m2 + k − j) |j| 1/α and decreases exponentially if β < 1/α. What happens if β = 1/α? [Hint: Write Wn = Wn−1 α1−Xn βXn , where Xn = 0 or 1 only, each with probability 1/2, and take logarithms.] S OLUTION : Define Xn = 0, 1, where Xn = 0 means that there is downtick, and Xn = 1 means there is an uptick. From the problem, by iteration, we have Wn = W0 α(1−X1 )+(1−X2 )+···+(1−Xn ) βX1 +X2 +···+Xn = αn

„ «Sn β α

where, as usual, we write Sn = X1 + X2 + · · · + Xn . Taking the logarithm of both sides of the equation, we see that „ «−t√n/2 „ «t√n/2 β Wn β P < < α (αβ)n α

Zt Sn − n/2 1/α and decreases exponentially fast if β < 1/α. When β = 1/α, we see that log Wn = n log α + Sn log(β/α) = (Sn − n/2) log(β/α) is a zero-mean random variable symmetrically distributed around 0 and there is no systemic bias in either direction.

6. Stability of the Poisson distribution. Suppose X1 , X2 , . . . are the outcomes of repeated independent trials, each Xj taking positive integer values

109

The Normal Law

VI.7

according to the distribution p(k) = P{Xj = k} = e−1 /k! for k ≥ 0. Form the partial sums Sn = X1 + · · · + Xn . Determine the distribution of Sn for each n. [Hint: Induction.] S OLUTION : We prove by induction that Sn has a Poisson distribution with mean n. The induction base is trivial: when n = 1, S1 = X1 is given to be Poisson with mean 1. As induction hypothesis, suppose now that Sk has the Poisson distribution with mean k. As Sk = X1 + · · · + Xk and Xk+1 are independent, Sk+1 = Sk + Xk+1 is the sum of independent Poisson variates, whence

P(Sk+1 = m) =

m m X e−(k+1) X e−1 e−k km−t m! = km−t t! (m − t)! m! t!(m − t)! t=0 t=0 ! m e−(k+1) (k + 1)m e−(k+1) X m m−t t = k 1 = m! t=0 t m!

by the binomial theorem. So Sk+1 is Poisson distributed with mean k + 1 and this completes the induction.

7. Continuation, a central limit theorem for the Poisson distribution. Fix a < b and, for each At be the collection of positive integers k sat√ positive t, let √ isfying t + a t < k < t + b t. Use Markov’s method to prove Laplace’s Rb P k formula limt→∞ k∈At e−t tk! = a φ(x) dx. [Hint: Begin by allowing t to tend to infinity through the positive integers n. With S√n as in the previous problem, consider the normalised variable S∗n = (Sn − n) n and write down a sum for P{a < S∗n < b}. Now consider a decomposition of the summation index k in the form k = k1 + · · · + kn and utilise Markov’s method by exploiting the fact that the trials X1 , . . . , Xn are independent.] S OLUTION : We allow t to tend to infinity through the positive integers n. Writing Sn as sum of n independent Poisson random variables with mean 1, and letting „ f(x) = rect

x − (a + b)/2 b−a



« =

1 0

if a < x < b, otherwise,

we have

Sn − n P a< √ t + a t − P Sdte ≥ t + b t p p   = P Sbtc > btc + a btc + αt − P Sdte ≥ dte + b dte + βt and above by √ √   P Sdte > t + a t − P Sbtc ≥ t + b t p p   = P Sdte > dte + a dte + γt − P Sbtc ≥ btc + b btc + δt where αt , βt , γt , and δt are bounded. (For t sufficiently large we may take them to be less than 1.) By the just-proved result for integral values of t, we see that each of these two expressions tends, as n → ∞, to Z∞ Z∞ Zb φ(x) dx − φ(x) dx = φ(x) dx, a

b

√ √  and hence so does P t + a t < St < t + b t .

a

8. Weyl’s equidistribution theorem. For any real x, let (x) = x mod 1 = x − [x] denote the fractional part of x. (In standard notation, [x] represents the integer part of x; our notation for the fractional part is not standard.) As n steps through the positive integers, the values xn = (nx) then determine a sequence of points in the unit interval. In 1917, H. Weyl proved that if x is irrational then the sequence {xn } is equidistributed in (0, 1). In other words, fix any 0 ≤ a < b ≤ 1 and let ξn = 1 if a < xn < b and let ξn = 0 otherwise. For each n, let νn = (ξ1 + · · · + ξn )/n denote the fraction of the numbers x1 , . . . , xn which fall in (a, b). Then νn → b − a as n → ∞. Prove Weyl’s theorem by introducing the periodic function f(x) of period 1, with the form given by (VI.4.1) in the interval 0 ≤ x < 1, and using Fourier series instead of Fourier integrals. S OLUTION : Begin with some Fourier preliminaries. We work on the circle T of unit circumference obtained by “rolling up” the real line by identifying the points x, x + 1, x + 2, . . . , x − 1, x − 2, . . . and so on with one point on the circumference of T, that is to say, we work on the real line modulo 1 by identifying points on the line with the same fractional part. To each continuous function f : T → C we associate its Fourier coefficients Z b = f(x)ei2πkx dx f(k) (k ∈ Z). T

This is, of course, the discrete analogue of the Fourier transform introduced in Section VI.2 of ToP. Corresponding to T HE S IMPLEST F OURIER I NVERSION T HEOREM of Section VI.3 in ToP, we now have its discrete analogue. T HE S IMPLEST F OURIER I NVERSION T HEOREM FOR F UNCTIONS ON T Suppose f is a con˛ P ˛ tinuous function on T and suppose additionally that the series k ˛b f(k)˛ converges. Then X −i2πkx b f(x) = f(k)e , k

112

VI.8

The Normal Law

the series converging uniformly and absolutely. The proof follows the same pattern as the one given in Section VI.3 but is a bit simpler and the interested reader should attempt to repeat the steps. Suppose a < b. (As we are working on the circle T, what we mean really is that a mod 1 < b mod 1. We henceforth interpret all orderings modulo 1.) Following Markov’s method as outlined in Section VI.4 in ToP, introduce the function  1 if a < x < b, f(x) = 0 otherwise. We may then identify ξj = f(jx) for j ≥ 1, whence νn =

n 1X f(jx). n j=1

(4)

Following Markov’s method, we should attempt to replace f(jx) by its Fourier expansion. An easy integration shows that Zb ei2πkb − ei2πka b = f(k) ei2πkx dx = , i2πk a the usual interpretation via l’Hôpital’s rule yielding b =b−a f(0) in accordance with a direct integration. Writing down a formal Fourier inversion in (4) and blithely changing the order of summation to spy out the lay of the land, we obtain a speculative sequence νn =

„ X « n n X n X X ` −i2πkx ´j 1X 1 ? 1 −i2πkjx ? b b = f(jx) = f(k)e f(k) e n j=1 n j=1 k n j=1 k « „ X e−i2πkx − e−i2πk(n+1)x b + b = f(0) f(k) n(1 − e−i2πkx ) k6=0 „ « X sin(πknx) ? b −iπk(n+1)x b + b = f(0) f(k)e ≈ f(0), n sin(πkx)

(5)

k6=0

as, with luck, the term in round brackets in the penultimate expression should decay P b sufficiently fast with n, the absolute convergence of the series k f(k) carrying the day. The difficulty with this procedure, as we discovered in Section VI.4 of ToP in the case of the transform, is that the Fourier coefficients of f are not particularly well-behaved— ˛ ˛ b decays too slowly with k (and, in particular, P ˛b ˛ the sequence f(k) k f(k) diverges). The problem can be traced to the abrupt corners of the rectangular function f(x) at x = a and x = b. The solution as in Section VI.5 of ToP is to finesse the baulky f altogether by smoothing it out to form continuous upper and lower trapezoidal approximations f+

113

The Normal Law

VI.8 ε

ε

f(x)

a

ε

f+(x)

b

a

ε

f−(x)

b

a

b

Figure 1: The rectangular function f and its upper and lower trapezoidal approximations f+ and f− forming the Lévy sandwich. and f− , the Lévy sandwich functions, for a suitably small 0 <  < (b − a)/2, as sketched in Figure 1. It is not difficult to compute the Fourier coefficients of f± but it will suffice for our purposes to observe that each trapezoidal function may be written as the difference of two triangular functions [see Example VI.2.3 in ToP]. Now, a straightforward, if mildly tedious, integration by parts shows that if 0 < a < 1/2 the Fourier coefficients of the triangular function ˛ x ˛ `x´  ˛ ∆a (x) = ∆ a = max 0, 1 − ˛ a (recall that we work modulo 1 so that this is equivalent to a periodic, triangle tooth function) are given by «2 „ ca (k) = a sin(πka) = a sinc(πka)2 . ∆ πka As the trapezoidal functions f± are each given as the difference of two triangular func−2 tions, by additivity the Fourier coefficients of f± satisfy fc ) and, in particu± (k) = O(k P ˛˛c ˛˛ lar, k f± (k) converges for every choice of  > 0. As the trapezoidal functions f± are continuous by design, by the simplest Fourier inversion theorem for functions on T, it follows that we have X −i2πkx fc , f± (x) = ± (k)e k

the series converging uniformly and absolutely. In particular, we may select N = N() sufficiently large so that ˛ ˛ N X X ˛ ˛ ˛ ˛ −i2πkx ˛ ˛c f± (k)˛ <  and ˛˛f± (x) − fc ± (k)e ˛N

k=−N

for all x ∈ T. We may now completely rigourously modify the steps in (5) by working with the smooth approximations f± . Introduce the smoothed variations of (4): ν±,n =

n 1X f± (jx). n j=1

As f( x) ≤ f(x) ≤ f+ (x), we see that ν−,n ≤ νn ≤ ν+,n and we have neatly sandwiched the quantity to be estimated by smoothed versions. Drawing inspiration from (5), we estimate ˛ ˛ n ˛ ˛ ˛1 X ˛ c ˛ν±,n − fc ˛ ˛ f± (jx) − f± (0)˛˛ ± (0) = ˛ n j=1

114

VI.8

The Normal Law ˛ X ˛ « „ X N X ˛1 n ˛ −i2πkjx −i2πkjx c c c ˛ =˛ f± (k)e + f± (k)e − f± (0)˛˛ n j=1 k=−N |k|>N

˛ ˛ X N n X ˛ ˛ 1X ˛1 n X ˛ −i2πkjx ˛ ˛c fc − fc ≤ ˛˛ f± (k)˛. ± (k)e ± (0)˛ + n j=1 k=−N n j=1

(6)

|k|>N

The second of the two terms on the right is <  by choice of N = N(). As for the first term on the right, we have n N n N 1X X c 1X X c f± (k)e−i2πkjx − fc f± (k)e−i2πkjx ± (0) = n j=1 k=−N n j=1 k=−N k6=0

=

N X



1 fc ± (k) n k=−N

n X

` −i2πkx ´j e

« =

j=1

N X

„ fc ± (k)

k=−N k6=0

k6=0

=

N X

e−i2πkx − e−i2πk(n+1)x n(1 − e−i2πkx )

−iπk(n+1)x fc ± (k)e

k=−N k6=0



«

sin(πknx) n sin(πkx)

« .

Set δ := min1≤k≤N | sin(πkx)|. As x is given to be irrational, kx remains bounded away from both zero and one and hence δ = δ() > 0 is strictly positive. (The implicit dependence on the choice of  > 0 arises through N = N().) As |f± (x)| ≤ 1, we may trivially bound ˛Z ˛ Z Z1 ˛ ˛ ˛ ˛ ˛c f± (k)˛ = ˛˛ f± (x)ei2πkx dx˛˛ ≤ |f± (x)| dx ≤ dx = 1 T

0

T

for every k and so ˛ ˛ n N N X X ˛ ˛ ˛ 1 ˛1 X 2N −i2πkjx c c ˛c ˛ ≤ 0 may be chosen arbitrarily small, we conclude that νn → b − a as n → ∞.

The succeeding problems build upon Problems V.12–15. 9. Bent coins. In repeated tosses of a bent coin with success probability p (and corresponding failure probability q = 1 − p) let Sn denote the number of successes after n tosses and let bn (k; p) denote the probability that Sn = k. Starting with the expression derived in Problem V.14 show that bn (k; p) = n k n−k . k p q S OLUTION : In the correspondence between bent coin tosses and the bent binary digits, we associate the jth toss with the jth bent digit and accumulated successes Sn with the sum z1 (t; p) + · · · + zn (t; p). It follows via this correspondence that P{Sn = k} = λ{ t : z1 (t; p) + · · · + zn (t; p) = k } ` ´ k n−k and Problem V.14 shows that the right-hand side is equal to n p q . k

10. The central term of the binomial. With notation as in the previous problem, for each fixed n and p, show that there is a unique integer m such that bn (k; p) ≥ bn (k − 1; p) for all k ≤ m and show that (n + 1)p − 1 < m ≤ (n + 1)p. Under what condition is it true that bn (m; p) = bn (m − 1; p)? S OLUTION : By a consideration of ratios, we see that `n´ k n−k p q (n − k + 1)p (n − k + 1)p bn (k; p) = ` n ´k k−1 n−k+1 = = . bn (k − 1; p) kq k − kp p q k−1 It follows that for the inequality bn (k; p) ≥ bn (k − 1; p) to hold it is necessary and sufficient for b(n + 1)pc ≥ k or, (n + 1)p − 1 < k ≤ (n + 1)p. Writing m = b(n + 1)pc, we see that bn (k; p) increases monotonically for 0 ≤ k ≤ m and then decreases monotonically thereafter. If bn (m; p) = bn (m − 1; p) then (n − m + 1)p = m − mp or, what is the same thing, m = (n + 1)p. For bn (m; p) = bn (m − 1; p), it is hence necessary and sufficient that (n + 1)p is an integer.

11. Laplace’s theorem. With Sn as in Problem 9, denote the normalised √ number of successes by S∗n = (Sn − np) npq. For any a < b, using Markov’s Rb method show that P{a < S∗n < b} → a φ(x) dx asymptotically as n → ∞. This extension of de Moivre’s theorem to bent coins is due to Laplace. S OLUTION : Let zk (t; p) be the bent binary digits and Rn (t) = z1 (t; p) + · · · + zn (t; p). −np Let S∗n = S√nnpq . We want to calculate limn→∞ P{a < S∗n < b}. By the correspondence between the tosses of a bent coin and the bent binary digits, we have

Z1 Rn (t) − np P{a < S∗n < b} = λ t : a < < b = 1A (t) dt √ npq 0

116

VI.11

The Normal Law

where A = t : a
N1 (ξ). (Re(z) denotes the real part of z.) So we can set M1 = 2 in (13) to get ˛ „ ˛« 3 «˛ „˛ ˛ ˛ ˛ ˛ ˛R1 √ξ ˛ ≤ 1 ˛ √ξp ˛ for n > N1 (ξ). ˛ ˛ ˛ 3 npq ˛ n

(14)

Similarly, for some integer N2 (ξ), ˛ „ ˛«3 «˛ „˛ ˛ ˛ ˛ − p) ˛˛ ˛R2 √ξ ˛ ≤ 1 ˛ ξ(1 for n > N2 (ξ). √ ˛ 3 ˛ npq ˛ n ˛

118

(15)

VI.13

The Normal Law

Now, from (10), we get |x| ≤

˛ „ ˛ „ «˛ «˛ ˛ ˛ ˛ ξ2 ξ ˛˛ ˛R1 √ξ ˛ + (1 − p) + p ˛˛R2 √ ˛ ˛ 2n n n ˛ ˛«3 ˛«3 „˛ „˛ p ˛˛ ξ(1 − p) ˛˛ ξ2 1 − p ˛˛ ξp ˛˛ ξ2 + ≤ + ≤ √ √ ˛ npq ˛ 2n 3 ˛ npq ˛ 3 n

for n large enough, say n > N3 (ξ). Now, for the Taylor expansion (11) of g(x) = log(1 − x), we can bound the remainder term by ˛ „ «˛ ˛ ˛ ˛R3 √ξ ˛ ≤ M3 |x|2 (16) ˛ 2 n ˛ “ ”2 1 where M3 is an upper bound on g00 (y) = 1−y for |y| ≤ |x|. To determine M3 , note that since |y| ≤ |x| ≤

ξ2 , n

for n large enough, say n > N4 (ξ), we have |y| ≤

|g (y)| = (1 − y) ≤ 4. Thus, we can set M3 = 4. Also, since |x| ≤ from (16) we see that ˛ „ «˛ 4 ˛ ˛  ˛R3 √ξ ˛ ≤ 2 ξ for n > max N3 (ξ), N4 (ξ) . ˛ ˛ n2 n 00

−2

2

ξ n

1 2

and hence

for n > N3 (ξ),

(17)

From (14), (15) and (17), we conclude that the last three terms in the right hand side of 2 (12) go to 0 as n → ∞. Thus, limn→∞ n log(1 − x) = − ξ2 and hence 2

lim G(n, ξ)n = lim exp(n log(1 − x)) = e−ξ

n→∞

n→∞

/2

.

Now, from (7) and the above expression, and following a procedure similar to the proof of de Moivre’s theorem using Markov’s method (coupled with a Lévy sandwich) we can shown that Zb 2 1 lim P(a < S∗n < b) = √ e−x /2 dx. n→∞ 2π a

12. A local limit theorem for bent coins. Let βk = bn (m + k; p) with m as in Problem 10. Suppose {Kn } is any sequence of positive integers satisfying Kn /(npq)2/3 → 0 as n → ∞. (Tacitly, p, hence also q, are allowed to depend on  k 1 φ √npq holds uniformly as n.) Show that the asymptotic estimate βk ∼ √npq n → ∞ for all k in the interval −Kn ≤ k ≤ Kn . [Note that the range of validity is reduced when the coin is unfair; symmetry has its advantages.] 13. A large deviation theorem for bent coins. Suppose the sequence {an } √ grows Rin such a way that an pq → ∞ and an /(npq)1/6 → 0. Then P{S∗n > ∞ an } ∼ an φ(x) dx.

119

VII Probabilities on the Real Line 1. An expectation identity. Suppose the distribution p(k) of the arithmetic random variable P X has support in the positive integers only. ByP rearranging the ∞ terms of the sum k≥0 kp(k), verify the useful identity E(X) = n=0 P{X > n}. S OLUTION: As X is positive, E(X) = 0 · p(0) + 1 · p(1) + 2p(2) + 3p(3) + 4p(4) + · · · . We may rearrange the terms of the expectation sum as all terms are positive and write the right-hand side in the form p(1) + p(2) + p(3) + p(4) + · · · + p(2) + p(3) + p(4) + · · · + p(3) + p(4) + · · · + p(4) + · · · .............................. The associative property of addition! Each row in the sum above may be identified with a tail probability and so E(X) = P{X > 0} + P{X > 1} + P{X > 2} + P{X > 3} + · · · as asserted.

2. Urn problem. An urn contains b blue and r red balls. Balls are removed at random until the first blue ball is drawn. Determine the expected number of balls drawn. S OLUTION : (i) First solution: a combinatorial approach. As the reader is well aware, sampling without replacement creates niggling dependencies; Problem 1, however, allows us to preserve our sang froid. The setting is that of Pólya’s urn scheme of Section II.5. Let X be the number of balls drawn until the first blue ball is drawn. Then the event X > n occurs if, and only if, the first n balls drawn are red. Accordingly, „ « ffi „ « r−1 r−n+1 rn r b+r−n b+r · ··· = = P{X > n} = r−n r b+r b+r−1 b+r−n+1 (b + r)n

Probabilities on the Real Line

VII.3

for 0 ≤ n ≤ r. It follows that E(X) =

X n≥0

« « r „ r „ 1 X b+r−n 1 X b+k P{X > n} = `b+r´ = `b+r´ . r−n k r

r

n=0

k=0

It is irresistible to apply Pascal’s triangle to the terms of the sum on the right and we obtain ! ! «– « „ r r »„ X X b+r+1 b+k b+k b+k+1 = = − k−1 k r k k=0

k=0

as the sum telescopes. It follows that „ « ffi „ « b+r+1 r b+r+1 b+r E(X) = = =1+ , r r b+1 b+1 an intuitively satisfying result. (ii) Second solution: conditioning on the first step. The first ball drawn is red with probability r/(b + r) and blue with probability b/(b + r). Write K(b, r) for the mean in question. Then, by conditioning on the colour of the first ball, by total probability, we obtain the recurrence ` ´ K(b, r) = 1 + K(b, r − 1) ·

b r +1· . b+r b+r

The natural boundary condition is K(b, 0) = 0. We may now verify directly (or by induction) that K(b, r) = 1 + r/(b + 1). (iii) Third solution: an approach illustrating the power of additivity. Continue the thought experiment until all balls are removed from the urn. Then every arrangement of red and blue balls is equally likely. In any such arrangement, the b blue balls occur at specific locations along the sequence and separate the r red balls into up to b + 1 sub-groups: we initially have a sequence of, say, N1 red balls followed by the first blue ball, then N2 red balls followed by the second blue ball, and so on, culminating with a sequence of Nb red balls followed by the last blue ball which is followed by a sequence of Nb+1 red balls. Of course, some of the values N1 , . . . , Nb+1 may be zero, but together they must account for all the red balls and so r = N1 + N2 + · · · + Nb+1 . Taking expectation of both sides, we see that r = E(N1 )+E(N2 )+· · ·+E(Nb+1 ) = (b+1) E(N1 ). (The variables Nj are clearly dependent. No matter! Expectation is always additive.) It follows that the expected number of contiguous reds drawn before the first blue is r/(b + 1); including the first blue drawn finishes up.

3. Continuation. The balls are replaced and then removed at random until the remaining balls are of the same colour. Find the expected number remaining in the urn. S OLUTION : (i) First solution: reversing the sequence is a helpful tool. Consider a random arrangement of the b + r balls. The simple observation that, viewed from right to left, the reversed sequence has the same distribution as the original sequence reduces the

122

VII.4

Probabilities on the Real Line

problem to the previous one. (Indeed, each realisation of a particular sequence from right to left may be placed in one-to-one correspondence with its counterpart sequence viewed from left to right.) Thus, the problem as stated is equivalent to a consideration of the number of balls of one colour that are drawn before a ball of opposing colour is drawn. There are two possibilities for the first ball drawn. It is red with probability r/(b + r). Conditioned on this event, we are now faced with a random arrangement of b blue and r − 1 red balls and, by Problem 2, the expected length of the sequence of red balls drawn before the first blue ball is (r − 1)/(b + 1). Adding in the first red ball, the expected number of red balls drawn in sequence before the first blue ball is drawn, given that the first ball drawn is red, is 1+(r−1)/(b+1) = (b+r)/(b+1). By an entirely similar argument, the expected number of blue balls drawn in sequence before the first red ball is drawn, given that the first ball drawn is blue, is 1 + (b − 1)/(r + 1) = (b + r)/(r + 1). By total probability, the expected number of balls of a common colour drawn before a ball of opposing colour is drawn is b+r r b+r b r b · + · = + . b+1 b+r r+1 b+r b+1 r+1 (ii) Second solution: additivity, once more. In the notation introduced in the solution by additivity for Problem 2, let Nb+1 denote the length of the red sequence to the right of the last blue ball and let Mr+1 denote the length of the blue sequence to the right of the last red ball in the sequence. One of these two numbers is zero. Then Nb+1 +Mr+1 denotes the number of balls (of the same colour) that remain in the urn. In the previous problem we have seen that E(Nb+1 ) = r/(b + 1) and, by an entirely similar argument, E(Mr+1 ) = b/(r + 1). By additivity of expectation, E(Nb+1 + Mr+1 ) = E(Nb+1 ) + E(Mr+1 ) =

b r + . b+1 r+1

4. Pepys’s problem, redux. Is it more probable to obtain at least n sixes in 6n rolls of a die or to obtain at least n + 1 sixes in 6(n + 1) rolls of a die? S OLUTION : This problem is harder than it appears. There are sophisticated solutions but they tend to be opaque to the novice reader. I will provide an elementary solution here which at least has the virtue of transparency even if the motivations behind the manipulations require a more subtle understanding of the binomial distribution. Consider an unending sequence of die throws and, for each n, let S6n denote the number of sixes obtained in the the first 6n throws. Then S6n is governed by the binomial distribution which, in the notation introduced in Problem VI.9, we may write in the form „ « „ «k „ «6n−k 1 5 6n P{S6n = k} = =: b6n (k; 1/6). k 6 6 Write P6n (n) for the probability that, in 6n throws of a die, at least n sixes are obtained: X P6n (n) = b6n (k; 1/6). k≥n

123

Probabilities on the Real Line

VII.4

The form of the right-hand side is not particularly illuminating but we can certainly compute it numerically to get a feel for how the tail probabilities evolve with n. Table 1 tabulates the first few values and a cursory examination of the values leads to the shrewd suspicion that P6n (n) decreases monotonically with n. We now set about trying to prove it. n 1 2 3 4 5 6

P6n (n) 0.665102 0.618667 0.597346 0.584487 0.575661 0.569124

Table 1: A tabulation of the probabilities that, in 6n throws of a die, at least n sixes are obtained.

In view of the numerical evidence in Table 1, it is clear that P6n (n) decreases at least through n = 6 and we may accordingly suppose n ≥ 7. The first key observation is that we may relate the distributions of S6n+6 and S6n via a convolution. Indeed, writing S60 for the number of sixes obtained in throws 6n + 1, . . . , 6n + 6, we see that S6n+6 = S6n + S60 where S60 has the same binomial distribution b6 (·; 1/6) (with support only in {0, 1, . . . , 6}) as S6 . As S6n and S60 are determined by disjoint sets of trials it is clear that they are independent variables and, by total probability, we obtain the convolution relation 6 X P6n+6 (n + 1) = P6n (n + 1 − k)b6 (k; 1/6) k=0

as, S6n+6 ≥ n + 1 if, and only if, for some 0 ≤ k ≤ 6, we have both S60 = k and S6n ≥ n + 1 − k. As the binomial probabilities b6 (k; 1/6) sum to one, we also have the relation 6 X P6n (n) = P6n (n) b6 (k; 1/6), k=0

and so, by grouping terms, we obtain

P6n (n) − P6n+6 (n + 1) =

6 X ˆ

˜ P6n (n) − P6n (n + 1 − k) b6 (k; 1/6)

k=0

ˆ ˜ = b6n (n; 1/6) · b6 (0; 1/6) + [0] · b6 (1; 1/6) ˆ ˜ − b6n (n − 1) · b6 (2; 1/6) ˆ ˜ − b6n (n − 2) + b6n (n − 1) · b6 (3; 1/6) ˆ ˜ − b6n (n − 3) + b6n (n − 2) + b6n (n − 1) · b6 (4; 1/6) ˜ ˆ − b6n (n − 4) + b6n (n − 3) + b6n (n − 2) + b6n (n − 1) · b6 (5; 1/6) ˆ ˜ − b6n (n − 5) + b6n (n − 4) + b6n (n − 3) + b6n (n − 2) + b6n (n − 1) · b6 (5; 1/6).

124

VII.6

Probabilities on the Real Line

The second key observation is that the binomial distribution b6n (k; 1/6) achieves its unique maximum value at k = n. (This is easily verified by taking a ratio of successive terms; see Problem VI.10.) It follows a fortiori that, for k ≥ 1, we have b6n (n; 1/6) > b6n (n−k; 1/6) and, accordingly, we may bound each of the expressions inside the square brackets on the right by ˆ ˜ b6n (n − k + 1) + b6n (n − k + 2) + · · · + b6n (n − 1) < (k − 1)b6n (n) (2 ≤ k ≤ 6). By factoring out the term b6n (n; 1/6), we may hence bound » – 6 X P6n (n) − P6n+6 (n + 1) > b6n (n) b6 (0; 1/6) − (k − 1)b6 (k; 1/6) .

(1)

k=2

A simple numerical computation shows that the expression in square brackets on the right is identically equal to zero which establishes the proposition that the sequence { P6n (n), n ≥ 1 } decreases monotonically with n. For the reader who disdains numerical calculation, a direct analytical argument is just as easy if she bears in mind the expected P value of the binomial: E(S6 ) = 6k=0 kb6 (k; 1/6) = 6 · 16 = 1. Keeping this in mind, we may massage the term in square brackets on the right into the form

b6 (0; 1/6) −

6 X k=2

=

„X 6 k=0

(k − 1)b6 (k; 1/6) =

6 X

b6 (k; 1/6) −

k=0 k6=1

6 X

kb6 (k; 1/6)

k=2

« „X « 6 b6 (k; 1/6) − b6 (1; 1/6) − kb6 (k; 1/6) − 0 · b6 (0; 1/6) − 1 · b6 (1; 1/6) k=0

` ´ ` = 1 − b6 (1; 1/6) − 6 ·

1 6

´ − 0 − b6 (1; 1/6) = 0.

By substituting into (1), we see indeed that P6n (n) > P6n+6 (n + 1) for n ≥ 7 and hence, by virtue of the numerical evidence in Table 1, also for all n ≥ 1.

5. Dice. Let X and Y denote the number of ones and sixes, respectively, in n throws of a die. Determine E(Y | X). S OLUTION : If we are given that exactly X = j dice show ones, then the outcomes of the remaining n − j dice are equally likely to take any of the remaining five values 2 through 6. Identifying “success’ with a six being thrown, the number of sixes among the remaining dice may be identified with the number of accumulated successes in n − X tosses of a coin with success probability 1/5. Accordingly, conditioned on there being X ones, the number Y of sixes has the binomial distribution bn−X (·; 1/5). It follows without further calculation that E(Y | X) = (n − X)/5.

6. Show that it is not possible to weight two dice so that the sum of face values is equally likely to take any value from 2 through 12. S OLUTION : Write X and Y for the face values shown by the two dice and let S = X + Y denote the sum of face values. Suppose P{X = k} = pk and P{Y = k} = qk for 1 ≤ k ≤ 6,

125

Probabilities on the Real Line

VII.7

the dice so weighted that P{S = 2} = P{S = 3} = · · · = P{S = 12} = 1/11. In particular, 1 = P{S = 2} = p1 q1 , 11 1 = P{S = 12} = p6 q6 . 11 It is clear hence that p1 , q1 , p6 , q6 are all strictly positive and so by equating the two right-hand sides and re-ordering terms we see that 0 1 whenever x > 0 and we obtain a contradiction. We conclude that there is no way to weight two dice so that the sum of face values is equally likely to take any value from 2 through 12.

7. With ace counting as 1, extract the 20 cards with face values 1 through 5 from a standard 52-card deck and shuffle them. Ask a friend to think of a number N1 between 1 and 5 but not to disclose it. Expose the cards one by one. Ask your friend to begin a count from the first card exposed and to take mental note of the face value, say, N2 of the N1 th card exposed; at which point she begins the count anew and takes note of the face value, say, N3 of the N2 th card from that point onwards (the (N1 + N2 )th card exposed overall); she proceeds in this fashion until all 20 cards are exposed at which point she, beginning with her private value N1 , has accumulated a privately known sequence N2 , . . . , Nk of face values on which she “lands”. Her final noted face value, say, Nk for some value of k, corresponds to a card near the end of the deck with fewer than Nk cards left to be exposed. Propose a procedure to guess the value of Nk . (Remember that you don’t know her starting number N1 .) Estimate by computer simulation your probability of success. I heard about this pretty problem from Aaron Wyner who credited it to Persi Diaconis. [Hint: Do Problem I.16 first.] S OLUTION : This problem is similar to Problem I.16 (Keeping up with the Joneses) with n = 5, the difference being that sampling is now without replacement and we deal with a finite, twenty card deck. In the current problem we begin with a random shuffle of 20 (distinguishable) cards (though only their face values matter) resulting in a random ` ´ (1) permutation of face values Π(1), . . . , Π(20) . Beginning with the face value N1 = N1 , ` (1) ´ the subsequence Nj , j ≥ 1 of face values generated in the problem may be identified ` (1) (1) (1) ´ recursively via Nj = Π N1 + · · · + Nj−1 for j ≥ 2. The sequence terminates with a

126

VII.8

Probabilities on the Real Line (1)

(1)

(1)

+ · · · + Nj−1 ≤ 20. Our experience ´ ` (1) with Problem I.16 points the way: we anticipate that two sequences Nj , j ≥ 1 and ` (2) ´ (1) (2) Nj , j ≥ 1 with different starting points N1 6= N1 will, with high probability, have the same end-points as the two sequences will coincide sooner or later with high probability. We (2) simply begin with our own random or arbitrary starting point N1 ∈ {1, . . . , 5} and ` (2) ´ create a new subsequence of face values Nj , j ≥ 1 terminating with a face value face value N∗

(2)

N∗

at the largest value j for which N1

(2)

at the largest value j for which N1

(2)

(2)

+ · · · + Nj−1 ≤ 20 and proffer N∗

as our

(1) N∗ .

guess of Here is some Mathematica code implementing the procedure (admittedly crude but I didn’t bother fiddling with the documentation to find a fast and efficient way to do it). m = 1000000; (*Number of iterations*) f = 0; (*Number of matches*) Do[ s = RandomSample[{1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 5}];(*Select a random shuffle of Ace through 5*) n1 = RandomInteger[{1, 5}];(*Sequence protagonist 1*) While[n1 1/2. For m > 1/2, normalisation of probability measure now dictates that C = I(m)−1 and it only remains to evaluate the integral which, by direct methods, is seen to be ´ √ ` π Γ m − 12 I(m) = . (2) Γ (m) And so, to properly normalise the density, C must be the reciprocal of the quantity on the right. In the case when m is a strictly positive integer, elementary methods suffice to compute the integral on the left in (2). A standard trigonometric transformation shows that Z π/2 Z π/2 sec(θ)2 tan θ←x dθ = cos(θ)2m−2 dθ, I(m) = 2m −π/2 −π2 sec(θ) and the standard integral on the right may be evaluated by a variety of methods [see the approaches outlined following (V.5.2) in ToP] to yield „ « 2m − 2 I(m) = π2−2m−2 = πb2m−2 (m − 1; 1/2), m−1 as may be directly verified by expressing the gamma functions on the right in (2) in terms of factorials. The special case m = 1 yields I(1) = π which results in the standard Cauchy density π−1 (1 + x2 )−1 .

10. A particle of unit mass is split into two fragments of masses X and 1 − X. The density f of X has support in the unit interval and by reasons of symmetry f(x) = f(1 − x). Let X1 be the smaller of the two masses, X2 the larger. The fragment of mass X1 is split in like manner resulting in two fragments with masses X11 and X12 ; likewise, splitting X2 results in masses X21 and X22 . We again use subscript 1 to denote the smaller of two masses, subscript 2 denoting the larger. Assume splits are independent,  the split of a mass m governed by 1 x the appropriately scaled density m f m . Determine the joint density of X11 and X22 . Thence or otherwise determine the density of X11 . S OLUTION: What do we know? (1) Symmetry: We are given that f has support in the unit interval (0, 1) and that f(x) = f(1 − x) for 0 < x < 1. (2) Splitting: The split of a` mass ´ m into two segments, say, X and m − X, is governed by the scaled density 1 x f m . By the symmetry m ` x ´ of f at the midpoint of its support, X and m − X have the 1 common density m f m . (3) Conditional independence: The variables X11 and X22 are conditionally independent given that X1 = m. The recurring theme: Suppose a mass m is randomly split into two fragments X and m − X. Set Xmin = min{X, m − X} and Xmax = max{X, m − X} and let fmin and fmax ,

129

Probabilities on the Real Line

VII.10

respectively, be their densities. It is clear that Xmin has support in the interval 0 < x ≤ m/2 while Xmax has support in the interval m/2 ≤ x < m. A standard exercise now shows that the d.f. of Xmin is given by Z m−x 1 “t” P{Xmin ≤ x} = 1 − P{Xmin > x} = 1 − P{x < X < m − x} = 1 − f dt. m m x By differentiating we see that fmin (x) =

x” 1 “x” 2 “x” 1 “ f 1− + f = f m m m m m m

(0 < x ≤ m/2).

(∗)

As Xmax = m − Xmin , no further calculations are required and we may directly write down 2 “ x” 2 “x” fmax (x) = fmin (m − x) = f 1 − = f (m/2 ≤ x < m). (∗∗) m m m m The first split: Suppose X1 has d.f. F1 and corresponding density f1 . As X1 = min{X, 1 − X}, it is clear that f1 (x) has support in 0 < x ≤ 1/2 only, and applying the recurring theme (∗) to the case m = 1, we see that (0 < m ≤ 1/2).

f1 (m) = 2f(m)

Of course, X2 = 1 − X1 has corresponding density f2 (m) = f1 (1 − m) with support in 1/2 ≤ m < 1. The second split: Write f11,22 (x, y | X1 = m) for the conditional density of the pair (X11 , X22 ) given that X1 = m and let f11 (x | X1 = m) and f22 (y | X1 = m) be the corresponding conditional marginal densities of X11 and X22 , respectively. Applying the recurring theme (∗), we see that f11 (x | X1 = m) =

2 “x” f m m

(0 < x ≤ m/2),

(∗ ∗ ∗)

and as, conditioning on the event X1 = m is the same as conditioning on the event X2 = 1 − m, applying (∗∗) with m replaced by 1 − m, we have „ « ` ´ 2 y f22 (y | X1 = m) = f (1 − m)/2 ≤ y < (1 − m) . 1−m 1−m As X11 and X22 are conditionally independent given that X1 = m, it follows that f11,22 (x, y | X1 = m) = f11 (x | X1 = m)f22 (y | X1 = m) and so the joint density of the triple (X11 , X22 , X1 ) is given by “x” “ y ” 8 f f f(m) (∗ ∗ ∗∗) m(1 − m) m 1−m ` with support only in the region in three-space defined by the inequalities 0 < m ≤ ´ 1/2; 0 < x ≤ m/2; (1 − m)/2 ≤ y < 1 − m . f(x, y, m) = f11,22 (x, y | X1 = m)f1 (m) =

The joint density of X11 , X22 is obtained by integrating out m in the expression (∗ ∗ ∗∗). Fix (x, y) in the support of the joint distribution given by the region in the

130

VII.11

Probabilities on the Real Line

` ´ plane defined by the inequalities 0 < x ≤ 1/4; (1 − 2x)/2 ≤ y < (1 − 2x) . For fixed x and y in this region, by integrating out over the relevant range of m, we obtain Z 1/2∧(1−y) “x” “ y ” 8 f f f(m) dm. m 1−m (2x)∨(1−2y) m(1 − m) The marginal density of X11 directly from (∗ ∗ ∗). As the density of the pair (X11 , X1 ) is given by 4 “x” f11,1 (x, m) = f f(m) (0 < m ≤ 1/2; 0 < x ≤ m/2), m m we may integrate out over the indicated range of m to obtain Z 1/2 4 “x” f f(m) dm (0 < x ≤ 1/4). f11 (x) = m 2x m We may verify this expression by integrating out over y and m in (∗∗∗∗). Fixing 0 ≤ x < 1/4, m may vary in the range 2x ≤ m ≤ 1/2 and y in the range (1−m)/2 ≤ y < (1−m). Integrating out, we obtain Z 1/2 Z 1−m “ y ” 4 “x” 2 f11 (x) = f f(m) f dy dm. m 1−m 2x m (1−m)/2 1 − m The change of variable y/(1 − m) = t reduces the inner integral to Z1

Z1 2f(u) du =

1/2

Z1 f(u) du +

1/2

Z 1/2

f(u) du 1/2

Z1

Z 1/2

f(1 − u) du +

= 0

f(u) du = 1/2

Z1 f(u) du +

0

f(u) du = 1 1/2

as it must.

 11. Let f(x, y) = (1+ax)(1+ay)−a e−x−y−axy for x, y > 0. If 0 < a < 1 show that f is a density. S OLUTION : As (1 + ax)(1 + ay) − a = (1 − a) + a(x + y + xy) is positive for all x > 0, y > 0, it suffices to show that f integrates to 1 for it to be a density. This will follow once the marginals are determined. Integrate out y to obtain the marginal density of X: f1 (x) = e−x „ = e−x 1 −

Z∞

Z∞ (1 + ay)(1 + ax)e−(1+ax)y dy − ae−x e−(1+ax)y dy 0 0 « Z∞ Z∞ a (1 + ax)e−(1+ax)y dy + ae−x y(1 + ax)e−(1+ax)y dy. 1 + ax 0 0

Recall that for positive α, the exponential density αe−αx (x > 0) has mean 1/α and variance 1/α2 . In the expressions above, 1 + ax plays the rôle of α and we immediately obtain « „ a ae−x −x f1 (x) = e 1− + = e−x (x > 0). 1 + ax 1 + ax

131

Probabilities on the Real Line

VII.12

As the expression for f is symmetric in x and y it follows that the marginal density of Y is also the unit exponential density, f2 (y) = e−y for y > 0. A little extra: conditional density and expectation. By definition, for each positive x, f2 (y | X = x) =

˜ f(x, y) ˆ = (1 + ax)(1 + ay) − a e−(1+ax)y f1 (x)

(y > 0).

Taking expectation with respect to f2 (y | X = x) we obtain „ E(Y | X = x) = 1−

a 1 + ax

« Z∞

−(1+ax)y

y(1+ax)e

Z∞ dy+a

0

y2 (1+ax)e−(1+ax)y dy

0

1 2a a 1 + a + ax = + − = . 1 + ax (1 + ax)2 (1 + ax)2 (1 + ax)2 It follows that E(Y | X) = (1 + a + aX)/(1 + aX)2 .

12. The triangular density. Random variables X1 and X2 are independent, both uniformly distributed in the unit interval (0, 1). Let Y = X1 +X2 . Determine the density of Y, its mean, and its variance. S OLUTION : Let u(x) be the common marginal density of X1 and X2 :  1 if 0 < x < 1, u(x) = 0 otherwise. As X1 and X2 are independent, the density of Y = X1 + X2 is given by the convolution Z∞ f(y) = (u ? u)(y) = u(x)u(y − x) dx, −∞

and as u is the unit rectangular function centred at x = 1/2, the convolution gives a triangular density centred at x = 1 shown in Figure 1. In notation: f(y) 1

y 1

2

Figure 1: The triangular density.   y f(y) = 2 − y   0

if 0 < y ≤ 1, if 1 ≤ y < 2, otherwise.

[See (IX.1.1) and (IX.1.2) in ToP.]

132

(3)

VII.14

Probabilities on the Real Line

R1 R1 If X ∼ Uniform(0, 1), then E(X) = 0 x dx = 1/2 and E(X2 ) = 0 x2 dx = 1/3, whence Var(X) = E(X2 ) − E(X)2 = 1/12. By linearity of expectation it follows that E(Y) = E(X1 ) + E(X2 ) =

1 2

1 2

+

= 1,

and as the Xi are independent, the addition theorem for variances yields Var(Y) = Var(X1 ) + Var(X2 ) =

1 12

+

1 12

=

1 . 6

13. The random variables X and Y have density f(x, y) = 24xy with support only in the region x ≥ 0, y ≥ 0, and x+y ≤ 1. Are X and Y independent? S OLUTION : Independence implies rectangular support. What this cryptic statement means is that, if X has support in (a Borel set) A and Y has support in (another Borel set) B, then, if X and Y are independent, the pair (X, Y) has support in the rectangle A × B. This observation provides the most elegant path to solution. By inspection, the variables X and Y each has (marginal) support in the unit interval but the pair (X, Y) has support only in the lower triangular region of the unit square. It follows that X and Y are not independent. Via marginals. Integrating out over the support of the joint density, we see X has support in the unit interval with marginal density Z 1−x f1 (x) = 24xy dy = 12x(1 − x)2 =: g(x) (0 ≤ x ≤ 1). 0

The symmetry of the situation shows that Y has same marginal density, f2 (y) = g(y). But the product of marginals f1 (x)f2 (y) = g(x)g(y) has support in the entire unit square (0, 1)2 and so is manifestly not equal to f(x, y) everywhere.

14. The p random variables X and Y are uniformly distributed in the annulus a < x2 + y2 < b. For each y, determine the conditional density of Y given that X = x. S OLUTION: The (joint) density of the pair (X, Y) is given by f(x, y) =

1 π(b2 − a2 )

(a2 < x2 + y2 < b2 ).

Integrating out y we obtain the (marginal) density f1 (x) of X which has support only in the interval −b < x < b. The structure of the annulus shows that we split up the support into two regions. If |x| ≤ a, then Z −(a2 −x2 )1/2 f1 (x) = −(b2 −x2 )1/2

1 dy + π(b2 − a2 )

Z (b2 −x2 )1/2 (a2 −x2 )1/2

=

133

1 dy π(b2 − a2 )

` ´ 2 (b2 − x2 )1/2 − (a2 − x2 )1/2 . π(b2 − a2 )

Probabilities on the Real Line

VII.16

If a < |x| < b, then Z (b2 −x2 )1/2 f1 (x) = −(b2 −x2 )1/2

2(b2 − x2 )1/2 1 dy = . π(b2 − a2 ) π(b2 − a2 )

We may merge the ranges in the support in a compact notation and combine the two expressions in the form ˆ ` ´1/2 ˜ 2 (b2 − x2 )1/2 − (a2 − x2 )+ f1 (x) = π(b2 − a2 ) where, in the usual notation, t+ = max{t, 0} denotes the positive part of a real variable t. Now suppose |x| < b. Given that X = x, the conditional density f2 (y | X = x) ` ´1/2 of Y has support only in the interval (a2 − x2 )+ ≤ y < (b2 − x2 )1/2 and, taking ratios, we find that for y in this interval, f2 (y | X = x) =

f(x, y) 1 = ˆ ` ´1/2 ˜ . f1 (x) 2 (b2 − x2 )1/2 − (a2 − x2 )+

The expression on the right does not depend upon y and hence, given X = x, Y is uniformly distributed in the region of its support.

15. Let X and Y be independent random variables with a common density. You are informed that this density has support only within the interval [a, b] and that it is symmetric around (a + b)/2 (but you are not told anything more about the density). In addition, you are told that the sum Z = X + Y has density g+ (t) (which is given to you). How can you determine the density g− (t) of the variable W = X − Y given this information? S OLUTION : The key is to exploit the symmetry by a proper centring. Write f for the e = X − (a + b)/2 and common density of X and Y. Centre the variables by setting X e and Ye is related to f by a simple shift of Ye = Y − (a + b)/2. The common density fe of X origin, ` ´ e = f t + 1 (a + b) . f(t) 2 e e The key is the observation that feis an even function, f(−t) = f(t), whence Ye and −Ye both e As X e and Ye are independent, it follows that X e + Ye and X e − Ye have the have the density f. e e e e e common density f ? f obtained by convolving f with itself. But X + Y = X + Y − (a + b) = Z − (a + b) so that, on the one hand, ` ´ e = g+ t + (a + b) , fe ? f(t) e − Ye = W, so that while, on the other, X e = g− (t). fe ? f(t) It follows that g− is related to g+ by a simple shift of origin: g( t) = g+ (t + a + b).

134

VII.17

Probabilities on the Real Line

16. Additivity of expectation. Suppose the random n-tuple (X1 , . . . , Xn ) has density f(x1 , . . . , xn ). Show by exploiting the linearity of integration that if the individual variables have expectation then E(X1 + X2 ) = E(X1 ) + E(X2 ) and hence that E(a1 X1 + · · · + an Xn ) = a1 E(X1 ) + · · · + an E(Xn ). S OLUTION : Starting with n = 2, write f(x1 , x2 ) for the density of the pair (X1 , X2 ). Let f1 (x1 ) and f2 (x2 ) be the corresponding marginals of X1 and X2 , respectively. By linearity of integration, we may immediately write ZZ (x1 + x2 )f(x1 , x2 ) dx2 dx1 E(X1 + X2 ) = R2 Z∞ Z∞ Z∞ Z∞ = x1 f(x1 , x2 ) dx2 dx1 + x2 f(x1 , x2 ) dx1 dx2 , −∞

−∞

−∞

−∞

where we may reverse the order of integration in the second of the two integrals under the assumption that the double integral is absolutely convergent. We identify the two inner integrals on the right with the marginal densities f1 (x1 ) and f2 (x2 ), respectively, and so Z∞ Z∞ E(X1 + X2 ) = x1 f1 (x1 ) dx1 + x2 f2 (x2 ) dx2 = E(X1 ) + E(X2 ). −∞

−∞

By induction it now follows quickly that E(X1 + · · · + Xn ) = E(X1 ) + · · · + E(Xn ). As E(aX) = a E(X) by the scaling of the density [see Example VII.4.1 in ToP], linearity quickly follows.

17. Continuation, additivity of variance. Suppose additionally that X1 , . . . , Xn are independent. First show that Var(X1 + X2 ) = Var(X1 ) + Var(X2 ) and thence that Var(X1 + · · · + Xn ) = Var(X1 ) + · · · + Var(Xn ). S OLUTION : We verify the additivity property for the base case n = 2, induction doing the rest. Write f1 and f2 for the marginal densities of X1 and X2 , respectively; as X1 and X2 are independent, the joint density of the pair (X1 , X2 ) is hence f1 (x1 )f2 (x2 ). Write µ1 and µ2 for the means of X1 and X2 , respectively. By Problem 16, E(X1 + X2 ) = E(X1 ) + E(X2 ) = µ1 + µ2 and so ZZ ` ´2 Var(X1 + X2 ) = (x1 + x2 ) − (µ1 + µ2 ) f1 (x1 )f2 (x2 ) dx2 dx1 . R2

Expanding out the square in the integrand and separating terms, we obtain Z∞ Var(X1 + X2 ) =

Z∞ (x1 − µ1 )2 f1 (x1 ) dx1 f2 (x2 ) dx2 dx2 −∞ −∞ Z∞ Z∞ + f1 (x1 ) dx1 (x2 − µ2 )2 f2 (x2 ) dx2 −∞ −∞ Z∞ Z∞ +2 (x1 − µ1 )f1 (x1 ) dx1 (x2 − µ2 )f2 (x2 ) dx2 . −∞

−∞

135

Probabilities on the Real Line

VII.18

By additivity, Z∞ Z∞ Z∞ (x1 − µ1 )f1 (x1 ) dx1 = x1 f1 (x1 ) dx1 − µ1 f1 (x1 ) dx1 = E(X1 ) − µ1 = 0, −∞

−∞

−∞

and so the cross-product term contributes naught to the variance. And as, by normalisation of the marginal densities, Z∞ Z∞ f2 (x2 ) dx2 = 1, f1 (x1 ) dx1 = −∞

−∞

we see hence that Z∞ Z∞ Var(X1 +X2 ) = (x1 −µ1 )2 f1 (x1 ) dx1 + (x2 −µ2 )2 f2 (x2 ) dx2 = Var(X1 )+Var(X2 ), −∞

−∞

whence variance is additive when the variables are independent. Induction completes the story.

18. Show that g(x, y) = e−x−y /(x + y) is a density concentrated in the first quadrant x > 0, y > 0 in R2 . Generalise: Suppose f is a density concentrated on the positive half-line (0, ∞). Then g(x, y) = f(x + y)/(x + y) is a density in R2 . Find its covariance matrix. S OLUTION : We may exploit the positivity of the variables by representing them as squared variables, u2 ← x and v2 ← y, and now the particular form of the density suggests √ that a further transformation to polar coordinates may be useful, r ← u2 + v2 and θ ← arctan(v/u). Following the suggested course of action, we obtain ZZ

Z∞ Z∞

R2

Z ∞ Z π/2

= 0

0

Z∞ Z∞

2

2

e−(u +v ) 4uv dudv u2 + v2 0 0 0 0 Z Z 2 ∞ π/2 ` ´ 2 e−r 4r2 sin(θ) cos(θ) · r dθdr = 4 re−r dr · sin(θ) d sin(θ) 2 r 0 0 ˛π/2 Z∞ Z∞ sin(θ)2 ˛˛ t←r2 −r2 =4 dr · = re e−t dt = 1, 2 ˛0 0 0

g(x, y) dxdy =

e−(x+y) dxdy = x+y

and so g is properly normalised. As g is patently positive, it is indeed a density. If the reader looks through the argument she will realise that the particular form of the exponential function had very little to do with the matter; all that mattered was that the exponential density is properly normalised to unit area. This suggests that we can effortlessly generalise the result. Suppose f(x, y) is a density concentrated in the first quadrant and, as given, g(x, y) = f(x + y)/(x + y). Following the same transformations as before, we obtain Z∞ Z∞ ZZ Z∞ Z∞ f(u2 + v2 ) f(x + y) dxdy = 4uv dudv g(x, y) dxdy = x+y u2 + v2 0 0 R2 0 0 Z ∞ Z π/2 Z Z ∞ π/2 ` ´ f(r2 ) 2 = 4r sin(θ) cos(θ) · r dθdr = 4 rf(r2 ) dr · sin(θ) d sin(θ) 2 r 0 0 0 0 ˛π/2 Z∞ Z∞ sin(θ)2 ˛˛ t←r2 =4 rf(r2 ) dr · = f(t) dt = 1. 2 ˛0 0 0

136

VII.19

Probabilities on the Real Line

Thus, g is properly normalised and positive, hence a density. The same changes of variable rapidly permit the determination of various moments. Write µ and σ2 for the mean and variance, respectively, of the density f. For the mean of X, we now have ZZ

Z∞

E(X) =

xg(x, y) dxdy = 4 R2

Z π/2 r3 f(r2 ) dr ·

0

sin(θ) cos(θ)3 dθ 0

Z∞

=4 0

˛π/2 dt − cos(θ) ˛˛ µ tf(t) · = . ˛ 2 4 2 0

By the symmetry inherent in the situation, E(Y) = µ/2 as well. A similar calculation for the second moment shows that ZZ E(X2 ) =

x2 g(x, y) dxdy = 4 R2

Z∞

Z π/2 r5 f(r2 ) dr

0

Z∞

=4

sin(θ) cos(θ)5 dθ 0

t2 f(t)

0

˛π/2 dt − cos(θ)6 ˛˛ 1 = (µ2 + σ2 ). · ˛ 2 6 3 0

And traipsing down the transformation path once more for the mixed moment gives Z∞

ZZ E(XY) =

xyg(x, y) dxdy = 4 R2

Z π/2 r5 f(r2 ) dr

0

sin(θ)3 cos(θ)3 dθ. 0

An easy trigonometric integration shows that Z π/2 sin(θ)3 cos(θ)3 dθ = 0

=

1 16

1 8

Z π/2 sin(2θ)3 dθ

z←2θ

=

0

1 16

Zπ sin(z)2 · sin(z) d 0

theta



˛π « „ ` ´ ` ´ cos(z)3 ˛˛ 1 1 =− . 1 − cos(z)2 d cos(z) = cos(z) − 16 3 ˛0 12 0

It follows that E(XY) = −

1 3

Z∞

t←r2

r5 f(r2 ) dr = −

0

1 6

Z∞ 0

1 t2 f(t) dt = − (µ2 + σ2 ). 6

Putting the pieces together, E(X) =

µ , 2

“ µ ”2 1 2 1 1 2 (µ + σ2 ) − = σ2 + µ , 3 2 3 12 “ ” 1 µ 2 1 5 2 Cov(X, Y) = E(XY) − E(X) E(Y) = − (µ2 + σ2 ) − = − σ2 − µ , 6 2 6 12 Var(X) = E(X2 ) − E(X)2 =

and, a fortiori, X and Y are negatively correlated.

137

Probabilities on the Real Line

VII.20

19. Suppose X1 , . . . , Xn are independent variables with a common density f and d.f. F. Let Y = min{X1 , . . . , Xn } and Z = max{X1 , . . . , Xn }. Determine the joint density of the pair (Y, Z). S OLUTION : As the event {Z ≤ z} occurs if, and only if, each of the events {X1 ≤ z}, . . . , {Xn ≤ z} occur, by the independence of X1 , . . . , Xn , we see that P{Z ≤ z} = P{X1 ≤ z, . . . , Xn ≤ z} = P{X1 ≤ z} × · · · × P{Xn ≤ z} = F(z)n . Differentiation shows that the marginal density of Z is given by nF(z)n−1 f(z). Likewise, the event {Y > y} occurs if, and only if, each of the events {X1 > y}, . . . , {Xn > y} occur, and so, by independence again, ` ´n P{Y > y} = P{X1 > y, . . . , Xn > y} = P{X1 > y} × · · · × P{Xn > y} = 1 − F(y) , whence, by total probability, the d.f. of Y is given by ` ´n P{Y ≤ y} = 1 − P{Y > y} = 1 − 1 − F(y) . ` ´n−1 Differentiation once more gives the marginal density of Y to be n 1 − F(y) f(y). A similar argument shows that ˆ ˜n P{Y > y, Z ≤ z} = P{y < X1 ≤ z, . . . , y < Xn ≤ z} = F(z) − F(y) . By additivity, we can now write down the joint d.f. G(y, z) of the pair (Y, Z): ˆ ˜n G(y, z) = P{Y ≤ y, Z ≤ z} = P{Z ≤ z} − P{Y > y, Z ≤ z} = F(z)n − F(z) − F(y) . Differentiation unearths the corresponding joint density: g(y, z) =

ˆ ˜n−2 ∂2 G(y, z) = n(n − 1) F(z) − F(y) f(y)f(z). ∂y∂z

These manouevres for maxima and minima of independent variables show up repeatedly.

20. Positive variables. Suppose T is a positive random variable with d.f. F(t) and associated density Rf(t) with support in [0, ∞). If TRhas mean µ and  ∞ ∞ variance σ2 , show that µ = 0 1 − F(t) dt and σ2 + µ2 = 2 0 t 1 − F(t) dt. R∞ [Hint: Write 1 − F(t) = t f(u) du.] S OLUTION: Following the hint, for the mean we write Z∞ Zu Z∞ Z∞ Z∞ Z∞ ` ´ µ= uf(u) du = f(u) dt du = f(u) du dt = 1 − F(t) dt. 0

0

0

0

t

0

The change in the order of integration in the penultimate step is justified as the integrand is positive and the integral converges. (Or, if you must, appeal to Fubini’s theorem.) A

138

VII.21

Probabilities on the Real Line

similar manipulation serves to establish a similar identity for the second moment. We have µ 2 + σ2 =

Z∞ 0

u2 f(u) du = 2

Z∞

Zu f(u)

0

t dt du Z∞ Z∞ Z∞ ` ´ =2 t f(u) du dt = 2 t 1 − F(t) dt 0

0

t

0

21. A waiting time problem. Trains are scheduled on the hour, every hour, but are subject to delays of up to one hour. We model the delays Tj as independent random variables with a common d.f. F(t) and associated density f(t) with support in the unit interval [0, 1). Let µ and σ2 denote the mean and variance, respectively, of the delays. A passenger arrives at the station at a random time X. As, in units of one hour, only the fractional part of X matters we may suppose that X has been reduced modulo one and has the uniform density in the unit interval [0, 1). (The examples of Section IX.4 provide some justification for the assumption of a uniform arrival density.) The time the passenger has to wait for a train to arrive is a positive variable W with density g(t) and d.f. G(t). Show that the conditional density of W given that X = x is  f(t + x) if 0 ≤ t < 1 − x, g(t | X = x) = F(x)f(t + x − 1) if 1 − x ≤ t < 2 − x. Show hence that E(W) = 12 + σ2 . Thus, the expected waiting time of the passenger increases proportionately with the variance of the delay. [Problem 20 comes in handy.] S OLUTION: The time the passenger waits depends on whether the train scheduled at the beginning of the hour in which he arrives has already departed or is yet to come. Accordingly,  T1 − X if T1 > X, W= 1 − X + T2 if T1 ≤ X. By additivity, the condition distribution of the waiting time given X = x is hence given by G(t | X = x) = P{W ≤ t | X = x} = P{W ≤ t, T1 > X | X = x}+P{W ≤ t, T1 ≤ X | X = x} = P{x < T1 ≤ t + x} + P{T1 ≤ x, T2 ≤ t − (1 − x)} = F(t + x) − F(x) + F(x)F(t − (1 − x)) as T1 , T2 , and X are independent. As F(z) = 0 for z ≤ 0 and F(z) = 1 for z ≥ 1, we may identify two regimes of behaviour and write  F(t + x) − F(x) if 0 < t ≤ 1 − x, G(t | X = x) = (4) 1 − F(x) + F(x)F(t − 1 + x) if 1 − x < t ≤ 2 − x.

139

Probabilities on the Real Line

VII.22

Taking derivatives, we obtain the conditional density  f(t + x) if 0 < t ≤ 1 − x, 0 g(t | X = x) = G (t | X = x) = F(x)f(t − 1 + x) if 1 − x < t ≤ 2 − x. Writing u for the uniform density in the unit interval, by taking expectation with respect to u we obtain ∞ ZZ

tg(t | X = x)u(x) dt dx =

E(W) =

tf(t+x) dt+ 0

−∞

Z 2−x

Z 1 »Z 1−x

– tF(x)f(t−1+x) dt dx.

1−x

0

Making the natural change of variable in the inner integrals on the right, changing the order of integration, and leveraging Problem 20, we obtain Z1 E(W) = 0

=

Zs f(s)

1 2

Z1

Z1 (s − x) dx ds +

0

0 Z1

Z1

s2 f(s) ds + 0

Z1 f(s)

f(s) 0

(s + 1 − x)F(x) dx ds 0

(s + 1 − x) dx ds 0

Z1

Z1 ` ´ (s + 1 − x) 1 − F(x) dx ds

f(s)

− 0

0

1 1 1 1 = (µ2 + σ2 ) + µ + − µ(µ + 1) + (µ2 + σ2 ) = + σ2 . 2 2 2 2

22. Continuation. Determine g(t). What is G(1)? S OLUTION : We may re-express the conclusion of the previous problem in the form  f(t + x) if 0 < x ≤ 1 − t, 0 g(t | X = x) = G (t | X = x) = F(x)f(t − 1 + x) if 1 − t < x < 2 − t, to focus on the range of validity as x varies. As X is uniformly distributed in the unit interval, the joint density of the pair (W, X) is given by g(t | X = x)u(x) whence W has marginal density Z∞ g(t) = g(t | X = x)u(x) dx. −∞

It is clear that g has support only in 0 < t < 2 and there are now two cases depending on whether 0 < t ≤ 1 or 1 < t < 2 and we take these in turn. Suppose 0 < t ≤ 1. By integrating out x, we obtain Z 1−t g(t) =

Z1 f(t + x) dx +

0

F(x)f(t + x − 1) dx 1−t

Zt = 1 − F(t) +

F(u − t + 1)f(u) du 0

140

(0 < t ≤ 1).

VII.22

Probabilities on the Real Line

Likewise, for 1 < t < 2, integrating out x yields Z 2−t g(t) =

Z1 F(x)f(t + x − 1) dx =

0

F(u − t + 1)f(u) du

(1 < t < 2).

t−1

If, for instance, f(x) = u(x) is uniform then  1 − t2 /2 g(t) = 2(1 − t) + t2 /2

if 0 < t ≤ 1, if 1 < t < 2.

By a similar process, by integrating out over the possibilities for x, we may formally write Z∞ G(t) = G(t | X = x)u(x) dx, −∞

and by leveraging (4) we obtain Z1

Z1 ` ´ 1 − F(x) + F(x)2 dx = 1 −

G(1) = 0

` ´ F(x) 1 − F(x) dx. 0

` ´ As 0 ≤ F(x) ≤ 1, we see that 0 ≤ F(x) 1 − F(x) ≤ 1/4 and so 3/4 ≤ G(1) ≤ 1. The lower bound is achieved if F has a jump of size 1/2 at 0 and another jump of size 1/2 at 1, that is to say, the train is equally likely to either arrive on time or be delayed by one hour. The upper bound is achieved if the train always arrives on time (or, equivalently for our purposes, always arrives one hour late). If, on the other hand, f(x) = u(x) is uniform in the unit interval then Z1 5 1 1 G(1) = 1 − x(1 − x) dx = 1 − + = . 2 3 6 0

141

VIII The Bernoulli Schema 1. Vandermonde’s convolution. For any positive integers µ, ν, and n, prove by induction that  µ  ν  0

n

+

 µ  1

 µ  ν   µ + ν  ν  + ··· + = . n−1 n 0 n

(1)

[First prove the result for µ = 1 and all ν.] Then provide a direct proof by a probabilistic argument. Alexandre Vandermonde wrote about this identity in 1792. S OLUTION : An inductive argument. For the `ν´induction base, consider µ = 0 and any ν and n. Then both sides of (1) are equal to n . To see the thrust of the induction, consider µ = 1 then (1) reduces to the claim that “ν” “ ν ” “ν + 1” + = n n−1 n which, of course, is just Pascal’s triangle. Suppose now as induction hypothesis that, for some µ ≥ 0, n “ X m ”“ ν ” “ m + ν ” = k n−k n k=0 for all m ≤ µ and all ν and n. Then n “ X µ + 1 ”“

=

k=0 n “ X k=0

k

– n » ν ” (a) X “ µ ” “ µ ” “ ν ” = + n−k k k−1 n−k k=0

n µ ”“ ν ” X “ µ ”“ ν ” (b) “ µ + ν ” “ µ + ν ” (c) “ µ + 1 + ν ” = . = + + n k−1 n−k n n−1 k n−k k=0

Step (a) is by Pascal’s triangle, step (b) follows by two applications of the induction hypothesis, and step (c) is again Pascal’s triangle. This completes the induction. A combinatorial argument. Consider a population consisting of µ members of one class and ν members of another. If n members are selected from the entire population

The Bernoulli Schema

VIII.4

then, for some k, there must be k members from the first class [who may be selected in `µ´ ` ν ´ ways] and n − k members from the second [who may be selected in n−k ways]. k Summing over all possibilities for k exhausts all possibilities for selecting n members from the entire population and (1) results.

2. Continuation. Show that, for every positive integer n,  n 2 0

+

 n 2 1

+

 n 2 2

+ ··· +

 n 2 n

=

 2n  n

.

S OLUTION : Substitute µ = ν = n in Vandermonde’s convolution (1) and observe that `n´ ` n ´ = n−k . k

3. Closure of the binomial under convolutions. Show directly from the binomial sums that (bµ ? bν )(k) = bµ+ν (k) where, for each n, bn represents a binomial distribution corresponding to n tosses of a coin with fixed success probability p. Now provide a direct probabilistic argument. S OLUTION : A direct evaluation of the convolution sum. With success probability p fixed, we have X X“ µ ” j µ−j “ ν ” k−j ν−k+j ` ´ bµ (·; p)?bν (·; p) (k) = bµ (j; p)bν (k−j; p) = pq · p q j k−j j j X“ µ ”“ ν ” “ µ + ν ” k µ+ν−k = pk qµ+ν−k = p q = bµ+ν (k; p), j k−j k j the penultimate step justified by Vandermonde’s convolution. A probabilistic argument. Suppose X1 , . . . , Xµ , Xµ+1 , . . . , Xµ+ν represents a sequence of µ + ν Bernoulli trials with success probability p. Then S = X1 + · · · + Xµ ∼ Binomial(µ, p), T = Xµ+1 + · · · + Xµ+ν ∼ Binomial(ν, p), and S + T = X1 + · · · + Xµ+ν ∼ ` ´ Binomial(µ + ν; p). It follows that bµ (·; p) ∗ bν (·; p) (k) = bµ+ν (k; p).

4. Suppose X1 and X2 are independent Poisson variables of means λ1 and λ2 , respectively. Determine the conditional distribution of X1 given that X1 + X2 = n. S OLUTION : With X1 ∼ Poisson(λ1 ) and X2 ∼ Poisson(λ2 ), we know that X1 + X2 ∼ Poisson(λ1 + λ2 ) as the Poisson distribution is stable under convolutions. It follows that P{X1 = k, X2 = n − k} P{X1 = k, X1 + X2 = n} = P{X1 + X2 = n} P{X1 + X2 = n} n−k ffi k P{X1 = k} P{X2 = n − k} (λ1 + λ2 )n −λ1 λ1 −λ2 λ2 = e−(λ1 +λ2 ) =e ·e P{X1 + X2 = n} k! (n − k)! n! «k „ «n−k “n”„ λ ` ´ λ2 1 = = bn k; λ1 /(λ1 + λ2 ) . k λ1 + λ2 λ1 + λ2

P{X1 = k | X1 + X2 = n} =

144

VIII.6

The Bernoulli Schema

In words: conditioned on the occurrence of the event X1 + X2 = k, the random variable X1 has a binomial distribution corresponding to the number of accumulated successes in n tosses of a coin with success probability λ1 /(λ1 + λ2 ).

5. The multinomial distribution. Suppose {p1 , . . . , pr } forms a discrete probability distribution. Write p(k1 , k2 , . . . , kr ) =

n! pk1 pk2 · · · pkr r . k1 !k2 ! · · · kr ! 1 2

(2)

By summing over P all positive integers k1 , k2 , . . . , kr satisfying k1 +k2 +· · ·+kr = n, show that p(k1 , k2 , . . . , kr ) = 1 and hence that the values p(k1 , k2 , . . . , kr ) determine a discrete probability distribution on a lattice in r dimensions. This is the multinomial distribution; for r = 2 we recover the binomial distribution. S OLUTION : The terms on the right in (2) are patently positive. That they are properly normalised is trite by the multinomial theorem: X k1 ,k2 ...,kr ≥0 k1 +k2 +···+kr =n

n! k k p 1 p 2 · · · pkr r = (p1 + p2 + · · · + pr )n = 1. k1 !k2 ! · · · kr ! 1 2

When r = 2, writing p = p1 and p2 = q = 1 − p, we obtain p(k, n − k) =

n! pk qn−k = bn (k; p), k!(n − k)!

recovering the binomial distribution.

6. Central term of the multinomial. Show that the maximal term of the multinomial distribution (2) satisfies the inequalities npj − 1 < kj ≤ (n + r − 1)pj

(1 ≤ j ≤ r).

(3)

[Hint: Show that a maximal term of (2) satisfies pi kj ≤ pj (ki + 1) for every pair (i, j). Obtain the lower bound by summing over all j, the upper bound by summing over all i 6= j. W. Feller attributes the result and the neat proof to P. A. P. Moran.] S OLUTION : Suppose the multinomial distribution (2) achieves its maximum value for some selection of indices k1 , . . . , kn . Suppose j is any index with kj > 0. Let i be any other index. The replacements kj ← kj − 1 and ki ← ki + 1 while keeping the rest of the values fixed yields another valid lattice selection with a smaller probability (as k1 , . . . , kn are assumed to maximise the probability). Comparing the multinomial expressions in the two cases yields the inequality 1 1 k k +1 k −1 k p i pj j ≤ p ip j. (ki + 1)!(kj − 1)! i ki !kj ! i j

145

The Bernoulli Schema

VIII.8

Clearing the denominators yields the inequality kj pi ≤ (ki + 1)pj .

(4)

While derived under the assumption that kj > 0, the inequality is trivially seen to hold also when kj = 0 and so (3) holds for all choices of i and j at any lattice point (k1 , . . . , kn ) at which the multinomial distribution attains its maximum. Summing both sides of (4) over all i with i 6= j yields X X kj (1 − pj ) = ki pj ≤ (ki + 1)pj = (n − kj + r − 1)pj i : i6=j

i : i6=j

which establishes the upper bound. On the other hand, for each i, the inequality in (4) is strict for at least one value of j. Summing both sides of (4) over all j hence yields X X npi = kj pi < (ki + 1)pj = ki + 1 j

j

which establishes the desired lower bound.

7. Central term of the trinomial. Show that the trinomial achieves its maximum value when j = k = n.

(3n)! −3n j!k!(3n−j−k)! 3

S OLUTION : With n ← 3n in (3), set r = 3 and p1 = p2 = p3 = 1/3. Then the lattice point where the maximum is attained satisfies 3n + 2 2 3n − 1 < ki ≤ or n − 1 < ki ≤ n + . 3 3 3 The interval (n − 1, n + 2/3] includes only one integral point which is n and so k1 = k2 = k3 = n specifies the unique location of the maximum of the trinomial.

8. In a variant of Pepys’s problem, which is more likely: that when twelve dice are tossed each of the face values 1 through 6 show exactly twice or that when six dice are tossed each of the face values show exactly once? S OLUTION : Consider first the case when six dice are thrown. Labelling the dice from 1 through 6, if each face value shows exactly once then, identifying outcomes by an ordered six-tuple of face values, the outcome (Π1 , . . . , Π6 ) may be identified with a random permutation of (1, . . . , 6). There are 66 possible outcomes for this experiment, each with equal probability, and of these precisely 6! constitute permutations of (1, . . . , 6). Accordingly, the probability that each of the six possible face values shows exactly once is given by 6! 5 = ≈ 0.015. 66 324 Now consider the case when twelve dice are thrown, each of the 612 outcomes being equally likely. Label the dice from 1 through 12 this time, the face values running from 1 through 6. With dice playing the rôle of balls and face values playing the rôle of urns, we have an urn problem, each outcome of the experiment corresponding to

146

VIII.9

The Bernoulli Schema

a distribution of 12 labelled balls into 6 labelled urns. The event that each face value shows twice may be identified with a distribution of balls into urns, each urn receiving exactly two balls. The number of such arrangements is given by the product of binomial coefficients „ «„ «„ «„ «„ «„ « „ « 12! 12 10 8 6 4 2 12 = 6 =: 2 2 2 2 2 2 2, 2, 2, 2, 2, 2 2 by first specifying the two balls out of the twelve that go into the first urn, then specifying the two balls out of the ten remaining that go into the second urn, and so on. The expression on the right is the multinomial coefficient. It follows that the probability that each face value shows twice is given by ffi 12! 1925 612 = ≈ 0.003. 26 559872 It is thus almost five times more likely that each face value shows once in six throws of a die than that each face value shows twice in twelves throws of a die. Generalisation: If n dice, each with n faces, are thrown, the probability that each face shows once is given by p1 (n) = n!/nn , while, if 2n such dice are thrown, the ‹ 2n probability that each face shows twice is given by p2 (n) = (2n)! n . It is not difficult 2n to see that the ratio (2n)! p2 (n) = r(n) = p1 (n) (2n)n n! decreases monotonically with n—Table 1 provides convincing numerical evidence (to four decimal places) if the reader does not √ wish to attempt to show this analytically. Stirling’s formula for the factorial, m! ∼ 2πm mm e−m , makes the rate of asymptotic n r(n)

1 1.

2 0.75

3 0.5556

4 0.4102

5 0.3024

6 0.2280

7 0.1641

8 0.1208

9 0.0889

Table 1: Ratio of probabilities for Problem VIII.8. decay precise:

√ “ 2 ”n 2 (n → ∞). e The increased number of ways in which 2n dice could be distributed to achieve a fair representation of faces is dwarfed by the much smaller probabilities—the ratio of probabilities for the two cases decays exponentially fast. r(n) ∼

9. Ones and sixes. Let X and Y denote the number of 1s and 6s, respectively, that turn up in n independent throws of a fair die. What is the expected value of the product XY? [Problem VII.5 simplifies calculations.] S OLUTION : The number of ones and sixes, X and Y, respectively, individually have distribution Binomial(n, 1/6) but they are manifestly dependent. Conditioning allows us to seamlessly track down the dependencies.

147

The Bernoulli Schema

VIII.11

If we are given that exactly m sixes occur then the remaining n − k trials can each take one of five equally likely possibilities. Accordingly, conditioned on the event {Y = k} that exactly k sixes were observed, the distribution of the number X of ones acquires the binomial distribution Binomial(n − k; 1/5) or, in other words, E(X | Y) ∼ Binomial(n − Y; 1/5). Total probability allows us to quickly stitch together the solution: ` ´ ` ´ ` E(XY) = E E(XY | Y) = E Y E(X | Y) = E Y · =

n 5

E(Y) −

n−Y 5 2 1 E(Y ) = n 5 5

´

E(Y) −

1 5

ˆ

˜ Var(Y) + E(Y)2 .

As Y has mean E(Y) = n/6 and variance Var(Y) = 5n/36, we obtain » – n n n 1 5n n2 n2 + − . E(XY) = · − = 5 6 5 36 36 36 36 It follows that Cov(X, Y) = E(XY) − E(X) E(Y) = −n/36, or, with the proper normalisation, the correlation coefficient ρ is given by ffi Cov(X, Y) −n 5n 1 p = =− . ρ= p 36 36 5 Var(X) Var(Y) The variables X and Y are indeed dependent, but weakly, with a small negative correlation.

10. Traffic. A pedestrian can cross a street at epochs k = 0, 1, 2, . . . . The event that a car will be passing the crossing at any given epoch is described by a Bernoulli trial with success probability p. The pedestrian can cross the street only if there will be no car passing over the next three epochs. Find the probability that the pedestrian has to wait exactly k = 0, 1, 2, 3, 4 epochs. S OLUTION : Write q = 1 − p as usual and let P(k) denote the probability that the pedestrian can cross the street at epoch k. These are listed in Table 2 for 0 ≤ k ≤ 4. The case k P(k)

0 q3

1 pq3

2 pq3

3 pq3

4 pq3 − pq6

Table 2: Probabilities for pedestrian crossing at epoch k. k = 0 is trivial as the pedestrian may cross immediately if, and only if, there are no cars crossing at epochs 1, 2, and 3. The remaining cases 1 ≤ k ≤ 4 may be verified by the observation that for the pedestrian to cross at epoch k it is necessary that the intersection be free at epochs k + 1, k + 2, and k + 3, and that the intersection be busy at epoch k (else the pedestrian could have crossed at epoch k − 1). For k = 1, 2, and 3, this condition is necessary and sufficient as the condition of the intersection at the previous 0, 1, and 2 epochs, respectively, is irrelevant given this condition. For the case k = 4, we have to rule out the possibility that the intersection is also free over the first three epochs. The

148

VIII.12

The Bernoulli Schema

general formulæ are not at all obvious and are related to the theory of success runs. See Problems XV.20–27.

11. Log convexity. We say that a function p defined on the integers is log convex if it satisfies p(k − 1)p(k + 1) ≤ p(k)2 for all k. Show that the binomial distribution bn (k; p) and the Poisson distribution p(k; λ) are both log convex. S OLUTION : For the binomial distribution, we have ` n ´` n ´ bn (k − 1; p)bn (k + 1; p) n−k k k+1 = = k−1 · ≤1 `n´2 bn (k; p)2 n k+1 k

as, by clearing the denominator, it is apparent that k(n−k) = nk−k2 ≤ nk+n = n(k+1) for all k ≥ 1. Likewise, for the Poisson distribution, we have p(k − 1; λ)p(k + 1; λ) k!2 k = = ≤1 p(k; λ)2 (k − 1)!(k + 1)! k+1 for all k ≥ 1.

12. Continuation. Find a distribution p(k) on the positive integers for which equality holds in the log convex inequality, i.e., p(k − 1)p(k + 1) = p(k)2 for all k ≥ 1. S OLUTION : If equality holds in the log convex inequality then we must have p(k) = p(k − 1)2 /p(k − 2) for all k ≥ 2. Writing p(0) = α and p(1) = β, we see that p(2) =

p(1)2 β2 = , p(0) α

p(3) =

p(2)2 β4 1 β3 = 2 · = 2, p(1) α β α

p(4) =

p(3)2 β6 α β4 = 4 · 2 = 3, p(2) α β α

and, by induction, we readily verify that p(k) =

βk αk−1

(k ≥ 1).

(Induction shows that the result is true for k ≥ 2 and substituting k = 1 into the expression shows that the validity extends to k = 1 as well.) The probabilities must sum to one by normalisation of probability measure and so 1=

X k≥0

p(k) = p(0) +

X

p(k) = α +

k≥1

149

∞ ∞ „ «k−1 X X βk β = α + β . k−1 α α k=1 k=1

The Bernoulli Schema

VIII.15

The geometric series on the right converges if, and only if, β < α and, in this case, we obtain β 1=α+ . 1 − β/α Clearing the denominator and solving for β we obtain β = α(1 − α) and substituting back shows that p(k) = (1 − α)k α (k ≥ 0) where, by inspection, the identity holds for k = 0 as well as its nominal range k ≥ 1. In other words, the geometric distribution is the unique distribution (up to the geometric type, that is to say, the parameter 0 < α ≤ 1) on the positive integers satisfying the log convex inequality with equality.

13. A book of 1000 pages contains 1000 misprints. Estimate the chances that a given page contains at least three misprints. S OLUTION : Misprints occur at the mean rate of λ = 1000/1000 = 1 per page. This mean rate plays the rôle of np where n denotes the number of symbols on the page and p is the probability that a symbol is in error. Assuming symbol errors are independent, the number of errors on a given page have the binomial distribution bn (·; p) which, by Poisson’s approximation to the binomial may be approximated by the Poisson distribution p(·; λ). Accordingly, the probability that the page contains at least three misprints is (approximately) ´ ` 1 1 ≈ 0.019, + 3! 1 − p(0; 1) − p(1; 1) − p(2; 1) − p(3; 1) = 1 − e−1 1 + 1 + 2! or about 2%.

14. Poker. A royal flush in poker is a hand comprised of the cards ace, king, queen, jack, and ten, all from the same suit. Determine its probability. How large does n have to be so that the chances of at least one royal flush in n hands are at least 1 − e−1 ≈ 2/3? S OLUTION : A royal flush is uniquely determined once the suit is specified and, as there are four suits, the probability of obtaining a royal flush is 4 p = `52´ = 5

1 = 1.53908 × 10−6 . 649740

This expression plays the rôle of the success probability in a succession of coin tosses. By Poisson’s approximation to the binomial, the probability that no royal flush is obtained in n hands is approximately e−np . We require 1−e−np ≥ 1−e−1 or n ≥ 1/p = 649, 740.

15. Show that the terms p(k; λ) = e−λ λk /k! of the Poisson distribution attain their maximum value when k = bλc. S OLUTION : The ratio of successive terms of the Poisson distribution of mean λ is ffi k−1 p(k; λ) λk λ λ = = , p(k − 1; λ) k! (k − 1)! k

150

VIII.18

The Bernoulli Schema

and so p(k; λ) ≥ p(k − 1; λ) if, and only if, λ ≥ k. Thus the terms of the Poisson distribution increase monotonically up till the point k = bλc and decrease monotonically thereafter.

16. Show that the sequence ak = bn (k; p)/p(k; np) attains its maximum value for k = bλ + 1c. S OLUTION : Direct simplification shows that αk =

bn (k; p) enp qn n! = p(k; np) (nq)k (n − k)!

and so the ratio of successive terms in the sequence is given by (nq)k−1 (n − k + 1)! αk n−k+1 = = . αk−1 (nq)k (n − k)! nq With λ = np, it follows that αk ≥ αk−1 if, and only if, k ≤ λ + 1.

P∞ 17. Differentiating under the summation sign. Write S(q) = k=0 qk for 0 < q < 1. Show that it is permissible to differentiate under the summation sign to form a series representation for S 0 (q). Hence, by differentiating both sides of the geometric series formula obtain the mean of the geometric distribution w(k; p). S OLUTION : We may quote a theorem from analysis to justify differentiating under the summation sign but a first principles argument is almost as simple. Write S(q) = Pn P k k q = ˜qn+1 /(1 − q). Taking R (q) = n k>n k=0 q + Rn (q) where the remainder term ˆ derivatives, we see that Rn0 (q) = (1−q)−2 (n+1)qn (1−q)+qn+1 and so Rn0 (q) → 0 as P n → ∞ for every 0 < q < 1. As S(q) = k≥0 qk = (1 − q)−1 , we may now differentiate cheerfully under the summation sign and obtain S 0 (q) =

∞ X

kqk−1 =

k=0

1 . (1 − q)2

The mean of the geometric distribution follows quickly: X X k X k−1 kw(k; p) = kq p = pq kq = k

k≥0

k≥0

q pq = (1 − q)2 p

as q = 1 − p.

18. Continuation. By differentiating both sides of the geometric series formula twice derive the variance of w(k; p). S OLUTION P : The argument provided in the solution to Problem 17 above shows that with S(q) = k≥0 qk we may differentiate under the summation sign as many times as we wish. Taking two derivatives results in S 00 (q) =

X

k(k1 )qk−2 =

k≥0

151

2 . (1 − q)3

The Bernoulli Schema

VIII.20

For the variance of the geometric distribution, we may now write X

(k − q/p)2 w(k; p) =

k

= q2 p

X

k2 w(k; p) −

k≥0

X

k(k − 1)qk−2 + qp

k≥0

X k≥0

q2 p2

kqk−1 −

q2 2q2 p qp q2 q = + − 2 = 2. 2 3 2 p (1 − q) (1 − q) p p

19. Suppose X1 and X2 are independent with the common geometric distribution w(k; p). Determine the conditional distribution of X1 given that X1 + X2 = n. S OLUTION : As X1 and X2 are independent with the common geometric distribution k w(k; p) = q ` p for´ kn ≥2 0, the sumnX12 + X2 has the negative binomial distribution w2 (n; p) = 2+n−1 q p = (n + 1)q p for n ≥ 0. Accordingly, for k ∈ {0, 1, . . . , n}, n P{X1 = k | X1 + X2 = n} = =

P{X1 = k, X1 + X2 = n} P{X1 = k, X2 = n − k} = P{X1 + X2 = n} P{X1 + X2 = n}

P{X1 = k} P{X2 = n − k} w(k; p)w(n − k; p) qk p · qn−k p 1 = = = . P{X1 + X2 = n} w2 (n; p) (n + 1)qn p2 n+1

In other words, given that X1 + X2 = n, the conditional distribution of X1 is uniform over {0, 1, . . . , n}.

20. Waiting for Mr. Right. A game of chance proceeds by repeated throws of an n-sided die (with face values 1, . . . , n) resulting in a sequence of values X1 , X2 , . . . . A gambler pays a house fee f to enter the game and thereafter an additional fee a for every trial that she stays in the game. She may leave the game at any time and at departure receives as payoff the last face value seen before she quits. Thus, if she leaves after τ trials she will have paid f + aτ in accumulated fees and receives a payoff Xτ ; her winnings (negative winnings are losses) are Wτ = Xτ − (f + aτ). The gambler who plays a predetermined number of times has expected winnings (n + 1)/2 − (f + aτ) decreasing monotonically with τ so that there is no incentive to play for more than one trial. The reader who has done the marriage problem II.28, 29, however, may feel that our intrepid gambler can benefit from a properly selected waiting time strategy. Fix any integer 0 ≤ r ≤ n − 1 and let τr be the first trial for which Xτr > r. Determine the expected winnings E(Wτr ) of the gambler who plays till the first instant she sees a value r + 1 or higher and then quits, and optimise over r to maximise her expected winnings. Suppose n is large. Should she play the game if the house charges an entrance fee f = 9n/10 and a playing fee a = 1 for each trial? In this case, what would a fair entrance fee f be? If f = 0 what would a fair playing fee a be?

152

VIII.21

The Bernoulli Schema

S OLUTION : Fix 0 ≤ r < n − 1 and let τ = min{ j : Xj > r }. Then τ has the geometric distribution P{τ = k} = w(k − 1; pr ) = (1 − pr )k−1 pr (k ≥ 1) where pr = (n − r)/n = 1 − r/n denotes the probability that a throw of the die yields a value > r. In particular, E(τ) = 1/pr = n/(n − r). The outcome of trial τ is special. Conditioned on the event that the gambler leaves the game after trial k, we have ´ 1 ` E(Xτ | τ = k) = E(Xj | Xj > r) = (r + 1) + (r + 2) + · · · + n n−r » – n(n + 1) r(r + 1) 1 n+r+1 n2 + n − r2 − r = − = . = n−r 2 2 2(n − r) 2 As this doesn’t depend on k, it follows that E(Xτ ) = (n+r+1)/2. Or, formally, summing over k, total probability tells us that E(Xτ ) =

∞ X

E(Xτ | τ = k) P{τ = k} =

k=1

∞ n+r+1 X n+r+1 . (1 − pr )k−1 pr = 2 2 k=1

The gambler’s winnings are given by Wτ = Xτ − (f + aτ). By additivity of expectation, it follows that „ « ` ´ n+r+1 an E(Wτ ) = E(Xτ ) − f + a E(τ) = − f+ =: g(r). (5) 2 n−r Viewing g(r) as a function of a real variable r, differentiation shows that g 0 (r) =

1 an −2an − and g 00 (r) = . 2 (n − r)2 (n − r)3

It is apparent hence √ g increases monotonically √ that in the interval [0, n − 1], the function for 0 ≤ r ≤ n − 2an and decreases monotonically for n − 2an ≤ r ≤ n − 1. It follows √ that (5) is maximised when the integer r is one of the two integers bracketing n − 2an. Ignoring the irritating integer round-off to keep the story clean, we see that max E(Wτ ) = r

n−

√ √ an n + 1 n − 2an 1 + −f− √ = n + − f − 2an. 2 2 2 2an

√ If f = 9n/10 and a = 1, for an optimal choice the gambler should select r = 2n resulting in expected winnings n+

1 2



9n 10

√ −

2n =

n 10



√ 2n +

1 2

which increases without bound as n increases. The gambling house has made a grave mistake and the player should take them to the cleaners. √ When f = 0 then maxr E(Wτ ) = n − 2an + 1/2. The game will be fair for the choice a = (2n + 1)2 /(8n) ∼ n/2 as one would anticipate; with such a choice the expected winnings for an optimal choice of strategy are zero.

153

The Bernoulli Schema

VIII.23

21. Negative binomial. Determine the mean and the variance of the negative binomial distribution wr (·; p) by direct computation. Hence infer anew the mean of the geometric distribution. [Hint: The representa andkvariance r tion −r (−q) p for the negative binomial probabilities makes the computak tion simple.] S OLUTION : An application of the binomial theorem and a little elbow grease shows that the mean of the negative binomial distribution is given by X k

kwr (k; p) =

X „−r« (−q)k pr k k k X „−r − 1« rq (−q)k−1 = rqpr (1 − q)−r−1 = = (−r)(−q)pr . k−1 p k

Another application of the binomial theorem gives the variance: X

k(k − 1)wr (k; p) +

k

X

„ kwr (k; p) −

k

= (−r)(−r − 1)(−q)2 pr = r(r + 1)q p (1 − q)

−r−2

«2

X „−r − 2« k

2 r

rq p

+

k−2

(−q)k−2 +

rq r2 q2 − 2 p p

r(r + 1)q2 rq r2 q2 rq r2 q2 rq − 2 = + − 2 = 2. 2 p p p p p p

Substituting r = 1 gives the mean q/p and variance q/p2 of the geometric distribution.

22. In Banach’s match box problem (Example 7.4) determine the probability that at the moment the first pocket is emptied of matches (as opposed to the moment when the mathematician reaches in to find an empty pocket) the other contains exactly r matches. S OLUTION : The analysis is as in Example VIII.7.4 in ToP. With equal probability 1/2 of visiting the left and right pockets, at the moment the pocket on the right, say, is emptied there will have been n visits to the right (successes) and, if there are r matches left in the pocket on the left, then there will also have been n − r visits to the left (failures). The probability of this event is given by the negative binomial probability corresponding to a waiting time of n − r failures before the nth success and is hence given by „ « „ «n−r „ «n „ « 1 1 −n 2n − r − 1 −2n+r = 2 . wn (n − r; 1/2) = (−1)n−r n−r n−1 2 2 The situation of the left pocket emptying first has exactly the same probability and so, the probability that at the moment the first pocket is emptied the other pocket contains r matches is given by „ « 2n − r − 1 −2n+r+1 2wn (n − r; 1/2) = 2 . n−1

154

VIII.24

The Bernoulli Schema

23. A sequence of Bernoulli trials with success probability p is continued until the rth success is obtained. Suppose X is the number of trials required. Evaluate E(r/X). (The infinite series that is obtained can be summed in closed form.) S OLUTION : The random variable X has the waiting time distribution „ « −r P{X = k} = wr (k − r; p) = (−q)k−r pr (k ≥ r). k−r It follows that „ « „ «r X « ∞ ∞ „ “r” X p r qk −r −r = (−q)k−r pr = r E (−1)k−r X k k−r q k=r k − r k k=r „ «r X « Zq ∞ „ p −r =r (−1)k−r xk−1 dx q k=r k − r 0 « „ «r Z q „ «r Z q ∞ „ X p p −r xr−1 (−x)k−r dx = r xr−1 (1 − x)−r dx, =r k − r q q 0 0 k=r

(6)

the final step following by the binomial theorem. We are now left with an elementary integral to evaluate. Induction provides the simplest path to solution. write Zq Iν = xν (1 − x)−ν−1 dx (ν ≥ 0). 0

The base of the induction is trivial as Zq dx = − log(1 − q) = − log(p). I0 = 1 −x 0 For ν ≥ 1, an integration by parts shows that ˛1 Z q „ «ν −xν (1 − x)−ν ˛˛ 1 q ν−1 −ν . Iν = − x (1 − x) dx = ˛ −ν ν p 0 0 Repeating the process it follows quickly by induction that Iν =

1 ν

„ «ν „ «ν−1 „ «ν−2 „ «1 (−1)ν−1 q q 1 q 1 q − + − ··· + + (−1)ν I0 p ν−1 p ν−2 p 1 p „ «ν−k+1 ν X ` ´ (−1)k−1 q = + (−1)ν − log(p) . ν − k + 1 p k=1

By substitution into (6) with ν = r − 1, it follows that E

“r” X

=

r−1 X k=1

(−1)k−1

r r−k

„ «k „ «r p p + (−1)r r log(p). q q

155

The Bernoulli Schema

VIII.24

24. The problem of the points. In a succession of Bernoulli trials with success probability p, let Pm,n denote the probability that m successes occur before n failures. Show that Pm,n = pPm−1,n + qPm,n−1 for m, n ≥ 1 and solve the recurrence. This solution is due to Blaise Pascal. S OLUTION : We work by conditioning on the outcome of the first trial. If it is a success then in the subsequent trials we require m − 1 successes before n failures; if it is a failure then in the subsequent trials we require m successes before n − 1 failures. Accordingly, Pm,n = pPm−1,n + qPm,n−1

(m, n ≥ 1).

(7)

We may approach a solution of the recurrence in different ways. An inductive approach: It is always simplest to begin with the answer and in this case we already know it. It is now a simple matter to verify by induction that Pm,n =

n−1 X

wm (k; p) =

k=0

n−1 X„ k=0

« n−1 X „m + k − 1« k m −m (−q)k pm = q p . k k k=0

We begin with the right-hand side: by induction hypothesis, pPm−1,n + qPm,n−1 = p

n−1 X

wm−1 (k; p) + q

k=0

=

n−1 X„ k=0

= pm +

n−1 X »„ k=1

n−2 X

wm (k; p)

k=0

« n−2 X „m + k − 1« k+1 m m+k−2 qk pm + q p k k k=0

« „ «– n−1 X „m + k − 1« k m m+k−2 m+k−2 + qk pm = q p , k k−1 k k=0

where in the penultimate step we have simply grouped terms of like powers, the final step following by Pascal’s triangle. This concludes the induction. A generatingfunctionological1 approach: The boundary conditions for the recurrence (7) are given by  pm if m ≥ 0 and n = 1, Pm,n = (8) 1 − qn if m = 1 and n ≥ 0, as, when n = 1, each of the first m trials has to result in a success and, when m = 1, one of the first n trials has to result in a success. Extending the recurrence to all lattice points in the plane with integer coordinates (m, n) eliminates the need to deal with pernickety limits of summation. Accordingly, if we set  0 if n ≤ 0 (and m is any integer), Pm,n = (9) 1 if m ≤ 0 and n ≥ 1, 1 This

happy mouthful was coined by H. Wilf

156

VIII.25

The Bernoulli Schema

then it is a simple matter to verify that the recurrence (7) is valid at all lattice points (m, n) excepting only on the half-line { (m, 1) : m ≤ 0 } = {(0, 1), (−1, 1), (−2, 1), . . . }. In particular, we may recover the original boundary conditions (8) from (7,9). A generatingfunctionological approach is now indicated. The recurrence (7) suggests at least three different generating functions, but all roads lead to Rome. Here is one. For each m, set up the generating function X gm (x) = Pm,n xn , n

the sum formally extending over all integers. Then g0 (x) =

X

P0,n xn =

n

∞ X

xn

n=1

[and we manfully resist the urge to reduce the geometric series on the right to x/(1 − x)]. For m ≥ 1, we multiply both sides of (7) by xn and, as the recurrence is valid without qualification for all n when m ≥ 1, we may sum over all n to obtain X X X gm (x) = Pm,n xn = p Pm−1,n xn + q Pm,n−1 xn = pgm−1 (x) + qxgm (x). n

n

n

Rearranging terms, we obtain „ gm (x) =

p 1 − qx

« gm−1 (x)

(m ≥ 1),

and, by repeatedly turning the recurrence crank we verify by induction that „ «m p gm (x) = g0 (x) (m ≥ 0). 1 − qx Expanding the power by the binomial theorem finishes off the job: for m ≥ 0, we have gm (x) = pm (1 − qx)−m g0 (x) =

X „−m« k

=

∞ XX k

n=1

k+n (ν←k+n)

wm (k; p)x

=

k

∞ X X k

ν=k+1

(−qx)k pm

∞ X

xn

n=1

wm (k; p)xν =

X„ X ν

« wm (k; p) xν ,

k≤ν−1

where, in the final step we simply swap the order of summation. We may now simply P read off the coefficient of xν to obtain Pm,ν = k≤ν−1 wm (k; p), as advertised.

25. Continuation. Argue that for m successes to occur before n failures it is necessary and sufficient that there be at least m successes in the first m + n − 1 trials and thence write down an expression for Pm,n . This solution is due to Pierre Fermat. S OLUTION : If there are m or more successes in the first m + n − 1 trials then, when the (m + n − 1)th trial is concluded, there will be fewer than n failures and at least m

157

The Bernoulli Schema

VIII.28

successes, whence the mth success must occur before the nth failure. Conversely, if there are fewer than m successes in the first m + n − 1 trials then, when the (m + n − 1)th trial is concluded, there will have been at least n failures and fewer than m successes, so that n failures will have occurred before the mth success. Consequently, Pm,n =

m+n−1 X

bm+n−1 (j; p) =

j=m

m+n−1 X „ j=m

« m+n−1 pj qm+n−1−j . j

If she wishes, the reader may massage this expression into the form of the distribution function of the negative binomial as obtained in Pascal’s solution but I shall leave this to the reader to attempt.

26. Equally matched tennis players. Suppose players A and B alternate serve from point to point in a modified tie-break in a tennis match, pA and pB , respectively, the probabilities of their winning the point on their serve. The tiebreak is won by whichever player first takes a two-point lead. If pA = 0.99 and pB = 0.96 then the game will go on for a very long time with players holding serve. The first person to lose her serve will, with high probability, lose the game. Show that the probability that A wins is approximately 4/5. (This pretty observation is due to T. M. Cover.) S OLUTION : Let A denote the event that A wins the tie-break. If A occurs then, for some r, there were r successive ties followed by two successive point wins for A. A tie occurs when both players either both win or both lose their serve one after another. Summing over all r ≥ 0, we see that P(A) =

∞ X ` ´r pA pB + (1 − pA )(1 − pB ) pA (1 − pB ) r=0

=

pA (1 − pB ) = 0.804878. 1 − pA pB − (1 − pfA )(1 − pB )

The back-of-the-envelope calculation is fueled by the realisation that A and B win most of their serves so that the first person to lose her serve will (with high probability) lose the game. The odds of A winning are hence, approximately, 0.04/(0.01 + 0.04) = 4/5. The exact analysis takes into account the remote possibility of both players successively losing their serves.

27. Alternating servers in tennis. In singles play in tennis the player serving alternates from game to game. Model the fact that the server usually has a slight edge on points during her serve by setting A’s chances of winning a point as p1 on her serve and p2 on her opponent’s serve. Determine the game, set, and match win probabilities for A under this setting. 28. An approximation for the hypergeometric distribution. A large population of N elements is made up of two subpopulations, say, red and black, in the proportion p : q (where p + q = 1). A random sample of size n is taken without

158

VIII.29

The Bernoulli Schema

replacement. Show that, as N → ∞, the probability that the random sample contains exactly k red elements tends to the binomial probability bn (k; p). S OLUTION : The falling factorial notation xm = x(x − 1) · · · (x − m + 1) simplifies presentation. The probability at hand [given by (VIII.10.1) in ToP] is `Np´` Nq ´

k!(n − k)! (Np)k (Nq)n−k · n! Nn n “ “

Q ” Q k−1 n−k−1 j „ « 1 − Np 1− j=1 j=1 n (Np)k (Nq)n−k

Q · = ` ´ n n−1 k N 1− j

hNp,N (k; n) =

k

`Nn−k ´ =

j=1

j Nq

” .

N

Simple bounds suffice. Dividing both sides by the binomial probability bn (k; p), we see that ”k−1 “ ”n−k−1 ”k−1 “ ”n−k−1 “ “ 1 1 1 − n−k−1 1 − Nq 1 − k−1 1 − Np Np Nq hNp,N (k; n) ≤ ≤ . ` ´ ` ´n−1 1 n−1 bn (k; p) 1− N 1 − n−1 N If p, q are fixed and non-zero and n is a fixed integer, the bookend bounds are both 1 + O(1/N) and so hNp,N (k; n) → bn (k; p) as N → ∞. Somewhat more can be gleaned as the argument shows that p = pN and n = nN may be allowed to depend on N: the ` ´ bounds capturing the relative error on either side are (conservatively) 1+O n2 /N(p ∧ q) where p ∧ q = min{p, q}. It follows that hNp,N (k; n) ∼ bn (k; p) as N → ∞ provided n2 /(p ∧ q)  N.

29. Capture–recapture. A subset of m animals out of a population of n has been captured, tagged, and released. Let R be the number of animals it is necessary to recapture (without re-release) in order to obtain k tagged animals. Show that R has distribution m  m − 1  n − m . n − 1  P{R = r} = n k−1 r−k r−1 and determine its expectation. S OLUTION : Recaptures stop with the capture of the kth tagged animal which may be specified in m ways. The probability that the rth capture is tagged is m/n. Given that the rth capture is tagged, the ` probability ´`n−m´‹`that ´it is the kth recapture in given by the n−1 hypergeometric probability m−1 . Accordingly, k−1 k−r k−1 `m−1´`n−m´ m `n−1r−k ´ · k−1 P{R = r} = n r−1 as, of the first r − 1 creatures sampled, precisely k − 1 must be tagged and the remaining r − k must be untagged. Now every sampling of the creatures until all n have been caught engenders a permutation of the animals. There will be m tagged animals in the sequence and these

159

The Bernoulli Schema

VIII.30

punctuate sequences of untagged animals, some possibly empty. Let X1 , X2 , . . . , Xm , Xm+1 denote the number of untagged animals between successive tagged animals; X1 is the number of untagged animals in the sequence before the first tagged animal; Xm+1 is the number of untagged animals in the sequence after the last tagged animal; and for 2 ≤ j ≤ m, Xj is the number of untagged animals between the (j − 1)th and jth tagged animals). Then X1 + · · · + Xm+1 = n − m. Taking expectations of both sides, by additivity, E(X1 )+· · ·+E(Xm+1 ) = n−m. But the random variables X1 , . . . , Xm+1 have the same marginal distribution by symmetry. It follows that (m + 1) E(X1 ) = n − m, or E(X1 ) = (n − m)/(m + 1). The number of creatures recaptured until the kth tagged animal is caught may be expressed in the form R = X1 + · · · + Xk + k and so, by another application of additivity of expectation, » – ˆ ˜ k(n + 1) n−m E(R) = E(X1 ) + · · · + E(Xk ) + k = k E(X1 ) + 1 = k +1 = . m+1 m+1

The following problems deal with random walks. 30. Enumerating paths. If a > 0 and b > 0, the number of sample paths of a random walk satisfying S1 > −b, . . . , Sn−1 > −b, and Sn = a equals Nn (a) − Nn (a + 2b). If b > a, how many paths are there whose first n − 1 steps lie strictly below b and whose nth step culminates in a? S OLUTION : To each path from (0, 0) to (a, n) which lies strictly above the horizontal line y = −b we may associate the reversed path from (a, n) to (0, 0). This reversed path provides the cleanest perspective for this problem. In the language of Section VIII.4 of ToP, there are precisely Nn (a) paths from (a, n) to (0, 0). These paths are of two kinds: those paths which lie strictly above the horizontal line y = −b (call these the type 1 paths; these are the ones of interest here) and those paths which intersect the horizontal line y = −b (call these the type 2 paths). The two situations are sketched in Figure 1. Proceeding left from (n, a) towards the origin, a

a

n

m

−b

−b

−2b

−2b A type 1 path above y = −b.

n

A type 2 path intersecting y = −b.

Figure 1: The two types of reversed paths from (n, a) to (0, 0). any type 2 path has a first (perhaps the only) point of intersection with the horizontal line y = −b at some point in the abscissa, say, x = m. If we reflect that portion of the reversed path from (m, −b) to (0, 0) about the line y = −b we obtain a path which

160

VIII.31

The Bernoulli Schema

goes from (n, a) to (0, −2b) as shown by the dashed line in the figure. And as every path from (n, a) to (0, −2b) must intersect the line y = −b at some point it follows that every type 2 path may be put into one-to-one correspondence with a path from (n, a) to (0, −2b), and conversely. There are thus exactly Nn (a + 2b) type 2 paths. It follows that the number of type 1 paths is given by Nn (a) − Nn (a + 2b). For the second part of the problem we work again with the reflected path. Suppose b > a > 0. The Nn (a) reversed paths from (n, a) to (0, 0) are of two types: type 1’ paths which lie strictly below the line y = b (these are the paths of interest) and type 2’ paths which intersect the line y = b. For each type 2’ path, as we progress left from (n, a) towards the origin, there is a point in the abscissa, say, m 0 , at which the reversed path encounters the line y = b for the first time. If we reflect just the segment of the reversed path from (m 0 , 0) to (0, 0) about the line y = b we obtain a path from (n, a) to (0, 2b). And as each path from (n, a) to (0, 2b) must intersect the line y = b somewhere, it follows that the type 2’ paths may be put into one-to-one correspondence with the paths from (n, a) to (0, 2b). There are hence precisely Nn (2b − a) type 2’ paths. The number of type 1’ paths is hence given by Nn (a) − Nn (2b − a).

31. Regular parenthetical expressions. The use of parentheses in mathematics to group expressions and establish the order of operations is well understood. The rules governing what constitutes a proper collection of left and right parentheses in an expression are simple: (i) the number of left parentheses must equal the number of right parentheses; and (ii) as we go from left to right, the number of right parentheses encountered can never exceed the number of left parentheses previously encountered. We say that an expression consisting of left and right parentheses is regular (or well-formed) if it satisfies these two properties. Let Cn be the number of regular expressions of length 2n and set C0 = 1 for convenience. Show that Cn =

n X k=1

Ck−1 Cn−k =

n−1 X

Cj Cn−j−1

(n ≥ 1).

j=0

These are the Catalan numbers first described by Eugène Catalan in 1838. The widespread appearance of these numbers owes to the prevalence of this recurrence in applications. S OLUTION : Any regular parenthetical expression must begin with a left parenthesis and end with a right parenthesis (else it violates condition (ii)). Consider any regular expression of length 2n. Then there exists a unique, strictly positive integer 1 ≤ k ≤ n such that a regular expression is formed for the first time at location 2k. This implies the right parenthesis at location 2k is matched with the left parenthesis at location 1. (If the left parenthesis at location 1 is matched with an earlier right parenthesis then a regular expression would have been formed earlier.) If we strip the opening and closing parentheses at locations 1 and 2k then what remains is an imbedded expression of length 2k − 2. We argue that this imbedded expression must be a regular parenthetical expression in its own right. Indeed: (a) The imbedded expression must begin with a left

161

The Bernoulli Schema

VIII.32

parenthesis. (An opening right parenthesis would mean that the first two parentheses of the original expression already form a regular sub-expression; this is impossible as the first regular sub-expression encountered is assumed to be at location 2k). (b) The imbedded expression of length 2k − 2 satisfies (i). (Otherwise, there is a first location, say, j, in the imbedded sub-string at which the number of right parentheses is one more than the number of left parentheses. If we consider the sub-string from locations 1 through j + 1 in the original expression we see that both conditions (i) and (ii) are satisfied or, in other words, a regular sub-expression has already been formed at location j + 1 < 2k. But this is impossible by assumption that the first regular sub-expression occurs at location 2k.) (c) The imbedded expression of length 2k − 2 satisfies (ii). (Else, the number of left parentheses in the imbedded expression exceeds the number of right parentheses in the expression by at least 1. But then this means that, in the original expression, the substring from locations 1 through 2k is not balanced, contradicting the assumed regularity of the sub-string at that point.) Thus, the imbedded expression is a regular expression of length 2k − 2. By an entirely similar argument, if we excise the sub-string consisting of the first 2k parentheses from the original regular expression then what remains is a regular expression of length 2n − 2k making up the tail of the original regular expression. Any regular expression of length 2n with a first regular sub-expression occurring at location 2k has this form. And as any selection of these two imbedded regular expressions, the first of length 2k − 2 = 2(k − 1), the second of length 2n − 2k = 2(n − k), will result in a regular expression of length 2n with an imbedded first regular sub-expression at location 2k, it follows that there are precisely Ck−1 Cn−k regular expressions of length 2n with a first regular sub-expression at location 2k. Summing over the possible values of k, we obtain Cn =

n X

Ck−1 Cn−k =

k=1

n−1 X

Cj Cn−1−j

(n 6= 0)

(10)

j=0

with the natural initial condition C0 = 1. The recurrence is formally valid for n ≥ 1 but if we set Cn = 0 for n < 0 then we may extend its range of validity to all non-zero integers Z \ {0}.

32. Continuation. Identify ( with +1 and ) with −1. Argue that a necessary and sufficient condition for an expression of length 2n to be regular is that the corresponding walk returns to the origin in 2n steps through positive (i.e.,  2n 1 ≥ 0) values only and hence show that Cn = n+1 n . S OLUTION : 1◦ Via random walks. Each regular expression of length 2n may be placed in one-to-one correspondence with a segment of a sample path of a random walk satisfying S0 = 0, S1 = 1, S2 ≥ 0, . . . , S2n−2 ≥ 0, S2n−1 = 1, S2n = 0. If we shift this segment to the right by one unit and upward by one unit we see that this segment is in one-toone correspondence with a segment of a random walk which, starting from the origin (0, 0), moves to the point (2n + 1, 1) through strictly positive values. In the notation of Section VIII.4 of ToP, there are precisely N+ 2n+1 (1) such paths. Accordingly, Cn =

N+ 2n+1 (1)

1 = 2n + 1



2n + 1 2n+1+1 2

«

1 = 2n + 1

162

„ « „ « 1 2n + 1 2n = n+1 n+1 n

VIII.33

The Bernoulli Schema

by (VIII.4.1) in ToP. 2◦ Via generating functions. With the conventions C0 = 1 and Cn = 0 for n < 0, the recurrence (10) holds for all n 6= 0. Introduce the generating function X C(s) = Cn s n (11) n

where, as usual, unless explicitly constrained, the summation index is assumed to run through all integers. Multiplying both sides of (10) by sn and summing over the admissible range n 6= 0, the left-hand side becomes X X Cn sn = Cn sn − C0 s0 = C(s) − 1, n

n6=0

while the right-hand side is given by X n−1 X

(a)

Cj Cn−1−j sn =

n6=0 j=0

=s

X j

XX n6=0

Cj s

j

X

(b)

Cj Cn−1−j sn =

XX n

j

Cn−1−j s

n−1−j (k←n−1−j)

=

j

s

n

Cj Cn−1−j sn

X j

C j sj

X

Ck sk = sC(s)2 .

k

We may extend the range of inner summation in (a) to all integers j as for j < 0 we have Cj = 0 and for j ≥ n we have Cn−1−j = 0; likewise, we may extend the range of outer summation in (b) to all integers n by including the term n = 0 as Cj = 0 for j < 0 and C0−1−j = 0 for j ≥ 0. The generating function of the Catalan sequence hence satisfies the quadratic relation √ 1 ± 1 − 4s . sC(s)2 − C(s) + 1 = 0 or, equivalently, C(s) = 2s The boundary condition C(s) = C0 = 1 eliminates the root with the positive sign and so, by Newton’s binomial theorem [see Problem I.4], we obtain √ » « – ∞ „ X 1 − 4s 1 1/2 = 1− (−4s)m m 2s 2s m=0 „ « « ∞ ∞ „ X 1/2 (n←m−1) X 1/2 = (−1)m−1 22m−1 sm−1 = (−1)n 22n+1 sn . m n+1

C(s) =

1−

m=1

(12)

n=0

Comparing (11) and (12) termwise, we see that Cn = 0 for n < 0 (by fiat!) and, for n ≥ 0, „ « 1/2 Cn = (−1)n 22n+1 n+1 » „ « – „ « 1 1 2(n + 1) − 2 −2(n+1)+1 2n = (−1)(n+1)−1 2 (−1)n 22n+1 = . n + 1 (n + 1) − 1 n+1 n ` 1/2 ´ For the reduction of the fractional binomial coefficient n+1 in terms of the ordinary binomial coefficients, see Problem I.3.

163

The Bernoulli Schema

VIII.33

33. Let u2n denote the probability that a random walk returns to the origin at step 2n, and let f2n denote the probability Pn that it returns to the origin for the first time at step 2n. Show that u2n = j=1 f2j u2n−2j and hence verify   that u2n = (−1)n −1/2 and f2n = (−1)n−1 1/2 n n . S OLUTION : The probability that there is a first return to the origin at step 2j followed by a subsequent return to the origin at step 2n is f2j u2n−2j as the portion of the walk from step 2j + 1 to 2n may be put in one-to-one correspondence with a random walk returning to the origin in 2n − 2j steps. If there is a return to the origin at all then there must be a first return to the origin at a step 2j for some 1 ≤ j ≤ n. Summing over j we see hence that n X u2n = f2j u2n−2j (n 6= 0) (13) j=1

with the natural boundary conditions u0 = 1 and f0 = 0. We may extend the validity of the recurrence to all non-zero integers by setting u2n = f2n = 0 for n < 0. We may verify by direct calculation that „ « „ « 2n −2n −1/2 u2n = 2 = (−1)n n n

(14)

[see Problem I.3] as there is a return to zero at step 2n if, and only if, there are n positive steps and n negative steps in some order. By our binomial conventions this identity is ` ´ for the valid for all integers n. It only remains to verify the solution f2n = (−1)n−1 1/2 n distribution of first returns to zero. 1◦ An inductive solution via Vandermonde’s convolution. To set up an inductive argument, we begin with the general form of Vandermonde’s convolution which asserts that „ « X « n „ «„ α+β α β = (15) n j n−j j=0

for all real α and β. [See the solution to Problem II.18.] Applying (13) to the case n = 1, as induction base we verify that „ « „ « „ « u2 1 −1/2 2 −2 1/2 f2 = = (−1)1 = 2 = = (−1)1−1 . 1 1 1 u0 2 As u0 = 1, we may rewrite (13) in the form f2n = u2n −

n−1 X

f2j u2n−2j .

(16)

j=1

As induction hypothesis we suppose now that f2j = (−1)j expression on the right in (16) then simplifies to u2n −

n−1 X j=1

f2j u2n−2j

`1/2´ j

for 1 ≤ j ≤ n − 1. The

„ « n−1 „ « „ « X −1/2 j−1 1/2 n−j −1/2 (−1) = (−1) − (−1) n j n−j n

j=1

164

VIII.33

= (−1)

The Bernoulli Schema

n

n−1 X„ j=0

«„ «„ « «„ „ « « n „ X −1/2 1/2 −1/2 1/2 −1/2 n n 1/2 . = (−1) − (−1) 0 n j n−j j n−j j=0

The sum on the right simplifies via Vandermonde’s convolution: «„ « „ « „ « n „ X 1/2 −1/2 1/2 − 1/2 0 = = =0 (n ≥ 1). j n−j n n j=0

It follows that u2n −

n−1 X

f2j u2n−2j = −(−1)n



j=1

1/2 n

«

= (−1)n−1

„ « 1/2 n

which, by (16), concludes the induction. 2◦ Via generating functions. Introduce the generating functions X X U(s) = u2n s2n and F(s) = f2n s2n . n

n

Beginning with (14), Newton’s binomial theorem shows that U(s) =

∞ X

(−1)n

n=0

„ « « ∞ „ −1/2 2n X −1/2 s = (−s2 )n = (1 − s2 )−1/2 . n n

(17)

n=0

To determine F(s), we begin with the recurrence (13). Multiplying both sides by s2n and summing over the entire range of validity n 6= 0, the left-hand side becomes X X u2n s2n = u2n s2n − u0 s0 = U(s) − 1. n

n6=0

The corresponding sum on the right evaluates to n XX

f2j u2n−2j s2n =

n6=0 j=1

=

X j

XX n6=0

f2j s

2j

X

f2j u2n−2j s2j s2n−2j =

j

u2n−2j s

2n−2j (k←n−j)

=

n

X

XX n

f2j s

f2j u2n−2j s2j s2n−2j

j

2j

j

X

u2k s2k = F(s)U(s)

k

as expanding the range of the inner sum to all integers j in the second step and, likewise, expanding the range of the outer sum to all integers n (by including the term n = 0) does not change the value of the sum in view of the convention u2n = f2n = 0 for n < 0. It follows that U(s) − 1 = F(s)U(s), or, equivalently, F(s) = 1 −

1 . U(s)

In view of (17), one more application of the binomial theorem shows that F(s) = 1 − (1 − s2 )1/2 = 1 −

« ∞ „ X 1/2 n=0

n

165

(−s2 )n =

∞ X n=1

(−1)n−1

„ « 1/2 2n s . n

The Bernoulli Schema As F(s) = clude that

VIII.35 P

n f2n s

2n

, we may simply read off the coefficients of the expansion to con 0 if n ≤ 0, ` ´ f2n = n−1 1/2 (−1) if n ≥ 1, n

as advertised.

34. Continuation. Hence show that u0 u2n + u2 u2n−2 + · · · + u2n u0 = 1. S OLUTION : 1◦ Via Vandermonde’s convolution. In view of (14), the given sum may be simplified via Vandermonde’s convolution (15) to the form n X

n X

« „ « −1/2 −1/2 n−k u2k u2n−2k = (−1) (−1) k n−k k=0 k=0 „ « „ « „ « „ « n X −1/2 −1/2 −1/2 − 1/2 −1 = (−1)n = (−1)n = (−1)n =1 k n−k n n k



k=0

for all n ≥ 0. 2◦ Via generating functions. Writing u2n = u(2n) for notational convenience, we recognise the given sum as the convolution of the sequence { u2n , n ∈ Z } with itself, u ? u(2n) :=

n X

u(2k)u(2n − 2k) =

X

k=0

u(2k)u(2n − 2k),

k

the convention u(2n) = 0 for n < 0 permitting us to extend the summation to all integers k. Let X U? (s) := u ? u(2n)s2n (18) n

denote the generating function of the sequence { u ? u(2n), n ∈ Z }. By expanding out the convolution and changing the order of summation we see that U? (s) =

XX n

u(2k)u(2n − 2k)s2n =

k (j←n−k)

=

X

X

u(2k)s2k

k

u(2k)s

k

2k

X

X

u(2n − 2k)s2n−2k

n

u(2j)s2j = U(s)2 = (1 − s2 )−1 ,

j

in view of (17). The compact expression on the right begs to be expanded in a geometric series and so ∞ X U? (s) = 1 · s2n . (19) n=0

Comparing the terms of the expansions (18,19), we see that u ? u(2n) = 1 for n ≥ 0, as averred.

35. Using the method of images show that the probability that S2n = 0 and the maximum of S1 , . . . , S2n−1 equals k is the same as P{S2n = k}−P{S2n = 2k + 2}.

166

VIII.37

The Bernoulli Schema

S OLUTION : Every sample path of a random walk which returns to the origin at epoch 2n and satisfies max{S1 , . . . , S2n−1 } ≥ k has a unique epoch, say, j, at which Sj = k for the first time. If we reflect that portion of the walk to the right of j about the line y = k, we obtain a unique walk from S0 = 0 to S2n = 2k, and vice versa. Accordingly, we see that  P S2n = 0, max{S1 , . . . , S2n−1 } ≥ k = P{S2n = 2k}. If max{S1 , . . . , S2n−1 } > k then the walk will intersect the line y = k + 1 for at least one epoch and by reflecting the portion of the walk to the right of the first such intersection we obtain a unique walk from S0 = 0 to S2n = 2k + 2. Accordingly,  P S2n = 0, max{S1 , . . . , S2n−1 } > k = P{S2n = 2k + 2}. By additivity, it follows that  P S2n = 0, max{S1 , . . . , S2n−1 } = k   = P S2n = 0, max{S1 , . . . , S2n−1 } ≥ k − P S2n = 0, max{S1 , . . . , S2n−1 } > k = P{S2n = 2k} − P{S2n = 2k + 2}.

36. For 0 ≤ k ≤ n, the probability that the first visit to S2n takes place at epoch 2k equals P{S2k = 0} P{S2n−2k = 0}. S OLUTION : Consider any walk for which the first visit to S2n occurs at epoch 2k. Move the origin to the point (n, S2n ) and view the walk in reverse. From this perspective, we see that the walk is equivalent to one which, over 2n steps, visits the origin for the last time at epoch 2n − 2k. This situation is covered by the arc sine law for last visits [see the displayed equation on p. 250 in ToP] and so the desired probability is given by P{S2k = 0} P{S2n−2k = 0}.

37. Maxima of random walks. Let Mn be the maximum of the values S0 , S1 , . . . , Sn . Show that P{Mn = r} = P{Sn = r} + P{Sn = r + 1} for each r ≥ 0. S OLUTION : Suppose Sn = s. If s ≥ r then it is certain that the event Mn ≥ r occurs. Consider now the case s < r. If Mn ≥ r then there must be a first epoch, say, j, at which Sj = r, the walk from this point onwards ultimately decreasing to the value Sn = s at epoch n. By reflecting this segment of the walk about the line y = r we see that the walk may be placed in one-to-one correspondence with a unique walk terminating at Sn = r + (r − s) = 2r − s. Accordingly, we see that  P{Sn = s} if s ≥ r, P{Mn ≥ r, Sn = s} = P{Sn = 2r − s} if s < r. Summing over the possible values of s, we see by additivity that P{Mn ≥ r} =

∞ X s=r

P{Sn = s} +

r−1 X

P{Sn = 2r − s}

s=−∞

= P{Sn ≥ r} +

∞ X

P{Sn = t} = P{Sn = r} + 2 P{Sn ≥ r + 1}.

t=r+1

167

The Bernoulli Schema

VIII.38

One more application of additivity finishes up: P{Mn = r} = P{Mn ≥ r} − P{Mn ≥ r + 1} ˆ ˜ ˆ ˜ = P{Sn = r} + 2 P{Sn ≥ r + 1} − P{Sn = r + 1} + 2 P{Sn ≥ r + 2} = P{Sn = r} + P{Sn = r + 1}. At most one of the two terms on the right can be non-zero.

38. Continuation. Show that for i = 2k and i = 2k + 1, the probability that the walk reaches M2n for the first time at epoch i equals 12 P{S2k = 0} P{S2n−2k = 0}. S OLUTION : Work by reflection. The quoted result is given in (VIII.5.5) in ToP.

168

IX The Essence of Randomness 1. Intersecting chords. Four points X1 , X2 , X3 , X4 are chosen independently and at random on a circle (say of unit circumference for definiteness). Find the probability that the chords X1 X2 and X3 X4 intersect. S OLUTION : The combinatorial approach. Proceeding clockwise from X1 , the event of interest occurs if the remaining points are encountered in the order X3 X2 X4 or X4 X2 X3 . As all six orderings of X2 , X3 , and X4 are equally likely by symmetry, it follows that the probability that the chords X1 X2 and X3 X4 intersect is 2/6 = 1/3. (Formally, there are six permutations (i, j, k) of the indices (2, 3, 4) and the inequalities Xi ≤ Xj ≤ Xk partition the unit cube into six regions of equal volume.) An approach by unrolling the circle. Scatter the points on the circle. Break the circle at X1 and unroll it clockwise, say, for definiteness. The chords will intersect in the circle if, and only if, on the unrolled unit interval with the origin at X1 the points X3 and X4 lie on either side of X2 . By conditioning on the location x of X2 , the probability of the R1 ordering (0 = X1 , X3 , X2 , X4 ≤ 1) is hence 0 x(1 − x) dx = 1/6. As the ordering (0 = X1 , X4 , X2 , X3 ≤ 1) has the same probability, the probability that the chords intersect is 1/3. A solution by conditioning. Condition on the location of point X1 . Write Lk for the arc length (measured clockwise from X1 ) between the points X1 and Xk . Then L2 , L3 , and L4 are independent and uniformly distributed on [0, c] where c is the circumference of the circle. Conditioned on the event that L2 = t, the probability that the ordering X3 X2 X4 is encountered proceeding clockwise ` ´ from X1 is then given by P{L3 < t, L4 > t | L2 = t} = P{L3 < t} P{L4 > t} = ct 1 − ct . Take expectation with respect to L2 to get rid of the conditioning to obtain ˛1 ˛1 « Zc „ Z1 t t 1 x2 ˛˛ x3 ˛˛ 1 1 1 P{L3 < L2 < L4 } = 1− dt = x(1 − x) dx = − = − = . c c 2 ˛0 3 ˛0 2 3 6 0 c 0 Likewise, P{L4 < L2 < L3 } = 1 + 61 = 13 . 6

1 6

and it follows that the chords intersect with probability

The navvy way by working with directly with the four-dimensional sample space. The reader who finds conditioning arguments like this (and the hidden appeals to symmetry) slightly worrying can elect to work with the sample space directly at the cost of a

The Essence of Randomness

IX.1 X4

X1 = 0

X3 X2 = x X3

X4

Figure 1: Intersecting chords.

little more algebra. The sample space is the space of real four-tuples (x1 , x2 , x3 , x4 ) with support in the four-dimensional unit cube [0, 1)4 , the distribution uniform (Lebesgue measure) in the cube. Identify differences modulo one and introduce the nonce notation (s − t)1 = (s − t) mod 1. Proceeding clockwise along the circle for definiteness, the two situations leading to chord intersection may now be identified with X3 lying between X1 and X2 and X4 between X2 and X1 or, vice versa, with X4 lying between X1 and X2 and X3 lying between X2 and X1 . In notation, the regions described by these constraints are defined by A1 = { (x1 , x2 , x3 , x4 ) : 0 ≤ x1 < 1, 0 ≤ x2 < 1, 0 ≤ (x3 − x1 )1 < (x2 − x1 )1 < (x4 − x1 )1 < 1 }, A2 = { (x1 , x2 , x3 , x4 ) : 0 ≤ x1 < 1, 0 ≤ x2 < 1, 0 ≤ (x4 − x1 )1 < (x2 − x1 )1 < (x3 − x1 )1 < 1 }. The event A that the cords X1 X2 and X3 X4 intersect occurs when either A1 occurs or A2 occurs, A = A1 ∪ A2 . As the regions A1 and A2 are disjoint, we see that P(A) = P(A1 ) + P(A2 ). To determine the probabilities on the right requires only a mildly irritating consideration of the cases x2 < x1 and x2 ≥ x1 . In the former case, for A1 to occur, we require x2 ≤ x3 < x1 and either 0 ≤ x4 < x2 or x1 ≤ x4 < 1; in the latter case, for A1 to occur, we require x1 ≤ x3 < x2 and either 0 ≤ x4 < x2 or x1 ≤ x4 < 1. »Z x1 Z x1 „Z x2 « – Z1 Z1 P(A1 ) = dx1 dx2 dx3 dx4 + dx4 0

0

x2

Z1 0

0

dx1 0 Z1

dx1

= 0

(1 − t)t dt + »

0 x21

dx4 x2

– dx2 (x2 − x1 )(1 − x2 + x1 )

x1

Z 1−x1

»Z x1



« Z x1 dx3

x2

Z1

dx2 (1 − x1 + x2 )(x1 − x2 ) +

0

Z1 =

0

0

Z1 dx3 +

»Z x1 dx1

=

x2

„Z x2 dx2

dx1

+ Z1

0

»Z x1

– t(1 − t) dt

0

(1 − x1 )2 (1 − x1 )3 x3 − 1 + − 2 3 2 3

– =

1 1 1 1 1 − + − = . 6 12 6 12 6

Of course, P(A2 ) can be computed in the same way and, in any case, is equal to P(A1 ) on grounds of symmetry. It follows that P(A) = 1/3 as the conditioning argument

170

IX.3

The Essence of Randomness

maximally exploiting symmetry gave us.

2. Find the probability that the quadratic λ2 −2Uλ+V has complex roots if the coefficients U and V are independent random variables with a common uniform density f(x) = 1/a for 0 < x < a. S OLUTION : The quadratic will have complex roots if and only if the discriminant is strictly negative, that is to say, ∆ = U2 − V < 0. As (U, V) has support in (0, a) × (0, a), the square of side a in the first quadrant, the event ∆ < 0 may be identified with a region in the square. There are two cases depending on whether a ≤ 1 or a > 1. C ASE 1: 0 < a ≤ 1. In this case 0 < a2 ≤ a and so we may identify the event {∆ < 0} with the region (u, v) : 0 < u < a, u2 < v < a . Integrating out with respect to the product of uniform measures, we obtain Za Za P{∆ < 0} = 0

u2

dv du = a a

« Z a„ a u2 du =1− . 1− a a 3 0

√ C ASE 2: a > 1. In this case 0 < √a < a and we may now identify the event {∆ < 0} with the region (u, v) : 0 < u < v, 0 < v < a . Integrating out again with respect to the the product of uniform measures (it will be convenient now to transpose the order of integration), we obtain Z a Z √v P{∆ < 0} = 0

0

1 du dv = 2 a a a

Za 0

√ 2 v dv = √ . 3 a

3. Suppose X1 , X2 , . . . is a sequence of independent random variables uniformly distributed in the unit interval [0, 1). Let 0 < t < 1 be fixed but arbitrary and let N denote the smallest integer n such that X1 + · · · + Xn > t. Show that P{N > n} = tn /n! and thence find the mean and variance of N. S OLUTION : As usual, S0 = 1 and let Sn = X1 + X2 + · · · + Xn for n ≥ 1. Fix 0 < t < 1 and let N = min{ n : Sn > t }. Then, for any n ≥ 0, it is now clear that N > n if, and only if, Sn ≤ t. An inductive proof. With the answer known it is natural to attempt a proof by induction. T HE BASE CASE : When n = 0 we have P{N > 0} = 1 = t0 /0! (trivially) and when n = 1 we see that P{N > 1} = P{S1 ≤ t} = P{X1 ≤ t} = t = t1 /1!. T HE INDUCTION STEP : For n ≥ 1, we may write Sn+1 = Sn + Xn+1 as the sum of two independent variables. Accordingly, by conditioning on Xn+1 , we see that P{Sn+1 ≤ t | Xn+1 = x} = P{Sn + Xn+1 ≤ t | Xn+1 = x} = P{Sn ≤ t − x} = (t − x)n /n!,

171

The Essence of Randomness

IX.3

the penultimate step follows as Sn and Xn+1 are independent, the final step by induction hypothesis. Integrating out with respect to the uniform density of Xn+1 , we obtain Z1 P{Sn+1 ≤ t | Xn+1 = x} dx

P{N > n + 1} = P{Sn+1 ≤ t} = 0

=

1 n!

Z1 (t − x)n dx = 0

tn+1 , (n + 1)!

completing the induction. A direct proof. Let x+ denote the positive part of x,  x if x > 0, x+ = 0 if x ≤ 0, n with xn + to be interpreted as (x+ ) bearing in mind the nonce convention that  1 if x > 0, 0 x+ = 0 if x ≤ 0.

By the Theorem of Section IX.1 [see Equation (IX.1.2)] in ToP, for each n ≥ 1, the partial sum Sn has density un (t) =

„ « n X 1 n (−1)k (t − k)n−1 . + k (n − 1)! k=0

Integrating out the summands termwise recovers the d.f. Un (t) of Sn . To begin, Zt

Zt (x − k)n−1 dx = +

0

Z t−k (x − k)n−1 dx = +

k

Accordingly,

Zt Un (t) =

un (x) dx = 0

un−1 du = + 0

(t − k)n + . n

„ « n 1 X n (−1)k (t − k)n +. k n! k=0

But (t − k)+ = 0 for all k ≥ 1 as we are given that 0 < t < 1. The only term that survives in the sum on the right hence is when k = 0 and so P{N > n} = P{Sn ≤ t} = Un (t) =

tn n!

(n ≥ 0).

Expectation and variance. The distribution of N is given by additivity to be P{N = n} = P{N > n − 1} − P{N > n} =

tn−1 tn − (n − 1)! n!

(n ≥ 1).

It is now not hard to simply sum out to determine the mean and variance. Alternatively, the following identities simplify the computation.

172

IX.4

The Essence of Randomness

T HEOREM Suppose N is a positive, arithmetic random variable. Then ∞ X

E(N) =

P{N > n},

(1)

n=0

∞ X

E(N2 ) − E(N) = 2

n P{N > n}.

(2)

n=0

P ROOF : The first of the claimed results is just Theorem XIII.5.4 in ToP [see also Problem VII.20]. The proof is elementary, the claimed result obtained simply by a rearrangement of the terms of the sum: ∞ X

∞ X

P{N > n} =

n=0

∞ X

P{N = m}

n=0 m=n+1

∞ m−1 X X

=

∞ X

P{N = m} =

m=1 n=0

m P{N = m} = E(N).

m=1

The verification of (2) follows a similar line. We have 2

∞ X

n P{N > n} = 2

n=0

∞ X

n

n=0

∞ X

P{N = n} = 2

m=n+1

∞ X

P{N = m}

m=1

=2

m−1 X

n

n=0

∞ X

P{N = m}

m=1

(m − 1)m 2

by summing the terms of the arithmetic series in the inner sum. The right-hand side simplifies to ∞ X

∞ X

m2 P{N = m} −

m=1

m P{N = m} = E(N2 ) − E(N),

m=1

as claimed.

I

Leveraging (1), we obtain E(N) =

∞ X

P{N > n} =

n=0

∞ n X t = et . n! n=0

To leverage (2), we first compute 2

∞ X

n P{N > n} = 2

n=0

∞ X ntn = 2tet . n! n=0

It follows that Var(N) = E(N2 ) − E(N)2 = 2

∞ X

n P{N > n} + E(N) − E(N)2 = 2tet + et − e2t .

n=0

173

The Essence of Randomness

IX.5

4. Suppose X is a random point on the unit circle in R2 . Determine the density of the length of its projection along the x-axis and find its mean. S OLUTION : We may represent X in polar coordinates by a random angle Θ measured counter-clockwise from the x-axis and uniformly distributed over [0, 2π). The projection X1 = cos Θ of X along the x-axis has support in (−1, 1) and has d.f. F(x) = P{cos Θ ≤ x} = 2 P{0 ≤ Θ < arccos x} =

1 π

arccos(x)

(−1 < x < 1).

Differentiation uncovers the density with support in (−1, 1): f(x) = F 0 (x) =

1 √ π 1 − x2

(−1 < x < 1).

No computations are needed to determine the mean as f is an even function, whence Z1 −1

x √ dx = 0. π 1 − x2

5. Interval lengths for random partitions. Suppose X1 , . . . , Xn are independent and uniformly distributed over the interval [0, t). These points partition the interval into n + 1 subintervals whose lengths, taken in order from left to right, are L1 , L2 , . . . , Ln+1 . Denote by pn (s; t) = P{min Lk > s} the probability that all n + 1 intervals are longer than s. Prove the recurrence relation n pn (s; t) = n t

Z t−s xn−1 pn−1 (s; x) dx,

(3)

0

n and conclude that pn (s; t) = t−n t − (n + 1)s + . S OLUTION : The problem is equivalent to the selection of n + 1 random points on a circle of perimeter t with L1 , . . . , Ln+1 the successive spacings, say, in the clockwise orientation. By de Finetti’s theorem [see Section IX.2 in ToP], „ «n u P{L1 > u} = 1 − , t + ` ´n−1 and so L1 has density nt 1 − ut + with support only in 0 ≤ u < t. The requirement that L1 > s restricts us to the range s ≤ u < t. Conditioned on L1 = u for s ≤ u < t, the first point to the right of X0 along the circumference is fixed and, by excising this interval and this point, the problem now reduces to the selection of n random points on a circle of perimeter t − u with the spacings all exceeding s. The conditional probability of this event is, by definition, equal to pn−1 (s; t − u). Integrating out with respect to the distribution of L1 , we see hence that «n−1 Zt „ n u pn (s; t) = 1− pn−1 (s; t − u) du, t s t

174

IX.6

The Essence of Randomness

the change of variable x ← t − u reducing the right-hand side to the form (3). We proceed by induction to obtain a closed form. For the induction base it is simplest to begin with the case n = 0 which, on the circle of perimeter t, corresponds to picking a single point X0 . The spacing length L1 is now just the circumference t of the circle and since we require L1 > s we must have s < t. Thus,  „ «0 0 if s ≥ t, s = p0 (s; t) = t−0 1 − t + 1 if 0 ≤ s < t. [Recall the convention for positive parts: we interpret x0+ to be identically equal to zero n for x ≤ 0 and equal to 1 when x > 0; and for n ≥ 1, we interpret xn + = (x+ ) , the positive part operation taking priority over the exponentiation.] We may check that  Z 0 if t ≤ 2s, 1 t−s 0 p1 (s; t) = x p0 (s; x) dx = 1 Rt−s 1 t 0 dx = (t − 2s) if t > 2s, t s t which we may verify by direct computation as, for n = 1, the joint occurrence of the events L1 > s and L2 > s means that we require s < X1 < t − s, an event of probability t−1 (t − 2s)+ . ns)n−1 . +

As induction hypothesis, suppose accordingly that pn−1 (s; x) = x−(n−1) (x − Then, for 0 ≤ s < t/(n + 1), we have Z t−s

pn (s; t) = 0

Z n t−s nxn−1 −(n−1) n−1 x (x − ns)+ dx = n (x − ns)n−1 dx tn t ns Z t−(n+1)s ´n 1 ` y←x−ns n yn−1 dy = n t − (n + 1)s . = n t 0 t

This completes the induction.

6. An isosceles triangle is formed by a unit vector in the x-direction and another in a random direction. Determine the distribution of the length of the third side in R2 and in R3 . S OLUTION : We proceed algebraically though a geometric viewpoint is just as easy. Let L be the length of the third side. Write F2 (t) and F3 (t) for its d.f. in two and three dimensions, respectively, and let f2 (t) and f3 (t) denote the corresponding densities. It is clear from the geometry that 0 ≤ L ≤ 2 so that f2 and f3 have support only in the interval [0, 2]. In R2 : With two vertices situated at (0, 0) and (1, 0), the third vertex has coordinates (cos Θ, sin Θ) where the angle Θ is uniformly distributed in the interval [0, 2π). The square of the length of the third side is hence given by L2 = (1 − cos Θ)2 + sin2 Θ = 2 − 2 cos Θ, a fact we could have deduced directly from the law of cosines. Then, for 0 ≤ t ≤ 2, we have ` 2´ F2 (t) = P{L ≤ t} = P{2−2 cos Θ ≤ t2 } = 2 P{0 ≤ Θ ≤ arccos(1−t2 /2)} = π1 arccos 1− t2 .

175

The Essence of Randomness

IX.8

A routine differentiation shows hence that f2 (t) =

π

2t p t2 (4 − t2 )

(0 ≤ t ≤ 2).

In R3 : The first two vertices are now situated at (0, 0, 0) and (1, 0, 0). Measuring “latitude” by angle 0 ≤ Θ < π with respect to the ray passing through the point (1, 0, 0) and “longitude” 0 ≤ Φ < 2π counter-clockwise along the equator with respect to the (y, z) plane, the coordinates of the third vertex are given by (cos Θ, sin Θ cos Φ, sin Θ sin Φ). The square of the length of the third side is hence given by L2 = (1 − cos Θ)2 + sin2 Θ cos2 Φ + sin2 Θ sin2 Φ = (1 − cos Θ)2 + sin2 Θ = 2 − 2 cos Θ, as could as easily have been intuited from the geometry as the longitude is irrelevant for the length. But now, per Lemma IX.3.1 in ToP, it is not Θ that is uniformly distributed but rather the projection cos Θ along the x-axis that is uniform over the interval [−1, 1]. It follows that ` ´ F3 (t) = P{L ≤ t} = P{2−2 cos Θ ≤ t2 } = P{cos Θ ≥ 1−t2 /2} = 12 1−(1−t2 /2) = 14 t2 . Differentiation yields the simple formula f3 (t) =

1 t 2

(0 ≤ t ≤ 2).

The moral? Projections have very different characters in two and three dimensions.

7. A ray enters at the north pole of a unit circle, its angle with the positive x-axis being uniformly distributed over (−π/2, π/2). Find the distribution of the length of the chord within the circle. S OLUTION : Write L for the length of the chord. Draw lines from each end of the chord to the centre of the circle and observe that we have constructed an isosceles triangle where the two equal sides have length equal to the radius of the circle (one) and the remaining side has length L. The angles opposite the equal sides are both equal to Θ. Elementary trigonometry now shows that L = 2 cos Θ. The d.f. of L is hence obtained via F(t) = P{L ≤ t} = P{cos Θ ≤ t/2}. Clearly, F(t) = 0 if t ≤ 0 and F(t) = 1 if t ≥ 2. Now suppose 0 < t < 2. As Θ is uniform in (−π/2, π/2) we obtain „ « Z π/2 Z − arccos t/2 dθ dθ 2 π t 2 t F(t) = + = − arccos = arcsin . π π π 2 2 π 2 arccos t/2 −π/2 By differentiating we obtain the density of the chord length   √2 if 0 < t < 2, f(t) = π 4−t2 0 otherwise.

176

IX.8

The Essence of Randomness

8. A stick of unit-length is broken in n (random) places. Show that the probability that the n + 1 pieces can form a polygon is 1 − (n + 1)2−n . S OLUTION : This generalises Example IX.1.2 in ToP from triangles to polygons but the mode of analysis does not appear to extend easily and we cast about for a new principle. Consider n = 2 to begin. The selection of two random points in the unit interval engenders spacings L1 , L2 , and L3 . A little thought now shows that these spacings can form the sides of a triangle if, and only if, the largest of the spacings does not exceed 1/2. Indeed, if the largest spacing is less than 1/2 then the sum of any two spacings exceeds 1/2 and hence the remaining spacing whence the triangle inequality is satisfied for any combination of sides; conversely, if the largest spacing exceeds 1/2 then there is one spacing which exceeds the sum of the other two and the triangle inequality is violated. A geometric proof may also be crafted by placing the largest spacing, say, z as the base of a triangle to be constructed. If the remaining two spacings have lengths x and y, respectively, then draw a circle of radius x centred at one end of the baseline, and another circle of radius y centred at the other baseline. These circles must intersect as their centres are separated by a distance z < 1/2 while x + y = 1 − z > 1/2. Connecting a point of intersection of the circles with each of the two endpoints of the baseline yields the desired triangle with sides x, y, and z. This is essentially Euclid’s proof of Proposition 1 (construction of an equilateral triangle) in the Elements! Thus, the probability that a triangle  cannot be formed from the spacings L1 , L2 , and L3 is given by P max{L1 , L2 , L3 } > 12 . The events {L1 > 1/2}, {L2 > 1/2}, and {L3 > 1/2} are mutually exclusive and as, by de Finetti’s theorem [Section IX.2 in ToP], the variables L1 , L2 , and L3 are exchangeable with the common distribution given by (IX.2.2) with n = 2 in ToP, we see that  P max{L1 , L2 , L3 } >

1 2



=P

„[ 3



Lj >

1 2



« =

3 X  P Lj >

1 2



j=1

j=1

 = 3 P L1 >

1 2



` ´2 = 3 1 − 12 =

3 . 4

The probability that a triangle can be formed is hence 1 − 3/4 = 1/4 which matches the result of Example IX.1.2 obtained by a more painstaking calculation in ToP. Now consider a selection of random points X1 , . . . , Xn in the unit interval. These engender spacings L1 , . . . , Ln+1 in the usual fashion and the question is whether these n + 1 spacings can form the sides of a polygon with n + 1 sides. The generalisation of the principle we had discovered for the triangle is immediate: the spacings can form a polygon if, and only if, the largest spacing does not exceed 1/2. Necessity is obvious. To verify sufficiency, we can construct a geometrical proof modelled after the one given for the triangle or work by induction on the number of sides in the polygon. Here is the inductive proof. Suppose as induction hypothesis that, given any lengths l, l1 , . . . , ln with l < l1 + · · · + ln , we can connect the two endpoints of a baseline AB of length l by connected segments of lengths l1 , . . . , ln to form a polygon on the baseline AB. The base of the induction has already been established for n = 2. Now suppose the spacings are arranged in increasing order L(1) < L(2) < · · · < L(n+1) (these are the order statistics of the spacings) and suppose further that L(n+1) < 1/2. Establish a baseline AB of length L(n+1) . With A as centre draw a circle of radius L(n) . The circle intersects the baseline at a point whose distance from B is L(n+1) − L(n) < L(1) + · · · + L(n−1) .

177

The Essence of Randomness

IX.9

We may accordingly select a point, say, C on the circle with the line segment BC of length, say, l with l < L(1) + · · · + L(n−1) . By induction hypothesis we may construct a polygon on this baseline with the remaining spacings. We can “convexify” the resulting figure (imagine a repulsive force pushing out the connected rods) at need to form a more standard polygon. Proceeding as in the case of the triangle, the events {L1 > 1/2}, . . . , {Ln+1 > 1/2} are mutually exclusive and the variables L1 , . . . , Ln+1 exchangeable with common marginal distribution given by (IX.2.2) in ToP. Accordingly, the probability that a polygon cannot be formed from the spacings L1 , . . . , Ln+1 is given by  P max{L1 , . . . , Ln+1 } >

1 2



=P

„n+1 [



Lj >

1 2



j=1

 = (n + 1) P L1 >

1 2

«



=

n+1 X

 P Lj >

1 2



j=1

` ´n = (n + 1) 1 − 12 = (n + 1)2−n .

The probability that the spacings can form a polygon is hence given by 1 − (n + 1)2−n . The exponential improvement in probability is striking: for n = 7 the probability that an octagon can be formed is already about 94%.

9. An example of Robbins. The pair of random variables (X1 , X2 ) is uniformly distributed in the shaded region of area 1/2 inside the unit square as shown in Figure 2. Let Y = X1 + X2 . (The following questions can be answered x2 1

.5

.5

1

x1

Figure 2: The Robbins example.

geometrically without any calculation by referring to the figure. Of course, direct calculation works as well, albeit with a little more effort.) Determine the marginal densities of X1 , X2 , and Y. What are the mean and variance of Y? [Careful! Are X1 and X2 independent? Compare with (IX.1.1).] S OLUTION : As the area of the shaded figure is 1/2, the joint density f(x1 , x2 ) is equal to 1 = 2 inside the shaded area and is equal to 0 outside it. A picture is now worth a 1/2 thousand calculations. For the marginal density of X2 , fix any 0 < x2 < 1 and consider the picture shown in Figure 3. Integrating out x1 from the joint density f(x1 , x2 ), we find that the marginal density f2 (x2 ) of X20 is simply the length of the dashed line in the above figure. And by symmetry this length is exactly 1/2. Thus, f2 (x2 ) =

length of dashed line 2 = =1 1/2 2

178

IX.9

The Essence of Randomness

Figure 3: The Robbins example: the marginal densities. whenever 0 < x2 < 1, and clearly, f2 (x2 ) is identically zero outside the unit interval. In other words, X20 is uniformly distributed in the unit interval. An identical visualisation shows that X10 is also uniformly distributed in the unit interval. For any value 0 < y < 2, the line x1 + x2 = y is shown dashed in Figure 4. Observe that as y increases from 0 to 2 the dashed lines move upwards at a 45◦ angle. y%= 2 y%= 1 y

y%= 0

Figure 4: The Robbins example: the distribution of the sum. The probability that Y ≤ y is then simply the ratio of the area of that portion of the shaded region below the dashed line to the area of the entire shaded region. By symmetry, this is just the ratio of the area of the portion of the unit square below the dashed line to the area of the entire unit square. (Observe how there is symmetry in the shaded and unshaded areas.) It follows that Y has the triangular density  y if 0 < y ≤ 1, u2 (y) := 2 − y if 1 < y < 2. We recall from (IX.1.1,2) and the convolutional theorem of Section IX.1 in ToP, however, that, if X10 and X20 are independent random variables with the common uniform distribution in the unit interval, then their sum Y 0 = X10 + X20 also has the triangular density u2 (·). It follows that the distribution of Y coincides with the distribution of Y 0 —they both have the same triangular density u2 . The random variables X1 and X2 are clearly dependent, alas! While linearity of expectation continues to work unabated we cannot apply the addition theorem for

179

The Essence of Randomness

IX.10

variances without further check to see whether X1 and X2 are uncorrelated. But we don’t have to do this much work. As Y has the same distribution as Y 0 it follows a fortiori that these variables have the same moments. In particular, E(Y) = E(Y 0 ) = 1 and Var(Y) = Var(Y 0 ) = 1/6. This cute example was constructed by Robbins. The moral: While the distribution of a sum of independent random variables is a convolution of the marginal distributions, the converse is not true in general.

10. Closure under convolution of the Cauchy density. Suppose X and Y are independent Cauchy random variables with density ca (x) = a/π(a2 + x2 ). Show that the variables X + Y and X + X have the same density. S OLUTION : To begin, the density of X + X = 2X is related by a simple scale to that of X and we see that X + X has density 2a 1 “x” ´ = c2a (x). ca = ` 2 2 π (2a)2 + x2 On the other hand, the density of X + Y may be written as a convolutional integral Z∞ ca ? ca (x) = ca (t)ca (x − t) dt. −∞

(4)

The integral on the right, as we shall see shortly, also evaluates to c2a (x). Thus, X + X and X + Y both have the Cauchy density c2a with parameter 2a. This shows once more that a density that is expressible as a convolution of two densities need not arise via a sum of independent variables. So we are left with the need to evaluate the convolutional integral (4). The integral is unpleasant to compute directly (though partial fractions will work with sufficient elbow grease) but Fourier methods provide a quick path. Taking Fourier transforms of both sides, the convolution property of the Fourier transform [Theorem VI.2.1(4) in ToP] 2 −1 2 −1 b shows that f(ξ) = cc for the standard Cauchy dena (ξ) . Write c(x) = `π ´ (1 + x ) x 1 sity with parameter 1. Then ca (x) = a c a , whence the scale property of the Fourier transform [Theorem VI.2.1(3) in ToP] shows that cc c(aξ). It follows that a (ξ) = b c\ c(aξ)2 a ? ca (ξ) = b

(5)

and we now need to compute the Fourier transform Z∞ b c(ξ) = c(x)eiξx dx. −∞

Two approaches to computing the Fourier transform are suggested. Approach 1: Duality. This is the slick argument. Write g(x) = e−|x| . A direct integration shows that Z∞ b(ξ) = g

−∞

Z0 g(x)eiξx dx = =

−∞

e(1+iξ)x dx +

Z∞

e−(1−iξ)x dx

0

˛0 ˛∞ e(1+iξ)x ˛˛ e−(1−iξ)x ˛˛ 1 1 2 + = + = . 1 + iξ ˛−∞ −(1 − iξ) ˛0 1 + iξ 1 − iξ 1 + ξ2

180

IX.10

The Essence of Randomness

b is continuous (trivially) and integrable, It is easy to verify that g ˛∞ Z∞ Z∞ ˛ ˛ ˛ 2 ˛g b(ξ)˛ dξ = dξ = 2 arctan(ξ)˛˛ = 2π, 2 1 + ξ −∞ −∞ −∞ whence the inversion formula is in force [see the Simplest Fourier Inversion Theorem in Section VI.3 of ToP], Z 1 ∞ 2e−iξx e−|x| = dξ, 2π −∞ 1 + ξ2 the integral on the right converging uniformly in x. Exchanging the rôles of x and ξ we obtain Z∞ Z∞ e−iξx eiξx πe−|ξ| = dx = dx, 2 2 −∞ 1 + x −∞ 1 + x the final step following by replacing x by −x as the variable of integration. It follows b(ξ) = πe−|ξ| . that the function g(x) = (1 + x2 )−1 has Fourier transform g Approach 2: Contour integration. We wish to compute Z∞ b c(ξ) =

−∞

eiξx dx. 1 + x2

(6)

For ξ = 0 the integration is elementary: ˛∞ ˛ b c(0) = arctan(x)˛ = π. −∞

Cauchy’s residue theorem now beckons. A little experimentation will show that we will need to consider the cases ξ < 0 and ξ > 0 separately. The function F(z) =

eiξz eiξz = 2 1+z (z + i)(z − i)

defined on the complex plane has isolated singularities only in the pair of conjugate simple poles z = ±i on the imaginary axis and we observe that ˛ ˛ ˛ ˛ 1 −ξ 1 = i2 Resz=i F(z) = F(z)(z−i)˛ = −i2 e and Resz=−i F(z) = F(z)(z+i)˛ eξ . z=i

z=−i

To evaluate the Fourier transform for ξ > 0 we use the upper semi-circular contour shown on the left of Figure 5. To return to our evaluation, we partition the closed upper semi-circular contour Γ into the straight line segment Γ1 extending along the real axis from −R to R and the semi-circular segment Γ2 . If R > 1 the contour Γ contains the simple pole z = i of F. It follows by Cauchy’s residue theorem that Z F(z) dz = 2πi Resz=i F(z) = πe−ξ . Γ

On the other hand,

Z

Z

Z

F(z) dz = Γ

F(z) dz + Γ1

181

F(z) dz Γ2

The Essence of Randomness

IX.10 Γ2

ξ>0 i

i Γ1

Γ1

−R

R

R

−R

−i

−i ξ 0 and ξ < 0.

and we consider the contributions from the two segments in turn. Along the straight line segment the imaginary part of the integrand is identically zero and setting the variable of integration to z = x + i0, we obtain Z R+i0 ZR Z eiξx F(z) dz = F(z) dz = dx 2 −R+i0 −R 1 + x Γ1 and the right-hand side converges to the desired integral (6) as R → ∞. (The gimlet eyed reader may object. If she has a steel trap memory, she may recall that the improper Riemann integral on the right of (6) is defined as the limit of the integral Z R2 eiξx dx 2 −R1 1 + x as R1 and R2 tend independently to infinity. Okay, you caught me out. Accept the statement for the nonce; we’ll return to it in a moment.) Now traipsing along the semi-circular segment Γ2 , we may parametrise the value of z by setting z = Reiθ ; as θ varies from 0 to π we trace out the semi-circular arc counterclockwise. Thus, Zπ Z exp(iξReiθ ) F(z) dz = iReiθ dθ. iθ − 1)(Reiθ + 1) 0 (Re Γ2 The integral is messy but we’re not trying to solve it. All we need to argue is that it contributes a negligible amount for large R—we can afford to be fairly cavalier in our bounding. First, observe that ˛ ˛ ˛ ˛ ˛ ˛ ˛exp(iξReiθ )˛ = ˛exp(iξR cos θ)˛ · ˛exp(−ξR sin θ)˛ = exp(−ξR sin θ) ≤ 1 as sin θ is positive for 0 ≤ θ ≤ π and ξ > 0. As |z ± 1| ≥ |z| − 1 for any complex z, setting z = Reiθ we obtain |Reiθ ± 1| ≥ R − 1. It follows that ˛Z ˛ Z π˛ ˛ ˛ ˛ ˛ ˛ exp(iξReiθ ) πR iθ ˛ ˛ ˛≤ ˛ F(z) dz iRe (R → ∞). ˛ (Reiθ − 1)(Reiθ + 1) ˛ ˛ ˛ dθ ≤ (R − 1)2 → 0 Γ2 0 We’ve hence shown that, for ξ > 0, ZR −R

eiξx dx → πe−ξ 1 + x2

182

IX.10

The Essence of Randomness

as R → ∞. Now consider the case ξ < 0 and choose the lower semi-circular contour on the right of Figure 5. Again, if R > 1, the contour Γ contains the simple pole z = −i of F. Accordingly, by Cauchy’s residue theorem, Z F(z) dz = 2πi Resz=−i F(z) = −πeξ . Γ

Making sure to adhere to the counterclockwise convention, we partition the closed lower semi-circular contour Γ into the straight line segment Γ1 extending along the real axis from R to −R and the semi-circular segment Γ2 . We then have Z

Z

Z F(z) dz +

F(z) dz = Γ

Γ1

F(z) dz Γ2

and we consider the contributions from the two segments in turn. Along the straight line segment the imaginary part of the integrand is identically zero and setting the variable of integration to z = x + i0, we obtain Z −R+i0

Z F(z) dz = Γ1

Z −R F(z) dz =

R+i0

R

eiξx dx = − 1 + x2

ZR −R

eiξx dx. 1 + x2

As before, we may parametrise the value of z along Γ2 by setting z = Reiθ ; as θ varies from π to 2π we trace out the semi-circular arc counterclockwise. Thus, Z 2π

Z F(z) dz = Γ2

exp(iξReiθ ) iReiθ dθ. − 1)(Reiθ + 1)

(Reiθ

π

Proceeding as before, we first observe that ˛ ˛ ˛ ˛ ˛ ˛ ˛exp(iξReiθ )˛ = ˛exp(iξR cos θ)˛ · ˛exp(−ξR sin θ)˛ = exp(−ξR sin θ) ≤ 1 as sin θ ≤ 0 for π ≤ θ ≤ 2π and ξ < 0. As |z ± 1| ≥ |z| − 1 for any complex z, setting z = Reiθ we obtain |Reiθ ± 1| ≥ R − 1. It follows that ˛Z ˛ ˛ ˛

˛ Z 2π ˛ ˛ ˛ ˛ ˛ exp(iξReiθ ) πR iθ ˛ ˛ ˛ F(z) dz˛ ≤ ˛ (Reiθ − 1)(Reiθ + 1) iRe ˛ dθ ≤ (R − 1)2 → 0 Γ2 π

And thus, for ξ < 0,

ZR −R

(R → ∞).

eiξx dx → πeξ 1 + x2

as R → ∞. Putting the pieces together, we’ve established that, for every ξ ∈ R, ZR −R

eiξx dx → πe−|ξ| 1 + x2

183

The Essence of Randomness

IX.12

as R → ∞. Pick any R1 , R2 > 1 and set R = min{R1 , R2 }. Then R → ∞ if both R1 and R2 increase without bound. It follows that ˛Z R ˛ ˛Z R ˛ ZR Z R2 iξx ˛ 2 eiξx ˛ ˛ ˛ eiξx e eiξx ˛ ˛=˛ ˛ dx − dx dx + dx ˛ ˛ ˛ ˛ 2 2 2 2 −R 1 + x R 1+x −R1 1 + x −R1 1 + x ˛∞ Z R2 ˛ iξx ˛ Z∞ Z R ˛ iξx ˛ ˛ ˛ e ˛ ˛ ˛ e dx ˛ dx + ˛ ˛ dx ≤ 2 ˛ →0 ˛ = 2 arctan(x) ≤ ˛ 1 + x2 ˛ ˛ ˛ 1 + x2 ˛ 2 R R 1+x −R1 R as arctan(R) → π/2 when R1 → ∞ and R2 → ∞. Consequently, we’ve shown that Z∞ Z R2 eiξx eiξx b c(ξ) = dx = lim dx = πe−|ξ| 2 R1 →∞ −R 1 + x2 −∞ 1 + x 1

(7)

R2 →∞

as the duality principle so cheekily gave us. Returning to (5), by two applications of (7), we see that c\ c(aξ)2 = πe−2a|ξ| = b c(2aξ). a ? ca (ξ) = b The scale property of the transform [Theorem VI.2.1(3) in ToP] shows that the right-hand `x´ 1 = c2a (x), and by the uniqueness of Fourier side is the Fourier transform of 2a c 2a inversion [the Simplest Fourier Inversion Theorem in Section VI.3 of ToP], it follows that ca ? ca (x) = c2a (x).

11. Buffon’s needle on a grid. Consider the plane scored by a grid of parallel lines, a units apart in the x-direction and b units apart in the y-direction. If M. Buffon drops his needle of length r (< min{a, b}) at random on the grid, show that the probability that the needle intersects a line equals r(2a + 2b − r)/(πab). S OLUTION : Let (X, Y) be the location of the centre of the needle on the plane and let Θ be the angle it makes with the horizontal. The reduced variables 0 X, 0 Y, and Θ are assumed independent with 0 X and 0 Y uniform in the unit interval and Θ uniform in (−π/2, π/2]. Let I be the event that the needle intersects any line. Conditioned on the event {Θ = θ}, let Ih (θ) represent the event that the needle intersects a horizontal line; likewise, given {Θ = θ}, let Iv (θ) represent the event that the needle intersects a vertical line. 0 0 Observe that Ih (θ) ` and ´ Iv (θ) are independent for each θ (as X and Y` are independent). ´ Furthermore, P Iv (θ) = | cos θ| = cos θ (as −π/2 < θ ≤ π/2) and P Ih (θ) = | sin θ|. It follows that ` ´ ` ´ ` ´ ` ´ P(I | Θ = θ) = P Ih (θ) ∪ Iv (θ) = 1 − P Ich (θ) ∩ Icv (θ) = 1 − P Ich (θ) P Icv (θ) = 1 − (1 − cos θ)(1 − | sin θ|) = cos θ + | sin θ| − cos θ| sin θ|. Observe that the right-hand side is an even function of θ. Remove conditioning by taking expectation to obtain Z Z 1 π/2 2 π/2 P(I) = P(I | Θ = θ) dθ = (cos θ + sin θ − cos θ sin θ) dθ π −π/2 π 0 ˛π/2 ˛π/2 ˛π/2 – » ˛ ˛ 2 cos2 θ ˛˛ 3 = sin θ˛˛ − cos θ˛˛ + = . π 2 ˛0 π 0 0

184

IX.12

The Essence of Randomness

12. Buffon’s cross. A cross comprised of two unit-length orthogonal segments welded together at their midpoints is dropped on the plane scored by parallel lines one unit apart. Let Z be the number of intersections of√the cross  with the parallel lines. Show that E(Z/2) = 2/π and Var(Z/2) = π1 3− 2 − π42 . Comment on whether it is better to estimate π using the needle or the cross. S OLUTION : With lines scored vertically, we suppose that the location of the cross along the abscissa and its rotational configuration are independent random variables. A picture helps smooth the way—see Figure 6. Let 0 ≤ X < 1 be the distance from the centre of the cross to the nearest line on the left and orient the coordinate axes at the cross centre. As there is always one arm of the cross in the first quadrant, let 0 ≤ Θ < π/2 be the angle it makes with the abscissa. We assume that X and Θ are independent with X X

Θ

Figure 6: Buffon’s cross. uniformly distributed in the unit interval and Θ uniformly distributed in the interval [0, π/2). This is only one of many possible representations: this one has the virtue of exploiting symmetry in the four quadrants to simplify the computational task at hand. 1◦ No intersections. The needle whose arms lie in the first and third quadrants will not intersect a vertical line if, and only if, 12 cos Θ < X < 1 − 12 cos Θ, while the needle whose arms lie in the second and fourth quadrants will not intersect a vertical line if, and only if, 12 sin Θ < X < 1 − 21 sin Θ. It follows that Z = 0 if, and only if,   max 21 cos(Θ), 12 sin(Θ) < X < min 1 − 12 cos(Θ), 1 − 12 sin(Θ) , or, equivalently, 1 2 1 2

cos Θ < X < 1 − sin Θ < X < 1 −

1 cos Θ 2 1 sin Θ 2

if 0 ≤ Θ < π/4, if π/4 ≤ Θ < π/2.

Accordingly, P{Z = 0} =

2 π

Z π/4 ˆ` 1− 0

+

2 π

1 2

´ cos(θ) −

1 2

˜ cos(θ) dθ

Z π/2 ˆ` 1− π/4

185

1 2

´ sin(θ) −

1 2

˜ sin(θ) dθ

The Essence of Randomness

IX.12 Z ˆ ˜ ˜ 2 π/2 ˆ 1 − cos(θ) dθ + 1 − sin(θ) dθ π π/4 0 ˛π/4 ˛π/2 ˛π/2 – » ˛π/4 ˛ ˛ ˛ ˛ 2 ˛ θ˛ − sin(θ)˛˛ + θ˛˛ + cos(θ)˛˛ = π 0 0 π/4 π/4 √ » – 2 π 1 π 1 2 2 = − √ + − √ =1− . π 4 4 π 2 2 =

2 π

Z π/4

2◦ One intersection. A single intersection must be either with the line left of centre or the line right of centre. By the symmetry inherent in the situation, these possibilities have equal probability. An intersection on the left occurs if there is either an intersection in the third quadrant or an intersection in the second quadrant; in other words, if 21 sin(Θ) < X ≤ 21 cos(Θ) for 0 ≤ Θ < π/4 or if 12 cos(Θ) < X ≤ 12 sin(Θ) for π/4 ≤ Θ < π/2. It follows that Z ` ´ ´ 4 π/2 1 ` cos(θ) − sin(θ) dθ + sin(θ) − cos(θ) dθ 2 π π/4 0 √ ˛π/4 ˛π/4 ˛π/2 ˛π/2 ˛ ˛ ˛ ˛ 2 2 2 4 2 2 4 + cos(θ)˛˛ − cos(θ)˛˛ − sin(θ)˛˛ = = sin(θ)˛˛ − . π π π π π π 0 0 π/4 π/4

P{Z = 1} =

4 π

Z π/4

1 2

3◦ Two intersections. There will be two intersections of the needle arms with a vertical line if, and only if, either both needles intersect the line to the left of centre or both the line to the right of centre; for the first  we 1require X ≤  needles1 intersect sin(Θ) while for the second we require X ≥ max 1 − 2 cos(Θ), 1 − min 12 cos(Θ), 2 1 sin(Θ) . (There are no other possibilities for two intersections as geometrically it is 2 clear that the cross cannot simultaneously intersect the line to the line left of centre as well as the line to the right of centre: or, if you must, because 12 cos(Θ) + 12 sin(Θ) can never exceed one.) Again, by symmetry, the two possibilities for a double intersection have equal probability whence  P{Z = 2} = 2 P X ≤ min{ 12 cos(Θ), 12 sin(Θ)} ˛π/4 ˛π/2 Z Z ˛ ˛ 2 4 π/4 1 4 π/2 1 2 ˛ + sin(θ)˛˛ = sin(θ) dθ + cos(θ) dθ = − cos(θ)˛ π 0 2 π π/4 2 π π 0 π/4 √ „ « 2 1 1 4 2 2 = −√ + 1 + 1 − √ = − . π π π 2 2 (If the symmetry arguments sound a little specious, write down the corresponding integrals and evaluate them along the same lines.) 4◦ Expectation. It follows that ` ´ E 21 Z = 0 · P{Z = 0} +

1 2

· P{Z = 1} +

2 2

√ √ 2 2 2 4 2 2 2 · P{Z = 2} = − + − = , π π π π π

matching what we had for a single needle. [For a direct verification, let A be the event that the needle lying in the first and third quadrants intersects a line, B the event that the needle lying in the second and fourth quadrants intersects a line. Then Z = 1A + 1B

186

IX.13

The Essence of Randomness

whence E(Z) = E(1A ) + E(1B ) via additivity of expectation! It’s true that the events A and B are dependent, but it makes little matter: expectation is always additive. But, by the analysis of Buffon’s classical needle problem, the probability of a needle in isolation intersecting a line is 2/π whence E(1A ) = E(1B ) = E(Z/2) = 2/π. Using indicators like this is a powerful general trick. Remember it!] 5◦ Variance. Another easy calculation now shows that ` ´ ˆ ` ´2 ` ´2 ˜ ` ´2 Var 12 Z = 02 · P{Z = 0} + 12 · P{Z = 1} + 22 · P{Z = 2} − E 12 Z √ √ √ ´ 1 4 2 2 4 4 2 1` = − + − − 2 = 3− 2 − 2. π π π π π π π 6◦ Comparison. Let Y be the number of intersections of a standard Buffon needle with the system of parallel lines. Then Y is a Bernoulli trial with success probability ` ´ 2/π and hence E(Y) = π2 and Var(Y) = π2 1 − π2 . Consequently, Y√and Z/2 both may serve as unbiased estimators of 2/π. However, it is clear that 3 − 2 < 2 so that ` ´ Var 21 Z < Var(Y) and the cross provides a better estimate.

13. Record values. Denote by X0 a speculator’s financial loss at a downturn in the stock market. Suppose that his friends suffer losses X1 , X2 , . . . . We suppose that X0 , X1 , . . . are independent random variables with a common distribution. The nature of the distribution does not matter but as the exponential distribution serves as a model for randomness, we assume that the Xj are exponentially distributed with common density f(x) = αe−αx for x > 0. Here α is positive and α−1 is hence the mean of the Xj . To find a measure of our rash gambler’s ill luck we ask how long it will be before a friend experiences worse luck: the waiting time N is the value of the smallest subscript n such that Xn > X0 . Determine the distribution of N, that is to say, find p(n) := P{N = n} for each n = 1, 2, . . . . What can we say about the expected waiting time before we find a friend with worse luck? [The direct approach by conditioning requires a successive integration of an n-dimensional integral. It is not difficult but you will need to be careful about the limits of integration. There is also a simple direct path to the answer.] S OLUTION : The event {N = n} occurs if, and only if, the n inequalities X0 ≥ X1 , X0 ≥ X2 , . . . , X0 ≥ Xn−1 , and X0 < Xn are all simultaneously satisfied. We may simplify the algebra by realising that scaling all the variables by a positive constant does not affect the inequalities and hence, by replacing X0 ← αX0 , X1 ← αX1 , . . . , Xn ← αXn , . . . , we may as well consider the situation where X0 , X1 , . . . are independent variables with the common exponential distribution of mean one. The direct approach by partitioning the sample space. By systematically integrating out the variables one at a time, we may write down the desired probability as an iterated integral over the first (positive) orthant in n + 1 dimensions in the form Z∞ Z x0 Z x0 Z∞ p(n) = dx0 e−x0 dx1 e−x1 · · · dxn−1 e−xn−1 dxn e−xn . (8) 0

0

0

187

x0

The Essence of Randomness

IX.13

The inner integrals over x1 through xn−1 each evaluate to 1 − e−x0 , and the innermost integral over xn evaluates to e−x0 . It follows that Z∞ p(n) =

Z1

(∗)

dt tn−1 (1 − t)

dx0 e−x0 (1 − e−x0 )n−1 e−x0 =

0

0

=

˛1 ˛1 tn ˛˛ tn+1 ˛˛ 1 1 1 − = − = n ˛0 n + 1 ˛0 n n+1 n(n + 1)

(n ≥ 1),

the change of variable t = 1 − e−x0 , dt = e−x0 dx0 in the step marked (∗) making the evaluation trite. The analysis is essentially unchanged if we replace the exponential distribution by any other density f with corresponding d.f. F. The relation (8) now merely becomes Z∞ p(n) = Z∞ =

−∞

Z x0 dx0 f(x0 )

−∞

Z x0 dx1 f(x1 ) · · ·

−∞

Z∞ dxn−1 f(xn−1 )

dxn f(xn ) x0

˛ ˛∞ ` ´ F(x0 )n ˛∞ F(x0 )n+1 ˛˛ 1 1 ˛ dx0 f(x0 ) F(x0 )n−1 1 − F(x0 ) = − = − , n ˛−∞ n + 1 ˛−∞ n n+1 −∞

as before. An approach by conditioning. The key to the simplification of the integrals is the realisation that the events X1 ≤ X0 , . . . , Xn−1 ≤ X0 , Xn > X0 are conditionally independent given the value of X0 . In notation, ` ´ P{X1 ≤ X0 , . . . , Xn−1 ≤ X0 , Xn > X0 | X0 = x0 } = F(x0 )n−1 1 − F(x0 ) . Integrating out with respect to the distribution of X0 yields Z∞ p(n) =

−∞

Z1 ` ´ F(x0 )n−1 1−F(x0 ) f(x0 ) dx0 =

tn−1 (1−t) dt = 0

1 1 − n n+1

(n ≥ 1),

the penultimate step following by the natural change of variable t = F(x0 ), dt = f(x0 ) dx0 , with the corresponding change in the limits of integration. The combinatorial approach. The discovery that the solution is distribution independent suggests that there must be some underlying symmetries in the problem. And indeed there are: the distribution of (X0 , X1 , . . . , Xn ) is invariant with respect to permutations of elements. Each realisation of (X0 , X1 , . . . , Xn ) engenders a permutation of indices (0, 1, . . . , n) 7→ (Π0 , Π1 , . . . , Πn ) corresponding to the induced ordering XΠ0 < XΠ1 < · · · < XΠn (we may ignore the possibilities of equality as these have probability zero). The symmetry of the underlying distribution with respect to transpositions of coordinates implies that each of the (n + 1)! possible orderings has equal probability. We may identify the event {N = n} with those orderings for which Πn=1 = 0 and Πn = n. As there are exactly (n − 1)! such orderings (obtained by distributing X1 , . . . , Xn−1 among the first n − 1 locations), we obtain p(n) = P{N = n} =

(n − 1)! 1 = (n + 1)! n(n + 1)

188

(n ≥ 1).

IX.15

The Essence of Randomness The persistence of bad luck. Direct substitution now shows that E(N) =

∞ X n=1

np(n) =

∞ X n=1

n 1 1 1 = + + + ··· = ∞ n(n + 1) 2 3 4

as the harmonic series diverges. Our unfortunate speculator will have to go through an infinite number of friends on average before he finds one with worse luck! And, of course, as W. Feller has pointed out, his disposition is not improved by the fact that all his friends bemoan their own bad luck arguing in exactly the same way. (One could call this the anti-Lake Wobegon effect—everyone is below average—the reader may recall that in Garrison Keillor’s Lake Wobegon, all the children are above average.) Naïve intuition could not have prepared us for the fact that the distribution of N has heavy tails, hence the infinite expectation.

14. Continuation. With X0 and X1 as in the previous problem, let R = X0 /X1 . Find the distribution function F(r) := P{R ≤ r} of R. Determine hence the probability that X0 > 3X1 . Does the gambler have cause to complain of ill luck? What can we say about the expected value of R? S OLUTION : The problem does not get any harder if we have varying exponential distributions and we may as well do it in general. Suppose X0 ∼ Exponential(α) and X1 ∼ Expoential(β). For r > 0, we have

Z∞ Z∞ Z∞ X0 = dx αe−αx dy βe−βy = dx αe−αx e−βx/r F(r) = P X1 ≥ r −∞ x/r −∞ „ « Z∞ β −α(1+β/r)x α r α dx α 1 + e = = . (9) = β β r α+ r 0 α+ r r+ β α In particular, for the given case of an identical exponential distribution with mean 1/α, we have F(r) = r/(r + 1) and, in particular, P{X0 > 3X1 } = 1 − F(3) = 1/4. It is not at all unlikely that a gambler’s loss is in excess of three times that of another speculator under the same conditions. We can put this observation in sharp relief by computing the expected value: from (9) we obtain Z∞ Z∞ ` ´ dr E(R) = 1 − F(r) dr = = ∞, β r + 0 0 α for every α > 0 and β > 0. Surprisingly, the ratio has very heavy tails and hence infinite expectation for any value of α and β.

15. Suppose X, Y, and Z are independent with a common exponential distribution. Determine the density of (Y − X, Z − X). S OLUTION : Suppose (X, Y, Z) ∼ f(x, y, z) and (U, V, W) ∼ g(u, v, w). We may write (U, V, W) = (X, Y, Z)A where the linear transformation A and its inverse are given by 1 0 1 0 1 −1 −1 1 1 1 −1 0A A = @0 1 and A = @0 1 0A . 0 0 1 0 0 1

189

The Essence of Randomness

IX.17

We hence have

(x, y, z) = (u, v, w)A−1 = (u, u + v, u + w)

and as det A = 1, we obtain g(u, v, w) = f(u, u + v, u + w) = αe−αu · βe−β(u+v) · γe−γ(u+w) = αβγe−(α+β+γ)u−βv−γw with support only in the region in three-dimensional Euclidean space defined by the inequalities u > 0, u + v > 0, and u + w > 0.

16. Suppose X, Y, and Z are independent, exponentially distributed random variables with means λ−1 , µ−1 , and ν−1 , respectively. Determine P{X < Y < Z}. S OLUTION : The navvy way. Condition on X = x then Y = y to obtain Z∞ Z∞ Z∞ P{X < Y < Z} = λe−λx µe−µy νe−νz dzdydx. 0

x

Recall that the tail of the exponential is given by plify: Z∞ P{X < Y < Z} = 0

y

R∞ t

αe−αs ds = e−αt to rapidly sim-

Z∞ λe−λx µe−µy e−νy dydx x „ «„ « Z∞ µ µ λ e−(µ+ν)x dx = . = λe−λx µ+ν µ+ν λ+µ+ν 0

A slick solution. Recall that min{Y, Z} is exponentially distributed with parameter µ+ν and if T1 and T2 are independent and exponentially distributed with parameters λ1 and λ2 , respectively, T2 } = λ1 /(λ1 + λ2 ). Now the event {Y < Z} is inde then P{T1 < pendent of the event X < min{Y, Z} . It follows that   P{X < Y < Z} = P X < min{Y, Z}, Y < Z = P X < min{Y, Z} P{Y < Z} „ «„ « µ λ = . λ+µ+ν µ+ν

17. Pure birth processes. Starting from an initial state E0 , a system passes through a sequence of states E0 7→ E1 7→ · · · 7→ En 7→ · · · , staying at Ej for a sojourn time Xj governed by the exponential distribution with mean α−1 j . Then Sn = X0 +X1 +· · ·+Xn is the epoch at which there is a transition En 7→ En+1 . Let Pn (t) denote the probability that at time t the system is in state En . Show that P00 (t) = −α0 P0 (t) and, for n ≥ 1, the process satisfies the differential equation Pn0 (t) = −αn Pn (t) + αn−1 Pn−1 (t). S OLUTION : To establish the base case for the recurrence, we observe that P0 (t) = P{X0 > t} = e−α0 t .

190

IX.18

The Essence of Randomness

Taking derivatives of both sides we obtain P00 (t) = −α0 P0 (t). We now proceed as in Examples IX.6.1,2 in ToP by Crofton’s method of perturbation. For any t ≥ 0 and h > 0, introduce the nonce notation P(t, t + h; j, j + k) for the probability that there are k transitions in the interval [t, t + h) given that the system is in state Ej at epoch t (and hence is in epoch Ej+k at epoch t + h). Then, as h → 0, P(t, t + h; j, j + 0) = P{Xj > h} = e−αj h = 1 − αj h + o(h). Likewise, suppose that, starting in state Ej at time t, there is a single transition in the interval [t, t + h). By conditioning on the time τ of the transition, we see that, with Xj = τ − t, there will be no further transition in this interval if, and only if, Xj+1 > t + h − τ = h − (τ − t). As Xj+1 is independent of Xj , by integrating out the conditioning with respect to the exponential distribution of Xj , we see that Z t+h P{Xj+1 > t + h − τ | Xj = τ − t}αj e−αj (τ−t) dτ

P(t, t + h; j, j + 1) = t

Z t+h

P{Xj+1 > t + h − τ}αj e−αj (τ−t) dτ

=

x←τ−t

=

Zh αj e−αj+1 h

t

e−(αj −αj+1 )x dx. 0

If αj = αj+1 the integral on the right evaluates to P(t, t + h; j, j + 1) = αj he−αj h = αj h + o(h)

(h → 0),

and, when αj 6= αj+1 , we obtain ` ´ αj e−αj+1 h 1 − e−(αj −αj+1 )h αj − αj+1 ` ´` ´ 1 − αj+1 h + o(h) (αj − αj+1 )h + o(h) = αj h + o(h)

P(t, t + h; j, j + 1) = =

αj αj − αj+1

(h → 0).

Thus, for all choices of αj and αj+1 , we have P(t, t + h; j, j + 1) = αj h + o(h) As

P k

X

(h → 0).

P(t, t + h; j, j + k) = 1, it follows that P(t, t + h; j, j + k) = 1 − P(t, t + h; j, j + 0) − P(t, t + h; j, j + 1)

k≥2

` ´ ` ´ = 1 − 1 − αj h + o(h) − αj h + o(h) = o(h). Writing [0, t + h) = [0, t) ∪ [t, t + h), we see hence by additivity that ` ´ Pn (t + h) = Pn (t)P(t, t + h; n, n + 0) + Pn−1 (t)P t, t + h; n − 1, (n − 1) + 1 +

n X

` ´ Pn−k (t)P t, t + h; n − k, (n − k) + k

k=2

` ´ ` ´ = Pn (t) 1 − αn h + o(h) + Pn−1 (t) αn−1 h + o(h) + o(h). Rearranging terms and dividing throughout by h, we see that Pn (t + h) − Pn (t) − αn Pn (t) = αn−1 Pn−1 (t) + o(1), h

191

The Essence of Randomness

IX.18

and passing to the limit as h → 0, we obtain the desired differential equation Pn0 (t) + αn Pn (t) = αn−1 Pn−1 (t)

(n ≥ 1).

18. The absent-minded professor and the tardy student. A professor (absentminded by convention) gives each of two students an appointment at 12 noon. One of the students arrives on time; the second arrives 5 minutes late. The amount of time the first student spends closeted with the professor is X1 ; the second student engages the professor as soon as he is free and spends time X2 in conference with him. Suppose X1 and X2 are independent random variables with the common exponential distribution of mean 30 minutes. What is the expected duration from the arrival of the first student to the departure of the second? [Condition on X1 .] S OLUTION : Let X denote the total duration from the arrival of the first student at 12 noon to the departure of the second student. Two situations are possible depending on whether the first student departs before or after the arrival of the second student. Clearly,  5 + X2 if X1 ≤ 5, X= X1 + X2 if X1 > 5, so that by the theorem of total probability, E(X) = E{X | X1 ≤ 5} P{X1 ≤ 5} + E{X | X1 > 5} P{X1 > 5}. Let’s evaluate the various terms in turn. As X1 is exponentially distributed with mean 30, P{X1 > 5} = e−5/30 and P{X ≤ 1.5} = 1 − e−5/30 . Conditioned on the event that the first student leaves within 5 minutes, linearity of expectation gives us E{X | X1 ≤ 5} = E{5 + X2 | X1 ≤ 5} = 5 + E{X2 | X1 ≤ 5} = 5 + E(X2 ) = 35 as the variables X1 and X2 are independent with a common exponential distribution with mean 30. On the other hand, if the first student’s sojourn time with the professor exceeds 5 minutes then X = X1 + X2 = 5 + X10 + X2 where the random variable X10 = X1 − 5 denotes the residual time the first student spends with the professor following the arrival of the second student. Linearity of expectation yields E{X | X1 > 5} = 5 + E{X10 | X1 > 5} + E{X2 | X1 > 5}. The independence of X1 and X2 again yields E{X2 | X1 > 5} = E(X2 ) = 30, while, by the memoryless property of the exponential distribution, the residual time X10 conditioned on the event X1 > 5 is exponentially distributed with mean 30. Indeed, P{X10 > x | X1 > 5} = P{X1 > x + 5 | X1 > 5} = =

P{X1 > x + 5, X1 > 5} P{X1 > 5}

P{X1 > x + 5} e−(x+5)/30 = = e−x/30 , P{X1 > 5} e−5/30

192

IX.20

The Essence of Randomness

which just retraces the derivation of the memoryless property of the exponential. It follows that E(X10 | X1 > 5} = 30, and, consequently, E{X | X1 > 5} = 65. Putting the various pieces together we hence obtain E(X) = 35(1 − e−1/6 ) + 65e−1/6 = 35 + 30e−1/6 which evaluates to approximately 60.4 minutes.

19. A waiting time distribution. Suppose X and T are independent random variables, X exponentially distributed with mean 1/α and T exponentially distributed with mean 1/β. Determine the density of X − T . S OLUTION : Write g(x) = e−x H0 (x) for the exponential density of unit mean; here H0 (x) is the Heaviside function which takes value 1 for x ≥ 0 and value zero for x < 0. As H0 (γx) = H0 (x) for any γ > 0, we see that X has marginal density gα (x) = αg(αx) = αe−αx H0 (x) with support in the positive half-line x ≥ 0 while −T has marginal density βt g− H0 (−t) with support in the negative half-line t ≤ 0. Write S = β (t) = βg(−βt) = βe X + (−T ). As X and −T are independent, the density f of S is given by the convolution f(s) = (gα ? g− β )(s) =

Z∞ −∞

gα (x)g− β (s − x) dx = αβeβs

Z∞ −∞

e−(α+β)x H0 (x)H0 (−s + x) dx.

The evaluation is split into cases. If s ≥ 0, then H0 (x)H0 (−s + x) = 1 for x ≥ s and is zero otherwise, and so f(s) = αβeβs

Z∞

e−(α+β)x dx =

s

αβ −αs e . α+β

If s < 0, then H0 (x)H0 (−s + x) = 1 for x ≥ 0 and is zero otherwise, and so f(s) = αβeβs

Z∞

e−(α+β)x dx =

0

αβ βs e . α+β

Putting the pieces together, we have  f(s) =

αβ βs e α+β αβ −αs e α+β

if s < 0, if s ≥ 0.

Of course, we may just as easily integrate the joint density −αx+βt gα (x)g− H0 (x)H0 (−t) β (t) = αβe

of the pair (X, T ) over the region in the first quadrant determined by the inequality x − t ≤ s to obtain the d.f. F(s) of S = X − T . Differentiation recovers the density.

193

The Essence of Randomness

IX.21

20. Continuation. A customer arriving at time t = 0 occupies a server for an amount of time X. A second customer arriving at time T > 0 waits for service if T < X and otherwise receives attention immediately. Her waiting time is hence W = (X − T )+ . With X and T as in the previous problem, determine the distribution of W and its mean. S OLUTION : In the notation of the solution to the previous problem, W = S+ . Then W is positive, has a mass at the origin of size Z0 P{W = 0} = P{S ≤ 0} =

−∞

α αβ βs e ds = , α+β α+β

and, for w > 0, has distribution Zw α αβ −αs + e ds α+β 0 α+β α β β = + (1 − e−αw ) = 1 − e−αw . α+β α+β α+β

P{W ≤ w} = P{S ≤ 0} + P{0 < S ≤ w} =

Its mean is hence given by Z∞ E(W) =

P{W > w} dw = 0

β α+β

Z∞

e−αw dw =

0

β . α(α + β)

If α = β = 1 the expected waiting time is 1/2 as is intuitive.

21. Spacings via the exponential distribution. Suppose X1 , . . . , Xn are independent variables with a common exponential distribution and let Sn = X1 + · · · + Xn . Consider the variables Uj = Xj /Sn for 1 ≤ j ≤ n − 1 and set Un = Sn . Show that (U1 , . . . , Un−1 ) has the same distribution as if Uj were the jth spacing in a random partition of the unit interval by n − 1 points. [Hint: Evaluate the Jacobian of the transformation.] S OLUTION : Let’s begin with an easy consequence of di Finetti’s theorem. Spacings engendered by random points in the unit interval: Suppose V1 , . . . , Vn−1 , Vn are the spacings engendered by n − 1 random points in the unit interval. The system of variables (V1 , . . . , Vn−1 , Vn ) is singular continuous in Rn and is concentrated on the unit simplex determined by the conditions v1 ≥ 0, . . . , vn−1 ≥ 0, vn ≥ 0, and v1 + · · · + vn−1 + vn = 1. The system is linearly dependendent with Vn = 1 − V1 − · · · − Vn−1 completely determined by the remaining variables and it will suffice to characterise the distribution of (V1 , . . . , Vn−1 ). This system of variables is concentrated in the polytope S determined by the inequalities v1 ≥ 0, . . . , vn−1 ≥ 0, and v1 + · · · + vn−1 ≤ 1. The characterisation of the distribution is now particularly simple in view of di Finetti’s theorem. T HEOREM The variables (V1 , . . . , Vn−1 ) are uniformly distributed in the polytope S.

194

IX.21

The Essence of Randomness

P ROOF : Write F(v1 , . . . , vn−1 ) for the d.f. of (V1 , . . . , Vn−1 ) and let f(v) = f(v1 , . . . , vn−1 ) =

∂n−1 F(v1 , . . . , vn−1 ) ∂F(v) = ∂v1 · · · ∂vn−1 ∂v

be the corresponding density. With a view to using inclusion and exclusion, for 0 ≤ j ≤ n − 1, define X X (1−vi1 −· · ·−vij )n−1 , Tj = P{Vi1 > vi1 , . . . , Vij > vij } = 1≤i1 0. Bounding the remaining terms in the sum on the right, we obtain ` ´ G1 {τ} < MF(k) {τ − t} + MF(k) Zk \ {τ − t} = M for every τ ∈ Zk . It must hence follow under the reductio ad absurdum assumption that M > maxτ G1 {τ} = M and we have a contradiction. We conclude that G0 {t} = M for (k) every t ∈ V∗ and, by induction, that, for each j ≥ 0, we must have Gj {t} = M for (k) (k) all t ∈ V∗ . We have hence shown that there is a constant M such that F?n {t} → M (k) (k) (k) for every t ∈ V∗ . But then this constant must be equal to 1/ card V∗ and so F?n (k) (k) converges weakly to F∗ , the uniform distribution concentrated on V∗ . To complete the proof we remove our nonce assumption that V (k) is a subgroup of Zk to begin with. Suppose now that the generator V (k) is a proper subset of the

404

XIX.19

Convergence in Law, Selection Theorems (k)

subgroup V∗ generated by it. As V (k) includes the point 0, we recall that the sequence  (k) S (k) (k) (k) of supports Vn , n ≥ 0 is increasing with limit set n Vn = V∗ . As card V∗ ≤ k, (k) (k) there must in fact exist an integer N such that the support VN of the distribution F?N (k) (k) (k) coincides with the subgroup V∗ . It follows that Vn = V∗ for all n ≥ N and, in (k) particular, the subsequence {F?nN , n ≥ 1} consists of distributions all concentrated on (k) (k) (k) (k) (k) (k) (k) the subgroup V∗ . As convolution is associative, F?2N = F?N ? F?N , F?3N = F?N ? F?2N , and so on, so that the entire subsequence can be built up by repeated convolutions of (k) the basic distribution F?N . By the just concluded proof of the simplified setting, the  (k) (k) sequence F?nN , n ≥ 1 converges vaguely to F∗ .  (k) Now consider the subsequence F?nN+1 , n ≥ 1 obtained by convolving each  (k) (k) member of the sequence F?nN , n ≥ 1 with F(k) . On the one hand, F?nN+1 = F(k) ? (k) F?nN , while, on the other, X (k) X (k) 1 1 F {s} = F(k) ? F(k) F∗ {t − s}F(k) {s} = ∗ {t} = (k) (k) card V∗ s∈V (k) card V∗ s∈Zk  (k) (k) (k) (k) for every t ∈ V∗ so that F(k) ? F∗ = F∗ . It follows that F?nN+1 , n ≥ 1 converges  (k) (k) (k) vaguely to F∗ and, by induction, F?nN+j , n ≥ 1 converges vaguely to F∗ for each  (k) (k) j ≥ 0. We conclude that F?n , n ≥ 1 converges vaguely to F∗ . I We may now quickly deduce convergence theorems. It is simplest to phrase them in terms of the total variation distance [see Problem XVII.2 for the definition specialised to arithmetic variables]. C OROLLARY For a given k ≥ 2, if F(k) places positive mass on both 0 and 1, or if F(k) places ` ´ (k) positive mass on 0, s, and t where s and t are relatively prime, then dTV L(Sn ), L(U(k) ) → 0 as n → ∞. P ROOF : Starting with 0 and 1, repeated additions of 1 generate all the elements of Zk . If s and t are relatively prime then by Bézout’s identity there exist positive integers m and n such that ms + nt ≡ 1 mod k. Thus, if F(k) places positive mass on 0, s, and t, (k) then F?m+n places positive mass on 0, 1, s, and t. Addition of 1 now recovers all the remaining elements of Zk . Thus, the set V (k) is a generator of Zk under either of the (k) given conditions. The limiting distribution F∗ is hence uniform on {0, 1, . . . , k − 1}. I ` ´ (k) C OROLLARY If F places positive mass on 0 and 1 then limn dTV L(Sn ), L(U(k) ) = 0 for every k ≥ 2. P ROOF : If F{0} > 0 and F{1} > 0 then it is the case that F(k) places non-zero mass on both 0 and 1 for any k ≥ 2. The previous corollary now establishes distributional convergence for every k ≥ 2. If we identify the half-closed unit interval [0, 1) with the circle T of unit length then, by wrapping the real line around T (in one or the other orientation) we may identify points x, x ± 1, x ± 2, . . . and so on. In this viewpoint the sequence of (normal (k) ized) partial sums, k1 Sn , n ≥ 1 represents a random walk on the vertices 0, 1/k,

405

Convergence in Law, Selection Theorems

XIX.20

. . . , (k − 1)/k of a regular polygon embedded in T. Our theorem is hence a statement about the weak convergence of a particular sequence of distributions concentrated on a regularly spaced set of points on the circle T. The conditions are reminiscent of Weyl’s equidistribution theorem. In that setting we begin with an atomic distribution F1 = F placing all its mass at a single irrational point x in T; we now construct a sequence of distributions {Fn , n ≥ 1} with Fn placing equal mass on the points x, 2x, . . . , nx in T. If the reader refers to Section XIX.8 of ToP she will find that the analysis now proceeds in exactly the same fashion with the conclusion now that {Fn , n ≥ 1} converges weakly to the uniform distribution on T or, in the language of analysis, the sequence {nx, n ≥ 1} is equidistributed on T. I

20. Continuation. Show that the preceding result need not hold if F does not have an atom at the origin. S OLUTION : In the notation of the previous problem, suppose F(k) is concentrated at the point 1. This means that the originating distribution F is concentrated on the set of points { jk + 1 : j ≥ 0 }. The simplest example is the shifted Heaviside distribution F(x) = H0 (x − 1) which puts all its mass on the point x = 1. Then F?2 (x) = F ? F(x) is a distribution which puts all its mass on points of the form (jk + 1) + (lk + 1) = (k) (j + l)k + 2 so that F?2 is concentrated at the point 2. Working inductively, it is clear that F?n is concentrated on the set of points of the form jk + n for integer j, and so (k) F is concentrated at the point n mod k in Zk . Thus, the sequence of distributions ?n(k) F?n , n ≥ 1 just cycles through the points in Zk = {0, 1, . . . , k − 1} and there is no convergence of any kind.

406

XX Normal Approximation 1. Repeated convolutions. Starting with the function f(x) which takes value 1 in the interval (−1/2, 1/2) and 0 elsewhere, recursively define the sequence of functions { fn , n ≥ 1 } as follows: set fR1 = f and, for each n ≥ 2, set fn =  ?f. √fn−1 ∞ Define the functions Gn via Gn (x) = x fn (t) dt. Estimate Gn (0), Gn n , and Gn (n/4). S OLUTION : The easiest path is to realise that convolutions of densities can be associated with sums of independent random variables. Let { Xk , k ≥ 1 } be independent and identically distributed with common density f(x). For each n, form the random variable Sn = X1 + · · · + Xn . Then Sn has density fn = fn−1 ? f = f?n given by the n-fold convolution of f with itself. Observe that the Xk s are uniformly distributed over the centred unit interval (−1/2, 1/2), hence have mean 0 and variance 1/12. It follows that Sn has mean zero and variance n/12. With this for preparation, we may write Z∞ Gn (x) = fn (x) dx = P{Sn > x}. x

It is clear that Sn is continuous and symmetrically distributed around zero whence Gn (0) = P{Sn > 0} = 1/2. √

The n term is the tip-off that the central limit theorem may be at work here. By symmetry, the left and right tails of Sn carry√the‹ same √ probability, whence (recall that the standard deviation of Sn is given by sn = n 2 3) √ √



√ √  ` √ ´ `√ ´ √ Sn 2 3 Sn 2 3 √ √ Gn n = P S n > n = P >2 3 =P < −2 3 → Φ −2 3 n n as n → ∞ because the central limit theorem is in force. At n/4 the deviations now are too large for the central limit theorem but we can still bound the probability in the tails via Chebyshev or Chernoff, for example. For instance, Chebyshev gives us

Var(Sn ) Sn 1 Gn (n) = P{Sn > n} = P >1 ≤ = . n n2 12n

Normal Approximation

XX.4

The bound may be improved to an exponential bound via Chernoff.

2. The Poisson distribution with large mean. Suppose Sλ has the Poisson √ distribution with mean λ. Show that (Sλ − λ) λ converges in distribution to the standard normal. (Hint: The stability of the Poisson under convolution.) S OLUTION : First, suppose λ = n is an integer and suppose X1 , X2 , . . . is a sequence of independent random variables with the common Poisson distribution of mean 1. From the stability of Poisson under convolution we know Sn = X1 + · · · + Xn has the Poisson distribution with mean n. On the other hand, by the plain ‹√vanilla central limit theorem [see the theorem of Section XX.1], the sequence (Sn − n) n converges to the standard normal in distribution. More generally, λ is not necessarily an integer. But then there is a unique integer n such that n ≤ λ < n+1. As before, let X1 , . . . , Xn be independent Poisson(1) variables and introduce the slack Poisson variable X 0 ∼ Poisson(λ − n) independent of X1 , . . . , Xn to clear up the irritating non-integral residue. Then Sλ = X1 + · · · + Xn + X 0 is Poisson with mean n + (λ − n) = λ. We may then write „ « Sλ − λ X1 + · · · + Xn − n √ √ = Aλ + Bλ n λ where

√ X 0 − (λ − n) n √ Aλ = √ and Bλ = . λ λ

´ ` As Aλ = 1 + O λ−1/2 we see that Aλ → 1 as λ → ∞. As X 0 ∼ Poisson(λ − n), Chebyshev’s inequality shows that Var Bλ λ−n 1 = ≤ →0 (λ → ∞) 2 λ2 λ2 ‹√ d d and so Bλ → 0. By Problem XIX.2 it follows that (Sλ − λ) λ → N(0, 1). P{|Bλ | ≥ } ≤

3. Continuation. Let Gλ (t) be the d.f. of Sλ /λ. Determine limλ→∞ Gλ (t) if (i) t > 1, and (ii) t < 1. What can you say if t = 1? S OLUTION : Suppose, to begin, that λ is an integer and suppose {Xn } is a sequence of independent random variables, each with the Poisson distribution Pλ of mean 1. By the stability of Poisson under convolution we know that Sλ = i=1 Xi has the Poisson distribution with mean λ and so, by the strong law of large numbers, we conclude that Sλ /λ → 1 with probability one. It follows that, as λ → ∞, Gλ (t) → 1 if t > 1 and Gλ (t) → 0 if t < 1. Our conclusion is unchanged when λ tends to infinity, not necessarily through integer values, by approximating λ by bλc as in the solution to Problem 2. To finish up, for t = 1, we have

1 Sλ − λ √ ≤ 0 → Φ(0) = G(1) = P{Sλ ≤ λ} = P 2 λ by Problem 2.

408

(λ → ∞)

XX.6

Normal Approximation

4. Records. Let { Xk , k ≥ 1 } be a sequence of independent random variables with a common distribution. We say that a record occurs at time k if max{X1 , . . . , Xk−1 } < Xk . Prove a central limit theorem for the number Rn of records up to time n. 5. Inversions. Let Π : (1, . . . , n) 7→ (Π1 , . . . , Πn ) be a random permuta(n) tion. For each k, let Xk be the number of smaller elements, i.e., 1 through k − 1, to the right of k in the permutation. Prove a central limit theorem for (n) (n) Sn = X1 + · · · + Xn , the total number of inversions. 6. Khinchin’s weak law of large numbers states that if { Xk , k ≥ 1 } is an independent, identically distributed sequence of random variables  1with finite Sn − µ > mean µ then, with S = X + · · · + X , for every  > 0, we have P n 1 n n  → 0 as n → ∞. Prove this using operator methods. S OLUTION : Write Fn (x) for the d.f. of Xi /n. Then 1 Fn (x) = P n Xi ≤ x = P{Xi ≤ nx} = F(nx). Let Fn be the operator corresponding to the d.f. Fn . By the Reductionist Theorem of Section XIX.5 in ToP, for every bounded, continuous function u, we have n kFn n u − H0 uk ≤ nkFn u − H0 uk

(1)

where H0 denotes the operator corresponding to the Heaviside distribution H0 concentrated at 0 and the operator power n denotes the operator corresponding to n-fold convolution. Now H0 u(t) = u(t), and so ˛Z ∞ ˛ ˛ ˛ |Fn u(t) − H0 u(t)| = ˛˛ u(t − x) dFn (x) − u(t)˛˛ −∞ ˛Z ∞ ˛ ˛Z ˛ ˛ ˛ (∗) ˛ ∞ ` ˛ ` ´ ´ 0 ˛ ˛ ˛ =˛ u(t − x) − u(t) dFn (x)˛ = ˛ u(t − x) − u(t) + u (t)x dFn (x)˛˛ −∞ ˛Z ˛ −˛∞ ˛ Z ˛ ˛ ˛ ˛ ` ´ ` ´ 0 ≤ ˛˛ u(t − x) − u(t) + u (t)x dFn (x)˛˛ + ˛˛ u(t − x) − u(t) + u0 (t)x dFn (x)˛˛ |x|≤δ

|x|>δ

(2) for any δ > 0. The R∞only step that needs a little justification is the one marked (∗) and this follows because −∞ x dFn (x) = 0. We now consider functions u(t) that have bounded derivatives of all orders. Suppose |u0 (t)| ≤ M and |u00 (t)| ≤ M for all t, where M is a fixed constant. Fix  > 0. Now, by Taylor’s Theorem, for some ξ(x), we have ˛Z ˛ ˛Z ˛ ˛ ˛ ˛ ˛ ` ´ ˆ 0` ´ ˜ 0 0 ˛ ˛ ˛ u(t − x) − u(t) + u (t)x dFn (x)˛ = ˛ u ξ(x) + u (t) x dFn (x)˛˛ ˛ |x|>δ

|x|>δ

˛Z ˛ ≤ 2M˛˛

˛Z ˛ ˛ ˛ ˛ ˛ |x| dF(nx)˛˛ |x| dFn (x)˛˛ = 2M˛˛ |x|>δ |x|>δ ˛ ˛Z ˛ ˛Z ˛ ˛ ˛ |y| 2M ˛˛ ˛ ≤ 2M = 2M˛˛ dF(y)˛˛ = |y| dF(y) ˛ ˛ n |y|>nδ n |y|>nδ n

409

(3)

Normal Approximation

XX.8

for all sufficiently large n. The last step follows Rfrom the fact that since Xn is integrable R∞ for each n, −∞ |y| dF(y) converges and hence |y|>nδ |y| dF(y) goes to 0 as n tends to ∞. By another application of Taylor’s Theorem, for some ζ(x), we have ˛Z ˛ ˛ ˛

˛ ˛Z ˛ ˛ ` ´ u(t − x) − u(t) + u0 (t)x dFn (x)˛˛ = ˛˛

˛ ˛ ` x2 u00 ζ(x)) dFn (x)˛˛ 2 |x|≤δ |x|≤δ ˛Z ˛ ˛Z ˛ ˛Z ˛ (∗∗) Mδ ˛ ˛ ˛ Mδ ˛ ˛ M ˛˛ 2 ˛ ≤ ˛ ˛= ˛ ˛ |x| dF (x) |x| dF (x) |x| dF(nx) ≤ n n ˛ ˛ ˛ 2 ˛ |x|≤δ 2 ˛ |x|≤δ 2 ˛ |x|≤δ ˛Z ˛ ˛Z ˛ ˛ Mδ ˛ ˛ |y| Mδ ˛˛ Mδ ˛ = dF(y)˛˛ = |y| dF(y)˛˛ ≤ E(|X|). 2 ˛ |y|≤nδ n 2n ˛ |y|≤nδ 2n

(4)

The step (∗∗) is justified by the observation |x|2 = |x| · |x| ≤ δ|x| when |x| ≤ δ. Putting (1), (2), (3) and (4) to use, we obtain „ « Mδ 2M Mδ n kFn u − H uk ≤ n + E(|X|) = 2M + E(|X|), n 0 n 2n 2 and the right-hand side can be made arbitrarily small as X is integrable. As Hn 0 = H0 , we obtain kFn u − H uk → 0 and hence by the Equivalence Theorem III of Section XIX.5 0 n in ToP, Sn /n converges to 0 in distribution.

7. Cauchy distribution. Show that the Cauchy distribution is stable and determine its characteristic exponent. More verbosely, if X has a Cauchy distribution, and X1 , . . . , Xn are independent copies of X, then X1 + · · · + Xn has the same distribution as n1/α X for some characteristic exponent 0 < α ≤ 2. S OLUTION : We may suppose that X, X1 , . . . , Xn are independent with the common unit ´ ‹` parameter Cauchy density c(x) = 1 π(1 + x2 ) . [If they have the common Cauchy ´ ` x density ca (x) = a1 c a for some parameter a > 0 then reduce the problem to that of unit parameter Cauchy densities by scaling each of the variables by a−1 .] The stability of the Cauchy distribution follows by the theorem of Section IX.10 in ToP: ca ? cb = ca+b . In particular, c ? c = c2 and, by induction, X1 + · · · + Xn has the Cauchy density `x´ 1 cn (x) = n c n . But this is the same density as that of the scaled variable nX. It follows that the Cauchy distribution is stable with characteristic exponent 1.

8. Pareto-type distribution. Suppose  that X1 , X2 , . . . are independent with  common density f(x) = c |x|3 (log |x|)3 for |x| > 2 (and zero otherwise). Show that these variables do not satisfy the Liapounov condition but do satisfy the Lindeberg condition. S OLUTION : It is easy to see that f is a density in good standing as for |x| ≥ 2 the function |x|−3 log(|x|)−3 is dominated by the integrable function |x|−3 log(2)−3 . [Indeed, a numerical integration shows that Z∞ dx = 0.136281 3 log(x)3 x 2

410

XX.10

Normal Approximation

and so the proper normalisation of the density is provided by setting c = 3.6544.] The integral Z∞ Z∞ dx xf(x) dx = c 2 log(x)3 x 2 0 is clearly convergent (as over the range of integration the integrand is dominated by the integrable function x−2 log(2)−3 ) and so the given density has zero mean, Z∞ xf(x) dx = 0. µ= −∞

A similar argument shows that the corresponding variance is finite. Indeed, σ2 =

Z∞ −∞

x2 f(x) dx = 2c

Z∞ 2

(t←log x) x2 dx = 2c 3 x log(x)3

Z∞ log 2

dt c = = 7.60615. t3 log(2)2

The verification that the Lindeberg condition is satisfied is trite when the variables have a common distribution with a second moment. Indeed, in the notation of Section XX.5 in ToP, s2n = nσ2 , mn = n, and so Z Z mn Z 1 X 1 1 (n) 2 2 x dF (x) = x f(x) dx = x2 f(x) dx → 0 ·n k √ nσ2 σ2 |x|≥σ√n s2n k=1 |x|≥sn |x|≥σ n for  > 0. [No evaluation of the integral on the right is necessary as we know that R∞ every 2 x f(x) dx = σ2 is convergent and so the integral tails must go to zero.] −∞ On the other hand, if r > 2, then Z∞ Z ∞ (r−2)t (t←log x) xr e dx = dt 3 log(x)3 x t3 2 log 2 diverges. And so, Z∞ mn 1 X `˛˛ (n) ˛˛r ´ 1 E X = · n |x|r f(x) dx = +∞ k srn k=1 σr nr/2 −∞ for every r > 2 and the Liapounov condition is violated.

9. A central limit theorem for runs. As in Section VIII.8, let Rn = Rn (ω) denote the length of the run of 0s starting at the nth place in the dyadic Pexpann sion of a point ω drawn at random from the unit interval. Show that k=1 Rk is approximately normally distributed with mean n and variance 6n. 10. Suppose X1 , X2 , . . . is a sequence of independent random variables. For each k, the variable Xk is concentrated at four points, taking values ±1 1 −2 with probability 12 (1 − k−2 √ ) and values ±k with probability 2 k . By an easy truncation prove that Sn n behaves asymptotically in √the same way as if Xk = ±1 with probability 1/2. Thus, the distribution of S n tends to the standard n √  normal but Var Sn n → 2.

411

Normal Approximation

XX.10

S OLUTION : We break up the analysis into a sequence of steps. a) For each k, we have Var(Xk ) = E(X2k ) = (1 − k−2 ) · 12 + k−2 · k2 = 2 − k−2 . It follows that n n ` ‹√ ´ 1 X 1 X −2 Var Sn n = (2 − k−2 ) = 2 − k = 2 − O(n−1 ), n k=1 n k=1 ` ‹√ ´ and so, limn Var Sn n = 2.

b) It is clear that



P{|Xk | > M} =

0 k−2

if 1 ≤ k ≤ M, if k ≥ M + 1.

By Boole’s inequality, it follows that „[ « ∞ X P {|Xk | > M} ≤ P{|Xk | > M} = k=M+1

k≥1

∞ X k=M+1

1 < k2

(5)

Z∞ M

dx 1 = . x2 M

 We hence obtain P supk |Xk | ≤ M > 1 − M−1 . 0 , we have c) Writing Sn = SM + Sn−M  ‹√ P a < Sn n < b | supk |Xk | ≤ M √ „ „ « «˛

√ 0 Sn−M n SM n SM ˛˛ =P √ a− √ < √ b− √ < √ supk |Xk | ≤ M . ˛ n n n−M n−M n−M

As each of X1 , . . . , XM is bounded in absolute value by M, for trivial reasons, |SM | ≤ M. √ ‹√ As n n − M → 1 as n → ∞, for every  > 0 and all sufficiently large n, the expression on the right is bounded by the conditional probability given supk |Xk | ≤ M ‹√ 0 that Sn−M n − M lies in the interval (a + , b − ) and above by the conditional ‹√ 0 probability that Sn−M n − M lies in the interval (a − , b + ). Accordingly, ˛

˛ S0 P a +  < √ n−M < b −  ˛˛ supk |Xk | ≤ M n−M  ‹√ < P a < Sn n < b | supk |Xk | ≤ M ˛

0 ˛ Sn−M ˛

M}, and so, in view of part b), ˛  ‹√  ‹√ ˛ ˛P a < Sn n < b − P a < Sn n < b | sup |Xk | ≤ M ˛ k ≤ P{supk |Xk | > M} < 1/M. By choosing M ≥ 1/ the upper bound can be made ≤  and so, by allowing n to tend to infinity, we see that ˛ ˛ Zb ˛ ˛  ‹√ ˛lim P a < Sn n < b − φ(x) dx˛˛ < . ˛ n a

 ‹√ As  may be chosen arbitrarily small, it follows that limn P a < Sn n < b = ‹√ Rb φ(x) dx and so Sn n converges in distribution to the standard normal. The limiting a distribution has unit variance! Thus, a sequence of random variables may converge in distribution but the limiting variance of the sequence need not be equal to the variance of the limiting distribution!

11. Continuation. Construct variants of Problem 10 where E(X2k ) = ∞ √ and yet the distribution of Sn n tends to the standard normal Φ. S OLUTION : Modify the distribution of Problem 10 as follows. For k ≥ 1, set Ck =

„X «−1 ∞ 1 . j3 j=k

` ´ Equip Xk with the atomic distribution which places mass equal to 12 1 − k12 on each of ±1 and mass equal to 2k12 · Cj3k on each of ±j for j ≥ k. Only step (a) now needs modification in the solution to Problem 10. As ∞ X j=k



∞ Ck Ck X 1 1 · 3 = 2 2 2k j 2k j=k j2

413

Normal Approximation

XX.13

is convergent, it follows that Xk is integrable and by symmetry has zero expectation. But ∞ X

j2 ·

j=k

∞ 1 Ck Ck X 1 · = 2k2 j3 2k2 j=k j

and the sum on the right diverges. It follows that Var(Xk ) = +∞. No matter, the conditioning arguments in parts (b), (c), and (d) of the solution to Problem 10 carry ‹√ d over in toto and we conclude that Sn n → N(0, 1) though the individual variances are infinite!

12. Once more, Stirling’s formula. For every real number x define its negative part by x− = −x if x ≤ 0 and x− = 0 if x > 0. Let { Xk , k ≥ 1 } be a sequence of independent random variables drawn from the common Poisson distribution with mean 1 and let { S√ n , n ≥ 1 } be the corresponding sequence of partial sums. Write Tn = (Sn − n) n. Show that E(Tn− ) = nn+1/2 e−n /n!. S OLUTION : By the closure of the Poisson under convolution [see the theorem of Section VIII.6 in ToP], Sn = X1 + · · · + Xn has the Poisson distribution with mean n,

As Tn−

P{Sn = k} = p(k; n) = e−n nk /k! (k ≥ 0). ˆ −1/2 ˜− = n (Sn − n) is non-zero if, and only if, Sn ≤ n, we see that

„„ E

Sn − n √ n

«− « =

« n „ n n X X √ nk nk n−k e−n X nk √ e−n = n e−n − √ k! k! n n k=1 (k − 1)! k=0 k=0 – »X n n k−1 k X n √ √ n nn − = n e−n (6) = n e−n k! (k − 1)! n! k=1 k=0

as the difference of sums in the square brackets telescopes.

13. Continuation, convergence in distribution. Suppose Z is a standard nord mal random variable. Show that Tn− → Z− . S OLUTION : The independent Poisson(1) variables { Xj , j ≥ 1 } have unit mean and unit variance and so, for each n, the partial sum Sn = X1 + · · · + Xn has mean n and variance n. Accordingly, the variable Tn = S∗n =

n Sn − n 1 X √ (Xj − 1) = √ n n j=1

is suitably centred and scaled and so, by the central limit theorem for identical distributions [Section XX.1 in ToP], S∗n converges in distribution to a standard normal. In particular, P{S∗n < x} → Φ(x) or, equivalently, P{S∗n ≥ x} → 1 − Φ(x) for every real x.

414

XX.14

Normal Approximation Now write out the negative part of Tn− explicitly: „ «−  −1/2 n (n − Sn ) if 0 ≤ Sn ≤ n, Sn − n − √ Tn = = n 0 if Sn > n.

− Write F− n for the d.f. of the positive variable Tn . Then, for t ≥ 0,

 − −1/2 F− (n − Sn ) ≤ t + P{Sn > n} n (t) = P{Tn ≤ t} = P 0 ≤ n   √ √ = P 0 ≤ n − Sn ≤ t n + P{Sn > n} = P Sn ≥ n − t n . By centring and scaling, we may identify the event on the right with the event {S∗n ≥ −t} and so we conclude that Z∞ ∗ F− φ(x) dx = 1 − Φ(−t) (t ≥ 0). (t) = P{S ≥ −t} → n n −t

As the normal density has even symmetry, 1 − Φ(−t) = Φ(t) and so  Φ(t) if t ≥ 0, − Fn (t) → 0 if t < 0. The right-hand side is the distribution of the negative part Z− of a standard normal d random variable Z and so we conclude that Tn− → Z− .

14. Continuation, convergence of expectations. Now, here’s the tricky part. Show that E(Tn− ) → E(Z− ) = (2π)−1/2 and pool answers to rediscover Stirling’s formula. [Hint: Modify the equivalence theorem.] S OLUTION : Robbins’s lower bound for the factorial [see the display equation on the top of page 577, Section XVI.6 in ToP] shows that there is an absolute positive constant C such that −1 n! > eC nn+1/2 e−n · e(12n+1) > eC nn+1/2 e−n for every n. As Tn− is a positive variable, by Problem P:XX.12, it follows that 0 ≤ E(Tn− ) < e−C for all n and so { Tn− , n ≥ 1 } is a uniformly integrable sequence of positive random variables. By the dominated convergence theorem ` ´ [see Example XIII.8.3 in ToP], it follows that, as M → ∞, we have E Tn− 1[M,∞) (Tn− ) → 0 uniformly in n. In` particular, for any ´  > 0, we may select M = M() determined only by  so that 0 ≤ E Tn− 1[M,∞) (Tn− ) <  for all n. An easier argument shows that the positive variable Z− = Z1(Z ≤ 0) is integrable as well. Indeed, by Problem XIII.3, its d.f. is given by E(Z− ) =

Z∞ 0

P{Z− > t} dt =

Z∞ P{Z < −t} dt 0

Z ∞ Z −t = 0

415

−∞

Z∞ Z∞ φ(x) dx dt =

φ(x) dx dt. 0

t

Normal Approximation

XX.15

By Fubini’s theorem [see Theorem XIV.5.3 in ToP], we may exchange the order of integration and so Z∞ Z∞ Zx Z∞ 1 1 (y←x2 /2) √ e−y dy = √ . E(Z− ) = φ(x) dt dx = xφ(x) dx = 2π 0 2π 0 0 0 ` − ´ − By Example XIII.8.3 in ToP again, it follows that E Z 1[M,∞) (Z ) <  by, if necessary, selecting a larger value of M. All is now in readiness for the equivalence theorem [Section XIX.2 in ToP]. Introduce the continuous, bounded function   if x < 0, 0 uM (x) = x if 0 ≤ x < M,   M if x ≥ M, d

and write u∞ (x) = x1[0,∞) (x). By Problem 13` we have´Tn− →` Z− and so ´ by the equivalence theorem of Section XIX.2 we see that E uM (Tn− ) → E uM (Z− ) as n → ∞. In particular, we may select n = n() sufficiently large so that ˛ ` ´ ` ´˛ 0 ≤ ˛E uM (Tn− ) − E uM (Z− ) ˛ < . By addition and subtraction of terms, we may write ` ´ ` ´ ˆ ` ´ ` ´˜ E(Tn− ) − E(Z− ) = E u∞ (Tn− ) − E u∞ (Z− ) = E u∞ (Tn− ) − E uM (Tn− ) ˆ ` ´ ` ´˜ ˆ ` ´ ` ´˜ + E uM (Tn− ) − E uM (Z− ) + E uM (Z− ) − E u∞ (Z− ) . By additivity, we have ` ´ ` ´ ` ´ E u∞ (Tn− ) − E uM (Tn− ) = E u∞ (Tn− ) − uM (Tn− ) ` ´ ` ´ = E Tn− 1[M,∞) (Tn− ) − M1[M,∞) (Tn− ) = E Tn− 1[M,∞) (Tn− ) − M P{Tn− ≥ M}, and, likewise, ` ´ ` ´ ` ´ E uM (Z− ) − E u∞ (Z− ) = E Z− 1[M,∞) − M P{Z− ≥ M}. It follows that ` ´ E(Tn− ) − E(Z− ) = E Tn− 1[M,∞) (Tn− − M P{Tn− ≥ M} ˆ ` ´ ` ´˜ ` ´ + E uM (Tn− ) − E uM (Z− ) + E Z− 1[M,∞) (Z− ) − M P{Z− ≥ M} and we may now finish up with repeated applications of the triangle inequality to obtain ˛ ˛ ` ´ 0 ≤ ˛E(Tn− ) − E(Z− )˛ ≤ E Tn− 1[M,∞) (Tn− ) + M P{Tn− ≥ M} ˛ ` ´ ` ´˛ ` ´ + ˛E uM (Tn− ) − E uM (Z− ) ˛ + E Z− 1[M,∞) (Z− ) + M P{Z− ≥ M} < 5 for all sufficiently large values of n. As  > 0 may be chosen arbitrarily small, it follows that E(Tn− ) → E(Z− ) = (2π)−1/2 as n → ∞. In conjunction with (6) we see hence that √

n e−n

√ 1 nn → √ , or, equivalently, n! ∼ 2π nn+1/2 e−n n! 2π

and we have rediscovered Stirling’s formula!

416

XX.19

Normal Approximation

15. Central limit theorem for exchangeable variables.1 Consider a family of distributions Fθ indexed by a real parameter θ. We suppose that, for each θ, the d.f. Fθ has mean zero and variance σ2 (θ). A random value Θ is first chosen according to a distribution G and a sequence of variables X1 , X2 , . . . is then 2 2 drawn by independent sampling  √  from FΘ . Write a = E σ (Θ) . Show that the normalised variable Sn a n  converges in distribution to the limiting disR∞ ax tribution given by −∞ Φ σ(θ) dG(θ). The limiting distribution is not normal unless G is concentrated at one point. 16. Galton’s height data. Table 1 of Section VII.8 provides Galton’s height data. Test the data against the hypothesis that they arise from a normal distribution. 17. Rotations. Suppose X = (X1 , . . . , Xn ) has independent N(0, 1) components. If H is any orthogonal n × n matrix show that XH has the same distribution as X. 18. Continuation, the uniform distribution on the sphere. Deduce that X/kXk has the uniform law λn−1 on the Borel sets of the unit sphere Sn−1 = { x ∈ Rn : kxk = 1 } equipped with the geodesic distance. More specifically, λn−1 (A) = λn−1 (AH) for every orthogonal n×n matrix H and every Borel set A in B(Sn−1 ). 19. Continuation, the infinite-dimensional sphere. In vector notation, let (n) (n)  R(n) = R1 , . . . , Rn be a point chosen in Sn−1 according to the distribution λn−1 . Show that P

√

(n)

n R1

≤ t → Φ(t),

and, likewise, √ (n) √ (n) P n R1 ≤ t1 , n R2 ≤ t2 → Φ(t1 )Φ(t2 ). This beautiful theorem connecting the normal to the infinite-dimensional sphere is due to E. Borel.2 This famous paper has sparked a huge literature and is important for Brownian motion and Fock space constructions in quantum mechanics. [Hint: Use Problems X.20 and XVII.5.]

1 J. R. Blum, H. Chernoff, M. Rosenblatt, and H. Teicher, “Central limit theorems for interchangeable processes”, Canadian Journal of Mathematics, vol. 10, pp. 222–229, 1958. 2 E. Borel, “Sur les principes de la theorie cinétique des gaz”, Annales Scientifique l’École Normale Supérieure., vol. 23, pp. 9–32, 1906.

417

Index 1-trick (use with Cauchy– Schwarz inequality), 316 Absolutely continuous distribution, 268, 340 Accumulated entrance fee, 363 Additive group, 403 Additivity of expectation, 80, 133, 135, 136, 153, 160, 211, 214, 218, 219, 223, 227–230, 233, 234, 239, 242, 279, 280, 283–288, 295–297, 299, 304–306, 308, 316, 319, 324, 325, 328, 329, 338, 353, 354, 360, 377, 416 of measure, 17, 23, 25, 38, 41, 51, 53, 58, 88, 107, 138, 167, 168, 172, 191, 251, 253–255, 266, 275, 276, 294, 347, 413 of variance, 133, 135, 180, 214, 229 Additivity of measure, 99 Algebra of events, 20 Alm, S. E., 388n Almost everywhere, 241, 287 Altitude, 307 Ancestor (of vertex), 389 Anomalous numbers, 91 Antheil, G., 84n Approximation by simple functions, 285 of measure, 254 of sum by integrals, 48, 382 Arc sine law, 167 Arithmetic distribution, 331, 333

Asymptotic expansion, 212 Atoms of a discrete distribution, 293 Axiom of choice, 252 Azuma’s inequality, 378 Baire function, 399 Baldi, P., 307n Ball in n dimensions, 230 Ballot problem, 28 Banach’s match boxes, 154 Bayes’s rule for events, 39 Beardwood, J., 382n Beardwood–Halton– Hammersley theorem, 382 Bell, R., 370n Benford, F., 91 Bennett’s inequality, 378 Bent binary digit, 98–100, 102, 116 coin, 116, 119, 270 Bernoulli model of diffusion, 42, 43 trials, 144, 202, 268, 270, 294, 308, 310, 318, 319, 322, 346, 377, 402 Bernstein polynomial, 100, 361 in two dimensions, 361 Bernstein’s inequality, 378 Bernstein, S. N., 53, 94, 103, 361 Bessel function of the first kind modified, 213 Beta density, 197, 216 Bézout’s identity, 405

Bi, W., vii Bin-packing problem, 381 Binary channel, 58 Binning number, 381 Binomial coefficient, 3, 163 distribution, 99, 107, 116, 123, 125, 144, 145, 147, 149, 150, 159, 202, 206, 312, 313, 357, 358 theorem, 4, 40, 59, 154, 155, 157, 163, 165, 345 Birth-death chain, 42 Birthday paradox, 7, 84 Bivariate normal density, 215, 221, 225–228 Bluetooth protocols, 84 Blum, R., 417n Boole’s inequality, 358, 359, 364, 379, 391, 398 sieve, 84 Boole’s inequality, 75, 273, 283 Boppana, R., 390n Borel set, 250, 252, 254, 255 set (on the sphere), 417 sigma-algebra, 252 Borel’s law of normal numbers, 363 Borel, E., 417n Borel-Cantelli lemma, 363 Bose–Einstein distribution, 11, 13, 67 Boston Celtics, 17 Box, G. E. P., 274n Box–Muller construction, 274 Branching process, 341 Bridge hand high pairs, 76

Index

trump distribution, 27 Brownian bridge, 238–241 motion, 239–243, 417 Buffon’s cross, 185 needle problem, 184 Campbell’s theorem, 402 Cantelli’s strong law of large numbers, 362 Cantor diagonalisation, 249 distribution, 266–268, 283, 310 set, 266 card, see Cardinality of a set Cardinal arithmetic, 250 Cardinality of a set, 56 Catalan number, 161, 345 Catalan, E., 161 Catcheside, D. G., 77n Cauchy density, 129, 180, 201, 410 density in R3 , 200 density in the plane, 200, 201 distribution, 410 sequence in L2 , 323 Cauchy–Schwarz inequality, 304, 315, 316, 320 Cayley’s formula, 83 Central limit theorem, 109, 110, 407 limit theorem for exchangeable variables, 417 limit theorem for identical distributions, 371, 408, 414 limit theorem for inversions, 409 limit theorem for order statistics, 203 limit theorem for records, 409 limit theorem for runs, 411 limit theorem for sample median, 204 limit theorem for triangular arrays, 204 term of the binomial, 106, 107, 116, 358, 359

term of the multinomial, 145 term of the trinomial, 146 Chain rule for conditional probabilities, 198, 390, 391 Channel capacity, 379 Characteristic exponent, 410 function, 352, 401 Chebyshev’s inequality, 90, 92, 101, 228, 229, 357, 358, 362, 363, 407 Chebyshev’s inequality, 107, 359, 408 Chernoff’s inequality, 360, 407, 408 Chernoff, H., 417n Chi-squared density, 224, 228 Chromosome breakage and repair, 13 matching, 77 Chu, Shih-Chieh, 40n Chu-Vandermonde identity, 40 Chung, K. L., 283 Circular scan statistic, 388 Clique, 85, 361, 395 Closed disc, 373 rectangle, 373 Coarser σ-algebra, 288 Code book, 314, 379 Code division multiple access, 84 Cognitive radio, 47 College admissions, 47 Collisions in transmissions, 84 Complete graph, 85, 395 inner product space, 322 measure, 255 Completion of measurable space, 255 Complex random variable, 353 Component of Rn , 374 Compound Poisson distribution, 338

420

Concentration function, 381 of measure, 379 Conditional expectation, 286, 288, 322, 325 Conditionally independent events, 51 Configuration function, 383 Conjugate exponent, 317 Connected graph, 82 Continuity of measure, 266, 343 Continuous distribution, 301 Convention for binomial coefficients, 40 for binomial coefficients, 3, 13, 164 for logarithms, 5 for positive parts, 175 for runs, 346 Convergence a.e., 102, 378, 382 in distribution, 399, 400, 408, 410, 414 in probability, 101, 102, 378 vague, 400, 405 Convex function, 317, 383 set, 373 Convolution of Cauchy densities, 180 of exponential densities, 208 of functions, 407 of gamma densities, 197, 200 of uniform densities, 132 property of characteristic functions, 354 property of transforms, 180, 331, 333, 334, 339 Correlation coefficient, 148 function, 240, 241 inequality, 318, 390 Countable set, 249, 250 Coupling, 387, 388 Coupon collector’s problem, 261 Cover

Index

of a set, 254 Cover, T. M., 158, 370n Cramér–Rao bound, 320 Craps, 333 Crofton’s method of perturbations, 191 Crystallography, 316 de Finetti’s theorem, 174, 177, 194, 296 de Moivre’s theorem, 105, 106, 109, 116, 119 de Moivre, A., 109 de Moivre–Laplace theorem, 203 de Moivre-Laplace theorem, 203, 204 de Montmort, P. R., 70, 78, 386, 387 de Morgan’s laws, 19 Decision rule, 314 minimum distance, 314 Decreasing family of sequences, 389 sequence of sets, 23 Degenerate random variable, 288 Degrees of freedom, 224, 225 Denial of service attack, 128 Density of digits, 91 di Finetti’s theorem, 194, 195 Diaconis, P., 126 Differentiation property of transforms, 332, 339 Direct sequence spread spectrum, 84n Dirichlet density, 199, 200 Discrepancy, 272, 273, 345 Distribution function, 255 Dominated convergence theorem, 284, 290, 291, 399 Dominated convergence theorem, 415 Doob’s maximal inequality, 243 DoS, see Denial of service Double sample, 273, 375 Dyadic expansion, 400 Dynkin, E. B., 64

Edge of a graph, 383 Ehrenfest model of diffusion, 42 Empirical distribution, 272, 345 Energy, 307 Entropy, 369 Equivalence class, 252 theorem, 397, 399, 401, 415, 416 Equivalent random variables, 317 Error prediction, 304 Estimator least-squares, 305 linear, 305, 307 Euclid, 177 Euclid’s Elements, 177 Exchangeable random variables, 177, 178, 295, 417 Expectation, 381 and median, 377 identity, 121, 281 is homogeneous, 284, 289, 327, 354 is linear, 122, 123, 179, 192, 217, 288, 289 is monotone, 283–286, 288, 303, 304, 317 Exponential density, 195, 198, 207, 216 distribution, 187–189, 192, 194, 198, 205, 207–209, 216, 278, 387, 399 generating function, 72 Extreme value distribution, 128, 400 Factorisation property, 289, 327–329 Failure run, 16, 346 Fair game, 152, 363 Falling factorial, 4, 7, 26, 68, 159 Fan, C. T., 128n Fang, S. C., vii Fatemieh, O., 128n Fatou’s lemma, 290 Feller, W., 30n Feller, W., 145, 189, 344, 363n Fermat, P., 157

421

Fermi–Dirac distribution, 13 Fiancée problem, see Marriage problem Filtration, 326–328 Finer σ-algebra, 288 Finite-dimensional distribution, 241, 243 Fisher information, 320 FKG inequality, 319 Flood, M. M., 47 Fluctuation theory, 17 Flush (poker), 9 Fock space, 417 Fortuin, C. M., 319n Four-of-a-kind (poker), 10 Fourier coefficient, 112–114 inversion, 112–114 series, 112 transform, 112, 113, 352 Fourier–Lebesgue transform, 352 Fourth moment, 362 Fractional part of real number, 112 Free energy, 382 Frequency-hopping spread spectrum, 84 Fubini’s theorem, 138, 299, 301, 416 fails for outer integrals, 301 Full house (poker), 10 Function space, 374 Gale, D., 47n Galton’s height data, 417 Gambler’s ruin, 43 Gamma density, 197–200, 216 distribution, 197, 209, 210 function, 105, 225, 229, 312 Gaussian channel, 379 noise, 379 process, 232, 233, 240–243 General position, 374 Generating function, 11, 12, 157 definition, 72 of artithmetic distributions, 331, 333 of Catalan sequence, 163 of return to origin, 165

Index

of return to the origin, 166 two-sided, 332 Geodesic metric, 417 Geometric distribution, 16, 150–152, 198, 199, 278, 331, 340, 370 Gilbert channel, 60 Gilbert, E. N., 60n Gilovich, T., 17n Ginibre, J., 319n Glivenko–Cantelli theorem, 375 Gnedenko, B. V., 345n Growth function, 374 Gumbel distribution, 400 Gunter, C., 128n Half-space, 373, 374 affine, 374 Halmos, P., 63 Halton, J. H., 382n Hammersley, J. M., 382n Hamming distance, 308, 380, 381 Hardy’s law, 62 Harker–Kasper inequality, 316 Harmonic series, 189, 314, 315 Heaviside distribution, 397, 406, 409 function, 193, 263 Helly’s selection principle, 401, 404 Heyde, C. C., 344 Hoeffding’s inequality, 360, 379 for the binomial, 359, 379 Hölder’s inequality, 317 Horse race, 368 Hot hand in basketball, 16– 18 Hotelling’s theorem, 231 Hypergeometric distribution, 158, 159, 357, 358 function, 67 tail bound, 357, 358, 375 Hyperplane, 374 I0 (·), Ik (·), see Bessel function of the first kind, modified Importance

density, 360, 361 sampling, 360 Inclusion–exclusion, 65–67, 70, 74–78, 81, 82, 260, 265, 385, 386 Increasing family of sequences, 389 sequence of sets, 22 Increment, 206 Independence, 64 number of graph, 383 set of vertices (events), 390 sieve, 389 Independent binary digits, 98, 99, 102, 117 increments, 239 random variables, 289, 354 sampling, 381, 382 trials, 109, 110 Indicator function, 20 random variable, 80, 227, 280, 284, 308, 325 Inequality of arithmetic and geometric means, 316 Infinite-dimensional sphere, 417 Inner envelope, 278 product in L2 , 322 product of Euclidean vectors, 269 Integer part of real number, 112 Interchange in order of derivative and sum, 151 sum and integral, 111 Inversions in permutation, 409 Ising spin glass, 307 Isolated point in geometric graph, 388 Iterated integrals, 299 Jacobian, 199, 232 Janson’s inequality, 390, 393– 395 Janson, S., 390n Jensen’s inequality, 304, 314, 315, 321, 326 Jump discontinuity, 259

422

Kahneman, D., 17n Kantorovich’s inequality, 316 Kasbekar, G., vii Kasteleyn, P. W., 319n Khan, F., 128n Khanna, S., 128n Kleitman’s inequality, 391 Kleitman’s lemma, 389, 391 Kleitman, D. J., 390n Kolmogorov, A. N., 363, 371 Koroljuk, V. S., 346n Kullback–Leibler divergence, 315, 321, 322, 377 Kullback-Leibler divergence, 369 Ladder index, 274, 345 Lamarr, H., 84n Laplace transform, 332, 339, 401 Laplace’s formula, 110 law of succession, 46 theorem, 116 Laplace, P. S., 116 Large deviation theorem, 119 Le problème des ménages, 80, 387 rencontres, 70, 78, 387 Lea, D. E., 77n Lebesgue integral, 301 measurable set, 250–253 measure, 170, 250, 299, 400 Lebesgue’s decomposition theorem, 264, 265 Lebsegue measurable set, 252 Length of random chain, 298 Level set, 230, 268, 272, 273, 295 Levi’s theorem, 102, 103, 286 Lévy sandwich, 111, 114, 119 Liapounov condition, 410, 411 Lindeberg condition, 410, 411 Linear congruential generator, 274 Linear transform diagonalisation, 229 of normal variables, 229

Index

Lipschitz condition, 381 function, 103, 381 Local limit theorem, 105, 107, 108, 119 Log convex function, 149 Log sum inequality, 315, 378 Lognormal distribution, 343, 344 Longest common subsequence, 383 Lottery, 8 Lucas, E., 80, 387 Luczak, T., 390n Marginal density (distribution), 247, 321 Markov’s method, 110, 113, 116, 119 Marriage problem, 47, 152 Martingale, 326, 338 bounded-difference, 378, 379 transform, 329 Matchings (le problème des rencontres), 387 multiple, 385 Maximum likelihood principle, 379 of normal variables, 378 of random walk, 167 Maxwell–Boltzmann distribution, 10, 11, 30 McCarthy, J., 234n Mean recurrence time of runs, 347 Median, 302–305, 381–383 Memoryless property of exponential distribution, 193, 209 Mengoli’s inequality, 314 Mengoli, P., 314 Method of images, 166 Metric, 317, 318, 402 Minimax decision, 49 Minkowski’s inequality, 323 Modulus inequality, 353, 377 Moments, 339 definition, 301 determine distribution, 343, 401

Monotone class of sets, 63 class theorem, 63 convergence theorem, 281, 290, 290, 325, 326 Monotone convergence theorem, 302 Monotonicity of measure, 23, 106, 253, 254, 391 Moran, P. A. P., 145 Muller, M. E., 128n, 274n Multilevel sequences, 379 Multinomial distribution, 145, 145, 146, 147 theorem, 89, 102 Mutual information, 321 Mutually exclusive events, 51 n-server queue, 208 Negative binomial distribution, 154, 199 Negative binomial distribution, 152, 154, 198, 331 identity, 12 Negatively related indicator variables, 389 Neural computation, 307 Non-measurable function, 291 set, 252 Norm Lp , 316, 317 of Euclidean vector, 269 of function, 402 of operator, 400 Normal density, 218 distribution, 204, 213, 220– 222, 230, 294, 371, 378, 408, 411, 414 number, 91 tail bound, 108, 378 Normalisation of measure, 42, 129, 136, 149, 254, 353, 361 Nowhere differentiable function, 234 Nukpezah, J., vii Null

423

set, 273 Occupancy configuration, 10, 11, 13, 30, 32 One pair (poker), 9 Open set, 372 Optimal decision, 48 stopping, 47 tour, 382 Optimum receiver, 379 Order statistics, 177, 201, 201, 203–205, 295 Orthant, 187, 231, 311, 373 Orthogonal functions, 102 projection, 322, 323 random variables, 323 transformation, 417 Orthonormal basis of eigenvectors, 222, 223 system of functions, 92 Outer envelope, 278 integral, 291, 301 measure, 251, 254, 255, 291 π–λ theorem, 64 Pairwise independence, 55, 56 Parallelogram law, 322, 323 Pareto density, 410 Parseval’s equation, 95 Partial order, 318 Partition, 81, 97 function, 382 Pascal’s triangle, 3, 9, 74, 122, 143, 156, 202, 205, 260, 312 Pascal, B., 156, 158 Path in a graph, 385 Pepys’s problem, 123, 146 Percolation, 382 Permanent of a matrix, 74 Permutation, 55, 70, 275 Phase transition, 394 Pigeon-hole principle, 382 Pinsker’s inequality, 377 Plucked string function, 234 Poisson

Index

approximation, 70, 85 approximation to the binomial, 150 distribution, 109, 144, 149, 150, 260, 408, 414 paradigm, 85, 386, 387, 395 process, 198, 402 sieve, 389 Poissonisation, 382 Poker, 9, 150 Polar coordinates, 211, 220, 221 Pollaczek–Khinchin formulæ, 344 Pólya’s urn scheme, 39, 41, 327, 328 Portfolio, 109 Positive random variable, 138, 289 Positively related indicator variables, 388 Positivity of measure, 23, 254 Problem of the points, 156 Product measure, 380 space, 380 Pure birth process, 190 Pythagoras’s theorem, 311, 323, 324 Quadrant, 373 Quadratic form, 374 of a normal, 222 Quantum mechanics, 417 Quarantining, 389 Queuing theory, 217 Rademacher functions, 89, 363 are orthogonal, 87, 92 Radon–Nikodým theorem, 287 Ramsey number, 73, 85 Ramsey’s theorem, 74 Random direction, see Random point on circle (sphere) flight, 298 geometric graph, 385 graph, 85, 383 permutation, 26, 146, 375, 409

point in interval, 174 point in unit square, 381 point on circle, 174, 268, 388 point on sphere, 269 selection, 36 set, 8, 54, 56 walk, 42, 160, 167, 345 Randomisation, 198 Rational form, 347, 351 Ray, 374 Rayleigh density, 211 Rayleigh’s density, 218, 233, 274 Record values, 187, 409 Rectangular function, 95, 113, 132 Red Auerbach, disdain for basketball statistics, 17n Reductionist theorem, 409 Reflected walk, 168 Regular expressions, 161, 162 Renewal epoch, 16, 346 equation, 344 process, 16, 346 Reservoir sampling, 127 Residual time, 209 Returns to the origin, 164 Reversed walk, 160, 161, 167 Rezucha, I., 128n Rice’s density, 213 Robbins’s bounds for the factorial, 359, 415 Robbins, H. E., 178 Rosenblatt, M., 417n Rotation transform, 417 Row sum, 402 Royal flush (poker), 150 Rucinski, ´ A., 390n Rumours (spread of), 14 Run length, 278 Runs, 16, 52, 63, 346, 351, 411 Salehi, A. T., vii Sample median, 204 size, 375 variance, 297 Scale property of 180

transforms,

424

Schläfli’s theorem, 373 Schläfli, L., 373n Second moment, 227, 308 Secretary problem, see Marriage problem Semiring of sets, 24 Sequential decision theory, 47 Set function, 23, 24, 286 of measure zero, 400 Shahrampour, S., vii Shannon, C. E., 321n Shapley, L. S., 47n Sherrington–Kirkpatrick spin glass model, 382 Shot noise, 402 Signal vector, 314, 379, 380 Simple random variable, 281, 287, 289, 290 Simultaneous diagonalisation, 230 Sinc function, 95 Singular continuous distribution, 268 Slogans, 361 Sojourn time, 344 Space of square-integrable functions, 322 Space-filling curve, 381 Spacings, 194, 296, 387, 388 small, 389 Spencer, J., 390n Spherical coordinates, 311 Spin interaction, 382 quantum state, 382 Square-root (of a positive definite matrix), 229, 230 St. Petersburg game with shares, 370 Stability of normal, 214 Stability under convolution of Cauchy, 410 of normal, 107, 109 of Poisson, 109, 408, 414 Stable distributions, 222 Standard normal density, 382 State (of chain, system), 42

Index

Stationary distribution, 42, 43, 62 process, 232, 233 Statistical tests for capture–recapture, 159 periodogram analysis, 231 success runs (hot hand), 17, 346 Steele’s theorem, 374 Steele, J. M., 374n Steepest descent, 316 Stirling numbers of the second kind, 81, 83 Stirling’s formula, 68, 69, 74, 105, 106, 147, 225, 229, 313, 322, 358, 414, 416 Stopping time, 338 Straight, straight flush (poker), 9, 10 Stratification, 198 Strong law of large numbers, 102, 362, 363, 368, 371, 372, 378, 408 Student’s density, 224 Subadditivity of measure, 23 Success run, 16, 17, 66, 149, 277, 278, 346, 347 Successor (of vertex), 389 Symmetric distribution, 344 Symmetrisation by pairwise exchanges, 375 by permutation of double sample, 375 Talagrand’s convex distance, 381 induction method, 380 Talagrand, M., 379 Teicher, H., 417n Temperature, 382 Tennis ranking, 158 tie-breaks, 158 Ternary digit, 88 Test for independence, 321 Theorem of inclusion-exclusion, 75, 83, 195, 260 total probability, 26, 28– 30, 34–41, 43–47, 54, 57– 61, 63, 122–124, 126, 138,

148, 153, 192, 289, 319, 338, 339, 341, 342, 413 Thoday, J. M., 77n Three-of-a-kind (poker), 9 Threshold function, 385, 388, 389, 394 Tian, L., 333n Toroidal metric, 385 Total variation distance, 377, 387, 402, 405 Tournament, 56 Tower property, 288, 327 Transition probability, 42, 48 Translate of a set, 250 Translation invariance of Lebesgue measure, 250, 253 Trapezoidal function, 113, 114 Travelling salesman problem, 382 Tree, 83 Treize, 387 Triangle in random graph, 85, 393– 395 indquality, 398 inequality, 238, 247, 251, 285, 317, 317, 318, 399, 402, 416 inequalty, 399 Triangular array, 402 density, 132, 179, 269 function, 114 system of wavelets, 238 Trigonometric polynomial, 231 Trinomial distribution, 146, 146 Trivial σ-algebra, 288 Tversky, A., 17n Two pair (poker), 9 Two-dimensional distribution, 247 Type of distribution (density), 150 Unbiased estimator, 297, 320, 360 Uncorrelated random variables, 234 Uniform

425

atomic distribution, 403 density, 139, 141, 171, 172, 218, 271, 293, 360, 361 distribution, 140, 179, 201, 205, 216, 221, 268, 271, 273, 278, 293, 302, 310, 328, 360, 379, 400 distribution in a cube, 170 distribution in polytope, 195, 196 distribution in unit square, 381, 382 distribution on subgroup, 404 distribution on the sphere, 417 Uniformly continuous function, 92 Unit cube in n-dimensions, 92, 311 sphere, 378, 417 Unusual dice, 333 Urn problem, 10, 27, 30, 41, 60, 121, 146, 259, 279, 280 indistinguishable balls, 11 Valley (energy minimum), 307–309 Vallone, R., 17n Vandermonde’s convolution, 40, 41, 143, 144, 164–166 Vandermonde, A., 143 Vapnik–Chervonenkis class, 374, 375 dimension, 375, 383 theorem, 375 Venkatesh, S. S., 128n, 307n Vergetis, E., vii Vertex of a graph, 383 Viète’s formula, 88 Volume of a ball, 312 of a set, 92 of polytope, 195 von Neumann, J., 277n, 278 Waiting time, 139, 152, 154, 187, 199, 331, 370 distribution, 155, 193 Wald’s equation, 338, 341 Walsh–Kaczmarz function, 91

Index

Weak law of large numbers, 58, 62, 92, 101, 106, 272, 409 for stationary sequences, 362 Weierstrass’s approximation theorem, 94, 103, 361 Weight vector, 374

Weinberg, S., 10n seeRegular expressions, 161 Wendel, J. G., 374n Weyl’s equidistribution theorem, 112 Weyl, H., 112 Wilf, H., 156n Wilf, H., 336

426

Wu, Z., vii Wyner, A., 126

You, J., 333n

Zk , see Additive group