Citation preview

Lectures on Probability Theory and Mathematical Statistics Second Edition Marco Taboga

ii

Contents I

Mathematical tools

1 Set 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8

theory Sets . . . . . . . . Set membership . . Set inclusion . . . Union . . . . . . . Intersection . . . . Complement . . . . De Morgan’s Laws Solved exercises . .

1 . . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

3 3 4 4 5 6 6 7 7

2 Permutations 2.1 Permutations without repetition . . . . . . . . . . . 2.1.1 De…nition of permutation without repetition 2.1.2 Number of permutations without repetition . 2.2 Permutations with repetition . . . . . . . . . . . . . 2.2.1 De…nition of permutation with repetition . . 2.2.2 Number of permutations with repetition . . . 2.3 Solved exercises . . . . . . . . . . . . . . . . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

9 9 9 10 11 11 11 12

3 k-permutations 3.1 k-permutations without repetition . . . . . . . . . . . 3.1.1 De…nition of k-permutation without repetition 3.1.2 Number of k-permutations without repetition . 3.2 k-permutations with repetition . . . . . . . . . . . . . 3.2.1 De…nition of k-permutation with repetition . . 3.2.2 Number of k-permutations with repetition . . . 3.3 Solved exercises . . . . . . . . . . . . . . . . . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

15 15 15 16 17 17 18 19

4 Combinations 4.1 Combinations without repetition . . . . . . . . . . . 4.1.1 De…nition of combination without repetition 4.1.2 Number of combinations without repetition . 4.2 Combinations with repetition . . . . . . . . . . . . . 4.2.1 De…nition of combination with repetition . . 4.2.2 Number of combinations with repetition . . . 4.3 More details . . . . . . . . . . . . . . . . . . . . . . . 4.3.1 Binomial coe¢ cients and binomial expansions 4.3.2 Recursive formula for binomial coe¢ cients . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

21 21 21 22 22 23 23 25 25 25

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

iii

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . . .

iv

CONTENTS 4.4

Solved exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

5 Partitions into groups 5.1 De…nition of partition into groups 5.2 Number of partitions into groups . 5.3 More details . . . . . . . . . . . . . 5.3.1 Multinomial expansions . . 5.4 Solved exercises . . . . . . . . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

27 27 28 29 29 30

6 Sequences and limits 6.1 De…nition of sequence . . . . . . . . . . . . . . 6.2 Countable and uncountable sets . . . . . . . . . 6.3 Limit of a sequence . . . . . . . . . . . . . . . . 6.3.1 The limit of a sequence of real numbers 6.3.2 The limit of a sequence in general . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

31 31 32 32 32 34

7 Review of di¤erentiation rules 7.1 Derivative of a constant function . . . . 7.2 Derivative of a power function . . . . . . 7.3 Derivative of a logarithmic function . . . 7.4 Derivative of an exponential function . . 7.5 Derivative of a linear combination . . . 7.6 Derivative of a product of functions . . . 7.7 Derivative of a composition of functions 7.8 Derivatives of trigonometric functions . 7.9 Derivative of an inverse function . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

39 39 39 40 40 41 41 42 43 43

8 Review of integration rules 8.1 Inde…nite integrals . . . . . . . . . . . . . . . . . . . . . . . . 8.1.1 Inde…nite integral of a constant function . . . . . . . . 8.1.2 Inde…nite integral of a power function . . . . . . . . . 8.1.3 Inde…nite integral of a logarithmic function . . . . . . 8.1.4 Inde…nite integral of an exponential function . . . . . 8.1.5 Inde…nite integral of a linear combination of functions 8.1.6 Inde…nite integrals of trigonometric functions . . . . . 8.2 De…nite integrals . . . . . . . . . . . . . . . . . . . . . . . . . 8.2.1 Fundamental theorem of calculus . . . . . . . . . . . . 8.2.2 De…nite integral of a linear combination of functions . 8.2.3 Change of variable . . . . . . . . . . . . . . . . . . . . 8.2.4 Integration by parts . . . . . . . . . . . . . . . . . . . 8.2.5 Exchanging the bounds of integration . . . . . . . . . 8.2.6 Subdividing the integral . . . . . . . . . . . . . . . . . 8.2.7 Leibniz integral rule . . . . . . . . . . . . . . . . . . . 8.3 Solved exercises . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

45 45 46 46 47 47 47 48 48 48 49 50 51 51 51 52 52

9 Special functions 9.1 Gamma function . . . . . . . . . . . . . 9.1.1 De…nition . . . . . . . . . . . . . 9.1.2 Recursion . . . . . . . . . . . . . 9.1.3 Relation to the factorial function 9.1.4 Values of the Gamma function .

. . . . .

. . . . .

. . . . .

. . . . .

55 55 55 56 56 57

. . . . .

. . . . .

. . . . .

. . . . .

. . . . . . . . .

. . . . .

. . . . .

. . . . . . . . .

. . . . .

. . . . .

. . . . . . . . .

. . . . .

. . . . . . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

CONTENTS

9.2

9.3

II

v

9.1.5 Lower incomplete Gamma Beta function . . . . . . . . . . . 9.2.1 De…nition . . . . . . . . . 9.2.2 Integral representations . Solved exercises . . . . . . . . . .

function . . . . . . . . . . . . . . . . . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

Fundamentals of probability

67

10 Probability 10.1 Sample space, sample points and events . . . . 10.2 Probability . . . . . . . . . . . . . . . . . . . . 10.3 Properties of probability . . . . . . . . . . . . . 10.3.1 Probability of the empty set . . . . . . . 10.3.2 Additivity and sigma-additivity . . . . . 10.3.3 Probability of the complement . . . . . 10.3.4 Probability of a union . . . . . . . . . . 10.3.5 Monotonicity of probability . . . . . . . 10.4 Interpretations of probability . . . . . . . . . . 10.4.1 Classical interpretation of probability . 10.4.2 Frequentist interpretation of probability 10.4.3 Subjectivist interpretation of probability 10.5 More rigorous de…nitions . . . . . . . . . . . . . 10.5.1 A more rigorous de…nition of event . . . 10.5.2 A more rigorous de…nition of probability 10.6 Solved exercises . . . . . . . . . . . . . . . . . . 11 Zero-probability events 11.1 De…nition and discussion . . . . 11.2 Almost sure and almost surely 11.3 Almost sure events . . . . . . . 11.4 Solved exercises . . . . . . . . .

58 59 59 59 61

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

69 69 70 71 71 72 72 73 73 74 74 74 74 75 75 76 76

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

79 79 80 81 82

12 Conditional probability 12.1 Introduction . . . . . . . . . . . . . . . . 12.2 The case of equally likely sample points 12.3 A more general approach . . . . . . . . 12.4 Tackling division by zero . . . . . . . . . 12.5 More details . . . . . . . . . . . . . . . . 12.5.1 The law of total probability . . . 12.6 Solved exercises . . . . . . . . . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

85 85 85 87 90 90 90 91

. . . .

. . . .

. . . .

. . . .

13 Bayes’rule 95 13.1 Statement of Bayes’rule . . . . . . . . . . . . . . . . . . . . . . . . . 95 13.2 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 13.3 Solved exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

vi 14 Independent events 14.1 De…nition of independent event . . . . . . 14.2 Mutually independent events . . . . . . . 14.3 Zero-probability events and independence 14.4 Solved exercises . . . . . . . . . . . . . . .

CONTENTS

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

99 99 100 101 101

15 Random variables 15.1 De…nition of random variable . . . . . . . . . . . . . . . 15.2 Discrete random variables . . . . . . . . . . . . . . . . . 15.3 Absolutely continuous random variables . . . . . . . . . 15.4 Random variables in general . . . . . . . . . . . . . . . . 15.5 More details . . . . . . . . . . . . . . . . . . . . . . . . . 15.5.1 Derivative of the distribution function . . . . . . 15.5.2 Continuous variables and zero-probability events 15.5.3 A more rigorous de…nition of random variable . . 15.6 Solved exercises . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

105 105 106 107 108 109 109 109 109 109

16 Random vectors 16.1 De…nition of random vector . . . . . . . . . . . . . . . . . . 16.2 Discrete random vectors . . . . . . . . . . . . . . . . . . . . 16.3 Absolutely continuous random vectors . . . . . . . . . . . . 16.4 Random vectors in general . . . . . . . . . . . . . . . . . . . 16.5 More details . . . . . . . . . . . . . . . . . . . . . . . . . . . 16.5.1 Random matrices . . . . . . . . . . . . . . . . . . . . 16.5.2 Marginal distribution of a random vector . . . . . . 16.5.3 Marginalization of a joint distribution . . . . . . . . 16.5.4 Marginal distribution of a discrete random vector . . 16.5.5 Marginalization of a discrete distribution . . . . . . 16.5.6 Marginal distribution of a continuous random vector 16.5.7 Marginalization of a continuous distribution . . . . . 16.5.8 Partial derivative of the distribution function . . . . 16.5.9 A more rigorous de…nition of random vector . . . . . 16.6 Solved exercises . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

115 115 116 117 118 119 119 119 120 120 120 120 121 121 121 121

17 Expected value 17.1 De…nition of expected value . . . . . . . . . . 17.2 Discrete random variables . . . . . . . . . . . 17.3 Continuous random variables . . . . . . . . . 17.4 The Riemann-Stieltjes integral . . . . . . . . 17.4.1 Intuition . . . . . . . . . . . . . . . . . 17.4.2 Some rules . . . . . . . . . . . . . . . 17.5 The Lebesgue integral . . . . . . . . . . . . . 17.6 More details . . . . . . . . . . . . . . . . . . . 17.6.1 The transformation theorem . . . . . . 17.6.2 Linearity of the expected value . . . . 17.6.3 Expected value of random vectors . . 17.6.4 Expected value of random matrices . . 17.6.5 Integrability . . . . . . . . . . . . . . . 17.6.6 Lp spaces . . . . . . . . . . . . . . . . 17.6.7 Other properties of the expected value

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

127 127 128 129 130 131 132 133 134 134 134 136 136 136 136 136

. . . .

. . . .

. . . .

. . . . . . . . . . . . . . .

. . . .

. . . . . . . . . . . . . . .

. . . .

. . . . . . . . . . . . . . .

. . . .

. . . . . . . . . . . . . . .

. . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

CONTENTS

vii

17.7 Solved exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 18 Expected value and the Lebesgue integral 141 18.1 Intuition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 18.2 Linearity of the Lebesgue integral . . . . . . . . . . . . . . . . . . . . 143 18.3 A more rigorous de…nition . . . . . . . . . . . . . . . . . . . . . . . . 144 19 Properties of the expected value 19.1 Linearity of the expected value . . . . . . . . . . . . . . . 19.1.1 Scalar multiplication of a random variable . . . . . 19.1.2 Sums of random variables . . . . . . . . . . . . . . 19.1.3 Linear combinations of random variables . . . . . . 19.1.4 Addition of a constant and a random matrix . . . 19.1.5 Multiplication of a constant and a random matrix 19.2 Other properties . . . . . . . . . . . . . . . . . . . . . . . 19.2.1 Expectation of a positive random variable . . . . . 19.2.2 Preservation of almost sure inequalities . . . . . . 19.3 Solved exercises . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

147 . 147 . 147 . 147 . 148 . 148 . 149 . 150 . 150 . 150 . 151

20 Variance 20.1 De…nition of variance . . . . . . . . . . . 20.2 Interpretation of variance . . . . . . . . 20.3 Computation of variance . . . . . . . . . 20.4 Variance formula . . . . . . . . . . . . . 20.5 Example . . . . . . . . . . . . . . . . . . 20.6 More details . . . . . . . . . . . . . . . . 20.6.1 Variance and standard deviation 20.6.2 Addition to a constant . . . . . . 20.6.3 Multiplication by a constant . . 20.6.4 Linear transformations . . . . . . 20.6.5 Square integrability . . . . . . . 20.7 Solved exercises . . . . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

155 155 155 155 156 156 157 157 157 158 158 159 159

21 Covariance 21.1 De…nition of covariance . . . . . . . . . . . . . . . . 21.2 Interpretation of covariance . . . . . . . . . . . . . . 21.3 Covariance formula . . . . . . . . . . . . . . . . . . . 21.4 Example . . . . . . . . . . . . . . . . . . . . . . . . . 21.5 More details . . . . . . . . . . . . . . . . . . . . . . . 21.5.1 Covariance of a random variable with itself . 21.5.2 Symmetry . . . . . . . . . . . . . . . . . . . . 21.5.3 Bilinearity . . . . . . . . . . . . . . . . . . . . 21.5.4 Variance of the sum of two random variables 21.5.5 Variance of the sum of n random variables . . 21.6 Solved exercises . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

163 163 163 164 164 166 166 166 166 167 168 169

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

viii

CONTENTS

22 Linear correlation 22.1 De…nition of linear correlation coe¢ cient . . . . . . 22.2 Interpretation . . . . . . . . . . . . . . . . . . . . . 22.3 Terminology . . . . . . . . . . . . . . . . . . . . . . 22.4 Example . . . . . . . . . . . . . . . . . . . . . . . . 22.5 More details . . . . . . . . . . . . . . . . . . . . . . 22.5.1 Correlation of a random variable with itself 22.5.2 Symmetry . . . . . . . . . . . . . . . . . . . 22.6 Solved exercises . . . . . . . . . . . . . . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

177 177 177 178 178 180 180 180 181

23 Covariance matrix 23.1 De…nition . . . . . . . . . . . . . . . . . . . . . . . 23.2 Structure of the covariance matrix . . . . . . . . . 23.3 Covariance matrix formula . . . . . . . . . . . . . . 23.4 More details . . . . . . . . . . . . . . . . . . . . . . 23.4.1 Addition to a constant vector . . . . . . . . 23.4.2 Multiplication by a constant matrix . . . . 23.4.3 Linear transformations . . . . . . . . . . . . 23.4.4 Symmetry . . . . . . . . . . . . . . . . . . . 23.4.5 Semi-positive de…niteness . . . . . . . . . . 23.4.6 Covariance between linear transformations . 23.4.7 Cross-covariance . . . . . . . . . . . . . . . 23.5 Solved exercises . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

189 189 189 190 190 191 191 191 192 192 192 193 193

24 Indicator function 24.1 De…nition . . . . . . . . . . . . . . . . . . . 24.2 Properties of the indicator function . . . . . 24.2.1 Powers . . . . . . . . . . . . . . . . . 24.2.2 Expected value . . . . . . . . . . . . 24.2.3 Variance . . . . . . . . . . . . . . . . 24.2.4 Intersections . . . . . . . . . . . . . 24.2.5 Indicators of zero-probability events 24.3 Solved exercises . . . . . . . . . . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

197 197 198 198 198 198 198 199 199

25 Conditional probability as a random variable 25.1 Partitions of events . . . . . . . . . . . . . . . . . . 25.2 Probabilities conditional on a partition . . . . . . . 25.3 The fundamental property . . . . . . . . . . . . . . 25.4 The fundamental property as a de…nition . . . . . 25.5 More details . . . . . . . . . . . . . . . . . . . . . . 25.5.1 Conditioning with respect to sigma-algebras 25.5.2 Regular conditional probabilities . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

201 202 203 204 205 206 206 207

26 Conditional probability distributions 26.1 Conditional probability mass function . . . . . . . 26.2 Conditional probability density function . . . . . . 26.3 Conditional distribution function . . . . . . . . . . 26.4 More details . . . . . . . . . . . . . . . . . . . . . . 26.4.1 Conditional distribution of a random vector 26.4.2 Joint equals marginal times conditional . . 26.5 Solved exercises . . . . . . . . . . . . . . . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

209 210 213 215 216 216 216 216

. . . . . . . .

. . . . . . . .

. . . . . . . .

CONTENTS

ix

27 Conditional expectation 27.1 De…nition . . . . . . . . . . . . . . . . . . . 27.2 Discrete random variables . . . . . . . . . . 27.3 Absolutely continuous random variables . . 27.4 Conditional expectation in general . . . . . 27.5 More details . . . . . . . . . . . . . . . . . . 27.5.1 Properties of conditional expectation 27.5.2 Law of iterated expectations . . . . 27.6 Solved exercises . . . . . . . . . . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

221 221 221 223 224 225 225 225 226

28 Independent random variables 28.1 De…nition . . . . . . . . . . . . . . . . . . . . . 28.2 Independence criterion . . . . . . . . . . . . . . 28.3 Independence between discrete variables . . . . 28.4 Independence between continuous variables . . 28.5 More details . . . . . . . . . . . . . . . . . . . . 28.5.1 Mutually independent random variables 28.5.2 Mutual independence via expectations . 28.5.3 Independence and zero covariance . . . 28.5.4 Independent random vectors . . . . . . 28.5.5 Mutually independent random vectors . 28.6 Solved exercises . . . . . . . . . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

229 229 229 231 232 233 233 234 234 235 235 236

III

. . . . . . . .

29 Probabilistic inequalities 29.1 Markov’s inequality . . 29.2 Chebyshev’s inequality 29.3 Jensens’s inequality . . 29.4 Solved exercises . . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

239 . . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

30 Legitimate probability mass functions 30.1 Properties of probability mass functions . . . . . . . . . . . . . . . 30.2 Identi…cation of a legitimate pmf . . . . . . . . . . . . . . . . . . . 30.3 Solved exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . .

241 241 242 243 244

247 . 247 . 248 . 249

31 Legitimate probability density functions 251 31.1 Properties of probability density functions . . . . . . . . . . . . . . . 251 31.2 Identi…cation of a legitimate pdf . . . . . . . . . . . . . . . . . . . . 252 31.3 Solved exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253 32 Factorization of joint probability mass functions 257 32.1 The factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257 32.2 A factorization method . . . . . . . . . . . . . . . . . . . . . . . . . . 257 33 Factorization of joint probability density functions 261 33.1 The factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261 33.2 A factorization method . . . . . . . . . . . . . . . . . . . . . . . . . . 261

x

CONTENTS

34 Functions of random variables and their distribution 34.1 Strictly increasing functions . . . . . . . . . . . . . . . . . . 34.1.1 Strictly increasing functions of a discrete variable . . 34.1.2 Strictly increasing functions of a continuous variable 34.2 Strictly decreasing functions . . . . . . . . . . . . . . . . . . 34.2.1 Strictly decreasing functions of a discrete variable . 34.2.2 Strictly decreasing functions of a continuous variable 34.3 Invertible functions . . . . . . . . . . . . . . . . . . . . . . . 34.3.1 One-to-one functions of a discrete variable . . . . . . 34.3.2 One-to-one functions of a continuous variable . . . . 34.4 Solved exercises . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

265 265 267 268 269 270 271 272 273 273 274

35 Functions of random vectors and their distribution 35.1 One-to-one functions . . . . . . . . . . . . . . . . . . 35.1.1 One-to-one function of a discrete vector . . . 35.1.2 One-to-one function of a continuous vector . 35.2 Independent sums . . . . . . . . . . . . . . . . . . . 35.3 Known moment generating function . . . . . . . . . 35.4 Known characteristic function . . . . . . . . . . . . . 35.5 Solved exercises . . . . . . . . . . . . . . . . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

277 . 277 . 277 . 278 . 280 . 281 . 281 . 281

36 Moments and cross-moments 36.1 Moments . . . . . . . . . . . . . . . . . . 36.1.1 De…nition of moment . . . . . . . . 36.1.2 De…nition of central moment . . . 36.2 Cross-moments . . . . . . . . . . . . . . . 36.2.1 De…nition of cross-moment . . . . 36.2.2 De…nition of central cross-moment

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

285 285 285 285 285 285 287

37 Moment generating function of a random 37.1 De…nition . . . . . . . . . . . . . . . . . . 37.2 Moments and mgfs . . . . . . . . . . . . . 37.3 Distributions and mgfs . . . . . . . . . . . 37.4 More details . . . . . . . . . . . . . . . . . 37.4.1 Mgf of a linear transformation . . 37.4.2 Mgf of a sum . . . . . . . . . . . . 37.5 Solved exercises . . . . . . . . . . . . . . .

variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

289 289 290 291 293 293 293 294

. . . . . . . .

297 . 297 . 299 . 300 . 301 . 301 . 302 . 302 . 303

. . . . . .

. . . . . .

. . . . . .

. . . . . .

38 Moment generating function of a random vector 38.1 De…nition . . . . . . . . . . . . . . . . . . . . . . . . . 38.2 Cross-moments and joint mgfs . . . . . . . . . . . . . . 38.3 Joint distributions and joint mgfs . . . . . . . . . . . . 38.4 More details . . . . . . . . . . . . . . . . . . . . . . . . 38.4.1 Joint mgf of a linear transformation . . . . . . 38.4.2 Joint mgf of a vector with independent entries 38.4.3 Joint mgf of a sum . . . . . . . . . . . . . . . . 38.5 Solved exercises . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

CONTENTS

xi

39 Characteristic function of a random variable 39.1 De…nition . . . . . . . . . . . . . . . . . . . . . . . 39.2 Moments and cfs . . . . . . . . . . . . . . . . . . . 39.3 Distributions and cfs . . . . . . . . . . . . . . . . . 39.4 More details . . . . . . . . . . . . . . . . . . . . . . 39.4.1 Cf of a linear transformation . . . . . . . . 39.4.2 Cf of a sum . . . . . . . . . . . . . . . . . . 39.4.3 Computation of the characteristic function 39.5 Solved exercises . . . . . . . . . . . . . . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

307 307 308 309 310 310 310 311 312

40 Characteristic function of a random vector 40.1 De…nition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40.2 Cross-moments and joint cfs . . . . . . . . . . . . . . . . . . 40.3 Joint distributions and joint cfs . . . . . . . . . . . . . . . . 40.4 More details . . . . . . . . . . . . . . . . . . . . . . . . . . . 40.4.1 Joint cf of a linear transformation . . . . . . . . . . 40.4.2 Joint cf of a random vector with independent entries 40.4.3 Joint cf of a sum . . . . . . . . . . . . . . . . . . . . 40.5 Solved exercises . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

315 315 315 317 317 317 318 318 319

41 Sums of independent random variables 41.1 Distribution function of a sum . . . . . . . . . 41.2 Probability mass function of a sum . . . . . . . 41.3 Probability density function of a sum . . . . . . 41.4 More details . . . . . . . . . . . . . . . . . . . . 41.4.1 Sum of n independent random variables 41.5 Solved exercises . . . . . . . . . . . . . . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

323 323 325 327 329 329 329

IV

. . . . . .

. . . . . .

. . . . . . . .

. . . . . .

. . . . . . . .

. . . . . .

. . . . . . . .

. . . . . .

. . . . . . . .

. . . . . .

. . . . . .

Probability distributions

42 Bernoulli distribution 42.1 De…nition . . . . . . . . . . . . 42.2 Expected value . . . . . . . . . 42.3 Variance . . . . . . . . . . . . . 42.4 Moment generating function . . 42.5 Characteristic function . . . . . 42.6 Distribution function . . . . . . 42.7 More details . . . . . . . . . . . 42.7.1 Relation to the binomial 42.8 Solved exercises . . . . . . . . .

333 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . distribution . . . . . . .

43 Binomial distribution 43.1 De…nition . . . . . . . . . . . . . . . . 43.2 Relation to the Bernoulli distribution . 43.3 Expected value . . . . . . . . . . . . . 43.4 Variance . . . . . . . . . . . . . . . . . 43.5 Moment generating function . . . . . . 43.6 Characteristic function . . . . . . . . . 43.7 Distribution function . . . . . . . . . . 43.8 Solved exercises . . . . . . . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

335 335 336 336 336 337 337 337 337 337

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

341 341 342 344 344 345 346 346 347

xii

CONTENTS

44 Poisson distribution 44.1 De…nition . . . . . . . . . . . . . . . . . 44.2 Relation to the exponential distribution 44.3 Expected value . . . . . . . . . . . . . . 44.4 Variance . . . . . . . . . . . . . . . . . . 44.5 Moment generating function . . . . . . . 44.6 Characteristic function . . . . . . . . . . 44.7 Distribution function . . . . . . . . . . . 44.8 Solved exercises . . . . . . . . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

349 350 350 352 353 354 355 355 356

45 Uniform distribution 45.1 De…nition . . . . . . . . . . 45.2 Expected value . . . . . . . 45.3 Variance . . . . . . . . . . . 45.4 Moment generating function 45.5 Characteristic function . . . 45.6 Distribution function . . . . 45.7 Solved exercises . . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

359 359 359 360 361 361 362 362

46 Exponential distribution 46.1 De…nition . . . . . . . . . . . . . . . . . . . . 46.2 The rate parameter and its interpretation . . 46.3 Expected value . . . . . . . . . . . . . . . . . 46.4 Variance . . . . . . . . . . . . . . . . . . . . . 46.5 Moment generating function . . . . . . . . . . 46.6 Characteristic function . . . . . . . . . . . . . 46.7 Distribution function . . . . . . . . . . . . . . 46.8 More details . . . . . . . . . . . . . . . . . . . 46.8.1 Memoryless property . . . . . . . . . . 46.8.2 Sums of exponential random variables 46.9 Solved exercises . . . . . . . . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

365 365 366 368 368 369 369 371 371 371 372 372

47 Normal distribution 47.1 The standard normal distribution . . . . . . . . . . . 47.1.1 De…nition . . . . . . . . . . . . . . . . . . . . 47.1.2 Expected value . . . . . . . . . . . . . . . . . 47.1.3 Variance . . . . . . . . . . . . . . . . . . . . . 47.1.4 Moment generating function . . . . . . . . . . 47.1.5 Characteristic function . . . . . . . . . . . . . 47.1.6 Distribution function . . . . . . . . . . . . . . 47.2 The normal distribution in general . . . . . . . . . . 47.2.1 De…nition . . . . . . . . . . . . . . . . . . . . 47.2.2 Relation to the standard normal distribution 47.2.3 Expected value . . . . . . . . . . . . . . . . . 47.2.4 Variance . . . . . . . . . . . . . . . . . . . . . 47.2.5 Moment generating function . . . . . . . . . . 47.2.6 Characteristic function . . . . . . . . . . . . . 47.2.7 Distribution function . . . . . . . . . . . . . . 47.3 More details . . . . . . . . . . . . . . . . . . . . . . . 47.3.1 Multivariate normal distribution . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

375 376 376 377 377 378 379 380 381 381 382 382 382 383 383 384 384 384

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

CONTENTS

xiii

47.3.2 Linear combinations of normal random variables . . . . . . . 384 47.3.3 Quadratic forms involving normal random variables . . . . . 385 47.4 Solved exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 385 48 Chi-square distribution 48.1 De…nition . . . . . . . . . . . . . . . . . . . . . . . . . . . 48.2 Expected value . . . . . . . . . . . . . . . . . . . . . . . . 48.3 Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48.4 Moment generating function . . . . . . . . . . . . . . . . . 48.5 Characteristic function . . . . . . . . . . . . . . . . . . . . 48.6 Distribution function . . . . . . . . . . . . . . . . . . . . . 48.7 More details . . . . . . . . . . . . . . . . . . . . . . . . . . 48.7.1 Sums of independent Chi-square random variables 48.7.2 Relation to the standard normal distribution . . . 48.7.3 Relation to the standard normal distribution (2) . 48.8 Solved exercises . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

387 387 388 388 389 390 391 392 392 393 395 395

49 Gamma distribution 49.1 De…nition . . . . . . . . . . . . . . . . . . . . . 49.2 Expected value . . . . . . . . . . . . . . . . . . 49.3 Variance . . . . . . . . . . . . . . . . . . . . . . 49.4 Moment generating function . . . . . . . . . . . 49.5 Characteristic function . . . . . . . . . . . . . . 49.6 Distribution function . . . . . . . . . . . . . . . 49.7 More details . . . . . . . . . . . . . . . . . . . . 49.7.1 Relation to the Chi-square distribution . 49.7.2 Multiplication by a constant . . . . . . 49.7.3 Relation to the normal distribution . . . 49.8 Solved exercises . . . . . . . . . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

397 397 398 398 399 400 402 402 402 403 404 404

50 Student’s t distribution 50.1 The standard Student’s t distribution . . . . . . . . . . 50.1.1 De…nition . . . . . . . . . . . . . . . . . . . . . . 50.1.2 Relation to the normal and Gamma distributions 50.1.3 Expected value . . . . . . . . . . . . . . . . . . . 50.1.4 Variance . . . . . . . . . . . . . . . . . . . . . . . 50.1.5 Higher moments . . . . . . . . . . . . . . . . . . 50.1.6 Moment generating function . . . . . . . . . . . . 50.1.7 Characteristic function . . . . . . . . . . . . . . . 50.1.8 Distribution function . . . . . . . . . . . . . . . . 50.2 The Student’s t distribution in general . . . . . . . . . . 50.2.1 De…nition . . . . . . . . . . . . . . . . . . . . . . 50.2.2 Relation to the standard Student’s t distribution 50.2.3 Expected value . . . . . . . . . . . . . . . . . . . 50.2.4 Variance . . . . . . . . . . . . . . . . . . . . . . . 50.2.5 Moment generating function . . . . . . . . . . . . 50.2.6 Characteristic function . . . . . . . . . . . . . . . 50.2.7 Distribution function . . . . . . . . . . . . . . . . 50.3 More details . . . . . . . . . . . . . . . . . . . . . . . . . 50.3.1 Convergence to the normal distribution . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

407 407 408 408 410 411 413 414 414 414 415 415 415 416 416 416 417 417 417 417

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

xiv

CONTENTS 50.3.2 Non-central t distribution . . . . . . . . . . . . . . . . . . . . 418 50.4 Solved exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 418

51 F distribution 51.1 De…nition . . . . . . . . . . . . . . . . 51.2 Relation to the Gamma distribution . 51.3 Relation to the Chi-square distribution 51.4 Expected value . . . . . . . . . . . . . 51.5 Variance . . . . . . . . . . . . . . . . . 51.6 Higher moments . . . . . . . . . . . . 51.7 Moment generating function . . . . . . 51.8 Characteristic function . . . . . . . . . 51.9 Distribution function . . . . . . . . . . 51.10Solved exercises . . . . . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

52 Multinomial distribution 52.1 The special case of one experiment . . . . 52.1.1 De…nition . . . . . . . . . . . . . . 52.1.2 Expected value . . . . . . . . . . . 52.1.3 Covariance matrix . . . . . . . . . 52.1.4 Joint moment generating function 52.1.5 Joint characteristic function . . . . 52.2 Multinomial distribution in general . . . . 52.2.1 De…nition . . . . . . . . . . . . . . 52.2.2 Representation as a sum of simpler 52.2.3 Expected value . . . . . . . . . . . 52.2.4 Covariance matrix . . . . . . . . . 52.2.5 Joint moment generating function 52.2.6 Joint characteristic function . . . . 52.3 Solved exercises . . . . . . . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

421 421 422 424 425 426 427 428 428 428 429

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . multinomials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

431 431 431 432 432 433 433 434 434 434 435 435 436 436 437

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

439 439 439 440 441 441 442 442 443 443 444 445 445 445 446 446 446 446 447

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

53 Multivariate normal distribution 53.1 The standard MV-N distribution . . . . . . . . . . . 53.1.1 De…nition . . . . . . . . . . . . . . . . . . . . 53.1.2 Relation to the univariate normal distribution 53.1.3 Expected value . . . . . . . . . . . . . . . . . 53.1.4 Covariance matrix . . . . . . . . . . . . . . . 53.1.5 Joint moment generating function . . . . . . 53.1.6 Joint characteristic function . . . . . . . . . . 53.2 The MV-N distribution in general . . . . . . . . . . 53.2.1 De…nition . . . . . . . . . . . . . . . . . . . . 53.2.2 Relation to the standard MV-N distribution . 53.2.3 Expected value . . . . . . . . . . . . . . . . . 53.2.4 Covariance matrix . . . . . . . . . . . . . . . 53.2.5 Joint moment generating function . . . . . . 53.2.6 Joint characteristic function . . . . . . . . . . 53.3 More details . . . . . . . . . . . . . . . . . . . . . . . 53.3.1 The univariate normal as a special case . . . 53.3.2 Mutual independence and joint normality . . 53.3.3 Linear combinations and transformations . .

. . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

CONTENTS

xv

53.3.4 Quadratic forms . . . . . . . . . . . . . . . . . . . . . . . . . 447 53.3.5 Partitioned vectors . . . . . . . . . . . . . . . . . . . . . . . . 447 53.4 Solved exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 447 54 Multivariate Student’s t distribution 54.1 The standard MV Student’s t distribution . . . . . . . . . . . 54.1.1 De…nition . . . . . . . . . . . . . . . . . . . . . . . . . 54.1.2 Relation to the univariate Student’s t distribution . . 54.1.3 Relation to the Gamma and MV normal distributions 54.1.4 Marginals . . . . . . . . . . . . . . . . . . . . . . . . . 54.1.5 Expected value . . . . . . . . . . . . . . . . . . . . . . 54.1.6 Covariance matrix . . . . . . . . . . . . . . . . . . . . 54.2 The MV Student’s t distribution in general . . . . . . . . . . 54.2.1 De…nition . . . . . . . . . . . . . . . . . . . . . . . . . 54.2.2 Relation to the standard MV Student’s t distribution 54.2.3 Expected value . . . . . . . . . . . . . . . . . . . . . . 54.2.4 Covariance matrix . . . . . . . . . . . . . . . . . . . . 54.3 Solved exercises . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

451 451 451 452 452 454 455 455 457 457 457 458 459 459

55 Wishart distribution 55.1 De…nition . . . . . . . . . . . . . . . . . 55.2 Relation to the MV normal distribution 55.3 Expected value . . . . . . . . . . . . . . 55.4 Covariance matrix . . . . . . . . . . . . 55.5 Review of some facts in matrix algebra . 55.5.1 Outer products . . . . . . . . . . 55.5.2 Symmetric matrices . . . . . . . 55.5.3 Positive de…nite matrices . . . . 55.5.4 Trace of a matrix . . . . . . . . . 55.5.5 Vectorization of a matrix . . . . 55.5.6 Kronecker product . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

461 461 462 462 463 465 465 465 465 466 466 466

V

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

56 Linear combinations of normals 56.1 Linear transformation of a MV-N vector . . . . . . . 56.1.1 Sum of two independent variables . . . . . . 56.1.2 Sum of more than two independent variables 56.1.3 Linear combinations of independent variables 56.1.4 Linear transformation of a variable . . . . . . 56.1.5 Linear combinations of independent vectors . 56.2 Solved exercises . . . . . . . . . . . . . . . . . . . . .

467 . . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

57 Partitioned multivariate normal vectors 57.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57.2 Normality of the sub-vectors . . . . . . . . . . . . . . . . . . . . . . 57.3 Independence of the sub-vectors . . . . . . . . . . . . . . . . . . . .

. . . . . . .

469 469 470 471 471 472 473 473

477 . 477 . 478 . 478

xvi

CONTENTS

58 Quadratic forms in normal vectors 58.1 Review of relevant facts in matrix algebra 58.1.1 Orthogonal matrices . . . . . . . . 58.1.2 Symmetric matrices . . . . . . . . 58.1.3 Idempotent matrices . . . . . . . . 58.1.4 Symmetric idempotent matrices . . 58.1.5 Trace of a matrix . . . . . . . . . . 58.2 Quadratic forms in normal vectors . . . . 58.3 Independence of quadratic forms . . . . . 58.4 Examples . . . . . . . . . . . . . . . . . .

VI

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

Asymptotic theory

. . . . . . . . .

481 481 481 482 482 482 483 483 484 485

489

59 Sequences of random variables 59.1 Terminology . . . . . . . . . . . . . . . . 59.1.1 Realization of a sequence . . . . 59.1.2 Sequences on a sample space . . 59.1.3 Independent sequences . . . . . . 59.1.4 Identically distributed sequences 59.1.5 IID sequences . . . . . . . . . . . 59.1.6 Stationary sequences . . . . . . . 59.1.7 Weakly stationary sequences . . 59.1.8 Mixing sequences . . . . . . . . . 59.1.9 Ergodic sequences . . . . . . . . 59.2 Limit of a sequence of random variables

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

491 491 492 492 492 492 492 492 493 494 494 495

60 Sequences of random vectors 60.1 Terminology . . . . . . . . . . . . . . . . 60.1.1 Realization of a sequence . . . . 60.1.2 Sequences on a sample space . . 60.1.3 Independent sequences . . . . . . 60.1.4 Identically distributed sequences 60.1.5 IID sequences . . . . . . . . . . . 60.1.6 Stationary sequences . . . . . . . 60.1.7 Weakly stationary sequences . . 60.1.8 Mixing sequences . . . . . . . . . 60.1.9 Ergodic sequences . . . . . . . . 60.2 Limit of a sequence of random vectors .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

497 497 497 497 497 498 498 498 499 499 499 500

61 Pointwise convergence 501 61.1 Sequences of random variables . . . . . . . . . . . . . . . . . . . . . 501 61.2 Sequences of random vectors . . . . . . . . . . . . . . . . . . . . . . 502 61.3 Solved exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 503 62 Almost sure convergence 505 62.1 Sequences of random variables . . . . . . . . . . . . . . . . . . . . . 505 62.2 Sequences of random vectors . . . . . . . . . . . . . . . . . . . . . . 507 62.3 Solved exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 507

CONTENTS

xvii

63 Convergence in probability 511 63.1 Sequences of random variables . . . . . . . . . . . . . . . . . . . . . 511 63.2 Sequences of random vectors . . . . . . . . . . . . . . . . . . . . . . 513 63.3 Solved exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 514 64 Mean-square convergence 519 64.1 Sequences of random variables . . . . . . . . . . . . . . . . . . . . . 519 64.2 Sequences of random vectors . . . . . . . . . . . . . . . . . . . . . . 521 64.3 Solved exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 522 65 Convergence in distribution 65.1 Sequences of random variables . . . 65.2 Sequences of random vectors . . . . 65.3 More details . . . . . . . . . . . . . . 65.3.1 Proper distribution functions 65.4 Solved exercises . . . . . . . . . . . .

. . . . .

. . . . .

. . . . .

66 Relations between modes of convergence 66.1 Almost sure ) Probability . . . . . . . . 66.2 Probability ) Distribution . . . . . . . . 66.3 Almost sure ) Distribution . . . . . . . . 66.4 Mean square ) Probability . . . . . . . . 66.5 Mean square ) Distribution . . . . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

527 527 529 529 529 530

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

533 . 533 . 533 . 534 . 534 . 534

67 Laws of Large Numbers 67.1 Weak Laws of Large Numbers . . . . . . . . . . . . . 67.1.1 Chebyshev’s WLLN . . . . . . . . . . . . . . 67.1.2 Chebyshev’s WLLN for correlated sequences 67.2 Strong Laws of Large numbers . . . . . . . . . . . . 67.2.1 Kolmogorov’s SLLN . . . . . . . . . . . . . . 67.2.2 Ergodic theorem . . . . . . . . . . . . . . . . 67.3 Laws of Large numbers for random vectors . . . . . 67.4 Solved exercises . . . . . . . . . . . . . . . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

535 535 535 537 540 540 541 541 542

68 Central Limit Theorems 68.1 Examples of Central Limit Theorems . 68.1.1 Lindeberg-Lévy CLT . . . . . . 68.1.2 A CLT for correlated sequences 68.2 Multivariate generalizations . . . . . . 68.3 Solved exercises . . . . . . . . . . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

545 546 546 548 549 550

69 Convergence of transformations 69.1 Continuous mapping theorem . . . . . . . . . . . . . . . . 69.1.1 Convergence in probability of sums and products . 69.1.2 Almost sure convergence of sums and products . . 69.1.3 Convergence in distribution of sums and products 69.2 Slutski’s Theorem . . . . . . . . . . . . . . . . . . . . . . 69.3 More details . . . . . . . . . . . . . . . . . . . . . . . . . . 69.3.1 Convergence of ratios . . . . . . . . . . . . . . . . 69.3.2 Random matrices . . . . . . . . . . . . . . . . . . . 69.4 Solved exercises . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

555 555 555 556 556 557 557 557 558 558

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

xviii

VII

CONTENTS

Fundamentals of statistics

70 Statistical inference 70.1 Samples . . . . . . . . . . 70.2 Statistical models . . . . . 70.2.1 Parametric models 70.3 Statistical inferences . . . 70.4 Decision theory . . . . . .

561 . . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

563 563 565 565 566 567

71 Point estimation 71.1 Estimate and estimator . . . . . . . 71.2 Estimation error, loss and risk . . . . 71.3 Other criteria to evaluate estimators 71.3.1 Unbiasedness . . . . . . . . . 71.3.2 Consistency . . . . . . . . . . 71.4 Examples . . . . . . . . . . . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

569 569 569 571 571 571 572

72 Point estimation of the mean 72.1 Normal IID samples . . . . . . . . . . . 72.1.1 The sample . . . . . . . . . . . . 72.1.2 The estimator . . . . . . . . . . . 72.1.3 Expected value of the estimator . 72.1.4 Variance of the estimator . . . . 72.1.5 Distribution of the estimator . . 72.1.6 Risk of the estimator . . . . . . . 72.1.7 Consistency of the estimator . . 72.2 IID samples . . . . . . . . . . . . . . . . 72.2.1 The sample . . . . . . . . . . . . 72.2.2 The estimator . . . . . . . . . . . 72.2.3 Expected value of the estimator . 72.2.4 Variance of the estimator . . . . 72.2.5 Distribution of the estimator . . 72.2.6 Risk of the estimator . . . . . . . 72.2.7 Consistency of the estimator . . 72.2.8 Asymptotic normality . . . . . . 72.3 Solved exercises . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

573 573 573 573 573 574 574 575 575 575 575 576 576 576 576 576 577 577 577

73 Point estimation of the variance 73.1 Normal IID samples - Known mean . . . 73.1.1 The sample . . . . . . . . . . . . 73.1.2 The estimator . . . . . . . . . . . 73.1.3 Expected value of the estimator . 73.1.4 Variance of the estimator . . . . 73.1.5 Distribution of the estimator . . 73.1.6 Risk of the estimator . . . . . . . 73.1.7 Consistency of the estimator . . 73.2 Normal IID samples - Unknown mean . 73.2.1 The sample . . . . . . . . . . . . 73.2.2 The estimator . . . . . . . . . . . 73.2.3 Expected value of the estimator . 73.2.4 Variance of the estimator . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

579 579 579 579 580 580 581 581 582 582 582 582 583 585

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

CONTENTS 73.2.5 73.2.6 73.2.7 73.3 Solved

xix Distribution of the estimator Risk of the estimator . . . . . Consistency of the estimator exercises . . . . . . . . . . . .

. . . .

. . . .

. . . .

. . . .

74 Set estimation 74.1 Con…dence set . . . . . . . . . . . . . . . . 74.2 Coverage probability - con…dence coe¢ cient 74.3 Size of a con…dence set . . . . . . . . . . . . 74.4 Other criteria to evaluate set estimators . . 74.5 Examples . . . . . . . . . . . . . . . . . . . 75 Set estimation of the mean 75.1 Normal IID samples - Known variance . 75.1.1 The sample . . . . . . . . . . . . 75.1.2 The interval estimator . . . . . . 75.1.3 Coverage probability . . . . . . . 75.1.4 Con…dence coe¢ cient . . . . . . 75.1.5 Size . . . . . . . . . . . . . . . . 75.1.6 Expected size . . . . . . . . . . . 75.2 Normal IID samples - Unknown variance 75.2.1 The sample . . . . . . . . . . . . 75.2.2 The interval estimator . . . . . . 75.2.3 Coverage probability . . . . . . . 75.2.4 Con…dence coe¢ cient . . . . . . 75.2.5 Size . . . . . . . . . . . . . . . . 75.2.6 Expected size . . . . . . . . . . . 75.3 Solved exercises . . . . . . . . . . . . . . 76 Set estimation of the variance 76.1 Normal IID samples - Known mean . . 76.1.1 The sample . . . . . . . . . . . 76.1.2 The interval estimator . . . . . 76.1.3 Coverage probability . . . . . . 76.1.4 Con…dence coe¢ cient . . . . . 76.1.5 Size . . . . . . . . . . . . . . . 76.1.6 Expected size . . . . . . . . . . 76.2 Normal IID samples - Unknown mean 76.2.1 The sample . . . . . . . . . . . 76.2.2 The interval estimator . . . . . 76.2.3 Coverage probability . . . . . . 76.2.4 Con…dence coe¢ cient . . . . . 76.2.5 Size . . . . . . . . . . . . . . . 76.2.6 Expected size . . . . . . . . . . 76.3 Solved exercises . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . .

. . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . .

. . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . .

. . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . .

. . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . .

. . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . .

. . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . .

. . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . .

. . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . .

. . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . .

. . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . .

. . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . .

. . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . .

. . . .

585 587 588 589

. . . . .

591 . 591 . 592 . 592 . 593 . 593

. . . . . . . . . . . . . . .

595 . 595 . 595 . 595 . 596 . 596 . 597 . 597 . 597 . 597 . 597 . 598 . 601 . 601 . 602 . 603

. . . . . . . . . . . . . . .

607 . 607 . 607 . 607 . 608 . 608 . 609 . 609 . 609 . 609 . 610 . 610 . 611 . 611 . 611 . 611

xx 77 Hypothesis testing 77.1 Null hypothesis . . . . . 77.2 Alternative hypothesis . 77.3 Types of errors . . . . . 77.4 Critical region . . . . . . 77.5 Test statistic . . . . . . 77.6 Power function . . . . . 77.7 Size of a test . . . . . . 77.8 Criteria to evaluate tests 77.9 Examples . . . . . . . .

CONTENTS

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

615 616 616 616 616 617 617 617 618 618

78 Hypothesis tests about the mean 78.1 Normal IID samples - Known variance . 78.1.1 The sample . . . . . . . . . . . . 78.1.2 The null hypothesis . . . . . . . 78.1.3 The alternative hypothesis . . . . 78.1.4 The test statistic . . . . . . . . . 78.1.5 The critical region . . . . . . . . 78.1.6 The power function . . . . . . . 78.1.7 The size of the test . . . . . . . . 78.2 Normal IID samples - Unknown variance 78.2.1 The sample . . . . . . . . . . . . 78.2.2 The null hypothesis . . . . . . . 78.2.3 The alternative hypothesis . . . . 78.2.4 The test statistic . . . . . . . . . 78.2.5 The critical region . . . . . . . . 78.2.6 The power function . . . . . . . 78.2.7 The size of the test . . . . . . . . 78.3 Solved exercises . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

619 619 619 619 620 620 620 620 621 621 621 622 622 622 622 623 625 626

79 Hypothesis tests about the variance 79.1 Normal IID samples - Known mean . . 79.1.1 The sample . . . . . . . . . . . 79.1.2 The null hypothesis . . . . . . 79.1.3 The alternative hypothesis . . . 79.1.4 The test statistic . . . . . . . . 79.1.5 The critical region . . . . . . . 79.1.6 The power function . . . . . . 79.1.7 The size of the test . . . . . . . 79.2 Normal IID samples - Unknown mean 79.2.1 The sample . . . . . . . . . . . 79.2.2 The null hypothesis . . . . . . 79.2.3 The alternative hypothesis . . . 79.2.4 The test statistic . . . . . . . . 79.2.5 The critical region . . . . . . . 79.2.6 The power function . . . . . . 79.2.7 The size of the test . . . . . . . 79.3 Solved exercises . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

629 629 629 629 630 630 630 630 631 631 631 631 632 632 632 632 633 633

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . . . . . . . . . .

Preface This book is a collection of lectures on probability theory and mathematical statistics that I have been publishing on the website StatLect.com since 2010. Visitors to the website have been constantly increasing and, while I have received positive feedback from many of them, some have suggested that the lectures would be easier to study if they were collected and printed as a traditional paper textbook. I followed their suggestion and started to work on this book. The painful editing work needed to convert the webpages to book chapters took a signi…cant portion of my spare time for almost one year. I hope the result is vaguely satisfactory, but I am sure that not all typos and mistakes have been eliminated. For this, I humbly ask the forgiveness of my readers. There were two main reasons why I started writing these lectures. First of all, I thought it was di¢ cult to …nd a thorough yet accessible treatment of the basics of probability theory and mathematical statistics. While there are many excellent textbooks on these subjects, the easier ones often do not touch on many important topics, while the more complete ones frequently require a level of mathematical sophistication not possessed by most people. In these lectures I tried to give an accessible introduction to topics that are not usually found in elementary books. Secondly, I tried to collect in these lectures results and proofs (especially on probability distributions) that are hard to …nd in standard references and are scattered here and there in more specialistic books. I hope this will help my readers save some precious time in their study of probability and statistics. The plan of the book is as follows: Part 1 is a review of some elementary mathematical tools that are needed to understand the lectures; Part 2 introduces the fundamentals of probability theory; Part 3 presents additional topics in probability theory; Part 4 deals with special probability distributions; Part 5 contains more details about the normal distribution; Part 6 discusses the basics of asymptotic theory (sequences of random variables and their convergence); Part 7 is an introduction to mathematical statistics.

Preface to second edition Besides some minor editing, the main changes I introduced in the second edition are as follows: I have expanded the lectures on the Gamma and Beta functions, and on the multinomial distribution; I have entirely rewritten the lecture on the binomial distribution; I have added solved exercises to several chapters that did not have any. xxi

xxii

Dedication This book is dedicated to Emanuela and Anna.

PREFACE

Part I

Mathematical tools

1

Chapter 1

Set theory This lecture introduces the basics of set theory.

1.1

Sets

A set is a collection of objects. Sets are usually denoted by a letter and the objects (or elements) belonging to a set are usually listed within curly brackets. Example 1 Denote by the letter S the set of the natural numbers less than or equal to 5. Then, we can write S = f1; 2; 3; 4; 5g Example 2 Denote by the letter A the set of the …rst …ve letters of the alphabet. Then, we can write A = fa; b; c; d; eg Note that a set is an unordered collection of objects, i.e., the order in which the elements of a set are listed does not matter. Example 3 The two sets fa; b; c; d; eg and fb; d; a; c; eg are considered identical. Sometimes a set is de…ned in terms of one or more properties satis…ed by its elements. For example, the set S = f1; 2; 3; 4; 5g could be equivalently de…ned as S = fn 2 N : n

5g

which reads as follows: "S is the set of all natural numbers n such that n is less than or equal to 5", where the colon symbol (:) means "such that" and precedes a list of conditions that the elements of the set need to satisfy. 3

4

CHAPTER 1. SET THEORY

Example 4 The set

n o n S= n2N: 2N 4 is the set of all natural numbers n such that n divided by 4 is also a natural number, i.e., S = f4; 8; 12; : : :g

1.2

Set membership

When an element a belongs to a set A, we write a2A which reads "a belongs to A" or "a is a member of A". On the contrary, when an element a does not belong to a set A, we write a2 =A which reads "a does not belong to A" or "a is not a member of A". Example 5 Let the set S be de…ned as follows: A = f2; 4; 6; 8; 10g Then, for example, 42A and 72 =A

1.3

Set inclusion

If A and B are two sets, and if every element of A also belongs to B, then we write A

B

B

A

which reads "A is included in B", or

and we read "B includes A". We also say that A is a subset of B. Example 6 The set A = f2; 3g is included in the set B = f1; 2; 3; 4g because all the elements of A also belong to B. Thus, we can write A

B

1.4. UNION

5

When A B but A is not the same as B, i.e., there are elements of B that do not belong to A, then we write A B which reads "A is strictly included in B", or B

A

We also say that A is a proper subset of B. Example 7 Given the sets A = f2; 3g B = f1; 2; 3; 4g C = f2; 3g we have that A A

B C

but we cannot write A

1.4

C

Union

The union of two sets A and B is the set of all elements that belong to at least one of them, and it is denoted by A[B Example 8 De…ne two sets A and B as follows: A = fa; b; c; dg B = fc; d; e; f g Their union is A [ B = fa; b; c; d; e; f g If A1 , A2 , . . . , An are n sets, their union is the set of all elements that belong to at least one of them, and it is denoted by n [

i=1

Ai = A1 [ A2 [ : : : [ An

Example 9 De…ne three sets A1 , A2 and A3 as follows: A1 A2 A3 Their union is

3 [

i=1

= fa; b; c; dg = fc; d; e; f g = fc; f; gg

Ai = A1 [ A2 [ A3 = fa; b; c; d; e; f; gg

6

1.5

CHAPTER 1. SET THEORY

Intersection

The intersection of two sets A and B is the set of all elements that belong to both of them, and it is denoted by A\B Example 10 De…ne two sets A and B as follows: A = fa; b; c; dg B = fc; d; e; f g Their intersection is A \ B = fc; dg If A1 , A2 , . . . , An are n sets, their intersection is the set of all elements that belong to all of them, and it is denoted by n \

i=1

Ai = A1 \ A2 \ : : : \ An

Example 11 De…ne three sets A1 , A2 and A3 as follows: A1 A2 A3 Their intersection is

3 \

i=1

1.6

= fa; b; c; dg = fc; d; e; f g = fc; f; gg

Ai = A1 \ A2 \ A3 = fcg

Complement

Suppose that our attention is con…ned to sets that are all included in a larger set , called universal set. Let A be one of these sets. The complement of A is the set of all elements of that do not belong to A and it is indicated by Ac Example 12 De…ne the universal set

as follows:

= fa; b; c; d; e; f; g; hg and the two sets A = fb; c; dg B = fc; d; eg The complements of A and B are Ac Bc

= fa; e; f; g; hg = fa; b; f; g; hg

1.7. DE MORGAN’S LAWS

1.7

7

De Morgan’s Laws

De Morgan’s Laws are c

= Ac \ B c = Ac [ B c

(A [ B) c (A \ B)

and can be extended to collections of more than two sets as follows: !c n n [ \ Ai = Aci i=1 n \

Ai

i=1

1.8

i=1

!c

=

n [

Aci

i=1

Solved exercises

Below you can …nd some exercises with explained solutions.

Exercise 1 De…ne the following sets A1 A2 A3 A4

= = = =

fa; b; cg fb; c; d; e; f g fb; f g fa; b; dg

List all the elements belonging to the set A=

4 [

Ai

i=2

Solution The union can be written as A = A2 [ A3 [ A4 The union of the three sets A2 , A3 and A4 is the set of all elements that belong to at least one of them: A = A2 [ A3 [ A4 = fa; b; c; d; e; f g

Exercise 2 Given the sets de…ned in the previous exercise, list all the elements belonging to the set 4 \ A= Ai i=1

8

CHAPTER 1. SET THEORY

Solution The intersection can be written as A = A1 \ A2 \ A3 \ A4 The intersection of the four sets A1 , A2 , A3 and A4 is the set of elements that are members of all the four sets: A = A1 \ A2 \ A3 \ A4 = fbg

Exercise 3 Suppose that A and B are two subsets of a universal set Ac Bc

= fa; b; cg = fb; c; dg

List all the elements belonging to the set c

(A [ B) Solution Using De Morgan’s laws, we obtain c

(A [ B)

= Ac \ B c = fa; b; cg \ fb; c; dg = fb; cg

, and that

Chapter 2

Permutations This lecture introduces permutations, one of the most important concepts in combinatorial analysis. We …rst deal with permutations without repetition, also called simple permutations, and then with permutations with repetition.

2.1

Permutations without repetition

A permutation without repetition of n objects is one of the possible ways of ordering the n objects. A permutation without repetition is also simply called a permutation. The following subsections give a slightly more formal de…nition of permutation and deal with the problem of counting the number of possible permutations of n objects.

2.1.1

De…nition of permutation without repetition

Let a1 , a2 ,. . . , an be n objects. Let s1 , s2 , . . . , sn be n slots to which the n objects can be assigned. A permutation (or permutation without repetition, or simple permutation) of a1 , a2 ,. . . , an is one of the possible ways to …ll each of the n slots with one and only one of the n objects, with the proviso that each object can be assigned to only one slot. Example 13 Consider three objects a1 , a2 and a3 . There are three slots, s1 , s2 and s3 , to which we can assign the three objects. There are six possible permutations of the three objects, that is, six possible ways to …ll the three slots with the three objects: Slots s1 s2 s3 Permutation 1 a1 a2 a3 Permutation 2 a1 a3 a2 Permutation 3 a2 a1 a3 Permutation 4 a2 a3 a1 Permutation 5 a3 a2 a1 Permutation 6 a3 a1 a2 9

10

CHAPTER 2. PERMUTATIONS

2.1.2

Number of permutations without repetition

Denote by Pn the number of possible permutations of n objects. How much is Pn in general? In other words, how do we count the number of possible permutations of n objects? We can derive a general formula for Pn by using a sequential argument: 1. First, we assign an object to the …rst slot. There are n objects that can be assigned to the …rst slot, so there are n possible ways to …ll the …rst slot 2. Then, we assign an object to the second slot. There were n objects, but one has already been assigned to a slot. So, we are left with n 1 objects that can be assigned to the second slot. Thus, there are n

1 possible ways to …ll the second slot

and n (n

1) possible ways to …ll the …rst two slots

3. Then, we assign an object to the third slot. There were n objects, but two have already been assigned to a slot. So, we are left with n 2 objects that can be assigned to the third slot. Thus, there are n

2 possible ways to …ll the third slot

and n (n

1) (n

2) possible ways to …ll the …rst three slots

4. An so on, until only one object and one free slot remain. 5. Finally, when only one free slot remains, we assign the remaining object to it. There is only one way to do this. Thus, there is 1 possible way to …ll the last slot and n (n

1) (n

2) : : : 2 1 possible ways to …ll all the n slots

Therefore, by the above sequential argument, the total number of possible permutations of n objects is Pn = n (n

1) (n

2) : : : 2 1

The number Pn is usually indicated as follows: Pn = n! where n! is read "n factorial", with the convention that 0! = 1 Example 14 The number of possible permutations of 5 objects is P5 = 5! = 5 4 3 2 1 = 120

2.2. PERMUTATIONS WITH REPETITION

2.2

11

Permutations with repetition

A permutation with repetition of n objects is one of the possible ways of selecting another set of n objects from the original one. The selection rules are: 1. each object can be selected more than once; 2. the order of selection matters (the same n objects selected in di¤erent orders are regarded as di¤erent permutations). Thus, the di¤erence between simple permutations and permutations with repetition is that objects can be selected only once in the former, while they can be selected more than once in the latter. The following subsections give a slightly more formal de…nition of permutation with repetition and deal with the problem of counting the number of possible permutations with repetition.

2.2.1

De…nition of permutation with repetition

Let a1 , a2 ,. . . , an be n objects. Let s1 , s2 , . . . , sn be n slots to which the n objects can be assigned. A permutation with repetition of a1 , a2 ,. . . , an is one of the possible ways to …ll each of the n slots with one and only one of the n objects, with the proviso that an object can be assigned to more than one slot. Example 15 Consider two objects, a1 and a2 . There are two slots to …ll, s1 and s2 . There are four possible permutations with repetition of the two objects, that is, four possible ways to assign an object to each slot, being allowed to assign the same object to more than one slot: Slots Permutation Permutation Permutation Permutation

2.2.2

1 2 3 4

s1 a1 a1 a2 a2

s2 a1 a2 a1 a2

Number of permutations with repetition

Denote by Pn0 the number of possible permutations with repetition of n objects. How much is Pn0 in general? In other words, how do we count the number of possible permutations with repetition of n objects? We can derive a general formula for Pn0 by using a sequential argument: 1. First, we assign an object to the …rst slot. There are n objects that can be assigned to the …rst slot, so there are n possible ways to …ll the …rst slot 2. Then, we assign an object to the second slot. Even if one object has been assigned to a slot in the previous step, we can still choose among n objects, because we are allowed to choose an object more than once. So, there are n objects that can be assigned to the second slot and n possible ways to …ll the second slot

12

CHAPTER 2. PERMUTATIONS and n n possible ways to …ll the …rst two slots 3. Then, we assign an object to the third slot. Even if two objects have been assigned to a slot in the previous two steps, we can still choose among n objects, because we are allowed to choose an object more than once. So, there are n objects that can be assigned to the third slot and n possible ways to …ll the third slot and n n n possible ways to …ll the …rst three slots 4. An so on, until we are left with only one free slot (the n-th). 5. When only one free slot remains, we assign one of the n objects to it. Thus, there are n possible ways to …ll the last slot and |n n {z: : : n} possible ways to …ll the n available slots n tim es

Therefore, by the above sequential argument, the total number of possible permutations with repetition of n objects is Pn0 = nn Example 16 The number of possible permutations with repetition of 3 objects is P30 = 33 = 27

2.3

Solved exercises

This exercise set contains some solved exercises on permutations.

Exercise 1 There are 5 seats around a table and 5 people to be seated at the table. In how many di¤erent ways can they seat themselves? Solution Sitting 5 people at the table is a sequential problem. We need to assign a person to the …rst chair. There are 5 possible ways to do this. Then we need to assign a person to the second chair. There are 4 possible ways to do this, because one person has already been assigned. An so on, until there remain one free chair and one person to be seated. Therefore, the number of ways to seat the 5 people at the table is equal to the number of permutations of 5 objects (without repetition). If we denote it by P5 , then P5 = 5! = 5 4 3 2 1 = 120

2.3. SOLVED EXERCISES

13

Exercise 2 Bob, John, Luke and Tim play a tennis tournament. The rules of the tournament are such that at the end of the tournament a ranking will be made and there will be no ties. How many di¤erent rankings can there be? Solution Ranking 4 people is a sequential problem. We need to assign a person to the …rst place. There are 4 possible ways to do this. Then we need to assign a person to the second place. There are 3 possible ways to do this, because one person has already been assigned. An so on, until there remains one person to be assigned. Therefore, the number of ways to rank the 4 people participating in the tournament is equal to the number of permutations of 4 objects (without repetition). If we denote it by P4 , then P4 = 4! = 4 3 2 1 = 24

Exercise 3 A byte is a number consisting of 8 digits that can be equal either to 0 or to 1. How many di¤erent bytes are there? Solution To answer this question we need to follow a line of reasoning similar to the one we followed when we derived the number of permutations with repetition. There are 2 possible ways to choose the …rst digit and 2 possible ways to choose the second digit. So, there are 4 possible ways to choose the …rst two digits. There are 2 possible ways two choose the third digit and 4 possible ways to choose the …rst two. Thus, there are 8 possible ways to choose the …rst three digits. An so on, until we have chosen all digits. Therefore, the number of ways to choose the 8 digits is equal to : : 2} = 28 = 256 |2 :{z 8 tim es

14

CHAPTER 2. PERMUTATIONS

Chapter 3

k-permutations This lecture introduces the concept of k-permutation, which is a slight generalization of the concept of permutation1 . We …rst deal with k-permutations without repetition and then with kpermutations with repetition.

3.1

k-permutations without repetition

A k-permutation without repetition of n objects is a way of selecting k objects from a list of n. The selection rules are: 1. the order of selection matters (the same k objects selected in di¤erent orders are regarded as di¤erent k-permutations); 2. each object can be selected only once. A k-permutation without repetition is also simply called a k-permutation. The following subsections give a slightly more formal de…nition of kpermutation and deal with the problem of counting the number of possible kpermutations.

3.1.1

De…nition of k-permutation without repetition

Let a1 , a2 ,. . . , an be n objects. Let s1 , s2 , . . . , sk be k (k n) slots to which k of the n objects can be assigned. A k-permutation (or k-permutation without repetition or simple k-permutation) of n objects from a1 , a2 ,. . . , an is one of the possible ways to choose k of the n objects and …ll each of the k slots with one and only one object. Each object can be chosen only once. Example 17 Consider three objects, a1 , a2 and a3 . There are two slots, s1 and s2 , to which we can assign two of the three objects. There are six possible 2permutations of the three objects, that is, six possible ways to choose two objects 1 See

p. 9.

15

16

CHAPTER 3. K-PERMUTATIONS

and …ll the two slots with the two objects: Slots 2-permutation 2-permutation 2-permutation 2-permutation 2-permutation 2-permutation

3.1.2

1 2 3 4 5 6

s1 a1 a1 a2 a2 a3 a3

s2 a2 a3 a1 a3 a1 a2

Number of k-permutations without repetition

Denote by Pn;k the number of possible k-permutations of n objects. How much is Pn;k in general? In other words, how do we count the number of possible kpermutations of n objects? We can derive a general formula for Pn;k by using a sequential argument: 1. First, we assign an object to the …rst slot. There are n objects that can be assigned to the …rst slot, so there are n possible ways to …ll the …rst slot 2. Then, we assign an object to the second slot. There were n objects, but one has already been assigned to a slot. So, we are left with n 1 objects that can be assigned to the second slot. Thus, there are n

1 possible ways to …ll the second slot

and n (n

1) possible ways to …ll the …rst two slots

3. Then, we assign an object to the third slot. There were n objects, but two have already been assigned to a slot. So, we are left with n 2 objects that can be assigned to the third slot. Thus, there are n

2 possible ways to …ll the third slot

and n (n

1) (n

2) possible ways to …ll the …rst three slots

4. An so on, until we are left with n k-th).

k + 1 objects and only one free slot (the

5. Finally, when only one free slot remains, we assign one of the remaining n k + 1 objects to it. Thus, there are n

k + 1 possible ways to …ll the last slot

and n (n

1) (n

2) : : : (n

k + 1) possible ways to …ll the k available slots

3.2. K-PERMUTATIONS WITH REPETITION

17

Therefore, by the above sequential argument, the total number of possible k-permutations of n objects is Pn;k = n (n

1) (n

2) : : : (n

k + 1)

Pn;k can be written as Pn;k =

n (n

1) (n

2) : : : (n k + 1) (n k) (n (n k) (n k 1) : : : 2 1

k

1) : : : 2 1

Remembering the de…nition of factorial2 , we can see that the numerator of the above ratio is n! while the denominator is (n k)!, so the number of possible k-permutations of n objects is Pn;k =

n! (n

k)!

The number Pn;k is usually indicated as follows: Pn;k = nk Example 18 The number of possible 3-permutations of 5 objects is P5;3 =

3.2

5! 5 4 3 2 1 = = 5 4 3 = 60 2! 2 1

k-permutations with repetition

A k-permutation with repetition of n objects is a way of selecting k objects from a list of n. The selection rules are: 1. the order of selection matters (the same k objects selected in di¤erent orders are regarded as di¤erent k-permutations); 2. each object can be selected more than once. Thus, the di¤erence between k-permutations without repetition and kpermutations with repetition is that objects can be selected more than once in the latter, while they can be selected only once in the former. The following subsections give a slightly more formal de…nition of kpermutation with repetition and deal with the problem of counting the number of possible k-permutations with repetition.

3.2.1

De…nition of k-permutation with repetition

Let a1 , a2 ,. . . , an be n objects. Let s1 , s2 , . . . , sk be k (k n) slots to which k of the n objects can be assigned. A k-permutation with repetition of n objects from a1 , a2 ,. . . , an is one of the possible ways to choose k of the n objects and …ll each of the k slots with one and only one object. Each object can be chosen more than once. 2 See

p. 10.

18

CHAPTER 3. K-PERMUTATIONS

Example 19 Consider three objects a1 , a2 and a3 and two slots, s1 and s2 . There are nine possible 2-permutations with repetition of the three objects, that is, nine possible ways to choose two objects and …ll the two slots with the two objects, being allowed to pick the same object more than once: Slots 2-permutation 2-permutation 2-permutation 2-permutation 2-permutation 2-permutation 2-permutation 2-permutation 2-permutation

3.2.2

1 2 3 4 5 6 7 8 9

s1 a1 a1 a1 a2 a2 a2 a3 a3 a3

s2 a1 a2 a3 a1 a2 a3 a1 a2 a3

Number of k-permutations with repetition

0 Denote by Pn;k the number of possible k-permutations with repetition of n objects. 0 How much is Pn;k in general? In other words, how do we count the number of possible k-permutations with repetition of n objects? 0 We can derive a general formula for Pn;k by using a sequential argument:

1. First, we assign an object to the …rst slot. There are n objects that can be assigned to the …rst slot, so there are n possible ways to …ll the …rst slot 2. Then, we assign an object to the second slot. Even if one object has been assigned to a slot in the previous step, we can still choose among n objects, because we are allowed to choose an object more than once. So, there are n objects that can be assigned to the second slot and n possible ways to …ll the second slot and n n possible ways to …ll the …rst two slots 3. Then, we assign an object to the third slot. Even if two objects have been assigned to a slot in the previous two steps, we can still choose among n objects, because we are allowed to choose an object more than once. So, there are n objects that can be assigned to the second slot and n possible ways to …ll the third slot and n n n possible ways to …ll the …rst three slots 4. An so on, until we are left with only one free slot (the k-th). 5. When only one free slot remains, we assign one of the n objects to it. Thus, there are n possible ways to …ll the last slot

3.3. SOLVED EXERCISES

19

and |n n {z: : : n} possible ways to …ll the k available slots k tim es

Therefore, by the above sequential argument, the total number of possible k-permutations with repetition of n objects is 0 Pn;k = nk

Example 20 The number of possible 2-permutations of 4 objects is 0 = 42 = 16 P4;2

3.3

Solved exercises

This exercise set contains some solved exercises on k-permutations.

Exercise 1 There is a basket of fruit containing an apple, a banana and an orange and there are …ve girls who want to eat one fruit. How many ways are there to give three of the …ve girls one fruit each and leave two of them without a fruit to eat? Solution Giving the three fruits to three of the …ve girls is a sequential problem. We …rst give the apple to one of the girls. There are 5 possible ways to do this. Then we give the banana to one of the remaining girls. There are 4 possible ways to do this, because one girl has already been given a fruit. Finally, we give the orange to one of the remaining girls. There are 3 possible ways to do this, because two girls have already been given a fruit. Summing up, the number of ways to assign the three fruits is equal to the number of 3-permutations of 5 objects (without repetition). If we denote it by P5;3 , then P5;3

= =

5! 5 4 3 2 1 = (5 3)! 2 1 5 4 3 = 60

Exercise 2 An hexadecimal number is a number whose digits can take sixteen di¤erent values: either one of the ten numbers from 0 to 9, or one of the six letters from A to F. How many di¤erent 8-digit hexadecimal numbers are there, if an hexadecimal number is allowed to begin with any number of zeros? Solution Choosing the 8 digits of the hexadecimal number is a sequential problem. There are 16 possible ways to choose the …rst digit and 16 possible ways to choose the second digit. So, there are 16 16 possible ways to choose the …rst two digits. There are 16 possible ways two choose the third digit and 16 16 possible ways to

20

CHAPTER 3. K-PERMUTATIONS

choose the …rst two. Thus, there are 16 16 16 possible ways to choose the …rst three digits. An so on, until we have chosen all digits. Therefore, the number of ways to choose the 8 digits is equal to the number of 8-permutations with repetition of 16 objects: 0 P16;8 = 168

Exercise 3 An urn contains ten balls, each representing one of the ten numbers from 0 to 9. Three balls are drawn at random from the urn and the corresponding numbers are written down to form a 3-digit number, writing down the digits from left to right in the order in which they have been extracted. When a ball is drawn from the urn it is set aside, so that it cannot be extracted again. If one were to write down all the 3-digit numbers that could possibly be formed, how many would they be? Solution The 3 balls are drawn sequentially. At the …rst draw there are 10 balls, hence 10 possible values for the …rst digit of our 3-digit number. At the second draw there are 9 balls left, hence 9 possible values for the second digit of our 3-digit number. At the third and last draw there are 8 balls left, hence 8 possible values for the third digit of our 3-digit number. Summing up, the number of possible 3-digit numbers is equal to the number of 3-permutations of 10 objects (without repetition). If we denote it by P10;3 , then P10;3

= =

10 9 : : : 2 1 10! = (10 3)! 7 6 ::: 2 1 10 9 8 = 720

Chapter 4

Combinations This lecture introduces combinations, one of the most important concepts in combinatorial analysis. Before reading this lecture, you should be familiar with the concept of permutation1 . We …rst deal with combinations without repetition and then with combinations with repetition.

4.1

Combinations without repetition

A combination without repetition of k objects from n is a way of selecting k objects from a list of n. The selection rules are: 1. the order of selection does not matter (the same objects selected in di¤erent orders are regarded as the same combination); 2. each object can be selected only once. A combination without repetition is also called a simple combination or, simply, a combination. The following subsections give a slightly more formal de…nition of combination and deal with the problem of counting the number of possible combinations.

4.1.1

De…nition of combination without repetition

Let a1 , a2 ,. . . , an be n objects. A simple combination (or combination without repetition) of k objects from the n objects is one of the possible ways to form a set containing k of the n objects. To form a valid set, any object can be chosen only once. Furthermore, the order in which the objects are chosen does not matter. Example 21 Consider three objects, a1 , a2 and a3 . There are three possible combinations of two objects from a1 , a2 and a3 , that is, three possible ways to choose two objects from this set of three: Combination 1 Combination 2 Combination 3 1 See

the lecture entitled Permutations (p. 9).

21

a1 and a2 a1 and a3 a2 and a3

22

CHAPTER 4. COMBINATIONS

Other combinations are not possible, because, for example, fa2 ; a1 g is the same as fa1 ; a2 g.

4.1.2

Number of combinations without repetition

Denote by Cn;k the number of possible combinations of k objects from n. How much is Cn;k in general? In other words, how do we count the number of possible combinations of k objects from n? To answer this question, we need to recall the concepts of permutation and k-permutation introduced in previous lectures2 . Like a combination, a k-permutation of n objects is one of the possible ways of choosing k of the n objects. However, in a k-permutation the order of selection matters: two k-permutations are regarded as di¤erent if the same k objects are chosen, but they are chosen in a di¤erent order. On the contrary, in the case of combinations, the order in which the k objects are chosen does not matter: two combinations that contain the same objects are regarded as equal. Despite this di¤erence between k-permutations and combinations, it is very easy to derive the number of possible combinations (Cn;k ) from the number of possible k-permutations (Pn;k ). Consider a combination of k objects from n. This combination will be repeated many times in the set of all possible k-permutations. It will be repeated one time for each possible way of ordering the k objects. So, it will be repeated Pk = k! times3 . Therefore, if each combination is repeated Pk times in the set of all possible k-permutations, dividing the total number of k-permutations (Pn;k ) by Pk , we obtain the number of possible combinations: Cn;k =

n! Pn;k = Pk (n k)!k!

The number of possible combinations is often denoted by Cn;k = and

n k

n k

is called binomial coe¢ cient.

Example 22 The number of possible combinations of 3 objects from 5 is C5;3 =

4.2

5! 5 4 3 2 1 5 4 = = = 10 2!3! (2 1) (3 2 1) 2 1

Combinations with repetition

A combination with repetition of k objects from n is a way of selecting k objects from a list of n. The selection rules are: 1. the order of selection does not matter (the same objects selected in di¤erent orders are regarded as the same combination); 2. each object can be selected more than once. 2 See

the lectures entitled Permutations (p. 9) and k-permutations (p. 15). is the number of all possible ways to order the k objects - the number of permutations of k objects. 3P k

4.2. COMBINATIONS WITH REPETITION

23

Thus, the di¤erence between simple combinations and combinations with repetition is that objects can be selected only once in the former, while they can be selected more than once in the latter. The following subsections give a slightly more formal de…nition of combination with repetition and deal with the problem of counting the number of possible combinations with repetition.

4.2.1

De…nition of combination with repetition

A more rigorous de…nition of combination with repetition involves the concept of multiset, which is a generalization of the notion of set4 . Roughly speaking, the di¤erence between a multiset and a set is the following: the same object is allowed to appear more than once in the list of members of a multiset, while the same object is allowed to appear only once in the list of members of an ordinary set. Thus, for example, the collection of objects fa; b; c; ag is a valid multiset, but not a valid set, because the letter a appears more than once. Like sets, multisets are unordered collections of objects, i.e. the order in which the elements of a multiset are listed does not matter. Let a1 , a2 ,. . . , an be n objects. A combination with repetition of k objects from the n objects is one of the possible ways to form a multiset containing k objects taken from the set fa1 ; a2 ; : : : ; an g. Example 23 Consider three objects, a1 , a2 and a3 . There are six possible combinations with repetition of two objects from a1 , a2 and a3 , that is, six possible ways to choose two objects from this set of three, allowing for repetitions: Combination Combination Combination Combination Combination Combination

1 2 3 4 5 6

a1 and a2 a1 and a3 a2 and a3 a1 and a1 a2 and a2 a3 and a3

Other combinations are not possible, because, for example, fa2 ; a1 g is the same as fa1 ; a2 g.

4.2.2

Number of combinations with repetition

0 Denote by Cn;k the number of possible combinations with repetition of k objects 0 from n. How much is Cn;k in general? In other words, how do we count the number of possible combinations with repetition of k objects from n? To answer this question, we need to use a slightly unusual procedure, which is introduced by the next example.

Example 24 We need to order two scoops of ice cream, choosing among four ‡avours: chocolate, pistachio, strawberry and vanilla. It is possible to order two scoops of the same ‡avour. How many di¤ erent combinations can we order? The 4 See

the lecture entitled Set theory (p. 3).

24

CHAPTER 4. COMBINATIONS

number of di¤ erent combinations we can order is equal to the number of possible combinations with repetition of 2 objects from 4. Let us represent an order as a string of crosses ( ) and vertical bars (j), where a vertical bar delimits two adjacent ‡avours and a cross denotes a scoop of a given ‡avour. For example, jj

jjj

jj

j jjj

1 1 2 j 2

chocolate, 1 vanilla strawberry, 1 vanilla chocolate strawberry

where the …rst vertical bar (the leftmost one) delimits chocolate and pistachio, the second one delimits pistachio and strawberry and the third one delimits strawberry and vanilla. Each string contains three vertical bars, one less than the number of ‡avours, and two crosses, one for each scoop. Therefore, each string contains a total of …ve symbols. Making an order is equivalent to choosing which two of the …ve symbols will be a cross (the remaining will be vertical bars). So, to make an order, we need to choose 2 objects from 5. The number of possible ways to choose 2 objects from 5 is equal to the number of possible combinations without repetition5 of 2 objects from 5. Therefore, there are 5 2

=

(5

5! = 10 2)!2!

di¤ erent orders we can make. In general, choosing k objects from n with repetition is equivalent to writing a string with n + k 1 symbols, of which n 1 are vertical bars (j) and k are crosses ( ). In turn, this is equivalent to choose the k positions in the string (among the available n + k 1) that will contain a cross (the remaining ones will contain vertical bars). But choosing k positions from n + k 1 is like choosing a combination without repetition of k objects from n+k 1. Therefore, the number of possible combinations with repetition is 0 Cn;k

n+k 1 k (n + k 1)! (n + k 1)! = (n + k 1 k)!k! (n 1)!k!

= Cn+k =

1;k

=

The number of possible combinations with repetition is often denoted by 0 Cn;k =

and

n k

n k

is called a multiset coe¢ cient.

Example 25 The number of possible combinations with repetition of 3 objects from 5 is (5 + 3 1)! 7! 0 C5;3 = = (5 1)!3! 4!3! 7 6 5 4 3 2 1 = (4 3 2 1) (3 2 1) 7 6 5 = = 7 5 = 35 3 2 1 5 See

p. 21.

4.3. MORE DETAILS

4.3 4.3.1

25

More details Binomial coe¢ cients and binomial expansions

The binomial coe¢ cient is so called because it appears in the binomial expansion: n X n k n k n (a + b) = a b k k=0

where n 2 N.

4.3.2

Recursive formula for binomial coe¢ cients

The following is a useful recursive formula for computing binomial coe¢ cients: n+1 k

=

n n + k k 1

Proof. It is proved as follows:

= = =

4.4

n n n! + = + k! (n k)! k k 1 n! + (k 1)! (n k)!k (k 1)! (n n! (n + 1 k + k) (k 1)! (n k)!k (n + 1 k) n! (n + 1) (n + 1)! = k! (n + 1 k)! k! (n + 1 k)!

n! (k 1)! (n + 1 k)! n! k)! (n + 1 k)

=

n+1 k

Solved exercises

This exercise set contains some solved exercises on combinations.

Exercise 1 3 cards are drawn from a standard deck of 52 cards. How many di¤erent 3-card hands can possibly be drawn? Solution First of all, the order in which the 3 cards are drawn does not matter, that is, the same cards drawn in di¤erent orders are regarded as the same 3-card hand. Furthermore, each card can be drawn only once. Therefore the number of di¤erent 3-card hands that can possibly be drawn is equal to the number of possible combinations without repetition of 3 objects from 52. If we denote it by C52;3 , then C52;3

= =

52 52! 52! = = 3 (52 3)!3! 49!3! 52 51 50 52 51 50 = = 22100 3! 3 2 1

26

CHAPTER 4. COMBINATIONS

Exercise 2 John has got one dollar, with which he can buy green, red and yellow candies. Each candy costs 50 cents. John will spend all the money he has on candies. How many di¤erent combinations of green, red and yellow candies can he buy? Solution First of all, the order in which the 3 di¤erent colors are chosen does not matter. Furthermore, each color can be chosen more than once. Therefore, the number of di¤erent combinations of colored candies John can choose is equal to the number of possible combinations with repetition of 2 objects from 3. If we denote it by 0 , then C3;2 0 C3;2

= =

3 3+2 1 4 = = 2 2 2 4! 4! 4 3 4 3 = = = =6 (4 2)!2! 2!2! 2! 2 1

Exercise 3 The board of directors of a corporation comprises 10 members. An executive board, formed by 4 directors, needs to be elected. How many possible ways are there to form the executive board? Solution First of all, the order in which the 4 directors are selected does not matter. Furthermore, each director can be elected to the executive board only once. Therefore, the number of di¤erent ways to form the executive board is equal to the number of possible combinations without repetition of 4 objects from 10. If we denote it by C10;4 , then C10;4

= =

10 10! 10! = = 4 (10 4)!4! 6!4! 10 9 8 7 10 9 8 7 = = 210 4! 4 3 2 1

Chapter 5

Partitions into groups This lecture introduces partitions into groups. Before reading this lecture, you should read the lectures entitled Permutations (p. 9) and Combinations (p. 21). A partition of n objects into k groups is one of the possible ways of subdividing the n objects into k groups (k n). The rules are: 1. the order in which objects are assigned to a group does not matter; 2. each object can be assigned to only one group. The following subsections give a slightly more formal de…nition of partition into groups and deal with the problem of counting the number of possible partitions into groups.

5.1

De…nition of partition into groups

Let a1 , a2 ,. . . , an be n objects. Let g1 , g2 , . . . , gk be k (with k n) groups to which we can assign the n objects. Moreover, n1 objects need to be assigned to group g1 , n2 objects need to be assigned to group g2 , and so on. The numbers n1 , n2 , . . . , nk are such that n1 + n2 + : : : + nk = n A partition of a1 , a2 ,. . . , an into the k groups g1 , g2 , . . . , gk is one of the possible ways to assign the n objects to the k groups. Example 26 Consider three objects, a1 , a2 and a3 , and two groups, g1 and g2 , with n1 n2

= 2 = 1

There are three possible partitions of the three objects into the two groups: Groups Partition 1 Partition 2 Partition 3

g1 fa1 ; a2 g fa1 ; a3 g fa2 ; a3 g

g2 a3 a2 a1

Note that the order of objects belonging to a group does not matter, so, for example, fa1 ; a2 g in Partition 1 is the same as fa2 ; a1 g. 27

28

CHAPTER 5. PARTITIONS INTO GROUPS

5.2

Number of partitions into groups

Denote by Pn1 ;n2 ;:::;nk the number of possible partitions into the k groups (where group i contains ni objects). How much is Pn1 ;n2 ;:::;nk in general? The number Pn1 ;n2 ;:::;nk can be derived using a sequential argument: 1. First, we assign n1 objects to the …rst group. There is a total of n objects to choose from. The number of possible ways to choose n1 of the n objects is equal to the number of combinations1 of n1 elements from n. So there are n n1

=

n! n1 ! (n n1 )!

possible ways to form the …rst group. 2. Then, we assign n2 objects to the second group. There were n objects, but n1 have already been assigned to the …rst group. So, there are n n1 objects left, that can be assigned to the second group. The number of possible ways to choose n2 of the remaining n n1 objects is equal to the number of combinations of n2 elements from n n1 . So there are n

n1 n2

=

(n n1 )! n2 ! (n n1 n2 )!

possible ways to form the second group and n n1

n

n1 n2

(n n1 )! n! n1 ! (n n1 )! n2 ! (n n1 n2 )! n! n1 !n2 ! (n n1 n2 )!

= =

possible ways to form the …rst two groups. 3. Then, we assign n3 objects to the third group. There were n objects, but n1 + n2 have already been assigned to the …rst two groups. So, there are n n1 n2 objects left, that can be assigned to the third group. The number of possible ways to choose n3 of the remaining n n1 n2 objects is equal to the number of combinations of n3 elements from n n1 n2 . So there are n

n1 n3

n2

=

(n n1 n2 )! n3 ! (n n1 n2 n3 )!

possible ways to form the third group and n n1

= =

n

n1

n

n1 n3

n2 n! (n n1 n2 )! n1 !n2 ! (n n1 n2 )! n3 ! (n n1 n2 n3 )! n! n1 !n2 !n3 ! (n n1 n2 n3 )!

possible ways to form the …rst three groups. 1 See

n2

the lecture entitled Combinations (p. 21).

5.3. MORE DETAILS

29

4. An so on, until we are left with nk objects and the last group. There is only one way to form the last group, which can also be written as: n

n1

n2 : : : nk

nk

1

=

(n n1 n2 : : : nk 1 )! nk ! (n n1 n2 : : : nk )!

As a consequence, there are n n1

=

= = =

n

n1

n

n1 n2 n n1 ::: n2 n3 n! n1 !n2 ! : : : nk 1 ! (n n1 n2 : : : nk 1 )! (n n1 n2 : : : nk 1 )! nk ! (n n1 n2 : : : nk )! n! n1 !n2 ! : : : nk ! (n n1 n2 : : : nk )! n! n1 !n2 ! : : : nk !0! n! n1 !n2 ! : : : nk !

n2 : : : nk

nk

1

possible ways to form all the groups. Therefore, by the above sequential argument, the total number of possible partitions into the k groups is Pn1 ;n2 ;:::;nk =

n! n1 !n2 ! : : : nk !

The number Pn1 ;n2 ;:::;nk is often indicated as follows: Pn1 ;n2 ;:::;nk =

n n1 ; n 2 ; : : : ; n k

and n1 ;n2n;:::;nk is called a multinomial coe¢ cient. Sometimes the following notation is also used: Pn1 ;n2 ;:::;nk = (n1 ; n2 ; : : : ; nk )! Example 27 The number of possible partitions of 4 objects into 2 groups of 2 objects is 4 4! 4 3 2 1 P2;2 = = = =6 2; 2 2!2! (2 1) (2 1)

5.3 5.3.1

More details Multinomial expansions

The multinomial coe¢ cient is so called because it appears in the multinomial expansion: X n n (x1 + x2 + : : : + xk ) = xn1 xn2 : : : xnk k n1 ; n 2 ; : : : ; n k 1 2 where n 2 N and the summation is over all the k-tuples n1 ; n2 ; : : : ; nk such that n1 + n2 + : : : + nk = n

30

5.4

CHAPTER 5. PARTITIONS INTO GROUPS

Solved exercises

Below you can …nd some exercises with explained solutions.

Exercise 1 John has a basket of fruit containing one apple, one banana, one orange and one kiwi. He wants to give one fruit to each of his two little sisters and two fruits to his big brother. In how many di¤erent ways can he do this? Solution John needs to decide how to partition 4 objects into 3 groups. The …rst two groups will contain one object and the third one will contain two objects. The total number of partitions is P1;1;2

= =

4 4! = 1!1!2! 1; 1; 2 4 3 2 1 24 = = 12 1 1 2 1 2

Exercise 2 Ten friends want to play basketball. They need to divide into two teams of …ve players. In how many di¤erent ways can they do this? Solution They need to decide how to partition 10 objects into 2 groups. Each group will contain 5 objects. The total number of partitions is P5;5

= = = =

10 10! = 5!5! 5; 5 10 9 8 7 6 5 4 3 2 1 (5 4 3 2 1) (5 4 3 2 1) 9 8 7 6 10 9 8 7 6 = 5 4 3 2 1 4 3 9 2 7 2 = 252

Chapter 6

Sequences and limits This lecture discusses the concepts of sequence and limit of a sequence.

6.1

De…nition of sequence

Let A be a set of objects, for example, real numbers. A sequence of elements of A is a function from the set of natural numbers N to the set A, that is, a correspondence that associates one and only one element of A to each natural number n 2 N. In other words, a sequence of elements of A is an ordered list of elements of A, where the ordering is provided by the natural numbers. A sequence is usually indicated by enclosing a generic element of the sequence in curly brackets: fan g where an is the n-th element of the sequence. Alternative notations are 1

fan gn=1 fa1 ; a2 ; : : : ; an ; : : :g an ; n 2 N a1 ; a2 ; : : : ; an ; : : : Thus, if fan g is a sequence, a1 is its …rst element, a2 is its second element, an is its n-th element, and so on. Example 28 De…ne a sequence fan g by characterizing its n-th element an as follows: 1 an = n fan g is a sequence of rational numbers. The elements of the sequence are a1 = 1, a2 = 21 , a3 = 13 , a4 = 14 , and so on. Example 29 De…ne a sequence fan g by characterizing its n-th element an as follows: 1 if n is even an = 0 if n is odd fan g is a sequence of 0 and 1. The elements of the sequence are a1 = 0, a2 = 1, a3 = 0, a4 = 1, and so on. 31

32

CHAPTER 6. SEQUENCES AND LIMITS

Example 30 De…ne a sequence fan g by characterizing its n-th element an as follows: 1 1 an = ; n+1 n fan g is a sequence of closed subintervals of the interval [0; 1]. The elements of the sequence are a1 = 12 ; 1 , a2 = 13 ; 12 , a3 = 14 ; 13 , a4 = 15 ; 14 , and so on.

6.2

Countable and uncountable sets

A set of objects A is a countable set if all its elements can be arranged into a sequence, that is, if there exists a sequence fan g such that 8a 2 A; 9n 2 N : an = a In other words, A is a countable set if there exists at least one sequence fan g such that every element of A belongs to the sequence. A is an uncountable set if such a sequence does not exist. The most important example of an uncountable set is the set of real numbers R.

6.3

Limit of a sequence

This section introduces the notion of limit of a sequence fan g. We start from the simple case in which fan g is a sequence of real numbers, then we deal with the general case in which fan g is a sequence of objects that are not necessarily real numbers.

6.3.1

The limit of a sequence of real numbers

We give …rst an informal and then a more formal de…nition of the limit of a sequence of real numbers. Informal de…nition of limit - Sequences of real numbers Let fan g be a sequence of real numbers. Let n0 2 N. Denote by fan gn>n0 a subsequence of fan g obtained by dropping the …rst n0 terms of fan g, i.e., fan gn>n0 = fan0 +1 ; an0 +2 ; an0 +3 ; : : :g The following is an intuitive de…nition of limit of a sequence. De…nition 31 (informal) We say that a real number a is a limit of a sequence fan g of real numbers, if, by appropriately choosing n0 , the distance between a and any term of the subsequence fan gn>n0 can be made as close to zero as we like. If a is a limit of the sequence fan g, we say that the sequence fan g is a convergent sequence and that it converges to a. We indicate the fact that a is a limit of fan g by a = lim an n!1

Thus, a is a limit of fan g if, by dropping a su¢ ciently high number of initial terms of fan g, we can make the remaining terms of fan g as close to a as we like. Intuitively, a is a limit of fan g if an becomes closer and closer to a by letting n go to in…nity.

6.3. LIMIT OF A SEQUENCE

33

Formal de…nition of limit - Sequences of real numbers The distance between two real numbers is the absolute value of their di¤erence. For example, if a 2 R and an is a term of a sequence fan g, the distance between an and a, denoted by d (an ; a), is d (an ; a) = jan

aj

Using the concept of distance, the above informal de…nition can be made rigorous. De…nition 32 (formal) We say that a 2 R is a limit of a sequence fan g of real numbers if 8" > 0; 9n0 2 N : d (an ; a) < "; 8n > n0 If a is a limit of the sequence fan g, we say that the sequence fan g is a convergent sequence and that it converges to a. We indicate the fact that a is a limit of fan g by a = lim an n!1

For those unfamiliar with the universal quanti…ers 8 (any) and 9 (exists), the notation 8" > 0; 9n0 2 N : d (an ; a) < "; 8n > n0 reads as follows: "For any arbitrarily small number ", there exists a natural number n0 such that the distance between an and a is less than " for all the terms an with n > n0 ", which can also be restated as "For any arbitrarily small number ", you can …nd a subsequence fan gn>n0 such that the distance between any term of the subsequence and a is less than "", or as "By dropping a su¢ ciently high number of initial terms of fan g, you can make the remaining terms as close to a as you wish". It is possible to prove that a convergent sequence has a unique limit, that is, if fan g has a limit a, then a is the unique limit of fan g. Example 33 De…ne a sequence fan g by characterizing its n-th element an as follows: 1 an = n The elements of the sequence are a1 = 1, a2 = 12 , a3 = 13 , a4 = 14 , and so on. The higher n is, the smaller an is and the closer it gets to 0. Therefore, intuitively, the limit of the sequence should be lim an = 0 n!1

It is straightforward to prove that 0 is indeed a limit of fan g by using De…nition 32. Choose any " > 0. We need to …nd an n0 2 N such that all terms of the subsequence fan gn>n0 have distance from zero less than ": d (an ; 0) < "; 8n > n0 The distance between a generic term of the sequence an and 0 is d (an ; 0) = jan

0j = jan j = an

(6.1)

34

CHAPTER 6. SEQUENCES AND LIMITS

where the last equality holds because all the terms of the sequence are positive and hence equal to their absolute values. Therefore, we need to …nd an n0 2 N such that all the terms of the subsequence fan gn>n0 satisfy an < "; 8n > n0

(6.2)

Since an < an0 ; 8n > n0

condition (6.2) is satis…ed if an0 < ", which is equivalent to n10 < ". As a consequence, it su¢ ces to pick any n0 such that n0 > 1" to satisfy condition (6.1). Summing up, we have just shown that, for any ", we are able to …nd n0 2 N such that all terms of the subsequence fan gn>n0 have distance from zero less than ", which implies that 0 is the limit of the sequence fan g.

6.3.2

The limit of a sequence in general

We now deal with the more general case in which the terms of the sequence fan g are not necessarily real numbers. As before, we …rst give an informal de…nition, and then a more formal one. Informal de…nition of limit - The general case Let A be a set of objects and let fan g be a sequence of elements of A. The limit of fan g is de…ned as follows. De…nition 34 (informal) Let a 2 A. We say that a is a limit of a sequence fan g of elements of A, if, by appropriately choosing n0 , the distance between a and any term of the subsequence fan gn>n0 can be made as close to zero as we like. If a is a limit of the sequence fan g, we say that the sequence fan g is a convergent sequence and that it converges to a. We indicate the fact that a is a limit of fan g by a = lim an n!1

The de…nition is the same given in De…nition 31, except for the fact that now both a and the terms of the sequence fan g belong to a generic set of objects A. Metrics and the de…nition of distance In De…nition 34 we have implicitly assumed that the concept of distance between elements of A is well-de…ned. Thus, for this de…nition to make any sense, we need to properly de…ne distance. We need a function d : A A ! R that associates to any couple of elements of A a real number measuring how far these two elements are. For example, if a and a0 are two elements of A, d (a; a0 ) needs to be a real number measuring the distance between a and a0 . A function d : A A ! R is considered a valid distance function if it satis…es the properties listed in the following de…nition. De…nition 35 Let A be a set of objects. Let d : A A ! R. d is considered a valid distance function, in which case it is called a metric on A, if the following conditions are satis…ed for any a, a0 and a00 belonging to A:

6.3. LIMIT OF A SEQUENCE 1. non-negativity: d (a; a0 )

35 0;

2. identity of indiscernibles: d (a; a0 ) = 0 if and only if a = a0 ; 3. symmetry: d (a; a0 ) = d (a0 ; a); 4. triangle inequality: d (a; a0 ) + d (a0 ; a00 )

d (a; a00 ).

All four properties are very intuitive: property 1) says that the distance between two points cannot be a negative number; property 2) says that the distance between two points is zero if and only if the two points coincide; property 3) says that the distance from a to a0 is the same as the distance from a0 to a; property 4) says that the distance you cover when you go from a to a00 directly is less than or equal to the distance you cover when you go from a to a00 passing from a third point a0 (in other words, if a0 is not on the way from a to a00 , you are increasing the distance covered). Example 36 (Euclidean distance) Consider the set of K-dimensional real vectors RK . The metric usually employed to measure the distance between elements of RK is the so-called Euclidean distance. If a and b are two vectors belonging to RK , then their Euclidean distance is q 2 2 2 d (a; b) = (a1 b1 ) + (a2 b2 ) + : : : + (aK bK )

where a1 ; : : : ; aK are the K components of a and b1 ; : : : ; bK are the K components of b. It is possible to prove that the Euclidean distance satis…es all the four properties that a metric needs to satisfy. Furthermore, when K = 1, it becomes q 2 d (a; b) = (a b) = ja bj which coincides with the de…nition of distance between real numbers already given above.

Whenever we are faced with a sequence of objects and we want to assess whether it is convergent, we need to de…ne a distance function on the set of objects to which the terms of the sequence belong, and verify that the proposed distance function satis…es all the properties of a proper distance function (a metric). For example, in probability theory and statistics we often deal with sequences of random variables. To assess whether these sequences are convergent, we need to de…ne a metric to measure the distance between two random variables. As we will see in the lecture entitled Sequences of random variables (see p. 491), there are several ways of de…ning the concept of distance between two random variables. All these ways are legitimate and are useful in di¤erent situations. Formal de…nition of limit - The general case Having de…ned the concept of a metric, we are now ready to state the formal de…nition of a limit of a sequence. De…nition 37 (formal) Let A be a set of objects. Let d : A A ! R be a metric on A. We say that a 2 A is a limit of a sequence fan g of objects belonging to A if 8" > 0; 9n0 2 N : d (an ; a) < "; 8n > n0

36

CHAPTER 6. SEQUENCES AND LIMITS

If a is a limit of the sequence fan g, we say that the sequence fan g is a convergent sequence and that it converges to a. We indicate the fact that a is a limit of fan g by a = lim an n!1

Also in this case, it is possible to prove that a convergent sequence has a unique limit. Proposition 38 If fan g has a limit a, then a is the unique limit of fan g. Proof. The proof is by contradiction. Suppose that a and a0 are two limits of a sequence fan g and a 6= a0 . By combining property 1) and 2) of a metric (see above), we obtain d (a; a0 ) > 0 i.e., d (a; a0 ) = d, where d is a strictly positive constant. Pick any term an of the sequence. By property 4) of a metric (the triangle inequality), we have d (a; an ) + d (an ; a0 )

d (a; a0 )

Since d (a; a0 ) = d, the previous inequality becomes d (a; an ) + d (a0 ; an )

d>0

Now, take any " < d. Since a is a limit of the sequence, we can …nd n0 such that d (a; an ) < "; 8n > n0 , which means that " + d (a0 ; an )

d (a; an ) + d (a0 ; an )

d > 0; 8n > n0

and d (a0 ; an )

d

" > 0; 8n > n0

0

Therefore, d (a ; an ) can not be made smaller than d cannot be a limit of the sequence.

", and, as a consequence, a0

Convergence criterion In practice, it is usually di¢ cult to assess the convergence of a sequence using De…nition 37. Instead, convergence can be assessed using the following criterion. Proposition 39 Let A be a set of objects. Let d : A A ! R be a metric on A. Let fan g be a sequence of objects belonging to A, and a 2 A. The sequence fan g converges to a if and only if lim d (an ; a) = 0

n!1

Proof. This is easily proved by de…ning a sequence of real numbers fdn g whose generic term is dn = d (an ; a) and noting that the de…nition of convergence of fan g to a, which is 8" > 0; 9n0 2 N : d (an ; a) < "; 8n > n0

6.3. LIMIT OF A SEQUENCE

37

can be written as 8" > 0; 9n0 2 N : jdn

0j < "; 8n > n0

which is the de…nition of convergence of fdn g to 0. So, in practice, the problem of assessing the convergence of a generic sequence of objects is simpli…ed as follows: 1. …nd a metric d (an ; a) to measure the distance between the terms of the sequence an and the candidate limit a; 2. de…ne a new sequence fdn g, where dn = d (an ; a); 3. study the convergence of the sequence fdn g, which is a simple problem, because fdn g is a sequence of real numbers.

38

CHAPTER 6. SEQUENCES AND LIMITS

Chapter 7

Review of di¤erentiation rules This lecture contains a summary of di¤erentiation rules, i.e. of rules for computing the derivative of a function. This review is neither detailed nor rigorous and it is not meant to be a substitute for a proper lecture on di¤erentiation. Its only purpose it to serve as a quick review of di¤erentiation rules. d f (x) will In what follows, f (x) will denote a function of one variable and dx denote its …rst derivative.

7.1

Derivative of a constant function

If f (x) is a constant function: f (x) = c where c 2 R, then its …rst derivative is d f (x) = 0 dx

7.2

Derivative of a power function

If f (x) is a power function: f (x) = xn where n 2 R, then its …rst derivative is d f (x) = nxn dx

1

Example 40 De…ne f (x) = x5 The derivative of f (x) is d f (x) = 5x5 dx 39

1

= 5x4

40

CHAPTER 7. REVIEW OF DIFFERENTIATION RULES

Example 41 De…ne f (x) =

p 3

x4

The derivative of f (x) is d d p d 4 3 f (x) = x4 = x4=3 = x4=3 dx dx dx 3

7.3

1

=

4 1=3 x 3

Derivative of a logarithmic function

If f (x) is the natural logarithm of x: f (x) = ln (x) then its …rst derivative is

1 d f (x) = dx x If f (x) is the logarithm to base b of x: f (x) = logb (x)

then its …rst derivative is1

d 1 f (x) = dx x ln (b)

Example 42 De…ne f (x) = log2 (x) The derivative of f (x) is d 1 f (x) = dx x ln (2)

7.4

Derivative of an exponential function

If f (x) is the exponential function f (x) = exp (x) then its …rst derivative is

d f (x) = exp(x) dx If the exponential function f (x) does not have the natural base e, but another positive base b: f (x) = bx then its …rst derivative is2

d f (x) = ln (b) bx dx

Example 43 De…ne f (x) = 5x The derivative of f (x) is d f (x) = ln (5) 5x dx 1 Remember 2 Remember

ln(x)

that logb (x) = ln(b) . that bx = exp (x ln (b)).

7.5. DERIVATIVE OF A LINEAR COMBINATION

7.5

41

Derivative of a linear combination

If f1 (x) and f2 (x) are two functions and c1 ; c2 2 R are two constants, then d d d (c1 f1 (x) + c2 f2 (x)) = c1 f1 (x) + c2 f2 (x) dx dx dx In other words, the derivative of a linear combination is equal to the linear combinations of the derivatives. This property is called "linearity of the derivative". Two special cases of this rule are 1. Multiplication by a constant d d (c1 f1 (x)) = c1 f1 (x) dx dx 2. Addition

d d d (f1 (x) + f2 (x)) = f1 (x) + f2 (x) dx dx dx

Example 44 De…ne f (x) = 2 + exp (x) The derivative of f (x) is d d d d (2 + exp (x)) = (2) + (exp (x)) f (x) = dx dx dx dx The …rst summand is d (2) = 0 dx because the derivative of a constant is 0. The second summand is d (exp (x)) = exp (x) dx by the rule for di¤ erentiating exponentials. Therefore d d d f (x) = (2) + (exp (x)) = 0 + exp (x) = exp (x) dx dx dx

7.6

Derivative of a product of functions

If f1 (x) and f2 (x) are two functions, then the derivative of their product is d (f1 (x) f2 (x)) = dx

d f1 (x) f2 (x) + f1 (x) dx

d f2 (x) dx

Example 45 De…ne f (x) = x ln (x) The derivative of f (x) is d f (x) dx

= =

d d d (x ln (x)) = (x) ln (x) + x (ln (x)) dx dx dx 1 1 ln (x) + x = ln (x) + 1 x

42

7.7

CHAPTER 7. REVIEW OF DIFFERENTIATION RULES

Derivative of a composition of functions

If g (y) and h (x) are two functions, then the derivative of their composition is ! d d d (g (h (x))) = g (y) h (x) dx dy dx y=h(x) What does this chain rule mean in practice? It means that …rst you need to compute the derivative of g (y): d g (y) dy Then, you substitute y with h (x): d g (y) dy

y=h(x)

Finally, you multiply it by the derivative of h (x): d h (x) dx Example 46 De…ne f (x) = ln x2 The function f (x) is a composite function: f (x) = g (h (x)) where g (y) = ln (y) and h (x) = x2 The derivative of h (x) is d d h (x) = x2 = 2x dx dx The derivative of g (y) is d d 1 g (y) = (ln (y)) = dy dy y which, evaluated at y = h (x) = x2 , gives d g (y) dy

= y=h(x)

1 1 = 2 h (x) x

Therefore d (g (h (x))) = dx

d g (y) dy

y=h(x)

!

d 1 2 h (x) = 2 2x = dx x x

7.8. DERIVATIVES OF TRIGONOMETRIC FUNCTIONS

7.8

43

Derivatives of trigonometric functions

The trigonometric functions have the following derivatives: d sin (x) = cos (x) dx d cos (x) = sin (x) dx d 1 tan (x) = 2 dx cos (x) while the inverse trigonometric functions have the following derivatives: d arcsin (x) = dx d arccos (x) = dx d arctan (x) = dx

p

1

x2 1 p 1 x2 1 1 + x2 1

Example 47 De…ne f (x) = cos x2 We need to use the chain rule for the derivative of a composite function: ! d d d (g (h (x))) = g (y) h (x) dx dy dx y=h(x) The derivative of h (x) is d d h (x) = x2 = 2x dx dx The derivative of g (y) is d d g (y) = (cos (y)) = dy dy

sin (y)

which, evaluated at y = h (x) = x2 , gives d g (y) dy

=

sin (h (x)) =

sin x2

y=h(x)

Therefore d (g (h (x))) = dx

7.9

d g (y) dy

y=h(x)

!

d h (x) = dx

sin x2

Derivative of an inverse function

If y = f (x) is a function with derivative d f (x) dx

2x

44

CHAPTER 7. REVIEW OF DIFFERENTIATION RULES 1

then its inverse x = f

(y) has derivative

d f dy

1

(y) =

d f (x) dx

1 (y)

x=f

!

1

Example 48 De…ne f (x) = exp (3x) Its inverse is f

1

1 ln (y) 3

(y) =

The derivative of f (x) is d f (x) = 3 exp (3x) dx As a consequence d f (x) dx

x=f

1 (y)

= 3 exp (3x)jx= 1 ln(y) = 3 exp 3 3

and d f dy

1

(y) =

d f (x) dx

x=f

1 (y)

!

1 ln (y) 3

1

= (3y)

1

= 3y

Chapter 8

Review of integration rules This lecture contains a summary of integration rules, i.e. of rules for computing de…nite and inde…nite integrals of a function. This review is neither detailed nor rigorous and it is not meant to be a substitute for a proper lecture on integration. Its only purpose it to serve as a quick review of integration rules. d f (x) will In what follows, f (x) will denote a function of one variable and dx denote its …rst derivative.

8.1

Inde…nite integrals

If f (x) is a function of one variable, an inde…nite integral of f (x) is a function F (x) whose …rst derivative is equal to f (x): d F (x) = f (x) dx An inde…nite integral F (x) is denoted by Z F (x) = f (x) dx Inde…nite integrals are also called antiderivatives or primitives. Example 49 Let f (x) = x3 The function F (x) =

1 4 x 4

is an inde…nite integral of f (x) because d d F (x) = dx dx

1 4 x 4

=

1 d 1 x4 = 4x3 = x3 4 dx 4

Also the function G (x) =

1 1 4 + x 2 4

45

46

CHAPTER 8. REVIEW OF INTEGRATION RULES

is an inde…nite integral of f (x) because d G (x) dx

=

d dx

1 1 4 + x 2 4

=

0+

1 4x3 = x3 4

=

d dx

1 2

+

1 d x4 4 dx

Note that if a function F (x) is an inde…nite integral of f (x) then also the function G (x) = F (x) + c is an inde…nite integral of f (x) for any constant c 2 R, because d G (x) dx

d d d (F (x) + c) = (F (x)) + (c) dx dx dx = f (x) + 0 = f (x) =

This is also the reason why the adjective inde…nite is used: because inde…nite integrals are de…ned only up to a constant. The following subsections contain some rules for computing the inde…nite integrals of functions that are frequently encountered in probability theory and statistics. In all these subsections, c will denote a constant and the integration rules will be reported without a proof. Proofs are trivial and can be easily performed by the reader: it su¢ ces to compute the …rst derivative of F (x) and verify that it equals f (x).

8.1.1

Inde…nite integral of a constant function

If f (x) is a constant function: f (x) = a where a 2 R, then an inde…nite integral of f (x) is F (x) = ax + c

8.1.2

Inde…nite integral of a power function

If f (x) is a power function: f (x) = xn then an inde…nite integral of f (x) is F (x) = when n 6=

1. When n =

1 xn+1 + c n+1

1, i.e. when f (x) =

1 x

the integral is F (x) = ln (x) + c

8.1. INDEFINITE INTEGRALS

8.1.3

47

Inde…nite integral of a logarithmic function

If f (x) is the natural logarithm of x: f (x) = ln (x) then its inde…nite integral is F (x) = x ln (x)

x+c

If f (x) is the logarithm to base b of x: f (x) = logb (x) then its inde…nite integral is1 F (x) =

8.1.4

1 (x ln (x) ln (b)

x) + c

Inde…nite integral of an exponential function

If f (x) is the exponential function: f (x) = exp (x) then its inde…nite integral is F (x) = exp(x) + c If the exponential function f (x) does not have the natural base e, but another positive base b: f (x) = bx then its inde…nite integral is2 F (x) =

8.1.5

1 x b +c ln (b)

Inde…nite integral of a linear combination of functions

If f1 (x) and f2 (x) are two functions and c1 ; c2 2 R are two constants, then: Z Z Z (c1 f1 (x) + c2 f2 (x)) dx = c1 f1 (x) dx + c2 f2 (x) dx In other words, the integral of a linear combination is equal to the linear combinations of the integrals. This property is called "linearity of the integral". Two special cases of this rule are: 1. Multiplication by a constant: Z Z c1 f1 (x) dx = c1 f1 (x) dx 2. Addition:

1 Remember 2 Remember

Z

(f1 (x) + f2 (x)) dx = ln(x)

that logb (x) = ln(b) . that bx = exp (x ln (b)).

Z

f1 (x) dx +

Z

f2 (x) dx

48

8.1.6

CHAPTER 8. REVIEW OF INTEGRATION RULES

Inde…nite integrals of trigonometric functions

The trigonometric functions have the following inde…nite integrals: Z sin (x) dx = cos (x) + c Z cos (x) dx = sin (x) + c Z 1 tan (x) dx = ln +c cos (x)

8.2

De…nite integrals

Let f (x) be a function of one variable and [a; b] an interval of real numbers. The de…nite integral (or, simply, the integral) from a to b of f (x) is the area of the region in the xy-plane bounded by the graph of f (x), the x-axis and the vertical lines x = a and x = b, where regions below the x-axis have negative sign and regions above the x-axis have positive sign. The integral from a to b of f (x) is denoted by Z

b

f (x) dx

a

f (x) is called the integrand function and a and b are called the upper bound of integration and the lower bound of integration. The following subsections contain some properties of de…nite integrals, which are also often utilized to actually compute de…nite integrals.

8.2.1

Fundamental theorem of calculus

The fundamental theorem of calculus provides the link between de…nite and indefinite integrals. It has two parts. On the one hand, if you de…ne Z x f (t) dt F (x) = a

then the …rst derivative of F (x) is equal to f (x), i.e. d F (x) = f (x) dx In other words, if you di¤erentiate a de…nite integral with respect to its upper bound of integration, then you obtain the integrand function. Example 50 De…ne F (x) =

Z

x

exp (2t) dt

a

Then:

d F (x) = exp (2x) dx

8.2. DEFINITE INTEGRALS

49

On the other hand, if F (x) is an inde…nite integral (an antiderivative) of f (x), then Z b

f (x) dx = F (b)

F (a)

a

In other words, you can use the inde…nite integral to compute the de…nite integral. The following notation is often used: Z b b f (x) dx = [F (x)]a a

where

b

[F (x)]a = F (b)

F (a)

Sometimes the variable of integration x is explicitly speci…ed and we write x=b

[F (x)]x=a Example 51 Consider the de…nite integral Z 1 x2 dx 0

The integrand function is

f (x) = x2 An inde…nite integral of f (x) is 1 3 x 3 Therefore, the de…nite integral from 0 to 1 can be computed as follows: Z 1 1 1 3 1 3 1 1 3 x2 dx = x = 1 0 = 3 3 3 3 0 0 F (x) =

8.2.2

De…nite integral of a linear combination of functions

Like inde…nite integrals, also de…nite integrals are linear. If f1 (x) and f2 (x) are two functions and c1 ; c2 2 R are two constants, then: Z b Z b Z b (c1 f1 (x) + c2 f2 (x)) dx = c1 f1 (x) dx + c2 f2 (x) dx a

a

a

with the two special cases:

1. Multiplication by a constant: Z b Z c1 f1 (x) dx = c1 a

Z

b

(f1 (x) + f2 (x)) dx =

a

b

f1 (x) dx

a

Z

b

f1 (x) dx +

a

Example 52 For example: Z 1 Z 3x + 2x2 dx = 3 0

0

Z

b

f2 (x) dx

a

1

xdx + 2

Z

0

1

x2 dx

50

CHAPTER 8. REVIEW OF INTEGRATION RULES

8.2.3

Change of variable

If f (x) and g (x) are two functions, then the following integral Z

b

f (g (x))

a

d g (x) dx dx

can be computed by a change of variable, using the variable t = g (x) The change of variable is performed in the following steps: 1. Di¤erentiate the change of variable formula t = g (x) and obtain dt =

d g (x) dx dx

2. Recompute the bounds of integration: x = a ) t = g (x) = g (a) x = b ) t = g (x) = g (b) d dx g (x) dx

3. Substitute g (x) and Z

b

d g (x) dx = dx

f (g (x))

a

Example 53 The integral

in the integral:

Z

2

1

Z

g(b)

g(a)

ln (x) dx x

can be computed performing the change of variable t = ln (x) Di¤ erentiating the change of variable formula, we obtain dt =

d 1 ln (x) dx = dx dx x

The new bounds of integration are x = x =

1 ) t = ln (1) = 0 2 ) t = ln (2)

Therefore, the integral can be written as follows: Z

1

2

ln (x) dx = x

Z

0

ln(2)

tdt

f (t) dt

8.2. DEFINITE INTEGRALS

8.2.4

51

Integration by parts

Let f (x) and g (x) be two functions and F (x) and G (x) their inde…nite integrals. The following integration by parts formula holds: Z

b

f (x) G (x) dx = [F

a

Example 54 The integral

Z

Z

b (x) G (x)]a

b

F (x) g (x) dx

a

1

exp (x) xdx

0

can be integrated by parts, by setting f (x) G (x)

= exp (x) = x

An inde…nite integral of f (x) is F (x) = exp (x) and G (x) is an inde…nite integral of g (x) = 1 or, said di¤ erently, g (x) = 1 is the derivative of G (x) = x. Therefore: Z

0

1

Z

1

exp (x) xdx = [exp (x) x]0 = exp (1)

1

exp (x) dx

0 Z 1

0

exp (x) dx

0

= exp (1) = exp (1)

8.2.5

1

[exp (x)]0 [exp (1) 1] = 1

Exchanging the bounds of integration

Given the integral

Z

b

f (x) dx

a

exchanging its bounds of integration is equivalent to changing its sign: Z

a

f (x) dx =

b

8.2.6

Z

b

f (x) dx

a

Subdividing the integral

Given the two bounds of integration a and b, with a that a m b, then Z

a

b

f (x) dx =

Z

a

m

f (x) dx +

Z

b, and a third point m such b

m

f (x) dx

52

CHAPTER 8. REVIEW OF INTEGRATION RULES

8.2.7

Leibniz integral rule

Given a function of two variables f (x; y) and the integral I (y) =

Z

b(y)

f (x; y) dx

a(y)

where both the lower bound of integration a and the upper bound of integration b may depend on y, under appropriate technical conditions (not discussed here) the …rst derivative of the function I (y) with respect to y can be computed as follows: d I (y) dy

where

@ @y f

=

d b (y) f (b (y) ; y) dy Z b(y) @ f (x; y) dx + @y a(y)

d a (y) f (a (y) ; y) dy

(x; y) is the …rst partial derivative of f (x; y) with respect to y.

Example 55 The derivative of the integral I (y) =

Z

y 2 +1

exp (xy) dx

y2

is d I (y) dy

=

d y2 + 1 dy

exp

y2 + 1 y

d 2 y exp y 2 y + dy =

2y exp y 3 + y

Z

y 2 +1

y2

2y exp y 3

@ (exp (xy)) dx @y Z y2 +1 + x exp (xy) dx y2

8.3

Solved exercises

Below you can …nd some exercises with explained solutions.

Exercise 1 Compute the following integral: Z 1

cos (x) exp ( x) dx

0

Hint: perform two integrations by parts. Solution Performing two integrations by parts, we obtain: Z 1 cos (x) exp ( x) dx 0

8.3. SOLVED EXERCISES A

B

53

Z 1 1 [sin (x) exp ( x)]0 sin (x) ( exp ( x)) dx 0 Z 1 = 0 0+ sin (x) exp ( x) dx 0 Z 1 1 = [ cos (x) exp ( x)]0 ( cos (x)) ( exp ( x)) dx 0 Z 1 = 0 ( 1) cos (x) exp ( x) dx 0 Z 1 = 1 cos (x) exp ( x) dx =

0

where integration by parts has been performed in steps A and B . Therefore Z

1

cos (x) exp ( x) dx = 1

0

Z

1

cos (x) exp ( x) dx

0

which can be rearranged to yield Z 1 2 cos (x) exp ( x) dx = 1 0

or

Z

1

cos (x) exp ( x) dx =

0

1 2

Exercise 2 Use Leibniz integral rule to compute the derivative with respect to y of the following integral: Z y2 I (y) = exp ( xy) dx 0

Solution Leibniz integral rule is d dy

Z

b(y)

f (x; y) dx =

a(y)

d b (y) f (b (y) ; y) dy Z b(y) @ + f (x; y) dx a(y) @y

d a (y) f (a (y) ; y) dy

We can apply it as follows: d I (y) dy

= = =

d dy

Z

y2

exp ( xy) dx

0

d 2 y exp dy 2y exp

y3

2

y y + Z

0

Z

0

y2

y2

@ exp ( xy) dx @y

x exp ( xy) dx

54

CHAPTER 8. REVIEW OF INTEGRATION RULES A

=

= =

2y exp

2y exp 3y exp

y

3

(

1 exp ( xy) y

x

y 3 + y exp y

3

1 + 2 exp y

y

0

1 + y

Z

y2

y2 0

1 y2

3

where: in step A we have performed an integration by parts.

Exercise 3 Compute the following integral: Z

1

2

x 1 + x2

dx

0

Solution This integral can be solved using the change of variable technique: Z 1 Z 1 1 2 2 2 (1 + t) dt x 1+x dx = 0 0 2 =

1 (1 + t) 2

1

1

= 0

1 1 1 1 + 1= 2 2 2 4

where the change of variable is t = x2 and the di¤erential is dt = 2xdx so that we can substitute xdx with 21 dt.

)

exp ( xy) dx

0

1 exp ( xy) y

1 y

y3

y2

Chapter 9

Special functions This chapter brie‡y introduces some special functions that are frequently used in probability and statistics.

9.1

Gamma function

The Gamma function is a generalization of the factorial function1 to non-integer numbers. Recall that if n 2 N, its factorial n! is n! = 1 2 : : : (n

1) n

so that n! satis…es the recursion n! = (n The Gamma function

1)! n

(z) satis…es a similar recursion: (z) =

(z

1) (z

1)

but it is de…ned also when z is not an integer.

9.1.1

De…nition

The following is a possible de…nition of the Gamma function. De…nition 56 The Gamma function is a function : R++ ! R++ satisfying the equation Z 1 (z) = xz 1 exp ( x) dx 0

While the domain of de…nition of the Gamma function can be extended beyond the set R++ of strictly positive real numbers, for example, to complex numbers, the somewhat restrictive de…nition given above is more than su¢ cient to address all the problems involving the Gamma function that are found in these lectures. 1 See

p. 10.

55

56

CHAPTER 9. SPECIAL FUNCTIONS

9.1.2

Recursion

The next proposition states a recursive property that is used to derive several other properties of the Gamma function. Proposition 57 The Gamma function satis…es the recursion (z) =

(z

1) (z

1)

(9.1)

Proof. By integrating by parts2 , we get (z)

=

Z

1

xz

1

exp ( x) dx

0

xz

= =

(0

1

exp ( x)

0) + (z

1)

1 0

Z

+

Z

1

(z

1) xz

2

exp ( x) dx

0

1

x(z

1) 1

exp ( x) dx

0

=

9.1.3

(z

1) (z

1)

Relation to the factorial function

The next proposition states the relation between the Gamma and factorial functions. Proposition 58 When the argument of the Gamma function is an integer n 2 N, then its value is equal to the factorial of n 1: (n) = (n

1)!

Proof. First of all, we need to compute a starting value: (1) =

Z

1

x1

1

exp ( x) dx =

0

Z

1

0

1

exp ( x) dx = [ exp ( x)]0 = 1

By using the recursion (9.1), we obtain (1) (2) (3) (4)

(n)

2 See

p.51.

= = = =

1 = 0! (2 1) (3 1) (4 1) .. . = (n 1)

(2 (3 (4

1) = 1) = 1) =

(1) 1 = 1 = 1! (2) 2 = 1 2 = 2! (3) 3 = 1 2 3 = 3!

(n

1) = 1 2 3 : : : (n

1) = (n

1)!

9.1. GAMMA FUNCTION

9.1.4

57

Values of the Gamma function

The next proposition states a well-known fact, which is often used in probability theory and statistics. Proposition 59 The Gamma function is such that 1 2

p

=

Proof. Using the de…nition of Gamma function and performing a change of variable, we obtain 1 2

=

Z

1

x1=2

1

exp ( x) dx

0

=

Z

1

1=2

x

exp ( x) dx

0

A

=

2

Z

1

t2 dt

exp

0

=

2

Z

1

t2 dt

exp

0

=

2

Z

=

2

B

=

2

Z

1

=

2

=

2

=

2

0

Z

1

exp

s2 ds

1=2

0

1

1

1

1=2

exp

t2

s2 dtds

exp

s2 u2

1=2

s2 sduds 1=2

1 + u2 s2 sdsdu

exp

0

1

1 exp 2 (1 + u2 )

0

Z

t2 dt

0

Z

1

0

Z

1=2

exp

0

Z

1

0

Z

t2 dt

exp

0

Z

1

0

1

0

Z

Z

1

0+

Z

1

1 du 2 (1 + u2 )

1 du 1 + u2

=

21=2

=

1=2

2

=

21=2 (arctan (1)

=

21=2

0

1 + u2 s2

1

1=2

du

0 1=2

1=2

1 1=2

([arctan (u)]0 )

1=2

arctan (0))

1=2

2

0

=

1=2

where: in step A we have made the change of variable t = x1=2 ; in step B we have performed the change of variable t = su. By using the result stated in the previous proposition, it is possible to derive other values of the Gamma function.

58

CHAPTER 9. SPECIAL FUNCTIONS

Proposition 60 The Gamma function is such that n+

1 2

=

p nY1

j+

j=0

1 2

for n 2 N. Proof. The result is obtained by iterating the recursion formula (9.1): n+ =

n

=

n

1 2

1 2 1 1+ 2 1+

n

1+

n

2+

n

2+

1 2

1 2

n

2+

1 2

.. . =

n

=

n

=

1 2 1 1+ 2

1+

p nY1 j=0

j+

n

1 2 1 2+ 2

::: n 1 2

:::

1 2 1 2

n+

n

n+

1 2

1 2

There are also other special cases in which the value of the Gamma function can be derived analytically, but it is not possible to express (z) in terms of elementary functions for every z. As a consequence, one often needs to resort to numerical algorithms to compute (z). For example, the Matlab command gamma(z) returns the value of the Gamma function at the point z. For a thorough discussion of a number of algorithms that can be employed to compute numerical approximations of (z) see Abramowitz and Stegun3 (1965).

9.1.5

Lower incomplete Gamma function

The de…nition of the Gamma function Z 1 (z) = xz

1

exp ( x) dx

0

can be generalized by substituting the upper bound of integration, equal to in…nity, with a variable y: Z y

xz

(z; y) =

1

exp ( x) dx

0

The function 3 Abramowitz,

(z; y) thus obtained is called lower incomplete Gamma function.

M. and I. A. Stegun (1965) Hanbook of mathematical functions: with formulas, graphs, and mathematical tables, Courier Dover Publications.

9.2. BETA FUNCTION

9.2

59

Beta function

The Beta function is a function of two variables that is often found in probability theory and mathematical statistics. We report here some basic facts about the Beta function.

9.2.1

De…nition

The following is a possible de…nition of the Beta function. De…nition 61 The Beta function is a function B : R2++ ! R++ de…ned by B (x; y) = where

(x) (y) (x + y)

() is the Gamma function.

While the domain of de…nition of the Beta function can be extended beyond the set R2++ of couples of strictly positive real numbers, for example, to couples of complex numbers, the somewhat restrictive de…nition given above is more than su¢ cient to address all the problems involving the Beta function that are found in these lectures.

9.2.2

Integral representations

The Beta function has several integral representations, which are sometimes also used as a de…nition of the Beta function in place of the de…nition we have given above. We report here two often used representations. Integral between zero and in…nity The …rst representation involves an integral from zero to in…nity. Proposition 62 The Beta function has the integral representation Z 1 x y B (x; y) = tx 1 (1 + t) dt

(9.2)

0

Proof. Given the de…nition of the Beta function as a ratio of Gamma functions, the equality holds if and only if Z 1 (x) (y) x y tx 1 (1 + t) dt = (x + y) 0 or (x + y)

Z

1

tx

1

x y

(1 + t)

dt =

(x) (y)

0

That the latter equality indeed holds is proved as follows:

A

=

Z

(x) (y)

0

1

ux

1

exp ( u) du

Z

0

1

vy

1

exp ( v) dv

60

CHAPTER 9. SPECIAL FUNCTIONS =

Z

1

v

y 1

Z

exp ( v)

0

B

= =

Z

1

Z0 1

=

Z

vy vy

1

1

Z

exp ( v)

0

v x+y Z0 1 Z 1 0

=

0

1

tx

1

0

C

= =

Z

1

Z0 1

exp ( v)

1

Z

1

exp ( v)

=

exp ( u) dudv

x 1

(vt)

v x tx

Z

1

tx

exp ( vt) vdtdv

1

exp ( vt) dtdv

1

exp ( vt) dtdv

v x+y Z 1

1 x 1

t

v x+y

1

exp ( (1 + t) v) dtdv exp ( (1 + t) v) dvdt

0

1

Z0 1 Z

1

0

tx

1

Z

1

s 1+t

0

t

x 1

x y

(1 + t)

0

D

ux

0

0

=

1

1

x+y 1

exp ( s) Z

1

sx+y

1

1 dsdt 1+t

exp ( s) dsdt

0

tx

x y

1

(1 + t) Z 1 (x + y) tx

(x + y) dt

0

=

1

(1 + t)

x y

dt

0

where: in step A we have used the de…nition of Gamma function; in step B we have performed the change of variable u = vt; in step C we have made the change of variable s = (1 + t) v; in step D we have again used the de…nition of Gamma function. Integral between zero and one Another representation involves an integral from zero to one. Proposition 63 The Beta function has the integral representation B (x; y) =

Z

1

tx

1

y 1

(1

t)

dt

0

Proof. This can be obtained from the previous integral representation: Z 1 x y B (x; y) = tx 1 (1 + t) dt 0

by performing a change of variable. The change of variable is s=

t =1 1+t

1 1+t

Before performing it, note that lim

t!1

t =1 1+t

(9.3)

9.3. SOLVED EXERCISES

61

and that

1 s 1= 1 s 1 s Furthermore, by di¤erentiating the previous expression, we obtain t=

2

1

dt =

1

ds

s

We are now ready to perform the change of variable: Z 1 x y B (x; y) = tx 1 (1 + t) dt 0

=

Z

1

1

0

=

Z

1

Z

x 1

1 1

sx

1

Z

sx

1

sx

1

1

1

Z

1

x y

s

2

1 1

ds

s

s

ds

ds 1 y

ds

s

1

(1

s

s 1

1

1

2

1

x 1 x y+2

1

0

=

1

x y

s

1

s

0

=

1+

s s

0

=

x 1

s

y 1

s)

ds

0

Note that the two representations (9.2) and (9.3) involve improper integrals that converge if x > 0 and y > 0. This might help you to see why the arguments of the Beta function are required to be strictly positive in De…nition 61.

9.3

Solved exercises

Below you can …nd some exercises with explained solutions.

Exercise 1 Compute the ratio 16 3 10 3

Solution We need to repeatedly apply the recursive formula (z) = (z

1) (z

1)

to the numerator of the ratio, as follows: 16 3 10 3

= =

16 3

16 3

1

1

10 3

13 3

13 3

13 3

1 10 3

= 1

13 3

13 3 10 3

=

13 10 3 3

10 3 10 3

=

130 9

62

CHAPTER 9. SPECIAL FUNCTIONS

Exercise 2 Compute (5) Solution We need to use the relation of the Gamma function to the factorial function: (n) = (n

1)!

which for n = 5 becomes (5) = (5

1)! = 4! = 4 3 2 1 = 24

Exercise 3 Express the integral

Z

1

1 x dx 2

x9=2 exp

0

in terms of the Gamma function. Solution This is accomplished as follows: Z

1

1 x dx 2

x9=2 exp

0

A

=

Z

1

9=2

(2t) exp ( t) 2dt Z 1 11=2 2 t11=2 1 exp ( t) dt 0

=

0

B

=

211=2 (11=2)

where: in step A we have performed a change of variable (t = 12 x); in step B we have used the de…nition of Gamma function.

Exercise 4 Compute the product 5 2 where

B

3 ;1 2

() is the Gamma function and B () is the Beta function.

Solution We need to write the Beta function in terms of Gamma functions: 5 2

B

3 ;1 2

=

5 2

3 2 3 2

(1) +1

9.3. SOLVED EXERCISES

63

=

B

=

(1)

3 2

3 2

1

1

1 1 2 2 1p 2

= C

(1) 5 2

3 2 3 2

= A

3 2

5 2

=

=

where: in step A we have used the fact that (1) = 1; in step B we have used the recursive formula for the Gamma function; in step C we have used the fact that p 1 = 2

Exercise 5 Compute the ratio 7 9 2; 2 5 11 2; 2

B B where B () is the Beta function. Solution

This is achieved by rewriting the numerator of the ratio in terms of Gamma functions: B B = A B

C

= =

7 9 2; 2 5 11 2; 2 7 2 7 2

1 5 11 B 2; 2 7 2

1 B

+

9 2 9 2 7 2

1 7 2

5 11 2; 2

1 5 11 B 2; 2

5 2

=

52 1 2 9 B 52 ; 11 2

=

5 1 B 5 9 B 2 ; 11 2

5 2

+ 11 2

7 2 5 2 5 2

1+ +

5 11 ; 2 2

9 2

1 9 2 11 2 9 2

1

+1

11 2 11 2

=

5 9

where: in steps A and B we have used the the recursive formula for the Gamma function; in step C we have used the de…nition of Beta function.

64

CHAPTER 9. SPECIAL FUNCTIONS

Exercise 6 Compute the integral

Z

1

x3=2 (1 + 2x)

5

dx

0

Solution We …rst express the integral in terms of the Beta function: Z 1 5 x3=2 (1 + 2x) dx 0

A

=

Z

1

0

= = B

=

1 2

1 t 2 5=2 Z

3=2

(1 + t) 1

5

1 dt 2

t3=2 (1 + t)

5

dt

0

1 2

5=2

1 2

5=2

Z

1

t5=2

1

5=2 5=2

(1 + t)

dt

0

5 5 ; 2 2

B

where: in step A we have performed a change of variable (t = 2x); in step B we have used the integral representation of Beta function. Now, write the Beta function in terms of Gamma functions: B

5 5 ; 2 2

5 2 5 2

5 2 + 52 5 2 2

= =

A

= =

B

=

(5) 31 22

1 2

2

4 3 2 1 2 1 3 24 4 9 384

1 2

2

where: in step A we have used the recursive formula for the Gamma function; in step B we have used the fact that 1 2

=

p

Substituting the above number into the previous expression for the integral, we obtain Z 1 5=2 5=2 1 5 5 1 9 5 x3=2 (1 + 2x) dx = B ; = 2 2 2 2 384 0 3 p 9 p 1p 9 2 2 = = 2 = 1024 384 3072 8

9.3. SOLVED EXERCISES

65

If you wish, you can check the above result using the MATLAB commands syms x f = (x^(3=2)) ((1 + 2 x)^ int(f; 0; Inf)

5)

66

CHAPTER 9. SPECIAL FUNCTIONS

Part II

Fundamentals of probability

67

Chapter 10

Probability Probability is used to quantify the likelihood of things that can happen, when it is not yet known whether they will happen. Sometimes probability is also used to quantify the likelihood of things that could have happened in the past, when it is not yet known whether they actually happened. Since we usually speak of the "probability of an event", the next section introduces a formal de…nition of the concept of event. We then discuss the properties that probability needs to satisfy. Finally, we discuss some possible interpretations of the concept of probability.

10.1

Sample space, sample points and events

Let be a set of things that can happen1 . We say that is a sample space, or space of all possible outcomes, if it satis…es the following two properties: 1. Mutually exclusive outcomes. Only one of the things in will happen. That is, when we learn that ! 2 has happened, then we also know that none of the things in the set f! 2 : ! 6= !g has happened. 2. Exhaustive outcomes. At least one of the things in

will happen.

An element ! 2 is called a sample point, or a possible outcome. When (and if) we learn that ! 2 has happened, ! is called the realized outcome. A subset E is called an event (you will see below2 that not every subset of is, strictly speaking, an event; however, on a …rst reading you can be happy with this de…nition). Note that itself is an event, because every set is a subset of itself, and the empty set ; is also an event, because it can be considered a subset of . Example 64 Suppose that we toss a die. Six numbers, from 1 to 6, can appear face up, but we do not yet know which one of them will appear. The sample space is = f1; 2; 3; 4; 5; 6g 1 In this lecture we are going to use the Greek letter probability theory. is upper case, while ! is lower case. 2 P. 75.

69

(Omega), which is often used in

70

CHAPTER 10. PROBABILITY

Each of the six numbers is a sample point. The outcomes are mutually exclusive, because only one number at a time can appear face up. The outcomes are also exhaustive, because at least one of the six numbers in will appear face up, after we toss the die. De…ne E = f1; 3; 5g E is an event (a subset of ). In words, the event E can be described as "an odd number appears face up". Now, de…ne F = f6g Also F is an event (a subset of number 6 appears face up".

10.2

). In words, the event F can be described as "the

Probability

The probability of an event is a real number, attached to the event, that tells us how likely that event is. Suppose E is an event. We denote the probability of E by P (E). Probability needs to satisfy the following properties: 1. Range. For any event E, 0

P (E)

1.

2. Sure thing. P ( ) = 1. 3. Sigma-additivity (or countable additivity). Let fE1 ; E2 ; : : : ; En ; : : :g be a sequence3 of events. Let all the events in the sequence be mutually exclusive, i.e., Ei \ Ej = ; if i 6= j. Then, P

1 [

n=1

En

!

=

1 X

P (En )

n=1

The …rst property is self-explanatory. It just means that the probability of an event is a real number between 0 and 1. The second property is also intuitive. It says that with probability 1 at least one of the possible outcomes will happen. The third property is a bit more cumbersome. It can be proved (see p. 72) that if sigma-additivity holds, then also the following holds: If E and F are two events and E \F = ;, then P (E [ F ) = P (E)+P (F ) (10.1) This property, called …nite additivity, while very similar to sigma-additivity, is easier to interpret. It says that if two events are disjoint, then the probability that either one or the other happens is equal to the sum of their individual probabilities. Example 65 Suppose that we ‡ip a coin. The possible outcomes are either tail (T ) or head (H), i.e., = fT; Hg 3 See

p. 31.

10.3. PROPERTIES OF PROBABILITY

71

There are a total of four subsets of (events): itself, the empty set ;, the event fT g and the event fHg. The following assignment of probabilities satis…es the properties enumerated above: 1 1 , P (fHg) = 2 2

P ( ) = 1, P (;) = 0, P (fT g) =

All these probabilities are between 0 and 1, so the range property is satis…ed. P ( ) = 1, so the sure thing property is satis…ed. Also sigma-additivity is satis…ed, because P (fT g [ fHg)

=

P (f g [ f;g)

=

P (fT g [ f;g)

=

P (fHg [ f;g)

=

1 1 + = P (fT g) + P (fHg) 2 2 P ( ) = 1 = 1 + 0 = P (f g) + P (f;g) 1 1 P (T ) = = + 0 = P (fT g) + P (f;g) 2 2 1 1 P (H) = = + 0 = P (fHg) + P (f;g) 2 2 P( ) = 1 =

and the four couples (T; H), ( ; ;), (T; ;), (H; ;) are the only four possible couples of disjoint sets. Before ending this section, two remarks are in order. First, we have not discussed the interpretations of probability, but below you can …nd a brief discussion of the interpretations of probability. Second, we have been somewhat sloppy in de…ning events and probability, but you can …nd a more rigorous de…nition of probability below.

10.3

Properties of probability

The following subsections discuss some of the properties enjoyed by probability.

10.3.1

Probability of the empty set

The empty set has zero probability: P (;) = 0 Proof. De…ne a sequence of events as follows: E1 = , E2 = ;, ..., En = ;, ... The sequence is a sequence of disjoint events, because the empty set is disjoint from any other set. Then, 1 = =

P( ) ! 1 [ P En n=1

=

1 X

P (En )

n=1

=

P( ) +

1 X

n=2

P (;)

72

CHAPTER 10. PROBABILITY

that is, P( ) +

1 X

P (;) = 1

n=2

Since P ( ) = 1, we have that 1 X

P (;) = 1

1=0

n=2

which implies P (;) = 0.

10.3.2

A sigma-additive function is also additive (see 10.1). Proof. Let E and F be two events and E \ F = ;. De…ne a sequence of events as follows: E1 = E, E2 = F , E3 = ;, ..., En = ;, ... The sequence is a sequence of disjoint events, because the empty set is disjoint from any other set. Then, P (E [ F )

=

P

1 [

En

n=1

=

1 X

!

P (En )

n=1

=

P (E) + P (F ) +

1 X

P (;)

n=3

=

P (E) + P (F )

since P (;) = 0.

10.3.3

Probability of the complement

Let E be an event and E c its complement (i.e., the set of all elements of do not belong to E). Then, P (E c ) = 1

that

P (E)

In words, the probability that an event does not occur (P (E c )) is equal to one minus the probability that it occurs. Proof. Note that = E [ Ec

(10.2)

and that E and E c are disjoint sets. Then, using the sure thing property and …nite additivity, we obtain 1 = P ( ) = P (E [ E c ) = P (E) + P (E c ) By rearranging the terms of this equality, we obtain 10.2.

10.3. PROPERTIES OF PROBABILITY

10.3.4

73

Probability of a union

Let E and F be two events. We have already seen how to compute P (E [ F ) in the special case in which E and F are disjoint. In the more general case (E and F are not necessarily disjoint), the formula is P (E [ F ) = P (E) + P (F )

P (E \ F )

Proof. First, note that P (E)

= P (E \ ) = P (E \ (F [ F c )) = P ((E \ F ) [ (E \ F c )) = P (E \ F ) + P (E \ F c )

P (F )

= P (F \ ) = P (F \ (E [ E c )) = P ((F \ E) [ (F \ E c )) = P (F \ E) + P (F \ E c )

and

so that P (E \ F c ) = P (E) P (F \ E c ) = P (F )

P (E \ F ) P (E \ F )

Furthermore, the event E [ F can be written as follows: E [ F = (E \ F ) [ (E \ F c ) [ (F \ E c ) and the three events on the right hand side are disjoint. Thus, P (E [ F ) = P ((E \ F ) [ (E \ F c ) [ (F \ E c )) = P (E \ F ) + P (E \ F c ) + P (F \ E c ) = P (E \ F ) + P (E) P (E \ F ) + P (F ) = P (E) + P (F ) P (E \ F )

10.3.5

P (E \ F )

Monotonicity of probability

If two events E and F are such that E P (E)

F , then P (F )

In other words, if E occurs less often than F , because F contemplates more occurrences, then the probability of E must be less than the probability of F . Proof. This is easily proved by using additivity: P (F )

= P ((F \ E) [ (F \ E c )) = P (F \ E) + P (F \ E c ) = P (E) + P (F \ E c )

Since, by the range property, P (F \ E c )

0, it follows that

P (F ) = P (E) + P (F \ E c )

P (E)

74

CHAPTER 10. PROBABILITY

10.4

Interpretations of probability

This subsection brie‡y discusses some common interpretations of probability. Although none of these interpretations is su¢ cient per se to clarify the meaning of probability, they all touch upon important aspects of probability.

10.4.1

Classical interpretation of probability

According to the classical de…nition of probability, when all the possible outcomes of an experiment are equally likely, the probability of an event is the ratio between the number of outcomes that are favorable to the event and the total number of possible outcomes. While intuitive, this de…nition has two main drawbacks: 1. it is circular, because it uses the concept of probability to de…ne probability: it is based on the assumption of "equally likely" outcomes, where equally likely means "having the same probability"; 2. it is limited in scope, because it does not allow to de…ne probability when the possible outcomes are not all equally likely.

10.4.2

Frequentist interpretation of probability

According to the frequentist de…nition of probability, the probability of an event is the relative frequency of the event itself, observed over a large number of repetitions of the same experiment. In other words, it is the limit to which the ratio number of occurrences of the event total number of repetitions of the experiment converges when the number of repetitions of the experiment tends to in…nity. Despite its intuitive appeal, also this de…nition of probability has some important drawbacks: 1. it assumes that all probabilistic experiments can be repeated many times, which is false; 2. it is also somewhat circular, because it implicitly relies on a Law of Large Numbers4 , which can be derived only after having de…ned probability.

10.4.3

Subjectivist interpretation of probability

According to the subjectivist de…nition of probability, the probability of an event is related to the willingness of an individual to accept bets on that event. Suppose a lottery ticket pays o¤ 1 dollar in case the event occurs and 0 in case the event does not occur. An individual is asked to set a price for this lottery ticket, at which she must be indi¤erent between being a buyer or a seller of the ticket. The subjective probability of the event is de…ned to be equal to the price thus set by the individual. Also this de…nition of probability has some drawbacks: 1. di¤erent individuals can set di¤erent prices, therefore preventing an objective assessment of probabilities; 4 See

the lecture entitled Laws of Large Numbers (p. 535).

10.5. MORE RIGOROUS DEFINITIONS

75

2. the price an individual is willing to pay to participate in a lottery can be in‡uenced by other factors that have nothing to do with probability; for example, an individual’s betting behavior can be in‡uenced by her preferences.

10.5

More rigorous de…nitions

10.5.1

A more rigorous de…nition of event

The de…nition of event given above is not entirely rigorous. Often, statisticians work with probability models where some subsets of are not considered events. This happens mainly for the following two reasons: 1. sometimes, is a really complicated set; to make things simpler, attention is restricted to only some subsets of ; 2. sometimes, it is possible to assign probabilities only to some subsets of ; in these cases, only the subsets to which probabilities can be assigned are considered events. Denote by F the set of subsets of which are considered events. F is called the space of events. In rigorous probability theory, F is required to be a sigmaalgebra on . F is a sigma-algebra on if it is a set of subsets of satisfying the following three properties: 1. Whole set.

2 F.

2. Closure under complementation. If E 2 F then also E c 2 F (E c , the complement of E with respect to , is the set of all elements of that do not belong to E). 3. Closure under countable unions. If E1 , E2 , ..., En ,... are a sequence of subsets of belonging to F, then ! 1 [ En 2 F n=1

Why is a space of events required to satisfy these properties? Besides a number of mathematical reasons, it seems pretty intuitive that they must be satis…ed. Property a) means that the space of events must include the event "something will happen", quite a trivial requirement! Property b) means that if "one of the things in the set E will happen" is considered an event, then also "none of the things in the set E will happen" is considered an event. This is quite natural: if you are considering the possibility that an event will happen, then, by necessity, you must also be simultaneously considering the possibility that the same event will not happen. Property c) is a bit more complex. However, the following property, implied by c), is probably easier to interpret: If E 2 F and F 2 F, then (E [ F ) 2 F It means that if "one of the things in E will happen" and "one of the things in F will happen" are considered two events, then also "one of the things in E or one of the things in F will happen" must be considered an event. This simply means

76

CHAPTER 10. PROBABILITY

that if you are able to separately assess the possibility of two events E and F happening, then, of course, you must be able to assess the possibility of one or the other happening. Property c) simply extends this intuitive property to countable5 collection of events: the extension is needed for mathematical reasons, to derive certain continuity properties of probability measures.

10.5.2

A more rigorous de…nition of probability

The de…nition of probability given above was not entirely rigorous. Now that we have de…ned sigma-algebras and spaces of events, we can make it completely rigorous. Let be a sample space. Let F be a sigma-algebra on (a space of events). A function P : F ! [0; 1] is a probability measure if and only if it satis…es the following two properties: 1. Sure thing. P ( ) = 1. 2. Sigma-additivity. Let fE1 ; E2 ; : : : ; En ; : : :g be any sequence of elements of F such that i 6= j implies Ei \ Ej = ;. Then, ! 1 1 [ X P En = P (En ) n=1

n=1

Nothing new has been added to the de…nition given above. This de…nition just clari…es that a probability measure is a function de…ned on a sigma-algebra of events. Hence, it is not possible to properly speak of probability for subsets of that do not belong to the sigma-algebra. A triple ( ; F,P) is called a probability space and the sets belonging to the sigma-algebra F are called measurable sets.

10.6

Solved exercises

This exercise set contains some solved exercises on probability and events.

Exercise 1 A ball is drawn at random from an urn containing colored balls. The balls can be either red or blue (no other colors are possible). The probability of drawing a blue ball is 1=3. What is the probability of drawing a red ball? Solution The sample space F:

can be represented as the union of two disjoint events E and =E[F

where the event E can be described as "a red ball is drawn" and the event F can be described as "a blue ball is drawn". Note that E is the complement of F : E = Fc 5 See

p. 32.

10.6. SOLVED EXERCISES

77

We know P (F ), the probability of drawing a a blue ball: P (F ) =

1 3

We need to …nd P (E), the probability of drawing a a red ball. Using the formula for the probability of a complement: P (E) = P (F c ) = 1

P (F ) = 1

1 2 = 3 3

Exercise 2 Consider a sample space

comprising three possible outcomes: = f! 1 ; ! 2 ; ! 3 g

Suppose the probabilities assigned to the three possible outcomes are P (f! 1 g) = 1=4

P (f! 2 g) = 1=4

P (f! 3 g) = 1=2

Can you …nd an event whose probability is 3=4? Solution There are two events whose probability is 3=4. The …rst one is E = f! 1 ; ! 3 g By using the formula for the probability of a union of disjoint events, we get P (E)

= =

P (f! 1 g [ f! 3 g) = P (f! 1 g) + P (f! 3 g) 3 1 1 + = 4 2 4

The second one is E = f! 2 ; ! 3 g By using the formula for the probability of a union of disjoint events, we obtain: P (E)

= =

P (f! 2 g [ f! 3 g) = P (f! 2 g) + P (f! 3 g) 3 1 1 + = 4 2 4

Exercise 3 Consider a sample space

comprising four possible outcomes: = f! 1 ; ! 2 ; ! 3 ; ! 4 g

Consider the three events E, F and G de…ned as follows: E F G

= f! 1 g = f! 1 ; ! 2 g = f! 1 ; ! 2 ; ! 3 g

78

CHAPTER 10. PROBABILITY Suppose their probabilities are P (E) =

1 10

P (F ) =

5 10

P (G) =

7 10

Now, consider a fourth event H de…ned as follows: H = f! 2 ; ! 4 g Find P (H). Solution First note that, by additivity, P (H) = P (f! 2 g [ f! 4 g) = P (f! 2 g) + P (f! 4 g) Therefore, in order to compute P (H), we need to compute P (f! 2 g) and P (f! 4 g). P (f! 2 g) is found using additivity on F : 5 10

=

P (F ) = P (f! 1 g [ f! 2 g) = P (f! 1 g) + P (f! 2 g)

=

P (E) + P (f! 2 g) =

1 + P (f! 2 g) 10

so that

5 1 4 = 10 10 10 P (f! 4 g) is found using the fact that one minus the probability of an event is equal to the probability of its complement and the fact that f! 4 g = Gc : P (f! 2 g) =

P (f! 4 g) = P (Gc ) = 1

P (G) = 1

7 3 = 10 10

As a consequence, P (H) = P (f! 2 g) + P (f! 4 g) =

3 7 4 + = 10 10 10

Chapter 11

Zero-probability events The notion of a zero-probability event plays a special role in probability theory and statistics, because it underpins the important concepts of almost sure property and almost sure event. In this lecture, we de…ne zero-probability events and discuss some counterintuitive aspects of their apparently simple de…nition, in particular the fact that a zero-probability event is not an event that never happens: there are common probabilistic settings where zero-probability events do happen all the time! After discussing this matter, we introduce the concepts of almost sure property and almost sure event.

11.1

De…nition and discussion

Tautologically, zero-probability events are events whose probability is equal to zero. De…nition 66 Let E be an event1 and denote its probability by P (E). We say that E is a zero-probability event if and only if P (E) = 0 Despite the simplicity of this de…nition, there are some features of zeroprobability events that might seem paradoxical. We illustrate these features with the following example. Example 67 Consider a probabilistic experiment whose set of possible outcomes, called sample space and denoted by , is the unit interval = [0; 1] It is possible to assign probabilities in such a way that each sub-interval has probability equal to its length: [a; b]

[0; 1] ) P ([a; b]) = b

a

The proof that such an assignment of probabilities can be consistently performed is beyond the scope of this example, but you can …nd it in any elementary measure 1 See

the lecture entitled Probability (p. 69) for a de…nition of sample space and event.

79

80

CHAPTER 11. ZERO-PROBABILITY EVENTS

theory book (e.g., Williams2 - 1991). As a direct consequence of this assignment, all the possible outcomes ! 2 have zero probability: 8! 2 ; P (f!g) = P ([!; !]) = !

!=0

Stated di¤ erently, every possible outcome is a zero-probability event. This might seem counterintuitive. In everyday language, a zero-probability event is an event that never happens. However, this example illustrates that a zero-probability event can indeed happen. Since the sample space provides an exhaustive description of the possible outcomes, one and only one of the sample points3 ! 2 will be the realized outcome4 . But we have just demonstrated that all the sample points are zero-probability events; as a consequence, the realized outcome can only be a zeroprobability event. Another apparently paradoxical aspect of this probability model is that the sample space can be obtained as the union of disjoint zero-probability events: [ = f!g !2

where each ! 2 is a zero-probability event and all events in the union are disjoint. If we forgot that the additivity property of probability applies only to countable collections of subsets, we would mistakenly deduce that ! [ X P( ) = P f!g = P (!) = 0 !2

!2

and we would come to a contradiction: P ( ) = 0, when, by the properties of probability5 , it should be P ( ) = 1. Of course, the fallacy in such an argument is that is not a countable set, and hence the additivity property cannot be used. The main lesson to be taken from this example is that a zero-probability event is not an event that never happens: in some probability models, where the sample space is not countable, zero-probability events do happen all the time!

11.2

Almost sure and almost surely

Zero-probability events are of paramount importance in probability and statistics. Often, we want to prove that some property is almost always satis…ed, or something happens almost always. "Almost always" means that the property is satis…ed for all sample points, except possibly for a negligible set of sample points. The concept of zero-probability event is used to determine which sets are negligible: if a set is included in a zero-probability event, then it is negligible. De…nition 68 Let be some property that a sample point ! 2 can either satisfy or not satisfy. Let F be the set of all sample points that satisfy the property: F = f! 2 2 Williams, 3 See

p. 69. 4 See p. 69. 5 See p. 70.

: ! satis…es property

g

D. (1991) Probability with martingales, Cambridge University Press.

11.3. ALMOST SURE EVENTS

81

Denote its complement, that is, the set of all points not satisfying property , by F c . Property is said to be almost sure if there exists a zero-probability event E such that6 F c E. An almost sure property is said to hold almost surely (often abbreviated as a.s.). Sometimes, an almost sure property is also said to hold with probability one (abbreviated as w.p.1).

11.3

Almost sure events

Remember7 that some subsets of the sample space may not be considered events. The above de…nition of almost sure property allows us to consider also sets F that are not, strictly speaking, events. However, in the case in which F is an event, F is called an almost sure event and we say that F happens almost surely. Furthermore, since there exists an event E such that F c E and P (E) = 0, we can apply the monotonicity of probability8 : Fc

E ) P (F c )

P (E)

which in turn implies P (F c ) = 0. Finally, recalling the formula for the probability of a complement9 , we obtain P (F ) = 1

P (F c ) = 1

0=1

Thus, an almost sure event is an event that happens with probability 1. Example 69 Consider the sample space = [0; 1] and the assignment of probabilities introduced in the previous example: [a; b]

[0; 1] ) P ([a; b]) = b

a

We want to prove that the event E = f! 2

: ! is a rational numberg

is a zero-probability event. Since the set of rational numbers is countable10 and E is a subset of the set of rational numbers, E is countable. This implies that the elements of E can be arranged into a sequence: E = f! 1 ; : : : ; ! n ; : : :g Furthermore, E can be written as a countable union: E=

1 [

n=1

f! n g

6 In other words, the set F c of all points that do not satisfy the property is included in a zero-probability event. 7 See the lecture entitled Probability (p. 69). 8 See p. 73 9 See p. 72. 1 0 See p. 32.

82

CHAPTER 11. ZERO-PROBABILITY EVENTS

Applying the countable additivity property of probability11 , we obtain ! 1 1 X [ f! n g = P (f! n g) = 0 P (E) = P n=1

n=1

since P (f! n g) = 0 for every n. Therefore, E is a zero-probability event. This might seem surprising: in this probability model there are also zero-probability events comprising in…nitely many sample points! It can also easily be proved that the event F = f! 2 : ! is an irrational numberg is an almost sure event. In fact F = Ec and applying the formula for the probability of a complement, we get P (F ) = P (E c ) = 1

11.4

P (E) = 1

0=1

Solved exercises

Below you can …nd some exercises with explained solutions.

Exercise 1 Let E and F be two events. Let E c be a zero-probability event and P (F ) = Compute P (E [ F ).

2 3.

Solution E c is a zero-probability event, which means that P (E c ) = 0 Furthermore, using the formula for the probability of a complement, we obtain P (E) = 1 Since (E [ F )

P (E c ) = 1

0=1

E, by monotonicity we obtain P (E [ F )

P (E)

Since P (E) = 1 and probabilities cannot be greater than 1, this implies P (E [ F ) = 1

Exercise 2 Let E and F be two events. Let E c be a zero-probability event and P (F ) = Compute P (E \ F ). 1 1 See

p. 72.

1 2.

11.4. SOLVED EXERCISES

83

Solution E c is a zero-probability event, which means that P (E c ) = 0 Furthermore, using the formula for the probability of a complement, we obtain P (E) = 1

P (E c ) = 1

0=1

It is also true that P (E \ F )

Since (E [ F )

= P (E) + P (F ) P (E [ F ) 3 1 P (E [ F ) = P (E [ F ) = 1+ 2 2

E, by monotonicity, we obtain P (E [ F )

P (E)

Since P (E) = 1 and probabilities cannot be greater than 1, this implies P (E [ F ) = 1 Thus, putting pieces together, we get P (E \ F ) =

3 2

P (E [ F ) =

3 2

1=

1 2

84

CHAPTER 11. ZERO-PROBABILITY EVENTS

Chapter 12

Conditional probability This lecture introduces the concept of conditional probability. A more sophisticated treatment of conditional probability can be found in the lecture entitled Conditional probability as a random variable (p. 201). Before reading this lecture, make sure you are familiar with the concepts of sample space, sample point, event, possible outcome, realized outcome and probability (see the lecture entitled Probability - p. 69).

12.1

Introduction

Let be a sample space and let P (E) denote the probability assigned to the events E . Suppose that, after assigning probabilites P (E) to the events in , we receive new information about the things that will happen (the possible outcomes). In particular, suppose that we are told that the realized outcome will belong to a set I . How should we revise the probabilities assigned to the events in , to properly take the new information into account? Denote by P (E jI ) the revised probability assigned to an event E after learning that the realized outcome will be an element of I. P (E jI ) is called the conditional probability of E given I. Despite being an intuitive concept, conditional probability is quite di¢ cult to de…ne in a rigorous way. We take a gradual approach in this lecture. We …rst discuss conditional probability for the very special case in which all the sample points are equally likely. We then give a more general de…nition. Finally, we refer the reader to other lectures where conditional probability is de…ned in even more abstract ways.

12.2

The case of equally likely sample points

Suppose a sample space

has a …nite number n of sample points ! 1 ; : : : ; ! n : = f! 1 ; : : : ; ! n g

Suppose also that each sample point is assigned the same probability: P (f! 1 g) = : : : = P (f! n g) = 85

1 n

86

CHAPTER 12. CONDITIONAL PROBABILITY

In such a simple space, the probability of a generic event E is obtained as P (E) =

card (E) card ( )

where card denotes the cardinality of a set, i.e. the number of its elements. In other words, the probability of an event E is obtained in two steps: 1. counting the number of "cases that are favorable to the event E", i.e. the number of elements ! i belonging to E; 2. dividing the number thus obtained by the number of "all possible cases", i.e. the number of elements ! i belonging to . For example, if E = f! 1 ; ! 2 g then P (E) =

card (f! 1 ; ! 2 g) 2 = card ( ) n

When we learn that the realized outcome will belong to a set I apply the rule probability of an event =

, we still

number of cases that are favorable to the event number of all possible cases

However, the number of all possible cases is now equal to the number of elements of I, because only the outcomes beloning to I are still possible. Furthermore, the number of favorable cases is now equal to the number of elements of E \ I, because the outcomes in E \ I c are no longer possible. As a consequence: P (E jI ) =

card (E \ I) card (I)

Dividing numerator and denominator by card ( ) one obtains P (E jI ) =

P (E \ I) card (E \ I) =card ( ) = card (I) =card ( ) P (I)

Therefore, when all sample points are equally likely, conditional probabilities are computed as P (E \ I) P (E jI ) = P (I) Example 70 Suppose that we toss a die. Six numbers (from 1 to 6) can appear face up, but we do not yet know which one of them will appear. The sample space is = f1; 2; 3; 4; 5; 6g Each of the six numbers is a sample point and is assigned probability 1=6. De…ne the event E as follows: E = f1; 3; 5g where the event E could be described as "an odd number appears face up". De…ne the event I as follows: I = f4; 5; 6g

12.3. A MORE GENERAL APPROACH

87

where the event I could be described as "a number greater than 3 appears face up". The probability of I is P (I) = P (f4g) + P (f5g) + P (f6g) =

1 1 1 1 + + = 6 6 6 2

Suppose we are told that the realized outcome will belong to I. How do we have to revise our assessment of the probability of the event E, according to the rules of conditional probability? First of all, we need to compute the probability of the event E \ I: 1 P (E \ I) = P (f5g) = 6 Then, the conditional probability of E given I is P (E jI ) =

P (E \ I) = P (I)

1 6 1 2

=

2 1 = 6 3

In the next section, we will show that the conditional probability formula P (E jI ) =

P (E \ I) P (I)

is valid also for more general cases (i.e. when the sample points are not all equally likely). However, this formula already allows us to understand why de…ning conditional probability is a challenging task. In the conditional probability formula, a division by P (I) is performed. This division is impossible when I is a zeroprobability event1 . If we want to be able to de…ne P (E jI ) also when P (I) = 0, then we need to give a more complicated de…nition of conditional probability. We will return to this point later.

12.3

A more general approach

In this section we give a more general de…nition of conditional probability, by taking an axiomatic approach. First, we list the properties that we would like conditional probability to satisfy. Then, we prove that the conditional probability formula introduced above satis…es these properties. The discussion of the case in which the conditional probability formula cannot be used because P (I) = 0 is postponed to the next section. The conditional probability P (E jI ) is required to satisfy the following properties: 1. Probability measure. P (E jI ) has to satisfy all the properties of a probability measure2 . 2. Sure thing. P (I jI ) = 1. 3. Impossible events. If E

I c 3 , then P (E jI ) = 0.

1 I.e.

P (I) = 0; see p. 79. p. 70. 3 I c , the complement of I with respect to to I. 2 See

, is the set of all elements of

that do not belong

88

CHAPTER 12. CONDITIONAL PROBABILITY 4. Constant likelihood ratios on I. If E

I, F

I and P (E) > 0, then:

P (F jI ) P (F ) = P (E jI ) P (E) These properties are very intutitve: 1. Probability measure. This property requires that also conditional probability measures satisfy the fundamental properties that any other probability measure needs to satisfy. 2. Sure thing. This property says that the probability of a sure thing must be 1: since we know that only things belonging to the set I can happen, then the probability of I must be 1. 3. Impossible events. This property says that the probability of an impossible thing must be 0: since we know that things not belonging to the set I will not happen, then the probability of the events that are disjoint from I must be 0. 4. Constant likelihood ratios on I. This property is a bit more complex: it says that if F I is - say - two times more likely than E I before receiving the information I, then F remains two times more likely than E, also after reiceiving the information, because all the things in E and F remain possible (can still happen) and, hence, there is no reason to expect that the ratio of their likelihoods changes. It is possible to prove that: Proposition 71 Whenever P (I) > 0, P (E jI ) satis…es the four above properties if and only if P (E \ I) P (E jI ) = P (I) Proof. We …rst show that P (E jI ) =

P (E \ I) P (I)

satis…es the four properties whenever P (I) > 0. As far as property 1) is concerned, we have to check that the three requirements for a probabilitiy measure are satis…ed. The …rst requirement for a probability measure is that 0 P (E jI ) 1. Since (E \ I) I, by the monotonicity of probability4 we have that: P (E \ I) Hence:

P (E \ I) P (I)

Furthermore, since P (E \ I)

0 and P (I) P (E \ I) P (I)

4 See

p. 73.

P (I)

1 0, also 0

12.3. A MORE GENERAL APPROACH

89

The second requirement for a probability measure is that P ( jI ) = 1. This is satis…ed because P ( \ I) P (I) P ( jI ) = = =1 P (I) P (I) The third requirement for a probability measure is that for any sequence of disjoint sets fE1 ; E2 ; : : : ; En ; : : :g the following holds: ! 1 1 [ X P En jI = P (En jI ) n=1

n=1

But

P

1 [

n=1

En jI

!

1 S

P

= = =

\I

P (I)

P =

En

n=1

=

1 S

n=1

(En \ I)

P (I) n=1 P (En \ I) P (I) 1 X P (En \ I) P1

n=1 1 X

n=1

P (I)

P (En jI )

so that also the third requirement is satis…ed. Property 2) is trivially satis…ed: P (I) P (I \ I) = =1 P (I) P (I)

P (I jI ) = Property 3) is veri…ed because, if E P (E jI ) =

P (E \ I) P (;) = =0 P (I) P (I)

Property 4) is veri…ed because, if E P (F jI ) P (E jI )

I c , then

= =

I, F

I and P (E) > 0, then

P (F \ I) P (I) P (I) P (E \ I) P (F \ I) P (F ) = P (E \ I) P (E)

So, the "if" part has been proved. Now we prove the "only if" part. We prove it by contradiction. Suppose there exist another conditional probability P that satis…es the four properties. Then, there exists an event E, such that P (E jI ) 6= P (E jI ) It can not be that E

I, otherwise we would have

P (E jI ) P (E jI ) P (E jI ) P (E \ I) P (E) = 6= = = 1 1 P (I) P (I) P (I jI )

90

CHAPTER 12. CONDITIONAL PROBABILITY

which would be a contradiction, since if P was a conditional probability it would satisfy P (E jI ) P (E) = P (I) P (I jI ) If E is not a subset of I then P (E jI ) 6= P (E jI ) implies also P (E \ I jI ) 6= P (E \ I jI ) because P (E jI )

= P ((E \ I) [ (E \ I c ) jI ) = P (E \ I jI ) + P (E \ I c jI ) = P (E \ I jI )

P (E jI )

= P ((E \ I) [ (E \ I c ) jI ) = P (E \ I jI ) + P (E \ I c jI ) = P (E \ I jI )

and

but this would also lead to a contradiction, because (E \ I)

12.4

I.

Tackling division by zero

In the previous section we have generalized the concept of conditional probability. However, we have not been able to de…ne the conditional probability P (E jI ) for the case in which P (I) = 0. This case is discussed in the lectures entitled Conditional probability as a random variable (p. 201) and Conditional probability distributions (p. 209).

12.5

More details

12.5.1

The law of total probability

Let I1 , . . . , In be n events having the following characteristics: 1. they are mutually disjoint: Ij \ Ik = ; whenever j 6= k; 2. they cover all the sample space: =

n [

Ij

j=1

3. they have strictly positive probability: P (Ij ) > 0 for any j. I1 , . . . , In is a partition of . The law of total probability states that, for any event E, the following holds: P (E) = P (E jI1 ) P (I1 ) + : : : + P (E jIn ) P (In )

12.6. SOLVED EXERCISES

91

which can, of course, also be written as P (E) =

n X j=1

P (E jIj ) P (Ij )

Proof. The law of total probability is proved as follows: P (E)

= P (E \ ) 0 0 P @E \ @

=

0

= [email protected]

n [

j=1

A

B

n X

=

j=1 n X

=

j=1

n [

j=1

11

Ij AA 1

(E \ Ij )A

P (E \ Ij ) P (E jIj ) P (Ij )

where: in step A we have used the additivity of probability; in step B we have used the conditional probability formula.

12.6

Solved exercises

Some solved exercises on conditional probability can be found below.

Exercise 1 Consider a sample space

comprising three possible outcomes ! 1 , ! 2 , ! 3 : = f! 1 ; ! 2 ; ! 3 g

Suppose the three possible outcomes are assigned the following probabilities: P (! 1 ) = P (! 2 )

=

P (! 3 )

=

1 5 2 5 2 5

De…ne the events E F

= f! 1 ; ! 2 g = f! 1 ; ! 3 g

and denote by E c the complement of E. Compute P (F jE c ), the conditional probability of F given E c .

92

CHAPTER 12. CONDITIONAL PROBABILITY

Solution We need to use the conditional probability formula P (F jE c ) =

P (F \ E c ) P (E c )

The numerator is P (F \ E c ) = P (f! 1 ; ! 3 g \ f! 3 g) = P (f! 3 g) = and the denominator is P (E c ) = P (f! 3 g) =

2 5

2 5

As a consequence: P (F jE c ) =

P (F \ E c ) 2=5 = =1 P (E c ) 2=5

Exercise 2 Consider a sample space

comprising four possible outcomes ! 1 , ! 2 , ! 3 , ! 4 : = f! 1 ; ! 2 ; ! 3 ; ! 4 g

Suppose the four possible outcomes are assigned the following probabilities: 1 P (! 1 ) = 10 4 P (! 2 ) = 10 3 P (! 3 ) = 10 2 P (! 4 ) = 10 De…ne two events E F

= f! 1 ; ! 2 g = f! 2 ; ! 3 g

Compute P (E jF ), the conditional probability of E given F . Solution We need to use the formula P (E jF ) =

P (E \ F ) P (F )

But P (E \ F ) = P (f! 2 g) =

4 10

while, using additivity: P (F ) = P (f! 2 ; ! 3 g) = P (f! 2 g) + P (f! 3 g) = Therefore: P (E jF ) =

4 3 7 + = 10 10 10

P (E \ F ) 4=10 4 = = P (F ) 7=10 7

12.6. SOLVED EXERCISES

93

Exercise 3 The Census Bureau has estimated the following survival probabilities for men: 1. probability that a man lives at least 70 years: 80%; 2. probability that a man lives at least 80 years: 50%. What is the conditional probability that a man lives at least 80 years given that he has just celebrated his 70th birthday? Solution Given an hypothetical sample space E F

= f! 2 = f! 2

, de…ne the two events:

: man lives at least 70 yearsg : man lives at least 80 yearsg

We need to …nd the following conditional probability P (F jE ) =

P (F \ E) P (E)

The denominator is known: P (E) = 80% =

4 5

As far as the numerator is concerned, note that F E (if you live at least 80 years then you also live at least 70 years). But F E implies F \E =F Therefore: P (F \ E) = P (F ) = 50% = Thus: P (F jE ) =

1 2

P (F \ E) 1=2 5 = = P (E) 4=5 8

94

CHAPTER 12. CONDITIONAL PROBABILITY

Chapter 13

Bayes’rule This lecture introduces Bayes’rule. Before reading this lecture, make sure you are familiar with the concept of conditional probability (p. 85).

13.1

Statement of Bayes’rule

Bayes’ rule, named after the English mathematician Thomas Bayes, is a rule for computing conditional probabilities. Let A and B be two events. Denote their probabilities by P (A) and P (B) and suppose that both P (A) > 0 and P (B) > 0. Denote by P (A jB ) the conditional probability of A given B and by P (B jA ) the conditional probability of B given A. Bayes’rule states that: P (A jB ) =

P (B jA ) P (A) P (B)

Proof. Take the conditional probability formulae P (A jB ) =

P (A \ B) P (B)

P (B jA ) =

P (A \ B) P (A)

and

Re-arrange the second formula: P (A \ B) = P (B jA ) P (A) and plug it into the …rst formula: P (A jB ) =

P (A \ B) P (B jA ) P (A) = P (B) P (B)

The following example shows how Bayes’ rule can be applied in a practical situation. 95

96

CHAPTER 13. BAYES’RULE

Example 72 An HIV test gives a positive result with probability 98% when the patient is indeed a¤ ected by HIV, while it gives a negative result with 99% probability when the patient is not a¤ ected by HIV. If a patient is drawn at random from a population in which 0,1% of individuals are a¤ ected by HIV and he is found positive, what is the probability that he is indeed a¤ ected by HIV? In probabilistic terms, what we know about this problem can be formalized as follows: P (positive jHIV ) = P (positive jNO HIV ) = P (HIV) = P (NO HIV) =

0:98 1 0:99 = 0:01 0:001 1 0:001 = 0:999

The unconditional probability of being found positive can be derived using the law of total probability1 : P (positive)

= P (positive jHIV ) P (HIV) + P (positive jNO HIV ) P (NO HIV) = 0:98 0:001 + 0:01 0:999 = 0:00098 + 0:00999 = 0:01097

Using Bayes’ rule: P (positive jHIV ) P (HIV) P (positive) 0:98 0:001 0:00098 = = 0:01097 0:01097 ' 0:08933

P (HIV jpositive ) =

Therefore, even if the test is conditionally very accurate, the unconditional probability of being a¤ ected by HIV when found positive is less than 10 per cent!

13.2

Terminology

The quantities involved in Bayes’rule P (A jB ) =

P (B jA ) P (A) P (B)

often take the following names: 1. P (A) is called prior probability or, simply, prior; 2. P (B jA ) is called conditional probability or likelihood; 3. P (B) is called marginal probability; 4. P (A jB ) is called posterior probability or, simply, posterior.

13.3

Solved exercises

Below you can …nd some exercises with explained solutions. 1 See

p. 90.

13.3. SOLVED EXERCISES

97

Exercise 1 There are two urns containing colored balls. The …rst urn contains 50 red balls and 50 blue balls. The second urn contains 30 red balls and 70 blue balls. One of the two urns is randomly chosen (both urns have probability 50% of being chosen) and then a ball is drawn at random from one of the two urns. If a red ball is drawn, what is the probability that it comes from the …rst urn? Solution In probabilistic terms, what we know about this problem can be formalized as follows: 1 2 3 P (red jurn 2 ) = 10 1 P (urn1) = 2 1 P (urn 2) = 2

P (red jurn 1 ) =

The unconditional probability of drawing a red ball can be derived using the law of total probability: P (red)

= =

P (red jurn 1 ) P (urn 1) + P (red jurn 2 ) P (urn 2) 1 1 3 1 1 3 5+3 2 + = + = = 2 2 10 2 4 20 20 5

Using Bayes’rule we obtain: P (urn 1 jred )

= =

P (red jurn 1 ) P (urn 1) P (red) 1 2

1 2 2 5

=

5 1 5 = 4 2 8

Exercise 2 An economics consulting …rm has created a model to predict recessions. The model predicts a recession with probability 80% when a recession is indeed coming and with probability 10% when no recession is coming. The unconditional probability of falling into a recession is 20%. If the model predicts a recession, what is the probability that a recession will indeed come? Solution What we know about this problem can be formalized as follows: P (rec. pred. jrec. coming ) =

8 10

P (rec. pred. jrec. not coming ) = P (rec. coming) =

2 10

1 10

98

CHAPTER 13. BAYES’RULE P (rec. not coming) = 1

P (rec. coming) = 1

2 8 = 10 10

The unconditional probability of predicting a recession can be derived using the law of total probability: P (rec. pred.)

=

P (rec. pred. jrec. coming ) P (rec. coming) +P (rec. pred. jrec. not coming ) P (rec. not coming) 8 2 1 8 24 = + = 10 10 10 10 100

Using Bayes’rule we obtain: P (rec. coming jrec. pred. )

= =

P (rec. pred. jrec. coming ) P (rec. coming) P (rec. pred.) 8 10

2 10

24 100

=

16 100 2 = 100 24 3

Exercise 3 Alice has two coins in her pocket, a fair coin (head on one side and tail on the other side) and a two-headed coin. She picks one at random from her pocket, tosses it and obtains head. What is the probability that she ‡ipped the fair coin? Solution What we know about this problem can be formalized as follows: 1 2 P (head junfair coin ) = 1 1 P (fair coin) = 2 1 P (unfair coin) = 2 P (head jfair coin ) =

The unconditional probability of obtaining head can be derived using the law of total probability: P (head)

=

P (head jfair coin ) P (fair coin) +P (head junfair coin ) P (unfair coin) 1 1 1 1 2 3 = +1 = + = 2 2 2 4 4 4

Using Bayes’rule we obtain: P (fair coin jhead )

= =

P (head jfair coin ) P (fair coin) P (head) 1 2

1 2 3 4

=

1 4 1 = 4 3 3

Chapter 14

Independent events This lecture introduces the notion of independent event. Before reading this lecture, make sure you are familiar with the concept of conditional probability1 .

14.1

De…nition of independent event

Two events E and F are said to be independent if the occurrence of E makes it neither more nor less probable that F occurs and, conversely, if the occurrence of F makes it neither more nor less probable that E occurs. In other words, after receiving the information that E will happen, we revise our assessment of the probability that F will happen, computing the conditional probability of F given E; if F is independent of E, the probability of F remains the same as it was before receiving the information: P (F jE ) = P (F ) (14.1) Conversely, P (E jF ) = P (E)

(14.2)

In standard probability theory, rather than characterizing independence by the above two properties, independence is characterized in a more compact way. De…nition 73 Two events E and F are independent if and only if P (E \ F ) = P (E) P (F ) This de…nition implies properties (14.1) and (14.2) above: if E and F are independent, and (say) P (E) > 0, then P (F jE ) =

P (E \ F ) P (E) P (F ) = = P (F ) P (E) P (E)

Example 74 An urn contains four balls B1 , B2 , B3 and B4 . We draw one of them at random. The sample space is = fB1 ; B2 ; B3 ; B4 g 1 See

p. 85.

99

100

CHAPTER 14. INDEPENDENT EVENTS

Each of the four balls has the same probability of being drawn, equal to 14 , i.e., P (fB1 g) = P (fB2 g) = P (fB3 g) = P (fB4 g) =

1 4

De…ne the events E and F as follows: E F

= fB1 ; B2 g = fB2 ; B3 g

Their respective probabilities are P (E)

=

P (F )

=

1 1 1 + = 4 4 2 1 1 1 P (fB2 g [ fB3 g) = P (fB2 g) + P (fB3 g) = + = 4 4 2 P (fB1 g [ fB2 g) = P (fB1 g) + P (fB2 g) =

The probability of the event E \ F is P (E \ F ) = P (fB1 ; B2 g \ fB2 ; B3 g) = P (fB2 g) =

1 4

Hence, 1 11 = = P (E \ F ) 22 4 As a consequence, E and F are independent. P (E) P (F ) =

14.2

Mutually independent events

The de…nition of independence can be extended also to collections of more than two events. De…nition 75 Let E1 , . . . , En be n events. E1 , . . . , En are jointly independent (or mutually independent) if and only if, for any sub-collection of k events (k n) Ei1 , . . . , Eik , we have that 0 1 k k \ Y @ A P Eij = P Eij j=1

j=1

Let E1 , . . . , En be a collection of n events. It is important to note that even if all the possible couples of events are independent (i.e., Ei is independent of Ej for any j 6= i), this does not imply that the events E1 , . . . , En are jointly independent. This is proved with a simple counter-example:

Example 76 Consider the experiment presented in the previous example (extracting a ball from an urn that contains four balls). De…ne the events E, F and G as follows: E F G

= fB1 ; B2 g = fB2 ; B3 g = fB2 ; B4 g

14.3. ZERO-PROBABILITY EVENTS AND INDEPENDENCE

101

It is immediate to show that 1 P (E \ F ) = = P (E) P (F ) =) E and F are independent 4 1 P (E \ G) = = P (E) P (G) =) E and G are independent 4 1 = P (F ) P (G) =) F and G are independent P (F \ G) = 4 Thus, all the possible couple of events in the collection E, F , G are independent. However, the three events are not jointly independent because 1 1 6= = P (E) P (F ) P (G) 4 8 On the contrary, it is obviously true that if E1 , . . . , En are jointly independent, then Ei is independent of Ej for any j 6= i. P (E \ F \ G) = P (fB2 g) =

14.3

Zero-probability events and independence

Proposition 77 If E is a zero-probability event2 , then E is independent of any other event F . Proof. Note that (E \ F )

E

As a consequence, by the monotonicity of probability3 , we have that P (E \ F )

P (E)

But P (E) = 0, so P (E \ F ) 0. Since probabilities cannot be negative, it must be P (E \ F ) = 0. The latter fact implies independence: P (E \ F ) = 0 = 0 P (F ) = P (E) P (F )

14.4

Solved exercises

Below you can …nd some exercises with explained solutions.

Exercise 1 Suppose that we toss a die. Six numbers (from 1 to 6) can appear face up, but we do not yet know which one of them will appear. The sample space is = f1; 2; 3; 4; 5; 6g Each of the six numbers is a sample point and is assigned probability the events E and F as follows: E F

= f1; 3; 4g = f3; 4; 5; 6g

Prove that E and F are independent. 2 See 3 See

p. 79. p. 73.

1 6.

De…ne

102

CHAPTER 14. INDEPENDENT EVENTS

Solution The probability of E is P (E)

= =

P (f1g) + P (f3g) + P (f4g) 1 1 1 1 + + = 6 6 6 2

The probability of F is P (F )

=

P (f3g) + P (f4g) + P (f5g) + P (f6g) 1 1 1 1 4 2 + + + = = 6 6 6 6 6 3

= The probability of E \ F is P (E \ F )

= =

P (f1; 3; 4g \ f3; 4; 5; 6g) = P (f3; 4g) 1 1 2 1 P (f3g) + P (f4g) = + = = 6 6 6 3

The events E and F are independent because P (E \ F ) =

1 1 2 = = P (E) P (F ) 3 2 3

Exercise 2 A …rm undertakes two projects, A and B. The probabilities of having a successful outcome are 43 for project A and 12 for project B. The probability that both 7 projects will have a successful outcome is 16 . Are the outcomes of the two projects independent? Solution Denote by E the event "project A is successful", by F the event "project B is successful" and by G the event "both projects are successful". The event G can be expressed as G=E\F If E and F are independent, it must be that P (G)

= =

P (E \ F ) = P (E) P (F ) 3 1 3 7 = 6= 4 2 8 16

Therefore, the outcomes of the two projects are not independent.

Exercise 3 A …rm undertakes two projects, A and B. The probabilities of having a successful outcome are 23 for project A and 45 for project B. What is the probability that neither of the two projects will have a successful outcome if their outcomes are independent?

14.4. SOLVED EXERCISES

103

Solution Denote by E the event "project A is successful", by F the event "project B is successful" and by G the event "neither of the two projects is successful". The event G can be expressed as G = Ec \ F c

where E c and F c are the complements of E and F . Thus, the probability that neither of the two projects will have a successful outcome is P (G)

= P (E c \ F c )

A

=

B

P ((E [ F ) )

=

1

C

P (E [ F )

= =

1 1

D

(P (E) + P (F ) P (E \ F )) P (E) P (F ) + P (E \ F )

=

1

= =

c

P (E) P (F ) + P (E) P (F ) 2 4 2 4 1 + 3 5 3 5 15 10 12 + 8 1 = 15 15

where: in step A we have used De Morgan’s law4 ; in step B we have used the formula for the probability of a complement5 ; in step C we have used the formula for the probability of a union6 ; in step D we have used the fact that E and F are independent.

4 See

p. 7. p. 72. 6 See p. 73. 5 See

104

CHAPTER 14. INDEPENDENT EVENTS

Chapter 15

Random variables This lecture introduces the concept of random variable. Before reading this lecture, make sure you are familiar with the concepts of sample space, sample point, event, possible outcome, realized outcome, and probability (see the lecture entitled Probability - p. 69).

15.1

De…nition of random variable

A random variable is a variable whose value depends on the outcome of a probabilistic experiment. Its value is a priori unknown, but it becomes known once the outcome of the experiment is realized. Denote by the sample space, that is, the set of all possible outcomes of the experiment. A random variable associates a real number to each element of , as stated by the following de…nition. De…nition 78 A random variable X is a function from the sample space the set of real numbers R: X: !R

to

In rigorous (measure-theoretic) probability theory, the function X is also required to be measurable1 . The real number X (!) associated to a sample point ! 2 is called a realization of the random variable. The set of all possible realizations is called support and it is denoted by RX . Some remarks on notation are in order: 1. the dependence of X on ! is often omitted, i.e., we simply write X instead of X (!); 2. if A

R, the exact meaning of the notation P (X 2 A) is the following: P (X 2 A) = P (f! 2

3. if A

: X (!) 2 Ag)

R, we sometimes use the notation PX (A) with the following meaning: PX (A) = P (X 2 A) = P (f! 2

1 See

: X (!) 2 Ag)

below the subsection entitled A more rigorous de…nition of random variable (p. 109).

105

106

CHAPTER 15. RANDOM VARIABLES In this case, PX is to be interpreted as a probability measure on the set of real numbers, induced by the random variable X. Often, statisticians construct probabilistic models where a random variable X is de…ned by directly specifying PX , without specifying the sample space .

Example 79 Suppose that we ‡ip a coin. The possible outcomes are either tail (T ) or head (H), i.e., = fT; Hg The two outcomes are assigned equal probabilities: P (fT g) = P (fHg) =

1 2

If tail (T ) is the outcome, we win one dollar, if head (H) is the outcome we lose one dollar. The amount X we win (or lose) is a random variable, de…ned as 1 if ! = T 1 if ! = H

X (!) = The probability of winning one dollar is P (X = 1) = P (f! 2

: X (!) = 1g) = P (fT g) =

1 2

The probability of losing one dollar is P (X =

1) = P (f! 2

: X (!) =

1g) = P (fHg) =

1 2

The probability of losing two dollars is P (X =

2) = P (f! 2

: X (!) =

2g) = P (;) = 0

Most of the time, statisticians deal with two special kinds of random variables: 1. discrete random variables; 2. absolutely continuous random variables. We de…ne these two kinds of random variables below.

15.2

Discrete random variables

Discrete random variables are de…ned as follows. De…nition 80 A random variable X is discrete if 1. its support RX is a countable set2 ; 2. there is a function pX : R ! [0; 1], called the probability mass function (or pmf or probability function) of X, such that, for any x 2 R: pX (x) = 2 See

P (X = x) if x 2 RX 0 if x 2 = RX

the lecture entitled Sequences and Limits (p. 32) for a de…nition of countable set.

15.3. ABSOLUTELY CONTINUOUS RANDOM VARIABLES

107

The following is an example of a discrete random variable. Example 81 Let X be a discrete random variable that can take only two values: 1 with probability q and 0 with probability 1 q, where 0 q 1. Its support is RX = f0; 1g Its probability mass function is 8 < q 1 pX (x) = : 0

q

if x = 1 if x = 0 otherwise

The properties of probability mass functions are discussed in more detail in the lecture entitled Legitimate probability mass functions (p. 247). We anticipate here that probability mass functions are characterized by two fundamental properties: 1. non-negativity: pX (x)

0 for any x 2 R; P 2. sum over the support equals 1: x2RX pX (x) = 1.

It turns out not only that any probability mass function must satisfy these two properties, but also that any function satisfying these two properties is a legitimate probability mass function. You can …nd a detailed discussion of this fact in the aforementioned lecture.

15.3

Absolutely continuous random variables

Absolutely continuous random variables are de…ned as follows. De…nition 82 A random variable X is absolutely continuous if 1. its support RX is not countable; 2. there is a function fX : R ! [0; 1], called the probability density function (or pdf or density function) of X, such that, for any interval [a; b] R, P (X 2 [a; b]) =

Z

b

fX (x) dx

a

Absolutely continuous random variables are often called continuous random variables, omitting the adverb absolutely. The following is an example of an absolutely continuous random variable. Example 83 Let X be an absolutely continuous random variable that can take any value in the interval [0; 1]. All sub-intervals of equal length are equally likely. Its support is RX = [0; 1] Its probability density function is fX (x) =

1 if x 2 [0; 1] 0 otherwise

108

CHAPTER 15. RANDOM VARIABLES

The probability that the realization of X belongs, for example, to the interval is Z 3=4 Z 3=4 P (X 2 [a; b]) = fX (x) dx = dx 1=4

=

3=4

[x]1=4

1 3 4; 4

1=4

3 = 4

1 1 = 4 2

The properties of probability density functions are discussed in more detail in the lecture entitled Legitimate probability density functions (p. 251). We anticipate here that probability density functions are characterized by two fundamental properties: 1. non-negativity: fX (x)

0 for any x 2 R; R1 2. integral over R equals 1: f (x) dx = 1. 1 X

It turns out not only that any probability density function must satisfy these two properties, but also that any function satisfying these two properties is a legitimate probability density function. You can …nd a detailed discussion of this fact in the aforementioned lecture.

15.4

Random variables in general

Random variables, also those that are neither discrete nor absolutely continuous, are often characterized in terms of their distribution function. De…nition 84 Let X be a random variable. The distribution function (or cumulative distribution function or cdf ) of X is a function FX : R ! [0; 1] such that FX (x) = P (X x) ; 8x 2 R If we know the distribution function of a random variable X, then we can easily compute the probability that X belongs to an interval (a; b] R, as P (a < X

b) = FX (b)

FX (a)

Proof. Note that ( 1; b] = ( 1; a] [ (a; b]

where the two sets on the right hand side are disjoint. Hence, by additivity, we get FX (b)

= = = = = =

P (X 2 ( 1; b]) PX (( 1; b]) PX (( 1; a] [ (a; b]) PX (( 1; a]) + PX ((a; b]) P (X a) + P (a < X x) FX (a) + P (a < X x)

Rearranging terms, we obtain P (a < X

x) = FX (b)

FX (a)

15.5. MORE DETAILS

109

15.5

More details

15.5.1

Derivative of the distribution function

If X is an absolutely continuous random variable, then its distribution function can be written as Z x FX (x) = fX (t) dt 1

Hence, by taking the derivative with respect to x of both sides of the above equation, we obtain dFX (x) = fX (x) dx

15.5.2

Continuous variables and zero-probability events

If X is an absolutely continuous random variable, then the probability that X takes on any speci…c value x 2 RX is equal to zero: Z x P (X = x) = fX (t) dt = 0 x

Thus, the event f! : X (!) = xg is a zero-probability event for any x 2 RX . The lecture entitled Zero-probability events (p. 79) contains a thorough discussion of this apparently paradoxical fact: although it can happen that X (!) = x, the event f! : X (!) = xg has zero probability of happening.

15.5.3

A more rigorous de…nition of random variable

Random variables can be de…ned in a more rigorous manner using the terminology of measure theory. Let ( ; F; P) be a probability space3 . Let X be a function X : ! R. Let B (R) be the Borel sigma-algebra of R, i.e. the smallest sigma-algebra containing all the open subsets of R. If, for any B 2 B (R), f! 2

: X (!) 2 Bg 2 F

then X is a random variable on . If X satis…es this property, then it is possible to de…ne P (X 2 B) := P (f! 2

: X (!) 2 Bg)

for any B 2 B (R) and the probability on the right hand side is well-de…ned because the set f! 2 : X (!) 2 Bg is measurable by the very de…nition of random variable.

15.6

Solved exercises

Below you can …nd some exercises with explained solutions. 3 See

p. 76.

110

CHAPTER 15. RANDOM VARIABLES

Exercise 1 Let X be a discrete random variable. Let its support RX be RX = f0; 1; 2; 3; 4g Let its probability mass function pX (x) be 1=5 if x 2 RX 0 if x 2 = RX

pX (x) = Compute P (1

X < 4)

Solution By using the additivity of probability, we have P (1

X < 4) = P (fX = 1g [ fX = 2g [ fX = 3g) = P (fX = 1g) + P (fX = 2g) + P (fX = 3g) 1 1 1 3 = pX (1) + pX (2) + pX (3) = + + = 5 5 5 5

Exercise 2 Let X be a discrete random variable. Let its support RX be the set of the …rst 20 natural numbers: RX = f1; 2; : : : ; 19; 20g Let its probability mass function pX (x) be pX (x) =

x=210 if x 2 RX 0 if x 2 = RX

Compute the probability P (X > 17) Solution By the additivity of probability, we have P (X > 17)

= P (fX = 18g [ fX = 19g [ fX = 20g) = P (fX = 18g) + P (fX = 19g) + P (fX = 20g) = pX (18) + pX (19) + pX (20) 19 20 57 19 18 + + = = = 210 210 210 210 70

Exercise 3 Let X be a discrete random variable. Let its support RX be RX = f0; 1; 2; 3g

15.6. SOLVED EXERCISES

111

Let its probability mass function pX (x) be 1 x 4

3 x

pX (x) =

0

where

3 x

=

3 3 x 4

if x 2 RX if x 2 = RX

3! x! (3 x)!

is a binomial coe¢ cient4 . Calculate the probability P (X < 3) Solution First note that, by additivity, we have P (X < 3)

= P (fX = 0g [ fX = 1g [ fX = 2g) = P (fX = 0g) + P (fX = 1g) + P (fX = 2g) = pX (0) + pX (1) + pX (2)

Therefore, in order to compute P (X < 3), we need to evaluate the probability mass function at the three points x = 0 , x = 1 and x = 2: pX (0) =

pX (1)

3 0

0

3 4

3 0

=

1

3 1 4 4 27 1 9 = 3 4 16 64 3 1

= =

pX (2) =

1 4

3 2

1 4

2

3 4

3! 27 27 1 = 0!3! 64 64

3 1

=

3 2

=

3! 1 9 1!2! 4 16

3! 1 3 9 = 2!1! 16 4 64

Finally, P (X < 3)

= pX (0) + pX (1) + pX (2) 27 27 9 63 = + + = 64 64 64 64

Exercise 4 Let X be an absolutely continuous random variable. Let its support RX be RX = [0; 1] Let its probability density function fX (x) be 1 if x 2 RX 0 if x 2 = RX

fX (x) = Compute P 4 See

p. 22.

1 2

X

2

112

CHAPTER 15. RANDOM VARIABLES

Solution The probability that an absolutely continuous random variable takes a value in a given interval is equal to the integral of the probability density function over that interval: Z 2 1 1 P X 2 = P X2 ;2 = fX (x) dx 2 2 1=2 Z 1 1 1 1 = = dx = [x]1=2 = 1 2 2 1=2

Exercise 5 Let X be an absolutely continuous random variable. Let its support RX be RX = [0; 1] Let its probability density function fX (x) be 2x if x 2 RX 0 if x 2 = RX

fX (x) = Compute 1 4

P

1 2

X

Solution The probability that an absolutely continuous random variable takes a value in a given interval is equal to the integral of the probability density function over that interval: Z 1=2 1 1 1 1 P X = P X2 ; = fX (x) dx 4 2 4 2 1=4 Z 1=2 1 3 1 1=2 = = 2xdx = x2 1=4 = 4 16 16 1=4

Exercise 6 Let X be an absolutely continuous random variable. Let its support RX be RX = [0; 1) Let its probability density function fX (x) be fX (x) =

exp (

x) if x 2 RX if x 2 = RX

P (X

1)

0

where > 0. Compute

15.6. SOLVED EXERCISES

113

Solution The probability that an absolutely continuous random variable takes a value in a given interval is equal to the integral of the probability density function over that interval: Z 1 fX (x) dx P (X 1) = P (X 2 [1; 1)) = 1 Z 1 1 = exp ( x) dx = [ exp ( x)]1 1

=

0

( exp (

)) = exp (

)

114

CHAPTER 15. RANDOM VARIABLES

Chapter 16

Random vectors This lecture introduces the concept of random vector, which is a multidimensional generalization of the concept of random variable. Before reading this lecture, make sure you are familiar with the concepts of sample space, sample point, event, possible outcome, realized outcome and probability (see the lecture entitled Probability - p. 69) and with the concept of random variable (see the lecture entitled Random variables - p. 105).

16.1

De…nition of random vector

Suppose that we conduct a probabilistic experiment and that the possible outcomes of the experiment are described by a sample space . A random vector is a vector whose value depends on the outcome of the experiment, as stated by the following de…nition. De…nition 85 Let be a sample space. A random vector X is a function from the sample space to the set of K-dimensional real vectors RK : X:

! RK

In rigorous probability theory, the function X is also required to be measurable1 . The real vector X (!) associated to a sample point ! 2 is called a realization of the random vector. The set of all possible realizations is called support and it is denoted by RX . Denote by P (E) the probability of an event E . When dealing with random vectors, the following conventions are used: If A

RK , we often write P (X 2 A) with the meaning P (X 2 A) = P (f! 2

If A

: X (!) 2 Ag)

RK , we sometimes use the notation PX (A) with the meaning PX (A) = P (X 2 A) = P (f! 2

: X (!) 2 Ag)

In applied work, it is very commonplace to build statistical models where a random vector X is de…ned by directly specifying PX , omitting the speci…cation of the sample space altogether. 1 See

below the subsection entitled A more rigorous de…nition of random vector (p. 121).

115

116

CHAPTER 16. RANDOM VECTORS We often write X instead of X (!), omitting the dependence on !.

Example 86 Two coins are tossed. The possible outcomes of each toss can be either tail (T ) or head (H). The sample space is = fT T; T H; HT; HHg The four possible outcomes are assigned equal probabilities: P (fT T g) = P (fT Hg) = P (fHT g) = P (fHHg) =

1 4

If tail (T ) is the outcome, we win one dollar, if head (H) is the outcome we lose one dollar. A 2-dimensional random vector X indicates the amount we win (or lose) on each toss: 8 1 1 if ! = T T > > < 1 1 if ! = T H X (!) = 1 1 if ! = HT > > : 1 1 if ! = HH The probability of winning one dollar on both tosses is P X=

1

1

= =

P

!2

: X (!) = 1 P (fT T g) = 4

1

1

The probability of losing one dollar on the second toss is P (X2 =

1)

=

P (f! 2

: X2 (!) =

1g)

=

P (fT H; HHg) = P (fT Hg) + P (fHHg) =

1 2

The next sections deal with discrete random vectors and absolutely continuous random vectors, two kinds of random vectors that have special properties and are often found in applications.

16.2

Discrete random vectors

Discrete random vectors are de…ned as follows. De…nition 87 A random vector X is discrete if: 1. its support RX is a countable set2 ; 2. there is a function pX : RK ! [0; 1], called the joint probability mass function (or joint pmf, or joint probability function) of X, such that, for any x 2 RK : P (X = x) if x 2 RX pX (x) = 0 if x 2 = RX 2 See

the lecture entitled Sequences and Limits (p. 32) for a de…nition of countable set.

16.3. ABSOLUTELY CONTINUOUS RANDOM VECTORS

117

The following notations are used interchangeably to indicate the joint probability mass function: pX (x) = pX (x1 ; : : : ; xK ) = pX1 ;:::;XK (x1 ; : : : ; xK ) In the second and third notation the K components of the random vector X are explicitly indicated. Example 88 Suppose X is a 2-dimensional random vector whose components X1 and X2 can take only two values: 1 or 0. Furthermore, the four possible combinations of 0 and 1 are all equally likely. X is an example of a discrete random vector. Its support is 1 1 0 0 RX = ; ; ; 1 0 1 0 Its probability mass function is

16.3

8 1=4 if x = 1 > > > > > < 1=4 if x = 1 pX (x) = 1=4 if x = 0 > > > > > : 1=4 if x = 0 0 otherwise

1 0 1 0

> > > >

Absolutely continuous random vectors

Absolutely continuous random vectors are de…ned as follows. De…nition 89 A random vector X is absolutely continuous (or, simply, continuous) if: 1. its support RX is not countable; 2. there is a function fX : RK ! [0; 1], called the joint probability density function (or joint pdf or joint density function) of X, such that, for any set A RK where A = [a1 ; b1 ] : : : [aK ; bK ] the probability that X belongs to A can be calculated as follows: P (X 2 A) =

Z

b1

a1

:::

Z

bK

fX (x1 ; : : : ; xK ) dxK : : : dx1

aK

provided the above multiple integral is well de…ned. The following notations are used interchangeably to indicate the joint probability density function: fX (x) = fX (x1 ; : : : ; xK ) = fX1 ;:::;XK (x1 ; : : : ; xK ) In the second and third notation the K components of the random vector X are explicitly indicated.

118

CHAPTER 16. RANDOM VECTORS

Example 90 Suppose X is a 2-dimensional random variable whose components X1 and X2 are independent uniform3 random variables on the interval [0; 1]. X is an example of an absolutely continuous random variable. Its support is RX = [0; 1]

[0; 1]

Its probability density function is fX (x) =

1 if x 2 [0; 1] 0 otherwise

[0; 1]

The probability that the realization of X falls in the rectangle [0; 1=2] [0; 1=2] is Z 1=2 Z 1=2 1 1 P X 2 0; 0; = fX (x1 ; x2 ) dx2 dx1 2 2 0 0 Z 1=2 Z 1=2 Z 1=2 1=2 = dx2 dx1 = [x2 ]0 dx1 0

= =

16.4

Z

0

1=2

1 1 dx1 = 2 2 0 1 1 1 = 2 2 4

Z

0

1=2

dx1 =

0

1 1=2 [x1 ]0 2

Random vectors in general

Random vectors, also those that are neither discrete nor absolutely continuous, are often characterized in terms of their joint distribution function: De…nition 91 Let X be a random vector. The joint distribution function (or joint df, or joint cumulative distribution function, or joint cdf ) of X is a function FX : RK ! [0; 1] such that FX (x) = P (X1

x1 ; : : : ; XK

xK ) ; 8x 2 RK

where the components of X and x are denoted by Xk and xk respectively, for k = 1; : : : ; K. The following notations are used interchangeably to indicate the joint distribution function: FX (x) = FX (x1 ; : : : ; xK ) = FX1 ;:::;XK (x1 ; : : : ; xK ) In the second and third notation the K components of the random vector X are explicitly indicated. Sometimes, we talk about the joint distribution of a random vector, without specifying whether we are referring to the joint distribution function, or to the joint probability mass function (in the case of discrete random vectors), or to the joint probability density function (in the case of absolutely continuous random vectors). This ambiguity is legitimate, since: 1. the joint probability mass function completely determines (and is completely determined by) the joint distribution function of a discrete random vector; 3 See

p. 359.

16.5. MORE DETAILS

119

2. the joint probability density function completely determines (and is completely determined by) the joint distribution function of an absolutely continuous random vector. In the remainder of this lecture, we use the term joint distribution when we are making statements that apply both to the distribution function and to the probability mass (or density) function of a random vector.

16.5

More details

16.5.1

Random matrices

A random matrix is a matrix whose entries are random variables. It is not necessary to develop a separate theory for random matrices, because a random matrix can always be written as a random vector. Given a K L random matrix A, its vectorization, denoted by vec (A), is the KL 1 random vector obtained by stacking the columns of A on top of each other. Example 92 Let A be the following 2 A=

2 random matrix:

A11 A21

The vectorization of A is the following 4

A12 A22 1 random vector:

2

3 A11 6 A21 7 7 vec (A) = 6 4 A12 5 A22 When vec (A) is a discrete random vector, then we say that A is a discrete random matrix and the joint probability mass function of A is just the joint probability mass function of vec (A). By the same token, when vec (A) is an absolutely continuous random vector, then we say that A is an absolutely continuous random matrix and the joint probability density function of A is just the joint probability density function of vec (A).

16.5.2

Marginal distribution of a random vector

Let Xi be the i-th component of a K-dimensional random vector X. The distribution function FXi (x) of Xi is called marginal distribution function of Xi . If X is discrete, then Xi is a discrete random variable4 and its probability mass function pXi (x) is called marginal probability mass function of Xi . If X is absolutely continuous, then Xi is an absolutely continuous random variable5 and its probability density function fXi (x) is called marginal probability density function of Xi . 4 See 5 See

p. 106. p. 107.

120

16.5.3

CHAPTER 16. RANDOM VECTORS

Marginalization of a joint distribution

The process of deriving the distribution of a component Xi of a random vector X from the joint distribution of X is known as marginalization. Marginalization can also have a broader meaning: it can refer to the act of deriving the joint distribution of a subset of the set of components of X from the joint distribution of X. For example, if X is a random vector having three components X1 , X2 and X3 , we can marginalize the joint distribution of X1 , X2 and X3 to …nd the joint distribution of X1 and X2 ; in this case we say that X3 is marginalized out of the joint distribution of X1 , X2 and X3 .

16.5.4

Marginal distribution of a discrete random vector

Let Xi be the i-th component of a K-dimensional discrete random vector X. The marginal probability mass function of Xi can be derived from the joint probability mass function of X as follows: X pXi (x) = pX (x1 ; : : : ; xK ) (x1 ;:::;xK )2RX :xi =x

In other words, the probability that Xi = x is obtained summing the probabilities of all the vectors of RX whose i-th component is equal to x.

16.5.5

Marginalization of a discrete distribution

Let Xi be the i-th component of a discrete random vector X. Marginalizing Xi out of the joint distribution of X, we obtain the joint distribution of the remaining components of X, i.e. we obtain the joint distribution of the random vector X i de…ned as follows: X i = [X1 : : : Xi 1 Xi+1 : : : XK ] The joint probability mass function of X i is X pX i (x1 ; : : : ; xi 1 ; xi+1 ; : : : ,xK ) = pX (x1 ; : : : ; xi

1 ; xi ; xi+1 ; : : : ,xK )

xi 2RXi

In other words, the joint probability mass function of X i is obtained summing the joint probability mass function of X over all values xi that belong to the support of Xi .

16.5.6

Marginal distribution of a continuous random vector

Let Xi be the i-th component of a K-dimensional absolutely continuous random vector X. The marginal probability density function of Xi can be derived from the joint probability density function of X as follows: Z 1 Z 1 fX (x1 ; : : : ; xi 1 ; x; xi+1 ; : : : ; xK ) dxK : : : dxi+1 dxi 1 : : : dx1 ::: fXi (x) = 1

1

In other words, the joint probability density function, evaluated at xi = x, is integrated with respect to all variables except xi (so it is integrated a total of K 1 times).

16.6. SOLVED EXERCISES

16.5.7

121

Marginalization of a continuous distribution

Let Xi be the i-th component of an absolutely continuous random vector X. Marginalizing Xi out of the joint distribution of X, we obtain the joint distribution of the remaining components of X, i.e. we obtain the joint distribution of the random vector X i de…ned as follows: X

i

= [X1 : : : Xi

1

Xi+1 : : : XK ]

The joint probability density function of X i is Z 1 fX i (x1 ; : : : ; xi 1 ; xi+1 ; : : : ,xK ) = fX (x1 ; : : : ; xi

1 ; xi ; xi+1 ; : : : ,xK ) dxi

1

In other words, the joint probability density function of X i is obtained integrating the joint probability density function of X with respect to xi .

16.5.8

Partial derivative of the distribution function

Note that, if X is absolutely continuous, then Z x1 Z xK FX (x) = ::: fX (t1 ; : : : ; tK ) dtK : : : dt1 1

1

Hence, by taking the K th -order cross-partial derivative with respect to x1 ; : : : ; xK of both sides of the above equation, we obtain @ K FX (x) = fX (x) @x1 : : : @xK

16.5.9

A more rigorous de…nition of random vector

Random vectors can be de…ned in a more rigorous manner using the terminology of measure theory: De…nition 93 Let ( ; F; P) be a probability space6 . Let X be a function X : ! RK . Let B RK be the Borel -algebra of RK (i.e. the smallest -algebra containing all open hyper-rectangles in RK ). If f! 2

: X (!) 2 Bg 2 F

for any B 2 B RK , then X is a random vector on

.

Thus, if X satis…es this property, we are allowed to de…ne P (X 2 B) := P (f! 2 because the set f! 2 vector.

16.6

: X (!) 2 Bg) ; 8B 2 B RK

: X (!) 2 Bg is measurable by the very de…nition of random

Solved exercises

Some solved exercises on random vectors can be found below. 6 See

p. 76 for a de…nition of probability space and measurable sets.

122

CHAPTER 16. RANDOM VECTORS

Exercise 1 Let X be a 2 1 discrete random vector and denote its components by X1 and X2 . Let the support of X be the set of all 2 1 vectors such that their entries belong to the set of the …rst three natural numbers, i.e., n o > RX = x = [x1 x2 ] : x1 2 N3 and x2 2 N3 where

N3 = f1; 2; 3g Let the joint probability mass function of X be ( > 1 if [x1 x2 ] 2 RX 36 x1 x2 pX (x1 ; x2 ) = > 0 if [x1 x2 ] 2 = RX Find P (X1 = 2 and X2 = 3). Solution Trivially, we need to evaluate the joint probability mass function at the point (2; 3), i.e., 1 6 1 P (X1 = 2 and X2 = 3) = pX (2; 3) = 2 3= = 36 36 6

Exercise 2 Let X be a 2 1 discrete random vector and denote its components by X1 and X2 . Let the support of X be the set of all 2 1 vectors such that their entries belong to the set of the …rst three natural numbers, i.e., n o > RX = x = [x1 x2 ] : x1 2 N3 and x2 2 N3 where

N3 = f1; 2; 3g Let the joint probability mass function of X be ( > 1 36 (x1 + x2 ) if [x1 x2 ] 2 RX pX (x1 ; x2 ) = > 0 if [x1 x2 ] 2 = RX Find P (X1 + X2 = 3). Solution There are only two possible cases that give rise to the occurrence X1 + X2 = 3. These cases are > X = [1 2] and >

X = [2 1]

16.6. SOLVED EXERCISES

123

Since these two cases are disjoint events, we can use the additivity property of probability7 : n o n o > > P (X1 + X2 = 3) = P X = [1 2] [ X = [2 1] n o n o > > = P X = [1 2] + P X = [2 1] = pX (1; 2) + pX (2; 1) 1 6 1 1 = (1 + 2) + (2 + 1) = = 36 36 36 6

Exercise 3 Let X be a 2 1 discrete random vector and denote its components by X1 and X2 . Let the support of X be n o > > > RX = [1 1] ; [2 0] ; [0 0] and its joint probability mass function be 8 1=3 > > < 1=3 pX (x) = > > : 1=3 0

>

if x = [1 1] > if x = [2 0] > if x = [0 0] otherwise

Derive the marginal probability mass functions of X1 and X2 . Solution The support of X1 is RX1 = f0; 1; 2g We need to compute the probability of each element of the support of X1 : pX1 (0)

=

X

pX (x1 ; x2 ) = pX (0; 0) =

1 3

pX (x1 ; x2 ) = pX (1; 1) =

1 3

pX (x1 ; x2 ) = pX (2; 0) =

1 3

f(x1 ;x2 )2RX :x1 =0g

pX1 (1)

=

X

f(x1 ;x2 )2RX :x1 =1g

pX1 (2)

=

X

f(x1 ;x2 )2RX :x1 =2g

Thus, the probability mass function of X1 is 8 1=3 > > < X 1=3 pX1 (x) = pX (x1 ; x2 ) = 1=3 > > f(x1 ;x2 )2RX :x1 =xg : 0

The support of X2 is

RX2 = f0; 1g 7 See

p. 72.

if x = 0 if x = 1 if x = 2 otherwise

124

CHAPTER 16. RANDOM VECTORS

We need to compute the probability of each element of the support of X2 : X 2 pX2 (0) = pX (x1 ; x2 ) = pX (2; 0) + pX (0; 0) = 3 f(x1 ;x2 )2RX :x2 =0g

pX2 (1)

X

=

pX (x1 ; x2 ) = pX (1; 1) =

f(x1 ;x2 )2RX :x2 =1g

1 3

Thus, the probability mass function of X2 is

8 < 2=3 if x = 0 1=3 if x = 1 pX2 (x) = pX (x1 ; x2 ) = : 0 otherwise f(x1 ;x2 )2RX :x2 =xg X

Exercise 4

Let X be a 2 1 absolutely continuous random vector and denote its components by X1 and X2 . Let the support of X be RX = [0; 2]

[0; 3]

i.e. the set of all 2 1 vectors such that the …rst component belongs to the interval [0; 2] and the second component belongs to the interval [0; 3]. Let the joint probability density function of X be 1=6 if x 2 RX 0 otherwise

fX (x) = Compute P (1

X1

3; 1

X2

1).

Solution By the very de…nition of joint probability density function: P (1 X1 3; 1 X2 1) Z 3Z 1 = fX (x1 ; x2 ) dx2 dx1 1

=

Z

2

1

=

1 6

Z

Z

1 1

0 2

1

1 1 dx2 dx1 = 6 6

Z

2

1

1

[x2 ]0 dx1

1 1 2 1dx1 = [x1 ]1 = 6 6

Exercise 5 Let X be a 2 1 absolutely continuous random vector and denote its components by X1 and X2 . Let the support of X be RX = [0; 1)

[0; 2]

i.e. the set of all 2 1 vectors such that the …rst component belongs to the interval [0; 1) and the second component belongs to the interval [0; 2]. Let the joint probability density function of X be fX (x) = fX (x1 ; x2 ) = Compute P (X1 + X2

3).

exp ( 2x1 ) if x 2 RX 0 otherwise

16.6. SOLVED EXERCISES

125

Solution First of all note that X1 + X2 3 if and only if X2 3 X1 . Using the de…nition of joint probability density function, we obtain n o > P (X1 + X2 3) = P [x1 x2 ] : x1 2 R; x2 2 ( 1; 3 x1 ] Z 1 Z 3 x1 fX (x1 ; x2 ) dx2 dx1 = 1

1

When x1 2 [0; 1), the inner integral is Z 3 x1 fX (x1 ; x2 ) dx2 1

8 < 0 R3 = : R02 0

if 3 if 0 if 3

x1

exp ( 2x1 ) dx2 exp ( 2x1 ) dx2

x1 < 0, i.e. if x1 > 3 3 x1 2, i.e. if 1 x1 x1 > 2, i.e. if x1 < 1

3

Therefore,

= =

P (X1 + X2 Z 1 Z 3 x1

Z

fX (x1 ; x2 ) dx2 dx1

1 1 1Z 2

exp ( 2x1 ) dx2 dx1 +

0

=

3)

Z

0

1

exp ( 2x1 )

0

=

Z

Z

2

dx2 dx1 +

exp ( 2x1 ) 2dx1 +

0

=

1

[ exp ( 2x1 )]0 + 3

=

Z

Z

Z

Z

3 x1

exp ( 2x1 ) dx2 dx1

0

3

exp ( 2x1 )

1

3

Z

3 x1

dx2 dx1

0

exp ( 2x1 ) (3

1

3

exp ( 2x1 ) dx1

1

x1 ) dx1 Z

3

x1 exp ( 2x1 ) dx1

1

1 exp ( 2x1 ) exp ( 2) + 1 + 3 2 ( Z 3 3 1 x1 exp ( 2x1 ) 2 1 1

=

=

3

1

0

1

Z

3 1

1 exp ( 2x1 ) dx1 2

)

3 3 3 exp ( 6) + exp ( 2) + exp ( 6) 2 2 2 3 1 1 exp ( 2) + exp ( 2x1 ) 2 4 1 1 1 1 + exp ( 6) exp ( 2) 4 4 1

exp ( 2)

Exercise 6 Let X be a 2 1 absolutely continuous random vector and denote its components by X1 and X2 . Let the support of X be RX = R2+

126

CHAPTER 16. RANDOM VECTORS

i.e., the set of all 2-dimensional vectors with positive entries. Let its joint probability density function be fX (x) = fX (x1 ; x2 ) =

exp ( x1 0

x2 ) if x1 0 and x2 otherwise

0

Derive the marginal probability density functions of X1 and X2 . Solution The support of X1 is RX1 = R+ We can …nd the marginal density by integrating the joint density with respect to x2 : Z 1 fX1 (x) = fX (x; x2 ) dx2 1

When x < 0, then fX (x; x2 ) = 0 and the above integral is trivially equal to 0. Thus, when x < 0, then fX1 (x) = 0. When x > 0, then Z 1 Z 0 Z 1 fX1 (x) = fX (x; x2 ) dx2 = fX (x; x2 ) dx2 + fX (x; x2 ) dx2 1

1

0

but the …rst of the two integrals is zero since fX (x; x2 ) = 0 when x2 < 0; as a consequence, Z 0 Z 1 fX1 (x) = fX (x; x2 ) dx2 + fX (x; x2 ) dx2 1 0 Z 1 Z 1 = fX (x; x2 ) dx2 = exp ( x x2 ) dx2 0 0 Z 1 Z 1 = exp ( x) exp ( x2 ) dx2 = exp ( x) exp ( x2 ) dx2 0

=

0

1

exp ( x) [ exp ( x2 )]0 = exp ( x) ( 0

( 1)) = exp ( x)

So, putting pieces together, the marginal density function of X1 is fX1 (x) =

exp ( x) if x 0 0 otherwise

Obviously, by symmetry, the marginal density function of X2 is fX2 (x) =

exp ( x) if x 0 0 otherwise

Chapter 17

Expected value The concept of expected value of a random variable1 is one of the most important concepts in probability theory. It was …rst devised in the 17th century to analyze gambling games and answer questions such as: how much do I gain - or lose - on average, if I repeatedly play a given gambling game? how much can I expect to gain - or lose - by performing a certain bet? If the possible outcomes of the game (or the bet) and their associated probabilities are described by a random variable, then these questions can be answered by computing its expected value, which is equal to a weighted average of the outcomes, in which each outcome is weighted by its probability. For example, if you play a game where you gain 2\$ with probability 1=2 and you lose 1\$ with probability 1=2, then the expected value of the game is half a dollar: 2\$ (1=2) + ( 1\$) (1=2) = 1=2\$ What does this mean? Roughly speaking, it means that if you play this game many times and the number of times each of the two possible outcomes occurs is proportional to its probability, then, on average you gain 1=2\$ each time you play the game. For instance, if you play the game 100 times, win 50 times and lose the remaining 50, then your average winning is equal to the expected value: (2\$ 50 + ( 1\$) 50) =100 = 1=2\$ In general, giving a rigorous de…nition of expected value requires quite a heavy mathematical apparatus. To keep things simple, we provide an informal de…nition of expected value and we discuss its computation in this lecture, while we relegate a more rigorous de…nition to the (optional) lecture entitled Expected value and the Lebesgue integral (p. 141).

17.1

De…nition of expected value

The following is an informal de…nition of expected value: De…nition 94 (informal) The expected value of a random variable X is the weighted average of the values that X can take on, where each possible value is weighted by its respective probability. 1 See

p. 105.

127

128

CHAPTER 17. EXPECTED VALUE

The expected value of a random variable X is denoted by E [X] and it is often called the expectation of X or the mean of X. The following sections discuss how the expected value of a random variable is computed.

17.2

Discrete random variables

When X is a discrete random variable having support RX and probability mass function pX (x), the formula for computing its expected value is a straightforward implementation of the informal de…nition given above: the expected value of X is the weighted average of the values that X can take on (the elements of RX ), where each possible value x 2 RX is weighted by its respective probability pX (x). De…nition 95 Let X be a discrete random variable with support RX and probability mass function pX (x). The expected value of X is: X E [X] = xpX (x) x2RX

provided that:

X

x2RX

jxj pX (x) < 1

The symbol

X

x2RX

indicates summation over all the elements of the support RX . So, for example, if RX = f1; 2; 3g then:

X

xpX (x) = 1 pX (1) + 2 pX (2) + 3 pX (3)

x2RX

The requirement that

X

x2RX

jxj pX (x) < 1

is called absolute summability and ensures that the summation X xpX (x)

(17.1)

x2RX

is well-de…ned also when the support RX contains in…nitely many elements. When summing in…nitely many terms, the order in which you sum them can change the result of the sum. However, if the terms are absolutely summable, then the order in which you sum becomes irrelevant. In the above de…nition of expected value, the order of the sum X xpX (x) x2RX

is not speci…ed, therefore the requirement of absolute summability is introduced in order to ensure that the expected value is well-de…ned. When the absolute summability condition is not satis…ed, we say that the expected value of X is not well-de…ned or that it does not exist.

17.3. CONTINUOUS RANDOM VARIABLES

129

Example 96 Let X be a random variable with support RX = f0; 1g and probability mass function: 8 < 1=2 if x = 1 1=2 if x = 0 pX (x) = : 0 otherwise

Its expected value is: E [X]

X

=

xpX (x) = 1 pX (1) + 0 pX (0)

x2RX

=

17.3

1 1 1 +0 = 2 2 2

1

Continuous random variables

When X is an absolutely continuous random variable with probability density function fX (x), the formula for computing its expected value involves an integral: De…nition 97 Let X be an absolutely continuous random variable with probability density function fX (x). The expected value of X is: Z 1 E [X] = xfX (x) dx 1

provided that:

Z

1 1

jxj fX (x) dx < 1

This integral can be thought of as the limiting case of the sum (17.1) found in the discrete case. Here pX (x) probability R 1is replaced by fX (x) dx (the in…nitesimal P of x) and the integral sign 1 replaces the summation sign x2RX . The requirement that Z 1

1

jxj fX (x) dx < 1

is called absolute integrability and ensures that the improper integral Z 1 xfX (x) dx 1

is well-de…ned. This improper integral is a shorthand for: lim

t! 1

Z

t

0

xfX (x) dx + lim

t!1

Z

t

xfX (x) dx

0

and it is well-de…ned only if both limits are …nite. Absolute integrability guarantees that the latter condition is met and that the expected value is well-de…ned. When the absolute integrability condition is not satis…ed, we say that the expected value of X is not well-de…ned or that it does not exist.

130

CHAPTER 17. EXPECTED VALUE

Example 98 Let X be an absolutely continuous random variable with support RX = [0; 1) and probability density function: exp (

fX (x) = where

x) if x 2 [0; 1) otherwise

0

> 0. Its expected value is: Z 1 E [X] = xfX (x) dx 1 Z 1 = x exp ( x) dx 0 Z 1 A = 1 t exp ( t) dt 0

B

= = =

1 1 1

[ t exp ( f0

1 t)]0

+

Z

1

exp ( t) dt

0 1

0 + [ exp ( t)]0 g

f0 + 1g =

1

where: in step A we have made a change of variable (t = x); in step B we have integrated by parts.

17.4

The Riemann-Stieltjes integral

This section introduces a general formula for computing the expected value of a random variable X. The formula, which does not require X to be discrete or absolutely continuous and is applicable to any random variable, involves an integral called Riemann-Stieltjes integral (see below for an introduction). While we brie‡y discuss this formula for the sake of completeness, no deep understanding of this formula or of the Riemann-Stieltjes integral is required to understand the other lectures. De…nition 99 Let X be a random variable having distribution function2 FX (x). The expected value of X is: Z 1 E [X] = xdFX (x) 1

where the integral is a Riemann-Stieltjes integral and the expected value exists and is well-de…ned only as long as the integral is well-de…ned. Also this integral is the limiting case of formula (17.1) for the expected value of a discrete random variable. Here dFX (x) replaces pXP (x) (the probability of x) R1 and the integral sign 1 replaces the summation sign x2RX . 2 See

p. 108.

17.4. THE RIEMANN-STIELTJES INTEGRAL

131

The following section contains a brief and informal introduction to the RiemannStieltjes integral and an explanation of the above formula. Less technically oriented readers can safely skip it: when they encounter a Riemann-Stieltjes integral, they can just think of it as a formal notation which allows a uni…ed treatment of discrete and absolutely continuous random variables and can be treated as a sum in one case and as an ordinary Riemann integral in the other.

17.4.1

Intuition

As we have already seen above, the expected value of a discrete random variable is straightforward to compute: the expected value of a discrete variable X is the weighted average of the values that X can take on (the elements of the support RX ), where each possible value x is weighted by its respective probability pX (x): X E [X] = xpX (x) x2RX

or, written in a slightly di¤erent fashion: X E [X] = xP (X = x) x2RX

When X is not discrete the above summation does not make any sense. However, there is a workaround that allows to extend the formula to random variables that are not discrete. The workaround entails approximating X with discrete variables that can take on only …nitely many values. Let x0 , x1 , . . . , xn be n + 1 real numbers (n 2 N) such that: x0 < x1 < : : : < xn De…ne a new random variable Xn (function of X) as follows: 8 x1 when x0 < X x1 > > > < x2 when x1 < X x2 Xn = .. .. > . . > > : xn when xn 1 < X xn

As the number n of points increases and the points become closer and closer (the maximum distance between two successive points tends to zero), Xn becomes a very good approximation of X, until, in the limit, it is indistinguishable from X. The expected value of Xn is easy to compute: E [Xn ]

= = =

n X

i=1 n X

i=1 n X

xi P (Xn = xi ) xi P (X 2 (xi xi [FX (xi )

i=1

where FX (x) is the distribution function of X.

1 ; xi ])

FX (xi

1 )]

132

CHAPTER 17. EXPECTED VALUE

The expected value of X is then de…ned as the limit of E [Xn ] when n tends to in…nity (i.e. when the approximation becomes better and better): E [X] = lim E [Xn ] = lim n!1

n!1

n X

xi [FX (xi )

FX (xi

1 )]

i=1

When the latter limit exists and is well-de…ned, it is called the Riemann-Stieltjes integral of x with respect to FX (x) and it is indicated as follows: Z

1

xdFX (x) = lim

n!1

1

n X

xi [FX (xi )

FX (xi

1 )]

i=1

R1 can be thought of as a shorthand Roughly speaking, the integral notation 1 Pn for limn!1 i=1 and the di¤erential notation dFX (x) can be thought of as a shorthand for [FX (xi ) FX (xi 1 )].

17.4.2

Some rules

We present here some rules for computing the Riemann-Stieltjes integral when the integrator function is the distribution function of a random variable X, i.e. we limit attention to integrals of the kind: Z

b

g (x) dFX (x)

a

where FX (x) is the distribution function of a random variable X and g : R ! R. Before stating the rules, note that the above integral does not necessarily exist or is not necessarily well-de…ned. Roughly speaking, for the integral to exist the integrand function g must be well-behaved. For example, if g is continuous on [a; b], then the integral exists and is well-de…ned. That said, we are ready to present the calculation rules: 1. FX (x) is continuously di¤erentiable on [a; b]. If FX (x) is continuously di¤erentiable on [a; b] and fX (x) is its …rst derivative, then: Z

b

g (x) dFX (x) =

a

Z

b

g (x) fX (x) dx

a

2. FX (x) is continuously di¤erentiable on [a; b] except at a …nite number of points. Suppose FX (x) is continuously di¤erentiable on [a; b] except at a …nite number of points c1 , . . . , cn such that: a < c1 < c2 < : : : < cn

b

Denote the derivative of FX (x) (where it exists) by fX (x). Then: Z

b

g (x) dFX (x)

a

=

Z

a

c1

"

g (x) fX (x) dx + g (c1 ) FX (c1 )

#

lim FX (x)

x!c1 x 0 and the exponential function is strictly positive, fX (x) 0 for any x 2 R, so the non-negativity property is satis…ed. The integral property is also satis…ed, because Z 1 Z 1 fX (x) dx = exp ( x) dx 1

0

= =

1

[ exp ( x)]0 0 ( 1) = 1

Exercise 2 Consider the function fX (x) =

1 u l

0

if x 2 [l; u] if x 2 = [l; u]

where l; u 2 R and l < u. Prove that fX (x) is a legitimate probability density function.

254

CHAPTER 31. LEGITIMATE PDFS

Solution l < u implies u1 l > 0, so fX (x) 0 for any x 2 R and the non-negativity property is satis…ed. The integral property is also satis…ed, because Z 1 Z u 1 fX (x) dx = dx u l 1 l Z u 1 = dx u l l 1 u = [x] u l l 1 = (u l) = 1 u l

Exercise 3 Consider the function fX (x) =

2 0

n=2

( (n=2))

1

xn=2

1

exp

1 2x

if x 2 [0; 1) if x 2 = [0; 1)

where n 2 N and () is the Gamma function4 . Prove that fX (x) is a legitimate probability density function. Solution Remember the de…nition of Gamma function: Z 1 xz 1 exp ( x) dx (z) = 0

(z) is obviously strictly positive for any z, since exp ( x) is strictly positive and xz 1 is strictly positive on the interval of integration (except at 0, where it is 0). Therefore, fX (x) satis…es the non-negativity property, because the four factors in the product 1 1 2 n=2 ( (n=2)) xn=2 1 exp x 2 are all non-negative on the interval [0; 1). The integral property is also satis…ed, because Z 1 Z 1 1 1 fX (x) dx = 2 n=2 ( (n=2)) xn=2 1 exp x dx 2 1 0 Z 1 1 1 = 2 n=2 ( (n=2)) xn=2 1 exp x dx 2 0 Z 1 1 n=2 1 A = 2 n=2 ( (n=2)) 1 (2t) exp 2t 2dt 2 0 Z 1 1 n=2 1 = 2 n=2 ( (n=2)) (2) 2 tn=2 1 exp ( t) dt 0 Z 1 1 = ( (n=2)) tn=2 1 exp ( t) dt 0

4 See

p. 55.

31.3. SOLVED EXERCISES B

= =

( (n=2)) 1

255 1

(n=2)

where: in step A we have made a change of variable (x = 2t); in step B we have used the de…nition of Gamma function.

256

CHAPTER 31. LEGITIMATE PDFS

Chapter 32

Factorization of joint probability mass functions This lecture discusses how to factorize the joint probability mass function1 of two discrete random variables X and Y into two factors: 1. the conditional probability mass function2 of X given Y = y; 2. the marginal probability mass function3 of Y .

32.1

The factorization

The factorization, which has already been discussed in the lecture entitled Conditional probability distributions (p. 209), is formally stated in the following proposition. Proposition 177 (factorization) Let [X Y ] be a discrete random vector with support RXY and joint probability mass function pXY (x; y). Denote by pXjY =y (x) the conditional probability mass function of X given Y = y and by pY (y) the marginal probability mass function of Y . Then: pXY (x; y) = pXjY =y (x) pY (y) for any x and y.

32.2

A factorization method

When we know the joint probability mass function pXY (x; y) and we need to factorize it into the conditional probability mass function pXjY =y (x) and the marginal probability mass function pY (y), we usually proceed in two steps: 1. marginalize pXY (x; y) by summing it over all possible values of x and obtain the marginal probability mass function pY (y); 1 See

p. 116. p. 210. 3 See p. 120. 2 See

257

258

CHAPTER 32. FACTORIZATION OF JOINT PMFS

2. divide pXY (x; y) by pY (y) and obtain the conditional probability mass function pXjY =y (x) (of course this step makes sense only when pY (y) > 0). In some cases, the …rst step (marginalization) can be di¢ cult to perform. In these cases, it is possible to avoid the marginalization step, by making a guess about the factorization of pXY (x; y) and verifying whether the guess is correct with the help of the following proposition: Proposition 178 (factorization method) Suppose there are two functions h (y) and g (x; y) such that: 1. for any x and y, the following holds: pXY (x; y) = g (x; y) h (y) 2. for any …xed y, g (x; y), considered as a function of x, is a probability mass function with the same support of X (i.e. RX ). Then: pXjY =y (x) pY (y)

= g (x; y) = h (y)

Proof. The marginal probability mass function of Y satis…es: X pY (y) = pXY (x; y) x2RX

Therefore, by property 1: pY (y)

=

X

fXY (x; y)

x2RX

=

X

g (x; y) h (y)

x2RX

A

= h (y)

X

g (x; y)

x2RX

B

= h (y)

where: in step A we have used the fact that h (y) does not depend on x; in step B we have used the fact that, for any …xed y, g (x; y), considered as a function of x, is a probability mass function and the sum4 of a probability mass function over its support equals 1. Therefore, pXY (x; y) = g (x; y) h (y) = g (x; y) pY (y) ; 8 (x; y) Since we also have that pXY (x; y) = pXjY =y (x) pY (y) ; 8 (x; y) 4 See

p. 247.

32.2. A FACTORIZATION METHOD

259

then, by necessity, it must be that: g (x; y) = pXjY =y (x) Thus, whenever we are given a formula for the joint probability mass function pXY (x; y) and we want to …nd the marginal and the conditional functions, we have to manipulate the formula and express it as the product of: 1. a function of x and y that is a probability mass function in x for all values of y; 2. a function of y that does not depend on x. Example 179 Let X be a 3 1 random vector having a multinomial distribution5 with parameters p1 , p2 and p3 (the probabilities of the three possible outcomes of each trial) and n (the number of trials). The probabilities are strictly positive numbers such that: p 1 + p2 + p3 = 1 The support of X is RX = x 2 Z3+ : x1 + x2 + x3 = n where x1 , x2 , x3 denote the components of the vector x. The joint probability mass function of X is pX (x1 ; x2 ; x3 ) =

x1 x2 x3 n! x1 !x2 !x3 ! p1 p2 p2

0

if (x1 ; x2 ; x3 ) 2 RX otherwise

Note that:

=

A

n! px1 px2 px3 x1 !x2 !x3 ! n! px1 1 px2 2 x3 n (n x3 )! p p x1 !x2 ! (n x3 )!x3 ! pn3 x3 3 3

x3

=

(n x3 )! px1 1 px2 2 x1 !x2 ! pn3 x3

(n

n! px3 pn x3 )!x3 ! 3 3

=

(n x3 )! px1 1 px2 2 x1 !x2 ! px3 1 +x2

(n

n! px3 pn x3 )!x3 ! 3 3

=

(n x3 )! x x (p1 =p3 ) 1 (p2 =p3 ) 2 x1 !x2 !

(n

x3

x3

n! px3 pn x3 )!x3 ! 3 3

where in step A we have used the fact that x1 + x2 + x3 = n Therefore, the joint probability mass function can be factorized as: pX (x1 ; x2 ; x3 ) = g (x1 ; x2 ; x3 ) h (x3 ) 5 See

p. 431.

x3

260

CHAPTER 32. FACTORIZATION OF JOINT PMFS

where g (x1 ; x2 ; x3 ) = and:

8 < :

h (x3 ) =

(n x3 )! x1 !x2 !

x1

(p1 =p3 )

0 x3 n x3 n! (n x3 )!x3 ! p3 p3

0

x2

(p2 =p3 )

if (x1 ; x2 ) 2 Z2+ and x1 + x2 = n otherwise

if x3 2 Z+ and x3 otherwise

x3

n

For any x3 n, g (x1 ; x2 ; x3 ) is the probability mass function of a multinomial distribution with parameters p1 =p3 , p2 =p3 and n x3 . Therefore: pX1 ;X2 jX3 =x3 (x1 ; x2 ) = g (x1 ; x2 ; x3 ) pX3 (x3 ) = h (x3 )

Chapter 33

Factorization of joint probability density functions This lecture discusses how to factorize the joint probability density function1 of two absolutely continuous random variables (or random vectors) X and Y into two factors: 1. the conditional probability density function2 of X given Y = y; 2. the marginal probability density function3 of Y .

33.1

The factorization

The factorization, which has already been discussed in the lecture entitled Conditional probability distributions (p. 209), is formally stated in the following proposition. Proposition 180 (factorization) Let [X Y ] be an absolutely continuous random vector with support RXY and joint probability density function fXY (x; y). Denote by fXjY =y (x) the conditional probability density function of X given Y = y and by fY (y) the marginal probability density function of Y . Then fXY (x; y) = fXjY =y (x) fY (y) for any x and y.

33.2

A factorization method

When we know the joint probability density function fXY (x; y) and we need to factorize it into the conditional probability density function fXjY =y (x) and the marginal probability density function fY (y), we usually proceed in two steps: 1. marginalize fXY (x; y) by integrating it with respect to x and obtain the marginal probability density function fY (y); 1 See

p. 116. p. 213. 3 See p. 120. 2 See

261

262

CHAPTER 33. FACTORIZATION OF JOINT PDFS

2. divide fXY (x; y) by fY (y) and obtain the conditional probability density function fXjY =y (x) (of course this step makes sense only when fY (y) > 0). In some cases, the …rst step (marginalization) can be di¢ cult to perform. In these cases, it is possible to avoid the marginalization step, by making a guess about the factorization of fXY (x; y) and verifying whether the guess is correct with the help of the following proposition: Proposition 181 (factorization method) Suppose there are two functions h (y) and g (x; y) such that: 1. for any x and y, the following holds: fXY (x; y) = g (x; y) h (y) 2. for any …xed y, g (x; y), considered as a function of x, is a probability density function. Then: fXjY =y (x) fY (y)

= g (x; y) = h (y)

Proof. The marginal probability density of Y satis…es Z 1 fY (y) = fXY (x; y) dx 1

Thus, by property 1 above: fY (y)

= =

A

Z Z

1

fXY (x; y) dx

1 1

g (x; y) h (y) dx Z 1 = h (y) g (x; y) dx 1

1

B

= h (y)

where: in step A we have used the fact that h (y) does not depend on x; in step B we have used the fact that, for any …xed y, g (x; y), considered as a function of x, is a probability density function and the integral of a probability density function over R equals 1 (see p. 251). Therefore fXY (x; y) = g (x; y) h (y) = g (x; y) fY (y) which, in turn, implies g (x; y) =

fXY (x; y) = fXjY =y (x) fY (y)

Thus, whenever we are given a formula for the joint density function fXY (x; y) and we want to …nd the marginal and the conditional functions, we have to manipulate the formula and express it as the product of:

33.2. A FACTORIZATION METHOD

263

1. a function of x and y that is a probability density function in x for all values of y; 2. a function of y that does not depend on x. Example 182 Let the joint density function of X and Y be fXY (x; y) =

1 2 y exp (

0

yx) if x 2 [0; 1) and y 2 [1; 3] otherwise

The joint density can be factorized as follows: fXY (x; y) = g (x; y) h (y) where g (x; y) =

y exp ( yx) if x 2 [0; 1) 0 otherwise

and h (y) =

1 2

0

if y 2 [1; 3] otherwise

Note that g (x; y) is a probability density function in x for any …xed y (it is the probability density function of an exponential random variable4 with parameter y). Therefore: fXjY =y (x) fY (y)

4 See

p. 365.

= g (x; y) = h (y)

264

CHAPTER 33. FACTORIZATION OF JOINT PDFS

Chapter 34

Functions of random variables and their distribution Let X be a random variable with known distribution. Let another random variable Y be a function of X: Y = g (X) where g : R ! R. How do we derive the distribution of Y from the distribution of X? There is no general answer to this question. However, there are several special cases in which it is easy to derive the distribution of Y . We discuss these cases below.

34.1

Strictly increasing functions

When the function g is strictly increasing on the support of X, i.e. 8x1 ; x2 2 RX ; x1 > x2 ) g (x1 ) > g (x2 ) then g admits an inverse de…ned on the support of Y , i.e. a function g that: X = g 1 (Y )

1

(y) such

Furthermore g 1 (y) is itself strictly increasing. The distribution function of a strictly increasing function of a random variable can be computed as follows: Proposition 183 (cdf of an increasing function) Let X be a random variable with support RX and distribution function1 FX (x). Let g : R ! R be strictly increasing on the support of X. Then, the support of Y = g (X) is RY = fy = g (x) : x 2 RX g 1 See

p. 108.

265

266

CHAPTER 34. FUNCTIONS OF RANDOM VARIABLES

and the distribution function of Y is 8 < 0 FX g FY (y) = : 1

1

if y < ; 8 2 RY if y 2 RY if y > ; 8 2 RY

(y)

Proof. Of course, the support RY is determined by g (x) and by all the values X can take. The distribution function of Y can be derived as follows: if y is lower than than the lowest value Y can take on, then P (Y so: FY (y) = 0 if y < ; 8 2 RY

y) = 0,

if y belongs to the support of Y , then FY (y) can be derived as follows: FY (y) A

=

P (Y

B

=

P (g (X)

C

=

P g

= D

y)

1

P X

= FX g

y)

(g (X)) g 1

1

g

1

(y)

(y)

(y)

where: in step A we have used the de…nition of distribution function of Y ; in step B we have used the de…nition of Y ; in step C we have used the fact that g 1 exists and is strictly increasing on the support of Y ; in step D we have used the de…nition of distribution function of X; if y is higher than than the highest value Y can take on, then P (Y so: FY (y) = 1 if y > ; 8 2 RY

y) = 1,

Therefore, in the case of an increasing function, knowledge of g 1 and of the upper and lower bounds of the support of Y is all we need to derive the distribution function of Y from the distribution function of X. Example 184 Let X be a random variable with support RX = [1; 2] and distribution function FX (x) = Let

8 < 0 :

1 2x

1

if x < 1 if 1 x if x > 2

Y = X2

2

34.1. STRICTLY INCREASING FUNCTIONS

267

The function g (x) = x2 is strictly increasing and it admits an inverse on the support of X: p g 1 (y) = y The support of Y is RY = [1; 4]. The distribution function of Y is 8 if y < ; 8 2 RY , i.e. if y < 1 < 0 p FX g 1 (y) = 12 y if y 2 RY , i.e. if 1 y 4 FY (y) = : 1 if y > ; 8 2 RY , i.e. if y > 4

In the cases in which X is either discrete or absolutely continuous there are specialized formulae for the probability mass and probability density functions, which are reported below.

34.1.1

Strictly increasing functions of a discrete variable

When X is a discrete random variable, the probability mass function of Y = g (X) can be computed as follows: Proposition 185 (pmf of an increasing function) Let X be a discrete random variable with support RX and probability mass function pX (x). Let g : R ! R be strictly increasing on the support of X. Then, the support of Y = g (X) is RY = fy = g (x) : x 2 RX g and its probability mass function is pX g 0

pY (y) =

1

(y)

if y 2 RY if y 2 = RY

Proof. This proposition is a trivial consequence of the fact that a strictly increasing function is invertible: pY (y)

= P (Y = y) = P (g (X) = y) = P X = g 1 (y) = pX g

1

(y)

Example 186 Let X be a discrete random variable with support RX = f1; 2; 3g and probability mass function pX (x) =

1 6x

0

if x 2 RX if x 2 = RX

Let Y = g (X) = 3 + X 2 The support of Y is RY = f4; 7; 12g

268

CHAPTER 34. FUNCTIONS OF RANDOM VARIABLES

The function g is strictly increasing and its inverse is p g 1 (y) = y 3 The probability mass function of Y is: 1p 6 y pY (y) = 0

34.1.2

3 if y 2 RY if y 2 = RY

Strictly increasing functions of a continuous variable

When X is an absolutely continuous random variable and g is di¤erentiable, then also Y is absolutely continuous and its probability density function can be easily computed as follows: Proposition 187 (pdf of an increasing function) Let X be an absolutely continuous random variable with support RX and probability density function fX (x). Let g : R ! R be strictly increasing and di¤ erentiable on the support of X. Then, the support of Y = g (X) is RY = fy = g (x) : x 2 RX g and its probability density function is ( fX g 1 (y) fY (y) = 0

dg

1 (y) dy

if y 2 RY if y 2 = RY

Proof. This proposition is a trivial consequence of the fact that the density function is the …rst derivative of the distribution function2 : it can be obtained by di¤erentiating the expression for the distribution function FY (y) found above. Example 188 Let X be an absolutely continuous random variable with support RX = (0; 1] and probability density function 2x if x 2 RX 0 if x 2 = RX

fX (x) = Let

Y = g (X) = ln (X) The support of Y is RY = ( 1; 0]

The function g is strictly increasing and its inverse is g with derivative

(y) = exp (y)

1

(y) = exp (y) dy The probability density function of Y is fY (y) = 2 See

p. 109.

dg

1

2 exp (y) exp (y) = 2 exp (2y) if y 2 RY 0 if y 2 = RY

34.2. STRICTLY DECREASING FUNCTIONS

34.2

269

Strictly decreasing functions

When the function g is strictly decreasing on the support of X, i.e. 8x1 ; x2 2 RX ; x1 > x2 ) g (x1 ) < g (x2 ) then g admits an inverse de…ned on the support of Y , i.e. a function g that X = g 1 (Y )

1

(y) such

Furthermore g 1 (y) is itself strictly decreasing. The distribution function of a strictly decreasing function of a random variable can be computed as follows: Proposition 189 (cdf of a decreasing function) Let X be a random variable with support RX and distribution function FX (x). Let g : R ! R be strictly decreasing on the support of X. Then, the support of Y = g (X) is: RY = fy = g (x) : x 2 RX g and the distribution function of Y is: 8 < 0 1 FX g 1 (y) + P X = g FY (y) = : 1

1

(y)

if y < ; 8 2 RY if y 2 RY if y > ; 8 2 RY

Proof. Of course, the support RY is determined by g (x) and by all the values X can take. The distribution function of y can be derived as follows: if y is lower than than the lowest value Y can take on, then P (Y so: FY (y) = 0 if y < ; 8 2 RY

y) = 0,

if y belongs to the support of Y , then FY (y) can be derived as follows: FY (y) A

= =

P (Y y) 1 P (Y > y)

B

=

1

P (g (X) > y)

C

=

1

P g

= = D

1 1

1

(g (X)) < g

P X 2

1 2x

1

X2

Y = The function g (x) = support of X:

2

x2 is strictly decreasing and it admits an inverse on the g

1

(y) =

p

y

The support of Y is RY = [ 4; 1]. The distribution function of Y is 8 if y < ; 8 2 RY , > > 0 > > i.e. if y < 4 > > < if y 2 RY , 1 FX g 1 (y) + P X = g 1 (y) FY (y) = 1 1p i.e. if 4 y 1 y + 1 = 1 > 2 2 fy= 1g > > > if y > ; 8 2 R , > Y > : 1 i.e. if y > 1

where 1fy= 1g equals 1 when y = 1 and 0 otherwise (because P X = g always zero except when y = 1 and g 1 (y) = 1).

1

(y) is

We report below the formulae for the special cases in which X is either discrete or absolutely continuous.

34.2.1

Strictly decreasing functions of a discrete variable

When X is a discrete random variable, the probability mass function of Y = g (X) can be computed as follows: Proposition 191 (pmf of a decreasing function) Let X be a discrete random variable with support RX and probability mass function pX (x). Let g : R ! R be strictly decreasing on the support of X. Then, the support of Y = g (X) is RY = fy = g (x) : x 2 RX g and its probability mass function is pY (y) =

pX g 0

1

(y)

if y 2 RY if y 2 = RY

34.2. STRICTLY DECREASING FUNCTIONS

271

Proof. The proof of this proposition is identical to the proof of the proposition for strictly increasing functions. In fact, the only property that matters is that a strictly decreasing function is invertible: pY (y) = P (Y = y) = P (g (X) = y) = P X = g 1 (y) 1

= pX g

(y)

Example 192 Let X be a discrete random variable with support RX = f1; 2; 3g and probability mass function 1 2 14 x

pX (x) =

0

if x 2 RX if x 2 = RX

Let Y = g (X) = 1

2X

The support of Y is RY = f 5; 3; 1g The function g is strictly decreasing and its inverse is g

1

(y) =

1 2

1 y 2

The probability mass function of Y is pY (y) =

34.2.2

1 14

0

1 2

2 1 2y

if y 2 RY if y 2 = RY

Strictly decreasing functions of a continuous variable

When X is an absolutely continuous random variable and g is di¤erentiable, then also Y is absolutely continuous and its probability density function is derived as follows: Proposition 193 (pdf of a decreasing function) Let X be an absolutely continuous random variable with support RX and probability density function fX (x). Let g : R ! R be strictly decreasing and di¤ erentiable on the support of X. Then, the support of Y = g (X) is RY = fy = g (x) : x 2 RX g and its probability density function is ( fX g fY (y) = 0

1

(y)

dg

1 (y) dy

if y 2 RY if y 2 = RY

272

CHAPTER 34. FUNCTIONS OF RANDOM VARIABLES

Proof. This proposition is easily derived: 1) remembering that the probability that an absolutely continuous random variable takes on any speci…c value3 is 0 and, as a consequence, P X = g 1 (y) = 0 for any y; 2) using the fact that the density function is the …rst derivative of the distribution function; 3) di¤erentiating the expression for the distribution function FY (y) found above. Example 194 Let X be a uniform random variable4 on the interval [0; 1], i.e. an absolutely continuous random variable with support RX = [0; 1] and probability density function fX (x) =

1 if x 2 RX 0 if x 2 = RX

Let Y = g (X) = where

1

ln (X)

2 R++ is a constant. The support of Y is RY = [0; 1)

where we can safely ignore the fact that g (0) = 1, because fX = 0g is a zeroprobability event5 . The function g is strictly decreasing and its inverse is 1

g

(y) = exp (

y)

with derivative dg

1

(y) = dy

exp (

y)

The probability density function of Y is fY (y) =

exp ( 0

y) if y 2 RY if y 2 = RY

Therefore, Y has an exponential distribution with parameter entitled Exponential distribution - p. 365).

34.3

(see the lecture

Invertible functions

In the case in which the function g (x) is neither strictly increasing nor strictly decreasing, the formulae given in the previous sections for discrete and absolutely continuous random variables are still applicable, provided g (x) is one-to-one and hence invertible. We report these formulae below. 3 See

p. 109. p. 359. 5 See p. 109. 4 See

34.3. INVERTIBLE FUNCTIONS

34.3.1

273

One-to-one functions of a discrete variable

When X is a discrete random variable the probability mass function of Y = g (X) is given by the following: Proposition 195 (pmf of a one-to-one function) Let X be a discrete random variable with support RX and probability mass function pX (x). Let g : R ! R be one-to-one on the support of X. Then, the support of Y = g (X) is RY = fy = g (x) : x 2 RX g and its probability mass function is pY (y) =

pX g 0

1

(y)

if y 2 RY if y 2 = RY

Proof. The proof of this proposition is identical to the proof of the propositions for strictly increasing and stricly decreasing functions found above: pY (y) = P (Y = y) = P (g (X) = y) = P X = g 1 (y) = pX g

34.3.2

1

(y)

One-to-one functions of a continuous variable

When X is an absolutely continuous random variable and g is di¤erentiable, then also Y is absolutely continuous and its probability density function is given by the following: Proposition 196 (pdf of a one-to-one function) Let X be an absolutely continuous random variable with support RX and probability density function fX (x). Let g : R ! R be one-to-one and di¤ erentiable on the support of X. Then, the support of Y = g (X) is RY = fy = g (x) : x 2 RX g If d g dy

1

(y) 6= 0 ; 8y 2 RY

then the probability density function of Y is ( 1 fX g 1 (y) dg dy(y) fY (y) = 0

if y 2 RY if y 2 = RY

Proof. For a proof of this proposition see: Poirier6 (1995). 6 Poirier, D. J. (1995) Intermediate statistics and econometrics: a comparative approach, MIT Press.

274

34.4

CHAPTER 34. FUNCTIONS OF RANDOM VARIABLES

Solved exercises

Below you can …nd some exercises with explained solutions.

Exercise 1 Let X be an absolutely continuous random variable with support RX = [0; 2] and probability density function: 3 2 8x

fX (x) =

if x 2 RX if x 2 = RX

0

Let Y = g (X) =

p

X +1

Find the probability density function of Y . Solution The support of Y is RY =

hp p i 1; 3

The function g is strictly increasing and its inverse is 1

g with derivative

dg

(y) = y 2

1

1

(y) = 2y dy

The probability density function of Y is 3 8

fY (y) =

y2

1

2

2y

0

if y 2 RY if y 2 = RY

Exercise 2 Let X be an absolutely continuous random variable with support RX = [0; 2] and probability density function fX (x) =

1 2

0

if x 2 RX if x 2 = RX

Let Y = g (X) = Find the probability density function of Y .

X2

34.4. SOLVED EXERCISES

275

Solution The support of Y is: RY = [ 4; 0] The function g is strictly decreasing and its inverse is 1

g with derivative

1=2

(y) = ( y)

1

dg

(y) = dy

1 ( y) 2

1=2

The probability density function of Y is fY (y) =

11 22

1=2

( y)

=

0

1 4

1=2

( y)

if y 2 RY if y 2 = RY

Exercise 3 Let X be a discrete random variable with support RX = f1; 2; 3; 4g and probability mass function 1 2 30 x

pX (x) =

if x 2 RX if x 2 = RX

0

Let Y = g (X) = X

1

Find the probability mass function of Y . Solution The support of Y is RY = f0; 1; 2; 3g The function g is strictly increasing and its inverse is g

1

(y) = y + 1

The probability mass function of Y is pY (y) =

1 30

0

2

(y + 1)

if y 2 RY if y 2 = RY

276

CHAPTER 34. FUNCTIONS OF RANDOM VARIABLES

Chapter 35

Functions of random vectors and their distribution Let X be a K 1 random vector with known distribution, and let a L vector Y be a function of X: Y = g (X)

1 random

where g : RK ! RL . How do we derive the distribution of Y from the distribution of X? Although there is no general answer to this question, there are some special cases in which the distribution of Y can be easily derived from the distribution of X. This lecture discusses some of these special cases.

35.1

One-to-one functions

When the function g (x) is one-to-one and hence invertible, and the random vector X is either discrete or absolutely continuous, there are readily applicable formulae for the distribution of Y , which we report below.

35.1.1

One-to-one function of a discrete vector

When X is a discrete random vector, the joint probability mass function1 of Y = g (X) is given by the following proposition. Proposition 197 Let X be a K 1 discrete random vector with support RX and joint probability mass function pX (x). Let g : RK ! RK be one-to-one on the support of X. Then, the random vector Y = g (X) has support RY = fy = g (x) : x 2 RX g and probability mass function pY (y) = 1 See

pX g 0

1

(y)

p. 116.

277

if y 2 RY if y 2 = RY

278

CHAPTER 35. FUNCTIONS OF RANDOM VECTORS

Proof. If y 2 RY , then pY (y) = P (Y = y) = P (g (X) = y) = P X = g

1

(y) = pX g

1

(y)

where we have used the fact that g is one-to-one on the support of Y , and hence it possesses an inverse g 1 (y). If y 2 = RY , then, trivially, pY (y) = 0. Example 198 Let X be a 2 1 discrete random vector and denote its components by X1 and X2 . Let the support of X be n o > > RX = [1 1] ; [2 0]

and its joint probability mass function be

8 > < 1=3 if x = [1 1] > pX (x) = 2=3 if x = [2 0] : 0 otherwise

Let

Y = g (X) = 2X The support of Y is

The inverse function is

n o > > RY = [2 2] ; [4 0] x=g

1

(y) =

1 y 2

The joint probability mass function of Y is

35.1.2

8 > < pX 21 y if y = [2 2] > 1 pY (y) = p y if y = [4 0] : X 2 0 otherwise 8 > < pX (1; 1) if y = [2 2] > = p (2; 0) if y = [4 0] : X 0 otherwise 8 > < 1=3 if y = [2 2] > = 2=3 if y = [4 0] : 0 otherwise

One-to-one function of a continuous vector

When X is an absolutely continuous random vector and g is di¤erentiable, then also Y is absolutely continuous, and its joint probability density function2 is given by the following proposition. Proposition 199 Let X be a K 1 absolutely continuous random vector with support RX and joint probability density function fX (x). Let g : RK ! RK be 2 See

p. 117.

35.1. ONE-TO-ONE FUNCTIONS

279

one-to-one and di¤ erentiable on the support of X. Denote matrix of g 1 (y), i.e., 2 @x @x1 @x1 1 : : : @y @y @y2 K 6 @x12 @x2 @x2 : : : @y 6 @y1 @y2 K Jg 1 (y) = 6 .. .. 6 .. . 4 . . @xK @xK @xK : : : @y1 @y2 @yK

by Jg

1

(y) the Jacobian

3 7 7 7 7 5

where yi is the i-th component of y, and xi is the i-th component of x = g Then, the support of Y = g (X) is

1

(y).

RY = fy = g (x) : x 2 RX g If the determinant of the Jacobian matrix satis…es det Jg

1

(y) 6= 0 ; 8y 2 RY

then the joint probability density function of Y is fY (y) =

fX g 0

1

(y) det Jg

1

(y)

if y 2 RY if y 2 = RY

Proof. See e.g. Poirier3 (1995). A special case of the above proposition obtains when the function g is a linear one-to-one mapping. Proposition 200 Let X be a K 1 absolutely continuous random vector with joint probability density fX (x). Let Y be a K 1 random vector such that Y =

+ X

where is a constant K 1 vector and is a constant K K invertible matrix. Then, Y is an absolutely continuous random vector whose probability density function fY (y) satis…es fY (y) =

1 jdet( )j fX

1

(y

)

if y 2 RY if y 2 = RY

0

where det ( ) is the determinant of

.

Proof. In this case the inverse function is g

1

(y) =

1

(y

)

The Jacobian matrix is Jg

1

(y) =

1

When y 2 RY , the joint density of Y is fX g

1

(y) det Jg

1

(y)

= fX

1

(y

) det

= fX

1

(y

) jdet ( )j

1 1

3 Poirier, D. J. (1995) Intermediate statistics and econometrics: a comparative approach, MIT Press.

280

CHAPTER 35. FUNCTIONS OF RANDOM VECTORS

Example 201 Let X be a 2

1 random vector with support RX = [1; 2]

[0; 1)

and joint probability density function x1 exp ( x1 x2 ) if x1 2 [1; 2] and x2 2 [0; 1) 0 otherwise

fX (x) =

where x1 and x2 are the two components of x. De…ne a 2 Y = g (X) with components Y1 and Y2 as follows: Y1 Y2 The inverse function g

1

= =

Jg

1

3X1 X2

(y) is de…ned by x1 x2

The Jacobian matrix of g

1 random vector

1

(y) is "

(y) =

= y1 =3 = y2

@x1 @y2 @x2 @y2

@x1 @y1 @x2 @y1

#

=

1=3 0

0 1

0 0=

1 3

Its determinant is det Jg

1

(y) =

1 ( 1) 3

The support of Y is n > RY = [y1 y2 ] : y1 = 3x1 ; y2 = =

[3; 6]

( 1; 0]

o x2 ; x1 2 [1; 2] ; x2 2 [0; 1)

For y 2 RY , the joint probability density function of Y is fY (y)

= fX g

1

(y) det Jg

1

(y)

=

(y1 =3) exp ( (y1 =3) ( y2 ))

=

1 y1 exp (y1 y2 =3) 9

1 3

while, for y 2 = RY , the joint probability density function is fY (y) = 0.

35.2

Independent sums

When the components of X are independent and g (x) = x1 + : : : + xK then the distribution of Y = g (X) can be derived by using the convolution formulae illustrated in the lecture entitled Sums of independent random variables (p. 323).

35.3. KNOWN MOMENT GENERATING FUNCTION

35.3

281

Known moment generating function

The joint moment generating function4 of Y = g (X), provided it exists, can be computed as MY (t) = E exp t> Y = E exp t> g (X) by using the transformation theorem5 . If MY (t) is recognized as the joint moment generating function of a known distribution, then such a distribution is the distribution of Y , because two random vectors have the same distribution if and only if they have the same joint moment generating function, provided the latter exists.

35.4

Known characteristic function

The joint characteristic function6 of Y = g (X) can be computed as 'Y (t) = E exp it> Y

= E exp it> g (X)

by using the transformation theorem. If 'Y (t) is recognized as the joint characteristic function of a known distribution, then such a distribution is the distribution of Y , because two random vectors have the same distribution if and only if they have the same joint characteristic function.

35.5

Solved exercises

Below you can …nd some exercises with explained solutions.

Exercise 1 Let X1 be a uniform random variable7 with support RX1 = [1; 2] and probability density function 1 if x1 2 RX1 0 if x1 2 = RX1

fX1 (x1 ) =

Let X2 be an absolutely continuous random variable, independent of X1 , with support RX2 = [0; 2] and probability density function fX2 (x2 ) =

3 2 8 x2

0

Let Y1 4 See

p. p. 6 See p. 7 See p. 5 See

297. 134. 315. 359.

= X12

if x2 2 RX2 if x2 2 = RX2

282

CHAPTER 35. FUNCTIONS OF RANDOM VECTORS Y2

= X1 + X2

Find the joint probability density function of the random vector >

Y = [Y1 Y2 ] Solution

Since X1 and X2 are independent, their joint probability density function is equal to the product of their marginal density functions: 3 2 8 x2

fX (x1 ; x2 ) = fX1 (x1 ) fX2 (x2 ) =

if x1 2 [1; 2] and x2 2 [0; 2] otherwise

0

The support of Y is n o > RY = [y1 y2 ] : y1 = x21 ; y2 = x1 + x2 ; x1 2 [1; 2] ; x2 2 [0; 2] o n p > = [y1 y2 ] : y1 2 [1; 4] ; y2 2 [ y1 ; 4] The function y = g (x) is one-to-one on RY and its inverse g p x1 = y1 p x2 = y2 y1

1

(y) is de…ned by

with Jacobian matrix Jg

1

(y) =

"

@x1 @y1 @x2 @y1

@x1 @y2 @x2 @y2

#

=

"

1=2 1 2 y1 1=2 1 2 y1

0 1

#

The determinant of the Jacobian matrix is det Jg

1

(y) =

1 1=2 y 1 2 1

1 1=2 y 2 1

0

=

1 1=2 y 2 1

which is di¤erent from zero for any y belonging to RY . For y 2 RY , the joint probability density function of Y is fY (y)

= fX g =

3 (y2 8

1

(y) det Jg p

2

y1 )

1

p = fX ( y1 ; y2

(y)

1 1=2 3 1=2 y (y2 y = 2 1 16 1

p

p

1 1=2 y1 ) y1 2

2

y1 )

while, for y 2 = RY , the joint probability density function is fY (y) = 0.

Exercise 2 Let X be a 2

1 random vector with support RX = [0; 1)

[0; 1)

and joint probability density function fX (x) =

exp ( x1 0

x2 ) if x1 2 [0; 1) and x2 2 [0; 1) otherwise

35.5. SOLVED EXERCISES

283

where x1 and x2 are the two components of x. De…ne a 2 Y = g (X) with components Y1 and Y2 as follows: Y1 Y2

1 random vector

= 2X1 = X1 + X2

Find the joint probability density function of the random vector Y . Solution The inverse function g

1

(y) is de…ned by x1 x2

The Jacobian matrix of g Jg

1

1

(y) is "

(y) =

= y1 =2 = y2 y1 =2

@x1 @y1 @x2 @y1

#

@x1 @y2 @x2 @y2

=

1=2 1=2

0 1

0

1 2

1 2

Its determinant is det Jg

1

(y) =

1 1 2

=

The support of Y is n o > RY = [y1 y2 ] : y1 = 2x1 ; y2 = x1 + x2 ; x1 2 [0; 1) ; x2 2 [0; 1) n o > = [y1 y2 ] : y1 2 [0; 1) ; y2 2 [y1 =2; 1)

For y 2 RY , the joint probability density function of Y is fY (y)

= fX g

1

(y) det Jg

= fX (y1 =2; y2 =

1 exp ( y1 =2 2

1

(y)

1 y1 =2) 2 y2 + y1 =2) =

1 exp ( y2 ) 2

while, for y 2 = RY , the joint probability density function is fY (y) = 0.

284

CHAPTER 35. FUNCTIONS OF RANDOM VECTORS

Chapter 36

Moments and cross-moments This lecture introduces the notions of moment of a random variable and crossmoment of a random vector.

36.1

Moments

36.1.1

De…nition of moment

The n-th moment of a random variable is the expected value of its n-th power: De…nition 202 Let X be a random variable. Let n 2 N. If X

(n) = E [X n ]

exists and is …nite, then X is said to possess a …nite n-th moment and X (n) is called the n-th moment of X. If E [X n ] is not well-de…ned, then we say that X does not possess the n-th moment.

36.1.2

De…nition of central moment

The n-th central moment of a random variable X is the expected value of the n-th power of the deviation of X from its expected value: De…nition 203 Let X be a random variable. Let n 2 N. If X

(n) = E [(X

n

E [X]) ]

exists and is …nite, then X is said to possess a …nite n-th central moment and X (n) is called the n-th central moment of X.

36.2 36.2.1

Cross-moments De…nition of cross-moment

Let X be a K 1 random vector. A cross-moment of X is the expected value of the product of integer powers of the entries of X: nK ] E [X1n1 X2n2 : : : XK

285

286

CHAPTER 36. MOMENTS AND CROSS-MOMENTS

where Xi is the i-th entry of X and n1 ; n2 ; : : : ; nK 2 Z+ are non-negative integers. The following is a formal de…nition of cross-moment: De…nition 204 Let X be a K non-negative integers and

1 random vector. Let n1 , n2 , . . . , nK be K

n=

K X

nk

(36.1)

k=1

If X

nK (n1 ; n2 ; : : : ; nK ) = E [X1n1 X2n2 : : : XK ]

(36.2)

exists and is …nite, then it is called a cross-moment of X of order n. If all cross-moments of order n exist and are …nite, i.e. if (36.2) exists and is …nite for all K-tuples of non-negative integers n1 , n2 , . . . , nK such that condition (36.1) is satis…ed, then X is said to possess …nite cross-moments of order n. The following example shows how to compute a cross-moment of a discrete random vector: Example 205 Let X be a 3 1 discrete random vector and denote its components by X1 , X2 and X3 . Let the support of X be: 82 3 2 3 2 39 2 3 = < 1 RX = 4 2 5 ; 4 1 5 ; 4 3 5 : ; 1 3 2

and its joint probability mass function1 be:

pX (x) =

8 > > > < > > > :

1 3 1 3 1 3

0

if x = 1 if x = 2 if x = 3 otherwise

2 1 3

1 3 2

> > >

The following is a cross-moment of X of order 4: X

(1; 2; 1) = E X1 X22 X3

which can be computed using the transformation theorem2 : X

(1; 2; 1)

= =

E X1 X22 X3 X x1 x22 x3 pX (x1 ; x2 ; x3 )

(x1 ;x2 ;x3 )2RX

1 22 1 pX (1; 2; 1) + 2 12 3 pX (2; 1; 3) +3 32 2 pX (3; 3; 2) 1 1 1 64 = 4 +6 + 54 = 3 3 3 3 =

1 See 2 See

p. 117. p. 134.

36.2. CROSS-MOMENTS

36.2.2

287

De…nition of central cross-moment

The central cross-moments of a random vector X are just the cross-moments of the random vector of deviations X E [X]: De…nition 206 Let X be a K non-negative integers and

1 random vector. Let n1 , n2 , . . . , nK be K n=

K X

nk

(36.3)

k=1

If X

(n1 ; n2 ; : : : ; nK ) = E

"

K Y

(Xk

nk

E [Xk ])

k=1

#

(36.4)

exists and is …nite, then it is called a central cross-moment of X of order n. If all central cross-moments of order n exist and are …nite, i.e. if (36.4) exists and is …nite for all K-tuples of non-negative integers n1 , n2 , . . . , nK such that condition (36.3) is satis…ed, then X is said to possess …nite central cross-moments of order n. The following example shows how to compute a central cross-moment of a discrete random vector: Example 207 Let X be a 3 1 discrete random vector and denote its components by X1 , X2 and X3 . Let the support of X be: 82 3 2 3 2 39 2 1 = < 4 RX = 4 2 5 ; 4 1 5 ; 4 3 5 : ; 4 1 2 and its joint probability mass function be: 8 1 > if x = 4 > > < 25 if x = 2 5 pX (x) = 2 > > if x = 1 > : 5 0 otherwise

2 1 3

4 1 2

> > >

The expected values of the three components of X are: E [X1 ] E [X2 ] E [X3 ]

1 2 +2 +1 5 5 1 2 = 2 +1 +3 5 5 1 2 = 4 +1 +2 5 5

= 4

2 =2 5 2 =2 5 2 =2 5

The following is a central cross-moment of X of order 3: h i 2 E [X1 ]) (X2 E [X2 ]) X (2; 1; 0) = E (X1 which can be computed using the transformation theorem: h i 2 (X E [X ]) (X E [X ]) (2; 1; 0) = E 1 1 2 2 X

288

CHAPTER 36. MOMENTS AND CROSS-MOMENTS X

=

(x1

2

2)

(x2

2) pX (x1 ; x2 ; x3 )

(x1 ;x2 ;x3 )2RX

=

(4

2

2)

+ (2

(1

(1

2) pX (2; 1; 1)

(3

2) pX (1; 3; 2) 2 2 2) = 5 5

2) 2

2)

2) pX (4; 2; 4)

2

2)

+ (1 =

(2 2

(3

Chapter 37

Moment generating function of a random variable The distribution of a random variable is often characterized in terms of its moment generating function (mgf), a real function whose derivatives at zero are equal to the moments1 of the random variable. Mgfs have great practical relevance not only because they can be used to easily derive moments, but also because a probability distribution is uniquely determined by its mgf, a fact that, coupled with the analytical tractability of mgfs, makes them a handy tool to solve several problems, such as deriving the distribution of a sum of two or more random variables. It must be mentioned that not all random variables possess an mgf. However, all random variables possess a characteristic function2 , another transform that enjoys properties similar to those enjoyed by the mgf.

37.1

De…nition

We start this lecture by giving a de…nition of mgf. De…nition 208 Let X be a random variable. If the expected value E [exp (tX)] exists and is …nite for all real numbers t belonging to a closed interval [ h; h] R, with h > 0, then we say that X possesses a moment generating function and the function MX : [ h; h] ! R de…ned by MX (t) = E [exp (tX)] is called the moment generating function of X. The following example shows how the mgf of an exponential random variable is derived. 1 See 2 See

p. 285. p. 307.

289

290

CHAPTER 37. MGF OF A RANDOM VARIABLE

Example 209 Let X be an exponential random variable3 with parameter Its support is the set of positive real numbers

2 R++ .

RX = [0; 1) and its probability density function is exp (

fX (x) =

x) if x 2 RX if x 2 = RX

0

Its mgf is computed as follows: E [exp (tX)]

= =

Z

Z

1 1 1

exp (tx) fX (x) dx exp (tx) exp (

x) dx

0

A

=

Z

1

exp ((t

) x) dx

0

= =

1

exp ((t

t 0

) x)

1 0

1

=

t

t

where: in step A we have assumed that t < , which is necessary for the integral to be …nite. Therefore, the expected value exists and is …nite for t 2 [ h; h] if h is such that 0 < h < , and X possesses an mgf MX (t) =

37.2

t

Moments and mgfs

The mgf takes its name by the fact that it can be used to derive the moments of X, as stated in the following proposition. Proposition 210 If a random variable X possesses an mgf MX (t), then, for any n 2 N, the n-th moment of X, denoted by X (n), exists and is …nite. Furthermore, X

(n) = E [X n ] =

dn MX (t) dtn

t=0

n

X (t) where d M is the n-th derivative of MX (t) with respect to t, evaluated at dtn t=0 the point t = 0.

Proof. Proving the above proposition is quite complicated, because a lot of analytical details must be taken care of (see, e.g., Pfei¤er4 - 1978). The intuition, 3 See

p. 365. P. E. (1978) Concepts of probability theory, Courier Dover Publications.

4 Pfei¤er,

37.3. DISTRIBUTIONS AND MGFS

291

however, is straightforward: since the expected value is a linear operator and differentiation is a linear operation, under appropriate conditions we can di¤erentiate through the expected value, as follows: dn MX (t) dn dn = E [exp (tX)] = E exp (tX) = E [X n exp (tX)] dtn dtn dtn which, evaluated at the point t = 0, yields dn MX (t) dtn

= E [X n exp (0 X)] = E [X n ] =

X

(n)

t=0

The following example shows how this proposition can be applied. Example 211 In Example 209 we have demonstrated that the mgf of an exponential random variable is MX (t) =

t

The expected value of X can be computed by taking the …rst derivative of the mgf: dMX (t) = dt (

2

t)

and evaluating it at t = 0: E [X] =

dMX (t) dt

= t=0

2

(

0)

=

1

The second moment of X can be computed by taking the second derivative of the mgf: 2 d2 MX (t) = 3 2 dt ( t) and evaluating it at t = 0: E X2 =

d2 MX (t) dt2

= t=0

2 (

3

0)

=

2 2

And so on for the higher moments.

37.3

Distributions and mgfs

The following proposition states the most important property of the mgf. Proposition 212 (equality of distributions) Let X and Y be two random variables. Denote by FX (x) and FY (y) their distribution functions5 , and by MX (t) and MY (t) their mgfs. X and Y have the same distribution, i.e., FX (x) = FY (x) for any x, if and only if they have the same mgfs, i.e., MX (t) = MY (t) for any t. 5 See

p. 108.

292

CHAPTER 37. MGF OF A RANDOM VARIABLE

Proof. For a fully general proof of this proposition see, e.g., Feller6 (2008). We just give an informal proof for the special case in which X and Y are discrete random variables taking only …nitely many values. The "only if" part is trivial. If X and Y have the same distribution, then MX (t) = E [exp (tX)] = E [exp (tY )] = MY (t) The "if" part is proved as follows. Denote by RX and RY the supports of X and Y , and by pX (x) and pY (y) their probability mass functions7 . Denote by A the union of the two supports: A = R X [ RY and by a1 ; : : : ; an the elements of A. The mgf of X can be written as MX (t) E [exp (tX)] X = exp (tx) pX (x) =

A

B

=

x2RX n X

exp (tai ) pX (ai )

i=1

where: in step A we have used the de…nition of expected value; in step B we have used the fact that pX (ai ) = 0 if ai 2 = RX . By the same token, the mgf of Y can be written as n X MY (t) = exp (tai ) pY (ai ) i=1

If X and Y have the same mgf, then, for any t belonging to a closed neighborhood of zero, MX (t) = MY (t) and

n X

exp (tai ) pX (ai ) =

i=1

n X

exp (tai ) pY (ai )

i=1

By rearranging terms, we obtain n X

exp (tai ) [pX (ai )

pY (ai )] = 0

i=1

This can be true for any t belonging to a closed neighborhood of zero only if pX (ai )

pY (ai ) = 0

for every i. It follows that that the probability mass functions of X and Y are equal. As a consequence, also their distribution functions are equal. It must be stressed that this proposition is extremely important and relevant from a practical viewpoint: in many cases where we need to prove that two distributions are equal, it is much easier to prove equality of the mgfs than to prove equality of the distribution functions. 6 Feller, 7 See

W. (2008) An introduction to probability theory and its applications, Volume 2, Wiley. p. 106.

37.4. MORE DETAILS

293

Also note that equality of the distribution functions can be replaced in the proposition above by equality of the probability mass functions8 if X and Y are discrete random variables, or by equality of the probability density functions9 if X and Y are absolutely continuous random variables.

37.4

More details

37.4.1

Mgf of a linear transformation

The next proposition gives a formula for the mgf of a linear transformation. Proposition 213 Let X be a random variable possessing an mgf MX (t). De…ne Y = a + bX where a; b 2 R are two constants and b 6= 0. Then, the random variable Y possesses an mgf MY (t) and MY (t) = exp (at) MX (bt) Proof. Using the de…nition of mgf, we obtain MY (t)

= E [exp (tY )] = E [exp (at + btX)] = E [exp (at) exp (btX)] = exp (at) E [exp (btX)] = exp (at) MX (bt)

If MX (t) is de…ned on a closed interval [ h; h], then MY (t) is de…ned on the h h interval b; b .

37.4.2

Mgf of a sum

The next proposition shows how to derive the mgf of a sum of independent random variables. Proposition 214 Let X1 ; : : : ; Xn be n mutually independent10 random variables. Let Z be their sum: n X Z= Xi i=1

Then, the mgf of Z is the product of the mgfs of X1 ; : : : ; Xn : MZ (t) =

n Y

MXi (t)

i=1

provided the latter exist. Proof. This is proved as follows: MZ (t) = 8 See

p. 106. p. 107. 1 0 See p. 233. 9 See

E [exp (tZ)]

294

CHAPTER 37. MGF OF A RANDOM VARIABLE "

= E exp t

= E exp

= E A B

= =

n Y

i=1 n Y

Xi

i=1 n X

"

"

n X

tXi

i=1

n Y

!#

!#

#

exp (tXi )

i=1

E [exp (tXi )] MXi (t)

i=1

where: in step A we have used the properties of mutually independent variables11 ; in step B we have used the de…nition of mgf.

37.5

Solved exercises

Some solved exercises on mgfs can be found below.

Exercise 1 Let X be a discrete random variable having a Bernoulli distribution12 . Its support is RX = f0; 1g and its probability mass function13 is 8 < p 1 pX (x) = : 0

p

if x = 1 if x = 0 if x 2 = RX

where p 2 (0; 1) is a constant. Derive the mgf of X, if it exists. Solution Using the de…nition of mgf, we get MX (t)

=

E [exp (tX)] =

X

exp (tx) pX (x)

x2RX

= =

exp (t 1) pX (1) + exp (t 0) pX (0) exp (t) p + 1 (1 p) = 1 p + p exp (t)

The mgf exists and it is well-de…ned because the above expected value exists for any t 2 R. 1 1 See

p. 234. p. 335. 1 3 See p. 106. 1 2 See

37.5. SOLVED EXERCISES

295

Exercise 2 Let X be a random variable with mgf MX (t) =

1 (1 + exp (t)) 2

Derive the variance of X. Solution We can use the following formula for computing the variance14 : 2

Var [X] = E X 2

E [X]

The expected value of X is computed by taking the …rst derivative of the mgf: dMX (t) 1 = exp (t) dt 2 and evaluating it at t = 0: dMX (t) dt

E [X] =

= t=0

1 1 exp (0) = 2 2

The second moment of X is computed by taking the second derivative of the mgf: d2 MX (t) 1 = exp (t) dt2 2 and evaluating it at t = 0: E X2 =

d2 MX (t) dt2

= t=0

1 1 exp (0) = 2 2

Therefore, Var [X]

= E X2 =

1 2

2

E [X] =

1 2

1 2

2

1 1 = 4 4

Exercise 3 A random variable X is said to have a Chi-square distribution15 with n degrees of freedom if its mgf is de…ned for any t < 12 and it is equal to MX (t) = (1

2t)

n=2

De…ne Y = X1 + X2 where X1 and X2 are two independent random variables having Chi-square distributions with n1 and n2 degrees of freedom respectively. Prove that Y has a Chi-square distribution with n1 + n2 degrees of freedom. 1 4 See 1 5 See

p. 156. p. 387.

296

CHAPTER 37. MGF OF A RANDOM VARIABLE

Solution The mgfs of X1 and X2 are MX1 (t)

= (1

MX2 (t) =

(1

2t)

n1 =2

2t)

n2 =2

The mgf of a sum of independent random variables is the product of the mgfs of the summands: MY (t) = (1

2t)

n1 =2

(1

2t)

n2 =2

= (1

2t)

(n1 +n2 )=2

Therefore, MY (t) is the mgf of a Chi-square random variable with n1 + n2 degrees of freedom. As a consequence, Y has a Chi-square distribution with n1 +n2 degrees of freedom.

Chapter 38

Moment generating function of a random vector The concept of joint moment generating function (joint mgf) is a multivariate generalization of the concept of moment generating function (mgf). Similarly to the univariate case, the joint mgf uniquely determines the joint distribution of its associated random vector, and it can be used to derive the cross-moments1 of the distribution by partial di¤erentiation. If you are not familiar with the univariate concept, you are advised to …rst read the lecture entitled Moment generating functions (p. 289).

38.1

De…nition

Let us start with a formal de…nition. De…nition 215 Let X be a K E exp t> X

1 random vector. If the expected value

= E [exp (t1 X1 + t2 X2 + : : : + tK XK )]

exists and is …nite for all K such that H = [ h1 ; h 1 ]

1 real vectors t belonging to a closed rectangle H [ h2 ; h 2 ]

:::

[ hK ; h K ]

RK

with hi > 0 for all i = 1; : : : ; K, then we say that X possesses a joint moment generating function and the function MX : H ! R de…ned by MX (t) = E exp t> X is called the joint moment generating function of X. As an example, we derive the joint mgf of a standard multivariate normal random vector. Example 216 Let X be a K Its support is

1 standard multivariate normal random vector2 . R X = RK

1 See 2 See

p. 285. p. 439.

297

298

CHAPTER 38. JOINT MGF OF A RANDOM VECTOR

and its joint probability density function3 is fX (x) = (2 )

K=2

1 | x x 2

exp

As explained in the lecture entitled Multivariate normal distribution (p. 439), the K components of X are K mutually independent4 standard normal random variables, because the joint probability density function of X can be written as fX (x) = f (x1 ) f (x2 ) : : : f (xK ) where xi is the i-th entry of x, and f (xi ) is the probability density function of a standard normal random variable: f (xi ) = (2 )

1=2

1 2 x 2 i

exp

The joint mgf of X can be derived as follows: MX (t)

= E exp t> X = E [exp (t1 X1 + t2 X2 + : : : + tK XK )] "K # Y = E exp (ti Xi ) i=1

A

=

K Y

E [exp (ti Xi )]

i=1

B

=

K Y

MXi (ti )

i=1

where: in step A we have used the fact that the entries of X are mutually independent5 ; in step B we have used the de…nition of mgf of a random variable6 . Since the mgf of a standard normal random variable is7 MXi (ti ) = exp

1 2 t 2 i

the joint mgf of X is MX (t)

=

K Y

MXi (ti ) =

i=1

=

exp

1 2

K X i=1

t2i

!

K Y

exp

i=1

= exp

1 2 t 2 i 1 > t t 2

Note that the mgf MXi (ti ) of a standard normal random variable is de…ned for any ti 2 R. As a consequence, the joint mgf of X is de…ned for any t 2 RK . 3 See

p. p. 5 See p. 6 See p. 7 See p. 4 See

117. 233. 234. 289. 378.

38.2. CROSS-MOMENTS AND JOINT MGFS

38.2

299

Cross-moments and joint mgfs

The next proposition shows how the joint mgf can be used to derive the crossmoments of a random vector. Proposition 217 If a K 1 random vector X possesses a joint mgf MX (t), then it possesses …nite cross-moments of order n for any n 2 N. Furthermore, if you de…ne a cross-moment of order n as X

nK ] (n1 ; n2 ; : : : ; nK ) = E [X1n1 X2n2 : : : XK

where n1 ; n2 ; : : : ; nK 2 Z+ and n = X

(n1 ; n2 ; : : : ; nK ) =

PK

k=1

nk , then

@ n1 +n2 +:::+nK MX (t1 ; t2 ; : : : ; tK ) @tn1 1 @tn2 2 : : : @tnKK

t1 =0;t2 =0;:::;tK =0

where the derivative on the right-hand side is an n-th order cross-partial derivative of MX (t) evaluated at the point t1 = 0, t2 = 0, . . . , tK = 0. Proof. We do not provide a rigorous proof of this proposition, but see, e.g., Pfei¤er8 (1978) and DasGupta9 (2010). The intuition of the proof, however, is straightforward: since the expected value is a linear operator and di¤erentiation is a linear operation, under appropriate conditions one can di¤erentiate through the expected value, as follows: @ n1 +n2 +:::+nK MX (t1 ; t2 ; : : : ; tK ) @tn1 1 @tn2 2 : : : @tnKK @ n1 +n2 +:::+nK = E [exp (t1 X1 + t2 X2 + : : : + tK XK )] @tn1 1 @tn2 2 : : : @tnKK @ n1 +n2 +:::+nK = E exp (t1 X1 + t2 X2 + : : : + tK XK ) @tn1 1 @tn2 2 : : : @tnKK nK = E [X1n1 X2n2 : : : XK exp (t1 X1 + t2 X2 + : : : + tK XK )] which, evaluated at the point t1 = 0; t2 = 0; : : : ; tK = 0, yields @ n1 +n2 +:::+nK MX (t1 ; t2 ; : : : ; tK ) @tn1 1 @tn2 2 : : : @tnKK = = =

t1 =0;t2 =0;:::;tK =0

nK E [X1n1 X2n2 : : : XK exp (0 X1 + 0 X2 + : : : + 0 XK )] nK n1 n2 E [X1 X2 : : : XK ] X (n1 ; n2 ; : : : ; nK )

The following example shows how the above proposition can be applied. Example 218 Let us continue with the previous example. The joint mgf of a 2 1 standard normal random vector X is MX (t) = exp 8 Pfei¤er,

1 > t t 2

= exp

1 2 1 2 t + t 2 1 2 2

P. E. (1978) Concepts of probability theory, Courier Dover Publications. A. (2010) Fundamentals of probability: a …rst course, Springer.

9 DasGupta,

300

CHAPTER 38. JOINT MGF OF A RANDOM VECTOR

The second cross-moment of X can be computed by taking the second cross-partial derivative of the joint mgf: X

(1; 1)

= =

E [X1 X2 ] @2 exp @t1 @t2

=

@ @t1

=

@ @t1

=

38.3

1 2 1 2 t + t 2 1 2 2

@ exp @t2 t2 exp

t1 t2 exp

t1 =0;t2 =0

1 2 1 2 t + t 2 1 2 2 1 2 1 2 t + t 2 1 2 2

1 2 1 2 t + t 2 1 2 2

t1 =0;t2 =0

t1 =0;t2 =0

=0 t1 =0;t2 =0

Joint distributions and joint mgfs

One of the most important properties of the joint mgf is that it completely characterizes the joint distribution of a random vector. Proposition 219 (equality of distributions) Let X and Y be two K 1 random vectors, possessing joint mgfs MX (t) and MY (t). Denote by FX (x) and FY (y) their joint distribution functions10 . X and Y have the same distribution, i.e., FX (x) = FY (x) for any x 2 RK , if and only if they have the same mgfs, i.e., MX (t) = MY (t) for any t 2 H RK . Proof. The reader may refer to Feller11 (2008) for a rigorous proof. The informal proof given here is almost identical to that given for the univariate case12 . We con…ne our attention to the case in which X and Y are discrete random vectors taking only …nitely many values. As far as the left-to-right direction of the implication is concerned, it su¢ ces to note that if X and Y have the same distribution, then MX (t) = E exp t> X

= E exp t> Y

= MY (t)

The right-to-left direction of the implication is proved as follows. Denote by RX and RY the supports of X and Y , and by pX (x) and pY (y) their joint probability mass functions13 . De…ne the union of the two supports: A = R X [ RY and denote its members by a1 ; : : : ; an . The joint mgf of X can be written as MX (t) A

B

= =

=

E exp t> X X exp t> x pX (x)

x2RX n X

exp t> ai pX (ai )

i=1

1 0 See

p. 118. W. (2008) An introduction to probability theory and its applications, Volume 2, Wiley. 1 2 See p. 291. 1 3 See p. 116. 1 1 Feller,

38.4. MORE DETAILS

301

where: in step A we have used the de…nition of expected value; in step B we have used the fact that pX (ai ) = 0 if ai 2 = RX . By the same line of reasoning, the joint mgf of Y can be written as MY (t) =

n X

exp t> ai pY (ai )

i=1

If X and Y have the same joint mgf, then MX (t) = MY (t) for any t belonging to a closed rectangle where the two mgfs are well-de…ned, and n X

exp t> ai pX (ai ) =

i=1

n X

exp t> ai pY (ai )

i=1

By rearranging terms, we obtain n X

exp t> ai [pX (ai )

pY (ai )] = 0

i=1

This equality can be veri…ed for every t only if pX (ai )

pY (ai ) = 0

for every i. As a consequence, the joint probability mass functions of X and Y are equal, which implies that also their joint distribution functions are equal. This proposition is used very often in applications where one needs to demonstrate that two joint distributions are equal. In such applications, proving equality of the joint mgfs is often much easier than proving equality of the joint distribution functions (see also the comments to Proposition 212).

38.4

More details

38.4.1

Joint mgf of a linear transformation

The next proposition gives a formula for the joint mgf of a linear transformation. Proposition 220 Let X be a K De…ne

1 random vector possessing a joint mgf MX (t). Y = A + BX

where A is a L 1 constant vector and B is an L K constant matrix. Then, the L 1 random vector Y possesses a joint mgf MY (t), and MY (t) = exp t> A MX B > t Proof. Using the de…nition of joint mgf, we obtain MY (t)

=

E exp t> Y

=

E exp t> A + t> BX

=

E exp t> A exp t> BX

302

CHAPTER 38. JOINT MGF OF A RANDOM VECTOR exp t> A E exp t> BX h i > = exp t> A E exp B > t X

=

= exp t> A MX B > t

If MX (t) is de…ned on a closed rectangle H, then MY (t) is de…ned on another closed rectangle whose shape and location depend on A and B.

38.4.2

Joint mgf of a vector with independent entries

The next proposition shows how to derive the joint mgf of a vector whose components are independent random variables. Proposition 221 Let X be a K 1 random vector. Let its entries X1 ; : : : ; XK be K mutually independent random variables possessing an mgf. Denote the mgf of the i-th entry of X by MXi (ti ). Then, the joint mgf of X is MX (t1 ; : : : ; tK ) =

K Y

MXi (ti )

i=1

Proof. This is proved as follows: MX (t)

E exp t> X " !# K X E exp ti X i

= =

=

E

"K Y

i=1

exp (ti Xi )

i=1

A

K Y

=

#

E [exp (ti Xi )]

i=1

B

K Y

=

MXi (ti )

i=1

where: in step A we have used the fact that the entries of X are mutually independent; in step B we have used the de…nition of mgf of a random variable.

38.4.3

Joint mgf of a sum

The next proposition shows how to derive the joint mgf of a sum of independent random vectors. Proposition 222 Let X1 , . . . , Xn be n mutually independent random vectors14 , all of dimension K 1. Let Z be their sum: Z=

n X i=1

1 4 See

p. 235.

Xi

38.5. SOLVED EXERCISES

303

Then, the joint mgf of Z is the product of the joint mgfs of X1 , . . . , Xn : MZ (t) =

n Y

MXi (t)

i=1

provided the latter exist. Proof. This is proved as follows: E exp t> Z " !# n X > = E exp t Xi

MZ (t) =

"

= E exp

= E A B

= =

"

n Y

i=1 n Y

i=1 n X >

t Xi

i=1

n Y

>

exp t Xi

i=1

!#

#

E exp t> Xi MXi (t)

i=1

where: in step A we have used the fact that the vectors Xi are mutually independent; in step B we have used the de…nition of joint mgf.

38.5

Solved exercises

Some solved exercises on joint mgfs can be found below.

Exercise 1 Let X be a 2 1 discrete random vector and denote its components by X1 and X2 . Let the support of X be n o > > > RX = [1 1] ; [2 0] ; [0 0] and its joint probability mass function be 8 1=3 > > < 1=3 pX (x) = > > : 1=3 0

Derive the joint mgf of X, if it exists.

>

if x = [1 1] > if x = [2 0] > if x = [0 0] otherwise

304

CHAPTER 38. JOINT MGF OF A RANDOM VECTOR

Solution By using the de…nition of joint mgf, we get MX (t)

= =

E exp t> X = E [exp (t1 X1 + t2 X2 )] X exp (t1 x1 + t2 x2 ) pX (x1 ; x2 )

(x1 ;x2 )2RX

=

exp (t1 1 + t2 1) pX (1; 1) + exp (t1 2 + t2 0) pX (2; 0) + exp (t1 0 + t2 0) pX (0; 0) 1 1 1 + exp (2t1 ) + exp (0) = exp (t1 + t2 ) 3 3 3 1 = [1 + exp (2t1 ) + exp (t1 + t2 )] 3 Obviously, the joint mgf exists and it is well-de…ned because the above expected value exists for any t 2 R2 .

Exercise 2 Let >

X = [X1 X2 ] be a 2

1 random vector with joint mgf MX1 ;X2 (t1 ; t2 ) =

1 2 + exp (t1 + 2t2 ) 3 3

Derive the expected value of X1 . Solution The mgf of X1 is MX1 (t1 )

=

E [exp (t1 X1 )] = E [exp (t1 X1 + 0 X2 )] 1 2 = MX1 ;X2 (t1 ; 0) = + exp (t1 + 2 0) 3 3 1 2 = + exp (t1 ) 3 3

The expected value of X1 is obtained by taking the …rst derivative of its mgf: dMX1 (t1 ) 2 = exp (t1 ) dt1 3 and evaluating it at t1 = 0: E [X1 ] =

dMX1 (t1 ) dt1

= t1 =0

2 2 exp (0) = 3 3

Exercise 3 Let >

X = [X1 X2 ]

38.5. SOLVED EXERCISES be a 2

305

1 random vector with joint mgf MX1 ;X2 (t1 ; t2 ) =

1 [1 + exp (t1 + 2t2 ) + exp (2t1 + t2 )] 3

Derive the covariance between X1 and X2 . Solution We can use the following covariance formula: Cov [X1 ; X2 ] = E [X1 X2 ]

E [X1 ] E [X2 ]

The mgf of X1 is MX1 (t1 )

=

E [exp (t1 X1 )] = E [exp (t1 X1 + 0 X2 )] 1 = MX1 ;X2 (t1 ; 0) = [1 + exp (t1 + 2 0) + exp (2t1 + 0)] 3 1 = [1 + exp (t1 ) + exp (2t1 )] 3

The expected value of X1 is obtained by taking the …rst derivative of its mgf: dMX1 (t1 ) 1 = [exp (t1 ) + 2 exp (2t1 )] dt1 3 and evaluating it at t1 = 0: E [X1 ] =

dMX1 (t1 ) dt1

= t1 =0

1 [exp (0) + 2 exp (0)] = 1 3

The mgf of X2 is MX2 (t2 )

=

E [exp (t2 X2 )] = E [exp (0 X1 + t2 X2 )] 1 = MX1 ;X2 (0; t2 ) = [1 + exp (0 + 2t2 ) + exp (2 0 + t2 )] 3 1 = [1 + exp (2t2 ) + exp (t2 )] 3

To compute the expected value of X2 we take the …rst derivative of its mgf: dMX2 (t2 ) 1 = [2 exp (2t2 ) + exp (t2 )] dt2 3 and we evaluate it at t2 = 0: E [X2 ] =

dMX2 (t2 ) dt2

= t2 =0

1 [2 exp (0) + exp (0)] = 1 3

The second cross-moment of X is computed by taking the second cross-partial derivative of the joint mgf: @ 2 MX1 ;X2 (t1 ; t2 ) @t1 @t2

= =

@ @t1 @ @t1

@ 1 [1 + exp (t1 + 2t2 ) + exp (2t1 + t2 )] @t2 3 1 [2 exp (t1 + 2t2 ) + exp (2t1 + t2 )] 3

306

CHAPTER 38. JOINT MGF OF A RANDOM VECTOR =

1 [2 exp (t1 + 2t2 ) + 2 exp (2t1 + t2 )] 3

and evaluating it at (t1 ; t2 ) = (0; 0): E [X1 X2 ]

= =

@ 2 MX1 ;X2 (t1 ; t2 ) @t1 @t2 t1 =0;t2 =0 1 4 [2 exp (0) + 2 exp (0)] = 3 3

Therefore, Cov [X1 ; X2 ]

= E [X1 X2 ] E [X1 ] E [X2 ] 4 1 = 1 1= 3 3

Chapter 39

Characteristic function of a random variable In the lecture entitled Moment generating function (p. 289), we have explained that the distribution of a random variable can be characterized in terms of its moment generating function, a real function that enjoys two important properties: it uniquely determines its associated probability distribution, and its derivatives at zero are equal to the moments of the random variable. We have also explained that not all random variables possess a moment generating function. The characteristic function (cf) enjoys properties that are almost identical to those enjoyed by the moment generating function, but it has an important advantage: all random variables possess a characteristic function.

39.1

De…nition

We start this lecture by giving a de…nition of characteristic function. De…nition 223 Let X be a random variable. Let i = The function ' : R ! C de…ned by

p

1 be the imaginary unit.

'X (t) = E [exp (itX)] is called the characteristic function of X. The …rst thing to be noted is that the characteristic function 'X (t) exists for any t. This can be proved as follows: 'X (t)

= E [exp (itX)] = E [cos (tX) + i sin (tX)] = E [cos (tX)] + iE [sin (tX)]

and the last two expected values are well-de…ned, because the sine and cosine functions are bounded in the interval [ 1; 1]. 307

308

CHAPTER 39. CF OF A RANDOM VARIABLE

39.2

Moments and cfs

Like the moment generating function of a random variable, the characteristic function can be used to derive the moments of X, as stated in the following proposition. Proposition 224 Let X be a random variable and 'X (t) its characteristic function. Let n 2 N. If the n-th moment of X, denoted by X (n), exists and is …nite, then 'X (t) is n times continuously di¤ erentiable and X

(n) = E [X n ] =

1 dn 'X (t) in dtn

t=0

n

'X (t) where d dt is the n-th derivative of 'X (t) with respect to t, evaluated at n t=0 the point t = 0.

Proof. The proof of the above proposition is quite complex (see, e.g., Resnick1 1999). The intuition, however, is straightforward: since the expected value is a linear operator and di¤erentiation is a linear operation, under appropriate conditions one can di¤erentiate through the expected value, as follows: dn 'X (t) dtn

dn E [exp (itX)] dtn dn = E exp (itX) dtn n = E [(iX) exp (itX)] = in E [X n exp (itX)] =

which, evaluated at the point t = 0, yields dn 'X (t) dtn

= in E [X n exp (0 iX)] = in E [X n ] = in

X

(n)

t=0

In practice, the proposition above is not very useful when one wants to compute a moment of a random variable, because it requires to know in advance whether the moment exists or not. A much more useful statement is provided by the next proposition. Proposition 225 Let X be a random variable and 'X (t) its characteristic function. If 'X (t) is n times di¤ erentiable at the point t = 0, then 1. if n is even, the k-th moment of X exists and is …nite for any k

n;

2. if n is odd, the k-th moment of X exists and is …nite for any k < n. In both cases, the following holds: X 1 Resnick,

(k) = E X k =

1 dk 'X (t) ik dtk

S. I. (1999) A Probability Path, Birkhauser.

t=0

39.3. DISTRIBUTIONS AND CFS

309

Proof. See e.g. Ushakov2 (1999). The following example shows how this proposition can be used to compute the second moment of an exponential random variable. Example 226 Let X be an exponential random variable with parameter Its support is RX = [0; 1)

2 R++ .

and its probability density function is exp (

fX (x) =

0

x) if x 2 RX if x 2 = RX

Its characteristic function is 'X (t) = E [exp (itX)] =

it

which is proved in the lecture entitled Exponential distribution (p. 365). Note that dividing by ( it) does not pose any division-by-zero problem, because > 0 and the denominator is di¤ erent from 0 also when t = 0. The …rst derivative of the characteristic function is i d'X (t) = 2 dt ( it) The second derivative of the characteristic function is d2 'X (t) = dt2

2 (

3

it)

Evaluating it at t = 0, we obtain d2 'X (t) dt2

=

2 2

t=0

Therefore, the second moment of X exists and is …nite. Furthermore, it can be computed as 2 1 d2 'X (t) 1 2 = 2 = 2 E X2 = 2 2 2 i dt i t=0

39.3

Distributions and cfs

Characteristic functions, like moment generating functions, can also be used to characterize the distribution of a random variable. Proposition 227 (equality of distributions) Let X and Y be two random variables. Denote by FX (x) and FY (y) their distribution functions3 and by 'X (t) and 'Y (t) their characteristic functions. X and Y have the same distribution, i.e., FX (x) = FY (x) for any x, if and only if they have the same characteristic function, i.e., 'X (t) = 'Y (t) for any t. 2 Ushakov, 3 See

N. G. (1999) Selected topics in characteristic functions, VSP. p. 108.

310

CHAPTER 39. CF OF A RANDOM VARIABLE

Proof. For a formal proof, see, e.g., Resnick4 (1999). An informal proof for the special case in which X and Y have a …nite support can be provided along the same lines of the proof of Proposition 212, which concerns the moment generating function. This is left as an exercise (just replace exp (tX) and exp (tY ) in that proof with exp (itX) and exp (itY )). This property is analogous to the property of joint moment generating functions stated in Proposition 212. The same comments we made about that proposition also apply to this one.

39.4

More details

39.4.1

Cf of a linear transformation

The next proposition gives a formula for the characteristic function of a linear transformation. Proposition 228 Let X be a random variable with characteristic function 'X (t). De…ne Y = a + bX where a; b 2 R are two constants and b 6= 0. Then, the characteristic function of Y is 'Y (t) = exp (iat) 'X (bt) Proof. Using the de…nition of characteristic function, we get 'Y (t)

39.4.2

= = = = =

E [exp (itY )] E [exp (iat + ibtX)] E [exp (iat) exp (ibtX)] exp (iat) E [exp (ibtX)] exp (iat) 'X (bt)

Cf of a sum

The next proposition shows how to derive the characteristic function of a sum of independent random variables. Proposition 229 Let X1 , . . . , Xn be n mutually independent random variables5 . Let Z be their sum: n X Z= Xj j=1

Then, the characteristic function of Z is the product of the characteristic functions of X1 , . . . , Xn : n Y 'Z (t) = 'Xj (t) j=1

4 Resnick, 5 See

S. I. (1999) A Probability Path, Birkhauser. p. 233.

39.4. MORE DETAILS

311

Proof. This is proved as follows: 'Z (t)

= E [exp (itZ)] 13 2 0 n X X j A5 = E 4exp @it =

2

0

E 4exp @ 2

= E4

n Y

j=1

A

B

=

=

n Y

j=1 n Y

j=1

n X j=1

13

itXj A5 3

exp (itXj )5

E [exp (itXj )] 'Xj (t)

j=1

where: in step A we have used the properties of mutually independent variables6 ; in step B we have used the de…nition of characteristic function.

39.4.3

Computation of the characteristic function

When X is a discrete random variable with support RX and probability mass function pX (x), its characteristic function is X 'X (t) = E [exp (itX)] = exp (itx) pX (x) x2RX

Thus, the computation of the characteristic function is pretty straightforward: all we need to do is to sum the complex numbers exp (itx) pX (x) over all values of x belonging to the support of X. When X is an absolutely continuous random variable with probability density function fX (x), its characteristic function is Z 1 'X (t) = E [exp (itX)] = exp (itx) fX (x) dx 1

The right-hand side integral is a contour integral of a complex function along the real axis. As people reading these lecture notes are usually not familiar with contour integration (a topic in complex analysis), we avoid it altogether in the rest of this book. We instead exploit the fact that exp (itx) = cos(tx) + i sin (tx) to rewrite the contour integral as the complex sum of two ordinary integrals: Z 1 Z 1 Z 1 exp (itx) fX (x) dx = cos(tx)fX (x) dx + i sin(tx)fX (x) dx 1

1

and to compute the two integrals separately. 6 See

p. 234.

1

312

39.5

CHAPTER 39. CF OF A RANDOM VARIABLE

Solved exercises

Below you can …nd some exercises with explained solutions.

Exercise 1 Let X be a discrete random variable having support RX = f0; 1; 2g and probability mass function 8 1=3 > > < 1=3 pX (x) = 1=3 > > : 0

if if if if

x=0 x=1 x=2 x2 = RX

Derive the characteristic function of X. Solution

By using the de…nition of characteristic function, we obtain X 'X (t) = E [exp (itX)] = exp (itx) pX (x) x2RX

= =

exp (it 0) pX (0) + exp (it 1) pX (1) + exp (it 2) pX (2) 1 1 1 1 + exp (it) + exp (2it) = [1 + exp (it) + exp (2it)] 3 3 3 3

Exercise 2 Use the characteristic function found in the previous exercise to derive the variance of X. Solution We can use the following formula for computing the variance: Var [X] = E X 2

2

E [X]

The expected value of X is computed by taking the …rst derivative of the characteristic function: d'X (t) 1 = [i exp (it) + 2i exp (2it)] dt 3 evaluating it at t = 0, and dividing it by i: E [X] =

1 d'X (t) i dt

= t=0

11 [i exp (i 0) + 2i exp (2i 0)] = 1 i3

The second moment of X is computed by taking the second derivative of the characteristic function: d2 'X (t) 1 2 = i exp (it) + 4i2 exp (2it) 2 dt 3

39.5. SOLVED EXERCISES

313

evaluating it at t = 0, and dividing it by i2 : E X2 =

1 d2 'X (t) i2 dt2

= t=0

5 11 2 i exp (i 0) + 4i2 exp (2i 0) = i2 3 3

Therefore, Var [X] = E X 2

2

E [X] =

5 3

12 =

2 3

Exercise 3 Read and try to understand how the characteristic functions of the uniform and exponential distributions are derived in the lectures entitled Uniform distribution (p. 359) and Exponential distribution (p. 365).

314

CHAPTER 39. CF OF A RANDOM VARIABLE

Chapter 40

Characteristic function of a random vector This lecture introduces the notion of joint characteristic function (joint cf) of a random vector, which is a multivariate generalization of the concept of characteristic function of a random variable. Before reading this lecture, you are advised to …rst read the lecture entitled Characteristic function (p. 307).

40.1

De…nition

Let us start this lecture with a de…nition. p 1 be the imaginary De…nition 230 Let X be a K 1 random vector. Let i = unit. The function ' : RK ! C de…ned by 2 0 13 K X 'X (t) = E exp it> X = E 4exp @i tj Xj A5 j=1

is called the joint characteristic function of X.

The …rst thing to be noted is that the joint characteristic function 'X (t) exists for any t 2 RK . This can be proved as follows: 'X (t)

=

E exp it> X

=

E cos t> X + i sin t> X

=

E cos t> X

+ iE sin t> X

and the last two expected values are well-de…ned, because the sine and cosine functions are bounded in the interval [ 1; 1].

40.2

Cross-moments and joint cfs

Like the joint moment generating function1 of a random vector, the joint characteristic function can be used to derive the cross-moments2 of X, as stated in the 1 See 2 See

p. 297. p. 285.

315

316

CHAPTER 40. JOINT CF

following proposition. Proposition 231 Let X be a random vector and 'X (t) its joint characteristic function. Let n 2 N. De…ne a cross-moment of order n as follows: X

nK ] (n1 ; n2 ; : : : ; nK ) = E [X1n1 X2n2 : : : XK

where n1 ; n2 ; : : : ; nK 2 Z+ and n=

K X

nk

k=1

If all cross-moments of order n exist and are …nite, then all the n-th order partial derivatives of 'X (t) exist and X

(n1 ; n2 ; : : : ; nK ) =

1 @ n1 +n2 +:::+nK 'X (t1 ; t2 ; : : : ; tK ) in @tn1 1 @tn2 2 : : : @tnKK

t1 =0;t2 =0;:::;tK =0

where the partial derivative on the right-hand side of the equation is evaluated at the point t1 = 0, t2 = 0, . . . , tK = 0. Proof. See Ushakov3 (1999). In practice, the proposition above is not very useful when one wants to compute a cross-moment of a random vector, because the proposition requires to know in advance whether the cross-moment exists or not. A much more useful proposition is the following. Proposition 232 Let X be a random vector and 'X (t) its joint characteristic function. If all the n-th order partial derivatives of 'X (t) exist, then: 1. if n is even, for any m=

K X

mk

n

k=1

all m-th cross-moments of X exist and are …nite; 2. if n is odd, for any m=

K X

mk < n

k=1

all m-th cross-moments of X exist and are …nite. In both cases, X

(m1 ; m2 ; : : : ; mK ) =

1 @ m1 +m2 +:::+mK 'X (t1 ; t2 ; : : : ; tK ) mK m2 1 in @tm 1 @t2 : : : @tK

t1 =0;t2 =0;:::;tK =0

Proof. See Ushakov (1999). 3 Ushakov,

N. G. (1999) Selected topics in characteristic functions, VSP.

40.3. JOINT DISTRIBUTIONS AND JOINT CFS

40.3

317

Joint distributions and joint cfs

The next proposition states the most important property of the joint characteristic function. Proposition 233 (equality of distributions) Let X and Y be two K 1 random vectors. Denote by FX (x) and FY (y) their joint distribution functions4 and by 'X (t) and 'Y (t) their joint characteristic functions. X and Y have the same distribution, i.e., FX (x) = FY (x) for any x 2 RK , if and only if they have the same characteristic functions, i.e., 'X (t) = 'Y (t) for any t 2 RK . Proof. See Ushakov (1999). An informal proof for the special case in which X and Y have a …nite support can be provided along the same lines of the proof of Proposition 219, which concerns the joint moment generating function. This is left as an exercise (just replace exp t> X and exp t> Y in that proof with exp it> X and exp it> Y ). This property is analogous to the property of joint moment generating functions stated in Proposition 219. The same comments we made about that proposition also apply to this one.

40.4

More details

40.4.1

Joint cf of a linear transformation

The next proposition gives a formula for the joint characteristic function of a linear transformation. Proposition 234 Let X be a K 'X (t). De…ne

1 random vector with characteristic function Y = A + BX

where A is a L 1 constant vector and B is a L characteristic function of Y is

K constant matrix. Then, the

'Y (t) = exp it> A 'X B > t Proof. By using the de…nition of characteristic function, we obtain 'Y (t)

=

E exp it> Y

=

E exp it> A + it> BX

=

E exp it> A exp it> BX

=

exp it> A E exp it> BX h exp it> A E exp i B > t

= =

4 See

p. 118.

exp it> A 'X B > t

>

X

i

318

CHAPTER 40. JOINT CF

40.4.2

Joint cf of a random vector with independent entries

The next proposition shows how to derive the joint characteristic function of a vector whose components are independent random variables. Proposition 235 Let X be a K 1 random vector. Let its entries X1 ; : : : ; XK be K mutually independent random variables. Denote the characteristic function of the j-th entry of X by 'Xj (tj ). Then, the joint characteristic function of X is 'X (t1 ; : : : ; tK ) =

K Y

'Xj (tj )

j=1

Proof. This is proved as follows: E exp it> X 2 0 13 K X = E 4exp @i tj Xj A5

'X (t) =

2

= E4

j=1

K Y

j=1

A

=

K Y

3

exp (itj Xj )5

E [exp (itj Xj )]

j=1

B

=

K Y

'Xj (tj )

j=1

where: in step A we have used the fact that the entries of X are mutually independent5 ; in step B we have used the de…nition of characteristic function of a random variable6 .

40.4.3

Joint cf of a sum

The next proposition shows how to derive the joint characteristic function of a sum of independent random vectors. Proposition 236 Let X1 , . . . , Xn be n mutually independent random vectors. Let Z be their sum: n X Z= Xj j=1

Then, the joint characteristic function of Z is the product of the joint characteristic functions of X1 ; : : : ; Xn : n Y 'Z (t) = 'Xj (t) j=1

5 In

particular, see the mutual independence via expectations property (p. 234). p. 307.

6 See

40.5. SOLVED EXERCISES

319

Proof. This is proved as follows: 'Z (t)

= E exp it> Z 2 0

= E 4exp @it> 2

n X j=1

13

Xj A5

0 13 n X = E 4exp @ it> Xj A5 2

= E4

j=1

n Y

j=1

A

B

=

=

n Y

j=1 n Y

3

exp it> Xj 5

E exp it> Xj 'Xj (t)

j=1

where: in step A we have used the fact that the vectors Xj are mutually independent; in step B we have used the de…nition of joint characteristic function of a random vector given above.

40.5

Solved exercises

Below you can …nd some exercises with explained solutions.

Exercise 1 Let Z1 and Z2 be two independent standard normal random variables7 . Let X be a 2 1 random vector whose components are de…ned as follows: X1 X2

= Z12 = Z12 + Z22

Derive the joint characteristic function of X. Hint: use the fact that Z12 and Z22 are two independent Chi-square random variables8 having characteristic function 'Z12 (t) = 'Z22 (t) = (1

2it)

1=2

Solution By using the de…nition of characteristic function, we get 'X (t) 7 See 8 See

= E exp it> X = E [exp (it1 X1 + it2 X2 )]

p. 376. p. 387.

320

CHAPTER 40. JOINT CF =

E exp it1 Z12 + it2 Z12 + Z22

=

E exp i (t1 + t2 ) Z12 + it2 Z22

=

E exp i (t1 + t2 ) Z12 exp it2 Z22

A

=

E exp i (t1 + t2 ) Z12

B

= 'Z12 (t1 + t2 ) 'Z22 (t2 ) 1=2

E exp it2 Z22

=

(1

2it1

2it2 )

(1

=

[(1

2it1

2it2 ) (1

=

[1

2it1

2it2

2it2 1

=

1

2it1

4it2

4t1 t2

1=2

2it2 ) 1=2

2it2 )]

2it2 ( 2it1 ) 4t22

2it2 ( 2it2 )]

1=2

1=2

where: in step A we have used the fact that Z1 and Z2 are independent; in step B we have used the de…nition of characteristic function.

Exercise 2 Use the joint characteristic function found in the previous exercise to derive the expected value and the covariance matrix of X. Solution We need to compute the partial derivatives of the joint characteristic function: @' @t1 @' @t2 @2' @t21 @2' @t22

= = = =

1 1 2 1 1 2 3 1 4 3 1 4 +4

@2' @t1 @t2

=

+2

4it2

4t1 t2

4t22

3=2

2it1

4it2

4t1 t2

4t22

3=2

2it1

4it2

4t1 t2

4t22

5=2

2it1

4it2

4t1 t2

4t22

5=2

1

3 1 4

2it1

2it1 2it1

1

2it1

4it2 4it2 4it2

4t22

4t1 t2 4t1 t2

4t22

4t1 t2

4t2 )

( 4i

4t1

8t2 )

2

( 2i

4t2 )

( 4i

4t1

2

8t2 )

3=2

5=2

4t22

( 2i

( 2i

4t2 ) ( 4i

4t1

8t2 )

3=2

All partial derivatives up to the second order exist and are well de…ned. As a consequence, all cross-moments up to the second order exist and are …nite and they can be computed from the above partial derivatives: E [X1 ] E [X2 ]

=

1 @' i @t1

=

1 @' i @t2

=

1 i=1 i

=

1 2i = 2 i

t1 =0;t2 =0

t1 =0;t2 =0

40.5. SOLVED EXERCISES E X12

=

321

1 @2' i2 @t21

=

1 3i2 = 3 i2

=

1 i2

t1 =0;t2 =0

2

E X22 E [X1 X2 ]

=

1 @ ' i2 @t22

=

1 @2' i2 @t1 @t2

t1 =0;t2 =0

= t1 =0;t2 =0

12i2 + 4 = 8 1 i2

6i2 + 2 = 4

The covariances are derived as follows: Var [X1 ]

= E X12

Var [X2 ] Cov [X1 ; X2 ]

X22

2

E [X1 ] = 3

1=2

2

= E E [X2 ] = 8 4 = 4 = E [X1 X2 ] E [X1 ] E [X2 ] = 4 2 = 2

Summing up, we have E [X2 ]

>

E [X]

=

E [X1 ]

Var [X]

=

Var [X1 ] Cov [X1 ; X2 ] Cov [X1 ; X2 ] Var [X2 ]

=

1

2

>

=

2 2

2 4

Exercise 3 Read and try to understand how the joint characteristic function of the multinomial distribution is derived in the lecture entitled Multinomial distribution (p. 431).

322

CHAPTER 40. JOINT CF

Chapter 41

Sums of independent random variables This lecture discusses how to derive the distribution of the sum of two independent random variables1 . We explain …rst how to derive the distribution function2 of the sum and then how to derive its probability mass function3 (if the summands are discrete) or its probability density function4 (if the summands are continuous).

41.1

Distribution function of a sum

The following proposition characterizes the distribution function of the sum in terms of the distribution functions of the two summands: Proposition 237 Let X and Y be two independent random variables and denote by FX (x) and FY (y) their respective distribution functions. Let Z =X +Y and denote the distribution function of Z by FZ (z). The following holds: FZ (z) = E [FX (z

Y )]

FZ (z) = E [FY (z

X)]

or

Proof. The …rst formula is derived as follows: FZ (z) A

1 See

p. p. 3 See p. 4 See p. 2 See

= = =

P (Z z) P (X + Y z) P (X z Y )

229. 108. 106. 107.

323

324

CHAPTER 41. SUMS OF INDEPENDENT RANDOM VARIABLES B

=

E [P (X

z

C

=

E [FX (z

Y )]

Y jY = y )]

where: in step A we have used the de…nition of distribution function; in step B we have used the law of iterated expectations; in step C we have used the fact that X and Y are independent. The second formula is symmetric to the …rst. The following example illustrates how the above proposition can be used. Example 238 Let X be a uniform random variable5 with support RX = [0; 1] and probability density function 1 if x 2 RX 0 otherwise

fX (x) =

and Y another uniform random variable, independent of X, with support RY = [0; 1] and probability density function 1 if y 2 RY 0 otherwise

fY (y) = The distribution function of X is FX (x) =

Z

8 < 0 if x 0 x if 0 < x fX (t) dt = : 1 1 if x > 1

x

1

The distribution function of Z = X + Y is FZ (z)

=

E [FX (z Y )] Z 1 = FX (z y) fY (y) dy =

Z

1 1

FX (z

y) dy

0

A B

= =

Z

Z

z 1

FX (t) dt

z

z

FX (t) dt

z 1

where: in step A we have made a change of variable6 (t = z y); in step B we have exchanged the bounds of integration7 . There are four cases to consider: 5 See

p. 45. p. 50. 7 See p. 51. 6 See

41.2. PROBABILITY MASS FUNCTION OF A SUM 1. If z

0, then FZ (z) =

Z

z

FX (t) dt =

z 1

2. If 0 < z

z

0dt = 0

z 1

1, then FZ (z)

= =

Z

z

FX (t) dt =

z 1 Z 0

0dt +

Z

FX (t) dt =

z 1

3. If 1 < z

Z

325

Z

0

FX (t) dt +

z 1

Z

z

Z

z

FX (t) dt

0

tdt = 0 +

0

z

1 2 t 2

= 0

1 2 z 2

2, then FZ (z)

= =

z

z 1 Z 1

tdt +

z 1

= = =

Z

Z

1

FX (t) dt +

z 1

z

1dt =

1

Z

z

FX (t) dt

1

1 2 t 2

1

z

+ [t]1 z 1

1 2 1 2 1 (z 1) + z 1 2 2 1 2 1 1 2 1 z + 2z 1 +z 2 2 2 2 1 2 z + 2z 1 2

1

4. If z > 2, then FZ (z)

= =

Z

z

FX (t) dt =

z 1 z [t]z 1

Z

z

1dt

z 1

=z

(z

1) = 1

Therefore, combining these four possible cases, we obtain 8 0 if z 0 > > < 1 2 z if 0 < z 1 2 FZ (z) = 1 2 z + 2z 1 if 1 > : 2 1 if z > 2

41.2

Probability mass function of a sum

When the two summands are discrete random variables, the probability mass function of their sum can be derived as follows: Proposition 239 Let X and Y be two independent discrete random variables and denote by pX (x) and pY (y) their respective probability mass functions and by RX and RY their supports. Let Z =X +Y and denote the probability mass function of Z by pZ (z). The following holds: X pZ (z) = pX (z y) pY (y) y2RY

326

CHAPTER 41. SUMS OF INDEPENDENT RANDOM VARIABLES

or

X

pZ (z) =

pY (z

x) pX (x)

x2RX

Proof. The …rst formula is derived as follows: pZ (z) A

= = =

P (Z = z) P (X + Y = z) P (X = z Y )

B

=

E [P (X = z

C

=

D

=

E [pX (z Y )] X pX (z y) pY (y)

Y jY = y )]

y2RY

where: in step A we have used the de…nition of probability mass function; in step B we have used the law of iterated expectations; in step C we have used the fact that X and Y are independent; in step D we have used the de…nition of expected value. The second formula is symmetric to the …rst. The two summations above are called convolutions (of two probability mass functions). Example 240 Let X be a discrete random variable with support RX = f0; 1g and probability mass function pX (x) =

1=2 if x 2 RX 0 otherwise

and Y another discrete random variable, independent of X, with support RY = f0; 1g and probability mass function pY (y) =

1=2 if y 2 RY 0 otherwise

De…ne Z =X +Y Its support is RY = f0; 1; 2g The probability mass function of Z, evaluated at z = 0 is X pZ (0) = pX (0 y) pY (y) = pX (0 0) pY (0) + pX (0 y2RY

1) pY (1)

41.3. PROBABILITY DENSITY FUNCTION OF A SUM =

327

1 1 1 1 +0 = 2 2 2 4

Evaluated at z = 1, it is X pZ (1) = pX (1

y) pY (y) = pX (1

0) pY (0) + pX (1

1) pY (1)

0) pY (0) + pX (2

1) pY (1)

y2RY

=

1 1 1 1 1 + = 2 2 2 2 2

Evaluated at z = 2, it is X pZ (2) = pX (2

y) pY (y) = pX (2

y2RY

=

0

1 1 1 1 + = 2 2 2 4

Therefore, the probability mass function of 8 1=4 > > < 1=2 pZ (z) = 1=4 > > : 0

41.3

Z is if z = 0 if z = 1 if z = 2 otherwise

Probability density function of a sum

When the two summands are absolutely continuous random variables, the probability density function of their sum can be derived as follows: Proposition 241 Let X and Y be two independent absolutely continuous random variables and denote by fX (x) and fY (y) their respective probability density functions. Let Z =X +Y and denote the probability density function of Z by fZ (z). The following holds: Z 1 fZ (z) = fX (z y) fY (y) dy 1

or fZ (z) =

Z

1

fY (z

x) fX (x) dx

1

Proof. As stated in Proposition 237, the distribution function of a sum of independent variables is FZ (z) = E [FX (z Y )] Di¤erentiating both sides and using the fact that the density function is the derivative of the distribution function8 , we obtain fZ (z) d = E [FX (z dz 8 See

p. 109.

Y )]

328

CHAPTER 41. SUMS OF INDEPENDENT RANDOM VARIABLES A

B

d FX (z Y ) dz = E [fX (z Y )] Z 1 = fX (z y) fY (y) dy =

E

1

where: in step A we have interchanged di¤erentiation and expectation; in step B we have used the de…nition of expected value. The second formula is symmetric to the …rst. The two integrals above are called convolutions (of two probability density functions). Example 242 Let X be an exponential random variable9 with support RX = [0; 1) and probability density function fX (x) =

exp ( x) if x 2 RX 0 otherwise

and Y another exponential random variable, independent of X, with support RY = [0; 1) and probability density function fY (y) =

exp ( y) if y 2 RY 0 otherwise

De…ne Z =X +Y The support of Z is RZ = [0; 1) When z 2 RZ , the probability density function of Z is Z 1 Z 1 fZ (z) = fX (z y) fY (y) dy = fX (z y) exp ( y) dy 1 0 Z 1 = exp ( (z y)) 1fz y 0g exp ( y) dy 0 Z 1 = exp ( z + y) 1fy zg exp ( y) dy Z0 z Z z = exp ( z + y) exp ( y) dy = exp( z) dy = z exp ( z) 0

0

Therefore, the probability density function of Z is fZ (z) = 9 See

p. 365.

z exp ( z) if z 2 RZ 0 otherwise

41.4. MORE DETAILS

329

41.4

More details

41.4.1

Sum of n independent random variables

We have discussed above how to derive the distribution of the sum of two independent random variables. How do we derive the distribution of the sum of more than two mutually independent10 random variables? Suppose X1 , X2 , . . . , Xn are n mutually independent random variables and let Z be their sum: Z = X1 + : : : + Xn The distribution of Z can be derived recursively, using the results for sums of two random variables given above: 1. …rst, de…ne Y2 = X1 + X2 and compute the distribution of Y2 ; 2. then, de…ne Y3 = Y2 + X3 and compute the distribution of Y3 ; 3. and so on, until the distribution of Z can be computed from Z = Yn = Yn

41.5

1

+ Xn

Solved exercises

Below you can …nd some exercises with explained solutions.

Exercise 1 Let X be a uniform random variable with support RX = [0; 1] and probability density function fX (x) =

1 if x 2 RX 0 otherwise

and Y an exponential random variable, independent of X, with support RY = [0; 1) and probability density function fY (y) =

exp ( y) if y 2 RY 0 otherwise

Derive the probability density function of the sum Z =X +Y 1 0 See

p. 233.

330

CHAPTER 41. SUMS OF INDEPENDENT RANDOM VARIABLES

Solution The support of Z is RZ = [0; 1) When z 2 RZ , the probability density function of Z is Z 1 Z 1 fZ (z) = fX (z y) fY (y) dy = fX (z y) exp ( y) dy 1 0 Z 1 Z 1 = 1f0 z y 1g exp ( y) dy = 1f 1 y z 0g exp ( y) dy 0 0 Z 1 Z z = 1fz 1 y zg exp ( y) dy = exp ( y) dy 0

=

[ exp (

z 1

z y)]z 1

=

exp ( z) + exp (1

z)

Therefore, the probability density function of Z is fZ (z) =

exp ( z) + exp (1 0

z) if z 2 RZ otherwise

Exercise 2 Let X be a discrete random variable with support RX = f0; 1; 2g and probability mass function: pX (x) =

1=3 if x 2 RX 0 otherwise

and Y another discrete random variable, independent of X, with support RY = f1; 2g and probability mass function: pY (y) =

y=3 if y 2 RY 0 otherwise

Derive the probability mass function of the sum Z =X +Y Solution The support of Z is: RZ = f1; 2; 3; 4g The probability mass function of Z, evaluated at z = 1 is: X pX (1 y) pY (y) = pX (1 1) pY (1) + pX (1 pZ (1) = y2RY

=

1 1 2 1 +0 = 3 3 3 9

2) pY (2)

41.5. SOLVED EXERCISES Evaluated at z = 2, it is: X pZ (2) = pX (2

331

y) pY (y) = pX (2

1) pY (1) + pX (2

2) pY (2)

1) pY (1) + pX (3

2) pY (2)

1) pY (1) + pX (4

2) pY (2)

y2RY

=

1 1 1 2 1 + = 3 3 3 3 3

Evaluated at z = 3, it is: X pZ (3) = pX (3

y) pY (y) = pX (3

y2RY

=

1 1 1 1 2 + = 3 3 3 3 3

Evaluated at z = 4, it is: X pZ (4) = pX (4

y) pY (y) = pX (4

y2RY

=

0

1 1 2 2 + = 3 3 3 9

Therefore, the probability mass function of 8 1=9 > > > > < 1=3 1=3 pZ (z) = > > 2=9 > > : 0

Z is: if z = 1 if z = 2 if z = 3 if z = 4 otherwise

332

CHAPTER 41. SUMS OF INDEPENDENT RANDOM VARIABLES

Part IV

Probability distributions

333

Chapter 42

Bernoulli distribution Suppose you perform an experiment with two possible outcomes: either success or failure. Success happens with probability p, while failure happens with probability 1 p. A random variable that takes value 1 in case of success and 0 in case of failure is called a Bernoulli random variable (alternatively, it is said to have a Bernoulli distribution).

42.1

De…nition

Bernoulli random variables are characterized as follows: De…nition 243 Let X be a discrete random variable. Let its support be RX = f0; 1g Let p 2 (0; 1). We say that X has a Bernoulli distribution with parameter p if its probability mass function1 is 8 if x = 1 < p 1 p if x = 0 pX (x) = : 0 if x 2 = RX

Note that, by the above de…nition, any indicator function2 is a Bernoulli random variable. The following is a proof that pX (x) is a legitimate probability mass function3 : Proof. Non-negativity is obvious. We need to prove that the sum of pX (x) over its support equals 1. This is proved as follows: X pX (x) = pX (1) + pX (0) xeRX

= p + (1

1 See

p. 106. p. 197. 3 See p. 247. 2 See

335

p) = 1

336

42.2

CHAPTER 42. BERNOULLI DISTRIBUTION

Expected value

The expected value of a Bernoulli random variable X is E [X] = p Proof. It can be derived as follows: E [X]

=

X

xpX (x)

x2RX

= 1 pX (1) + 0 pX (0) = 1 p + 0 (1 p) = p

42.3

Variance

The variance of a Bernoulli random variable X is Var [X] = p (1

p)

Proof. It can be derived thanks to the usual formula for computing the variance4 : X E X2 = x2 pX (x) x2RX

= =

2

E [X]

Var [X]

42.4

12 pX (1) + 02 pX (0) 1 p + 0 (1 p) = p

= p2 = E X2

2

E [X] = p

p2 = p (1

p)

Moment generating function

The moment generating function of a Bernoulli random variable X is de…ned for any t 2 R: MX (t) = 1 p + p exp (t) Proof. Using the de…nition of moment generating function: MX (t)

= =

E [exp (tX)] X exp (tx) pX (x)

x2RX

= exp (t 1) pX (1) + exp (t 0) pX (0) = exp (t) p + 1 (1 p) = 1 p + p exp (t) Obviously, the above expected value exists for any t 2 R. 4 Var [X]

= E X2

E [X]2 . See p. 156.

42.5. CHARACTERISTIC FUNCTION

42.5

337

Characteristic function

The characteristic function of a Bernoulli random variable X is 'X (t) = 1

p + p exp (it)

Proof. Using the de…nition of characteristic function: 'X (t) = =

E [exp (itX)] X exp (itx) pX (x)

x2RX

= exp (it 1) pX (1) + exp (it 0) pX (0) = exp (it) p + 1 (1 p) = 1 p + p exp (it)

42.6

Distribution function

The distribution function of a Bernoulli random variable X is 8 if x < 0 < 0 1 p if 0 x < 1 FX (x) = : 1 if x 1

Proof. Remember the de…nition of distribution function: FX (x) = P (X

x)

and the fact that X can take either value 0 or value 1. If x < 0, then P (X x) = 0, because X can not take values strictly smaller than 0. If 0 x < 1, then P (X x) = 1 p, because 0 is the only value strictly smaller than 1 that X can take. Finally, if x 1, then P (X x) = 1, because all values X can take are smaller than or equal to 1.

42.7

More details

In the following subsections you can …nd more details about the Bernoulli distribution.

42.7.1

Relation to the binomial distribution

A sum of independent Bernoulli random variables is a binomial random variable. This is discussed and proved in the lecture entitled Binomial distribution (p. 341).

42.8

Solved exercises

Below you can …nd some exercises with explained solutions.

338

CHAPTER 42. BERNOULLI DISTRIBUTION

Exercise 1 Let X and Y be two independent Bernoulli random variables with parameter p. Derive the probability mass function of their sum: Z =X +Y Solution The probability mass function of X is 8 < p 1 pX (x) = : 0 The probability mass function of Y is 8 < p 1 pY (y) = : 0

p

if x = 1 if x = 0 otherwise

p

if y = 1 if y = 0 otherwise

The support of Z (the set of values Z can take) is RY = f0; 1; 2g The formula for the probability mass function of a sum of two independent variables is5 X pZ (z) = pX (z y) pY (y) y2RY

where RY is the support of Y . When z = 0, the formula gives: X pZ (0) = pX ( y) pY (y) y2RY

= pX ( 0) pY (0) + pX ( 1) pY (1)

=

(1

p) (1

p) + 0 p = (1

When z = 1, the formula gives: X pZ (1) = pX (1

y) pY (y)

When z = 2, the formula gives: X pZ (2) = pX (2

y) pY (y)

2

p)

y2RY

= pX (1 = p (1

0) pY (0) + pX (1 1) pY (1) p) + (1 p) p = 2p (1 p)

y2RY

= pX (2 = 0 (1

0) pY (0) + pX (2 p) + p p = p2

Therefore, the probability mass function of Z 8 2 (1 p) > > < 2p (1 p) pZ (z) = 2 p > > : 0 5 See

p. 325.

1) pY (1)

is if z = 0 if z = 1 if z = 2 otherwise

42.8. SOLVED EXERCISES

339

Exercise 2 Let X be a Bernoulli random variable with parameter p = 1=2. Find its tenth moment. Solution The moment generating function of X is MX (t) =

1 1 + exp (t) 2 2

The tenth moment of X is equal to the tenth derivative of its moment generating function6 , evaluated at t = 0: 10 = X (10) = E X

d10 MX (t) dt10

But dMX (t) dt d2 MX (t) dt2 d10 MX (t) dt10

1 exp (t) 2 1 = exp (t) 2 .. . 1 = exp (t) 2 =

so that: X

(10)

= =

6 See

p. 290.

d10 MX (t) dt10 t=0 1 1 exp (0) = 2 2

t=0

340

CHAPTER 42. BERNOULLI DISTRIBUTION

Chapter 43

Binomial distribution Consider an experiment having two possible outcomes: either success or failure. Suppose the experiment is repeated several times and the repetitions are independent of each other. The total number of experiments where the outcome turns out to be a success is a random variable whose distribution is called binomial distribution. The distribution has two parameters: the number n of repetitions of the experiment, and the probability p of success of an individual experiment. A binomial distribution can be seen as a sum of mutually independent Bernoulli random variables1 that take value 1 in case of success of the experiment and value 0 otherwise. This connection between the binomial and Bernoulli distributions will be illustrated in detail in the remainder of this lecture and will be used to prove several properties of the binomial distribution.

43.1

De…nition

The binomial distribution is characterized as follows. De…nition 244 Let X be a discrete random variable. Let n 2 N and p 2 (0; 1). Let the support of X be2 RX = f0; 1; : : : ; ng We say that X has a binomial distribution with parameters n and p if its probability mass function3 is pX (x) = where

n x

=

n! x!(n x)!

n x

px (1

n x

p)

0

if x 2 RX if x 2 = RX

is the binomial coe¢ cient4 .

The two parameters of the distribution are the number of experiments n and the probability of success p of an individual experiment. The following is a proof that pX (x) is a legitimate probability mass function5 . 1 See

p. 335. other words, RX is the set of the …rst n natural numbers and 0. 3 See p. 106. 4 See p. 22. 5 See p. 247. 2 In

341

342

CHAPTER 43. BINOMIAL DISTRIBUTION

Proof. Non-negativity is obvious. We need to prove that the sum of pX (x) over the support of X equals 1. This is proved as follows: X

xeRX

pX (x) =

n X n x p (1 x x=0

n x

p)

= [p + (1

n

p)] = 1n = 1

where we have used the formula for binomial expansions6 n

(a + b) =

43.2

n X n x n a b x x=0

x

Relation to the Bernoulli distribution

The binomial distribution is intimately related to the Bernoulli distribution. The following propositions show how. Proposition 245 A random variable has a binomial distribution with parameters n and p, with n = 1, if and only if it has a Bernoulli distribution with parameter p. Proof. We demonstrate that the two distributions are equivalent by showing that they have the same probability mass function. The probability mass function of a binomial distribution with parameters n and p, with n = 1, is pX (x) =

1 x

1 x

px (1

p)

0

if x 2 f0; 1g if x 2 = f0; 1g

but pX (0) =

1 0 p (1 0

1 0

p)

=

1! (1 0!1!

p) = 1

p

and pX (1) =

1 1 p (1 1

1 1

p)

=

1! p=p 1!0!

Therefore, the probability mass function can be written as 8 if x = 1 < p 1 p if x = 0 pX (x) = : 0 otherwise

which is the probability mass function of a Bernoulli random variable. Proposition 246 A random variable has a binomial distribution with parameters n and p if and only if it can be written as a sum of n jointly independent Bernoulli random variables with parameter p. 6 See

p. 25.

43.2. RELATION TO THE BERNOULLI DISTRIBUTION

343

Proof. We prove it by induction. So, we have to prove that it is true for n = 1 and for a generic n, given that it is true for n 1. For n = 1, it has been proved in Proposition 245. Now, suppose the claim is true for a generic n 1. We have to verify that Yn is a binomial random variable, where Yn = X1 + X2 + : : : + Xn and X1 , X2 , : : :, Xn are independent Bernoulli random variables. Since the claim is true for n 1, this is tantamount to verifying that Yn = Yn

1

+ Xn

is a binomial random variable, where Yn 1 has a binomial distribution with parameters n 1 and p. By performing a convolution7 , we can compute the probability mass function of Yn : pYn (yn ) X

=

yn

= yn

= yn

= yn

pXn (yn

1 2RYn

X

I ((yn

1 2RYn

yn

yn

X

1 2RYn

n

1

(yn

1)

1)

2 f0; 1g) pyn

yn

1

1 yn +yn

(1

p)

1

1

1 2RYn

1

n 1 yn yn 1 +yn 1 1 y +y +n 1 yn 1 p (1 p) n n 1 yn 1 X n 1 yn I (yn 2 fyn 1 ; yn 1 + 1g) p (1 yn 1 1 2RYn

n yn

p)

1

X

n yn

p)

1 2RYn

yn

yn

1 ) pYn

n 1 yn 1 n 1 yn 1 p (1 p) yn 1 X I (yn 2 fyn 1 ; yn 1 + 1g)

= pyn (1

If 1

yn

1

I (yn 2 fyn

1 ; yn 1

+ 1g)

n 1 yn 1

1

1, then

I (yn 2 fyn

1 ; yn 1

n 1 yn 1

+ 1g)

n

=

1 yn

+

n yn

1 1

=

n yn

1

where the last equality is the recursive formula for binomial coe¢ cients8 . If yn = 0, then

yn

= 7 See 8 See

p. 326. p. 25.

X

1 2RYn

n

1 yn

I (yn 2 fyn

1 ; yn 1

+ 1g)

n 1 yn 1

1

=

n

1 0

=1=

n 0

=

n yn

344

CHAPTER 43. BINOMIAL DISTRIBUTION

Finally, if yn = n, then X

1 2RYn

yn

=

n yn

1 1

I (yn 2 fyn

+ 1g)

n 1 yn 1

n n

=

n yn

n 1 yn 1

=

1 ; yn 1

1

n n

=

1 1

=1=

Therefore, for yn 2 RYn , we have yn

X

1 2RYn

I (yn 2 fyn

1 ; yn 1

+ 1g)

n yn

1

and pYn (yn ) =

n yn

n yn

pyn (1

p)

0

if yn 2 RYn otherwise

which is the probability mass function of a binomial random variable with parameters n and p. This completes the proof.

43.3

Expected value

The expected value of a binomial random variable X is E [X] = np Proof. It can be derived as follows:

A

E [X] " n # X = E Yi i=1

B C

= =

n X i=1 n X

E [Yi ] p

i=1

= np

where: in step A we have used the fact that X can be represented as a sum of n independent Bernoulli random variables Y1 ; : : : ; Yn ; in step B we have used the linearity of the expected value; in step C we have used the formula for the expected value of a Bernoulli random variable9 .

43.4

Variance

The variance of a binomial random variable X is Var [X] = np (1 9 See

p. 336.

p)

43.5. MOMENT GENERATING FUNCTION

345

Proof. It can be derived as follows:

A

Var [X] " n # X = Var Yi i=1

B C

= =

n X i=1 n X

Var [Yi ] p (1

p)

i=1

= np (1

p)

where: in step A we have used the fact that X can be represented as a sum of n independent Bernoulli random variables Y1 ; : : : ; Yn ; in step B we have used the formula for the variance of the sum of jointly independent random variables; in step C we have used the formula for the variance of a Bernoulli random variable10 .

43.5

Moment generating function

The moment generating function of a binomial random variable X is de…ned for any t 2 R: n MX (t) = (1 p + p exp (t)) Proof. This is proved as follows: MX (t) A

=

E [exp (tX)]

B

= =

E [exp (t (Y1 + : : : + Yn ))] E [exp (tY1 ) : : : exp (tYn )]

C

=

E [exp (tY1 )] : : : E [exp (tYn )]

D

= MY1 (t) : : : MYn (t)

E

= (1 = (1

p + p exp (t)) : : : (1 n p + p exp (t))

p + p exp (t))

where: in step A we have used the de…nition of moment generating function; in step B we have used the fact that X can be represented as a sum of n independent Bernoulli random variables Y1 ; : : : ; Yn ; in step C we have used the fact that Y1 ; : : : ; Yn are jointly independent; in step D we have used the de…nition of moment generating function of Y1 ; : : : ; Yn ; in step E we have used the formula for the moment generating function of a Bernoulli random variable11 . Since the moment generating function of a Bernoulli random variable exists for any t 2 R, also the moment generating function of a binomial random variable exists for any t 2 R. 1 0 See 1 1 See

p. 336. p. 336.

346

43.6

CHAPTER 43. BINOMIAL DISTRIBUTION

Characteristic function

The characteristic function of a binomial random variable X is n

'X (t) = (1

p + p exp (it))

Proof. Similar to the previous proof: 'X (t) A

=

E [exp (itX)]

B

= =

E [exp (it (Y1 + : : : + Yn ))] E [exp (itY1 ) : : : exp (itYn )]

C

=

E [exp (itY1 )] : : : E [exp (itYn )]

D

= 'Y1 (t) : : : 'Yn (t)

E

= (1 = (1

p + p exp (it)) : : : (1 n p + p exp (it))

p + p exp (it))

where: in step A we have used the de…nition of characteristic function; in step B we have used the fact that X can be represented as a sum of n independent Bernoulli random variables Y1 ; : : : ; Yn ; in step C we have used the fact that Y1 ; : : : ; Yn are jointly independent; in step D we have used the de…nition of characteristic function of Y1 ; : : : ; Yn ; in step E we have used the formula for the characteristic function of a Bernoulli random variable12 .

43.7

Distribution function

The distribution function of a binomial random variable X is 8 if x < 0 < 0 Pbxc n s n s FX (x) = p (1 p) if 0 x n s=0 s : 1 if x > n

where bxc is the ‡oor of x, i.e. the largest integer not greater than x. Proof. For x < 0, FX (x) = 0, because X cannot be smaller than 0. For x > n, FX (x) = 1, because X is always smaller than or equal to n. For 0 x n: FX (x) A B

= =

P (X bxc

X

x)

P (X = s)

s=0

C

=

bxc X s=0

1 2 See

p. 337.

pX (s)

43.8. SOLVED EXERCISES

347

=

bxc X n s p (1 s s=0

n s

p)

where: in step A we have used the de…nition of distribution function; in step B we have used the fact that X can take only integer values; in step C we have used the de…nition of probability mass function of X. Values of FX (x) are usually computed by means of computer algorithms. For example, the MATLAB command binocdf(x,n,p) returns the value of the distribution function at the point x when the parameters of the distribution are n and p.

43.8

Solved exercises

Below you can …nd some exercises with explained solutions.

Exercise 1 Suppose you independently ‡ip a coin 4 times and the outcome of each toss can be either head (with probability 1=2) or tails (also with probability 1=2). What is the probability of obtaining exactly 2 tails? Solution Denote by X the number of times the outcome is tails (out of the 4 tosses). X has a binomial distribution with parameters n = 4 and p = 1=2. The probability of obtaining exactly 2 tails can be computed from the probability mass function of X as follows: pX (2)

= =

4 n 2 n 2 p (1 p) = 2 2 4! 1 1 4 3 2 1 1 = = 2!2! 4 4 2 1 2 1 16

2

1 1 2 6 3 = 16 8

1 2

4 2

Exercise 2 Suppose you independently throw a dart 10 times. Each time you throw a dart, the probability of hitting the target is 3=4. What is the probability of hitting the target less than 5 times (out of the 10 total times you throw a dart)? Solution Denote by X the number of times you hit the target. X has a binomial distribution with parameters n = 10 and p = 3=4. The probability of hitting the target less than 5 times can be computed from the distribution function of X as follows: P (X < 5)

=

P (X

4) = FX (4)

348

CHAPTER 43. BINOMIAL DISTRIBUTION

=

=

4 X n s p (1 s s=0

4 X 10 s s=0

3 4

n s

p) s

1 4

10 s

' 0:0197

where FX is the distribution function of X and the value of FX (4) can be calculated with a computer algorithm, for example with the MATLAB command binocdf(4,10,3/4)

Chapter 44

Poisson distribution The Poisson distribution is related to the exponential distribution1 . Suppose an event can occur several times within a given unit of time. When the total number of occurrences of the event is unknown, we can think of it as a random variable. This random variable has a Poisson distribution if and only if the time elapsed between two successive occurrences of the event has an exponential distribution and it is independent of previous occurrences. A classical example of a random variable having a Poisson distribution is the number of phone calls received by a call center. If the time elapsed between two successive phone calls has an exponential distribution and it is independent of the time of arrival of the previous calls, then the total number of calls received in one hour has a Poisson distribution. 12

Exponential distribution

Number of phone calls

10

8

6 Exponential distribution Poisson distribution

4

2

0

0

10

20

30

40

Minutes

50

60

70

80

90

The concept is illustrated by the plot above, where the number of phone calls received is plotted as a function of time. The graph of the function makes an upward jump each time a phone call arrives. The time elapsed between two successive phone calls is equal to the length of each horizontal segment and it has an 1 See

p. 365.

349

350

CHAPTER 44. POISSON DISTRIBUTION

exponential distribution. The number of calls received in 60 minutes is equal to the length of the segment highlighted by the vertical curly brace and it has a Poisson distribution. The following sections provide a more formal treatment of the main characteristics of the Poisson distribution.

44.1

De…nition

The Poisson distribution is characterized as follows: De…nition 247 Let X be a discrete random variable. Let its support be the set of non-negative integer numbers: RX = Z+ Let 2 (0; 1). We say that X has a Poisson distribution with parameter its probability mass function2 is exp ( 0

pX (x) =

1 ) x!

x

if

if x 2 RX if x 2 = RX

where x! is the factorial3 of x.

44.2

Relation to the exponential distribution

The relation between the Poisson distribution and the exponential distribution is summarized by the following proposition: Proposition 248 The number of occurrences of an event within a unit of time has a Poisson distribution with parameter if and only if the time elapsed between two successive occurrences of the event has an exponential distribution with parameter and it is independent of previous occurrences. Proof. Denote by:

2

time elapsed before …rst occurrence time elapsed between …rst and second occurrence

n

time elapsed between (n

1

.. . 1) -th and n-th occurrence

.. . and by X the number of occurrences of the event. Since X x if and only if + : : : + 1 (convince yourself of this fact), the proposition is true if and only 1 x if: P (X x) = P ( 1 + : : : + x 1) for any x 2 RX . To verify that the equality holds, we need to separately compute the two probabilities. We start with: P( 2 See 3 See

p. 106. p. 10.

1

+ ::: +

x

1)

44.2. RELATION TO THE EXPONENTIAL DISTRIBUTION

351

Denote by Z the sum of waiting times: Z=

+ ::: +

1

x

Since the sum of independent exponential random variables4 with common parameter is a Gamma random variable5 with parameters 2x and x , then Z is a Gamma random variable with parameters 2x and x , i.e. its probability density function is cz x 1 exp ( z) if z 2 [0; 1) fZ (z) = 0 if z 2 = [0; 1) where

x

c=

x

=

(x)

(x

1)!

and the last equality stems from the fact that we are considering only integer values of x. We need to integrate the density function to compute the probability that Z is less than 1: P(

1

+ ::: +

1)

x

=

P (Z Z 1

=

Z

=

1) fZ (z) dz

1 1

cz x

1

exp (

z) dz

1

exp (

z) dz

0

= c

Z

1

zx

0

The last integral can be computed integrating by parts x Z

1

zx

1

exp (

z) dz

0

1

=

1

z

x 1

exp (

z)

+

1

exp (

) + (x

1)

Z

1

1) z x

(x

2

1

exp (

z) dz

0

0

=

1 times:

1

Z

1

zx

2

exp (

z) dz

0

1

= +

exp (

Z

) + (x

1)

1

(x

2) z x

3

1

(

1

1

1

zx

2

exp (

z) 0

exp (

z) dz

0

=

1

exp (

+ (x

1) (x

= ::: x X1 (x = (x i=1 4 See 5 See

p. 372. p. 397.

)

(x 2)

1 2

1) Z 1

1 2

zx

exp ( 3

)

exp (

z) dz

0

1)! 1 exp ( i)! i

)+

(x

1)! 1

1 x 1

Z

0

1

exp (

z) dz

352

CHAPTER 44. POISSON DISTRIBUTION

=

x X1 i=1

=

x X1 i=1

(x (x

1)! 1 exp ( i)! i

)+

(x (x

1)! 1 exp ( i)! i

)

(x

1

1)! x 1

1

exp (

z) 0

(x

1)!

exp (

x

)+

(x

1)! x

Multiplying by c, we obtain: c

Z

1

zx

1

0

=

(x

x

1)!

x X1

= =

1

1

zx

1

exp (

z) dz

0

(x

i=1 x X

i)!

(x

j!

+ ::: +

x

j=0

exp (

)

exp (

)+1

x i

x X1

1

z) dz

x i

i=1

=

exp (

Z

i)!

exp (

)

j

exp (

)

Thus, we have obtained: P(

1

x X1

1) = 1

j=0

j

j!

exp (

)

Now, we need to compute the probability that X is greater than or equal to x: P (X

x)

= 1 = 1 =

1

P (X < x) P (X x 1) x X1

P (X = j)

j=0

=

1

x X1

pX (j)

j=0

=

1

x X1 j=0

=

P(

1

j

j!

exp (

+ ::: +

x

which is exactly what we needed to prove.

44.3

Expected value

The expected value of a Poisson random variable X is: E [X] =

) 1)

44.4. VARIANCE

353

Proof. It can be derived as follows: X E [X] = xpX (x) =

x2RX 1 X

x exp (

)

x=0

A

=

0+

1 X

1 x!

x exp (

x

1 x!

)

x=1

B

=

1 X

(y + 1) exp (

)

1 (y + 1)!

(y + 1) exp (

)

1 (y + 1) y!

y=0

C

=

=

1 X

y=0 1 X

exp (

)

y=0

D

=

x

1 X

1 y!

y+1

y

y

pY (y)

y=0

E

=

where: in step A we have used the fact that the …rst term of the sum is zero, because x = 0; in step B we have made a change of variable y = x 1; in step C we have used the fact that (y + 1)! = (y + 1) y!; in step D we have de…ned pY (y) = exp (

)

1 y!

y

where pY (y) is the probability mass function of a Poisson random variable with parameter ; in step E we have used the fact that the sum of a probability mass function over its support equals 1.

44.4

Variance

The variance of a Poisson random variable X is: Var [X] = Proof. It can be derived thanks to the usual formula for computing the variance6 : X E X2 = x2 pX (x) =

x2RX 1 X

x2 exp (

)

x=0

A

=

0+

1 X

x=1 6 Var [X]

= E X2

E [X]2 . See p. 156.

x2 exp (

1 x!

x

)

1 x!

x

354

CHAPTER 44. POISSON DISTRIBUTION B

C

=

=

= D

=

1 X y=0 1 X

2

)

1 (y + 1)!

2

)

1 (y + 1) y!

(y + 1) exp (

(y + 1) exp (

y=0 1 X y=0 1 X

(y + 1) exp (

)

1 y!

(y + 1) pY (y)

(1 X

ypY (y) +

y=0

E

=

F

= =

y

y

y=0

=

y+1

1 X

)

pY (y)

y=0

fE [Y ] + 1g f + 1g +

2

where: in step A we have used the fact that the …rst term of the sum is zero, because x = 0; in step B we have made a change of variable y = x 1; in step C we have used the fact that (y + 1)! = (y + 1) y!; in step D we have de…ned pY (y) = exp (

)

1 y!

y

where pY (y) is the probability mass function of a Poisson random variable with parameter ; in step E we have used the fact that the sum of a probability mass function over its support equals 1; in step F we have used the fact that the expected value of a Poisson random variable with parameter is . Finally, 2

E [X] =

2

and Var [X]

= E X2 =

44.5

2

+

2

E [X] 2

=

Moment generating function

The moment generating function of a Poisson random variable X is de…ned for any t 2 R: MX (t) = exp ( [exp (t) 1]) Proof. Using the de…nition of moment generating function: MX (t)

=

E [exp (tX)]

44.6. CHARACTERISTIC FUNCTION =

=

X

355

exp (tx) pX (x)

x2RX 1 X

x

[exp (t)] exp (

1 x!

)

x=0

)

1 x X ( exp (t)) x! x=0

=

exp (

= =

exp ( ) exp ( exp (t)) exp ( [exp (t) 1])

where exp ( exp (t)) =

x

1 x X ( exp (t)) x! x=0

is the usual Taylor series expansion of the exponential function. Furthermore, since the series converges for any value of t, the moment generating function of a Poisson random variable exists for any t 2 R.

44.6

Characteristic function

The characteristic function of a Poisson random variable X is: 'X (t) = exp ( [exp (it)

1])

Proof. Using the de…nition of characteristic function: 'X (t)

= =

E [exp (itX)] X exp (itx) pX (x)

x2RX

=

X

x

[exp (it)] exp (

)

x2RX

)

exp (

= =

exp ( ) exp ( exp (it)) exp ( [exp (it) 1])

exp ( exp (it)) =

x

1 x X ( exp (it)) x! x=0

=

where:

1 x!

1 x X ( exp (it)) x! x=0

is the usual Taylor series expansion of the exponential function (note that the series converges for any value of t).

44.7

Distribution function

The distribution function of a Poisson random variable X is: Pbxc 1 s if x 0 exp ( ) s=0 s! FX (x) = 0 otherwise

356

CHAPTER 44. POISSON DISTRIBUTION

where bxc is the ‡oor of x, i.e. the largest integer not greater than x. Proof. Using the de…nition of distribution function: FX (x) A B

=

P (X bxc

X

=

x)

P (X = s)

s=0

C

bxc X

=

pX (s)

s=0

bxc X

=

exp (

)

s=0

=

exp (

)

1 s!

bxc X 1 s! s=0

s

s

where: in step A we have used the de…nition of distribution function; in step B we have used the fact that X can take only positive integer values; in step C we have used the de…nition of probability mass function of X. Values of FX (x) are usually computed by computer algorithms. For example, the MATLAB command: poisscdf(x,lambda) returns the value of the distribution function at the point x when the parameter of the distribution is equal to lambda.

44.8

Solved exercises

Below you can …nd some exercises with explained solutions.

Exercise 1 The time elapsed between the arrival of a customer at a shop and the arrival of the next customer has an exponential distribution with expected value equal to 15 minutes. Furthermore, it is independent of previous arrivals. What is the probability that more than 6 customers will arrive at the shop during the next hour? Solution If a random variable has an exponential distribution with parameter , then its expected value is equal to 1= . Here 1

= 0:25 hours

Therefore, = 4. If inter-arrival times are independent exponential random variables with parameter , then the number of arrivals during a unit of time has a Poisson distribution with parameter . Thus, the number of customers that will

44.8. SOLVED EXERCISES

357

arrive at the shop during the next hour (denote it by X) is a Poisson random variable with parameter = 4. The probability that more than 6 customers arrive at the shop during the next hour is: P (X > 6) =

1

= 1

P (X

6) = 1

exp ( 4)

6 X s=0

FX (6)

4s ' 0:1107 s!

The value of FX (6) can be calculated with a computer algorithm, for example with the MATLAB command: poisscdf(6,4)

Exercise 2 At a call center, the time elapsed between the arrival of a phone call and the arrival of the next phone call has an exponential distribution with expected value equal to 15 seconds. Furthermore, it is independent of previous arrivals. What is the probability that less than 50 phone calls arrive during the next 15 minutes? Solution If a random variable has an exponential distribution with parameter , then its expected value is equal to 1= . Here 1

=

1 1 minutes = quarters of hour 4 60

where, in the last equality, we have taken 15 minutes as the unit of time. Therefore, = 60. If inter-arrival times are independent exponential random variables with parameter , then the number of arrivals during a unit of time has a Poisson distribution with parameter . Thus, the number of phone calls that will arrive during the next 15 minutes (denote it by X) is a Poisson random variable with parameter = 60. The probability that less than 50 phone calls arrive during the next 15 minutes is: P (X < 50)

=

P (X

49) = FX (49)

=

exp ( 60)

49 X 60s s=0

s!

' 0:0844

The value of FX (49) can be calculated with a computer algorithm, for example with the MATLAB command: poisscdf(49,60)

358

CHAPTER 44. POISSON DISTRIBUTION

Chapter 45

Uniform distribution A continuous random variable has a uniform distribution if all the values belonging to its support have the same probability density.

45.1

De…nition

The uniform distribution is characterized as follows: De…nition 249 Let X be an absolutely continuous random variable. Let its support be a closed interval of real numbers: RX = [l; u] We say that X has a uniform distribution on [l; u] if its probability density function1 is 1 if x 2 RX u l fX (x) = 0 if x 2 = RX Sometimes, we also say that X has a rectangular distribution or that X is a rectangular random variable.

45.2

Expected value

The expected value of a uniform random variable X is u+l 2

E [X] = Proof. It can be derived as follows: Z E [X] = Z =

1

xfX (x) dx

1 u

x

l

= 1 See

u

1 l

1 u Z

l

p. 107.

359

l

dx

u

xdx

360

CHAPTER 45. UNIFORM DISTRIBUTION = = =

45.3

u

1

1 2 x u l 2 l 1 1 2 u l2 u l2 (u l) (u + l) u+l = 2 (u l) 2

Variance

The variance of a uniform random variable X is Var [X] =

2

(u

l) 12

Proof. It can be derived thanks to the usual formula for computing the variance2 : Z 1 E X2 = x2 fX (x) dx 1 Z u 1 = x2 dx u l l Z u 1 x2 dx = u l l u 1 1 3 = x u l 3 l 1 1 3 = u l3 u l3 (u l) u2 + ul + l2 = 3 (u l) 2 u + ul + l2 = 3 2

E [X] =

Var [X]

= E X2 = = = =

2 Var [X]

= E X2

u+l 2

2

=

u2 + 2ul + l2 4 2

E [X]

u2 + ul + l2 u2 + 2ul + l2 3 4 4u2 + 4ul + 4l2 3u2 6ul 3l2 12 (4 3) u2 + (4 6) ul + (4 3) l2 12 2 u2 2ul + l2 (u l) = 12 12

E [X]2 . See p. 156.

45.4. MOMENT GENERATING FUNCTION

45.4

361

Moment generating function

The moment generating function of a uniform random variable X is de…ned for any t 2 R: 1 exp (tl)] if t 6= 0 (u l)t [exp (tu) MX (t) = 1 if t = 0 Proof. Using the de…nition of moment generating function: MX (t)

= = = = =

E [exp (tX)] Z 1 exp (tx) fX (x) dx 1 Z u 1 exp (tx) dx u l l u 1 1 exp (tx) u l t l exp (tu) exp (tl) (u l) t

Note that the above derivation is valid only when t 6= 0. However, when t = 0: MX (0) = E [exp (0 X)] = E [1] = 1 Furthermore, it is easy to verify that lim MX (t) = MX (0)

t!0

When t 6= 0, the integral above is well-de…ned and …nite for any t 2 R. Thus, the moment generating function of a uniform random variable exists for any t 2 R.

45.5

Characteristic function

The characteristic function of a uniform random variable X is 'X (t) =

1 (u l)it

1

[exp (itu)

exp (itl)] if t 6= 0 if t = 0

Proof. Using the de…nition of characteristic function: 'X (t)

= E [exp (itX)] = E [cos (tX)] + iE [sin (tX)] Z 1 Z 1 cos (tx) fX (x) dx + i sin (tx) fX (x) dx = 1 1 Z u Z u 1 1 = cos (tx) dx + i sin (tx) dx u l u l l l Z u Z u 1 = cos (tx) dx + i sin (tx) dx u l l l u u 1 1 1 = sin (tx) + i cos (tx) u l t t l l

362

CHAPTER 45. UNIFORM DISTRIBUTION = = = =

1 (u

l) t

fsin (tu)

1 (u

l) it

fi sin (tu)

sin (tl)

i cos (tu) + i cos (tl)g

i sin (tl) + cos (tu)

1 f[cos (tu) + i sin (tu)] (u l) it exp (itu) exp (itl) (u l) it

cos (tl)g

[cos (tl) + i sin (tl)]g

Note that the above derivation is valid only when t 6= 0. However, when t = 0: 'X (0) = E [exp (i 0 X)] = E [1] = 1 Furthermore, it is easy to verify that lim 'X (t) = 'X (0)

t!0

45.6

Distribution function

The distribution function of a uniform random variable X is 8 if x < l < 0 (x l) = (u l) if l x u FX (x) = : 1 if x > u

Proof. If x < l, then:

FX (x) = P (X

x) = 0

because X can not take on values smaller than l. If l FX (x)

x

= P (X x) Z x = fX (t) dt 1 Z x 1 = dt l l u 1 x = [t] u l l = (x l) = (u l)

If x > u, then: FX (x) = P (X

x) = 1

because X can not take on values greater than u.

45.7

Solved exercises

Below you can …nd some exercises with explained solutions.

u, then:

45.7. SOLVED EXERCISES

363

Exercise 1 Let X be a uniform random variable with support RX = [5; 10] Compute the following probability: P (7

X

9)

Solution We can compute this probability either using the probability density function or the distribution function of X. Using the probability density function: P (7

X

9)

Z

=

9

fX (x) dx =

7

9

1

dx

10 2 7) = 5

5

9 5 10 5

7 5 10 5

7

1 9 1 [x] = (9 5 7 5

=

Z

Using the distribution function: P (7

X

9)

= FX (9) =

4 5

FX (7) =

2 2 = 5 5

Exercise 2 Suppose the random variable X has a uniform distribution on the interval [ 2; 4]. Compute the following probability: P (X > 2) Solution This probability can be easily computed using the distribution function of X: P (X > 2)

=

1

=

1

P (X 2) = 1 FX (2) 2 ( 2) 4 1 =1 = 4 ( 2) 6 3

Exercise 3 Suppose the random variable X has a uniform distribution on the interval [0; 1]. Compute the third moment3 of X, i.e.: X 3 See

p. 36.

(3) = E X 3

364

CHAPTER 45. UNIFORM DISTRIBUTION

Solution We can compute the third moment of X using the transformation theorem4 : Z 1 Z 1 E X3 = x3 fX (x) dx = x3 dx 1

=

4 See

p. 134.

1 4 x 4

0

1 0

1 = 4

Chapter 46

Exponential distribution How much time will elapse before an earthquake occurs in a given region? How long do we need to wait before a customer enters our shop? How long will it take before a call center receives the next phone call? How long will a piece of machinery work without breaking down? Questions such as these are often answered in probabilistic terms using the exponential distribution. All these questions concern the time we need to wait before a certain event occurs. If this waiting time is unknown, it is often appropriate to think of it as a random variable having an exponential distribution. Roughly speaking, the time X we need to wait before an event occurs has an exponential distribution if the probability that the event occurs during a certain time interval is proportional to the length of that time interval. More precisely, X has an exponential distribution if the conditional probability P (t < X

t+

t jX > t )

is approximately proportional to the length t of the time interval comprised between the times t and t + t, for any time instant t. In most practical situations this property is very realistic and this is the reason why the exponential distribution is so widely used to model waiting times. The exponential distribution is also related to the Poisson distribution. When the event can occur more than once and the time elapsed between two successive occurrences is exponentially distributed and independent of previous occurrences, the number of occurrences of the event within a given unit of time has a Poisson distribution. See the lecture entitled Poisson distribution (p. 349) for a more detailed explanation and an intuitive graphical representation of this fact.

46.1

De…nition

The exponential distribution is characterized as follows. De…nition 250 Let X be an absolutely continuous random variable. Let its support be the set of positive real numbers: RX = [0; 1) 365

366

CHAPTER 46. EXPONENTIAL DISTRIBUTION

Let 2 R++ . We say that X has an exponential distribution with parameter if its probability density function1 is fX (x) = The parameter

exp ( 0

x) if x 2 RX if x 2 = RX

is called rate parameter.

The following is a proof that fX (x) is a legitimate probability density function2 . Proof. Non-negativity is obvious. We need to prove that the integral of fX (x) over R equals 1. This is proved as follows: Z 1 Z 1 1 fX (x) dx = exp ( x) dx = [ exp ( x)]0 = 0 ( 1) = 1 1

46.2

0

The rate parameter and its interpretation

We have mentioned that the probability that the event occurs between two dates t and t + t is proportional to t, conditional on the information that it has not occurred before t. The rate parameter is the constant of proportionality: P (t < X

t+

t jX > t ) =

t + o ( t)

where o ( t) is an in…nitesimal of higher order than t, i.e., a function of t that goes to zero more quickly than t does. The above proportionality condition is also su¢ cient to completely characterize the exponential distribution. Proposition 251 The proportionality condition P (t < X

t+

t jX > t ) =

t + o ( t)

is satis…ed only if X has an exponential distribution. Proof. The conditional probability P (t < X P (t < X

t+

t jX > t ) =

t+

t jX > t ) can be written as

P (t < X t + P (t < X t + t; X > t) = P (X > t) P (X > t)

Denote by FX (x) the distribution function3 of X: FX (x) = P (X

x)

and by SX (x) its survival function: SX (x) = 1

FX (x) = P (X > x)

Then, P (t < X t + P (X > t) 1 See

p. 107. p. 251. 3 See p. 108. 2 See

t)

=

FX (t + t) FX (t) = 1 FX (t)

SX (t + t) SX (t) SX (t)

t)

46.2. THE RATE PARAMETER AND ITS INTERPRETATION = Dividing both sides by SX (t +

t) t

367

t + o ( t)

t, we obtain SX (t)

1 = SX (t)

t t

+o

=

+ o (1)

where o (1) is a quantity that tends to 0 when t tends to 0. Taking limits on both sides, we obtain SX (t + t) SX (t) 1 lim = t!0 t SX (t) or, by the de…nition of derivative: dSX (t) 1 = dt SX (t) This di¤erential equation is easily solved using the chain rule: dSX (t) 1 d ln (SX (t)) = = dt SX (t) dt Taking the integral from 0 to x of both sides Z x d ln (SX (t)) dt = dt 0 we obtain

x

[ln (SX (t))]0 =

Z

x

dt

0

x

[ t]0

or ln (SX (x)) = ln (SX (0))

x

But X cannot take negative values. So SX (0) = 1

FX (0) = 1

which implies ln (SX (x)) =

x

Exponentiating both sides, we get SX (x) = exp (

x)

Therefore, 1

FX (x) = exp (

x)

or FX (x) = 1

exp (

x)

Since the density function is the …rst derivative of the distribution function4 , we obtain dFX (x) fX (x) = = exp ( x) dx which is the density of an exponential random variable. Therefore, the proportionality condition is satis…ed only if X is an exponential random variable. 4 See

p. 109.

368

46.3

CHAPTER 46. EXPONENTIAL DISTRIBUTION

Expected value

The expected value of an exponential random variable X is 1

E [X] =

Proof. It can be derived as follows: Z 1 E [X] = xfX (x) dx 0 Z 1 = x exp ( x) dx 0 Z A = [ x exp ( x)]1 + 0

1

exp (

x) dx

0

=

(0

1

0) +

exp (

1

x)

0

=

0+ 0+

1

1

=

where: in step A we have performed an integration by parts5 .

46.4

Variance

The variance of an exponential random variable X is 1

Var [X] =

2

Proof. The second moment of X is Z 1 E X2 = x2 exp ( x) dx 0 Z 1 A = x2 exp ( x) 0 +

1

2x exp (

0

B

=

(0

2

0) +

x exp (

x)

1

+

2

(0

0) +

2

1

exp (

x)

Z

1

exp (

x) dx

0

0

=

x) dx

1

=

2

2

0

where: in step A and B we have performed two integrations by parts. Furthermore, 2 1 1 2 E [X] = = 2 The usual formula for computing the variance6 gives Var [X] = E X 2

5 See 6 See

p. 51. p. 156.

2

E [X] =

2 2

1 2

=

1 2

46.5. MOMENT GENERATING FUNCTION

46.5

369

Moment generating function

The moment generating function of an exponential random variable X is de…ned for any t < and it is equal to MX (t) =

t

Proof. Using the de…nition of moment generating function, we obtain Z 1 MX (t) = E [exp (tX)] = exp (tx) fX (x) dx 1 Z 1 Z 1 = exp (tx) exp ( x) dx = exp ((t ) x) dx 0

0

1

=

exp ((t

t

) x)

1

=

t

0

Of course, the above integrals converge only if (t ) < 0, i.e., only if t < . Therefore, the moment generating function of an exponential random variable exists for all t < .

46.6

Characteristic function

The characteristic function of an exponential random variable X is 'X (t) =

it

Proof. Using the de…nition of characteristic function and the fact that exp (itx) = cos (tx) + i sin (tx) we can write 'X (t)

Z 1 E [exp (itX)] = exp (itx) fX (x) dx 1 Z 1 = exp (itx) exp ( x) dx 0 Z 1 Z 1 = cos (tx) exp ( x) dx + i sin (tx) exp ( =

0

0

We now compute separately the two integrals. The …rst integral is Z 1 cos (tx) exp ( x) dx 0

A

1

=

=

t

1 sin (tx) exp ( x) t 0 Z 1 1 sin (tx) ( exp ( x)) dx t Z0 1 sin (tx) exp ( x) dx 0

x) dx

370

CHAPTER 46. EXPONENTIAL DISTRIBUTION B

=

= =

1

1 cos (tx) exp ( x) t t 0 Z 1 1 cos (tx) ( exp ( x)) dx t 0 Z 1 1 cos (tx) exp ( x) dx t t t 0 2 Z 1 cos (tx) exp ( x) dx t2 t2 0

where in step A and B we have performed two integrations by parts. Therefore, Z 1 2 Z 1 cos (tx) exp ( x) dx cos (tx) exp ( x) dx = 2 t t2 0 0 which can be rearranged to yield Z 1 2 cos (tx) exp ( 1+ 2 t 0

or

Z

1

2

cos (tx) exp (

x) dx =

0

t2

1+

x) dx = 1

=

t2

t2

t2 t2 t2 +

2

=

t2 +

2

The second integral is Z

1

sin (tx) exp (

x) dx

0

A

=

= B

=

1 t 1 t

=

1 t

=

1 t

1

1 cos (tx) exp ( x) t 0 Z 1 1 cos (tx) ( exp ( x)) dx t 0 Z 1 cos (tx) exp ( x) dx t 0 1 1 sin (tx) exp ( x) t t 0 Z 1 1 sin (tx) ( exp ( x)) dx t 0 Z 1 sin (tx) exp ( x) dx t t 0 2 Z 1 sin (tx) exp ( x) dx t2 0

where in step A and B we have performed two integrations by parts. Therefore, Z 1 2 Z 1 1 sin (tx) exp ( x) dx = sin (tx) exp ( x) dx t t2 0 0 which can be rearranged to yield Z 1 2 1+ 2 sin (tx) exp ( t 0

x) dx =

1 t

46.7. DISTRIBUTION FUNCTION or

Z

1

sin (tx) exp (

371

x) dx =

0

Putting pieces together, we get Z 1 'X (t) = cos (tx) exp (

1+

x) dx + i

0

= =

46.7

t2 + 2 + it 2 t + 2

t t2 + it = it

+i

1

2

1 t

=

t2 Z

1

t t2 +

2

sin (tx) exp (

x) dx

0

+ it t2 + 2

=

2

it

Distribution function

The distribution function of an exponential random variable X is 0 1

FX (x) =

if x < 0 x) if x 0

exp (

Proof. If x < 0, then FX (x) = P (X

x) = 0

because X can not take on negative values. If x 0, then Z x Z x FX (x) = P (X x) = fX (t) dt = exp ( =

46.8

[ exp (

x

1

t)]0 =

t) dt

0

exp (

x) + 1

More details

In the following subsections you can …nd more details about the exponential distribution.

46.8.1

Memoryless property

One of the most important properties of the exponential distribution is the memoryless property: P (X

x + y jX > x ) = P (X

y)

for any x 0. Proof. This is proved as follows: P (X

x + y jX > x )

= =

P (X

x + y and X > x) P (X > x) P (x < X x + y) P (X > x)

372

CHAPTER 46. EXPONENTIAL DISTRIBUTION FX (x + y) FX (x) 1 FX (x) 1 exp ( (x + y)) (1 exp ( x)) exp ( x) exp ( x) exp ( (x + y)) exp ( x) 1 exp ( y) = FX (y) = P (X y)

= = = =

Remember that X is the time we need to wait before a certain event occurs. The memoryless property states that the probability that the event happens during a time interval of length y is independent of how much time x has already elapsed without the event happening.

46.8.2

Sums of exponential random variables

If X1 , X2 , . . . , Xn are n mutually independent7 random variables having an exponential distribution with parameter , then the sum Z=

n X

Xi

i=1

has a Gamma distribution8 with parameters 2n and n= . Proof. This is proved using moment generating functions9 : MZ (t) =

n Y

MXi (t) =

i=1

=

1

n Y

n

t

i=1

1

n

t

=

1

=

2 (n= ) t (2n)

t (2n)=2

The latter is the moment generating function of a Gamma distribution10 with parameters 2n and n= . So Z has a Gamma distribution, because two random variables have the same distribution when they have the same moment generating function. The random variable Z is also sometimes said to have an Erlang distribution. The Erlang distribution is just a special case of the Gamma distribution: a Gamma random variable is also an Erlang random variable when it can be written as a sum of exponential random variables.

46.9

Solved exercises

Below you can …nd some exercises with explained solutions. 7 See

p. 233. p. 397. 9 Remember that the moment generating function of a sum of mutually independent random variables is just the product of their moment generating functions (see p. 293). 1 0 See p. 399. 8 See

46.9. SOLVED EXERCISES

373

Exercise 1 The probability that a new customer enters a shop during a given minute is approximately 1%, irrespective of how many customers have entered the shop during the previous minutes. Assume that the total time we need to wait before a new customer enters the shop (denote it by X) has an exponential distribution. What is the probability that no customer enters the shop during the next hour? Solution Time is measured in minutes. Therefore, the probability that no customer enters the shop during the next hour is P (X > 60) = 1

P (X

60) = 1

FX (60)

where FX (x) is the distribution function of X. Since X is an exponential random variable with rate parameter 1%, its distribution function is FX (x) = 1

exp ( 0:01 x)

Therefore, P (X > 60)

= 1 FX (60) = 1 (1 exp ( 0:01 60)) = exp ( 0:01 60) = exp ( 0:6)

Exercise 2 Let X be an exponential random variable with parameter probability P (2 X 4)

= ln (3). Compute the

Solution First of all we can write the probability as P (2

X

4)

= P (fX = 2g [ f2 < X = P (X = 2) + P (2 < X

4g) 4) = P (2 < X

4)

where we have used the fact that the probability that an absolutely continuous random variable takes on any speci…c value is equal to zero11 . Now, the probability can be written in terms of the distribution function of X as P (2

X

4)

= P (2 < X 4) = FX (4) FX (2) = [1 exp ( ln (3) 4)] [1 exp ( ln (3) 2)] = exp ( ln (3) 2) exp ( ln (3) 4) = 3 2 3

4

Exercise 3 Suppose the random variable X has an exponential distribution with parameter = 1. Compute the probability P (X > 2) 1 1 See

p. 109.

374

CHAPTER 46. EXPONENTIAL DISTRIBUTION

Solution The above probability can be easily computed using the distribution function of X: P (X > 2) = 1

P (X

2) = 1

FX (2) = 1

[1

exp ( 2)] = exp ( 2)

Exercise 4 What is the probability that a random variable X is less than its expected value, if X has an exponential distribution with parameter ? Solution The expected value of an exponential random variable with parameter E [X] =

is

1

The probability above can be computed using the distribution function of X: P (X

E [X])

= P X =

1

exp

1

= FX 1

=1

1

exp ( 1)

Chapter 47

Normal distribution The normal distribution is one of the cornerstones of probability theory and statistics, because of the role it plays in the Central Limit Theorem1 , because of its analytical tractability and because many real-world phenomena involve random quantities that are approximately normal (e.g. errors in scienti…c measurement). It is often called Gaussian distribution, in honor of Carl Friedrich Gauss (17771855), an eminent German mathematician who gave important contributions towards a better understanding of the normal distribution. Sometimes it is also referred to as "bell-shaped distribution", because the graph of its probability density function resembles the shape of a bell. 0.45

0.4

0.35

0.3

0.25

0.2

0.15

0.1

0.05

0

0.05

4

3

2

1

0

1

2

3

4

As you can see from the above plot of the density of a normal distribution, the density is symmetric around the mean (indicated by the vertical line at zero). As a consequence, deviations from the mean having the same magnitude, but di¤erent signs, have the same probability. The density is also very concentrated around the mean and becomes very small by moving from the center to the left or to the right of the distribution (the so called "tails" of the distribution). This means that 1 See

p. 545.

375

376

CHAPTER 47. NORMAL DISTRIBUTION

the further a value is from the center of the distribution, the less probable it is to observe that value. The remainder of this lecture gives a formal presentation of the main characteristics of the normal distribution, dealing …rst with the special case in which the distribution has zero mean and unit variance, then with the general case, in which mean and variance can take any value.

47.1

The standard normal distribution

The adjective "standard" indicates the special case in which the mean is equal to zero and the variance is equal to one.

47.1.1

De…nition

The standard normal distribution is characterized as follows: De…nition 252 Let X be an absolutely continuous random variable. Let its support be the whole set of real numbers: RX = R We say that X has a standard normal distribution if its probability density function2 is: 1 1 2 fX (x) = p exp x 2 2 The following is a proof that fX (x) is indeed a legitimate probability density function3 . Proof. fX (x) is a legitimate probability density function if it is non-negative and if its integral over the support equals 1. The former property is obvious, while the latter can be proved as follows: Z 1 fX (x) dx 1 Z 1 1 2 1=2 = (2 ) exp x dx 2 1 Z 1 1 2 A = (2 ) 1=2 2 exp x dx 2 0 Z 1 Z 1 1=2 1 2 1 2 1=2 = (2 ) 2 exp x dx exp x dx 2 2 0 0 Z 1 Z 1 1=2 1 2 1 2 1=2 = (2 ) 2 exp x dx exp y dy 2 2 0 0 Z 1Z 1 1=2 1 2 1=2 = (2 ) 2 exp x + y 2 dydx 2 0 0 Z 1Z 1 1=2 1 2 B = (2 ) 1=2 2 exp x + s2 x2 xdsdx 2 0 0 2 See 3 See

p. 107. p. 251.

47.1. THE STANDARD NORMAL DISTRIBUTION

= (2 )

1=2

2

Z

1

0

=

(2 )

1=2

2

Z

Z

1

1 exp 1 + s2

0

=

(2 )

1=2

2

Z

1

Z

1

1 ds 1 + s2

(2 )

1=2

=

(2 )

1=2

2 ([arctan (s)]0 )

=

(2 )

1=2

2 (arctan (1)

=

(2 )

1=2

2

=

2

0

1

1 2 x 1 + s2 2

1=2

ds

0

ds

1=2

1 1=2 1=2

arctan (0))

1=2

2

xdxds

1=2

1 0+ 1 + s2

0

1=2

1 2 x 1 + s2 2

exp

0

1

377

0

=2

1=2

1=2

1=2

2

2

1=2

=1

where: in step A we have used the fact that the integrand is even; in step B we have made a change of variable (y = xs).

47.1.2

Expected value

The expected value of a standard normal random variable X is: E [X] = 0 Proof. It can be derived as follows: Z 1 E [X] = xfX (x) dx 1 Z 1 1=2 = (2 ) x exp =

(2 )

1=2

+ (2 )

Z

1=2

1 0

x exp 1 1

Z

x exp

0

=

(2 )

1=2

+ (2 ) = =

47.1.3

(2 )

1=2

(2 )

1=2

1 2 x dx 2 1 2 x dx 2

exp

1 2 x 2

[ 1 + 0] + (2 ) (2 )

0

1 2 x 2

exp

1=2

1 2 x dx 2

1=2

1 1

1=2

=0

Variance

The variance of a standard normal random variable X is: Var [X] = 1

0

[0 + 1]

378

CHAPTER 47. NORMAL DISTRIBUTION

Proof. It can be proved with the usual formula for computing the variance4 : E X2 Z 1 = x2 fX (x) dx 1 Z 1 1=2 = (2 ) x2 exp =

A

=

=

= B

=

1 0

Z

1 2 x dx 2

1 2 (2 ) x x exp x dx 2 1 Z 1 1 2 + x x exp x dx 2 0 ( Z 0 0 1 2 1 2 1=2 (2 ) x exp x + exp x dx 2 2 1 1 Z 1 1 1 2 1 2 x + exp x dx + x exp 2 2 0 0 Z 0 1 2 1=2 x dx (2 ) (0 0) + (0 0) + exp 2 1 Z 1 1 2 + exp x dx 2 0 Z 1 1 2 1=2 (2 ) exp x dx 2 1 Z 1 fX (x) dx = 1 1=2

1

where: in step A we have performed an integration by parts5 ; in step B we have used the fact that the integral of a probability density function over its support is equal to 1. Finally: 2 E [X] = 02 = 0 and Var [X] = E X 2

47.1.4

2

E [X] = 1

0=1

Moment generating function

The moment generating function of a standard normal random variable X is de…ned for any t 2 R: 1 2 MX (t) = exp t 2 Proof. Using the de…nition of moment generating function: MX (t)

=

E [exp (tX)] Z 1 = exp (tx) fX (x) dx 1

4 Var [X] 5 See

=E p. 51.

X2

E [X]2 . See p. 156.

47.1. THE STANDARD NORMAL DISTRIBUTION = (2 ) = =

(2 ) (2 )

=

(2 )

=

exp

A

=

exp

B

=

exp

C

=

exp

1=2

1=2

1=2

1=2

1 2 t 2 1 2 t 2 1 2 t 2 1 2 t 2

Z Z

Z Z

1 1 1 1 1 1 1

1 2 x dx 2

exp (tx) exp exp exp

379

1 2 x 2 1 2 x 2

2tx

dx

2tx + t2

1 2 t exp 2 1 Z 1 1=2 (2 ) exp 1 Z 1 1=2 (2 ) exp 1 Z 1 fZ (z) dz exp

t2

dx

1 2 x 2tx + t2 2 1 2 (x t) dx 2 1 2 z dz 2

dx

1

where: in step A we have performed a change of variable (z = x we have de…ned 1 2 fZ (z) = exp z 2

t); in step B

where fZ (z) is the probability density function of a standard normal random variable; in step C we have used the fact that the integral of a probability density function over its support is equal to 1. The integral above is well-de…ned and …nite for any t 2 R. Thus, the moment generating function of a standard normal random variable exists for any t 2 R.

47.1.5

Characteristic function

The characteristic function of a standard normal random variable X is: 'X (t) = exp

1 2 t 2

Proof. Using the de…nition of characteristic function: 'X (t)

A

= E [exp (itX)] = E [cos (tX)] + iE [sin (tX)] Z 1 Z = cos (tx) fX (x) dx + i 1 Z 1 = cos (tx) fX (x) dx

1

sin (tx) fX (x) dx

1

1

where: in step A we have used the fact that sin (tx) fX (x) is an odd function of x. Now, take the derivative with respect to t of the characteristic function: d ' (t) dt X

=

d E [exp (itX)] dt

380

CHAPTER 47. NORMAL DISTRIBUTION = = = = A

= = = =

B

= =

d exp (itX) dt E [iX exp (itX)] iE [X cos (tX)] E [X sin (tX)] Z 1 Z 1 i x cos (tx) fX (x) dx x sin (tx) fX (x) dx 1 1 Z 1 x sin (tx) fX (x) dx 1 Z 1 1 2 1 x dx x sin (tx) p exp 2 2 1 Z 1 d 1 1 2 p exp x dx sin (tx) 2 dx 2 1 Z 1 d sin (tx) fX (x) dx dx 1 Z 1 1 [sin (tx) fX (x)] 1 t cos (tx) fX (x) dx 1 Z 1 t cos (tx) fX (x) dx E

1

where: in step A we have used the fact that x cos (tx) fX (x) is an odd function of x; in step B we have performed an integration by parts. Putting together the previous two results, we obtain: d ' (t) = dt X

t'X (t)

The only function that satis…es this ordinary di¤erential equation (subject to the condition 'X (0) = E [exp (i 0 X)] = 1) is: 'X (t) = exp

47.1.6

1 2 t 2

Distribution function

There is no simple formula for the distribution function FX (x) of a standard normal random variable X, because the integral Z x FX (x) = fX (t) dt 1

cannot be expressed in terms of elementary functions. Therefore, it is necessary to resort to computer algorithms to compute the values of the distribution function of a standard normal random variable. For example, the MATLAB command: normcdf(x) returns the value of the distribution function at the point x.

47.2. THE NORMAL DISTRIBUTION IN GENERAL

381

Some values of the distribution function of X are used very frequently and people usually learn them by heart: FX ( FX ( FX ( FX (

2:576) = 0:005 2:326) = 0:01 1:96) = 0:025 1:645) = 0:05

FX (2:576) = 0:995 FX (2:326) = 0:99 FX (1:96) = 0:975 FX (1:645) = 0:95

Note also that: FX ( x) = 1

FX (x)

which is due to the symmetry around 0 of the standard normal density and is often used in calculations. In the past, when computers were not widely available, people used to look up the values of FX (x) in normal distribution tables. A normal distribution table is a table where FX (x) is tabulated for several values of x. For values of x that are not tabulated, approximations of FX (x) can be computed by interpolating the two tabulated values that are closest to x. For example, if x is not tabulated, x1 is the greatest tabulated number smaller than x and x2 is the smallest tabulated number greater than x, the approximation is as follows: FX (x) = FX (x1 ) +

47.2

FX (x2 ) x2

FX (x1 ) (x x1

x1 )

The normal distribution in general

While in the previous section we restricted our attention to the normal distribution with zero mean and unit variance, we now deal with the general case.

47.2.1

De…nition

The normal distribution with mean

and variance

2

is characterized as follows:

De…nition 253 Let X be an absolutely continuous random variable. Let its support be the whole set of real numbers: RX = R Let 2 R and and variance

2 R++ . We say that X has a normal distribution with mean , if its probability density function is

2

1 1 fX (x) = p exp 2

2

1 (x 2

2

) 2

!

We often indicate that X has a normal distribution with mean by: X

N

;

2

and variance

382

47.2.2

CHAPTER 47. NORMAL DISTRIBUTION

Relation to the standard normal distribution

A random variable X having a normal distribution with mean is just a linear function of a standard normal random variable: Proposition 254 If X has a normal distribution with mean then: X= + Z

and variance and variance

2

2

,

where Z is a random variable having a standard normal distribution. Proof. This can be easily proved using the formula for the density of a function6 of an absolutely continuous variable: fX (x)

= fZ g = fZ =

1

(x)

x

1

dg

(x) dx

1

1 p exp 2

1 2

2

x

!

1

Obviously, then, a standard normal distribution is just a normal distribution with mean = 0 and variance 2 = 1.

47.2.3

Expected value

The expected value of a normal random variable X is: E [X] = Proof. It is an immediate consequence of the fact that X = + Z (where Z has a standard normal distribution) and the linearity of the expected value7 : E [X] = E [ + Z] =

47.2.4

+ E [Z] =

+

0=

Variance

The variance of a normal random variable X is: Var [X] =

2

Proof. It can be derived using the formula for the variance of linear transformations8 on X = + Z (where Z has a standard normal distribution): Var [X] = Var [ + Z] =

6 See p. 265. Note that X = g (Z) = strictly positive. 7 See p. 134. 8 See p. 158.

2

Var [Z] =

2

+ Z is a strictly increasing function of Z, since

is

47.2. THE NORMAL DISTRIBUTION IN GENERAL

47.2.5

383

Moment generating function

The moment generating function of a normal random variable X is de…ned for any t 2 R: 1 2 2 MX (t) = exp t+ t 2 Proof. Recall that X = + Z (where Z has a standard normal distribution) and that the moment generating function of a standard normal random variable is: MZ (t) = exp

1 2 t 2

We can use the formula for the moment generating function of a linear transformation9 : MX (t)

=

exp ( t) MZ ( t) 1 2 = exp ( t) exp ( t) 2 1 2 2 = exp t+ t 2

It is de…ned for any t 2 R because the moment generating function of Z is de…ned for any t 2 R.

47.2.6

Characteristic function

The characteristic function of a normal random variable X is: 1 2

'X (t) = exp i t

2 2

t

Proof. Recall that X = + Z (where Z has a standard normal distribution) and that the characteristic function of a standard normal random variable is: 'Z (t) = exp

1 2 t 2

We can use the formula for the characteristic function of a linear transformation10 : 'X (t)

9 See 1 0 See

p. 293. p. 310.

=

exp (i t) 'Z ( t)

=

exp (i t) exp

=

exp i t

1 2

1 2 ( t) 2 2 2

t

384

47.2.7

CHAPTER 47. NORMAL DISTRIBUTION

Distribution function

The distribution function FX (x) of a normal random variable X can be written as: x FX (x) = FZ where FZ (z) is the distribution function of a standard normal random variable Z. Proof. Remember that any normal random variable X with mean and variance 2 can be written as: X= + Z where Z is a standard normal random variable. Using this fact, we obtain the following relation between the distribution function of Z and the distribution function of X: FX (x)

= P (X x) = P( + Z x = P Z = FZ

x)

x

Therefore, if we know how to compute the values of the distribution function of a standard normal distribution (see above), we also know how to compute the values of the distribution function of a normal distribution with mean and variance 2 . Example 255 If we need to compute the value FX 12 of a normal random variable X with mean = 1 and variance 2 = 1, we can compute it using the distribution function of a standard normal random variable Z: FX

47.3

1 2

= FZ

1=2

= FZ

1=2 1 1

= FZ

1 2

More details

More details about the normal distribution can be found in the following subsections.

47.3.1

Multivariate normal distribution

A multivariate generalization of the normal distribution is introduced in the lecture entitled Multivariate normal distribution (p. 439).

47.3.2

Linear combinations of normal random variables

The lecture entitled Linear combinations of normals (p. 469) explains and proves one of the most important facts about the normal distribution: the linear combination of jointly normal random variables also has a normal distribution.

47.4. SOLVED EXERCISES

47.3.3

385

Quadratic forms involving normal random variables

The lecture entitled Quadratic forms in normal vectors (p. 481) discusses quadratic forms involving normal random variables.

47.4

Solved exercises

Below you can …nd some exercises with explained solutions.

Exercise 1 Let X be a normal random variable with mean Compute the following probability: P ( 0:92

X

= 3 and variance

2

= 4.

6:92)

Solution First of all, we need to express the above probability in terms of the distribution function of X: P ( 0:92 X 6:92) = P (X 6:92) P (X < A

0:92)

= P (X 6:92) P (X 0:92) = FX (6:92) FX ( 0:92)

where: in step A we have used the fact that the probability that an absolutely continuous random variable takes on any speci…c value is equal to zero11 . Then, we need to express the distribution function of X in terms of the distribution function of a standard normal random variable Z: FX (x) = FZ

x

= FZ

x

3 2

Therefore, the above probability can be expressed as: P ( 0:92 X 6:92) = FX (6:92) FX ( 0:92) 6:92 3 0:92 3 = FZ FZ 2 2 = FZ (1:96) FZ ( 1:96) = 0:975 0:025 = 0:95 where we have used the fact that FZ (1:96) = 1 which has been discussed above. 1 1 See

p. 109.

FZ ( 1:96) = 0:975

386

CHAPTER 47. NORMAL DISTRIBUTION

Exercise 2 Let X be a random variable having a normal distribution with mean variance 2 = 16. Compute the following probability:

= 1 and

P (X > 9) Solution We need to use the same technique used in the previous exercise and express the probability in terms of the distribution function of a standard normal random variable: P (X > 9) =

1

= 1 =

1

P (X

9) = 1 FX (9) 9 1 FZ p = 1 FZ (2) 16 0:9772 = 0:0228

where the value FZ (2) can be found with a computer algorithm, for example with the MATLAB command normcdf(2)

Exercise 3 Suppose the random variable X has a normal distribution with mean variance 2 = 1. De…ne the random variable Y as follows: Y = exp (2 + 3X) Compute the expected value of Y . Solution The moment generating function of X is: MX (t)

=

E [exp (tX)] = exp

=

1 exp t + t2 2

t+

1 2

2 2

t

Using the linearity of the expected value, we obtain: E [Y ]

= E [exp (2 + 3X)] = E [exp (2) exp (3X)] = exp (2) E [exp (3X)] = exp (2) MX (3) 1 19 = exp (2) exp 3 + 9 = exp 2 2

= 1 and

Chapter 48

Chi-square distribution A random variable X has a Chi-square distribution if it can be written as a sum of squares: X = Y12 + : : : + Yn2 where Y1 , . . . , Yn are n mutually independent1 standard normal random variables2 . The importance of the Chi-square distribution stems from the fact that sums of this kind are encountered very often in statistics, especially in the estimation of variance and in hypothesis testing.

48.1

De…nition

Chi-square random variables are characterized as follows. De…nition 256 Let X be an absolutely continuous random variable. Let its support be the set of positive real numbers: RX = [0; 1) Let n 2 N. We say that X has a Chi-square distribution with n degrees of freedom if its probability density function3 is fX (x) =

cxn=2 0

where c is a constant: c= and

1

exp

1 2x

if x 2 RX if x 2 = RX

1 2n=2 (n=2)

() is the Gamma function4 .

The following notation is often employed to indicate that a random variable X has a Chi-square distribution with n degrees of freedom: X

2

(n)

where the symbol means "is distributed as" and distribution with n degrees of freedom. 1 See

p. p. 3 See p. 4 See p. 2 See

233. 376. 107. 55.

387

2

(n) indicates a Chi-square

388

CHAPTER 48. CHI-SQUARE DISTRIBUTION

48.2

Expected value

The expected value of a Chi-square random variable X is E [X] = n Proof. It can be derived as follows: Z 1 xfX (x) dx E [X] = 0 Z 1 1 = xcxn=2 1 exp x dx 2 0 Z 1 1 x dx = c xn=2 exp 2 0 Z 1 1 1 n n=2 1 A = c xn=2 2 exp x + x 2 exp 2 2 0 0 Z 1 1 x dx = c (0 0) + n xn=2 1 exp 2 0 Z 1 1 = n cxn=2 1 exp x dx 2 0 Z 1 = n fX (x) dx

1 x dx 2

0

B

= n

where: in step A we have performed an integration by parts5 ; in step B we have used the fact that the integral of a probability density function over its support is equal to 1.

48.3

Variance

The variance of a Chi-square random variable X is Var [X] = 2n Proof. The second moment of X is Z 1 E X2 = x2 fX (x) dx 0 Z 1 1 = x2 cxn=2 1 exp x dx 2 0 Z 1 1 x dx = c xn=2+1 exp 2 0 1 1 A = c xn=2+1 2 exp x 2 0 Z 1 n 1 n=2 x dx + + 1 x 2 exp 2 2 0 5 See

p. 51.

48.4. MOMENT GENERATING FUNCTION Z

1 1 0) + (n + 2) xn=2 exp x dx 2 0 Z 1 1 x dx = c (n + 2) xn=2 exp 2 0 1 1 x = c (n + 2) xn=2 2 exp 2 0 Z 1 n n=2 1 1 + x 2 exp x dx 2 2 0 Z 1 1 = c (n + 2) (0 0) + n xn=2 1 exp x dx 2 0 Z 1 1 = (n + 2) n cxn=2 1 exp x dx 2 Z0 1 = (n + 2) n fX (x) dx

= c (0

B

389

0

C

=

(n + 2) n

where: in step A and B we have performed an integration by parts; in step C we have used the fact that the integral of a probability density function over its support is equal to 1. Furthermore, 2

E [X] = n2 By employing the usual formula for computing the variance6 , we obtain Var [X]

= E X2 =

48.4

(n + 2) n

2

E [X]

n2 = n (n + 2

n) = 2n

Moment generating function

The moment generating function of a Chi-square random variable X is de…ned for any t < 21 : n=2 MX (t) = (1 2t) Proof. By using the de…nition of moment generating function, we get MX (t)

= = = =

A 6 See

p. 156.

=

E [exp (tX)] Z 1 exp (tx) fX (x) dx 1 Z 1 c exp (tx) xn=2 1 exp 0 Z 1 1 c xn=2 1 exp t 2 0 Z 1 n=2 1 2 c y exp ( 1 2t 0

1 x dx 2 x dx y)

2 1

2t

dy

390

CHAPTER 48. CHI-SQUARE DISTRIBUTION

= c

Z

1

1

0

B C

1

= =

n=2

2t

1

2n=2

n=2

1

1

exp ( y) dy

y n=2

1

exp ( y) dy

(n=2)

2t 1 (n=2)

n=2

2 1

(n=2)

2t

2n=2

1 2n=2 (1 (1

Z

y n=2

0

2

= c =

2t

2

= c

n=2

2

2t)

n=2

2t) n=2

where: in step A we have performed the change of variable y=

1 2

t x

in step B we have used the de…nition of Gamma function7 ; in step C we have used the de…nition of c. The integral above is well-de…ned and …nite only when 1 t > 0, i.e., when t < 12 . Thus, the moment generating function of a Chi-square 2 random variable exists for any t < 12 .

48.5

Characteristic function

The characteristic function of a Chi-square random variable X is 'X (t) = (1

2it)

n=2

Proof. Using the de…nition of characteristic function, we get 'X (t)

= = =

A

= = =

E [exp (itX)] Z 1 exp (itx) fX (x) dx 1 Z 1 1 c exp (itx) xn=2 1 exp x 2 0 ! Z 1 X 1 1 k (itx) xn=2 1 exp c k! 0 k=0 Z 1 1 X 1 k c (it) xk xn=2 1 exp k! 0 k=0 Z 1 1 X 1 k c (it) xk+n=2 1 exp k! 0

= c

k=0 1 X

k=0 7 See

p. 55.

1 k (it) 2k+n=2 (k + n=2) k!

dx 1 x dx 2 1 x dx 2 1 x dx 2

48.6. DISTRIBUTION FUNCTION Z

1

1 xk+n=2 (k + n=2)

2k+n=2

0

391 1

1 x dx 2

exp

Z 1 1 X 1 k = c (it) 2k+n=2 (k + n=2) fk (x) dx k! 0

B C

= c

k=0 1 X k=0

D

= = =

2n=2

1

X 1 1 k (it) 2k+n=2 (k + n=2) (n=2) k=0 k!

1 X 1 (k + n=2) k (it) 2k k! (n=2)

k=0 1 X

k=0

=

1 k (it) 2k+n=2 (k + n=2) k!

1 k (2it) k!

(k + n=2) (n=2)

1 kY1 X 1 n k 1+ (2it) +j k! 2 j=0 k=1

E

=

(1

2it)

n=2

where: in step A we have substituted the Taylor series expansion of exp (itx); in step B we have de…ned fk (x) =

2k+n=2

1 xk+n=2 (k + n=2)

1

exp

1 x 2

where fk (x) is the probability density function of a Chi-square random variable with 2k + n degrees of freedom; in step C we have used the fact that the integral of a probability density function over its support is equal to 1; in step D we have used the de…nition of c; in step E we have used the fact that 1+

1 kY1 X 1 n k (2it) +j k! 2 j=0

k=1

is the Taylor series expansion of (1 the expansion yourself.

48.6

n=2

2it)

, which you can verify by computing

Distribution function

The distribution function of a Chi-square random variable is: FX (x) = where the function (z; y) =

Z

(n=2; x=2) (n=2)

y

sz 1

1

exp ( s) ds

392

CHAPTER 48. CHI-SQUARE DISTRIBUTION

is called lower incomplete Gamma function8 and is usually computed by means of specialized computer algorithms. Proof. This is proved as follows: Z x FX (x) = fX (t) dt 1 Z x 1 ctn=2 1 exp = t dt 2 1 Z x=2 n=2 1 A = c (2s) exp ( s) 2ds 1

= c2n=2 B

= = =

Z

x=2

sn=2 1

1

Z 1 n=2 2 2n=2 (n=2) Z x=2 1 sn=2 (n=2) 1 (n=2; x=2) (n=2)

exp ( s) ds x=2

sn=2

1

exp ( s) ds

1 1

exp ( s) ds

where: in step A we have performed a change of variable (s = t=2); in step B we have used the de…nition of c. Usually, it is possible to resort to computer algorithms that directly compute the values of FX (x). For example, the MATLAB command chi2cdf(x,n) returns the value at the point x of the distribution function of a Chi-square random variable with n degrees of freedom. In the past, when computers were not widely available, people used to look up the values of FX (x) in Chi-square distribution tables. A Chi-square distribution table is a table where FX (x) is tabulated for several values of x and n. For values of x that are not tabulated, approximations of FX (x) can be computed by interpolation, with the same procedure described for the normal distribution (p. 380).

48.7

More details

In the following subsections you can …nd more details about the Chi-square distribution.

48.7.1

Sums of independent Chi-square random variables

Let X1 be a Chi-square random variable with n1 degrees of freedom and X2 another Chi-square random variable with n2 degrees of freedom. If X1 and X2 are independent, then their sum has a Chi-square distribution with n1 + n2 degrees of 8 See

p. 58.

48.7. MORE DETAILS

393

freedom: 2 2 X1 (n1 ) , X2 (n2 ) =) X1 + X2 X1 and X2 are independent

2

(n1 + n2 )

This can be generalized to sums of more than two Chi-square random variables, provided they are mutually independent: ! k k 2 X X Xi (ni ) for i = 1; : : : ; k 2 =) Xi ni X1 ; X2 ; : : : ; Xk are mutually independent i=1

i=1

Proof. This can be easily proved using moment generating functions. The moment generating function of Xi is MXi (t) = (1 De…ne X=

2t)

k X

ni =2

Xi

i=1

The moment generating function of a sum of mutually independent random variables is just the product of their moment generating functions9 : MX (t) =

k Y

MXi (t)

i=1

=

k Y

(1

i=1

= (1

2t)

= (1

2t)

where n=

k X

ni =2

2t) Pk

i=1

ni =2

n=2

ni

i=1

Therefore, the moment generating function of X is the moment generating function of a Chi-square random variable with n degrees of freedom. As a consequence, X is a Chi-square random variable with n degrees of freedom.

48.7.2

Relation to the standard normal distribution

Let Z be a standard normal random variable10 and let X be its square: X = Z2 Then X is a Chi-square random variable with 1 degree of freedom. Proof. For x 0, the distribution function of X is FX (x) 9 See 1 0 See

p. 293. p. 376.

394

CHAPTER 48. CHI-SQUARE DISTRIBUTION A

=

P (X

x)

B

=

P Z2

x

C

=

P Z

D

=

x1=2

Z

x1=2

x1=2

fZ (z) dz x1=2

where: in step A we have used the de…nition of distribution function; in step B we have used the de…nition of X; in step C we have taken the square root on both sides of the inequality; in step D fZ (z) is the probability density function of a standard normal random variable: 1 fZ (z) = p exp 2

1 2 z 2

For x < 0, FX (x) = 0, because X, being a square, cannot be negative. By using Leibniz integral rule11 and the fact that the density function is the derivative of the distribution function12 , the probability density function of X, denoted by fX (x), is obtained as follows (for x 0): fX (x)

= =

dFX (x) dx Z x1=2 d fZ (z) dz dx x1=2

d x1=2 fZ dx 1 1 1=2 2 p exp x 2 2 1 1 p exp x1=2 2 2 1 1 1=2 1 p x exp x 2 2 2 1 1 p x 1=2 exp x 2 2 1 x1=2 1 exp 21=2 (1=2)

= fZ x1=2 =

= = =

x1=2 1 x 2

1 x1=2 1 21=2 (1=2)

0

x1=2 dx

1=2

2

1 x 2 1 1 +p x 2 2

1=2

1=2

exp

1 x 2

1 x 2

where in the last step we have used the fact that13 trivially holds that fX (x) = 0. Therefore, fX (x) =

d

exp

1 2x

(1=2) =

p

. For x < 0, it

if x 0 if x < 0

which is the probability density function of a Chi-square random variable with 1 degree of freedom. 1 1 See

p. 52. p. 109. 1 3 See p. 57. 1 2 See

48.8. SOLVED EXERCISES

48.7.3

395

Relation to the standard normal distribution (2)

Combining the results obtained in the previous two subsections, we obtain that the sum of squares of n independent standard normal random variables is a Chi-square random variable with n degrees of freedom.

48.8

Solved exercises

Below you can …nd some exercises with explained solutions.

Exercise 1 Let X be a Chi-square random variable with 3 degrees of freedom. Compute the probability P (0:35 X 7:81) Solution First of all, we need to express the above probability in terms of the distribution function of X: P (0:35

X

7:81) =

P (X

7:81)

P (X < 0:35)

A

= P (X 7:81) P (X 0:35) = FX (7:81) FX (0:35)

B

=

0:95

0:05 = 0:90

where: in step A we have used the fact that the probability that an absolutely continuous random variable takes on any speci…c value is equal to zero14 ; in step B the values FX (7:81) = FX (0:35) =

0:95 0:05

can be computed with a computer algorithm or found in a Chi-square distribution table.

Exercise 2 Let X1 and X2 be two independent normal random variables having mean and variance 2 = 16. Compute the probability P X12 + X22 > 8 Solution First of all, the two variables X1 and X2 can be written as X1 1 4 See

p. 109.

=

4Z1

=0

396

CHAPTER 48. CHI-SQUARE DISTRIBUTION X2

=

4Z2

where Z1 and Z2 are two standard normal random variables. Thus, we can write P X12 + X22 > 8

=

P 16Z12 + 16Z22 > 8 8 16 1 P Z12 + Z22 > 2

= P Z12 + Z22 > =

but the sum Z12 + Z22 has a Chi-square distribution with 2 degrees of freedom. Therefore, P X12 + X22 > 8

1 2

=

P Z12 + Z22 >

=

1

P Z12 + Z22

=

1

FY

1 2

1 2

where FY (1=2) is the distribution function of a Chi-square random variable Y with 2 degrees of freedom, evaluated at the point y = 21 . Using any computer program, we can …nd 1 FY = 0:2212 2

Exercise 3 Suppose the random variable X has a Chi-square distribution with 5 degrees of freedom. De…ne the random variable Y as follows: Y = exp (1

X)

Compute the expected value of Y . Solution The expected value of Y can be easily calculated using the moment generating function of X: 5=2 MX (t) = E [exp (tX)] = (1 2t) Now, exploiting the linearity of the expected value, we obtain E [Y ]

= E [exp (1 X)] = E [exp (1) exp ( X)] = exp (1) E [exp ( X)] = exp (1) MX ( 1) = exp (1) 3 5=2

Chapter 49

Gamma distribution The Gamma distribution can be thought of as a generalization of the Chi-square distribution1 . If a random variable Z has a Chi-square distribution with n degrees of freedom and h is a strictly positive constant, then the random variable X de…ned as h X= Z n has a Gamma distribution with parameters n and h.

49.1

De…nition

Gamma random variables are characterized as follows: De…nition 257 Let X be an absolutely continuous random variable. Let its support be the set of positive real numbers: RX = [0; 1) Let n; h 2 R++ . We say that X has a Gamma distribution with parameters n and h if its probability density function2 is fX (x) =

cxn=2 0

1

exp

n1 h 2x

if x 2 RX if x 2 = RX

where c is a constant: n=2

c= and

(n=h) n=2 2 (n=2)

() is the Gamma function3 .

A random variable having a Gamma distribution is also called a Gamma random variable. 1 See

p. 387. p. 107. 3 See p. 55. 2 See

397

398

CHAPTER 49. GAMMA DISTRIBUTION

49.2

Expected value

The expected value of a Gamma random variable X is E [X] = h Proof. It can be derived as follows: Z 1 E [X] = xfX (x) dx Z0 1 n1 = xcxn=2 1 exp x dx h2 0 Z 1 n1 = c xn=2 exp x dx h2 0 1 h n1 A = c xn=2 2 exp x n h2 0 Z 1 n n=2 1 h n1 + x 2 exp x dx 2 n h2 0 Z 1 n1 x dx = c (0 0) + h xn=2 1 exp h2 0 Z 1 n1 = h cxn=2 1 exp x dx h2 Z0 1 = h fX (x) dx 0

B

= h

where: in step A we have performed an integration by parts4 ; in step B we have used the fact that the integral of a probability density function over its support is equal to 1.

49.3

Variance

The variance of a Gamma random variable X is Var [X] = 2

h2 n

Proof. It can be derived thanks to the usual formula for computing the variance5 : Z 1 2 E X = x2 fX (x) dx 0 Z 1 n1 x dx = x2 cxn=2 1 exp h2 0 Z 1 n1 = c xn=2+1 exp x dx h2 0 4 See

p. 51. = E X2

5 Var [X]

E [X]2 . See p. 156.

49.4. MOMENT GENERATING FUNCTION A

xn=2+1 2

= c Z

= = B

=

= = = C

=

h exp n

399 1

n1 x h2

0

1

n h n1 + 1 xn=2 2 exp x dx + 2 n h2 0 Z n1 h 1 n=2 x exp x dx c (0 0) + (n + 2) n 0 h2 Z 1 h n1 c (n + 2) xn=2 exp x dx n h2 0 1 h h n1 c (n + 2) xn=2 2 exp x n n h2 0 Z 1 n n=2 1 h n1 + x 2 exp x dx 2 n h2 0 Z 1 n1 h c (n + 2) (0 0) + h xn=2 1 exp x dx n h2 0 Z n1 h2 1 n=2 1 (n + 2) cx exp x dx n 0 h2 Z h2 1 (n + 2) fX (x) dx n 0 h2 (n + 2) n

where: in step A and B we have performed an integration by parts; in step C we have used the fact that the integral of a probability density function over its support is equal to 1. Finally: 2 E [X] = h2 and Var [X]

= E X2 =

49.4

(n + 2

2

E [X] = (n + 2) n)

h2 n

h2

h2 h2 =2 n n

Moment generating function

The moment generating function of a Gamma random variable X is de…ned for n any t < 2h : n=2 2h MX (t) = 1 t n Proof. Using the de…nition of moment generating function: MX (t)

=

E [exp (tX)] Z 1 = exp (tx) fX (x) dx =

Z

0

1

1

n=2

exp (tx)

(n=h) xn=2 2n=2 (n=2)

1

exp

n1 x dx h2

400

CHAPTER 49. GAMMA DISTRIBUTION

=

Z

(n=h) xn=2 2n=2 (n=2)

1

(n=h) xn=2 n=2 2 (n=2)

1

(1=h) nn=2 n=2 x 2n=2 (n=2)

0

=

Z

0

=

Z

0

=

n=2

1

n1 x + tx dx h2

1

exp

1n x h2

n=2

1 h

n=2

1 h

1

0

=

exp

n=2

(1=h) Z

1

1

1 h

n=2

1 h

exp

2t n

n x dx 2

1 h

2t n

n=2

2t n

2t n=2 n=2 n n xn=2 1 2n=2 (n=2)

(1=h)

2t n x dx n 2

exp

n x dx 2

n=2

2t n

where the integral equals 1 because it is the integral of the probability density 1 function of a Gamma random variable with parameters n and h1 2t . Thus: n MX (t)

(1=h)

=

(h)

=

1

1 h

n=2

n=2

2t n

1 h

n=2

=

2t n

n=2

n=2

2h t n

n Of course, the above integrals converge only if h1 2t n > 0, i.e. only if t < 2h . Therefore, the moment generating function of a Gamma random variable exists for n . all t < 2h

49.5

Characteristic function

The characteristic function of a Gamma random variable X is: 'X (t) =

1

2h it n

n=2

Proof. Using the de…nition of characteristic function:

= = = A

=

'X (t) E [exp (itX)] Z 1 exp (itx) fX (x) dx 1 Z 1 c exp (itx) xn=2 1 exp 0 ! Z 1 X 1 1 k c (itx) xn=2 k! 0 k=0

n1 x dx h2 1

exp

n1 x dx h2

49.5. CHARACTERISTIC FUNCTION Z 1 1 X 1 k (it) xk xn=2 k! 0 k=0 Z 1 1 X 1 k = c (it) xk+n=2 k! 0 = c

= c

k=0 1 X

k=0

Z

C

= c = c = =

1

exp

1n x dx 2h

exp

! 1 2k + n x dx 2 h 2k+n n

(n=h) xk+n=2 k+n=2 2 (k + n=2)

1

exp

1 X 1 k (it) 2k+n=2 (k + n=2) (n=h) k!

k=0 1 X

k n=2

k+n=2

1

k=0

D

1

1 k (it) 2k+n=2 (k + n=2) (n=h) k!

0

B

401

1 k (it) 2k+n=2 (k + n=2) (n=h) k!

k

! 1 2k + n x dx 2 h 2k+n n Z 1 n=2 fk (x) dx 0

k n=2

1

X 1 (n=h) k (it) 2k+n=2 (k + n=2) (n=h) n=2 2 (n=2) k=0 k! n=2

1

X 1 1 k (it) 2k (k + n=2) (n=h) (n=2) k!

k n=2

k

k=0

= =

1 X 1 k (it) k!

k=0 1 X

k=0

=

1+

1 k!

2h it n

1 X 1 k!

k=1

E

=

1

2h n

2h it n

k

k

2h it n n=2

(k + n=2) (n=2)

(k + n=2) (n=2) k kY1 j=0

n +j 2

where: in step A we have substituted exp (itx) with its Taylor series expansion; in step B we have have de…ned ! k+n=2 1 2k + n (n=h) k+n=2 1 x exp x fk (x) = k+n=2 2 h 2k+n 2 (k + n=2) n where fk (x) is the probability density function of a Gamma random variable with parameters 2k + n and h 2k+n n ; in step C we have used the fact that probability density functions integrate to 1; in step D we have used the de…nition of c; in step E we have recognized that 1+

1 X 1 k!

k=1

is the Taylor series expansion of 1

2h it n 2h n it

k kY1 j=0

n=2

.

n +j 2

402

CHAPTER 49. GAMMA DISTRIBUTION

49.6

Distribution function

The distribution function of a Gamma random variable is: (n=2; nx=2h) (n=2)

FX (x) = where the function (z; y) =

Z

y

sz

1

exp ( s) ds

1

is called lower incomplete Gamma function6 and is usually evaluated using specialized computer algorithms. Proof. This is proved as follows: Z x FX (x) = fX (t) dt 1 Z x n1 = ctn=2 1 exp t dt h2 1 Z nx=2h n=2 1 2h 2h A = c s exp ( s) ds n n 1 Z n=2 nx=2h 2h = c sn=2 1 exp ( s) ds n 1 n=2 Z nx=2h n=2 (n=h) 2h B = sn=2 1 exp ( s) ds 2n=2 (n=2) n 1 Z nx=2h 1 = sn=2 1 exp ( s) ds (n=2) 1 (n=2; nx=2h) = (n=2) where: in step A we have performed a change of variable (s = we have used the de…nition of c.

49.7

n 2h t);

in step B

More details

In the following subsections you can …nd more details about the Gamma distribution.

49.7.1

Relation to the Chi-square distribution

The Gamma distribution is a scaled Chi-square distribution. Proposition 258 If a variable X has a Gamma distribution with parameters n and h, then: h X= Z n where Z has a Chi-square distribution7 with n degrees of freedom. 6 See 7 See

p. 58. p. 387.

49.7. MORE DETAILS

403

Proof. This can be easily proved using the formula for the density of a function8 of an absolutely continuous variable: fX (x)

1

= fZ g

(x)

dg

1

(x) dx

n n x h h The density function of a Chi-square random variable with n degrees of freedom is = fZ

fZ (z) =

kz n=2 0

1

where k=

1 2z

exp

if x 2 [0; 1) otherwise

1 2n=2 (n=2)

Therefore: fX (x)

n n = fZ x h h ( n n=2 n=2 x k h = 0

1

1n 2 hx

exp

if x 2 [0; 1) otherwise

which is the density of a Gamma distribution with parameters n and h. Thus, the Chi-square distribution is a special case of the Gamma distribution, because, when h = n, we have: h n Z= Z=Z n n In other words, a Gamma distribution with parameters n and h = n is just a Chi square distribution with n degrees of freedom. X=

49.7.2

Multiplication by a constant

Multiplying a Gamma random variable by a strictly positive constant one obtains another Gamma random variable. Proposition 259 If X is a Gamma random variable with parameters n and h, then the random variable Y de…ned as Y = cX

(c 2 R++ )

has a Gamma distribution with parameters n and ch. Proof. This can be easily seen using the result from the previous subsection: h Z n where Z has a Chi-square distribution with n degrees of freedom. Therefore: X=

Y = cX = c

h Z n

=

ch Z n

In other words, Y is equal to a Chi-square random variable with n degrees of freedom, divided by n and multiplied by ch. Therefore, it has a Gamma distribution with parameters n and ch. 8 See p. 265. Note that X = g (Z) = strictly positive

h Z n

is a strictly increasing function of Z, since

h n

is

404

49.7.3

CHAPTER 49. GAMMA DISTRIBUTION

Relation to the normal distribution

In the lecture entitled Chi-square distribution (p. 387) we have explained that a Chi-square random variable Z with n 2 N degrees of freedom can be written as a sum of squares of n independent normal random variables W1 , . . . ,Wn having mean 0 and variance 1: Z = W12 + : : : + Wn2 In the previous subsections we have seen that a variable X having a Gamma distribution with parameters n and h can be written as X=

h Z n

where Z has a Chi-square distribution with n degrees of freedom. Putting these two things together, we obtain X

= =

h h Z= W12 + : : : + Wn2 n n !2 !2 r r h h Wn W1 + ::: + n n

= Y12 + : : : + Yn2 where we have de…ned Yi =

r

h Wi n

; i = 1; : : : ; n

But the variables Yi are normal random variables with mean 0 and variance nh . Therefore, a Gamma random variable with parameters n and h can be seen as a sum of squares of n independent normal random variables having mean 0 and variance h=n.

49.8

Solved exercises

Below you can …nd some exercises with explained solutions.

Exercise 1 Let X1 and X2 be two independent Chi-square random variables having 3 and 5 degrees of freedom respectively. Consider the following random variables: Y1 Y2 Y3 What distribution do they have?

=

2X1 1 = X2 3 = 3X1 + 3X2

49.8. SOLVED EXERCISES

405

Solution Being multiples of Chi-square random variables, the variables Y1 , Y2 and Y3 all have a Gamma distribution. The random variable X1 has n = 3 degrees of freedom and the random variable Y1 can be written as h X1 n where h = 6. Therefore Y1 has a Gamma distribution with parameters n = 3 and h = 6. The random variable X2 has n = 5 degrees of freedom and the random variable Y2 can be written as h Y2 = X2 n where h = 5=3. Therefore Y2 has a Gamma distribution with parameters n = 5 and h = 5=3. The random variable X1 + X2 has a Chi-square distribution with n = 3 + 5 = 8 degrees of freedom, because X1 and X2 are independent9 , and the random variable Y3 can be written as Y1 =

h (X1 + X2 ) n where h = 24. Therefore Y3 has a Gamma distribution with parameters n = 8 and h = 24. Y3 =

Exercise 2 Let X be a random variable having a Gamma distribution with parameters n = 4 and h = 2. De…ne the following random variables: Y1 Y2 Y3

1 X 2 = 5X = 2X

=

What distribution do these variables have? Solution Multiplying a Gamma random variable by a strictly positive constant one still obtains a Gamma random variable. In particular, the random variable Y1 is a Gamma random variable with parameters n = 4 and 1 =1 2 The random variable Y2 is a Gamma random variable with parameters n = 4 and h=2

h = 2 5 = 10 The random variable Y3 is a Gamma random variable with parameters n = 4 and h=2 2=4 The random variable Y3 is also a Chi-square random variable with 4 degrees of freedom (remember that a Gamma random variable with parameters n and h is also a Chi-square random variable when n = h). 9 See

the lecture entitled Chi-square distribution (p. 387).

406

CHAPTER 49. GAMMA DISTRIBUTION

Exercise 3 Let X1 , X2 and X3 be mutually independent normal random variables having mean = 0 and variance 2 = 3. Consider the random variable X = 2X12 + 2X22 + 2X32 What distribution does X have? Solution The random variable X can be written as X

p

2

p

=

2

=

18 2 Z1 + Z22 + Z32 3

3Z1

+

2

3Z2

+

p

2

3Z3

where Z1 , Z2 and Z3 are mutually independent standard normal random variables. The sum Z12 + Z22 + Z32 has a Chi-square distribution with 3 degrees of freedom. Therefore X has a Gamma distribution with parameters n = 3 and h = 18.

Chapter 50

Student’s t distribution A random variable X has a standard Student’s t distribution with n degrees of freedom if it can be written as a ratio: Y X=p Z between a standard normal random variable1 Y and the square root of a Gamma random variable2 Z with parameters n and h = 1, independent of Y . Equivalently, we can write X=p

Y 2 =n n

where 2n is a Chi-square random variable3 with n degrees of freedom (dividing by n a Chi-square random variable with n degrees of freedom, one obtains a Gamma random variable with parameters n and h = 1 - see the lecture entilled Gamma distribution - p. 397). A random variable X has a non-standard Student’s t distribution with mean , scale 2 and n degrees of freedom if it can be written as a linear transformation of a standard Student’s t random variable: Y X= + p Z where Y and Z are de…ned as before. The importance of Student’s t distribution stems from the fact that ratios and linearly transformed ratios of this kind are encountered very often in statistics (see e.g. the lecture entitled Hypothesis tests about the mean - p. 619). We …rst introduce the standard Student’s t distribution. We then deal with the non-standard Student’s t distribution.

50.1

The standard Student’s t distribution

The standard Student’s t distribution is a special case of Student’s t distribution. By …rst explaining this special case, the exposition of the more general case is greatly facilitated. 1 See

p. 375. p. 397. 3 See p. 387. 2 See

407

408

CHAPTER 50. STUDENT’S T DISTRIBUTION

50.1.1

De…nition

The standard Student’s t distribution is characterized as follows: De…nition 260 Let X be an absolutely continuous random variable. Let its support be the whole set of real numbers: RX = R Let n 2 R++ . We say that X has a standard Student’s t distribution with n degrees of freedom if its probability density function4 is fX (x) = c 1 + where c is a constant:

(n+1)=2

x2 n

1 1 c= p n n B 2 ; 12

and B () is the Beta function5 . Usually the number of degrees of freedom is integer (n 2 N), but it can also be real (n 2 R++ ).

50.1.2

Relation to the normal and Gamma distributions

A standard Student’s t random variable can be written as a normal random variable whose variance is equal to the reciprocal of a Gamma random variable, as shown by the following proposition: Proposition 261 (Integral representation) The probability density function of X can be written as Z 1 fX (x) = fXjZ=z (x) fZ (z) dz 0

where: 1. fXjZ=z (x) is the probability density function of a normal distribution with mean 0 and variance 2 = z1 : fXjZ=z (x)

=

2

=

(2 )

2

1=2

1=2

exp

z 1=2 exp

1 x2 2 2 1 2 zx 2

2. fZ (z) is the probability density function of a Gamma random variable with parameters n and h = 1: fZ (z) = cz n=2 where c= 4 See 5 See

p. 107. p. 59.

1

exp

nn=2 (n=2)

2n=2

1 n z 2

50.1. THE STANDARD STUDENT’S T DISTRIBUTION

409

Proof. We need to prove that fX (x) =

Z

1

fXjZ=z (x) fZ (z) dz

0

where 1=2

fXjZ=z (x) = (2 ) and

fZ (z) = cz n=2

1

1 2 zx 2

z 1=2 exp 1 n z 2

exp

Let us start from the integrand function: fXjZ=z (x) fZ (z) 1=2

=

(2 )

=

(2 )

1=2

(2 )

1=2

=

=

1 2 zx cz n=2 2

z 1=2 exp

1

exp

1 n z 2

1 z 2 1 0 n + 1 1 1=2 zA (2 ) cz (n+1)=2 1 exp @ n+1 2 x2 +n 1 0 n + 1 1 1 1=2 zA (2 ) c c2 z (n+1)=2 1 exp @ n+1 2 c2 2 cz (n+1)=2

1

x2 + n

exp

x +n

=

1 c fZjX=x (z) c2

where (n + 1) = c2

= =

n+1 x2 +n

(n+1)=2

2(n+1)=2 ((n + 1) =2) x2 + n 2n=2 21=2

(n+1)=2 n 2

+

1 2

and fZjX=x (z) is the probability density function of a random variable having a Gamma distribution with parameters n + 1 and xn+1 2 +n . Therefore, Z = A

=

B

= =

1

Z0 1

fXjZ=z (x) fZ (z) dz

1 c fZjX=x (z) dz c 2 0 Z 1 1=2 1 (2 ) c fZjX=x (z) dz c2 0 1=2 1 (2 ) c c2 nn=2 1=2 (2 ) 2n=2 21=2 n=2 2 (n=2) (2 )

1=2

n 1 + 2 2

x2 + n

(n+1)=2

410

CHAPTER 50. STUDENT’S T DISTRIBUTION

=

(2 )

1=2

=

(2 )

1=2

nn=2 1=2 2 (n=2) nn=2 (n=2)

n=2 1=2

n =

(2 )

=

2

+ 12 (n=2)

= n

1=2

C

= n

1=2

D

= n

p

n 2

1+

+ 12 (1=2) (n=2)

1 B (n=2; 1=2) = fX (x) 1=2

1=2

n

+ 12 (n=2)

n 2

(n+1)=2

1 2 x n

n 1 + 2 2

1=2

1 2

n 2

+ 12 (n=2)

n 1+

(n+1)=2

1 2 x n

1+

1=2

1 n 2

1=2

1 2

n 2

1=2

n 1 + 2 2

1+

1+

1 2 x n

(n+1)=2

(n+1)=2

1 2 x n (n+1)=2

1 2 x n 1+

1+

(n+1)=2

1 2 x n

(n+1)=2

1 2 x n

where: in step A we have used the fact that c and c2 do not depend on z; in step B we have used the fact that the integral of a density function over its support p is equal to 1; in step C we have used the fact that = (1=2); in step D we have used the de…nition of Beta function. Since X is a zero-mean normal random variable with variance 1=z, conditional on Z = z, then we can also think of it as a ratio Y X=p Z where Y has a standard normal distribution, Z has a Gamma distribution and Y and Z are independent.

50.1.3

Expected value

The expected value of a standard Student’s t random variable X is well-de…ned only for n > 1 and it is equal to E [X] = 0 Proof. It follows from the fact that the density function is symmetric around 0: Z 1 E [X] = xfX (x) dx = A

=

Z

1 0

xfX (x) dx + 1 Z 0 1

Z

1

0

( t) fX ( t) dt +

xfX (x) dx Z

0

1

xfX (x) dx

50.1. THE STANDARD STUDENT’S T DISTRIBUTION Z

=

0

tfX ( t) dt +

1

B

=

1

0

Z

1

tfX ( t) dt +

Z0 1

= C

Z

= 0

xfX (x) dx

Z

1

xfX (x) dx

0

Z

xfX ( x) dx + 0 Z 1 Z xfX (x) dx + 0

=

411

1

0 1

xfX (x) dx

xfX (x) dx

0

where: in step A we have performed a change of variable in the …rst integral (t = x); in step B we have exchanged the bounds of integration; in step C we have used the fact that fX ( x) = fX (x) The above integrals are …nite (and so the expected value is well-de…ned) only if n > 1, because Z 1 xfX (x) dx 0

=

Z

lim

u!1

u

0

= c lim

u!1

"

cn

=

n

1

1 2 (n+1)

x2 xc 1 + n n

1 2 (n

x2 1+ n

n (

dx

1

0

1 2 (n

2

lim

u!1

# 1) u

u 1+ n

1)

1

)

and the above limit is …nite only if n > 1.

50.1.4

Variance

The variance of a standard Student’s t random variable X is well-de…ned only for n > 2 and it is equal to n Var [X] = n 2 Proof. It can be derived thanks to the usual formula for computing the variance6 and to the integral representation of the Beta function: Z 1 2 E X = x2 fX (x) dx = A B

Z

1 0

x2 fX (x) dx + 1 Z 0

= E X2

0

1

x2 fX (x) dx

Z 1 t2 fX ( t) dt + x2 fX (x) dx 1 0 Z 1 Z 1 2 = t fX ( t) dt + x2 fX (x) dx =

0

6 Var [X]

Z

E [X]2 . See p. 156.

0

412

CHAPTER 50. STUDENT’S T DISTRIBUTION C

=

Z

1

t2 fX (t) dt +

0

=

2

Z

1

x2 fX (x) dx

0

Z

1

x2 fX (x) dx

0

=

Z

2c

1

x

0

D

=

Z

2c

1

0

3=2

= cn

1

t3=2

1

dx

n=2 1=2

nt (1 + t)

Z

1 2 (n+1)

x2 1+ n

2

p

n 1 p dt 2 t

3=2 (n=2 1)

(1 + t)

dt

0

E

= cn3=2 B

F

=

G

= n

1 1 p n B n2 ; 12

= n H

= n =

3 n ; 2 2

n n

1 1 n + 1; 2 2

n3=2 B

n 1 2 + 2 n 1 2 2 1 + 1 2 n 2 1 1 n 2 2 2 n 2

1 2

+1 n 2

n 2 1 2

+

n 2 1 2

1 1

1

2 n 2 1 2

2

where: in step A we have performed a change of variable in the …rst integral (t = x); in step B we have exchanged the bounds of integration; in step C we have used the fact that fX ( t) = fX (t) 2

in step D we have performed a change of variable (t = xn ); in step E we have used the integral representation of the Beta function; in step F we have used the de…nition of c; in step G we have used the de…nition of Beta function; in step H we have used the fact that (z) =

(z

1) (z

1)

Finally: 2

E [X] = 0 and: Var [X] = E X 2

n

2

E [X] =

n

2

From the above derivation, it should be clear that the variance is well-de…ned only when n > 2. Otherwise, if n 2, the above improper integrals do not converge (and the Beta function is not well-de…ned).

50.1. THE STANDARD STUDENT’S T DISTRIBUTION

50.1.5

413

Higher moments

The k-th moment of a standard Student’s t random variable X is well-de…ned only for k < n and it is equal to X

k+1 2

nk=2 0

(k) =

n k 2

n 2

1 2

if k is even if k is odd

Proof. Using the de…nition of moment: X

E Xk Z 1 = xk fX (x) dx

(k)

=

= A

=

B

=

C

= =

Z

1 0

xk fX (x) dx + 1 Z 0

Z

1

xk fX (x) dx

0

Z 1 k ( t) fX ( t) dt + xk fX (x) dx 1 0 Z 1 Z 1 k k ( 1) t fX ( t) dt + xk fX (x) dx 0 0 Z 1 Z 1 k k ( 1) t fX (t) dt + xk fX (x) dx 0 0 Z 1 k 1 + ( 1) xk fX (x) dx 0

where: in step A we have performed a change of variable in the …rst integral (t = x); in step B we have exchanged the bounds of integration; in step C we have used the fact that fX ( t) = fX (t) Therefore, to compute the k-th moment and to verify whether it exists and is …nite, we need to study the following integral: Z 1 xk fX (x) dx 0

= c

Z

1

x

0

A

= c = =

B

=

C

=

D

=

Z

1

1 2 (n+1)

x2 1+ n

k

k=2

(nt)

(1 + t)

dx p

n=2 1=2

n 1 p dt 2 t

0 Z 1 (k+1)=2 1 k=2 1=2 n=2 1=2 c n t (1 + t) dt 2 0 Z 1 1 (k+1)=2 (n c n(k+1)=2 t(k+1)=2 1 (1 + t) 2 0 1 (k+1)=2 k+1 n k c n B ; 2 2 2 1 1 k+1 n k 1 p n(k+1)=2 B ; 2 n B n2 ; 12 2 2

1 k=2 n 2

n 2

1 2

n+1 2

1

k)=2

dt

414

CHAPTER 50. STUDENT’S T DISTRIBUTION k+1 2 =

1 k=2 n 2

n

k

n+1 2

2

k+1 2 n 2

n k 2 1 2

where: in step A we have performed a change of variable in the …rst integral 2 (t = xn ); in step B we have used the integral representation of the Beta function; in step C we have used the de…nition of c; in step D we have used the de…nition of Beta function. From the above derivation, it should be clear that the k-th moment is well-de…ned only when n > k. Otherwise, if n k, the above improper integrals do not converge (the integrals involve the Beta function, which is wellde…ned and converges only when its arguments are strictly positive - in this case only if n 2 k > 0). Therefore, the k-th moment of X is: X

(k)

=

k

1 + ( 1)

Z

1

xk fX (x) dx

0

= =

50.1.6

k

1 + ( 1) nk=2 0

k+1 2 n 2

1 k=2 n 2

k+1 2

n k 2

n k 2 1 2 n 2

1 2

if k is even if k is odd

Moment generating function

A standard Student’s t random variable X does not possess a moment generating function. Proof. When a random variable X possesses a moment generating function, then the k-th moment of X exists and is …nite for any k 2 N. But we have proved above that the k-th moment of X exists only for k < n. Therefore, X can not have a moment generating function.

50.1.7

Characteristic function

There is no simple expression for the characteristic function of the standard Student’s t distribution. It can be expressed in terms of a Modi…ed Bessel function of the second kind (a solution of a certain di¤erential equation, called modi…ed Bessel’s di¤erential equation). The interested reader can consult Sutradhar7 (1986).

50.1.8

Distribution function

There is no simple formula for the distribution function FX (x) of a standard Student’s t random variable X, because the integral Z x FX (x) = fX (t) dt 1

B. C. (1986) On the characteristic function of multivariate Student t-distribution, Canadian Journal of Statistics, 14, 329-337.

50.2. THE STUDENT’S T DISTRIBUTION IN GENERAL

415

cannot be expressed in terms of elementary functions. Therefore, it is usually necessary to resort to computer algorithms to compute the values of FX (x). For example, the MATLAB command: tcdf(x,n) returns the value of the distribution function at the point x when the degrees of freedom parameter is equal to n.

50.2

The Student’s t distribution in general

While in the previous section we restricted our attention to the Student’s t distribution with zero mean and unit scale, we now deal with the general case.

50.2.1

De…nition

The Student’s t distribution is characterized as follows: De…nition 262 Let X be an absolutely continuous random variable. Let its support be the whole set of real numbers: RX = R Let 2 R, 2 2 R++ and n 2 R++ . We say that X has a Student’s t distribution with mean , scale 2 and n degrees of freedom if its probability density function is: ! 12 (n+1) 2 (x ) 1 1+ fX (x) = c n 2 where c is a constant:

1 1 c= p n n B 2 ; 12

and B () is the Beta function. We indicate that X has a t distribution with mean , scale of freedom by: X T ; 2; n

50.2.2

2

and n degrees

Relation to the standard Student’s t distribution

A random variable X which has a t distribution with mean , scale 2 and n degrees of freedom is just a linear function of a standard Student’s t random variable8 : Proposition 263 If X

T

;

2

; n , then: X=

+ Z

where Z is a random variable having a standard t distribution. 8 See

p. 407.

416

CHAPTER 50. STUDENT’S T DISTRIBUTION

Proof. This can be easily proved using the formula for the density of a function9 of an absolutely continuous variable: fX (x)

1

= fZ g x

= fZ = c

dg

(x)

1

1

(x) dx

1

1+

2

(x

) n

2

!

1 2 (n+1)

Obviously, then, a standard t distribution is just a normal distribution with mean = 0 and scale 2 = 1.

50.2.3

Expected value

The expected value of a Student’s t random variable X is well-de…ned only for n > 1 and it is equal to E [X] = Proof. It is an immediate consequence of the fact that X = + Z (where Z has a standard t distribution) and the linearity of the expected value10 : E [X] = E [ + Z] =

+ E [Z] =

+

0=

As we have seen above, E [Z] is well-de…ned only for n > 1 and, as a consequence, also E [X] is well-de…ned only for n > 1.

50.2.4

Variance

The variance of a Student’s t random variable X is well-de…ned only for n > 2 and it is equal to n 2 Var [X] = n 2 Proof. It can be derived using the formula for the variance of linear transformations11 on X = + Z (where Z has a standard t distribution): Var [X] = Var [ + Z] =

2

Var [Z] =

n

2

n

2

As we have seen above, Var [Z] is well-de…ned only for n > 2 and, as a consequence, also Var [X] is well-de…ned only for n > 2.

50.2.5

Moment generating function

A Student’s t random variable X does not possess a moment generating function. Proof. It is a consequence of the fact that X = + Z (where Z has a standard t distribution) and of the fact that a standard Student’s t random variable does not possess a moment generating function (see above). 9 See p. 265. Note that X = g (Z) = strictly positive. 1 0 See p. 134. 1 1 See p. 158.

+ Z is a strictly increasing function of Z, since

is

50.3. MORE DETAILS

50.2.6

417

Characteristic function

There is no simple expression for the characteristic function of the Student’s t distribution (see the comments above, for the standard case).

50.2.7

Distribution function

As for the standard t distribution (see above), there is no simple formula for the distribution function FX (x) of a Student’s t random variable X and it is usually necessary to resort to computer algorithms to compute the values of FX (x). Most computer programs provide only routines for the computation of the standard t distribution function (denote it by FZ (z)). In these cases we need to make a conversion, as follows: FX (x) = P (X x) = P( + Z x = P Z

x)

x

= FZ For example, the MATLAB command:

tcdf((x-mu)/sigma,n) returns the value at the point x of the distribution function of a Student’s t random variable with mean mu, scale sigma and n degrees of freedom.

50.3

More details

50.3.1

Convergence to the normal distribution

A Student’s t distribution with mean , scale 2 and n degrees of freedom converges in distribution12 to a normal distribution with mean and variance 2 when the number of degrees of freedom n becomes large (converges to in…nity). Proof. As explained before, if Xn has a t distribution, it can be written as: + p

Xn =

Y 2 =n n

where Y is a standard normal random variable, and 2n is a Chi-square random variable with n degrees of freedom, independent of Y . Moreover, as explained in the lecture entitled Chi-square distribution (p. 387), 2n can be written as a sum of squares of n independent standard normal random variables Z1 ; : : : ; Zn : 2 n

=

n X

Zi2

i=1

When n tends to in…nity, the ratio 2 n

n 1 2 See

p. 527.

n

=

1X 2 Z n i=1 i

418

CHAPTER 50. STUDENT’S T DISTRIBUTION

converges in probability to E Zi2 = 1, by the Law of Large Numbers13 . As a consequence, by Slutski’s theorem14 , Xn converges in distribution to X=

+ Y

which is a normal random variable with mean

50.3.2

and variance

2

.

Non-central t distribution

As discussed above, if Y has a standard normal distribution, Z has a Gamma distribution with parameters n and h = 1 and Y and Z are independent, then the random variable X de…ned as Y X=p Z has a standard Student’s t distribution with n degrees of freedom. Given the same assumptions on Y and Z, de…ne a random variable W as follows: Y +c W = p Z where c 2 R is a constant. W is said to have a non-central standard Student’s t distribution with n degrees of freedom and non-centrality parameter c. We do not discuss the details of this distribution here, but be aware that this distribution is sometimes used in statistical theory (also in elementary problems) and that routines to compute its moments and its distribution function can be found in most statistical software packages.

50.4

Solved exercises

Below you can …nd some exercises with explained solutions.

Exercise 1 Let X1 be a normal random variable with mean = 0 and variance 2 = 4. Let X2 be a Gamma random variable with parameters n = 10 and h = 3, independent of X1 . Find the distribution of the ratio X1 X=p X2 Solution We can write:

2 Y X1 X=p =p p X2 3 Z where Y = X1 =2 has a standard normal distribution and Z = X2 =3 has a Gamma distribution with parameters n = 10 and h = 1. Therefore, the ratio Y p Z 1 3 See 1 4 See

p. 535. p. 557.

50.4. SOLVED EXERCISES

419

has a standard Student’s t distribution with n = 10 degrees of freedom and X has a Student’s t distribution with mean = 0, scale 2 = 4=3 and n = 10 degrees of freedom.

Exercise 2 Let X1 be a normal random variable with mean = 3 and variance 2 = 1. Let X2 be a Gamma random variable with parameters n = 15 and h = 2, independent of X1 . Find the distribution of the random variable r 2 X= (X1 3) X2 Solution We can write: X=

r

Y 3) = p Z

2 (X1 X2

where Y = X1 3 has a standard normal distribution and Z = X2 =2 has a Gamma distribution with parameters n = 15 and h = 1. Therefore, the ratio Y p Z has a standard Stutent’s t distribution with n = 15 degrees of freedom.

Exercise 3 Let X be a Student’s t random variable with mean degrees of freedom. Compute: P (0

X

= 1, scale

2

= 4 and n = 6

1)

Solution First of all, we need to write the probability in terms of the distribution function of X:

=

P (0 P (X

A

=

P (X

B

= FX (1)

X 1) 1) P (X < 0) 1)

P (X

0)

FX (0)

where: in step A we have used the fact that any speci…c value of X has probability zero; in step B we have used the de…nition of distribution function. Then, we express the distribution function FX (x) in terms of the distribution function of a standard Student’s t random variable Z with n = 6 degrees of freedom: FX (x) = FZ

x

1 2

420

CHAPTER 50. STUDENT’S T DISTRIBUTION

so that: P (0

X

1) = FX (1)

FX (0) = FZ (0)

FZ

1 2

= 0:1826

where the di¤erence FZ (0) FZ ( 1=2) can be computed with a computer algorithm, for example using the MATLAB command tcdf(0,6)-tcdf(-1/2,6)

Chapter 51

F distribution A random variable X has an F distribution if it can be written as a ratio: X=

Y1 =n1 Y2 =n2

between a Chi-square random variable1 Y1 with n1 degrees of freedom and a Chisquare random variable Y2 , independent of Y1 , with n2 degrees of freedom (where each of the two random variables has been divided by its degrees of freedom). The importance of the F distribution stems from the fact that ratios of this kind are encountered very often in statistics.

51.1

De…nition

The F distribution is characterized as follows: De…nition 264 Let X be an absolutely continuous random variable. Let its support be the set of positive real numbers: RX = [0; 1) Let n1 ; n2 2 N. We say that X has an F distribution with n1 and n2 degrees of freedom if its probability density function2 is fX (x) = cxn1 =2

1

1+

n1 x n2

where c is a constant: c=

n1 n2

n1 =2

1 B

and B () is the Beta function3 . 1 See

p. 387. p. 107. 3 See p. 59. 2 See

421

n1 n2 2 ; 2

(n1 +n2 )=2

422

51.2

CHAPTER 51. F DISTRIBUTION

Relation to the Gamma distribution

An F random variable can be written as a Gamma random variable4 with parameters n1 and h1 , where the parameter h1 is equal to the reciprocal of another Gamma random variable, independent of the …rst one, with parameters n2 and h2 = 1: Proposition 265 (Integral representation) The probability density function of X can be written as Z 1 fX (x) = fXjZ=z (x) fZ (z) dz 0

where: 1. fXjZ=z (x) is the probability density function of a Gamma random variable with parameters n1 and h1 = z1 : n =2

fXjZ=z (x)

=

(n1 =h1 ) 1 xn1 =2 2n1 =2 (n1 =2)

=

(n1 z) 1 xn1 =2 n 2 1 =2 (n1 =2)

1

exp

n1 1 x h1 2

1

exp

1 n1 z x 2

n =2

2. fZ (z) is the probability density function of a Gamma random variable with parameters n2 and h2 = 1: n2 =2

fZ (z) =

(n2 ) 2n2 =2

(n2 =2)

z n2 =2

1

1 n2 z 2

exp

Proof. We need to prove that fX (x) =

Z

1

fXjZ=z (x) fZ (z) dz

0

where

n =2

fXjZ=z (x) =

(n1 z) 1 xn1 =2 n 2 1 =2 (n1 =2)

and

1

exp

n =2

fZ (z) =

n2 2 z n2 =2 2n2 =2 (n2 =2)

1

exp

Let us start from the integrand function: fXjZ=z (x) fZ (z) n =2

=

(n1 z) 1 xn1 =2 n 2 1 =2 (n1 =2)

1

exp

1 n1 z x 2

exp

1 n2 z 2

n =2

n2 2 z n2 =2 2n2 =2 (n2 =2) 4 See

p. 397.

1

1 n1 z x 2 1 n2 z 2

51.2. RELATION TO THE GAMMA DISTRIBUTION n =2

n =2

=

423

n2 2 (n1 z) 1 2n1 =2 (n1 =2) 2n2 =2 (n2 =2) xn1 =2

1 n2 =2 1

z

1 (n1 x + n2 ) z 2

exp

n =2 n =2

=

n1 1 n 2 2 2(n1 +n2 )=2 (n1 =2) (n2 =2) xn1 =2

1 (n1 +n2 )=2 1

z

1 (n1 x + n2 ) z 2

exp

n =2 n =2

=

n1 1 n2 2 xn1 =2 2(n1 +n2 )=2 (n1 =2) (n2 =2) n1 + n2 n1 x + n2

exp

1

n1 1 n2 2 xn1 =2 2(n1 +n2 )=2 (n1 =2) (n2 =2)

where

z

1 (n1 + n2 ) z 2

n =2 n =2

=

1 (n1 +n2 )=2 1

11

c

!

fZjX=x (z)

(n +n )=2

c=

(n1 x + n2 ) 1 2 (n 2 1 +n2 )=2 ((n1 + n2 ) =2)

and fZjX=x (z) is the probability density function of a random variable having a Gamma distribution with parameters n1 + n 2 and

n1 + n 2 n1 x + n2

Therefore: Z

1

fXjZ=z (x) fZ (z) dz

0

=

Z

n =2 n =2

1

n1 1 n2 2 1 xn1 =2 1 fZjX=x (z) dz (n +n )=2 1 2 c 2 (n1 =2) (n2 =2) 0 Z 1 n1 =2 n2 =2 n1 n2 1 xn1 =2 1 fZjX=x (z) dz c 0 2(n1 +n2 )=2 (n1 =2) (n2 =2)

A

=

B

=

n1 1 n2 2 xn1 =2 2(n1 +n2 )=2 (n1 =2) (n2 =2)

11

=

n =2 n =2 n1 1 n2 2 2(n1 +n2 )=2 (n1 =2)

(n1 +n2 )=2 12

n =2 n =2

= =

(n2 =2)

xn1 =2

c

((n1 + n2 ) =2) n1 =2 n2 =2 n1 =2 n n2 x (n1 =2) (n2 =2) 1 (n1 =2) (n2 =2) ((n1 + n2 ) =2) 1+

n1 x n2

((n1 + n2 ) =2) (n1 +n2 )=2

(n1 x + n2 ) 1

(n1 x + n2 )

(n1 +n2 )=2

1

(n1 +n2 )=2

n =2 n =2

n1 1 n2 2 xn1 =2

1

n2

(n1 +n2 )=2

424

CHAPTER 51. F DISTRIBUTION

C

(n1 =2) (n2 =2) ((n1 + n2 ) =2)

1

=

(n1 =2) (n2 =2) ((n1 + n2 ) =2)

1

=

1 B (n1 =2; n2 =2) = fX (x) =

n1 n2

n =2

n1 1 n 2 n1 n2

n1 =2 n1 =2 1

x

1+

n1 =2

xn1 =2

n1 =2

xn1 =2

1

1+

1

n1 x n2

1+

(n1 +n2 )=2

n1 x n2

n1 x n2

(n1 +n2 )=2

(n1 +n2 )=2

where: in step A we have used the fact that c does not depend on z; in step B we have used the fact that the integral of a density function over its support is equal to 1; in step C we have used the de…nition of Beta function.

51.3

Relation to the Chi-square distribution

In the introduction, we have stated (without a proof) that a random variable X has an F distribution with n1 and n2 degrees of freedom if it can be written as a ratio: Y1 =n1 X= Y2 =n2 where: 1. Y1 is a Chi-square random variable with n1 degrees of freedom; 2. Y2 is a Chi-square random variable, independent of Y1 , with n2 degrees of freedom. The statement can be proved as follows. Proof. It is a consequence of Proposition 265 above: X can be thought of as a Gamma random variable with parameters n1 and h1 , where the parameter h1 is equal to the reciprocal of another Gamma random variable Z, independent of the …rst one, with parameters n2 and h2 = 1. The equivalence can be proved as follows. Since a Gamma random variable with parameters n1 and h1 is just the product between the ratio h1 =n1 and a Chi-square random variable with n1 degrees of freedom (see the lecture entitled Gamma distribution - p. 397), we can write: X=

h1 Y1 n1

where Y1 is a Chi-square random variable with n1 degrees of freedom. Now, we know that h1 is equal to the reciprocal of another Gamma random variable Z, independent of Y1 , with parameters n2 and h2 = 1. Therefore: X=

Y1 =n1 Z

But a Gamma random variable with parameters n2 and h2 = 1 is just the product between the ratio 1=n2 and a Chi-square random variable with n2 degrees of freedom (see the lecture entitled Gamma distribution - p. 397). Therefore, we can write Y1 =n1 X= Y2 =n2

51.4. EXPECTED VALUE

51.4

425

Expected value

The expected value of an F random variable X is well-de…ned only for n2 > 2 and it is equal to n2 E [X] = n2 2 Proof. It can be derived thanks to the integral representation of the Beta function: Z 1 E [X] = xfX (x) dx =

Z

1

1

xcxn1 =2

1

1+

0

= c

Z

1

xn1 =2 1 +

0

A

= c

Z

n2 t n1

n2 n1

n1 =2+1

n2 n1

n1 =2+1

n2 n1

n1 =2+1

0

= c = c B

= c

C

= =

D

= = =

E

= =

n1 n2

n1 x n2

n1 =2

1

(1 + t) Z

1

(n1 +n2 )=2

n1 x n2

dx

(n1 +n2 )=2

dx (n1 +n2 )=2

tn1 =2 (1 + t)

n2 dt n1

n1 =2 n2 =2

dt

0

Z

1

t(n1 =2+1)

1

(1 + t)

(n1 =2+1) (n2 =2 1)

dt

0

B n1 =2

n1 n2 + 1; 2 2

n2 n1 B n21 ; n22 n2 1 n1 n2 B + 1; n1 B n21 ; n22 2 2 1

1

n1 =2+1

B

n1 n2 + 1; 2 2

1

1

n2 (n1 =2 + n2 =2) (n1 =2 + 1) (n2 =2 1) n1 (n1 =2) (n2 =2) (n1 =2 + 1 + n2 =2 1) n2 (n1 =2 + n2 =2) (n1 =2 + 1) (n2 =2 1) n1 (n1 =2) (n2 =2) (n1 =2 + n2 =2) n2 (n1 =2 + 1) (n2 =2 1) n1 (n1 =2) (n2 =2) n2 1 (n1 =2) n1 n2 =2 1 n2 n2 2

where: in step A we have performed a change of variable (t =

n1 n2 x);

in step B

we have used the integral representation of the Beta function; in step C we have used the de…nition of c; in step D we have used the de…nition of Beta function; in step E we have used the following property of the Gamma function: (z) =

(z

1) (z

1)

It is also clear that the expected value is well-de…ned only when n2 > 2: when n2 2, the above improper integrals do not converge (both arguments of the Beta

426

CHAPTER 51. F DISTRIBUTION

function must be strictly positive).

51.5

Variance

The variance of an F random variable X is well-de…ned only for n2 > 4 and it is equal to 2n22 (n1 + n2 2) Var [X] = 2 n1 (n2 2) (n2 4) Proof. It can be derived thanks to the usual formula for computing the variance5 and to the integral representation of the Beta function: Z 1 E X2 = x2 fX (x) dx =

Z

1

1

x2 cxn1 =2

1

1+

0

= c

Z

1

xn1 =2+1 1 +

0

A

= c

Z

= c = c

(n1 +n2 )=2

n1 x n2

dx

n2 t n1

n2 n1

n1 =2+2

n2 n1

n1 =2+2

n2 n1

n1 =2+2

(n1 +n2 )=2

(1 + t) Z

1

tn1 =2+1 (1 + t)

n2 dt n1

n1 =2 n2 =2

dt

0

Z

1

t(n1 =2+2)

1

(1 + t)

(n1 =2+2) (n2 =2 2)

0

n1 n2 + 2; 2 2

B

= c

C

n1 n2

n1 =2

=

n2 n1

2

=

n2 n1

2

=

(n1 =2 + n2 =2) (n1 =2) (n2 =2)

(n1 =2 + 2) (n2 =2 2) (n1 =2 + 2 + n2 =2 2)

n2 n1

2

=

(n1 =2 + n2 =2) (n1 =2) (n2 =2)

(n1 =2 + 2) (n2 =2 (n1 =2 + n2 =2)

n2 n1

2

=

(n1 =2 + 2) (n1 =2)

D

E

B 1 B

n1 n2 2 ; 2

1 B

n1 n2 2 ; 2

2

= = =

5 Var [X]

dx

n1 =2+1

1

0

(n1 +n2 )=2

n1 x n2

= E X2

B

n2 n1

n1 =2+2

B

n1 n2 + 2; 2 2

n1 n2 + 2; 2 2

2

(n2 =2 2) (n2 =2)

n2 (n1 =2 + 1) (n1 =2) n1 (n2 =2 2 n2 (n1 + 2) n1 n21 (n2 2) (n2 4) n22 (n1 + 2) n1 (n2 2) (n2 4) E [X]2 . See p. 156.

2

1 1) (n2 =2

2)

2)

2

dt

51.6. HIGHER MOMENTS

427

where: in step A we have performed a change of variable (t =

n1 n2 x);

in step B

we have used the integral representation of the Beta function; in step C we have used the de…nition of c; in step D we have used the de…nition of Beta function; in step E we have used the following property of the Gamma function: (z) =

(z

1) (z

1)

Finally: 2

n2

2

E [X] =

n2

2

and: 2

= E X2

Var [X]

E [X]

n22 (n1 + 2) n1 (n2 2) (n2 4)

= =

n22 ((n1 + 2) (n2

=

n22

n22 2

(n2 2) n1 (n2 4))

2) 2

n1 (n2 2) (n2 4) (n1 n2 2n1 + 2n2 4 n1 n2 + 4n1 ) 2

n1 (n2 2) (n2 4) n22 (2n1 + 2n2 4) 2n22 (n1 + n2 2) = 2 2 n1 (n2 2) (n2 4) n1 (n2 2) (n2 4)

=

It is also clear that the expected value is well-de…ned only when n2 > 4: when n2 4, the above improper integrals do not converge (both arguments of the Beta function must be strictly positive).

51.6

Higher moments

The k-th moment of an F random variable X is well-de…ned only for n2 > 2k and it is equal to k (n1 =2 + k) (n2 =2 k) n2 X (k) = n1 (n1 =2) (n2 =2) Proof. Using the de…nition of moment: Z 1 k = xk fX (x) dx X (k) = E X =

Z

1

1

xk cxn1 =2

1+

n1 x n2

(n1 +n2 )=2

1

1+

n1 x n2

(n1 +n2 )=2

1

0

= c

Z

1

xn1 =2+k

0

A

= c

Z

= c

dx

n1 =2+k 1

1

n2 t n1

n2 n1

n1 =2+k

0

dx

(1 + t) Z

0

1

tn1 =2+k

1

(n1 +n2 )=2

(1 + t)

n2 dt n1

n1 =2 n2 =2

dt

428

CHAPTER 51. F DISTRIBUTION

= c

n2 n1

n1 =2+k

n2 n1

n1 =2+k

Z

1

t(n1 =2+k)

1

(1 + t)

(n1 =2+k) (n2 =2 k)

n1 n2 + k; 2 2

B

= c

C

n1 n2

n1 =2

=

n2 n1

k

=

n2 n1

k

=

(n1 =2 + n2 =2) (n1 =2) (n2 =2)

(n1 =2 + k) (n2 =2 k) (n1 =2 + k + n2 =2 k)

n2 n1

k

=

(n1 =2 + n2 =2) (n1 =2) (n2 =2)

(n1 =2 + k) (n2 =2 (n1 =2 + n2 =2)

n2 n1

k

=

(n1 =2 + k) (n1 =2)

D

dt

0

B 1

n1 n2 2 ; 2

B 1 B

n1 n2 2 ; 2

B

n2 n1

k

n1 =2+k

B

n1 n2 + k; 2 2

n1 n2 + k; 2 2

k

k

k)

(n2 =2 k) (n2 =2)

where: in step A we have performed a change of variable (t =

n1 n2 x);

in step B

we have used the integral representation of the Beta function; in step C we have used the de…nition of c; in step D we have used the de…nition of Beta function. It is also clear that the expected value is well-de…ned only when n2 > 2k: when n2 2k, the above improper integrals do not converge (both arguments of the Beta function must be strictly positive).

51.7

Moment generating function

An F random variable X does not possess a moment generating function. Proof. When a random variable X possesses a moment generating function, then the k-th moment of X exists and is …nite for any k 2 N. But we have proved above that the k-th moment of X exists only for k < n2 =2. Therefore, X can not have a moment generating function.

51.8

Characteristic function

There is no simple expression for the characteristic function of the F distribution. It can be expressed in terms of the con‡uent hypergeometric function of the second kind (a solution of a certain di¤erential equation, called con‡uent hypergeometric di¤erential equation). The interested reader can consult Phillips6 (1982).

51.9

Distribution function

The distribution function of an F random variable is: Z n1 x=n2 1 sn1 =2 1 (1 + s) FX (x) = B n21 ; n22 1

n1 =2 n2 =2

ds

6 Phillips, P. C. B. (1982) "The true characteristic function of the F distribution", Biometrika, 69, 261-264.

51.10. SOLVED EXERCISES

429

where the integral Z

n1 x=n2

sn1 =2

1

(1 + s)

n1 =2 n2 =2

ds

1

is known as incomplete Beta function and is usually computed numerically with the help of a computer algorithm. Proof. This is proved as follows: FX (x) Z x = fX (t) dt = A

Z

= c

1

x

ctn1 =2 Z

= c B

1

1

1

n2 n1

1 n2 s n1 n1 =2 Z n1 x=n2

=

(n1 =n2 ) 1 B n21 ; n22 1 B

n1 n2 2 ; 2

(n1 +n2 )=2

n1 t n2

dt

n =2 1

n1 x=n2

n =2

=

1+

Z

1

n2 n1

(1 + s) sn1 =2

n1 =2

n1 x=n2

Z

1

n1 =2 n2 =2

(1 + s)

n1 x=n2

n2 ds n1

n1 =2 n2 =2

sn1 =2

1

ds

(1 + s)

n1 =2 n2 =2

sn1 =2

1

(1 + s)

n1 =2 n2 =2

ds

1

where: in step A we have performed a change of variable (s = we have used the de…nition of c.

51.10

ds

1

n1 n2 t);

in step B

Solved exercises

Below you can …nd some exercises with explained solutions.

Exercise 1 Let X1 be a Gamma random variable with parameters n1 = 3 and h1 = 2. Let X2 be another Gamma random variable, independent of X1 , with parameters n2 = 5 and h1 = 6. Find the expected value of the ratio: X1 X2 Solution We can write: X1 X2

= =

2Z1 6Z2

where Z1 and Z2 are two independent Gamma random variables, the parameters of Z1 are n1 = 3 and h1 = 1 and the parameters of Z2 are n2 = 5 and h2 = 1

430

CHAPTER 51. F DISTRIBUTION

(see the lecture entitled Gamma distribution - p. 397). Using this fact, the ratio becomes: X1 2 Z1 1 Z1 = = X2 6 Z2 3 Z2 where Z1 =Z2 has an F distribution with parameters n1 = 3 and n2 = 5. Therefore: E

X1 X2

=

1 Z1 1 Z1 = E 3 Z2 3 Z2 1 n2 1 5 5 = = 3 n2 2 35 2 9

E

=

Exercise 2 Find the third moment of an F random variable with parameters n1 = 6 and n2 = 18. Solution We need to use the formula for the k-th moment of an F random variable: X (k) =

n2 n1

k

(n1 =2 + k) (n1 =2)

(n2 =2 k) (n2 =2)

Plugging in the parameter values, we obtain: 3

X (3)

18 (3 + 3) (9 3) (6) = 33 6 (3) (9) (3) 5! 5! 1 = 27 = 27 (5 4 3) 2! 8! 8 7 6 1 135 = = 27 5 2 7 2 28 =

(6) (9)

where we have used the relation between the Gamma function7 and the factorial function.

7 See

p. 55.

Chapter 52

Multinomial distribution The multinomial distribution is a generalization of the binomial distribution1 . If you perform n times an experiment that can have only two outcomes (either success or failure), then the number of times you obtain one of the two outcomes (success) is a binomial random variable. If you perform n times an experiment that can have K outcomes (K can be any natural number) and you denote by Xi the number of times that you obtain the i-th outcome, then the random vector >

X = [X1 X2 . . . XK ]

is a multinomial random vector. In this lecture we will …rst present the special case in which there is only one experiment (n = 1), and we will then employ the results obtained for this simple special case to discuss the more general case of many experiments (n 1).

52.1

The special case of one experiment

In this case, one experiment is performed, having K possible outcomes with probabilities p1 ; : : : ; pK . When the i-th outcome is obtained, the i-th entry of the multinomial random vector X takes value 1, while all other entries take value 0.

52.1.1

De…nition

The distribution is characterized as follows. De…nition 266 Let X be a K 1 discrete random vector. Let the support of X be the set of K 1 vectors having one entry equal to 1 and all other entries equal to 0: 8 9 K < = X K RX = x 2 f0; 1g : xj = 1 : ; j=1

Let p1 , . . . , pK be K strictly positive numbers such that K X

pj = 1

j=1

1 See

p. 341.

431

432

CHAPTER 52. MULTINOMIAL DISTRIBUTION

We say that X has a multinomial distribution with probabilities p1 , . . . , pK and number of trials n = 1 if its joint probability mass function2 is pX (x1 ; : : : ; xK ) =

QK

j=1

x

pj j

0

if (x1 ; : : : ; xK ) 2 RX otherwise

If you are puzzled by the above de…nition of the joint pmf, note that when (x1 ; : : : ; xK ) 2 RX and xi is equal to 1, because the i-th outcome has been obtained, then all other entries are equal to 0 and K Y

x

pj j

x

x

i+1 = px1 1 : : : pi i 11 pxi i pi+1 : : : pxKK

j=1

52.1.2

= p01 : : : p0i 1 p1i p0i+1 : : : p0K = 1 : : : 1 p1i 1 : : : 1 = pi

Expected value

The expected value of X is E [X] = p where the K

(52.1)

1 vector p is de…ned as follows: |

p = [p1 p2 . . . pK ]

Proof. The i-th entry of X, denoted by Xi , is an indicator function3 of the event "the i-th outcome has happened". Therefore, its expected value is equal to the probability of the event it indicates: E [Xi ] = pi

52.1.3

Covariance matrix

The covariance matrix of X is Var [X] = where

is a K

K matrix whose generic entry is ij

=

pi (1 pi ) if j = i pi pj if j 6= i

(52.2)

Proof. We need to use the formula4 ij

= Cov [Xi ; Xj ] = E [Xi Xj ]

E [Xi ] E [Xj ]

If j = i, then ii 2 See

=

E [Xi Xi ]

E [Xi ] E [Xi ] = E [Xi ]

p. 116. p. 197. 4 See the lecture entitled Covariance matrix - p. 189. 3 See

2

E [Xi ]

52.1. THE SPECIAL CASE OF ONE EXPERIMENT = pi

p2i = pi (1

433

pi )

where we have used the fact that Xi2 = Xi because Xi can take only values 0 and 1. If j 6= i, then ij

= =

E [Xi Xj ] E [Xi ] E [Xj ] E [Xi ] E [Xj ] = pi pj

where we have used the fact that Xi Xj = 0, because Xi and Xj cannot be both equal to 1 at the same time.

52.1.4

Joint moment generating function

The joint moment generating function5 of X is de…ned for any t 2 RK : MX (t) =

K X

pj exp (tj )

(52.3)

j=1

Proof. If the j-th outcome is obtained, then Xi = 0 for i 6= j and Xi = 1 for i = j. As a consequence exp (t1 X1 + : : : + tK XK ) = exp (tj ) and the joint moment generating function is MX (t)

= =

E exp t> X K X

= E [exp (t1 X1 + t2 X2 + : : : + tK XK )]

pj exp (tj )

j=1

52.1.5

Joint characteristic function

The joint characteristic function6 of X is 'X (t) =

K X

pj exp (itj )

(52.4)

j=1

Proof. If the j-th outcome is obtained, then Xi = 0 for i 6= j and Xi = 1 for i = j. As a consequence exp (it1 X1 + it2 X2 + : : : + itK XK ) = exp (itj ) and the joint characteristic function is 'X (t)

= =

E exp it> X K X j=1

5 See 6 See

p. 297. p. 315.

pj exp (itj )

= E [exp (it1 X1 + it2 X2 + : : : + itK XK )]

434

CHAPTER 52. MULTINOMIAL DISTRIBUTION

52.2

Multinomial distribution in general

We now deal with the general case, in which the number of experiments can take any value n 1.

52.2.1

De…nition

Multinomial random vectors are characterized as follows. De…nition 267 Let X be a K 1 discrete random vector. Let n 2 N. Let the support of X be the set of K 1 vectors having non-negative integer entries summing up to n: ( ) K X K RX = x 2 f0; 1; 2; : : : ; ng : xi = n i=1

Let p1 , . . . , pK be K strictly positive numbers such that K X

pi = 1

i=1

We say that X has a multinomial distribution with probabilities p1 , . . . , pK and number of trials n, if its joint probability mass function is pX (x1 ; : : : ; xK ) = where

n x1 ;x2 ;:::;xK

52.2.2

n x1 ;x2 ;:::;xK

0

QK

i=1

pxi i

if (x1 ; : : : ; xK ) 2 RX otherwise

is the multinomial coe¢ cient7 .

Representation as a sum of simpler multinomials

The connection between the general case (n 1) and the simpler case illustrated above (n = 1) is given by the following proposition. Proposition 268 A random vector X having a multinomial distribution with parameters p1 ; : : : ; pK and n can be written as X = Y1 + : : : + Yn where Y1 ; : : : ; Yn are n independent random vectors all having a multinomial distribution with parameters p1 ; : : : ; pK and 1. >

Proof. The sum Y1 + : : : + Yn is equal to the vector [x1 ; : : : ; xK ] when: 1) x1 terms of the sum have their …rst entry equal to 1 and all other entries equal to 0; 2) x2 terms of the sum have their second entry equal to 1 and all other entries equal to 0; . . . ; K) xK terms of the sum have their K-th entry equal to 1 and all PK other entries equal to 0. Provided xi 0 for each i and i=1 xi = n, there are several di¤erent realizations of the matrix [Y1 : : : Yn ] 7 See

p. 29.

52.2. MULTINOMIAL DISTRIBUTION IN GENERAL

435

satisfying these conditions. Since Y1 ; : : : ; Yn are independent, each of these realizations has probability px1 1 : : : pxKK Furthermore, their number is equal to the number of partitions of n objects into K groups8 having numerosities x1 ; : : : ; xK , which in turn is equal to the multinomial coe¢ cient n x1 ; x2 ; : : : ; xK Therefore P Y1 + : : : + Yn = [x1

:::

>

xK ]

n px1 : : : pxKK = pX (x1 ; : : : ; xK ) x1 ; x2 ; : : : ; xK 1

=

which proves that X and Y1 + : : : + Yn have the same distribution.

52.2.3

Expected value

The expected value of a multinomial random vector X is E [X] = np where the K

1 vector p is de…ned as follows: |

p = [p1 p2 . . . pK ]

Proof. Using the fact that X can be written as a sum of n multinomials with parameters p1 ; : : : ; pK and 1, we obtain E [X]

= E [Y1 + : : : + Yn ] = E [Y1 ] + : : : + E [Yn ] = p + : : : + p = np

where the result E [Yj ] = p has been derived in the previous section (formula 52.1).

52.2.4

Covariance matrix

The covariance matrix of a multinomial random vector X is Var [X] = n where

is a K

K matrix whose generic entry is ij

=

pi (1 pi ) if j = i pi pj if j 6= i

Proof. Since X can be represented as a sum of n independent multinomials with parameters p1 ; : : : ; pK and 1, we obtain Var [X] 8 See

the lecture entitled Partitions - p. 27.

436

CHAPTER 52. MULTINOMIAL DISTRIBUTION =

Var [Y1 + : : : + Yn ]

A

=

Var [Y1 ] + : : : + Var [Yn ]

B

= nVar [Y1 ]

C

= n

where: in step A we have used the fact that Y1 ; : : : ; Yn are mutually independent; in step B we have used the fact that Y1 ; : : : ; Yn have the same distribution; in step C we have used formula (52.2) for the covariance matrix of Y1 .

52.2.5

Joint moment generating function

The joint moment generating function of a multinomial random vector X is de…ned for any t 2 RK : 0 1n K X pj exp (tj )A MX (t) = @ j=1

Proof. Writing X as a sum of n independent multinomial random vectors with parameters p1 ; : : : ; pK and 1, the joint moment generating function of X is derived from that of the summands: MX (t)

=

E exp t> X

=

E exp t> (Y1 + : : : + Yn )

=

E exp t> Y1 + : : : + t> Yn "n # Y E exp t> Yl

=

l=1

A

=

n Y

E exp t> Yl

l=1

B

=

n Y

MYl (t)

l=1

C

0 1n K X = @ pj exp (tj )A j=1

where: in step A we have used the fact that Y1 ; : : : ; Yn are mutually independent; in step B we have used the de…nition of moment generating function of Yl ; in step C we have used formula (52.3) for the moment generating function of Y1 .

52.2.6

Joint characteristic function

The joint characteristic function of X is 0 1n K X 'X (t) = @ pj exp (itj )A j=1

52.3. SOLVED EXERCISES

437

Proof. The derivation is similar to the derivation of the joint moment generating function: 'X (t)

=

E exp it> X

=

E exp it> (Y1 + : : : + Yn )

=

E exp it> Y1 + : : : + it> Yn "n # Y E exp it> Yl

=

l=1

A

=

n Y

E exp it> Yl

l=1

B

=

n Y

'Yl (t)

l=1

C

0 1n K X = @ pj exp (itj )A j=1

where: in step A we have used the fact that Y1 ; : : : ; Yn are mutually independent; in step B we have used the de…nition of characteristic function of Yl ; in step C we have used formula (52.4) for the characteristic function of Y1 .

52.3

Solved exercises

Below you can …nd some exercises with explained solutions.

Exercise 1 A shop selling two items, labeled A and B, needs to construct a probabilistic model of the sales that will be generated by its next 10 customers. Each time a customer arrives, only three outcomes are possible: 1) nothing is sold; 2) one unit of item A is sold; 3) one unit of item B is sold. It has been estimated that the probabilities of these three outcomes are 0.50, 0.25 and 0.25 respectively. Furthermore, the shopping behavior of a customer is independent of the shopping behavior of all other customers. Denote by X a 3 1 vector whose entries X1 ; X2 and X3 are equal to the number of times each of the three outcomes occurs. Derive the expected value and the covariance matrix of X. Solution The vector X has a multinomial distribution with parameters 1 2

p=

1 4

1 4

>

and n = 10. Therefore, its expected value is E [X] = np = 10

1 2

1 4

1 4

>

= 5

5 2

5 2

>

438

CHAPTER 52. MULTINOMIAL DISTRIBUTION

and its covariance matrix is 2 Var [X]

3 p1 (1 p1 ) p1 p2 p 1 p3 p1 p 2 p2 (1 p2 ) p 2 p3 5 = n 4 p1 p 3 p2 p3 p3 (1 p3 ) 2 1 1 3 1 1 1 1

=

=

10 4 2

10 4

2 2 2 4 1 1 1 3 2 4 4 4 1 1 1 1 2 4 4 4 1 1 1 4 8 8 1 3 1 8 16 16 1 1 3 8 16 16

2 4 1 1 4 4 1 3 4 4

3

2

5=4

5

5 2 5 4 5 4

5 4 15 8 5 8

5 4 5 8 15 8

3 5

Exercise 2 Given the assumptions made in the previous exercise, suppose that item A costs \$1,000 and item B costs \$2,000. Derive the expected value and the variance of the total revenue generated by the 10 customers. Solution The total revenue Y can be written as a linear transformation of the vector X: Y = AX where A = [0

1; 000

2; 000]

By the linearity of the expected value operator, we obtain E [Y ]

= E [AX] = AE [X] = [0 =

1; 000

2

3 5 2; 000] 4 5=2 5 5=2

0 5 + 1; 000 5=2 + 2; 000 5=2 = 7; 500

By using the formula for the covariance matrix of a linear transformation, we obtain Var [Y ]

= Var [AX] = AVar [X] A> 2 5

=

=

=

= =

[0

1; 000

2; 000] 4

2

[0 1; 000 2; 000] 4 2

[0 1; 000 2; 000] 4 2

[0 1; 000 2; 000] 4

2 5 4 5 4

5 4 15 8 5 8

5 4 5 8 15 8

32

3 0 5 4 1; 000 5 2; 000

3 (5=2) 0 (5=4) 1; 000 (5=4) 2; 000 (5=4) 0 + (15=8) 1; 000 (5=8) 2; 000 5 (5=4) 0 (5=8) 1; 000 + (15=8) 2; 000 3 1; 250 2; 500 1; 875 1; 250 5 1; 250 + 3; 750 3 3; 750 625 5 2; 500

0 ( 3; 750) + 1; 000 625 + 2; 000 2; 500 = 5; 625; 000

Chapter 53

Multivariate normal distribution The multivariate normal (MV-N) distribution is a multivariate generalization of the one-dimensional normal distribution1 . In its simplest form, which is called the "standard" MV-N distribution, it describes the joint distribution of a random vector whose entries are mutually independent univariate normal random variables, all having zero mean and unit variance. In its general form, it describes the joint distribution of a random vector that can be represented as a linear transformation of a standard MV-N vector. It is a common mistake to think that any set of normal random variables, when considered together, form a multivariate normal distribution. This is not the case. In fact, it is possible to construct random vectors that are not MV-N, but whose individual elements have normal distributions. The latter fact is very well-known in the theory of copulae (a theory which allows to specify the distribution of a random vector by …rst specifying the distribution of its components and then linking the univariate distributions through a function called copula). The remainder of this lecture illustrates the main characteristics of the multivariate normal distribution, dealing …rst with the "standard" case and then with the more general case.

53.1

The standard MV-N distribution

The adjective "standard" is used to indicate that the mean of the distribution is equal to zero and its covariance matrix is equal to the identity matrix.

53.1.1

De…nition

Standard MV-N normal random vectors are characterized as follows. De…nition 269 Let X be a K 1 absolutely continuous random vector. Let its support be the set of K-dimensional real vectors: R X = RK 1 See

p. 375.

439

440

CHAPTER 53. MULTIVARIATE NORMAL DISTRIBUTION

We say that X has a standard multivariate normal distribution if its joint probability density function2 is K=2

fX (x) = (2 )

53.1.2

1 | x x 2

exp

Relation to the univariate normal distribution

Denote the i-th component of x by xi . The joint probability density function can be written as ! K 1X 2 K=2 fX (x) = (2 ) exp x 2 i=1 i K Y

=

1=2

(2 )

1 2 x 2 i

exp

i=1 K Y

=

f (xi )

i=1

where 1=2

f (xi ) = (2 )

1 2 x 2 i

exp

is the probability density function of a standard normal random variable3 . Therefore, the K components of X are K mutually independent4 standard normal random variables. A more detailed proof follows. Proof. As we have seen, the joint probability density function can be written as fX (x) =

K Y

f (xi )

i=1

where f (xi ) is the probability density function of a standard normal random variable. But f (xi ) is also the marginal probability density function5 of the i-th component of X: fXi (xi ) Z 1 Z = ::: =

Z

1

1

:::

1

= f (xi ) Z 1 1

= f (xi ) 2 See

p. p. 4 See p. 5 See p. 3 See

117. 376. 233. 120.

Z

Z

1 1 1

1

fX (x1 ; : : : ; xi K Y

1 ; xi ; xi+1 ; : : : ; xK ) dx1

f (xj ) dx1 : : : dxi

1 j=1

f (x1 ) dx1 : : :

1

f (xi+1 ) dxi+1 : : :

Z

Z

1

1 dxi+1

f (xi

1 1

: : : dxK

1 ) dxi 1

f (xK ) dxK

1

: : : dxi

1 dxi+1

: : : dxK

53.1. THE STANDARD MV-N DISTRIBUTION

441

where, in the last step, we have used the fact that all the integrals are equal to 1, because they are integrals of probability density functions over their respective supports. Therefore, the joint probability density function of X is equal to the product of its marginals, which implies that the components of X are mutually independent.

53.1.3

Expected value

The expected value of a standard MV-N random vector X is E [X] = 0 Proof. All the components of X are standard normal random variables and a standard normal random variable has mean 0.

53.1.4

Covariance matrix

The covariance matrix of a standard MV-N random vector X is Var [X] = I where I is the K K identity matrix, i.e., a K K matrix whose diagonal entries are equal to 1 and whose o¤-diagonal entries are equal to 0. Proof. This is proved using the structure of the covariance matrix6 : 3 2 Var [X1 ] Cov [X1 ; X2 ] : : : Cov [X1 ; XK ] 6 Cov [X1 ; X2 ] Var [X2 ] Cov [X2 ; XK ] 7 ::: 6 7 Var [X] = 6 7 .. .. .. . . 4 5 . . . . Cov [X1 ; XK ]

Cov [X2 ; XK ] : : :

Var [XK ]

where Xi is the i-th component of X. Since the components of X are all standard normal random variables, their variances are all equal to 1: Var [X1 ] = : : : = Var [XK ] = 1 Furthermore, since the components of X are mutually independent and independence implies zero-covariance7 , all the covariances are equal to 0: Cov [Xi ; Xj ] = 0 if i 6= j

Therefore,

Var [X]

2

6 6 = 6 4 2

6 6 = 6 4 6 See 7 See

p. 189. p. 234.

Var [X1 ] Cov [X1 ; X2 ] .. . Cov [X1 ; XK ]

Cov [X1 ; X2 ] Var [X2 ] .. .

: : : Cov [X1 ; XK ] Cov [X2 ; XK ] ::: .. .. . .

Cov [X2 ; XK ] : : : 3

1 0 ::: 0 0 1 ::: 0 7 7 .. .. . . . 7=I . .. 5 . . 0 0 ::: 1

Var [XK ]

3 7 7 7 5

442

53.1.5

CHAPTER 53. MULTIVARIATE NORMAL DISTRIBUTION

Joint moment generating function

The joint moment generating function of a standard MV-N random vector X is de…ned for any t 2 RK : 1 > MX (t) = exp t t 2 Proof. The K components of X are K mutually independent standard normal random variables (see 53.1.2). As a consequence, the joint mgf of X can be derived as follows: MX (t)

= E exp t> X = E [exp (t1 X1 + t2 X2 + : : : + tK XK )] 2 3 K Y = E4 exp (tj Xj )5 j=1

A

=

K Y

E [exp (tj Xj )]

j=1

B

=

K Y

MXj (tj )

j=1

where: in step A we have used the fact that the components of X are mutually independent8 ; in step B we have used the de…nition of moment generating function9 . The moment generating function of a standard normal random variable10 is 1 2 MXj (tj ) = exp t 2 j which implies that the joint mgf of X is MX (t)

=

K Y

j=1

=

MXj (tj ) = 0

exp @

1 2

K X j=1

1

K Y

exp

j=1

t2j A = exp

1 2 t 2 j 1 > t t 2

The mgf MXj (tj ) of a standard normal random variable is de…ned for any tj 2 R. As a consequence, the joint mgf of X is de…ned for any t 2 RK .

53.1.6

Joint characteristic function

The joint characteristic function of a standard MV-N random vector X is 'X (t) = exp 8 See

1 > t t 2

Mutual independence via expectations (p. 234). p. 297. 1 0 See p. 378. 9 See

53.2. THE MV-N DISTRIBUTION IN GENERAL

443

Proof. The K components of X are K mutually independent standard normal random variables (see 53.1.2). As a consequence, the joint characteristic function of X can be derived as follows: 'X (t)

= E exp it> X = E [exp (it1 X1 + it2 X2 + : : : + itK XK )] 2 3 K Y = E4 exp (itj Xj )5 j=1

A

K Y

=

E [exp (itj Xj )]

j=1

B

K Y

=

'Xj (tj )

j=1

where: in step A we have used the fact that the components of X are mutually independent; in step B we have used the de…nition of joint characteristic function11 . The characteristic function of a standard normal random variable is12 1 2 t 2 j

'Xj (tj ) = exp

which implies that the joint characteristic function of X is 'X (t)

=

K Y

j=1

=

53.2

'Xj (tj ) = 0

exp @

K Y

exp

j=1

1 2

K X j=1

1

t2j A = exp

1 2 t 2 j 1 > t t 2

The MV-N distribution in general

While in the previous section we restricted our attention to the multivariate normal distribution with zero mean and unit variance, we now deal with the general case.

53.2.1

De…nition

MV-N random vectors are characterized as follows. De…nition 270 Let X be a K 1 absolutely continuous random vector. Let its support be the set of K-dimensional real vectors: R X = RK 1 1 See 1 2 See

p. 315. p. 379.

444

CHAPTER 53. MULTIVARIATE NORMAL DISTRIBUTION

Let be a K 1 constant vector and V a K K symmetric and positive de…nite matrix. We say that X has a multivariate normal distribution with mean and covariance V if its joint probability density function is fX (x) = (2 )

K=2

1=2

jdet (V )j

1 (x 2

exp

|

1

) V

(x

)

We indicate that X has a multivariate normal distribution with mean covariance V by X N ( ;V )

and

The K random variables X1 : : : ; XK constituting the vector X are said to be jointly normal.

53.2.2

Relation to the standard MV-N distribution

The next proposition states that a multivariate normal random vector with arbitrary mean and covariance is just a linear transformation of a standard MV-N vector. Proposition 271 Let X be a K 1 random vector having a multivariate normal distribution with mean and covariance V . Then, X= where Z is a standard MV-N K | such that V = = | .

+ Z

(53.1)

1 vector and

is a K

K invertible matrix

Proof. This is proved using the formula for the joint density of a linear function13 of an absolutely continuous random vector: fX (x) 1 fZ = jdet ( )j 1 (2 ) = jdet ( )j = jdet ( )j

1 2

1

(x

K=2

) 1 2

exp 1 2

jdet ( )j

(2 )

=

(2 )

K=2

jdet ( ) det ( )j

=

(2 )

K=2

jdet ( ) det (

=

(2 )

K=2

jdet (

=

(2 )

K=2

jdet (V )j

|

)j 1 2

|

1 2

1 2

)j

1

K=2

1 2

exp 1 (x 2 1 (x 2

+

1

1 (x 2

1 (x 2 1 (x 2

|

1 |

|

)

|

=

|

(x |

)

1

|

) (

1

)

|

|

|

|

(x

) ) (

) ( ) V

The existence of a matrix satisfying V = that V is symmetric and positive de…nite. 1 3 See p. 279. Note that X = g (Z) = invertible.

|

)

exp

exp

exp

exp

(x

1

)

(x

1

1

1

(x

(x

(x

)

) )

)

) is guaranteed by the fact

X is a linear one-to-one mapping because

is

53.2. THE MV-N DISTRIBUTION IN GENERAL

53.2.3

445

Expected value

The expected value of a MV-N random vector X is E [X] = Proof. This is an immediate consequence of (53.1) and of the linearity of the expected value14 : E [X] = E [ + Z] =

53.2.4

+ E [Z] =

+ 0=

Covariance matrix

The covariance matrix of a MV-N random vector X is Var [X] = V Proof. This is an immediate consequence of (53.1) and of the properties of covariance matrices15 : Var [X]

53.2.5

|

= Var [ + Z] = Var [Z] | = I |= =V

Joint moment generating function

The joint moment generating function of a MV-N random vector X is de…ned for any t 2 RK : 1 MX (t) = exp t> + t> V t 2 Proof. This is an immediate consequence of (53.1), of the fact that is a K K | invertible matrix such that V = , and of the rule for deriving the joint mgf of 16 a linear transformation : MX (t)

=

exp t>

MZ

>

exp

1 > t 2

t

=

exp t>

=

1 exp t> + t> V t 2

where MZ (t) = exp

>

t

1 > t t 2

1 4 See, in particular the Addition to constant matrices (p. 148) and Multiplication by constant matrices (p. 149) properties of the expected value of a random vector. 1 5 See, in particular, the Addition to constant vectors (p. 191) and Multiplication by constant matrices (p. 191) properties. 1 6 See p. 301.

446

53.2.6

CHAPTER 53. MULTIVARIATE NORMAL DISTRIBUTION

Joint characteristic function

The joint characteristic function of a MV-N random vector X is 1 > t Vt 2

'X (t) = exp it>

Proof. This is an immediate consequence of (53.1), of the fact that is a K | K invertible matrix such that V = , and of the rule for deriving the joint characteristic function of a linear transformation17 : 'X (t)

=

exp it>

'Z

>

=

exp it>

exp

=

exp it>

t 1 > t 2

>

t

1 > t Vt 2

where

1 > t t 2

'Z (t) = exp

53.3

More details

53.3.1

The univariate normal as a special case

The univariate normal distribution18 is just a special case of the multivariate normal distribution: setting K = 1 in the joint density function of the multivariate normal distribution one obtains the density function of the univariate normal distribution19 .

53.3.2

Mutual independence and joint normality

Let X1 ; : : : ; XK be K mutually independent random variables all having a normal distribution. Denote by i the mean of Xi and by 2i its variance. Then the K 1 random vector X de…ned as |

X = [X1 : : : XK ] has a multivariate normal distribution with mean =[ and covariance matrix

1 7 See

2

6 6 V =6 4

2 1

0 .. . 0

1

| K]

::: 0 2 2

.. . 0

::: :::

..

. :::

0 0 .. . 2 K

3 7 7 7 5

p. 317. p. 375. 1 9 Remember that the determinant and the transpose of a scalar are equal to the scalar itself. 1 8 See

53.4. SOLVED EXERCISES

447

In other words, mutually independent normal random variables are also jointly normal. Proof. This can be proved by showing that the product of the probability density functions of X1 ; : : : ; XK is equal to the joint probability density function of X (this is left as an exercise).

53.3.3

Linear combinations and transformations

Linear transformations and combinations of multivariate normal random vectors are also multivariate normal. This is explained and proved in the lecture entitled Linear combinations of normals (p. 469).

53.3.4

The lecture entitled Quadratic forms in normal vectors (p. 481) discusses quadratic forms involving standard normal random vectors.

53.3.5

Partitioned vectors

The lecture entitled Partitioned multivariate normal vectors (p. 477) discusses partitions of normal random vectors into sub-vectors.

53.4

Solved exercises

Below you can …nd some exercises with explained solutions.

Exercise 1 Let X = [X1

>

X2 ]

be a multivariate normal random vector with mean = [1

>

2]

and covariance matrix V =

3 1

1 2

Prove that the random variable Y = X1 + X2 has a normal distribution with mean equal to 3 and variance equal to 7. Hint: use the joint moment generating function of X and its properties. Solution The random variable Y can be written as Y = BX where B = [1

1]

448

CHAPTER 53. MULTIVARIATE NORMAL DISTRIBUTION

By using the formula for the joint moment generating function of a linear transformation of a random vector20 MY (t) = MX B > t and the fact that the mgf of a multivariate normal vector X is 1 MX (t) = exp t> + t> V t 2 we obtain 1 = MX B > t = exp t> B + t> BV B | t 2 1 = exp B t + BV B | t2 2

MY (t)

where, in the last step, we have also used the fact that t is a scalar, because Y is unidimensional. Now, B = [1

1]

=

1 1

3 1

=

1

4 3

1 2

=1 1+1 2=3

and BV B |

1

1 2

1 1

=

1 1

3 1+1 1 1 1+2 1

=1 4+1 3=7

Plugging the values just obtained into the formula for the mgf of Y , we get 1 MY (t) = exp B t + BV B | t2 2

7 = exp 3t + t2 2

But this is the moment generating function of a normal random variable with mean equal to 3 and variance equal to 7 (see the lecture entitled Normal distribution p. 375). Therefore, Y is a normal random variable with mean equal to 3 and variance equal to 7 (remember that a distribution is completely characterized by its moment generating function).

Exercise 2 Let X = [X1

>

X2 ]

be a multivariate normal random vector with mean = [2

>

3]

and covariance matrix V =

2 1

1 2

Using the joint moment generating function of X, derive the cross-moment21 E X12 X2 2 0 See 2 1 See

p. 301. p. 285.

53.4. SOLVED EXERCISES

449

Solution The joint mgf of X is MX (t)

1 = exp t> + t> V t 2 1 = exp 2t1 + 3t2 + 2t21 + 2t22 + 2t1 t2 2 exp 2t1 + 3t2 + t21 + t22 + t1 t2

=

The third-order cross-moment we want to compute is equal to a third partial derivative of the mgf, evaluated at zero: E X12 X2 =

@ 3 MX (t1 ; t2 ) @t21 @t2

t1 =0;t2 =0

The partial derivatives are @MX (t1 ; t2 ) @t1 @ 2 MX (t1 ; t2 ) @t21

=

(2 + 2t1 + t2 ) exp 2t1 + 3t2 + t21 + t22 + t1 t2

=

@ @t1

=

2 exp 2t1 + 3t2 + t21 + t22 + t1 t2

@MX (t1 ; t2 ) @t1 2

+ (2 + 2t1 + t2 ) exp 2t1 + 3t2 + t21 + t22 + t1 t2 @ 3 MX (t1 ; t2 ) @t21 @t2

@ 2 MX (t1 ; t2 ) @t21

=

@ @t2

=

2 (3 + 2t2 + t1 ) exp 2t1 + 3t2 + t21 + t22 + t1 t2 +2 (2 + 2t1 + t2 ) exp 2t1 + 3t2 + t21 + t22 + t1 t2 2

+ (2 + 2t1 + t2 ) (3 + 2t2 + t1 ) exp 2t1 + 3t2 + t21 + t22 + t1 t2 Thus, E X12 X2

= =

@ 3 MX (t1 ; t2 ) = 2 3 1 + 2 2 1 + 22 3 1 @t21 @t2 t1 =0;t2 =0 6 + 4 + 12 = 22

450

CHAPTER 53. MULTIVARIATE NORMAL DISTRIBUTION

Chapter 54

Multivariate Student’s t distribution This lecture deals with the multivariate (MV) Student’s t distribution. We …rst introduce the special case in which the mean is equal to zero and the scale matrix is equal to the identity matrix. We then deal with the more general case.

54.1

The standard MV Student’s t distribution

The adjective "standard" is used for a multivariate Student’s t distribution having zero mean and unit scale matrix.

54.1.1

De…nition

Standard multivariate Student’s t random vectors are characterized as follows. De…nition 272 Let X be a K 1 absolutely continuous random vector. Let its support be the set of K-dimensional real vectors: R X = RK Let n 2 R++ . We say that X has a standard multivariate Student’s t distribution with n degrees of freedom if its joint probability density function1 is fX (x) = c 1 +

1 | x x n

(n+K)=2

where c = (n ) and

K=2

(n=2 + K=2) (n=2)

() is the Gamma function2 .

1 See 2 See

p. 117. p. 55.

451

452

CHAPTER 54. MULTIVARIATE STUDENT’S T DISTRIBUTION

54.1.2

Relation to the univariate Student’s t distribution

When K = 1, the de…nition of the standard multivariate Student’s t distribution coincides with the de…nition of the standard univariate Student’s t distribution3 : fX (x) =

A

(n )

= n

1=2

= n

1=2

=

p

K=2

(n=2 + K=2) (n=2)

1+

1 | x x n

(n=2 + 1=2) (n=2)

1+

x2 n

1=2

x2 1+ n

(n=2 + 1=2) (1=2) (n=2)

1 nB (n=2; 1=2)

1+

54.1.3

(n+1)=2

(n+1)=2

(n+1)=2

x2 n

where: in step A we have used the fact that4

(n+K)=2

(1=2) =

1=2

:

Relation to the Gamma and MV normal distributions

A standard multivariate Student’s t random vector can be written as a multivariate normal random vector5 whose covariance matrix is scaled by the reciprocal of a Gamma random variable6 , as shown by the following proposition. Proposition 273 (Integral representation) The joint probability density function of X can be written as Z 1 fX (x) = fXjZ=z (x) fZ (z) dz 0

where: 1. fXjZ=z (x) is the joint probability density function of a multivariate normal distribution with mean 0 and covariance V = z1 I (where I is the K K identity matrix): fXjZ=z (x)

= c1 jdet (V )j = c1

1 z

1=2

1 | x V 2

exp 1=2

K

det (I)

= c1 z K=2 exp

exp

1 z x| x 2

where c1 = (2 ) 3 See

p. p. 5 See p. 6 See p. 4 See

407. 57. 439. 397.

K=2

1

x

1 | x 2

1 z

1

I

1

!

x

54.1. THE STANDARD MV STUDENT’S T DISTRIBUTION

453

2. fZ (z) is the probability density function of a Gamma random variable with parameters n and h = 1: fZ (z) = c2 z n=2 where c2 =

1

1 n z 2

exp

nn=2 2n=2 (n=2)

Proof. We need to prove that fX (x) =

Z

1

fXjZ=z (x) fZ (z) dz

0

where

1 z x| x 2

fXjZ=z (x) = c1 z K=2 exp and fZ (z) = c2 z n=2

1

1 n z 2

exp

We start from the integrand function: fXjZ=z (x) fZ (z)

= c1 z K=2 exp

1 z x| x c2 z n=2 2

1

1 n z 2

exp

1 (x| x + n) z 2 1 0 1 n + K zA = c1 c2 z (n+K)=2 1 exp @ n+K 2 x| x+n 1 0 n + K 1 1 zA = c1 c2 c3 z (n+K)=2 1 exp @ n+K 2 c3 | = c1 c2 z (n+K)=2

1

exp

x x+n

= c1 c2

1 fZjX=x (z) c3

where (n + K) = c3 =

n+K x| x+n

(n+K)=2

2(n+K)=2 ((n + K) =2)

(x| x + n) 2n=2 2K=2

(n+K)=2

=

n 2

+

K 2

and fZjX=x (z) is the probability density function of a random variable having a Gamma distribution with parameters n + K and xn+K | x+n . Therefore, Z

Z0 1

fXjZ=z (x) fZ (z) dz

1 c1 c2 fZjX=x (z) dz c 3 0 Z 1 1 = c1 c2 fZjX=x (z) dz c3 0

= A

1

454

CHAPTER 54. MULTIVARIATE STUDENT’S T DISTRIBUTION B

= c1 c2

1 c3

=

(2 )

K=2

=

(2 )

K=2

=

(2 )

K=2

n

nn=2 2n=2 2K=2 2n=2 (n=2) nn=2 K=2 2 (n=2) nn=2 (n=2)

n=2 K=2

K=2

=

(2 )

=

2

=

(n )

1 n 2 K=2

1+

K=2

1 | x x n

(n+K)=2 (n+K)=2

n K + 2 2

n

n 2

+K 2 (n=2)

+K 2 (n=2)

n 1+

K=2

1 2

n 2

(x| x + n)

(n+K)=2

1 | x x n

+K 2 (n=2)

K=2

n K + 2 2

1 2

n 2

n K + 2 2

1+

1+

K=2

1 | x x n

1 | x x n

1+

1 | x x n

(n+K)=2

(n+K)=2

(n+K)=2

= fX (x) where: in step A we have used the fact that c1 , c2 and c3 do not depend on z; in step B we have used the fact that the integral of a probability density function over its support is 1. Since X has a multivariate normal distribution with mean 0 and covariance V = z1 I , conditional on Z = z, then we can also think of it as a ratio 1 X=p Y Z where Y has a standard multivariate normal distribution, Z has a Gamma distribution and Y and Z are independent.

54.1.4

Marginals

The marginal distribution of the i-th component of X (denote it by Xi ) is a standard Student’s t distribution with n degrees of freedom. It su¢ ces to note that the marginal probability density function7 of Xi can be written as Z 1 fXi (xi ) = fXi jZ=z (xi ) fZ (z) dz 0

where fXi jZ=z (xi ) is the marginal density of Xi jZ = z , i.e., the density of a normal random variable8 with mean 0 and variance z1 : fXi jZ=z (xi ) = (2 ) 7 See 8 See

p. 120. p. 375.

1=2

z 1=2 exp

1 z x2i 2

54.1. THE STANDARD MV STUDENT’S T DISTRIBUTION

455

Proof. This is obtained by exchanging the order of integration: fXi (xi ) Z 1 Z = ::: 1 Z Z 1 ::: = 1

1 1 1 1

fX (x1 ; : : : ; xi 1 ; xi ; xi+1 ; : : : ; xK ) dx1 : : : dxi 1 dxi+1 : : : dxK Z 1 fXjZ=z (x1 ; : : : ; xi 1 ; xi ; xi+1 ; : : : ; xK ) fZ (z) dz 0

dx1 : : : dxi 1 xi+1 : : : dxK Z 1Z 1 Z 1 = ::: fXjZ=z (x1 ; : : : ; xi 1

0

1 ; xi ; xi+1 ; : : : ; xK )

1

dx1 : : : dxi 1 dxi+1 : : : dxK fZ (z) dz Z 1 = fXi jZ=z (xi ) fZ (z) dz 0

But, by Proposition 273, the fact that Z 1 fXi (xi ) = fXi jZ=z (xi ) fZ (z) dz 0

implies that Xi has a standard multivariate Student’s t distribution with n degrees of freedom (hence a standard univariate Student’s t distribution with n degrees of freedom, because the two are the same thing when K = 1).

54.1.5

Expected value

The expected value of a standard multivariate Student’s t random vector X is well-de…ned only when n > 1 and it is E [X] = 0

(54.1)

Proof. E [X] = 0 if E [Xi ] = 0 for all K components Xi . But the marginal distribution of Xi is a standard Student’s t distribution with n degrees of freedom. Therefore, E [Xi ] = 0 provided n > 1.

54.1.6

Covariance matrix

The covariance matrix of a standard multivariate Student’s t random vector X is well-de…ned only when n > 2 and it is Var [X] =

n n

2

I

where I is the K K identity matrix. Proof. First of all, we can rewrite the covariance matrix as follows: Var [X] A

=

E XX >

B

=

E XX >

>

E [X] E [X]

456

CHAPTER 54. MULTIVARIATE STUDENT’S T DISTRIBUTION C

=

D

=

E

= = =

E E XX > jZ = z h E E XX > jZ = z

>

E [X jZ = z ] E [X jZ = z ]

E [Var [X jZ = z ]] 1 E I z 1 E I z

i

where: in step A we have used the formula for computing the covariance matrix9 ; in step B we have used the fact that E [X] = 0 (eq. 54.1); in step C we have used the Law of Iterated Expectations10 ; in step D we have used the fact that E [X jZ = z ] = 0 because X has a normal distribution with zero mean, conditional on Z = z; in step E we have used the fact that, by the de…nition of variance, Var [X jZ = z ] = E XX > jZ = z

>

E [X jZ = z ] E [X jZ = z ]

But

=

1 E z Z 1 0

= = =

A

=

B

=

C

= =

9 See 1 0 See

p. 190. p. 225.

Z

1 fZ (z) dz z

1

1 nn=2 1 z n=2 1 exp n z dz n=2 z2 2 (n=2) 0 Z 1 n=2 1 n z (n 2)=2 1 exp n z dz 2 2n=2 (n=2) 0

2(n 2)=2 ((n 2) =2) nn=2 2n=2 (n=2) n(n 2)=2 Z 1 n(n 2)=2 z (n 2)=2 1 exp 2(n 2)=2 ((n 2) =2) 0 Z 2(n 2)=2 nn=2 ((n 2) =2) 1 ' (z) dz (n=2) 2n=2 n(n 2)=2 0

2(n 2)=2 nn=2 2n=2 n(n 2)=2 1 1 n 2 (n 2) =2 n n 2

((n 2) =2) (n=2)

n

21 z dz 2

n 2 n

54.2. THE MV STUDENT’S T DISTRIBUTION IN GENERAL

457

where: in step A we have de…ned ' (z) =

n(n 2(n 2)=2

2)=2

((n

2) =2)

z (n

2)=2 1

exp

n

21 z 2

n 2 n

which is the density of a Gamma random variable with parameters n 2 and nn 2 ; in step B we have used the fact that the integral of a probability density function over its support is equal to 1; in step C we have used the properties of the Gamma function. As a consequence, Var [X] = E

54.2

n 1 I= I z n 2

The MV Student’s t distribution in general

While in the previous section we restricted our attention to the multivariate Student’s t distribution with zero mean and unit scale matrix, we now deal with the general case.

54.2.1

De…nition

Multivariate Student’s t random vectors are characterized as follows. De…nition 274 Let X be a K 1 absolutely continuous random vector. Let its support be the set of K-dimensional real vectors: R X = RK Let n 2 R++ , let be a K 1 vector and let V be a K K symmetric and positive de…nite matrix. We say that X has a multivariate Student’s t distribution with mean , scale matrix V and n degrees of freedom if its joint probability density function is (n+K)=2 1 | ) V 1 (x ) fX (x) = c 1 + (x n where c = (n )

K=2

(n=2 + K=2) jdet (V )j (n=2)

1=2

We indicate that X has a multivariate Student’s t distribution with mean , scale matrix V and n degrees of freedom by X

54.2.2

T ( ; V; n)

Relation to the standard MV Student’s t distribution

If X T ( ; V; n), then X is a linear function of a standard Student’s t random vector.

458

CHAPTER 54. MULTIVARIATE STUDENT’S T DISTRIBUTION

Proposition 275 Let X

T ( ; V; n). Then, X=

+ Z

(54.2)

where Z is a K 1 vector having a standard multivariate Student’s t distribution | with n degrees of freedom and is a K K invertible matrix such that V = = | . Proof. This is proved using the formula for the joint density of a linear function11 of an absolutely continuous random vector: fX (x)

= =

1 fZ jdet ( )j 1 (n ) jdet ( )j 1+

=

(n )

K=2

1+ =

(n )

(n )

=

(n )

1+ =

(n )

1+

(n=2 + K=2) (n=2)

K=2

(x

1

(x

) 1 2

jdet ( )j

1

(x

)

(n=2 + K=2) jdet ( ) det ( )j (n=2) |

) (

|

1 2

(n+K)=2 1

)

1

(x

)

(n=2 + K=2) jdet ( ) det ( (n=2) |

|

) (

|

|

) (

|

)j

1 2

(n+K)=2

)

1

(x

)

(n=2 + K=2) jdet ( (n=2)

|

)j

1 2

(n+K)=2

)

1

(x

)

(n=2 + K=2) jdet (V )j (n=2)

1 (x n

1 2

(n+K)=2

1 |

1 2

(n+K)=2

|

1

) V

The existence of a matrix satisfying V = that V is symmetric and positive de…nite.

54.2.3

(n+K)=2

|

)

|

1 (x n

K=2

)

)

1 (x n

K=2

(x

(n=2 + K=2) jdet ( )j (n=2)

1 (x n

K=2

1+

1

1 (x n

K=2

1+ =

1 n

1

(x

) |

=

|

is guaranteed by the fact

Expected value

The expected value of a multivariate Student’s t random vector X is E [X] = 1 1 See

p. 279. Note that X = g (Z) = invertible.

+

X is a linear one-to-one mapping since

is

54.3. SOLVED EXERCISES

459

Proof. This is an immediate consequence of (54.2) and of the linearity of the expected value12 : E [X] = E [ + Z] =

54.2.4

+ E [Z] =

+ 0=

Covariance matrix

The covariance matrix of a multivariate Student’s t random vector X is Var [X] =

n n

2

V

Proof. This is an immediate consequence of (54.2) and of the properties of covariance matrices13 : Var [X]

54.3

= Var [ + Z] = Var [Z] | n = I | n 2 n | = n 2 n V = n 2

Solved exercises

Below you can …nd some exercises with explained solutions.

Exercise 1 Let X be a multivariate normal random vector with mean = 0 and covariance matrix V . Let Z1 ; : : : ; Zn be n normal random variables having zero mean and variance 2 . Suppose that Z1 ; : : : ; Zn are mutually independent, and also independent of X. Find the distribution of the random vector Y de…ned as 2

Y =p

Z12

Solution

+ : : : + Zn2

X

We can write Y

=

2

p

Z12 + : : : + Zn2

X

1 2 See, in particular the Addition to constant matrices (p. 148) and Multiplication by constant matrices (p. 149) properties of the expected value of a random vector. 1 3 See, in particular, the Addition to constant vectors (p. 191) and Multiplication by constant matrices (p. 191) properties.

460

CHAPTER 54. MULTIVARIATE STUDENT’S T DISTRIBUTION = =

p

n

2 p n

2 X p 2 (W1 + : : : + Wn2 ) =n 1 p Q 2 (W1 + : : : + Wn2 ) =n

where W1 ; : : : ; Wn are standard normal random variables, Q has a standard mul> tivariate normal distribution and = V (if you are wondering about the stan2 2 2 dardizations Zi = Wi and X = Q, revise the lectures entitled Normal distribution - p. 375, and Multivariate normal distribution - p. 439). Now, the sum W12 + : : : + Wn2 has a Chi-square distribution with n degrees of freedom and the ratio W12 + : : : + Wn2 n has a Gamma distribution with parameters n and h = 1 (see the lectures entitled Chi-square distribution - p. 387, and Gamma distribution - p. 397). As a consequence, by the results in Subsection 54.1.3, the ratio 1

p

(W12

+ : : : + Wn2 ) =n

Q

has a standard multivariate Student’s t distribution with n degrees of freedom. Finally, by equation (54.2), Y has a multivariate Student’s t distribution with mean 0 and scale matrix 2 p n

2 p n

>

=

4 n 2

>

=

4 V n 2

Chapter 55

Wishart distribution This lecture provides a brief introduction to the Wishart distribution, which is a multivariate generalization of the Gamma distribution1 . In previous lectures we have explained that: 1. a Chi-square random variable2 with n degrees of freedom can be seen as a sum of squares of n independent normal random variables having mean 0 and variance 1; 2. a Gamma random variable with parameters n and h can be seen as a sum of squares of n independent normal random variables having mean 0 and variance h=n. A Wishart random matrix3 with parameters n and H can be seen as a sum of outer products of n independent multivariate normal random vectors4 having mean 0 and covariance matrix n1 H. In this sense, the Wishart distribution can be considered a generalization of the Gamma distribution (take point 2 above and substitute normal random variables with multivariate normal random vectors, squares with outer products and the variance h=n with the covariance matrix n1 H). At the end of this lecture you can …nd a brief review of some basic concepts in matrix algebra that will be helpful in understanding the remainder of this lecture.

55.1

De…nition

Wishart random matrices are characterized as follows: De…nition 276 Let W be a K K absolutely continuous random matrix. Let its support be the set of all K K symmetric and positive de…nite real matrices: R W = w 2 RK

K

: w is symmetric and positive de…nite

Let n be a constant such that n > K 1 and let H be a symmetric and positive de…nite matrix. We say that W has a Wishart distribution with parameters n 1 See

p. p. 3 See p. 4 See p. 2 See

397. 387. 119. 439.

461

462

CHAPTER 55. WISHART DISTRIBUTION

and H if its joint probability density function5 is n=2 (K+1)=2

fW (w) = c [det (w)] where c= and

n=2

2nK=2 [det (H)]

K(K

() is the Gamma function6 .

exp

nn=2 QK 1)=4

j=1

n tr H 2

1

(n=2 + (1

w

j) =2)

The parameter n needs not be an integer, but, when n is not an integer, W can no longer be interpreted as a sum of outer products of multivariate normal random vectors.

55.2

Relation to the MV normal distribution

The following proposition provides the link between the multivariate normal distribution and the Wishart distribution: Proposition 277 Let X1 ; : : : ; Xn be n independent K 1 random vectors all having a multivariate normal distribution with mean 0 and covariance matrix n1 H. Let K n. De…ne: n X W = Xi Xi> i=1

Then W has a Wishart distribution with parameters n and H.

Proof. The proof of this proposition is quite lengthy and complicated. The interested reader might have a look at Ghosh and Sinha7 (2002).

55.3

Expected value

The expected value of a Wishart random matrix W is E [W ] = H Proof. We do not provide a fully general proof, but we prove this result only for the special case in which n is integer and W can be written as W =

n X

Xi Xi>

i=1

(see subsection 55.2 above). In this case: " n # X > E [W ] = E Xi Xi i=1

5 See

p. 117. p. 55. 7 Ghosh, M. and Sinha, B. K. (2002) "A simple derivation of the Wishart distribution", The American Statistician, 56, 100-101. 6 See

55.4. COVARIANCE MATRIX

=

463

n X

E Xi Xi>

i=1

A

=

n X

Var [Xi ] + E [Xi ] E Xi>

i=1

B

= =

n X i=1 n X i=1

Var [Xi ] 1 1 H=n H=H n n

where: in step A we have used the fact that the covariance matrix of X can be written as8 Var [Xi ] = E Xi Xi> E [Xi ] E Xi> and in step B we have used the fact that E [Xi ] = 0.

55.4

Covariance matrix

The concept of covariance matrix is well-de…ned only for random vectors. However, when dealing with a random matrix, one might want to compute the covariance matrix of its associated vectorization (if you are not familiar with the concept of vectorization, see the review of matrix algebra below for a de…nition). Therefore, in the case of a Wishart random matrix W , we might want to compute the following covariance matrix: Var [vec (W )] Since vec (W ), the vectorization of W , is a K 2 1 random vector, V is a K 2 K 2 matrix. It is possible to prove that: Var [vec (W )] =

1 I + Pvec(W ) (H n

H)

where denotes the Kronecker product and Pvec(W ) is the transposition-permutation matrix associated to vec (W ). Proof. The proof of this formula can be found in Muirhead9 (2005). There is a simpler expression for the covariances between the diagonal entries of W : 2 2 Cov [Wii ; Wjj ] = Hij n Proof. Again, we do not provide a fully general proof, but we prove this result only for the special case in which n is integer and W can be written as: W =

n X

Xi Xi>

i=1

(see above). To compute this covariance, we …rst need to compute the following fourth cross-moment: 2 2 E Xsi Xsj 8 See

p. 190. R.J. (2005) Aspects of multivariate statistical theory, Wiley.

464

CHAPTER 55. WISHART DISTRIBUTION

where Xsi denotes the i-th component (i = 1; : : : ; K) of the random vector Xs (s = 1; : : : ; n). This cross-moment can be computed by taking the fourth crosspartial derivative of the joint moment generating function10 of Xsi and Xsj and evaluating it at zero. While this is not complicated, the algebra is quite tedious. I recommend doing it with computer algebra, for example utilizing the Matlab Symbolic Toolbox and the following four commands: syms t1 t2 s1 s2 s12; f=exp(0.5*(s1^2)*(t1^2)+0.5*(s2^2)*(t2^2)+s12*t1*t2); d4f=diff(diff(f,t1,2),t2,2); subs(d4f,{t1,t2},{0,0}) The result of the computations is 2 2 E Xsi Xsj

2

=

Var [Xsi ] Var [Xsj ] + 2 (Cov [Xsi ; Xsj ]) 1 Hii n

=

1 Hjj n

+2

1 Hij n

2

Using this result, the covariance between Wii and Wjj is derived as follows: Cov [Wii ; Wjj ] " n # n X X 2 2 Xsi ; Xtj = Cov s=1

A

=

n X n X

t=1

2 2 Cov Xsi ; Xtj

s=1 t=1

B

=

n X

2 2 Cov Xsi ; Xsj

s=1

C

=

n X

2 2 E Xsi Xsj

2 2 E Xsi E Xsj

s=1

= =

n X

s=1

=

n X s=1

= n = =

E

s=1 " n X

2 2 Xsi Xsj

n X

2 2 E Xsi E Xsj

s=1

1 Hii n

1 Hjj n

+2

1 Hij n

1 2 Hii Hjj + 2Hij n 2 2 H n ij

#

n X s=1

1 Hii n

1 Hjj n

n X 1 H H 2 ii jj n s=1

1 1 2 Hii Hjj + 2 2Hij n2 n

1 1 2 Hii Hjj + 2 2Hij 2 n n

2

n

1 Hii Hjj n2

Hii Hjj

where: in step A we have used the bilinearity of covariance; in step B we have used the fact that 2 2 Cov Xsi ; Xtj = 0 for s 6= t 1 0 See

p. 297.

55.5. REVIEW OF SOME FACTS IN MATRIX ALGEBRA

465

in step C we have used the usual covariance formula11 .

55.5

Review of some facts in matrix algebra

55.5.1

Outer products

As the Wishart distribution involves outer products of multivariate normal random vectors, we brie‡y review here the concept of outer product. If X is a K 1 column vector, the outer product of X with itself is the K K matrix A obtained from the multiplication of X with its transpose: A = XX > Example 278 If X is the 2

1 random vector X1 X2

X= then its outer product XX > is the 2 XX >

55.5.2 AK

2 random matrix

=

X1 X2

X1

=

X12 X2 X1

X1 X2 X22

X2

Symmetric matrices

K matrix A is symmetric if and only if A = A>

i.e. if and only if A equals its transpose.

55.5.3 AK

Positive de…nite matrices

K matrix A is said to be positive de…nite if and only if x> Ax > 0

for any K 1 real vector x such that x 6= 0. All positive de…nite matrices are also invertible. Proof. The proof is by contradiction. Suppose a positive de…nite matrix A were not invertible. Then A would not be full rank, i.e. there would be a vector x 6= 0 such that Ax = 0 which, premultiplied by x> , would yield x> Ax = x> 0 = 0 But this is a contradiction. 1 1 See

p. 164.

466

55.5.4

CHAPTER 55. WISHART DISTRIBUTION

Trace of a matrix

Let A be a K K matrix and denote by Aij the (i; j)-th entry of A (i.e. the entry at the intersection of the i-th row and the j-th column). The trace of A, denoted by tr (A), is the sum of all the diagonal entries of A: tr (A) =

K X

Aii

i=1

55.5.5

Vectorization of a matrix

Given a K L matrix A, its vectorization, denoted by vec (A), is the KL vector obtained by stacking the columns of A on top of each other. Example 279 If A is a 2

1

2 matrix: A=

the vectorization of A is the 4

A11 A21

A12 A22

1 random vector: 2 3 A11 6 A21 7 7 vec (A) = 6 4 A12 5 A22

For a given matrix A, the vectorization of A will in general be di¤erent from the vectorization of its transpose A> . The transposition permutation matrix associated to vec (A) is the KL KL matrix Pvec(A) such that: vec A> = Pvec(A) vec (A)

55.5.6

Kronecker product

Given a K L matrix A and a M N matrix B, the Kronecker product of A and B, denoted by A B, is a KM LN matrix having the following structure: 2 3 A11 B A12 B : : : A1N B 6 A21 B A22 B : : : A2N B 7 6 7 A B=6 7 .. .. .. 4 5 . . . AM 1 B

where Aij is the (i; j)-th entry of A.

AM 2 B

: : : AM N B

Part V

467

Chapter 56

Linear combinations of normals One property that makes the normal distribution extremely tractable from an analytical viewpoint is its closure under linear combinations: the linear combination of two independent random variables having a normal distribution also has a normal distribution. This lecture presents a multivariate generalization of this elementary property and then discusses some special cases.

56.1

Linear transformation of a MV-N vector

A linear transformation of a multivariate normal random vector1 also has a multivariate normal distribution, as illustrated by the following: Proposition 280 Let X be a K 1 multivariate normal random vector with mean and covariance matrix V . Let A be an L 1 real vector and B an L K full-rank real matrix. Then the L 1 random vector Y de…ned by Y = A + BX has a multivariate normal distribution with mean E [Y ] = A + B and covariance matrix

Var [Y ] = BV B |

Proof. This is proved using the formula for the joint moment generating function of the linear transformation of a random vector2 . The joint moment generating function of X is 1 MX (t) = exp t> + t> V t 2 Therefore, the joint moment generating function of Y is MY (t) 1 See 2 See

=

exp t> A MX B > t

p. 439. p. 301.

469

470

CHAPTER 56. LINEAR COMBINATIONS OF NORMALS 1 = exp t> A exp t> B + t> BV B | t 2 1 = exp t> (A + B ) + t> BV B | t 2

which is the moment generating function of a multivariate normal distribution with mean A + B and covariance matrix BV B | . Note that BV B | needs to be positive de…nite in order to be the covariance matrix of a proper multivariate normal distribution, but this is implied by the assumption that B is full-rank. Therefore, Y has a multivariate normal distribution with mean A + B and covariance matrix BV B | , because two random vectors have the same distribution when they have the same joint moment generating function. The following subsections present some important special cases of the above property.

56.1.1

Sum of two independent variables

The sum of two independent normal random variables has a normal distribution, as stated in the following: Proposition 281 Let X1 be a random variable having a normal distribution with mean 1 and variance 21 . Let X2 be a random variable, independent of X1 , having a normal distribution with mean 2 and variance 22 . Then, the random variable Y de…ned as: Y = X1 + X2 has a normal distribution with mean E [Y ] =

+

1

2

and variance Var [Y ] =

2 1

2 2

+

Proof. First of all, we need to use the fact that mutually independent normal random variables are jointly normal3 : the 2 1 random vector X de…ned as |

X = [X1 X2 ]

has a multivariate normal distribution with mean E [X] = [ and covariance matrix Var [X] =

| 2]

1

2 1

0

0 2 2

We can write: Y = X1 + X2 = BX where B = [1 1] 3 See

p. 446.

56.1. LINEAR TRANSFORMATION OF A MV-N VECTOR

471

Therefore, according to Proposition 280, Y has a normal distribution with mean E [Y ] = BE [X] = and variance

56.1.2

1

+

Var [Y ] = BVar [X] B | =

2

2 1

+

2 2

Sum of more than two independent variables

The sum of more than two independent normal random variables also has a normal distribution, as shown in the following: Proposition 282 Let X1 ; : : : ; XK be K mutually independent normal random variables, having means 1 ; : : : ; K and variances 21 ; : : : ; 2K . Then, the random variable Y de…ned as: K X Y = Xi i=1

has a normal distribution with mean

E [Y ] =

K X

i

i=1

and variance Var [Y ] =

K X

2 i

i=1

Proof. This can be obtained, either generalizing the proof of Proposition 281, or using Proposition 281 recursively (starting from the …rst two components of X, then adding the third one and so on).

56.1.3

Linear combinations of independent variables

The properties illustrated in the previous two examples can be further generalized to linear combinations of K mutually independent normal random variables: Proposition 283 Let X1 ; : : : ; XK be K mutually independent normal random variables, having means 1 ; : : : ; K and variances 21 ; : : : ; 2K . Let b1 ; : : : ; bK be K constants. Then, the random variable Y de…ned as: Y =

K X

bi X i

i=1

has a normal distribution with mean E [Y ] =

K X

bi

i

i=1

and variance Var [Y ] =

K X i=1

b2i

2 i

472

CHAPTER 56. LINEAR COMBINATIONS OF NORMALS

Proof. First of all, we need to use the fact that mutually independent normal random variables are jointly normal: the K 1 random vector X de…ned as |

X = [X1 : : : XK ] has a multivariate normal distribution with mean E [X] = [

1

| K]

:::

and covariance matrix 2

We can write:

6 6 Var [X] = 6 4 Y =

2 1

0 .. . 0

K X

0

:::

2 2

.. . 0

0 0 .. .

:::

..

. :::

2 K

3 7 7 7 5

bi Xi = BX

i=1

where B = [b1 : : : bK ] Therefore, according to Proposition 280, Y has a (multivariate) normal distribution with mean: K X E [Y ] = BE [X] = bi i i=1

and variance: Var [Y ] = BVar [X] B | =

K X

b2i

2 i

i=1

56.1.4

Linear transformation of a variable

A special case of Proposition 280 obtains when X has dimension 1 random variable):

1 (i.e. it is a

Proposition 284 Let X be a normal random variable with mean and variance 2 . Let a and b be two constants (with b 6= 0). Then the random variable Y de…ned by: Y = a + bX has a normal distribution with mean E [Y ] = a + b and variance Var [Y ] = b2

2

Proof. This is just a special case (for K = 1) of Proposition 280.

56.2. SOLVED EXERCISES

56.1.5

473

Linear combinations of independent vectors

The property illustrated in Example 3 can be generalized to linear combinations of mutually independent normal random vectors. Proposition 285 Let X1 ; : : : ; Xn be n mutually independent K 1 normal random vectors, having means 1 ; : : : ; n and covariance matrices V1 ; : : : ; Vn . Let B1 ; : : : ; Bn be n real L K full-rank matrices. Then, the L 1 random vector Y de…ned as: n X Y = B i Xi i=1

has a normal distribution with mean

E [Y ] =

n X

Bi

i

i=1

and covariance matrix Var [Y ] =

n X

Bi Vi Bi|

i=1

Proof. This is a consequence of the fact that mutually independent normal random vectors are jointly normal: the Kn 1 random vector X de…ned as |

X = [X1| : : : Xn| ] has a multivariate normal distribution with mean E [X] = [ and covariance matrix

2

6 6 Var [X] = 6 4

| 1

| | n]

:::

V1 0 .. .

0 V2 .. .

0

0

::: :::

..

0 0 .. .

. : : : Vn

3 7 7 7 5

Therefore, we can apply Proposition 280 to the vector X.

56.2

Solved exercises

Below you can …nd some exercises with explained solutions.

Exercise 1 Let X = [X1 be a 2

>

X2 ]

1 normal random vector with mean = [1

>

3]

and covariance matrix 2 1 1 3 Find the distribution of the random variable Z de…ned as V =

Z = X1 + 2X2

474

CHAPTER 56. LINEAR COMBINATIONS OF NORMALS

Solution We can write Z = BX where B = [1

2]

Being a linear transformation of a multivariate normal random vector, Z is also multivariate normal. Actually, it is univariate normal, because it is a scalar. Its mean is E [Z] = B = 1 1 + 2 3 = 7 and its variance is Var [Z]

= BV B > = [1

2]

2 1

2 1+1 2 1 1+3 2 = 1 4 + 2 7 = 18

=

[1

2]

1 3

1 2

= [1

2]

4 7

Exercise 2 Let X1 , . . . , Xn be n mutually independent standard normal random variables. Let b 2 (0; 1) be a constant. Find the distribution of the random variable Y de…ned as n X Y = bi X i i=1

Solution Being a linear combination of mutually independent normal random variables, Y has a normal distribution with mean E [Y ] =

n X

bi E [Xi ] =

i=1

n X

bi 0 = 0

i=1

and variance Var [Y ]

= =

A

=

n X i=1 n X

i=1 n X

bi

2

b2

i

Var [Xi ] 1

ci

i=1

= c + c2 + : : : + cn = c 1 + c + : : : + cn = c 1 + c + : : : + cn =

c 1

c

(1

cn )

1 1

1 1

c c

56.2. SOLVED EXERCISES

475 = =

cn+1 1 c b2 b2n+2 1 b2 c

where: in step A we have de…ned c = b2 .

476

CHAPTER 56. LINEAR COMBINATIONS OF NORMALS

Chapter 57

Partitioned multivariate normal vectors Let X be a K 1 multivariate normal random vector1 with mean and covariance matrix V . In this lecture we present some useful facts about partitionings of X, that is, about subdivisions of X into two sub-vectors Xa and Xb such that X=

Xa Xb

where Xa and Xb have dimensions Ka 1 and Kb 1 respectively and Ka +Kb = K.

57.1

Notation

In what follows, we will denote by a

= E [Xa ] the mean of Xa ;

b

= E [Xb ] the mean of Xb ;

Va = Var [Xa ] the covariance matrix of Xa ; Vb = Var [Xb ] the covariance matrix of Xb ; Vab = Cov [Xa ; Xb ] the cross-covariance2 between Xa and Xb . Given this notation, it follows that a

=

b

and V = 1 See 2 See

Va Vab

p. 439. p. 193.

477

| Vab Vb

478

CHAPTER 57. PARTITIONED MULTIVARIATE NORMAL VECTORS

57.2

Normality of the sub-vectors

The following proposition states a …rst fact about the two sub-vectors. Proposition 286 Both Xa and Xb have a multivariate normal distribution, i.e., Xa Xb

N( N(

a ; Va ) b ; Vb )

Proof. The random vector Xa can be written as a linear transformation of X: Xa = AX where A is a Ka K matrix whose entries are either zero or one. Thus, Xa has a multivariate normal distribution, because it is a linear transformation of the multivariate normal random vector X, and multivariate normality is preserved3 by linear transformations. By the same token, also Xb has a multivariate normal distribution, because it can be written as a linear transformation of X: Xb = BX where B is a Kb K matrix whose entries are either zero or one. This, of course, implies that any sub-vector of X is multivariate normal when X is multivariate normal.

57.3

Independence of the sub-vectors

The following proposition states a necessary and su¢ cient condition for the independence of the two sub-vectors. Proposition 287 Xa and Xb are independent if and only if Vab = 0. Proof. Xa and Xb are independent if and only if their joint moment generating function is equal to the product of their individual moment generating functions4 . Since Xa is multivariate normal, its joint moment generating function is MXa (ta ) = exp t> a

a

1 + t> V a ta 2 a

The joint moment generating function of Xb is MXb (tb ) = exp t> b

b

1 + t> V b tb 2 b

The joint moment generating function of Xa and Xb , which is just the joint moment generating function of X, is MXa ;Xb (ta ; tb ) = MX (t) 3 See 4 See

p. 469. p. 297.

57.3. INDEPENDENCE OF THE SUB-VECTORS =

1 exp t> + t> V t 2

=

exp

=

exp t> a

=

exp t> a

=

exp t> a

| 1 > > Va Vab ta tb [ta tb ] Vab Vb 2 b 1 > 1 > 1 > 1 > > > a + tb b + ta Va ta + tb Vb tb + tb Vab ta + ta Vab tb 2 2 2 2 1 > 1 > > > a + ta Va ta + tb b + tb Vb tb + tb Vab ta 2 2 1 > 1 > > > a + ta Va ta exp tb b + tb Vb tb exp tb Vab ta 2 2

> t> a tb

a

+

= MXa (ta ) MXb (tb ) exp t> b Vab ta from which it is obvious that MXa ;Xb (ta ; tb ) = MXa (ta ) MXb (tb ) if and only if Vab = 0.

479

480

CHAPTER 57. PARTITIONED MULTIVARIATE NORMAL VECTORS

Chapter 58

Quadratic forms in normal vectors Let X be a K 1 multivariate normal random vector1 with mean and covariance matrix V . In this lecture we present some useful facts about quadratic forms in X, i.e. about forms of the kind Q = X > AX where A is a K K matrix and > denotes transposition. Before discussing quadratic forms in X, we review some facts about matrix algebra that are needed to understand this lecture.

58.1

Review of relevant facts in matrix algebra

58.1.1

Orthogonal matrices

AK

K real matrix A is orthogonal if A> = A

1

which also implies A> A = AA> = I where I is the identity matrix. Of course, if A is orthogonal also A> is orthogonal. An important property of orthogonal matrices is the following: Proposition 288 Le X be a K 1 standard multivariate normal random vector, i.e. X N (0; I). Let A be an orthogonal K K real matrix. De…ne Y = AX Then also Y has a standard multivariate normal distribution, i.e. Y

N (0; 1).

Proof. The random vector Y has a multivariate normal distribution, because it is a linear transformation of another multivariate normal random vector (see the 1 See

p. 439.

481

482

CHAPTER 58. QUADRATIC FORMS IN NORMAL VECTORS

lecture entitled Linear combinations of normal random variables - p. 469). Y is standard normal because its expected value is E [Y ] = E [AX] = AE [X] = A0 = 0 and its covariance matrix is = Var [AX] = AVar [X] A| = AIA| = AA| = I

Var [Y ]

where the de…nition of orthogonal matrix has been used in the last step.

58.1.2 AK

Symmetric matrices

K real matrix A is symmetric if A = A>

i.e. A equals its transpose. Real symmetric matrices have the property that they can be decomposed as A = P DP > where P is an othogonal matrix and D is a diagonal matrix (i.e. a matrix whose o¤-diagonal entries are zero). The diagonal elements of D, which are all real, are the eigenvalues of A and the columns of P are the eigenvectors of A.

58.1.3 AK

Idempotent matrices

K real matrix A is idempotent if A2 = A

which implies An = A for any n 2 N.

58.1.4

Symmetric idempotent matrices

If a matrix A is both symmetric and idempotent then its eigenvalues are either zero or one. In other words, the diagonal entries of the diagonal matrix D in the decomposition A = P DP > are either zero or one. Proof. This can be easily seen as follows: P DP >

= = = =

A = An = A : : : A P DP > : : : P DP > P D(P > P ) : : : (P > P )DP > P Dn P >

which implies D = Dn

8n 2 N

But this is possible only if the diagonal entries of D are either zero or one.

58.2. QUADRATIC FORMS IN NORMAL VECTORS

58.1.5

483

Trace of a matrix

Let A be a K K real matrix and denote by Aij the (i; j)-th entry of A (i.e. the entry at the intersection of the i-th row and the j-th column). The trace of A, denoted by tr (A) is K X tr (A) = Aii i=1

In other words, the trace is equal to the sum of all the diagonal entries of A. The trace of A enjoys the following important property: tr (A) =

K X

i

i=1

where

58.2

1; : : : ;

K

are the K eigenvalues of A.

The following proposition shows that certain quadratic forms in standard normal random vectors have a Chi-square distribution2 . Proposition 289 Le X be a K 1 standard multivariate normal random vector, i.e. X N (0; I). Let A be a symmetric and idempotent matrix. Let tr (A) be the trace of A. De…ne: Q = X > AX Then Q has a Chi-square distribution with tr (A) degrees of freedom. Proof. Since A is symmetric, it can be decomposed as A = P DP > where P is orthogonal and D is diagonal. The quadratic form can be written as Q = X > AX = X > P DP > X =

P >X

>

D P > X = Y > DY

where we have de…ned Y = P >X By the above theorem on orthogonal transformations of standard multivariate normal random vectors, the orthogonality of P > implies that Y N (0; 1). Since D is diagonal, we can write the quadratic form as Q = Y > DY =

K X

Djj Yj2

j=1

where Yj is the j-th component of Y and Djj is the j-th diagonal entry of D. Since A is symmetric and idempotent, the diagonal entries of D are either zero or one. Denote by J the set J = fj K : Djj = 1g 2 See

p. 387.

484

CHAPTER 58. QUADRATIC FORMS IN NORMAL VECTORS

and by r its cardinality, i.e. the number of diagonal entries of D that are equal to 1. Since Dij 6= 1 ) Dij = 0, we can write Q=

K X

Djj Yj2 =

j=1

X

Djj Yj2 =

j2J

X

Yj2

j2J

But the components of a standard normal random vector are mutually independent standard normal random variables. Therefore, Q is the sum of the squares of r independent standard normal random variables. Hence, it has a Chi-square distribution with r degrees of freedom3 . Finally, by the properties of idempotent matrices and of the trace of a matrix (see above), r is not only the sum of the number of diagonal entries of D that are equal to 1, but it is also the sum of the eigenvalues of A. Since the trace of a matrix is equal to the sum of its eigenvalues, then r = tr (A).

58.3

We start this section with a proposition on independence between linear transformations: Proposition 290 Le X be a K 1 standard multivariate normal random vector, i.e. X N (0; I). Let A be a LA K matrix and B be a LB K matrix. De…ne: T1 T2

= AX = BX

Then T1 and T2 are two independent random vectors if and only if AB | = 0. Proof. First of all, note that T1 and T2 are linear transformations of the same multivariate normal random vector X. Therefore, they are jointly normal (see the lecture entitled Linear combinations of normal random variables - p. 469). Their cross-covariance4 is h i > Cov [X; Y ] = E (T1 E [T1 ]) (T2 E [T2 ]) h i > = E (AX E [AX]) (BX E [BX]) h i A = AE (X E [X]) (X E [X])> B | B

= AVar [X] B |

C

= AB |

where: in step A we have used the linearity of the expected value; in step B we have used the de…nition of covariance matrix; in step C we have used the fact that Var [X] = I. But, as we explained in the lecture entitled Partitioned multivariate normal vectors (p. 477), two jointly normal random vectors are independent if and only if their cross-covariance is equal to 0. In our case, the cross-covariance is equal to zero if and only if AB | = 0, which proves the proposition. 3 See 4 See

p. 395. p. 193.

58.4. EXAMPLES

485

The following proposition gives a necessary and su¢ cient condition for the independence of two quadratic forms in the same standard multivariate normal random vector. Proposition 291 Le X be a K 1 standard multivariate normal random vector, i.e. X N (0; I). Let A and B be two K K symmetric and idempotent matrices. De…ne Q1 Q2

= X > AX = X > BX

Then Q1 and Q2 are two independent random variables if and only if AB = 0. Proof. Since A and B are symmetric and idempotent, we can write >

Q1

=

(AX) (AX)

Q2

= (BX) (BX)

>

from which it is apparent that Q1 and Q2 can be independent only as long as AX and BX are independent. But, by the above proposition on the independence between linear transformations of jointly normal random vectors, AX and BX are independent if and only if AB | = 0. Since B is symmetric, this is the same as AB = 0. The following proposition gives a necessary and su¢ cient condition for the independence between a quadratic form and a linear transformation involving the same standard multivariate normal random vector. Proposition 292 Le X be a K 1 standard multivariate normal random vector, i.e. X N (0; I). Let A be a L K vector and B a K K symmetric and idempotent matrix. De…ne T = AX Q = X > BX Then T and Q are independent if and only if AB = 0. Proof. Since B is symmetric and idempotent, we can write T = AX > Q = (BX) (BX) from which it is apparent that T and Q can be independent only as long as AX and BX are independent. But, by the above proposition on the independence between linear transformations of jointly normal random vectors, AX and AB are independent if and only if AB | = 0. Since B is symmetric, this is the same as AB = 0.

58.4

Examples

We discuss here some quadratic forms that are commonly found in statistics.

486

CHAPTER 58. QUADRATIC FORMS IN NORMAL VECTORS

Sample variance as a quadratic form Let X1 , . . . , Xn be n independent random variables, all having a normal distribution with mean and variance 2 . Let their sample mean5 X n be de…ned as n

Xn =

1X Xi n i=1

and their adjusted sample variance6 be de…ned as s2 =

1 n

1

n X

Xi

Xn

2

i=1

De…ne the following matrix: 1 > {{ n

M =I

where I is the n-dimensional identity matrix other words, M has the following structure: 2 1 1 n1 n 1 6 1 n1 n 6 M =6 4 1 n

1 n

and { is a n 1 n 1 n

::: ::: .. . ::: 1

1 n

1 vector of ones. In

3 7 7 7 5

M is a symmetric matrix. By computing the product M M , it can also be easily veri…ed that M is idempotent. Denote by X the n 1 random vector whose i-th entry is equal to Xi and note that X has a multivariate normal distribution with mean { and covariance matrix 2 I . The matrix M can be used to write the sample variance as s2

= = =

A B

= =

1 n

1 1

n

1 1

n

1 1

n

1 1

n

1

n X

Xi

Xn

2

i=1

>

(M X) M X X >M >M X X >M M X X >M X

where: in step A we have used the fact that M is symmetric; in step B we have used the fact that M is idempotent. Now de…ne a new random vector Z= 5 See 6 See

p. 573. p. 583.

1

(X

{)

58.4. EXAMPLES

487

and note that Z has a standard (mean zero and covariance I) multivariate normal distribution (see the lecture entitled Linear combinations of normal random variables - p. 469). The sample variance can be written as s2

= =

1 n

1 1

n

1 2

=

n

X >M X (X

>

{ + {) M (X

X

{

Z+

{

1 2

=

n

1 2

=

n

1

+ >

>

{

X

M

M Z+

Z >M Z +

{ + {) {

+

{

{

Z >M { +

{> M Z +

2

{> M {

The last three terms in the sum are equal to zero, because M{ = 0 which can be veri…ed by directly performing the multiplication of M and {. Therefore, the sample variance s2 =

2

n

1

Z >M Z

is proportional to a quadratic form in a standard normal random vector (Z > M Z) and the quadratic form is obtained from a symmetric and idempotent matrix (M ). Thanks to the propositions above, we know that the quadratic form Z > M Z has a Chi-square distribution with tr (M ) degrees of freedom, where tr (M ) is the trace of M . But the trace of M is tr (M )

=

n X

Mii =

i=1

n X

1 n

1

i=1

1 n

= n 1

=n

1

So, the quadratic form Z > M Z has a Chi-square distribution with n 1 degrees of freedom. Multiplying a Chi-square random variable with n 1 degrees of freedom 2 by n 1 one obtains a Gamma random variable with parameters n 1 and 2 (see the lecture entitled Gamma distribution - p. 403 - for more details). So, summing up, the adjusted sample variance s2 has a Gamma distribution with parameters n 1 and 2 . Futhermore, the adjusted sample variance s2 is independent of the sample mean X n , which is proved as follows. The sample mean can be written as Xn =

1 > { X n

and the sample variance can be written as s2 =

1 n

1

X >M X

488

CHAPTER 58. QUADRATIC FORMS IN NORMAL VECTORS

Using Proposition 292, verifying the independence of X n and s2 boils down to verifying that 1 1 > { M =0 n n 1 which can be easily checked by directly performing the multiplication of {> and M.

Part VI

Asymptotic theory

489

Chapter 59

Sequences of random variables One of the central topics in probability theory and statistics is the study of sequences of random variables, i.e. of sequences1 fXn g whose generic element Xn is a random variable. There are several reasons why sequences of random variables are important: 1. Often, we need to analyze a random variable X, but for some reasons X is too complex to analyze directly. What we usually do in this case is to approximate X by simpler random variables Xn that are easier to study; these approximating random variables are arranged into a sequence fXn g and they become better and better approximations of X as n increases. This is exactly what we did when we introduced the Lebesgue integral2 . 2. In statistical theory, Xn is often an estimate of an unknown quantity whose value and whose properties depend on the sample size n (the sample size is the number of observations used to compute the estimate). Usually, we are able to analyze the properties of Xn only asymptotically (i.e. when n tends to in…nity). In this case, fXn g is a sequence of estimates and we analyze the properties of the limit of fXn g, in the hope that a large sample (the one we observe) and an in…nite sample (the one we analyze by taking the limit of Xn ) have a similar behavior. 3. In many applications a random variable is observed repeatedly through time (for example, the price of a stock is observed every day). In this case fXn g is the sequence of observations on the random variable and n is a time-index (in the stock price example, Xn is the price observed in the n-th period).

59.1

Terminology

In this lecture, we introduce some terminology related to sequences of random variables. 1 See 2 See

p. 31. p. 141.

491

492

59.1.1

CHAPTER 59. SEQUENCES OF RANDOM VARIABLES

Realization of a sequence

Let fxn g be a sequence of real numbers and fXn g a sequence of random variables. If the real number xn is a realization3 of the random variable Xn for every n, then we say that the sequence of real numbers fxn g is a realization of the sequence of random variables fXn g and we write fXn g = fxn g

59.1.2

Sequences on a sample space

Let be a sample space4 . Let fXn g be a sequence of random variables. We say that fXn g is a sequence of random variables de…ned on the sample space if all the random variables Xn belonging to the sequence fXn g are functions from to R.

59.1.3

Independent sequences

Let fXn g be a sequence of random variables de…ned on a sample space . We say that fXn g is an independent sequence of random variables (or a sequence of independent random variables) if every …nite subset of fXn g (i.e. every …nite set of random variables belonging to the sequence) is a set of mutually independent random variables5 .

59.1.4

Identically distributed sequences

Let fXn g be a sequence of random variables. Denote by Fn (x) the distribution function6 of a generic element of the sequence Xn . We say that fXn g is a sequence of identically distributed random variables if any two elements of the sequence have the same distribution function: 8x 2 R; 8i; j 2 N; Fi (x) = Fj (x)

59.1.5

IID sequences

Let fXn g be a sequence of random variables de…ned on a sample space . We say that fXn g is a sequence of independent and identically distributed random variables (or an IID sequence of random variables), if fXn g is both a sequence of independent random variables (see 59.1.3) and a sequence of identically distributed random variables (see 59.1.4).

59.1.6

Stationary sequences

Let fXn g be a sequence of random variables de…ned on a sample space . Take a …rst group of q successive terms of the sequence Xn+1 , . . . , Xn+q . Now take a 3 See

p. p. 5 See p. 6 See p. 4 See

105. 69. 233. 118.

59.1. TERMINOLOGY

493

second group of q successive terms of the sequence Xn+k+1 , . . . , Xn+k+q . The second group is located k positions after the …rst group. Denote the joint distribution function7 of the …rst group of terms by Fn+1;:::;n+q (x1; : : : ; xq ) and the joint distribution function of the second group of terms by Fn+k+1;:::;n+k+q (x1; : : : ; xq ) The sequence fXn g is said to be stationary (or strictly stationary) if and only if Fn+1;:::;n+q (x1; : : : ; xq ) = Fn+k+1;:::;n+k+q (x1; : : : ; xq ) for any n; k; q 2 N and for any vector (x1; : : : ; xq ) 2 Rq . In other words, a sequence is strictly stationary if and only if the two random vectors [Xn+1 : : : Xn+q ] and [Xn+k+1 : : : Xn+k+q ] have the same distribution for any n, k and q. Requiring strict stationarity is weaker than requiring that a sequence be IID (see 59.1.5): if fXn g is an IID sequence, then it is also strictly stationary, while the converse is not necessarily true.

59.1.7

Weakly stationary sequences

Let fXn g be a sequence of random variables de…ned on a sample space . We say that fXn g is a covariance stationary sequence (or weakly stationary sequence) if 9 2 R : E [Xn ] = ; 8n > 0 8j 0; 9 j 2 R : Cov [Xn ; Xn

j]

=

j ; 8n

>j

(1) (2)

where n and j are, of course, integers. Property (1) means that all the random variables belonging to the sequence fXn g have the same mean. Property (2) means that the covariance between a term Xn of the sequence and the term that is located j positions before it (Xn j ) is always the same, irrespective of how Xn has been chosen. In other words, Cov [Xn ; Xn j ] depends only on j and not on n. Property (2) also implies that all the random variables in the sequence have the same variance8 : 9

0

2 R : Var [Xn ] =

0 ; 8n

2N

Strictly stationarity (see 59.1.6) implies weak stationarity only if the mean E [Xn ] and all the covariances Cov [Xn ; Xn j ] exist and are …nite. Covariance stationarity does not imply strict stationarity, because the former imposes restrictions only on …rst and second moments, while the latter imposes restrictions on the whole distribution. 7 See

p. 118.

8 Remember

that Cov [Xn ; Xn ] = Var [Xn ].

494

CHAPTER 59. SEQUENCES OF RANDOM VARIABLES

59.1.8

Mixing sequences

Let fXn g be a sequence of random variables de…ned on a sample space . Intuitively, fXn g is a mixing sequence if any two groups of terms of the sequence that are far apart from each other are approximately independent (and the further the closer to being independent). Take a …rst group of q successive terms of the sequence Xn+1 , . . . , Xn+q . Now take a second group of q successive terms of the sequence Xn+k+1 , . . . , Xn+k+q . The second group is located k positions after the …rst group. The two groups of terms are independent if and only if E [f (Xn+1 ; : : : ; Xn+q ) g (Xn+k+1 ; : : : ; Xn+k+q )] = E [f (Xn+1 ; : : : ; Xn+q )] E [g (Xn+k+1 ; : : : ; Xn+k+q )]

(59.1)

for any two functions f and g. This is just the de…nition of independence9 between the two random vectors [Xn+1 : : : Xn+q ] (59.2) and [Xn+k+1 : : : Xn+k+q ]

(59.3)

Trivially, condition (59.1) can be written as E [f (Xn+1 ; : : : ; Xn+q ) g (Xn+k+1 ; : : : ; Xn+k+q )] E [f (Xn+1 ; : : : ; Xn+q )] E [g (Xn+k+1 ; : : : ; Xn+k+q )] = 0 If this condition is true asymptotically (i.e. when k ! 1), then we say that the sequence fXn g is mixing: De…nition 293 We say that a sequence of random variables fXn g is mixing (or strongly mixing) if and only if limk!1

fE [f (Xn+1 ; : : : ; Xn+q ) g (Xn+k+1 ; : : : ; Xn+k+q )] E [f (Xn+1 ; : : : ; Xn+q )] E [g (Xn+k+1 ; : : : ; Xn+k+q )]g = 0

for any two functions f and g and for any n and q. In other words, a sequence is strongly mixing if and only if the two random vectors (59.2) and (59.3) tend to become more and more independent by increasing k (for any n and q). This is a milder requirement than the requirement of independence: if fXn g is an independent sequence (see 59.1.3), all its terms are independent from one another; if fXn g is a mixing sequence, its terms can be dependent, but they become less and less dependent as the distance between their locations in the sequence increases. Of course, an independent sequence is also a mixing sequence, while the converse is not necessarily true.

59.1.9

Ergodic sequences

In this section we discuss ergodicity. Roughly speaking, ergodicity is a weak concept of independence for sequences of random variables. In the subsections above we have discussed other two concepts of independence for sequences of random variables: 9 See

p. 235.

59.2. LIMIT OF A SEQUENCE OF RANDOM VARIABLES

495

1. independent sequences are sequences of random variables whose terms are mutually independent; 2. mixing sequences are sequences of random variables whose terms can be dependent but become less and less dependent as their distance increases (by distance we mean how far apart they are located in the sequence). Requiring that a sequence be mixing is weaker than requiring that a sequence be independent: in fact, an independent sequence is also mixing, but the converse is not true. Requiring that a sequence be ergodic is even weaker than requiring that a sequence be mixing (mixing implies ergodicity but not vice versa). This is probably all you need to know if you are not studying asymptotic theory at an advanced level, because ergodicity is quite a complicated topic and the de…nition of ergodicity is fairly abstract. Nevertheless, we give here a quick de…nition of ergodicity for the sake of completeness. Denote by RN the set of all possible sequences of real numbers. When fxn g is a sequence of real numbers, denote by fxn gn>1 the subsequence obtained by dropping the …rst term of fxn g, i.e. fxn gn>1 = fx2 ; x3 ; : : :g We say that a subset A RN is a shift invariant set if and only if fxn gn>1 belongs to A whenever fxn g belongs to A. In symbols: De…nition 294 A set A

RN is shift invariant if and only if fxn g 2 A =) fxn gn>1 2 A

Shift invariance is used to de…ne ergodicity: De…nition 295 A sequence of random variables fXn g is said to be an ergodic sequence if an olny if either P (fXn g 2 A) = 0 or P (fXn g 2 A) = 1 whenever A is a shift invariant set.

59.2

Limit of a sequence of random variables

As we explained in the lecture entitled Sequences and limits (p. 31), whenever we want to assess whether a sequence is convergent (and …nd its limit), we need to de…ne a distance function (or metric) to measure the distance between the terms of the sequence. Intuitively, a sequence converges to a limit if, by dropping a su¢ ciently high number of initial terms of the sequence, the remaining terms can be made as close to each other as we wish. The problem is how to de…ne "close to each other". As we have explained, the concept of "close to each other" can be made fully rigorous using the notion of a metric. Therefore, discussing convergence of a sequence of random variables boils down to discussing what metrics can be used to measure the distance between two random variables. In the following lectures, we introduce several di¤erent notions of convergence of a sequence of random variables: to each di¤erent notion corresponds a di¤erent way of measuring the distance between two random variables.

496

CHAPTER 59. SEQUENCES OF RANDOM VARIABLES

The notions of convergence (also called modes of convergence) introduced in the following lectures are: 1. Pointwise convergence (p. 501). 2. Almost sure convergence (p. 505). 3. Convergence in probability (p. 511). 4. Mean-square convergence (p. 519). 5. Convergence in distribution (p. 527).

Chapter 60

Sequences of random vectors In this lecture, we generalize the concepts introduced in the lecture entitled Sequences of random variables (p. 491). We no longer consider sequences whose elements are random variables, but we now consider sequences fXn g whose generic element Xn is a K 1 random vector. The generalization is straightforward, as the terminology and the basic concepts are almost the same used for sequences of random variables.

60.1

Terminology

60.1.1

Realization of a sequence

Let fxn g be a sequence of K 1 real vectors and fXn g a sequence of K 1 random vectors. If the real vector xn is a realization1 of the random vector Xn for every n, then we say that the sequence of real vectors fxn g is a realization of the sequence of random vectors fXn g and we write fXn g = fxn g

60.1.2

Sequences on a sample space

Let be a sample space2 . Let fXn g be a sequence of random vectors. We say that fXn g is a sequence of random vectors de…ned on the sample space if all the random vectors Xn belonging to the sequence fXn g are functions from to RK .

60.1.3

Independent sequences

Let fXn g be a sequence of random vectors de…ned on a sample space . We say that fXn g is an independent sequence of random vectors (or a sequence of independent random vectors) if every …nite subset of fXn g (i.e. every …nite subset of random vectors belonging to the sequence) is a set of mutually independent random vectors3 . 1 See

p. 105. p. 69. 3 See p. 235. 2 See

497

498

CHAPTER 60. SEQUENCES OF RANDOM VECTORS

60.1.4

Identically distributed sequences

Let fXn g be a sequence of random vectors. Denote by Fn (x) the joint distribution function4 of a generic element of the sequence Xn . We say that fXn g is a sequence of identically distributed random vectors if any two elements of the sequence have the same joint distribution function: 8x 2 RK ; 8i; j 2 N; Fi (x) = Fj (x)

60.1.5

IID sequences

Let fXn g be a sequence of random vectors de…ned on a sample space . We say that fXn g is a sequence of independent and identically distributed random vectors (or an IID sequence of random vectors), if fXn g is both a sequence of independent random vectors (see above) and a sequence of identically distributed random vectors (see above).

60.1.6

Stationary sequences

Let fXn g be a sequence of random vectors de…ned on a sample space . Take a …rst group of q successive terms of the sequence Xn+1 , . . . , Xn+q . Now take a second group of q successive terms of the sequence Xn+k+1 , . . . , Xn+k+q . The second group is located k positions after the …rst group. Denote the joint distribution function of the …rst group of terms by Fn+1;:::;n+q (x1; : : : ; xq ) and the joint distribution function of the second group of terms by Fn+k+1;:::;n+k+q (x1; : : : ; xq ) The sequence fXn g is said to be stationary (or strictly stationary) if and only if Fn+1;:::;n+q (x1 ; : : : ; xq ) = Fn+k+1;:::;n+k+q (x1; : : : ; xq ) >

> 2 RKq . for any n; k; q 2 N and for any vector x> 1 : : : xq In other words, a sequence is strictly stationary if and only if the two random vectors > > Xn+1 : : : Xn+q

>

and > > Xn+k+1 : : : Xn+k+q

>

have the same distribution (for any n, k and q). Requiring strict stationarity is weaker than requiring that a sequence be IID (see the subsection IID sequences above): if fXn g is an IID sequence, then it is also strictly stationary, while the converse is not necessarily true. 4 See

p. 118.

60.1. TERMINOLOGY

60.1.7

499

Weakly stationary sequences

Let fXn g be a sequence of random vectors de…ned on a sample space . We say that fXn g is a covariance stationary sequence (or weakly stationary sequence) if 9 2 RK : E [Xn ] = ; 8n 2 N 8j 0; 9 j 2 RK K : Cov [Xn ; Xn

j]

=

j ; 8n

>j

(1) (2)

where n and j are, of course, integers. Property (1) means that all the random vectors belonging to the sequence fXn g have the same mean. Property (2) means that the cross-covariance5 between a term Xn of the sequence and the term that is located j positions before it (Xn j ) is always the same, irrespective of how Xn has been chosen. In other words, Cov [Xn ; Xn j ] depends only on j and not on n. Note also that property (2) implies that all the random vectors in the sequence have the same covariance matrix (because Cov [Xn ; Xn ] = Var [Xn ]): 9

60.1.8

0

2 RK

K

: Var [Xn ] =

0 ; 8n

2N

Mixing sequences

The de…nition of mixing sequence of random vectors is a straightforward generalization of the de…nition of mixing sequence of random variables, which has been discussed in the lecture entitled Sequences of random variables (p. 491). Therefore, we report here the de…nition of mixing sequence of random vectors without further comments and we refer the reader to the aforementioned lecture for an explanation of the concept of mixing sequence. De…nition 296 We say that a sequence of random vectors fXn g is mixing (or strongly mixing) if and only if limk!1

fE [f (Xn+1 ; : : : ; Xn+q ) g (Xn+k+1 ; : : : ; Xn+k+q )] E [f (Xn+1 ; : : : ; Xn+q )] E [g (Xn+k+1 ; : : : ; Xn+k+q )]g = 0

for any two functions f and g and for any n and q.

60.1.9

Ergodic sequences

As in the previous section, we report here a de…nition of ergodic sequence of random vectors, which is a straightforward generalization of the de…nition of ergodic sequence of random variables, and we refer the reader to the lecture entitled Sequences of random variables (p. 491) for explanations of the concept of ergodicity. N Denote by RK the set of all possible sequences of real K 1 vectors. When fxn g is a sequence of real vectors, denote by fxn gn>1 the subsequence obtained by dropping the …rst term of fxn g, i.e.: fxn gn>1 = fx2 ; x3 ; : : :g N

We say that a subset A RK is a shift invariant set if and only if fxn gn>1 belongs to A whenever fxn g belongs to A. In symbols: 5 See

p. 193.

500

CHAPTER 60. SEQUENCES OF RANDOM VECTORS

De…nition 297 A set A

RK

N

is shift invariant if and only if

fxn g 2 A =) fxn gn>1 2 A Shift invariance is used to de…ne ergodicity. De…nition 298 A sequence of random vectors fXn g is said to be an ergodic sequence if an olny if either P (fXn g 2 A) = 0 or P (fXn g 2 A) = 1 whenever A is a shift invariant set.

60.2

Limit of a sequence of random vectors

Similarly to what happens for sequences of random variables, there are several di¤erent notions of convergence also for sequences of random vectors. In particular, all the modes of convergence found for random variables can be generalized to random vectors: 1. Pointwise convergence (p. 501). 2. Almost sure convergence (p. 505). 3. Convergence in probability (p. 511). 4. Mean-square convergence (p. 519). 5. Convergence in distribution (p. 527).

Chapter 61

Pointwise convergence This lecture discusses pointwise convergence. We deal …rst with pointwise convergence of sequences of random variables and then with pointwise convergence of sequences of random vectors.

61.1

Sequences of random variables

Let fXn g be a sequence of random variables de…ned on a sample space1 . Let us consider a single sample point2 ! 2 and a generic random variable Xn belonging to the sequence. Xn is a function Xn : ! R. However, once we …x !, the realization Xn (!) associated to the sample point ! is just a real number. By the same token, once we …x !, the sequence fXn (!)g is just a sequence of real numbers3 . Therefore, for a …xed !, it is very easy to assess whether the sequence fXn (!)g is convergent; this is done employing the usual de…nition of convergence of a sequence of real numbers. If, for a …xed !, the sequence fXn (!)g is convergent, we denote its limit by X (!), to underline that the limit depends on the speci…c ! we have …xed. A sequence of random variables is said to be pointwise convergent if and only if the sequence fXn (!)g is convergent for any choice of !: De…nition 299 Let fXn g be a sequence of random variables de…ned on a sample space . We say that fXn g is pointwise convergent to a random variable X de…ned on if and only if fXn (!)g converges to X (!) for all ! 2 . X is called the pointwise limit of the sequence and convergence is indicated by: Xn ! X pointwise Roughly speaking, using pointwise convergence we somehow circumvent the problem of de…ning the concept of distance between random variables: by …xing !, we reduce ourselves to the familiar problem of measuring distance between two real numbers, so that we can employ the usual notion of convergence of sequences of real numbers. Example 300 Let = f! 1 ,! 2 g be a sample space with two sample points (! 1 and ! 2 ). Let fXn g be a sequence of random variables such that a generic term Xn of 1 See

p. 492. p. 69. 3 See p. 33. 2 See

501

502

CHAPTER 61. POINTWISE CONVERGENCE

the sequence satis…es: Xn (!) =

1 n

1+

if ! = ! 1 if ! = ! 2

2 n

We need to check the convergence of the sequences fXn (!)g for all ! 2 , i.e. for ! = ! 1 and for ! = ! 2 : (1) the sequence fXn (! 1 )g, whose generic term is Xn (! 1 ) =

1 n

is a sequence of real numbers converging to 0; (2) the sequence fXn (! 2 )g, whose generic term is 2 Xn (! 2 ) = 1 + n is a sequence of real numbers converging to 1. Therefore, the sequence of random variables fXn g converges pointwise to the random variable X, where X is de…ned as follows: 0 if ! = ! 1 X (!) = 1 if ! = ! 2

61.2

Sequences of random vectors

The above notion of convergence generalizes to sequences of random vectors in a straightforward manner. Let fXn g be a sequence of random vectors de…ned on a sample space4 , where each random vector Xn has dimension K 1. If we …x a single sample point ! 2 , the sequence fXn (!)g is a sequence of real K 1 vectors. By the standard criterion for convergence5 , the sequence of real vectors fXn (!)g is convergent to a vector X (!) if lim d (Xn (!) ; X (!)) = 0 n!1

where d (Xn (!) ; X (!)) is the distance between a generic term of the sequence Xn (!) and the limit X (!). The distance between Xn (!) and X (!) is de…ned to be equal to the Euclidean norm of their di¤erence: d (Xn (!) ; X (!)) = kXn (!) X (!)k q 2 [Xn;1 (!) X ;1 (!)] + : : : + [Xn;K (!) =

X

2

;K

(!)]

where the second subscript is used to indicate the individual components of the vectors Xn (!) and X (!). Thus, for a …xed !, the sequence of real vectors fXn (!)g is convergent to a vector X (!) if lim kXn (!)

n!1

X (!)k = 0

A sequence of random vectors fXn g is said to be pointwise convergent if and only if the sequence fXn (!)g is convergent for any choice of !: 4 See 5 See

p. 497. p. 36.

61.3. SOLVED EXERCISES

503

De…nition 301 Let fXn g be a sequence of random vectors de…ned on a sample space . We say that fXn g is pointwise convergent to a random vector X de…ned on if and only if fXn (!)g converges to X (!) for all ! 2 , i.e. 8!; lim kXn (!) n!1

X (!)k = 0

X is called the pointwise limit of the sequence and convergence is indicated by: Xn ! X pointwise Now, denote by fXn;i g the sequence of the i-th components of the vectors Xn . It can be proved that the sequence of random vectors fXn g is pointwise convergent if and only if all the K sequences of random variables fXn;i g are pointwise convergent: Proposition 302 Let fXn g be a sequence of random vectors de…ned on a sample space . Denote by fXn;i g the sequence of random variables obtained by taking the i-th component of each random vector Xn . The sequence fXn g converges pointwise to the random vector X if and only if fXn;i g converges pointwise to the random variable X ;i (the i-th component of X) for each i = 1; : : : ; K.

61.3

Solved exercises

Below you can …nd some exercises with explained solutions.

Exercise 1 Let the sample space

be6 : = [0; 1]

De…ne a sequence of random variables fXn g as follows: Xn (!) =

! 2n

8! 2

Find the pointwise limit of the sequence fXn g. Solution For a …xed sample point !, the sequence of real numbers fXn (!)g has limit: lim Xn (!) = lim

n!1

n!1

! =0 2n

Therefore, the sequence of random variables fXn g converges pointwise to the random variable X de…ned as follows: X (!) = 0 6 In

other words, the sample space

8! 2

is the set of all real numbers between 0 and 1.

504

CHAPTER 61. POINTWISE CONVERGENCE

Exercise 2 Suppose the sample space

is as in the previous exercise: = [0; 1]

De…ne a sequence of random variables fXn g as follows: Xn (!) = 1 +

n

! n

8! 2

Find the pointwise limit of the sequence fXn g. Solution For a given sample point !, the sequence of real numbers fXn (!)g has limit7 : lim Xn (!) = lim

n!1

n!1

1+

! n

n

= exp (!)

Thus, the sequence of random variables fXn g converges pointwise to the random variable X de…ned as follows: X (!) = exp (!)

8! 2

Exercise 3 Suppose the sample space

is as in the previous exercises: = [0; 1]

De…ne a sequence of random variables fXn g as follows: Xn (!) = ! n

8! 2

De…ne a random variable X as follows: X (!) = 0

8! 2

Does the sequence fXn g converge pointwise to the random variable X? Solution For ! 2 [0; 1), the sequence of real numbers fXn (!)g has limit: lim Xn (!) = lim ! n = 0

n!1

n!1

However, for ! = 1, the sequence of real numbers fXn (!)g has limit: lim Xn (!) = lim 1n = 1

n!1

n!1

Thus, the sequence of random variables fXn g does not converge pointwise to the random variable X, but it converges pointwise to the random variable Y de…ned as follows: 0 if ! 2 [0; 1) Y (!) = 1 if ! = 1 7 Note that this limit is encountered very frequently and you can …nd a proof of it in most calculus textbooks.

Chapter 62

Almost sure convergence This lecture introduces the concept of almost sure convergence. In order to understand this lecture, you should …rst understand the concepts of almost sure property and almost sure event1 and the concept of pointwise convergence of a sequence of random variables2 . We deal …rst with almost sure convergence of sequences of random variables and then with almost sure convergence of sequences of random vectors.

62.1

Sequences of random variables

Let fXn g be a sequence of random variables de…ned on a sample space3 . The concept of almost sure convergence (or a.s. convergence) is a slight variation of the concept of pointwise convergence. As we have seen, a sequence of random variables fXn g is pointwise convergent if and only if the sequence of real numbers fXn (!)g is convergent for all ! 2 . Achieving convergence for all ! 2 is a very stringent requirement. Therefore, this requirement is usually weakened, by requiring the convergence of fXn (!)g for a large enough subset of , and not necessarily for all ! 2 . In particular, fXn (!)g is usually required to be a convergent sequence almost surely: if F is the set of all sample points ! for which the sequence fXn (!)g is convergent, its complement F c must be included in a zero-probability event: F = f! 2 : fXn (!)g is a convergent sequenceg E is a zero-probability event Fc E In other words, almost sure convergence requires that the sequences fXn (!)g converge for all sample points ! 2 , except, possibly, for a very small set F c of sample points (F c must be included in a zero-probability event). De…nition 303 Let fXn g be a sequence of random variables de…ned on a sample space . We say that fXn g is almost surely convergent (a.s. convergent) to a random variable X de…ned on if and only if the sequence of real numbers 1 See

the lecture entitled Zero-probability events (p. 79). p. 501. 3 See p. 492. 2 See

505

506

CHAPTER 62. ALMOST SURE CONVERGENCE

fXn (!)g converges to X (!) almost surely, i.e. if there exists a zero-probability event E such that: f! 2

: fXn (!)g does not converge to X (!)g

E

X is called the almost sure limit of the sequence and convergence is indicated by: a:s: Xn ! X The following is an example of a sequence that converges almost surely: Example 304 Suppose the sample space

is:

= [0; 1] It is possible to build a probability measure P on , such that P assigns to each sub-interval of [0; 1] a probability equal to its length4 : if 0

a

b

1 and E = [a; b] , then P (E) = b

a

Remember that in this probability model all the sample points ! 2 are assigned zero probability (each sample point, when considered as an event, is a zeroprobability event): 8! 2 ; P (f!g) = P ([!; !]) = !

!=0

Now, consider a sequence of random variables fXn g de…ned as follows: Xn (!) =

1 1 n

if ! = 0 if ! = 6 0

When ! 2 (0; 1], the sequence of real numbers fXn (!)g converges to 0, because: lim Xn (!) = lim

n!1

n!1

1 =0 n

However, when ! = 0, the sequence of real numbers fXn (!)g is not convergent to 0, because: lim Xn (!) = lim 1 = 1 n!1

n!1

De…ne a constant random variable X: X (!) = 0; 8! 2 [0; 1] We have that: f! 2

: fXn (!)g does not converge to X (!)g = f0g

But P (f0g) = 0 because: P (f0g) = P ([0; 0]) = 0

0=0

which means that the event f! 2

: fXn (!)g does not converge to X (!)g

is a zero-probability event. Therefore, the sequence fXn g converges to X almost surely, but it does not converge pointwise to X because fXn (!)g does not converge to X (!) for all ! 2 . 4 See

the lecture entitled Zero-probability events (p. 79).

62.2. SEQUENCES OF RANDOM VECTORS

62.2

507

Sequences of random vectors

The above notion of convergence generalizes to sequences of random vectors in a straightforward manner. Let fXn g be a sequence of random vectors de…ned on a sample space5 , where each random vector Xn has dimension K 1. Also in the case of random vectors, the concept of almost sure convergence is obtained from the concept of pointwise convergence by relaxing the assumption that the sequence fXn (!)g converges for all ! 2 (remember that the sequence of real vectors fXn (!)g converges to a real vector X (!) if and only if limn!1 kXn (!) X (!)k = 0). Instead, it is required that the sequence fXn (!)g converges for almost all ! (i.e. almost surely). De…nition 305 Let fXn g be a sequence of random vectors de…ned on a sample space . We say that fXn g is almost surely convergent to a random vector X de…ned on if and only if the sequence of real vectors fXn (!)g converges to the real vector X (!) almost surely, i.e. if there exists a zero-probability event E such that: f! 2 : fXn (!)g does not converge to X (!)g E X is called the almost sure limit of the sequence and convergence is indicated by: a:s: Xn ! X Now, denote by fXn;i g the sequence of the i-th components of the vectors Xn . It can be proved that the sequence of random vectors fXn g is almost surely convergent if and only if all the K sequences of random variables fXn;i g are almost surely convergent. Proposition 306 Let fXn g be a sequence of random vectors de…ned on a sample space . Denote by fXn;i g the sequence of random variables obtained by taking the i-th component of each random vector Xn . The sequence fXn g converges almost surely to the random vector X if and only if fXn;i g converges almost surely to the random variable X ;i (the i-th component of X) for each i = 1; : : : ; K.

62.3

Solved exercises

Below you can …nd some exercises with explained solutions.

Exercise 1 Let the sample space

be: = [0; 1]

Sub-intervals of [0; 1] are assigned a probability equal to their length: if 0

a

b

1 and E = [a; b] , then P (E) = b

De…ne a sequence of random variables fXn g as follows: Xn (!) = ! n 5 See

p. 497.

8! 2

a

508

CHAPTER 62. ALMOST SURE CONVERGENCE

De…ne a random variable X as follows: X (!) = 0

8! 2

Does the sequence fXn g converge almost surely to X? Solution For a …xed sample point ! 2 [0; 1), the sequence of real numbers fXn (!)g has limit: lim Xn (!) = lim ! n = 0 n!1

n!1

For ! = 1, the sequence of real numbers fXn (!)g has limit: lim Xn (!) = lim ! n = lim 1n = 1

n!1

n!1

n!1

Therefore, the sequence of random variables fXn g does not converge pointwise to X, because lim Xn (!) 6= X (!) n!1

for ! = 1. However, the set of sample points ! such that fXn (!)g does not converge to X (!) is a zero-probability event: P (f! 2

: fXn (!)g does not converge to X (!)g) = P (f1g) = 0

Therefore, the sequence fXn g converges almost surely to X.

Exercise 2 Let fXn g and fYn g be two sequences of random variables de…ned on a sample space . Let X and Y be two random variables de…ned on such that: a:s:

Xn ! X a:s: Yn ! Y Prove that

a:s:

Xn + Yn ! X + Y Solution Denote by FX the set of sample points for which fXn (!)g converges to X (!): FX = f! 2

: fXn (!)g converges to X (!)g

The fact that fXn g converges almost surely to X implies that c FX

EX

where P (EX ) = 0. Denote by FY the set of sample points for which fYn (!)g converges to Y (!): FY = f! 2

: fYn (!)g converges to Y (!)g

62.3. SOLVED EXERCISES

509

The fact that fYn g converges almost surely to Y implies that FYc

EY

where P (EY ) = 0. Now, denote by FXY the set of sample points for which fXn (!) + Yn (!)g converges to X (!) + Y (!): FXY = f! 2

: fXn (!) + Yn (!)g converges to X (!) + Y (!)g

Observe that if ! 2 FX \FY then fXn (!) + Yn (!)g converges to X (!)+Y (!), because the sum of two sequences of real numbers is convergent if the two sequences are convergent. Therefore FX \ FY FXY Taking the complement of both sides, we obtain c

c FXY

A B

(FX \ FY ) c = FX [ FYc

EX [ EY

where: in step A we have used De Morgan’s law; in step B we have used the c fact that FX EX and FYc EY . But P (EX [ EY )

=

P (EX ) + P (EY ) P (EX \ EY ) P (EX ) + P (EY ) = 0 + 0 = 0

c and, as a consequence, P (EX [ EY ) = 0. Thus, the set FXY of sample points ! such that fXn (!) + Yn (!)g does not converge to X (!) + Y (!) is included in the zero-probability event EX [ EY , which means that a:s:

Xn + Yn ! X + Y

Exercise 3 Let the sample space

be: = [0; 1]

Sub-intervals of [0; 1] are assigned a probability equal to their length, as in Exercise 1 above. De…ne a sequence of random variables fXn g as follows: Xn (!) =

1 n

if ! 2 0; 1 otherwise

1 n

Find an almost sure limit of the sequence. Solution If ! = 0 or ! = 1, then the sequence of real numbers fXn (!)g is not convergent: lim Xn (!) = lim n = 1

n!1

n!1

510

CHAPTER 62. ALMOST SURE CONVERGENCE

For ! 2 (0; 1), the sequence of real numbers fXn (!)g has limit: lim Xn (!) = 1

n!1

because for any ! we can …nd n0 such that ! 2 0; 1 n1 for any n n0 (as a consequence Xn (!) = 1 for any n n0 ). Thus, the sequence of random variables fXn g converges almost surely to the random variable X de…ned as: X (!) = 1

8! 2

because the set of sample points ! such that fXn (!)g does not converge to X (!) is a zero-probability event: P (f! 2 : fXn (!)g does not converge to X (!)g) = P (f0; 1g) = P (f0g) + P (f1g) = 0 + 0 = 0

Chapter 63

Convergence in probability This lecture discusses convergence in probability. We deal …rst with convergence in probability of sequences of random variables and then with convergence in probability of sequences of random vectors.

63.1

Sequences of random variables

As we have discussed in the lecture entitled Sequences of random variables (p. 495), di¤erent concepts of convergence are based on di¤erent ways of measuring the distance between two random variables (how "close to each other" two random variables are). The concept of convergence in probability is based on the following intuition: two random variables are "close to each other" if there is a high probability that their di¤erence is very small. Let fXn g be a sequence of random variables de…ned on a sample space1 . Let X be a random variable and " a strictly positive number. Consider the following probability: P (jXn Xj > ") (63.1) Intuitively, Xn is considered far from X when jXn Xj > "; therefore, (63.1) is the probability that Xn is far from X. If fXn g converges to X, then (63.1) should become smaller and smaller as n increases. In other words, the probability of Xn being far from X should go to zero when n increases. Formally, we should have: lim P (jXn

n!1

Xj > ") = 0

(63.2)

Note that the sequence fP (jXn

Xj > ")g

is a sequence of real numbers, therefore the limit in (63.2) is the usual limit of a sequence of real numbers. Furthermore, condition (63.2) should be satis…ed for any " (also for very small ", which means that we are very restrictive on our criterion for deciding whether Xn is far from X). This leads us to the following de…nition of convergence: 1 See

p. 492.

511

512

CHAPTER 63. CONVERGENCE IN PROBABILITY

De…nition 307 Let fXn g be a sequence of random variables de…ned on a sample space . We say that fXn g is convergent in probability to a random variable X de…ned on if and only if lim P (jXn

Xj > ") = 0

n!1

for any " > 0. X is called the probability limit of the sequence and convergence is indicated by P Xn ! X or by plim Xn = X n!1

The following example illustrates the concept of convergence in probability: Example 308 Let X be a discrete random variable with support RX = f0; 1g and probability mass function2 : 8 < 1=3 if x = 1 2=3 if x = 0 pX (x) = : 0 otherwise

Consider a sequence of random variables fXn g whose generic term is: Xn =

1+

1 n

X

We want to prove that fXn g converges in probability to X. Take any " > 0. Note that: jXn A

= =

B

=

Xj 1 1+ X n

X

1 X n 1 X n

where: in step A we have used the de…ntion of Xn ; in step B we have used the fact that X cannot be negative. When X = 0, which happens with probability 32 , we have that: 1 jXn Xj = X = 0 n and, of course, jXn have that

Xj

". When X = 1, which happens with probability 13 , we jXn

2 See

p. 106.

Xj =

1 1 X= n n

63.2. SEQUENCES OF RANDOM VECTORS and jXn

Xj

" if and only if P (jXn

1 n

Xj

1 " ).

" (or n ") =

513 Therefore: 1 " 1 "

2=3 if n < 1 if n

and P (jXn

Xj > ") = 1

P (jXn

Xj

") =

1=3 if n < 0 if n

1 " 1 "

Thus, P (jXn Xj > ") trivially converges to 0, because it is identically equal to 1 0 for all n such that n " . Since " was arbitrary, we have obtained the desired result: lim P (jXn Xj > ") = 0 n!1

for any " > 0.

63.2

Sequences of random vectors

The above notion of convergence generalizes to sequences of random vectors in a straightforward manner. Let fXn g be a sequence of random vectors de…ned on a sample space3 , where each random vector Xn has dimension K 1. We have explained above that a sequence of random variables fXn g converges in probability if and only if lim P (d (Xn ; X) > ") = 0

n!1

for any " > 0, where 4

d (Xn ; X) = jXn

Xj

is the distance of Xn from X. In the case of random vectors, the de…nition of convergence in probability remains the same, but distance is measured by the Euclidean norm of the di¤erence between the two vectors: d (Xn ; X)

= kXn Xk q = [Xn;1 (!)

X

2

;1

(!)] + : : : + [Xn;K (!)

X

2

;K

(!)]

where the second subscript is used to indicate the individual components of the vectors Xn (!) and X (!). The following is a formal de…nition: De…nition 309 Let fXn g be a sequence of random vectors de…ned on a sample space . We say that fXn g is convergent in probability to a random vector X de…ned on if and only if lim P (kXn

n!1

Xk > ") = 0

for any " > 0. X is called the probability limit of the sequence and convergence is indicated by P Xn ! X 3 See 4 See

p. 497. p. 34.

514

CHAPTER 63. CONVERGENCE IN PROBABILITY

or by plim Xn = X n!1

Now, denote by fXn;i g the sequence of the i-th components of the vectors Xn . It can be proved that the sequence of random vectors fXn g is convergent in probability if and only if all the K sequences of random variables fXn;i g are convergent in probability: Proposition 310 Let fXn g be a sequence of random vectors de…ned on a sample space . Denote by fXn;i g the sequence of random variables obtained by taking the i-th component of each random vector Xn . The sequence fXn g converges in probability to the random vector X if and only if the sequence fXn;i g converges in probability to the random variable X ;i (the i-th component of X) for each i = 1; : : : ; K.

63.3

Solved exercises

Below you can …nd some exercises with explained solutions.

Exercise 1 Let U be a random variable having a uniform distribution5 on the interval [0; 1]. In other words, U is an absolutely continuous random variable with support RU = [0; 1] and probability density function fU (u) =

1 if u 2 [0; 1] 0 if u 2 = [0; 1]

Now, de…ne a sequence of random variables fXn g as follows: X1 X2 X4 X8 X16

= 1fU 2[0;1]g = 1fU 2[0;1=2]g X3 = 1fU 2[1=2;1]g = 1fU 2[0;1=4]g X5 = 1fU 2[1=4;2=4]g X6 = 1fU 2[2=4;3=4]g X7 = 1fU 2[3=4;1]g = 1fU 2[0;1=8]g X9 = 1fU 2[1=8;2=8]g X10 = 1fU 2[2=8;3=8]g : : : = 1fU 2[0;1=16]g X17 = 1fU 2[1=16;2=16]g X18 = 1fU 2[2=16;3=16]g : : : .. .

where 1fU 2[a;b]g is the indicator function6 of the event fU 2 [a; b]g. Find the probability limit (if it exists) of the sequence fXn g. 5 See 6 See

p. 359. p. 197.

63.3. SOLVED EXERCISES

515

Solution A generic term Xn of the sequence, being an indicator function, can take only two values: it can take value 1 with probability: j j+1 ; m m

P (Xn = 1) = P U 2

=

1 m

where m is an integer satisfying n ")

lim P (Xn > ")

n!1

lim P (Xn = 1)

n!1

1 =0 m!1 m lim

where: in step A we have used the fact that Xn is positive; in step B we have used the fact that Xn can take only value 0 or value 1.

Exercise 2 Does the sequence in the previous exercise also converge almost surely7 ? 7 See

p. 505.

516

CHAPTER 63. CONVERGENCE IN PROBABILITY

Solution We can identify the sample space8

with the support of U : = RU = [0; 1]

and the sample points ! 2 with the realizations of U : when the realization is U = u, then ! = u. Almost sure convergence requires that n oc ! 2 : lim Xn (!) = X (!) E n!1

where E is a zero-probability event9 and the superscript c denotes the complement of a set. In other words, the set of sample points ! for which the sequence fXn (!)g does not converge to X (!) must be included in a zero-probability event E. In our case, it is easy to see that, for any …xed sample point ! = u 2 [0; 1], the sequence fXn (!)g does not converge to X (!) = 0, because in…nitely many terms in the sequence are equal to 1. Therefore: n oc P ! 2 : lim Xn (!) = X (!) =1 n!1

and, trivially, there does not exist a zero-probability event including the set n oc ! 2 : lim Xn (!) = X (!) n!1

Thus, the sequence does not converge almost surely to X.

Exercise 3 Let fXn g be an IID sequence10 of continuous random variables having a uniform distribution with support 1 1 ; RXn = n n and probability density function fXn (x) =

n 2

0

if x 2 if x 2 =

1 n; 1 n;

1 n 1 n

Find the probability limit (if it exists) of the sequence fXn g. Solution As n tends to in…nity, the probability density tends to become concentrated around the point x = 0. Therefore, it seems reasonable to conjecture that the sequence fXn g converges in probability to the constant random variable X (!) = 0; 8! 2 To rigorously verify this claim we need to use the formal de…nition of convergence in probability. For any " > 0, lim P (jXn

n!1 8 See

p. 69. p. 79. 1 0 See p. 492. 9 See

Xj > ")

=

lim P (jXn

n!1

0j > ")

63.3. SOLVED EXERCISES

517 =

lim [1

= 1 = 1 A

P ( " Xn ")] Z " lim fXn (x) dx

n!1

=

1

=

1

=

0

n!1

lim

n!1

lim

n!1

Z

" min(";1=n)

max( "; 1=n)

Z

1=n 1=n

n dx 2

n dx 2

lim 1

n!1

where: in step A we have used the fact that

1 n

< " when n becomes large.

518

CHAPTER 63. CONVERGENCE IN PROBABILITY

Chapter 64

Mean-square convergence This lecture discusses mean-square convergence. We deal …rst with mean-square convergence of sequences of random variables and then with mean-square convergence of sequences of random vectors.

64.1

Sequences of random variables

In the lecture entitled Sequences of random variables (p. 495) we have stressed the fact that di¤erent concepts of convergence are based on di¤erent ways of measuring the distance between two random variables (how "close to each other" two random variables are). The concept of mean-square convergence, or convergence in mean-square, is based on the following intuition: two random variables are "close to each other" if the square of their di¤erence is on average small. Let fXn g be a sequence of random variables de…ned on a sample space1 . Let X be a random variable. The sequence fXn g is said to converge to X in mean-square if fXn g converges to X according to the metric2 d (Xn ; X) de…ned as follows: h i 2 d (Xn ; X) = E (Xn X) (64.1) Note that d (Xn ; X) is well-de…ned only if the expected value on the right hand side exists. Usually, Xn and X are required to be square integrable3 , which ensures that (64.1) is well-de…ned and …nite. Intuitively, for a …xed sample point4 !, the squared di¤erence (Xn (!)

2

X (!))

between the two realizations of Xn and X provides a measure of how di¤erent those two realizations are. The mean squared di¤erence (64.1) provides a measure of how di¤erent those two realizations are on average (as ! varies): if it becomes smaller and smaller by increasing n, then the sequence fXn g converges to X. We summarize the concept of mean-square convergence in the following: 1 See

p. 492. you do not understand what it means to "converge according to a metric", you need to revise the material discussed in the lecture entitled Sequences and limits (p. 34). 3 See p. 159. 4 See p. 69. 2 If

519

520

CHAPTER 64. MEAN-SQUARE CONVERGENCE

De…nition 311 Let fXn g be a sequence of square integrable random variables de…ned on a sample space . We say that fXn g is mean-square convergent, or convergent in mean-square, if and only if there exists a square integrable random variable X such that fXn g converges to X, according to the metric h i 2 d (Xn ; X) = E (Xn X)

i.e.

h lim E (Xn

2

X)

n!1

i

=0

(64.2)

X is called the mean-square limit of the sequence and convergence is indicated by m:s: Xn ! X or by L2

Xn ! X L2

Note that (64.2) is just the usual criterion for convergence5 , while Xn ! X indicates that convergence is in the Lp space6 L2 , because both fXn g and X have been required to be square integrable. The following example illustrates the concept of mean-square convergence: Example 312 Let fXn g be a covariance stationary7 sequence of random variables such that all the random variables in the sequence have the same expected value , the same variance 2 and zero covariance with each other. De…ne the sample mean X n as follows: n 1X Xn = Xi n i=1

and de…ne a constant random variable X = term of the sequence X n and X is d X n; X = E But

h

Xn

X

2

. The distance between a generic

i

=E

h

Xn

is equal to the expected value of X n , because " n # n 1X 1X E Xn = E Xi = E [Xi ] n i=1 n i=1 n

=

1X n i=1

=

1 n = n

Therefore d X n; X

5 See

p. 36. p. 136. 7 See p. 493. 6 See

=

E

=

E

h h

Xn Xn

X

2 2

i

i

2

i

64.2. SEQUENCES OF RANDOM VECTORS = E =

h

Xn

521 E Xn

2

Var X n

i

by the very de…nition of variance. In turn, the variance of X n is " n # 1X Var X n = Var Xi n i=1 " n # X A = 1 Var Xi n2 i=1 B

= =

n 1 X Var [Xi ] n2 i=1 n 1 X n2 i=1

2

=

1 n n2

2

2

=

n

where: in step A we have used the properties of variance8 ; in step B we have used the fact that the variance of a sum is equal to the sum of the variances when the random variables in the sum have zero covariance with each other9 . Thus: h i 2 2 d X n; X = E X n X = Var X n = n and

lim E

n!1

h

Xn

X

2

i

2

= lim

n!1

n

=0

But this is just the de…nition of mean square convergence of X n to X. Therefore, the sequence X n converges in mean-square to the constant random variable X = .

64.2

Sequences of random vectors

The above notion of convergence generalizes to sequences of random vectors in a straightforward manner. Let fXn g be a sequence of random vectors de…ned on a sample space10 , where each random vector Xn has dimension K 1. The sequence of random vectors fXn g is said to converge to a random vector X in mean-square if fXn g converges to X according to the metric11 d (Xn ; X) de…ned as follows: h i 2 d (Xn ; X) = E kXn Xk (64.3) h i 2 2 = E (Xn;1 X ;1 ) + : : : + (Xn;K X ;K )

where kXn Xk is the Euclidean norm of the di¤erence between Xn and X and the second subscript is used to indicate the individual components of the vectors Xn and X. 8 See,

in particular, the property Multiplication by a constant (p. 158). p. 168. 1 0 See p. 497. 1 1 See p. 34. 9 See

522

CHAPTER 64. MEAN-SQUARE CONVERGENCE

Of course, d (Xn ; X) is well-de…ned only if the expected value on the right hand side exists. A su¢ cient condition for (64.3) to be well-de…ned is that all the components of Xn and X be square integrable random variables. Intuitively, for a …xed sample point !, the square of the Euclidean norm 2

kXn (!)

X (!)k

of the di¤erence between the two realizations of Xn and X provides a measure of how di¤erent those two realizations are. The mean of the square of the Euclidean norm (formula 64.3 above) provides a measure of how di¤erent those two realizations are on average (as ! varies): if it becomes smaller and smaller by increasing n, then the sequence of random vectors fXn g converges to the vector X. The following is a formal de…nition of mean-square convergence for random vectors: De…nition 313 Let fXn g be a sequence of random vectors de…ned on a sample space , whose components are square integrable random variables. We say that fXn g is mean-square convergent,or convergent in mean-square, if and only if there exists a random vector X with square integrable components such that fXn g converges to X, according to the metric h i 2 d (Xn ; X) = E kXn Xk i.e.

h lim E kXn

n!1

2

Xk

i

=0

(64.4)

X is called the mean-square limit of the sequence and convergence is indicated by m:s: Xn ! X or by: L2

Xn ! X L2

Note that (64.4) is just the usual criterion for convergence, while Xn ! X indicates that convergence is in the Lp space L2 , because both fXn g and X have been required to have square integrable components. Now, denote by fXn;i g the sequence of the i-th components of the vectors Xn . It can be proved that the sequence of random vectors fXn g is convergent in meansquare if and only if all the K sequences of random variables fXn;i g are convergent in mean-square: Proposition 314 Let fXn g be a sequence of random vectors de…ned on a sample space , such that their components are square integrable random variables. Denote by fXn;i g the sequence of random variables obtained by taking the i-th component of each random vector Xn . The sequence fXn g converges in mean-square to the random vector X if and only if fXn;i g converges in mean-square to the random variable X ;i (the i-th component of X) for each i = 1; : : : ; K.

64.3

Solved exercises

Below you can …nd some exercises with explained solutions.

64.3. SOLVED EXERCISES

523

Exercise 1 Let U be a random variable having a uniform distribution12 on the interval [1; 2]. In other words, U is an absolutely continuous random variable with support RU = [1; 2] and probability density function 1 if u 2 [1; 2] 0 if u 2 = [1; 2]

fU (u) =

Consider a sequence of random variables fXn g whose generic term is Xn = 1fU 2[1;2

1=n]g

where 1fU 2[1;2 1=n]g is the indicator function13 of the event fU 2 [1; 2 Find the mean-square limit (if it exists) of the sequence fXn g.

1=n]g.

Solution When n tends to in…nity, the interval [1; 2 [1; 2], because lim

1 n

2

n!1

1=n] becomes similar to the interval =2

Therefore, we conjecture that the indicators 1fU 2[1;2 1=n]g converge in meansquare to the indicator 1fU 2[1;2]g . But 1fU 2[1;2]g is always equal to 1, so our conjecture is that the sequence fXn g converges in mean square to 1. To verify our conjecture, we need to verify that h i 2 lim E (Xn 1) = 0 n!1

The expected value can be computed as follows: Z 1 h i 2 2 E (Xn 1) = 1fu2[1;2 1=n]g 1 fU (u) du =

Z

1 2

1

=

Z

Z

Z

=

12fu2[1;2

1=n]g

+ 12

1fu2[1;2

1=n]g

+1

2

1fu2[1;2

1

Z

1

1 2 See 1 3 See

p. 359. p. 197.

1=n]g du +

2 1=n

du +

1

=

du 2 1fu2[1;2

1=n]g 1

du

2

1

=

1=n]g

2

1

=

2

1fu2[1;2

Z

1

2 1=n

[u]1

2

+ [u]1

Z

2 1fu2[1;2 2

du

2

1

2

du

2

Z

Z

1

2 1=n

1 2 1=n 2 [u]1

du

1=n]g

du

2

1fu2[1;2

1=n]g du

524

CHAPTER 64. MEAN-SQUARE CONVERGENCE =

1 n

2

1+2

1

1 n

2 2

1

=

1 n

Thus, the sequence fXn g converges in mean-square to 1 because h lim E (Xn

n!1

2

1)

i

= lim

n!1

1 =0 n

Exercise 2 Let fXn g be a sequence of discrete random variables. Let the probability mass function of a generic term of the sequence Xn be 8 if xn = n < 1=n 1 1=n if xn = 0 pXn (xn ) = : 0 otherwise Find the mean-square limit (if it exists) of the sequence fXn g.

Solution Note that lim P (Xn = 0) = lim

n!1

1

n!1

1 n

=1

Therefore, one would expect that the sequence fXn g converges to the constant random variable X = 0. However, the sequence fXn g does not converge in meansquare to 0. The distance of a generic term of the sequence from 0 is h E (Xn

2

0)

Thus

i

1 +0 n

= E Xn2 = n2 h lim E (Xn

n!1

2

0)

i

1

1 n

=n

=1

while, if fXn g was convergent, we would have h i 2 lim E (Xn 0) = 0 n!1

Exercise 3 Does the sequence in the previous exercise converge in probability? Solution The sequence fXn g converges in probability to the constant random variable X = 0 because for any " > 0 lim P (jXn

Xj > ")

lim P (jXn

0j > ")

n!1

= A

=

n!1

lim P (Xn > ")

n!1

64.3. SOLVED EXERCISES B

525 = =

lim P (Xn = n)

n!1

lim

n!1

1 =0 n

where: in step A we have used the fact that Xn is positive; in step B we have used the fact that Xn can take only value 0 or value n.

526

CHAPTER 64. MEAN-SQUARE CONVERGENCE

Chapter 65

Convergence in distribution This lecture discusses convergence in distribution. We deal …rst with convergence in distribution of sequences of random variables and then with convergence in distribution of sequences of random vectors.

65.1

Sequences of random variables

In the lecture entitled Limit of a sequence of random variables (p. 495) we explained that di¤erent concepts of convergence are based on di¤erent ways of measuring the distance between two random variables (how "close to each other" two random variables are). The concept of convergence in distribution is based on the following intuition: two random variables are "close to each other" if their distribution functions1 are "close to each other" . Let fXn g be a sequence of random variables2 . Let us consider a generic random variable Xn belonging to the sequence. Denote by Fn (x) its distribution function. Fn (x) is a function Fn : R ! [0; 1]. Once we …x x, the value Fn (x) associated to the point x is a real number. By the same token, once we …x x, the sequence fFn (x)g is a sequence of real numbers. Therefore, for a …xed x, it is very easy to assess whether the sequence fFn (x)g is convergent; this is done employing the usual de…nition of convergence of sequences of real numbers3 . If, for a …xed x, the sequence fFn (x)g is convergent, we denote its limit by FX (x) (note that the limit depends on the speci…c x we have …xed). A sequence of random variables fXn g is said to be convergent in distribution if and only if the sequence fFn (x)g is convergent for any choice of x (except, possibly, for some "special values" of x where FX (x) is not continuous in x): De…nition 315 Let fXn g be a sequence of random variables. Denote by Fn (x) the distribution function of Xn . We say that fXn g is convergent in distribution (or convergent in law) if and only if there exists a distribution function FX (x) such that the sequence fFn (x)g converges to FX (x) for all points x 2 R where FX (x) is continuous. If a random variable X has distribution function FX (x), then X is called the limit in distribution (or limit in law) of the sequence and convergence 1 See

p. 108. p. 491. 3 See p. 33. 2 See

527

528

CHAPTER 65. CONVERGENCE IN DISTRIBUTION

is indicated by d

Xn ! X Note that convergence in distribution only involves the distribution functions of the random variables belonging to the sequence fXn g and that these random variables need not be de…ned on the same sample space4 . On the contrary, the modes of convergence we have discussed in previous lectures (pointwise convergence, almost sure convergence, convergence in probability, mean-square convergence) require that all the variables in the sequence be de…ned on the same sample space. Example 316 Let fXn g be a sequence of IID5 random variables all having a uniform distribution6 on the interval [0; 1], i.e. the distribution function of Xn is 8 < 0 if x < 0 x if 0 x < 1 FXn (x) = : 1 if x 1

De…ne:

Yn = n 1

max Xi

1 i n

The distribution function of Yn is FYn (y)

=

P (Yn

y)

=

P n 1

=

P

=

1

=

1

A

=

1

B

=

1

C

=

1

D

=

1

max Xi

max Xi

1 i n

P

y

1 i n

1

y n

max Xi < 1

1 i n

y n

y y ; X2 < 1 ; : : : ; Xn < 1 n n y y P X1 < 1 P X2 < 1 ::: P n n y y P X1 1 P X2 1 ::: P n n y y FX1 1 FX2 1 : : : FXn 1 n n h y in FXn 1 n P X1 < 1

y n Xn < 1 Xn

1

y n y n

y n

where: in step A we have used the fact that the variables Xi are mutually independent; in step B we have used the fact that the variables Xi are absolutely continuous; in step C we have used the de…nition of distribution function; in step D we have used the fact that the variables Xi have identical distributions. Thus: 8 if y < 0 < 0 n FYn (y) = 1 1 ny if 0 y < n : 1 if y n 4 See

p. 69. p. 492. 6 See p. 359. 5 See

65.2. SEQUENCES OF RANDOM VECTORS Since lim

n!1

y n

1

529

n

= exp ( y)

we have 0 1

lim FYn (y) = FY (y) =

n!1

if y < 0 exp ( y) if y 0

where FY (y) is the distribution function of an exponential random variable7 . Therefore, the sequence fYn g converges in law to an exponential distribution.

65.2

Sequences of random vectors

The de…nition of convergence in distribution of a sequence of random vectors is almost identical; we just need to replace distribution functions in the above de…nition with joint distribution functions8 : De…nition 317 Let fXn g be a sequence of K 1 random vectors. Denote by Fn (x) the joint distribution function of Xn . We say that fXn g is convergent in distribution (or convergent in law) if and only if there exists a joint distribution function FX (x) such that the sequence fFn (x)g converges to FX (x) for all points x 2 RK where FX (x) is continuous. If a random vector X has joint distribution function FX (x), then X is called the limit in distribution (or limit in law) of the sequence and convergence is indicated by d

Xn ! X

65.3

More details

65.3.1

Proper distribution functions

Let fXn g be a sequence of random variables and denote by Fn (x) the distribution function of Xn . Suppose that we …nd a function FX (x) such that FX (x) = lim Fn (x) n!1

for all x 2 R where FX (x) is continuous. How do we check that FX (x) is a proper distribution function, so that we can say that the sequence fXn g converges in distribution? FX (x) is a proper distribution function if it satis…es the following four properties: 1. Increasing. FX (x) is increasing, i.e. FX (x1 )

FX (x2 ) if x1 < x2 .

2. Right-continuous. FX (x) is right-continuous, i.e. lim FX (t) = FX (x)

t!x t x

for any x 2 R. 7 See 8 See

p. 365. p. 118.

530

CHAPTER 65. CONVERGENCE IN DISTRIBUTION

3. Limit at minus in…nity. FX (x) satis…es lim F (x) = 0

x! 1

4. Limit at plus in…nity. FX (x) satis…es lim F (x) = 1

x!1

65.4

Solved exercises

Below you can …nd some exercises with explained solutions.

Exercise 1 Let fXn g be a sequence of random variables having distribution functions 8 0 if x 0 > > < n x + 1 x2 if 0 < x 1 2n+1 4n+2 Fn (x) = n 1 2 x 4n+2 x 4x + 2 if 1 < x 2 > > : 2n+1 1 if x > 2 Find the limit in distribution (if it exists) of the sequence fXn g.

Solution If 0 < x

1, then lim Fn (x)

n!1

If 1 < x

1 n x+ x2 2n + 1 4n + 2 n + x2 lim = x lim n!1 2n + 1 n!1 1 1 + x2 0 = x = x 2 2 =

lim

n!1

1 4n + 2

2, then

n 1 x x2 4x + 2 2n + 1 4n + 2 n = x lim + x2 4x + 2 lim n!1 2n + 1 n!1 1 1 = x + x2 4x + 2 0 = x 2 2 We now need to verify that the function 8 if x 0 < 0 1 x if 0 2 lim Fn (x)

n!1

=

lim

n!1

1 4n + 2

is a proper distribution function. The function is increasing, continuous, its limit at minus in…nity is 0 and its limit at plus in…nity is 1, hence it satis…es the four properties that a proper distribution function needs to satisfy. This implies that fXn g converges in distribution to a random variable X having distribution function FX (x).

65.4. SOLVED EXERCISES

531

Exercise 2 Let fXn g be a sequence of random variables having distribution functions 8 if x < 0 < 0 n 1 (1 x) if 0 x 1 Fn (x) = : 1 if x > 1 Find the limit in distribution (if it exists) of the sequence fXn g.

Solution If x = 0, then lim Fn (x)

n!1

=

lim [1

n!1

= 1 If 0 < x

n

(1

x) ] = 1

(1

x) ] = 1

n

lim (1

0)

lim (1

x)

n!1

1=0

1, then lim Fn (x)

n!1

= =

lim [1

n!1

1

n

n!1

n

0=1

Therefore, the distribution functions Fn (x) converge to the function GX (x) = lim Fn (x) = n!1

0 if x 0 1 if x > 0

which is not a proper distribution function, because it is not right-continuous at the point x = 0. However, note that the function FX (x) =

0 if x < 0 1 if x 0

is a proper distribution function and it is equal to limn!1 Fn (x) at all points except at the point x = 0. But this is a point of discontinuity of FX (x). As a consequence, the sequence fXn g converges in distribution to a random variable X having distribution function FX (x).

Exercise 3 Let fXn g be a sequence of random variables having distribution functions 8 if x < 0 < 0 nx if 0 x 1=n Fn (x) = : 1 if x > 1=n Find the limit in distribution (if it exists) of the sequence fXn g.

Solution The distribution functions Fn (x) converge to the function GX (x) = lim Fn (x) = n!1

0 if x 0 1 if x > 0

532

CHAPTER 65. CONVERGENCE IN DISTRIBUTION

This is the same limiting function found in the previous exercise. As a consequence, the sequence fXn g converges in distribution to a random variable X having distribution function 0 if x < 0 FX (x) = 1 if x 0

Chapter 66

Relations between modes of convergence In the previous lectures, we have introduced several notions of convergence of a sequence of random variables (also called modes of convergence). There are several relations between the various modes of convergence, which are discussed in the following subsections and are summarized by the following diagram (an arrow denotes implication in the arrow’s direction): Almost sure convergence

Mean square convergence &

66.1

Convergence in probability # Convergence in distribution

.

Almost sure ) Probability

Proposition 318 If a sequence of random variables fXn g converges almost surely to a random variable X, then fXn g also converges in probability to X. Proof. See e.g. Resnick1 (1999).

66.2

Probability ) Distribution

Proposition 319 If a sequence of random variables fXn g converges in probability to a random variable X, then fXn g also converges in distribution to X. Proof. See e.g. Resnick (1999). 1 Resnick,

S.I. (1999) "A Probability Path", Birkhauser.

533

534

66.3

CHAPTER 66. MODES OF CONVERGENCE - RELATIONS

Almost sure ) Distribution

Proposition 320 If a sequence of random variables fXn g converges almost surely to a random variable X, then fXn g also converges in distribution to X. Proof. This is obtained putting together Propositions (318) and (319) above.

66.4

Mean square ) Probability

Proposition 321 If a sequence of random variables fXn g converges in mean square to a random variable X, then fXn g also converges in probability to X. Proof. Weocan apply Markov’s inequality2 to a generic term of the sequence n 2 (Xn X) : h i 2 E (Xn X) 2 P (Xn X) c2 c2 for any strictly positive real number c. Taking the square root of both sides of the left-hand inequality, we obtain h i 2 E (Xn X) P (jXn Xj c) c2 Taking limits on both sides, we get h i h i 2 2 E (Xn X) limn!1 E (Xn X) lim P (jXn Xj c) lim = =0 n!1 n!1 c2 c2 where we have used the fact that, by the very de…nition of convergence in mean square: h i 2 lim E (Xn X) = 0 n!1

Since, by the very de…nition of probability, it must be that P (jXn

Xj

c)

0

then it must be that also lim P (jXn

n!1

Xj

c) = 0

Note that this holds for any arbitrarily small c. By the de…nition of convergence in probability, this means that Xn converges in probability to X (if you are wondering about strict and weak inequalities here and in the de…nition of convergence in probability, note that jXn Xj c implies jXn Xj > " for any strictly positive " < c).

66.5

Mean square ) Distribution

Proposition 322 If a sequence of random variables fXn g converges in mean square to a random variable X, then fXn g also converges in distribution to X. Proof. This is obtained putting together Propositions (321) and (319) above. 2 See

p. 241.

Chapter 67

Laws of Large Numbers Let fXn g be a sequence of random variables1 . Let X n be the sample mean of the …rst n terms of the sequence: n

1X Xi n i=1

Xn =

A Law of Large Numbers (LLN) is a proposition stating a set of conditions that are su¢ cient to guarantee the convergence of the sample mean X n to a constant, as the sample size n increases. It is called a Weak Law of Large Numbers (WLLN) if the sequence X n converges in probability2 and a Strong Law of Large Numbers (SLLN) if the sequence X n converges almost surely3 . There are literally dozens of Laws of Large Numbers. We report some examples below.

67.1

Weak Laws of Large Numbers

67.1.1

Chebyshev’s WLLN

Probably, the best known Law of Large Numbers is Chebyshev’s: Proposition 323 (Chebyshev’s WLLN) Let fXn g be an uncorrelated and covariance stationary sequence of random variables4 : 9 2 R : E [Xn ] = ; 8n 2 N 9 2 2 R+ : Var [Xn ] = 2 ; 8n 2 N Cov [Xn ; Xn+k ] = 0; 8n; k 2 N Then, a Weak Law of Large Numbers applies to the sample mean: plim X n = n!1 1 See

p. 491. p. 511. 3 See p. 505. 4 In other words, all the random variables in the sequence have the same mean , the same variance 2 and zero covariance with each other. See p. 493 for a de…nition of covariance stationary sequence. 2 See

535

536

CHAPTER 67. LAWS OF LARGE NUMBERS

where plim denotes a probability limit5 . Proof. The expected value of the sample mean X n is: " n # n 1X 1X E Xn = E Xi = E [Xi ] n i=1 n i=1 n

1X n i=1

=

=

1 n = n

The variance of the sample mean X n is: " Var X n

=

A

=

B

=

# n 1X Var Xi n i=1 " n # X 1 Var Xi n2 i=1 n 1 X Var [Xi ] n2 i=1 n 1 X n2 i=1

=

2

=

1 n n2

2

2

=

n

where: in step A we have used the properties of variance6 ; in step B we have used the fact that the variance of a sum is equal to the sum of the variances when the random variables in the sum have zero covariance with each other7 . Now we can apply Chebyshev’s inequality8 to the sample mean X n : P Xn

E Xn

Var X n k2

k

for any strictly positive real number k. Plugging in the values for the expected value and the variance derived above, we obtain: 2

P Xn

k

Since

nk 2

2

lim

n!1

nk 2

=0

and P Xn

k

0

then it must also be that: lim P X n

n!1

k =0

Note that this holds for any arbitrarily small k. By the very de…nition of convergence in probability, this means that X n converges in probability to (if you are 5 See

p. 511. in particular, the Multiplication by a constant property (p. 158). 7 See p. 168. 8 See p. 242. 6 See,

67.1. WEAK LAWS OF LARGE NUMBERS

537

wondering about strict and weak inequalities here and in the de…nition of convergence in probability, note that X n k implies X n > " for any strictly positive " < k). Note that it is customary to state Chebyshev’s Weak Law of Large Numbers as a result on the convergence in probability of the sample mean: plim X n = n!1

However, the conditions of the above theorem guarantee the mean square convergence9 of the sample mean to : m:s:

Xn !

Proof. In the above proof of Chebyshev’s Weak Law of Large Numbers, it is proved that: 2

Var X n =

n

and that E Xn = This implies that E

h

Xn

As a consequence:

2

i

=E

lim E

n!1

h

h

Xn

Xn

E Xn

2

i

2

i

2

= Var X n =

n

2

= lim

n!1

n

=0

but this is just the de…nition of mean square convergence of X n to . Hence, in Chebyshev’s Weak Law of Large Numbers, convergence in probability is just a consequence of the fact that convergence in mean square implies convergence in probability10 .

67.1.2

Chebyshev’s WLLN for correlated sequences

Chebyshev’s Weak Law of Large Numbers (see above) sets forth the requirement that the terms of the sequence fXn g have zero covariance with each other. By relaxing this requirement and allowing for some correlation between the terms of the sequence fXn g, a more general version of Chebyshev’s Weak Law of Large Numbers can be obtained: Proposition 324 (Chebyshev’s WLLN for correlated sequences) Let fXn g be a covariance stationary sequence of random variables11 : 9 2 R : E [Xn ] = ; 8n > 0 8j 0; 9 j 2 R : Cov [Xn ; Xn 9 See

j]

=

j ; 8n

>j

p. 519. p. 534. 1 1 In other words, all the random variables in the sequence have the same mean , the same variance 0 and the covariance between a term Xn of the sequence and the term that is located j positions before it (Xn j ) is always the same ( j ), irrespective of how Xn has been chosen. 1 0 See

538

CHAPTER 67. LAWS OF LARGE NUMBERS

If covariances tend to be zero on average, i.e. if n

1X lim n!1 n i=0

i

=0

(67.1)

then a Weak Law of Large Numbers applies to the sample mean: plim X n = n!1

Proof. For a full proof see e.g. Karlin and Taylor12 (1975). We give here a proof based on the assumption that covariances are absolutely summable: 1 X j=0

j