Introduction
Measures
Measures
Probability measures
Measurable functions and random variables
Measurable functions
Constructing new measures
Random variables
Convergence of measurable functions
Tail events
Integration
Definition and basic properties
Integrals and limits
New measures from old
Integration and differentiation
Product measures and Fubini's theorem
Inequalities and Lp spaces
Four inequalities
Lp spaces
Orthogonal projection in L2
Convergence in L1(P) and uniform integrability
Fourier transform
The Fourier transform
Convolutions
Fourier inversion formula
Fourier transform in L2
Properties of characteristic functions
Gaussian random variables
Ergodic theory
Ergodic theorems
Big theorems
The strong law of large numbers
Central limit theorem
Index

##### Citation preview

Part II — Probability and Measure Based on lectures by J. Miller Notes taken by Dexter Chua

Michaelmas 2016 These notes are not endorsed by the lecturers, and I have modified them (often significantly) after lectures. They are nowhere near accurate representations of what was actually lectured, and in particular, all errors are almost surely mine.

Analysis II is essential Measure spaces, σ-algebras, π-systems and uniqueness of extension, statement *and proof* of Carath´eodory’s extension theorem. Construction of Lebesgue measure on R. The Borel σ-algebra of R. Existence of non-measurable subsets of R. Lebesgue-Stieltjes measures and probability distribution functions. Independence of events, independence of σ-algebras. The Borel–Cantelli lemmas. Kolmogorov’s zero-one law. [6] Measurable functions, random variables, independence of random variables. Construction of the integral, expectation. Convergence in measure and convergence almost everywhere. Fatou’s lemma, monotone and dominated convergence, differentiation under the integral sign. Discussion of product measure and statement of Fubini’s theorem. [6] Chebyshev’s inequality, tail estimates. Jensen’s inequality. Completeness of Lp for 1 ≤ p ≤ ∞. The H¨ older and Minkowski inequalities, uniform integrability. [4] L2 as a Hilbert space. Orthogonal projection, relation with elementary conditional probability. Variance and covariance. Gaussian random variables, the multivariate normal distribution. [2] The strong law of large numbers, proof for independent random variables with bounded fourth moments. Measure preserving transformations, Bernoulli shifts. Statements *and proofs* of maximal ergodic theorem and Birkhoff’s almost everywhere ergodic theorem, proof of the strong law. [4] The Fourier transform of a finite measure, characteristic functions, uniqueness and inversion. Weak convergence, statement of L´evy’s convergence theorem for characteristic functions. The central limit theorem. [2]

1

Contents

II Probability and Measure

Contents 0 Introduction

3

1 Measures 1.1 Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Probability measures . . . . . . . . . . . . . . . . . . . . . . . . .

5 5 16

2 Measurable functions and random variables 2.1 Measurable functions . . . . . . . . . . . . . . 2.2 Constructing new measures . . . . . . . . . . 2.3 Random variables . . . . . . . . . . . . . . . . 2.4 Convergence of measurable functions . . . . . 2.5 Tail events . . . . . . . . . . . . . . . . . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

20 20 23 25 29 34

3 Integration 3.1 Definition and basic properties . . . . . 3.2 Integrals and limits . . . . . . . . . . . . 3.3 New measures from old . . . . . . . . . 3.4 Integration and differentiation . . . . . . 3.5 Product measures and Fubini’s theorem

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

36 36 42 44 46 48

4 Inequalities and Lp spaces 4.1 Four inequalities . . . . . . . . . . . . . . . . . 4.2 Lp spaces . . . . . . . . . . . . . . . . . . . . . 4.3 Orthogonal projection in L2 . . . . . . . . . . . 4.4 Convergence in L1 (P) and uniform integrability

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

54 54 59 61 65

5 Fourier transform 5.1 The Fourier transform . . . . . . . . 5.2 Convolutions . . . . . . . . . . . . . 5.3 Fourier inversion formula . . . . . . 5.4 Fourier transform in L2 . . . . . . . 5.5 Properties of characteristic functions 5.6 Gaussian random variables . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

69 69 70 72 77 79 80

. . . . . .

. . . . . .

. . . . .

. . . . . .

. . . . .

. . . . . .

. . . . .

. . . . . .

. . . . . .

6 Ergodic theory 83 6.1 Ergodic theorems . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 7 Big theorems 90 7.1 The strong law of large numbers . . . . . . . . . . . . . . . . . . 90 7.2 Central limit theorem . . . . . . . . . . . . . . . . . . . . . . . . 92 Index

94

2

0

Introduction

0

II Probability and Measure

Introduction

In measure theory, the main idea is that we want to assign “sizes” to different sets. For example, we might think [0, 2] ⊆ R has size 2, while perhaps Q ⊆ R has size 0. This is known as a measure. One of the main applications of a measure is that we can use it to come up with a new definition of an integral. The idea is very simple, but it is going to be very powerful mathematically. Recall that if f : [0, 1] → R is continuous, then the Riemann integral of f is defined as follows: (i) Take a partition 0 = t0 < t1 < · · · < tn = 1 of [0, 1]. (ii) Consider the Riemann sum n X

f (tj )(tj − tj−1 )

j=1

(iii) The Riemann integral is Z f = Limit of Riemann sums as the mesh size of the partition → 0. y

··· ··· tk tk+1 · · · 1

0 t1 t2 t3

x

The idea of measure theory is to use a different approximation scheme. Instead of partitioning the domain, we partition the range of the function. We fix some numbers r0 < r1 < r2 < · · · < rn . We then approximate the integral of f by n X

rj · (“size of f −1 ([rj−1 , rj ])”).

j=1

We then define the integral as the limit of approximations of this type as the mesh size of the partition → 0.

3

0

Introduction

II Probability and Measure

y

x We can make an analogy with bankers — If a Riemann banker is given a stack of money, they would just add the values of the money in order. A measuretheoretic banker will sort the bank notes according to the type, and then find the total value by multiplying the number of each type by the value, and adding up. Why would we want to do so? It turns out this leads to a much more general theory of integration on much more general spaces. Instead of integrating functions [a, b] → R only, we can replace the domain with any measure space. Even in the context of R, this theory of integration is much much more powerful than the Riemann sum, and can integrate a much wider class of functions. While you probably don’t care about those pathological functions anyway, being able to integrate more things means that we can state more general theorems about integration without having to put in funny conditions. That was all about measures. What about probability? It turns out the concepts we develop for measures correspond exactly to many familiar notions from probability if we restrict it to the particular case where the total measure of the space is 1. Thus, when studying measure theory, we are also secretly studying probability!

4

1

Measures

1

II Probability and Measure

Measures

In the course, we will write fn % f for “fn converges to f monotonically increasingly”, and fn & f similarly. Unless otherwise specified, convergence is taken to be pointwise.

1.1

Measures

The starting point of all these is to come up with a function that determines the “size” of a given set, known as a measure. It turns out we cannot sensibly define a size for all subsets of [0, 1]. Thus, we need to restrict our attention to a collection of “nice” subsets. Specifying which subsets are “nice” would involve specifying a σ-algebra. This section is mostly technical. Definition (σ-algebra). Let E be a set. A σ-algebra E on E is a collection of subsets of E such that (i) ∅ ∈ E. (ii) A ∈ E implies that AC = X \ A ∈ E. (iii) For any sequence (An ) in E, we have that [ An ∈ E. n

The pair (E, E) is called a measurable space. Note that the axioms imply that σ-algebras are also closed under countable intersections, as we have A ∩ B = (AC ∪ B C )C . Definition (Measure). A measure on a measurable space (E, E) is a function µ : E → [0, ∞] such that (i) µ(∅) = 0 (ii) Countable additivity: For any disjoint sequence (An ) in E, then ! ∞ [ X µ An = µ(An ). n

n=1

Example. Let E be any countable set, and E = P (E) be the set of all subsets of E. A mass function is any function m : E → [0, ∞]. We can then define a measure by setting X µ(A) = m(x). x∈A

In particular, if we put m(x) = 1 for all x ∈ E, then we obtain the counting measure.

5

1

Measures

II Probability and Measure

Countable spaces are nice, because we can always take E = P (E), and the measure can be defined on all possible subsets. However, for “bigger” spaces, we have to be more careful. The set of all subsets is often “too large”. We will see a concrete and also important example of this later. In general, σ-algebras are often described on large spaces in terms of a smaller set, known as the generating sets. Definition (Generator of σ-algebra). Let E be a set, and that A ⊆ P (E) be a collection of subsets of E. We define σ(A) = {A ⊆ E : A ∈ E for all σ-algebras E that contain A}. In other words σ(A) is the smallest sigma algebra that contains A. This is known as the sigma algebra generated by A. Example. Take E = Z, and A = {{x} : x ∈ Z}. Then σ(A) is just P (E), since every subset of E can be written as a countable union of singletons. Example. Take E = Z, and let A = {{x, x + 1, x + 2, x + 3, · · · } : x ∈ E}. Then again σ(E) is the set of all subsets of E. The following is the most important σ-algebra in the course: Definition (Borel σ-algebra). Let E = R, and A = {U ⊆ R : U is open}. Then σ(A) is known as the Borel σ-algebra, which is not the set of all subsets of R. ˜ We can equivalently define this by A˜ = {(a, b) : a < b, a, b ∈ Q}. Then σ(A) is also the Borel σ-algebra. Often, we would like to prove results that allow us to deduce properties about the σ-algebra just by checking it on a generating set. However, usually, we cannot just check it on an arbitrary generating set. Instead, the generating set has to satisfy some nice closure properties. We are now going to introduce a bunch of many different definitions that you need not aim to remember (except when exams are near). Definition (π-system). Let A be a collection of subsets of E. Then A is called a π-system if (i) ∅ ∈ A (ii) If A, B ∈ A, then A ∩ B ∈ A. Definition (d-system). Let A be a collection of subsets of E. Then A is called a d-system if (i) E ∈ A (ii) If A, B ∈ A and A ⊆ B, then B \ A ∈ A (iii) For all increasing sequences (An ) in A, we have that

S

n

An ∈ A.

The point of d-systems and π-systems is that they separate the axioms of a σ-algebra into two parts. More precisely, we have Proposition. A collection A is a σ-algebra if and only if it is both a π-system and a d-system. 6

1

Measures

II Probability and Measure

This follows rather straightforwardly from the definitions. The following definitions are also useful: Definition (Ring). A collection of subsets A is a ring on E if ∅ ∈ A and for all A, B ∈ A, we have B \ A ∈ A and A ∪ B ∈ A. Definition (Algebra). A collection of subsets A is an algebra on E if ∅ ∈ A, and for all A, B ∈ A, we have AC ∈ A and A ∪ B ∈ A. So an algebra is like a σ-algebra, but it is just closed under finite unions only, rather than countable unions. While the names π-system and d-system are rather arbitrary, we can make some sense of the names “ring” and “algebra”. Indeed, a ring forms a ring (without unity) in the algebraic sense with symmetric difference as “addition” and intersection as “multiplication”. Then the empty set acts as the additive identity, and E, if present, acts as the multiplicative identity. Similarly, an algebra is a boolean subalgebra under the boolean algebra P (E). A very important lemma about these things is Dynkin’s lemma: Lemma (Dynkin’s π-system lemma). Let A be a π-system. Then any d-system which contains A contains σ(A). This will be very useful in the future. If we want to show that all elements of σ(A) satisfy a particular property for some generating π-system A, we just have to show that the elements of A satisfy that property, and that the collection of things that satisfy the property form a d-system. While this use case might seem rather contrived, it is surprisingly common when we have to prove things. Proof. Let D be the intersection of all d-systems containing A, i.e. the smallest d-system containing A. We show that D contains σ(A). To do so, we will show that D is a π-system, hence a σ-algebra. There are two steps to the proof, both of which are straightforward verifications: (i) We first show that if B ∈ D and A ∈ A, then B ∩ A ∈ D. (ii) We then show that if A, B ∈ D, then A ∩ B ∈ D. Then the result immediately follows from the second part. We let D0 = {B ∈ D : B ∩ A ∈ D for all A ∈ A}. We note that D0 ⊇ A because A is a π-system, and is hence closed under intersections. We check that D0 is a d-system. It is clear that E ∈ D0 . If we have B1 , B2 ∈ D0 , where B1 ⊆ B2 , then for any A ∈ A, we have (B2 \ B1 ) ∩ A = (B2 ∩ A) \ (B1 ∩ A). By definition of D0 , we know B2 ∩ A and B1 ∩ A are elements of D. Since D is a d-system, we know this intersection is in D. So B2 \ B1 ∈ D0 . S Finally, suppose that (Bn ) is an increasing sequence in D0 , with B = Bn . Then for every A ∈ A, we have that [  [ Bn ∩ A = (Bn ∩ A) = B ∩ A ∈ D. 7

1

Measures

II Probability and Measure

Therefore B ∈ D0 . Therefore D0 is a d-system contained in D, which also contains A. By our choice of D, we know D0 = D. We now let D00 = {B ∈ D : B ∩ A ∈ D for all A ∈ D}. Since D0 = D, we again have A ⊆ D00 , and the same argument as above implies that D00 is a d-system which is between A and D. But the only way that can happen is if D00 = D, and this implies that D is a π-system. After defining all sorts of things that are “weaker versions” of σ-algebras, we now defined a bunch of measure-like objects that satisfy fewer properties. Again, no one really remembers these definitions: Definition (Set function). Let A be a collection of subsets of E with ∅ ∈ A. A set function function µ : A → [0, ∞] such that µ(∅) = 0. Definition (Increasing set function). A set function is increasing if it has the property that for all A, B ∈ A with A ⊆ B, we have µ(A) ≤ µ(B). Definition (Additive set function). A set function is additive if whenever A, B ∈ A and A ∪ B ∈ A, A ∩ B = ∅, then µ(A ∪ B) = µ(A) + µ(B). Definition (Countably additive set function). A set function is countably additive if whenever An is a sequence of disjoint sets in A with ∪An ∈ A, then ! [ X µ(An ). µ An = n

n

Under these definitions, a measure is just a countable additive set function defined on a σ-algebra. Definition (Countably subadditive set function). A set function is countably S subadditive if whenever (An ) is a sequence of sets in A with n An ∈ A, then ! [ X µ An ≤ µ(An ). n

n

The big theorem that allows us to construct measures is the Caratheodory extension theorem. In particular, this will help us construct the Lebesgue measure on R. Theorem (Caratheodory extension theorem). Let A be a ring on E, and µ a countably additive set function on A. Then µ extends to a measure on the σ-algebra generated by A. Proof. (non-examinable) We start by defining what we want our measure to be. For B ⊆ E, we set ( ) X [ ∗ µ (B) = inf µ(An ) : (An ) ∈ A and B ⊆ An . n

8

1

Measures

II Probability and Measure

If it happens that there is no such sequence, we set this to be ∞. This measure is known as the outer measure. It is clear that µ∗ (φ) = 0, and that µ∗ is increasing. We say a set A ⊆ E is µ∗ -measurable if µ∗ (B) = µ∗ (B ∩ A) + µ∗ (B ∩ AC ) for all B ⊆ E. We let M = {µ∗ -measurable sets}. We will show the following: (i) M is a σ-algebra containing A. (ii) µ∗ is a measure on M with µ∗ |A = µ. Note that it is not true in general that M = σ(A). However, we will always have M ⊇ σ(A). We are going to break this up into five nice bite-size chunks. Claim. µ∗ is countably subadditive. S P Suppose B ⊆ n Bn . We need to show that µ∗ (B) ≤ n µ∗ (Bn ). We can wlog assume that µ∗ (Bn ) is finite for all n, or else the inequality is trivial. Let ε > 0. Then by definition of the outer measure, for each n, we can find a sequence (Bn,m )∞ m=1 in A with the property that [ Bn ⊆ Bn,m m

and µ∗ (Bn ) +

X ε ≥ µ(Bn,m ). 2n m

Then we have B⊆

[

Bn ⊆

n

[

Bn,m .

n,m

Thus, by definition, we have X X X ε  µ∗ (B) ≤ µ∗ (Bn ). µ∗ (Bn,m ) ≤ µ∗ (Bn ) + n = ε + 2 n n,m n Since ε was arbitrary, we are done. Claim. µ∗ agrees with µ on A. In the first example sheet, we will show that if A is a ring and µ is a countably additive set function on µ, then µ is in fact countably subadditive S and increasing. Assuming this, suppose that A, (An ) are in A and A ⊆ n An . Then by subadditivity, we have X X µ(A) ≤ µ(A ∩ An ) ≤ µ(An ), n

n

9

1

Measures

II Probability and Measure

using that µ is countably subadditivity and increasing. Note that we have to do this in two steps, S rather than just applying countable subadditivity, since we did not assume that n An ∈ A. Taking the infimum over all sequences, we have µ(A) ≤ µ∗ (A). Also, we see by definition that µ(A) ≥ µ∗ (A), since A covers A. So we get that µ(A) = µ∗ (A) for all A ∈ A. Claim. M contains A. Suppose that A ∈ A and B ⊆ E. We need to show that µ∗ (B) = µ∗ (B ∩ A) + µ∗ (B ∩ AC ). Since µ∗ is countably subadditive, we immediately have µ∗ (B) ≤ µ∗ (B ∩ A) + µ∗ (B ∩ AC ). For the other inequality, we first observe that it is trivial if µ∗ (B) is infinite. If it is finite, S then by definition, given ε > 0, we can find some (Bn ) in A such that B ⊆ n Bn and X µ∗ (B) + ε ≥ µ(Bn ). n

Then we have B∩A⊆

[ (Bn ∩ A) n

[ B ∩ A ⊆ (Bn ∩ AC ) C

n

We notice that Bn ∩ AC = Bn \ A ∈ A. Thus, by definition of µ∗ , we have X X µ∗ (B ∩ A) + µ∗ (B ∩ Ac ) ≤ µ(Bn ∩ A) + µ(Bn ∩ AC ) n

n

X = (µ(Bn ∩ A) + µ(Bn ∩ AC )) n

=

X

µ(Bn )

n ∗

≤ µ (Bn ) + ε. Since ε was arbitrary, the result follows. Claim. We show that M is an algebra. We first show that E ∈ M. This is true since we obviously have µ∗ (B) = µ∗ (B ∩ E) + µ∗ (B ∩ E C ) for all B ⊆ E. Next, note that if A ∈ M, then by definition we have, for all B, µ∗ (B) = µ∗ (B ∩ A) + µ∗ (B ∩ AC ). Now note that this definition is symmetric in A and AC . So we also have AC ∈ M . 10

1

Measures

II Probability and Measure

Finally, we have to show that M is closed under intersection (which is equivalent to being closed under union when we have complements). Suppose A1 , A2 ∈ M and B ⊆ E. Then we have µ∗ (B) = µ∗ (B ∩ A1 ) + µ∗ (B ∩ AC 1) ∗ C = µ∗ (B ∩ A1 ∩ A2 ) + µ∗ (B ∩ A1 ∩ AC 2 ) + µ (B ∩ A1 )

= µ∗ (B ∩ (A1 ∩ A2 )) + µ∗ (B ∩ (A1 ∩ A2 )C ∩ A1 ) + µ∗ (B ∩ (A1 ∩ A2 )C ∩ AC 1) = µ∗ (B ∩ (A1 ∩ A2 )) + µ∗ (B ∩ (A1 ∩ A2 )C ). So we have A1 ∩ A2 ∈ M. So M is an algebra. Claim. M is a σ-algebra, and µ∗ is a measure on M. To show that M is a σ-algebra, we need to show that it is closed under countable unions. We let of sets in M, then we S (An ) be a disjoint collection P want to show that A = n An ∈ M and µ∗ (A) = n µ∗ (An ). Suppose that B ⊆ E. Then we have µ∗ (B) = µ∗ (B ∩ A1 ) + µ∗ (B ∩ AC 1) Using the fact that A2 ∈ M and A1 ∩ A2 = ∅, we have C = µ∗ (B ∩ A1 ) + µ∗ (B ∩ A2 ) + µ∗ (B ∩ AC 1 ∩ A2 )

= ··· n X C = µ∗ (B ∩ Ai ) + µ∗ (B ∩ AC 1 ∩ · · · ∩ An ) i=1

n X

µ∗ (B ∩ Ai ) + µ∗ (B ∩ AC ).

i=1

Taking the limit as n → ∞, we have µ∗ (B) ≥

∞ X

µ∗ (B ∩ Ai ) + µ∗ (B ∩ AC ).

i=1

By the countable-subadditivity of µ∗ , we have µ∗ (B ∩ A) ≤

∞ X

µ∗ (B ∩ Ai ).

i=1

Thus we obtain µ∗ (B) ≥ µ∗ (B ∩ A) + µ∗ (B ∩ AC ). By countable subadditivity, we also have inequality in the other direction. So equality holds. So A ∈ M. So M is a σ-algebra. To see that µ∗ is a measure on M, note that the above implies that µ∗ (B) =

∞ X

(B ∩ Ai ) + µ∗ (B ∩ AC ).

i=1

11

1

Measures

II Probability and Measure

Taking B = A, this gives µ∗ (A) =

∞ ∞ X X (A ∩ Ai ) + µ∗ (A ∩ AC ) = µ∗ (Ai ). i=1

i=1

Note that when A itself is actually a σ-algebra, the outer measure can be simply written as µ∗ (B) = inf{µ(A) : A ∈ A, B ⊆ A}. Caratheodory gives us the existence of some measure extending the set function on A. Could there be many? In general, there could. However, in the special case where the measure is finite, we do get uniqueness. Theorem. Suppose that µ1 , µ2 are measures on (E, E) with µ1 (E) = µ2 (E) < ∞. If A is a π-system with σ(A) = E, and µ1 agrees with µ2 on A, then µ1 = µ2 . Proof. Let D = {A ∈ E : µ1 (A) = µ2 (A)} We know that D ⊇ A. By Dynkin’s lemma, it suffices to show that D is a d-system. The things to check are: (i) E ∈ D — this follows by assumption. (ii) If A, B ∈ D with A ⊆ B, then B \ A ∈ D. Indeed, we have the equations µ1 (B) = µ1 (A) + µ1 (B \ A) < ∞ µ2 (B) = µ2 (A) + µ2 (B \ A) < ∞. Since µ1 (B) = µ2 (B) and µ1 (A) = µ2 (A), we must have µ1 (B \ A) = µ2 (B \ A). S (iii) Let (An ) ∈ D be an increasing sequence with An = A. Then µ1 (A) = lim µ1 (An ) = lim µ2 (An ) = µ2 (A). n→∞

n→∞

So A ∈ D. The assumption that µ1 (E) = µ2 (E) < ∞ is necessary. The theorem does not necessarily hold without it. We can see this from a simple counterexample: Example. Let E = Z, and let E = P (E). We let A = {{x, x + 1, x + 2, · · · } : x ∈ E} ∪ {∅}. This is a π-system with σ(A) = E. We let µ1 (A) be the number of elements in A, and µ2 = 2µ1 (A). Then obviously µ1 = 6 µ2 , but µ1 (A) = ∞ = µ2 (A) for A ∈ A. Definition (Borel σ-algebra). Let E be a topological space. We define the Borel σ-algebra as B(E) = σ({U ⊆ E : U is open}). We write B for B(R).

12

1

Measures

II Probability and Measure

Definition (Borel measure and Radon measure). A measure µ on (E, B(E)) is called a Borel measure. If µ(K) < ∞ for all K ⊆ E compact, then µ is a Radon measure. The most important example of a Borel measure we will consider is the Lebesgue measure. Theorem. There exists a unique Borel measure µ on R with µ([a, b]) = b − a. Proof. We first show uniqueness. Suppose µ ˜ is another measure on B satisfying the above property. We want to apply the previous uniqueness theorem, but our measure is not finite. So we need to carefully get around that problem. For each n ∈ Z, we set µn (A) = µ(A ∩ (n, n + 1])) µ ˜n (A) = µ ˜(A ∩ (n, n + 1])) Then µn and µ ˜n are finite measures on B which agree on the π-system of intervals of the form (a, b] with a, b ∈ R, a < b. Therefore we have µn = µ ˜n for all n ∈ Z. Now we have X X X µ(A) = µ(A ∩ (n, n + 1]) = µn (A) = µ ˜n (A) = µ ˜(A) n∈Z

n∈Z

n∈Z

for all Borel sets A. To show existence, we want to use the Caratheodory extension theorem. We let A be the collection of finite, disjoint unions of the form A = (a1 , b1 ] ∪ (a2 , b2 ] ∪ · · · ∪ (an , bn ]. Then A is a ring of subsets of R, and σ(A) = B (details are to be checked on the first example sheet). We set n X µ(A) = (bi − ai ). i=1

We note that µ is well-defined, since if A = (a1 , b1 ] ∪ · · · ∪ (an , bn ] = (˜ a1 , ˜b1 ] ∪ · · · ∪ (˜ an , ˜bn ], then

n X

(bi − ai ) =

i=1

n X (˜bi − a ˜i ). i=1

Also, if µ is additive, A, B ∈ A, A ∩ B = ∅ and A ∪ B ∈ A, we obviously have µ(A ∪ B) = µ(A) + µ(B). So µ is additive. Finally, we have to show that µ isSin fact countably additive. Let (An ) be a ∞ disjoint P sequence in A, and let A = i=1 An ∈ A. Then we need to show that ∞ µ(A) = n=1 µ(An ). Since µ is additive, we have µ(A) = µ(A1 ) + µ(A \ A1 ) = µ(A1 ) + µ(A2 ) + µ(A \ A1 ∪ A2 ) ! n n X [ = µ(Ai ) + µ A \ Ai i=1

i=1

13

1

Measures

II Probability and Measure

To finish the proof, we show that µ A\

n [

! → 0 as n → ∞.

Ai

i=1

We are going to reduce this to the finite intersection property of compact Tn sets in R: if (Kn ) is a sequence of compact sets in R with the property that m=1 Km 6= ∅ T∞ for all n, then m=1 Km 6= ∅. We first introduce some new notation. We let n [

Bn = A \

Am .

m=1

We now suppose, for contradiction, that µ(Bn ) 6→ 0 as n → ∞. Since the Bn ’s are decreasing, there must exist ε > 0 such that µ(Bn ) ≥ 2ε for every n. For each n, we take Cn ∈ A with the property that Cn ⊆ Bn and µ(Bn \Cn ) ≤ ε . This is possible since each Bn is just a finite union of intervals. Thus we n 2 have ! ! n n \ \ µ(Bn ) − µ C m = µ Bn \ Cm m=1

m=1 n [

≤µ

! (Bm \ Cm )

m=1

≤ ≤

n X

µ(Bm \ Cm )

m=1 n X

ε 2m m=1

≤ ε. On the other hand, we also know that µ(Bn ) ≥ 2ε. ! n \ µ Cm ≥ ε m=1

Tn

for all n. We now let that Kn = m=1 Cm . Then µ(Kn ) ≥ ε, and in particular Kn 6= ∅ for all n. Thus, the finite intersection property says ∅= 6

∞ \

Kn ⊆

n=1

∞ \

Bn = ∅.

n=1

This is a contradiction. So we have µ(Bn ) → 0 as n → ∞. So done. Definition (Lebesgue measure). The Lebesgue measure is the unique Borel measure µ on R with µ([a, b]) = b − a. Note that the Lebesgue measure is not a finite measure, since µ(R) = ∞. However, it is a σ-finite measure. 14

1

Measures

II Probability and Measure

Definition (σ-finite measure). Let (E, E) be a measurable space, and µ a measure. We say µ is σ-finite if there exists a sequence (En ) in E such that S E = E and µ(En ) < ∞ for all n. n n This is the next best thing we can hope after finiteness, and often proofs that involve finiteness carry over to σ-finite measures. Proposition. The Lebesgue measure is translation invariant, i.e. µ(A + x) = µ(A) for all A ∈ B and x ∈ R, where A + x = {y + x, y ∈ A}. Proof. We use the uniqueness of the Lebesgue measure. We let µx (A) = µ(A + x) for A ∈ B. Then this is a measure on B satisfying µx ([a, b]) = b − a. So the uniqueness of the Lebesgue measure shows that µx = µ. It turns out that translation invariance actually characterizes the Lebesgue measure. ˜ be a Borel measure on R that is translation invariant and Proposition. Let µ µ([0, 1]) = 1. Then µ ˜ is the Lebesgue measure. Proof. We show that any such measure must satisfy µ([a, b]) = b − a. By additivity and translation invariance, we can show that µ([p, q]) = q − p for all rational p < q. By considering µ([p, p + 1/n]) for all n and using the increasing property, we know µ({p}) = 0. So µ(([p, q)) = µ((p, q]) = µ((p, q)) = q − p for all rational p, q. Finally, by countable additivity, we can extend this to all real intervals. Then the result follows from the uniqueness of the Lebesgue measure. In the proof of the Caratheodory extension theorem, we constructed a measure µ∗ on the σ-algebra M of µ∗ -measurable sets which contains A. This contains B = σ(A), but could in fact be bigger than it. We call M the Lebesgue σ-algebra. Indeed, it can be given by M = {A ∪ N : A ∈ B, N ⊆ B ∈ B with µ(B) = 0}. If A ∪ N ∈ M, then µ(A ∪ N ) = µ(A). The proof is left for the example sheet. It is also true that M is strictly larger than B, so there exists A ∈ M with A 6∈ B. Construction of such a set was on last year’s exam (2016). On the other hand, it is also true that not all sets are Lebesgue measurable. This is a rather funny construction.

15

1

Measures

II Probability and Measure

Example. For x, y ∈ [0, 1), we say x ∼ y if x − y is rational. This defines an equivalence relation on [0, 1). By the axiom of choice, we pick a representative of each equivalence class, and put them into a set S ⊆ [0, 1). We will show that S is not Lebesgue measurable. Suppose that S were Lebesgue measurable. We are going to get a contradiction to the countable additivity of the Lebesgue measure. For each rational r ∈ [0, 1) ∩ Q, we define Sr = {s + r mod 1 : s ∈ S}. By translation invariance, we know Sr is also Lebesgue measurable, and µ(Sr ) = µ(S). S Also, by construction of S, we know (Sr )r∈Q is disjoint, and r∈Q Sr = [0, 1). Now by countable additivity, we have   [ X X 1 = µ([0, 1)) = µ  Sr  = µ(Sr ) = µ(S), r∈Q

r∈Q

r∈Q

which is clearly not possible. Indeed, if µ(S) = 0, then this says 1 = 0; If µ(S) > 0, then this says 1 = ∞. Both are absurd.

1.2

Probability measures

Since the course is called “probability and measure”, we’d better start talking about probability! It turns out the notions we care about in probability theory are very naturally just special cases of the concepts we have previously considered. Definition (Probability measure and probability space). Let (E, E) be a measure space with the property that µ(E) = 1. Then we often call µ a probability measure, and (E, E, µ) a probability space. Probability spaces are usually written as (Ω, F, P) instead. Definition (Sample space). In a probability space (Ω, F, P), we often call Ω the sample space. Definition (Events). In a probability space (Ω, F, P), we often call the elements of F the events. Definition (Probaiblity). In a probability space (Ω, F, P), if A ∈ F, we often call P[A] the probability of the event A. These are exactly the same things as measures, but with different names! However, thinking of them as probabilities could make us ask different questions about these measure spaces. For example, in probability, one is often interested in independence. Definition (Independence of events). A sequence of events (An ) is said to be independent if " # \ Y P An = P[An ] n∈J

n∈J

for all finite subsets J ⊆ N. 16

1

Measures

II Probability and Measure

However, it turns out that talking about independence of events is usually too restrictive. Instead, we want to talk about the independence of σ-algebras: Definition (Independence of σ-algebras). A sequence of σ-algebras (An ) with An ⊆ F for all n is said to be independent if the following is true: If (An ) is a sequence where An ∈ An for all n, them (An ) is independent. Proposition. Events (An ) are independent iff the σ-algebras σ(An ) are independent. While proving this directly would be rather tedious (but not too hard), it is an immediate consequence of the following theorem: Theorem. Suppose A1 and A2 are π-systems in F. If P[A1 ∩ A2 ] = P[A1 ]P[A2 ] for all A1 ∈ A1 and A2 ∈ A2 , then σ(A1 ) and σ(A2 ) are independent. Proof. This will follow from two applications of the fact that a finite measure is determined by its values on a π-system which generates the entire σ-algebra. We first fix A1 ∈ A1 . We define the measures µ(A) = P[A ∩ A1 ] and ν(A) = P[A]P[A1 ] for all A ∈ F. By assumption, we know µ and ν agree on A2 , and we have that µ(Ω) = P[A1 ] = ν(Ω) ≤ 1 < ∞. So µ and ν agree on σ(A2 ). So we have P[A1 ∩ A2 ] = µ(A2 ) = ν(A2 ) = P[A1 ]P[A2 ] for all A2 ∈ σ(A2 ). So we have now shown that if A1 and A2 are independent, then A1 and σ(A2 ) are independent. By symmetry, the same argument shows that σ(A1 ) and σ(A2 ) are independent. Say we are rolling a dice. Instead of asking what the probability of getting a 6, we might be interested instead in the probability of getting a 6 infinitely often. Intuitively, the answer is “it happens with probability 1”, because in each dice roll, we have a probability of 16 of getting a 6, and they are all independent. We would like to make this precise and actually prove it. It turns out that the notions of “occurs infinitely often” and also “occurs eventually” correspond to more analytic notions of lim sup and lim inf. Definition (limsup and liminf). Let (An ) be a sequence of events. We define \ [ lim sup An = Am n m≥n

lim inf An =

[ \ n m≥n

17

Am .

1

Measures

II Probability and Measure

To parse these definitions more easily, we can read ∩ as “for all”, and ∪ as “there exits”. For example, we can write lim sup An = ∀n, ∃m ≥ n such that Am occurs = {x : ∀n, ∃m ≥ n, x ∈ Am } = {Am occurs infinitely often} = {Am i.o.} Similarly, we have lim inf An = ∃n, ∀m ≥ n such that Am occurs = {x : ∃n, ∀m ≥ n, x ∈ Am } = {Am occurs eventually} = {Am e.v.} We are now going to prove two “obvious” results, known as the Borel–Cantelli lemmas. These give us necessary conditions for an event to happen infinitely often, and in the case where the events are independent, the condition is also sufficient. Lemma (Borel–Cantelli lemma). If X P[An ] < ∞, n

then P[An i.o.] = 0. Proof. For each k, we have   \ [ P[An i.o] = P  Am  n m≥n

 ≤ P

 [

Am 

m≥k ∞ X

P[Am ]

m=k

→0 as k → ∞. So we have P[An i.o.] = 0. Note that we did not need to use the fact that we are working with a probability measure. So in fact this holds for any measure space. Lemma (Borel–Cantelli lemma II). Let (An ) be independent events. If X P[An ] = ∞, n

then P[An i.o.] = 1. 18

1

Measures

II Probability and Measure

Note that independence is crucial. IfPwe flip a fair Pcoin, and we set all the An to be equal to “getting a heads”, then n P[An ] = n 12 = ∞, but we certainly do not have P[An i.o.] = 1. Instead it is just 21 . Proof. By example sheet, if (An ) is independent, then so is (AC n ). Then we have " N # N \ Y C P Am = P[AC m] m=n

m=n

=

N Y

(1 − P[Am ])

m=n

N Y

exp(−P[Am ])

m=n

= exp −

N X

! P[Am ]

m=n

→0 P as N → ∞, as we assumed that n P[An ] = ∞. So we have " ∞ # \ C P Am = 0. m=n

By countable subadditivity, we have " # ∞ [ \ C P Am = 0. n m=n

This in turn implies that " # " # ∞ ∞ \ [ [ \ C P Am = 1 − P Am = 1. n m=n

n m=n

So we are done.

19

2

Measurable functions and random variables

2

II Probability and Measure

Measurable functions and random variables

We’ve had enough of measurable sets. As in most of mathematics, not only should we talk about objects, but also maps between objects. Here we want to talk about maps between measure spaces, known as measurable functions. In the case of a probability space, a measurable function is a random variable! In this chapter, we are going to start by defining a measurable function and investigate some of its basic properties. In particular, we are going to prove the monotone class theorem, which is the analogue of Dynkin’s lemma for measurable functions. Afterwards, we turn to the probabilistic aspects, and see how we can make sense of the independence of random variables. Finally, we are going to consider different notions of “convergence” of functions.

2.1

Measurable functions

The definition of a measurable function is somewhat like the definition of a continuous function, except that we replace “open” with “in the σ-algebra”. Definition (Measurable functions). Let (E, E) and (G, G) be measure spaces. A map f : E → G is measurable if for every A ∈ G, we have f −1 (A) = {x ∈ E : f (x) ∈ E} ∈ E. If (G, G) = (R, B), then we will just say that f is measurable on E. If (G, G) = ([0, ∞], B), then we will just say that f is non-negative measurable. If E is a topological space and E = B(E), then we call f a Borel function. How do we actually check in practice that a function is measurable? It turns out we are lucky. We can simply check that f −1 (A) ∈ E for A in any generating set Q of G. Lemma. Let (E, E) and (G, G) be measurable spaces, and G = σ(Q) for some Q. If f −1 (A) ∈ E for all A ∈ Q, then f is measurable. Proof. We claim that {A ⊆ G : f −1 (A) ∈ E} is a σ-algebra on G. Then the result follows immediately by definition of σ(Q). Indeed, this follows from the fact that f −1 preserves everything. More precisely, we have ! [ [ −1 f An = f −1 (An ), f −1 (AC ) = (f −1 (A))C , f −1 (∅) = ∅. n

n

So if, say, all An ∈ A, then so is

S

n

An .

Example. In the particular case where we have a function f : E → R, we know that B = B(R) is generated by (−∞, y] for y ∈ R. So we just have to check that {x ∈ E : f (x) ≤ y} = f −1 ((−∞, y])) ∈ E.

20

2

Measurable functions and random variables

II Probability and Measure

Example. Let E, F be topological spaces, and f : E → F be continuous. We will see that f is a measurable function (under the Borel σ-algebras). Indeed, by definition, whenever U ⊆ F is open, we have f −1 (U ) open as well. So f −1 (U ) ∈ B(E) for all U ⊆ F open. But since B(F ) is the σ-algebra generated by the open sets, this implies that f is measurable. This is one very important example. We can do another very important example. Example. Suppose that A ⊆ E. The indicator function of A is 1A (x) : E → {0, 1} given by ( 1 x∈A 1A (x) = . 0 x 6∈ A Suppose we give {0, 1} the non-trivial measure. Then 1A is a measurable function iff A ∈ E. Example. The identity function is always measurable. Example. Composition of measurable functions are measurable. More precisely, if (E, E), (F, F) and (G, G) are measurable spaces, and the functions f : E → F and g : F → G are measurable, then the composition g ◦f : E → G is measurable. Indeed, if A ∈ G, then g −1 (A) ∈ F, so f −1 (g −1 (A)) ∈ E. But f −1 (g −1 (A)) = (g ◦ f )−1 (A). So done. Definition (σ-algebra generated by functions). Now suppose we have a set E, and a family of real-valued functions {fi : i ∈ I} on E. We then define σ(fi : i ∈ I) = σ(fi−1 (A) : A ∈ B, i ∈ I). This is the smallest σ-algebra on E which makes all the fi ’s measurable. This is analogous to the notion of initial topologies for topological spaces. If we want to construct more measurable functions, the following definition will be rather useful: Definition (Product measurable space). Let (E, E) and (G, G) be measure spaces. We define the product measure space as E × G whose σ-algebra is generated by the projections E×G π1

.

π2

E

G

More explicitly, the σ-algebra is given by E ⊗ G = σ({A × B : A ∈ E, B ∈ G}). More generally, if (Ei , Ei )Q is a collection of measure spaces, the product measure space has Q underlying set i Ei , and the σ-algebra generated by the projection maps πi : j Ej → Ei . This satisfies the following property:

21

2

Measurable functions and random variables

II Probability and Measure

Proposition. Q Let fi : E → Fi be functions. Then {fi } are all measurable iff (fi ) : E → Fi is measurable, where the function (fi ) is defined by setting the ith component of (fi )(x) to be fi (x). Proof. If the map (fi ) is measurable, then by composition with the projections πi , we know that each fi is measurable. Q Conversely, if all fi are measurable, then since the σ-algebra of Fi is −1 generated by sets of the form πj (A) : A ∈ Fj , and the pullback of such sets along (fi ) is exactly fj−1 (A), we know the function (fi ) is measurable. Using this, we can prove that a whole lot more functions are measurable. Proposition. Let (E, E) be a measurable space. Let (fn : n ∈ N) be a sequence of non-negative measurable functions on E. Then the following are measurable: f1 + f2 , inf fn , n

max{f1 , f2 },

f1 f2 , sup fn ,

lim inf fn ,

n

n

min{f1 , f2 }, lim sup fn . n

The same is true with “real” replaced with “non-negative”, provided the new functions are real (i.e. not infinity). Proof. This is an (easy) exercise on the example sheet. For example, the sum f1 + f2 can be written as the following composition. E

(f1 ,f2 )

+

[0, ∞]2

[0, ∞].

We know the second map is continuous, hence measurable. The first function is also measurable since the fi are. So the composition is also measurable. The product follows similarly, but for the infimum and supremum, we need to check explicitly that the corresponding maps [0, ∞]N → [0, ∞] is measurable. Notation. We will write f ∧ g = min{f, g},

f ∨ g = max{f, g}.

We are now going to prove the monotone class theorem, which is a “Dynkin’s lemma” for measurable functions. As in the case of Dynkin’s lemma, it will sound rather awkward but will prove itself to be very useful. Theorem (Monotone class theorem). Let (E, E) be a measurable space, and A ⊆ E be a π-system with σ(A) = E. Let V be a vector space of functions such that (i) The constant function 1 = 1E is in V. (ii) The indicator functions 1A ∈ V for all A ∈ A (iii) V is closed under bounded, monotone limits. More explicitly, if (fn ) is a bounded non-negative sequence in V, fn % f (pointwise) and f is also bounded, then f ∈ V. Then V contains all bounded measurable functions.

22

2

Measurable functions and random variables

II Probability and Measure

Note that the conditions for V is pretty like the conditions for a d-system, where taking a bounded, monotone limit is something like taking increasing unions. Proof. We first deduce that 1A ∈ V for all A ∈ E. D = {A ∈ E : 1A ∈ V}. We want to show that D = E. To do this, we have to show that D is a d-system. (i) Since 1E ∈ V, we know E ∈ D. (ii) If 1A ∈ V , then 1 − 1A = 1E\A ∈ V. So E \ A ∈ D. (iii) If (An ) is an increasing sequence in D, then 1An → 1S An monotonically increasingly. So 1S An is in D. So, by Dynkin’s lemma, we know D = E. So V contains indicators of all measurable sets. We will now try to obtain any measurable function by approximating. Suppose that f is bounded and non-negative measurable. We want to show that f ∈ V. To do this, we approximate it by letting fn = 2−n b2n f c =

∞ X

k2−n 1{k2−n ≤f g(y) ⇔ f (x) > y. Explicitly, for x ∈ I, we define f (x) = inf{y ∈ R : x ≤ g(y)}. Proof. We just have to verify that it works. For x ∈ I, consider Jx = {y ∈ R : x ≤ g(y)}. Since g is non-decreasing, if y ∈ Jx and y 0 ≥ y, then y 0 ∈ Jx . Since g is right-continuous, if yn ∈ Jx is such that yn & y, then y ∈ Jx . So we have Jx = [f (x), ∞). Thus, for f ∈ R, we have x ≤ g(y) ⇔ f (x) ≤ y. So we just have to prove the remaining properties of f . Now for x ≤ x0 , we have Jx ⊆ Jx0 . So f (x) ≤ f (x0 ). So f is non-decreasing. T Similarly, if xn % x, then we have Jx = n Jxn . So f (xn ) → f (x). So this is left continuous. Example. If g is given by the function

then f is given by

24

2

Measurable functions and random variables

II Probability and Measure

This allows us to construct new measures on R with ease. Theorem. Let g : R → R be non-constant, non-decreasing and right continuous. Then there exists a unique Radon measure dg on B such that dg((a, b]) = g(b) − g(a). Moreover, we obtain all non-zero Radon measures on R in this way. We have already seen an instance of this when we g was the identity function. Given the lemma, this is very easy. Proof. Take I and f as in the previous lemma, and let µ be the restriction of the Lebesgue measure to Borel subsets of I. Now f is measurable since it is left continuous. We define dg = µ ◦ f −1 . Then we have dg((a, b]) = µ({x ∈ I : a < f (x) ≤ b}) = µ({x ∈ I : g(a) < x ≤ g(b)}) = µ((g(a), g(b)]) = g(b) − g(a). So dg is a Radon measure with the required property. There are no other such measures by the argument used for uniqueness of the Lebesgue measure. To show we get all non-zero Radon measures this way, suppose we have a Radon measure ν on R, we want to produce a g such that ν = dg. We set ( −ν((y, 0]) y ≤ 0 g(y) = . ν((0, y]) y>0 Then ν((a, b]) = g(b) − g(a). We see that ν is non-zero, so g is non-constant. It is also easy to see it is non-decreasing and right continuous. So ν = dg by continuity.

2.3

Random variables

We are now going to look at these ideas in the context of probability. It turns out they are concepts we already know and love! Definition (Random variable). Let (Ω, F, P) be a probability space, and (E, E) a measurable space. Then an E-valued random variable is a measurable function X : Ω → E. By default, we will assume the random variables are real. Usually, when we have a random variable X, we might ask questions like “what is the probability that X ∈ A?”. In other words, we are asking for the “size” of the set of things that get sent to A. This is just the image measure! Definition (Distribution/law). Given a random variable X : Ω → E, the distribution or law of X is the image measure µx : P ◦ X −1 . We usually write P(X ∈ A) = µx (A) = P(X −1 (A)).

25

2

Measurable functions and random variables

II Probability and Measure

If E = R, then µx is determined by its values on the π-system of intervals (−∞, y]. We set FX (x) = µX ((−∞, x]) = P(X ≤ x) This is known as the distribution function of X. Proposition. We have ( 0 FX (x) → 1

x → −∞ . x → +∞

Also, FX (x) is non-decreasing and right-continuous. We call any function F with these properties a distribution function. Definition (Distribution function). A distribution function is a non-decreasing, right continuous function f : R → [0, 1] satisfying ( 0 x → −∞ FX (x) → . 1 x → +∞ We now want to show that every distribution function is indeed a distribution. Proposition. Let F be any distribution function. Then there exists a probability space (Ω, F, P) and a random variable X such that FX = F . Proof. Take (Ω, F, P) = ((0, 1), B(0, 1), Lebesgue). We take X : Ω → R to be X(ω) = inf{x : ω ≤ f (x)}. Then we have X(ω) ≤ x ⇐⇒ w ≤ F (x). So we have FX (x) = P[X ≤ x] = P[(0, F (x)]] = F (x). Therefore FX = F . This construction is actually very useful in practice. If we are writing a computer program and want to sample a random variable, we will use this procedure. The computer usually comes with a uniform (pseudo)-random number generator. Then using this procedure allows us to produce random variables of any distribution from a uniform sample. The next thing we want to consider is the notion of independence of random variables. Recall that for random variables X, Y , we used to say that they are independent if for any A, B, we have P[X ∈ A, Y ∈ B] = P[X ∈ A]P[Y ∈ B]. But this is exactly the statement that the σ-algebras generated by X and Y are independent! Definition (Independence of random variables). A family (Xn ) of random variables is said to be independent if the family of σ-algebras (σ(Xn )) is independent. 26

2

Measurable functions and random variables

II Probability and Measure

Proposition. Two real-valued random variables X, Y are independent iff P[X ≤ x, Y ≤ y] = P[X ≤ x]P[Y ≤ y]. More generally, if (Xn ) is a sequence of real-valued random variables, then they are independent iff P[x1 ≤ x1 , · · · , xn ≤ xn ] =

n Y

P[Xj ≤ xj ]

j=1

for all n and xj . Proof. The ⇒ direction is obvious. For the other direction, we simply note that {(−∞, x] : x ∈ R} is a generating π-system for the Borel σ-algebra of R. In probability, we often say things like “let X1 , X2 , · · · be iid random variables”. However, how can we guarantee that iid random variables do indeed exist? We start with the less ambitious goal of finding iid Bernoulli(1/2) random variables: Proposition. Let (Ω, F, P) = ((0, 1), B(0, 1), Lebesgue). be our probability space. Then there exists as sequence Rn of independent Bernoulli(1/2) random variables. Proof. Suppose we have ω ∈ Ω = (0, 1). Then we write ω as a binary expansion ω=

∞ X

ωn 2−n ,

n=1

where ωn ∈ {0, 1}. We make the binary expansion unique by disallowing infinite sequences of zeroes. We define Rn (ω) = ωn . We will show that Rn is measurable. Indeed, we can write R1 (ω) = ω1 = 1(1/2,1] (ω), where 1(1/2,1] is the indicator function. Since indicator functions of measurable sets are measurable, we know R1 is measurable. Similarly, we have R2 (ω) = 1(1/4,1/2] (ω) + 1(3/4,1] (ω). So this is also a measurable function. More generally, we can do this for any Rn (ω): we have n−1 2X Rn (ω) = 1(2−n (2j−1),2−n (2j)] (ω). j=1

So each Rn is a random variable, as each can be expressed as a sum of indicators of measurable sets. Now let’s calculate P[Rn = 1] =

2n−1 X

−n

2

((2j) − (2j − 1)) =

j=1

2n−1 X j=1

27

2−n =

1 . 2

2

Measurable functions and random variables

II Probability and Measure

Then we have P[Rn = 0] = 1 − P[Rn = 1] =

1 2

as well. So Rn ∼ Bernoulli(1/2). We can straightforwardly check that (Rn ) is an independent sequence, since for n 6= m, we have P[Rn = 0 and Rm = 0] =

1 = P[Rn = 0]P[Rm = 0]. 4

We will now use the (Rn ) to construct any independent sequence for any distribution. Proposition. Let (Ω, F, P) = ((0, 1), B(0, 1), Lebesgue). Given any sequence (Fn ) of distribution functions, there is a sequence (Xn ) of independent random variables with FXn = Fn for all n. Proof. Let m : N2 → N be any bijection, and relabel Yk,n = Rm(k,n) , where the Rj are as in the previous random variable. We let Yn =

∞ X

2−k Yk,n .

k=1

Then we know that (Yn ) is an independent sequence of random variables, and each is uniform on (0, 1). As before, we define Gn (y) = inf{x : y ≤ Fn (x)}. We set Xn = Gn (Yn ). Then (Xn ) is a sequence of random variables with FXn = Fn . We end Pn the section with a random fact: let (Ω, F, P) and Rj be as above. Then n1 j=1 Rj is the average of n independent of Bernoulli(1/2) random variables. The weak law of large numbers says for any ε > 0, we have   X 1 n 1 P  Rj − ≥ ε → 0 as n → ∞. 2 n j=1 The strong law of large numbers, which we will prove later, says that   n  X 1 1  P ω : Rj → = 1.  n j=1 2 So “almost every number” in (0, 1) has an equal proportion of 0’s and 1’s in its binary expansion. This is known as the normal number theorem.

28

2

Measurable functions and random variables

2.4

II Probability and Measure

Convergence of measurable functions

The next thing to look at is the convergence of measurable functions. In measure theory, wonderful things happen when we talk about convergence. In analysis, most of the time we had to require uniform convergence, or even stronger notions, if we want limits to behave well. However, in measure theory, the kinds of convergence we talk about are somewhat pointwise in nature. In fact, it will be weaker than pointwise convergence. Yet, we are still going to get good properties out of them. Definition (Convergence almost everywhere). Suppose that (E, E, µ) is a measure space. Suppose that (fn ), f are measurable functions. We say fn → f almost everywhere (a.e.) if µ({x ∈ E : fn (x) 6→ f (x)}) = 0. If (E, E, µ) is a probability space, this is called almost sure convergence. To see this makes sense, i.e. the set in there is actually measurable, note that {x ∈ E : fn (x) 6→ f (x)} = {x ∈ E : lim sup |fn (x) − f (x)| > 0}. We have previously seen that lim sup |fn − f | is non-negative measurable. So the set {x ∈ E : lim sup |fn (x) − f (x)| > 0} is measurable. Another useful notion of convergence is convergence in measure. Definition (Convergence in measure). Suppose that (E, E, µ) is a measure space. Suppose that (fn ), f are measurable functions. We say fn → f in measure if for each ε > 0, we have µ({x ∈ E : |fn (x) − f (x)| ≥ ε}) → 0 as n → ∞, then we say that fn → f in measure. If (E, E, µ) is a probability space, then this is called convergence in probability. In the case of a probability space, this says P(|Xn − X| ≥ ε) → 0 as n → ∞ for all ε, which is how we state the weak law of large numbers in the past. After we define integration, we can consider the norms of a function f by Z kf kp =

|f (x)|p dx

1/p .

Then in particular, if kfn − f kp → 0, then fn → f in measure, and this provides an easy way to see that functions converge in measure. In general, neither of these notions imply each other. However, the following theorem provides us with a convenient dictionary to translate between the two notions. Theorem. (i) If µ(E) < ∞, then fn → f a.e. implies fn → f in measure. 29

2

Measurable functions and random variables

II Probability and Measure

(ii) For any E, if fn → f in measure, then there exists a subsequence (fnk ) such that fnk → f a.e. Proof. (i) First suppose µ(E) < ∞, and fix ε > 0. Consider µ({x ∈ E : |fn (x) − f (x)| ≤ ε}). We use the result from the first example sheet that for any sequence of events (An ), we have lim inf µ(An ) ≥ µ(lim inf An ). Applying to the above sequence says lim inf µ({x : |fn (x) − f (x)| ≤ ε}) ≥ µ({x : |fm (x) − f (x)| ≤ ε eventually}) ≥ µ({x ∈ E : |fm (x) − f (x)| → 0}) = µ(E). As µ(E) < ∞, we have µ({x ∈ E : |fn (x) − f (x)| > ε}) → 0 as n → ∞. (ii) Suppose that fn → f in measure. We pick a subsequence (nk ) such that   1 ≤ 2−k . µ x ∈ E : |fnk (x) − f (x)| > k Then we have   X ∞ ∞ X 1 µ x ∈ E : fnk (x) − f (x)| > 2−k = 1 < ∞. ≤ k k=1

k=1

By the first Borel–Cantelli lemma, we know   1 µ x ∈ E : |fnk (x) − f (x)| > i.o. = 0. k So fnk → f a.e. It is important that we assume that µ(E) < ∞ for the first part. Example. Consider (E, E, µ) = (R, B, Lebesgue). Take fn (x) = 1[n,∞) (x). Then fn (x) → 0 for all x, and in particular almost everywhere. However, we have   1 µ x ∈ R : |fn (x)| > = µ([n, ∞)) = ∞ 2 for all n. There is one last type of convergence we are interested in. We will only first formulate it in the probability setting, but there is an analogous notion in measure theory known as weak convergence, which we will discuss much later on in the course.

30

2

Measurable functions and random variables

II Probability and Measure

Definition (Convergence in distribution). Let (Xn ), X be random variables with distribution functions FXn and FX , then we say Xn → X in distribution if FXn (x) → FX (x) for all x ∈ R at which FX is continuous. Note that here we do not need that (Xn ) and X live on the same probability space, since we only talk about the distribution functions. But why do we have the condition with continuity points? The idea is that if the resulting distribution has a “jump” at x, it doesn’t matter which side of the jump FX (x) is at. Here is a simple example that tells us why this is very important: Example. Let Xn to be uniform on [0, 1/n]. Intuitively, this should converge to the random variable that is always zero. We can compute   x≤0 0 FXn (x) = nx 0 < x < 1/n .   1 x ≥ 1/n We can also compute the distribution of the zero random variable as ( 0 x 0. Since x ∈ S, this implies that there is some δ > 0 such that ε 2 ε FX (x + δ) ≤ FX (x) + . 2 FX (x − δ) ≥ FX (x) −

31

2

Measurable functions and random variables

II Probability and Measure

We fix N large such that n ≥ N implies P[|Xn − X| ≥ δ] ≤ 2ε . Then FXn (x) = P[Xn ≤ x] = P[(Xn − X) + X ≤ x] We now notice that {(Xn − X) + X ≤ x} ⊆ {X ≤ x + δ} ∪ {|Xn − X| > δ}. So we have ≤ P[X ≤ x + δ] + P[|Xn − X| > δ] ε ≤ FX (x + δ) + 2 ≤ FX (x) + ε. We similarly have FXn (x) = P[Xn ≤ x] ≥ P[X ≤ x − δ] − P[|Xn − X| > δ] ε ≥ FX (x − δ) − 2 ≥ FX (x) − ε. Combining, we have that n ≥ N implying |Fxn (x) − FX (x)| ≤ ε. Since ε was arbitrary, we are done. (ii) Suppose Xn → X in distribution. We again let (Ω, F, B) = ((0, 1), B((0, 1)), Lebesgue). We let ˜ n (ω) = inf{x : ω ≤ FX (x)}, X n ˜ X(ω) = inf{x : ω ≤ FX (x)}. ˜ n has the same distribution function as Xn for Recall from before that X ˜ has the same distribution as X. Moreover, we have all n, and X ˜ n (ω) ≤ x ⇔ ω ≤ FX (x) X n ˜ x < Xn (ω) ⇔ FX (x) < ω, n

and similarly if we replace Xn with X. ˜n → X ˜ We are now going to show that with this particular choice, we have X a.s. ˜ is a non-decreasing function (0, 1) → R. Then by general Note that X ˜ has at most countably many discontinuities. We write analysis, X ˜ is continuous at ω0 }. Ω0 = {ω ∈ (0, 1) : X Then (0, 1) \ Ω0 is countable, and hence has Lebesgue measure 0. So P[Ω0 ] = 1. 32

2

Measurable functions and random variables

II Probability and Measure

˜ n (ω) → X(ω) ˜ We are now going to show that X for all ω ∈ Ω0 . Note that FX is a non-decreasing function, and hence the points of discontinuity R \ S is also countable. So S is dense in R. Fix ω ∈ Ω0 and ε > 0. ˜ n (ω) − X(ω)| ˜ We want to show that |X ≤ ε for all n large enough. Since S is dense in R, we can find x− , x+ in S such that ˜ x− < X(ω) < x+ and x+ − x− < ε. What we want to do is to use the characteristic property ˜ and FX to say that this implies of X FX (x− ) < ω < FX (x+ ). Then since FXn → FX at the points x− , x+ , for sufficiently large n, we have FXn (x− ) < ω < FXn (x+ ). Hence we have ˜ n (ω) < x+ . x− < X ˜ n (ω) − X(ω)| ˜ Then it follows that |X < ε. ˜ However, this doesn’t work, since X(ω) < x+ only implies ω ≤ FX (x+ ), and our argument will break down. So we do a funny thing where we introduce a new variable ω + . ˜ is continuous at ω, we can find ω + ∈ (ω, 1) such that X(ω ˜ + ) ≤ x+ . Since X ˜ X(ω)

x+ x−

ε

ω ω+ Then we have ˜ ˜ + ) < x+ . x− < X(ω) ≤ X(ω Then we have FX (x− ) < ω < ω + ≤ FX (x+ ). So for sufficiently large n, we have FXn (x− ) < ω < FXn (x+ ). So we have ˜ n (ω) ≤ x+ , x− < X and we are done.

33

2

Measurable functions and random variables

2.5

II Probability and Measure

Tail events

Finally, we are going to quickly look at tail events. These are events that depend only on the asymptotic behaviour of a sequence of random variables. Definition (Tail σ-algebra). Let (Xn ) be a sequence of random variables. We let Tn = σ(Xn+1 , Xn+2 , · · · ), and \

T =

Tn .

n

Then T is the tail σ-algebra. Then T -measurable events and random variables only depend on the asymptotic behaviour of the Xn ’s. Example. Let (Xn ) be a sequence of real-valued random variables. Then n

lim sup n→∞

1X Xj , n j=1

n

lim inf n→∞

1X Xj n j=1

are T -measurable random variables. Finally,   n   1X lim Xj exists ∈T, n→∞ n  j=1 since this is just the set of all points where the previous two things agree. Theorem (Kolmogorov 0-1 law). Let (Xn ) be a sequence of independent (realvalued) random variables. If A ∈ T , then P[A] = 0 or 1. Moreover, if X is a T -measurable random variable, then there exists a constant c such that P[X = c] = 1. Proof. The proof is very funny the first time we see it. We are going to prove the theorem by checking something that seems very strange. We are going to show that if A ∈ T , then A is independent of A. It then follows that P[A] = P[A ∩ A] = P[A]P[A], so P[A] = 0 or 1. In fact, we are going to prove that T is independent of T . Let Fn = σ(X1 , · · · , Xn ). This σ-algebra is generated by the π-system of events of the form A = {X1 ≤ x1 , · · · , Xn ≤ xn }. Similarly, Tn = σ(Xn+1 , Xn+2 , · · · ) is generated by the π-system of events of the form B = {Xn+1 ≤ xn+1 , · · · , Xn+k ≤ xn+k }, where k is any natural number. 34

2

Measurable functions and random variables

II Probability and Measure

Since the Xn are independent, we know for any such A and B, we have P[A ∩ B] = P[A]P[B]. Since this is true T for all A and B, it follows that Fn is independent of Tn . SinceST = k Tk ⊆ Tn for each n, we know Fn is independent of T . Now k Fk is a π-system, which generates the σ-algebra F∞ = σ(X1 , X2 , · · · ). S We know that if A ∈ n Fn , then there has to exist an index n such that A ∈ Fn . So A is independent of T . So F∞ is independent of T . Finally, note that T ⊆ F∞ . So T is independent of T . To find the constant, suppose that X is T -measurable. Then P[X ≤ x] ∈ {0, 1} for all x ∈ R since {X ≤ x} ∈ T . Now take c = inf{x ∈ R : P[X ≤ x] = 1}. Then with this particular choice of c, it is easy to see that P[X = c] = 1. This completes the proof of the theorem.

35

3

Integration

3 3.1

II Probability and Measure

Integration Definition and basic properties

We are now going to work towards defining the integral of a measurable function on a measure space (E, E, µ). Different sources use different notations for the integral. The following notations are all commonly used: Z Z Z µ(f ) = f dµ = f (x) dµ(x) = f (x)µ(dx). E

E

E

In the case where (E, E, µ) = (R, B, Lebesgue), people often just write this as Z µ(f ) = f (x) dx. R

On the other hand, if (E, E, µ) = (Ω, F, P) is a probability space, and X is a random variable, then people write the integral as E[X], the expectation of X. So how are we going to define the integral? There are two steps to defining the integral. The idea is that we first define the integral on simple functions, and then extend the definition to more general measurable functions by taking the limit. When we do the definition for simple functions, it will be obvious that the definition satisfies the nice properties, and we will have to check that they are preserved when we take the limit. Definition (Simple function). A simple function is a measurable function that can be written as a finite non-negative linear combination of indicator functions of measurable sets, i.e. n X ak 1Ak f= k=1

for some Ak ∈ E and ak ≥ 0. Note that some sources do not assume that ak ≥ 0, but assuming this makes our life easier. It is obvious that Proposition. A function is simple iff it is measurable, non-negative, and takes on only finitely-many values. Definition (Integral of simple function). The integral of a simple function f=

n X

ak 1Ak

k=1

is given by µ(f ) =

n X

ak µ(Ak ).

k=1

Note that it can be that µ(Ak ) = ∞, but ak = 0. When this happens, we are just going to declare that 0 · ∞ = 0 (this makes sense because this means we are ignoring all 0 · 1A terms for any A). After we do this, we can check the integral is well-defined. 36

3

Integration

II Probability and Measure

We are now going to extend this definition to non-negative measurable functions by a limiting procedure. Once we’ve done this, we are going to extend the definition to measurable functions by linearity of the integral. Then we would have a definition of the integral, and we are going to deduce properties of the integral using approximation. Definition (Integral). Let f be a non-negative measurable function. We set µ(f ) = sup{µ(g) : g ≤ f, g is simple}. For arbitrary f , we write f = f + − f − = (f ∨ 0) + (f ∧ 0). We put |f | = f + + f − . We say f is integrable if µ(|f |) < ∞. In this case, set µ(f ) = µ(f + ) − µ(f − ). If only one of µ(f + ), µ(f− ) < ∞, then we can still make the above definition, and the result will be infinite. In the case where we are integrating over (a subset of) the reals, we call it the Lebesgue integral . Proposition. Let f : [0, 1] → R be Riemann integrable. Then it is also Lebesgue integrable, and the two integrals agree. We will not prove this, but this immediately gives us results like the fundamental theorem of calculus, and also helps us to actually compute the integral. However, note that this does not hold for infinite domains, as you will see in the second example sheet. But the Lebesgue integrable functions are better. A lot of functions are Lebesgue integrable but not Riemann integrable. Example. Take the standard non-Riemann integrable function f = 1[0,1]\Q . Then f is not Riemann integrable, but it is Lebesgue integrable, since µ(f ) = µ([0, 1] \ Q) = 1. We are now going to study some basic properties of the integral. We will first look at the properties of integrals of simple functions, and then extend them to general integrable functions. For f, g simple, and α, β ≥ 0, we have that µ(αf + βg) = αµ(f ) + βµ(g). So the integral is linear. Another important property is monotonicity — if f ≤ g, then µ(f ) ≤ µ(g). Finally, we have f = 0 a.e. iff µ(f ) = 0. It is absolutely crucial here that we are talking about non-negative functions. Our goal is to show that these three properties are also satisfied for arbitrary non-negative measurable functions, and the first two hold for integrable functions. In order to achieve this, we prove a very important tool — the monotone convergence theorem. Later, we will also learn about the dominated convergence theorem and Fatou’s lemma. These are the main and very important results about exchanging limits and integration. 37

3

Integration

II Probability and Measure

Theorem (Monotone convergence theorem). Suppose that (fn ), f are nonnegative measurable with fn % f . Then µ(fn ) % µ(f ). In the proof we will use the fact that the integral is monotonic, which we shall prove later. Proof. We will split the proof into five steps. We will prove each of the following in turn: (i) If fn and f are indicator functions, then the theorem holds. (ii) If f is an indicator function, then the theorem holds. (iii) If f is simple, then the theorem holds. (iv) If f is non-negative measurable, then the theorem holds. Each part follows rather straightforwardly from the previous one, and the reader is encouraged to try to prove it themself. We first consider the case where fn = 1An and f = 1A . Then fn % f is true iff An % A. On the other hand, µ(fn ) % µ(f ) iff µ(An ) % µ(A). For convenience, we let A0 = ∅. We can write ! [ µ(A) = µ An \ An−1 n

=

∞ X

µ(An \ An−1 )

n=1

= lim

N →∞

N X

µ(An \ An−1 )

n=1

= lim µ(AN ). N →∞

So done. We next consider the case where f = 1A for some A. Fix ε > 0, and set An = {fn > 1 − ε} ∈ E. Then we know that An % A, as fn % f . Moreover, by definition, we have (1 − ε)1An ≤ fn ≤ f = 1A . As An % A, we have that (1 − ε)µ(f ) = (1 − ε) lim µ(An ) ≤ lim µ(fn ) ≤ µ(f ) n→∞

n→∞

since fn ≤ f . Since ε is arbitrary, we know that lim µ(fn ) = µ(f ).

n→∞

38

3

Integration

II Probability and Measure

Next, we consider the case where f is simple. We write f=

m X

ak 1Ak ,

k=1

where ak > 0 and Ak are pairwise disjoint. Since fn % f , we know a−1 k fn 1Ak % 1Ak . So we have µ(fn ) =

m X

µ(fn 1Ak ) =

k=1

m X

ak µ(a−1 k fn 1Ak ) →

k=1

m X

ak µ(Ak ) = µ(f ).

k=1

Suppose f is non-negative measurable. Suppose g ≤ f is a simple function. As fn % f , we know fn ∧ g % f ∧ g = g. So by the previous case, we know that µ(fn ∧ g) → µ(g). We also know that µ(fn ) ≥ µ(fn ∧ g). So we have lim µ(fn ) ≥ µ(g)

n→∞

for all g ≤ f . This is possible only if lim µ(fn ) ≥ µ(f )

n→∞

by definition of the integral. However, we also know that µ(fn ) ≤ µ(f ) for all n, again by definition of the integral. So we must have equality. So we have µ(f ) = lim µ(fn ). n→∞

Theorem. Let f, g be non-negative measurable, and α, β ≥ 0. We have that (i) µ(αf + βg) = αµ(f ) + βµ(g). (ii) f ≤ g implies µ(f ) ≤ µ(g). (iii) f = 0 a.e. iff µ(f ) = 0. Proof. (i) Let fn = 2−n b2n f c ∧ n gn = 2−n b2n gc ∧ n. Then fn , gn are simple with fn % f and gn % g. Hence µ(fn ) % µ(f ) and µ(gn ) % µ(g) and µ(αfn + βgn ) % µ(αf + βg), by the monotone convergence theorem. As fn , gn are simple, we have that µ(αfn + βgn ) = αµ(fn ) + βµ(gn ). Taking the limit as n → ∞, we get µ(αf + βg) = αµ(f ) + βµ(g). 39

3

Integration

II Probability and Measure

(ii) We shall be careful not to use the monotone convergence theorem. We have µ(g) = sup{µ(h) : h ≤ g simple} ≥ sup{µ(h) : h ≤ f simple} = µ(f ). (iii) Suppose f 6= 0 a.e. Let  An =

1 x : f (x) > n

 .

Then {x : f (x) 6= 0} =

[

An .

n

Since the left hand set has non-negative measure, it follows that there is some An with non-negative measure. For that n, we define h=

1 1A . n n

Then µ(f ) ≥ µ(h) > 0. So µ(f ) 6= 0. Conversely, suppose f = 0 a.e. We let fn = 2−n b2n f c ∧ n be a simple function. Then fn % f and fn = 0 a.e. So µ(f ) = lim µ(fn ) = 0. n→∞

We now prove the analogous statement for general integrable functions. Theorem. Let f, g be integrable, and α, β ≥ 0. We have that (i) µ(αf + βg) = αµ(f ) + βµ(g). (ii) f ≤ g implies µ(f ) ≤ µ(g). (iii) f = 0 a.e. implies µ(f ) = 0. Note that in the last case, the converse is no longer true, as one can easily see from the sign function sgn : [−1, 1] → R. Proof. (i) We are going to prove these by applying the previous theorem. By definition of the integral, we have µ(−f ) = −µ(f ). Also, if α ≥ 0, then µ(αf ) = µ(αf + ) − µ(αf − ) = αµ(f + ) − αµ(f − ) = αµ(f ). Combining these two properties, it then follows that if α is a real number, then µ(αf ) = αµ(f ). 40

3

Integration

II Probability and Measure

To finish the proof of (i), we have to show that µ(f + g) = µ(f ) + µ(g). We know that this is true for non-negative functions, so we need to employ a little trick to make this a statement about the non-negative version. If we let h = f + g, then we can write this as h+ − h− = (f + − f − ) + (g + − g − ). We now rearrange this as h+ f − + g − = f + + g + + h− . Now everything is non-negative measurable. So applying µ gives µ(f + ) + µ(f − ) + µ(g − ) = µ(f + ) + µ(g + ) + µ(h− ). Rearranging, we obtain µ(h+ ) − µ(h− ) = µ(f + ) − µ(f − ) + µ(g + ) − µ(g − ). This is exactly the same thing as saying µ(f + g) = µ(h) = µ(f ) = µ(g). (ii) If f ≤ g, then g − f ≥ 0. So µ(g − f ) ≥ 0. By (i), we know µ(g) − µ(f ) ≥ 0. So µ(g) ≥ µ(f ). (iii) If f = 0 a.e., then f + , f − = 0 a.e. So µ(f + ) = µ(f − ) = 0. So µ(f ) = µ(f + ) − µ(f − ) = 0. As mentioned, the converse to (iii) is no longer true. However, we do have the following partial converse: Proposition. If A is a π-system with E ∈ A and σ(A) = E, and f is an integrable function that µ(f 1A ) = 0 for all A ∈ A. Then µ(f ) = 0 a.e. Proof. Let D = {A ∈ E : µ(f 1A ) = 0}. It follows immediately from the properties of the integral that D is a d-system. So D = E by Dynkin’s lemma. Let A+ = {x ∈ E : f (x) > 0}, A− = {x ∈ E : f (x) < 0}. Then A± ∈ E, and µ(f 1A+ ) = µ(f 1A− ) = 0. So f 1A+ and f 1A− vanish a.e. So f vanishes a.e. Proposition. Suppose that (gn ) is a sequence of non-negative measurable functions. Then we have ! ∞ ∞ X X µ gn = µ(gn ). n=1

n=1

41

3

Integration

II Probability and Measure

Proof. We know N X

! gn

%

n=1

∞ X

! gn

n=1

as N → ∞. So by the monotone convergence theorem, we have ! ! N N ∞ X X X µ(gn ) = µ gn % µ gn . n=1

But we also know that

n=1

N X

n=1

µ(gn ) %

n=1

∞ X

µ(gn )

n=1

by definition. So we are done. So for non-negative measurable functions, we can always switch the order of integration and summation. Note that we can consider summation as integration. We let E = N and E = {all subsets of N}. We let µ be the counting measure, so that µ(A) is the size of A. Then integrability (and having a finite integral) is the same as absolute convergence. Then if it converges, then we have Z f dµ =

∞ X

f (n).

n=1

So we can just view our proposition as proving that we can swap the order of two integrals. The general statement is known as Fubini’s theorem.

3.2

Integrals and limits

We are now going to prove more things about exchanging limits and integrals. These are going to be extremely useful in the future, as we want to exchange limits and integrals a lot. Theorem (Fatou’s lemma). Let (fn ) be a sequence of non-negative measurable functions. Then µ(lim inf fn ) ≤ lim inf µ(fn ). Note that a special case was proven in the first example sheet, where we did it for the case where fn are indicator functions. Proof. We start with the trivial observation that if k ≥ n, then we always have that inf fm ≤ fk . m≥n

By the monotonicity of the integral, we know that   µ inf fm ≤ µ(fk ). m≥n

for all k ≥ n.

42

3

Integration

II Probability and Measure

So we have  µ

 ≤ inf µ(fk ) ≤ lim inf µ(fm ).

inf fm

m≥n

m

k≥n

It remains to show that the left hand side converges to µ(lim inf fm ). Indeed, we know that inf fm % lim inf fm . m

m≥n

Then by monotone convergence, we have     µ inf fm % µ lim inf fm . m

m≥n

So we have

  µ lim inf fm ≤ lim inf µ(fm ). m

m

No one ever remembers which direction Fatou’s lemma goes, and this leads to many incorrect proofs and results, so it is helpful to keep the following example in mind: Example. We let (E, E, µ) = (R, B, Lebesgue). We let fn = 1[n,n+1] . Then we have lim inf fn = 0. n

So we have µ(fn ) = 1 for all n. So we have lim inf µ(fn ) = 1, So we have

µ(lim inf fn ) = 0.

  µ lim inf fm ≤ lim inf µ(fm ). m

m

The next result we want to prove is the dominated convergence theorem. This is like the monotone convergence theorem, but we are going to remove the increasing and non-negative measurable condition, and add in something else. Theorem (Dominated convergence theorem). Let (fn ), f be measurable with fn (x) → f (x) for all x ∈ E. Suppose that there is an integrable function g such that |fn | ≤ g for all n, then we have µ(fn ) → µ(f ) as n → ∞.

43

3

Integration

II Probability and Measure

Proof. Note that |f | = lim |f |n ≤ g. n

So we know that µ(|f |) ≤ µ(g) < ∞. So we know that f , fn are integrable. Now note also that 0 ≤ g + fn ,

0 ≤ g − fn

for all n. We are now going to apply Fatou’s lemma twice with these series. We have that µ(g) + µ(f ) = µ(g + f )   = µ lim inf (g + fn ) n

≤ lim inf µ(g + fn ) n

= lim inf (µ(g) + µ(fn )) n

= µ(g) + lim inf µ(fn ). n

Since µ(g) is finite, we know that µ(f ) ≤ lim inf µ(fn ). n

We now do the same thing with g − fn . We have µ(g) − µ(f ) = µ(g − f )   = µ lim inf (g − fn ) n

≤ lim inf µ(g − fn ) n

= lim inf (µ(g) − µ(fn )) n

= µ(g) − lim sup µ(fn ). n

Again, since µ(g) is finite, we know that µ(f ) ≥ lim sup µ(fn ). n

These combine to tell us that µ(f ) ≤ lim inf µ(fn ) ≤ lim sup µ(fn ) ≤ µ(f ). n

n

So they must be all equal, and thus µ(fn ) → µ(f ).

3.3

New measures from old

We have previously considered several ways of constructing measures from old ones, such as the image measure. We are now going to study a few more ways of constructing new measures, and see how integrals behave when we do these. 44

3

Integration

II Probability and Measure

Definition (Restriction of measure space). Let (E, E, µ) be a measure space, and let A ∈ E. The restriction of the measure space to A is (A, EA , µA ), where EA = {B ∈ E : B ⊆ A}, and µA is the restriction of µ to EA , i.e. µA (B) = µ(B) for all B ∈ EA . It is easy to check the following: Lemma. For (E, E, µ) a measure space and A ∈ E, the restriction to A is a measure space. Proposition. Let (E, E, µ) and (F, F, µ0 ) be measure spaces and A ∈ E. Let f : E → F be a measurable function. Then f |A is EA -measurable. Proof. Let B ∈ F. Then (f |A )−1 (B) = f −1 (B) ∩ A ∈ EA . Similarly, we have Proposition. If f is integrable, then f |A is µA -integrable and µA (f |A ) = µ(f 1A ). Note that means we have Z µ(f 1A ) =

Z f 1A dµ =

E

f dµA . A

Usually, we are lazy and just write Z µ(f 1A ) =

f dµ. A

In the particular case of Lebesgue integration, if A is an interval with left and right end points a, b (i.e. it can be open, closed, half open or half closed), then we write Z Z b f dµ = f (x) dx. A

a

There is another construction we would be interested in. Definition (Pushforward/image of measure). Let (E, E) and (G, G) be measure spaces, and f : E → G a measurable function. If µ is a measure on (E, E), then ν = µ ◦ f −1 is a measure on (G, G), known as the pushforward or image measure. We have already seen this before, but we can apply this to integration as follows:

45

3

Integration

II Probability and Measure

Proposition. If g is a non-negative measurable function on G, then ν(g) = µ(g ◦ f ). Proof. Exercise using the monotone class theorem (see example sheet). Finally, we can specify a measure by specifying a density. Definition (Density). Let (E, E, µ) be a measure space, and f be a non-negative measurable function. We define ν(A) = µ(f 1A ). Then ν is a measure on (E, E). Proposition. The ν defined above is indeed a measure. Proof. (i) ν(φ) = µ(f 1∅ ) = µ(0) = 0. (ii) If (An ) is a disjoint sequence in E, then [   X  X X ν An = µ(f 1S An ) = µ f 1An = µ (f 1An ) = ν(f ). Definition (Density). Let X be a random variable. We say X has a density if its law µX has a density with respect to the Lebesgue measure. In other words, there exists fX non-negative measurable so that Z µX (A) = P[X ∈ A] = fX (x) dx. A

In this case, for any non-negative measurable function, for any non-negative measurable g, we have that Z E[g(X)] = g(x)fX (x) dx. R

3.4

Integration and differentiation

In “normal” calculus, we had three results involving both integration and differentiation. One was the fundamental theorem of calculus, which we already stated. The others are the change of variables formula, and differentiating under the integral sign. We start by proving the change of variables formula. Proposition (Change of variables formula). Let φ : [a, b] → R be continuously differentiable and increasing. Then for any bounded Borel function g, we have Z

φ(b)

Z

b

g(y) dy = φ(a)

g(φ(x))φ0 (x) dx.

a

We will use the monotone class theorem.

46

(∗)

3

Integration

II Probability and Measure

Proof. We let V = {Borel functions g such that (∗) holds}. We will want to use the monotone class theorem to show that this includes all bounded functions. We already know that (i) V contains 1A for all A in the π-system of intervals of the form [u, v] ⊆ [a, b]. This is just the fundamental theorem of calculus. (ii) By linearity of the integral, V is indeed a vector space. (iii) Finally, let (gn ) be a sequence in V , and gn ≥ 0, gn % g. Then we know that Z φ(b) Z b gn (y) dy = gn (φ(x))φ0 (x) dx. φ(a)

a

By the monotone convergence theorem, these converge to Z

φ(b)

Z g(y) dy =

φ(a)

b

g(φ(x))φ0 (x) dx.

a

Then by the monotone class theorem, V contains all bounded Borel functions. The next problem is differentiation under the integral sign. We want to know when we can say Z Z d ∂f f (x, t) dx = (x, t) dx. dt ∂t Theorem (Differentiation under the integral sign). Let (E, E, µ) be a space, and U ⊆ R be an open set, and f : U × E → R. We assume that (i) For any t ∈ U fixed, the map x 7→ f (t, x) is integrable; (ii) For any x ∈ E fixed, the map t 7→ f (t, x) is differentiable; (iii) There exists an integrable function g such that ∂f (t, x) ≤ g(x) ∂t for all x ∈ E and t ∈ U . Then the map ∂f (t, x) ∂t is integrable for all t, and also the function Z F (t) = f (t, x)dµ x 7→

E

is differentiable, and 0

Z

F (t) = E

∂f (t, x) dµ. ∂t

47

3

Integration

II Probability and Measure

The reason why we want the derivative to be bounded is that we want to apply the dominated convergence theorem. Proof. Measurability of the derivative follows from the fact that it is a limit of measurable functions, and then integrability follows since it is bounded by g. Suppose (hn ) is a positive sequence with hn → 0. Then let gn (x) =

f (t + hn , x) − f (t, x) ∂f − (t, x). hn ∂t

Since f is differentiable, we know that gn (x) → 0 as n → ∞. Moreover, by the mean value theorem, we know that |gn (x)| ≤ 2g(x). On the other hand, by definition of F (t), we have Z Z F (t + hn ) − F (t) ∂f − (t, x) dµ = gn (x) dx. hn E ∂t By dominated convergence, we know the RHS tends to 0. So we know Z F (t + hn ) − F (t) ∂f lim (t, x) dµ. → n→∞ hn E ∂t Since hn was arbitrary, it follows that F 0 (t) exists and is equal to the integral.

3.5

Product measures and Fubini’s theorem

Recall the following definition of the product σ-algebra. Definition (Product σ-algebra). Let (E1 , E1 , µ1 ) and (E2 , E2 , µ2 ) be finite measure spaces. We let A = {A1 × A2 : A2 × E1 , A2 × E2 }. Then A is a π-system on E1 × E2 . The product σ-algebra is E = E1 ⊗ E2 = σ(A). We now want to construct a measure on the product σ-algebra. We can, of course, just apply the Caratheodory extension theorem, but we would want a more explicit description of the integral. The idea is to define, for A ∈ E1 ⊗ E2 ,  Z Z µ(A) = 1A (x1 , x2 ) µ2 (dx2 ) µ1 (dx1 ). E1

E2

Doing this has the advantage that it would help us in a step of proving Fubini’s theorem. However, before we can make this definition, we need to do some preparation to make sure the above statement actually makes sense: Lemma. Let E = E1 × E2 be a product of σ-algebras. Suppose f : E → R is E-measurable function. Then 48

3

Integration

II Probability and Measure

(i) For each x2 ∈ E2 , the function x1 7→ f (x1 , x2 ) is E1 -measurable. (ii) If f is bounded or non-negative measurable, then Z f2 (x2 ) = f (x1 , x2 ) µ1 (dx1 ) E1

is E2 -measurable. Proof. The first part follows immediately from the fact that for a fixed x2 , the map ι1 : E1 → E given by ι1 (x1 ) = (x1 , x2 ) is measurable, and that the composition of measurable functions is measurable. For the second part, we use the monotone class theorem. We let V be R the set of all measurable functions f such that x2 7→ E1 f (x1 , x2 )µ1 (dx1 ) is E2 -measurable. (i) It is clear that 1E , 1A ∈ V for all A ∈ A (where A is as in the definition of the product σ-algebra). (ii) V is a vector space by linearity of the integral. (iii) Suppose (fn ) is a non-negative sequence in V and fn % f , then     Z Z x2 7→ fn (x1 , x2 ) µ1 (dx1 ) % x2 7→ f (x1 , x2 ) µ(dx1 ) E1

E1

by the monotone convergence theorem. So f ∈ V . So the monotone class theorem tells us V contains all bounded measurable functions. Now if f is a general non-negative measurable function, then f ∧ n is bounded and measurable, hence f ∧ n ∈ V . Therefore f ∈ V by the monotone convergence theorem. Theorem. There exists a unique measurable function µ = µ1 ⊗ µ2 on E such that µ(A1 × A2 ) = µ(A1 )µ(A2 ) for all A1 × A2 ∈ A. Here it is crucial that the measure space is finite. Actually, everything still works for σ-finite measure spaces, as we can just reduce to the finite case. However, things start to go wrong if we don’t have σ-finite measure spaces. Proof. One might be tempted to just apply the Caratheodory extension theorem, but we have a more direct way of doing it here, by using integrals. We define  Z Z µ(A) = 1A (x1 , x2 ) µ2 (dx2 ) µ1 (dx1 ). E1

E2

Here the previous lemma is very important. It tells us that these integrals actually make sense! We first check that this is a measure: (i) µ(∅) = 0 is immediate since 1∅ = 0. 49

3

Integration

II Probability and Measure

S (ii) Suppose (An ) is a disjoint sequence and A = An . Then we have  Z Z µ(A) = 1A (x1 , x2 ) µ2 (dx2 ) µ1 (dx1 ) E1 E2 ! Z Z X = 1An (x1 , x2 ) µ2 (dx2 ) µ1 (dx1 ) E1

E2

n

We now use the fact that integration commutes with the sum of nonnegative measurable functions to get ! Z X Z = 1A (x1 , x2 ) µ2 (dx2 ) µ1 (dx1 ) E1

= =

Z

E1

n

X

E2

n

XZ

 1An (x1 , x2 ) µ2 (dx2 ) µ1 (dx1 )

E2

µ(An ).

n

So we have a working measure, and it clearly satisfies µ(A1 × A2 ) = µ(A1 )µ(A2 ). Uniqueness follows because µ is finite, and is thus characterized by its values on the π-system A that generates E. Exercise. Show the non-uniqueness of the product Lebesgue measure on [0, 1] and the counting measure on [0, 1]. Note that we could as well have defined the measure as  Z Z µ(A) = 1A (x1 , x2 ) µ1 (dx1 ) µ2 (dx2 ). E2

E1

The same proof would go through, so we have another measure on the space. However, by uniqueness, we know they must be the same! Fubini’s theorem generalizes this to arbitrary functions. Theorem (Fubini’s theorem). (i) If f is non-negative measurable, then  Z Z µ(f ) = f (x1 , x2 ) µ2 (dx2 ) µ1 (dx1 ). E1

(∗)

E2

In particular, we have   Z Z Z Z f (x1 , x2 ) µ2 (dx2 ) µ1 (dx1 ) = f (x1 , x2 ) µ1 (dx1 ) µ2 (dx2 ). E1

E2

E2

E1

This is sometimes known as Tonelli’s theorem.

50

3

Integration

II Probability and Measure

(ii) If f is integrable, and  Z A = x1 ∈ E :

 |f (x1 , x2 )|µ2 (dx2 ) < ∞ .

E2

then µ1 (E1 \ A) = 0. If we set

(R f1 (x1 ) =

E2

f (x1 , x2 ) µ2 (dx2 )

0

x1 ∈ A , x1 6∈ A

then f1 is a µ1 integrable function and µ1 (f1 ) = µ(f ). Proof. (i) Let V be the set of all measurable functions such that (∗) holds. Then V is a vector space since integration is linear. (a) By definition of µ, we know 1E and 1A are in V for all A ∈ A. (b) The monotone convergence theorem on both sides tell us that V is closed under monotone limits of the form fn % f , fn ≥ 0. By the monotone class theorem, we know V contains all bounded measurable functions. If f is non-negative measurable, then (f ∧ n) ∈ V , and monotone convergence for f ∧ n % f gives that f ∈ V . (ii) Assume that f is µ-integrable. Then Z x1 7→ |f (x1 , x2 )| µ(dx2 ) E2

is E1 -measurable, and, by (i), is µ1 -integrable. So A1 , being the inverse image of ∞ under that map, lies in E1 . Moreover, µ1 (E1 \ A1 ) = 0 because integrable functions can only be infinite on sets of measure 0. We set f1+ (x1 )

Z

f + (x1 , x2 ) µ2 (dx2 )

= E2

f1− (x1 )

Z

f − (x1 , x2 ) µ2 (dx2 ).

= E2

Then we have f1 = (f1+ − f1− )1A1 . So the result follows since µ(f ) = µ(f + ) − µ(f − ) = µ(f1+ ) − µ1 (f1− ) = µ1 (f1 ). by (i).

51

3

Integration

II Probability and Measure

Since R is σ-finite, we know that we can sensibly talk about the d-fold product of the Lebesgue measure on R to obtain the Lebesgue measure on Rd . What σ-algebra is the Lebesgue measure on Rd defined on? We know the Lebesgue measure on R is defined on B. So the Lebesgue measure is defined on B × · · · × B = σ(B1 × · · · × Bd : Bi ∈ B). By looking at the definition of the product topology, we see that this is just the Borel σ-algebra on Rd ! Recall that when we constructed the Lebesgue measure, the Caratheodory extension theorem yields a measure on the “Lebesgue σ-algebra” M, which was strictly bigger than the Borel σ-algebra. It was shown in the first example sheet that M is complete, i.e. if we have A ⊆ B ⊆ R with B ∈ M, µ(B) = 0, then A ∈ M. We can also take the Lebesgue measure on Rd to be defined on M ⊗ · · · ⊗ M. However, it happens that M ⊗ M together with the Lebesgue measure on R2 is no longer complete (proof is left as an exercise for the reader). We now turn to probability. Recall that random variables X1 , · · · , Xn are independent iff the σ-algebras σ(X1 ), · · · , σ(Xn ) are independent. We will show that random variables are independent iff their laws are given by the product measure. Proposition. Let X1 , · · · , Xn be random variables on (Ω, F, P) with values in (E1 , E1 ), · · · , (En , En ) respectively. We define E = E1 × · · · × En ,

E = E1 ⊗ · · · ⊗ En .

Then X = (X1 , · · · , Xn ) is E-measurable and the following are equivalent: (i) X1 , · · · , Xn are independent. (ii) µX = µX1 ⊗ · · · ⊗ µXn . (iii) For any f1 , · · · , fn bounded and measurable, we have " n # n Y Y E fk (Xk ) = E[fk (Xk )]. k=1

k=1

Proof. – (i) ⇒ (ii): Let ν = µX1 × · · · ⊗ µXn . We want to show that ν = µX . To do so, we just have to check that they agree on a π-system generating the entire σ-algebra. We let A = {A1 × · · · × An : A1 ∈ E1 , · · · , Ak ∈ Ek }. Then A is a generating π-system of E. Moreover, if A = A1 × · · · × An ∈ A, then we have µX (A) = P[X ∈ A] = P[X1 ∈ A1 , · · · , Xn ∈ An ]

52

3

Integration

II Probability and Measure

By independence, we have =

n Y

P[Xk ∈ Ak ]

k=1

= ν(A). So we know that µX = ν = µX1 ⊗ · · · ⊗ µXn on E. – (ii) ⇒ (iii): By assumption, we can evaluate the expectation " n # Z n Y Y E fk (Xk ) = fk (xk )µ(dxk ) E k=1

k=1

= =

n Z Y k=1 n Y

f (xk )µk (dxk )

Ek

E[fk (Xk )].

k=1

Here in the middle we have used Fubini’s theorem. – (iii) ⇒ (i): Take fk = 1Ak for Ak ∈ Ek . Then we have " n # Y P[X1 ∈ A1 , · · · , Xn ∈ An ] = E 1Ak (Xk ) k=1

= =

n Y k=1 n Y k=1

So X1 , · · · , Xn are independent.

53

E[1Ak (Xk )] P[Xk ∈ Ak ]

Inequalities and Lp spaces

4

II Probability and Measure

Inequalities and Lp spaces

4

Eventually, we will want to define the Lp spaces as follows: Definition (Lp spaces). Let (E, E, µ) be a measurable space. For 1 ≤ p < ∞, we define Lp = Lp (E, E, µ) to be the set of all measurable functions f such that Z kf kp =

1/p |f | dµ < ∞. p

For p = ∞, we let L∞ = L∞ (E, E, µ) to be the space of functions with kf k∞ = inf{λ ≥ 0 : |f | ≤ λ a.e.} < ∞. However, it is not clear that this is a norm. First of all, kf kp = 0 does not imply that f = 0. It only means that f = 0 a.e. But this is easy to solve. We simply quotient out the vector space by functions that differ on a set of measure zero. The more serious problem is that we don’t know how to prove the triangle inequality. To do so, we are going to prove some inequalities. Apart from enabling us to show that k · kp is indeed a norm, they will also be very helpful in the future when we want to bound integrals.

4.1

Four inequalities

The four inequalities we are going to prove are the following: (i) Chebyshev/Markov inequality (ii) Jensen’s inequality (iii) H¨ older’s inequality (iv) Minkowski’s inequality. So let’s start proving the inequalities. Proposition (Chebyshev’s/Markov’s inequality). Let f be non-negative measurable and λ > 0. Then µ({f ≥ λ}) ≤

1 µ(f ). λ

This is often used when this is a probability measure, so that we are bounding the probability that a random variable is big. The proof is essentially one line. Proof. We write f ≥ f 1f ≥λ ≥ λ1f ≥λ . Taking µ gives the desired answer. This is incredibly simple, but also incredibly useful! The next inequality is Jensen’s inequality. To state it, we need to know what a convex function is. 54

4

Inequalities and Lp spaces

II Probability and Measure

Definition (Convex function). Let I ⊆ R be an interval. Then c : I → R is convex if for any t ∈ [0, 1] and x, y ∈ I, we have c(tx + (1 − t)y) ≤ tc(x) + (1 − t)c(y).

(1 − t)f (x) + tc(y)

x

(1 − t)x + ty

y

Note that if c is twice differentiable, then this is equivalent to c00 > 0. Proposition (Jensen’s inequality). Let X be an integrable random variable with values in I. If c : I → R is convex, then we have E[c(X)] ≥ c(E[X]). It is crucial that this only applies to a probability space. We need the total mass of the measure space to be 1 for it to work. Just being finite is not enough. Jensen’s inequality will be an easy consequence of the following lemma: Lemma. If c : I → R is a convex function and m is in the interior of I, then there exists real numbers a, b such that c(x) ≥ ax + b for all x ∈ I, with equality at x = m. φ

ax + b

m

If the function is differentiable, then we can easily extract this from the derivative. However, if it is not, then we need to be more careful. Proof. If c is smooth, then we know c00 ≥ 0, and thus c0 is non-decreasing. We are going to show an analogous statement that does not mention the word “derivative”. Consider x < m < y with x, y, m ∈ I. We want to show that c(m) − c(x) c(y) − c(m) ≤ . m−x y−m

55

4

Inequalities and Lp spaces

II Probability and Measure

To show this, we turn off our brains and do the only thing we can do. We can write m = tx + (1 − t)y for some t. Then convexity tells us c(m) ≤ tc(x) + (1 − t)c(y). Writing c(m) = tc(m) + (1 − t)c(m), this tells us t(c(m) − c(x)) ≤ (1 − t)(c(y) − c(m)). To conclude, we simply have to compute the actual value of t and plug it in. We have m−x y−m , 1−t= . t= y−x y−x So we obtain

y−m m−x (c(m) − c(x)) ≤ (c(y) − c(m)). y−x y−x

Cancelling the y − x and dividing by the factors gives the desired result. Now since x and y are arbitrary, we know there is some a ∈ R such that c(m) − c(x) c(y) − c(m) ≤a≤ . m−x y−m for all x < m < y. If we rearrange, then we obtain c(t) ≥ a(t − m) + c(m) for all t ∈ I. Proof of Jensen’s inequality. To apply the previous result, we need to pick a right m. We take m = E[X]. To apply this, we need to know that m is in the interior of I. So we assume that X is not a.s. constant (that case is boring). By the lemma, we can find some a, b ∈ R such that c(X) ≥ aX + b. We want to take the expectation of the LHS, but we have to make sure the E[c(X)] is a sensible thing to talk about. To make sure it makes sense, we show that E[c(X)− ] = E[(−c(X)) ∨ 0] is finite. We simply bound [c(X)]− = [−c(X)] ∨ 0 ≤ |a||X| + |b|. So we have E[c(X)− ] ≤ |a|E|X| + |b| < ∞ since X is integrable. So E[c(X)] makes sense. We then just take E[c(X)] ≥ E[aX + b] = aE[X] + b = am + b = c(m) = c(E[X]). So done. 56

4

Inequalities and Lp spaces

II Probability and Measure

We are now going to use Jensen’s inequality to prove H¨older’s inequality. Before that, we take note of the following definition: Definition (Conjugate). Let p, q ∈ [1, ∞]. We say that they are conjugate if 1 1 + = 1, p q where we take 1/∞ = 0. Proposition (H¨older’s inequality). Let p, q ∈ (1, ∞) be conjugate. Then for f, g measurable, we have µ(|f g|) = kf gk1 ≤ kf kp kgkq . When p = q = 2, then this is the Cauchy-Schwarz inequality. We will provide two different proofs. Proof. We assume that kf kp > 0 and kf kp < ∞. Otherwise, there is nothing to prove. By scaling, we may assume that kf kp = 1. We make up a probability measure by Z P[A] = Since we know Z kf kp =

|f |p 1A dµ.

1/p |f |p dµ = 1,

we know P[ · ] is a probability measure. Then we have µ(|f g|) = µ(|f g|1{|f |>0} )   |g| p 1{|f |>0} |f | =µ |f |p−1   |g| =E 1{|f |>0} |f |p−1 Now use the fact that (E|X|)q ≤ E[|X|q ] since x 7→ xq is convex for q > 1. Then we obtain   1/q |g|q ≤ E 1{|f |>0} . |f |(p−1)q The key realization now is that 1q + p1 = 1 means that q(p − 1) = p. So this becomes  q 1/q |g| E 1 = µ(|g|q )1/q = kgkq . {|f |>0} |f |p Using the fact that kf kp = 1, we obtain the desired result. Alternative proof. We wlog 0 < kf kp , kgkq < ∞, or else there is nothing to prove. By scaling, we wlog kf kp = kgkq = 1. Then we have to show that Z |f ||g| dµ ≤ 1. 57

4

Inequalities and Lp spaces

To do so, we notice if we have

1 p

II Probability and Measure

+ 1q = 1, then the concavity of log tells us for any a, b > 0,   1 1 a b log a + log b ≤ log + . p q p q

Replacing a with ap ; b with bp and then taking exponentials tells us ab ≤

ap bq + . p q

While we assumed a, b > 0 when deriving, we observe that it is also valid when some of them are zero. So we have  Z Z  p |g|q 1 1 |f | |f ||g| dµ ≤ + dµ = + = 1. p q p q Just like Jensen’s inequality, this is very useful when bounding integrals, and it is also theoretically very important, because we are going to use it to prove the Minkowski inequality. This tells us that the Lp norm is actually a norm. Before we prove the Minkowski inequality, we prove the following tiny lemma that we will use repeatedly: Lemma. Let a, b ≥ 0 and p ≥ 1. Then (a + b)p ≤ 2p (ap + bp ). This is a terrible bound, but is useful when we want to prove that things are finite. Proof. We wlog a ≤ b. Then (a + b)p ≤ (2b)p ≤ 2p bp ≤ 2p (ap + bp ). Theorem (Minkowski inequality). Let p ∈ [1, ∞] and f, g measurable. Then kf + gkp ≤ kf kp + kgkp . Again the proof is magic. Proof. We do the boring cases first. If p = 1, then Z Z Z Z kf + gk1 = |f + g| ≤ (|f | + |g|) = |f | + |g| = kf k1 + kgk1 . The proof of the case of p = ∞ is similar. Now note that if kf + gkp = 0, then the result is trivial. On the other hand, if kf + gkp = ∞, then since we have |f + g|p ≤ (|f | + |g|)p ≤ 2p (|f |p + |g|p ), we know the right hand side is infinite as well. So this case is also done.

58

4

Inequalities and Lp spaces

II Probability and Measure

Let’s now do the interesting case. We compute µ(|f + g|p ) = µ(|f + g||f + g|p−1 ) ≤ µ(|f ||f + g|p−1 ) + µ(|g||f + g|p−1 ) ≤ kf kp k|f + g|p−1 kq + kgkp k|f + g|p−1 kq = (kf kp + kgkp )k|f + g|p−1 kq = (kf kp + kgkp )µ(|f + g|(p−1)q )1−1/p = (kf kp + kgkp )µ(|f + g|p )1−1/p . So we know µ(|f + g|p ) ≤ (kf kp + kgkp )µ(|f + g|p )1−1/p . Then dividing both sides by (µ(|f + g|p )1−1/p tells us µ(|f + g|p )1/p = kf + gkp ≤ kf kp + kgkp . Given these inequalities, we can go and prove some properties of Lp spaces.

4.2

Lp spaces

Recall the following definition: Definition (Norm of vector space). Let V be a vector space. A norm on V is a function k · k : V → R≥0 such that (i) ku + vk ≤ kuk + kvk for all U, v ∈ V . (ii) kαvk = |α|kvk for all v ∈ V and α ∈ R (iii) kvk = 0 implies v = 0. Definition (Lp spaces). Let (E, E, µ) be a measurable space. For 1 ≤ p < ∞, we define Lp = Lp (E, E, µ) to be the set of all measurable functions f such that Z kf kp =

1/p |f | dµ < ∞. p

For p = ∞, we let L∞ = L∞ (E, E, µ) to be the space of functions with kf k∞ = inf{λ ≥ 0 : |f | ≤ λ a.e.} < ∞. By Minkowski’s inequality, we know Lp is a vector space, and also (i) holds. By definition, (ii) holds obviously. However, (iii) does not hold for k · kp , because kf kp = 0 does not imply that f = 0. It merely implies that f = 0 a.e. To fix this, we define an equivalence relation as follows: for f, g ∈ Lp , we say that f ∼ g iff f − g = 0 a.e. For any f ∈ Lp , we let [f ] denote its equivalence class under this relation. In other words, [f ] = {g ∈ Lp : f − g = 0 a.e.}.

59

4

Inequalities and Lp spaces

II Probability and Measure

Definition (Lp space). We define Lp = {[f ] : f ∈ Lp }, where [f ] = {g ∈ Lp : f − g = 0 a.e.}. This is a normed vector space under the k · kp norm. One important property of Lp is that it is complete, i.e. every Cauchy sequence converges. Definition (Complete vector space/Banach spaces). A normed vector space (V, k · k) is complete if every Cauchy sequence converges. In other words, if (vn ) is a sequence in V such that kvn − vm k → 0 as n, m → ∞, then there is some v ∈ V such that kvn − vk → 0 as n → ∞. A complete vector space is known as a Banach space. Theorem. Let 1 ≤ p ≤ ∞. Then Lp is a Banach space. In other words, if (fn ) is a sequence in Lp , with the property that kfn − fm kp → 0 as n, m → ∞, then there is some f ∈ Lp such that kfn − f kp → 0 as n → ∞. Proof. We will only give the proof for p < ∞. The p = ∞ case is left as an exercise for the reader. Suppose that (fn ) is a sequence in Lp with kfn − fm kp → 0 as n, m → ∞. Take a subsequence (fnk ) of (fn ) with kfnk+1 − fnk kp ≤ 2−k for all k ∈ N. We then find that

M M

X X

kfnk+1 − fnk kp ≤ 1. |fnk+1 − fnk | ≤

k=1

p

k=1

We know that M X

|fnk+1 − fnk | %

k=1

∞ X

|fnk+1 − fnk | as M → ∞.

k=1

So applying the monotone convergence theorem, we know that

∞ ∞

X

X

|f − f | ≤ kfnk+1 − fnk kp ≤ 1.

nk+1 nk

k=1

In particular,

p

∞ X

k=1

|fnk+1 − fnk | < ∞ a.e.

k=1

So fnk (x) converges a.e., since the real line is complete. So we set ( limk→∞ fnk (x) if the limit exists f (x) = 0 otherwise 60

4

Inequalities and Lp spaces

II Probability and Measure

By an exercise on the first example sheet, this function is indeed measurable. Then we have kfn − f kpp = µ(|fn − f |p )   = µ lim inf |fn − fnk |p k→∞

≤ lim inf µ(|fn − fnk |p ), k→∞

which tends to 0 as n → ∞ since the sequence is Cauchy. So f is indeed the limit. Finally, we have to check that f ∈ Lp . We have µ(|f |p ) = µ(|f − fn + fn |p ) ≤ µ((|f − fn | + |fn |)p ) ≤ µ(2p (|f − fn |p + |fn |p )) = 2p (µ(|f − fn |p ) + µ(|fn |p )2 ) We know the first term tends to 0, and in particular is finite for n large enough, and the second term is also finite. So done.

4.3

Orthogonal projection in L2

In the particular case p = 2, we have an extra structure on L2 , namely an inner product structure, given by Z hf, gi = f g dµ. This inner product induces the L2 norm by kf k22 = hf, f i. Recall the following definition: Definition (Hilbert space). A Hilbert space is a vector space with a complete inner product. So L2 is not only a Banach space, but a Hilbert space as well. Somehow Hilbert spaces are much nicer than Banach spaces, because you have an inner product structure as well. One particular thing we can do is orthogonal complements. Definition (Orthogonal functions). Two functions f, g ∈ L2 are orthogonal if hf, gi = 0, Definition (Orthogonal complement). Let V ⊆ L2 . We then set V ⊥ = {f ∈ L2 : hf, vi = 0 for all v ∈ V }.

61

4

Inequalities and Lp spaces

II Probability and Measure

Note that we can always make these definitions for any inner product space. However, the completeness of the space guarantees nice properties of the orthogonal complement. Before we proceed further, we need to make a definition of what it means for a subspace of L2 to be closed. This isn’t the usual definition, since L2 isn’t really a normed vector space, so we need to accommodate for that fact. Definition (Closed subspace). Let V ⊆ L2 . Then V is closed if whenever (fn ) is a sequence in V with fn → f , then there exists v ∈ V with v ∼ f . Thee main thing that makes L2 nice is that we can use closed subspaces to decompose functions orthogonally. Theorem. Let V be a closed subspace of L2 . Then each f ∈ L2 has an orthogonal decomposition f = u + v, where v ∈ V and u ∈ V ⊥ . Moreover, kf − vk2 ≤ kf − gk2 for all g ∈ V with equality iff g ∼ v. To prove this result, we need two simple identities, which can be easily proven by writing out the expression. Lemma (Pythagoras identity). kf + gk2 = kf k2 + kgk2 + 2hf, gi. Lemma (Parallelogram law). kf + gk2 + kf − gk2 = 2(kf k2 + kgk2 ). To prove the existence of orthogonal decomposition, we need to use a slight trick involving the parallelogram law. Proof of orthogonal decomposition. Given f ∈ L2 , we take a sequence (gn ) in V such that kf − gn k2 → d(f, V ) = inf kf − gk2 . g

We now want to show that the infimum is attained. To do so, we show that gn is a Cauchy sequence, and by the completeness of L2 , it will have a limit. If we apply the parallelogram law with u = f − gn and v = f − gm , then we know ku + vk22 + ku − vk22 = 2(kuk22 + kvk22 ). Using our particular choice of u and v, we obtain

  2

2 f − gn + gm + kgn − gm k22 = 2(kf − gn k22 + kf − gm k22 ).

2 2 So we have kgn −

gm k22

= 2(kf −

gn k22

+ kf −

62

gm k22 )

2

gn + gm

. − 4 f −

2 2

4

Inequalities and Lp spaces

II Probability and Measure

The first two terms on the right hand side tend to d(f, V )2 , and the last term is bounded below in magnitude by 4d(f, V ). So as n, m → ∞, we must have kgn − gm k2 → 0. By completeness of L2 , there exists a g ∈ L2 such that gn → g. Now since V is assumed to be closed, we can find a v ∈ V such that g = v a.e. Then we know kf − vk2 = lim kf − gn k2 = d(f, V ). n→∞

So v attains the infimum. To show that this gives us an orthogonal decomposition, we want to show that u = f − v ∈ V ⊥. Suppose h ∈ V . We need to show that hu, hi = 0. We need to do another funny trick. Suppose t ∈ R. Then we have d(f, V )2 ≤ kf − (v + th)k22 = kf − vk2 + t2 khk22 − 2thf − v, hi. We think of this as a quadratic in t, which is minimized when t=

hf − v, hi . khk22

But we know this quadratic is minimized when t = 0. So hf − v, hi = 0. We are now going to look at the relationship between conditional expectation and orthogonal projection. Definition (Conditional expectation). Suppose we have a probability space S (Ω, F, P), and (Gn ) is a collection of pairwise disjoint events with n Gn = Ω. We let G = σ(Gn : n ∈ N). The conditional expectation of X given G is the random variable Y =

∞ X

E[X | Gn ]1Gn ,

n=1

where E[X | Gn ] =

E[X1Gn ] for P[Gn ] > 0. P[Gn ]

In other words, given any x ∈ Ω, say x ∈ Gn , then Y (x) = E[X | Gn ]. If X ∈ L2 (P), then Y ∈ L2 (P), and it is clear that Y is G-measurable. We claim that this is in fact the projection of X onto the subspace L2 (G, P) of G-measurable L2 random variables in the ambient space L2 (P). Proposition. The conditional expectation of X given G is the projection of X onto the subspace L2 (G, P) of G-measurable L2 random variables in the ambient space L2 (P). In some sense, this tells us Y is our best prediction of X given only the information encoded in G.

63

4

Inequalities and Lp spaces

II Probability and Measure

Proof. Let Y be the conditional expectation. It suffices to show that E[(X −W )2 ] is minimized for W = Y among G-measurable random variables. Suppose that W is a G-measurable random variable. Since G = σ(Gn : n ∈ N), it follows that W =

∞ X

an 1Gn .

n=1

where an ∈ R. Then  E[(X − W )2 ] = E 

∞ X

!2  (X − an )1Gn

n=1

"

# X

=E

2

(X +

a2n

2

a2n

− 2an X)1Gn

n

" =E

# X

(X +

− 2an E[X | Gn ])1Gn

n

We now optimize the quadratic X 2 + a2n − 2an E[X | Gn ] over an . We see that this is minimized for an = E[X | Gn ]. Note that this does not depend on what X is in the quadratic, since it is in the constant term. Therefore we know that E[X | Gn ] is minimized for W = Y . We can also rephrase variance and covariance in terms of the L2 spaces. Suppose X, Y ∈ L2 (P) with mX = E[X],

mY = E[Y ].

Then variance and covariance just correspond to L2 inner product and norm. In fact, we have var(X) = E[(X − mX )2 ] = kX − mX k22 , cov(X, Y ) = E[(X − mX )(Y − mY )] = hX − mX , Y − mY i. More generally, the covariance matrix of a random vector X = (X1 , · · · , Xn ) is given by var(X) = (cov(Xi , Xj ))ij . On the example sheet, we will see that the covariance matrix is a positive definite matrix.

64

4

Inequalities and Lp spaces

4.4

II Probability and Measure

Convergence in L1 (P) and uniform integrability

What we are looking at here is the following question — suppose (Xn ), X are random variables and Xn → X in probability. Under what extra assumptions is it true that Xn also converges to X in L1 , i.e. E[Xn − X] → 0 as X → ∞? This is not always true. Example. If we take (Ω, F, P) = ((0, 1), B((0, 1)), Lebesgue), and Xn = n1(0,1/n) . Then Xn → 0 in probability, and in fact Xn → 0 almost surely. However, E[|Xn − 0|] = E[Xn ] = n ·

1 = 1, n

which does not converge to 1. We see that the problem with this series is that there is a lot of “stuff” concentrated near 0, and indeed the functions can get unbounded near 0. We can easily curb this problem by requiring our functions to be bounded: Theorem (Bounded convegence theorem). Suppose X, (Xn ) are random variables. Assume that there exists a (non-random) constant C > 0 such that |Xn | ≤ C. If Xn → X in probability, then Xn → X in L1 . The proof is a rather standard manipulation. Proof. We first show that |X| ≤ C a.e. Let ε > 0. We then have P[|X| > C + ε] ≤ P[|X − Xn | + |Xn | > C + ε] ≤ P[|X − Xn | > ε] + P[|Xn | > C] We know the second term vanishes, while the first term → 0 as n → ∞. So we know P[|X| > C + ε] = 0 for all ε. Since ε was arbitrary, we know |X| ≤ C a.s. Now fix an ε > 0. Then   E[|Xn − X| = E |Xn − X|(1|Xn −X|≤ε + 1|Xn −X|>ε ) ≤ ε + 2C P [|Xn − X| > ε] . Since Xn → X in probability, for N sufficiently large, the second term is ≤ ε. So E[|Xn − X|] ≤ 2ε, and we have convergence in L1 . But we can do better than that. We don’t need the functions to be actually bounded. We just need that the functions aren’t concentrated in arbitrarily small subsets of Ω. Thus, we make the following definition: Definition (Uniformly integrable). Let X be a family of random variables. Define IX (δ) = sup{E[|X|1A ] : X ∈ X , A ∈ F with P[A] < δ}. Then we say X is uniformly integrable if X is L1 -bounded (see below), and IX (δ) → 0 as δ → 0. 65

4

Inequalities and Lp spaces

II Probability and Measure

Definition (Lp -bounded). Let X be a family of random variables. Then we say X is Lp -bounded if sup{kXkp : X ∈ X } < ∞. In some sense, this is “uniform continuity for integration”. It is immediate that Proposition. Finite unions of uniformly integrable sets are uniformly integrable. How can we find uniformly integrable families? The following proposition gives us a large class of such families. Proposition. Let X be an Lp -bounded family for some p > 1. Then X is uniformly integrable. Proof. We let C = sup{kXkp : X ∈ X } < ∞. Suppose that X ∈ X and A ∈ F. We then have E[|X|1A ] =≤ E[|X|p ]1/p P[A]1/q ≤ CP[A]1/q . by H¨ older’s inequality, where p, q are conjugates. This is now a uniform bound depending only on P[A]. So done. This is the best we can get. L1 boundedness is not enough. Indeed, our earlier example Xn = n1(0,1/n) , is L1 bounded but not uniformly integrable. So L1 boundedness is not enough. For many practical purposes, it is convenient to rephrase the definition of uniform integrability as follows: Lemma. Let X be a family of random variables. Then X is uniformly integrable if and only if sup{E[|X|1|X|>k ] : X ∈ X } → 0 as k → ∞. Proof. (⇒) Suppose that χ is uniformly integrable. For any k, and X ∈ X by Chebyshev inequality, we have P[|X| ≥ k] ≤

E[X] . k

Given ε > 0, we pick δ such that P[|X|1A ] < ε for all A with µ(A) < δ. Then pick k sufficiently large such that kδ < sup{E[X] : X ∈ X }. Then P[|X| ≥ k] < δ, and hence E[|X|1|X|>k ] < ε for all X ∈ X . (⇐) Suppose that the condition in the lemma holds. We first show that X is L1 -bounded. We have E[|X|] = E[|X|(1|X|≤k + 1|X|>k )] ≤ k + E[|X|1|X|>k ] < ∞ by picking a large enough k. 66

4

Inequalities and Lp spaces

II Probability and Measure

Next note that for any measurable A and X ∈ X , we have E[|X|1A ] = E[|X|1A (1|X|>k + 1|X|≤k )] ≤ E[|X|1|X|>k ] + kP[A]. Thus, for any ε > 0, we can pick k sufficiently large such that the first ε term is < 2ε for all X ∈ X by assumption. Then when P[A] < 2k , we have E|X|1A ] ≤ ε. As a corollary, we find that Corollary. Let X = {X}, where X ∈ L1 (P). Then X is uniformly integrable. Hence, a finite collection of L1 functions is uniformly integrable. Proof. Note that E[|X|] =

∞ X

E[|X|1X∈[k,k+1) ].

k=0

Since the sum is finite, we must have E[|X|1|X|≥K ] =

∞ X

E[|X|1X∈[k,k+1) ] → 0.

k=K

With all that preparation, we now come to the main theorem on uniform integrability. Theorem. Let X, (Xn ) be random variables. Then the following are equivalent: (i) Xn , X ∈ L1 for all n and Xn → X in L1 . (ii) {Xn } is uniformly integrable and Xn → X in probability. The (i) ⇒ (ii) direction is just a standard manipulation. The idea of the (ii) ⇒ (i) direction is that we use uniformly integrability to cut off Xn and X at some large value K, which gives us a small error, then apply bounded convergence. Proof. We first assume that Xn , X are L1 and Xn → X in L1 . We want to show that {Xn } is uniformly integrable and Xn → X in probability. We first show that Xn → X in probability. This is just going to come from the Chebyshev inequality. For ε > 0. Then we have P[|X − Xn | > ε] ≤

E[|X − Xn |] →0 ε

as n → ∞. Next we show that {Xn } is uniformly integrable. Fix ε > 0. Take N such that n ≥ N implies E[|X − Xn |] ≤ 2ε . Since finite families of L1 random variables are uniformly integrable, we can pick δ > 0 such that A ∈ F and P[A] < δ implies ε E[X1A ], E[|Xn |1A ] ≤ 2 for n = 1, · · · , N .

67

4

Inequalities and Lp spaces

II Probability and Measure

Now when n > N and A ∈ F with P[A] ≤ δ, then we have E[|Xn |1A ] ≤ E[|X − Xn |1A ] + E[|X|1A ] ε ≤ E[|X − Xn |] + 2 ε ε ≤ + 2 2 = ε. So {Xn } is uniformly integrable. Assume that {Xn } is uniformly integrable and Xn → X in probability. The first step is to show that X ∈ L1 . We want to use Fatou’s lemma, but to do so, we want almost sure convergence, not just convergence in probability. Recall that we have previously shown that there is a subsequence (Xnk ) of (Xn ) such that Xnk → X a.s. Then we have   E[|X|] = E lim inf |Xnk | ≤ lim inf E[|Xnk |] < ∞ k→∞

k→∞

since uniformly integrable families are L1 bounded. So E[|X|] < ∞, hence X ∈ L1 . Next we want to show that Xn → X in L1 . Take ε > 0. Then there exists K ∈ (0, ∞) such that     ε E |X|1{|X|>K} , E |Xn |1{|Xn |>K} ≤ . 3 To set things up so that we can use the bounded convergence theorem, we have to invent new random variables XnK = (Xn ∨ −K) ∧ K,

X K = (X ∨ −K) ∧ K.

Since Xn → X in probability, it follows that XnK → X K in probability. Now bounded convergence tells us that there is some N such that n ≥ N implies ε E[|XnK − X K |] ≤ . 3 Combining, we have for n ≥ N that E[|Xn − X|] ≤ E[|XnK − X K |] + E[|X|1{|X|≥K} ] + E[|Xn |1{|Xn |≥K} ] ≤ ε. So we know that Xn → X in L1 . The main application is that when {Xn } is a type of stochastic process known as a martingale. This will be done in III Advanced Probability and III Stochastic Calculus.

68

5

Fourier transform

5

II Probability and Measure

Fourier transform

We now turn to the exciting topic of the Fourier transform. There are two main questions we want to ask — when does the Fourier transform exist, and when we can recover a function from its Fourier transform. Of course, not only do we want to know if the Fourier transform exists. We also want to know if it lies in some nice space, e.g. L2 . It turns out that when we want to prove things about Fourier transforms, it is often helpful to “smoothen” the function by doing what is known as a Gaussian convolution. So after defining the Fourier transform and proving some really basic properties, we are going to investigate convolutions and Gaussians for a bit (convolutions are also useful on their own, since they correspond to sums of independent random variables). After that, we can go and prove the actual important properties of the Fourier transform.

5.1

The Fourier transform

When talking about Fourier transforms, we will mostly want to talk about functions Rd → C. So from now on, we will write Lp for complex valued Borel functions on Rd with Z  1/p

|f |p

kf kp =

< ∞.

Rd

The integrals of complex-valued function are defined on the real and imaginary parts separately, and satisfy the properties we would expect them to. The details are on the first example sheet. Definition (Fourier transform). The Fourier transform fˆ : Rd → C of f ∈ L1 (Rd ) is given by Z ˆ f (u) = f (x)ei(u,x) dx, Rd

where u ∈ Rd and (u, x) denotes the inner product, i.e. (u, x) = u1 x1 + · · · + ud xd . Why do we care about Fourier transforms? Many computations are easier with fˆ in place of f , especially computations that involve differentiation and convolutions (which are relevant to sums of independent random variables). In particular, we will use it to prove the central limit theorem. More generally, we can define the Fourier transform of a measure: Definition (Fourier transform of measure). The Fourier transform of a finite measure µ on Rd is the function µ ˆ : Rd → C given by Z µ ˆ(u) = ei(u,x) µ(dx). Rd

In the context of probability, we give these things a different name: Definition (Characteristic function). Let X be a random variable. Then the characteristic function of X is the Fourier transform of its law, i.e. φX (u) = E[ei(u,X) ] = µ ˆX (u), where µX is the law of X. 69

5

Fourier transform

II Probability and Measure

We now make the following (trivial) observations: Proposition. kfˆk∞ ≤ kf k1 ,

kˆ µk∞ ≤ µ(Rd ).

Less trivially, we have the following result: Proposition. The functions fˆ, µ ˆ are continuous. Proof. If un → u, then f (x)ei(un ,x) → f (x)ei(u,x) . Also, we know that |f (x)ei(un ,x) | = |f (x)|. So we can apply dominated convergence theorem with |f | as the bound.

5.2

Convolutions

To actually do something useful about the Fourier transforms, we need to talk about convolutions. Definition (Convolution of random variables). Let µ, ν be probability measures. Their convolution µ ∗ ν is the law of X + Y , where X has law µ and Y has law ν, and X, Y are independent. Explicitly, we have µ ∗ ν(A) = P[X + Y ∈ A] ZZ = 1A (x + y) µ(dx) ν(dy) Let’s suppose that µ has a density function f with respect to the Lebesgue measure. Then we have ZZ µ ∗ ν(A) = 1A (x + y)f (x) dx ν(dy) ZZ = 1A (x)f (x − y) dx ν(dy) Z  Z = 1A (x) f (x − y) ν(dy) dx. So we know that µ ∗ ν has law Z f (x − y) ν(dy). This thing has a name. Definition (Convolution of function with measure). Let f ∈ Lp and ν a probability measure. Then the convolution of f with µ is Z f ∗ ν(x) = f (x − y) ν(dy) ∈ Lp .

70

5

Fourier transform

II Probability and Measure

Note that we do have to treat the two cases of convolutions separately, since a measure need not have a density, and a function need not specify a probability measure (it may not integrate to 1). We check that it is indeed in Lp . Since ν is a probability measure, Jensen’s inequality says we have p Z Z p kf ∗ νkp = |f (x − y)|ν(dy) dx ZZ ≤ |f (x − y)|p ν(dy) dx ZZ = |f (x − y)|p dx ν(dy) = kf kpp < ∞. In fact, from this computation, we see that Proposition. For any f ∈ Lp and ν a probability measure, we have kf ∗ νkp ≤ kf kp . The interesting thing happens when we try to take the Fourier transform of a convolution. Proposition. f[ ∗ ν(u) = fˆ(u)ˆ ν (u). Proof. We have Z Z f[ ∗ ν(u) =

 f (x − y)ν(dy) ei(u,x) dx

ZZ

f (x − y)ei(u,x) dx ν(dy)  Z Z i(u,x−y) f (x − y)e d(x − y) ei(u,y) µ(dy) =  Z Z i(u,x) = f (x)e d(x) ei(u,y) µ(dy) Z = fˆ(u)ei(u,x) µ(dy) Z = fˆ(u) ei(u,x) µ(dy) =

= fˆ(u)ˆ ν (u). In the context of random variables, we have a similar result: Proposition. Let µ, ν be probability measures, and X, Y be independent variables with laws µ, ν respectively. Then µ[ ∗ ν(u) = µ ˆ(u)ˆ ν (u). Proof. We have µ[ ∗ ν(u) = E[ei(u,X+Y ) ] = E[ei(u,X) ]E[ei(u,Y ) ] = µ ˆ(u)ˆ ν (u). 71

5

Fourier transform

5.3

II Probability and Measure

Fourier inversion formula

We now want to work towards proving the Fourier inversion formula: Theorem (Fourier inversion formula). Let f, fˆ ∈ L1 . Then Z 1 f (x) = fˆ(u)e−i(u,x) du a.e. (2π)d Our strategy is as follows: (i) Show that the Fourier inversion formula holds for a Gaussian distribution by direct computations. (ii) Show that the formula holds for Gaussian convolutions, i.e. the convolution of an arbitrary function with a Gaussian. (iii) We show that any function can be approximated by a Gaussian convolution. Note that the last part makes a lot of sense. If√X is a random variable, then convolving with a Gaussian is just adding X + tZ, and if we take t → 0, we recover the original function. What we have to do is to show that this behaves sufficiently well with the Fourier transform and the Fourier inversion formula that we will actually get the result we want. Gaussian densities Before we start, we had better start by defining the Gaussian distribution. Definition (Gaussian density). The Gaussian density with variance t is d/2 2 1 e−|x| /2t . 2πt √ This is equivalently the density of tZ, where Z = (Z1 , · · · , Zd ) with Zi ∼ N (0, 1) independent. 

gt (x) =

We now want to compute the Fourier transformation directly and show that the Fourier inversion formula works for this. We start off by working in the case d = 1 and Z ∼ N (0, 1). We want to compute the Fourier transform of the law of this guy, i.e. its characteristic function. We will use a nice trick. Proposition. Let Z ∼ N (0, 1). Then φZ (a) = e−u

2

/2

.

We see that this is in fact a Gaussian up to a factor of 2π. Proof. We have φZ (u) = E[eiuZ ] Z 2 1 √ = eiux e−x /2 dx. 2π 72

5

Fourier transform

II Probability and Measure

We now notice that the function is bounded, so we can differentiate under the integral sign, and obtain φ0Z (u) = E[iZeiuZ ] Z 2 1 =√ ixeiux e−x /2 dx 2π = −uφZ (u), where the last equality is obtained by integrating by parts. So we know that φZ (u) solves φ0Z (u) = −uφZ (u). This is easy to solve, since we can just integrate this. We find that 1 log φZ (u) = − u2 + C. 2 So we have

2

φZ (u) = Ae−u

/2

.

We know that A = 1, since φZ (0) = 1. So we have 2

φZ (u) = e−u

/2

.

We now do this problem in general. Proposition. Let Z = (Z1 , · · · , Zd ) with Zj ∼ N (0, 1) independent. Then has density 2 1 gt (x) = e−|x| /(2t) . d/2 (2πt) with

2

gˆt (u) = e−|u|

t/2

tZ

.

Proof. We have gˆt (u) = E[ei(u, =

d Y

tZ)

] √

E[ei(uj ,

tZj )

]

j=1

=

d Y

√ φZ ( tuj )

j=1

=

d Y

2

e−tuj /2

j=1 2

= e−|u|

t/2

.

Again, gt and gˆt are almost the same, apart form the factor of (2πt)−d/2 and the position of t shifted. We can thus write this as gˆt (u) = (2π)d/2 t−d/2 g1/t (u).

73

5

Fourier transform

II Probability and Measure

So this tells us that gˆ ˆt (u) = (2π)d gt (u). This is not exactly the same as saying the Fourier inversion formula works, because in the Fourier inversion formula, we integrated against e−i(u,x) , not ei(u,x) . However, we know that by the symmetry of the Gaussian distribution, we have  d Z 1 −d ˆ gt (x) = gt (−x) = (2π) gˆt (−x) = gˆt (u)e−i(u,x) du. 2π So we conclude that Lemma. The Fourier inversion formula holds for the Gaussian density function. Gaussian convolutions Definition (Gaussian convolution). Let f ∈ L1 . Then a Gaussian convolution of f is a function of the form f ∗ gt . We are now going to do a little computation that shows that functions of this type also satisfy the Fourier inversion formula. Before we start, we make some observations about the Gaussian convolution. By general theory of convolutions, we know that we have Proposition. kf ∗ gt k1 ≤ kf k1 . We also have a pointwise bound Z  d/2 1 −|y|2 /(2t) |f ∗ gt (x)| = f (x − y)e dy 2πt Z ≤ (2πt)−d/2 |f (x − y)| dx ≤ (2πt)−d/2 kf k1 . This tells us that in fact Proposition. kf ∗ gt k∞ ≤ (2πt)−d/2 kf k1 . So in fact the convolution is pointwise bounded. We see that the bound gets worse as t → 0, and we will see that this is because as t → 0, the convolution f ∗ gt becomes a better and better approximation of f , and we did not assume that f is bounded. Similarly, we can compute that Proposition. kf\ ∗ gt k1 = kfˆgˆt k1 ≤ (2π)d/2 t−d/2 kfˆk1 , and kf\ ∗ gt k∞ ≤ kfˆk∞ .

74

5

Fourier transform

II Probability and Measure

Now given these bounds, it makes sense to write down the Fourier inversion formula for a Gaussian convolution. Lemma. The Fourier inversion formula holds for Gaussian convolutions. We are going to reduce this to the fact that the Gaussian distribution itself satisfies Fourier inversion. Proof. We have Z f ∗ gt (x) =

f (x − y)gt (y) dy   Z Z 1 −i(u,y) = f (x − y) gˆt (u)e du dy (2π)d  d Z Z 1 = f (x − y)ˆ gt (u)e−i(u,y) du dy 2π  d Z Z  1 f (x − y)e−i(u,x−y) dy gˆt (u)e−i(u,x) du = 2π  d Z 1 = fˆ(u)ˆ gt (u)e−i(u,x) du 2π  d Z 1 = f\ ∗ gt (u)e−i(u,x) du 2π

So done. The proof Finally, we are going to extend the Fourier inversion formula to the case where f, fˆ ∈ L2 . Theorem (Fourier inversion formula). Let f ∈ L1 and Z Z 2 ft (x) = (2π)−d fˆ(u)e−|u| t/2 e−i(u,x) du = (2π)−d f\ ∗ gt (u)e−i(u,x) du. Then kft −f k1 → 0, as t → 0, and the Fourier inversion holds whenever f, fˆ ∈ L1 . To prove this, we first need to show that the Gaussian convolution is indeed a good approximation of f : Lemma. Suppose that f ∈ Lp with p ∈ [1, ∞). Then kf ∗ gt − f kp → 0 as t → 0. Note that this cannot hold for p = ∞. Indeed, if p = ∞, then the ∞-norm is the uniform norm. But we know that f ∗ gt is always continuous, and the uniform limit of continuous functions is continuous. So the formula cannot hold if f is not already continuous. Proof. We fix ε > 0. By a question on the example sheet, we can find h which is continuous and with compact support such that kf − hkp ≤ 3ε . So we have kf ∗ gt − h ∗ gt kp = k(f − h) ∗ gt kp ≤ kf − hkp ≤ 75

ε . 3

5

Fourier transform

II Probability and Measure

So it suffices for us to work with a continuous function h with compact support. We let Z e(y) = |h(x − y) − h(x)|p dx. We first show that e is a bounded function: Z e(y) ≤ 2p (|h(x − y)|p + |h(x)|p ) dx = 2p+1 khkpp . Also, since h is continuous and bounded, the dominated convergence theorem tells us that e(y) → 0 as y → 0. R Moreover, using the fact that gt (y) dy = 1, we have p Z Z p kh ∗ gt − hkp = (h(x − y) − h(x))gt (y) dy dx Since gt (y) dy is a probability measure, by Jensen’s inequality, we can bound this by ZZ ≤ |h(x − y) − h(x)|p gt (y) dy dx  Z Z p = |h(x − y) − h(x)| dx gt (y) dy Z = e(y)gt (y) dy Z √ = e( ty)g1 (y) dy, where we used the definition of g and substitution. We know that this tends to 0 as t → 0 by the bounded convergence theorem, since we know that e is bounded. Finally, we have kf ∗ gt − f kp ≤ kf ∗ gt − h ∗ gt kp + kh ∗ gt − hkp + kh − f kp ε ε ≤ + + kh ∗ gt − hkp 3 3 2ε + kh ∗ gt − hkp . = 3 Since we know that kh ∗ gt − hkp → 0 as t → 0, we know that for all sufficiently small t, the function is bounded above by ε. So we are done. With this lemma, we can now prove the Fourier inversion theorem. Proof of Fourier inversion theorem. The first part is just a special case of the previous lemma. Indeed, recall that 2 f\ ∗ gt (u) = fˆ(u)e−|u| t/2 .

Since Gaussian convolutions satisfy Fourier inversion formula, we know that ft = f ∗ gt . 76

5

Fourier transform

II Probability and Measure

So the previous lemma says exactly that kft − f k1 → 0. Suppose now that fˆ ∈ L1 as well. Then looking at the integrand of Z 2 ft (x) = (2π)−d fˆ(u)e−|u| t/2 e−i(u,x) du, we know that

2 ˆ f (u)e−|u| t/2 e−i(u,x) ≤ |fˆ|.

Then by the dominated convergence theorem with dominating function |fˆ|, we know that this converges to Z ft (x) → (2π)−d fˆ(u)e−i(u,x) du as t → 0. By the first part, we know that kft − f k1 → 0 as t → 0. So we can find a sequence (tn ) with tn > 0, tn → 0 so that ftn → f a.e. Combining these, we know that Z f (x) = fˆ(u)e−i(u,x) du a.e. So done.

5.4

Fourier transform in L2

It turns out wonderful things happen when we take the Fourier transform of an L2 function. Theorem (Plancherel identity). For any function f ∈ L1 ∩ L2 , the Plancherel identity holds: kfˆk2 = (2π)d/2 kf k2 . As we are going to see in a moment, this is just going to follow from the Fourier inversion formula plus a clever trick. Proof. We first work with the special case where f, fˆ ∈ L1 , since the Fourier inversion formula holds for f . We then have Z kf k22 = f (x)f (x) dx  Z Z 1 ˆ(u)e−i(u,x) du f (x) dx = f (2π)d Z   1 ˆ(u) f (x)e−i(u,x) dx du = f (2π)d Z  1 = fˆ(u) f (x)ei(u,x) dx du (2π)d Z 1 = fˆ(u)fˆ(u) du (2π)d 1 = kfˆ(u)k22 . (2π)d So the Plancherel identity holds for f .

77

5

Fourier transform

II Probability and Measure

To prove it for the general case, we use this result and an approximation argument. Suppose that f ∈ L1 ∩ L2 , and let ft = f ∗ gt . Then by our earlier lemma, we know that kft k2 → kf k2 as t → 0. Now note that

2 fˆt (u) = fˆ(u)ˆ gt (u) = fˆ(u)e−|u| t/2 . 2

The important thing is that e−|u| t/2 % 1 as t → 0. Therefore, we know Z Z 2 kfˆt k22 = |fˆ(u)|2 e−|u| t du → |fˆ(u)|2 du = kfˆk22 as t → 0, by monotone convergence. Since ft , fˆt ∈ L1 , we know that the Plancherel identity holds, i.e. kfˆt k2 = (2π)d/2 kft k2 . Taking the limit as t → 0, the result follows. What is this good for? It turns out that the Fourier transform gives as a bijection from L2 to itself. While it is not true that the Fourier inversion formula holds for everything in L2 , it holds for enough of them that we can just approximate everything else by the ones that are nice. Then the above tells us that in fact this bijection is a norm-preserving automorphism. Theorem. There exists a unique Hilbert space automorphism F : L2 → L2 such that F ([f ]) = [(2π)−d/2 fˆ] whenever f ∈ L1 ∩ L2 . Here [f ] denotes the equivalence class of f in L2 , and we say F : L2 → L2 is a Hilbert space automorphism if it is a linear bijection that preserves the inner product. Note that in general, there is no guarantee that F sends a function to its Fourier transform. We know that only if it is a well-behaved function (i.e. in L1 ∩ L2 ). However, the formal property of it being a bijection from L2 to itself will be convenient for many things. Proof. We define F0 : L1 ∩ L2 → L2 by F0 ([f ]) = [(2π)−d/2 fˆ]. By the Plancherel identity, we know F0 preserves the L2 norm, i.e. kF0 ([f ])k2 = k[f ]k2 . Also, we know that L1 ∩ L2 is dense in L2 , since even the continuous functions with compact support are dense. So we know F0 extends uniquely to an isometry F : L2 → L2 . Since it preserves distance, it is in particular injective. So it remains to show that the map is surjective. By Fourier inversion, the subspace V = {[f ] ∈ L2 : f, fˆ ∈ L1 } 78

5

Fourier transform

II Probability and Measure

is sent to itself by the map F . Also if f ∈ V , then F 4 [f ] = [f ] (note that applying it twice does not suffice, because we actually have F 2 [f ](x) = [f ](−x)). So V is contained in the image F , and also V is dense in L2 , again because it contains all Gaussian convolutions (we have fˆt = fˆgˆt , and fˆ is bounded and gˆt is decaying exponentially). So we know that F is surjective.

5.5

Properties of characteristic functions

We are now going to state a bunch of theorems about characteristic functions. Since the proofs are not examinable (but the statements are!), we are only going to provide a rough proof sketch. Theorem. The characteristic function φX of a distribution µX of a random ˜ are random variables variable X determines µX . In other words, if X and X and φX = φX˜ , then µX = µX˜ Proof sketch. Use the Fourier inversion to show that φX determines µX (g) = E[g(X)] for any bounded, continuous g. Theorem. If φX is integrable, then µX has a bounded, continuous density function Z fX (x) = (2π)−d φX (u)e−i(u,x) du. √ Proof sketch. Let Z ∼ N (0, 1) be independent of X. Then X + tZ has a bounded continuous density function which, by Fourier inversion, is Z 2 −d ft (x) = (2π) φX (u)e−|u| t/2 e−i(u,x) du. Sending t → 0 and using the dominated convergence theorem with dominating function |φX |. The next theorem relates to the notion of weak convergence. Definition (Weak convergence of measures). Let µ, (µn ) be Borel probability measures. We say that µn → µ weakly if and only if µn (g) → µ(g) for all bounded continuous g. Similarly, we can define weak convergence of random variables. Definition (Weak convergence of random variables). Let X, (Xn ) be random variables. We say Xn → X weakly iff µXn → µX weakly, iff E[g(Xn )] → E[g(X)] for all bounded continuous g. This is related to the notion of convergence in distribution, which we defined long time ago without talking about it much. It is an exercise on the example sheet that weak convergence of random variables in R is equivalent to convergence in distribution. It turns out that weak convergence is very useful theoretically. One reason is that they are related to convergence of characteristic functions. Theorem. Let X, (Xn ) be random variables with values in Rd . If φXn (u) → φX (u) for each u ∈ Rd , then µXn → µX weakly. 79

5

Fourier transform

II Probability and Measure

The main application of this that will appear later is that this is the fact that allows us to prove the central limit theorem. Proof sketch. By the example sheet, it suffices to show that E[g(Xn )] → E[g(X)] for all compactly supported g ∈ C ∞ . We then use Fourier inversion and convergence of characteristic functions to check that √ √ E[g(Xn + tZ)] → E[g(X + tZ)] for all t >√0 for Z ∼ N (0, 1) independent of X, (Xn ). Then we check that E[g(Xn + tZ)] is close to E[g(Xn )] for t > 0 small, and similarly for X.

5.6

Gaussian random variables

Recall that in the proof of the Fourier inversion theorem, we used these things called Gaussians, but didn’t really say much about them. These will be useful later on when we want to prove the central limit theorem, because the central limit theorem says that in the long run, things look like Gaussians. So here we lay out some of the basic definitions and properties of Gaussians. Definition (Gaussian random variable). Let X be a random variable on R. This is said to be Gaussian if there exists µ ∈ R and σ ∈ (0, ∞) such that the density of X is   1 (x − µ)2 fX (x) = √ exp − . 2σ 2 2πσ 2 A constant random variable X = µ corresponds to σ = 0. We say this has mean µ and variance σ 2 . When this happens, we write X ∼ N (µ, σ 2 ). For completeness, we record some properties of Gaussian random variables. Proposition. Let X ∼ N (µ, σ 2 ). Then E[X] = µ,

var(X) = σ 2 .

Also, for any a, b ∈ R, we have aX + b ∼ N (aµ + b, a2 σ 2 ). Lastly, we have 2

φX (u) = e−iµu−u

σ 2 /2

.

Proof. All but the last of them follow from direct calculation, and can be found in IA Probability. For the last part, if X ∼ N (µ, σ 2 ), then we can write X = σZ + µ, where Z ∼ N (0, 1). Recall that we have previously found that the characteristic function of a N (0, 1) function is 2

φZ (u) = e−|u|

80

/2

.

5

Fourier transform

II Probability and Measure

So we have φX (u) = E[eiu(σZ+µ) ] = eiuµ E[eiuσZ ] = eiuµ φZ (iuσ) 2

= eiuµ−u

σ 2 /2

.

What we are next going to do is to talk about the corresponding facts for the Gaussian in higher dimensions. Before that, we need to come up with the definition of a higher-dimensional Gaussian distribution. This might be different from the one you’ve seen previously, because we want to allow some degeneracy in our random variable, e.g. some of the dimensions can be constant. Definition (Gaussian random variable). Let X be a random variable. We say that X is a Gaussian on Rn if (u, X) is Gaussian on R for all u ∈ Rn . We are now going to prove a version of our previous theorem to higher dimensional Gaussians. Theorem. Let X be Gaussian on Rn , and le tA be an m × n matrix and b ∈ Rm . Then (i) AX + b is Gaussian on Rm . (ii) X ∈ L2 and its law µX is determined by µ = E[X] and V = var(X), the covariance matrix. (iii) We have φX (u) = ei(u,µ)−(u,V u)/2 . (iv) If V is invertible, then X has a density of   1 fX (x) = (2π)−n/2 (det V )−1/2 exp − (x − µ, V −1 (x − µ)) . 2 (v) If X = (X1 , X2 ) where Xi ∈ Rni , then cov(X1 , X2 ) = 0 iff X1 and X2 are independent. Proof. (i) If u ∈ Rm , then we have (AX + b, u) = (AX, u) + (b, u) = (X, AT u) + (b, u). Since (X, AT u) is Gaussian and (b, u) is constant, it follows that (AX +b, u) is Gaussian. (ii) We know in particular that each component of X is a Gaussian random variable, which are in L2 . So X ∈ L2 . We will prove the second part of (ii) with (iii)

81

5

Fourier transform

II Probability and Measure

(iii) If µ = E[X] and V = var(X), then if u ∈ Rn , then we have E[(u, X)] = (u, µ),

var((u, X)) = (u, V u).

So we know (u, X) ∼ N ((u, µ), (u, V u)). So it follows that φX (u) = E[ei(u,X) ] = ei(u,µ)−(u,V u)/2 . So µ and V determine the characteristic function of X, which in turn determines the law of X. (iv) We start off with a boring Gaussian vector Y = (Y1 , · · · , Yn ), where the Yi ∼ N (0, 1) are independent. Then the density of Y is 2

fY (y) = (2π)−n/2 e−|y|

/2

.

We are now going to construct X from Y . We define ˜ = V 1/2 Y + µ. X ˜ is This makes sense because V is always non-negative definite. Then X ˜ = µ and var(X) ˜ = V . Therefore X has the same Gaussian with E[X] ˜ Since V is assumed to be invertible, we can compute distribution as X. ˜ using the change of variables formula. the density of X (v) It is clear that if X1 and X2 are independent, then cov(X1 , X2 ) = 0. Conversely, let X = (X1 , X2 ), where cov(X1 , X2 ) = 0. Then we have   V11 0 V = var(X) = . 0 V22 Then for u = (u1 , u2 ), we have (u, V u) = (u1 V11 u1 ) + (u2 , V22 u2 ), where V11 = var(X1 ) and V22 var(X2 ). Then we have φX (u) = eiµu−(u,V u)/2 = eiµ1 u1 −(u1 ,V11 u1 )/2 eiµ2 u2 −(u2 ,V22 u2 )/2 = φX1 (u1 )φX2 (u2 ). So it follows that X1 and X2 are independent.

82

6

6

Ergodic theory

II Probability and Measure

Ergodic theory

We are now going to study a new topic — ergodic theory. This is the study the “long run behaviour” of system under the evolution of some Θ. Due to time constraints, we will not do much with it. We are going to prove two ergodic theorems that tell us what happens in the long run, and this will be useful when we prove our strong law of large numbers at the end of the course. The general settings is that we have a measure space (E, E, µ) and a measurable map Θ : E → E that is measure preserving, i.e. µ(A) = µ(Θ−1 (A)) for all A ∈ E. Example. Take (E, E, µ) = ([0, 1), B([0, 1)), Lebesgue). For each a ∈ [0, 1), we can define Θa (x) = x + a mod 1. By what we’ve done earlier in the course, we know this translation map preserves the Lebesgue measure on [0, 1). Our goal is to try to understand the “long run averages” of the system when we apply Θ many times. One particular quantity we are going to look at is the following: Let f be measurable. We define Sn (f ) = f + f ◦ Θ + · · · + f ◦ Θn−1 . We want to know what is the long run behaviour of Snn(f ) as n → ∞. The ergodic theorems are going to give us the answer in a certain special case. Finally, we will apply this in a particular case to get the strong law of large numbers. Definition (Invariant subset). We say A ∈ E is invariant for Θ if A = Θ−1 (A). Definition (Invariant function). A measurable function f is invariant if f = f ◦ Θ. Definition (EΘ ). We write EΘ = {A ∈ E : A is invariant}. It is easy to show that EΘ is a σ-algebra, and f : E → R is invariant iff it is EΘ measurable. Definition (Ergodic). We say Θ is ergodic if A ∈ EΘ implies µ(A) = 0 or µ(AC ) = 0. Example. For the translation map on [0, 1), we have Θa is ergodic iff a is irrational. Proof is left on example sheet 4. Proposition. If f is integrable and Θ is measure-preserving. Then f ◦ Θ is integrable and Z Z f ◦ Θdµ = f dµ. E

It turns out that if Θ is ergodic, then there aren’t that many invariant functions. 83

6

Ergodic theory

II Probability and Measure

Proposition. If Θ is ergodic and f is invariant, then there exists a constant c such that f = c a.e. The proofs of these are left as an exercise on example sheet 4. We are now going to spend a little bit of time studying a particular example, because this will be needed to prove the strong law of large numbers. Example (Bernoulli shifts). Let m be a probability distribution on R. Then there exists an iid sequence Y1 , Y2 , · · · with law m. Recall we constructed this in a really funny way. Now we are going to build it in a more natural way. We let E = RN be the set of all real sequences (xn ). We define the σ-algebra E to be the σ-algebra generated by the projections Xn (x) = xn . In other words, this is the smallest σ-algebra such that all these functions are measurable. Alternatively, this is the σ-algebra generated by the π-system ( ) Y A= An , An ∈ B for all n and An = R eventually . n∈N

Finally, to define the measure µ, we let Y = (Y1 , Y2 , · · · ) : Ω → E where Yi are iid random variables defined earlier, and Ω is the sample space of the Yi . Then Y is a measurable map because each of the Yi ’s is a random variable. We let µ = P ◦ Y −1 . By the independence of Yi ’s, we have that Y µ(A) = m(An ) n∈N

for any A = A1 × A2 × · · · × An × R × · · · × R. Note that the product is eventually 1, so it is really a finite product. This (E, E, µ) is known as the canonical space associated with the sequence of iid random variables with law m. Finally, we need to define Θ. We define Θ : E → E by Θ(x) = Θ(x1 , x2 , x3 , · · · ) = (x2 , x3 , x4 , · · · ). This is known as the shift map. Why do we care about this? Later, we are going to look at the function f (x) = f (x1 , x2 , · · · ) = x1 . Then we have Sn (f ) = f + f ◦ Θ + · · · + f ◦ Θn−1 = x1 + · · · + xn . So Snn(f ) will the average of the first n things. So ergodic theory will tell us about the long-run behaviour of the average. 84

6

Ergodic theory

II Probability and Measure

Theorem. The shift map Θ is an ergodic, measure preserving transformation. Proof. It is an exercise to show that Θ is measurable and measure preserving. To show that Θ is ergodic, recall the definition of the tail σ-algebra \ Tn = σ(Xm : m ≥ n + 1), T = Tn . n

Suppose that A ∈

Q

n∈N

An ∈ A. Then

Θ−n (A) = {Xn+k ∈ Ak for all k} ∈ Tn . Since Tn is a σ-algebra, we and Θ−n (A) ∈ TN for all A ∈ A and σ(A) = E, we know Θ−n (A) ∈ TN for all A ∈ E. So if A ∈ EΘ , i.e. A = Θ−1 (A), then A ∈ TN for all N . So A ∈ T . From the Kolmogorov 0-1 law, we know either µ[A] = 1 or µ[A] = 0. So done.

6.1

Ergodic theorems

The proofs in this section are non-examinable. Instead of proving the ergodic theorems directly, we first start by proving the following magical lemma: Lemma (Maximal ergodic lemma). Let f be integrable, and S ∗ = sup Sn (f ) ≥ 0, n≥0

where S0 (f ) = 0 by convention. Then Z f dµ ≥ 0. {S ∗ >0}

Proof. We let Sn∗ = max Sm 0≤m≤n

and An = {Sn∗ > 0}. Now if 1 ≤ m ≤ n, then we know Sm = f + Sm−1 ◦ Θ ≤ f + Sn∗ ◦ Θ. Now on An , we have

Sn∗ = max Sm , 1≤m≤n

since S0 = 0. So we have On AC n , we have

Sn∗ ≤ f + Sn∗ ◦ Θ. Sn∗ = 0 ≤ Sn∗ ◦ Θ.

85

6

Ergodic theory

II Probability and Measure

So we know Z

Sn∗ dµ =

E

Z

Z

Sn∗ dµ +

AC n

An

Z

Z

Sn∗ ◦ Θ dµ +

f dµ + An

An

Z

Z

=

Z AC n

Sn∗ ◦ Θ dµ

Sn∗ ◦ Θ dµ

f dµ + An

E

Z

Z

=

Sn∗ dµ

Sn∗ dµ

f dµ + An

E

So we know

Z f dµ ≥ 0. An

Taking the limit as n → ∞ gives the result by dominated convergence with dominating function f . We are now going to prove the two ergodic theorems, which tell us the limiting behaviour of Sn (f ). Theorem (Birkhoff’s ergodic theorem). Let (E, E, µ) be σ-finite and f be integrable. There exists an invariant function f¯ such that µ(|f¯|) ≤ µ(|f |), and

Sn (f ) → f¯ a.e. n If Θ is ergodic, then f¯ is a constant.

Note that the theorem only gives µ(|f¯|) ≤ µ(|f |). However, in many cases, we can use some integration theorems such as dominated convergence to argue that they must in fact be equal. In particular, in the ergodic case, this will allow us to find the value of f¯. Theorem (von Neumann’s ergodic theorem). Let (E, E, µ) be a finite measure space. Let p ∈ [1, ∞) and assume that f ∈ Lp . Then there is some function f¯ ∈ Lp such that Sn (f ) → f¯ in Lp . n Proof of Birkhoff ’s ergodic theorem. We first note that lim sup n

Sn , n

lim sup n

Sn n

are invariant functions, Indeed, we know Sn ◦ Θ = f ◦ Θ + f ◦ Θ2 + · · · + f ◦ Θn = Sn+1 − f So we have lim sup n→∞

Sn ◦ Θ Sn f Sn = lim sup + → lim sup . n n n n n→∞ n→∞ 86

6

Ergodic theory

II Probability and Measure

Exactly the same reasoning tells us the lim inf is also invariant. What we now need to show is that the set of points on which lim sup and lim inf do not agree have measure zero. We set a < b. We let   Sn (x) Sn (x) D = D(a, b) = x ∈ E : lim inf < a < b < lim sup . n→∞ n n n→∞ Now if lim sup Snn(x) 6= lim inf Snn(x) , then there is some a, b ∈ Q such that x ∈ D(a, b). So by countable subadditivity, it suffices to show that µ(D(a, b)) = 0 for all a, b. We now fix a, b, and just write D. Since lim sup Snn and lim inf Snn are both invariant, we have that D is invariant. By restricting to D, we can assume that D = E. Suppose that B ∈ E and µ(G) < ∞. We let g = f − b1B . Then g is integrable because f is integrable and µ(B) < ∞. Moreover, we have Sn (g) = Sn (f − b1B ) ≥ Sn (f ) − nb. Since we know that lim supn Sn (g) > 0. So we know that

Sn (f ) n

> b by definition, we can find an n such that

S ∗ (g)(x) = sup Sn (g)(x) > 0 n

for all x ∈ D. By the maximal ergodic lemma, we know Z Z Z 0≤ g dµ = f − b1B dµ = f dµ − bµ(B). D

D

D

If we rearrange this, we know Z bµ(B) ≤

f dµ. D

for all measurable sets B ∈ E with finite measure. Since our space is σ-finite, we can find Bn % D such µ(Bn ) < ∞ for all n. So taking the limit above tells Z bµ(D) ≤ f dµ. (†) D

Now we can apply the same argument with (−a) in place of b and (−f ) in place of f to get Z (−a)µ(D) ≤ − f dµ. (‡) D

Now note that since b > a, we know that at least one of b > 0 and a < 0 has to be true. In the first case, (†) tells us that µ(D) is finite, since f is integrable. Then combining with (‡), we see that Z bµ(D) ≤ f dµ ≤ aµ(D). D

87

6

Ergodic theory

II Probability and Measure

But a < b. So we must have µ(D) = 0. The second case follows similarly (or follows immediately by flipping the sign of f ). We are almost done. We can now define ( lim Sn (f )/n the limit exists f¯(x) = 0 otherwise Then by the above calculations, we have Sn (f ) → f¯ a.e. n Also, we know f¯ is invariant, because lim Sn (f )/n is invariant, and so is the set where the limit exists. Finally, we need to show that µ(f¯) ≤ µ(|f |). This is since µ(|f ◦ Θn |) = µ(|f |) as Θn preserves the metric. So we have that µ(|Sn |) ≤ nµ(|f |) < ∞. So by Fatou’s lemma, we have   Sn ¯ µ(|f |) ≤ µ lim inf n n   Sn ≤ lim inf µ n n ≤ µ(|f |)

The proof of the von Neumann ergodic theorem follows easily from Birkhoff’s ergodic theorem. Proof of von Neumann ergodic theorem. It is an exercise on the example sheet to show that Z Z p p kf ◦ Θkp = |f ◦ Θ| dµ = |f |p dµ = kf kpp . So we have

Sn

= 1 kf + f ◦ Θ + · · · + f ◦ Θn−1 k ≤ kf kp

n n p by Minkowski’s inequality. So let ε > 0, and take M ∈ (0, ∞) so that if g = (f ∨ (−M )) ∧ M,

88

6

Ergodic theory

II Probability and Measure

then kf − gkp
0 such that E[Xn ] = µ, E[Xn4 ] ≤ M for all n. With Sn = X1 + · · · + Xn , we have that Sn → µ a.s. as n → ∞. n Note that in this version, we do not require that the Xn are iid. We simply need that they are independent and have the same mean. The proof is completely elementary. Proof. We reduce to the case that µ = 0 by setting Yn = Xn − µ. We then have E[Yn ] = 0,

E[Yn4 ] ≤ 24 (E[µ4 + Xn4 ]) ≤ 24 (µ4 + M ).

So it suffices to show that the theorem holds with Yn in place of Xn . So we can assume that µ = 0. By independence, we know that for i 6= j, we have E[Xi Xj3 ] = E[Xi ]E[Xj3 ] = 0. Similarly, for all i, j, k, ` distinct, we have E[Xi Xj Xk2 ] = E[Xi Xj Xk X` ] = 0. 90

7

Big theorems

II Probability and Measure

Hence we know that  E[Sn4 ] = E 

n X

 Xk4 + 6

X

Xi2 Xj2  .

1≤i