Statistics

Table of contents :
Introduction
Estimation
Estimators
Mean squared error
Sufficiency
Likelihood
Confidence intervals
Bayesian estimation
Hypothesis testing
Simple hypotheses
Composite hypotheses
Tests of goodness-of-fit and independence
Goodness-of-fit of a fully-specified null distribution
Pearson's chi-squared test
Testing independence in contingency tables
Tests of homogeneity, and connections to confidence intervals
Tests of homogeneity
Confidence intervals and hypothesis tests
Multivariate normal theory
Multivariate normal distribution
Normal random samples
Student's t-distribution
Linear models
Linear models
Simple linear regression
Linear models with normal assumptions
The F distribution
Inference for beta
Simple linear regression
Expected response at x*
Hypothesis testing
Hypothesis testing
Simple linear regression
One way analysis of variance with equal numbers in each group

Citation preview

Part IB — Statistics Based on lectures by D. Spiegelhalter Notes taken by Dexter Chua

Lent 2015 These notes are not endorsed by the lecturers, and I have modified them (often significantly) after lectures. They are nowhere near accurate representations of what was actually lectured, and in particular, all errors are almost surely mine.

Estimation Review of distribution and density functions, parametric families. Examples: binomial, Poisson, gamma. Sufficiency, minimal sufficiency, the Rao-Blackwell theorem. Maximum likelihood estimation. Confidence intervals. Use of prior distributions and Bayesian inference. [5] Hypothesis testing Simple examples of hypothesis testing, null and alternative hypothesis, critical region, size, power, type I and type II errors, Neyman-Pearson lemma. Significance level of outcome. Uniformly most powerful tests. Likelihood ratio, and use of generalised likelihood ratio to construct test statistics for composite hypotheses. Examples, including t-tests and F -tests. Relationship with confidence intervals. Goodness-of-fit tests and contingency tables. [4] Linear models Derivation and joint distribution of maximum likelihood estimators, least squares, Gauss-Markov theorem. Testing hypotheses, geometric interpretation. Examples, including simple linear regression and one-way analysis of variance. Use of software. [7]

1

Contents

IB Statistics

Contents 0 Introduction 1 Estimation 1.1 Estimators . . . . . . 1.2 Mean squared error . 1.3 Sufficiency . . . . . . 1.4 Likelihood . . . . . . 1.5 Confidence intervals 1.6 Bayesian estimation

3

. . . . . .

. . . . . .

. . . . . .

4 4 5 6 10 12 15

2 Hypothesis testing 2.1 Simple hypotheses . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Composite hypotheses . . . . . . . . . . . . . . . . . . . . . . 2.3 Tests of goodness-of-fit and independence . . . . . . . . . . . 2.3.1 Goodness-of-fit of a fully-specified null distribution . . 2.3.2 Pearson’s chi-squared test . . . . . . . . . . . . . . . . 2.3.3 Testing independence in contingency tables . . . . . . 2.4 Tests of homogeneity, and connections to confidence intervals 2.4.1 Tests of homogeneity . . . . . . . . . . . . . . . . . . . 2.4.2 Confidence intervals and hypothesis tests . . . . . . . 2.5 Multivariate normal theory . . . . . . . . . . . . . . . . . . . 2.5.1 Multivariate normal distribution . . . . . . . . . . . . 2.5.2 Normal random samples . . . . . . . . . . . . . . . . . 2.6 Student’s t-distribution . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

19 19 22 25 25 26 27 29 29 31 32 32 34 35

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

3 Linear models 3.1 Linear models . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Simple linear regression . . . . . . . . . . . . . . . . . . 3.3 Linear models with normal assumptions . . . . . . . . . 3.4 The F distribution . . . . . . . . . . . . . . . . . . . . . 3.5 Inference for β . . . . . . . . . . . . . . . . . . . . . . . 3.6 Simple linear regression . . . . . . . . . . . . . . . . . . 3.7 Expected response at x∗ . . . . . . . . . . . . . . . . . . 3.8 Hypothesis testing . . . . . . . . . . . . . . . . . . . . . 3.8.1 Hypothesis testing . . . . . . . . . . . . . . . . . 3.8.2 Simple linear regression . . . . . . . . . . . . . . 3.8.3 One way analysis of variance with equal numbers group . . . . . . . . . . . . . . . . . . . . . . . .

2

. . . . . .

. . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . in each . . . . .

37 37 39 42 46 46 47 48 50 50 52 53

0

0

Introduction

IB Statistics

Introduction

Statistics is a set of principles and procedures for gaining and processing quantitative evidence in order to help us make judgements and decisions. In this course, we focus on formal statistical inference. In the process, we assume that we have some data generated from some unknown probability model, and we aim to use the data to learn about certain properties of the underlying probability model. In particular, we perform parametric inference. We assume that we have a random variable X that follows a particular known family of distribution (e.g. Poisson distribution). However, we do not know the parameters of the distribution. We then attempt to estimate the parameter from the data given. For example, we might know that X ∼ Poisson(µ) for some µ, and we want to figure out what µ is. Usually we repeat the experiment (or observation) many times. Hence we will have X1 , X2 , · · · , Xn being iid with the same distribution as X. We call the set X = (X1 , X2 , · · · , Xn ) a simple random sample. This is the data we have. We will use the observed X = x to make inferences about the parameter θ, such as ˆ – giving an estimate θ(x) of the true value of θ. – Giving an interval estimate (θˆ1 (x), θˆ2 (x)) for θ – testing a hypothesis about θ, e.g. whether θ = 0.

3

1

Estimation

1 1.1

IB Statistics

Estimation Estimators

The goal of estimation is as follows: we are given iid X1 , · · · , Xn , and we know that their probability density/mass function is fX (x; θ) for some unknown θ. We know fX but not θ. For example, we might know that they follow a Poisson distribution, but we do not know what the mean is. The objective is to estimate the value of θ. Definition (Statistic). A statistic is an estimate of θ. It is a function T of the data. If we write the data as x = (x1 , · · · , xn ), then our estimate is written as θˆ = T (x). T (X) is an estimator of θ. The distribution of T = T (X) is the sampling distribution of the statistic. Note that we adopt the convention where capital X denotes a random variable and x is an observed value. So T (X) is a random variable and T (x) is a particular value we obtain after experiments. Example. Let X1 , · · · , Xn be iid N (µ, 1). A possible estimator for µ is 1X T (X) = Xi . n Then for any particular observed sample x, our estimate is 1X xi . T (x) = n What is the sampling distribution from IA Probability that in P of T ? Recall P P general, if Xi ∼ N (µi , σi2 ), then Xi ∼ N ( µi , σi2 ), which is something we can prove by considering moment-generating functions. So we have T (X) ∼ N (µ, 1/n). Note that by the Central Limit Theorem, even if Xi were not normal, we still have approximately T (X) ∼ N (µ, 1/n) for large values of n, but here we get exactly the normal distribution even for small values of n. P The estimator n1 Xi we had above is a rather sensible estimator. Of course, we can also have silly estimators such as T (X) = X1 , or even T (X) = 0.32 always. One way to decide if an estimator is silly is to look at its bias. Definition (Bias). Let θˆ = T (X) be an estimator of θ. The bias of θˆ is the difference between its expected value and true value. ˆ = Eθ (θ) ˆ − θ. bias(θ) Note that the subscript θ does not represent the random variable, but the thing we want to estimate. This is inconsistent with the use for, say, the probability mass function. ˆ = θ. An estimator is unbiased if it has no bias, i.e. Eθ (θ) To find out Eθ (T ), we can either find the distribution of T and find its expected value, or evaluate T as a function of X directly, and find its expected value. Example. In the above example, Eµ (T ) = µ. So T is unbiased for µ. 4

1

Estimation

1.2

IB Statistics

Mean squared error

Given an estimator, we want to know how good the estimator is. We have just come up with the concept of the bias above. However, this is generally not a good measure of how good the estimator is. For example, if we do 1000 random trials X1 , · · · , X1000 , we can pick our estimator as T (X) = X1 . This is an unbiased estimator, but is really bad because we have just wastedPthe data from the other 999 trials. On the other hand, 1 Xi is biased (with a bias of 0.01), but is in general much T ′ (X) = 0.01 + 1000 more trustworthy than T . In fact, at the end of the section, we will construct cases where the only possible unbiased estimator is a completely silly estimator to use. Instead, a commonly used measure is the mean squared error. Definition (Mean squared error). The mean squared error of an estimator θˆ is Eθ [(θˆ − θ)2 ]. Sometimes, we use the root mean squared error, that is the square root of the above. We can express the mean squared error in terms of the variance and bias: ˆ + Eθ (θ) ˆ − θ)2 ] Eθ [(θˆ − θ)2 ] = Eθ [(θˆ − Eθ (θ) ˆ 2 ] + [Eθ (θ) ˆ − θ]2 + 2Eθ [Eθ (θ) ˆ − θ] Eθ [θˆ − Eθ (θ)] ˆ = Eθ [(θˆ − Eθ (θ)) {z } | 0

ˆ + bias2 (θ). ˆ = var(θ) If we are aiming for a low mean squared error, sometimes it could be preferable to have a biased estimator with a lower variance. This is known as the “bias-variance trade-off”. For example, suppose X ∼ binomial(n, θ), where n is given and θ is to be determined. The standard estimator is TU = X/n, which is unbiased. TU has variance varθ (X) θ(1 − θ) varθ (TU ) = = . 2 n n Hence the mean squared error of the usual estimator is given by mse(TU ) = varθ (TU ) + bias2 (TU ) = θ(1 − θ)/n. Consider an alternative estimator TB =

X +1 X 1 = w + (1 − w) , n+2 n 2

where w = n/(n + 2). This can be interpreted to be a weighted average (by the sample size) of the sample mean and 1/2. We have   1 nθ + 1 − θ = (1 − w) −θ , Eθ (TB ) − θ = n+2 2 and is biased. The variance is given by varθ (TB ) =

varθ (X) θ(1 − θ) = w2 . 2 (n + 2) n 5

1

Estimation

IB Statistics

Hence the mean squared error is 2

mse(TB ) = varθ (TB ) + bias (TB ) = w

2 θ(1

− θ) + (1 − w)2 n



1 −θ 2

2 .

We can plot the mean squared error of each estimator for possible values of θ. Here we plot the case where n = 10. mse 0.03

unbiased estimator

0.02 0.01 biased estimator 0

θ 0

0.2

0.4

0.6

0.8

1.0

This biased estimator has smaller MSE unless θ has extreme values. We see that sometimes biased estimators could give better mean squared errors. In some cases, not only could unbiased estimators be worse — they could be completely nonsense. Suppose X ∼ Poisson(λ), and for whatever reason, we want to estimate θ = [P (X = 0)]2 = e−2λ . Then any unbiased estimator T (X) must satisfy Eθ (T (X)) = θ, or equivalently, Eλ (T (X)) = e−λ

∞ X x=0

T (x)

λx = e−2λ . x!

The only function T that can satisfy this equation is T (X) = (−1)X . Thus the unbiased estimator would estimate e−2λ to be 1 if X is even, -1 if X is odd. This is clearly nonsense.

1.3

Sufficiency

Often, we do experiments just to find out the value of θ. For example, we might want to estimate what proportion of the population supports some political candidate. We are seldom interested in the data points themselves, and just want to learn about the big picture. This leads us to the concept of a sufficient statistic. This is a statistic T (X) that contains all information we have about θ in the sample. Example. Let X1 , · · · Xn be iid Bernoulli(θ), so that P(Xi = 1) = 1 − P(Xi = 0) = θ for some 0 < θ < 1. So fX (x | θ) =

n Y

θxi (1 − θ)1−xi = θ

P

xi

(1 − θ)n−

P

xi

.

i=1

This depends on the data only through T (X) = 6

P

xi , the total number of ones.

1

Estimation

IB Statistics

Suppose we are now given that T (X) = t. Then what is the distribution of X? We have Pθ (X = x, T = t) Pθ (X = x) fX|T =t (x) = = . Pθ (T = t) Pθ (T = t) Where the last equality comes because since if X = x, then T must be equal to t. This is equal to P P  −1 n θ xi (1 − θ)n− xi  = . n t n−t t θ (1 − θ) t So the conditional distribution of X given T = t does not depend on θ. So if we know T , then additional knowledge of x does not give more information about θ. Definition (Sufficient statistic). A statistic T is sufficient for θ if the conditional distribution of X given T does not depend on θ. There is a convenient theorem that allows us to find sufficient statistics. Theorem (The factorization criterion). T is sufficient for θ if and only if fX (x | θ) = g(T (x), θ)h(x) for some functions g and h. Proof. We first prove the discrete case. Suppose fX (x | θ) = g(T (x), θ)h(x). If T (x) = t, then Pθ (X = x, T (X) = t) Pθ (T = t) g(T (x), θ)h(x) =P {y:T (y)=t} g(T (y), θ)h(y)

fX|T =t (x) =

g(t, θ)h(x) P g(t, θ) h(y) h(x) =P h(y) =

which doesn’t depend on θ. So T is sufficient. The continuous case is similar. If fX (x | θ) = g(T (x), θ)h(x), and T (x) = t, then g(T (x), θ)h(x) fX|T =t (x) = R g(T (y), θ)h(y) dy y:T (y)=t g(t, θ)h(x) R g(t, θ) h(y) dy h(x) =R , h(y) dy =

which does not depend on θ. Now suppose T is sufficient so that the conditional distribution of X | T = t does not depend on θ. Then Pθ (X = x) = Pθ (X = x, T = T (x)) = Pθ (X = x | T = T (x))Pθ (T = T (x)). The first factor does not depend on θ by assumption; call it h(x). Let the second factor be g(t, θ), and so we have the required factorisation. 7

1

Estimation

IB Statistics

Example. Continuing the above example, fX (x | θ) = θ

P

xi

(1 − θ)n−

P

xi

.

Take g(t, θ) = θt (1 − θ)n−t and h(x) = 1 to see that T (X) = for θ.

P

Xi is sufficient

Example. Let X1 , · · · , Xn be iid U [0, θ]. Write 1[A] for the indicator function of an arbitrary set A. We have fX (x | θ) =

n Y 1

θ i=1

1[0≤xi ≤θ] =

1 1[maxi xi ≤θ] 1[mini xi ≥0] . θn

If we let T = maxi xi , then we have fX (x | θ) =

1 1 n [t≤θ] |θ {z } g(t,θ)

1[mini xi ≥0] . | {z } h(x)

So T = maxi xi is sufficient. Note that sufficient statistics are not unique. If T is sufficient for θ, then so is any 1-1 function of T . X is always sufficient for θ as well, but it is not of much use. How can we decide if a sufficient statistic is “good”? Given any statistic T , we can partition the sample space X n into sets {x ∈ X : T (x) = t}. Then after an experiment, instead of recording the actual value of x, we can simply record the partition x falls into. If there are less partitions than possible values of x, then effectively there is less information we have to store. If T is sufficient, then this data reduction does not lose any information about θ. The “best” sufficient statistic would be one in which we achieve the maximum possible reduction. This is known as the minimal sufficient statistic. The formal definition we take is the following: Definition (Minimal sufficiency). A sufficient statistic T (X) is minimal if it is a function of every other sufficient statistic, i.e. if T ′ (X) is also sufficient, then T ′ (X) = T ′ (Y) ⇒ T (X) = T (Y). Again, we have a handy theorem to find minimal statistics: Theorem. Suppose T = T (X) is a statistic that satisfies fX (x; θ) does not depend on θ if and only if T (x) = T (y). fX (y; θ) Then T is minimal sufficient for θ. Proof. First we have to show sufficiency. We will use the factorization criterion to do so. Firstly, for each possible t, pick a favorite xt such that T (xt ) = t. Now let x ∈ X N and let T (x) = t. So T (x) = T (xt ). By the hypothesis, fX (x;θ) fX (xt :θ) does not depend on θ. Let this be h(x). Let g(t, θ) = fX (xt , θ). Then fX (x; θ) = fX (xt ; θ)

fX (x; θ) = g(t, θ)h(x). fX (xt ; θ) 8

1

Estimation

IB Statistics

So T is sufficient for θ. To show that this is minimal, suppose that S(X) is also sufficient. By the factorization criterion, there exist functions gS and hS such that fX (x; θ) = gS (S(x), θ)hS (x). Now suppose that S(x) = S(y). Then fX (x; θ) gS (S(x), θ)hS (x) hS (x) = = . fX (y; θ) gS (S(y), θ)hS (y) hS (y) X (x;θ) This means that the ratio ffX (y;θ) does not depend on θ. By the hypothesis, this implies that T (x) = T (y). So we know that S(x) = S(y) implies T (x) = T (y). So T is a function of S. So T is minimal sufficient.

Example. Suppose X1 , · · · , Xn are iid N (µ, σ 2 ). Then  P (2πσ 2 )−n/2 exp − 2σ1 2 i (xi − µ)2 fX (x | µ, σ 2 )  P = fX (y | µ, σ 2 ) (2πσ 2 )−n/2 exp − 2σ1 2 i (yi − µ)2 ( ! !) X X X 1 µ X 2 2 = exp − 2 xi − yi + 2 xi − yi . 2σ σ i i i i P P P P This is a P constantPfunction of (µ, σ 2 ) iff i x2i = i yi2 and i xi = i yi . So 2 2 T (X) = ( i Xi , i Xi ) is minimal sufficient for (µ, σ ). As mentioned, sufficient statistics allow us to store the results of our experiments in the most efficient way. It turns out if we have a minimal sufficient statistic, then we can use it to improve any estimator. Theorem (Rao-Blackwell Theorem). Let T be a sufficient statistic for θ and let ˆ ˜ = E[θ(X) | T (X) = θ˜ be an estimator for θ with E(θ˜2 ) < ∞ for all θ. Let θ(x) T (x)]. Then for all θ, E[(θˆ − θ)2 ] ≤ E[(θ˜ − θ)2 ]. The inequality is strict unless θ˜ is a function of T . ˆ It is defined as the Here we have to be careful with our definition of θ. ˜ expected value of θ(X), and this could potentially depend on the actual value of θ. Fortunately, since T is sufficient for θ, the conditional distribution of X given ˜ T = t does not depend on θ. Hence θˆ = E[θ(X) | T ] does not depend on θ, and so is a genuine estimator. In fact, the proof does not use that T is sufficient; we ˆ only require it to be sufficient so that we can compute θ. Using this theorem, given any estimator, we can find one that is a function of a sufficient statistic and is at least as good in terms of mean squared error of ˆ estimation. Moreover, if the original estimator θ˜ is unbiased, so is the new θ. ˜ ˆ ˜ Also, if θ is already a function of T , then θ = θ. ˆ = E[E(θ˜ | T )] = Proof. By the conditional expectation formula, we have E(θ) ˜ So they have the same bias. E(θ). By the conditional variance formula, ˜ = E[var(θ˜ | T )] + var[E(θ˜ | T )] = E[var(θ˜ | T )] + var(θ). ˆ var(θ) ˜ ≥ var(θ). ˆ So mse(θ) ˜ ≥ mse(θ), ˆ with equality only if var(θ˜ | T ) = Hence var(θ) 0. 9

1

Estimation

IB Statistics

Example. Suppose X1 , · · · , Xn are iid Poisson(λ), and let θ = e−λ , which is the probability that X1 = 0. Then P

e−nλ λ xi pX (x | λ) = Q . xi ! So

P

θn (− log θ) xi Q pX (x | θ) = . xi ! P P We see that T = Xi is sufficient for θ, and Xi ∼ Poisson(nλ). We start with an easy estimator θ is θ˜ = 1X1 =0 , which is unbiased (i.e. if we observe nothing in the first observation period, we assume the event is impossible). Then ! n X E[θ˜ | T = t] = P X1 = 0 | Xi = t 1

Pn P(X1 = 0)P( 2 Xi = t) Pn = P( 1 Xi = t)  t n−1 = . n So θˆ = (1 − 1/n) makes sense.

P

xi

¯

¯

ˆ

. This approximately (1 − 1/n)nX ≈ e−X = e−λ , which

Example. Let X1 , · · · , Xn be iid U [0, θ], and suppose that we want to estimate θ. We have shown above that T = max Xi is sufficient for θ. Let θˆ = 2X1 , an unbiased estimator. Then E[θ˜ | T = t] = 2E[X1 | max Xi = t] = 2E[X1 | max Xi = t, X1 = max Xi ]P(X1 = max Xi ) + 2E[X1 | max Xi = t, X1 ̸= max Xi ]P(X1 ̸= max Xi )   1 t n−1 =2 t× + n 2 n n+1 = t. n So θˆ =

1.4

n+1 n

max Xi is our new estimator.

Likelihood

There are many different estimators we can pick, and we have just come up with some criteria to determine whether an estimator is “good”. However, these do not give us a systematic way of coming up with an estimator to actually use. In practice, we often use the maximum likelihood estimator. Let X1 , · · · , Xn be random variables with joint pdf/pmf fX (x | θ). We observe X = x. Definition (Likelihood). For any given x, the likelihood of θ is like(θ) = fX (x | θ), regarded as a function of θ. The maximum likelihood estimator (mle) of θ is an estimator that picks the value of θ that maximizes like(θ). 10

1

Estimation

IB Statistics

Often there is no closed form for the mle, and we have to find θˆ numerically. When we can find the mle explicitly, in practice, we often maximize the log-likelihood instead of the likelihood. In particular, if X1 , · · · , Xn are iid, each with pdf/pmf fX (x | θ), then like(θ) = log like(θ) =

n Y

fX (xi | θ),

i=1 n X

log fX (xi | θ).

i=1

Example. Let X1 , · · · , Xn be iid Bernoulli(p). Then X   X  l(p) = log like(p) = xi log p + n − xi log(1 − p). Thus

dl = dp

This is zero when p = (and is unbiased).

P

P

P xi n − xi − . p 1−p

xi /n. So this is the maximum likelihood estimator

Example. Let X1 , · · · , Xn be iid N (µ, σ 2 ), and we want to estimate θ = (µ, σ 2 ). Then n 1 X n (xi − µ)2 . l(µ, σ 2 ) = log like(µ, σ 2 ) = − log(2π) − log(σ 2 ) − 2 2 2 2σ This is maximized when

∂l ∂l = = 0. ∂µ ∂σ 2

We have ∂l 1 X =− 2 (xi − µ), ∂µ σ

∂l n 1 X = − + (xi − µ)2 . ∂σ 2 2σ 2 2σ 4

So the solution, estimator is (ˆ µ, σ ˆ 2 ) = (¯ x, Sxx /n), P hence maximum P likelihood where x ¯ = n1 xi and Sxx = (xi − x ¯)2 . We shall see later that SXX /σ 2 = 2 σ ˆ is biased.

nˆ σ2 σ2

∼ χ2n−1 , and so E(ˆ σ2 ) =

(n−1)σ 2 , n

i.e.

Example (German tank problem). Suppose the American army discovers some German tanks that are sequentially numbered, i.e. the first tank is numbered 1, the second is numbered 2, etc. Then if θ tanks are produced, then the probability distribution of the tank number is U (0, θ). Suppose we have discovered n tanks whose numbers are x1 , x2 , · · · , xn , and we want to estimate θ, the total number of tanks produced. We want to find the maximum likelihood estimator. Then 1 like(θ) = n 1[max xi ≤θ] 1[min xi ≥0] . θ So for θ ≥ max xi , like(θ) = 1/θn and is decreasing as θ increases, while for θ < max xi , like(θ) = 0. Hence the value θˆ = max xi maximizes the likelihood.

11

1

Estimation

IB Statistics

ˆ For 0 ≤ t ≤ θ, the Is θˆ unbiased? First we need to find the distribution of θ. cumulative distribution function of θˆ is  n t n ˆ Fθˆ(t) = P (θ ≤ t) = P(Xi ≤ t for all i) = (P(Xi ≤ t)) = , θ Differentiating with respect to T , we find the pdf fθˆ = Z θ nθ ntn−1 ˆ = . E(θ) t n dt = θ n +1 0

ntn−1 θn .

Hence

So θˆ is biased, but asymptotically unbiased. Example. Smarties come in k equally frequent colours, but suppose we do not know k. Assume that we sample with replacement (although this is unhygienic). Our first Smarties are Red, Purple, Red, Yellow. Then like(k) = Pk (1st is a new colour)Pk (2nd is a new colour) Pk (3rd matches 1st)Pk (4th is a new colour) k−1 1 k−2 =1× × × k k k (k − 1)(k − 2) = . k3 The maximum likelihood is 5 (by trial and error), even though it is not much likelier than the others. How does the mle relate to sufficient statistics? Suppose that T is sufficient for θ. Then the likelihood is g(T (x), θ)h(x), which depends on θ through T (x). To maximise this as a function of θ, we only need to maximize g. So the mle θˆ is a function of the sufficient statistic. ˆ For Note that if ϕ = h(θ) with h injective, then the mle of ϕ is given by h(θ). example, if the mle of the standard deviation σ is σ ˆ , then the mle of the variance σ 2 is σ ˆ 2 . This is rather useful in practice, since we can use this to simplify a lot of computations.

1.5

Confidence intervals

Definition. A 100γ% (0 < γ < 1) confidence interval for θ is a random interval (A(X), B(X)) such that P(A(X) < θ < B(X)) = γ, no matter what the true value of θ may be. It is also possible to have confidence intervals for vector parameters. Notice that it is the endpoints of the interval that are random quantities, while θ is a fixed constant we want to find out. We can interpret this in terms of repeat sampling. If we calculate (A(x), B(x)) for a large number of samples x, then approximately 100γ% of them will cover the true value of θ. It is important to know that having observed some data x and calculated 95% confidence interval, we cannot say that θ has 95% chance of being within the interval. Apart from the standard objection that θ is a fixed value and either is or is not in the interval, and hence we cannot assign probabilities to this event, we will later construct an example where even though we have got a 50% confidence interval, we are 100% sure that θ lies in that interval. 12

1

Estimation

IB Statistics

Example. Suppose X1 , · · · , Xn are iid N (θ, 1). Find a 95% confidence interval for θ. ¯ ∼ N (θ, 1 ), so that √n(X ¯ − θ) ∼ N (0, 1). We know X n Let z1 , z2 be such that Φ(z2 ) − Φ(z1 ) = 0.95, where Φ is the standard normal distribution function. √ ¯ We have P[z1 < n(X − θ) < z2 ] = 0.95, which can be rearranged to give   z2 z1 ¯ ¯ P X−√ 0 so that π(θ) ∝ θa−1 (1 − θ)b−1 . Then the posteriori is P

π(θ | x) ∝ fx (x | θ)π(θ) ∝ θ xi +a−1 (1 − θ)n− P P We recognize this as beta( xi + a, n − xi + b). So P

P

xi +b−1

.

P

θ xi +a−1 (1 − θ)n− xi +b−1 P P π(θ | x) = . B( xi + a, n − xi + b) In practice, we need to find a Beta prior distribution that matches our information from other hospitals. It turns out that beta(a = 3, b = 27) prior distribution has mean 0.1 and P(0.03 < θ