A Non-Least Squares Approach to Linear Models 1527592448, 9781527592445

This book provides a unifying framework which can be used to apply many types of linear models used in applications to t

262 124 8MB

English Pages 204 [205] Year 2023

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Generalized Linear Models: A Unified Approach 1506387349, 9781506387345

Generalized Linear Models: A Unified Approach provides an introduction to and overview of GLMs, with each chapter carefu

384 52 2MB Read more

An Engineering Approach to Linear Algebra 0521084768

2,432 378 15MB Read more

LINEAR ALGEBRA: A GEOMETRIC APPROACH 8120316282, 9788120316287

This clear, concise and highly readable text is designed for a first course in linear algebra and is intended for underg

1,719 488 45MB Read more

Linear Algebra: a Geometric Approach 9781351435291, 1351435299

1,645 337 13MB Read more

Linear Models with Python 9781138483958

Like its widely praised, best-selling companion version, Linear Models with R, this book replaces R with Python to seaml

4,203 681 4MB Read more

Introduction to Applied Linear Algebra: Vectors, Matrices, and Least Squares [1 ed.] 1316518965, 9781316518960

This groundbreaking textbook combines straightforward explanations with a wealth of practical examples to offer an innov

1,177 241 27MB Read more

Introduction to Applied Linear Algebra – Vectors, Matrices, and Least Squares 9781316518960, 9781108583664

1,421 112 4MB Read more

Applied Linear Regression Models 0073013447, 9780073013442

2,234 173 4MB Read more

Market Design: A Linear Programming Approach to Auctions and Matching [Hardcover ed.] 1107173183, 9781107173187

The digital economy led to many new services where supply is matched with demand for various types of goods and services

895 55 3MB Read more

Fundamentals of Linear Control: A Concise Approach 9781107187528

Taking a different approach from standard thousand-page reference-style control textbooks, Fundamentals of Linear Contro

910 148 7MB Read more

A Non-Least Squares Approach to Linear Models
1527592448, 9781527592445

Author / Uploaded
Mike Jacroux

Table of contents :
Table of Contents
Preface
Chapter 1
1.1 Random Vectors and Matrices
1.2 Expectation Vectors and Matrices
1.3 Covariance Matrices
1.4 The Multivariate Normal Distribution
1.5 The Chi-Squared Distribution
1.6 Quadratic Forms in Normal Random Vectors
1.7 Other Distributions of Interest
1.8 Problems for Chapter 1
Chapter 2
2.1 Introduction
2.2 The Basic Linear Model
2.3 Preliminary Notions
2.4 Identifiability and Estimability of Parametric Vectors
2.5 Best Linear Unbiased Estimation when cov(Y) = o2V
2.6 The Gauss-Markov Property
2.7 Least Squares, Gauss Markov, Residuals and Maximum Likelihood Estimation when cov(Y) = o2In
2.8 Generalized Least Squares, Gauss-markov, Residuals and Maximum Likelihood Estimation when cov(Y) = o2V
2.9 Models with Nonhomogeneous Constraints
2.10 Sampling Distributions of Estimators
2.11 Problems for Chapter 2
Chapter 3
3.1 Introduction
3.2 Correspondence
3.3 Correspondence and Full Rank Parameterizations
3.4 Problems for Chapter 3
Chapter 4
4.1 Introduction
4.2 Some Preliminaries
4.3 An Intuitive Approach to Testing
4.4 The Likelihood Ratio Test
4.5 Formulating Linear Hypotheses
4.6 ANOVA Tables for Testing Linear Hypotheses
4.7 An Alternative Test Statistic for Testing Parametric Vectors
4.8 Tests of Hypotheses in Models with Nonhomogeneous Constraints
4.9 Testing Parametric Vectors in Models with Nonhomogeneous constraints
4.10 Problems for Chapter 4
Chapter 5
5.1 Introduction and Preliminaries
5.2 A Single Covariance Matrix
5.3 Some Further Results on Estimation
5.4 More on Estimation
5.5 Problems for Chapter 5
Appendices
Appendix A1
Appendix A2
Appendix A3
Appendix A4
Appendix A5
Appendix A6
Appendix A7
Appendix A8
Appendix A9
Appendix A10
Appendix A11
Appendix A12
Appendix A13
Appendix A14
Appendix A15
References
Index

Citation preview

A Non-Least Squares Approach to Linear Models

A Non-Least Squares Approach to Linear Models By

Mike Jacroux

A Non-Least Squares Approach to Linear Models By Mike Jacroux This book first published 2023 Cambridge Scholars Publishing Lady Stephenson Library, Newcastle upon Tyne, NE6 2PA, UK British Library Cataloguing in Publication Data A catalogue record for this book is available from the British Library Copyright © 2023 by Mike Jacroux All rights for this book reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, without the prior permission of the copyright owner. ISBN (10): 1-5275-9244-8 ISBN (13): 978-1-5275-9244-5

ToJustusSeelyforhismentoringandfriendshipovertheyears.

TABLEOFCONTENTS Preface.........................................................................................................x Chapter1....................................................................................................1 ProbabilityandStatisticsPreliminaries 1.1RandomVectors..............................................................................1 1.2ExpectationVectorsandMatrices..................................................2 1.3CovarianceMatrices.......................................................................3 1.4TheMultivariateNormalDistribution.............................................8 1.5TheChiSquaredDistribution........................................................13 1.6QuadraticFormsinNormalRandomVectors...............................15 1.7OtherDistributionsofInterest......................................................19 1.8ProblemsforChapter1.................................................................21 Chapter2..................................................................................................25 Estimation 2.1Introduction..................................................................................25 2.2TheBasicLinearModel.................................................................25 2.3PreliminaryNotions......................................................................32 2.4IdentifiabilityandEstimability......................................................34 2.5BestLinearUnbiasedEstimation..................................................39 2.6TheGaussͲMarkovProperty.........................................................46 2.7LeastSquares,GaussͲMarkov,Residualsand MaximumLikelihoodEstimationwhencov(Y)=ʍ2I......................51 2.8GeneralizedLeastSquares,GaussͲMarkov,Residuals andMaximumLikelihoodEstimationwhencov(Y)=ʍ2V..............55 2.9ModelswithNonhomogeneousConstraints................................60 2.10SamplingDistributionofEstimators...........................................68 2.11ProblemsforChapter2...............................................................70 Chapter3..................................................................................................77 ParameterizationsandCorrespondence 3.3Introduction..................................................................................77 3.2Correspondence............................................................................78

viii

TableofContents

3.3CorrespondenceandMaximalRankParameterizations...............81 3.4ProblemsforChapter3.................................................................87 Chapter4..................................................................................................90 TestingLinearHypotheses 4.1Introduction..................................................................................90 4.2SomePreliminaries.......................................................................90 4.3AnIntuitiveApproachtoTesting..................................................92 4.4TheLikelihoodRatioTest............................................................100 4.5FormulatingLinearHypotheses..................................................104 4.6ANOVATablesforTestingLinearHypotheses............................110 4.7AnAlternativeTestStatisticforTestingParametricVectors......113 4.8TestingHypothesesinModelsHavingNonhomogeneous Constraints..................................................................................120 4.9TestingParametricVectorsinModelswithNonhomogeneous Constaints....................................................................................124 4.10ProblemsforChapter4.............................................................127 Chapter5................................................................................................130 TheGeneralTheoryofLinearEstimation 5.1IntroductionandPreliminaries...................................................130 5.2ASingleCovarianceMatrix.........................................................132 5.3SomeFurtherResultsonEstimation...........................................138 5.4MoreonEstimation....................................................................141 5.5ProblemsforChapter5...............................................................147 Appendices.............................................................................................151 AppendixA1MatricesandMatrixOperations.................................151 AppendixA2VectorSpacesͲRn..........................................................152 AppendixA3SetArithmeticinRn.....................................................154 AppendixA4TheEuclideanInnerProduct.......................................156 AppendixA5MatricesasLinearTransformations............................158 AppendixA6TheTransposeofaMatrixA(A’).................................160 AppendixA7Inverses........................................................................161 AppendixA8ProductsofMatrices...................................................162 AppendixA9PartitionedMatrices....................................................164 AppendixA10EigenvaluesandEigenvectors...................................165 AppendixA11Projections.................................................................167 AppendixA12GeneralizedInverses.................................................171 AppendixA13LinearEquationsandAffineSets...............................174

ANonͲLeastSquaresApproachtoLinearModels

ix

AppendixA14MultivariateDistributions.........................................177 AppendixA15ProblemsfortheAppendices....................................181 References..............................................................................................187 Index.......................................................................................................189

PREFACE ItookaPh.D.levelcourseinlinearmodelsfromJustusSeelyin1974. AtthetimeJustuswaswritingasetofnotesthatheeventuallyhopedto turnintoatextbookonlinearmodels.Unfortunately,Justusdiedin2002 beforehewasabletofinalizehisnotesintoatextbook.Inaddition,the notesthatJustuswasdevelopingchangedovertimeastheclienteleinhis classchanged.ThistextisbasedonasetofnotesthatJustususedtoteach his1989classonlinearmodels.WhileagooddealofwhatJustuswrote has been modified to include my own views and prejudices on linear models,thecurrenttextreliesonideasJustusdevelopedinhisnotes.With thekindpermissionoftheSeelyfamily,mostofwhatappearsinthistext is based on or taken directly from the notes that Justus wrote. The examples and problems that are due to Justus are referenced as Seely(1989). Therearetwoaspectsofthecurrenttextthattheauthorbelievesmake it significantly different from most texts on linear models. The first and perhaps most meaningful difference is that whereas most textbooks on linearmodelsinitiallyintroduceleastsquaresestimationandthenusethat as the basis for the development of the theory of best linear unbiased estimation, that is not the approach taken here. Rather, the theory of linear estimation developed here is based on a wellͲknown theorem in mathematicalstatisticswhichbasicallysaysthatifanunbiasedestimator foraparameterhaszerocovariancewithallunbiasedestimatorsofzero, thentheestimatorisaminimumvarianceunbiasedestimator.Thereasons forthisapproachareseveral.First,itismorestatisticalinnature.Second, thisapproacheasilyallowsestimationtheorytobedevelopedunderthe more general assumption of a ʍ2V covariance structure where ʍ2 is an unknownpositiveconstantandVisaknownpositivedefinitematrixrather thanaʍ2Incovariancestructurewhichistypicallyassumedinmostlinear models texts. Lastly, in the author’s view, the approach used here simplifiestheproofsofmanyofthemainresultsgiventhusmakingthetext easiertoread.Thustheapproachtowardslinearestimationusedherehas thedualbenefitsofinitiallyallowingfortheconsiderationofamuchwider

ANonͲLeastSquaresApproachtoLinearModels

xi

varietyofmodelsandatthesametimemakesthetextsimplertoread.The second major difference between this and most other texts on linear modelsisfoundinchapter3ofthecurrenttext.Inthischapterasystematic approach is given for studying relationships between different parameterizationsforagivenexpectationspace.Whilesuchrelationships have been alluded to in other texts, this is the only formal approach to studying suchrelationshipsknown to the author and is primarily due to JustusSeely(1989).

Usageofthebook Thematerialpresentedinthisbookprovidesaunifyingframeworkfor using many types of models arising in applications such as regression, analysisvariance,analysisofcovarianceandvariancecomponentmodels to analyze data generated from experiments. The author has used the material in the current text to teach a semester long course in linear models at Washington State University to both undergraduate and graduate students majoring in mathematics and statistics. The minimal backgroundrequiredbystudentstoreadthetextincludesthreesemesters of calculus, an introductory course in mathematical statistics, an undergraduate course in linear algebra and preferably a course on regressionoranalysisofvariance.Byhavingthisbackground,thereader should have gained familiarity with basic concepts in probability and mathematical statistics such as multiͲdimensional random variables, expectation,covariance,pointestimation,confidenceintervalestimation andhypothesistesting.However,beforethereaderembarksonstudying thistext,itisstronglyrecommendedthatthelinearalgebramaterialthat is provided in the appendices (Appendix A1 through A13) be studied in detailbecausethematerialpresentedinthemaintextishighlydependent andfreelyusestheresultspresentedthere.Theseappendicescontaina great deal of material on linear algebra not usually contained in an undergraduate course on linear algebra, particularly in appendices A5 through A13. In fact, the author typically spends the first three weeks covering topics in the Appendices such as direct sums of subspaces, projection matrices, generalized inverses, affine sets, etc. With this knowledge in hand, the student can then proceed linearly through the bookfromsectiontosection.Thefirstfourchaptersofthebookcomprise whatIconsidertobeabasiccourseinlinearmodelsandcanbecovered easilyinasemester.Chapter5presentssomeadditionaltopicsofinterest

xii

Preface

on estimation theory that might be covered if time allows. There are a number of problems at the end of each chapter that an instructor can choose from to make assignments that will enhance the students’ understanding of the material in that chapter. The level of difficulty of theseproblemsrangesfromeasytochallenging.Therearealsoanumber ofnumericalappliedtypeproblemsineachchapterthatcanbeassigned to give the students a feel for dealing with data. It is my hope that by studyingthistextthereaderwillgainanappreciationoflinearmodelsand allitsapplications.

CHAPTER 1 PROBABILITY AND STATISTICAL PRELIMINARIES

1.1 Random Vectors and Matrices In this chapter, we introduce some of the basic mathematical and statistical fundamentals required to study linear models. We begin by introducing the ideas of a random vector and a random matrix. To this end, let Y1,…,Yn be a set of n random variables. In this text we only consider continuous random variables, hence we associate with Y1,…,Yn the joint probability density function f(y1,…,yn). Definition 1.1.1. An n-dimensional vector Y is called a continuous random vector if the n components of Y are all continuous random variables, i.e., Y= (Y1,…,Yn)’ is a continuous random vector if Y1,…,Yn are all continuous random variables. Because in this text we only consider continuous random variables, whenever we refer to a random variable or a random vector it will be assumed to be continuous, thus we shall no longer use the term continuous to describe it. If Y is a random vector, we can use more concise notation to describe the joint density function of Y such as f(y) or fY(y) where the subscript on fY(y) may be omitted if the random vector Y being considered is clear from the context. More generally, we can extend the idea of a random vector to that of a random matrix. Definition 1.1.2. An mxn matrix W = (Wij)mxn is called a continuous random matrix if its mxn components are all continuous random variables, i.e., W= (Wij)mxn is a continuous random matrix if Wij, i = 1,…,m, j = 1,…,n are all continuous random variables.

Chapter 1

2

As with random variables and vectors, we shall no longer use the term continuous when referring to a random matrix.

1.2 Expectation Vectors and Matrices In this section we define what we mean by the expectation of a random vector or matrix. So let Y = (Y1,…,Yn)’ be an n-dimensional random vector with joint density function fY(y). Then the expectation of Yi, denoted by E(Yi) or μi, is computed as ஶ

ஶ

E(Yi) = μi = ‫ି׬‬ஶ… ‫ି׬‬ஶ

yifY(y) dy1…dyn

provided the above integral exists. Definition 1.2.1. Let Y = (Y1,…,Yn)’ be a random vector. Then the expectation vector of Y, denoted by E(Y) = μY = μ, is defined as E(Y) = (E(Y1),…,E(Yn))’ = (ʅ1,…,ʅn)’ provided all expectations exist. We extend the definition of an expectation vector to the expectation of a random matrix. Definition 1.2.2. Let W = (Wij)mxn be a random matrix. The expectation matrix of W, denoted by E(W), is defined to be E(W) = [E(Wij)]mxn provided all expectations exist. The expectation operator associated with random vectors and matrices has some properties that are useful in connection with studying linear models. Some of these properties are given in the following theorems and corollaries. Theorem 1.2.3. Let A = (aij)lxm, B = (bij)nxp, and C = (cij)lxp be matrices of real numbers and let Z=(Zij)mxn be a random matrix. Then E(AZB+C) = AE(Z)B+C. ௡ Proof. Let W = AZB + C = (Wij)lxp. Then Wij = σ௠ ௥ୀଵ σ௦ୀଵ

airZrsbsj+ cij. Thus

Probability and Statistical Preliminaries ௡ E(W) = (E(Wij))lxp = (E[σ௠ ௥ୀଵ σ௦ୀଵ ௡ = ([σ௠ ௥ୀଵ σ௦ୀଵ

3

airZrsbsj + cij])lxp

airE(Zrs)bsj + cij])lxp

= [(AE(Z)B)ij]lxp + [(cij)]lxp = AE(Z)B + C. Corollary 1.2.4. If X = (X1,…,Xn)’ is an n-dimensional random vector and A = (aij)mxn and C= (ci)mx1 are matrices of real numbers, then E(AX + C) = AE(X)+C. Proof. Let B = In in theorem 1.2.3 above. Example 1.2.5. Let Y = (Y1,…,Yn)’ be an n-dimensional random vector where the Yi’s are independent random variables. Let E(Y) = 1nʅ where ʅ is an unknown parameter. Then we can write E(Y) in the form E(Y) = Xʅ where X = 1n. Let Ɋො = (X’X)-1X’Y = σ࢔࢏ୀ૚ Yi/n. Then E(Ɋො)=E[(X’X)-1X’Y]=(X’X)-1X’E(Y)= (X’X)-1;y͛yͿʅсʅ͘ Proposition 1.2.6. Let A = (aij)mxn and B = (bij)mxn be matrices of real numbers and let X and Y be n-dimensional random vectors. Then E(AX + BY) = AE(X)+BE(Y). Proof. Let W = (Wi)mx1 = AX + BY where Wi = σ௡௝ୀଵ E(W) = [E(Wi)]mx1 = [E(σ௡௝ୀଵ =[σ௡௝ୀଵ = [ σ௡௝ୀଵ

aijXj + σ௡௝ୀଵ aijXj + σ௡௝ୀଵ

aijE(Xj) + σ௡௝ୀଵ

aijE(Xj)]mx1 + [σ௡௝ୀଵ

bijYj. Then bijYj)]mx1

bijE(Yj)]mx1

bijE(Yj)]mx1 = AE(X) + BE(Y).

1.3 Covariance Matrices Let Y = (Y1,…,Yn)’ be a random vector with joint density fY(y). Assume E(Y) = (E(Y1),…,E(Yn))’ = (μ1,…,μn)’ exists. Then the covariance between Yi and Yj, denoted by cov(Yi , YjͿсʍij, is computed as ஶ

Cov(Yi,YjͿсʍij = E[(Yi - μi)(Yj - μj)] = ‫ି׬‬ஶ

ஶ

…‫ି׬‬ஶ

(yi - μi)(yj -μj) fY(y) dy1…dyn

provided the above integral exists. Also, the variance of Yi, denoted by var(YiͿсʍi2, is defined as var(Yi) = ʍi2 = cov(Yi ,Yi) = E[(Yi – ʅi)2].

Chapter 1

4

Definition 1.3.1. Let X = (X1,…,Xm)’ and Y = (Y1,…,Yn)’ be random vectors. Then the covariance matrix between X and Y, denoted by cov(X, Y), is defined as cov(X,Y) = (cov(Xi,Yj))mxn = (ʍij)mxn provided all covariances exist. Several properties associated with covariance matrices are given below. Proposition 1.3.2. Let X and Y be mx1 and nx1 random vectors such that E(X) = ʅX and E(Y) = ʅY. Then cov(X,Y) = E[(X – ʅX)(Y – ʅY)’] Proof. Observe that cov(X,Y) = (cov(Xi,Yj))mxn = (E[(Xi – ʅXi)(Yj – ʅYj)])mxn = E[(Xi – ʅXi)(Yj – ʅYj)]mxn = E[(X – ʅX)(Y – ʅY)’]. Definition 1.3.3. Let Y be an nx1 random vector. Then cov(Y,Y), denoted by cov(Y)= V = VY, is called the dispersion or covariance matrix of Y. Thus if Y = (Y1,…,Yn)’ is an nx1 random vector, cov(Y) is an nxn matrix having cov(Yi,Yj) as its off-diagonal elements for all i ് ݆ and var(Yi) as it diagonal elements for i=1,…,n. Proposition 1.3.4. Let Y be an nx1 random vector such that E(Y) = ʅY. Then cov(Y) = E[(Y – ʅY)(Y – ʅY)’] Proof. This follows directly from Proposition 1.3.2. Suppose Y=(Y1,…,Yn)’ is a random vector. If a ੣ Rn, then a’Y = σ࢔࢏ୀ૚ aiYi is called a linear combination of Y1,…,Yn. Random variables of the form a’Y are fundamental in linear model theory and it is convenient to have matrix expressions for the mean and variance of such random variables. Suppose Y is an n-dimensional random vector. If E(Y)=μ exists and a ੣Rn, then

Probability and Statistical Preliminaries

E(a’Y) = E(σ௡௜ୀଵ

aiYi) = σ௡௜ୀଵ

aiE(Yi) = σ௡௜ୀଵ

5

aiʅi = a’ʅ.

An expression for the variance can also be obtained whenever cov(Yi ,Yj) exists for all i,j by observing that var(a’Y) = var(σ௡௜ୀଵ

aiYi) = σ௡௜ୀଵ

σ௡௝ୀଵ

aiajcov(Yi,Yj) = a’cov(Y)a .

Proposition 1.3.5. Suppose Y is an n-dimensional random vector such that E(Y) = μ and cov(Y)=V exists. Then: (a) cov(a’Y,b’Y) = a’Vb for all a,b ੣ Rn. (b) V is a positive semi-definite matrix. (c) V is the only matrix satisfying statement (a). Proof. (a) To prove (a), observe that cov(a’Y,b’Y) = E[(a’Y – E(a’Y))(b’Y – E(b’Y))] =E[(a’Y – a’E(Y))(b’Y – b’E(Y))] = E[(a’Y-a’μ)(b’Y- b’μ)] = E[(a’(Y – ʅ)b’(Y – ʅ)] = E[(σ௡௜ୀଵ = E[(σ௡௜ୀଵ

ai(Yi-μi))(σ௡௝ୀଵ σ௡௝ୀଵ

bj(Yj-μj))]

aibj(Yi - μi)(Yj - μj)]

= σ௡௜ୀଵ

σ௡௝ୀଵ

aibjE[(Yi - μi)(Yj - μj)]

= σ௡௜ୀଵ

σ௡௝ୀଵ

aibj cov(Yi , Yj) = a’Vb.

(b) For (b), note that cov(Yi,Yj) = cov(Yj,Yi) implies V = V’ and for any a ੣ Rn, a’Va=var(a’Y) implies a’Va ൒ 0. (c) For (c), suppose G also satisfies the condition. Then a’Vb = a’Gb for all a,b ੣Rn, hence a’(V – G)b = 0 for all a,b ੣ Rn which implies V = G. The covariance matrix is a very useful tool for expressing variances and covariances of linear combinations of random vectors.

6

Chapter1

Example1.3.6.LetYandX=1nbeasinExample1.2.5andsupposeY1,…,Yn haveacommonvarianceʍ2.Clearlycov(Y)existsandisequaltoʍ2In.Thus, Proposition 1.3.2 (a) implies cov(a’Y,b’Y) = ʍ2a’b for all a,b ੣ Rn. In particular,letɊොbeasinExample1.2.5.ThenwehavethatɊො=t’Ywheret= X(X’X)Ͳ1=nͲ11nand var(Ɋො)=var(nͲ11n’Y)=nͲ11n’ʍ2In1nnͲ1=ʍ2/n. Lemma1.3.7.SupposeXandYaremx1andnx1randomvectors,AandB are lxm and pxn matrices of real numbers and a and b are lx1 and px1 vectorsofrealconstants.Then cov(AX+a,BY+b)=Acov(X,Y)B’. Proof.LetU=AX+aandletV=BY+b.Thenbycorollary1.2.4,E(U)= AE(X)+aandE(V)=BE(Y)+bandbyproposition1.3.2andtheorem1.2.3, cov[AX+a,BY+b]=cov[U,V]=E[(UͲE(U))(V–E(V))’] =E[(AX+a–AE(X)–a)(BY+b–BE(Y)–b))’] =E[A(XͲE(X))(B(Y–E(Y)))’]=E[A(X–E(X))(Y–E(Y))’B’] =AE[(X–E(X))(Y–E(Y))’]B’=Acov(X,Y)B’. Corollary1.3.8.LetYbeannx1randomvector,Abeanlxnmatrixofreal numbersandletabeanlx1vectorofrealconstants.Ifcov(Y)=V,then cov(AY+a)=cov(AY)=AVA’. Proof.Bylemma1.3.7,cov(AY+a)=cov(AY+a,AY+a)=Acov(Y,Y)A’= Acov(Y)A’=AVA’=cov(AY). Thecovariancematrixofarandomvectorhasmanysimilaritieswith thevarianceofarandomvariable.Forexample,itisnonnegativeinthe senseofProposition1.3.5.Asanotherexample,ifYandZareindependent covሺ‫܇‬ሻ 0୬୬ ‫܇‬ nx1vectors,thencovቀ ቁ= ൬ ൰.Now,ifɲisarealnumber 0୬୬ covሺ‫܈‬ሻ ‫܈‬ andc੣Rn,usingcorollary1.3.8,wehave cov(ɲY+Z+c)=cov(ɲY+Z)=cov[(ɲIn,In)൫‫܈܇‬൯]=(ɲIn,In)cov൫‫܈܇‬൯(ɲIn,In)’

Probability and Statistical Preliminaries

сɲ2 cov(Y) + cov(Z)

7

(1.3.9)

as long as cov(Y) and cov(Z) both exist. Additional properties of cov(.) are discussed in the problems at the end of this chapter. Example 1.3.10. (Two variance component model) Suppose Yij = μ + bi + eij, i,j=1,2, where μ is an unknown parameter and b1,b2,e11,…,e22 are independent random variables having zero means. Also assume the bi have ǀĂƌŝĂŶĐĞ ʍb2 and the eij ŚĂǀĞ ǀĂƌŝĂŶĐĞ ʍ2. Set Y = (Y11,Y12,Y21,Y22)’, X = (1,1,1,1)’, e = (e11,e12,e21,e22)’, B’ = ቀ

1 0

1 0 0 1

0 ቁ 1

and b = (b1,b2)’. Then Y can be expressed in matrix form as Y = Xμ +Bb + e. Notice that b and e are independent random vectors, that Xμ is a constant vector, that cov(bͿсʍb2 I2 and cov(eͿсʍ2 I4. From (1.3.9) we conclude that cov(Y) = cov(XɊ + Bb + e) = cov(Bb + e) = cov(Bb) + cov(e) = Bcov(bͿ͛нʍ2 I4 сʍ2I4 нʍb2BB’. Also notice that E(Y) = Xμ since b and e both have expectation zero. In this example, ʍ2 and ʍb2 are called variance components. The expression given for cov(Y) in proposition 1.3.4 in terms of a random matrix also leads to a convenient form for the expectation of a quadratic form. Definition 1.3.11. Suppose Y = (Y1,…,Yn)’ is a random vector and A=A’=(aij)nxn. Then Y’AY = σ࢔࢏ୀ૚

σ࢔࢐ୀ૚

aijYiYj

Is called a quadratic form in Y. Proposition 1.3.12. Let Y = (Y1,…,Yn)’ be a random vector with E(Y) = μ and Cov(Y)= V. Let A = A’= (aij)nxn. Then E(Y’AY) = μ’Aμ + trAV.

Chapter 1

8

Proof. To begin, observe that Y’AY = (Y-μ)’A(Y-μ) + μ’AY + Y’Aμ - μ’Aμ. Now observe that since A=A’,Y’AɊ = (Y’Aμ)’ = μ’A’Y = μ’AY and E(Y’Aʅ) = E(μ’AY)=μ’AE(Y)= μ’Aʅ, we have that E(Y’AY) = E[(Y-μ)’A(Y-μ) + μ’AY + Y’Aμ - μ’Aμ] = E[(Y-μ)’A(Y-μ)] + E(μ’AY) + E(Y’Aμ) - E(μ’Aμ) =E[(Y-μ)’A(Y-μ)] + μ’Aμ + μ’Aμ - μ’Aμ = E[σ௡௜ୀଵ

σ௡௝ୀଵ

aij(Yi - μi)(Yj - μj)] + μ’Aμ

= σ࢔࢏ୀ૚

σ࢔࢐ୀ૚ aijE(Yi - μi)(Yj - μj) + μ’Aᬐ

= σ࢔࢏ୀ૚

σ࢔࢐ୀ૚

aij cov(Yi,Yj) + μ’Aμ = trAV + μ’Aμ.

Corollary 1.3.13. Let W = Y - b where E(Y) = μ, cov(Y) = V and b ੣ Rn. If A =A’ is a matrix of real numbers, then cov(W)= Cov(Y) and E(W’AW) = trAV + (μ - b)’A(μ - b). The problems at the end of this chapter related to this section explore other aspects of the covariance operator. We strongly suggest that the reader go through these problems and at the very least become familiar with the properties discussed.

1.4 The Multivariate Normal Distribution In this section, we introduce the multivariate normal distribution and investigate some of its properties. For a brief review of some of the distribution theory used in this section, the reader should consult Appendix A14. Let Z be a standard normal random variable and recall the following properties associated with Z: (1) The probability density function for Z is, for all z ੣ R1, f(z) = (1/2ʋ)(1/2)exp[(-1/2)z2].

Probability and Statistical Preliminaries

9

(2) E(Z) = 0. (3) Var(Z) = 1. (4) The moment generating function (m.g.f.) for Z is, for all t ੣ R1, Mz(t) = exp[(1/2)t2]. Now let Z1,…,Zn be mutually independent standard normal random variables and let Z = (Z1,…,Zn)’. Then some easily established facts concerning Z are the following: (1) The joint density function for Z is, for all z ੣ Rn, fZ(z) = ς௡௜ୀଵ

fZi(zi) = ς௡௜ୀଵ

;ϭͬϮʋͿ(1/2)exp((-1/2)zi2)

с;ϭͬϮʋͿ(n/2)exp((-1/2)(σ௡௜ୀଵ

zi2))

с;ϭͬϮʋͿ(n/2)exp((-1/2)(z’I nz)). (2) E(Z) = 0n. (3) cov (Z) = In. (4) The m.g.f. for Z is, for all t ੣ Rn, MZ(t) = ς௡௜ୀଵ

MZi(ti) = ς௡௜ୀଵ

exp[(1/2)ti2]

= exp[(1/2)(t12 +…+ tn2))] = exp[(1/2)t’Int]. Definition 1.4.1. We say the random vector X = (X1,…,Xn)’ follows an ndimensional multivariate normal distribution of rank p if X has the same distribution as AZ + b where A is some nxn real matrix with r(A) = p, b ੣ Rn and Z = (Z1,...,Zn)’ is an n-dimensional random vector whose components Zi are independent standard normal random variables.

10

Chapter1

Proposition1.4.2.SupposeXsatisfiesdefinition1.4.1.Then (a)E(X)=b. (b)cov(X)=VwhereV=AA’andr(V)=r(A)=p. Proof.SinceXhasthesamedistributionasAZ+b,itfollowsthatE(X)= E(AZ+ b) and that cov(X) = cov(AZ + b). These results now follow after applyingcorollary1.2.4andcorollary1.3.8toAZ+b. If X satisfies definition 1.4.1, we denote it by “X~Nn(b,V) of rank p” whereV=AA’andp=r(A)=r(V).IfV>0,wewillgenerallyomittherank portionoftheprecedingstatement.WenotethatifZisasindefinition 1.4.1, then Z~N(0n, In). We now investigate some of the properties associatedwithmultivariatenormaldistributions. Proposition1.4.3.Annx1randomvectorX~Nn(b,V)ofrankpifandonlyif itsm.g.f.hastheform MX(t)=exp(t’b+(1/2)t’Vt) whereb੣Rn,Vш0andr(V)=p Proof. Suppose X~Nn(b,V). Then X satisfies definition1.4.1 and has the samedistributionasAZ+bwhereAisannxnrealmatrixofrankp,b੣Rn and Z is an nͲdimensional random vector whose components are independent standard normal random variables. Since Z has a m.g.f, so doesAZ+bandsinceXandAZ+bhavethesamedistributions,theyhave thesamem.g.f.Butthem.g.f.ofZis MZ(t)=E(exp(t’Z))=exp((1/2)t’Int). Thus, MX(t)=E[exp(t’X)]=MAZ+b(t)=E[exp(t’(AZ+b))]=E[exp(t’AZ+t’b)] =exp(t’b)E[exp((A’t)’Z)]=exp(t’b)MZ(A’t) =exp(t’b)exp[(1/2)t’AA’t]=exp[t’b+(1/2)t’Vt]

Probability and Statistical Preliminaries

11

where V=AA’. Conversely, suppose X has m.g.f. MX(t) = exp[t’b + (1/2)t’Vt] where r(V)= Ɖ͘^ŝŶĐĞsшϬ͕ǁĞĐĂŶĨŝŶĚĂŶŶǆŶƌĞĂůŵĂƚƌŝǆŽĨƌĂŶŬƉƐƵĐŚƚŚĂƚsс͛͘ Now, as in the proof above, it follows that the mgf of AZ + b is exp[t’b + (1/2)t’Vt] where V = AA’, the same as X. Because the m.g.f. uniquely determines the distribution (when the m.g.f. exists in an open n-dimensional neighborhood containing 0n), X has the same distribution as AZ + b. Proposition 1.4.4. Let X~Nn(μ,V), let C be a qxn real matrix and let a ੣ Rq. Then Y = CX + a~Nq(Cμ + a, CVC’) of r(CVC’). Proof. Observe that MY(t) = E[exp(t’Y)] = E[exp(t’(CX + a))] = E[exp(t’CX + t’a)] =exp(t’a)E[exp((C’t)’X)] = exp(t’a)MX(C’t) = exp(t’a)exp[(C’t)’μ + (1/2)(C’t)’V(C’t)] = exp[(C’t)’μ + t’a + (1/2)t’CVC’t] = exp[t’(Cμ + a) + (1/2)t’CVC’t]. Now observe that this last expression for MY(t) is the same as that of a random vector Y~Nq(Cμ+a, CVC’) of r(CVC’).

‫܇‬ଵ ൰ ~ Nn(μ,V) where Y1 and Y2 are n1x1 and ‫܇‬ଶ n2x1 random vectors, respectively. Correspondingly, let

Corollary 1.4.5. Let Y = ൬

V Ɋ = ቀɊଵ ቁ and V = ൬ ଵଵ V Ɋଶ ଶଵ

Vଵଶ ൰ Vଶଶ

Chapter 1

12

where E(Y1) = Ɋ1, E(Y2) = Ɋ2, cov(Y1) = V11, cov(Y2) = V22 and cov(Y1,Y2)= V12 = V21’. Then Y1 ‫ ׽‬Nn1(Ɋ1, V11). Proof. In Proposition 1.4.4, take C=(In1, 0n1n2). Proposition 1.4.6. Suppose Y~Nn(μ,V) where Y, Ɋ and V are partitioned as in corollary 1.4.5. Then Y1 and Y2 are independent if and only if cov(Y1,Y2)= V12 =0n1n2. Proof. By Corollary 1.4.5, Y1 ~ Nn1(μ1,V11) and Y2 ~ Nn2(μ2,V22) and Y1 and Y2 are mutually independent if and only if My(t) = My1(t1)My2(t2). Now, observe that by Proposition 1.4.3, My1(t1) = exp[t1’μ1 + (1/2)t1’V11t1] and MY2(t2) = exp[t2’μ2 + (1/2)t2’V22t2] and Ɋଵ t V My(t) = MY(ቀtଵ ቁ) = exp[(t1’ , t2’)ቀɊ ቁ + (1/2)(t1’ , t2’)൬ ଵଵ Vଶଵ ଶ ଶ

Vଵଶ tଵ ൰ ቀ ቁ] Vଶଶ t ଶ

= exp[t1’μ1 + t2’μ2 + (1/2)(t1’V11t1 + t1’V12t2 +t2’V21t1 + t2’V22t2)]. Therefore, Y1 and Y2 are independent if and only if t1’V12t2 + t2’V21t1 = 0 for all possible values of t1 and t2. But since t2’V21t1 = (t2’V21t1)’ = t1’V12t2 = (t1’V12t2)’, the condition is that t1’V12t2 = 0 for all t1 and t2, i.e., that V12 = 0n1n2. Proposition 1.4.7. Suppose Y~Nn(μ,V) where V>0. Then fY;ǇͿс;ϮʋͿ-(n/2)|V| -(1/2) exp[-(1/2)(y - μ)’V-1(y - μ)] for all y ੣ Rn and where |V| denotes the determinant of V. Proof. Since Y~Nn(μ,V), it has the same distribution as AZ + μ where Z=(Z1,…,Zn)’ and the Zi are all independent standard normal random variables, μ ੣ Rn, A is an nxn real matrix with r(A) = n and V = AA’. Now recall that, for all z ੣ Rn, fZ(z) = ς௡௜ୀଵ

fZi(zi) = ς௡௜ୀଵ

;ϮʋͿ-(1/2)exp[-(1/2)zi2]

Probability and Statistical Preliminaries

13

с;ϮʋͿ-(n/2)exp[(1/2)z’Inz]. As indicated in Appendix A14 fY(y) = (1/|A| )fZ(A-1(y-μ)) с;ϮʋͿ-(n/2)(1/|A| )exp [-(1/2)(y - μ)’A-1’InA-1(y - μ)] с;ϮʋͿ-(n/2) |V|-(1/2) exp[-(1/2)(y - μ)’V-1(y - μ)]. Note. When n = 1, the expression given above for the density function of Y reduces to the density function of a 1-dimensional normal random variable.

1.5 The Chi-Squared Distribution In this section we consider properties of central and non-central chisquared distributions which play fundamental roles in tests of hypotheses in linear models. Definition 1.5.1. Let Y~Nv(μ, Iv). Then the random variable Y’Y = σ࢜࢏ୀ૚ Yi2 is said to follow a non-central chi-squared distribution with v degrees of freedom (d.f.) and non-ĐĞŶƚƌĂůŝƚǇƉĂƌĂŵĞƚĞƌʄсђ’μ. dŚĞ ƌĞĂĚĞƌ ƐŚŽƵůĚ ŽďƐĞƌǀĞ ƚŚĂƚ ǁŚĞŶ ʄ с ђ’μ = 0, we get the well known central chi-squared distribution with v d.f which can be expressed as the sum of v squared independent standard normal random variables. We shall use notation such as X~ʖ2;ǀ͕ʄͿ ƚŽ ŝŶĚŝĐĂƚĞ ƚŚĂƚ ƚŚĞ ƌĂŶĚŽŵ variable X follows a chi-squared distribution with v d.f. and non-centrality ƉĂƌĂŵĞƚĞƌʄĂŶĚƵƐĞƚŚĞŶŽƚĂƚŝŽŶy~ʖ2(v) to denote a random variable X which follows a central chi-squared distribution with v d.f.. Proposition 1.5.2. If Y~ʖ2;ǀ͕ʄͿ͕ƚŚĞŶƚŚĞŵŽŵĞŶƚŐĞŶĞƌĂƚŝŶŐĨƵŶĐƚŝŽŶĨŽƌz is MY(t) = (1/(1 - 2t))(v/2) ĞǆƉ΀ʄƚͬ;ϭ- 2t)] for all tĞƚƐсƌ;ͿĂŶĚŽďƐĞƌǀĞƚŚĂƚшϬ͘^ŽďǇƚŚĞŽƌĞŵϭϬ͘ϭϮ͕ǁĞĐĂŶĨŝŶĚ a pxs matrix H such that A = HH’ and r(H) = s. Then U’AU = U’HH’U = (H’U)’(H’U) = Y’Y where Y = H’U~Ns(H’ʅ͕,͛є,Ϳ͘>Ğƚ>с;,͛,Ϳ-1H’ and observe that ,͛є,с;,͛,Ϳ-1,͛,,͛є,;,͛,Ϳ;,͛,Ϳ-1 с>є>͛с>ʋ>͛ сʋ;,͛,Ϳ-1H’HH’H(H’H)-1 сʋ/s which implies Y=H’U~Ns(H’ʅ͕ʋ/sͿĂŶĚ;ϭͬʋͿ(1/2)Y~Ns;;ϭͬʋͿ(1/2)H’ʅ,Is). Hence, by definition 1.5.1, ;ϭͬʋͿ(1/2)Y͛;ϭͬʋͿ(1/2)Y с;ϭͬʋͿY’Y с;ϭͬʋͿU’HH’U с;ϭͬʋͿU’AU~ɖ2;Ɛ͕ʄͿ ǁŚĞƌĞƐсƌ;ͿĂŶĚʄс;ϭͬʋͿ(1/2)ʅ’HH’ʅ;ϭͬʋͿ(1’2) с;ϭͬʋͿʅ’Aʅ. Corollary 1.6.2. Let Y~Nn;ђ͕ʋ/nͿǁŚĞƌĞʋхϬ͘/Ĩс͛с2, then ;ϭͬʋͿY’AY~ʖ2;Ŭ͕ʄͿ

Chapter 1

16

ǁŚĞƌĞŬсƚƌсƌ;ͿĂŶĚʄс;ϭͬʋͿђ’Aμ. WƌŽŽĨ͘dŚŝƐĨŽůůŽǁƐĨƌŽŵƉƌŽƉŽƐŝƚŝŽŶϭ͘ϲ͘ϭƐŝŶĐĞʋ/nсʋ2 = ʋA and from proposition A11.11 since A2 = A implies k = trA = r(A). Proposition 1.6.3. Let A = A’ and V = BB’ where B is an nxn matrix. Set T = B’AB. Then the following statements hold: (a) trT = tr(VA). (b) If T2 = T, then tr T = r(T). (c) r(T) = r(VAV). (d) T2 = T if and only if V(AVA-A)V = 0nn. Proof. (a) By proposition A10.14 trT = tr(B’AB) = tr(ABB’) = tr(AV) = tr(VA). (b) This is part of proposition A11.11. (c) Using proposition A8.2 and proposition A6.2, we have that r(VAV) = r(BB’AV) = r(B’AV) - dim(R(B’AV) ‫ ת‬N(B)) = r(B’AV) - dim(R(B’AV) ‫ ת‬R(B’)ᄰ) = r(B’AV) - 0 = r(VAB) = r(BB’AB) = r(B’AB) - dim[R(B’AB)‫ת‬N(B)] =r(B’AB) - dim[R(B’AB) ‫ ת‬R(B’)ᄰ] = r(B’AB)-0 = r(T). (d) Assume T2 = T. Then (B’AB)(B’AB) = B’AB and BB’ABB’ABB’ = BB’ABB’ which implies that VAVAV = VAV which yields the desired result. Conversely, assume VAVAV = VAV. Then BB’AVAV-BB’AV = 0nn which implies that B[B’AVAV-B’AV]= 0nn and R(B’AVAV-B’AV) ‫ ؿ‬N(B)‫ת‬R(B’) = R(B’)ᄰ‫ת‬R(B’) = 0n. Hence B’AVAV - B’AV = 0nn. But this implies that VAVAB - VAB = BB’AVAB – BB’AB = B(B’AVAB-B’AB) = 0nn. Hence, as above, that

Probability and Statistical Preliminaries

17

R(B’AVAB – B’AB) ‫ ؿ‬N(B) ‫ ת‬R(B’) = R(B’)ᄰ ‫ ת‬R(B’) = 0n and that B’AVAB –B’AB = B’ABB’AB – B’AB = T2 – T = 0nn. Proposition 1.6.4. Suppose Y~Nv;ђ͕ʋsͿǁŚĞƌĞʋхϬ͕s ൒ 0 and μ ੣ R(V). If A = A’ and V(AVA-A)V = 0vv͕ƚŚĞŶ;ϭͬʋͿY’AY~ʖ2;Ŭ͕ʄͿǁŚĞƌĞŬсƚƌ;sͿĂŶĚʄс ;ϭͬʋͿђ’Aμ. Proof. Since V൒ 0, by theorem A10.10, we can find a vxv matrix B such that V = BB’. Set T = B’AB. Then T = T’ and by proposition 1.6.3, it follows that T2= T. Also, since μ ੣ Z;sͿсZ;͛ͿсZ;Ϳ͕ǁĞĐĂŶĨŝŶĚɷ ƐƵĐŚђсɷ. Let X~Nv;ɷ͕ʋ/n). Then BX~Nv;ɷ͕ ʋ͛Ϳ с Ev(μ,ʋsͿ͘ EŽǁ ůĞƚ Y = BX. Then, ;ϭͬʋͿY’AY с;ϭͬʋͿX’B’ABX с;ϭͬʋͿX’TX where T = T’= T2. Thus, by corollary 1.6.2, ;ϭͬʋͿY’AY с;ϭͬʋͿX’TX~߯2;Ŭ͕ʄͿ ǁŚĞƌĞŬсƚƌ;dͿсƚƌ;sͿĂŶĚʄс;ϭͬʋͿɷ’dɷс;ϭͬʋͿɷ’͛ɷс;ϭͬʋͿђ’Aμ. Corollary 1.6.5. The preceding proposition remains true if A = A’് 0vv and (VA)2 = VA. Proof. Since (VA)2 = VA, we have that VAVAV = VAV and the result follows by proposition 1.6.4. Proposition 1.6.6. Suppose Y~Np(μ,V) where V ൒ 0. Let Qi = Y’AiY for i = 1,…,t where Ai= Ai’ and let U0 = B’Y where B is a pxs matrix. If AiVAj= 0pp and AiVB = 0ps for all i ് j, i,j = 1,…,t, then U0,.Q1,…,Qt are all mutually independent. Proof. For k = 1,…,t observe that Qi = Y’AiY = Y’AiAi-AiY = Ui’A-Ui where Ui = AiY, thus each Qi is a function of Ui for i=1,…,t. So to show that the Qi are all independent, it is enough to show that the Ui are all independent. To this end, let U0 be as defined above and let Uk = AkY for k=1,…,t. Now consider the matrix M=(B,A1’,…,At’)’ and let W = MY = (Y’B,Y’A1’,…,Y’At’)’. By proposition1.4.4, W = (U0’,U1’,…,Ut’)’~Nn(Mμ, MVM’) where n = tp + s. But also from proposition 1.4.6, we know that the Ui, i = 0,…,t are independent if and only if cov(Ui,Uj) = 0pp for all i ് j, i,j = 1,…,t and cov(U0,Ui) = 0sp for i= 1,…,t. But by corollary 1.3.7, cov(Ui,Uj) = cov(Ai,Y,AjY)= AiVAj, i ് j, i,j = 1,…,t and cov(U0,Ui) = cov(B’Y,AiY) = B’VAi for i = 1,…t. Thus

Chapter 1

18

by proposition 1.4.6, the Ui, i = 0,…,t are all independent if and only if the conditions of the proposition hold. We now momentarily diverge and give two lemmas concerning linear algebra which are needed to prove the main result of this section. Lemma 1.6.7. Suppose H and G are nxn symmetric matrices such that r(H+G)= r(H)+ r(G) and (H + G)2 = H + G. Then H2= H, G2 = G and HG = GH = 0nn. Proof. Note that r(H) + r(G) = r(H + G) = dim R(H + G) ൑ dim[R(H) + R(G)] = dim[R(H)] + dim[R(G)] - dim[R(H)‫ת‬R(G)] ൑ r(H) + r(G) which implies that dim[R(H)‫ת‬R(G)] = 0, hence that R(H)‫ת‬R(G) = 0n. Now consider (H + G)2 = H2+HG+GH+G2 = H+G which we can rewrite as H2 + HG - H = G - GH - G2. This latter expression implies that R(H2+HG-H) ‫ ؿ‬R(H)‫ת‬R(G) = 0n which implies that H2 – H = HG and that, since H and G are symmetric, H2 - H = GH. Thus R(H2-H) ‫ ؿ‬R(H)‫ת‬R(G) = 0n and H2-H = 0nn. But we also have R(GH) ‫ؿ‬ R(H)‫ת‬R(G) = 0n, hence GH =0nn . Similarly, it can be shown that G2 = G and HG= 0nn. Lemma 1.6.8. Suppose A1,…,At are nxn symmetric matrices with ranks r1,…,rt, respectively. If In = A1+…+At, then the following are equivalent: (a) AiAj = 0nn for all i ് j. (b) Ai2 = Ai for i = 1,…,t. (c) σ௧௜ୀଵ

ri = n.

Proof. (a) ֜ (b) Note that AiIn = Ai(A1+…+At) = Ai2,, hence Ai2 = Ai. (b) ֜ (c) By lemma A11.11, Ai2 = Ai implies that trA i= ri. So n= tr(In) = tr(σ௧௜ୀଵ

Ai) = σ௧௜ୀଵ

tr(Ai) = σ௧௜ୀଵ

ri.

Probability and Statistical Preliminaries

19

(c) ֜ (a) Suppose p ് q and let H = Ap+Aq, G =In - H, h = rp + rq and g = n - h. Then r(H) =r(Ap + Aq) ൑ r(Ap) + r(Aq) = rp + rq = h and h - r(H) ൒ 0. Now, r(G) = r(In-H) = r(σ௧௜ୀଵ,௜ஷ௣,௤

Ai) ൑ σ௧௜ୀଵ,௜ஷ௣,௤

r(Ai) = σ௧௜ୀଵ,௜ஷ௣,௤

ri = g

and g - r(G) ൒ 0. Now observe that n = r(In) = r(H + G) ൑ r(H) + r(G) ൑ h + g = n and (h -r(H)) + (g - r(G) ൑ 0. Thus h = r(H), g = r(G) and r(H + G) = r(H) + r(G). Now, since H+G =In and In2= In implies (H + G)2 = (H +G) and since r(H + G) = r(H) + r(G), we have that H2= H by lemma 1.6.7. But again, h = r(H) = r(Ap+Aq) = r(Ap) + r(Aq) = h and H2= H, hence by lemma 1.6.7, ApAq = 0nn. Theorem 1.6.9. (Cochran’s Theorem). Suppose Z~Nn(0n,In) and Z’Z = Q1+…+Qt where Qi = Z’AiZ, Ai = Ai’ and r(Ai) = ri for i=1,…,t. Then Q1,…,Qt are mutually independent and Qi~ʖ2(ri) for i= 1,…,t if and only if σ௧௜ୀଵ ri = n. Proof. For i=1,…,t, Ai = Ai’ and Qi = Z’AiZ. Then the condition Z’Z = Z’InZ = Q1+…+Qt= Z’A1Z +…+Z’AtZ = Z’(A1+…+At)Z for all values z of Z implies In = A1+…+At. Now, if n= σ௧௜ୀଵ ri, it follows from lemma 1.6.8 that Ai2 = Ai for i= 1,…,t and that AiAj = 0 for all i ് j. Hence we have by corollary 1.6.2 and corollary 1.6.6 that Qi = Z’AiZ~ɖ2(r(AiͿͿ с ʖ2(ri) and are all independent. Conversely, suppose Q1,…,Qt are all mutually independent and Qi~ʖ2(ri) for i = 1,…,t. Then by proposition 1.5.3, σ௧௜ୀଵ Qi ~ʖ2(σ௧௜ୀଵ ri). By definition, Z’Z~ɖ2(n) and since Z’Z= σ୲୧ୀଵ Qi, we have that σ௧௜ୀଵ r i= n.

1.7 Other Distributions of Interest In this section we introduce some other distributions which are useful in the study of linear models. Definition 1.7.1. Let Q1~ʖ2(p) and Q2 ๝ ʖ2(q) and suppose Q1 and Q2 are independent. Then F = (Q1/p)/(Q2/q) is said to follow a central Fdistribution with p and q degrees of freedom and is denoted by F~F(p,q). Note: If F ๝ F(p,q), then F has probability density function

20

Chapter 1

g(f) = Ȟ((p/2) + (q/2))p(p/2)q(q/2)f((p/2) - 1)/Ȟ(p/2)Ȟ(q/2)(q + pf)((p + q)/2). Definition 1.7.2. Suppose U1~ʖ2;Ɖ͕ʄͿ ĂŶĚ h2~ʖ2(q) where U1 and U2 are independent. We distribution say W = (U1/p)/(U2/q) follows a non-central F-distribution with p and q degrees of freedom and non-centrality paramĞƚĞƌʄ͕ĚĞŶŽƚĞĚďǇt~&;Ɖ͕Ƌ͕ʄͿ͘ Note: If W~&;Ɖ͕Ƌ͕ʄͿ͕ ƚŚĞŶ ƚŚĞ ƉƌŽďĂďŝůŝƚǇ ĚĞŶƐŝƚǇ ĨƵŶĐƚŝŽŶ ŽĨ t ĐĂŶ ďĞ expressed as a weighted infinite series of densities. For more information on the central and non-central F-distributions, the reader is referred to Searle (1971). The main value of the central F-distribution is for testing hypotheses in linear models. We shall discuss such tests in Chapter 4. The main use of the non-central F-distribution is the determination of power associated with Ftests used in testing hypotheses in linear models. Definition 1.7.3. Let Z~N(0,1) and Y~ʖ2(v) with Z and Y being independent. Let T= Z/(Y/v)(1/2). We say T follows a central T-distribution with v degrees of freedom, denoted by T~T(v). Note. If T~T(v), then the probability density function of T is given by fT(t) = [Ȟ;;ǀнϭͿͬϮͿͬ;ǀʋͿ(1/2)Ȟ(v/2)](1 + (t2/v))-(v + 1)/2 for all -ьфƚфь͘ Note. The main use of the T-distribution is to find confidence intervals and test hypotheses about individual parameters or individual linear functions of parameters in linear models. Definition 1.7.4. Let X~N(μ,1) and Y~ʖ2(v) with X and Y being independent. If T’= X/(Y/v)(1/2), we say T’ follows a non-central T-distribution with v degrees of freedom and non-centrality parameter μ, denoted by T’~T(v,μ). Note. The main use of the non-central T-distribution is to determine the power of T-tests concerning individual linear functions of parameters in linear models. For more information on the central and non-central Tdistributions, the reader is referred to Searle (1971).

Probability and Statistical Preliminaries

21

1.8 Problems for Chapter 1 2 1 2 1. Suppose Z = (Zij)2x3 is a random matrix with E(Z) = ቀ ቁ. Let െ1 1 3 6 2 െ2 2 െ1 ቁ and B = ቆ3 െ1 4 ቇ be matrices of constants. Find E(AZB). A=ቀ 3 2 െ1 2 6 2. Suppose W = (Wij)nxp and Z = (Zij)kxp are random matrices and that A=(aij)mxn and B = (bij)mxk are matrices of constants. Show that E(AW + BZ) = AE(W) + BE(Z). 3. (Seely (1989)) Suppose Z is a random vector such that E(Z) = (1, 0, -1)’ 2 1 0 and cov(Z) = ቆ1 2 1ቇ. Set Y = ( Y1, Y2, Y3)’ where Y1 = Z1 – Z2 + Z3, Y2 = 0 1 2 2Z1 – Z3, and Y3 = 2Z2 + 3Z3. (a) Determine E(Y), cov(Y) and E(YY’). (b) Determine the mean and the variance of Y2. (c) Determine the covariance between Y1 and Y2 + Y3. (d) Find the expectation of Q = Y1 + 2Y3 + Y12 - 2Y1Y3 + Y32. 4. Suppose Y is a random vector such that cov(YͿ с ʍ2In. What is the covariance matrix of Z = R’Y where R is an orthogonal matrix? 5. Suppose Y is an nx1 random vector such that cov(YͿсʍ2V where V is a positive definite matrix. Let V = BB’ where B is nxn. What is the covariance of Z = B-1Y? 6. Let X be an n-dimensional random vector and let Y1=X1 and Yi = Xi – Xi-1 for i= 2, … , n. If the Xi ‘s are all independent with unit variances, find cov(Y). 7. Let X1,…,X4 be ŝŶĚĞƉĞŶĚĞŶƚƌĂŶĚŽŵǀĂƌŝĂďůĞƐĞĂĐŚǁŝƚŚǀĂƌŝĂŶĐĞʍ2 and let Xi + 1 =aXi+b for i = 1,2,3. Find cov(X). 8. Suppose Y and Z are independent kxt and txm random matrices such that E(Y) and E(Z) both exist. Show that E(YZ) = E(Y)E(Z).

22

Chapter 1

9. Suppose U and Y are kx1 random vectors and Z is an mx1 random vector and that cov(U, Z) and cov(Y, Z) both exist. ^ŚŽǁƚŚĂƚĐŽǀ;ɲU + Y + c, Z) = ɲ2 cov (U,Z)+ cov(Y,ZͿĨŽƌĂůůƌĞĂůŶƵŵďĞƌƐɲĂŶĚĂůůǀĞĐƚŽƌƐĐ੣ Rk. 10. (Seely (1989)) Answer the following questions: (a) What is the relationship between cov(Y,Z) and cov(Z,Y)? (b) What is cov(Y,Z) when Y and Z are independent? (c) What is cov(X) in terms of Y and Z when X = (Y’,Z’)’? ;ĚͿ/Ŷ;ϭ͘ϯ͘ϵͿĂŶĞǆƉƌĞƐƐŝŽŶĨŽƌĐŽǀ;ɲY + Z + c) was given when Y and Z are independent. How should the expression be altered if the independence assumption is removed? (e) Give two expressions for cov(Y,Z) in terms of the expectation operator E(.). Properties of covariance matrices. 11. Show that an nx1 random vector Y has a degenerate distribution ,i.e., Pr(Y=μ)= 1 for some μ ੣ Rn, if and only if cov(Y) = 0nn. 12. (Seely (1989)) Suppose Y is an nx1 nondegenerate random vector such that V= cov(Y) exists. Let μ = E(Y) and let N = In – P where P is the orthogonal projection of R(V). Verify the following assertions: (a) Pr[N(Y - μ) = 0] = 1. (b) Pr(Y = BZ + μ) = 1 where Z = (B’B)-1B’(Y - μ) and B is any nxr matrix such that r= r(V) and V = BB’. (c) The random vector Z in (b) has zero mean and cov(Z) = Ir. 13. Suppose Z is an nx1 random vector such that V = cov(Z) exists and such that E(Z) = 0n. Verify that following statements: (a) Pr(Z ੣ R(V)) = 1. (b) If M is any subspace of Rn for which Pr(Z ੣ M) = 1, then R(V) ‫ ؿ‬M.

Probability and Statistical Preliminaries

23

(c) The smallest subspace M of Rn for which Pr(Z Ԗ M) = 1 is the subspace R(V). 14. Generalize problem 13 above to a nonzero mean vector. 15. Suppose Y ‫ ׽‬Np(μ, V) and assume A and B are mxp and nxp matrices, respectively. Show that the random vectors AY and BY are independent if and only of AVB’ = 0mn. 16. (Seely (1989)) Determine the mean vector and the covariance matrix for each of the bivariate normal distributions whose density functions are given below: ;ĂͿ;ϮʋͿ-1exp{-(2x2 + y2 + 2xy – 22x – 14y + 65)/2}. ;ďͿ;ϮʋͿ-1exp{-(x2 + y2 + 4x – 6y + 13)/2}. 17. If X1,…,Xn are independently and identically distributed with mean μ ĂŶĚǀĂƌŝĂŶĐĞʍ2, find the expectation of Q = (X1 – X2)2 + (X2 – X3)2 +…+ (Xn-1 – Xn)2. 18. If X = (X1,…,Xn)’ and the Xi have a common mean μ and cov(X) = V = (vij)nxn where vii сʍ2 for i = 1,…,n and vij сʌʍ2 for all i ് j, find the expectation of Q = σ௡௜ୀଵ

(Xi - ܺത)2.

19. Assume Y ~ Np(μ, V) where V > 0. Find the distribution of Y’V-1Y. 20. Let X ~ Nn(μ, V) and let A be a matrix such that A = A2 = A’. Find the distribution of (X - μ)’A(X - μ). 21. (Seely (1989)) If Y1,…,Yn is a random sample from Y ~ N1;Ϭ͕ʍ2), answer the following questions: (a) Is ܻത independent of σ௡௜ୀଵ

(Yi – Yi - 1)2? (Justify your answer.)

(b) Let X = AY, U = BY and V = CY where Y = (Y1,…,Yn)’ and A, B and C are mxn matrices of rank m. If X and U are independent, X and V are independent, then is X independent of U + V? (Justify your answer)

24

Chapter 1

22. Suppose Y ~ Nn(μ, In). Find the joint distribution of W1 = a’Y and W2=b’Y where a’b = 0. 23. (Seely (1989)) Let (X1,Y1)’,…,(Xn,Yn)’ be a random sample from U~N2(μ,V). Find the joint distribution of W = (1/n)[(X1,Y1)’ + … + (Xn,Yn)’]. 24. If Y1 and Y2 are random variables such that Z1 = Y1 + Y2 and Z2 = Y1 – Y2 are independent standard normal random variables, find the joint distribution of Y1 and Y2. 25. If Y ~ Nn(μ1n, V) where vii сʍ2, i = 1,…,n and vij сʍ2(1 – ʌͿĨŽƌĂůůŝ് j and 0൑ʌ൑ 1, is ഥ Y independent of σ௡௜ୀଵ (Yi - ܻത)2? 1 ɏ 0 26. Suppose Y ~ N3(μ, V) where V = ൭ɏ 1 ɏ൱͘&ŽƌǁŚĂƚǀĂůƵĞƐŽĨʌĂƌĞ 0 ɏ 1 W1 = Y1 + Y2 + Y3 and W2 = Y1 - Y2 – Y3 independent?

CHAPTER 2 ESTIMATION

2.1 Introduction Throughout this chapter and chapter 5, our interest will be in an outcome y on a random vector Y to estimate various unknowns when certain assumptions are made about the probability distribution of Y. The present chapter is used to develop linear model estimation theory under special assumptions on the expectation and covariance of Y. In Chapter 5 we consider various relaxations of these special expectation and covariance assumptions. There are a number of books dealing with estimation and testing linear hypotheses in linear models. A few of these books include Christensen (2002), Graybill (1976), Rao (1973), Scheffe (1959), Searle (1971) and Seber (1977).

2.2 The Basic Linear Model The basic probability model in linear models is that of a random vector Y such that E(Y) and Cov(Y) both exist. Unless mentioned otherwise, we will assume that Y is an nx1 random vector. Various assumptions regarding what is known about E(Y) can be made. The two cases we consider in this chapter occur when E(Y) is an element of ȳ where ɏ is either a subspace or an affine set in Rn. For the most part up through section 2.8, we consider the case where ɏ is a subspace in Rn. In section 2.9 we consider the more general case where ɏ is an affine set in Rn͘/ŶĞŝƚŚĞƌĐĂƐĞ͕ƚŚĞƐĞƚɏŝƐĐĂůůĞĚ the expectation space and is often described using a parameterization. Definition 2.2.1. Suppose Y is a random vector such that E(Y) ੣ ɏǁŚĞƌĞɏ is a known subspace (affine set) in Rn. Then the representation E(YͿсyɴ͕ ɴ੣ ੓, is said to be a parameterization for E(Y) provided that X is a known matrix and ੓ ŝƐĂŬŶŽǁŶƐƵďƐƉĂĐĞ;ĂĨĨŝŶĞƐĞƚͿƐƵĐŚƚŚĂƚɏс΂yɴ͗ɴ੣ ੓}. In any such parameterization, ɴ is called the parameter vector, ȣ is called the parameter space and X is called the design matrix.

26

Chapter 2

&ŽƌĂŶǇůŝŶĞĂƌŵŽĚĞůƚŚĞĞǆƉĞĐƚĂƚŝŽŶƐƉĂĐĞɏŝƐĂƵŶŝƋƵĞĨŝǆĞĚquantity. However, for a parameterization of E(Y), say E(YͿ с yɴ͕ ɴ੣ ੓, there are infinitely many choices of X and ੓. In any such parameterization, and ƵŶůĞƐƐ ŵĞŶƚŝŽŶĞĚ ŽƚŚĞƌǁŝƐĞ͕ y ŝƐ ĂƐƐƵŵĞĚ ƚŽ ďĞ ŶǆƉ ƐŽ ƚŚĂƚ ɴ ŝƐ Ă Ɖǆϭ vector and ੓ is a subspace (affine subset) in Rp. Comment. From time to time throughout this text, we may be dealing with more than one random vector at a time, i.e., we might be dealing with random vectors denoted such as Y and Z. In such cases, we may use subscripts to help identify such things as the parameter spaces and expectation spaces associated with these different random vectors, i.e., we might use ੓Y ĂŶĚɏY to denote the parameter space and expectation space associated with the random vector Y and use ੓Z ĂŶĚɏZ to denote the parameter space and the expectation space associated with the random vector Z. At other times, we may use a subscript on an object such as a parameter space or expectation space to emphasize that the object is associated with a particular random vector. Typically when using linear model theory, a particular random phenomenon is modelled using a parameterization of the form E(YͿсyɴ͕ ɴ ੣ ੓, that occurs naturally. Thus, from a practical viewpoint the expectation space is defined implicitly through a given parameterization. /ƚƐŚŽƵůĚĂůǁĂǇƐďĞƌĞŵĞŵďĞƌĞĚ͕ŚŽǁĞǀĞƌ͕ƚŚĂƚƚŚĞĞǆƉĞĐƚĂƚŝŽŶƐƉĂĐĞɏ is the basic ingredient of the linear model and that it can be parameterized in many ways. In fact, as we shall see, in analyzing a given set of data it is sometimes advantageous to consider several different parameterizations. Ɛ ŶŽƚĞĚ ĂďŽǀĞ͕ ƚŚĞ ƐĞƚ ɏ ŝƐ ƵƐƵĂůůǇ ĚĞĨŝŶĞĚ ŝŵƉůŝĐŝƚůǇ ďǇ Ă parameterization E(YͿсyɴ͕ɴ੣ ੓, and it is often the case that interest is in ŵĂŬŝŶŐŝŶĨĞƌĞŶĐĞƐĂďŽƵƚƚŚĞĐŽŵƉŽŶĞŶƚƐŽĨɴŽƌƐŽŵĞůŝŶĞĂƌ functions of ƚŚĞ ĐŽŵƉŽŶĞŶƚƐ ŽĨ ɴ͘ ĞĐĂƵƐĞ ŽĨ ƚŚŝƐ͕ ƚŚĞ ƉƌŝŵĂƌǇ ĞŵƉŚĂƐŝƐ ŝŶ ƚŚĞ following chapter is on a fixed but arbitrary parameterization for E(Y). For the cases considered in this chapter, in a parameterization of the form E(YͿсyɴ͕ɴ੣ ੓, the set ੓ is usually described using one of the following statements: ;ϭͿdŚĞƉĂƌĂŵĞƚĞƌǀĞĐƚŽƌɴŝƐĐŽŵƉůĞƚĞůǇƵŶŬŶŽǁŶ͘ ;ϮͿ dŚĞ ƉĂƌĂŵĞƚĞƌ ǀĞĐƚŽƌ ɴ ƐĂƚŝƐĨŝĞƐ Ă ƐĞƚ ŽĨ ůŝŶĞĂƌ ŚŽŵŽŐĞŶĞŽƵƐ constraints ȟ͛ɴсϬs where ȴ’ is an sxp matrix.

Estimation

27

(3) dŚĞ ƉĂƌĂŵĞƚĞƌ ǀĞĐƚŽƌ ɴ ƐĂƚŝƐĨŝĞƐ Ă ƐĞƚ ŽĨ ůŝŶĞĂƌ ŶŽŶŚŽŵŽŐĞŶĞŽƵƐ ĐŽŶƐƚƌĂŝŶƚƐȴ͛ɴс੘ ് 0s. The first statement is the standard assumption occurring in most of the linear models literature. It means that ੓ = Rp. When ȣ = Rp and r(X) = p, the corresponding parameterization is called a full or maximal rank parameterization or a full or maximal rank model. In the second statement, the parameter space ੓ ŝƐƚŚĞŶƵůůƐƉĂĐĞŽĨȴ͛ĂŶĚŝƐĂƐƵďƐƉĂĐĞŽĨZp. Notice that in this statemenƚŝĨǁĞƚĂŬĞȴ͛сϬsp, then ੓=Rp. Statement (3) is the most general and implies that ੓ and ɏ are affine sets in Rp and Rn, respectively. Statement (3) also contains the other statements as special cases, i.e., statement (2) is obtained from (3) by taking ੘ = 0s and statement ;ϭͿŝƐŽďƚĂŝŶĞĚďǇƚĂŬŝŶŐȴсϬps and ੘ =0s. Example 2.2.2. (The One-way additive model) For Yij, i = 1,2,3, j = 1,2, consider the following parameterizations for E(Y). Parameterization 1: E(YijͿ с ʌi, i = 1,2,3, j = 1,2. In this model, E(Y)=X1ɴ1 where 1 1 ‫ۇ‬ 0 X1 = ‫ۈ‬ ‫ۈ‬0 0 ‫ۉ‬0

0 0 1 1 0 0

0 0 ‫ۊ‬ 0 ͕ɴ с;ʌ ͕ʌ ͕ʌ Ϳ͕͛ɏ = R(X ) and ੓ = R3. 1 2 3 1 1 1 ‫ ۋ‬1 0‫ۋ‬ 1 1‫ی‬

Parameterization 2: E(YijͿсђнɲi, i = 1,2,3, j = 1,2. In this model, E(Y) = X2ɴ2 where 1 1 ‫ۇ‬ 1 X2 = ‫ۈ‬ ‫ۈ‬1 1 ‫ۉ‬1

1 1 0 0 0 0

0 0 1 1 0 0

0 0 ‫ۊ‬ 0 ͕ɴ с;ђ͕ɲ ͕ɲ ͕ɲ )’͕ɏ = R(X ) and ੓ = R4. 1 2 3 2 2 2 ‫ ۋ‬2 0‫ۋ‬ 1 1‫ی‬

Parameterization 3: E(Yij) = μc нɲic, i = 1,2,3,j = 1,2 and σଷ௜ୀଵ ɲic = 0. In this model, E(Y) = X3ɴ3 ĂŶĚ ȴ͛ɴ3 = 0 where X3 = X2͕ ɴ3 = (μc͕ɲ1c͕ɲ2c͕ɲ3c)’, ȴ͛с;Ϭ͕ϭ͕ϭ͕ϭͿ͕ɏ3 = {X3ɴ3 ͗ȴ͛ɴ3 = 0} and ੓3 сE;ȴ͛Ϳ͘

Chapter 2

28

It is straight ĨŽƌǁĂƌĚ ƚŽ ƐŚŽǁ ƚŚĂƚ ɏ1 с ɏ2 с ɏ3, hence that parameterizations 1, 2, and 3 all describe the same expectation space for Y. Suppose we let Y be a random vector and let E(Y) = Xɴ, ɴ ੣ ੓, be a parameterization for E(Y). Let ȿ be a pxs matrix and let c be an sx1 real vector. If we consider a transformation from ੓ to Rs defined by ȿ(ɴ) + c = ȿ’ɴ + c for all ɴ੣੓, then we call ȿ’ɴ + c a parametric vector. Our primary estimation problem with regard to linear models can be stated as follows: For a given pxs matrix ȿ and an sx1 vector c, how should an nxs coefficient matrix T and an sx1 vector d be selected so that the random vector T’Y + d is in some sense a good estimator for the parametric vector ȿ’ɴ + c? To help answer this question, we shall later in this chapter employ the standard statistical criteria of unbiasedness and smallest variance to find “good estimators”. However, now we give some additional examples to illustrate the types of models and parameterizations to which linear model techniques can be applied. Example 2.2.3. (Two-way additive model with no constraints) Consider a collection of random variables {Yij} such that E(Yij) = ʅ + ʏi + ɷj for i = 1,2, j=1,2,3 and ʅ, ʏi’s, ɷj’s denote unknown parameters. Let Y=(Y11,Y12,Y13,Y21,Y22,Y23)’. Then for this model, E(Y) = Xɴ, ɴ ੣ ੓, where 1 1 1 1 ‫ۇ‬ 1 1 X=‫ۈ‬ ‫ۈ‬1 0 1 0 ‫ۉ‬1 0

0 0 0 1 1 1

1 0 0 1 0 0

0 0 1 0 ‫ۊ‬ 0 1 , ɴ = (ʅ,ʏ ,ʏ ,ɷ ,ɷ ɷ )’, 1 2 1 2 3 ‫ۋ‬ 0 0‫ۋ‬ 1 0 0 1‫ی‬

ɏ = R(X) and ੓ = R6, i.e., ɴ is completely unknown. Note that ɏ = R(X) is implicit under this parameterization. Example 2.2.4. (Two-way additive model with constraints) Consider the same situation as in Example 2.2.3 except suppose for all i,j that E(Yij)=μcнʏicнɷjc where the parameters are known to satisfy the conditions ʏ1cнʏ2cсϬĂŶĚɷ1cнɷ2cнɷ3c = 0. Then the assumptions regarding E(Y) can be

Estimation

29

stated as E(YͿсyɴc͕ȴ͛ɴc= 02, where X is the same as in example 2.2.3, ɴc is defined in the obvious way and ȴ͛сቀ

0 1 0 0

1 0 0 1

0 0 ቁ. 1 1

hŶĚĞƌƚŚŝƐƉĂƌĂŵĞƚĞƌŝǌĂƚŝŽŶ͕ɏс΂yɴc͗ȴ͛ɴc = 02} and ੓ сE;ȴ͛Ϳ͘/ƚŝƐĞĂƐǇƚŽ show that the parameterizations given in this example and in example 2.2.3 both describe the same expectation space. Example 2.2.5. (piecewise linear regression model) (Seely (1989)) Suppose the response function y(x), x ൒ 0, for some process is known to be linear in ƚŚĞ ŝŶƚĞƌǀĂůƐ с ΀Ϭ͕ϭϬ΁ ĂŶĚ с;ϭϬ͕ьͿ͘ ƐƐƵŵĞ ƚŚĂƚ ƚŚĞ ůŝŶĞƐ ŝŶ ƚŚĞ segments A and B may possibly be different, but at x = 10 the two lines join. That is, for some unknown parameƚĞƌƐɲ1͕ɲ2 ͕ɷ1 ĂŶĚɷ2 we have z;ǆͿсɲ1 нɷ1x if x ੣ A сɲ2 нɷ2x if x ੣ B ĂŶĚɲ1 нϭϬɷ1 сɲ2 нϭϬɷ2. Now suppose observations y1,…,y5 are available on the process at xi levels 0,2,10,12 and 15, respectively. Assume that the yi may be regarded as an outcome on the random variable Yi having expectation E(Yi) =y(xi). Define Y = (Y1,Y2,Y3,Y4,Y5)’ and 1 0 0 0 1 2 0 0‫ۊ‬ ‫ۇ‬ X = ‫ۈ‬1 10 0 0 ‫͕ۋ‬ɴс;ɲ1͕ɷ1͕ɲ2͕ɷ2Ϳ͛ĂŶĚȴ͛с;ϭ͕ϭϬ͕-1,-10). 0 0 1 12 ‫ ۉ‬0 0 1 15‫ی‬ Then y = (y1,y2,y3,y4,y5)’ can be regarded as an outcome on a random vector Y whose expectation has the linear model structure E(YͿсyɴ͕ȴ͛ɴсϬ͘/Ŷ this set up, ੓ сE;ȴ͛ͿĂŶĚɏс΂yɴ͗ȴ͛ɴсϬ΃͘ The condition we assume on the expectation of a random vector Y to have a linear model structure is that E(Y) belongs to a known subspace (affine set) of Rn, or in parameterized form, that E(YͿсyɴ͕ɴ੣ ੓, where ੓ is a known subspace (affine set) in Rp. It is also possible to develop much of linear model theory under the more general asƐƵŵƉƚŝŽŶ ŽĨ ɏ ĂŶĚ ੓ being arbitrary subsets of Rn and Rp, respectively, but this discussion is

Chapter 2

30

beyond the scope of this text. We now give an example where the expectation and parameter spaces are affine sets instead of subspaces. Example 2.2.6. (Seely (1989)) Suppose one has available observations {yij} ŽŶƚŚĞĂŶŐůĞƐɴ1 ͕ɴ2͕ɴ3 of a triangle where yij represents the jth observation (j = 1,2) on the ith angle (i = 1,2,3) and the problem is to estimate the true values of the angles. Let Y = (Y11,Y12,Y21,Y22,Y31,Y32)’ and 1 1 ‫ۇ‬ 0 X=‫ۈ‬ ‫ۈ‬0 0 ‫ۉ‬0

0 0 1 1 0 0

0 0 ‫ۊ‬ 0 ͕ɴс;ɴ ͕ɴ ͕ɴ Ϳ͛ĂŶĚȴс;ϭ͕ϭ͕ϭͿ͛͘ 1 2 3 ‫ۋ‬ 0‫ۋ‬ 1 1‫ی‬

Then one possible model to describe the above situation is that the yij ‘s are outcomes on a random vector Y where E(YͿ с yɴ͕ ɴ ੣ ੓ with ੓ = ΂ɴ͗ȴ͛ɴсϭϴϬ΃с Ⱦത н E;ȴ͛Ϳ ĂŶĚ Ⱦത ŝƐ ĂŶǇ ƐŽůƵƚŝŽŶ ƚŽ ȴ͛ɴ с ϭϴϬ͘ hŶĚĞƌ ƚŚŝƐ parameterization ੓ ĂŶĚɏс΂yɴ͗ɴ੣ ੓} are both affine sets. Comment. Perhaps a more realistic model for the experimental situation given in example 2.2.6 might be to have ੓ с΂ɴ͗ɴ1͕ɴ2͕ɴ3 хϬ͕ȴ͛ɴсϭϴϬ΃ĂŶĚ ɏс΂yɴ͗ɴ੣੓}. In this latter case neither ੓ ŶŽƌɏĂƌĞĂĨĨŝŶĞƐĞƚƐ͘dŽĚĞĂůǁŝƚŚ this situation requires methods similar to those developed in section 2.9 for the affine set case but which we will not pursue here. Up to the present we have been discussing assumptions on the expectation of the random vector Y. In order to answer the basic estimation problem posed earlier, it is necessary to further assume that the covariance matrix of Y exists and has a particular structure. It is convenient to distinguish three categories describing the covariance structure of Y: (1) cov(YͿсʍ2In ǁŚĞƌĞʍ2 is an unknown positive real number. (2) cov(YͿсʍ2sǁŚĞƌĞʍ2 is an unknown positive real number and V is a known positive definite matrix. (3) cov(Y) = є͕є੣ V, where V is a known collection of positive semi-definite matrices.

Estimation

31

Linear model estimation theory is primarily developed in this chapter under the second category and chapter 5 is devoted to the third category. Several general comments with regard to these three categories are easy to see: (1) The three categories are not mutually exclusive; in fact, each is a special case of the succeeding one. (2) The assumption cov(YͿ с ʍ2In is the one most commonly used in the linear model literature. (3) The second category allows a wider class of models than does the first with essentially no loss in conclusions that can be drawn. (4) The assumption in the last category is so general that almost no exact small sample results can be obtained without further assumptions on the structure of the set V. An example is now given to illustrate category 3. Example 2.2.7. (Two variance component model) Suppose {Yij} is a collection of random variables such that each Yij may be expressed as Yij = ɴi + bj + eij͕ŝ͕ũсϭ͕Ϯ͕ǁŚĞƌĞɴ1 ĂŶĚɴ2 are unknown constants and where the bi’s and eij ‘s are mutually independent random variables having zero ŵĞĂŶƐĂŶĚǀĂƌŝĂŶĐĞƐʍb2 ĂŶĚʍ2, respectively. Rewrite the model in matrix form as Y сyɴнb+e where 1 X = ቌ1 0 0

0 1 0ቍ and B = ቌ0 1 1 0 1

0 1ቍ 0 1

and Y͕ɴ͕b and e are defined in the obvious fashion. Then following the same procedure as in example 1.3.10, Y is seen to be a random vector such that E(YͿсyɴ͕ɴ੣ R2, and cov(Y) = ʍ2I4 нʍb2s͕ʍ2 > 0, ʍb2 ൒ 0, where V = BB’. Notice that E(Y) has a linear model structuƌĞǁŝƚŚɏсZ;yͿ and covariance structure cov(YͿсє͕є੣ V, where V = {ʍ2I4 + ʍb2V, ʍ2 > 0, ʍb2ш 0}.

32

Chapter 2

Suppose Y is a random vector such that E(Y) = yɴ͕ ɴ ੣ ੓, is a parameterization for E(Y) and cov(YͿсʍ2V. In such a parameterization, we assume that X and ੓ are known so that E(Y) has a linear model structure. tĞ ĂůƐŽ ĂƐƐƵŵĞ͕ ƵŶůĞƐƐ ŵĞŶƚŝŽŶĞĚ ŽƚŚĞƌǁŝƐĞ͕ ƚŚĂƚ ʍ2 is a positive unknown parameter. For parameterizations considered here, this amounts to stating tŚĂƚƚŚĞƉĂƌĂŵĞƚĞƌƐƉĂĐĞŽĨ;ɴ͕ʍ2) is ੓x(0,λ). As a final observation in this section, we note that in many standard text books and journal articles the model we have been discussing for Y is described as Y с yɴ н e ǁŚĞƌĞyĂŶĚɴŚĂǀĞƚŚĞƐĂŵĞŝŶƚĞƌƉƌĞƚĂƚŝon as above and where e is a random vector. In a description such as this the distribution of Y ŝƐĚĞƚĞƌŵŝŶĞĚĨƌŽŵƚŚĞǀĞĐƚŽƌyɴĂŶĚƚŚĞĚŝƐƚƌŝďƵƚŝŽŶĂů assumptions on the vector e. The expectation of the vector e in a model Yсyɴнe is always assumed to exist and to be equal to the zero vector so that E(YͿ с yɴ ĂƐ ǁŝƚŚ ƚŚĞ ŵŽĚĞů ǁĞ ŚĂǀĞ ďĞĞŶ ĚŝƐĐƵƐƐŝŶŐ͘ KƚŚĞƌ assumptions are generally made concerning the random vector e and thus assumptions concerning the vector Y may be determined, e.g., cov(eͿсʍ2In Žƌ ʍ2V implies cov(YͿ с ʍ2In Žƌ ʍ2V. Occasionally we shall use a model description of the form Y сyɴнe and will assume the same interpretation as just mentioned: but for most purposes we shall simply indicate our model and assumptions directly through Y. The random vector e in a model of the form Y сyɴнe is usually referred to as the random error vector.

2.3 Preliminary Notions In the previous section we described the various assumptions on the expectation vector and covariance matrix of a random vector Y that are made in linear models. It was also indicated that the primary estimation goal in linear models is the selection of a matrix T and a vector d which insures that T’Y + d is a good estimator for a component or a linear function ȿ͛ɴ н Đ ŽĨ ƚŚĞ ƉĂƌĂŵĞƚĞƌ ǀĞĐƚŽƌ ɴ͘ dŚĞ ƉƌĞƐĞŶƚ ƐĞĐƚŝŽŶ ŝƐ ŝŶƚĞŶĚĞĚ ƚŽ acquaint the reader with some additional fundamental ideas in linear models. To this end, suppose Y is a random vector with cov(YͿсʍ2V and that E(YͿсyɴ͕ɴ੣ ੓, is a parameterization for E(Y) where ੓ is a subspace (affine set) in Rp. The notion of a linear parametric vector as described in the previous ƐĞĐƚŝŽŶŝƐĐĞŶƚƌĂůƚŽůŝŶĞĂƌŵŽĚĞůƐ͘/ĨȿŝƐĂƉǆƐŵĂƚƌŝǆĂŶĚĐŝƐĂŶƐǆϭǀĞĐƚŽƌ͕ ƚŚĞŶ ǁĞ ƐŚĂůů ƵƐĞ ȿ͛ɴ н Đ ƚŽ ĚĞŶŽƚĞ ĂŶ Ɛǆ1 linear parametric vector. As

Estimation

33

ŵĞŶƚŝŽŶĞĚƉƌĞǀŝŽƵƐůǇ͕ȿ͛ɴнĐĐĂŶďĞƚŚŽƵŐŚƚŽĨĂƐĂƚƌĂŶƐĨŽƌŵĂƚŝŽŶĨƌŽŵ ੓ to Rs. In general, we shall refer to a linear parametric vector as simply a parametric vector. When a parametric vector is one dimensional we often use the term parametric function instead of parametric vector and ƚǇƉŝĐĂůůǇĚĞŶŽƚĞƐƵĐŚĂĨƵŶĐƚŝŽŶďǇʄ͛ɴнĐǁŚĞƌĞʄ੣ Rp is a vector and c੣R1. For many purposes in linear models it is sufficient to study parametric vectors via parametric functions because an sx1 parametric vector is simply a vector composed of s parametric functions. For this reason some results in later sections are formulated in terms of parametric functions along with a brief mention concerning the generalization to parametric vectors. Another thing to keep in mind throughout the sequel is that we ƐŚĂůůƐŽŵĞƚŝŵĞƐƵƐĞŶŽƚĂƚŝŽŶƐƵĐŚĂƐȿ͛ɴнĐƚŽĚĞŶŽƚĞĂƉĂƌĂŵĞƚƌŝĐǀĞĐƚŽƌ and other times use the same notation to denote the value of the ƉĂƌĂŵĞƚƌŝĐǀĞĐƚŽƌĂƚĂƉĂƌƚŝĐƵůĂƌɴ੣ ੓. This might cause some confusion. However, the correct interpretation will generally be clear from the context in which it is used. In addition, we try to use the terminology ͞ƉĂƌĂŵĞƚƌŝĐǀĞĐƚŽƌ͟ǁŚĞŶǁĞŵĞĂŶȿ͛ɴнĐŝƐĂŐĞŶĞƌĂůƚƌĂŶƐĨŽƌŵĂƚŝŽŶ͖ĂŶĚ ŝŶĞƐƉĞĐŝĂůůǇĐŽŶĨƵƐŝŶŐƉůĂĐĞƐǁĞǁŝůůĂƚƚĞŵƉƚƚŽƵƐĞŶŽƚĂƚŝŽŶƐůŝŬĞȿ͛Ⱦഥ +c, ȿ͛ɴ0 + c, etc., when we mean a particular value of the parametric vector. Similarly, notation like ʄ͛ɴнĐǁŚĞƌĞʄ੣ Rp serves the dual role as the name of a parametric function and the value of the parametric function at a ƉĂƌƚŝĐƵůĂƌƉŽŝŶƚɴ੣ ੓. A second notion fundamental to linear models is that of a linear estimator. If A is an nxs matrix and d is an sx1 vector, then any estimator of the form A’Y + d is called a linear estimator because its components are linear combinations of the random observations that make up Y. When it is important, we employ additional terminology such as A’Y + d is an sdimensional linear estimator to mean that A’Y + d is sx1 or equivalently that A is nxs and d is sx1. One dimensional linear estimators play an important role in the theory of linear models. Such estimators are sometimes referred to as real-valued linear estimators and sometimes we use notation like a’Y + d where a ੣ Rn and d ੣ R1 to specifically state that a linear estimator is one-dimensional. As mentioned previously, our goal as far as estimation is concerned is to find linear estimators of the form A’Y + d which are in some sense good ĞƐƚŝŵĂƚŽƌƐĨŽƌĂŐŝǀĞŶƉĂƌĂŵĞƚƌŝĐǀĞĐƚŽƌȿ͛ɴнĐ͘,ŽǁĞǀĞƌ͕ǁĞƐŚŽƵůĚĂůƐŽ observe that in many situations it is unreasonable to expect to estimate all

34

Chapter 2

parametric vectors. We now give an example which would never occur in practice but which illustrates this idea. Example 2.3.1. Suppose Y is a random vector such that cov(Y) = 0nn. Further suppose that E(YͿсyɴ͕ɴ੣ ੓, is a parameterization for E(Y). Then Y is a random vector such that P(Y сyɴͿсϭĨŽƌƐŽŵĞɴ੣ ੓ and suppose we want ƚŽ ĞƐƚŝŵĂƚĞ ƚŚĞ ƉĂƌĂŵĞƚƌŝĐ ǀĞĐƚŽƌ ɴ͘ ,ĞƌĞ ǁĞ ŚĂǀĞ ƚŚĞ ďĞƐƚ ƉŽƐƐŝďůĞ ƐŝƚƵĂƚŝŽŶ ĨŽƌ ĞƐƚŝŵĂƚŝŶŐ ƚŚĞ ƉĂƌĂŵĞƚƌŝĐ ǀĞĐƚŽƌ ɴ ƐŝŶĐĞ Y с yɴ ǁŝƚŚ probabiliƚǇŽŶĞĨŽƌƐŽŵĞɴ੣ ੓. Thus for an outcome y of Y, we would select Ⱦ෠ ੣ ੓ such that y = XȾ෠ and if the determination of Ⱦ෠ ŝƐƵŶŝƋƵĞ͕ǁĞŚĂǀĞɴ exactly. However, if the specification of Ⱦ෠ is not unique such as when ੓ =Rp and r(X) < p, then it should be clear that there is no possibility for ĚĞƚĞƌŵŝŶŝŶŐ ɴ ĂŶĚ ŚĞŶĐĞ ƚŚĂƚ ŝƚ ŝƐ ƵŶƌĞĂƐŽŶĂďůĞ ƚŽ ĚŽ ƐŽ͘ dŚĞ ƉƌŽďůĞŵ illustrated in this example is that there is not necessarily a one to one ĐŽƌƌĞƐƉŽŶĚĞŶĐĞ ďĞƚǁĞĞŶ ǀĞĐƚŽƌƐ ɴ ੣ ੓ ĂŶĚ ƚŚĞ ĞǆƉĞĐƚĂƚŝŽŶ yɴ ŽĨ ƚŚĞ random vector Y, ŝ͘Ğ͕͘ƐĞǀĞƌĂůɴǀĞĐƚŽƌƐŵĂǇůĞĂĚƚŽƚŚĞƐĂŵĞĞǆƉĞĐƚĂƚŝŽŶ ǀĞĐƚŽƌ yɴ ĂŶĚ ƚŚĞƌĞ ŝƐ ƐŝŵƉůǇ ŶŽ ǁĂǇ ƚŽ ĚŝƐƚŝŶŐƵŝƐŚ ǁŚŝĐŚ ɴ ŝƐ ƚŚĞ ĂƉƉƌŽƉƌŝĂƚĞŽŶĞĞǀĞŶŝŶƚŚĞĞǀĞŶƚƚŚĂƚǁĞŬŶŽǁĞǆĂĐƚůǇƚŚĞyɴǀĞĐƚŽƌĂƐ in the case when cov(Y) = 0nn. To find a good estimator ĨŽƌĂƉĂƌĂŵĞƚƌŝĐǀĞĐƚŽƌȿ͛ɴнĐ͕ǁĞƐŚĂůůĚƌĂǁ upon the standard statistical criteria of unbiasedness and smallest variance. In the following section, we identify those parametric vectors of ƚŚĞĨŽƌŵȿ͛ɴнĐĨŽƌǁŚŝĐŚůŝŶĞĂƌƵŶďŝĂƐĞĚĞƐƚŝŵĂƚŽƌƐ͛Y + d exist.

2.4 Identifiability and Estimability of Parametric Vectors From this section up through section 2.8, linear model estimation theory will be developed under the assumption that ё is a subspace. We will consider models having expectation spaces which are affine sets in section 2.9. We shall assume from this point on up through section 2.8 that ੓ с΂ɴ͗ȴ͛ɴсϬs} and use these characterizations of the parameter space interchangeably. For example, sometimes we may describe the parameterization for E(Y) as yɴ͕ɴ੣ ੓͕ĂŶĚŽƚŚĞƌƚŝŵĞƐĂƐyɴ͕ȴ͛ɴсϬs. We also note that ੓ сE;ȴ͛ͿĂŶĚǁĞǁŝůůƵƐĞƚŚŝƐƌĞůĂƚŝŽŶƐŚŝƉďĞƚǁĞĞŶ੓ and N(ȴ’) whenever it is convenient. So suppose Y is a random vector that has cov(YͿсʍ2sĂŶĚɏс΂yɴ͗ɴ੣੓} where ੓ is a subspace in Rp is a parameterization for E(Y). In this section

Estimation

35

ǁĞ ŝŶǀĞƐƚŝŐĂƚĞ ǁŚĂƚ ƉĂƌĂŵĞƚƌŝĐ ǀĞĐƚŽƌƐ ŽĨ ƚŚĞ ĨŽƌŵ ȿ͛ɴ н Đ ĐĂŶ ďĞ estimated unbiasedly by a linear estimator of the form A’Y + d. Definition 2.4.1. A linear estimator A’Y + d is an unbiased linear estimator ĨŽƌĂƉĂƌĂŵĞƚƌŝĐǀĞĐƚŽƌȿ͛ɴнĐŝĨĂŶĚŽŶůǇŝĨ;͛Y + d) = A’XȾത нĚсȿ͛Ⱦത + c for all Ⱦത੣੓. With regard to Definition 2.4.1, we note that if A’Y + d is an arbitrary linear estimator, then A’Y + d is unbiased for E(A’Y + d) = A’E(Y) + d = A’XȾതнĚсȿ͛Ⱦഥ +d for all Ⱦത ੣ ੓ ǁŚĞƌĞȿсy͛͘tĞĐĂůůȿ͛ɴнĚƚŚĞƉĂƌĂŵĞƚƌŝĐ vector induced by the expectation of A’Y + d. Thus any linear estimator is unbiased for its induced expectation. We shall from time to time refer to E(A’Y + d) as a parametric vector with the understanding that we are referring to E(A’Y нĚͿсȿ͛ɴнĚ͘dŚŝƐƌĂƚŚĞƌĐĂƐƵĂůƵƐĞŽĨƚŚĞĞǆƉĞĐƚĂƚŝŽŶ operator can initially be confusing in the sense that E(A’Y + d) can mean a parametric vector or it can mean the value of the expectation at a given ƉĂƌĂŵĞƚĞƌƉŽŝŶƚɴ੣ ੓͘ƐǁŝƚŚƚŚĞĚƵĂůƵƐĂŐĞŽĨƚŚĞŶŽƚĂƚŝŽŶȿ͛ɴнĚ͕ƚŚĞ correct interpretation of E(A’Y + d) will generally be clear from the context. In addition, we will attempt to avoid confusion by saying parametric vector E(A’Y + d) or by saying the value of E(A’Y нĚͿĂƚɴ੣ ੓. So any linear estimator A’Y + d is unbiased for the parametric vector induced by its expectation. Unfortunately, there does not always exist a ůŝŶĞĂƌƵŶďŝĂƐĞĚĞƐƚŝŵĂƚŽƌĨŽƌĂƉĂƌĂŵĞƚƌŝĐǀĞĐƚŽƌŽĨƚŚĞĨŽƌŵȿ͛ɴнĐ͘/Ŷ the following example, we illustrate a situation in which a parametric vector cannot be estimated unbiasedly. In the same example, we also demonstrate an important property ƚŚĂƚȿ͛ɴ н Đ ŵƵƐƚŚĂǀĞ ŝŶ ŽƌĚĞƌ ĨŽƌ ƚŚĞƌĞƚŽĞǆŝƐƚĂůŝŶĞĂƌƵŶďŝĂƐĞĚĞƐƚŝŵĂƚŽƌŽĨȿ͛ɴнĐ͘ Example 2.4.2. Consider a parameterization of the form E(YͿсyɴ͕ǁŚĞƌĞy ŝƐŶǆƉ͕ƌ;yͿфƉĂŶĚɴ੣ ੓ = Rp͘EŽǁĐŽŶƐŝĚĞƌɴ1͕ɴ2 ੣ Rp ƐƵĐŚƚŚĂƚyɴ1 сyɴ2 ĂŶĚůĞƚȿ͛ɴн c be a k-ĚŝŵĞŶƐŝŽŶĂůƉĂƌĂŵĞƚƌŝĐǀĞĐƚŽƌƐƵĐŚƚŚĂƚȿ͛ɴ1+ c ് ȿ͛ɴ2 + c. Then if A’Y+d ŝƐ Ă ůŝŶĞĂƌ ĞƐƚŝŵĂƚŽƌ ĨŽƌ ȿ͛ɴ н Đ͕ ĨŽƌ ɴ1͕ɴ2 ੣ ੓, E(A’YнĚͿс͛yɴ1 нĚс͛yɴ2нĚďƵƚȿ͛ɴ1 + c ് ȿ͛ɴ2 + c. Hence we see that A’Y нĚĚŽĞƐŶŽƚƐĂƚŝƐĨǇĚĞĨŝŶŝƚŝŽŶϮ͘ϰ͘ϭĂŶĚĐĂŶŶŽƚďĞƵŶďŝĂƐĞĚĨŽƌȿ͛ɴнĐ͘ This example illustrates an important property that a parametric vector of ƚŚĞĨŽƌŵȿ͛ɴнĐŵƵƐƚŚĂǀĞŝŶŽƌĚĞƌĨŽƌƚŚĞƌĞƚŽĞǆŝƐƚĂůŝŶĞĂƌƵŶďŝĂƐĞĚ ĞƐƚŝŵĂƚŽƌĨŽƌȿ͛ɴнĐ͕ŝ͘Ğ͕͘ȿ͛ɴн ĐŵƵƐƚďĞĐŽŶƐƚĂŶƚĂĐƌŽƐƐĂůůɴ੣ ੓ that lead ƚŽƚŚĞƐĂŵĞyɴ੣ ɏ͘/ĨĂƉĂƌĂŵĞƚƌŝĐǀĞĐƚŽƌƉŽƐƐĞƐƐĞƐƚŚŝƐůĂƐƚƉƌŽƉĞƌƚǇ͕ŝƚŝƐ said to be identifiable.

36

Chapter 2

Definition 2.4.3. Consider a parameterization for E(YͿŽĨƚŚĞĨŽƌŵ;zͿсyɴ͕ ɴ੣੓. A parametric vector ȿ͛ɴнĐŝƐƐĂŝĚƚŽďĞŝĚĞŶƚŝĨŝĂďůĞ if and only if ǁŚĞŶĞǀĞƌɴ1͕ɴ2੣੓ ĂƌĞƐƵĐŚƚŚĂƚyɴ1 сyɴ2͕ƚŚĞŶȿ͛ɴ1 + c сȿ͛ɴ2 + c. &ŽƌĂƉĂƌĂŵĞƚƌŝĐǀĞĐƚŽƌȿ͛ɴнĐǁŚĞƌĞȿс;ʄ1͕͙͕ʄk) and c’ = (c1,…,ck), it ŝƐ ĞĂƐǇ ƚŽ ƐĞĞ ƚŚĂƚ ȿ͛ɴ н Đ ŝƐ ŝĚĞŶƚŝĨŝĂďůĞ ŝĨ ĂŶĚ ŽŶůǇ ŝĨ ŝƚƐ ĐŽŽƌĚŝŶĂƚĞ ƉĂƌĂŵĞƚƌŝĐĨƵŶĐƚŝŽŶƐʄ1͛ɴнĐ1͕͙͕ʄk͛ɴнĐk are identifiable. Thus, it suffices to limit attention on identifiability to parametric functions. Also, notice ƚŚĂƚ ŝĨ ƚŚĞ ƉĂƌĂŵĞƚƌŝĐ ǀĞĐƚŽƌ ɴ ŝƚƐĞůĨ ŝƐ ŝĚĞŶƚŝĨŝĂďůĞ͕ ƚŚĞŶ Ăůů ƉĂƌĂŵĞƚƌŝĐ vectors are identifiable. Example 2.4.4. (Seely (1989)) Suppose that ੓ с΂ɴ͗ɴ1 нɴ2 нɴ3 = 0}, that ŶсϮ͕ĂŶĚĨŽƌɴ੣ ੓, that E(YiͿсɴi, i = 1,2. It is easy to see that the parametric ĨƵŶĐƚŝŽŶƐɴ1 ĂŶĚɴ2 are identifiable, i.e., the condition XȾത = XȾധ alone implies Ⱦതi = Ⱦധi, i =1,2. It is also true that the parametric functŝŽŶɴ3 is identifiable. To see this, let Ⱦത,Ⱦധ੣੓ be such that XȾത = XȾധ. Then Ⱦത3 = -Ⱦത1 -Ⱦത2= -Ⱦധ1 - Ⱦധ2 = Ⱦധ3 ƐŚŽǁƐƚŚĂƚɴ3 ŝƐŝĚĞŶƚŝĨŝĂďůĞ͘,ĞŶĐĞƚŚĞƉĂƌĂŵĞƚƌŝĐǀĞĐƚŽƌɴŝƐŝĚĞŶƚŝĨŝĂďůĞ which implies that all parametric vectors are identifiable. Example 2.4.5. (Two-way additive model) (Seely (1989)) Consider again example 2.2.3. For this example (and in general) any parametric function ʄ͛ɴнĐƚŚĂƚĐĂŶďĞĞǆƉƌĞƐƐĞĚŝŶƚŚĞĨŽƌŵʄ͛ɴнĐсĂ͛yɴнĐĨŽƌƐŽŵĞǀĞĐƚŽƌĂ is clearly identifiable. That is, any parametric function which can be ĞǆƉƌĞƐƐĞĚĂƐĂůŝŶĞĂƌĐŽŵďŝŶĂƚŝŽŶŽĨĞǆƉĞĐƚĂƚŝŽŶƐђнʏiнɷj is identifiable. For example, ʏ1 - ʏ2с;ђнʏ1 нɷ1) – ;ђнʏ2 нɷ1) is an identifiable parametric function. Unlike the previous example, however, it can be shown that none of the coordinate functions are identifiable. To see this for the parametric function μ, let Ⱦത = 06 and let Ⱦന =(1,-1,-1,0,0,0). Then XȾത = XȾധ = 06, but μത = 0 ് μധ = 1 so that μ is not identifiable. /ŶĞǆĂŵƉůĞϮ͘ϰ͘ϱ͕ŝƚǁĂƐƐƚĂƚĞĚƚŚĂƚĂŶǇƉĂƌĂŵĞƚƌŝĐĨƵŶĐƚŝŽŶʄ͛ɴнĐƚŚĂƚ can be expressed as a linear combination of the components of E(Y) is identifiable. Another way of stating this condition is that there exists a ůŝŶĞĂƌƵŶďŝĂƐĞĚĞƐƚŝŵĂƚŽƌĨŽƌʄ͛ɴнĐ͘tĞŶŽǁŐŝǀĞŽŶĞĐŚĂƌĂĐƚĞƌŝǌĂƚŝŽŶŽĨ parametric vectors that can be estimated unbiasedly.

Estimation

37

Lemma 2.4.6. Consider the parameterization E(YͿ с yɴ͕ ȴ͛ɴ с Ϭs, and ƐƵƉƉŽƐĞȿ͛ɴнĐŝƐĂŬǆϭƉĂƌĂŵĞƚƌŝĐǀĞĐƚŽƌ͘dŚĞŶȿ͛ɴ+ c can be unbiasedly ĞƐƚŝŵĂƚĞĚŝĨĂŶĚŽŶůǇŝĨȿсy͛нȴ>ĨŽƌƐŽŵĞŵĂƚƌŝĐĞƐĂŶĚ>ŝŶǁŚŝĐŚ case A’Y + d where c = d is an unbiased linear estimator for ȿ͛ɴ + c. WƌŽŽĨ͘ƐƐƵŵĞȿсy͛нȴ>ĨŽƌsome matrices A and L and c = d. Then for Ăůůɴ੣੓, ȿ͛ɴнĐс͛yɴн>͛ȴ͛ɴнĐс͛yɴнĐс͛;Y) + c = E(A’Y + c) = E(A’Y + d). Hence A’Y н Ě ŝƐ ƵŶďŝĂƐĞĚ ĨŽƌ ȿ͛ɴ н Đ͘ ŽŶǀĞƌƐĞůǇ͕ ƐƵƉƉŽƐĞ ͛Y + d is ƵŶďŝĂƐĞĚĨŽƌȿ͛ɴнĐ͘dŚĞŶ͛yɴнĚсȿ͛ɴнĐĨŽƌĂůůɴ੣ ੓. Because ੓ is a subspace, 0p ੣ ੓, hence A’X0p + d = ȿ’0p + c and c= d. Now, since c = d, it follows that (A’X - ȿ’) ɴ=0k for all ɴ ੣ N(ȴ’), hence that R(X’A – ȿͿ‫ ؿ‬E;ȴ͛Ϳᄰ сZ;ȴͿĂŶĚȿсy͛нȴ>ĨŽƌƐŽŵĞƐǆŬŵĂƚƌŝǆ>͘ Comment. The reader should note that in lemma 2.4.6 that because c = d, A’Y ŝƐƵŶďŝĂƐĞĚĨŽƌȿ͛ɴ͘dŚƵƐŝŶŵŽĚĞůƐǁŝƚŚŚŽŵŽŐĞŶĞŽƵƐĐŽŶƐƚƌĂŝŶƚƐŽƌ no constraints on the parameter vector, we can essentially restrict our attention to finding parametric vectors that can be estimated unbiasedly ƚŽůŝŶĞĂƌƉĂƌĂŵĞƚƌŝĐǀĞĐƚŽƌƐŽĨƚŚĞĨŽƌŵȿ͛ɴ͘dŚƵƐ͕ĨƌŽŵƚŚŝƐƉŽŝŶƚŽŶƵƉ through section 2.8, we will only be considering parametric vectors of the form ȿ’ɴ͘ ůƐŽ͕ ŽďƐĞƌǀĞ ƚŚĂƚ ŝŶ ƚŚĞ ĐĂƐĞ ƚŚĂƚ ȴ с Ϭps, there exists an ƵŶďŝĂƐĞĚĞƐƚŝŵĂƚŽƌĨŽƌȿ͛ɴŝĨĂŶĚŽŶůǇŝĨZ;ȿͿ‫ ؿ‬R(X’). Lemma 2.4.6 gives one characterization of parametric vectors that can be estimated unbiasedly and example 2.4.2 implies that the only parametric vectors that one can even attempt to estimate unbiasedly are those that are identifiable. We now give a result which ties together the concepts of identifiability and unbiased estimation and provides some additional criteria for determining when a given parametric vector can be unbiasedly estimated. Theorem 2.4.7. Consider the parameterization E(YͿ с yɴ͕ ȴ͛ɴ с Ϭs. For a ƉĂƌĂŵĞƚƌŝĐƚǆϭǀĞĐƚŽƌȿ͛ɴ͕ƚŚĞĨŽůůŽǁŝŶŐƐƚĂƚĞŵĞŶƚƐĂƌĞĞƋƵŝǀĂůĞŶƚ͗ ;ĂͿȿ͛ɴŝƐŝĚĞŶƚŝĨŝĂďůĞ͘ ;ďͿȿсy͛нȴ>ĨŽƌƐŽŵĞŶǆƚŵĂƚƌŝǆĂŶĚƐǆƚŵĂƚƌŝǆ>͕ŝ͘Ğ͘ ,

Z;ȿͿ ‫ ؿ‬R(X͛ͿнZ;ȴͿ͘

38

Chapter 2

;ĐͿdŚĞƌĞĞǆŝƐƚƐĂůŝŶĞĂƌƵŶďŝĂƐĞĚĞƐƚŝŵĂƚŽƌ͛zĨŽƌȿ͛ɴ͘ Proof. (a) ֜ ;ďͿ͘ ƐƐƵŵĞ ȿ͛ɴ ŝƐ ŝĚĞŶƚŝĨŝĂďůĞ͘ tĞ ǁŝůů ƐŚŽǁ ƚŚĂƚ Z;ȿͿ‫ؿ‬ R;y͛ͿнZ;ȴͿďǇƐŚŽǁŝŶŐƚŚĞĞƋƵŝǀĂůĞŶƚƐƚĂƚĞŵĞŶƚƚŚĂƚ Z;ȿͿᄰ сE;ȿ͛Ϳ‫; ـ‬Z;y͛ͿнZ;ȴͿͿᄰ = N(X)‫ת‬E;ȴ͛Ϳ͘ So let f ੣ N(X) ‫ ת‬E;ȴ͛ͿĂŶĚůĞƚɴ0 ੣ ੓ be fixed and let Ⱦതсɴ0 нĨ͘dŚĞŶȴ͛Ⱦത = ȴ͛ɴ0 нȴ͛ĨсϬs which implies Ⱦത ੣ ੓ = N(ȴ’) and XȾഥ сyɴ0 нyĨсyɴ0 which ŝŵƉůŝĞƐȿ͛ɴ0 сȿ͛Ⱦത ĂŶĚȿ͛;Ⱦത – ɴ0Ϳсȿ͛ĨсϬt, hence f ੣ E;ȿ͛ͿĂƐǁĞǁĞƌĞƚŽ show. (b) ֜ (c). This follows directly from Lemma 2.4.6. (c) ֜ (a). Assume A’Y ŝƐƵŶďŝĂƐĞĚĨŽƌȿ͛ɴ͘dŚĞŶ͛yɴсȿ͛ɴĨŽƌĂůůɴ੣ ੓. Now ŝĨɴ1͕ɴ2 ੣ ੓ ĂƌĞƐƵĐŚƚŚĂƚyɴ1 сyɴ2͕ƚŚĞŶȿ͛ɴ1 с͛yɴ1 с͛yɴ2 сȿ͛ɴ2 which implies Ȧᇱ Ⱦ is identifiable. tŚĞŶ ʄ ŝƐ Ă Ɖǆϭ ǀĞĐƚŽƌ ƚŚe condition in Theorem 2.4.7 (b) can ĞƋƵŝǀĂůĞŶƚůǇďĞƐƚĂƚĞĚĂƐʄ੣ Z;y͕͛ȴͿ͘dŚŝƐŝƐĂŚĂŶĚǇĨĂĐƚƚŽƌĞŵĞŵďĞƌǁŚĞŶ ĚĞĂůŝŶŐǁŝƚŚƉĂƌĂŵĞƚƌŝĐǀĞĐƚŽƌƐĂŶĚĨƵŶĐƚŝŽŶƐ͘EŽƚŝĐĞĂůƐŽƚŚĂƚZ;y͕͛ȴͿŝƐĂ subspace which implies that the class of identifiable parametric vectors is closed under linear operations. That is, any parametric vector of the form >ɻнɷǁŚĞƌĞ>ŝƐĂŐŝǀĞŶŵĂƚƌŝǆĂŶĚɻĂŶĚɷĂƌĞŝĚĞŶƚŝĨŝĂďůĞƉĂƌĂŵĞƚƌŝĐ vectors is an identifiable parametric vector. We also observe that the condition for IpɴсɴƚŽďĞŝĚĞŶƚŝĨŝĂďůĞĂŶĚŚĞŶĐĞĂůůƉĂƌĂŵĞƚƌŝĐǀĞĐƚŽƌƐƚŽ be identifiable is that R(Ip) ‫ؿ‬Z;y͕͛ȴͿǁŚŝĐŚŝƐĞƋƵŝǀĂůĞŶƚƚŽƌ;y͕͛ȴͿсƉ͘>ĂƐƚůǇ͕ notice that the theorem implies a parametric vector is identifiable provided that it can be estimated unbiasedly by a linear estimator. We now give a definition which is more consistent within the literature on linear models with regard to identifiable parametric vectors. ĞĨŝŶŝƚŝŽŶϮ͘ϰ͘ϴ͘ƉĂƌĂŵĞƚƌŝĐǀĞĐƚŽƌȿ͛ɴŝƐƐĂŝĚƚŽďĞĞƐƚŝŵĂďůĞ if and only ŝĨƚŚĞƌĞĞǆŝƐƚƐĂůŝŶĞĂƌƵŶďŝĂƐĞĚĞƐƚŝŵĂƚŽƌĨŽƌȿ͛ɴ͘ ĞĐĂƵƐĞŽĨƚŚĞŽƌĞŵϮ͘ϰ͘ϳ͕ŝƚŝƐĐůĞĂƌƚŚĂƚĨŽƌĂƉĂƌĂŵĞƚƌŝĐǀĞĐƚŽƌȿ͛ɴ͕ the two concepts of estimability and identifiability are equivalent. Thus all properties stated previously that hold for identifiable parametric vectors also hold for estimable parametric vectors. For example, because the space of identifiable parametric vectors is closed under all linear operations, so

Estimation

39

ŝƐ ƚŚĞ ƐƉĂĐĞ ŽĨ ĞƐƚŝŵĂďůĞ ƉĂƌĂŵĞƚƌŝĐ ǀĞĐƚŽƌƐ ĂŶĚ ŝĨ ɴ ŝƐ ŝĚĞŶƚifiable ;ĞƐƚŝŵĂďůĞͿ͕ ƐŽ ĂƌĞ Ăůů ƉĂƌĂŵĞƚƌŝĐ ǀĞĐƚŽƌƐ ŝŶǀŽůǀŝŶŐ ɴ ŝĚĞŶƚŝĨŝĂďůĞ (estimable). However, because the term estimable is so widely used in the linear model literature, hereafter, it is generally used in preference to the term identifiable. The equivalence of the two concepts and Theorem 2.4.7 will, of course, be utilized when convenient. It is frequently necessary to determine whether a given parametric ĨƵŶĐƚŝŽŶ͕ƐĂǇʄ͛ɴ͕ŝƐĞƐƚŝŵĂďůĞ͘tŚŝůĞƚŚĞƌĞĂƌĞĂŶƵŵďĞƌŽĨǁĂǇƐŝŶǁŚŝĐŚ this can be done, one useful method is to determine if there exists a ƐŽůƵƚŝŽŶďƚŽƚŚĞĞƋƵĂƚŝŽŶƐ,ďсʄŽƌ͕ĞƋƵŝǀĂůĞŶƚůǇ͕ŝĨƌ;,͕ʄ) = r(H) where ,с;y͕͛ȴͿ͘ In this section the equivalent notions of identifiability and estimablilty have been introduced. For the sake of consistency in terminology with the linear model literature, we will almost exclusively use the term estimable in preference to identifiable. The main result of this section is theorem 2.4.7 which gives three equivalent statements for a parametric vector to be estimable. One very important consequence of this theorem, as mentioned previously, is that the class of estimable parametric vectors is closed under all linear operations.

2.5 Best Linear Unbiased Estimation when cov(Y) = ʍ2V In section 2.4 we characterized those parametric vectors that can be estimated unbiasedly. We now consider the problem of finding a best linear unbiased estimator (blue), when they exist, for a given parametric vector. Throughout this section we consider random vectors Y having E(Y)੣ɏǁŚĞƌĞɏŝƐĂŬŶŽǁŶƐƵďƐƉĂĐĞŝŶZn and cov(YͿсʍ2V > 0. We also assume that E(Y) is parameterized as E(YͿсyɴ͕ɴ੣ ੓, where ੓ is a subspace in Rp. Recall that the first criteria imposed for estimating a parametric vector with a linear estimator was that of unbiasedness. In the previous section, ǁĞ ĐŚĂƌĂĐƚĞƌŝǌĞĚ Ăůů ƉĂƌĂŵĞƚƌŝĐ ǀĞĐƚŽƌƐ ȿ͛ɴ ĨŽƌ ǁŚŝĐŚ ůŝŶĞĂƌ ƵŶďŝĂƐĞĚ estimators exist. While the requirement of unbiasedness typically reduces the class of candidate estimators, in general, the class of linear unbiased estimators for a particular parametric vector is still extremely large. To further reduce the class, we consider another requirement on the

40

Chapter 2

estimator. This requirement is that the estimator have the “smallest covariance matrix” among the linear unbiased estimators for a given parametric vector. To do this, it is necessary to at least assume that cov(Y) exists. We begin our discussion with a formal definition. Definition 2.5.1. A linear estimator T’Y is said to be a best linear unbiased estimator (blue) for its expectation if and only if cov(T’Y) ൑ cov(A’Y),i.e., cov(A’Y)– cov(T’Y) ൒ 0, for all linear estimators A’Y that are unbiased for the parametric vector E(T’Y). The reader should note the implication of definition 2.5.1 for the case ǁŚĞŶʄ͛ɴŝƐĂŶĞƐƚŝŵĂďůĞƉĂƌĂŵĞƚƌŝĐĨƵŶĐƚŝŽŶ͘&ŽƌƚŚŝƐĐĂƐĞ͕ĚĞĨŝŶŝƚŝŽŶϮ͘ϱ͘ϭ implies that t’Y ŝƐĂďůƵĞĨŽƌʄ͛ɴŝĨĂŶĚŽŶůǇŝĨŝƚŚĂƐƚŚĞƐŵĂůůĞƐƚǀĂƌŝĂŶĐĞ ĂŵŽŶŐ Ăůů ůŝŶĞĂƌ ƵŶďŝĂƐĞĚ ĞƐƚŝŵĂƚŽƌƐ ĨŽƌ ʄ͛ɴ͕ ŝ͘Ğ͕͘ ƚ͛Y is the minimum variance linear unbiased ĞƐƚŝŵĂƚŽƌĨŽƌʄ͛ɴ͘ The problem now is to find, among all linear unbiased estimators of a given parametric vector, the unbiased estimator having the smallest covariance matrix as defined in definition 2.5.1. To solve this problem, we employ a method that is less used in the literature on linear models, but which is well known in mathematical statistics, i.e., see Lehmann and Scheffe (1950). We begin by characterizing all unbiased estimators for a given parametric vector. While there are several methods of doing this, our approach is to characterize them through the linear unbiased estimators of zero. For the remainder of this text, assuming a parameterization of the form E(YͿсyɴǁŚĞƌĞȴ͛ɴсϬs, we let N с΂yʌ͗ȴ͛ʌсϬs}ᄰ с΂y΀E;ȴ͛Ϳ΁΃ᄰ = ɏᄰ. Lemma 2.5.2. A linear tx1 estimator F’Y is an unbiased estimator of zero if and only R(F) ‫ ؿ‬N Proof. Assume F’Y is unbiased for zero. Then E(F’YͿс&͛yɴсϬt ĨŽƌĂůůɴ੣ ੓ which implies R(F) ᄰ Nᄰ сё and R(F) ‫ ؿ‬N. Conversely, assume R(F) ‫ܰ ؿ‬.Then ĨŽƌĂŶǇɴ੣ ੓ сE;ȴ͛Ϳ͕yɴ ੣ Nᄰ and E(F’Y) = F’Xɴ= 0t where the last inequality follows because R(F)‫ ؿ‬N. We can use lemma 2.5.2 to characterize all unbiased estimators for a ŐŝǀĞŶ ƉĂƌĂŵĞƚƌŝĐ ǀĞĐƚŽƌ ȿ͛ɴ͘ /Ŷ ƉĂƌƚŝĐƵůĂƌ͕ ƚŚĞ ƌĞůĂƚŝŽŶƐŚŝƉ ďĞƚǁĞĞŶ ƚŚĞ

Estimation

41

linear unbiased estimators for a particular parametric vector and unbiased estimators of zero is given in the following lemma. Lemma 2.5.3. let A’Y be a given tx1 linear estimator. Then a linear estimator B’Y is an unbiased estimator for the parametric vector E(A’Y) if and only if B = A+F for some matrix F having R(F) ‫ ؿ‬N. Proof. Assume B = A + F where R(F) ‫ ؿ‬N. Then E(B’Y) = E((A + F)’Y) = E(A’Y)+E(F’Y)= E(A’Y) because F’Y is unbiased for 0t by lemma 2.5.2. Conversely, assume B’Y is unbiased for E(A’Y). Then (B – A)’Y = F’Y where F= B – A is unbiased for zero and by Lemma 2.5.2, R(F) ‫ ؿ‬N. Thus B = A + F as we were to show. &ƌŽŵ>ĞŵŵĂϮ͘ϱ͘ϯ͕ŝƚĨŽůůŽǁƐƚŚĂƚŝĨȿ͛ɴŝƐĂŐŝǀĞŶƉĂƌĂŵĞƚƌŝĐǀĞĐƚŽƌŽĨ interest and if A’Y ŝƐ ƵŶďŝĂƐĞĚ ĨŽƌ ȿ͛ɴ͕ ƚŚĞŶ ƚŚĞ ĞŶƚŝƌĞ ĐůĂƐƐ ŽĨ ůŝŶĞĂƌ ƵŶďŝĂƐĞĚĞƐƚŝŵĂƚŽƌƐĨŽƌȿ͛ɴĐĂŶďe described as B’Y where B is any matrix of the form B=A+F and F is a matrix such that R(F) ‫ܰ ؿ‬. Example 2.5.4. (Seely (1989)) Suppose Y is a random vector such that E(YͿсyɴ͕ȟ’Ⱦ = 0, where 1 X = ቆ0 2

0 1ቇ ĂŶĚȴ͛с;ϭ͕ϭͿ͘ 1

We now apply the remarks indicated in the previous paragraph to the ƉĂƌĂŵĞƚƌŝĐ ĨƵŶĐƚŝŽŶ ɷ1͛ɴ с ɴ1 ǁŝƚŚ ɷ1 = (1,0)’. First, note that a’Y = Y1 is ƵŶďŝĂƐĞĚĨŽƌɴ1͘^Ž͕ƚŚĞƐĞƚŽĨůŝŶĞĂƌƵŶďŝĂƐĞĚĞƐƚŝŵĂƚŽƌƐĨŽƌɴ1 is a’Y + f’Y, f ٣ ɏ͘ tĞ ǁŝůů ĐŚĂƌĂĐƚĞƌŝǌĞ ƚŚĞ ǀĞĐƚŽƌƐ Ĩ ǀŝĂ Ă ďĂƐŝƐ ĨŽr the orthogonal ĐŽŵƉůĞŵĞŶƚŽĨɏ͘KŶĞƉŽƐƐŝďŝůŝƚǇĨŽƌĚŽŝŶŐƚŚŝƐŝƐƚŽĨŝŶĚĂŵĂƚƌŝǆhǁŚŽƐĞ ĐŽůƵŵŶƐƐƉĂŶɏ ĂŶĚƚŚĞŶĨŝŶĚĂďĂƐŝƐĨŽƌE;h͛Ϳсɏᄰ. Set B = (1,-1)’. Because ȴ͛сϬĂŶĚƌ;ͿсŶ;ȴ͛Ϳ͕ŝƚĨŽůůŽǁƐƚŚĂƚZ;ͿсE;ȴ͛Ϳс੓. Thus ɏ с΂yɴ͗ȴ͛ɴсϬ΃с΂yɷ͗ɷ੣ R1} = R(U) where U = XB = (1,-1,1)’. By inspection, we see that f1 = (1,1,0)’ and f2с;Ϭ͕ϭ͕ϭͿ͛ĨŽƌŵĂďĂƐŝƐĨŽƌƚŚĞŽƌƚŚŽŐŽŶĂůĐŽŵƉůĞŵĞŶƚŽĨɏ͘^Ž {a’Y нɲ1f1’Yнɲ2f2’Y: ɲ1͕ɲ2 ੣ R1}

Chapter 2

42

describes the entire set of linear unbiased estimators for the parametric ĨƵŶĐƚŝŽŶɴ1. We now use the previous lemmas to help characterize those linear estimators of the form T’Y that are blues for their expectation. Lemma 2.5.5. Let T’Y be a linear estimator. If R(VT) ‫ ؿ‬Nᄰ = ɏ, then T’Y is a blue for E(T’Y). Proof. Suppose T’Y is a linear estimator such that R(VT) ‫ ؿ‬Nൢ. Now suppose B’Y is unbiased for E(T’Y). Then from lemma 2.5.3, it follows that B = T + F for some matrix F such that R(F) ‫ ؿ‬N. Hence, since R(VT) ‫ ؿ‬Nᄰ, cov(B’Y) = cov[(T + F)’Y] = ʍ2(T + F)’V(T + F) сʍ2(T’VT + T’VF + F’VT + F’VF) =ʍ2(T’VT + F’VF) ൒ ʍ2T’VT = cov(T’Y). Since this argument holds for any unbiased estimator B’Y for E(T’Y), we have that T’Y is a blue. The reader should note that lemma 2.5.5 provides a sufficient condition for T’Y to be a blue for E(T’Y), but not a necessary one. To show that the condition given in lemma 2.5.5 is also necessary we introduce the following additional set. Let M = {V-1yʌ͗ȴ͛ʌсϬs} = V-1[Nᄰ] = V-1[ɏ]. We note that if T’Y is a linear estimator, then it is easily seen that R(VT)‫ ؿ‬Nᄰ if and only R(T) ‫ ؿ‬M. The following theorem is due to Zyskind (1967). Theorem 2.5.6. Let M and N be as defined previously in this section. Then the following statements hold: (a) Rn = M ۩ N. (b) T’Y is a tx1 blue for E(T’Y) if and only if R(T) ‫ ؿ‬M, i.e., R(VT) ‫ ؿ‬Nᄰ. Proof. (a) Observe that because V > 0, dim(MͿсĚŝŵ΂yʌ͗ȴ͛ʌсϬs} = dim{ɏ}, hence that

Estimation

43

dim(M + N) = dim(M) + dim(N) – dim(M ‫ ת‬N) = n – dim(M ‫ ת‬N). So, if we can show that M ‫ ת‬N = 0n, we have the desired result. So suppose q੣M‫ת‬N. Then q = V-1yʌĨŽƌƐŽŵĞǀĞĐƚŽƌʌƐĂƚŝƐĨǇŝŶŐȴ͛ʌсϬs͘dŚĞŶsƋсyʌ ĂŶĚsƋсyʌ੣ Nᄰ. But this implies, since q ੣ N͕ƚŚĂƚƋ͛sƋсƋ͛yʌсϬ͘dŚƵƐ q=0n since V > 0. (b) If R(VT) ‫ ؿ‬Nᄰ,i.e., R(T) ‫ ؿ‬M, it follows from lemma 2.5.5 that T’Y is a blue for E(T’Y). Conversely, assume T’Y is a blue for E(T’Y). Then from part (a), it follows that T = A + F where R(A) ‫ ؿ‬M, i.e., R(VA) ‫ ؿ‬Nᄰ and R(F) ‫ ؿ‬N. Thus we can rewrite T’Y as T’Y = A’Y + F’Y where A’Y has the same expectation as T’Y (because F’Y is unbiased for zero) and it follows as in the proof of lemma 2.5.5 that since R(VA)‫ؿ‬Nᄰ, cov(A’Y) ൑ cov(T’Y). But by assumption, T’Y is also a blue for E(T’Y), hence cov(T’Y) ൑ cov(A’Y). Thus we have that cov(A’Y) = A’VA = cov(T’Y) = T’VT = (A + F)’V(A + F) = A’VA + A’VF + F’VA + F’VF =A’VA + F’VF [Because R(VA) ‫ ؿ‬Nᄰ and R(F) ‫ ؿ‬N]. Now, since A’VA=A’VA+F’VF, it follows that F’VF = 0tt which implies that VF= 0nt. But since F = T-A, we have that VT = VA. Thus R(VT) = R(VA)‫ ؿ‬Nൢ Theorem 2.5.6 has a number of important implications. One important consequence is that the class of blues is closed under linear (matrix) operations. ෡ + Ʉො where L is a given matrix, Ԅ ෡ is a blue for Lemma 2.5.7. Suppose Ɂ෠ = L’Ԅ ƚŚĞƉĂƌĂŵĞƚƌŝĐǀĞĐƚŽƌʔĂŶĚɄො ŝƐĂďůƵĞĨŽƌƚŚĞƉĂƌĂŵĞƚƌŝĐǀĞĐƚŽƌɻ͘dŚĞŶ Ɂ෠ ŝƐĂďůƵĞĨŽƌƚŚĞƉĂƌĂŵĞƚƌŝĐǀĞĐƚŽƌ>͛ʔнɻ͘ ෡ = T’Y and Ʉො = H’Y where R(T) ‫ ؿ‬M and Proof. To see this statement, let Ԅ ෡ + Ʉො = L’(T’Y) + H’Y = (TL + H)’Y where R(TL + H) ‫ؿ‬ R(H)‫ؿ‬M. Then Ɂ෠ = L’Ԅ R(TL) + R(H)‫ؿ‬M. So by Theorem 2.5.6, Ɂ෠ ŝƐƚŚĞďůƵĞĨŽƌɷ͘ Lemma 2.5.7 leads to an alternative and interesting characterization of a blue. Consider an arbitrary linear estimator Ɂ෠. Suppose Ɂ෠ is a blue. Because each component of Ɂ෠ can be obtained by a linear operation on Ɂ෠,

44

Chapter 2

it follows that each component of Ɂ෠ is a blue. Conversely, since Ɂ෠ can be obtained from its components by linear operations, it follows that Ɂ෠ is a blue whenever each of its components is a blue. To summarize, an sdimensional linear estimator is a blue if and only if each of its s components is a blue. The reader should note another important consequence of theorem 2.5.6 is that because M = V-1΀ɏ΁͕ the coefficient matrix of a blue only ĚĞƉĞŶĚƐŽŶɏĂŶĚĚŽĞƐŶŽƚĚĞƉĞŶĚŽŶĂŶǇƉĂƌƚŝĐƵůĂƌƉĂƌĂŵĞƚĞƌŝǌĂƚŝŽŶ͘ For example, if a linear estimator is a blue under one parameterization it is a blue under any parameterization. Of course, the parametric vector that a blue estimates does depend on the parameterization. Never the less, the fact that a linear estimator being a blue only depends on the expectation space leads to one method of finding a blue for a given parametric vector under a given parameterization. In particular, suppose H is selected so that its columns span Nᄰ = ё. Then it follows that R(V-1H) = M defined earlier and A’H’V-1Y is a blue for any matrix A. Thus to determine a blue for a given estimable parametric vector, simply select A so that A’H’V-1Y has the correct expectation. However, there is an alternative and interesting method for finding blues. It involves applying the results given in theorem 2.5.6 to a given linear estimator. In particular, let P and N = In – P denote the projection operators on M along N and on N along M, respectively. Then In = P + N so that any linear estimator A’Y can be decomposed as A’Y =T’Y + F’Y where T = PA and F=NA. In this decomposition notice that R(T) ‫ ؿ‬M so that T’Y is a blue and that R(F) ‫ ؿ‬N so that F’Y has expectation zero. Putting these two observations together, we have the following lemma. Lemma 2.5.8. Let A’Y be an arbitrary linear estimator. Then A’P’Y is a blue for the parametric vector E(A’Y) where P is the projection on M along N. The technique for finding a blue given in Lemma 2.5.8 is the basis for many later results. For the present time, there are two immediate conclusions implied by lemma 2.5.8. First, if there exists a linear unbiased estimator for a parametric vector, then there exists a blue for that parametric vector. Secondly, one way to construct a blue for a parametric vector (if one exists) is to first determine a linear unbiased estimator, say A’Y, then compute A’P’Y where P is the projection on M along N. For

Estimation

45

example, Y is unbiased for E(Y) under any parameterization, hence P’Y is a blue for E(Y) under any parameterization. Example 2.5.9. (Seely (1989)) We illustrate the last observations in the previous paragraph using example 2.5.4 assuming cov(YͿсʍ2In. First we ĨŝŶĚĂďůƵĞĨŽƌƚŚĞƉĂƌĂŵĞƚƌŝĐĨƵŶĐƚŝŽŶɷ1’ɴсɴ1. From example 2.5.4, we know a’Y = Y1 ŝƐƵŶďŝĂƐĞĚĨŽƌɴ1 ĂŶĚhƐĂƚŝƐĨŝĞƐZ;hͿсɏсNᄰ. Thus a’PY = a’U(U’U)-1U’Y = (a’U/U’U)U’Y = t1’Y, t1 = 3-1h ŝƐ Ă ďůƵĞ ĨŽƌ ɴ1 where P=U(U’U)-1U’ is the orthogonal projection onto Nᄰ сɏ͘EŽǁǁĞǁŝůůĨŝŶĚĂ ďůƵĞĨŽƌƚŚĞƉĂƌĂŵĞƚƌŝĐĨƵŶĐƚŝŽŶɴ2. Because 03’Y ŝƐƵŶďŝĂƐĞĚĨŽƌȴ͛ɴсϬ and 03 ੣ ɏ͕ŝƚĨŽůůŽǁƐƚŚĂƚϬ3’Y ŝƐĂďůƵĞĨŽƌȴ͛ɴ͘dŚƵƐƚ2’Y=03’Y-t1’Y (t2 = -t1) is a linear combination of blues so that t2’Y ŝƐĂďůƵĞĨŽƌȴ͛ɴ– ɷ1͛ɴсɴ2. Lemma 2.5.10. If T’Y and H’Y are tx1 blues for the same parametric vector, then T = H. Proof. Let F = T – H. We show that R(F) ‫ ؿ‬M ‫ ת‬N = 0n which implies T = H. First, note that R(F) ‫ ؿ‬N since E(F’Y) = 0t. Also, since T’Y and H’Y are both blues, R(H)‫ؿ‬M and R(T) ‫ ؿ‬M, hence R(F) = R(T – H) ‫ ؿ‬M which implies R(F)‫ ؿ‬M ‫ ܰ ת‬as we were to show Comment. In Theorem 2.5.7, we let M = {V-1yʌ͗ȴ͛ʌсϬs} and N с΂yʌ͗ȴ͛ʌсϬs}ᄰ and let P be the projection on M along N. Because the projection P is an important matrix that appears throughout the rest of this text, it is useful to have a computable form for P. This is the purpose of the following lemma. Lemma 2.5.11. Consider the parameterization E(YͿсyɴ͕ȴ͛ɴсϬs. Now let ďĞĂƉǆŬŵĂƚƌŝǆƐƵĐŚƚŚĂƚZ;ͿсE;ȴ͛Ϳ͕ůĞƚhсyĂŶĚůĞƚWсs-1U(U’V-1U)-U’. Then P is the projection on M along N where M and N are as defined above. Proof. First, we observe that M = R(V-1XA) = R(V-1U) and N = R(XA)ᄰ = R(U)ᄰ= N(U’). Now, note that P2 = V-1U(U’V-1U’)-U’V-1U(U’V-1U)-U’ = V-1U(U’V-1U)-U’ because U’V-1U(UV-1U’)- is a projection on R(U’). Clearly R(P) ‫ ؿ‬R(V-1U) = M. Now let x ੣ M. Then x = V-1hʌĨŽƌƐŽŵĞǀĞĐƚŽƌʌ͘ƵƚƚŚĞŶ

Chapter 2

46

U’Px = U’V-1U(U’V-1U)-U’x = U’V-1U(U’V-1U)- U’V-1hʌсh͛s-1hʌ͘ Hence V-1U(U’V-1U)-U’V-1hʌ- V-1hʌ੣ N(U’)‫ת‬R(V-1U) = M ‫ ת‬N = 0n and we have that Px = V-1U(U’V-1U)-U’x = x, so that x੣R(P) and R(P) = M. Clearly N(U’)=N‫ؿ‬N(P). So let x ੣ N(P). Then Px = V-1U(U’V-1U)-U’x = 0n and U’Px =U’V-1U(U’V-1U)-U’x = U’x = 0k So that x ੣ N(U’) and N(U’) = N. Comment. Since the matrix P defined in lemma 2.5.11 is the projection on M along N, it follows from proposition A11.7 that the matrix P’ = U(U’V-1U)-‘U’V-1 = U(U’V-1U)-U’V-1 is the projector on Nᄰ с΂yʌ͗ȴ͛ʌсϬs} along Mᄰ= {V-1yʌ͗ȴ͛ʌсϬs }ᄰ. Comment. If we let P be the projection on M along N and let N = In – P, then using simple algebra and the expression for P given in lemma 2.5.11, the following facts are easy to prove: a) P’VP = P’V = VP. b) PV-1P’ = PV-1 = V-1P’.

(2.5.12)

c) N’VN = N’V = VN. d) NV-1N’=NV-1=V-1N’. While the observations given in (2.5.12) are not of great value for our present considerations, they will be useful in much of our later work.

2.6 The Gauss-Markov Property In general, our main problem in linear models is to determine blues for parametric vectors. If we are only interested in determining blues for one or two parametric vectors, then either of the two methods for finding blues given in section 2.5 will work. However, in many situations, we are interested in determining the blue for a number of parametric vectors. When this is the case the methods introduced in section 2.5 are not very efficient. In this section, we develop methods of estimating arbitrary

Estimation

47

estimable parametric vectors under a given parameterization. The ƐŝŵƉůĞƐƚĐĂƐĞĨŽƌǁŚŝĐŚƚŚŝƐŝƐƉŽƐƐŝďůĞŽĐĐƵƌƐǁŚĞŶɴŝƚƐĞůĨŝƐĞƐƚŝŵĂďůĞ͘/Ŷ ƚŚŝƐĐĂƐĞ͕ŝĨɴŝƐĞƐƚŝŵĂďůĞĂŶĚȾ෠ ŝƐƚŚĞďůƵĞĨŽƌɴ͕ƚŚĞŶŝƚĨŽůůŽǁƐĨƌŽŵůĞŵŵĂ Ϯ͘ϱ͘ϳƚŚĂƚŝĨȿ͛ŝƐĂŶĂƌďŝƚƌĂƌǇůǆƉŵĂƚƌŝǆ͕ƚŚĞŶƚŚĞďůƵĞĨŽƌȿ͛ɴŝƐȿ͛Ⱦ෠. It is this latter idea we wish to generalize. In particular, such as in the case when ɴŝƐĞƐƚŝŵĂďůĞ͕ǁĞǁŽƵůĚůŝŬĞŝĨƉŽƐƐŝďůĞƚŽĨŝŶĚĂƌĂŶĚŽŵǀĞĐƚŽƌȾ෠ such that ȿ͛Ⱦ෠ ŝƐƚŚĞďůƵĞĨŽƌȿ͛ɴǁŚĞŶĞǀĞƌȿ͛ɴŝƐĞƐƚŝŵĂďůĞ͘ƐŝƚƚƵƌŶƐŽƵƚ͕ǁĞĐĂŶ always find such a random vector Ⱦ෠. Definition 2.6.1. Under a given parameterization E(Y) = Xɴ, ȴɴ = 0s, a random vector Ⱦ෠ is said to have the Gauss-Markov (gm) property ĨŽƌɴŝĨ ĂŶĚŽŶůǇŝĨȿ͛Ⱦ෠ ŝƐƚŚĞďůƵĞĨŽƌȿ͛ɴǁŚĞŶĞǀĞƌȿ͛ɴŝƐĞƐƚŝŵĂďůĞ͘ Theorem 2.6.2. Assume a parameterization of the form E(YͿсyɴ͕ȴ͛ɴсϬs, and let P be the projection on M along N where M and N are as previously defined in this chapter. Then a random vector Ⱦ෡ ŚĂƐƚŚĞŐŵƉƌŽƉĞƌƚǇĨŽƌɴ if and only if XȾ෠=P’Y ĂŶĚȴ͛Ⱦ෠ = 0s. Proof. Suppose Ⱦ෠ has the gm property. Then it follows from definition 2.6.1 ƚŚĂƚȿ͛Ⱦ෠ ŵƵƐƚďĞƚŚĞďůƵĞĨŽƌȿ͛ɴǁŚĞŶĞǀĞƌȿ͛ɴŝƐĞƐƚŝŵĂďůĞ͘ƵƚƚŚĞŶyȾ෠ ŵƵƐƚďĞƚŚĞďůƵĞĨŽƌyɴĂŶĚȴ͛Ⱦ෠ ŵƵƐƚďĞƚŚĞďůƵĞĨŽƌȴ͛ɴƐŝŶĐĞyɴĂŶĚȴ͛ɴ ĂƌĞďŽƚŚĞƐƚŝŵĂďůĞ͘^ŝŶĐĞȴ͛ɴсϬs , it is clear that 0s ŝƐƚŚĞďůƵĞĨŽƌȴ͛ɴ͕ ŚĞŶĐĞƚŚĂƚȴ͛Ⱦ෠ = 0s by uniqueness of blues. Since Y ŝƐƵŶďŝĂƐĞĚĨŽƌyɴ͕ŝƚ follows from lemma 2.5.8 that P’Y ŝƐĂďůƵĞĨŽƌyɴĂŶĚĂŐĂŝŶďǇƵŶŝƋƵĞŶĞƐƐ of blues, we have that XȾ෠ = P’Y. Conversely, suppose Ⱦ෡ satisfies XȾ෠ = P’Y ĂŶĚȴ͛Ⱦ෠ = 0s͘>Ğƚȿ͛ɴďĞĂŶǇĞƐƚŝŵĂďůĞƉĂƌĂŵĞƚƌŝĐǀĞĐƚŽƌĂŶĚůĞƚ͛Y be ƵŶďŝĂƐĞĚ ĨŽƌ ȿ͛ɴ͘ &ƌŽŵ ůĞŵŵĂ Ϯ͘ϰ͘ϲ͕ ŝƚ ĨŽůůŽǁƐ ƚŚĂƚ ȿсy͛нȴ> Ĩor some ŵĂƚƌŝǆ>͘ƵƚƚŚĞŶȿ͛Ⱦ෠ = A’XȾ෡ н>͛ȴ͛Ⱦ෡ = A’P’Y ǁŚŝĐŚŝƐƚŚĞďůƵĞĨŽƌȿ͛ɴďǇ lemma 2.5.8. Comment: We call XȾ෠ = P’Y the fitted values for the model. Also, notice that in Theorem 2.6.2 when the model has no constraints, then the condition for Ⱦ෠ to have the gm property reduces to XȾ෠ = P’Y where P is the projection on M along N where M = R(V-1X) and N = R(X)ᄰ. The following lemma provides an alternative characterization for vectors Ⱦ෠ having the gm property.

48

Chapter 2

Lemma 2.6.3. Consider a parameterization of the form E(Y) = Xɴ, ȴ͛ɴсϬs, and let W be any nxl matrix such that R(W) = M. Then Ⱦ෠ ŝƐŐŵĨŽƌɴŝĨĂŶĚ only if W’XȾ෠=W’Y ĂŶĚȴ͛Ⱦ෠ = 0s. Proof. Assume Ⱦ෠ ŝƐŐŵĨŽƌɴ͘dŚĞŶȾ෠ satisfies XȾ෠ = P’Y ĂŶĚȴ͛Ⱦ෠ = 0s where P is the projection on M along N. Since P is a projection on M and R(W) =M, it follows that PW = W, W’P’ = W’ and NW = 0nl where N = In – P. Hence from XȾ෠ =P’Y we get W’XȾ෠ = W’P’Y = W’Y ĂŶĚȴ͛Ⱦ෠ = 0s. Conversely, suppose W’XȾ෠ = W’Y ĂŶĚ ȴ͛Ⱦ෠=0s. Then, as in the previous paragraph, W’XȾ෠=(W’P’)XȾ෠ = W’Y = (W’P’)Y, hence R(P’XȾ෠ – P’Y) ‫ ؿ‬N(W’) ‫ ת‬R(P’) = R(W)ᄰ ‫ ת‬N(P)ᄰ = Mᄰ ‫ ת‬Nᄰ = 0n since Rn = M ْ N. Therefore P’XȾ෠ = P’Y. But we also have that P’ is a projection on R(P’)=Nᄰ= ΂yʌ͗ ȴ͛ʌ с Ϭs΃ ĂŶĚ ƐŝŶĐĞ ȴ͛Ⱦ෠= 0s, it follows that XȾ෠੣Nᄰ = R(P’) and that P’XȾ෠ = XȾ෠ = P’Y. Thus Ⱦ෠ ŝƐŐŵĨŽƌɴďǇƚŚĞŽƌĞŵϮ͘ϲ͘Ϯ͘ With regard to Lemma 2.6.3, we make several observations. In lemma 2.6.3, if we take V = In͕ȴсϬps and W = X, then Ⱦ෠ ŝƐŐŵĨŽƌɴŝĨĂŶĚŽŶůǇŝĨ X’XȾ෠ = X’Y. These latter equations are often called the gm equations for the cŽƌƌĞƐƉŽŶĚŝŶŐŵŽĚĞů͘/ŶůĞŵŵĂϮ͘ϲ͘ϯ͕ŝĨȴсϬps, we can take W = V-1X and see that Ⱦ෠ ŝƐ Őŵ ĨŽƌ ɴ ŝĨ ĂŶĚ ŽŶůǇ ŝĨ ŝƚ ƐĂƚŝƐĨŝĞƐ y͛s-1XȾ෡ = X’V-1Y. These equations are often referred to as the generalized gm equations. As a final topic in this section, we note that whenever possible, it would be nice to have a computationally simple method for computing a gm estimator for ɴ in the model for Y. That is the topic of the following discussion. Consider a random vector Y having E(Y) ੣ ɏǁŚĞƌĞɏŝƐĂƐƵďƐƉĂĐĞŽĨZn and two alternative parameterizations for E(Y): Parameterization 1: E(Y) = Xɴ͕ȴ͛ɴ = 0s͕ǁŚĞƌĞȴ͛ŝƐĂŶƐǆƉŵĂƚƌŝǆ͘ EŽǁůĞƚďĞĂƉǆŵŵĂƚƌŝǆƐƵĐŚƚŚĂƚZ;ͿсE;ȴ͛ͿĂŶĚůĞƚhсy͘dŚĞŶƐŝŶĐĞ ɏс΂yɴ͗ ȴ͛ɴ = 0s} = R(U), we clearly have the following as an alternative parameterization for E(Y): Parameterization 2: E(YͿсhʏ͕ʏ੣ Rm.

Estimation

49

Now in parameterization 2, we have that ɒො ŝƐ Őŵ ĨŽƌ ʏ ŝĨ ĂŶĚ ŽŶůǇ ŝĨ ɒො satisfies the equation Uɒො = P’Y

(2.6.4)

where P is the projection on M along N or ߬Ƹ satisfies the equation U’V-1Uɒො = U’V-1Y.

(2.6.5)

Now observe in (2.6.4) that if we let Ⱦ෠ = Aɒො, then we have that Uɒො=XAɒො=XȾ෠=P’Y ĂŶĚ ȴ͛Ⱦ෡ с ȴ͛ɒො = 0s. Thus if ɒො ŝƐ Őŵ ĨŽƌ ʏ ŝŶ parameterization 2, it follows from lemma 2.6.2 that Ⱦ෠ = Aɒො is gm for ɴ in parameterization 1. Thus the problem of finding a gm estimator for ɴ in a parameterization involving homogeneous constraints has been reduced to finding a gm estimatŽƌ ĨŽƌ ʏ ŝŶ Ă ƉĂƌĂŵĞƚĞƌŝǌĂƚŝŽŶ ǁŝƚŚŽƵƚ ĐŽŶƐƚƌĂŝŶƚƐ which is generally easier to handle computationally. With regard to finding ĂǀĞĐƚŽƌǁŚŝĐŚŝƐŐŵĨŽƌʏŝŶƉĂƌĂŵĞƚĞƌŝǌĂƚŝŽŶϮ͕ǁĞŶŽƚĞƚŚĂƚɒො must satisfy (2.6.5) above. These equations are particularly easy to solve when U has full column rank since then U’V-1U is invertible and then ɒො=(U’V-1U)-1U’V-1Y ŝƐ Őŵ ĨŽƌ ʏ͘ ^Ž ŝƚ ǁŽƵůĚ ďĞ ƵƐĞĨƵů ƚŽ ŬŶŽǁ ǁŚĞŶ ƚŚĞ ŵĂƚƌŝǆ h ŝŶ parameterization 2 above has full column rank. This is the content of the following lemma. Lemma 2.6.6. In parameterization 2, let A be a pxm matrix whose columns ĨŽƌŵĂďĂƐŝƐĨŽƌE;ȴ͛ͿĂŶĚůĞƚhсy͘dŚĞŶhŚĂƐĨƵůůĐŽůƵŵŶƌĂŶŬŝĨĂŶĚ ŽŶůǇŝĨƌ;y͕͛ȴͿсƉ͘ Proof. Assume U has full column rank and note that U is an nxm matrix and that r(U) = r(A) = m. But r(A) = r(U) = r(XA) = r(A’X’) = r(X’) – dim[N(A’) ‫ ת‬R(X’)] = r(X’) – Ěŝŵ΀Z;ȴͿ‫ ת‬R(X’) сƌ;y͛Ϳнƌ;ȴͿ– Ěŝŵ΀Z;ȴͿ‫ ת‬R(X’)] – ƌ;ȴͿсĚŝŵ΀Z;y͛ͿнZ;ȴͿ΁– ƌ;ȴͿ൑ p – ƌ;ȴͿ сĚŝŵ΀Z;ȴͿ΁ᄰ сĚŝŵ΀E;ȴ͛Ϳ΁сƌ;Ϳ͘ dŚƵƐĚŝŵ΀Z;y͛ͿнZ;ȴͿ΁– ƌ;ȴͿсƉ– ƌ;ȴͿĂŶĚĚŝŵ΀Z;y͛ͿнZ;ȴͿ΁сƌ;y͕͛ȴͿсƉ͘ ŽŶǀĞƌƐĞůǇ͕ƐƵƉƉŽƐĞƌ;y͕͛ȴͿсƉ͘dŚĞŶĂƐŝŶƚŚĞĨŝƌƐƚƉĂƌƚŽĨƚŚŝƐƉƌŽŽĨ͕

50

Chapter 2

r(U) = r(XA) = r(A’X’) = r(X’) –dim[N(A’) ‫ ת‬R(X’) = r(X’) – Ěŝŵ΀Z;ȴͿ‫ ת‬R(X’)] сƌ;y͛Ϳнƌ;ȴͿ– dim [R(X’) ‫ ת‬Z;ȴͿ΁– ƌ;ȴͿсĚŝŵ΀Z;y͛ͿнZ;ȴͿ΁– ƌ;ȴͿ =r(X’,ȴ) - r(ȴ) = p – ƌ;ȴͿсĚŝŵ΀Z;ȴͿᄰ΁сĚŝŵE;ȴ͛Ϳсƌ;Ϳ So r(U) = r(A) = m and since U and A both have m columns, we have the desired result. We now summarize the above discussion in the following theorem. Theorem 2.6.7. Suppose E(YͿсyɴ͕ȴ͛ɴсϬs, is a parameterization for E(Y). dŚĞŶɴŝƐĞƐƚŝŵĂďůĞĂŶĚƚŚĞƌĞĞǆŝƐƚƐĂďůƵĞĨŽƌɴŝĨĂŶĚŽŶůǇŝĨƌ;y͕͛ȴͿсƉ͘ &ƵƌƚŚĞƌ͕ ůĞƚ hсy ǁŚĞƌĞ ƚŚĞ ĐŽůƵŵŶƐ ŽĨ ĨŽƌŵ Ă ďĂƐŝƐ ĨŽƌ E;ȴ͛Ϳ͘ /Ĩ ƌ;y͕͛ȴͿсƉ͕hŚĂƐĨƵůůĐŽůƵŵŶƌĂŶŬ͕h͛s-1U is invertible and Ⱦ෡ =A(U’V-1U)-1U’V-1Y ŝƐƚŚĞƵŶŝƋƵĞďůƵĞĨŽƌɴŚĂǀŝŶŐĐŽǀ;Ⱦ෠Ϳсʍ2D where D = A(U’V-1U)-1A’. WƌŽŽĨ͘tĞŶŽƚĞƚŚĂƚɴс/pɴŝƐĞƐƚŝŵĂďůĞŝĨĂŶĚŽŶůǇŝĨZ;/p) ‫ ؿ‬Z;y͛ͿнZ;ȴͿ which occurs if and only if p = r(Ip) чĚŝŵ΀Z;y͛ͿнZ;ȴͿ΁сƌ;;y͕͛ȴͿчƉ͘dŚĞfact that Ⱦ෠ as given in ƚŚĞƚŚĞŽƌĞŵŝƐƚŚĞďůƵĞĨŽƌɴĨŽůůŽǁƐĨƌŽŵƚŚĞĚŝƐĐƵƐƐŝŽŶ preceding lemma 2.6.6. Finally, observe that cov(Ⱦ෠) = cov[A(U’V-1U)-1U’V-1Y΁сʍ2A(U’V-1U)-1U’V-1VV-1U(U’V-1U)-1’A’ сʍ2A(U’V-1U)-1A’. As mentioned previously, theorem 2.6.7 is computationally useful because full column rank parameterizations are often assumed in the literature on best linear unbiased estimation and many computer programs are usable only for such parameterizations. We now give an example to illustrate the use of theorem 2.6.7. Example 2.6.8. (One-way additive model) (Seely(1989)) Suppose {Yij} is a collection of independent random variables having a constant unknown ǀĂƌŝĂŶĐĞʍ2 and expectation E(YijͿсђнɲi, for i = 1,…,a, j = 1,…,ni and has σ௜ ɲi =0,i.e, the Yij follow a one-way additive model with constraints. Write the model in matrix form as E(Y) = yɴ͕ȴ͛ɴсϬĂŶĚĐŽǀ;YͿсʍ2In. Let Xd = (1a,Ia) be the axp (p =a+1) matrix of the distinct rows of X. Then it is easily

Estimation

51

seenthatr(X’,ȴ)=r(Xd’,ȴ)=a+1=p.Thus,wecanapplytheorem2.6.7to findablueforɴ.TodothisweneedamatrixAwhosecolumnsformabasis for N(ȴ’). Since n(ȴ’)=a, r(A) = a and any pxa matrix of rank a whose columns are orthogonal to ȴ would be a satisfactory choice for A. By inspection it is easy to generate a satisfactory A matrix to form a parameterization E(Y) = Uɲ, ɲ੣ Ra, from which Ⱦ෠ and cov(Ⱦ෠) can be determined.

2.7LeastSquares,GaussͲMarkov,ResidualsandMaximum LikelihoodEstimationwhencov(Y)=ʍ2In The classical approach to linear estimation problems when cov(Y) = ʍ2Inandwhichisusedinmanystandardtextbooksistheprincipleofleast squares(ls)whichisdescribedbelow.Onepurposeofthepresentsection is to tie together the linear estimation techniques developed in the previoussectionsofthischapterforthecasewhencov(Y)=ʍ2Inandthe principle of least squares. As a byͲproduct of this development we will obtain an estimator of theparameter ʍ2associated with thecovariance matrixofourrandomvectorY.Wewilltakeasimilarapproachconcerning generalizedleastsquares(gls)whencov(Y)=ʍ2Vinthenextsection. OneofthestandardwaysofintroducinglsistoestimateE(Y)byfinding avectorinɏwhichisinsomesenseasclosetotheobservedvalueyofY aspossible.Inparticular,sincethemeanvectorofYisXȾ,weselectȾ෠so thatXȾ෠isascloseaspossibletotheobservedvalueyofY.Ifwetakethe statement “as close as possible” to be defined in terms of Euclidean distance,thenwearriveatchoosingȾ෠tominimize ԡ ‫ ܇‬െ XȾ ԡ=[(Y–Xɴ)’(Y–Xɴ)]1/2. Since minimizing the Euclidean distance is the same as minimizing the squareoftheEuclideandistance,itisclearthatminimizingtheEuclidean distance is equivalent to minimizing (Y – Xɴ)’(YͲ Xɴ). In our discussion, nothinghasbeensaidaboutconstraintsand/orestimability.Typicallylsis introducedforparameterizationswithnoconstraintsandXhasfullcolumn rank so that such problems don’t arise. However, constraints are easily addedbysimplyminimizingoverthepossibleɴvectorsandinsistingthat yoursolutionsatisfytheconstraintsandthefullrankconditionisrelaxed

Chapter2

52

bynotrequiringauniquesolutionforȾ෠.Forreference,weformallydefine leastsquares. Definition2.7.1.ConsideraparameterizationoftheformE(Y)=Xɴ,ȴ’ɴ=0s, wherecov(Y)=ʍ2In.ArandomvectorȾ෠issaidtobeleastsquares(ls)forɴ ifandonlyif Inf{ɴ:ȴ’ɴ=0s}ԡ ‫ ܇‬െ XȾ ԡ2=inf{ɴ:ȴ’ɴ=0s}(Y–Xɴ)’(Y–Xɴ)=(YͲXȾ෠)’(YͲXȾ෠) andȴ’Ⱦ෠=0s. Theprincipleofls,asdescribedabove,isadvocatedonanintuitivebasis. In particular, notice that the covariance matrix played no role in its development.Infact,theprincipleoflsisfrequentlyemployedwithoutany knowledge of the covariance structure. However, when the covariance structureisknowntobeamultipleoftheidentitymatrix,thenitistrue thattheprincipleoflsandthegmpropertycoincide. Lemma2.7.2.ConsideraparameterizationoftheformE(Y)=Xɴ,ȴ’ɴ=0s, wherecov(Y)=ʍ2In.ThenȾ෠islsforɴifandonlyȾ෠isgmforɴ. Proof.Firstobservethatsincecov(Y)=ʍ2In,MandNareorthogonal.Solet PbetheorthogonalprojectionontoMandletN=In–P.Forɴ੣N(ȴ’)=੓, Y–XɴcanbeexpressedasYͲXɴ=Y–PY+PY–Xɴ=NY+(PYͲXɴ).Since PN=0andNXɴ=0forallɴ੣N(ȴ’),itfollowsthat (Y–Xɴ)’(Y–Xɴ)=[NY+(PY–Xɴ)]’[NY+(PY–Xɴ)] =(NY)’(NY)+(PY–Xɴ)’(PY–Xɴ). SoforanyvalueyofY,Py–Xɴcanalwaysbemadezerobyanappropriate choiceofɴ.SoforagivenvalueyofY, Inf{ɴ:ȴ’ɴ=0s}(y–Xɴ)’(y–Xɴ)=(Ny)’(Ny). ෣ ,write ThusforanyestimatorȾ෠ ൌȾሺࢅሻ (YͲXȾ෠)’(YͲXȾ෠)=(NY)’(NY)+(PYͲXȾ෠)’(PYͲXȾ෠).

Estimation

53

෣ islsforɴifandonlyif(YͲXȾ෠)’(YͲXȾ෠)= Hence,wehavethatȾ෠=Ⱦሺ‫܇‬ሻ (NY)’(NY),i.e.,ifandonlyifwehave(PYͲXȾ෠)’(PYͲXȾ෠)=0.Butthisimplies PY=XȾ෠andȴ’Ⱦ෠=0s,hencethatȾ෠islsforɴifandonlyifȾ෡ isgmforɴ. Whencov(Y) = ʍ2In and we have a given parameterization E(Y) =Xɴ, ȴ’ɴ=0s,oncethegm(ls)estimatorȾ෠isdetermined,thevectorYͲXȾ෠=‫܍‬ො canbecomputed.Thevector‫܍‬ොiscalledtheresidualvectorandhasseveral uses.First,itisoftenusedtoassesstheassumptionsmadeonthemodel, i.e., it is used for model diagnostics. Second, it is used for hypothesis testingwhichiscoveredinchapter4.Finally,itprovidesanaturalmeans ofmeasuringhowclosetheobservedvaluesofYaretothefittedvalues ෡=(YͲXȾ෠)’(YͲXȾ෠)=‫܍‬ො’‫܍‬ොis undertheassumedmodel.Therandomvariable R calledtheresidualsumofsquaresandprovidesanoverallmeasureofhow widelydispersedtheobservedvaluesofYarefromthefittedvaluesofY. However,wenowobservethatsinceȾ෠satisfiesXȾ෠=PY,wehavethat ෡ =(YͲXȾ෠)’(YͲXȾ෠ሻ=[NY]’[NY]=Y’NY. ‫܍‬ො=Y–XȾ෠=Y–PY=NYandR ෡ satisfy several other The random vector ‫܍‬ො and the random variable R propertieswhichwegiveinthefollowingtwolemmas. Lemma2.7.3.ConsideraparameterizationoftheformE(Y)=Xɴ,ȴ’ɴ=0s, wherecov(Y=ʍ2In.Supposeȿ’ɴisatx1estimableparametricvectorand thatȾ෠isls(gm)forɴ.Thenthefollowingstatementshold: (a)E(‫܍‬ො)=0n.

(b)cov(‫܍‬ො)=ʍ2N.

(c)cov(‫܍‬ො,ȿ’Ⱦ෠)=0nt. Proof.(a)ObservethatE(‫܍‬ො)=E(NY)=N[E(Y)]=N[Xɴ]=0nsinceXɴ੣Mfor allɴ੣੓. (b)cov(‫܍‬ො)=cov(NY)=Nʍ2InN’=ʍ2NbecauseNisanorthogonalprojection. (c)Writeȿ’Ⱦ෠=T’YforanappropriatematrixThavingR(T)‫ؿ‬M.Then cov(‫܍‬ො,ȿ’Ⱦ෠)=cov(NY,T’Y)=Nʍ2InT=0nt.

54

Chapter2

෡ providesameasure Asmentionedabove,theresidualsumofsquaresR ofhowwidelydispersedtheobservedvaluesofYarefromthefittedvalues ofYundertheassumedmodel.Ontheotherhand,thecomponentsofe= YͲXɴhaveexpectationzeroandvarianceofʍ2whichprovidesameasure ofthevariabilityofthedistributionofYaroundXɴ.Thusitseemsnatural ෡ insomewaytoestimateʍ2. thatweshouldbeabletouseR Lemma2.7.4.ConsideraparameterizationoftheformE(Y)=Xɴ,ȴ’ɴ=0s. ෡ )=(n–m)ʍ2. ThenE(R ෡ =Y’NYisaquadraticform,wehavefromproposition1.3.12 Proof.SinceR that ෡ )=E(Y’NY)=tr(ʍ2NIn)+[E(Y)]’N[E(Y)]=ʍ2tr(N)+[Xɴ]’N[Xɴ] E(R =ʍ2r(N)=ʍ2(n–m) sinceNistheorthogonalprojectionalongMandXɴ੣Mforallɴ੣੓. ෡ /(n–m)will ෝ2=R Usinglemma2.7.4,itisclearthatwhencov(Y)=ʍ2In,ɐ 2 be an unbiased estimator for ʍ and is typically the one that is used in practice.Wetypicallyrefertothisestimatorasthemeansquareerror(MSE). Asafinaltopicinthissection,weconsidertherelationshipsbetweenbest linear unbiased estimation, least squares estimation, and the method of maximumlikelihoodestimation.Thereadershouldnotethattothispoint, we have made no distributional assumptions about the random vector Y other than E(Y) occurs in a known subspace of Rn and that cov(Y)= ʍ2In. However,anassumptionwhichisoftenmadeinpracticeandwhichismade inmanytextbooksisthatYfollowsamultivariatenormaldistribution.This amountstoassumingthatY~Nn(Xɴ,ʍ2In).Oncetheassumptionofnormality ismade,wecanapplythemethodofmaximumlikelihoodtofindmaximum likelihoodestimators(mles)forɴandʍ2.Tofindthesemles,weconsiderthe likelihoodfunctionthatisderivedfromthejointdensityoftheobservations byconsideringtheparametersasvariablesandtheobservationsasfixedat their observed values. If we make the assumption that Y follows a multivariate normal distribution, then we find the mles for ɴ and ʍ2 by findingthosevaluesoftheparametersthatmaximize f(y;ɴ,ʍ2)=(2ʋʍ2)Ͳn/2exp[Ͳ(YͲXɴ)’(YͲXɴ)/2ʍ2] (2.7.5)

Estimation

55

ƐƵďũĞĐƚƚŽȴ͛ɴсϬs. Equivalently, the log of the likelihood function can be maximized. The log of (2.7.5) is ůŽŐ΀Ĩ;Ǉ͖ɴ͕ʍ2)] = -;ŶͬϮͿůŽŐ;ϮʋͿ– ;ŶͬϮͿůŽŐ΀ʍ2] – (Y - yɴͿ͛;Y – yɴͿͬϮʍ2. (2.7.6) &ŽƌĞǀĞƌǇĨŝǆĞĚǀĂůƵĞŽĨʍ2͕;Ϯ͘ϳ͘ϲͿŝƐŵĂǆŝŵŝǌĞĚďǇƚĂŬŝŶŐɴƚŽďĞƚŚĞǀĂůƵĞ that minimizes (Y - yɴͿ͛;Y – yɴͿƐƵďũĞĐƚƚŽȴ͛ɴсϬs, i.e., the mle Ⱦ෠ ĨŽƌɴŝƐ ƚŚĞůĞĂƐƚƐƋƵĂƌĞƐĞƐƚŝŵĂƚŽƌĨŽƌɴĚŝƐĐƵƐƐĞĚĂďŽǀĞ͘dŽĞƐƚŝŵĂƚĞʍ2, we can substitute Ⱦ෠ into the log likelihood function where (Y - XȾ෠)’(Y - XȾ෠) = Y’(In – P)Y, ĚŝĨĨĞƌĞŶƚŝĂƚĞǁŝƚŚƌĞƐƉĞĐƚƚŽʍ2, and get the ŵůĞĨŽƌʍ2 to be Y’(In – P)Y/n where P is the orthogonal projection on M. ŽŵŵĞŶƚ͘ dŚĞ ŵůĞ ĚĞƌŝǀĞĚ ĂďŽǀĞ ĨŽƌ ʍ2 is not much used in practical applications because it is a biased estimator. The MSE is used as an ĞƐƚŝŵĂƚĞ ĨŽƌ ʍ2 in almost all applications because it is unbiased and because, when the assumption of normality is made for the random vector Y, the MSE can be shown to be the minimum variance unbiased estimator ĨŽƌʍ2.

2.8 Generalized Least Squares, Gauss-markov, Residuals and Maximum Likelihood Estimation when cov(Y) = ʍ2V In this section, we extend the discussion developed in section 2.7 for least squares, gm estimators, maximum likelihood estimation and residuals to the case when cov(YͿсʍ2V. The method we use here involves the transformation of a random vector Y having cov(YͿсʍ2V to a random vector W having cov(WͿ сʍ2In and then exploring the relationships between the two models. Because the following discussion will simultaneously involve two random vectors, we will use subscript notation to associate such things as parameter spaces, expectation spaces, etc., with each of the vectors. So let Y be a random vector that has cov(YͿсʍ2V and has a parameterization of the form E(YͿс yɴ͕ ȴ͛ɴ с Ϭs. Under this model, V is a knŽǁŶ ƉŽƐŝƚŝǀĞ ĚĞĨŝŶŝƚĞ ŵĂƚƌŝǆ ĂŶĚ ʍ2> 0 is an unknown constant. Also under this model the expectation space for E(YͿ ŝƐ ɏY = ΂yɴ͗ȴ͛ɴ с Ϭs} and the parameter space is ੓Y с ΂ɴ͗ ȴ͛ɴ с Ϭs} = N(ȴ’). In addition, we have

56

Chapter 2

MY = {V-1Xɴ: ȴɴ = 0} and NY = [Xɴ: ȴ’ɴ = 0}ᄰ. Because V>0, we can find a nonsingular matrix Q such that V=QQ’. Now let W=Q-1Y. Then E(W) = E(Q-1Y) = Q-1yɴс,ɴ where H=Q-1yĂŶĚȴ͛ɴсϬs. In addition, cov(W) = cov(Q-1Y) = Q-1cov(Y)Q-1’ = Q-1ʍ2VQ-1͛сʍ2Q-1QQ’Q-1͛сʍ2In. Thus the random vector W follows a model such as considered in section 2.7. Under this model for W, ƚŚĞĞǆƉĞĐƚĂƚŝŽŶƐƉĂĐĞŝƐɏW с΂,ɴ͗ȴ͛ɴсϬs} = {Q1yɴ͗ȴ͛ɴсϬs} and the parameter space is ੓W с ΂ɴ͗ ȴ͛ɴ с Ϭs} = ੓Y. Also, MW={Hɴ: ȴ’ɴ = 0s} = {Q-1Xɴ: ȴ’ɴ = 0s} = ёW and NW = MWᄰ. We now investigate the relationships between the models followed by the random vector Y and the random vector W. Lemma 2.8.1. Let the random vectors Y and W follow the models described above. (a) If a’Y ŝƐƵŶďŝĂƐĞĚĨŽƌʄ͛ɴŝŶƚŚĞŵŽĚĞůĨŽƌY, then a’QW is unbiased for ʄ͛ɴŝŶƚŚĞŵŽĚĞůĨŽƌW. (b) If a’W ŝƐƵŶďŝĂƐĞĚĨŽƌʄ͛ɴŝŶƚŚĞŵŽĚĞůĨŽƌW, then a’Q-1Y is unbiased ĨŽƌʄ͛ɴŝŶƚŚĞŵŽĚĞůĨŽƌY. Proof. (a) Suppose a’Y ŝƐƵŶďŝĂƐĞĚĨŽƌʄ͛ɴŝŶƚŚĞŵŽĚĞůĨŽƌY. Then a’Xɴ= ʄ͛ɴĨŽƌĂůůɴ੣ ੓Y. But then E(a’QW) = a’QE(W) = a’QE(Q-1Y) = a’QQ-1yɴсĂ͛yɴсʄ͛ɴ ĨŽƌĂůůɴ੣ ੓W = ɅY, hence a’QW ŝƐƵŶďŝĂƐĞĚĨŽƌʄ͛ɴ͘ (b) Suppose a’W ŝƐƵŶďŝĂƐĞĚĨŽƌʄ͛ɴ͘dŚĞŶĂ͛,ɴсĂ͛Y-1Xɴс ʄ’ɴ ĨŽƌĂůůɴ੣ ɅW. But then E(a’Q-1Y) = a’Q-1yɴсĂ͛,ɴсʄ͛ɴ ĨŽƌĂůůɴ੣ ੓W = ɅY. Hence a’Q-1Y is unbiased for ʄɴ.

Estimation

57

Corollary 2.8.2. Let Y and W follow models as described above. The ƉĂƌĂŵĞƚƌŝĐĨƵŶĐƚŝŽŶʄ͛ɴŝƐĞƐƚŝŵĂďůĞŝŶƚŚĞŵŽĚĞůĨŽƌY ŝĨĂŶĚŽŶůǇʄ͛ɴŝƐ estimable in the model for W. Lemma 2.8.3. Let the random vectors Y and W follow the models described above. (a) If a’Y ŝƐƚŚĞďůƵĞĨŽƌʄ͛ɴŝŶƚŚĞŵŽĚĞůĨŽƌY, then a’QW ŝƐƚŚĞďůƵĞĨŽƌʄ͛ɴ in the model for W. (b) If a’W ŝƐƚŚĞďůƵĞĨŽƌʄ͛ɴŝŶƚŚĞŵŽĚĞůĨŽƌW, then a’Q-1Y is the blue for ʄ͛ɴŝŶƚŚĞŵŽĚĞůĨŽƌY. Proof. (a) Suppose a’Y ŝƐƚŚĞďůƵĞĨŽƌʄ͛ɴŝŶƚŚĞŵŽĚĞůĨŽƌY. Then a’QW is ƵŶďŝĂƐĞĚĨŽƌʄ͛ɴŝŶƚŚĞŵŽĚĞůĨŽƌW by lemma 2.8.1. Furthermore, since a’Y is a blue in the model for Y, it follows from theorem 2.5.6 that Va = QQ’a ੣ {yʌ͗ ȴ͛ʌ с Ϭs} = ёY which implies that Q’a ੣ {Q-1yʌ͗ ȴ͛ʌ с Ϭs} = ΂,ʌ͗ȴ͛ʌсϬs} = ёW. Hence again by theorem 2.5.6 we have that a’QW is the ďůƵĞĨŽƌʄ͛ɴŝŶƚŚĞŵŽĚĞůĨŽƌW. (b) Suppose a’W ŝƐ ƚŚĞ ďůƵĞ ĨŽƌ ʄ͛ɴ ŝŶ ƚŚĞ ŵŽĚĞů ĨŽƌ W. Then a’Q-1Y is ƵŶďŝĂƐĞĚĨŽƌʄ͛ɴŝŶƚŚĞŵŽĚĞůĨŽƌY by lemma 2.8.1. Furthermore, since a’W is a blue, it follows from theorem 2.5.6 that a ੣ MW = ёW which implies that Q’-1a ੣ {Q’-1Q-1Xɏ͗ȴ͛ʌсϬs} = {V-1yʌ͗ȴ͛ʌсϬs} = MY. Hence again from theorem 2.5.6 we have that a’Q-1Y ŝƐƚŚĞďůƵĞĨŽƌʄ͛ɴŝŶ the model for Y. Lemma 2.8.4. Let Y and W follow the models described above. A random vector Ⱦ෠ ŝƐŐŵĨŽƌɴŝŶƚŚĞŵŽĚĞůĨŽƌY if and only if Ⱦ෠ ŝƐŐŵĨŽƌɴŝŶƚŚĞ model for W. Proof. Suppose Ⱦ෠ ŝƐŐŵĨŽƌɴŝŶƚŚĞŵŽĚĞůĨŽƌW͘dŚĞŶŝĨʄ͛ɴŝƐĞƐƚŝŵĂďůĞŝŶ the model for W͕ʄ͛Ⱦ෠ ŝƐƚŚĞďůƵĞĨŽƌʄ͛ɴ͘tĞĂůƐŽŚĂǀĞƚŚĂƚʄ͛Ⱦ෠ = t’W for some vector t ੣ MW and from lemma 2.8.3 that t’Q-1Y ŝƐƚŚĞďůƵĞĨŽƌʄ͛ɴin the model for Y. Hence t’Q-1Y = t’W = ʄ’Ⱦ෠ and Ⱦ෠ is gm for ɴ in the model for Y.

58

Chapter 2

Now assume Ⱦ෡ ŝƐŐŵĨŽƌɴŝŶƚŚĞŵŽĚĞůĨŽƌY͘dŚĞŶŝĨʄ͛ɴŝƐĞƐƚŝŵĂďůĞŝŶƚŚĞ model for Y͕ʄ͛Ⱦ෠ ŝƐƚŚĞďůƵĞ͘ƵƚƚŚĞŶʄ͛Ⱦ෠ = t’Y for some vector t ੣ MY and from lemma 2.8.3, we have that t’QW ŝƐƚŚĞďůƵĞĨŽƌʄ͛ɴŝŶƚŚĞŵŽĚĞůĨŽƌ W and t’QW = t’QQ-1Y = t’Y = ʄ͛Ⱦ෠. Thus Ⱦ෡ ŝƐŐŵĨŽƌɴŝŶƚŚĞŵŽĚĞůĨŽƌW. The relationship between the models for Y and W is now considered in terms of least squares estimation. Recall from section 2.7 that the least ƐƋƵĂƌĞƐĞƐƚŝŵĂƚŽƌĨŽƌɴŝŶƚŚĞŵŽĚĞůĨŽƌW satisfies ෢. Inf΂ɴ͗ȴ͛ɴсϬs} (W - ,ɴͿ͛;W – ,ɴͿс;W - HȾ෠)’(W - HȾ)

(2.8.5)

Since W = Q-1Y and H = Q-1X, we can rewrite (2.8.5) in terms of Y as Inf΂ɴ͗ȴ͛ɴсϬs} (Q-1Y - Q-1yɴͿ͛;Y-1Y - Q-1yɴͿсŝŶĨ΂ɴ͗ȴ͛ɴсϬs} (Y – yɴͿ͛s-1(Y – yɴͿ = (Y – XȾ෠)’V-1(Y - XȾ෠) .

(2.8.6)

The random vector Ⱦ෠ satisfying (2.8.6) is called the generalized least squares ;ŐůƐͿĞƐƚŝŵĂƚŽƌĨŽƌɴƵŶĚĞƌƚŚĞŵŽĚĞůĨŽƌY. But since (2.8.5) and (2.8.6) are equivalent equations, it follows that the solution Ⱦ෠ is the same for both. From lemma 2.7.2, it follows that Ⱦ෠ satisfying (2.8.5) is also gm for ɴ in the model for W. But we also have from lemma 2.8.4 that Ⱦ෠ which is gm for ɴ in the model for W is also gm for ɴ in the model for Y. Thus the generalized least squares estimator for ɴ satisfying (2.8.6) is also gm for ɴ in the model for Y. We now consider the relationship between these models for Y and W with respect to ʍ2. Recall that it was shown in section 2.7 that under the model for W, ෡ W/r(NW) ߪො2 = W’(In – PW)W/r(In – PW) = W’NWW/r(NW) = R

(2.8.7)

ŝƐĂŶƵŶďŝĂƐĞĚĞƐƚŝŵĂƚŽƌĨŽƌʍ2 ǁŚĞƌĞ͕ŝĨŝƐĂŵĂƚƌŝǆƐƵĐŚƚŚĂƚZ;ͿсE;ȴ͛Ϳ and U = XA, PW = (HA)’[(HA)(HA)’]-(HA)’ = (Q-1XA)[(Q-1XA)’(Q-1XA)]-(Q-1XA)’ = Q-1U(U’V-1U)-U’Q-1’

Estimation

59

෡ W is the residual sum of squares is the orthogonal projection on MW and R 2 under the model for W͘^ŝŶĐĞʍ is the same under each of the models for Y and W, the equivalent estimator given in terms of Y is ෝ2 = W’NW/r(In – PW) = (Q-1Y)’(In – PW)(Q-1Y)/r(In – PW) ɐ = Y’Q-1’(In - PW)Q-1Y/r(In –PW) = Y’Q-1’[In – Q-1U(U’V-1U)-U’Q-1’]Q-1Y /r(In –PW)

(2.8.8)

= [Y’V-1Y - Y’V-1U[U’V-1U]-U’V-1Y]/r(In – PW) = Y’ (V-1 - V-1PY’)Y/r(In - PY) [see lemma 2.5.11] = Y’(V-1 - V-1PY’ + PYV-1 – PYV-1)Y/r(In – PY) where PY is the projection on MY = R(V-1U) along NY с΂yʌ͗ȴ͛ʌсϬs}ᄰ = R(U)ᄰ. Using the fact that PYV-1PY’ = PYV-1 = V-1PY’ [see (2.5.12)], we can rewrite the last expression in (2.8.8) as ෝ2 = Y’(V-1 – V-1PY’ + PYV-1PY’ – PYV-1)Y/r(I – PY) ɐ = Y’[In – PY]V-1[In – PY ]’Y/r(I – PY)

(2.8.9)

= YԢ[I – PY]V-1[I – PY]’Y/(n – m) where m = r(PW) = r(PY) = dim MW = dim MY. Thus the last expression given ŝŶ;Ϯ͘ϴ͘ϵͿŝƐĂŶƵŶďŝĂƐĞĚĞƐƚŝŵĂƚŽƌĨŽƌʍ2 in the model for Y. We typically ෡ Y and call it the denote the numerator of the last expression in (2.8.9) by R weighted residual sum of squares under the model for Y. As in the previous section, the final topic we consider here is the relationship between best linear unbiased estimation, generalized least squares and maximum likelihood estimation. In the present set up, the only distributional assumptions we have been making is that Y is a random vector whose expectation occurs in a known subspace in Rn and whose ĐŽǀĂƌŝĂŶĐĞŵĂƚƌŝǆŝƐʍ2V where V is a known positive definite matrix. We now assume that Y ~ Nn;yɴ͕ʍ2V) and find the mles ĨŽƌɴĂŶĚʍ2 based on this assumption. To find these mles, we again consider the likelihood function that is derived from the joint pdf of Y by considering the parameters as variables and the observed values of Y as fixed in

60

Chapter 2

Ĩ;Ǉ͖ɴ͕ʍ2) = (Ϯʋʍ2)-n/2|ܸ|-1/2exp[-(y – yɴͿ͛s-1(y – yɴͿͬϮʍ2] (2.8.10) where |ܸ| denotes the determinant of V. Equivalently, we can find those values of the parameters that maximize the log likelihood function given by ůŽŐ΀Ĩ;Ǉ͖ɴ͕ʍ2)] =(-ŶͬϮͿůŽŐ;ϮʋͿ–;ŶͬϮͿůŽŐ;ʍ2)–(1/2)log|ܸ|–(y–yɴͿ͛s-1(y–yɴͿͬϮʍ2.

(2.8.11)

&ŽƌĞǀĞƌǇĨŝǆĞĚǀĂůƵĞŽĨʍ2͕;Ϯ͘ϴ͘ϭϭͿŝƐŵĂǆŝŵŝǌĞĚďǇƚĂŬŝŶŐƚŚĂƚǀĂůƵĞŽĨɴ that minimizes (y – yɴͿ͛s-1(y – yɴͿƐƵďũĞĐƚƚŽȴ͛ɴсϬs i.e., the mle Ⱦ෠ ĨŽƌɴŝƐ ƚŚĞ ŐĞŶĞƌĂůŝǌĞĚ ůĞĂƐƚ ƐƋƵĂƌĞƐ ĞƐƚŝŵĂƚŽƌ ĨŽƌ ɴ ĚĞƚĞƌŵŝŶĞĚ ĂďŽǀĞ͘ dŽ ĞƐƚŝŵĂƚĞʍ2, we substitute Ⱦ෠ into the log likelihood function where ෡ Y, (Y - XȾ෠)’V-1(Y - XȾ෠) = (NY’Y)’V-1(NY’Y) = R ෝ2 = (NY’Y)’V-1(NY’Y)/n where NY = ĚŝĨĨĞƌĞŶƚŝĂƚĞǁŝƚŚƌĞƐƉĞĐƚƚŽʍ2, and get ɐ In - PY and PY is the projection on MY along NY. ŽŵŵĞŶƚ͘KŶĐĞĂŐĂŝŶ͕ƚŚĞŵůĞĨŽƌʍ2 obtained above is not used very often ŝŶƉƌĂĐƚŝĐĞ͘ZĂƚŚĞƌ͕ƚŚĞĞƐƚŝŵĂƚĞĨŽƌʍ2 given in (2.8.9) is typically preferred ďĞĐĂƵƐĞĂŵŽŶŐŝƚƐƉƌŽƉĞƌƚŝĞƐŝƐƵŶďŝĂƐĞĚŶĞƐƐĨŽƌʍ2 and the fact that when Y is assumed to follow a multivariate normal distribution, the estimator given in (2.8.9) can be shown to be the uniform minimum variance unbiased estimator for ʍ2.

2.9 Models with Nonhomogeneous Constraints In this section, we consider parameterizations where the constraints on the parameter vector are nonhomogeneous. In particular, let Y denote an nx1 random vector having E(YͿсyɴ͕ȴ͛ɴсԃ ് 0s, and cov(YͿсʍ2V where sхϬŝƐĂŬŶŽǁŶŵĂƚƌŝǆĂŶĚʍ2 > 0 is an unknown constant. So under such a ŵŽĚĞů͕ɏY с΂yɴ͗ȴ͛ɴс੘} and ੓Y с΂ɴ͗ȴ͛ɴс੘}. The reader should observe that the expectation space ɏY and the parameter space ੓Y are affine sets in Rn and Rp, respectively, instead of subspaces as assumed in previous sections. To begin our discussion of models having nonhomogeneous constraints, we give several definitions and results concerning

Estimation

61

identifiability and estimability that were previously given in section 2.4 for models with homogeneous constraints but now make them relevant for random vectors following models having nonhomogeneous constraints. Definition 2.9.1. A linear estimator A’Y + d is an unbiased estimator for a ƉĂƌĂŵĞƚƌŝĐǀĞĐƚŽƌȿ͛ɴнĐŝĨĂŶĚŽŶůǇŝĨ;͛Y + d) = A’XȾത нĚсȿ͛Ⱦത + c for all Ⱦത੣੓Y. With regard to Definition 2.9.1, as with models with homogeneous constraints, we note that if A’Y + d is an arbitrary linear estimator, then A’Y+ d is unbiased for E(A’Y + d) = A’E(Y) + d = A’XȾത нĚсȿ͛Ⱦഥ + d for all Ⱦത੣੓Y ǁŚĞƌĞȿсy͛͘ƐŝŶƐĞĐƚŝŽŶϮ͘ϰ͕ǁĞĐĂůůȿ͛ɴнĚƚŚĞƉĂƌĂŵĞƚƌŝĐǀĞĐƚŽƌ induced by the expectation of A’Y + d. Thus any linear estimator is unbiased for its induced expectation. For models with nonhomogeneous constraints, as in the case for models with homogeneous constraints, the only parametric vectors that one can even attempt to estimate unbiasedly are those that are “identifiable.” Definition 2.9.2. Consider a parameterization for E(Y) of the form E(YͿсyɴ͕ ɴ੣੓Y. A parametriĐǀĞĐƚŽƌȿ͛ɴнĐŝƐƐĂŝĚƚŽďĞŝĚĞŶƚŝĨŝĂďůĞ if and only if ǁŚĞŶĞǀĞƌɴ1͕ɴ2 ੣ ੓Y ĂƌĞƐƵĐŚƚŚĂƚyɴ1сyɴ2͕ƚŚĞŶȿ͛ɴ1 + c сȿ͛ɴ2 + c. A theorem similar to theorem 2.4.7 is now given which provides several necessary and sufficient conditions for parametric vectors to be identifiable in models having nonhomogeneous constraints. Theorem 2.9.3. Consider the parameterization E(YͿсyɴ͕ȴ͛ɴс੘. For a tx1 ƉĂƌĂŵĞƚƌŝĐǀĞĐƚŽƌȿ͛ɴнĐ͕ƚŚĞĨŽůůŽǁŝŶŐƐƚĂƚĞŵĞŶƚƐĂƌĞĞƋƵŝǀĂůĞŶƚ͗ (aͿȿ͛ɴнĐŝƐŝĚĞŶƚŝĨŝĂďůĞ͘ ;ďͿȿсy͛нȴ>ĨŽƌƐŽŵĞŵĂƚƌŝĐĞƐĂŶĚ>͕ŝ͘Ğ͕͘Z;ȿͿ ‫ ؿ‬Z;y͛ͿнZ;ȴͿ͕ǁŚĞƌĞ L’੘+c=d for an appropriate vector d. ;ĐͿdŚĞƌĞĞǆŝƐƚƐĂůŝŶĞĂƌƵŶďŝĂƐĞĚĞƐƚŝŵĂƚŽƌ͛zнĚĨŽƌȿ͛ɴнĐ͘ Proof. (a) ֜ ;ďͿ͘ƐƐƵŵĞȿ͛ɴнĐŝƐŝĚĞŶƚŝĨŝĂďůĞ͘tĞǁŝůůƐŚŽǁƚŚĂƚZ;ȿͿ‫ؿ‬ Z;y͛ͿнZ;ȴͿďǇƐŚŽǁŝŶŐƚŚĞĞƋƵŝǀĂůĞŶƚƐƚĂƚĞŵĞŶƚƚŚĂƚ

Chapter 2

62

Z;ȿͿᄰ сE;ȿ͛Ϳ‫;ـ‬Z;y͛ͿнZ;ȴͿͿᄰ= N(X) ‫ ת‬E;ȴ͛Ϳ͘ So let f ੣ N(X) ‫ ת‬E;ȴ͛ͿĂŶĚůĞƚɴ0 ੣ ੓Y be fixed and let Ⱦത сɴ0 нĨ͘dŚĞŶȴ͛Ⱦത = ȴ͛ɴ0 нȴ͛Ĩсȴ͛ɴ0 + 0 = ੘ which implies Ⱦത ੣ ੓Y and XȾത сyɴ0 нyĨсyɴ0 which ŝŵƉůŝĞƐȿ͛ɴ0нĐсȿ͛Ⱦത нĐĂŶĚȿ͛;Ⱦത – ɴ0Ϳсȿ͛ĨсϬt, hence f ੣ E;ȿ͛ͿĂƐǁĞǁĞƌĞ to show. Furthermore, ȿ͛ɴнĐс͛yɴн>͛ȴ͛ɴнĐсyɴн>͛੘ нĐс͛yɴнĚ where d = L’੘ + c. (b) ֜ (c). Assume ȿ = X’A + ȴL for some matrix L such that L’੘ + c = d. Then for all ɴ ੣ ੓Y, ȿ’ɴ + c = A’Xɴ + L’ȴ’ɴ + c = A’Xɴ + L’੘ + c = A’E(Y) + L’੘ + c = A’E(Y) + d = E(A’Y + d). Hence A’Y + d is unbiased for ȿ’ɴ + c. (c) ֜ (a). Assume A’Y + d is ƵŶďŝĂƐĞĚĨŽƌȿ͛ɴнĐ͘dŚĞŶ͛yɴнĚсȿ͛ɴнĐĨŽƌ Ăůůɴ੣ ੓Y͘EŽǁ͕ŝĨɴ1͕ɴ2 ੣ ੓Y ĂƌĞƐƵĐŚƚŚĂƚyɴ1 сyɴ2͕ƚŚĞŶȿ͛ɴ1 нĐс͛yɴ1+d= ͛yɴ2нĚсȿ͛ɴ2 + c ǁŚŝĐŚŝŵƉůŝĞƐȿ͛ɴнĐŝƐŝĚĞŶƚŝĨŝĂďůĞ͘ Similar to in section 2.4 there are several things we should observe with ƌĞŐĂƌĚ ƚŽ ƚŚĞ ƉƌĞǀŝŽƵƐ ƚŚĞŽƌĞŵ͘ &ŝƌƐƚ͕ ǁŚĞŶ ȿ с ʄ ŝƐ Ă Ɖǆϭ ǀĞĐƚŽƌ ƚŚĞ ĐŽŶĚŝƚŝŽŶŝŶdŚĞŽƌĞŵϮ͘ϵ͘ϯ;ďͿĐĂŶĞƋƵŝǀĂůĞŶƚůǇďĞƐƚĂƚĞĚĂƐʄ੣ Z;y͕͛ȴͿ͘ ^ĞĐŽŶĚ͕ ďĞĐĂƵƐĞ Z;y͕͛ȴͿ ŝƐ Ă ƐƵďƐƉĂĐĞ͕ ŝƚ ĨŽůůŽǁƐ ƚŚĂƚ ƚŚĞ ĐůĂƐƐ ŽĨ identifiable parametric vectors is closed under linear operations. That is, ĂŶǇƉĂƌĂŵĞƚƌŝĐǀĞĐƚŽƌŽĨƚŚĞĨŽƌŵ>ɻнɷǁŚĞƌĞ>ŝƐĂŐŝǀĞŶŵĂƚƌŝǆĂŶĚɻ ĂŶĚ ɷ ĂƌĞ ŝĚĞŶƚŝĨŝĂďůĞ ƉĂƌĂŵĞƚƌŝĐ ǀĞĐƚŽƌƐ ŝƐ ĂŶ ŝĚĞŶƚŝĨŝĂďůĞ ƉĂƌĂŵĞƚƌŝĐ vector. Lastly, notice that the theorem implies a parametric vector is identifiable provided that it can be estimated unbiasedly by a linear estimator. As in the case for models with homogeneous constraints, we now give a definition which is more consistent within the literature on linear models with regard to identifiable parametric vectors. ĞĨŝŶŝƚŝŽŶϮ͘ϵ͘ϰ͘ƉĂƌĂŵĞƚƌŝĐǀĞĐƚŽƌȿ͛ɴнĐŝƐƐĂŝĚƚŽďĞĞƐƚŝŵĂďůĞ if and ŽŶůǇŝĨƚŚĞƌĞĞǆŝƐƚƐĂůŝŶĞĂƌƵŶďŝĂƐĞĚĞƐƚŝŵĂƚŽƌĨŽƌȿ͛ɴнĐ͘ Because of Theorem 2.9.3, it is clear that for a parametric vecƚŽƌȿ͛ɴнĐ͕ the two concepts of estimability and identifiability are equivalent. Thus all

Estimation

63

properties stated previously that hold for identifiable parametric vectors also hold for estimable parametric vectors. We now consider the problem of finding blues in models with nonhomogeneous constraints. To deal with this problem, it is convenient to transform the vector Y to a vector Z that follows a model with homogeneous constraints and then consider relationships between the two models. So let Ⱦത be a fixed vector in ੓Y ĂŶĚĨŽƌĞĂĐŚǀĞĐƚŽƌɴ੣ ੓Y, let ɲсɴ-Ⱦത. Now let Z = Y-XȾത. Then E(Z) = E(Y - XȾതͿсy;ɴ- ȾതͿсyɲĂŶĚȴ͛;ɴ-Ⱦത)= ȴ͛ɲсϬs. Thus Z follows a model having homogeneous constraints such as considered in the previous sections of this chapter where E(Z) = yɲ͕ȴ͛ɲсϬs and cov(ZͿсʍ2V. Under the model followed by Z, ɏZ с΂yɲ͗ȴ͛ɲсϬs} and ੓Z с΂ɲ͗ȴ͛ɲсϬs} = N(ȴ’) and are subspaces of Rn and Rp, respectively. We also have that MZ = {V-1Xɲ: ȴɲ = 0} and NZ = {Xɲ: ȴ’ɲ = 0}ᄰ. With regard to these two models, the following relationships are easy to establish: ɏY = XȾത нɏZ and ੓Y = Ⱦത + ੓Z = Ⱦത нE;ȴ͛Ϳ where as ɏZ = -XȾഥ нɏY сɏY –ɏY and ੓Z = -Ⱦത + ੓Y = ੓Y- ੓Y. We now consider relationships between the models for Y and Z which are useful in determining blues in the model for Y. Lemma 2.9.5. Let Y and Z be random vectors which follow the models described above. The parametric vector ȿ͛ɴ н Ě ŝƐ ĞƐƚŝŵĂďůĞ ƵŶĚĞƌ ƚŚĞ model for Y if and only the parametric vector ȿ͛ɲ + e is estimable in the model for Z. Proof. The necessary and sufficient condition for ȿ’ɴ + d to be estimable in the model for Y and for ȿ’ɲ + e to be estimable in the model for Z is by theorem 2.9.3 the same for both models, i.e., R(ȿ) ‫ ؿ‬R(X’) + R(ȴ). Definition 2.9.6. Let Y be a random vector which follows a model such as described above. The linear estimator T’Y + c is a blue for E(T’Y + c) if and only if whenever A’Y + d is unbiased for E(T’Y + c), then

64

Chapter 2

cov(T’Y + c) ൑ cov(A’Y + d). Lemma 2.9.7. Let Y and Z be random vectors which follow the models described above. Then T’Y + c is a blue for its expectation under the model for Y if and only if T’Z + d is a blue for its expectation under the model for Z. Proof. Suppose T’Z + d is a blue under the model for Z and let A’Y + f be a linear estimator under the model for Y such that E(A’Y + f ) = ͛yɴнĨс E(T’Y нĐͿсd͛yɴнĐĨŽƌĂůůɴ੣ ੓Y. Then if Ⱦത ੣ ੓Y is fixed, A’XȾത + f = T’XȾത + c and E(A’Y + f – A’XȾത - f) = ͛yɴнĨ- AXȾത – Ĩс͛y;ɴ– ȾതͿс͛yɲс;͛Z) сd͛yɴнĐ– T’XȾത – Đсd͛y;ɴ- Ⱦത) = T‘Xɲ = E(T’Z). Thus A’Z and T’Z both have the same expectations as will T’Z + d and A’Z+d. But since T’Z + d is a blue, we have that cov (T’Z + d) ) = cov(T’(Y - XȾത) + d) = cov(T’Y) = cov(T’Y + c) ൑ cov(A’Z + d) =cov(A’(Y - XȾത) + d) = cov (A’Y) = cov(A’Y + f). Hence T’Y + c is the blue for its expectation. Conversely, assume T’Y + c is a blue under the model for Y and let T’Z + d and A’Z + f both have the same expectation under the model for Z, i.e., E(A’Z + f) = E(A’(Y – XȾതͿнĨс͛yɴ–A’XȾത + f = E(T ᇱ ‫ ܈‬+ d) = E(T ᇱ ൫‫ ܇‬െ XȾത൯ + d) = Ԣyɴ– T’XȾത + d ĨŽƌĂůůɴ੣ ੓Y. But since Ⱦത ੣ ੓Y, it follows that d = f and that T’Z and A’Z both have the same expectations as do T’Y - T’XȾത and A’Y – A’XȾത . Now observe that T’Y -T’XȾത + T’XȾത + c = T’Y + c and A’Y – A’XȾത + T’XȾത + c also have the same expectation and since T’Y + c is a blue, it follows that cov(T’Z + d) = cov(T’Z) = cov(T’Y – T’XȾത) = cov(T’Y) = cov(T’Y + c) ൑ cov(A’Y - A’XȾത + T’XȾത + c) = cov(A’(Y - XȾത)) = cov(A’Z) = cov(A’Z + f). Since this same argument holds for any unbiased estimator of E(T’Z + d), we have that T’Z + d is a blue.

Estimation

65

Atthispoint,thereaderisremindedthatunderthemodelforZ,T’Z+d isablueifandonlyifR(T)‫ؿ‬MZ={VͲ1Xʌ:ȴ’ʌ=0s}.ButfromLemma2.9.7, wehavethatT’Z+disablueunderthemodelforZifandonlyifT’Y+cis ablueunderthemodelforY.ThustocheckifT’Y+cisablue,weneed onlytocheckthatR(T)‫ؿ‬MZ.SincetheconditionsforT’Y+ctobeabluein themodelforYarethesameasthoseforT’Z+dtobeablueinthemodel forZ,wecanletMY=MZandNY=NZandletMYandNYplaythesameroles inthemodelforYasMZandNZplayinthemodelforZ Definition2.9.8.LetYbearandomvectorfollowingthemodeldescribed above,i.e.,E(Y)=XȾ, ȴ’ɴ=੘.ThenarandomvectorȾ෠issaidtohavethe GaussͲMarkov(gm)propertyforɴprovidedthatwheneverȿ’ɴ+disan estimable parametric vector in the model for Y, ȿ’Ⱦ෠ + d is the blue for ȿ’ɴ+d. Lemma 2.9.9. Let Y and Z be random vectors that follow the models describedaboveandletȾതbeanysolutiontotheequationsȴ’ɴ=੘. ෝ+Ⱦതisgmforɴinthemodel (a)IfȽ ෝisgmforɲinthemodelforZ,thenȾ෠=Ƚ forY. (b)IfȾ෠isgmforɴinthemodelforY,thenȽ ෝ=Ⱦ෠ͲȾത isgmforɲinthemodel forZ. Proof.(a)SupposeȽ ෝisgmforɲinmodelforZ.ThenȦ’Ƚ ෝ ൅ b isthebluefor ȿ’ɲ+bwheneverȿ’ɲ+bisestimableunderthemodelforZ.Wemust showthatȿ’(Ƚ ෝ +Ⱦത)+cistheblueforȿ’ɴ+cwheneverȿ’ɴ+cisestimable inthemodelforY.Soassumeȿ’ɴ+cisestimableinthemodelforY.Since ɽY=Ⱦത+ɽZ,eachɴ੣ɽYcanbewrittenasɴ=ɲ+Ⱦതforsomeɲ੣ɽZ.Thuswe canrewriteȿ’ɴ+casȿ’ɴ+c=ȿ’ɲ+ȿ’Ⱦത+cwhereȿ’ɲ+ȿ’Ⱦത+cisestimable inthemodelforZbylemma2.9.5.Butthentheblueforȿ’ɲ+ȿ’Ⱦത+cis ȿ’Ƚ ෝ+ȿ’Ⱦത+c=T’Z+dforsomematrixThavingR(T)‫ؿ‬MZ=MY.Nowobserve that ȿ’Ⱦ෠+c=ȿ’(Ƚ ෝ+Ⱦത)+c=ȿ’Ƚ ෝ+ȿ’Ⱦത+c=TZ+d=T’Y–T’XȾത+d wherethelastexpressionisablueinthemodelforYsinceR(T)‫ؿ‬MY=MZ. Alsowehavethat E(ȿ’Ⱦ෠+c)=E(ȿ’Ƚ ෝ+ȿ’Ⱦത+c)=ȿ’ɲ+ȿ’Ⱦത+c=ȿ’(ɲ+Ⱦത)+c=ȿ’ɴ+c.

Chapter2

66

Thusȿ’Ⱦ෠+cistheblueforȿ’ɴ+cwhichimpliesthatȾ෠isgmforɴ. (b)SupposeȾ෠isgmforɴunderthemodelforY,i.e.,ȿ’Ⱦ෠+disthebluefor ȿ’ɴ+dwheneverȿ’ɴ+disestimableunderthemodelforY.Wemustnow showthatifȽ ෝ=Ⱦ෠Ͳ Ⱦത,thenȿ’Ƚ ෝ+cistheblueforȿ’ɲ+cwheneverȿ’ɲ+cis estimableunderthemodelforZ.Sosupposeȿ’ɲ+cisestimableunderthe ഥ eachɲ੣ɽZcanbeexpressedasɲ=ɴͲȾതfor modelforZ.SinceɽZ=ɽYͲȾ, someɴ੣ɽY.Butthenwecanrewriteȿ’ɲ+casȿ’ɲ+c=ȿ’(ɴͲȾത)+c=ȿ’ɴ –ȿ’Ⱦത+cwhereȿ’ɴ–ȿ’Ⱦത+cisestimableinthemodelforYbylemma2.9.5. SinceȾ෠isgmforɴ,theblueforȿ’ɴ–ȿ’Ⱦത+cisȿ’Ⱦ෠–ȿ’Ⱦത+c=T’Y+dfor somematrixThavingR(T)‫ؿ‬MY=MZ.Now ȿ’Ƚ ෝ+c=ȿ’(Ⱦ෠ͲȾത)+c=ȿ’Ⱦ෠–ȿ’Ⱦത+c=T’Y–T’XȾത+T’XȾത+d=T’Z+T’XȾത+d ෝ+c=T’Z+T’XȾത+disablueunderthe whereR(T)‫ؿ‬MZ=MY,henceȿ’Ƚ modelforZ.Butwealsohavethat E(ȿ’Ƚ ෝ+c)=E[ȿ’(Ⱦ෠ͲȾത)+c)=ȿ’(ɴͲȾത)+c=ȿ’ɲ+c, ෝ+cistheblueforȿ’ɲ+cwhichimpliesthatȽ ෝisgmforɲunder thusȿ’Ƚ themodelforZ. Another characterization is now given for vectors Ⱦ෡ having the gm propertyunderthemodelforY.RecallthatȽ ෝisgmforɲinthemodelfor ZifandonlyifȽ ෝsatisfiesoneofthefollowingtwoconditions: ෝ=0s. (1)XȽ ෝ=PZ’Zandȴ’Ƚ (2)W’XȽ ෝ=W’Zandȴ’Ƚ ෝ=0s. (2.9.10) wherein(2.9.10(1))PZistheprojectiononMZalongNZandin(2.9.10(2)) WisanymatrixsuchthatR(W)=MZ. Theorem2.9.11.(a)ArandomvectorȾ෠isgmforɴunderthemodelforYif andonlyXȾ෠=PY’Y+NY’XȾതandȴ’Ⱦ෠=੘wherePY=PZistheprojectiononMZ= MYalongNZ=NY,NY=In–PYandȾഥ isanysolutiontoȴ’ɴ=੘.

Estimation

67

(b) A random vector Ⱦ෠ ŝƐŐŵĨŽƌɴƵŶĚĞƌƚŚĞŵŽĚĞůĨŽƌY if and only if it satisfies W’XȾ෠ = W’Y and ȴ’Ⱦ෠ = ੘ where W is any matrix such that R(W)= MY. Proof. (a) Suppose Ⱦ෠ ŝƐŐŵĨŽƌɴƵŶĚĞƌƚŚĞŵŽĚĞůĨŽƌY. Then by Lemma 2.9.9, Ƚ ෝ= Ⱦ෠ - Ⱦത ŝƐŐŵĨŽƌɲƵŶĚĞƌƚŚĞŵŽĚĞůĨŽƌZ and from (2.9.10 (1)) above, must satisfy XȽ ෝ = XȾ෠ - XȾത = PZ’Z = PZ’Y – PZ’XȾത ĂŶĚȴ͛Ƚ ෝ = 0s. But then this implies that XȾ෠ = PZ’Y + XȾത – PZ’XȾത = PZ’Y + NZ’XȾത and 0s сȴ͛Ƚ ෝсȴ͛;Ⱦ෠ - Ⱦത) ǁŚŝĐŚ ŵĞĂŶƐ ƚŚĂƚ ȴ͛Ⱦ෠ с ȴ͛Ⱦത =੘. Replacing PZ by PZ = PY then gives the desired result. Conversely, suppose Ⱦ෠ satisfies the equations given. Then XȾ෠ - XȾത = XȽ ෝ = PY’(Y - XȾത) = PZ’(Y - XȾത) = PZ’Z ĂŶĚȴ͛Ⱦ෠ = ੘ сȴ͛Ⱦത ǁŚŝĐŚŝŵƉůŝĞƐȴ͛;Ⱦ෠ - ȾതͿсȴ͛Ƚ ෝ = 0s. Hence Ƚ ෝ ŝƐŐŵĨŽƌɲ under the model for Z and Ⱦ෠ = Ƚ ෝ + Ⱦത ŝƐŐŵĨŽƌɴƵŶĚĞƌƚŚĞŵŽĚĞůĨŽƌY by lemma 2.9.9. (b) Suppose Ⱦ෠ ŝƐŐŵĨŽƌɴ͘dŚĞŶďǇ;ĂͿĂďŽǀĞ͕Ⱦ෠ must satisfy the equations XȾ෠ = PY’Y + NY’XȾത ĂŶĚȴ͛Ⱦ෠ = ੘. Since R(W) = MY and PY is the projection on MY along NY, it follows that PYW = W, W’PY’ = W’ and NYW + (In – PY)W = 0. Thus W’XȾ෠ = W’PY’Y + W’NY’XȾത = W’Y+ 0t = W’Y ĂŶĚȴ͛Ⱦ෠ = ੘ Conversely, suppose Ⱦ෠ satisfies the given equations. Using the relations between PY and W established above, it follows that W’XȾ෠ = (PYW)’XȾ෠ = W’PY’XȾ෠ = W’Y = (PYW)’ Y =W’PY’Y which implies W’(PY’XȾ෠ - PY’Y) = 0t. So R(PY’XȾ෠ – PY’Y) ‫ ؿ‬N(W’) ‫ ת‬R(PY’) = R(W)ᄰ‫ת‬N(PY)ᄰ = MYᄰ ‫ ת‬NYᄰ = 0n which means that PY’XȾ෠ = PY’Y. Ƶƚȴ͛Ⱦ෠ = ੘, hence XȾ෠ ੣ ΂yʌ͗ȴ͛ʌсϬs} +XȾത= Nᄰ + XȾത from which it follows that X(Ⱦ෠ - Ⱦത) ੣ Nᄰ = R(PY’). Hence PY’X(Ⱦ෠ -Ⱦത)= X(Ⱦ෠ - Ⱦത), PY’XȾ෠ – PY’XȾഥ = XȾ෠ - XȾത and from this last expression we get

Chapter 2

68

XȾ෡ = PY’XȾ෠ – PY’XȾത + XȾത = PY’Y + NY’XȾത, so Ⱦ෠ ŝƐŐŵĨŽƌɴďǇƉĂƌƚ;ĂͿ͘ Another relationship between the models for Y and Z occurs with respect to the estimation of ʍ2. In particular, we saw in section 2.7 that under the model for Z, ɐ ෝ2 = Z’(In – PZ)V-1(In – PZ)’Z/(n – m) is an unbiased 2 ĞƐƚŝŵĂƚŽƌĨŽƌʍ where PZ is the projection on MZ = MY along NZ = NY and ŵсĚŝŵɏZ. Since Z = Y - XȾത͕ƚŚĞĞƋƵŝǀĂůĞŶƚƵŶďŝĂƐĞĚĞƐƚŝŵĂƚŽƌĨŽƌʍ2 under the model for Y is ɐ2 = (Y - XȾത)’(In – PY)V-1(In – PY)’(Y - XȾത)/(n – m) ෝ

(2.9.12)

where PY = PZ. The numerator of the expression given in (2.9.12) is usually ෡ Y and is called the weighted corrected residual sum of squares denoted by R under the model for Y. As a final observation in this section, we note that by using lemma Ϯ͘ϵ͘ϵ͕ǁĞĐĂŶĨŝŶĚĂŐŵĞƐƚŝŵĂƚŽƌĨŽƌɴŝŶƚŚĞŵŽĚĞůĨŽƌY by first finding a ŐŵĞƐƚŝŵĂƚŽƌĨŽƌɲŝŶƚŚĞĐŽŵƉƵƚĂƚŝŽŶĂůůǇƐŝŵƉůĞƌƚŽĚĞĂůǁŝƚŚŵŽĚĞůĨŽƌ Z. To estimate ɲ͕ŽŶĞĐĂŶŽĨƚĞŶƵƐĞƚhe technique described just prior to lemma 2.6.6 with the potentiality of applying theorem 2.6.7.

2.10 Sampling Distributions of Estimators If one wishes to find confidence intervals and test hypotheses about individual parametric vectors, additional assumptions about the distribution of the random vector Y are needed. The usual assumption for Y is that Y~Nn;yɴ͕ʍ2sͿǁŚĞƌĞʍ2 > 0 is an unknown parameter and V > 0. Using this assumption, confidence intervals and tests of hypotheses for individual estimable paƌĂŵĞƚƌŝĐ ĨƵŶĐƚŝŽŶƐ ŽĨ ƚŚĞ ĨŽƌŵ ʋ͛ɴ н Đ ĐĂŶ ďĞ obtained using the following lemma. Lemma 2.10.1. Consider the parameterization E(Y) = yɴ͕ȴ͛ɴс੘ and let Ⱦ෠ ďĞŐŵĨŽƌɴ͘ůƐŽůĞƚʋ͛ɴнĐďĞĂŶĞƐƚŝŵĂďůĞƉĂƌĂŵĞƚƌŝĐĨƵŶĐƚŝŽŶ͕ůĞƚWY be the projection on MY along NY where MY and NY are as previously defined in this section and let NY= In- PY and let Ⱦത ďĞƐŽŵĞĨŝǆĞĚƐŽůƵƚŝƵŽŶƚŽȴ͛ɴс ੘. Then the following statements hold assuming Y ‫ ׽‬Nn;yɴ͕ʍ2V): ;ĂͿʋ͛Ⱦ෠ + c ‫ ׽‬E;ʋ͛ɴнĐ͕ʍ2ʌ1’PY’VPYʌ1ͿǁŚĞƌĞʋсy͛ʌ1 нȴʌ2.

Estimation

69

(b)(1/ʍ2)(YͲXȾത)’’(In–PY)VͲ1(In–PY)’(YͲXȾത)‫ ׽‬ʖ2(n–m)wherem=r(P). (c)ʋ’Ⱦ෡ +cfrom(a)and(1/ʍ2)(YͲXȾത)’(In–PY)VͲ1(In–PY)’(Y–XȾത)from(b)are mutuallyindependent. (d)T=(ʋ’Ⱦ෠–ʋ’ɴ)/[ʌ1’PY’VPYʌ1]1/2/[(YͲXȾത)’(In–PY)VͲ1(In–PY)’(YͲXȾത)/(n–m)]1/2 ‫׽‬T(nͲm). Proof(a)SinceȾ෠isgmforɴ,itsatisfiesXȾ෠=PY’Y+NY’XȾത,ȴ’Ⱦ෠=੘andsince ʋ’ɴ+cisestimable,ʋ=X’ʌ1+ȴʌ2forsomevectorsʌ1andʌ2whereʌ2’੘+c=d foranappropriatescalard.Butthen ʋ’Ⱦ෠+c=ʌ1’XȾ෠+ʌ2’ȴ’Ⱦ෠+c=ʌ1’XȾ෠+ʌ2’੘+c=ʌ1’PY’Y+ʌ1’NY’XȾത+ʌ2’੘+c isalinearfunctionofY,ʋ’Ⱦ෠+cisunbiasedforʋ’ɴ+cand cov(ʋ’Ⱦ෠+c)=cov(ʌ1’PY’Y+ɏ1’NY’XȾത+ʌ2’੘+c)=cov(ʌ1’PY’Y)=ʍ2ʌ1’PY’VPYʌ1 fromwhichitfollowsthat ʋ’Ⱦ෠ +c‫׽‬N1(ʋ’ɴ+c,ʍ2ʌ1’PY’VPYʌ1). (b)Byassumption,Y‫ ׽‬Nn(Xɴ,ʍ2V),ȴ’ɴ=੘.Applyinglemma1.6.1tothe quadraticform (1/ʍ2)(Y–XȾത)’(In–PY)VͲ1(In–PY)’(YͲXȾത) andusingthefactsthatPYVͲ1PY’=PYVͲ1=VͲ1PY’[see(2.5.12)],weseethat (In–PY)VͲ1(In–PY)’ʍ2V(In–PY)VͲ1(In–PY)’ =ʍ2(In–PY)(In–VͲ1PY’V)(In–PY)VͲ1(In–PY)’ =ʍ2(In–PY)(In–PY)(In–PY)VͲ1(in–PY)’=ʍ2(In–PY)’VͲ1(In–PY)’. Hence,itfollowsfromlemma1.6.1that (1/ʍ2)(YͲXȾത)’(In–PY)VͲ1(In–PY)’(YͲXȾത)~ɖ2[r((In–PY)VͲ1(In–PY)’,ʄ] =ʖ2[r(In–PY),ʄ] whereʄ=(XɴͲXȾത)’(In–PY)VͲ1(In–PY)’(XɴͲXȾത)=0

Chapter 2

70

;ĐͿ&ƌŽŵ;ĂͿǁĞŚĂǀĞƚŚĂƚʋ͛Ⱦ෠ нĐсʌ1’PY’Y + ʌ1’NY’XȾത нʌ2’੘ + c which is a function of PY’Y and since ;ϭͬʍ2)( Y - XȾത)’ (In – PY)V-1(In – PY)’(Y - XȾത) is a function of (In – PY)’Y, it suffices to show that P’Y and (In – PY)’Y are mutually independent. Now, using proposition 1.6.6 and the fact that PY’VPY = PY’V = VPY (see (2.5.12)), we have that (In – PY)’ ʍ2VPY = ʍ2(VPY – PY’VPY) = ʍ2(VPY –VPY) = 0nn. Thus PY’Y and (In – PY)’Y are independent which implies that ʋ’Ⱦ෠ + c and (Y - XȾത)’(In – PY)V-1(In – PY)’(Y - XȾത) are independent. ;ĚͿ>Ğƚс΀ʋ͛Ⱦ෠ – ʋ͛ɴ΁ͬ΀ʍ2ʌ1’PY’VPYʌ1]1/2 and hс;ϭͬʍ2)(Y - XȾത)’(In – PY)V-1(In – PY)’(Y - XȾത). Then Z~ N1(0, 1) and it follows from part (b) above that U ~ ʖ2(n – m). Also, we have from part (c) that Z and U are independent, thus T = Z/[U/(n – m)]1/2 ~ T(n – m). One can now use the T-statistic given in lemma 2.10.1 to find confidence intervals and carry out tests of hypotheses on individual parametric functions using familiar T-procedures such as T-tests.

2.11 Problems for Chapter 2 1. Suppose Y is a random vector such that E(YͿсyɴ͕ɴ੣ ੓ where ੓ is a subspace of Rp. Do the following: ;ĂͿ>Ğƚʄ͕ɷ੣ Rp͘^ŚŽǁƚŚĂƚƚŚĞƉĂƌĂŵĞƚƌŝĐĨƵŶĐƚŝŽŶʄ͛ɴĂŶĚɷ͛ɴĂƌĞŝĚĞŶƚŝĐĂů ŝĨĂŶĚŽŶůǇŝĨʄ– ɷ੣ ੓ൢ. ;ďͿ^ŚŽǁƚŚĂƚĂƉĂƌĂŵĞƚƌŝĐĨƵŶĐƚŝŽŶʄ͛ɴŝƐƵŶŝƋƵĞŝĨĂŶĚŽŶůǇ੓ = Rp.

Estimation

71

2. (Seely(1989)) Consider a model of the form E(YͿ с yɴ͕ ɴ ੣ Rp. If X = (X1,…,XpͿĂŶĚɴс;ɴ1͕͙͕ɴpͿ͕͛ƐŚŽǁƚŚĂƚɴ1 is estimable if and only X1 cannot be written as a linear combination of X2,…,Xp. 3. (Seely(1989)) Consider a model of the form E(YͿсyɴ͕ɴ੣ ੓ where ੓ is a subspace of Rp. Suppose t’Y is a blue. By considering estimators of the form a’Y + c where c is a constant, is it possible to obtain an estimator having the same expectation as t’Y and less variance? Explain your reasoning. 4. Let Y1, Y2, and Y3 be independent random variables such that E(Y1Ϳсɲ͕ E(Y2ͿсϮɲ–3੘ and E(Y3Ϳсɲнϰ੘. Assume var(YiͿсʍ2 for i = 1,2,3. Find the ďůƵĞƐĨŽƌɲĂŶĚ੘ and their covariance matrix. 5. Consider the model E(YiͿсɴ0 нɴ1xi нɴ2(xi2 – 4) for i = 1,2,3 where x1 = 0, x2 = 1, and x3 = 2. Assuming var(YiͿсʍ2 for i = 1,2,3 and all observations are ŝŶĚĞƉĞŶĚĞŶƚ͕ĨŝŶĚƚŚĞďůƵĞƐĨŽƌɴ0͕ɴ1ĂŶĚɴ2 and their covariance matrix. ϲ͘ /Ŷ ŽƌĚĞƌ ƚŽ ĞƐƚŝŵĂƚĞ ƚǁŽ ƉĂƌĂŵĞƚĞƌƐ ɲ ĂŶĚ ੘, suppose we take m ŽďƐĞƌǀĂƚŝŽŶƐŚĂǀŝŶŐĞǆƉĞĐƚĂƚŝŽŶɲ- ੘, m observations having expectation ɲ - 2੘ ĂŶĚ Ŷ ŽďƐĞƌǀĂƚŝŽŶƐ ŚĂǀŝŶŐ ĞǆƉĞĐƚĂƚŝŽŶ ɲ н ϯ੘. Suppose all ŽďƐĞƌǀĂƚŝŽŶƐĂƌĞŝŶĚĞƉĞŶĚĞŶƚĂŶĚŚĂǀĞĐŽŶƐƚĂŶƚǀĂƌŝĂŶĐĞʍ2. ;ĂͿ&ŝŶĚƚŚĞďůƵĞƐĨŽƌɲĂŶĚ੘ and their variances. (b) Are there any values for m and n ĨŽƌǁŚŝĐŚƚŚĞďůƵĞƐĨŽƌɲĂŶĚ੘ are uncorrelated? If so, give them. 7. Suppose Y1 and Y2 are independent random variables such that E(Y1Ϳсɴ1x1,, var(Y1Ϳсʍ2, E(Y2Ϳсɴ2x2 and var(Y2Ϳсϯʍ2͘&ŝŶĚƚŚĞďůƵĞƐĨŽƌɴ1 ĂŶĚɴ2 and find the covariance matrix of your estimates. 8. Suppose Y1,…,Yn are independent random variables having mean E(Yi) =μ for all i and var(YiͿсʍ2/ai for all i. Find the blue for μ and its variance. 9. Let Y1,…,Yn be independent random variables with E(YiͿ с ŝɴ ĂŶĚ var(Yi)=i2ʍ2 for i = 1,͙͕Ŷ͘&ŝŶĚƚŚĞďůƵĞĨŽƌɴĂŶĚŝƚƐǀĂƌŝĂŶĐĞ͘ Problems 10-14 (Seely (1989)) below relate to the following paragraph concerning solving equations for finding gm vectors:

72

Chapter 2

Assume a model of the form E(YͿсyɴ͕ȴ͛ɴсϬ͕ĐŽǀ;YͿсʍ2In, and suppose ƚŚĂƚ y͕ ɴ ĂŶĚ ȴ ĂƌĞ ƉĂƌƚŝƚŝŽŶĞĚ ƐŽ ƚŚĂƚ yɴ с ;y1, X2Ϳ ;ɴ1͕͛ɴ2͕Ϳ͛ ĂŶĚ ȴ͛ɴ с ȴ1͛ɴ1нȴ2͛ɴ2 с Ϭ ǁŚĞƌĞ ȴ1͛ ŝƐ ƐǆƐ͕ ȴ2’ is sxv, and X1, X2͕ ɴ1͕ ĂŶĚ ɴ2 have ĚŝŵĞŶƐŝŽŶƐĐŽŶĨŽƌŵĂďůĞǁŝƚŚȴ1͛ĂŶĚȴ2͛͘ƐƐƵŵĞȴ1’ is nonsingular. For this situation a natural way of handling constƌĂŝŶƚƐŝƐƚŽĨŝƌƐƚƐŽůǀĞĨŽƌɴ1 in ƚĞƌŵƐŽĨɴ2 ǁŚŝĐŚůĞĂĚƐƚŽɴ1 = ->ɴ2 ǁŚĞƌĞ>с;ȴ1’)-1ȴ2’. Then substitute the ƐŽůƵƚŝŽŶĨŽƌɴ1 into E(Y) to obtain E(YͿсhɴ2 where U = X2 – X1L. In problem ϭϭďĞůŽǁ͕ǇŽƵĂƌĞĂƐŬĞĚƚŽƐŚŽǁƚŚĂƚɏс΂yɴ͗ȴ͛ɴсϬ΃сZ;hͿ͘dhis latter expression implies that E(YͿсhɴ2͕ɴ2 ੣ Rv is also a parameterization for E(YͿ͘ KƌĚŝŶĂƌŝůǇ ǁĞ ǁŽƵůĚ ŶŽƚ ƵƐĞ ƚŚĞ ƐĂŵĞ ƐǇŵďŽů ɴ2 in two different parameterizations for E(Y), however, problems 12 and 13 below justify this double usage by establishing that no confusion will arise with respect to ĞƐƚŝŵĂďŝůŝƚǇĂŶĚďůƵĞƐĨŽƌƉĂƌĂŵĞƚƌŝĐĨƵŶĐƚŝŽŶƐŽĨƚŚĞĨŽƌŵʄ2ɴ2. ϭϬ͘^ŚŽǁZ;hͿсɏ͘ ϭϭ͘&Žƌʄ2 ੣ Rv͕ƐŚŽǁƚŚĂƚƚŚĞƉĂƌĂŵĞƚƌŝĐĨƵŶĐƚŝŽŶʄ2͛ɴ2 is estimable with respect to the parameterization E(YͿ с yɴ͕ ȴ͛ɴ с Ϭ ŝĨ ĂŶĚ ŽŶůǇ ŝĨ ʄ2͛ɴ2 is estimable with respect to the parameterization E(YͿсhɴ2͕ɴ2 ੣ Rv. 12. If Ⱦ෠2 ŝƐŐŵ ĨŽƌɴ2 with respect to E(YͿс hɴ2͕ɴ2 ੣ Rv, then show that Ⱦ෠1=LȾ෠2 and Ⱦ෠2 ĂƌĞ ũŽŝŶƚůǇ Őŵ ĨŽƌ ɴ ǁŝƚŚ ƌĞƐƉĞĐƚ ƚŽ ;Y) с yɴ͕ ȴ͛ɴ с Ϭ͘ Conversely, if Ⱦ෠ = (Ⱦ෠1’,Ⱦ෠2͛Ϳ͛ŝƐŐŵĨŽƌɴǁŝƚŚƌĞƐƉĞĐƚƚŽ;YͿсyɴ͕ȴ͛ɴсϬ͕ then show that Ⱦ෠2 ŝƐŐŵĨŽƌɴ2 with respect to E(YͿсhɴ2͕ɴ2 ੣ Rv. ϭϯ͘ ^ŚŽǁ ƚŚĂƚ ƚŚĞ ƉĂƌĂŵĞƚƌŝĐ ǀĞĐƚŽƌ ɴ ŝƐ ĞƐƚŝŵĂďůĞ ŝĨ ĂŶĚ ŽŶůǇ ŝĨ ƚŚĞ ƉĂƌĂŵĞƚƌŝĐǀĞĐƚŽƌɴ2 is estimable. ϭϰ͘ ^ƵƉƉŽƐĞ ƚŚĂƚ ɴ ŝƐ ĞƐƚŝŵĂďůĞ ĂŶĚ ƚŚĂƚ Ⱦ෠ ŝƐ ƚŚĞ ďůƵĞ ĨŽƌ ɴ͘ ^ŚŽǁ ƚŚĂƚ cov(Ⱦ෠Ϳсʍ2A(U’U)-1A’ where A = [-L’, Iv]’. Problems 15 and 16 below refer to the following paragraph concerning artificial solutions to the normal equations for finding gm estimators: Assume a model of the form E(YͿ с yɴ͕ ȴ͛ɴ с Ϭs with the additional assumption that R(X’) ‫ ת‬Z;ȴͿсϬp and r(X’) = r(ȴͿсƉǁŚĞƌĞȴ͛ŝƐĂŶƐǆƉ matrix. Now let Ⱦ෩ be any random vector satisfying W’WȾ෨ = W’Z where Z = (Y’, 0s͛Ϳ͛ĂŶĚtс;y͕͛ȴͿ͛͘ 15. (Seely (1989)) Verify the following assertions:

Estimation

73

(a) W’W is nonsingular. ;ďͿȴ͛Ⱦ෨ = 0s and X’XȾ෨ = X’Y. (c) Ⱦ෨ ŝƐƚŚĞďůƵĞĨŽƌɴ͘ ;ĚͿ /Ĩ ȿ͛ɴ ŝƐ Ă ƉĂƌĂŵĞƚƌŝĐ ǀĞĐƚŽƌ ƐƵĐŚ ƚŚĂƚ Z;ȿͿ ‫ ؿ‬R(X’), then ĐŽǀ΀ȿ͛Ⱦ෨Ϳсʍ2ȿ͛;t͛tͿ-1ȿ͘ 16. (Seely (1989)) Consider the same situation as in problem 16 and answer each of the following questions. Be sure to justify your answer. (a) Is Ⱦ෨ = Ⱦ෠ where Ⱦ෠ is gm for ɴƵŶĚĞƌƚŚĞŵŽĚĞů;YͿсyɴ͕ȴ͛ɴсϬĚĞƐĐƌŝďĞĚ in the paragraph preceding problem 16? (b) Is cov(Ⱦ෨Ϳсʍ2 (W’W)-1? ෡ = (Z - WȾ෨)’(Z - WȾ෨)? (c) Is R (d) Is (W’W)-1 ĂŐĞŶĞƌĂůŝǌĞĚŝŶǀĞƌƐĞĨŽƌĞŝƚŚĞƌy͛yŽƌȴȴ͍͛ The following problems are numerical exercises. 17. (Piecewise linear regression model) (Seely (1989)) Suppose Y is a function of x ੣ R1 defined by Ǉ;ǆͿсɲ1 нɴ1ǆĨŽƌǆчϱ Ǉ;ǆͿсɲ2 нɴ2x for x > 5 Ǉ;ǆͿсɲ1 нɴ1ϱсɲ2 нɴ25, ǁŚĞƌĞ ɲ1͕ ɲ2͕ ɴ1, ɴ2 denote unknown parameters. From a series of nine independent experiments at the levels x1, … , x9 indicated below, the corresponding yi observations were obtained in such a way that each yi may be thought of as an observation on a random variable Yi with the property that E(Yi)= y(xi) and var(YiͿсʍ2. Assume the following data was obtained: x1 = -1, y1=1; x2 = 0, y2 =1; x3 =3, y3 = 4; x4 = 4, y4 = 5; x5 5, y5 = 4; x6 = 6, y6 =6; x7 =10, y7 = 21; x8 = 20, y8 =53; x9 = 30, y9 =75.

Chapter 2

74

^Ğƚ ʋ с ;ɲ1͕ɴ1͕ɲ2͕ɴ2)’. Let Ɏ ෝ ďĞ ƚŚĞ ďůƵĞ ĨŽƌ ʋ ĂŶĚ ůĞƚ ďĞ ƐƵĐŚ ƚŚĂƚ cov(Ɏ ෝͿсʍ2 D. Do the following: ;ĂͿĞƚĞƌŵŝŶĞŵсĚŝŵɏ. (b) Why does Ɏ ෝ exist? (c) Evaluate Ɏ ෝ for the above set of outcomes. (d) Determine D. ෡ for the above set of outcomes. (e) Determine the value of R (f) ĞƚĞƌŵŝŶĞ ƚŚĞ ĞƐƚŝŵĂƚĞ ĨŽƌ ƚŚĞ ƉĂƌĂŵĞƚƌŝĐ ĨƵŶĐƚŝŽŶ ɴ1 – ɴ2 and the variance of the estimator. 18. (piecewise quadratic regression model) (Seely (1989)) This problem is an extension of problem 18 above. Suppose y is a function defined for each x ੣ R1 by Ǉ;ǆͿсɲ1 нɴ1x + ੘1x2 if x ൑ 4 Ǉ;ǆͿсɲ2 нɴ2x + ੘2x2 if x > 4 and the Greek letters denote unknown parameters. Further suppose it is known that the function y(x) and the unknown parameters satisfy the following relationships: ɲ1 нɴ14 + ੘1ϭϲсɲ2 нɴ24 + ੘216 ɴ1 + 2੘1ϰсɴ2 + 2੘24. Suppose a series of independent experiments are carried out at levels xi in such a way that observations yi may be thought of as observations on random variables Yi with the property that E(Yi) = y(xi) and var(YiͿ с ʍ2. Suppose the following xi ‘s and yi ‘s were obtained from the experiment: x1= -3, y1 = -11; x2 = 1, y2 =0; x3 = 2, y3 = 5; x4 = 3, y4 = 9; x5 = 1, y5 = 3; x6 = 4, y6 = 12; x7 = 6, y7 = 30; x8 = 5, y8 =15; x9 = 10, y9 = 80; x10 =7, y10 =40. (a) Find the blues for the unknown parameters.

Estimation

75

(b) Obtain the covariance matrix for the blue estimates of the unknown parameters. ෡. (c) Find R 19. (Analysis of covariance model) (Seely(1989)) Assume {Yijk} is a collection of random variables such that Yijk ŚĂƐǀĂƌŝĂŶĐĞʍ2 and an expectation E(Yijk) сђнʏi нɷj + ੓xijk. Suppose i=1,2, j = 1,2,3, k = 1,…,nij, and that the xijk are known real numbers. Let Y denote the vector of yijk and let E(Y) с yɴ͕ ɴ unknown, denote the above parameterization. Do parts (a) – (f) below assuming the set of outcomes {yijk} and their associated {xijk} values are as follows: x111 = 1, y111 = 9; x112 = 2, y112 = 4; x131 = -2, y131 =10; x132 =114, y132 = 2; x133 = 1, y133 = 9; x211 = -1, y211 = 5; x221 = 3, y221 = -6; x222 = 0, y222 = 3; x231 = 3. Y231 = 1. ;ĂͿĞƚĞƌŵŝŶĞŵсĚŝŵɏ. (b) Determine which of the individual parameters are estimable. (c) For what values of i = 1,2, j = 1,2,3 and x ੣ R1 is the parametric function ђнʏiнɷj+੓x estimable? (d) Evaluate Ⱦ෠ for the above data. ;ĞͿ sĞƌŝĨǇ ƚŚĂƚ ƚŚĞ ƉĂƌĂŵĞƚƌŝĐ ǀĞĐƚŽƌ ȿ͛ɴ с ;ʏ1 – ʏ2͕ ɷ1 – ɷ2͕ ɷ1 – ɷ3) is ĞƐƚŝŵĂďůĞ͘dŚĞŶĚĞƚĞƌŵŝŶĞĂŶĞƐƚŝŵĂƚĞĨŽƌȿ͛ɴĂŶĚƚŚĞĐŽǀĂƌŝĂŶĐĞŵĂƚƌŝǆ of the estimator. ෡. (f) Determine the value of R 20. (Seely (1989)) Suppose the data below represents observations on seven independent random variables Yi, i = 1,…,7, each with a common ƵŶŬŶŽǁŶǀĂƌŝĂŶĐĞʍ2 and having an expectation of the form E(YiͿсɲ1x1i + ɲ2x2i нɲ3x3i, i = 1,…,5, ǁŚĞƌĞ ɲ1͕ ɲ2͕ ĂŶĚ ɲ3 are parameters known to satisfy the restriction ɲ1нɲ2нɲ3=1. Assume the following data: x11 = 4, x21 = 6, x31 =2, y1 = 3.5; x12 = 2, x22 = 3, x32 = 7, y2 = 4.1; x13 = 5, x23 = 9, x33 = 4, y3 = 5; x14 = 3, x24 = 1, x34 = 1, y4 = 1.6; x15 = 4, x25= 2, x35 = 3, y5 = 3.9; x16 = 1, x26 =4, x36 = 5, y6 = 4.1; x17 = 8, x27 = 3, x37 = 9, y7 = 6.8.

76

Chapter 2

;ĂͿ&ŝŶĚƚŚĞďůƵĞĞƐƚŝŵĂƚĞƐŽĨɲ1͕ɲ2, ĂŶĚɲ3. ;ďͿƐƚŝŵĂƚĞʍ2. (c) Obtain the joint covariance matrix for the blues determined in part (a). ෝ2,-Ƚ ෝ1 + 2Ƚ ෝ2 + 3Ƚ ෝ3) where the Ƚ ෝi, i = 1,2,3 are the blues (d) Find cov(2Ƚ ෝ1 - Ƚ obtained in part (a).

CHAPTER 3 PARAMETERIZATIONS AND CORRESPONDENCE

3.1 Introduction In this chapter, we begin by considering random vectors having expectation spaces which are affine sets in Rn. In particular, we consider random vectors Y having cov(YͿсʍ2V and parameterizations for E(Y) of the form E(YͿсyɴǁŚĞƌĞȴ͛ɴс੘. While the basic theory of best linear unbiased estimation was developed in most of chapter 2 for random vectors following models where the expectation space is a subspace of Rn, methods for dealing with models where the expectation space is an affine set in Rn were developed in section 2.9. In this chapter we consider several additional aspects of the theory. The majority of the chapter is used to discuss a relation called “correspondence” between parametric vectors from different parameterizations and various ways of determining unconstrained full rank parameterizations. While the discussion and results given in this chapter are alluded to in other texts on linear models, the material presented here on correspondence is the only formal development of relationships between parametric vectors from different parameterizations for an expectation space known to the author and is completely due to Justus Seely (1989). One of the primary reasons for discussing these additional topics is computational. That is, the topics discussed illustrate how to determine blues via regression routines for full rank parameterizations which are available on almost any computer. A secondary reason is to illustrate, via examples, that different parameterizations of a linear model can also provide alternative interpretations for the data obtained in a study.

3.2 Correspondence In this section we investigate relationships between different parameterizations for E(Y). So to begin, take Y to be a random vector such

78

Chapter 3

that E(Y) ੣ ɏǁŚĞƌĞɏ с΂yɴ͗ȴ͛ɴс੘} is an affine set in Rn and cov(YͿсʍ2V. Suppose in what follows that E(YͿ с hɲ͕ ʧ͛ɲ с ɷ͕ ŝƐ ĂŶ ĂůƚĞƌŶĂƚŝǀĞ parameterization for E(Y). Let P be the projection on M along N as defined ෡ denote the residual sum of squares. Now in chapter 2, N = In – P and let R ෡ depend only on the expectation recall from chapter 2 that P, N, and R ƐƉĂĐĞɏĂŶĚŶŽƚŽŶĂŶǇƉĂƌƚŝĐƵůĂƌƉarameterization for E(Y). On the other hand, the ideas of an estimable parametric vector and of a random vector having the gm property do depend upon a particular parameterization for E(Y). Thus ĂƐƚĂƚĞŵĞŶƚůŝŬĞʋ͛ɲнĐŝƐĂŶĞƐƚŝŵĂďůĞƉĂƌĂŵĞƚƌŝĐǀĞĐƚŽƌŵƵƐƚ be interpreted with respect to the parameterization E(YͿ с hɲ͕ ʧ͛ɲ с ɷ whereas a statement like Ⱦ෠ ŝƐŐŵĨŽƌɴŵƵƐƚďĞŝŶƚĞƌƉƌĞƚĞĚǁŝƚŚƌĞƐƉĞĐƚƚŽ the parameterization E(YͿ с yɴ͕ ȴ͛ɴ с ੘. We also note that the theory developed in Chapter 2 applies to both of the assumed parameterizations for E(Y). Because the results obtained in this chapter seldom depend upon the covariance structure of Y, we will typically state our results without reference to the covariance structure of Y. The primary purpose of this section is to explore in what ways parametric vectors associated with different parameterizations for E(Y) can be thought of as the same. While there are several ways this might be done, our approach is in the same spirit as our original definition of identifiability. That is, when the mean vectors under two parameterizations are the same, the values of the “corresponding” parametric vectors should also be the same. This is formulated in the following definition. ĞĨŝŶŝƚŝŽŶϯ͘Ϯ͘ϭ͘;^ĞĞůǇ;ϭϵϴϵͿ^ƵƉƉŽƐĞȿ͛ɴнĐĂŶĚʋ͛ɲнĚĂƌĞƉĂƌĂŵĞƚƌŝĐ ǀĞĐƚŽƌƐ͘tĞƐĂǇƚŚĂƚȿ͛ɴнĐĂŶĚʋ͛ɲнĚĐŽƌƌĞƐƉŽŶĚ ĂŶĚǁƌŝƚĞȿ͛ɴнĐؐ ʋ͛ɲ нĚƉƌŽǀŝĚĞĚƚŚĂƚȿ͛Ⱦത +c = ʋ’Ƚ ഥ + d whenever Ⱦത and Ƚ ഥ ƐĂƚŝƐĨǇȴ͛Ⱦത = ੘ ĂŶĚʧ͛Ƚ ഥ ത сɷ͕ƌĞƐƉĞĐƚŝǀĞůǇ͕ĂŶĚĂƌĞƐƵĐŚƚŚĂƚyȾ = UȽ ഥ. WƌŽƉŽƐŝƚŝŽŶϯ͘Ϯ͘Ϯ͘;^ĞĞůǇ;ϭϵϴϵͿͿ^ƵƉƉŽƐĞȿ͛ɴнĐĂŶĚʋ͛ɲнĚĂƌĞƉĂƌĂŵĞƚƌŝĐ ǀĞĐƚŽƌƐƐƵĐŚƚŚĂƚȿ͛ɴнĐؐ ʋ͛ɲнĚĂŶĚĂƐƐƵŵĞƚŚĂƚȾ෠ and Ƚ ෝ ĂƌĞŐŵĨŽƌɴ ĂŶĚɲ͕ƌĞƐƉĞĐƚŝǀĞůǇ͘dŚĞŶ͗ ;ĂͿȿ͛Ⱦ෠ нĐсʋ͛Ƚ ෝ + d. ;ďͿȿ͛ɴнĐĂŶĚʋ͛ɲнĚĂƌĞďŽƚŚĞƐƚŝŵĂďůĞparametric vectors.

Parameterizations and Correspondence

79

WƌŽŽĨ͘;ĂͿ&ŝƌƐƚŽďƐĞƌǀĞƚŚĂƚɏс΂yȾ͗ȴ͛ɴс੘΃с΂hɲ͗ʧ͛ɲсɷ΃͘EŽǁĐŚŽŽƐĞȾത and Ƚ ഥ such that ȟԢȾത = ੘͕ʧ͛Ƚ ഥ сɷĂŶĚyȾത = UȽ ഥ. Then -XȾത нɏс΂y;ɴ- ȾതͿ͗ȴ͛;ɴ- ȾതͿсϬ΃с΂yʌ͗ȴ͛ʌсϬ΃с- UȽ ഥ нɏ с΂h;ɲ- Ƚ ഥͿ͗ʧ͛;ɲ- Ƚ ഥͿсϬ΃с΂hʏ͗ʧ͛ʏсϬ΃͘ and this latter set of equalities implies that M and N as defined in chapter 2 for a given parameterization are the same for both parameterizations for E(Y) given above. So let P be the projection on M along N and let N = In - P. Then by Theorem 2.9.11, Ⱦ෠ and Ƚ ෝ ĂƌĞŐŵĨŽƌɴĂŶĚɲ͕ƌĞƐƉĞĐƚŝǀĞůǇ͕ŝĨĂŶĚ only if XȾ෠ = P’Y + N’XȾത = P’Y + N’UȽ ഥ = UȽ ෝ ĂŶĚȴ͛Ⱦ෠ = ੘ ĂŶĚʧ͛Ƚ ෝсɷ͘ƵƚďĞĐĂƵƐĞȿ͛ɴнĐؐ ʋ͛ɲнĚ͕ŝƚĨŽůůŽǁƐƚŚĂƚȿ͛Ⱦ෠+c= ʧ͛Ƚ ෝ + d. ;ďͿ&ŝƌƐƚ͕ŽďƐĞƌǀĞƚŚĂƚŝĨɴ1͕ɴ2 ੣ ΂ɴ͗ȴ͛ɴс੘΃ĂŶĚyɴ1 сyɴ2, then we can find ɲ1 ੣ ΂ɲ͗ʧ͛ɲсɷ΃ƐƵĐŚyɴ1 сyɴ2 сhɲ1͘ƵƚƐŝŶĐĞȿ͛ɴнĐؐ ʋ͛ɲнĚ͕ǁĞŚĂǀĞ ƚŚĂƚȿ͛ɴ1нĐсʋ͛ɲ1 нĚсȿ͛ɴ2 нĐ͕ŚĞŶĐĞȿ͛ɴнĐŝƐŝĚĞŶƚŝĨŝĂďůĞĂŶĚĞƐƚŝŵĂďůĞ͘ ^ŝŵŝůĂƌůǇ͕ŝƚĨŽůůŽǁƐƚŚĂƚʋ͛ɲнĚŝƐĞƐƚŝŵĂďůĞ͘ The above proposition establishes an important fact about the correspondence relation. In particular, if there is interest in determining ƚŚĞďůƵĞĨŽƌĂŶĞƐƚŝŵĂďůĞƉĂƌĂŵĞƚƌŝĐǀĞĐƚŽƌȿ͛ɴнĐǀŝĂĂƉĂƌĂŵĞƚĞƌŝǌĂƚŝŽŶ ǁŝƚŚĐŽŶǀĞŶŝĞŶƚĐŽŵƉƵƚĂƚŝŽŶƐ͕ƚŚĞŶǁĞŶĞĞĚŽŶůǇƚŽŬŶŽǁǁŚĂƚȿ͛ɴнĐ corresponds to in the computationally convenient parameterization in order to obtain our estimate. When working with the correspondence relation, it is often useful to realize that two parametric vectors correspond if and only if their respective component parametric functions correspond. Similarly, it is also useful to know and show that all linear operations can be performed on corresponding parametric vectors as though the relation “ؐ” were an equality ,i.e., see problem 1 at the end of this chapter. Frequently, the observations just mentioned along with the elementary, but fundamental, ĐŽƌƌĞƐƉŽŶĚĞŶĐĞ yɴ ؐ hɲ ĂƌĞ ĂĚĞƋƵĂƚĞ ĨŽƌ ĚĞƚĞƌŵŝŶŝŶŐ ƉĂƌƚŝĐƵůĂƌ correspondences between two parameterizations.

80

Chapter 3

Example 3.2.3. (One-way additive model) (Seely (1989)) Assume {Yij} is a collection of random variables whose expectations are E(YijͿсђнʏi for i = 1,…,t ,j = 1,…, ni, σ௧௜ୀଵ ti = 0 and that cov(YͿ с ʍ2In. Write the one-way additive model in matrix form as E(YͿсyɴ͕ȴ͛ɴсϬ͘/ƚŝƐĞĂƐǇƚŽƐŚŽǁɏс Z;yͿ͕Ěŝŵ;ɏͿсƌ;yͿсŵсƚĂŶĚƚŚĂƚĂŶǇƚĐŽůƵŵŶƐŽĨyс;y0,X1,…,Xt) form a ďĂƐŝƐ ĨŽƌ ɏ͘ >Ğƚ &с;y1,…,Xt). Then E(YͿ с &ɲ͕ ɲ ੣ Rt, is a full rank parameterization for E(Y). We now find the correspondences between the ĐŽŵƉŽŶĞŶƚƐŽĨɲĂŶĚɴ͘&ƌŽŵyɴؐ &ɲǁĞŐĞƚɲi ؐ ђнʏi, i=1,…,t. However, there is often more interest in the reverse correspondences because ෝ=(F’F)-1F’Y is easy to evaluate. Using the F’F=diag(n1,…,nt) means that Ƚ correspondences already determined and linear operations, we get σ௧௜ୀଵ ɲi ؐ σ௧௜ୀଵ ;ђнʏi) = tμ+σ௧௜ୀଵ ʏi = tμ. So μ ‫ ؗ‬σ௧௜ୀଵ ɲi /t = Ƚ ഥ and ഥ for i=1,…,t. From these correspondences, it is easy to ʏi с;Ƶнʏi) - μؐ ɲi - Ƚ ĚĞƚĞƌŵŝŶĞ ĂŶǇ ĚĞƐŝƌĞĚ ĐŽƌƌĞƐƉŽŶĚĞŶĐĞƐ ďĞƚǁĞĞŶ ƚŚĞ &ɲ ĂŶĚ yɴ parameterizations. The previous example illustrates the determination of various ĐŽƌƌĞƐƉŽŶĚĞŶĐĞƐƵƐŝŶŐyɴؐ hɲĂŶĚůŝŶĞĂƌŽƉĞƌĂƚŝŽŶƐ͘tĞĐŽŶclude this section with some alternative ways of determining correspondences. WƌŽƉŽƐŝƚŝŽŶ ϯ͘Ϯ͘ϰ͘;^ĞĞůǇ ;ϭϵϴϵͿͿ ^ƵƉƉŽƐĞ ȿ͛ɴ н Đ ĂŶĚ ʋ͛ɲ н Ě ĂƌĞ Ŭǆϭ parametric vectors. Then the following statements are equivalent: ;ĂͿȿ͛ɴнĐؐ ʋ͛ɲнĚ͘ (b) there exist maƚƌŝĐĞƐ͕ ؗ‬ʋ͛ɲ for any kxs matrix L. ;ďͿ/Ĩȿi͛ɴĂŶĚʋi’ɲ are corresponding d-dimensional parametric vectors for ŝсϭ͕Ϯ͕ƚŚĞŶȿ1͛ɴнȿ2͛ɴؐ ʋ1’ɲ нʋ2’ɲ. Ϯ͘ ;^ĞĞůǇ ;ϭϵϴϵͿͿ >Ğƚ ȴ͛ с ;ȴ1͕͛ȴ2͛Ϳ ďĞ ĂŶ ƐǆƉ ŵĂƚƌŝǆ͘ ^ƵƉƉŽƐĞ ƚŚĂƚ ȴ1’ is nonsingular. Show that the columns of the matrix B = [-ȴ2;ȴ1)-1, Ip – s]’ form ĂďĂƐŝƐĨŽƌE;ȴ͛Ϳ͘ŝƐĐƵƐƐŚŽǁŽŶĞĐŽƵůĚƵƐĞƚŚŝƐƌĞƐƵůƚƚŽĨŝŶĚĂŵĂƚƌŝǆ ǁŚŽƐĞĐŽůƵŵŶƐĨŽƌŵĂďĂƐŝƐĨŽƌE;ʧ͛ͿǁŚĞƌĞʧ͛ŝƐĂŶĂƌďŝƚƌĂƌǇƋǆƉŵĂƚƌŝǆ͘ 3. (Seely (1989)) Suppose Y is a random vector such that E(YͿсyɴ͕ȴ͛ɴсϬs, is a parameterization from E(YͿ͘^ƵƉƉŽƐĞĂůƐŽƚŚĂƚĚŝŵɏсŵ͘>Ğƚȿ͛ɴďĞ an mx1 estimable parametric vector. Show that the following statements are equivalent: ;ĂͿƌ;ȿͿсŵĂŶĚZ;ȿͿ‫(ܴ ת‬ȴ͛ͿсϬp. ;ďͿdŚĞƌĞĞǆŝƐƚƐĂŵĂƚƌŝǆƐĂƚŝƐĨǇŝŶŐȿ͛с/m ĂŶĚȴ͛сϬ͘ (c) There exists a parameterization E(Y) = Uɲ, ɲ ੣ Rm , such that ɲ ؐ ȿ͛ɴ͘ 4. (Seely (1989)) Suppose Y is a random vector such that E(YͿсyɴ͕ȴ͛ɴсϬ͕ is a parameterization for E(Y). Set U = XA where A is a pxs matrix. Show that E(YͿсhɲ͕ɲƵŶŬŶŽǁŶ͕ŝƐĂƉĂƌĂŵĞƚĞƌŝǌĂƚŝŽŶĨŽƌ;Y) if and only if E;yͿнE;ȴ͛ͿсE;yͿнZ;Ϳ͘

Chapter 3

88

5. (Seely (1989)) Suppose {Yijk} is a collection of independent random variables each that follow a two-way additive model with a common ƵŶŬŶŽǁŶǀĂƌŝĂŶĐĞʍ2 and expectation of the form E(YijkͿсђнʏi нɷj where i = 1,2,3,4; j = 1,2,3; and k = 0,1, … , nij. Suppose also that the following data represents an outcome on the random variable {Yijk} and that the Greek letters in the expectation denote unknown parameters. Assume the following data values; y111 = -1; y112 = 1; y113 =2; y121 = 8; y211 = 5; y221 = 10; y222 = 9; y331 =7; y332 = 9; y421 = 14; y422 = 13. Analyze this data and do the following: (a) Determine which of the following parametric vectors are estimable; ʏ1െ߬2͕ʏ2 – ʏ3, ɷ1 – ɷ2͕ɷ2 – ɷ3͕ɷ2 – ߜ 4͕ђнʏ2 нɷ3. (b) For the set of estimable parametric vectors found in (a), find their blues and their joint covariance matrix. ;ĐͿ&ŝŶĚƚŚĞƌĞƐŝĚƵĂůƐƵŵŽĨƐƋƵĂƌĞƐĂŶĚŐŝǀĞĂŶĞƐƚŝŵĂƚĞĨŽƌʍ2. 6. (Seely (1989)) Suppose {Yij : i = 1,2,3,4; j = 1,2,…,nij} is a collection of random variables such that E(Yij) = μi and such that Cov (Yij, Yi’j’Ϳсʍ2 for i = i’ and j = j’ с;ϭͬϮͿʍ2 for i = i’ and |݆ െ ݆Ԣ| = 1 = 0 otherwise. Assume further that the following observations are available on the random variables indicated: y11 = 2; y12 = 4; y21 = 2; y31 = -5; y32 = -9; y33 -6; y41 = -1; y42 = 1. Estimate the parameters μ1,…,μ4 aŶĚʍ2 under each of the following assumptions on the μi͛Ɛ ĂƐƐƵŵŝŶŐ ĨŽƌ ĞĂĐŚ ĐĂƐĞ ƚŚĂƚ ʍ2 is unknown. (a) μi = iɲ for i = 1,2,3,4 where ɲ is an unknown real number. (b) μi = μi + 2 for i = 1,2. (c) μi сɲнɴƚi ĨŽƌŝсϭ͕Ϯ͕ϯ͕ϰǁŚĞƌĞɲĂŶĚɴĂƌĞunknown real numbers and t1 = 0, t2 = -1, t3 = 4 and t4 =1.

ParameterizationsandCorrespondence

89

(d)3μ1+μ2Ͳμ4=3,Ͳμ1+2μ2Ͳ2μ3+3μ4=9andͲ7μ1Ͳ2μ3+5μ4=3. 7.(Seely(1989))Assumetheinformationinthedatagivenbelowarethe outcomes from a Latin square experiment having two missing observations,i.e.,E(Yijk)=μ+ri+cj+ʏkwherei=1,2,3,4,j=1,2,3,4,andri =theeffectofrowi,cj=theeffectofcolumnjandʏ1ʏ2,ʏ3,andʏ4represent theeffectsoftreatmentsA,B,CandD,respectively.AssumethattheYijk‘s are all independent and have a constant variance ʍ2 and take on the followingvalues;y122=13;y133=10;y144=19; y21428;y221=8; y243=14; y313=22;y324=19;y331=6;y342=15;y412=22;y423=12;y434=28;y441=8. (a)Determinewhichofthefollowingparametricfunctionsareestimable: ʏ1–ʏ2,ʏ2–ʏ3,ʏ2–ʏ4,ʏ3–ʏ4. (b)Findthebluesforeachoftheparametricfunctionswhichareestimable inpart(a)andfindtheirjointcovariancematrix. ෡ andgiveanestimateforʍ2. (c)FindR

CHAPTER 4 TESTING LINEAR HYPOTHESES

4.1 Introduction In this chapter we consider tests of hypotheses in linear models. To test hypotheses, additional assumptions about the random vector Y must be made. The typical assumption, and the one we make here is that Y follows a multivariate normal distribution. We consider essentially two methods of testing. The first involves tests on E(Y). The second involves tests of linear parametric vectors. While both methods of testing can be shown to be essentially equivalent, there are some conceptual differences. Throughout this chapter, unless otherwise specified, we assume that the ĞǆƉĞĐƚĂƚŝŽŶƐƉĂĐĞɏĨŽƌ;Y) is a subspace of Rn. The reason for this is to simplify notation. We consider tests when the expectation space is an affine set in the last two sections of this chapter. In the next section, we cover some preliminary notions concerning tests of hypotheses in linear models. In section 4.3, we give an intuitive derivation of the usual F test for linear hypotheses whereas in section 4.4, a more quantitative derivation for the F-statistic using the likelihood ratio principle is given. In section 4.5 we consider the problem of formulating hypotheses using parametric vectors. In section 4.6 we discuss several ANOVA tables often associated with testing a linear hypothesis whereas in section 4.7 we give an alternative form of the F-statistic used for testing linear hypotheses. Finally, in sections 4.8 and 4.9 we consider the more general problem of testing hypotheses in linear models where the expectation space is an affine subset of Rn.

4.2 Some Preliminaries In this section we give some preliminary discussion concerning testing linear hypotheses in models under the assumption that E(Y) ੣ ɏǁŚĞƌĞɏ is a known subspace of Rn and where cov(YͿсʍ2V. Thus throughout this section we assume that Y is a random vector distributed according to some

Testing Linear Hypotheses

91

distribution in the collection P ={Nn;ђ͕ʍ2V): μ ੣ ɏ͕ʍ 2 х Ϭ΃ ǁŚĞƌĞ ɏ ŝƐ Ă known subspace. We will generally refer to the model having E(Y) ੣ ɏ as “the initial or full model” associated with a test. Relative to the full model, let P, N and m be defined in the usual way, i.e., P is the projection on M along N where M and N are as defined in chapter 2, N = In – P and m =dimɏ͘ Throughout this chapter, unless specified otherwise, we assume that n >m. Definition 4.2.1. A linear hypothesis is any hypothesis testing statement about E(Y) that can be formulated as H: E(Y) ੣ ɏH vs A: E(Y) ੣ ɏA ǁŚĞƌĞɏH ŝƐĂƐƵďƐƉĂĐĞŽĨɏ ĂŶĚɏA ĐŽŶƐŝƐƚƐŽĨĂůůǀĞĐƚŽƌƐŝŶɏƚŚĂƚĂƌĞŶŽƚŝŶɏH,i.e., ёA = ё - ёH. In any such linear hypothesis, the statement H: E(Y) ੣ ёH is called the null hypothesis whereas the statement A: E(Y) ੣ ёA is called the alternative hypothesis. We note that an essential feature of definition 4.2.1 is that in formulating the null hypothesis we asƐƵŵĞƚŚĂƚɏH is a subspace which is ƐŝŵŝůĂƌƚŽƚŚĞĂƐƐƵŵƉƚŝŽŶďĞŝŶŐŵĂĚĞŽŶɏĂŶĚƚŚĂƚŝƐĐŽŶƐŝƐƚĞŶƚǁŝƚŚour knowledge of E(YͿ;ɏH ‫ ؿ‬ɏͿ͘tĞǁŝůůŽĨƚĞŶƵƐĞƚŚĞƚĞƌŵŝŶŽůŽŐǇ͞ƌĞĚƵĐĞĚ model” to designate the linear model structure described by the null hypothesis of a linear hypothesis. In particular, in definition 4.2.1, the reduced model would refer to the linear model E(Y) ੣ ɏH and cov(YͿсʍ2V ǁŚĞƌĞ ʍ2 > 0. We call the model specified by the null hypothesis the reduced model because usually the expectation space associated with the reduced model is smaller than that of the full model. Generally, when reference is made to a linear hypothesis, only the portion of the null hypothesis describing the mean vector of Y is specified. For example, it is common to specify a linear hypothesis as H: E(Y) ੣ ɏH. In such a statement, ƵŶůĞƐƐŵĞŶƚŝŽŶĞĚŽƚŚĞƌǁŝƐĞ͕ŝƚŝƐĂƐƐƵŵĞĚƚŚĂƚɏH is a subspace contained ŝŶɏĂŶĚƚŚĂt the alternative hypothesis is as defined in definition 4.2.1. We shall refer to any hypothesis that is stated as in definition 4.2.1 as a hypothesis on E(Y) Example 4.2.2. (Seely (1989)) Suppose Yij ‫ ׽‬E;ɴi͕ ʍ2) where i=1,2,3 and j=1,…,ni with n1 = n3 =2 and n2 = 3 and all the Yij’s are independent. Assume ƚŚĂƚɴс;ɴ1͕ɴ2͕ɴ3Ϳ͛ŝƐƵŶŬŶŽǁŶ͕ƚŚĂƚʍ2 is unknown, and let X be such that E(YͿсyɴ͕ɴ੣R3. Here the class P ŽĨĚŝƐƚƌŝďƵƚŝŽŶƐŚĂƐŶсϳĂŶĚɏсZ;yͿ͘ Now suppose we wish to test a hypothesis of the form H: E(YijͿсɴH͕ɴH ੣R1, ĨŽƌĂůůŝĂŶĚũ͘dŚĂƚŝƐ͕ǁĞǁŝƐŚƚŽƚĞƐƚƚŚĞŶƵůůŚǇƉŽƚŚĞƐŝƐƚŚĂƚɴ1͕ɴ2 ĂŶĚɴ3 are all equal. Set XH = 17. It is clear that the mean vector assumptions under the above null hypothesis can alternatively be stated as H: E(Y) ੣ R(XH) or

92

Chapter 4

equivalently as H: E(Y) = XH ɴH, ɴH ੣ R1͘^ŝŶĐĞɏH = R(XH) ‫ ؿ‬ɏ͕ŝƚŝƐĐůĞĂƌĨƌŽŵ definition 4.2.1 that the above hypothesis is a linear hypothesis. Example 4.2.3. (Seely (1989)) Suppose Yij ‫ ׽‬E;ɴi͕ʍ2) for i = 1,2,3 and j = 1,…,s and the Yij͛ƐĂƌĞĂůůŝŶĚĞƉĞŶĚĞŶƚ͘^ƵƉƉŽƐĞƚŚĂƚɴс;ɴ1͕ɴ2͕ɴ3)’ satisfies ȴ͛ɴсɴ1 нɴ2 нɴ3 сϬĂŶĚƚŚĂƚʍ2 is unknown. Let Y denote the vector of Yij’s and let X be such that E(YͿсyɴ͕ȴ͛ɴсϬ͘dŚĞŶƚŚĞĐůĂƐƐP of distributions ŚĂƐŶсϯƐĂŶĚɏс΂yɴ͗ȴ͛ɴсϬ΃͘^ƵƉƉŽƐĞǁĞǁŝƐŚƚŽƚĞƐƚƚŚĞŶƵůůŚǇƉŽƚŚĞƐŝƐ ƚŚĂƚɴ1 ĂŶĚɴ2 are equal. There are several ways of formulating this null hypothesis. One possibility is the following: H: E(Yij) сɴH1 for i=1,2 and j = 1,…,s сȲH2 for i = 3 and j = 1,…,s where BH1 нɴH1 нɴH2 сϮɴH1 нɴH2 = 0. Let XH = (X1 + X2,X3) where Xi ĚĞŶŽƚĞƐƚŚĞŝƚŚĐŽůƵŵŶŽĨyĂŶĚůĞƚʧ͛с;Ϯ͕ϭͿ͘ ^ĞƚɏH= {XHɴH͗ʧ͛ɴH = 0}. Then the above hypothesis can also be formulated as H:E(Y) ੣ ɏH͘ ĞĐĂƵƐĞ ɏH ŝƐ Ă ƐƵďƐƉĂĐĞ ĂŶĚ ɏH‫ؿ‬ɏ͕ ŝƚ ĨŽůůŽǁƐ ĨƌŽŵ definition 4.2.1 that the hypothesis is a linear hypothesis. The reader should note that in both examples above, the subspace specified under the linear hypothesis was parameterized in some way. This is often the case when testing a linear hypothesis, i.e., the reduced expectation space specified under the null hypothesis is often parameterized in some way. Often, the matrix XH, say, used to parameterize the reduced expectation space consists of a subset of columns or perhaps linear combinations of the columns of the matrix X which is used to parameterize the full expectation space.

4.3 An Intuitive Approach to Testing Suppose Y is a random vector such that E(Y) ੣ ɏ ‫ ؿ‬Rn and cov(YͿсʍ2V ǁŚĞƌĞ ɏ ŝƐ Ă ŬŶŽǁŶ ƐƵďƐƉĂĐĞ ŽĨ ĚŝŵĞŶƐŝŽŶ ŵ ĂŶĚ ǁŝƚŚŽƵƚ ůŽƐƐ ŽĨ generality assume that the full model for E(Y) is parameterized as E(YͿсyɴ͕ ɴ੣ Rp͘EŽǁůĞƚɏH ďĞĂƐƵďƐƉĂĐĞŽĨɏǁŚĞƌĞĚŝŵ;ɏH) = mH and suppose that we wish to test the null hypothesis H: E(Y) ੣ ɏH vs A: E(Y) ੣ ɏA

(4.3.1)

Testing Linear Hypotheses

93

ǁŚĞƌĞɏA сɏ– ɏH. Further assume without loss of generality that under the null hypothesis the reduced model is parameterized as E(Y) = XHɴH, ɴH੣Rs. Thus the test described in (4.3.1) can be equivalently stated as H: E(Y) = XHɴH͕ɴH ੣ Rs vs A: E(Y) ് XHɴH͕ɴH ੣ Rs.

(4.3.2)

The usual assumption is that the full model is the correct model and we are testing to see if the reduced model is also correct. If the reduced model is also correct then it should be used because simpler models are easier to work with and interpret. The usage of the simpler model under the null hypothesis is also supported by Akim’s razor which can be paraphrased as “when presented with two competing hypotheses that make essentially the same predictions, one should select the one with the fewest assumptions” or as “other things being equal, simpler explanations are generally better than complex ones.” An intuitive Approach to the F statistic when cov(Y) = ʍ2In We now give an intuitive derivation of the usual statistic used to test the linear hypothesis described in (4.3.1) and (4.3.2) when cov(YͿсʍ2In. Recall from chapter 2 that under the full model, Ⱦ෠ is gm (ls) for ɴ if and only if it satisfies XȾ෠= PY where P is the orthogonal projection onto ɏ or if and only if it satisfies ෡ Inf{ɴ: Xɴ ੣ ɏ} (Y – Xɴ)’(Y – Xɴ) = (Y - XȾ෠)’(Y - XȾ෠) = ‫܍‬ො’‫܍‬ො = Y’(In – P)Y = R ෡ denotes the residual sum of where ‫܍‬ො denotes the residual vector and R squares. Similarly, under the reduced model, Ⱦ෠H is gm (ls) for ɴ if and only if it satisfies XȾ෠H = PHY where PH is the orthogonal projection onto ɏH or if and only if it satisfies ෡H inf{ɴ: Xɴ ੣ ɏH} (Y – Xɴ)’(Y – Xɴ) = (Y - XȾ෠H)’(Y - XȾ෠H) = ‫܍‬ොH’‫܍‬ොH = Y’(In – PH)Y = R ෡ H denotes the residual sum of where ‫܍‬ොH denotes the residual vector and R squares. From these previous expressions, we see that ‫܍‬ො and ‫܍‬ොH provide measures as to how close the fitted values XȾ෠ and XHȾ෠H are to the individual ෡ and R ෡H observed data values under the full and reduced models whereas R provide overall measures as to how well the corresponding models fit the

94

Chapter 4

observed data. One method often used to help decide whether to reject or fail to reject the null hypothesis is to compare the full and reduced models with respect to how well they fit the observed data values. Before giving a more detailed discussion of this approach, we make several observations regarding the information already given in this paragraph: (1) The fitted data values under the full model are XȾ෠ = PY and those under the reduced model are XHȾ෠H = PHY. Because the fitted values under the full and reduced models are equal to PY and PHY, respectively, these fitted values do not depend upon the parameterizations used for ё or ёH. ෡ = Y’(I – P)Y and R ෡ H = Y’(I – PH)Y, these residual (2) Similarly, because R sums of squares do not depend upon the parameterizations used for ё or ёH. (3) Because ёH ‫ ؿ‬ё and dim ёH чĚŝŵё͕ŝƚĨŽůůŽǁƐĨƌŽŵƉƌŽƉŽƐŝƚŝŽŶ A11.10 that P – PH is an orthogonal projection, hence it is positive semidefinite. Thus ෡ =Y’[In – PH)Y – Y’(In – P)Y=Y[(In – PH) – (In - P)]Y=Y’(P – PH)Y шϬ ෡H - R R for all values of Y and it will always be the case that the full model provides at least as good of fit to the data as does the reduced model. From above, we see that if the full model provides only a modestly ෡ should be ෡ H and R better fit to the data than the reduced model, then R ෡H - R ෡ should be relatively small. In this about the same and the size of R ෡ should be regarded as evidence that the two models case, the size of RH - R fit the data about the same, hence that the reduced model is probably adequate, i.e., this is evidence that we should fail to reject the null hypothesis. On the other hand if the full model provides a substantially ෡ H should be bigger better fit to the data than the reduced model, then R ෡ and the size of R ෡H - R ෡ should be relatively large. In this situation, the than R ෡H - R ෡ should be regarded as evidence that the reduced model is size of R not adequate, i.e., this is evidence that we should reject the null hypothesis.

Testing Linear Hypotheses

95

So from the preceding paragraph, we see that whether we reject or fail ෡H - R ෡ is large to reject the null hypothesis depends on deciding whether R ෡H - R ෡ provides a measure of the difference of how well or small. But while R the two models fit the data, to make a decision, we also need some idea ෡ should be when the null hypothesis is true or not ෡H - R as to how large R true. To gain at least a long term understanding of the behavior of the ෡ , we consider the expectation of ෡H - R magnitude of R ෡H - R ෡ = Y’(P – PH)Y R when the reduced model is correct and when it is not correct. When the reduced model is correct, by proposition 1.3.12, E(Y’(P – PH)Y) = (XHɴH)’(P – PH)(XHɴHͿнʍ2tr[(P – PH)In] = (m – mHͿʍ2. (4.3.3) since XHɴH ੣ ёH ‫ ؿ‬ё. On the other hand, if the reduced model is not correct, again by proposition 1.3.12, E(Y’(P – PH)Y) с;yɴͿ͛;W – PHͿ;yɴͿнƚƌ΀;W– PHͿʍ2In] = (m – mHͿʍ2+ ;yɴͿ͛;W– PHͿ;yɴͿ͘

(4.3.4)

From (4.3.3) and (4.3.4), we see that if the null hypothesis is true, then ෡ = Y’(P – PH)Y has a smaller expectation than when the null hypothesis ෡ HെR R is not true. To make the comparison more transparent, rather than consider Y’(P – PH)Y, suppose we consider ෡H - R ෡ )/(m – mH) = Y’(P – PH)Y/(m – mH). (R From (4.3.3) and (4.3.4), we see that E[Y’(P – PH)Y/(m – mHͿ΁сʍ2

(4.3.5)

when the null hypothesis is true whereas when the null hypothesis is not true, E(Y(P – PH)Y/(m - mHͿͿсʍ2 н;yɴͿ͛;W– PHͿ;yɴͿͬ;ŵ– mH) . (4.3.6) From (4.3.5) and (4.3.6), it is clear that when the null hypothesis is correct, Y’(P – PH)Y/(m – mH)

Chapter 4

96

ŝƐĂŶƵŶďŝĂƐĞĚĞƐƚŝŵĂƚŽƌĨŽƌʍ2 whereas when the null hypothesis is not correct, Y’(P – PH)Y/(m – mH) has aŶĞǆƉĞĐƚĂƚŝŽŶǁŚŝĐŚŝƐůĂƌŐĞƌƚŚĂŶʍ2. Thus values of Y’(P – PH)Y/(m – mHͿŶĞĂƌʍ2 would indicate that the reduced model is correct whereas values of Y’(P – PH)Y/(m – mHͿ ůĂƌŐĞƌ ƚŚĂŶ ʍ2 would indicate that the reduced model is not correct. In practice, the true ǀĂůƵĞŽĨʍ2 is unknown, so we can’t actually compare Y’(P – PH)Y/(m-mH) to ʍ2. As an alternative, we can compare Y’(P – PH)Y/(m - mH) to an estimate ŽĨʍ2 which does not depend on the null hypothesis. Recall from chapter 2, under the full model (because it is assumed to be correct), ෡ /(n – m) MSE = Y’(In–P)Y/(n – m) = R ෡ is the residual sum of squares ŝƐ ĂŶ ƵŶďŝĂƐĞĚ ĞƐƚŝŵĂƚŽƌ ĨŽƌ ʍ2 where R under the full model. Thus to determine if Y’(P – PH)Y/(m – mH) is ĂƉƉƌŽǆŝŵĂƚĞůǇĞƋƵĂůƚŽŽƌůĂƌŐĞƌƚŚĂŶʍ2, we can compare it to MSE under the full model using the ratio F෠ = Y’(P – PH)Y/(m – mH)/MSE. F෠ is called an F-statistic. If the reduced model is correct, the numerator and denominator of F෠ ĂƌĞ ďŽƚŚ ƵŶďŝĂƐĞĚ ĞƐƚŝŵĂƚŽƌƐ ŽĨ ʍ2, hence this ratio should be about one. However, if the reduced model is not correct, then from (4.3.6) we see that the numerator is estimating something larger than ʍ2, hence the value of F෠ should be larger than one. Thus values of F෠ much larger than one lead to rejection of the null hypothesis whereas values of F෠ nearer to one lead us to fail to reject the null hypothesis. We also note that since ෡H - R ෡, Y’(P – PH)Y = R we can express the above F-statistic as ෡ )/(m – mH)/R ෡ /(n – m). ෡H - R F෠ = (R The above argument used to derive the F-statistic for testing (4.3.1) was made on an intuitive basis and the only assumptions made were that E(Y) ੣ ɏ ĂŶĚ ƚŚĂƚ ĐŽǀ;YͿ с ʍ2In. In the next section we provide a more quantitative and statistical approach for deriving the above F-statistic based on the likelihood ratio test and the assumption Y ‫ ׽‬Nn;yɴ͕ ʍ2V).

Testing Linear Hypotheses

97

However, before proceeding to the next section, we now give an intuitive derivation of the F-statistic when cov(YͿсʍ2V. An intuitive approach to the F-statistic when cov(YͿсʍ2V We now extend the notion of using an F-test for testing linear hypotheses to the more general case of when cov(Y) = ʍ2V instead of ʍ2In. Because the following discussion involves two different random vectors, as in section 2.8, we will use subscript notation to associate such things as parameter spaces, expectation spaces, etc., with each of the two random vectors. So for the situation now being considered, we make essentially the same assumptions as in the previous derivation of the F-statistic, i.e., E(YͿсyɴ͕ɴ੣ Rp͕ŝŶǁŚŝĐŚĐĂƐĞ͕ɏYсZ;yͿĂŶĚĚŝŵ;ɏY) = m. But here we ĂƐƐƵŵĞĐŽǀ;zͿсʍ2V. Under this model for Y, using the same notation as in chapter 2, MY = R[V-1X] and NY = ёYᄰ = R(X)ᄰ. We want to test H: E(Y) ੣ ɏYH ǁŚĞƌĞɏYH is an mH ĚŝŵĞŶƐŝŽŶĂůƐƵďƐƉĂĐĞŽĨɏĂŶĚ;Y) is parameterized as E(Y)= XHɴH͕ɴH ੣ Rs. For this ůŝŶĞĂƌŚǇƉŽƚŚĞƐŝƐ͕ɏYH = R(XH). We also have that MYH = R(V-1XH) and NYH = ёYHᄰ = R(XH)ᄰ. To actually derive the test, it is convenient to transform the vector Y. In particular, recall that since V > 0, we can find a nonsingular matrix Q such that V = QQ’. Now let W = Q-1Y. Then as in section 2.8, we see that E(W) = Q-1yɴ͕ɴ੣ Rp, cov (WͿсʍ2In, and the expectation space of W ŝƐɏW = {Q-1yɴ͗ɴ੣ Rp) = R(Q-1X) = Q-1[ɏY] and ŚĂƐĚŝŵĞŶƐŝŽŶŵ͘EŽǁůĞƚɏWH = Q-1΀ɏYH] = {Q-1XHɴH͗ɴH ੣ Rs). Then because Q-1 is nonsingular and one to one, there is a one to one correspondence between the expectation vectors in ɏY and ɏW and because ɏYH ‫ ؿ‬ɏY, it follows that ɏWH = Q-1[ɏYH]‫ ؿ‬ɏW = Q-1[ɏY] and that there is a one to one correspondence between the expectation vectors in ёYH and ёWH. It also follows that E(Y) ੣ ёYH if and only E(W) ੣ ёWH. Thus we see that testing H:E(Y) ੣ ɏYH is equivalent to testing H’:E(W)੣ ɏWH. Applying the same argument as given above for the case cov(YͿсʍ2In, we ෡W see that when the full and reduced models for W are both correct, then R ෡ WH should be about the same and R ෡ WH - R ෡ W should be small where R ෡W and R ෡ is the residual sum of squares under the full model for W and RWH is the residual sum of squares under the reduced model for W. However, if the ෡ WH should be larger than R ෡ W and reduced model for W is not correct, then R ෡ WH-R ෡ W should be larger. So larger values of R ෡ WH - R ෡ W would lead us to R

98

Chapter 4

෡ WH - R ෡ W would lead to reject H’: E(W) ੣ ɏW whereas smaller values of R failure to reject the null hypothesis. Continuing to use the same argument as above, we are led to the F-statistic for testing H’: E(W) ੣ ɏH which is given by ෡ W)/(m – mH)/MSEW = (R ෡ WH - R ෡ W)/(m – mH)/R ෡ W/(n – m) (4.3.7) ෡ WH - R F෠ = (R = W’(PW – PWH)W/(m – mH)/W’(In – PW)W/(n – m) where PW is the orthogonal projection onto ɏW and PWH is the orthogonal projection onto ɏWH . We can also express the F-statistic given in (4.3.7) in terms of the original random vector Y. To see this, observe that the ŽƌƚŚŽŐŽŶĂůƉƌŽũĞĐƚŝŽŶŽŶɏW = R(Q-1X) is PW = (Q-1X)[(Q-1X)’(Q-1X)]-(Q-1X)’ = (Q-1X)[X’V-1X]-X’Q-1’ and that the ŽƌƚŚŽŐŽŶĂůƉƌŽũĞĐƚŝŽŶŽŶƚŽɏWH = R(Q-1XH) is PWH = (Q-1XH)[(Q-1XH)’(Q-1XH)]-(Q-1XH)’ = Q-1XH[XH’V-1XH]-1XH’Q-1’. Finally, let PY = V-1X[X’V-1X]-1X’ and PYH = V-1XH[XH’V-1XH]-XH’. Then from lemma 2.5.11 and proposition A11.7, we have that PY is the projection on MY = R(V-1X) along NY = R(X)ᄰ = N(X’), PY’ is the projection on NYᄰ = R(X) along MYᄰ = R(V-1X)ᄰ = N(X’V-1), PYH is the projection on MYH = R(V-1XH) along NYH = R(XH)ᄰ = N(XH’) and that PYH’ is the projection on NYHᄰ = R(XH) along MYHᄰ = R(V-1XH)ᄰ = N(XH’V-1). Also recall from section 2.8 that we call ෡ Y = Y’(In – PY)V-1(In – PY)’Y R the weighted residual sum of squares under the model for Y. Correspondingly, we call ෡ YH = Y’(In – PYH)V-1(In – PYH)’Y R the weighted residual sum of squares under the reduced model for Y. Using the above expressions, we have the following lemma. ෡ Y. ෡ W = W’(In – PW)W = Y’(In – PY)V-1(In – PY)’Y = R Lemma 4.3.8. (a) R ෡ WH = W’(In – PWH)W = Y’(In – PYH)V-1(In – PYH)’Y = R ෡ YH. (b) R

Testing Linear Hypotheses

99

෡ WH - R ෡ W = W’(PW – PWH)W = Y’(PY – PYH)V-1(PY – PYH)’Y = R ෡ YH - R ෡ Y. (c) R Proof (a) Using the relationships given in (2.5.12) between P, N, V and V-1, we see that ෡ W = W’(In – PW)W = Y’Q-1’[In – Q-1X(X’V-1X]-X’Q-1’]Q-1Y R = Y’Q-1’Q-1Y - Y’V-1X[X’V-1X]-X’V-1Y = Y’V-1Y – Y’V-1PY’Y = Y’[V-1 – V-1PY’]Y = Y’[V-1- V-1PY’ + PYV-1PY’ – PYV-1PY’]Y ෡ Y. = Y’[V-1 – V-1PY’ + PYV-1PY’ –PYV-1]Y = Y’(In- PY)V-1(In- PY)’Y = R (b) This follows the same as in (a) using the corresponding relationships obtained from (2.5.12) upon replacing PY and NY by PYH and NYH. (c) Once again using the relationships given in (2.5.12), we have that ෡ W = W’(PW – PWH)W = W’[(In – PWH) – (In – PW)]W ෡ WH - R R = W’(In – PWH)W - W’(In – PW)W = Y’[(In – PYH)V-1(In – PYH)’Y – Y’(In – PY)V-1(In – PY)’Y = Y’[(In – PYH)V-1(In – PYH)’ – (In – PY)V-1(In- PY)’]Y = Y’[V-1 – V-1PYH’ – PYHV-1 + PYHV-1PYH’ – V-1 + V-1PY’ + PYV-1 – PYV-1PY’]Y = Y’[-V-1PYH’ – PYHV-1 + PYHV-1PYH’ + V-1 PY’]Y = Y’[-V-1(PYPYH)’ – (PYPYH)V-1 + PYHV-1PYH’ + PYV-1PY’]Y = Y’[-PYHV-1PY’ – PYV-1PYH’ + PYHV-1PYH’ + PYV-1PY’]Y ෡ YH - R ෡ Y. = Y’(PY – PYH)V-1(PY – PYH)’Y = R Using lemma 4.3.8, we see that F෠ = W’(PW – PWH)W/r(PW – PWH)/W’(In – PW)W/r(In – PW) = Y’(PY – PYH)V-1(PY – PYH)’Y/r(PY – PYH)/Y’(In – PY)V-1(In –PY)’Y/r(In – PY)

100

Chapter 4

෡ YH - R ෡ Y)/r(PY – PYH)/R ෡ Y/r(In – PY) = (R since r(PY – PYH) = r(PW – PWH) and r(In – PY) = r(In – PW). The latter two expressions given above for the F-statistic can be computed using the relevant expressions for the model from Y.

4.4 The Likelihood Ratio Test In the previous section, we presented two intuitive derivations of the F-statistic normally used to test hypotheses in linear models when cov(Y)= ʍ2In or ʍ2V. In this section we give a more statistical justification for usage of the F-test via the likelihood ratio approach. To this end, assume Y is a random vector such that E(Y) ੣ ɏс΂yɴ͗ȴ͛ɴсϬs} and cov(YͿсʍ2V and recall that if Y ~ Nn;yɴ͕ʍ2V), then for all y ੣ Rn, Ĩ;Ǉ͖ɴ͕ʍ2Ϳс;ϮʋͿ-(n/2);ʍ2)-(n/2)|V|-(1/2)exp{-(y – yɴͿ͛s-1(y – yɴͿͬϮʍ2}. Also recall from section 2.8 that the maximum likelihood estimators for ෡ /n where ;ɴ͕ʍ2) are (Ⱦ෠, ɐ ෝ2) where Ⱦ෠ ŝƐŐŵĨŽƌɴĂŶĚɐ ෝ2 = R ෡ = (Y - XȾ෠)’V-1(Y – XȾ෠) = Y’(In– P)V-1(In – P)’Y R is the weighted residual sum of squares. In this last expression, P is the projection on M = V-1;ɏͿĂůŽŶŐN = ɏᄰ. We also have that ෡ /n)-(n/2)|V|-(1/2)exp{-(1/2)} f(y; Ⱦ෠, ɐ ෝ2Ϳс;ϮʋͿ-(n/2)(R EŽǁůĞƚɏH ďĞĂƐƵďƐƉĂĐĞŽĨɏǁŚĞƌĞɏH = {XHɴH͗ʧ͛ɴH = 0l} and suppose we want to test the linear hypothesis that E(Y) ੣ ɏH. Under the linear ŚǇƉŽƚŚĞƐŝƐ͕ ƚŚĞ ŵĂǆŝŵƵŵ ůŝŬĞůŝŚŽŽĚ ĞƐƚŝŵĂƚŽƌƐ ĨŽƌ ;ɴH͕ ʍ2) are (Ⱦ෠H, ɐ ෝH2) 2 ෠ ෡ where ȾH ŝƐŐŵĨŽƌɴH and ɐ ෝH = RH/n where ෡ H = (Y - XHȾ෠H)’V-1(Y - XHȾ෠H) = Y’(In – PH)V-1(In – PH)’Y R is the weighted residual sum of squares under the null hypothesis and PH is the projection on MH = V-1;ɏH) along NH = ёHᄰ = {XHɴH͗ʧ͛ɴH = 0l}ᄰ. We also have that ෡ H/n)-(n/2)|V|-(1/2)exp{-(1/2)}. f(y;Ⱦ෠H,ɐ ෝH2Ϳс;ϮʋͿ-(n/2)(R

Testing Linear Hypotheses

101

Proposition 4.4.1. Suppose H: E(Y) ੣ ɏH is a linear hypothesis. Then the ෡ /R ෡ H]n/2. Furthermore, with likelihood ratio test statistic ɉ෠ is given by ɉ෠ = [R probability one the random variable ɉ෠ is well defined. Proof. To simplify notation, suppose we let ʅ = Xɴ in the discussion of the preceding paragraph which means that Y ‫ ׽‬Nn(ʅ ͕ʍ2V). Then Ĩ;Ǉ͖ђ͕ʍ2Ϳс;ϮʋͿ-(n/2);ʍ2)-(n/2)|V|-(1/2) exp[(-ϭͬϮʍ2)(y - ʅ)’V-1(y - ʅ)] for all y ੣ Rn is the probability density function for Y. Now suppose y is the ෡ > 0 and R ෡ H > 0, then observed value of Y such that R ɉ෠(y) = sup΂ʅ੣ ёH,ʍхϬ΃ Ĩ;Ǉ͖ђ͕ʍ2)/sup{μ ੣ ɏ͕ʍхϬ΃ Ĩ;Ǉ͖ђ͕ʍ2).

(4.4.1)

Let μො = XȾ෠ and ɐ ෝ2 ďĞƚŚĞŵĂǆŝŵƵŵůŝŬĞůŝŚŽŽĚĞƐƚŝŵĂƚŽƌƐĨŽƌђĂŶĚʍ2 under the full model where Ⱦ෠ is gm for ɴ and let μො H = XHȾ෠H and ɐ ෝH2 be the 2 ŵĂǆŝŵƵŵ ůŝŬĞůŝŚŽŽĚ ĞƐƚŝŵĂƚŽƌƐ ĨŽƌ ђ ĂŶĚ ʍ under the reduced model where Ⱦ෠H is gm for ɴH. Then plugging these estimates as derived above into (4.4.1), it is straight forward to show that ෡ /R ෡ H]n/2 ɉ෠(y) = [R ෡ is the weighted residual sum of squares under the full model and where R ෡ RH is the weighted residual sum of squares under the reduced model. To ෡H ൒ R ෡ and see that ɉ෠(y) is defined with probability one, we observe that R ෡ ͬʍ2 ~ ʖ2(n – m) implies (since we are that, as proven in lemma 2.10.1, R ෡ > 0) = 1 for every possible distribution. assuming n > m) that P(R Now consider testing a linear hypothesis, say H: E(Y) ੣ ɏH. The likelihood ratio principle says to reject the null hypothesis for any outcome y of Y such that ɉ෠(y) K where K is selected to give the appropriate size or significance level. It is this testing procedure which we use for testing a linear hypothesis. Before proceeding further, we note, as indicated in section 4.3, that the F-statistic for testing the linear hypothesis H: E(Y) ੣ ɏH can be expressed in several different ways. In particular, recall that if P is the projection on M= V-1;ɏͿĂůŽŶŐN сɏᄰ, then the weighted residual sum of squares under the ෡ = Y’(In – P)V-1(In – P)’Y. Similarly, if we let full model can be expressed as R -1 PH denote the projection on V ;ɏHͿĂůŽŶŐɏHᄰ, then the weighted residual ෡ H under the reduced model can be expressed as sum of squares R ෡ H = Y’(In – PH)V-1(In – PH)’Y. R It now follows from lemma 4.3.8 that the numerator sum of squares for the F-statistic used to test H: E(Y) ੣ ɏH can be expressed as ෡H - R ෡ = Y’(P – PH)V-1(P – PH)’Y. R

(4.4.2)

We now consider the basic properties needed to establish the distribution of F෠ . Theorem 4.4.3. Consider the linear hypothesis H: E(Y) ੣ ɏH such that mH 0, the following statements can be made: ෡ ‫ ׽‬ʖ2(n – m). (a) R ෡H - R ෡ Ϳͬʍ2 ‫߯ ׽‬2(m – mH͕ʄHͿǁŚĞƌĞʄH с;yɴͿ͛;W– PH)V-1(P – PHͿ͛;yɴͿͬʍ2. (b) (R ෡H - R ෡ and R ෡ are independent. (c) R

Testing Linear Hypotheses

103

Furthermore, the non-ĐĞŶƚƌĂůŝƚǇƉĂƌĂŵĞƚĞƌʄH in part (b) is zero if and only ŝĨyɴ੣ ɏH. Proof (a) This follows from proposition 1.6.1 because Y ~ Nn;yɴ͕ʍ2V), ෡ = Y’(In – P)V-1(In – P)’Y R where P is the projection on M along N and (using the relationships given in (2.5.12)) the fact that ;ϭͬʍ2)(In – P)V-1(In – WͿ͛;ʍ2sͿ;ϭͬʍ2)(In - P)V-1(In – P)’ с;ϭͬʍ2)(In – P)V-1(In - P)’VV-1(In – P)’(In – P)’ с;ϭͬʍ2)(In – P)V-1(In – P)’(In – WͿ͛с;ϭͬʍ2)(In – P)V-1(In – P)’. (b) This also follows from proposition 1.6.1 since ෡ H- R ෡ = Y’(P – PH)V-1(P – PH)’Y R and (using the relationships from (2.5.12)) the fact that ;ϭͬʍ2)(P – PH)V-1(P – PHͿ͛΀ʍ2s΁;ϭͬʍ2)(P – PH)V-1(P – PH)’ с;ϭͬʍ2)(P – PH)(P – PH)’V-1V(P – PH)V-1(P – PH)’ с;ϭͬʍ2)(P – PH)(P – PH)(P – PH)V-1(P – PH)’ с;ϭͬʍ2)(P – PH)(P – PH)V-1(P – PH)’ с;ϭͬʍ2)(P – PH)V-1(P – PH)’. ෡ and R ෡H - R ෡ follows from (c) Again using (2.5.12), independence between R proposition 1.6.6 since (In – P)V-1(In – P)’ʍ2V(P – PH)V-1(P – PH)’ = ʍ2(In - P)V-1V(In – P)(P – PH)V-1(P – PH)’ = 0nn. &ŽƌƚŚĞůĂƐƚĐŽŶĐůƵƐŝŽŶŶŽƚĞƚŚĂƚʄH = 0 if and only if 0n = (P – PHͿyɴ͘Ƶƚ yɴ੣ɏ, ƐŽƚŚĂƚʄH сϬŝĨĂŶĚŽŶůǇŝĨyɴсWHyɴ͕ŝ͘Ğ͕͘yɴ੣ ɏH.

104

Chapter 4

For testing a linear hypothesis with m – mH ൒ 1, theorem 4.4.3 implies ෡H - R ෡ )/(m – mH)/R ෡ /(n – m) F෠ = (R has a central F-distribution under the null hypothesis. Thus, the likelihood ratio testing procedure can be carried out very simply by making the familiar F-test that was intuitively justified in section 4.3,i.e., reject the linear hypothesis when F෠ > K where K is selected so as to give the appropriate significance level. The above theorem also implies that under the alternative hypothesis F෠ has a non-central F-distribution with a nonzero non-centrality parameter. The degrees of freedom for the Fdistribution under both the null and alternative hypotheses are (m – mH) and (n – m), respectively. Example 4.4.4. (Seely(1989)) Let us determine the various quantities needed to form F෠ for the linear hypothesis in example 4.2.3. Note that X ŚĂƐĨƵůůĐŽůƵŵŶƌĂŶŬƐŽƚŚĂƚƌ;y͕͛ȴͿсƌ;y͛Ϳ͘dŚƵƐ͕ŵсƌ;y͕͛ȴͿ– ƌ;ȴͿсϯ– 1 =2. By exactly the same reasoning, we also get mH = r(XH’͕ʧͿ– ƌ;ʧͿсϮ– 1 = 1. ෡ H it is only necessary to realize that R ෡ H is the residual sum of To calculate R squares under the reduced model. As a result, any of our techniques for ෡ can be used to calculate R ෡ H. For example, let the columns of calculating R a matrix BH ĨŽƌŵ Ă ďĂƐŝƐ ĨŽƌ E;ʧ͛Ϳ͕ Ğ͘Ő͕͘ H = (1,-2)’. Because XH has full column rank, it follows that the columns of UH = XHBH ĨŽƌŵĂďĂƐŝƐĨŽƌɏH. Thus, if we let PH = U(U’U)-1U’, then ܴ෠ H = Y’(In – PH)Y.

4.5 Formulating Linear Hypotheses Let us adopt here the model and notation introduced in section 4.2, i.e., we assume that E(YͿсyɴ͕ɴ੣ ੓, where ੓ is a subspace of Rp is a given parameterization for E(YͿ͘KĨƚĞŶƚŚĞƐĞƚɏH in a linear hypothesis is defined implicitly by some description that has a practical meaning or significance for the problem at hand. There are two basic ways of supplying such a description. These are via a parameterization for E(Y) under the reduced ŵŽĚĞůŽƌǀŝĂƐŽŵĞƐƚĂƚĞŵĞŶƚĂďŽƵƚƚŚĞƉĂƌĂŵĞƚĞƌǀĞĐƚŽƌɴƵŶĚĞƌƚŚĞĨƵůů model. It is the purpose of this section to explore these two ways of specifying a linear hypothesis.

Testing Linear Hypotheses

105

The description of a linear hypothesis via a parameterization for E(Y) under the reduced model is generally stated in the form H: E(Y) = Gɲ, ʧ͛ɲ=0l, where G aŶĚʧĂƌĞŐŝǀĞŶŵĂƚƌŝĐĞƐ͘ǇƐƵĐŚĂƐƚĂƚĞŵĞŶƚǁĞŵĞĂŶŽĨ course, the linear hypothesis H: E(Y) ੣ ɏH ǁŚĞƌĞɏH = {Gɲ͗ʧ͛ɲ = 0l}. For such a statement to constitute a valid linear hypothesis, notice that it is both ŶĞĐĞƐƐĂƌǇ ĂŶĚ ƐƵĨĨŝĐŝĞŶƚ ƚŚĂƚ ɏH = {Gɲ͗ ʧ͛ɲ = 0l΃ ŝƐ ĐŽŶƚĂŝŶĞĚ ŝŶ ɏ͘ &Žƌ purposes of exposition we shall generally refer to a linear hypothesis that is described in the above manner as a linear hypothesis on E(Y). Example 4.5.1. (Seely(1989)) Consider again example 4.2.2. Notice that the linear hypothesis in this example was given as a linear hypothesis on E(Y), i.e., H:E(Y)= XHɴH͕ ɴH ੣ R1. As another example suppose that the three groups (i= 1,2,3) are defined via some quantitative variable, say x, so that we have three distinct real numbers x1,x2 and x3 naturally associated with the three groups. In situations like this it is often of interest to test a hypothesis of the form H: E(YijͿсɲ0 нɲ1xi͕ĨŽƌĂůůŝ͕ũĂŶĚɲ0, ɲ1 unknown. Let G = (17,x) where x = (x1,x1,x2,…,x3)’. Then the above hypothesis can be rewritten as H: E(YͿс'ɲ͕ɲ੣ R2. To check if this is a linear hypothesis on E(Y), we need only check if R(G) ‫ ؿ‬ɏĂŶĚƚŚŝƐŝƐƐƚƌĂŝŐŚƚĨŽƌǁĂƌĚ͘ The second method for describing a linear hypothesis that we wish to discuss is one formulated in terms of the parameter vector of the full model. Consider the hypothesis statement ,͗ɴ੣ ੓H [੓H a subspace of ੓΁ǀƐ͗ɴ੣ ੓A [੓A с΂ɴ੣ ੓͗ɴ‫ ב‬੓H]. (4.5.2) In (4.5.2), as in definition 4.2.1, the statement H: ɴ ੣ ɽH is called the null hypothesis while the statement A: ɴ ੣ ɽA is called the alternative hypothesis. However, it should be noted that any null hypothesis formulated as in (4.5.2) also implies a statement about E(Y). In particular, it implies the statement H: E(Y) ੣ ɏH ǁŚĞƌĞɏH с΂yɴ͗ɴ੣ ੓H}. Because ੓H is a subspace of ੓͕ŝƚŝƐĐůĞĂƌƚŚĂƚɏH ŝƐĂƐƵďƐƉĂĐĞŽĨɏƐŽƚŚĂƚ,͗;Y) ੣ ɏH is a linear hypothesis. Unfortunately, when one tries to interpret the linear hypothesis H: E(Y) ੣ ɏH in terms of statement (4.5.2) some ambiguities can arise. The problem is that when one fails to reject H: E(Y) ੣ ɏH, then in ƚĞƌŵƐŽĨɴǇŽƵŬŶŽǁŽŶůǇƚŚĂƚɴ੣ ੓ ĂŶĚyɴ੣ ɏH which does not necessarily ŝŵƉůǇ ƚŚĞ ƐƚĂƚĞŵĞŶƚ ,͗ ɴ ੣ ੓H. To insure that the linear hypothesis H:E(Y)੣ɏH leads to an unambiguous interpretation of (4.5.2), we make the following definition.

106

Chapter 4

Definition 4.5.3. The hypothesis testing statement in (4.5.2) is said to be a linear hypothesis on the parameter vector ɴŝĨĂŶĚŽŶůǇŝĨȾത ੣ ੓ and XȾത ੣ ɏH implies Ⱦത ੣ ੓H ǁŚĞƌĞɏH ŝƐƚŚĞƐĞƚŽĨĂůůǀĞĐƚŽƌƐyɴŐĞŶĞƌĂƚĞĚďǇɴ੣ ੓H. As we have previously done with linear hypotheses, we usually just ǁƌŝƚĞƚŚĞŶƵůůŚǇƉŽƚŚĞƐŝƐ,͗ɴ੣ ੓H when describing a linear hypothesis on ɴ͘/ŶƐƵĐŚĂƐƚĂƚĞŵĞŶƚŝƚŝƐƚŽďĞƵŶĚĞƌƐƚŽŽĚƚŚĂƚƚŚĞĂĐƚƵĂůŚǇƉŽƚŚĞƐŝƐ testing problem is as stated in (4.5.2). Of course it is also to be understood that it is an unambiguous linear hypothesis, i.e., the additional condition in definition 4.5.3 is satisfied. Example 4.5.4. (Seely (1989)) Suppose {Yij} is a collection of normally distributed independent random variables that follow an additive one way model with E(YijͿсђнɲi͕ŝ͕ũсϭ͕ϮǁŚĞƌĞɴс;ђ͕ɲ1͕ɲ2)’ is unknown. Set ੓H = ;ɴ͗ʋ͛ɴсϬ΃ǁŚĞƌĞʋ͛с;Ϭ͕ϭ͕ϭͿ͘>ĞƚƵƐĐŚĞĐŬŝĨ,͗ɴ੣ ੓H is a linear hypothesis ŽŶ ɴ͘ learly ੓H is a subspace in ੓= R3. So, we need only check the additional condition in definition 4.5.3. Let Y and X = (X0,X1,X2) be defined in the usual manner. Set ȾതH с;ϭ͕Ϭ͕ϬͿ͛͘dŚĞŶʋ͛ȾതH = 0 so that XȾH = X0 ੣ ɏH. Now select Ⱦത = (0,1,1)’. Then Ⱦത ੣ ੓ and XȾത = X1 + X2 = X0 ੣ ɏH͕ďƵƚʋ͛Ⱦത = 2 ് 0 so that Ⱦത ‫ ב‬੓H͘dŚƵƐǁĞƐĞĞƚŚĂƚ,͗ɴ੣੓H ŝƐŶŽƚĂůŝŶĞĂƌŚǇƉŽƚŚĞƐŝƐŽŶɴ͘ Example 4.5.5. (Seely (1989)) Consider again the null hypothesis in ĞǆĂŵƉůĞϰ͘Ϯ͘ϯƚŚĂƚɴ1 ĂŶĚɴ2 are equal. We now verify that this is in fact a linear hypothesis on ɴ͘^ŽůĞƚʋс;ϭ͕-ϭ͕ϬͿ͘dŚĞŶƚŚĞƐĞƚŽĨǀĞĐƚŽƌƐɴ੣E;ȴ͛Ϳ ǁŝƚŚɴ1сɴ2 is precisely the set of vectors ੓H с΂ɴ͗ʋ͛ɴсϬ͕ɴ੣ ੓΃͘^Ž͕,͗ɴ੣ ੓H ĚĞƐĐƌŝďĞƐ ƚŚĞ ĚĞƐŝƌĞĚ ŶƵůů ŚǇƉŽƚŚĞƐŝƐ͘ tĞ ŶŽƚĞ ƚŚĂƚ ŝŶ ƚŚŝƐ ĐĂƐĞ͕ ʋ͛ɴ ŝƐ estimable and hence identifiable. Let us check if the given hypothesis is a ůŝŶĞĂƌŚǇƉŽƚŚĞƐŝƐŽŶɴ͘EŽƚĞƚŚĂƚ੓H is a subspace of ੓ so that we only need check definition 4.5.3. Suppose Ⱦത ੣ ੓ is such that XȾത ੣ ɏH. Let ȾതH ੣ ੓H be such that XȾത = XȾതH. We now have Ⱦത,ȾതH ੣ ੓ and XȾത = XȾതH͘ĞĐĂƵƐĞʋ͛ɴŝƐ ĂŶŝĚĞŶƚŝĨŝĂďůĞƉĂƌĂŵĞƚƌŝĐĨƵŶĐƚŝŽŶ͕ŝƚĨŽůůŽǁƐƚŚĂƚʋ͛Ⱦത сʋ͛ȾതH͘Ƶƚʋ͛ȾതH=0 ƐŽƚŚĂƚʋ͛Ⱦത = 0 which implies Ⱦത ੣ ੓H. Thus ŝƚĨŽůůŽǁƐƚŚĂƚ,͗ɴ੣ ੓H is a linear ŚǇƉŽƚŚĞƐŝƐŽŶɴ͘ The set ੓H ƵƐĞĚƚŽĚĞƐĐƌŝďĞĂůŝŶĞĂƌŚǇƉŽƚŚĞƐŝƐŽŶɴŝƐŵŽƐƚŽĨƚĞŶ͕ĂƐŝŶ the previous two examples, described via a set of linear equations. That is, ƚŚĞ ĐŽŵŵŽŶ ǁĂǇ ŽĨ ǁƌŝƚŝŶŐ Ă ůŝŶĞĂƌ ŚǇƉŽƚŚĞƐŝƐ ŽŶ ɴ ŝƐ ƚŽ ƐŝŵƉůǇ ǁƌŝƚĞ H͗ʋ͛ɴсϬl ǁŚĞƌĞʋŝƐĂŐŝǀĞŶƉǆůŵĂƚƌŝǆ͘/ŶƐƵĐŚĂƐƚĂƚĞŵĞŶƚ͕ĞǀĞŶŝĨŝƚŝƐŶŽƚ

Testing Linear Hypotheses

107

ĞǆƉůŝĐŝƚůǇ ŵĞŶƚŝŽŶĞĚ͕ ŝƚ ŝƐ ƚŽ ďĞ ƵŶĚĞƌƐƚŽŽĚ ƚŚĂƚ ɴ ŝƐ ĂůƐŽ ŝŶ ੓, e.g., the ĂĐƚƵĂůŚǇƉŽƚŚĞƐŝƐƚĞƐƚŝŶŐƐƚĂƚĞŵĞŶƚŝƐ,͗ɴ੣ ੓H ǀƐ͗ɴ੣ ੓A where ੓H с΂ɴ͗ʋ͛ɴсϬl͕ɴ੣ ੓} and ੓A с΂ɴ͗ʋ͛ɴ് 0l͕ɴ੣ ੓}. ^ƵƉƉŽƐĞ ŶŽǁ ƚŚĂƚ ʋ ŝƐ Ă Ɖǆů ŵĂƚƌŝǆ ĂŶĚ ǁĞ ǁĂŶƚ ƚŽ ƚĞƐƚ ,͗ ʋ͛ɴ с Ϭl. Under this set of linear equations, ੓H с΂ɴ͗ʋ͛ɴсϬl͕ɴ੣ ੓} is a subspace of ੓͘/ŶĂĚĚŝƚŝŽŶ͕ǁĞŚĂǀĞƚŚĂƚɏс΂yɴ͗ɴ੣ ੓} and ɏH с΂yɴ͗ɴ੣ ੓H΃с΂yɴ͗ʋ͛ɴсϬl͕ɴ੣ ੓}. ,ŽǁĞǀĞƌ͕ǁŝƚŚŽƵƚĨƵƌƚŚĞƌƐƉĞĐŝĨŝĐĂƚŝŽŶŽŶʋ͛ɴ͕ŝƚĐŽƵůĚŚĂƉƉĞŶƚŚĂƚπH сɏ ƵŶĚĞƌ ,͗ ʋ͛ɴ с Ϭl ĂŶĚ ǁĞ ǁŽƵůĚ ďĞ ƚĞƐƚŝŶŐ ŶŽƚŚŝŶŐ͕ ŝ͘Ğ͕͘ɏH is simply an alternative parameterization for E(Y) as occurs in the case when ʋ’ɴ = 0l are nonpre-estimable constraints as discussed in section 3.3. Thus for ,͗ʋ͛ɴсϬl ƚŽďĞĂŵĞĂŶŝŶŐĨƵůƚĞƐƚ͕ǁĞŶĞĞĚĚŝŵɏH фĚŝŵɏ͘^ŽǁŚĂƚĂƌĞ ƚŚĞĐŽŶĚŝƚŝŽŶƐŽŶʋŶĞĞĚĞĚƚŽŝŶƐƵƌĞƚŚĂƚ,͗ʋ͛ɴсϬl is a legitimate linear ŚǇƉŽƚŚĞƐŝƐŽŶɴĂŶĚƐĂƚŝƐĨŝĞƐĚĞĨŝŶŝƚŝŽŶϰ͘ϱ͘ϯĂŶĚĂǀŽŝĚƐƚŚe problem of ŚĂǀŝŶŐɏH сɏ͘tĞŶŽǁŐŝǀĞĂĐŽŶĚŝƚŝŽŶƚŚĂƚŝŶƐƵƌĞƐƚŚĂƚ,͗ʋ’ɴ = 0l is a legitimate hypothesis. WƌŽƉŽƐŝƚŝŽŶϰ͘ϱ͘ϲ͘>Ğƚʋ͛ďĞĂŶůǆƉŵĂƚƌŝǆĂŶĚůĞƚ੓H с΂ɴ͗ʋ͛ɴсϬl͕ɴ੣ ੓}. dŚĞŶ,͗ɴ੣ ੓H͕ǁŚŝĐŚǁĞŶŽƌŵĂůůǇǁƌŝƚĞĂƐ,͗ʋ͛ɴсϬl, is a linear hypothesis ŽŶɴŝĨĂŶĚŽŶůǇŝĨʋ͛ɴŝƐĂŶĞƐƚŝŵĂďůĞƉĂƌĂŵĞƚƌŝĐǀĞĐƚŽƌ͘ WƌŽŽĨ͘^ƵƉƉŽƐĞĨŝƌƐƚƚŚĂƚ,͗ɴ੣ ੓H ŝƐĂůŝŶĞĂƌŚǇƉŽƚŚĞƐŝƐŽŶɴ͘>ĞƚȾത1,Ⱦത2 ੣ ੓ be such that XȾത1 = XȾത2͘tĞǁŝůůƐŚŽǁƚŚĂƚʋ͛Ⱦത1 сʋ͛Ⱦത2 so that identifiability ŝŵƉůŝĞƐʋ͛ɴŝƐĞƐƚŝŵĂďůĞ͘^ĞƚȾത = Ⱦത1 - Ⱦത2. Then Ⱦത ੣ ੓ and XȾത = 0n ŝƐŝŶɏH since ɏH is a subspace. By definition 4.5.3, it follows that Ⱦത ੣ ੓H. But this implies ʋ͛Ⱦത = 0l ƐŽ ƚŚĂƚ ʋ͛Ⱦത1 с ʋ͛Ⱦത2. The converse can be proved by the same argument used in example 4.5.5. Observe from Proposition 4.5.6 that we get a criterion we are familiar ǁŝƚŚƚŽĚĞƚĞƌŵŝŶĞŝĨĂƐĞƚŽĨůŝŶĞĂƌĞƋƵĂƚŝŽŶƐʋ͛ɴсϬl gives us a legitimate ŚǇƉŽƚŚĞƐŝƐŽŶɴ͘tĞĂůƐŽŚĂǀĞƚŚĞĨŽůůŽǁŝŶŐƉƌŽƉŽƐŝƚŝŽŶ͘ Proposition 4.5.7. Consider the parameterization E(YͿ с yɴ͕ ȴ͛ɴ с Ϭs, for E(YͿĂŶĚůĞƚ,͗ʋ͛ɴсϬl ďĞĂŚǇƉŽƚŚĞƐŝƐŽŶɴ͕ŝ͘Ğ͕͘ʋ͛ɴŝƐĞƐƚŝŵĂďůĞ͘>Ğƚ ɏс΂yɴ͗ȴ͛ɴсϬs΃ĂŶĚɏH с΂yɴ͗ȴ͛ɴсϬs͕ʋ͛ɴсϬl}.

108

Chapter 4

/ĨZ;ʋͿ‫ ف‬Z;ȴͿ͕ƚŚĞŶĚŝŵɏH фĚŝŵɏ. WƌŽŽĨ͘dŽďĞŐŝŶ͕ŽďƐĞƌǀĞƚŚĂƚŝĨZ;ʋͿ‫ ؿ‬Z;ȴͿ͕ƚŚĞŶZ;ʋͿᄰ сE;ʋ͛Ϳ‫ ـ‬Z;ȴͿᄰ = E;ȴ͛Ϳ ĂŶĚ ɏH с ΂yɴ͗ ȴ͛ɴ с Ϭs͕ ʋ͛ɴ с Ϭl΃ с ΂yɴ͗ ȴ͛ɴ с Ϭs΃ с ɏ. Thus when Z;ʋͿ‫ؿ‬Z;ȴͿ͕ɏH сɏĂŶĚƚŚĞƚĞƐƚ,͗ʋ͛ɴсϬl is really testing nothing. Now ĂƐƐƵŵĞZ;ʋͿ‫ ف‬Z;ȴͿ͘dŚĞŶZ;ȴͿᄰ = N(ȴ͛Ϳ‫ ف‬Z;ʋͿᄰ сE;ʋ͛ͿĂŶĚƚŚĞƌĞŵƵƐƚĞǆŝƐƚ ĂƚůĞĂƐƚŽŶĞɴ͕ƐĂǇȾത͕ƐƵĐŚƚŚĂƚȴ͛Ⱦത = 0s͕ďƵƚʋ͛Ⱦത ് 0l. But since Ⱦത ੣ E;ȴ͛ͿĂŶĚ ʋ͛Ⱦത ് 0l, it follows that XȾത ੣ ɏ͕ďƵƚyȾത ‫ ב‬ɏH͘dŚƵƐĚŝŵɏH фĚŝŵɏ as we were to show. Comment. In problem 1 of section 4.10, you are asked to show that if H:ʋ’ɴ= 0l is a linear hypothesis on ɴ, then r(ʋ,ȴ) – r(ȴ) = m – mH where m = dim ё and mH = dim ёH. How is this problem related to proposition 4.5.7? We have so far considered the idea of a linear hypothesis on E(Y) and the idea oĨĂůŝŶĞĂƌŚǇƉŽƚŚĞƐŝƐŽŶɴ͘/ƚƐŚŽƵůĚďĞĐůĞĂƌƚŚĂƚďŽƚŚŽĨƚŚĞƐĞ ideas are simply different ways of describing a linear hypothesis. As a result, the method of testing developed previously in this chapter can be used to test either a linear hypothesis H: E(Y) ੣ ɏH ‫ ؿ‬ɏŽŶ;Y) or a linear ŚǇƉŽƚŚĞƐŝƐ,͗ʋ͛ɴс 0l ŽŶɴďǇƐĞƚƚŝŶŐɏH с΂yɴ͗ʋ͛ɴсϬl͕ɴ੣ ੓}. It should also be realized that a linear hypothesis on the E(Y) and a linear hypothesis on ɴĐĂŶďŽƚŚĚĞƐĐƌŝďĞƚŚĞƐĂŵĞůŝŶĞĂƌŚǇƉŽƚŚĞƐŝƐ͘dŚĂƚŝƐ͕ƐƵƉƉŽƐĞH: E(Y) = 'ɲ͕ʧ͛ɲсϬf is a linear hypothesis on E(YͿĂŶĚƐƵƉƉŽƐĞ,͗ʋ͛ɴсϬl is a linear ŚǇƉŽƚŚĞƐŝƐŽŶɴ͘/ĨƚŚĞƐĞƚƐ΂'ɲ͗ʧ͛ɲсϬf΃ĂŶĚ΂yɴ͗ɴ੣੓H} are the same where ੓H с ;ɴ͗ ʋ͛ɴ с Ϭl͕ ɴ ੣ ੓}, then the two different hypotheses statements describe the same linear hypothesis. With regard to this last statement, we have the following proposition. Proposition 4.5.8. Consider the parameterization E(YͿсyɴ͕ȴ͛ɴсϬs, for E(Y) ĂŶĚůĞƚɏH ‫ ؿ‬ɏ͘dŚĞŶŝƚŝƐĂůǁĂǇƐƉŽƐƐŝďůĞƚŽĨŝŶĚĂŶĞƐƚŝŵĂďůĞƉĂƌĂŵĞƚƌŝĐ ǀĞĐƚŽƌʋ͛ɴƐƵĐŚƚŚĂƚɏH с΂yɴ͗ȴ͛ɴсϬs͕ʋ͛ɴсϬl}. WƌŽŽĨ͘>ĞƚďĞĂŶŶǆŐŵĂƚƌŝǆƐƵĐŚƚŚĂƚE;ͿсɏH ĂŶĚůĞƚʋ͛сቀԢቁ. First, ȟԢ ŽďƐĞƌǀĞƚŚĂƚZ;ʋͿсZ;y͕͛ȴͿ‫ ؿ‬Z;y͕͛ȴͿ͕ŚĞŶĐĞʋ͛ɴŝƐĞƐƚŝŵĂďůĞ͘EŽǁŽďƐĞƌǀĞ that ɏH = N(B) ‫ ת‬ɏсE;Ϳ‫΂ ת‬yɴ͗ȴ͛ɴсϬs΃с΂yɴ͗͛yɴсϬg͕ȴ͛ɴсϬs} с΂yɴ͗ ቀԢቁɴсϬg+s΃с΂yɴ͗ʋ͛ɴсϬl} ȟԢ

Testing Linear Hypotheses

109

where l = g + s. So proposition 4.5.8 guarantees that any linear hypothesis of the form H:E(Y)੣ ɏH on E(YͿĐĂŶĂůƐŽďĞĨŽƌŵƵůĂƚĞĚĂƐĂƐƚĂƚĞŵĞŶƚĂďŽƵƚɴŝŶǀŽůǀŝŶŐ an estimable parametric vector. The reader should also keep in mind that ŝƚŝƐĂůƐŽƉŽƐƐŝďůĞƚŚĂƚƚǁŽĚŝĨĨĞƌĞŶƚƉĂƌĂŵĞƚƌŝĐǀĞĐƚŽƌƐʋ1͛ɴĂŶĚʋ2͛ɴĐĂŶ ďĞƵƐĞĚƚŽĚĞƐĐƌŝďĞƚŚĞƐĂŵĞůŝŶĞĂƌŚǇƉŽƚŚĞƐŝƐŽŶɴ͘ Example 4.5.9. Consider the parameterization E(YͿсyɴ͕ȴ͛ɴсϬs͘>Ğƚʋ1͛ɴ be ĞƐƚŝŵĂďůĞ ĂŶĚ ůĞƚ ʋ2 с ʋ1 н ȴʌ for some vector ʌ of appropriate ĚŝŵĞŶƐŝŽŶ͘dŚĞŶʋ2͛ɴŝƐĞƐƚŝŵĂďůĞĂŶĚʋ2͛ɴс;ʋ1нȴʌͿ͛ɴсʋ1͛ɴĨŽƌĂůůɴ੣ ੓ = E;ȴ͛Ϳ͘ůĞĂƌůǇƚŚĞƚǁŽŚǇƉŽƚŚĞƐĞƐ,1͗ʋ1͛ɴсϬl and H2͗ʋ2͛ɴсϬl describe the ƐĂŵĞŚǇƉŽƚŚĞƐĞƐŽŶɴ͘ As a final observation in this section, it is worth noting that just as two different looking parameterizations can describe the same expectation ƐƉĂĐĞ ɏ ĨŽƌ ;Y), it is also possible that two different looking parameterizations can describe the same linear hypothesis spĂĐĞɏH. We now give a proposition which relates how the same linear hypothesis can be described in terms of parametric vectors relative to two different parameterizations for E(Y). Proposition 4.5.10. Suppose E(YͿсyɴ͕ȴ͛ɴсϬs and E(YͿсhɲ͕ʧ͛ɲсϬt are both parameterizations for E(Y) ĂŶĚ ĂƐƐƵŵĞ ƚŚĂƚ ʋ͛ɴ ĂŶĚ ȿ͛ɲ ĂƌĞ corresponding parametric vectors. Then: ;ĂͿ,͗ʋ͛ɴсϬl ŝƐĂůŝŶĞĂƌŚǇƉŽƚŚĞƐŝƐŽŶɴ͘ ;ďͿ,͗ȿ͛ɲсϬl ŝƐĂůŝŶĞĂƌŚǇƉŽƚŚĞƐŝƐŽŶɲ. Moreover, (a) and (b) describe the same linear hypothesis. Proof. Since corresponding parametric vectors must be estimable, (a) and ;ďͿĨŽůůŽǁŝŵŵĞĚŝĂƚĞůǇ͘/ƚƌĞŵĂŝŶƐƚŽƐŚŽǁƚŚĂƚɏH(1) с΂yɴ͗ȴ͛ɴсϬs͕ʋ͛ɴсϬl} ĂŶĚɏH(2)с΂hɲ͗ʧ͛ɲсϬt͕ȿ͛ɲсϬl} are equal. Suppose x ੣ ɏH(1). Then x =XȾത ǁŚĞƌĞʋ͛Ⱦത = 0l ĂŶĚȴ͛Ⱦത = 0s for some vector Ⱦത. Because x ੣ ɏǁĞĐĂŶĂůƐŽ ഥ͘ƵƚƚŚĞŶ͕ʋ͛ɴ‫ ؗ‬ȿ͛ɲŝŵƉůŝĞƐ write x = UȽ ഥ ǁŚĞƌĞʧ͛Ƚ ഥ = 0t for some vector Ƚ ത ȿ͛Ƚഥ сʋ͛Ⱦ = 0l so that x ੣ ɏH(2). The reverse containment is analogous.

110

Chapter 4

The above proposition establishes, as one would expect, that corresponding parametric vectors lead to the same linear hypothesis under different parameterizations for E(Y). Example 4.5.11. (Seely (1989)) Consider the linear model with parameterization E(YͿсyɴ͕ȴ͛ɴсϬ͕ŝŶĞǆĂŵƉůĞϰ͘Ϯ͘ϯ͘^Ğƚс;1,B2) where B1 =(1,-1,0)’ and B2 = (1,0,-ϭͿ͛͘^Ğƚhсy͘^ŝŶĐĞZ;ͿсE;ȴ͛Ϳ͕ŝƚĨŽůůŽǁƐƚŚĂƚ E(YͿсhɲ͕ɲ੣ R2, is a parameterization for E(Y) and from proposition 3.3.2 ƚŚĂƚʋ͛ɴ‫ ؗ‬ʋ͛ɲǁŚĞŶĞǀĞƌʋ͛ɴŝƐĞƐƚŝŵĂďůĞ͘dŚƵƐƚŽƚĞƐƚ,͗ɴ1 - ɴ2 = 0 of example 4.5.5 via the parameterization E(YͿсhɲ͕ɲ੣ R2͕ǁĞĐŽŵƉƵƚĞʋ͛с (2,1) and use proposition 4.5.10 to conclude that an equivalent hypothesis on ɲ ŝƐ,͗Ϯɲ1нɲ2 = 0

4.6 ANOVA Tables for Testing Linear Hypotheses When calculating the F- statistic for a linear hypothesis, it is often useful (and standard in many places) to present the pertinent statistics in the form of analysis of variance (ANOVA) tables. We discuss below some of the ingredients of such tables. Let us assume the model E(Y) ੣ ɏĂŶĚĐŽǀ;YͿсʍ2V. Further, suppose that E(Y) = yɴ͕ ȴ͛ɴ с Ϭs, is a parameterization for E(Y). To begin our discussion, we reconsider the decomposition of what is often called the total weighted sum of squares Y’V-1Y. Recall that if Ⱦ෠ ŝƐ Őŵ ĨŽƌ ɴ͕ ƚŚĞŶ Ⱦ෡ satisfies XȾ෠ = P’Y ĂŶĚȴ͛Ⱦ෠ = 0s where P is the projection onto M along N. Also recall that ‫܍‬ො = Y - XȾ෠ = (In – P’)Y is called the residual vector and R෡ = ‫܍‬ො’V-1‫܍‬ො = Y’(In – P)V-1(In – P)’Y is called the weighted residual sum of squares. Now observe, using the algebraic relationships given in (2.5.12), that Y’V-1Y = Y’[(In – P) + P)V-1[(In – P) + P]’Y = Y’[(In –P) + P][V-1(In – P)’ + V-1P’]Y = Y’[(In - P) + P][(In – P)V-1 + PV-1]Y = Y’[(In - P)(In – P)V-1 + (In - P)PV-1 + P(In – P)V-1 + PPV-1]Y = Y’[(In - P)V-1(In - P)’ + PV-1P’]Y ෡ + Y’PV-1P’Y = Y’[(In - P)V-1(In – P)’Y + Y’PV-1P’Y = R

Testing Linear Hypotheses

111

where Y’PV-1P’Y is called the weighted regression sum of squares. So the weighted total sum of squares is the sum of the weighted residual and the weighted regression sums of squares. This decomposition of Y’V-1Y is conveniently summarized in the following tabular form. Table 4.6.1 (Initial ANOVA Table for the Full Model) Source of Variation

df

SS

E(SS)

Weighted Regression

m

Y’PV-1P’Y

ŵʍ2н;yɴͿ͛s-1;yɴͿ

Weighted Residual

n–m

Weighted Total

n

෡ R

(n– m)ʍ2

Y’V-1Y

In the above table, notice that both the df (degrees of freedom) column and the SS (sum of squares) column are additive. Also, note that the last column contains the expected value of the regression SS and the residual SS. To verify the expression for the expected value of the regression SS, we apply proposition 1.3.12 and use (2.5.12) to get E[Y’PV-1P’Y] = tr[PV-1W͛ʍ2s΁н;yɴͿ͛Ws-1W͛;yɴͿсƚƌ΀Ws-1ʍ2sͿн;yɴͿ͛s-1yɴ сʍ2ƚƌ΀W΁н;yɴͿ͛s-1;yɴͿ͘ because Xɴ ੣ ё = R(P’) and because trP = m. ෡ H be defined EŽǁƐƵƉƉŽƐĞ,͗ʋ͛ɴсϬl ŝƐĂůŝŶĞĂƌŚǇƉŽƚŚĞƐŝƐŽŶɴ͘>ĞƚR in the usual way. Because the reduced model has the same model structure as the full model, we can construct an ANOVA table similar to table 4.6.1. In particular, we get the following: Table 4.6.2 (ANOVA table for the reduced model under H:) Source of Variation

df

SS

Regression under H:

mH

Residual under H:

n – mH

E(SS)

Y’PHV-1PH’Y ෡H R

mHʍ2н;yɴͿ͛s-1;yɴͿ (n – mHͿʍ2 +

;yɴͿ͛;I – PH)V-1(In – PHͿ͛;yɴͿ Total

n

Y’V-1Y

Chapter 4

112

The verification of the above expected sums of squares column is straightforward and is left as an exercise. Notice that the E(SS) column is not quite analogous to the corresponding column in table 4.6.1. The reason is that these expectations were taken reůĂƚŝǀĞƚŽƚŚĞĨƵůůŵŽĚĞů͕ŝ͘Ğ͕͘ɴ੣੓, as opposed to taking the expectations relative to the reduced model, i.e., ɴ੣ ੓H͘ŶĞǆĂĐƚĂŶĂůŽŐǇĚŽĞƐŽĐĐƵƌ͕ŚŽǁĞǀĞƌ͕ǁŚĞŶɴ੣ ੓H, because then yɴ੣ ёH so that PHXɴсyɴĂŶĚ;/n – PHͿyɴсϬn. We can now conveniently summarize in another ANOVA table the ŶĞĐĞƐƐĂƌǇƐƚĂƚŝƐƚŝĐƐĨŽƌƚĞƐƚŝŶŐƚŚĞůŝŶĞĂƌŚǇƉŽƚŚĞƐŝƐ,͗ʋ͛ɴсϬl. One tabular form for the summary ANOVA table is the following: Table 4.6.3 (ANOVA for testing the null hypothesis) Source of Variation

df

SS

E(SS)

Residual

n–m

෡ R

(n – ŵͿʍ2

Residual under H:

n – mH

෡H R

(n – mHͿʍ2 + ;yɴͿ͛;/–PH)V-1( I - PHͿ͛;yɴͿ

Deviation from H:

m – mH

෡H - R ෡ R

(m – mHͿʍ2 + ;yɴͿ͛;W– PH)V-1(P – PHͿ͛;yɴͿ

Notice here that the expected sum of squares column can be verified directly from tables 4.6.1 and 4.6.2. The ANOVA table 4.6.3 is presented in terms of residual sums of ෡ is via regression ෡H - R squares. Another way of presenting the statistic R sums of squares. Since the sum of squares columns in tables 4.6.1 and 4.6.2 are additive, it is straightforward to see that ෡ = Y’PV-1P’Y – Y’PHV-1PH’Y. ෡H - R R Using this identity, we can conveniently summarize the pertinent statistics necessary to test H: ʋ’ɴ = 0l in the following table:

Testing Linear Hypotheses

113

Table 4.6.4 (Summary ANOVA Table) Source of Variation

df

SS

Regression

m

Y’PV-1P’Y

Regression under H:

mH

Y’PHV-1PH’Y

Deviation from H:

m – mH

Y’PV-1P’Y – Y’PHV-1PH’Y

Except for the expected sum of squares column, which can easily be filled in, observe that all of the information needed to test H: ʋ’ɴ = 0l in tables 4.6.1, 4.6.2 and 4.6.3 can be extracted from the above ANOVA table. Be sure to observe also that the “deviation from the hypothesis” line is exactly the same in tables 4.6.3 and 4.6.4. In this section we have presented the fundamental ANOVA tables that are associated with a linear hypothesis. Our presentation did assume that ǁĞŚĂĚĂůŝŶĞĂƌŚǇƉŽƚŚĞƐŝƐŽŶɴ͕ďut it should be clear that our discussion can be applied equally well to a linear hypothesis on E(Y).

4.7 An Alternative Test Statistic for Testing Parametric Vectors In this section, we give an alternative form of the F-statistic that was ĚĞǀĞůŽƉĞĚƉƌĞǀŝŽƵƐůǇĨŽƌƚĞƐƚŝŶŐŚǇƉŽƚŚĞƐĞƐŽŶɴ͘dŽďĞŐŝŶŽƵƌĚŝƐĐƵƐƐŝŽŶ͕ we shall assume an unconstrained parameterization for E(Y) of the form E(YͿсyɴ͕ɴ੣ Rp, cov(YͿсʍ2V > 0. Under ƚŚŝƐŵŽĚĞů͕ɏY с΂yɴ͗ɴ੣ Rp} = R(X) and ੓Y = Rp͘>Ğƚʋ͛ɴďĞĂŶĞƐƚŝŵĂďůĞƉĂƌĂŵĞƚƌŝĐǀĞĐƚŽƌǁŚĞƌĞʋ͛ŝƐĂŶůǆƉ matrix and suppose we want to test ,͗ʋ͛ɴсϬl.

(4.7.1)

Under the null hypothesis, the reduced model has expectation space ёYH с΂yɴ͗ʋ͛ɴсϬl}.

(4.7.2)

tĞĐĂŶǁŝƚŚŽƵƚůŽƐƐŽĨŐĞŶĞƌĂůŝƚǇĂƐƐƵŵĞʋ͛ŚĂƐĨƵůůƌŽǁƌĂŶŬƐŝŶĐĞŝĨʋ͛ does not initially have full row rank, we can eliminate redundant linearly dependent rows in ɎԢ ƚŽĨŽƌŵƚŚĞĚǆƉŵĂƚƌŝǆʋ1’ which does have full row

Chapter 4

114

ƌĂŶŬĂŶĚŽďƐĞƌǀĞƚŚĂƚɏYH с΂yɴ͗ʋ͛ɴсϬl΃с΂yɴ͗ʋ1͛ɴсϬd}. Thus we shall ĂƐƐƵŵĞĨƌŽŵƚŚŝƐƉŽŝŶƚŽŶƚŚĂƚʋ͛ŚĂƐĨƵůůƌŽǁƌĂŶŬ͘dŚŽƵŐŚƚŚŝƐĂƐƐƵŵƉƚŝŽŶ is not necessary, it does make some of our computations later on simpler. Under the full (initial) model, Ⱦ෠ ŝƐ Őŵ ĨŽƌ ɴ ŝĨ ĂŶĚ ŽŶůǇ ŝĨ ŝƚ ƐĂƚŝƐĨŝĞƐ ƚŚĞ equation X’V-1XȾ෠ = X’V-1Y ĂŶĚƚŚĞďůƵĞĨŽƌʋ͛ɴŝƐ ʋ’Ⱦ෠ сʋ͛;y͛s-1X)-X’V-1Y

(4.7.3)

and ĐŽǀ;ʋ͛Ⱦ෠Ϳсʋ͛;y͛s-1X)-X’V-1ʍ2VV-1X(X’V-1X)-‘‘ʋсʍ2ʋ͛;y͛s-1X)-ʋсʍ2Vʋ (4.7.4) Proposition 4.7.5. Let Vʋ be as in (4.7.4). Then Vʋ is nonsingular. WƌŽŽĨ͘^ŝŶĐĞʋ͛ŝƐůǆƉ͕ƌ;ʋ͛Ϳсů͕ʋ͛ɴŝƐĞƐƚŝŵĂďůĞĂŶĚƐŝŶĐĞZ;y͛s-1X) = R(X’), ǁĞŚĂǀĞƚŚĂƚʋсy͛s-1XB for some matrix B. Thus Vʋ сʋ͛;y͛s-1X)-ʋс͛;y͛s-1X)’(X’V-1X)-(X’V-1X)B= B’(X’V-1X)B. So r(Vʋ) = r[B’(X’V-1X)B] and r(Vʋ) = r[B’(X’V-1X)B] = r(XB) ൒ r(X’V-1yͿсƌ;ʋͿсů͘ Now, since Vʋ is an lxl matrix, we have that Vʋ is nonsingular. Combining proposition 4.7.5 with (4.7.3) and (4.7.4), we see that if we assume Y‫ ׾‬Nn;yɴ͕ʍ2V) and Ⱦ෠ ŝƐŐŵĨŽƌɴ͕ƚŚĞŶʋ͛Ⱦ෠ ‫ ׾‬Nl;ʋ͛ɴ͕ʍ2Vʋ). We now prove the following proposition. Proposition 4.7.6. Assume an unconstrained parameterization for E(Y) and that Y ‫ ׾‬Nn;yɴ͕ʍ2sͿ͘/Ĩʋ͛ɴŝƐĞƐƚŝŵĂďůĞǁŚĞƌĞʋ͛ŝƐůǆƉŽĨƌĂŶŬů͕ĂŶĚȾ෠ is ŐŵĨŽƌɴ͕ƚŚĞŶƚŚĞĨŽůůŽǁŝŶŐƐƚĂƚĞŵĞŶƚƐŚŽůĚ͗ (a) (1ͬʍ2Ϳ;ʋ͛Ⱦ෠)’Vʋ-1;ʋ͛Ⱦ෠) ‫ ׽‬ʖ2;ů͕ʄͿǁŚĞƌĞʄс;ϭͬʍ2Ϳ;ʋ͛ɴͿ͛sʋ-1;ʋ͛ɴͿ͘ (b) F෠ с;ϭͬʍ2Ϳ;ʋ͛Ⱦ෠)’ Vʋ-1 ;ʋ͛Ⱦ෠Ϳͬůͬ;ϭͬʍ2) Y’(In – PY)V-1(In – PY)’Y/(n – m) ‫׽‬ F(l,n-m,ʄ) where PY is the projection on M = V-1(ёY) along N = R(X)ᄰ, m = ĚŝŵɏY ĂŶĚʄŝƐĂƐŐŝǀĞŶŝŶ;ĂͿ͘ Proof. (a) In proposition 1.6.4, if we let A = (1/ʍ2) Vʋ-1, the result follows ĚŝƌĞĐƚůǇǁŝƚŚʄс;ϭͬʍ2Ϳ;ʋ͛ɴͿ͛sʋ-1;ʋ͛ɴͿ͘

Testing Linear Hypotheses

115

(b) It has already been established in section 4.4 that ෡ Y~ ʖ2(n – m) ;ϭͬʍ2)Y’(In – PY)V-1(In –PY)’Y с;ϭͬʍ2)R where PY is the projection on M along N. We must now show that the quadratic forms appearing in the numerator and denominator of F are independent. But since the numerator of F෠ is a function of ʋ͛Ⱦ෠ сʋ͛;y͛s-1X)-X’V-1Y and the denominator is a function of (In – PY)’Y, the result follows from proposition 1.6.6 because R(X) ‫ ؿ‬R(PY’) and (In – PY)’ʍ2VV-1X(X’V-1X)-͚ʋсʍ2(In – PY)’X(X’V-1X)-‘ʋ = 0np. hƐŝŶŐƉƌŽƉŽƐŝƚŝŽŶϰ͘ϳ͘ϲ͕ǁĞƐĞĞƚŚĂƚĂƚĞƐƚĨŽƌ,͗ʋ͛ɴсϬl ǀƐ͗ʋ͛ɴ് 0l can be constructed using the F-statistic given in proposition 4.7.6. In ƉĂƌƚŝĐƵůĂƌ͕ǁŚĞŶ,͗ʋ͛ɴсϬl ŝƐƚƌƵĞ͕ǁĞŚĂǀĞƚŚĂƚʄŝŶƚŚĞƉƌŽƉŽƐŝƚŝŽŶŝƐϬ and that F෠ follows a central F-distribution with l and n – m df. On the other ŚĂŶĚ͕ǁŚĞŶ,͗ʋ͛ɴсϬl is noƚƚƌƵĞ͕ʄ് 0 and F෠ given in the proposition follows a non-central F-distribution with non-centrality parameter ʄс;ϭͬʍ2Ϳ;ʋ͛ɴͿ͛sʋ-1;ʋ͛ɴͿ͘dŚŝƐůĞĂĚƐƵƐƚŽƌĞũĞĐƚƚŚĞŶƵůůŚǇƉŽƚŚĞƐŝƐ,͗ʋ͛ɴсϬl if F෠ > fɲ(l,n – m) where fɲ(l, n-m) denotes the (1-ɲ)100 percentile of the central F-distribution associated with F෠ under the null hypothesis. If F෠ ч fɲ(l,n – m), we fail to reject the null hypothesis. While the F-statistic given in proposition 4.7.6 can be used to test ,͗ʋ͛ɴсϬl, we note that we could also have used the F-statistic derived in ƐĞĐƚŝŽŶϰ͘ϰƚŽŵĂŬĞƚŚĞƐĂŵĞƚĞƐƚ͘/ŶƉĂƌƚŝĐƵůĂƌ͕ŝĨǁĞůĞƚɏY с΂yɴ͗ɴ੣ Rp} ĂŶĚɏYH с΂yɴ͗ʋ͛ɴсϬl΃ǁŚĞƌĞʋ͛ɴŝƐĞƐƚŝŵĂďůĞĂŶĚůĞƚWY and PYH be the projections on M along N and on MH along NH, respectively, then we could have used the statistic F෠ = Y’(PY – PYH)V-1(PY – PYH)’Y/(m – mH)/Y’(In – PY)V-1(In – PY)’Y/(n –m) ෡ YH - R ෡ Y)/(m – mH)/R ෡ Y/(n – m) (4.7.7) = (R ƚŽƚĞƐƚ,͗ʋ͛ɴсϬl. However, as we shall now show, the F statistics given in proposition 4.7.6 and (4.7.7) are exactly the same. To simplify the process of showing that these F statistics are the same and simplify calculations,

Chapter 4

116

we begin by showing that the two corresponding statistics for testing H:ʋ’ɴ= 0l under the model E(Y) = Xɴ, ɴ ੣ Rp, cov(Y) = ʍ2In are the same where ʋ’ɴ is estimable and ʋ’ is an lxp matrix of rank l. For this latter model, ɏY=R(X), ɏYH = {Xɴ: ʋ’ɴ = 0l} and if we let PY and PYH denote the orthogonal projections on ɏY and ɏYH, respectively, then the two statistics given in ƉƌŽƉŽƐŝƚŝŽŶϰ͘ϳ͘ϲĂŶĚ;ϰ͘ϳ͘ϳͿĨŽƌƚĞƐƚŝŶŐ,͗ʋ͛ɴсϬl reduce to F෠ 1 с;ʋ͛Ⱦ෠)’Vʋ-1;ʋ͛Ⱦ෠)/l/Y’(In – PY)Y/(n – m) where Vʋсʋ͛;y͛yͿ-ʋ͕Ⱦ෠ ŝƐŐŵĨŽƌɴĂŶĚ F෠ 2 = Y’(PY – PYH)Y/(m – mH)/Y’(In – PY)Y/(n – m). Thus to show that F෠ 1 = F෠ 2 when cov(Y) = ʍ2In, we need to show that the numerator sums of squares in F෠ 1 and F෠ 2 are equal, i.e., we need to show that ;ʋ͛Ⱦ෠)’Vʋ-1;ʋ͛Ⱦ෠)/l = Y’(PY – PYH)Y/(m – mH). dŽƚŚŝƐĞŶĚ͕ŽďƐĞƌǀĞƚŚĂƚďĞĐĂƵƐĞʋ͛ɴŝƐĞƐƚŝŵĂďůĞ͕ʋсy͛ĨŽƌƐŽŵĞŵĂƚƌŝǆ B, hence from 4.7.4 it follows that Vʋ= B’X(X’X)-X’B = B’PYB = B’PY’PYB where PY is the orthogonal projection on ёY = R(X). Also, because Ⱦ෠ ŝƐŐŵĨŽƌɴ͕ŝƚ satisfies XȾ෠ = PYY, hence ;ʋ͛Ⱦ෠)’Vʋ-1;ʋ͛Ⱦ෠) = (B’XȾ෠)’Vʋ-1(B’XȾ෠) = (B’PYY)’Vʋ-1(B’PYY) = Y’PYB(B’PYB)-1B’PYY. Thus to show that F෠ 1= F෠ 2, we need to show that (PYB)(B’PYB)-1(B’PY) = (PYB)(B’PY’PyB)-1(PYB)’ = PY - PYH and that l = m – mH. Proposition 4.7.8. Consider the model E(YͿсyɴ͕ɴ੣ Rp, and cov(YͿсʍ2In. >Ğƚʋсy͛ďĞĂƉǆůŵĂƚƌŝǆŽĨƌĂŶŬůĂŶĚƐƵƉƉŽƐĞǁĞǁĂŶƚƚŽƚĞƐƚ,͗ʋ͛ɴсϬl. Then (a) l = m – mH where m = dim(ɏY) and mH = dim(ɏYH). (b) PYB(B’PYB)-1B’PY = PY – PYH.

Testing Linear Hypotheses

117

Proof. (a) First, note that if A is a matrix such that R(A) = N(ʋ’), then ɏH=R(XA). Thus mH сĚŝŵ;ɏYH) = r(XA) = r(A) – dim(N(X) ‫ ת‬Z;ͿͿсŶ;ʋ͛Ϳ– dim[N(X) ‫ ת‬E;ʋ͛Ϳ΁ ௑ сŶ;ʋ͛Ϳ– dim[N[൫஠ᇱ ൯]

= p –l – (p – ƌ;y͕͛ʋͿсƌ;y͕͛y͛Ϳ– l = r(X’) – l = r(X) – l = m – l. Thus l = m – mH as we were to show. To prove(b), note that PYB(B’PYB)-1B’PY is an orthogonal projection on R(PYB) along R(PYB)ᄰ = N(B’PY) and by proposition A11.10 that PY – PYH is an orthogonal projection on R(PY –PYH) = ɏYHᄰ ‫ ת‬ɏY along R(PY – PYH)ᄰ с;ёYHᄰ ‫ ת‬ёY)ᄰсɏYH ْɏYᄰ. Thus it ƐƵĨĨŝĐĞƐ ƚŽ ƐŚŽǁ ƚŚĂƚ ɏYH ْ ɏYᄰ = N(B’PY). To show this latter equality, first observe that since R(X)ᄰ сɏyᄰ = N(PYͿ͕ǁĞŚĂǀĞƚŚĂƚɏyᄰ ‫ؿ‬ N(B’PY). Now suppose that a ੣ ɏYH͕ŝ͘Ğ͕͘ĂсyʌĨŽƌƐŽŵĞʌǁŚĞƌĞʋ͛ʌсϬl. Then PYĂсĂƐŝŶĐĞɏYH ‫ ؿ‬ɏY = R(PY). Now B’PYĂс͛Ăс͛yʌсʋ͛ʌсϬl which implies that a੣N(B’PYͿ͕ŚĞŶĐĞɏYHْ ɏYᄰ ‫ ؿ‬N(B’PY). But we also have that Ěŝŵ΀ɏYH ۩ ɏYᄰ΁сĚŝŵ;ɏYHͿнĚŝŵ;ɏYᄰͿсĚŝŵ;ɏYH) + dim(R(X)ᄰ) = mH+(n- m) = mH + n – m and dim[N(B’PY)] = n – r(B’PY) = n – r(B’PY’PYB) = n –r(Vʋ) = n – l = n – (m – mH) = mH + n – m ǁŚŝĐŚ ŐŝǀĞƐ Ěŝŵ;ɏYH ۩ ɏYᄰ) = dim(N(B’PY)), hence equality of the two subspaces. Using Proposition 4.7.8, we see that F෠ 1 = F෠ 2 ĨŽƌƚĞƐƚŝŶŐ,͗ʋ͛ɴсϬl under the model E(YͿсyɴ͕ɴ੣ Rp, and cov(YͿсʍ2In. We now show that the test ƐƚĂƚŝƐƚŝĐƐŐŝǀĞŶŝŶƉƌŽƉŽƐŝƚŝŽŶϰ͘ϳ͘ϲĂŶĚ;ϰ͘ϳ͘ϳͿĨŽƌƚĞƐƚŝŶŐ,͗ʋ͛ɴсϬl under the model E(YͿсyɴ͕ɴ੣ Rp, cov (YͿсʍ2s͕ĂƌĞƚŚĞƐĂŵĞǁŚĞƌĞʋсy͛ŝƐĂƉǆů

118

Chapter4

matrixofrankl.ToshowthattheseFͲstatisticsarethesame,weneedto showthat (Ʌ’Ⱦ෠)’VʋͲ1(ʋ’Ⱦ෠)=Y’(PY–PYH)VͲ1(PY–PYH)’Y (4.7.9) whereȾ෠isgmforɴ,Vʋ=ʋ’(X’VͲ1X)Ͳʋ,PYistheprojectiononMY=R(VͲ1X) along NY =R(X)ᄰand PYH is theprojectiononMYH =R(VͲ1XA)alongNYH= R(XA)ᄰandR(A)=N(ʋ’).Toestablish(4.7.9),letQbeamatrixsuchthatV= QQ’andletW=QͲ1Y.Thenasinsection2.8,wehavethatE(W)=QͲ1Xɴ= HɴwhereH=QͲ1Xandcov(W)=ʍ2In.UnderthismodelforW,ɽW=ɽY=Rp, ёW=Q1[ёY]={QͲ1Xɴ:ɴ੣Rp}={Hɴ:ɴ੣Rp},ɽWH=ɽYH={ɴ:ʋ’ɴ=0l}andёWH= QͲ1[ёYH]={QͲ1Xɴ:ʋ’ɴ=0l}={Hɴ:ʋ’ɴ=0l}.Now,becauseQͲ1isnonsingular, itisaonetoonetransformationfromwhichitfollowsthatthereisaone toonecorrespondencebetweentheexpectationvectorsforE(W)inёW andёWHandtheexpectationvectorsforE(Y)inёYandёYH.Furthermore, itfollowsthatE(W)੣ёWHifandonlyofE(Y)੣ёYH.Thusweseethattesting H:E(Y)੣ёYHisequivalenttotestingH’:E(W)੣ёWH.Butinthemodelfor W,wehaveshownthat (ʋ’Ⱦ෠)’VɅWͲ1(ʋ’Ⱦ෠)=W’(PW–PWH)W whereʋ=H’C=X’QͲ1’C=X’B(B=QͲ1’C),Ⱦ෠isgmforɴinthemodelforW andsatisfiesHȾ෠=PWW,VɅW=ʋ’(H’H)ͲʋandPWandPWHaretheorthogonal projections onto MW = ёW = R(H) and MWH = ёWH = R(HA), respectively, whereR(A)=N(ʋ’).Now W’(PW–PWH)W=(QͲ1Y)’[H(H’H)ͲH]ͲͲ[(HA)(A’H’HA)ͲA’H’](QͲ1Y) =Y’QͲ1’[QͲ1X(X’QͲ1’QͲ1X)ͲX’QͲ1’–QͲ1XA(A’X’QͲ1’QͲ1XA)ͲA’X’QͲ1’]QͲ1Y =Y’[VͲ1X(X’VͲ1X)ͲX’VͲ1–VͲ1XH(XH’VͲ1XH)ͲXH’VͲ1]Y =Y’(PYVͲ1–PYHVͲ1)Y[seelemma2.5.11] =Y’(PY–PYH)VͲ1Y=Y’(PY–PYH)VͲ1(PYͲPYH)’Y wherethelastequalityfollowsfromtherelationsgivenin(2.5.12)andthe fact that R(PYH) ‫ ؿ‬R(PY). Also, we have that because Ⱦ෠ is gm for ɴ and satisfiesHȾ෠=PWW,ʋ=H’Cand

TestingLinearHypotheses

119

C’H(H’H)ͲH’W=C’QͲ1X(X’QͲ1’QͲ1X)ͲX’QͲ1’QͲ1Y=B’X(X’VͲ1X)X’VͲ1Y=B’PY’Y, (ʋ’Ⱦ෠)’VʋWͲ1(ʋ’Ⱦ෠)=(C’HȾ෠)’VʋWͲ1(C’HȾ෠)=(C’PWW)’VʋWͲ1(C’PWW) =(W’PW’C)(ʋ’(H’H)Ͳʋ)Ͳ1C’PWW =(Y’QͲ1’H(H’H)ͲH’C)(ʋ’(X’QͲ1’QͲ1X)Ͳʋ)Ͳ1C’H(H’H)ͲH’QͲ1Y =Y’PYB(ʋ’(X’VͲ1X)Ͳʋ)Ͳ1B’PY’Y =Ⱦ෠’X’B(ʋ’(X’VͲ1X)Ͳʋ)Ͳ1B’XȾ෠=Ⱦ෠’ʋ(ʋ’(X’VͲ1X)Ͳʋ)Ͳ1ʋ’Ⱦ෠. =Ⱦ෠’ʋVʋͲ1ʋ’Ⱦ෠. Finally,sinceW’(PW–PWH)W=(ʋ’Ⱦ෠)’VʋWͲ1(ʋ’Ⱦ෠),wehavethedesiredresult. Comment. Which form of the FͲstatistic given in proposition 4.7.6 and (4.7.7)shouldbeusedtotestH:ʋ’ɴ=0lwhereʋ’ɴisestimabledepends upon which is computationally most convenient. If computing VʋͲ1 is computationallysimplerthancomputingPYandPYH,thenitmaybeeasier tousetheformoftheFͲstatisticgiveninproposition4.7.6.Otherwise,one couldusetheformgivenin(4.7.7). While the alternative test for a parametric vector developed in this sectionassumedaninitialfullmodelparameterizationoftheformE(Y)=Xɴ, ɴ੣Rp,similartestscanbeusedafterappropriatereparameterizationfor othermodelsaswell.Inparticular,supposewestartwithamodelofthe form E(Y)=Xɴ,ȴ’ɴ=0s,cov(Y)=ʍ2V. (4.7.10) Inthismodel,theparameterspaceis੓={ɴ:ȴ’ɴ=0s}andtheexpectation space is ɏ = {Xɴ: ȴ’ɴ = 0s}. Now suppose ʋ’ɴ is estimable in parameterization(4.7.10)whereʋ’isanlxpmatrixofrankIandwewant totestH:ʋ’ɴ=0l.Underthishypothesis,thereducedmodelhas ੓H={ɴ:ȴ’ɴ=0s,ʋ’ɴ=0l}andɏH={Xɴ:ȴ’ɴ=0s,ʋ’ɴ=0l}. NowletAbeamatrixsuchR(A)=N(ȴ’)andletU=XA.Thenanalternative parameterizationforE(Y)in(4.7.10)isclearly

120

Chapter 4

E(YͿсhɲ͕ɲ੣ Rm.

(4.7.11)

&ƵƌƚŚĞƌŵŽƌĞ͕ĨƌŽŵƉƌŽƉŽƐŝƚŝŽŶϯ͘ϯ͘Ϯ͕ǁĞŚĂǀĞƚŚĂƚʋ͛ɴ‫ ؗ‬ʋ͛ɲсʗ͛ɲǁŚĞƌĞ ʗ͛сʋ͛ĂŶĚĨƌŽŵƉƌŽƉŽƐŝƚŝŽŶϰ͘ϱ͘ϭϬƚŚĂƚ,͗ʋ͛ɴсϬl ĂŶĚ,͗ʗ͛ɲсϬl both describe the same reduced model expectation space under the parameterizations for E(Y) given in (4.7.10) and (4.7.11). We can thus test ,͗ ʋ͛ɴ с Ϭl ŝŶ ƉĂƌĂŵĞƚĞƌŝǌĂƚŝŽŶ ;ϰ͘ϳ͘ϭϬͿ ďǇ ƚĞƐƚŝŶŐ ,͗ ʗ͛ɲ с Ϭl in parameterization (4.7.11). This latter hypothesis can be tested using the form of the F-statistic developed in this section or the form given in (4.7.7) Comment. The alternative F-test for parametric vectors developed in this ƐĞĐƚŝŽŶƌĞůŝĞĚŽŶƚŚĞĂƐƐƵŵƉƚŝŽŶƚŚĂƚʗ͛ɲŝŶƉĂƌĂŵĞƚĞƌŝǌĂƚŝŽŶ;ϰ͘ϳ͘ϭϭͿǁĂƐ an estimable parametric vector having full row rank. However, as ŵĞŶƚŝŽŶĞĚƉƌĞǀŝŽƵƐůǇ͕ŝĨʗ͛ĚŽĞƐŶŽƚŚĂǀĞĨƵůůƌŽǁƌĂŶŬ͕ǁĞĐĂŶĞůŝŵŝŶĂƚĞ ƌĞĚƵŶĚĂŶƚůŝŶĞĂƌůǇĚĞƉĞŶĚĞŶƚƌŽǁƐĨƌŽŵʗ͛ƚŽĂƌƌŝǀĞĂƚĂŵĂƚƌŝǆʗ1’ and ƚŚĞĐŽƌƌĞƐƉŽŶĚŝŶŐƌŽǁƐĨƌŽŵʋ͛ƚŽĂƌƌŝǀĞĂƚĂŵĂƚƌŝǆʋ1͛ƐŽƚŚĂƚ,͗ʗ͛ɲсϬl, ,͗ ʗ1͛ɲ с Ϭg͕ ,͗ ʋ͛ɴ с Ϭl ĂŶĚ ,͗ʋ1͛ɴс Ϭg all describe the same reduced expectation spaces in parameterizations (4.7.10) and (4.7.11)

4.8 Tests of Hypotheses in Models with Nonhomogeneous Constraints In this section we extend the results given in section 4.4 of this chapter to the case of testing hypotheses in models having nonhomogeneous constraints. So in our full or initial model, we consider a random vector Y following a model of the following form: Model for Y; E(Y) сyɴ͕ȴ͛ɴс੘, cov(YͿсʍ2V > 0 ǁŚĞƌĞsŝƐĂŬŶŽǁŶŵĂƚƌŝǆĂŶĚʍ2 > 0 is an unknown constant. Under this model for Y, the parameter space is ੓Y с΂ɴ͗ȴ͛ɴс੘} = Ⱦത + N(ȴ͛ͿĂŶĚƚŚĞ ĞǆƉĞĐƚĂƚŝŽŶƐƉĂĐĞŝƐɏY с΂yɴ͗ȴ͛ɴс੘} = X[੓Y] = XȾത нy΀E;ȴ͛Ϳ΁ǁŚĞƌĞȾത is any vector in ੓Y. Thus in this model for Y, ੓Y ĂŶĚɏY are both affine sets rather than subspaces. So suppose we are interested in testing a hypothesis about Y of the form H:E(Y) ੣ ɏYH ‫ ؿ‬ɏY ǁŚĞƌĞɏYH is an affine ƐƵďƐĞƚŽĨɏY. To conduct the test, we transform the random vector Y to a random vector Z having homogeneous constraints and the hypothesis about Y to an equivalent hypothesis about Z. So to this end, let Ⱦത be any vector in ੓Y and let Z = Y - XȾത. Under this transformation of Y, ĨŽƌĂŶǇɴ੣੓Y,

Testing Linear Hypotheses

121

E(Z)= E[Y – XȾത΁сy;ɴെ ȾതͿсyɲǁŚĞƌĞɲсɴ- Ⱦത ĂŶĚȴ͛;ɴ-ȾതͿсȴ͛ɲсϬs. Thus under the above transformation, we see that Z follows a model of the following form: Model for Z: E(ZͿсyɲ͕ȴ͛ɲсϬs, cov(ZͿсʍ2V. In addition, observe that under this model for Z, the parameter space is ੓Zс΂ɲ͗ȴ͛ɲсϬs΃сE;ȴ͛ͿĂŶĚƚŚĂƚƚŚĞĞǆƉĞĐƚĂƚŝŽŶƐƉĂĐĞŝƐɏZ с΂yɲ͗ȴ͛ɲсϬs}= y΀E;ȴ͛Ϳ΁͘tĞĂůƐŽŚĂǀĞƚŚĞĨŽůůŽǁŝŶŐĞĂƐŝůǇĞƐƚĂďůŝƐŚĞĚƌĞůĂƚŝŽŶƐŚŝƉƐ͗ ੓Z = ੓Y - Ⱦത͕ɏZ сɏY - XȾത, ੓Y = ੓Z + Ⱦത ĂŶĚɏY сɏZ + XȾത. ůƐŽŶŽƚĞƚŚĂƚŝĨǁĞůĞƚɏZH сɏYH – XȾത, then it follows from proposition ϭϯ͘ϯƚŚĂƚɏZH ŝƐĂƐƵďƐƉĂĐĞŽĨɏZ. In addition, because the translation of ƚŚĞ ĂĨĨŝŶĞ ƐĞƚ ɏY ƚŽ ƚŚĞ ƐƵďƐƉĂĐĞ ɏZ is a one to one transformation, it follows that there is a one to one correspondence between the expectation vectors for Y ŝŶɏY and the expectation vectors for Z ŝŶɏZ. This also implies that there is a one to one correspondence between the expectation vectors for Y ŝŶɏYH and the expectation vectors for Z ŝŶɏZH and furthermore that E(Y) ੣ ɏYH if and only if E(Z)੣ ɏZH. Thus we see that testing the hypothesis H: E(Y) ੣ ɏYH in the model for Y is equivalent to testing the hypothesis H’: E(Z) ੣ ɏZH in the model for Z. This latter test can be conducted using results given in the previous sections of this chapter. In particular, if we let PZ and PZH be the projections onto MZ = V-1(ёZ) along NZс ɏZᄰ and onto MZH = V-1΀ɏZH] along NZH с ɏZHᄰ, respectively, then the appropriate statistic for testing H’: E(Z) ੣ ɏZH is from section 4.4 F෠ = Z’(PZ–PZH)V-1(PZ–PZH)’Z /(m–mH)/Z’(In -PZ)V-1(In–PZ)’Z/(n – m) (4.8.1) where m = r(PZͿсĚŝŵɏZ and mH = r(PZHͿсĚŝŵɏZH. Using F, we reject the null hypothesis in the model for Z if F෠ > fɲ(m – mH, n – m). We can also express this latter F-statistic in terms relative to the model for Y. To do this, the reader should recall from section 2.6 that the projections PZ and PZH defined above are precisely the projections defined in section 2.6 to find random vectors having the gm properties relative to to models having homogeneous constraints and relative to the reduced model defined by the linear hypothesis on E(Y) given by H: E(Y)੣ɏYH. Thus if we let PY = PZ, PYH = PZH and Z = Y - XȾത where Ⱦത ƐĂƚŝƐĨŝĞƐȴ͛Ⱦ = ੘, we can rewrite the F-statistic given above in terms relative to the model for Y as

Chapter 4

122

F෠ = (Y - XȾത)’(PY – PYH)V-1(PY – PYH)’(Y - XȾത)/(m - mH)/ (Y - XȾത)’(In – PY)V-1(In – PY)’(Y - XȾത)/(n – m) . (4.8.2) The reader should note that the denominator of this latter expression for the F-ƐƚĂƚŝƐƚŝĐŝƐĞǆĂĐƚůǇƚŚĞƐĂŵĞĂƐƚŚĞƵŶďŝĂƐĞĚĞƐƚŝŵĂƚŽƌĨŽƌʍ2 derived in section 2.9 relative to the model for Y. In calculating the F-statistic given in (4.8.2) for models having nonhomogeneous constraints, it is often convenient to give the appropriate expressions for sums of squares in the form of ANOVA tables such as given in section 4.6. In particular, in the sums of squares expressions given prior to table 4.6.1, if we replace Y by Y – XȾത where Ⱦത is any solution to ȟᇱ Ⱦ = ԃ and set P = PY we get (Y - XȾത)’V-1(Y - XȾത) = (Y - XȾത)’(In – PY)V-1(In - PY)’(Y - XȾത) + (Y - XȾത)’PYV-1 PY’(Y - XȾത) ෡ Y + (Y- XȾത)’PYV-1PY’(XȾ෠ - XȾത) =R

(4.8.3)

෡ Y = (Y - XȾത)’(In – PY)V-1(In – PY)’(Y - XȾത). In (4.8.3), where R (1) (Y - XȾത)’V-1(Y - XȾത) is called the total weighted corrected sum of squares (ss), ෡ Y is called the weighted corrected (2) (Y - XȾത)’(In – PY)V-1(In – PY)’(Y - XȾത) = R residual (error) ss, and (3) (Y - XȾത)ԢPYV-1 PY’(Y - XȾത) is called the weighted corrected regression ss. These sums of squares are typically given in an initial ANOVA table for the full model such as the following: Table 4.8.4 (Initial ANOVA table for the full model) Source of variation

df

Weighted corrected regression ss

m

Weighted corrected residual ss Total weighted corrected ss

n–m n

ss (Y- XȾത)’PYV-1PY’(Y - XȾത) ෡Y R (Y - XȾത)’V-1(Y - XȾത)

Testing Linear Hypotheses

123

Now, for the hypothesis H: E(Y) ੣ ɏYH where ɏYH is an affine subset of ɏY, we can for the reduced model obtain sums of squares similar to those given in table 4.8.4 for the full model. If we let PYH be the projection on MH= V-1[ɏYH - XȾത) along NH = [ɏYH - XȾത]ᄰ where Ⱦത ƐĂƚŝƐĨŝĞƐ ȴ͛ɴ с ੘, we get the following sums of squares for the reduced model: (Y - XȾത)’V-1(Y - XȾത) = (Y - XȾത)’(In – PYH )V-1(In – PYH)’(Y - XȾത) + (Y - XȾത)’PYH V-1PYH’(Y - XȾത) ෡ YH + (Y - XȾത)’PYHV—1PYH’(Y - XȾത) =R where ෡ YH = (Y - XȾത)’(In – PYH)V-1(In – PYH)’(Y - XȾത). R Similar to table 4.8.4, we have the following ANOVA table for the reduced model: Table 4.8.5 (ANOVA table for the reduced model) Source of variation

df

Weighted corrected regression ss

mH

Weighted corrected residual ss Total weighted corrected ss

n – mH n

ss (Y - XȾത)’PYHV-1PYH’(Y - XȾത) ෡ YH R (Y - XȾത)’V-1(Y - XȾത)

We also observe that using lemma 4.3.8, it can be shown that ෡ Y = (Y - XȾത)’(PY – PYH)V-1(PY – PYH)’(Y - XȾത). ෡ YH - R R Thus the F-statistic given in (4.8.2) can also be expressed as ෡ Y)/(m – mH)/R ෡ Y/(n – m). ෡ YH - R F෠ = (R

Chapter 4

124

This latter expression can easily be computed from the information provided in tables 4.8.4 and 4.8.5 or from the following table that summarizes the information given in tables 4.8.4 and 4.8.5: Table 4.8.6 (summary table for testing H: E(Y) ੣ ɏH) Source of variation

df

ss

Residual under the full model

n –m

෡Y R

Residual under the reduced model

n - mH

෡ YH R

Deviation from the hypothesis

m – mH

෡ YH - R ෡Y R

Comment. With regards to tables 4.8.4 and 4.8.5, the reader should keep in mind that the weighted corrected regression ss and the weighted corrected total ss may not be unique, i.e., different choices of the Ⱦത vector may lead to different sums of squares.

4.9 Testing Parametric Vectors in Models with Nonhomogeneous constraints In the previous section we considered testing linear hypotheses on E(Y) for random vectors Y in models having the following form: Model for Y: E(YͿсyɴ͕ȴ͛ɴс੘, cov(YͿсʍ2V > 0 ǁŚĞƌĞ s ŝƐ Ă ŬŶŽǁŶ ŵĂƚƌŝǆ ĂŶĚ ʍ2 is an unknown constant. Under this model, ੓Yс΂ɴ͗ȴ͛ɴс੘} = Ⱦത нE;ȴ͛ͿĂŶĚɏY с΂yɴ͗ȴ͛ɴс੘} = XȾത нy΀E;ȴ͛Ϳ΁ where Ⱦത is any vector in ੓Y. Thus ੓Y ĂŶĚɏY are both affine sets. In this set up, we derived an F-statistic for testing H: E(Y) ੣ ɏYH vs A: E(Y) ‫ ב‬ɏYH where ɏYH ŝƐĂŶĂĨĨŝŶĞƐƵďƐĞƚŽĨɏY. In this section, we consider linear hypotheses ŽŶɴŝŶŵŽĚĞůƐŚĂǀŝŶŐŶŽŶŚŽŵŽŐĞŶĞŽƵƐĐŽŶƐƚƌĂŝŶƚƐǁŚŝĐŚĐĂŶďĞƐƚated in terms of a parametric vector as ,͗ʋ͛ɴсʌǀƐ͗ʋ͛ɴ് ʌ͘

(4.9.1)

Comment. We could consider hypotheses that can be stated as H: ʋ’ɴ +c=ʌ vs A: ʋ’ɴ + c ് ʌ. However, after subtracting c from both sides of the equations given in the (4.9.1), we end up with a statement such as given in

Testing Linear Hypotheses

125

(4.9.1). Thus we may as well only consider hypotheses of the form given in (4.9.1). hŶĚĞƌĂůŝŶĞĂƌŚǇƉŽƚŚĞƐŝƐŽŶɴƐƵĐŚĂƐŐŝǀĞŶŝŶ;ϰ͘ϵ͘ϭͿ͕ ੓YH с΂ɴ੣ ੓Y͗ʋ͛ɴсʌ΃с΂ɴ͗ȴ͛ɴс੘͕ʋ͛ɴсʌ΃сߚҧ н΀E;ȴ͛Ϳ‫ ת‬E;ʋ͛Ϳ where Ⱦത ƐĂƚŝƐĨŝĞƐȴ͛Ⱦത = ੘ ĂŶĚʋ͛Ⱦത сʌĂŶĚ ɏYH с΂yɴ͗ɴ੣ ੓YH} = XȾത нy΀E;ȴ͛Ϳ‫ ת‬E;ʋ͛Ϳ΁͘ We note that much of the discussion given in section 4.5 concerning the formulation of unambiguous hypotheses applies equally well to this ƐĞĐƚŝŽŶ͘ /Ŷ ƉĂƌƚŝĐƵůĂƌ͕ ǁŝƚŚŽƵƚ ƐŽŵĞ ĂĚĚŝƚŝŽŶĂů ĐŽŶĚŝƚŝŽŶƐ ŽŶ ʋ͛ɴ͕ ŝƚ ĐĂŶ ŚĂƉƉĞŶ ƚŚĂƚ ĨĂŝůƵƌĞ ƚŽ ƌĞũĞĐƚ ƚŚĞ ŶƵůů ŚǇƉŽƚŚĞƐŝƐ ,͗ ʋ͛ɴ с ʌ ůĞĂĚƐ ƵƐ ƚŽ ĐŽŶĐůƵĚĞƚŚĂƚɴ੣ ੓Y ĂŶĚyɴ੣ ɏYH, but not ŶĞĐĞƐƐĂƌŝůǇƚŚĂƚɴ੣ ੓YH. To avoid this type of an ambiguous conclusion, we make the following definition analogous to definition 4.5.3. ĞĨŝŶŝƚŝŽŶ ϰ͘ϵ͘Ϯ͘ ůŝŶĞĂƌ ŚǇƉŽƚŚĞƐŝƐ ŽŶ ɴ ŝƐ Ă ŚǇƉŽƚŚĞƐŝƐ ŽĨ ƚŚĞ ĨŽƌŵ ,͗ʋ͛ɴсʌƐƵĐŚƚŚĂƚŝĨȾത ੣ ੓Y and XȾത ੣ ɏYH, then Ⱦത ੣ ੓YH. Any hypothesis satisfying definition 4.9.2 avoids the type of ambiguous interpretation mentioned in the previous paragraph should we fail to ƌĞũĞĐƚ,͗ʋ͛ɴсʌ͘dŚĞĐŽŶĚŝƚŝŽŶŽŶʋ͛ɴƚŚĂƚŝƐŶĞĞĚĞĚƚŽƐĂƚŝƐĨǇĚĞĨŝŶŝƚŝŽŶ 4.9.2 is the same as that given in section 4.5. >ĞŵŵĂϰ͘ϵ͘ϯ͘^ƵƉƉŽƐĞʋ͛ɴсʌŝƐĂĐŽŶƐŝƐƚĞŶƚƐĞƚŽĨůŝŶĞĂƌĞƋƵĂƚŝŽŶƐ͘dŚĞŶ ,͗ʋ͛ɴсʌĐŽŶƐƚŝƚƵƚĞƐĂůŝŶĞĂƌŚǇƉŽƚŚĞƐŝƐŽŶƚŚĞƉĂƌĂŵĞƚĞƌǀĞĐƚŽƌɴŝĨĂŶĚ ŽŶůǇŝĨZ;ʋͿ‫ؿ‬Z;y͛ͿнZ;ȴͿ͕ŝ͘Ğ͕͘ŝĨĂŶĚŽŶůǇŝĨʋ͛ɴŝƐĞƐƚŝŵĂďůĞ͘ WƌŽŽĨ͘^ƵƉƉŽƐĞZ;ʋͿ‫ ؿ‬R(X͛ͿнZ;ȴͿĂŶĚƐƵƉƉŽƐĞɴ1 ੣ ɏY ĂŶĚyɴ1 ੣ ɏYH. Then ǁĞ ĐĂŶ ĨŝŶĚ ɴ2 ੣ ੓YH ƐƵĐŚ ƚŚĂƚ yɴ1 с yɴ2 ĂŶĚ ʋ͛ɴ2 с ʌ͘ Ƶƚ ƐŝŶĐĞ ʋ͛ɴ /Ɛ ĞƐƚŝŵĂďůĞ͕ŝƚŝƐĂůƐŽŝĚĞŶƚŝĨŝĂďůĞ͕ŚĞŶĐĞǁĞŚĂǀĞƚŚĂƚʋ͛ɴ1 сʋ͛ɴ2сʌĂŶĚƚŚĂƚ ɴ1 ੣ ੓YH , ŽŶǀĞƌƐĞůǇ͕ ƐƵƉƉŽƐĞ ,͗ ʋ͛ɴ с ʌ ƐĂƚŝƐĨŝĞƐ ĚĞĨŝnition 4.9.2. Let ɴ1͕ɴ2੣੓Y ďĞƐƵĐŚƚŚĂƚyɴ1 сyɴ2͘tĞǁŝůůƐŚŽǁƚŚĂƚʋ͛ɴŝƐŝĚĞŶƚŝĨŝĂďůĞ͕ŚĞŶĐĞ ĞƐƚŝŵĂďůĞ͘^ŝŶĐĞɴ1͕ɴ2 ੣ ੓Y͕ǁĞĐĂŶĞǆƉƌĞƐƐɴ1 = Ⱦത нɲ1 ĂŶĚɴ2 = Ⱦതнɲ2 where ɲ1͕ɲ2 ੣ E;ȴ͛ͿĂŶĚȾത੣੓YH. Now let Ⱦ෨сɴ1–ɴ2 сɲ1–ɲ2. Then XȾ෨ = 0 ੣ ɏYH-XȾത = y΀E;ȴ͛Ϳ ‫ ת‬E;ʋ͛Ϳ΁ ǁŚŝĐŚ ŝƐ Ă ƐƵďƐƉĂĐĞ͘ Ƶƚ ƚŚĞŶ yȾ෨ + XȾത ੣ ɏYH and since

126

Chapter4

Ⱦ෨+Ⱦത੣੓Y,itfollowsfromdefinition4.9.2thatȾ෨+Ⱦത੣੓YH.Butthenʋ’(Ⱦ෨+Ⱦത)=ʌ andʋ’(ɴ1–ɴ2)=0l,henceʋ’ɴ1=ʋ’ɴ2asweweretoshow ThustotestahypothesisonɴinthemodelforY,itonlymakessense totestalinearhypothesisoftheformH:ʋ’ɴ=ʌwhereʋ’ɴisanestimable parametricvector.Whilethereareseveralmethodsofperformingsucha test,perhapsthesimplestisthemethodoftestinggiveninsection4.8.To implementthismethodoftesting,wefirstneedtoidentifytheaffineset ɏYHassociatedwithH:ʋ’ɴ=ʌ,i.e.,let੓YH={ɴ੣੓Y:ʋ’ɴ=ʌ}={ɴ:ȴ’ɴ=੘, ʋ’ɴ=ʌ}=Ⱦത+[N(ȴ’)‫ת‬N(ʋ’)]whereȾതsatisfiesȴ’Ⱦത=੘andʋ’Ⱦത=ʌandthen letɏYH={Xɴ:ɴ੣੓YH}=XȾത+X[N(ȴ’)‫ת‬N(ʋ’)].NowconductatestonE(Y) using the hypothesis H: E(Y) ੣ ɏYH which is equivalent to the initial hypothesis on ɴ. The test on H: E(Y) ੣ ɏYH can be performed using the methodgiveninsection4.8. We now give a proposition which shows that any hypothesis of the formH:E(Y)੣ɏYHwhereɏYHisanaffinesubsetofɏYcanalsobestatedas ahypothesisaboutɴ. Proposition4.9.4.ConsiderthemodelforYandletɏYHbeanaffinesubset inɏY.Thenthereexistsanestimablefunctionʋ’ɴandaconstantɷsuch thatɏYH={Xɴ:ʋ’ɴ=ɷandȴ’ɴ=੘}. Proof.Let੓YH={ɴ੣੓Y:Xɴ੣ɏYH}andletȾത੣੓YH.NowpropositionA13.3 impliesthatɏYHͲXȾതisasubspaceofɏYͲXȾത.LetF’beanylxnmatrixsuch thatN(F’)=ɏYHͲXȾത,letʋ’=F’Xandletɷ=F’XȾത.Clearlyʋisestimablesince R(ʋ)‫ؿ‬R(X’)+R(ȴ).Nowletɏʋ={Xɴ:ʋ’ɴ=ɷandȴ’ɴ=੘}.Wewillshowthat ɏYH=ɏʋ.Sosupposeʅ੣ɏYH.Thenwecanfindɴ1੣੓YHsuchthatʅ=Xɴ1. ButthenX(ɴ1ͲȾത)੣ɏYHͲXȾതwhichimpliesthatF’(Xɴ1ͲXȾത)=0l,hencethat ʋ’ɴ1=F’Xɴ1=F’XȾത=ɷ.Thusʅ=Xɴ1੣ɏʋ.Nowletʅ੣ɏʋ.Thenwecanfind ɴ1suchthatȴ’ɴ1=੘andʋ’ɴ1=F’Xɴ1=F’XȾത=ɷ.ButthenXɴ1ͲXȾത੣N(F’)= ɏYHͲXȾത,hencewehavethatXɴ1=ʅ੣ɏYH.

4.10ProblemsforChapter4 Forproblems1and2below,assumethatYisannͲdimensionalrandom vectorthatfollowsamultivariatenormaldistributionandthatE(Y)=Xɴ, ȴ’ɴ=0,isaparameterizationforE(Y).Alsoassumethatcov(Y)=ʍ2V ൐ 0. Inthissituation,wewillletɏdenotetheexpectationspaceandɏHdenote

Testing Linear Hypotheses

127

ĂƐƵďƐƉĂĐĞŽĨɏƚŚĂƚǁĞĂƌĞƚĞƐƚŝŶŐĂďŽƵƚǁŚĞƌĞĚŝŵё = m and dim ёH = mH. ϭ͘^ƵƉƉŽƐĞ,͗ʋ͛ɴсϬŝƐĂŚǇƉŽƚŚĞƐŝƐŽŶɴ͘^ŚŽǁƚŚĂƚƌ;ʋ͕ȴͿ- ƌ;ȴͿсŵ-mH. Ϯ͘^ƵƉƉŽƐĞ,͗ɴ੣ ੓Hi ĨŽƌŝсϭ͕ϮĂƌĞĞƋƵŝǀĂůĞŶƚŚǇƉŽƚŚĞƐĞƐŽŶɴ͘^ŚŽǁƚŚĂƚ ੓H1 = ੓H2. 3. Suppose Y1 ~ N1;ɲ1͕ʍ2), Y2 ~ N1;ɲ1 –ϯɲ2͕ʍ2) and that Y3 ~ N1;ɲ1 нϰɲ2͕ʍ2) and that all the indicated random variables are mutually independent. Derive the F-ƐƚĂƚŝƐƚŝĐĨŽƌƚĞƐƚŝŶŐ,͗ɲ1 сɲ2. 4. Suppose Y1, Y2 and Y3 are independent random variables such that Y1~ N1;Ϯɲ1нɲ2͕ʍ2), Y2 ~ N1;ϯɲ1͕ʍ2) and Y3 ~ N1(-ɲ1 нϮɲ2͕ʍ2). Derive the FƐƚĂƚŝƐƚŝĐĨŽƌƚĞƐƚŝŶŐƚŚĞŚǇƉŽƚŚĞƐŝƐ,͗ɲ1 сϯɲ2. 5. Suppose Y ~ N4;ђ͕ʍ2I4) where μ1 + μ2 + μ3 - μ4 = 0. Derive the F-statistic for testing H: μ2 = μ4. 6. Let X1,…,Xm be a random sample from X ~ N1(μ1͕ʍ2) and let Y1,…,Yn be a random sample from Y ~ N1(μ2͕ʍ2). Derive the F-statistic for testing H:μ1=μ2. 7. Consider the data from problem 5 in chapter 3 with the additional assumption that the Yijk’s are normally distributed. (a) Construct an initial ANOVA table. (b) Construct an ANOVA table for testing the linear hypothesis ,͗ʏ1сʏ2сʏ3сʏ4 and carry out the appropriate test. (c) For each of the following statements that constitutes a linear hypothesis on the parameter vector, carry out the appropriate test: (i) H1͗ђнʏ1 = 0. (ii) H2͗ʏ1 – ʏ2 нɷ1 – ɷ2 = 0. (iii) H3͗ɷ1 = ɷ2 сɷ3 = 0.

128

Chapter 4

8. (Seely(1989)) For problem 19 of chapter 2, do the following: (a) Give the initial ANOVA table for the full model. (b) Give the ANOVA table for the reduced model for testing the linear hypothesis H: ੓ = 0 and carry out the appropriate test. ;ĐͿdĞƐƚƚŚĞŚǇƉŽƚŚĞƐŝƐŽĨŶŽĚŝĨĨĞƌĞŶĐĞŝŶɷĞĨĨĞĐƚƐ͕ŝ͘Ğ͕͘ƚĞƐƚ,͗ɷ1 = ɷ2 сɷ3. 9. (Seely (1989)) Consider again the data given in problem 7 of chapter 3, i.e., the data given in that problem are the outcomes from a Latin square experiment having two missing observations where E(Yijk) = μ + ri + cj нʏk for i = 1,2,3,4, j = 1,2,3,4, and ri = the effect of row i, cj = the effect of column ũ ĂŶĚ ʏ1 ʏ2͕ ʏ3͕ ĂŶĚ ʏ4 represent the effects of treatments A,B,C and D, respectively. (a) Construct an initial ANOVA table for the above model and data. (b) For each of the following statements that constitute a linear hypothesis on the parameter vector, carry out the appropriate test: (i) H1͗ʏ1 нʏ2 сʏ3 нʏ4 (ii) H2͗ʏ1 нʏ2 нʏ3 нʏ4 = 0 10. (Seely(1989)) Consider again problem 17 in chapter 2. For the data given in this problem, do the following: (a) Give the initial ANOVA table for this model. (b) Give the ANOVA table for the reduced model under the linear ŚǇƉŽƚŚĞƐŝƐ ,͗ɲ1с ɲ2, ɴ1 с ɴ2 and carry out the appropriate test for this hypothesis. 11. (Seely(1989))Consider again problem 20 in Chapter 2 and do the following: (a) Give the initial ANOVA table for the data in this problem. (b) Give the ANOVA table for the reduced model under the linear ŚǇƉŽƚŚĞƐŝƐ ,͗ɲ1 с ɲ2 с Ϯɲ3 and carry out the appropriate test for this hypothesis.

Testing Linear Hypotheses

129

12. (Seely (1989)) Suppose {Yijk} is a collection of independent random variables which are normally distributed with a common unknown variance and expectation of the form E( Yijk) = μij for i = 1,2,3; j= 1,2,3; and k = 1, … , nij > 0. Set each of the following linear hypotheses on E(Y) up as a ůŝŶĞĂƌŚǇƉŽƚŚĞƐŝƐŽŶƚŚĞƉĂƌĂŵĞƚĞƌǀĞĐƚŽƌɴс;ђ11,…,μ33)’. (a) E(YijkͿсђнɲi + ੘j for all i, j and k. (b) E(YijkͿсђнʏi for all i, j and k. (c) E(Yijk) = μ + Ʉ1 + Ʉ2xi + Ʉ3xi2 + ੘j for all i, j and k where x1 = 1, x2 = -2 and x3 = 0. [Hint: Use proposition 4.5.8 of this chapter to do this problem.]

CHAPTER 5 THE GENERAL THEORY OF LINEAR ESTIMATION

5.1 Introduction and Preliminaries The present chapter is a continuation of unbiased linear estimation that we pursued in chapter 2. The majority of the material deals with covariance structures for a random vector Y not covered in chapter 2. We shall assume throughout this chapter that the expectation space of Y is a subspace. The reason for this latter assumption is primarily for ease of exposition. Whenever the expectation space in a linear model is a subspace, it is easily seen that all of the concepts, terminology and results of sections 4 and 5 of chapter 2 that did not involve the covariance structure carry over without change. Similarly, under the subspace assumption the notion of a parameterization for the expectation space and relationships between different parameterizations for the same expectation space remain the same as discussed in chapters 2 and 3. Most of the material presented in this chapter is taken from a set of notes written by Justus Seely(1989). In this chapter, we consider n-dimensional random vectors Y whose ƉŽƐƐŝďůĞ ĞǆƉĞĐƚĂƚŝŽŶ ǀĞĐƚŽƌƐĐŽŶƐƚŝƚƵƚĞ Ă ƐƵďƐƉĂĐĞ ɏ of Rn. We will also assume throughout this chapter, primarily for notational convenience, that ƚŚĞĞǆƉĞĐƚĂƚŝŽŶƐƉĂĐĞɏŚĂƐďĞĞŶƉĂƌĂŵĞƚĞƌized as E(YͿсyɴ͕ɴ੣ Rp, i.e., ɴ ŚĂƐ ŶŽ ƌĞƐƚƌŝĐƚŝŽŶƐ͘ &ƌŽŵ ƚŚĞ ĚŝƐĐƵƐƐŝŽŶ ĐŽŶĐĞƌŶŝŶŐ ƉĂƌĂŵĞƚĞƌŝǌĂƚŝŽŶƐ given in chapter 3, we know that such a parameterization is always possible. Also, because the results given in Chapter 3 relating estimable parametric vectors between different parameterizations do not depend on the covariance structure of Y, we can apply those same results to parameterizations when cov(Y) ൒ 0. The assumptions on the covariance structure will, however, vary from section to section in this chapter. The most general covariance structure assumption we make in this chapter is that cov(Y) exists and is an element of a set V of positive semi-definite matrices. Before considering this general structure we will first consider

The General Theory of Linear Estimation

131

the case when cov(Y) = D where D is a given positive semi-definite matrix. Because the notion of a blue and that of the gm property depend on the covariance structure, we give below the appropriate definitions for these two concepts. Definition 5.1.1. Suppose Y is a random vector such that E(Y) ੣ ɏ and cov(Y)੣ V ǁŚĞƌĞɏс΂yɴ͗ɴ੣ Rp} = R(X) is a subspace and V is a set of positive semi-definite matrices. For such a model, a linear estimator T’Y is said to be a blue for its expectation if and only if covɇ(T’Y) ൑ covɇ(A’YͿĨŽƌĂůůɇ੣ V and for all estimators A’Y having the same expectation as T’Y. In the previouƐĚĞĨŝŶŝƚŝŽŶ͕ƚŚĞĚĞƉĞŶĚĞŶĐĞŽŶɇ͕ƚŚĞĐŽǀĂƌŝĂŶĐĞŵĂƚƌŝǆ of Y, has been indicated using a subscript. Typically we will omit the covariance matrix subscript in such expressions, but we have used it in the definition to emphasize that the covariance matrix inequality must hold for all covariance matrices in V. As in the previous chapter, we use terminology like T’Y is a blue for its expectation and T’Y is a blue for a given parametric vector to mean it is a blue and is additionally unbiased for the given parametric vector. Definition 5.1.2. Assume the same linear model structure as in definition 5.1.1 and suppose E(YͿсyɴ͕ɴ੣ Rp, ŝƐĂƉĂƌĂŵĞƚĞƌŝǌĂƚŝŽŶĨŽƌɏ͘&ŽƌƐƵĐŚĂ parameterization, a random vector Ⱦ෠ ŝƐƐĂŝĚƚŽďĞŐŵĨŽƌɴŝĨĂŶĚŽŶůǇŝĨȿ͛Ⱦ෠ ŝƐĂďůƵĞĨŽƌȿ͛ɴǁŚĞŶĞǀĞƌȿ͛ɴŝƐĂŶĞƐƚŝŵĂďůĞƉĂƌĂŵĞƚƌŝĐǀĞĐƚŽƌ͘ Definition 5.1.2 gives the appropriate generalization of the gm property previously defined in chapter 2 to deal with the more general covariance structures considered in this chapter. With regard to the two definitions in this section, several things should be kept in mind. The definition of a blue is independent of the notion of a parameterization for E(Y). That is, if T’Y is a blue under any parameterization, it is a blue for all parameterizations. However, from definition 5.1.2, it is clear that the gm property is dependent on a given parameterization for E(Y). Finally, be sure to observe that the gm property depends (via the blue condition in the definition) on the covariance structure of Y even though the set V does not explicitly appear in definition 5.1.2.

Chapter5

132

5.2ASingleCovarianceMatrix Weassumethroughoutthissectionthemodelinsection1underthe specialcaseV={D}.Thatis,weassumeYisannx1randomvectorwith expectation E(Y)੣ɏ and cov(Y) = D where ɏ is a subspace and D is an arbitrarybutfixedpositivesemiͲdefinitematrix.Weshallalsoassumethat ɏisparameterizedasE(Y)੣ɏ={Xɴ:ɴ੣Rp}. Our approach in developing the theory of best linear unbiased estimationfollowssomewhatthatgiveninsections4and5ofchapter2. Theprimarytoolindevelopinglinearestimationinchapter2wasthatany tx1linearestimatorA’YcouldbedecomposedintoasumT’Y+F’Ywhere F’Yhasexpectationzero(sothatT’YandA’Yhavethesameexpectation) and where cov(T’Y) ൑ cov(A’Y). To see how to generalize this decomposition, we need only consider the covariance matrix inequality because the condition of zero expectation does not depend on the covariancestructure,i.e.,F’YhaszeroexpectationifandonlyifR(F)‫ؿ‬ɏᄰ. Thecovariancematrixinequalitywasachievedbyhavingthecolumnsof DTorthogonaltothecolumnsofF.Thisorthogonalityimpliedcov(T’Y,F’Y)= T’DF=0tt,whichallowedustogettheappropriatematrixinequality.Inthe presentsettingwefollowthesameapproach.Asinchapter2,onewayto assurethatthisistrueistohaveR(DT)‫ؿ‬ɏbecausethezeroexpectation conditiononF’Yis,asstatedabove,equivalenttoR(F)‫ؿ‬ɏᄰ. Lemma5.2.1.LetT’Ybeanarbitrarylinearestimator.IfR(DT)‫ ؿ‬ɏ,then T’Yisablue. Proof.AssumeR(DT)‫ؿ‬ɏandletA’Ybeanyotherunbiasedestimatorfor E(T’Y).Thenbylemma2.5.3,A=T+FwhereR(F)‫ ؿ‬ɏᄰ.Butthen cov(A’Y)=cov((T+F)’Y)=(T+F)’D(T+F)=T’DT+T’DF+F’DT+F’DF=T’DT+F’DF =cov(T’Y)+cov(F’Y)൒ cov(T’Y). HenceT’Yisablue. As in chapter 2, we note that lemma 5.2.1 provides a sufficient conditionforT’Ytobeabluebutnotanecessaryone.Toshowthatthe condition given in lemma 5.2.1 is also necessary, we introduce the followingset.Let

The General Theory of Linear Estimation

133

μD = {t: Dt ੣ ɏ΃͘ Notice that μD is a subspace. It is also easy to check that a matrix T satisfies R(DT) ‫ ؿ‬ɏŝĨĂŶĚŽŶůǇŝĨZ;dͿ ‫ ؿ‬μD. In particular, it should be noted that if T is such that R(T) ‫ ؿ‬μD, then T’Y is a blue. Some relationships between μD ĂŶĚɏᄰ are given next. Lemma 5.2.2. Let μD = {t:Dt ੣ ɏ΃ǁŚĞƌĞɏŝƐĂƐƵďƐƉĂĐĞĂŶĚŝƐĂƉŽƐŝƚŝǀĞ semi-definite matrix. Then the following statements can be made: (a) μDᄰ = (Df: f ੣ ɏᄰ}. (b) Rn = μD нɏᄰ. (c) μD ‫ ת‬ɏᄰ = N(D) ‫ ת‬ɏᄰ. WƌŽŽĨ͘;ĂͿ>ĞƚYďĞĂŵĂƚƌŝǆƐƵĐŚƚŚĂƚE;Y͛Ϳсɏ͘dŚĞŶђD = N(Q’D). So μDᄰ = R(DQ)= {Df: f ੣ R(Q)} = {Df: f ੣ N(Q’)ᄰ} = {Df: f ੣ ɏᄰ} which establishes (a). (b) To show (b), we establish the equivalent condition A = μDᄰ ‫ ת‬ɏсϬn. If b੣A, then b = Df for some vector f ੣ ɏᄰ [by (a)] and b ੣ ɏ͘^Žď͛ĨсĨ͛ĨсϬ͘ Thus b=Df=0n. (c) Observe that N(D) ‫ ؿ‬μD gives one direction. For the other direction, suppose f ੣ μD and f ੣ ɏᄰ. Then Df ੣ ɏƐŽƚŚĂƚĨ͛ĨсϬǁŚŝĐŚŝŵƉůŝĞƐĨсϬn and f੣N(D)‫ת‬ёᄰ. Lemma 5.2.2 allows us to answer a number of questions about the theory of best linear unbiased estimation when cov(Y) = D. For example, if A’Y is any linear estimator, then part (b) of the lemma implies A can be expressed as A = T + F where T’Y is a blue and F’Y has zero expectation. Thus, in any parameterization there always exists a blue for any estimable parametric vector. Theorem 5.2.3. (The characterization theorem). For the linear model E(Y)੣ɏ ĂŶĚ ĐŽǀ;Y) = D, a kx1 linear estimator T’Y is a blue if and only if R(DT)‫ ؿ‬ɏ.

Chapter 5

134

Proof. Assume T’Y is a blue. By lemma 5.2.2 (b), we can write T = H + F where R(DH) ‫ ؿ‬ɏĂŶĚZ;&Ϳ‫ ؿ‬ɏᄰ. Then cov(H’Y) ൑ cov(T’Y) ൑ cov(H’Y) where the first inequality follows as in the proof of lemma 5.2.1 and the second inequality follows because T’Y is a blue and H’Y has the same expectation as T’Y. We also have cov(H’Y) = cov(T’Y) if and only if cov(F’Y) = F’DF = 0kk. But this condition is equivalent to DF = 0nk and since F = T – H, it follows that DT = DH which shows that R(DT) ‫ ؿ‬ɏ͘ dŚĞ ĐŽŶǀĞƌƐĞ ĨŽůůŽǁƐ ĨƌŽŵ lemma 5.2.1. While Theorem 5.2.3 provides a necessary and sufficient condition for T’Y to be a kx1 blue, it doesn’t provide a very satisfactory method for obtaining a blue for a given parametric vector. We now give two more specific and useful methods for finding a blue for a given kx1 estimable parametric ǀĞĐƚŽƌȿ͛ɴ͘ Method 1 Let Q be an nx(n-ŵͿŵĂƚƌŝǆƐƵĐŚƚŚĂƚZ;YͿсɏᄰ͘dŚĞŶE;Y͛Ϳсɏ so that Q’DT = 0n-m,k is equivalent to the condition that R(DT) ‫ ؿ‬ɏĨŽƌd͛Y to be a blue for E(T’Y). Now let A’Y be a kx1 linear estimator. We would like to find a linear estimator T’Y which is a blue for E(A’Y). We know by lemma 5.2.2 (b) that T = A + F = A + QB for some matrix B where F = QB has R(F) ‫ ؿ‬ɏᄰ. For T to be a blue, it must satisfy the condition given above that 0n-m,k = Q’DT =Q’DA + Q’DQB. Solving this last equation for B, we get B = -(Q’DQ)-Q’DA. Thus T = A + QB = A – Q(Q’DQ)-Q’DA = (In – Q(Q’DQ)-Q’D)A. (5.2.4) Lemma 5.2.5. Let A’Y be a kx1 linear estimator. Then T’Y where T is given in (5.2.4) is a blue for E(A’Y). Proof. First observe E(T’YͿс͛yɴ– A’DQ(Q’DQ)-͚Y͛yɴс͛yɴс;͛Y) since Q’X=0n-m,p. Now we show T’Y is a blue by showing that Q’DT = 0n-m,k. To obtain this last equality, observe that Q’DT = Q’DA – Q’DQ(Q’DQ)-Q’DA = Q’DA –Q’DA = 0n-m,k. since Q’DQ(Q’DQ)- is a projection on R(Q’DQ) = R(Q’D). Thus T’Y is a blue for E(A’Y).

TheGeneralTheoryofLinearEstimation

135

Using the above procedure, we can find a blue for any parametric vectorȿ’ɴwhichisestimablebydoingthefollowing: (1)FindamatrixQsuchthatN(Q’)=ɏ. (2)FindamatrixAsuchthatA’Yisunbiasedforȿ’ɴ. (3)ComputetheblueT’Yforȿ’ɴwhereT=(In–Q(Q’DQ)ͲQ’D)A. We now give an example to illustrate the method just outlined for findingablue.Inthefollowingexample,wealsoillustratehowtodealwith amodelwithconstraints. Example 5.2.6. (Seely (1989)) Suppose Y is a 3x1 random vector with 1 0 1 cov(Y)= ʍ2V where V = ቆ0 1 1ቇ and E(Y) = Xɴ, ȴ’ɴ = 0, where X = 1 1 2 1 1 ቆ1 െ1ቇ and ȴ’ = (1, 2). We begin by parameterizing E(Y) to an 2 0 unconstrainedmodel.ObservethatR(A)=N(ȴ’)whereA’=(2,Ͳ1).LetU= XAwhereU=(1,3,4)’.ThenE(Y)=Uɲ,ɲ੣R1,isanunconstrainedfullrank parameterizationforE(Y).AlsoobservethatsinceXɴ‫ؗ‬Uɲ,ɴ1 ‫ ؗ‬2ɲ.We nowusetheprocedureoutlinedabovetofindablueforɲ.Observethat Q1’=(1,1,Ͳ1)andQ2’=(7,Ͳ5,2)formabasisforɏᄰ=R(U)ᄰ.SetQ=(Q1,Q2). NowY1=a’Ywherea’=(1,0,0)isunbiasedforɲ.Ift’Yisablueforɲ,then t=a+Qɷforsomeɷbecauseofunbiasedness.UsingVt੣ɏifandonlyif Q’Vt=0,wegettheconditionsQ’Vt=Q’V(a+Qɷ)=0whichimpliesQ’VQɷ= ͲQ’Va.Thesolutionforɷisɷ=Ͳ(Q’VQ)Ͳ1Q’Vawhichgivesɷ’=(Ͳ2/10,Ͳ1/10) whichinturngivest’=(a+Qɷ)’=(1/10,3/10,0).ThustheblueforɲisȽ ෝ= (1/10)Y1+(3/10)Y2andsinceɴ1‫ؗ‬2ɲ,theblueforɴ1is ෝ=(2/10)Y1+(6/10)Y2. Ⱦ෠1=2Ƚ Method1demonstratedinthepreviousexampleforfindingbluescanbe used to estimate any parametric vector. However, the technique is in general not very appealingbecause one must first compute Q and then essentially invert Q’VQ which could be a very large matrix. It is also inefficientwhenthereisinterestinfindingbluesforanumberofestimable parametric vectors. We now give a second method for estimating estimable parametric vectors which overcomes some of the

Chapter 5

136

inconveniences associated with method 1 above. This second method revolves around finding a vector which has the gm property. Method 2 Consider the same model as assumed for method 1, i.e., E(YͿсyɴ͕ɴ੣ Rp, cov(Y) сʍ2V ൒ 0.

(5.2.7)

Before proceeding to our main result, we give a preliminary lemma. Lemma 5.2.8. Under the model given in (5.2.7), Y ੣ R(X) + R(V) with probability 1. Proof. Consider the random vector Z = Y- yɴǁŚŝĐŚŚĂƐ;Z) = 0n and cov(Z)= ʍ2V. Now let PV be the orthogonal projection on R(V). Then cov[(In – PV)Z] = (In – PVͿ;ʍ2V)(In – PV) = 0nn. Hence, from problems 13-15 of chapter 1, (In – PV)Z equals E[(In – Pv)Z] = 0n. with probability one, i.e., Z = Y – yɴ੣ R(V) with probability 1 which implies Y੣R(X)+ R(V) with probability 1. We now give our main result concerning method 2. Lemma 5.2.9. Let W = V + XX’ and let G be any g-inverse of W. Then the following facts hold: (a) There exists a random vector Ⱦ෠ satisfying X’GXȾ෠ = X’GY. (b) The Ⱦ෠ ǀĞĐƚŽƌƐĂƚŝƐĨǇŝŶŐƚŚĞĞƋƵĂƚŝŽŶƐŐŝǀĞŶŝŶƉĂƌƚ;ĂͿŝƐŐŵĨŽƌɴ͘ Proof. (a) To begin, observe that N(V) ‫ ـ‬N(W) and N(XX’) ‫ ـ‬N(W) since V൒ 0 and XX’ ൒ 0. This implies that R(V) ‫ ؿ‬R(W) and R(XX’) ‫ ؿ‬R(W). Also, since R(X) = R(XX’) ‫ ؿ‬R(W), Y ੣ R(X) + R(V) ‫ ؿ‬R(W) with probability 1 and W is symmetric, it follows that X’GX = X’G’X and X’GY = X’G’Y. We now show that R(X’GX) = R(X’) by showing that R(X’GX)ᄰ = N(X’G’X) = N(X’GX) = R(X’)ᄰ= N(X). To show this latter equality, note that since R(X) = R(XX’) ‫ ؿ‬R(W), we have ƚŚĂƚ y с tȿ с ;s н yy͛Ϳȿ ĨŽƌ ƐŽŵĞ ŵĂƚƌŝǆ ȿ͘ ,ĞŶĐĞ͕ y͛'y с ȿ͛;sнyy͛Ϳ'yсȿ͛yƐŝŶĐĞ;sнyy͛Ϳ'ŝƐĂƉƌŽũĞĐƚŝŽŶŽŶZ;sнyy͛Ϳ͘EŽǁƐƵƉƉŽƐĞ ʄ੣ N(X’GX). Then

TheGeneralTheoryofLinearEstimation

137

0p=ʄ’X’GX=ʄ’ȿ’X=ʄ’ȿ’(V+XX’)ȿ implies,sinceV+XX’൒0,thatʄ’ȿ’(V+XX’)=0n=ʄ’X’,i.e.,ʄ’X’GX=0p impliesʄ’X’=0andN(X’GX)‫ ؿ‬N(X).Clearly,N(X)‫ؿ‬N(X’GX),hencewehave thatN(X)=N(X’GX)andthatR(X’GX)=R(X’).Butthenforanyoutcomeyof Y, we can find Ⱦ෠(y) such that X’GXȾ෠(y) = X’Gy, hence Ⱦ෠(y) satisfies the equation. (b)SupposeȾ෠satisfiestheequationgivenin(a)andletȾ෠=(X’GX)ͲX’GY.If thekx1parametricvectorȿ’ɴisestimable,thenȿ=X’AforsomematrixA and E(ȿ’Ⱦ෠)=E(A’XȾ෠)=E(A’X(X’GX)ͲX’GY)=A’X(X’GX)ͲX’GXɴ=A’Xɴ=ȿ’ɴ, soȿ’Ⱦ෠isunbiasedforȿ’ɴ.Nowobservethatȿ’Ⱦ෠=A’X(X’GX)ͲX’GYandsince V=W–XX’,wehavethat V[GX(X’GX)Ͳ‘X’A]=(W–XX’)[GX(X’GX)ͲX’A] =WGX(X’GX)ͲX’A–XX’GX(X’GX)ͲX’A =X(X’GX)ͲX’A–XX’GX(X’GX)ͲX’A (sinceWGisaprojectiononR(W)andR(X)‫ؿ‬R(W)) whichisinR(X).Henceȿ’Ⱦ෠istheblueforȿ’ɴ.ThusthevectorȾ෠satisfying theequationsgiveninlemma5.2.9(a)isgmforɴandcanbeusedtofind ablueforanyestimableparametricvector. Example5.2.10.Consideragainexample5.2.6withX,ȴ,VandUasdefined there.Let 2 3 5 W=V+UU’=ቆ3 10 13ቇ 5 13 18 andobservethatr(W)=2.Usingthemethodofcomputationoutlinedin appendix A12, one generalized inverse for W is given by G ൌ 10/11 െ3/11 0 ൭െ3/11 2/11 0൱. Now, to find a gm vector for ɲ, we solve the 0 0 0 equationsU’GUȽ ෝ=(10/11)Ƚ ෝ=U’GY=(1/11)Y1+(3/11)1Y2.Thus,thegm

Chapter5

138

vector(blue)forɲisȽ ෝ=(1/10)Y1+(3/10)Y2andsinceɴ1‫ؗ‬2ɲ,thebluefor ෝ=(2/10)Y1+(6/10)Y2.NotethattheblueȾ෠1determinedforɴ1 ɴ1isȾ෠1=2Ƚ here is the same as the blue determined for ɴ1 in example 5.2.6 using method1.

5.3SomeFurtherResultsonEstimation This section is devoted to generalizing some of the facts about best linear unbiased estimation discussed in the previous section. Unless specified otherwise, we assume throughout the general linear model structureE(Y)੣ɏwhereɏ={Xɴ:ɴ੣Rp}isasubspaceandcovY)੣Vwhere VisanarbitrarysetofpositivesemiͲdefinitematrices. For most general facts in this section, we rely on the results already developed for a single covariance matrix. To do this, it is convenient to think of our general model as a collection of linear models where each linearmodelhasthesameexpectationspaceandasingle,butdifferent, covariance matrix. For each nxn positive semiͲdefinite matrix D, let MD denotethelinearmodelE(Y)੣ɏandcov(Y)=D.Fromthedefinitionofa blue,itisclearthatalinearestimatorT’Yisablueifandonlyifitisablue with respect to model MD for all D ੣ V. Using this observation, we can restate many of the facts in section 2 for our general model by simply applyingthemtothelinearmodelMDforallD੣V.Forexample,theorem 5.2.3canberestatedasfollows. Theorem5.3.1.(Zyskind’sTheorem).AlinearestimatorT’Yisablueifand onlyifR(DT)‫ؿ‬ɏforallD੣V. ThistheoremwasfirstgivenbyZyskind(1967)andismostusefulasa theoretical tool. From a practical viewpoint, however, the theorem is sometimes useful when you wish to check whether or not a particular linearestimatorisablue.However,inthissection,thetheorem’sprimary useistogeneralizesomeoftheideasofsection2aswellasdiscusssome facts thatcaneasily bedeterminedusing the theorem. In this regard, it should be remembered that we use the term blue (gm) without a qualifying “with respect to …” to mean blue (gm) with respect to the generallinearmodelassumedinthissection.

The General Theory of Linear Estimation

139

Let μ0 denote the set of vectors t such that t’Y is a blue. Further, let μD denote the set of vectors t such that t’Y is a blue with respect to model MD. From the discussion preceding Zyskind’s theorem, it is clear that t ੣ μ0 if and only if t ੣ μD for all D ੣ V. This observation immediately allows us to state the following relationship: μ0 = ‫ת‬D ੣ V μD.

(5.3.2)

Some immediate consequences of (5.3.2) are the following: (a) μ0 is a subspace. (b) T’Y is a blue if and only if R(T) ‫ ؿ‬μ0. (c) The class of blues is closed under linear operations. That is, if ɏො and Ɂ෠ are blues and A is a matrix, then ɏො + AɁ෠ is a blue provided that the matrix operations are defined. (d) T’Y is a blue if and only if t’Y is a blue for each column t of T. The above consequences are straightforward to verify. In particular, (a) follows because the intersection of subspaces is a subspace. Then continually using the fact that μ0 is a subspace, (b) follows from Zyskind’s theorem, (c) follows from (b) and (d) follows from (c). As noted above, Zyskind’s theorem is the generalization of theorem 5.2.3 that is needed to develop the theory of linear unbiased estimation under the more general model being considered in this section. For the remaining part of this section, we consider the questions of existence and uniqueness of blues. Suppose A’Y is an arbitrary linear estimator. If A can be expressed as A = T + F where T’Y is a blue and F’Y is unbiased for zero, then T’Y is a blue with the same expectation as E(A’Y). In particular, there exists a blue for E(A’Y). A formal statement of this existence is given as the next theorem. Theorem 5.3.3. (Existence Theorem). For each linear estimator A’Y, there exists a blue for E(A’Y) if and only if Rn = μ0 нɏᄰ. Proof. Sufficiency follows as in the previous paragraph. Conversely, let a be any nx1 vector. Select t so that t’Y is a blue for E(a’Y). Because t’Y and a’Y

140

Chapter 5

have the same expectation, there exists f ੣ ɏᄰ such that a = t + f. Hence Rn‫ؿ‬ μ0 нɏᄰ. The next question we address is the uniqueness of the coefficient matrix of a blue. Suppose t’Y is a blue and that f ੣ μ0 ‫ ת‬ɏᄰ. Set h = t + f. Then h’Y is a blue (because μ0 is a subspace) and h’Y has the same expectation as t’Y (because f’Y has expectation zero). Hence t’Y and h’Y are blues for the same parametric function. Thus if μ0 ‫ ת‬ɏᄰ is not 0n, then the coefficient vector (and hence matrix) of a blue is not unique. The next result establishes that this intersection being zero is both necessary and sufficient for the coefficient matrix of a blue for a given parametric vector to be unique. Theorem 5.3.4 (Uniqueness Theorem) Set F = μ0 ‫ ת‬ɏᄰ. Then the following statements can be made: (a) If T’Y is a blue and H = T + F where R(F) ‫ ؿ‬F, then T’Y and H’Y are blues for the same parametric vector. (b) If T’Y and H’Y are blues having the same expectation, then H = T + F where R(F) ‫ ؿ‬F. Proof. (a) This follows as in the previous paragraph. (b) Set F = H – T. Because T’Y and H’Y are blues, we have R(F) ‫ ؿ‬μ0. Because they have the same expectation, we have R(F) ‫ ؿ‬ɏᄰ, hence H = T+ F where R(F) ‫ ؿ‬F. Be sure to observe the implication of this theorem. In particular, the condition μ0 ‫ ת‬ɏᄰ = 0n implies that the coefficient matrix of a blue for a given parametric vector is unique. Also, the relationship μ0 ‫ ת‬ɏᄰ ‫ ؿ‬μD ‫ ת‬ɏᄰ for all D ੣ V

(5.3.5)

which follows from (5.3.2) is handy to remember. For example, uniqueness for model MD for a single D ੣ V implies μ0 ‫ ת‬ɏᄰ = 0n. Also, it is worthwhile noting that N(D) ‫ ת‬ɏᄰ = 0n is equivalent to r(D,U)= n where U can be any ŵĂƚƌŝǆƐĂƚŝƐĨǇŝŶŐƚŚĞĐŽŶĚŝƚŝŽŶZ;hͿсɏ͘

The General Theory of Linear Estimation

141

5.4 More on Estimation This section is a continuation of the discussion started in section 5.3. As in that section, we assume the linear model structure E(Y) ੣ ɏ ǁŚĞƌĞ ɏс΂yɴ͗ɴ੣ Rp} is a subspace and cov(Y) ੣ V where V is an arbitrary set of positive semi-definite matrices. In addition, we also use the notation and ideas developed in that section. As noted in section 5.3, Zyskind’s theorem is sometimes useful when one wishes to check whether or not a particular linear estimator is a blue. This is essentially the focus of the present section. In the quest for a blue, it would be helpful if one could limit the potential candidates. This can be done as follows: Let V ੣ V. Then a necessary condition for a linear estimator to be a blue is that it is a blue with respect to model MV. Thus, for determining a blue for a particular parametric vector, one need only check the linear estimators that are blue for that parametric vector with respect to model MV. This is especially easy when V is positive definite, because then there is only one possible candidate to check. When possible, the choice V = In is natural because such a choice involves the simplest computations. Example 5.4.1. Suppose Y is a random vector with E(Y) = Xμ where X = 1n and μ is unknown. Further, suppose that cov(Yk,YtͿсʍ2 for k = t = ੘ otherwise, ǁŚĞƌĞ ʍ2 and ੘ are unknown parameters (subject to, of course, the constraints imposed by a covariance matrix). If possible, let us determine a blue for μ. Notice that In ŝƐĂƉŽƐƐŝďůĞĐŽǀĂƌŝĂŶĐĞŵĂƚƌŝǆ;Ğ͘Ő͕͘ʍ2 = 1 and ੘ = 0). From chapter 2 we know that the sample mean μො = t’Y is the blue with respect to model MIn. Thus, t’Y is the only potential candidate for the blue. Now let us use Zyskind’s theorem to check if t’Y is a blue. For D ੣ V, ǁĞƐĞĞƚŚĂƚƚс;ʍ2 + (n – 1)੘)t ੣ ɏďĞĐĂƵƐĞƚ੣ɏ͘^ŝŶĐĞǁĂƐĂƌďŝƚƌĂƌǇ͕ŝƚ follows that μො is the blue for μ. To check whether or not a particular linear estimator T’Y is a blue via Zyskind’s theorem, it is necessary to examine the structure of DT for all choices D ੣ V. In many cases, however, one can examine a few select

142

Chapter5

choices of D and determine whether or not T’Y is a blue. To illustrate, reconsiderexample5.4.1.NoticethatanyDcanbeexpressedasɲIn+੘1nn whereɲ=ʍ2–੘.ItiseasytocheckthatbothIntand1nntareinɏ;andsince ɏ is a subspace it follows that Dt = ɲInt + ੘1nnt ੣ ɏ. Thus we know that Zyskind’stheoremisapplicableaftercheckingthechoicesD=InandD=1nn. Therearetwobasicideasimpliedinthepreviousparagraph.Thefirst isthenotionofreplacingthesetVinZyskind’stheoremwithanotherset Ginsuchawaythatthetheoremremainstrue.Asolutiontothisiseasily given. If V and G have the same span (within the class of symmetric matrices),thenitiseasytocheckthatZyskind’stheoremremainstrueifV isreplacedbyG.Thesecondistomakethereplacementset“small”sothat there are fewer things to check. One possibility here is to select the replacementsetsothatitselementsformabasisforVS=spV.Thereare, however, some things to consider with this proposed solution. In particular, the span condition does not restrict the elements of G to be positivesemiͲdefinitematrices;andwithoutthisrestrictionitmaynotbe possibletotietogethertheZyskindconditionandtheminimumcovariance matrix condition in definition 5.1.1. Thus our first requirement for a replacement set will be that it consists of covariance (positive semiͲ definite)matrices.Asecondthingtoconsideristhatinsomesituations, one might like to make the replacement set as “big” as possible. For example,onemightdecidetouseleastsquaresforestimationpurposes and consequently would like to know for what type of covariance structures your least squares estimators are optimal. That is, one starts with{In}andthenwantsthereplacementsettobeas“big”aspossible.For presentandfuturepurposes,itisconvenienttoformallyprovideaconcept (definition)thatincorporatesthepointsraisedinthisparagraph. Definition5.4.2.AsetofcovariancematricesGissaidtobebͲequivalent toV(writtenasV‫؜‬G)ifandonlyifμ0remainsthesamewhenVinthe linearmodelisreplacedbyG. LetVDdenotethesetofallpositivesemiͲdefinitematricesinVS.Thena trivial,buthandy,observationisthatanysetG‫ؿ‬VDwhosespanisVS isbͲ equivalenttoV.ThereasonforthisisthatthepositivesemiͲdefinitematrix conditionisjustanotherwayofsayingthesetconsistsofcovariancematrices andthespanconditioninsuresthatZyskind’stheoremcanbeappliedtoprove theequivalenceofthetwoμ0 sets.Inexample5.4.1,theset{In,1nn}isone

TheGeneralTheoryofLinearEstimation

143

such set. A class of models where the selection of a bͲequivalent set is essentiallyobviousistheclassofvariancecomponentmodels. Example 5.4.3. (Variance component models) A class of models that is frequentlyusedinapplicationsandwhichwewillusetoillustratesomeof ourconceptsistheclassofvariancecomponentmodels.Arandomvector Ywhosecovariancestructureisoftheform Cov(Y)=ʍ2In+ʍ12D1+…+ʍk2Dk whereD1,…,DkareknownpositivesemiͲdefinitematricesandtheʍi2are unknown variances is said to be a variance component model. For a variance component model, we assume, unless specifically mentioned otherwise,thatɏisparameterizedasɏ=Xɴwithɴunknown,thatʍ2>0 andthat ʍ12,…,ʍk2 ൒ 0. Under these assumptions it is easy to check that {In,D1,…,Dk}isbͲequivalenttoV.Noticethatʍ2isassumedtobepositive. ThisinsuresthatallD੣Varepositivedefinite.Afactthatissometimes handy to remember (and is sometimes used as the definition) for a variancecomponentmodelisthatitcanbethoughtofasarandomvector Ywiththestructure Y=Xɴ+B1b1+…+Bkbk+e whereX,B1,…,Bkareknownmatrices,ɴisanunknownparametervector, b1,…,bkandeareindependentrandomvectorsofdimensionsn1,…,nkand n with means zero and covariance matrices ʍ12In1,…,ʍk2Ink and ʍ2In, respectively. (Note that the identity matrices here are generally of differentdimensions.)Thistypeofmodelfallswithinourgeneraldefinition ofavariancecomponentmodelbytakingDi=BiBi’fori=1,…,k. The set μ0 is important in essentially all aspects of our theory. An expressionforμ0thatissometimesconvenienttouseis μ0={t:Dt੣ɏforallD੣G},G‫؜‬V.(5.4.4) Toseethatthisexpressionistrue,simplyapplyZyskind’stheoremorelse useexpression(5.3.2) Example 5.4.5. (Random oneͲway model) Suppose {Yij} is a collection of randomvariablesoftheform

144

Chapter 5

Yij = μ + bi + eij, I = 1,…,r, j = 1,…,ni, where μ is unknown and the bi and eij are all mutually independent random ǀĂƌŝĂďůĞƐ ǁŝƚŚ ĐŽŵŵŽŶ ŵĞĂŶ ǌĞƌŽ ĂŶĚ ƵŶŬŶŽǁŶ ǀĂƌŝĂŶĐĞƐ ʍb2 ĂŶĚ ʍ2, respectively. Express Y in matrix form as Y = Xμ + Bb + e where X, B, b, and e are defined in the obvious fashion. Then cov(YͿсʍ2In нʍb2D where D=BB’. Clearly, this is a special case of a two-variance component model. So, {In,D} is b- equivalent to V and μ0 = {t:t ੣ ɏ͕ƚ੣ ɏ΃͘ By inspection, one can see that either μ0 = 0n or μ0 сɏ͘dŚĞĐĂƐĞђ0 сɏŝƐ of particular interest because this is the condition for a blue to exist for μ. It is easy to check that the condition μ0 сɏŝƐƚƌƵĞŝĨĂŶĚŽŶůǇŝĨƌ;y͕yͿсϭ͕ which can be shown to be equivalent to the condition ni = nj for all i,j. Thus, a blue for μ exists if and only if the ni are all equal. The previous example illustrates the point where our general theory diverges from the results in chapter 2. That is, with a general covariance structure it is no longer possible to prove that a blue exists for a given estimable parametric vector. Unfortunately, nonexistence is the usual case and existence is the exceptional case. Because of this, we limit our remaining discussion to a few particular remarks and some examples. Suppose one could find a covariance matrix D such that V and {D} are b-equivalent. This means that any facts about blues for model MD carry over to the model with cov(Y) ੣ V. In particular, all of the results in section 5.2 are at our disposal. If in addition D is nonsingular, then all of the results given in chapter 2 are also at our disposal. Thus, a sufficient condition for things to go smoothly with our general theory is that V is b-equivalent to a single covariance matrix. We now pursue conditions under which V is bequivalent to a single covariance matrix. Definition 5.4.6. We say C ੣ VD is a maximal element in VD if and only if R(D)‫ ؿ‬R(C) for all D ੣ VD. One useful fact to remember is that if V ੣ VD is positive definite, then it is a maximal element. Now we make some general observations that can be made without any additional assumptions:

The General Theory of Linear Estimation

145

(a) V ‫{ ؜‬D} is equivalent to the condition μ0 = μD. (b) μ0 ‫ ؿ‬μD for all D ੣VD.

(5.4.7)

(c) μ0 ‫ ת‬ɏᄰ = μC ‫ ת‬ɏᄰ for all maximal elements C ੣ VD. Part (a) of (5.4.7) follows from the definition of b-equivalence. Part (b) follows from expression (5.3.2) and part (c) follows from lemma 5.2.2(c). These observations provide us a mechanism to decide whether or not V is b-equivalent to a single covariance matrix. This result is contained in the next lemma along with an equivalent statement regarding the existence of blues given in Theorem 5.3.3. Lemma 5.4.8. Select any maximal element C ੣ VD. Then the following statements are equivalent. (a) V ‫{ ؜‬D} for some D. (b) μ0 = μC. (c) Rn = μ0 + ёᄰ. Proof. It is clear that (b) implies (a) and (a) implies (c). So, we show only (c) implies (b). From proposition A3.2 of the appendices, we get that dim μ0 = dim[μ0 нɏᄰ] – Ěŝŵɏᄰ + dim[μ0 ‫ ת‬ɏᄰ]. Similarly, we get that dim μC = dim[μC нɏᄰ] – Ěŝŵɏᄰ + dim[μC ‫ ת‬ɏᄰ]. In these two expressions for dim μ0 and dim μC the first terms are equal because we are assuming (c), the second terms are the same and the third terms are equal by (5.4.7) (c). So dim μ0 = dim μC from which (b) follows because of the fact that ᬐ0 ‫ ؿ‬ᬐC. If C is any maximal element, then one simply has to check the condition μC‫ؿ‬μ0 (the other containment is always true) to decide whether or not V is b-equivalent to a single covariance matrix. One way to check this condition is the following: Find T such that R(T) = μC. For example, when C is positive definite T = C-1hǁŚĞƌĞZ;hͿсɏŝƐŽŶĞƐƵĐŚd͘dŚĞŶZ;dͿ‫ ؿ‬μ0 if

Chapter 5

146

and only if T’Y is a blue, which for example, can be checked using Zyskind’s theorem. For future reference, we summarize our findings for the case most frequently encountered in practical situations. Corollary 5.4.9. Suppose E(YͿсyɴ͕ɴƵŶŬŶŽǁŶ͕ŝƐĂƉĂƌĂŵĞƚĞƌŝǌĂƚŝŽŶĨŽƌɏ and suppose In ੣ V. Then the following statements are equivalent: (a) V is b-equivalent to a single covariance matrix. (b) X’Y is a blue. (c) μ0 = R(X). Proof. Note that μIn = R(X). So, (a) if and only if (c) in the preceding paragraph with V = In. Clearly (c) implies (b). It remains to show (b) implies (c). Assume (b) is true. By the blue property, R(X) ‫ ؿ‬μ0 and since μ0 ‫ ؿ‬μIn = R(X), the result follows. Example 5.4.10. (Heterogeneous Variances). Suppose {Yij} is a collection of independent random variables such that E(Yij) = μ and var(YijͿсʍi2 for i = 1,…,r and j = 1,…, ni. Suppose also that μ is an unknown parameter and the variances are all unknown positive parameters. Write the model in matrix form as E(Y) = Xμ and cov(YͿсʍ12V1 н͙нʍr2Vr. where X =1n. Clearly In ੣ V and G = {V1,…,Vr} is b-equivalent to V. Since μ is estimable, we can conclude from corollary 5.4.9 that there exists a blue for μ if and only if X’Y is a blue. Using Zyskind’s theorem (with G), we see that X’Y is a blue if and only if ViX ੣ ɏĨŽƌŝсϭ͕͙͕ƌ͘ƵƚƚŚŝƐŝƐƚƌƵĞŝĨĂŶĚŽŶůǇŝĨ r= 1, i.e., if and only if there is a single group. Example 5.4.11. (Rao Structure). Assume E(YͿ с yɴ͕ ɴ ƵŶŬŶŽǁŶ͕ ŝƐ Ă ƉĂƌĂŵĞƚĞƌŝǌĂƚŝŽŶĨŽƌɏ and that cov(Y) ੣ V. Let Q be any matrix such that Z;YͿсɏᄰ. Suppose each D ੣ V can be expressed in the form сʍ2In нyȿy͛нYʧY͛ ĨŽƌƐŽŵĞʍ2͕ȿĂŶĚʧ;ƚŚĞƌĞĂƌĞŶŽƌĞƐƚƌŝĐƚŝŽŶƐŽŶʍ2͕ȿĂŶĚʧĞǆĐĞƉƚ͕ŽĨ course, that D is a covariance matrix). Then

TheGeneralTheoryofLinearEstimation

147

DX=ʍ2X+XȿX’X+QʧQ’X=X(ʍ2In+ȿX’X). Thatis,R(DX)‫ ؿ‬ɏforallD੣V.HenceX’YisabluesothatR(X)‫ؿ‬μ0.This means,amongotherthings,thatbluesandrandomvectorshavingthegm propertycomputedundermodelMIn willenjoythesamepropertyunder cov(Y)੣V. Thisconcludesthistextonlinearmodels.Byreadingthistext,theauthor hopesthatthereaderhasgainedsomelevelofappreciationforthemany applicationsthatlinearmodelshaveinpractice.

5.5ProblemsforChapter5 1.(Seely(1989)SupposeYisa4x1randomvectorsuchthatE(Y)=Xɴ,ɴ unknownisparameterizationforE(Y)andsuchthatcov(Y)=ʍ2Vwhere 1 X=ቌ 1 0 െ1

1 0 0 0ቍ and V= ቌ0 1 0 0 1 1 0 1

0 0 1 0

1 0ቍ. 0 1

LetȾ෠=(Ⱦ෠1,Ⱦ෠2)’beablueforɴ.Dothefollowing: (a)Finddimμ0. (b)FindamatrixTsuchthatT’Yisablueforɴ. (c)DescribeallHsuchthatH’Yisablueforɴ. (d)FindDsuchthatcov(Ⱦ෠)=ʍ2D. (e)Set‫܍‬ො=YͲXȾ෠.FindWsuchthatcov(‫܍‬ො)=ʍ2W. 2.(Seely(1989))Doproblem1aboveassumingX=(1,2,2,1)’. 3.(Seely(1989))Suppose{Yij :i=1,2,3,4;j=1,2,…,nij}isacollectionof randomvariablessuchthatE(Yij)=μiandsuchthat Cov(Yij,Yi’j’)=ʍ2fori=i’andj=j’ =(1/2)ʍ2fori=i’and|݆ െ ݆Ԣ|=1

Chapter5

148

=0otherwise. Assumefurtherthatthefollowingobservationsareobtained:y11=2;y12= 4;y21=3;y31=5;y32=Ͳ9;y33=Ͳ6;y41=Ͳ1;y42=1.Estimatetheparameters μ1,…,μ4 and ʍ2 under each of the following assumptions on the μi’s assumingforeachcasethatʍ2isunknown.Foreachcasebelowstatewhy μ=(μ1,μ2,μ3,μ4)’isestimable,determinem=dimμ0andfindDsuchthat cov(μො )=ʍ2Dwhereμො istheblueforμ. (a)μi=iɲfori=1,2,3,4whereɲisanunknownrealnumber. (b)μi=μi+2fori=1,2. (c)μi=ɲ+ɴtifori=1,2,3,4whereɲandɴareunknownrealnumbersand t1=0,t2=Ͳ1,t3=4andt4=1. (d)3μ1+μ2Ͳμ4=3,Ͳμ1+2μ2Ͳ2μ3+3μ4=9andͲ7μ1Ͳ2μ3+5μ4=6. Inproblems4Ͳ7(Seely(1989)belowassumethatYisarandomvectorsuch thatE(Y)੣ɏwhereɏisasubspaceofRn andcov(Y)=ʍ2VwhereVisa knownpositivesemiͲdefinitematrixandʍ2isunknown. 4.Dothefollowing: (a)Showthatdimμ0=n(V)+dim[R(V)‫]ߗ ת‬. (b)IsRn=ɏ۩μ0ᄰ۩N(V)‫ת‬ɏᄰ? 5.LetQbeanymatrixsuchthatR(Q)=ɏᄰ.Verifythefollowing: (a)IfAisanynxsmatrix,thenthereexistsamatrixLsatisfyingQ’VQL=Q’VA. (b)IfAandLaresuchthatQ’VQL=Q’VA,thenT=A–QLissuchthatT’Yis ablueforE(A’Y).Furthermore,cov(T’Y)=ʍ2(A’VA–L’Q’VQL). (c) If the columns of Q are linearly independent, then Q’VQ is positive definiteifandonlyifN(V)‫ ת‬ɏᄰ=0. 6.SupposeE(Y)=Xɴ,ɴunknown,isaparameterizationforE(Y).LetW=V+ AwhereAissymmetricandR(A)‫ؿ‬R(X)‫ؿ‬R(W).LetGbeanygͲinversefor W.Verifythefollowingassertions:

TheGeneralTheoryofLinearEstimation

149

(a)ThereexistarandomvectorȾ෠satisfyingX’GXȾ෠=X’GY. (b)AnyȾ෠satisfyingtheequationsin(a)isgmforɴ. (c)IfA=XʧX’whereʧissymmetric,Ⱦ෠isgmforɴandʋ’ɴ,ȿ’ɴareestimable, then cov[ʋ’Ⱦ෠,ȿ’Ⱦ෠]=ʍ2ʋ’[D–ʧ]ȿ whereDisanygͲinverseforX’GX. 7.ShowthatA=XX’isasatisfactorychoiceofAinproblem6. Problems 8Ͳ12 (Seely (1989) below refer to the following paragraph. Assume the same general model and notation as in section 4 of this chapter.Inaddition,assumeE(Y)=Xɴ,ɴunknown,isaparameterization forE(Y)andletVD={D੣VS:DispositivesemiͲdefinite}.ItisclearthatVand VDarebͲequivalent.Asdefinedinsection4,anymatrixCinVDissaidtobe amaximalelementprovidedR(D)‫ؿ‬R(C)forallD੣VD.Inproblems8Ͳ12 assumeCcanbeselectedarbitrarily,butitremainsfixedthroughoutthe problems. A maximal element essentially takes the place of positive definitematricesinsection3and4ofthischapter.Ofcourse,aspointed out previously in this chapter, any positive definite matrix is a maximal element. In the event V contains a positive definite matrix, the results belowaddlittletothetheoryalreadygiven. 8.LetC=D1+…+DkwhereD1,…,DkarepositivesemiͲdefinitematricesthat formaspanningsetforVS.ShowthatCisamaximalelement. 9.Verifythefollowingassertions: (a)μ0‫ ת‬πᄰ=N(C)‫ת‬ɏᄰ. (b)Theconditionr(C,X)=nistrueifandonlyifμ0‫ת‬ɏᄰ=0n. 10.LetA’Ybeanarbitrarylinearestimator.LetT’YbeablueforE(A’Y)with respecttomodelMC.ShowthatifthereexistsablueforE(A’Y),thenT’Yis ablueforE(A’Y). [Fromproblem10,weseethattocheckwhetherornotablueexistsfor E(A’Y),weonlyneedtocheckforitsexistencewithrespecttomodelMC.]

150

Chapter 5

11. Show that the following statements are equivalent: (a) Rn = μ0 нɏᄰ. (b) There exists a covariance matrix D such that μ0 = μD. (c) μ0 = μC. (d) μ0 = μV for every maximal element V ੣ VD. 12. Let T be such that R(CT) ‫ ؿ‬ё (i.e., T’Y is a blue with respect to model MC) and such that r(T’X) = r(X). Show that μ0 = μC if and only if T’Y is a blue.

APPENDICES BACKGROUND MATERIAL

In the following appendices, we provide background which is useful for reading the main text. The reader is encouraged to go through any of the appendices which contain information with which the reader may not be familiar.

Appendix A1-Matrices and Matrix Operation Definition A1.1. An mxn matrix A is a rectangular array of elements having m rows and n columns. We denote the entry in the i th row and j th column of A by aij. So ܽଵଵ A = (aij)mxn = ൭ ‫ڭ‬ ܽ௠ଵ

‫ڮ‬ ‫ڰ‬ ‫ڮ‬

ܽଵ௡ ‫ ڭ‬൱. ܽ௠௡

If m = n, we say A is a square matrix of order n. If aij = aji for i,j = 1, …, n, we say A is symmetric. In this text, we will be dealing with matrices whose elements are always real numbers. Definition A1.2. 0mn is used to denote an mxn matrix having all entries equal to zero. We will denote a 0m1 matrix by 0m and the real number 01 by 0. Definition A1.3. 1mn or Jmn is used to denote an mxn matrix whose entries are all ones. To denote 1m1 we use 1m or Jm and to denote 11 we use 1. Definition A1.4. In = (aij)nxn is the identity matrix, an nxn matrix having aii =1 for i=1,…,n and aij = 0 for all i,j = 1,…,n, i ് j.

Appendices

152

Definition A1.5. A matrix A = (aij)mxn can be multiplied by a scalar (real number) c and this product is defined as cA = B = (bij)mxn where bij = caij for all i and j. Definition A1.6. Two matrices A = (aij)mxn and B = (bij)mxn can be added where the addition is defined as A + B = C = (cij)mxn where cij = aij + bij for all i and j. Comment. If A and B are matrices, in order for A + B to be defined, the number of rows and columns in each matrix must be the same. Definition A1.7. Two matrices A = (aij)mxn and B = (bij)nxp can be multiplied together to form the product AB = C = (cij)mxp where cij = σ௡௞ୀଵ aikbkj for all i and j. If AB = BA, we say the matrices commute. Comment. In order for a product AB between matrices A and B to be defined, the number of columns of A must be the same as the number of rows of B. Comment. If A = (aij)mxn and B = (bij)nxp = (b1,…,bp) where bi denotes the i th column of B, then it easy to show using definition A1.7 that AB = A(b1,…,bp)= (Ab1,…,Abp). These latter expressions are useful to remember. Definition A1.8. The transpose of a matrix A = (aij)mxn is written as A’. It is defined as A’ = C = (cij)nxm where cij = aji for all i and j. Comment. (1) Note that if A = A’, then A is a symmetric matrix. (2) If A and B are matrices such that AB exists, then (AB)’ = B’A’. Comment. Let A = (aij)nxn be a matrix. We say A is a diagonal matrix if aij = 0 for all i ് j and denote such a matrix A by A = diag(a11,…,ann).

Appendix A2-Vector Spaces-Rn Let Rn consist of all nx1 matrices with real numbers. These nx1 matrices are called vectors in Rn and are denoted by x = (x1,…,xn)’. Let vector addition and scalar multiplication of vectors be as defined previously for matrices. These operations satisfy the following properties: (1) If u,v ੣ Rn, then u + v ੣ Rn.

Background Material

153

(2) If u,v ੣ Rn, then u + v = v + u. (3) If u,v,w ੣ Rn, then u + (v + w) = (u + v) + w. (4) The vector 0n ੣ Rn is such that 0n + u = u + 0n = u for all u੣Rn. (5) For each u ੣ Rn, there exists -u ੣ Rn such that u + (-u) = 0n. (A2.1). (6) If k is any real number and u ੣ Rn, then ku ੣ Rn. (7) If k is a real number and u,v ੣ Rn, then k(u + v) = ku + kv. (8) If k and l are real numbers and u ੣ Rn, then (k + l)u = ku + lu. (9) If k and l are real numbers and u ੣ Rn, then k(lu) = (kl)u. (10) If u ੣ Rn, then 1u = u. Because the operations of vector addition and scalar multiplication satisfy properties 1-10 given in (A2.1), we call Rn a vector space and more specificly, a Euclidean vector space. Comment. More generally, any set V of objects having two operations called addition of the objects and multiplication of the objects by a scalar and which satisfies the 10 properties given in (A2.1) is called a vector space. Definition A2.2. A subset W of Rn is called a subspace of Rn if W is itself a vector space under the addition and scalar multiplication operations defined on Rn. We denote the fact that W is a subspace of Rn by W ‫ ؿ‬Rn. Theorem A2.3. If W is a subset of Rn, then W is a subspace of Rn if and only if (a) u,v ੣ W implies u + v ੣ W and (b) if u ੣ W and k is a real number, then ku ੣ W. Definition A2.4. A vector y ੣ Rn is called a linear combination of the vectors v1,…,vr੣ Rn if y = k1v1 +…+ krvr where k1,…,kr are real numbers.

154

Appendices

Definition A2.5. Let v1,…,vr be vectors in Rn and let V be a subset of vectors in Rn. If every vector in V can be written as a linear combination of v1,…,vr then we say these vectors span V. Theorem A2.6. If v1,…,vr are vectors in Rn, then the following statements hold: (a) The set W of all linear combinations of v1,…, vr is a subspace of Rn. (b) W in (a) is the smallest subspace of Rn containing v1,…,vr. If V is a set of vectors in Rn, then all linear combinations of vectors in V is called the span of V and is denoted by sp{V}. Definition A2.7. Let S = {v1,…,vr} be a subset of Rn. Then the vector equation k1v1+…+ krvr = 0n has at least one solution, i.e., k1 =…= kr = 0. If this is the only solution, then S is called a linearly independent (l.i.) set. If there are other solutions, then S is called a linearly dependent set. Definition A2.8. If W is a subspace of Rn and S = {v1,…,vr} is a subset of W, then S is called a basis for W if (a) S is a linearly independent set and (b) S spans W. Theorem A.2.9. If S = {v1,…,vr} is a basis for W ‫ ؿ‬Rn, then every set of vectors in W with more than r vectors is linearly dependent. Theorem A2.10. Any two bases for a subspace W of Rn have the same number of vectors. Definition A2.11. The dimension of a subspace W of Rn, denoted by dimW, is the number of vectors in a basis for W. We also define the zero vector space to have dimension 0. Theorem A2.12. If S = {v1,…,vr} is a linearly independent set in a pdimensional subspace W of Rn, then there exists vectors vr+1,…,vp such that {v1,…,vr,vr+1,…,vp} is a basis for W.

Appendix A3-Set Arithmetic in Rn Definition A3.1. Let S and T be subsets in Rn. Then S + T = {s + t: s ੣ S and t੣T}.

Background Material

155

If S, T and U are subsets in Rn, then the following properties are easily established: (1) S + T = T + S. (2) (S + T) + U = S + (T + U). (3) For 0n ੣ Rn, 0n + S = S. (4) Rn + S = Rn. (5) S – T = S + (-T) where -T = {-t:t ੣ T}. Proposition A3.2. Let S and T be subspaces in Rn. Then (a) S ‫ ת‬T is a subspace in Rn. (b) S + T is a subspace in Rn. Proposition A3.3. If S and T are subspaces in Rn, then dim{S + T} = dim S + dim T – dim{S ‫ ת‬T}. Definition A3.4. Let S and T be subspaces in Rn. If S ‫ ת‬T = 0n, we say S and T are disjoint subspaces. Definition A3.5. If S and T are subspaces of Rn and S ‫ ת‬T = 0n, we denote S+T by S ْ T and call S ْ T the direct sum of S and T. Proposition A3.6. Let S and T be disjoint subspaces in Rn. Then dim[S ۩ T] = dim S + dim T. Proposition A3.7. Let S and T be disjoint subspaces in Rn. Then every vector x ੣ S ۩ T can be represented uniquely as x = s + t where s ੣ S and t ੣ T Comment. The above results concerning direct sums have natural extensions to m subspaces. For example, if S1,…,Sm are subspaces of Rn such that Si ‫ ת‬Sj = 0n for all i ് j, then S1 + … + Sm = S1 ۩ …۩Sm.

Appendices

156

Appendix A4-The Euclidean Inner Product Definition A4.1. If x,y ੣ Rn, then the Euclidean inner product between x and y, denoted as x·y or (x,y)n, is defined as x·y = (x,y)n = x’y = x1y1 +…+ xnyn. If u, v, w are vectors in Rn and k is any real number, then the following properties associated with the Euclidean inner product are easy to verify: (1) uήv = v·u. (2) (u + v)ήw = uήw + v·w. (3) (ku)ήv = k(uήv). (4) u‫ڄ‬u ൒ 0 with equality if and only if u = 0n. Definition A4.2. The Euclidean norm (or length) of x ੣ Rn is defined as ԡxԡ = (x· x)1/2 = (x’x)1/2 = (x12 +…+ xn2)1/2. If u and v are vectors in Rn and k is a real number, then the following properties of the Euclidean norm (the Euclidean length) are easy to establish: (1) ԡuԡ ൒ 0. (2) ԡuԡ = 0 if and only if u = 0n. (3) ԡkuԡ = |k| ԡuԡ. (4) ԡu + vԡ ൑ ԡuԡ + ԡvԡ (the triangle inequality). Definition A4.3. The Euclidean distance between x,y ੣ Rn is defined by d(x,y) = ԡx െ yԡ =[(x – y)’(x – y)]1/2 = [(x1 – y1)2 +…+ (xn – yn)2]1/2. If u, v and w are vectors in Rn, then Euclidean distance has the following properties: (1) d(u,v) ൒ 0 with equality if and only if u = v. (2) d(u,v) = d(v,u).

Background Material

157

(3) d(u,v) ൑ d(u,w) + d(w,v) (the triangle inequality). Theorem A4.4. (The Cauchy-Schwarz inequality) If x and y are vectors in Rn, then (xήy)2 ൑ (x·x)(y·y). Definition A4.5. If x and y are vectors in Rn and x·y = 0, we say x and y are orthogonal, denoted by x ٣ y. Theorem A4.6. (The Pythagorean Theorem) If x and y are vectors in Rn and x ٣ y, then ԡx + yԡ2 = ԡxԡ2 + ԡyԡ2. Definition A4.7. A set of vectors S in Rn is called an orthogonal set if all pairs of distinct vectors in S are orthogonal. An orthogonal set of vectors S in which each vector has norm (length) 1 is called an orthonormal set. Theorem A4.8. If S = {v1,…,vr} is an orthogonal set of nonzero vectors in Rn, then S is a linearly independent set. Theorem A4.9. If S = {v1,…,vr} is an orthonormal basis for a subspace W of Rn and u ੣ W, then u = (u·v1)·v1 +…+ (u·vr)·vr. Theorem A4.10. (The Gram-Schmidt Theorem) Let S be a subspace of Rn with basis {x1,…,xr}. Then there exists an orthonormal basis for S, say {y1,…,yr}, with ys in the space spanned by x1,…,xs for s = 1,… ,r. Proof. Define y1 = x1/(x1’x1)1/2, ws = xs - σ௦ିଵ ௜ୀଵ

(xs’yi)·yi and ys = ws/(ws’ws)1/2.

Definition A4.11. Let S and T be sets in Rn. We say S and T are orthogonal if s·t = 0 for all s ੣ S and t ੣ T. We denote this relationship by S ٣ T. Definition A4.12. Let S be a set in Rn. Then the orthogonal complement of S with respect to Rn, denoted by Sᄰ, is defined by Sᄰ = {x ੣ Rn: x ٣ S}. Comment. If S is a subset of Rn, then Sᄰ is always a subspace of Rn (even if S is just a subset of Rn).

Appendices

158

Proposition A4.13. Let S and T be subspaces in Rn. Then the following statementshold: (a)(Sᄰ)ᄰ=S. (b)Rn=S۩Sᄰ. (c)IfS‫ؿ‬T,thenSᄰ‫ـ‬Tᄰ. (d)(S+T)ᄰ=Sᄰ‫ת‬Tᄰ. (e)(S‫ ת‬Tሻᄰ=Sᄰ+Tᄰ. AsageneralizationoftheorthogonalcomplementofasubsetSinRn withrespecttoRn,wehavethefollowingdefinition. DefinitionA4.14.LetSandTbesubspacesofRnandletTSᄰ={y੣S:y٣T}. ThenTSᄰiscalledtheorthogonalcomplementofTwithrespecttoS. Comment.IfSistakenasRnindefinitionA4.14,thenTSᄰreducestothe orthogonalcomplementofTwithrespecttoRndefinedindefinitionA4.12. PropositionA4.15.LetS‫ؿ‬RnandletT‫ ؿ‬S.Thenthefollowingstatements hold: (a)TSᄰ=S‫ ת‬TᄰisasubspaceofS. (b)S=TْTSᄰ=Tْ(S‫ת‬Tᄰ). (c)dimS=dimT+dimTSᄰ=dimS+dim(S‫ת‬Tᄰ) (d)Ifx੣S,thenxcanberepresenteduniquelyasx=t+ywheret੣Tand y੣TSᄰ.

AppendixA5ͲMatricesasLinearTransformations LetA=(aij)mxn beamatrixandletx੣Rn.ThenAx=y੣Rm andwecan think of A as a function or a transformation from Rn to Rm, denoted by A:Rn՜Rm.

BackgroundMaterial

159

Comment.LetAbeanmxnmatrixandletx,y੣Rnandletkandlbereal numbers.ThenAalsosatisfies A(kx+ly)=kAx+lAy. (A5.1) Anyfunctionthatsatisfies(A5.1)iscalledalineartransformationfromRn toRm. DefinitionA5.2.Thematrix0mniscalledthezerotransformationfromRnto Rmandsatisfies0mnx=0m੣Rmforallx੣Rn. DefinitionA5.3.LetIn=(aij)nxnwhereaii=1fori=1,…,nandaij=0foralli്j. Then In: Rn ՜ Rn defined by Inx = x for all x ੣ Rn is called the identity transformation. Let A = (aij)mxn. Two important subspaces associated with A are the following: (1)TherangespaceorcolumnspaceofAisdenotedbyR(A)anddefined by R(A)={Ax੣Rm:x੣Rn}‫ؿ‬Rm. (2)ThenullspaceofAisdenotedbyN(A)anddefinedby N(A)={x੣Rn:Ax=0m}‫ؿ‬Rn. If A is an mxn matrix, then R(A) is a subspace of Rm and N(A) is a subspaceofRnandweusetheR(A)andN(A)todefineboththerankand thenullityofA.Inparticular,therankofA,denotedbyr(A),isdefinedby r(A)=dimR(A)andthenullityofA,denotedbyn(A),isdefinedbyn(A)= dimN(A).IfA=(aij)mxn=(a1,…,an)whereairepresentstheithcolumnvector inAandx=(x1,…,xn)’੣Rn,thenAx=x1a1+…+xnan.Thelastexpressionin the preceding sentence implies that R(A) = sp{a1,…,an} and r(A) = dim{sp[a1,…,an]}=thenumberoflinearlyindependentcolumnvectorsin A. PropositionA5.4.IfA=(aij)mxnisamatrix,thenn=r(A)+n(A).

Appendices

160

Many proofs in linear algebra involving matrices, say A and B, rest upon showing that R(A) = R(B). A useful fact to remember is that if A and B are matrices, there are essentially two ways of showing that R(A) = R(B): (1) Show that R(A) ‫ ؿ‬R(B) and R(B) ‫ ؿ‬R(A). (2) Show R(A) ‫ ؿ‬R(B) and dim R(A) = dim R(B). If A = (aij)mxn and S is a subset of Rn, we will from time to time use notation such as A[S] to denote the subset T of Rm defined by T = {As: s ੣S}, i.e., T is the set of all vectors in Rm generated by multiplying all vectors in S by A.

Appendix A6-The Transpose of a Matrix A (A’) Let A = (aij)mxn be a matrix. Then A: Rn ՜ Rm is a linear transformation from Rn to Rm and the transpose of A, A’= (aji)nxm is a linear transformation A’:Rm՜ Rn from Rm to Rn. Recall that for x,y ੣ Rn, the Euclidean inner product between x and y, denoted by x·y or (x,y)n, is defined by x·y = (x,y)n = x’y = x1y1 +...+xnyn. Also, recall that if A is an sxt matrix and B is a txm matrix, then (AB)’ = B’A’. Using these latter stated results, we see that if A is an sxt matrix, then for x ੣ Rt and y ੣ Rs, (Ax,y)s = (Ax)’y = x’(A’y) =(x,A’y)t.

(A6.1)

The relationship given in (A6.1) is often useful to remember. Proposition A6.2. Let A = (aij)sxt be a matrix. Then the following statements hold: (a) r(A) = r(A’). (b) R(A) = N(A’)ᄰ. (c) N(A) = R(A’)ᄰ. Proof.( c) Let x ੣ N(A). Then Ax = 0s ੣ Rs. Now, for any y ੣ Rs, 0 = (0s,y)s = (Ax,y)s = (Ax)’y = x’(A’y) = (x, A’y)t

Background Material

161

which implies that x ٣ R(A’) and x ੣ R(A’)ᄰ. Conversely, let y ੣ R(A’)ᄰ ‫ ؿ‬Rt. Then for any x ੣ Rs, 0 = (y, A’x)t = y’(A’x) = (Ay)’x = (Ay, x)s which implies that Ay ੣ RƐᄰ = 0s. Thus Ay = 0s and y ੣ N(A) as we were to show. (b) From part (c), we have that N(A’) = R[(A’)’]ᄰ = R[A]ᄰ which implies N(A’)ᄰ = [R(A)ᄰ]ᄰ = R(A). (a) From proposition A5.4, since A is an sxt matrix, we have that t = r(A)+n(A) and since Rt = R(A’) ۩ R(A’)ᄰ , it follows that t = dim R(A’) + dim R(A’)ᄰ = r(A’) + dim N(A) = r(A’) + n(A). Hence we have that t = r(A) + n(A) = r(A’) + n(A) which implies that r(A)=r(A’). Using proposition A6.2, we also have the following: If A = (aij)sxt, then (a) Rs = R(A) ۩ N(A’) and (b) Rt = R(A’) ۩ N(A).

Appendix A7-Inverses Definition A7.1. Let A = (aij)nxn and suppose A:Rn ՜ Rn is a transformation from Rn onto Rn, i.e., R(A) = Rn. Then we say that A is invertible. We now give a list of properties associated with invertible matrices that are proven in most beginning linear algebra courses. So let A and B be nxn invertible matrices. Then the following statements hold: (1) r(A) = n. (2) A is invertible if and only if N(A) = 0n. (3) A is invertible if and only if A is a one to one transformation, i.e., if x1,x2੣Rn and x1 ് x2, then Ax1 ് Ax2.

Appendices

162

(4) A is invertible if and only if there exists a unique matrix B such that AB=BA=In.WedenotethismatrixBbyAͲ1andcallittheinverseofA. (5)IfAisinvertible,thensoisAͲ1and(AͲ1)Ͳ1=A. (6)IfAandBarebothinvertible,soisABand(AB)Ͳ1=BͲ1AͲ1. (7)IfAisinvertible,thensoisA’and(A’)Ͳ1=(AͲ1)’. (8)Aisinvertibleifandonly|‫്|ܣ‬0where|‫|ܣ‬denotesthedeterminantof thematrixA. Comment. Any nxn matrix A which is invertible is also said to be a nonsingularmatrix.IfAisannxnmatrixwhichisnotinvertible,thenwe sayAissingular.

AppendixA8ͲProductsofMatrices Productsofmatricesoccurfrequentlyinlinearmodels.Asaresultitis handyto have some basic facts about products at our disposal. In what follows, B is an mxn matrix and A is an sxm matrix. Thus B is a linear transformationfromRntoRm andAisalineartransformationfromRm to Rs.Also,theproductABiswelldefinedandisansxnmatrixwhichisalinear transformationfromRntoRs.NoticethattherearenoconditionsonAand Bexceptthattheirproductiswelldefined. For a vector v ੣ Rn, observe that Bv ੣ Rm so that ABv ੣ R(A), i.e. R(AB)‫ؿ‬R(A)whichimpliesr(AB)൑r(A).Ontheotherhand,ifv੣N(B),then clearly ABv = 0s so that N(B) ‫ ؿ‬N(AB) and n(B) ൑ n(AB). Now using proposition A5.4 and the fact that B and AB both map from Rn, the inequalityn(B)൑n(AB)isseentoimplyr(AB)൑r(B).Insummary: (1)R(AB)‫ؿ‬R(A). (2)N(B)‫ؿ‬N(AB). (A8.1) (3)r(AB()൑min[r(A),r(B)]. The results given in (A8.1) can of course be applied to any product of matrices.Forexample,ifA,BandCarematricessuchthattheproductABC

BackgroundMaterial

163

iswelldefined,thenR(ABC)‫ؿ‬R(AB)andr(ABC)islessthanorequaltoany oneoftheranksofA,BorC. The information given in (A8.1) provides some general information abouttheproductABbetweentwomatrices.Oftenitisnecessarytohave more precise results. The rank of AB provides much of the needed additionalinformation.Forexample,ifr(AB)=r(A),then(A8.1(1))implies R(AB)=R(A).Also,therankalwaysprovidesinformationaboutthenullity viapropositionA5.4.Afactthatisoftenusefulabouttherankofaproduct isthefollowing: PropositionA8.2.IfAandBarematricessuchthatABisdefined,then r(AB)=r(B)–dim[R(B)‫ת‬N(A)]. Proof. In the product AB, suppose we consider A to be a linear transformation from R(B) to R(A). Then proposition A5.4 implies r(B) = r(AB)+n(AB).TheresultnowfollowsbynotingthatN(AB)=N(A)‫ת‬R(B). TwospecialcasesofpropositionA8.2areworthnoting: (1)r(AB)=r(B)ifAisinvertible. (2)r(AB)=r(A)ifBisinvertible. (A8.3) Toseetheresultsgivenin(A8.3),notethatAinvertibleimpliesN(A)=0m sothat(1)follows.Tosee(2)notethatBinvertibleimpliesR(B)=Rm,which containsN(A),sothatr(AB)=m–n(A)=r(A)bypropositionA5.4. Anotherusefulfactisgiveninthefollowingcorollary. CorollaryA8.4.LetA=(aij)mxn.Then (a)R(AA’)=R(A). (b)R(A’A)=R(A’). Proof.(a)WeclearlyhaveR(AA’)‫ؿ‬R(A)andbypropositionA8.2, r(AA’)=r(A’)–dim[N(A)‫ת‬R(A’)]=r(A)–dim[R(A’)ᄰ‫ת‬R(A’)]=r(A)–0=r(A). ThusR(AA’)=R(A).

Appendices

164

(b)Similartotheproofof(a). PropositionA8.5.SupposeAandGaremxnandmxtmatricessuchthat R(A)‫ؿ‬R(G).ThenthereexistsatxnmatrixHsuchA=GH. Proof.LetA=(a1,…,an)wheretheai‘srepresentthecolumnvectorsofA. Then ai ੣ R(G) for i = 1,…,n, i.e., ai = Ghi for some vector hi ੣ Rt. So let H=(h1,…,hn).ThenGH=(Gh1,…,Ghn)=(a1,…,an)=Aasweweretoshow.

AppendixA9ͲPartitionedMatrices Ofteninlinearmodels,werunintopartitionedmatrices.Inparticular, supposeC=(cij)mxnisamatrixandwewriteC=(A,B)whereAisanmxp1 matrix thatconsists of the first p1 columns of C and B isan mxp2 matrix consisting of columns p1 + 1,…,n of C where p1 + p2 = n. We call C a partitionedmatrix.SomeeasilyprovenfactsaboutCarethefollowing: (1)R(C)=R(A,B)=R(A)+R(B). (2)dimR(C)=dim[R(A)+R(B)]=dimR(A)+dimR(B)–dim[R(A)‫ת‬R(B)] =r(A)+r(B)–dim[R(A)‫ת‬R(B). (A9.1) Noticein(A9.1)thattherangeoperationisadditiveoversuchpartitioned matriceswhereastherankoperationisadditiveonlyifthesubspacesR(A) andR(B)aredisjoint. Of course, the idea of partitioning a matrix is not only restricted to columnsofagivenmatrixbutcanbeappliedtorowsaswellasrowsand columnssimultaneously. A Example A9.2. Let C be an mxn matrix such C = ቀ ቁ were A is an m1xn B matrixandBisanm2xnmatrix,m1 +m2=m.ThenCisamatrixthatisa partitionedmatrixwherethepartitioningisbasedonrows. S T ቁwhereSis U V p1xq1,Tisp1xq2,Uisp2xq1,andVisp2xq2,p1+p2=mandq1+q2=n.Then Cisanexampleofapartitionedmatrixthatisbasedonpartitioningrows andcolumnssimultaneously. ExampleA9.3.LetCbeanmxnmatrixsuchthatC=ቀ

Background Material

165

We note that addition and multiplication of partitioned matrices can also be carried out based on the partitioning of the matrices. J K C D ቁ where A is pxq, C is p1xq1, ቁ and B = ቀ L M E F D is p1xq2, E is p2xq1, F is p2xq2, B is qxr, J is q1xr1, K is q1xr2, L is q2xr1 and M is q2xr2. If p1 = q1, p2 = q2,, q1 = r1 and q2 = r2, then

Example A9.4. Let A = ቀ

A+B=ቀ

C+J E+L

D+K ቁ F+M

is a pxq matrix where as CJ + DL AB = ൬ EJ + FL

CK + DM ൰ EK + FM

is a pxr matrix. The addition and multiplication of partitioned matrices illustrated in this example can be extended to more complicated partitioned matrices provided the dimensions of the submatrices involved are conformable for the appropriate operation.

Appendix A10-Eigenvalues and Eigenvectors Definition A10.1. Let A = (aij)nxn ďĞ Ă ŵĂƚƌŝǆ͘ ŶƵŵďĞƌ ʄ ŝƐ ĐĂůůĞĚ ĂŶ eigenvalue of A if A - ʄ/n is a singular matrix. The number ʄ ŝƐ ĐĂůůĞĚ ĂŶ eigenvalue of multiplicity s if n(A – ʄ/n) = s. An nxn nonzero vector x is called an eigenvector ŽĨĐŽƌƌĞƐƉŽŶĚŝŶŐƚŽʄŝĨǆсʄǆ͘ ŽŵŵĞŶƚ͘/ƚĐĂŶďĞƐŚŽǁŶƚŚĂƚŝĨʄŝƐĂŶĞŝŐĞŶǀĂůƵĞŽĨĂŶŶǆŶŵĂƚƌŝǆ͕ƚŚĞ eigenvectoƌƐĐŽƌƌĞƐƉŽŶĚŝŶŐƚŽʄ;ĂůŽŶŐǁŝƚŚƚŚĞϬn vector) form a subspace ŝŶZ;Ϳ͘/ĨʄсϬŝƐĂŶĞŝŐĞŶǀĂůƵĞŽĨ͕ƚŚĞŶƚŚĞƐƵďƐƉĂĐĞŽĨZn corresponding to 0 is N(A). We now consider only nxn real symmetric matrices A, i.e., A = A’ and all entries in A are real numbers, and state a series of results concerning the eigenvalues and eigenvectors of such matrices. Proposition A10.2. If A is an nxn real symmetric matrix, then the following statements hold: (a) The eigenvalues of A are all real numbers.

166

Appendices

(b) The eigenvectors corresponding to distinct eigenvalues of A are orthogonal. Proposition A10.3. If A is an nxn real symmetric matrix, then there exists a basis for R(A) consisting of eigenvectors corresponding to nonzero ĞŝŐĞŶǀĂůƵĞƐŽĨ͘/ĨʄŝƐĂŶĞŝŐĞŶǀĂůƵĞŽĨŵƵůƚŝƉůicity s, then the basis will ĐŽŶƚĂŝŶƐĞŝŐĞŶǀĞĐƚŽƌƐĐŽƌƌĞƐƉŽŶĚŝŶŐƚŽʄ͘ Proposition A10.4. If A is an nxn real symmetric matrix, then there exists an orthonormal basis for Rn consisting of eigenvectors of A. Definition A10.5. An nxn matrix P is called an orthogonal matrix if P’ = P-1. Comment. The reader should note that if P is an orthogonal matrix, then so is P’. Proposition A10.6. If A is an nxn real symmetric matrix, then there exists an orthogonal matrix P such that P’AP = ĚŝĂŐ;ʄ1͕͙͕ʄnͿǁŚĞƌĞĚŝĂŐ;ʄ1͕͙͕ʄn) ĚĞŶŽƚĞƐĂĚŝĂŐŽŶĂůŵĂƚƌŝǆ͘dŚĞĚŝĂŐŽŶĂůĞŶƚƌŝĞƐʄ1͕͙͕ʄn are the eigenvalues of A where if an eigenvalue has multiplicity s, it occurs s times on the main ĚŝĂŐŽŶĂůŽĨĚŝĂŐ;ʄ1͕͙͕ʄn). Comment. The columns of the matrix P in proposition A10.6 consists of n orthonormal eigenvectors corresponding to the n eigenvalues of A. Corollary A10.7. Let A and P be as in proposition A10.6. Then сWĚŝĂŐ;ʄ1͕͙͕ʄn)P’. Corollary A10.8. If A is a real symmetric matrix, then r(A) = the number of nonzero eigenvalues of A (including multiplicities). Definition A10.9. An nxn symmetric matrix A is said to be positive definite (positive semi-definite) if, for any nonzero vector x ੣ Rn, x’Ax > 0 (x’Ax ൒0). We denote A being positive definite (positive semi-definite) by A > 0 (A൒0). Theorem A10.10. An nxn symmetric matrix A ൒ 0 if and only if there exists an nxn matrix P such that PP’ = A. Corollary A10.11. An nxn symmetric matrix A > 0 if and only if there exists an nxn nonsingular matrix P such that PP’ = A.

Background Material

167

Comment. The reader should note that if A > 0, then corollary A10.11 implies A is nonsingular. Theorem A10.12. If A ൒ 0 is an nxn matrix and r(A) = r, then there exists an nxr matrix B of rank r such that A = BB’. Proof. Let P = (p1,…,pn) be an orthogonal matrix where p1,…,pr are eigenvectors corresponding to the nonzero eigenvalues ʄ1,…,ʄr of A and D 0୰୲ ൰ pr+1,…,pn are the eigenvectors corresponding to 0. Then P’AP = ൬ 0୲୰ 0୲୲ where D = diag(ʄ1,…,ʄr) and t = n – r. Now let Q = diag(ʄ11/2͕͙͕ʄr1/2) and D 0୰୲ observe that ൬ ൰ = (Q 0୰୲ )’(Q 0୰୲ ) and that 0୲୰ 0୲୲ A = P൬ where B = P(Q

D 0୲୰

0୰୲ ൰P’ = P(Q 0୲୲

0୰୲ )’(Q

0୰୲ )P’ = BB’

0୰୲ )’ is the desired matrix.

Definition A10.13. Let A = (aij)nxn be symmetric. The trace of A, denoted by tr A, is defined by tr A = σ௡௜ୀଵ

aii.

Proposition A10.14. For matrices A = (aij)rxs and B = (Bij))sxr, tr AB = tr BA. Proposition A10.15. Let A = (aij)nxn be symmetric with eigenvalues ʄ1,…,ʄn. Then tr A = σ௡௜ୀଵ ʄi.

Appendix A11-Projections Definition A11.1. Let Rn = M ْ N where M ‫ ؿ‬Rn and N ‫ ؿ‬Rn. Then for each v ੣ Rn, v = m + n where m ੣ M and n ੣ N and this representation is unique. The nxn matrix P such that Pv = m for all v ੣ Rn is called the projection on M along N. Theorem A11.2. (Existence and uniqueness of projections) Suppose Rn = Mْ N and let e1,…,en be the standard basis vectors in Rn, i.e., ei has a 1 in its i th component and zeros elsewhere. Let ei = pi + qi where pi ੣ M and qi੣N. Now let P = (p1, … , pn). Then P is the unique projection on M along N.

Appendices

168

Proof.Supposex=(x1,…,xn)’੣Rn.Then x=x1e1+…+xnen=σ௡௜ୀଵ whereσ௡௜ୀଵ

xi(pi+qi)=σ௡௜ୀଵ

xipi੣Mandσ௡௜ୀଵ

xipi+σ௡௜ୀଵ

xiqi

xiqi੣Nand

Px=(p1,…,pn)(x1,…,xn)’=x1p1+…+xnpn੣M. This implies that P is the desired projection. To establish uniqueness, supposeP1andP2arebothprojectionsonMalongN.ThenP1x=P2xforall x੣Rn.Hence(P1–P2)x=0nforallx੣RnwhichimpliesthatP1–P2=0nn, hencethatP1=P2. ExampleA11.3.LetM=sp(2,1)’andN=sp(1,1)’.ThenR2=M۩N.Now observethate1=(1,0)’=p1+n1=(2,1)’+(Ͳ1,Ͳ1)’ande2=(0,1)’=p2+n2= 2 െ2 (Ͳ2,Ͳ1)’+(2,2)’.ThusP=(p1,p2)=ቀ ቁisthedesiredprojectiononM 1 െ1 alongN. In example A11.3, the reader should observe that P = P2. Any matrix satisfyingthislastpropertyissaidtobeidempotent. PropositionA11.4.SupposePisannxnmatrixsuchthatP2=P,i.e.,Pis idempotent.Thenthefollowingstatementshold: (a)R(P)={v੣Rn:Pv=v}. (b)Rn=R(P)۩N(P). (c)PistheprojectiononR(P)alongN(P). Proof.(a)Observethat{v੣Rn:Pv=v}‫ؿ‬R(P).Soassumet߳ R(P)which impliest=Pxforsomevectorx੣Rn.ButthenPt=P(Px)=P2x=Px=twhich impliesR(P)‫{ؿ‬v੣Rn:Pv=v}asweweretoshow. (b)Supposev੣R(P)‫ת‬N(P).ThusPv=0nandfrom(a)itfollowsthatPv=v, hencev=0n.Sonowletv੣Rnandletx=Pvandy=(In–P)v.Thenv=x+y wherex੣R(P)andPy=P(In–P)y=(P–P2)v=0nnv=0nwhichimpliesy੣ N(P),hencethatRn=R(P)۩N(P). (c)Simplyapply(a)and(b)andthedefinitionofaprojection.

Background Material

169

Proposition A11.5. Suppose P is a projection on M along N. Then P2 = P and M = R(P) and N = N(P). Proof. For v ੣ Rn, let v = x + y where x ੣ M and y ੣ N. Then Pv = x and Px = x which implies P(Pv) = Px = x = Pv and that P2 = P. Now, since Px = x for all x੣M and P2 = P, it follows from proposition A11.4 that M ‫ ؿ‬R(P). Also, since Py = 0n for all y ੣ N, we have that M ‫ ؿ‬N(P) and Rn = R(P) ۩ N(P) = M ْ N. Now, n = dim M +dim N= r(P) + n(P). But dim M ൑ r(P) and dim N ൑ n(P) which implies M = R(P) and N = N(P). Proposition A11.6. If P is the projection on M along N, then In - P is the projection on N along M. Proof. For v ੣ Rn, let v = x + y where x ੣ M and y ੣ N. We then have that (In - P) v= (In – P)(x + y) = x –x + y = y ੣ N which implies In – P is the projection on N along M. Proposition A11.7. If P is the projection on M along N, then P’ is the projection on Nᄰ along Mᄰ. Proof. Since P is a projection, it follows from propositions A11.4 and A11.5 that P2 = P, M = R(P), N = N(P). Thus (P’)2 = P’ and P’ is a projection on R(P’)= N(P)ᄰ = Nᄰ along N(P’) = R(P)ᄰ = Mᄰ. Definition A11.8. Let Rn = M ۩ Mᄰ. Then the projection on M along Mᄰ is called the orthogonal projection on M. Proposition A11.9. P is an orthogonal projection if and only if P = P2 = P’. Proof. Suppose P is the projection on M along Mᄰ. By propositions A11.5 and A11.7, it follows that P2 = P and P’ is the projection on (Mᄰ)ᄰ = M along Mᄰ and by uniqueness of projections we have that P = P’. Conversely, suppose P = P2 = P’. Then by proposition A11.4, P is a projection on R(P) along N(P) and since P = P’, N(P) = N(P’) = R(P)ᄰ which implies P is an orthogonal projection. Proposition A11.10. Let P and P0 be orthogonal projections such that R(P0)‫ ؿ‬R(P). Then the following statements hold:

170

Appendices

(a) P – P0 is an orthogonal projection. (b) R(P – P0) = R(P0)R(P)ᄰ = R(P0)ᄰ ‫ ת‬R(P). (c) N(P – P0) = R(P0) ْ R(P)ᄰ. Proof (a) Since R(P0) ‫ ؿ‬R(P), PP0 = P0 and by symmetry, P0P = P0. Now observe that (P – P0)2 = P2 – PP0 - P0P + P02 =P – PP0 – PP0 + P0 = P – P0 –P0 + P0 = P – P0 and that (P – P0)’ = P – P0, hence we have by proposition A11.9 that P – P0 is an orthogonal projection. (b) To begin, observe that R(P – P0) ٣ R(P0) because (P – P0)P0 = PP0 – P02 = P0 – P0 = 0nn. Thus R(P – P0) ‫ ؿ‬R(P0)R(P)ᄰ. Now, if x ੣ R(P) and x ٣ R(P0), then x = Px = (P – P0)x + P0x = (P – P0)x. Thus x ੣ R(P – P0), so R(P0)R(P)ᄰ ‫ ؿ‬R(P – P0) as we were to show. The second equality follows from proposition A4.15. (c) This follows since N(P – P0) = R(P – P0)ᄰ = [R(P0)ᄰ ‫ ת‬R(P)]ᄰ = R(P0) ْ R(P)ᄰ We now give three methods for computing projection matrices. (1) Let Rn = M ۩ N and let e1,…,en denote the n standard basis vectors in Rn. If we let ei = pi + ni where pi ੣ M and ni ੣ N for i = 1,…,n, then as shown in the proof of theorem A11.2, P = (p1,…,pn) is the projection on M along N. (2) Let Rn = M ْ Mᄰ, suppose dimM = m and let p1,…,pm be an orthonormal basis for M. Let L = (p1,..,pm) and let P = LL’. Then P = P2 = P’ and P is the orthogonal projection on M. (3) Let Rn = M ۩ Mᄰ and Let U be any matrix whose columns form a basis for M. Let P = U(U’U)-1U’. Then it is easily seen that P = P2 = P’ and R(P) =M, thus P is the orthogonal projection on M. Comment. Two useful facts to remember throughout the text are the following: (1) Suppose P is a projection on M along N and A is a matrix such that R(A)‫ؿ‬M. Then PA = A.

Background Material

171

(2) Suppose U and T are matrices such that R(U) = R(T) and the columns of U and T each form a basis for a subspace M of Rn. Then U(U’U)-1U’ = T(T’T)-1T’ since both of these matrices are orthogonal projections onto R(U) = R(T)=M and we get equality by uniqueness of projections. Several miscellaneous facts about projections are given in the following Lemma. Lemma A11.11. Suppose P is a projection on a subspace M along N. Then the following statements hold: (a) The eigenvalues of P are all 0’s or 1’s. (b) tr P = r(P). Proof. See the exercises.

Appendix A12-Generalized Inverses Proposition A12.1. Let A = (aij)mxp be a matrix. Then there exists a pxm matrix B such that A’AB = A’. Moreover, for any such matrix B, the following facts hold: (a) ABA = A. ( b) (AB)2 = AB. (c) R(AB) = R(A). (d) (AB)’ = AB. Proof. By corollary A8.4, we have that R(A’A) = R(A’), hence we have by proposition A8.5 that there exists a pxm matrix B such that A’AB = A’. (a) Suppose A’AB = A’. Then A’ABA = A’A and A’(ABA –A) = 0pp. Thus R(ABA – A) ‫ ؿ‬R(A) ‫ ת‬N(A’) = R(A) ‫ ת‬R(A)ᄰ = 0m which implies ABA – A = 0mp as we were to show.

Appendices

172

(b) From part (a), ABA = A which implies ABAB = AB, hence AB is idempotent. (c) From (a), R(ABA) = R(A) ‫ ؿ‬R(AB) ‫ ؿ‬R(A), hence R(AB) = R(A). (d) (AB)’ = B’A’ = B’A’AB = (A’AB)’B = (A’)’B = AB. Comment. If B satisfies A’AB = A’, then it follows from proposition A12.1 that AB is the orthogonal projection on to R(A). Definition A12.2. Suppose A = (aij)mxp is any matrix. Any pxm matrix G satisfying AGA = A is said to be a generalized inverse (g-inverse) of A. We denote an arbitrary generalized inverse of A by A-. Comment (1) Proposition A12.1 establishes the existence of a generalized inverse for any matrix A. (2) If A is an mxp matrix, then a generalized inverse of A may not be unique. In fact, there may be infinitely many generalized inverses for a given matrix. In the following proposition, we present some of the basic properties possessed by generalized inverses. Proposition A12.3. Suppose A = (aij)pxq is a matrix and let A- denote any generalized inverse of A. Then the following statements hold: (a) r(A-) ൒ r(A). (b) (A-)’ is a generalized inverse of A’. (c) If A is invertible, then A- = A-1. (d) AA- is a projection on R(A). (e) A-A is a projection along N(A). Proof. (a) r(A) = r(AA-A) ൑ min [r(A),r(A-A)] ൑ r(A-A) ൑ min[r(A),r(A-)] ൑ r(A-).

BackgroundMaterial

173

(b) AAͲA = A which implies A’AͲ‘A’ = A’. Hence by definition, AͲ‘ is a generalizedinverseofA’. c)IfAisinvertible,AAͲA=AimpliesAͲ1AAͲAAͲ1=AͲ=AͲ1AAͲ1=AͲ1 aswe weretoshow. (d) Observe that (AAͲ)(AAͲ) = AAͲAAͲ = AAͲ which implies that AAͲ is idempotent,henceitisaprojection.NowobservethatR(A)=R(AAͲA)‫ؿ‬ R(AAͲ)‫ؿ‬R(A)whichimpliesR(A)=R(AAͲ).ThusAAͲisaprojectiononR(A). (e)Observethat(AͲA)(AͲA)=AͲAAͲA=AͲA,henceAͲAisidempotentanda projection. Now we have that N(A) ‫ ؿ‬N(AͲA) ‫ ؿ‬N(AAͲA) = N(A) which impliesthatN(AͲA)=N(A),hencethatAͲAisaprojectionalongN(A). In proposition A12.3, the reader should note that in part (b), if A is symmetric,thenbothAͲandAͲ‘aregeneralizedinversesofA.Also,inparts (d)and(e),thesubspacesthatAAͲprojectsalongandthatAͲAprojectsonto cannot be specified. This is because AͲ is not unique and these spaces dependuponwhichgeneralizedinverseofAisused. Proposition A12.4. Suppose M ‫ ؿ‬Rn and that A is any matrix such that R(A)=M.ThenP=A(A’A)ͲA’istheorthogonalprojectionontoR(A)=M. Proof.Tobegin,observethat P2=[A(A’A)ͲA’][A(A’A)ͲA’]=A(A’A)Ͳ(A’A)(A’A)ͲA’=A(A’A)ͲA’=P because A’A(A’A)Ͳ is a projection onto R(A’A) = R(A’). Thus A(A’A)ͲA’ is idempotent and a projection. Now observe that if we let B = (A’A)ͲA’ in PropositionA12.1,thenBsatisfiestheequationA’AB=A’.Hencebythe commentimmediatelyfollowingPropositionA12.1,wehavethatA(A’A)ͲA’ istheorthogonalprojectiononR(A)=M. InpropositionA12.4,itshouldbenotedthatAcouldbeanymatrixsuch thatR(A)=M.ThusifAandBareanymatricessuchthatR(A)=R(B)=Mfor somesubspaceMofRn,thenitfollowsfromtheuniquenessoforthogonal projectionsthatA(A’A)ͲA’=B(B’B)ͲB’. We now give one method for computing a generalized inverse of a givenmxnmatrixAthatisbasedontherankofA:

174

Appendices

(1) Suppose A is an mxn matrix of rank s and can be partitioned as A = Aଵଵ Aଵଶ ൰ where A11 is an sxs submatrix having r(A11) = s, A12 is sx(n-s), ൬ Aଶଵ Aଶଶ A21 is (m-s)xs and A22 is an (m-s)x(n-s) submatrix. B (2) Compute an nxm generalized inverse for A as A- = ൬ ଵଵ Bଶଵ where B11 = A11-1, B12 = 0s,(m-s), B21 = 0(n-s),s and B22 = 0(n-s),(m-s).

Bଵଶ ൰ Bଶଶ

The method just given for computing a generalized inverse for a given matrix A is the simplest known to the author but does depend upon being able to partition the matrix A as in step 1 above. There are more general algorithms for computing generalized inverses available. For more information on computing generalized inverses, the reader is referred to Searle(1971). While there are many additional properties associated with generalized inverses that have not been given here, the facts given in this appendix are adequate for the purposes of this text.

Appendix A13-Linear Equations and Affine Sets Consider a set of simultaneous linear equations Ax = b where A is an sxt matrix and b is an sx1 vector. If b = 0, we say the equations are homogeneous. If b т 0, we say the equations are nonhomogeneous. Let H= {x: Ax = b} be the entire set of solutions to the equations. The equations Ax= b are said to be consistent provided that H is nonempty. This is equivalent to the existence of an x0 ੣ Rt satisfying Ax0 = b which is clearly equivalent to b ੣ R(A). For the special case b ‫ ב‬R(A), b = 0 and A nonsingular, the form of H is easy to ascertain. One general way of expressing H is given in the following proposition. Proposition A13.1. If x0 ੣ H, then H = x0 + N(A). Proof. Clearly x0 + N(A) ‫ ؿ‬H. Conversely, if x ੣ H, then f = x – x0 ੣ N(A) and x = x0+f implies that H ‫ ؿ‬x0 + N(A). From proposition A13.1, we see that when H is nonempty, it is the translation of a subspace. We refer to such translations as affine sets.

Background Material

175

Definition A13.2. In Rn, a set of the form M = v + L where v ੣ Rn and L is a subspace of Rn is called an affine set. The subspace L is called the subspace parallel to M. By convention, the empty set is also considered to be an affine set. Thus in proposition A13.1, if H is nonempty, then H is an affine set and N(A) is the subspace parallel to H. With regard to affine sets, we now give several useful propositions. Proposition A13.3. Let H = v + L be an affine set in Rn where v is a vector in Rn and L is a subspace of Rn. Then the following statements hold: (a) H – H = {h1 – h2: h1, h2 ੣ H} = L. (b) For every vector ho ੣ H, H – h0 = {h – h0: h ੣ H} = L. Proof. See the exercises. Two easily seen facts concerning subspaces are the following: (1) For any subspace L of Rn having dim L = m, there exists an nxm matrix A such that R(A)= L.(let A be any matrix whose m columns form a basis for L.) (A13.4) (2) For any subspace L of Rn having dim L = m, there exists an (n –m)xn matrix B such that N(B) = L. (let B be an (n – m)xn matrix whose n – m rows form a basis for Lᄰ.) Proposition A13.5. H is an affine set in Rn if and only there is a qxn matrix B and a vector c ੣ Rq such that H = {x ੣ Rn: Bx = c} Proof. Let H = v + L be an affine set in Rn where v ੣ Rn and L ‫ ؿ‬Rn. Choose h0 ੣ H. Then L = H – h0 and by (A13.4 b)), there exists an (n-q)xn matrix B where q = dim L such that N(B) = L. Set c = Bh0. Now H = h0 + L = {h0 + l: Bl = 0n-q} = {x ‫ א‬Rn: B(x – h0) = 0n-q} = {x ‫ א‬Rn: Bx = Bh0} = {x ‫ א‬Rn: Bx = c}. Conversely, choose h0 ੣ H = {x: Bx = c}. Then Bh0 = c and H – h0 = {x – h0:Bx = c} = {x – h0:Bx = Bh0} = {x – h0:B(x – h0) = 0n-q} = N(B)

Appendices

176

where N(B) is a subspace of Rn. Thus H = h0 + N(B) and H is an affine set. Occasionally in the text it is necessary to know when a fixed vector v ੣ Rn has a constant Euclidean inner product with all vectors in an affine subset of Rn, i.e., If H is an affine subset of Rn and v ੣ Rn, under what conditions is v’x=c for all x ੣ H where c is some constant. Proposition A13.6. (Seely(1989)) Assume H = {x ੣ Rn: Ax = c} is nonempty where A is pxn and let v ੣ Rn. Then v’x is constant for all x ੣ H if and only if v ੣ R(A’). Proof. Let x0 ੣ H be fixed. Since R(A’) = N(A)ᄰ, the condition v ੣ R(A’) is the same as v’f = 0 for all f ੣ N(A). But this is equivalent to v’(x0 + f) = v’x0 for all f ੣ N(A). The result now follows by proposition A13.1. It is sometimes useful to know how a linear transformation acts on an affine set. Let H = {x ੣ Rn: Ax = b} where A is pxn and let M = (Bx: Ax = b} where B is an mxn matrix. That is, M is the image of H under the matrix B. Set L = {Bf: Af = 0p}. Then L is a subspace and using proposition A13.1 it follows that Ax0 = b implies M = {Bx: Ax = b} = {B(x0 + f): Af = 0p} = Bx0 + L.

(A13.7)

Thus, M is an affine set and L is the subspace parallel to M. Two useful facts about L are given in the next proposition. Proposition A13.8. (Seely(1989)) Set L = {Bf: Af = 0s} where B and A are arbitrary mxt and sxt matrices, respectively. Then the following statements can be made: (a) dim L = r(B’,A’) – r(A’). (b) Lᄰ = {z: B’z ੣ R(A’)}. Proof.( b) Let T be a txq matrix such that R(T) = N(A) so that L = R(BT). Now observe that Lᄰ = N(T’B’) and that N(T’) = R(A’). Thus z੣ Lᄰ implies T’B’z=0q, hence B’z ੣ N(T’) = R(A’) which establishes (b). (a) Let T be as in (b) and note the dim L = r(BT) = r(T’B’). By proposition A8.2,

BackgroundMaterial

177

r(T’B’)=r(B’)–dim[R(B’)‫ת‬N(T’)=r(B’)–dim[R(B’)‫ ת‬R(A’)]. Nowobservethat dimL=r(T’B’)=r(B’)–dim[R(B’)‫ת‬R(A’)] =[r(B’)+r(A’)–dim[R(B’)‫ת‬R(A’)]]–r(A’)=r(B’,A’)–r(A’) wherethelastequalityfollowsfrom(A9.1(2)).

AppendixA14ͲMultivariateDistributions In this appendix we give a brief review of multivariate probability distributions. The primary reasons for doing this are for the sake of referenceandtoestablishnotation.Noattemptismadeatrigor.Because weonlyconsidercontinuousrandomvariablesinthisbook,wewillonly considermultivariatecontinuousdistributions.Itisassumedthereaderis familiarwiththebasicpropertiesof1Ͳdimensionalrandomvariables. LetX1,…,Xnben1Ͳdimensionalcontinuousrandomvariables.Thenthe vectorX=(X1,…,Xn)’iscalledannͲdimensionalcontinuousrandomvector. SoXcanessentiallyassumeallvaluesonsomenͲdimensionalrectangular regioninRnorsomeunionofsuchrectangularregions.AssociatedwithX aretwocriticalfunctionswhichcompletelydefinethedistributionofX.The firstisthejointcumulativedistributionfunctionofX,definedforallx੣Rn, denotedbyFX(x)=FX(x1,…,xn),anddefinedby FX(x)=P(X1൑ x1,…,Xn൑xn). ThesecondfunctionassociatedwithXisthejointprobabilitydensity functionofX,denotedbyfX(x)=fX(x1,…,xn),andisobtainedfromFX(x)by fX(x)=(ᖜn/ࠪx1…ࠪxn)FX(x1,…,xn). WenotethatfX(x)mustsatisfytwoconditions: (1)fX(x)൒ 0forallx੣Rn. ஶ

ஶ

(2)‫ି׬‬ஶ… ‫ି׬‬ஶ

fX(x)dx1…dxn=1.

Appendices

178

If we partition the nͲdimensional random vector X into X = (X1,X2)’ where X1 is an n1Ͳdimensional random vector and X2 is an (n – n1)Ͳ dimensional random vector, then we can find the marginal distribution (the joint probability density function) for X1 by integrating out all the componentsofXassociatedwithX2,i.e., ஶ

ஶ

fX1(x1)=‫ି׬‬ஶ… ‫ି׬‬ஶ

fx(x1,x2)dxn1+1…dxn.

TheconditionaldistributionofX2givenX1=x1canalsobeobtainedvia theratio f(x2lx1)=fX(x1,x2)/fx1(x1)providedfX1(x1)>0. Iftheconditionaldistributionoff(x2lx1)doesnotdependonX1,wesayX1 andX2areindependent.Alternatively,onecanestablishX1andX2asbeing independentbyshowingthat

fX(x1,x2)=fX1(x1)fX2(x2)forallx੣Rn. As in the 1Ͳdimensional case, one way we can sometimes describe certain aspects of the distribution of X is through its moments. If X = (X1,…,Xn)’,thekthmomentaboutzeroforXiis ஶ

ஶ

μXi(k)=E(Xik)=‫ି׬‬ஶ… ‫ି׬‬ஶ

xi(k)fX(x)dx1…dxn

provided the above integral exists. For k = 1, the superscript is usually omittedandμiiswrittenforμi(1). Anotherusefulmeasuretohelpdescribethedistributionofarandom vectorXisthecovariancebetweenitsvariouscomponents.Thecovariance betweentheithandjthcomponentsofXisdenotedbycov(Xi,Xj)orʍijand isdefinedby ஶ

ஶ

cov(Xi,Xj)=ʍij=E[(XiͲμi)(XjͲμj)]=‫ି׬‬ஶ…‫ି׬‬ஶ

(xiͲμi)(xjͲμj)fX(x)dx1…dxn

provided the above integral exists. Similarly, the variance of the i th componentofX,denotedbyvar(Xi)orʍi2,canbecomputedas var(Xi)=ʍi2=cov(Xi,Xi)=E[(XiͲμi)2]

BackgroundMaterial

179

providedtheappropriateintegralexists.Thevariancesandcovariancesfor thecomponentsofarandomvectorXareoftenpresentedintheformof a varianceͲcovariance matrix V = cov(X) = (ʍij)nxn whose properties are furtherexploredinchapter1. Momentsandrelationshipsbetweendistributionsareoftenobtained usingamomentgeneratingfunction(m.g.f.).Inthe1Ͳdimensionalcase,the m.g.f.functionofarandomvariableX,denotedbyMX(t),issaidtoexist andisdefinedby ஶ

MX(t)=E(etX)=‫ି׬‬ஶ

etxfX(x)dx

providedthelastintegralisfiniteforallvaluesoftinsomeopeninterval containing0.NotethatMX(t)isafunctionoftandthatMX(0)=1.IfMX(t) exists, then one can use it to find the k th moment of X around 0 by evaluating μX(k)=ࠪkMX(t)/ࠪtklt=0. Them.g.f.forafunctionofX,sayh(X),canalsobefoundbyevaluating ஶ

Mh(X)(t)=E(eth(x))=‫ି׬‬ஶ

eth(x)fX(x)dx,

providedthislastintegralexistsforalltinanopenintervalcontainingzero. IfMh(x)(t)exists,thenitcanalsobeusedtofindthekthmomentabout0 ofh(X)bycomputing μh(X)(k)=ࠪkMh(X)(t)/ࠪtklt=0. Formultivariatedistributions,somesimilarresultshold.IfX=(X1,…,Xn)’, them.g.f.ofXutilizesavectorofparameterst’=(t1,…,tn),isdenotedby MX(t),andissaidtoexistprovided ஶ

ஶ

MX(t)=E(et’X)=E[exp(t1X1+…+tnXn))=‫ି׬‬ஶ… ‫ି׬‬ஶ

et’xfX(x)dx1…dxn

isfiniteforallvaluesoftinsomeopenneighborhoodofRncontaining0n. UsingMX(t),asinthe1Ͳdimensionalcase,onecanfindthekthmomentof theithcomponentofXabout0byevaluating μXi(k)=ࠪkMX(t)/ࠪtiklt=0.

180

Appendices

Alsoasinthe1Ͳdimensionalcase,wecanfindthem.g.f.ofsomescalar functionofX,sayh(X),bycomputing ஶ

ஶ

Mh(X)(t)=E(eth(X))=‫ି׬‬ஶ… ‫ି׬‬ஶ

eth(x)fX(x)dx1…dxn.

provided this last integral is finite for all values of t in an open neighborhoodRn containingthe origin.Aswell as yielding the moments aboutzeroofadistribution,them.g.f.hastwootherimportantuseswhich areusedthroughoutthistext.First,iftwonͲdimensionalrandomvectors XandYhavem.g.f.’sthatexistandaresuchthat MX(t)=MY(t) foralltinsomenͲdimensionalopenrectangleofRncontainingtheorigin, thenXandYhaveexactlythesamedistributions.Thesecondusageisin establishingindependenceofrandomvectors.Inparticular,ifX=(X1,X2)’ and X, X1 and X2 all have well defined m.g.f.’s , then X1 and X2 are independentifandonlyif MX(t)=MX1(t1)MX2(t2) wheret=(t1,t2)’ispartitionedcorrespondinglytoX=(X1,X2)’. Ofteninlinearmodels,weareinterestedinlineartransformationsofa randomvector.Inparticular,ifXisannͲdimensionalrandomvectorwith jointprobabilitydensityfunctionfX(x),Aisannxnnonsingularmatrixand Y=AX+bwherebisannx1vectorofrealnumbers,thenYcanbeshown tohaveprobabilitydensityfunction fY(y)=fX[AͲ1(Y–b)](1/|‫)|ܣ‬. IfXhasam.g.f.,Aisannxnmatrixandb੣Rn,thenthem.g.f.forY=AX+b is MY(t)=et’bMX(A’t). These latter expressions are used in chapter 1 in connection with the multivariatenormaldistribution.

Background Material

181

Appendix A15-Problems for the Appendices 1. Determine whether the space spanned by {(1,-2,4,1)’, (11,7,11,5)’, (1,1,0,0)’} contains the vector (-1,17,-37,-6)’. Do the same for the vector (1,0,-1,3)’. 2. Determine whether the set of vectors {(1,0,1,1)’, (2,3,1,-1)’, (4,2,-2,3)’, (1,5,8,8)’} form a linearly independent set of vectors. 3. Determine the dimension of the space spanned by the four vectors in problem 2. 4. In problem 2, find two different sets of basis vectors for the space spanned by the four vectors. 5. Find x so that the vector (-2,13,x)’ is contained in the space spanned by {(1,4,3)’, (2, 1, 5)’}. 6. Prove proposition A3.2. 7. If the vectors x1 and x2 span the subspace V1 and the vectors y1, y2, and y3 span the space V2, find a basis for V1 ‫ ת‬V2 where x1 = (1,1,-1,0)’, x2=(3,1,2,1)’, y1 = (1,4,1,3)’, y2 = (8,-1,-6,-5)’ and y3 = (4,15,-2,7)’. 8. In problem 7, find a basis for V1 + V2. 9. In problem 7, find the dimension of V1 ‫ ת‬V2 and the dimension of V1 +V2. 10. Let V be the subspace of R4 spanned by the vectors x1 = (1,1,-1,2)’ and x2 = (1,2,2,1)’. Find an orthonormal basis for the subspace V. 11. Find an orthonormal set of basis vectors for the orthogonal complement of the subspace V in problem 10. 12. Prove that if V1 and V2 are subspaces in Rn, then (V1 + V2)ᄰ = V1ᄰ ‫ ת‬V2ᄰ. 13. Let V be the subspace of R4 spanned by the vectors x1 = (1,-1,-1,3)’, x2=(4,1,2,0)’, and x3 = (2,1,0,-1)’ and let M be the subspace of R4 spanned by y= (8,1,2,7)’. Find an orthonormal basis for MVᄰ.

Appendices

182

14. Let x = (1,1,-1)’ and y = (2,1,0)’. Find tǁŽǀĞĐƚŽƌƐɲǆ੣ sp{x} and z ੣ sp{x}ᄰ ƐƵĐŚƚŚĂƚǇсɲǆнǌ͘ 1 3 15. Let A = ቆ 3 1 െ1 1

െ1 5 ቇ and let x = (4,20,-8)’. െ3

(a) Determine whether x ੣ R(A). (b) Determine the rank of A. (c) Determine a basis for N(A). (d) Determine the nullity of A. (e) Find a basis for R(A)ᄰ.. 1 2 16. Let A = ቆ െ1 1 െ1 െ2

െ1 3 2 5 0 ቇ and B = ቆെ5 2 1ቇ. െ5 1 െ3 1

(a) Determine if R(A) = R(B). (b) Determine r(A) and r(B). (c) Find a basis for N(A) and a basis for N(B). (d) Find a basis for R(A)ᄰ and R(B)ᄰ.. 17. Let A be an sxt matrix. Verify the following assertions: (a) Rs = R(A) ۩ N(A’) and (b) Rt = R(A’) ۩ N(A). 1 െ1 െ1 18. Let A = ቆ2 െ3 െ4ቇ. Find A-1. 8 3 4 19. Let A and B be invertible matrices. Show the following: (a) (AB)-1 = B-1A-1. (b) (A’)-1 = (A-1)’.

Background Material

183

20. (Seely(1989)) Suppose A and B are sxk and kxt matrices, respectively. Show that r(AB)൒ r(A) + r(B) – k. 21. (Seely(1989)) Suppose A, B and C are matrices such that ABC is defined. Show that r(AB)+r(BC) ൑ r(B) + r(ABC) and that equality holds if and only if R(BC)‫ת‬N(A)= R(B) ‫ ת‬N(A). 22. Let A and B be matrices such that A + B is defined. Verify the following statements: (a) R(A + B) ‫ ؿ‬R(A) + R(B). (b) r(A + B) ൑ r(A) + r(B). (c) r(A + B) = r(A) + r(B) if and only if R(A + B) = R(A) ۩ R(B). 1 23. Let A = ቆ1 2

2 1 3

4 െ1 െ3 െ1ቇ and B = ቆ3 2 െ4 7 3

9 11ቇ. 14

(a) Show that R(B) ‫ ؿ‬R(A). (b) Find a matrix C such that B = AC. 24. Let A = ቀ

2 4 ቁ. 4 6

(a) Find the eigenvalues of A. (b) Find the eigenvectors associated with each eigenvalue of A. (c) Verify that the eigenvectors of A are orthogonal. 25. Let A be as defined in problem 24. Normalize the eigenvectors of A and let P be a matrix whose columns consist of the normalized eigenvectors of A. (a) Compute P’P and PP’. (b) Compute P’AP to obtain the matrix D and then compute (PDP’)k for any integer k ൒ 1.

184

Appendices

26. Let A be ĂƐǇŵŵĞƚƌŝĐŵĂƚƌŝǆĂŶĚůĞƚʄďĞĂŶĞŝŐĞŶǀĂůƵĞŽĨ͘^ŚŽǁƚŚĂƚ ʄk is also an eigenvalue of Ak for any integer k ൒ 1. 27. Determine whether the matrix A = ቀ 28. (a) Show that the matrix B = ቀ

5 െ3

4 3

3 ቁ is positive definite or not. 4

െ3 ቁ is positive definite. 4

(b) Find the eigenvalues of B and verify that they are positive. (c) Find a matrix Q such that B = QQ’. 29. Suppose V and D are nxn positive semi-definite matrices. Verify the following assertions: (a) If x ੣ Rn, then x’Vx = 0 if and only x ੣ N(V). (b) N(V + D) = N(V) ‫ ת‬N(D). (c) R(V + D) = R(V) + R(D). 30. Let A and B be sxt matrices and suppose D is a txt positive definite matrix. Verify the following statements: (a) If ADB’ = 0s, then R(A’) ‫ ת‬R(B’) = 0t. (b) if R(A’) ‫ ת‬R(B’) = 0t, then R(A + B) = R(A) + R(B). 31. Let A be an nxn matrix and let P be an nxn nonsingular matrix. Show that tr(A)= tr(P-1AP). 32. If A and B are nxn matrices and a and b are real numbers, show that tr(aA+bB) = a tr(A) + b tr(B). 33. (Seely(1989)) Suppose Rn = M ۩ N. Let T and Q be matrices satisfying R(T) = M and N(Q’) = N, respectively. Verify the following assertions: (a) There exists a matrix A satisfying Q’TA = Q’. (b) If Q’TA = Q’, then P = TA is the projection on M along N.

BackgroundMaterial

185

2 2 െ1 1 1 1 34.LetA’ൌ ቀ ቁandletB’=ቀ ቁ. 4 3 െ2 1 0 െ1 (a)ShowthatR3=R(A)۩N(B’). (b)Useproblem33todeterminetheprojectiononR(A)alongN(B’). 35.(Seely(1989))SupposeDisansxsmatrixandAisansxtmatrixsuchthat R(A’DA)=R(A). Let G be any generalized inverse of A’DA. Verify the followingstatements: (a)DAGA’istheprojectiononR(DA)alongN(A’). (b)AGA’DistheprojectiononR(A)alongN(A’D). 36. Assume P1 and P2 are orthogonal projections on M1 and M2. Set P=P1+P2.Showthatthefollowingstatementsareequivalent: (a)Pisidempotent. (b)P1P2=0. (c)R(P1)٣R(P2). Furthermore,ifanyoftheabovestatementsaretrue,thenshowthatPis theorthogonalprojectiononM=M1+M2. 37.LetP1andP2beorthogonalprojectionsonM1andM2.SetP=P1ͲP2. Showthatthefollowingstatementsareequivalent: (a)Pisidempotent. (b)P1P2=P2. (c)R(P1)٣ R(P2). (d)R(P2)‫ؿ‬R(P1). Furthermore,ifanyoftheabovestatementsaretrue,thenshowthatPis theorthogonalprojectiononM=M1‫ת‬M2ᄰ.

Appendices

186

38.LetM=sp{(1,1)’}andN=sp{(1,Ͳ1)’}.FindtheprojectiononMalong N. 2 2 െ1 39.LetA’=ቀ ቁ.FindtheorthogonalprojectiononR(A). 4 3 െ2 40.Findageneralizedinverseforeachofthefollowingmatrices: A=ቀ

2 1 3 5 2 1 5 3 2 0 1 ቁ. ቁ B=ቀ 4 2 6 10 4 1 2 3 1 6 5

3 2 1 41.FineagͲinverseofthematrixAwhereA=൭1 1 1 ൱. 3 1 െ1 42.LetAbeanmx2matrixofrank1,i.e.,A=ሺa kaሻwherekisareal numberandaisanmx1vector.GiveagͲinverseforAintermsofkanda. 43.IfGisageneralizedinverseofapxqmatrixA,showthatG+Z–GAZAG isalsoageneralizedinverseofAforanymatrixZthatisconformablein termsofitsdimensions. 44. Suppose A and B are symmetric matrices such that R(A) ‫ ת‬R(B) = 0. ShowthatG=(A+B)ͲisageneralizedinverseofbothAandB. 45.ProvepropositionA13.3. 46.(a)IfBisanygͲinverseofA,showthatBABisalsoagͲinverseofA. (b)Inpart(a)showthatBABhasthesamerankasA. (c)LetBbeasinpart(a)andletC=BAB.ShowthatCAC=C. 47.SupposeS‫ؿ‬RnandthatT‫ؿ‬S.ShowthatTSᄰ=S‫ת‬Tᄰ.

REFERENCES

1. Aitkin, A.C. (1934). On least squares and linear combinations of observations, Proceedings of the Royal Statistical Society of Edinburgh, 55, 42-48. 2. Box, G. E. P., Draper, N. R. (1987). Empirical Model-Building and Response Surfaces, John Wiley and Sons, New York. 3. Box, G. E. P., Hunter, W. G., and Hunter, J. S. (1978). Statistics for Experimenters, John Willey and Sons, New York. 4. Christensen, R. (1996). Plane Answers to Complex Questions: The Theory of Linear Models (2nd addition), Springer, New York. 5. Cochran, W. G. (1934). The distribution of quadratic forms in a normal system with applications to the analysis of covariance, Proc. Cambridge Philos. Soc., 30, 178-191. 6. Cox, D. R. (1958). The Planning of Experiments, John Willey and Sons, New York. 7. Draper, N. R., and Smith, H. (1981). Applied Regression Analysis (2nd addition), John Wiley and Sons, New York. 8. Graybill, F. A. (1976). Theory and Application of the Linear Model, Duxbury Press, North Scituate, MA. 9. Graybill, F. A. (1969). Introduction to Matrices with Applications in Statistics, Wadsworth Publishing Company, Inc., Belmont, Ca. 10. Lehmann, E. L. (1983) Theory of Point Estimation, John Wiley and Sons, New York. 11. Lehmann, E. L. and Scheffe, H. (1950). Completeness, similar regions and unbiased estimation, part 1, Sankhya, 10, 305-346.

188

References

12. Ogawa, J. (1949). On the independence of bilinear and quadratic forms of a random sample from a normal population, Ann. of the Inst. of Statist. Math., 1, 83-108. 13. Rao, C. R. (1973). Linear Statistical Inference and Its Applications (2nd edition), John Wiley and Sons, New York. 14. Rao, C. R., and Mitra, S. K. (1971). Generalized Inverse of Matrices and Its Applications, John Wiley and Sons, New York. 15. Scheffe, H. (1959). The Analysis of Variance, John Wiley and Sons, New York 16. Searle, S. R. (1971). Linear Models, John Wiley and Sons, New York. 17. Seber, G. A. F. (1977). Linear Regression Analysis, John Wiley and Sons, New York. 18. Seely, J. (1989). Linear Models Notes, unpublished. 19. Snedecor, G. W., and Cochran, W. G. (1980). Statistical Methods (7th addition), Iowa State University Press, Ames. 20. Zyskind, G. (1967). On canonical forms, non-negative covariance matrices and best and simple least squares linear estimators in linear models, Ann. Math. Statist., 38, 1092-1109.

INDEX affine sets, 176 Akim’s razor, 93 ANOVA tables. See section 4.6 for the full model, 111 for the full model having nonhomogeneous constraints, 123 for the reduced model, 112 for the reduced model having nonhomogeneous constraints, 124 b-equivalent set of covariance matrices, 142 Cauchy-Schwarz inequality, 157 central chi-squared distribution defintion of, 14 degrees of freedom for, 14 moment generating function for, 15 central F-distribution definition of, 21 degrees of freedom for, 21 probability density function for, 21 central T-distribution, 21 definition of, 21 degrees of freedom for, 21 probability density function of, 21 characterization theorem, 133 Cochran’s Theorem, 20 constraints homogeneous, 27 nonhomogeneous, 27 nonpre-estimable, 85 continuous random matrix, 1 continuous random variables covariance between, 3

expectation of, 2 k th moment about zero, 180 moment generating function for, 180 variance of, 4 continuous random vector, 1 conditional distribution of, 179 covariance matrix of, 4 definition of, 178 indpendence of marginals, 179 joint probability density function for, 1, 178 marginal distribution of, 179 moment generating function for, 181 eigenvalues, 166 eigenvectors, 166 estimators, 33 best linear unbiased, 40, 64, 131 generalized least squares, 59 least squares, 52 linear, 34 linear unbiased, 61 linear unbiased for zero, 41 maximum likelihood, 55, 60, 100 minimum variance linear unbiased, 40 real-valued linear, 34 Euclidean distance, 157 inner product, 156 norm (length), 156 expectation, 2 induced, 35, 62 matrix, 2 vector, 2

190 F-statistic, 96 Gauss-Markov equations, 49 Gauss-Markov property, 47, 66, 131 generalized Gauss-Markov equations, 49 Gram-Schmidt Theorem, 158 likelihood function, 55, 60 linear combinations of random variables, 5 covariance between, 5 expectation of, 5 variance of, 5 linear combinations of vectors, 154 linear equations homogeneous, 175 nonhomogeneous, 175 linear hypothesis, 91 alternative, 91, 105 alternative test statistic for. See section 4.7 and corresponding parametric vectors, 109 definition of, 91 in models with nonhomogeneous constraints. See section 4.8 likelihood ratio test for. See section 4.4 likelihood ratio test statistic for, 101 null, 91, 105 on E(Y), 91 on the parameter vector, 106 linear transformation definition of, 159 identity transformation, 160 zero transformation, 159 log likelihood function, 55, 60 matrix, 4 column space of, 160 covariance, 4 definition of, 151 design, 26

Index diagonal, 152 dispersion, 4 generalized inverse of (ginverse of), 173 idempotent, 169 identity, 151 inverse, 162 invertible, 162 nonsingular, 163 null space of, 160 nullity of, 160 of ones, 151 orthogonal, 167 orthogonal projection, 170, 174 partitioned. See appendix A9 positive definite, 167 positive semi-definite, 167 projection, 168 range space of, 160 rank of, 160 singular, 163 square, 151 symmetric, 151, 152 trace of, 168 transpose, 152, 161 Vandermond, 84 zero, 151 maximal element, 144 mean square error, 55 model, 7 analysis of covariance, 76 full, 91 full rank, 27 initial, 91 Latin square, 89 maximal rank, 27 piecewise linear regression, 29, 74, 84 piecewise quadratic regression, 75 random one-way, 144 reduced, 91 two variance component, 7, 31

A Non-Least Squares Approach to Linear Models variance component, 143 with nonhomogeneous constraints. See section 2.9 multivariate normal random vector, 9 covariance matrix of, 10 definition of, 10 expectation of, 10 independence of marginals, 12 joint probability density function for, 13 marginal distributions of, 12 moment generating function for, 11 non-central chi-squared distribution, 14 definition of, 14 degrees of freedom for, 14 expectation of, 15 moment generating function for, 14 non-centrality parameter for, 14 variance of, 15 non-central F-distribution definition of, 21 degrees of freedom for, 21 non-centrality parameter for, 21 non-central T-distribution definition of, 22 degrees of freeom for, 22 non-centrality parameter for, 22 one-way additive model, 27, 51, 80, 83, 86, 106 parameter space, 26 parameter vector, 25 parameterization definition of, 25 full rank, 27 maximal rank, 27 parametric functions, 33

191

parametric vectors corresponding, 78 definition of, 28 estimable, 39, 63 identifiable, 36, 62 Pythagorean Theorem, 157 quadratic forms definition of, 8 expectation of, 8 in normal random vectors. See section 1.6 independence of, 18 random error vector, 32 Rao Structure, 146 regression sum of squares weighted, 111 weighted corrected, 123 residual sum of squares, 53 weighted, 60, 98, 100, 111 weighted corrected, 69, 123 residual vector, 53 standard normal random variable, 9 expectation of, 9 moment generating function for, 9 probability density function for, 9 variance of, 9 subspaces basis for, 154 definition of, 153 dimension of, 154 direct sum of, 155 disjoint, 155 intersection of, 155 orthogonal complement with respect to, 159 sum of, 155 total sum of squares, 111 weighted, 111 weighted corrected, 123 triangle inequality, 157 T-statistic, 71

192 two-way additive model, 36, 88 with constraints, 29 without constraints, 28 vector sets addition of, 155 difference of, 155 linearly dependent, 154 linearly independent, 154 orthogonal, 157, 158

Index orthogonal complement of, 158 orthonormal, 157 span of, 154 vector spaces defintion of, 153 Euclidean, 153 vectors orthogonal, 157 Zyskind’s Theorem, 138