Bayesian Analysis of Linear Models [1 ed.] 9780824785826, 9781315138145, 9781351464475, 9781351464468, 9781351464482, 9780367851354, 9780367451745

With Bayesian statistics rapidly becoming accepted as a way to solve applied statisticalproblems, the need for a compreh

319 87 9MB

Pages [475] Year 1985

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Bayesian Analysis of Linear Models [1 ed.]
 9780824785826, 9781315138145, 9781351464475, 9781351464468, 9781351464482, 9780367851354, 9780367451745

Table of contents :

1. Bayesian Inference for the General Linear Model 2. Linear Statistical Models and Bayesian Inference 3. The Traditional Unear Models 4. The Mixed Model 5. Time Series Models 6. Linear Dynamic Systems 7. Structural Change in Linear Models 8. Multivariate Linear Models 9. Looking Ahead

Citation preview

BAYESIAN ANALYSIS OF LINEAR MODELS

STATISTICS: Textbooks and Monographs A SERIES EDITED B Y D. B. OWEN, Coordinating Editor

D epartm ent o f Statistics Southern M ethodist University Dallas, Texas H. L. Gray and W. R. Schucany AnantM. Kshirsagar Statistics and Society, Walter T. Federer

V o l.

1: The Generalized Jacknife Statistic,

V o l.

2: Multivariate Analysis,

V o l.

3:

V o l.

4: Multivariate Analysis: A Selected and Abstracted Bibliography, 1957-1972,

V o l.

5: Design o f Experiments: A Realistic Approach,

V o l.

6:

V o l.

7: Introduction to Probability and Statistics (in tw o parts), Part I: Probability;

V o l.

8: Statistical T h eo ry o f the Analysis o f Experimental Designs,

V o l.

9: Statistical Techniques in Simulation (in tw o parts),

Kocherlakota Subrahmaniam and Kathleen Subrahmaniam (out o f p rin t) Virgil L. Anderson and Robert A. McLean Statistical and Mathematical Aspects o f Pollution Problems, John W. Pratt Part II: Statistics,

Narayan C. Giri

V o l. 10: Data Quality Control and Editing,

J. Ogawa Jack P. C. Kleijnen

Joseph L Naus (out o f print)

V o l. 11: Cost o f Living Index Numbers: Practice, Precision, and Theory,

Kali S. Banerjee

V o l. 12: Weighing Designs: For Chemistry, Medicine, Economics, Operations Research, Statistics,

Kali S. Banerjee

V o l. 13: The Search for Oil: Some Statistical Methods and Techniques,

D. B. Owen

edited by

V ol. 14: Sample Size Choice: Charts for Experiments with Linear Models,

and Martin Fox

V o l. 15: Statistical Methods for Engineers and Scientists,

Duran, and Thomas L. Boullion

V o l. 16: Statistical Quality Control Methods,

Robert E. Odeh

Robert M. Bethea, Benjamin S.

Irving W. Burr edited by D. B. Owen

V o l. 17: On the H istory o f Statistics and Probability, V o l. 18: Econometrics,

Peter Schmidt

V o l. 19: Sufficient Statistics: Selected Contributions,

AnantM. Kshirsagar)

V o l. 20: Handbook o f Statistical Distributions,

D. B. Owen

V o l. 21: Case Studies in Sample Design,

Vasant S. Huzurbazar (edited by

Jagdish K. Patel, C. H. Kapadia, and

A. C. Rosander compiled by R. E. Odeh, D. B. Owen, Z. W.

V o l. 22: Pocket Book o f Statistical Tables,

Birnbaum, and L. Fisher

V o l. 23: The Inform ation in Contingency Tables,

D. V. Gokhale and Solomon Kullback

V o l. 24: Statistical Analysis o f R eliability and Life-Testing Models: T heory and Methods,

LeeJ . Bain

V o l. 25: Elementary Statistical Quality Control,

Irving W. Burr

V o l. 26: An Introduction to Probability and Statistics Using BASIC, V o l. 27: Basic A p p lied Statistics, V ol. 28: V o l. 29:

B. L. Raktoe and J. J. Hubert A Primer in Probability, Kathleen Subrahmaniam Random Processes: A First L ook, R. Syski

Richard A. Groeneveld

V ol. 30: Regression Methods: A T o o l tor Data Analysis, R u d o lf J. Freund and Paul D. M in ton V o l. 31: Random ization Tests, Eugene S. Edgington V ol. 32: Tables for Normal Tolerance Limits, Sampling Plans, and Screening, R o b e rt E. Odeh and D. B. Owen V ol. 33: Statistical Computing, William J. K ennedy, Jr. and James E. Gentle V ol. 34: Regression Analysis and Its Application: A Data-Oriented Approach, Richa rd F. Gunst and R o b e rt L. Mason V ol. 35: Scientific Strategies to Save Your Life,/. D. J. Bross V o l. 36: Statistics in the Pharmaceutical Industry, edited by C. Ralph B uncher and Jia-Yeong Tsay V ol. 37: Sampling from a Finite Population, J. Hajek V ol. 38: Statistical M odeling Techniques, S. S. Shapiro V ol. 39: Statistical Theory and Inference in Research, T. A . B an croft and C.-P. Han V ol. 40: Handbook o f the Normal Distribution, Jagdish K. Patel and Cam pbell B. Read V ol. 41: Recent Advances in Regression Methods, Hrishikesh D. Vinod and A man Ullah V ol. 42: Acceptance Sampling in Quality Control, Edward G. Schilling V ol. 43: The Randomized Clinical Trial and Therapeutic Decisions, edited by Niels Tygstrup, John M. Lachin, and E rik Juhl V ol. 44: Regression Analysis o f Survival Data in Cancer Chemotherapy, Walter H. Carter, Jr., Galen L. Wampler, and D onald M. Stablein V ol. 45: A Course in Linear Models, A nant M. Kshirsagar V ol. 46: Clinical Trials: Issues and Approaches, edited by Stanley H. Shapiro and Thomas H. Louis V ol. 47: Statistical Analysis o f D N A Sequence Data, edited by B. S. Weir V ol. 48: Nonlinear Regression Modeling:

A Unified Practical Approach, David A.

R atkowsky V o l. 49: Attribute Sampling Plans, Tables o f Tests and Confidence Limits for Proportions, R o b e rt E. Odeh and D. B. Owen V o l. 50: Experimental Design, Statistical Models, and Genetic Statistics, edited by Klaus Hinkelmann V ol. 51: Statistical Methods for Cancer Studies, edited by Richa rd G. C ornell V o l. 52: Practical Statistical Sampling for Auditors, A rth u r J. Wilburn V ol. 53: Statistical Signal Processing, edited by Edward J. Wegman and James G. Sm ith V o l. 54: Self-Organizing Methods in Modeling: G M DH T ype Algorithms, edited by Stanley J. Farlow Vol. 55: Applied Factorial and Fractional Designs, R o b e rt A . M cLean and Virgil L. Anderson V o l. 56: Design o f Experiments: Ranking and Selection, edited by Thomas J. Santner and A fit C. Tamhane V o l. 57: Statistical Methods for Engineers and Scientists. Second Edition, Revised and Expanded, R o b e rt M. Bethea, Benjamin S. Duran, and Thomas L. B oullion V o l. 58: Ensemble Modeling: Inference from Small-Scale Properties to Large-Scale Systems, A lan E. Gelfand and Crayton C. Walker V o l. 59: Computer M odeling for Business and Industry, Bruce L. Bowerman and R icha rd T. O ’C onnell V o l. 60: Bayesian Analysis o f Linear Models, L y le D. B roem eling

OTHER VOLUMES IN PREPARATION

AN IMPORTANT MESSAGE TO READERS.. A Marcel Dekker, Inc. Facsimile Edition contains the exact contents of an original hard cover MDI published work but in a new soft sturdy cover. Reprinting scholarly works in an economical format assures readers that im­ portant information they need remains accessible. Facsimile Editions provide a viable alternative to books that could go “out of print.” Utilizing a contem­ porary printing process for Facsimile Editions, scientific and technical books are reproduced in limited quantities to meet demand. Marcel Dekker, Inc. is pleased to offer this specialized service to its readers in the academic and scientific communities.

BAYESIAN ANALYSIS OF LINEAR MODELS

Lyle D. Broemeling O klahom a State University Stillwater, O klahom a

M A R C E L D E K K E R , INC.

New York and Basel

Library of Congress Cataloging in Publication Data Broemeling, Lyle D ., [date] Bayesian analysis of linear models. (Statistics, textbooks and monographs ; v . 60) Includes bibliographical references and index. 1. Linear models (Statistics) 2. Bayesian statistical decision theory. I. Title. II. Series. QA276.B695 1985 519.5 84-19902 ISBN: 0 —824 7—8 5 8 2 —7

COPYRIGHT © 1985 by MARCEL DEKKER, INC.

ALL RIGHTS RESERVED

Neither this book nor any part may be reproduced or transmitted in any form or b y any means, electronic or mechanical, including photo­ copying, microfilming, and recording, or by any information storage and retrieval system, without permission in writing from the publisher. MARCEL DEKKER, INC. 270 Madison Avenue, New York, New York 10016 Current printing (last d ig it ): 10 9 8 7 6 5 4 3 2 PRINTED IN THE UNITED STATES OF AMERICA

PREFACE

This book presents the basic theory of linear models from a Bayesian viewpoint. A course in linear model theory is usually required for graduate degrees in statistics and is the foundation for many other courses in the curriculum. In such a course the student is exposed to regression models, models for designed experiments, including fixed, mixed and random models, and perhaps some econometric and time series models. The approach taken here is to introduce the reader to a wide variety of linear models, not only those most frequently used by statis­ ticians but those often used by engineers and economists. This book is unique in that time series models such as autoregressive moving average processes are treated as linear models in the same way the general linear model is examined. Discrete-time linear dynamic systems are another class of linear models which are not often studied by statisticians; however, these systems are quite popular with communica­ tion and control engineers and are easily analyzed with Bayesian pro­ cedures. Why the Bayesian approach? Because one may solve the inferen­ tial problems in the analysis of linear models with one tool, namely Bayes theorem. Instead of learning the large number of sampling theory techniques of estimation, hypothesis testing, and forecasting, one need only remember how to apply Bayes theorem. Bayes statistics is now rapidly becoming accepted as a way to solve applied statistical problems and has several special features which com­ bine to make it appealing for solving applied problems: First, it can be applied to a wide range of statistical and other problems; second, it allows the formal use of prior information, and thereby it offers a well-defined and straightforward way of analyzing any problem. Also,

iv

PREFACE

it assists the understanding and formulation of applied problems in statistical terms, and, finally, it faces squarely the problem of optimal action in the face of uncertainty. This book is intended for second-year statistics graduate students who have completed introductory courses in probability and mathemat­ ical statistics, regression analysis, and the design of experiments. Graduate students of econometrics and communication engineering should also benefit, and the book should serve well as a reference for statis­ ticians, econometricians, and engineers. Many people have helped me to write this book, and I am indebted to former students Samir Shaarawy, Joaquin Diaz, Peyton Cook, David Moen, Diego Salazar, Muthyia Rajagopalan, Mohamed A1 Mahmeed, Mohamed G h arraf, Margaret Land, Juanita Chin Choy, Albert Chi, Y u soff Abdullah, and Don Holbert. Secretarial and clerical assistance was provided b y Leroy Folks of Oklahoma State University and William B . Smith of Texas A & M University. Some of the topics in this book are a result of the research sponsored by the Office of Naval Research, Contract No. N000-82-K0292. L yle D. B ro em elin g

CONTENTS

P refa ce In tro d u ctio n

iii ix

1 BAYESIAN INFERENCE FOR THE GENERAL LINEAR MODEL Introduction The Parametric Inference Problem Bayes Theorem Prior Information Posterior Analysis An Example Inferences for the General Linear Model Predictive Analysis Summary and Guide to the Literature Exercises References 2 LINEAR STATISTICAL MODELS AND BAYESIAN INFERENCE Introduction Linear Statistical Models Bayesian Statistical Inference Summary Exercises References

1 1 1 2 3 5 7 8 14 18 21 22 25 25 26 39 58 58 60

v

vi

CONTENTS

3 THE TRADITIO NAL LINEAR MODELS Introduction Prior Information Normal Populations Linear Regression Models Multiple Linear Regression Nonlinear Regression Analysis Some Models for Designed Experiments The Analysis of Two-Factor Experiments The Analysis of Covariance Comments and Conclusions Exercises References 4 THE MIXED MODEL Introduction The Prior Analysis The Posterior Analysis Approximations Approximation to the Posterior Distribution of Approximation to the Posterior Distribution of the Variance Components Posterior Inferences Summary and Conclusions Exercises References 5 TIME SERIES MODELS Introduction Autoregressive Processes Moving A verage Models Autoregressive Moving Average Models The Identification Problem Other Time Series Models A Numerical Study of Autoregressive Processes Which A re Almost Stationary Exercises References

6 LINEAR DYNAMIC SYSTEMS Introduction Discrete Time Linear Dynamic Systems

65 65 66 70 84 94 104 117 123 134 137 138 140 143 143 145 147 150 153 154 155 177 177 178 181 181 183 187 196 203 207 216 232 234

237 237 238

CONTENTS Estimation Control Strategies An Example Nonlinear Dynamic Systems Adaptive Estimation An Example of Adaptive Estimation Summary Exercises References 7 STRUCTURAL CHANGE IN LINEAR MODELS Introduction Shifting Normal Sequences Structural Change in Linear Models Detection of Structural Change Structural Stability in Other Models Summary and Comments Exercises References 8 MULTIVARIATE LINEAR MODELS Introduction Multivariate Regression Models Multivariate Design Models The Vector Autoregressive Process Other Time Series Models Multivariate Models with Structural Change Comments and Conclusions Exercises References 9 LOOKING AHEAD Introduction A Review of the Bayesian Analysis of Linear Models Future Research Conclusions References

vii 239 244 251 267 274 294 302 303 305 307 307 308 315 346 351 369 370 371 375 375 376 384 389 400 410 425 426 428 431 431 431 434 438 438

APPENDIX Introduction The Univariate Normal The Gamma

441 441 442

viii

CONTENTS The Normal-Gamma The Univariate t Distribution The Multivariate Normal The Wishart Distribution The Normal-Wishart The Multivariate t Distribution The Poly-t Distribution The Matrix t Distribution Comments References

In d ex

442 443 443 444 444 445 447 448 449 449

451

INTRODUCTION

There are many who do not accept Bayesian techniques because of the controversy surrounding the use of prior distributions and the rejec­ tion, by Bayesian doctrine, that the sampling distribution of statistics is relevant for making inferences about model parameters. See Tiao and Box (1973) for an account of this aspect of the controversy. They essentially say the sampling properties of Bayes estimators are irrele­ vant, since they refer to hypothetical repetitions of an experiment which in fact will not occur. Lindley (1971) rejects many samplingtheory techniques because they violate the likelihood principle and thus violate the axioms of utility and probability. For more about the ad­ vantages and disadvantages of Bayesian inference and other inferential theories, Barnett (1973) gives an unbiased view. Chapter 2 also dis­ cusses some of the more controversial topics of Bayesian inference. Bayes theorem and its use in inference and prediction are intro­ duced with a study of the general linear model. The general linear model is the first model to be encountered in the first linear model course, and from a non-Bayesian viewpoint, Graybill (1961) and Searle (1968) present the sampling theory viewpoint. The treatment given here is quite different, since one theorem on the posterior analysis provides one with a way to make all inferences (estimation and hypothe­ sis testing) about the parameters of the model. The techniques of the prior, posterior, and predictive analyses presented in Chapter 1 will be repeated as each class of models is analyzed. Chapter 2 gives a brief history of Bayesian inference and explains the subjective interpretation of probability and the implementation of prior knowledge. It is argued that the subjective interpretation of the probability distribution of a parameter is no more subjective than the frequency interpretation of the distribution of an observable random variable. Prior information about the parameters of a model is impleix

X

INTRODUCTION

men ted by fitting the hyperparameters of a proper density either to past data or to prior values of the observations. This chapter also gives a b rie f discussion of how each of the various models is used in other areas of interest. The general linear model includes as special cases the regression models and the models of designed experiments, and since they are so important, Chapter 3 gives a detailed account of how a Bayesian would use them. The mixed linear model is studied in Chapter 4, in which the mar­ ginal posterior distribution of the variance components is derived. Using a normal approximation to the posterior distribution of the ran ­ dom effects, one is able to devise Bayes estimators of the primary p a ­ rameters of the model. The subject of time series analysis is often given separate treat­ ment; however, with the Bayesian approach to statistical inference, one is able to study ARMA (autoregressive moving average) processes in much the same way one studies the general linear model. First auto­ regressive processes are considered, in which one is able to find the posterior distribution of the parameters and the predictive distribution of future observations. The marginal posterior distribution of the autoregression coefficients is a multivariate t, and the error precision has a gamma posterior distribution. The predictive distribution of future observations is shown to be a t distribution. The moving average class of models is more difficult to analyze, although the exact analysis of an M A(1) and AR M A (1,1) processes is reported. Other time series models which are studied are the regression model with autocorrelated errors and the so-called lagged variable model in econ­ ometrics. This chapter shows that the Bayesian approach to the study of time series models is in its infancy but that the approach will produce some interesting results. Very few statisticians learn about the linear dynamic model and its use by communication engineers with monitoring and navigation problems. The model is called dynamic because the parameters of the model are always changing, from one observation to the next; thus it is natural to study such processes from a Bayesian approach. There are several important problems in connection with a dynamic linear model. The first is estimation of the parameters of the system, and estimation consists of filtering, smoothing, and prediction. Fil­ tering is estimating the current state (param eter) of the system, while smoothing is estimating the past states of the system. Prediction, unlike time series analysis, is estimating future states of the system. Chapter 6 develops the Bayesian way of estimating the parameters by finding the posterior distribution of the states of the system, and the Kalman filter is shown to be the mean of the posterior distribution of the current state. Chapter 6 also treats other aspects of the linear dynamic model, including adaptive estimation and control.

INTRODUCTION

xi

Chapter 7 is a complete study of structural change in linear mod­ els, in which structural change means some of the parameters of the model will change during the time the observations will be taken. In this type of system the parameters change only occasionally, whereas in linear dynamic systems they change with each additional observation. Thus models with structural change stand between the static class, such as the traditional linear models (regression models, for example), and linear dynamic models. There are two ways to model structural change: One is with a shift point, and the other is with a transition function. Both are used in Chapter 7, with regression models and time series processes. Most of the material in Chapter 8 on multivariate processes is given by Box and Tiao (1973) and Zellner (1971). Multivariate linear regression models, models for designed experiments, and multivariate autoregressive processes are introduced. The only new material is a section on structural change of a multivariate regression model. Many examples illustrate the prior, predictive, and posterior analyses, which are the three main ingredients of a Bayesian inference. The necessary statistical background is given in the Appendix, in which the following distributions are defined: the univariate normal, the gamma, the normal-gamma, the multivariate normal, the Wishart, the normal-Wishart, the multivariate t, the poly t, and the matrix t. If one learns the material in the Appendix, one will be able to under­ stand the way a Bayesian makes statistical inferences about linear models. The books by Box and Tiao (1973) and Zellner (1971) have given us the foundation for the Bayesian analysis of linear models; thus this book can best be thought of, first, as a review of the material in those books and, second, as a presentation of new material. This Preface and Chapters 1, 2, and 3 are mostly a review of the Bayesian tech­ niques of estimation, hypothesis, testing, and forecasting as applied to the traditional populations, while the remaining chapters present material which is either new or an extension of existing work.

REFERENCES Barnett, V . (1973). C om pa rative S ta tistic a l In fe re n c e . John Wiley and Sons, In c., New York. Box, G. E. P. and G . C. Tiao (1973). B a yesia n In fere n ce in S ta tistic a l A n a ly s is . Addison-Wesley, Reading, Ma. Graybill, F. A . (1961). An In tro d u ctio n to L in ear S ta tistic a l M odels . McGraw-Hill, New York. Lindley, D. V. (1971). B a yesia n S ta tis tic s , A R evie w , Reg. Conf. Ser. Applied M ath., 2. SIAM, Philadelphia.

xii

INTRODUCTION

Searle, S. R. (1968). L in ear M odels . John Wiley and Sons, In c., New York. Tiao, G . C . and G. E. P . Box (1973). "Some comments on Bayes estimators," The American Statistician, Vol. 27, No. 1, pp . 1214. Zellner, Arnold (1971). A n In tro d u ctio n to B a yesia n In fere n ce in E co n o m etrics . John Wiley and Sons, In c., New York.

BAYESIAN ANALYSIS OF LINEAR MODELS

1 BAYESIAN INFERENCE FOR THE GENERAL UNEAR MODEL

IN T R O D U C T IO N

Many books begin with the general linear model and this is no excep­ tion. The general linear model includes many useful and interesting special cases. For example, independent normal sequences of random variables, simple and multiple regression models, models for designed experiments, and analysis of covariance models are all special cases of the general linear model which will be explained in more detail in the following chapters. This chapter introduces the prior, posterior, and predictive analysis of the linear model and thereby lays the foundation for the remainder of the book where the analyses are reported with other models.

THE PARAM ETRIC INFERENCE PROBLEM Let e be a p x l vector of real parameters, Y = ( y 1, y 0, . . . , y )* a j. z n n x l vector of observations, X a n x p known design matrix. Then the general linear model is Y = Xe + e,

(1 .1 )

where e ^ N ( 0 , x

*1 ) , and t I is the precision matrix of e, which has n n covariance matrix a21 , and = t > 0 is unknown.

1

GENERAL LINEAR MODEL

2

This is the general linear model and our objective is to provide inferences for 6 and t when observing s = ( y 1, y 0, . . . , y ) , where l & n y. is the i-th observation. The word inference is somewhat of a vague w o rd , but it usually implies a procedure which extracts information about 0 from the sample s. Introductory statistics books explain inferential techniques in terms of point and interval estimation, tests of hypotheses, and fore­ casts . For the Bayesian, all inferences are based on the posterior distri­ bution of 0, which is given by Bayes theorem.

BAYES THEOREM

Suppose one's prior information about 0 is represented by a probability density function £ (0 , t ) , 0 e R ^, t > 0, then Bayes theorem combines this information with the information contained in the sample. The likelihood function for 0 and t is L ( 8 , t|s) « Tn / 2 exp -

| (Y -

X 8 ) '( Y -

X 8) ,

(1 .2 )

where 0 e R^ and t > 0, where « means proportional to (as a function of t and 0). The likelihood function is oneTs sample information about the parameters and is the conditional density function of the sample random variables given 0 and t. Bayes theorem gives the conditional density of 0 given s

5 ( 6 , t|s) a L ( 8 , t / s ) 5 ( 6 ,t ) ,

9 e Rp ,

t > 0.

(1 .3 )

The posterior density of 0 is £ ( 0 ,t |s) and represents one's knowl­ edge of 0 and t after observing the sample s. On the other hand our information about 0 and t before s is observed is contained in the prior density. Note the posterior density (1 .3 ) is written with a proportional symbol and £ is used to denote both the prior and posterior densities. I f one uses an equality sign, the posterior density is

5 ( 0 , t |s) = K - L ( 8 ,t

|s ) 5 ( 0 ,

t

),

0 €RP,

t

> 0

where K is the normalizing constant and is given by

(1 .4 )

3

PRIOR INFORMATION

K" 1 = X

X R

L ( 6 , t Is ) 5 ( 6 , t ) d0 d t ,

which is the marginal probability density o f Y . Obviously it is easier to omit the normalizing constant and use the proportional symbol and this convention will be adhered to in this book. Before continuing the analysis, one must find the posterior den­ sity of 0 and t .

PRIOR IN FO R M A TIO N

The prior information about the parameters 6 and t is given in two ways. The first is when £ (0 , t ) is a normal-gamma prior density, namely,

e(e.T)

=

(6 | t )c 2( t ) ,

e e rp,

t > 0,

(1 .5 )

where

SjO I t ) a Tp/2 exp -

j ( e —y )'p (e - p ),

0€RP

(1 .6 )

and ii is a p x 1 given vector and P is a known p x p positive definite matrix. Thus 5 is the conditional density of 0 given t and is normal with mean vector y and precision matrix t P. The marginal prior density of t is gamma with parameters a > 0 and 0 > 0. a t 01 ^ exp — t &,

t

> 0.

(1 .7 )

What does this imply about the marginal prior density of 0? Since (1 .5 ) is the joint prior density of 0 and t , the marginal density of 0 is

? l ( e)

* fQ

5 ( 0 . T) dx

T((p+2et)/2)' 1 exp - | [26 + (0 — y ) ' P ( 0 — u ) ] dx = [26 + ( 0 - y ) ' P ( 0 - y ) ] ’ (P+2a)/2,

0 e RP ,

(1 .8 )

GENERAL LINEAR MODEL

4

which is a t density with 2a degrees of freedom, location vector y and precision matrix (2 a )(2 B ) By using the normal-gamma density as a prior for the parameters, one cannot stipulate one’s prior information about 0 separately from that o f t . The parameters of the marginal distribution of 0 involve a and 8 , which are parameters of the prior distribution of t , but the marginal prior density of t does not involve parameters of the marginal of 0. The parameter y is one’s prior mean for 0, while one’s opinion of the correlation between the components of 0 is given by P 1(2 B )(2 a — 2) 1, which is the marginal prior dispersion matrix of 0. Since this involves a and 8 > the marginal prior information about t depends on the choice of the dispersion or precision matrix of 0. With regard to the information about x , it is convenient to think

2

o f x as the inverse of the residual variance a and that

E ( t _ 1) = 8 (a -

l ) ' 1,

a > 1

v a r (T _1) = 82(a - l ) ' 2(a - 2 ) '1,

(1 .9 ) a>2

1 10)

( .

or that

These two equations together with

( 1 . 11 )

E (0 ) = y and

D (0 ) = P ' 1(2 6 )(n + 2a -

2 ) '\

( 1 . 12 )

which are the mean vector and dispersion matrix of 0, will assist one in choosing the four hyperparameters for the prior distribution of 0 and x. As will be seen, the normal-gamma prior density is a member of a conjugate class of distributions, that is the posterior density £ (0,x |s) of (2 .3 ) is also a normal-gamma density. Conjugate families have the advantage that one has a scale by which to judge the amount of in ­ formation added by the sample, beyond the amount given a priori. Early writers such as Jeffreys (1959) and Lindley (1965) use improper densities for prior information. For example,

POSTERIOR ANALYSIS

5(9,t )

0

t

is called a vague non-informative prior density for 0 and was developed by J e ffre y s, because it satisfied certain rules of invariance and con­ veys ’’very little” information about the parameters. This density, al­ though improper, produces a normal-gamma posterior density for 0 and t , which of course is proper. Thus, improper prior densities may result in proper posterior densities, but this does not hold in general. The Jeffreys prior implies that, a p rio ri, 0 and t are independent and that 0 has a constant density over

and that the marginal prior den­

sity of i i s C2( t ) a 1/'r » T > 0. Both forms (1 .5 ) and (1.13) of prior information will be employed in the posterior analysis which follows.

POSTERIOR A N A LY S IS

Using Bayes theorem given by (1 .3 ), then using the normal-gamma prior density (1 .6 ), the posterior density of 0 and t is

£ ( 0 , T |S)

cc T

((n+2a+p) /2) -1

+ (Y for

X 0 )’(Y -

exp -

^ [23 + (0 — y ) ’P ( 0 - u )

X0)1

(1.14)

R^ and t > 0. N ow , completing the square on 0 gives

0 E

£ ( 0 , T |S)

((n + 2 a )/ 2 )-l cc

T

Y ’Y -

exp -

x

(X ’Y + P u )’(X ’X + P )~ 1(X ’Y + Pyi)'

x TP / 2

exp - |

x [0 -

(X 'X + P ) " 1(X 'Y + P u )]

[0

-

(X ’X + P ) " 1(X 'Y + P u )]'(X 'X + P)

which is a normal-gamma density, hence the marginal posterior density of x is gamma with parameters

(n + 2a)

12

and 3 +

Y 'Y -

(X 'Y + P u )’(X 'X + P ) ' 1(X ’Y + Py )

2

GENERAL LINEAR MODEL

6

The marginal posterior density of 0 is found by integrating (1.14) with respect to t and yields S j U l s ) « {28 + Y 'Y -

(X 'Y + P y )'(X 'X + P ) " 1(X 'Y + P y )

+ [6 -

(X 'X + P ) ”1(X 'Y + P y )]'(X 'X + P)

x [e -

(X 'X + P ) " 1(X 'Y + P u ) ] } ' (n+2a+p)/2,

0 e RP , (1.15)

which is a p-dimensional t density with n + 2a degrees of freedom, location vector

y * = (X 'X + P ) " 1(X ’Y + P y )

(1 .16)

and precision matrix D *(6 | s) = (X 'X + P )(n + 2 a)[28 + Y 'Y -

(X 'Y + P y )'(X 'X + P )’ 1

x (X ’Y + P y ) f \

(1.1 7)

The reader is referred to the Appendix for properties of the normal-gamma, multivariate t and gamma distributions. The marginal posterior moments of E ( t ” 1 |s) = [26 + Y 'Y -

t

* are

(X 'Y + P y )'(X 'X + P ) ' 1

x (X 'Y + P y ) ] ( n + 2a -

2 )’ 1

(1.18)

and V a r ( T _ 1 |s) = E ( t _1 |s)2(n + 2a - 4 ) ’ 1.

(1.19)

The posterior analysis of the general linear model reveals the joint distribution of e and t is a normal-gamma distribution, the marginal distribution of 0 is a multivariate t , and the marginal of t a gamma if the prior of the parameters is a normal-gamma. What is the posterior analysis if the improper density £ (0 , t ) * 1/t, 0 e R ? , t > 0 is used as prior information? should be verified.

The following theorem

T heorem 1 .1 . If 0 and t are the parameters of the general linear model (1 .1 ) where X'X is of full rank and if the prior density of 0 and t is Jeffreys1 improper density

AN EXAMPLE

5 ( 0, x ) « i / t ,

7

t > o,

e g r p,

(1 .2 0 )

then the posterior density of 0 and t is normal-gamma where the marginal posterior density of xis gamma with parameters (n - p)/2 and [Y TY -

Y 'X (X 'X ) 1XtY ]/ 2 , and the conditional posterior density of 0

given t is normal with mean vector (X 'X ) *X'Y and precision matrix xX'X. Also, the marginal posterior density of 0 is a p-dimensional t distribution with n — p degrees of freedom, location vector E (e |s) = ( X 'x r '- X 'Y

(1.21)

and precision matrix

D * ( 0|s) = (n - p ) ( X 'X ) [ Y 'Y - Y 'X C X 'X l^ X 'Y ]’ 1.

(1.22)

These results can be obtained by assuming theprior distribution of 0 and x is a normal-gamma with hyperparameters a, 3, P, and y , then letting 3 + 0, a + -p / 2 , and P + 0(p * p ) in the joint posterior distribution of 0 and x. Note, this is not equivalent to letting 3 + 0 , a + —p/2, and P 0(p x p ) in the joint prior distribution of 0 and x because the prior density would not be Jeffreys improper prior, (1 .2 0 ). If one uses the normal gamma density as a prior for 0 and x, X'X may be singular, but if one uses Jeffreys improper prior, X'X must be nonsingular, otherwise the posterior density of 0 and xis improper.

AN EXAMPLE Consider a special case of the general linear model, where 0 is a scalar (p = 1) and the design matrix is a n x l column vector of ones, then y . = 0 + e.

(1.23)

where the e. are n .i.d . (0 ,x * ), x > 0 , 0 € R.

1

The y . , i = 1, 2, . . . , n represent a random sample of size n from a normal population with mean 0 and precision x. Suppose the prior infor­ mation about 0 and x is a normal-gamma with parameters a, 3, P, and y , where a, 3, and P are positive scalars and y is any real number, then the marginal posterior density of 0 is a univariate t with n + 2a degrees of freedom, mean (n + P) *( EY. + P y ) and precision (n + P )(n + 2a)

GENERAL LINEAR MODEL

8 2 2 - 1 x [23 + lY . — (I Y . + £ p ) (n + P) -1 1 1 of t is gamma with mean E ( t 1 |s) = [26 + EY.2 -

1 ]

.

The marginal posterior density

(EY. + P p )2/(n + p ) ] ( n + 2a -

2 ) '\

and variance V a rC t"1 |s)

=

[ E ( t _1 |s)32(n

+ 2a -

4 ) '1.

Using the improper Jeffreys prior density, the marginal posterior density of 0 is a t with n - 1 degrees of freedom, mean y and precision _ 2 n (n — 1)/E(Y. — y ) , and the marginal posterior distribution of t is 1 _ 2 gamma with parameters (n — 1)/2 and £(Y. - y) 12. The results are obvious from Theorem 1.1.

INFERENCES FOR THE GENERAL LIN EA R MODEL

From the Bayesian viewpoint, all inferences are based on the joint posterior distribution of 6 and t, but what is meant by inference? Within the Bayesian approach, there are two ways to study the param­ eters. The first is called inference and consists of plotting the pos­ terior distribution of the parameters and computing certain charac­ teristics, such as the mean, variance, mode, and median of the distribu­ tion. Usually, one wants to estimate the parameters with point and interval estimates or test hypotheses about the parameters. These methods are to be explained and will be illustrated in the book. The second way to study the parameters of the model is called Bayesian decision theory approach which utilizes a loss function L(d,), which measures the loss when d is the decision and is the value of the parameter. The loss d* which minimizes the average loss EL(d, F . 1-y y;m,n+2a

GENERAL LINEAR MODEL

14

~X

n

If the prior density is improper £ ( 0 , t ) « t for 0 e R^ and t > 0, the 1 — y HPD region for 0 is the set of all 0, where Q (0 ) = (0 - 0)'X 'X (0 - 0) < p s 2F

, where 0 = (X 'X ) Sc'Y is the least y ;p ,n -p 2 2 squares estimate of 0 and the estimate of a is s = (Y - X 0)' x -1 (Y — X 0 )(n — p ) . This is the standard confidence region of content 1 — Y for 0, and if the null hypothesis is H^: 0 = 0^ and the alternative is H^: 0

±

0 q, then

is rejected if Q ( 0 q) > p s 2 F^

n_^.

This method of testing hypotheses is not the only way and other tests are available. For example, one may use posterior odds ratios or a decision theory approach, when a loss function is available to de­ scribe the costs of making a wrong decision. DeGroot (1970) presents various Bayesian tests of hypotheses while Zellner (1971) has a good introduction to posterior odds ratios. With the HPD approach, one must specify the hyperparameters U , P, a, and 3 of the normal gamma prior density and this information must reflect the statistician's opinion about the parameters. One's opinion about 0 and t , when the null hypothesis is true, is perhaps different than when the alternative is true, so one must be careful in choosing the prior density.

P R E D IC T IV E A N A LY S IS Predictive analysis is the methodology that is developed in order to forecast future observations. O f course, forecasting is a very impor­ tant activity in business and economics, where sophisticated time series techniques are used. The Bayesian uses the so-called Bayesian predictive density to forecast future observations, so this will be defined in this section. There are many ways to predict observations, however the Bayesian approach is natural in that one bases one's prediction on the condi­ tional distribution of the future given the p a st. In order to do th is, one must treat the parameters of the model as random, that is, as will be seen, the joint distribution of the future observations and the pa­ rameters , given the past observations, is averaged over the posterior distribution of the parameters. In this section the Bayesian predictive density is defined and il­ lustrated with linear models, then in the later parts of the book the same approach will be employed with time series models. Suppose y 1 ,y 9, . . . , y is a random sample from a normal popl z n ulation with mean 0 and precision t and that w„ ,w rt, . . . , w, are future 1 2 k observations, how does one forecast the values of the future observa­ tions?

PREDICTIVE ANALYSIS

15

The Bayesian predictive density is the conditional density of w = (w 1,w 2, wk ) T given the sample s = ( y ^ y ^ . . . , y )'• The density of w given 0,

t,

and s is .

f(w | e ,T ,s) a T

k

exp “ §

(w. -

0)

(1.33)

i= l where w E R and does not depend on s, since given 0 and t , w and s are independent. Now suppose 0 and t have a prior normal-gamma density, say

5 ( 6 , x ) * Ta' 1e ' T BT l/2e- ( T / 2 ) q ( e - i i )

^

9 € r ,

T > o, (1 .3 4 )

where a > 0 , 3 > 0, y e R , and q > 0 are the hyperparameters. The product of the posterior density of 0 and t and (1.33) is the joint conditional density of w , 0 , and t given s , which when averaged with respect to 0 and t produces the Bayesian predictive density of w. The posterior density of 0 and t is

5 (0 ,

|s )

t

oc

T( n + 2 “ ) / 2

1 e x p

_

I

[ 2 g +

£ ( y,

_

y ) 2 +

n ( 0 _

y ) 2]

(1.35) for 0 G R and x > 0, and when this is multiplied b y (1 .3 3 ), gives n ,

.

| .

( n + 2 c t+ k )/ 2 - l

g(w,0,T|s) « x

- 2 + n (0 - y )^ +

exp -

X

V

^

t

(w. 1 1

0)

L

,

-.2

(yj ~ y)

*1 I

(1.36)

for w E R , 0 E R, and t > 0, which is to be integrated with respect to 0 and t over R x ( 0, «>). Completing the square on 0 and integrating with respect to 0 over R, then integrating (1.36) with regard to t over (0 ,° ° ), then completing the square on w , results in h (w |s) c= { [ W - A 1B ] ’A [w - A _1B ] + C - B ’A ' 1} (n+2a+k) 12,

w G Rk

(1.37)

GENERAL LINEAR MODEL

16

as the predictive density of w , where A is a k x k matrix with diagonal elements (n + k -

l ) ( n + k ) 1 and off-diagonal elements - ( n + k )

The column vector B has elements n y (n + k ) \ thus we see one would forecast w with a k-dimensional t distribution with n + 2 a degrees of freedom, location vector A

, and precision matrix (n + 2 a)A (C —

B fA *B ) 1, where C = 23 + E(y. - y )^ + n y 2 -

n 2y 2(n + k )

It is important to remember oneTs prediction of w is to be based on the whole of the predictive distribution, that is to say one has available an entire distribution with which to forecast w. A point fore­ cast o f w is given by A * B , the mean o f the distribution and the con­ fidence or the forecast error of this prediction is given by the disper­ sion matrix o f w namely A * (C — B TA * B )(n + 2a — 2) Note, the degrees of freedom do not depend on the number of observations to be predicted but on the number o f observations in the sample and the prior parameter a. An ad hoc way to make a forecast about w , one future observa_ -i tion, is to treat w 1 ~~ n ( 0 ,x ) and estimate 6 by y and x by -1 n _ 2 _ (n - 1 ) “ y ) * then predict w^ with y , since y is the mean of the conditional distribution of w-p given 0 and x. The same prediction arises from the Bayesian method if one uses a Jeffreys 1 vague prior density for 0 and x, and uses the mean y of the predictive density of w^ given s; however the ad hoc method is based on a n ( 0 ,x ) distribution, while the Bayesian forecast is based on a t distribution; but the latter is equivalent to the former, when the number of observations is large, in which case the t distribution ap­ proaches the normal. In the more general case of a general linear model, how does one forecast future observations w , where w = Z 0 + e,

(1.38)

w is a k x l vector of future observations, Z a known k x p matrix, 0 the p x l parameter vector of the linear model Y = X0 + e

(l.i)

and e a k x 1 vector o f future errors? The forecasting principle is the same here as in the previous ex­ ample, one must find the conditional distribution of w given y , thus assuming e and e are independent normal errors, where e - N ( 0 ,t_ 1 I, ) K

PREDICTIVE ANALYSIS

17

and using a normal-gamma prior density for 0 and t , namely (1 .5 ), it is easy to derive the Bayesian predictive density of w. The p re ­ vious example is a special case of forecasting w in the general linear model. Suppose the prior density for 6 and t is the normal-gamma density (1 .5 ) with hyperparameters a > 0 , 3 > 0, y € R**, and P, a positive definite p x p matrix, then the reader may verify that the predictive density of w is

g (w | y ) 0 is an unknown precision with a prior gamma density. Find the posterior density of t.

REFERENCES Aitchison, J. S. and J. R. Dunsmore (1975). S ta tis tic a l P red ictio n A n a ly s is , Cambridge University P ress, London. Barnett, V . (1973). C o m pa rative S ta tistic a l In fe re n c e , John Wiley and Sons, In c ., New York. B e rge r, James O. (1980). S ta tis tic a l D ecision T h e o r y , F o u n d a tio n s, C o n c e p ts , M eth o d s, S prin ger-V erlag, New York. Bernardo, J. M. (1979). "Reference posterior distributions for Bayesian inference" (with discu ssion ), Journal of the Royal Statistical Society, Series B , Vol. 41, No. 2, pp. 113-147. Box, G. E. P. and G. C. Tiao (1965). "Multiparameter problems from a Bayesian point of view ," Annals of Mathematical Statistics, Vol. 36, pp. 1468-1482. B ox, G. E. P ., and G. C. Tiao (1973). B a yesia n In fere n ce in S ta tis ­ tical A n a ly s is , Addison Wesley, Reading, Mass. DeGroot, M. H. (1970). O ptim al S ta tistic a l D e cisio n s, McGraw-Hill Book Company, New York. Ferguson, Thomas S. (1967). M athem atical S ta tis tic s , A D ecision T h eo retic A p p ro a ch , Academic Press, New York and London. Geisser, S. (1971). "The inferential use of predictive distributions," in F ou n dation s o f S ta tis tic a l In fe re n c e , edited by V . P. Godambe and A. D. Sprott, Holt, Rinehart and Winston, Toronto. Harrison, P. J. and C. F. Stevens (1976). "Bayesian forecasting" (with discussion), Journal of the Royal Statistical Society, Series B , Vol. 38, pp. 205-247.

REFERENCES

23

Jeffreys, H. (1939/1967). T h eo ry o f P ro b a b ility , Third Edition, Clarendon Press, O xford. Lindley, D. V . (1965). In tro d u ctio n to P ro b a b ility an d S ta tis tic s from a B a yesia n V iew p o in t, University Press, Cambridge. Lindley, D. V . (1971). B a yesia n S ta tis tic s , A R e v ie w , Reg. Conf. Ser. Appl. Math., 2. SIAM, Philadelphia. Lindley, D. V . and A . F. M. Smith (1972). "Bayes estimates for the linear model," Journal of the Royal Statistical Society, Series B , Vol. 34, pp. 1-42. Raiffa, Howard and Robert Schlaifer (1961). A p p lie d S ta tis tic a l D ecision T h e o r y , Division of Research, Harvard Business School, Boston. Savage, L. J. (1972). T he F oundation o f S ta tis tic s , Dover, In c ., New York. Winkler, Robert, L. (1977). "Prior distributions and model building in regression analysis," in N ew D evelo p m en ts in th e A p p lica tio n s o f B a yesia n M eth o ds, edited by Ahmet Aykac and Carlo Brumat, North-Holland, Amsterdam. Zellner, Arnold (1971). A n In tro d u ctio n to B a yesia n In fere n ce in E con o m etrics, John Wiley and Sons, In c ., New York. Zellner, Arnold (1980). "On Bayesian regression analysis with gprior distributions." Technical Reprint, H. G. B . Alexander Research Foundation, Graduate School of Business, University of Chicago.

2 UNEAR STATISTICAL MODELS AND BAYESIAN INFERENCE

INTRODUCTION

This chapter will introduce the various linear models which are to be examined in the remaining chapters of this book. One class of models was introduced in Chapter 1, namely the so-called general linear model, which includes, as special cases, the fixed models, which are used for regression analysis and the analysis of designed experiments. The regression and design models are the traditional ones in that they are the most frequently used in statistical practice. Many of the software packages such as SAS and BMD allow one to routinely analyze data with one of the traditional models. A first cousin to the design model is the mixed model, which is often employed when some of the experimental factors are random, that is the levels of the factors are selected at random from a population of levels and one is interested in making inferences about the factor pop­ ulations . The time series models to be analyzed in this book are special Gaussian linear models, namely the regression model with autocorre­ lated errors, the ARM A class (autoregressive moving average models), and the distributed lag models o f econometrics. During the past decade the ARM A models have been thoroughly examined from the classical analysis of Box and Jenkins (1970), while the other time series models, in addition to being analyzed from a classical perspective, have been approached, principally by Zellner (1971), from the Bayesian view­ point . Linear dynamic models are now being studied by statisticians, but in the past have received the most attention of engineers who are

25

LINEAR MODELS AND BAYESIAN INFERENCE

26

interested in communication theory, navigation systems, and tracking of satellites. This class of models should prove to be very useful in the statistical analysis of time series. Linear models which have unstable parameters are studied in Chapter 7 under the title structural change, which is a term from the field of economics. If one is confident, a p rio ri, that a change in model (population) parameters will occur, special techniques are necessary. We will see that the Bayesian analysis of such models has been a major contribution to the analysis of data in this area. Most of this book deals with univariate linear models and it is only in Chapter 8 that multivariate models are first encountered. Multi­ variate regression, design, time series models, and some econometric models are examined in detail. The linear logistic model is also mentioned in this chapter but will not be studied. In what is to follow, each of the above models will be defined. An example of each is given, and a preview of the Bayesian analysis of the particular model is described. LIN EA R S T A T IS T IC A L MODELS Regression Models The regression model is a special case of the general linear model Y = Xe + e

(1 .1 )

o f Chapter 1, where Y is a n x l vector of observations, X is a n x p known matrix, 6 a p x 1 unknown real parameter vector, and e a n x l vector of observation errors. Regression models are employed to ex­ amine the relationship between a dependent variable y and q inde­ pendent variables x t ,x 0, . . . , x , thus one observes ( Y , x ,x 0, . . . , i « q l l x ) and the n observations are denotedby (y. ,x„. ,x _ ., . . . , x . ) , q l li Zi qi i = 1,2, . . . , n , where y. is the value of Y when one observes x

,x , . . . , x .. Presumably Y is observed with an error, but the 11 zl qi independent variables are observed without error. In terms of the general model (1 .1 )

q yj = Bo + L V i + ei> j= l J J

i = 1>2

n,

(2 .1 )

and the i-th component of Y is y , of e, e ., the first column o f X is j, and the second column is ( x 11*x 12» • • • > x i n^?* etc- * where P =

q

+ 1.

Thus, one is assuming the average value of the dependent v a ri­ able is a linear function of q independent variables if the n errors each

LINEAR STATIS TICAL MODELS

27

have a zero mean. The regression model is examined in detail in Chap­ ter 3, where a complete Bayesian analysis is performed. It is shown how to estimate and test hypotheses about the parameters of the moc^el and how to forecast future observations, using either a vague density or a conjugate density to express prior information. Chapter 1 ex­ plained how one is to do a Bayesian analysis and these ideas are ap­ plied to the regression model and design models of Chapter 3. The regression and design models are quite similar and each is expressed in terms of the general linear model, however, with the de­ sign model the design matrix X of (1 .1 ) is used to indicate the p re s­ ence or absence of the levels of several factors of the experiment, and the components of X are either zero or one.

The Design Models If the experiment has only one factor with, say, k levels, the model is y.. = 0. + e.., y i ] i i]

(2 .2 )

where y.. is the j-th observation on the i-th level, where i = 1,2, . . . , k k, and j = 1,2 n ., and n = Z. „ n. is the number of observations i i= l l and e.. is the error associated with the ii-th observation. O f course, i] this example can be expressed in terms of the general linear model, where y = ( y - ^ y , * , . . . , y, ; y, ,»y , •••> )>

28

LINEAR MODELS AND BAYESIAN INFERENCE

is a n x k matrix, where the first

components of the first column

are ones and the remaining elements are zero.

The last n^ components

of the k-th column are on es, thus the elements of X are either zeros or ones with a one in each row . The elements indicate the presence or absence of the k levels of the factor for each of the n observations, and the theory of Chapter 1 is applied in Chapter 3 in order to analyze one- and two-factor experiments, such as the completely randomized design and completely randomized block design. If the k levels of the experiment are the only ones of interest to the experimenter, that is inferences are confined to the k means 0., the model is called fixed, otherwise if the k levels are selected at ran­ dom from a population of levels, the model is said to be mixed or random. In the fixed version of the one-way model (2 .2 ), the e^.'s are in­ dependent normal variables with zero mean and unknown precision x , and the k parameters 0. represent the effects of the k levels of the factor.

In the classical approach to this problem, the 0.'s are regarded

as fixed unknown constants, if the model is fix e d , otherwise they are thought of as unobservable random normal variables with zero mean and constant variance (the variance component) if the k levels are selected at random from a population of lev els, hence if the variance component is zero the population is degenerate and there is no treat­ ment effect. From the Bayesian viewpoint in the fixed case, the 0.’s are treated as random variables with some known prior distribution, but in the random case, a prior density is put on the variance component of the model. Thus if one adopts a Bayesian approach the distinction between fixed and random is not as meaningful as it is when one adopts the classical approach. Consider an example by Davies (1967, page 105), which Box and Tiao (1973, page 216) analyze from a Bayesian approach. The experiment consists of learning what effect the batch to batch variation of a raw material has on the yield of the product produced. Five samples from each of six batches were examined for product yield and the experiment is thought of as random, because the six batches were selected at random from a population of batches. Clearly, oneTs inferences should not be confined only to the six batches of the e x ­ periment, but to the population of batches, which is characterized by a variance component which explains the batch-to-batch variation. The Bayesian analysis of random and mixed models is given in Chapter 4, but the approach differs from the method of Box and Tiao (1973) who use numerical integration to isolate the marginal posterior distribution of each of the variance components. The results of Chap­ ter 4 are based on the dissertation of Rajagopalan (1980) who found a

LINEAR STATIS TICAL MODELS

29

way to isolate the posterior density of each variance component via analytical approximation. The mixed model is given by

y = x0 + ^ u.b. + e i= l 1 1

(2 .3 )

where y i s a n x l vector of observations, x i s a n x p known matrix, 9 a p x l unknown parameter vector, u. is a n x m, known matrix, the b .’s are independent normal unobservable vectors each with a zero mean vector and b. has precision matrix t .J

, and e is independent of i the b .’s and is a normal random vector with zero mean and precision matrix xln -

Also, 0 € R^, t. > 0 , t > 0 , and the c variance compo­

nents are a? = t. *, while a2 = t * is called the error variance. i i The mixed model accommodates both fixed and random factors o f a designed experiment and the 0 vector has the levels of the fixed factors, while the levels of the c random factors are represented by b .,,b 0, . . . , b . Box and Tiao (1973, page 341) introduce the additive JL L C mixed model y.. = 0. + c. + e.. i 1 i]

(2 .4 )

to analyze a car-driver experiment with eight drivers and six cars. The main response was the mileage per gallon of gasoline and each driver drove each of the six cars, where i = 1,2, . . . , 8 and j = 1,2, . . . , 6. Thus y.. is the gasoline mileage when driver i drives car j, the 0.’s are unknown constants, the c.’s are normal independent random vari-

2

ables with zero mean and variance component a , and the e..’s are inde-

C

1] 2

pendent normal variables with zero means and error variance a . Thus model (2 .4 ) is seen to be a special case of the mixed model ( 2. 3) , where 2 the cars were selected from a population of cars with variance ac » but inferences about the drivers are confined to the eight drivers of the experiment. Chapter 4 is concerned with such experiments and the posterior analysis consists of determining the posterior distribution of each variance component and the vector of fixed effects.

LINEAR MODELS AND BAYESIAN INFERENCE

30 Time Series Models

One of the most useful models to analyze time series data is the p-th order autoregressive model P y (t ) = 2 i= l

~ i) + et >

2, more complex correlations can be studied. Still another useful model to analyze time series data is the q-th order moving average class q

y t = et -

L

i= l

W

i ’

* = 1,2

n’

(2,7)

where y^ is the observation at time t, the .!s are unknown real param­ eters (the moving average coefficients), and the e ' s are independent -1 n ( 0, t ) random variables (white noise). If q = 1, the correlation, see Box and Jenkins (1970), is given by

2 ’ S = p ( y t >yt + s )

(2. 8)

0

, s > 2

and the moving average class introduces another form of correlation for the observations. The autoregressive model has been used for many years for analy­ sis of time series data and has received attention from both the Bayesians, see Zellner (1971), and from classical procedures, see for example Box and Jenkins (1970). On the other hand the moving average model has not been explored from a Bayesian viewpoint but has received a lot of attention from other perspectives.

LINEAR STATIS TICAL MODELS

31

In Chapter 5, the Bayesian analysis of Zellner will be expanded for the autoregressive model, and the first and second order moving average models will be studied, where the posterior and predictive analysis will be derived. The autoregressive and moving average models are combined into the ARMA class, thus an AR M A(1,1) model is given by yt -

where y

* v r

( 2 - 9)

is the observation at time t, is the moving average param­

eter, 0 the autoregressive parameter, the e^’s are white noise, and y^ is a known constant. Thus there are three parameters to the model. The ARMA model is parsimonious, that i s , it is able to represent time series data with a few parameters, thus for a given set of observa­ tions { y ( t ) , t = 1,2, . . . , n } an AR M A(1,1) model and an A R (8 ) (8th order autoregressive model) perhaps will give the same ’’fit” to the data. Parsimony was emphasized by Box and Jenkins (1970) in their development of the statistical analysis of time series which are gen­ erated by ARMA models. Indeed they say in practice one is not likely to encounter a case where either (o r both) the moving average order a or (a n d ) the autoregressive order p is in excess of two. The Box and Jenkins analysis of time series is a major contribution to statistical methodology and is today very popular. Compared to Box-Jenkins methodology, the Bayesian approach is in its infancy. Nevertheless, the Bayesian posterior analysis and forecasting techniques of ARMA models are attempted in Chapter 5, where a com­ plete treatment of autoregressive processes is given. The A R M A (1,1), M A(1) and M A(2) processes are also examined, but only partial success can be reported here for the M A(2) case. Monahan (1983) has recent­ ly developed a Bayesian analysis of ARMA models for low order processes.

Models fo r S tru c tu ra l Change Two important references on structural change in linear models are Poirier (1976) and more recently the special issue of the Journal of Econometrics, which was edited by Broemeling (1982). The book by Poirier is a review of the statistical and econometric literature on structural change and presents new ideas about how one should model structural change with the aid of spline functions. The special issue contains ten articles on the subject and introduces the reader to a large variety of theory and methodology. This issue contains some material which was introduced before 1976, but for the most part includes ideas which appeared after Poirier’s book.

32

LINEAR MODELS AND BAYESIAN INFERENCE

The term ’’structural change” doesn’t have a definite definition and is used by econometricians to denote a change in the parameters of a model which explains a relationship between economic variables. Also, the term is often employed to refer to a model which has been misspecified. Terms or phrases such as shift point, change point, tran­ sition function, separate regressions, switching regression, and twophase regression, although not identical in meaning to structural change, are involved in some way with structural change. When data are collected in an ordered sequence, often one is interested in the probabilistic structure of the measured variables from one subset of the time domain to the other. Suppose for example a major economic policy change is put into effect at some time, then the researcher is perhaps interested in assessing the effects of the change on the variables under study. Two fundamental questions now occur: (i)

(ii)

Has a change occurred among the variables under investigation during the observation period? This is called the detection problem because one wants to detect a change in the relation­ ship. Assuming a change has taken place, can one estimate the pa­ rameters of the model? For example, one may want to estimate the time where the change occurred as well as the other param­ eters of the model, namely those which explain the ’’before” and ’’after” relationship between the variables.

Problems of structural change are found in a wide variety of dis­ ciplines. Often the variables under study are economic where the policy change could be a new tax law (such as the windfall profits ta x ), a new government program (such as price su pp orts), or a major dis­ turbance to the economy (an oil em bargo). In biology, one might focus on the life of an organism and the time when a change in growth pat­ tern will occur. These examples and others are given by Holbert (1982), who reviews Bayesian developments in structural change from 1968 to the present. An interesting example reported by Holbert is a two-phase regres­ sion problem illustrated with stock market sales volume. On January 3, 1970, B u sin e ss W eek reported regional stock ex ­ changes were hurt by abolition of give-ups (commission splitting) on December 5, 1968. McGee and Carleton (1970) analyzed the relation­ ship between dollar volume sales of the Boston Stock Exchange and the combined New York and American Stock Exchanges. They did this to investigate the hypothesis that a structural change occurred (between dollar volume sales of the regional and national exchanges) after December 5, 1968, and their analysis was based on a piecewise re g re s­ sion methodology. Holbert (1982) reexamined the data, but used the

LINEAR STATISTICAL MODELS

33

Bayesian approach, and came to the same conclusion as McGee and Carleton, namely that abolition of split-ups did indeed hurt the re ­ gional exchanges. The Bayesian approach was based on the two-phase regression model

I

a l + ^ lXi + ei ’

i =

...,m

2 10 )

( .

a 9 + 80x. + e ., L Z 1 1

i = m + 1, . . . , n,

where the e.’s are n .i.d . ( 0 , t *) , y. is the sales volume on the combined exchanges at time i, x. the corresponding total dollar volume on the Boston Stock Exchange at time i, m the unknown shift point, and a. and $. (i = 1,2) are unknown regression parameters. m pairs (x . ,y . ) follow one relationship and the last n -

Thus the first m pairs (x .,y .)

a regression relation with parameters a 2 and 82* The mflin question here is what is m? Is m "close to” December 5, 1968? Does m have a large posterior probability corresponding to support the hypothesis the regional exchanges were damaged because of the abolition o f give-ups? Holbertfs analysis adopted a vague improper prior for the param­ eters

1

£(ct ,8 ,a 1

Z ,B0) Z

a constant,

£(m) = ( n - 3 ) 1, £ ( T) « 1/T,

a. e R, B. € R 1 1 ( 2 . 11 )

2 < m < n - 2

T> 0

which when combined with the likelihood function gave

i-1



i=m+l

( 2 . 12 )

LINEAR MODELS AND BAYESIAN INFERENCE

34

where 2 < m < n — 2, as the marginal posterior mass function of m. In this formula x„ is the mean of the first m x-values, x 0 the mean l,m 2,m o f the last n — m valu es, y. is the predicted y. value based on the i,m r 1 first m ( x , y ) pairs, and y. the predicted value o f y. based on the last n - m ( x , y ) pairs. Holbert computes the function (normalized) (2.12) for the months from F ebruary, 1967, to November, 1969, and obtains a "large" posterior probability of .2615 for November, 1968, the month just before the change went into effect. The two-phase regression model is one of the many models of structural change which appear in Chapter 7. Other models which are examined are normal sequences, time series processes such as the autoregressive and regression models with autocorrelated errors. The multiple linear regression model with one change is investigated in de­ tail and a thorough sensitivity study is done. In most o f the models either a vague improper prior density or a quasiconjugate prior den­ sity is used. Models for structural change usually incorporate only one change (o r at most two changes). On the other hand, models which incorpor­ ate parameters which always change (from one time point to the next) are also very important and as we will see have been quite valuable to engineers and scientists in the communication and space industries.

Dynamic Linear Models The Kalman filter is one of the major achievements of mathematical (statistical) modeling, however, it is not well-known to statisticians. The Kalman filter is an algorithm which recursively computes the mean of the posterior distribution of a parameter of a dynamic model. Models are called dynamic when the parameters of the model are actually al­ ways changing, thus the traditional models of regression and the de­ sign of experiments are static, since the parameters of these models are not thought to be always changing their values. The dynamic linear model is for t = 1,2, . . . y ( t ) = F (t ) 6(t) + U ( t ) , (2.13) 0(t) = G0(t -

1) + V ( t ) ,

where y ( t ) is a m x 1 observation vector, F (t ) a m * p matrix, e (t) a p x l state vector, U (t ) a m x l unobservable observation error vector, G a p x p systems transition matrix, and V (t ) a p x p unobservable systems error vector. The primary parameters are the state vectors 0(0) ,9(1) , . . . , and consequently one must estimate them as the observations become avail­

LINEAR STATISTICAL MODELS

35

able, that is, given y ( l ) , what is 0 (1), given y ( l ) and y ( 2 ) , what is 0 (2), etc. The first equation of the model is the observation equation which relates the observation in a linear way to the state vector and by itself is the usual linear model of regression and the design of ex­ periments. The second equation gives the model its dynamic flavor, and is called the systems equation and tells us how the parameters of the model change. The second equation is a first-o rd er vector auto­ regressive model (examined in Chapter 8 ), which tells us the param­ eter (state vector) is always changing in such a way that the next value is a linear function of the present value apart from a random e rr o r . One immediately notices that the dynamic model contains a large number of parameters when compared to their static counterparts, thus one usually assumes that F ( t ) , t = 1,2, . . . and the covariance matrices of the error vectors U (t ) and V (t ) are known. Assuming these quantities are known as well as making other assumptions such as normality of the error vectors, Kalman (1960) found the mean of the marginal posterior normal distribution of 0(t) given y ( l ) , y ( 2 ) , . . . , y ( t ) for all t = 1,2, . . . , but more importantly, developed a recursive formula for its computation. It was v ery important to have a recursive formula since often the time between successive observations is very short, only a few seconds or less. The Kalman filter is often used to compute the position o f a satel­ lite or a missile and tracking these craft requires the computation to be extremely fast, thus Kalman’s filter was a major contribution. Consider again the dynamic linear model, then 0(t) might rep re­ sent the actual position of a satellite at time t, y (t ) the measured position by radar or telemetry, and U (t ) the error in the observed position. Knowing the relationship between the actual and observed positions allows one to know F (t ) and knowing the noise character­ istics of the radar or telemetry signals allows one to know the covar­ iance matrix of the observation error vector U ( t ) . Celestial mechanics assists one to find a law which relates the successive positions of the satellite from one time to the n e x t, that i s , one knows G , the transi­ tion matrix of the system equation. See Jazwinski (1970, page 324) for an example of orbit mechanics. Tracking problems are easily handled by dynamic models (2.13) as are navigation problems where it is necessary to control a missile, aircraft, or submarine. A linear dynamic model for navigation is given by the observation equation of (2.13) but the systems equation is amended to give 0(t) = G0(t -

1) + H x(t -

1) + V ( t ) ,

(2.14)

where t = 1,2, . . . , H is a p x s known matrix, x (t — 1) a s x 1 con­ trol vector, G a p x p matrix, and V (t ) a p x l error vector.

LINEAR MODELS AND BAYESIAN INFERENCE

36

With navigation problems, one must accurately predict the po­ sition e(t + 1) of the vehicle one time unit ahead, then make the nec­ essary adjustments so that the actual position of the vehicle is at a predetermined location. This problem of navigation is called the con­ trol problem in the literature. At time t , one has t observations y ( l ) , y ( 2 ) , . . . , y (t ) and one must control the vehicle so that its position e(t + 1) one time unit in the future is to be say T (t + l ) , thus, how should one choose x (t ) so that 0(t + 1) is "close" to T (t + 1)? A plan to do this is called a control strategy and is well-developed in the literature. Chapter 6 introduces the dynamic linear model and how it is ap­ plied to navigation and tracking problems. The presentation is from the perspective of a statistician and should be readable to people with a statistical training. The chapter reviews the Kalman filter, the control problem, nonlinear filtering, adaptive estimation (when some of the parameters of the system and observation equations are un­ k n o w n ), smoothing, and prediction. The dynamic linear model is the most general of those considered in this book and indeed all of the others are special cases.

O th e r L in ear Models The chapter on other models, Chapter 8, includes many of the multi­ variate models which are familiar to theoretical and applied statisticians. For example, multivariate regression and design models as well as vector autoregressive and moving average processes are to be con­ sidered. For each univariate model considered in this book, there is a corresponding multivariate version. The linear dynamic model is an example of a multivariate linear model and is to be examined in Chapter 6; however it is so important, so believes the author, that a separate chapter will be devoted to its analysis. Let us consider the multivariate version of the p-th order auto­ regressive process, namely P y T( t ) = 2

y ' t t “ i ) 0* + e’( t )

( 2. 15)

i= l where t = 1,2, . . . , n , y ( t ) is a m x 1 observation vector, y ( 0 ) , y ( - l ) , . . . , y ( l — p ) are known m x 1 vectors, the 0. (i = 1,2, . . . , p ) are unknown m x m matrices of real numbers where e is nonzero, -1 p and the e (t ) are independent N( Q, P ) random vectors, where P is a m x m unknown positive definite symmetric precision matrix. This

LINEAR STATISTICAL MODELS

37

model will accommodate the simultaneous analysis of m time series ob­ served at equally spaced points, and as with the univariate A R (p ) process, the main objective of our analysis is to develop the joint posterior distribution of the parameters 0..,0O, . . . , 0 , and P and to l z p predict future observations y ( n + l ) , y ( n + 2), . . . It will be shown that the Bayesian analysis of the multivariate AR model is a straightforward generalization of the univariate AR model i f one uses the natural conjugate prior density for the parameters. In the univariate version of the model, the conjugate family is the normal-gamma class and the normal-Wishart for the vector version of the model, however, a constraint on the posterior precision matrix of 0 ,0 , . . . , 0 is imposed in the vector case (m > 2). See Zellner 1 A P (1971) for a discussion of this situation. Since the multivariate autoregression model is also a multivariate linear regression model, the same constraint occurs with that model, but except for this particular nuisance, no difficulties are encountered in the posterior and predictive analysis of the multivariate linear model. The autoregressive process is, as mentioned earlier, the most often used to model time series data and the vector version is being used mere and more along with the vector ARMA model. Jenkins (1979, page 75) gives an interesting example of a bivariate time series

y (t ) =1

,

t=l,2,

...

(2.16)

where y .(t ) is the sales of product i, i = 1,2.

The products are com­

petitive, and the observations are quarterly and consist of eleven years of data. Since the sales of one product affect those of the other (and con versely), a multivariate ARMA model P

y '( t ) -

q

£ y '(t - i)0. = e '(t ) - £ e’(t - j U . i= l j= l ]

(2.17)

seems an appropriate tentative guess to explain the data where y (t ) is the vector of observations at time t (the t-th quarter) consisting of the sales of the two products at that time, 0. is a 2 x 2 unknown ma­ trix which explains the dependence of each series on the other at time lag i, the e (t ) are bivariate normal vectors with mean zero and unknown precision matrix P, and the .’s are unknown 2 x 2 moving average

LINEAR MODELS AND BAYESIAN INFERENCE

38 matrix coefficients.

Also 0 and d> are nonzero matrices. Jenkins P q explains the Box-Jenkins methodology of identification, estimation, diagnostic checking and forecasting in building a model for the sales data. Box and Tiao (1981) have recently expanded the univariate anal­ ysis of Box and Jenkins (1970) to the vector ARMA case where an iterative procedure of identification, estimation, and diagnostic check­ ing is proposed. Unfortunately, the Bayesian analysis of vector ARMA processes (which is also the situation with univariate models) has not progressed as far as the Box-Tiao methodology. With regard to the AR process, the Bayesian analysis is fairly easy to derive, but it is another story with the moving average process because its likelihood function is difficult to work with. Other models considered in Chapter 8 are the regression model with autocorrelated errors and changing multivariate regression model. The regression model is y! = 3Tx! + ej»

i = 1,2, . . . , n

(2.18)

where y. is a m x 1 observation vector, 3 is a m x 1 unknown param­ eter vector, x. a m x m known matrix, and e. a m x 1 random vector l l which follows a vector A R (1 ) model, namely

where e^ is known, the

e.

are independent N( 0, P *) random m x 1 vec­

tors, 0 is a m x m unknown coefficient matrix, and P is an unknown precision matrix. This model is often used to model time series which have a de­ pendence on a set of m explanatory variables and which have the same correlation structure as an autoregressive time series of order one. The linear logistic model is in a class by itself because it is quite different from the others since it is used to analyze counting or dis­ crete data. Curiously, the multinormal logit model has not been ex ­ amined, until recently by Zellner and Rossi (1982) from a Bayesian viewpoint, however, from the classical perspective a lot of work has been accomplished. For example in econometrics, Hausman and McFadden (1981) report that the multinomial logit model is the most used model specification. The multinomial logit model is the usual multinomial model but with a logit parametrization of the class probabilities. Suppose there are n independent trials and on each exactly one of three events can occur with probabilities p^, p^, and p^ where

BAYESIAN STATISTICAL INFERENCE $Z. p. = e

3 3Z. /V e , 1^1

39

i = 1,2,3,

= 1, Z^ = 0 = Z^, and 3 is a scalar parameter, then the joint dis­ tribution of n^, n^, and n^, where n. is the frequency of the i-th event in n trials ( n = n ^ + n 2 + n ^ ) i s multinomial.

B y reparametriz-

ing the model in terms of 6 = e /2, Zellner (1982) and Zellner and Rossi (1982) present a convenient exact small-sample posterior anal­ ysis. Their parametrization by 0 allows them to identify the inverted Beta as the conjugate class to the model, hence the posterior moments are easily computed. This is an elegant way to solve the problem, and their solution is quite significant since recently v e ry little work on non-normal models has appeared. The foregoing should give the reader a good idea of what models are going to appear and in what situations they are useful. The linear models appearing in this work are useful in the engineering, physical, social, mathematical, economic, and biological sciences and the follow­ ing chapters will give the reader a good idea of how to do a Bayesian statistical analysis when one of these linear models is appropriate for the analysis of the data. It is assumed one knows that a particular model is indeed appropriate for a particular experimental situation, and the question of choosing the appropriate model, although very important, is not at issue. What is important is how one implements a Bayesian statistical analysis.

B A YESIAN S T A T IS T IC A L INFERENCE The Bayesian analysis of a statistical problem, using the general linear model, was illustrated in Chapter 1, where the posterior and predictive analysis was done on the basis of either a proper prior density or the proper conjugate prior density. In this section we will discuss just what constitutes our Bayesian analysis when one adopts a linear model for the probability model of the observations, and what Bayesian in­ ference i s , as well as what it is not. The results of this section will apply to all the linear models which were introduced in the first part of this chapter. In what is to follow, we will look at the history of Bayesian inference, the subjective interpretation of probability, the main in ­ gredients of a Bayesian analysis, the various types of prior informa­ tion , the implementation of prior information, and the advantages and disadvantages o f Bayesian inference.

LINEAR MODELS AND BAYESIAN INFERENCE

40

Some H isto ry o f Bayes Procedures Our subject's beginning originated with the Rev. Thomas Bayes who published only two papers on the subject. His "An Essay Toward Solving a Problem in the Doctrine of Chances" was published in 1764 and was concerned with the problem of making inferences about the parameter e of n Bernoulli trials. Bayes assumed 0 was uniformly distributed (b y construction of a "billiard" table) and presented a way to compute P ( a < 0 < b | x = p ) where x is the number of successes among the n trials and a and b are constants. O f course, this is now recognized as the calculation, a posteriori, that the probability of a success lies in some interval ( a , b ) . Bayes gave two contributions in his paper: (1 ) the calculation for P ( a < e < b | x = p ) , called Bayes theorem, and (2 ) the principle of insufficient reason. The first con­ tribution is widely accepted, however the second is quite contro­ versial . Some have interpreted the principle of insufficient reason to mean that if nothing is known about the values of 6, then one value of 0 is as likely as another, and hence a uniform prior density is appropriate for 0. According to Stigler (1982), this is not what Bayes really meant, but instead knowing nothing about 0 really means the marginal dis­ tribution of x is uniform. In any case, such controversy over Bayes’ article is not unusual. Beginning with Laplace (1951), many people have commented on the essay of Bayes including Pearson (1920), Fisher (1922), and Jef­ freys (1939). In the essay, Bayes used a uniform prior density for 0 but it was constructed to be so, because 0 was the horizontal component of the position of a ball tossed at random on a flat table; hence the prior dis­ tribution of 0 can be given a frequency interpretation and there is very little controversy. The interesting question is that if one cannot use a frequency interpretation for the distribution of 0, a priori, how does one choose a prior for Q and is it appropriate to use a uniform distribu­ tion for 0, if one knows "nothing" about the possible values of 0? For example, I have a coin and its probability of landing "heads" on a toss is 0, can I use a uniform prior density for 0 if I know "nothing" about 0? There are those that say "n o ," because if I know nothing about 0,

2

then I know nothing about any function of 0, say g( 0) = 0 , 0 < 0 < 1;

2

then 0 should also have a uniform distribution, which is impossible. For this simple problem, there have been many "solutions." Haldane (1932) proposes

5 (e ) = e "1( i -

e )’ 1,

o < e < l,

and others, including Jeffreys, use

BAYESIAN STATISTICAL INFERENCE

£( 0) cc e -1/2 X/*(1 - 0) -1/2 x / *,

41

0 < e < l,

to express prior ignorance for 0 . Thus we see, from its inception, Bayesian inference is a controversial subject, but not much was heard about it until the twentieth century, where during the last thirty years many advances have been made. The current activity can be divided into Bayesian decision theory and Bayesian inference, where the former uses three sources of in­ formation (in making statistical decisions), namely the sample data s, the prior density £ (e ) of the parameters, and the loss function which measures the consequences of making an incorrect decision. On the other hand, Bayesian inference only uses the prior density of the parameters and the sample data because with inference one is not in ­ terested in making decisions, only investigating the values of the pa­ rameters of the model or making predictions about future observa­ tions . As mentioned earlier, the books by Lindley ( 1965 ), Zellner (1971 ), Box and Tiao ( 1973 ), Winkler (1972 ), and Press (1982) give introduc­ tions to Bayesian inference while DeGroot (1970 ), Ferguson (1967), Raiffa and Schlaifer ( 1961 ), Savage (1954) and B erger (1980) deal with decision making. The early work in Bayesian statistics of this century was done by deFinetti (1930 ), Jeffreys (1939) and Ramsey (1931/1964). deFinetti and Ramsey develop a formal theory of subjective probability and utility which was continued by Savage (1954 ), Lindley (1971 ), and DeGroot, among others, while Jeffreys gave the foundation for Bayesian inference, which was continued by Lindley (1965 ), Box and Tiao (1973 ), Winkler ( 1972 ), Zellner (1971) and Press (1 982). Most of these writers in the latter group employ improper type prior distribu­ tions for the parameters, but Raiffa and Schlaifer (1961) introduce conjugate prior distributions, which are proper probability distribu­ tions. In recent years, Bayesian statistics has been nurtured and de­ veloped by the Seminar on Bayesian Inference in Econometrics, which has been sponsored jointly by the National Bureau of Economic Re­ search and the National Science Foundation. This small group under the leadership of Arnold Zellner has played a significant role in fos­ tering new ideas in Bayesian theory and methodology. One activity they sponsor is the Savage dissertation award for the best doctoral dissertation in Bayesian inference in econometrics and another is the publication of several books devoted to new research. Among these are those edited by Aykac and Brumat ( 1977 ), Fienberg and Zellner (1975) and Zellner ( 1980). We see Bayesian ideas have come a long way since Bayes, how­ ever with regard to statistical practice it will be some time before they are widely accepted for the routine processing of data. One rea­ son Bayesian methodology will have to wait is because non-Bayesian

LINEAR MODELS AND BAYESIAN INFERENCE

42

ideas are firmly entrenched, but another is two controversial aspects of Bayes theorem. First is the subjective interpretation of probability, which is so often (b u t not always) used in Bayesian inference, and the other is the assessment of prior information.

S ub jective P ro b a b ility The most widely accepted interpretation of probability is the limiting frequency of an event in infinite repetitions of an experiment. Most practicing statisticians use this interpretation and are skeptical of the other interpretations, such as the classical, logical, and subjective interpretations, which is one way to classify the various approaches to probability. Bernoulli (1713) and DeMorgan (1847) advance the idea o f su b­ jective probability in an informal way. For instance DeMorgan says ”B y degree of probability we really mean, or ought to mean, degree of b e lie f.. . . Probability then, refers to and implies belief, more or less, and belief is but another name for imperfect knowledge, or it may be, expresses the mind in a state of imperfect knowledge/’ Formal theories of subjective probability were formulated by Ramsey (1926) and deFinetti (1937), but, according to K yb u rg (1970), it was not until Savage (1954) that this interpretation began to have an impact on statisticians. A more recent formal presentation of su b ­ jective probability is given by DeGroot (1970) who axiomatically de­ velops utility and probability. Why is the Bayesian interested in the subjective interpretation of probability? We saw that in Bayes’ original paper, the parameter 9 was given a frequency interpretation, because he constructed it that way using a ball-tossing experiment, but in many applications it is difficult to not use a subjective interpretation (non -frequ en cy) for the param­ eters 0 of the probability model. For example, suppose 0 is the variance component of a one-way random model, then treating 0 as a random variable with values occurring as some experiment is repeated, is for some, to go too far. Barnett (1982), in his book on comparative in ­ ference, refers to this idea (g iv in g 0 a frequency interpretation) as a super experiment. Thus, in many applications, it appears unrea­ sonable to give 0 a frequency interpretation, and one must rely on his subjective interpretation that the probability distribution of 0 expresses one’s degree-of-belief about the values of the parameter. If 0 is the probability of a coin landing ’’heads” on one toss, and I believe every possible value of 0 is equally likely, then I would use £(0) = 1,

0< 0< 1

as my density for 0 , but if I believed the coin is ’’almost” unbiased, I would use

BAYESIAN STATISTICAL INFERENCE

? ( 6) = e 1 0( i - e ) 1 0 ,

43

o < 0< l

to express my belief. Other people would probably use other den­ sities to express their beliefs about the values of 0 , but the important thing to observe is that it is possible to express one's degree-ofbelief about 0 via probability density functions, and in doing so, one is employing a subjective notion of probability. O f course, it is one thing to express one’s d egree-of-belief about the parameter 0 of a binomial experiment and quite another to do it for the variance covariance matrix of a multinormal population. In the formal theory of subjective probability one must be co­ herent in the following sense: the individual must not exhibit irr a ­ tional behavior when assigning his subjective probability to events. He must follow certain rules so that it is not possible to contradict the usual rules of probability. The axioms of subjective probability begin with a preference relation ( ”at least as likely a s ,” see DeGroot, 1970) between the events and one must be consistent with his p re f­ erences. For example, if I prefer event A to event B , and prefer B to event C , then I must prefer A to C . If the axioms of subjective probability are adopted, one will assign probabilities to the events in such a way that the Kolmogorov axioms are satisfied. Subjective probability is unavoidable, or so it seems, with the Bayesian approach to statistics, and in the remainder of this chapter, we will see how to deal with it when the ideas of prior probability are explained. For an interesting and highly critical account of the frequency interpretation of probability the reader should read chapter VII of Jeffreys (1961).

T he Basic Components o f In ference O f the many definitions of statistics, I like Barnett’s (1982), who says statistics is ’’the study of how information should be employed to re ­ flect on, and give guidance for action in, a practical situation involving uncertainty.” For this book, if one eliminates ’’and give guidance for action in ” in Barnett’s definition, the resulting statement is what this book is about, because this book considers inference (not decision making) and inference is an act of reflection about the parameters of the probability model, among other things. When one does Bayesian statistical inference, one is using prior information and sample data in order to find the values of the param­ eters in the model which generated the data. The components of statistical inference consist of the prior infor­ mation, the sample data, calculation of the posterior density of the pa­ rameters and sometimes calculation of the predictive distribution of future observations. The prior information is expressed by a probability

LINEAR MODELS AND BAYESIAN INFERENCE

44

density £ ( 0 )> 0 € n of the parameters 0 of the probability model f ( x J0 ) , 0 € Q, x g S , where f is a density of a random variable x and S is the support of f. The information in the data x ^ x ^ , . . . , x ^ , where x ,x , . . . , x is a random sample from a population with density i & n f is contained in the likelihood function C( 0 |x. , x _ , . . . , x ) , e e l ] , 1 1 n which is the joint density of the sample data, then this is combined with the prior density of 0 , by Bayes theorem, and gives the posterior den­ sity of 0 . O f course, these ideas were shown in Chapter 1, thus one may describe an inference problem in terms of ( S , f i , £ ( 0 ) , f (x | 0 ) ) and this problem is solved once the posterior density C O lx ^ X

, . . . , x n ) cc C(e|x ,x , . . . , x n ) £ ( e ) ,

061]

is calculated. From the posterior density one may make inferences for 0 by e x ­ amining the posterior density. Some prefer to give estimates of e, either point or interval estimates which are computed from the pos­ terior distribution. If 0 is one-dimensional, a plot of its posterior den­ sity tells one the story about 0 , but if 0 is multidimensional one must be able to isolate those components of 0 one is interested in. The following diagram illustrates the approach to statistical in­ ference which will be adopted in this book: THE MATHEMATICAL MODEL f (x 10 ) , x 6 S, Q e q THE PRIOR INFORMATION I THE SAMPLE DATA

THE POSTERIOR D ISTR IBU TIO N, VIA BAYES THEOREM

5 ( e | x -x . . . . , x ) POSTERIOR INFERENCES ESTIMATION

POINT

INTERVAL

TESTS OF HYPOTHESIS HPD REGIONS FOR PARAMETERS

POINT

PREDICTION OF FUTURE

INTERVAL

BAYESIAN STATISTICAL INFERENCE

45

Thus one begins the investigation with a probability model, which tells one how the observations s will be distributed (fo r each value n of the param eter), and oneTs prior (before the data) information is expressed with the density £ ( 6 ) , and after s^ is observed, the pos­ terior density of 0 can be calculated, via Bayes’ theorem. The founda­ tion of Bayesian inference is the posterior distribution of 0 , because all other aspects of inference depend on this distribution. Perhaps one is more interested in estimation, in which case one must first decide what components of 0 are of primary importance and the nuisance parameters eliminated. On the other hand if testing hypotheses is one’s main objective, then one must decide how to conduct the test, i . e . , to use an HPD region or a posterior odds ratio. Forecasting future observations will be done with the Bayesian predictive density f (y „ Is ) , where y , is a future observation, y . € S, and one n+11 n n+1 Jn+1 may provide either point or interval (region ) predictions. It is obvious this is an idealization of the actual way in which one would carry out a statistical analysis. For example, how does one adopt a probability model (say an AR model for a time series) for the observations, and how does one assess the prior information which is available for the parameters? These and other questions are to be examined in the following sections. One difficulty with advocating such an ideal system of inference is that it ignores the way scientific learning progresses. Box (1980) describes, by means of a Bayesian model, the iterative process of criticism and estimation for building statistical models, and he uses the predictive distribution as a diagnostic check on the probability model. The diagram of inference, presented here, ignores that im­ portant aspect of model building, that is, it is assumed the probability model is ’’correct” and is not to be changed during the analysis. Box’s diagnosis should be carried out in practice but will not be adopted here, because it would carry us too far afield from our main goal, which is to demonstrate the technical mechanics of making pos­ terior inferences. In any case, one will benefit from reading the Box paper.

The P robability Model How does one choose a probability model f (x | 0 ) to represent the data generating mechanism? In our context, how does one choose one of the many linear models which are described in the earlier parts of the chapter? This is a separate area of statistics and we will not deal with it here. It will be assumed a particular model is indeed appropri­ ate and that it was chosen either from subject matter considerations or from diagnostic checks. In addition to the classical goodness-of-fit procedures to check model adequacy, there are Bayesian methods of

46

LINEAR MODELS AND BAYESIAN INFERENCE

choosing a model from a class of models. It is a direct application of Bayes theorem and given by Zellner (1971, page 292). It will be as­ sumed here that one has correctly chosen a parametric family of den­ sities {f (x | e ); x e S, 9 6 1 1 } which correctly represents the sample data, and our problem is to find out more about the parameter 0 . Or it may be the problem of choosing one particular density (o r subset of densities, as in some hypothesis testing problems) from this class. For example, suppose one is sure an A R (p ) process is appropriate for some time series data { y ( t ) : t = 1 , 2 , . . . , n }, then how should one choose p? This identification problem is analyzed in Chapter 5. Here we have assumed the A R (p ) class of models is appropriate and one may describe the class by a parametric family of Gaussian densi­ ties. Our problem is to learn more about p (the o rd e r) as well as the other parameters (the autoregressive coefficients and noise v a ri­ ance) of the model and to forecast future observations.

P rio r Information O f all the various aspects of Bayesian inference, this is the most d if­ ficult and controversial. Many object to treating unobservable param­ eters 0 as random variables and giving them subjective probability distributions or densities £ ( 0 ) ; however not many people object to giving observable quantities x a frequency-type probability distribu­ tion. This is probably so because one is so used to doing it this way one does not question this part of their model. To assume x is a random variable with a frequency distribution is to assume the exis­ tence of an unending sequence of repetitions of the experiment, and of course, since such repetitions in fact do not occur they must be imagined to occur; hence to assume x has a frequency type probability distribution is no more difficult (o r is as difficult) than it is to assume 0 has some (subjective) probability distribution. In any case to use the Bayesian approach, one must have a probability distribution for 0 which expresses one's information about the parameters before the data are observed. Phrased another way, one can say to adopt a frequency probability distribution for x and a subjective probability law for 0 are both quite subjective activities. However the probability law of x is more objective in the sense x can be observed and if observed its probability law can be checked against the data. Choosing a prior density for the parameters has an interesting history. We have seen that Bayes chooses a prior density for the pa­ rameter 0 of a Bernoulli sequence by constructing a "billiard" table and that £ ( 0 ) (uniform over ( 0 , 1 ) ) had a frequency interpretation. In those situations where 0 cannot be given a frequency inter­ pretation, how does one choose a prior for 0 , if one knows nothing about 0 ? Stigler (1982) argues that Bayes would choose £ ( 0 ) to be uniform because the marginal distribution of x (the total number of

BAYESIAN STATISTICAL INFERENCE

47

successes in n trials) has a discrete uniform distribution over 0 , 1 , 2 , . . . , n, and a continuous uniform prior density for 0 implies x has a dis­ crete uniform distribution. This justification for a uniform prior for 0 is quite different than the usual interpretation o f Bayes 1 choice of a uniform prior for 0 . Many have used the principle of insufficient reason to put a uniform distribution on the parameters. The principle of insufficient reason according to Jeffreys (1961) is " I f there is no reason to believe one hypothesis rather than another, the probabilities are equal" and one may choose a uniform prior density for the param­ eters, however, if one knows nothing about 0 , nothing is known about any function of 0 , thus any function o f 0 should have a uniform prior density (according to the principle of insufficient reason), hence one is led to a contradiction in using the principle. Jeffreys has formu­ lated a theory of choosing prior densities based on rules of invariance in situations where "nothing" is known about the parameters. For example, if 0 and 0 ^ are the mean and standard deviation of a normal population, then Jeffreys 1 improper prior is

? (e i ,e2) “ which implies 0 ^ and



9i e R >

V

0’

are independent, that 0 ^ has a constant mar­

ginal density over R and that the marginal density of 0^ is 1/0^, 02 > 0.

*

This means every value of 0 ^ is equally likely and that

smaller values of the standard deviation are more likely than larger valu es. This type of prior was used in Chapter 1 and will be used many times in the following chapters. Often Jeffreys 1 prior leads to con­ fidence intervals which are identical to those constructed from a nonBayesian theory, but of course the interpretation of the two would be different. To some Bayesians this perhaps is an advantage of the Jeffreys 1 approach but to others a liability. Whatever the advantages or disadvantages, the Jeffreys improper prior has been used over and over again by Box and Tiao (1971), Zellner (1971), Lindley (1965) and many others, but Lindley now prefers to use proper prior den­ sities instead o f Jeffreys 1 improper prior distributions, because their use can lead to logical inconsistencies, which was demonstrated by Dawid et al. (1973). The issue is still being debated and Jaynes (1980) reportedly refutes the Dawid argument. Conjugate densities are proper distributions which have the property that the prior and posterior densities both belong to the same parametric family, thus they have the desirable property of mathe­ matical convenience. Raiffa and Schlaifer (1961) developed Bayesian inference and decision theory with conjugate prior densities and their

LINEAR MODELS AND BAYESIAN INFERENCE

48

work is a major contribution to statistical theory. DeGroot (1970) also gives an excellent account of conjugate prior distributions and shows how to use them with the usual (popu lar) population models such as normal, Bernoulli, and Poisson. Press (1982), Savage (1954), and Zellner (1971) also consider conjugate densities. Considering Bayes* original paper about a Bernoulli population parameter e, the conjugate class is the two-parameter beta family with density

£ ( 0 |a , 3 )

«

0a

1(1 -

0)^

1,

0
0

and one must choose values for a and B in order to express prior in ­ formation. This class is quite flexible because the class contains a large number and variety of distributions, but this is not a general property of a conjugate class. The Wishart class is conjugate to the multinormal population with an unknown precision matrix and is not very flexible (as compared to the beta family). See Press (1982) for a discussion of the Wishart and related distributions, but the main problem with employing a conjugate class is that one must choose the parameters (hyperparam eters) of the prior distribution. With the beta one must choose two scalar hyperparameters and with the pdimensional normal distribution, one must find p + p (p + l ) / 2 hyper­ parameters . Still another principle in choosing prior distributions is that of precise measurement, which "is the kind of measurement we have when the data are so incisive as to overwhelm the initial opinion, thus b rin g ­ ing a great variety of realistic initial opinions to practically the same conclusion" (S avage, 1954, page 70). In choosing a prior on the basis o f this principle the normalized likelihood function is approximately the same as the posterior distribution, and the prior does not have much o f an effect on the posterior distribution. A fter the b rie f history of prior information, we are still left with the problem of how to choose a prior distribution for the parameters of the model, that is, how does one express what one actually knows about 0? Apparently it is easier to model the observations than it is to model the parameters of the model of the observations. As statisticians we are familiar with choosing various probability models to different ex ­ perimental or observational conditions, and it becomes somewhat routine in some environments. For example, regression models and models for designed experiments are clearly appropriate in many situations, but when it comes to adopting a model for the parameters of the probability model (such as for the regression coefficients in an autoregressive process) it is another story. O f cou rse, one should not always use a Jeffreys* improper prior density or a conjugate class, or the principle of precise measurement, but try to accurately represent the prior information.

BAYESIAN STATISTICAL INFERENCE

49

Improper prior densities or a conjugate class of densities are used throughout this book to represent prior information, although this does not mean one should always use these two types o f prior in for­ mation. They are used because they combine nicely with the like­ lihood function. Why is it so difficult to choose prior densities for the param­ eters? Possibly because we are so used to thinking in terms of ob­ servable random variables we have not developed a feeling or intuition about what the parameters represent in our probability models. The approach the author favors is to choose the prior density in terms of the marginal distribution of the observations (predicted or past d a t a ). This approach is referred to as the predictive method or the device of imaginary results (S tigler, 1982, page 254) and is e x ­ ploited by Good (1950, page 35), Geisser (1980), Kadane (1980), Winkler (1977), and Kadane et al. (1980). In a recent paper, Stigler (1982) argues that this ’’device of imaginary results” was what Bayes did in his justification for the choice of a uniform prior density for 0 , where 6 is the parameter of a binomial experiment and 0 ’’must” be given a subjective probability distribution. The device of imaginary results is thoroughly explained by Stigler (1982) and what follows is his description. Consider the usual parametric model f (x | 0 ) , the conditional density of an observation x when 0 is the value of the parameter and fi is the parameter space, then the marginal density of x is

where S is the sample space and £ the prior density of 0 , 0 e fl. The device of imaginary results implies that one should choose £ ( 0 ) in such a way that it is compatible with one’s choice of p ( x ) , x e s. How does one choose p (x )? One can either ’’fit” p to imaginary predic­ tions of future values of the observation or fit p to past data, which were collected under the same circumstances as the future observations will be observed. As Winkler (1980) notes, it is much easier for the experimentor to predict future values of x than it is to think directly in terms of 0 . In the binomial case considered by Bayes (1963) a uniform dis­ crete distribution of x implied £ ( 0 ) was uniform on ( 0 , 1 ) , that is

p ( x ) = — 7 -7 , n + 1

x = 0 , 1 ,2 , . . . , n

implies £ ( 0 ) = 1, O < 0 < 1 , where

LINEAR MODELS AND BAYESIAN INFERENCE

50

f ( x l e ) =1

Jex ( 1 -

9)n ’ X ,

x = 0 ,1 ,2

n.

■ Q O f course if £ ( 0 ) is uniform then p ( x ) is uniform, but the converse is not necessarily true. The prior predictive density p ( x ) allows us to assess prior in­ formation about 6 from prior knowledge about x instead of contemplat­ ing directly the prior density £ ( 6 ) . I f £ depends on hyperparameters a e A , then the predictive density is

p (x |a) =

f

f(x | e ) £(e|a) d 0 ,

x € S,

a GA

and one may use this integral equation to find values of a which su p­ port or fit predicted observations x ,x , . . . , x (sampled from the l J n population with density p (x | a )) or to fit past observations which were sampled from a population with density p (x | a ). Suppose in the binomial example al 1 £(0|a) « 0 ( 1 — 0)

a 2~1 ,

0 < 0 < 1,

> 0,

>0

then the prior density belongs to the conjugate class, and the prior predictive density of x is p (x | a 1 , a 2) « T (a 1 + x )T (n + a 2 -

x ),

x = 0 , 1 , 2 , . . . , n.

I f one observes values x ,x , . . . , x from this distribution, one i j n may find values of and compatible with these predictions by the method of moments or maximum likelihood or some other principle o f esti­ mation. One might put a known prior density on a n and a 0 and estimate 1 u and a 2 from the conditional distribution of and a 2 given the p re ­ dicted observations. Winkler (1980) shows how this method works with the general linear model of Chapters 1 and 3 and develops the conjugate prior den­ sity for the parameters of a multiple linear regression model k y

i= l

eixi + e >

BAYESIAN STATISTICAL INFERENCE

51

where y is the dependent variable, 0. is an unknown regression coef-

1

ficient, x. is the i-th independent variable, and e is n (0 ,a

-2

).

If a

2

is known the conjugate class is normal with mean say b and covariance

2

matrix a V , where b is k x 1 and V is k x k , then the predictive dis­ tribution of y is normal with mean E (y |a2 ,x ) = b 'x , V a r(y | a 2 ,x ) = a2 (x 'V x + 1), where x' = (x ^ .X g ,

x ^ ), b = (t >1 >b 2 ...........bk ^ '‘

w*nkler shows

how one may assess this predictive distribution, thus choosing values for b and V , the hyperparameters of the prior distribution of 0 = ( 61» ©2, ’ * * ’ 0k ^ ’

Given x > one predicts the mean and variance of y

for k (k + l ) / 2 values of x , where k of the equations (assessments) allows one to find b , and all k (k + 1) /2 allows one to find V . Winkler

2

also shows how to deal with the case when a is unknown and Kadane et al. (1980) develop a sophisticated but complicated method of fitting the hyperparameters of prior distributions. The advantage of the predictive approach is that the experimenter is better able to think about an observable random variable than he is (directly) about an unobservable parameter 0 . A disadvantage is that one must usually restrict prior information to a parametric family such as a conjugate class of distributions, and the class may not be as flexible as one desires. Perhaps this can be avoided by a mixture of densities from the conjugate class. The principle of imaginary results appears to be a very promising way to assess prior information and it should be used in practice to see if experimenters and other users of statistical techniques are willing to adopt it. One of the latest developments assessing prior information is the idea that the parameters of the model are exchangeable. The idea of exchangeability was introduced by deFinetti (1937) (see K ybu rg and Smokier, 1964) and has been applied by Lindley and Smith (1972) and Leonard (1972) for the Bayesian analysis of linear models and binomial probability laws. Let us consider an example given by Lindley and Smith (1972) where y ,y , . . . , y are independent, normally distributed, have a l L n known variance and E (y .) = 0., i = 1,2, . . . , n , where the means

0 ,0 , . . . , 0

are unknown real parameters and very little is known l L n about the means. Since very little is known about the means it per-

LINEAR MODELS AND BAYESIAN INFERENCE

52

haps is reasonable to assume the means are exchangeable, that i s , one’s prior information about 0 .^ is the same as that about or any other 0. and that in general one’s prior knowledge about 0 - , 0 O, . . . , 0 1 l u n is invariant under any permutation of the indices. An exchangeable sequence can arise as a mixture of independent identically distributed random variables, given some hyperparameter y , s a y , that is the density of 0 = ( 0 ^ , 0 ) ’ is

g (9 ) =

m

p ( 0. |y ) d Q (y ) , i= l

where p ( 0^|y) is the conditional density of 0. given y , Q (y ) the dis­ tribution function of y , and p and Q are arbitrary.

Thus

***’

0r (given y ) constitute a random sample of size n from a population with density p , and y is a random variable with distribution function Q ( y ) . Thus, for example, we might let p be normal with mean y and known variance t 1 and give y some known probability distribution, or we could have the prior distribution of y depend on unknown hyper­ parameters and put a known distribution on them and continue in a hierarchical fashion. Lindley and Smith next consider a randomized block layout of b blocks in each of which t treatments are assigned, then the model is y.. = u + «i +

+ e..,

where y.. is the observation on the experimental unit of block j to 1] -2 which treatment i was applied, the errors are independent n ( 0 ,a ) and one must assign a prior distribution to y , , ... , , 3^ , 32» • • • » 3^.

If the treatment constants are exchangeable and also the block

parameters, and they are independent, then one perhaps would assign -2 ~2 -2 a. ^ n ( 0 , a ) , 3 . ^ n ( 0 ,a ) and y ^ n ( 0 ,a ) , where these distribui

a

]

tions are independent.

3

y

9

2

2

a

3

y

I f the variances cr, a , and a are known, one

stops here at this stage and proceeds with the posterior analysis. Also, one might consider these variances as unknown and give them some known distribution, and this is also considered by Lindley and Smith. The idea of exchangeability has played an important role in the Bayesian analysis of linear models because deFinetti’s work provides a mathematical justification for selecting a prior distribution when

BAYESIAN STATISTICAL INFERENCE

53

a priori symmetry is present about the parameters. It imposes a reasonable restriction on the prior probability distribution when our knowledge of the parameters exhibits this peculiar symmetric property, but it usually doesnTt tell us what particular parametric family is the appropriate one to use. For example, in the above two problems e x ­ changeability of the model parameters does not imply a normal prior density for them, although these parameters are indeed exchangeable. This section has reviewed some of the ways one may express prior information about the parameters. We have looked at vague prior information, conjugate prior densities, the principle of insuf­ ficient reason, the principle of imaginary results, and exchangeable prior information. Also, empirical methods of assessing prior infor­ mation were discussed, thus it is easy to see that much has been de­ veloped in the area of prior information but that also much remains to be accomplished. The

P o s te r io r A n a l y s i s

Suppose one is now satisfied that the probability model, f (x | e ), for observations and the prior density, £ ( 0 ) , for the parameters are ap­ propriate and one must determine the plausible values of 6 . One now enters the posterior analysis stage of a Bayesian statistical analysis, where one perhaps might want to estimate 0 (o r some of its components), test hypotheses about 0 , predict future values of the observations, or possibly control the observations at predetermined levels. Regardless of what particular activity is contemplated, one must first find the posterior density of 0 , £ ( 0 1x ) cc f (x 10 ) s ( 0 ) ,

0 g

n

c R^

which is found by Bayes theorem. Often all the components of 0 are of interest, but sometimes some of these components 0 ^ are regarded as nuisance parameters and the remaining ©2(q * 1 ) are of primary interest. will use

How should one make inferences about ©2? The Bayesian

1 where 01 is (p — q ) x 1 . g is called the marginal posterior density 1 * of 0 _, and as with any posterior density function it may be used to z estimate and test hypotheses concerning To estimate 0 2> either

LINEAR MODELS AND BAYESIAN INFERENCE

54

the marginal mean, median, or mode, or any other typical or repre­ sentative value o f e2 can be computed from the marginal posterior density o f 0 .

O f course these estimates are summary characteristics

2

o f £ ( 0 o |x) and only tell us part o f the story about the parameter and

2

they are not a substitute for the entire distribution. Consider an experiment which has been conducted according to a randomized block design (see Chapter 3), then the primary param­ eters are the treatment effects but the block effects and error v a ri­ ance are regarded as nuisance parameters. Then the Bayesian would test for significance of treatment effects by referrin g to the marginal posterior distribution of those parameters. To test hypotheses about the parameters 0 2» one perhaps would find a 1 — y HPD, 0 < y < 1 (highest posterior density) region R ( 0 ) from £ ( 0 |x). Such a region has the property that i f 0* €

y 2

2

2

R ( 0 O) and 8 " £ R ( e o) , then S(9' |x) > 5(9',l|x), that is parameter y 2 2 y 2 « 2 values inside the region have larger posterior probability density than those excluded from the region which must satisfy

-f

? (e 2 |x) de2,

R y (9 2) £ « 2 that is the HPD region has posterior probability content 1 — y. Box and Tiao (1971) and Chapter 1 give other properties of HPD regions and they are used in Chapter 3. To test the hypothesis H : 0 = 0 U Z ZU (known) versus H : 0 4 0on one rejects H if 0 (the hypothesized a « 2U U 2U value) is excluded from the region. In the case of a randomized block design, one tests for treatment effects by finding an HPD region for the treatment contrasts. The commonly used other way to test hypotheses is based on the posterior odds ratio but is not implemented in this book. DeGroot (1970), Zellner (1971), and others use the odds ratio, however Box and Tiao (1971) do not. Suppose ©2 is scalar, that 9^ is some non­ degenerate interval of the real line and that Hft: 0 e ( a , b ) c w Z H : 0 g ( a , b ) , 0 e 9 , then the posterior odds ratio is a 2 2 2 P {0 0 (H .| H ,x ) = Y O' a* P {0 e where

e (a ,b )| x }

9 n — (a ,b )

x}

9 2 while

BAYESIAN STATISTICAL INFERENCE

P {e 2 g ( a , b ) |x> = /

55

? ( e 2 | x) de, a

and the larger the ratio the more the indication the null hypothesis is true. If the null hypothesis is H_: 0 = eoA versus H : 0. ^ 0__, Jef0 l zu a 2 20 freys (1939/1961) expresses prior information about e in terms of a mixed (part discrete, part continuous) distribution made up of a dis­ crete prior probability of

5 ( 0) , 0

+

at e2 = 0 2O anC* 8

density

0 2q for the other values of 0 , then the posterior odds ratio

for the null versus the alternative is

5 ( 02oIx H

f

E(e|x) dx = 0 (H V

|H ,x>.

(V

For an up-to-date account of Bayesian hypothesis testing see Bernardo (1980) and Barnett (1982). Since the posterior odds ratio is still being developed, it will not be used to test hypotheses but in­ stead the HPD region method which was formulated by Lindley (1965) will be employed. Lindley advocates the HPD region only when a vague prior density is appropriate. However, it is not an unreasonable pro­ cedure with any prior probability density function, such as a member of a conjugate class and it is done this way, sometimes (see Chapter 3) in this book. For almost all the linear models of this book, estimation is accom­ plished by formulas for the posterior mean and variance of the param­ eters of the model. For example, Chapter 4 examines the random and mixed linear models where the marginal distribution of each variance component is found (approximately) and formulas for the mean and variance are derived. In this way each component can be estimated and an idea of the uncertainty in the estimate is given by the marginal posterior variance. Unfortunately, it was not possible to find estimates for the fixed effects of the model because their marginal posterior distribution cannot be determined analytically and must be determined numerically as was done by Box and Tiao (1973). Another example of estimation is the Kalman filter, which is the mean of the marginal normal posterior distribution of the successive states of a linear dynamic system. This particular method of estimation is explained in Chapter 6 . Sometimes all of the parameters of the model are nuisance param­ eters , in the sense one is mainly interested in predicting future obser­ vations or controlling the future values of the dependent variable of a regression analysis or time series.

LINEAR MODELS AND BAYESIAN INFERENCE

56

B a y e s F orecasting Suppose one observes a series of observations y ( l ) , y ( 2 ) , . . . , y (n ) and one wants to forecast the next observation y (n + 1 ) and that the generating mechanism of the series is known and consequently one "knows" the joint density f [ y ( l ) , y ( 2 ) , y (n )| e ], 0 £ f l , of the observations and the prior density for e, say £ ( 0 ) . Then how does the Bayesian predict the next observation y (n + 1)? The standard way is to find the conditional distribution of y (n + 1 ) given y ( l ) , y ( 2 ) , . . . , y (n ) and this is called the Bayesian predictive dis­ tribution of y (n + 1 ) and is

f [y (n + 1 ) | y ( l ) , . . . , y ( n ) ] =

J

f [y (n + 1 ) | y (l) , y ( 2 ) , . . . , y (n ) , 6 ]

x c (e | y (i ), . . . , y ( n ) ] de, where the first factor of the integrand is the conditional density of y (n + 1 ) given 0 and the previous observations, and the second factor is the posterior density of 0 . The predictive density is the average of the conditional predictive density of y (n + l ) , given 0 , with respect to the posterior distribution of 0 , that is if one knew 0 , one would predict y (n + 1 ) with f [y (n + 1 ) |y( 1 ) , y ( 2 ) , y ( n ) , 0 ] but since 0 is unknown this density is averaged over the posterior distribution of 0 and a Bayesian forecast consists of an average of conditional p re ­ dictions . We have seen how this approach is used with the general linear model of Chapter 1. In what is to follow, the Bayesian predictive den­ sity is found for the time series models of Chapter 5 and the linear d y ­ namic model of Chapter 6 . Often with linear dynamic systems it is im­ portant to control the states (param eters) of the system at prede­ termined levels and to do this one must be able to forecast the future observations at each stage in the evolution of the system. The Bayesian predictive density plays an important role in the control and monitoring o f physical and economic systems and its use will be demonstrated in Chapters 5 and 6 . Press (1982) and Zellner (1971) consider a control problem which is not discussed in this book. Suppose Y = X 0 + Z + e is a linear model, where Y is the vector of observations, X is a matrix consisting of the observations on independent variables which cannot be controlled, 0 is an unknown parameter vector, Z a matrix consisting of the values of control variables, another unknown parameter vector,

BAYESIAN STATISTICAL INFERENCE

57

and e an error vector. Under the assumption of normally distributed errors and a vague prior for the parameters, Zellner shows how to choose the control variable Z so that Y is "close" to some predetermined value. He adopts three control strategies to do this but in all three the Bayesian predictive density plays a crucial role. These problems, although important, will not be examined because the author is unable to add anything new and Zellner’s account is excellent. The type of control problem which occurs in Chapter 6 is similar to the following problem. Suppose { y ( t ) : t = 1,2, . . . , n , . . . } is a time series which follows the law y ( t ) = e (t )y (t -

1 ) + e(t)

and

6 ( t ) = G 0(t -

1) + H (t )x (t -

1) + e (t )

where t = 1 ,2 ,___ This is a dynamic A R (1 ) process where the time dependent autoregressive coefficients follow an A R (1 ) law. The o b ­ servation errors e (t) and the system errors e (t ) are independent normally distributed random variables, y (0 ) is known, G is known, and H (t ) is known, and at the initial stage, assume 0(0) is known. Among these and other assumptions, the problem is to control 0 ( 1 ) at some level, say T ( l ) , but to do this one must have an idea of the value of 0(1) (as a function of H (0 )), thus one must predict 0(1) (one-stepahead). Since the posterior mean of 0(1) depends on y ( l ) to predict 0 (1), one averages this posterior mean over the Bayesian predictive distribution of y ( l ) , because y ( l ) has yet to be observed. This procedure is applied at each stage in the evolution of the system. Linear dynamic systems are very interesting models to study from a Bayesian perspective because in order to analyze them one must use the complete arsenal of Bayesian methodology. Once the Bayesian predictive density is found, one predicts the future value y (n + 1 ) by examining this distribution as one would do in making inferences for the parameters from their posterior distribu­ tion. If one wants one value to forecast y (n + 1), perhaps the mean, median, or mode will do, and if one wants an interval forecast, one could construct an HPD (highest predictive density?) for y (n + 1), analogous to the way one would find a region of highest posterior density for the parameters of the model. The Bayesian predictive distribution for y (n + 1) turns out to be a univariate, multivariate, or matrix t (depending on the dimension of y (n + 1 ) ) for most of the linear models, with the exception of the linear models exhibiting structural change, where the predictive dis­ tribution is a mixture of t distributions. Other exceptions are the moving average process, the ARMA model and linear dynamic models.

LINEAR MODELS AND BAYESIAN INFERENCE

58 SUMMARY

This chapter gives the reader a chance to see what this book is about. First, a description of each class of models is given with some applica­ tions provided illustrating their use. For example, we have seen that linear dynamic models are sophisticated ways for modeling the monitor­ ing and control of many physical and economic variables and that models for structural change give one a degree of flexibility not possessed by the others. Secondly, we see the way a Bayesian would analyze data which were generated by one of the linear models. The components of an analysis, namely, the probability model, prior information, the posterior and predictive analysis were carefully explained. Problems with choosing prior information were discussed and it was decided that the Ttprinciple of imaginary results" was a promising approach to assessing prior information, but that there was no easy and "quick" solution to this vexing problem. Further research along the lines suggested by Winkler (1980) and Kadane et al. (1980) should enhance the way one expresses their prior information about the parameters, and i f these ideas can be made operational so that they become standard in the routine problems of regression, the analysis of variance, and time series, the Bayesian methodology of this book will easily be imple­ mented .

EXERCISES 1.

I f x ^ x ^ , . . . , x^ is a random sample from a population with a mass function f(x | e ) = ex ( i -

e )1" * , x = o , i ,

o< e< l

and

? ( e ) « e “ ’ 1 ( i - e ) 8" 1 , o < e < 1 ,

a < o, e > o

is the prior density of 9, find: (a ) (b )

The The

posterior density of 0 , predictive distribution of x ^ +1» and

(c )

The

(d )

If n

joint predictive distribution of x „, x n+1 n+2 = 4, x 1 = 1, Xg = 1, Xg = 0, x = 0 , a =1=3 ,

plotthe

posterior density of 0 and predictive mass function of x n+-^«

EXERCISES 2.

59

Consider an exponential population with density

f(x | e ) = e e 6 x ,

x>o,

e > o

and let x ,x , . . . , x be a random sample, l & n (a ) Examine the likelihood function for 0 (based on x ,x , . . . , X

x ^ ) as a function of 6, where 0 > 0.

(b ) (c )

u

What type of function

is it? Upon examining the likelihood function, what is a conjugate prior density for 0? Using the conjugate prior for 0, derive the posterior density of 0 and the predictive density of x n + y

(d )

3.

Suppose, a priori, that previous values of x have been ob­ served, namely x * = 1, x * = 6, and x * = 4. What values 1 L o would you use for the hyperparameters of the conjugate prior density? Let x^ ,x ^ , . . . , x^ be a random sample from a Poisson population

f (x 10) = e 9 0X /x! ,

x = 0 ,1 ,2 ,...

and suppose the mean 0 has prior density ci” 1 - 08 £(0) a 0 e ,

4.

a > 0,

8 > o,

0 > 0.

(a ) (b ) (c )

Is the above a conjugate prior density? Using the above prior density, find a HPD region for 0. Find the predictive mass function of x n +y

(d )

Give an interval forecast of x

,. . n+l Let x ^ ,x 2> . . . , x^ be a random sample of size n from a population with density

f ( x | e 1 ,e2) = ( 02 - Sj)"1 .

e! < x < e2

where 0 and 0 are unknown real parameters. Give a conjugate l & prior density for 0 and 0 and use it to find the predictive den1 it sity of Refer to Chapter 9 of DeGroot (1970). 5.

If x and y have a normal-gamma distribution, that is the conditional distribution of x given y is normal and the marginal distribution

LINEAR MODELS AND BAYESIAN INFERENCE

60

6.

of y is gamma, find the conditional distribution of y given x. Also, find the upper ,01 point of the marginal distribution of x. Refer to the Appendix. I f x 1»x 2, . . . , x^ is a random sample of size n from a bivariate

7.

normal population with mean vector y and prediction matrix P ( 2 x 2 ) , and i f the prior distribution of y and P is a normalWishart, find the posterior distribution of the correlation coef­ ficient . The gamma population has density

f (x | a 1B ) = - — ^ - x a 1 e X ^, 1

x > 0,

3>0

e '“

where r is the gamma function. (a ) Find a conjugate prior density for a and 3 . ( b ) If a is known, what is the conjugate class of prior densities for 3 ? (c ) If a is known, and x * ,x * , . . . , x * are previous values of x

8.

9. 10.

(the population random v a ria b le ), how would you choose values for the parameters of the conjugate prior density of 3 ? Is the sampling distribution of an estimator of 6 irrelevant in making inferences about 0? Is the posterior distribution of 0 relevant in making inferences about 0? Which approach is more relevant? Explain your answers carefully. What are the controversial aspects of the Bayesian approach to statistical analysis? Why isn!t the Bayesian approach to statistics more popular? Is there an inherent defect in the Bayesian analysis of a statistical problem?

REFERENCES

Aykac, Ahmet and Carlo Brumat (1977).

A pplica tion s o f B a y e s ia n M ethods.

New D e v e lo p m e n ts in the North-Holland Publishing

Company, New York. Barnett, Vic (1982). C om parati ve S tatistical I n f e r e n c e , second edition. John Wiley and Sons, New York. Bayes, T . (1963). "An essay toward solving a problem in the doctrine of chances," Philosophical Transactions Royal Society, Vol. 53, pp. 320-418. B e rg e r, James O. (1980). Statistical Decision T h e o r y , F o u n d a tio n s , C o n c e p t s , and M ethods. S prin ger-V erlag, New York.

REFERENCES

61

Bernardo, J. M. (1980). "A Bayesian analysis of classical hypothesis testin g,” in B a yes ia n S ta t i s t i c s : P r o c e e d i n g s o f the First I n t e r ­ national M eeting, Valencia, 1979, edited by Bernardo, DeGroot, Lindley and Smith. Bernoulli, J. (1713). A r s C onjectandi. Basel. B ox, G. E. P. (1980). ’’Sampling and Bayes inference in scientific modeling and robustness” (with discussion), Journal of the Royal Statistical Society, Series A , Vol. 143, part 4, pp. 383-430. B ox, G. E. P. and Gwilym M. Jenkins (1970). Time S er ies A n a l y s i s , F orecasting, and C ontrol, Holden-Day, San Francisco. Box, G. E. P. and G. C. Tiao (1973). B ayesian I n fere n ce in S t a ­ tistical A n a l y s is , Addison-W esley, Reading, Ma. Box, G. E. P. and G. C. Tiao (1981).’’Modeling multiple time se rie s,” Journal of the American Statistical Association, Vol. 76, No. 376, pp. 802-816. Broemeling, Lyle (1982). ’’Introduction,” Journal of Econometrics, Vol. 19, pp. 1-5. Davies, O. L. (1967). Statistica l Methods in R e s e a r c h and P ro du ction , third edition, Oliver and Boyd, London. Dawid, A . P ., Stone, M. and J. W. Zedek (1973). ’’Marginalization paradoxes in Bayesian structural inference” (with discussion ), Journal of the Royal Statistical Society, B , Vol. 35, pp. 189-233. deFinetti, B . (1930). ’’Sulla proprieti conglomerativa delle probabilita subordinate,” Rend R. Inst. Lombardo (M ilano), Vol. 63, pp. 414-418. deFinetti, B . (1937). ’’Foresight: its logical laws, its subjective sou rces,” Annals de l’Institute Henri Poincare. Reprinted (in translation) in K ybu rg and Smokier (1964). DeGroot, M. H. (1970). Optimal Statistical Decision s, McGraw-Hill, New York. DeMorgan, August (1847). Formal Logic, Taylor and Walton, London. Fienberg, S. E. and A . Zellner (1975). S t u d i e s in B ayesian Econ­ om etrics and S t a t i s t i c s . North-Holland Publishing C o ., New York. Fisher, R. A . (1922). ”On the mathematical foundations of theo­ retical statistics,” Phil. Transactions Roy. Soc. London, A , Vol. 222, pp. 309-362. Ferguson, Thomas, S. (1967). Mathematical S t a t i s t i c s A Decision T h eoreti c A p p r o a c h , Academic Press, New York. Geisser, S. (1980). ”A predictivistic prim er,” in B ayesian A n alys is in Econometrics and S t a t i s t i c s , edited by A . Zellner, NorthHolland, Amsterdam. Good, I. J. (1950). P ro babilit y and the Weighing o f E viden ce, Griffin, London. Holbert, Don (1982). ”A Bayesian analysis of a switching linear model,” Journal of Econometrics, Vol. 19, pp. 77-87.

62

LINEAR MODELS AND BAYESIAN INFERENCE

Jaynes, Edwin T . (1980). "Marginalization and prior probabilities," in B a y e s ia n A n a l y s i s in Econometrics and S t a t i s t i c s , edited by A . Zellner, North Holland Publishing C o ., Amsterdam. Jazwinski, Andrew H. (1970). Stoc h astic P r o c e s s e s and F iltering T h e o r y , Academic Press, New York. Jeffreys, Harold (1939/1961). T h e o r y o f P r o b a b i l i t y , Oxford at the Clarendon P ress, first and third editions. Jenkins, Gwilym M. (1979). Practical E x p e r i e n c e s w ith Modeling an d F o r eca s tin g Time S e r i e s , Gwilym Jenkins and Partners (o v e r­ seas) Ltd. Kadane, J. B . (1980). "Predictive and structural methods for elicit­ ing prior distributions," in B a yes ia n A n a l y s i s in Econometrics and S t a t i s t i c s , edited by A . Zellner, North - Holland, Amsterdam. Kadane, J. B . , J. M. Dickey, R. I. Winkler, W. S. Smith, and S. C. Peters (1980). "Interactive elicitation of opinion for a normal linear model," Journal of American Statistical Association, Vol. 75, pp. 845-854. Kalman, R. E. (1960). "A new approach to linear filtering and p re ­ diction problem s," Transactions of ASME, Series D , Journal of Basic Engineering, Vol. 83, pp. 95-108. K y b u rg , Henry E ., Jr. (1970). P ro bability an d I n d u c t i v e L o g i c . The Macmillan Company, Collier-Macmillan Limited, London. K y bu rg , H. E ., J r ., and Smokier, H. E. (e d s .) (1964). S t u d i e s in S u b j e c t i v e P ro b a b ility , John Wiley and Sons, New York. Laplace, P. S. de (1951). A Philosophical E ss a y on Proba bili tie s. (A n English translation by Truscott and Emory of 1820 edition.) Dover, New York. Leonard, T . (1972). "Bayesian methods for binomial data," Biometrika, Vol. 59, pp. 581-589. Lindley, D . V . (1965). I n tro d u c tio n to P ro bability and S t a t i s t i c s from a B a y e s ia n View point, Par t 2, I n f e r e n c e . Cambridge University P re s s . Lindley, D. V . (1971). Making Decisions, Wiley-Interscience, John Wiley and Sons L t d . , London. Lindley, D. V . and A . F. M. Smith (1972). "Bayes estimates for the linear model," Journal of the Royal Statistical Society, B , Vol. 34, pp. 1-42. McGee, V . E. and W. T . Carleton (1970). "Piecewise regression ," Journal of the American Statistical Association, Vol. 65, pp. 11091124. Monahan, J. F. (1983). "Fully Bayesian analysis of ARMA time series models," Journal of Econometrics, Vol. 21, pp. 307-331. Pearson, K. (1920). "Note on the fundamental problem of practical statistics," Biometrika, Vol. 13, pp. 300-301. Poirier, D. J. (1976). The Econo metrics o f S t r u c t u r a l C hange, North-Holland Publishing Company, Amsterdam.

REFERENCES

63

A p p l i e d Multivariate A n a l y s i s : Using B ayes ian and F r e q u e n t i s t M etho ds o f I n fere n ce , second edition,

P ress, S. James (1982).

Robert E. Krieger Publishing Company, Malabar, Florida. Raiffa, H. and R. Schlaifer (1961). A p p l i e d S ta tis tica l Decision T h e o r y , Division of Research, Graduate School of Business Administra­ tion, Harvard University, Cambridge, Massachusetts. Rajagopalan, Muthiya (1980). Bayesian Inference for the Variance Components in Mixed Linear Models. P h .D . dissertation, De­ cember, 1980, Oklahoma State University, Stillwater, Oklahoma. Ramsey, F. P. (1931/1964). "Truth and probability," in K ybu rg and Smokier (1964), Reprinted from the Foundations o f Mathematics and o t h e r E s s a y s , Kagan, Paul, Trench, London, 1931. Savage, L. J. (1954). T he Foundation o f S t a t i s t i c s . John Wiley and Sons, In c ., New York. S tigler,. Stephen M. (1982). "Thomas Bayes and Bayesian inference," Journal of the Royal Statistical Society, A , Vol. 145, part 2, pp. 250-258. Winkler, R. L. (1972). I n tro d u ctio n to B a yes ia n I n fere n ce and D e ­ cision, Holt, Rinehart and Winston, New York. Winkler, R. L. (1980). "Prior distributions and model building in regression analysis," in New D e velo p m en ts in the App lications o f B ayes ian M ethods, edited by Aykac and Brumat, NorthHolland Publishing Company, Amsterdam. Zellner, A . (1971). A n I n tro d u ctio n to B ayes ian In fere n ce in Econometrics, John Wiley and Sons, I n c ., New York. Zellner, Arnold (1980). B a yes ia n A n a l y s i s in Econometrics and S t a t i s t i c s , E s s a y s in Honor o f Harold J e f f r e y s , North-Holland Publishing Company, New York.

3 THE TRADITIONAL UNEAR MODELS

IN T R O D U C T IO N

A first one-semester course in linear models introduces the student to simple and multiple regression models and fixed models of designed experiments, and if time permits to random and mixed models of de­ signed experiments. These are the so-called standard or traditional linear models that the statistician is expected to know and be able to use. The Bayesian analysis of the general linear model was introduced in Chapter 1 and will be used to analyze the regression models and fixed models for designed experiments, and these are to be studied in this chapter. The random and fixed models will be examined in the following chapter. F irst, elementary models for one and two normal populations are given and the Bayesian approach is illustrated with the prior, posterior, and predictive analysis. Secondly, simple and multiple linear regression models are analyzed, then nonlinear regression models, and finally models for designed experiments including models containing concomitant variables are introduced and Bayesian inferences are provided and illustrated with numerical examples. Prior information about the parameters of these models is spec­ ified by either a vague improper Jeffreys' density or a member of a class which is conjugate to the model. We have seen the conjugate class is the normal-gamma family of distributions and that the four parameters of this distribution must be specified from one's prior knowledge of the future experiment. This chapter will use WinklerTs (1977) method of

65

THE TRADITIONAL LINEAR MODELS

66

the prior density of a future observation to set the values of the p a ­ rameters of the normal-gamma prior density. Once having determined the prior distribution, the joint pos­ terior distribution of all the parameters and the marginal distribution of certain subsets of the parameters will be found. O f course, from Chapter 1 the relevant distribution theory has been derived and all that is needed is to find the parameters of the various posterior distributions. For example, in a simple linear regression model, the intercept and slope parameters will have a bivariate t distribution, while the marginal distribution of these parameters will be univariate t distributions. Having the sample information and the parameters of the prior distribution will allow one to know the precision matrix, de­ grees of freedom, and mean vector of the joint posterior distribution of the slope and intercept coefficients of the model. Thus, this will allow one to test hypotheses about the parameters, using HPD region s, and construct confidence intervals for these parameters as well as others such as the average response of the dependent variable at selected values of the regressor variables. To predict a future value of the dependent variable, the Bayesian predictive density will give point and interval forecasts. This type of analysis will be repeated for multiple linear re g re s­ sion models and those for some designed experiments and accompanied by numerical examples which will provide tables and graphs of pos­ terior and predictive densities. The reader should consult Box and Tiao (1973), Lindley (1965) and Zellner (1971) for a similar presentation along the lines developed in this chapter.

PRIOR IN FO R M A TIO N

Two types of prior information will be used in this chapter. JeffreysT improper density

£(9,

t

)

°c 1 / t ,

e e R P ,

T > 0

First,

(1.13)

is to be used when v ery little prior information is available for the parameters 0 and t of the linear model, and when one is more in ­ formed , a p rio ri, recall ?(9,

where

t

)

=

Cx( 9 1 x ) C ( t ) ,

t

> 0,

9 e

RP

(1.5)

PRIOR INFORMATION

^

0 1t )

cc

T p y ^2 e x p

67

-

~

( 0

t

> 0

“- y ) t P ( 0 “

u ) ,

9 6 R P

(1 *6 )

and

S2( t ) « xa 1 e~T 3 ,

(1 .7 )

is the normal-gamma density with knownparameters y €Rp , P a sym­ metric positive definite matrix, a > 0, and 3 > 0. O f course the d if­ ficulty with using this conjugate class is that the parameters need to be specified, thus how does one know what values to assign to the hyperparameters? Consider one way advocated by Winkler (1977) but what is to follow is my interpretation. Thus consider the model Y = X© + e

(1 .1 )

which describes the experiment which will be run in the future, where Y is n x 1, X is n x p, ©is p x l, and e is n (0 , i l n ) . Also consider the model Z=X*0+e,

where Z i s m

x

l, X* is m

(3 .1 ) xp,

0 is p

x

l, and e

N ( 0 , t *1 )

,

and

where 0 and t are the parameters of both models. Note 0 and t have the same meaning in both models. The latter model (3 .1 ) thus describes a related experiment (related by the parameters 0 and t ) which p e r­ haps is a past experiment with past data Z when the design matrix was X *. On the other hand (3 .1 ) is perhaps a hypothetical experiment with hypothetical observations Z when the design matrix is X *. The hypothetical data Z are supplied by the experimenter and in this situa­ tion, if Z is the experim enters guess or prediction of Y , then the de­ sign matrix is X (= X * ) . What is the predictive prior density of Z? It is the marginal density of Z, which will depend on the hyperparameters y , P, a, and 3. The conditional density of Z given 0 and t is from (3 .1 ). g ( z | e , T ) 0= Tm/2 exp -

| (z -

X*e)'(z -

x * e ),

z e Rm

(3 .2 )

thus the marginal density of Z is

h (z | y ,P ,a ,3 ) « -----[(z - A ' 1B )'A (z - A 1B ) + C -

B 'A " 1B ] (m+2a)/2 z € Rm ,

(3 .3)

THE TRADITIONAL LINEAR MODELS

68 where A = I -

X * (X * 'X * +

(3 .4 )

B = X *(X * 'X * + P ) " 1f’u .

(3 .5 )

C = y ’Py + 26 — y 'P(X*'X* + P ) ' 1Py.

(3 .6 )

and

Since the prior predictive density (3 .3 ) of Z is a m-variate multi­ variate t with 2a degrees of freedom, location

E (Z ) = A _1B and precision matrix (2a) A

P (Z ) =

C -

B 'A 'S

and since these moments depend on the hyperparameters, one may choose them in such a way that the equations A _1B = X * (X * ,X * ) " 1X *'Z

(3 .7 )

and (2 ) A

z 'n -

x w x - r t -i z i 51

C - B 'A _1B

(3 .8 )

(m * P )

are satisfied; however, the solution is not unique. The right-hand side of these two equations were obtained from (3 .1 ) as follows. From (3 .1 ) E (Z |e ) = X*e and this is estimated by the right-hand side of (3 .7 ), since the usual estimate of 0 is (X * fX *) *X *TZ.

Also from (3 .1 ),

the dispersion matrix of Z is Tlm» thus its precision matrix is

t

*Im

and t * , the variance is usually estimated by the residual sum of squares of (3 .8 ). Still a much easier way to set the values of the hyperparameters is to notice that from (3 .1 ), the usual estimators of 0* = ( X ^ X * ) " ^ * ^

0

and

t

* are (3 .9 )

PRIOR INFORMATION

69

and

( - 1 ^ = Z’Z -

Z’X ^ X ^ X * ) ' ^ * ^ (m - p )

and since the prior mean of 0 is y , choose y = 0*. since the prior mean of t 1 is 3/(a that 3/(a — 1) = (

t

1) * .

In the same way,

1), (a > 1), choose a and 3 such

O f course the choice of a and 3 is not unique.

The prior dispersion of 0 is 3P(a — 1) *, thus choose P so that ( t * ) _1P = ( x * ) " 1(X * ’X * ) ' 1 or let P = ( X ^ X * ) * 1. This determines values of the hyperparameters from the past or hypothetical experiment (3 .1 ), although the choice of a and 3 is not unique. Is there a way to choose a and 3 uniquely? Of course it is important that Z be known, and this is not a p ro b ­ lem if Z is from a past experiment, however if Z must be chosen hypothet­ ically as the future value of Y , more subjectivity is involved because if Z is the future value of Y , Z must be supplied by the experimenter in the hypothetical future experiment

Z = X0 + e, which is the experiment to be conducted where Y (in place of Z) will be observed. As Winkler emphasizes, the advantage of this method is that the values of the hyperparameters are determined indirectly in terms of future (o r past) values of Y and not directly in terms of the hyper­ parameters. It is much easier to " guess” future values of Y or to observe past values of Y than it is to " guess” the values of the hyperparameters. For additional information on choosing the hyperparameters of the conjugate prior density (1 .5 ), Zellner (1980) uses a g -p rior distribution, while Kadane et al. (1980) choose the hyperparameters from the quantiles of the prior predictive distribution. To recapitulate one may either use an improper prior density or a conjugate prior density with which to express one's prior informa­ tion about the parameters, and in the latter case, one may use the prior predictive density to set the values of the hyperparameters. Consider a n (0 ,x ) population and suppose a normal-gamma con­ jugate prior £ (0 , t ) is appropriate, then how does one choose the hyperparameters when m past observations Z^,Z^, Z^ are avail­ able?

From the past data, 0 and t * would be estimated by 0* = z

and ( t * )* =

(z* ~ z)^/(m — 1) then the hyperparameters could be

70

THE TRADITIONAL LINEAR MODELS

chosen as follows:

y by z, a and 3 to satisfy 3 (a -

1) 1 = ( t * )*

and P (sc a la r) by m \ where a > 1.

NORMAL PO PULA TIO N S

Bayesian inferences for one normal population were discussed in Chapter 1 so now several normal populations will be examined. Suppose one has two normal populations, n C e ^ i^ ) and n (0 2, i 2 ) ;

(

(

2

3

)

)

01

) 6 1

(

4

)

II CD

1

to

(

CD

then four cases arise, namely:

*

=

0

T 1

2 .

T 2

=

T 1

V

h

W

*

t i

t 2

t 2

*

V

In the first case the populations are identical and there is actually only one population, therefore only the three remaining cases are of interest.

Two Normal Populations, D is tin c t Means Let S^ = ( x n ’x i 2 ’ •••> ulation and suppose S

L

x in = (x

) be a random sample from a n (0 ^,x *) pop­ Z1

,x

ll

, ..., x

) is a random sample from 2

the second n (0 2,x 1) population, then the likelihood function of 0^, 02 and t is

( n i + n 9)/ 2

L (0 ,t | s )

T

ni

~

exp - - 2 ^

n2

ft

( x u - 0!)

T->

+

2

0 e R , where s = (S ^ ,S 2> and 0 =

an(* ^ one uses

ft)

1*1

t

(Xg. - e2> J ,

> 0

(3.11)

normal (b i ­

variate)-gamma prior density of ( 0 , t ) suggested by (3 .1 1 ), with

NORMAL POPULATIONS

71

parameters y (2 * 1), P(2 * 2), a and 3» the joint posterior density of ( 0 , t ) is

5 (9 , t |s ) « T(n+2a+2)/2" 1 exp -

| [ ( e - A ' 1B ) 'A ( e - A _1B )

+c b 'A ' a ' 1:B ] , C - B

(3.12)

where n = n^ + n^,

H

\ o"

1

°H

n2 /

V ZX2i 7 and

C = 23 +

2 + Ex2. 2 + y'Py .

Furthermore, the marginal posterior density of e is a bivariate t density e< e|s) « [ ( 0 -

a ' 1B ) ’A (6 -

A* 1B ) + C - B 'A " 1B ] ' (n+2ct+2)/2, e € R 2,

thus 0 has n + 2a degrees of freedom, location vector A cision matrix P ( 9 |s) = - - - " 2ct)^ . C - B 'A B

(3.13)

and p re ­

(3.14)

In addition the marginal posterior distribution of x is a gamma with parameters a' = (n + 2a) and

12

THE TRADITIONAL LINEAR MODELS

72 0' = (C - B 'A _1B )/2 .

How does one make posterior inferences about e and t?

First

consider 0, then the mean A *B jointly estimates 6^ and ©2, but suppose one wants to estimate 0^, ignoring e^.

If 01s has a bivariate t distribu­

tion then 0 |s has a univariate t distribution (see DeGroot, Chapter 5, and the Appendix of this book) with n

+ 2a

degrees of freedom, location

parameter which is the first component of A * B , and precision P x(

e1 |s)

= p n ( 0 |s) -

p 12( 6 |s)p22( 0 ls ) P 2i ( 0 ls ) » where Py is the D "111

component of p ( 0 |s) given by (3 .1 4 ). Often y = 0 — 0 is the parameter of interest, because y measures 1 L the difference in the means of the two normal populations and if y = 0, the two populations are identical. What is the posterior distribution of y? Since @|s is a bivariate t, y|s has a univariate t distribution with n +

2a

degrees of freedom, location (1 , - 1 ) A

* B , and precision

[(1 , - 1 ) P 1(0 | s )(l , —l ) 1] 1 = P (y | s ). These results can be verified by referrin g to DeGroot, Chapter 5, or to the Appendix of this book. Remember the hyperparameters y , P , a and 6 must be known in order to do the posterior analysis and they can be determined by using the methods of the previous section of this chapter. On the other h and, if v ery little is known about the parameters, before the experiment, one perhaps would use the Jeffreys' prior (1.13) which results in a simplified posterior analysis. For example the posterior distribution o f 0 is a bivariate t with n — 2 degrees of freedom, location vector E(0 |s) = ( _ - 0

) , and precision matrix

2

-

( nl + n2 “ 2) P (e | s) = (n x -

Cl 0 ) \o

l ) s j + (n 2 -

n. /

l)s 2

where n.

i

S.2 = 2 3 (X.. j= l 3

X .)2(n

-

1)

i = 1,2

NORMAL POPULATIONS

73

and x

and x are the sample means and s and s the sample vari1 Z I z ances. The posterior analysis for 9 using Jeffreys1 prior density is based on Theorem 1.1 of Chapter 1. One may obtain the same results by letting a + —1, 9 + 0 , P + 0(2 x 2) in the joint posterior distribution of 9 with density (3 .1 2 ). Regions of highest posterior density, HPD, may be found for 0, 9., i = 1,2, and y. First for 0, from (3.12) one may show (see Chap­ ter 1, 1.28)

A ' 1B ) ’A (9 - A -1B ) ( n + 2ct) C -

(3.15)

B 'A *B

has an F-distribution with 2 and n + 2a degrees of freedom and one may plot 99%, 95% regions for 0. If 0^ or 0^ or y is the parameter of interest, one may construct an HPD interval using the Student t tables. How would one do this? Since 0.. and 09 are the primary parameters of interest, Bayesian 1 u inferences for t will not be discussed.

Two Normal Populations w ith a Common Mean Consider two normal populations n ( 0 , i ^ ) and n (0 ,x 2*) with a common mean 0 but distinct precisions t

and t 2, then on the basis of random

samples S. = (x ., ,x .rt, . . . , x. ) , i = 1,2, how does one make inferences v l l l i2 in. l about the common mean? Suppose given 9, and i ^ that S^ and are independent, then the likelihood function for 0 and t =

*s

n.

i

i= l (3.16) where s = (S

1

,S ) . z

The conjugate prior density is

THE TRADITIONAL LINEAR MODELS

74 0 G R,

x. > 0,

(3.17)

where a. > 0 , B. > 0, P. > 0 , and y. G R, and is the product of two i l l i normal-gamma densities, one for each population. This implies the marginal prior density of 0 is

2 " ( 2ai+1^ 2 C O ) “ 1 1 [20. + P.(0 - p .) ] , i= l

0 G R,

which is the product of two t densities on 0 and is called a 2/0 uni­ variate t density with parameters a., 3., P., and y ., i = 1,2. This is a very complicated density because its normalizing constant and moments are unknown, thus it is v ery difficult to assign values to the hyperparameters. How would one assign values to the hyperparam­ eters? One solution is to consider each population separately, so for the nC©,-^ ) population, one may set the values of y^, P^, a^, and 3^^ and then repeat the process for the second population. legitimate procedure? The marginal prior density o f and Xg is

2

,

o

a . - l -t.3 . l 11 ti. e

n

i=1

* e

Is this a

1/2 1/2

t. t_ 1 2

1/2 (T 1P 1 + T 2P 2>

' T l T 2P l P 2( v r ,J2) 2 / ( T l P l + T 2P 2 )

•*1 >

0,

t 2 > 0, (3.18)

thus x ^ and x ^ are not independent, a p rio ri, even if y ^ = y 2.

Since

the populations have a common mean, it is probably wise to let y ^ and y g have a common value, in which case (3.18) simplifies but is not the density of independent precision components. Proceeding with the posterior analysis, we have 2

(2a.+n.+l)/2-l t

i= i

.

1

+ P.(e-yi )2 |,

x. /

ni

exp - -7 - )’ 2 b. + 1 j ( x . M 1 j= i 1]

e )‘

(3.19)

NORMAL POPULATIONS where eters .

0

e

R,

t

.

> 0 , for the joint posterior density o f the param­

C orrm letin cr t h e

and

75

sn n a re

on

A an d

in tecrrA tin o *

w ith

resn en t

to

t

t

0£ R

(3.20)

for the marginal posterior density of 9 and is a 2/0 scalar poly-t den­ sity, which, as we have seen, is the same form as the prior density of 0. For the properties of this distribution, see Box and Tiao (1973), Zellner (1971), Dreze (1977), Sedory (1980), and Y usoff (1982). Since 0 is scalar, the density (3.20) is easily plotted and its moments calcu­ lated by numerical integration; however if 0 is a vector, it may be necessary to use approximations to the density and this has been done by Box and Tiao (1973), Zellner (1971), Dreze (1977), and Yusoff (1982). What is the marginal posterior density of and If 0 is eliminated from the joint density,

(n .+ 2 a .)/ 2 -l

S (

t

|s )

cc

n. i

r r ( i= l

t

is the density of the precision components, where

.

> 0

(3.21)

THE TRADITIONAL LINEAR MODELS

76

which is some type of bivari ate-gamma distribution and it is necessary to employ numerical integration to calculate the joint and marginal properties of the distribution. Thus with a common mean for the two populations, it is very d if­ ficult to do the posterior analysis because the marginal posterior distributions for 6, then for x^ and x 2 are not standard densities. For example, if the precisions are equal and the means are distinct, it has been shown the posterior analysis results in standard well-known distributions for the parameters. How should inferences be made for this problem? On the one hand, for 0, one must use the poly-t density, and on the other, to make inferences for x and x , one must use the bivariate type gamma X

z

density (3 .2 1 ), and both densities must be numerically determined. Is there some other way by which posterior inferences for 0 and x are possible? Consider the conditional posterior densities for the parameters, for example, for 0 given x and x , and then for x given 0. The X L former distribution is normal with mean

E ( 0| t , s ) =

[E

T .(n.

+ P.)]

[E

T .(n.x. +

p.y.)]

and precision

P (0 | x ,s ) =

x.(n. + P )

1

1 1

and the conditional distribution of x^ and x 2> given 0, is that of in ­ dependent gamma random variables, such that x. |0 is gamma with parameters (2a. + n^ + l)/2 and 1 n. l 26. + S (x.. i j=^ i]

x .) + n.(0 l i

x . ) 2 + P.(0 - y . ) 2 l i l

Although the marginal posterior distributions of the parameters are intractable, the corresponding conditional distributions are wellknown. How can these be used for inferences about the parameters? Perhaps the mode of all the parameters can be found from the modes of the conditional densities, which are

NORMAL POPULATIONS

77

2 Xrf T.(n.X. + p.y .) - I l l 11 (3.22)

2

and

M(x. |t.,i

i

j,0 ) =

n. + 2a. l l

1

n. l

(3.23) for i = 1,2. Using an approach of Lindley and Smith (1972), the mode of the joint density of 0 , x ^ , and x 2> perhaps, can be found by solving simultaneously the above three modal equations for the parameters. An explicit solution cannot be found and an iterative technique can be tried by starting the process with some trial values for t and t 1 z (say the sample precisions) and finding the conditional mode of 0 from (3.22) which is then substituted into the equations, (3 .2 3 ), for the conditional modes of x and x . The process is repeated until the

0

x

Z

estimates stabilize and then the final estimates are a solution to the modal equations, and the solution, perh ap s, is the joint mode of the joint posterior density. Since the joint density may have multiple modes, the iterative technique may not converge to the absolute maximum but instead to a local maximum, depending on the starting value of the iterative procedure. O f course, the joint mode of the parameters, if it can be found, is a useful characteristic of the posterior distribution since it gives one a point estimate of the parameters, but it does not tell us any­ thing about the precision of the posterior distribution. Suppose 0 is the primary parameter of interest and x ^ and x 2 are nuisance parameters, then how should one estimate 0 ? There are several possibilities. Suppose we obtain the marginal posterior mean and mode of 0 from the 2/0 poly-t density (3.20) via numerical inte­ gration and the 0 component of the joint posterior mode, which is found by an iterative solution to the modal equations (3.22) and (3.23)

THE TRADITIONAL LINEAR MODELS

78

where the sample precisions are used as starting values. This will give three point estimates of 6. ^ _ 2 Suppose the sample values are n = n 0 = 10 and z (x —x ) = 31.7882,

10

z1

— 2



_

i i 3

i

( x 0. - x j 2] L = 9.9999, x1 =-.1 49 9 and x ^= .3611, thus there are two normal populations with a common mean and distinct variances and the two sample means are —.1499 and .3611. The sample precisions are .28312 and .9 respectively and are used as the starting values of the iterative solution to the modal equations. One must choose the hyperparameters a^, 6^, a^, B2» P^, V ^ y

X

, and y

z

of the prior distribution, and for fifty-two combinations

of these parameters, Table 3.1 lists the mode of the marginal density of e, the 6 component of the solution to the modal equations, and the mean of the marginal density of 0. We see, first of all, that for each combination of the eight hyper­ parameters, the three estimates of 0 are very close. The marginal mean and mode of 0 are not the same because the marginal density of 0, (3 .2 0 ), is usually not symmetric and the 0 component of the joint modal estimate is not the same as that calculated from the marginal density of 0, nevertheless, the three estimates are close to one another. Table 3.1 reveals the sensitivity of the three posterior estimates to the values of the parameters of the prior distribution. For e x ­ ample, the range of the largest to the smallest values of each estimate is 5.4087, 5.3870, and 5.3865 for the marginal mode, 0 component of iterative solution, and posterior marginal mean, respectively, and the last has the smallest range. The three estimates appear to be sensitive to y ^ and y 2> the two means of the prior distribution of 0.

As y

and y

increase, so do

the posterior mean of 0 and the other two estimates and this was not unexpected. In order to arrive at definite conclusions about this way of estima­ tion, more numerical studies need to be conducted. Another way to do the posterior analysis is with an improper prior density

£ ( 0 , t ^ , t 2) a t

2 ^»

0 6 R>

> 0, x 2 > 0

(3.24)

for the parameter. If this is done, one will see that the marginal pos­ terior density of 0 is a 2/0 p o ly-t, of the form (3 .2 0 ), and the mar­ ginal posterior density of t is a bivariate type gamma density, (3 .2 1 ). Also, the analysis of this section is easily extended to k (> 2 ) normal populations with the same mean but with distinct variances.

NORMAL POPULATIONS

79

Table 3.1 Comparison of the Mode (from Marginal Distribution), Mode (by Iteration), and the Mean al a2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 4

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 4

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 2 3 4 5 6 7 8 9 0 0

32

*1

C2 y l

Mode (marginal) y2

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 6 9 12 15 18 21 24 27 0 0

0 0 0 0 0 0 2 4 6 8 10 12 14 16 18 0 0 0 0 0 0 0 0 0 0 0

0 0 0 2 0 4 0 6 0 8 0 10 2 0 4 0 6 0 8 0 10 0 12 0 14 0 16 0 18 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 2 4 6 8 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

.2458 .2458 .2458 .2458 .2458 .2458 .2060 .1661 .1528 .1262 .1130 .1130 .0997 .0864 .0864 .1894 .1528 .1528 .1162 .1162 .1162 .0797 .0797 .0797 .2458 .2458

Mode (iterative)

Mean

.2418 .2418 .2418 .2418 .2418 .2418 .1995 .1698 .1478 .1308 .1173 .1064 .0973 .0896 .0831 .1989 .1679 .1443 .1260 .1113 .0992 .0891 .0806 .0733 .2418 .2418

.2261 .2261 .2261 .2261 .2261 .2261 .1867 .1589 .1384 .1225 .1099 .0997 .0911 .0840 .0779 .1867 .1592 .1389 .1231 .1105 .1001 .0915 .0841 .0778 .2305 .2330

THE TRADITIONAL LINEAR MODELS Table 3.1 (Continued)

a2

32

Mode (m arginal)

*i

Mode (iterative)

Mean

6

0

.2458

.2418

.2346

8

0

.2458

.2418

.2357

10

0

.2458

.2418

.2365

12

0

.2458

.2418

.2371

14

0

.2458

.2418

.2376

16

0

.2458

.2418

.2380

2

3

-8

-8

-1.7741

-1.7634

-1.7636

2

3

-6

-6

-1.2957

-1.2967

-1.2974

2

3

-4

-4

-.8173

- . 8225

-.8245

2

3

-2

-2

-.3389

-.3299

-.3349

2

3

0

0

.1794

.1745

.1664

2

2

.6179

.6262

.6187

2

4

1.0565

1.0572

1.0520

6

1.4950

1.4999

1.4966

8

1.9336

1.9512

1.9489

.2359

.2299

.2191

.4186

.4078

.3996

.5382

.5189

.5124

.5781

.5949

.5895

2 2 2 2 2 2 2

12 12

.6578

.6502

.6455

2

15

15

.6977

.6922

.6881

18

18

.7375

.7252

.7215

3

2

2

2

2

.4983

.4947

.4877

2 2 4

2

6

4

4

4

4

1.2425

1.2501

1.2477

6

3

9

6

6

6

6

2.3322

2.3318

2.3301

8

4

12

8

8

8

8

3.6346

3.6236

3.6229

NORMAL POPULATIONS

81

Two Normal Populations w ith D istin ct Means and Precisions The Behrens-Fisher problem is the problem of comparing the means 0^ and ©2 of two normal populations which have distinct variances or precisions

and t

If the precisions are equal, the two-sample

t-test is the way to test the hypothesis the two means are equal, and the Bayesian analysis of this problem was discussed in the section on two normal populations with distinct means (see pp. 70-73). When the precisions are unequal, various tests such as that of Scheffe (1943) have been proposed. Suppose S. = (x . ,x. , . . . , x. ) is a random sample of size n. l

from a n (0 .,i.

IX

in. l

l

) population, where i = 1,2, then how does one provide

posterior inferences for the parameters? 0 = (0 1»02) and T = ( t i , t 2) is

0 G

2

R ,

The likelihood function for

n n. l

t ^ > 0,

(3.25)

t2 > 0

where s = ( S ^ S ^ , which implies

a.-l

-

t

2

0 G

R ,

t

x

>

.$.

i pi

0,

2 1/2 t . e i

t

2 > 0

(3.26)

is the conjugate density, a product of two normal-gamma densities, thus, a priori, (0 ,-r ^ and (0 2>T2) are independent. It follows, a posteriori, that ( 0 ^ ^ ) and (0 2>?2) are independent with density,

THE TRADITIONAL LINEAR MODELS

82 where

t

.

> 0,

2 e R , hence (0.

0

,

t

.)

has a normal-gamroa distribution

with parameters ex., 0., v ., and P., i = 1,2. of 0^ and

The joint marginal p . d . f .

is 2 TT(

ni V '

2

1=1 (

]=1

+ [6. -

(n. + P .)

-

5 ( 9 | s) « U 26i + E ( x

2

-x.) +

niPi ( u i "

p:1------

J _i

i

(3.28)

i

_ 2 i “(n.+2a.+l)/2 (P.w. + n .x .)] (n. + P .)} 1 1

and 0. has a univariate t distribution with n^ + 2a. degrees of freedom, location

E(6.|s) = (n. + P . ) ' 1^

.

+ n.x.)

and precision (n. + 2a )(n . + P.) P(0. |s) = -

1

n. l

1

1

1

20. + ^ (x .. — x . ) 2 + n .P .(y . — x . ) 2(n. + P .) 1 i ~ i ] l i l l l l i i = 1,2. Furthermore,

\

and

t

it0

are independent and

t

.

i

is gamma with

parameters af = (n. + 2a.)/2 and 0’ , where n. l

20'

= 20. + i

"

(x.. i]

x . ) 2 + n .P .(y. i

i l l

x . ) 2(n. + P .) *, l

i

l

i = 1,2

In this type of problem, often y = 0 - 0 is the parameter of 1 z interest and and are nuisance parameters. How would one find the posterior density of y? See Box and Tiao (1973, pages 104—109) where they explain an approximation of Patil (1964) to the posterior density of y and illustrate the approximation with a "textile” example. When reading Box and Tiao the reader should remember, they are using an improper prior density

NORMAL POPULATIONS -

1-1

£ (0 , t ) « ti t 2 ’

83

2

>

t1

> 0 , t 2 >0

for the parameters and not the conjugate prior density (3 .2 6 ); how­ ever, their analysis can be duplicated by letting a. —1/2, 0, and P.

0, i = 1,2, in the posterior density (3 .2 7 ).

When letting the

hyperparameters approach these limits in the posterior density, the normalizing constant of the density should be ignored. Now returning to the distribution of y, let £.(e. |s) be the pos­ terior density of e., then we have seen £. is a t density with n. + 2a. degrees of freedom, location E(0. |s) and precision P(0. |s), i = 1,2, hence the posterior density of y is

g (y |s) =

J R

£ (y + A |s)£ (A |s)dA , 1 1

y€R

(3.29)

and the integral cannot be expressed in terms of simple functions and must be computed by numerical integration. Patil's approximation to g (y | s ) avoids the numerical integration of (3.29) and expresses g (y | s ) as a t distribution. Box and Tiao compare the Patil approxima­ tion to the exact distribution of y |s and show the approximation works well. How would one develop a normal approximation to the distribution of y|S?

Concluding Remarks This chapter begins with the Bayesian analysis of two normal popula­ tions, where either the means are distinct or the precisions are distinct, or both. In the case the means are distinct, but the precisions are equal, the marginal posterior distribution of the means is a bivariate t dis­ tribution with density (3.13) and the marginal posterior distribution of the common precision parameter is a gamma. Also, the marginal posterior distributions of 0 , 0 , and y = 0 are univariate t densities and Bayesian inferences for these parameters are easily obtained. If the precision parameters are distinct and the means are equal, the marginal posterior density of the common mean is a univariate 2/0 poly-t density (3 .2 0 ), and the joint posterior density of the precisions is given by (3 .2 1 ), which is a bivariate-type gamma density. A nu­

THE TRADITIONAL LINEAR MODELS

84

merical example gave three estimates of the common mean for various combinations of the parameters of the prior distribution and the es­ timates are listed in Table 3.1. One of these estimates was calculated from the 6 component of the mode of the posterior distribution of all the parameters. An iterative procedure of Lindley and Smith (1972) produced this modal estimate of 0 while the other estimates, the margin­ al mode and mean, were calculated from the marginal posterior density of 0. This section was concluded with a Bayesian analysis of the Behrens-Fisher problem. Using a conjugate prior density for the pa­ rameters, it was shown that the joint posterior density (3.27) is such that ( 0 1 , t 1 ) and ( 0 O t 0 ) are independent and each has a normalgamma distribution, however the distribution of y = 9^ — evaluated numerically.

must

The marginal posterior distributions for

and Tg are each gamma, which follows from (3.27) but inferences for these parameters were not made. How would one compare Perhaps the ratio of these two parameters could be used.

and t 2?

LIN EA R REGRESSION MODELS

One of the standard models the statistician must know how to use is the linear regression model. With such models one attempts to build a connection between a dependent variable Y and a set of m (m > 1) independent variables. Often, but not always, the average value of Y is a linear function of a set of p unknown parameters 0, thus if there are n observations ( Y . , x . . , x . ft, . . . , x. ) on these m + 1 varil l l i2 lm ables, the model relating Y to m independent variables is Y = x0

+ e,

(3.30)

where Y is the n * 1 observation vector, x is the n * p , (p = m + 1), matrix of values of the independent variables, 0 a p * 1 parameter vector, and e is a n * 1 vector of unobservable random variables with zero means. If, in addition, the n observations on y are independent and normally distributed with a common precision t , model (3.3 0) is the general linear model of Chapter 1. The prior, posterior and p re ­ dictive analysis for such a model was developed in Chapter 1. In this section, a detailed Bayesian analysis of the linear re g re s­ sion model will be given. That is, using the normal-gamma prior density for 0 and t :

LINEAR REGRESSION MODELS

85

(i) Point estimates for 0 and t will be found. (ii) HPD interval estimates for the parameters will be constructed. (iii) Point and interval estimates of E (Y |x) for a given set X of values of the independent variables will be constructed. (i v ) Prediction intervals for future Y values for a given set of values of the independent variables are to be derived. ( v ) Numerical examples will illustrate the above four techniques. O f course, not all regression studies involve the general linear model (3 .3 0 ), since sometimes the E (Y |x) is a nonlinear function of x and often the observations on Y are correlated, that is the precision matrix P (Y |x) is not tI^ , but is nondiagonal. Nonlinear regression models and models with correlated observations will be presented later.

Simple L in ear Regression Let yi = ei + e2x. + e.,

i = 1,2, . . . , n

where y. is the i-th observation on a dependent variable Y , x. is the i-th observation on an independent variable x , 0^ and

are unknown

intercept and slope coefficients respectively, and e. is the i-th error -1 1 term. If the e. are n .i.d . (0 , t ) variables, then the general linear model and the theory of Chapter 1 apply. In terms of this model Y is the n x l vector of observations of the dependent variable, x is a n x 2 matrix

0 = (0^, ©2 > G ^ e - n ( 0 , T _1In ) .

» anc* e *s

n x 1 vector of errors e ., where

THE TRADITIONAL LINEAR MODELS

86

If one uses a normal-gamma prior density for 0 and t , the Bayesian analysis was developed in Chapter 1. First, how does one develop point and interval estimates for 0^ 02» and

t

From Chapter 1 the marginal posterior density o f 0 is a

?

t with n + 2a degrees of freedom, location vector jj*

= ( x fx + p)

-1, (x*y + P y )

(3.31)

and precision matrix ( x ?x + P )(n + 2 a )

P (fls ) = 23

+

y »y

-

where

( x ty

+

p y )» (x !x + P)

(3.32)

(x 'y + P y )

z z n

n

1

x Tx =

n

n

Z1

x.1 1

1

and

x ?y = E w , . 1 1 If one uses an improper prior density

£( 0 ,

t

)

oc

1 /t ,

t

> 0,

0

e

R

,

(3.33)

the marginal posterior density of 0 is a t with n — 1 degrees of freedom, location vector

= ( x ’x ) *x!y * 3i d2

I

(3.34)

LINEAR REGRESSION MODELS

87

where

and n ]C 8 2

(x . -

1

x )(y

- y)

n

£ (x. - x ) 2 1 which is the usual least squares estimator of 0, and the precision matrix of 0 is

P * ( 0 | s ) = 2y >

(3.35)

S

where

S

2 _ y ’y “ y fx ( x Tx ) *x ’y (n - 2)

i=l and s

2

a2

The dispersion 2 -1 matrix of the marginal posterior distribution of 0 is s ( x !x ) (n — 1)(n -

is the usual unbiased estimator of

= 1/t.

3) *, n > 3. We see if one uses an improper prior density for the parameters, the Bayesian analysis closely resembles the least squares or maximum likelihood approach to estimating the parameters. To estimate 0 from the joint posterior distribution, the vector 0 of least squares estimators seems appropriate, if one uses an improper prior density (3 .3 3 ), but i f one uses a conjugate prior density, then y * , the mean of the marginal posterior density of 0 seems appropriate. B y letting a + —1, & 0, P -* 0 (2 x 2) in the joint posterior density of 0 and t , which corresponds to the conjugate prior density, one will obtain the joint posterior density of 0 and t corresponding to the improper prior density (3 .3 3 ). In particular, y * -*• 0 (the least squares

THE TRADITIONAL LINEAR MODELS

88

estimator of 6) as P 0 and as a —1 also if P 0, and 3 0, 2 P (e | s) -* ( x fx )/ s , the precision matrix of the marginal posterior distribution of 0 corresponding to the improper prior density (3 .3 3 ). If one is interested in t , how should this parameter be estimated? We know the marginal posterior density of t is

r-r i«\ £(x|s)

«

x

. -(T / 2 ){2 6 + y 'y (x 'y+ p v|) I( x ,x+P) 1(x 'y + P y ) } (n + z a )/ 2 -l > e t

> 0

(3.36)

if one uses a conjugate prior density with hyperparameters y , P, a, and 0. Since t|s has a gamma density, the mean and mode of t|s are distinct, thus if one uses the mean to estimate t , the estimator is n + 2a E ( t |s ) = ----------------------------------------------- - j -------------23 + y ’y — ( x ’y + P y )’( x ’x + P) ( x ’y + P y )

(3 .3 7 )

whereas the marginal posterior mode of x is n + 2a — 2 M ( t |s ) = ----------------------------------------------- — x -------------23 + y fy — ( x Ty + Py ) ’( x ’x + P) ( x ’y + Py )

(3 .3 8 )

-1 2 What is the marginal posterior mean of x (which is the variance a along the regression line) and what is E(x|s) if an improper prior density is used to express one’s prior information about the param­ eters? The marginal posterior densities for 0 and x are also useful in obtaining HPD regions. For example, what is the HPD interval for 0^? Since 0^ |s has a univariate t density, one may use Student’s t-tables to find these intervals for 0 , as well as for 0 .

With regard to x ,

HPD intervals are more difficult to construct because the gamma marginal posterior distribution is not symmetric and numerical integra­ tion is necessary in order to find an HPD interval for x . An approx­ imate HPD interval is easily constructed from the chi-square tables since, a posteriori, [20 + y ’y — ( x ’y + P y )’( x ’x + P) * (x ’y + Py) ] x /s has a chi-square distribution with n + 2a degrees of freedom. O ften , one is interested in estimating the average value of the dependent variable Y when the independent variable x = x * , where x * is known. Now

E (Y |s*) = (1 , x * )

(>-

LINEAR REGRESSION MODELS

89

is the average value of Y when x = x * , and since 6 has a bivariate t distribution with n + 2a degrees of freedom, location vector y * and precision matrix P (e | s ), y has a t distribution with n + 2 a degrees of freedom, location ( l , x * ) y * and precision (3.39) and one would usually estimate y by ( l , x * ) y * and an HPD interval for y is easily found from the Student t tables. For example, a 1 (0 < A < 1) HPD interval for y is ( 1 ,X * )y * ± t

A /2 , n + 2 a

P ( y /s )

-

1/2

A

(3.40)

where P (y | s ) is given by (3 .3 9 ). The analysis of a simple linear regression model is completed with the development of a forecasting procedure to predict a future value w of the dependent variable Y when the regressor x = Yn+1> and x n+^ is known.

Recall from the section on predictive analysis in Chap­

ter 1 (see pp. 14-18) the derivation of the Bayesian predictive density, with which a future value of Y is to be predicted. One obtains the future predictive density of Y if one substitutes the appropriate quantities into formula (1 .3 9 ), which is based on the conjugate prior density with hyperparameters a , 3, y , and P . Use the following substitutions: k = 1, z = ( l , x j ) , x is the n x 2 matrix following formula (3 .3 0 ), and Y is the n x 1 vector of observations. The predictive density of Y n+^ = w is a t-distribution with n + 2a degrees of freedom, location A -1B and precision (n + 2a)A* (C -

B TA *B ) *, where A , B , and C are explained by formula (1 .3 9 ). Notice, letting 3 0, P 0(2 x 2) and a —1,

(1.41) n+1 B ** ( l , x n+1) ( z Tz + x 'x ) *x'y

(1.42)

n (1.43)

where

THE TRADITIONAL LINEAR MODELS

90

The predictive density of w which corresponds to the improper prior density is a univariate t density with n - 2 (n > 2) degrees of freedom, location A *B = E(w |s) and precision P(w |s) = (n -

2) A*

(C - B 'A *B ) 1, where A , B , and C are now given by (1 .4 1 ), (1 .4 2 ), and (1 .4 3 ), respectively, and a l - A ( 0 < A < 1) HPD interval for a future observation w (when x = x n+1) is given by

E (w | s) ± t A/2 n 2

-

B ,A _1B ) ( n -

2 ) '1 ,

n > 3. (1.44)

How does this interval compare to the ’’usual" prediction interval of a future observation? See D raper and Smith (1966) for details about the conventional simple linear regression analysis.

An Example o f Simple Linear Regression Consider the model yi = ei + e2x. + e.,

i = 1,2, . . . , 30,

(3.41)

x. = i, the e. are n .i.d . (0 ,1 ), 0 = 1 , and 0 = 2 . Thirty y. values 1 1 1 Z 1 were generated with this model and the parameters of the normalgamma prior density were determined empirically from the first six (x. ,y .) pairs. The values of the hyperparameters were

LINEAR REGRESSION MODELS y = (.03958, / 5.58765 =( \19.5568

91

27426), 19.5568 \ ], 84.7461 /

(3.42)

a = 1, and B = .931275, where these values were calculated as

y = (z Tz) *z!y * , • / 2, P = z'z/s , V) 1 ,

= (S

2 _ y * Ty * — y * fz (z Tz) ^zTy * S ‘ 4 and a was set equal to one.

Also,

y * = ( y 1(y 2 > •••> y 6) '

and

z =

The posterior analysis is based on the normal-gamma prior den­ sity, which was constructed from the first six observations and the likelihood function, which is based on the 7th through 29th pairs ( i , y . ) , i = 7,8, . . . , 29, selected from (3 .4 1 ). Combining the prior

THE TRADITIONAL LINEAR MODELS

92

and likelihood function gave the following values for the parameters of the joint posterior distribution:

/ E (e js ) / .970903 \ ') = 'I 1 ' E (e | s) = I \ 2.005566 / \E (e js )

I 23.1789 P (e | s) = I \ 351.528

(3.43)

351.528 \ I , 6931.33 /

P (9 1 |s) = 5.3085, P (e 2 |s) = 1600.1.

The average value of a future observation

given

= 30 is

E(w |s) = 61.1376 compared to

= 61.1289, and the precision of the predictive distribu­

tion of y

is P(w |s) = .707206, where s = { ( i , y . ) : i = y , . . . , 29}. oU 1 If one uses a Jeffreysr prior density (3 .3 3 ), the following values of the parameters of the posterior distribution were calculated: ✓ E(0 |s)\

/ 1.0334

\

= \ E (e 2 |s)/

.

(3.44)

\ 2.00196 /

/ 17.3861 P ( 6 |s) = V 312.95

312.95 \ ) , 6398.09/

P (6 1 |s) = 2.07877,

P (e 2 |s) = 764.989,

and the mean of the predictive distribution of y ^ is E(w |s) = 61.0922, when x when actually y .637491.

uU

= 30

= 61.1289, and the precision of w is P (w |s) =

LINEAR REGRESSION MODELS

93

B y using a normal-gamma prior density with parameters fitted to the first six observations ( i , y . ) , i = 1,2,3,4,5,6, one obtains prior estimates (3.42) of

and

which are very close to the true values

of the intercept 0^ = 1 and slope 0^ = 2 of the regression model (3 .4 1 ), therefore it is not surprising that the posterior estimates (3.43) are even closer to the true values. On the other hand, when the prior information is vague, the posterior estimates (3.44) of the regression coefficients are not as close as they were when the prior was fitted to the first six observa­ tions. Also, we see the posterior precisions of 0^ and 0^ are less when JeffreysT density expresses our prior information. Suppose 0 and 0 are estimated by 95% HPD intervals. The X z marginal posterior distribution of 0^ is a t with n + 2a degrees of free­ dom, location E ^ l s ) and precision P (0 1 |s), thus

is a 95% HPD region for 0 and reduces to (n = 23, a = 1) 1.0334 ± (2.060)/.693580

or

(-0.39537, 2.462175)

As for 0 , the 95% HPD interval is 2.00556 ± (2.060)(.02499) or (1.95407, 2.05709), which is a much shorter interval than the HPD interval for 0 because the latter has a much smaller precision. Now suppose y

oU

is to be predicted with a 95% HPD interval, then

is the appropriate formula. Substituting n = 23 and a = 1, E (w |s) = 61.1376 and P(w |s) = .707206 gives 61.1376 ± 2.4495 as the interval for forecasting the future observation y ^ when x ^ = 30. What are the 95% HPD intervals for 0 ,

B^9 and

w |x^^ = 30 when

vague prior information is used? The following y values were gen­ erated from the model (3.41) and were used in the preceding calcula­ tions. Beginning with y^ and continuing through y ^ , they are: 2.70178, 4.07126, 7.64881, 7.57121, 12.31, 13.6939, 14.7727, 16.6751, 19.1657, 20.4161, 24.6735, 26.1047, 26.4773, 28.2187, 29.5846, 34.3475, 35.337, 36.2295, 40.5608, 40.6826, 42.5413, 46.2414, 47.9887, 47.5131, 50.6753, 55.0602, 54.9478, 54.8281, 60.5314, 60.0476.

94

THE TRADITIONAL LINEAR MODELS

We see they are increasing, as they should because so are the x values since x. = i, i = 1,2, . . . , 30. l M U L TIP L E LIN EA R REGRESSION Consider a linear regression model with two independent v ariab les, then yi = 9l + e 2Xi l + 6 3Xi2 + ei ’

(3 - 45)

where the e. are n .i.d . ( 0 , t 1) . The n triples (y .,x . ,x. ) , i = 1, 1 1 IX \u 2, . . . , n, are such that y. is the i-th observation on a dependent variable y and x.^ and x.^ are the i-th observations on two nonstochastic re g re s­ sor variables x

and x , respectively. The unknown parameters are 3 0 = ( 01» 02,03^ and T * wliere 0 G R anci T > and given the sample

S = {(y . ,x. ,x ) : i = 1,2, . . . , n } , what inferences can be made about 1 IX 1Z the parameters and how does one forcast future values of the de­ pendent variable? O f course, these questions are answered in much the same way as one would answer the same questions about the simple linear regression model and the theory concerning the p rior, predic­ tive, and posterior analysis is covered in Chapter 1. With the simple linear regression model, it was shown the prior analysis is based on the normal-gamma conjugate prior density or the Jeffreys1 improper prior, the posterior analysis consists of studying the normal-gamma posterior distribution of e and t , and the predictive analysis is then done with the aid of the predictive t distribution. All these techniques apply to the analysis of multiple linear regression models; however, because the model contains more than one independent variable (as compared to a simple linear regression model), the posterior analysis becomes more involved since there are now more parameters in the model and the posterior distribution of 6 has one more dimension.

T h e P osterior and P re d ic tiv e Analysis Let us briefly consider the posterior analysis i f an improper prior density

S(0,

t

)

«

1/t ,

t

> 0,

0eR 3

is used in assessing one's prior information.

(3.46)

95

MULTIPLE LINEAR REGRESSION The joint posterior density o f the parameter is

.

(n / 2 )-l

5( e , t | s )

a t

t

exp

-

-

n V*

2

-$

tyj

-

et

-

e2xu

-

03x.2 ]

2 ,

(3.47) where 0. € R (i = 1,2,3) and

t

> 0, and is a normal-gamma density,

thus the marginal posterior distribution of 0 is a trivariate t with n — 3 degrees of freedom, location vector E ( 0 | s ) = A ' 1B

(3.48)

and precision matrix

P (e | s) = (n -

3 )A (C - B ' A ^ B ) ' 1,

n>3

(3.49)

where n Ix i i

IXi l 2 Exi i

IXi2

IXi l Xi2

SYi

EXi2 Exi l xi2 2 ZXi2

\

EyiXi l ZyiXi2 and

c - S r f 1

.

The three marginal posterior univariate distributions are t dis­ tributions. For example, the marginal posterior distribution of 0^ is a t with n -

3 degrees of freedom, location

E (6 2 |s ) = ( 0 , l , 0 ) E ( e | s )

(3.50)

THE TRADITIONAL LINEAR MODELS

96 and precision

P (0 2|s) = [(0 .1 .0 )P " 1(e | 8 )(0 , l , 0 )t] " 1 and a 1 -

(3.51)

i (0 < 4 < 1) HPD region for e2 is given by

^ p‘ 1 (e2ls >

y2 >y3 >yr y 5 >ye y

1

X *

=

11

21

12

22

16

26 i

and

and using the formulas the values of the hyperparameters were calcu­ lated to be:

101

MULTIPLE LINEAR REGRESSION Table 3.2 Y Values for Multiple Linear Regression

II

*4

34.2187

y l5 = 94.3475

/ 12.1431 p = I

42.5008 184.17

/ .26964 \ y = I 2.47146 J \ 2.6549 /

II II II CJl

to

II II

y l4 =

77.4773

120.948 114.828 135.531

CM

y !3 =

50.1047

112.06

II 00

y !2 =

y 26 =

80.6753

to ~q II

*4

36.6735

o

59.4161

122.513

CO CM

40.1657

=

136.989

II

46.6751

^q

109.683

II

CO 00

87.2295

31.6939

y9

*4

*4

12.31

80.3337

106.561

y 18 CO

II II

22.5712

y i6 =

to o

II

10.6488

CM

13.0713

II

5.70178

*
i t n+2 « , . 0 5 ^ ,>' 1 (e l |s)

( 3 - 6 ,)

or .951843 ± .58297 and a 90% HPD interval for 0 O is 2.02176 ± .07385.

What is the 90%

HPD interval for 0 ^ and how would one test for significance of re ­ gression (

“ e2 “ 03 ”

Suppose one desires to forecast a future value of Y ^ (=138.903) given x

0

1, oU

= 30 and x

Z , oU

= 36.

We have seen a point estimate of Y

is given by E (w |s) = 168.988 with a predictive precision of .597336 thus a 90% prediction interval for Y ^ q is 168.988 ± 2.20993 and the actual observation of 138.048 lies quite far outside the p re ­ diction interval, because the predictive precision is very small.

oU

THE TR AD ITIO NAL LINEAR MODELS

104

This example was constructed so that 0 ^ and 0^ have the same "true" values as 0

and 0 O, the slope and intercept, respectively, of 1 A the simple linear regression model (3.4 1) and the first regressor in both assume the same values, namely x.^ = i, i = 1,2, . . . , 30. The addition o f the second regressor affects the marginal posterior distribu­ tions of 0 ^ and 0 £. For instance P ( 0 ^|s) = 5.35085 with the simple linear regression model and a data-based prior, while P ( 0 ^|s) = 8.5838 with the regression containing two regressions and a data-based prior.

N O N LIN E A R REGRESSION A N A LY S IS

Suppose y. = f ( x . , 0 ) + e .,

i = 1 ,2 , ..., n

(3.68)

where x. is a m x 1 vector o f independent v ariab les, 0 is a p x 1 real unknown parameter vector, x. € Rm, 0 € R ^, f: Rm x R^

R is a

known function with domain Rm x R^ and range contained in R , and the e. are n .i .d . ( 0 , t * ) , where f ( x , 9) = x 0 ,

t

> 0 and unknown.

If

0 e Rp

x g Rm,

(3.69)

where x is a 1 x m vector, then the model is the general linear model, but if f ( x , 0 ) is not linear in 0 , the model (3.68) is called a nonlinear regression model, and the mean of the dependent variable y is a non­ linear function of 0 . It has been shown that if f is linear in 0 , then the Bayesian ap­ proach is well developed, however when f is nonlinear in 0 , Bayesian inferences are more difficult to develop. One problem is the nonlinear­ ity of f makes a conjugate family difficult to identify (except in some special cases). This is obvious from the likelihood function n L ( 0 ,t|s) 0,

(3.70)

NONLINEAR REGRESSION ANALYSIS

105

1,2

s = { ( x l i ’x 2i ’

m}

and , x .). mi

Xi = ( x l i ’X2i*

If p = 1, there is an exact small-sample Bayesian analysis as follows. Suppose the prior density of 0 and t is

5 (0 ,t)

* T01’ 1 e " TB

5 (0),

then the posterior density of

£ ( 0 ,

t

|s )

oc

0

£ ( 0 ) T f n + 2 o t ^ /2

0

and

e R, t

t

> 0

is

1 exp -

tYj “ f ( x . , 0

0 e R,

t

) ]

2 + 23

> 0,

where £(9) is a proper marginal prior density of 0 and prior density of t is gamma with parameters a > 0 and By integrating (3.71) with respect to t ,

5 (e | s) cc ----------------- ^ ‘

n 2 B + E [y, i= l

(n + 2 a)

12

,

j,

(3.71) themarginal >0 .

3

e e R

(3.72)

f ( x . , 0 ) ] 21

is the proper marginal posterior density of 0 . Since 0 is scalar, £ ( 0 |s) is easily normalized, plotted, and the posterior moments, pro­ vided they exist, computed. It is assumed f is such a function that £( 0 |s) is integrable with respect to 0 over R. The marginal posterior density of 0 is difficult to determine (except in special cases) in terms of simple functions, however the conditional posterior distribution of t is gamma with parameters

0 .'= ^ ^

(3.73)

and 3 f( 0 ) , where n 23?(e ) = 23

[y< i= l

f(x . , 0 ) ]

(3.74)

THE TRADITIO NAL LINEAR MODELS

106

and the marginal posterior mean of E ( t |s ) =

E

t

is

[ t '| B '( 9 ) ] ,

where the expectation is with respect to the posterior density of 0 and one must assume £(e|s)/& f( e ) is integrable with respect to 0 over R. Provided the other moments of t exist, they may be computed in a similar fashion, in terms of the conditional posterior moments of t . O f course, £(x|s) may be determined by a univariate numerical inte­ gration of the joint density £ (0 ,x | s), (3 .7 1 ), and we see i f 0 is scalar a complete Bayesian analysis is possible. What is the Bayesian predictive density of a future observation w of the dependent variable y when x = x*? Assuming w is independent of s , the conditional density of w given 0 and t is

g(w | 0 ,T ) cc

T 1 /2

e " Ttw' f ( x * ’ 6)1 /2,

W

G

(3.75)

R

and upon forming the product of g (w | 0 ,x ) and £ (0 ,x | s), the joint density of w , 0, and t given s is

,

n

h ( w , 8, t |s ) « 5 (e )T * n+2a+1* /2" 1 exp - ^|26 + 5 3 tYj “ f ( xj> 6)12

+ [w -

f(x * ,e )]2 L

(3.76)

where w e R , 0 e R , and t > 0. Eliminating x from (3.76) gives h ( w , 0 |s) cc - ---------- --------------------) (n+2oc+l) /2

{23 +

£

-

« x . , e )]2 + [ w -

0

f ( x* , ) ] 2J

w e R

(3.77)

for the joint density of w and 0, given s. But from (3 .7 7 ), one can identify the conditional distribution of w given 0 and s as a t with (n + 2a) degrees of freedom, location E(w 10,s ) = f ( x * ,0) and precision

(3.78)

NONLINEAR REGRESSION ANALYSIS

P (w | e ,s ) = (n + 2a)

j

23 +

107

[y. ~ f ( x . , 0 ) ] 2

j

.

(3.79)

How does one forecast a future value of w? If one knew 0 , one would use f ( x * , 0 ) ; however, since 0 is unknown, one could use the predictive mean of w, namely E(w |s) =

E (w |0 ,s)

E

0 |s which is calculated by averaging the conditional mean with respect to 0 . By numerical univariate integrations, the Bayesian predictive density of w can be computed by the integral ( i f it exists)

h (w | s) =

I

0

£ ( 0 |s)h(w 10,s ) d0,

w e R

(3.80)

where 5 ( 0 |s) is the marginal posterior density of 0 and h (w | e ,s) is the conditional predictive density of w , given 0 , which is a t with n + 2a degrees of freedom, location (3 .7 8 ), and precision (3 .7 9 ). One may sum up the Bayesian analysis of nonlinear regression, when 0 is scalar, by saying a complete analysis is possible i f f is sufficiently well-behaved to insure the existence of certain integrals; however, if 0 is of dimension greater than or equal to two, a Bayesian analysis becomes more difficult. If 0 € R ^, the marginal density of 0 is (3.72) but revised to be «e|s)

cc

--------------------------------------

^

[ 2 0 + 2 ^ [f (x . , 0 ) - y .]

,e € r p

) ( n + 2 a) / 2

j

where £ ( 0 ) is the prior marginal density of 0 on R ^, f is chosen so that £ ( 0 |s) is integrable over R ^ and the marginal prior density of t is gamma with parameters a and 3 . If p = 2, on e, by numerical bivariate integration, could develop the normalizing constant, the contours of constant density, and some of the bivariate moments, and by a series of univariate numerical integrations, determine the marginal posterior densities of the compo­ nents of 0. O f course when p > 3, the numerical integration problems become more impractical.

(

3. 8I )

THE TR AD ITIO NAL LINEAR MODELS

108

When p > 2, there are some special cases, where an exact and complete Bayesian analysis is possible. For example, let p = 2 and consider f (x , 6 ) = e2f 1( x , e 1) , where f^: R * R

01 e R ,

R and

e2 € R

(3.82)

is such that £ (e | s ), (3 .8 1 ), is integrable

over R ^ , and i f the prior density of 0 is

50) =

6 6 R 2

where E9(6 „ ) is constant over R , then the posterior marginal density Of 6 is

5 ( e | s ) = -----------

,

e eR

I rtl (n+2a)/2 | 23 + 2 ^ [y i ~ 0 2f 1( x . , 0 1) ] j

(3.83)

hence the conditional posterior density of 0 given 0 is Z L

5 ( e j e 1(s) 2

V

{ [ 02 - A ’ 1( e 1) B ( e 1) ] 2A ( e 1) + c ( s ) -

B 2( e 1) A ’ 1( e 1) }(n+2“ ) / 2 ’

02 6 R where

A O j) =

f^ x -.e ^ ,

= ^ y i V i ’V and



(3.84)

NONLINEAR REGRESSION ANALYSIS

109

n C (s ) = 23 + ] C y.2.

1 We see 0^ |©i has a univariate t distribution with n + 2a degrees of freedom, location

E (e 2 |el f s ) = A ^ e ^ B O ^ , and precision

g

(n + 2 a )A ( 0^) —

s ) __

.

C (s ) - B ( 0^)A * ( 0 ^

Also, the marginal posterior density of 0^ is CfCQf)

S (6 llS ) ' A 1' 2( e 1) | , . . ] - ---------------- -----------------------------------------------n ' (n + 2 a )12 V* (l) 2V + 2 - [yj - f l x . V ^ , * ] ] J

26



,

NONLINEAR REGRESSION ANALYSIS

111

( 1)

0

€ R

P1

(3.88)

hence if p = 1, £ is a scalar function o f 0 ^ € R, and ^ versus (1 ) 0 may be plotted, and each component of 0 has an easily determined conditional posterior distribution. If is "close" to the "true" value of 0 ^ , then 5 will not provide inaccurate inferences for 0 ^ . The classical approach to nonlinear regression, see Draper and Smith (1966), expands f in a Taylor’s series about known successive values of 0, then by an iterative procedure, attempts to maximize the likelihood function with respect to 0 and t . A similar approach could be taken here, from a Bayesian viewpoint. One other way to analyze a nonlinear regression problem is to first find the maximurn-likelihood estimate of 0, say 0 = [ 0 ( 1 ) , 0 ^ ] , then substitute 0 ^ for into the conditional posterior density of 0 ^ given (2 ) 0 = (J). Then posterior inferences for each component of 0 are pos­ sible, and one may obtain additional information about these parameters beyond what is given by the usual analysis, which consists of the maximum-likelihood estimate 0 and its large sample dispersion matrix. Consider this simple nonlinear regression model

y. = e

- 0x. 1 + e.,

i = 1,2, . . . , n

(3.89)

where 9 G R , (x ^ ,y .) is the i-th observation, where y. is the i-th obser­ vation of the random variable y, x. the i-th observation of the regression 1 -l variable x and the e. are n .i.d . ( 0 , t ) , where 0 and t > 0 are unknown parameters. where

0 =

2

Suppose n = 30 (x . ,y . ) pairs are generated from (3 .8 9 ), and t =

1

and that the prior density for

1

? (8 , t ) = t

0

and

t

is

1 19 - C t / 2 ) ( 6 - u ) 2P

A

e TB t

e

,

e e R and t > 0, (3.90)

where a = 2, sity of 0 is

3

= 1,

( 5( e|s) -

y

= 3, and P

^

{26+2^

=

1, then the marginal posterior den­

-e x ( Yi "

e

-(n + 2 a + l)/ 2 )

+ (9 -

3)

2.0 1.0

2.5

BA AA

Nonlinear regression — Bayesian posterior density.

0 .5

ABA

4.0

5

AAA AAA BAB AABABAAB

7 .0

ABAABAABABAABA

line ar

-

BA

t r a d it io n a l

Figure 3.1.

-

BAABABAABAAB

A AA

AA

the

0 .0075

0.0125

0.0150

0.0175

0. 0 20 0

0.0225

POSTERIOR DENSITY

112 models

NONLINEAR REGRESSION ANALYSIS

113 e€ r

(3.91)

which is plotted in Figure 3.1. Table 3.3 gives the 30 observations, which were generated from the nonlinear regression model (3 .8 9 ), and the posterior moments of 6, namely the posterior mean E (e |s) = 3.08133, the second posterior moment,

E (e 2 |s) = 10.8659, and the posterior variance V a r( 6 |s) = 1.37126.

Table 3.3 Observations from Nonlinear Model N

Y

N

Y

0.432834

Row 16

0.19206

Row

1

Row

2

-1.33177

Row 17

0.636969

Row

3

-0.702886

Row 18

-0.409463

Row

4

-1.64393

Row 19

2.15701

Row

5

-0.828128

Row 20

0.120331

Row

6

-0.738765

Row 21

-0.305884

Row

7

1.07194

Row 22

-0.728323

Row

8

0.731636

Row 23

0.176013

Row

9

0.797699

Row 24

0.409809

Row 10

0.0071879

Row 25

-0.594562

Row 11

0.0268496

Row 26

1.77496

Row 12

1.20312

Row 27

1.06013

Row 13

1.31212

Row 28

0.482367

Row 14

-0.690334

Row 29

0.692809

Row 15

1.08125

Row 30

0.522263

THE TR AD ITIO NAL LINEAR MODELS

114

Table 3.4 Posterior Density of 8 O bs.

e

Posterior density

O bs.

6

Posterior density

1

-0.94

0.0000000

29

0.74

0.0027856

2

-0.88

0.0000000

30

0.80

0.0031682

3

-0.82

0.0000000

31

0.86

0.0035714

4

-0.76

0.0000000

32

0.92

0.0039949

5

-0.70

0.0000000

33

0.98

0.0044384

6

-0.64

0.0000000

34

1.04

0.0049016

7

-0.58

0.0000000

35

1.10

0.0053843

8

-0.52

0.0000000

36

1.16

0.0058860

9

-0.46

0.0000000

37

1.22

0.0064063

10

-0.40

0.0000000

38

1.28

0.0069445

11

-0.34

0.0000000

39

1.34

0.0074999

12

-0.28

0.0000000

40

1.40

0.0080713

13

-0.22

0.0000000

41

1.46

0.0086575

14

-0.16

0.0000000

42

1.52

0.0092570

15

-0.10

0.0000000

43

1.58

0.0098680

16

-0.04

0.0000000

44

1.64

0.0104886

17

0.02

0.0001189

45

1.70

0.0111164

18

0.08

0.0002839

46

1.76

0.0117491

19

0.14

0.0003522

47

1.82

0.0123838

20

0.20

0.0004532

48

1.88

0.0130176

21

0.26

0.0005919

49

1.94

0.0136474

22

0.32

0.0007659

50

2.00

0.0142698

23

0.38

0.0009728

51

2.06

0.0148813

24

0.44

0.0012104

52

2.12

0.0154785

25

0.50

0.0014762

53

2.18

0.0160575

26

0.56

0.0017683

54

2.24

0.0166148

27

0.62

0.0020847

55

2.30

0.0171465

28

0.68

0.0024242

56

2.36

0.0176490

NONLINEAR REGRESSION ANALYSIS

115

Table 3.4 (Continued) O bs.

e

Posterior density

O bs.

0

Posterior dens

57

2.42

0.0181187

85

4.10

0.0131022

58

2.48

0.0185522

86

4.16

0.0124903

59

2.54

0.0189463

87

4.22

0.0118800

60

2.60

0.0192979

88

4.28

0.0112744

61

2.66

0.0196044

89

4.34

0.0106761

62

2.72

0.0198632

90

4.40

0.0100878

63

2.78

0.0200724

91

4.46

0.0095117

64

2.84

0.0202303

92

4.52

0.0089498

65

2.90

0.0203355

93

4.58

0.0084039

66

2.96

0.0203874

94

4.64

0.0078755

67

3.02

0.0203854

95

4.70

0.0073658

68

3.08

0.0203297

96

4.76

0.0068759

69

3.14

0.0202209

97

4.82

0.0064066

70

3.20

0.0200598

98

4.88

0.0059583

71

3.26

0.0198479

99

4.94

0.0055314

72

3.32

0.0195871

100

5.00

0.0051262

73

3.38

0.0192795

101

5.06

0.0047425

74

3.44

0.0189277

102

5.12

0.0043803

75

3.50

0.0185347

103

5.18

0.0040392

76

3.56

0.0181034

104

5.24

0.0037187

77

3.62

0.0176374

105

5.30

0.0034185

78

3.68

0.0171402

106

5.36

0.0031378

79

3.74

0.0166154

107

5.42

0.0028760

80

3.80

0.0160670

108

5.48

0.0026324

81

3.86

0.0154987

109

5.54

0.0024061

82

3.92

0.0149143

110

5.60

0.0021964

83

3.98

0.0143177

111

5.66

0.0020025

84

4.04

0.0137124

112

5.72

0.0018234

THE TRADITIO NAL LINEAR MODELS

116

Table 3.4 (Continued) O bs.

0

Posterior density

O bs.

0

Posterior density

113

5.78

0.00165834

132

6.92

0.00022984

114

5.84

0.00150651

133

6.98

0.00020576

115

5.90

0.00136707

134

7.04

0.00018413

116

5.96

0.00123922

135

7.10

0.00016470

117

6.02

0.00112217

136

7.16

0.00014726

118

6.08

0.00101518

137

7.22

0.00013163

119

6.14

0.00091753

138

7.28

0.00011761

120

6.20

0.00082851

139

7.34

0.00010505

121

6.26

0.00074749

140

7.40

0.00009381

122

6.32

0.00067382

141

7.46

0.00008375

123

6.38

0.00060693

142

7.52

0.00007475

124

6.44

0.00054627

143

7.58

0.00006671

125

6.50

0.00049131

144

7.64

0.00005951

126

6.56

0.00044158

145

7.70

0.00005309

127

6.62

0.00039662

146

7.76

0.00004735

128

6.68

0.00035602

147

7.82

0.00004222

129

6.74

0.00031938

148

7.88

0.00003765

130

6.80

0.00028636

149

7.94

0.00003357

131

6.86

0.00025661

150

8.00

0.00002993

These last three moments were computed via numerical integration of the marginal posterior density (3 .9 1 ). Table 3.4 is a tabulation o f the posterior density of 9, where the domain of the function is from -0 .9 4 to 8.00 to form a partition of 150 9-values with an interval width of .06. At the end points, £(e|s) is zero to at least four decimal places and the mode is at m (e|s) = 3.08. Upon inspection of the posterior marginal density of 6, one sees the heavy influence o f the prior density of 6, which has a mean at 0 = 3 , while the true value of 0 is located at 0 = 2, thus the informa­ tion about 0 from the sample does not overcome the highly confident prior information that 0 is located at three.

SOME MODELS FOR DESIGNED EXPERIMENTS

117

SOME MODELS FOR DESIGNED EXPERIMENTS

The analysis of designed experiments usually consists, among other th in gs, of an analysis of variance, which is a way to determine the effect of various factors on a response variable y . For example, in a completely randomized design t treatments are assigned at random to n experimental units. Suppose that n. is the number of units receiving the i-th treatment.

If y.. denotes the response measured from the j-th

unit receiving treatment i , a one-way analysis of variance is performed in order to determine if each treatment produces the same average response. The analysis of variance partitions the total sum of squares into "between" treatment and "within" treatment components, and if the ratio of their corresponding mean squares is "la r g e ," significant d if­ ferences between the treatments are said to e x ist, that i s , the average responses produced by the treatments are not the same. Fisher ( 1 9 5 3 ) , Kempthorne ( 1 9 5 2 ) , Cochran and Cox ( 1 9 5 7 ) , and Cox ( 1 9 5 8 ) are standard references for the design and analysis of experiments and Graybill ( 1 9 6 1 ) and Searle ( 1 9 7 1 ) give the theory for those models which are appropriate for the analysis of designed experiments. This section will give the Bayesian analysis of designed exper­ iments and is the analog of the classical approach given by Graybill ( 1 9 6 1 ) and Searle ( 1 9 7 1 ) . The so-called one-way model y.. = 0. + e.., Jij i i]

i = 1,2, . . . , t ,

j = 1,2, . . . ,

n. i

(3.92)

models the completely randomized design, where y.. is the response measured from the j-th unit receiving treatment i, 6. is the mean of the i-th treatment, and the ey ’s are n .i.d . (0 , t

) , where t > 0, thus the

average response from treatment i is 0., which is considered a real unknown parameter. The analysis of variance is a way to test H^: 0^ = 0^ = . . . = 0^ versus the alternative that

is not true, and it is equivalent to the

likelihood-ratio test. This is shown by Graybill ( 1 9 6 1 ) and Searle (1 9 7 1 ). The A O V test is to reject at level a whenever F = BMS/WMS

>F /0 . . a/2 ,t- l,n -t ,

(3 .9 3 )

where t bms

= £ (y . i= l

- y

) 2/(t -

1)

118

THE TR AD ITIO NAL LINEAR MODELS

and

wms =

EE 1

(y ^ - y, ) 2/(n - t ).

1

J

The Bayesian approach to the analysis of the completely randomized design is based on an HPD region for the vector of treatment effects

e = ( ©j,e2 ,

et ) \

T h e Com pletely Randomized Design The one-way model (3.92) is a special case of the general linear model of Chapter 1, where y is n * 1 (n = E* n.) and

/ y (1>\

.( 2 )

v - >

where y ^

is n. * 1 with components y 1

11

,y .9,

sign matrix X is n x t, where

, y. , and the dem. i

/ * i\ X =

w where x. is a n. x t matrix with i-th column consisting of ones, and the remaining columns are zero. The error vector n .i.d . ( 0 , t ) random variables. The likelihood function for e and t is

e is

n x 1 and contains

SOME MODELS FOR DESIGNED EXPERIMENTS

L (6,-r|y,x) = T ° /2 exp -

t

ni

i = l

j= l

|

119

(y ~ -

2

6 ) 2,

1]

1

B 6 R 1,

t

>0

(3 .9 4 )

and suppose the prior information of the parameters is a normal-gamma distribution with parameters y e Bayes theorem

R?,

P, a > 0, and 3 > 0, then by

t ? ( 9 , T | y , x )

oc

T ( n + 2 o + l ) / 2

1 e x p

2

+ ( 9 — vi ) ' P ( 0 - y ) >

ni

+

_ I

(y.. i= i j = l

.

1]

e )2 1

( 3 . 9 5 )

where 9 e R^ and t > 0, is the joint posterior density of 0 and t , which may be expressed as

C (e ,x |y ,x ) « T 0, and a = ( z !z)

£(a|y) « [ ( y -

z a )T(y -

(3.131)

za)], -1 z'y.

The marginal posterior

za) + (a — a )!zfz (a — a )]

rbt/2

a e Rbt

(3.132)

hence a has a t distribution with rbt — bt degrees of freedom, location vector E (a | y ) = a,

(3.133)

and precision matrix (3.134) It is interesting to observe that if r = 1, there are zero degrees of freedom and the marginal posterior distribution is improper, which is also a problem when one analyzes the experiment the conventional way, because then the error sum of squares is zero and one is unable to estimate t *, the error variance. One way for the Bayesian to avoid this is to use a normal-gamma prior density for a and t , but it perhaps would be a difficult task to assign values to the hyperparam eters. Let u s, for the time being, assume r > 1 and avoid the difficulty. How does one analyze such a model? Remember, one is interested in the way the levels of the two factors effect the average response S.. = m + a. + b. + (a b ).., i] l ] i] for i = 1,2, . . . , a and j = 1,2, . . . , b , and the model is said to have no interaction if (S.. - S.?.) - (S ..f - S.,.,) = 0 for all i, i T, j and jf , which is equivalent to saying (a b )y — (ab).^. -

(a b )„ , + ( a b ). t^.T = 0

for all i, iT, j, and jT, otherwise the model is said to be nonadditive or is a model with interaction. In the analysis of such models, one first checks for the presence of interaction and if there is none, one examines the levels a ^ ,a 2> . . . , at of the first factor and then the b levels of the second factor.

THE ANALYSIS OF TWO-FACTOR EXPERIMENTS

133

Thus, the first task is to develop a test for no interaction, namely, test the hypothesis H^: (a b)? . = 0, where i = 1,2, • • • , a and j = 1,2, . . . , b , where

t

b

1

1

i]

One may show there is no interaction if and only if (a b )* . = 0 for all i and j. For example, if t = b = 2, implies Ta = 0 (1 x l ) , where T is the matrix T = (1 , —1, —l , l ) f , thus to test for no interaction, one must find an HPD region for Ta. In general, to test for no interaction, one may find a (t — 1 )(b — 1) x bt matrix T , such that no interaction H^ is equivalent to Hq:

y = T a = 0 [(t -

l)(b -

1) x 1].

(3.135)

Since a|y ^ t [ r b t - b t,E (a | y ) , P ( a | y ) ] , where the location and precision are given by (3.133) and (3.134), respectively, y = T a also has a t distribution with b t (r — 1) degrees of freedom, location vector (3.136)

E( y |y) = T E (a|y) and precision matrix

(3.137)

P ( y |y) = [T P ~ 1(a | y ) T '] ”1, and a 1 HPD

A (0 < A < 1) HPD region for

A

( Y) = (F (Y | y ) < F

y

is

A , ( t - l ) ( b - l ) ,(r -l)b t

(3.138)

}

where

F ( y |y) = [Y - E ( y | y )]'P (Y | y ) [ y - E ( Y | y )](t -

i ) -1 (b -

l ) " 1. (3.139)

THE TRADITIONAL LINEAR MODELS

134

F

If no interaction is indicated, that is, if F(0 |y) > , one may analyze the additive model A , ( t - l ) ( b - l ) ,b t (r -l) y..„ = m + a. + b. + e.., i]k 1 ] i]k

according to the methods of the previous section. When r = 1, there is no Bayesian way to test for interaction, b e ­ cause Jeffreys* density was used to express prior information; however, one may develop a complete Bayesian analysis if one may express prior information by a proper prior density, such as the normal-gamma density for a and t . The above model can be used to analyze several experimental layouts. For example, a two-factor factorial arrangement of treatments in a completely randomized design could be examined by the methods presented in this section.

TH E A N A LY S IS OF C O V A R IA N C E

Analysis of covariance models combine the regression and the models for designed experiments into one model. For example, consider a completely randomized design consisting of t treatments, where the i-th treatment is assigned to n. experimental units, and the response y „ and a regressor or covariable x „ are measured on the j-th unit receiving the i-th treatment.

A model for such a situation is given by

y.. = a. + x..y + e.., i]

i

i]

where a. GR, x.. e R , y g i

t

(3.140)

i]

i]

R, and the e.. are n .i.d . ( 0 , t * ) , where l]

> 0 and i = 1,2, . . . , tand j = 1,2, . . . , n .. The unknownparameters

are

the treatment effects a ^ a g , . . . ,

a^, y, and

t

, and one is

usually

interested in the effect of treatments on the average responses E (y ..) = a. + x ..y . i]

i

i]

The covariable or concomitant variable x is introduced to increase the precision estimating the treatment effects. The covariable should be highly related to the main response y and is not to be affected by the treatments. Our objective is to examine the effect of the treatments on the average response after taking into account or "adjusting” for the effect of the covariable. One is referred to Cox (1958) for a discussion

THE ANALYSIS OF COVARIANCE

135

of experimental design considerations when one uses concomitant ob­ servations and to Gray bill (1961) for the conventional analysis of variance of such designs. From a Bayesian viewpoint, in order to study the treatment effects, adjusted for the concomitant variable, one must isolate the marginal posterior distribution of 0^ where 0^ = (a ^ .a ^ , . . . , a^)f is the vector of treatment effects, and it is necessary to assign a prior distribution to the parameters 0 and t where 0 = ( 0* , 0 V and 0 = The Bayesian 1 Z id analysis will produce results similar to the analysis of variance if one uses

y.

£ ( 0 ,

t

)

«

1 /

t

,

t

>

0,

0

,t+l

€ R

(3.141)

for the prior density; but it is no trouble if one uses a normal-gamma prior density. The likelihood function for 0 and t is

L ( e ,x |y ) = t n/2 exp - | [ ( 0 - A 1B )'A (e - A_1B) + C - B ' A ^ B ] ,

T

where A is a (t + 1) x (t + 1) matrix

A =

B is the (t + 1) x l vector

>

0,

6

e

R

t+1

(3.142)

THE TR AD ITIO NAL LINEAR MODELS

136 and

The marginal posterior density o f 0 is a t with n — (t + 1) degrees o f freedom, location vector

E(0 |y) = A _1B and precision matrix

P (e | y ) =

(n — t — 1)A

Inferences for the treatment effects are to be based on the marginal posterior distribution of 0^, which is also a t with n — t — 1 degrees of freedom, location vector

where 4> is a t * 1 zero vector, and precision matrix P O j y ) = t (I t ,)P'1(e | y )( I t , l|0’f How does one test H0:

ai

a2 " ■

at

for equality of treatment effects?

Since H^ is equivalent

a^.+1j = 0, i = 1,2, . . . , t — 1, consider S = T(

V

where T is a (t — 1) x t vector

1-1 0 0 1-1

... 0 0 ... 0

0

0

T =

0

1 -1

to H^: a.

COMMENTS AND CONCLUSIONS

137

then S = 0 [(t — 1) x 1 ], if and only i f HQ is true and S has a t distribu­ tion with n — t precision matrix

1 degrees o f freedom, location vector TE(0^|y) and

P (S | y ) = [ T P ' ^ e j l y j T * ] ' 1. It is known that F (S | y ) = [S - E (S | y )]'P (S | y )[S - E (S | y )](t -

l ) '1

has an F distribution with t - 1 and n - t - 1 degrees of freedom, thus Hq is rejected at significance level A (0 < A < 1) i f F(0 |y) > F4 A and it can be shown that this test is equivalent to the A ;t -l,n -t -l M analysis of covariance procedure. O f course, one perhaps should first test to see i f the concomitant variable x is actually needed in the model, i . e . , is y = 0? This can be done in the usual way by finding the marginal posterior distribution of y from the marginal posterior density of 0 and constructing an HPD region for y The above analysis of the completely randomized design with one covariable is easily extended to other design s, for example to a ran­ domized block design with one or more concomitant variables.

COMMENTS AND CONCLUSIONS The Bayesian methodology for the standard statistical models of reg re s­ sion and for designed experiments is given in this chapter. The methodology is based on the theory which was developed in Chapter 1, and the theory gave the prior, posterior, and predictive analysis for the standard linear model. Before analyzing the regression models, Bayesian inferences for two normal populations were examined and it was found that when the populations have a common mean but distinct variances, the marginal posterior distribution of the common mean was a p o ly -t, a distribution which will be again encountered in the chapters on mixed models and linear dynamic systems. The section on regression analysis was partitioned into su b­ sections on simple linear, multiple linear, and nonlinear regression problems. The simple and multiple linear models were studied on the basis of prior, posterior, and predictive analyses. With regard to nonlinear regression, it was shown that it is indeed difficult to specify the marginal posterior distribution of the parameters when there are

138

THE TRADITIONAL LINEAR MODELS

several parameters, because the distribution is not well-known as it is when the regression relation is linear. This chapter gives only a b rief introduction to the Bayesian analysis of models used to analyze designed experiments. If the de­ sign is completely randomized the Bayesian analysis (with Jeffreys' prior density) is more or less equivalent to the analysis of variance procedure for testing the equality of treatment effects. On the other hand, when examining two-factor experiments, such as randomized block designs, one must reparametrize the less than full rank (design matrix) models to full rank equivalents in order to produce the standard F tests of the analysis of variance (again using Jeffreys' prior den­ sity ). It was necessary to reparametrize the model to full rank b e ­ cause if one uses Jeffreys' prior density in conjunction with the less than full rank model, the posterior distribution of the parameters is improper. The chapter is concluded with a Bayesian test of the equality of treatment effects, where the design is completely randomized and one concomitant variable is measured on the experimental units. One may conclude that it is possible to develop a complete Bayesian methodology for the analysis of designed experiments and that the analysis will be more informative than the traditional analysis of variance.

EXERCISES 1.

2.

3.

If X and 3 > 3 if

is a Bernoulli random variable with parameter 0, 0 < 0 < 1 0 has a Beta distribution with parameters a and 3 (a > 0, 0 ), how would one set values for the hyperparameters a and one had past data X * , X * . . . , X* ,X* = 0,1, i = 1,2, . . . , m ? 1 2 m i Use the prior predictive distribution of X. Suppose X is n ( 0 , x ) and that 0 and t have a prior normal-gamma distribution with hyperparameters y , P, a and 3 . If z ,z , . . . , X L z represent earlier values of X , use them to set values for the m hyperparam eters. Refer to equation (3.15) and construct: (a ) a HPD region for 91 ~~ 02* an(*

4. 5.

a

region for t , the common precision.

Derive the joint prior density of t and t , that is verify equation (3 .1 8 ). Refer to equation (3 .2 8 ), the joint posterior density of 0^ and ©2* Are 0^ and ©2 independent?

Let y = 0.^ -

and develop

a normal approximation to the marginal posterior density of y .

139

EXERCISES

6.

In the Behrens-Fisher problem, what is the marginal posterior density of Refer to equation (3 .2 1 ).

7.

In the example of simple linear regression in thischapter (see pp. 85-94) find the mean of the marginal posterior distribution of

8. 9.

t using the data-based prior. In the section on multiple linear regression, explain why equation (3.52) gives a HPD region for

Refer to equation (3.103) but use Jeffreysf improper prior density (3.105), then show the analysis of variance test of H^: e1 = e2 = ... =

10.

is equivalent to the Bayes procedure of using the HPD

region of (3.103). Consider the k linear models with common parameter 8 y. = X .0 + e., i = 1,2, . . . , k where the y. are n * l , X. is n * p, 0 is p * 1 and

’ ek

are n .i.d . ( 0 ,t. *1 ) , and e e R ? and the t . > 0 are unknown pal n l ^ rameters. (a ) What is the conjugate prior density for the parameters 0 , (b ) (c )

11.

W Tk ? What is the marginal posterior density of 0 ? Hint: See equation (3 .2 8 ). Derive the marginal posterior density o f p = ( x 1 ,T 0 .........

l i

V To answer (a ) and ( b ) use the conjugate prior density of the parameters. Consider a linear model Y.. = 0. + 3 X.. + e.., i] i i] i] where 0. e R , 3 e

R , the X .. are known values of an independent

variable X, Y.. is the j-th observation of a random variable for 13 -1 group i, and the e.. are n .i.d . ( 0 , t ) , where i = 1 , 2 , . . . , a, ] = 1,2, . . . , b, and t ( > 0) is an unknown precision. This model is that corresponding to a g ro u p s, on each of which b pairs of ob­ servations (X ..»Y .j) are made. The independent variable X rep ­ resents a covariable which is related to the dependent variable Y . Using the improper prior density

THE TR AD ITIO NAL LINEAR MODELS

£

( 0, T, 8)

cc

~

T

, 0 G Ra ,

t

>

0,

0 G R,

where 0 = (0 ,0 , (a ) (b )

0 ) ' , do the following: a Find the marginal posterior density of 0. Find an HPD region for 8 and test H^: 8 = 0.

(c )

Find an HPD region for 0 and test

\

l

= = •••

= ea *

REFERENCES B ox, G. E. P. and G. C. Tiao (1973). B a y e s ia n I n f ere n ce in S tatistical A n a l y s i s , Addison-W esley, Reading, Mass. Cochran, W. G. and G. M. Cox (1957). E xperim ental D e s i g n s , 2nd e d . , John Wiley and Sons, New York. Cox, D. R. (1958). Planning o f E x p e r i m e n t s , John Wiley and Sons, New York. de Finetti, Bruno (1972). P r o b a b i l i t y , I n d u c t i o n , and S t a t i s t i c s , the A r t o f G u e s s i n g , John Wiley and Sons, London. D raper, N . R. and H. Smith (1966). A p p l i e d R e g r e s s i o n A n a l y s i s , John Wiley and Sons, I n c ., New York. Fisher, R. A . (1951). D e sig n o f E x p e r im e n ts, 6th e d . , Oliver and Boyd, Edinburgh. G raybill, F. A . (1961). A n In tro d u ctio n to Linear St ati sti cal Models, McGraw Hill, New York. Kadane, Joseph B . , James M. Dickey, Robert L. Winkler, Wayne S. Smith and Stephen C. Peters (1980). "Interactive elicitation of opinion for a normal linear model," Journal o f American Sta­ tistical Association, Vol. 75, pp. 845-854. Kempthorne, O. (1952). D e sig n an d A n a l y s i s o f E x p e r im e n ts, John Wiley and Sons, In c ., New York. Lindley, D. V . (1965). I n tro d u ctio n to P ro bability an d S t a t i s t i c s from a B a y e s ia n V iew point, University P ress, Cambridge. Lindley, D. V . (1968). "The choice of variables in multiple regression, Journal of the Royal Statistical Society, Series B , Vol. 30, pp . 31-66. Lindley, D. V . and A . F. M. Smith (1972). "Bayes estimates for the linear model," Journal of the Royal Statistical Society, Series B , Vol. 34, pp. 1-42. Patil, V . H. (1964). "The Behrens-Fisher problem and its Bayesian solution," Journal of the Indian Statistical Association, Vol. 2, pp. 21-31. Scheffe, Henry (1943). "On solutions to the Behrens-Fisher problem based on the t-distribu tion ," Annals of Mathematical Statistics, Vol. 14, pp. 35-44.

REFERENCES

141

Searle, S. R. (1971). Linear Models, John Wiley and Sons, In c ., New York. Sedory, S. A . (1981). Pursuing the Poly-1. Unpublished Masters Report, Department of Statistics, Oklahoma State University. Winkler, Robert L. (1977). TtPrior distributions and model building in regression analysis,” in New D e v e lo p m e n ts in the A pplications o f B a yes ia n M e t h o d s , edited by Ahmet Aykac and Carlo Baumat, North-Holland, Amsterdam. Y u soff, Mat Abdullah (1982). B a y e s ia n I n f ere n ce w ith the P o l y - t D i s tr ib u tio n . P h .D . dissertation, Oklahoma State University. Zellner, Arnold (1971). A n I n tro du ctio n to B a yes ia n In fere n ce in E con o m etr ics , John Wiley and Sons, In c ., New York. Zellner, Arnold (1980). ”On Bayesian regression analysis with g prior distributions,” Technical reprint, H. G. B . Alexander Re­ search Foundation, Graduate School o f Business, University of Chicago.

4 THE MIXED MODEL

IN T R O D U C T IO N

The mixed model is more general than the fixed because in addition to containing fixed effects the model contains so called random effects. The random factors of the model are unobservable random variables and these have variances called variance components which are the primary parameters of interest. In the traditional analysis, the fixed effects are parameters but are not regarded as random variables. Fixed, mixed, and random are terms which came from the sampling theory framework and are not applicable to the Bayesian ap­ proach; nevertheless we will continue to use them, but within a Bayesian context. The fixed model is the general linear model Y = X6 + e of Chap­ ter 1, where oneTs main interest lies with the posterior distribution of 6 and not with the parameters per se of the posterior distribution of 9. With a mixed model c y = X0 +

u.b. + e,

1

(4 .1 )

11

the random factors b 1 ,b 9, . . . , b

are independent random vectors with c 2 zero means and dispersion matrices a. I , and one's main interest is i m. 2 1 with the variance components a ., i = 1,2, . . . , c , not the posterior

143

THE MIXED MODEL

144

distribution of the random factors. I f 0 is a vector of fixed effects, one is also interested in the posterior distribution of 0, not with the parameter of the posterior distribution, although these are indispensable for describing the posterior of 0. The usual assumption about the random factors is that they are independent and normally distributed, namely N [ 0,

-1 I

] , where

and the error precision N [0 ,

t

h n],

t

t.

=

-2 ,

which is equivalent to a prior asi sumption about the random factor parameters. Now, since the variance components are parameters of the random factors, one must introduce a prior distribution for their analysis. Note there are two levels o f parameters. The primary level con­ sists of 0, the vector of fixed effects, the random factors b „ , . . . , b , t.

-2

= a

1

, where the random error term e -

and is independent of the random factors.

C

The secondary

level consists of the vector of precision components, p = ( t ) f . Thus, the precision components (o r variance components) are c parameters of some of the primary level parameters of the model. For a complete description of the model we let X be a full rank n x p matrix, and let the u. be n x m. known design matrices. t

There are p + m + c + 1 parameters, with p + m + 1 primary level parameters, and c secondary level parameters. The random factors are usually regarded as nuisance parameters, thus there are p + c + 1 parameters of interest. Most of the work concerning random (p = 1) and mixed models has centered on estimating the variance components

t

.

* and until

1967 the methodology was to equate the analysis of variance sum of squares to their expectations and solve the resulting system of linear equations for estimates of the variance components. For example, in a one-way random model, the between and within mean squares are equated to their expectations, giving analysis of variance estimates of the between and within components. See Searle (1971), pp. 385-389 for an example. The original methodology was introduced by Daniels (1939) and Winsor and Clarke (1940), and the sampling properties of the estimators studied by Graybill (1954), Graybill and Wortham (1956), and Graybill and Hultquist (1961). For unbalanced data, Henderson (1953) introduced three analysis o f variance methods of estimating variance components, and the properties of these estimators were examined by Searle (1971). The next new development was maximum likelihood estimation given by Hartley and Rao in 1967 and since then there have appeared many new methodologies including restricted maximum likelihood, minimum norm quadratic unbiased estimation or MINQUE, iterative MINIQUE, MIVQUE,

THE PRIOR ANALYSIS

145

or minimum variance quadratic unbiased estimation. These and other methods of estimating variance components are described by Searle (1978). Box and Tiao (1973) is the only Bayesian book dealing with the variance components of random and mixed models. They give a very thorough treatment of the subject and the methodology is based on a numerical determination of the one- and two-dimensional marginal pos­ terior distribution of the variance components. Thus with the one­ way random model, which has two variance components, the one twodimensional and two one-dimensional marginal posterior distributions are given for two data sets of Chapter 4. One interesting feature of their analysis is the choice of prior distribution. They show a Jeffreys1 type vague non-informative prior on the variance components is equivalent to a vague prior on the expected mean squares of the analysis of variance. They continue the study o f variance components by examining the two-fold nested random model, the two-way random, and the two-way mixed models. Other studies from a Bayesian approach have been taken by Hill (1965, 1967), who studied the one-way model, and Stone and Springer (1965) who criticize the Box and Tiao choice of prior distribution. Regardless of oneTs approach, it is safe to say that to estimate variance components is indeed a difficult task. There are so many sampling methods, it is difficult to declare any one as superior to the others, while with the existing Bayesian techniques one must rely on multi-dimensional numerical integrations. I suspect, variance compo­ nents, except for the error variance, are difficult to estimate because they are secondary level parameters (at least more difficult to estimate than primary level param eters). This suspicion is motivated by Goel and DeGroot (1980), who assess the amount of information of hyper­ parameters in hierarchical models. Inference about parameters of a mixed model will be accomplished by first determining the one-dimensional marginal posterior distribution of the variance components and the marginal posterior distribution of the fixed effects 6, then finding the joint posterior mode of all the parameters in the model.

TH E PRIOR A N A LY S IS To recapitulate, consider the model ( 4. 1) , where y is a n x 1 vector of observations, X an n x p known full-rank design matrix, u. a n x m. known design matrix, 0 a p x 1 real unknown parameter vector, b. a m. x l random real parameter vector, and e a n x 1 normal vector

THE MIXED MODEL

146

with mean 0 and precision Tln . Furthermore assume ••• > b » and e are independent. Thus given 0 , the b ., and t ( > 0 ), y is normal with mean X 0 + ub and precision Tln . The prior information is introduced in two stages: conditional prior distribution of b. given i l

First, the

is b. ~ N( 0, x. I ) , i = l l m. l 1,2, . . . , c , which is the usual assumption about the random factors, x.

6 has a marginal prior density which is constant over , and t is gamma with positive parameters a and 3. Also assume 0 , b , and t are independent, a priori. Secondly, assume the secondary param­ eters t , t 0 , . . . , t are independent and that t . is gamma with positive 1 2 c 1 parameters a and 3..

Why this choice of prior distribution? of the prior density, namely,

,_. j .

p(0,b,T |p)

TT 11

J L a

l

m./2 t.

ill

- x.b!b./2

e

a-1e -iB ,

t

As will be seen, this form



,

i= l (4 .2 ) _

a.-l

p ( p ) * 1 1 t. 1 i =l

-

e

t

. 3.

11

m. for 0 G R P , x > 0 , b. g R 1, and x. > 0, produces an analytically tractable posterior distribution for the parameters. If one believes this prior is not flexible enough to express one's prior opinion about 0, p , and t , one may use mixtures of these distributions which will allow more flexibility. A normal prior distribution could have been used with 0, but this yields messy posterior distributions. O r we could give 0 and t a normal-gamma distribution, b and p a normalgamma, and let 0 and b be independent. This would give messy but tractable posterior distributions. Note that (4 .2 ) is an improper prior density. The likelihood function or probability density of y given 0, b , and t is

L( 0, b, T) « for

0

€ R*\ b

G

Tn / 2

exp -

Rm, and

x

| (y -

x e - u b ) '( y - Xe - u b )

> 0, where m =

m..

(4 .3 )

THE POSTERIOR ANALYSIS

147

This will be combined with ( 4. 2) , by Bayes 1 theorem, to give the posterior distribution of all the parameters.

TH E POSTERIO R A N A LY S IS

In this section, joint and marginal posterior distributions for the param­ eters will be d eriv ed , thereby giving us the foundation for making in­ ferences about the parameters of the model. F irst, the joint distribution o f all the parameters is found by combining the likelihood function with the two-stage prior density, then the conditional posterior distributions of , b , x , and p , given the other parameters is found. Second, the conditional distributions of given b , x given b , and p given b are derived and lastly, the marginal posterior density of p and x is found. Combining the likelihood function (4 .3 ) with the prior density ( 4 . 2 ) , we have

e

e

p ( 0 , b , x , p |y) « T( n+2ot) / 2 1 exp _ I {23 + y»Ry - b'u 'R u b

+ (b -

b ) Tu fR u (b -

b)

+(e -eyx'xo -e)} 11 t. JL i

((m .+ 2 o .)/ 2 ) - l

1

X.

X exp where 0 € R ^,

y - (26. + b! b .) ,

b e Rm, x > 0, and x. > 0 , as thejoint posterior den­

sity of all the parameters.

The various quantities in this expression

are b = (u ’R u )'u 'R y , R = I -

XCX’X ) ' ^ ' , and 9 = (X 'X )’ 1X '(y -

where A is the unique Moore-Penrose generalized inverse of the matrix A. An equivalent representation of this density is

p(e,b,T,p

(4 .4 )

|y) = T( n + 2 a ) / 2 - 1 exp -

\

[26 + (y - X9 - u b )'

u b ),

THE MIXED MODEL

148 T.

x exp -

where e e R ^ , T > 0 ,

t.

y

> 0, and

(b!b. + 2 3 .)

b

e Rm.

(4 .5 )

This form of the density

allows one to readily deduce the many conditional posterior distributions of the parameters. Thus we have

Theor em 4 .1.

The conditional posterior distribution of 0 given b , t , and p is normal with mean 0 and precision matrix tX 'X . The con­ ditional posterior distribution of b given 0, t , and p is normal with mean [ x u fu + A ( p ) ] ^ u ^ y — X0) and precision matrix t u ' u + A ( p ) , where A ( p ) is the m x m block diagonal matrix with i-th diagonal matrix t . I , i = l , 2, . . . , c . The conditional posterior distribution of t 1 m.

1

given 0 , b , and p is gamma with parameters (n + 2a)/2 and [23 + (y — X0 — u b )’(y — X0 — u b )]/ 2 . Finally, the conditional distribution of p given 0 , b , and t is such that

* * * * Tc are *n(iePen(ient anci

rameters (m. + 2a.)

12

*s gamma with pa­

and ( 2 3 . + b !b .)/ 2 , i = 1,2, . . . , c.

Thus the conditional distribution of each parameter given the other is a well-known density and, as will be shown, provides a con­ venient way to estimate the mode of the joint posterior density. O f course, our goal is to determine the marginal posterior dis­ tribution of 0, t , and p, regarding b as a nuisance parameter, but first consider the conditional distributions of 0 given b and ( t , p) given b .

T heo rem 4.2. The conditional posterior distribution of 0 conditional on b is a multivariate t with mean vector 0 and precision matrix T = ________________________________________________ 23 + y fRy - b ’u'Rub + (b - b ) ’u fR u (b - b ) ’

(46)

K

The conditional distribution of t conditional on b is gamma with parameters (n - p + 2a)/2 and [ ( 2 3 + y ?Ry - 6 fufRu6 + (b — b ) tutR u (b - b )]/ 2 . Also, conditional on b ; t , 1 z ..., tc are independent and t . is gamma with parameters (m. + 2 a . ) 12 and ( 2 3 . + b! b.)/2 , i = 1,2, . . . , c. We will use the conditional distributions o f the previous theorem in order to find the marginal posterior distributions o f 0, t , and p , and to do th is, we n eed, also, the marginal posterior distribution of b , which is given by

J

THE POSTERIOR ANALYSIS

149

Theor em 4, 3, The marginal posterior density of b , the vector of random effects, is

p (b | y ) - [1 + (b - b ) 'A ( b - b ) ] ' (n ' p+2“)/2

JL - ( m+ 2a ) /2 x I I (1 + b!A .b.) , 1

(4 .7 )

where b e Rm, u'Ru

/V

>

23 + y'R y - b'u 'R u b and A. = ( 2 6 . ) '1I , i i m.

i = 1,2

c.

i

This density is called a multiple t or poly-t density and is dis­ cussed at length by Dreze (1977). It arises in many cases of reg re s­ sion analysis and Dreze discusses the various situations it will occur in in econometrics. Except in the trivial case, when c = 1, the usual t density, the poly-t distribution is difficult to work with because the normalizing constant is unknown. For an exact Bayesian analysis of the variance components, one needs their marginal distributions. Consider the following.

T heorem 4, 4,

The marginal posterior density of

p (t ,p | y ) * A (

t ,

t

> 0 and

t.

and

p

is

-1/2 i ~ p) | fi( t , p) | exp - - b ' g j C p )

x [(f^t.p) where

t

6 ' 1( p ) ] 6 1( p ) b .

> 0, i = 1,2, . . . , c.

A ( t , p ) = T^n P+2ot^ 2 1 exp _ | (2 B + y'R y - b'u'R ufi)

S. x

1 1

t.

(m.+2a.) /2-1 1 1 exp -

t.B.

(4 .8 )

THE MIXED MODEL

150 3 ( t ,p ) = 6 1( p ) + A ( p ) , where 31( p ) =

t u 'Ru

.

The above joint posterior density of t and p is a very complicated expression and is analytically intractable. For instance it seems im­ possible to integrate it with respect to the precision components. Another way of finding the marginal densities of the variance components is to combine the marginal density of b with the conditional densities of the variance components given b , then integrate with respect to b , but this will require an approximation to the density of b.

A P P R O X IM A TIO N S

The posterior distribution of b is a key factor in obtaining the con­ ditional distribution of the variance components as well as determining the posterior marginal distribution of these parameters. The distribu­ tion of b is very difficult to handle, thus one way of solving the p rob ­ lem is to find an approximation in terms of simple expressions. Since a multivariate t distribution may be approximated by a multivariate normal with the same first two moments as the multivariate t distribution, [1 + (l / k )( X -

y)'A(X -

y)]

^n+k^ 2 may

ap­

proximated by exp - l/ 2 [(k - 2 )/ k ](X - y ) ’A( X - y ) for any X € Rn and any non-negative definite matrix A . Using this approximation with each of the c factors of ( 4. 7) , one may show the density of b may be approximated by 1 p (b |y) « exp - - (b -

b * ) 'A * ( b - b * ) ,

b e R

m

,

where A * = A x + A 2,

b * = ( A * ) _1A 6 , A^ = (n -

p + 2a -

m-

2)u'Ru[2$ + y ’Ry -

b 'u 'R u b ],

( 4. 9)

APPROXIMATIONS

151

and A , = {D IA G [(o . l i

1) /8,11 ) i m. l

The matrix A^ is m x m block diagonal with i-th diagonal matrix [ ( a. l

1) /B.] I , where i = 1,2, l m. l Thus the posterior distribution of b is approximately normal with mean vector b * and precision matrix A * , however we would expect the approximation not to be good unless 6 is close to the zero vector. See Box and Tiao (1973), Chapter 9, where they discuss the poly-t distribu­ tion. Later, we will discuss how good the approximation is, using a one-way random model. The mean of b , namely b * , is the approximate mean of the distribu­ tion of b and can be used as the conditioning value of b. For example, if conditional point estimates for the parameters are needed, the mean of these conditional posterior distributions of Theorem 4.2 can be taken as estimates. They are

6 = ( X ' X ) ' 1X '(y -

u b *)

for the fixed effects,

o*

= (n + 2a — p — 2) *(2B + y ' R y — b ’u 'R ub)

for the error variance, and a = (2g. + b.*'b.*)/(m. + 2a. l i l l l l for the variance components.

2 ),

Note, b * = ( b * T, b* ’ , . . . , b * ’) 1 is the

partitioned form of the approximate mean of the posterior distribution of b. Our objective here is to determine a convenient approximation to the marginal posterior distribution of the variance components. To do this, we need the distribution of quadratic forms in normal variables. Let Z be a random n-vector having a normal distribution with mean y and positive definite dispersion matrix D , or symbolically let Z - N (y , D) . Let Q ( Z ) = (Z - Z q) ’M(Z - Z^) , where Z^ is a known n vector and M a known non-negative definite matrix, then it is clear the distribution of Q ( Z ) is the same as that of (Z * — Z^) ’M( Z* — Z^) where Z^ = Z^ — y and Z* - N ( 0 , D) .

THE MIXED MODEL

152

Since D is positive definite, there exists a nonsingular lower triangular matrix L such that D = LL \ Also, since L TML is symmetric there exists an orthogonal matrix P such that P 'L ’MLP = A , the diag­ onal matrix of eigenvalues of L ’ML. Using the transformation Z* = LPW, one may show the distribution of Q ( Z ) is that of n 2 Z X.(W. -

W .)2,

where A = di ag( X1>X2

Xn ) ,

w' = (w 1,w 2

wfi) ,

w = ( w r w2,

wn ) ,

w = P'L

and

n = number of rows of M. If n 1 = rank of m < n, then assuming the last n — nT of the X’s to be zero, one may show the distribution of Q( Z ) is the same as that of n Q( W) = £

1

X.(W. - w .)2. 1 l l

Ruben (1960, 1962) has shown that for given A , n, and w, the distribution of Q( W) may be expressed as Fn ( q / A, w) = P r [ Q ( z ) < q]

n = Pr

S A.(W - w . ) 2 < q Lj=1 1 1 1 J

_ +OT E -e .P- r (x 2 j=0

]

< q/c) ,

(4.1 0)

POSTERIOR DISTRIBUTION OF o2

153

where n

n

V j ej

(r »l),

e Q = exp

n-1 e

r

= (l/ 2 r ) £ j=0 n

G r

= E (1 j= l

n c/X.)r + rc 3

L ( w 2/X. ) ( l j= l 3 3

c/UM 3

( r > 1),

and c is an arbitrary positive constant. Ruben has also shown that the series on the right of (4.10) is uniformly convergent over any finite interval of q and the bound for error in the above series is the n-th term. Furthermore it is obvious that the distribution in (4.10) can be an infinite mixture of chi-square densities provided c is chosen to be less than the minimum of ( ^ » * 2 * . . . , X ) , to insure each e. > 0. n l

APPROXIMATION TO THE POSTERIOR DISTRIB UTIO N OF a2 From Theorem 4.2, the conditional distribution of t given b has a gamma distribution with parameters a* = (n — p + 2a) 12 and B* = (1/2)(2B + y fRy - b'u 'R u b + Q ), where Q = (b - b ) TufR u (b - b ). From Rubenfs representation, let b * , b , A * * and uTRu, replace y , Z q , D , and M respectively, and let s = n = rank of u?Ru, then the marginal density of

t

is

THE MIXED MODEL

154 x (q / c )S/2+j“ 1(l/ c )d q ,

t

> 0

(4.11)

where 2k = n — p + 2a, and d = 28 + y'R y — b ’u’R ub, and c > 0. Now let k be an integer. To insure this, one needs to make a slight adjustment in the value of a, but the change in a need not be larger than 0.5.

2

Using this assumption, the density of a is given by

, 2, -d/2a2 . . ...k 2 (k+1) p (a |y) = ke (d/2) (1/a ) e. J j=0 r ( s/2 + j ) ( l + c/a2) S/2+j

k T(s/2 + r + j)

E

r=0 r ( r + l ) r ( k -

r + l ) [( d / 2 ) ( l / c + l/a2) ] r (4.12)

APPROXIMATION TO THE POSTERIOR DISTRIBUTION OF THE VARIANCE COMPONENTS Consider the i-th precision component x., then using RubenTs result, -l1 replace y , z^, D , and M by b * , 0, A.* , and I respectively and i let s. = m. = rank of I , where A * is the precision matrix of b. in the 1 1 m. i ^ l l normal approximation to b , given by ( 4. 9) . The posterior marginal density of x. is

p(Ti,y) = F u p JQ -oo

a* a * - l

exP - ^

00

>S e« [ ^ Pr 0 , a.* = (m. + 2a.)/2, 8* = (28. + b!b.)/2, and the e.^ are given by ( 4. 10) , with e.^ = e^., and c. is an arbitrary positive constant.

POSTERIOR INFERENCES

155

One may let k. = (m. + 2a.) 12 be an integer with a slight adjustment in 1 2 1 -1 1 a . , then letting a. = t . , the above formula reduces to l & l l 2 .................................2 (k i+1> p(o. |y) = k.(d./2) *(1 / a .) 1 exp -

. 2 d./2a.

e.. i] 2, (8,/2+J) j=0 r( s. /2 + j ) ( l + c./Op

k. * r=0 r ( r + l ) r ( k . -

r (s./2 + r + j) r + l)[(d ./ 2 )(l / c . + 1lo*) ]

(4.13)

2

where a. > 0 .

1

Using the normal approximation to b and Ruben’s representation of a quadratic form in normal variables, formulas (4.12) and (4.13) are approximate densities for the marginal posterior distribution of the error variance and i-th variance component respectively. Ruben’s re ­ sult is exact, however, the normal distribution of b is an approximation to the true distribution of b , given by ( 4 . 9 ) , therefore the adequacy of the approximation to the posterior distribution of the variance com­ ponents and error variance depends on the adequacy of the normal approximation to b . Thus, up to now, the exact marginal and conditional posterior distributions of the parameters of the mixed model have been derived. Since the exact marginal density ( 4. 8) of the error variance and v a ri­ ance components is analytically intractable, it is necessary to rep re­ sent their marginal density in a more convenient form, thus the normal approximation to b and Ruben’s representation was employed. These derivations are found in Rajagopalan (1980) and Rajagopalan and Broemeling (1983).

POSTERIOR INFERENCES Inferences for the parameters of the mixed model are to be based on the appropriate posterior distribution and two approaches will be taken here. First, in regard to estimating the error variance and variance components, a plot of the relevant marginal posterior distribution will be employed, along with the posterior mean and variance of that pa­

THE MIXED MODEL

156

rameter. Secondly, joint modal estimates of all the parameters will be calculated from an iterative technique.

M arginal Posterior Inferences

2

Consider making inferences for the error variance a of a mixed model,

2

then one must plot the posterior marginal distribution of a , either by numerical integration, via the joint posterior density ( 5. 8) of t and p , or via the approximate density (5 .1 2 ). O f course, in a similar way, inferences for the variance components are made by plotting the marginal posterior density of that parameter, via numerical integration from ( 4. 8) , or by plotting the approximate density (4 .1 3 ). Accompanying the plot of each parameter one may compute the posterior mean and variance. Again here one has a choice. Either compute these moments from the exact joint marginal posterior density ( 4. 8) , or from the approximate densities, (4.12) and (4 .1 3 ). In the latter case, one may derive formulas for the mean and variance.

Posterior Means and V ariances fo r V arian ce Components The moments of a

2

2

and a. are difficult to obtain directly from the

approximate densities (4.12) and ( 4. 13) , but may be obtained relatively easily by first computing the conditional moments of these parameters given b , which are exact, then averaging conditional moments over the distribution of b , using the normal approximation to b , ( 4. 9) .

2

For instance, the moments of a (i) (ii)

can be computed by noting that

t given b has a gamma distribution, from Theorem 4.2, and b has an approximate normal distribution. Using the formulas

(i)

E (o 2) = E [ E ( a 2 | b ) ] , and

(ii)

2 ^ 2 V a r(o ) = V a r [ E ( a b

|b)] + E[ V a r ( a b

2

|b),

where E denotes expectation with respect to b and Var means variance b b with respect to the distribution of b , one may show the marginal posterior mean of

is

POSTERIOR INFERENCES

E (a 2 |y) = (n -

157

p + 2a — 2) *[2$ + y !Ry - b'u'R ub

+ TrCu’R u A * "1) + (b * - b )'u TR u (b * - b ) ] ,

(4.14)

and the marginal posterior variance is

Var( 4 — 2a. 2 In a similar way, the first two posterior moments o f a. are

E(o.2 |y) = (m. + 2cl -

2 ) ' 1[26j + T r C A * '1) + b * 'b * ]

(4.16)

and

Var(a.2 |y) = 2(m. + 2a. -

2 )‘ 2(m. + 2a. -

+ b.*'b.*]2 + (m. + 2a -

1 1

x if m. > 4 -

l

l

i

2 ) ' 1(m. + 2a. -

l

[2 T r(A .*"1) 2 + 4b.*'A.*~ H>.*]

l

i

l

4 )" 1[2B. + T rC A *"1)

l

i

4 ) '1

(4.17)

2 a , i = 1,2, . . . , c.

l

All the higher moments, provided they exist, of the variance com­ ponents and error variance may be obtained in a similar fashion.

An Example o f M arginal Posterior A nalysis A balanced one-way random model is considered to illustrate the p re ­ ceding results. The linear model describing such a design is

THE MIXED MODEL

158 y.. = u + b. + e..;

i = 1,2

m;

j = 1,2

t,

2 where b. is normal with mean 0 and variance a - , and the e.. follow a

1

1

2

1]

normal distribution with zero mean and error variance a for all i and j. This is the mixed model (4 .1 ) with c = l , m = m , p = l , and n = mt. Also

y = ( y n , • • • - y l t ; y 21. • • • > y 2 t ; - - - y m l * x = j , a n x l vector of on es, n 0 = y , a real scalar, U = Diagonal

b

=

•••> it ) »

( b l ’b 2

V

of

order n x m, and

-

For this particular case, it can be shown that R = I

n

— (l/ n )J

ufRu = t [I

m

-

n

, J a n x n matrix of ones, n

(l/m )J

(u ’R u )" = d / t ) [ I

m

-

], m ( l/m)J ] , m

b = ( y x - y, y 2 - y, . . . , y m - y ) ', b !uTRub =s^, the between group sum of squares,

A , = (n - m + 2a - 3) ( 2g + s j ' ^ t l (t/m)J 1 2 m m where s 2 is the within group sum of squares, A 0 = [(cv

i

l

-

D/3JI , 1 m

A* = A

l +i

A = ( a t + g )I m

a = (n

— m +

2a —

3 )( 2 3

+

(at/m)J

s 2> *,

, m

and

where

],

POSTERIOR INFERENCES g = (cij -

159

1 ) /3j.

Using the above results, one may show that

( A * ) 1 = (a t + g ) 1Im + at(gm ) 1(at + g ) 1Jm>

b * = at(at + g )

u 'R u C A * )'1 = t ( a t + g ) " 1 [I

m

-

(l/m )J

], m

b*»b* = a2t (a t + g )

(b * - b )'u 'R u (b * -

b ) = g 2(a t + g ) ~ 2s .

The conditional Bayes estimators conditional on b * are

a2 = (n +

2a -

3)

1 [ 23+

s 2 + g 2S j(a t

+ g ) 2] , (4.18)

a2 = (m + 2a1 -

2) 1 [2B1 + a 2ts (at + g ) 2] .

Approximate Bayes estimators, based on the approximations to the marginal posterior distribution of the parameters are given by (4.14) and (4.16) and reduce to the estimators a2 = (n + 2a -

3) 1 [ 23+ s 2 + t(m -

l ) ( a t + g ) 1 + g 2s 1(a t + g )

2] ,

(4.19) a2 = (m + 2a1 -

2) 1[261 + ( a t + mg ) g

1( a t +

g ) 1 + a2t s 1(a t + g ) 2] .

The approximate Bayes estimators (means of posterior densities) and the conditional Bayes estimators (means of the conditional pos­ terior distribution) are very similar. For example, we see they each have two terms in common, only differing in the third term t(m - 1)* (at + g ) * for the error variance and (at + mg) g \ a t + g ) * for the between variance component.

THE MIXED MODEL

160

-2 The usual AOV estimators, see Box and Tiao (1973), are a = s 2t 1(m -

1) 1 and

= [s 2t 1(m -

1) 1 -

s^ m -

1) 1] ( t -

1) and

are quite different than the conditional Bayes and approximate Bayes estimators, which are not linear functions of the AOV sum of squares (as are the AOV estimators), but instead are ratios of integer powers of the AOV sum of squ a re s. For this model, the AOV estimators are minimum variance quadratic unbiased estimators (bu t are negative with positive probability), see Graybill and Wortham (1956), thus it would be interesting to compare the sampling mean square errors of Bayes and AOV estimators. We now continue with our example of the one-way random model by plotting the posterior density of the variance components and cal­ culating the posterior mean and variance of these parameters for a set of data employed b y Box and Tiao (1973). For the generated data the

2

2

true value of a is 16 and is 4 for the between variance component a^. Many different sets of values o f the hyperparameters ( a , 3 , a ^ , 3 ^ ) were taken so as to examine the effects of the prior parameter values on the two marginal posterior distributions and to see the effect on the closeness of the approximation. For each set of hyperparam eters, the posterior means and v a ri­ ances of the variance components were calculated in two ways. First from (4.14) —( 4.17) , the approximate moments were computed, then the true moments, mean and variance, were calculated from the exact posterior densities via numerical integration. The eigenvalues and other constants needed to evaluate the approximate densities (4.12) and (4.13) and their moments, (4 .1 4 )-(4 .1 7 ) , were computed with the MATRIX procedure in SAS (Statistical Analysis System). The true posterior distributions are given by ( 4. 8) , and the relevant marginal densities and their moments were evaluated using numerical integra­ tions. This was done with a FORTRAN program (W A T F IV ). To eval­ uate the approximate densities (4.12) and (4.13) only 20 terms in the mixture were taken, because the contribution from the remaining terms was negligible. The constant c was chosen to be the smallest eigen­ value in order to make (4.12) and (4.13) mixtures. The calculations were done on an IBM 370/168 at Oklahoma State University by M. Rajagopalan (1980). Table 4.1 provides one with the approximate and true values of the posterior means and variances of the variance components for various values of the hyperparameters. The exact and approximate marginal posterior densities of the two parameters are plotted in Figures 4.1-4.6 for six sets of hyperparam­ eters. Results of the numerical study indicate the approximations are close, generally, to the true means and variance.

POSTERIOR INFERENCES

161

The closeness of the approximations increases with a , even for values as low as 8 , and is independent of the 3 value. That this must be so follows from the fact that as the a parameter increases, the de­ grees of freedom of each factor of the poly-t density increase, thus each factor is well approximated by the normal density, and the pos­ terior density of b is well approximated by a normal. This insures the accuracy of the approximation, however a large a parameter also insures a small prior standard deviation for the parameter. -

2

For example, the prior standard deviation of a is 8 (a -

1/2

2

1)

-1 •

(a — 2 ) , thus the approximate posterior distribution of a is ac­ curate if one has a strong informative prior for the variance component. When precise prior information is available the approximations are e f­ ficient. The accuracy of the approximations is confounded with one's prior beliefs. Also, when the a-parameter is large and the prior means of the components are close to their true values, the posterior means are close to the true means. Thus in such cases, the two posterior dis­ tributions are centered very close to the true value of the variance components and the posterior variances are small. Furthermore, when a is large, changes in the a parameter do not significantly affect the posterior distribution of the within component as compared to the between component. Informative priors influence the between component relatively more than the within. Hence, unless one is quite sure of what one is doing, then the value of the a param­ eter for the between variance component should not be selected as large even though it does not matter much for the within variance component. The reasons are explained by referrin g to the posterior density of b , (4 .7 ), where in the first factor the degrees of freedom,

2

which introduces prior information about a (o r t ) , is (n — p + 2 a + m) which is likely to be large for most data, regardless of the value of a . Variations in the a parameter do not affect this factor, but on the other hand, the second part, consisting of c factors, of the density of b , and the degrees of freedom of the i-th factor, is only m., and varia­ tions in the value of

significantly affect the posterior distribution

of the between component. Lastly, when there is little prior information, it seems reasonable to let the value of the a parameter be close to 2 so the prior variance is large. Even when a is close to 2, the approximations are close, but whether the posterior distributions of the variance components are close to their true values depends on the 6 parameter. From Table 4.1, one can see that when a = 2 and 3 varies from 2 to 2 0 , the posterior

2

mean of a

value of a

2

goes from 13.06 to 14.82, a small change. = 16.

Note the true

We conclude, therefore, that the posterior distn bu-

3

3

3

3

8

20

8

3

5

10

20

50

2

2

2

3

3

3

3

3

3

5

2

2

2

5

2

2

2

2

50

20

10

5

3

5

20

8

5

2

15.80

13.56

12.75

16.18

13.86

12.97

12.50

12.28

12.21 12.32

13.39

14.82

13.81

13.56

13.06

13.35

14.59

13.49

13.14

12.96

app

13.58 19.08

17.86

11.67

10.73

10.13

12.68

17.02

14.22

13.67

12.24

app

12.92

10.96

9.63

9.92

12.30

16.16

12.84

11.15

11.72

true

true

3

3^

Prior parameters

ot^

Within variance component -----------------------------------------------------mean variance

9.37 24.22

6.50 12.23 12.07

26.73

11.90

2.61

2.12 3.36

1.23

1.07

2.05

0.60

0.30

24.40

4.02

2.66

1.07

app

0.56

0.30

16.87

2.63

1.97

0.84

true

1.33

1.17

8.97

3.69

3.01

1.62

app

5.77

3.07

1.79

1.17

1.11

7.44

3.29

2.40

1.22

true

Between variance component ---------------------------------------------mean variance

Table 4.1 Mean and Variance o f the True and the Approximate Marginal Posterior Distributions o f the Within and Between Variance Components for B ox's Data

o> to

THE MIXED MODEL

10.95

5

5

5

5

5

5

10

20

50

8

10

40

5

5

5

5

8

10

10

10

10.14

8.79

10.10 12.86 17.56 8.82 18.14

10

50

100

200

100

200

80

20

20

20

20

50

50

20

100

200

100

400

500

10

20

20

32

15.49

8.81

9.48

8

8

15.51

18.02

8.84

17.31

12.94

9.53

18.29 0.30

10.70 1.17

10.25

0.01 0.32 1.25

0.01 0.32

1.10

0.52 2.52

11.26 5.55

5.52

2.50

14.90

10.58

2.49

14.62

7.87

3.98

0.43 0.80

4.00 3.96

0.08

2.02 2.01 4.00

4.65

9.74 9.70

4.90

2.51

4.82

4.77 7.55

0.52

3.57

4.89

0.16 0.16

1.09

1.08

4.65

13.86

9.13

8.76

12.15

0.79

0.33

0.06

4.71

14.37

3.35 2.98

4.08

3.85

8.89

1.03 0.95

2.20

2.07

0.30

20.76

6.08

2.00

0.56

0.30

7.79

1.11

5.15

1.72

0.56

0.30

5.03

2.77

1.50

1.16

4.64

2.54

1.38

1.12

3.55

4.98

11.98

13.98

13.85

50

8.70

12.04

11.91

7.67

7.23

11.00 11.34

14.81

14.53

14.85 7.24

10.77

9.40

8.74

10.07

10.52

9.23

8.24

9.91

12.85

12.08

11.69

12.26

10 20 11.26

14.76

50

4

50

4

12.68

20

4

20

4

11.96

10

4

10

4

11.59

5

4

5

4

12.21

5

5

3

3

POSTERIOR co

INFERENCES

THE MIXED MODEL

Alternative

Plan Alternative Education Plan

164

Figure 4.1. Marginal posterior density of the within-variance compo­ nent; a = 2, $ = 5, = 2, 3 ^ = 5 ; ------- = approximation, ------ = true distribution.

tion of the within variance component is not very sensitive to changes in 3, for values of a close to 2. This is not the case with the between component since when a is close to 2 , the posterior mean and variance heavily depend on the 3 parameter. Thus a fairly good estimate of the true value is needed for a proper choice of the 3 parameter. These conclusions are based on a limited study of one data set. More extensive numerical studies need to be done before one can formu­ late general rules on the closeness of the approximations and the choice of the hyperparameters. Overall, one may say that the results of the numerical study indi­ cate the approximations are good. Generally, the accuracy of the ap-

POSTERIOR INFERENCES

Alternative Education Plan Alternative Education Plan

165

Figure 4.2. Marginal posterior density of the within-variance compo­ nent; a = 5, 3 = 10, = 5, 3^ = 1 0 ; ------- = approximation, = true distribution.

proximations increases with the a parameter ( a or a^) of the prior dis­ tribution of the variance components and the approximations are close

2

for values of a as small as 8 . The posterior distribution of a is less sensitive to variations in the 3 parameters of the gamma priors as com­ pared to that of the between variance component. When precise prior information is available, it doesn't seem unreasonable to set the a parameter close to 2. Generally, in order to fix the 3 parameters, a good prior estimate of the true value of the variance components is needed. But, more detailed numerical studies are needed before one can put forward these conclusions as general rules.

THE MIXED MODEL

Alternative Education Plan Alternative Education Plan

166

Figure 4.3. Marginal posterior density of the within-variance compo­ nent; a = 32, 3 = 500, = 20, 3 ^ = 8 0 ; ------= approximation, --------- = true distribution.

Joint Modal Estimators o f Parameters o f Mixed Models Another way to estimate the parameters of mixed linear models is with the mode of the joint posterior distribution of all the parameters. This approach is somewhat similar to the method of maximum likelihood estimation, which was first developed by Hartley and Rao (1968). The mode of the joint posterior distribution is found in the same way as done by Lindley and Smith (1972), who used iterative procedures to find modal estimates of the parameters. These modal estimators were viewed as large-sample approximations to posterior means. The view taken here is to regard joint modal estimators as "natural" Bayesian estimators, since they correspond to points of highest pos-

167

Alternative Education Plan Alternative Education Plan

POSTERIOR INFERENCES

Figure 4.4. Marginal posterior density of the between-variance compo­ nent; a = 2, 3 = 5, a 1 = 2, 3 ^ = 5 ; ------- = approximation, --------= true distribution.

terior density, and to use the posterior mean if a quadratic loss func­ tion is appropriate. O f course, a complete Bayesian analysis dictates that one give the marginal posterior distributions of all the param eters, thus for a model k with k parameters, there are 2 — 1 marginal distributions to determine. If one considers a one-way model, it has three parameters, excluding the random factor, one should specify seven marginal distributions, something which has not been done. Instead, one is forced to use summary characteristics of the posterior distributions such as posterior means, modes, medians, variances and so on of the one- and two-dimensional marginal distribu-

THE MIXED MODEL

Education Plan Alternative EducationAlternative Plan

168

Figure 4.5. Marginal posterior density of the between-variance compo­ nent; a = 5 , 8 = 10, a^ = 5, 6 ^ = 1 0 ; ------= approximation, -------- -true distribution.

tions of the two parameters a

and a .

This approach usually involves

numerical integration of functions of several variables. In a related study, Rajagopalan (1980) presents a way to compute all one-dimensional marginal densities of the parameters of mixed linear models and this method was presented in the previous sections of the chapter. One advantage of his approach is that it avoids numerical integration o f functions of several variables. Lindley and Smith (1972) and later Smith (1973) propose using the mode of the joint posterior density of the parameters in the model. They employed their method with a two-factor model with two variance components, the error variance and the row and column effects. They

POSTERIOR INFERENCES

Alternative Education Plan Alternative Education Plan

169

Figure 4.6. Marginal posterior density of the between-variance compo­ nent; a = 32, 3 = 500, = 20, 3^ =8 0 ; --------- = approximation, = true distribution.

were primarily interested in estimating the main effect factors, but in principle, their method will work with any subset of parameters. Consider the joint posterior density (4 .5 ) of all the parameters of a mixed linear model. Theorem 4.1 gives the conditional posterior dis­ tributions of 0 , b , t , and p , given the other parameters of the model, thus all the conditional modes are known, namely

M (e | b ,T ,p ) = ( X ' X ) ' 1X’(y - u b ) is the conditional posterior mode of 0 given b ,

(4.20)

t

, and p .

A lso ,

THE MIXED MODEL

170

M ( b | e , x , p ) = [ t u ’u + A ( p ) ]

1xu’ ( y -

xe),

(4.21)

(4.22)

and

M(T."1 |e,b,T ) =

23. + b!b. l ii m. + 2 a. + 2 l l

are the conditional modes of b , the other parameters. t

t

(4.23)

1, and

t

. *, i = 1 , 2 , . . . , c , given

The last two modes refer to the error variance

1 and the variance components

t

.

\ and the four equations for the

conditional modes will be used to find a mode of the joint posterior density, (4 .5 ), of all the parameters e, b , t , and p. Suppose a mode of the joint density is given by a solution to the equations - ^ - p (e , b , T , p |y) = 0,

(4 .24)

— p (e ,b ,T ,p |y) = 0,

(4.25)

-^ -p (e ,b ,T , p |y) = 0,

(4.26)

dT

and p (e ,b ,T ,p |y) = 0 ,

(4.27)

i where i = 1 , 2 , . . . , c . Now consider (4 .2 4 ), then it is known that the joint density factors as p (e ,b ,T ,p |y) = p 1 (e | b ,T , p , y )p 2 (b ,T ,p |y), where p^ is the conditional density of 9 given b , t , and p and p^ the marginal density of b , t , and p.

I f p^ > 0 for b e Rm, t > 0, and

all t . > 0, the first equation (4.24) is equivalent to

POSTERIOR INFERENCES

171

3 P jO



|b ,T

,p ,y ) = 0,

(4.28)

and in a similar fashion



P 3(b | e ,T ,p ,y ) = 0,

(4.2 9)

p 5 (T | b , 6 , p ,y ) = 0 ,

(4.3 0)

and

P ^ T .Ip .e .b .y ) = 0 ,

i = 1 ,2 ,

(4.31)

i where p^, p &, and p 7 are the conditional densities of b , t , and t . given the other parameters, and this system of four equations is equivalent to the system (4 .2 4 )-(4 .2 7 ). Since each conditional density is unimodal for all values of the conditioning variables, this system is equivalent to the four modal equations (4 .2 0 )-(4 .2 3 ). In order to find the joint mode, this system must be solved simultaneously for 0, b , t , and t . , i = 1,2, . . . , c, over the space b € Rm, e € RP , t > 0, and t . > 0, i = 1,2

c.

Starting with some initial value for b , one may find a 0 value from the first modal equation (4 .2 0 ), then a t 1 value from (4 .2 2 ), and finally c t. 1 values from (4 .2 3 ), then the cycle is repeated, and if the process converges, the solution is a solution to the system (4 .2 4 )(4 .2 7 ), and a mode of the joint posterior density is obtained. Let us see how this method works when applied to some simple mixed models. First, consider the one-way random balanced model of the previous section, and let_the initial value of b be the teast squares estimator of b , namely Qb = ( y 1 - y , y g - y , . . . , ym - y ) \ then the first modal equation is

= y where

0 = y

.

Further substitution gives

THE MIXED MODEL

172 m — - 2 (y j - y ) ' -l ________________________ 0 T1 ~ m + 2 a^ + 2

2 e1 + 5 Z

as the first stage estimates of the error variance and between variance component, respectively. G harraf (1979) shows that ^b, ^ y , ^ t , and nx1 1 are conditional Bayes (conditional posterior modes) estimators of -1 -i b , y , t , and t , respectively. -1 -1 The cycle is repeated using ^b = m(b IqU >qT ) and con­ tinued until the value of the estimates stabilizes. Note the first stage estimators depend on the familiar analysis of variance sum of squares. Next consider a two-fold nested random balanced model = 0 + a. + b.. + e.._ , l i] i]k

y.._

J i]k

(4.32)

where i = 1,2, . . . , a; j = 1,2, . . . , g ; k = 1,2, . . . , d. In terms of the general mixed model, n = agd, c = 2 (two variance components), m^ = a and m^ = g .

6 is a real unknown scalar,

The observations are

the sets {a .}, {b . .}, and

are independent and the a. are inde^ 2 pendent and n ( 0 , a . ) , the b.. are independent and n ( 0 ,a ) , and the

2

2

eijk are ^dependent and n (0 ,a ) .

If the starting value of b is ^b =

( j fR u) u ’R y, the modal equations reduce to

oe = y. a

g

d

E

.0 0

2

1

-

E

1

2/(g + l ) 2 1

a +

1 __________________

2a„ + 2

and

2 0°2

2 +

1

L

[y « . - g y , / ( g + i ) - y / ( g + D ]

i

l

13

g + 2a2 + 2

1

2

POSTERIOR INFERENCES

173

for the first stage estimates of the parameters. beginning with a new value of b , namely jb = t0 Tu'y + A ( Qp )] ^TU'Cy -

-1

The cycle is repeated

X 0 6] ,

-2

where np = ( ftx , t ) and t . = a. , i = 1, 2. This model is given as U U 1 U Z 1 1 an example by Box and Tiao (1973) to illustrate the Bayesian approach, by Hartley and Rao (1968), who use maximum likelihood estimation, and by Gharraf (1979), who showed the first stage estimators were condi­ tional Bayes estimators. The starting value of b was the least squares estimator of b , however other reasonable estimators of b will suffice. There are some problems with this procedure. First, when does it converge and if it converges does it converge to the joint posterior mode? Also if it converges, how does the rate of convergence depend on the starting value? There are also some other computational prob­ lems with inverting large matrices. The generalized Moore-Penrose inverse of u'Ru and the inverse of t u ’ u + A (p ) are of order m x m, the total number of random effects of the model. This problem also occurs with maximum likelihood estimation, thus perhaps the W trans­ form introduced by Hemmerle and Lorens (1976) is a way to calculate inverses of patterned matrices such as u?Ru. This section is concluded with numerical examples for the one­ way and twofold random models. Again, the data set generated by Box and Tiao (1973) is used as in the previous section. With regard to the first mode the parameters were set at

t

1 = 16,

= 4, and 6 = 5 , and

there are six groups, m = 6 , and five observations per group. analysis here uses gamma prior densities for

2 a

and

2 a^,

The

which differs

from Box and Tiao (1973), who place an improper vague prior on the analysis of variance expected mean squares. See pp. 246-276 of Box and Tiao (1973) for a detailed account. The analysis of variance

-2

-2

estimators is a = 14.495 and a^ = 1.3219, and the joint posterior mode of ( 02 , 0 ^) is (13.796,0). Table 4.2 gives a solution to the modal equations ( 4 . 2 0 )-(4 .23) corresponding to a one-way model using different values for the hyper­ parameters a, o^, 3 , and 3 ^, and in all cases the starting value of b was the least squares estimator. The random effect estimates are not shown since b is a 6 x l vector and the other parameters are our primary concern. In all cases 6 is 5.666 because the conditional mode of 0 is y and is not affected by prior information. Table 4.2 demonstrates the effect of the prior distribution on the modal estimates. For example, the first row uses the "true" value of

THE MIXED MODEL

174 Table 4.2 Modal Estimates of Parameters: The One-Way Model

A2 a

al

0

*1

0

a

A2 °1

2

2

4

4

5.6656

11.4861

.713785

2

2

2

2

5.6656

10.9041

.349768

2

2

10

10

5.6656

10.8208

1.8334

2

2

20

20

5.6656

12.9161

3.59343

5

5

4

4

5.6656

9.93803

.462915

5

5

2

2

5.6656

9.41301

.228348

5

5

50

20

5.6656

11.1399

5

5

10

10

5.6656

9.3689

10

10

16

4

5.6656

8.08161

.293158

10

10

2

2

5.6656

7.63936

.145248

10

10

15

20

5.6656

9.05468

1.49166

10

10

100

20

5.6656

11.0234

1.47858

10

10

10

10

5.6656

1

1

5.6656

11.5485

.20321

7.63557

2.35272 1.1868

.748002

1.1

1.1

1.01

1.01

.1

.1

5.6656

11.7507

.0200421

1.001

1.001

.0 1

.01

5.6656

11.7734

.0200043

the variance components as the prior mean of these parameters, and the estimates of the within and between components are 11.4861 and .713785, respectively. On the other hand, when the prior means are set at 10, the modal estimates are 11.7734 and .0200043. The modal estimate of 0 is not sensitive to prior information, but the modal estimates of the within and between components are sensitive; however, the within estimate is less sensitive than the between estimate. For instance, the ratio of the largest to smallest of the within estimates is only 1.5419 while the ratio is 179.6328 for the between estimates.

POSTERIOR INFERENCES

175

It is difficult to state any definitive conclusions about these estimates and a more thorough sensitivity analysis needs to be done. The estimates are closest to the true value of the parameters when the prior means are set at 15 and 20 and in all cases the estimates are smaller than the true values of the parameters. The next example is the twofold nested model, (4 .3 2 ), and is discussed by Box and Tiao (1973) beginning on page 282. This model

2

has four parameters, the fixed effect e, the error variance a , and two variance components, a

2

2

and a9.

The authors generated the data

and set the parameter values at 0 = 0, 2.25. are

2

2

2

= 1.0, a2 = 4.0, and a =

Also a = 10 and g = d = 2 and the analysis of variance estimators

-2

= 1.04,

a2

A2

= 5.62, and a

= 3 .6 0 .

Assuming a vague prior

distribution on the expected mean squares of the analysis of variance, Box and Tiao numerically determine the marginal distribution of the variance components. Suppose one uses independent gamma densities for the precision

-2

components T 1 ( = a 1 )> T2 » an(i the error precision x (= a hyperparameters

c^, 3 ^ c^,

a>

anci

-2

) with

3 and a constant prior density

for the fixed effect. For various values of thehyperparameters model equations were solved and the solution listed in Table 4.3, where in all cases the starting value of b is the least squares estimator (u ’R u ) u*Ry. _ The conditional mode of 0 is y and is not affected by prior information.

a2

The estimates o and

a2

of the two variance components

are more sensitive to prior information than the error variance estimate

-2

a .

For example, the ratios of the largest to smallest estimates are

a2

5.86, 5703.23, and 14.22 for o ,

a2

and

a2

a^9 respectively.

Definitive

conclusions are difficult to state on the basis of such a limited study. In this section, a mode of the posterior marginal distribution of all the parameters is found for a general mixed model. The estimates are solutions to the modal equations based on the conditional mode of each set of parameters, conditional on the remaining. In all cases the iterative scheme rapidly converged and was il­ lustrated with two random models. The estimates of the variance com­ ponents were more sensitive to prior information than the error vari­ ance, and estimates of the fixed effects were not sensitive to prior in­ formation . The program of the iterative procedure was written in SAS.

the

2

2

2

1

1.001

2

1

2

2

1

1.001

2

1

a2

2

al

10

50

2

.1

.01

4

10

31

10

50

2

.01

.1

2.25

10

THE MIXED MODEL

EXERCISES

177

SUMMARY AND CONCLUSIONS This chapter has introduced a Bayesian analysis of the mixed model. Generally two methodologies were given. First, the one-dimensional marginal distributions of all the variance components and of the error variance are derived and are based on a normal approximation to the marginal posterior density of the random factors. This method applies to any mixed model, be it unbalanced or otherwise, and general formulas for the posterior means and variances are derived. A numerical study of the one-way random model reveals the ac­ curacy of the approximation and the sensitivity of the marginal posterior densities to the parameters of the prior density. Second, modal estimates of all the parameters may be found by solving a set of modal equations based on the conditional posterior dis­ tributions given in Theorem 4.1. The sensitivity of the modal estimates to prior information was investigated for the one-way and twofold nested models. The advantage of the first method is that one may plot each marginal posterior density, thus, giving us at a glance (see Fig­ ures 4 .1 -4 . 6 ) inferences about the parameters based on all the infor­ mation in the sample. But, a major disadvantage is that this method is based on an approximation to the marginal posterior densities. With regard to the second method, estimates of all the parameters, including the random and fixed effects, are easily found, however the posterior precision of the parameters is not available. These two methodologies should both be done in a Bayesian analysis and are not competing methods, but should augment each other. The second is easier to program than the first. As we have seen, making inferences for the parameters of a mixed model presents many challenges. With regard to the variance components, further research, perhaps, will give us an accurate ap­ proximation to the marginal distribution of b , in which case, the Bayesian solution will be complete.

EXERCISES

1.

2.

Formulas (4 .1 4 ), (4 .1 5 ), (4 .1 6 ), and (4.17) give the approximate mean and variance for the posterior distributions of the error variance and variance components of any mixed model. Using the above formulas, find the approximate mean and variance of the posterior distribution of a balanced twofold nested random model (4 .3 2 ). Repeat the above procedure for a balanced two-way crossed random model without interaction and with one observation per cell.

THE MIXED MODEL

178 3.

4.

5. 6.

7.

8.

For a balanced two-way crossed random model one observation per cell and with no interaction, find the conditional posterior mean of the error variance and the two variance components, given b the vector of random effects. Consider the general mixed model (4 .1 ) and the prior density (4 .2 ) of the parameters. The conditional posterior density of 6 given b is expressed by Theorem 4.6. Explain how you would make inferences about 6 , from the marginal posterior distribution of 0 . For example, what is the mean of the marginal distribution of 0 ? Prove Theorems 4.1, 4.2, 4.3, and 4.4. Consider Theorem 4.3, which gives the marginal distribution of b , the vector of random effects. Develop a normal approximation to the distribution of b . R eferring to Theorem 4.4 and equation (4 .8 ), what is the joint posterior density of t and t ^, the error precision and precision component of the one-way random model given above? Can the marginal posterior density of t be isolated? Consider a twofold nested balanced random model (4.32) with error variance a

2

2

and variance components a^ and

conditional joint marginal posterior distribution of

2

o 2^ Using the 2 2 a , a , and

a^ given b , find the conditional joint posterior distribution of

2 ai

1

a

2

+ o2 x +2 o2

and

2 2= 2

yo

Q

°2

2

2

+ ° l + °2

given b (the vector of random effects).

REFERENCES B ox, G. E. P. and G. C. Tiao (1973). B a y e s ia n I n f ere n ce in S t a t i s ­ tical A n a l y s i s , Addison-W esley, Reading, Massachusetts. Daniels, H. E. (1939). "The estimation of components of variance," Journal of the Royal Statistical Society (S u p p .), Vol. 6 , pp. 186197.

REFERENCES

179

Dreze, Jacque H. (1977). "Bayesian regression analysis using poly-tdistributions," from Ne w D e v e l o p m e n t s in the A pplicati on o f B a yes ia n M eth o ds, edited by Aykac and Brumat, North-Holland Publishing C o ., Amsterdam. Gharrah, M. K. (1979). A General Solution to Making Inferences about the Parameters of Mixed Linear Models. P h .D . disserta­ tion, Oklahoma State University. Goel, Prem K. and Morris H. DeGroot (1981). "Information about hyperparameters in hierarchical models," Journal of the American Statistical Association, Vol. 76, pp. 140-147. Graybill, F. A . (1954). "On quadratic estimates of variance compo­ nents," Annals of Math. Statistics, Vol. 25, p p . 367-372. Graybill, F. A . , and R. A . Hultquist (1961). "Theorems on Eisenhart!s model I I , " Annals of Math Statistics, Vol. 32, pp. 261-269. Graybill, F. A . and A . W. Wortham (1956). "A note on uniformly best unbiased estimators for variance components," Journal of the American Statistical Association, Vol. 51, pp . 266-268. Hartley, H. O. and J. N . K. Rao (1967). "Maximum likelihood estima­ tion for the mixed analysis of variance model," Biometrika, Vol. 54, pp. 93-108. Hemmerle, W. J. and J. A . Lorens (1976). "Improved algorithm for the W-transform in variance component estimation," Technometrics, Vol. 18, No. 2, pp. 207-212. Henderson, C. R. (1953). "Estimation of variance and covariance components," Biometrics, Vol. 9, pp. 226-252. Hill, B . M. (1965). "Inference about variance components in the one­ way model," Journal of the American Statistical Association, Vol. 60, pp. 806-825. Hill, B . M. (1967). "Correlated errors in the random model," Journal of the American Statistical Association, Vol. 62, pp. 1387-1400. Lindley, D. V . and A . F. M. Smith (1972). "Bayes estimates for the linear model," Journal of the Royal Statistical Society, Series B , Vol. 34, pp. 1-18. Rajagopalan, M. (1980). "Bayesian inference for the variance com­ ponents in mixed linear models," P h .D . dissertation, Oklahoma State University, Stillwater, Oklahoma. Rajagopalan, M. and Lyle Broemeling (1983). "Bayesian inference for the variance components of the general mixed models," Com­ munications in Statistics, Vol. 12, No. 6 , pp. 701-724. Ruben, H. (1960). "Probability content of regions under spherical normal distributions," Annals of Math. Statistics, Vol. 31, pp. 598-619. Ruben, H. (1962). "Probability contents of regions under spherical normal distributions, IV : the distribution of homogeneous and non-homogeneous quadratic function of normal v ariables," An ­ nals of Math. Statistics, Vol. 33, pp. 542-570.

180

THE MIXED MODEL

Searle, S. R. (1971). L in ear Models, John Wiley and Sons, In c ., New York. Searle, S. R. (1978). "A summary of recently developed methods of estimating variance components," Technical Report, BU-338, Biometrics Unit, Cornell University, New York. Smith, A . F. M. (1973). "Bayes estimates in one-way and two-way models,” Biometrika, Vol. 60, pp. 319-329. Stone, M. and B . G. F. Springer (1965). "A paradox involving quasi prior distributions," Biometrika, Vol. 52, p. 623. Winsor, C. P. and G. L . Clarke (1940). "A statistical study of variation in the catch of plankton n ets," Sears Foundation Journal o f Marine Research, Vol. 3, pp. 1-34.

5 TIME SERIES MODELS

IN T R O D U C T IO N

Up to this point, our study of linear models has been confined to those which are usually studied in traditional courses on linear models. Re­ gression models and models used to analyze designed experiments have been analyzed from a Bayesian viewpoint. With the exception o f the autoregressive model, v e ry little has appeared from the Bayesian viewpoint and Zellnerfs (1971) book appears to be the only one devoted to the Bayesian analysis o f selected para­ metric time series models. His book examines the autoregressive process and distributed lag models. This chapter introduces the prior, pos­ terior, and predictive analysis o f selected autoregressive, moving average, and autoregressive-m oving average models as well as the dis­ tributed lag model, thus one may view this chapter as an extension of Zellner* s work. The literature about the theoretical and methodological aspects of ARMA models is quite large and the standard reference is Box and Jenkins (1970). They present the theory and methodology for the analysis of data which are generated by univariate and multivariate ARMA and transfer function models and models which incorporate seasonal and intervention effects. The methodology is iterative and consists o f differencing the data to achieve stationarity, identifying the model from the sample autocorrelation and partial autocorrelation function, estimating the parameters via least squares or maximum like­ lihood, and forecasting future observations by linear combinations of past observations and forecasts. There is no doubt that the BoxJenkins way is the prevailing methodology for analyzing ARMA proc­ esses. Their methods are not Bayesian and include many classical algorithms for identification, forecasting, and control. 181

182

TIME SERIES MODELS

On the other hand, the Bayesian literature devoted to ARMA models is sparse. When compared to Box-Jenkins and other classical techniques, such as those given by Granger and Newbold (1977), there are few Bayesian contributions. The Current Index to Statistics (1980) lists very few Bayesian articles, but among them are Smith (1979) and Akaike (1979). The first article is a generalization of the Harrison and Stevens (1976) steady-state forecasting model, and the second is a Bayesian interpretation of the author 1s identification method for auto­ regressive processes. As mentioned earlier, Zellner (1971) is the only book exclusively devoted to the Bayesian analysis o f time series and econometric models but deals only with autoregressive and distributed lag models. Aoki!s (1967) book is a Bayesian decision-theoretic study o f the linear dynamic systems of engineering control. Bayesian studies o f time series have been mostly made by engineers and economists. For example, the econometric literature about struc­ tural change in linear models, including selected time series, is well developed and reviewed by Poirier (1976). More recently, Holbert and Broemeling (1977), Chin Choy and Broemeling (1980), Abraham and Wei (1979), Salazar, Broemeling, and Chi (1981), and Tsurumi (1980) have investigated structural change. Structural change in a model is similar to ARMA models which have intervention effects. Bayesian analyses developed primarily for ARMA models are limited; however, Newbold (1973) derives the posterior distribution of the parameters of a univariate transfer function noise model. Since 1973 there has been very little work and Chatfield's (1977) review re ­ fers to one paper, the Harrison and Stevens (1976) paper. Control engineers are developing an interest in Bayesian inferential procedures and PerterkaTs (1981) article reviews Bayesian methodology and dem­ onstrates its applicability to system identification. Most recently, Monahan (1981) has devised a Bayesian analysis of ARMA time series models. His approach is mostly numerical in that the posterior dis­ tribution of the parameters must be specified numerically by numerical integrations. In spite of these difficulties, Monahan's work is an im­ portant first step. Why is there so little work on the Bayesian analysis of time series? It appears the likelihood function is analytically intractable for most ARMA models and this presents a problem, not only to Bayesians but also to those interested in maximum likelihood estimation. Many papers, such as those by Newbold (1974), Dent (1977), Ali (1977), Ansley (1979), Hillmer and Tiao (1979), Ljung and Box (1976), Nicholls and Hall (1979), and many others are investigations for simplifying the likelihood function so one may develop more efficient algorithms. I f the likelihood function can be represented in such a way as to produce analytically tractable posterior distributions, a complete Bayesian analysis is possible.

183

AUTOREGRESSIVE PROCESSES A UTO R E G R ES S IV E PROCESSES

Let Y^:

t = 1,2, . . . , be a time series and suppose

Y t “ 9 i Yt - i = v

(5 - ‘ >

where 0 ^ is a real unknown parameter, and suppose

are

independent and identically distributed normal variables with mean zero and precision t > 0. Thus each observation is alinear function of the previous observation plus white noise. This model is called a first order autoregressive process and is stationary i f |e^| < 1 . A second order process, denoted by A R (2 ), is

Yt where 0

9l Yt -I and 0

model, the

V t -2

= V

are real unknown parameters and as with the A R (1 )

are i .i .d . n (0 ,x

stationary i f ( 0 ^, 0 2>

e

-1

) random variables.

The process is

T , where T is the triangle with vertices ( - 2 , 0 ) ,

( 2 , 0 ) , and ( 0 , 1 ) . In general an A R (p ) process is given by

0(B )Y t = where the

z^

et , t =

l, 2 , . . .

(5 .3 )

are white noise with precision t ,

8( B ) = 1 -

6 B 1

0 „B 2 «

and BXY t = Y^ ., i = 1,2, . . . .

---------- 0 B P P

The process is stationary i f the roots

of 0 ( B ) = 0 lie outside the unit circle. Suppose there are n observations ^ ^ 2 ’ process, then

0( B ) Y t = where y ft,y

>

et , t =

• • •>y .

(5 .4 )

1,2, . . . , n

"• • •

Y n ^rom an

(5 .5 )

are initial observations assumed to be known —1 constants. Since e , e , . . . , e are i .i .d . n (0 ,x ) , the joint density of l z n the n observations is

184

TIME SERIES MODELS

••• y j & . t ) “ t n / 2 exp -

|

n * §

(y t ~

V t -P)2

V t -i

where 0. 6 R, i = 1,2, . . . , p , x > 0, and y

(5 ,6 )

e R, t = 1,2, . . . , n.

Suppose stationarity is not assumed and one wants to make inferences about the parameters

) € RP and x > 0. The y. are p 1 observed values o f Y ., and (5 .6 ) is the likelihood function of 0 and x 0

=

(0

1

,0 , . . . ,

2

0

based on the observations. What does one do about prior information for the parameters? It is well known that the conjugate cla§a o f prior distributions is the normal-gamma family, i . e . , e given x is normal and x is gamma, thus let P ( 01x) a Tp/r2 exp - ^ ( 0 - y ) TP ( 0

~ y ),

0 G RP

(5 .7 )

be the conditional prior density o f 0 given x and p ( x ) « Ta

1 e T ^, x ^ 0

be the marginalprior density o f x.

(5 .8 ) The known hyperparameters are

y , P, a and 0 , where y e Rp , a > 0, g > 0, and P is a p -th order positive definite matrix. Since we a re working with the conjugate prior class for the parameters, the joint prior distribution o f the pa­ rameters is also normal-ganima, and one must find the posterior param­ eters . Using BayesT theorem, the posterior density o f the parameters is P (0 .T |sn ) « T( n+2 «fP ) / 2 'l exp _ I [ (e _ A - 1c ) .A (e _ A ' * C )

+ D ],

(5.9 )

where 0 e Rp and x > 0. The other terms in this expression are A , C , and D , where A is a known p x p matrix

AUTOREGRESSIVE PROCESSES

185

and

D = 26 + v'Pu + 5 3

-

C 'A _ 1 C.

Also, (a ..) is p x p with ij-th element a.. = E? yx ,y. . and (c .) i] i] 1 t - r t-j y is a p x l vector with j-th element E, „ yxyA .. t= l t t - ] Integrating the joint density with respect to t ,

8£RP (5.10) is the marginal posterior density o f 0 , thus 0 has a t density with n + 2a degrees o f freedom, location A * C , and precision matrix (n + 2a)A *D.

The marginal posterior density o f

P (t|sn ) « T( n + 2a >/ 2 ' 1 exp f

t

is

[23 + u'Pu +

-

C 'A - 1C ] ,

t

> 0

(5.11)

where s

= (y ,y , , y ) is the sample observations, thus t has a n n 2 gamma distribution with parameters n + 2 a and [ 2 3 + u T?u + —

C 'A ‘ 1 C]/2. Thus, with regard to the autoregressive model, the posterior analysis is similar to the general linear model. Usually with time series models, the main objective is to predict future observations Y n+i>Y n+ 2 » •••»

the Process> end as has been

shown the Bayesian predictive distribution provides a convenient fore­ casting procedure. Consider one future observation Y , . , then the density o f Y ^ i n+i n+i given s ^ , t , and 0 is

, with n + 2 a degrees of freedom, location E [Y n+ 1 |sn ] = [1 -

F * '(A + E ) ' 1 F * ] ’ 1 F * '(A + E ) _1C

and precision P [Y A1 Is ] = [1 - F * '(A + E )_ 1 F * ]{D + C 'A _1C n+l n X F * [l -

C '(A + E ) ' 1

F *’( A + E ) ' 1F * ] ‘ 1F *’(A + E ) XC } ,

where F = yn+^F *> and F* is p x 1 with j-th component y n

1 ,2,

j =

• • «, p. O f course, this can be generalized to more than one future obser­ vation , but the joint predictive density of two future observations is not the usual bivariate t distribution. We have seen that the posterior and predictive analysis are straightforward for an A R (p ), autoregressive model i f one assumes a conjugate prior density for the parameters and i f one does not assume stationarity. What i f one assumes stationarity? Consider an A R (1 ) model

MOVING AVERAGE MODELS where |

187

| < 1 , then how does one analyze this stationary model?

The normal-gamma family is no longer the conjugate class since the parameter space is interval ( - 1 , 1 ) x ( 0 ,«>) and not as before ( - 00, 00) x ( 0 ,°°). One approach is to ignore the stationarity assumption and pro­ ceed to use the normal-gamma family as a prior distribution for the pa­ rameters, but with an adjustment in the prior parameters to allow for approximate stationarity. To see how this works with an A R (1 ) model, use a normal-gamma prior with parameters y , £, a , and 3 , then the marginal prior density of 0^ and x is a t distribution with 2 a degrees of freedom, location y , and precision a£/ 3 , and one may choose y , a, 3 , and £ in such a way so that 1 0 0 y% of the marginal prior probability for 0 ^ is contained in the interval ( —1 , 1 ) , for any 0 < y < 1 * By letting y be "close" to 1, one is assuming "almost" stationarity. It should be noted that the a and 3 parameters are the parameters of the marginal prior distribution of t , thus choosing an (a , 3 ) pair in the marginal prior distribution of 0 ^ will affect one!s prior information about t . Another approach is to put a prior distribution on the parameter space ( —1 , 1 ) x ( 0 ,«>), but this way will produce analytically intract­ able posterior distributions, and a purely numerical procedure is re­ quired for a complete analysis. In spite o f these disadvantages, Lahiff (1980) has developed an interesting algorithm to deal with this difficult problem, and Monahan (1981) has taken a similar approach with A R M A (p ,q ) models, where p ,q = 0, 1, 2. The identification problem of determining the order of an A R (p ) process has not been explored yet, but Diaz and Farah (1981) and Monahan (1981) have begun some initial studies. Except for these two papers, Bayesian results have been limited. With regard to the forecasting problem, v e ry little has been done in finding the joint predictive density of future observations. Zellner (1971) has developed the distribution for first and second order AR processes and Land (1981) derived the forecasting distribution of A R (p ) models with k future observations, but both authors deal with the non-stationary case. In her dissertation, Land shows that the forecast procedure may be done by a series o f conditional t-forecasts. For more information about autoregressive processes, the reader is referred to Zellner (1971), where interesting examples are explained in detail.

MOVING AVERAGE MODELS A first order moving average process, M A (1 ), expresses each observa­ tion as

TIME SERIES MODELS

188 Y t = et ~

(j>1 et l ,

t = l,2 ,

(5.14)

which is a linear combination of white noise, that is e^, n .i .d . ( 0 , t 1) , where t > 0 and ^ parameter. A M A (2) model is given by Y t = et — where and l

2’ --->

(5 .15)

are moving average parameters and the e are white x

noise with precision t . In general, a q -th order moving average process is given by t = 1 ,2 , ...,

Y t = (B)et ,

(5.16)

where

4>(B) - 1 — jB —

QgB

— ... —

and B is the backshift operator such that B et = et k anc* the moving average parameters are 4>1 »0* • •• , » assumed to be real unknown l z q parameters. Moving average processes are always stationary. Suppose there are n observations (y ,y , . . . , y ) ! = y from a M A (1) process and let n y = A(1 )e ,

(5.17)

where e = (e ^,e ^ , . . . , e ^ ), A(^) is a n x (n + 1) matrix given by

1. 0, O Since e -

0,

0,

0 ..........

1.

0, •• > -

, y - N {0 , x [ A ( j ) ] ~1}, and

MOVING AVERAGE MODELS

2 1’

+1 ’

2

iy

0,

0,

0, -4 y

...,

-^,

1 +

which is a tridiagonal Toeplitz matrix with — for the super- and sub-

2

diagonal components and 1 + ^ for the principal diagonal elements. The joint density of the observations is given by

f[y|1» t ] « Tn/ 2 |A(1 ) A ,( 4>1) | 1 / 2 exp - | y '[A (i() 1 ) A ,(if'1) ]

y e Rn ,

*y>

(5.18)

Since A(^)Af( ^ ) is symmetric and positive definite, there ex­ ists an orthogonal matrix Q such that D( ) = QfA( )A T(., )Q is diagonal with diagonal elements 1 + ^

2

-

2 ^ cos i?r(n + 1 )

-1

= d .^ ^ ),

where i = 1,2, , n. It can be shown, see Gregory and Karney (1969), that Q is independent of and has ij-th element /2/(n + 1) * sin ij (n + 1) 1. QD

We see that A ^ ^ A 1^

) = Q D C ^ IQ 1, thus A ^ ^ A ' ^ j ) =

and the density (5.18) may be written as n f

f

y i

a tI1/2 n d-( ) 1 / 2 exp i= i 1 1

^ yf QD l (1 )Q !y ,

1

1

y e Rn ,

(5.19)

which is a simplification over the other form (5 .1 8 ), because A(^)Af( 0

be the marginal prior density of t , namely gamma with parameters a and 3 , which combines easily with the likelihood function for and x, (5 .1 9 ). By applying Bayes* theorem, the joint density o f ^ and t is

P (4' 1 »t|y) = x( n+2ot>/ 2 1 p ( ^ i ) X I d. 1/2(1) exp i= l

-~

x [28 + y 'Q D ' 1 ( ^ 1 )Q y ],

(5 .2 0)

where ^ € R and x > 0, and the marginal density of ^ is n

p

wMch, since ^ is scalar, is easy to plot, and it is an easy task numerically to determine the posterior moments of ^. The marginal posterior density o f x must be determined numerically from the joint density (5 .2 0 ), because the integration with respect to ^ o f the joint density is not known analytically. The conditional density of x given is gamma with parameters (n + 2a)/2 and [23 + y !QD

2

-1

and variance of a = x

2 E (c U i,y )=

-1

1

(..)Qy]/2, hence the conditional mean

are

28 + y 'Q D ' 1 ( ^ 1 )Q 'y n + 2a — 2 (5.22) [28 + y ’QD~1( * )Q ’y ] 2

V a r (a | ,y ) = -------------------- - ----------------------. (n + 2a - 2 ) (n + 2a - 4)

2

Thus the marginal mean and variance of a may be found nu­ merically by averaging their moments with respect to the marginal posterior density o f , (5 .2 1 ). Since

MOVING AVERAGE MODELS E (o 2 |y) = E

191

E ^ l^ y )

*1

1

and V a r (a 2 |y) = E

V a r (a 2 | ,y ) + Var

where



and V ar

E (a 2 1 , y ),

1

9l

denote mean and variance with respect to the

marginal density of ^, the marginal moments of a may be found from the conditional moments of It is now seen that the posterior analysis of a M A(1) model is somewhat more messy than the posterior analysis o f an A R (p ) model in that the joint and marginal posterior distributions o f the parameters are not standard distributions, nevertheless, it is possible to determine

2

the marginal distribution of $ , the conditional of a (o r t ) given , o and the marginal moments of o . How does one forecast with a moving average model? With the autoregressive model, the Bayesian predictive density was employed, thus the same method will be adopted for moving average models. Sup­ pose we first consider a first order model (5.14) with the gamma prior for t and the posterior density (5 .2 0 ). Let Y ^ +1 be the future ob­ servation, then the joint density of Y = (Y ^ , . . . , Y ^ ), the past ob­ servations, and the future observation Y , given (5.19) n+1

and

t

is from

n+1 f ( y »y n+i l 9 i . t ) « T( n + 1 ) / 2 l i d . ( 4^ ) 1 / 2 exp -

j

( y ' . y n+1>

x Q * D ' 1 ( 4»1 )Q * '( y , ,yn+1) ' , where y

e

Rn , y n+^ e R, ^ € R, and

t

> 0.

(5.23)

Q* and D*(^) are

analogous to the quantities Q and D (< ^ ) of (5 .1 9 ), that is Q *!A*(^)A 3*eT(cj)^) as diagonal elements.

The

192

TIME SERIES MODELS

ij-th element o f Q* is /2/(n + 2) sin ij (n + 2)

where i ,j = 1,2, . . . ,

n + 1. Consider the exponent o f (5.23) and let Q *D * where E(^) is partitioned as

= E ($ ^ ),

E12( h )N E ^ ) E 2 1 (l )

£ 2 2 (

and E ^ ^ ) is n x n, E ^ ^ ) is a scalar. written as y ' E ^ ^ y

-

The exponent is now

2 y 'E 12 ( * 1 ) y n+1 +

Multiply the density (5.23) by the prior density for

and t

and integrate the product of the two densities with respect to t , then

f ( yn+ i ly * i ) « — 2-------------------------- ---------------------------------------t y n + l E 2 2 ( * l )

"

^ n +

i y ' V

* ^

+ y 'E l l ( * l ) y

+ 2 g ](n+1+2a)/2.

(5.24)

Yn+^ g R is the conditional density of the future observation given the past observations y and

Complete the square in Yn+^ by

letting A ^ ) = E 2 2 (1) , B ^ ) = y ' E ^ C ^ ) , and C(1) = y lE 1 1 ((j)1)y + 23, then the conditional density (5.24) is

^ n + l^ ’V

*

------------- A- l , . { [ y n+l “ A

L ;

— B (^)A

+

m2 . A (* i > +

\*w x 1(n + 2 a + l ) / 2 ( ^)B(^) } (5.25)

where y

„ G R. n+1 This is the conditional predictive density of

the Bayesian predictive density, but we see that t distribution with n + 2 a degrees o f freedom, location

given ^, not given ^

is a

MOVING AVERAGE MODELS

193

and precision

P ( Y n +

l K

’y )

=

A ( < t’ i > ( n +

2 a ) t C ( ^ i>

~

(5.2?; As with the posterior analysis, the marginal moments can be computed from the conditional moments. For example, suppose a point forecast o f Y n+ is needed, then the mean of the marginal distribution of Y n+^ is given by E < s „ . i l y , = % 1E ( Y „ « l V

where E

y >’

(5 - 28>

is the expectation with respect to the marginal posterior

density of , given by (5 .2 1 ).

In a similar way the marginal variance

V a r (Y n+ 1 |y) of the predictive distribution o f Yn+1 is computed by first finding the conditional variance V a r (Y ^ + ^ |^,y) of Y n+^ given ^

from (5 .2 7 ), and using

v « < v „ H |y)

* T" * 1E(5.29)

where V ar

means the variance is computed over the marginal posterior *i distribution of $ . Since $ cannot be eliminated analytically from the joint density of Yn+1 and ^, there is no closed form expression for the predictive density of Y

. ; nevertheless, one may plot the predictive density via n+1 one-dimensional numerical integrations o f (5.25) and compute the fore­ cast mean and variance from the formulas (5.28) and (5 .2 9 ). Can the above be extended to a second order model (5.15) with n observations Y = ( y t ,y 9, . . . , y )? Again, let l u n Y =

2) e ,

where e = (e , , e „ e , , . . . . -1 0 , 1

e

n

) and A(,

1 2

is n x (n + 2) and

194

TIME SERIES MODELS

2> 0,

^

^

0

A ( * r * 2) = ! » _(l>2 > "”^ 1 *

•••

*

Since e is normally distributed, y is also normal with mean zero and dispersion matrix A( , .

2

2

1 2

5).

2 4^

1 2

2 which is

n x n and has

The principal diagonal has common element

2

+ 2> and the first off-diagonals have common

— 1 ) , while the second off-diagonals have common ele-

The conditional density of Y given 4> , 0, and t is

1 2

f ( y k 1 . 4,2 ,T ) “ Tn/ 2 | B ( ^ ^ 2 ) r 1 /2 exp - | y 'B * ( .(y2 )y , y 6

Rn ,

(5 .30)

where B ( * l t +2) = A ( * l t * 2 ) A ' ( * l t * 2) . Suppose ( , ) e R

2 on R

2

and let p ( 4>1, 4>0) be an arbitrary proper density

1 2

and suppose t is independent of ( 4>., ,4>0) and has a gamma prior

1

2

distribution with parameters a and 3* O f course, the posterior density o f the parameters is the product of the likelihood function (5.30) and the prior density p(^* 4>2 ^ B *($..

1

2

*

e

T >

G

but can

be expressed in closed form, as was the case for the first

order model? I f so, the posterior analysis for the second order model is a straightforward generalization of that of the first order model, but in any case, the marginal posterior density o f ( , $ ) is

1

-1

|B2) | B (* l f * 2) | ' 1 / 2

X exp £ [2 b + y'B 1 ( 0. The marginal moments J. sL of t may be computed directly from the marginal density o f t or in­ directly by noting the conditional density o f t , given ( 4>-, >0) is gamma with parameters (n + 2a)/2 and [23 + y !B ditional mean of

-1 t

-1

is 26 +

)y

E (t

n . 2 a — 2-------

The marginal mean of

t

ECt ' V )

where E

=E

( ,0 )y ]/ 2 , thus the con-

(5.33)

* is (5.34)

denotes expectation with respect to (..,0) , with den♦ l ’ +2 1 2 may be computed sity (5 .3 1 ), and the other marginal moments o f t in a similar way. We have seen that the moving average model presents computation­ al and theoretical problems that do not arise with theautoregressive model. The marginal posterior distributions of the parameters of an autoregressive model are standard distributions, namely the multi­ variate t for the autoregressive parameters and a gamma for the pre­ cision of the white noise. Also, the predictive distribution for a future observation is a univariate t distribution.

TIME SERIES MODELS

196

With regard to the moving average model, the marginal posterior distributions are not standard; however, the conditional posterior distributions o f the noise precision given the moving average param­ eters are a gamma. Although the marginal distribution of the moving average parameters is not standard, it may be plotted numerically and the moments completed. The marginal moments o f the noise precision are computed from the conditional moments by averaging the latter over the marginal posterior distribution of the moving average pa­ rameters .

A U TO R E G R ES S IV E M O VING AVERAG E MODELS

The A R (p ) and M A (q ) processes may be coupled to form an A R M A (p ,q ) model,

0 ( B ) Y t = $ (B )e t ,

t = 1,2,

...,

(5.35)

where 0(B) = 1 -

0j B -

• • • - Gp B P ,

and $ (B ) = 1 — jB — ••• — (frqB^ The observations are Y^, the t

> 0,

..., $

are white noise with precision

the autoregressive coefficients are

1

are the moving average parameters.

A

•••»

0 P

> and >0> 1

A

The parameter space is

Q =R P

x Rq x ( 0 ,co), and the process is stationary i f the roots o f 0 ( B ) = 0 are outside the unit circle, while i f all the roots of $ (B ) = 0

are outside the unit circle, the process is invertible. an A R M A (1,1) model is given by Yt -

0^ ^

= et -

In particular

t = 1 ,2 , ...,

and is stationary and invertible if |.J< 1 and 10 ^ | < 1.

(5.36) I f $ = 0,

the process is A R (1 ) and is M A(1) i f 0 = 0. Before proceeding with the posterior analysis of an AR M A (1,1) let us derive the likelihood function of the three parameters , 0 ^, and t , but not assuming stationarity or invertibility, and letting Y^ be a known constant.

AUTOREGRESSIVE MOVING AVERAGE MODELS

197

First let z t

..., n

where z = Y

— 0- Y

, , then first the joint density o f z ,z , . . . , z X ti will be determined and then the joint density o f Y , . . . , will be derived. The density of z = (z ,z , z ) f is that of the observations of x u n a M A(1) process, and has been derived in the previous section. To recapitulate, the density o f z is (5 .1 9 ), namely X

X

X

X” X

| z 'Q D X(1 >Q,z ,

exp -

z e Rn where D ( ^ ) is a n

x

(5.37)

n diagonal matrix, the elements of which are the

eigenvalues of A(-^)Af( < J > o f (5 .1 7 ). are

2

The eigenvalues of this matrix

d . ^ ^ ) = 1 + ^ i — 2 i cos i?r (n + 1 )

-1 ,

i = 1 , 2 , . . . , n,

and the eigenvectors o f the matrix are the columns o f the n x n matrix Q, which has ij-th element / 2 /(n + 1 ) sin ij (n + 1 ) * and is independent o f ^. Let Y = (Y ^ , . . . , Y ^ ) f be the vector o f observations, then since Z = Y density of Y is

0 Y * ( Y ) , where Y * (Y ) = ( Y Q, . . . , Y

« /2

exp -

g (y le ^ ^ .T ) * t

x ( 4>1 )Q ’ ty -

where y * (y ) = ( y Q

) , the n

yn-]?



e ^ C y )]

^

and y € ^

| [y -

91 y * (y )]'Q D

(5.38)

91 € R ’ ^ 1 € R ’

and t > 0. This can be simplified by letting y = Q!y and y = QTy * ( y ) , thus the likelihood function for e ^ , ^, and t is

TIME SERIES MODELS

198

L ( 0 , , , x|y) 1 ) = y ’D x( 4>1 ) y * ( y ) + Su>

and

AUTOREGRESSIVE MOVING AVERAGE MODELS

199

C(1) = y,D " 1 (,j>1)y + 26 + Cu2. Furthermore, the conditional density of 6 ^ given ^ is

1

|(j)

p (0

1

,y )

cc ---------------------------------------------------------------------------

1

{ [ 01 -

A ~ 1 (1 )B(1 ) ] 2A ( 4 > + C ( * x) -

B2

/a

\ a ” ! / x x T( n + 2 a + l ) / 2 ( 4>1)A (1) }

0

e R

(5.43)

where A( ) , where p(0

|

t

)

« xp ^ 2 exp -

t

/2 (

0 — y ) Tp ( e — n ) , e e r p , p ( x ) » x a 1 e

x > 0, and p() is a proper density over Rq , where 0 = (

02 * • • • ’

the q moving average parameters. I f these assumptions are satisfied, it is conjectured the conditional distribution of t given (j> is a gamma, the conditional posterior distribu­ tion of 0 given is a p-variate multiple t distribution, and the mar­ ginal posterior distribution of , although not a standard density, may be explicitly determined in closed form. I f this is true, the mar­ ginal posterior moments o f t and 0 can be computed from the condition­ al posterior moments o f these parameters. The section on ARMA processes is concluded with the Bayesian procedure for forecasting future observations o f an AR M A(1,1) model, thus consider the one-step-ahead forecast of Y n+^. The Bayesian predictive density o f Y ^ + ^ will be found by deriving the joint marginal density of Y = (Y

,Y , . . . , Y ) and Y , then identifying the conl ^ n n+1 ditional density o f Y ^+ ^ given Y = y . First consider the joint condi­ tional density o f Y and Y n+^ given e , $ , and t , which is (5.38) with the following changes.

Let

and

and replace y by y ^ and y * (y ) by y ^ in (5 .3 8 ), then the joint density o f Y and Y ^ + ^ given 0 ^, given where

A ( V

=

r 2 2 ( * l }

~

[ R 2 1 ( * l ) y * +

r 2 2 ( * l ) y n l 2 t 5

B ( 4>1) = [ y 'R 11(1) y i,!( y ) + y 'R 12( ‘l,i ) y n +

+

y

'

^

W

J

y



3

x [ R 2 1 ( + 1 ) y * ( y ) + r 22( * l ) y n1 U + y ’^ ^ ^ i ^

2^ " 1

and C(1) = u'cu + 2g + y IR11(1)y -

+ C u ] 2[

5

[ y 'R 1 1 (1 )y * + y 'R 1 2 (|t|i ) y n

+ y , ( 2 ) R(4>j)y( 2 ) ] ” 1 ,

*

>

THE IDENTIFICATION PROBLEM where the (n + 1 ) x (n + 1 ) matrix Rl l ( * l )

203 is partitioned as

R 1 2 (i|>1 )N

R (^ ) .

is o f order n. From (5 .4 7 ), one may conclude the conditional density o f Yn+^

given y and is a t distribution with n + 2 a degrees o f freedom, lo-

-1

2

cation A ( )B( ) and precision A( ) ( n + 2a)[C( ) - B ( ) -1 - 1 1 1 x A A( ^ ) ] \ Suppose the mean o f the predictive distribution o f Y as a point forecast o f the future observation, then since n

E ( Y n + l l y ,,f,l )

=

A " 1 ( 4>i) B ( V

is the conditional mean of

is used

(5 > 4 8 )

given ^ and y , this conditional moment

can be averaged with respect to the marginal posterior density o f 1, given by (5 .4 1 ). It would be interesting to see i f the forecasting procedure o f the AR M A(1,1) model can be extended to the general A R M A (p ,q ) model.

THE ID E N T IF IC A T IO N PROBLEM

In the preceding Bayesian analysis of ARMA models the orders p and q o f the autoregressive and moving average components of the model were assumed known; however, in practice one does not usually know these values and they must be identified or estimated. Box and Jenkins (1970) identify the order of an ARMA process from the sample autocor­ relation and partial autocorrelation functions, while the method of Akaike (1979) is popular in estimating the order o f an A R (p ) model. A purely Bayesian procedure to identify the order of an A R M A (p ,q ) process has yet to be examined; however Diaz and Farah (1981) have derived the exact posterior mass function o f the order p of an autoregressive process and their derivation will be given. Recall that for an A R (p ) process with n observations y = (y., >y0> • • • » y ) and assuming the initial observations y n,y 1 » • • • > i i n u -x

^ 1 p are known> the likelihood function is

204

TIME SERIES MODELS

L [ e ( p ) ,p ,r | y ]

where 0 ^

Tn/2 exp -

| 5 3 | y t - 5 3 6piy t - i j 2 >

= (0

0 ) f e RP , x > 0 , and p = 1 , 2 , . . . , k, pi pp where k is the largest possible order o f the process. The likelihood may be expressed as L (9 ( p ) ,p,x|y] « Tn / 2 exp -

| [ 0f ( p ) A ( p , y ) 0 (p )

20, ( p ) B ( p , y ) + C ( y ) ] ,

(5.49)

n 2 where A ( p , y ) is p x p with i-th diagonal element Zt=1 y^_. and ij-th off-diagonal element Z*1. y .y B ( p , y ) is a p x 1 vector with i-th * n , Jpn 2 component Zt=1 y t_., and C (y ) = Z. =1 y. . The form of the likelihood suggests letting 0^

have a condi-

tional normal distribution, a p rio n , given p and x with mean vector \i and precision matrix Q^. Suppose t and p are independent, where t

(p )

is gamma with parameters a and 3 , and p is uniform over the integers 1,2, x is

Thus, the conditional prior density o f 0 ^

. . . , k.

given p and

P /2 g [e ( p ) |p,T] -

exp ( 2 ir)

I [ 0 (p ) -

2

y ( p ) ] ’Q re( p ) P

. (p )],

9( P ) e RP ,

(5 .5 0 )

the marginal prior density o f x is

g (-r) «

t

01' 1

e 'Te,

t

> 0,

(5.51)

and the marginal prior mass function o f p is g (p ) = k p = 1 ,2, ..., k. . . N ow , combining the prior o f 0 , p , and x with the likelihood function,

THE IDENTIFICATION PROBLEM

205

X (P » y )]'A * ( p ,y )[e (p ) -

+ C * (p ,y ) -

A * "1(p ,y )B * (p ,y )]

B * '( p , y ) A ' 1 * ( p , y ) B * ( p , y ) > (5.52)

is the joint posterior density o f e^p \ p, and t , where A * (p , y ) = Qp + A ( p , y ) , B * (p , y ) = B ( p , y ) + Qp u ( p \ and C * (p ,y ) = C (y ) + 2B + a ’( p ) Qp y ( p ) . Eliminating first 6 ^P \ then t from (5.52) gives

h (p | y ) -

IA * -Ej Z)A ^ —

[C * (p ,y ) -

B '* ( p , y ) A * " 1 (p , y )

i«Pl X B * (p ,y ) ] ' (n + 2 “ ) / 2

(5.53)

p = 1 ,2 , k, as the marginal posterior mass function o f p. This identification procedure assigns a posterior probability to each of the k autoregressive models

e .y. .

r* ' £

pi t -i

P = 1,2,

where 0 . is the coefficient o f the i-th lagged value o f the dependent variable in the p -th model. Formula (5.53) allows one to compute a plot o f the posterior mass function of p over all possible orders of the model. To identify the model, one should inspect the plot of the posterior mass function of p , and select an order of the model. Notice for each value of p , the inverse and determinant of A ( p , y ) + Q must be computed. . v How does one make inferences about the vector 0 ^ o f auto­ regressive parameters? One could identify p from its posterior mass function, then use the conditional posterior desnity o f 0 ^ ^ given p = p, where p is the identified value of p.

TIME SERIES MODELS

206

The conditional posterior density o f 0 P , h ( 0 P |p,y) given p is a multivariate t with (n + 2 a) degrees of freedom, 0 * is the loca­ tion vector and (n + 2 a )(A (p ,y ) + Q )/s is the precision parameter, -1

P

P

where 0 * = A * ( p , y ) B * ( p , y ) and s = C * (p ,y ) - B f* ( p , y ) A * x (p ,y )B * (p ,y ). p Note, to use the above analyses o f an A R (p ) model,one must specify a sequence o f hyperparameters, namely the means y ^ and precisions Q o f 0 for p = 1,2, . . . , k. The conditional prior density . (p ) . P of 0 given p is

g [9 (P ) |p]

{20 + [ 6 ( p ) - p ( p ) ]'Q p [ e ( p ) - v ( p ) >(p + 2 a )/ 2 ’

0( p ) € RP

( 5 . 54)

which is a t density with 2 a degrees o f freedom, location y

-1

persion matrix y^P \

and Q

(a -

1)

-1

.

and disP Now, how should one choose a, 3 ,

p = 1,2, . . . , k? The parameters a and 6 control the

prior gamma distribution o f the noise variance and the prior precision o f all the conditional distributions o f 0 given p = 1 , 2 , . . . , k. One way advocated by Litterman (1980) is to let the first compo­ nent o f y ^ always be one and the others zero and let the prior v a ri­ ances o f 0 11, 0

11

, 0

1 0. The process is explosive i f |p | > 1 , other­ wise it is called a nonexplosive process. I f |p | < 1, the correlation between adjacent observations is p, i . e . , the correlation between Y^ and Y is p , and in general the correlation between Y and Y , k is p , thus in the non-explosive case the autocorrelation function becomes zero in a geometric fashion as the lag increases. The model may be expressed as Y t = pY t . 1 + B(X t -

pX ^) + v

where Y^ is a known initial value.

Since the

are n .i.d . (0 , t 1) , the

joint density o f Y ^, . . . , Y ^ , conditional on the parameters is

n/ 2 f (y | t ,B , p )

OTHER TIME SERIES MODELS

213

and n D g (p ) = 2 y + X ) (y t -

py^ ) 2 + p 2y 2 -

c j^ p )

n * pM (xn+l - pxn )2 - C‘l 1(p) S (yt - pyt .!) * n x ] C given y , one has some idea o f the precision o f one's forecast o f Y

... n+1 Zellner (1971) gives a more extensive examination o f the regres­ sion model with autocorrelated errors and presents a numerical analysis based on some generated data where the regressor variable is rescaled investment expenditure taken from a paper by Haavelmo (1947). He examines explosive and nonexplosive cases by plotting the marginal densities o f 3 and p , and the conditional distribution o f 3 given p , for various values o f p to see the effect o f the autocorrelation param­ eter on inferences for 3. His results can be obtained i f one lets a -1 / 2 , 3 0, and £ -*■ 0 in the posterior analysis. Also, Zellner generalizes the model to include several regressor variables but does not develop a prediction analysis, as was done here. The last time series model to be considered is the distributed lag model

TIME SERIES MODELS

214

yf = 2 aY i =0 where

. + u ,

t = 1,2

n

(5 .66)

is the value o f the dependent variable at time t , xt j the

value o f the stimulus at time t - i , X is an unknown real parameter, 0 < X < 1, and the u^ is white noise with precision t > 0.

An

equivalent representation of this model is obtained by subtracting Xyt_ i from both sides to obtain

y t = Xyt - i + axt + ut ■ Xut - r

(5 ' 67)

which is an A R M A (1,1) model containing a concomitantvariable x* It can be shown, see Zellner, Chapter 7, that the dispersion matrix o f Y ,Y , . . . , Y is G ( X ), where G ( X) is tridiagonal with 1 a n 2 -i -1 (1 + X ) t as the principal diagonal elements and - Xt as the superand subdiagonal elements, and all other elements are zero. This is the same matrix which was encountered with the M A(1) model of this chapter. So as before, let Q be an orthogonal matrix which diagonalizes G ( X ) , i . e . , Q!G (X )Q = D (X ), where D (X ) is diagonal with diagonal ele-

2

ments 1 + X -

2Xcos iir(n + 1 )

-1

= d.(X ) , i,2 , . . . , n and Q is indeI .1 pendent of X with ij-th element /2/(n + 1) sin ijir (n + 1) . Our analy­ sis of the distributed lag model will be quite like the analysis of the ARMA(1 ,1 ) process. The likelihood function for t , X, and a is £(x , A,a|y) * Tn/ 2 |G(A) | ’ 1 / 2 exp x (A )(y -

A y -

| (y -

-

1

a x )’G

a x ).

(5.68)

where x > 0, 0 < A < 1, a s R , y . = (y , , y -l -l o (y j.y g

Ay±

yn ) ’ » and x = ( x 1 . x 2 * •••> Xn ) '-

y

,)’, y = n -l

Since G "1 (A ) =

QD * (X )Q ! , the likelihood function reduces to ii J.(x,A ,a|y) « x n / 2 H d . 1 /2 (A ) exp i= l 1 x (A )Q '(y Let y * = Q’y , y ^ = Q 'y ^

Ay_x -

| (y -

1

a x ).

and x * = Q'x> then

Ay

-

ax)'Q D 1

_1 (5.6 9)

OTHER TIME SERIES MODELS

XI

215

n

Z (t ,X ,a | y ) a x n / 2

i= l

x (X )(y * -

d.. 1 /2 (X ) exp -

1 Xy*^ -

£ (y * -

4

Xy*

~l

-

axTD "1

aX* )

gives a convenient form o f the likelihood function to combine with the prior density o f the parameters. Suppose, a priori, that ( a , x ) is independent o f X and that ( a , x ) is normal-gamma with parameters y , £, a, and 3 and X has some proper density g (X ) over (0 ,1 ]. The posterior density o f the parameters is by Bayes theorem

p (X ,x ,a | y ) « T( l +n+ 2 °0/2 1 J J d 1 /2 ( x ) exp - | {23 + (a - y ) 2£

1 + (y * -

Xy*x -

1

1

a x * ),D 1 (X )(y * -

Xy*x -

aX* ) l , (5.70)

where 0 < X < 1, x > 0, and a e R. It will be shown that the marginal posterior density o f X can be isolated and determined analytically, and that the conditional posterior distributions o f a given X and x given X are t and gamma respectively. Integrating with respect to a and x gives

P O | y ) - --------

TTo

[ A U . y ) ] 1 ' ' 5 [C (X ,y ) - B '(X ,y )A

p (X ) _f-

(n + 2 o ) / 2 •

(X ,y )B (X ,y )]^ 0< X < 1

(5.71)

for the marginal posterior density o f X, and furthermore it can be shown the conditional distribution o f a given X is a t with n + 2a degrees o f freedom, location E (a | x ,y ) = A ' 1 (X ,y )B (X ,y )

(5.72)

and precision A (X , y )(n + 2 a )[C (X ,y ) - B ' ( l » y ) A 1 (X ,y )B (X ,y ) ] 1, where A , B , and C are the appropriate functions o f X and y. The conditional density of x given X is a gamma with parameters a* = (n + 2a)/2 and 3*, where 23* = C (X ,y ) - B ?(X ,y )A 1 (X ,y )B (X ,y ). The marginal posterior moments of a and x may be computed by averaging their conditional moments with respect to the marginal den­ sity of X, (5 .7 1 ).

TIME SERIES MODELS

216

It should be relatively easy to develop the Bayesian predictive density o f future observations for the distributed lag model, since the analysis is so similar to the A R M A (1,1) model. There are many versions of the distributed lag model, and the reader is referred to Zellner (1971) for his study of the model. Also Zellner provides an interesting numerical example.

A N UM ER IC AL S TU D Y OF A U TO R EG R ESSIVE PROCESSES WHICH ARE ALMOST S T A T IO N A R Y The autoregressive process was examined on pp. 183-187, where the posterior and predictive analyses were derived, and then on pp . 203-206 the identification of the order of an autoregressive model was e x ­ plained . Suppose we consider a first-o rd er model Y t = 6Y t - l + et where the

are i .i .d . n (0 ,i 1) , e

(5 ,1 ) R, Y

is the observation at time t,

and 0 and t ( > 0 ) are unknown. B y using a normal-gamma prior density for 0 and t , the posterior density o f 0 and t is normal-gamma and given by formula (5 .9 ), and the marginal posterior density o f 0 by (5 .1 0 ). Also, (5.11) gives the marginal posterior density o f the precision t , and the one-step-ahead predictive density o f a future observation is given b y (5 .1 3 ). All these derivations are for a general A R (p ) model for p = 1,2, . . . and no assumption o f stationarity was made. With the first-o rd er model (p = 1 ), the process is stationary i f 10 1 < 1 , thus i f one knows the stationarity assumption is tru e , the posterior and predictive analyses of pp. 183-187 do not apply. If stationarity is indeed the case, the parameter space is ft* = { ( B, x) , |0 1 < |, t > 0} and one should develop a Bayesian analysis with a prior distribution defined on ft*. This has been done by Lahiff (1980) and Monahan (1981). Suppose that one is not certain that the first-o rd er model is stationarity but is, say, 95% confident the stationarity assumption holds? If this is the case, can one use a normal-gamma prior density for ( 0 , t ) in the analysis o f a first-o rd er model? Suppose one chooses a normal-gamma prior density with parameters y , £, a and 8 , i . e . , let /a , 1 / 2 -t? / 2 (9 -v i ) 2 a- 1 - r b P (0 ,t) « t e t e , be the prior density of 0 and t , then

„ _ 0 e R, t > 0

ALM OST-STATIO NARY PROCESSES

217

r

P (9 ) « - = -----------------------------------------•’ J l ( 2 a + l ) / 2

9e R

0 .7 3 )

is the marginal prior density o f 0 , which is a t density with 2 a degrees o f freedom, location y and precision a ? / 3 . The marginal posterior den­ sity o f t is gamma with parameters a and 3 , which are also involved in the degrees o f freedom and precision o f the marginal distribution o f 0 . Suppose one must choose the parameters o f the prior distribution in such a way that, say, ( 1 — y) 1 0 0 %of the marginal prior probability o f 0 is contained in ( —1 , + 1 ) , then since

(1

~

y

)

>

0 < Y < 1,

u , a, B and £ may be chosen so that

1

u + t y/ 2 , 2 a

1 )? '

I f y , a and 3 are given, £ must be st 2

y/2,2a

5=

V (1 -

y)

. ( a -

(5.74)

1)

where - l < y < 1 , 3 > 0 , and a > 1 , and in this way ( 1 - y ) 1 0 0 %of the marginal prior density o f 0 is contained in ( —1 , + 1 ) , and one would be ( 1 — y ) 100% confident the A R ( 1 ) process is stationary. For ex­ ample, if y = . 0 5 , 3 = a - 1, and |y | < 1, then by letting

g =

t2 •025>2.(?- ,

(1 -

(5.75)

v)

one would be 95% confident the model is stationary and the prior mean of 0 would be y , and the prior information about t would be a gamma distribution with parameters a and 3 where 3 = a - 1, thus E (x 1) = 3 / ( a — 1) = 1, a > 1. By choosing y , a and 3 first, then y, and finally £, according to (5 .7 4 ), approximate stationarity, a priori, may be induced. However, there are some problems, as y + 1( —1 ) , £ ■* 00, thus if one puts y close to one of the end points, one is forced to put a large prior precision on the marginal prior density of 0 . A numerical study is now introduced to investigate the effect of the sample size n and the prior mean y on the marginal posterior

.1235 .1235

1.1111

1.1111

.2298

.2298

.1293

.1293

0

0

100

750

.25

.25

.25

.25

25

50

100

750

.1293

.1235 .1235

1.1111

.1235

.1235

1.1111

1.1111

1.1111

.0442

.1554

.2008

.1408

.0418

.1425

.1805

.1020

E(|S)

.0013

.0091

.0171

.0299

.0013

.0094

.0180

.0326

V(|S)

.0165 .0027

1.0212

.0243

.0324

.0027

. 0165

.0242

.0325

V ( t |S)

.9949

.9227

.8538

1.0216

.9948

.9211

• 8548

E ( t |S)

Posterior information

.0894

.8273

3068

.1895

.8453

.7584

—. 2759

.1373

.9872

1.0247

1.1554

1.2800

.9868

1.0250

1.1596

1.2834

V ( Y n+ l>

Predictive information E (Y n+ l>

N ( 0 , 1), Yq = 0, N -G Prior a = 10, 3 = 9

SERIES

.1293

.1235

1.1111

.2298

0

50

.1235

1.1111

.2298

0

25

0.00

V ( t)

E ( t)

V (* )

- i.i.d .

E()

n

u

0.25

= e^,

Prior information

A R ( 1),

Table 5.1

to 00

TIME MODELS

0.90

0.75

0.50

.5

.5

.5

.5

.75

.75

.75

.75

.9

.9

.9

.9

25

50

100

750

25

50

100

750

25

50

100

750

.1235

1.1 11

1.1 11

1.1 11

1.1 11

.0023

.0023

.0023

.0023

.1235

.1235

.1235

.1235

.1235

.1235

1.1 11

1.1 11

.0144

.1235

.0144

1.1 11

.0144

.1235

1.1 11

1.1 11

.0575

.1235

.1235

.1235

.0144

1.1 11

1.1 11

.0575

.0575

1.1 11

.0575

.3617

.7543

.8174

.8443

.1034

.3889

.4982

.5450

.0527

.1986

.2651

.2476

.6916 .7136 .8065

.0030 .0026

.0011

.6285

.9802

.0013 .0036

.8845

.8179

.7290

1.0168

.9805

.9080

.8290

.0066

.0099

.0139

.0013

.0085

.0151

.0249

.0017

.0085

.0137

.0176

.0025

.0130

.0191

.0237

.0027

.0160

.0236

.0305

.7318

.4015

-1.2492

1.1370

.2093

.2070

-.7613

.7339

.1065

.1057

- . 4051

.3334

1.2474

1.4258

1.4955

1.6715

1.0281

1.1516

1.2817

1.4591

.9914

1.0396

1.1690

1.3076

ALMOST-STATIONARY PROCESSES 219

.25

.25

.25

50

1 00

750

.1235 .1235 .1235

111

111

111

1.

1.

1.

.1293

.1293

.1235

111

.1293

.1235

.1235

111

111

.1235

1.

1.

1.

1. 111

.1235

V ( t)

.1293

.2298

0

750

.25

.2298

0

100

25

.2298

0

50

1. 111

E t)

t-1 +

.2815

.3738

.0012

.0080

.0148

.0275

.0012

.0083

.0156

.0299

V (+ | S ]>

1.0210

.9927

.9166

.8521

1.0205

.9887

.9104

.8458

E ( t |S)

.0027

.0164

.0240

.0323

.0027

.0163

.0237

.0318

V (x | S )

.4986

.2047

- . 5286

.4475

.4964

.2017

-.5178

.4189

E (Y n+ l>

.9859

1.0269

1.1493

1.2815

.9863

1.0311

1.1583

1.2954

V ( Y n+ l>

Predictive information

N -G Prior a = 1 0 , $ = 9

SERIES

.3973

.3209

.2803

.3684

.3892

.3004

!> ’ y 0 = ° ’

Posterior information

Table 5.2 ^ i . i . .d. N (0 ,

E (* | S )

e V

TIME

0.25

.2298

0

25

VU)

0.00

E()

n

li

P rior information

A R ( 1) , Y t = .25 Y

220 MODELS

0.90

0.75

0.50

.0575

.5

.5

.5

50

10 0

750

.0023

.0023

.9

.9

100

750

1.1 11

1.1 11

1.1 11

.0023

.9

50

1.1 11

.0023

.9

25

1.1 11

.75

750

.0144

.75

10 0

1.1 11

1.1 11

.0144

.75

50

.0144

1.1 11

.0144

.75

1.1 11

1.1 11

1.1 11

1.1 11

25

.0575

.0575

.0575

.5

25

.1235

.1235

.1235

.1235

.1235

.1235

.1235

.1235

.1235

.1235

.1235

.1235

.5002

.7880

.8343

.8558

.3194

.5170

.5839

.6063

.2865

.3968

.4315

.3883

.7908

0124

.8219 .8936

0009

.7770

0023

0026

.7074

1.0023

0012 0031

.9469

0057

.8743

1.0199

0 0 12

0086

.9920

.9175

.8477

0075

0131

0227

.4314 .8851

.0113

.0 021

-1.1099

1.1932

.0 2 2 2 .0173

.5658

.2831

-.7768

.8453

.5074

.2172

-.5741

.5414

.0026

.0149

.0218

.0278

.0027

.0164

.0241

.0319

1.1249

1.2380

1.3295

1.4855

1.0039

1.1057

1.1926

1.3476

.9869

1.0274

1.1451

1.2786

ALMOST-STATIONARY PROCESSES

.25

.25

.25

50

10 0

750

.1293

.1293 .1235 .1235

1.1111

.1235

1.1111

1.1111

.1235

.1235

.1235

.1235

.1235

V ( t)

1.1111

1.1111

1.1111

1.1111

1.1111

E ( t)

.5209

.5893

.5842

.4909

.5208

.5896

.5846

.4854

EU|S)

.0 0 1 0

.0061

.0116

.0235

.0 0 1 0

.0062

.0 1 2 1

.0254

V ( 4>|S)

1.0198

.9838

.8998

.8409

1.0190

.9788

.8926

.8316

E ( t |S)

.0027

.0161

.0231

.0314

.0027

.0160

.0228

.0307

V ( t |S)

.8019

.3232

- . 8399

.7995

.8017

.3234

-.8737

.7905

.9854

1.0355

1.1681

1.3070

.9862

1.0408

1.1783

1.3259

Predictive information

SERIES

.1293

.1293

.2298

0

750

.25

.2298

0

10 0

25

.2298

0

50

0.25

.2298

0

25

V()

0.00

E()

n

u

Posterior information

Table 5.3 + e , et '- i . i . d . N (0 , 1 ), y Q = 0, N -G Prior a = 10, 3 = 9

P rior information

A R ( 1) , Y t = .5 Y

TIME MODELS

.1235 .1235 .1235

1.1111

1.1111

1.1 1 1 1

.0144

.0023

.0023

.75

.9

.9

.9

.9

25

50

100

750

.1235 .1235

1.1111

1.1111

.0023

.0023

750

.1235

1.1111

.0144

.75

100

.1235

1.1111

.0144

.75

50

.1235

1.1111

.0144

.75

25

.1235

1.1111

.0575

.5

750

.1235

1.1111

.0575

.5

100

.1235

1.1111

.0575

.5

50

.1235

1.1111

.0575

.5

25

.6365

.8200

.8503

.8654

.5376

.6489

.6722

.6644

.5226

.5952

.5945

.5232

0007

0019

0023

0028

0009

0045

0071

0110

0010

0057

0105

0197

.9653

.9140

.8426

.7690

1.0161

.9831

.9020

.8289

1.0206

.9900

.9085

.8490

.0024

.0139

.0203

.0263

.0027

.0161

.0232

.0305

.0027

.0163

.0236

.0320

.9798

.4498

-1.2225

1.4096

.8277

.3559

-.9664

1.0821

.8045

.3265

-.8547

.8522

1.0403

1.1132

1.2265

1.3684

.9889

1.0357

1.1560

1.2917

.9846

1.0290

1.1547

1.2849

ALMOST-STATIONARY PROCESSES

0.25

.25

.25

.25

50

100

750

.1235 .1235

1.1111

1.1111

.1293

.1235

1.1111

.1293

.1235

.1235

.1235

1.1111

1.1111

1.1111

.1235

.1235

V ( t)

.7587

.8075

.7943

.7167

.7591

.8100

.7987

.7231

E( |S)

.0006

.0032

.0068

.0159

.0006

.0033

.0069

.0168

V( |S)

1.0175

.9700

.8793

.8200

1.0168

.9667

.8743

.8119

E ( t |S)

Posterior information

.0027

.0157

.0 2 2 1

.0299

.0027

.0156

.0218

.0293

V ( t |S)

1.0162

.2765

-1.7779

1.6157

1.0167

.2774

—1.7879

1.6301

E (Y n+ l>

.9864

1.0487

1.2046

1.3570

.9871

1.0524

1.2121

1.3742

V ( Y n+ l>

Predictive information

Table 5.4 + et , e i .i .d . N (0 , 1 ), y Q = 0, N -G Prior a = 10, 3 = 9

SERIES

.1293

.1293

.2298

0

750

.25

.2298

0

100

25

.2298

0

50

1.1111

1.1111

.2298

0

25

0.00

E ( t)

V((j>)

E((j>)

n

u

= .75 Y

Prior information

A R (1 ), Y

TIME MODELS

0.90

0.75

0.50

.1235 .1235 .1235 .1235 .1235 .1235

1.111

1.111

1.111

1.111

1.111

1.111

.0575

.0575

.0144

.0144

.0144

.0144

.5

.5

.75

.75

.75

.75

750

25

50

100

750 .1235 .1235 .1235 .1235

1.111

1.111

1.111

1.111

.0023

.0023

.0023

.0023

.9

.9

.9

.9

25

50

100

750

.1235

100

1.111

.0575

.5

50

.1235

.0575

.5

25

0005

0014

.8 668 .7887

0019

0025

0005

0027

0048

0087

0006

0031

0063

0140

.8779

.8809

.7605

.8083

.7991

.7584

.7584

.8044

.7893

.7149

1.011

.9805

.8976

.8314

1.0 2 0 1

.9874

.9038

.8487

1.0186

.9760

.8881

.8331

.0027

.0160

.0230

.0307

.0027

.0162

.0233

.0320

.0027

.0159

.0225

.0309

1.0504

.2969

-1.9651

1.9858

1.0187

.2768

-1.7887

1.7097

1.0158

.2755

-1.7668

1.6116

.9924

1.0373

1.1564

1.2713

.9838

1.0302

1.1632

1.2772

.9853

1.0423

1.1909

1.3270

ALMOST-STATIONARY PROCESSES

.25

.25

.25

50

100

750

.2298

0

750

.1293

.1293 .1235

1 .1111

.1235

1.1111 .1235

.1235

1.1111

1.1111

.1235

.1235

.1235

.1235

1.1111

1.1111

1.1111

1.1111

E ( t)

et

V ( t)

+ V

.9033

.9277

.9307

.9073

.9036

.9292

.9332

.9144

E (+ | S )

.0002

.0013

. 0024

.0080

.0002

.0013

.0025

.0082

V (+ | S )

1.0155

.9563

.8661

.7994

1.0152

.9549

.8641

.7961

E ( t |S)

Posterior information

.0027

.0152

.0214

.0284

.0027

.0152

.0213

.0282

V ( t |S)

.9142

.2087

.9875

1.0635

1.2007

1.4411 3.6900 -2.0755

.9878

1.0651

1.2035

1.4498

V ( Y n+ l>

.9144

.2090

-2.0811

3.7187

E (Y n+ l>

Predictive information

~ i .i .d . N ( 0 , 1), y q = 0, N -G Prior a = 10, 3 = 9

SERIES

.1293

.1293

.2298

0

100

.25

.2298

0

50

25

*9 Y m

TIME

0.25

.2298

0

25

0.00

V()

n

u

E ( )

=

P rior information

A R ( 1) , Y

Table 5.5

226 MODELS

0.90

0.75

0.50

.1235 .1235

1.1 11

1.1 11

.0023

.0023

.9

.9

100

750

.1235

.1235

.1235

1.1 11

1.1 11

.0144

.75

750

.1235

.0023

1.1 11

.0144

.75

100

.1235

.9

1.1 11

.0144

.75

50

.1235

50

1.1 11

.0144

.75

25

.1235

1.1 11

1.1 11

.0575

.5

750

.1235

.0023

1.1 11

.0575

.5

100

.1235

.9

1.1 11

.0575

.5

50

.1235

25

1.1 11

.0575

.5

25

.9672 1.0178 .8495 .9023

00 1 2 00 02 00 21 00 1 2

.9041

.9223

.9218

.9105

.9019

.9197

1.0199

.8814

00 2 1

0 0 02

.8253

0056

.8814

.9818

1.0162

00 0 2

.9028

0008

.9592

0013

.9251

.9174

.8700

0024

.9262

.8062

0075

.8958

.0027

.0161

.0233

.0321

.0027

.0156

. 0 2 22

.0303

.0027

.0153

.0216

.0289

.9150

.2075

-2.0558

3.7030

.9128

.2069

-2.0458

3.5844

.9137

.2081

-2.0655

3.0431

.9832

1.0358

1.1471

1.2663

.9853

1.0515

1.1785

1.3508

.9869

1.0603

1.1951

1.4216

ALMOST-STATIONARY PROCESSES

228

TIME SERIES MODELS

density o f 9 (5 .1 0 ), the marginal posterior density of t (5 .1 1 ), and the Bayesian predictive density o f Y n+i> given by (5 .1 3 ). Series o f length n = 25, 50, 100, and 750 were generated from A R (1 ) models with 9 = 0, .25, .5, .75 and .9, initial value o f Y^ = 0 and t = 1. The parameters o f the normal-gamma prior density were chosen as a = 10, 3 = 9, y = 0, .25, .75, .9, and £ was calculated from (5 .7 4 ), and in this way 95% o f the prior probability for 9 was contained in ( —1 , 1 ) . Tables 5.1-5.5 give the prior mean and variance of 9 and t , the posterior mean and variance of 9 and t , and the predictive mean and variance of Y n+1* For example, Table 5.1 is based on data generated from an A R (1 ) model with 9 = 0, Y q = 0* end x = l, for samples of size n = 25, 50, 100, 750, and the prior mean of 9, namely y = 0, .25, .5, .75 and .9, a = 10, 3 = 9, and £ was calculated from (5 .7 4 ). The prior variance of 9 is 3 /£(a — 1) = l/£ since 3 = a — 1. From Table 5.1 if y = 0, the prior variance of 0 is V (9 ) = .2298. Consider n = 50, y = .25, then from Table 5.1, the prior mean o f 0 is .25, the prior variance .1293, the prior mean o f xis 1.1111 and its variance .1235, the posterior mean and variance o f 0 are .2008 and .0171 respectively, the posterior mean and variance o f x are .9227 and .0243, respectively, and the mean and variance of the predictive distribution of Y . are —.3068 and 1.1554, respectively. Tables 5.2n +1 5.5 are now self-explanatory. Several graphs are also given and Figure 5.1 is a graph of the prior and posterior marginal t densities o f 0 corresponding to the first row o f Table 5.3, that is when the series was generated from an A R (1 ) model with 0 = .5 and a length o f n = 25. The prior parameters were y = 0 and a prior variance for 0 of .2298. Figure 5.2 is for the same case as above but is a plot o f the prior and posterior gamma densities o f the precision x, while Figure 5.3 gives a plot o f the predictive t density o f the future observation Y ._ 2 b. What may be concluded from this study? The tables show the anticipated reduction of the posterior variances o f 0 and x and p re ­ dictive variance Yn+1 as the sample size increases. Also, the effect o f the prior mean o f 9 on the posterior mean o f 0 is clearly demonstrated. As y becomes closer to the "true" 9 value so does the posterior mean o f 0 and as y departs further from the "true" value o f 0 , the posterior mean of 0 does the same. Also, as the sample size increases, the ef­ fect o f the prior mean y of 0 on the posterior mean of 0 becomes less and less. A v e ry interesting result of this study shows that values of 0 close to one are easier to estimate (with the posterior mean o f 0 ) , than 0 values close to zero. Table 5.5 reveals remarkably small changes



Prior and posterior densities for 0 : = p rior, A = posterior.

A 0* 38 35087E 00

A ... AA • • • • • • • - - — A A A A AA 2222222222222 22 2 0.1&S0S26E 01 0*191

true 0 = .5 , n = 25, prior mean = 0, N -G prior

A

to

to CO

PROCESSES

Figure 5.1. for ( 0 , t ) ;

O.OOOOOCOE 00 J2 222222222222 A A A A A A A A A A A A A A A A A A A A A A A A-0.US0&26E 01 -0.383S09bE 00 -0*191 5««E 01

0.7370319E 00

0.1«7«063E 01

0 *221109SE 01

0.294812SE 01

ALMOST-STATIONARY

• . 2^.

* A

A

A

A

A

A

A

AA AA | AAAAA . -A A AA AA AA AA A A AA AA AA AA A A AA A 0.1299121E 01 0.1732162E 01 0.216S203E 01

A A

SERIES

F igure 5.2. P rior and posterior densities for t : true t = 1, true e = .5, n = 25, E ( t ) = 1.1111, E ( 0 ) = 0, N -G p rior for ( 0 , x ); • = prior, A = posterior.

0 . 8o60812E 00

A

'T 'T

TIME

O.OOOOOOOC 00 22222 2 2 2 2 2 2 2 2 1 2AA--* 0•0000003 £ 00 0.4330406E 00

0 .4 7S0042E 00

0.9500064E 00 * -

0.1 *2 50 H E 01 ♦—

0.19000155 01

0•2375019E 01 230 MODELS

Figure

00

5.3.

512 J376E

Predictive

probability

density

of Y,

ALM OST-STATIO NARY PROCESSES 231

TIME SERIES MODELS

232

in the posterior means E (e | s) for changes in n (= 25, 50, 100, 750) and changes in y (= 0, .25, .75, .9 ). There is little effect on the posterior mean of 0 from the prior distribution of 0 and the sample size when the "true" value o f 0 is close to one. On the other hand, Table 5.1 shows 0 is more difficult to estimate when it is close to zero in the sense that the posterior mean o f 0 is more sensitive to the prior mean of 0 and the sample size (than when 0 is close to on e). If 0 is close to one, one could use a small sample size and not have an accurate prior estimate u o f 0 and still have a reasonably good estimate o f 0 (at least for this data s e t), but i f 0 is close to zero, one should have precise prior information or a large sample size in order to obtain good estimates of 0 . This phenomenon is confirmed b y the sampling theory approach to time series analysis. The sampling v a ri-

2

ance o f the usual estimator of 0 is approximately ( 1 — 0 )/ n , thus as 0 1, one obtains better estimates. See Box and Jenkins (1970) for the sampling theory approach to time series analysis. As a last observation, we see as the prior mean o f 0 increases the posterior variance o f 0 decreases, and this should occur because as y increases, the prior precision of 0 increases according to (5 .7 4 ); con­ sequently the posterior variances of 0 decrease along with the prior variance o f 0 . Is the normal-gamma distribution adequate to express prior knowledge about A R (1 ) processes, which are almost stationary? We have seen one limitation with this approach, namely, the prior v a ri­ ance o f 0 is a function o f the prior mean o f 0 . In order to keep 95% of the prior probability o f 0 in ( —1 , + 1 ) , for values o f y close to one, the prior precision is forced to be high. This is not a problem when the "true" value o f 0 is close to one (see Table 5 .5 ), but can be a problem when 0 is close to zero (see Table 5 .1 ). Note in these tables that = 0 .

EXERCISES

1.

Consider an autoregressive process o f order one and let Y ^+ ^, be two future observations, then if the "initial" observa­ tion is given and one adopts a normal-gamma prior distribution, the marginal posterior density o f 0 is given by (5.10) and the predictive density of Y n+^ b y (5 .1 3 ).

Derive the conditional

predictive density o f Y n+2» given Y'

.

should predict Y n+2 with this density.

Describe how you

EXERCISES 2.

233

Equation (5 .2 ) is that o f an A R (2 ) process with parameters 0 ^,

0 2> and x.

Using a normal-gamma prior density and assuming

Yq and Y ^ are known, find:

3. 4.

5.

6. 7.

8.

9.

(a )

The marginal posterior density o f 0 O and,

(b )

The predictive density o f a future observation Y

Do not assume the process is stationary. Why is it so difficult to develop the posterior and predictive analysis o f a moving average process? Explain carefully. The section above entitled "The Identification Problem" (p p . 203207) develops a way to identify the order of an autoregressive proc­ ess when one uses the marginal posterior mass function of the order parameter p given by equation (5 .5 3 ). Propose an alternative method by which a Bayesian may identify the order o f an auto­ regressive process. Verify that equation (5.59) is the marginal posterior density of p, the autocorrelation parameter o f a regression model (5.55) with autocorrelated errors. Describe the similarities between the density (5.60) and the univariate t density. From equation (5 .6 5 ), show the conditional predictive density o f w^ given p is a univariate t distribution and present a way to forecast a future observation. Given an A R (1 ) process with a normal-gamma prior distribution and assuming Yq is given, find the joint predictive density of Y n+ 1 ’ Y n+ 2 Y n+ k ’ where k = 1 , 2 .......... Consider a M A(1) process (5.14) Y t = et -

9i Gt - i

where t = 1 , 2 , . . . , n, 0 , 6 R and

1

( 0 , t~ ) , where t > 0.

Letting

eA,

, . . . , e are n .i.d . 1n = 0, develop an approximation to

0

the likelihood function of 0 ^ and t based on

10.

,.. n+1

Y ^ 9Y ^ 9

...,

Y^.

Discuss the merits o f the approximation and why it should provide an easier way to study moving average processes. (Continuation o f Problem 9 .) I f £q = 0, one may calculate the residuals recursively as - Y^ +

t — 1 , 2 , • • •, n

corresponding to the n observations. Thus, one may estimate the moving average parameter by finding the value of 0 ^, say 0 ^ which minimizes the residual sum of squares

TIME SERIES MODELS

234

g(

= 1 t= l

with respect to 9^ over R. U sing the approximate likelihood function o f Problem 9 and a normal-gamma prior density for 9^ and x, develop a t-a p p ro x imation to the marginal posterior distribution o f 0 ^. 11.

Also show

the marginal posterior distribution o f x is approximately a gamma. R eferring to Problem 10 above, find a t-approximation to the pre­ dictive distribution of a future observation y , . . n+1

REFERENCES

Abraham, Bovas and William W. S. Wei (1979). "Inferences in a switch in g time se rie s," American Statistical Association Proceedings of the Business and Economic Statistics Section, pp. 354-358. Akaike, H. (1979). "A Bayesian extension o f the minimum AIC pro­ cedure o f autoregressive model bu ildin g," Biometrika, Vol. 6 6 , No. 2, pp. 237-242. A li, M. M. (1977). "Analysis of autoregressive moving average models estimation and prediction," Biometrika, Vol. 64, pp. 535-545. A nsley, C. F. (1979). "A n algorithm for the exact likelihood o f mixed autoregressive moving average process," Biometrika, Vol. 6 6 , pp. 59-65. Aoki, Masanao (1967). O ptim ization o f S to c h a stic S y s t e m s , T opics in D is c r e te Time S y s te m s , Academic Press, New York. B ox, George E. P. and Gwilym M. Jenkins (1970). Time S e r ie s A n a ly s is , F o r e c a s tin g a n d C on trol, Holden-Day, San Francisco. Broemeling, Lyle D . (1977). "Forecasting future values o f changing sequences," Communications in Statistics, Theory and Methods, Vol. A 6 ( l ) , pp. 87-102. Chatfield, C. (1977). "Some recent developments in time series analysis," Journal o f the Royal Statistical Society, Series A , Vol. 140, pp. 492-510. Chin Choy, J. H. and L. D. Broemeling (1980). "Some Bayesian in­ ferences for a changing linear model," Technometrics, Vol. 22, No. 1, pp. 71-78.

C u r r e n t I n d e x to S ta tis tic s , A p p p lic a tio n s , M eth o ds, a n d T h e o r y ,

Volume 5, 1979 (1980). Dent, W. T . (1977). "Computation o f the exact likelihood function for an ARIMA process," Journal o f Statistical Computations and Simulation, Vol. 5, pp. 193-206.

REFERENCES

235

Diaz, Jaoquin and Jose Luis Farah (1981). fTBayesian identification o f autoregressive processes,” 22nd NBER-NSF Seminar on Bayesian Inference in Econometrics. G ranger, C. W. J. and Paul Newbold (1977). F o re c a stin g Economic Time S e r ie s , Academic P ress, New York. Haavelmo, T . (1953). "Methods o f measuring the marginal propensity to consume,” Journal of the American Statistical A s s o c ., Vol. 42, pp. 105-122. Harrison, P. J. and Stevens, C. F. (1976). "Bayesian forecasting” (with discussion), Journal o f the Royal Statistical Society, Series B , Vol. 38, pp. 205-247. Hillmer, Steven C . , and George C. Tiao (1979). "Likelihood function o f stationary multiple autoregressive moving average models,” Journal o f the American Statistical Association, Vol. 74, No. 367, pp. 652-660. Lahiff, Maureen (1980). "Time series forecasting with informative prior distributions,” Technical Report No. I l l , Department of Statistics, The University o f Chicago. Land, Margaret Foster (1981). Bayesian Forecasting for Switching Regression and Autoregressive Processes, P h .D . dissertation, Oklahoma State University, Stillwater, Oklahoma. Lindley, D. V . and A . F. M. Smith (1972). "Bayes estimates for the linear model,” Journal o f the Royal Statistical Society, Series B , Vol. 34, pp. 1-18. Litterman, Robert B . (1980). "A Bayesian procedure for forecasting with vector autoregressions,” An unpublished photocopy paper, Massachusetts Institute of Technology, Cambridge, Massachusetts. Ljung, G. M. and G. E. P. Box (1976). "Studies in the modeling of discrete time series— maximum likelihood estimation in the auto­ regressive moving average model,” Technical Report No. 476, Department of Statistics, University of Wisconsin, Madison. Monahan, John F. (1981). "Computations for Bayesian time series analysis,” a paper presented at the 141st Annual Meeting of the American Statistical Association, August 12, 1981. Newbold, Paul (1973). "Bayesian estimation of Box Jenkins transfer function noise models,” Journal of the Royal Statistical Society, Series B , Vol. 35, No. 2, pp. 323-336. Newbold, Paul (1974). "The exact likelihood function for a mixed autoregressive moving average p rocess,” Biometrika, Vol. 61, pp. 423-426. Nicholls, D. F. and A . D. Hall (1979). ”The exact likelihood function o f multivariate autoregressive moving average model,” Biometrika, Vol. 6 6 , No. 2 , pp. 259-264. Perterka, V . (1981). "Bayesian system identification,” Automatica, Vol. 17, No. 1, pp. 41-53. Poirier, Dale J. (1976). T he E conom etrics o f S tr u c tu r a l C hange, North-Holland Publishing Company, Amsterdam.

236

TIME SERIES MODELS

Salazar, D . , Broemeling, L . , and A . Chi (1981). "Parameter changes in a regression model with autocorrelated e rro rs ," Communications in Statistics, Part A — Theory and Methods, Vol. A10, Number 17, pp. 1751-1758. Smith, J. O. (1979). "A generalization o f the Bayesian steady fore­ casting model," Journal o f the Royal Statistical Society, Series B , Vol. 41, No. 3, pp. 375-387. Tsurumi, H. (1980). "A Bayesian estimation o f structural shifts by gradual switching regression with an application to the US gasoline m arket," in B a y e sia n A n a ly s is in E conom etrics an d S ta tis tic s : E s s a y s in Honor o f H arold J e f f r e y s , edited by A . Zellner, North-Holland, Amsterdam. Zellner, A . (1971). A n In tro d u c tio n to B a yesia n In fe re n c e in E co n o m etrics , John Wiley and Sons, I n c ., New York.

6 LINEAR DYNAMIC SYSTEMS

INTRODUCTION

Linear dynamic systems are used by communications and control engineers to monitor and control the state o f a system as it evolves through time. To determine whether a system is operating satisfac­ torily and later to control the system one must know the behavior of the system at any time. For example, in the navigation of a submarine, spacecraft, or aircraft, the state of the system is the position and velocity o f the craft. The state of the system may be random because physical sys­ tems are often subject to random disturbances and one uses measuring instruments to take observations on the system, but the measurements are often corrupted with noise because of the electrical and mechanical components of the measuring device. Finding the state of the system from noisy measurements is called estimation and estimation is essential to monitoring the system as it evolves through time. This chapter will give the reader a b rie f introduction to linear dynamic systems. The engineering literature on this subject is vast and one should consult some of the many books on the subject. For example Aoki (1967), Maybeck (1979), Jazwinski (1970), Sawaragi et al. (1967), and Harrison and Stevens (1976) provide one with the essen­ tials for understanding linear dynamic systems. The above references are for the most part written by and for engineers except the Harrison and Stevens paper which is statistical in nature and emphasizes business and economic forecasting. Maybeck is a good general intro­ duction to estimation (filterin g, prediction, smoothing), control, adaptive estimation, and nonlinear estimation. For an unusually ex­ cellent account of filtering, the book by Jazwinski is recommended and 237

LINEAR DYNAMIC SYSTEMS

238

for a thorough treatment of the Bayesian approach to control strategies one should read Aoki. Sawaragi et al. should appeal to statisticians because it shows the relationship between statistical decision theory and adaptive control systems. One will find that Bayesian ideas have played an important role in the development o f estimation, identification, and control of linear dynamic systems. Indeed, the Kalman (1960) filter is based on a Bayesian estimation technique and many o f the resulting methodologies have a Bayesian foundation, thus it is interesting that the theory and methodology o f linear dynamic models is not v e ry familiar to statisticians. Recently, however, statisticians are beginning to study the linear d y ­ namic model and use it with some o f the standard statistical problems. For example, Sallas and Harville (1981) estimate the parameters o f the mixed models with best recursive estimators, while Downing, Pike, and Morrison (1980) apply the Kalman filter to inventory control. Shumway, Olsen, and Levy (1981) employ mixed model estimation to a linear dy­ namic model, which in a way is the reverse o f that done by Sallas and Harville. This chapter begins with an explanation o f the linear dynamic model and reviews the Kalman filter, then develops the Bayesian analysis for control, adaptive estimation, and nonlinear filtering.

D IS C R E TE T IM E LIN EA R D YN AM IC SYSTEMS Consider an observation equation Y (t ) = F ( t ) 0( t ) + U ( t ) ,

t = 1,2, . . .

(6 .1 )

where Y ( t ) is an n x 1 observation vector, F (t ) a n x p nonrandom n x p matrix, 0( t ) a p x 1 real unknown parameter vector, and U (t ) a n x l observation error vector. The observation equation (6 .1 ) looks like a sequence of usual type linear models, where F (t ) is a n x p de­ sign matrix. The parameter 0 ( t ) is called the state o f the system at time t and how the states evolve is given by the system equation.

0 ( t ) = G 0 (t -

1) + V ( t ) ,

t = 1,2, . . .

(6 .2 )

where G is a p * p nonrandom systems transition matrix, 0(0) is the initial state o f the system, and V ( t ) is a p x 1 systems error vector. Some measuring device records at each time point Y ( t ) , which indirectly measures the state o f the system at time t and is linearly related to the actual state 0 ( t ) , but since the signal Y ( t ) is corrupted by noise, the additive error term U (t ) is put into the observation equation. The system equation is a first-o rd er autoregressive process, but of course other stochastic processes could represent the states of the system.

ESTIMATION

239

E S T IM A T IO N

Estimation is the activity of estimating the states e (t) o f the system as observations Y ( 1 ) , Y ( 2 ) , , become available, and each observation adds one parameter to be estimated. Now suppose one has k observa­ tions Y ( l ) ,Y ( 2 ) , . . . , Y ( k ) , then there are k parameters 0 (1 ),0 (2 ), . . . , 0(k ) to be estimated. Filtering is the estimation o f the current state 0( k ) and in the next section the Kalman filter will be explained. Smoothing is the estimation o f the previous states 0( 1 ) , 0( 2 ) , . . . , 0(k — 1 ) and prediction is the forecasting o f the future states 0 (k + 1 ), 0(k + 2 ), . . . . A s we will see, prediction is important when one wants to control a future state, say 0 (k + 1) at some known target value T (k + 1). The next three sections explain filtering, smoothing, and pre­ diction.

F ilte rin g The Kalman filter is a recursive algorithm to estimate the current states o f the system at each time point. Suppose F (t ) and G are known, and that 0 ( 0 ) is random with known mean m ( 0 ) and known precision ma­ trix p (0 ), that U (t ) is normal with mean zero and known precision matrix p u(t)> f^at V (t ) is normal with mean zero and precision matrix Py ( t ) , and assume 0 (0 ), U ( 1 ) , U ( 2 ) ,

V (1 ), V (2 ), . . . , are stochas­

tically independent, then given Y ( 1 ) , Y ( 2 ) , . . . , Y ( k ) , the posterior distribution of 0(k ) is normal with mean m (k) and precision matrix P ( k ) , where m (k ) = p " 1 ( k ) { F '( k ) P u ( k ) Y ( k ) + Pv (k )G [G 'P v (k )G + P (k x P (k - l)m (k -

1 )}

l)]'1 (6 .3 )

and P (k ) = F '(k )P u ( k )F (k ) + [G P _1(k -

1)G' + P ' V ) ] " \ for k = 1,2, . . .

(6 .4 )

Thus the current state 0 (k ) is recursively estimated by m (k) and P (k ) is expressed as a function o f the current Y ( k ) , the known current parameters Pu (k ) and Py ( k ) ,and the parameters m(k — 1) and P (k - 1) of the posterior distribution o f the previous state 0(k - 1 ).

LINEAR DYNAMIC SYSTEMS

240

This is summarized by the following theorem.

T heorem 6.1 (T h e Kalman F ilte r ).

Consider the linear dynamic system

Y (t ) = F ( t ) 6( t ) + U ( t ) , \ it = 1 , 2 , . . . , k

6 (t ) = G 0(t -

1) + V ( t ) J

and suppose for each k , k = 1,2, . . . , that F (t ) and G are known and suppose 0 (0 ),U ( l ) , . . . , U ( k ) , V ( 1 ), V (2 ), . . . , V (k ) are stochastically independent.

Let U (t ) - N t O . P '^ t ) ] , V (t ) - N [ 0 , P ' ^ t ) ] for t = 1, ?1 V 2, . . . , k , and 0(0) ~ N [m (0 ), P ( 0 ) ] , the marginal posterior distribu­

tion of 0( k ) is normal with mean m (k) and precision matrix P ( k ) , where m (k) and P (k ) are given by (6 .3 ) and (6 .4 ), respectively. To prove this theorem, note the prior distribution of 0 (k ) is induced b y the posterior distribution of 0 (k — 1 ) via the system equation ( 6 . 2 ) and at each stage only p x p matrices need to be in ­ verted . Plackett (1950) gave an earlier version o f these formulas, but Kalman’s main contribution is the recursive nature o f the algorithm, (6 .3 ) and (6 .4 ), which provides one with an efficient computational procedure.

Smoothing Suppose the assumptions o f Theorem 6.1 are satisfied, then how should the previous states 0 ( 0 ) , 0( 1 ) , . . . , 0(k — 1 ) , given k observations Y ( 1 ) , Y ( 2 ) , . . . , Y ( k ) , be estimated? To a Bayesian, one should (k - 1) estimate these parameters from the posterior distribution of 0 = [ 0T( 0 ) , 0f( l ) , . . . , 0f(k - l ) ] 1, which is easily derived from the pos(k ) terior distribution o f 0 , namely the conditional distribution o f 0 (0), . . . , 0( k ) , given Y ^ = [Y !( l ) , . . . , Y T( k ) ] ! . In this way, all o f the previous observations are available to estimate the previous states. x (ki F irst, consider the posterior distribution o f 0 , so suppose the (k ) assumptions o f Theorem 6.1 hold, then the joint density of Y and

ESTIMATION

241

where Y ( k ) € Rn k ,e (k ) 6 Rp (k + 1 ), and k tt [ Y (k ) |0 ( k ) ] cc exp 1

| 2

[Y (t ) -

X

[Y (t ) -

F ( t ) 0 ( t ) ] 'P

2 t= l

(t ) U

F (t ) 6( t ) 3

( 6 . 6)

and k ir9 [e ( k ) ] 2

[ 6( t ) -

exp -

cc

G 6(t -

1 )]'P

2 t=l

x [ 0( t ) x exp -

(t ) V

G 6 (t -

1 )]

| [0 (0 ) - m (O )]'P (O )[ 0(O) - m (0 )].

Combining the exponents o f tt^ and

(6 .7 )

the resulting exponent is

quadratic in the k + 1 states 8 (0 ), . . . , 0 (k ), thus it is the density o f a multivariate normal distribution and the mean vector and pre­ cision matrix need to be identified. (k ) It can be shown that 8 has density

7r [ 0 ( k ) |Y( k ) ] cc exp - | [ 0 (k ) - A " 1 B ] ' A [ 0 (k ) - A _ 1 B ]

( 6 .8)

where A is p (k + 1 ) x p (k + 1) symmetric and patterned as follows:

A =

A 13 =

^

A 11

A 1 2

A 1 3

A 21

A 22

A 23

A 31

A 3 2

A 3 3

= A 31* A 22 *S cIuasi"tridia&onal

order

~ D p and

the i-th diagonal matrix is F !(i ) P u ( i ) F ( i ) + Py (i ) + G!Py (i + 1)G for i = 1,2 , ..., k -

1 and the super- and subdiagonal matrices are

—GTP ( i ) , i = 2, . . . , k V

1.

A 9Q is (k — l ) p x p and uO

LINEAR DYNAMIC SYSTEMS

242

0 (p x p ) 0 (p x p ) A 23

I

0 (p x p ) - G ’Pv (k )

and P (0 )ra(0 ) F '(1 )P U( 1 )Y (1 ) F »(2 )P u (2 )Y (2 ) B =

F '(k )P u ( k ) Y ( k )

is a (k + l ) p x l vector.

The scalar C is given by

k C = 5 3 Y '( t )P

( t ) Y ( t ) + m '( 0 ) P ( 0 )m ( 0 ) ,

1 thus 0

(k )

has a normal distribution with mean vector A

-1

B = u and p re ­

cision matrix A = T . How does one estimate the previous states 0(0) , 0(1), . . . , 0 (k - 1 )? Obviously all inferences should be based on the marginal distribu­ tion o f these states, thus let (k )

* = [1 ,0 ] 0

(6 .9 )

where 0 is a pk x p zero matrix and I is the identity matrix of order (k -i) pk , then is the pk x l vector 0 of the previous states, i.e . (k -1) those previous to 0 ( k ) . From ( 6 . 8 ) and (6 .9 ), 0 has a normal distribution with mean [ 1 , 0 ] u and precision { [ 1 , 0 ] T * [ 1 , 0 ] ! } \ and in (k -1) a similar way any subvector of 0 has a normal distribution. The previous results are summarized as follows by

ESTIMATION

Theorem 6. 2.

243 I f the assumptions o f Theorem 6.1 are true and one has

= [ Y ' ( 1 ) , Y !( 2 ) , . . . , Y ' ( k ) ] ’ , then the conditional (k -1) distribution of the previous states 9 is normal with mean [ l , 0 ]y k observations

and precision matrix { [ 1 ,0 ]T 1 [ I , 0 ] f } ”1, where y and T are the mean (k ) vector and precision matrix of 0 , which has density given by ( 6 . 8 ) . It should be noted that the posterior distribution o f 0 ( i ) , 1 < i < k — 1 given by Theorem 6 .2 is not the same as that given by the Kal­ man filter, because Theorem 6.1 gives the posterior distribution of 0 ( i ) based on i observations Y ( 1 ) , Y ( 2 ) , . . . , Y ( i ) , but Theorem 6.2 provides the conditional distribution o f 0 ( i ) , 0 < i < k — 1 based on k observations. Smoothing is a way to improve one!s estimate o f the previous states beyond that given by the Kalman filter and is a post-mortem operation.

P rediction Suppose one has available k observations Y ( 1 ) , Y ( 2 ) , . . . , Y (k ) and we want to forecast 0 (k + 1 ), before observing Y (k + 1 ), then the appropriate distribution with which to forecast 0 (k + 1 ) is the prior (p rio r to Y (k + 1 )) distribution o f 0 (k + 1). The prior distribution o f 0(k + 1 ) is induced by the posterior distribution o f 0 ( k ) , where

0(k + 1) = G 0 (k ) + V (k + 1) but the posterior distribution o f 0( k ) is given by the Theorem 6 .1 (the Kalman filte r), where m (k ) is the mean o f 0 (k ) and P (k ) the precision matrix of 0 ( k ) , thus by the Bayes theorem, we have

T heorem 6.3. Under the assumptions of Theorem 6.1, theprior dis­ tribution o f 0 (k + 1 ) is normal with mean m *(k + 1) = P *_1(k + l ) { P v (k + l)G [G 'P v (k + 1)G + P (k ) ] " 1 x P (k ) m (k ) }

(6.10)

and precision matrix P * (k + 1) = [G P " 1 (k )G ' + P ^ ( k + l ) ] " 1.

(6.11)

Note that m *(k + 1) is a function of the first k observations be­ cause m (k ) is a function o f these observations.

LINEAR DYNAMIC SYSTEMS

244

B y repeated application o f this theorem one may predict 0 (k + 2 ), 0 (k + 3 ), and so on, since the prior distribution o f 0(k + 2 ) is induced by the prior distribution o f 0(k + 1 ) , via the systems equation 0(k + 2) = G 0 (k + 1) + V (k + 2). Thus, the prior distribution of 0(k + 2) is normal with mean m *(k + 2) and precision matrix P * (k + 2) where m *(k + 2) = R * " \ k + 2)Pv (k + 2 )G [G 'P y (k + 2)G + P * (k + l ) ] " 1 x P * (k + l)m * (k + 1)

(6.1 2)

and

P * (k + 2) = [G P * _1(k + 1)G ’ + P ^ ( k + 2 ) ] " 1.

(6 .13)

Again m *(k + 2) is a function o f the first k observations, b e ­ cause m *(k + 2 ) is a function o f m *(k + 1 ) , which is a function of m (k ) (see Theorem 6 .3 ), which is the mean o f the posterior distribu­ tion o f 0( k ) . Prediction in the context o f linear dynamic systems means some­ thing other than the way prediction is used with time series analysis. In time series, to predict means to forecast future observations (see Chapter 5 ), while with linear dynamic systems, to predict means to forecast the future states of the system. In the next section on the control problem, the predictive dis­ tribution o f future observations will be employed in the strategy o f choosing control variables.

CONTROL STRATEGIES This section gives only a b rie f introduction to the control o f linear dynamic systems (6 .1 ) and (6 .2 ). For a more thorough treatment o f control problems, Aoki (1967) provides one with a Bayesian analysis o f various strategies to control the state o f a physical system. Consider the linear dynamic system with observation equation Y (t ) = F ( t ) 0 (t ) + U (t )

(6 .1 )

and with system equation 0(t) = G9(t -

1) + H (t )X (t -

1) + V ( t ) ,

(6.14)

CONTROL STRATEGIES

245

where the term H (t )X (t — 1) has been added to the system equation (6 .2 ), and H (t ) is a known p x q matrix and X (t - 1) i s a q x 1 vec­ tor o f control variables. That is one can set these vectors at various values to control the process. The control problem is: how does one choose X (0 ) , X (1 ), . . . , in such a way that the corresponding states 9 ( 1 ) , 0 ( 2 ) , . . . , are close to preassigned target values T ( 1 ) , T ( 2 ) , .. .? Initially, one has yet to observe Y ( l ) , and X (0 ) must be chosen in order for 0(1) to be "close" to T ( l ) , then having observed Y ( l ) = y ( l ) but not yet Y (2 ), X ( l ) is chosen so that 0(2) is "close" to T (2 ), and so on. Choosing a set of control values X (0 ),X (1 ), . . . , for the control variables X (0 ),X (1 ), . . . , is called a strategy, and onefs strategy depends on how one measures the closeness o f 0 ( i ) to T ( i ) , i = 1,2, . . . . The strategy presented here is based on the predictive dis­ tribution o f the posterior means and is just one o f many criteria that could have been used. Aoki (1969) gives us many control strategies, and the reader should refer to him and Maybeck (1979) for more de­ tailed accounts o f the control problem. The strategy here is to choose the control variables X (0 ),X (1 ), . . . , in such a way that at the first stage T ( l ) = E E [6 (1 )] Y ( 1 ) , X ( 0 ) ] ,

(6.15)

1 at the second stage T (2 ) = E E [8 (2 ) | y ( l ) , Y ( 2 ) , X ( 0 ) , X ( l ) J

2

(6.16)

and in general T ( i ) = E E [9 (i) | y ( l ) i X (i -

2 ),X (i -

y (i -

1 )]

l),Y (i),X (0 ),X (l), .. .,

(6.17)

for i = 1,2, . . . , and E denotes expectation with respect to the coni ditional predictive distribution of Y ( i ) given Y ( j ) =y ( j ) , j = 1,2, . . . , i — 1, where y ( j ) is the observed value o f Y ( j ) , and X (i ) is the value chosen for X (i) so that T ( i ) is close to the posterior mean o f 0 ( i ) , in the sense of (6 .1 7 ). Thus, X (0 ) is chosen so that the average value o f the posterior mean of 0(1) is the target value T ( l ) , and the average is taken with respect to the predictive distribution of Y ( l ) (which has yet to be o b se rv e d ). At the second stage X ( l ) is chosen before Y (2 ) is observed but after Y ( l ) is observed to be y ( l ) , in such a way the

246

LINEAR DYNAMIC SYSTEMS

target value T ( 2 ) is equal to the average value o f the posterior mean o f 0(2). Because Y (l ') = y ( l ) and Y (2 ) is yet to be observed the average E is with respect to the conditional predictive distribution of

2

Y (2 ) given Y ( l ) = y ( l ) , and so on. Solving the system (6 .17) with respect to X ( 0 ) , X ( 1 ) , . . . , gives us as the solution X ( 0 ) , X ( 1 ) , . . . , for the control strategy. Let us return to the initial stage o f the control problem , where X (0 ) must be chosen, then one must be able to first find the pos­ terior mean o f 0 (1 ), then the predictive distribution o f Y ( l ) , for the linear dynamic system (6 .1 ) and (6 .1 4 ). To do this, one must stip­ ulate assumptions comparable to those given in Theorem 6.1, thus suppose

0 ( 0 ) - N [ m ( 0 ) , P - 1 ( 0 ) ] , U (t ) - N [ 0 , p ^ ( t ) ] , V (t ) ' N [ 0 , P ^ ( t ) ] and that 0 (0); U ( 1 ), U (2 ), . . . ; V ( 1 ) , V ( 2 ) , . . . , are independent and that F ( t ) , H (t ), pu (t)> and a re known for t = 1,2, . . . , and that also G is known. Now under these assumptions how should X (0 ) be chosen so that (6 .1 5) is satisfied? F irst, it can be shown the posterior mean o f 0(1) is E [0 (1 ) |Y( 1 ) , X (G )] = P * X( 1 ) { [ GP~X(0 )G ' + P ^ d ) ] " 1 X H (1 )X (0 ) + F’( l ) P u( l ) Y ( l )

+ Pv ( l ) G [ P ( 0 ) + G'Pv ( l ) G ] " 1},

(6.18)

where P ( l ) = F ( l ) P u( l ) F ( l ) + [G ,P ' 1 (0 )G + P’ ^ l ) ] " 1

(6 .19)

and is the posterior precision matrix o f 0 ( 1 ) . It can also be shown that the predictive distribution o f Y ( l ) is normal with mean

E [ Y ( 1 ) ] = Q ' 1 ( l ) P u( l ) F ' ( l ) P ' 1 ( l ) { [ G P ' 1 (0 )G ’ + P ^ l ) ] ' 1 x H (1 )X (0 ) + Pv ( l ) G [ P ( 0 ) + G'Pv ( l ) G ] _ 1 P (0 )m (0 )}

( 6 . 20)

CONTROL STRATEGIES

247

and precision matrix

Q (l) = Pu( l ) - Pu( l ) F ( l ) P 1( l ) F '( l ) P u( l ) .

(6.21)

Both the predictive mean E [Y (1 )] and the posterior mean E [ 0(1) |Y( 1 ) ,X (0 )] may be derived from the joint distribution o f Y ( l ) , 0( 1 ) , and 0 ( 0 ) , which, from the assumptions given above, have density ^ Y d ) , 0(1),0(0)] = ^ [ Y d J l e a ^ i e c D l e C O ) ] * ^ ] ,

(6.22)

where

1r1 [ Y ( l ) | 0 ( l ) ] - exp - | [ Y ( 1 ) - F ( l ) e ( l ) ] ' P u ( l ) x [ Y ( l ) — F ( 1) 9( 1 ) ] , ir2 [ 0 (l )| e (O )] "c exp -

| [0 (1 ) -

x [ 0( 1 ) -

G 0 (O) -

Y ( l ) e Rn ,

G 0 (O) -

H ( l ) X ( 0 ) ] 'P v ( l )

H (1 )X (0 )],

0 ( 1 ) e RP ,

and

TTgt 0(0)1 0= exp -

| [0 (0 ) - m (0 )]'P (0 )[ 0(0) - m (0 )], 0(0) € RP .

I f one eliminates 0(1) and 0(0) from (6 .2 2 ), one has a normal predictive distribution for Y ( l ) , then if one integrates ( 6 . 2 2 ) with respect to 0 ( 0 ) , the result is the posterior density o f 0 ( 1 ) . The posterior mean o f 0(1), (6 .1 8 ), is a linear function o f Y ( l ) and X (0 ), thus let E [ 0(1) |Y( 1 ) ,X (0 )] = A X (0 ) + B Y (1 ) + C where A = P ' 1 ( 1 ) [ G P ' 1 (0 )G ’ + P ^ d ) ] ' V l ) and B = P 1 ( l ) F ' ( l ) P u( l )

(6.23)

LINEAR DYNAMIC SYSTEMS

248 and

C = P_ 1 (1 )P (1 )G [P (0 ) + G,Pv ( l ) G ] ' 1 P (0 )m (0 ).

Also, E [Y (1 )] is a linear function o f X (0 ), thus let E [Y (1 )] = D X (0 ) + E,

(6.24)

where D = Q 1 ( l ) P u ( l ) F ' ( l ) P ' 1 ( l ) [ G P " 1 (0 )G ' + P ' ^ l ) ] ' V l ) and E = q " 1 (1 )P u(1 )F '(1 )P ‘ 1 (1 )P v (1 )G [P (0 ) + GlPv ( l ) G ] _ 1P (0 )ra(0 )

Now from (6 .1 5 ), T ( l ) = (A + B D )X (O ) + BE + C and the control vector X (0 ) should be set at

X (0 ) = { ( A + B D )'(A + B D ) } _1(A + B D )'[T (1 ) -

BE -

C ], (6.2 5)

which solves the problem at the first stage, but how does the process continue? At the second stage one must choose X ( l ) so that T (2 ) is close to 0(2) in the sense o f (6 .1 6 ). Thus we need to know the conditional predictive distribution o f Y (2 ) given Y ( l ) = y ( l ) and the posterior mean o f 0 (2 ), then solve equation (6.16) for X ( l ) . In general, re ­ ferrin g to equation (6 .1 7 ), we need to know the predictive conditional distribution o f Y ( i ) given Y ( j ) = y ( j ) , j = 1,2, . . . , i - 1, and the posterior mean of 0 ( i ) , then solve equation (6.1 7) for X (i - 1). The general solution is given by the following theorem. T heorem 6 .4 . Consider the linear dynamic system (6 .1 ) and (6 .1 4 ), where 6(0) ' N [ m ( 0 ) , P ' 1 ( 0 ) ] , U (t ) - N t O . P ^ t t ) ] , and V (t ) - N I O . P ^ V ) ] . Assume for t = 1,2

that G, F (t ), H (t ), Pu ( t ) , Py ( t ) , m (0), and

P (0 ) are known and that 0(0); U ( 1 ), U (2 ), . . . ; V ( 1 ) , V ( 2 ) , . . . , are independent, then at the i-th stage of control, the value X (i — 1) which satisfies T (i) = E E [9 (i )| y (l),y (2 )

y (i -

l),Y (i),X (0 ),X (l),

CONTROL STRATEGIES

249

X (i — 2 ),X (i — 1 )]

(6.17)

is given by X (i -

1) = { [ A ( l ) + B ( i ) D ( i ) ] '[ A ( i ) + B ( i ) D a ) ] } " 1 x [A (i) + B (i )D (i )]’[T (i) -

B (i)E (i) -

C (i)]

(6.2 6)

where A ( i ) = P ' 1 ( i ) [ G P _1(i -

1)G' + P^ 1 ( i ) ] ‘ 1H (i ),

B (i) = P ' 1(i)F '(i)F »(i)P u(i), C ( i ) = P " 1 (i )P v (i )G [P (i -

P (i ) = F '(i )P u ( i ) F ( i ) +

1) + G'Pv ( l ) G ] ' 1 P (i -

-

1)G' + P’ ^ D l ' W ) ,

E (i) = Q " 1 ( i ) P u ( i ) F ( i ) P " 1 (i )P v (i )G [P (i l)m (i -

1 ),

1)G + P ' ^ i ) ] " 1,

D (i ) = Q ' 1 ( i ) P u (i ) F ( i ) P _ 1 ( i ) [ G P " 1(i -

x P (i -

l)m (i -

1) + G'Pv ( i ) G ] _1

1 ),

Q(i) = P u(i) “ Pu(i)F (i )P '1(i)F '(i)P u(i). m (i) = P‘ 1 ( i ) { [ G P ‘ 1(i -

1)G' + P^ 1 ( i ) ] ' 1 H (i)X (i -

x Pu( i ) Y ( i ) + Pv (i )G [P (i -

1) + F '(i )

1) + G'Pv ( i ) G ] ‘ 1P (i -

l)m (i -

Now, m (i) and P (i ) are the mean vector and precision matrix, respectively, o f the posterior distribution o f e (i) and Q (i) is the pre­ cision matrix of the conditional predictive distribution o f Y (i ) given Y ( j ) = y ( j ) , j = 1,2, . . . , i - 1.

P ro o f .

The posterior distribution o f 0 ( i ) is normal with mean vector

m (i) = P ' 1 ( i ) { [ G P ' 1(i -

1)G> + p ‘ 1 ( l ) ] " 1 H (i)X (i -

+ F '(i )P u( i ) Y ( i ) + Pv (i )G [P (i + G,Pv ( i ) G ] ' 1P (i -

l)m (i -

1 ),

1)

1) (6.27)

1 )}

LINEAR DYNAMIC SYSTEMS

250 and the precision matrix

P (i ) = F '(i )P u ( i ) F ( i ) + [ G 'P '^ i -

1)G + P ^ i ) ] ' 1,

(6.2 8)

for i = 1 , 2 , . . . , and is a recursive formula for the posterior mean and precision matrix of 0 (i ) and is similar to that given b y the Kalman filter of Theorem 6.1; hence the posterior mean of e(i) is expressed as a linear function o f the current observation Y ( i ) , the known parameters G, pu (i)> and at the *“th stage, and the previous posterior mean m(i — 1) and precision Next, what is the y ( j ) for j = 1,2, . . . , i predictive distribution matrix Q (i ), where

matrix P (i - 1). predictive distribution o f Y ( i ) given Y ( j ) = - 1? It can be shown that this conditional is normal with mean Q (i )R (i ) and precision

Q (i) = Pu(i ) ~ Pu(i)F (i)P _1(i )F '(i)P u( i ) ,

(6 .29)

and R (i ) = P u( i ) F '( i ) P " 1 ( i ) { [ G P ' 1(i x H (i)X (i x P (i -

1)G' + P ^ i ) ] " 1

1) + Pv (i )G [P (i -

l)m (i -

1 ),

1) + G'Pv (i )G ] ' 1 (6.30)

and P (i — 1) is the posterior precision matrix of 6 (i — 1 ), given by formula (6 .2 8 ), and (6.27) gives the posterior mean m(i - 1) of

e(i -

1).

Since m (i) is a linear function of Y(i)^ and X (i — 1) and because Q (i )R (i ) is a linear function of X (i — 1 ), X (i — 1) is given as above by (6 .2 6 ). The above theorem shows one how to compute the vectors X (0 ), X ( l ) , . . . , in order to control the states of the system at target values T ( 1 ) , T ( 2 ) , . . . , and (6.17) was the criterion used in developing this strategy, however, other criteria could have been employed. For example, one could use a loss function L .,{T (i) ,E[ 0 ( i ) ] } at each stage and choose X (i - 1) in such a way as to minimize the expected loss with respect to the conditional predictive distribution o f Y (i ) given Y ( j ) = y ( j ) , j = 1,2, . . . , i - 1. Theorem 6.4 is a single period control strategy, that is , only one-step-ahead predictions are involved. If one is interested in multi­ period control strategies, there are several ways to devise control algorithms. For example, Zellner (1971) develops "here and n ow ,”

AN EXAMPLE

251

sequential updating, and adaptive control strategies for multi-period situations involving multiple linear regression models. These methods are easily adapted to linear dynamic systems.

AN EXAMPLE

The preceding sections have introduced the reader to the analysis of linear dynamic systems. Estimation, including smoothing, filtering, and prediction was considered and a single-period control algorithm was devised in order to control the states of the system. These Bayesian methodologies will be illustrated with scalar linear dynamic models. Let Y (t ) = 3e(t) + U ( t ) ,

t = 1,2, . . . , 50, (6.31)

0( t ) = . 56(t -

1) + V ( t ) ,

where U (t ) ~ n ( 0 , . 2 ) , V (t ) ^ n ( 0 , l ) , and 0(0) ^ n ( 2 , . 7 ) , and where 0 (0); U ( l ) , . . . , U (5 0 ); V ( 1 ), V (2 ), . . . , V (5 0 ), are independent random variables. In terms of the general model F (t ) = 3, G = .5, Pu (t ) = .5, Pv (t ) = 1, m(0) = .2 and P (0 ) = ( . 7 ) ' 1. Fifty observations were generated from the model (6.31) and Table 6.1 consists of six columns: (1 ) The U column has the fifty values o f the observation errors. (2 ) The fifty system errors are shown in the V column. (3 ) The Theta column lists the fifty states where the initial state 0(0) = 2.9944 is given in the first row. The Kalman filter estimates, (6 .3 ), are listed in the m column. (5 ) The Y column gives the fifty observations (beginning in row 2 ), and ( 6 ) The P column lists the precision of the posterior distribution of the current states and is calculated from formula (6 .4 ). Since the design matrices and the two precision matrices are constants (sc a la rs), the precision o f the estimates reaches the level 18.9870 after three steps and stays at that level (to four decimal places). For the most part, the Kalman filter estimates imitate the behavior o f the states with respect to sign; however, since the p re ­ cision of each estimate is relatively low (18.9870), it is difficult to say that m (k) is a "good" estimate of 0 ( k ) . The next example considers the control problem with the scalar model. Y (t ) = 3X (t) + U ( t ) , (6.32)

0(t ) = . 50(t -

1) + 2X(t -

1) + V ( t ) ,

9

10

11 12

13 14 15 16 17 18 19

20

21 22

23 24

9

10

11 12

13 14 15 16 17 18 19

20

21 22

23 24

8

8

6

6

7

6

7

3 4 5

3 4 5

3 4 5

23 24

-0.9506 1.9024 -1.8785 -1.7613 0.4381 -0.1880 0.0780 -0.7542

1.9927 -1.5175 -3.3583 0.2301 2.1326 0.3588 0.0436 -1.5534 -1.0473 0.0716

0.1421 1.2243 -0.7466 0.7390 0.3159 -0.2560 -1.0921 -0.1825 1.6075 -0.4624

V

0.0529 -0.1616 -0.0028 -0.7556

2.0774 -0.4788 -3.5977 -1.5687 1.3482 1.0329 0.5600 -1.2734 -1.6840 -0.7704

2.9944 2.7215 0.6141 1.0461 0.8390 0.1635 -1.0103 -0.6876 1.2637 0.1694

0

-0.2549 0.4414 -0.5848 -1.2881

1.5952 -1.2359 -3.4899 -2.2300 0.9600 1.4391 0.4696 -1.8508 -1.1725 -0.1811

-0.792 1.418 -1.887 -4.028

5.004 -4.042 -10.942 -6.770 3.221 4.475 1.368 -5.895 -3.558 -0.477

0.000 7.090 2.005 2.503 3.394 -0.189 - 2.222 -3.159 5.276 1.562

2.2744 0.6927 0.8089 1.0934 -0.0312 -0.7029 -1.0165 1.6408 0.5362

Y

0.2000

m

18.9870 18.9870 18.9870 18.9870

18.9870 18.9870 18.9870 18. 9870 18.9870 18.9870 18.9870 18.9870 18.9870

18.9870

0.7000 18.7368 18.9868 18.9870 18.9870 18.9870 18.9870 18.9870 18.9870 18. 9870

P

DYNAMIC

21 22

-1.2282 -2.6058 -0.1488 -2.0636 -0.8232 1.3763 -0.3125 -2.0752 1.4938 1.8346

0.3826 -1.0744 0.1626 -0.6353 0.8767 -0.6793 0.8093 -1.0961 1.4850 1.0534

U

LINEAR

20

13 14 15 16 17 18 19

11 12

10

9

8

7

1 2

1 2

1 2

T

Row

O bs.

Table 6.1 Kalman Filter Estimates o f 6

252 SYSTEMS

0.9998

51

51

51

-0.2302

41 42 43 44 45 46 47 48 49 50

41 42 43 44 45 46 47 48 49 50

41 42 43 44 45 46 47 48 49 50

-0.2560 0.4785 0.8278 0.0517 0.5348 -1.2130 -1.5569 1.2628 0.1308 0.0570 0.1760 -1.2522 0.1569 -0.0867 -0.3484 0.1317 -2.7617 -1.0189 1.4949 1.0396

-0.9422 -0.6092 0.2991 -0.0597 0.8914 1.6638 0. 9454 -0.9849 0.5713 2.0040

31 32 33 34 35 36 37 38 39 40

31 32 33 34 35 36 37 38 39 40

31 32 33 34 35 36 37 38 39 40

1.4504 0. 9513 1.1334 0.5065 1.7053 0.4895

1.2940 -0.5726 0.3955 2.7095 -2.0137 0.6470 2.6622 -0.9927 0.3914 2.1436

-1.1793 -0.5606 2.6717 -2.8705 -1.5287 1.3191

25 26 27 28 29 30

25 26 27 28 29 30

25 26 27 28 29 30

1.4813

0.6896 -1.2206 -0.2867 0.5766 -1.0877 0.0679 -1.8293 -2.6633 0.3206 1.8043

0.3211 0.5519 1.2617 0.6393 1.1180 -0.1852 -1.5521 -0.0775 0.4396 0.8292

0.6107 1.2490 2.6564 0.5319 1.8323 2.0793

18. 9870

18.9870 18.9870 18.9870 18. 9870 18.9870 18.9870 18. 9870 18.9870 18.9870 18.9870

2.114 -3.919 -0.807 1.848 -3.489 0.304 -5.794 -8.278 1.234 5.683 4.539

18.9870 18. 9870 18.9870 18.9870 18.9870 18.9870 18.9870 18.9870 18.9870 18.9870

18.9870 18.9870 18.9870 18.9870 18.9870 18.9870

0.845 1.720 3.947 1.919 3.485 -0.678 -4.896 -0.117 1.397 2.588

2.038 3.902 8.303 1.465 5.755 6.429

cn co

EXAMPLE

1.5897

0.2734 -1.1155 -0.4008 -0.2871 -0.4919 -0.1142 -2.8189 -2.4283 0.2808 1.1800

0.5957 0.7764 1.2160 0.6597 0.8647 —0.7806 -1.9472 0.2892 0.2754 0.1947

1.0726 1.4876 1.8772 1.4451 2.4278 1.7034

AN

254

LINEAR DYNAMIC SYSTEMS

Controlling the Dbs.

Row

Table 6.2 States o f a Dynamic Model (P u = 1, Pv = 1)

Time

P

Q

A

0.70000 9.73684 9.97497 9.97555 9. 97555 9.97555 9.97555 9.97555 9. 97555 9.97555

0.0000 000

0.000000

0.000000

0.0756757 0.0977414 0.0977941 0.0977942 0.0977942 0.0977942 0.0977942 0.0977942 0.0977942

0.151351 0.195483 0.195588 0.195588 0.195588 0.195588 0.195588 0.195588 0.195588

0.308108 0.300753 0.300735 0.300735 0.300735 0.300735 0.300735 0.300735 0.300735

9.97555 9. 97555 9.97555 9. 97555 9.97555 9.97555 9.97555 9.97555 9.97555 9. 97555

0.0977942 0.0977942 0.0977942 0.0977942 0.0977942 0.0977942 0.0977942 0.0977942 0.0977942 0.0977942

0.195588 0.195588 0.195588 0.195588 0.195588 0.195588 0.195588 0.195588 0.195588 0.195588

0.300735 0.300735 0.300735 0.300735 0.300735 0.300735 0.300735 0.300735 0.300735 0.300735

0.0977942 0.0977942 0.0977942 0.0977942 0.0977942 0.0977942 0.0977942 0.0977942 0.0977942 0.0977942

0.195588 0.195588 0.195588 0.195588 0.195588 0.195588 0.195588 0.195588 0.195588 0.195588

0.300735 0.300735 0.300735 0.300735 0.300735 0.300735 0.300735 0.300735 0.300735 0.300735

0.0977942 0.0977942 0.0977942 0.0977942 0.0977942 0.0977942 0.0977942 0.0977942 0.0977942 0.0977942

0.195588 0.195588 0.195588 0.195588 0.195588 0.195588 0.195588 0.195588 0.195588 0.195588

0.300735 0.300735 0.300735 0.300735 0.300735 0.300735 0.300735 0.300735 0.300735 0.300735

1 2

1 2

1 2

3 4 5

3 4 5

3 4 5

6

6

6

7

7

7

8

8

8

9

9

10

10

10

11 12

11 12

11 12

13 14 15 16 17 18 19

13 14 15 16 17 18 19

13 14 15 16 17 18 19

20

20

20

21 22

21 22

21 22

23 24 25 26 27 28 29 30

23 24 25 26 27 28 29 30

23 24 25 26 27 28 29 30

9. 97555 9.97555 9.97555 9.97555 9. 97555 9.97555 9.97555 9.97555 9.97555 9.97555

31 32 33 34 35 36 37 38 39 40

31 32 33 34 35 36 37 38 39 40

31 32 33 34 35 36 37 38 39 40

9.97555 9.97555 9.97555 9.97555 9.97555 9.97555 9.97555 9.97555 9.97555 9.97555

9

B

AN EXAMPLE

f

i

? f

T f

?

CO 00 00 o in CO in T f

i

ii

rH 0 5 O rH 0 0 05 CM 0 0 CM CO CM CM CO yH CM

© 00 00 05 tH 00 CM CO 00 CO © CM o 05 o CO in © 00 05 yH CM yH CO rH o ? f f f f ?

f

f

f

f

CO 0 5 CM IH rH CM

rH IH 00 00 CM ©

CO CO rH rH CM i n CO yH CO 00 CM CM CO CO 05 i n CO CM 05 CM CO co CO t CM CO CO CM CM o CO o f f

o 00 o 00 o 0 5 rH 00 CO o CM c— CM CM o ? 1 f

m CO CM CM CO TF tH o o f

f

f

f

o CO o 00 in CO o 00 CO CM CO

f

f

tH 00 CO tH

if

CM tH rH CM

? ? V

CO o rH o CM 1

o © 05 0 5 CO CO o 0 5 CO o 00 00 CM CM 00 0 5 CM rH CM o CO CO CO in in o rH CM CM rH CO rH O in CO CM rH CO CM o CO o o

o o o CO o

i?

o o o o o

tH in tH o o o

f

i

0 5 CO CO 00 o CO tH CO i n 0 5 CO CO CM i n 0 0 0 0 CM o CM CO o IH o o o o o o o ? f

00 o tH o o

f

f

f f

o CO CO CO co CO CO CO CO CO o o o o o o

05 05 CO tH CO CM CO t 05 05 i n CM CO 00 CO 05 rH rH 00 i n

¥ ¥ ¥ ¥ ¥ ¥ 7

7

00 o 05 rH CM

CO in o CM

CO CM rH IH

CM 05 rH tH rH

o ih CO CM CO CM IH

i n CM t n CO CO CO rH CM

05 CM CO i n rH i n o 00 o CO rH CO 05 CO 00 in CO CO CO 00 rH CM O 05 00 00 o

05 CO 00 CO o CO 05 o CO CO 00 00 i n CO rH 05 CO 00 CO CO 00 i n CO CO rH rH

o in tH 00 CO o 00 in rH rH rH rH

05 00 CO CO

CO rH rH m tH CO CM in 00 CO rH CM T* CO CO t — CO CO CO i n CO o CO CO

05 CO tH o CO CO CO t - CO CM 05 00 in CM tH CM in 05 CO 05 05

o o in CO

rH CO o 00

o CO o CO CM o 00

tH tH 05 tH CO CO CM CO CO rH rH CO CO CO 05 rH

CO tH tH in

CO i n CO CO CO i n tH o 00 c~ rH 00 CO i n m rH CO tH 05 CO CM o i n CO CO 05 CO m CM CO CO CO tH CO i n CO 00 o IH

00 CO 00 CO tH 05 m 00 CO i n i n rH 00 c~ CM 00 05 CO O CO CM CO Tj« i n CO rH CM rH rH rH rH rH CM

in CO 00 o i n 00 CM o

05 CO CO CM

CO CO 00 rH

rH CO CO CM

rH 05 CO CO CM

o in IH 00 rH

CO CM 05 rH

rH O in 05 O

05 o 05 CO rH

¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥

rH CO i n CO 05 CO CM CO 00 rH 00 CO O rH 05 CO CO 05 00 CO CO C"* t -

¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ V¥ ¥ ¥

in © in t o tH 05 05 00 CO CO 00 rH 05 IH CO CO rH rH CM o tCO i n CM rH CM rH tH

¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥

f © 05 CO i n CM CO 00 CO rH

7

00 00 IH © 7

f

¥ ¥ ¥ ¥ ¥

¥ ¥ rH rH CM rH

05 CO CO CO rH

¥ ¥ ¥ ¥ ¥ ¥

05 rp rp CM CO rH i h CO CO CO o o CM CO m 05 c o CO CM 00 CO CO CO CO

05 m ih 00

in rH CO 05 rH 00 in

CO o rH rH CO CO in CO

© o in o

05 IH rH CO

rH 05 CO CM cm i n CO 00 cq o TT o i n i n CM o

CO CM LO rH 7 7

f

f

¥ ¥ ¥

m 00 rH in o

o 00 ih rH O

IH o CO in CM o

CM rH rH CM ©

rH CO 00 CO o

rH 00 in CO o

I>» 00 CO t~ CO o

in 00 CM 00 CO CM CO o o

CM CM in in CO o

co o CM 05 CO o

CO 00 tCO o

CM rH CO o

05 rH 00 00 05 l>* i n

CM o CO CM o

CO CO 00 rH o

t00 rH O

o CM CM IH o

rH O rH o

o o o o o o

m in CO CM

rH 05 rH in

CM rH m o CO 05 05 CM CO i n 05 05 05 tH o CM 05 CO CO 05 CM i n

CM t> rH CO

7

¥

¥ ¥ ¥ ¥ ¥

¥

7

rH CM in o

rH CM 05 CO rH o

in CO CM CO o

tH CM CO CM CM o

in o tH CM CM O

CO 00 tH rH CO o

tH CO CM 05 CO o

CO CO 05 05 o o

CO CO CO CO CO CO CO CO CO CO CO CO CO CO CO CO CO CO CO CO CO CM CO in o

CM in CM rH o

05 CO CO o CM o

¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥

¥

CO IH 00 05 o

00 00 CO rH

m 05 05 ©

in rH CO 00 in o

05 O CO rH in 00 CO o

CO CO 00 in o

CM t - CO m CO 05 05 CO CM i n 05 CM 05 i n CO CO 00 o CO CM rH l> o o CM in in 00 C"»

rH CO CO O

rH CO co

tH rH 00 CM

¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥

00 CO in CO

i n 00 CO 05 CO o 05 00 CO CM CM 00 00 o o tH tH

7

7

¥ ¥ 7

7

Ih 05 CO CO CO o

f f f f f f ¥ ? ¥ ¥

CO CO IH CO o

CO CO CO CO CO CO CO CO CO CO

7 7

7

¥ f ¥ ¥ ¥ ¥ f ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥

rp

¥

7

f

CO in 05 CM i n 00 05 CO m CO 00 7

tH rH in ©

¥ ¥ ¥

¥ V O 00 rH rH rH tH in 00 rH

rH CM CO rH rH

rH O in o

in CM 05 tH rH

CO CO 00 00 ©

00 rp CO c tH 05 IH CM rH o CM i n rH rH tH CM rH CM tH

rH Tt< in CO o

05 in co rH rH

05 cm CO tH 05 i n i n cm 00 05 c**

¥ ¥ ¥ ¥ f

05 05 CO in

f ¥ ¥ f f ¥ ¥ f ¥ f tH i n i n 00 00 H rH o IH o> rH 1 ¥ ¥

in CO CM fcCM o

CO CO CO CO CO CO CO CO CO CO 05 00 CO in 'Ct* O

Q

o rH CM 05 rH O

f f ¥ ¥

00 CM rH CM CO o o

IH o CM TF i n o CM CO tH o 05 IO CM CO rH CO i n CM CO i n tH CO CO in in in o o o o o o o 1

¥ f ¥ f ¥

260

LINEAR DYNAMIC SYSTEMS Table 6 .4 Controlling the States o f a Dynamic Model (P u = 100, Pv = 100)

3bs.

Row

Time

P 0.700 902.724 997.305 997.555 997.555 997.555 997.555 997.555 997.555 997.555

0.0000 0

0.000000

0.000000

0.30172 9.75682 9.77937 9.77942 9.77942 9.77942 9.77942 9.77942 9.77942

0.006034 0.195136 0.195587 0.195588 0.195588 0.195588 0.195588 0.195588 0.195588

0.332328 0.300811 0.300735 0.300735 0.300735 0.300735 0.300735 0.300735 0.300735

997.555 997.555 997.555 997.555 997.555 997.555 997.555 997.555 997.555 997.555

9.77942 9.77942 9.77942 9.77942 9.77942 9.77942 9.77942 9.77942 9.77942 9.77942

0.195588 0.195588 0.195588 0.195588 0.195588 0.195588 0.195588 0.195588 0.195588 0.195588

0.300735 0.300735 0.300735 0.300735 0.300735 0.300735 0.300735 0.300735 0.300735 0.300735

9.77942 9.77942 9.77942 9.77942 9.77942 9.77942 9.77942 9.77942 9.77942 9.77942

0.195588 0.195588 0.195588 0.195588 0.195588 0.195588 0.195588 0.195588 0.195588 0.195588

0.300735 0.300735 0.300735 0.300735 0.300735 0.300735 0.300735 0,300735 0.300735 0.300735

9.77942 9.77942 9.77942 9.77942 9.77942 9.77942 9.77942 9.77942 9.77942 9.77942

0.195588 0.195588 0.195588 0.195588 0.195588 0.195588 0.195588 0.195588 0.195588 0.195588

0.300735 0.300735 0.300735 0.300735 0.300735 0.300735 0.300735 0.300735 0.300735 0.300735

1 2

1 2

1 2

3 4 5

3 4 5

3 4 5

6

6

6

7

7

7

8

8

8

9

9

9

10

10

10

11 12

11 12

11 12

13 14 15 16 17 18 19

13 14 15 16 17 18 19

13 14 15 16 17 18 19

20

20

20

21 22

21 22

21 22

23 24 25 26 27 28 29 30

23 24 25 26 27 28 29 30

23 24 25 26 27 28 29 30

997.555 997.555 997.555 997.555 997.555 997.555 997.555 997.555 997.555 997.555

31 32 33 34 35 36 37 38 39 40

31 32 33 34 35 36 37 38 39 40

31 32 33 34 35 36 37 38 39 40

997.555 997.555 997.555 997.555 997.555 997.555 997.555 997.555 997.555 997.555

Q

A

B

C O cm

w

K? a. < X w z
(Nt>Ot>OCDMH ONNHOOOCON^lfl o oco^o o^in N 0 o

tin t00 rH

CO cco 00 rH o rH 00 CO rH

05 fcCO CO rH

O C00 rH

00 CM 05 tH

00 NI 05 fctH

05 00 05 fcrH

o H^t>^OOM

^OOt>NOOO0 OO ot-ooioincoinootN

¥

O f c - 0 5 c o o o 0 5 o m O t H

y ¥ y y ¥ ¥ f

00 CM 05 00 o rH O fc— in rH fc00 o CO CO 00 fc- t -

f ¥ ¥ f ¥ ??

CO tH 00 tH

1

CM rH 05 CO 05 CO 00 CO CO 00 in 00 rH in CM CO 05 CO o CO 05 05 05 t- 05 CO fc- in in o

ii i¥

CO CO in t05 in CO CO o 00 CM 05 o 00 00 fc—

t-

rH in fc00 tH

f ¥ f ¥ ¥ ¥

in

i¥ iy iy

1

NI CO in 00 tH

1

05 CO o CM CO o 0

1

fc05 tH in CM o o

CO CO CO CO CO CO CO CO CO CO

CM 00 CO fc00 o

1

05 CM rH 00 CO o 0

CO o 05 05 CM o

¥

1

rH 00 fctH CO o o

CO 0 0 tH rH fc- CO CM O rH 1 Y

¥ ??

co

t - CM CO 00 CO 05 CO CM rH CO CO o o o

CO CO CO CO CO CO CO CO CO CO

in 0 5 o in 0 0 in 05 rH CM CM CO co o o o

in 00 CM CO CO o

CO 00 T* CO o

tH in tH fc05 in CO CM rH CM CO CO CO O O o o

¥ ¥ ¥ ¥

CM rH in ■o< o CO CM CO CO o o fc05 o CO o

¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥ ¥

CM 00 CO 00 CM o

¥

05 fcCM CO CO o

CO 00 00 CO tH

in NI NI tH

Tf o 00 ttH

tH in tH 00 tH

NI tH 00 CO rH

00 CO tr­ i o CO rH tH

? ? TT

rH rH fc05 rH

© 05 tH fctH

f

CO o in in tH 00 tH CO 00 rH

¥ ¥ ¥

fc— 05 00 fctH

f

in CO CO o NI

fc05 tH CO tH

f

in 00 05 fctH fc00 00 rH

f

in in rH 00 rH

CO CO rH 05 tH

¥

rH 05 CO fctH

o o o i n t - i n o c o i n o o c o

c oo5c ot - cococ ooocof c-

C D O O ^ N O 5 O O l f t H 0 i n

COWOiOOOTffO^O H « t » l O l f l H N W O O N^COOCD^inNON OONOOOOXlfiHh^ OOt^ t > t >

OOOO^NOOCOCO^O NNOint»coinwfOifl

COCOCOCOcOOOCOlOCOCO

? ? ? ? ? ? ? ? ? ?

(N O O O O fO N C O in H C O ^

? ? ? ? ? ? ? ? ? ?

? ? ? ? ? ? ? ? ?

M t O 0 | > i n t O O t « O O Q O

? ? ? ? ? ? ? ? ? ?

H O ) U ) O O C O H O ) l f l O O

f

O H O O N O O H C O f N

^inOilflOMfONOO t>HHCOM^in-^COM

cocoio^o^rtwoot* ^^eo»«oifiinoot>H 0 0 0 0 0 0 0 0 (0 0 ) ^ 0 0 0 )

05 05 CM O NI

¥

05 tH tH O

¥

rH 0 5 0 0 CO t - CO 0 5 CO O CO 05 fcrH rH CO 0 0 CO rH fcrH o CO 0 5 rH 0 5 0 5

f ¥ ¥ f f ¥

0 5 0 5 N I 0 5 CO o O 0 5 CO CM 00 CO N I fc- N I 00 CO in O 00 O 0 5 o tH

¥

CO NI in fc00 CO O o in rH 0 5 CO o O fcco

¥

O 00 o N I o CO CO o o 0 0 fc- in CO 05 t - 0 5

f ¥ ¥ ¥ f ¥ f ¥ 00 CO CO fc- 00 O o NI NI to CO N I N I i n N I 0 5 o O 05 0 5

©

CM rH tH CO o in in fc- tH 05 in CM rH CO o CO O 05 05 i n CM CO fc- CO CO o rH CO 00 in CO t - o fc- fc- 05 00 CO in CO CO 00 fc- CO CO in in

i n HOCOl /) COMC5t >M

O ® r f 0 0 t * N « H 000i QinOONt-QOHtO'!}*

V ¥ ¥ YY ¥ Y Y Y f

O ^ O O O O O O O O O O H O O

i

d v y ¥ y y y ¥ f

fcin CO tH

f f f ? f ? f ¥ ¥ ¥ ¥ f ¥ f ¥ ¥ f ¥ ¥ f ¥ ¥ ¥ ¥

CO o o CO fc- CM rH CO in in t- 00 in fcCO o rH LO 00 in 00 CO in 05 o tH rH CO r-l CM ■o< rH CO CM fc- 05 in tH 05 m O CM CO 00 05 fc- rH rH in O CO CO CO 00 O rH 00 CM -**« lO fc- CO CO CO fc- 00 t - in CO t - t - CO 00 o

o rH i n o CO 0 5 o rH o NI 05 CM rH

CO CM in •o< CM o

0 5 CO 0 0 00 CO CO fc- 00 fc- 0 5 ■n CM tH tH in CM CO CO CO o o o o •

¥ ¥ ¥ ¥ ¥ ¥ ? ¥

CO CO CO CO CO CO CO CO CO CO

? ? Y f f f Y Y Y f YY f f Y f f ? f Y Y YY? Y f YY Y Y Y ¥ Y f ¥ Y

o rH CO CO tH CO CO NI 0 5 rH fco 00 NI NI o m rH 0 0 fc- t - 0 5 0 5 CO t - 0 5 tH 0 0 0 5 0 5 O NI o

¥ ¥ ¥ ¥ f f ¥ ¥ ¥ f o o o o o

CM o CO o o o o

O CO CO CO CO CO CO CO CO CO o o o o o o o

LINEAR DYNAMIC SYSTEMS

262

Table 6.4 (Continued) B

Obs.

Row

Time

P

41 42 43 44 45 46 47 48 49 50

41 42 43 44 45 46 47 48 49 50

41 42 43 44 45 46 47 48 49 50

997.555 997.555 997.555 997.555 997.555 997.555 997.555 997.555 997.555 997.555

9.77942 9.77942 9.77942 9.77942 9.77942 9.77942 9.77942 9.77942 9.77942 9.77942

0.195588 0.195588 0.195588 0.195588 0.195588 0.195588 0.195588 0.195588 0.195588 0.195588

0.300735 0.300735 0.300735 0.300735 0.300735 0.300735 0.300735 0.300735 0.300735 0.300735

51

51

51

997.555

9.77942

0.195588

0.300735

A

Q

where t = 1,2, . . . , 50, where Y ( t ) , 0( t ) , U ( t ) , X (t ) and V (t ) are scalar and U (t ) - n C O . P ^ ), V (t ) - n ( 0 , P ^ 1) , and 6(0) - n ^ , . ? ' 1) . Suppose the states of the system are to be controlled at the target values T (t ) = - . 7 , t = 1,2, . . . , 50, in such a way that the future mean of each state is equal to the target value. This is the criterion (6.17) of the preceding section. Now using the formulas (6.26) of Theorem 6.4, fifty observations were generated from the model (6.32) under three conditions: Experiment One: Experiment Two: Experiment Three:

= Pv = 1 Pu = Pv = 10 P^ = Pv = 100

(6.33) (6.34) (6.35)

The observations o f (6.32) were generated as follows: (i ) (ii)

6( 0 ) is selected from a n ( . 2 , .7 *) population, X (0 ) is computed from (6 .2 6 ),

(iii)

V ( l ) is selected from a n (0 , P ^ ) population,

(i v )

0 ( 1 ) is computed from (6 .3 2 ),

(v )

U ( l ) is selected from a n ( 0 , P ^ ) population,

(v i) (v ii)

and Y ( l ) is computed from (6 .3 2 ). The first six steps are repeated 49 times to give fifty observa­ tions and values of the control variable.

AN EXAMPLE

263

c

D

E

-0.035434 -0.035803 -0.033477 -0.031356 -0.031997 -0.033901 -0.037399 -0.031889 -0.031423 -0.033248

6 6 6 6 6 6 6 6 6 6

-1.0870 -1.0983 -1.0270 -0.9619 -0.9816 -1.0400 -1.1473 -0 . 9783 -0.9640 -1.0199

-0.16695 -0.17884 -0.18968 -0.18641 -0.17667 -0.15879 -0.18696 -0.18934 -0.18001 -0.17062

-0.72732 -0.63440 -0.67078 -0.68313 - 0.68688 -0.74127 -0.70172 -0.60147 -0.65361 -0.75086

-2.2071 -2.0489 -1.9047 -1.9483 -2.0777 -2.3156 -1.9410 -1.9093 -2.0334 -2.1583

-0.73220 -0.68464 -0.64126 -0.65437 -0.69331 -0.76485 -0.65218 -0.64263 -0.67996 -0.71752

-0.035085

6

-1.0763

0.00000

-0.71399

-2.1808

-0.72430

Z

X

Y

m

Tables 6.2, 6.3, and 6.4 give the results o f the three experiments. These tables are organized into several columns. The X, A , B , C, D , E , P , and Q columns are the values needed to compute the control variable X, according to formula (6 .2 6 ), while the m column is the Kalman filter estimate o f the 9 column, which lists the fifty states (beginning in row 2) of the system. All the relevant formulas are found in Theorem 6 .4. For example from Table 6.2 m( 1 ) = —0.1448

(6.36)

is the Kalman filter estimate of 0(1) = —0.0426 andthe precision of the estimate is P ( l ) = 9.73684. From Table 6.3 (P = P = 1 0 ) , row 2, u v m( 1) = -1.3839

(6.37)

is the Kalman filter estimate of 6(1) = —1.3917 and the precision of the estimate is P ( l ) = 92.1875. The precision of the Kalman filter estimates increases by a factor of about ten because the observation and system precisions have been increased by a factor of ten, namely from one to ten. When the precision is increased to P^ = Pv = 100, m( 1) = -0.50275

(6.38)

is the Kalman filter estimate o f 6(1) = -0 .4 9 2 , found in row 2 of Table 6.4, and the precision o f the estimate is 902.724.



3

5

Aa

A

7

A

A

9

A

A

11

13

A

A

15

A

17

19

A

21

A

23

A

TIME

25

A A

27

A

29

A

A

Controlling the states o f a dynamic model (P ^ = 1,

1

A

31

A

35

A

= 1 ).

33

A A A

37

A

39

A A

41

A

43

A

A

45

A A

47

A

A

49

A

/

51

A

DYNAMIC

Figure 6.1.

4.0

-

2.5

-

3.5

2.0

-

-

1.5

-

3.0

1.0

-

A

A

A

LINEAR

-

0.5

-

0.0

0.5

1.0

1.5

TH 2.0

264 SYSTEMS

0.2

0.6

0.8

1.1

1. 2

-

-

Figure 6.2.

-1.5 ♦

-1.H

-1.3

1.0

-

-0.9

-

-0.7

-

-0.5

- 0 .l|

-0.3

-

-0.1

0.0

TM | 0. 1 ♦

3

A

5

A

A

7

A

9

A

11

A

A A

13

A A

15

17

A

A

19

A

21

A

A

A

23

A

A

TI ME

25

A

27

A

A

29

A

A

31

A

33

A

A A

35

A

A

37

A

Controlling the states o f a dynamic model (P ^ = 10, Pv = 10).

1

A

A

A

39

A

A

H)

A

***

A A

M5

A

A

**7

A

«*9

A

51

A

to

o> Cl

AN EXAMPLE

0.2

0.6

0.8

Figure 6.3.

3

A

5

A

A

7

A

A

9

A

11

A

A

13

A

A

15

A

17

A

A

19

A

21

A AA

23

A

A

25

Controlling the states o f a dynamic model (P

1

A A

A

29

A

A

31

A

33

A

A

35

A

A A

= 100, Pv = 100).

27

A

37

A

39

A

*41

A

A A

U5

A A A

U3

A

U7

A A A

A

**9

A

51

DYNAMIC

-0.9 +

-

A

LINEAR

-0.7

-

-0.5

-0.*4

-

0.2

1M | 0. 3 «■ 266 SYSTEMS

NONLINEAR DYNAMIC SYSTEMS

267

The objective o f this study was to see the effect o f the observa­ tion and system precision on the control o f the state vectors, which were supposed to be close to - 0 . 7 . From the 0 column o f the tables and from Figure 6.3 we see the 0 values become close to the target value —0.7 when and Py are one hundred. That this should be the case is evident from the formulas of Theorem 6.4. Figures 6.1 and 6.2 demonstrate that 0 is much more difficult to control when the system and observation errors are small as compared to Figure 6.3, when the precisions are each one hundred.

NO N LINEA R D YN AM IC SYSTEMS

Consider the dynamic system Y (t ) = F t [ 0 ( t ) ] + U ( t )

(6.39)

t = 1 ,2 , ...

0( t ) = G t [ 6 (t -

1 )] + V ( t ) ,

(6.40)

where F^ is some function o f the state vector 0( t ) and tion of 0(t -

1 ), the previous state.

is some func­

As before assume F^, G^, Pu< t ),

Py ( t ) , M (0 ), and P (0 ) are known, where Pu (t ) is the precision matrix of U ( t ) , Py (t ) is the precision matrix of V ( t ) , M (0) the mean of 0(0) and P (0 ) the precision matrix of 0(0).

Suppose Y (t ) is the n x 1 ob­

servation vector, e(t) a real p x l state vector, F^: RP + Rn , Gt : RP -► RP , U (t ) - N l O . P ^ t ) ] , V (t ) — N [0 ,P ^ 1 ( t ) ] , and that e (0 ); U (l),U (2 ), V (1 ) , V ( 2 ) , are independent. When Ft [ 0 ( t ) ] = F (t )e (t ) and Gt [ e ( t ) ] = G 6 (t -

1 ), we have the

usual linear dynamic system with (6.39) as the observation equation with design matrix F (t ) and G is the transition matrix of the system equation (6 .4 0 ). Obviously, the system equation (6.40) can be amended to include a nonlinear function o f control variables. There are many versions of the nonlinear dynamic system, and one should refer to Jazwinski (1971) for a detailed account of nonlinear filters. I f Ft and/or G^ are nonlinear functions o f 0 ( t ) and 0(t — 1 ), respectively, difficulties are encountered with determining the pos­ terior distribution o f 0( k ) , given k observations, Y ( 1 ) , Y ( 2 ) , . . . , Y (k ) for k = 1,2, . . . . These troubles are not new to statisticians

LINEAR DYNAMIC SYSTEMS

268

who have worked with nonlinear regression problems. Except for some special nonlinear functions and , the Kalman filter formulas are no longer applicable and one must devise some new methodology for the nonlinear case. Suppose the function of the system equation is linear in

6 (t -

1 ), that is G^[ 0 (t )] = G 0(t -

1) and G is a p x p transition

matrix, then

0( t ) = G 0 (t -

1) + V ( t )

(6.41)

is the system equation. Consider estimation o f 0(1), now the prior distribution o f 0(1) is induced by the prior distribution of 0(0), which is N [M (0 ) , P ( 0 ) ] . It is obvious the prior density of 0(1) is normal with mean M (l) and precision P ( l ) , where M( 1 ) = [ P ' V ) + G P " 1 (0 )G ']P v ( l ) G [ P ( 0 ) + G'Pv ( l ) G ] _ 1 P (0 )M (0 ) (6.4 2) and P ( l ) = [G P _ 1 (0 )G ' + P ^ C l ) ] ' 1.

(6 .4 3)

What is the posterior distribution o f 0(1)? From the observation equation with t = 1 , the conditional distribution o f Y ( l ) given 0 ( 1 ) and pu ( l ) is normal with mean F ^ [ 0 ( 1 ) ] and precision Pu d ) > th u s the posterior distribution of 0 ( 1 ) is Tr[0 ( l ) |Y(1)] = exp - | { Y ( 1 ) -

0 (1 )] }'P U( D

F1 [ e ( l ) ] }

x (Y (l) -

x exp - | [ e ( l ) - M( 1 )]'P (1 )[0 ( 1) - M ( l ) ] ,

Lk

0(1) € RP (6 .44) but i f F^ is nonlinear in 0(1), the distribution o f 0(1) given Y ( l ) is not normal, hence the recursive estimates of 0( t ) given b y the Kalman filter (6 .3 ) do not hold. How should one estimate 0(1)?

NONLINEAR DYNAMIC SYSTEMS

269

I f p = 1, 2, 3, numerical methods may be used to estimate 0(1). For example, one could plot the one and two dimensional marginal pos­ terior densities o f the components o f 0( 1 ) and compute posterior joint and marginal moments of 0 ( 1 ) . Now consider the next stage for estimating 0(2), where the prior distribution of 0( 2 ) is induced by the posterior distribution of 0 ( 1 ) , where 0(2) = G 0(1) + V (2 ). Since the posterior density of 0 ( 1 ) is not normal, the prior dis­ tribution o f 0 ( 2 ) is also non-normal, and the posterior density o f 0 ( 2 ) is

ir[ 0( 2 ) |Y( 1 ) , Y ( 2)3 « exp - | { Y ( 2 ) - Fgt 0 ( 2 ) ] >’Pu ( 2 ) x (Y (2 ) -

x

f

F2[ 0(2)1 }

* [e (l)| Y (l)]e x p -

| [0 (2 ) -

G 6( l ) ] '

V X Pv (2 )[9 (2 ) -

G 0 (l)]d e (l),

(6.45)

where 0(2) e Rp and tt[ 0(1) |Y(1)] is given by (6 .4 4 ). This integration must be done numerically since the integrand is not in the form of a normal density on 0(1). In general one would have tt[e(k) |Y(1),

Y (k > ] « exp -

| {Y (k ) -

x (Y ( k ) -

Fk [ e ( k ) ] }

ir[0 (k -

x exp —

x

1) |Y(1 ),

4u [ 6(k )

[e (k ) -

Fk 8 ( k ) ] }’Pu(k )

-

G 0 (k -

G 6 (k -

Y (k -

1 )]'P

l ) ]d 9 (k -

V

1 )]

(k )

1 ), (6.46)

LINEAR DYNAMIC SYSTEMS

270

k = 1,2, . . . , for the posterior density o f 0( k ) . This is a recursive formula for the posterior density of 0 ( k ) , as a function of the posterior density o f 0(k — 1 ) , and will give good operational results i f p is "sm all,” so that the numerical integration of (6.46) can be done quick­ ly. Thus (6.4 6) yields a generalized Kalman filter, where after (6.46) is determined numerically, the posterior mean M (k ) of 0(k ) is given by

M .(k) =

/

tt[ 0 (k ) |Y(1 ) , Y ( 2 ) , . . . , Y ( k ) ] 0. ( k ) d 0( k ) ,

RP i = 1 ,2, ..., p

(6.47)

where the integration must be done numerically, and M .(k ) is the i-th component o f M (k ) and 0.(k ) the i-th component o f 0( k ) . To summarize the above, one may say that i f a nonlinear function is included in the observation equation, the computational efficiency o f the Kalman filter is lost. One might try a linear approximation to F^, as a function of 0(1) and obtain a normal approximation to the posterior density o f 0( 1 ) . Since the prior density of 0(1) is normal with mean M ( l ) and precision matrix P ( l ) , replace F ^ [0 (1 )] by its first order Taylor series about M ( l ) e Rp , i . e . , let F 1 [ e ( l ) ] i F ^ M d ) ] + R^[ 9(1) -

M(l)],

(6 .48)

where R^ is the p x l vector with i-th component 3F

3 e .(i)

, where 0. ( 1 ) is the i-th component

[ 0( 1 ) ] e (1 )= M (1 >

o f © (l ) . 1

The approximate posterior density o f 0(1) is, from (6 .4 4 ),

tt[9(1) |Y(1)] 0= exp -

| {Y (1 ) -

x Pu ( l ) { Y ( l ) -

-

F 1 [M ( 1 ) ] -

R ^ e d ) - M(l)]}'

F 1 [M (1 )]

R'1 [ 6 ( l ) - M ( l ) ] } e x p -

| [9 (1 ) -

M (1 )]'P (1 )

NONLINEAR DYNAMIC SYSTEMS x [ 9( 1 ) - M ( l ) ] ,

271 0(1) e

RP

(6 .49)

and it can be shown that tt is a normal density with mean

M ( l ) = P ' ^ D R ^ d M Y d ) + R ^ M (l) - F j I M d ) ] },

(6.50)

and precision P ( l ) = R 1 Pu( l ) R *1 + P ( l ) ,

(6.51)

where P ( l ) is the prior precision of 9(1) and M ( l ) the mean o f the prior distribution o f 6(1). Recall that M ( l ) = P ' 1 (1 )P V (1 )G [P (0 ) + G’P( 1 )G ]_ 1P (0 )M (0 ),

(6.52)

where

P (l) =

[ P ’ ^ l ) + GP""1 ( 0 ) G '] " 1.

(6 .53)

Let us consider the second stage where Y (2 ) = F 2[ 6(2)1 + U (2 )

(6.54)

9(2) = G 0(1) + V (2 ),

(6.55)

and

then the prior distribution of 0(2) is induced by the N [M (1 ),P (1 )] posterior distribution o f 0(1). The prior (p rio r to Y (2 )) distribution o f 0(2) is normal with mean M (2) and precision P (2 ), where M (2) = P_ 1 ( 2 ){P v (2 ) -

Pv (2 )G [G ’Pv (2 )G + P ( l ) f ^ ( D M d ) } (6.56)

and P (2 ) = [G P _ 1 (1 )G ' + P‘ 1 ( 2 ) ] ‘ 1.

(6.57)

As was done at the first stage, suppose F 2 [ 0( 2) ] is replaced by its first order Taylor 1s series, then

272

LINEAR DYNAMIC SYSTEMS F 2 [ 0( 2) ] e F 2 [M (2 )] + R'2 [0 (2 ) -

M(2)],

(6.58)

where R 0 is the p x 1 vector with i-th component

3F2 8 6.(2)

[ 6( 2 ) ]

, where 6. ( 2 ) is the i-t h component 9(2)=M(2)

of 6 ( 2 ) ;

thus the observation equation (6.5 4) is approximately Y (2 ) = F 2 [M (2 )] + R'2 [ 0( 2) - M (2 )] + U (2 )

(6.59)

and the posterior distribution o f 0 ( 2 ) is approximately normal with mean M (2) = P_ 1 (2 )R 2 Pu ( 2 ) { Y ( 2 ) + R'2M (2) -

F 2 [M (2 )] },

(6.6 0)

and precision matrix P (2 ) =

+ P (2 ),

(6.61)

where M (2) and P (2 ) are given by (6.56) and (6 .5 7 ), respectively. Since M (2) and P (2 ) are functions o f M ( l ) and P ( l ) , M (2) and P (2 ) are functions o f the posterior mean and precision o f 0( 1 ). Suppose k observations Y ( 1 ) , Y ( 2 ) , . . . , Y (k ) are available and one wants to estimate 0 (k ) and suppose at each state prior to and in­ cluding the k -th , Ft [ 0 ( t ) ] = Ft [M (t )] + R » [ 0 (t ) - M (t )]

(6.62)

for t = 1,2, . . . , k — 1 , where M (t) is the prior mean o f 0(t ) and R^ is p x 1 with i-th component

8F 8 e.(t)

[ 6( t ) ]

6 ( t ) = M (t)

then the posterior distribution o f 6 (k ) is approximately normal with mean M (k ) = p ' 1 (k )R . P ( k ) { Y ( k ) + R 'M (k ) K

U

K

F. [M (k )] }, K

(6.6 3)

NONLINEAR DYNAMIC SYSTEMS

273

and precision matrix P (k ) = RkPu (k )R ^ + P (k )

(6.64)

where M (k ) = P_ 1 ( k ) { P v (k ) x P (k -

l)M (k -

Pv (k )G [G fPv (k )G + P (k 1 )},

l ) ] ”1 (6.65)

and

P (k ) = [G P _1(k -

1)G' + P ^ k ) ] " 1.

(6 .66 )

Thus, M (k ) the mean o f the approximate posterior distribution of

0(k ) acts like a Kalman filter because M (k ) is a function o f M (k — 1) and P (k — 1 ), the previous estimates o f the state; however, since the nonlinear functions F^t 9 (t)] are approximated by linear functions of

0(t ) in a neighborhood o f M (t ), M (k ) is only an approximation to the posterior mean o f 0( k ) , and the accuracy of the approximation needs to be investigated. We have seen that i f the observation equation (6.39) o f a dy­ namic system is nonlinear, then the exact Bayesian solution, i . e . , the posterior mean o f 0(k ) is given by (6 .4 7 ), and this integration must be done numerically, but the posterior distribution o f 0(k ) can be ex­ pressed recursively as a function o f the posterior distribution of 0 (k — 1 ), for k = 1,2, . . . . I f p is small, say p = 1, 2, 3, the nu­ merical integration o f (6.47) may be quickly done and the on-line filtering of 0 (k ) is possible. On the other hand, i f p is too large, approximation techniques may be necessary in order for on-line filtering to be feasible. An approximation method for nonlinear filtering was presented whereby each of the nonlinear functions F^[ 0( t ) ] of the observation equation was replaced by its first-o rd er Taylor series about the prior mean M (t ), (6 .6 5 ), o f 0( t ) , thus allowing one to develop the usual Kalman filter lft(k), (6 .6 3 ), as an approximation to the actual mean of the posterior distribution of 0( k ) . Needless to say, the error o f ap­ proximation o f 0 ( t ) ] by its Taylor series induces an error o f ap­ proximation o f E [ 0( k ) ] by the Kalman filter M (k ), and this error has not been investigated. Only one type of nonlinearity was considered, namely when the observation equation (6.39) contained a nonlinear function F of 0(t )

274

LINEAR DYNAMIC SYSTEMS

and the observation equation (6 .40) had a linear transition matrix Gt of 0(t - 1). If, in addition, Gt is nonlinear the posterior analysis is more complex than what was done here; but by replacing G^[ 6(t — 1 ) ] by the Taylor series, one may derive a Kalman filter which is an ap­ proximation to the posterior mean of Q (k ). There are many ways to approach the problem o f nonlinear fil­ tering and one should read Jazwinski (1971) or Maybeck (1979) for the traditional solutions to the problem.

A D A P T IV E E S T IM A T IO N In tro d u ctio n The Kalman filter was designed for linear dynamic systems where all the parameters, except the state vectors, are known. The parameters o f the system are m( 0) , p (0 ), Pu (t ) and Py ( t ) , where m(0) and p (0 ) are the mean and precision, respectively, of the initial state 0( 0 ) , P ^ (t ) is the precision matrix of u ( t ) , the observation error, and the precision matrix o f v ( t ) , the systems error, is Pv ( t ) . Often one or more o f the parameters are unknown and adaptive estimation is the term which is applied to estimation problems when some o f the parameters are unknown. Adaptive estimation is an active area o f research and Jazwinski (1970), Mehira (1970,1972), Lainiotis (1971), Kailath (1968), and Maybeck (1979) will give the reader a good intro­ duction to adaptive filtering, smoothing, and prediction. Adaptive filtering, based on Bayesian methods, is given by Alspach (1974) and Hawkes (1973); however their methods, although related, are not the same as those which are to be presented. Other methods of adaptive filtering depend on covariance-matching, corre­ lation and maximum-likelihood techniques o f estimating parameters.

A d ap tive F ilte rin g w ith Unknown Precisions Our approach to adaptive estimation is begun with the special situa­ tion where the observation and systems errors u (t ) and v ( t ) are scalar and have unknown precision parameters x^ and xv> respec­ tively, where x

= P (t ) and x = P (t ) for t = 1,2, . . . . The obu u v v servation precision xy is common to all observations and there is only

one system error precision x .

The other parameters are m(0) and

p ( 0 ) , but for the moment, suppose 0( 0 ) is a known initial state vec­

ADAPTIVE ESTIMATION

275

tor, that is m(0) = 0(0) and p (0 ) = °°. model is y (t ) = F ( t ) 0( t ) + u (t ) ,

0 (t ) = G 0(t -

Thus the linear dynamic

)

\

t = 1,2, . . .

(6.67)

1) + v ( t ) , )

where u (t ) ^ n ( 0 , i u1) , v (t ) ^ n ( 0 , i v1) , t = 1 , 2 , . . . , and u( 1 ) , u ( 2 ) , .. and v ( l ) , v ( 2 ) , . . . are independent sequences. The unknown param­ eters are t > 0 and t > 0 , while 0 ( 0 ) , the initial state, is known, u v The problem is to estimate the successive states 0 (1 ),0 (2 ), . . . o f the system as the observations y ( l ) , y ( 2 ) , . . . become available. The primary parameters are the states of the system and the pre­ cision components and tv are regarded as nuisance parameters, but since they are unknown they must be assigned a prior distribu­ tion, thus suppose and tv are independent and that the mar­ ginal prior density of t u is a -1 -t 3 \ a, Tu u ~ e u u>

/

Tu > 0n

and that a -1 V

tt( t

v

)

«

t

v

-6 e

V

t

V

,

t

v

> 0,

where a , a , 3 , and 3 are known positive constants, u v u v First, consider estimating 0(1), then it may be shown that the posterior density o f 0 ( 1 ) is

i r [ 0 ( l ) | y ( l ) l « {26

2

0( 1 )]

+ f (l)[y (l)/f(l) -

2 - < 2«u+ 1 ) / 2 }

- ( 2o + l)/ 2 x { 2 Bu + [9 (1 ) -

g 0 (O )]

} 0(1) € R ( 6 . 6 8 )

which is a 2 / 0 univariate poly-t density and is common distribution which is encountered in econometrics. See Dreze (1977), for ex­ ample, for a review o f the literature on this topic. Now in general, (k ) it can be shown that 0 = [ 0 ( 1 ) , 0( 2 ) , . . . , 0( k ) ] ! has joint marginal density function

LINEAR DYNAMIC SYSTEMS

276

{ 28» + £

g 6 (t -

1 )]

. -(k + 2 a ) / 2 | V ,

}■

(6.6 9)

(k ) k for 0 6 R , which is a 2/0 k variable poly-t density and is v ery difficult to work with. For example, the normalizing constant is un­ known and the moments must be evaluated numerically. Various ap­ proximations have been proposed in working with this distribution. Box and Tiao (1973) and Zellner (1971) have proposed infinite series expansions, while Dreze (1977) has developed a numerical integration algorithm to find the moments and other characteristics o f the density. An interesting feature of this distribution is that the degrees of freedom for the two t densities o f (6.69) always have 2au and 2a degrees o f freedom for all k, k = 1,2, . . . . Thus as more observa­ tions are obtained, the precision matrices o f the two factors do not "decrease” with k! The precision matrices of each factor depend on 3 u and 3v> respectively, but 3 ^ and 3v are constant, i . e . , they do not depend on k.

The two parameters 3u and 3v control the precision

o f the precision parameters

and

t

) = -------------- 5--------------- , (a - 1 ) (a - 2) u u

V a r (r

.

For example

a

> 2,

thus small values o f 3

(and 3 ) decrease the prior variance of t u v ^ u and increase the posterior precision of 0 (1 ), e( 2 ), . . . , 0 ( k ) . To ob­ tain precise estimates o f the states, one must have "small" values of 3 ^ and 3^ , but small values o f 3 ^ and 3v imply one's prior knowledge

of

is precise, v The difficulty in working with the posterior density (6 .69) is that the marginal density of 0 (k ) is difficult to determine, and a re ­ cursive formula for F [ 0( k ) | y ( l ) , y ( k ) ] in terms of F [ 0(k — 1) |y(1 ), . . . , y ( k — 1) ] , y (k ) , f (k ) , and G is all but impossible to derive, and a formula analogous to the Kalman filter is difficult to obtain. t

u

and

t

ADAPTIVE ESTIMATION

277

What should be done in order to find a recursive estimate o f e(k)? One could approximate each of the two t-density factors of (6,69) by a normal density (with the same mean vector and precision (k ) m atrix), then the posterior density o f 0 is normal and it would be possible to find a Kalman-type expression for the posterior mean of 0( k ) ; however, the normal approximation to a poly-t density may not be ve ry good, so this approach, though interesting, will be abandoned. Inspecting (6 .6 9 ), one sees that the conditional posterior density o f 0 ( k ) , given 0 (t ) = 0 ( t ) , t = 1 , 2 , . . . , k - l , i s a 2 / 0 univariate poly-t density, namely

Tr[e(k)|e(t) = e ( t ) , t = 1 , 2

k -

l]

U >0>

where U > 0 means the n (n + 1 ) 1 2 distinct elements of U form a pos­ itive definite matrix U. In a similar way let V have density (v - p - l ) / 2 exp “ 2 Tr ( T v V ) ’

“ lV l i . e . , V is Wishart with v T

v > °’

degrees of freedom and precision matrix

, which is known, symmetric and positive definite. The joint density of y ( l ) ,

9(1), U, and V is

ADAPTIVE ESTIMATION

287 (v + l - n - l ) / 2

g [ y ( l ) , 6( l ) , U , V ] «

|U|

exp -

x e (l)][y (l) -

| T r {[y ( l ) - F (l)

F ( l ) 6 (1 )]' + T u )

(v + l-p -l)/2 xU|V|

-

exp - i y [ e ( l )

G 0(0)1 [6 (1 ) -

G 9 (0 )]' + T v >V,

where U > 0, V > 0, y ( l ) e Rn and 9(1) e RP , and eliminating U and V by the properties o f the Wishart density, gives

g t y ( l ) , 8(1)]

=

(v + l ) / 2 |Tu + [ y ( l ) -

F (l)e (l)][y (l) -

F (l)e (l)]

>1

1

( v + l ) /2 |Ty + [0 (1 ) -

G 0 (O )][0 (1 ) -

G 6 (0 )]'|

and upon completing the square with respect to 0 ( 1 ) in each factor, the conditional distribution o f 0 ( 1 ) given y ( l ) is

g [e (i> | y (i> ]

------------------------- i ------------------------------

{1 + [ 0 ( 1 ) -

U 1 1 ] 'P 1 1 [ 8 ( 1 ) -

y 11] }

(v + l ) / 2 ’ {1 + [0 (1 ) -

u 1 2 ] 'P 1 2 [ 0 ( l ) ~ u 12] } 0(1) € Rp

(6.95)

Therefore, the marginal posterior density of 0(1) is a p variable

2 / 0 poly-t distribution, where

LINEAR DYNAMIC SYSTEMS

288

P .

3a__________

11

i + y ' ( i ) T u1 y ( i ) -

wi 1 F,( i > T u1 y ( i )

u 12 = G 6 (0 ), and P ., = t "1 . 12 v I f one wants to filter 0 ( 1 ) , the mean o f 0 ( 1 ) needs to be found from (6 .9 5 ), which is unknown, thus it must be found numerically, but this is indeed difficult i f p > 1 . Suppose each of the two t densities of (6.95) are approximated by normal densities with the same mean vectors and precision matrices, then (6.95) will be approximately normal with mean

m ( l )

=

( P ^

+

p | 2 >

+

P 1 2 U 1 2 ) ’

( 6 * 9 6 )

and precision

P ( l ) = ( P *x + P | 2)>

( 6 -97)

where

pn = (vi "

p

~ 1)pn

and

P 12 = ( V 2 - P ~ 1 ) P 1 2 In general, for k > 2, the conditional posterior distribution o f

0 ( k ) , given 0 ( 1 ) , 0 ( 2 ) , . . . , 0 (k — 1 ) is normal with mean « l ( k ) = p - 1 ( k ) ( p . lU k l + P » 2 P k2) , and precision matrix

(6.98)

ADAPTIVE ESTIMATION

289

Pk 2 = ( V 2 + k ~ ? -

P

kl

2 )P k 2 >

F,( k ) S ' 1(k - l ) F ( k ) = ---------------------------- ±------------------------------------------------------

-1

1 + y !( k ) S 1 (k -

Pk l = [ F '( k ) S ^ ( k -

l)y (k ) -

-1

yJKlFf( k ) S 1A(k -

l ) F ( k ) ] ' 1F ' ( k ) S 1(k -

l)y (k )

*

l)y (k ),

k-1 S j (k -

1) = T u + £

Vk 2 = G 0 (k -

1),

Pk 2 = S 2 ( k -

1 )’

[y ( t ) -

F ( t ) 0( t ) ] [ y ( t ) - F ( t ) 0 ( t ) ] ' ,

and k-1 S „(k -

I

1) = T

v

+ £

1

[ 0 (t ) - G 0 (t -

l ) ] [ 0 (t ) -

G 0 (t -

1 )]'.

To reiterate, the conditional distribution of e (k ) given 0 (1), . . . , 9(k — 1) is normal with mean m (k) and precision P ( k ) , where for k = 1, m (l) and P ( l ) are given by (6.96) and (6 .9 7 ), respectively, and for k > 2, m (k) and P (k ) b y (6.98) and (6 .9 9 ), respectively. To filter 0 ( k ) , one may use either the mean of the marginal posterior distribution of 0 (k ) or the mean m (k) of the conditional pos­ terior distribution of 0 (k ) given 0( 1 ) , 0 ( 2 ) , . . . , 0 (k - 1 ) and con­ dition on estimates of the previous states. We choose the latter b e ­ cause it is difficult to express the marginal posterior mean of 0 (k ) in a recursive fashion. By using the conditional mean m (k) of 0 ( k ) , given the conditional means of the previous states, 0 (t ) = m (t), t = 1 , 2 , . . . , k - 1 , one may obtain a recursive filter to 0 ( k ) . Consider the first state 0 ( 1), then using the approximate mean m (l) of 0 (1), 0(1) is easily estimated. Now consider the second state 0 ( 2 ) , which is estimated by m ( 2 ) , but m ( 2 ) depends on 0 ( 1 ) b e ­ cause S ^ (l) and ^22 depend on 0 (1), thus let 0(1) = m (l) and m(2) is determined. In the same way, m(3) depends on 0(1) and 0(2) b e­ cause S^(2) and y ^ are functions of 0(1) and 0 (2), thus letting

0( 1 ) = m (l) and 0 ( 2 ) = m ( 2 ) , where m ( 2 ) is given at the second stage,

LINEAR DYNAMIC SYSTEMS

290

determines m (3). This process is continued and a recursive filter for 6 (k ) is easily constructed; however, it depends on the normal approx­ imation to a poly-t distribution and the error of approximation needs to be studied. A numerical study at the end of the chapter will reveal how rea­ sonable the filter m (k) is in estimating the states.

A d a p tiv e C on trol, th e V ecto r C ase Dynamic linear models are often used to model the control mechanism of dynamic systems and in a previous section of this chapter a con­ trol strategy was developed for a scalar version of the dynamic model (6 .7 2 ). The scalar version of the model is now generalized to the case when the states are vectors, and as in the previous section, the conditional posterior density of the present state given the previous states is approximated by a normal distribution. To recapitulate, again consider the model y (t ) = F (t )e (t ) + u ( t ) , (6.72)

0 ( t ) = G 0 (t -

1) + H (t )x (t -

1) + v ( t ) ,

where t = 1 , 2 , . . . , and the terms in the model are as those of ( 6 . 1 ) except, H (t )x (t - 1) has been added to the system equation, where H (t ) is a known p x q matrix and x (t — 1 ) is a q x 1 vector of control variables, which are to be chosen in such a way that 0 (t ) is "close" to T ( t ) , t = 1,2, . . . . Suppose 0(0) is known and that u ( l ) , u ( 2 ) , . . . and v ( l ) , v ( 2 ) , . . . are independent vectors and that u (t ) ~~N(0,U S and v (t ) ^ N ( 0 , V 1) , t = 1,2, . . . , where U and V are unknown positive definite symmetric precision matrices of orders n and p , respectively. First the exact posterior distribution of 0 ( 1) , 0 ( 2) , . . . , 0 (k ) given y ( l ) , y ( 2 ) , . . . , y ( k ) , k = 1 , 2 , . . . will be derived, then the ex­ act conditional distribution of 0 ( k ) given 0 ( 1 ) , 0 ( 2 ) , . . . , 0 (k - 1 ) will be identified and finally this will be approximated by a normal distribution, from which a control strategy will be developed. The main parameters of interest are the states, thus the pre­ cision matrices U and V will be regarded as nuisance parameters and in­ dependent with prior Wishart distributions. Let U be Wishart with v^ > n — 1 degrees of freedom and precision matrix T , which is known, and let v 0 > p — 1 be the degrees of freedom and T 2 ,

2 g [ 0 ( k ) | 0( D ,

6 (k -

l ) ] cc n {1 + Ie (k ) i= l

Pk i] ’Pki

- (v .+ k ) / 2 x [ 0 (k ) - p y ] }

1 0 (k ) e Rp

which is a 2 / 0 poly-t density and uk l = t F '(k ) S i ( k -

l ) F ( k ) ] 1F '(k )S ^ 1(k -

l)y (k ),

(6.101)

LINEAR DYNAMIC SYSTEMS

292 k-1 S (k — 1) = T

1

+ u

2

[y ( t ) -

F '(k )S (k P. kl

1 + y ,( k ) S * 1(k -

uk 2 = G 6 (k -

F ( t ) 0( t ) ] [ y ( t ) - F ( t ) e ( t ) ] ’ ,

t=l l)F (k )

l)y (k ) -

1) + H (k )x (k -

u j^ F 'C W S ’ ^ k -

l)y (k )

1),

and k-1 S (k — 1) = T + 2 L V ,_1

X

[ 0(t) -

10 ( t ) - G 6 (t -

G0(t -

1) -

1) -

H (t )x (t -

H (t )x (t -

1)]

1 ) ] T.

The marginal posterior density of 0(1) is given by (6.100) and is to be approximated by a normal distribution with mean

( 6. 102) and precision matrix (6.103) where P li = (v i " p ~ 1 ) P l i ’ and P

and

i = 1 ’2

are given by (6 .9 5 ).

This normal approximation is

based on approximating each of the two factors of ( 6 . 1 0 0 ) b y a normal density with the same mean vector and precision matrix as the factor. In the same way, the conditional posterior density of 0 (k ) given the previous states is approximated by a normal density with mean

(6.104) and precision matrix

293

ADAPTIVE ESTIMATION where p *i = (v . + k - p -

2 )p ki>

i = 1.2,

k = 2,3, . . .

and ukl> yk2, Pkl> P k 2 are given by ( 6 . 1 0 1 ) . How does one develop a control strategy based on the normal approximation? F irst, one must choose a control criterion which tells one how close the preassigned target value T (i ) is to be to the state e ( i ) , i = 1,2, . . . . Up to this point the criterion was based on the onestep-ahead forecasts of the states, i . e . , for example x ( 0 ) was chosen so that T ( 1) = E ^ e d ) |x(0) , y ( l ) ]

= E ^ [m (l)]

where E^ of y ( l ) .

(6.106)

denotes expectation withrespect to the distribution (fu tu re ) But, y

and P ^ both are functions of y ( l ) and the ex­

pectation of (6.106) appears to be intractable, therefore consider an alternative criterion of control. In lieu of choosing the control vector x (0 ) so that the target value T ( l ) is the average value of the future state 9(1) via (6.106), choose x (0 ) where T ( l ) is the average prior state of 9(1), i . e . , let T ( 1) =

E [0 (1 )] v ( 1)

= G 0 (O) + H ( l ) x ( 0 ), and the average is taken with respect to v ( l ) . respect to x ( 0 ) gives x (0 ) =[H '(1 )H (1 )]

1 H '(1 )[T (1 ) —G 6 ( 0 )].

(6.107) Solving (6.107) with

(6.108)

i f p ^ q and H ( l ) is of full rank. At the second stage of the control algorithm, y ( l ) has been ob­ served, x ( 0 ) has been chosen according to (6.108) and x ( l ) must be chosen before y ( 2 ) is observed, in such a way that T ( 2) = E [9 (2 ) | x (l) ,x (0 )] v (2)

LINEAR DYNAMIC SYSTEMS

294 = GQ(1) + H( 2 )x ( 1) where

where m (l) is evaluated at x ( 0 ) = x ( 0 ) according to ( 6 . 1 0 2 ) , the pos­ terior mean of 0 ( 1 ) , which depends on x ( 0 ) *u 12 of ( 6 . 1 0 0 ) is a func­ tion of x ( 0 ) , thus x ( l ) is chosen as x ( l ) = [H '( 2 ) H ( 2 ) ] " 1H '(2 ) [T (2 ) -

G 9 (l)].

Thus in general,

x (k -

1) = [ H ' ( k ) H ( k ) ] ' 1H '( k ) [T ( k ) -

G e(k -

1 )],

(6.109)

where

e(k -

1) =

where m(k -

1) |* , is evaluated according to (6.104), and (6.109) XV.K-Z) gives a recursive formula for controlling the states at preassigned values T ( 1) ,T ( 2 ) .......... O f course other control criteria could have been used and the reader is referred to Aoki (1967), Sawaragi et al. (1967), and Zellner (1971) for other ways to choose a control strategy. AN EXAMPLE OF ADAPTIVE ESTIMATION

This example will illustrate adaptive filtering of the bivariate states of a linear dynamic model which has bivariate observations. Consider the model y (t ) = F e (t) + u ( t ) , t = 1,2, . . . , 50 e(t) = G e(t where

l) + v (t ),

(6.110)

AN EXAMPLE OF ADAPTIVE ESTIMATION

295

" O u (t ) - n ( 0 ,u 1) , v ( t ) - n ( 0 ,v 1) , and 9(0) is known. Also assume u ( 1 ), . . . , u ( 5 ), v ( 1), . . . , v ( 50) are independent bivariate normal random vectors. Fifty bivariate observations were generated according to ( 6 . 1 1 0 ) , i . e . , y ( l ) was generated by selecting v ( l ) from a N (0 ,v ) population and 0(1) was generated according to the system equation 0(1) = G 0 (O) + v ( l ) , where 0(0) is known, then u ( l ) was selected from a N (0 ,u ) population and y ( l ) was computed from the observation equation y ( l ) = F0(1) + u ( l ) , and this was repeated fifty times. Table 6.5 has four columns and the first two consist of the two components of 0 ( 1), 0( 2 ), . . . , 0 ( 50), and the last two columns list the two components of the filter m (l ),m ( 2 ) , . . . , m (50), where m (l) is computed by formula (6 .9 6 ), while the remaining filters m (t), t = 2, . . . , 50 are given by (6 .9 8 ). The observation and system precision matrices used in generating the observations and states were

6 111)

( . ■ ■ (\ - . 4■

■ .■6 )/

and /

555.5556 -444.4445>

6 112 )

( . -444.4445

555.5556/

and the initial state was

H! With regard to prior information, u and v are independent Wishart matrices, where u has v^ = 4 degrees of freedom and precision matrix T^ and and

= 4 and T v are the parameters of the prior distribution of v,

LINEAR DYNAMIC SYSTEMS

296

Table 6.5 Estimates of Bivariate States, First Model

1

0.8767608 1.0138780 1.0157220 0.8869861 0.9979184 1.0177950 0.8915292 0.9890499 1.0458780 0.8683957

0.9436042 1.0086370 0.9275690 0.8818690 0.9660156 0.9694926 0.9432775 1.0239550 1.0531150 0.8866324

0.9921297 0.9956300 0.9968629 0.9951978 0.9960202 0.9971470 0.9970379 0.9964256 0.9965436 0.9977383

0.9920480 0.9949626 0.9964153 0.9950728 0.9958527 0.9968274 0.9969366 0.9968209 0.9969116 0.9972972

1.0401780 0.9417663 1.0145750 0.9947690 1.0400200 1.0981140 1.1150730 0.8976504 0.9583021 0.9606810

1.1292990 0.9990182 0.9442801 1.0194400 1.0639820 1.0044710 1.0627030 0.7979441 0.9701377 0.8540279

0.9984227 0.9983795 0.9992406 0.9993104 0.9997271 0.9995579 0.9989278 0.9993744 0.9992608 0.9992424

0.9980738 0.9984148 0.9993359 0.9996912 0.9999698 0.9992192 0.9997275 0.9995686 0.9994334 0.9991309

23 24 25 26 27 28 29 30

1.0641170 1.0035550 0.9376380 0.8141437 1.0743160 1.0292130 0.9791433 1.0833180 1.0762960 0.9052299

1.0544220 1.0304920 1.0469480 0.8606508 1.0896180 0.9388443 0.9985471 1.0673180 1.0282070 0.9344323

0.9989031 0.9984049 0.9988129 0.9981064 0.9981948 0.9982485 0.9979768 0.9978260 0.9979589 0.9978279

0.9992933 0.9988465 0.9983499 0.9984301 0.9981479 0.9979065 0.9982306 0.9981133 0.9977634 0.9978729

31 32 33 34 35 36 37 38 39 40

0.8707097 1.0610300 0.9989973 1.0556150 0.9656247 0.8883852 1.0536730 0.9135162 1.0512460 1.0604350

0.9098074 1.0355550 0.9829768 1.1018600 1.0633060 0.8781859 1.1227540 0.9449835 1.0342730 0.9912644

0.9979017 0.9979369 0.9981555 0.9978477 0.9976317 0.9972868 0.9969144 0.9971144 0.9970482 0.9972612

0.9977839 0.9979443 0.9976802 .9975105 .9973399 .9973239 .9974268 0.9969935 0.9971711 0.9967551

2 3 4 5 6 7

8 9

10 11 12

13 14 15 16 17 18 19

20 21 22

AN EXAMPLE OF ADAPTIVE ESTIMATION

297

Table 6.5 (Continued) 41 42 43 44 45 46 47 48 49 50

1.0291380 0.9045110 1.0577140 0.9408895 0.9251381 0.8869721 0.9849322 0.8835386 0.9539953 1.0018050

T

u

= v .u

0.9817958 0.9742339 1.0639700 0.8657800 1.0265350 0.9065623 1.0250230 0.9120919 0.9177560 1.0242950

0.9966899 0.9968914 0.9966074 0.9966914 0.9967267 0.9968241 0.9967179 0.9967266 0.9966724 0.9968926

0.9971806 0.9967545 0.9970306 0.9968976 0.9972832 0.9969598 0.9969848 0.9969863 0.9970572 0.9970560

-1

1

and T

v

= v 0v

-1

2

where u and v are given by ( 6 . 1 1 1 ) and ( 6 . 1 1 2 ) , respectively. Now, referrin g to Table 6.5, the last two columns estimate the first two, thus m (l) = (.9921297, .9920480) estimates 0(1) = (.8767608, .9436042) and m(50) = (.9968926, .9970560) estimates 0(50) = (1.0018050, 1.0242950) and one sees there is very little variation in the states and estimates. That there is little variation is due to the "large" system precision matrix V , which allows for small variation in the state vectors. Thus 555.5556 is the precision in the first com­ ponent of the state vector. Table 6 . 6 lists the fifty precision matrices of the approximate conditional normal distribution of 0 (k ) given 0 (t ) = m (t), t = 1 , 2 , . . . , k — 1 and k = 1,2, . . . , 50. Remember m (k) is the mean of the con­ ditional distribution of 0 ( k ) , given 0 ( 1 ) , 0 ( 2 ) , . . . , 0 (k — 1 ) , and Table 6 . 6 gives the precision of the estimates of this distribution for k = 1, . . . , 50. We see as the number of observations increases, the precision of the first of the two components of m (k) increases from 139.9763 to 6936.667. If the systems precision matrix is changed to

and the observation precision matrix to

LINEAR DYNAMIC SYSTEMS

298

Table 6 . 6 Estimates of Precision Matrix, First Model

1

139.9763000 278.1474000 416.8376000 555.5969000 694.3935000 833.1870000 971.9570000 1110.7590000 1249.5300000 1388.3120000

-110.1362000 -221.9297000 -333.1662000 -444.3525000 -555.5400000 -666.7160000 -777.8828000 -889.0532000

-110.1362000 -221.9297000 -333.1662000 -444.3525000 -555.5400000 -666.7160000 -777.8828000 -889.0532000

-1111.3460000

-1111.3460000

139.7888000 278.0527000 416.8354000 555.6538000 694.4687000 833.3015000 972.1335000 1110.9590000 1249.7910000 1388.6030000

1527.0440000 1665.7910000 1804.5810000 1943.3680000 2082.1280000 2220.9120000 2359.6900000 2498.2870000 2637.0580000 2775.8220000

-1222.4810000 -1333.6250000 -1444.7840000 -1555.9500000 -1667.0730000 -1778.2190000 -1889.3500000 -2000.3210000 -2111.4570000 -2222.5910000

-1222.4810000 -1333.6250000 -1444.7840000 -1555.9500000 -1667.0730000 -1778.2190000 -1889.3500000 -2000.3210000 -2111.4570000 -2222.5910000

1527.4240000 1666.2520000 1805.0830000 1943.8970000 2082.6820000 2221.4880000 2360.2680000 2498.9140000 2637.7170000 2776.5240000

23 24 25 26 27 28 29 30

2914.6010000 3053.3150000 3191.9850000 3330.6710000 3469.3740000 3608.1450000 3746.8620000 3885.5970000 4024.3060000 4163.0580000

-2333.7330000 -2444.8210000 -2555.8980000 -2666.9500000 -2778.0430000 -2889.1820000 -3000.2620000 -3111.3650000 -3222.4520000 -3333.5700000

-2333.7330000 -2444.8210000 -2555.8980000 -2666.9500000 -2778.0430000 -2889.1820000 -3000.2620000 -3111.3650000 -3222.4520000 -3333.5700000

2915.3340000 3054.0960000 3192.8710000 3331.5910000 3470.3690000 3609.1800000 3747.9270000 3886.7010000 4025.4670000 4164.2500000

31 32 33 34 35 36 37 38 39 40

4301.8120000 4440.5620000 4579.3240000 4717.9600000 4856.6790000 4995.3940000 5134.1400000 5272.7260000 5411.4680000 5550.2100000

-3444.6980000 -3555.8140000 -3666.9430000 -3777.9470000 -3889.0160000 -4000.0910000 -4111.2030000 -4222.1750000 -4333.2810000 -4444.3900000

-3444.6980000 -3555.8140000 -3666.9430000 -3777.9470000 -3889.0160000 -4000.0910000 -4111.2030000 -4222.1750000 -4333.2810000 -4444.3900000

4303.0500000 4441.8390000 4580.6400000 4719.3120000 4858.0350000 4996.7650000 5135.5540000 5274.2100000 5412.9840000 5551.7650000

2 3 4 5 6 7

8

9

10 11 12

13 14 15 16 17 18 19

20 21 22

-

1000.2010000

-

1000.2010000

AN EXAMPLE OF ADAPTIVE ESTIMATION

299

Table 6 . 6 (Continued) 41 42 43 44 45 46 47 48 49 50

5688.8000000 5827.3780000 5966.1090000 6104.7340000 6243.4450000 6381.9880000 6520.6990000 6659.3710000 6798.0460000 6936.6670000

-4555.3350000 -4666.2920000 -4777.3860000 -4888.3820000 -4999.4600000 -5110.3350000 -5221.4210000 -5332.4680000 -5443.5070000 -5554.4960000

-4555.3350000 -4666.2920000 -4777.3860000 -4888.3820000 -4999.4600000 -5110.3350000 -5221.4210000 -5332.4680000 -5443.5070000 -5554.4960000

5690.3710000 5829.0030000 5967.7650000 6106.4290000 6245.1830000 6383.6950000 6522.4570000 6661.1750000 6799.8820000 6938.5390000

Table 6.7 Estimates of Bivariate States, Second Model

1 2 3 4 5

6 7

8 9

10 11 12 13 14 15 16 17 18 19

20 21 22 23 24 25

-2.0187330 -0.3568782 -0.1034499 -4.2550140 -3.6362530 -3.3096080 -6.3195590 -4.6620440 -3.3241550 -7.4801380

0.2714834 -0.5616732 -2.8283520 -4.1827350 -4.6128130 -4.8248720 -4.5256390 -3.5833760 -3.2063650 -6.6583880

-1.7885740 -0.1711372 -0.4052987 -3.8115030 -3.4780980 -3.1248320 -5.0647860 -5.0297330 -4.5411190 -4.8145800

0.0073242 1.0364430 -2.7328530 -3.5973520 4.2119970 -4.3479190 -3.8704520 -4.0316240 -4.1322040 4.7252330

-3.7249180 -5.9570100 -4.2883490 -4.8190750 -3.7379370 -2.2448510 -1.6583950 -6.8210420 -5.4881420 -5.5082360

-1.1017790 -4.0974540 -6.4610460 -4.0589170 -3.0930690 -5.2986450 -3.4925020 -9.6436080 -5.0424010 -8.6731960

-5.1964110 -5.2468970 -4.1150940 -5.1326100 -4.9735230 -3.4472240 -4.1790720 -3.7749360 -4.7625970 -3.6413460

-3.9505730 -4.1224350 5.3106630 -4.3283750 -4.3302470 -5.5369760 -4.4818610 5.1844730 -4.3596320 -5.4908290

-2.9756040 -4.3860670 -6.0284100 -9.1554870 -3.0298220

-3.4056260 -3.5756530 -2.5753710 -7.3568390 -2.7223140

-4.5341880 -4.9361630 -5.3793240 -5.5199390 -4.8775220

-4.4563310 -4.1328890 -3.8960780 4.2653220 -4.7453790

LINEAft DYNAMIC SYSTEMS

300

Table 6.7 (Continued) 26 27 28 29 30

-4.0510470 -5.1820460 -2.6773540 -2.7688590 -6.8400740

-6.8654740 -4.5522410 -3.3413930 -4.3950170 -5.7588750

-3.7927980 -5.0457170 -4.6268300 -4.0422420 -4.9973520

-5.6437160 -4.4836900 -4.7481430 5.0999170 -4.3651140

31 32 33 34 35 36 37 38 39 40

-7.7955340 -3.2683430 -4.7181680 -3.2870640 -5.4523020 -7.4232510 -3.4467090 -6.9186350 -3.6154960 -3.3500490

-6.3414780 -4.1748700 -5.2078570 -1.9998420 -2.4132670 -7.5067710 -1.4618470 -5.7871800 -4.2446610 -5.5865210

-5.1617560 -4.5055150 -4.3622970 -4.8541070 -5.4350430 -4.8126640 -5.6200360 -5.1857880 -4.7909640 -3.9588040

-4.4704800 4.9825980 -5.0384390 -4.4801670 -4.0666450 -4.8422420 -4.1190240 -4.7001310 5.0114490 -5.6019570

41 42 43 44 45 46 47 48 49 50

-3.9933370 -7.0150830 -3.3668910 -6.2205110 -6.5916890 -7.6909470 -5.3735310 -7.9085680 -6.2792730 -5.1237140

-5.5007500 -4.7013150 -3.3031310 -8.3897360 -3.3572670 -6.8660590 -4.1286990 -6.8041210 -7.2928890 -4.4496240

-4.5554680 -5.4134860 -5.0504150 -4.3390080 -6.1055210 -5.3444560 -5.5777810 -5.6229320 -5.1296930 -5.5265420

-4.9468580 -4.3033640 -4.6506340 -5.3339380 -3.8858910 -4.8167030 -4.6881830 -4.8358590 -5.3598010 -5.0118690

Table 6 . 8 Estimates of Precision Matrix, Second Model

1

2 3 4 5 6 7 8 9

10

758.8698000 0.8138938 1.1235380 1.4086680 0.9312764 1.0907000 1.2407220 1.0807830 1.0940570 1.1929490

664.1608000 0.4229654 -0.1353001 -0.2876567 -0.6302108 -0.7812513 -0.8738025 -0.8331743 -0.8739011 -0.9670018

564.1604000 0.4229654 -0.1353030 -0.2876605 -0.6302108 -0.7812513 -0.8737891 -0.8331743 -0.8739011 -0.9670018

585.1857000 6.8305330 2.5455650 0.7875984 0.8866733 0.9866041 1.0532150 1.1221170 1.2227610 1.3511640

AN EXAMPLE OF ADAPTIVE ESTIMATION

301

Table 6 . 8 (Continued)

11 12

1.3056440 1.3159190 1.3486230 1.3669840 1.4258790 1.5057740 1.3708450 1.4474810 1.4582770 1.5198770

-1.0688860 -1.0775630 -1.1105050 -1.0775690 -1.1277940 -1.1944110 -1.0020730 -1.0563220 -1.0213700 -1.0675250

-1.0688860 -1.0775630 -1.1105360 -1.0776020 -1.1278000 -1.1944110 -1.0020570 -1.0563470 -1.0213440 -1.0675060

1.4792010 1.5309000 1.6132620 1.5714780 1.6615020 1.7638520 1.5602130 1.6408530 1.6057760 1.6855690

23 24 25 26 27 28 29 30

1.4835190 1.5480850 1.5788550 1.5478350 1.5434750 1.6102210 1.5587260 1.6061540 1.6609260 1.6834240

-0.9850793 -1.0297110 -1.0468060 -1.0077560 -0.9992738 -1.0458600 -0.9563593 -0.9865548 -1.0198510 -1.0155720

-0.9849992 -1.0297320 -1.0468280 -1.0077670 -0.9992023 -1.0458350 -0.9563465 -0.9865147 -1.0198790 -1.0156010

1.6061190 1.6797060 1.7324680 1.7345170 1.7701100 1.8458860 1.7667630 1.8299230 1.8910060 1.9096880

31 32 33 34 35 36 37 38 39 40

1.7227500 1.7573640 1.8067600 1.8513450 1.8974430 1.8817690 1.9336460 1.9107270 1.9533640 1.9965790

-1.0394010 -1.0596560 -1.0860420 -1.1049870 -1.1316270 -1.1020990 -1.1328220 -1.0931890 -1.1179410 -1.1390550

-1.0393260 -1.0596250 -1.0860420 -1.1049220 -1.1316100 -1.1020650 -1.1328580 -1.0930800 -1.1179230 -1.1390170

1.9677200 2.0228830 2.0751410 2.1187620 2.1756700 2.1818270 2.2407580 2.2340560 2.2908620 2.3391290

41 42 43 44 45 46 47 48 49 50

1.9646730 2.0108160 2.0133390 2.0570000 2.0767010 1.9753590 2.0107060 2.0310540 2.0556070 2.0949360

-1.0760300 -1.0993710 -1.0870810 -1.1111050 -1.1023650 -0.9883214 -1.0061040 -1.0085630 -1.0163300 -1.0328120

-1.0759400 -1.0992560 -1.0869860 -1.1111490 -1.1021750 -0.9883955 -1.0061070 -1.0085220 -1.0163910 -1.0328550

2.2950500 2.3447920 2.3677530 2.4217210 2.4339880 2.3575520 2.4068760 2.4413430 2.4821100 2.5257250

13 14 15 16 17 18 19

20 21 22

LINEAR DYNAMIC SYSTEMS

302 211.2676 -78.169

-78.169 70.47254

Table 6.7 lists m (k) versus e (k ) and Table 6 . 8 gives the precision matrix P (k ) , which was computed according to (6 .9 9 ). The prior information about u and v is such that the prior param­ = v .u T = v nv \ v = v = 4 and 0(0) = (1 ,1 )T. u 1 v 2 1 2 An examination of Table 6 . 8 reveals the precision of the two com­ ponents of m (k) slowly increases to the value 2.094936 of the first component and 2.525725 for the second component and this causes more variation in the 0 (k ) and m (k ), k = 1,2, . . . , 50 of Table 6.7. More variation is present in e (k ) and m (k) (a s compared to the first e x ­ ample) because of the "small" system precision matrix V , given by (6.113).

eters are T

SUMMARY In this chapter, a strictly Bayesian approach to the analysis of linear dynamic systems was taken. The usual approach (see Maybeck, 1979) is to use both Bayesian and non-Bayesian methods of analysis. For example, the Kalman filter is usually derived from the Bayesian ap­ proach, but sometimes the Kalman filter is based on sampling theory where it is derived as a minimum mean squared error estimator. The analysis on estimation examined all phases of the estimation procedure, i . e . , of filtering, smoothing, and prediction, where the appropriate posterior distribution was derived. With filtering it was the posterior distribution of the present state and the Kalman filter was given by Theorem 6.1, Theorems 6.1 and 6.2 specify the pos­ terior distribution of the past and future states, thus giving a solu­ tion to the smoothing and prediction problems. A Bayesian control strategy was introduced (p p . 244-251), where the average value of the posterior mean of the one-step-ahead future states was equal to the predetermined target values. A recursive algorithm, similar to the Kalman filter for finding the values of the control variables, was given and shown to produce an operational on­ line way to control the linear dynamic system. The analysis of dynamic systems was extended to the case where the design function of the observation equation was nonlinear and the transition function of the systems equation was linear. It was shown that nonlinearities of this type produce filters which do not possess the advantages of the Kalman filter. The mean of the posterior distribution can be determined in a recursive fashion but must be calculated by numerical integration.

EXERCISES

303

To avoid the difficulty, a linear approximation (based on a firstorder Taylor's series) to the design function was taken and the re ­ sult was a recursive filter which can be easily computed, however, the effect of the error of approximation on the properties of the filter was not studied. The last part of the chapter examined a particular case of adaptive estimation, namely when the precision matrices of the observation and system errors are unknown. Unfortunately, the posterior distribution of the present state could not be expressed as a recursive relation, because the posterior distribution of the past and present states is a 2 / 0 multivariate poly-t distribution; therefore an approximate filter was devised. The approximation was based on the conditional posterior distribution of the present state given the past states. In this way, one may derive a recursive filter, which is similar to the Kalman filter; however, a normal approximation to the poly-t distribution was also employed. In addition, a control strategy based on the above approx­ imation was developed. Linear dynamic systems are an important class of linear models. These models are usually not studied by statisticians, but as has been shown, present one with challenging problems in the Bayesian approach to inference, and this chapter is only a v ery b rief introduction to the subject. Many interesting problems are waiting to be solved.

EXERCISES

1.

Let (i ) (ii)

Y (t ) = F ( t ) 0 (t ) + u (t ) 6 (t ) = G 0 (t - 1) + v ( t )

where t = 1 , 2 , . . . , that is, consider the linear dynamic model ( 6 . 1 ) and ( 6 . 2 ) , and that the initial state 0 ( 0 ) is known and also that F (t ) and G are known, where all quantities above in the model are scalar. Let u ( t ) ; t = 1,2, . . . be a sequence of n .i.d . ( 0 ,a

2

) variables and suppose v ( t ) , t = 1 , 2 , . . . is a sequence ^ 2 of n .i.d . (0 ,o v ) variables. Given k observations y ( l ) , y ( 2 ) , . . . y ( k ) , find the predictive distribution of y (k + 1 ) and explain how you would forecast a future observation.

2

2.

a and that they are known, v R eferring to the above system but now letting

Assume

o

2

=

LINEAR DYNAMIC SYSTEMS

304 (iii) Y (t ) = 0 (t ) Y (t - 1) + u (t ) (i v ) 0 ( t ) = G 0 (t - 1) + v ( t ) ,

3.

4.

5.

6.

that is above let F (t ) = Y (t - 1). This is a dynamic autoregres­ sive model of order one. Assuming u ( 1) , u ( 2), . . . and v ( 1), v ( 2 ) , . . . are independent random vectors estimate (filte r) the current state 0 (k ) using the first k observations Y ( 1 ) , Y ( 2 ) , . . . , Y (k ). Theorem 6.4 gives a solution to the control problem using a onestep-ahead strategy. Now use a two-step-ahead strategy and find values for the control variables X (0 ) and X ( l ) such that 0(1) and 0(2) are close to T ( l ) and T (2 ) respectively, where Y ( l ) and Y (2 ) have yet to be observed. Hint: Use the joint predic­ tive density of Y ( l ) and Y (2 ). Assume what is given in Theorem 6.4. Write a short essay on the difficulties of a Bayesian approach to nonlinear filtering and review the nonlinear regression section of Chapter 3. Propose alternative ways to filter the states of a dynamic linear system for the general case of adaptive estimation. See the sec­ tion entitled "Adaptive Filtering, the Vector Case" (p p . 286-290). Consider the linear dynamic model of (6 .1 ) and (6 .2 ): (a )

7.

Explain how the usual general linear model (1 .1 ) is a special case of the dynamic linear model. ( b ) Define the mixed model (4 .1 ) as a special case of the dy­ namic model. (c ) Is the autoregressive process A R (1 ) a special case of the dynamic linear model? If so, explain in what sense. Consider a dynamic autoregressive model Y (t ) = 0 x( t ) Y ( t -

1) + 02 (t ) Y ( t -

1) + u ( t ) , t = 1,2, . . .

with autoregressive parameters ©1 ( t ) and ©2 ( t ) which follow a bivariate A R (1 ) process

0 ( t ) = G 0 (t -

1) + V (t )

where 0(0) is known, G is a known 2 x 2 transition matrix and e (t) = [ 01 ( t ) , 02 ( t ) ] » , t = 1 , 2 , . . . Let u (t ) '" N (0 ,P V V (t ) N (0 ,P 1) , where P and P are known, 1 u 2 v u v P^ > 0 and Pv is a 2nd order positive definite precision matrix.

REFERENCES

305

Suppose u ( l ) , u ( 2), V ( 1 ), V (2 ), . . . are independent and develop a test of the null hypothesis = 0 , t = 1 , 2 , . . . , k, where k is a fixed positive integer.

REFERENCES

Alspach, D .L . (1974). "A parallel filtering algorithm for linear sy s­ tems with unknown time varying noise statistics," IEEE T ran s­ actions on Automatic Control, Vol. AC-^19, 555-556. Aoki, Masanao (1967). O ptim ization o f S to c h a stic S y s t e m s , T o p ics in D is c r e te Time S e r i e s , Academic Press, New York, London. Box, G. E. P. and G. C. Tiao (1973). B a yesia n In fere n ce in S ta tistica l A n a ly s is , Addison-Wesley, Reading, Mass. Downing, D .J ., D. H. Pike, and G. W. Morrison (1980). "A p ­ plication of the Kalman filter to inventory control," Techno­ metrics, Vol. 22, No. 1, pp. 17-22. Dreze, Jacques H. (1977). "Bayesian regression analysis using poly-t densities," in New D e velo p m en ts in th e A p plica tion o f B ayesian M eth o d s , edited by Ahmet Aykac and Carlo Bremat, North-Holland Publishing Company, Amsterdam. Harrison, P. J. and C. F. Stevens (1976). "Bayesian forecasting" (with discussion), Journal of the Royal Statistical Society, Series B , Vol. 38, 205-247. Hawkes, R. M. (1973). D em odulation o f P ulse M odulated S ignals U sing A d a p tiv e E stim ation , unpublished master's thesis, University of Newcastle. Jazwinski, Andrew H. (1970). S to c h a stic P ro c e s s e s an d F ilterin g T h e o r y , Academic Press, New York. Kailath, T . (1968). "An innovations approach to least squares estima­ tion— Part I: Linear filtering in addative white noise," IEEE Transactions on Automatic Control, Vol. A C - 13, 639-645. Kalman, R. E. (1960). "A new approach to linear filtering and p re ­ diction problem s," Transactions, ASME, J. of Basic Engineering, Vol. 82-D, pp. 34-45. Lainiotis, D. G. (1971). "Optimal adaptive estimation: structure and parameter adaptation," IEEE Transactions on Automatic Control, Vol. A C -16, pp . 160-170. Maybeck, Peter S. (1979). S to c h a stic M odels , E stim ation , and C on ­ trol, Volumes I and II, Academic Press, New York. Mehira, R. A . (1970). "Approaches to adaptive filterin g," IEEE Transactions on Automatic Control, Vol. A C - 17, pp. 693-698. Mehira, Raman K. (1970). "On the identification of variances and adaptive Kalman filterin g," IEEE Transactions on Automatic Control, Vol. A C - 15, pp. 175-184.

306

LINEAR DYNAMIC SYSTEMS

Plackett, R . L. (1950). "Some theorems in least squ ares," Bio­ metrika, Vol. 37, pp. 149-157. P ress, James S. (1972). A p p lied .M u ltiv a ria te A n a ly s is , Holt, Rinehart, and Winston, In c ., New York. Sallas, William M. and David A . Harville (1981). "Best linear recursive estimation for mixed linear models," Journal of American Statis­ tical Association, Vol. 76, pp. 860-869. Sawaragi, Yashikazu, Yoshifumi Sunaharo, and Takayoshi Nakamizo (1967). S ta tistic a l D ecision T h e o ry in A d a p tiv e C ontrol S y s te m s , Academic Press, New York, London. Shumway, R. H ., D. E. Olsen and L. J. Levy (1981). "Estimation and tests of hypotheses for the initial mean and covariance in the Kalman filter model," Communications in Statistics, Part A — Theory and Methods, Vol. A 10, No. 16, pp. 1625-1642. Zellner, A . (1971). A n In tro d u c tio n to B a yesia n I n fere n ce in E conom etrics, John Wiley and Sons, In c ., New York.

7 STRUCTURAL CHANGE IN UNEAR MODELS

INTRODUCTION

This chapter will give a brief introduction to linear models which e x ­ hibit a change in some or all of the parameters during the period of observing the data. These models are quite useful for some economic, social, physical, and biological processes. Since 1954, when Page discussed the problem of detecting a pa­ rameter change in the context of quality control, statistical techniques for examining structural change have been rapidly developing. Page (1954, 1955, 1957) continued to develop nonparametric p ro ­ cedures for detecting shifts in univariate sequences and is well-known for his cusums (cumulative sum s). A fter that, the next development was by Chernoff and Zacks (1964) and Kander and Zacks (1966), who use Bayesian methods to estimate the current mean and to test for a shift in the mean (H ft: y = y versus H : y ^ y ) of a sequence of u jl z a l z independent random variables. Bhattacharyya and Johnson (1968) study optimality criteria related to testing hypotheses of shifting pa­ rameters, while Bacon and Watts (1970) and Broemeling (1972) give a Bayesian analysis of shifting normal sequences and switching re g re s­ sion models and explain estimation and detection techniques. Much of the work in structural change is in the field of economics and the pioneering work of Quandt (1958, 1960, 1972, 1974). Goldfield and Quandt (1973a, 1973b, 1974) introduced techniques to analyze switching regression models. Other contributions to this area of structural change are Hinkley (1969, 1971), Farley, Hinich, and McGuire (1971, 1975), Chow (1960), Brown and Durbin (1968), who use recursive residuals to detect changes 307

STRUCTURAL CHANGE IN LINEAR MODELS

308

in regression parameters, Madala and Nelson (1974), and Hartley and Mallela (1975). For a review of estimation methods for structural change, Poirier (1976) gives a thorough survey. From the Bayesian viewpoint, there have also been many contribu­ tions. Bacon and Watts (1970) appears to be the first Bayesian study of switching regression models, while Broemeling (1972), as mentioned previously, studied sequences of normal random variables. The Bacon and Watts method employs a transition function to model possibly smooth changes from one regression line to the other, while Broemeling uses a shift point, which allows for more general changes in the mean. Other contributions to the Bayesian methods of analyzing struc­ tural change are Holbert (1973), Holbert and Broemeling (1977), Chin Choy (1977), Chin Choy and Broemeling (1980, 1981), Chi (1979), Salazar (1980), Salazar, Broemeling and Chi (1981), Smith (1975, 1977, 1980), Ferreira (1975), and Tsurumi (1977, 1978). This b rie f review fails to mention many contributions to struc­ tural change, but one must not fail to mention the many articles by Sen and Srivastava (1973, 1975a, 1975b, 1975c), who studied changes in the mean vector of multivariate normal sequences.

S H IF T IN G NORMAL SEQUENCES

A simple example of structural change is a normal sequence X ,X , . . . , J.

z

X

,X X of independent random variables, where the first m m+i n m observations have a n ( 0 1>T) distribution and the last (n — m) have a n (0 2 ,x ) distribution, where

e R, 0 g

e

R,

t

> 0, 01

4

and the

shift point m = 1,2, . . . , n — 1, where n > 3. The primary interest here is in m, the shift point and the p re - and postchange values of the mean. The likelihood function of the parameters is m L (0 1 >02 ,T,m|s) oc Tn / 2 exp -

| < ] £ (x . t i= l

e^ 2

n +

2 for (0 l t 02) e R

but ex

£ ( x - 0 ) 2}, i=m+l 1 L )

*

e2> T > °> and m = 1.2,

should one analyze structural change?

(7 .1 )

n -

1.

How

The posterior analysis will

SHIFTING NORMAL SEQUENCES

309

consist of deriving the marginal posterior mass function of m, and the marginal posterior densities of 0 , and t , but first one must se­ lect prior information for the parameters. It is difficult to find a family of conjugate densities for the param­ eters, however the form of the likelihood function suggests a quasiconjugate type prior density, namely, let and T have a normalgamma distribution, and suppose m is independent of 6

1,

2 1 ), i . e . , let 6 =

and has a uniform distribution over { 1 , 2 , . . . , n ( 01 ,e 2) ' , and p (e | t ) = t 2 / 2 exp

| ( e - y ) 'P ( 6 - y ) , 0 € R 2

-

be the prior density of

9 , and t

9,

(7 .2 )

given x, where P is a 2*2positive definite

2

matrix, and y e R are known hyperparameters. marginal prior density of t is P ( T) cc T01' 1 e~Te,

x> 0

p(m ) = (n — 1) 1,

m = 1,2, . . . , n -

Now suppose the

( a > 0, 6 > 0)

(7 .3 )

and 1,

(7 .4 )

the marginal posterior mass function of m. It seems imperative that the conditional prior density of 9 depend on m, namely through the hyperparameters, y and P , but it is difficult to devise such a density which conveniently combines with the likelihood function. Combining the likelihood function with the prior density gives the joint posterior density of the parameters. p (9,T ,m |s) « T(n+2+2a)/2 1 exp _ I { [ q - ^

1 (m )B (m )]lA (m )

A 1 (m )B (m )] + C(m ) — B f(m )A 1 (m )B (m )},

x [9 -

(7 .5 ) where s

=

( x ^ x ^ , . . ., x ^ ) is the sample,

m = l , 2 , . . . , n — l, and /m A (m ) = P + I \ 0

0

\ I n — m/

2

9

e R ,

9^

^

t

> 0,

STRUCTURAL CHANGE IN LINEAR MODELS

310

and n C(m ) = 23 + i C x .2 + y'Py .

i

1

Now e and x are easily eliminated from the joint density (7 .5 ), to give

P ( m Is )

a

-j /9

|A(m) |

_i

(r\+2a}l2 *

[C (m ) — B f(m )A 1 (m )B (m ) ] ( n ^

a}U

m = 1,2, . . . , n — 1

(7 .6 )

for the marginal posterior mass function of the shift point. Upon inspection of (7 .5 ), one sees the conditional posterior den­ sity of 0 , given m, after eliminating x , is a bivariate t distribution with density

p(e|m,s) = -----1 { [6 - A 1 (m )B (m )]'A (m ) [ 0 - A ' 1 (m )B (m )] + C(m ) - B '(m )A " 1 (m )B (m ) } ( n + 2 + 2 a ) / 2 (7 .7 )

2

for 0 e R . matrix is T(m ) =

The location parameter is A

(n + 2 a)A (m ) C(m ) - B ’( m ) A ' 1 (m )B (m )

and the degrees of freedom are n + 2a.

-1

(m )B (m ), the precision

SHIFTING NORMAL SEQUENCES

311

In the same way, one may show the conditional posterior distribu­ tion of t , given m, is a gamma with parameters (n + 2 a)/2 and [C (m ) B ,( m ) A ' 1 (m )B (m )]/ 2 . To find the marginal distributions of 9 and t , one may average the conditional distributions of 0 and t with respect to the marginal posterior mass function of m, thus for 9 n-1 p ( 01s) =

p(m | s)p (0 | m ,s),

9

e

R2

(7 .8 )

m=l where p(m |s) and p(9|m ,s) are given by (7 .6 ) and (7 .7 ), respec­ tively, is the marginal posterior density of 0 . This is a mixture of n — 1 conditional t densities and the mixing distribution is the mar­ ginal distribution of the shift point. A s for x, one has n- 1 p(x|s) =

^ m=l

p(m |s)p( x|m ,s),

x > 0,

(7 .9 )

where p(x|m ,s) is a gamma density with parameters a(m) = (n + 2 a )/ 2 , and 3 (m) = [C (m ) — B !(m )A 1 (m )B (m )]/ 2 . The above may be summarized by the following theorem.

Theorem

7.1.

Consider a sequence X ,X , . . . , X of independent i z n normal random variables, where n ( 0 ^, x *) ,

X.

i = 1,2, . . . , m and

! X .^ n (0

1

z

,x

(7.10) ),

i = m + 1, . . . , n

where 01t 0O, x, and m are unknown, and 01 e R, 0

1

£

x > 0 , and m = l , 2 , . . . , n — 1 . Suppose a priori that 0 = ( 0

1

2

€ R, ei ^ 6 9, 1 Z

, 0 O) T and x is a normal gamma with

2

parameters y , P , a and 3 , where y e R , P is a 2 x 2 positive definite symmetric matrix and that a > 0 and 3 > 0 and suppose m is uniform over { 1 , 2 , . . . , n — 1 } and independent of ( 0 , x ) , then the joint pos­ terior density of all the parameters is given by (7 .5 ), the marginal posterior mass function of the shift point by (7 .6 ), the marginal posterior mass function of 0 is a mixture of conditional t densities, given by (7 .8 ), and (7 .9 ) expresses the marginal posterior density of x as a mixture of (n — 1 ) conditional gamma densities.

STRUCTURAL CHANGE IN LINEAR MODELS

312

We will see that these results extend in a straightforward way to two-phase regression and more general linear models, including time series processes. How does one represent vague prior information when one has a shifting sequence of random variables? If one lets P 0( 2 x 2), £ 0 , and a - 1 in the joint posterior distribution (7 .5 ), of all the parameters, then the joint posterior density of the parameters is the one that would have been obtained if one had used p (e ,x ,m ) « (n — 1 )

\

t > 0,

0 6 R^, m = 1,2 . . . , n -

for the prior density of the parameters. may prove the following

1

(7.11)

Under this assumption, one

C orollary 7.1. Consider a shifting sequence (7.10) of normal random variables and suppose the prior distribution of 0 , 0 , x, and m is 1

L

given by the density (7 .1 1 ), then the marginal posterior mass function of m is

1

p*(m |s) a

/m(n-m) [C *(m ) - B * ’( m ) A * ' 1 (m )B *(m ) ] ( n ' 2 ) / 2 m = 1,2 . . . , n where n > 3 and

0 0

B *(m )

( and n C *(m )

n —m

1

(7.12)

SHIFTING NORMAL SEQUENCES

313

A lso , the marginal posterior density of 0 is the mixture n-1 p (e | s ) cc ^ p*(m | s)p *(e | m ,s), m=l

0 € R2

(7.13)

of n — 1 conditional t densities p * ( 0 |m,s), which have n — 2 degrees of freedom, location vector A * 1 (m )B *(m ) and precision T * (m ) = ----------- O L n 2 ) A * ( m)

As for

t,

its marginal posterior distribution is a mixture of n —

1 conditional gamma densities p * (i| m ,s ), which have parameters (n -

2)/2 and B(m) = [C *(m ) - B * '(n i )A * ' 1(m )B *(m )]/ 2 ,

and the mixing distribution is the posterior mass function p*(m |s) of m. The above theorem and corollary provide a way to make inferences about the parameters of the model and this will be given in detail for the two-phase regression model, but for now let us study the fore­ casting problem for a shifting normal sequence. Suppose S = (x ^ X g , , x n > represents the observed values of a shifting normal sequence (7 .1 0 ), and that X n+1 is a future value one wants to forecast, then what is the Bayesian predictive density of the future observation? To find this distribution the joint density of X „ ,X 0, . . . , X , X t 1 , . . . , X , X _Li given 0,t , and m will be mul1 2 m m+1 n n +1 ° tip lied by the proper prior density (7 .2 ) of the parameters, then the product integrated with respect to 0 , t , and m, and the result will be the marginal joint density of the observations and the future value Xn + r namely» n-1 p (m | s )g (x 1 ......... x n ,x n+ 1 |m),

g ( x 1( . . . , x n ,x n+1) = m=l

(7.14) for ( x ^ x ^ , . . . , x n+j ) e R of m given by (7 .1 2 ), and

, p(m |x) is the posterior mass function

STRUCTURAL CHANGE IN LINEAR MODELS

314 g (x r

x n+ 1 |m) -

1

(7 .1 5)

D 1 /2 (m )[F (tn ) - E '(m )D ' 1 (m )E (m ) ] ( n + 2 a + 1 ) / 2 is the conditional density of the observations and x n+-^» given m, where

D (m ) =

/m 0 \ ( ) \ 0 n - m + 1/

+

f,

m x. l E(m) = 1

I

+ P,

n

x1

m+1

i

X

n+1

and

F(m ) = 2$ + u 1Pp. +

2^t

o 2 x - + xn+i*

Since F(m ) - ET(m )D 1 (m )E (m ) is a quadratic function of xn+1> the predictive density of

is a mixture of n - 1 conditional t den­

sities (7.15) and the mixing distribution is the marginal posterior dis­ tribution of the shift point m with density (7 .1 2 ). Each conditional density has n + 2a degrees of freedom. What are the other parameters? To forecast a future value, one averages n — 1 conditional fore­ casts with respect to the marginal posterior probabilities of the pos­ sible change points, i . e . , one assumes the change occurred at m = 1 , then forecasts with the conditional t density (7.15) letting m = 1 , then one assumes a change occurred at m = 2 , and forecasts with the conditional distribution of Xn+1 given s and m = 2 , and so on until one completes n — 1 forecasts. Each forecast is weighted by the appropriate posterior probability of a shift occurring at that point. For further details about the posterior analysis of shifting normal sequences, the reader is referred to Holbert (1973) and Broemeling (1974), and for Bayesian forecasting methods with normal sequences, Broemeling (1977).

STRUCTURAL CHANGE IN LINEAR MODELS

315

S T R U C T U R A L CHANCE IN LIN EA R MODELS In tro d u ctio n Structural change in linear models is easily generalized from changing normal sequences. A special case of the changing linear model is twophase regression, where Yi = a i + 3 i xi + ei ,

i = 1 , 2 , . . . , m and

Y i = a 2 + &2x. + e.,

i = m + 1 , . . . , n,

(7.16)

where the e.'s are i .i .d . n ( 0 ,x 1) , the x .Ts are the values of a concom­ itant variable x , the regression coefficients satisfy ( a 1 , 3 1) ^ (a X X

Z

,3 ), Z

and m is the shift point such that if m = n, there is no shift, but when m = 1,2, . . . , n — 1 exactly one shift has occurred. This model has been studied from classical and Bayesian perspectives. For example, Quandt (1958) estimated the switch point m and the regression param­ eters by maximum likelihood, and Hinkley (1969, 1971), under the as­ sumption that the structural change was continuous, estimated and made other inferences about the intersection y = ( a 2 “ a i ^ ^ i ^ ^ between the two regression lines. From the Bayesian viewpoint, Holbert (1973) and Holbert and Broemeling (1977) studied two-phase regression by assigning a mar­ ginal uniform proper prior to the shift point and an improper prior for the unknown regression parameters. Ferreira (1975) also assigned a vague-type prior distribution to the unknown regression parameters and assigned three prior mass functions to the shift point and developed some sampling properties for the estimated shift point m. Some other studies related to two-phase regression problems are Quandt (1960), Sprent (1961), Hudson (1966), Feder (1975), Farley, Hinick, and McGuire (1975), and Brown, Durbin, and Evans (1975). From a non-Bayesian viewpoint, the changing linear model p re ­ sents analytical and numerical difficulties and to avoid such problems, the Bayesian approach, as will be seen, provides one with some ad­ vantages. The earliest Bayesian solutions used improper prior distributions with the result that the marginal posterior mass function of the shift point did not exist at all possible values, thus it is important to r e ­ examine the problem with proper prior distributions. Essentially, there are two problems associated with changing linear models: detecting the change (what is m?) and making inferences about the shift point and the other parameters, assuming a change has occured, where m = 1,2, . . . , n — 1. In the first problem, one tests the hypothesis m = n versus the alternative m ^ n , but presently we will deal

STRUCTURAL CHANGE IN LINEAR MODELS

316

with only the second problem. For details about the second problem, see Broemeling (1972), Smith (1975), Chin Choy (1977), and Chin Choy and Broemeling (1981). In many practical problems either the data them­ selves will validate the assumption that there is a change or there will be reasons which make this assumption reasonable. For example in biological systems, the threshold level of a chemical is specific, i .e ., the response of the system to the chemical is additive to the threshold level, and after the level is attained, the response stays constant or the chemical becomes toxic, thereby resulting in a decreasing response with increasing concentration. Ohki (1974) found the top growth of cotton increased sharply with a very slight increase of manganese con­ tent of the blade tissue. A fter reaching the inflection point of the nu­ trient calibration curve, the manganese content of the blade tissue in­ creased sharply with no increase in plant growth. Some other examples of this problem can be seen in the papers of Green and Fekette (1933), Needham (1935), and Pool and Borchgrevinck (1964). In what is to follow, the posterior analysis of a changing linear model will be presented, then an example will be given in detail which will illustrate how one makes inferences about the parameters of the model.

T h e P osterior A nalysis Consider a sequence Y ,Y 0, . . . , Y of normally and independently l m n distributed random variables, where ^ i = Xi ^ l + ei >

* = * ’ ^ ’ ’ ’ ’ * m an 0. This model can be written as

and the precision param­

y^(m ) = x 1 (m )3 1 + e 1? (7.18) y 2(m ) = x 2(m )3 2 + e 2> or

STRUCTURAL CHANGE IN LINEAR MODELS

317

y = x(m ) 3 + e where y .(m ), x .(m ), 3 ., and e., i = 1 , 2 , denote the observation vec­ tor, design matrix, parameter vector, and error vector of the i-th model. A lso , y = [y f1 (m ),y f2 ( m ) ] t, 3 = (B ^ B 'g )1, e = ( e ^ e ^ ) ',

Note y ^ m ) is m * 1, Y 2 (m ) *s (n — m) x l , x ^ m ) is m x p, x 2 (m ) is (n — m) x p , 3 ^ is p x l, 3 2 is p x l, e^ is m x l, e 2 is (n — m) x l , and

and

are matrices of zeros.

The probability density function of y given m, 3 and t is

f (y |m, 3» t) * Tn / 2 exp -

| [y -

x (m )3 ]f [y -

x (m )3 ],

(7.19)

2d where x > 0, 3 e R > and m = l , 2 , . . . , n — 1 . In this presentation, only the most general case will be con­ sidered, where all the parameters are unknown. For this and special cases, one may refer to Chin Choy (1977). The posterior analysis is given by the following theorem.

T heorem 7.2. If model (7.17) holds and m, 3, and t are unknown and if m is uniformly distributed over I ^ = { 1 , 2 , . . . , n — 1 } and the joint marginal prior distribution of 3 and t is such that: the condi­ tional distribution of 3 given t is normal with mean y and precision matrix xr ( t > 0 ) where r is a given 2p * 2p positive definite matrix and y a 2p x l constant vector, and the marginal prior distribution of x is gamma with parameters a > 0 and b > 0 , and m is independent of ( 3 , t ) , then (i)

The posterior probability mass function of m is Tr(m|y) « D

-a *

(m) |x?(m )x (m ) + r|

- 1/2

,

m = 1,2, . . . , n (7.20)

where a* = a + n/2,

318

STRUCTURAL CHANGE IN LINEAR MODELS

D (m ) = b + | {y ’y + y ’ry -

3 *’ (m )[x t(m )x (m ) + r ] 3 * (m )},

and

3 *(m ) = [ x ,(m )x (m ) + r ] ^[ry + x f(m )y ]. (ii)

The posterior density function of 3 is n-1 tt( 3 |y) a

S t [ 3 ; 2p , 2 a * , 3 * (m ),p (m )] •Tr(mly), m=l 3 € R 2p

(7.21)

where t [ 3 ; 2p , 2 a*, 3 *(m ) ,p (m )] is a 2p-dimensional t-density with 2a* degrees of freedom, location 3 *(m ) and precision matrix p(m ) = [a * / p (m )][x !(m )x (m ) + r ] . Partition the vector 3, the location 3 * (m ), and the precision ma­ trix p(m ) as

'I \

A>n (m ) ,

p*(m ) = j

p 1 2 (m)>

1 , p(m ) =|

\ ° 2 {m)/

\ P 2 l (m )

P 2 2 (mV

where 3 . and a?(ni), i = 1 , 2 , are each p * 1 and the p ..(m ), i,j = 1 , 2 , are p x p , then (iii) The marginal posterior density of 3 ^ is n-1 " (e jy ) a

E

iC t [ 3 1 ; p , 2 a * , a * (m ),p * (m )] 7r(m |y), m=l

(7 .22) where

STRUCTURAL CHANGE IN LINEAR MODELS (i v )

319

In a similar fashion, the marginal posterior density of 3

z

is

n -1 ■^(B2 |y> “

2 t [8 2 ;p ,2 a * ,a * (m ),p * (m )] •ir(m ly), m=l

(7.23)

where

Pgtm) = P 2 2 (m) " P 2 i (m )P i i (m )P i 2 (m )(v )

The marginal posterior density of

t

is

n-1 ir(i|y) «

^ g [ x ;a * ,D (m )] ir(m | y )], m=l

t

> 0,

(7.24)

where g [T ,a * ,D (m )] is a gamma density with parameters a* and D (m ).

P roof.

Combining the likelihood function (7.19) with the prior distribu­ tion of the parameters gives Tr(m,e,T|y) 0 , for the joint posterior density of

the parameters. If (7.25) is integrated with respect to ( 3 , t ) , (m ,x ); and (m , 3 ) , one obtains the marginal posterior p . d . f . of m, 3 , and t respectively. Using the properties of the multivariate t distribution, see DeGroot (1970), the posterior p . d . f . of 3., i = 1,2, is obtained from (7 .2 1 ). 1 As a corollary to the theorem, one may show:

C orollary 7. 2. (i)

The conditional posterior p . d . f . of 3 given m is tt(3

(ii)

|y ,m) « t [3 ;2 p ,2 a * ,3 * (m ),p (m )],

3 e R 2p.

The conditional posterior p . d . f . of 3 1 given m is

(7.26)

320

STRUCTURAL CHANGE IN LINEAR MODELS ir( 3 1 |y,m) « t [ 3 1 ; p , 2 a * , a * (m ) ,p * (m )],

(iii)

e RP .

(7.27)

The conditional posterior density of t given m is tt(

x|y ,m) = g [ t ;a * ,D (m )] .

(7 .28)

These formulas for the marginal and conditional posterior distribu tions of the shift point (7 .2 0 ), regression coefficients (7 .2 1 ), (7 .2 6 ), and precision (7 .2 4 ), (7.28) will be illustrated with data generated from a model which has known parameters.

M arginal and Conditional Moments To estimate the parameters one may use several estimators correspond­ ing to different loss functions. It is well known that with a squared error loss function, the Bayes estimator is the mean of the posterior distribution and the Bayes risk is the average variance of the pos­ terior distribution. The mean and dispersion matrix of the conditional posterior distributions of & , 6 2» and T given m are E ( e i |y,m) = o * (m ), E (B 2 |y,m) = a *(m ), E(x|y,m ) = a * / D (m ), (7.29) C o v ((5jjy ,m ) = a * (a * -

1) 1 [p * (m )] 1,

C o v (B 2 |y,m) = a * (a * -

1) 1 [p * (m )] 1,

Var(T|y,m ) - a * / [ D ( m ) ] 2. The mean and dispersion matrix of the marginal posterior distribu tions of (5 , B2, and t are

STRUCTURAL CHANGE IN LINEAR MODELS

321

n-1 E(x|y) = ^ Tr(m|y)E(x|m,y), m=l C o v (3 1 |y) = E Cov ( 3 1 |m,y) + Cov E ( 3 1 |m,y), m m

(7.30)

C o v (3 2 |y) = E Cov ( 3 2 |m,y) + Cov E ( 3 2 |m,y), m m Var(T|y) = E Var (T|m,y) + Var E (T|m ,y), m m where E , Cov, Var denote the mean, dispersion and variance are taken m m m with respect to the marginal distribution of m.

A Numerical Example In this section, an example given by Chin Choy (1977) will illustrate the method of estimating the shift point and all unknown regression parameters of a two-phase regression model. This example uses the data generated by Quandt (1958) shown in Table 7.1. These data are twenty observations generated from the model Y. = 2.5 + 0.7x. + e., l i i

i = 1,2, . . . , 12,

Y. = 5.0 + 0.5x. + e., l i i

i = 13, . . . , 2 0 ,

(7.31)

where the e.'s are i .i .d . n ( 0 , l ) . Assume that the two-phase regression model is Y. = a, + 3 .x. + e., i 1 1 l l

i = 1 ,2 , ..., m (7.32)

Y i = a 2 + 3 2 xi + ei ’

i = m + 1,

, n,

where the e.fs are n .i.d . ( 0 ,t 1) where t and 3 = (a , 3 , a 0 , 3 0) are 1 1 1 JL L unknown. The first part of the example demonstrates the sensitivity of the marginal posterior distribution of m to different prior distributions, and the second part will consist of providing inferences for all the parameters.

yi

i

13.036

8.264

11

12

11

15

11.555

3.473

7.612

3

13

5.714

5

3

11.802

14

14

5.710

2

4

12.551

16

15

6.046

6

5

10.296

10

16

7.650

8

6

10.014

7

17

3.140

1

7

15.472

19

18

10.312

12

8

15.65

18

19

13.353

17

9

9. 871

9

20

17.197

20

10

CHANGE

x.

13

2

4

1

STRUCTURAL

O bs. no. ( i )

yi

1

X.

O bs. no. ( i )

Table 7.1 Quandt*s Data Set

322 IN LINEAR MODELS

STRUCTURAL CHANGE IN LINEAR MODELS

323

S e n s itiv ity A n a ly s is Two sources of prior information are used, where the first is data based, that is, the parameters y , r , a, and b of the normal-gamma prior density are determined from the data. In order to obtain param­ eter values, assume the shift is near 12 and group the first nine con­ secutive observations into three sets and the last six into two sets of three observations per set. Based on the three observations in each set, a regression analysis is performed on each set and the usual least * A2 2 -1 squares estimators a, 3 , and a (a = t ) are obtained from the first three sets, the mean and variance of a and 3 are obtained and the covariance between a and 3 is calculated. The values are used for the prior parameters of the first regression, and in the same way, using the last two sets, values for the prior parameters of the second r e ­ gression are found. The numerical results are shown in Table 7.2, and from this table, the prior mean vector of 3 is y = (2.7523, 0.5878, 6.1420, 0.4440). Assuming the coefficients of the first regression are independent of the second, the covariance matrix of 3 is

c o v (3 )

o2

Now t has a gamma distribution with parameters a and b , thus

-1

= t has an inverse gamma distribution with parameters a and b and by the method of moments, two estimators of a and b are used.

*2

One is based on the five values of a in Table 7.2 and the other on the five values of t = 1/a of the same table. For the first one, the estimates are a^ = 3.5032 and b^ = 1.0550 and for the second method the estimates based on t are a^ = 0.3625 and b ^

=

.0226.

Since 3 has

a four variate t distribution with mean y and precision (a / b )r , it can be shown the dispersion matrix of 3 is c o v ( 3 ) = b (a - 1 ) *r *, when a > 1. If one uses the second set of estimates a ^ and b^, c o v (3 ) does not exist, thus the first set of estimates a^ and b^ are used, giving

c o v (3 ) =

17.4064

130.2590

0

130.2590

980.3119

0

0

0

422.1041

0

0

4439.3427

46698.8884

0.5878 0.0762

Mean = 2.7523 V ar. = 4.2926

(15,16,17)

(18,19,20)

4

5

0.4440 0.0442

Mean = 6.1420 V ar. = 4.8878

„1

= 1.0550

b2

= 0.0226

= 0.3623

IN LINEAR

bi

J2

3.3407 0.2993

a, = 3.5027

4.8960

69.2712

0.0144

0.2043

1.6973

0.9993

= i/;2

CHANGE

C o v (a , 3) = -0.4647

0.5925

4.5787

0.2953

T

0.5892

1.0007

-2

a

STRUCTURAL

7.7053

Cov (a ,3 ) = -0.5704

0.6406

2.5294

(7 , 8 , 9)

3

0.2891

4.9266

(4 , 5, 6 )

0.8336

2

0.8009

3

(1 , 2, 3)

a

1

Set No.

Numbers of observations contained in each set

Table 7.2 Least Squares Estimators for Each o f the Five Sets

324 MODELS

STRUCTURAL CHANGE IN LINEAR MODELS

325

for the marginal prior dispersion matrix of 3 . The values a, b , y , and r complete the specification of the prior normal gamma distribution. Using the values for y , r , a, and b , the marginal p .m .f. of m is calculated and shown in Table 7.3, and shows the p .m .f. of m at 12 is 0.8728, which is an extremely high probability. It appears unreasonable to believe the shift did not occur at m = 12. Now consider two experiments in order to see the sensitivity of the marginal posterior mass function of m to the prior parameters: the mean y of 3 , the precision matrix r ( 3 ) = (a / b )r of 3 (o r the covariance matrix of 3 , c o v ( 3 ) = [b / (a — l ) ] r

* ), and the parameters

2

a and b of x, where E (x ) = a/b and v a r (x ) = a/b .

Also note the

variance a2 = t * has mean and variance E (a 2) = b (a — 1 ) * and v a r (a 2) = b 2(a - 1 ) 2(a - 2 ) 1. The two experiments were as follows: Experiment One— The values of the parameters are assigned as follows: 1. y = (2 .5 , 0.7, 5, 0 .5 ), 2. T (3 ) = XI4 and A = .01, .1, 1, 10, 100, thus all regression 3. 4.

coefficients are uncorrelated, a priori. E (t) = 1 v a r (x ) = .01, 0.1, 1, 10, 100.

Once the values of these four parameters are assigned, the values of y , r , a, b are known and the result is 25 different prior dis­ tributions. Based on each prior, the p .m .f. of m is calculated as shown in Tables 7 .4 -7 . 8 . Experiment Two— The values of the parameters are

1. 2.

y = (2 .5 , 0.7, 5, 0 .5 )\ c o v ( 3 ) = vl , v = .0 1 , . 1 , 1 , 10, 100,

3.

E (a ) = 1,

4.

v a r (a 2) = . 0 1 , . 1 , 1 , 1 0 , 1 0 0 .

2

to

to

Once the values of y, cov(3)> E (c ) and v a r(a ) are assigned, one knows the values of y , r , a, and b , and for each o f the 25 sets of these, one has a marginal p .m .f. of m, which is tabulated in Tables 7.9-7.13. The results of Experiment One demonstrate that 1. 2.

The posterior mass function of m is maximum at m = 12 for all values of A and v a r (A ). When A decreases, the probability of the end points m = 1 and m = 19 increases and is very noticeable when A = .01 and 0.1. The reason for this is as A -> 0, r = Al approaches singularity and

0.0244

0.8728

12

0.0 0 0 2

0 .0 0 0 2

1

2

1

0.0017

13

0.0004

3

15 0.0039

0.0022

0.0002

5

14

0.0000

4

0.0019

16

0.0006

6

0.0001

17

0.0117

7

0.0001

18

0.0234

8

0.0002

19

0.0279

9

0.0280

10

CHANGE

p .m .f.

m

p . m .f .

m

Table 7.3 Posterior Distribution of m for Data Based Prior

STRUCTURAL IN LINEAR MODELS

STRUCTURAL CHANGE IN LINEAR MODELS

327

Table 7.4 Posterior Distribution of m When X = 0.01, E ( t) = 1 and 0 = (2 .5 , 0.7, 5, 0 .5 )! Var ( t ) m

0.01

0.1

1

10

100

1

0.3505

0.3480

0.3410

0.3390

0.3389

2

0.0159

0.0158

0.0155

0.0154

0.0154

3

0.0132

0.0131

0.0128

0.0128

0.0127

4

0.0010

0.0013

0.0016

0.0017

0.0017

5

0.0021

0.0024

0.0027

0.0027

0.0027

6

0.0034

0.0037

0.0039

0.0039

0.0040

7

0.0186

0.0177

0.0167

0.0165

0.0164

8

0.0274

0.0254

0.0234

0.0230

0.0230

9

0.0279

0.0257

0.0235

0.0230

0.0230

10

0.0362

0.0331

0.0301

0.0295

0.0295

11

0.0347

0.0318

0.0289

0.0283

0.0283

12

0.4077

0.4173

0.4323

0.4360

0.4364

13

0.0054

0.0055

0.0054

0.0054

0.0054

14

0.0078

0.0077

0.0075

0.0074

0.0074

15

0.0169

0.0159

0.0149

0.0146

0.0146

16

0.0146

0.0139

0.0132

0.0130

0.0130

17

0.0015

0.0019

0.0022

0.0023

0.0023

18

0.0022

0.0028

0.0032

0.0033

0.0033

19

0.0129

0.0172

0.0211

0.0219

0.0220

STRUCTURAL CHANGE IN LINEAR MODELS

328

Table 7.5 Posterior Distribution of m When X = 0.1, E ( t ) = 1 and 3 (2 .5 , 0.7, 5, 0 .5 )’ Var ( t ) m

0.01

1

0.1454

2

1

10

0.1458

0.1436

0.1429

0.1428

0.0151

0.0155

0.0155

0.0155

0.0155

3

0.0151

0.0142

0.0150

0.0150

0.0149

4

0.0013

0.0017

0.0021

0.0022

0.0 0 2 2

5

0.0027

0.0032

0.0036

0.0037

0.0037

6

0.0045

0.0050

0.0053

0.0053

0.0053

7

0.0259

0.0246

0.0232

0.0229

0.0228

8

0.0381

0.0354

0.0326

0.0320

0.0320

9

0.0389

0.0358

0.0328

0.0321

0.0321

0.0500

0.0458

0.0417

0.0408

0.0407

0.0480

0.0440

0.0400

0.0392

0.0391

12

0.5459

0.5580

0.5745

0.5784

0.5789

13

0.0072

0.0073

0.0072

0.0072

0.0072

14

0.0103

0.0102

0.0099

0.0098

0.0098

15

0.0225

0.0212

0.0198

0.0196

0.0195

16

0.0189

0.0181

0.0171

0.0169

0.0169

17

0.0018

0.0023

0.0027

0.0028

0.0028

18

0.0026

0.0032

0.0038

0.0039

0.0039

19

0.0058

0.0077

0.0095

0.0099

0.0099

10 11

0.1

100

STRUCTURAL CHANGE IN LINEAR MODELS

329

Table 7.6 Posterior Distribution of m When A = l , E ( x ) = l and (2 .5 , 0.7, 5, 0 .5 )T Var ( t ) m

0.01

0.1

1

10

100

1

0.0384

0.0419

0.0444

0.0448

0.0449

2

0.0040

0.0050

0.0057

0.0059

0.0059

3

0.0071

0.0080

0.0087

0.0088

0.0088

4

0.0009

0.0014

0.0019

0.0020

0.0 0 2 0

5

0.0025

0.0031

0.0037

0.0038

0.0038

6

0.0046

0.0053

0.0059

0.0060

0.0060

7

0.0351

0.0338

0.0324

0.0320

0.0320

8

0.0519

0.0489

0.0457

0.0451

0.0450

9

0.0541

0.0505

0.0469

0.0462

0.0461

10

0.0659

0.0613

0.0566

0.0557

0.0556

11

0.0629

0.0585

0.0541

0.0532

0.0531

12

0.6036

0.6120

0.6233

0.6259

0.6262

13

0.0079

0.0082

0.0084

0.0084

0.0084

14

0.0113

0.0114

0.0113

0.0113

0.0113

15

0.0250

0.0240

0.0229

0.0229

0.0226

16

0.0186

0.0183

0.0177

0.0177

0.0176

17

0.0014

0.0018

0.0023

0.0023

0.0023

18

0.0019

0.0025

0.0030

0.0030

0.0031

19

0.0029

0.0040

0.0051

0.0051

0.0053

STRUCTURAL CHANGE IN LINEAR MODELS

330

Table 7.7 Posterior Distribution of m When A = 10, E ( t ) = 1 and 3 (2 .5 , 0.7, 5, 0 .5 )T

Var ( t ) m

0.01

0.1

1

10

100

1

0.0026

0.0042

0.0060

0.0063

0.0064

2

0.0005

0.0008

0.0013

0.0014

0.0014

3

0.0015

0.0022

0.0030

0.0031

0.0032

4

0.0003

0.0005

0.0009

0.0 0 1 0

0.0 0 1 0

5

0.0013

0.0019

0.0026

0.0027

0.0028

6

0.0033

0.0042

0.0050

0.0052

0.0052

7

0.0431

0.0426

0.0418

0.0416

0.0416

8

0.0647

0.0624

0.0598

0.0593

0.0592

9

0.0702

0.0671

0.0639

0.0632

0.0632

10

0.0817

0.0777

0.0737

0.0728

0.0727

0.0776

0.0739

0.0701

0.0693

0.0692

0.5888

0.5944

0.6012

0.6027

0.6028

13

0.0081

0.0088

0.0093

0.0094

0.0094

14

0.0114

0.0119

0.0123

0.0124

0.0124

15

0.0247

0.0246

0.0243

0.0243

0.0243

16

0.0164

0.0169

0.0171

0.0171

0.0171

17

0.0008

0.0013

0.0017

0.0019

0.0018

18

0.0011

0.0017

0.0022

0.0024

0.0024

19

0.0018

0.0027

0.0038

0.0040

0.0040

11 12

331

STRUCTURAL CHANGE IN LINEAR MODELS Table 7.8 Posterior Distribution of m When X = 100, E ( t ) = 1 and = (2 .5 , 0.7, 5, 0 .5 )1 Var ( t ) m

0.01

0.1

1

10

100

1

0.0003

0.0006

0 .0011

0.0012

0.0013

2

0.0002

0.0004

0.0007

0.0008

0.0008

3

0.0008

0.0013

0.0019

0.0 0 2 0

0.0020

4

0.0002

0.0004

0.0007

0.0007

0.0007

5

0.0 0 1 0

0.0015

0.0022

0.0023

0.0023

6

0.0027

0.0036

0.0045

0.0047

0.0047

7

0.0425

0.0426

0.0424

0.0424

0.0424

8

0.0644

0.0629

0.0612

0.0609

0.0608

9

0.0694

0.0673

0.0650

0.0645

0.0645

10

0.0918

0.0882

0.0846

0.0838

0.0838

11

0.0881

0.0847

0.0812

0.0805

0.0804

12

0.5822

0.5850

0.5885

0.5893

0.5894

13

0.0083

0.0092

0.0100

0.0101

0.01011

14

0.0 1 1 1

0.0120

0.0127

0.0129

0.0129

15

0.0212

0.0218

0.0222

0.0222

0.0223

16

0.0134

0.0143

0.0150

0.0152

0.0152

17

0.0006

0.0010

0.0015

0.0015

0.0016

18

0.0008

0.0012

0.0018

0.0019

0.0019

19

0.0012

0.0019

0.0028

0.0030

0.0030

332

STRUCTURAL CHANGE IN LINEAR MODELS Table 7.9

2

Posterior Distribution of m for v = 0.01, E ( c ) = l and 3 = (2 .5 , 0.7, 5, 0 .5 )T ~v V ar( a2) m

0.01

0 .1

1

10

100

1

0.0003

0.0004

0.0006

0.0006

0.0006

2

0.0002

0.0003

0.0004

0.0004

0.0004

3

0.0007

0.0 0 1 0

0.0 0 1 1

0.0 0 1 2

0.0013

4

0.0001

0.0 0 0 2

0.0003

0.0004

0.0004

5

0.0009

0.0012

0.0013

0.0014

0.0014

6

0.0026

0.0029

0.0030

0.0031

0.0031

7

0.0417

0.0387

0.0359

0.0354

0.0353

8

0.0636

0.0588

0.0542

0.0534

0.0533

9

0.0686

0.0635

0.0587

0.0578

0.0577

0.0911

0.0848

0.0788

0.0776

0.0775

0.0874

0.0813

0.0755

0.0744

0.0743

12

0.5876

0.6137

0.6387

0.6436

0.6441

13

0.0080

0.0079

0.0076

0.0075

0.0075

14

0.0108

0.0104

0.0100

0.0098

0.0098

15

0.0208

0.0195

0.0182

0.0180

0.0179

16

0.0130

0.0125

0.0118

0.0117

0.0009

17

0.0006

0.0008

0.0009

0.0009

0.0011

18

0.0007

0.0009

0.0011

0.0011

0.0018

19

0.0011

0.0015

0.0017

0.0018

0.0018

10 11

STRUCTURAL CHANGE IN LINEAR MODELS

333

Table 7.10 2 Posterior Distribution of m for v = 0.1, E (a ) = 1 and 3 = (2 .5 , 0.7, 5, 0 .5 )T Var ( o 2) m

0.01

0.1

1

10

100

1

0.0025

0.0031

0.0036

0.0037

0.0037

2

0.0004

0.0006

0.0007

0.0008

0.0008

3

0.0015

0.0017

0.0019

0.0019

0.0019

4

0.0003

0.0004

0.0005

0.0005

0.0005

5

0.0012

0.0015

0.0016

0.0017

0.0017

6

0.0031

0.0033

0.0034

0.0034

0.0034

7

0.0424

0.0387

0.0352

0.0346

0.0345

8

0.0639

0.0581

0.0529

0.0518

0.0517

9

0.0695

0.0634

0.0578

0.0567

0.0566

10

0.0810

0.0744

0.0681

0.0669

0.0668

11

0.0770

0.0706

0.0646

0.0635

0.0633

12

0.5945

0.6248

0.6541

0.6598

0.6604

13

0.0079

0.0075

0.0071

0.0070

0.0070

14

0.0111

0.0104

0.0097

0.0095

0.0095

15

0.0242

0.0222

0.0203

0.0199

0.0199

16

0.0160

0.0149

0.0137

0.0135

0.0135

17

0.0008

0.0 0 1 0

0.0 0 1 1

0.0011

0.0011

18

0 .0011

0.0013

0.0014

0.0143

0.0014

19

0.0017

0.0 0 2 1

0.0023

0.0024

0.0024

334

STRUCTURAL CHANGE IN LINEAR MODELS Table 7.11

2

Posterior Distribution of m for v = l , E ( a ) and 3 = (2 .5 , 0.7, 5, 0 .5 )!

= l

Var ( a 2)

0.01

0.1

1

10

1

0.0372

0.0351

0.0325

0.0319

0.0319

2

0.0039

0.0040

0.0039

0.0039

0.0039

3

0.0068

0.0066

0.0062

0.0061

0.0061

4

0.0009

0.0011

0.0011

0.0011

0.0011

5

0.0023

0.0025

0.0025

0.0024

0.0024

6

0.0044

0.0043

0.0041

0.0041

0.0041

7

0.0344

0.0305

0.0269

0.0262

0.0261

8

0.0512

0.0453

0.0398

0.0388

0.0387

9

0.0534

0.0474

0.0418

0.0408

0.0406

0.0653

0.0584

0.0519

0.0506

0.0505

0.0624

0.0557

0.0495

0.0483

0.0481

12

0.6104

0.6478

0.6844

0.6915

0.6923

13

0.0077

0.0071

0.0064

0.0063

0.0062

14

0.0110

0.0100

0.0089

0.0087

0.0086

15

0.0245

0.0217

0.0191

0.0186

0.0186

16

0.0182

0.0162

0.0144

0.0140

0.0140

17

0.0013

0.0014

0.0015

0.0015

0.0015

18

0.0018

0.0019

0.0020

0.0019

0.0019

19

0.0027

0.0031

0.0032

0.0032

0.0032

m

10 11

100

STRUCTURAL CHANGE IN LINEAR MODELS

335

Table 7.12 2 Posterior Distribution of m for v = 10, E (a ) = 1 and 0 = (2 .5 , 0.7, 5, 0 .5 )T Var ( a 2)

0.01

0.1

1

10

1

0.1419

0.1274

0.1128

0.1099

0.1096

2

0.0147

0.0134

0.0119

0.0116

0.0116

3

0.0147

0.0132

0.0117

0.0114

0.0114

4

0.0013

0.0014

0.0014

0.0014

0.0014

5

0.0026

0.0026

0.0025

0.0024

0.0024

6

0.0043

0.0041

0.0038

0.0037

0.0037

7

0.0254

0.0223

0.0194

0.0189

0.0188

8

0.0377

0.0330

0.0286

0.0278

0.0277

9

0.0385

0.0339

0.0295

0.0286

0.0285

10

0.0497

0.0441

0.0387

0.0377

0.0377

11

0.0477

0.0423

0.0371

0.0361

0.0360

12

0.5541

0.6008

0.6475

0.6566

0.6576

13

0.0070

0.0063

0.0056

0.0055

0.0055

14

0.0101

0.0090

0.0079

0.0077

0.0077

15

0.0221

0.0194

0.0169

0.0164

0.0163

16

0.0185

0.0163

0.0142

0.0138

0.0137

17

0.0017

0.0019

0.0019

0.0019

0.0018

18

0.0025

0.0026

0.0025

0.0025

0.0025

19

0.0055

0.0060

0.0061

0.0061

0.0061

m

100

STRUCTURAL CHANGE IN LINEAR MODELS

336

Table 7.13

2

Posterior Distribution of m for v = 100, E (o ) = 1 and 6 = (2 .5 , 0.7, 5, 0 .5 )T V ar ( a 2)

0.01

0 .1

1

10

1

0.3442

0.3152

0.2846

0.2784

0.2777

2

0.0156

0.0143

0.0129

0.0126

0.0126

3

0.0130

0.0119

0.0107

0.0105

0.0105

4

0.0010

0.0011

0.0011

0.0 0 1 1

0.0 0 1 1

5

0.0020

0.0020

0.0020

0.0020

0.0019

6

0.0033

0.0032

0.0030

0.0030

0.0030

7

0.0184

0.0165

0.0147

0.0144

0.0143

8

0.0273

0.0244

0.0217

0.0212

0.0211

9

0.0278

0.0250

0.0223

0.0217

0.0217

10

0.0362

0.0329

0.0296

0.0289

0.0288

11

0.0347

0.0315

0.0283

0.0277

0.0276

12

0.4164

0.4645

0.5143

0.5256

0.5267

13

0.0053

0.0049

0.0045

0.0044

0.0044

14

0.0077

0.0070

0.0063

0.0062

0.0061

15

0.0167

0.0150

0.0133

0.0130

0.0130

16

0.0144

0.0130

0.0116

0.0113

0.0113

17

0.0015

0.0015

0.0016

0.0016

0.0016

18

0.0022

0.0023

0.0023

0.0023

0.0023

19

0.0124

0.0138

0.0144

0.0144

0.0144

m

100

STRUCTURAL CHANGE IN LINEAR MODELS

3.

337

x T(m )x (m ) is singular at m = 1 and 19, thus x ’(m )x (m ) + r ap­ proaches singularity. The posterior probability at m = 12 increases with an increase in var( t ) . The results from the second experiment show that:

1.

The p .m .f. of m is maximum at m = 12 for all values of v and

2 2

var ( a ) and the probability at m = 12 increases very little as

2.

3.

v a r (a ) increases from 1 to 1 0 0 . As v increases, the probability at m = 1 and m = 19 increases, especially at m = 1 , and is most noticeable when v = 10 and 100 because x f(m )x (m ) + r approaches singularity (a s v increases). The posterior probability at m = 12 increases with an increase in v a r(a ^ ).

A safe conclusion from the above results is that the shift point is at m = 12 for all prior distributions and that the p .m .f. shows a re ­ markable insensitivity to the different prior distributions.

I n fere n ce s for the P aram eters This part of the example will provide inferences for the parameters 3 > including 3^ and 32» t , and m of the model. Together with Quandt’s data set, Table 7.1, and a normal-gamma prior distribution with param-

2

eters y = (2 .5 , 0.7, 5, 0 .5 )', r = I , a = 3 , and b = 2, i . e . , E (a ) =

2

1, v a r (a ) = 1, E(

3)

= y, and v a r(

3)

= I^, the marginal posterior dis­

tributions of each parameter will be analyzed. The posterior p .m .f. of m is given in Table 7.2 and clearly shows a shift occurred at m = 12 with a probability of .6844 and that the mode is at m = 1 2 , the median at 1 2 , and the mean at 1 1 . 1 1 . Inferences about the regression parameters a^, 3^, &2 ma^ based on either the marginal posterior distribution or the conditional posterior distribution, given, say, m = 1 2 , of the parameter, and these densities were derived in an earlier section of the chapter, for example the conditional densities are give by formulas (7.26) through (7 .2 8 ), while the marginal densities were given by Theorem 7.2. Point estimates and regions of highest posterior density will be obtained for each of the parameters using both the marginal and con­ ditional posterior distributions of that parameter. The reader is re ­ ferred to Box and Tiao (1965) and to Chapter 1 for an explanation of HPD regions.

338

STRUCTURAL CHANGE IN LINEAR MODELS

I n fere n ce s abo u t

a^.

Let ir(a^|y) and ir(a^|y,12) denote the mar­

ginal and conditional (given m = 12) posterior densities of a^.

Figure

7.1 is a plot of ir(a^|y), Tr(a^|y,12) and the marginal prior density of a^,

an 0 ,

(7.34) where

y (m ) = ky 2( n »

x(m ) =

where

and 2 are zero matrices, y^(m ) am * l vector consisting of

the first m observations of the sequence, and y 2 (m ) the vector of the remaining n — m observations.

Now x^(m ) is the design matrix cor­

responding to the first m observations, namely x^(m ) = ( x ^ x ^ , . . . , x

) T, and x A(m) = (x x ) ' . Prior information is assigned as m I m+1 n follows: first, the marginal p .m .f. of m is

( q, V m) = {1 ( 1 where 0 < q < 1.

m= n (7.35) - q )/ (n -

1) ,

1 < m < n — 1,

STRUCTURAL CHANGE IN LINEAR MODELS

348

Secondly, if m = n, 3^ and

t

are the parameters and assigned a

normal-gamma density, where

V

6 i l T) “ t P / 2 exP ~ f (B 1 ~ B10 ) , A U ( B 1 “ B10 ) ’

B e RP (7.36) is the conditional prior p . d . f . of 3 ^ given t , and the marginal prior density of t is I()( t ) «

Ta '

V

l b ,

t

>0,

(7 .3 7 )

where 3 ^ is a given p x l vector, a > 0 , and b > 0 . 1, 3’ = ( 8 ’ , 81,) and x are the parameters 1 A and assigned a normal-gamma prior density, where Thirdly, if 1 < m < n -

it0 ( Bi> B2 iT) “

T 2 p / 2

exp -

| (B - 6 q )'A ( 6 - 6 q ) ,

6 e R 2p (7.38) is the conditional prior density of 3 given x , and the marginal prior density of x is still (7 .3 7 ). The mean vector of 3 is 3q where 3^ = (3*10* 8 2o)T and ^20 *s 1 x P» and the Precision matrix A is partitioned as

where A

, A .^ , and A 22 are given matrices such that A is positive

definite.

Note that 3^q and A ^ are also used in the prior density of

3 .^ given x when m = n. Combining the likelihood function (7.34) with the prior information gives

DETECTION OF STRUCTURAL CHANGE

349

l/ 2 qT(n +p+ 2 a )/ 2 - l

exp - -

11' X {D (n ) +

x (B j -

it( m

,g j,T | y ) “

-

6 | ) '[ x '( n ) x ( n ) + A n ]

3 * )]},

m = n,

e RP ,

T > 0,

^

|1/2(1 -

q )x exp -

| {D (m )

+ [B — B * ( m ) ] , tx' ( m) x( m) + A] x [ g — B * ( m ) ] },

V

Be

R 2p

1 < m < n — 1,

x > 0, (7.39)

where

8 *(m ) = [x '(m )x (m ) + A] 1[ A 6 () + x '(m )y (m )], D (m ) = 2b + [y (m ) ~ x(m ) 8 * (m )]'y (m ) + [B Q -

B * (m )]’ AB0,

6 * = [x '( n ) x ( n ) + A n ] 1 f A

10 + x '( n ) y ( n ) ] ,

and D (n ) = 2b + ( y '(n )y (n ) + B jq A jjB jq -

B | '[x '(n )x (n ) + A

The shift point m is of primary interest, thus the nuisance param­ eters Bj. > and T are integrated from (7.39) giving ,,

m

Y t = Xt ^ 2 +

V

t = m + 1, ..., T

(7.41)

t = 1 ,2, . . . , T

at = Pat- l + V where

(7.42)

is the t-th observation on the dependent variable, 3 ^ and

3 2 are distinct k * 1 vectors of regression parameters,

is the t-th

observation on k independent variables, p is the unknown autocorre­ lation parameter, and the e^s are i .i .d . n ( 0 , i *) random variables, where the precision t > 0, and a^ a known initial quantity.

The shift

point m models the time where the change occurs and m = 1 , 2 , . . . , T - 2. The model is rewritten as Y t = PYt _ 1 +

-

Px ^ . 1 ) 3 1 + et ,

t = 1,2, . . . , m (7.43)

Y m+ l = p Y m + X m+i e 2

p x1 3. + e , 1 , m l m+1

STRUCTURAL STABILITY IN OTHER MODELS where

and

353

are "known” initial values of the independent and de­

pendent variables, and the errors a^ still satisfy (7 .4 2 ).

The model

allows one change in the regression parameter, but the precision pa­ rameter t and autocorrelation parameters do not undergo change. The following theorem gives the marginal posterior mass function of m and the marginal posterior densities of 3 and t.

Theorem 7. 3. Given the above model (7.43) and supposing the joint prior distribution of m, 3^» 3g» P» and T is as follows: The marginal prior distribution of m is uniform over 1, 2, . . . , T — 2, the condi­ tional prior density of 3 = ( 3 !.>3 f0) f given t is a 2k-dimensional normal 2k with mean vector y = ( y f , y T) f and precision matrix P, where y € R , k y. € R , i = 1, 2, and P is positive definite, the marginal prior distribu­ tion of t is gamma with positive parameters a and b , the marginal prior density of p is constant over R, and m, ( 3 , t ) , and p are in­ dependent, a priori, then: (i)

The marginal posterior mass function of m is

R m = 1,2 . . . , T -

2,

where

H(p)

=

where P ^ is of order k and H( p ) is partitioned as H( p ) = ( H . j ( p ) ) , where H „ ( p ) is of order k,

(7.44)

354

STRUCTURAL CHANGE IN LINEAR MODELS

Z1

= (x l -

px 0 ’ x 2 "

Z2 = ( X m+2 ~ pXm +l’ Py 0.

V1 = ( y l ~

. . . . x T --

y2 - py^

p*T_1 ) ’

••• ’ y m ~ ' pym- 1 ) ( I

1 >*

1

•• •» yT

CL

pym +l’

pxm- 1 > ’

H >>

V2 = ( y m+2 -

pXj, . . .

5 2 = H 2 2 . 1

I“ 2 - H 21 ( » )H n *

m=l

|H (p ) r 1 /V ' ! (p > ' 2 dp,

/ R

t

P ro o f .

The likelihood function of the parameters is

L(6,T,p,m |y)

t

T /2

exp -

t -

-

z ^ j VCV j

-

z ^ )

> 0. (7.46)

STRUCTURAL STABILITY IN OTHER MODELS

2k where 3 € R , t > 0, p € R , and m = 1 , 2 , . . . , T prior density of the parameters is Q+fc—1 i r ( 3 , T , p , m) «

t

355

2, and the joint

exp — - [ 2 b + ( 3 - y ) !P ( 3 - y ) l ,

2k where 3 € R , t > 0, p € R, and m = 1,2 T - 2 and combined with the likelihood function gives the joint posterior density of the parameters. The marginal posterior densities ( i ) , ( i i ) , and (iii) of the theorem are easily found by integrating the joint posterior density of the parameters with respect to the appropriate parameters. This theorem gives the foundation for making inferences about the parameters of a regression model with autocorrelated errors, but, of course, it doesn’t directly provide all the information about the parameters. For example, part (ii) of the theorem gives the posterior distribution of 3 but not the marginal distributions of 3 ^ and 3 g» but by using the properties of the multivariate t, they are easily found. Consider a numerical example to illustrate inferences for m, the shift point. Suppose the model is t = 1 , 2 , . . . , m* (7.47) (3 + A )x

+ a ,

t = m* + 1, . . . , 15,

at = pat 1 + V the et are A = *2» p .5, 1.25, and the x^ are rescaled investment expenditures used by

=

Zellner (1971, page 89). The joint prior density of the parameters is tt^( 3 >t,p ,m) « ia 1e

^ g e R,

p

g R,

t

> 0, m = 1,2, . . . , 13, where a = 1 and

b = 2. The data were generated using IMSL subroutines and the p .m .f. of m (7.44) was obtained using Simpson’s rule for numerical integration in double precision on an IBM 370/168 computer. Table 7.15 gives the marginal p .m .f. of m* for five values of A and m* = 3 , 8 , 1 2 , where the first entry in each cell corresponds to p = 1.25, the ’’explosive” case, and the second entry corresponds to p = .5. The results show when A > .5, the posterior p .m .f. of m gives a clear indication of the shift, and a shift located near the center

356

STRUCTURAL CHANGE IN LINEAR MODELS Table 7.15 Posterior Probability of m* for T = 15 A

3

.3

.5

.7

.07

.06

.89

.99

1.00

.07

.42

.90

.97

.05

.12

.65

.96

1.00

.09

.03

.77

1.00

1.00

.10

.14

.35

.69

.83

.05

.11

.38

.73

.84

CO

8

12

.8

.2

o

m*

The first value in each cell corresponds to p = 1.25 and the second to p = .5.

(m* = 8 ) is easier to detect than if m* is located at either extreme (m* = 3 or 12). Table 7.16 is the p .m .f. of m when m* = 8 , p = .5, t = 1, A as before is .2 through . 8 , and T = 15. The correct value of m* has probability of at least .77 when A > .5, but when A = .2, m = 5 has the largest posterior mass of .16. If A = .2, the distribution of m is rather uniform and one would be hesitant in identifying the shift at any particular mass point. When A = .3, one perhaps would identify incorrectly m as 1 2 . The results of Table 7.17 are similar to those of Table 7.16 in that if A > .5, the posterior probability of a shift at the "correct" value of m = 3 becomes easier to detect as A increases, but these probabilities are not as large as those in Table 7.16 because the m* in Table 7.16 is set at 8 (which is at the center of the data) while m* is set at one e x ­ treme (m* = 3) for Table 7.17. When the value of m* is set at the middle of the observations, the probability of a change at m* is larger than when m* is set equal to one extreme or the other.

T ra n s itio n Functions and S tru c tu ra l Change in Regression Models w ith A utoco rrelated E rro rs When a shift point m is employed to model structural change, one is assuming, a priori, only that a change has occurred, exactly once, somewhere during the period of observation, thus if one has additional

STRUCTURAL STABIL ITY IN OTHER MODELS

357

Table 7.16 Posterior Distribution of m for p = . 5 and m* = 8 A

.2

m

.3

.5

.7

.8

1

.0923

.0624

.0031

.0001

0.0000

2

.0449

.0191

.0010

0.0000

0.0000

3

.0616

.0155

.0011

0.0000

0.0000

4

.0680

.0139

.0015

.0001

0.0000

5

.1623

.0142

.0014

.0001

0.0000

6

.1042

.0136

.0023

.0001

0.0000

7

.1133

.0160

.0039

.0001

0.0000

8

.0894

.0308

.7689

.9949

.9991

9

.0661

.0420

.1543

.0037

.0006

10

.0448

.0970

.0240

.0003

.0001

11

.0467

.2109

.0161

.0002

.0001

12

.0509

.3691

.0174

.0003

.0001

13

.0552

.0951

.0045

.0001

0.0000

prior information about the nature of the change, the shift point method would most likely ignore this information and is inappropriate. For example, if one knew the average value of the dependent variable was a continuous function of the dependent variable over the domain where the shift will occur, this information should be a part of onefs prior information about the model. From the Bayesian viewpoint, first Bacon and Watts (1971), then Tsurumi (1977) and later Salazar (1980), use a transition function to model both abrupt and smooth changes in linear models. Bacon and Watts work with a simple linear regression model, Tsurumi with simultaneous equations models, and Salazar (1980) with time series processes. Now, Salazar's work will be presented for a regression model with autocorrelated errors. Suppose ^ is a non-negative function of a real variable such that (i )

iKO) = 0

STRUCTURAL CHANGE IN LINEAR MODELS

358

Table 7.17 Posterior Distribution of m for p = 1.25 and m* = 3 A

.2

m

.3

.5

.7

.8

1

.0738

.0495

.0054

.0002

.0001

2

.0519

.0636

.0153

.0009

.0002

3

.0750

.2400

.8932

.9946

.9986

4

.1661

.1255

.0123

.0004

.0001

5

.0289

.0238

.0034

.0002

0.0000

6

.1094

.0843

.0091

.0004

.0001

7

.0346

.0312

.0048

.0002

.0001

8

.0442

.0364

.0049

.0002

.0001

9

.0528

.0398

.0049

.0002

0.0000

10

.0731

.0630

.0095

.0005

.0001

11

.1637

.1297

.0171

.0008

.0002

12

.0691

.0620

.0110

.0007

.0002

13

.0573

.0511

.0089

.0005

.0001

lim n^oo

= 1, where

= s /y and

(

t < t*

(ii)

st =

0,

It -

t*,

(7.48)

t > t*

and t* is a fixed real number. Then $ is called a transition function and y the transition parameter and indicates the abruptness of the structural change which begins at time t*. Consider a regression model with autocorrelated errors, with k independent variables x ^ X g , . . . , x k and k regression parameters 3 1 » 3 2» •••> from 3 1q to 3

and suppose 3 1 changes

+ 3 ^ beginning at time t*, then at time t

STRUCTURAL STABILITY IN OTHER MODELS

at = pat - i + v

359

t = 1>2, •••’ T,

where x ^ , i = 1 , 2 , . . . , k, is the value of the i-th independent v ari­ able at time t,

the dependent variable at time t, and the errors fol­

low a first order autoregressive process with autocorrelation parameter p G R. The model may be rewritten as

Yt =

+ ( zt ” pzt - l ^ + V

zt =

xt2>

t = 1>2, *

T ' where (7.50)

xtk]-,

k + 1

6 = (B 1 0 ’ 6 1 1 ’ 6 2 ..............ek *' € R the

are i .i .d . n ( 0 ,x 1) ,

t

> 0 , and zQ and Y^ are known initial values

of the independent and dependent variables, respectively. The transition function ^ depends on unknown parameters t* and the likelihood function may be written as

L (B ,t*,Y ,p ,T | y ) « xT / 2 exp for B € R p y 0> y 2 _ zt

k +1

, 1 < t* < T ,

py !> •••> y T -

y

> 0,

t

| (V > 0,

p

G

z $ )'(V -

zB ),

y

(7.51)

R and where V = (y^

p y ^ ) ' and z = ( z i -

and

-

pzo *z2 ~ p z i .........

_ pZT - i ) I " The conjugate family for this model is unknown, thus let

7T0 ( 3 ,t * ,y ,p ,T ) « xa 1 e Tt>, 1 < t* < T ,

pG

R,

t > 0,

6 e Rk+1,

y > 0

(7.52)

be the joint prior density for all the parameters, which implies 3 > t*, Y, p, and t are independent and that 3, t*, y , and p have marginal constant densities and that t is gamma with parameters a and b . Combining the prior density with the likelihood function and intek +1 grating the product with respect to 3 and t over R x (0 ,° ° ), the joint posterior density of t* and y is TT(t*,Y|y) “

f

Jn

|z'z I" 1/ 2 (b + v'Pv/2 ) ' ( T ' k ' 1 ) / 2 "a / 2 dp,

STRUCTURAL CHANGE IN LINEAR MODELS

360 1 < t* < T ,

(7.53)

> 0, where P = I - z (z ’z) *zf . To find the marginal posterior densities of t* and y one must use numerical integration procedures since they cannot be expressed in analytical form. If p = 0 , the model reduces to simple linear r e ­ gression . Consider the following example of a gradual change (y > 0) using a set of T = 30 standardized normal random deviates. The marginal posterior density function of t* was calculated with data generated from Y

Y

= 3xt + Atxt + at ,

t = 1,2, . . . , 30 (7.54)

at = pat - i + v

where A ^ - 0 for t - 1,2, . . . , t* •5’ \ * + 3 = - 8’ \ * +4 = The

1, A ^

.1, A

• 3» A

\ * + 5 = 1>2' U0 = - 5>X0 = ° and y 0 = - 5-

are i .i .d . n ( 0 , l ) and the joint prior density of

3e R,

t>

0,

t*, 1 < t* < 30, p G R and y > 0 is

TrQ( 6 , t * ,y , p ,T ) * ~

Inferences about t* are illustrated in Figures 7.7 and 7.8, where the posterior densities of t* were computed for p = 0, .5, 1.25 and t* = 14 and 19. It is shown that for non-explosive models, the center of the posterior densities of t* are insensitive to changes in p. For example when t* = 19 and p = 0 or .5, they are tightly concentrated around the modes of 18.5 and 19 shown in Figure 7.7.

For t* = 14,

p = 0, the posterior density is spread between 1 and 17 with modal value of 10.5. The posterior density of t* when t* = 14 and p = 1.25 is shown in Figure 7.8 and has a mode at 12.5 and a long left tail. The two densities of this figure show the marginal posterior density of t* is sensitive to changes in p. The mode changes from 9.5 to 12.5 when p changes from 0 to 1.25, when the change in parameter begins at 14.

STRUCTURAL STABIL ITY IN OTHER MODELS

Alternative Education Plan

361

Figure 7.7. Posterior density of t* for p = 0 and .5 and t* = 19, the true value of t* (T = 30).

S tru c tu ra l Change in A uto re g re s s iv e Parameters Only elementary versions of the autoregressive process will be con­ sidered; however, the posterior analysis of a general autoregressive model is easily found. The p-th order process (5 .3 ) was defined in Chapter 5 but for now consider a first order process with one change, Y t = 3 11 + B12 Y t - l +

V

t =

Y t = B 2 1 + e 22 Y t - l +

V

t=m +

1’ 2, • • • '

m

1’

where m is the shift point, m = l , 2 , . . . , n — 1 , 3 32 = ( 3 2 1 >$2 2 ^

R

2

(7,55)

= (P u jB iJ 1 €R

are the autore&ressive parameters, 3j

+

2

,

32> Y t

is the t-th observation on the dependent variable, the e are i .i .d . n (0 ,

t

) ,

t

> 0, and y^ is a known initial value.

Thus the first m observa­

tions follow an A R ( 1) process with parameters 3^ and the remaining observa-

STRUCTURAL CHANGE IN LINEAR MODELS

Alternative Education Plan

362

Figure 7.8. Posterior density of t* for p = 0 and 1.25 and t* = 14, the true value of t* (T = 30).

tions follow an A R (1 ) process with parameters

anc* all the param­

eters are unknown. The A R (p ) process allows a conjugate family of normal-gamma distributions, which will be modified for an A R (1 ) with one change. Suppose, a priori, that m and ( B , t ) are independent where 4 3 = G R > and that m has a uniform distribution over 1 , 2 , . . . , n - 1

and that B given

t

is a four variate normal with mean

y

=

2

( y - » y ’0) Tt y. e R , i = 1 , 2 , and precision matrix tP , where P is pos1 z 1 itive definite and

where P ^ is

2 x

2

and also suppose

and b , then one may show:

t

is a gamma with parameters a

STRUCTURAL STABILITY IN OTHER MODELS

T heorem

7. 4.

363

The marginal posterior mass function of the shift point

is ir(m|y) «

|A(m) |

m = 1,2, . . . , T — 1, (7.56)

where

A 11

P 12

P 21

A 22

A (m )

A 11 = P 11 + G 1 ’ A 22

P 22 + H l* m

/

Gl

=

Ey i

(

l’



L

m

E y 2

i Vt T

/ /

T -

m

E

1!

m+1 I

T

T

E m+1 C^m )

= b + 2 [ S 1(® 1) + S l ( ®'

h

2 12

STRUCTURAL CHANGE IN LINEAR MODELS

364 m ?

s

(yt

(S ) = E

(y

B11

-

B 12y t - 1 *

e 21 -



B22y t _1 >2 .

m+1 m i =G;1( £

m yt ’ £

y t y t - i ) ,>

y- ’ I

y*y0

(7.58)

where g(x|m ,y) is a gamma density with parameters a + T/2 and C ^ m ). The results of this theorem completely parallel Theorem 7.2, which is not surprising since the latter theorem gave the posterior analysis for a changing linear model. There are three results to the theorem, namely the three marginal posterior distributions of m, 3 , and t , and of course more results follow. For example, the marginal densities of 3 ^ and 3 2 easily follow from the density of 3 .

STRUCTURAL STABILITY IN OTHER MODELS

365

This analysis employed a shift point parameter m and will be gen­ eralized to include smoother changes by use of the transition function, (7 .4 8 ). Suppose the shift parameter m is of primary importance and the other parameters regarded as nuisance parameters, then one's inference about m will be based on (7 .5 6 ). Suppose an A R (1 ) model changes once according to the model

Yt = , “

n V i * V

, = 1 ’ 2 ........... 9 (7.59)

Yt = 4 * (B12 * 4)Yt-l * V where the

' “ 10............20

are i .i .d . n ( 0 , l ) and y Q = 5.

The prior distribution of the parameters was chosen so that m is independent of ( 3 >t) and uniform over 1,2, . . . , 19, where 6 = ^ 1 1 ’ ^ 1 2 ’ ^ 2 1 ’ ^ 22 ^ * an(* t^ie conc*itional P1™ 1* distribution of 3 given t is normal with mean vector (4 ,1 ,4 ,1 )T and precision matrix xl^, and the marginal distribution of x is gamma with parameters a = 1 and b = 2. Table 7.18 shows the results of the posterior mass function of m when the data were generated according to (7.59) with 3 ^ = -5 anc* 322 =

&i 2 + ^ 9 w^ ere

A = .5, 1, 1.49.

Thus there is one shift

beginning with the 10 th observation and this corresponds to a non­ explosive series since |3

1< 1 and I ^ 22 ^
1, the p .m .f. of m correctly identifies the shift. With regard to Table 7.19 the results give the p .m .f. of m when 312 = 1* A = •!> *2, .3, and m* = 10, thus the data generated according to (7.59) were from an explosive model, since l& ^ l >

anc* structural

change is easier to detect in this case than when the data are generated from a nonexplosive model. For a small change of .1, it is difficult to identify the shift point, but when A > . 2, there is no question the change is correctly given at m = 1 0 . Now let!s use a transition function to model a change in an A R (1 ) model, thus let

for t = 1,2, . . . , T , where $ is the transition function given by (7 .4 8 ),

nt = St^Y ’ and

STRUCTURAL CHANGE IN LINEAR MODELS

366

Table 7.18 Posterior Distribution of m for 3 ^ 2 = *5 and m* = 10 A .5

1.

1.49

1

.000

.000

.000

2

.016

.011

.002

3

.018

.014

.004

4

.019

.016

.004

5

.015

.013

.003

6

.019

.021

.007

7

.016

.018

.006

8

.024

.043

.023

9

.021

.036

.015

10

.058

.291

.547

11

.082

.347

.278

12

.062

.093

.006

13

.044

.024

.001

14

.037

.014

.000

15

.030

.012

.000

16

.065

.014

.000

17

.469

.029

.001

18

.000

.000

.000

19

.000

.000

.000

m

This equation represents an A R ( 1) process with a change in the autoregressive parameter from 3 2Q to 3 21 beginning at time t* and y, the transition parameter, measures the degree of change, where y = 0 indicates an abrupt change.

Now assuming the

are n .i.d . (0 ,t 1) ,

STRUCTURAL STABILITY IN OTHER MODELS

Table 7.19 Posterior Distribution of m for

=

367

and m* = 10

A m

.1 0

.20

.30

1

.000

.000

.000

2

.291

.003

.000

3

.116

.001

.000

4

.048

.001

.000

5

.042

.000

.000

6

.033

.000

.000

7

.031

.000

.000

8

.043

.002

.000

9

.034

.005

.000

10

.091

.980

1.000

11

.028

.004

.000

12

.015

.001

.000

13

.012

.001

.000

14

.013

.000

.000

15

.041

.000

.000

16

.041

.000

.000

17

.116

.000

.000

18

.000

.000

.000

19

.000

.000

.000

where t > 0 , the joint density of the observations, given 6 ^, &2 q> **» t , and y is T XT / 2 exp -

L(t*,Y,e,T|s) «

-

|

23

[y t “



e 20 y t - l (7.61)

STRUCTURAL CHANGE IN LINEAR MODELS

368

where s = ( y 0,y , . . . , y ) is the sample of observations and B1 = Q ^ ( 3 1 , 3 2 0 ,B 2 1 ) € R » 1 < t 5| c < T » Y > 0 and t > 0 . Now suppose prior information is represented by the density (t * , y >3 »x) « xa 1 e T b ,

x > 0,

y > 0,

3 € R 3, 1 < t* < T

(7.62)

i . e . , the marginal prior density of t is gamma with parameters a > 0 and b > 0 , and x is independent of the others t*, y> and 3 , which have a constant prior density. By Bayes theorem, one may show the joint posterior density is Tr(t*,Y,3,x|s) a T( T + 2a)/2 1 exp _ I

_ 3 )TH (t * , y ) ( 3 - 3)

+ s 2 (3 ) + 2 b ],

(7.63)

where T £ l T H ( t * , Y)

T

[ E

s l(B)

S

T yt , E

[yt

T ytyt. i .

Bj

2

6 20y t - l

yt - i

l

T

T

1

1

T

T

-i yt. i « V y tJ*

e 2 1 lJ'( n t ) y t - l ^

The marginal posterior density of t* and y is

SUMMARY AND COMMENTS

TT(t*,y) «

369

|H (t*,y) | " 1/ 2 [ b + S l(B )/ 2 ] ‘ ( T “ 3 + a ) / 2

y > 0,

1 < t* < T ,

(7.64)

and is found by integrating the joint posterior density (7.63) with respect to 3 and t . It does not appear possible to eliminate either parameter from (7 .64) except by numerical integration procedures. Very little work has appeared for structural change in time series processes, except for what we have seen here, namely for auto­ regressive models and regression models with autocorrelated errors. Clearly, the results given above for the posterior analysis of auto­ regressive models can be extended to autoregressive models of any order. Also, it should be relatively easy to develop Bayesian predic­ tive techniques for autoregressive models which exhibit structural change. Structural change in moving average models and autoregressive moving average models has yet to be derived. In Chapter 5 we saw that the posterior analysis of moving average models is more difficult than that for autoregressive processes, and we would expect the same situation when one puts structural changes into these two models, nevertheless it should be possible to build an operational posterior and predictive analysis for M A (q ) and A R M A (p ,q ) processes.

SUMMARY AND COMMENTS This chapter has introduced the reader to the Bayesian analysis of linear models which exhibit structural change in the parameters. Normal sequences, two-phase regression models, general linear models, regression models with autocorrelated errors and autoregressive proces ses have been examined. The instability of a process was represented either by a changepoint parameter or a transition function and the posterior analysis focused on the posterior mass function of the change point or the pos­ terior density of the two parameters of the transition function. A b rie f introduction to detecting instability was given on pp. 346351, and an exhaustive numerical study of a changing linear model was presented on pp. 321-345, where the posterior mass function of the change point and the posterior densities of the other parameters were evaluated. Although much has been written about structural change, much remains to be done. For example, time series processes need to be examined in more detail, and forecasting methods which incorporate structural change into the model need to be developed.

STRUCTURAL CHANGE IN LINEAR MODELS

370

For the latest account on structural change in statistics and economics, the reader is referred to Poirier (1976).

EXERCISES 1.

2.

Verify equation (7 .8 ), that the marginal posterior distribution of 0 is a mixture of bivariate t densities and that the mixing distribution is that of the marginal posterior distribution of the shift point. From this equation, derive the mean and variance of 0 . Referring to equation (7 .1 5 ), show the predictive density of Y n+^>

one future observation, is a mixture of n — 1 univariate t den­ sities and that the mixing distribution is of the shift point m. 3. Verify equation (7 .4 4 ), of the marginal posterior mass function of m for a regression model with autocorrelated e rrors, (7 .4 3 ). 4. For a regression model with autocorrelated errors, and one change, represented by (7 .4 3 ), find the predictive distribution of one future observation Y , using the prior density given in Theorem 7.3. 5. One may use either a shift point or a transition function to model structural change, as for example was presented with the re g re s­ sion model with autocorrelated errors, (7 .4 3 ). Explain when it is appropriate to use the shift point (not the transition function) and when it is appropriate to use the transition function but not the shift point. Also, present other techniques to model struc­ tural change in statistical models. Read Poirier (1976). 6 . Consider the general linear model with one change (7 .1 7 ), show that if one uses the improper prior density §(m, 3 , t

7.

8.

)

«

-1 T

,

t

> 0, 3 e R

2d

, m = 1,2, . . . , n -

1

that the marginal mass function of m does not exist over the integers 1 , 2 , . . . , n - 1 , but does exist over p ,p + 1 , . . . , n — p. The parameter 3 of a linear model with one change(7.17) may be estimated from either the marginal posterior density function of 3 , (7 .2 1 ), or the conditional posterior density function of 3 , (7 .2 6 ). Explain under what conditions the latter is likely to be a "good" approximation to the former. What are the advantages of using the conditional posterior density of 3 (to make inferences about 3 ) when it is a "good" approximation to the marginal pos­ terior density of 3 ? Consider a sequence of independent Poisson random variables, say X 1 ,X , . . . , X , where the first m have mean 8 and the x z n X

REFERENCES

371

last a mean of 0 O, where 0 n ^ 6 , and 0

f*

1

u

X

> 0 and 0

i

> 0.

Suppose

1 < m < n — 1 and that a priori m, 0 ^, and ©2 are independent such that m is uniform over 1 , 2 , . . . , n with parameters A. = i = 1, 2. Find:

9.

1 , and 0 . is gamma

(a ) (b )

the marginal posterior mass function of m. the marginal posterior density of 0 ^ and 0 ^.

(c )

an HPD region for 0^.

Consider a sequence of normal variables with one change, given by (7.10) but assuming ^ > 02» where 0 ^ e R, g R , and m = 1,2, . . . , n — 1. Formulate a reasonable prior density for the parameters and derive the joint posterior density of 0 and 0 ^.

10.

Consider an autoregressive process of first order with one change at time point m, 1 < m < n — 1 , with parameters t , 3 , 3 0, XX xz 321» and 322« Refer to equation (7.55) for the explanation of the model, and Theorem 7.4 for the posterior analysis. In this theorem it is stated that the marginal posterior distribution of 3 is a mixture of multivariate t distributions, however no formula is given. Derive the formula.

REFERENCES

Bacon, D. W. and D. G. Watts (1971). "Estimating the transition between two intersecting lin es," Biometrika, Vol. 58, pp. 524534. Bhattacharyya, G. K . and B . A . Johnson (1968). "Non parametric tests for shift at an unknown time point," Annals of Mathe­ matical Statistics, Vol. 39, pp. 1731-1734. Box, G. E. P. and G. C. Tiao (1965). "Multiparameter problems from a Bayesian point of view ," Annals of Mathematical Statistics, Vol. 36, pp. 1468-1482. Box, G. E. P. and G. C. Tiao (1973). B a yesia n In fere n ce in S ta tistica l A n a ly s is , Addison-Wesley, Reading, Mass. Broemeling, L. D. (1972). "Bayesian procedures for detecting a change in a sequence of random v ariables," Metron, Vol. X X X -N -1-4, pp. 1-14. Broemeling, L. D. (1974). "Bayesian inference about a changing sequence of random variables," Communications in Statistics, Vol. 3 (3 ), pp. 243-255. Broemeling, L . D. (1977). "Forecasting future values of a chang­ ing sequence," Communications in Statistics, Vol. A 6 ( l ) , pp. 87-102.

372

STRUCTURAL CHANGE IN LINEAR MODELS

Broemeling, L. D. and J. H. Chin Choy (1981). "Detecting struc­ tural change in linear models," Communications in Statistics, Vol. A 1 0 (24) , pp. 2551-2562. Brown, R. L . , J. D urbin, and J. M. Evans (1975). "Techniques for testing the constancy of regression relations over time" (with discussion), Journal of the Royal Statistical Society, Series B , pp. 49-92. Chernoff, H. and S. Zacks (1964). "Estimating the current mean of a normal distribution which is subjected to changes over time," Annals of Mathematical Statistics, Vol. 35, pp. 999-1018. Chi, Albert Yu-M ing (1979). The B a yesia n A n a ly s is o f S tr u c tu r a l C hange in L in ear M odels, doctoral dissertation, Oklahoma State University, Stillwater, Oklahoma. Chin Choy, J. H. (1977). A B a yesia n A n a ly s is o f a C h an gin g L in ear M odel , doctoral dissertation, Oklahoma State University, Stillwater, Oklahoma. Chin Choy, J. H. and L . D. Broemeling (1980). "Some Bayesian inferences for a changing linear model," Technometrics, Vol. 22(1), pp. 71-78. Chow, G. (1960). "Tests of the equality between two sets of coef­ ficients in two linear regression s," Econometrica, Vol. 28, pp. 591-605. Farley, J. U . , M. J. Hinich, and T . W. Mcguire (1975). "Some comparisons of tests for a shift in the slopes of a multivariate linear time series," J. of Econometrics, Vol. 3, pp. 297-318. Feder, P. (1975). "The log likelihood ratio in segmented re g re s­ sion," Annals of Statistics, Vol. 3, pp. 84-97. Ferreira, R. E. (1975). "A Bayesian analysis of a switching re g re s­ sion model: known number of regim es," Journal of the American Statistical Association, Vol. 70, pp. 370-374. Goldfield, S. M. and R. E. Quandt (1973a). "The estimation of structural shifts by switching regression s," Annals of Economic and Social Measurement, Vol. 2, pp. 475-485. Goldfield, S. M. and R. E. Quandt (1973b). "A Markov model for switching regression s," Journal of Econometrics, Vol. 1, pp. 3-16. Goldfield, S. M. and R. E. Quandt (1974). "Estimation in a dis­ equilibrium model and the value of information," unpublished manuscript. Green, C. V . and E. Fekette (1933). "Differential growth in the mouse," J. Experimental Zoology, Vol. 6 6 , pp . 351-370. Hartley, M. J. and P. Mallela (1975). "The asymptotic properties of a maximum likelihood estimator for a model of markets in disequilibrium ," Discussion paper no. 329, State University of New York at Buffalo. Hinkley, D. V. (1969). "Inference about the intersection in twophase regression ," Biometrika, Vol. 56, pp. 495-504.

REFERENCES

373

Hinkley, D. V. (1971). "Inference in two-phase regression ,” J. of American Statistical Association, Vol. 6 6 , pp. 736-793. Holbert, D . and L. D . Broemeling (1977). "Bayesian inference related to shifting sequences and two phase regression ,” Com­ munications in Statistics, Vol. A 6 (3 ), pp . 265-275. Hudson, D. (1966). "Fitting the segmented curves whose end­ points have to be estimated,” Journal of the American Statistical Association, Vol. 61, pp. 1097-1109. Kander, Z. and S. Zacks (1966). "Test procedures for possible changes in parameters of statistical distributions occurring at unknown time po in ts,” Annals of Mathematical Statistics, Vol. 37, pp. 1196-1210. Lindley, D. V . (1965). In tro d u ctio n to P ro b a b ility and S ta tis tic s from a B a yesia n V iew p o in t , Cambridge University Press, Cam bridge. Madala, G. S. and F. D. Nelson (1974). "Maximum likelihood methods for models of markets in disequilibrium ,” Econometrica, Vol. 42, pp. 1013-1030. Needham, A . E. (1935). "On relative growth in the jaws of certain fis h e s,” Proceedings of Zoological Society of London, Vol. 2, pp . 773-784. Ohki, K. (1974). "Manganese nutrition of cotton under two boron levels. II. Critical M le v e l,” Agronomy Journal, Vol. 6 6 , pp. 572-575. n Page, E. S. (1954). "Continuous inspection schemes,” Biometrika, Vol. 41, pp. 100-114. Page, E. S. (1957). "On problems in which a change in parameter occurs at an unknown poin t,” Biometrika, Vol. 44, pp. 248252. Poirier, D. J. (1976). T he E conom etrics o f S tr u c tu r a l C h a n g e , North-Holland Publishing Company, Amsterdam. Pool, J. and C. F. Borchgrevinck (1964). "Comparison of rat liver response to coumarin administered in vivo versus in v itro ,” American Journal of Physiology, Vol. 206(1), pp . 229-238. Quandt, R. E. (1958). "The estimation of the parameters of a linear regression system obeying two separate regim es,” Journal of the American Statistical Association, Vol. 53, pp. 873-880. Quandt, R. E. (1960). "Tests of hypotheses that a linear re g re s­ sion system obeys two separate regim es,” Journal of the Amer­ ican Statistical Association, Vol. 55, pp. 324-330. Quandt, R. E. (1972). "A new approach to estimating switching regression s,” Journal of the American Statistical Association, Vol. 67, pp. 306-310. Quandt, R. E. (1974). "A comparison of methods for testing non­ nested hypotheses,” Review of Economics and Statistics, Vol. 56, pp. 92-99.

STRUCTURAL CHANGE IN LINEAR MODELS

374

The A n a ly s is o f S tr u c tu r a l C h an ges in Time S e r ie s an d M ultivariate L inear M odels , doctoral dissertation,

Salazar, Diego (1980).

Oklahoma State University, Stillwater, Oklahoma. Salazar, Diego, L. Broemeling, and A . Chi (1981). ’’Parameter changes in a regression model with autocorrelated e r r o r s ,” Communications in Statistics, Vol. A10(17), pp. 1751-1758. Sen, A . K. and M. S. Srivastava (1973). ”On multivariate tests for detecting change in mean,” Sankhya, Vol. A 35, pp. 173185. Sen, A* K. and M. S. Srivastava (1975a). ”On tests for detecting a change in mean,” Annals of Statistics, Vol. 3, p p . 98-108. Sen, A . K. and M. S. Srivastava (1975b). ”On tests for de­ tecting change in mean when variance is unknown,” Annals of the Institute of Statistical Mathematics, Vol. 27, pp. 479-486. Sen, A . K. and M. S. Srivastava (1975c). ’’Some one-sided tests for change in le v e l,” Technometrics, Vol. 17, pp. 61-64. Smith, A . F. M. (1975). ”A Bayesian approach to inference about a change point in a sequence of random v a ria b le s,” Bio­ metrika, Vol. 62, pp. 407-416. Smtih, A . F. M. (1977). ”A Bayesian analysis of some time-varying models,” in R e c e n t D e velo p m en ts in S ta tis tic s (edited by J. R. B arra et a l . ) , North-Holland, Amsterdam, pp . 257-267. Smith, A . F. M. (1980). ’’Change-point problems: approaches and applications," in B a yesia n S ta t i s t ic s , edited by J. M. Bernardo, University Press, Valencia, pp. 83-98. Sprent, P. (1961). ’’Some hypotheses concerning two-phase re g re s­ sion lin e s,” Biometrics, Vol. 17, pp. 634-645. Srivastava, M. S. (1980). "On tests for detecting change in the multivariate mean,” Technical Report no. 3, University of Toronto. Tsurumi, H. (1977). ”A Bayesian test of a parameter shift with an application,” J. of Econometrics, pp. 371-380. Tsurumi, H. (1978). ”A Bayesian test of a parameter shift in a simultaneous equation with an application to a macro savings function,” Economic Studies Quarterly, Vol. 24(3), pp. 216230. Zellner, A . (1971). A n In tro d u c tio n to B a yesia n In feren ce in E co n o m etrics , John Wiley and Sons, In c ., New Y ork.

8 MULTIVARIATE UNEAR MODELS

IN T R O D U C T IO N

This chapter concludes the book with a study of some multivariate linear models which are useful in regression, time series, and designed experiments, Multivariate models are generalizations of some of the univariate linear models which were considered in the earlier chapters, thus the Bayesian analysis of the multivariate versions of the reg re s­ sion design and autoregressive models is introduced. The Bayesian analysis of multivariate linear models is well known and is presented in the books of Box and Tiao (1973), Press (1982), and Zellner (1971) and we will review these analyses in the next three sections. The analysis of vector time series models such as vector ARMA models has not been developed, except for some special cases. For example, the univariate autoregressive model has been examined by Zellner (1971) and was presented in Chapter 5, however, very little has appeared in the multivariate case; however Litterman (1980) and Chow (1975) have done some work and the vector AR model is quite similar to the multivariate linear model. Nothing has appeared for vector moving average processes (because very little has been done with univariate MA models) therefore only the vector AR model is studied. Structural change in linear models is rapidly being developed from a Bayesian viewpoint. For example Salazar (1981,1982) has developed an analysis for changing multivariate linear models and autoregressive processes, and Tsurumi (1978) has examined structural change in simultaneous equations of econometrics. Structural change in linear models is of course related to dynamic linear systems, which were

375

MULTIVARIATE LINEAR MODELS

376

studied in Chapter 6 . Linear dynamic models are linear models with parameters which always change, while structural change in linear models deal with parameters which change only once (o r only oc­ casionally) .

MULTIVARIATE REGRESSION MODELS Suppose one has m correlated dependent variables and k independent variables and we want to investigate the relationship between the two sets, then if the average values of the dependent variables are linear functions of the independent variables, Y = X 0 + e,

(8 .1 )

where Y is a n * m matrix representing n observations on the m de­ pendent variables, X is a n x k matrix of n observations on k inde­ pendent variables, 6 is a k x m matrix of unknown regression param­ eters, and e is a n x m matrix of error vectors such that the n 1 x m rows of e, say ef ,eT , . . . , e are independent normal random vectors i z n with mean vector zero and unknown precision matrix P. Thus as with the univariate model of Chapter 1, we will develop the prior, posterior, and predictive analysis. It should be noted that the rows of the Y matrix are independent normal random vectors and that the likelihood function of the param­ eters is

L(e, P|Y,X) «

|p| n / 2 exp - | T r (Y - X e ) ' ( Y - X6) P,

(8 .2 )

where 0 is a n x k real matrix and P an unknown m x m positive def­ inite symmetric matrix. The likelihood function is the conditional density of Y given X (nonstochastic), P, and 6 , and can also be written as

L (e , P |Y ,X ) «

|P | n / 2 exp -

| T r [S + (9 - e )'X 'X (e - 9)] P, (8 .3 )

where S is the m x m matrix

S = (Y and

X 6 ) '(Y -

X6),

(8 .4 )

MULTIVARIATE REGRESSION MODELS

377

0 = ( X 'X ^ X 'Y , if X is n x k of full rank. O f course, 6 is the least squares estimator of 0 and (Y - X 0 ) !(Y - X§) is the residual matrix. It is interesting to note that the likelihood function may also be written as L( 0 , P |Y ,X ) oc |p | n / 2 exp -

®

X*X ( 0

-

exp -

| ( 6c -

e j't

8 ),

(8 .5 )

where

and 0. is the i-th column of 0 , i = 1 , 2 , . . . , m, and 0. is the i-th 1 . 1 column of 0 , thus 0^ is the maximum likelihood of which is normally distributed with mean 0 matrix

-1 P

®

and precision matrix P ©

-1 .C Thus (X ’X )

X’X or dispersion

the sampling distribution of

that the variance-covariance matrix of

is such

places restrictions on the

variances and covariances of 0 , and the same restrictions are placed on the least squares estimators of 0 . This is a consequence of the definition of the model and the principle of maximum likelihood, and we will encounter this restriction in the Bayesian analysis. Let us first use the conjugate family for the prior distribution of 0 and P, which we see from (8 .3 ) is a normal-Wishart, namely

MULTIVARIATE LINEAR MODELS

378

C(0,p) = ^ ( e | P ) ^2( P ) ,

(8.6)

where ^ (e lP ) «

|P|

k 12

1

exp - | T r ( 0 - p ) ' A ( 6 - p ) p

(8 .7 )

and C2 (P ) -

|p| ( v ' m' 1 ) / 2 exp -

|T rB P ,

( 8 . 8)

where y is a k * m known matrix, A is a known k x k matrix, v > m — 1, and B is a known m x m positive definite symmetric matrix, thus a priori, P has a marginal posterior density which is Wishart with v degrees of freedom, and precision matrix B , while the condi­ tional prior density of 0 given P is normal such that E( 0 ) = y and 0 * = ( 0 * , 0* . , 0* ) \ when 0 .. is the i-th row of 0 , has p re\ 1/ 1 cision matrix A © P . Now multiplying (8 .7 ) and ( 8 . 8 ) gives the joint prior density of 0 and P and if P is eliminated by (integration (u sin g the properties of the Wishart distribution, see DeGroot, 1972, p . 57), one will see the marginal prior density of 0 is

5 (0 )=

|B + (0 — u ) ’A ( 0 — U) f ( V + k ) / 2

which is a matrix t distribution. (1982) and Box and Tiao (1973).

Its properties are explained in Press Box and Tiao show the mean of 0 is

y and the variance-covariance matrix of 0 * is (v — m — 1 ) * A ~ * ® B , and we see the parameters of the prior distribution of 0 are functions of the parameters of the marginal prior distribution of P and the con­ ditional prior density of 0 given P, which is the consequence of using the conjugate prior distribution. Combining the prior ( 8 . 6 ) with the likelihood function (8 .2 ) (note X fX can be sin gu la r), the posterior density is normal-Wishart, namely « 0 , P | Y , X ) « C1 ( 0 | P ,Y ( X ) 5 2 (P | Y . X ) ) where

(8 .9 )

MULTIVARIATE REGRESSION MODELS

379

». ,«s \~ xt\ i(n + v -m -l)/ 2 £2 (P | X ,Y ) cc |p|v exp -

-

Tr f4 — [B + y Ay + Y !Y

(X 'Y + A v i)'(X 'X + A ) ' 1 (X 'Y + A u ) ] P ,

(8.11)

where

6 = (X 'X + A ) _ 1 (X 'Y + A n ).

(8.12)

Now (8.10) is the conditional density of 6 given P, which is normal, and £ 2 the marginal density of P, which is Wishart. The marginal density of 6 is a matrix t with density

5 ( e | x , Y ) = |c + ( e - e ) ' ( X ' x + A ) ( e - 0 )| ‘ (n +v+k )/2 ,

(8.13)

where

C = B + u 'A v + Y 'Y -

(X 'Y + A y )'(X 'X + A ) ' 1 (X 'Y + A n ),

and the posterior mean of 0 is E (e | X ,Y ) = 0 . persion matrix of 0* is D ( 0* | X ,Y ) = (n + v -

m-

Also, the posterior dis­

1 ) ' 1 (X 'X + A ) " 1 ®

C.

(8.14)

The last equation shows certain restrictions are placed on the posterior variances and covariances of the regression matrix 0 , which was noted by Rothenberg (1963). We have seen such restrictions occur if one uses the method of maximum likelihood, that is the same restrictions are placed on the variance-covariance matrix (with A = 0) of the sampling distribution of the maximum likelihood estimator 0 = ( X 'X ^ X 'Y of 0 . Can these restrictions be avoided? improper prior density

5 ( 0 , P) -

Suppose one uses Jeffreys*

| P f ( m+1 )/ 2 ,

(8.15)

then the marginal posterior density of 0 is

-jz

C ( 0 | X , Y ) - -------------------------------------Is + ( 0 - 0)'X 'X ( 0 - 0) I

(8.16)

and the same restrictions on the variances and covariances of 0* occur as with the maximum likelihood estimator of 0 . See Zellner (1971, p. 229).

MULTIVARIATE LINEAR MODELS

380

Can one develop a Bayesian analysis which avoids these restric­ tions? Press (1982) and Zellner (1971) develop a posterior analysis which somewhat avoids placing restrictions on the variances and covariances of the posterior distribution of 0 . They use a generalized natural conjugate prior density (so called by P ress) where 0 and P are independent and 0 has a normal density and P a Wishart, however the joint and marginal posterior densities of 0 and P are not standard and one must use approximations to obtain a standard form. If this is done (see Press, 1982), the large sample approximation to the marginal posterior distribution is a normal, but the precision matrix of 0 is D + G © (X TX ), where D and G are known matrices and D “ * is the dispersion matrix of the marginal prior density of 0 , which is normal with some mean y . The restrictions are avoided to some extent by adding D to the Kronecker product G ® (X 'X ), the part of the sum which places restrictions on the marginal prior dispersion matrix of 0 . The problem with this approach is that one's prior knowledge of the dispersion of 0 may not be of such a nature to remove undesirable restrictions on 0 . O f course another problem is that the posterior distributions of 0 are large sample approximations and their accuracy in small samples needs to be studied. The Bayesian analysis of the multivariate linear model based on the generalized conjugate prior will not be pursued here but the reader should read Press (1982) or Zellner (1971). Now consider the problem of forecasting or predicting future observations with the multivariate linear model. Suppose one plans a future experiment W = Z 0 + e,

(8.17)

where Z is the known "future" n * k design matrix, W the n x m matrix of n future observations on the m dependent variables, and e the future error matrix with the same properties as the error vector e of (8 .1 ), then (8.17) represents a future replication of the ex­ periment given by the linear model ( 8 . 1 ) . Predicting W is to be based on the Bayesian predictive density of W, which is the conditional density of W given Y , which is

g(W |Y ,X , Z) =

J

£ ( 0 ,P | Y ,X ) h (W | Z , 0 ,P ) d 0 dP,

(8.18)

ft* where £ ( 0 ,p | Y ,X ) is the joint posterior density of the parameters and h is the conditional density of W given Z, 0 , and P, namely

MULTIVARIATE REGRESSION MODELS

381

and ft* is the parameter space where 0 is a k x m real matrix and P a m x m positive definite symmetric matrix of order m. It is assumed that e and e are independent. If one uses the conjugate prior density ( 8 . 6 ) , the corresponding posterior density of the parameters is given by (8 .9 ) and if this is used in the predictive density g of (8 .1 8 ), the forecasting procedure will be based on conjugate prior information. Doing this, one may show the predictive density of (8.18) reduces to g ( W |X , Y , Z ) oc |(W -

W *)'D (W - W*) + E | - ( 2n+ v > / 2

(8.20)

where

W* = [I -

Z (Z ’Z + X'X + A ) " 1Z ' ] " 1 Z (Z 'Z + X'X + A ) ' 1 (X 'Y + A p ),

8 21 )

( . D = I -

Z (Z 'Z + X'X + A ) _ 1Z ',

(8.22)

and E = Y 'Y + p'Ap + B + (X ’Y + A p )'(Z 'Z + X'X + A ) _ 1 (X 'Y + A p ) - W *'Z (Z ’Z + X'X + A ) _ 1 (X 'Y + A p ) .

(8.23)

We see the predictive distribution of W is a matrix t with mean W* and the covariance matrix of W = [W .t .,W ..., W ] ' is •i ^ ( 1/ ( ^ ) ( (n + v — m — 1) D ® E , Wf is the i-th row of W. This predic­ tive distribution was derived from (8.18) by completing the square on 0 , eliminating 0 using the properties of the normal density, elim­ inating P from the properties of the Wishart distribution, and finally completing the square on W to identify its distribution as a matrix t. We now consider the problem of making inferences for the param­ eters 0 and P of the multivariate linear model. If one only has vague prior knowledge about the parameters, one would not use the conjugate prior density ( 8 . 6 ) , which was used in the derivation of the posterior analysis; instead we could consider Jeffreys' vague prior

{ ( e ' p> ■

( 8 - 24>

which is the density employed by Box and Tiao (1973), Zellner (1971) and Press (1982), in which case the marginal posterior density of 0 is

MULTIVARIATE LINEAR MODELS

382

given by (8 .1 6 ), which is obtained from the marginal posterior density (8.13) by letting A *+ 0(k x k ) , B + 0(m x m ), and v = - k , and as­ suming X'X is nonsingular. Inferences for e are based on this dis­ tribution and one would estimate 0 by E (e | X ,Y ) = ( X 'X ^ X 'Y , the usual least squares estimator. More sophisticated inferences re ­ quire the properties of the matrix t distribution which is explained by Box and Tiao (1973, pp. 441-452), and Zellner (1971, pp. 396-399), where the latter author refers to the matrix t distribution as a gen­ eralized Student t denoted by GSt. For example, in order to find HPD regions for subsets of the elements of 0 (sa y the first column or ro w ), Box and Tiao (p . 448) explain the procedure. If a row or col­ umn of the design matrix is of primary interest, one may show the row or column has a multivariate t distribution which we have used many times in the previous chapters. Let 0 ^ be the first column of the regression matrix 0 , then Box and Tiao (1973, p. 443) show 0 ^ has a multivariate t distribution, thus it is quite easy using the material of Chapter 1 to construct HPD re ­ gions for e . In most applications of the multivariate linear model, the re g re s­ sion parameters 0 are of primary concern and the precision matrix P is a nuisance parameter. If P is also of interest, one's inferences about the elements of P would be based on its marginal density (8.11) which is a Wishart distribution with n + v degrees of freedom and precision matrix (see DeGroot, 1970)

P (P |XjY) = B + Y 'Y + u'Ay -

(X 'Y + A p )'(X 'X + A ) _ 1 (X 'Y + A p ). (8 .25)

This distribution was derived assuming a conjugate prior normal Wishart distribution ( 8 . 6 ) , so if one uses a vague prior density (8.24) instead, one would modify the parameters of ( 8 . 11 ) by letting v = —k, A = 0, and B = 0, thus a posteriori, P has a Wishart distribution with n — k degrees of freedom and precision matrix

P (P | X , Y ) = Y 'Y - Y 'X ( X 'X ) ' 1X 'Y ,

(8.26)

the matrix of the sum of squares and cross products of the residuals. Thus by assuming vague prior knowledge, one perhaps would estimate P

MULTIVARIATE REGRESSION MODELS

E (P | X ,Y ) = (n - k )[Y 'Y - Y'XCX’X l ' ^ ' Y ] ' 1,

383 (8 .27)

the marginal posterior mean matrix of P. The properties of the Wish­ art distribution are given by Press (1982) as well as by Box and Tiao (1973) and Zellner (1971), however, Press gives a very detailed analy­ sis of the Wishart and inverse Wishart distributions. For example (p . 107), he gives the variance-covariance matrix of the elements of the precision matrix. HPD regions for some of the elements of P can be found. For example, the diagonal elements of P have gamma dis­ tributions, thus for them such regions are easily constructed numer­ ically, however for arbitrary subsets of the elements of P the density is unknown, and one would have to use multidimensional integration techniques of the Wishart density. An examination of Box and Tiao (1973), Press (1982), and Zell­ ner (1971) shows that the first authors give an interesting and thorough Bayesian analysis of a chemical process with two dependent variables, yield of product and yield of by-product, and one inde­ pendent variable, temperature, thus the model is

Yil - V i l * e2ixi2 + V

(8- 28>

Yi2 = V l l * 822Xi2 * *12

where X.^ = 1 , and X.^ is the i-th setting of the temperature corre­ sponding to product yield Y.^ and by-product yield Y.^, for i = 1 , 2, . . . , n.

The regression matrix is

and the n independent bivariate error vectors e! = (e ^ »

are

normal with mean (0 , 0) and unknown 2 x 2 precision matrix P. A s ­ suming a vague prior density for the parameters 0 and P, Box and Tiao (1973, pp. 453-459) determine the marginal posterior distributions (bivariate-t distributions) of 9^ and 02 where 0 = ( 0 ^, 0 2) and the elements of 0^ ^ are the regression parameters of the product yield equation (8 .2 8 ). They plot some of the contours of the posterior distribution of 0 ^ and 02 and from these test hypotheses about the parameters.

This example is quite interesting and demonstrates the

384

MULTIVARIATE LINEAR MODELS

advantages of the Bayesian approach to inference. Zellner also gives an example of using the model to analyze multivariate data (p . 244, 1971), but it is not as detailed as the one given by Box and Tiao.

MULTIVARIATE DESIGN MODELS The Bayesian analysis of multivariate linear models for designed ex­ periments is now being developed and most of the work has been done b y Press (1980) and Press and Shigemasu (1982). Box and Tiao (1973) give a detailed treatment of multivariate regression analysis but devote very little to the analyses of designed experiments. The analysis presented here is similar to Press and Shigemasu (1982) with one important difference. Prior distributions for the pa­ rameters of the model are expressed with a natural conjugate prior distribution instead of assuming the precision matrix P and the other parameters 0 (treatment and block means, e tc .) to be independent. The Bayesian analysis of univariate design models for one-way and two-way layouts was given in Chapter 3 and, as mentioned there, was done earlier by Lindley and Smith (1972) where they used e x ­ changeable prior distributions. Press and Shigemasu (1982) also employ the idea of exchangeability for the prior distribution in their treatment of the multivariate case. Press in his 1980 paper uses a vague constant prior for the de­ sign parameters and a JeffreysT vague prior for the precision matrix of the observations, therefore the analysis presented now is similar to his in that the posterior distribution of the parameters is normalWishart which is the conjugate family to the multivariate normal linear model. Let us first consider a one-way layout consisting of k treatments assigned at random to n experimental units such that n., i = 1 , 2 , . . . , k receive treatment i. On each unit one observes m correlated re ­ sponses, then if Y.. is the m x l vector of responses obtained from the j-th unit which received treatment i, the normal theory model is Y = X0 + e

(8.30)

where

Y' = [ Y 11 e* = [ e n ,

Y ln ;

1

e

;

1

’ Y k l ’ Y k 2 ......... V

e^ ,

....

]’ k

^

],

k

MULTIVARIATE DESIGN MODELS 0T

[ 0( 1 ) ’ 0( 2 ) *

385

9( k ) ]>

and X = D IA G (J

n.

1

, .... J n

), m is a n. x 1 vector of ones, and this

a block diagonal matrix where J i is the model (8 .3 0 ), namely Y.. = 0 .. x + e.. i] 0 ) i] where 0 ^

(8.31)

is a real unknown m x 1 vector representing the effect of

treatment i (the mean vector of population i ) , and the e..!s are such that the rows of e are independent normal random vectors with mean zero and m * m unknown precision matrix P. Of course the linear model for designed experiments has the same form as the multivariate regression model (8 .1 6 ). The main p a ­ rameters of interest are the treatment effects 0, ^. , 0 ., . . . , 0 , . . of m x l vectors which represent the means of the k normal popula­ tions, and one must specify prior information about them as well as the precision matrix P. We have seen that if one uses the Jeffreys 1 improper prior density

£( e,

P) «

^ , ; v |p|(m+l ) /79 2

that the joint posterior density of 0 and P is normal-Wishart and that the marginal density of 0 is S (e | x ,Y ) * where

|s + (e - e )'X 'X (e - 6 ) | " n / 2

(8.32)

MULTIVARIATE LINEAR MODELS

386

is a real k x m matrix, 6 = (X fX ) *XTY is the least squares estimator of 0 and S = Y TY — Y !X (X TX ) * is the matrix of residual sum of squares and cross products. This density of 0 is called a matrix t density and

0 is said to have a matrix t distribution with parameters 0 , (X 'X ) \ S, and d = n — k — m + 1 . Box and Tiao (1973) use the notation

-t, [ 0, ( X 'x r ^ s . d ] km

(8.33)

to denote the distribution of 0 , and we have seen that E ( 0 |X,Y) = 0 and the posterior dispersion matrix of 0 * = [ 0’ 1. , 0* , . . . , 0' . ] 1 is .. _( ( “) ( k) D ( 6 * |X, Y ) = (d - 2) (X 'X ) ® S. If on the other hand one uses a conjugate prior density for 6 and P, how does one choose the hyperparameters? Suppose the prior density of 0 and P is £ ( 0 , P) «

5 j ( e | P)

(8.34)

CL( e|P) -

, ,k/2 Tr |P | exp - - f ( 0 - U) ' A ( 0 - y ) P

C2 (P )

and

(8.35)

and , ( v -m -1)/2 _

Tr

2

(8.36)

where y is a real k x m matrix, A i s k x k , B i s m x m, and v > m — 1 , then we know the marginal prior density of 0 is

5(0)^

|B + ( 0 - y ) ' A ( 0 - u ) | ’ ( k + V ) / 2

(8.37)

Thus, if one uses a conjugate prior density for the parameters, one is assuming that E ( 0 ) = y and that

D ( 0 *) = (v - m -

l) ' ^ ’ 1 ©

B.

Now A is k x k and B is m x m and the Kronecker product is the km x km matrix

MULTIVARIATE DESIGN MODELS

(v — m — 1 )

387

-1

(

and the diagonal elements of A are equal, then the treatment effects L, x are such that they are uncorrelated and have (k ) the same precision matrix. If one chooses the rows of y to be the same, then one can represent prior information about the treatment effects so that they all have the same mean vector as well as the same covariance matrix. One still has the problem of choosing the hyperparameters, but ignoring that for now let us consider the posterior analysis. The marginal posterior density of 0 is given by (8 .1 3 ), that is e|X,Y - t, 1

km

[ 0 , (X 'X +

A )’ 1,

C, n + v -

m + 1],

(8.38)

where

6 = (X 'X + A )" 1 (X 'Y + A p ), and C = B + p'Ap + Y 'Y -

(X 'Y + A p )'(X 'X + A ) " 1(X 'Y + A p ). (8.39)

The matrix 0 of treatment effects has posterior mean 0 and the covariance matrix of 0 * is D (e | X ,Y ) = (n + v -

m-

D '^ X 'X + A ) ' 1 ®

C.

How does one make inferences for the treatment effects? In problesm like these, one is usually interested in the treatment effects 0 , 0 , . . . , 0 , each of which is a m * 1 vector, and especially in the 1

Z

K

MULTIVARIATE LINEAR MODELS

388

equality of treatment effects. In the classical approach to statistics one tests for equality of treatment effects by the analysis of variance and one should consult Press (1982), Anderson (1958), Morrison (1967), and Srivastava and Khatri (1979). When one studies a univariate one-way layout by Bayesian tech­ niques (see Chapter 3), one first finds the marginal posterior dis­ tribution of 6 , which is a multivariate t distribution. Then one tran s­ forms to the F-distribution and finds an HPD region to test for equality of treatment effects. When one uses a JeffreysT improper prior dis­ tribution, the HPD region test is equivalent to the analysis of v a ri­ ance, but what does one do in the multivariate case? Consider the hypothesis (8.40)

1

then we know the marginal posterior distribution of 6 is the matrix t (8 .3 8 ), assuming a conjugate prior density. The null hypothesis is equivalent to

where n = Q0 and

(8.41)

which is a (k — 1) x k matrix.

It can be shown, see Box and Tiao

(1973, p. 447), that n is distributed as

t

i ) m^ e» Q ( x 'x + A ) 1Q, »

C ,v + n — m + 1 ], and one may test for equality of treatment effects by finding an HPD region for n- A (1 - a) HPD region for n is given by C (a ) = (n : U (n ) < k ( a ) } n

(8.42)

where k (a ) depends on a and the distribution of U ( n ) , which can be determined by the methods of Box and Tiao (1973). They find the exact distribution of U (n ) when m = 1 and 2 and give an approximation to the distribution of U (n ) for m > 3, and in fact they show —dlo g U (n ) has an approximate chi-square distribution with m(k — 1 ) degrees of freedom, where $ = 1 + ( 2 d ) 1(m + k - 1 - 3 ) and d = n + v — m + 1. Thus equality of treatment effects is easily tested by use of the chi-square tables and is a straightforward extension of the testing procedure in the univariate case.

THE VECTOR AUTOREGRESSIVE PROCESS

389

Press and Shigemasu (1982) show how one may analyze a two-way layout with and without interaction; however they do not use a con­ jugate prior density, but if one does, it appears possible to test hypotheses about the row and/or column effects by finding an HPD region for these parameters. Press and Shigemasu (1982) also extend their analyses to models which include covariables, thus giving a Bayesian alternative to the analyses of covariance procedures of clas­ sical statistics.

THE V EC TO R A UTO R EG R ESSIVE PROCESS

Autoregressive processes are the most used of parametric time series models and the univariate case was considered in Chapter 5. The Bayesian analysis of the vector autoregressive process is being de­ veloped by Litterman (1980), however much remains to be done. The Bayesian identification of vector AR processes has yet to be developed and the posterior and predictive analysis has yet to be completed, although Chow (1975) has given us the moments of the predictive distribution. By using a conjugate prior density for the parameters, the prior and predictive analysis will be developed. First let us consider a m vector A R (p ) process, namely

Y (t ) =

Yi

e Y (t - i) + e ( t ) ,

i= l where t = 1,2, . . . , n, Y (t ) is a m * 1 real observation at time t,

0.

is a m x m real matrix, 0 is a nonzero matrix, p = 1 , 2 , . . . and the _1 P e (t) are n . i . d . ( 0 ,P ) , where P is a m * m positive definite symmetric precision matrix. The model is a special case of the multivariate linear model (8 .1 ), where Y = X9 + e

(8.43)

and: y is a n x m matrixwith t-th row Y ' ( t ) , while the t-th row of the n x pm matrix X is Y '(t - 1), Y !(t - 2), . . . , Y f(t - p ) , the pm x m regression matrix 0 of autoregressive parameters is

MULTIVARIATE LINEAR MODELS

390

(8 .44)

and e is n x m with t-th row e!( t ) , t = 1,2, . . . , n. It is assumed that Y (0 ),Y (-1 ), Y (1 - p ) are known "initial” values. Conditions of stationarity will not be imposed on the parameters. The likelihood function is L( 6 , P |X ,Y ) *

|P| n / 2 exp — —

(Y — X 0 ) '( Y - X e)P

and we will use a conjugate prior density for the parameters, namely

5 (e | P ) °= |p| m p / 2 exp - ^

( e - u ) ' A ( e - w)P

as the conditional prior density of e given P and

S (P ) «

i ^ i(v - m - 1)/2 Tr _ _ |P I e x p -------— BP

is the marginal prior density of P and is Wishart with v degrees of freedom and precision matrix B . The marginal prior density of 0 is the matrix t density

5( e ) «

|b + ( e - i i ) ’A ( e - p ) | '(m p+v)/2,

(8 .45)

where B is a positive definite m x m matrix and A is mp x mp. prior dispersion matrix of 0* is D (e * ) = ( v -

m-

D ^ A "1 ®

B,

The

(8 .46)

where 0* is the m^p x l vector

0* = ( 0 11, . . . , 0 , ; 0 O1, . . . , 0 O ; . . . , 0 11 lm 21 2m pi where 0„ is the j-th row of 0!.

..., 0 ) T, pm

(8.47)

Thus the hyperparameters of the

prior distribution of the parameter may be chosen so that the autore-

THE VECTOR AUTOREGRESSIVE PROCESS

391

gressive matrices are uncorrelated, each having the same mean matrix and dispersion matrix. If m = p = 2, the 4 x 4 matrix in the upper left hand portion of the 8 x 8 matrix D ( 0*) is

aU B (v - m -

1) 1

a1 2 B

|

| * a B

(8.48)

a2SB

where a..* is the ii-th element of A \ and is the variance-covariance i] matrix of the first autoregressive matrix 01 . This would be a second order model and the variance-covariance matrix of the second auto­ regressive matrix would be the 4 x 4 matrix in the lower right hand corner of D ( 0* ) . The covariance matrix between 0T ^ and 0^ is the 4 x 4 matrix

f

—1 __ a l3

-1 a l4

(v -

a2 3 B

l ) " 1,

a24B '

and by choosing a g, a ^ , a ^ , and a 24 to be zero, one may put zero correlation between the two autoregressive matrices of the model. B is given by the partitioned matrix

-1 ® B 11

A 12

-1 ® 21

A 22

B

®

B



®

B



> H * 1 no

In general A 1 ©



A 2p

® ® = A* 1 ®

A’ 1 ® PP

B,

B (8.49)

where the mp x mp matrix A * is partitioned as

MULTIVARIATE LINEAR MODELS

392 a ’ 1 = (A y 1) , i = 1,2,

Therefore A..* © e! and e! if i

4

p ; j = 1,2,

p.

B (v — 1) * is the covariance matrix between

j , and the dispersion matrix of 0!, if i = j .

If one

chooses A.. = 0(m x m ), i # j, and A.. = A = ... = A , the reg re s1] XX Li Li PP sion matrices 0* , 0* , . . . , 0! are uncorrelated and each have the same 1 2 p dispersion matrix. We also see the prior variance-covariance matrix places certain restrictions, a priori, on the variances and covariances of the elements of the regression matrices of the model, and if these do not conform to one's prior information about the parameters, the prior density used here is not appropriate. This problem was encountered with the multivariate linear model. The posterior analysis of the vector autoregressive model is the same as that for the multivariate linear model. In particular the mar­ ginal posterior density of 0 is

6

+ A ) _ 1 , C, n + v -

(pm)m

m + 1]

(8.5 0)

where

and

e = ( X ' X + A ) ’ 1( X ' Y + A y )

C = B + y ' A y + Y 'Y -

e'(X 'Y + A y ) ,

Thus the posterior mean of 0 is 0 and the dispersion matrix of 0 * is

D ( 0* | X ,Y ) = A * © C ( n + v -

m - l )

1,

(8.51)

where A * = (X 'X + A ) ’ 1.

2

Recall that 9* is m p x l and given by 0* = [ 01 1 } 0 ln , . . . , 0 .

11

e2 1 >02 2 ’ 02 m: 6p 1 ’ 9p2’ row of 0^, and is 1 x m. Let

12

lm

;

0pm] ’ ’ where e12 is the second

THE VECTOR AUTOREGRESSIVE PROCESS

393

e* =

where the i-th row of 0 ! is 0.. and i i]

1 ,2 ,

( 8 . 53 )

p,

2

which is a m x l vector. If A * is partitioned into p dispersion matrix of 0? is

D (e * |X,Y) = A*. ®

submatrices A.*, of order m, then the

ll

C (n + v -

m-

l) ' 1

(8.54)

and the covariance matrix between 0.* and 0* is i 3 C o v ( 6* , e* |X,Y) = (n + V -

m-

D ^ A ? 1. ® C .

(8.55)

In making inferences for the regression parameters of this model, one bases them on the matrix t distribution (8 .5 0 ), and we saw how to use this distribution for the multivariate design model with a one-way layout. Often the autoregressive model is used to predict future obser­ vations and this will be done, but sometimes one is more interested in knowing the values of the parameters 0 and P as in some time series which occur in the analysis of designed experiments of psychology as in Glass, Willson, and Gottman (1975). OneTs knowledge of the 0 parameters is gained from a posterior analysis of the model which is based on the matrix t distribution t, x [ 0 , ( X fX + A ) (pm)m

* , C, n + v -

m + 1].

We have seen that one may

MULTIVARIATE LINEAR MODELS

394

test hypotheses about the regression parameters by constructing HPD regions. Consider the bivariate A R (1 ) model

Y l (t )

=

0l l Y l ( t



1} +

012Y 2( t



1} +

el ( t )

Y 2( t )

= 921Y l ( t

~

1} +

622Y 2( t

-

1} +

e 2( t )

where t = 1 , 2 , . . . , n , which is a model for relating two univariate time series { Y ^ t ) : t = 1,2, . . . } and ( Y 2 ( t ) : t = 1,2, . . . } . Thus each component of the bivariate time series is a linear function of the previous observations on both series plus an error term eT(t ) = [e ( t ) , which has a bivariate normal distribution with mean ( 0 , 0 ) and 2 x 2 precision matrix P. In terms of the general model m = 2, p = 1 and 0 = 6^, where 0 ^ is the 2 x 2 regression matrix

If a normal-Wishart distribution of 0 and P is used, we have seen the marginal posterior density of 0 is the matrix t t o[ 0 , ( X TX + A ) \ C, n + v matrix of

1] and one would estimate 0 by 0 , while the dispersion

is D (e *| X ,Y ) = (n + v where A * = (X 'X + A ) ’ 1 and

3 )_ 1A * ®

C,

THE VECTOR AUTOREGRESSIVE PROCESS

A* ©

s; i c

»;2c

a 21

a 22 C

395

C =

where a?5, is the ij-th element of A * and C is 2 * 2. persion matrix of (Q-q ,

The posterior dis­

*s ( n + v “ 3) ^ a ^ C, the dispersion

matrix of (0 12’ 9 22^? *s ( n + v “ 3) *a* 2 C, and the covariance matrix between ( e n> 9 ^ ) ' and ( 9 ^ , 9 ^ ) ' is a * 2

C=a*n C.

Now,

021 22 and one can make inferences about this matrix by the techniques of the earlier sections. For example, the hypothesis 0 ^ 2 = 0 may be tested as follows. We know the first column of 0 has a bivariate t distribution and that 0 ^ 2 has a univariate t distribution, then an HPD region for 0 ^ 2 can be constructed and H^ rejected if 0 ^ 2 = 0 is not in the interval.

Also HQ:

= 0 can be tested by a similar procedure.

Other inferences about the 0 matrix can be developed by the methods given by Box and Tiao (1973), Press (1982), and Zellner (1971). We now consider predicting one step ahead for Y (n + 1) with a vector autoregressive process. The general formula for this is given by ( 8 . 2 0 ) , namely g [ Y ( n + 1) f X,Y ,X*] «

|E + [ Y ( n + 1) X [ Y ( n + 1) -

Y * ]'D *

Y* ] | * ( 2 n + v ) / 2 ,

(8.56)

where

Y * = [I -

x

X * ( X * ' X * + X ' X + A ) ' 1X * ' ] ’ 1X * ( X * ' X * + X ' X + A ) " 1

(X'Y + A y ) ,

E = Y ' Y + y ' A y + B + ( X ' Y + A y ) ' ( X * ' X * + X ' X + A ) ' 1( X ' Y + A y ) -

Y * ' X * ( X * ' X * + X 'X + A ) _1( X ' Y + A y ) ,

MULTIVARIATE LINEAR MODELS

396 and D* = I - X *(X * 'X * + X'X +

The other term that needs to be defined is X* which is the X matrix of the future observation Y (n + 1), where Y (n + 1) = X * 6 + e (n + 1) or Y f( n + 1) = Y ' ( n ) e» + Y '(n - 1) e' + . . . + Y '(n - p - 1) eT + 1 i p ef( n + 1). The (n + l ) - s t error term is n ( 0 , p ) and independent of m the error term e of ( 8. 43) , thus X* is the 1 x mp matrix X * = [ Y ' ( n ) , Y ’(n -

1),

Y '(n - p -

1) ] .

(8.57)

The predictive density (8.56) is based on the conjugate prior density which is given b y (8 .4 5 ). Since Y (n + 1) is a m x l vector the predictive distribution of Y (n + 1) is a multivariate t distribution with mean vector E [ Y ( n + 1) |X, Y, X* ] = Y* . precision matrix P [ Y ( n + 1) |X, Y, X* ] = (2n + v -

m-

2) ' 1© * ' 1 ®

E,

and 2 n + v — m — 2 degrees of freedom. If one uses Y * as a point forecast of Y (n + 1), one is simul­ taneously forecasting the m future components o f Y ( n + 1), but sup­ pose one is interested only in the first component of the future obser­ vation, then how would one construct its point forcast? Of course one would most likely use the first component of Y * and the precision of the forecast would be the precision of the first component of a ran ­ dom vector which has a multivariate t distribution and the formulas are well-known and given in Chapter 1 or in DeGroot (1970). An HPD region for Y (n + 1) is based on the random variable

F [ Y ( n + 1)] = [ Y ( n + 1) - Y * ] ' D * [ Y ( n + 1) -

Y ^ m '1,

(8.58)

which has an F distribution with m and 2n + v — m degrees of freedom and a ( 1 — a) region is given by the ellipse { Y ( n + 1): F [ Y ( n + 1)] < F 0

0 A }. 2 , m, 2 n+v-m

If one wants to predict two or more steps ahead one very easily can derive the appropriate Bayesian predictive density but it is not a standard distribution.

THE VECTOR AUTOREGRESSIVE PROCESS

397

The posterior and predictive analyses of the vector autoregres­ sive process were derived with a conjugate prior density but if one believes the Jeffreys 1 prior e(0,P)

|j,

-

|- Cm+l ) /2

is appropriate, the posterior and predictive densities are obtained by letting v = —k (=m ), A •* 0(k x k ) and B 0(m x m) in the appropriate formulas. Model identification is an important matter when one uses an autoregressive process to model a time series and this problem was discussed in Chapter 5 on time series analysis. In that chapter a Bayesian identification method of Diaz and Farah (1981) was explained for a univariate autoregressive process. This method is easily e x ­ tended to a vector process as follows. Consider a m dimensional vec­ tor AR process of order p, P Y '( t )

Y '( t - j ) 0 . + e '(t ),

t = 1, 2, . . . , n,

(8.59)

j= l where p = 1,2, . . . , L , and L is the maximum known order of the model with L = 1,2, . . . . The observation at time t is Y ( t ) , which is a m x l vector and the 0pj, j = 1,2 . . . , p, p = 1,2 . . . , L are m x m real matrices, Y ( 0 ) , Y ( - l ) .........Y( 1 - p ) are known initial values, and e ( l ) , . . . , e(n ) are independent n (0 ,P "1 ) random m x l vectors. Then the model is Y = X 0 + E, P where Y is n x m and

and X is n x mp and

(8.60)

398

MULTIVARIATE LINEAR MODELS Y ’( 0 )

Y T( —1) ,

. . . , Y T( 1 — p)

Y'(l)

Y f( 0 ) ,

..., Y '(2 -

p)

X =

Y f( n -

1)

Y '(n -

2), . . . , Y f( n - p)

Also

and

E =

The prior distribution for p, 9^, and P are chosen as follows: First the marginal mass function o f p i s £ ( p) , p = 1,2, . . . , L, and given p , 6 and P have a normal-Wishart distribution with density P

5(6

P

,p|p) = 5 ( 6

1

P

|P,p) 5 „ ( P ) , *

where

x e x p - E ( e p - Up)' Ap

0,

p

>

0

(8.70)

then the joint posterior density of the parameters is

? ( 0 , P ,

t

| X , Y )

-

T1172' 1

exp -

I [(Y

-

p Y _ j) -

(X -

x [ ( Y — PY _ 1) — (X — p X )e ],

pX ^ e ] '

(8.71)

where p € R, x > 0, and 0 G R*\ It is seen that the conditional pos­ terior distribution of x given 0 and p is gamma, the conditional dis­ tribution of 0 given p is a multivariate t, and the conditional distribu­ tion of p given 0 is a univariate t. Furthermore, it is seen the joint posterior distribution of p and 0 is not a (p + 1 )-dimensional multi­ variate t, and that the marginal distributions of 0 and p may be iso­ lated but they are not standard well-known distributions. The joint density of 0 and p is £( 0 , p |X ,Y ) a { [ ( Y -

pY^) -

(X p X ^ e ] 1

x [(Y -

pY _x) -

(X -

pX _1)6 ] }~n/2,

where 9 e R^ and p e R , and we see that given p , 9 has a multivariate t density and that given 0, p has a univariate t distribution, but that 0 and p jointly do not have a multivariate t distribution. The marginal density of p is obtained by first integrating (8.71) with respect to 0 then x and the result is 5 (p | X ,Y ) « CjCP.X) e2( p , X , Y ) ,

p e R,

where

C1 ( P ’ X) = ---------------------------------------------172

|(X -

and

Px_1)'(x - pX^)! '

(8.72)

MULTIVARIATE LINEAR MODELS

404 £ 2( P , X , Y )

=

p Y _ 1>, ( Y

-

1_____________________________ {(Y

-

X (X -

p Y _ x) — ( Y

p X ^ f^ X

— p Y _ 1) t( X

- p X _ 1) , ( Y -

-

pX

^ K X

-

pX^)'

p Y ^ ) }(n l)/2 #

This density is not of a standard form but is easily graphed since p is a scalar. Also the posterior moments of p are found by numerical inte­ gration . In the same way, the marginal posterior density of 0 can be de­ rived, however since 0 is p-dimensional it may be difficult to find the marginal moments of the components of 0. Another approach is to find the conditional distribution of 0 given p , which is a multivariate t dis­ tribution with n degrees of freedom, location vector

E ( e | X , Y , p ) = t(X - pX_1)'(X - pX_1) ] ' 1(X - pX_1)'(Y - pY_x) (8.73) and precision matrix n(X

-

-

pY

PX

)'(X

-

PX

(X

-

)

P ( e | X , Y , p ) = ---------------------------------------- , where S(p,X,Y)

= (Y

X

(X

-

)'[I

- i n

-

p X _ 1) ] " 1( X

-

pX

J [(X

-1

p X ^ 'K Y

-

-

pX

pY_

)'

-1

j )

.

The marginal mean vector of 0 is

E ( e | X, Y ) = E E ( e | X , Y , p ) ,

(8.75)

P

where

E

denotes expectation with respect to p .

P

The marginal posterior mean of 0 is computed by the numerical integration of £ ( p | X , Y ) E ( 0 | X , Y , p ) with respect to p, thus only p one-dimensional integrations are involved in order to determine this moment. If one first found the marginal density of 0, the means of

OTHER TIME SERIES MODELS

405

each of the p components of 0 would involve (p - 1) dimensional inte­ grations, thus if p > 3, one has more computing efficiency from formu­ la (8 .7 5 ). In the same way one may find the dispersion matrix of 0 by averaging the conditional dispersion matrix (n — 2) *P * (0 | X ,Y ,p ) of 0 with respect to the marginal density of p . To make inferences for 0 and p , one may base them on the ap­ proximate marginal posterior distribution, but how does one estimate and test hypotheses about the precision t ? It can be shown that the marginal density of x cannot be obtained in closed form, however it can be shown the conditional density of t given p is 5 ( t |X ,Y , p ) - t (n "P)/2

exp - | S ( p , X , Y ) ,

t> 0

(8.76)

where S (p , X , Y ) is given by (8 .7 4 ), that is the conditional distribu­ tion of t given p is a gamma with parameters a = (n — p)/2 and 3 = S (p ,X ,Y )/ 2 .

The conditional posterior mean of t 1 given p is

E ( t ' 1 |p . X , Y )

= S(p , X , Y )(n - p -

2 )" 1.

(8.77)

One may estimate t either from the marginal distribution of t or from the conditional distribution of t given p by conditioning on an estimate of p, say E(p | X ,Y ), which is computed via (8 .7 2 ), or one may find the marginal posterior mean of t * by averaging E ( t * | p ,X ,Y ) with respect to the marginal density of p , that is

E ( t ~ 1 |X, Y )

=

/

E ( T ' 1|X,Y,p)C(p|X,Y)dp

R may be found numerically. O f course in the same way one may compute the precision or variance of the marginal posterior distribution of t . An alternative method of estimation of the parameters 0, t , and p of the multiple regression model with autocorrelated errors is to de­ termine the mode of the joint posterior density (8 .7 1 ), as was done in the univariate case of the model in Chapter 5. Lindley and Smith (1972) show the joint mode of the distribution may be found by solving simultaneously the modal equations, which are found from the conditional distributions of 0 given p and t , p given 0 and x, and x given 0 and p. From an examination of the joint density (8 .7 1 ), it can be shown that the conditional posterior density of i given 0 and p is gamma with parameters a = n/2 and 3, where 23 = [ ( Y - pY_^) - (X - p X ^ )© ]' x [(Y -

pY 1) -

(X -

pX _1)0 ], that the conditional posterior distribu­

tion of 0 given p and x is normal with mode

MULTIVARIATE LINEAR MODELS

406 E ( 61p , t ) = [(X -

PX _ 1)'(X -

PX _ 1) ] ' 1(X -

pX _1) ,(Y -

p Y _ , (8.78)

and that p given 0 and t is also normal with mode E ( e |p ,

t

)

=

[ ( Y _ 1 - x _ 10 ) ' ( Y _ 1 - x _ 1e ) ] " 1( Y _ 1 - X _ 16 )'(Y

-

X 0) .

(8.79) The mode of conditional distribution of

m(Tj 0,p) =

t

given

0

and

p

-

is

(8.80)

P

The three equations

0

= E (

0|p,t),

p = E(p |0» t ) , t

=

m (

t

(8.81)

|0 , p )

are solved simultaneously by an iterative scheme of beginning with initial guesses for p , which allows one to solve for 0 and t and the cycle is repeated until the solutions stabilize. O f course, convergence is not guaranteed and if convergence occurs the solution perhaps is the mode of the joint density. Obviously this method of estimation requires investigation and should be used with caution. Thus, there are two ways to estimate the parameters of the joint density, the first is to find the means of the marginal distributions along with their variances and the second is with joint mode. The first way will give the marginal Bayesian estimators of the parameters with respect to a squared error loss function and the second way will give additional information about the joint posterior distribution. One may easily repeat the above posterior analysis when one uses a proper prior density for the parameters. For example consider

C (0 ,P ,T ) = ^ ( t K ^ O . p |t ) , where r a -1 -ib € ^ (t ) « x e ,

t

> 0,

t

> 0,

0€R P,

p e Rf

(8 .82)

OTHER TIME SERIES MODELS

52( 0 , p |t ) « x 1/2 exp -

407 | [(Y * -

p Y *x) -

(X * -

x t (Y * — p Y * ^ ) — (X * — PX * 1)0 ] , where Y * , Y * ^ X *, and X*^ are n x i , n x i , n x p ,

pX ^je]'

6 eRP,

p e R ,

and n x p matrix

hyperparameters. Indeed these quantities may be chosen from an hy­ pothetical future "experiment” (o r with past d a ta ), where Y * = X*e + e*.

(8.83)

If this is done the posterior and predictive distributions are of the same type that were derived with the improper prior density (8 .7 0 ). The predictive density of a future observation was not found but was obtained for the simple linear regression model of Chapter 5.

A Transfer Function Model A transfer function model relates two time series {Y ( t ) : t = 1,2, . . . } and ( X ( t ) : t = 1,2, . . . } in such a way that P

q

Y (t ) = £ 0 Y (t - i) + 2 * 2, •••> +q)

e

Rr are unknown vectors, that e ( l ) ,

R q » and n = (n ^ n g . •••> nr ) e . . . , e (n ) are n .i.d . (0, t^ 1) , that

MULTIVARIATE LINEAR MODELS

408

two sequences are independent, that Y ( 0 ) , Y ( 1 ) , Y (1 — p ) ; X (0 ), X [m a x (r ,q )] are known scalars and that and x2 are unknown. Under these conditions, how does one estimate the unknown pa­ rameters? How does one perform a Bayesian analysis of such a model? It is obvious that if the X values are fixed, the model can be analyzed as one would analyze any multivariate linear model. The density of Y ( l ) , Y ( 2 ) , . . . , Y (n ) given 0, + , ^ , and X (1 ),X (2 ), . . . , X (n ) is n

t

g [Y 16,,t ,X ] « t" /2 exp -

^

£

t e (B )Y (t ) -

K B ) X ( t ) ] 2,

t=l Y e Rn (8.8 5) where Y = [Y ( 1) ,Y (2 ) , . . . . Y ( n ) ] ' ,

0(B) = 1 - 01B - 0AB 2 - . . . - 0PBP , S (m ,X ,Y ), d ],

where d = n + a — p + 1. Now how should one use (8.104) in order to make inferences for y? Suppose one wants to see if the first column of the regression matrix 0^ has changed, then one would see if the first column of y was different than the zero vector and this can be done by examining the posterior distribution of the first column of y, say y , where

and y. is the i-th column (k * 1) of y .

MULTIVARIATE MODELS WITH STRUCTURAL CHANGE Since y

417

= yQ , where Q is p x 1 and Q = (1 ,0 ,0 , . . . , 0 )1, it

can be shown, again from Box and Tiao (1973, p . 447), that y^ has density n-1 «y J Y )< *

H

C(rn|Y) 5(y

|m,Y),

(8.106)

m=l where C (Yj |m,Y) «

1__________________________________ [ ( Y x-

Y1)'[P (X 'X + A ) ' 1P 1] " 1( y 1 -

Yx) + Q 'S (m ,X ,Y )Q ](n+2k+a)/2 (8.107)

Now y^ = yQ and P is given by (8.103) and we see the marginal pos­ terior distribution of y

is a mixture of k dimensional t densities.

U n­

fortunately, it is quite difficult to find an HPD region for y , since its posterior distribution is a mixture, however one may test H^: y^ = 0(k x 1) by a conditional test from the conditional posterior distribu­ tion of y^ given m = m*, where m* is an estimate of m obtained from the marginal posterior distribution of m, given by (8.100). We know the conditional distribution of y^ has density (8.107) and represents a t distribution with n + k + a degrees of freedom, lo­ cation vector y , and precision matrix (n + k + a) [P (X ’X + A ) ^P’ ] * * [Q 'S (m ,X ,Y )Q ] F ( Y l |m) =

~l ,

thus -

Y1)'[P (X 'X + A ) ' V f \ y 1 -

Y x)

(8.108)

has a conditional distribution (given m) which is an F with k and n + k + a degrees of freedom and one would reject H^: y^ = 0 if

F(0|m) > F (3 ;k ,n + k + a)

(8.109)

with s significance level of 3, where F( 3;k ,n + k + a) is the upper 1003% point of the F distribution with k and n + k + a degrees of free-

MULTIVARIATE LINEAR MODELS

418

dom. One should use the conditional test with caution because the appropriate test should be based on the marginal posterior density of not its conditional counterpart. If the conditioning value of m, m* has a "large" posterior probability, then the conditional distribution of given m = m* is a "good" approximation to the marginal distribu­ tion of m, in which case one may use the conditional F test (8.109). If y ^ is scalar, one may Easily find the HPD region for y ^ by numerical integration of the marginal distribution of y ^ and this was done in an example of the previous chapter, but if y ^ is not scalar it is quite difficult to numerically integrate the mixture of t densities (8.106). Up to this point we know how to make inferences for the shift point parameter m and the matrices of regression parameters, and the only other parameters of interest are the elements of the precision matrix P, which give the correlations between the p responses, thus we need the marginal posterior distribution of P. The joint posterior density of m and P is given by (8 .9 9 ), thus the marginal density of P is n-1

C(P|Y) 2 e(m|Y) =

(8.110)

£ (P | m ,Y ),

m=l

where £(m|Y) is the marginal posterior mass function of m and £(P|m ,Y) is the conditional posterior density function of P and is also given by (8.99) which is a Wishart distribution with (n + a) de­ grees of freedom and precision matrix P (P|m ,Y)

= B + Y'Y

+ u'Ap

-

(X 'Y

+ A v )'(X 'X

+ A ) ‘ 1( X ' Y

+ A p ).

8 111 )

( .

A point estimate of P is given by the posterior mean of P which is n-1 E (P|Y)

=

Z )

5 ( m | Y ) ( n + a ) P " \ P |m , Y ) ,

(8.112)

m=l however HPD regions for P are difficult to obtain because its distribu­ tion is a mixture. Conditional HPD regions for the components of P are more easily determined from the conditional posterior distribution of P given m = m*, where m* is say the posterior mode of m. For e x ­ ample, one perhaps is interested in testing for zero correlation between the p components of the dependent variable.

MULTIVARIATE MODELS WITH STRUCTURAL CHANGE

419

Some questions about the posterior analysis have been left un­ answered but for the most part, if one is interested in estimating or testing hypotheses about the parameters, one would be able to do these things from the information given in this section. Often one wants to forecast future observations W = (W ^ W ^ , W ^ )T where W is a

I

x p matrix, where

W = V02 + E* with V a

I

x k known matrix, 02 the regression matrix of the changing

model (8 .9 3 ), and E* an

i

x p matrix with rows which are independent

and distributed as N (0 , P * ) . Note V is a known matrix of Z observa­ tions on the k independent variables and E* is independent of E- and V

Changing multivariate models should be very useful in fore­ casting future economic activity because the models used now do not include the possibility of changing parameters, therefore it seems if the models are revised to include changing parameters the fore­ casts should improve. There are many ways to represent parameters which change in a linear process other than the way we did here in this section, which was to change the regression matrix only once during the observa­ tion period. Another way is given by Chapter 6 with the linear d y ­ namic model which has parameters which always change. The Bayesian predictive density will be used to forecast the future observations W and is determined by finding the conditional density of W given the past observations Y . It can be shown that the future distribution of W is a mixture of matrix t distributions and that the mixing distribution is the posterior mass function of m, the shift point. This section is concluded with a numerical study of the robustness of the distribution of m to changes in the values of the hyperparam­ eters y , A , and a of the prior distribution. In this study, the natural conjugate prior distribution of the rows of 0., i = 1, 2, namely 0L, j = 1,2, . . . , k, given P is a multivariate normal with mean vector y !. e R^ and precision matrix r..P, r.. > 0, i] F i] i] and the marginal distribution of P is Wishart with a degrees of freedom and precision matrix B . One may show the posterior distribution of the change point is S(m|Y) «

|D1D 2 |'P/2Q(m ) ‘ (n + a )/ 2 ( l < m < n -

= 0 otherwise,

1

(8.113)

MULTIVARIATE LINEAR MODELS

420 where D. = X!X. + R., i = 1, 2 i l i l R. = D ia g o n a l(r..), i = 1, 2

2

Q(m) = B + E [Y !Y . + y!R .V. i= l

(X !Y. + R .v ^ 'D .'^ X I Y . + R . y . ) ] ,

and

vi = ( y i l ’ pi2 ’ • • • ’

v ikVt

i = 1* 2>

For the numerical study the following choices were made for the parameters of the model:

•.■ O ' A 2 = .04, .05, .06, and .07,

_ + . ai v 2~ y l + . A1

°

0

A^ = 0, .2, .4, and .6, and 2

(

2p

2pl 2,

where p = 0, ±.2, ±.5, and ±.7. The bivariate observations were generated with the 2 x 2 p re ­ cision matrix P, where

MULTIVARIATE MODELS WITH STRUCTURAL CHANGE

and

2

2

= Og = 1.

421

The example is a bivariate regression model with

two independent variables and the first m bivariate observations are Y.

ll

- 011X .1 + 0O1X .O + e...

Y i2

612Xi l + 022Xi2 + ®i2

11 l l

21 i2

ll

for i = 1,2, . . . , m , 1 < m < n — 1. is

11

12

21

21 >

(8.114)

The regression parameter matrix

and since

A2 1

+

A2 only the first column of 0^ changes, thus only the average value of the first component of (Y of Y.g.

, Y ) changes and not the average value 1JL 1Z The n x 2 design matrix consisted of ones for the first col­

umn and the values for the second column were two-digit numbers selected at random from a random number table. Now n = 20 bivariate obser­ vations were generated and three different cases were considered for the actual change, namely m = 3, 10, and 17. Moen (1982), using an IMSL (International Mathematical and Statistical Library ) subroutine, generated the bivariate normal error term for a specified p, and SAS (Statistical Analysis System) pro­ grams were written for finding the posterior distribution of m, (8.113). Tables 8.1-8.3 present the results for a sample of size n = 20 when using a natural conjugate prior when the actual point of change is at 3, 10, and 17. For fixed values of p and A ^ changes in A ^ do not have much of an effect on the posterior probability of the actual point of change; however in all cases, as A 2 increases for fixed values of p and A the posterior probability also increases. It is most often the case that for fixed A ^ and the posterior probability of the true change point

.91379 .99335 .99941 .99993 .90975 .99311 .99939 .99993 .89395 .99180 .99928 .99992 .86235 .98892 .99903 .99989

.41512* .85267 .97817 .99669 .41126* .85168 .97820 .99672 .39160* .84118 .97654 .99650 .35773* .81991 .97285 .99595

.10962* .37789* .76023 .93980 .11055* .38127* .76384 .94125 .10795* .37512* .75967 .94027 .10210* .35974* .74746 .93673

.07374* .25171* .60240 .86660 .07516* .25632* .60917 .87036 .07451* .25460* .60765 .87016 .07184* .24666* .59775 .86594

.07398* .24787* .58358 .84746 .07608* .25404* .59226 .85253 .07609* .25416* .59311 .85358 .07403* .24819* .58607 .85060

.18900* .53609 .84069 .95760

.18916* .53726 .84217 .95827

.18299* .52835 .83838 .95728

.61913 .91615 .98612 .99784

.61606 .91579 .98610 .99783

.59747 .91020 .98500 .99763

.04 .05 .06 .07

.04 .05 .06 .07

.04 .05 .06 .07

.2

.4

.6

MULTIVARIATE LINEAR

*The largest probability occurs at the 19th data point.

0.

.18253* .52496 .83394 .95520

.7

.60709 .91138 .98507 .99764

A2

A1

.04 .05 .06 .07

\ A2

/ A2

.5

P

B2 = &i +

.2

»

0

°/

° \ I

—. 2

A1

A1

- .5

'

-.7

n = 20, u 2 = P i + |

Table 8.1 Posterior Probability That m = 3, When the Actual Point o f Change Is T h ree, Using a Natural Conjugate P rior Distribution;

422 MODELS

.59613 .74536 .86086 .93511 .61090 .75969 .87180 .94157 .62255 .77093 .87971 .94590 .63075 .77902 .88478 .94835

.23619* .54068 .68529 .79954 .25076* .55492 .69965 .81222 .25963* .56662 .71182 .82247 .26232* .57558 .72165 .83032

.10368* .42065 .61133 .73916 .11160* .43463 .62617 .75340 .11653* .44509 .63885 .76529 .11809* .45173 .64914 .77477

.08031* .36097 .58655 .72901 .08487* .37299 .60236 .74430 .08727* .38127 .61559 .75696 .08738* .38542 .62597 .76691

.20797 .51787 .70831 .84290

.21213 .53525 .72680 .85662

.21269 .54896 .74148 .86678

.20952 .55841 .75225 .87363

.53581 .75976 .90374 .96846

.55664 .78028 .91514 .97288

.57246 .79476 .92213 .97529

.58229 .80338 .92537 .97608

.04 .05 .06 .07

.04 .05 .06 .07

.04 .05 .06 .07

.04 .05 .06 .07

.2

.4

.6

.82823 .93076 .97712 .99331

STRUCTURAL

.82337 .92921 .97690 .99334

WITH

.81380 .92442 .97517 .99282

.79931 .91593 .97162 .99162

.7

MODELS CHANGE

♦The largest probability occurs at the 19th data point.

0.

.5

.2

A2

A1

0

P -.2

+

-.5

v1

- .7

n = 20, u 2 =

Table 8.2 Posterior Probability That m = 10, When the Actual Point of Change Is Ten, Using a Natural Conjugate Prior Distribution; MULTIVARIATE 423

.04 .05 .06 .07

.04 .05 .06 .07

.04 .05 .06 .07

.04 .05 .06 .07

0.

.2

.4

.6

.99897 .99994 1.00000 1.00000

.99910 .99995 1.00000 1.00000

.99912 .99995 1.00000 1.00000

.99903 .99994 1.00000 1.00000

-.7

.96986 .99699 .99968 .99996

.97069 .99711 .99969 .99996

.96993 .99705 .99969 .99996

.96745 .99679 .99966 .99996

- .5

.64047 .92627 .98914 .99843

.64235 .92737 .98937 .99847

.63778 .92619 .98921 .99845

.62674 .92265 .98862 .99836

- .2

41

° /

» \

| .

.36473* .79294 .96513 .99500

.36969* .79783 .96635 .99520

.36921* .79793 .96644 .99522

.36332* .79327 .96538 .99505

0

P

#2 = Bj *

.2

42

.23832* .70465 .95295 .99420

.24728* .71701 .95597 .99462

.25168* .72258 .95724 .99479

.25127* .72147 .95687 .99473

\

/ i.

.33235* .88170 .99133 .99936

.36419* .89777 .99276 .99948

.38634* .90703 .99352 .99953

.39714* .91054 .99377 .99955

.5

.83168 .99426 .99979 .99999

.87132 .99597 .99986 .99999

.89395 .99683 .99989 1.00000

.90392 .99715 .99990 1.00000

.7

MULTIVARIATE LINEAR

*The largest probability occurs at the 19th data point.

A2

A1

20. „ 2 = * j ♦ |

'

Table 8.3 Posterior Probability That m = 17, When the Actual Point of Change Is Seventeen, Using a Natural Conjugate Prior Distribution; 424 MODELS

COMMENTS AND CONCLUSIONS

425

(m = 3, 10, 17) is smallest when p = 0 and increases as p becomes large in absolute value. These results are what one would expect in the behavior of the posterior distribution of m. As the prior mean of 02 changes, the posterior probability of m does not change very much when the other parameters are constant. O f course this is a v ery limited study and we see additional work is required before definite conclusions can be drawn. One interesting feature of the study is that when 3 changes as little as .04 from 3 , « 1 it is often detected with a high probability except when p is close to zero in absolute value, and for values of p close to zero, the change is difficult to detect unless 32 changes b y as much as .07 from 3^. Another interesting question is why changes are more difficult to detect when p = 0.

COMMENTS AND CONCLUSIONS

This chapter introduces the reader to some of the basic problems one encounters in the analysis of multivariate data. If the data can be represented by a multivariate linear model, then this chapter shows one how Bayesian inference can be made. The first model to be considered was the multivariate regression model where prior information is given either by the normal-Wishart conjugate prior or Jeffreys1 improper density. The posterior analysis consisted of finding the marginal posterior distribution of the reg re s­ sion matrix and from that showing how one may estimate these param­ eters either with an HPD region or with point estimators. Forecasting future observations was based on the predictive distribution which was a matrix t. The multivariate linear model for a one-way layout was also e x ­ amined and the results are similar to those obtained for the regression model, namely the marginal posterior distribution of the "treatment” mean vectors is a matrix t and inferences about them are obtained from the usual HPD regions. Two vector time series models were studied, the autoregressive and the regression model with autocorrelated errors, and it was demonstrated that the posterior analysis of the former is a simple gen­ eralization of the univariate case which was examined in Chapter 5. O f particular importance with the autoregressive model is the fore­ casting methodology where inferences about future observations (onestep-ahead) are based on a multivariate t distribution. The last section of the chapter was on structural change in the multivariate regression model, where one change in the regression

MULTIVARIATE LINEAR MODELS

426

matrix is allowed. The main emphasis here was on the change point parameter, however the marginal posterior distributions of the other parameters were obtained. Finally this section is concluded with a numerical study of the sensitivity of the posterior distribution of the change point to the parameters of the prior distribution. One can conclude that this chapter is for the most part a review of the standard methodology given b y Box and Tiao (1973), Press (1982) and Zellner (1971). Some "new" developments were also introduced, namely the Bayesian identification procedure for vector autoregressive processes and the part on structural change in multivariate regression models. O f course, there are many interesting areas in which additional research can be done. For example, structural change problems in the vector autoregressive model need to be examined, especially the forecasting procedures. The autoregressive model is widely used in economic forecasting and it is often thought that the regression pa­ rameters are indeed changing. If this is so, forecasting methods via the Bayesian predictive density should give improved predictions over those which are based on autoregressions with parameters which do not change. Another promising area of research is the development of a com­ plete Bayesian analysis of vector ARMA models, but as was shown in Chapter 5, one must first solve the univariate case of the ARMA process.

EXERCISES 1. 2. 3.

Show the likelihood function may be expressed by equations (8 .2 ), (8 .3 ), and (8 .5 ). Verify the matrix t density. Describe the marginal posterior distribution of 6. See equation (8 .1 3 ). From equation (8.19) show the predictive distribution of W is a matrix t with mean W* and covariance matrix (n + v — m — 1) *•

4.

5.

D * © E and explain how you would construct point and region forecasts of W. Describe a way to set the values of the hyperparameters of the conjugate prior distribution of the parameters of a multivariate linear model using past data from a related experiment. See Chapter 3 and the method of the prior predictive distribution for assessing prior information. Consider equation (8 .3 0 ), which gives a multivariate linear model for a one-way layout k of treatments. A test for the equality of treatment effects is given by the HPD region (8 .1 2 ). Does this

EXERCISES

6.

7.

8.

9.

10.

427

test reduce to the one given by (3.103) of Chapter 3, which is the approximate test for the equality of treatment effects when measures a single response on each experimental unit? From equation (8 .7 1 ), show that: (a ) the conditional posterior distribution of t given 6 and p is gamma, ( b ) the conditional distribution of 0 given p is a multivariate t, and (c ) the conditional distribution of p given 0 is a univariate t. Equation (8.92) is the joint posterior density of X and a, two parameters of a distributed lag model (8 .9 1 ). Explain why the joint distribution is not a multivariate t distribution. Consider a multivariate regression model (8.93) with one change, then using the prior density (8.95) and (8 .9 6 ), show the mar­ ginal posterior mass function of m is given by (8.100). Describe how you would forecast a future observation with this model. Equation (8.30) of this chapter describes a linear model with a one-way layout where m responses are measured on the experi­ mental units. Define a multivariate linear model (o f m responses) which would be appropriate to analyze data emanating from a randomized block design. That is, generalize the univariate model (3.124) of Chapter 3 to the multivariate case. Also, develop a test for the equality of treatment effects. Consider a moving average process Y (t ) = e(t) = e(t) -

^ (t

-

2e(t -

1), t = 1,2, . . . , m 1), t = m+ 1, . . . , n

where

1< m < n — 1, € R, 9 e R and the e (t ), t = 0,1,2, -1 . . . , n are n .i.d . (0 , t ) , where x is a positive precision and Y (t ) is the t-observation. Assuming ^ ^ 2, the parameter of the moving average process has changed at some unknown point m from (J> to 0. Letting e(0) = 0 , the residuals may be calculated recursively by e(t) = Y (t ) + 4>1e(t -

1)

if

t = 1,2, . . . , m

= Y (t ) + 2e(t -

1)

if

t = m + 1, . . . ,

n

where 1 < m < n - 1. Assuming m can be any one of its possible values with equal probability, describe a way to estimate ^ and

- N n i ( „ ( 1 ) , [ p u - P 12P 22- 1P 21, - 1J

where y ^ is the vector of the first partitioned as

f P*11

P

12

lP

P

22)

components of y and P is

P =

v 21

where P .. is m x m . 11 1 1 It can also be shown that the conditional distribution of X ( 2) given X is normal, but for the details see DeGroot (1970), Chaptere Five.

T HE WISHART D IS T R IB U T IO N

Let V be a random m x m symmetric matrix with density f(v | d ,T ) « |T|d ^2 |v|(d m * ^ 2 exp - (1/2) T r (T v ) if v is positive definite, otherwise zero, where T is a m x m positive definite symmetric ma­ trix and d > m — 1, then V is said to have a Wishart distribution with d degrees of freedom and precision matrix T . It can be shown the E (V ) = nT *, but for additional properties of the Wishart see Press (1982) and DeGroot (1970).

T H E N ORMAL-W ISHART Let X be am x l real random vector and V a m x m symmetric random matrix; then X and V have a normal-Wishart distribution if the density of X and V is such that the conditional distribution of X given, V = v is normal with mean y e Rm and precision matrix av, where a > 0, and v is m x m positive definite and the marginal distribution of V is Wishart with d degrees of freedom and precision matrix T where T is symmetric positive definite and d > m - 1. The density of X and V is f(x ,v | y , a , d , T ) = f 1( x | v , y , a ) f 2( v | d , T ) , where

THE MULTIVARIATE t DISTRIBUTION f j(x | v ,y ,a) a |v | e x p

-

445

| (x -* y ) f v (x -- y ) ,

x € Rm

and

r Ij rp\ |rp |d/2 |v| | ,(d -m -l)/ 2 T r /f_ x f 2(v | d ,T ) oc |T I e x p -------—( T v ) , v is positive definite. We see f ^ is the conditional density of X given V = v and f^ the marginal density of V . When one analyzes a multivariate normal population with both pa­ rameters unknown, the normal Wishart is the conjugate class. See Chapter 8.

THE M U L T IV A R IA T E t D IS T R IB U T IO N

There are many ways to define multivariate t distributions, see Johnson and Kotz (1972) for example, but for the purposes of this book, only the following t distribution occurs in the Bayesian analysis of linear models. Let X be a m * 1 real random vector, then X has a t distribu­ tion with d degrees of freedom, location y g Rm and precision matrix T if the density of X is

f(x|d,,,T)

xeR™

where d > 0 and T is m * m symmetric positive definite matrix. First, it can be shown that when m = 1, X has a univariate t dis-

1/2

tribution as defined earlier and that (x — y )T has Student?s t dis­ tribution with d degrees of freedom. An interesting property of the multivariate t distribution is that if m > 2, the components of X cannot be independent but they can be uncorrelated since the dispersion matrix of X is

D (X ) = d~-"~2 T_1,

d > 2-

The mean vector of X is E (X ) = y and the marginal distribution of X ^

is also a t with d degrees of freedom, location y ^

cision matrix T ^ - T ^ T 22 1 T 21> where

and p re ­

APPENDIX

446

and T

^ is

of order m ^

Here

consists of the first m^ components

of X and y is partitioned as

where y

(1 ) . is m^ * 1.

The conditional distribution of X ^ the conditional mean vector of X ^

E [X (1 ) |X(2 ) = x 2] = y ( 1 ) -

given X ^

is also a t where

is

Tn 1

T w [x 2 -

y( 2 )],

and the conditional precision matrix is

______________ ( d + m2) d + [x! - /

Tn

_____________________

J ) ] ' [ T J 2 - T n T 11- 1 T 12) [ l 2 - /

2) ] '

Unlike the normal distribution, the precision matrix of the con­ ditional distribution of X ^ given X ^ depends on the value of the conditioning variable. The multivariate t distribution is encountered many times in the Bayesian analysis of linear models; see Chapters 1, 2, and 3 in p a r­ ticular . HPD regions for the regression parameters 0 of a general linear model are constructed on the basis of the F distribution, which is a particular function of a t random vector X . That is to say if X has a multivariate t distribution with parameters d, y , and T then Q( X) = n f 1 ( X -

y)>

T( X -

y)

has an F distribution with m and d degrees of freedom. ter 1.

See Chap­

THE POLY-t DISTRIBUTION

447

TH E P O L Y -t D IS T R IB U T IO N

The poly-t distribution is the name of the distribution of a real random vector X of m dimensions which has density k f(x|d,y,T)«

I W j= l 3

,

v T i ) ,

x € Rm

where

f.(x | d .,

v ,

-d./2

T .) = [1 + (x -

p. ) ' T .(x -

p. ) ]

3

The parameters of the distribution are d = ( d ^ d ^

,

x € R m.

d ^ )1

and T = ET 1’T 2

V



where d. > 0 , y. is m x 1, and T. is symmetric and semi-positive def] ^ 1 1 inite and E^ T^ is positive definite. We see if k = 1, X has a multivari­ ate t distribution but when k > 2, very little is known about it. For example, the normalizing constant and other moments are unknown and it is very difficult to work with, but unfortunately it occurs quite often in a Bayesian analysis of linear models. Actually there are two basic forms of the distribution and the one above is called the product form, the other the quotient form. Only the product form appears in this book. The poly-t distribution first occurs in Chapter 3 when one analy­ zes several normal populations with a common mean, where the poly-t is the marginal posterior distribution of the common mean. Next it occurs in the Bayesian analysis of mixed linear models, where it is the marginal posterior distribution of the random effects of the model.

APPENDIX

448

Lastly, it is found as the posterior distribution of the states of a d y ­ namic system in Chapter 6. For other examples of this vexing distribu­ tion see Dreze (1977), who gives us an excellent account of the p ro b ­ lems of working with the poly-t distribution. If X is a scalar or of dimension less than three one may use n u ­ merical integration techniques, but in general one will probably have to develop an adequate approximation. A normal approximation to the poly-t was used in Chapters 4 and 6 where each factor f. of the density was approximated by a normal density with the same mean vector and dispersion matrix as f..

Since

the poly-t distribution is in general multimodal and asymmetric, we know the normal distribution is an inadequate approximation, and further research is required.

THE M A T R IX t D IS T R IB U T IO N

The matrix t distribution is a generalization of the multivariate t dis­ tribution, of which the univariate t is a special case. We have seen that the matrix t distribution occurs in the analysis of multivariate linear models, namely as the posterior distribution of the matrix of re ­ gression coefficients and as the one-step-ahead predictive distribution of a future observation. Press (1982) and Box and Tiao ( 1973) define the matrix t distribu­ tion as follows. The pq elements of the random p x q matrix T follow a matrix t distribution if the density of T is

-c-

i i " 1 ++ t1Bb ' 1 t*i m/2 *

i( m -q) /2 , ,p/2 ^ |a | h " ‘ |b .

where t is a real p x q matrix, A is of order p symmetric and positive definite, B is of order q, symmetric and positive definite, and m > p + q - 1. The constant is given by nPq/2 r c '!/( m, p , q )^ =

r

(H L p jji P 2 (m/2)

where

r ( x) = np ( p - 1 ) / 4 r ( x) r P

(A"l) •••r(A“ 2 +f)’

REFERENCES

449

and r is the gamma function. Box and Tiao and Press both give additional properties of the distribution and in particular they show that any row or any column of T has a multivariate t distribution.

COMMENTS

The normal-gamma distribution is a special case of the normal-Wishart and the univariate t a special case of the multivariate t which in turn is a special case of the matrix t distribution. The multivariate t is also a special case of the poly-t (when k = 1), but for the most part the poly-t is in a class by itself. We have seen that with the exception of the poly-t distribution, all are easy to work with in the sense that the moments and other characteristics are well known. For additional information about these distributions see Box and Tiao (1973), DeGroot (1970), Dreze (1977), Press (1982), and Zellner (1971).

REFERENCES B ox, G. E. P. and G. C. Tiao (1973). B a y e s ia n Inference in Statis tical A n a l y s i s , Addison-Wesley, Reading, Mass. DeGroot, M. H. (1970). Optimal Statistica l D e c is io n s , McGraw-Hill Book Company, New York. Dreze, Jacque H. (1977). "Bayesian regression analysis using poly-t distributions,” in New D e velo p m en ts in the Application o f B ayesian M e t h o d s , edited by Aykac and Brumat, NorthHolland Publishing Co. , Amsterdam. Johnson, Normal L. and Samuel Kotz (1972). D is tr ib u tio n s In S t a t i s t i c s ; C ontinuous Multivariate D i s t r i b u t i o n s , John Wiley and Sons, In c ., New York. P ress, James S. (1982). A p p l i e d Multiva riate A n a ly s is : Using B a yes ia n an d F r e q u e n t i s t Methods o f I n f e r e n c e , Kreiger P u b ­ lishing Co. , Melbourne, Florida. Zellner, Arnold (1971). A n In tro ductio n to B a yes ia n In feren ce in Econometrics, John Wiley and Sons, In c., New York.

INDEX

Adaptive control, 278-284, 289293 Adaptive estimation, 273-278 ( s e e also Dynamic models; Estimation) Adaptive filter, 285-289 Analysis: of covariance, 134 with interaction, 131 posterior, 53 ( s e e also Mixed model) predictive, 54 prior, 47 of two factors, 123 of variance, 159 Approximations: examples of, 161-165 to random effects, 149 to variance components, 153 Autoregressive structural change of, 360-368 Autoregressive univariate pro­ cess, 182-186 Autoregressive vector process, 388-399

Bayes theorem, 2-3

Confidence intervals and regions: HPD posterior, 54 HPD predictive, 55 Control problems, 35 Control strategies, 273-293

Designs: completely randomized, 118 randomized block, 129 Distributed lag models: multivariate, 408-409 univariate, 213-215 Dynamic models: adaptive estimation, 273-278 ( see also Adaptive estima­ tion) control, 243-250 estimation, 238

451

452 [Dynamic models] filtering, 238 Kalman filter, 239, 250-254 nonlinear, 266-273 observation equation, 237 prediction, 242-243 system equation, 237 transition matrix, 237 E Eigenvalues, 188, 196 Eigenvectors, 188, 196 Estimation ( see also Adaptive estimation; Dynamic m odels): adaptive, 273-278 least squares, 87 maximum likelihood, 50, 143 modal, 170-172 F Filtering: adaptive, 273-278 Kalman, 239 G Gamma distribution, 441 General linear model: point estimation, 8-11 posterior estimation, 5-14 predictive analysis, 14-18 prior analysis, 3-5 region estimation, 11-12 testing hypotheses, 12-14 H Hypothesis testing: by HPD regions, 54 by posterior odds, 54

INDEX

I Identification: by D iaz-F arah method, 396399, 202-206 of multivariate AR, 396-399 of univariate A R , 202-206 Interval estimation, 54 Iterative technique, 77, 165, 174177 M Marginal posterior density, 53 Mixed model: between component, 161-169 fixed factors, 142 maximum likelihood, 143 modal estimation, 165-176 nested models, 144,171 one-way layout, posterior analysis, 146-158 random factors, 142 variance components, 143 within component, 161-169 Models: additive, 125 autocorrelated errors, 38 autoregressive, 25, 36 autoregressive moving average, 25, 31, 37 dynamic, 25, 36 econometric, 25 general linear, 1 moving average, 25, 30 (see also Moving average) multivariate, 374-427 time series, 25, 30 traditional, 65-141 Moving average, 186-195 (see also Models) Multiple regression: univariate case, 94-100 Multivariate case, 375-383 Multivariate distributions:

INDEX [Multivariate distributions] matrix-t, 444 normal, 442 (see also Normal multivariate) poly-t, 446 Wishart, 443 Multivariate models: autocorrelated e rrors, 388-400 autoregressive, 400-406 for designed experiments, 383388 regression, 375-383 time series, 388-608 transfer function, 406-409 Natural conjugate distributions: normal-gamma, 441 (see also Normal distribution) normal-Wishart, 443 (see also Normal distribution) N Normal distribution, 440 Normal-gamma, 441 (see also Natural conjugate distribu­ tions) Normal linear model, 1 Normal multivariate, 442 (see also Multivariate distribution) Normal populations, 70, 81 Normal-Wishart, 443 (see also Natural conjugate distribu­ tions) Nonlinear dynamic models, 104117 Nonlinear regression models, 266273 Nuisance parameters, 53 (see also Parameters) P Parameters: hyperparameters, 83 nuisance, 82 (see also Nuisance parameters)

453 [Parameters] precision, 1 Posterior joint distribution, 53 Posterior marginal distribution, 53-54 Posterior odds ratio, 54 Precision matrix, 1 Predictive analysis, 43-44 Predictive distribution, 56 Predictive region, 56 Prior distributions: conjugate, 47-48, 50 exchangeable, 52 Jeffreys, 47 noninformative, 47 vague, 47 Principle: of imaginary results, 49 of insufficient reason, 40 of precise measurement, 48 Probability: frequency interpretation, 46 models of, 45 subjective interpretation, 42

R R egression: multivariate linear , 375-383 univariate linear, 84-90 univariate multiple, 91-103 univariate nonlinear, 104 Research problems, 433-438 S Statistical analysis system ( S A S ) , 159 Structural change: of autocorrelated errors, 351360 detection of, 345-350 of linear models, 314-345 of normal sequences, 307-314 sensitivity to, 322-325 of time series, 360-368

454

t distribution (see also Univari­ ate distribu tion s): multivariate, 444 univariate, 442 Testing hypotheses: by HPD regions, 54 by posterior odds, 54 Time series: autocorrelated errors, 206-212 autoregressive moving average, 195-202 autoregressive processes, 182-186, 215-231 distributed lag, 212-215 identification, 202- 206 invertible, 195 multivariate, 388-399 prediction with, 184-185, 191, 201, 210 stationary, 183-199 vector processes, 388-399

INDEX Transition: function, 355-358 matrix, 237

U U nivariate distrib ution s : gamma, 441 normal, 440 t, 442 (see also t distribution) V Vague prior distribution, 47 Variance components, 29, 143, 148, 155

W Wishart distribution, 443