Handbook of Statistics 5: Time Series in the Time Domain [5, First ed.] 9780444876294, 0444876294

In this volume prominent workers in the field discuss various time series methods in the time domain. The topics include

129 46 22MB

English Pages [492] Year 1985

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Handbook of Statistics 5: Time Series in the Time Domain [5, First ed.]
 9780444876294, 0444876294

Citation preview

Handbook of Statistics Volume 5

Time Series in the Time Domain

About the Book In this volume prominent workers in the field discuss various time series methods in the time domain. The topics included are autoregressive-moving average models, control, estimation, identification, model selection, non-linear time series, non-stationary time series, prediction, robustness, sampling designs, signal attenuation, and speech recognition. This volume complements Handbook of Statistics 3: Time Series in the Frequency Domain.

This page has been left intentionally blank

Handbook of Statistics Volume 5

Time Series in the Time Domain

Edited by

E. J. Hannan P. R. Krishnaiah M. M. Rao

North-Holland is an imprint of Elsevier Radarweg 29, PO Box 211, 1000 AE Amsterdam, Netherlands The Boulevard, Langford Lane, Kidlington, Oxford OX5 1GB, United Kingdom Copyright © 1985 Elsevier B.V. All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, recording, or any information storage and retrieval system, without permission in writing from the publisher. Details on how to seek permission, further information about the Publisher’s permissions policies and our arrangements with organizations such as the Copyright Clearance Center and the Copyright Licensing Agency, can be found at our website: www.elsevier.com/permissions. This book and the individual contributions contained in it are protected under copyright by the Publisher (other than as may be noted herein). Notices Knowledge and best practice in this field are constantly changing. As new research and experience broaden our understanding, changes in research methods, professional practices, or medical treatment may become necessary. Practitioners and researchers must always rely on their own experience and knowledge in evaluating and using any information, methods, compounds, or experiments described herein. In using such information or methods they should be mindful of their own safety and the safety of others, including parties for whom they have a professional responsibility. To the fullest extent of the law, neither the Publisher nor the authors, contributors, or editors, assume any liability for any injury and/or damage to persons or property as a matter of products liability, negligence or otherwise, or from any use or operation of any methods, products, instructions, or ideas contained in the material herein. ISBN: 978-0444876294 ISBN: 0444876294

For information on all North-Holland publications visit our website at https://www.elsevier.com/books-and-journals

Publisher: Zoe Kruze Acquisition Editor: Sam Mahfoudh Editorial Project Manager: Peter Llewellyn Production Project Manager: Vignesh Tamil Cover Designer: Mark Rogers Typeset by SPi Global, India

Table of Contents Preface E.J. Hannan, P.R. Krishnaiah and M.M. Rao Contributors Chapter 1. Chapter 2.

Chapter 3.

Chapter 4.

Chapter 5.

Chapter 6.

Chapter 7.

Chapter 8.

Chapter 9.

Chapter 10.

Chapter 11. Chapter 12.

Nonstationary Autoregressive Time Series Wayne A. Fuller Non-Linear Time Series Models and Dynamical Systems Tohru Ozaki Autoregressive Moving Average Models, Intervention Problems and Outlier Detection in Time Series G.C. Tiao Robustness in Time Series and Estimating ARMA Models R. Douglas Martin and Victor J. Yohai Time Series Analysis with Unequally Spaced Data Richard H. Jones Various Model Selection Techniques in Time Series Analysis Ritei Shibata Estimation of Parameters in Dynamical Systems Lennart Ljung Recursive Identification, Estimation and Control Peter Young General Structure and Parametrization of ARMA and State-Space Systems and its Relation to Statistical Problems M. Deistler Harmonizable, Cramér, and Karhunen Classes of Processes M.M. Rao On Non-Stationary Time Series C.S.K. Bhagavan Harmonizable Filtering and Sampling of Time Series Derek K. Chang

v

xiii 1 25 85

119 157 179 189 213 257

279 311 321

Chapter 13. Chapter 14. Chapter 15.

Chapter 16. Chapter 17.

Sampling Designs for Time Series Stamatis Cambanis Measuring Attenuation M.A. Cameron and P.J. Thomson Speech Recognition Using LPC Distance Measures P.J. Thomson and P. de Souza Varying Coefficient Regression D.F. Nicholls and A.R. Pagan Small Samples and Large Equations Systems Henri Theil and Denzil G. Fiebig

337

Subject Index

481

Handbook of Statistics Contents of Previous Volumes

485

363 389 413 451

This page has been left intentionally blank

Preface

The theory and practice of the analysis of time series has followed two lines almost since its inception. One of these proceeds from the Fourier transformation of the data and the other from a parametric representation of the temporal relationships. Of course, the two lines are interrelated. The frequency analysis of data was surveyed in Volume 3 of the present Handbook of Statistics series, subtitled, Time Series in the Frequency Domain, edited by D. R. Brillinger and P. R. Krishnaiah. Time domain methods are dealt with in this volume. The methods are old, going back at least to the ideas of Prony in the eighteenth century, and owe a great deal to the work of Yule early this century. Several different techniques for classes of nonstationary processes have been developed by various analysts. By the very nature of the subject in these cases, the work tends to be either predominantly data analysis oriented with scant justifications, or mathematically oriented with inevitably advanced arguments. This volume contains descriptions of both these approaches by strengthening the former and minimizing the latter, and yet presenting the state-of-the-art in the subject. A brief indication of the work included is as follows. One of the successful parametric models is the classical autoregressive scheme, going back to the pioneering work of G. U. Yule, early in this century. The model is a difference equation with constant coefficients, and much of the classical work is done if the roots of its characteristic equation are interior to the unit circle. If the roots are of unit modulus, the analysis presents many difficulties. The advances made in recent years in this area are described in W. Fuller's article. An important development in the time domain area is the work of R. Kalman. It led to the emphasis on a formalization of rational transfer function systems as defined by an underlying state vector generated in a Markovian manner and observed subject to noise. This representation is connected with a rich structure theory whose understanding is central in the subject. It is surveyed in the article by M. Deistler. The structure and analysis of several classes of nonstationary time series that are not of autoregressive type but for which the ideas of Fourier analysis extend is given in the article by M. M. Rao; and the filtering and smoothing problems are discussed by D. K. Chang. Related results on what may be termed "asymptotically stationary" and allied time series have been surveyed in C. S. K. Bahagavan's paper. The papers by L. Ljung, P. Young and G. C. Tiao relate to the estimation

vi

Preface

problems in the dynamical modelling systems. Here Young's paper deals with the on-line (real time) calculations. One of the uses of these models has been to analyze the consequences of an intervention (such as the introduction of exhaust emission laws) and another to consider the outlier detection problems. These are discussed by Tiao and T. Ozaki. Though rational transfer function models are parametric, it is seldom the case that the model set contains the truth and the problem may better be viewed as one of selecting a structure from an infinite set in some asymptotically optimal manner. This point of view is explored by R. Shibata. Though least squares techniques, applied to the prediction errors, have dominated, there is a need to modify these to obtain estimators less influenced by discrepant observations. This is treated by Tiao and, in an extensive discussion, by R. D. Martin and V. J. Yohai. The model selection and unequally spaced data are natural problems in this area confronting the experimenter, and these are discussed by R. H. Jones. Since the time points may sometimes be under control of the experimenter, their optimal choice must be considered. This problem is treated by S. Cambanis. The modelling in the papers referred to above has been essentially linear. Ozaki presents an approach to the difficult problem of nonlinear modelling. The autoregressive models may have time varying parameters, and this is considered by D. F. Nicholls and A. R. Pagan. Their paper has special reference to econometric data as does also the paper by H. Theil and D. G. Fiebig who treat the problem where the regressor vectors in a multivariate system may be of a dimension higher than the number of time points for observation. The final two papers on applications by M. A. Cameron, P. J. Thomson and P. de Souza complement the areas covered by the preceding ones. These are designed to show two special applications, namely in signal attenuation estimation and speech recognition. Thus several aspects of the time domain analysis and the current trends are described in the different chapters of this volume. So they will be of interest not only to the research workers in the area of time series, but also to data analysts who use these techniques in their work. We wish to express our sincere appreciation to the authors for their excellent cooperation. We also thank the North-Holland Publishing Company for their cooperation. Eo J. Hannan P. R. Krishnaiah M. M. Rao

Contributors

C. S. K. Bhagavan, Dept. of Statistics, Andhra University, Waltair, India 530003 (Ch. H) S. Cambanis, Dept. of Statistics, University of North Carolina, Chapel Hill, NC 27514, USA (Ch. 13) M. A. Cameron, CSIRO, Division of Mathematics & Statistics, P.O. Box 218, Lindfield, N.S.W., Australia 2070 (Ch. 14) D. K. Chang, Dept. of Mathematics, California State University, Los Angeles, CA 90023, USA (Ch. 12) M. Deistler, Institute of Econometrics, Technical University of Vienna, Argentinierstr. 8, A 1040 Vienna, Austria (Ch. 9) P. de Souza, Dept. of Mathematics, Victoria University, Wellington, New Zealand (Ch. 15) D. G. Fiebig, University of Sydney, Sydney, N.S.W., Australia 2006 (Ch. 17) W. A. Fuller, Dept. of Statistics, Iowa State University, Ames, IA 50011, USA (Ch. 1) R. H. Jones, Scientific Computing Center, University of Colorado Medical Center, Box B-119, Denver, CO 80262, USA (Ch. 5) L. Ljung, Dept. of Electrical Engineering, Link6ping University, S-581 83 LinkSping, Sweden (Ch. 7) R. D. Martin, Dept. of Statistics, GN22, B313 Padelford Hall, University of Washington, Seattle, WA 98195, USA (Ch. 4) D. F. Nicholls, Statistics Dept., Australian National University, G.P.O. Box 4, Canberra, A.C.T., Australia 2601 (Ch. 16) T. Ozaki, The Institute of Statistical Mathematics, 4-6-7-Minami-Azabu, Minato-Ku, Tokyo, Japan (Ch. 2) A.R. Pagan, Statistics Dept., Australian National University, G.P.O. Box 4, Canberra, A.C.T., Australia 2601 (Ch. 16) M.M. Rao, Dept. of Mathematics, University of California, Riverside, CA 92521, USA (Ch. 10) R. Shibata, Dept. of Mathematics, Keio University, 3-14-1 Hiyoshi, Kohoku, Yokohama 223, Japan (Ch. 6) H. Theil, College of Business Administration, Dept. of Economics, University of Florida, Gainesville, FL 32611, USA (Ch. 17) xiii

xiv

Contributors

19. J. Thomson, Institute of Statistics and Operations Research, Victoria University, Wellington, New Zealand (Ch. 14, 15) G. C. Tiao, Graduate School of Business, University of Chicago, Chicago, IL 60637, USA (Ch. 3) V. J. Yohai, Department of Mathematics, Piso 7, University of Buenos Aires, Argentina (Ch. 4) P. Young, Dept. of Environmental Sciences, University of Lancaster, Lancaster L A I 4YQ, England (Ch. 8)

E. J. Hannan, P. R. Krishnaiah, M. M. Rao, eds., Handbook of Statistics, Vol. 5 © Elsevier Science Publishers B.V. (1985) 1-23

Nonstationary

Autoregressive

Time

1

Series

Wayne A. Fuller

1. I n t r o d u c t i o n A m o d e l often u s e d to d e s c r i b e the b e h a v i o r of a v a r i a b l e o v e r time is the a u t o r e g r e s s i v e m o d e l . In this m o d e l it is a s s u m e d that the c u r r e n t value can be e x p r e s s e d as a function of p r e c e d i n g values a n d a r a n d o m error. If we let Yt d e n o t e the value of the v a r i a b l e at time t, the p t h - o r d e r real valued a u t o r e gressive time series is a s s u m e d to satisfy P

Y , = g ( t ) + ~ , c q Y , _ i + e ,,

t-l,2

.....

(1.1)

i-1

w h e r e t h e e,, t - 1, 2 . . . . . are r a n d o m v a r i a b l e s a n d g ( t ) is a real v a l u e d fixed function of time. W e h a v e chosen to define the a u t o r e g r e s s i v e time series on the p o s i t i v e integers, but the time series might b e d e f i n e d on o t h e r d o m a i n s . T h e statistical b e h a v i o r of the t i m e series is d e t e r m i n e d by the initial values (I(0, Y--1. . . . . Y p+l), by the function g(t), by the coefficients (cq, a 2. . . . . C~p), and by the stochastic p r o p e r t i e s of the e,. W e shall, h e n c e f o r t h , assume that the e, h a v e z e r o m e a n a n d v a r i a n c e 0-2. A t a m i n i m u m we a s s u m e the e, to be u n c o r r e l a t e d . O f t e n we a s s u m e the e, to be i n d e p e n d e n t l y and i d e n t i c a l l y distributed. L e t t h e j o i n t d i s t r i b u t i o n function of a finite set {Y,,, Y'2 . . . . . Y,} of the 1/, be d e n o t e d by

Y,2

. . . . .

Y'2. . . . . Y'o)

T h e t i m e series is strictly s t a t i o n a r y if F,,,,, r,2..... v,, (Y,,' Y,2. . . . .

Y,°) = F ~,~+,, W,2+h..... r,,,h(Y'~' Y'2. . . . .

Y',,)

for all p o s s i b l e sets of indices tl, t 2 , . . . , tn and t 1+ h, t 2+ h , . . . ,-t, + h in t h e set {1, 2 . . . . }. T h e time series is said to b e c o v a r i a n c e s t a t i o n a r y if

E{Y,} =

t=1,2 .....

2

W. A, Fuller

and E{(Y,-tx)(Yt.~h-/X)}=y(h),

t=1,2 .... ; h=0,1,...,

w h e r e / x is a real n u m b e r and y ( h ) is a'real valued function of h. T o study the b e h a v i o r of the time series Y~ we solve the difference e q u a t i o n (1.1) and express I,', as a function of (el, e2 . . . . e,) and (Y0, Y < . . . . , Y-p+1). T h e difference equation p

coi= ~ c~jcoi-j

(1.2)

j=l

with initial conditions co0 = 1 ,

coi=0,

i=

1, - 2 . . . .

has solution of the form p

col = Z cjim~,

(1.3)

j=l

where m i are the roots of the characteristic equation p

m p - ~ ~jm p-j = 0 ,

(1.4)

j=l

the coefficients Cji

are

of the form

cji = bji kj ,

(1.5)

and the bj are such that the initial conditions are satisfied. T h e e x p o n e n t kj is zero if the root rnj is a distinct root. A root with multiplicity r has r coefficients with k j = 0 , 1 . . . . . r - 1 . Using the coi, the time series Y, can be written as t-I

p

I

t -1

Yt = 2 coie,-i + ~, co,+iY-i + ~]~ wig(t i=0

i= 0

i) .

(1.6)

i=0

T h e m e a n of Yt is t-I

E { Y t } = ~, co,g(t i=0

p 1

i)+ ~, co,+iE{Y ,}.

(1.7)

i= 0

T h e r e f o r e , if (1/0, Y-1 . . . . . Y-p+1) is a fixed vector, the variance of I", is a function of t and Y, is not stationary.

Nonstationary autoregressive time series

3

If the roots of (1.4) are less than one in absolute value, then m i goes to zero as i goes to infinity. O n e c o m m o n model is that in which g(t) =- a o. A s s u m e that (Yo, Y - l , . . . , Y-p+1) is a vector of r a n d o m variables with c o m m o n mean

o0(,-ko,)' i=1

c o m m o n variance a¢ 0"2 E tO~ i=o

(1.9)

and covariances

S{Yt.Yt+h}=0-2E(.oio)i+h,

t,t+h=O,-1

.... ,-p+l.

(1.10)

i=0

If g(t) = %, if (Y0, Y-1 . . . . . Y--v+1) is i n d e p e n d e n t of (e0, e 1. . . . ), and if the initial conditions satisfy (1.8), (1.9) and (1.10), then Y, is covariance stationary. If the initial conditions do not satisfy (1.8), (1.9) and (1.10), the time series will display a different b e h a v i o r for small t than for large t. H o w e v e r , if g ( t ) = a0 and the roots of the characteristic e q u a t i o n are less than one in absolute value, the nonstationarity is transitory. In such a situation, the large-t b e h a v i o r is that of a stationary time series.

2. T h e first-order model W e begin our discussion with the first-order m o d e l

g,=ao+c~lY

, 1-~ e,,

=%,

t=1,2 .....

(2.1)

t=0.

Given n o b s e r v a t i o n s on the process, several inference p r o b l e m s can be considered. O n e is the estimation of %. Closely related to the estimation p r o b l e m is the p r o b l e m of testing h y p o t h e s e s a b o u t oq, particularly the hypothesis that o~ = 1. Finally, one may be interested in predicting future observations. A natural e s t i m a t o r for (c~o,%) is the least squares e s t i m a t o r o b t a i n e d by regressing Y, oll Y,-1, including an intercept in the regression. T h e estimators are n

_.

2 q lr n

(2.2)

4

W . A . Fuller

where n

Y, 1, t 1 n t-1

These estimators are the m a x i m u m likelihood estimators for normal e, and fixed I/0. The distribution of c~1 depends upon the true value of al, the initial conditions, and the distribution of the e,. The error in the estimator of c~1 can be written

(~1 (~1

(Yg-I- Y(-,))

~'~ (Y,-1- Y(-t))( e, e(0)).

(2.3)

t=l

-

Under the assumption that the e, are uncorrelated, the expected value of the n u m e r a t o r is zero. The limiting behavior of the estimator is determined by the joint behavior of the sample m o m e n t s in the numerator and d e n o m i n a t o r of (2.3). The limiting distributions of c}~ are characterized in Table 2.1. For a time series with l a [ < 1, the limiting distribution of F/1/2(15~1 - O{1) is normal under quite weak assumptions. T h e first proof of the limiting normal distribution was given by Mann and Wald (1943). There have been a n u m b e r of extensions since that time. Because wi ~ 0 as n -~ % the initial value Y0, for any real Y0, will not influence the limiting distribution, though the influence for small samples could be large. The variance of the limiting distribution of n l / 2 ( ~ 1 -- 0~1) is

O) i

~

l--

O'1.

Table 2.1 Limiting properties of the least squares estimator of crt Parameters

0"1 ]all < 10"ll = lai] = [all > 10"11>

1 1 1 1 1 I0"1[ > a

Limiting distribution

0"o

Initial value I/0

Distribution of et

Standardizing function a

any real ao # 0 ao 0 0"0 = 0 ao = 0 0"0 # 0

any real any real any real !/0 = 0 Yo = 0 Y0 # 0

lID(0, 0-2) lID(0, 0-2) lID(0, 0.2) NID(0, 0.2) lID(0, 0.2)

nl/2(1 - a2) -1/2 n 3/2 n (a 2 - 1) '0"¢ ( a 2 - I) toe'l' (a '2 -- 1) 'de

N I D ( 0 , 0.2)

aThe standardizing function is a multiplier of ( ~ distribution. bThe constant sc = Y0+ o~0(1 og) 1.

Form b N(0, 1) Normal Tabulated Cauchy 9 N(0, 1)/N(~:, 1)

oct) that produces a n o n d e g e n e r a t e limiting

Nonstationary autoregressivetime series

5

The result of Table 2.1 is stated for independently and identically distributed random variables, but the limiting distribution of nU2(&l- cq) is also normal for et that are martingale differences. For example, see H a n n a n and Heyde (1972) and Crowder (1980). If Icq] = 1 and a 0 - 0, there is no simple closed form expression for the limiting distribution of n(&a- oq). The limiting distribution of n(~ 1 %) is that of a function of three random variables, L

n(al-- ~,)--, [2(r- W2)] l[(T2 1) 2TW],

(2.4)

where (F, T, W) =

2 2 E 2%,,z,, Z ~,~/2 2~. ~,,z,, ~ ~',~,), =

i=1

Yi-- (-1)i+12[( 2i -

i=1

l ) v ] -1 ,

and {Z/} is a sequence of NI(0, 1) random variables. Tables of the distribution are given in Fuller (1976) and the distribution has been discussed by Dickey and Fuller (1979). T h e estimator of oq constructed under the knowledge that a 0 = 0 has been studied by White (1958), Rao (1978a, 1978b), Dickey and Fuller (1979), and Evans and Savin (1981a). It is interesting that the normalization required to obtain a limiting distribution for 61 when [all = 1 is n, not n u2. The basis for the normalization is partly explained by examining the sum of squares in the denominator of ill. If Yt is stationary, E{Y~} is a constant for all t and

is nearly a constant multiple of n. This remains true for lall < 1 and any fixed real initial conditions. If lO/ll : 1 and a 0 = 0,

E { v , ~} = to-~ and n

y,-

n g 2 = [2-'n(n q 1)

=

= 6 q ( n 2 - 1)o.2 " Ifa 0-/0 anda l=l,then

Yt = Yo + aot + ~ ej j=l

6 1, a0 = 0 and Y0 = 0, then Y, can be written as t-1

r , : E ' °llet

i

i=o

t-1

= Ol tl £

Ol i1- t et i

i=0 c~

j=t+l

where X : •

aTJej.

j=l

Therefore, aTtyt converges to the r a n d o m variable X as t becomes large. It is also true that n

-2n (~1 £ t=l

--2 P , 2 1)X Y t - " ' + ( O / 1 -" .

2

T h e limiting properties of the estimator of % follow from these results. Because the sum of squares of Yt is increasing at the rate a 2n 1 , the least squares estimator of a 1 converges to o/1 very rapidly and it is necessary to multiply d 1 - % by a'~ to obtain a limiting distribution. The limiting distribution of a~'(~l - al) is that of the ratio of two r a n d o m variables. T h e variable X (or X plus a constant) is in the d e n o m i n a t o r and the n u m e r a t o r variable is an independent r a n d o m variable whose distribution is the limiting distribution of n-1

1=0

Therefore, if s 0 - 0, I/0 = 0 and the e, are normally distributed, the limiting

Nonstationary autoregressive time series

7

distribution is that of a Cauchy random variable. This-result was obtained by White (1958) and has been extended by Anderson (1959), R a o (1961), Venkataraman (1967), Narasimham (1969), and Hasza (1977). If s 0 ¢ 0 or Y0 ¢ 0, the denominator random variable has a nonzero m e a n (see Table 2.1). If the e t are not normally distributed, the form of the limit distribution depends upon the form of the distribution of the e r To summarize, the least squares estimator of c~1 has a limiting distribution for any value of c~,, but the standardizing function of n required to obtain a limiting distribution is a function of cq, c~0 and Y0- Also, the form of the distribution is a function of the same three parameters. An interesting aspect of the limiting distribution of the estimator of % is that o-2 is not a p a r a m e t e r of the distribution. This is because the least squares estimator of o~1 is invariant to changes in the scale of Yr. The case of Icq] = 1 is clearly a boundary case. Fuller (197"9) has shown that slight changes in the definition of the estimator produce different limiting distributions. For example, if it is known that ]O{11~ 1, and if one has observations (Y0, Y , . . . , Y,), one might use the estimator

~1

n 1 ( Y o - y)2 + Z ( Y t - Y)2 q-2(Yn t=l

where

n

Z (Yt-,- Y)(Yt-- Y), t=l

(2.6)

n

J3 = (n + 1)-' ~'~ Y~. t-0

This estimator is restricted to [ - 1 , 1] and is the estimator for the first-order process used in the m a x i m u m entropy method of spectral estimation described by Burg (1975) and Ulrych and Bishop (1975). If al = 1, then L 1 n ( ~ l - °gl)-)-2[/=~1 ~ u2i ] 2] -1 ,

(2.7)

where {ui} is a sequence of NID(0, 1) random variables Y2i-12= (4i2 2)-1

Y2i2= (4Z~)-1 ,

and Z i is the ith positive zero of the function t 2 sin t - t ' cos t. The limiting distribution was obtained in a different context by Anderson and Darling (1952) and is discussed by MacNeil (1978). The distribution defined in (2.7) is much easier to tabulate than that of 61, where 61 is defined in (2.2) because the characteristic function for (2.7) may be obtained and inverted numerically. Statistics closely related to 61 have been discussed by Durbin (1973), Sargan and Bhargava (1983) and Bhargava (1983). Lai and Siegmund (1983) consider a sampling scheme in which observations

8

W.A. Fuller

are taken from the time series until

nc E Y~-I > co-2, t=l

where c is a specified constant and n c is the smallest number such that the inequality holds. For this sampling scheme and the model with a 0 = 0 known, they show that n~l

2

\1t2

Yt-1)

L

( d q - oq) + N(0, o e)

as c ~ % uniformly for - 1 ~< ~'1 ~ 1. Thus, for a particular kind of sampling, a limiting normal distribution is also obtained for the unit root case. The least squares estimator of a0 given in (2.2) can be written as

(2.8) Therefore, the distribution of d 0 is intimately related to that of & l - %. For the model with ]%] < 1, the limiting distribution of nl/2(60- o~0)is normal. For other situations, the limiting distribution is more complicated. The fact that the distribution of 61 does not depend on o-2 permits one to use the distribution of Table 2.1 for inference about a 1. Another statistic that is natural to use for inference purposes is the Studentized statistic = [ ~t~'r{61}1-1(~1-

(2.9)

1),

where g{~l}

(Yt-

=

] or2,

= n

I~"2=

(g/ - -

2) -1 Z [ 1/, - Y{0~- c~l(Y, 1 - 'i2{ 0)12. t=l

The limiting distribution of the statistic [ also depends upon the true parameters of the model. The types of distributions are tabulated in Table 2.2. For those situations where the limiting distribution of the standardized least squares estimator oq is normal, the limiting distribution of the [-statistic is N(0, 1). The distribution of [ for loq] = 1 is a ratio of quadratic forms and has been tabulated by Dickey (1976). See Fuller (1976). One of the more interesting results of Table 2.2 is the fact that the limiting distribution of the •-statistic is N(0, 1) for ]all > 1. This result emphasizes the unique place of tall = 1. The •-statistic for estimator (2.6) has a limiting distribution that is a simple transformation of the limiting distribution of 61 . The properties of predictors for the first-order autoregressive process are

Nonstationary autoregressive time series Table 2.2 Limiting properties of the least squares 't-statistic' Parameters

lall < Icql = I~ll = Icql > Icql > [all >

I 1 1 1 1 1

Initial value Y0

Distribution of et

Limiting distribution

any real any real any real Yo - 0 Y~ - 0 Yo- 0

IID(0, 0-:) IID(0, 0-2) lID(0, 0 - 2 ) NID(0, 0-2) liD(0, 0-2) N1D(O, 0"2)

N(0, 1) N(0, 1) Tabulated N(0, 1) ?(0, 1) N(O, 1)

any real s0 ~ 0 so = 0 0~0 = 0 a0 = 0 ao # 0

given in Table 2.3. Let Y,÷j denote the predictor constructed with known parameters. If the parameters are known and if the e t are independent, the best predictor of Y,+j given (Y0, Y~. . . . , Y,) is the conditional expectation ?o+j = E{Yo+j [ Y.} = d0+ = %(1+ a 1+... +

-1)+ a t Y . .

The error in this predictor is Yn+j - Y,+:

e,,.j+ale,.;

1 +''

j-2

"q-Od 1

e, Im21 > Im31>~ . . . >~ lmp[, where I m l [ > l and ]mi] / >

I*kl. Fountis and Dickey (1983) show that L

n(2,- a , ) - * A , ( 2 r ) - ' ( r 2 - 1), where F and T are defined in (2.4).

19

Nonstationary autoregressive time series

Fuller, Hasza and Goebel (1981) have extended T h e o r e m 3.3 to the model with one root of (3.2) greater than or equal to one in absolute value and the remaining roots less than one in absolute value. Also, the prediction results of Section 2 extend to the more complicated models of this section. EXAMPLE 3.1. Engle and Kraft (1981) analyzed the logarithm of the implicit Price Deflator for Gross National Product as an autoregressive time series. We simplify the model of Engle and Kraft and use data for the period 1955 first quarter through 1980 third quarter. For the initial part of our analysis we assume that the process is a third-order autoregressive process. The least squares estimated autoregressive equation is (3.17)

f'~ = -0.021 + 1.429Y t ~ 0.133Yr_2+ 0.290Yt_ 3

and the residual mean square error is d-2 = 1.1173(10 5). T h e r e are a total of 103 observations and 100 observations are used in the regression. The largest root of the characteristic equation m 3 - 1.429m2 + 0.133m + 0.290 = 0 is 1.0178. Because the largest root is greater than one, the estimated model is explosive. We first test the hypothesis that the largest root is one. This is done by regressing the first differences on Yt-i and the lagged first differences. The estimated equation is

'frl -- Wt-1 =

-0,0211 @ 0.0054Y t (0.0082) (0.0026)

+ 0.290(Yt-2- Y, 3), (o.099)

1 + 0.423(Yt-1

....

"}//-2)

(0.098)

(3.18)

where the numbers in parentheses are the estimated standard errors obtained from the ordinary least squares regression program. By T h e o r e m 3.4 the statistic [ = (0.0026) 1(0.0054)= 1.93 has the distribution tabulated by Dickey when the largest root is one. By Table 8.5.2 of Fuller (1976) the /'-statistic will exceed 0.63 about one percent of the time. Therefore, the hypothesis of a unit root is easily rejected. Because of the large positive autocorrelation of series such as the price deflator, numerical problems are often reduced by fitting the model in the form (3.18) instead of in the form (3.17). To set confidence limits for the largest root, we again use T h e o r e m 3.4. Let the coefficient of gt-I in the regression of }It--mlYt-1 on Yt-l, Y~ 1--mlYt-2 and Y, 2-miYt-3 be denoted by /~. If m 1 > 1 is the largest root of the characteristic equation and if all other roots are less than one in absolute value,

W. A. Fuller

20

then the limiting distribution of the statistic

i = (s.e. t;)-'~, where s.e. /) is the ordinary least squares standard error, is that of a N(0, 1) r a n d o m variable. Therefore, we can define a confidence interval for m I to be those m 1 such that the absolute value of the calculated statistic /" is less than the tabular values of Student's t for the desired confidence level. F o r our data

"/,

1.0091Y,_~ = -0.0211+ 0.00274 Y, 1+0.417(Y, (0.0082) -r- 0 . 2 8 8 ( Y

(0.00139) t 2 - 1.0091

1--

1.0091Y,~2)

(0.098) Y, 3)

(0.099) and Y,-1.02561/,1=-0.0211-

0.00254Y 1 1+ 0.406(Y, 1 .- 1.0256Yt_2)

(0.0082)

(0.00128)

(0.098)

+ 0.283(Y,-2 -- 1.0256 Y,-3) • (0.097)

It follows that a 95 percent confidence interval for m I based on the large sample theory is (1.0091, 1.0256). In the preceding analysis we assumed the process to be a third-order autoregressive process. W e can use T h e o r e m 3.4 to test the hypothesis that the coefficient for Yt-4 is zero. By that t h e o r e m the ordinary regression t-statistic for Y, 3 - relY, 4 in the regression of I/, - relY,_ l on Yt-1, Yt 1- miYt-2, Y,-2-relY, 3, and Y,-3-m~Yt-4 has a N(0, 1) distribution in the limit. Because the t-statistic for the hypothesis that the coefficient for Yt 3-m~ Y, 4 is zero is identical (for any m 1 # 0) to the t-statistic for the coefficient of Y,-4, we have a test of the hypothesis that the process is third order against the hypothesis that it is fourth order. W e have, for example,

f't - 1.0178Y, i = -0.0207

0.000001/, ~ q 0.409(Y, j

(0.0086)

(o.00r 14)

(0.104)

+ 0 . 2 8 0 ( Yt_ 2 --

1.0178 Y,-3)

+ 0.012(Y,_~

1.0178I/, 4).

(o.1o8) (0.104)

Because the /--statistic for Y, 4 is t = (0.104) 10.012 = 0.12

1.0178 Y,_2)

Nonstationary autoregressive time series

21

we easily accept the hypothesis that the process is third order. The argument extends to the use of an F-test with two degrees of freedom to t e s t the hypothesis of third order against the alternative of a fifth order process, etc. Acknowledgements This research was partly supported by Joint Statistical A g r e e m e n t J.S.A. 82-6 with the U.S. Bureau of the Census. I thank David Dickey, David Hasza, V. A. Samaranayake, and Sastry Pantula for comments.

References Amemiya, T. and Fuller, W. A. (1967). A comparative study of alternative estimators in a distributed lag model. Econometrica 35, 509-529. Anderson, R. L. (1942). Distribution of the serial correlation coefficient. Ann. Math. Statist. 13, t-13. Anderson, T. W. (1959). On asymptotic distributions of estimates of parameters of stochastic difference equations. Ann. Math. Statist. 30, 676~87. Anderson, T. W. and Darling, D. A. (1952). Asymptotic theory of certain "goodness of fit" criteria based on stochastic processes. Ann. Math. Statist. 23, 193-212. Bhargava, A. (1983). On the theory of testing for unit roots in observed time series. London School of Economics. Burg, J. P. (1975). Maximum entropy spectral analysis. Unpublished Ph.D. thesis. Stanford University, Stanford, CA. Crowder, M. J. (1980). On the asymptotic properties of least squares estimators in autoregression. Ann. Statist. 8, 132-146. Davisson, L. D. (1965). The prediction error of stationary Gaussian time series of unknown covariance. IEEE Trans. Inform. Theory, IT-11, 527-532. Dickey, D. A. (1976). Estimation and hypothesis testing ill nonstationary time series. Unpublished Ph.D. thesis, Iowa State University, Ames, Iowa. Dickey, D. A. (1977). Distributions associated with the nonstationary autoregressive process. Paper presented at the eastern regional meeting of the Institute of Mathematical Statistics, Chapel Hill, NC (April 1977). Dickey, D. A. and Fuller, W. A. (1979). Distribution of the estimators for autoregressive time series with a unit root. J. Amer. Statist. Assoc. 74, 427-431 Dickey, D. A. and Fuller, W. A. (1981). Likelihood ratio statistics for autoregressive time series with a unit root. Econometrica 49, 1057-1072. Dickey, D. A. and Said, S. E. (1982). Testing A R I M A (p, 1, q) versus A R M A (p + 1, q). In: O. D. Anderson, ed., Applied Time Series Analysis. North-Holland, Amsterdam. Dickey, D. A., Hasza, D. P., and Fuller, W. A. (1984). Testing for unit roots in scasoaaI time series. J. Amer. Statist. Assoc. 79, 355-367. Durbin, J. (1960). Estimation of parameters in time-series regression models. J. Roy. Statist. Soc. 22, 139-153, Durbin, J. (1973). Distribution theory for tests based on the sample distribution function. Regional Conference Series in Applied Mathematics No. 9. SIAM, Philadelphia, Pennsylvania. Engle, R. F. and Kraft, D. F. (1981). Multiperiod forecast error variances of inflation estimated from A R C H models. In: A. Zellner, ed,, Proceedings of the A S A - C e n s u s - N B E R Conference on Applied "Iime Series Analysis of Economic Data. Evans, G. B. A. and Savin, N. E. (1981a). The calculation of the limiting distribution of the least squares estimator of the parameter in a random walk model. Ann. Statist. 9, 1114-1118.

22

W. A . Fuller

Evans, G. B. A. and Savin, N. E. (1981b). Testing for unit roots 1. Econometrica 49, 753-777. Findley, D. F. (1980). Large sample behavior of the S-array of seasonally nonstationary ARMA series. In: O. D. Anderson and M. R. Perryman, eds., Time Series Analysis, 163-170. NorthHolland, Amsterdam. Fountis, N. G. (1983). Testing for unit roots in' multivariate autoregressions. Unpublished Ph.D. thesis. North Carolina State University, Raleigh, NC. Fountis, N. G. and Dickey, D. A. (1983). Testing for a unit root nonstationarity in multivariate autoregressive time series. Paper presented at Statistics: An Appraisal, International Conference to Mark the 50th Anniversary of the Iowa State University Statistical Laboratory, Ames, Iowa. Friedman, M. and Schwartz, A. J. (1963). A Monetary History of the United States I867-1960. Princeton University Press, Princeton, NJ. Fuller, W. A. (1976). Introduction to Statistical Time Series. Wiley, New York. Fuller, W. A. (1979). Testing the autoregressive process for a unit root. Paper presented at the 42rid Session of the International Statistical Institute, Manila. Fuller, W. A. (1980). The use of indicator variables in computing predictions. J. Econometrics 12, 231-243. Fuller, W. A. and Hasza, D. P. (1980). Predictors for the first-order autoregressive process. J. Econometrics 13, 139-157. Fuller, W. A. and Hasza, D. P. (1981). Properties of predictors for autoregressive time series. J. Amer. Statist. Assoc. 76, 155-161. Fuller, W. A., Hasza, D. P. and Goebel, J. J. (1981). Estimation of the parameters of stochastic difference equations. Ann. Statist. 9, 531-543. Gould, J. P. and Nelson, C. R. (1974). The stochastic structure of the velocity of money. American Economic Review 64, 405-417. Grenander, U. (1954). On the estimation of regression coefficients in the case of an autocorrelated disturbance. Ann. Math. Statist. 25, 252-272. Hannan, E. J. (1956). The estimation of relationships involving distributed lags. Econometrica 33, 206--224. Hannah, E. J. (1970). Multiple Time Series. Wiley, New York. Hannan, E. J. (1979). The central limit theorem for time series regression. Stoch. Process. Appl. 9, 281-289. Hannan, E. J., Dunsmuir, W. T. M. and Deistler, M. (i980). Estimation of vector ARMAX models. J. Multivariate Anal. 10, 275-295. Hannan, E. J. and Heyde, C. C. (1972). On limit theorems for quadratic fuuctions of discrete time series. Ann. Math. Statist. 43, 2058-2066. Hannan, E. J. and Nicholls, D. F. (1972). The estimation of mixed regression, autoregression, moving average and distributed lag models. Econometrica 40, 529-548. Hasza, D. P. (1977). Estimation in nonstationary time series. Unpublished Ph.D. thesis. Iowa State University, Ames, Iowa. Hasza, D. P. and Fuller, W. A. (1979). Estimation for autoregressive processes with unit roots. Ann. Statist. 7, 1106-1120. Hasza, D. P. and Fuller, W. A. (1982). Testing for nonstationary parameter specifications in seasonal time series models. Ann. Statist. 10, 1209-1216. Hatanaka, M. (1974). An efficient two-step estimator for the dynamic adjustment model with autoregressive errors. J. Econometrics 2, 199-220. Kawashima, H. (1980). Parameter estimation of autoregressive integrated processes by least squares. Ann. Statist. 8, 423435. Koopmans, T. C., Rubin, H. and Leipnik, R. B. (1950). Measuring the equation systems of dynamic economics. In: T. C. Koopmans, ed., Statistical Inference in Dynamic Economic Models. Wiley, New York. Lai, T. L. and Siegmund, D. (1983). Fixed accuracy estimation of an autoregressive parameter. Ann. Statist. 11, 478-485. Lai, T. L. and Wei, C. Z. (1982). Asymptotic properties of projections with applications to stochastic regression problems. J. Multivariate Anal. 12, 346-370.

Nonstationary autoregressive time series

23

MacNeil, I. B. (1978). Properties of sequences of partial sums of polynomial regression residuals with applications to tests for change of regression at unknown times. Ann. Statist. 6, 422-433. Mann, H. B. and Wald, A. (1943). On the statistical treatment of linear stochastic difference equations. Eeonometrica 11, 173-220. Muench, T. J. (1971). Consistency of least squares estimates of coefficients of stochastic difference equations. Mimeograph, University of Minnesota, Minneapolis, MN. Narasimham, G. V. L. (1969). Some properties of estimators occurring in the theory of linear stochastic processes. In: M. Beckman and H. P. K/inzi, eds., Lecture Notes in Operations Research and Mathematical Economics. Springer, Berlin. Nichols, D. F. (1976). The efficient estimation of vector linear time series models. Biometrika 63, 381-390. Orcutt, G. H. and Winokur, H. S. (1969). First order autoregression: Inference, estimation, and prediction. Econometrica 37, 1-14. Phillips, P. C. B. (1979). The sampling distribution of forecasts from a first-order autoregression. J. Econometrics 9,241-262. Rao, C. R. (1967). Least squares theory using an estimated dispersion matrix and its application to measurement of signals. In: Proc. Fifth Berkeley Syrup. Math. Statist. and Probability, Vol. 1, 355-372. University of California, Berkeley, CA. Rao, M. M. (1961). Consistency and limit distributions of estimators of parameters in explosive stochastic difference equations. Ann. Math. Statist. 32, 195-218. Rao, M. M. (1978a). Asymptotic distribution of an estimator of the boundary parameter of an unstable process. Ann. Statist. 6, 185-190. Correction (1980) 1403. Rao, M. M. (1978b). Covariance analysis of nonstationary time series. In: P. R. Krishnaiah, ed., Developments in Statistics. Academic Press, New York. Rubin, H. (1950). Consistency of maximum-likelihood estimates in the explosive case. In: T. C. Koopmans, ed., Statistical Inference in Dynamic Economic Models. Wiley, New York. Samaranayake, V. A. and Hasza, D. P. (1983). The asymptotic properties of the sample autocorrelations for a multiple autoregressive process with one unit root. University of Missouri, Rolla, MO. Sargan, J. D. and Bhargava, A. (1983). Testing residuals from least squares regression for being generated by the Gaussian random walk. Econometrica 51, 153-174. Scott, D. J. (1973). Central limit theorems for martingales and for processes with stationary increments using a Skorokhod representation approach. Adv. in Appl. Probab. 5, 119-137. Stigum, B. P. (1974). Asymptotic properties of dynamic stochastic parameter estimates (Ill). J. Multivariate Anal. 4, 351-381. Stigum, B. P. (1975). Asymptotic properties of autoregressive integrated moving average processes. Stoch. Proper. Appl. 3, 315-344. Stigum, B. P. (1976). Least squares and stochastic difference equations. J. Econometrics 4, 349-370. Tiao, G. C. and Tsay, R. S. (1983). Consistency properties of least squares estimates of autoregressive parameters in A R M A models. Ann. Statist. II, 856-871. Utrych, T. J. and Bishop, T. N. (1975). Maximum entropy spectral analysis and autoregressive decomposition. Rev. Geophys. Space Phys. 13, 183-200. Venkataraman, K. N. (i967). A note on the least square estimators of the parameters of a second order linear stochastic difference equation. Calcutta Statist. Assoc. Bull. 16, 15-28. Venkataraman, K. N. (1968). Some limit theorems on a linear stochastic difference equation with a constant term, and their statistical applications. Sankhyg Ser. A 30, 51-74. Venkataraman, K. N. (1973). Some convergence theorems on a second order linear explosive stochastic difference equation with a constant term. J. Indian Statist. Assoc. 11, 47--69. Wegman, E. J. (1974). Some results on nonstationary first order autoregression. Technometrics 16, 321-322. White, J. S. (1958). The limiting distribution of the serial correlation in the explosive case. Ann. Math. Statist. 29, 1188-1197. White, J. S. (1959). The limiting distribution of the serial correlation coefficient in the explosive case I1. Ann. Math. Statist. 30, 831-834. Yamamoto, T. (1976). Asymptotic mean square prediction error for an autoregressive model with estimated coefficients. J. Appl. Statist. 25, 123-127.

E. J. Hannan, P. R. Krishnaiah, M. M. Rao, eds., Handbook of Statistics, Vol. 5 © Elsevier Science Publishers B.V. (1985) 25-83

2

Non-Linear Time Series Models and Dynamical Systems Tohru Ozaki

1. Introduction

There are many examples of dynamic p h e n o m e n a in nature which can be regarded as stochastic processes, e.g. a ship rolling in the sea, brain wave records, animal populations in ecology, sunspot numbers in astronomy and riverflow discharge in hydrology. Some of them are considered to be stochastic processes by virtue of their own mechanism. Some of them, such as hydrodynamic p h e n o m e n a , may not be considered to be stochastic at the microscopic level, but may be considered to be at the macroscopic level. By treating them as stochastic processes, meaningful results both in theory and applications may be obtained. For the inference of the characteristics of these stochastic processes and for their forecasting and control, observation data, obtained by sampling the process at equally spaced intervals of time, are often used. Since many stochastic p h e n o m e n a in the world can be considered to be approximately Gaussian processes, it follows that much of the effort of time series analysts has been devoted to providing methodologies for the statistical analysis of Gaussian time series. After recent development in linear time series modelling for stationary Gaussian processes by Box and Jenkins (1970) and Akaike and Nakagawa (1972), time series analysts' attention has been turned to nonGaussian or non-stationary processes. For the analysis of these processes and for their forecasting and control, non-linear or non-stationary time series models are needed. It is also expected that non-linear time series models may be useful for the inference of the non-linear structure in the dynamics of stochastic processes. Several non-linear time series models have been introduced and used for the analysis of time series data. In other applications, diffusion process models have been considered to be a good approximate model for stochastic dynamic p h e n o m e n a and have been used in many fields of science. Our main purpose in this paper is to see how non-linear time series models and diffusion process models characterize the non-linear dynamics in the non-Gaussian stochastic process and to show that the both are closely related 25

26

T. Ozaki

by a time discretization scheme for differential equations. For this purpose, we study some examples of dynamic p h e n o m e n a and their linear and non-linear time series models in Section 2, and diffusion processes and their time discretization schemes in Section 3. In Section 4 we discuss estimation methods for models considered in the previous sections and show that non-linear time series models are useful for the inference of the non-linear structure of a process, whose non-linearity is characterized by potential functions. In Section 5 the multivariate extension and the implications of the present scheme in time series analysis are discussed.

2. Amplitude-dependent autoregressive models

2.1. Autoregressive models for ship rolling T h e m o v e m e n t of a ship on the ocean is complicated because of sustained excitement caused by the ocean waves, whose dynamics are impossible to describe by a deterministic differential equation. To describe the dynamics, a stochastic approach was introduced and stochastic differential equation methods and time series analysis methods were introduced (Yamanouchi, 1974). For example, the dynamics of ship rolling are supposed to have the following mechanism. If the ship rolls by x degrees, then the centre of buoyancy moves from B to B' in Fig. 2.1, and has a buoyancy force from the water at B' while the ship's centre of gravity is G. Then the gravity force G W and the buoyancy force B'M cause the righting m o m e n t and it gives the ship a restoring force which is a function g(x) of the

/*

Z

Fig. 2.1.

Non-linear time series models and dynamical systems

27

rolling angle x. At the same time, the sea water damps the ship's m o v e m e n t and this damping force is supposed to be a function f(ic) of the velocity 2 of the rolling movement. T h e ship is also considered to be under the continual external force s~ of r a n d o m excitement by the ocean waves. Therefore, the dynamics of the ship rolling can be described by the following stochastic differential equation: 5~+ / ( x ) + g ( x ) : s~ .

(2.1)

When x and x are not very large, f ( 2 ) and g ( x ) are usually approximated by linear functions as f(2) = a 2 ,

(2.2)

g(x) = bx.

(2.3)

Then the dynamics of ship rolling is approximately described by the linear stochastic differential equation 5i + a2 + bx = ~.

(2.4)

where the p a r a m e t e r a is called the damping coefficient and the p a r a m e t e r b is called the restoring coefficient. Both a and b depend on the size and shape of the ship and hence each ship has its own parameters. When the external force s~ is a Gaussian white noise and At is sufficiently small, (2.4) corresponds to an A R M A ( 2 , 1) model X t = ~)IXt 1 + ~lXt_2 + 01at_ 1 + a t ,

(2.5)

where a t is a discrete time Gaussian white noise with variance 0 "2, and q~l, ~2, 01 and 0-2 a re uniquely determined if a, b and the variance of white noise s~ are known (Pandit and Wu, 1975). If we use the backward shift operator B, which is such that B x t = xt_ 1 and Bixt = Bi-axt_~, (2.5) is rewritten as (1 - 4 , 1 B

- 4,2B2)x, - (1 + OiB)a,

o r as

~(B) O ( B ) x, = a,,

(2.6)

where ~b(B) = 1 - q ~ l B - q~2B2 and O ( B ) - 1 + 0 l B . T h e x t is considered to be an output of the system which is driven by a Gaussian white noise a t. However, ~: of (2.1) is usually a non-Gaussian coloured noise and so a t of (2.6) is, in general, not a Gaussian white noise but a non-Gaussian coloured noise process with some peaks and valleys in its spectrum. A simple approximate model for such a coloured noise process a t with rational svectrum is a Gaussian

29

Non-linear time series models and dynamical systems

O- I

-!

sl '0

80

]6'0

~o 40'0

2410

480

5do

6~'o

720

8;0

8so

980 iO00

TIME

Fig. 2.2. Ship rolling data. Fig. 2.3 shows the spectrum which was calculated using the Hanning window. When we fit A R models of order 0 to 20 to the data, the following AR(7) model: xt = 1-9500xt-1- 0.8660xt-2- 0.2430xt 3-r- 0.1070x t 4 + O.0120x,-s - 0.0660x,-6 + 0.0690xt-7 + 6 ,

(2.10)

6.2= 0.2100, was adopted as the best model by A1C (Ozaki and Oda, 1978). Fig. 2.4 shows the spectrum of the fitted model (2.10), where the p e a k of the spectrum shows about 25 seconds periodicity of rolling, and the figure of the spectrum extracts the essential characteristics of the spectrum Fig. 2.3 calculated without the parametric model.

(2) C3

tA

f2~¢3

z~-t

/

z C_Bc2" CM___

~0 9 0

T--T---

I

0,20

I. . . .

I

0=~0

FREOUENSY Fig. 2.3. Estimated spectrum by Hanning window (Lag = 100).

Non-linear time series models and dynamical systems

31

×t

X )( X

X X

X C

X

X

×

X

X C ©

X

X

X 0

× X

X

×

× O

o x°

X

tO

Fig. 2.6. One-step-ahead prediction (x : observed, ©: predicted). restoring force is the following stochastic differential equation model: 2 + a2 + bx + cx 3 = ~,

(2.11)

where the restoring force is approximated by (b + cx2)x. When c > 0 the system is called a hard spring type and when c < 0 the system is called a soft spring type. A natural non-linear extension of the time series model for ship rolling may be the A R model with some non-linear terms such as Xt = ~)lXt_ 1 -~ . . .

+ (~pXt_ p 4- O ( X t _ l . . . . .

Xt_p) + E,,

(2.12)

where Q(Xt_l,..., Xt_p) is a polynomial of the variables xt_ 1. . . . , xt_ p. We call the non-linear A R model (2.12) a polynomial A R model. The validity of the polynomial A R model is checked by fitting both linear and polynomial A R models for part of the data (see Fig. 2.6) and by comparing the one-step-ahead prediction error variances obtained by applying both fitted models to the rest of the data. For example, we fitted the AR(7) model and an AR(7) model with a non-linear term x 3t - 1 (AR(7)+ ~-Xt_l) 3 for the first 760 data points, x 1, x a. . . . . x760, of Fig. 2.2, and calculated the variances of the one-step-ahead prediction error for x761. . . . . xl000. The obtained prediction error variance d-2Lof ~2 the AR(7) model was ~rL= 0.1041 and the prediction error variance O'NL^2by the AR(7) +Trxt_ ~3 model was ~rNL^2= 0.1016. This means that the non-linear model slightly improves the prediction performance of the ship rolling. Although the above polynomial A R model gives better predictions than the linear A R model, it has a fatal deficiency as a model for the dynamics of vibration systems. Simulations of fitted polynomial A R models almost always diverge even though the original ship rolling process or the process defined by a non-linear stochastic differential equation (2.11) is quite stable and nondivergent. Therefore, some other non-linear time series model which is not explosive is desired. 2.3. E x p o n e n t i a l A R

models

To see the reason why the polynomial A R models are explosive, let us

32

T. Ozaki

consider some simple polynomial A R models of vibration systems. The simplest A R model which can exhibit a r a n d o m vibration system is the AR(2) model (2.13)

• X t • 051Xt_1 ~- 052Xt 2 + F.t,

and the simplest polynomial A R model for a non-linear vibration system is xt = 05ixt_1 + 052xt_2+ 7rxt_13+ et.

(2.14)

The spectrum of the process defined by (2.13) is O"

P(f)

2

~- I 1 - 051 e i 2 ~ r f 05 e-~2~S.2 2 •

2

(2.15)

The p e a k of the spectrum which characterizes the proper frequency of the vibration system is attained at 1 X / - 052~-- 4052, f = 2"rr tan-1

05~

which is the argument of the roots of the characteristic equation, A 2 - 05,A --052 = O.

(2.16)

When 052 is fixed to be a constant, the proper frequency is characterized by 051 as in Fig. 2.7. The polynomial A R model (2.14) is represented as xt = (051 + 7TXt_l)Xt_l 2 -]- 052Xt_2 q- e t , gl 1

Fig. 2.7.

(2.17)

Non-linear time series models and dynamical systems

33

and is considered to have an amplitude-dependent first-order autoregressive coefficient (see Fig. 2.8). In m a n y vibration systems, the value of x~ may stay in some finite region [x,l < M and the roots of the equation a 2 - (l~lq- "JTx~)A -- 11)2 -..~ 0

(2.18)

may stay inside the unit circle for such x[s. However, the white noise e t is Gaussian distributed and may have a large value and the roots of (2.18) may lie outside the unit circle. Then the system begins to diverge at this stage. Since we are interested in the stochastic behaviour of x, mostly for ]xt[ < M, it may be reasonable to make the non-linear function approach a b o u n d as t ~ + oo as in Fig. 2.9. A time series model which can exhibit this characteristic is the following model x, = (¢1

2

~r e -x,_, )x,__l+

(2.19)

ff~2Xt_2 q- g t .

The model is called an exponential A R model (Ozaki and Oda, 1978). The roots of the equation

(2.2o)

A2 - (~1-- "/r e -x2 ')A - ¢2 = 0

always stay inside the unit circle for any x,_ 1 if ¢1, ¢2 and rr satisfy the condition that the roots A0 and £0 of A2-- ( 4 1 - 7r)A - (/)2 0,

(2.21)

~-

and the roots , ~ and a-® of (2.22)

A 2 --I~IA .--4)2 = 0

all lie inside the unit circle (see Fig. 2.10).

. . . . . . . . . . . . . .

1 .........................

¢

x

Fig. 2.8.

1

34

T. Ozaki

P

Xt-~

Fig. 2.9. In the above example, the second-order coefficient is fixed to 4~2 and the roots of both (2.21) and (2.22) all stay inside the unit circle. However, in the general vibration system, the damping coefficient is not constant in general. One example is the following van der Pol equation: - a(1

-

x 2 ) . , ~ q-

b x = O,

(2.23)

where for x 2 < 1 the system has negative damping force and starts to oscillate and diverge, but for x 2 > 1 the system has positive damping force and it starts to damp out. The interplay of these two effects of opposite tendency produces a steady oscillation of a certain amplitude, which is called a limit cycle. When the system is disturbed by a white noise n, we have - a ( 1 - x2)x + bx = n,

(2.24)

which produces a perturbed limit cycle process (see Fig. 2.11). The exponential A R model (2.19) is easily extended and applied (Haggan

Fig. 2.10.

35

Non-linear time series models and dynamical systems ""

I ~

_

i~1111t.

Wlgllllll

i i BI~Ill

I

x(t)

~(t)

n(t)

~

Fig~ 2.11. Analog simulation of (2.24). and Ozaki, 1981; O z a k i 1982a) for this kind of non-linear d a m p i n g system by m a k i n g the s e c o n d - o r d e r coefficient amplitude d e p e n d e n t as Xt = (~1 q- 7/'1 e-X2-1)xt-1 q- (t~2 + 7r2 e-x2 l)xt 2 q- Ft "

(2.25)

If the coefficients satisfy the condition (C1), which is such that (C1) the roots )t o and A0 of A 2 - (4~1 + wi)A - (~b2 + -,r2)= 0

(2.26)

lie outside the unit circle, then x t starts to oscillate and diverge for small x t 1~ while if the coefficients satisfy the condition (C2) such that (C2) the roots of A~ and A~ of A2

~b~A - ~b2 = 0

(2.27)

lie inside the unit circle, then x t starts to d a m p out w h e n xt_ ~ b e c o m e s too large. T h e result of these two effects is e x p e c t e d to p r o d u c e a similar sort of self-sustained oscillation as (2.23) if we suppress the white noise e t of (2.25). Fig.

36

T. Ozaki

2-

Q .L-)

kD

'0,00

I00.00

200,00

300.00

400,00

500,00

500.00

700

Fig. 2.12.

2.12 shows the limit cycles obtained for the model xt = (1.95 + 0.23 e-X2'-')xt_l - (0.96 + 0.24 e -x~ 9x, 2 + t-,,

(2.28)

where the coefficients satisfy the above conditions (C1) and (C2). 2.4. Stationarity The necessary and sufficient condition for the AR(2) model (2.29)

Xt = (01Xt-1 + (02Xt-2 ~- ~'t

to be stationary is that the roots of the characteristic equation A 2 - (01A - (02 = 0

(2.30)

all lie inside the unit circle. For checking the stationarity of exponential model x, = ((o1+ ~rle x2'-l)xt 1+(,52 + 7rze x]-l)X, 2-~ e,,

(2.31)

the following theorem about the ergodicity of a Markov chain on a norm space is useful. THEORE~ 2.1 (Tweedie, 1975). A Markov chain X , on a norm space with transition law p(x, y) is ergodic if p(x, y) is strongly continuous, i.e. p(x, y) is continuous with respect to x when y is fixed, and if there exists a compact set K and a positive value c > 0 which satisfy the following conditions,

(i)

E{Ilxo+,II- Ilxoll I x . = x}

(ii)

E{[]X,+I[]- [[X,][ [ X, = x} ~ T 2 ,

where 1r(x,_l)= ~r0+ ~r,xt-,+ ' " + ~rrx~-l. If we a p p r o x i m a t e f1(x,-1)by a con2 stant plus a H e r m i t i a n - t y p e p o l y n o m i a l thl + ( % + 7rlxt-i + " ' " + %x~_1)e -x'-~

(a) Linear threshold AR model

(b) Non-linear threshold AR model

~(x)

(c) Exponential AR model

~(x)'

~(x) I i

--q

I I x

x

0

Fig. 2.17.

0

44

T. Ozaki

, ',(XOT I I I I i I

x2

,%

0

~;

/,~

xt

Fig. 2.18.

(see Fig. 2.17), we have the following e x t e n d e d e x p o n e n t i a l A R m o d e l (Ozaki, 1981a): x, = {4)1+ ( % + 7rlx,_ 1 + " " + 7r2c~_1) e

x 2

'-i}x, 1 + et.

(2.58)

This model includes the exponential A R model as special case, s --- 0. It seems that non-linear models with continuous 4) functions have more versatile geometric structure than models with discontinuous step & functions such as linear threshold A R models. For example, if we design the 4) function of x,+, = 4~(x,)x, + ~,+1

as in Fig. 2.18 by using non-linear threshold A R models or an extended exponential A R model, then (b(x,) = 1 at four points x t = (1, sol, (2 and ~:2 (see Fig. 2.18) and so they have four non-zero stable or unstable singular points to which x t converges or from which x t diverges when the white noise e, is suppressed. However, the linear threshold models do not have such a geometric structure, since the ~b function of the model is a discontinuous step function.

2. Z T h r e s h o l d structure

W e have used the threshold in some amplitude-dependent A R models to a p p r o x i m a t e the dynamics of the A R coefficients. The introduction of the threshold idea in such a situation may look somewhat ad hoc. However, there are often cases in nature, in physical or biological phenomena, where the threshold value has a significant physical meaning. The threshold structure does not necessarily mean that the system is switched from one linear system to another linear system depending on whether the concerned x t values crosses over the critical value. One example is the wave propagation of a nerve impulse (see Fig. 2.19) or a heart beat, which are supposed to form a fixed wave

Non-linear time series models and dynamical systems

45

(a) Impulse above the threshold

P

(b) Impulse b e l o w the threshold

L___Fig. 2.19. pattern and propagate if an impulse is larger than a critical value, while if the impulse is less than the critical value the impulse wave dies out (see Fig. 2.19). Neurophysically, the wave propagation is realized by the flow of electrons along the axon which is caused by the change of membrane potential and a mathematical model, called the Hodgkin-Huxley equation, is presented for this dynamic phenomenon by Hoi3gkin and Huxley (1952). Starting from this Hodgkin-Huxley equation, Fitzhugh (1969) obtained the following non-linear dynamical system model for the dynamics of the potential V: dV . : = a ( V - Eo)3+(E=- V) - b ( V - E l ) ,

(2.59)

tit

where ( V - E o ) B + = ( V - E o ) 3 for V>~Eo and ( V - E o ) 3 - O for V < E o , E 0 < E 1% E 2 and E 0, E 1 and E 2 are ionic equilibrium potentials determined by the sodium and potassium ion and some other ion. The coefficients a and b of (2.59) are values which are related to the sodium, potassium and some other ions. Since they are varying very slowly compared with V, they can be considered to be locally constant. From (2.59) we know that dV/dt is zero at A, B and C (see Fig. 2.20). The reference to the sign of dV/dt on the neighbouro hood of these points shows that A and C are stable singular points, while B is an unstable singular point. If V > B , V ~ C, but if V < B , V ~ A . Therefore, B is a 'threshold', separating two stable states which may be called the resting state A, and the excited state C. This kind of threshold structure is realized by the discrete time non-linear difference equation X,+l = 4,(x,)x,,

designing &(xt) as in Fig. 2.18. One example is the following model: 2

2

x,+1 = (0.8 + 4x, e-X')x,,

(2.60)

46

7". O z a k i Ionic current

~

b(V-E1)

A Eo

~ ( V - E o ) ~ (E2-V)

E1

0

E2

Fig. 2.20. where sc~ = 0.226 . . . . and so; = - 0 . 2 2 6 . . . are unstable singular points and ~:~ = 2.1294 . . . . sc~ = - 2 . 1 2 9 4 . . . and s%= 0 are stable singular points. If we apply an impulse to model (2.60), then xt goes to zero for t - + ~ if the magnitude of the impulse is less than the unstable singular point ~:~ but xt goes to ~:~ for t ~ m if the magnitude of the impulse is larger than the threshold value ~ (see Fig. 2.21). If we have a white noise input to the model defined by (2.60), we have the following model:

(2.61)

Xt+1= (0.8 -}-4*x ,2 e x2/. )x,+e,+~,

where et+~ is a Gaussian white noise. Fig. 2.22 shows the simulation of model (2.61), where xt fluctuates around one of the stable singular points and sometimes moves around from one stable singular point to another depending

cD..... cD

,%

c~

g'£?"t____J r 0.00

r

40.00

-r----T ......--F .......

:

80.00

°O ,00 Fig. 2.21.

40.00

-T"----T. . . . T. . . . .

60,00

47

Non-linear time series models and dynamical systems 0 (xJ

~-I--

n'~rY"l"11~,'~-,

.q-lr,rr'F''ir'v

v.~n~r-T.'~

O

C) 130

io .00

-f

1

2o.oo

i

I

I

4o.oo

I

"3"--I

6o.oo

l--

8o.oo

I

I00,00

~i0 1 Fig. 2.22.

on the white noise input. By looking at the data (Fig. 2.22) of the above example (2.61), people may think of two linear models, one above the threshold and one below the threshold. However, the data are actually described by o n e non-linear model. A similar non-linear phenomenon is realized by a non-linear time series model with time varying coefficients. For example, consider the following model: X,+ 1

(2.62)

= (~(t, Xt)X t H- 6 , + 1 ,

where ~b(t, x,) = {0.8 + 0.4rt e -x2' + 4(1 - rt)x 2 e -~2'}

changes from 2

£', +1

(2.63)

xt+x = (0.8 + 0.4 e-X2')x, + e , < ,

(2.64)

x,+l = (0.8 H 4x 2 e-X,)x, -t to

(23 (%1

0

or')

T '0.00

r 20.00

1 ~

J 40.00

'-T

~" 60.00

Fig. 2.23.

1

T .............. J. . . . . . . . 1 I00.00 80.~u~'~

48

T. Ozaki

x'-

0

Fig. 2.24.

as "rt increases monotonically from 0 to 1 as t increases. The model (2.63), as we saw before, has three stable singular points ~:~, ~:z and so0= 0 and two unstable singular points s~ and ~ , while the model (2.64) has two stable singular points r t ~ = 0 . 8 3 . . . . and - q ~ = - 0 . 8 3 . . . , and one unstable singular point rl = 0 . Therefore, the stable singular point ~:0= 0 changes into an unstable singular point as time t passes and the process xt begins to move arouhd s% to one of the other stable singular points as in Fig. 2.23. The sudden change of an equilibrium point in the above example is considered to be a result of a smooth change of some potential function as in Fig. 2.24. This kind of structural change of the process caused by a gradual change of parameters is closely related with the topic treated in catastrophe theory (see, for example, Zeeman, 1977).

2.8. Distributions We have seen that a threshold structure is realized by a stationary non-linear time series model x,+l = (0.8 + 4x 2 e-X~)x, + e,+l,

(2.65)

where x~ moves around from one stable singular point to another depending on the white noise input. However, the process defined by (2.65) has one and the same equilibrium distribution on the whole. Fig. 2.26 shows the histogram of the data generated by simulating the non-linear threshold A R model 1"(0.8+ 1.3x{ - 1.3xg)xt + e,+l

x,+l = t 0.8xt + et+x

for Ix, I < 1.0, for tx, f > 1.0,

(2.66)

which has the same structural property as (2.65). It has three stable singular points ~:0= 0, ~:¢ = 0.9 and sc~ = - 0 . 9 and two unstable singular points ~:~ =

Non-linear time series models and dynamical systems

49

I

-1 .47

-0.63

0.2

1 .05

Fig. 2.25. 0.4358.~. and s~i = - 0 . 4 3 5 8 . . . Fig. 2.25 shows the histogram of the white noise used in the above simulation, where the number of data is N = 8000. It is obvious that the three peaks in Fig. 2.25 correspond to the three stable singular points G0, s~i and ~ , and the two valleys correspond to the two unstable singular points ~:i~ and s~. These correspondences remind us of the

1

-0.8I

-0.26

0.09

Fig. 2.26.

I

0.44

50

T.

Ozaki

Fig. 2.27. correspondence between the singular points of the dynamical system

Yc = f ( x ) and its potential function x

V(x) -= - f f ( y ) dy. For example, the dynamical system 2 = - 4 x + 5x 3 - x 5 has three stable singular points G0= 0, s~ = 2, ~:~ = - 2 and two unstable singular points ~:i~ - 1 and £~ = - 1 (see Fig. 2.27). The stable singular points correspond to the valleys of the potential and unstable singular points correspond to the peaks of potential (see Fig. 2.28). Further, it is known that the equilibrium distribution W ( x ) of the diffusion process defined by the stochastic dynamical system

2 = f ( x ) + n(t) is given by

W ( x ) - Wo exp{-2 V(x)/0-2}, where 0-2 is the variance of white noise n(t) and W0 is a normalizing constant. If we consider this structural correspondence between non-linear time series models and diffusion processes defined by stochastic dynamical systems, it may be natural to study the diffusion process and its time discretization scheme in the succeeding section.

Non-linear time series models and dynamical systems

51

V(×J

0

X

Fig. 2.28.

3. Diffusion processes and their time discretizations

3.1. Stochastic d y n a m i c a l systems

A stochastic dynamical system is defined by

(3.~)

= f(x) + ~(t).

where ~:(t) is a Gaussian white noise with variance cr 2, and so it is also r e p r e s e n t e d as (3.2)

5c = f ( x ) + ern (t) ,

where n ( t ) is a unit G a u s s i a n white noise whose v a r i a n c e is one. Since, for small r > 0, it holds that lim E [ A x ] _ f ( x ) , .r~O

lim r~0

lim r-*O

T

E[(Ax)2I

~r2 ,

T

Et(ax)q

-0

(k/>3),

T

where Ax = x ( t + r ) - x ( t ) = f ( x ) r + f[+'~ ds n ( s ) + o(r), we have, for the process defined by (3.2), the following F o k k e r - P l a n c k equation: 0p ot

1 02 0 Ox [f(x)pl + ~ x 5 [o'2p],

(3.3)

T. Ozaki

52

where p stands for the transition probability p(X[Xo, t) which means the probability that the process takes the value x at time t, given that it had the value x 0 at time t = 0. Thus the stochastic dynamical system uniquely defines a diffusion process with transition probability p(x I Xo, t) defined by the F o k k e r Planck equation (3.2). Conversely, the diffusion process defined by (3.3), obviously, uniquely defines the stochastic dynamical system (3.2). However, the rate of the growth of the variance,

E[(Ax)

lim r~0

-

,

T

of a general diffusion process is not a constant but a function of x. A general diffusion process is characterized by the following Fokker-Planck equation:

Op

0

Ot

[a(x)p] + ~ ~Ox [b(x)p] . Ox

1 02

(3.4)

Then (3.4) uniquely defines the following stochastic differential equation (see, for example, Goel and Richter-Dyn, 1974) 2 = f(x) + g(x)n(t),

(3.5)

where

f ( x ) : a(x),

g ( x ) - X/b(x).

On the other hand, a stochastic differential equation

2 - f ( x ) + g(x)n(t) uniquely defines a diffusion process whose Fokker-Planck equation is

019 Ot

1 0a 0 [f(x)p] + ~ ~x 2 [g2(x)p] . Ox

By the variable transformation

y = y(x)=

f

x dE g(~),

(3.6)

we have, from the stochastic differential equation (3.5), the following stochastic dynamical system:

= a ( y ) + n(t),

(3.7)

where n(t) is a Gaussian white noise with unit variance. We call the process y

Non-linear time series models and dynamical systems

53

the associated diffusion process of (3.4), and we call the dynamical system f~ = a(y) the associated dynamical system of (3.4). By the analogy with mechanics we define the potential function by

V(y) = -

f

Y

a 05) d)T

(3.8)

We note that the potential function (3.8) is different from the potential function well known in Markov process theory (Blumenthal and Getoor, 1968), and we call V(y) of (3.8) the potential function associated with the diffusion process or simply the associated potential function. The above discussion suggests that any diffusion process uniquely defines a variable transformation and a potential function with respect to the transformed variable.

3.2. Distribution systems Since our interests are non-linear stationary time series with given equilibrium distributions, let us confine ourselves to homogeneous diffusion processes which have unique equilibrium distributions. The equilibrium distribution W(x) of the diffusion process (3.4) is given by W(x) = ~ C exp{2 fx [a(~)/b(~)]d~}

(3.9)

where C is the normalizing constant. Wong (1963) showed that for any probability distribution function W(x) defined by the Pearson system

d W(x) c o+ Qx dx - do+ dlX + d2x2 W(x) ,

(3.10)

we can construct a diffusion process whose equilibrium distribution is W(x). Then the following proposition is obvious from the straightforward extension of Wong's logic: PROPOSITION 3.1~ For any distribution W(x) defined by the distribution system dW(x)

dx

c(x) W(x), d(x)

(3.11)

we can construct a diffusion process whose equilibrium distribution is W(x) as follows: Op O 1 02 Ot ..... Ox [{c(x) + d'(x)Ip] + ~ Ox~ [2d(x)p] ,

where c(x) and d(x) are analytic functions.

(3.12)

T. Ozaki

54

We call the distribution system (3.11) a generalized Pearson system. The system includes not only distributions of the Pearson system but also all the analytic exponential families ~g of distributions which are defined by the set of distributions {f} of the following forria:

W(x) = a (f)a (x) exp{ fi (f). t(x)},

(3.13)

where a and the fli of fl = ( i l l , . . . , ilk) are real-valued functions of ~, and a(x) and t(x)= (tl(X) . . . . . tk(X))' are analytic functions of x (Barndorff-Nielsen, 1978). From the definition of the generalized Pearson system the following propositions are also easily obtained.

The generalized Pearson system of the equilibrium distribution of the diffusion process defined by the Fokker-Planck equation PROPOSITION 3.2.

Op Ot

0 1 Oa [a(x)p] + ~ ~Ox [b(x)p] Ox

(3.14)

is dW

dx

-

2 a ( x ) - b'(x)

b(x)

(3.15)

W(x).

The generalized Pearson system of the equilibrium distribution of the diffusion process defined by the stochastic differential equation

PROPOSITION 3.3.

Yc= f(x) + g(x)n(t)

(3.16)

is dW(x) dx

=

2 f ( x ) - g(x)g'(x) -

-

g(x) 2

W(x).

(3.1'7)

The generalized Pearson system of the diffusion process y associated with the diffusion process x defined by (3.16) is PROPOSrrlON 3.4.

dW(y) dy

-2a(y)W(y),

(3.18)

where c~ ( y ) = c~ ( y ( x ) ) -

f(x)

(3.19)

g(x) "

The above correspondence between the generalized Pearson system and the diffusion process in Proposition 3.1 is unique if we restrict that c(x) and d(x) of (3.11) are mutually irreducible.

Non-linear time series models and dynamical systems

55

3.3. Local linearization of y = f(y) + n(t) A well-known method for the time discretization of

= f ( y ) + n(t)

(3.20)

is to use the following Markov chain model: yt+A,- Yt = At. f ( y , ) + B,+a,- B,,

(3.21)

where B , + a t - B t is an increment of a process of Brownian motion and is distributed as a Gaussian distribution with variance At. The process y, defined by (3.21) is known to converge uniformly, for At-+0, to the original diffusion process y defined by (3.20) on a finite interval of time (Gikhman and Skorohod, 1965). The deterministic part, y,+a,-Yt = At "f(Yt), of (3.21) is known as the Euler method of discretization of the dynamical system Y = f(y).

(3.22)

However, the Euler method is known to be unstable and explosive for any small At, if the initial value of y is in some region. For example, the trajectory y(t) of 3~ = _y3

(3.23)

is known to go to zero for any initial value of y. Its discretized model by the Euler method is Y,+at = Yt- At. y~,

(3.24)

which is explosive, the trajectory going to infinity if the initial value Y0 is in the region ]Y0I> ~/2/At. It is also known that, for any small At, the Markov chain (3.21) is non-stationary if f ( y ) is a non-linear function which goes to + ~ for [ y l - ~ (Jones, 1978). The same thing can be said for some other more sophisticated discretization methods such as the H e u n m e t h o d or the R u n g e Kutta method (see, for example, Henrici, 1962). For the estimation and simulation of diffusion processes by a digital com~ puter, it is desirable to have a stationary Markov chain which converges to the concerned stationary diffusion process for At-~ 0. Our idea of obtaining such a stationary Markov chain is based on the following local linearization idea. When f ( y ) of (3.22) is linear as in = -~y,

(3.25)

its analytic solution is obtained as

y(t) = Yo e-~'.

(3.26)

T. O z a k i

56

Therefore, we can define the discrete time dynamical system by y,+~, = e ~aty,,

(3.27)

which coincides with y(t) of (3.26) on t, t + At, t + 2At . . . . . Also, the Markov chain defined by Y t+At = e - ' ~ a t y t + k / ~ e

(3.28)

t+at

is stationary if a > 0 , and the Markov chain converges to the stationary diffusion process :9 = - ~ y

+ n(t)~

If we approximate e -~a' of (3.27) by a first-order Taylor approximation, (3.27) becomes equivalent to the Euler method, which does not even coincide with the analytic solution (3.26) at t, t + At, t + 2 A t , . . . . Other discretization methods such as the Heun method and the R u n g e - K u t t a method are approximation methods which aim to be higher-order (2nd and 4th, respectively) Taylor approximations of e -~at. If we consider the general dynamical system (3.22) to be locally linear, i.e. linear for a small interval At, and if we use the analytic solution (3.26) for the small interval, we have a trajectory which coincides with the trajectory of the original dynamical system at least for linear f(y). This idea is realized by integrating, over [t, T), t ~< ~- < t + At,

of

Y = 7f(Y) • oy

(3.29)

which is obtained by differentiating (3.22), assuming that

J, = Of ¢ 0 Oy

(3.30)

is constant on the interval, i.e. assuming that the system is linear on the interval. Then we have y(~-) = eJ'('-°3~(t )

(3.31)

from which we have, by integrating again over [t, t ~ At), y(t 4

At) = y(t) + J;l(eJ'a'--

1)f(y,).

(3.32)

For Jt = 0 we have

y(t ~ at) = y(/) + a t f ( y , ) .

(3.33)

Non-linear time series models and dynamical systems

57

It is easily seen that the model defined by (3.32) and (3.33), which we call a locally linearized dynamical system, converges to 3 ) = f ( t ) f o r A t e 0 . It is also easily checked (see, for example, Gikhman and Skorohod, 1965) that the Markov chain defined by Y,+a, =

(3.34)

@(Y,)+ V ~ e , + a , ,

where Yt + j;1 (e,,a, _ 1)f(yt) qS(Yt) = Yt + At . f(yt)

for Jt ¢ O,

(3.35)

for Jt = O,

and et+~t is a Gaussian white noise with unit variance, converges to the diffusion process y(t) of (3.20). We call the model (3.34) the locally linearized Markov chain model of the stochastic dynamical system model (3.20). As we shall see later, the present local linearization method brings us, unlike the Euler method or other discretization methods, non-explosive discrete time dynamical systems. If f ( x ) is specified it is easy to check whether the locally linearized dynamical system is non-explosive or not. However, it may be sometimes useful if sufficient conditions for the non-explosiveness of the locally linearized dynamical system are given for the general dynamical system 9 = f(Y). The model (3.34) is rewritten in the following way: Y,+a, = 4' (Y,)Y, + V~Te t+at,

(3.36)

4'(y,) = 1 + (e j'a'- 1)f(y,)/(.l, . y,)

(3.37)

where

for y, ¢ 0 and Jt ¢ 0. For the y, to be non-explosive for t ~ , has only to satisfy

the function f ( y )

[6(y,)l < 1

for large lY,I. From (3.37) it is obvious that we have (3.38)

(e j'a' - 1)f(yD/(J , . y,) < 0 ; hence 4'(Yt) < 1, for large ]YtI if f ( y ) satisfies the following condition: (A)

f(y) - - m ,

forlyl~ ~.

For the locally linearized dynamical system model (3.36) to be non-explosive

58

T. Ozaki

we have to say that 6(y)> -1 for lY]-* m This is equivalent to

(eJ~y~a' 1)f(y)/{J(y)y} > - 2

(3.39)

for lyl~m, W h e t h e r f(y) satisfies (3.39) or not very much d e p e n d s on the decreasing (or increasing) b e h a v i o u r of the function f ( y ) for [Yl--'m. F r o m now on, we will discuss the situation w h e r e y ~ m, because the s a m e l o g i c m a y be applied for the negative side. If

J(y)"~O

for y ~ m

then we have

e s°)a'- I 2~tf(y) J(y) At y

Atf(y) y

- - >

-2

for y--~ m.

(3.40)

T h e r e f o r e , ~b(y) > - 1 for y -* m if f(y) satisfies J ( y ) ~ 0 for y ~ m. If J(y)-* c < 0 for y -* m, we have, for sufficiently small At,

eJ(y)at 1 .f(y___)) e - c a ' - l f ( y ) > _ 2

J(y)

y

c

for y ~ m

(3.41)

y

T h e r e f o r e , a sufficient condition for qS(y)> -1 for y ~ o~ is: (BI)

J(y)-*c 0 for Y>Yo, then we have q ~ ( y ) > 0 for Y>Yo. T o have q ~ ' ( y ) > 0 for Y>Y0 concaveness of f ( y ) is sufficient. To have q~(y0)_> 0 for some Y0, it is sufficient that there exists Yl ~> Y0 such that f(Yl) < - c X / y l Vc > 0. This is always satisfied if f(y) satisfies the condition (B;), and so (B~) is a sufficient condition for qS(y) > - 1 for y ~ oo. Examples of functions which satisfy (B;) are

f(y)=-y

e y2 and

f(y)=-y3.

The similar conditions of f ( y ) for y ~ --oo are obtained f r o m the same logic as follows: (C1)

J (y )'~ c 0"~ ] and for any c > 0 there exists Yl ~< Yo such that

f(Yl) > - cYl. From the above discussions we have the following theorem:

The locally linearized dynamical system (3.32) is non-explosive if the function f ( y ) of (3.22) satisfies the condition (A), any one of conditions (B1) or (B;) and any one of conditions (C0 or (C;).

THEOREM 3.1.

The non-explosiveness of the locally linearized dynamical system (3.32) is

60

T. Ozaki

closely related with the ergodicity of Markov chains on the continuous norm space. For the locally linearized Markov chains (3.34) to be ergodic, Theorem 2.1 requires q~(y) to be a continuous function of y and to have the shift back to centre property which is guaranteed by [q'(Y)/Yl = 14ffy)l < 1 forly[ ~ ~ . Therefore, we have the following theorem: THEOREM 3.2. The locally linearized Markov chain (3.34) is ergodic if f ( y ) satisfies the condition (A), any one of conditions (B1) or (B;) and,any one of conditions (C 0 or (C~). 3.4. Some examples Let us see some examples of diffusion processes which have some distributions of interest and their locally linearized Markov chain models. EXAMPLE 1. Ornstein-Uhlenbeck process. is defined by 0p_

at

0

The Ornstein-Uhlenbeck process

10 2

Ox [axp] + ~ Ox--~ [o'2pl,

(3.42)

from which we have the following stochastic differential equation: Y¢= - ax + ~rn(t).

(3.43)

The associated dynamical system is = -ay,

(3.44)

where y = x/cr. We define the damping function z ( y ) of a dynamical system = f ( y ) by z(y) = -f(y)

o

Then the damping function of (3.44) is a linear function (see Fig. 3.1) z(y) = ay.

(3.45)

The associated potential function (see Fig. 3.2) is V(y) = a y2. Z

(3.46)

Non-linear time series models and dynamical systems

61

/

zlv)

¥ ¥

Fig. 3.1.

Fig. 3.2.

T h e Pearson system of the equilibrium distribution of x of (3.42) is

dW(x) - 2ax W(x), dx o.2

(3.47)

and the distribution W ( x ) is the well-known Gaussian distribution (see Fig. 3.3)

/ a / ax2\ W ( x ) = ~ a 2 e x p , - --~-7-) -

(3.48)

T h e locally linearized M a r k o v chain model is Xt = o . Y t , Yt+at = e - a

atYt + X/--~

(3.49)

et+at ,

which is an AR(1) model with a constant ~b function (see Fig. 3.4). EXAMPLE 2. 2 = --X 3.

T h e dynamical system

2 - - x3

(3.50)

has a non-linear cubic damping function as in Fig. 3.5. If this dynamical system is

Wlxl

0

Fig. 3.3.

×

Fig. 3.4.

62

T. Ozaki

driven by a white noise of variance 0"2, we have 2 = - x 3 + o'n(t).

(3.51)

The Fokker-Planck equation of the process x is _013 _ = _ _0 [x3p ] + _102 [0-2p]. Ot

OX

(3.52)

20X 2

The associated dynamical system is obtained by employing the variable transformation (3.53)

y = x/o',

giving (3.54)

= _0-2y3.

The associated potential function (see Fig. 3.6) is

V(y)

0-1 y4 .

(3.55)

= ~

The distribution system of the equilibrium distribution of x is d W(x) dx

_

--2X 3 (3.56)

0"2 W ( x ) .

Then the distribution W ( x ) is given by (see Fig. 3.7) W ( x ) = W o exp - ~ 2

,

(3.57)

where W 0 is a normalizing constant.

V{y)

0 Fig. 3.5.

Fig. 3.6.

Non-linear time series models and dynamical systems

63

W(X)

~(Yt)

2

5

Fig. 3.8.

Fig. 3.7.

T h e locally linearized M a r k o v chain model is

xt = 0.Yt, where

Yt+at = 6(Yt)Y, + X/~te,+at.

(3.58)

2 1 2 2 qb(yt) = 3 + 5 exp(--30. 2xtyt).

(3.59)

T h e figure of the ~b function is shown in Fig. 3.8. EXAMPLE 3. system

2 = --6X + 5.5X 3 - X5.

T h e d a m p i n g function of the dynamical

2 = - 6 x + 5.5x 3 - x 5

(3.60)

has five zero points, so0 = 0, sc~ = ~22, ~:~ =-Xf~-~, sc~ = 2 and ( ~ = - 2 (see Fig. 3.9). T h e y are called singular points of the dynamical system. If an initial value x 0 of (3.60) is one of the five singular points, then x(t) stays at x 0 for any t > 0. If the d y n a m i c a l system is driven by a white noise 0-n(t), we have 2 = -6x

+ 5.5x 3 -

x 5~

o'n(t).

(3.61)

T h e c o r r e s p o n d i n g F o k k e r - P l a n c k equation is

0t9 Ot

0 [ ( - 6 x + 5.5x 3-- xS)p] + 1 0 2 0x~ [0-2p] ~ Ox

(3.62)

T h e associated dynamical system is 3? = - 6 y + 5.50-2y 3 - o4y5, where y = x/0-. The associated potential function is (see Fig. 3.10) 11°"2 y4 +

V ( y ) = 3y 2 - - - 8 -

0. 4

--6

y6

"

(3.63)

64

7". Ozaki

T h e distribution system of x is dW(x)

- 1 2 x + l l x 3 - 2x 5

dx

0-2

W(x) ,

(3.64)

and the distribution W ( x ) is (see Fig. 3.11)

W(x)= Woexp{(-6x2+llx4-1x6)/0-2},

(3.65)

where W 0 is a normalizing constant. T h e locally linearized M a r k o v chain m o d e l is

x t = cryt , y,+~,, = 49(yt) + X/Net+at, where

q)(Y,) =! Y'

+ flY') ~ t ) [exp{J(yt)z~t} - 1]

y, + a t . f(y,)

for J(Yt) ~ 0, for J(y,) = 0 ,

f ( y , ) = _ 6 y t + 5.50-2y3_ o '4y t,, and

J(Yt) = - 6 + 16.50-2yt; - 50"4yt.a Since

cl)(yt)/y, ~ e -6a'

for

ly, l-" 0,

the ~b function of the locally linearized M a r k o v chain m o d e l is (see Fig. 3.12)

I1 + f ( Y t L [ e x p { J ( y t ) A t } - 1] 'I J~Yt)Yt ~P(Yt) = ' 1 ] + (--6y, + 16.50-2y3t -- 50-'ySt)At

for J(Yt)Yt # O, for J(y,) --- O,

/

t e-6At

for Yt = 0.

J

Z(y)

/ Fig. 3.9.

Fig. 3.10.

65

Non-linear time series models and dynamical systems w(×l'

¢(Vt)

0

Fig. 3.11.

Fig. 3.12.

Gamma-distributed process.

EXAMPLE 4. by

W(x) =

Z

The Gamma distribution is defined

x ~-1 e x p ( - x / f l )

r(~)/3 °

(3.66)

Its Pearson system is dW(x)

(a-1)/3-x

dx

/3x

(3.67)

from which we have a diffusion process defined by the following Fokker= Planck equation 0t9

0

Ot

Ox

l 02 [(a/3 - x)p] + ~ ~ [2/3xp].

(3.68)

The stochastic differential equation representation of the diffusion process is 2 = (a - ½)fl - x + ~/~2flx" n ( t ) .

(3.69)

By the variable transformation

y = x/2~//3

0.70)

we have the stochastic dynamical system y = ( a - ~ )1/ y - y/2+ n ( t ) .

(3.71)

The damping function z ( y ) of the associated dynamical system is z ( y ) = y / 2 - (a - ~)/y. 1

(3.72)

66

T. Ozaki 1

As is seen in Fig. 3.13, if a >~ the damping function is negative for y < 1 V'2a - 1 while if a < ~ the damping function is always positive. The associated potential function (see Fig. 3.14) is V ( y ) = y2/4 - (a - 2) log y.

(3.73)

The shape of the distribution of Gamma distribution changes drastically at a = 1, while the critical value for the distribution of the associated process y ( t ) of (3.71) is ce = 12. The equilibrium distribution of y(t) is given by (see Fig. 3.15) 1 y(~_l) exp(_ ~ ) V(oL)2,,_1

W(y)-

(3.74)

when a =~1 the damping function is a linear function of y, and the potential function is a quadratic function. Therefore, the distribution of y is Gaussian for 1 a = ~. The locally linearized Markov chain model for the diffusion process x ( t ) is x, = (flyt)2/2fl, (3.75) Yt+a, = 49(y,) + X/ M e,+a,,

where = l y , + [exp{J(yt) At}- 1]. f ( y t ) / J ( y , )

for J(y,) # 0,

q:'(y,)

Ly, + A t .

and

for

f(y,),

f ( y , ) = (a - ~1) / y , - y J 2 ,

J(y,)=-(a

1 -

9/7,

(3.76) J(y,)

= 0

2 1

~.

-

L~(y,)/y,I < 1 for y,-~ % ~b(y,) is not bounded (see Fig. 3.16), when a _! ~ z. A l t h o u g h [4~(Y,)I =

Z(y)[ V(y)

(~ >0.5 0

Fig. 3.13.

Fig. 3.14.

Y

67

Non-linear time series models and dynamical systems

W(x)

~(Yt)

~'0.5

o' 0, the 4,'s satisfy the

• - ..

(2.4)

+ ~I)pffll-p - - O l ,

where O0 = 1, q,j = 0 for j < 0 and 0 t = 0 for l > q. T h u s for 1/> r, the ~p,s can be expressed in the form

~t = A l a t l + "" . + Apoapol ,

(2.5)

where pop.

The important property of ~(1) is that it vanishes for 1 > p when the model is AR(p). This is akin to the property of the autocorrelation coefficients p(/)'s with respect to the MA(q) model, and will prove to be a useful tool in model building.

Extended autocorrelation function For the ARMA(p, q) model, we see from (2.14) that for 1 > q, letting

@(')(p) = c(p, t)-lr(p, O, where 4~(')(p) = (4~{° . . . . . .

p,t

=

-'"

(2.22)

q~o), and letting (2.23)

W(t)~ follows a MA(q) then, since q¢O(p)_- q~(p), the transformed process {__p.,,

ARMA

models, intervention problems a n d outlier detection

93

model. Thus, if we let p(p, l) be the lag 1 autocorrelation of wp, " q)t, we have that

p ( p , 1) =

+

+'"+02)

ll0,

-1 ,

l=q, l>q.

(2.24)

In general, for k = 1,2, 3 , . . . and 1 = 1, 2,3 . . . . , let the k x 1 vector

~O(k ) =

( 9~1 . . . . .

satisfies the equations

G(k, l)~(~)(k ) = ~,(k, l)

(2.25)

and p(k, l) be the lag 1 autocorrelation of the transformed process tw(0x t vv k,tJ, where W~l]t = (1 - cI)g)B . . . . . cI)g~Bk)Z,. That is

p(k, l) = b ' G ( k + 1, I)b/b'G(k + 1, O)b,

(2.26)

where b '= (1, q¢°(k)') and it is easily seen that p(k, l) is a function of the autocorrelations p(1) . . . . , p(k + 1). Now, for k = p and l >1 q, p(k, l) has the 'cutting off' property (2.24) for A R M A ( p , q ) model which is akin to the property of p(1) in (2.16) for the MA(q) model. Following the work of Tsay and Tiao (1984), we shall call p(k, l) the kth extended autocorrelation of lag l for Z r W e shall also denote p(l) = p(O, l) so that p(k, l) will be defined for k >i 0 and l/> 1. It can be readily shown that for stationary A R M A ( p , q) model, when k >~p,

p(k,l)=

c,

O,

l = q + k-p,

l>q+k-p,

(2.27)

where [c] < 1. The above property for p(k, l) will be exploited later in the model building process.

2.2. Prediction theory In this section, we discuss the problem of forecasting future observations for the A R M A ( p , q) model (1.3). W e shall assume that the model is known, i.e. all the p a r a m e t e r s q~l. . . . , q~p, 01. . . . . Oq and o-2 are given. In practice, these parameters will, of course, have to be estimated from the data. For a discussion of the effect of estimation errors of the estimates on forecasts, see e.g. Y a m a m o t o (1976). Basically, the forecasting problem is as follows. Suppose that the {Zt} series begins at time m and we have available observations up to time T, Z m. . . . . Z r. What statements can then be m a d e about future observations Zr+l, l = 1, 2 . . . . . L? Clearly, all the information about Zr+ 1. . . . . Zr+ c is contained in the conditional distribution p ( Z T + 1. . . . . ZT+ c [ Z(T)), where Z ( T ) = ( Z m . . . . . Z T ) ' .

G. (2". T i a o

94

From the probabilistic structure assumed in (2.1), this conditional distribution is a L-dimensional multivariate normal distribution. In what follows, we obtain the mean vector and covariance matrix of this distribution and discuss their main properties. We shall denote Z r ( l ) as the conditional expectation

Zr(l) = ET(ZT+,)= E(ZT+t

(2.28)

I ZCT)),

which is the minimum mean square,d error (m.m.s.e.) forecast of Zr+ l, and denote er(1 ) as the forecast error

er(l) = Zr+,

-

(2.29)

2r(z).

From (1.3) with C = 0 and (2.3), we have that for l t> 1 2r(0

: @ , 2 r ( l - 1) + " - O~fiT(l

+ % 2 r ( 1 - p) + fir(t)

-- 1) .....

OqfiT(l

(2.30)

-- q)

where Z,)(/') = ZT+j, j < 0,

and

fiT(i)

=

E(ar+ i [ Z(T))

so that fiT(i) = 0 for i > 0. Thus, the Z,r(/)'s can be recursively calculated from (2.30) once the expected values fir(-/'), J' = 0 . . . . . q - 1, are determined, and for l > q the Zr(/)'s satisfy the difference equation • ( B ) 2 r ( / ) = 0,

(2.31)

where B now operates on l. To obtain a r ( - ] ) , we have from (2.10) that T-j-m

fir(--J)- Z r - i -

Z

T-j m

%Zr-i-h +

Z

fr *hE

(w~__j_~ I z¢r~)

h= T-j-(m+r)+l

h=l

(2.32) It can be shown that when all the zeros of O(B) are lying outside the unit circle, both ~rh and rr~ approach zero as h ~ m and for T - j >> m, the third term on the right-hand side of (2.30) can be ignored so that T-j-m

fir(--jl=Zr_j -

~

rrhZrq_ h.

(2.32a)

h=l

Thus, approximately, fir(--J) only depends on Zr_ j. . . . . Z,,. Note that the requirement that all zeros of O(B) be lying outside the unit circle is known as the 'invertibility condition' of the A R M A ( p , q) model. For a discussion of noninvertible models, see e.g. Harvey (1981). It is of interest to study the behavior of the forecasts Z'r(/) as a function of

ARMA

m o d e l s , i n t e r v e n t i o n p r o b l e m s a n d outlier d e t e c t i o n

95

the lead time I. F r o m (2.31), we can write

aT(l) =

"~IA(T)'~'I~I-1-

"'"

(2.33)

-t- J-lNa(T)of/~l ,

-1 where, as in (2.5), p o < p , a71, . . . , c%o are the Po distinct zeros of q~(B), and . (T) A~r), .,Ap0 are polynomials in I whose coefficients are linear functions of Z q, the asymptotic variance of r(1) is V a r ( r ( / ) ) - -- 1 + 2 /'~

02(/) .

(2.43)

j=l

By substituting r(j) for the unknown p(j) in (2.43), the estimated variances of the r(/)'s are often used to help specify the order q of a MA model.

SPA CF The sample partial autocorrelations ~(l),

l = 1. . . . .

(2.44)

G. C. Tiao

98

of Z, are obtained by replacing the p(/)'s in (2.20) by their sample estimates r(/)'s. For stationary models P

a~(1) ~ ~(1)

(2.45)

and the ~(/)'s are asymptotically normally distributed. Also, for a stationary AR(p) model 1

Var(~(l))---,

l>p.

(2.46)

n

The properties in (2.45) arid (2.46) make SPACF a convenient tool for specifying the order p of a stationary A R model in practice. For nonstationary models, i.e. ~ ( B ) contains the factor U(B) in (1.5), the asymptotic property of ~(l) is rather complex, however. In the past, the SACF and SPACF have been the most commonly used statistical tools for tentative model specification. Specifically, a persistently high SACF signals the need for differencing, a moving average model is suggested by SACF exhibiting a small number of large values at low lags and an autoregressive model, by SPACF showing a similar 'cutting off' pattern. Also, for series exhibiting a strong seasonal behavior of period s, persistent high SACF at lags which are multiples of s signals the need to apply the 'seasonal differencing' operator 1 - B ' to the data, and so on. The weaknesses of these two methods are (i) subjective judgement is often required to decide on the order of differencing and (ii) for stationary mixed autoregressive moving average models, both SACF and SPACF tend to exhibit a gradual 'tapering off' behavior making specification of the orders of the autoregressive and the moving average parts difficult.

ESA CF Recently, several approaches have been proposed to handle the mixed model specification problems. These include the R- and S-array methods of Gray et al. (1978) and the generalized partial autocorrelations by Woodward and Gray (1981). In what follows, we discuss the procedure proposed by Tsay and Tiao (1984), using what they called the extended sample autocorrelation function (ESACF) for tentative specification of the order (p, q) for the general nonstationary and stationary A R M A model (1.3). The proposed procedure eliminates the need to difference or in general transform the series to achieve stationarity and directly specify the values p and q. For stationary A R M A models, estimates ~(k,/)'s of the EACF p(k,/)'s as defined in (2.26) can be obtained upon replacing the p(/)'s in (2.26) by their sample counterparts r(/)'s. In this case, the estimated ~5(k,/)'s will be consistent for the p(k,/)'s and hence the property (2.27) can be exploited for model identification. However, for nonstationary model, the ~(k,/)'s will not have the asymptotic property given by the right-hand side of (2.27) in general.

ARMA

models, intervention problems and outlier detection

99

Now for ARMA(p, q) models, one can view the extended sample autocorrelation function approach as consisting of the following two steps. W e first attempt to find consistent estimates of the autoregressive parameters in order to transform Z t into a moving average process. We then make use of the 'cutting off' property of the autocorrelation function of the transformed process for model identification. For estimating the autoregressive parameters, the following iterated regression approach has been proposed. First, let ,.g(0) .¢.(0) "a- l ( k ) , • . . U)k(k) be the ordinary least squares (OLS) estimates from fitting the A R ( k ) regression to the data, ,

(o)

(o) Z

Z t = 451(k)Zt 1 + " " " + qgk(k)

°(°)

(2.47)

t-t + ~ka,

where .,(0) L. k,t denotes the error term. The 1st iterated A R ( k ) regression is given by Zt = ¢~(1) 7

~l(k.~t-1

~- " " " -1-

~(1)

k(k)

Z

~

~(1)

.9(0)

(1)

t - k -- t-" l ( k ) ~ k , t - 1 -}- e k , t ,

(2.48)

who,.o ~(o) _ :1 ,g(o)~ ^ (o) k • ..... k . , - - , - - ~ ' t ( k ) ~" . . . . . q~k(k)B )Z, ,s the residual from (2.47) and e(k'~ denotes the error term. This yields a new set of OLS estimates C~]~k),.. " ' C ~k(kF O) In general, for 1 = 1, 2, . . . the estimates ,fi(t) ~'t(k), • • •, ~m(~) k ( k ) are obtained from the /th iterated A R ( k ) regression Z t ~_ ( ~ l ~ k ) Z t _ 1 _ ~ . . .

_[_ (~)(l) 7 ~('-~) ) ~(0) k(k)L't~k -}- bft(l) 'l(k) k,t-1 -}- " " " q- P0 ( 'l(k)t~ k,t-I +

e(~!, (2.49)

where i

O(i) = ( 1 k,t

¢~(i)

R

"x'- ( k ) ~

....

. __

]~k~, 7 __ ~ "~ k(k) JJ .IL't ~

(~(i)

i~(i) ~(i-h) I"h(k)'k,t-h

h=l

(i.e. the residuals from the ith iterated regression) and e~)t is the error term. In practice, these iterated estimates ,g(0 '~:'(k),~~ can be obtained from OLS estimates of the autoregressive coefficients by fitting AR(k), . . . , A R ( k + l) to Z t using the recursion j(k, = q•(t)

^ . . . . ~i(g+0- q~;'(~)qb~+lllk+,)/45~(I,1)), ~(t-1)

(2.50)

where ~0(k,'~(~)'=-1, j = l , . . . , k , k ~ > l and 1/>1. Based on some consistency results of OLS estimates of autoregressive parameters for nonstationary and stationary ARMA(p, q) models in Tiao and Tsay (1983), they show that for k=p P

~(')(p)-->

~(p),

l ~ q,

(2.51)

where ~(l)(p)= (ci)l(p) . . . . . .

p(p): .

Now analogous to (2.26), the extended sample autocorrelation -function r ( k , 1)

100

G. C. Tiao

is defined as

r(k, 1)= q(Wk.,) ~ (o ,

(2.52)

where rl(lTd~t!,) is the lag l sample autocorrelation of the transformed series -~ (0

Wk, t =

( 1 - -a(0 tPl(kyB

,fi(0 r~k~7 ~k(k)'-" J~t

.....

(2.53)

Also, we may denote r(0, l ) = r(l) for the ordinary sample autocorrelations, and shall call r(k, l) the kth extended sample autocorrelation of lag I. Tsay and Tiao show that for the general A R M A ( p , q) model in (1.3), stationary or nonstationary, when k >/p e {c,

l=q+k-p,

(2.54)

r(k,l)--~ O, l > q + k - p . where Icl < 1.

Tentative model specification via E S A CF The asymptotic property of the E S A C F r(k, l) given by (2.54) can now be exploited to help tentatively identify A R M A ( p , q) models in practice. For this purpose, it is useful to arrange the r(k, l)'s in a two-way table as shown in Table 2.1 in which the first row gives the SACF, the second row gives the 1st E S A C F , and so on. The rows are numbers 0, 1, 2 , . . . to signify the A R order and the columns in a similar way for the M A order. To illustrate the use of the table, suppose the true model is an A R M A ( 1 , 2). For the SACF, it is well known that asymptotically r(0, l) ¢ 0 for l ~ 2. Now from (2.54) with p = 1 and q = 2, we see that (i) when k = 1, r(1, l) - 0 for 1/> 3, (ii) when k = 2, r(2, I) - 0 for 1 ~> 4 and so on. The full situation is shown in Table 2.2, where x denotes a nonzero value, 0 is zero and * means a value between - 1 and 1. T h e zero values are seen to form a triangle with boundaries given by the two lines k = 1 and l - k = 2. The row and column coordinates of the vertex correspond precisely to the A R and M A order, respectively.

Table 2.1 The ESACF table ~,.,. M A R~ _

MA

0

1

2

3

r(O, 1) r(1, 1) r(2, 1) r(3, 1)

r(0,2) r(1, 2) r(2, 2) r(3, 2)

r(O, 3) r(1, 3) r(2, 3) r(3, 3)

r(0,4) r(1, 4) r(2, 4) r(3, 4)

\

0 1 2 3

A R M A models, intervention problems and outlier detection

101

Table 2.2 T h e asymptotic E S A C F table for an A R M A (1.2) model where x denotes a nonzero value and * denotes a value between - 1 and 1

A R ~

MA

0 1 2 3 4

0

1

2

3

4

5

6

7

* * * * *

X X X X X

X 0 X X X

X 0 0 X X

X 0 0 0 X

X 0 0 0 0

X 0 0 0 0

X 0 0 0 0

In general, we are thus led to search from the E S A C F table the vertex of a triangle of asymptotic 'zero' values having boundary lines k = c1> 0 and l - k = c 2 > 0 , and tentatively identify p - - c 1 and q = c 2 as the order of the A R M A model. In practice, for finite samples, the r(k,/)'s will not be zero. The asymptotic variance of the r(k,/)'s can be approximately obtained by using Bartlett's formula. As a crude but simple approximation, we may use the value (n - k - l) -1 on the hypothesis that the transformed series lYC(~!tis white noise to estimate the variance of r(k, l). Of course, it is understood that this simple approximation might underestimate the variance of r(l, k) and a further study of this subject is needed in the future. As a preliminary but informative guide for model specification, the E S A C F table may be supplemented by an analogous table consisting of indicator symbols x denoting values greater or less than -+2 standard deviations and 0 for in between values.

2.3.2. Estimation Once the order (p,q) of the model (1.3) is tentatively specified, the parameters (C, ~1 . . . . . @p, 01,. • . , Oq, tr 2) can now be estimated by maximizing the corresponding likelihood function. An extensive literature exists on properties of the likelihood function, various simplifying approximations to this function, and asymptotic properties of the associated maximum likelihood estimates (see e.g. Anderson, 1971; Newbold, 1974; Fullerl 1976; Ljung and Box, 1979). In what follows, we consider two useful approximations, the first of which has been called the 'conditional likelihood function' proposed by Box and Jenkins (1970) and the second, the 'exact likelihood function' by Hillmer and Tiao (1979). With n observations Z = ( Z 1. . . . . Zn)' from the model (1.3) and assuming m ~ 1, consider the transformed vector W = ( W x. . . . . IV,)', where

W : D~)Z,

(2.55)

with D ~ ) a n x n matrix analogous to D ~ ) in (2.35). Now partitioning W ' =

q~ C. T i a o

102

. W(2)), . . where . (Wo), W ( O - ( W 1 , . . . , Wp) and Wi2)= write the joint distribution of W as

(Wp+l,

.. . , W,), we can

(2.56)

p( W ) = p ( w m l W~2))p( W~2)) .

Both the 'conditional' and the 'exact' likelihood approaches are based on the distribution p(W(2)) by ignoring p(W(l) I W(2)); and it can in fact be shown that, for moderately large n, the parameter estimates are little affected by p(W(I) IW(2)). Now from (1.3) and (2.55), the probabilistic structure of W(2) is given by q

W t - C - ~ , Oia, i + a,

(2.57)

t = p + 1. . . . , n .

i=1

The 'conditional' approach assumes that ap = case, the likelihood function can be written as

(

10(C, ~, 0, ~r2 I Z) oc o-; ("-p) exp - ~

ap_ 1 . . . .

1 Z°

ap_q+1 = 0. In this

)

a2 ,

(2.58)

O'a t = p + l

where for given parameter values of (C, ~, O) the at's are recursively calculated from p

q

(2.59)

a, = Z, - C - Z ebZ,-i + ~, O,a,-i. i=1

i=1

Standard nonlinear least squares methods can now be employed to obtain estimates (C, q~, 0) minimizing the sum of squares in the exponent of (2.58). That is,

(2.60)

S(C, 4~, O)= min S(C, ~ , O),

where S(C, ~, 0 ) = Y'",=p+l a,.2 Also, the corresponding maximum likelihood estimate of ~r2a is 1

d-2a= - S ( C , ~, 0) o

(2.61)

n

In the 'exact' approach, the assumption at, ap_q+1 = 0 is not made, and after some algebraic reduction it can be shown that the likelihood function is . . . . . .

l(c, ~, o, o-]lZ) o~ o-X~"-,~)l~l-laexp(

1 =p~q+l "~ d~) . 2~r2

(2.62)

ARMA

models, intervention problems and outlier detection

In (2.62), for t = p + 1 . . . . .

n

p

d, = z, - c -

103

q

(2.63)

Z 4,,z,, + Z 0,a, ,, i=1

i=1

and for t = p - q + 1 . . . . . p the vector d ,

=

(ap-q+l,

is given by

- - - , i~p)'

(2.63a)

~i, = 22 - I R ' M ' a ,

w h e r e / ) = Iq + R ' M 'MR, -

1

°

."

.

.

71"1 "

E°q.......°i1

".. "

"



.

7"gn,_l . . . . . . .

°

"

"1

7"i'n,_q

n' = n - p, lq is a q x q identity matrix, the 7r~'s satisfy the relation (1 ~- 7 r ~ B 0qB q) = 1, and a = (ap+l . . . . . a , ) ' the elements of Ir~B 2 . . . . )(1 - 01B . . . . . which are given by (2.59). For a detailed derivation of (2.62), see Hillmer and Tiao (1979). T o obtain the m a x i m u m likelihood estimates of the p a r a m e t e r s in (2.62), we see that the c o n c e n t r a t e d likelihood of (C, q~, 0) is n

max l(C, ~, O, or] ] Z) ~ O'a

/~, t=

-

,

(2.64)

+1

where/)t = l~'~[l/2(n-P)~lt"T h u s standard nonlinear routines can be used to obtain estimates (C, ~ , 0) minimizing the sum of squares n

s*(c,., o ) =

Z

b,~

(2.65)

t = p - q + l

and the c o r r e s p o n d i n g m a x i m u m likelihood estimate of O"~2a 1 ^2 a O"

__

n-p

~-l/(n P)S*(C, 4}, 0).

is

(2.66)

it is clear f r o m (2.59), (2.63) and (2.63a) that the exact a p p r o a c h is c o r n putationally m o r e b u r d e n s o m e , but it can appreciably r e d u c e the biases in estimating the moving average p a r a m e t e r s 0 associated with the conditiona~ approach, especially w h e n some of the zeros of O(B) are near or on the uni~.

104

G. C. Tiao

circle. In practice, one uses the conditional approach in the initial phases of the iterative modeling process and switches to the exact methods towards the end. 2.3.3. Diagnostic checking

Once the parameters of the tentatively specified model are obtained, it is important to perform various diagnostic checks on the fitted model to determine if it is indeed adequate in representing the time series being studied. Methods for detecting model inadequacies are primarily based on the residuals P

q

at:Zt-d-~l~tZt-i-~at i=1

i, t : p + l

.... ,n,

(2.67)

i=1

from the fitted model. Useful tools include plotting of residuals against time to spot outliers (see later discussion in Subsection 3.3) and changes in level and variability, and studying the sample autocorrelation function rn(1) of the residuals to determine if it is consonant with that of a white noise process. A 'portmenteau' criterion originally proposed by Box and Pierce (1970) and later modified by Ljung and Box (1978) is given by m

O = n(n + 2) ~ (n - l)-lr](l).

(2.68)

1=1

On the hypothesis that the Zt's are generated from a stationary ARMA(p, q) model, then O in (2.68) obtained from the residuals will be approximately distributed as X2 with m - (p + q) degrees of freedom. It should be noted that in practice when serious inadequacy occurs, patterns of the individual ra(/)'s often provide useful information about directions to modify the tentatively specified model.

3. Transfer function models, intervention analysis and outlier detection

In this section, we discuss some properties of the transfer function model in (1.6) with special emphasis on its application to intervention analysis and outlier detection problems. In general, the input variables X#'s can be deterministic or stochastic. When the X#'s themselves are stochastic and follow Gaussian ARMA models, Box and Jenkins (1970) have proposed a modeling procedure which specifically deals with the case of one input variable. AIthough their procedure can in principle be extended to the case of several stochastically independent input variables, it becomes cumbersome to apply and an alternative method via vector ARMA models has been suggested (see Tiao and Box, 1981). In what follows, we shall confine our discussion to deterministic inputs.

A R M A models, intervention problems and outlier detection

105

3.1. Intervention problems In the analysis of economic and environmental time series data, it is frequently of interest to determine the effects of exogenous interventions such as a change in fiscal policy or the implementation of a certain pollution control measures that occurred at some known time points. Standard statistical procedures such as the t-test of mean difference before and after the intervention are often not appropriate because of (i) the dynamic characteristics of the intervention, and (ii) the existence of serial dependence in the observations. It is shown in Box and Tiao (1975) that a transfer function of the form (1.6) can be employed to study the effect of interventions. Specifically, suppose we wish to estimate simultaneously the effects of J interventions on an output series Yt, we may make X# indicator variables taking the values 1 and 0 to denote the occurrences and nonoccurrences of exogenous interventions and use 8~I(B)coj(B)B bj to model the dynamic effects on the output, where 8j(B) = 1 - 6liB . . . . .

6rfiB rj,

co(B)

= cooj - colj B . . . . .

cosj

s]

(3.1) and bj is a nonnegative integer representing the delay or 'dead time'. The variables X# can assume the form of a step function X# = S(~~) or a pulse function Xjt = -tP(rJ), where

S~rJ)=

0, 1,

tTj, and

{1, p~r,)= 0,

t=~, tCTj,

(3.2)

and note that (1 - B)S~ r) = p~r). Fig. 3.1 shows the response to a step and a pulse input for various transfer functions models of practical interest. Specifically, for a step change in input, (a) shows a step response with one-period delay; (b) shows the more common situation of a 'first-order' dynamic response and the steady state gain (eventual effect) is measured by w/(1 - 6); and (c) represents the situation when 6 = 1 in which the step change in the input produces a 'ramp' response or trend in the output. For a pulse input, (d) shows the situation in which the pulse input (e.g. a promotion campaign) has only a transient effect on the output (sales) with col measuring the initial increase and 6 the rate of decline; (e) represents the situation that apart from the transient effect, the possibility is entertained that a residual gain (or loss) 0)2 in the output persists, and finally (f) shows the situation of an immediate positive response to be followed by a decay and possibly a permanent residual effect. The last figure might represent the dynamic response of sales to a price increase. A positive coo would represent an immediate rush of buying when a prospective price change was announced at time T, the initial reduction in sales which occurred at time T + 1 when the price increase took effect would be measured by o) I + o)2 and the final effect of the price change would be represented by 0)2.

G. C. Tiao

106

I~_~ sIT) e ,

~ - ~ STEP

~

Pt(T' PULS~

's(s, st(T,

e(B) ~(B--~ piT)

[o1

Ill

_ _

+

II

w2

(hi

,'4

~e)

J (c)

P

- -

.... If)

%

r~P,

~t . . . . .

Fig. 3.1. R e s p o n s e s to a step and a p u l s e input.

Obviously, these dynamic transfer models may be readily extended to represent many situations of potential interest, and intervention extending over several time periods can be represented by indicator variables other than the pulse or the step functions.

3.2. Model building In practice, one needs to tentatively specify both the dynamic models

8il(B)o~j(B)B bj and an ARMA(p, q) model for the noise term N, in (1.6). Parsimonious dynamic models are usually postulated to represent the expected effects of interventions. For tentative specification of a model for the noise term Nt, there are several possible alternatives. One may apply the identification procedures discussed earlier in Subsection 2.3 to data prior to the occurrences of the interventions if a sufficiently large number of such observations are available. One may apply these procedures to the entire data set when the effects of the interventions are expected to be transient in nature. Finally, one may first estimate the impulse responses ~'~h l = 1 , . . . , m, for a

ARMA

models, intervention problems and outlier detection

107

suitably large m, where I.~j(B ) = FoJ q._ Pl.1B .jr_... _[_ 1]m.iB m ._t. 6fl(B)wj(B)Bb,,

by ordinary least squares, and then apply the identification procedures to the residuals Yt - Y'~=t ui(B)X# • Once a model of the form (1.6) is tentatively specified, we can then estimate the intervention parameters and parameters in the noise model for N t simultaneously via maximum likelihood. Specifically, write J

Yt = C + Z ujt + dP-I(B)O(B)a,,

(3.3)

j=l where 8j(B)U i, = ~oj(B)BbJXj, so that for given values of the parameters in 3j(B) and ~oj(B) the Uj, s carl be recursively calculated from the Xj,'s; we may then compute the at's recursively from q 0 ( B ) ( Y t - C - Z ] = l Ujt) = O(B)at and apply nonlinear least squares methods to estimate all the parameters involved. Finally, diagnostic checks can be performed on the residuals to assess the adequacy of the model fit and to search for directions of improvement, if needed.

3.3. Detection of outliers in time series In the above application of the transfer function model (1.6), the time points of occurrence of the interventions are supposed known. We now discuss a variant of the methods for handling situations in which the timings Tj's of the exogenous interventions are unknown and the effects lead to what may be called aberrant observations or outliers. We summarize the results on outliers detection in time series of Chang and Tiao (1983), following earlier work by Fox (1972).

Additive and innovational outliers Let {Yt} be the observable time series. We shall concentrate on two types of outliers, additive and innovational. An additive outlier (AO) is defined as

YI

Nt

+

.~(,o)

(3.4)

while an innovational outlier (10) is defined as

v,

N, +

O(B)

where

~:(t0)_{l' t=t 0, t

-

O,

t#to,

(3.5)

G. (2 Tiao

108

and N t follows the m o d e l (1.3). In terms of the a,'s in (1.3) with C = 0, we have that

o(B) Y, - 4)(B) at + 0)¢70)

(AO)

(3.6)

and (Io)

, + Y, - O(B) ta,

~(,0),) .

(3.7)

Thus, the A O case m a y be called a 'gross e r r o r ' model, since only the level of the t0th observation is affected. On the o t h e r hand, an I O r e p r e s e n t s an e x t r a o r d i n a r y shock at t o influencing Z~, Z~+ 1. . . . through the m e m o r y of the system described by O(B)/q)(B).

Estimation of o~ when to is known T o m o t i v a t e the situation when to known. Defining ( 1 - ~ ' B 1 - 7rB~ . . . . (AO)

p r o c e d u r e s for the detection of A O and IO, we discuss the and all time series p a r a m e t e r s in the m o d e l (1.3) are the residuals e, = r r ( B ) Y , where 7 r ( B ) = 4)(B)/O(B)= ), we have that e, : w~(B)~(t'°)+ a,.

and

(3.8) (IO)

e, : ~o~(,'°)+ a,.

F r o m least squares theory, estimators of the impact w of tile intervention and the variances of these estimators are (AO)

~ba=p27r(F)e,o,

Var(~A) =

p 2o" a;

and

(3.9) (IO)

a3, = e~,

Var(wl) = (r2a,

w h e r e F = B -~, p 2 = (1 + 7r~ + rr~ + - • ,)-1. Thus, the best estimate of the effect of an I O at time t o is the residual et0, while the best estimate for the effect for an A O is a linear c o m b i n a t i o n of e~, e,~+l. . . . with weights d e p e n d i n g on the structure of the time series model. N o t e that the variance of o5A can be much smaller than ~r] If desired, one m a y p e r f o r m various tests a m o n g the h y p o t h e s e s Ho: Hi: H2:

Yt0 is neither an I O nor an A O , Y'0 is an IO, Y~is an A O .

T h e likelihood ratio test statistics for I O and A O are H1 vs./40

~.1 = @/o~.

ARMA

models, intervention problems a n d outlier detection

109

and H 2 vs. H o A 2 = ff)Ai(po'a). On the null hypothesis/40, ,~ and

/~2 a r e

both distributed as N(0, 1).

Detection of outliers In practice, t o as well as the time series p a r a m e t e r s are all unknown. If only to is unknown, one may proceed by calculating A1 and A2 for each t, denoted by Att and Az, and then m a k e decisions based on the sampling properties given above. The time series p a r a m e t e r s (q~'s, O's, and O'a) are also unknown, and it can be shown that the estimates of these p a r a m e t e r s can be seriously biased by the existence of outliers. In particular, ~ra will tend to be overestimated. These considerations have led to the following iterative procedure to handle a situation in which there may exist an unknown n u m b e r of A O or I O outliers. (i) Model the series Yt by supposing that there are no outliers Yt (i.e. Yt = Nt) and from the estimated model compute the residuals

e, =

e(B)Y,.

Let ^2 =

^2

O"a

e t /~l t = l

2 be the initial estimate of cra. (ii) C o m p u t e £i, i = 1, 2 and t = 1 , . . . , n, these being Alt and A2t with the estimated model. Let 1£`01= max, maxi[12,fl. If 1£`01= I,(1`01> c, where c is a predetermined positive constant usually taken to be some value between 3 and 4, then there is the possibility of an I O at to and the best estimate of o) is o.]1` 0. Eliminate the effect of this possible I O by defining a new residual Yt° = ~t0- ~b~t° = 0. If, on the other hand, ]£J = 1£2,01> c, then there is the possibility of an A O at to, and the best estimate of its effect is o3at¢ T h e effect of this A O can be removed by defining the new residuals et = e t - WAtorrtD)gt ^ ,m,.(to), t ~> t0. A new estimate or, - 2 is c o m p u t e d from the modified residuals. (iii) R e c o m p u t e £1t and £2t based on the same initial p a r a m e t e r estimates of the ~ ' s and 0's but using the modified residuals and 52a, and repeat the process (ii). (iv) W h e n no more outliers are found in (iii), suppose that J outliers (either I O or A O ) have been tentatively identified at times t~. . . . . b. Treat these times as if they are known, and estimate the outlier p a r a m e t e r s o21. . . . , ~oj and the time series p a r a m e t e r s simultaneously using models of the form

J O(B) Y, = • wjLj (B)~(,") + - a, j=l 4)(B) '

(3.10)

where L j ( B ) = 1 for an A O and L j ( B ) = O(B)/CP(B) for an I O at t = tf The

G. C. Tiao

110

new residuals are J

~}1)= 7r0/(B)[ Yt- ~'~ °JjL, (B)sC(t'P] •

(3.11)

./=t

The entire process is repeated until all outliers are identified and their effects simultaneously estimated. The above procedure is easy to implement since very few modifications to existing software capable of dealing with A R M A and transfer function models are needed to carry out the required computations. Based on simulation studies, the performance of this procedure for estimating the autoregressive coefficient of a simple AR(1) model compares favorably with the robust estimation procedure proposed by Denby and Martin (1979) and Martin (1980). While the latter procedures cover only the AR case, our iterative procedure can be used for any ARMA model.

4. Illustrative examples In this section, we illustrate the ARMA modeling, intervention analysis and outlier detection procedures discussed in the preceding sections by two actual examples.

4.1. Gas data We here apply the ARMA modeling and outlier detection procedures to the Gas data given in Box and Jenkins (1970). The data consist of 296 observations taken at 9 second intervals on input gas feed rate from a gas furnace. Fig. 4.1 shows a plot of the series. The sample mean Z and sample variance s 2 = (n - 1)-1E (Z t - 5 ) 2 are, respectively, Z = -0.0568 and s 2 = 1.147.

Model specification Tables 4.1a, 4.1b, 4.1c and 4.1d give, respectively, the SACF, SPACF, ESACF and the simplified ESCAF for this example. Note that (i) the estimated standard errors of SACF are computed using Bartlett's formula (2.43), (ii) those for the SPACF are obtained by assuming that the series is white noise and (iii) the indicator symbol x is used in the simplified ESACF table when Ir(k, l)l > 2(n - k - l) -1/2. The SPACF suggests that an AR(3) model might be appropriate. On the other hand, an alternative ARMA(2, 3) model is suggested by the ESACF. The AR(3) model was used by Box and Jenkins; but we have found that the ARMA(2, 3) model gives a slightly better fit, and shall proceed with this model.

Estimation and diagnostic checking Employing the 'exact' likelihood approach discussed in Subsection 2.3.2, the

A R M A models, intervention problems' and outlier detection

111

2

0

-i

-2

0

40

80

120

180

200

240

280

t

Fig. 4.1. Gas data.

estimation results c o r r e s p o n d i n g to an A R M A ( 2 , 3) m o d e l are 1.29B + 0.43B2)Zt = --0.0082 + (1 + 0.63B + 0.50B 2 + 0.36B2)a, ~

(1 -

(0.10)

(0.09)

(0.03)

(0.10)

(0.09)

(0.07)

(4.1) ~2 ~r a = 0.0341, where the values in the parentheses are the estimated standard errors of the p a r a m e t e r estimates. Table 4.2 gives the S A C F of the residuals f r o m the fitted m o d e l (4.1). T h e

Table 4.1a Sample autocorrelation function--gas data l

e(l) S.E.

1

2

3

4

5

6

7

8

9

10

11

12

95 0.06

0.83 0.10

0.68 0.12

0.53 0.13

0.41 0.14

0.32 0.14

0.26 0.15

0.23 0.15

21 0.15

0.21 0.15

0.20 0.15

0A9 0.15

Table 4.1b Sample partial autocorrelation function---gas data l

p(l) S.E.

1

2

3

4

5

6

7

8

9

10

11

12

0.95 0.06

-0.79 0.06

0.34 0.06

0.12 0.06

0.06 0.06

-0.11 0.06

0.05 0.06

0.10 0.06

0.02 0.06

- 0.07 0.06

-0.09 0.06

0.04 0.06

G . C . Tiao

112

Table 4.1c E x t e n d e d sample autocorrelation f u n c t i o n - - g a s data

0

1

2

3

4

5

6

7

8

0.95 0.78 0.40 -0.32 -0.38 0.40 0.38

0.83 0.50 0.31 -0.02 -0.03 0.31 0.33

0.68 0.26 0.23 0.20 0.14 -0.07 -0.23

0.53 0.07 0.09 -0.20 -0.18 -0.19 -0.22

0.41 -0.06 -0.09 -0.09 -0.17 0.06 0.05

0.32 -0.14 -0,08 0.09 0.07 0.07 0.09

0.26 -0.18 -0.07 0.04 0.00 -0.00 -0.07

0.23 -0.18 -0.10 0.01 0.01 0.07 0.09

0.21 0.10 -0.10 -0.11 -0.09 -0.01 -0.02

Table 4.1d Simplified extended sample autocorrelation function---gas data

MAR~M]~

0

1

2

3

4

5

6

7

8

0

X

X

X

X

X

X

X

X

X

1

x

x

x

0

0

x

x

x

0

2 3 4 5

x x x

x 0 0

0 x x

x

x

x x x 0

x

0 0 x 0

0 0 0 0

0 0 0 0

0 0 0 0

0 0 0 0

6

x

x

x

x

0

0

0

0

0

associated values of However, inspection tence of a number detection procedure results are obtained:

the Q statistic indicate that the model seems adequate. of these residuals themselves indicates the possible exiso of outliers. Specifically, applying the iterative outlier discussed in Subsection 3.3 with c = 4.0, the following

~

43 6.62

Nature

AO

to

55 113 -5.95 -4.23 AO

AO

Simultaneous estimation of the effects of these three outliers and the time Table 4.2 Sample autocorrelation function of residuals--gas data l

1

2

3

4

5

6

7

8

9

10

11

12

ra(l)

0.02

-0.02

-0.02

0.02

-0.05

0.05

0.04

-0.02

-0.05

0.07

0.13

-0.06

S.E.

0.06

0.06

0.06

0.06

0.06

0,06

0.06

0.06

0.06

0.06

0.06

0.06

O

0.01

0.2

0.2

0.4

1.2

2.1

2.6

2.7

3.5

4.9

10.2

11.4

ARMA models, intervention problems and outlier detection

113

series m o d e l p a r a m e t e r s yields Z t = - 0 . 0 5 5 3 + 0 . 4 6 ~ 43) - 0.39~:~55)- 0.27~:~m) + Nt, (o.18) (0.05) (0.05) (0.05)

(4.2)

where (1 -

1.41B + 0.53B2)Nt = (1 + 0.81B + 0.45B 2 + 0.23B3)a, (0.09)

(0.09)

(0,10)

(0.12)

(0.08)

and 6-2a = 0.0227. C o m p a r i n g (4.2) with (4.1), it is seen that a substantial reduction in the estimated variance 6-2 of the at's occurs, f r o m 0.0341 to 0.0227, when the effects of these three A O ' s are taken into account. In addition, changes in the estimates of the autoregressive and moving average p a r a m e t e r s are also appreciable. W e note here that if the critical value c were set to be equal to 3, a few additional A O or I O would be identified. T h e effects on p a r a m e t e r estimates are, however, very slight and h e n c e they have not been included in the model. N o w it is readily verified that the zeros of the fitted autoregressive polynomial ( 1 - 1 . 4 1 B + 0.53B 2) in (4.2) are complex and lying outside the unit circle. This implies that the series is stationary. T h e estimated m e a n of the series is - 0 . 0 5 5 3 having an estimated standard error of 0.18 so that the m e a n is essentially zero. T h e estimated m o v i n g average p o l y n o m i a l ( 1 + 0 . 8 1 B + 0.45B2+ 0.23B 3) has o n e real zero and a pair of c o m p l e x zeros, all lying outside the unit circle. T h e c o m p l e x zeros in the autoregressive and moving average polynomials jointly explain the p s e u d o periodic b e h a v i o r exhibited by the series. Forecasts E m p l o y i n g (4.2) as the final m o d e l and treating the p a r a m e t e r estimates as T h e true values, Table 4.3 gives the forecasts Z r ( / ) of future observations Zr+t,

Table 4.3 Forecasts of future observations--gas data (T = 296) Lead time

l Z,r(l) S.E.(eT (l))

1 -0.248 0.151

2 --0.192 0.367

3 -0.122 0.588

4 -0.076 0.775

5 -0.049 0.905

Lead time

l Zr(l) S.E.(er (l))

6 -0.036 0.986

7 -0.031 1.031

8 -0.031 1.054

9 -0.035 1.064

10 -0.039 1.068

Lead time

l ZT(I) S.E.(er(l))

11 -0.043 1.070

12 -0.047 1.070

13 -0.050 1.070

14 -0.052 1.070

15 -0.054 1.070

Lead time

l Zr(l) S.E.(er(l))

16 -0.055 1.070

17 -0.055 1.070

18 -0.056 1.070

19 -0.056 1.070

20 -0.056 1.070

G. C. Tiao

114

l = 1 . . . . . 20, made at T = 296, the end of the data period. It is seen that as l increases, ZT(I) gradually approaches -0.0553, the estimated m e a n of the series. Also, the estimated standard error of the forecast error eT(l ) increases from 0.151 = 6-a for l -- 1 to 1.070 for l = 20 which is essentially the estimated standard deviation of the series. T h e seven-fold increase, from 0.1517 to 1.070, in the standard errors of forecasts shows that, although the series is stationary, substantial i m p r o v e m e n t in the accuracy of short-term forecasts is possible when past values of the series are utilized instead of relying solely on the mean level of the series. It is noted that all the computations involved in this example are p e r f o r m e d using the package developed by Liu et al. (1983).

4.2. Ozone data T o illustrate the intervention analysis techniques, we turn to consider the ozone data shown in Fig. 4.2 analyzed earlier by Tiao et al. (1975) and Box and Tiao (1975). The data consist of monthly averages of ozone level in downtown Los Angeles from January 1955 to D e c e m b e r 1972. Two interventions 11 a n d / 2 of potential m a j o r importance are: 11:

/2:

In early 1960 the opening of a new freeway in Los Angeles which altered the traffic pattern and the inception of a new law (Rule 63) which reduced the proportion of reactive hydrocarbons in the gasoline sold locally. F r o m 1966 onward, regulations required engine design changes in new automobiles which would be expected to reduce the emission of nitrogen oxides and hydrocarbons which are the primary components in the formation of ozone through photochemical reaction.

10

i

J

o __]__L~A___.I 24

48

!

L~[___I 72

l

98

1 120

l

I _L_ L__]----]--L 144

168

192

218

t

Fig. 4.2. Monthly averages of ozone at downtown Los Angeles (January 1955-December 1972).

ARMA models, intervention problems and outlier detection

115

The first intervention 11 was expected to produce a step change in the ozone level at the beginning of 1960. As for 12, the engine changes were expected to reduce the formation of ozone. Now in the absence of information on the proportion of cars with new design changes in the car population over time, we might represent the possible effect of I 2 as an annual trend reflecting the effect of the increased proportion of 'new design vehicles' in the population. As explained more fully in Tiao et al. (1975), because of the differences in meteorological conditions between the summer months and the winter months, the effect of I 2 would be different in these two seasons. The above considerations have led to the following model for the monthly ozone observations Y, Yt -- (001Xl, if- ( 0 0 2 ( 1 - B 1 2 ) - l x 2 t + ( 0 o 3 ( 1 - B 1 2 ) - l x 3 t + N t ,

where Nit ~-

.~(T)

{1, {1, X3t= O, ~

X2t~

T = January 1960,

t

O,

(4.3)

'summer' months J u n e - O c t o b e r beginning 1966, otherwise, 'winter' months N o v e m b e r - M a y beginning 1966, otherwise,

and N t is the noise term. Inspection of the SACF of Y, and that of the seasonally differenced series ( 1 - B ~ 2 ) Y t leads to the tentative model for the noise term Nt (1

- -

B12)Nt

:

(1 - O~B)(1 - OzB~2)a,.

(4.4)

The models (4.3) and (4.4) allow for (i) a step change in the level of ozone of size (001 associated with 11, (ii) progressive yearly increment in ozone level of sizes o)02 and (003, respectively, for the summer and the winter months associated with 12, and (iii) seasonal and nonseasonal serial relationship in the data. Employing the estimation procedure described in Subsection 3.2, we have obtained the following fitting results: Parameter

Estimate

S.E,

(901 (002 (003

-1.34 -0.24 -0.10 -0.27 0.78 0.62

0.19 0.06 0.05 0.07 0.04

01 02 o-.2

Examination of the residuals shows that the model seems adequate for this data set. Thus, there is evidence to support the following:

116

G. C. Tiao

(i) associated level of ozone; (ii) associated d a t a period, the m o n t h s , but the

with 11 is a step change of a p p r o x i m a t e l y o501= - 1 . 3 4 in the with I 2 there is a progressive reduction in ozone. O v e r the yearly i n c r e m e n t is e s t i m a t e d at &02 = - 0 . 2 4 for the s u m m e r i n c r e m e n t in the winter is slight.

5. S o m e a s p e c t s of v e c t o r A R M A

models

M u c h of the p r o p e r t i e s of the univariate A R M A m o d e l (1.3) discussed in Section 2 can be generalized to the vector m o d e l (1.7). In particular, following the s a m e d e v e l o p m e n t leading to the q, form in (2.6) and the ~r f o r m in (2.10) and relabeling the Zt's, wt's and a t ' s as vectors, and the O's, q~'s, ~p's and ~-'s as matrices, we can write the vector m o d e l alternatively in the forms t-m

Zt:.,+

h=l

t-m

O:,_h +

Z

O h*w,-h

(5.1)

h~t-(m+r)+ l

and t-m

t-m

Zt = Z arhZt-h -h=l

Z

rr h-w, ,, + a t .

(5.2)

h=t-(m+r)+l

It is clear from (5.2) that every e l e m e n t of Z t in general is related to all the e l e m e n t s of Zt-j, j = 1, 2 , . . . , so that there can be f e e d b a c k relationships a m o n g all the k c o m p o n e n t series {Z1¢}, i = 1 , . . . , k. H o w e v e r , if the c o m p o n e n t s of Zt can be arranged such that the matrices ~ ' s and O's are all lower triangular, then so will be the ~-'s and (1.7) will imply an unidirectional relationship a m o n g the series. T o illustrate, consider the vector A R M A ( 1 , 0) m o d e l with k = 2 and C = 0. S u p p o s e ~ is lower triangular so that we can write

[ 1 - (~)llB -(iD21B

][Z12:] = [alt ] . 1 - qO22B

(5.3)

ka2tJ

Letting a2, = /~al, + gt where e t and a u are i n d e p e n d e n t , we carl express (5.3) as ( 1 - : P n B ) Z 1 , - al, , ,-

w o - o)IB

1 - ~22 B

ZI, + (1 - ~ 2 2 B ) - l ~ ; t ,

(5.4a) (5.4b)

w h e r e w 0 =/3 and w 1 = (I)21 - A[~I~ll. T h u s ZI, will only d e p e n d on its o w n past, but Z2, will d e p e n d on its own past as well as the p r e s e n t and past of Z w In this case, Z1, can be r e g a r d e d as the 'input' and Z2t the ' o u t p u t ' . E x p r e s s i o n (5.4b) is of the s a m e f o r m as (1.6) with a single stochastic input. M o r e generally, an undirectional relationship m a y exist b e t w e e n subsets of the c o m p o n e n t s

A R M A models, intervention problems and outlier detection

117

of Zt while feedbacks are allowed within each subset. This occurs when the @'s and O's are lower block triangular. The important thing to note is that vector A R M A models cover both undirectional and feedback relationship. Model building procedures discussed in Subsection 2.3 can also be extended to cover the vector case. For a discussion of the various modeling techniques, see Quenouille (1957), Hannan (1970), Tiao and Box (1981), and Tiao and Tsay (1983).

References Anderson, T. W. (1971). The Statistical Analysis of Time Series. Wiley, New York. Bartlett, M. S. (1964), On the theoretical specification of sampling properties of autocorrelated time series. J. Roy. Statist. Soc. 8, 27. Box, G. E. P. and Jenkins, G. M. (19"70), Time Series Analysis l,brecasting and Control. HoldenDay, San Francisco, CA. Box, G. E. P. and Pierce, D. A. (1970). Distribution of residual autocorrelations in autoregressiveintegrated moving average time series models. J. Amer. Statist. Assoc. 65, 1509-1526. Box, G. E. P. and Tiao, G. C. (1975). Intervention analysis with application to economic and environmental problems. J. Amer. Statist. Assoc. 70, 70-79. Chang, I. and Tiao, G. C. (1983). Estimation of time series parameters in the presence of outliers. Technical Report No. 8. Statistics Research Center, Graduate School of Business, University of Chicago (to appear in Technometrics). Denby, L. and Martin, R. D. (1979). Robust estimation of the first order autoregressive parameters. J. Amer. Statist. Assoc. 74, 140--146. Fox, A. J. (1972). Outliers in time series. J. Roy. Statist. Soc. Set. B 43, 350-363. Fuller, W. A. (1976). Introduction to Time Series Analysis. Wiley, New York. Gray, H. L., Kelly, G. D. and Mclntire, D. D. (1978). A new approach to ARMA modeling. Comm. Statist. B7, 1-77. Hannah, E. J. (1970). Multiple Time Series. Wiley, New York. Harvey, A. C. (1981). Finite sample prediction and overdifferencing. J. Time Set. Anal. 2, 221-232, Hillmer, S. C. and Tiao, G. C. (1979). Likelihood function of stationary multiple autoregressive moving average models. J. Amer. Statist. Assoc. 74, 652--660. Liu, L. M., Hudak, G. B., Box, G. E. P., Muller, M. E. and Tiao, G. C. (1983). The SCA System for Univariate-Multivariate Time Series and General Statistical Analysis. DeKalb: Scientific Computing Associates. Ljung, G. M. and Box, G. E. P. (1978). On a measure of lack of fit in time series models. Biometrika 65, 297-304. Ljung, G. M. and Box, G. E. P. (1979). The likelihood function of stationary autoregressive-moving average models. Biometrika 66, 265-270. Martin, R. D. (1980). Robust estimation of autoregressive models In: D. R. Brillinger and G. C. Tiao, eds., Direction in Time Series. Institute of Mathematical Statistics, Hayward, CA. Newbold, P. (1974). The exact likelihood function for a mixed autoregressive-moving average models. Biometrika 61, 423--426. Quenouille, M. H. (1957). The Analysis of Multiple Time Series. Griffin, London. Slutsky, E. (1937). The summation of random causes as the source of cyclic processes. Econometrica 5, 105-146. Fiao, G. C. and Box, G. E. P. (1981). Modeling multiple time series with applications. J. Amer. Statist. Assoc. 76, 802.-816. Tiao G. C. and Tsay, R. S. (1983). Consistency properties of least squares estimates of autoregres.sire parameters in A R M A models. Ann. Statist. 11, 856-871.

118

G. C. Tiao

Tiao, G. C. Box, G. E. P. and Hamming, W. J. (1975). Analysis of Los Angeles photochemical smog data: a statistical overview. J. Air Pollution Control Assoc. 25, 260-265. Tsay, R. S. and Tiao, G. C. (1984). Consistent estimates of autoregressive parameters and extended sample autocorrelation function for stationary and nonstationary ARMA models. J. Amer. Statist. Assoc. 79, 84-96.

E. J. Hannan, P. R. Krishnaiah, M. M. Rao, eds., Handbook of Statistics, Vol. 5 © Elsevier Science Publishers B.V. (1985) 11%155

4

Robustness in Time Series and Estimating A R M A Models R . D o u g l a s M a r t i n * a n d Victor J. Y o h a i t

1. Robustness concepts Three distinct probabilistic concepts of robustness have been developed in the context of point estimation based on independent and identically distributed (i.i.d.) observations (or error terms). These concepts are, in historical order of inception, efficiency robustness (Tukey, 1960), m i n - m a x robustness (Huber, 1964), and qualitative robustness (Hampel, 1968, 1971), the contribution of these notions being due to the authors cited. In addition there is Tukey's (1976) data-oriented counterpart of qualitative robustness known as resistance. Taking relative importance as the criterion, we would list the probability-based robustness concepts in the following order: qualitative robustness, efficiency robustness and rain-max robustness. Resistance is at the same relative level as qualitative robustness, but on the data-based side of things. Since we regard resistance and qualitative robustness as the most important concepts, the bulk of this section is devoted to qualitative robustness. A careful definition of resistance forms the base for a particularly transparent definition of qualitative robustness which turns out to be equivalent to Hampel's (1968, 1971) definition in the classic i.i.d, setting. In Subsections 1.1-1.3 below, we briefly define efficiency robustness, minmax robustness and resistance. We note in advance that there is no conceptual difficulty in applying these three concepts in the time-series setting. The situation is quite different with regard to qualitative robustness, where new technical issues arise in providing an adequate definition for time series, and the relevant details are given in Subsection 1.4. Some summary comments are provided in Subsection 1.5.

I.I. Efficiency robustness Let T. = T.(x 1, x z. . . . . x.) be an estimate of the scalar parameter 0 in the *Research supported by the Office of Naval Research under contract N00014-82-0062, and by National Science Foundation Grant SES80-15570. tResearch supported by the Office of Naval Research under contract N00014-82-0062. 119

R. D. Martin and V. J. Yohai

120

distribution P~ for x" = (x 1, x 2. . . . . X n ) , and let EFF(T,, P0) denote a suitably defined efficiency of T~ at P0. For example, we might have EFF(T,, P0) = VARe;(best known T,) VARe;(T,) ,

(1.1)

or we might have VcR(Po) EFF(T,, P 0) = VARpo(T,) ' n

(1.2)

where VcR(P0) is the Cramer-Rao lower bound at P0. When the focus is on asymptotic efficiencies (as it often is), tile estimate is denoted T, the measure for the process {x,},m is denoted Po, and the efficiency of T at F o is EFF(T, P o ) -

VcR(Po) V~(T)

1 i(Po)V~(T )'

(1.3)

where V=(T) is the asymptotic variance of ~ / n T , at Po, and i(Po) = lim,.= n-li(Po) is the asymptotic Fisher information for 0, i(Po) being the finite-sample Fisher information for 0. Let Po be the nominal distribution for the data (typically P~ is Gaussian), and let P~,I, Po,2, • • -, Po,K be a strategically selected set of distributions which are in some sense ~near' Po. Typically, the Po, i will have marginal or conditional distributions which are heavy-tailed deviations from normality, and hence give rise to outliers. Then an estimate T (or Tn) is said to be efficiency robust if T (or 7~) has high efficiency at Po, and also at P~,I,..., Po,K. High efficiency at Po will usually mean an efficiency in the range 90% to 95%. Of course, for estimates T, at finite samples we require only the appropriate marginal distributions Po, Po,1. . . . , P0,K. In the most frequently used situation where the x, are i.i.d., we need only the one-dimensional marginal measures Po, Po,x. . . . . Po,K" For estimating location in the i.i.d, setting, the sample mean is fully efficient (i,e. has efficiency 100%) at a Gaussian P0, but has low efficiencies at heavy tailed alternatives ('low' can mean zero, e.g. at Cauchy-tailed distributions). On the other hand, high-efficiency robustness can be obtained through the use of trimmed means (Tukey, 1960), or Huber's (1964) location M-estimates, defined in (1.6) and (1.7) below, to name just two of many possibilities. Efficiency robustness can be similarly defined for vector parameters by using an appropriate definition of multivariate efficiency.

1.2. M i n - m a x robustness Let V(T, P=) denote the asymptotic variance of an estimate T at distribution P=, and let T denote a large family of estimates, while W denotes a large family

Robustness in time series and estimating A R M A models

121

of distributions for the process {Xn}n~1. A min-max robust estimate T O shires the problem inf sup V(T, P=).

(1.4)

TET P~P~

The solution to this problem is usually obtained by solving the saddle-point problem, sup inf V(T, P~) = V(To, P o ) = inf sup V(T, P~). P~P~

TET

(1.5)

TET P~EP ~

Of course, for that most frequently treated case of i.i.d, processes {x,}n~ ~ with marginal distribution F, one would replace W by a family P of univariate distributions, and replace P= by a univariate distribution P in the above expressions. Huber's (1964) seminal work showed that for estimating location in the i.i.d. setting, the above problem is solved by a member of the class of M-estimates/2 defined by tl

min 2 P ( Y i - / x ) , P-

(1.6)

i=l

where p is symmetric and convex. Equivalently,/2 is a solution of n

o(y,-

= 0

(1.7)

i=1

with ~b = p'. The min-max estimate T Ocorresponds to a particular psi-function OH which is now often called 'Huber's psi-function'. The definition of OH is given by (2.22) in Section 2. For more general min-max theory and results see H u b e r (1981). 1.3. R e s i s t a n c e

To many statisticans a resistant estimate is one which is not unduly affected by a few outliers (Tukey, 1976). This definition has been refined somewhat in the following way (cf. Huber, 1981, Chap. 1.2): t An estimate Tn is called resistant if 'small' changes in the data result in only small changes in T~, where 'small changes' in the data means (i) large changes in a small fraction of the data, and/or (ii) small changes in all the data. The large changes in (i) correspond to outliers, while the small changes in (ii) correspond, for example, to rounding errors or grouping errors. The sample mean lacks resistance, whereas estimates such as trimmed means (with the median as a limiting case) and M-estimates are resistant.

R. D. Martin and V. J. Yohai

122

We would remark that while resistance is a beautifully transparent notion, which makes it eminently serviceable for applied scientists, it suffers from a small defect, which is that the definition is not very precise. This defect is remedied in the next subsection, where a precise definition of resistance is given. This definition turns out to yield a very transparent and useful definition of qualitative robustness. As a caveat, one should be aware of the fact that even with a careful definition, it is not completely trivial to verify resistance for implicitly defined estimates such as location and regression M-estimates.

1.4. Qualitative robustness Let x~. . . . . x, . . . . . be i.i.d, observations with values in a Polish space, i.e. a complete and separable metric space (X, d). In most cases, X is a Euclidean space with the usual metric. The following notation will be used. Let X ~ and X = be the Cartesian product of n copies of X and countable copies of X, respectively./J will denote the Borel o--field on X, and/3 ~, fl~ the corresponding product o--field on X" and X =. For any measurable space (/2, A), let P(g2) be the set of all the probability measures on A. I f / z and u are in P(O), then POx, u) denotes the class of all the probabilities P on ( ~ x ~ , A x A) with marginals/z and v. Given a probability P C P(X), P" and P~ will for the time being denote the corresponding product probabilities in P(X") and P(X=). If (X, d) is any metric space, the Prohorov distance 7rd between # and v in P(X) is defined by 7rd(/z, ~')--inf{e: /x(B) ~ u(V(B, e, d ) ) + e, VB C fl},

(1.8)

V(B, e, d)-- {x ~ X: d(x, B ) < e}.

(1.9)

where

Strassen (1965) proved that if (X, d) is a Polish space, then 7rd is alternatively given by "rrd~, ~,) = inf {e: 3 P C POx, u) such that P([d(x, x') >7 e]) ~< e},

(1.10)

where [d(x, x') ~> e] = {(x, x'): d(x, x') >1 e}. Let T,: X" ~ F, n I> n 0, be a sequence of estimates which arc invariant under permutation of coordinates, where the parameter space (F, 3') is also a Polish space (in most cases F is a Euclidean space). The reason for the appearance of n o is that often a minimum number n o of observations are required in order to define the estimate. Hampel (1968) introduced two definitions of qualitative robustness. The first definition is as follows: DEFINrrlON 1.1. The sequence {T,},~0 is qualitatively robust at P ~ P(X) if given e > 0, there exists 6 > 0 such that Vn 1> no, VQ C P(X), Try(P, (2)) < 6 rr~(L(T,, P"), L(T,, (2)")) < e, where L ( T n, pn) denotes the law of T, under pn,

Robustness in time series and estimatingARMA models

123

According to the Strassen characterization of the Prohorov distance, this definition of qualitative robustness requires, uniformly ir~ sample size n, that the distributions of the estimates do not change too much when there is a small change in the marginal distribution of the observations produced by one or both of the following: (a) A small fraction of observations with gross errors (outliers). (b) Small errors in all the observations (e.g. rounding or grouping errors). However, Definition 1.1 allows only for i.i.d, deviations from the central i.i.d. model P". In order to at least partially cover non-i.i.d, deviations, Hampel introduced the concept of qualitative ~r-robustness. We use the following notation. Let J~" be X n modulo a permutation of coordinates. Given x " = ( x t , . . . , x , ) ~ X ", denote by tx [x n] the empirical probability which assigns mass $ n 1/n to each point xi, l ~ i < ~ n . Given x ", y n in X", define dn(x ,y")= 7rd(/~[x.], ~[yn]). Finally, given P~ E P(Xn), let /5~ be the probability induced on DEFINITION 1.2. The sequence {Tn},~,0 is qualitatively ~v-robust at P C P(X) if given e > 0, there exists 6 > 0 such that

vn 1>no, vOn e

&) L(Tn, &))

Boente, Fraiman and Yohai (1982) proposed a new approach to qualitative robustness, based on the concept of resistance (see Tukey, 1976; Mosteller and Tukey, 1977). The basic idea is to require that the estimate change by only a small amount when the sample is changed by replacing a small fraction of observations by arbitrarily large outliers or by perturbing all the observations with small errors (e.g. round-off or grouping errors). This approach has the advantage that it may be applied without special assumptions on the probability model for the observations, e.g., they may be dependent or non-identically distributed. Moreover as we will see below, the new definitions are based on quite simple and transparent concepts. First we define a new distance d~ on X". Given x " = (xl . . . . . xn), y n = (Yl. . . . . Yn) in X ' , define d](x", yn) = inf{e: #{i: d(xi, y i ) ~ e } ~ ne}.

(1.11)

Therefore, two points of X n are close in the metric d + if all the coordinates except a small fraction are close. According to this notion of closeness, if the original sample is modified by replacing a fraction no greater than e of observations by arbitrary outliers, or if all the observations are perturbed by round-off errors smaller than e, then the original and modified samples have a distance smaller than e.

R. D. Martin and V. J. Yohai

124

Given x" E X" and 6 > 0 let

AT,(x", 6) = sup{[T.(y") - T,(z")[: d+~(y ", x ~) 0 , there exists 6 > 0 such that AT,(x", ~) < e,

Vn i> no.

From now on P " will denote any probability in P(X n) (not just a product probability) and similarly P= will denote any probability in P(X=), unless otherwise noted. The following definitions of strong and weak robustness were introduced by Boente, Fraiman and Yohai (1982), and represent an alternative to Hampel's definition of qualitative robustness. DEFINrrlON 1.4.

Let P~ E P(X~). {7~,},~.,,° is strongly robust at P~ if

P~([{T,},~ o is resistant at x ] ) = 1 .

(1.13)

DEFINITION 1.5. Let P= ~ P(X=). {T,},~,o is weakly robust at P~ if, given e > 0, there exists 6 > 0 such that P~([AxT~(x", ~ ) ~ e ] ) ~ 1 - E ,

Vn -~ n o .

(1.14)

Boente, Fraiman and Yohai (1982) proved the following relationships between (i) weak and strong robustness, and between (ii) both weak and strong robustness and Hampel's definition of qualitative 7r-robustness: THEOREM 1.1. Le¢ {T,}n~,~° be a sequence of estimates and P~ ~ P(X~). 7hen (i) Strong robustness implies weak robustness. (ii) I f {T,},~,,° are invariant under permutations of coordinates and P~ corresponds to an i.i.d, process, weak robustness, strong robustness and qualitative •r-robustness are equivalent. Papantoni-Kazakos and Gray (1979), Bustos (1981) and Cox (1981) also gave various definitions of qualitative robustness which hold for dependent processes and which are in the spirit of Hampel's approach. There are two such definitions which correspond to generalizations of Hampel's qualitative robustness and qualitative 7r-robustness respectively. DEFINITION 1.6.

Let p be a metric on P(X~), and P ~

P(X~). {~/~},~,~ is

Robustness in time series and estimating A R M A models

125

qualitatively p-robust at P~ if given e > 0, there exists ~ > 0 such that Vn>~no,

VQ=~P(X~),

p(P=,Q=) 0, there exists 6 > 0 such that Vn ~ no,

VQ n e P ( X ' ) ,

O,(P", Qn) ~ 0. In this case both GM-estimates and RA-estimates are neither resistant nor qualitatively robust. For example, let us consider the MA(1) model y, = u , - Ou t p The estimated residuals ~t(O) are given by

a,(O) = y, + Oy,_l + " " + O'-ly~,

(5.11)

and a single outlier y, at time t has influence on all £~t,(O) with t'>~ t. Thus, a small fraction of outliers may have a large effect on a large fraction of residuals. Just one large outlier in the first observation may have a large effect on all the observations. Therefore, since GM- and RA-estimates depend on the residuals fi~ they cannot be qualitatively robust. However, GM- and RAestimates are less sensitive to outliers than LS- and M-estimates. A Monte Carlo study (see Bustos and Yohai, 1983) shows that for the MA(1) model with additive outliers, the RA-estimates of the Mallows and Hampel type are more robust than LS- or M-estimates. This is especially true when $ is taken in the bisquare family given by (2.23). More theoretical support of the behavior of GM- or RA-estimates for the AR(1) and MA(1) models using a proper definition of influence function for time series may be found in Martin and Yohai (1984). The idea is briefly described in Section 7. In the next subsection we present another class of estimates which are qualitatively robust for ARMA(p, q) models with q > 0. 5.4. T r u n c a t e d R A - e s t i m a t e s

As we have seen in the preceding subsection, the failure of resistance and robustness for the RA-estimates of the MA(1) model is due to the fact that tk.< residuals fit(0) given by (5.11) depend upon all the present and past data. By the same type of reasoning, RA-estimates lack robustness for a n y genuir:e A R M A model (i.e. one with a moving-average component). In order to robustify these estimates we introduce the truncated ~csid,_;ak. ~;~ order k. In the MA(1) case, these are (~t,k(O)

= Yt +

OYt-i + " " " + OkYt-k .

It is easy to see that if 00 is the true parameter, then ~lt,k(O0) = gt-

ok+lut-k-1 •

Therefore, if F is symmetric and ,/(u, v) odd in each variable, we have EopT(~t+L~ (0°)

/~"k(0°))=0,

Vj ~ l , j ¢

k+l

(5.~2)

R. D. Martin and V. J. Yohai

140

Recall that an RA-estimate for the MA(1) model with mean zero is obtained as a solution of T-1

Z 0i-~%(O) : 0,

(5.13)

1=1

where 3)j is defined in (5.5). Define

%,k(o) = Z n

"Yj,k(O)

by

u,,

t=l

Then the k-TRA-estimates, introduced by Bustos and Yohai (1983), are defined by replacing ~i(0) by ~j.k(O) for j ¢ k + 1, and Yk+l(O) by Yk+l,k-l(O) in (5.13). Equation (5.12) implies that if rl is odd in each variable and F symmetric, the TRA-estimates are Fisher consistent. The extension of the TRA-estimates for any A R M A model may be found in Bustos and Yohai (1983). The k-TRA-estimates are asymptotically normal, but their asymptotic covariance matrix expression is quite complicated, and can be found in Bustos and Yohai (1983) and in Bustos, Fraiman and Yohai (1984). Since the residuals in a TRA-estimate depend on only a finite number of observations, a sufficient condition for resistance and qualitative robustness of the TRA-estimate, under general regularity conditions, is that "0 be bounded. As k increases, the corresponding TRA-estimate becomes more efficient under the nominal Gaussian model without outliers, but it becomes less robust with regard to bias and variability under a general contamination model of the type (3.1). Of course, in large samples the former is dominant and so we often focus on bias robustness. Therefore, the choice of k will depend on a trade-off between efficiency under the model, and bias robustness under a general contamination model. Monte Carlo results studying the performance trade-otis of the TRA-estimates may be found in Bustos and Yohai (1983).

6. Approximate maximum-likelihood type estimates One of several things learned from Huber's (1964) early work on robust estimation of location was that robust estimates can be obtained using maximum-likelihood estimates for suitably heavy-tailed distributions. Some caveats are in order here, e.g. densities whose MLE's for location are robust do not always produce robust estimates of scale, and we do not yet have an M L E rationale for the bounded-influence regression estimates studied by Krasker and Welsch (1982) and Huber (1983). Nonetheless, the non-Gaussian M L E rationale sometimes provides a convenient way of uncovering and understanding the structure of robust estimates. We have already seen in Subsections 2.4 and 3.1 that while 'simple' M.estimates

Robustness in time series and estimating ARMA models

141

can provide efficiency robustness for perfectly observed A R M A models, they are not resistant or robust toward general contamination models of the type (3.1). In this section we describe a class of estimates of A R M A model parameters which are motivated by maximum-likelihood estimates for the additive outliers type of contamination model described in Subsection 3.1, and which are resistant and robust. W e call these estimates approximate maximum-likelihood type estimates (AM-estimates) because of approximations involving the nonGaussian MLE.

6.1. Definition of A M estimates As before, let the parameter vector a ' = (~p', 0', o"2) represent the parameters of the x t process in the A O model,

y, = x, + v,,

(6.1)

where x t and v t are assumed to be independent, and the v t are i.i.d, with zero mean. Throughout, we shall presume that the Yt in (6.1) have mean /z = 0. W h e n / x is unknown, it may be estimated robustly and the estimate/2 can be used to form centered observations. W h e n / 2 is consistent, estimators based on the centered data typically behave asymptotically as if /x were known and exactly centered observations were used. Alternatively, an intercept term can be included in some of the equations to follow. The log likelihood for this model is T

log h(y r a ) : Z l o g h(y, [y'-~, a ) ,

(6.2)

t=l

where y ' - (Yl, Yz. . . . Yt)' is the vector of observations up to and including observation y,. The observation-prediction density h(y t [ y t - l , a ) is the conditional density of the observations Yt given yt-~, and h ( y l l y °, a) denotes the unconditional density h(yl ] a). Since x t and vt are by assumption independent, we can write

h(y, [y' l, a ) - f fx(y, - ~ l y '-1, a ) dF~(~),

(6.3)

where F v is the distribution function of the measurement error v~ and fx is the conditional prediction density function of x t given y' 1. W e shall refer to this density as the state-prediction density. Let Xt ~,

i =: E(x, l y '-1)

(6.4)

and

m, = E[(x,-- 2,--1):]y,-1]

(6.5)

142

R. D. Martin and V. J. Yohai

denote the conditional-mean predictor of x, given yt 1, and the conditionalmean-square error of prediction, respectively. Because of the assumptions concerning (6.1), we also have xtt-1 = ytt-I : E(y, I y ' - l ) ,

(6.6)

where 13',-1 is the conditional-mean predictor of Yt given yt-1. Because of (6.6) we shall use 21-1 and y't-1 interchangeably. Since we cannot actually compute the exact conditional m e a n s xtt -1= ytt-1 , we shall only require that the 21-1 or ))it-1 appearing in the remainder of the discussion have the same structure as the approximate conditional-mean estimates described in the last part of Subsection 6.3. We make an important simplifying assumption that fx may be well approximated by the form

s (x, i,' 1,

u

1

IX,-

'~',-'~

st-v-U),

(6.7)

for some fixed density f which is independent of the parameters a (for t = 1, the expectations are taken to be unconditional). Of course, m t = rG(a ) and 2,,-1= 2,-,(,,). Now, using (6.7) we can rewrite (6.3) as h(y, l y '-1, or) = gt(ut),

(6.8)

where u t = y t - 2 ' t -1 and the subscript t on the function g, indicates the dependence of g, on y,-l. In practice, we very rarely know the noise distribution F v in the tails with high accuracy. For the contaminated normal (CN) noise distribution F~ = (1 - y)N(0, 02) + yN(0, 02),

(6.9)

2 ~ O.2 and small y > 0, Martin (1979) gave some motivation for apwith 0"0 proximating gt by setting

g , ( u , ) = s,

where s, is defined below, and ttle density g is obtained by convolution, g = f*Fv.

(6.11)

Although the functional forms assumed in (6.7) and (6.10) are not good approximations for general non-Gaussian F v, we believe that the use of these forms when F v is nearly normal involves an approximation error that is small enough to be relatively inconsequential.

Robustness in time series and estimating ARMA models

143

The scale measure st in (6.10) represents the scale of the y-prediction residuals ut = Yt - 33t/1. Since the x-prediction residuals x t - 21-1 have as scale measures the quantity V'm-~,,and since Yt = x, + vo with v t independent of x,, it is reasonable to let s, = X/m,----~¢~

(6.12)

when F, at most deviates from a nominal N(0, o-2) distribution primarily in the tails, e.g. as in (6.9). Of course, when the errors v, are zero most of the time, so that P(v t = 0) = 1 - y by virtue of having o.~ = 0 in (6.9), with 3, not too large, then we have st =

X/~.

(6.13)

Using (6.10) and (6.11) we can rewrite (6.2) as T

T

l ° g h ( y r l ° z ) - - ~ l ° g s t + ~ ' ~ l °,=, gg(~

(6.14)

Now, it seems natural by analogy with Huber's (1964, 1981) M-estimates (maximum-likelihood type estimates) to replace - l o g g with a properly chosen symmetric function p. Thus, we propose to define approximate maximum° likelihood estimates (AM-estimates) as the value a that minimizes the following robustified loss function:

L ( a ) = Z log s,(a)+ Z P k s ~ a ) ) ' t=l

(6.15)

t=l

with the residuals u t - - u t ( a ) and scale values s t = st(a ) obtained from the approximate conditional mean type filter cleaners described in Subsection 6.3. The parameter vector a is included in (6.15) to indicate explicitly the dependence of s,(a) and u,(a) on the parameter vector o d = (¢', 0', o'2). If p ( t ) = - l o g g(t) and the density g is normal, then minimization of L ( a ) yields the Gaussian maximum-likelihood estimate. The choice of the function p is guided by the same qualitative robustness considerations as for H u b e r M-estimates for location and regression (see, for example, Huber, 1981; Hampel, 1974), and, for the A R M A model M-estimates of Section 2: O should have a bounded and continuous derivative ~O= p'.

6.2. State-variable representation of the A R M A model To determine the parameter estimates which minimize the loss function L ( a ) defined by (6.15), we need to express 33', 1 and s, as functions of ¢, 0, G , and y,-1 In doing so it is convenient to write the A O A R M A model for the xt in the

144

R. D. Martin and V. J. Yohai

state-variable form X, = 45X,_1 + re,,

(6.16)

y, = xt + vt,

(6.17)

where x t is the first element of X~ and

45=

q~2

1)

(6.18)

,

where I(k_l)is a (k - 1 ) x ( k - 1) identity matrix, and 0 is a ( k - 1) column vector of zeros. The dimensionality of the square 45 matrix is k = max{p, q + 1}. If q ~>p, the first column of 45 contains the autoregressive parameters q~l, q~2. . . . , Pk, but with q~i = 0 for i > p. Corresponding to this choice of 45, the vector r in (6.16) is a k x 1 column vector defined as ( 1 , - 0 1 , - 0 2 , . . . , - - O k - l ) ' with 0i = 0 for i > q in case p > q. For details, see Appendix A of Martin, Samarov and Vandaele (1983). This state-variable representation is not unique. See Akaike (1974) for another possibility. 6.3. R o b u s t filter cleaners We now describe a class of robust filter cleaners which are used to obtain the one-step-ahead predictions ~gtt-1= 2tt -1, and thereby compute the prediction residuals ut= y t - ~ t t -1 appearing in the loss function (6.15). These filter cleanears are sometimes called approximate conditional-mean type (ACM) filter cleaners because of an approximate optimality result described at the end of this section. Here the term filter refers to an estimate 2 t of x t which is based on the present and past data y' = (Yx. . . . . y,)'. A smoother is an estimate 2~ of x, based on all the observed data y r = (Yl, Y2. . . . . Yr)'. We discuss smoother cleaners in Subsection 6.5. Under conditions to be described subsequently, the ))tt-~ are approximate conditional-mean estimates for the non-Gaussian A O model, and it is in this case that (6.14) will be an approximation to the log-likelihood function (6.2) (the various approximations involved here seem difficult to avoid in non~ Gaussian A O models). However, we shall not generally require that the conditions alluded to be in force, since good filter-cleaners and associated parameter estimates ~ can be obtained without such a requirement. The filter cleaner computes robust estimates J~r of the vector X, according to the following recursion: ^t

1

(6.19)

Robustness in time series and estimating A R M A models

where p, = mJs~, with m t being the first column of the k computed recursively as M,+I = q~P,q~' + Q ,

x

14~

k matrix M,, which is

(6.20)

, = M, w("s::, ') m"' s 2, The ~ is a robustifying psi-function, O = cr2rr ', and w is a weight function described in (6.27) below. The time-varying scale s t is defined by s~ = mll.t,

(6.22)

where r a n , t is the 1-1 element of M , the robust one-step-ahead predictors of y, and x, are )3;-1 = 2;-~= (4)Xt_l)l,

(6.23)

and the cleaned data at time t is

2, = (Xt)~ •

(6.24)

With the scaling (6.22), we will have 37t = y, a large fraction of the time when there are rather few outliers in the series. This is why we use the term filter cleaner.

Before proceeding, note that when 0 is the identity function, w is identically 1, and (6.22) is replaced by s 2t - mH, , + or20 with o-z0 = var v, in the additive-noise model, the above recursions are those of the Kalman filter. Correspondingly, M t and Pt are the prediction and filtering error-covariance matrices. See, for example, Kalman (1960), Jazwinski (1970), Meditch (1969). Unfortunately, the Kalman filter is not robust; a single outlying observation Yt can spoil not only Y~t, but also ~',, u > t. Use of a robust version is imperative in many situations. Our use of ~r20= 0 in (6.22) corresponds to the assumption that v t = 0 a large fraction of the time, e.g. as when a contaminated normal distribution with degenerate central component, i.e. o-~= 0 in (6.9), and 3: small, provides a reasonable model for F~. The weight function w should have the same qualitative properties as a good robustifying 0-function, namely: b o u n d e d n e s s , c o n t i n u i t y and perhaps c o m p a c t support.

A common compact support for 0 and w results in the following desirable behavior of the filter cleaner: if an observation y, deviates from its prediction 311-~

146

R. D. Martin and V. J. Yohai

by a sufficiently large amount, then Y~, will be the pure prediction X', = q~J~t-~, and the filtering-error covariance is set equal to the one-step prediction-error covariance Pt = Mt. The latter idea has often been i m p l e m e n t e d as a so-called hard-rejection rule: set Xt = q~X,-i and Pt = M, if lu,] > cs~ replacing (6.22) by s 2t = m11,t + o-20 in the general noise case where there is a nonzero additive Gaussian noise component. Typically, c = 3 has been used according to a time-honored habit, and the procedure accordingly is termed a 3-sigma-edit rule. This corresponds to the choices

t, It I < c , ¢,..(t) =

0,

(6.25)

[tt/> c,

1, It] < c , WHR(t)---- 0, ttl>fc.

(6.26)

Our filter cleaners would differ from this simple rule by imposing continuity, as well as boundedness and compact support. The O and w functions should return smoothly to zero. One reasonable way to accomplish this is to impose continuity on ~, and take w as

w(t)- 440

(6.27)

t

The two-part redescending W-function

t,

OnA(t)=

i

Itt- 0 as t-~ % i.e. the dependency of O, on t vanishes asymptotically. For examples which clarify this point, see Martin and Yohai (1984). Then the time-series version of I C H is ICH(~c) = lim T ( / z r ) - T(/z) ~o T

(7.4)

The trouble with this definition is that it is not very natural from the following viewpoint, among others. The contamination process measure p.~ is a mixture which corresponds to obtaining a realization from the stationary m e a s u r e / x with probability 1 - y, and with probability 7 obtaining a realization from the (nonstationary) process having marginal measure 65 for (Yp Y0, Y-1. . . . ). Such a mixture process does not correspond to any realistic contamination process occurring in practice! Further discussion on this point may be found in Martin and Yohai (1984), who propose a new definition of time-series influence curve IC as follows. Let the 0-1 process in (3.1) satisfy P ( z , = 1)= ,y + o(7), let /xx denote the measure for x t, let/x w denote the measure for the contaminating process wo and let /x~ be the measure for y,L We can get either isolated or patchy outliers depending upon how we specify the 0-1 processes z t and w,. Assume that the estimate of interest is obtained from the functional T(/x~). Then the time-series influence curve IC(p.w)= IC(/xw; T, {#~}) is the derivative at #w, along the arc { ~ } = {/x~: 0 4 7 < 1} as 7 ~ 0 , and correspondingly # ~ # x : IC(/xw) = lim T O x ~ ) - T(/xx) :,~0

(7.5)

)'

The argument of IC is the contamination measure /xw, so in general IC is a curve on measure space. However, calculations of IC(/xw) usually entail special forms for the contamination process w,: in the case (Subsection 3.1 (i)) of additive outliers, we let v t ~ ~ so that w, = x t + ~, whence the additive outliers have constant amplitude ~:, and in the case (Subsection 3.1(ii)) of substitution outliers we let w, ~-~, so that the substitution outliers all have constant value ~. When these special forms of w t are used we replace the notation IC(/xw) by IC(~:), so IC is now a curve with domain the real line. This is in keeping with the spirit of ICH(~:), whose argument is a fixed contamination value £ in R p, with p = 1 for

Robustness in time series and estimating A R M A models

153

univariate problems (and we are dealing only with univariate time series in the present discussion). Although the IC is similar in spirit to ICH, it coincides with ICH only in the special case where the estimate is permutation invariant and y, is an i.i.d. substitution outliers model (Subsection 3.1(ii)), i.e. in the usual i.i.d, setup the two definitions coincide (Corollary 4.1 of Martin and Yohai, 1984). Although in general IC is different from ICH, there is a close relationship between the two which facilitates the calculation of IC. Namely, under regularity conditions IC(/zw) = lim E, ICI-I(y~), :,-~0

(7.6)

y

where Yt = (Yl, Y0, Y-l, • • ") is governed by the measure/z ~ for the process Yt in (3.1), and Er den6tes expectation with respect to/Zy. The above result is established in Martin and Yohai (1984), where several other results concerning IC's are presented: Conditions are established which aid in the computation of IC's and which ensure that an IC is bounded. IC's are computed for both least squares and a variety of robust estimates of first-order autoregressive and moving-average models. Distinctly different behaviors of the IC are exhibited for patchy versus isolated outliers. It is shown that bounded monotone ~b-functions do not yield bounded IC's for moving-average parameters, whereas redescending 0-functions do yield bounded IC's. Finally, the IC is used to show that a class of generalized RA-estimates has a certain optimality property.

References Akaike, H. (1974). Markovian representation of stochastic processes and its application to the analysis of autoregressive moving average processes. Ann. Instit. Statist. Math. 26, 363-387. Beaton, A.E. and Tukey, J.W. (1974). The fitting of power series, meaning polynomials, illustrated on band spectroscopic data. Technometrics 16, 147-185. Belsley, D. A., Kuh, E. and Welsch, R. E. (1980). Regression Diagnostics. Wiley, New York. Boente, G., Fraiman, R. and Yohai, V. J. (1982). Qualitative robustness for general stochastic processes. Technical Report No. 26. Department of Statistics. University of Washington, Seattle, WA. Box, G. E. P. and Jenkins, G. M. (1976). Time Series Analysis Forecasting and Control. Holdeno Day, San Francisco, CA. Bustos, O. H. (1981). Qualitative robustness for general processes, Informes de Mathemfitica, Serie B-002/81. Instituto de Mathemfitica Pura e Aplicada, Brazil. Bustos, O. H. (1982). General M-estimates for contaminated p-th order autoregressive processes: consistency and asymptotic normality. Z. Wahrsch. Verw. Gebiete 59, 491-504. Bustos, O. H. and Yohai, V. J. (1983). Robust estimates for ARMA models. Informes de Mathemfitica, Serie B-12/83. Instituto de Mathemfitica Pura e Aplicada, Brazil. To appear in J. Amer. Statist. Assoc. Bustos, O., Fraiman, R. and Yohai, V. J. (1984). Asymptotics for RA-estimates of ARMA rnodels~ In: J. Franke, W. Hiirdle and D. Martin, eds., Robust and Nonlinear Time Series Analysis. Springe~ .. Berlin.

154

R. D. Martin and V. J. Yohai

Cook, D. and Weisberg, S. (1982) Residuals and Influence in Regression. Chapman and Hall, New York. Cox, D. (1981). Metrics on stochastic processes and qualitative robustness. Technical Report No. 3. Department of Statistics, University of Washington, Seattle, WA. Denby, L. a n d Mallows, C. L. (1977). Two diagnostic displays for robust regression analysis. Technometrics 19, 1-13. Denby, L. and Martin, R. D. (1979). Robust estimation on the first order autoregressive parameter. J. Amer. Statist. Assoc. 74, 140-146. Donoho, D. L. (1982). Breakdown propertirs of multivariate location estimators. Unpublished manuscript. Harvard University Ph.D. qualifying paper. Donoho, D. L. and Huber, P. J. (1983). The notion of breakdown point. In: P. J. Bickel, K. Doksum and J. L. Hodges, eds., Festschriftfur Erich L. Lehman. Wadsworth, Belmont, CA. Hampel, F. R. (1968). Contributions to the theory of robust estimation. Ph.D. Thesis. University of California, Berkeley, CA. Hampel, F. R. (1971). A general qualitative definition of robustness. Ann. Math. Stat. 42, 1887-1896. Hampel, F. R. (1974). The influence curve and its role in robust estimation. J. Amer. Statist. Ass. 69, 383-393. Hampel, F. R. (1975). Beyond location parameters: Robust concepts and methods. Proc. 40th Session I.S.I., Warsaw 1975. Bull. Int. Statist. Inst. 46, Book 1,375-382. Hannan, E. H. and Kanter, M. (1977). Autoregressive processes with infinite variance. J. Appl. 14, 411-415. Hodges, J. L. (1967). Efficiency in normal samples and tolerance of extreme values for some estimates of location. Proc. Fifth Berkeley Symp. on Math. Statist. and Probab. Vol. 1, 163-186. Huber, P. J. (1964). Robust estimation of a location parameter. Annals Math. Statist. 35, 73-101. Huber, P. J. (1973). Robust regression: asymptotics, conjectures and Monte Carlo. Ann. Statist. 1, 799-821. Huber, P. J. (1977). Robust Statistical Procedures. Regional Conference Series in Applied Mathematics No. 27. SIAM, Philadelpha, PA. Huber, P. J. (1981). Robust Statistics. Wiley, New York. Huber, P. J. (1983). Minimax aspects of bounded-influence regression. J. Amer. Statist. Assoc. 78, 66-80. Jazwinski, A. H. (1970). Stochastic Processes and Filtering Theory. Academic Press, New York. Kalman, R. E. (1960). A new approach to linear filtering and prediction problems. Trans. A S M E Ser. D J. Basic Eng. 82, 34-45. Kleiner, B., Martin, R. D. and Thompson, D. J. (1979). Robust estimation of power spectra. J. Roy. Statist. Soc. Ser. B 41, 313-351. Krasker, W. S. (1980). Estimation in linear models with disparate points. Econometrica 48, 1833-1846. Krasker, W. S. and Welsch, R. T. (1982). Efficient bounded-influence regression estimation. J, Amer. Statist. Assoc. 77, 595-604. K/inisch, H. (1984). Infinitesimal robustness for autoregressive processes. Arm. Statist. 12, 843 863~ Lee, C. H. and Martin, R. D. (1982). M-estimates for A R M A processes. Technical Report No. 23. Department of Statistics, University of Washington, Seattle, WA. Mallows, C. L. (1976). On some topics in Robustness. Bell Labs. Tech. Memo, Murray Hill, N J. Maronna, R., Bustos, O. H. and Yohai, V. J. (1979). Bias and efficiency robustness of general M-estimates for regression with random carriers. In: T. Gasser and M. Rosenblatt, eds., Smoothing Techniques for Curve Estimation (Proceedings, Heidelberg, 1979). Springer, New York. Martin, R. D. (1979). Approximate conditional-mean type smoothers and interpolators. In: T. Gasser and M. Rosenblatt, Smoothing Techniques for Curve Estimation, 117-143. Springer, Berlin. Martin, R. D. (1980). Robust estimation in Autoregressive models. In: D. R. Brillinger and G. C. Tiao, eds., Directions in Time Series, 228-254. Institute of Mathematical Statistics Publication, Haywood, CA.

Robustness in time series and estimating A R M A models

155

Martin, R. D. (1981). Robust methods for time series. In: D. F. Findley, ed., Applied Time Series H. Academic Press, New York. Martin, R. D. (1982). The Cramer-Rao bound and robust M-estimates for autoregressions. Biometrika 69, 437-442. Martin, R. D. and Jong, J. (1977). Asymptotic properties of robust generalized M-estimates for the first-order autoregressive parameter. Bell Labs. Tech. Memo, Murray Hill, NJ. Martin, R. D., Samarov, A. and Vandaele, W. (1983). Robust methods for A R I M A models. In: A. Zellner, ed., Applied Time Series Analysis of Economic Data. Econ. Res. Report ER-5, Bureau of the Census, Washington, DC. Martin, R. D. and Thompson, D. J. (1982). Robust resistant spectrum estimation. IEEE Proceedings 70(9), 1097-1115. Martin, R. D. and Yohai, V. J. (1984). Influence function for time series. Tech. Report. Department of Statistics, University of Washington, Seattle, WA. Masreliez, C. J. (1975). Approximate non-Gaussian filtering with linear state and observation relations. I E E E Trans. Automat. Control AC-20, 361-371. Meditch, J. S. (1969). Stochastic Optimal Linear Estimation and Control. McGraw-Hill, New York. Mosteller, F. and Tukey, J. W. (1977). Data Analysis and Regression. Addison-Wesley, Reading, MA. Papantoni-Kazakos, P. and Gray, R. M. (1979). Robustness of estimators on stationary obse~ vations. Ann. Probab. 7, 989-1002. Rousseeuw, P. and Yohai, V.J. (1984). Robust regression by means of S-estimators. In: J. Franke, W H~irdle and D. Martin, eds., Robust and Nonlinear Time Series Analysis. Springer, Berlin. Schweppe, F. C. (1973). Uncertain Dynamic Systems. Prentice-Hall, Englewood Cliffs, NJ. Strassen (1965). The existence of probability measures with given marginals. Ann. Math. Statist. 36, 423-439. Tukey, J. W. (1960). A survey of sampling from contaminated distributions. In: I. Olkin, ed., Contributions to Probability and Statistics. Stanford University Press, Stanford, CA. Tukey, J. W. (1976). Useable resistant/robust techniques of analysis. In: W. L. Nicholson and J. L. Harris, eds., Proc. First E R D A Statistics Symposium. Batelle Northwest Laboratories, Richland, WA. Whittle, P. (1962). Gaussian estimation in stationary time series. Bull. lnt. Statist. 39, 105-129. Yohai, V. J. and Maronna, R. A. (1978). Asymptotic behavior of least squares estimates for autoregressive processes with infinite variances. Ann. Statist. 5, 554-560.

E. J. Hannan, P. R. Krishnaiah, M. M. Rao, eds., Handbook of Statistics~ Vol. 5 © Elsevier Science Publishers B.V. (1985) 157-177

5

Time Series Analysis with Unequally Spaced Data Richard H. Jones

1. Introduction

Unequally spaced data can occur in two distinct ways. The data can be equally spaced with missing observations, or the data can be truly unequally spaced with no underlying sampling interval. For multivariate data when several variables are recorded at each observation time, it is possible to have missing observations within the observation vector at a given time. In this case, the observation times may be equally or unequally spaced. The key to data analysis in these situations is to represent the structure of the process using a state-space representation. For Gaussian inputs and errors, this allows the calculation of the exact likelihood using the Kalman filter. Nonlinear optimization can then be used to obtain m a x i m u m likelihood estimates of the unknown p a r a m e t e r s of the process. These methods are easily extended to regression with stationary errors including analysis of variance with serially correlated errors. Mixed models that include r a n d o m effects fit naturally into the state-space formulation since the random parameters can be included in the state vector, and the variances estimated by m a x i m u m likelihood, even with unbalanced designs.

2. State space and the Kalman filter

Kalman (1960) developed an approach to filtering and prediction based on the concept of state and state transition. While the term prediction is clear to those working in the area of time series analysis, filtering simply means estimating the current state of a process given observations .up to the present time. H e r e , the Kalman filter will be used as a recursive m e t h o d of calculating - 2 1 n likelihood which easily handles missing or unequally spaced data. The concept of state is the least amount of data about the past and present of a process needed to predict the future. For a first-order autoregressive (AR(1)) process, this is simply the current observation. For autoregressive moving average ( A R M A ) processes, the state can be represented in many ways as a 157

R.H. Jones

158

vector. This turns a univariate process into a vector Markov process involving a state transition matrix. It is this M a r k o v property that allows - 2 In likelihood to be calculated recursively (Schweppe, 1965). A discussion of the Kalman filter can be found in Gelb (1974). A general state-space model consists of two equations. The state equation defines the properties of the process in vector M a r k o v form, and the observation equation defines what is actually observed. These equations are

X ( t ) - F(t; t - 1 ) X ( t - 1)+ G(t)u(t), Y(t) = H(t)X(t) + v(t). X(t) is an m by 1 column vector representing the state of the process at time t. F(t; t - 1 ) is an m by m state transition matrix defining how the process progresses from one time point to the next. u(t) is the random input to the state equation, sometimes referred to as the plant noise, which is a column vector of length m', m ' ~< m, assumed to have a multivariate normal distribution with zero m e a n vector and covariance matrix equal to the identity matrix. T h e u(t) are assumed to be independent at different times. G(t) is an m by m ' matrix defining how the r a n d o m inputs are propagated into the state. H(t) is a d by m matrix defining linear combinations of the state that are observed at time t. Y(t) is a d by 1 vector of observations at time t, and v(t) is a d by 1 vector of r a n d o m observational errors assumed to be normally distributed with zero mean vector and covariance matrix R(t). T h e v(t) are assumed to be independent at different times and independent of the r a n d o m input u(t). The Kalman filter produces estimates of the state vector, X(t) based on data collected up to time t assuming that all the p a r a m e t e r s of the model are known, i.e. F(t; t - 1), G(t), H(t) and R(t). The notation for this estimate is X(t] t), and the estimate has covariance matrix

P(t [ t) = E{[X(t)- X(t [ t)l[X(t) - X(t I t)]'},

(2.2)

where ' denotes transposed. Similarly, X(t I t-- 1) denotes the estimate of the state at time t given observations up to time t - 1, a one-step prediction, and its covariance matrix is denoted P(t] t - 1). T o begin the recursion, it is necessary to specify an initial value of the state vector before the first observation is collected, X(010), and its covariance matrix P ( 0 1 0 ). T h e general step of the recursion starts with the information available at time t - 1, X ( t - 1] t - 1) and P ( t - 1 ] t - 1), and ends when this same information is available at time t. The recursion proceeds as follows: (1) Calculate a one-step prediction

x(tlt-

1 ) : F'(t; t - 1 ) x ( t - l i t .... 1).

Time series analysis with unequally spaced data

159

(2) Calculate the covariance matrix of this prediction

P ( t l t - 1) = F(t; t - 1)e(t - 11 t - 1)F'(t; t - 1) + G G ' . (3) The prediction of the next observation vector is

Y ( t l t - 1)= H ( t ) X ( t l t - 1). (4) The innovation vector is the difference between the observations and the predicted observations

I(t)-- Y(t)- Y(tIt-

1).

(5) The innovation covariance matrix is

V(t) = H ( t ) P ( t I t - 1)H'(t) + R ( t ) . (6) The contribution to - 2 In likelihood for this step is

l'(t) V - l ( t ) l ( t ) + In IV(t)l, where I I denotes the determinant of the matrix. The contribution for each step is summed over all steps. (7) The Kalman gain matrix is

K(t) = P(t l t- 1)H'(t)V-l(t). (8) The update estimate of the state vector is

X ( t [ t) : X ( t [ t -

1)+K(t)l(t).

(9) Its covariance matrix is

P ( t l t) = P ( t [ t - 1 ) - K ( t ) H ( t ) P ( t l t - 1). For univariate time series Y(t), the observation at time t, and R(t), the observational error variance, will be scalars. A process with time invariant structure observed at equally spaced time intervals has parameters that do not depend on time, and the model can be written

X ( t ) = F X ( t - 1) + G u ( t ) ,

(2.3)

Y ( t ) = O X ( t ) + v(t) , with observational error covariance matrix R. A stationary time series has time

160

R. H. Jones

invariant structure as do certain nonstationary processes such as autoregressive integrated moving average (ARIMA) processes. A R I M A processes can be directly represented in the above form without differencing the data. Differencing can cause problems when there are missing observations. For unequally spaced time series, F(t; t - 1) and G(t) will usually depend on the length of the time step. For univariate time series at equal spacing with missing observations, when an observation is missing, the recursion skips steps (3)-(7), and the final two steps become simply (8) Update estimate

X(tlt)-- X(t[t-

1).

(2.4)

P(t l t)= e(t l t - 1).

(2.5)

(9) Covariance matrix

Note that these two equations require no calculation since the values that are in memory are not changed. For multivariate time series with missing obsero vations within the observation vector, it is only necessary to reduce the number of rows in the H(t) matrix to allow for these missing observations. When the unknown model parameters have been estimated by maximum likelihood, predictions can be made by running the recursion off the end of the data using the missing data form of the recursion. Calculating V(t) from step (5) of the recursion gives the variance or covariance matrix of the prediction.

3. A state-space representation for an ARMA(1, 1) process There are several ways to represent an ARMA(1, 1) process in state-space form. One method that keeps all the equations in scalar form is to represent the process as a first-order autoregression with observational error,

x(t) = ,~x(t- 1)+ ,~u(t), y(t) :

(3.1)

x(t)+ v(t).

u(t) is assumed to have unit variance, and the observational error variance is R. The three parameters to be estimated by maximum likelihood are a, o-, and R. For any pass through the recursion, these parameters are assumed to be known and are varied by a nonlinear optimization routine between passes. In other words, one pass through the Kalman filter produces a value of - 2 1 n likelihood which is one function evaluation for a nonlinear optimization routine. Multivariate extensions of this model are discussed in Jones (1984), This special case has many practical applications. ARMA(1, 1) processes can be used to model serial correlation in many situations where data spans are not

Time series analysis with unequally spaced data

161

tOO long and the process is not highly structured. In particular, in regression or analysis of variance, using an ARMA(1, 1) model for the error structure may be much better than the usual assumption of independent errors. Approximate modeling of serial correlation when it exists is better than not modeling it at all. To begin the recursion, it is necessary to specify the initial parameters. Since x(t) is a zero mean AR(1) process, the variance of the process (lag zero covariance) is o-2/(1 - c~2). For given values of the parameters, the initial conditions specify what is known before any data are collected, x(010 ) = 0,

P(0 [ 0) = 0-2/(1 - a2).

(3.2)

The recursion is now a special case of the general recursion given in the last section: (1) Calculate a one step prediction

x ( t l t - 1)= a x ( t - l l t - 1). (2) Calculate its variance

e ( t l t - 1)= e ( t -

l i t - 1)a2+ 0-2.

(3) The prediction of the next observation is

y ( t l t - t ) = x ( t l t - 1). (4) Calculate the innovation

I(t) = y ( t ) -

y(tlt-

1).

(5) The innovation variance is

v(t) = P(tl t - 1) + R. (6) The contribution to - 2 In likelihood is

I2(t)/V(t) + In V(t). (7) The Kalman gain is K ( t ) = P ( t l t - 1)/V(t) .

(8) Update the estimate of the state

x(t [ t)= x(t [ t - 1)+ K(t)l(t) = [Rx(tlt-

1)+ P ( t l t - 1)y(t)]/v(t).

R . H . Jones

162

(9) Update its variance

POLO= P(t [ t - 1 ) - K(t)P(t [ t - 1) = R P ( t [ t - 1)/V(t). The second form of this last equation is more numerically stable since it prevents a subtraction. It is possible to concentrate 0.2 out of the likelihood by differentiation in which case the recursion takes a slightly different form. In this case, to calculate the - 2 In likelihood it is necessary to accumulate two terms, one associated with the weighted residual (or innovation) sum of squares RSS, and the other associated with the determinant in the multivariate normal distribution, DET. A new variable is defined which is the ratio of the two variances, c 2= R/o "2 .

(3.3)

The initialization is x ( 0 [ 0 ) = 0,

RSS = O,

P(010 ) = 1/(1- a2),

(3.4)

D E T = 0.

The modified recursion is (1) (2) (3) (4) (5) (6)

x ( t l t - 1) = ax(t-- l i t - 1), P ( t l t - 1) = P ( t - l i t - 1)a2+ 1, y(t l t - 1 ) = x(t l t - 1 ) , I(t)= y ( t ) - y ( t ] t - 1), V(t)= P ( t [ t - 1)+ c 2, R S S = RSS + I2(t)/V(t), D E T = D E T + In V(t).

Here the equal sign is used in the programming sense of "is replaced by". (7) (8) (9)

K(t)= P(t [ t - 1)IV(t), x ( t l t ) = [ c 2 x ( t l t - 1)+ P ( t t t - - l)y(t)]/V(t), P ( t l t ) = c 2 P ( t l t - 1)/V(t).

After completing the recursion with n observations present, - 2 In likelihood = n In RSS + D E T .

(3.5)

A nonlinear optimization search procedure can be used to find the minimum of - 2 1 n likelihood with respect to a and c ~. When this is completed, the maximum likelihood estimates of the two variances can be calculated from

0.2 = RSS/n,

R -- c20.2 o

(3.6)

Time series analysis with unequally spaced data

163

For missing observations, steps (3)--(7) are skipped and the last two steps replace by

x ( t l t ) = x ( t [ t - 1), (9) P(t l t) = P(t l t - 1). (8)

The above recursions are easily modified for an AR(1) process with missing observations by setting R or c 2= 0. In this case, it is nonlinear in only one parameter c~. Since, for a stationary process, this parameter must be in the range - 1 < c~< 1, and in most practical applications is in the range 0 ~< a < 1, it is easy to search for the maximum likelihood estimate of a on a micro computer. Note that in the case of an AR(1) process, the above steps simplify to:

V(t) = P(t l t - 1), x(t l t)= y(t), (9) P(t l t)= O . (5) (8)

In other words, without observational error, the variance of tile innovation is the prediction variance, the updated estimate of the state is new observation, and it has variance zero.

4. ARMA(p, q) processes There are various equivalent state-space representations of ARMA(p, q) processes in the literature. These can be constructed using concepts well known to electrical engineers (see, for example, Wiberg, 1971). As a simple example~ consider an autoregression of order p (AR(p)),

x(t) = % x ( t - 1)+ a 2 x ( t - 2 ) + ' "

+ a~(t--p)+

e(t),

(4.l)

where e(t) has standard deviation o-. The state of this process can be defined as the p most recent values of the process, and a state-space representation is

1 ,Xt l 1 'Ii

u(t)

kx(,-b+l)j

0

1

oj kx(:,-p) (4.2)

where u(t) has unit standard deviation. A minimal state-space representation of an A R M A (p, q) process has a state

164

R. H. Jones

vector of length m = max(p, q + 1). The inclusion of observational error in the observation equation can modify this. The addition of white noise to an ARMA(p, q) process is discussed by Box and Jenkins (1976, p. 122). If p > q, and the process is observed with error, the resulting observed process is ARMA(p, p). The resulting 2p parameters are a function of the original p + q parameters plus the variance of the observational error. The inclusion of observational error in the model provides the opportunity to find a more parsimonious model than simply fitting A R M A processes. For example, if the process is actually pure autoregressive with observational error, it is only necessary to estimate p + 1 parameters rather than 2p parameters for an ARMA(p, p) process. Fitting an ARMA(p, p - 1) with observational error may be equivalent to fitting an ARMA(p, p) process without observational error. If p ~< q, the addition of observational error produces an A R M A model of the same order, so the variance of the observational error will be confounded with the model parameters. The conclusion is that observational error can only be included in the model if p > q. If the model is A R M A ( p , p ) , the state vector can sometimes be reduced to length p rather than length p + 1 by fitting an ARMA(p, p - 1) model with observational error. The state vector of an A R M A ( p , q) process is not simply values of the process at lagged times. The m elements must summarize the entire past and present for the purpose of prediction. Akaike's (1975) Markovian representation was used by Jones (1980). The elements of the state vector are the present value of the process and 1, 2 . . . . , m - 1 step predictions into the future. A j step prediction is denoted x ( t + j ] t), i.e. the prediction at time t + j given data up to and including time t. The state equation is

[

l J°l°

x(t + 11 t) )J = L~

x(t+m-alt

0

...

,.

+o"

1

g2

... az

001[x

~

,

u(t),

x(t [.t- 1)

(t+m-2lt-a

(4.3)

and the observation equation is

[ x(ttt) ] y(t)=[1

0

~..

0] [

x(t+. l i t )

L (t+ m - l l t )

~ v(t).

(4.4)

Time series analysis with unequally spaced data

165

For the ARMA(p, q) model,

x(t) = OllX(t- 1)+ " " + % x ( t - p ) + E(t) + fl,e(t- 1)+""

+ f l q e ( t - q).

(4.5)

The g's in (4.3) are a function of the c~'s and /3's and are generated by the recursion j-1

gl = 1,

gj = ,Sj_1+ ~ eek&_k.

(4.6)

k~l

Harvey (1981) uses a different but equivalent state-space representation where the g vector of equation (2.1) is made up of the /3's. Other represen~ tations have the /3's in the H vector. Whatever representation is used, for a stationary process, it is necessary to be able to calculate the initial covariance matrix of the state, P ( 0 ] 0 ) . For Akaike's Markovian representation, the necessary equations for calculating this matrix are given in Jones (1980).

5. Stationarity and invertibility

The ARMA(p, q) process will be stationary if the roots of P

1 - ~, % z k = 0

(5.i)

k=l

are outside the unit circle, and for the moving average to be invertible, the roots of q

1 + ~, /3kzk = 0

(5.2)

k=l

must be outside the unit circle. To insure stationarity and invertibility, Jones (1980) reparameterized in terms of the partial autoregression and partial moving average coefficients, and constrained them to be in the interval (-1, 1) by a logistic type transformation. If a k is a partial autoregressive coefficient, a k = [1 - e x p ( - Uk)]/[1 + e x p ( - Uk)] ,

C '~..... :, ~>;

which has the inverse transformation u k = ln[(1 + ak)/(1 -- ak)] .

*.~, a~,

The u k can vary from - ~ to % and these aic the vaxiablcv t[~a~ ~;;," ::~:::x;c~

166

R. H. Jones

optimization work with. For a given value of uk, the corresponding a k is calculated from (5.3), and the autoregressive coefficients calculated from the Levinson (1947)-Durbin (1960) recursion. For j = 1 . . . . . p , a~)= a t, and for j > 1, Olk

~) ---

-0-~-

L'~k

a j c~ j0-1~ -k

'

k = 1, 2,

" " " '

j - 1

"

(5.5)

The o ' s are then used in the state-space representation along with the fl's which are transformed in a similar fashion and a value of - 2 In likelihood calculated. A natural way to obtain initial guesses at the parameters for nonlinear optimization is to proceed in a stepwise fashion, adding a single parameter, or perhaps both an autoregressive and a moving average parameter at each step. The initial values of the parameters can be the final values obtained from the previous step with the new parameter or parameters set to zero. The optimization will then start from the best value of - 2 In likelihood found at the previous step and try to improve it.

6. ARIMA(p, d, q) processes Differencing is usually used to reduce A R I M A processes to A R M A processes. When there are missing observations, this presents a problem. An alternative is to represent the integrated moving average process in state-space form so that it is possible to work with the original observations. The only problem is that for nonstationary processes, the initial covariance matrix cannot be expressed as a function of the process parameters. One possibility is to use the conditional likelihood, conditional on observing the first d available time points. Consider the following examples. A random walk observed with error, in state-space form is x(t) = x ( t -

1)+ o'u(t),

(6.1)

y(t) = x ( t ) + v(t),

where v(t) has variance R. The y(t) process is a special case of an ARIMA(0, 1, 1) and contains two unknown parameters to be estimated by maximum likelihood, o" and R. It is well known that the best estimate of the present of this process or the best prediction of the future is an exponentially weighted average of the past. The Kalman filter produces an exponentially weighted moving average in the limit for long data spans (Jones, 1966). It also produces optimal estimates near the beginning of a data span or in the presence of missing data once the parameters of the process are known° The

Time series analysis with unequally spaced data

167

likelihood conditional on the first observation can be calculated using the following starting conditions,

x(111) = y(1),

(6.2)

P(ll 1) = R.

An ARIMA(1, 1, 0) process requires a state vector of length two to model directly in state-space form. One state-space representation corresponding to the representation (4.2) is x(t)]=

x(t-1)J

[

l+a 1

y(t)=[10][x(~(t)l)

-a [x(t-1)

[0]

u(t),

0 ]Lx(t-2)] +

(6.3)

].

This is a nonstationary second-order autoregression with one root of the characteristic equation (5.1) equal to 1, and the other equal to 1/a. Since there is no observational error in this model, the initial conditions for calculating the likelihood conditional on the first observation are

X(2[ 1) = [y(1)]

ty(1)]'

P(211) = [ ~r2/(1- cJ) [

0

0

0]"

• (6.4)

Using this form of the initial conditions, in the form of a predictior~, me recursion is entered at step (3). The general ARIMA(p, d, q) represented directly in state-space form withou~ differencing, requires an autoregressive part of order p + d. The stationaryautoregressive part of order p can be represented as before in terms of partial autoregression coefficients using the transformation (5.3) to ensure stationarity The autoregression coefficients for the nonstationary process can be calculateo from the corresponding powers of z in the generating function (1 - z ) a ( 1 -

o/1z --.og2 Z2 . . . . .

olpzP).

t(}.))

For example, if d = 1, the nonstationary a's are

a'~= a ~ + l , 0/;=

~ 2 - - 0~1,

0/~ i~ ~ 3 -

0/2 ,

(6.6) ~p

~p ~- O/p_ 1 ,

O/p+ 1 = --(_.~p .

R. H. Jones

168

7. Continuous time models for unequally spaced data When data are truly unequally spaced, not equally spaced with missing observations, continuous time models are necessary to represent the process. These processes are discussed by Doob (1953). Kalman and Bucy (1961) develop the state-space filtering approach for continuous time processes, and Wiberg (1971) gives an easy-to-read introduction to the subject. The use of continuous time models allows the prediction and updating equations to be developed for an arbitrary time interval so that the Kalman filter recursion depends on the length of the step. As an introduction, consider a continuous time first-order autoregression referred to as a C A R ( l ) process. A zero mean C A R ( l ) process can be represented as a first-order linear differential equation driven by 'white noise'. The continuous time state-space representation is dx(t) = - a x ( t ) dt + d W ( t ) ,

(7.1)

where a > 0, and W(t) is a Wiener process, i.e. dW(t) is continuous time zero mean 'white noise'. Integrated white noise is a continuous time random walk or Brownian motion process which satisfies the differential equation

dz (t) = d W ( t ) .

(7.2)

The variance of the change in the random walk over a finite time interval is proportional to the length of the interval, i.e. for b > a,

b

Var{z(b)- z(a)} = Var{~! dW(t)}- (b- a)O.

(7,3)

Here Q will be referred to as the variance of the white noise process. The process (7.1) is a continuous time Markov process, with covariance function at lag r

C(r) = O [ e x p ( - a Irl)]/2a.

(7.4)

If the process is observed at a given time, this observation sums up the information in the past for the purposes of predicting the future, i.e. the value of the process at any time is the state of the process. If the process is observed at time a, the prediction of the process at time b > a can be calculated by solving the differential equation without the random input, dW(t), which is unpredictable, and substituting in the initial condition

x(b) = x(a) e x p [ - - a ( b - a ) ] .

(7.5)

The random input over a finite time interval is an exponentially weighted

T i m e series a n a l y s i s with u n e q u a l l y s p a c e d d a t a

169

integral of the white noise input (Gelb, 1974), b

£ exp[-a(b- r)] d W ( r ) ,

(7.6)

and has variance O{1 - e x p [ - 2 a ( b -

a)]}/2a.

(7.7)

Note that as the time interval b - a becomes large, this prediction variance approaches the variance of the process. If this C A R ( l ) process is observed at equally spaced time intervals with spacing h, the resulting discrete time process is AR(1) with autoregression coefficient a~ = e x p ( - a h ) and 0-2 = 011 -

(7.8)

exp(-2ah)l/2a

(7.9)

Assume that the process is observed at n unequally spaced time points. t l < t2 < " " " < tn .

It does not matter how these time points are determined, by some random mechanism or selected in advance, as long as the time points do not depend on the values of the process being sampled. It is assumed that the sampling times are known. The continuous time state-space model (7.1) can now be represen ted as a discrete time state-space model at the sampling times,

x(t3 =

F ( t , ; t,_0x(t~_0 +

G(t,)u(O,

.,7 t,.~)

where

F(ti; t~_~)= exp[-a(t~ - t~_l) ] , G(t~) = # Q { 1 - e x p [ - 2 a ( t ~ -

F7 ~

t~ ~)]}/2c~.

Now the state transition and the standard deviation of the random inpui depend on the length of the time step. The observation equation is y(ti) =

x(O + v(O,

C~.i~)

where v(ti) is the observational error with variance R. H e r e observational error would probably be truly observational error or numerical round off error, i:~ the equally spaced case, observational error is sometimes used as a convenient way to obtain a parameterization of the process with one less element in the state vector.

170

R . H . Jones

8. Continuous time AR(p) process with observational error, CAR(p) A CAR(p) process with observational error, and F O R T R A N code for calculating the exact - 2 1 n likelihood are given in Jones (1981). Variable transformations are used to ensure that the estimated parameters generate a stationary, process. The continuous time model can be written d[x0~)(t) + O l p _ l X ( P - 1 ) ( t ) + o ' " + o~0x(t)] = d W ( t ) ,

(8.1)

where xO)(t) denotes the jth derivative with respect to time. For stationarity, it is necessary that the roots of P

E .j zj = 0

(8.2)

j=O

have negative real parts (ap = 0). A state-space representation for this process uses the value of the process and its first p - 1 derivatives as the state. It does not matter that the derivatives are unobservable since the Kalman theory assumes that linear combinations of the state are observed, and in this case the linear combination simply picks off the first element of the state vector. For stationary processes, the unconditional covariance matrix of this state vector is known (Doob, 1953), i.e. the covariances between the process and its derivatives, so the initial state covariance matrix can be calculated for given values of the parameters of the process (Jones, 1981). An interesting byproduct of this analysis is that estimates are obtained of the derivatives of the process as well as the process itself, and these are often of interest. For example, the velocity and acceleration are estimated if p 1> 3. If the best estimate of velocity and acceleration are required within the data span using all the data, this can be calculated using the Kalman smoother (Gelb, 1974). This state-space representation can be written

ixx"~.(O.t jlo =

L xO'-o(t)

1 o

o

-%

o

1

--0:1

--0[, 2

+Iil

dW(t),

...

....

o1

o

Olp_ 1

x(t)

'¢~1!(t)

dt

x~-l~(t)_] (8.3)

Time series analysis with unequally spaced data

171

and the observation equation is

[ y(t)=[1

0

"'"

x(t)

1

O] / x O : ( t ) [ "" + v ( t ) , [

"

(8.4)

[

at each time point where the process is observed.

9. CARMA(p, q) and CARIMA(p, d, q) processes The above continuous time autoregression can be generalized to continuous time autoregressive moving average (CARMA) and continuous time autoregressive integrated moving average (CARIMA) processes. Doob (1953, p. 542) discusses continuous time processes with rational spectra, and Wiberg (1971, p. 19) gives state-space representations for these processes. If we formally represent the continuous time 'white noise' process as e(t)= dW(t),

(9.1)

a CARMA(p, q) process can be represented

X(P)(t) + ap_lX(P-1)(t) + . . . + aoX(t ) = 15(q)(t) + fq_lff(q-X)(t) + . . .

+ floE(t).

(9.2)

For stationarity, it is necessary that p > q and that the roots of (8.2) have negative real parts. The representation will be 'minimum phase' if the roots of q

Z f/= 0

(9.3)

j=o

have negative real parts. It is also assumed that (8.2) and (9.3) have no commort roots. For a CARIMA(p, d, q) process, d roots of (8.2) must be zero. This means that a0~

O/1= ,..

= Ogd_.l = 0 ,

(9.4)

1O. Regression with stationary errors

Harvey and Phillips (1979) used state-space representations to obtain exac~ maximum likelihood estimates of regression coefficients and the A R M A

172

R.H. Jones

p a r a m e t e r s for regression problems when the errors have a stationary A R M A structure. The regression p a r a m e t e r s are included in the state vector of the process and concentrated out of the likelihood. There are some problems with the initial conditions when using this method. An alternative method is to realize that the Kalman filter is simply a linear operation on the previous data. The prediction summarizes the information in the past and most recent observation. The innovation is the component of the next observation that is orthogonal to the past. The Kalman filter, therefore, simply transforms correlated data to uncorrelated data. Since the prediction variance is also part of the algorithm, the innovation can be divided by the square root of this variance producing a sequence of uncorrelated r a n d o m variables with constant variance. In reality, the parameters of the A R M A process are not known, so guesses are m a d e and - 2 1 n likelihood calculated. E m b e d d i n g the procedure in a nonlinear optimization algorithm gives maxim u m likelihood estimates of the A R M A parameters and the regression coefficients. The regression coefficients are separated out of the equations so that optimization is required only with respect to the nonlinear parameters. Consider the usual regression equation, y = X B + ~,

(10.1)

where y is an n by 1 vector of the response variable, X is an n by p matrix of the independent variables, /3 is a p by 1 vector of unknown regression coefficients, and ~ is an n by 1 vector of errors. It is assumed that the errors have a structure of one of the models discussed in this paper, i.e. an A R I M A or C A R I M A process with missing or unequally spaced data. The missing data are not within the y vector or X matrix. These are assumed to have no missing observations. The assumption is that the data are collected in time, and the time points are not equally spaced. As is usual when discussing weighted least squares, premultiply the regression equation by a matrix K, which in this case represents the Kalman filter, K y = KXI~ + K ~ .

(10.2)

In a regression situation, the Kalman filter operates on the y vector and each column of the X matrix in order to transform the errors to be uncorrelated with constant variance. It is very easy to modify the Kalman filter algorithm so that it operates on a matrix rather than a vector. By forming an n by p + 1 matrix by augmenting X by y, the algorithm can operate on each column replacing the entry by the innovation. The usual X ' X matrix and X ' y vector can be formed from the innovations. If y ' y is also calculated, the total sum of squares, then the residual sum of squares is RSS

= 7"SS - y ' X ( X ' X ) - l x ' y .

(10.3)

Time series analysis with unequally spaced data

173

It is important that the determinant term be included in the likelihood since the weight matrix is changed for each iteration, - 2 In likelihood = n In R S S + D E T .

(10.4)

where D E T is the natural log of the innovation variance summed over the time points.

11. Variance component models Duncan and Horn (1972) showed how random effects are naturally handled using the Kalman recursion. Random effects can be included in the state of the process while linear fixed effects are concentrated out of the likelihood as in regression. The advantage of this approach is the ability to handle unbalanced designs, such as missing observations by exact likelihood methods. A second advantage is that serial correlation in repeated measures designs can be modeled, even when there are missing observations or the data are unequally spaced. Consider a simple two-way repeated measures design, yo = / z + 5 +Try+ %,

(11.1)

where i denotes the subject and j denotes the repeated measurements on each subject, tx is the fixed grand mean, T the fixed treatment or time effect, ~ri the random subject effect, and % the random error. It is assumed that 7ri are independent N(0, V,~) and eij are independent N(0, VD and are independent of %. These assumptions produce the compound symmetry correlation structure for observations on the same subject, i.e. constant correlation between any two observations. This intraclass correlation, vj(v

+ v,),

(11.2)

is a result of the random subject effect, and is not serial correlation in the usual time series analysis sense. If this model is balanced with no missing observations, the usual repeated measures analysis of variance is appropriate (Winer, 1971), and exact maximum likelihood estimates of the two variances can be expressed in closed form (Herbach, 1959). In the unbalanced case with missing observations, the exact likelihood can be calculated using a state-space model. Concentrating V~ out of the likelihood as before, - 2 In likelihood is nonlinear in only one parameter, the ratio of the two parameters, c = v jr.

0 .3)

174

R. H. Jones

Since the fixed effects can be handled by regression as in Section 10, only the random terms need be represented in state-space form. Since subjects are independent, - 2 In likelihood can be calculated for each subject and summed over subjects. For subject i, the state equation is trivial since 7r~ is constant for each subject. It is, however, random across subjects with variance V,~. The state equation is 7r,(j)

=

w,(j - 1),

(11.4)

and the observation equation Yi/-/x - ~) = 7ri(j) + %.

(11.5)

The initial conditions are 7r,(01 0 ) -- O,

P,(0 10 ) = c 2 .

(11.6)

This initial variance would be V~ if V~ had not been concentrated out of the likelihood. % now plays the role of observational error. Concentrating V, out of the likelihood has the effect of dividing all variances in the recursion by V~; therefore, the observational error variance R for this model will be set equal to 1. If serial correlation exists between the e's within a subject, the e's must be modeled as part of the state. Any of the models discussed in this chapter can be used to model this serial correlation. The random subject effect is simply tacked onto the end of the state vector. The observations can be equally or unequally spaced, and there may be missing observations. For example, if the E's satisfy an AR(I) structure, (11.7)

% = ~Eij- I + u/,

the state equation can be written

0][ ~i,/-1] kTr~[%]=[0a 1AL ~ J + [ O ] uj'

(11.8)

and the observation equation is y ~ j - t~ - y / = [1

The initial state vector is [0

I][EiJ] .

(11.9)

0]' with covariance matrix

P(010)=111/(10- 2 )

c2 ]0 .

(11.10)

Here c 2= V.fftr 2, and this model is nonlinear in two parameters, c and m

Time series analysis with unequally spaced data

175

The usual linear mixed model (Rao and Kleffe, 1980), y = X ~ + UIO 1+ ' . . -~- UpOp + ~,

(11.11)

where X, U 1. . . . , Up are known matrices, /3 is a vector of fixed unknown parameters and 0 1 , . . . , Op are random vectors with zero means, uncorrelated with covariance matrices E{OiO'i} = V i i .

(11.12)

The methods presented here can be used to estimate /3 and the variance components for unbalanced designs, and the estimation is nonlinear in only p parameters, the variance components.

12. Nonlinear optimization The nonlinear optimization routines used by the author are the quasi-Newton methods discussed by Dennis and Schnabel (1983), who give al~ gorithms in the Appendix of their book. The art of nonlinear optimization is highly developed by computer scientists, and statisticians need only find good code. Supplying derivatives for the functions being minimized is a good idea if possible, but it is not necessary. Gradients can be approximated by finite differences.

13. Conclusion State-space representations and the Kahnan filter provide a unified approach for calculating likelihoods for time series models. The state-space representation represents the process as a vector Markov process. At each time point, the Kalman filter calculates the component of the new observation that is orthogonal to the past. These innovations together with their variances are used to calculate the likelihood assuming that the process has Gaussian errors and inputs. If observations are missing, predictions continue across the missing data keeping track of the growing prediction variance. When the next o b s e r vation is available, the innovation has a larger variance than when there are no missing observations, but the correct variance has been calculated to enter into the likelihood. If a large block of data is missing so that there is no longer any information available from the past for prediction, the algorithm converges to a steady state and the result is the same as if the algorithm starts again on a new realization of the process It is also possible to use multiple realizations of a process to obtain a single value of the likelihood. If data are unequally spaced, continuous time models can be used for the

176

R . H . Jones

process. A continuous time state-space representation defines a discrete time representation at the sample points. Embedding the recursion as a function evaluation in a nonlinear optimization routine provides maximum likelihood estimates of the process parameters. A calculation of - 2 In likelihood from a recursion through the data is one value of the function that the routine is attempting to minimize. These procedures generalize to regression with correlated error structure, including analysis of variance problems. Of particular interest are mixed linear models since variance components can be estimated from unbalanced designs, and in the presence of serially correlated errors.

References Akaike, H. (1975). Markovian representation of stochastic processes by canonical variables. S I A M J. Control 13, 162-173. Box, G. E. P. and Jenkins, G. M. (1976). Time Series Analysis forecasting and control. Holden-Day, San Francisco, CA. Dennis, J. E. and Schnabel, R. B. (1983). Numerical Methods for Unconstrained Optimization and Nonlinear Equations. Prentice-Hall, Englewood Cliffs, NJ. Doob, J. L. (1953). Stochastic Processes. Wiley, New York. Duncan, D. B. and Horn, S. D. (1972). Linear dynamic recursive estimation from the viewpoint of regression analysis. J. Amer. Statis~. Assoc. 67, 815-821. Durbin, J. (1960). The fitting of time series models. Review of the International Statistical Institute 28, 233-244. Gelb, A. (i974). Applied Optimal Estimation. M.1.T. Press, Cambridge, MA. Harvey, A. C. (1981). Time Series Models. Philip Allan, Deddington, Oxford, and John Wiley (Halstead Press), New York. Harvey, A. C. and Phillips, G. D. A. (1979). Maximum likelihood estimation of regression models with autoregressive-moving average disturbances. Biometrika 66, 49-58. Herbach, L. H. (1959). Properties of model II-type analysis of variance tests. Ann. Math. Statist. 30, 939-959. Jones, R. H. (1966). Exponential smoothing for multivariate time series. J. Roy. Statist. Soc. Set. B 28, 241-251. Jones, R. H. (1980). Maximum likelihood fitting of A R M A models to time series with missing observations. Technometrics 22, 389-395. Jones, R. H. (1981). Fitting a continuous time autoregression to discrete data. In: D. F. Findley, ed., Applied Time Series Analysis II, 651~582. Academic Press, New York. Jones, R. H. (1984). Fitting multivariate models to unequally spaced data. In: E Parzen, ed., Time Series Analysis of Irregularly Observed Data. Lecture Notes in Statistics, Vol. 25, 158-188. Springer, Berlin-New York. Kalman, R. E. (1960). A new approach to linear filtering and prediction problems. A S M E Trans. Part D (J. Basic Engineering) 82, 35-45. Kalman, R. E. and Bucy, R. S. (1961). New results in linear filtering and prediction theory. A S M E Trans. Part D (J. Basic Engineering) 83, 95-108. Levinson, N. (1947). The Wiener RMS error criterion in filter design and prediction. J. Math. Phys. 25, 261-278. Rao, C. R. and Kleffe, J. (1980). Estimation of variance components. In: P. R. Krishnaiah, ed., Handbook of Statistics, Vol. 1, 1-40. North-Holland, Amsterdam. Schweppe, F. C. (1965). Evaluation of likelihood functions for Gaussian Signals. I E E E Trans. Inform. Theory IT-11, 61-70.

Time series analysis with unequally spaced data

177

Solo, V. (1984). Some aspects of continuous-discrete time series modeling. In: E. Parzen, ed., Time Series Analysis of Irregularly Observed Data. Lecture Notes in Statistics, Vol. 25, 325-345. Springer, Berlin-New York. Wecker, W. E. and Ansley, C. F. (1983). The signal extraction approach to nonlinear regression and spline smoothing. J. Amer. Statist. Assoc. 78, 81-89. Wiberg, D. M. (1971). Theory and Problems of State Space and Linear Systems. Schaum's Outline Series. McGraw-Hill, New York. Winer, B. J. (1971). Statistical Principles in Experimental Design, 2nd ed. McGraw-Hill, New York.

E. J. Hannan, P. R. Krishnaiah, M. M. Rao, eds., Handbook of Statistics, Vol. 5 © Elsevier Science Publishers B.V. (1985) 179-187

6

Various Model Selection Techniques in Time Series Analysis Ritei Shibata

1. Introduction This chapter aims to give a short review on various model selection techniques which have been developed in the context of time series analysis. Our main concern is on moving-average (MA), autoregressive (AR) or autoregressivemoving average ( A R M A ) models. A similar problem is called model checking, which aims at checking adequacy of a model. Since the test statistics employed in model checking can be also used for constructing a model selection pro u cedure, we first look up such statistics. We should, however, note that the model selection is not a simple combination of model checkings. The aim of the model selection is not only in checking adequacy of a model but also in (a) obtaining a good predictor, or (b) describing a system, or identifying a system. W e consider a univariate A R M A ( p , q) model, rb(B)z,

=

0(B)e,,

0.~)

where ~b(B) = 1 + 4~1B + ~baB2+ "'" + ~bpB p and O(B) = 1 + 01B + 02 B 2 + ' ' " + OqB q

are transfer functions with backward operator B. For simplicity, we assume that {et} is a sequence of independent identically normally distributed random variables with mean 0 and variance o-2. A sequence of autocovariances is denoted by {%}, and that of the sample autocovariance, based on n obse~ vations Zl . . . . . z,, is denoted by {C~}. The estimated noise sequence is then

~t-- O(B)-I~(B)z,, where 0(B) and 4~(B) are the m a x i m u m likelihood or quasimaximum likelihood estimates of O(B) and &(B), respectively. In this chapter, we do not go 179

R. Shibata

180

further into mathematical details. We assume any commonly used regularity conditions. In the next section, we will see some test statistics which are specific for type of models, MA, A R , or A R M A . In Section 3, some other statistics will be seen, which are not specific for type of models. Section 4 is for discussions on how to construct a selection procedure, based on such test statistics.

2. Statistics specific for each mode|

2.1. Moving-average models C o m m o n l y used test statistic for M A ( q ) is that based on sample autocovariances [7]. The quadratic form of the sample autocovariances, h

T= n

~

l'lL'mO'~f~^Ira

1, m = q + l

is asymptotically distributed as X~h_q(O) with the noncentrality n

qJ=n

~

3,t~/mo "j" ,

l,m=q+l

where o"t" or o7t~ are the l, m elements of the inverse of the autocovariance matrix or of the sample autocovariance matrix, respectively. Therefore, by T we can check q dependence, a specific property of MA(q), but we may fail in checking a linearity of the process. A r e m a r k a b l e fact is that this statistic is not equivalent to the m a x i m u m log likelihood in any sense.

2.2. Autoregressive models For testing AR(p), a sequence of partial autocorrelations {4'm}, which are zeros for m > p, plays an important role. The sum of sample partial autocorrelations {0m}, h

~Jp+l l=l

is asymptotically distributed as X~(t)) with the noncentrality h

Z 4.2.+, l=l

Therefore, by T we can test the null hypothesis of {zt} being an AR(p). As the sample partial autocorrelation ~bm, commonly used definition is the last coordinate of the solution ~ ( m ) ' = ( ~ l ( m ) , . . . , C~m(m)) of the m t h - o r d e r Y u l e - W a l k e r equation.

Model selection techniques in time series analysis

181

Historically the following statistic is proposed before by Quenouille [18], i

T

-

h

d.4(p) n ~

h^2 p+ l

l=l

where =]~l

Y,

,(p)z,_,

n t=p+l

j(p)z,_,,j

,

"'j=O

-=

and 6 " 2 ( P ) - n - 1p , = ~ ~ =

dpl(P)Zt_l

with z, = 0 for t ~< 0. The above/~l can be thought of as an approximation to the covariance n

hl = _

~, ete,-i

n t=p+l

between the noise e, and its backward representation e, = El,_0 ~b,,z,+~. Hence, @dr2(p) might be more natural than the q~z as an estimate of the partial autocorrelation. It is well known that the above two statistics are asymptotically equivalent to each other [2]. These statistics are also asymptotically equivalent to the maximum log likelihood.

2.3. Autoregressive moving-average models A specific property of autoregressive moving-average model is that it is non-identifiable when overfitted. Since an A R M A model rI ( B ) 6 (B)z, = ~7(B)O(B)e, has the same covariance structure as that of (1.1), the transfer functions O(B) and ~b(B) are not uniquely determined by autocovariances of {z,}. Generalized partial autocorrelation 0k(J) is defined as the last coordinate oi the solution ~b(/') of the equation

AQ, k )~P(l) = - ~ ( / ' ) , where 7]

A¢i,k)= and ~,(j)'- (~,j+~.....

j

k-1

~j+~)°

" • 7j-k+1

"'°

J

~,, J

R. Shibata

182

Because of the non-identifiability when overfitted, the equations which characterize ARMA(p, q), p

3~t= - ~

~j'/t-j

for 1 > q,

j=l

imply that the matrix A ( j , k) is nonsingular if and only if j ~q,k>p, otherwise.

If j = 0, 0k(J) reduces to an ordinary partial autocorrelation ~k. Making use of such property, we can find the orders p and q as the following. In the estimated 4' array, =

find the coordinates (q + 1, p + 1) which specify the North-West corner of tile largest South-East sub-array, whose all elements are unstable but there are zeros on the North edge. By similar idea, the use of zl array, A = [IA(j, k)[]

is proposed by Beguin, Gourieroux and Monfort [4]. Equivalent procedures are proposed by Chow [8], Graupe, Krause and Moore [10], or Woodside [29]. Since 0 ]A(/, k)l= # 0

if j > q and k > p , otherwise,

it is enough to find, in an estimated A array, the coordinates (q + 1, p + 1), which specify the North-West corner of the largest South-East sub-array whose all elements are zeros. We can also construct a test statistic by considering the determinant of the above South-East sub-array [4]. More complicated statistics are proposed by Gray, Kelley and McIntire [11], called S array or R array. For example, the (j, k) element of S array is defined as

s,(j)

= {(-1)~s~+u}

-' ,

M o d e l selection techniques in time series a n a l y s i s

183

where S ~'m is the l, m element of the inverse of

S =

i 7j ')lj+k

" ""

"Y!+k- 1]

"' •

"Yj+2k-lj

The S array has the following properties: (-1)P (1 + f~ th,)

for all j > q - p,

l=1

Sp(j) =

(-1)P+'(1+ ~ 4h)~bp

for all j < - q - p ,

1=1 S k ( - - q - - p --

1)= +_ca for any k > p ,

Sk(j)= undefined for k > p, if j < - q -

p or j > q - p.

Then, similarly as in A array, we can find the coordinates (q + 1, p + 1) in S array. It is known that some elements of S array coincide with the partial a u t o c o r relations (see Woodward and Gray [30]). An advantage of a selection procedure based on such generalized partial autocorrelations is that we can avoid unstable estimation of A R M A parameters when overfitted. It is, of course, at the risk of underfitting. In other words, such procedure might be good for the aim (b) in Section 1.

3. Other statistics not specific for type of models 3.1. Based on likelihood The dominant term of the maximum log likelihood is the one-step-ahead prediction error, ~2

1 ~-, ~2

O" e = - - ~

Et .

(3.1)

/~ t=l

However, non-identifiability of A R M A model when overfitted causes a problem. If we want to estimate both transfer functions O(B) and $(B), these estimates are not only inconsistent but also unstable. For the aim (a) in Section l, such inconsistency does however not cause much problems since the maximum likelihood estimate of the transfer function k (B) = O(B)-%k(B )

184

R. Shibata

is not far from the true one, even when overfitted [14]. Another way might be to use the Lagrangian multiplier test statistic as is demonstrated in Poskitt and Tremayne [18]. By modifying the Fisher information matrix, we can avoid the problem of the singularity, but for doing this we have to fix an alternative a priori. Therefore, such a statistic is not suitable for model selection. 3.2. Portmanteau test statistic

Portmanteau test statistic or Q statistic is the sum of squares of serial correlations rt(e,) of residual sequence {g,}, h

r =. E /=1

It is shown by Box and Pierce [6] that the above T is asymptotically distributed as X]-p-q under the null hypothesis. To accelerate the speed of convergence to the asymptotic distribution, Ljung and Box [16] proposed a correction such as h

T = n(n + 2) Z (n - l)-lr~(g,).

(3.2)

/=1

Detailed analysis of the distribution under null hypothesis or alternatives can be found in [16]. The above statistic is the most natural for checking uncorrelatedness of residuals, but, if our main concern is in only obtaining a good predictor in the sense of mean squared error, it might be checking too many things. In spite of the correction in (3.2), convergence is not so fast since it consists of fourth moments of original process. A comparison with the Lagrangian multiplier test can be found in Godfrey [9]. 3.3. Cross-validation

This kind of statistic is proposed by Stone [27], in the context of multiple regression. A formal extension yields a statistic n

r = Z {z,- e,(t-t} 2 , t=l

where g , ( t - ) is an interpolation of z~, which is estimated from the observations except Zr It generally requires a laborious calculation. There is not so much known about the behavior of this statistic, but it has a tendency of overfitting, particularly when outliers exist.

Model selection techniques in time series analysis

185

4. Model selection We can construct a selection procedure by using one of test statistics introduced in the previous sections. However, it is not so good an idea to repeat such testing for various p and q. If we do so, we first have to choose many significance levels required, and the resulting power is a complicated function of the levels, as well as of the order of testings. It is hard to get a good control even for overall type I error. As an alternative we can consider the use of a h o m o g e n e o u s testing, like as in Krishnaiah [15]. By such testing, we can well control the type I error, but still it requires a lot of computation. A better principle might be to find a model which balances overfitting risks and underfitting risks. A typical way of realizing such balancing behavior is to select p and q which minimizes

C(p, q)= T+ a(p + q) .

(4.1)

Here, the T is one of test statistics which are introduced in previous sections. The second term in (4.1) can be considered as a penalty term for the complexity of the model, and a term for compensating the r a n d o m fluctuation of T. Since the expectation of T is p + q when T is distributed as X 2 with degrees of freedom p and q, it is better to choose c~ greater than 1, so as to ensure the positive penalty for an increase of the degrees of freedom. In A I C [1], B I C [20], or 4' [12], which are called criterion procedures, all criteria are of the form of (4.1) with T = - 2 log(maximum likelihood). For such criteria, a lot of discussions have been done. Most controversial point is that how to choose a, which is 2 in AIC, log n in BIC, and c loglog n for some c > 2 in 4,. The choice of a depends on the aim of the selection. If our main concern is in prediction, c~ should be chosen so as to yield less prediction error. If it is to identify a system stable, the consistency is more important than the amount of the prediction error. In Shibata [21], such two aspects of the model selection are d e m o n s t r a t e d for the case of AIC, in the context of the nested A R model fitting. It is shown that the selection by the minimum A I C procedure has a tendency of the underfitting and is not consistent, but the increase of the prediction error is not so much, only of the order O(1/n) uniformly in 4,1, - . . , 4,p. Similar discussions are done for general a by Bhansali and D o w n h a m [5], or Atkinson [3]. Their conclusion is consistent on the point that a should be greater than 2 even if the prediction error is our main concern. An answer to the optimality is given by Shibata [22] from the viewpoint of prediction error. H e showed that the choice a = 2 is asymptotically optimal, under the assumption that the underlying process does not degenerate finite order A R process. This result, namely "asymptotic efficiency of the selection with a = 2" is also applied to an autoregressive spectral

186

R. Shibata

estimate [24]. Taniguchi [29] showed that Shibata's result holds true also for A R M A models. However, for the case of small samples, the above asymptotic theory does not work so well [23]. Recently Shibata [25] showed that the a p p r o x i m a t e minimax regret choice of ce is 2.8. The regret means how much the prediction error increases when a section procedure is applied, compared with the error when the true model is known. Further generalization of the A I C can be found in [26]. If we want to avoid overfitting in any case, a should be chosen greater than 2 loglog n but slower than n. This is the result of H a n n a n and Quinn [12]. The term 2 loglog n follows from the fact that the range of the random fluctuation of T is at most 2 loglog n from the law of iterated logarithm. It is interesting to note that the choice a = log n in BIC, which is derived from the viewpoint of Bayesian, satisfies the above condition. H a n n a n and Rissanen [13] proposed a practical way of selecting the orders p and q of A R M A by using one of the above consistent criteria. Assuming p = q, find m which minimizes C(m, m) in (4.1), then the m is asymptotically equal to max(p0, q0) of the true orders P0 and q0. Next assuming p = m or q = m, find p and q which minimize C(p, q), then we can find P0 and q0 consistently. A remaining problem in practice is how to choose P and O which specify the largest orders p and q. This is equivalent to the problem how to choose ' h ' of statistics in Section 2. This problem has not been analyzed well, but an analysis by Shibata [26] gives a rough guideline that we can choose any large P and Q, as long as the tail probability P(F,,+2,,_p_o> am/(m + 2)) is 2 close enough to P(x,,.z>am) for m = 1 , 2 , 3 . . . . . n - P O. As a final remark, we note that if a is chosen bounded, then actual penalty is seriously affected by small changes of T as well as changes of initial conditions. W e should choose a so as to compensate well any such changes. References [1] Akaike, H. (1973). Information theory and an extension of the maximum likelihood principle. In: B. N. Petrov and F. Csfiki, eds., Second International Symposium on Information Theory, 267-281. Akadrmia Kiado, Budapest. [2] Anderson, T. W. (1971). The Statistical Analysis of Time Series. Wiley, New York. [3] Atkinson, A. C. (1980). A note on the generalized information criterion for choice of a model. Biometrika 67, 413--418. [4] Beguin, J.-M., Gorieroux, C. and Monfort, A. (1980). Identification of a mixed autoregressive-moving average process: the corner method. In: O. D. Anderson, ed., Time Series, 423-435. North-Holland, Amsterdam. [5] Bhansali, R. J. and Downham, D. Y. (1977). Some properties of the order of an autoregressive model selected by a generalization of Akaike's E P F criterion. Biometrika 64, 547-551. [6] Box, G. E. P. and Pierce, D. A. (1970). Distribution of residual autocorrelations in autoregressive-integrated moving average time series models. J. Amer. Statist. Assoc. 65, 1509--1526. [7] Box, G. E. P. and Jenkins, G. M. (1976). Time Series Analysis: Forecasting and Control. Holden-Day, New York. [8] Chow, J. C. (1972). On estimating the orders of an autoregressive moving-average process with uncertain observations. IEEE ?¥ans. Automat. Control AC-17, 707-709.

Model selection techniques in time series analysis

187

[9] Godfrey, L. G. (1979). Testing the adequacy of a time series model. Biometrika 66, 67-72. [10] Graupe, D., Krause, D. J. and Moore, J. B. (1975). Identification of autoregressive-moving average parameters of time series. I E E E Trans. Automat. Control AC-20, 104--107. [ll] Gray, H. L. Kelley, G. D. and McIntire, D. D. (1978). A new approach to ARMA modeling. Comm. Statist. B7, 1-115. [12] Hannah, E. J. and Quinn, B. G. (1979). The determination of the order of an autoregression. J. Roy. Statist. Soc. Ser. B 41, 190-195. [13] Hannan, E. J. and Rissanen, J. (1982). Recursive estimation of mixed autoregressive-moving average order. Biometrika 69, 81-94. [14] Hannan, E. J. (1982). Fitting multivariate A R M A models. In: G. Kallianpur, P. R. Krishnaiah, J. K. Ghosh, eds., Statistics and Probability: Essays in Honor ofC. R. Rao, 307-316. North-Holland, Amsterdam. [15] Krishnaiah, P. R. (1982). Selection of variables under univariate regression models. In: P. R. Krishnaiah, ed., Handbook of Statistics--II. North-Holland, Amsterdam. [16] Ljung, C. M. and Box, G. E. P. (1978). On a measure of lack of fit in time series models. Biometrika 65, 297-303. [17] Milh0j, A. (1981). A test of fit in time series models. Biometrika 68, 177-18% [18] Poskitt, D. S. and Tremayne, A. R. (1981). An approach to testing linear time series models. Ann. Statist. 9, 974--986. [19] Quenouille, M. H. (1947). A large-sample test for the goodness of fit of autoregressive schemes. J. Roy. Statist. Soc. Ser. B 11, 123-129. [20] Schwarz, C. (1978). Estimating the dimension of a model. Ann. Statist. 6, 461-464. [21] Shibata, R. (1976). Selection of the order of an autorcgressive model by Akaike's information criterion. Biometrika 63, 117-126. [22] Shibata, R. (1980). Asymptotically efficient selection of the order of the model for estimating parameters of a linear process. Ann. Statist. 8, 147-164. [23] Shibata, R. (1980). Selection of the number of regression parameters in small sample cases. In: Statistical Climatology, 137-148. Elsevier, Amsterdam. [24] Shibata, R. (1981). An optimal autoregressive spectral estimate. Ann. Statist. 9, 300-306. [25] Shibata, R. (1983). A theoretical view of the use of AIC. In: O. D. Anderson, ed., Time Series Analysis: Theory and Practice, Vol. 4, 237-244. Elsevier, Amsterdam [26] Shibata, R. (1984). Approximate efficiency of a selection procedure for the number of regression variables. Biometrika 71, 43-49. [27] Stone, M. (1974). Cross-validatory choice and assessment of statistical predictions. Z Roy~ Statist. Soc. 2, 111-133. [28] Taniguchi, M. (1980). On selection of the order of the spectral density model for a stationary process. Ann. Inst. Statist. Math. 32A, 401--419. [29] Woodside, C. M. (1971)o Estimation of the order of linear systems. Automatica 7, '727-733. [30] Woodward, W. A. and Gray, H. L. (1981). On the relationship between the S array and the Box-Jenkins method of A R M A model identification. J. Amer. Statist. Assoc. 76, 579-587.

E. J. Hannan, P. R. Krishnaiah, M. M. Rao, eds., Handbook of Statistics, Vol. 5 © Elsevier Science Publishers B.V. (1985) 189-211

7

Estimation of Parameters in Dynamical Systems

LennanLjung

1. Introduction

By a dynamical system we mean a relationship between an 'input signal' and an 'output signal', such that the current input affects also future outputs. Such systems are important in a number of areas in engineering, applied science and econometrics. We may think of systems like paper machines, communication channels, ecological systems, national economies, etc. Often it is found neceso sary to include also disturbances of various character in order to obtain reasonable descriptions of the systems. A special case is the situation when no input is present, and the system is driven by the disturbances only. Time series can be described in that way, and will thus form a subclass of the systems discussed in this chapter. The nomenclature and the notation may differ between various application areas (in econometrics, for example, the input signal is usually called 'exoo geneous variable'), and we shall here stick to notation that is common in the control theory area. In the different areas various types of models have been developed fol describing systems. Some of these have been specifically designed to serve certain purposes, like Bode plots for control synthesis. Others are of more general character, e.g. differential and/or difference equations for simulation~ analysis and various decision-making tasks. We shall discuss models of dynamical systems in somewhat more detail in Section 2. Various engineering disciplines have been developed for solving certain design problems based on a given model of a system (like control theory, filter design, signal processing, etc.). The applicability of the theory is thus critically dependent on the availability of good models. How does one construct good models of a given system? This question about the interface between the real world and the world of mathematics thus becomes crucial. The general answer is that we have to study the system experimental and make some inference from the observations. In practice there are two main routes. One is to split up the system, figuratively speaking, into subsystems, whose properties are well understood from previous experience~ 189

190

L. Ljung

This basically means that we rely upon 'laws of Nature' and other wellestablished relationships that have their roots in earlier empirical work. These subsystems are then joined together mathematically, and a model of the whole system is obtained. This route is known as modelling, and does not necessarily involve any experimentation on the actual system. When a model is required of a yet unconstructed system (such as a projected aircraft), this is the only possible approach. The other route is based on experiqaentation. Input and output signals from the system are recorded and are then subjected to data analysis in order to infer a model of the system. This route is known as identification. It is often advantageous to try to combine the approaches of modelling and identification in order to maximize the information obtained from identification experiments and to make the data analysis as sensible as possible. In this chapter we shall discuss various techniques for the identification of dynamical systems, focusing on methods that have been used, and to some extent developed, in the control oriented community. We shall pay special attention to sequential, or recursive methods, which refer to schemes that process the measurements obtained from the system continually, as they become available (Sections 7-11). We have found it suitable to present such recursive schemes as a natural development of off-line or batch identification methods, which assume that the whole data batch is available in each stage of the data processing. Therefore, we will spend a major part in exposing and explaining general ideas in identification (Sections 4-6). A particular problem with parameter estimation in dynamical systems is the multitude of possible models that are available. In Sections 2-3 we shall discuss a number of such possibilities, and also point out a unified framework for how to handle them. A very important problem, which is crucial for a successful application, is the choice of a family of candidate models for describing the systems, This problem is quite application-dependent, and we consider it outside the scope of this chapter to address the problem of choice of model set. This means that what we discuss most of the time is actually how to estimate parameters in a given model structure. For further discussion of the topics treated here we may refer to Goodwin and Payne (1977), Eykhoff (1974, 1981) and Ljung and S6derstr6m (1983).

2. Time-domain models of dynamical systems Describing dynamical systems in the time domain allows a considerable amount of freedom. Usually, differential (partial or ordinary) equations are used to describe the relationships between inputs and outputs. In discrete-time (sampled-data systems), difference equations are used instead. The question of how to describe properties of various disturbance signals also allows for several different possibilities. Here we shall list a few typical choices, confining ourselves to the case of linear, discrete-time models.

191

Estimation of parameters in dynamical systems

The word m o d e l is sometimes used ambiguously. It may mean a particular description (with numerical values) of a given system. It may also refer to a description with several coefficients or parameters that are not fixed. In the latter case, it is more appropriate to talk about a m o d e l set: a set of models that is obtained as the parameters range over a certain domain. L i n e a r difference equations

Let the relationship between the input sequence {u(t)} and the output sequence {y(t)} be described by

y(t)+ aly(t-

1)+ . . . +

a.y(t-

n):

blu(t-

1)+...

+

bmu(t- m). (2.1)

Here the coefficients a i and b i are adjustable parameters. (A multivariable description would be quite analogous, with a i and bi as matrices.) We shall generally denote the adjustable parameters by a vector 0: 0 = (al . . . . .

an, bl . . . . .

(2.2)

bin)T .

if we introduce the vector of lagged inputs and outputs q~(t) = ( - y ( t -

1) . . . . , - y ( t -

n), u ( t -

1 ) , . . . , u ( t - m)) s ,

(2.3)

(2.1) can be rewritten in the more compact form y(t) = OTq)(t).

(2.4)

In (2.1) or (2.4), the relationship between inputs and outputs is assumed to be exact. This may not be realistic in a n u m b e r of cases. Then we may add a term v ( t ) to (2.1) or (2.4),

y(t) = o % ( 0 + o(t),

(2.5)

which accounts for various noise sources and disturbances that affect the system, as well as for model inaccuracies. This term can be further modelled, typically by describing it as a stochastic process with certain properties. The simplest model of that kind is to assume {v(t)} to be white noise, i.e. a sequence of independent random variables with zero mean values. However, many other possibilities exist. A m o n g the most c o m m o n models is the following one. ARMAX

models

If the term {v(t)} in (2.5) is described as a moving average (MA) of white

L. Ljung

192

noise {e(t)}, we have a model

y(t) + a l y ( t - 1)+.-. + a . y ( t - n) = blU(t- 1 ) + . . - + bmu(t- m)

+ e(t)+ q e ( t - 1 ) + . . - + Ge(t- n).

(2.6)

Such a model is known as an A R M A X model.

Output error models Instead of adding the disturbance v(t) to the equation as in (2.5), it can be added as an output measurement error:

y(t) = x(t) + v(t),

(2.7a)

x(t)+ f l x ( t - 1 ) + . . . + f , x ( t - n)= blU(t- 1 ) + . . . + bmu(t- m). (2.7b) Such models are often called output error models. The 'noise-free output' x(t) is here not available for measurement, but given (2.7b) it can be reconstructed from the input. We denote by x(t, O) the noise-free output that is constructed using the model parameters 0 = (fl .....

fn, bD • . . ,

(2.8)

brn) T ,

i.e.

x(t, o) + L x ( t - 1, 0 ) + . . . + L x ( t - n, o) = b l U ( t - 1)-~ . . . . + b m u ( t - m ) .

(2.9)

With ~(t, o) : ( - x ( t -

1, o) . . . . .

-x(t-

n, o), u(t - 1) .....

u(t-

m)) T ,

(2.10) (2.7) can be rewritten as y(t) - 0r~(t, 0) +

v(t).

Notice the formal similarity to difference!

(2.ll) (2.5) but

the important

computational

State-space models A common way of describing stochastic, dynamical systems is to use state-

Estimation of parameters in dynamical systems

193

space models. Then the relationship between input and output is described by

x ( t + 1)= F(O)x(t)+ G(O)u(t)+ w ( t ) ,

(2.12)

y(t) = H(O)x(t) + e(t), where the noise sequences w and e are assumed to be independent at different time instants and have certain covariance matrices. Unknown, adjustable parameters 0 may enter the matrix elements in F, G and H in an arbitrary manner. These may, for example, correspond to canonical parametrizations (canonical forms) or to physical parameters in a time-continuous state-space description, which has been sampled to yield (2.12).

3. Models and predictors The list of potential models and model sets can be made long. For our purposes it is useful to extract the basic features of models, so as to allow for a treatment of model sets in general. First we introduce the following notation: J///(0): ~:

a particular model, corresponding to the parameter value 0, a set of models:

= zt:

I

m. CRa},

the set of measured input--output data up to time t: z' = {u(1), y(1), u(2), y ( 2 ) , . . . , u(t), y(t)}.

Similarly, u t and y' denote the input sequence and the output sequence, respectively, up to time t. The various models that can be used for dynamical systems all represent different ways of thinking and representing relationships between measured signals. They have one feature in common, though. They all provide a rule for computing the next output or a prediction (or 'guess') of the next output, given previous observations. This rule is, at time t, a function from z '-1 to the space where y(t) takes its values (R p in general). It will also be parametrized in terms of the model parameter 0. We shall use the notation

~(tl O) : g.(O; t, z '-1)

(3.1)

for this mapping. The actual form of (3.1) will of course depend on the underlying model. For the linear difference equation (2.1) or (2.4), we will have

~(t l O) = OT~(t).

(3.2)

L. Ljung

194

The same prediction or guess of the output y ( t ) will be used for the model (2.5) with disturbances, in case {v(t)} is considered as 'unpredictable' (like white noise). For the state-space model (2.12) the predictor function is given by the Kalman filter. Then g~ is a linear function of past data. For the A R M A X model (2.6) a natural predictor is computed as

¢(t l 0)+ Cl~(t- 1 I 0 ) + " "

+ c . ~ ( t - n I O)

= (c 1 - a l ) y ( t - 1 ) + . . . + ( c , - a , ) y ( t + b x u ( t - 1)+ • • • + bmu(t- m).

n)

(3.3)

Notice that this can be rewritten as (3.4a)

P(t l O) - Ore(t, 0 ) , 0 = (a 1. . . . .

a,, b l . . . . , bin, c 1. . . . .

¢(t, O)= (- y(t--

1) . . . . .

(3.4b)

- y ( t - n), u ( t - 1). . . . , u ( t - m ),

e ( t - 1, O) . . . . , e ( t e(t, O) = y ( t ) -

c,) v ,

n, 0))

P(tl 0).

(3.4c) (3.4d)

For the models (2.7)-(2.11), a natural predictor is also given by (3.4a) with 0 and ¢(t, 0) defined by (2.8}-(2.10). Notice that in this case the prediction is formed from past inputs only. We then have, formally, 33(tl 0) = g~(O; t, u' 1).

(3.5)

Such a model we call an output error m o d e l or a simulation m o d e l We shall sometimes work with the general linear structure

~(t+ 1, O): o%(O)~(t,0)+ ~(O)z(t),

(3.6)

9(t [ 0) : X(0)~(t, 0). Here we simply assume that the prediction 33 is a linear function of past data z, and that this linear function can be realized with a finite-dimensional, time-invariant filter. Notice that the function g , ( O ; t, .) in (3.1) is a deterministic function from the observations z t-1 to the predicted output. All stochastic assumptions involved in the model descriptions (e.g. white noises, covariances matrices, Gaussianness) have only served as vehicles or 'alibis' to arrive at the predictor function. The prediction p ( t [ 0 ) is computed from z '-1 at time t - 1 . At time t the output y ( t ) is received. We can then evaluate how good the prediction was by computing e(t, O)-- y ( t ) - ~(t l O) .

(3.7)

Estimation of parameters in dynamical systems

195

We shall call e (t, 0) the prediction error at time t, corresponding to model :g (0). This term will be the generic name for general model sets. Depending on the character of the particular model set, other names, for example, the (generalized) equation error, may be used. For a simulation model (3.5) it is customary to call the corresponding prediction error (3.7) the output error. We can also adjoin an assumption about the stochastic properties of the prediction error to the model A/(0): M(0):

"Assume that the prediction error e(t, O) has the conditional (given z '-1) probability density function (p.d.f.) f(t, O, x) [i.e.

P(e(t, O)~ B) = fxeBf(t, O, x)dx]".

(3.8) Notice that in 0.8) there is an implied assumption of independence of the prediction errors, for different t, since the p.d.f, does not depend on z t-~. A predictor model (3.1) adjoined with a probabilistic assumption (3.8)we shall call a

probabilistic model.

4. Guiding principles behind identification methods The problem now is to decide upon how to use the information contained in z u to select a proper member ~(ON) in the model set that is capable of 'describing' the data. Formally speaking, we have to determine a mapping from z N to the set

zN ~.~(0N).

(4.1)

Now, how can such mapping de determined? We pointed out that the essence of a model of a dynamical system is its prediction aspect. It is then natural to judge the performance of a given model d//(0") by evaluating the prediction errors, e(t, 0") given by (3.7). A guiding principle to form mappings (4.1) is thus the following one~

°'Based on z t compute the prediction error e(t, O) using (3.1) and (3.7). At time t = N, select ON so that the sequence of prediction errors e(t, ON), ¢= 1 . . . . . N, becomes as small as possible". The question is how to quantify what 'small' should mean. Two approaches have been taken. These will be treated in the following two subsections.

4.1. Criterion minimization techniques We introduce the scalar measure

l(t, 0,

0))

196

L. Ljung

to evaluate 'how large' the prediction error e(t, O) is. Here 1 is a mapping from R x R d x R p to R, where d = dim 0, p = dim y. After having recorded data up to time N a natural criterion of the validity of the model ~/(0) is

1 N VN(O, zN) = 77 ~'~ l(t, O, e(t, 0))o

(4,3)

This function is, for given z N, a well-defined, scalar-valued function of the model parameter 0. The estimate at time N, i.e. ON, is then determined by minimization of the function VN(O, zN). This gives us a large family of well-known methods. Particular 'named' methods are obtained as special cases, corresponding to specific choices of model sets and criterion functions l(t, O, e); and sometimes particular ways of minimizing (4.3).

The least squares method Choose l(t, 0, e) = le[ 2 and apply the criterion (4.3) to the difference equation model (2.5). Since the prediction is given by (3.2), we have the prediction error

e(t, O)= y ( t ) - OT~(t). The criterion function (4.3) thus becomes

1 N VN(O, z N ) = v;. ~] l y ( t ) - OV~o(t)l2 , P4 ~-

(4.4)

which we recognize as the familiar least squares criterion (see, e.g., Strejc, 1980). This function is quadratic in 0, which is a consequence of the prediction being linear in 0 and the quadratic choice of criterion function. This means that an explicit expression for the minimizing element 0N can be given:

N ~(t)y(t). Ou = [~1 ~(tlq~ T(t) ]-1 ~, 1

(4.5)

A quadratic criterion is a common ad hoc choice also for general models. For multioutput systems this gives

l(t, 0, e) = 26TA-le .

(4.6)

To arrive at other specific functions l, we could invoke, for example, tile maximum likelihood idea:

7he maximum likelihood method For the probabilistic model (3.1), (3.8), the likelihood function can be

Estimation of parameters in dynamical systems

197

determined. Calculations show that 1

1

N

---log P(y(N), y(N - 1) . . . . . y(1)) = =-: ~'~ log f(t, 0, e(t, 0)). N NT

(4.v)

Maximizing the likelihood function is thus the same as minimizing the criterion (4.3) with

0.8)

l(t, 0, e ) = - l o g f(t, 0, e ) . For Gaussian prediction errors 1

1

T

- l o g f(t, 0, e) = const + ~ log det At(O ) + 2e A p(O)e,

(4.9)

where At(O) is the assumed covariance matrix for the prediction errors. If the covariance matrix A t is supposed to be known (independent of 0), then the first two terms of (4.9) do not affect the minimization, and we have obtained a quadratic criterion like (4.6). The maximum likelihood method was introduced for A R M A X models in Astr6m and Bohlin (1965). For the least squares case it was possible to give an explicit expression for the parameter estimate. This is not the case in general. Then the criterion function (4.3) must be minimized using numerical search procedures. We shall comment more on this later. We shall, following Ljung (1978), use the general term prediction error identification methods for the procedures we described in this section (see also Astr6m, 1980). When applied to the special simulation model (3.5), the term output error methods might be preferred.

4.2. Correlation techniques Another way of expressing that the sequence {e(t, 0)} is small is to require that it be uncorrelated with a given sequence {~'(t)}. Let the vector ((t) represent some information that is available at time t - 1: ~r(t) = ~r(t, z ' - l ) .

(4.10)

Sometimes, there is reason to consider a more sophisticated variant, where itself may depend on the parameter 0. (Some such cases will be discussed below.)

~(t) = ~(t, O, z'-~).

(4.11)

The rationale for requiring e(t, O) and ~'(t) to be uncorrelated is the following: the predictors ~(tlO ) should ideally utilize all available information at time

L. Ljung

198

t - 1 . Thus the prediction errors e(t, O) should be uncorrelated with such information. (If they are not, more information can be squeezed out from z'-~.) We thus determine ON as the solution of (4.12a)

fu(O, z N) = 0

with 1

u

fu(O, z u) = "~ ~, s(t, O)(X(t) ,

(4.12b)

t=l

where, normally, the dimension of ~" is such that (4.12) gives a system of equations that is compatible with the dimension of 0. When (4.12) is applied to the model (3.2), the well-known instrumental variable method results. The vector ( is then known as the instruments or the instrumental variables. See Young (1970) and SSderstr6m and Stoica (1981) for a further discussion of this method.

H o w to choose ~'(t)?

A way to make the estimate 0 insensitive to the characteristics of the noise that affects the system is to choose ( to depend on past inputs only, ~'(t) = ~'(t, u'-~).

(4.13)

Then that contribution to e(t, O) that has its origin in the noise will be uncorrelated with ~" for all 0. Choices (4.13) are typical for the instrumental variable method. It turns out that such choices that give the best accuracy of the obtained estimates are obtained when u is filtered through filters associated with the true system (see SSderstr6m and Stoica, 1981). We then have

~(t) = ~(t, O, u'-l).

(4.14)

For models that can be written as (3.4a) (like the A R M A X model (2.6) and the output error model (2.7)= (2.11)), a natural choice is

~(t, 0, z '-1) = ~(t, 0).

(4.15)

Notice also that if we choose ~(t, O, Z t 1)= --t#(l, O)A -1 ,

(4.16)

d O(t, O) = - ~ e(t, 0),

(4.17)

where 121//

we find that (4.12) will define the stationary points of the criterion (4.3), (4.6). The criterion minimization approach can thus be seen as a special case of (4.12), from this point of view.

Estimation of parameters in dynamical systems

199

5. Asymptotic properties of the estimates It is an important problem to investigate the properties of the estimates defined by (4.3) and (4.12). Since the data z t typically are described as realizations of stochastic processes, the analysis has to be performed in a probabilistic setting. It is a difficult problem to derive the finite sample properties of these estimates, i.e. the properties for finite N. It is easier to establish what happens asymptotically as N tends to infinity. Such analysis basically relies upon (non-standard versions of) the law of large numbers and the central limit theorem. For the current problem formulation, the analysis is carried out in Ljung (1978) and Ljung and Caines (1979). The result is that 0N ~ 0*

with probability one as N--> w,

(5.1)

where 0* = arg min V(0),

(5.2)

OED~

IS'(0) = lim EVu(O, z N)

(5.3)

N--~ce

and ~/N(0N - 0") E AsN(0, P ) ,

(5.4)

P = Q-1HQ-1 '

(5.5)

H = lim E N V ~ O * , ZN)[V'~O *, ZN)]T ,

(5.6)

o = Q"(0*).

(5.7)

where

Here 0N is the estimate defined by the minimizing argument of VN given by (4.3). Prime and double prime denote differentiation, once and twice, respectively, with respect to 0. Expectation E is over the stochastic process z N. Equation (5.4) means that the random variable X/N-(0N - 0 " ) converges in distribution to the normal distribution with zero mean and convariance matrix P. An analogous result holds for the estimate defined by solution of (4.i2).

6. Numerical schemes for determining the estimates Above we described two principles for identification methods, namely to minimize VN(O ; Z N) in (4.3) or to solve fN(O, z N) = 0 in (4.12). In many cases these functions may be fairly complex, and it is not obvious how to actually obtain theo estimate in practice. Such questions will be discussed in this sectior~ (see also Astr6m and Bohlin, 1965; Gupta and Mehra, 1974).

L. Ljung

200

6.1. General schemes For the minimization of (4.3), the gradient of the criterion will play an important role. Let us therefore introduce the notation d d ov(t, O) = - ~ ~(t I O) = -- ~-~ e(t, O)

(6.1)

(~0 is a d I P matrix) for the gradient of the prediction, with respect to 0. Then 1 u

V~(O, z u) = - ~ ~ (lTo(t, O, e (t, 0 ) ) - O(t, O)Ir~(t, 0, e (t, 0))

(6.2)

( d l l vector) and, in the quadratic case (4.6) 1 U V~(O, Z u) = - ~ t~=~ O(t, o)a-le(t, 0).

(6.3)

Standard search routines for numerical minimization of functions can now be applied to (4.3). The general descent method in ~/+1) = 6 ~ )

/"L(i)[ (i)]NK''NJ J~ 1 --NKL" V ' I'~(i) N, z N) ,

(6.4)

where 0~) denotes the ith iterate when solving for the minimizing value 0u. The number/z is chosen so that

VN(0~+1))
O~ (iii) -(3) )_ (iv) For every k ~ rr(el~ (3) ) the k-equivalence class in 0~) is an affine subspace. ( V ) U (3) Is • (T~-~ oven m ' U- (3) or_ p_ / r ot " (vi) 7r(O~~)) C U~ ) and equality holds for s = 1. For more general a priori restrictions, results analogous to Theorem 6.1 are not yet available. Again results analogous to Remarks 1 and 3 after Theorem 4.1 hold.

7.

The

relation

to estimation

We now discuss the implications of the preceding results concerning the properties of the parametrizations for the process of identification. H e r e we concentrate on A R M A representations, as the results for state-space representations are analogous. First let us consider the case where a a determining one of the parameter spaces O(2), O~ ~ or O~ ) is already given° The common estimation procedures in this case are the (,Gaussian) maximum likelihood estimators (MLE) or related methods (e.g. prediction error estimation). We here discuss MLE as the prototype procedure. Let T denote the sample size, let y ) - ( y ' ( 1 ) , . . . , y ' ( T ) ) be the observations and let Fr(O , ~r(X)) denote the s T x s T covariance matrix given by

,

-

r ,s, = 1, . . . , T

where the spectral density matrix fy is determined by the parameter vectors 0 and o'(X). Then - 2 T -1 times the log of the (Gaussian) likelihood is given up to a constant by I],r(O, o'(X)) = T - ' log det Fr(O, or(X))

+

"F-IyT-[~TI(0,

O~(~))yr

.

(7.1)

- (3) Here 0 is an element of either O~ ), or v~,c;I(e)or of O~. We use O, for short to

Parametrization of A R M A and state-space systems

273

cover all three cases. (Even more generally O~ could be an identifiable set of MFD's with bounded degrees and with an additional technical assumption imposed.) In this section, (2.4) and (2.7) are assumed throughout, without taking this into account in our notation. Let U~ = ~(0~) be the corresponding set of transfer functions and @6: U ~ O~ is to denote the corresponding parametrization. As i t ( 0 , o'(X)) depends on 0 only via ~r(0), a 'coordinatefree' likelihood depending on k (and on ~(22)) may be defined. In the process of optimization of the likelihood, the possibility that the optimum is attained at certain boundary points cannot be excluded and this is one reason to define the coordinate-free likelihood Lr(k,o-(X)) as a function with domain Q~ x {o-(X) ] X > 0} (where again (2.4) and (2.7) have been imposed) rather than with domain U~ x {o-(£)1 X > 0}. Note that this coordinate-free likelihood is introduced for mathematical convenience as some statistical properties do not depend on the underlying parametrization, that however the actual optimization of the likelihood has to be performed in suitable coordinates. A reason for the introduction of the coordinate-free likelihood is the following consistency result [19, 33]: under the additional ergodicity requirement 1

T

l i m - - ~ y(t + s)y'(t) = Ey(s)y'(O) Tt=l

(a.s.)

and if the true transfer function k 0 is in U , then the MLE's/~v ~ r for k 0 and X0 (obtained by optimizing L r over 0~ x {~r(X) IX > 0}) are strongly consistent, i.e. /or ~ k0 (in Tpt) a.s. and 2r --" "~o (where X0 is the true matrix) a.s. This result, together with the properties of the parametrizations discussed in the previous sections, has the following implications for parameter estimation: let/~r ~ k0 (in Tpt), kr, k0 C/5-~ (where/~r is not necessarily the MLE), then we can distinguish three different cases [13, 16]: (i) If k 0 ~ U~, then, by the openness of U~ in U~, k r will be in U~ too, from a certain T o onwards. From this T Oonwards, the parameter estimates @~(/~r)= 0r are uniquely defined and by the continuity of the, we have 0r = @~(t~r) ~b~(k0) = 00, and thus, for example, the MLE's ~Jr are strongly consistent in this case. (ii) Let k 0 ~ 7r ( 6 ~ ) - U~. Then k 0 is represented by an equivalence class in (~ - O 4 (along this equivalence class the likelihood defined on (0, for fixed X, is constant). If in addition suitable prior bounds are imposed on the norm of the elements in O~, then the--not necessarily unique--parameter estimates 0r (i.e. ~ ' ( 0 r ) - / ~ r ) will converge to the 'true' equivalence class. Whether the algorithm will search along this class or whether the Or converge to a certain point in the equivalence class depends on the actual estimation procedure used. Of course reparametrization with a suitable/3 < c~, such that k 0 E U#, leads to the 'well-posed' situation described in (i).

274

M. Deistler

(iii) The situation k 0 ~ O~ - 7r((~) can only occur in the multivariable case (s > 1). In this case, k 0 corresponds to the 'point of infinity' of Oa, in the sense that even if /~r C U~, T E N, then /~r ~ k0 implies that the norm of the parameter estimates 0~(kr) will tend to infinity. In the special case of the overlapping parametrization of M(n), when U~ = U~ ) then this situation occurs if either k~0has order n and we have chosen the wrong local coordinates (i.e. k 0 ~ U,,) or if k 0 has order smaller than n, but cannot be described in a O~ ) such that/3 < a. Also in this situation, a suitable reparametrization leads to case (i). If k0~ U~ but 'near' to a point in 0 ~ - U , similar problems (in finite samples) may arise. In this case the matrices (as e.g. H~)) determining the parametrizations are ill conditioned and thus @~, although being continuous, is very distorting in the sense that a 'small' variation of transfer functions causes a 'large' variation of the parameters 0. The discussion in (ii) and (iii) may be considered as an analysis of the behavior of the parameter estimates in the case of a wrong dynamic specification. Of course there is also another case of wrong dynamic specification, namely when ko~ 0,~, i.e. when the observations do not correspond to a system in the model class (underfitting). In this case, of course, we cannot have consistency of the estimates. However, the maximum likelihood type estimates still have an optimality property: they converge to the set (consisting of more than one element in general) in U~ corresponding to the best linear one-step-ahead predictors for the process generating the data [44]. Now let us turn to the problem of inference of integer-valued parameters for the dynamic specification of the submodel. There are two main inference principles in this case, namely information criteria like AIC or BIC and criteria based on the inference of the linear dependence relations in H~. We mainly consider the case of the overlapping parametrization of M(n). Here both the order n and appropriate local coordinates given by a have to be determined. The reason why MLE's do not give reasonable results in order estimation (and in related problems) is as follows: since JQ(n0)C/Q(nl) for n o < n, and M(nl) is 'almost all' of/~f(nl) , the M L E over _M(n~) will be attained 'almost surely' in M(nO, even if n o is the true order. One way to overcome this notorious tendency of the MLE to overestimate the true order (to be more precise to attain its value at the maximum prescribed order) is to add a penalty term, taking into account the dimension of the parameter space. This leads to estimation criteria that are of the form A r ( n ) .... log det ~ r ( n ) + d -C(T) -,

T

n =O,...,N,

where 2r(n) is the M L E ~ r over hT/(n)x {o-(2:)[ X > 0}, N is the maximum prescribed order and where d = 2ns is the dimension of the parameter space. C(T) has to be prescribed. If C(T)= 2, then Ar(n ) is called AIC [2, 3]. If

a

C ( T ) = c log T, then A r ( n ) is called BIC [54]. The estimates fir of the order are obtained by minimizing A t ( n ) , n = 0 . . . . . N. Consistency of the minimum BIC estimate f i r has been shown in [30, 31]. BIC, defined over U~ / with d given by Theorem 4.1 (i), gives also consistent estimates of the Kronecker indices [34]. Minimum AIC estimates of n are not consistent; AIC was designed to satisfy another optimality criterion [58]. Closely related to thes~ estimation methods are likelihood ratio or Lagrange multiplier tests for the order

[51, 52]. For estimation of the local coordinates, measures of the conditioning of the estimate of H ~ are used [47]. In principle, all the integer-valued parameters discussed here could be inferred from an investigation of the linear dependence relations in H~, where H~ is estimated, e.g. by a 'long' autoregression. However, in most practical applications this seems to be a fairly tedious procedure. As has been pointed out in [34] in practical applications, for s > 3, both the large dimension of the parameter spaces as well as the large number of neighborhoods that have to be considered may cause great problems. Each optimization of the likelihood itself is a fairly costly procedure and, if N is the maximum prescribed order, we have to search over (N+s s ) neighborhoods vIfa (2~, E n i = n, 0 ~ n 0 , =0 for n ~1O} be a family of bounded linear mappings on L2(p) such that (i) S(u + v) = S(u)S(v), u, v >~O, S(O) = identity, (ii) ]lS(u)f[[ 0 such that

II ,U ll

cllU ll •

07)

in the stationary case, c = 1 and there is equality in (17). As an easy consequence of (16), one will have rs%, = %+,, and hence on the relevant subspace N, {%, s E T} should form a semigroup. Since this is true for the stationary case (with % as unitary) and since one wants to include some nonstationary processes, it is natural to look for the r, family, with some structure, at least as a normal operator semigroup, i.e. {r~, s E T} should satisfy the commutativity relations r,r* = r~r s (r* is the adjoint of %). Let us find out possible nonstationary processes admitted under such an assumption, since the stationary class is automatically included (because every unitary o p e r a t o r is normal). The mathematical detail will be minimized here. Let {%, s ~> 0} be a bounded semigroup of normal shifts on { X , t >~ 0} such that [I%X - XII ~ 0 as s --> 0 for each X ~ ~, the closed span of the X t in L2o(P). In order to include the unitary (or equivalently the stationary) case, r s should not be assumed self-adjoint! Thus normality is the next reasonable generalization. [Also the condition that IIr,X - XI[--->0 is known to be equivalent to the strong continuity of r s for s > 0 and the boundedness of % on 0 < s 0 %(Yg) in Yg. This is thus a technical hypothesis.] Let A h = (% - I)/h, h > 0. Then A h is a bounded normal transformation for each h. It is a consequence of the classical theory of such semigroups that for each X ~ )g, one has %X = lim

cAhx,

(18)

h~0

the limit existing in the metric of ~f, uniformly in s on closed intervals [0, a], a > 0 . On the other hand, for each h > 0 , A h is a b o u n d e d normal operator on the Hilbert space ~. Hence one can invoke the standard spectral theorem according to which there exists a 'resolution of the identity', {Eh(A), A C C} such that

A h X = fc ZEh(dz)X'

X E ~o,

(19)

where the integral is a vector integral and/x~,(zi) ..... E h ( A ) X E Yd, gives a vector measure. H e r e ~0 C ~ is the subspace for which the integral exists, i.e. z is/'7, integrable for X E ~0. But from the same theory one can also deduce that

eSAhX = fc e'~'Eh(dz)X'

x ~ ~(1 c No,

(20)

M. M. Rao

290

for which e 'z is/x,~-integrable. If y* E Y(*, then y*Eh(. )X is a signed measure in (20) and if y* = X ( E ~ * = Yg), then it is a positive bounded measure for each h so that one can invoke the Helly selection principle and then the Helly-Bray theorem in one of its forms to conclude that limh_~oy*Eh(')X converges to some Ux.y., a signed measure. This may be represented as y ' F ( . )X for an F ( - ) which has properties analogous to those of Eh(" ). Here the argument, which is standard in spectral theory, needs much care and detail. With this, one can take limits in (20) as h ~ 0 and interchange it with the integral to get

rsX = limh~oe s A h x

=

Jc eSZF(dz)X"

(21)

Thus the measure F ( . ) X is orthogonally scattered and is supported by the intersection of the spectral sets of A h , h > 0. It now follows that, if X s = ~-sX0, then by (21) with X = X0(¢ N1) there, one gets t"

X, = Jc e'~Z(dA)'

s >~0,

(22)

where Z ( . ) on C is an L~(P)-valued orthogonally scattered measure. The covariance function r of this process is given by

r(s, t ) - E(X,f(,) : ~c exp(sz + t~,)G(dz),

(23)

with G(A f-IB)= E(Z(A)z~(B)). If S = C and v = G in (13), one sees that {32,, s ~> 0} is a Karhunen process relative to f(s,. ), s >>-O, f(s, z) = e 'z, and the finite positive measure G such that f(s,. ) E L2(C, G), s >~O. If C is replaced by its imaginary axis, and for s < 0 the process is extended with X s = ~'*,X0, then the stationary case is recovered (cf. (2)). That (23) is essentially the largest such subclass of Karhunen processes admitting shifts again involved further analysis and this was shown by G e t o o r [9] in some detail. Thus the Karhunen class contains a subset of nonstationary processes which admit shift operations on them and also a subset of nonstationary processes (namely the harmonizable class) which do not admit such transformations. Since the representing measures in (3) and (23) or (13) are of a different character (it is complex 'bimeasure' in (3) and a regular signed measure in (13)), a study of Karhunen processes becomes advantageous for a structural analysis of various stochastic models. On the other hand, (3) shows a close relationship of some processes with a possibility of employing the finer Fourier analytic methods, giving perhaps a more detailed insight into their behavior. Thus both

Harmonizable, Cramdr,and Karhunenclassesof processes of these viewpoints are pertinent in understanding phenomena.

291

many nonstationary

4. Cram6r class After seeing the work of the preceding two sections it is natural to ask whether one can define a more inclusive nonstationary class incorporating and extending the ideas of both Karhunen an Lo6ve. Indeed, the answer is yes and such a family was already introduced by Cram6r in 1951 [6], and a brief description of it is in order. This also has an independent methodological interest since it results quite simply under linear transformations of Karhunen classes in much the same way that harmonizable families result under similar mappings from the stationary ones. One says that a function F on T x ~ into C is locally of (Fr~chet) variation finite if the restriction of F to each finite proper subrectangle I x I of 7" x 7" has the (Fr6chet) variation finite, I C T being a finite interval. Let us now state the concept in: DEFINITION. A second-order process {X~, t E T} CL20(P) is of (or class(C)) if its covariance function r is representable as

Cramdr class

r(t~,,t2)=fs~g(tl, A)g(t2, A')u(dA, dA' ), tiET, i = l , 2,

(24)

relative to a family {g(t,. ), t E T} of Borel functions and a positive definite function u of locally bounded variation on S × S, S being a subset of 7" (or more generally a locally compact space) and each g satisfying the (Lebegue) integrability condition:

0 1,

(39)

where n

r~)(h) : n1 ~=1 r~p 1)(h)' r("l)(h): r.(h). The analog for the case that T : R can similarly be given. Since in (34) rn(" ) is positive definite, it is seen easily that "n'(P)¢t'J ~ is also positive definite. Hence ?(. ) satisfies the same hypothesis and (35) holds, so that the representing H ( . ) may now be called a pth-order associated spectrum. The classical results on summability imply that if r~)(h)~?(h), then r~)(h)-~?(h) for each integer p~> 1, but the converse implication is false. Hence class(KF)C class(KF, p)Cclass(KF, p + 1) and the inclusions are proper. Thus one has an increasing sequence of classes of nonstationary processes each having an associated spectrum. The computations given for (38) show that the preceding example does not belong even to the class Up~class(KF, p). This also indicates that weakly harmonizable processes form a much larger class than the strongly harmonizable one, and is not included in the last union. It should be remarked here that a further extension of the preceding class is obtainable by considering the still weaker concept of Abel summabflity. The consequences of such an extension are not yet known, and perhaps should be investigated in future. The general idea behind the class(KF, p), p >/1, is that if the given process is not stationary, then some averaging, which is a smoothing operation, may give an insight into the structure by analyzing its associated spectrum. Moreover, if {Xt, t C R} E class(KF), and f is any Lebesgue integrable scalar function on R, then the convolution of f and the Xt process is again in class(KF) whenever the function ~b defined by 4~(t)= [E(IXtl2)] l/2 is in Lq(~) for some 1 ~< q ~< oc. Then

Y,- (f*X)t : fnf(t-s)Xs

ds,

t62 R ,

(40)

where the integral is a vector (or Bochner) integral, gives {Y,,tfSR}E:i class(KF). Thus class(KF) itself is a large family. This example is a slight extension of one indicated in [31].

7. The Cram6r-Hida approach and multiplicity In the previous discussion of Karhunen and Cram6r classes, it was noted that each {Xt, t E T} admits an integral representation such as (26) relative to a family {g(t,. ), t C T} and a stochastic measure Z ( . ) on the spectral set S into L~(P). Both g(t, u) and Z(du) can be given the following intuitive meaning,

Harmonizable, Cram&, and Karhunen classes of processes

299

leading to another aspect of the subject. Thus X, may be considered as the intensity of an electrical circuit measured at time t, Z ( d u ) as a random (orthogonal) impulse at u, and g(t, u) as a response function at time u but measured at a later time t. So X, is regarded as the accumulated random innovations up to t. This will be realistic provided the effects are additive and g(t, u) = 0 if u > t. Hence (26) should be replaced by

X, =

f

t

g(t, u ) Z ( d u ) ,

t ~ T.

(41)

Since in (26) the g there need not satisfy this condition, that formula does not generally reduce to (41). So one should seek conditions on a subclass of Karhunen processes admitting a representation of the type (41) which clearly has interesting applications. Such a class will be discussed together with some illustrations. First it is noted that each process {Xt, t E T} c Lz(P), assumed to be left continuous with right limits (i.e. for each t E T, E ( I X , - X,_ h [2)-+0 as h -+0 +, and there is an )(t such that E(I)( , -Xt+h[2)-+ 0 as h -+ 0 +, denoted )(, = X,+0), can be decomposed into a deterministic and a purely nondeterministic part (defined below). The deterministic component does not change from the remote past so that it has no real interest for further stochastic analysis such as in prediction and filtering problems. Thus only the second component has to be analyzed for a possible representation (41). This was shown to be the case by Cram& [7] and Hida [12] independently, and it will be presented here. ([7] has the 1960 references to Cram6r's papers.) Let Y£ = sp{X,, t E T} C L2(p), and similarly ~, = sp{X~, s < t} Q ~ and ~_~ = f-'l,Er ~ , . Since Y('I C Yg'2 for t 1 < t2, one has ~_= C Yg, C ~ and Yg_~ represents the remote past while ~ , stands for the past and present. The X t process is deterministic if gg_~= ~ and purely nondeterministic if ~ _ = = {0}. Thus the remote past generally contributes little to the experiment. The separation of remote past from the evolving part is achieved as follows. A process {X,, t E T} which is left continuous with right limits (and this is automatic if T = Z) can be uniquely decomposed as: X , = Y t + Z t, t E T, where the Y, component is purely nondeterministic, the Z, is deterministic and where the I:, and Z , processes are uncorrelated. (This is a special case of Wold's decomposition.) Since the deterministic part is uninteresting for the problems of stochastic analysis, and can be separated by the above result, one can ignore it. Hence for the rest of this section it will be assumed that our processes are purely nondeterministic. The proofs of the following assertions may be completed from the work of Cramdr in [7] (cf. the references for his other papers there). The approach here does not give much insight if T = Z. However, T = R is really the difficult case, and the present method is sPecifically designed for it. The new element in this analysis is the concept of 'multiplicity', and it is always

M. M. Rao

300

one if T = Z while it can be any integer N ~> 1 if T = R. (See [5], and the references there, and also [7].) The basic idea is to 'break up' the continuous parameter case, in the sense that each such process can be expressed as a direct sum of mutually uncorrelated components of the type (41) so that each of the latter elements can be analyzed with special methods. This relatively deep result was obtained independently (cf. [7] and [12]) and can be given as follows: THEOREM 7.1. Let {Xt, t ~ R } C L 2 ( p ) be a purely nondeterministic process which is left continuous with right limits on R. Then there exists a unique integer N, 1 s} at our disposal, it is desirable to have some approximations to the best predictor. A result on this can be described as follows. Let T = Z for simplicity,_ and for s < t0E Z . define ~. = sp{Xs, Xs_ 1 . . . . . Xs_n} SO that lim. cg. = sp{i..)..0 ~.} = y(.. If X,0,. = Q.(Xto ), Q. being the orthogonal projection of Y( onto ~ , then one can show, using the geometry of Y(, that E([..Y '0,. - X t .12)"-->0 as n --> oo. However, the pointwise convergence of )~,0. to 3~,0,. is mu~h more difficult, and in fact the truth of the general statement is not known. For a normal process, an affirmative answer can be obtained from the following nonlinear case. Let Y ' 0 , = E(XtolXs'Xs-1 . . . . . Xs_n) and Y t , be as before. Then the sequence { Y t o , , n ~ 1} Is a square integrable martingale such that sup, E([ Yt0,,[2)o< oo. Hence the general martingale convergence theory implies Yt0.~--~ Yt0,s both in the mean and with probability one, as n-~ ~. Since for normal processes both the linear and nonlinear predictors coincide, the remark at the end of the preceding paragraph follows. Thus predictors from finite but large samples give good (asymptotic) approximations for solutions )~t0,s (or Yt0,s) and this is important in practical cases. However, the error estimation in these problems received very little attention in the literature. In the case of normal processes certain other methods (e.g., the Kalman filter, etc.) giving an algorithm to compute the )~0,, sequence are available. But there is no such procedure as yet for the general second-order processes. At this point it will be useful to present a class of nondeterministic processes, belonging to a Karhunen class, which arise quite naturally as solutions of certain stochastic differential equations. This will also illustrate the remark made at the end of Section 4. In some problems of physics, the motion Xt of a simple harmonic oscillator, subject to random disturbances, can be described by a formal stochastic differential equation of the form (cf. [3]): '

.

d2X(t) dt 2 +/3 dXdd~t)+coZX(t) = A(t)

0'

(X(t)= X,),

(51)

where/3 is the friction coefficient and w0 denotes the circular frequency of the oscillator. H e r e A(t) is the random fluctuation, assumed to be the white n o i s e - - t h e symbolic (but really fictional) derivative of the Brownian motion. In some cases,/3 and w0 may depend on time. To make (51) realistic, the symbolic equation should be expressed as: d X ( t ) + al(t))~(t) dt + az(t)X(t) dt = d B ( t ) ,

(52)

where the B(t) process is Brownian motion. Thus for each t > 0, B(t) is normal with mean zero and variance ~r2t, denoted N(0, crzt), and if 0 < t~ < t2 < t3, then B ( t 3 ) - B ( t 2 ) and B(t2)-B(tl) are independent normal random variables with N(0, 0"2(t3- t2)), N(0, 0"2(/2- tx)) respectively. Also )~(t) = dX(t)/dt is taken as a mean square derivative. Then (52) and (51) can be interpreted in the integrated

M. M. Rao

304 form, i.e. by definition,

Lbf(t) A ( t ) dt =

f(t) d B ( t ) ,

(53)

the right side of (53) being a simple stochastic integral which is understood as m Section 3 (since B is also orthogonally scattered). Here f is a nonstochastic function. The integration theory, if f is stochastic needs a more subtle treatment and the B(t) process can also be replaced by a 'semi-martingale'. (See, e.g., [26], Chapter IV and V for details.) The point is that the following statements have a satisfactory and rigorous justification. With Brownian motion one can assert more, and, in fact regarding the solution process of (52), the following is true. THEOREM 8.1. Let Y = [a0, b0] C R + be a bounded interval, and {B,, t C J} be the Brownian motion. If ai(" ), i - 1, 2, are real (Lebesgue) integrable functions on J such that equation (52) is valid, then there exists a unique solution process {Xt, t E J} satisfying the initial conditions X% = C1, f(ao = C2 where CI, C2 are constants. In fact, the solution is defined by t

X,=

L

G(t,u)dB(u)+C1V,(t)+CRV2(t),

t~J,

(54)

o

where Vi('), i= 1,2, are the unique solutions of the accompanying homogeneous differential equation:

d2f(t)

df(t)_~

dt 2 ~- al(t) dt

a2(t)f(t)-0

(55)

with the initial conditions f ( a o ) - 1, f(ao)= O, and f ( a o ) - O, /(ao)= 1 respectively. In (54), O : J × J --~ C is the Green function. This is a continuous function such that oG/at is continuous in (t, s) on ao However, if S2 = R v, ~ = the cylinder cr algebra, then X,: S2 ~ R is defined as Xt(o~)= w(t), i.e., the coordinate function, and the problem of determining as to when P~ - P2, or P~ ± P2, or neither, is not simple. In the case that both P~, P2 are normal probability measures on .Q = R v, only the main dichotomy that P1 - P2 or P~ ± P2 can occur. This was first established independently by J. Feldman and J. Hfijek in 1958 and later elementary proofs of this theorem were presented by L. A. Shepp and others. A simplified but still nontrivial proof of this result with complete details is given in ([27], pp. 212-217). The statistical problem therefore is to decide, on the basis of a realization, which one of P~, P2 is the correct probability governing the process. In the singular case, this is somewhat easier, but in case P 1 - P2, the problem is not simple. A n u m b e r of cases have been discussed in [10] before the dichotomy result is known. The simplest usable condition in the general case is the following: Let P/ have the mean and covariance functions (mi, ri), written P(mi, ri), i = 1,2. Then Pl - P2 iff one has P(O, rl) ~ P(O, r2) and P(ml, rl) ~ P(m2, rl). Thus P ( m l , rl) ~ P(m2, r2) if[ P ( m l , r l ) ~ P(m2, r l ) - P(m2, r2). Some applications with likelihood ratios appear in [25]. This equivalence criterion will now be illuso trated on a purely nondeterministic normal process of multiplicity one. If {Xt, t ~ T} is a normal process with mean zero and covariance r let Z , = m (t) + Xt where m : T--+ R is a measurable nonstochastic function, so that the Z, process has mean function m and covariance r and is also normal. Let P and Pm be the corresponding probabilities governing them. The mean m ( - ) is

M. M. Rao

308

called admissible if P - Pro- The set Mp of all admissible means is an interesting space in its own right. In fact, it is a linear space, carries an inner produce and with it Mp becomes a Hilbert space attached to the given normal process. (For an analysis of Me, and t__hefollowing, see [24].) One shows that m E Mp iff there is a unique Y @ ~ sp{Xt, t E T} C L2(p) such that =

m(t)= E(YXt),

t E T,

(65)

and then the likelihood ratio dPm/dP is given by dP,. = exp{Y - ~E(I yI2)} dP

(66)

Using now an abstract generalization of the classical N e y m a n - P e a r s o n L e m m a due to G r e n a n d e r ([10], p. 210), one can test the hypothesis H0: m ~ O, vs. HI: m (t) ~ 0. The critical region for this problem can be shown to be Ak = {w E ~O: Y ( w ) ~< k } ,

(67)

where k is chosen so that P ( A k ) = a, tile prescribed size of the test (e.g., a = 0.05 or 0.01). This general result was first obtained by Pitcher [23]. In the case of nondeterministic processes of multiplicity one, the conditions on admissible means can be simplified much further. This may be stated following C r a m 6 r [7], as follows: Let T = [ a , b ] and X, be purely nondeterministic so that by (44) with N = 1, one has t

X, = ~. g(t, A)Z(dA), _

_

t E T,

(68)

I

and that ~ = sp{X t C T} = sp{Z(A): A C T, Borel}. But m E M e if there exists a Y E g( such that (65) holds. In this special case therefore, Y admits a representation as Y=

h(A)Z(OA),

(69)

for some h G LZ([a, b], F) where F ( A ) = E(IZ(A)]2). Suppose that the derivative F ' exists outside a set of Lebesgue measure zero. Since Z ( - ) has ortho-gonal increments, (65), (68) and (69) imply m(t)=

L

h(A)g(t,A)F'(A)dA,

t C T = [a, b].

(70)

This is the simplification noted above, If Og/Ot is assumed to exist, then (70)

Harmonizable, Cramdr, and Karhunen classes of processes

309

implies that the derivative m'(t) of m(t) also exists. In particular, if the Xt is the Brownian motion so that g = 1 and F ' = 1, one gets m'(t)= h(t) (a.e.) and h E LZ([a, b], dr) in order that P m - P. There is a corresponding result, when P 1 - P 2 , P~ are normal, but have different covariances. However, this is more involved. A discussion of this case from different points of view occurs in the works [35, 33, 7, 25]. (See also the extensive bibliography in these papers.) There is a great deal of specialized analysis for normal process in both the stationary and general cases. It is thus clear how various types of techniques can be profitably employed to several classes of nonstationary processes of second order. Many realistic problems raised by the above work are of interest for future investigations.

Acknowledgement This work is prepared with a partial support of O N R Contract No. N00014-84-K0356. References [1] Bhagavan, C. S. K. (1974). Nonstationary Processes, Spectral and Some Ergodic Theorems. Andhra University Press, Waltair, India. [2] Bochner, S. (1954). Stationarity, boundedness, almost periodicity of random valued functions. In: Proc. Third Berkeley Symp. Math. Statist. and Probability, Vol. 2, 7-27. University of California, Berkeley, CA. [3] Chandrasekhar, S. (1943). Stochastic problems in physics and astromony. Rev. Modern Phys. 15, 1-89. [4] Chang, D. K. (1983). Harmonizable filtering and sampling of time series, UCR Tech. Report No. 8, 26 pp. (to appear in Handbook in Statistics, Vol. 5). [5] Chi, G. Y. H. (1971). Multiplicity and representation theory of generalized random processes. J. Multivariate Anal. 1,412-432. [6] Cram6r, H. (1951). A contribution to the theory of stochastic process. In: Proc. Second Berkeley Symp. Math. Statist. and Probability, 329-339. University of California, Berkeley, CA. [7] Cramdr H. (1971). Structural and Statistical Problems for a Class of Stochastic Processes. S. S. Wilks Memorial Lecture, Princeton University Press, Princeton, NJ. [8] Dolph, C. L. and Woodbury, M. A., (1952). On the relation between Green's functions and covariances of certain stochastic processes and its application to unbiased linear predictions. Trans~ Amer. Math. Soc. 72, 519-550. [9] Getoor, R. K. (1956). The shift operator for nonstationary stochastic processes. Duke Math. £ 23, 175-187. [10] Grenander, U. (1950). Stochastic processes and statistical inference. Ark. Mat. 1, 195-277. [11] Grenander, U. and Rosenblatt, M. (1975). Statistical Analysis of Stationary Time Series. Wiley, New York. [12] Hida, T. (1960). Canonical representation of Gaussian processes and their applications. Mere. Coll. Sci. Kyoto Univ., Sec 4, 32, 109-155. [13] Kamp6 de Feriet, J. and Frenkiel, F. N. (1962). Correlation and spectra ot' nonstationary random functions. Math. Comp. 10, 1-21. [14] Karhunen, K. (1947). l]ber lineare Methoden in der Wahrscheinlichkeitsrechnung Ann. Acad. Sci. Fenn. Ser. A I Math. 37, 3-79.

310

M . M . Rao

[15] Kelsh, J. P. (1978). Linear analysis of harmonizable time series. Ph.D. thesis. UCR Library. [16] Lo~ve, M. (1948). Fonctions alfiatoires du second ordre. A note in P. L6vy's Processes Stochastiques et Movement Browien, 228-352. Gauthier-Villars, Paris. [17] Masani, P. (1968). Orthogonally scattered measures. Adv. in Math. 2, 61-117. [18] Morse, M. and Transue, W. (1956). C-bimeasures and their integral extensions. Ann. Math. 64, 480-504. [19] Nagabhushanam, K. (1951). The primary process of a smoothing relation. Ark. Mat. 1, 421-488. [20] Niemi, H. (1975). Stochastic processes as Fourier transforms of stochastic measures. Ann. Acad. Sci. Fenn. Set. A I Math. 591, 1-47. [21] Parzen, E. (1962). Spectral analysis of asymptotically stationary time series. Bull. Internat. Statist. Inst. 39, 87-103. [22] Parzen, E. (1962). Stochastic Processes. Holden-Day, San Francisco, CA. [23] Pitcher, T. S. (1959). Likelihood ratios of Gaussian processes. Ark. Mat. 4, 35-44. [24] Rao, M. M. (1975). Inference in stochastic processes--V: Admissible means. Sankhyd Set. A 37, 538-549. [25] Rao, M. M. (1978). Covariance analysis of nonstationary time series. Developments in Statistics, Vol. 1, 171-225. Academic Press, New York. [26] Rao, M. M. (1979). Stochastic Processes and Integration. Sijthoff and Noordhoff, Alphen aan den Rijn, The Netherlands. [27] Rao, M. M. (1981). Foundations of Stochastic Analysis, Academic Press, New York. [28] Rao, M. M. (1982). Harmonizable processes: structure theory. L'Enseign. Math. 28, 295-351. [29] Rao, M. M. (1984). Probability Theory with Applications. Academic Press, New York. [30] Rao, M. M. (1984). The spectral domain of multivariate harmonizable processes. Proc. Nat. • Acad. Sci. U.S.A. 81, 4611-4612. [31] Rozanov, Yu. A. (1959). Spectral analysis of abstract functions. Theory Probab. AppL 4, 271-287. [32] Rozanov, Yu. A. (1967). Stationary Random Processes (English translation). Holden-Day, San Francisco. [33] Rozanov, Yu. A. (1971). Infinite Dimensional Gaussian Distributions (English translation). American Mathematical Society, Providence, RI. [34] Yaglom, A. M. (1962). A n Introduction to the Theory of Stationary Random Functions (English translation). Prentice-Hall, Englewood Cliffs, NJ [35] Yaglom, A. M. (1963). On the equivalence and perpendicularity of two Gaussian probability measures in function spaces. Proc. Syrup. Time Series Analysis, 327-346. Wiley, New York.

E. J. Hannan, P. R. Krishnaiah, M. M. Rao, eds., Handbook of Statistics, Vol. 5 © Elsevier Science Publishers B.V. (1985) 311-320

] "l A L At_

On Non-Stationary Time Series

C. S. K. B h a g a v a n

1. Introduction A set of (numerical) observations on a characteristic of interest, collected over a few successive values of a progressively increasing indexing p a r a m e t e r called time, is known as time series. A single characteristic is usually considered as it is simple for discussion. The source for a time series is a stochastic process, an indexed set of r a n d o m variables, the index set being usually an infinite set. A time series is thus a truncated realisation of a stochastic process. From the point of view of collection of data, time series has the facility and simplicity of arising from routine collection, incidental to, say, administrative routine without resorting to the methods of sampling. T h e time series now takes the place of a sample. Time series being part of stochastic process, the terminology used for processes will also be used for time series, indistinguishably.

2. Stationarity Considerable progress has been m a d e in the analysis of time series under the basic assumption of stationarity. The stationarity considered is one of a structural invariance under translations of time. The structures considered are in two directions. O n e is regarding the probability setup and the other is regarding the second moments. T h e former is known as strict stationarity, while the latter is known as weak stationarity, the two coinciding over Gaussian processes. Strict stationarity is occasionally used, while weak stationarity is the frequently used one. A slight variant of weak stationarity, which requires that the first m o m e n t also is time invariant, is known as wide sense stationarity, a concept used in inferential problems. Unless otherwise stated, stationarity normally means weak or wide sense stationarity. It can be formally defined as follows: 311

c. s. K. Bhagavan

312

Let X(t) be a stochastic process possessing finite m o m e n t s of the first and second order: m (t); C(t, u). If

C(t, u)= E [ ( X ( t ) - m(t))(X(u)- m(u))] = C(u

t),

a function of the time lag, where the bar over an expression denotes the complex conjugate; then X(t) is called a weakly stationary process. If further,

m(t) : E(X(t)) = m, a constant over time, X(t) is referred to as a wide sense stationary (W.S.S) process. A non-stationary process means a process that is not necessarily stationary. Thus the class of stationary processes will be expanded by including classes of other processes as well. A time series from a non-stationary process will be naturally called a non-stationary time series.

3. Spectrum T i m e series in general are observed to have periodic tendencies. This is due to the blend of harmonic terms in the characteristic considered. Thus one well-known aim of time series analysis is the search for hidden periodicities. T h e presence of harmonic terms can be detected by an instrument called the spectrum of the process, the existence of which was established in the case of stationary processes by Herglotz (1911) and Khintchine (1934) (see also G r e n a n d e r and Rosenblatt, 1957), through a result saying that the covariance can be written as the Fourier-Stieltjes transform

C(u - t) = C(k ) = fw e~k~d F ( s ) , where w = [-~r, 7r] or ( - ~ , o0) according as the time p a r a m e t e r is discrete or continuous and F(s) is a bounded, non-negative and non-decreasing function. H e r e F(s) is known as the spectral function or spectrum of the process. Thus the concentration is now on the possible jumps of F(s). The jumps of the spectrum are noted to reflect the periodic nature of the time series. Using a time series, the device to track the jumps of the spectrum is through the well-known technique of periodogram analysis, where we plot the function called the intensity function I(s) or its modifications (see Anderson (1971) in this regard) against various chosen trial values. The intensity function is of the form

I(s) = {A2(s) + B2(s)}U2 ~

On non-stationary time series

313

where n

a (s) = 2 ~, X ( t ) cos n

2wt s

t=l n

B(s) = 2 ~, X ( t ) sin 2wt n

t=l

s

The nature of the periodogram is that it runs close to the X-axis except that there are sudden peaks at points corresponding to the jumps of the spectrum. Thus the periodogram analysis plays a vital role in time series analysis. Before passing on, it is to be emphasised that all these considerations are conditioned by

the assumption of stationarity. Having recognised the importance of the spectrum, one is naturally led to the question: What happens to these considerations if stationarity is absent? In other words, one is led to the consideration of 'non-stationary situations'. The immediate problem one faces here is to restore the concept of spectrum in this case. It may be remarked, even at this stage, that when the facility and simplicity of stationarity is gone, the attempts become diverse and the related aspects need much further developments. This chapter thus concentrates more on these spectral aspects, presenting the details in the discrete parameter case, putting the concept of spectrum thus obtained to the same usage as in the stationary case. The generalisation of the spectrum envisaged can fruitfully be achieved if we first have a look at what we are expecting of the spectral function. Broadly speaking, the following would be the requirements for a spectral function (see Loynes, 1968; Nagabhushanam, 1970): Non-negativity and additivity like mass or energy, unique determinability from the auto-covariance function, relationship to a meaningful function by Fourier-Stieltjes transformation, possibility of possessing a jump part; determinability of the spectral transfer function when the process variates undergo a simple linear transformation, estimability of the spectral density from a single realisation of the process, and reduction to the usual spectral function when the process is specialised to be a weakly stationary process. Loynes (1968) has listed all the requirements for a spectral function of a process and concluded that when a process is not stationary, there does not seem to exist a spectral function satisfying all the requirements. Then what can be done seems to be to define a spectrum of a type that will be suitable to the particular inquiry on hand. These have broadly developed in two streams: one stream taking a start from the covariance and the other taking a start from the process representation (see Cram6r and Leadbetter (1967) for process representation). 4. Spectra of non-stationary processes

We shall now review the various spectra considered for non-stationary processes:

C. S. K. Bhagavan

314

(a) Fano (1950) and Page (1952) have defined spectra based on considerations of Fourier integrals. The spectrum defined by Fano cannot include stationary processes in an essential way and that of Page cannot be necessarily non-negative. (b) Cram& (1961) has defined s

f f ldh s as the spectrum of the harmonisable process of discrete parameter, where

h (s, r) is a function of bounded variation in terms of which the auto-covariance function of the process has the representation

C(t, u) =

e i~+i"r dh (s, r) -~r

-~

(see Lo6ve, 1963). The function F(s) is now a bounded measure function and thus additive like mass. Further, when the process is stationary, it reduces to the spectrum as in the stationary process. This spectrum has been shown to be useful for judging if the process is purely non-deterministic or not and for linear prediction. (c) Parzen (1967) has considered real processes for which

E(X(t)) = 0 and

R ( k ) = l i m l f r-k E ( X (t)X (t + k )) dt T~o Z o

for k ~ 0

exist finitely for each k and remarks that these may be termed asymptotically weakly stationary processes, and that a time series X(t), t >t O, for which there exists a function R(k) satisfying the above could be said to possess a covariance function R(k) and a spectrum. H e establishes the existence of the spectrum assuming that: (i) fourth moments of the process exist and are uniformly bounded; (ii) ( l / T ) f o r-k X(t)X(t + k)dt converges in the mean square to R(k) as T ~ ; and (iii) R ( k ) i s continuous. (d) Herbst (1964) has considered discrete parameter processes X(t) of the form P

x ( o : ~ ajc, jE, j,

t= 0,+1,+2 .....

j=O

where e(t) is a real Gaussian stationary process of identically and in-. dependently distributed random variables, ai's being constants and ct's being

On non-stationary time series

315

such that c~ d/'*s'N(A )' and if the function tXs,N : R - + C

u, v E R

5

is absolutely continuous, so that one has

fS,N(U -- V) = f . e iC"-O*fs,N(A) d a , the solution to the filter problem is given by the expression

S(t) .

f

.

k(a)+ f~N(a)

. . =-e J, fs(A) + fN(A) + 2 Re(fs, u (a))

i,,

de(a),

teR.

(14)

Note that in the above results, one has to assume that all the spectral functions and the cross-spectral function are absolutely continuous. If this is not the case, the results become more complicated. When the series S, N and X are of Cram6r class as defined in [8], which contains the class of all strongly harmonizable time series, and when S and N are

D.K. Chang

330

uncorrelated, then similar results were obtained in [271. Without assuming that S and N are uncorrelated, Kelsh [15] considered the same problem for multidimensional Cram6r class series, and got the corresponding result using the technique essentially due to Rao [27]. For one-dimensional strongly harmonizable series S, N and X, Kelsh's result can be stated as follows. L e t / x s,/x N : R x R ~ C be the spectral functions of S and N, #S,N : R X R ~ C be the cross-spectral function, and let P-~.N:R x R--~C be defined by tX*s.N(u, v) = #s,N(v, u), for u, v E R. Then the optimal filter is S(t)=~aF(A)dZ(A),

t~R,

(15)

where F: R ~ C is a solution to the set of integral equations

j

f F ( u ) e -i~v dot s + # u +/Xs,N +/x ].N)(u, v)

RxR

= f f eitU-iS°d(tzs+ #s,N)(u,v), RxR

for all s ¢ R. In general, it is not easy to solve this system of integral equations analytically. However, if the spectral functions P-s, #N and #s.N are absolutely continuous, expression (15) can be reduced to an explicit form as in (14).

4. Sampling a harmonizable process

Next we discuss the sampling problem of the continuous parameter time series. When we study a time series in practice, it is sometimes physically difficult or economically undesirable to observe the whole series. It is then required to sample it at only finitely many times, and to estimate the original series from the observed samples. Sampling theorems are very important in many fields in practice, such as the communication and information theory. The following result is called the Kotel'nikov-Shannon formula, and is an abstraction of a classical (nonstochastic) result due to Cauchy [4]. If X = {X(t), t E R} is a weakly stationary time series with spectral function /x which is supported by a bounded interval (-1~h, ~1h ), h > 0 , i.e. it is constant in (-~,-~h] and [12h, o~), then N

X(t)=l.i.m. ~ X(nh) N ~ ,,=-N

sin[~r(t

nh)/h]

~(t -- nh )/h

,

t~R,

(16)

where the convergence on the right side of (16) is in the sense of mean square. This formula gives a periodic samplino theorem, where one observes the time

H a r m o n i z a b l e filtering a n d s a m p l i n g o f time series

331

series at the periodic points t = nh, - N 7, w h e r e X is an u n k n o w n i n p u t series. N o t e t h a t if we a s s u m e that Y(n) = 0 for all n ~< 0, t h e p r o b l e m b e c o m e s q u i t e simple. This is not a s s u m e d here. T h e c h a r a c t e r i s t i c p o l y n o m i a l of the filter L is of the f o r m P(t)= E6=0 af. T h e r o o t s t I. . . . . t 6 of P can also b e c o m p u t e d . T h e s e a r e as follows: t I = 1.295, t2 = _n1.746,

/3, 14 = 0.501-v- ] . 3 5 7 i , t~, t 6 =

0.739 - 1.118i.

Harmonizable filtering and sampling of time series Table 1 X Y

56.94 36.00

46.50 26.00

17.34 6.00

37.26 31.00

51.16 27.00

62.02 33.00

63.24 28.00

49.35 18.00

X Y

25.51 4.00

-24.86 -29.00

-24.45 -13.00

5.87 -7.00

-9.74 -15.00

-8.53 -7.00

-25.60 -22.00

-62.52 -40.00

X Y

-26.09 4.00

-12.47 -10.00

-28.30 -18.00

-26.92 -10.00

-42.74 -27.00

-53.19 -23.00

-34.82 -10.00

-19.68 -8.00

X Y

-21.52 -10.00

-19.68 -6.00

11.98 20.00

31.81 24.00

28.22 17.00

15.39 8.00

1.51 0.00

19.35 19.00

X Y

51.61 33.00

32.46 5.00

26.11 17.00

48.21 29.00

6.16 13.00

-25.86 -18.00

7.92 11.00

16.36

-36.44 -26.00

-21.63 -7.00

-22.06 -13.00

-19.53 9.00

12.98 15.00

3.82 -4.00

-43.91 -31.00

X Y

-14.99

19.00 37.65

1.00

X Y

-10.00

35.71 -21.00

-27.98 -10.00

28.84 -18.00

0.38 9.00

41.05 35.00

42.43 23.00

-29.48 -32.00

X Y

-54.60 -24.00

-24.72 3.00

-25.92 -20.00

-32.84 -21.00

-2.06 5.00

1.89 0.00

-20.87 -11.00

-56.99 -36.00

X Y

-29.33 - 1.00

46.23 45.00

63.53 31.00

51.70 27.00

63.24 42.00

57.73 32.00

21.57 5.00

42.92 30.00

X Y

45.59 13.00

16.25 -4.00

8.15 0.00

42.99 -46.00

-75.97 -46.00

-60.67 -31.00

43.61 -27.00

-63.65 -43.00

X Y

-68.65 --35.00

-62.08 -28.00

-52.70 -19.00

-25.59 0.00

-5.88 4.00

-10.89 -2.00

5.56 17.00

18.39 17.00

X Y

-16.33 -15.00

-54.96 -33.00

13.74 9.00

29.98 22.00

42.14 23.00

49.10 28.00

43.92 23.O0

24.88 14.00

X Y

34.20 25.00

24.15 4.00

2.77 -7.00

-0.71 -4.00

30.88 21.00

45.78 22.00

15.79 -4.00

1.18 - 1.00

X Y

--8.23 -8.00

-59.02 -47.00

-86.41 -51.00

-64.37 -32.00

-0.74 13.00

9.18 9.00

9.82 5.00

-33.10 -23.00

X Y

-26.27 2.00

20.79 28.00

32.36 15.00

29.50 15.00

21.57 9.00

10.95 12.00

-56.82 -38.00

-70.72 -40.00

X Y

-24.37 -3.00

10.38 6.00

17.36 8.00

14.21 9.00

61.42 54.00

51.27 24.00

11.06 1.00

30.26 26.00

X Y

50.84 26.00

32.36 9.00

10.57 -3.00

36.31 24.00

58.69 32.00

22.91 -3.00

-30.18 31.00

--50.69 -32.00

X Y

-16.49 0.00

-31.08 -31.00

-45.90 -30.00

-50.45 -30.00

-40.44 16.00

-21.23 -2.00

-3.85 3.00

-'7.21 -3.00

X Y

-'17.05 -6.00

-16.91 -3.00

12.31 19.00

38.81 27.00

41.67 22.00

42.31 25.00

37.32 20.00

20.33 8.00

X Y

-8.89 -13.00

24.73 23.00

7'7.68 47.00

80.61 35.00

21.77 -7.00

-25.42 -22.00

-46.07 -28.00

-57.54 -37.00

X Y

-66.98 -46.00

-58.49 -36.00

22.81 -4.00

12.46 20.00

-6.08 -7.00

-33.97 -15.00

--12.72 8.00

-45.03 -24.00

333

D. K.

334

Chang

Table 1

(Con~nued)

X Y

45.63 42.00

85.60 54.00

95.99 52.00

66.43 28.00

29.03 12.00

-49.77 -47.00

-48.37 -21.00

21.58 19.00

X Y

69.26 33.00

43.30 7.00

26.50 13.00

5.22 1.00

-28.48 -19.00

-52.19 -34.00

-13.25 -16.00

-6.87 -5.00

X Y

20.25 15.00

13.12 2.00

-6.64 -4.00

6.62 15.00

35.84 28.00

41.95 22.00

-8.41 -21.00

-42.37 27.00

X Y

-54.06 -31.00

-22.13 -3.00

16.36 12.00

1.63 -11.00

-36.77 -26.00

-57.53 -29.00

-57.13 -25.00

-26.83 -4.00

Since all these roots lie outside the unit circle, the filter L is physically realizable. To compute the values for the sequence X, we need to expand the rational function 1/P using the Taylor series method. With the coefficients b0, b 1. . . . thus determined, we can use the formula

X(m)= ~ b.Y(m -n) n=0

40 30 20 10 0 -10 -20 -30 -40 -50 -60 -70

Time

Fig. la. O u t p u t Series Y.

335

Harmonizable filtering and sampling of time series

60 50 40 30 20 10

-I0 --20

/ -5(

/

-7( 0

10

20

30

40

50

60

70

80

90

1oo Time

Fig. lb. Input Series X.

to o b t a i n t h e i n p u t s e r i e s X. T h e first 24 b's, c o r r e c t to t h r e e d e c i m a l places, are as f o l l o w s : bo = b 1= b2 = b3 = b4 = b5 =

1.177, 0.640, 0.033, -0.032, 0.030, 0.019,

b6 = b7= bs = b9 =

0.136, 0.147, 0.049, -0.003,

blo = 0.003, b n = 0.008,

b12 = b13 = bl4 = bls = bt6 = b17 =

0.018, 0.025, 0.015, 0.003, 0.000, 0.002,

bt8 = b19 = b2o = b21 = b22 = b23 =

0.003, 0.004, 0.003, 0.001, 0.000, 0.000.

A set of t w o h u n d r e d v a l u e s (from t h e s a m e d a t a r e c o r d s ) of X a n d Y, c o r r e c t to two d e c i m a l places, is g i v e n in T a b l e 1, a n d t h e g r a p h s for b o t h series X a n d Y with t h e s e v a l u e s a r e p l o t t e d in Fig. l a , b for c o m p a r i s o n .

References [1] Aronszajn, N. (1950). Theory of reproducing kernels. Trans. Amer. Math. Soc. 68,337--404. [2] Bhagavan, C. S. K. on non-stationary time series. This volume, Chapter 11. [3] Bochner, S. (1956). Stationarity, boundedness, almost periodicity of random valued functions. In: Proc. Third Berkeley Syrup. Math. Statist. and Probability, Vol. 2, 7-27. University of California Press, Berkeley, CA.

336

D. K. Chang

[4] Cauchy, A.-L. (1841). Memoire sur diverses formulaes de analyse. C. R. Acad. Sci. Paris 12, 283-298. [5] Chang, D. K. (1983) Bimeasures, harmonizable processes and filtering. Ph.D. Dissertation. University of California, Riverside, CA. [6] Chang, D. K. and Rao, M. M. (1983). Bimeasures and sampling theorems for weakly harmonizable processes. Stochastic Anal. & Appl. 1, 21-55. [7] Clarkson, J. A. and Adams, C. R. (1933). On definitions of bounded variation of two variables. Trans. Amer. Math. Soc. 35, 824-854. [8] Cram~r, H., (1951). A contribution to the theory of stochastic processes. In: Proc. Second Berkeley Symp. Math. Statist. and Probability, 329-339. University of California, Berkeley, CA. [9] Diestel, J. and Uhl, 3. J. Jr. (1977). Vector Measures, Mathematical Surveys No. 15. American Mathematical Society, Providence, RI. [10] Dunford, N. and Schwartz, J. T. (1958). Linear Operators, Part I: General Theory. Interscience, New York. [11] Grenander, U. (1950). Stochastic processes and statistical inference. Ark. Mat. 1, 195-277 [12] Hannan, E. J. (1967). The concept of a filter. Proc. Cambr. Phil. Soc. 63, 221-227. [13] Helson, H. and Lowdenslager, D. (1958). Prediction theory and Fourier series in several variables. Acta Math. 99, 165-202. [14] Kallianpur, G. (1959). A problem in optimum filtering with finite data. Ann. Math. Statist. 30, 659-669. [15] Kelsh, J. P. (1978). Linear analysis of Harmonizable time series. Ph.D. Dissertation. University of California, Riverside, CA. [16] Lloyd, S. P. (1959). A sampling theorem for stationary (wide sense) stochastic processes. Trans. Amer. Math. Soc. 92, 1-12. [17] Lo6ve, M. (1963). Probability Theory, 3rd ed. Van Nostrand, New York. [18] Masani, P. (1965). The normality of time-invariaut, subordinative operators in Hilbert space. Bull. Amer. Math. Soc. 71, 546-550. [19] Miamee, A. G. and Salehi, H. (1978). Harmonizability, V-boundedness, and stationary dilations of stochastic processes. Indiana Univ. Math. J. 27, 37-50. [20] Nagabhushanam, K. (1950). The primary process of a smoothing relation. Ark. Mat. 1~ 421--488. [21] Niemi, H. (1975). Stochastic processes as Fourier transforms of stochastic measures. Ann. Acad. Sci. Fenn. Ser. A I, 591, 1-47 (Helsinki). [22] Parzen, E. (1962). Extraction and detection problems and reproducing kernel Hilbert spaces. J. S I A M Control Ser. A 1, 35-62. [23] Penrose, R. A. (1955). A generalized inverse for matrices. Proc. Cambr. Phil. Soc. 51,400-413. [24] Piranashvili, Z. A. (1967). On the problem of interpolation of stochastic processes. Theory Prob. Appl. 12, 647-657. [25] Pourahmadi, M. (1980). On subordination, sampling theorem and 'past and future' of some classes of second-order processes. Ph.D. dissertation. Michigan State University. [26] Rao, M. M. (1982). Harmonizable processes: structure theory. L'Enseign. Math. 28, 295-351. [27] Rao, M. M. (1967). Inference in stochastic processes, III. Zeit. Warsch. Verw. Gebiete 8, 49-72. [28] Rozanov, Yu. A. (1959). Spectral theory of abstract functions. Theory Prob. Appl. 4. 271-287. [29] Yaglom, A. M. (1961). Second order homogencous random fields. In: Proc. Fourth Berkeley Symp. Math. Statist. and Probability, Vol. 2, 593. University of California Press, Berkeley, CA.

E. J. H a n n a n , P. R. Krishnaiah, M. M. Rao, eds., Handbook © Elsevier Science Publishers B.V. (1985) 337-362

of Statistics, Vol. 5

"l '7~ At_

Sampling Designs for Time Series*

Stamatis Cambanis

1. Introduction

In practice, a time series (or more generally a random field) is observed only at a finite number of appropriately chosen points, which constitute the sampling design, and based on these observations an estimate or a statistic is formed for use in the problem at hand. How to select the most appropriate choice of sampling points is the problem of sampling design. The statistician may be free to choose any desirable points, or else part of the sampling mechanism may be imposed on the statistician who then controls only certain parameters, e.g. periodic sampling is imposed where the period is controlled by the statistician, or Poisson sampling (at the times of occurrence of a Poisson stream of events) is imposed but the statistician has control over its rate. With such constraints, i.e. within certain classes of sampling designs, or with no constraints, i.e. among all sampling designs, how can the statistician choose the best design of a given sample size, or how can the statistician determine the sample size of a certain kind of design required to achieve a given performance? These questions will be considered in the context of three specific problems of interest involving time series: the estimation of a weighted average of a random quantity, the estimation of regression coefficients, and the detection of signals in noise. These three problems have a great deal in common, and specifically their sampling design questions are essentially the same. The setup here differs in two important ways from the classical setup. All observations are taken from a fixed (interval) region A and so, especially for large sample sizes, it is not realistic to assume lack of correlation; hence, observations form a correlated time series. Also repeated sampling at the same point is not allowed, and only one realization of the time series is available; i.e. only simple designs are considered in the terminology of Pfizman (1977). We consider both deterministic and random sampling designs, where either

*This research was supported under the Air Force Office of Scientific Research Grant No. F49620 82 C 0009. 337

S. Cambanis

338

optimal estimators and sufficient statistics are employed, or much simpler estimators and statistics are employed instead. Finding optimal designs of a given sample size turns out to be a very difficult task, which can be accomplished only for certain specific covariance structures on certain sufficiently simple sampling designs, such as simple random sampling. Finding sampling designs, which for large sample size perform like the best designs, is an easier task, which can be accomplished for broad classes of covariance structures and different designs, such as unconstrained, median and stratified, the latter two in fact using the simpler kind of estimators and statistics. There is a vast literature on designs with uncorrelated errors. In sharp contrast, the literature on sampling designs for time series is rather limited. Expressions for mean square errors for various kinds of deterministic and random sampling designs with correlated errors or in a time series setup are given in Cochran (1946), Quenouille (1949), Zubrzycki (1958), and Tubilla (1975). The question of finding optimal, and asymptotically optimal, sampling designs begins with the fundamental work of Sacks and Ylvisaker (1966, 1968, 1970a, 1970b) who resolved in a series of papers the case of deterministic designs using optimal estimators. Their work was continued by Hfijek and Kimeldorf (1974), Wahba (1971, 1974), and Eubank, Simth and Smith (1981, 1982a, 1982b). Median sampling and random sampling designs were considered by Schoenfelder (1978, 1982), Schoenfelder and Cambanis (1982), and Cambanis and Masry (1983). While the picture is reasonably, but by no means fully, complete for one-dimensional sampling, the case of multivariate sampling designs is in its infancy as the work of Ylvisakar (1975) indicates. Throughout {X(t), t C A} will be a time series defined over the time interval A of length ]A], with covariance function R(s, t), which is assumed continuous and strictly positive definite. When T = {t 1. . . . . t,} C A we will write X~- for the vector (X(tl) . . . . . X(t,)), R T for the n × n matrix {R(ti, ~)}/~4:1, and similarly f~-= (f(tl) . . . . ,f(tn) ) for a function f(t) defined on A. We will consider a function f of the form f(t)=IaR(t,s)4~(s)ds,

tCA,

(1.1)

where ~b is a continuous function on A, and we will put

s2= ;a f A R(S, t)cb(s)4)(t) ds dt.

(1.2)

For simplicity, we will sometimes write double integrals of this form as f f R4~4'. The centered process X(t) has quadratic mean derivative on A if and only if its covariance function R(s, t) is differentiable on A x A (and similarly for higher order derivatives). Reference will be made to the reproducing kernel

Sampling designs for time series

339

Hilbert space of a covariance function R, R K H S (R). The relevant facts can be found in Parzen (1967) but no essential knowledge is required here. For the reader's convenience, we mention a few relevant properties here. Any function f of the form (1.1) (in fact with ~b simply square integrabte over A) belongs to R K H S ( R ) , and the expression in (1.2) is its norm in R K H S ( R ) . The reproducing kernel Hilbert space norm of f r E R K H S ( R r ) is the familiar expression

)lf

(1.3)

and of course R K H S ( R r ) = R". In fact, f E R K H S ( R ) if and only if the supremum of (1.3) taken over all finite subsets T of A is finite, and the value of that supremum is the R K H S norm of f. When R (s, t) = min(s, t) and A = [0, hi, the R K H S ( R ) consists of all functions f which vanish at zero and are absolutely continuous with square integrable derivative: f(t) = f~ g(u) du, 0 s 2 .

(4.17)

Similarly, in simple random sampling with density h, if c(t) satisfies the consistency condition (4.13), we have 1sr, T n = I T n ,

(4.18)

esr, 2 T, = /,/-1( f cr2(~2h 1_ s 2 ) ~ 0 ,

(4.19)

where o-2(0 = R (t, t). In stratified sampling, we can choose c, so as to satisfy for each n, G(t)h.(t)-= th(t),

t ~ A,

(4.20)

where h a is the averaged sampling density n -1 ~;~=1 hnk, and then we have

.l~,,w. = n-'(cblh)~ XT. = ~ k=l

X(t.~)~

(4.21)

'~nkk*nkl

n

(4.22) k =1

nk

nk

nk

S a m p l i n g designs f o r t i m e series

347

E s t i m a t i o n of regression coefficients

For a sequence {T.} of median sampling designs generated by the density h, if c(t) satisfies the consistency condition c ( t ) h ( t ) = s-24)(t),

(4.23)

t ~ A,

we have tim,T. = s-glr.,

(4.24)

Bias tm.r. = ~s-a(m~ - s 2) ~ O,

(4.25)

Var tim,r - s-2 = s-4(s2. - s2) -~ O.

(4.26)

In simple random sampling with density h, by choosing c(t) as in (4.23) we have

ti..,ro= s--21To,

(4.27)

Bias flsr,r. = 0,

(4.28)

Var fls~,r. - s -2 = s-4 e~,r. ~ O .

(4.29)

In stratified sampling, we choose c . ( t ) h . ( t ) = s 24~(t) for each n and obtain fist,r. = s-2Ist,r,,

(4.30)

Bias ti,,.r, = 0,

(4.31)

Var/3~t,r,

....

s-2

= S -4e 2 st,n

~0.

(4.32)

Detection of signals in noise

For a sequence {T,} of median sampling designs generated by the density h, if c(t) satisfies the consistency condition (4.13), then S~,r,' = I t ,

(4,33)

Pa(Sm,r ) = q5 [}m.!. 4 - ' ( 1 - a ) ] ,

(4.34)

Sn

and comparing it with the probability of detection of the optimal test based on the entire interval we have

[

s-4,-l(l-a)

q~(u) du s~4,-l(1-a)

=

a (s2--,s2r)*2-4,-1(1-,~),~(u)du

~b(u) du

(4.35)

348

where

S. C a m b a n i s

~2

= SNR(•)-

SNR(Sm,r.)=

= 1 {$2($2 _ S2) _

(m.

s 2-

- s2)(mn

n/Sn2

m 2

(4.36)

+ $2)}--~ O.

s. In simple random sampling with density h, if c(t) satisfies the consistency conditions (4.13), so that (4.37)

Ssr,Tn = I T n ,

then, while the statistics S~r,r" are no longer Gaussian, we have Pd(S,~,r.) ~ Pd(Sa),

SNR(S,~,T.)-~ SNR(SA) ,

(4.38)

and in fact the distributions of Ssr,rn under each alternative hypothesis converge weakly to those of SA:

f(Ss~,r~ IHI)~ ~#(s2, $2),

~J~(Ssr,Tn [ S 0 ) -'--),Jr'(0, $2).

In this case (as in any case of a random detection are expressed in terms of the variables m, and s,, 2 which is not easy to signal-to-noise ratios whose expressions are

(4.39)

sampling design), probabilities of joint distribution of the random compute. We thus focus only on much simpler. In this case,

S4

SNR(S~r,~)-

2

S 2 + esr,~

*s2,

(4.40)

so that

$2 --

SNR(Ssr,Tn)

2 Gr,T.2

-

•0 .

(4.41)

S + esr,T~

In stratified sampling, if the functions G(t) are chosen from the consistency condition (4.20), so that

Sst3; = Ist,r " ,

(4.42)

then Ss,3," have the desirable limiting properties (4.38) and (4.39), and again concentrating on signal-to-noise ratios we have S4

SNR(Ss, r,) .............. -~ s 2 ,

S2 + e2st,rn

(4.43)

Sampling designs for time series

349

and 2 2

S e st, T,,

s 2 - SNR(Sst, r.) = s2 +

e2t,r . . O.

(4.44)

4.3. Parametric versus nonparametric estimators The simple-coefficient estimator (4.14) and statistic (4.33) require no knowledge of the covariance R and are thus nonparametric in nature, while the estimator (4.24) requires knowledge of s 2only. In contrast, the optimal coefficient estimators (4.1), (4.3), and statistic (4.7) require precise knowledge of the covariance R (t, z).

5. Optimal fixed sample size designs and asymptotically optimal designs Within a specified class of sampling designs ~, we are interested in finding the best sampling design of size n. For the specific problems we have been considering here a sampling design T of size n is optimal if it minimizes the 2 or the bias and variance of /3> or its mean square approximation error er, mean square error (MSE = Var + (Bias)2), or if it maximizes the probability of detection or the signal-to-noise ratio of S r, among all sampling designs in @ of size n. Finding optimal designs of fixed sample size turns out to be a difficult task. We are therefore interested also in finding sequences {T,*} of sampling designs T* of size n, which, while generally not optimal for any sample size n, are nevertheless asymptotically optimal in the sense that as the sample size tends to infinity their performance tends to that of the sequence of optimal sampling designs. For the specific problems under consideration, this means that 2 er~---*l,

inf e 2r

Var/3r;' -+1, inf Var fir

MSE/3r; -->1, inf MSE/3 r

(5.1)

--Pd(ST;) -+ 1, SNR(Sr;') - + 1, sup Pa(ST) sup SNR(Sr) where infimum and supremum are taken over all sampling designs of size n in ~. it should be clear that for any random design there always exists a better nonrandom design. Our main interest is therefore to find optimal or asymptotically optimal sampling designs within the class of all (deterministic) designs. In the following we comment on the asymptotics of the performance of optimal sampling designs, we show how in certain cases asymptotically optimal sequences of designs can be found, and we consider the performance of optimal fixed sample size simple random designs and of asymptotically optimal stratified designs.

S. Cambanis

350

5.1. Optimal coefficients and regular sampling When optimal coefficients are used, it is clear from expressions (4.2), (4.4) or (4.5), and (4.8) or (4.9), that in all three problems under consideration, the optimal sampling design of size n maximizes

f ~.R -r~f r = IIPrf II2

(5,2)

among all sampling designs @n of size n: T = {t~ < t 2 < ' ' " < tn} , where Prf is the projection of f to the subspace of the reproducing kernel Hilbert space of R generated by {R(., t), t C T}. Since the maximization is over the open subset of A" determined by /he inequalities t~< t 2 < - . . < tn, an optimal sampling design of size n does not necessarily exist. Such an optimal design exists when R(s, t ) = u(s)v(t) for s < t , including the Wiener and Gauss-Markov cases min(s, t) and exp(-Is - t[), but its existence becomes a very delicate question when R is ditterentiable (on the diagonal of A × A). Even when an optimal design exists, it is usually difficult to determine it by carrying out the minimization (an algorithm for certain special cases is developed in Eubank, Smith and Smith (1982a)). A very special case where the minimi.zation is easily carried out is when ~b ~- 1 and X has stationary independent increments: the optimal design of sample size n is given, for A = [0, 1], by tni = 2i/(2n + 1), i = 1 , . . . , n, with corresponding ETn2 = (0-2/3)(2n + 1)-2, where 0-2 = R(t, t)/t; this is derived in Samaniego (1976) and in Cressie (1978). The optimal designs satisfy sup f~R-rlfw : sup TE~ n

. TEffJ n

IlPTfll2~ [Ifll2 :

s2 .

(5.3)

n

We now turn our attention to sequences {Tn} of sampling designs Tn of size n, which are not optimal. If they form a regular sequence of designs generated by a density h, then they satisfy

2 = s 2_ f~cR r~fr, = Ilfll2 - [[Pr,f[I2 = Ilf-

Ern

P~fll 2--, 0

(5.4)

When R is smooth, upper bounds can be found on er.2 Specifically if R(s, t) has continuous (k, k) mixed partial derivative, then

er,2 = o ( n 2k),

(5.5)

and if in addition the (k, k) mixed partial derivative of R(s, t) is smooth off the diagonal of A x A, then 2 = o(n er,

2k-2).

(5.6)

Thus the smoother R is, i.e. the more quadratic derivatives the centered 2 process X(t) has, the faster er, tends to 0.

Sampling designs for time series

351

Precise rates of convergence and asymptotically optimal sequences of sampling designs are known only in certain cases where the centered process X ( t ) has exactly k quadratic mean derivatives, and the rate is n -2k-2. Specifically, under certain further regularity conditions,

n

2k+2

2

%(t)fb2(t) dt.

~.-~ G fA h2k+2(t)

(5.7)

We will not insist on the precise technical regularity conditions other than giving the expression for the function ak(t), assumed positive, % ( t ) = R(k'k+')(t, t -- O)

-

R(k'k+l)(t, t + 0),

(5.8)

and noting that stationary covariances R with rational spectral densities and the right number of quadratic mean derivatives satisfy them in fact with % ( 0 =- %. The constant C k is defined by C k = ]B2k+a]/(2k + 2)!, where B m is the ruth Bernoulli number, and C O= 1/12, C 1= 1/720. This asymptotic result has been established by Sacks and Ylvisaker (1970a, 1970b) for k = 0, 1, and for covariances R satisfying specific regularity conditions, and by Eubank, Smith and Smith (1981) for all k and a narrower class of covariances R, essentially those of k-fold integrals of Brownian motion or bridge; the latter authors conjecture its validity for the broader class of covariances considered by Sacks and Ylvisaker. By choosing the density h which minimizes the right-hand side of (5.7), one obtains an asymptotically optimal sequence of designs! Specifically, the regular sequence {T*} of sampling designs generated by the density h*(t) proportional to [%(t)f)2(t)] 1/(2k+3)is asymptotically optimal and

n2k+ZeT;-'+ C2k



2

fA [ak(t)62(t)] 1/(2k+3)dt }2k+3,

(5.9)

i.e. mfrs~ e T = infTe~,Hf- PTfl] 2 has the same asymptotics (where @n consists of all sampling designs of size n). The asymptotics of er;, 2 Var fiT', and Pd(ST;) follow immediately from (4.2), (4.5) and (4.10). Periodic sampling is covered by (5.7) by taking h the uniform density over A, and its asymptotic performance is then easily compared with that of the asymptotically optimal sequence of designs; they both have the same rate but different asymptotic constants: C2klA[2k+2f %4~2 for periodic sampling and as in the right-hand side of (5.9) for the asymptotically optimal sampling design, and the ratio of the latter to the former can take (for different 4~'s) any value in (0, 1]. Thus substantial improvement .in asymptotic performance may be achieved by sampling according to h*(t) rather than periodically.

S. Cambanis

352

5.2. Simple coefficients and median sampling For sequences of median sampling designs {Tn} generated by the density h and using the simpler coefficients described in Section 4, it is clear from expressions (4.15), (4.25), (4.26) and (2k34) that their asymptotic performance is determined by the asymptotics of m, ~ s 2 and of s2,~ s of (4.16) and (4.17) Here we describe the results for the case k = 0, i.e. the centered process X(t) has no quadratic mean derivative, under regularity conditions similar to those required for (5.7). Included are the cases where R is the covariance of the Wiener process, the Gauss-Markov process, etc. The precise asymptotic behavior is as follows: n2(s 2 -- S 2) -~ ~ f A ~°(t)ga2(t) h2(t) dt ' n 2 ( m n _ S 2) --> 1

fA

O¢o(t)(/12(t) h2(t ) dt.

(5.10) (5.11)

It then follows from (4.15), (4.25), (4.26), (4.36) and (4.35) that n2e2,T~ -+ 112f Oeo~b2h-2 ,

(5.12)

n 2 Bias [3m,Tn --+ 481 ~S -2f c¢0~2h -2 ,

(5.13)

n2(Var/~,,,r,, _ s-2)_~s 4 f c~042h 2,

(5.14)

n2(MSE/~m, G _ S-2)__.> ~S-4 f O~o~2h -2,

(5.14')

n2[s 2 - SNR(Sm, r,)] -~ ]~ f ao4 2h-2 ,

(5.15)

n2[pd(&) _ Pd(Sm'Tn)] ~ d¢)[S -- (I) 1(1 -- Or)] f ce0q~2h -2 24s

(5.16)

The density h*(t) which minimizes the integral f ao4)2h -2 is proportional to [Ceo(t)4)2(t)]2/3, and then the value of the integral becomes 2

3

and the corresponding sequence {T*} of median sampling designs is asymptotically optimal for the integral approximation problem and for the signal detection problem, as is seen from the equality of the asymptotic constants in

Sampling designs for time series

353

(5.7) and (5.12) for the former case, and from (5.7) and (5.16), (4.35) for the latter. Thus in these cases, median sampling design is both very simple (in view of the very simple form of its coefficients) and asymptotically optimal. In the regression problem the asymptotic constant in (5.14) is 50% larger than that in (4.5) and (5.7) and thus median sampling is not asymptotically optimal; it requires asymptotically about 22.5% more samples than the optimal sampling design in order to achieve the same variance. It is remarkable that median sampling design, utilizing such a simple (nonparametric) form of estimator coefficients, is asymptotically optimal for integral approximation and signal detection, and for regression coefficient estimation, it has the same rate of convergence of the optimal sequence of designs using (parametric) optimal coefficients, but with larger asymptotic constant. These results were obtained by Schoenfelder (1978) and complemented by Cambanis and Masry (1983). Work in progress by Schoenfelder has extended these results to k = 1, i.e. exactly one quadratic mean derivative for the centered process X(t); and for k ~> 2 it has produced rates of convergence n -2k-2 (i.e. identical with those of the optimal sequence of designs using optimal coefficients) by using, instead of the median of each interval (i.e. midpoint sampling), k appropriate quantiles (i.e. quantile sampling).

5.3. Simple coefficients and simple random sampling 2 In this case, it is clear from expressions (4.19), (4.29) and (4.41) that esr, r ., Var/3~r,r " - s -z, s z - SNR(Ssr,r,) all tend to zero with rate n -1, with no assumption whatsoever on the covariance R. This very simple result is also valid for random fields (i.e. for multidimensional index sets A). We can also find the optimal fixed sample size simple random design, by finding the density h which minimizes the integral f~r2cb2h -1 in (4.19). This optimal density h(t) is proportional to cr(t)lcb(t)l, and assuming or(t) is bounded away from 0, we have

[ ' r , - n-1 ~" sgn ¢b(tklx(tk ) k=l

e:.

o.(tk)

,

(5.18)

1,

and all other quantities are determined by these via (4.27), (4.29), (4.37) and (4.41).

5.4. Simple coefficients" and stratified sampling It is clear from expressions (4.22), (4.32) and (4.44) that only the convergence 2 to zero of e~t,r " needs to be considered. For periodic sampling with uniform jitter (worst case), we have

f o-2(t)4f(t) dt. ne 2st.r, -~ IAIJA

(5.20)

354

S. Cambanis

For each fixed partition {A,k}7,=~ of A, the sampling densities/~,k(t) which are proportional to ~r(t)l~b(t)] within each stratum A,k minimize the right-hand side of (4.22) term by term, producing a (partly optimal) stratified sampling design with n

'

=

=

,k

-7--c. cr(t,k)

(5.21)

A(tnk)

and n

k

/tfA

nk

JAfA nk

nk

}

For regular sequences of partitions generated by a density h bounded away from zero, we have e-2s,,r. = O(n-1).

(5.23)

Precise rates of convergence depend again on the quadratic mean differentiability of the centered process X, and again require appropriate regularity conditions. When k = 0, i.e. the centered process X has no quadratic mean derivative, then z-2 l ( ao(t)492(t) n est,r" -~ ~ Ja h2(t ) dt,

(5.24)

and by choosing h * ( t ) proportional to [ao(t)4)z(t)] 2/3, we obtain an asymptotically optimal sequence {T~} of stratified sampling designs with

IrA[a0(t)~b2(/)l 1/3dt /3.

n 2-z e s t , ~ ; ~1

(5.25)

Comparing (5.25) with (5.9), we see that the asymptotically optimal sequence of stratified sampling designs has the same rate as the sequence of optimal sampling designs using optimal coefficients, and the asymptotically optimal sequence of median sampling designs using simple coefficients, but asymptotic constant twice as large, thus requiring asymptotically 41.5% more samples for the same performance. When k ~ 1, i.e. when the centered process X has one or more quadratic mean derivatives, then 3-2 1 f . fl(t)d)2(t) n est,T.--~ ~ - , a h3(t ) dr,

(5.26)

where/3(t) = 2o-(t)o-"(t)- 2R~°a~(t, t) (= -2R"(0) in the stationary case)is ~>0 and in fact >0 on some small interval, so that the rate does not improve as the centered process X has more than one quadratic mean derivative. Thus n - 3 is the ultimate rate achievable by stratified sampling designs. By choosing h * ( t ) proportional to [/3 (t)~b2(t)] TM we obtain an asymptotically optimal sequence {T~}

Sampling designsfor time series

355

of stratified sampling designs with

n est,T.. ~

[fi(t)qsz(t)] TM dt

;

.

(5.27)

6. Discussion and extension

6.1. References for Sections 4 and 5 The work described in Sections 4 and 5 began with a series of fundamental papers by Sacks and Ylvisaker (1966, 1968, 1970a, 1970b), where the regression problem was considered using optimal coefficients, the notion of asymptotically optimal designs was introduced, and asymptotically optimal designs using optimal coefficients were found for k = 0, 1. In the last paper, the connection with the random integral estimation problem is also discussed. For general k, but a more restricted class of covariances, asymptotically optimal designs using optimal coefficients were developed by Eubank, Smith and Smith (1981, 1982b) based on the related work of Barrow and Smith (1978, 1979). Schoenfelder (1978) studied median and random designs for the integral approximation problem, and Cambanis and Masry (1983) considered the signal detection problem.

6.2. Sampling designs of fixed size The existence of optimal fixed sample size designs using optimal coefficients, and algorithms for their construction, are discussed, for k = 0, 1 and certain covariances, by Eubank, Smith and Smith (1981, 1982a). Random designs of fixed sample size are compared in Schoenfelder and Cambanis (1983): for every simple random sampling design, there is a better stratified sampling design of the same size; there are cases where systematic sampling outperforms stratified sampling and vice versa--in fact, systematic sampling may be outperformed even by random sampling in special cases.

6.3. Comparison of asymptotic performance The asymptotic performance of the various sampling designs considered in Section 5 is summarized in Table 1, where the exact rates of convergence are shown. The parentheses indicate that, except for k = 1, the result is established for special cases, and the double parentheses indicate anticipated results, not yet available in the literature. The performance of simple random sampling, while quite poor, is not affected by the smoothness of R or by the dimensionality of the index set. Stratified sampling performs as expected better than simple random, but is also not affected by the smoothness of R once one quadratic mean derivative exists; and even when its rate is n -2 for k = 0, just as for the nonrandom designs, its asymptotic constant is twice as large. The performance of all of these

S. Cambanis

356 Table 1 Exact rates of convergence Exact no. of q.m. derivatives k=0

1 ~ 1, C~, < Ck and as expected use of derivatives improves performance asymptotically (for instance, C ; = 0.3C 2, C~:C3/21, C; = C4/210, etc.). Wahba (1971) and Hfijek and Kimeldorf (1974) treat the case where the centered process X is autoregressive, and these results are generalized to the vector-valued case by Wittwer (1976). Wahba (1974) also treats the (more general) case where R is the covariance of a Gaussian process equivalent to an autoregressive Gaussian process. Eubank, Smith and Smith (1981, 1982a) give sufficient conditions on f under which there exist unique optimal designs using optimal coefficients for each sample size or for all sufficiently large sample sizes, and develop algorithms for finding them. Product sampling designs for a certain two-dimensional random field are considered by Wittwer (1978).

Sampling designs for time series

359

7.2. Relationship to quadrature formulae The approximation of the random integral (2.1), f a X ( t ) c k ( t ) d t , by the random sum (2.2), c~X T = Z~= 1Cr,kX(tk), of the values of the random integrand X ( t ) at a finite number of points, is reminiscent of quadrature formulae in the approximation of ordinary integrals, and their relationship is discussed in Sacks and Ylvisaker (1970b). Consider approximating the ordinary integral fa x(t)qb(t)dt by the quadrature formula crxr = ~=1Cr.kX(tk) • When x belongs to the reproducing kernel Hilbert space of R, the approximation error can be written as

er(X ,

Cr)= I f a x ( t ) 4 ) ( t ) d r -

n

k~= l

CT'kX(tk)

L=

I f f -- gr, cT~ X)l ,

.... CT,kR ' t( k, t). where ( . , • ) is the inner product of the RKHS(R) and gr, cr(t ) = zk=l For fixed T, the quadrature formula is called best in the sense of Sard (1963) if the coefficients c r minimize sup er(X, CT) = sup ][(f- gT,~r' X)I = I[fllxtl~l

Ilxll~l

gv,~ll.

But this is minimized when gTc* = P r f (cf. (5.2) and (5.4)). The quadrature • . ' T formula is called best m the sense of Sard if the finite set T* of sampling points (nodes) and the weights c~-. satisfy sup ev.(x, c).) = inf inf sup er(X, cv)(= inf IV - Prf[I) tlXl]~l

r

C T tlxll~l

r

The connection with the random integral approximation problem follows from the relationship

Ee2r(X(.), cr) = E

X(t)4~(t) dt - ~, CT;kX(tk k=t

- 17-gT;¢r]l 2 = sup er(X, CT) . IlxlL~l

Thus the best quadrature formula c~.x r. in the sense of Sard for f A x(t)g)(t) dt, x ~ R K H S ( R ) , determines the optimal sampling design T* when optimal coefficients c r are used and the best quadratic mean approximation c r.Xr. of fAX(t)~(t)dt, and vice versa. Certain properties of the best quadrature formula in the sense of Sard, and thus also of the optimal sampling design, are established in Karlin (1976), and asymptotics are studied in Barrow and Smith (1979).

360

S. Cambanis

7.3. Estimating random integrals with observation errors In connection with the problem of estimating the random integral (2.1), it is natural to consider the case where the values of the process X cannot be measured with perfect accuracy, but the observation at each sampling instant ti is Y~ = X(ti)+ ei, where the observation errors ei are uncorrelated and have zero means and common variance o-2. This important case has been studied by Jones (1948) when ~b --- 1 and R(t, s) = e x p ( - l t - sl), where it is shown that, with A = [0, 1], the best estimator is n -1 £i~=1 Y~ and the optimal sampling design of size n, {t~}~"__1, is periodic and symmetrically located in [0, 1], and its period along with the value t 1= 1 - t , are determined (implicitly). Kendall (1948) showed that the mean square error e ,2 satisfies ne2,~ 0-2, which, compared to (5.9) with k = 0, shows a loss of one power in the rate of convergence due to observation errors (n -1 instead of n -2 with no observation errors).

Z 4. Multiple regression with negligible correlation Bickel and Herzberg (1979) consider the multiple regression problem (6.2) with error covariance R(t, s)= y0-2p(t- s) for t # s and R(t, t) = 0-2, where p is a stationary covariance with p(0) = 1, and 0 ~< 7 ~< 1 ; i.e. the error N(t) consists of a stationary component with covariance y0-2p(t-s) and an uncorrelated white component with variance ( 1 - T ) 0 - 2. They make the critical assumption that as the sample size n of the design increases, the error correlation becomes negligible; specifically, they assume that p(t) depends on the sample size n as follows: p,(t)= r(nt), where r is a fixed stationary covariance with r(t)~O as t~. When the regression functions are powers, f j ( t ) = tj-l, and A = [ - a , a], they point out that asymptotic results when the covariance p is allowed to depend on the sample size n as above, can be translated to asymptotic results for fixed covariance p but interval over which the samples are taken depending on the sample size as follows: A , = I-ha, na]. They show that the variancecovariance matrix of the least squares linear estimates of the regression coefficients tends to zero with rate n -1, and they determine asymptotically optimal designs implicitly in the general case, and explicitly for location (J = 1, f l = 1), regression through the origin (J = 1, fl(t)= t), and linear regression (J = 2, fl =- 1, fz(t) = t). In Bickel, Herzberg and Schilling (1981) the first-order autoregressive case p ( t ) - exp(-[tl) is considered in detail for the location and linear regression problems, and the performance of the uniform designs is compared with that of the optimal and the asymptotically optimal designs. For the cases treated in the latter paper, the asymptotic performance of the variance-covariance matrix of the minimum variance linear unbiased estimao tots and of the least squares linear estimates are identical.

7.5. Periodic sampling to discriminate processes with independent increments Some aspects of the problem of finding the optimal period of a periodic sampling design of fixed size, in discriminating between two processes with independent increments are considered in Newman and Stuck (1979).

Sampling designs for time series

361

References Barrow, D. L. and Smith, P. W. (1978). Asymptotic properties of best L2[0, 1] approximation by splines With variable knots. Quart. Appl. Math. 36, 293-304. Barrow, D. L. and Smith, P. W. (1979). Asymptotic properties of optimal quadrature formula. In: Hiimmerlin, ed., Separatum aus: Numerische Integration, 54-66. Birkhiiuser, Basel, Switzerland. Bickel, P. J. and Herzberg, A. M. (1979). Robustness of design against autocorrelation in time I: Asymptotic theory, optimality for location and linear regression. Ann. Statist. 7, 77-95. Bickel, P. J., Herzberg, A. M. and Schilling, M. F. (1981). Robustness of design against autocorrelation in time II: Optimality, theoretical and numerical results for the first-order autoregression process. J. Amer. Statist. Assoc. 76, 870--877. Cambanis, S. and Masry, E. (1983). Sampling designs for the detection of signals in noise. IEEE Trans. Inform. Theory IT-29, 83-104. Cochran, W. (1946). Relative accuracy of systematic and stratified random samples for a certain class of population. Ann. Math. Statist. 17, 164-177. Cressie, N. (1978). Estimation of the integral of a stochastic process. Bull. Austral. Math. Soc. 18, 83-93. Dalenius, T., Hfijek, J. and Zubrzycki, S. (1961). On plane sampling and related geometrical problems. In: Proc. Fourth Berkeley Syrup. Math. StatisL and Probability, Vol. 1, 125-150. University of California Press, Berkeley, CA. David, M. (1978). Sampling and estimation problems for three dimensional spatial stationary and nonstationary stochastic processes as encountered in the mineral industry. J. Statist. Plann. Inference 2, 211-244. Eubank, R. L., Smith, P. L. and Smith, P. W. (1981). Uniqueness and eventual uniqueness of optimal designs in some times series models. Ann. Statist. 9, 486-493. Eubank, R. L., Smith, P. L. and Smith, P. W. (1982a). On the computation of optimal designs for certain time series models with applications to optimal quantile selection for location or scale parameter estimation. S I A M J. Sci. Statist. Comput. 3, 238249. Eubank, R. L., Smith, P. L. and Smith, P. W. (1982b). A note on optimal and asymptotically optimal designs for certain time series models. Ann. Statist. 10, 1295-1301. Hfijek, J. and Kimeldorf, G. (1974). Regression designs in autoregressive stochastic processes. Ann. Statist. 2, 520-527. Jones, A. E. (1948). Systematic sampling of continuous parameter populations. Biometrika 35, 291-296. Karlin, S. (19'76). Best approximations, optimal quadrature and monosplines. In: S. Karlin, C. A. Michelli, A. Pinkus and I. J. Schoenberg, eds., Studies in Spline Functions and Approximation Theory, 19~6. Academic Press, New York. Kendall, M.G. (1948). Continuation of Dr. Jones's paper. Biometrika 35, 291-296. Newman, C. M. and Stuck, B. W. (1979). Optimal sampling of independent increment processes. Stochastics 2, 213-225. Parzen, E. (1967). Time Series Analysis Papers. Holden-Day, San Francisco, CA. Pfizman, A. (1977). A contribution to the time series design problems. In: Trans. Seventh Prague Conference, 467-476. Reidel, Dordrecht. Quenouille, M. (1949). Problems in plane sampling. Ann. Math. Statist. 20, 355-375. Sacks, J. and Ylvisaker, D. (1966). Designs for regression problems with correlated errors. Ann. Math. Statist. 37, 66-89. Sacks, J. and Ylvisaker, D. (1968). Designs for regression problems with correlated errors; many parameters. Ann. Math. Statist. 39, 49~59. Sacks, J. and Ylvisaker, D. (1970a). Designs for regression problems with correlated errors Ill. Ann. Math. Statist. 41, 2057-2074. Sacks, J. and Ylvisaker, D. (1970b). Statistical designs and integral approximation. In: Proc. Twelfth Biennial Seminar of the Canadian Mathematical Congress, 115-136. Canadian Mathematical Congress, Montreal, Canada.

362

S. Cambanis

Samaniego, F. J. (1976). The optimal sampling design for estimating the integral of a process with stationary independent increments. IEEE Trans. Inform. Theory IT-22, 375-376. Sard, A. (1963). Linear Approximation. American Mathematical Society, Providence, RI. Schoenfelder, C. (1978). Random designs for estimating integrals of stochastic processes. Institute of Statistics Mimeo Series No. 1201. University of North Carolina, Chapel Hill, NC. Schoenfelder, C. (1982). Random designs for estimating integrals of stochastic processes: Asymptotics. Center for Stochastic Processes Tech. Rept. No. 6. University of North Carolina, Chapel Hill, NC. Schoenfelder, C. and Cambanis, S. (1982). Random designs for estimating integrals of stochastic processes. Ann. Statist. 10, 526-538. Tubilla, A. (1975). Error convergence rates for estimates of multi-dimensional integrals of random functions. Tech. Rept. No. 72. Department of Statistics, Stanford University, Stanford, CA. Wahba, G. (1971). On the regression design problem of Sacks and Ylvisaker. Ann. Math. Statist. 42, 1035--1053. Wahba, G. (1974). Regression design for some equivalence classes of kernels. Ann. Statist. 2, 925-934. Wahba, G. (1978). Interpolating surfaces: High order convergence rates and their associated designs with applications to X-ray image reconstruction. Tech. Rept. No. 523. Department of Statistics, University of Wisconsin, Madison. Wittwer, G. (1976). Versuchsplanung im Sinne von Sacks-Ylvisaker fur Vectorprozesse. Math. Operationsforsch. Statist. 7, 95-105. Wittwer, G. (1978). Uber asymptotisch optimale Versuchsplanung im Simme von Sacks-Ylvisaker. Math. Operationsforsch. Statist. Set. Statist. 9, 61-71. Ylvisaker, D. (1975). Designs on random fields. In: J. Srivastava, ed., A Survey of Statistical Design and Linear Models, 593--607. North-Holland, Amsterdam. Zubrzycki, S. (1958). Remarks on random, stratified, and systematic sampling in a plane. Colloq. Math. 6, 251-264.

E. J. Hannan, P. R. Krishnaiah, M. M. Rao, eds., Handbook of Statistics, Vol. 5 © Elsevier Science Publishers B.V. (1985) 363--387

1Zt Jl_ 7 1 -

Measuring Attenuation

M. A . Cameron and P. J. T h o m s o n

1. Introduction

The analysis of the relationships between several time series may be performed in a number of ways. In the time domain, vector models of A R M A type may be fitted to the data and these give empirical descriptions of the relationships between the series that can be used to generate forecasts. These models may form the basis of further investigation of the data structure (see Box and Tiao (1977) for an example). The alternative in the time domain is to hypothesise a structure and to use that to formulate and fit an appropriate model. A simple example of this is a time series regression (transfer function model), but more generally the model will be a constrained vector A R M A model. In the frequency domain the methods developed in multivariate analysis for the analysis of covariance matrices may be adapted to analyse complex covariance matrices and applied to the estimated cross-spectral density matrices at different frequencies. Brillinger (1975) gives examples of regression, principal component and canonical variate analysis applied to the estimated crossspectral matrices. If the analysis is exploratory, then the estimated spectral density for each of a number of frequencies across the frequency range of interest is analysed separately and no attempt is made to combine, formally, the information from different frequency bands. On the other hand, there may be a parametric model for the dependence between the different series which is formulated in either the time or the frequency domains. Then, if the analysis is performed in the frequency domain, information from different frequency bands should be combined formally, in either the time or the frequency domain. In this chapter, the problem of the estimation of the attenuation of a signal is considered. Here a model for the observations may be written down in the time domain, but the estimation of the time domain model may be performed in either the time or the frequency domain, The model is a time series version of a factor analysis model and so the methods and results are of wide relevance. For example, Geweke and Singleton (1981) and Engle and Watson (1981) 363

364

M. A. Cameron and P. J. Thomson

describe the application of time series factor models to economic data and C a m e r o n (1983) uses the model for comparative calibration of time series recorders. The methods are also applicable, for example, to a geophysical p r o b l e m described by Clay and Hinich (1981) and to the estimation of the velocity and attenuation of a signal as it passes across an array of sensors. The model will be developed and discussed in terms of the attenuation of a signal across an array for definiteness. In the next section, the model is described and some notation is introduced. In Section 3, methods for estimating attenuation are given. These are for time domain and for both narrow and broad band frequency domain estimation. In Section 4 the methods are extended to the case where delays are present whilst in Section 5 a discussion is given of how the methods could be used in practice and the analysis of a simple set of data is described.

2. The model Consider the situation where a n u m b e r of recording devices are each endeavouring to measure a c o m m o n scalar stochastic signal. Typically the observations made at these recorders will comprise a modified form of the signal together with additive noise. A simple model of a situation such as this is the (single) factor analysis model Yij = & + cgSi + Eo, J = 1 . . . . . p .

(2.1)

H e r e the observed random variables Yij have mean &, the c o m m o n signal S i has mean zero and variance o-2, Si and eij are independent Gaussian random variables and each eij has mean zero and variance cry. The c~j are attenuation coefficients. However, for the case of the simple model (2.1), they are more c o m m o n l y known as factor loadings. This particular model has been estensively used in many areas, in particular the social sciences. The properties of such models together with associated estimation procedures have been extensively discussed (see, for example, Joreskog, 1978). We wish to set this familiar problem in the more general time series context of signal estimation where the signal now becomes a stochastic process over time. T h e basic model considered is given by yj(t) = txi + f T~_cg(r)S(t- r) dr + xj(t),

--az< t < o c

(2.2)

where each of the observed processes yj(t), j - 1 . . . . . p, comprises a mean #j, a filtered form of the c o m m o n signal S(t) and a noise process xj(t). The S(t) and xj(t) processes are assumed to be zero mean, continuous time stationary processes with spectral densities fs(o)) and fx,i(w), respectively. Moreover, the xi(t ) will be assumed to be independent of one another and also of S(t). The impulse response function aj(7) reflects the fact that the signal

Measuring attenuation

365

will, in general, undergo modification prior to its arrival at any recorder and that this modification will typically depend on the particular recorder conceT ned. Suppose the spectral representation of S(t) is given by

it'dZ(o)),

S ( t ) = f~e_

where the complex valued process increments, i.e.

(2.3)

Z(w)

has zero mean and orthogonal

E{dZ(w)dZ( ,)}={ f~lo~)dw, o)= w,. Then (2.2) can be written as

yj(t) = Ixj +

e-~'~aj(w)dZ(w) + xj(t),

(2.4)

where

aJ(w)-flaJ(r)ei~dr_ is the transfer function of the filter with impulse response function aj(r). It is evidently aj(co) that modifies the component of S(t) at frequency ~o. In practice, the data will normally be sampled at equidistant time intervals either by virtue of the recording process adopted or as a consequence of the fact that the data will invariably be analysed by a digital computer. Assuming that the sampling interval has been chosen sufficiently small for aliasing effects in either S(t) or the xj(t) to be ignored, the continuous time model (2.2) is now replaced by the discrete time process:

yj(n) =/xj + ~, fij(k)S(n - k) + xj(n),

n = 0, _+1,.,.,

(2.5)

where time has now been rescaled so that the sampling intervai represents one time unit and the fij(k) satisfy j(k) e 'k..... a j ( , , ) ,

We assume that aj(w) is only non-zero in ( - ~ , 7 ] . Then the c~j(r) can be recovered from the flj(k) via the formula sin(k - r)~r =

j(k)

(k -

366

M. A . C a m e r o n a n d P. J. T h o m s o n

due to Shannon. However, (2.5) may be considered as a model in its own right irrespective of whether or not it has been derived from some underlying continuous time process. In any event, we are concerned with fitting models such as (2.5) to data and, in particular, estimating the transfer functions aj(co). From the latter, estimates of the flj(k) and aj(r) can be derived, either non-parametrically or via some suitably parametrised version of (2.5). Consider the p-dimensional process y(n), n = 0, _+1. . . . . with jth component yj(n) given by (2.5). If 7s(n) and yx,j(n) are the serial covariance functions of S(n) and xj(n) respectively, then the matrix of cross-covariance functions of y(n) has typical element

jk(n) = Z Z t j(0&(l + l')rs(n l

r) + ajkrxj(n),

l'

where 3jk is the Kronecker delta. Likewise, the spectral density matrix y(n) has typical element

f(w) of

fjk(O~) = ai(w)ak(o~)fs(co ) + 8jkf, j(co) . It is clear that the signal, noise and attentuation parameters are not uniquely determined by the cross-covariances and spectral densities of the y(n) process unless more is known or artificial constraints are imposed. If one of the yj(n) can be thought of as a reference or control series for which the corresponding aj.(oJ) is identically unity (i.e. /3j(n)= 60, ), then this problem is resolved. Yet another possibility would be to confound fs(co) with the ai(~o) and measure only t h e factors aj(co)J .1/2 s (co). This is equivalent to taking the signal as unit variance white noise, but with spectral components whose amplitude is unrelated to those of the noise. Normalising the factors aj(co)f s1/2(w) by the amplitude of the corresponding noise process at frequency o~ yields the spectral density flk(co) =-- (17j(t'O)Pk((M) q-

1/2 ,sj.,,)f,j1/2(co)L.,(,,'),

(2.6)

where

,,j(co ) = ¢(o, ){fs(o., )/f,,.j(o, )}

.

(2.7)

Apart from scaling the signal and noise to have the same amplitude, this reformulation of f(co) has the virtue'~that luj(o~)]2 admits a simple interpretation as the signal-to-noise ratio at t h e jth recorder. The latter is an important parameter in its own right. In practice, it might be expected that the signal S(n) and the transfer functions aj(o~) would be relatively smooth with the noise processes xj(t) having reasonably flat spectra. This would mean that the uj(co) would tend to be sizeable in the lower frequencies. As a consequence, it would seem that these spectral quantities should typically vary smoothly over frequency. Thus it might be expected that non-parametric estimation techniques based on the Fourier transforms of the data over non-overlapping narrow frequency bands would prove

Measuring attenuation

367

to be effective. Similarly, simple parametric forms for the vj(w) and the f,j(w) should generally fit such data well. The model (2.5) can be generalised in a number of ways. Of these, the most obvious is to incorporate delays. Consider the situation where the signal is received by an array of recorders. Because of the spatial configuration of the array, the individual recorders will, at any instant of time, receive lagged or delayed forms of the signal. In such circumstances, the model (2.5) becomes, yj(n) = a i + ~ , ~j(k ) S ( n - rj - k) + xj(n) ,

(2.8)

where the rj's are not integers in general. However, it may well be true that the medium through which the signal is travelling is dispersive. This would mean that the different frequency components that make up S(t) travel at different speeds, resulting in frequency-dependent delays. This leads to a model of the form 'rr

yj(n) = O~j+ f

e -i(n-Ti(°)))°~aj(w) d Z ( w ) + x j ( n ) . --

(2.9)

"rr

Methods of taking account of delays will be given later in this chapter. Other generalisations concern the cases where trend is present, where the signal is transient, where the observations and the signal are no longer scalar but are vector time series, and where there is more than one signal. This last case includes the situation where signal and noise are not incoherenL

3. Estimation

This section addresses the problems of fitting models such as (2.5) to data. As in most model fitting, the procedure used is composed of three parts: (a) exploratory data analysis and model selection, (b) parameter estimation and (c) diagnostic model checking. As models become more complicated and involve more parameters, the model selection phase becomes increasingly important since there will be many plausible models. In time series these models must be fitted by numerically maximising some function or, equivalently, by solving a set of non-linear equations. This becomes more difficult as the number of parameters increases, particularly if good initial estimates of the parameters are not available or if the model is a poor description of the data. Unfortunately, the model selection itself becomes more complicated. "For example, model selection in A R M A models is generally based solely on the autocorrelations and partial autocorrelations, whilst in transfer functions it is either a multistage procedure (Box and Jenkins, 1976) or involves calculating

368

M. A . Cameron and P. J. Thomson

the impulse response function and noise autocorrelations from spectrum estimates (Cameron, 1981; Pukkila, 1982). For models of the form of (2.5), the simplest procedure is to calculate estimates of uj(~o) and fxj(~o) for a number of frequencies and then to apply an inverse Fourier transform to these and choose appropriate models on the basis of these derived quantities. The estimation of uj(w) and fxj(~o) is described in Subsection 3.1. Although we have introduced these estimates as being part of the model selection procedure, they are of interest in their own right in many applications, especially if there are delays between the series (see Section 4). Subsection 3.2 addresses the problem of fitting (2.5) and its various parametric forms over all frequencies or possibly over some chosen band of frequencies. For this estimation a parametric form is required for ~,j(co) but not for fxj(~o) which may be approximated by the narrow band estimates obtained using the methods of Subsection 3.1. Finally, in Subsection 3.3, methods of fitting (2.5) in the time domain are considered. Here all components of the model must be parametrised. The estimation procedures described in Subsections 3.1, 3.2 and 3.3 thus form a natural sequence in fitting models such as (2.5). The exploratory phase suggests the models to be fitted, the frequency domain procedure of Subsection 3.2 allows the transfer functions or uj(w)'s to be modelled without a model also being fitted to the noise processes. The time domain procedure allows a full, exact maximum likelihood estimation of all parameters simultaneously. This will be most useful if there are few observations (which may be the case with economic data, for example). It wilt often be the case, however, that there are sufficient data for frequency domain methods to be used and that it is only the transfer function that is of interest, so that the time domain estimation procedure will not be required.

3.1. Estimation: Narrow band We now consider the problem of estimating the uj(oo) and fxa(oo) from a sample of observations y(1) . . . . . y(N) generated by (2.5). These are most expeditiously determined using the finite Fourier transform of the data,

u

2vk

n=l

N

W(o)k) = (2~rrN) ~/2 ~ y(n) e '"~k, 0 ~< k < [~N l, w~ = '

(3.1)

Here [x] denotes the integral part of x. These quantities are important because (2~r/N)l/2W(~o) is an approximation to the component of frequency w in the spectral representation of y(n) and, if N is highly composite, they are extremely cheap (i.e. rapid) to compute. Moreover, under certain quite general conditions, the W(wk) are asymptotically independently distributed each with a complex multivariate normal distribution with zero mean vector and covariance matrix f(o)k) (see Hannan, 1970). Now the situation we have in mind is that where the vi(~o) and the fxj(o)) vary slowly with frequency ~o. This means

Measuringattenuation

369

that estimates of the ~,j(w) and the fxj(W) at any given frequency o) should be able to be constructed from the W(Wk) evaluated over those wk nearest to o9. For the sake of clarity let us denote the chosen frequency of interest as A and suppose that our estimates of the v~(A) and fx.j(A) will be based on the m values of wk closest to A. An obvious estimation technique in this situation is to use the method of maximum likelihood where the likelihood is given by the probability density function derived from the asymptotic distribution of the W(w) for the m values of ~ok nearest to A. The relevant log-likelihood is proportional to

/(A)=-m '~_~ {logdetf(wk)+trDWl(Wk)W(wk)W(Wk)*]},

(3.2)

where det() and tr() denote the matrix operations of determinant and trace respectively and Z A is the sum over the m values of o)k concerned. Moreover, in keeping with the assumption that f(w) is not varying to any degree over this narrow band of frequencies, for the o)k near A, we set

(3.3)

f(Wk) = f xl/2( I + Uu,)flx/2

Here I is the p-row identity matrix, the asterisk denotes transposition combined with conjugation, the diagonal matrix fx has typical diagonal element fxj(A) and the p-dimensional vector v has typical element ~,i(A). Note that, given f(w) as specified by (3.3), we can only know the ]uj(A)l and the phase differences ~0j(A)- 0k(A), where [uj(A) I and ¢i(A) are the modulus and argument respectively of uj(A). Additional information is necessary in order to identify the individual Oj(A). We shall assume that Ol(A) is zero. Thus the first recorder is chosen as the recorder relative to which the phase differences 4Jj(A) will be measured. Now, with f(w) given by (3.3), maximising (3.2) is equivalent to minimising P

P

Z log f~,j + l o g o + u* ~,) + Z ~j/f~,j - (1 + u* u) ~u*f~/2ffxX/2v, j-i

where,

for

1:1

simplicity,

the

argument

A

has

been

omitted

and

f=

m -1Za W(wk)W(Wk)*. The parameters ~,j(A) and fxj(A) could be estimated by direct numerical maximisation of l(A) or else the derivatives of l(A) with respect to the unknown parameters may be calculated and the estimates found by solving the resulting estimating equations. These are

(3.4)

(1 +

and L.j - (1 + I jl

j = 1 .....

p.

(3.5)

370

M. A . Cameron and P. J. Thomson

In the special case when the noise spectra at the different recorders are assumed equal, these equations may be solved explicitly, yielding )rx = (trig- ~)*f~)/(1 + ~3"~3)}/(p - 1)

(3.6)

and 1) is the eigenvector of f corresponding to the maximum eigenvalue of )~ If there are only two recorders, such an assumption must be made. In the most general situation, (3.4) and (3.5) are the complex analogues of the equations that arise when fitting a conventional factor analysis model in the case where there is only one factor. If we write f(A) and ~ in terms of their real and imaginary parts, i.e. f ( A ) = C(A) + i0(A),

13= ~3R + i 6 ,

where C(A) is symmetric, O ( a ) is skew-symmetric, then (3.4) and (3.5) become

o re(A) -O(A)][/;

[/;,,2 o

c(A)JL o

^

(3.7) ^2 f~a(A) = (1 + URj + £,~j)-ICjj(h ) .

(3.8)

Equations (3.7) and (3.8) can now be solved using methods similar to conventional factor analysis procedures. Clearly, if good a priori estimates of the fxj(A) were known, then (3.4) and (3.2) state that ~ is proportional to the eigenvector of )~l/2f(A)f~l/2 associated with the largest eigenvalue. The constant of proportionality is the square root of the difference between the eigenvalue and unity. This observation and the simple form of the relationship (3.5) suggest that one might solve (3.4) and (3.5) numerically by first fixing the fxj and determining v, then adjusting the fx,j using (3.7) and recomputing 1), iterating until convergence. Unfortunately, this algorithm is frequently very slow to converge since it fails to take into account the covariation between small changes in the vj and the fxj. This problem has been discussed by Joreskog (1967) and Lawley (1967). It is thus preferable to maximise the log-likelihood directly or equivalently, to devise a Newton-Raphson algorithm based on (3.4) and (3.5), which takes into account the covariation and the fact that (3.2) is being maximised. Experience in standard factor analysis suggests that algorithms based on an optimisation technique of Fletcher and Powell (1963) converge reliably. This is discussed in greater detail by Joreskog (1978) and by Geweke and Singleton (1981). Under relatively mild regularity conditions (see Thomson, 1982), it can be shown that the resulting estimators of u and the fxj(A) are strongly consistent. Moreover, suppose the vector a 0 is defined as 0/0= 0¢;,1(•) . . . . .

L,p(/~), IPl(a)t . . . . .

IPp(a)l, (//2(A) . . . . .

~p(,~))t ,

Measuring

371

attenuation

where [vj(a)[ and ~i(h) are as before the modulus and argument respectively of vj(h). If ~ is defined as the corresponding vector estimator of a 0 obtained from (3.4) and (3.5), then for m and N large enough ml/2(& _ a0 ) has an asymptotic multivariate normal distribution with zero mean vector and covariance matrix F. Here J1

j12t 1

o

o

0

F=

J3

where the p x p matrices J1,-/2, J12 ( = J21) and the ( p - - 1 ) x ( p - 1) diagonal matrix -/3 have typical elements _]_1

2

~

,

1

,

2

2

~l~jl (1 v v) (v ~--l.jl ))/L,j, 2 2 , -1 t(-:l,51 t,,~l (1+ v v) )/(.fx,f~,k),

J*jk = ,[(1

1

J = k, j # k,

= ,[2(1+ ~,*~,)-'(v*~-I~.jl=(l+ 3~,* v)(1 + ~,* ~,) '), j = k , s2j~ t-21,,~llv~l(1+3,,*,.)(l+ ~*~)-~, j#k, J12 jk

'

=

/l,,jl(2+ v*~-ivjl2){Lj(l+ v*~,)}', 2 , ' - 1 t-t,,jll~l{L,~(l+,, ~)} ,

j = k, jek,

J3,o = 2(1 + v*v)-l{lvjl2v* v). 3.2. Estimation: Wide band The narrow band estimates provide a useful decomposition of the observations and are computed relatively easily from standard factor analysis software. These estimates may be sufficient in many circumstances. It will often be the case, however, that a model will be fitted to the attenuation over a broad band of frequencies (or all frequencies) for comparison with theory, for forecasting or perhaps just to obtain a smoother estimate. In this case the narrow band estimates may be used to help choose models for the individual components of (2.5) and to provide initial parameter estimates. In this section it is assumed that the vj(co)'s depend on a vector 0 of unknown parameters and they are written as vj(o); 0) to indicate this. The form of the parametrisation is not important here. However, stationary processes are usually modelled by A R M A models, transfer functions are often modelled by ratios of polynomials and so models for vj(o); 0) will, most commonly, be ratios of trigonometric polynomials. Note that the noise spectra are not parametrised here. That may be done as a straightforward extension of the methods used here or by using the time domain procedures described in Subsection 3.3. However, that first requires a good parametric model of the noise processes. A good way of choosing a model for these processes is from the estimated noise spectra derived in Subsection 3.1. These may be inverted using the discrete

372

M. A. Cameron and P. J. Thomson

Fourier transform to obtain estimates of the serial correlation functions and other quantities derived from them. In practice, estimates of all the unknown parameters will be calculated using a numerical optimisation algorithm.. If this is to find a global optimum efficiently and if there are a large number of parameters to be estimated, then the models for the various components fx(W) and u(co) must be reasonable fits to the data and the initial estimates should be close to the optimal ones. Since the number of unknown parameters may grow very rapidly, the best strategy is to use the. methods of this section to estimate the parameters defining the vj(co)'s before attempting to model the noise processes. It will be shown below, however, that if the noise spectra are assumed to be equal, then 0 may be estimated without explicitly estimating the noise spectra. As in the narrow band case, the estimation of 0 is performed by maximising a likelihood derived from the asymptotic distribution of W ( w ) . The likelihood is calculated assuming that the noise spectra are known, though in practice they will not be known and will be replaced by estimates obtained in the manner described in Subsection 3.1. The estimate of 0 need not depend on all frequencies available. Let B be a finite union of intervals in ( - v , 7r) that is symmetric about, but does not include, the origin. Then the vector, 0, of parameters describing the vj's may be estimated by maximising the loglikelihood, which is equivalent to minimising

C N (0) = (m/2N) ~.

[log{det f(a)} + tr{f-l(a)f(A)}]

(3.9)

with respect to the parameter 0. Except for a scaling factor, CN(O ) is just the sum of terms of the form (3.2) calculated over a number of non-overlapping narrow bands. Differentiating Cu(O) with respect to 0 leads to the estimating equations

m Z t r l f '(A) Of(a)f-l(A){f(A)-f(a)} 1 = 0 2N *eB L O0

(3.10)

where

-= 1,2

{Olt*(/~; 0)

of(aoo) L ( a )

~

O/-"(~t;0)]_1/2..

~+"* 7o- JL(a).

Suppose that tJ is the solution to (3.10) and that 00 is the true value of 0. It is proved in Cameron (1983) that this estimator of 0 is strongly consistent and that N1/2(/J - 00) is a Gaussian random vector with mean zero and covariance matrix A -1 + A - 1 K A -q ,

Measuring attenuation

373

where the (1",k) element of A is

(4Tr)-l f t r { f f l ( A ) ~ f f l ( a )

O0k J

and the (j, k) element of K is

(2~r) ~ b~a f f

(I)(])ab(A)CrP(k)cd(A')Sabcd(A,-A,-A',A')dAdA'

B

Here CI)(j)ab(~) is the (a, b) element of the Force(A, -A, -A', A') is the Fourier transform y(n). The integrand in A simplifies to

tr{[I + vu*l-x ~

(vv*)[l +

,

1

matrix f-l(~)[Of(l~)/OOj]f I(A) and of the fourth cumulant function of

9

O0k

]

where the arguments h and 0 of v have been deleted, and

q)o) = f21/2[I + vv*]

0

1~

(w,*)[I + vv*l-lf21/2.

The matrix A is estimated consistently by the matrix of second derivatives of CN(O). In the important case where the data are Gaussian, then matrix K is null and N1/2(0-00) has an asymptotic Gaussian distribution with mean zero and covariance matrix A 1. In the non-Gaussian case, Taniguchi (1982) gives a method of obtaining consistent estimates of K. If, in addition to assuming that the noise spectra are constant over narrow bands, it is also assumed that within each band the spectra of the noise are equal, then the estimating equations may be simplified. For a given 0, v(A; 0) may be evaluated for each frequency band and thus using (3.6), estimates of the common noise spectrum may be obtained. When this is substituted into CN(O), the resulting expression simplifies and, omitting an additive constant, reduces to C~)(0) = m(2N)~' ~ [ ( p - 1)log{f(,~)

R(A; 0)}+ logR(A; 0)],

AEB

where

R(a; o)= m -~

I£:

2

(a ; 0)* W(~oO /~(a; O)*~(a; 0).

374

M. A. Cameron and P. J. Thomson

3.3. T i m e d o m a i n estimation If parameters are to be estimated in the time domain, then not only must the attenuation be appropriately parametrised but so also must the signal and noise processes. In addition, constraints must be introduced in order that the parameters be identifiable. Suppose that the signal and noise processes are assumed to be A R M A processes and that the transfer function aj(w) is a ratio of trigonometric polynomials. Then the observed series yj(n), j = 1 . . . . . p, form a vector A R M A model from a particular parametric family and so for given values of the unknown parameters the exact likelihood can be calculated (see, for example, Nicholls and Hall, 1979). An algorithm such as that of Fletcher and Powell may then be used to find the parameter values maximising this likelihood. Difficulties arise, however, in choosing the correct models and lags for the signal and noise processes and the attenuation and in obtaining good initial estimates of the unknown parameters. Thus in all but the simplest cases the estimation methods described in Subsections 3.1 and 3.2 should be used before attempting a full time domain model. A different approach to calculating the likelihood in the time domain, which is in some ways more direct arises by noting that (2.5), is similar in form to the observation equation in a state-space model, except that the noise here may be coloured. Since the noise processes may also be written in state-space form, the model (2.5) may be written in standard state-space form by augmenting the state to include not only the signal but also the state of the coloured noise processes. Once the model has been written in state-space form, the Kalman filter can be used to compute the innovations recursively and a Gaussian likelihood may be evaluated. This likelihood may be maximised to obtain estimates of the unknown parameters. Engle and Watson (1981) use this procedure to estimate parameters in an example where they have 25 observations of each of 5 series. In their case, the signal follows a second-order autoregression and each of the noises is a first-order autoregression. The impulse response of the transfer function involves only the present value of the signal so that the aj(w) are taken as constant. The state is a vector of length seven, the components at time n being S ( n ) , S ( n - 1), xl(n ) . . . . , xs(n ) . Once the models for the various components have been chosen and the Kalman filter appropriately parametrised, the likelihood is easily evaluated. Again the likelihood must be maximised numerically and Engle and Watson suggest using an algorithm based on the method of scoring. Aasnaes and Kailath (1973) provide a direct recursion for state-space estimation in colored noise without going through the augmentation step. As mentioned earlier, a difficulty with time domain estimation lies in the need to choose the correct orders of the models for all components. An

Measuring attenuation

375

additional problem arises because all of the parameters must be estimated simultaneously. If there are many parameters, a large amount of computation is required unless good initial parameter estimates are available. If one works solely in the time domain, it is difficult to uncouple the signal and noise components so that a sensible decision about the orders of each of the components cannot be made. As we have shown in Subsection 3.1, however, it is easy to produce band by band estimates of the ~,j(w) and of the noise spectrum and therefore to calculate 'model-free' estimates of the impulse response functions and of the autocorrelation functions of the noise processes. From these the orders of the models may be chosen and initial parameter estimates calculated.

4. Estimation in the presence of delays

We now consider the situation where the model incorporates delayed as well as attenuated forms of the signal. If the delay is frequency dependent, then a simple time domain model cannot be given and estimation of the delay must be performed in the frequency domain. If the delay is not frequency dependent, then it may be approximated by a time domain model (Chan, Riley and Plant, 1980). However, a simpler procedure is to estimate delay in the frequency domain, realign the series and then, if time domain models are required, use the methods of Subsection 3.3 to estimate the remaining parameters in the model. Only frequency domain methods are considered in this section. The model is as given by (2.9) and the spectral density matrix of the observed process, expressed in terms of the uj(w), fxj(w) and the rj(w), now has typical element 1/2

l/2

fik(W) = (vj(w)~,k(tO) + 6ik)f xJ (Wff ~,k(w) ei(',(~)-'k(~°))~°

(4.1)

The effect of delay, if unaccounted for, will lead to biased estimators. Consider, for example, the case where, over a narrow band of frequencies centred at frequency A, the vj(w), fxa(w) and ~)(w) are approximately constant. Then, keeping the bandwidth fixed but allowing N to increase, the estimator of f(h) given by

/(a) = !m 2

a

will converge to the matrix with typical element h+8

- " " sin(T,(A) - zk(h))6 (5(A)_-~.k(--~ . f k(A) f,~-8 ei(~i('~)~-k(A))(,o,~)dw = ]}k(h)

(4.2)

Here the bandwidth is 26. Now, if the 5 ( t ) - ,k(h) are large, then (4.2) shows

376

M. A. Cameron and P. J. Thomson

that the off-diagonal cross-spectral estimates )~k(A), j ¢ k, will be biased downwards, but the estimates of the spectra )~j(A) Will be unbiased. This will have the effect of inflating the estimates of the noise spectra fxj(A) and seriously deflating the estimates of the attenuation function uj(A). Hence it is important to take account of any significant phase variation caused by delays. We consider two situations, first the case of correcting for significant phase variation due to delays over a narrow band and, second the case of correcting for such variation over any arbitrary range of frequencies. 4,1. The narrow band case We consider, as in Subsection 3.1, the narrow band of m frequencies (ok = 2 w k / N closest to some chosen frequency A. The relevant log-likelihood is given by (3.2), but with f ( w ) given by (4.1), i.e. f ( w ) - A (w ;'r)f~]2(I + uu*)f~x/2A *(w ; r ) .

(4.3)

Here fx and ~ are as defined below (3.3) and A (~o; ~-) is a diagonal matrix with typical element exp{i~-(A)(w - A)}. As before we are requiring that uj(co), f~j(w) and ~)(w) be effectively constant over the narrow band of frequencies and we have chosen to describe the phase of ~k(~o) near A as

4,j(a)- q,k(a)+ 6- j( a ) - ~k(a))(,o - a ) . Thus Oj(A), the argument of uj(A), now incorporates 5(A)A, the phase at frequency A due to delays. To identify all the parameters we shall again measure all phase differences relative to the first recorder and so [/11(~) = TI(/~ )

0,

Maximising this log-likelihood with respect to the parameters yields

]xl/2i(,,~ ; ~)/xl/2/) = (l + /)*/])p,

(4.4)

Lj(A) = (1 + ]~j(A)[2)-'~j(A ; 4), j = 1 . . . . . p , ~*f-xl/2{ Of(A ; ? )/ OS}/xl/2 ~ = O, j = 2 . . . . . p ,

(4.5) (4.6)

where

/(a; ~-)

1

Z A ( ,, ,)W(~os)W(~o,)*A(~os;,~),

m

,~

i.e.

1

~k(a; ,) -- - Z w,(~o,)wk(~o,) e - * + )- Tk+,~). m

a

t t e r e the p - 1-dimensional vector ~- has typical element "ri(A), J" = 2 . . . . . p. In principle, (4.4)-(4.6) can be solved by a numerical procedure such as the

377

Measuring attenuation

Newton-Raphson procedure with due account being taken of the fact that the required solutions must maximise (3.2). However, it can be shown that, appropriately normalised, the estimators ÷j are asymptotically independent of 1) and the f~j(A). This suggests that the following two-stage iterative scheme might have computational advantages. First, select initial estimates of the ~)(A) and solve (4.4) and (4.5) using factor analysis techniques as before. Then, using the resulting estimates of ~ and the f~j, solve (4.6). Note that, given v and f~, (4.6) is equivalent to maximising ~ - 1/2f(A;~')/~ ~ = 1/2 v ^ = v^ , fx

m - 1

Z

a

If? * f x

1/2A

*(a~,;r)W(ws)] 2

With the new estimates of ~j(A), repeat the first step and so on. Schemes such as this based on first solving (4.4) and (4.5) and then (4.6) will be asymptotically equivalent to a full Newton-Raphson procedure. However, as in any numerical maximisation procedure, it is important get good first estimates, especially of the delays ~)(A). If we had once and for all estimates of the delays ~)(A), we would then apply the standard estimation procedure described in Subsection 3.1. Such a procedure would have computational advantages over the direct solution of (4.4)-(4.6). In Hannan and Thomson (1973) a non-parametric estimate of the (group) delay in the case of two recorders is given. A simple generalisation of their technique leads us to estimate the ~)(A) by the values of ~)(A) that minimise detf(A; T). Indeed, this criterion follows from maximising (3.2)with f(w) replaced by A ( o o ) f ( A ) A * ( w ) and taking the f/k(A) and the ~)(A) as the parameters. To compare this estimator with that obtained by maximising the original likelihood involving v(A), f~(A) and ~-(A), consider maximising the likelihood over v and the fxj for any given ~-. The resulting maximised value of the likelihood is p

Z log(1 ~- tPjt 2) - log(1 + 1)'1)) = - l o g det 2 ( Z ; ~-), 1

where ~(A ; ~-) has typical element

2jk(a; and

L(a;

f(a; r) = f x~1/2 (I+

''2 ,)A,,~,/2 v y~ .

(4.7) (4.8)

Thus the maximum likelihood estimate of ~- is the value of "r minimising det ~(>.;'r), where 2(A; r) is the estimated (complex) coherence matrix. Now minimising det f(A ; r) is clearly equivalent to minimising det ~(A ; r), where .~(A; r) is formed from )~(A; r) in the same way as 2(a; ~-) was formed from f(A; r). Thus we see that the two criteria, i.e. minimising detZ(A; r) and minimising det ~(A ; r), are of essentially the same character. Indeed, they can

M. A. Cameron and P. J. Thomson

378

be shown to be asymptotically equivalent. These considerations lead us ,o estimate the r/(a) by minimising d e t f ( a ; r) and then to estimate u and the f,,j(A) using the standard estimation procedure described in Subsection 3.1 with f ( a ) replaced by f(A; ÷). In terms of asymptotic properties it can be shown that, under suitable regularity conditions, the ~(A) are strongly consistent estimators of the ~-j(A) and that the estimators of u and the fxj(A) obtained from f(A; ?) have the same properties as before. Moreover, N - l m 1/2(~_ 7") is asymptotically distributed independently of ~9and the fxj(A) and has asymptotic multivariate normal distribution with zero mean and covariance matrix B -1, where 2,n.

T 1 ;12( * -Ivyl2)/(l+ u'v),

Bjk =

f

j : k, (4.9)

- 2w2

-~-]uil2lGlZ/(l+ v'u),

j ¢ k.

For further details concerning the proof of these results, see Thomson (1982). Note that the chosen criterion of minimising d e t f ( A ; r ) has the following interpretation. Let

((n ) = m-1/2 ~ aW (w, ) exp{-i2~rns/m }, n = 1 , . . . , m

(4.10)

lj=

(4.11)

and 5,

j=l .....

Then it can be shown that minimising d e t f ( a ; r ) is almost the same as minimising the determinant of the matrix with typical element m

m

1

~_, ;~j(n + lj)(k(n + lk)*,

(4.12)

n=l

i.e. the generalised variance of the ~(n + li). Roughly speaking, ~(n) is the output of a band pass filter acting on y(n) which passes only the band of frequencies in question. Then the (i(n) are lagged in the obvious way and the lagging is optimised by minimising a (generalised) variance. Finally, having determined estimates of the delays rj(A), we now give corrected estimates of the attenuation coefficient by l~)j(A)]exp[i{q)j(A)-

4.2. The broad band case The model for the data may be considered as having three distinct components; the phase effects embodied in the delay r(a) and the arguments of the

379

M e a s u r i n g attenuation

uj(a), the signal-to-noise ratio Iv(a)[ and the noise spectrum fx(A). Each component may be estimated within each band or else can be modelled by a finite parameter model across a wide band of frequencies. The precise method to be used depends on which components are to be modelled by a finite number of parameters and which are to be estimated band by band. Of course, when all are estimated band by band, the problem reverts to that considered in the previous section. Here we consider only the case where the phase effects are modelled across broad bands, but the other parameters remain specific to each narrow band. Now, in contrast to the assumption in Subsection 4.1, it is assumed that the variation of {rj(w)-rk(~o)}~0 over any narrow band of frequencies is small compared with the variation across some given broad band of frequencies B. In the broad band case it is the variation between narrow bands that is to be modelled, whereas in Subsection 4.1 it was the within band variation that was modelled. It will be supposed, in this section, that the phase differences depend on a vector, 4,, of unknown parameters. The unknown parameters may be simply the relative delays of the signal at the different recorders, or, if the signals have been dispersed, they may be coefficients of polynomials or splines used to model the frequency dependence of the delays. As in Subsection 3.2 we consider B to be the union of non-overlapping narrow bands B u of m frequencies about central frequencies a,. For any w in the band B,, we model the spectral density of the process as

f ( o ) ) = ~ ( a . ; 4 , ) f ~ 2 ( a ~ ) { l + p ( a u ) p ( a , ) } ,f , 1/2(Au)A(A,; 4,)*,

(4.13)

where fx(au) is a diagonal matrix with typical diagonal element f,j(a,,), p(au) is a vector with typical element tv,(A,)l and A(A~; 4,) is a diagonal matrix with typical diagonal element exp{ig(a.; 4,)}. Here g(a~; 4,)= g ( Z , ; 4,)+ rj(Z,; 4,)a,, where g ( a , ; 4,) and rj(a,; 4,) are appropriate parametric forms for 4',(a) and rj(a) that model the phase variation over B. If the f~(A,,) and p(a,,) were known, then maximising the approximate log-likelihood (or equivalently minimising (3.9)) with respect to 4,, reduces to mmlm~smg

Z {1 + ,,*,,}-~ ~ I,,,,* " -~j2j kl{&jL.k} Nka ~(4,)Ak(4,) u

(4.14)

j,k

where the frequency a,, on which each of the quantities depends, has been omitted for simplicity. Note that (4.14) can also be written as 1

---~7, (1 + v* v)-' "~ m

.

Ip'f;

'2za

2 .

(4.15)

Bu

Now the quantity {1+

~,*u~-llpu*lJ¢ ¢ ,

I j

~.-l/2tg

klUx,ffx,kJ

IJjkt

(4.16)

M. A . Cameron and P. J. Thomson

380

may be consistently estimated by

where ~'jk(hu) is the estimated coherence at frequency h. between yj(n) and yk(n) and ~Jk(h.) is the (j, k) element of the inverse of the matrix of estimated coherences {d)k(h.)}. Using this estimate, (4.14) reduces, after a little manipulation, to

o(4,) = I2 22,~j,(A.)~"(ao)cos[&(&)u>O

{~:j(&; ~)- ~,,(&; 4,)}1.

j,k

Here &(X,) is the argument of m 1Z A Wj(ws)fVk(WS). The properties of the estimate of ~b which is obtained by maximizing O(~b) have been discussed in Hamon and Hannan (1974). Usually, ~jk(A,) is the standard estimator of coherence, namely ,~jk(a°) = L~,(A.)I/0~j(A°)L~(,U}

''2 .

However, Hannan and Thomson (1981) suggest fitting a vector autoregression to the data and then estimating the coherences from the autoregressive spectra. Their simulations show that using the autoregression leads to improved estimates of 4}. However, the number of parameters fitted increases as p2 and does not take account of the hypothesised structure of the observations. Alternatively, the structure of the underlying m o d e l can be more directly utilised to estimate (4.16). Consider in (4.16) replacing u by ~, fx by/x and ]~kl by the asymptotically equivalent form 1,9~kI{L.Z,A 1'2. Here 1) and )~ are the estimators obtained using the narrow band techniques of Subsection 3.1. In this case, (4.14) becomes

O,(4}) = Z ( 1 + /2*P) -1 Z I,>jl2I,>~I= cos[&(A.) - {~j(&; ~)- ~(&; ~)}]. j,k

u>0

In similar vein, if/5 has typical element [1)it, (4.15) yields the criterion 02(4}) = m - ' ~ 2 (1 + z3* ~3)-' ~_, I~'f~'/2A *(40W(w)[ 2 . u

Bu

Tile properties of the estimators that minimise these criteria are the same as those of the estimator which minimises 0(4}). Note that the methods of Subsection 3.2 could be used to estimate strongly consistent, 'smooth' values of ~, and thus of the weight function (4.15). To completely match the procedure of Hannan and Thomson a finite parameter

Measuring attenuation

381

model should also be fitted to the noise processes. In this case the number of parameters increases only as p and the parameters are easily interpreted. The cost is extra computation. At this stage no comparison of the finite sample properties of this procedure with the others has been performed.

5. Applying the methods Some of the practical problems that can arise in fitting models of the type described earlier are discussed in this section. These questions fall into two categories: (i) assuming the model is correct, how should the parameters be estimated? and (ii) what can be done to check that the model fitted is appropriate, and what is the effect of a poor model on the parameter estimates? The questions in the first category are of most importance when frequency domain methods are to be used, and cover problems such as whether the data should be prefiltered, what frequency bands should be used and how wide these should be. Of necessity the discussion of such matters is more nebulous than that presented in earlier sections where the data are always stationary and the observed records are assumed to be long. However, when sample sizes are finite, if spectra are changing rapidly poor data analysis may result if the practical problems are not treated thoughtfully. The main problems that arise when estimating parameters using frequency domain methods are the result of biases in the spectrum estimates caused by rapid changes in power or phase across a narrow band of frequencies. This was mentioned at the beginning of Section 4 where it was shown that if there is a large delay between recorders, then the modulus of the estimated u/s may be biased downwards. The delay may also be less efficiently estimated in such circumstances. See, for example, the simulation results in Hannan and Thom-~ son (1981). These biases may be substantially reduced by realigning the series (possibly in a frequency dependent way), to reduce the phase changes, by prefiltering the observations so that their spectra are approximately fiat, and by making the individual frequency bands narrower. Decreasing the width of the individual frequency bands increases the variance of spectral quantities within the band. However, if effective realignment and prewhitening are performed, then the width of the bands will not be critical. Unless data are delayed by an integral number of time units, the simplest way to delay a series by -r(A) time units is to compute

k

That is, x(n) is Fourier transformed, the coefficient at frequency ~ok is multi~ plied by exp{--i~'(wk)Wk} to introduce the delay and then an inverse Fourier transform is applied. If ~" is an integer constant, then this moves the first ,~

382

M. A. Cameron and P. J. Thomson

observations from the beginning of the series to the end. This may not always be an appropriate transformation so that an improved procedure would mean correct the series, add zeroes to the ends and multiply by an appropriate taper before performing the transform described above. So that appropriate prewhitening and realigning can be performed, exploratory analyses of the data will be required. These preliminary analyses should also be used to suggest good parametrisations of the different components in the model and to suggest initial parameter estimates to be used in the iterative maximisation of the likelihood. A general procedure for estimating attenuation and delay should thus include the following steps: (i) Estimate the spectrum of each observed series separately, using a few different bandwidths and also an autoregressive estimator. Choose a filter which will, approximately, whiten each of the observed series. (ii) From the whitened series, estimate the phase difference functions, compute approximate delays between the observed series and use these to realign the observations, if necessary. (iii) Compute the coherence functions for the prewhitened and realigned series. This should show the frequency range in which the signal common to the different recorders is discernible over the background noise. (iv) For narrow bands over the frequency range of interest, perform the narrow band analysis described in Subsection 3.1 or, if significant delays between the observed series remain, perform the analysis in Subsection 4.1. This analysis should suggest parametric forms for the delay and attenuation. Initial estimates of these parameters as well as estimates of the spectra of the noise processes should be obtained at this stage. (v) A broad band analysis of the form of those described in Subsections 3.2 and 4.2 may now be performed. (vi) If a time domain estimation is to be performed, the data should be corrected for any remaining delays estimated in (v). Also noise models may be chosen and preliminary parameter estimates obtained from the estimates of the noise spectra obtained in (iv). (vii) The final estimates should be modified to account for any preliminary whitening or realignment that have been performed. The steps outlined above may need to be performed more than once. For example, if there is a large delay between two series, then, except at very low frequencies, the phase difference between the two will change rapidly and so a reasonable estimate of the delay between the two series will be difficult to obtain initially. However, an approximate realignment will make the next estimate much better. It remains to consider whether the model fitted is a good description of the data. Difficulties may arise early in the estimation procedure if an inappropriate model is chosen. For example, if the noise at different recorders is correlated (often called signal associated noise), then narrow band analysis will yield noise spectrum estimates which are biased down and may be negative

Measuring attenuation

383

unless the estimation is constrained. If, for a particular band, the noise spectrum is biased downwards, then the weight attached to that band will be inflated in a broad band analysis leading to biased estimates of the attenuation and delay. Signal associated noise is not uncommon, particularly in geophysical data where the same signal may follow more than one path. If one path is substantially longer than the other, then the amplitude of the signal received after travelling along the longer path will often be diminished so as to be of the same magnitude as background noise. When signal associated noise is detected, the only solution is to model it as a second signal. When there is more than one signal, the basic ideas for parameter estimation are the same as those described here, but greater problems of identifiability arise. This will not be discussed here. T o check if there is more than one signal present, a principal component analysis of the cross-spectral matrix for each frequency band of interest could be performed (see Brillinger, 1975). Alternatively, narrow band factor analysis models could be fitted and the number of factors chosen using an AIC-type criterion. To illustrate the methods of this chapter, we consider a simple example of just two anemometers from which instantaneous readings of wind speed were obtained several times per second. A plot of the spectrum of the data shows a rapid decrease in power with frequency. Autoregressions were fitted to the two series and it was found that both series were adequately described by simple third-order autoregressions, the coefficients of the two series being very similar. The two series were prewhitened using a filter obtained using the coefficients of one of the autoregressions. This substantially reduced the variation of the spectra with frequency, though the observations were still not white noise. The two filtered series were strongly coherent at low frequencies, though the coherence was reduced at high frequencies (see Fig. 1). There was no evidence of delay between the two series. As mentioned earlier, when estimating the attenuation between just two observed series, some further assumption must be made about the noise. The simplest and most natural assumption is that the two noise series have the same spectrum, and that assumption was made here. The phase difference between the two series varied unsystematically about zero and so it was assumed that the ratio a2(oJ)/ai(oJ ) [= u2(m)/ul(~o)] was real and further estimation was based solely on the real part of the cross-spectrum between the two observed series. Narrow band estimates of the ratio a2(oJ)/al(w ) were calculated and these are plotted in Fig. 2. Because we are dealing with real quantities, confidence intervals may be calculated for these estimates using the method of Creasy (1956). The approximate 95% intervals calculated using this method are plotted on Fig. 2. It can be seen that the estimates are approximately constant and generally greater than one. There is one band at which the estimate is much greater than those at other frequencies. However, the signal-to-noise ratio is much lower at

384

M. A. Cameron and P. J. Thomson

o

LO [J r

o

L O) 0 U

co

C3

I

0 O.

0

0

"

J 1

1

0 , 2

0"3

I 0

O.S

I ~

Frequency

Fig. l. Coherence between the two observed, prewhitened series.

rq

04

o

O.O

0.1

0.2

0.3

0°4

O.S

Frequency

Fig. 2. Narrow band estimates of tvz(~o)/vl(w)] with approximate 95% confidence intervals,

Measunng aUenua6on

385

o (o L

Q)

0 C 0 C~ (0 C O) O3

r-I

--

I

0o0

O,

--

1

I

I

0.2

0.3

0,4

O.S

Frequency

Fig. 3. Narrow band estimates of

lul(w)l,

the signal-to-noise ratio for the first anemometer.

E D L O Q)

(3_ O] ra 0 Z-

....

I

I

I

_

t

o O. 0

0.1

O. ~

0.3

fr-equer~cy

Fig. 4. Estimated noise spectrum.

0.

4

Oo 5

386

M. A. Cameron and P. J. Thomson

the high frequencies, as can be seen from the widths of the confidence intervals in Fig. 2 and from Fig. 3, where ]~l(w)t is plotted. The estimate of the spectrum of the noise is plotted in Fig. 4. The above analysis suggests that the ratio Vz(W)/ul(w ) is constant and, under this assumption, the ratio was estimated using a wide frequency band estimator. The estimated ratio was found to be 1.10 with an approximate standard error of 0.02. The estimation was repeated without prefiltering. The estimate of u2(w)/v~(oo ) was essentially the same, but the estimated noise spectrum was dominated by the power at low frequencies c o m m o n to both signal and noise. At this point, time domain models could be chosen for vl(co ) and f(w) by examination of their narrow band estimates and an overall time domain estimation performed using the methods of Engle and Watson. This could improve the estimate of PZ(O))/b'l((.O) if there were fewer observations, but because there are a moderate number of observations and because the spectra contain no sharp peaks, there would appear to be no benefit in this case. N o t e that the noise models would probably be misspecified in the first phase of a time domain modelling unless the data were prewhitened. References Aasnaes, H. B. and Kailath, T. (1973). An innovations approach to least-squares estimation--part VII: Some applications of vector autoregressive-moving average models. IEEE Trans. Automat. Control AC-18, 601-607. Box, G. E. P. and Jenkins, G. M. (1976). Time Series Analysis--Forecasting and Control. Holden-Day, San" Francisco, CA. Box, G. E. P. and Tiao, G. C. (1977). A canonical analysis of multiple time series. Biometrika 64, 35.5-366. Brillinger, D. R. (1975). Time Series, Data Analysis and Theory. Holt, Rinehart and Winston, New York. Cameron, M. A. (1981). Estimation of noise correlations in transfer function models. Commun. Statist.--Simula. Computa. B10, 369-381. Cameron, M. A. (1983). The comparison of time series recorders. Technometrics 25, 9-22. Chart, Y. T., Riley, J. M. and Plant, J. B. (1980). A parameter estimation approach to time delay estimation and signal detection. IEEE Trans. Acoust. Speech Signal Process. ASSP-28, 8-16. Clay, C. S. and Hinich, M. J. (1981). Estimating the earth's impedance function when there is noise in the electric and magnetic signals. In: D. F. Findley, ed., Applied Time Series Analysis 1I, 184-219. Academic Press, New York. Creasy, M. A. (1956). Confidence limits for the gradient in the linear functional relationship. J. Roy. Statist. Soe. Ser. B 18, 65-69. Engle, R. and Watson, M. (1981). A one-factor multivariate time series model of metropolitan wage rates. J. Amer. Statist. Assoc. 76, 774-781. Fletcher, R. and Powell, M. J. D. (1963). A rapidly convergent descent method for minimization. Computer J. 6, 163-168. G e w e k e , J. F. and Singleton, K. J. (1981). Maximum likelihood "confirmatory" factor analysis of economic time series. Internat. Econom. Rev. 22, 37-54. Hamon, B. V. and Hannan, E. J. (1974). Spectral estimation of time delay for dispersive and non-dispersive systems. Appl. Statist. 23, 134-142. Hannan, E. J. (1970). Multiple Time Series. Wiley, New York.

Measuring attenuation

387

Hannan, E. J. (1983). Signal estimation. In: P. R. Krishnaiah, ed., Time Series Analysis in the Frequency Domain. North-Holland, Amsterdam. Hannan, E. J. and Thomson, P. J. (1973). Estimating group delay. Biometrika 60, 241-253. Hannan, E. J. and Thomson, P. J. (1981). Delay estimation and the estimation of coherence and phase. I E E E Trans. Acoust. Speech and Signal Process. ASSP 29, 485-490. Jones, R. H. (1980). Maximum likelihood fitting of A R M A models to time series with missing observations. Technometrics 22, 389-395. Joreskog, K. (1967). Some contributions to maximum likelihood factor analysis. Psychometrika 32, 443--482. Joreskog, K. (1978). Structural analysis of covariance and correlation matrices. Psychometrika 43, 443447. Lawley, D. N. (1967). Some new results in maximum likelihood factor analysis. Proe. Roy. Soc. Edinburgh Ser. A 67, 256-264. Nicholls, D. F. and Hall, A. D. (1979). The exact likelihood function of multivariate autoregressivemoving average models. Biometrika 66, 259-264. Pukkila, T. (1982). On the identification of transfer function noise models with several correlated inputs. Scand. J. Statist. 9, 139-146. Taniguchi, M. (1982). On estimation of the integrals of the fourth order cumulant spectral density. Biometrika 69, 117-122. Thomson, P. J. (1982). Signal estimation using an array of recorders. Stochastic Process. Appl. 13, 201-214.

E. J. ttannan, P. R. Kristmaiah, M. M. Rao, eds., Handbook of Statistics, Vol. 5 © Elsevier Science Publishers B.V. (1985) 389-412

] AL

Speech Recognition Using LPC Distance Measures P. J. Thomson and P. de Souza

1. Introduction

Research into automatic recognition of speech is concerned with the problem of designing a device which accepts speech as input and determines what words were spoken. This is to be distinguished from the related task of speech understanding where the goal is to design a device which reacts correctly to spoken commands, and which may be able to do so without necessarily recognising every word correctly. The task of a speech recogniser is simply to transcribe the input utterance without responding to its meaning, as if it were taking dictation. Indeed, dictation is one possible commercial application of a speech recogniser. Speech recognition can be subdivided into two categories: recognition of isolated words where there are distinct pauses between the words; and recognition of natural continuous speech where words are usually run together. Recognition of isolated words is less difficult than recognition of continuous speech and, in fact, isolated word recognisers handling vocabularies of up to 200 words have been in use commercially for several years [1]. Recently, recognisers have become available which will accept vocabularies of up to 500 words [19]. Two reasons why isolated word recognition has been more successful than continuous speech recognition are that: (1) pausing between words leads to clearer speech and substantially reduces the effects of co-articulation caused by the preceding and following words; (2) pauses can be identified fairly reliably [2] and hence the end-points of isolated words can be determined more easily than in continuous speech. In continuous speech, words are pronounced less carefully and co-artb culatory effects cause the pronunciation of words to vary according to their context. For example, the pronunciation of the word " a n d " is usually much clearer when spoken in isolation than when spoken naturally in a phrase like "black and blue" which may be pronounced more like "black 'n' blue". Also, in words such as " b r a n d " the final consonant may or may not be pronounced depending on its context: it is more likely to be pronounced in the phrase "brand of product" than in "brand new product".

390

P. J. Thomson and P. de Souza

Unlike isolated words, it is not usually possible to identify word end-points in continuous speech. As most successful isolated word recognisers rely heavily on end-point information they cannot be converted easily to work on continuous speech. Continuous speech, therefore, represents a significantly more complex problem which remains far from being fully solved. Typically, in isolated word recognition [3,20] a parameter vector is computed every 10-20 ms over a windowed region of the input signal. Making use of the pauses between words to determine the end-points of an utterance, a feature matrix is then extracted consisting of the time-varying parameter vectors over the interval between the end-points. The word is identified by comparing the feature matrix with a set of stored templates derived from the words in the vocabulary and selecting the template which gives the closest match. This simple, template matching approach is adequate for small vocabularies when the words are spoken in isolation, but problems are encountered when it is applied to large vocabularies [4], and it is difficult to apply to continuous speech although it can be done for very small vocabularies such as the ten digits [17]. In order to implement an isolated word recogniser like the one above, it is necessary to decide on the parameter set and distance measure to be used in comparing an unidentified word with the reference words. The predominant parametric representation in use today [5] is linear predictive coding (LPC) which is applied to the speech signal after digitising it at a rate of between 6 and 20 kHz. The sampling rate must be twice the desired bandwidth in order to avoid spectral aliasing. In the case of voiced speech a minimum bandwidth of 3 kHz is required for accurate estimation, while in the case of unvoiced fricative sounds such as "s" a bandwidth of 8-10 kHz is necessary. Since the telephone has a bandwidth of about 3 kHz only, a low sampling rate is usually adequate for telephone quality speech, whereas a 20 kHz sampling rate is desirable for high quality microphone speech [6]. The sampled speech is typically quantised to an accuracy of between 9 and 16 bits [5], again depending on the quality of the speech required and the application. The linear predictive model [21] which is applied to the digitised signal basically assumes that a speech sample can be approximated by a linear combination of the immediately preceding speech samples, but the foundations of LPC can be traced to Fant's very successful linear speech production model [7]. In Fant's model of acoustical speech behaviour, speech is considered to be the output of a linear, time-varying system excited by periodic pulses during voiced speech and random noise during unvoiced speech. Linear prediction provides a robust and accurate means of estimating the parameters that characterise this system [8, chapter 8], and is valid, therefore, to the extent that the production model is valid. The success of LPC can be attributed to the accuracy with which Fant's basic model applies to speech. Further reasons for the importance of LPC lie in the accuracy with which the

Speech recognition using LPC distance measures

391

speech parameters can be estimated and in the speed with which they can be computed. As it happens, estimation of the linear prediction coefficients reduces to a set of linear equations which are blessed with mathematical properties that allow extremely efficient solution [6]. Additionally, LPC has the further advantage that the asymptotic distribution of the linear prediction coefficients is known, tractable, and appears to provide a workable approximation to the distribution of the coefficients as obtained over finite length intervals typical of those used in speech recognition [9, 10]. The importance of knowing the approximate distribution of the coefficients is that it provides the means of developing appropriate distance measures instead of relying on empirically derived measures which would otherwise be the case. Despite these advantages, Fant's model is known to be imperfect [6], and the speech signal is not truly stationary as the LPC model assumes. Therefore, LPC does not provide a perfect representation of speech and is not necessarily the best parameter set for speech recognition purposes. Other parameters such as the discrete Fourier transform spectrum are in use and good results have been reported [11]. Once a parameter set has been selected, the isolated word recogniser sketched earlier needs an appropriate distance measure to determine which of the reference templates is closest to the feature matrix of an unidentified word. Ideally, the distance measure should have the property that it maximises the probability of selecting a template of the correct word. Since an unidentified word will usually be of a different duration than its corresponding template(s), it is necessary to perform some kind of time alignment between the unidentified and the reference patterns. Thus, computing the distance between an unidentified word and a reference template involves both time alignment and the accumulation of some distance function after alignment. it is well established that careful time alignment leads to a significant reduction in the recognition error rate of isolated words, particularly when the vocabulary contains polysyllabic words [12]. In practice this means that it is inadequate to perform simple linear time alignment in which one of the patterns is stretched or compressed linearly to match the other. Better results are obtained by performing dynamic time warping in which the optimal alignment is taken as being that one which minimises the total accumulated distance between the unidentified and reference patterns. In dynamic time warping the accumulated distance is taken to be the sum of the local distances between the temporally aligned parameter vectors of the reference and unidentified words. Efficient recursive procedures to find the required alignment have been devised and a useful discussion on the subject is given by Myers et al. [13]. To complete the definition of the distance between an unidentified word and a reference word, it is necessary to specify how the local distance between two parameter vectors will be measured. In the case of parameter vectors which have an unknown or intractable distribution, it is difficult to define an optimal distance; it may be necessary, instead, to resort to measures such as the

392

P. J. Thomson and P. de Souza

Euclidean distance, the city-block metric or any one of several intuitive distances [14] according to whichever was found to work best in practice. In contrast, when the parameters have a known tractable distribution, as is the case for LPC coefficients, this information can be used to derive distance measures which are optimal, or nearly so, in some well-defined sense. Probably the most common LPC distance measure in use today is Itakura's so-called log likelihood ratio [15]. Ironically, this is neither the log likelihood ratio statistic for comparing two estimated LPC vectors, nor is it statistically optimal as was pointed out by de Souza and Thomson [9, 10]. Despite this, it has been found to work better than most ad hoc distance measures and has the advantages that it can be computed quickly, and its storage requirements are small. Nonetheless, when Itakura's distance measure is used, isolated word recognisers of the type described here do not perform well with complex vocabularies containing acoustically similar words [16]. There is a need, therefore, for more powerful LPC distance measures than Itakura's in order to discriminate better between similar sounding words. Several candidates for an improved LPC distance measure where investigated by de Souza and Thomson [10] and will be further discussed later in this chapter. Because of the limitations inherent in isolated word recognisers using template matching, other approaches to speech recognition have been investigated. One of the more successful of these has been the maximum likelihood approach in which speech is modelled as a Markov source [18]. In this approach, word templates are replaced by Markov models, and each parameter vector is replaced by a scalar indicating which of several reference vectors is the nearest to the observed vector. The resulting sequence of scalars, or labels, is analysed to find those words whose collective concatenated Markov models maximise the likelihood of the observed labels. Good results using this technique have been reported for continuous speech [18] as well as for isolated words [11]. As in the case of template matching, the Markov modelling approach requires a distance measure defined in terms of the chosen parameter set. In the latter case it is needed in order to find the closest reference vector to an observed parameter vector during the labelling phase. Additionally, the reference vectors should be chosen so as to minimise distortion in the quam tised, or labelled, speech. This process, known as vector quantisation, is intrinsically related to the choice of distance measure used in labelling [22, 23]. It can be seen, then, that the definition of distance measures is an important aspect of speech recognition research. Given the predominance of LPC as the choice of parameters, LPC distance measures are of particular importance, and the remainder of this chapter is devoted to this subject. We begin in Section 2 by reviewing the LPC model and some of its statistical properties.

Speech recognition using LPC distance measures

393

2. The LPC m o d e l - - a review The stationary stochastic process {xt} is said to follow an L P C model if p

t=0,+l

~,aix,_ i=e,,

.....

(2.1)

i=0

where {et} is a white noise process; i.e. a sequence of uncorrelated r a n d o m variables each with m e a n zero and variance o-2. It is also assumed that a 0 = 1 and the z transform E ai zi is n o n - z e r o inside and on the unit circle. T h e latter condition ensures that x t is uncorrelated with future innovation terms Et+s, s > 0. N o t e that the L P C (linear predictive coding) model is nothing other than the familiar autoregressive model used in almost all b r a n c h e s of time series. Consider estimating a = (% . . . . . ap) T and o.2 from a sample of N observations on the process {xt}. W i t h o u t any loss of generality we may take these observations to be given by x = ( x I . . . . . XN) ~. A natural m e t h o d for estimating and o2 is by m e a n s of the least squares criterion

N'

(2.2)

a T (-' ~a ' t=p+l

=

where the (p + 1) x (p x 1) matrix C has typical e l e m e n t

c,j

] = m N'

N

Z

x, ix,_j,

(2.3)

i,j = o, 1 . . . . . p ,

t=p+l

with N ' - N -

p and a .... (1, v ) r .

Minimising (2.2) with respect to a yields (2.4)

& = -D-'d,

where D is the p x p submatrix of C obtained by deleting row and column 0, and d is the p - d i m e n s i o n a l column vector o b t a i n e d by deleting row zero f r o m the zeroth column of C. As is usual in least squares an estimate of o-z can be f o u n d by considering the m e a n squared residual

=:

~.2

~

_Z

t-

1

OliXt_ i

= ~'r6~.

(2.5)

i=0

It is well k n o w n (see, for example, [24-26]) that the distribution of ~ / ~ ( ~ a ) a p p r o a c h e s that of a multivariate Gaussian distribution as N ' ~ c . The limiting distribution has mean 0 and covariance matrix N where an asymptotically unbiased estimator of X is given by )2 - 4 2 D ' ' .

(2.6)

P. J. Thomson and P. de Souza

394

Although d"2 is asymptotically unbiased it will be biased for finite N ' . Paralleling the usual procedure for estimating variance in least squares, consider the estimator

S2 _

~

i

N'-pt=p+l

(/:~0)2

•iXt_i

=

S t

_ __

(j.2

N'-p

"

(2.7)

This estimator takes into account the p degrees of freedom lost by using the estimated values of cq in (2.2). Following the argument given in Fuller [26, p. 338] s 2 should prove to be a less biased estimator of o,2 in small samples. Now, since ~)2 and s 2 are asymptotically equivalent, we can define an alternative asymptotically unbiased estimator of X as

S = s2D -~ .

(2.8)

This estimator should be more appropriate than 2~ for small samples. If the process {xt} were Gaussian, then the likelihood of xp< . . . . . conditioned on the first p observations x, . . . . . xp is given by

(o-'k/ 2Trr, rr)-N' e x p { - ~J~22arrCa } .

XN,

(2.9)

Maximising (2.9) with respect to a and o-2 shows that d~ and d-2 are the m a x i m u m likelihood estimates of a and 0-2. Now the form of (2.9) as a function of oL and o- is precisely analogous to that of the Gaussian linear regression model in conventional statistics. (See Seber [27] for example.) Indeed, if y = (Y, . . . . . yn)x follows a Gaussian linear regression model, then y has a multivariate Gaussian distribution with mean Xfl and covariance matrix o-21. H e r e the columns of X contain the regressors, fl contains the regression coefficients and I is the identity matrix. H e n c e the likelihood for the linear regression model is -

-

-"

~

--~ (y (cryX/2~T) e x p [ - ~2Cry

- xt~)T(y

-

xfl)]

= (o-,',/2-~)-" exp[~--~n2 b TBb ] , Z,O'y

where b - ( l , - , i g T ) T and

B:I[yTy

n [XTy

yWX J XTX "

Replacing the fij by --aj, o~y by o-, n by N ' , columns of X by (Xp_j+1. . . . . XN_j) T, j = 1. . . . . establishes the correspondence between the respective distributions. The distribution of

y by (xp+1. . . . . Xu)v and the

p, yields (2.9). Note that this likelihoods only and not the xp< . . . . , x N conditioned on

Speech recognition using LPC distance measures

395

x l , . . . , xp does not follow the Gaussian linear regression model. The parallel between the likelihoods proves to be useful in the development that follows. The unconditional likelihood of x 1. . . . . xN is (2.9) multiplied by 1

T

(crX/2-~w)-P(det(A)) -1/2 exp - - -2o.2 x p A

--1

Xp

}

(2.10)

where xp = ( x 1. . . . . Xp) T and A has typical element Aij = covariance (x i, xs)/cr 2. A procedure for determining the Ais as a function of a is given in McLeod [28]. Although there exist algorithms for computing the exact maximum likelihood estimates [29, 30], they are considerably more costly to compute than the conditional maximum likelihood estimates. For this reason the unconditional maximum likelihood estimates have not been seriously considered for speech recognition where computational efficiency is a primary requirement. Note also, if N is large b y comparison to p, then the multiplicative factor (2.10) contributes little to the likelihood. Now, apart from the multiplicative factor (det(A)) -1/2, the exact likelihood of x 1. . . . . x N is (2.9) with N ' replaced by N and C replaced by (~ where (~ has typical element 1 N-i-j

Cis = ~

~,

xt+ix,+j,

i, j = O, 1 . . . . , p .

(2.11)

t=l

This gives an approximation to the exact likelihood since (det(A)) -1/2 is independent of N and will, for moderate to large samples, contribute little to the likelihood. (See Box and Jenkins [31] for discussion of this approximate likelihood.) Maximising this approximate likelihood yields the estimate = -/3-1d,

(2.12)

where /5 and d are defined in terms of (~ in exactly the same way that D and d were defined in terms of C. Moreover, the estimate of o-2 is (~2 __~I ~ T ~

(2.13)

,

where ~ = (1, ~T)T. Paralleling (2.7) we also define the least squares estimator of @2 as

g2 __

N

0.2.

(2.14)

N-p

These estimates should provide closer approximations to the (unconditional) maximum likelihood estimates than ~ and d"2 . Yet another way to estimate ~e is to solve the well-known Yule-Walker

396

P.J. Thomson and P. de Souza

equations. These yield the estimate

(2.15) where, once again, /} and d are defined in terms of C in exactly the same way that D and d were defined in terms of C. Here 0 has typical element

~ij = ~(li-in),

i,j =0, 1 . . . . . p ,

and 1 Nn

a(n) =

Z

x,x..,

= o, l . . . . .

p.

i=I The estimates of 0-2 analogous to (2.5) and (2.7) are (~2 .__ liT~li,

g2__

N

~2,

(2.16)

N-p

where a = ( l , liT)T. The Yule-Walker estimates are also approximate (unconditional) maximum likelihood estimates where the approximate likelihood in question is as for the previous paragraph, but with approximated by C. These estimates, popular because of the ease with which they can be computed, have lost some of their appeal due to the fact that there are now fast algorithms for computing & and 6-2 [6, 32]. Moreover, even in moderate size samples, simulation studies favour & and d"2 to & and 6-2 [331. All the estimates of o~ and 0-2 considered are asymptotically equivalent. In practice however, as intimated above, the estimates & and ~2 should normally provide the best estimates of ot and 02 followed, in order, by & and d-2 and then & and 6-2. In terms of computation the estimates & and 6-2 require a computational effort no greater than that for the Yule-Walker estimates ~i and 6-2 [32]. The approximate maximum likelihood estimates &, 6-2 can also be computed rapidly [34], but not quite as rapidly as & and 6-2 or & and 6-2. The algorithms concerned are the Levinson-Durbin recursion and generalisations of this. In the following section a number of tests will be derived using the likelihood ratio method and Gaussian likelihoods. The exact distribution of the resulting test statistics cannot easily be established and, as a consequence, only asymptotic distributions can be given. However, it can be shown that these asymptotic distributions will also hold under more general conditions where the white noise process {~t} satisfies the conditions following (2.1) together with additional mild regularity conditions. (See Hannan [25] and Fuller [26] in particular.) Thus, although the statistics are derived under Gaussian assumptions, they will continue to follow the stated asymptotic distribution in more general circumstances. For the sake of definiteness, we shall now confine our attention, in the

Speech recognition using LPC distance measures

397

main, to the conditional likelihood estimates & and d-2. This is not a restriction. Since the estimates can all be o b t a i n e d by maximising an a p p r o p r i a t e version of the likelihood (2.9), the three estimates of a and er2 may be used interchangeably in the statistics and distance measures that follow.

3. Comparative tests for LPC models In this section we use the formal theory of statistical hypothesis testing and, in particular, the likelihood ratio m e t h o d to g e n e r a t e a p p r o p r i a t e measures of the distance b e t w e e n sets of L P C coefficients. M u c h of the material that follows is drawn f r o m [10]. C o n s i d e r a s e q u e n c e of observations x T of length N T c o r r e s p o n d i n g to a stretch of voiced input that is to be coded. A s s u m e that x r is generated by a linear predictive process of order p with L P C coefficients a T and innovation 2 M o r e o v e r , we shall assume, for the m o m e n t , that x T has a variance err. Gaussian distribution so that the likelihood of x v c o n d i t i o n e d on the first p observations, is given by (2.9) with a = a r = (1, a T ) T and o- = err. In the simplest situation we might conceivably wish to test the hypothesis H : a r = a R, where a R is some k n o w n fixed reference vector. Alternatively we may not k n o w erR, but know instead only a reference s e q u e n c e of observations x R of length N R. In such cases we shall assume that x R is g e n e r a t e d ind e p e n d e n t l y of x T by a linear predictive process of order p with L P C coefficients erR and innovation variance cr2R. This again leads to consideration of the test of the hypothesis H : a T = erR, where n o w both err and e r r are u n k n o w n and must be estimated from the data. In this section we discuss the likelihood ratio tests of the hypothesis H : a R = a r in the various situations alluded to above. F r o m these tests relevant distance measures are constructed. It should be noted in passing that these tests and distance m e a s u r e s are also of interest in their own right since they are applicable to p r o b l e m s in fields o t h e r than speech recognition. T h e y can be seen as building on the work of Quenouille [35]. 3.1.

H:

err = aR ; err known

H e r e the relevant likelihood ratio statistic to test H : erj = oLR is o b t a i n e d as the m a x i m u m of the likelihood (2.9) u n d e r H expressed as a ratio of the u n c o n s t r a i n e d , m a x i m u m of the likelihood. By taking advantage of the corr e s p o n d e n c e between the likelihood (2.9) and that of the Gaussian linear regression model (see the discussion following (2.9)), the relevant-likelihood ratio test statistic is a m o n o t o n i c function of

e ( a , , aR) =

N ~ ( ~ T - aR)TDT(d@ or T

aR)

,

(3. t)

P. J. Thomson and P. de Souza

398

where &r, °-r,"2 C T and D r are obtained from x r using (2.3)--(2.5). Writing l~T = (1, t^r T T T we note that (3.1) can also be written as r ) T, a R = (1, trR) ~(&r, a n ) = N ~

R CT~ir

l

1

,

(3.2)

which is cheaper to compute than the form given by (3.1)o Asymptotically ~ has a Xp2 distribution when H : ¢er = atR is true. Note that the logarithm of the likelihood ratio is proportional to

r.]Ga. 1 I(& r, a n ) = l o g [ ~ / = k~rCr~r3

log(1 + ((&r~ a n ) / N ~ ) .

(3.3)

This statistic, with ¢iT and C r replaced by the asymptotically equivalent Y u l e - W a l k e r estimates dr and 0 r , is commonly known as Itakura's dista_nce [15]. When H : a T = a R is true, NLfl(&r, ozR) also has an asymptotic Xp2 distribution. We have chosen the statistic F in preference to I because of the f o r m e r ' s more tractable distributional properties and because of its direct relationship to the conventional test statistics developed for the Gaussian linear regression model [27, 37]. A better approximation to the null distribution of ~ can be determined. Note first that N~(& r

-

OIR)TDT(dfT --

o~R)/,~~

(3.4)

has an asymptotic X2 distribution under H and, if x t follows a Gaussian distribution, N~d-2r/o-2r is asymptotically equivalent to a X2n, p distribution under 2 T H. Moreover, &r and dr T are asymptotically independent. (See [31, p. 228] for example.) Hence, when H : ~ r = eeR is true, N~-- p l(&~ o~R) = N~.p ( (&r, aR)

N~.(c~r - aR)TDT (deT - ozR)/p 2

(3.5)

ST

has an (approximate) Fp,N~_ p d i s t r i b u t i o n . H e r e s 2r is obtained from ~r .,2r using (2.7). It can be seen from (3.4) that approximating the null distribution of g by a xzp distribution ignores the variability inherent in O-z modified distance " T . The measure l and its approximate F distribution have gone some way towards taking account of the variability of o--zT. Note that (3.4) and hence F each have an asymptotic X2p distribution under considerably m o r e general circumstances than the Gaussian assumption given previously [25, 26]. However, the distribution of , ^ 2 2T will • NTO-T/onot necessarily be well approximated by a AVN2 T-P distribution if x r is non-Gaussian. It has been argued forcefully that, it many circumstances, no single reference vector a~R will successfully characterise any given speech segment. (See

399

Speech recognition using LPC distance measures

[9, 10, 36] for example.) Because of co-articulation, there can be marked differences between different realisations of the same speech segment. In this context it is interesting to note that between 25% and 50% of words in normal conversation are unintelligible when taken out of context and heard in isolation. In practice this implies that a R is frequently not a fixed reference vector, but an estimated LPC vector with its own inherent variability. In these circumstances ~', l and Itakura's distance I are inappropriate. More appropriate measures are given in the remaining part of Section 3. 3.2. H : a r = a n ; a n u n k n o w n , er~ a n d o.2 k n o w n to be e q u a l

The joint likelihood of x n and x r, conditioned on the first p observations of each sequence, is ~ (er'X/~)-(N~+N~)exptf - - zer ( N n,a gTC R a a + N Tra r TC r a r

)}

'

(3.6)

N a' = N R - - p , a R = ( 1 , aR) T T , C R is obtained from x a using (2.3) and o.2 z We can again take advantage of the denotes the common value of ern2 and err. correspondence between (3.6) and the likelihood associated with two Gaussian linear regression models having the same variance; the corresponding test is that of testing for coincidence of two regression functions. (See Graybill [37, p. 190] for example.) The relevant likelihood ratio test statistic is a monotonic function of where

F(I~IT, ~ R )

[(N~+ N'r)erp , -2 - NRer n,~2 - N~6"Zrl/p =

, ~2

, ^2

,

,

[Nner R + N r e r r ] / ( N R + N r - 2p)

'

(3.7)

where er^ 2R and err ^ 2 are the estimates of o.2 obtained from x a and x r respectively using (2.5) and ^2

er,-

^1"

a~G, ~ .

(3.8)

Moreover, Cp is the pooled covariance matrix given by Cp = ( N ~ C n + N ' r C r ) / ( N ~ + N r ' ) ,

(3.9)

and dp .....(1, ¢~p), ^TT where ~p is obtained from 6p and its corresponding submatrix Dp using (2.4). The null hypothesis H : a r = a n will be rejected when F is significantly large. This statistic is due to de Souza [9] who derived it by analogy to classical regression theory. It can be shown [27, 37] that (3.7) is the same as F'(d~v &R) = (&R - & r ) T [ D R 1 / N ~ + D~'l/N'r]-~(d~n - & r ) / ( p s 2 ) ,

(3.10)

P. J. T h o m s o n a n d P. de S o u z a

400

where S2

=

t ^2 (Nn0.n + N r i0 .~2r ) / ( N n t + N~-- 2p)

(3.11)

estimates 0-2. Thus p F is just the squared distance between &n and ~T standardised by an estimate of the covariance matrix of &n - &r" Note that, as a distance measure, F possesses the desirable property that it is symmetric in &n and &r. If the null hypothesis H : a T = a n is true, then p F will have an asymptotic X2p distribution. This follows from the form of (3.10) and the stochastic properties of &n, &r, 0.R~2 and °'r.~2 This result will hold under quite mild regularity conditions concerning the nature of the processes generating x n and x r. It is not restricted to Gaussian x R and x r. However, if x n and x T are Gaussian, (NRd'2R + N~.d'2r)/0. 2 is asymptotically equivalent to a X2~+N~_2p random variable. Since &n and a r are asymptotically independent of d'~ and d-~- it is evident that F is, in the case of Gaussian x n and xr, asymptotically equivalent to an Fp. uk+N~_2p random variable. Because the F distribution yields a m o r e conservative testing precedure, we shall take t h e Fp,gk+s~_2p distribution as the a p p r o x i m a t e distribution of F under H even when x R and x r are not Gaussian. When N T' - N R' -- N ' , then

F(&r, & R ) -

2 ( N ' - P) p [d'~ + 6"~.

"1

1/

2 ( N ' - p) (&R - &Y ) T ( D R 1

(3.12) +

D r ~)

I ( ~ R --

~T)

^2 ^2 0 . R q- 0 . T

P

and F is asymptotically equivalent to

an

Fp,2(N,p)

(3.13)

random variable when

~fl~R ~'~ a T "

3.3. H : a n = a r ; a R u n k n o w n , o-2r and o-2 not necessarily equal

If 0.~ and 0.~ were known a priori, then it is easily shown that the likelihood ratio test statistic for testing H " a T = a n is a monotonic function of - I ./1~ . . . n _~(t~ n -- a T ) \ T 1r 0 . 2R lrJ- ~ R

o.~D-r,/N,r}-l(d~n

_ &r).

(3.14)

This has an asymptotic X2p distribution under H~ Since 0.R, 2 0.r2 are unknown, the natural test statistic, or distance measure, to use is (3.14) with o-2 and 0.J replaced by their estimates 8 2 and 0.~27 or SR2 and ST.2 (See (2.5) and (2.7).) This yields the test statistic X2(&r, den)= (an

&r)T{s~Dn~INR + S2rD T llN~}-~(&n - &T),

(3.15)

and the null hypothesis H : a R = a T will be rejected when X 2 is significantly large. As a distance measure, /~2 is symmetric in &R and der and is again the

Speech recognition using LPC distance measures

401

squared distance between &R and &r standardised by an estimate of the covariance matrix of & R - & r . Although not/ a function of the appropriate likelihood ratio test statistic, X2 is asymptotically equivalent to it. In this case the correct likelihood ratio test statistic possesses certain undesirable theoretical and computational properties. (See, for example, the literature on the Behrens-Fisher problem in classical statistics [38, 39].) Note that X2 is proportional to the F statistic of Subsection 3.2 computed for the rescaled sequences £R = XR/SR and 2 r = xv/sT, We have chosen s 2R and s 2r instead of crheR and ~~2 . partly because they are less biased and partly because they make Xz a more conservative test statistic. The X2 statistic has an asymptotic .g2 distribution under the null hypothesis. If xR and x r are Gaussian, a slightly better approximation to the null distribution which takes some account of the variability of the estimates s 2R and s 2r is given by the following argument. Under H : a n = a r the matrices D R and D r converge to a common limit A, i.e.

lim D R = lim D r

N~-~

N~oo

(3.16)

= A,

which means that X2 is asymptotically equivalent to

a )Ta(aR -

/ s2 /Ni

s /N;

o'R/N R+ ¢ r / N r

/crR/NR+

~rr/N r

(a.

-

(3.17)

When H is true, the numerator of (3.17) has an asymptotic X2p distribution and is asymptotically independent of the denominator. In addition, using Satterthwaite's approximation [40], the denominator is asymptotically equivalent, under H, to a X 2J r random variable where v is estimated as

=

( s 2 / N i + s2r/N'r)2

(3.18)

(s2 /N~)2/(N~ - p) + ( s 2 / N ~)Z/(Nr - p) Thus the distribution of X2/p under H : a R = oer can be approximated by an Fp,~ distribution. Note that (3.18) varies between the smaller of N ~ - p , N ~ - p and N~ + N } - 2 p with the maximum occurring when 2 SR

N~(N R-p)

S2F

N~(N~-p)"

For the special case N r i-

N ~ = N,r

X2(&r,&e) -- N,(& R --OgT) ~ 7{ s 2n D n t -~ S T2D T } :l = 2(N .... p)(s2p-- 1),

~(OgR-~ ~IT)

(3.19) (3.20)

402

P. J. Thomson and 1:'. de Souza

where sp2 is obtained from (3.8) and (3.9), but with C R and C T replaced by CR/s ~ and Cfls2r respectively. When otR = otT, the distribution of X 2 is asymptotically X2p and the distribution of X2/p can be approximated by an/z~,~ distribution, where 4

4

(N'-p).

S R -1- S T

3.4. Alternative tests of H : o~R = o~T

The F and X 2 statistics are m o r e costly to compute than either ~ or I. In order to meet this problem, approximations to F and X 2 were introduced by de Souza and Thomson [10]. These are F , and ,,v,2 respectively, where

F*('~T, '~.) = ( N i - ' + N 7~)-2('~R - '~T) T • ( D R / N R + D T / N T ) ( & R -- & r ) / ( p s 2)

(3.21)

and

X 2(~[tT' aR) = ( N i -1-~- NT-1)-2(~R -- &T)X(s-e2DR/N;~ + s-r2DT/N)) " (&R

-

&r).

(3.22)

These approximations follow from the observation that F and X 2 involve quadratic forms of the type x'r ( w l A ~ 1 + w2Azl)-lx,

(3.23)

where A 1 and A 2 denote positive definite symmetric matrices and w 1 and w 2 are non-negative weights that sum to unity. In the case of F the weights w 1 and w 2 are N ; f l / ( N ~ -1 + N~--1) and N r,-1 / ( N R,-1 + N T1-1) respectively and A1 and A 2 a r e D R and D T respectively. T h e X 2 statistic has the same weights as F, but A 1 and A 2 are now DR/s 2 and Dv/s2r. Observe that (wiA71 + w 2 A ; l ) -1 is just a harmonic average of A~ and A 2. F , and X,2. are basically the F and X 2 statistics with this harmonic average replaced by the analogous arithmetic average, i.e. w~A 1 + w2A 2. The F , and X~ statistics involve much the same computational cost as either ( or I. Moreover, when the null hypothesis is true they are asymptotically equivalent to F and X 2 respectively. Thus the asymptotic approximations to the distributions of F and X 2 under the null hypothesis also hold, respectively, for F , and X2,. In theory, F , and X~ are less powerful than F and X 2. In practice, however, the loss in power may be small enough to not warrant the additional computational cost involved in computing the more powerful tests. The simulation results of de Souza and T h o m s o n [10] give some guidelines here. In summary the F , and X~ statistics possess the advantage that they are naturally related to the optimal F and X 2 statistics and they are relatively inexpensive to compute.

Speech recognition using LPC distance measures W h e n N Rp= N r, _- N

403

t and a R = a r ,

F * ( & r ' OlR) =

N ' - p (&R -- &r)T(DR + D r ) ( & n -- &r) - 2p -z + erT A2 erR

(3.24)

has an a p p r o x i m a t e Fp,2(N,_p) distribution if x R and x r are G a u s s i a n and Nt X2*(&T, &R) = -4- (~¢R -- &T)T(DR/S2 + D r / s Z ) ( & R - &r)

(3.25)

has an a s y m p t o t i c X 2 distribution. Using Satterthwaite's a p p r o x i m a t i o n the null distribution of X 2 / p is a p p r o x i m a t e l y Fp,~, w h e r e ~) is given by (3.18). Tribolet et al. [41] consider the test statistic N ' ( ~ r - -,...--~n~rDr(&r- &R) g*(&r, &R)= 2 ^2

(3.26)

O" T

2 in the situation w h e r e N r' -- N R' = N ' and it is assumed a priori that err2 = erR. 2 t h e n , u n d e r H : aR = a r , O b s e r v e that ~'*(&r, &R) = F(&r, &R)/2- Given er 2r _- erR, F, has an a s y m p t o t i c X 2 distribution. As in the case of ~, a modification of ~', yields

l,(e

N'-p

r, e,R) = - N'p

N'

2

(~T

e,(e

--

r,

&R)TDT(&T- CeR)/P 2

(3.27)

sr

which is asymptotically equivalent to an F v , N , p distribution u n d e r H : a R = a r p r o v i d e d x R and x r are Gaussian. T h e tests based on t~. or l. will not be as powerful as F and X 2 or F . and X 2. H o w e v e r , the principal d i s a d v a n t a g e of ( . and l. is simply that, as distance measures, they are not s y m m e t r i c functions of &r and &R; i.e~ 2 it ~*(~(T' ~R) 76 ~P*(&R,•T) and l,(&T, dzR) ¢ I,(~R, &T)" M o r e o v e r , if o-~ ¢ err, is easily shown that when H : a r = a R is true, ( . is asymptotically equivalent to 2 1 2,, 2 2 2 a ~(1 + o'R/errlXv r a n d o m variable and I. to a ½(1 + erR/O'r)Fv, u,_p r a n d o m variable. H e n c e the 2"p2 or Fp,N, p a p p r o x i m a t i o n will give spurious results w h e n e v e r erre differs significantly f r o m err. T h e inter-relationships b e t w e e n I t a k u r a ' s distance m e a s u r e I and the distrance m e a s u r e s g, ( . and X~ are of interest. First o b s e r v e from (3.3) that, when a r = a R , N f l ( & r , a e ) and ( ( & r , aeR) are asymptotically equivalent and ~ . given by (3.26) is asymptotically equivalent to N } I ( & r, &R)/2 irrespective of

P. J. Thomson and P. de Souza

404

2

2

t

r

whether ~rR = ~rT or N R = N T. Thus, under H : a R = ¢~r, g(&r, aR) =' N~I(dzr, aR),

e*(d~r, &R) "- N~I(&r, e~R)/2,

(3.28)

where - indicates that the expressions concerned approach equality as N ~ and N ~ tend to infinity in such a way that N ' r / N ~ remains fixed. Moreover, from (3.22),

xz(&T, &R)= 2(N~+ N~)-2[(1-N~)N~-Z~,(&R, &r) P

t2

^

(3.29)

and, when the null hypothesis is true, X~ is asymptotically equivalent to I *(&r, &R) = ( N ~ + N~-)-2[(N~ - p) N'rZI (&R, &r) t

+ (NT- p)N.

12

' R)I -

(3.30)

H e r e (3.29) expresses the symmetric distance measure X 2 as a linear combination of the two asymmetric distances g*(&n, &r) and g*(&r, &R). When a R = a r, (3.30) shows that ,g2, is asymptotically equivalent to the symmetric distance measure I , which is a linear combination of the two asymmetric Itakura distances I(&R, &r) and I(&r, &g)- For N~ = N~-= N',

1

= p •

^

aT)+

and I*(&r, e/R)= N ' - p . 2

~2[I(¢~R' &r) + l(&r, &e)l-

(3.32)

Forming a symmetric distance measure from two asymmetric distance measures in this way is an intuitively reasonable procedure. Such a procedure has been used previously by Rabiner et al. [42]. The above derivation also shows that the distance measure I , is closely related to X2, and hence X 2. The implication of (3.30) and (3.32) is that speech recognition systems relying on Itakura's distance might be improved by the trivial modification of replacing I by I , . An experiment in which this was done is described in Section 6. 2 3.5. H : o"2R = O'T , OlR and o~r unknown

The tests and distance measures considered so far compare only the LPC coefficients, and not the innovation variances. Tests for comparing the in~-

Speech recognition using LPC distance measures

405

novation variances are useful for detecting a m p l i t u d e or e n e r g y changes. T h e resulting information, when c o m b i n e d a p p r o p r i a t e l y with an L P C distance m e a s u r e , can lead to greater recognition accuracy [3]. T o c o m p a r e two innovation variances we consider the likelihood ratio test of 2 This can be shown to be a function of 2 2 H : o 2 = o-r. s r / s R or, equivalently, ST2

FR-

2

-

Sa

NTaT, ^T CTIiT/(N~, - p ) ,~ ^ T

^

,

NRaR CRaR/(NR-- p)

(3.33)

"

T h e statistic F R is asymptotically equivalent to an FNi_p. Ni¢ p r a n d o m variable when O-R 2 = O.2 and x R and x r are Gaussian. T h e test is two sided with the critical points, in practice, being d e t e r m i n e d as the 100(~-y)% percentile and the l O 1 0 0 ( 1 - ~y)Yo percentile of the FNr_ p N,R p distribution w h e r e 7 d e n o t e s the level of significance of the test. T h e s e values are not o p t i m a l and can be slightly i m p r o v e d on. (See, for example, R o u s s a s [43, p. 303].) T h e p r o b l e m of combining L P C and energy m e a s u r e s in o n e overall m e a s u r e is considered in Subsection 3.6. 2

2

3.6. H : O-R = O-T, OlR

017"

~-

H e r e we wish to c o m p a r e p a r a m e t e r vectors that c o m p r i s e the p + 1 coefficients o-2 and re. T h e likelihood ratio test statistic for testing H : o-~¢ = o-2, a a = a T can be shown to be A w h e r e - 2 log A is given by

LLR

(N~+N~-)tog r

( I + N ~ + N ~ cP_ 2

^2 O'~

~.2

OrR

o-T

+ N R log ~ 5 + N~-log ~ S "

p

F) (3.34)

In the a b o v e F is given by (3.10), ^2

~

2

t

o-, - (N'R+ N , r • 2 p ) s , / ( N R +

N~)

and s~• is given by (3.11). W h e n o-2 = o-'r2 and 01R = 017, L L R has an asymptotic 2 ,gp+l distribution. Alternatively the critical values of this statistic when H is true can be d e t e r m i n e d numerically from the joint distribution of ps~F/cr 2, ( N ~ p)s2/o- 2 and ( N ; ~ - p ) s 2 / o - 2 (o-2= o-~ = o_2) which are asymptotically equivalent to i n d e p e n d e n t h '2 r a n d o m variables with degrees of f r e e d o m p, (N~r --.p) and ( N ; ~ - p ) respectively. T h e null hypothesis will be r e j e c t e d when L L R is sufficiently large. Simpler asymptotically equivalent expressions for L L R when o-2 ..... o"27 and 01R ~

KIlT a r e

~27. ~ L L R , = p F + N~¢ log SR -~ N r' log ---~

(3.35)

P. d. T h o m s o n a n d P. de S o u z a

406

2

--

2

1 LLR~, = p F + ~(N Rt-1 + N ~ - l )" -~l{lS~R - - S T ~ I \

Both L L R , and LLR.~ have asymptotic

S,

2

)(p+l

2

(3.36)

I

distributions under H.

3. 7. Tests based on alternative L P C likelihoods The test statistics and distance measures constructed in Section 3 have all been derived using the conditional likelihood (2.9). However, as noted in Section 2, other likelihoods could be chosen. In particular, a good approximation to the exact likelihood is given by (2.9) with C replaced by C, and this likelihood can, in turn, be further approximated by (2.9) with C replaced by C'. These likelihoods yield precisely the same test statistics as before, but with the conditional likelihood estimates (&r, °'r,^2 etc.) replaced by the approximate maximum likelihood estimates 0 i t , o'v,-2etc.) for the likelihood based on C', or by the Yule-Walker estimates (&r, ~2r, etc.) for the likelihood based on 0. Since ~ / N C , V ' N C and X/N-ff are asymptotically equivalent, the stated asymptotic distributional properties of the various test statistics and distance measures based on C are the same as for those based on (~ or C.

4. Power functions of the tests

Power is an important consideration in the choice of an appropriate test or distance measure. The more powerful the test, the greater the recognition accuracy of the associated distance measure. In this section the asymptotic distributions of many of the test statistics given in Section 3 are determined for the case when the null hypothesis is false. From these distributions approximations to the power functions can be constructed. For the situation described in Subsection 3.1 it can be shown that when a R ~ a T both t ~ and l have distributions that are asymptotically equivalent to a non-central gp2 distribution with non-centrality parameter T 12-_- N , T ( a T --

aa'TV- c) ! c),

P O ( 2 >>"c)

1c).

(4.7)

It would appear that F , and X2 are the more powerful statistics since F and X2 have the same respective asymptotic distributions under the null hypothesis as F , and X2. However, the asymptotic distributions of these statistics are only approximations to the true distributions for finite N~ and N}. Hence it might be deduced from (4.6) that any apparent increase in power by F , and .g2 would

408

P. d. Thomson and P. de Souza

be at the expense of a greater rate of Type l errors than that chosen. Evidence that this is indeed the case is provided in the simulation studies of de Souza and Thomson [10]. Since Itakura's distance I and its modification I , are asymptotically equivalent to ~ and X2, respectively under the null hypothesis, it might be expected that I and I , would be equally as powerful as ~ and X 2 respectively in the important case of small departures from the null hypothesis. One way of showing this is to consider, under the alternative hypothesis, a sequence of 2 and ~'22 values of o~R - aer which decrease with N r, and N R, in such a way that ~-~ remain fixed. In this situation it is readily shown that I and I , are asymptotically equivalent to t ~ and X~ respectively. 2 is not necessarily the same as o2 , the statistic For the general case when O~R 2 2 FR of Subsection 3.5 is asymptotically equivalent to a (rrr/O'R)FN~c_p,N~_ p random variable. The simulation studies of de Souza and Thomson [10] give some guidance as to the adequacy of the various asymptotic approximations in practice. In particular, the LPC distance measures X2 and X~ were found to be robust and powerful. However, of the two, X 2 follows the Xp2 distribution more closely under the null hypothesis, especially in the upper tail of the distribution

5. Computational costs of the tests

We first note that the conditional likelihood estimates & and 6.2, the approximate maximum likelihood estimates ¢/ and ~2 and the Yule-Walker estimates d, and 6. can all be computed efficiently. Indeed, Morf et al. [32] show that the number of multiplications necessary to compute & and 6.2 or & and c~z is ( N ' - 1)(p + 1) + 7p 2 + O(p), whereas Dickinson [34] shows that & and ~2 can be computed in (N' - 1)(/) + 1) + 7~2p2 + O(p) multiplications. In normal speech processing applications, however, the value of p chosen is such that these algorithms are no faster than the solution using Cholesky decomposition. Nevertheless, these algorithms do lead to reduced storage requirements. Consider now the computation of the various test statistics given in Section 3 and, in particular, the case N~,= N~. First observe that (, 1, I, t~, and l, are all functions of F which is computed more efficiently using (3.2) rather than (3.1). Moreover F is computed more efficiently using (3.12) rather than (3.13) and X2 ^2 (3.8) in F is more efficiently computed using (3.20) rather than (3.19). Here (rp and sap in 1`2 (3.20) can be calculated using the algorithms of Friedlander et al. [45] which take advantage of the near-Toeplitz structure, measured in terms of displacement rank, of D R and D r . In particular, since D R and D r each have displacement rank 2, any linear combination ClD R + c2D T where c~ and c 2 are positive constants has displacement rank 4. These algorithms together with those for 6 and 6.2 yield computationally efficient procedures for determining, not only F and X2, but also all the other distance measures discussed in Section 3. Once again, however, although the values of p used in most speech processing

Speech recognition using LPC distance measures

409

applications are such that these algorithms are no faster than Cholesky decomposition, they do lead to reduced storage requirements. Using the number of multiplications as an estimate of computational complexity, the calculations for F require approximately (p + 1)(N r - 1)+p3/3+ 3p2+ 8p/3 multiplications, whereas those for 4, l, I, ~ , , I, and F , each require (p + 1)(N~,- 1)+p3/6+ 2p2+ 17p/6 multiplications. X2 and X~ require -~(p + 1) multiplications in addition to those for F and F , respectively and I , requires an additional lzp(p + 3) multiplications in addition to those for X~. In arriving at these estimates it has been assumed that &r, &p, etc. were obtained using the Cholesky decomposition and that the number of multiplications that this entails is as given in [32]. Thus for N = 100, p = 10 the F and X2 statistics require approximately 20% more computations than any of 4, l, I, ( , , l, and F , . For N = 300, p = 10 this figure reduces to approximately 8%. Therefore, except under stringent computational conditions, the choice of test statistic can he based on the properties of the test statistic concerned rather than com~putational cost. Turning now to storage requirements we note that the computation of F, X2, 1 F , , X~ and I , require the storage of &R, °'R~2and D R (p + 1 + ~p(p + 1) floating point numbers), whereas 4, l, I, ~, and l, require the storage of &R only (p floating point numbers). Thus, for the typical case p = 10, the storage necessary to compute F, X2, F , , X2, or I , exceeds that for 4, l, I, 4, or 1, by about 56 floating point numbers per reference template.

6. An isolated word recognition experiment In this section we describe two versions of an isolated word recognition experiment performed on a 62-word vocabulary for the case in which N~ = N~. In the first version the Itakura distance was used, and in the second it was replaced by the symmetric distance I . (3.32) with dR, dr, o-~ and ~r~ estimated using the Yule-Walker estimates (2.15) and (2.16). The vocabulary used for this experiment comprised the letters, digits, and 26 other words consisting mainly of keyboard symbols (comma, period, asterisk, slash, percent, dollar, etc.). Each word in the vocabulary was uttered 10 times in random sequence by a male speaker in an ordinary laboratory environment using a Shure SM12 headset microphone. The incoming speech was digitised by a 14-bit A/D converter at a rate of 20 kHz. It was then pre-emphasised and a 14 pole selective autocorrelation LPC analysis was performed every 10 ms over a 512 point Hamming window. The end-points of the 620 words uttered were determined automatically and corrected where necessary by hand. The first utterance of each word was used as the reference or template, and the remaining 558 utterances were used as test data for recognition. Each of the 558 test utterances was recognised by finding the closest matching template using dynamic time warping to obtain a good match. The

410

P.J. Thomson and P. de Souza

dynamic time warping algorithm, which has been widely used, imposed strict end-point constraints on the test and reference patterns by forcing the two sets of end-points to coincide. Under these conditions 30 recognition errors were made using the ltakura distance I (3.3). The experiment was then repeated using the symmetric measure I , as the local distance, and on this occasion the number of errors was 25. Comparing these results, it can be seen that in this experiment, the price paid for using the Itakura distance measure instead of a more powerful LPC distance measure is a 20% increase in the word recognition error rate. This is a consequence of discarding information about the variability of the reference vector &R. In summary, this experiment shows that speech recognition systems based on Itakura's distance I can be significantly improved by the trivial modification of replacing I by I , where, for the case N T - NR, I, is directly proportional to the arithmetic mean of the Itakura distances I(&n, ~T) and I(& r, ~R). t

--

t

Acknowledgements The authors would like to thank S. Haltsonen for his assistance in performing the isolated word recognition experiments.

References [1] Martin, T. B. (1977). One way to talk to computers. IEEE Spectrum 14(5), 35-39. [2] de Souza, P. (1983). A statistical approach to the design of an adaptive self-norma!ising silence detector. IEEE Trans. Acoustic. Speech Signal Process. 31(3), 678-684. [3] Brown, M. K. and Rabiner, L. R. (1982). On the use of energy in LPC-based recognition of isolated words. Bell System Tech. J. 61(10), 2971-2987. [4] Rabiner, L. R., Rosenberg, A. E., Wilpon, J. G. and Keilin, W. J. (1982). Isolated word recognition for large vocabularies. Bell System Tech. J. 61(10), 2989-3005. [5] Zue, V. W. and Schwartz, R. M. (1980). Acoustic processing and phonetic analysis. In: W. A. Lea, ed., Trends in Speech Recognition 101-124. Prentice-Hall, Englewood Cliffs, NJ. [6] Markel, J. D. and Gray, A. H. (1976). Linear Prediction of Speech. Springer, Berlin. [7] Fant, G. C. M. (1960). Acoustic Theory of Speech Production. Mouton and Co., 's-Gravenhage, The Netherlands. [8] Rabiner, L. R. and Schafer, R. W. (1978). Digital Processing of Speech Signals'. Prentice-Hall, Englewood Cliffs, NJ. [9] de Souza, P. (1977). Statistical tests and distance measures for LPC coefficients. IEEE Trans. Acoust. Speech Signal Process. 25(6), 554-559. [10] de Souza, P. and Thompson, P. J. (1982). LPC distance measures and statistical tests with particular reference to the likelihood ratio. IEEE Trans. Acoust. Speech Signal Process. 30(2), 304-315. [11] Bahl, L. R., Cole, A. G., Jelinek, F., Mercer, R. L., Nadas, A., Nahamoo, D. and Picheny, M. A. (1983). Recognition of isolated-word sentences from a 5000-word vocabulary office correspondence task. Proc. IEEE Internat. Conf. on Acoustics, Speech, and Signal Processing, 1065-1067. [12] White, G. M. and Neely, R. B. (1976). Speech recognition experiments with linear prediction,

Speech recognition using LPC distance measures

[13]

[14] [15] [16] [17]

[18]

[19] [20] [21] [22] [23]

[24] [25] [26] [27] [28] [29] [30] [31] [32] [33]

[34] [35] [36] [37] [38]

411

bandpass filtering, and dynamic programming. I E E E Trans. Acoust. Speech Signal Process. 24(2), 183-188. Myers, C. S., Rabiner, L. R. and Rosenberg, A. E. (1980). Performance tradeoffs in dynamic time warping algorithms for isolated word recognition. I E E E Trans. Acoust. Speech Signal Process. 28(6), 622-633. Sneath, P. H. A. and Sokal, R. R. (1973). Numerical Taxonomy, Chap. 4. Freeman, San Francisco, CA. Itakura, F. (1975). Minimum prediction residual principle applied to speech recognition. I E E E Trans. Acoust. Speech Signal Process. 23(1), 67-72. Tribolet, J. M., Rabiner, L. R. and Wilpon, J. G. (1982). An improved model for isolated word recognition. Bell System Tech. J. 61(9), 2289-2312. Sakoe, H. (1979). Two-level DP-matching--a dynamic programming-based pattern matching algorithm for connected word recognition. IEEE Trans. Acoust. Speech Signal Process. 27(6), 588-595. Jelinek, F., Mercer, R. L. and Bahl, L. R. (1982). Continuous speech recognition: Statistical methods. In: P. R. Krishnaiah and L. N. Kanal, eds., Handbook of Statistics, Vol. 2, 549-573. North-Holland, Amsterdam. Lea, W. A. (1983). Selecting the best speech recogniser for the job. Speech Technology 1(4), 10-29. Rabiner, L. R. and Levinson, S. E. (1981). Isolated and connected word recognition--theory and selected applications. IEEE Trans. Commun. 29(5), 621-659. Makhoul, J. (1975). Linear prediction: a tutorial review. Proc. 1EEE 63(4), 56i-580. Buzo, A., Gray, R. M., Gray, A. H. and Markel, J. D. (1980). Speech coding based upon vector quantization. I E E E Trans. Acoust. Speech Signal Process. 28(5), 562-574. Juang, B. H., Wong, D. Y. and Gray, A. H. (1982). Distortion performance of vector quantization for LPC voice coding. I E E E Trans. Acoust. Speech Signal Process. 30(2), 294-304. Mann, H. B. and Wald, A. (1943). On the statistical treatment of linear stochastic difference equations. Econometrika 11, 173-220. Hannan, E. J. (1970). Multiple Time Series. Wiley, New York. Fuller, W. A. (1976). Introduction to Statistical Time Series. Wiley, New York. Seber, G. A. F. (1977). Linear Regression Analysis. Wiley, New York. McLeod, I. (1975). Derivation of theoretical autocovariance function of autoregressive-moving average time series. Appl. Statist. 24(2), 255-256. Ansley, C. F. (1979). An algorithm for the exact likelihood of a mixed autoregressive-moving average process. Biometrika 66, 59-65. Ljung, G. M. and Box, G. E. P. (1979). The likelihood function of stationary autoregressivemoving average models. Biometrika 66, 265-270. Box, G. E. P. and Jenkins, G. M. (1976). Time Series Analysis Forecasting and Control (rev. ed.). Holden-Day, San Francisco, CA. Morf, M., Dickinson, B., Kailath, T. and Vieira, (1977). Efficient solution of covariance equations for linear prediction. IEEE Trans. Acoust. Speech Signal Process. 25, 429--433. Chandra, S. and Lin, W. C. (1974). Experimental comparison between stationary and non-stationary formulations of linear prediction applied to voiced speech analysis. IEEE Trans. Acoust. Speech Signal Process. 22(6), 403-415. Dickinson, B. W. (1978). Two recursive estimates of autoregressive models based on maximum likelihood. J. Statist. Comput. Simulation 7, 85-92. Quenouille, M. H. (1958). The comparison of correlations in time series. J. Roy. Statist. Soc. Set. B 20, 158-164. Sambur, M. R. and Rabiner, L. R. (1976). A statistical decision approach to the recognition of connected digits. I E E E Trans. Acoust. Speech Signal Process. 24(6), 550-558. Graybill, F. A. (1976). Theory and Application of the Linear Model. Duxbury, Belmont, CA. Anderson, T. W. (1958). A n Introduction to Multivariate Statistical Analysis. Wiley, New York.

412

P.J. Thomson and P. de Souza

[39] Cox, D. R. and Hinkley, D. V. (1974). Theoretical Statistics. Chapman and Hall, London. [40] Satterthwaite, F. E. (1946). An approximate distribution of estimates of variance components. Biometrics Bulletin 2, 110-114. [41] Tribolet, J. M., Rabiner, L. R. and Sondhi, M. M. (1979). Statistical properties of an LPC distance measure. IEEE Trans. Aeoust. Speech Signal Process. 27(5), 550-558. [42] Rabiner, L. R., Levinson, S. E., Rosenberg, A. E. and Wilpon, I. G. (1979). Speakerindependent recognition of isolated words using clustering techniques. IEEE Trans. Acoust. Speech Signal Process. 27, 336-349. [43] Roussas, G. G. (1973). A First Course in Mathematical Statistics. Addison-Wesley, Reading, MA. [44] Bellman, R. (1970). Introduction to Matrix Analysis. McGraw-Hill, New York. [45] Friedlander, B., Kailath, T., Morf, M. and Ljung, L. (1978). Extended Levinson and Chandrasekhar equations for general discrete-time linear estimation problems. I E E E Trans. Automat. Control. 23, 653-659.

E. J. Hannan, P. R. Krishnaiah, M. M. Rao, eds., Handbook of Statistics, Vol. 5 © Elsevier Science Publishers B.V. (1985) 413-449

1 (~ J_

Varying Coefficient Regression

D. F. Nicholls a n d A . R . P a g a n

I. Introduction

Very early on in the development of methods for the analysis of time series and the relationships between time series, it was recognized that techniques based upon constant coefficient models might well be inadequate. Early exampies of this position would be Rubin (1950) and Kendall (1953), the former allowed for some r a n d o m variation in the coefficients whilst the latter restricted them to a deterministically evolving pattern. Despite these qualms, constant coefficient models have proven to be effective in empirical data analysis, so much so that only relatively recently have there a p p e a r e d either theoretical papers detailing the methodology for dealing with the types of coefficient variation important in the analysis of time series or empirical studies providing applications of these techniques) A n u m b e r of surveys have been written in the last five years of the area that this chapter covers, a book by Raj and Ullah (1981) and a contribution by Chow (1983) to the H a n d b o o k of Econometrics being prominent examples. As well, there is an annotated bibliography by Johnson (1977, 1980). Combining these references would provide any reader with a fairly comprehensive list of papers on the topic. For this reason we do not attempt an exhaustive examination of all the work done. Our objective is best understood by considering what it was that made the work by Box and Jenkins (1976) so seminal. Though there were some advances in estimation and hypothesis testing documented in that book, it is arguable that most of the techniques used by them had been available for some period of t i m e - - t h e autocorrelation function had been routinely computed during spectral analysis, a non-linear least squares tech~ nique for fitting models with moving average errors can be found in Whittle (1954), and the analysis of residuals for diagnostic purposes was long a feature of applied research. What was pathbreaking in their material was the presenIBecause of ttle nature of this volume a large literature based on longitudinal data -which indexes responses by individual units to allow for variation in model coefficients across individuals-is ignored. Some of this literature is surveyed in Swamy (1971) and Engle and Watson (1979). 413

D. F. Nicholls and A. R. Pagan

414

tation of an integrated approach to time series modelling, involving the specification/estimation/diagnostics cycle. 2 When approached in this disciplined way it proved easy to both communicate and assimilate techniques that had been in existence previously but had not been extensively used. It seems likely therefore that, in a book concerned with time series analysis, any discussion on varying coefficient models can be usefully structured in the same way. As will be shown, the estimation phase has received the predominant attention to date and yet, just as in standard time series analysis, it is probably the specification part of the cycle which is critical for practical work. Consequently, some of this chapter is an attempt to remedy that deficiency, although it will be apparent that much remains to be done. The analogy with Box and Jenkins' approach can be pushed one step further. In their research they recognized that the presence of seasonal factors in time series led to a different class of models than would be appropriate if no seasonal effects were present; the modelling cycle remained the same but different models were likely to be required. It is also useful to make such a distinction in discussing the varying coefficient regression (VCR) literature. To clarify that contention, (1.1) represents the model examined in this paper:

Yt = x, fl, + e,.

(1.1)

H e r e e, is a martingale difference process with E(e2t]o%,_l)= ~ < ~ a.s., and ~,-1 is the sigma field composed of a set of events that includes {Yt-/}~=l, {x,_j}7=, and may include x, if it is taken to be exogenous; x, is a 1 x p vector of regressors; and /3t is specified as following a multivariate A R I M A process A ( L ) ( f l , - / ? ) = rl,, where A ( L ) is a (possibly rational) polynomial in the backward lag operator L. Although such a characterization is restrictive, on the basis of the success of A R I M A models in representing time series it is to be hoped that /3t could also be approximated in such a way. The noise driving /3t - ¢i, r/o is taken to be i.i.d.(0, X) and independent of {e,}, while/3 is the mean of /3, when the process generating fi, is stationary, but equals zero if that process is A R I M A . 3 Equation (1.1) illustrates the three dimensions to any particular model: 3 the nature of x~ the nature of the process generating/3 t and the constancy of o-~. Table 1 lists the various assumptions employed about those three dimensions in this chapter. Altogether there are some 18 possible combinations of these assumptions. Some are discussed elsewhere in this volume, e.g. (X1, B1, V1) which 2Our preference is for the term 'specification' rather than 'identification' to describe the process of a preliminary screening of models, as this hatter term also needs to be used when discussing whether unique estimates of the unknown parameters of a process can be obtained. 3In restricting/3t to have a constant mean at most we have ignored the possibility that /~ might vary in a deterministic fashion with some variables zt (say), i.e./3t = E(flt) = zt6. As will be evident from later analysis this modification merely induces extra regressors involving the cross product b e t w e e n zt and xf and does not change the essence of our proposals.

Varying coefficient regression

415

Table 1 Assumptions employed in models

(X1)

xt contains lagged values of Yr.

(B1)

fir =/3.

(VI)

0.2 = 0.2 ,

(X2)

x, is a strictly exogenous set of variables. They may be stochastic or non-stochastic with a uniform bound.

(B2)

fit =/~ + rh, i.e. A(L) = 1. This will be referred to as random coefficient variation.

(V2)

0-~ is not constant.

(X3)

xt contains endogenous variables, i.e. (1.1) is part of a set of simultaneous equations.

(B3)

A(L) ~ 1. This case will be referred to as evolving coefficient variation.

r e p r e s e n t s c o n s t a n t coefficient a u t o r e g r e s s i v e m o d e l s . S o m e h a v e n e v e r b e e n f o r m a l l y e x a m i n e d within the l i t e r a t u r e , e.g. (X1, B3, V2) a n d , a p a r t f r o m s o m e g e n e r a l c o m m e n t s l a t e r , will have to b e i g n o r e d . In T a b l e 2 t h e c o n t e n t of each of the l a t e r s e c t i o n s of t h e c h a p t e r is m a t c h e d with t h e v a r i o u s d e m a r c a t i o n s of T a b l e 1.

Table 2 Section

Models

2 3 4.1 4.2

(x1, B2, V1) (X1, B3, V1) (X3, B2, V1) (X1, B1, V2)

(X2, B2, V1) (X2, B3, V1) (X2, B1, V2)

A n u m b e r of the c o m b i n a t i o n s missing from T a b l e 2, e.g. (X3, B3, V2), m a y well be t o o c o m p l e x to solve, p a r t i c u l a r l y in the light of t h e difficulties facing i n v e s t i g a t o r s with t h e s i m p l e r a l t e r n a t i v e (X3, B2, V1) a n d we t h e r e f o r e d o not even a t t e m p t to a n a l y s e such classes of m o d e l s h e r e .

2. Random coefficient variation In this s e c t i o n we shall be p r i m a r i l y i n t e r e s t e d in m o d e l s of the form (X1, B2, V I ) a n d shall o u t l i n e k n o w n results r e l a t i n g to this p a r t i c u l a r class of m o d e l . T h e s e r e s u l t s s h o u l d e x t e n d , using similar a r g u m e n t s , to t h e class (X2, B2, V1)

416

D. F. Nicholls and A. R. Pagan

or even to mixtures of these two classes of models. W h e n appropriate, r e f e r e n c e will be m a d e to where proofs of results for the wider class of models are available. T h e model (X1, B2, V1) can be written in the form p

y, = ~] f,jY,-j + ~,, j=l

with f , j the jth element of the p x 1 vector fit = fi + r/,, so that this model becomes p

Y, = E (/~ + r/t.j)Yt-j + et'

(2.1)

j=l

For models of this form Andel (1976) derived conditions for their second-order stationarity while Nicholls and Quinn (1981), referring to such models as r a n d o m coefficient autoregressions (RCA), have extended Andel's results to the case of multivariate RCA's. For simplicity we shall concentrate, in the r e m a i n d e r of this section, on scalar models, though most of the results extend in a natural way to the multivariate situation. In the case of the model (2.1) we m a k e the following assumptions: (i) {et; t = 0, +1,_+2 . . . . } is a sequence of i.i.d, random variables with zero m e a n and variance o-2. (ii) j3'= (J~l. . . . . /3p) is a vector of constants. (iii) If r/'t = 071,1. . . . . r/t,p), then {r/t; t = 1 . . . . . T} is a sequence of i.i.d, random vectors with zero mean and E(r/tr/'t)= X. (iv) {r/,} and {et} are mutually independent° If p

I

kfp . . . . . . .

/1 fl j

with the (1, 1) block being the ( p - 1)× 1 null matrix, the (1, 2) block the (p - 1) x (p - 1) identity matrix, and o%t is the g-field generated by {(e,, r/,); s ~< t}, then it is possible to show (see Nicholls and Quinn, 1982, p. 31) that, when X > 0, there exists a unique ~t - measurable second-order stationary solution to (2.1) if and only if M has all its eigenvalues within the unit circle, and ( v e c X ) ' v e c W < 1, where vec W is the last column of the matrix ( I M @ M ) -1. (The tensor or K r o n e c k e r product @ together with associated definitions and useful results are given in the Appendix.) T o obtain asymptotic properties of the estimators of the parameters of (2.1), the b o u n d e d n e s s of the second m o m e n t s of {y,} is required. If the two criteria required for the second-order stationarity of (2.1) are b o u n d e d away from unity, it follows that this m o m e n t condition will be satisfied. As a result the next assumption for (2.1) is

Varying coefficient regression

417

(v) The largest eigenvalue of M is less than or equal to ( 1 - 6 1 ) and (vec X)' vec W ~< (1 - 62), where 6 l > 0 and 62 > 0 are both arbitrarily small. The parameters/~, j = 1 . . . . . p, and X must be such that the solution {Yr} to (2.1) is strictly stationary and ergodic, these conditions being required to obtain asymptotic properties of the estimators. A sufficient condition for this strict stationarity and ergodicity is that a second-order stationary solution to (2.1) exists; a feature guaranteed by (v), together with the fact that {et} and {tit} are strictly stationary, which follows immediately from (i) and (iii). If z, = Kp vec(Y t 1Y't_I), where Kp is defined in the Appendix, the proofs of a number of the theorems to follow require that E{(z t - E(z,))(z,-E(zt))'} is positive definite. This follows from (vi) There is no non-zero constant vector c~ such that od(zt-E(zt)) = 0 almost everywhere. The next assumption to be imposed on (2.1) is (vii) The variance ~r: of e t is bounded below by 63 while the smallest eigenvalue of X is bounded below by 64, with 63 > 0 and 64 > 0 both arbitrarily small. Imposing (vii) eliminates the possibility of the vector of parameters of (2.1) lying on the boundary of the parameter space. Such situations cause difficulties when obtaining asymptotic properties of estimators. We discuss this further in the next subsection.

2.1. Specification of the model The first step in the specification stage is to determine the order of (2.1). Rewriting this model in the form p

Yt - Z ~jYI-j "+ btt = Y't-l~- @ •t

(2.2)

j=l

with p

U, = Zrl,,jyt_j + E, = Y't_lrlt + a,,

(2.3)

j=l

the ordinary least squares (OLS) estimates of /~p, p = 1, 2 . . . . . are just the partial correlation coefficients. Furthermore, as seen later in Theorem 2.2, these estimators are strongly consistent and asymptotically normally distributed. Consequently, in order to determine the order of the model (2.1) (or (2.2)), the partial autocorrelation coefficients and their standard errors are computed for orders 1, 2 , . . . . Thereafter, the order at which the first (and subsequent) of these is not significantly different from zero is found in a similar fashion to that proposed in the Box-Jenkins procedure. If the first coefficient which is not significantly different from zero occurs at lag (p + 1), and all higher-order coefficients are not significantly different from zero, then the model is of order p. Alternative order determination procedures including Akaike's AIC, B1C or

418

D. F. Nicholls and A. R. Pagan

related criteria could also be used for determination. A number of these are discussed in Priestley (1981). In practice, when considering models of the form (2.1), having determined the order of the model, the next question to be determined is whether or not a constant coefficient autoregressive model would fit the data just as well. The usual theory associated with tests based on m a x i m u m likelihood estimates will not hold in this context, since the vector of unknown coefficients under the null hypothesis lies on the boundary of the p a r a m e t e r space. Indeed, to demonstrate that the m a x i m u m likelihood estimators in (2.14)-(2.16) later are asymptotically normal, it is necessary to restrict the p a r a m e t e r space 0 in such a way that the parameters do not lie on the boundary of O. If a boundary value was to be permitted, e.g. 0/> 0, the m a x i m u m likelihood estimator (MLE) of 0 would need to solve {max 0 l(O) s.t. 0/> 0}, where l(O) is the log likelihood. As Moran (1971), Chant (1974) and, m o r e recently, Gourieroux et al. (1982) have shown, 0R that stems from this restricted problem has a very complex asymptotic distribution. From the results of those papers the likelihood ratio test is a mixture of X2 and degenerate r a n d o m variables, while Gourieroux et al. also d e m o n s t r a t e that this is true of the test statistic based upon the K u h n - T u c k e r multipliers, the analogue of the Lagrange Multiplier test statistic. Although the obvious way to test for coefficient constancy is to test the hypothesis X = 0, the fact that the M L E of X has a complex distribution when the null is true makes this route unattractive. One potential solution is to base a test for £ = 0 upon the scores Ol/O0, thereby ignoring the constraint X >~ 0. Such an approach loses power c o m p a r e d to that based on the scores of the restricted likelihood, but it does at least have a tractable asymptotic distribution; essentially this represents the proposal by Moran that N e y m a n ' s (1959) C ( a ) statistic be used. In this sense, the test presented below can be regarded as a score or C(c~) or Lagrange Multiplier test statistic. In order to develop a test of the null hypothesis that X = 0 or, equivalently y = vech 2; = 0, let/3r and o^2r be the m a x i m u m likelihood estimates of/q and o-2 under that hypothesis. (/3 r and d-2 are, of course, the usual m a x i m u m likelihood estimates for a fixed coefficient autoregression.) For a sample of size T, if Y,

=.=

(Y, . . . . y, 0) by regressing u,ut_j_against z,,j. Of course, 13 is not known so that it will b e n e c e s s a r y to estimate fi by regressing y, against x t. With the resulting estimate/3, the residuals ~7, = y, - xd3 replace u, in (3.9). Such a strategy parallels that of Subsection 2.1, the expansion in the number of regressions being occasioned by the fact that only F 0 = X is non-zero in the model (X1, B2, V1). Another pleasing outcome is that such a regression yields exactly the autocovariance function of the residuals when xt = 1 since then zj. t = 1 and ~j = T -1 ~ ~,g~,_j, demonstrating that the regression based strategy to obtain ~j is merely an extension of standard time series analysis. Defining ztj as zt.j for j > 0 and (1 zt,j) for j = 0, the relations (3.9) written in terms of residuals are tifi,_j = ~,.jO2,j + vj, t or, in matrix form, U_j = ZjOa,j + vj. The estimators considered are fi = ( X ' X ) - I X ' y and 02,j = (Z)2j)-~2)O-j -the estimates of ~ from the regression of y~ against x, and 02.j from the regression of fitfit j against 2tj - - a n d Theorem 3.1 describes their asymptotic properties.

THEOREM 3.1.

Under conditions (i)-(iv) of Theorem 2.2 a n d with (v) 13,- fi a stationary invertible process, (vi) E(vjv})= Vj > O, (vii) x, a sequence of nonstochastic regressors with uniform bound and limr_~ X ' X -~ B > 0, (viii)

Oj = lim Z~z-'=,jz,, i lim T-'Z,~Zj = Oj > 0, T~

T~

= (13 02j) has the properties a.s.

(A)

6---, 0,

(B)

T'/2(/3-/3)-~d N(0, B ~ ( l i m X ' V X ) B

(C)

- - 02.j)-~ N( 0, O ; l ( lim ZjVZj)Oj-' T 1/2(02,j --

~),

d

A somewhat cumbersome proof (1980) but a simpler one could When residuals are used in E(u,ut_j) + atfi,_ j - u,u,_ j. As T - m E without much difficulty, a proof

1)

of this theorem was referred to in Pagan be mounted along the following lines. (3.9), the error term is vj, t = utut_j Z't,j(Lltl~t._ j -- UtU t j) can be shown to be Op(l) of Theorem 3.1 would need to establish

D. F. Nicholls and A. R. Pagan

432

the limiting distribution of an OLS estimator in a regression involving dis. turbances that are both non-stationary and dependent processes; non-stationarity owing to the presence of a non-constant x t in the definition of u t and dependence because of any dependence in /3,-/3. Theorems 2.3 and 2.4 of Domowitz and White (1982) may be used for this purpose. Joint normality of e, and r/t may be dispensed with through such an approach, being replaced by some finite moment assumptions. Now ~j provides an estimate of the a.c.f, o f / 3 t - / 3 and Theorem 3.1 potentially enables a judgement to be made concerning whether any yj = 0. Unfortunately, computing the covariance matrix of ~j is no easy task. To see this, suppose that there is only a single coefficient and it is evolving as c~(L)(,/3, /3) = r/,. Specializing Theorem 3.1, if/3, follows a kth-order linear process, the asymptotic variance of T1/2(5/j - "yj) for j > k is the probability limit of T times k -~ 2 2 + 49~1(~x, x, iY,,o 2 ~ m=l

T -~ x, xt_jx , mXt.j._m31t, m)~); 1 , t=m+l+j

where 4~j = ~-~ • XtYt_ 2 2 j and 9,,m = E{(u,u, , - E(u,u,_j))(u,_,.u, _, m -- E(U,~r.U, j

.,))}

= E(u,u,_jU,_mU,_j_m)= E ( u , u , m ) E ( u , - j u , - j - m )

= E(am,,)ff.(am,t_j) under the null hypothesis. When x t -- 1 and e , - 0, E(am.,)= E(am.,_j)=: Y,n and the formula corresponds to that in Box and Jenkins (1976, p. 35, eq. (2.1.13)). It is apparent that the variance of T1/2(~/j- "yj) corresponds to that from a regression model in which the errors follow a 'moving average' of kth order with time dependent covariances. Under the conditions of Theorem 3.1 it follows from Domowitz and White (1982) that this variance may be consistently estimated by T times k

~/)71(>2X~X~_j~j2,t-}- 2 ~ r=l

7"

~

~)j,t~j,trXtXt.jXt rXt_j_r)~j 1

t=r+l

by using the OLS residuals tSj,t from (3.9). Estimating the asymptotic variance in this fashion seems a good deal simpler than the alternative of explicitly evaluating 5~,.m and replacing any yj appearing in them with ~j. As well, Domowitz and White's formula applies even if/3~ is not a scalar. Many regression packages nowadays provide estimates of the variance of the OLS estimator adjusted for heteroscedasticity as recommended in Eicker (1967) and White (1980). In the context of the regression in (3.9), these estimates would correspond to T times (aj 1(~ xjx, 2 2 iY,,o)&j -1 • When only a single coefficient evolves the omitted term is strictly non-negative, so that any test

Varying coefficient regression

433

statistic for yj = 0 b a s e d on the heteroscedasticity-adjusted variances would be a conservative one. U n f o r t u n a t e l y , this directional result does not obviously extend to the case w h e n m o r e than one coefficient varies. All of the a b o v e has been d e v o t e d to o r d e r d e t e r m i n a t i o n . R e g a r d i n g format, it is c u s t o m a r y to examine the partial a.c.f, as well as the ordinary a.c.f. As the p.a.c.f, ordinates can be thought of as estimates of the p a r a m e t e r s pj in a s e q u e n c e of autoregressions fitted to /3t, they m a y be f o u n d from the ~j by solving the multivariate equivalent of the Y u l e - W a l k e r equations. M o r e specifically, defining 7' = (7; . . . . . 7~:) and p' = (p'~ . . . . . p~), a linear relation of the f o r m Ap = 7 exists, where A is a matrix constructed f r o m 70 . . . . . 7K-~W h e n / 3 t is a scalar, the (i, ]')th e l e m e n t of A is y~ j. T h e linear relation b e t w e e n 7 and p m a y be exploited to r e - p a r a m e t e r i z e the regressions of T h e o r e m 3.1, with pj replacing yj as the u n k n o w n p a r a m e t e r s . T o illustrate, suppose /3t is a scalar and p~ and P2 are to be found. For K = 1, 70P~ = 7~ so that the relation tifi,_~ = xtx t 17~ + v~., is equivalently written as ti,fi,_ 1 = xtxt_l'YOp1q- 1.)1,r W h e n K=2,

?0 1io11_, ?] 'Yl

'Y03I-tO23

'Y2

provides Ut~t_l = (XtXt_l'Y0)Pl ~- (XtX t 1~/1)P2 @ I.)1,t

and a,a,_2 = (x~x,_2~q)o~ + (x,x,_2~,o)p2 + v2.,. As is evident, the p r e s e n c e of p~ and P2 in b o t h e q u a t i o n s shows that, if an efficient e s t i m a t o r of both p a r a m e t e r s is desired, it would be necessary to estimate both equations jointly imposing the cross e q u a t i o n equality restrictions. Simpler alternatives would be to ignore the cross-equation restrictions or to add the equations together; unfortunately, if the last tactic were adopted, the error Vl,, + V2, t would generally be autocorrelated. S o m e research in this area would seem n e e d e d . All of the a b o v e is predicated u p o n a k n o w l e d g e of yj. In fact these are u n k n o w n , and all that is available are the ~j. Replacing yj by ~j modifies the error term, e.g. in the scalar case when K = 1, fitfi, 1-(xev,_l~0)pl + Vl., + xrx,_l(yo- ~/o)Pv As is easily verified such a substitution does not affect the consistency of t~l, but it does m e a n that the covariance matrix of T1/2(Pl- Pl) d e p e n d s not only upon the limit of T~/2Z x,x,_~vt,,, but also upon

T-'J2 )2 x;c,_l(x~,_~(7o- ?o)pO = r~'2(~o- ~,o)plr-lS,

X t2X t 2 1

the second term clearly possesses a limit distribution. A d j u s t m e n t s must be p e r f o r m e d to obtain the correct variance for T1/2~1-Pl), but the exact nature of these must r e m a i n an area for future research. It is worth noting that,

434

D. F. Nicholls and A. R. Pagan

asymptotically and for a single evolving coefficient, the computed OLS variance of Tm(tij - pj) understates the true variance, providing a conservative test statistic. In all of the above analysis it was presumed that the process generating/3 t was stationary, yet there is no compelling reason why A ( L ) should not contain unit roots. In standard time series analysis, such a happening is detected by successive differencing of the time series until the a.c.f, ordinates die out rapidly. Unfortunately, it does not seem easy to mimic that mode of operation here. Suppose /3t was a scalar and /3t =/3 t 1+ ~t. Then /3 =/30 and u t in (3.1) becomes u, = x t E}=~ rh_j + e t, demonstrating that E(u, ut_j) = o'2(t - j)xtx,_ j +

Or2(~j,0 q = 1, 2 . . . . ).

Regressing fifit-j against ( t - j ) x d c t _ ~ yields an estimate not of the jth autocovariance of rh but its variance! A differencing-like test might be constructed by regressing ~i2 - fi,fi,_~ against a constant, (xa, t - xdc,_l(t- 1)) and x~t; the last of these regressors should provide an insignificant contribution as E(u,u,_,)=

2 O'Tl(xttXtXt

2 l(t- 1))+ o'~..

Probably a similar strategy might be devised for the detection of models where A ( L ) is not solely composed of unit roots, e.g. A ( L ) = (1 - A~L)(1 - L), but it must remain a high-priority area for research. Finally, application of T h e o r e m 3.1 to (X1, B3, V1) is not at all straightforward. It is a feature of (3.9) that in this case the random variable vj.t will be autocorrelated. OLS applied to (3.9) will therefore yield inconsistent estimators of yj whenever x t contains lagged values of Yr. This is a serious weakness, and to o v e r c o m e it requires the use of some instrumental variables for the lagged values of Yr Unfortunately, unless the autocorrelation in vj,t is of the MA t y p e - - a n d there is little reason to believe that /3t-/3 would have this c h a r a c t e r i s t i c - - i t is not possible to exploit the past history of Yt for instruments. Finding instruments may well be difficult, and unless they are of good quality it could be very hard to m a k e any prior discrimination between models. Once again this is a topic that requires much more detailed attention. 3.2. E s t i m a t i o n Having isolated a range of models that are to be entertained, the next phase in the modelling cycle involves estimating the unknown parameters. In contrast to the specification aspect, there has been a substantial amount of research devoted to estimation. Much of this research represents an adaptation of the techniques presented in Section 2. The r e c o m m e n d e d estimation technique of that section was maximum likelihood, and Subsection 3.2.2 details the properties of this estimator in the evolving coefficient case. What differentiates the two situations is that the likelihood can only be defined implicitly in the evolving coefficient case, making it difficult to find analytic derivatives as was

Varying coefficient regression

435

done in Subsection 2.2.2. Consequently, resort is frequently had to numerical algorithms for maximizing the likelihood, and a brief discussion of some of these is presented later in Subsection 3.3.3. Even though these algorithms have managed to handle quite complex models, the computational burden can be quite heavy, and one might be satisfied with a consistent estimator only. Subsection 3.2.1 deals with the most popular variants to achieve this objective, all of which involve the construction of estimators via regression analysis. These are essentially extensions of the least squares estimators described in Theorem 2.2, which were proposed mainly for the purpose of generating estimates to begin the iterations towards the MLE. Some authors have suggested that the estimators of Subsection 3.2.1 be employed to derive a 'two-step' estimator that has the same limiting distribution as the MLE, and that idea is described in Subsection 3.3.1.

3.2.1. Covariance estimators As T h e o r e m 3.1 showed it is possible to find consistent estimators of/3, yj and o-2 by regression. For 7j (J > 0 ) the regression is of ~tfit-j against z,j = (xt_j @ x,)Kp, where the matrix Kp reflects the relation yj = vech(Fj). Although not rigorous, it is convenient in what follows to ignore the symmetry in Fj and to define z,j = x,_j @ x,. In practice, this symmetry restriction is always imposed by the nature of the regression anyway. Once ,~j is derived, & (the vector of unknown parameters in A ( L ) ) and may be recovered from the autocovariances of the/3 t process, and the task is therefore one of factorizing the covariance f u n c t i o n - - s e e Wilson (1969) and (1973) for details on algorithms to accomplish this. Accordingly, such estimators might be termed 'covariance estimators'. Although this covariance estimator was suggested in Rosenberg (1973), it has not received a great deal of use until recently; possibly because Rosenberg did not provide any asymptotic properties for it. There is, however, one variant of the covariance estimator which has been a p p l i e d - - t h a t of Swamy and Tinsley (1980) (hereafter S - T ) - - w i t h applications in H a v e n n e r and Swamy (1981) and Swamy et al. (1982). S-T formulate A ( L ) f l t = rh in the linear system form ~, = ~sct_l + 0,; when /3, is an AR(p) for example, ~ 't -- (J3t, ' . . . , /3t-p+l)' ' ~,' = (~l'tO)and a = vec(Jq)). . .. Their estimator then involves the regression of (tt~t_j against 2t.j = xt_J(~_f',_~@ x t (the intercept term, if there is one, is absorbed into x, in their formulation), where ~ is an estimate of ~t generated by a formula given in S - T (1980, eq. (4.10)). To appreciate the relation of this estimator to that set out in Theorem 3.1, it is important to observe that S-T estimate a and X rather than the yj. For illustrative purposes let x, be a scalar and assume that /3t follows an AR(1), /3t = %/3t-1 + rh. Then 71 = °t~lY0 and the unknown coefficients are 19/1 and o 2 The regression relation to generate Yl is

utfi, - t - x?c~-lYl + vl,t~

(3.10)

D. F. Nicholls and A . R. Pagan

436

that is UtUt_ 1 = XtXt_l')/00/1 ~- Vl, t ,

(3.11)

and, just as in the p.a.c.f, computations, &~ could be found by regressing fi,fi,-i -2 against x,2 and against xtxt_l'~ o (C0 would be an output from the regression of u, unity). In contrast to this approach the S-T regression would be UtUt 1 = xtxt-l~t-l°~l + Va,t,

(3.12)

demonstrating its close relation to the estimator of Theorem 3.1; the sole difference in this instance being the replacement of 70 by so,_1. -2 As might be expected, for 61,s_T to be consistent certain conditions must be satisfied by ~t. In particular, for this scalar case, the sample moments of ~t should be consistent estimators of the population moments of ~:t (up to the fourth order). S-T's choice of ~t can in fact be shown to imply this for the scalar case, but it is much harder to see the equivalent necessary conditions for consistency in more general models. It would obviously be desirable that a proof of consistency of S-T's proposed estimator be available before extensive use is made of it. Havenner and Swamy (1981) show that /3 is consistent and asymptotically normal, but that is a comparatively simple task compared to establishing the limiting properties of estimators of ~ and X. There are a number of other points that need to be made about S-T's approach. First, just as the insertion of 70 in place of 70 invalidates the consistency of the OLS estimate of the covariance matrix of ~j (and hence 61), so too the covariance matrix of c~l.s_T is not consistently estimated by the OLS formula variance. Applications made of the S-T estimator do not seem to have allowed for this. Second, it is not clear what is to be gained by moving from 70 "2 to £t-1; the computational load of the first estimator being much lower. Of course, the S-T estimator is iterative in that new ~t can be found with the updated c~ and 2 and these may be exploited to give new estimates ~ and ~, etc. One could iterate the estimator of Theorem 3.1 as well by exploiting the form of the covariance matrix of vl.t, e.g. a weighted least squares regression to account for the heteroscedasticity in VlS, but iterations on covariance estimators seem a bit pointless as the computational burden in each iteration is much the same as in each step of an iterative scheme to get the maximum likelihood estimator (MLE). Furthermore, because S-T's estimator is a variant of that in Theorem 3.1, it shares with that estimator the problems posed whenever x, contains lagged values of Y,- Some applications of the S-T estimator have in fact been made to models such as (X1, B3, V1) without apparently realizing that the estimator will be inconsistent in such cases. Overall, this difficulty seriously reduces the appeal of covariance estimators for VCR's. 3.2.2. M a x i m u m

likelihood e s t i m a t i o n

As would be familiar from ordinary time series analysis, covariance estima-

Varying coefficient regression

437

tors tend to be fairly inefficient; their prime virtue being their simplicity and their ability to provide consistent estimators that possess a limiting distribution. To improve efficiency, most investigators interested in V C R models have followed Box and Jenkins and engaged in ML estimation. Following the strategy of Section 2, the log likelihood is constructed as if e l and "Or are jointly normal, being log LT(3~ , a, .~,) = - T/2 In 2rr - ~ ~'~ In h, - ~ 2 h,~(y, - yt/t_l) 2 + In f(Yl),

(3.13)

where Yt/t-1 is the expectation of Yt conditional upon the o.-field ~t_l = (Yl . . . . . Yt-1, Xx," ' - , Xt) and h t is the variance of the innovations e t = Yt- Yt/t 1. The first author to exploit this decomposition may have been Schweppe (1965) and it has subsequently formed the cornerstone for M L estimation of VCR models. Equation (3.13) is constructed with e t and h t. For the model treated in Section 2, both quantities could be derived analytically, but that is not so for the model of this section. Fortunately, once the V C R system in (1.1) is placed in the state-space form (SSF)

Yt = x,/~ + 2,£ t + e I ,

(3.14a)

~:t = qb~:_~+ ~b,,

(3.14b)

where xt = (xt i 0) and ~t has leading rows (/3t - / ~ ) and thereafter is defined to reduce A ( L ) ( / 3 , - / ~ ) to first-order form, the Kalman Filter (KF) equations provide values of h t and et for given a and X. This approach has been well documented elsewhere, e.g. Rosenberg (1973) and Harvey (1981), and interested readers can find the KF described in these and in a number of other references. Two items deserve some attention however. First, the KF needs to be initialized by E(scl) and E(scl~'l); because set is composed from A ( L ) ( 1 3 , - ~ ) = ~/~ E(sC~)= 0 and E(~1~:'1) is a function solely of X and a. Accordingly, it is not necessary to treat the initial coefficient/3 t as fixed and unknown; if this were desirable Rosenberg (1973) showed how to concentrate it out of the likelihood. Second, the term l o g f ( y 0 in the log likelihood (3.13) needs examination. As Yt = X t ~ + X t ( ~ , - J~)+ e,, Yl will be normally distributed with mean xa/3 and variance xlF~x'l+ o-2, allowing logf(yl) to be computed from x 1, a and £ (this derivation assuming x t to be non-stochastic). What makes ML estimation desirable is the consistency of the resulting estimator and the fact that its covariance matrix is given by the inverse of the information matrix ~¢00= -E(O2Lv/00 00'); this latter quantity frequently being estimated by the inverse of the Hessian of the log likelihood. However, because VCR constitutes a non-standard problem, with observations Yt being dependent and non-stationary, there is no certainty that these desirable properties can be invoked. Crowder (1976) and Baswa et al. (1976) have provided

D. F. NichoUs and A. R. Pagan

438

theorems for the M L E to be consistent and asymptotically normal when y, has the characteristics stemming from a VCR model, and the following t h e o r e m - which is a special case of Theorem 4 in Pagan (1980)--was proven by verifying that the conditions set out by Crowder hold under the stated assumptions. THEOREM 3.2. If A . the model is asymptotically locally identified; B. (i) x t is non-stochastic and uniformly bounded from above, (ii) O, the permissible parameter space, is a subset of R s, (iii) the eigenvalues of qb in (3.14b) have modulus less than unity; C. the errors et and ~b constitute a multivariate normal distribution with finite variances; D. 0o, the s × I vector of true parameter values, is an interior point of {9 ; then p

OML --+ 0o,

~ 1/2t~

d

~ oo ,vML -- 0o) ~

N (o, L ) .

I f 0 does not include vec(a), i.e. the transition matrix is fixed a priori, condition B(iii) may be deleted.

Some comments can be made upon this theorem and its assumptions. The permissible parameter space is defined by the problem but would certainly require ,Y to be p.s.d. In some situations the conditions are not exclusive, e.g. B(i)-(iii) would be a sufficient condition for A, but it seems worthwhile leaving unspecified what is needed for A and concentrating upon the asymptotic theory of the ML estimator given that A holds. Then, even if B(iii) did not, provided no elements in @ were estimated and the parameters in X were asymptotically identified, consistency and asymptotic normality would follow. This then extends the range of the estimation theorem to non-stationary cases, provided a separate analysis of asymptotic identifiability can be given. If elements in q~ are to be estimated, then, by analogy with the corresponding situation of estimating unstable AR's, it would be expected that normality would not hold when q~ had unit roots. Although in that literature asymptotic normality of 1/2 ^ •00(0-00) does hold when the roots in q~ are greater than unity, it is very doubtful that such a result would be true for a VCR model, the reason being that the innovations in an A R have bounded variance regardless of the roots of whereas the variance of the innovations would tend rapidly to infinity if the roots of q~ were greater than unity in the VCR case. Of the other assumptions of the theorem, normality could be dispensed with by providing bounds on the moments of "qt and e t. However, it would not seem possible to relax B(i) to allow non-stationary behaviour in x,, as the outcome of such an alternative would be an unbounded variance of the innovations, and it is hard to see how the theorem could possibly hold. It is worthwhile noting that Amemiya (1977) also retained this assumption for the ordinary random coefficient case. From the definition of the log likelihood in (3.13), the conditioning on past

Varying coefficient regression

439

data ensures that it remains the same even when x, includes lagged values of y,. Thus, M L estimates would be obtained in the same way regardless of the definition of x r However, Theorem 3.2 does not apply directly, although Weiss (1982) has considered the requisite extension. To do so demands the addition of various assumptions that serve to bound the moments of y,; as might be expected Weiss' methodology effectively combines Theorems 2.2 and 3.2. From Weiss' research it would seem that the properties of the M L E extend to the combination (X3, B3, V1) and, given the difficulties experienced by the covariance estimator under these circumstances, establishes a strong case for its use.

3.2.3. Identifiability T h e o r e m 3.2 required that the model be asymptotically locally identified or that T-l~oo be non-singular in the limit--this latter interpretation being provided by Rothenberg (1971). Since this assumption is very closely bound up with the existence of a consistent estimator of 0, and T h e o r e m 3.1 showed how such a consistent estimator might be found, it should come as no surprise that the conditions for the existence of the estimator of T h e o r e m 3.1, viz. that T - 1 X ' X and T a Z ztjztj have a non-singular probability limit, also appear as sufficient conditions for asymptotic identifiability. This is the result proven in Pagan (1980, p. 349) by decomposing the information matrix. Swamy and Tinsley (1980) give a similar requirement, but in terms of their zTtj's rather than z,j. An unsatisfactory aspect of stating identification conditions in terms of zT,j is the dependence upon estimates of SCr There seems little to be gained by adopting their version. Deducing necessary conditions is much harder. S-T assert that their conditions are necessary, but no proof is actually given of this proposition; just because it is necessary for the existence of their estimator does not mean that it is a necessary condition for identifiability. A promising alternative approach has been set out by Solo (1982). By utilizing a variant of the KF equations- the Output Statistics Kalman Filter due to Son and Anderson (1971)--and assuming that the xt's follow stationary processes, he has been able to slightly generalize the results in Pagan (1980). All of the above papers relate to models in which f i t - / 3 follow stationary invertible processes. However, some applications have forced the A ( L ) polynomial to have unit roots, e.g. the seasonal adjustment model in Hannan et al. (1970), treated as an evolving coefficient regression in Pagan (1973b). Under these circumstances only X and 2 are unknown and, although Theorem 3.2 shows that the M L E retains its standard properties, it does so by assuming asymptotic identifiability. The most complete treatment of identifiability when A(L) = I - L is contained in Hatanaka and T a n a k a (1981). They demonstrate that asymptotic identifiability holds under the following assumptions. _ l

A1

_

(a) xrx't < c I < oo for all t, with c I a positive constant. (b) There exists a positive integer ~- and a positive real number c: such that for every pair of k-element vectors h 1 and h 2 with h'lh 1= 1, h~h 2=1 and for every nonnegative integer rn and s ( 0 ~ < s ~ T - 1 ) , [xthl] > c 2 and Ixth2] > c 2 for some t in the interval [rot + s, (m + 1)~-+ s]. A2. xtx't > c 3 for all t and for some positive number c 3.

440

D. F. Nicholls and A. R. Pagan

Of the two assumptions A2 is the strongest, but they indicate that it can be eliminated at the expense of a more complex proof. Accordingly, in most circumstances, the presence of unit roots in A ( L ) would not invalidate the standard properties of the M L E expressed in T h e o r e m 3.2. Unfortunately, when x t contains lagged values of Yt nothing is yet available concerning identifiability, and this is an area that is in need of much more research.

3.3. Some miscellaneous topics 3.3.1. Two-step estimators One way to find the M L E is to use the method of scoring, which involves the iterative scheme 0~)- O(j-1)= ~¢-~-a)OLT/OOq ~) ,

(3.15)

where (j) indicates values at the jth iteration. If 0(0)iis a^ consistent estimator such that tJ(0) - 0 is Op(T-1/2), it is well known that T:/2(0(1)- O) has the same limiting distribution a s Many proofs qf this proposition are available, with a convenient statement being Rothenberg and Leenders (1964). Generally, 0(1) is not second-order efficient, but a further iteration will in fact produce 0(2) which i s - - s e e Rothenberg (1983). Consequently, it is possible to derive asymptotically efficient estimators from (3.15) once a /J(0) is available. But the estimator of T h e o r e m 3.1 satisfies these requirements for 0(0), making one step of the scoring algorithm from such estimates a means for deriving an estimator that is as efficient as the MLE. It is this argument that justifies the contention at the end of Subsection 3.2 that there was little point in iterating covariance estimators.

T1/Z(OML--0).

^

3.3.2. Diagnostic checking After estimation is complete, checks need to be made of model adequacy. An alternative is to 'overfit' and to test if the surplus parameters are zero using the asymptotic theory developed for the MLE. But, just as in ordinary time series analysis, exercising this option can be computationally expensive, and diagnosing inadequacy through residuals comes to the fore. Suppose the true innovations e, were available. As mentioned earlier, the autocorrelation function might be found by regressing e, against et_j ( j 1, 2 . . . . ), i.e. the relation e, =- 4)je,_j + a,

(3.16)

would be estimated. Because a t does not generally have constant v a r i a n c e - under /4o: ~bj = 0, a t = e t and so E ( a ~ ) = h t - - O L S is not the most efficient estimator of 4>/in (3.16). To isolate an efficient test statistic for H0: qbj = 0 it is weighted least squares rather than OLS which is the appropriate estimator.

Varying coefficient regression

441

Accordingly, the regression should feature h~l/2e t as regressand and htmet_j as regressor. But even this modification is not enough. In practice, e t is not available and the investigator must make do with er The relation (3.16) converts to (3.17)

et = 4)jYt-j + at + et - et + 4)j(et_j - ~,_j) •

Under H0: ~bj = 0, the error term in this regression is a t + e t - er By the mean value theorem, Oet e,-

Y, + ~ ( 0

02et

- 0)+ (0 - O * ) ' O ~ - ~ ( v

- 0"),

where 0 ~< O* ~< O, allowing (3.17) to be rewritten as

¢=

6 A - j + at +

Oet

- o) + % ( v l J 2 ) .

(3.18)

Examining the limiting distribution of T1/2(~j- ~j) from the regression of Y, against Yt-j, it becomes apparent that the term ( 0 e l 0 0 ) ( 0 - 0) contributes unless Oet p

T l ~ e t _ j 0~---~0.

The likelihood that this moment is zero is r e m o t e - - f r o m Pagan (1980, p. 359) OeJO0 is a linear combination of e,_k (k = 1, 2 . . . . ). From this, the variance of the jth ordinate of the a.c.f, of et will be quite complex, and certainly not T 1/2. Some way around this complication needs to be sought. Let Y*, Y*j and P be matrices with ~tl/Zyt~ ~tl/Zet_j and htl/ZcgeJO0 as tth elements respectively. An obvious matrix representation of/~7 '/2 times (3.18) is Y* =

~jY:j + P ( 0 -

0)+ op(Tm).

(3.19)

Defining M = I - p ( p , p ) - l p , , it is possible to annihilate the term P(0 - 0) by pre-multiplication of (3.19) by M MY* = 4)jNIY~j + M a *

+ Op(Tt/2).

(3.20)

Equation (3.20) is the basis for the proof of the following theorem concerning a valid diagnostic test for H0: ~bj = 0. THEOREM 3.3. In the regression of Y* against Y*_j and P, the 't statistic' associated with the coefficient of Y'j, when treated as a standard normal deviate, is asymptotically a valid test statistic for the null hypothesis that c~j = O.

442

D. F. Nicholls and A. R. Pagan

PROOF. Regressing M~* against M~_*/ gives ~ ) j : ~)j+

^., ,M6_j) ^ . - 1 ^e_jM ., , (e_jM Ma . + o,(T 1/2)

= ~b/+ (O*)MO*/)-20*)M'a* + o,(T1/2),

(3.21) (3.22)

P

using M ' M = M. Since/zt ~ ht because of the consistency of 0, it is easily seen that @

T1/2(dpj - dpj)'---~ N ( 0 ,

(plim T--~

T - l ~*_~M~*_j)-l) .

But the regression (3.20) involves the regression of the residuals from the regression of ~* against P versus the residuals from the regression of ~* -/ against P; as is well known an identical estimate of q~/and its covariance matrix can be found by the regression of 6" against 6*. and P. -1 It might be noted that, except for the division by ]~t 1/2 to produce a constant variance in the innovations, the regression in Theorem 3.3 is essentially that recommended by Durbin (1970) for adjusting the a.c.f, of residuals from an autoregression, again establishing a link between ordinary time series analysis and VCR methodology. Some other points of interest arise concerning Theorem 3.3. First, as Y*~.M'MY*j = Y*~V/Y*/~O. Frequently, there are more instruments available than are needed, i.e. the dimension of w t exceeds that of x, and Sargan (1958) showed that an optimal instrument would be the predictions from the regression of x t against w t denoted ~t in the following. Let us, therefore, rewrite (1.2) in terms of ~t as

y, = 2~,/3 + ( x , - ~,)fi + x , ( ~ , - fi)+ e,.

(4.1)

If/?t - ¢i = ~t, the estimator obtained by regressing Yt against x t will be

= ~ ~-

XtX '

~ (XtXtT]t-~- XtEt)

(4.2)

using the properties of regression that Z 2'~(x, - 2,) = 0, ~; 2,x t = Z ~ 2 , . Examination of (4.2) reveals that /3 is a consistent estimator of /3 if t

P

T - I Z 2~2tr}t--~O. Although this might appear a reasonable condition to impose, it would be better if a more basic set of assumptions ensuring it could be stated. When coefficients are constant the 2 t are sometimes taken as predictions from the reduced form of the system. Shiba and Tsurmi (1982) followed such a strategy for the (X3, B2, V1) case, and their analysis shows some of the difficulties that can arise. An important one is that, even if a reduced form exists, the stochastic part may have an infinite variance. Consequently, regressing x t against w, would not yield a consistent estimator of any relationship between the two variables. Tsurmi and Shiba (1982) provide an example of just such a situation in the context of a very simple macro-economic model. Even if /3 is consistently estimated however, there is still the task of t ~2 estimating o-2 and X = E('o,rlt). One possibility is to regress u, = (yt-x,~) 2

444

D. F. Nicholls and A. R. Pagan

against a constant and variables such a s xt@fft; under certain conditions involving the existence of reduced form moments this would provide a consistent estimator of X, but not of 02 as E ( x t e t ) ¢ 0 means that the mean of fit will also not be zero. Needless to say, fiaore research in this area is appropriate. 4.2. N o n - c o n s t a n t variances

It is sometimes forgotten that a regression model is characterized by two types of parameters: those connected with the mean behaviour of the data (/3) and those with the second moment (0"2). Nevertheless, there has been a steady growth of interest in either allowing for the effects of a non-constant error variance or in modelling any changes in it. If 0"2 is not a constant, but rather indexed by t, it is well known that the variance of the OLS estimator/3 = (X x'tx,) -1 E x'ty , is (X x'~xt) -I E, x'txto-~(E x;x~) -I, and Eicker (1967), Fuller (1975) and White (1980) have all proposed consistently estimating this quantity by replacing ~r~ in the formula by the squared OLS residuals t~. As the autocorrelation and partial autocorrelation functions of a series y, can be viewed as the estimated coefficients from a regression of yt against its various lags, if the variance in such a regression is not constant the standard theory as in Box and Jenkins (1976) would not be applicable. Nicholls and Pagan (1983) have shown that it is possible to use the same adjustment as proposed by Eicker/Fuller/White even when x t contains lagged values of y, and this approach allows for test statistics based on the autocorrelation function to be adjusted for non-stationarity in variances. Rather than react passively to a non-constant variance for e, some have suggested an active strategy of modelling it. Amemiya (1973) derived a threestep procedure asymptotically equivalent to the MLE when cr~ is assumed a function of E(yt). Harvey (1976) focussed upon an application of the scoring algorithm to give a two-step estimator asymptotically equivalent to the M L E for a general class o f heteroscedastic errors. Amemiya (1977) considered consistent estimators rather than efficient ones. As might be expected all of these proposals have been covered indirectly in Section 2. Any differences stem from the special characteristics of the model of Section 2 that lagged values of Yt appear in x t and the fact that cr~ under random coefficients is related to the Yt-j rather than to a more arbitrary set of variables wt. Nevertheless, the M L E and consistent estimators are derived exactly as in that section. A more recent development in this area has been the distinction drawn between the conditional and unconditional variance of e t. It may be that the unconditional variance of e t is a constant, but that the variance conditional upon the sigma field ~-t-1 = [Yt, • • •, Yt-t, x~. . . . . x,] is not, e.g. the model y t = ~ y , _ l + e t with E ( e ~ l ~ t _ 1 ) = 0 " 2 + 6 y t 2 _ l possesses this property (assuming that E(y~) is finite). There are obviously many ways in which such a situation could arise. One example is the random coefficient autoregressions of Section 2 or equivalent M A formulations (Robinson, 1977); another the bilinear models of Granger and Andersen (1978). A third is the recent

Varying coefficient regression

445

development by Engle (1982) of what he terms autoregressive conditional heteroscedasticity ( A R C H ) models, in which o-~ = E(e~l ~t-1) is a linear func2 tion of g ( e o , . . . , et_l), where g is some known function, e.g. o-2t = o-2 + 6e,_ 1. A number of papers have reported such effects in time series data, although it is by no means certain that the A R C H model is not proxying for a different type of conditional heteroscedasticity, e.g. o-2+ ~$Y~-I. It might be expected that this will be an area which is likely to receive a good deal more attention in the next few years.

5. Conclusion Our aim has been to survey material on varying coefficient regression in the context of a framework of crucial importance to the analysis of time series, viz. the specification, estimation and diagnostic cycle. By adopting this systematic approach, a better rapport between VCR research and traditional time series analysis is established, and areas in which there is a serious research deficiency can be more effectively isolated. As we commented in the introduction, there are many cells in Table 1 that have not been studied at all, and our survey has thrown up some gaps even for those models that have been studied. Two aspects stand out from the review: considerable progress has been made in enabling the standard fixed coefficient assumption to be relaxed but, at the same time, much remains to be done before VCR models become as widespread in their use as their fixed coefficient counterpart.

Appendix

Tensor notation and related results If A and B are matrices of order m x n and p × q respectively, then (i) the tensor or Kronecker product A @ B is the mp x nq matrix whose (i, j)th block is aiiB, where aij is the (i, j)th element of A ; (ii) vec A denotes the mn x 1 vector obtained by stacking the columns of A one on top of the other, in order, from left to right. The vector with r(r + 1)/2 elements obtained by stacking those elements of the columns of the r x r symmetric matrix D on and below the main diagonal, one on top of the other, in order, from left to right, is denoted by vech D (the vector half of D). If the matrix product A B C is defined, then it can be shown (Henderson and Searle, 1979) that vec ( A B C ) = ( C' @ A ) vec B .

(A.1)

For symmetric matrices it is possible to obtain linear relationships between

446

D. F. Nicholls and A. R. Pagan

v e c D and v e c h D . I n d e e d , H e n d e r s o n and Searle s h o w that for any r x r s y m m e t r i c matrix D there exists {r(r + 1)/2} x r 2 matrices K r and H r such that HrK; = Ir(r+l)/2 and for which

vechD=HrvecD

and

vecD=K;vechD.

References Amemiya, T. (1973). Regression when the variance of the dependent variable is proportional to the square of its expectation. J. Amer. Statist. Assoc. 68, 928-934. Amemiya, T. (1977). A note on a heteroskedastic model. J. Econometrics 6, 365-3"70. Andel, J. (1976). Autoregressive series with random parameters. Math. Operationsforsch. Statist. 7, 735-741. Basawa, I. V., Feigin, P. D. and Heyde, C. C. (1976). Asymptotic properties of maximum likelihood estimators for stochastic processes. Sankhygt 38, 259-270. Berndt, E. K., Hall, B. H., Hall, R. E. and Hausman, J. A. (1974). Estimation and inference in nonlinear structural models. Ann. Econ. Soc. Meas. 4, 653-665. Billingsley, P. (1961). The Lindeberg-L6vy theorem for martingales. Proc. Amer. Math. Soc. 12, 788-792. Box, G. E. P. and Jenkins, G. M. (1976). Time Series Analysis: Forecasting and Control (revised edition). Holden-Day, San Francisco, CA. Box, G. E. P. and Pierce, D. A. (1970). Distribution of residual autocorrelations in autoregressiveintegrated moving average time series models. J. Amer. Statist. Assoc. 65, 1509-1526. Breusch, T. S. and Pagan, A. R. (1979). A simple test for heteroscedasticity and random coefficient variation. Econometrica 47, 1287-1294. Breusch, T. S. and Pagan, A. R. (1980). The Lagrange multiplier test and its applications to model specification in econometrics. Rev. Econom. Stud. 47, 239-253. Chant, D. (1974). On asymptotic tests of composite hypotheses in non-standard conditions. Biometrika 61, 291-298. Chow, G. C. (1983). Random and changing coefficient models. In: Z. Griliches and M. D. Intriligator, eds., Handbook of Econometrics, Chap. 21. North-Holland, Amsterdam. Cooley, T. and Prescott, E. (1973). An adaptive regression model. Internat. Econom. Rev. 14, 364-371. Cooley, T. and Prescott, E. (1976). Estimation in the presence of sequential parameter variation. Econometrica 44, 167-184. Crowder, M. J. (1976). Maximum likelihood estimation for dependent observations. Jr. Roy. Statist. Soc. Set. B 38, 45-53. Davies, R. B. (1977). Hypothesis testing when a nuisance parameter is present only under the alternative. Biometrika 64, 247-254. Dempster, A. P., Laird, N. M. and Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. J. Roy. Statist. Soc. Ser B 39, 1-39. Domowitz, I. and White, H. (1982). Misspecified models with dependent observations. Z Econometrics 20, 35-58. Durbin, J. (1970). Testing for serial correlation in least squares regression when some of the regressors are lagged dependent variables. Uconometrica 38, 410--421. Eicker, F., (1967). Limit theorems for regression with unequal dependent errors. In: L. Le Cam and J. Neyman, eds., Proc. Fifth Berkeley Symposium, 59-82. University of California Press, Berkeley, CA. Engle, R. F. (1982). Autoregressive conditional heteroscedasticity with estimates of the variance of United Kingdom inflation. Econometrica 50, 987-1007. Engle, R. F. and Watson, M. (1979). A time domain approach to dynamic factor and MIMIC models. Discussion paper 79-41. University of California, San Diego.

Varying coefficient regression

447

Engle, R. F. and Watson, M. (1981). A one-factor multivariate time series model for metropolitan wage rates. J. Amer. Statist. Assoc. 76, 774-781. Fuller, W. A. (1975). Regression analysis for sample survey. Sankhyg~ 37, C, 117-132. Garbade, K. (1977). Two methods for examining the stability of regression coefficients. J. Amer. Statist. Assoc. 72, 54-63. Godfrey, L. G. (1978). Testing for multiplicative heteroskedasticity. J. Econometrics 8, 227-236. Gourieroux, C., Holly, A. and Monfort, A. (1982). Likelihood ratio test, Wald test, and KuhnTucker test in linear models with inequality constraints on the regression parameters. Econometrica 50, 63-80. Granger, C. W. J. and Andersen, A. (1978). A n Introduction to Bilinear Time Series Models. Vandenhoeck and Ruprecht, G6ttingen. Hannah, E. J. and Kavalieris, L. (1983). The convergence of autocorrelations and autoregressions. Austral. J. Statist. 25, 287-297. Hannan, E. J., Terrell, R. D. and Tuckwell, N. (1970). The seasonal adjustment of economic time series. Internat. Econom. Rev. 11, 24-52. Harvey, A. C. (1976). Estimating regression models with multiplicative heteroscedasticity. Econometrica 44, 461-466. Harvey, A. C. (1981). Time Series Models. Phillip Allan, Oxford. Hatanaka, M. and Tanaka, K. (1981). On the estimability of the covariance matrix in the multivariate random walk representing the time changing parameters of regression models. Mimeo. Osaka University. Havenner, A. and Swamy, P. A. V. B. (1981). A random coefficient approach to seasonal adjustment of economic time series. J. Econometrics 15, 177-210. Henderson, H. V. and Searle, S. R. (1979). Vec and vech operators for matrices with some uses in Jacobian and multivariate statistics. Canad. J. Statist. 7, 65-81. Hildreth, C. and Houck, J. P. (1968). Some estimators for a linear model with random coefficients. J. Amer. Statist. Assoc. 63, 584-595. Hurwicz, L. (1950). Systems with non-additive disturbances. In: T. C. Koopmans, Ed., Statistical Inference in Dynamic Economic Models, 410-418. Wiley, New York. Imhof, J. P. (1961). Computing the distribution of quadratic forms in normal variables. Biometrika 48, 419--426. Johnson, L. W. (1977). Stochastic parameter regression; an annotated bibliography. Internat. Statist. Rev. 45, 257-272. Johnson, L. W. (1980). Stochastic parameter regression: an additional annotated bibliography, Internat. Statist. Rev. 48, 95-102. Kendall, M. G. (1953). The analysis of economic time series--Part I: Prices. J. Roy. Statist. Soc. Ser. A 106, 11-25. Kelejian, H. H. (1974). Random parameters in a simultaneous equation framework: Identification and estimation. Econometrica 42, 517-528. King, M. L. and Hillier, G. (1980). A small sample power property of the Lagrange multiplier test. Monash University discussion paper. La Motte, L. R. and McWhorter, A. (1978). An exact test for the presence of random walk coefficients in a linear regression model. J. Amer. Statist. Assoc. 73, 816--820. Ljung, G. M. and Box, G. E. P. (1978). On a measure of lack of fit in time series models. Biometrika 65, 297-303. McDonald, J. (1981). Consistent estimation of models with composite moving average disturbance terms: A survey. Flinders University Mimeo. Moran, P. A. P. (1971). Maximum likelihood estimation in non-standard conditions. Proc. Camb. Phil. Soc. 70, 441-445. Neyman, J. (1959). Optimal asymptotic tests for composite statistical hypotheses. In: U. Grenander, ed., Probability and Statistics, 213-234. Wiley, New York. Nicholls, D. F. and Pagan, A. R. (1983). Heteroscedasticity in models with lagged dependent variables. Econometrica 51, 1233-1242. Nicholls, D. F. and Quinn, B. G. (1981). Multiple autoregressive models with random coefficients. .L Multivariate Anal. 11, 185-198.

448

D. F. Nicholls and A. R. Pagan

Nicholls, D. F. and Quinn, B. G. (1982). Random Coefficient Autoregressive Models: A n Introduction. Springer-Verlag, New York. Pagan, A. R. (1973a). Efficient estimation of models with composite disturbance terms. J. Econometrics 1, 329-340. Pagan, A. R. (1973b). Estimation of an evolving seasonal pattern as an application of stochastically varying parameter regression. Econometric Research Program Memo No. 153. Princeton University. Pagan, A. R. (1980). Some identification and estimation results for regression models with stochastically varying coefficients. J. Econometrics 13, 341-363. Pagan, A. R. and Hall, A. D. (1983). Diagnostic tests as residual analysis. Econometric Reviews 2, 159-218. Pagano, M. (1974). Estimation of models of autoregressive signal plus white noise. Ann. Statist. 2, 99-108. Priestley, M. B. (1981). Spectral Analysis and Time Series, Volume 1: Univariate Series. Academic Press, New York. Raj, B. and Ullah, A. (1981). Econometrics, A Varying Cbefficients Approach. Croom-Helm, London. Rao, C. R. (1974). Large sample tests of statistical hypotheses concerning several parameters with application to problems of estimation. Proc. Camb. Phil, Soc. 44, 50--57. Reinsel, G. (1979). A note on the estimation of the adaptive regression model. Internat. Econom. Rev. 20, 193-202. Revenkar, N. S. (1980). Analysis of regressions containing serially correlated and serially uncorrelated error components. Internat. Eeonom. Rev. 21, 185-200. Robinson, P. M. (1977). The estimation of a non-linear moving average model. Stochastic Process Appl. 5, 81-90. Rosenberg, B. (1973). The analysis of a cross-section of time series by stochastically convergent parameter regression. Ann. Econ. Soc. Meas. 2, 399-428. Rothenberg, T. J. (1971). Identification in parametric models. Econometriea 39, 577-592. Rothenberg, T. J. (1983). Approximating the distributions of econometric estimators and test statistics. In: Z. Griliches and M. D. Intriligator, eds. Handbook of Econometrics. North-Holland, Amsterdam. Rothenberg, T. J. and Leenders, C. T. (1964). Efficient estimation of simultaneous equation systems. Econometrica 32, 57-76. Rubin, H. (1950). Note on random coefficients. In: T. C. Koopmans, ed., Statistical Inference in Dynamic Economic Models. Wiley, New York. Sargan, J. D. (1958). The estimation of economic relationships using instrumental variables. Econometrica 26, 393-415. Schweppe, F. C. (1965). Evaluation of likelihood functions for Gaussian Signals. IEEE Trans. Inform. Theory IT-11, 61-70. Shiba, T. and Tsurmi, H. (1982). Consistent estimation of the random coefficient model in a simultaneous framework. Discussion paper 81-20, Rutgers University. Silvey, S. D. (1959). The Lagrangian multiplier test. Ann. Math. Statist. 30, 389-407. Solo, V. (1982). The output statistics Kalman Filter and varying parameter regression. Mimeo. Harvard University. Son, L. H. and Anderson, B. D. O. (197l), Design of Kalman filters using signal model output statistics. Proc. IEEE 120, 312-318. Swamy, P. A. V. B. (1971). Statistical Inference in Random Coefficient Regression Models. Springer-Verlag, New York. Swamy, P. A. V. B. and Tinsley, P. A. (1980). Linear prediction and estimation methods for regression models with stationary stochastic coefficients. J. Econometrics 12, 103-142. Swamy, P. A. V. B., Tinsley, P. A. and Moore, G. R..(1982). An autopsy of a conventional macroeconomic relation: the case of money demand. Paper presented to the Society for Economic Dynamics and Control Conference, Washington, D.C.

Varying coefficient regression

449

Takaka, K., (1981). On the Lagrange multiplier test for the constancy of regression coefficients and the asymptotic expansion. Mimeo, Kanazawa University. Tsurmi, H. and Shiba, T. (1982). A Bayesian analysis of a random coefficient model in a simple Keynsian system. J. Econometrics 18, 239-250. Watson, M. (1980). Testing for varying coefficients when a parameter is unidentified. University of California, San Diego discussion paper No. 80-8. Watson, M. and Engle, R. F. (1982). The EM algorithm for dynamic factor and MIMIC models. Harvard Institute of Economic Research discussion paper No. 879. Weiss, A. A. (1982). The estimation of the dynamic regression model with stochastic coefficients. University of California, San Diego discussion paper No. 82-11. White, H. (1980). A heteroscedasticity-consistent covariance matrix estimator and a direct test for heteroscedasticity. Econometrica 48, 817838. Whittle, P. (1954). Estimation and information in stationary time series. Ark. Mat. 2, 423-434. Wilson, G. T. (1969). Factorization of the covariance generating function of a pure moving average. S I A M J. Num. Anal. 6, 1-7. Wilson, G. T. (1973). The estimation of parameters in multivariate time series models. 3". Roy. Statist. Soc. Ser. B 35, 76-85.

E. J. Hannan, P. R. Krishnaiah, M. M. Rao, eds., Handbook of Statistics, Vol. 5 @ Elsevier Science Publishers B.V. (1985) 451-480

1~7 ALl

Small Samples and Large Equation Systems* Henri Theil and Denzil G. Fiebig 1. Introduction

in econometrics (and in several other areas of applied statistics) it happens frequently that we face a system of equations rather than a single equation. For example, let a consumer select the quantities of N goods which maximize his utility function subject to his budget constraint. Then under appropriate conditions a system of demand equations emerges, each describing the consumption of one good in terms of income and all N prices. The n u m b e r of coefficients in such a system is on the order of N 2, which is a large n u m b e r (unless N is small) and which raises problems when we want to test hypotheses about these coefficients (see Section 2). Another example is the estimation of the coefficients of one equation which is part of a system of simultaneous equations. H e r e a problem arises when the system contains a large n u m b e r of exogenous variables (see Section 3). One way of solving such problems is by (1) recognizing that the sample is drawn from a continuous distribution and (2) using this sample to fit a continuous approximation to the parent distribution. W h e n this is done under the m a x i m u m entropy (ME) criterion subject to mass- and mean-preserving constraints, a continuous M E distribution emerges which is superior to the discrete sample distribution in a n u m b e r of respects, particularly for small samples (see Sections 4 and 5). Subsequent sections show how the M E distribution can be used for the problems mentioned above. 2. How asymptotic tests can be misleading

Let t = 1 . . . . . n refer to successive observations and i, j = 1. . . . . N consumer goods. W e consider a linear demand system,

to

N

y . - Oixot + ~ . Iri/xjt + ei,, ]=1

(2.1)

*Research supported in part by NSF Grant SES-8023555. The authors are indebted to Sartaj A. Kidwai of the University of Florida for his research assistance. 451

H. Theil and D. G. Fiebig

452

where y~, is consumption of good i, x0, is total consumption, xj, is the price of good .L eit is a random error, and 0i and "rrlj are parameters, the 7rij's being known as Slutsky coefficients. H e r e we shall be interested in two hypotheses, viz., demand homogeneity, N

~ 7r~j= 0,

i=1 .... ,N,

(2.2)

7rjj = 7rj~, i, j = 1. . . . , N ,

(2.3)

j=l

and Slutsky symmetry,

Details on these properties are provided in the Appendix. Summation of (2.1) over i = 1 . . . . . N yields x0t = xot + E i ei, (because Zi 0~ = 1, Z~ ¢rij = 0), which implies that the e~t's are linearly dependent. This problem can be solved by deleting one of the equations, say the N t h . We assume that (el~ . . . . eu_l,t) for t = 1 . . . . , n are independently and normally distributed with zero means and nonsingular covariance matrix ~Y. Since (2.2) and (2.3) are linear in the ~-~j's, the standard procedure for testing these hypotheses is an F test if ~Y is known. However, ~Y is typically unknown, in which case it is usual to replace ~Y by S, the matrix of mean squares and products of LS residuals. Many such tests have yielded unexpected negative results; see, e.g., Barten (1969), Byron (1970), Christensen et al. (1975), Deaton (1974) and Lluch (1971). Laitinen (1978) conducted a simulation experiment in order to explore this problem. H e constructed a model of the form (2.1) satisfying both (2.2) and (2.3), with the e,'s obtained as pseudo-normal variates with zero means and a known covariance matrix ~Y. He used n = 31 observations and considered systems of N = 5, 8, 11 and 14 equations. Using the true ~Y, he applied the F test of the homogeneity hypothesis (2.2). The upper left part of Table 1 shows the numbers of rejections out of 100 trials at the 5 and 1 percent significance levels; these numbers are satisfactorily close to 5 and 1, respectively. Laitinen also used the same samples to compute S and the associated test statistic; this amounts to a X2 test which is asymptotically (n ~ ~) valid. The numbers of rejections shown in the upper middle part of Table 1 are much larger, particularly for large N. The results for the corresponding exact X2 test based on the true ~Y (upper right part of Table 1) are far more satisfactory, thus strongly suggesting that the use of S rather than ~ is mainly responsible for the numerous rejections of homogeneity in the literature. Meisner (1979) conducted a similar simulation experiment for testing the symmetry hypothesis (2.3). His results, shown in the lower part of Table 1, indicate an analogous increasing bias toward rejecting the null hypothesis as N increases when S rather than ,Y is used. The tests based on S fall under what is frequently referred to as Wald tests. See Bera et al. (1981) for similar results obtained with the asymptotically equivalent Lagrange multiplier and likelihood ratio tests. Laitinen (1978) proved that the exact distribution of the homo-

Small samples and large equation systems

453

Table 1 Rejections (out of 100 samples) of homogeneity and symmetry Exact F tests based on X 5%

1%

5 goods 8 goods 11 goods 14 goods

7 8 5 4

1 2 2 2

5 goods 8 goods 11 goods 14 goods

6 5 6 4

0 2 2 2

Asymptotic X 2 tests based on S

5%

1%

Exact X 2 t e s t s based on 5%

1%

Rejections of homogeneity 14 6 30 16 53 35 87 81

8 5 5 6

1 2 1 1

Refections of symmetry 9 3 26 8 50 37 96 91

5 5 4 6

1 1 3 0

geneity test statistic based on S is Hotelling's ,/,2 which in this case is an F ratio whose denominator has n - 2 N + 1 degrees of freedom. This illustrates the problem of homogeneity testing when N is not far below one-half the sample size n. The exact distribution of the symmetry test statistic based on S is a much more difficult issue because (2.3) is a cross-equation constraint. This is also the reason why symmetry-constrained estimation of (2.1) presents a problem when X is not known (see Section 7).

3. Simultaneous equation estimation from undersized samples Our objective is to estimate the parameter 3' in Ylt = YYzt + et,

t = 1.....

n,

(3.1)

which is one equation of a system that consists of several linear equations. The y's are observations on two endogenous variables of the system. The other equations contain certain endogenous variables in addition to these two, and also p exogeneous variables; the observations on the latter variables are written xw..., xpt. T h e sample moment matrix of these variables and those in (3.1) is thus of order (p + 2) x (p + 2), .m12

m'lp7

mt2

m22

m;e|

y2,

mxe

mzp

Mp_]

xl . . . . . . xo,

Imll

ylt

(3.2)

where mtp, map are p-element vectors and Alp is square ( p x p ) . The LS

1-t. Theil and D. G. Fiebig

454

estimator of 7 in (3.1) is then m12/m22; this estimator is biased and inconsistent because Yzt and e, are correlated. However, we can obtain consistent estimators from the property that each exogenous variable is statistically orthogonal to the errors in (3.1) in the sense that, for h = 1 . . . . . p, (1/n)Etxh,E , has zero probability limit as n ~ ~. The conditional sample m o m e n t matrix of the two endogeneous variables given the p exogenous variables is

mn. p

ml2.p] =

[mll

ml2.p

m22.pJ

L m 1 2 m22 j

m ~ ] _ [m!p]M_pl[mlp

m2p] .

(3.3)

Lm2pd

T h e k-class estimator of 7 is defined as [see, e.g., Theil (1971, Chap. 10)]

~(k ) = m12- km12P

(3.4)

m 2 2 - km22.p

which includes the LS estimator mt2/m2.2 as a special case (k = 0). It can be shown that, under standard conditions, ~(k) is consistent if k - 1 has zero probability limit and that n~/2[~,(k)- 7] converges to a normal distribution with zero mean if n l / 2 ( k - 1 ) has zero probability limit. These conditions are obviously satisfied by k = 1, which is the case of two-stage least squares (2SLS). In Section 6 we shall meet a k-class estimator with a r a n d o m k. Equation (3.1) is quite special because it contains only two endogeneous and no exogenous variables. The extension to m o r e variables is straightforward [see, e.g., Theil (1971, Chaps. 9 and 10)], but it is not our main concern here. O u r problem is that the matrix (3.3) does not exist when there are m o r e exogenous variables than observations (p > n) because Mp is then singular. In fact, all standard methods of consistently estimating 7 fail for p > n because they all require the inverse of Mp. Almost all present-day economy-wide econometric models have m o r e exogenous variables than observations. The problem is even m o r e pervasive due to the occurrence of lagged variables in dynamic equation systems. It is standard practice to treat each lagged variable in the same way as the exogenous variables are treated, which means that the 'dynamic' version of Mp is even m o r e likely to be singular. T h e irony of this problem is that we can reasonably argue that it should not be a problem. We can estimate 7 by 2SLS from n = 20 observations when there are p = 10 exogenous variables in the system, but not when p = 30 (because Mp is then singular for n = 20). In the former case there are 10 variables known to be statistically orthogonal to the error vector ( e l , . . . , e,) of (3.1), in the latter there are 30 such variables; a priori one would expect that the estimation of (3.1) is improved (at least, not hurt) when there are more orthogonality conditions on its error vector. In Section 6 we shall consider this matter further.

Small samples and large equation systems

455

4. The ME distribution of a univariate sample The previous discussion illustrates difficulties with sample moment matrices: S in Section 2, Mp in Section 3. H e r e and in Section 5 we shall seek a solution under the condition that the relevant random variables are continuously distributed. Our strategy will be to use the sample to fit a continuous distribution as an estimate of the parent distribution, and to compute moments (and other characteristics) from this fitted distribution. Such moments are population-moment estimators which are alternatives to the ordinary sample moments.

4. I. The M E principle and the univariate M E distribution The principle of maximum entropy (ME) states that given some information on the parent distribution of a random variable, the fitted distribution should be most uniformative subject to the constraints imposed by the prior information. To do otherwise would imply the use of information that is not available. The criterion of uniformativeness used in information theory is the entropy, which is minus the expectation of the logarithm of the density function. Specifically, the ME criterion maximizes

H=-f~f(x)logf(x)dx_

(4.1)

by varying the density function f ( . ) subject to certain constraints. If all that is known is that the variable is continuous with a finite range (a, b), the ME distribution is the uniform over this interval. If we know the mean of a positive continuous random variable (but nothing else), the M E distribution is the exponential with this m e a n ) These results were used by Theil and Laitinen (1980) to construct an estimated distribution function from a sample (xl . . . . . x,). 2 They used order statistics, written here with superscripts: x l < x 2 < - . . < x", and defined intermediate points between successive order statistics,

~i = ~(xl, xi+l),

i=1 ..... n-l,

(4.2)

where so( • ) is a symmetric ditterentiable function of its two arguments whose value is between these arguments. These ~¢i's define two open-ended intervals, i~ = (-%~:1) and I. = ((n_l,~), and n - 2 bounded intervals, /2= (sc1,(2), . . . . In1 = ((.-2, (n-l). Each /~ contains one order statistic x i and, hence, 1For other M E properties of the exponential family, see Kagan et al. (1973). 2See Theil and Fiebig (1984) for a survey containing m a n y other results as well as proofs of the statements which follow.

H. Theil and D. G. Fiebig

456

a fraction 1/n of the mass of the sample distribution. W e impose on the density function f ( . ) which will be fitted that it preserves these fractions,



f(x) dx=-7

n

i=l .....

n,

(4.3)

which is a mass-preserving constraint. W e also impose an analogous m e a n preserving constraint, referring b o t h to the overall m e a n (the sample m e a n 2) and to the m e a n s in each interval Ii. 3 Thus, o u r constraints refer to m o m e n t s of o r d e r zero and one. Subject to these constraints we seek the density f ( . ) which maximizes the e n t r o p y (4.1). T h e solution is unique for n > 2; it implies that the intermediate points (4.2) b e c o m e midpoints b e t w e e n successive o r d e r statistics, ~¢i = 1 i + xi+l)~ and that f ( . ) is constant in each b o u n d e d / ~ and exponential in 11 ~(x and I,. Thus, the associated cdf is c o n t i n u o u s and m o n o t o n e increasing, and it is piecewise linear a r o u n d each x i except a r o u n d x 1 and x" w h e r e it is exponential. W e shall refer to this fitted distribution as the M E distribution of a univariate sample or, m o r e briefly, as the univariate M E distribution. 1 i + XI+I) to i = 0 , 1 , . . . , n, where It will be convenient to extend ~i=~(x x ° = x 1, x " + x = x " so that ~:0 = x 1, ~ , = x " . T h e s e ~'s are referred to as the primary midpoints. T h e interval m e a n s of the M E distribution, written 21 . . . . . 2", are given by xi = ~-(sci-I+ ~i),

i = 1. . . . .

n,

(4.4)

which will be called the secondary midpoints.

4.2. Applications G i v e n that the density picture of the M E distribution is so simple (piecewise constant or exponential), it is straightforward to evaluate its variance and higher m o m e n t s . F o r example, the variance of the M E distribution (the M E variance) equals n k=l

n-1 q.n i-l-

2 ~ n i=2Z ( x i + l -

xi-1) 2.

(4.5)

Since the first term is the sample variance and since the two others are negative, the M E variance is thus subject to shrinkage relative to toe sample variance. Kidwai and Theil (1981) s h o w e d that, u n d e r normality, this shrinkage 3Define the order statistics associated with each interval k as those which determine its end points: x 1 and x 2 for II = (-~, ~¢1)[see (4.2)], x" and x "-I for I,, and x i-1, x i and x i+1 for Ii with 1 < i < n. The mean-preserving constraint on Ii requires that f(x) for x Eli be constructed so that the mean is a homogeneous linear function of the order statistics associated with I~.

S m a l l s a m p l e s a n d large equation s y s t e m s

457

is a random variable whose mean and standard deviation are both about proportional to n -1'3. Simulation experiments with pseudo-normal variates indicate that the ME variance and third- and fourth-order moments about the mean are all more accurate (in the mean-squared-error sense) than the corresponding estimators derived from the discrete sample distribution. This difference reflects the efficiency gain obtained by exploiting the knowledge that the parent distribution is continuous. However, the difference converges to zero as n ~ % implying that the efficiency gain is a small-sample gain. Fiebig (1982, Chap. 4) extended the simulation experiment to the estimation of the variances of fat-tailed mixtures of normal distributions. The fatter the tails for given n, the larger is the efficiency gain of the M E variance over the sample variance. Since the M E distribution is formulated in terms of order statistics, it is natural to consider the quantiles of the ME distribution as estimators of the parent quantiles. Let n be odd and write m = ½(n + 1). Then the sample median is x m, but the ME median is $m, i.e. the median of the secondary midpoints. For random samples from a normal population, the ME median has a smaller expected squared sampling error than the sample median, but the relative difference tends to zero as n ~ . Let n + 1 be a multiple of 4 and write q = ¼(n + 1). Then the sample quartiles are x q and X 3q, whereas the ME 1 q _ 3 q+l 1 3q - 1 3q+l quartiles a r e Q L ~xq-1 -[- ~X t gX and Ou = 3~x3q-1 + ix ± ~x if q > 1.4 For random samples from a normal population, the M E quartiles have smaller expected squared errors. Again, the relative difference tends to zero as n ~ ~, but this difference is still in excess of 10 percent for the interquartile distances O u - QL and x 3q - x q at n = 39. Also, the ME median and quartiles dominate their sample distribution counterparts (under squared-error loss) in the presence of an outlier with a different mean or a different variance; see Theil and Fiebig (1984) for details. =

4.3. E x t e n s i o n s

Tile M E distribution is easily extended to bounded random variables. If the variable is positive, the only modification is that 11 = ( - % ~:1) becomes (0, £1) and that the distribution over this interval becomes truncated exponential. A different extension is in order when the parent distribution is known to be symmetric. The difference x ~- $ then has a sampling distribution identical to that of $ - x "+1i for each i. We define 2 i _ 2 as the average of these differences, i.e. ~ i = ~ "b ~(X' -" x n + l - i ) ,

i =" ~, . . . , n .

(4.6)

Clearly, ~1 . . . . . 2" are 'symmetrized' order statistics located symmetrically 4Since the ME distribution has a continuous cdf, its median and quartiles are uniquely defined for each n. This is in contrast to the sample quantiles whose definitions for certain values of n can be made unique only by interpolation between order statistics.

H. Theil and D. G. Fiebig

458

around the sample mean g. (Since the M E procedure is mean-preserving, )~ is a natural point of symmetry.) The symmetric M E (SYME) distribution is then constructed from the .fl's in the same way that the ME distribution is obtained from the x i's. An alternative justification of the definition (4.6) is that it satisfies the LS criterion of minimizing Ni ( i / - x i ) z for variations in the 2 i's subject to the symmetry constraint 2 ~+ 2 n + l - i : 2x. 5 SYME moments and quantiles can be used as estimators of the corresponding population values if the population is symmetric. Doing so amounts to exploiting the knowledge of symmetry in addition to continuity. For random samples from a normal distribution, the SYME quartiles are asymptotically more efficient than the M E and sample quartiles: as n-->% the sampling variance of the former is about 13 percent below that of the latter. This shows that there are situations in which the exploitation of symmetry yields a large-sample gain. (Recall that the ME efficiency gain, based on the exploitation of continuity, is a small-sample gain only.) Under normality, the SYME variance provides no reduction in mean squared error beyond that of M E (mainly because the SYME variance is subject to additional shrinkage), but Fiebig (1982) did obtain such reductions for fat-tailed symmetric mixtures of normal distributions.

5. The ME distribution of a multivariate sample

5.1. The bivariate and multivariate ME distributions Let (xk, yk) for k = 1 . . . . . n be a sample from a continuous bivariate population. Our objective is to use this sample in the construction of the joint density function which maximizes the bivariate entropy oo

H=-

f

f -oo

f(x,y)logf(x,y)dxdy,

(5.1)

-oo

subject to mass-and mean-preserving constraints. As in the univariate case, we start with order statistics and the intermediate points (4.2), but we do this now for both variables, yielding n intervals 11. . . . . I, for x and n intervals J1 . . . . . J, for y. In the plane of both variables, we thus have n 2 rectangular cells, but since there are only n observations, n cells contain one observation each and n 2 - n cells contain no observations. The mass-preserving constraint states that the former cells are assigned mass 1In and the latter zero mass. Maximizing (5.1) requires stochastic independence 5A different procedure for estimating a symmetric distribution, proposed by Schuster (1973, 1975), consists of 'doubling the sample'; i.e. associated with each sample element xk is a value 2 5 - xk at equal distance from ~ but on the opposite side, which yields an augmented sample of size 2n (symmetric around .~) when these associated values are merged with a sample of size n. In a bivariate context, the value associated with (xk, Yk) is ( 2 5 - xk, 2y-Yk), yielding spherical sym~ metry. However, the simulation experiments by Theft, Kidwai, Yalnizo~lu and Yell6 (1982) based on pseudo-normal variates indicate that this alternate form of symmetrizing is not very promising.

Small samples and large equation systems

459

within each cell with mass 1/n. Each such cell falls under one of three groups: those which are bounded on all four sides, those which are open-ended on one side, and those which are open-ended on two sides. For the first group, the M E distribution within the cell is the bivariate uniform distribution; for the second, it is the product of the exponential (for the open-ended variable) and the uniform (for the other variable); for the third, it is the product of two exponentials. The extension to the p-variate M E distribution is straightforward. There are then n p cells, n of which contain one observation each and are assigned mass 1/n, while the n p - n others are assigned zero mass. The M E distribution within each cell with mass 1/n is the product of p univariate distributions, each being either uniform or exponential. The cdf of this distribution is a continuous and nondecreasing function of its p arguments, and it is piecewise linear except for exponential tails.

5.2. The M E covariance matrix The covariance of the bivariate M E distribution equals the covariance of the secondary midpoints, 1 -

" ~', (xk - x)(Yk

-- Y),

(5.2)

nk=l

where (Xk, Yk) for k = 1, . . . , n are the secondary midpoint pairs rearranged in the order of the original sample elements (Xk, Yk)" This rearrangement is indicated by the use of subscripts rather than superscripts [cf. (4.4)]. The M E variance was given in (4.5), but this variance can also be written in the form "

! Z

n k=,

1

-

+ --

n-1

Z

12n i=z

(~1 -- ~:0) 2

-

,-02 +

-}- (~:, -- ~:n-1) 2

,

(5.3)

4n

where the first term is the variance of the secondary midpoints. 6 The two other terms are a weighted sum of squared differences between successive primary midpoints which is always positive. On combining (5.2) and (5.3) we find that the 2 × 2 M E c o v a n a n c e matrix takes the form C + D, where C is the covariance matrix of the secondary midpoints and D is a diagonal matrix with positive diagonal elements. This C + D formulation applies to the covariance matrix of any p-variate M E distribution. The diagonal matrix D serves as the ridge of the M E covariance matrix; 7 this ridge ensures that the ME covariance matrix is always positive definite even when p / > n. 6Expression (5.3) is nothing but the variance decomposition of the univariate ME distribution between and within groups, the 'groups' being the intervals 11,. • •, I,. 7This ridge formulation has a superficial similarity to ridge regression, q21e major difference is that the ridge of the M E covariance matrix is not subject to arbitrary choice but is uniquely determined by the M E criterion subject to mass- and mean-preserving constraints.

460

H. Theil and D. G. Fiebig

The M E correlation t~ is obtained by dividing the M E covariance by the square root of the product of the two corresponding M E variances. A simulation experiment based on 10,000 pseudo-binormal variates with correlation p indicates that ~ has a smaller expected squared error than the sample correlation r for ]p]~0.6. The less satisfactory performance of ~ for large [p] results from the ridge of the M E covariance matrix which prevents [Pl from being close to 1. However, the picture is different when we evaluate the correlation estimators in terms of the squared errors of their Fisher transforms; then ,6 is superior to r for IPf ~< 0.95.8 Fiebig (1982) generated pseudo-normal vectors consisting of p equicorrelated variates with zero mean and unit variance. He computed their M E and sample covariance matrices and applied different loss functions to both. The ME estimator has smaller expected loss than the sample estimator when p is not small and P not close to 1, whereas the opposite holds for p = 0.99 and small p. The latter result is again due to the ridge of the ME covariance matrix. Fiebig also amended HaWs (1980) empirical Bayes estimator of the covariance matrix by substituting the M E covariance matrix for the sample covariance matrix in Haff's formula. Simulations indicate that this is an improvement except when the population covariance matrix is close to singular or when the number of variables is small. 5.3. Ties a n d m i s s i n g v a l u e s

Ties have zero probability when the sample is drawn from a continuous distribution, but they can occur when the data are rounded. Let the a t h and bth observations on x after rounding share the tth and (t + 1)st positions in ascending order: X a ~-~ X b ~- X t =

xt+l*

(5.4)

Here we consider the bivariate M E distribution of x and y under the assumption that the Yk'S are not tied and that x~ < x b and Xa > Xb both have probability 1 before rounding. The appropriate procedure is to assign mass 1/2n to each of the four cells associated with the tie. The M E covariance formula (5.2) remains applicable if £~ and Xb are defined as

(5.5) which means that the tie x a = x b is preserved in the form Ya = YbThe univariate M E distribution is not affected by the tie (5.4) so that we can 8Since the ridge of the ME covariance matrix tends to push ~6toward zero, this difference mainly results from the downward bias of r and the upward bias of the Fisher transform of r (for p > 0). In Theil, Kidwai, Yalnizo~lu and Yell6 (1982) the simulation experiment is extended to the SYME correlation and also to the correlation of the spherically symmetric version mentioned in footnote 5. Only the last correlation estimator has some merits for particular values of p (around 0.95) under squared-error loss of the Fisher transform.

S m a l l s a m p l e s a n d large e q u a t i o n s y s t e m s

461

use (4.5) for t h e M E v a r i a n c e . 9 H o w e v e r , it is of i n t e r e s t t o also c o n s i d e r the effect of t h e tie on t h e v a r i a n c e f o r m u l a (5.3) w h i c h c o n t a i n s Sk for k = a and k = b. It can b e s h o w n that, u n d e r t h e definition (5.5), a t e r m m u s t b e a d d e d to (5.3) of t h e f o r m (x t+2- x ' - 1 ) 2 / 3 2 n , which a m o u n t s to an e x t r a r i d g e (the 'tie r i d g e ' ) of t h e M E c o v a r i a n c e m a t r i x in t h e p r e s e n c e of a tie. See T h e i l a n d F i e b i g (1984) for f u r t h e r details. S i m i l a r results h o l d for the m u l t i v a r i a t e M E d i s t r i b u t i o n with missing values as a n a l y z e d b y C o n w a y a n d T h e i l (1980). C o n s i d e r n o b s e r v a t i o n s on two variables; let n t values b e k n o w n for o n e v a r i a b l e (n - n 1 a r e missing at r a n d o m ) and n 2 v a l u e s for t h e o t h e r (n - n 2 a r e missing at r a n d o m ) . T h e n u m b e r of cells is t h e n r e d u c e d f r o m n 2 t o n l n 2. T h e result for t h e M E c o v a r i a n c e is that (5.2) is still a p p l i c a b l e p r o v i d e d that )7k is i n t e r p r e t e d as t h e s a m p l e m e a n $ w h e n x k is missing (similarly f o r 37k). This d o e s n o t m e a n t h a t we act as if the missing x k t a k e s a p a r t i c u l a r v a l u e . N o such v a l u e is a s s u m e d ; t h e o n l y thing n e e d e d for t h e M E c o v a r i a n c e is a specification of Xk for missing x k, a n d this specification is $k = X, which f o l l o w s directly f r o m the M E p r i n c i p l e s u b j e c t to mass- a n d m e a n - p r e s e r v i n g c o n s t r a i n t s u n d e r t h e a s s u m p t i o n that t h e values which are missing a r e missing at r a n d o m . W h e n we a p p l y Sk = ~ for missing x k to t h e v a r i a n c e f o r m u l a (5.3), we m u s t a d d an e x t r a r i d g e (the m i s s i n g - v a l u e ridge). T h i s result is s i m i l a r to t h a t of the tie r i d g e a n d it is n o t surprising. B o t h ties a n d missing v a l u e s m a k e t h e s a m p l e less i n f o r m a t i v e than it w o u l d b e if t h e r e w e r e n o ties o r missing values. Since the M E d i s t r i b u t i o n is o b t a i n e d by m a x i m i z i n g t h e e n t r o p y s u b j e c t to constraints i m p l i e d by t h e s a m p l e , w e s h o u l d e x p e c t t h a t b o t h missing v a l u e s a n d ties y i e l d an M E d i s t r i b u t i o n closer to the i n d e p e n d e n c e case, and that is i n d e e d w h a t is s h o w n by its c o v a r i a n c e m a t r i x .

6. Experiments in simultaneous equation estimation H e r e w e r e t u r n to (3.1) a n d we c o n s i d e r t h e q u e s t i o n of w h e t h e r t h e M E a p p r o a c h can he useful w h e n t h e s a m p l e is u n d e r s i z e d . 6.1.

The

LIML

estimator

S u p p o s e t h a t e t in (3.1) and t h e e r r o r t e r m s in the o t h e r e q u a t i o n s of the system h a v e a m u l t i n o r m a l d i s t r i b u t i o n . It is then p o s s i b l e to a p p l y the m a x i m u m l i k e l i h o o d m e t h o d , which yields a k-class e s t i m a t o r k n o w n as L I M L . 1° T h e L I M L v a l u e of k is k - - # , w h e r e # is t h e smallest r o o t of a 9For t = 1 and t = n - 1, (5.4) is an extremal tie which implies that the exponential distribution over/1 or I, collapses, all mass being concentrated at the tied point. This also holds for a multiple tie, Xa = xb = xc = X t = x t+l = x t+2. In both cases the ME distribution becomes mixed discrete/continuous, but the validity of the variance formula (4.5) is not affected. I°LIML = limited-information maximum likelihood. 'Limited information' refers to the fact that no restrictions are incorporated on equations other than (3.1). 'Full information' and FIML use all restrictions in the system; see, e.g., Theil (1971, Chap. 10).

H. Theil and D. G. Fiebig

462

polynomial which is quadratic in the case of (3.1). The solution is tz

B 2A

1 2A ~/B2-4A(mumzz-

m~2)

(6.1)

where A = mll.pm22, p - m 212.p and B = mltm22.p + mzzm l l . p - 2m12m12.p, the m ij.p ' s being obtained from (3.3). Note that/x is random. As n ~ 0% n(/x - 1) converges in distribution to a I "2 variate so that nl/20x - 1) converges in probability to zero. Therefore, the propositions stated in the discussion following (3.4) imply that nl/2[~0z ) - 7] has the same asymptotic normal distribution as its 2SLS counderpart, n m [ ~ ( 1 ) - 7]. A closer approximation to the sampling distributions of the 2SLS and LIME estimators may be described as follows, n We standardize these two estimators by subtracting the true value of 7 and then dividing the difference by their common asymptotic standard deviation. The asymptotic distribution of these two standardized estimators is standard normal. This is a first-order approximation which can be improved upon by appropriate expansions. The second-order approximation yields cdfs of the form 2SLS: LIME:

q g ( u ) - n-mO(u z - p + 1)q~'(u), q ) ( u ) - n-l/ZOu2CD'(u),

(6.2) (6.3)

where qS(u) and 45'(u) are the standard normal cdf and density function, respectively, while 0 is a constant determined by the parameters of the system which contains (3.1) as one of its equations. Since substitution of u - - 0 into (6.3) yields q~(0)-0 = ~, we conclude that the approximate distribution of the standardized LIML estimator has zero median, whereas (6.2) shows that the standardized 2SLS estimator has this property only for p = 1. As p increases, the median of the latter approximate distribution moves away from zero. It appears that to a large extent these properties also apply when the estimators are formulated in terms of ME rather than sample moments. Theil and Meisner (1980) performed a simulation experiment in which the 2SLS estimator is systematically formulated in terms of ME moments. This has the advantage that the estimator exists even when p > n [because Mp in (3.3) is then positive definite], but the estimator is badly biased for large p. We shall therefore pay no further attention to 2SLS-type estimators. On the other hand, the approximate median-unbiasedness of L I M E which is implied by (6.3) appears to also apply when this estimator is formulated in terms of ME moments. 6.2. L I M L estimators based on sample and on M E moments

We return again to (3.1) and specify that the two associated reduced-form n T h e results which follow are from A n d e r s o n and Sawa; a convenient s u m m a r y is given by Malinvaud (1980, pp. 716-721).

Small samples and large equation systems

463

equations are ~2 P

Ylt = E Xht + ~lt, h=l

P

Y2t = ~'~ Xht + ~2,,

(6.4)

h=l

which agree with (3.1) if and only if y = 1 and e t = golf- (2t" In the simulation e x p e r i m e n t to be discussed, the Xht'S and ~t's are all g e n e r a t e d as i n d e p e n d e n t p s e u d o - n o r m a l variates, 13 the distribution of each Xht being N(0, V / p ) and that of each ~'j, being N(0, ~r2). T h e r e f o r e , P

Xh,- N(0, V),

~ j t - N(0, ~ro2),

e , - N(0, tr2),

(6.5)

h=l

where 0.2 = 2o-~. N o t e that the distribution of tile e x o g e n o u s c o m p o n e n t in the r e d u c e d f o r m (6.4) is i n d e p e n d e n t of p. The objective of the experiment is to analyze the b e h a v i o r of L I M L estimators as p increases b e y o n d n. Table 2 is based on 1000 trials for the specification V = 1, ~r~ = ~. C o l u m n s (2) and (3) contain, for each selected pair (p, n), the m e d i a n of the L I M L estimates over the 1000 trials. In column (2) we use the c o n v e n t i o n a l L I M L estimator based on sample m o m e n t s ( L I M L / S A ) ; in c o l u m n (3) we have L I M L / M E , o b t a i n e d by interpreting the matrix (3.2) as consisting of M E m o m e n t s . Since (3.1) contains no constant term, b o t h the sample and the M E m o m e n t s are interpreted as s e c o n d - o r d e r m o m e n t s m e a s u r e d f r o m zero (rather than from the mean). C o l u m n s (4), (7), (10) and (13) will be discussed in the next subsection. T h e m e d i a n s in column (3) are all close to 1 and thus suggest that the L I M L / M E estimator is approximately median-unbiasedness. N o t e that the medians in c o l u m n (2) decline as p a p p r o a c h e s n. This m e a n s that the conventional L I M L / S A estimator loses its m e d i a n - u n b i a s e d n e s s for large p. Also, the interquartile distance of L I M L / S A in c o l u m n (11) increases substantially as p a p p r o a c h e s n. A c o m p a r i s o n with the c o r r e s p o n d i n g quartiles in c o l u m n s (5) and (8) indicates that this increased dispersion results primarily from a declining lower quartile but also from an increasing u p p e r q u a r t i l e ) 4 T h e e x p e r i m e n t underlying Table 2 uses u n c o r r e l a t e d e x o g e n o u s variables. In T a b l e 3 we e x t e n d this to equicorrelated variables. Let c o m p o n e n t s of the vector (Xlo . . . . xpt ) be p s e u d o - n o r m a l with zero m e a n , correlation p and

t2The reduced form is obtained by solving the system for the endogenous variables. This requires the number of equations to be equal to the number of these variables. 13The x's are not constant in repeated trials. Making them constant would have implied that all entries in any given row of Table 2 are determined by the same set of n observations on the p exogenous variables. t4Since Mafiano and Sawa (1972) have shown that the sampling distribution of the LIML/SA estimator does not possess finite moments of any order, we use medians and quartiles to measure location and dispersion.

464

H. Theil and D. G. Fiebig

Table 2 Quartiles of LIML estimators based on sample, ME and hybrid moments a

Median

Lower quartile

p (1)

SA (2)

ME (3)

HY (4)

SA (5)

ME (6)

10 15 20 25 30 35 40

1.00 0.99 0.87 b b b b

1.00 1.00 1.00 0.98 1.01 1.01 0.99

1.00 1.00 0.99 0.98 1.00 1.01 1.00

0.84 0.80 0.48 b b b b

n = 21 0.84 0.81 0.78 0.81 0.84 0.85 0.85

10 15 20 25 30 35 40 45 50

1.00 1.00 0.99 1.00 0.92 b b b b

1.00 1.00 0.99 1.00 0.99 0.99 0.99 0.98 1.01

1.00 1,00 0.99 1.00 1.00 1.00 1.00 0.98 1.01

0.88 0.87 0.85 0.83 0.57 b b b b

10 15 20 25 30 35 40 45 50 55 60

1.01 1.01 1.00 1.00 1.00 0.99 0.91 b b b b

1.01 1.01 1.01 1.01 1.00 1.00 1.00 0.99 1.01 1.01 1.00 1.00 1.00 1.00 1.01 1.01 1.00 1.00 1.00 1.00 1,01 1.01

0.89 0.91 0.89 0.86 087 0.82 0.47 b ~ b b

Upper quartile

HY (7)

SA (8)

Interquartile distance

ME (9)

HY (10)

SA (11)

ME (12)

HY (13)

observations 0.85 1.18 0.83 1.23 0.83 1.33 0.84 b 0.85 b 0.86 b 0.86 b

1.19 1.23 1.25 1.18 1.19 1.19 1.15

1,19 1.20 1.19 1.17 1.18 1.16 1,13

0.34 0.43 0.84 b b b b

0.35 0.42 0.47 0.37 0.35 0.34 0.30

0.34 0.37 0.36 0.33 0.33 0.31 0.27

n = 31 0.88 0.87 0.85 0.84 0.83 0.83 0.86 0.86 0.88

observations 0.88 1.13 0,87 1.15 0.86 1.15 0.87 1.20 0.87 1.28 0.85 b 0.87 b 0.87 b 0.88 b

1.12 1.14 1.15 1.19 1.20 1.19 1.15 1.14 1.16

1.12 1.14 1.14 1.17 1.16 1.16 1.13 1.13 1.15

0.25 0.28 0.31 0.37 0.72 b b u b

0.24 0.27 0.30 0.35 0.37 0.35 0.29 0.28 0.29

0.24 0.27 0.28 0.29 0.30 0,30 0.27 0.27 0.27

n = 41 0.90 0.91 0.90 0.87 0.87 0.83 0.84 0.87 0.88 0.88 0.90

observations 0.90 1.14 0.91 1,12 0.90 1,13 0.87 1,14 0.88 1.16 0.86 1.22 0.88 1.35 0.89 b 0.88 b 0.89 b 0.90 b

1.13 1.12 1.13 1.13 1.16 1.21 1.21 1.18 1.14 1.15 1.15

1.13 1.12 1.12 1.13 1.15 1.17 1.16 1.16 1.13 1.14 1.15

0.24 0.21 0.24 0.27 0.28 0.40 0.88 b b U b

0.24 0.21 0.23 0.27 0.29 0.38 0.38 0.31 0.26 0.27 0.25

0.24 0.21 0.22 0.26 0.27 0.31 0.28 0.27 0.25 0.25 0.25

abased on 1000 trials; see text. bThe LIM/SA estimator does not exist,

variance

V/[p + p ( p - 1 ] p ]

so that

( 6 . 5 ) is still a p p l i c a b l e

for any (p, p). Let

t h e s e v e c t o r s b e i n d e p e n d e n t f o r d i f f e r e n t v a l u e s o f t. T a b l e 3 u s e s V = 1 a n d 2 t o- 0 = ~ a s b e f o r e a n d it is b a s e d o n 1 0 0 0 t r i a l s o f s i z e n = 21 f o r s e l e c t e d v a l u e s of p. T h e

results for LIML/SA

corresponding

and LIML/ME

r e s u l t s in T a b l e 2.

in t h i s t a b l e a r e s i m i l a r t o t h e

Small samples and large equation systems

465

Table 3 Q u a r t i l e s of L I M L e s t i m a t o r s b a s e d on c o r r e l a t e d e x o g e n o u s variables"

Median

Lower quartile

Upper quartile

Interquartile distance

p

SA

ME

HY

SA

ME

HY

SA

ME

HY

SA

ME

HY

10 15 20 25 30 35 40

1.02 1,00 0.90 U b b b

1.02 1.00 0.98 0.97 1.00 0.99 1.01

1.02 1.01 0.99 0.98 1.01 0.99 1.01

0.85 0.81 0.56 b b b b

0.85 0.82 0.79 0,81 0.85 0.83 0.87

p=0 0.85 0.85 0.85 0.83 0.86 0.84 0.87

1.19 1.21 1.31 U b b b

1.20 1.20 1.22 1.17 1.21 1.16 1.18

1.19 1.18 1.18 1.15 1.20 1.14 1.17

0.35 0.40 0.75 b b b b

0.35 0.38 0.43 0.36 0.36 0.32 0.31

0.34 0.33 0.33 0.32 0.34 0.30 0.29

10 15 20 25 30 35 40

1.02 1.01 0.91 b b b b

1.01 1.01 0.99 0.98 1.00 1.00 1.00

1.01 1.01 0.99 0.99 1.00 1.01 1,00

0.84 0.82 0.49 b b b b

0.84 0.83 0.81 0.83 0,85 0.87 0.87

p = 0.3 0.85 1.18 0.86 1.23 0.85 1.37 0.85 b 0.86 b 0.88 b 0.88 b

1,18 1.22 1.24 1.16 1.16 1.16 1.16

1.18 1.19 1.17 1.14 1.15 1.16 1.15

0.34 0.41 0.88 b b b b

0.34 0.39 0.42 0.33 0.31 0.30 0.29

0.33 0.33 0.32 0.29 0.29 0.29 0.27

10 15 20 25 30 35 40

1.02 1.01 0.93 b b b b

1.0I 1.01 0.99 0.98 0.99 1.01 1,00

1.02 1.00 1.00 0.99 0.99 1.01 1.00

0.84 0.82 0.52 b b h b

0.85 0.83 0.82 0.83 0.85 0.87 0.87

p = 0.6 0.86 1.18 0.85 1.23 0.85 1.38 0.85 b 0.86 b 0.87 b 0.88 b

1.18 1.22 1.22 1.17 1.16 1.17 1.16

1.18 1.19 1.17 1.15 1.16 1.16 1.16

0.34 0.42 0.87 b b b b

0.33 0.39 0.40 0.34 0.30 0.30 0.29

0.32 0.34 0.33 0.31 0.30 0.29 0.28

10 15 20 25 30 35 40

1.02 1.01 0.94 b b b b

1.02 1.01 1.00 0.99 0.99 1.01 0.99

1.02 1.01 0.99 0.99 0.99 1.01 0.99

0.85 0.81 0.53 b b b b

0.85 0.84 0.83 0.83 0.84 0.87 0.86

P =0.9 0.85 0.86 0.85 0.85 0.86 0.88 0.86

1.17 1.24 1.38 b b b b

1.17 1.20 1.20 1.17 1.17 1.18 1.17

1,17 1.19 1.18 1.17 1.16 1.17 1.17

0.32 0.43 0.85 b b b b

0.32 0.36 0.37 0.33 0.33 0.31 0.32

0.31 0.32 0.33 0.32 0.30 0.30 0.31

10 15 20 25 30 35 40

1.02 1.01 0.93 b b b b

1.02 1.01 0.99 0.99 0.99 1.01 1.00

1.02 0.85 11.01 0.81 0.99 0.54 1,00 b 0.99 b 1.01 b 1.00 U

0.86 0.86 0.84 0.85 0.85 0.87 0.85

p = 0.99 0.86 1.17 0.86 1.24 0.85 1.38 0.85 b 0.86 b 0.88 b 0.85 U

1.17 1.20 1.18 1.17 1.16 1.19 1.18

1.17 1.18 1.16 1.17 1.16 1.18 1.17

0.32 0.42 0.84 b b b b

0.31 0.34 0.33 0.33 0.31 0.31 0.33

0.30 0.32 0.31 0.31 0.30 0.29 0.31

"Based on 1000 trials; see text.

bThe L I M L / S A e s t i m a t o r d o e s n o t exist.

466

H. Theil and D. G. Fiebig

6.3. L I M L estimators based on hybrid moments Although Tables 2 and 3 indicate that the performance of L I M L / M E is far better than that of L I M L / S A , it is the case that the interquartile distance of the former estimator shows a bulge around p = n. 15 This bulge indicates that for fixed n and increasing p, the precision of the estimator deteriorates when p approaches n and then improves when p increases beyond n. Is it possible to eliminate this bulge? O n e way of doing this is by adding a ridge to the M E m o m e n t matrix in the same way that Haff's (1980) empirical Bayes estimator of the covariance matrix amounts to adding a ridge to the sample m o m e n t matrix. Specifically, let us interpret the p + 2 diagonal elements of the matrix (3.2) as sample m o m e n t s and all off-diagonal elements as M E moments. We shall refer to (3.2) thus interpreted as the hybrid moment matrix of the p + 2 variables. Simulation experiments based on alternative risk functions have indicated that the hybrid m o m e n t matrix is an attractive alternative to the M E m o m e n t matrix, particularly when the objective is to estimate the inverse of a parent m o m e n t matrix; see Theil and Fiebig (1984). T h e 1000 trials underlying each line of Table 2 have also been used to c o m p u t e L I M L / H Y estimates, all obtained from the hybrid interpretation of the m o m e n t matrix (3.2). The medians of these estimates in column (4) are about as close to 1 as those of L I M L / M E in column (3), but the interquartile distances of the former estimates in column (13) are systematically below those of the latter in column (12). Also, the interquartile distances in column (13) do not show the same large bulge around p = n which we find in column (12). The picture of the correlated case in Table 3 is about the same. T h e evidence of Tables 2 and 3 suggests that the L I M L approach can be rescued in the case of undersized samples by the simple device of replacing sample m o m e n t s by hybrid moments. This simplicity is in agreement with the view (see Section 3, last paragraph) that the problem of undersized samples should not be a problem. See Theil and Fiebig (1984) for additional evidence concerning equations with more than two variables. In Section 7 we shall apply hybrid m o m e n t s to a problem of constrained estimation.

7. Canonical correlations and symmetry-constrained estimation 7. l. Error covariance matrices and canonical correlations W e return to the linear system (2.1), which we generalize to a system of q linear equations with q dependent variables on the left and, in each equation, the same set of p independent variables on the right. The errors in the equations form a vector (elt . . . . . eat ) with zero mean and covariance matrix £. In Section 2 we described some problems that arise when we replace .~ by the 15There is no clear evidence of such a bulge for large p. This exception reflects the fact that the p exogenous variables effectively behave as one variable when p is sufficiently close to 1.

Small samples and large equation systems

467

estimate S consisting of mean squares and products of LS residuals; here we shall consider whether an M E approach yields m o r e attractive results. The account which follows is a modified version of Meisner (1981). We write the (p + q ) x (p + q) covariance matrix of the dependent and the independent variables in partitioned form:

[

~11

~121

"~'12 "~22j

(7.1)

q dependent variables p independent variables

If this matrix is interpreted as consisting of population variances and covariances, it is related to the error covariance matrix X by ,a~ = "~11 - X12X-1 t 22X 12.

('7.2)

Let Pl . . . . . Pm be the canonical correlation coefficients of the dependent and the independent variables, where m = min(p,q). These pi's can be obtained from the determinantal equation

(7.3)

p i ~ 1 1 [ = O,

so that (7.2) implies (7.4)

{X - (1 - p~)Zn{--- O,

which provides a link between the error covariance matrix X and the canonical correlations of the q dependent and the p independent variables of the system: for i = 1 . . . . . m, one minus each squared canonical correlation coefficient is a latent root of the diagonalization of X in the metric of the covariance matrix ~11 of the dependent variables.

7.2. Estimation of canonical correlations Given our interest in ~, the result (7.4) suggests that it is worthwhile to consider the estimation of canonical correlations. Fiebig (1980) conducted a simulation experiment based on Y, = p , x , + (1 -- 0 )I'2v,,

i = I .....

9,

(7.5)

where the Xi's and vi's are 18 independent standard pseudo-normal variates. Then the Yi's are also independent standard pseudo-normal, while X~ and are uncorrelated for i ¢ j and (Xi, Yi) has correlation p~. Therefore, Pl . . . . . P9 are the canonical correlations of (X L. . . . . Xg) and "(Y1 . . . . . Yg). The joint covariance matrix of the 18 variables ( X ' s and Y's) takes the form (7.1) with Xn = Xz2 = I and ~12 diagonal with 01 . . . . . P9 on the diagonal. Their true values

468

14. Theil and D. G. Fiebig

a r e specified as PI = 0.9, P2 = 0.8 . . . . .

P8 = 0.2, P9

=

0.1 .

(7.6)

By i n t e r p r e t i n g (7.1) as consisting of e i t h e r M E or s a m p l e m o m e n t s c o m p u t e d for a s a m p l e of size n, a n d then solving the a s s o c i a t e d d e t e r m i n a n t a l e q u a t i o n (7.3), we o b t a i n nine M E o r s a m p l e c a n o n i c a l c o r r e l a t i o n s . This e x p e r i m e n t was r e p l i c a t e d 100 t i m e s a n d t h e results are s u m m a r i z e d in T a b l e 4 in t e r m s of m e a n s a n d R M S E s a r o u n d t h e t r u e value. T h e u p p e r p a r t of the t a b l e c o n c e r n s the l a r g e s t c a n o n i c a l c o r r e l a t i o n (with true v a l u e Pl = 0.9). B o t h t h e M E a n d the s a m p l e e s t i m a t o r are s u b j e c t to a s u b s t a n t i a l u p w a r d bias w h i c h slowly d e c l i n e s as n increases, 16 b u t t h e bias of t h e f o r m e r e s t i m a t o r is s m a l l e r a n d this also h o l d s for its R M S E . T h e m i d d l e p a r t of T a b l e 4 c o n c e r n s t h e a r i t h m e t i c a v e r a g e c a n o n i c a l c o r r e l a t i o n (true v a l u e 0.5) and the l o w e r p a r t d e a l s with the sum of the s q u a r e d c a n o n i c a l c o r r e l a t i o n s (true v a l u e 2.85); this s u m p l a y s a role in H o o p e r ' s (1959) t r a c e c o r r e l a t i o n coefficient. T h e r e s u l t s are s i m i l a r to t h o s e in the u p p e r p a r t : t h e r e is an u p w a r d bias which slowly d e c r e a s e s as n increases, a n d b o t h the bias a n d t h e R M S E are s m a l l e r w h e n M E r a t h e r than the s a m p l e m o m e n t s are used. A l t h o u g h these results a r e e n c o u r a g i n g for t h e M E a p p r o a c h , it s h o u l d be a d m i t t e d that the u p w a r d bias is q u i t e s u b s t a n t i a l . A c o m p a r i s o n of t h e last f o u r c o l u m n s of T a b l e 4 shows that this bias is typically close to the c o r r e s p o n d i n g R M S E , suggesting t h a t a bias c o r r e c t i o n is in o r d e r . L e t r 1/> r 2/> • " 1> r,, b e the M E c a n o n i c a l c o r r e l a t i o n s . T h e c o r r e c t e d coefficients are fl . . . . . f,,, o b t a i n e d f r o m 1 -- r~ = (1 -- r~) n/(n+p+q-1) ,

(7.7)

which is a c o r r e c t i o n in e x p o n e n t i a l f o r m . T o e x p l a i n t h e e x p o n e n t we n o t e t h a t e a c h c a n o n i c a l v a r i a t e involves p - 1 o r q - 1 m u l t i p l i c a t i v e coefficients (only t h e ratios of t h e s e coefficients m a t t e r ) . T h i s yields p + q - 2 coefficients for a p a i r of c a n o n i c a l variates, to which w e a d d 1 for the use of a c o n s t a n t t e r m , y i e l d i n g a total of p + q - 1 coefficients. (Both c a n o n i c a l v a r i a t e s h a v e c o n s t a n t t e r m s , b u t the c o v a r i a n c e in t h e n u m e r a t o r of t h e c a n o n i c a l corr e l a t i o n is not affected w h e n only o n e c o n s t a n t is used.) T a b l e 5 p r o v i d e s e v i d e n c e of t h e c o r r e c t i o n (7.7) b a s e d on the e x p e r i m e n t a l d e s i g n (7.5) a n d (7.6) for b o t h t h e M E a n d t h e h y b r i d c a n o n i c a l c o r r e l a t i o n s . ~6The upward bias of the sample estimator is not surprising, since canonical correlations are generalizations of the multiple correlations. Let R be such a correlation, associated with a linear regression on p independent variables (including a constant term). A frequently used correction amounts to multiplying 1 - R 2 by the ratio of n - 1 to n - p - 1. Both this correction and that which is shown in (7.7) below for canonical correlations are corrections to the order 1/n, but (7.7) has the advantage of never yielding a negative f]. See also Lawley (1956, 1959) for an asymptotic expansion of the expected sample canonical correlations; the implied correction is much more complicated than (7.7).

Small samples and large equation systems

469

Table 4 ME and sample canonical correlation coefficients Mean Sample

Estimated bias a ME

Sample

RMSE

n

ME

ME

Sample

10 15 20 25 30 40 50 100

0.991 0.992 0.986 0.973 0.963 0.945 0.936 0.919

Largest canonical correlation coefficient b 0.091 b 0.091 b 0.092 b 0.092 0.994 0.086 0.094 0.087 0.978 0.073 0.078 0.073 0.967 0.063 0.067 0.064 0.949 0.045 0.049 0.050 0.939 0.036 0.039 0.040 0.920 0.019 0.020 0.026

b b 0.094 0.079 0.068 0.052 0.042 0.028

10 15 20 25 30 40 50 100

Average canonical correlation coefficient 0.826 b 0.326 b 0.328 0.733 b 0.233 b 0.235 0.678 0.685 0.178 0.185 0.181 0.635 0.640 0.135 0.140 0.139 0.613 0.618 0.113 0.118 0.118 0.583 0.586 0.083 0.086 0.088 0.562 0.563 0.062 0.063 0.068 0.530 0.531 0.030 0.031 0.037

b b 0.188 0.144 0.122 0.091 0.069 0.039

10 15 20 25 30 40 50 100

Sum of squared canonical correlation coefficients 6.74 b 3.89 b 3.91 b 5.67 b 2.82 b 2.84 b 4.95 5.06 2.10 2.21 2.14 2.23 4.46 4.52 1.61 1.67 1.63 1.69 4.19 4.25 1.34 1.40 1.38 1.43 3.83 3.86 0.98 1.01 1.00 1.04 3.60 3.62 0.75 0.77 0.79 0.81 3.23 3.23 0.38 0.38 0.41 0.42

aMean minus true value. bNot computed. For n = 10 and 15, The largest sample canonical correlation coefficient is identically equal to 1.

The top row of the table shows the true value of each squared canonical c o r r e l a t i o n . T h e first e i g h t r o w s c o n t a i n m e a n s o v e r 100 t r i a l s a n d , in p a r e n theses, the RMSEs around the true value of the squared ME canonical correlation. The next eight lines provide analogous results for the hybrid e s t i m a t e s o b t a i n e d b y i n t e r p r e t i n g (7.1) as t h e h y b r i d c o v a r i a n c e m a t r i x ( w i t h sample variances on the diagonal and ME covariances elsewhere). In the lower h a l f o f t h e t a b l e t h e c o r r e c t i o n (7.7) is a p p l i e d t o e i t h e r t h e M E o r t h e h y b r i d e s t i m a t o r . A c o m p a r i s o n o f m e a n s a n d R M S E s s h o w s t h a t f o r n ~> 15 t h e c o r r e c t e d h y b r i d e s t i m a t o r is s u p e r i o r e x c e p t w i t h r e s p e c t t o t h e l a r g e s t canonical correlation.

I-I. Theil and D. G. Fiebig

470

H

o

~5 H

~D

e0

z

?, 0

H

"d ©

"K ~2

Small samples and large equation systems

471

H. Theil and D. G. Fiebig

472

7.3. A cross-country d e m a n d system W e return again to the d e m a n d system (2.1), which we now a m e n d by adding a constant term to each equation: N

Yi, = oq + BiXo, + ~'~ rrqxj, +eit.

(7.8)

j=l

O u r application of this system will not be to time series d a t a but to per capita d a t a for 15 countries (t = 1 . . . . . n = 15); see the A p p e n d i x for further details. T h e analysis of h o m o g e n e i t y and s y m m e t r y testing is b e y o n d the scope of this chapter, because it would involve not only the f r e q u e n c y of rejections of the null hypothesis when this hypothesis is true but also the p o w e r of the test. Instead, we shall i m p o s e the h o m o g e n e i t y condition (2.2) by writing (7.8) in the form N-1

Yit = ai + fliXo, + •

~ij(xi, - xm) + ei,,

(7.9)

j=l

and we shall want to estimate this system subject to the s y m m e t r y constraint (2.3). Since e u + . - . + e u t = 0, we can confine the estimation of (7.9) to i = 1..... N-1. W e write (7.9) for t = l . . . . . 15 as y i = X 6 i + e i , where •i = (O~i, fli, 7ril, • " • , 7ri.N-i)' and X is a 15 x ( N + 1) matrix whose tth row equals (1, x0,, xlt - x N , , . . . , xN_l, , - XN,). L e t ( e l t , . . . , e u 1.t) for t = 1 . . . . . 15 be ind e p e n d e n t l y and identically distributed with zero m e a n s and nonsingular c o v a r i a n c e matrix X. T h e n ( X ' X ) - I X ' y ~ is the LS e s t i m a t o r of 6~, which is u n b i a s e d if X is fixed, while S defined as

S-

1

15 - ( N + 1)

Y'[I-X(X'X)-IX']

Y,

Y=[Yl,'-',YN

1],

(7.10)

is an unbiased e s t i m a t o r of X. T h e LS estimator of 6i does not satisfy the s y m m e t r y constraint (2.3). W e can write (2.3) in the form R 6 = 0, w h e r e 8 is a vector with 6i as the ith s u b v e c t o r (i = 1 , . . . , N - 1) and R is a matrix whose e l e m e n t s are all 0 or _+1, each r o w of R c o r r e s p o n d i n g to Irq = 1rji for s o m e (i,j). T h e B L U e s t i m a t o r of 6 constrained by (2.3) is

6(Z) = d - C(X)R'[RC(.~)R'I-~Rd,

(7.11)

and its covariance matrix is

C(Z)-

C(Z)R'[RC(Z)R']-~RC(Z),

(7.12)

Small samples and large equation systems

473

where C ( X ) = X , @ ( X ' X ) -~ and d is a vector with (X'X)-~X'y~ as the ith subvector (i = 1 . . . . . N - 1). For details on constrained linear estimation, see, e.g., Theil (1971, Sec. 6.8). If X is known, we can c o m p u t e (7.11) f r o m the data. If X is not k n o w n , the standard p r o c e d u r e is to replace X in (7.11) by the estimator S of (7.10). Alternatively, we can use an estimator based on corrected canonical correlations of the type (7.7), but an adjustment must be m a d e for the fact that X refers only to N - 1 equations. ~7 H e r e we retain all equations (7.9) for i = 1 . . . . . N by specifying p = q = N in (7.1). 18 Indicating by hats (circumflexes) that (7.1) has sample variances on the diagonal and M E covariances elsewhere, we obtain the (uncorrected) hybrid canonical correlations r l > r 2 > . . . > rN from 2

(~x2.,~2~.,~12- ri~lt)Z i = 0 , where

(7.13)

z~ is a characteristic vector associated with r~, normalized so that

z'i~,laZj = 6ij or, equivalently, Z ' 2 n Z = I,

Z-

[z, . . . . . ZN] .

(7.14)

Let XN be the covariance matrix of (elt . . . . , eNt), to be estimated f r o m the N x N version of (7.4) with characteristic vectors (the z / s ) added. Since NN has rank N - 1, we correct r 1 to 1 and use (7.7) with p = q = N for i = 2 . . . . . N. Let -2 on the diagonal. T h e n , from A be the diagonal matrix with 0, 1 - r-22. . . . ,1 - r/, (7.4), (7.13) and (7.14), the corrected estimator of J2N is 2 u = ( Z ' ) - I A Z -1= 211ZAZ'21t so that N

2N = ~ ( 1 - ?~)~1tzi(~llzi) ' ,

(7.15)

i=2

after which ~ is o b t a i n e d by deleting the last row and c o l u m n of ~N. This ~ is an estimator of X in (7.11) that will be used below as alternative to S of (7.10). N o t e that X does not involve the largest canonical correlation (see the end of the p r e v i o u s subsection).

7.4. Discussion of numerical results' A simulation e x p e r i m e n t was p e r f o r m e d in order to c o m p a r e the three s y m m e t r y - c o n s t r a i n e d estimators, with N = 8 goods: f o o d ; clothing; rent; fur17Deleting the first of N equations rather than the last amounts to a linear transformation of the dependent variables. Such a transformation affects the corrected ME error covariance matrix in a nontrivial way, since the rectangular cells in the second paragraph of Section 5 become paralo lelolograms when the variables are linearly transformed. 18There are p = N independent variables in (7.9); the constant terms ai are handled by the use of variances and covariances rather than second moments around zero.

H. Theil and D. G. Fiebig

474

~d r.~

~ ? ~ ? ~ I

II

¢'.1

.= I

I

~ l f l i l l I

[

II

I

II

II

II

II

II

II

II

II

II

II

II

II

II

Small samples and large equation systems

II

~

I I I

I

I

off t~

©

I

I

I

II

~1

H

II

II

II

II

II

II

II

II

II

IIIt

II

II

II

II

II

II

II

H

II

II

II

II

II

II

II

II

II

tl

II

IItt

II

II

II

tl

II

H

II

II

4"75

476

H. Theil and D. G. Fiebig

niture; medical care; transport and communication; recreation and education; other consumption expenditures. The first column of Table 6 contains the true values of the parameters. For each of the three estimators, the columns labeled Bias and RMSE contain the estimated bias (mean minus true value) and the RMSE around the true value over 500 trials. Bias presents no problem; the estimated bias values are all small in magnitude relative to the corresponding RMSEs. Differences between the three estimators appear when we consider their RMSEs. The estimates based on S are markedly inferior to those which use the true X. When we use rather than S, we obtain estimates which compare much more favorably to those based on the true X. In order to facilitate these comparisons, we computed ratios of the RMSEs based on ~7 to those based on S and ~ for each of the 35 coefficients. These ratios are shown in the first two columns of Table 7, and the quartiles of these ratios (lower, median, upper) are shown below.

Ratios for S Ratios for ~

Lower 0.67 0.92

Median 0.74 0.95

Upper 0.81 0.99

It is evident from these figures that there is a substantial efficiency gain from using ~ rather than S in the symmetry-constrained estimation procedure, and that the efficiency loss from not knowing the true error covariance matrix is quite modest when ~ is used as its estimator. A n o t h e r matter of importance is whether the standard errors of the symmetry-constrained estimates provide an adequate picture of the variability of these estimates around the true parameter values. This problem is pursued by the RMSSEs of Table 6. These are obtained from the matrix (7.12), with X interpreted as either S or ~ or the true X, by averaging the diagonal elements of (7.12) over the 500 trials and then taking square roots of these averages. On comparing the RMSSEs based on S with the corresponding RMSEs we must conclude that the standard errors based on S tend to underestimate the variability of their coefficient estimates. Table 7 illustrates this more clearly by providing the ratio of the RMSSE to the corresponding RMSE for each estimator. The third column of this table shows the substantial understatement of the variability of the estimates based on S. The quartiles of the 35 ratios in each of the last three columns are as follows:

Ratios for S Ratios for 2 _ Ratios for X

Lower 0.48 0.89 0.99

Median 0.59 0.94 1.01

Upper 0.69 1.06 1.02

When the true X is used, the ratios are tightly distributed around unity. Use of yields ratios which are more widely dispersed around 1, but which represent a marked improvement over the use of S.

Small samples and large equation systems

477

Table 7 R a t i o s of R M S E s a n d R M S S E s of s y m m e t r y - c o n s t r a i n e d e s t i m a t e s R a t i o of R M S E b a s e d o n true X to RMSE based on

s

~

R a t i o of R M S S E to R M S E

s

2

z

0.56 0.60 0.86 0.44 0.67 0.76 0.48

1.39 0.94 1.08 0.92 1.25 0.95 0.94

1.01 1.06 1.02 0.98 0.98 1.02 0.97

i= i= i= i= i= i= i=

1 2 3 4 5 6 7

0.70 0.69 0.89 0.63 0.79 0.83 0.67

Coefficients/3/ 0.92 0.90 0.99 0.99 0.90 0.94 0.94

i i i i i i i

= = = = = = =

1 2 3 4 5 6 7

0.72 0.78 0.92 0.61 0.79 0.88 0.65

D i a g o n a l S l u t s k y c o e f f i c i e n t s ~r~i 0.94 0.57 1.00 0.60 0.97 0.84 1.00 0.38 0.93 0.67 0.97 0.79 0.92 0.44

1.52 0.88 0.95 0.83 1.35 0.85 0.87

1.03 1.00 1.00 0.97 1.04 1.03 0.96

i i i i i i i i i i i i i i i i i i i i i

= = = = = = = = = = = = = = = = = = = = =

1, j = 1, i = 1, j = 1, j = 1, j = 1,/" = 2, . / = 2,/" = 2, j = 2, j = 2, j = 3, j = 3, j = 3, j = 3, j = 4, j = 4, ] = 4,/" = 5,/" = 5,/" = 6,/" =

0.70 0.83 0.62 0.77 0.86 0.63 0.81 0.63 0.72 0.81 0.67 0.78 0.81 0.90 0.79 0.68 0.69 0.60 0.78 0.72 0.74

O f f - d i a g o n a l S l u t s k y coefficients 7r0 0.87 0.52 0.86 0.72 0.97 0.42 0.93 0.59 0.88 0.66 0.96 0.43 0.99 0.71 0.99 0.43 0.94 0.55 1.00 0.67 0.95 0.45 1.01 0.66 0.87 0.71 1.01 0.79 0.96 0.69 0.95 0.49 0.96 0.54 1.01 0.36 0.86 0.65 0.92 0.55 0.99 0.58

0.93 1.27 0.98 1.43 1.00 1.06 0.90 0.84 0.98 0.86 0.87 0.92 1.11 0.90 0.91 0.97 0.86 0.89 0.97 0.98 0.93

0.99 1.03 0.99 0.99 0.96 1.01 1.02 0.99 1.01 1.02 0.98 0.98 1.01 1.01 0.99 1.02 1.03 1.00 1.05 1.01 1.01

2 3 4 5 6 7 3 4 5 6 7 4 5 6 7 5 6 7 6 7 7

478

1-1. Theil and D. G. Fiebig

8. Conclusion We have attempted to demonstrate an approach, based on the ME distribution, to problems that arise in large equation systems. Estimators of various population parameters are generated from this distribution according to the method of moments: whenever a standard procedure uses sample moments, we use ME moments. For example, previous analyses have found that the ME moment matrix leads to small-sample gains relative to the usual sample moment matrix. On the basis of our experimentation, impressive results were also achieved from a hybrid moment matrix whose diagonal elements are sample moments and whose off-diagonal elements are ME moments. The experiments presented, hopefully, have illustrated the effectiveness of the ME approach. Simulation experiments cannot be conclusive, though, and it is appropriate that further work be done in order to reinforce these initial impressions. The simultaneous equation experiment could be extended in a number of directions. The form of the equation isolated for attention is extremely simple and the experiment could be extended to equations that include more endogenous and/or exogenous variables. Also, no attempt was made to test the validity of the asymptotic standard errors. It is appropriate to note that there exist other problems associated with large equation systems that have not been discussed here. In the context of simultaneous equation estimation, full information methods of estimation (such as three-stage least squares) require the number of endogenous variables in the system to be less than the number of observations. Without such a condition, the usual sample estimator of the error covariance matrix is singular. Essentially the same problem can arise in systems of demand equations or more generally in any system of seemingly unrelated regression equations. For example, in order to estimate a system of demand equations with 37 goods on the basis of annual U.K. data for 17 years, Deaton (1975) used an a priori specified covariance matrix. The ME approach provides a simple and elegant solution in such situations.

Appendix The demand systems (2.1) and (7.8) are obtained by differentiating an appropriately differentiable utility function subject to the budget constraint E i p i q ~ = M , where pi and qi are the price and quantity of good i and M is total expenditure (or 'income'). The technique used amounts to deriving the firstorder constrained maximum condition and then differentiating it with respect to M and the p/s. The result can be conveniently written in the differential form N

wi

d(log qi) = 0i d(log O) + ~ ~°ijd(log pj), j=l

(A1)

where w~ is the budget share of good i and d(log Q) is the Divisia volume

Small samples and large equation systems

479

index, wi =

N

Piqi

d(log O) : ~'~ wi d(log q,), i=1

M '

(A2)

while 0~ = O(p~q~)/OM is the marginal budget share of good i and the Slutsky coefficient ~/ equals ( p i p j / M ) O q , / O p / , the derivative OqflOpj measuring the effect of pj on q~ when real income remains constant. The homogeneity property (2.2) reflects that proportionate changes in all prices do not affect any qi when M also changes proportionately. The symmetry property (2.3) results from the assumed symmetry of the Hessian matrix of the utility function. To apply (A1) to time series we write D x t = l o g ( x f l x t _ t ) for any positive variable x with value x t at time t. A finite-change approximation to (A1) is then N

VvitDq, , = OeDQ, + ~ , 7rijDpj , ,

(A3)

/=1

where D O t = Z~ 1,vitDqi t and wit is the arithmetic average budget share of good i at t - 1 and t. Equation (A3) is equivalent to (2.1) for y , = ~ t D q , , x0t = DOt, x/t = D p j t. For further details, see Theil (1980). The numerical results reported in Section 7 are based on the analysis of Theil and Suhm (1981) of data on 15 countries collected by Kravis et al. (1978). These countries are the U.S., Belgium, France, West Germany, U.K., The Netherlands, Japan, Italy, Hungary, Iran, Colombia, Malaysia, Philippines, South Korea, and India. Let w~, be the per capita budget share of good i in country t. Working's (1943) model describes such a share as a linear function of the logarithm of income. T o take into account that different countries have different relative prices, Working's model is postulated to hold at the geometric mean prices across countries, Pl . . . . . PN, where 15

log Pi = ~ ~, log Pit,

(A4)

t=l

which requires that a substitution term be added to the model. The result is that the demand system takes the form (7.8), with x0t per capita real income of country t, xit = log(p~,/pj) and Yit equal to 1 - x , + X j w # x / , multiplied by wi,. Then the sums over i = 1 . . . . . N of Yi, % 13i and 7r,j are equal to 1, 1, 0 and 0, respectively, implying that eat . . . . . eNt are linearly dependent.

References Barten, A. P. (1969). Maximum likelihood estimation of a complete system of demand equations. European Economic Review 1, 7--73. Bera, A. K., Byron, R. P. and Jarque, C. M. (1981). Further evidence on asymptotic tests for homogeneity and symmetry in large demand systems. Econom. Lett. 8, 101-105.

480

H. Theil and D. G. Fiebig

Byron, R. P. (1970). The restricted Aitken estimation of sets of demand equations. Econometrica 39, 816-830. Christensen, L. R., Jorgenson, D. W. and Lau, L. J. (1975). Transcendental logarithmic utility functions. American Economic Review 65, 367-383. Conway, D. and Theil, H. (1980). The maximum entropy moment matrix with missing values. Econom. Lett. 5, 319-322. Deaton, A. S. (1974). The analysis of consumer demand in the United Kingdom. Econometrica 42, 341-367. Deaton, A. S. (1975). Models and Projections of Demand in Post-War Britain. Chapman and Hall, London. Fiebig, D. G. (1980). Maximum entropy canonical correlations. Econom. Lett. 6, 345-348. Fiebig, D. G. (1982). The maximum entropy distribution and its covariance matrix. Doctoral dissertation. Department of Economics, University of Southern California. Haft, L. R. (1980). Empirical Bayes estimation of the multivariate normal covariance matrix. Ann. Statist. 8, 586-597. Hooper, J. W. (1959). Simultaneous equations and canonical correlation theory. Econometrica 27, 245-256. Kagan, A. M., Linnik, Y. V. and Rao, C. R. (1973). Characterization Problems in Mathematical Statistics. Wiley, New York. Kidwai, S. A. and Theil, H. (1981). Simulation evidence on the ridge and the shrinkage of the maximum entropy variance. Econom. Lett. 8, 59-61. Kravis, I. B., Heston, A. W. and Summers, R. (1978). International Comparisons of Real Product and Purchasing Power. The Johns Hopkins University Press, Baltimore, MD. Laitinen, K. (1978). Why is demand homogeneity so often rejected? Econom. Lett. 1, 187-191. Lawley, D. N. (1956). Tests of significance for the latent roots of covariance and correlation matrices. Biometrika 43, 128-136. Lawley, D. N. (1959). Tests of significance in canonical analysis. Biometrika 46, 59-66. Lluch, C. (1971). Consumer demand functions, Spain, 1958-1964. European Economic Review 2, 277-302. Malinvaud, E. (1980). Statistical Methods of Econometrics, 3rd ed. North-Holland, Amsterdam. Mariano, R. S. and Sawa, T. (1972). The exact finite-sample distribution of the limited-information maximum likelihood estimator in the case of two included exogenous variables. J. Amer. Statist. Assoc. 67, 159-165. Meisner, J. F. (1979). The sad fate of the asymptotic Slutsky symmetry test for large systems. Econom. Lett. 2, 231-233. Meisner, J. F. (1981). Appendix to Theil and Suhm (1981). Schuster, E. F. (1973). On the goodness-of-fit problem for continuous symmetric distributions. J. Amer. Statist. Assoc. 68, 713-715. Schuster, E. F. (1975). Estimating the distribution function of a symmetric distribution. Biometrika 62, 631-635. Theft, H. (1971). Principles of Econometrics. Wiley, New York. Theil, H. (1980). The System-Wide Approach to Microeconomics. The University of Chicago Press, Chicago, IL. Theil, H. and Fiebig, D. G. (1984). Exploiting Continuity: Maximum Entropy Estimation of Continuous Distributions. Ballinger, Cambridge, MA. Theil, H. and Laitinen, K. (1980). Singular moment matrices in applied econometrics. In: P. R. Krishnaiah, ed., Multivariate Analysis--V, 629-649. North-Holland, Amsterdam. Theil, H. and Meisner, J. F. (1980). Simultaneous equation estimation based on maximum entropy moments. Econom. Lett. 5, 339-344. "I]leil, H. and Suhm, F. E. (1981). International Consumption Comparisons: A System-Wide Approach. North-Holland, Amsterdam. Theil, H., Kidwai, S. A., Yalnizo~lu, M. A. and Yell6, K. A. (1982). Estimating characteristics of a symmetric continuous distribution. CEDS Discussion Paper 74. College of Business Administration, University of Florida. Working, H. (1943). Statistical laws of family expenditure. J. Amer. Statistist. Assoc. 38, 4.3--56.

Subject Index

ACM filter cleaners, 144 Adaptive algorithm, 202 Adaptive regression model, 430 Additive outliers model, 133 Akaike's AIC, 274, 417 Akaike's Markovian representation, 165 Almost harmonizable process, 285 Amplitude-dependent autoregressive models, 26, 43 Analysis of covariances matrices, 363 Approximate maximum-likelihood type estimates, 141 ARMAX model, 191,217, 223, 258 Asymptotically efficient estimators, 440 Asymptotically optimal, 349 Asymptotically optimal sequence of designs, 351 Asymptotically stationary, 280 Asymptotic breakdown point, 150 Asymptotic properties, 199, 378 Autocorrelation coefficients, 426 Autocorrelation function, 430, 433 Autoregressive (AR) model, 1, 15, 75, 179, 1"80, 358, 393, 445 Autoregressive moving average (ARMA) model, 27, 28, 75, 85, 86, 88, 106, 119, 155, 157, 179, 181, 257, 258, 265-269 Averaged covariances, 315, 316

Backward representation, 181 Batch identification, 190 Bernoulli number, 351 Bivariate entropy, 458 Bochner-Herglotz theorem, 280 Box and Jenkins' approach, 414 Breakdown points, t50 Brownian motion, 303, 351 Business data analysis, 243

Canonical correlations, 466, 467 Canonical echelon, 265, 266 Canonical form, 265-269 Canonical state space, 26_5-269 Catastrophe theory, 48 Class (KF), 295 Comparative calibration, 364 Computation of AM-estimates, 147 Concentrated maximum likelihood, 423 Consistency, 436, 437 Consistency condition, 346, 347, 348 Continuous time models, 168 Correlation characteristic, 3(11 Covariance estimators, 435, 436 Covariance matrix analysis, 363 Covariance stationary, 1 Cram6r class, 291 Cram6r-Hida class, 280 Cram6r-Rao lower bound, 120 Cross-validation, 184 Delta-array, 182 Detection of outliers, 107, 109 Deterministic process, 299 Deterministic sampling, 342 Diagnostic checks, 425, 440 Difference-differential filter, 305 Difference equation, 2, 191 Diffusion processes, 51, 53 Dilation of Cram6r process, 292 Dilation of harmonizable process, 283 Distribution systems, 53 Domain estimation, 374 Dynamical systems, 25, 189 Dynamic time warping, 391,409 Echelon forms, 265, 266 Ecological systems, 237 Efficiency of robust estimates, 129

481

482 Efficiency robustness, 119 E - M algorithm, 442 Endogenous variables, 415, 443, 454 Environmental systems, 237 Ergodicity, 36, 417, 421 Errors-in-variables, 133 Estimating the transfer functions, 273, 366 Estimation of narrow band, 368, 434 Estimation in presence of delays, 375 Evolutionary behavior, 428 Evolving coefficient regression, 439 Evolving coefficient variation, 415, 426 Evolving constant model, 430 Exogenous set of variables, 415, 454 Explosive autoregression, 6, 17 Exponential A R models, 31, 33, 37, 74 Extended autocorrelation function, 92 Extended exponential A R model, 44, 76 Extended Kalman filter, 205 Factorable, 339 Factor analysis, 363, 364 False alarm rate, 341,345 Filter, 143, 145, 305 Final prediction error, 230 Finite Fourier transform of the data, 368 Finite-sample breakdown points, 150 Fisher consistency, 149 Fisher information, 120, 130 Fokker-Planck equation, 51, 52, 54, 63, 71 Forgetting factors, 209 Frrchet variation, 282, 291,323 Frequency dependent, 375 Frequency-dependent delays, 367 Frequency response, 325 Gaussian noise, 341,342 Gaussian process, 301,358 Gauss-Markov, 350, 352, 356 Gauss-Newton algorithm, 200 Generalized partial autocorrelation, 181 Generalized Pearson system, 54 Generalized spectral function, 282 Gradient algorithm, 200 Green's function, 304 HempeI-Krasker-Welsch type, 135 Hankel matrix (block), 263 Harmonizable process, 314 Harmonizable process, multivariate, 285 Harmonizable process, strongly, 280 Harmonizable process, weakly, 282 Homogeneous testing, 185

Subject index Identifiability, 262, 439 Identification, 179, 190, 258, 414 Implementation, 209 Infinite-variance autoregression, 129 Influence curves, 151 Information criterion, 230 Information matrix, 437 Initial estimates, 372 Innovation outliers, 128 Innovations representation, 223 Input process, 305 Instrumental variable estimation, 198, 206, 443 Integral filter, 305, 326 Integrals of random quantities, 339, 346 Integrated random walk, 226 Integro-differential filter, 305 Intensity function, 312 Intervention analysis, 87, 104, 105 Isolated experimental study, 409, 410 Isolated word recognition, 389-392 Itakura distance, 392, 398, 403, 404 Jitter, 353 Kalman filter, 157, 223, 225, 374, 437 Karhunen class, 286 k-class estimator, 454 Kotel'nikov-Shannon formula, 330 Kronecker estimation, 275 Kronecker indices, 266 Kuhn-Tucker multipliers, 418 Lagged values, 439 Lagrange multiplier, 428 Lagrange multiplier test, 184, 418, 428 Large equation systems, 451-480 Least-squares estimates, 3, 4, 126, 127, 128, 419 Least-squares prediction, linear, 302 Least-squares prediction, nonlinear, 302 Likelihood (Gaussian), 272 Likelihood ratio test, 418 Limiting distribution, A R process, 12 LIML, 461 Linear filter, 324 Linearization, 55 Linear threshold AR modeI, 43, 77 Local coordinates for the manifold, 270 Log likelihood, 437 Longitudinal data, 413 LPC (approximate) likelihood, 394-396 LPC comparative tests, 397-406 LPC computational cost, 408, 409 LPC distance measures, 389-412 LPC likelihood ratio, 397-402

Subject index LPC model, 393 LPC of Gaussian linear model, 394 LPC power functions, 406-408

Manifold estimation, 275 Manifold of systems, 263, 269, 270 Markov chain, 36, 37, 55, 60, 75 Markov chain model, 57, 63, 66, 68, 72 Markov models, 392 Martingale central limit theorem, 421 Martingale difference, 414 Matrix fraction description, 261 Maximum entropy, 451,455, 458, 459, 460, 462 Maximum likelihood estimation, 129, 272, 273, 274, 419, 422, 436 Mean square approximation error, 344 Median sampling, 343, 346, 347, 351, 352, 354, 355, 356 M-estimates, 121, 130, 134 Method of scoring, 440 Minimax regret choice, 186 Minimum variance unbiased estimator, 306 Min-max robustness, 119, 120, 121 Missing observations, 157, 460 Model adequacy, 440 Model building strategy, 96, 106 Model checking, 179 Model order identification, 228, 232 Model parameter estimation, 232 Model selection, 185, 367, 375 Model validation, 232 Moving average filter, 305, 325 Moving average (MA) model, 179, 180 Multidimensional time series, 323 Multiindex, 266 Multiple regression, 357, 360 Multiplicity of a process, 300 Multivariate autoregression, 18

Narrow band case, 376 Newton algorithm, 200 Newton-Raphson method, 2111, 424, 442 Noise process, 294 Non-Gaussian colored noise, 27 Non-identifiability, 182 Non-linear difference equation, 40 Non-linear least squares, 413 Non-linear model, 47 Non-linear optimization, 175 Non-linear time series, 25 Non-parametric estimation techniques, 366, 377 Non-stationarity, 312, 444

483

Normal operator semigroup, 289 Normal process, 307 Numerical schemes, 201 Observation errors, 360 Off-line identification, 190 Optimal designs, 349, 352, 355, 356, 358, 359, 360 Optimal filter, 329 Optimal generalized equation error (OGEE) approach, 219 Order determination, 417, 433 Order of a system, 263 Orders of the models, 375 Ornstein-Uhlenbeck process, 60 Orthogonally scattered measure, 28'7 Oscillatory processes, 315 Outliers, 104, 120, 126 Output error identification, 197 Output error models, 192 Ozone data, 114 Parametrization, 259 Parametrization, ARMA, 262 Parametrization of the manifold, 269 Parametrization of state space, 264 Partial autocorrelations, 91, 180, 430, 433 Periodic, 360 Periodic sampling, 343, 351,360 Periodic sampling with jitter, 343 Periodogram analysis, 312 Permissible parameter space, 438 Phase across broad band, 379 Phase across narrow band, 381 Polynomial AR models, 30 Polynomial filter, 325 Portmanteau test statistic, 184 Praxis, 442 Prediction error, 9, 183, 217, 218 Prediction error identification, 197 Prediction error representation, 264 Prediction theory, 93 Predictors, 193 Prewhitening, 381,382 Processes with independent increments, 360 Product sampling designs, 358 Prohorov distance, 122 Pseudolinear regressions, 205 Psi-array, 182 Purely nondeterministic process, 299 Quadratic mean derivative, 338, 350, 351, 353, 354 Quadrature formula, 359

484

Subject index

Qualitative robustness, 119, 122, 123, 125, 136 Quantile sampling, 353, 356 Quenouille's test statistic, 181 Random coefficient autoregressions, 416, 444 Random coefficient variation, 415 Random fields, 357, 358 Random sampling, 342, 343, 355 Random vibrational system, 32 Random walk, 226 R-array, 182 Rate of convergence, 35.3-358, 360 Realignment, 381,382 Real-time identification, 202 Recursive identification, 202, 213 Recursive methods, 190 Recursive least squares algorithm, 218 Recursive prediction error methods, 202 Recursive time series, 231 Regression with stationary errors, 171 Regular sampling, 343, 349 Relatively smooth with noise, 366 Relevant log-likelihood, 369 Reproducing kernel Hilbert space, 327, 338, 339, 340, 341,350, 359 Residual autocovariance estimates, 136, 137, 138-140 Robust filter cleaners, 144 Robustness, 123, 124, 125, 356 Runge-Kutta method, 56 Sampling designs, 337, 342, 343 S-array, 182 Score-test statistic, 428 Seasonal adjustment model, 439 Seasonal factors, 414 Second-order efficient, 440 Second-order stationary, 417 Sequential parameter estimation, 202 Shift operator, 288 Ship rolling, 26 Signal associated noise, 383 Signal characteristic, 294 Signal process, 294 Signals in noise, 341,345, 347 Signal-to-noise ratio, 342, 348, 366 Simple random sampling, 343, 346, 347, 348, 353, 356, 357 Simultaneous equation estimation, 453, 461 Small sample distribution, 429 Smoothed random walk, 226 Spectral characteristic, 325 Spectral function, 322, 323

Spectral matrix function, 285 Spectral representation, 365 Spectrum of the process, 312 Speech recognition, 389-412 State-space models, 192, 258, 263, 264, 374 State-space representation, 157 State-variable estimation, 222 State-variable feedback, 224 Stationarity condition, 90 Stationary covariances, 351,360 Stationary independent increments, 350 Stationary invertible processes, 439 Statistical ergodic theorem, 318 Stochastic differential equation, 31 Stochastic dynamical systems, 51, 52, 67 Strassen characterization, 123 Stratified sampling, 343, 346, 348, 353, 355, 356, 358 Strict stationarity, 1,417 Strong consistency, 370, 421 Strongly harmonizable time series, 323 Strong robustness, 124 Structural identifiability, 271,272 Sufficient statistic, 341,345 Systematic sampling, 344, 355 Tensor notation, 445 Testing for coefficient evolution, 427 Three-step procedure, 444 Time-series influence curve, 152 Time-variable parameter estimation, 232 Transfer function models, 104, 215, 216, 217, 259 Trigonometric polynomials, 371 Two-step estimator(s), 440, 444 Typical diagonal element, 379 Unequally spaced data, 157 Unit roots, 434, 439 Variance component models, 173 Varying coefficient models, 413, 414 Vector ARMA models, 87, 116 Vibration systems, 33 Vitali variation, 323 Weakly harmonizable time series, 323 Weakly stationary process, 312 Weakly stationary time series, 322 Wide band, 371 Wiener, 350, 352, 356 Yule-Walker equation, 180 Yule-Walker estimates, 395, 396, 406, 409

H a n d b o o k of Statistics Contents of Previous V o l u m e s

V o l u m e 1. A n a l y s i s of V a r i a n c e E d i t e d b y P. R. K r i s h n a i a h 1980 xviii + 1002 p p .

1. Estimation of Variance Components by C. R. Rao and J. Kleffe 2. Multivariate Analysis of Variance of Repeated Measurements by N. H. Timm 3. Growth Curve Analysis by S. Geisser 4. Bayesian Inference in MANOVA by S. J. Press 5. Graphical Methods for Internal Comparisons in A N O V A and MANOVA by R. Gnanadesikan 6. Monotonicity and Unbiasedness Properties of ANOVA and MANOVA Tests by S. Das Gupta 7. Robustness of ANOVA and MANOVA Test Procedures by P. K. Ito 8. Analysis of Variance and Problems under Time Series Models by D. R. Brillinger 9. Tests of Univariate and Multivariate Normality by K. V. Mardia 10. Transformations to Normality by G. Kaskey, B. Kolman, P. R. Krishnaiah and L. Steinberg 11. ANOVA and MANOVA: Models for Categorical Data by V. P. Bhapkar 12. Inference and the Structural Model for ANOVA and MANOVA by D. A. S. Fraser 13. Inference Based on Conditionally Specified ANOVA Models Incorporat,. ing Preliminary Testing by T. A. Bancroft and C.-P. Han 14. Quadratic Forms in Normal Variables by C. G. Khatri 15. Generalized Inverse of Matrices and Applications to Linear Models by S. K. Mitra 16. Likelihood Ratio Tests for Mean Vectors and Covariance Matrices by P. R. Krishnaiah and J. C. Lee 485

486 17. 18. 19. 20. 21. 22. 23. 24. 25.

Contents of previous volumes

Assessing Dimensionality in Multivariate Regression by A. J. Izenman Parameter Estimation in Nonlinear Regression Models by H. Bunke Early History of Multiple Comparison Tests by H. L. Harter Representations of Simultaneous Pairwise Comparisons by A. R. Sampson Simultaneous Test Procedures for Mean Vectors and Covariance Matrices by P. R. Krishnaiah, G. S. Mudholkar and P. Subbaiah Nonparametric Simultaneous Inference for Some MANOVA Models by P. K. Sen Comparison of Some Computer Programs for Univariate and Multivariate Analysis of Variance by R. D. Bock and D. Brandt Computations of Some Multivariate Distributions by P. R. Krishnaiah Inference on the Structure of Interaction in Two-Way Classification Model by P. R. Krishnaiah and M. Yochmowitz

V o l u m e 2. C l a s s i f i c a t i o n , P a t t e r n R e c o g n i t i o n a n d R e d u c t i o n of Dimensionality E d i t e d by P. R. K r i s h n a i a h a n d L. N. K a n a l 1982 xxii + 903 pp.

1. Discriminant Analysis for Time Series by R. H. Shumway 2. Optimum Rules for Classification into Two Multivariate Normal Populations with the Same Covariance Matrix by S. Das Gupta 3. Large Sample Approximations and Asymptotic Expansions of Classification Statistics by M. Siotani 4. Bayesian Discrimination by S. Geisser 5. Classification of Growth Curves by 3. C. Lee 6. Nonparametric Classification by J. D. Broffitt 7. Logistic Discrimination by J. A. Anderson 8. Nearest Neighbor Methods in Discrimination by L. Devroye and T. J. Wagner 9. The Classification and Mixture Maximum Likelihood Approaches to Cluster Analysis by G. J. McLachlan 10, Graphical Techniques for Multivariate Data and for Clustering by J. M. Chambers and B. Kleiner ll. Cluster Analysis Software by R. K. Blashfield, M. S. Aldenderfer and L. C. Morey 12. Single-link Clustering Algorithms by F. J. Rohlf 13. Theory of Multidimensional Scaling by J. de Leeuw and W. lfeiser 14. Multidimensional Scaling and its Applications by M. W!sh and J. D. Carroll 15. Intrinsic Dimensionality Extraction by K. Fukunaga

Contents of previous volumes

487

16. Structural Methods in Image Analysis and Recognition by L. N. Kanal, B. A. Lambird and D. Lavine 17. Image Models by N. Ahuja and A. Rosenfeld 18. Image Texture Survey by R. M. Haralick 19. Applications of Stochastic Languages by K. S, Fu 20. A Unifying Viewpoint on Pattern Recognition by J. C. Simon, E. Backer and J. Sallentin 21. Logical Functions in the Problems of Empirical Prediction by G. S. Lbov 22. Inference and Data Tables and Missing Values by N. G. Zagoruiko and V. N. Yolkina 23. Recognition of Electrocardiographic Patterns by J. H. van Bemmel 24. Waveform Parsing Systems by G. C. Stockman 25. Continuous Speech Recognition: Statistical Methods by F. Jelinek, R. L. Mercer and L. R. Bahl 26. Applications of Pattern Recognition in Radar by A. A. Grometstein and W. H, Schoendorf 27. White Blood Cell Recognition by E. S. Gelsema and G. H. Landweerd 28. Pattern Recognition Techniques for Remote Sensing Applications by P. H. Swain 29. Optical Character Recognition--Theory and Practice by G, Nagy 30. Computer and Statistical Considerations for Oil Spill Identification by Y. T. Chien and T. J. Killeen 31. Pattern Recognition in Chemistry by B, R. Kowalski and S. Wold 32. Covariance Matrix Representation and Object-Predicate Symmetry by T. Kaminuma, S, Tomita and S. Watanabe 33. Multivariate Morphometrics by R. A. Reyment 34. Multivariate Analysis with Latent Variables by P. M. Bentler and D. G. Weeks 35. Use of Distance Measures, Information Measures and Error Bounds in Feature Evaluation by M. Ben-Bassat 36. Topics in Measurement Selection by J. M. Van Campenhout 37. Selection of Variables Under Univariate Regression Models by P. R. Krishnaiah 38. On the Selection of Variables Under Regression Models Using Krishnaiah's Finite Intersection Tests by J. L. Schmidhammer 39. Dimensionality and Sample Size Considerations in Pattern Recognition Practice by A. K. Jain and B. Chandrasekaran 40. Selecting Variables in Discriminant Analysis for Improving upon Classical Procedures by W. Schaafsma 41. Selection of Variables in Discriminant Analysis by P. R. Krishnaiah

488

Contents of previous volumes

V o l u m e 3. T i m e S e r i e s in t h e F r e q u e n c y D o m a i n E d i t e d b y D . R. B r i l l i n g e r a n d P. R . K r i s h n a i a h 1983 xiv + 485 pp.

1. Wiener Filtering (with emphasis on frequency-domain approaches) by R. J. Bhansali and D. Karavellas 2. The Finite Fourier Transform of a Stationary Process by D. R. Brillinger 3. Seasonal and Calendar Adjustment by W. S. Cleveland 4. Optimal Inference in the Frequency Domain by R. B. Davies 5. Applications of Spectral Analysis in Econometrics by C. W. J. Granger and R. Engle 6. Signal Estimation by E. J. Hannan 7. Complex Demodulation: Some Theory and Applications by T. Hasan 8. Estimating the Gain of A Linear Filter from Noisy Data by M. J. Hinich 9. A Spectral Analysis Primer by L. H. Koopmans 10. Robust-Resistant Spectral Analysis by R. D. Martin 11. Autoregressive Spectral Estimation by E. Parzen 12. Threshold Autoregression and Some Frequency-Domain Characteristics by J. Pemberton and H. Tong 13. The Frequency-Domain Approach to the Analysis of Closed-Loop Systems by M. B. Priestley 14. The Bispectral Analysis of Nonlinear Stationary Time Series with Reference to Bilinear Time-Series Models by T. Subba Rao 15. Frequency-Domain Analysis of Multidimensional Time-Series Data by E. A. Robinson 16. Review of Various Approaches to Power Spectrum Estimation by P. M. Robinson 17. Cumulants and Cumulant Spectra by M. Rosenblatt 18. Replicated Time-Series Regression: An Approach to Signal Estimation and Detection by R. H. Shumway 19. Computer Programming of Spectrum Estimation by T. Thrall 20. Likelihood Ratio Tests on Covariance Matrices and Mean Vectors of Complex Multivariate Normal Populations and their Applications in Time Series by P. R. Krishnaiah, J. C. Lee and T. C. Chang

Contents of previous volumes

489

V o l u m e 4. N o n p a r a m e t r i c M e t h o d s E d i t e d by P. R. K r i s h n a i a h a n d P. K. Sen 1984 xx + 968 pp.

1. Randomization Procedures by C. B. Bell and P. K. Sen 2. Univariate and Multivariate Multisample Location and Scale Tests by V. P. Bhapkar 3. Hypothesis of Symmetry by M. Hu~kovfi 4. Measures of Dependence by K. Joag-Dev 5. Tests of Randomness against Trend or Serial Correlations by G. K, Bhattacharyya 6. Combination of independent Tests by J. L. Folks 7. Combinatorics by L. Takfics 8. Rank Statistics and Limit Theorems by M. Ghosh 9. Asymptotic Comparison of T e s t s - A Review by K. Singh 10. Nonparametric Methods in Two-Way Layouts by D. Quade 11. Rank Tests in Linear Models by J. N. Adichie 12. On the Use of Rank Tests and Estimates in the Linear Model by J. C. Aubuchon and T. P. Hettmansperger 13. Nonparametric Preliminary Test Inference by A. K. Md. E. Saleh and P. K. Sen 14. Paired Comparisons: Some Basic Procedures and Examples by R. A. Bradley 15. Restricted Alternatives by S. K. Chatterjee 16. Adaptive Methods by M, Hu~kovfi 17. Order Statistics by J. Galambos 18. Induced Order Statistics: Theory and Applications by P. K. Bhattacharya 19. Empirical Distribution Function by E. Csfiki 20. Invariance Principles for Empirical Processes by M. Cs6rg6 21. M-, L- and R-estimators by J. Jureekovfi 22. Nonparametric Sequential Estimation by P. K. Sen 23. Stochastic Approximation by V. Dupae 24. Density Estimation by P. R6v6sz 25. Censored Data by A. P. Basu 26. Tests for Exponentiality by K. A. Doksum and B. S. Yandell 27. Nonparametric Concepts and Methods in Reliability by M. Hollander and F. Proschan 28. Sequential Nonparametric Tests by U. M/iller-Funk 29. Nonparametric Procedures for some Miscellaneous Problems by P. K. Sen 30. Minimum Distance Procedures by R. Beran 31. Nonparametric Methods in Directional Data Analysis by S. R. Jammalamadaka 32. Application of Nonparametric Statistics to Cancer Data by H. S. Wieand

490

Contents of previous volumes

33. Nonparametric Frequentist Proposals for Monitoring Comparative Survival Studies by M. Gail 34. Meteorological Applications of Permutation Techniques based on Distance Functions by P. W. Mielke, Jr. 35. Categorical Data Problems Using Information Theoretic Approach by S. Kullback and J. C. Keegel 36. Tables for Order Statistics by P. R. Krishnaiah and P. K. Sen 37. Selected Tables for Nonparametric Statistics by P. K. Sen and P. R. Krishnaiah