Multidimensional Stationary Time Series: Dimension Reduction and Prediction [1 ed.] 9780367569327, 9781003107293

This book gives a brief survey of the theory of multidimensional (multivariate), weakly stationary time series, with emp

346 122 10MB

English Pages [296] Year 2021

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Modelling Non-Stationary Economic Time Series: A Multivariate Approach 140390202X

Co-integration, equilibrium and equilibrium correction are key concepts in modern applications of econometrics to real w

1,057 146 1MB Read more

Stationary Processes and Prediction Theory. (AM-44), Volume 44 9781400881604

The description for this book, Stationary Processes and Prediction Theory. (AM-44), Volume 44, will be forthcoming.

171 99 14MB Read more

Risk Terrain Modeling: Crime Prediction and Risk Reduction 9780520958807

Imagine using an evidence-based risk management model that enables researchers and practitioners alike to analyze the sp

204 110 2MB Read more

Accuracy of Mathematical Models: Dimension Reduction, Homogenization, and Simplification 9783037192061

543 108 4MB Read more

Practical time series analysis: prediction with statistics and machine learning 9781492041658, 1492041653

Time series data analysis is increasingly important due to the massive production of such data through the internet of t

786 88 8MB Read more

Practical Time Series Analysis: Prediction with Statistics and Machine Learning 1492041653, 9781492041658

2,137 300 8MB Read more

Practical Time Series Analysis: Prediction with Statistics and Machine Learning [1 ed.] 1492041653, 978-1492041658

Time series data analysis is increasingly important due to the massive production of such data through the internet of t

20,802 5,666 9MB Read more

Practical Time Series Analysis: Prediction with Statistics and Machine Learning [1 ed.] 1492041653, 978-1492041658

Time series data analysis is increasingly important due to the massive production of such data through the internet of t

653 43 8MB Read more

Time Series

1,673 212 12MB Read more

Time Series

212 55 229KB Read more

Multidimensional Stationary Time Series: Dimension Reduction and Prediction [1 ed.]
9780367569327, 9781003107293

Author / Uploaded
Marianna Bolla
Tamás Szabados

Table of contents :
Cover
Half Title
Title Page
Copyright Page
Dedication
Contents
Foreword
Preface
List of Figures
Symbols
1. Harmonic analysis of stationary time series
1.1. Introduction
1.2. Covariance function and spectral representation
1.3. Spectral representation of multidimensional stationary time series
1.4. Constructions of stationary time series
1.4.1. Construction 1
1.4.2. Construction 2
1.4.3. Construction 3
1.4.4. Construction 4
1.4.4.1. Discrete Fourier Transform
1.4.4.2. The construction
1.5. Estimating parameters of stationary time series
1.5.1. Estimation of the mean
1.5.2. Estimation of the covariances
1.5.3. Periodograms
1.6. Summary
2. ARMA, regular, and singular time series in 1D
2.1. Introduction
2.2. Time invariant linear ltering
2.3. Moving Average processes
2.4. Autoregressive processes
2.5. Autoregressive moving average processes
2.6. Wold decomposition in 1D
2.7. Spectral form of the Wold decomposition
2.8. Factorization of rational and smooth densities
2.8.1. Rational spectral density
2.8.2. Smooth spectral density
2.9. Classi cation of stationary time series in 1D
2.10. Examples for singular time series
2.10.1. Type (0) singular time series
2.10.2. Type (1) singular time series
2.10.3. Type (2) singular time series
2.11. Summary
3. Linear system theory, state space models
3.1. Introduction
3.2. Restricted input/output map
3.3. Reachability and observability
3.4. Power series and extended input/output maps
3.5. Realizations
3.6. Stochastic linear systems
3.6.1. Stability
3.6.2. Prediction, miniphase condition, and covariance
3.7. Summary
4. Multidimensional time series
4.1. Introduction
4.2. Linear transformations, subordinated processes
4.3. Stationary time series of constant rank
4.4. Multidimensional Wold decomposition
4.4.1. Decomposition with an orthonormal process
4.4.2. Decomposition with innovations
4.5. Regular and singular time series
4.5.1. Full rank processes
4.5.2. Generic regular processes
4.5.3. Classification of non-regular multidimensional time series
4.6. Low rank approximation
4.6.1. Approximation of time series of constant rank
4.6.2. Approximation of regular time series
4.7. Rational spectral densities
4.7.1. Smith–McMillan form
4.7.2. Spectral factors of a rational spectral density matrix
4.8. Multidimensional ARMA (VARMA) processes
4.8.1. Equivalence of di erent approaches
4.8.2. Yule–Walker equations
4.8.3. Prediction, miniphase condition, and approximation by VMA processes
4.9. Summary
5 Dimension reduction and prediction in the time and frequency domain
5.1. Introduction
5.2. 1D prediction in the time domain
5.2.1. One-step ahead prediction based on finitely many past values
5.2.2. Innovations
5.2.3. Prediction based on the infinite past
5.3. Multidimensional prediction
5.3.1. One-step ahead prediction based on finitely many past values
5.3.2. Multidimensional innovations
5.4. Spectra of spectra
5.4.1. Bounds for the eigenvalues of Cn
5.4.2. Principal component transformation as discrete Fourier transformation
5.5. Kálmán's filtering
5.6. Dynamic principal component and factor analysis
5.6.1. Time domain approach via innovations
5.6.2. Frequency domain approach
5.6.3. Best low-rank approximation in the frequency domain, and low-dimensional approximation in the time domain
5.6.4. Dynamic factor analysis
5.6.5. General Dynamic Factor Model
5.7. Summary
A. Tools from complex analysis
A.1. Holomorphic (or analytic) functions
A.2. Harmonic functions
A.3. Hardy spaces
A.3.1. First approach
A.3.2. Second approach
B. Matrix decompositions and special matrices
C. Best prediction in Hilbert spaces
D. Tools from algebra
Bibliography
Index

Citation preview

Multidimensional Stationary Time Series

Multidimensional Stationary Time Series Dimension Reduction and Prediction

Marianna Bolla Tamás Szabados

First edition published 2021 by CRC Press 6000 Broken Sound Parkway NW, Suite 300, Boca Raton, FL 33487-2742 and by CRC Press 2 Park Square, Milton Park, Abingdon, Oxon, OX14 4RN © 2021 Marianna Bolla and Tamás Szabados CRC Press is an imprint of Taylor & Francis Group, LLC The right of Marianna Bolla and Tamás Szabados to be identified as authors of this work has been asserted by them in accordance with sections 77 and 78 of the Copyright, Designs and Patents Act 1988. Reasonable efforts have been made to publish reliable data and information, but the author and publisher cannot assume responsibility for the validity of all materials or the consequences of their use. The authors and publishers have attempted to trace the copyright holders of all material reproduced in this publication and apologize to copyright holders if permission to publish in this form has not been obtained. If any copyright material has not been acknowledged please write and let us know so we may rectify in any future reprint. Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented, including photocopying, microfilming, and recording, or in any information storage or retrieval system, without written permission from the publishers. For permission to photocopy or use material electronically from this work, access www.copyright. com or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400. For works that are not available on CCC please contact [email protected] Trademark notice: Product or corporate names may be trademarks or registered trademarks and are used only for identification and explanation without intent to infringe. Library of Congress Cataloging‑in‑Publication Data ISBN: 9780367569327 (hbk) ISBN: 9781003107293 (ebk) Typeset in CMR10 font by KnowledgeWorks Global Ltd.

To our families.

Contents

Foreword

xi

Preface

xiii

List of Figures

xvii

Symbols

xix

1 Harmonic analysis of stationary time series 1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Covariance function and spectral representation . . . . . . . 1.3 Spectral representation of multidimensional stationary time series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4 Constructions of stationary time series . . . . . . . . . . . . 1.4.1 Construction 1 . . . . . . . . . . . . . . . . . . . . . . 1.4.2 Construction 2 . . . . . . . . . . . . . . . . . . . . . . 1.4.3 Construction 3 . . . . . . . . . . . . . . . . . . . . . . 1.4.4 Construction 4 . . . . . . . . . . . . . . . . . . . . . . 1.4.4.1 Discrete Fourier Transform . . . . . . . . . . 1.4.4.2 The construction . . . . . . . . . . . . . . . . 1.5 Estimating parameters of stationary time series . . . . . . . 1.5.1 Estimation of the mean . . . . . . . . . . . . . . . . . 1.5.2 Estimation of the covariances . . . . . . . . . . . . . . 1.5.3 Periodograms . . . . . . . . . . . . . . . . . . . . . . . 1.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

9 18 18 20 22 23 23 25 26 26 30 33 35

2 ARMA, regular, and singular time series in 1D 2.1 Introduction . . . . . . . . . . . . . . . . . . . . 2.2 Time invariant linear filtering . . . . . . . . . . 2.3 Moving Average processes . . . . . . . . . . . . 2.4 Autoregressive processes . . . . . . . . . . . . . 2.5 Autoregressive moving average processes . . . . 2.6 Wold decomposition in 1D . . . . . . . . . . . . 2.7 Spectral form of the Wold decomposition . . . . 2.8 Factorization of rational and smooth densities . 2.8.1 Rational spectral density . . . . . . . . . 2.8.2 Smooth spectral density . . . . . . . . . .

39 39 40 42 46 53 58 60 65 65 66

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

1 1 1

vii

viii

Contents 2.9 Classification of stationary time series in 2.10 Examples for singular time series . . . . 2.10.1 Type (0) singular time series . . 2.10.2 Type (1) singular time series . . 2.10.3 Type (2) singular time series . . 2.11 Summary . . . . . . . . . . . . . . . . .

1D . . . . . . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

67 75 75 78 78 81

3 Linear system theory, state space models 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Restricted input/output map . . . . . . . . . . . . . . . 3.3 Reachability and observability . . . . . . . . . . . . . . 3.4 Power series and extended input/output maps . . . . . 3.5 Realizations . . . . . . . . . . . . . . . . . . . . . . . . 3.6 Stochastic linear systems . . . . . . . . . . . . . . . . . 3.6.1 Stability . . . . . . . . . . . . . . . . . . . . . . . 3.6.2 Prediction, miniphase condition, and covariance . 3.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

87 87 87 89 90 96 103 103 105 109

. . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

4 Multidimensional time series 113 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 4.2 Linear transformations, subordinated processes . . . . . . . . 113 4.3 Stationary time series of constant rank . . . . . . . . . . . . 116 4.4 Multidimensional Wold decomposition . . . . . . . . . . . . . 120 4.4.1 Decomposition with an orthonormal process . . . . . . 120 4.4.2 Decomposition with innovations . . . . . . . . . . . . 122 4.5 Regular and singular time series . . . . . . . . . . . . . . . . 124 4.5.1 Full rank processes . . . . . . . . . . . . . . . . . . . . 126 4.5.2 Generic regular processes . . . . . . . . . . . . . . . . 135 4.5.3 Classification of non-regular multidimensional time series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 4.6 Low rank approximation . . . . . . . . . . . . . . . . . . . . 141 4.6.1 Approximation of time series of constant rank . . . . . 142 4.6.2 Approximation of regular time series . . . . . . . . . . 146 4.7 Rational spectral densities . . . . . . . . . . . . . . . . . . . 147 4.7.1 Smith–McMillan form . . . . . . . . . . . . . . . . . . 148 4.7.2 Spectral factors of a rational spectral density matrix . 150 4.8 Multidimensional ARMA (VARMA) processes . . . . . . . . 151 4.8.1 Equivalence of different approaches . . . . . . . . . . . 151 4.8.2 Yule–Walker equations . . . . . . . . . . . . . . . . . . 158 4.8.3 Prediction, miniphase condition, and approximation by VMA processes . . . . . . . . . . . . . . . . . . . . . . 161 4.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163

Contents

ix

5 Dimension reduction and prediction in the time and frequency domain 169 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 169 5.2 1D prediction in the time domain . . . . . . . . . . . . . . . 170 5.2.1 One-step ahead prediction based on finitely many past values . . . . . . . . . . . . . . . . . . . . . . . . . . . 170 5.2.2 Innovations . . . . . . . . . . . . . . . . . . . . . . . . 173 5.2.3 Prediction based on the infinite past . . . . . . . . . . 175 5.3 Multidimensional prediction . . . . . . . . . . . . . . . . . . 178 5.3.1 One-step ahead prediction based on finitely many past values . . . . . . . . . . . . . . . . . . . . . . . . . . . 178 5.3.2 Multidimensional innovations . . . . . . . . . . . . . . 180 5.4 Spectra of spectra . . . . . . . . . . . . . . . . . . . . . . . . 183 5.4.1 Bounds for the eigenvalues of Cn . . . . . . . . . . . . 189 5.4.2 Principal component transformation as discrete Fourier transformation . . . . . . . . . . . . . . . . . . . . . . 190 5.5 K´ alm´ an’s filtering . . . . . . . . . . . . . . . . . . . . . . . . 191 5.6 Dynamic principal component and factor analysis . . . . . . 199 5.6.1 Time domain approach via innovations . . . . . . . . . 199 5.6.2 Frequency domain approach . . . . . . . . . . . . . . . 201 5.6.3 Best low-rank approximation in the frequency domain, and low-dimensional approximation in the time domain 202 5.6.4 Dynamic factor analysis . . . . . . . . . . . . . . . . . 204 5.6.5 General Dynamic Factor Model . . . . . . . . . . . . . 207 5.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208 A Tools from complex analysis A.1 Holomorphic (or analytic) functions A.2 Harmonic functions . . . . . . . . . A.3 Hardy spaces . . . . . . . . . . . . . A.3.1 First approach . . . . . . . . A.3.2 Second approach . . . . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

215 215 219 222 222 225

B Matrix decompositions and special matrices

227

C Best prediction in Hilbert spaces

241

D Tools from algebra

249

Bibliography

265

Index

269

Foreword

The purpose of this book is to give a brief survey of the theory of multidimensional (multivariate) weakly stationary time series, with emphasis on dimension reduction and prediction. We restricted ourselves to the case of discrete time, since this is a natural model of measurements in practice, and also, it is a good viewpoint to extend these techniques to the theory of continuous time as well. Understanding the covered material requires a certain mathematical maturity, a degree of knowledge in probability theory and statistics, and also in linear algebra and real, complex and functional analysis. It is advantageous if the reader is familiar with some notions of abstract algebra, but the corresponding details may also be skipped. The main tools include harmonic analysis, state space methods of linear time-invariant systems, and methods that reduce the rank of the spectral density matrix. We made an effort to give an almost self-contained version of the theory, with proofs that seemed the easiest to grasp. Also, we collected important results from a rather wide range of topics that are relevant in the theory of stationary time series. We hope that this book can be used as a text for graduate courses and also as a reference material for researchers working with time series. The applications of multidimensional stationary time series cover a very wide range of fields nowadays; there are important applications in statistics, engineering, economics, finance, natural and social sciences.

xi

Preface

“Deep is the well of the past. Should we not call it bottomless?” (Thomas Mann: Joseph and his brothers) To understand the remote past is indeed crucial if we want to classify the weakly stationary (briefly, stationary) time series. The presence of remote past causes different types of singularities, that themselves make the future predictable with zero error. We use the beautiful spectral theory of stationary time series, developed mainly for one-dimensional processes a century ago by H. Cramér, H. Wold, A.N. Kolmogorov, and N. Wiener, just to mention the most prominent ones. Singular, or in other words, deterministic processes in 1D may arise due to spectral distribution that is singular with respect to the Lebesgue measure (we call it Type (0) singularity), but even if an absolutely continuous spectral measure exists, the process can be singular (we call those Type (1) and Type (2) singularities), e.g. the sliding summation (two-sided moving average) that cannot be written as a one-sided moving average. The famous Wold decomposition is able to separate the singular and regular part of a non-singular process, where the regular part is, in fact, a one-sided moving average, and the singular part is of Type (0) in 1D. Here it is important that the Wold decomposition is not applicable to singular processes, as the possible Type (1) or Type (2) singularities cannot coexist with a regular process. Unfortunately, less attention has been paid to this distinction, however, understanding the down side makes us capable of better understanding the regular part, where prediction makes sense with positive error. The theory extends to multidimensional processes where more complicated situations can occur, e.g. regular and singular parts may coexist when concentrated on different components. Regular processes are also called purely non-deterministic or causal, as probabilistic predictions can be made for the future on the basis of the past, of course, with positive error. Those are the innovations (called fundamental shocks in finance) that contain the added value of the future observations, and so, they provide a driving force of the whole process, and establish the method of dynamic factor analysis too. We shall work with multidimensional stationary time series of discrete time, both in the frequency and in the time domain, and find analogies between these two approaches as for dimension reduction. Our main objects are (weakly) stationary time series with a spectral density matrix that has constant rank (almost everywhere in the frequency domain). Those correspond xiii

xiv

Preface

to the sliding summation, which is in general singular, but still, its spectral density can be factorized, and the (two-sided) power series expansion of the transfer function provides the coefficient matrices, in terms of which the twosided moving average can be written. Important special case of a constant rank process is the one-sided moving average that (by the Wold’s theorem) is always a regular process. Here the (one-sided) power series expansion of the transfer function gives the coefficient matrices that are called impulse responses, and the regular process can be expanded with them in terms of the innovations. Further subclass of the regular processes is the set of processes that have a rational spectral density matrix (its entries are complex rational functions of e−iω ) and also a rational transfer function. When factorizing rational spectral densities, usual row and column operations of the classical matrix decomposition algorithms should be adapted to modules over the ring of complex polynomials, to obtain e.g. the Smith–McMillan form. Under some additional conditions, there is a one-to-one correspondence between the spectral density matrix (frequency domain) and the pair of the transfer function and the innovation covariance matrix (time domain). Under the same conditions, there is also an (infinite) matrix relation between the block Hankel matrix of the impulse response matrices and that of the autocovariance matrices. However, by the Kronecker’s theorem, in the possession of a rational spectral density, these block Hankel matrices have bounded rank, so we can confine ourselves to finite segments of them. The processes with a rational spectral density matrix have either a stable VARMA (vector autoregressive plus moving average), or an MFD (matrix fractional description), or a state space representation; latter ones are linear dynamical systems with hidden state variables and observable variables, further, they contain white noises as error terms. All these representations have a finite number of parameter matrices that are sometimes overparametrized, and are not always uniquely determined by the transfer function, but can be estimated under some conditions. We do not want to go into details of the methods of estimation, but will discuss the Yule–Walker equations that emerge in different situations. In practice, we have a finite time series observed from a starting time, and infinite past can only be imitated if we go farther and farther to the future. Ergodicity under general conditions guarantees that we can make estimations based on a single trajectory, observed for long enough time. Luckily, the theory of finite past predictions is well elaborated since Gauss, with the projection principle that is widely used in the theory of multivariate regression (now our sample is the set of the finitely many, say n, past observations). As n → ∞, we approach the infinite past situation. The innovations, obtained by the Gram–Schmidt orthogonalization and, in the multidimensional case, by the block Cholesky decomposition, will better and better approach the innovations based on the infinite past prediction; whereas, the coefficient matrices will better and better approach the impulse responses. In this way, we assign a bottom to the well and ignore the remote past.

Preface

xv

When the spectral density matrix has a lower rank than its size, then it is important that its rank is equal to the dimension of the innovation subspaces in the multidimensional Wold decomposition. In this case, the impulse response matrices are not quadratic, but rectangular. In practice, the rank of the spectral density matrix is only estimated from a sample, and hence, we can speak of an essential rank of it. The theory of the General Dynamic Factor Models (GDFM) makes it possible to separate the structural (large absolute value) eigenvalues of the spectral density matrix and to look for lower dimensional innovations. Actually, this is the task of the dynamic factor analysis, and it can be realized with different techniques. As for the dimension reduction, we also introduce the “spectra of spectra” technique, i.e. in the possession of a d-dimensional time series observed at n consecutive time instances, we take the spectral decomposition of the d×d spectral density matrices at the n Fourier frequencies, and keep only the structural eigenvalues with corresponding eigenvectors. The computational complexity is of order nd3 that is smaller than the block Cholesky decomposition of the nd × nd large block Toeplitz matrix. However, it is proved that there is an asymptotic correspondence between the nd eigenvalues of this large matrix and the union of the d eigenvalues of the spectral density matrices at the n Fourier frequencies. Of course, both the autocovariance matrices and these spectral density matrices are only estimated by means of multivariate periodograms. We also touch upon the K´ alm´ an’s filtering. This method is able to predict the state variables in a state space model based on finitely many observed variables (past ones, current ones, or future ones, corresponding to prediction, filtering, or smoothing problems, respectively). As a byproduct, it is able to find the innovations as well. The organization of the chapters is as follows. In Chapter 1, we introduce the equivalent views of a weakly stationary, multidimensional time series (with discrete time and complex state space): they have a non-negative definite autocovariance matrix function and a non-negative definite matrix valued spectral measure on [−π, π] (the two are related via Fourier transformation). We deal with complex valued multivariate time series, though most of the real-life time series (e.g. financial ones) have real coordinates. However, by the spreading of quantum computers and quantum information theory, there are emerging telecommunication systems that are complex valued by nature. Our tools are well applicable to this situation, since the frequency domain calculations need complex harmonic analysis, anyway. In Chapter 2, we start with simple 1D time series, and from ARMA processes we arrive at the regular and non-singular ones with the Wold decomposition, in an inductive way. We also give a classification of the 1D weakly stationary processes, by distinguishing between different types of singularities. Chapter 3 is devoted to abstract algebraic tools in order to find matrix factorizations that keep the rationality. This is important when we work with rational spectral densities and state space models; however, this chapter can be skipped. Chapter 4 investigates the types of multidimensional stationary processes in a deductive way. First

xvi

Preface

the ones with spectral density matrix of constant rank are introduced that are, in fact, the sliding summations (two-sided moving averages). A special case is the one-sided moving average, that is a (purely) regular, causal process. Types of singularities are also characterized. A subclass of regular processes is constituted by those of rational spectral density, which are nothing else but the VARMA processes, or equivalently, the state space models that also have an MFD; they can be finitely parametrized and predicted by finitely many past observations and shocks. In Chapter 5, we perform predictions in the time domain and introduce the original version of the Wold decomposition. We also establish analogies between the time and frequency domain notions. The chapter is closed with the brief discussion of the Kálmán’s filtering and Dynamic Factor Analysis techniques. We believe that this book fills in the gap between the classical theory and modern techniques, overcoming the curse of dimensionality in multivariate time series. We collect pieces from the very original classical works (H. Cramér, H. Wold, N. Wiener, A.N. Kolmogorov, Yu.A. Rozanov, and R.E. Kálmán) that are “lost in translation” but needed for a deeper understanding of the topic. Marianna Bolla and Tamás Szabados Budapest, November 2020 (during the COVID-19 pandemic)

List of Figures

2.1

2.2 2.3

2.4 2.5

2.6 2.7

2.8 2.9

2.10 2.11 2.12 2.13 2.14 2.15

A typical trajectory and its prediction of an MA(4) process, with its covariance function and spectral density in Example 2.1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The mean square prediction error and det(Cn ) of an MA(4) process in Example 2.1. . . . . . . . . . . . . . . . . . . . . . A typical trajectory and its prediction of an MA(∞) process, with its covariance function and spectral density in Example 2.2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The mean square prediction error and det(Cn ) of an MA(∞) process in Example 2.2. . . . . . . . . . . . . . . . . . . . . . A typical trajectory and its prediction of a sliding summation process, with its covariance function and spectral density in Example 2.3. . . . . . . . . . . . . . . . . . . . . . . . . . . . The mean square prediction error and det(Cn ) of a sliding summation process in Example 2.3. . . . . . . . . . . . . . . . . . A typical trajectory and its prediction of a stable AR(4) process, with its bk coefficients, covariance function, and spectral density in Example 2.4. . . . . . . . . . . . . . . . . . . . . . The mean square prediction error and det(Cn ) of a stable AR(4) process in Example 2.4. . . . . . . . . . . . . . . . . . A typical trajectory and its prediction of a stable ARMA(4,4) process, with its bk coefficients, covariance function, and spectral density in Example 2.5. . . . . . . . . . . . . . . . . . . . The mean square prediction error and det(Cn ) of an ARMA(p, q) process in Example 2.5. . . . . . . . . . . . . . . Spectral measure and covariance function of a Type(0) singular process in Example 2.6. . . . . . . . . . . . . . . . . . . . . . Prediction error and det(Cn ) of a Type(0) singular process in Example 2.6. . . . . . . . . . . . . . . . . . . . . . . . . . . . Spectral density and covariance function of a Type (1) singular process in Example 2.7. . . . . . . . . . . . . . . . . . . . . . Prediction error and det(Cn ) of a Type (1) singular process in Example 2.7. . . . . . . . . . . . . . . . . . . . . . . . . . . . Spectral density and covariance function of a Type (2) singular process in Example 2.8. . . . . . . . . . . . . . . . . . . . . .

45 46

47 48

49 50

54 55

57 58 76 77 79 79 80

xvii

xviii

List of Figures

2.16 Prediction error and det(Cn ) of a Type (2) singular process in Example 2.8. . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 4.2 5.1 5.2 5.3

Typical trajectories of a 3D VAR(2) process, with its impulse response functions and covariance functions in Example 4.1. . Spectral densities in Example 4.1. . . . . . . . . . . . . . . . . Eigenvalue processes of the estimated Mj (j = 0, . . . , 534) matrices over [0, 2π], ordered decreasingly in Example 5.1. . . . . Approximation of the original time series by a rank 3 time series in Example 5.1. . . . . . . . . . . . . . . . . . . . . . . . . . . The 3 leading PC’s of the stock exchange data in the time domain in Example 5.1. . . . . . . . . . . . . . . . . . . . . .

81

164 164

205 206 207

Symbols

Symbol Description (Algebra) N Z R C Cd Cd×r T D a, b, . . . , x, y z, α, β, . . . a, . . . , Z A, B, . . . 0 O Id aT , AT , . . . a∗ , A∗ , . . . z, a, A, . . . H A⊗B kAk kAkF ρ(A) |A|, det(A) tr(A) rank(A) Ker(A) R(A), Range(A) diag[λ1 , . . . , λd ] Span{aj : j ∈ J} span{aj : j ∈ J}

Set of natural numbers. Set of integers. Set of real numbers. Set of complex numbers. Set of complex d-dimensional vectors. Set of complex matrices of size d × r. The unit circle in C. The open unit disc in C. Real scalars. Complex scalars. Column vectors (of real or complex) components. Matrices (of real or complex) entries. Zero vector. Zero matrix. d × d identity matrix. Transpose of a (real or complex) vector or matrix. Adjoint (conjugate transpose) of a complex vector or matrix. Conjugate of a complex scalar or entrywise conjugate of a complex vector or matrix. Hankel matrix. Kronecker product of matrices A and B. Spectral norm of matrix A. Frobenius norm of matrix A. Spectral radius of matrix A. Determinant of the quadratic matrix A. Trace of the quadratic matrix A. Rank of the matrix A. Kernel space of the matrix A. Image space (range) of the matrix A. Diagonal matrix with λ1 , . . . , λd in its main diagonal. Linear span: all finite linear combinations of a set of vectors. Closure of the linear span of a set of vectors. xix

xx

Symbols

1 if j = k; 0 if j 6= k. Σ = (A, B, C, D) Linear system with linear operators A, B, C, D. R Reachability matrix. O Observability matrix. φ0 Restricted input/output map. φ Extended input/output map. U Set of input sequences. Y Set of output sequences. H(z) Transfer function. {H` : ` = 1, 2, . . . } Impulse response function. δjk

Kronecker delta,

Symbols

xxi

Symbol Description (Probability and analysis) Probability. Expectation. Variance. Covariance (matrix). Correlation. Random variables. Random vectors. d-dimensional time series (of complex or real components) with discrete time. C(h), h ∈ Z d × d, hth order autocovariance matrix of a d-dimensional, weakly stationary time series. Cn n × n Toeplitz matrix of autocovariances of a 1D, weakly stationary time series. Cn nd × nd block Toeplitz matrix of autocovariance matrices of a d-dimensional, weakly stationary time series. f (ω) = [f ij (ω)] d×d spectral density matrix of {Xt }, ω ∈ [−π, π] or [0, 2π]. dF (ω) = [dF ij (ω)] d × d spectral measure matrix of {Xt }, ω ∈ [−π, π] or [0, 2π]. S Right (forward) shift operator (unitary operator). L = S −1 = S ∗ Left (backward) shift operator. WN(Σ) White noise sequence with covariance matrix Σ. `1 Banach space of absolute summable sequences. `2 Hilbert space of square summable sequences. Lp Banach space of functions f s.t. |f |p is integrable, 1 ≤ p ≤ ∞. H(X) Closure of linear span of the components of a time series {Xt }. ˜T X Empirical mean. ˆ C(h) Empirical covariance matrix function. jth Fourier frequency, j = 0, . . . , n − 1. ωj = 2πj n ρj = eωj jth primitive root of 1, j = 0, . . . , n − 1. Joint spectral measure of {(Xt , Yt )}. dF Y,X (ω) f Y |X (ω) Conditional spectral density of {Yt } w.r.t. {Xt }. log, ln Natural logarithm, with e as its base. o(f (n)) ‘Little o’ of f (n), i.e. a function of n such that (n)) limn→∞ o(f f (n) = 0. O(f (n)) ‘Big O’ of f (n), i.e. a function of n such that O(f (n)) ≤ Cf (n) with some constant C, independent of n. P E Var Cov Corr X, Y, ξ, η, . . . X, Y, ξ, η, . . . {Xt }t∈Z

xxii

Symbols

Symbol Description (Verbal abbreviations) 1D multi-D w.r.t. a.e. i.i.d. p.d.f. c.d.f. SD parsimonious SD SVD Gram-decomp. TLF AR(p) MA(q) ARMA(p, q) VAR(p) VMA(q) VARMA(p, q) RMSE Toeplitz matrix Hankel matrix block Toeplitz block Hankel DFT IDFT

One dimensional (time series). Multidimensional (time series). With respect to. Almost everywhere (w.r.t. a measure). Independent, identically distributed. Probability density function. Cumulative distribution function. Spectral decomposition (of a self-adjoint matrix). Spectral decomposition with minimum number of dyads. Singular value decomposition (of a rectangular matrix). Decomposition G = AA∗ of a self-adjoint matrix G. Time invariant linear filter. pth order autoregressive process (in 1D). qth order moving average process (in 1D). p-th order autoregressive plus qth order moving average process (in 1D). pth order vector autoregressive process (in multi-D). qth order vector moving average process (in multi-D). p-th order vector autoregressive plus qth order vector moving average process (in multi-D). Root mean square error. Quadratic matrix that has the same entries along its main diagonal and along all lines parallel to the main diagonal. Quadratic matrix that has the same entries along its antidiagonal and along all lines parallel to its anti-diagonal. Block matrix which is Toeplitz in terms of its blocks. Block matrix which is Hankel in terms of its blocks. Discrete Fourier Transform. Inverse Discrete Fourier Transform.

1 Harmonic analysis of stationary time series

1.1

Introduction

A widely applicable notion of random fields is the one of multidimensional (or multivariate) time series Xt = (Xt1 , . . . , Xtd ), where t ∈ R is the time, d ∈ N is the dimension, and for each t fixed, each coordinate Xtj (j = 1, . . . , d) is a complex valued random variable on the same probability space (Ω, F, P). Sometimes, we investigate the special case when the random variables are real valued. Here we consider time series with discrete time and state space Cd . We concentrate on stationary time series, mainly in the wide sense, the behavior of which is irrespective of time shift. Moreover, the assumption of weak stationarity needs only the first and second moments and cross-moments (second-order processes). This allows us to use a huge machinery of analytical tools and the theory of Hilbert spaces. Fortunately, many time series in practice can be approximated by stationary time series, possibly after some operations, that deprives the time series e.g. from trend and seasonality. We prove equivalent notions of weak stationarity in terms of autocovariance matrix functions and matrix valued spectral measures. Spectral representation of the process itself is also given by orthogonal increments (Cramér’s representation). We also give four different constructions for stationary time series, and discuss ergodicity of the estimates for the mean and covariances. In this way, we are able to make inferences based on a single trajectory, observed for a long enough time. Equivalent forms of a periodogram are also given.

1.2

Covariance function and spectral representation

Definition 1.1. The d-dimensional time series Xt (t ∈ R) is strongly stationary (or stationary in the strong sense) if for any h ∈ R, n ∈ N, and time instances t1 < t2 < · · · < tn , the joint distribution of (Xt1 +h , . . . , Xtn +h ) is the same as the joint distribution of (Xt1 , . . . , Xtn ). That is, the joint distributions are invariant for any time shift. The next conditions of weak stationarity are simpler to check in practice. 1

2

Harmonic analysis of stationary time series

Definition 1.2. The d-dimensional time series Xt (t ∈ R) is weakly stationary (or stationary in the wide sense) if it has finite expectation and finite covariance function that do not depend on time shift: EXt = µ = [µ1 , . . . , µd ]T ∈ Cd , j cjk (h) := Cov(Xt+h , Xtk ) = Cov(Xhj , X0k ) = E (Xhj − µj )(X0k − µk ) , where t, h ∈ R; j, k = 1, . . . , d, and Xt = [Xt1 , . . . , Xtd ]T in terms of its components. Note that strong stationarity implies the weak one if the process has finite second moments. If the time series is Gaussian, then the two notions are equivalent, because Gaussian distributions are uniquely determined by their expectations and covariances. Further, to any weakly stationary process there exists a strongly stationary Gaussian with the same expectation and covariance matrix function. Weakly stationary processes are sometimes called second-order processes, as only their first and second moments are used in the theory describing their behavior. Without loss of generality, from now on we assume that µ = 0, in which case j j Xtk ), Var(Xtk ) := E(|Xtk |2 ), Cov(Xt+h , Xtk ) := E(Xt+h where the complex conjugation on the second factor can be disregarded if the components are real valued. The main object of the present book is time series with discrete time. Definition 1.3. The d-dimensional time series Xt (t ∈ Z) is weakly stationary (or stationary in the wide sense) with discrete time if it has finite expectation and finite covariance function that do not depend on time shift: EXt = 0,

j cjk (h) := Cov(Xt+h , Xtk ) = Cov(Xhj , X0k ) = E(Xhj X0k )

(t, h ∈ Z; j, k = 1, . . . , d), where, as we said above, we may assume that EXt = µ = 0. From now on, the expression ‘stationary time series’ will refer to a discrete time weakly stationary process with zero expectation, unless it is explicitly stated otherwise. Considering complex valued random vectors simplifies the discussion; it is easy to describe the specific case of real valued random vectors whenever it is needed. Since we assume finite second moments, it follows that each component Xtj , j = 1, . . . , d, is square integrable, so belongs to the Hilbert space L2 (Ω, F, P) for any time instant t ∈ Z. Here and below, X is a column vector and X∗ denotes its adjoint (conjugate transpose), a row vector. We use the covariance matrix function (or autocovariance matrix function) C(h) = [cj` (h)] ∈ Cd×d to describe weakly stationary time series, C(h) := Cov(Xt+h , Xt ) = E (Xt+h X∗t ) ,

h ∈ Z.

Covariance function and spectral representation

3

The covariance matrix function C does not depend on the time instant t ∈ Z because of the assumed weak stationarity. Clearly, j ` cj` (−h) = E Xt−h Xt` = E Xtj Xt+h = c`j (h), therefore C(−h) = C ∗ (h).

(1.1)

By the Cauchy–Schwartz inequality, for any j, k = 1, . . . , d and h ∈ Z, 2 2 21 1 j k |cjk (h)| = E Xh X0 ≤ E Xhj E X0k = [cjj (0)ckk (0)] 2 , and so, |cjj (h)| ≤ cjj (0).

(1.2)

While the covariance matrix C(0) is self-adjoint (Hermitian) by (1.1), C(h) with a fixed time lag h 6= 0 is not self-adjoint in general. However, for an arbitrary n ≥ 1, let us consider the following nd × nd matrix, which is a block Toeplitz matrix (see Definition B.9):   C(0) C(1) C(2) · · · C(n − 1)  C ∗ (1) C(0) C(1) · · · C(n − 2)    ∗ ∗  C (2) C (1) C(0) · · · C(n − 3)  Cn :=  (1.3) .  ..  .. .. .. ..   . . . . . C ∗ (n − 1) C ∗ (n − 2)

C ∗ (n − 3) · · ·

C(0)

It is obviously self-adjoint and positive semidefinite as it is the usual covariance matrix of the compounded nd-dimensional random vector [XT1 , XT2 , . . . , XTn ]T . The next two theorems give important characterizations of weakly stationary time series in general. Theorem 1.1. The following back-and-forth statements are true. (a) If a d-dimensional time series {Xt }t∈Z is weakly stationary, then its covariance function C(h) (h ∈ Z) is non-negative definite (positive semidefinite), which means that n X

a∗k C(k − j) aj ≥ 0,

∀n ≥ 1,

∀a1 , . . . , an ∈ Cd .

(1.4)

j,k=1

Equivalently, the matrix Cn in (1.3) is self-adjoint and non-negative definite: a∗ Cn a ≥ 0, ∀n ≥ 1, ∀a ∈ Cnd . (b) Conversely, to any non-negative definite matrix function C(h) (h ∈ Z) one can find a weakly stationary time series {Xt }t∈Z , with this covariance function.

4

Harmonic analysis of stationary time series

Note that the fact that C(h) (h ∈ Z) is a positive semidefinite matrix function does not mean that the C(h)’s are positive semidefinite matrices. However, Equation (1.4) just means that the block Toeplitz matrix Cn is positive semidefinite in terms of its blocks. Theorem 1.2. The following back-and-forth statements are also true. (a) If a d-dimensional time series {Xt }t∈Z is weakly stationary, then it has a non-negative definite spectral measure matrix dF on [−π, π] such that the covariance matrix function C(h) (h ∈ Z) of {Xt } can be represented as the Fourier transform of dF : Z π C(h) = eihω dF (ω) (h ∈ Z). −π

Note that a measure matrix dF = [dF rs ] ∈ Cd×d on [−π, π] is called non-negative definite if for any interval (α, β] ⊂ [−π, π] and for any z1 , . . . , zn ∈ C, it holds that d X

dF rs ((α, β])zr zs ≥ 0.

(1.5)

r,s=1

(b) Conversely, to any non-negative definite measure matrix dF on [−π, π], one can find a non-negative definite function C(h) (h ∈ Z), and so one can find a weakly stationary time series {Xt }t∈Z , whose spectral measure matrix is dF . The proof of these two theorems will be given step-by-step in the sequel in this and subsequent sections. First, it is very simple that the matrix Cn in formula (1.3) is non-negative definite: n 2 ) ( n n X X X ∗ ∗ ∗ ak Xk = Var ak Xk ≥ 0 (1.6) ak C(k − j) aj = E k,j=1

k=1

k=1

for any n ≥ 1 and a1 , . . . , an ∈ Cd , see (1.1) as well. This proves Theorem 1.1(a). Later, a construction in Corollary 1.5 will show that the converse statement in Theorem 1.1(b) is also true. Theorem 1.3 (Herglotz theorem) below proves Theorem 1.2(a) in the onedimensional (1D) case. The proof of the general case will be quite circuitous, but relatively simple: • The 1D case automatically extends to any covariance function cjj (h), j = 1, . . . , d, which can be represented by non-negative spectral measures dF j (ω), c.f. (1.9). • Then in Section 1.3 we will use this to give the important spectral representation of the time series {Xt } itself via a stochastic integral w.r.t. a random process {Zω } of orthogonal increments, c.f. (1.20).

Covariance function and spectral representation

5

• In turn, {Zω } will be used to define the complex spectral measures dF j` (ω), j, ` = 1, . . . , d, which can represent any covariance function cj` (h). This way we are obtaining the spectral measure matrix dF (ω) that represents the covariance matrix function C(h), see Corollary 1.2. There exists a shorter way to give first the spectral representation of {Xt } using the spectral theory of normal operators in a Hilbert space. Since this is a quite advanced tool, see e.g. in [49, 12.22 Theorem], this approach will only be briefly sketched in Remark 1.8 later. Finally, Corollary 1.6 will give a construction to show the converse statement in Theorem 1.2(b). Naturally, the 1D cases of formulas (1.3) and (1.6) are simpler. If {Xt }t∈Z is a 1D stationary time series and its covariance function is c(h) = Cov(Xt+h , Xt )

(t, h ∈ Z),

then     Cn =   

c(0) c(−1) c(−2) .. .

c(1) c(0) c(−1) .. .

c(2) c(1) c(0) .. .

··· ··· ··· .. .

c(n − 1) c(n − 2) c(n − 3) .. .

c(−n + 1)

c(−n + 2)

c(−n + 3)

···

c(0)

      

(1.7)

is an ordinary Toeplitz matrix. Clearly, it is self-adjoint and non-negative definite for any n ≥ 1: n 2 ( n ) n X X X zk Xk = Var zk Xk ≥ 0 (z1 , . . . , zn ∈ C). c(k − j) zk zj = E k=1

j,k=1

k=1

Also, it is the usual covariance matrix of (X1 , . . . , Xn )T and it is positive semidefinite for this reason too. This implies that in the d-dimensional case, the Toeplitz matrix of the covariance function of each coordinate Xtj (j = 1, . . . , d) is also self-adjoint and non-negative definite for any n ≥ 1. Theorem 1.3. (Herglotz theorem) Let c : Z → C be a non-negative definite function. Then c has a spectral representation Z π c(h) = eihω dF (ω), h ∈ Z, −π

where dF is a unique bounded non-negative measure on [−π, π]. Proof. Since c is non-negative definite, with any n ≥ 1 and zj = e−ijω , j = 1, . . . , n, we get that 0≤

n X j,`=1

c(j − `) e−i(j−`)ω =

n−1 X k=−(n−1)

c(k) e−ikω (n − |k|).

6

Harmonic analysis of stationary time series

Define fn (ω) :=

n−1 X

1 2π

Then

c(k) e−ikω

1−

k=−(n−1)

|k| n

.

(1.8)

π

Z fn (ω) ≥ 0,

fn (ω)dω = c(0) ≥ 0

(n ≥ 1).

−π

Let dFn be the measure on [−π, π] whose non-negative density is fn for n ≥ 1. Since these measures are bounded and the interval is compact, by Helly’s selection theorem, see e.g. [37, p. 76], there exists a subsequence {n0 } such that the sequence of measures {dFn0 } converges weakly to a limit dF ; this will be the desired measure of the theorem. First, for any integer h such that |h| ≤ m, Z π |h| . eihω fn (ω)dω = c(h) 1 − n −π Thus for any h ∈ Z, Z

π

lim

n→∞

eihω fn (ω)dω = c(h).

−π

By the definition of weak convergence, for a suitable subsequence {n0 } one has Z π Z π ihω 0 (ω) = lim e dF eihω dF (ω). n 0 n →∞

−π

−π

Comparing these expressions proves the claimed representation. The uniqueness of the measure dF follows from two facts. First, any continuous function on [−π, π] can be uniformly approximated by trigonometric polynomials (Weierstrass’ theorem), see e.g. [50, 4.25 Theorem]. Second, integrals of continuous functions uniquely determine the measure by the Riesz representation theorem, see e.g. [50, 2.14 Theorem]. Later, in Corollary 1.6, we will see that the converse of Herglotz theorem is also true. The Herglotz theorem implies that each covariance function cjj (h) (h ∈ Z), for j = 1, . . . , d, has a spectral representation Z π cjj (h) = eihω dF j (ω), F j (−π) = 0, F j (π) = cjj (0). (1.9) −π

Here F j (ω) := dF j ((−π, ω]), ω ∈ (−π, π], is a non-decreasing function, a spectral cumulative distribution function (c.d.f.). Remark 1.1. Here we see two important special cases of Herglotz theorem.

Covariance function and spectral representation

7

(a) The first case is when the covariance function c(h) is absolutely summable: ∞ X

|c(h)| < ∞,

h=−∞

that is, {c(h)}h∈Z is in `1 . Then by (1.8), for k ≤ ` and for any ω ∈ [−π, π], |f` (ω) − fk (ω)| ≤

1 X |c(h)| → 0 2π |h|≥k

as k → ∞. Thus fk uniformly converges on [−π, π] to a continuous spectral density function f as k → ∞. So dF is absolutely continuous w.r.t. Lebesgue measure, and c(h) becomes the Fourier coefficient of the function f: Z π

eihω f (ω)dω,

c(h) =

dF (ω) = f (ω)dω,

(1.10)

−π

and, vice versa, f can be expanded into the Fourier series f (ω) =

∞ 1 X c(h) e−ihω , 2π

f (ω) ≥ 0,

ω ∈ [−π, π],

(1.11)

h=−∞

which is pointwise convergent (in fact, uniformly convergent) in [−π, π]. At this point we mention that analysis textbooks, see e.g.R[50, 4.26], typiπ 1 cally use different conventions for Fourier series: cn = 2π f (t)e−int dt, −π P∞ int f (t) ∼ n=−∞ cn e . This is a matter of convention; several pieces of time series literature use the same convention as we do, since the starting point of Fourier analysis in the theory of time series is Theorem 1.3, the spectral representation of the already defined covariance function. (b) Another important special case is when the covariance function is square summable: ∞ X |c(h)|2 < ∞, h=−∞

that is, {c(h)}h∈Z is in `2 . Though this condition is weaker than the absolute summability, the Riesz–Fischer theorem can be applied. This theorem says, see e.g. [50, 4.26], that then there exists a function f ∈ L2 [−π, π] such that formula (1.10) holds and the Fourier series (1.11) converges in the L2 [−π, π] sense. Remark 1.2. The spectrum of a time series with discrete times Z has been defined on the interval [−π, π], more precisely, on (−π, π]. Another usual approach is to define the spectrum on the unit circle T = {z ∈ C : |z| = 1}. Clearly, there is a one-to-one correspondence between the two: (−π, π] 3 ω ↔ e−iω ∈ T.

8

Harmonic analysis of stationary time series

Consequently, there also exists a one-to-one correspondence between functions defined on the two. If φ : (−π, π] → C, then there is a unique function Φ : T → C such that φ(ω) = Φ(e−iω ) for any ω ∈ (−π, π]. Remark 1.3. In this book we need the space L2d (Ω, F, P) of square integrable d-dimensional complex valued random vectors X = (X 1 , . . . , X d ). We assume that EX = 0, and define the inner product hX, Yi :=

d X

E(X j Y j ).

j=1

Beside the usual linear combinations aX + bY with constants a, b ∈ C, we also define linear combinations AX + BY with matrices A, B ∈ Cd×d in L2d (Ω, F, P). A consequence of this is that one can define two types of linear span of a set of random vectors {Xγ : γ ∈ Γ} ⊂ L2d (Ω, F, P):   d n X X  aj` Xγ`j : aj` ∈ C, n ≥ 1 ; (1.12) M = Span{Xγ : γ ∈ Γ} :=   j=1 `=1   n X  Md = Spand {Xγ : γ ∈ Γ} := Aj Xγj : Aj ∈ Cd×d , n ≥ 1 .   j=1

Then M ⊂ L2 (Ω, F, P) and Md ⊂ L2d (Ω, F, P). Clearly, Md = M × · · · × M, a Cartesian product with d factors. One can similarly define two types of closed linear spans, span{Xγ : γ ∈ Γ} and spand {Xγ : γ ∈ Γ}, the closures of the respective linear spans in the respective Hilbert spaces. The relationship between two second-order random vectors (of not necessarily the same dimension) is described by their cross-covariance matrix : 0

0

d×d Cov(X, Y) = E(XY∗ ) = [E(X j Y k )]d,d , j,k=1 ∈ C

where X is a d-dimensional and Y is a d0 -dimensional complex random vector with L2 (Ω, F, P) components with zero expectations. Clearly, Cov(Y, X) = [Cov(X, Y)]∗ , and by Proposition B.1, Cov(X, X) is a self-adjoint, nonnegative definite (positive semidefinite) d × d matrix, the usual covariance matrix of X. We say that X and Y are orthogonal, denoted X ⊥ Y, if Cov(X, Y) = O, the zero matrix. This is a more general notion of orthogonality, and applicable to random vectors of different dimensions too. Observe that it is a stronger

Spectral representation of multidimensional stationary time series

9

condition than the standard orthogonality hX, Yi = tr(Cov(X, Y)) = 0 if the two vectors are of the same dimension. The next lemma describes a slight generalization of the Projection Theorem C.1; see Appendix C. It clarifies that as far as one discusses optimal mean square approximations, orthogonal projections, orthogonality of random vectors, and linear spans of a set of random vectors, one may work with scalar components of random vectors in the scalar Hilbert space L2 (Ω, F, P) and its subspaces M defined by (1.12). Lemma 1.1. Let Md = M × · · · × M be a closed subspace in L2d (Ω, F, P) and ˆ ∈ Md such that Y ∈ L2d (Ω, F, P). Then there exists a unique Y ˆ ≤ kY − Zk for any Z ∈ Md , kY − Yk

(1.13)

equivalently, ˆ Z) = O for any Z ∈ Md , Cov(Y − Y,

ˆ ⊥ Md . (Y − Y)

(1.14)

ˆ we have Yˆ j = ProjM Y j (j = 1, . . . , d), the standard orthogonal For this Y, projection of Y j to the closed Hilbert subspace M ⊂ L2 (Ω, F, P). ˆ = (Yˆ 1 , . . . , Yˆ d ) componentwise by the standard projection Proof. Define Y theorem: Yˆ j := ProjM Y j , j = 1, . . . , d. ˆ Z) =: [γjk ]d Then for arbitrary Z ∈ Md let Cov(Y − Y, j,k=1 , for which we have γjk = hY j − Yˆ j , Z k i = 0 ∀j, k. This proves (1.14). Also, kY − Zk2 =

d X j=1

kY j − Z j k2 =

d n o X kY j − Yˆ j k2 + kYˆ j − Z j k2 ,

(1.15)

j=1

which proves (1.13). ˆ then (1.15) shows that it Conversely, if we choose any Z ∈ Md , Z 6= Y, cannot have the minimum property (1.13).

1.3

Spectral representation of multidimensional stationary time series

Let Xt = (Xt1 , . . . , Xtd ), t ∈ Z, be a d-dimensional weakly stationary time series. For each j = 1, . . . , d fixed, take the set of all finite linear combinations,

10

Harmonic analysis of stationary time series

with complex coefficients, of the random variables Xtj (t ∈ Z), and let H(X j ) denote its closure in the Hilbert space L2 (Ω, F, P). Also, let H(X) denote the closure in L2 (Ω, F, P) of the linear span of H(X 1 ) ∪ · · · ∪ H(X d ): H(X) = span{Xt : t ∈ Z} := span{Xtj : t ∈ Z, j = 1, . . . , d}. The right time shift (or forward shift) S is a linear operator given by j for j = 1, . . . , d. The operator S can be extended to H(X) SXtj = Xt+1 by linearity and continuity. The inverse S −1 of S is the left time shift L (or backward shift), defined similarly. Thus we may write that S t X0 = Xt for any t ∈ Z. Consider the following two equations for any k, t ∈ Z, Cov(Xt+k , Xt ) = Cov(S t+k X0 , S t X0 ) = Cov(S k X0 , S ∗t S t X0 ) Cov(S k X0 , X0 ) = Cov(Xk , X0 ). They are equal to each other if and only if S ∗ S = I, that is, S −1 = S ∗ . Thus the right time shift S is a unitary operator in H(X) if and only if the time series {Xt }t∈Z is weakly stationary. From now on we assume that {Xt }t∈Z is a weakly stationary d-dimensional time series. Then, by continuous extension, S is a unitary operator in each H(X j ) (j = 1, . . . , d) and the whole H(X) as well. Also, for any A ⊂ Z and for any k ∈ Z, S k (span{Xt : t ∈ A}) = span{Xt+k : t ∈ A}. The spectral representation of the covariance function can be extended to a spectral representation of the time series {Xt }t∈Z itself. To this end, we introduce a natural isometry. Note that a map ψ from a Hilbert space H onto a Hilbert space G is called an isometry (isometric isomorphism) between H and G, if we have hψ(X), ψ(Y )iG = hX, Y iH ,

∀X, Y ∈ H.

First consider a 1D weakly stationary time series {Xt }t∈Z with spectral measure dF . Define the linear map ψ : H(X) → L2 ([−π, π], B, dF ) for a random variable Xt as ψ(Xt ) = {eitω : ω ∈ (−π, π]}. Here B denotes the σ-field of Borel sets in [−π, π]. We emphasize that the image of the random variable Xt is the function ω 7→ eitω from (−π, π] onto the unit circle T . Then ψ is indeed an isometry: Z π hXt , Xs iH(X) = E Xt Xs = c(t − s) = ei(t−s)ω dF (ω) −π Z π = eitω e−isω dF (ω) = heitω , eisω idF . −π

Spectral representation of multidimensional stationary time series

11

Next, this isometry ψ can be extended to finite linear combinations as ! m m X X ψ ak Xtk = ak eitk ω , ω ∈ (−π, π], k=1

k=1

and finally to the whole Hilbert space H(X) by continuity. The image of H(X) is the closure of the set of trigonometric polynomials. Proposition 1.1. The closure of the set of trigonometric polynomials in L2 ([−π, π], B, dF ) is the whole space L2 ([−π, π], B, dF ). Proof. First, the space of continuous functions on [−π, π] is dense in L2 ([−π, π], B, dF ), see e.g. [50, 3.14 Theorem]. Second, the set of trigonometric polynomials is dense in the space of continuous functions on [−π, π], see e.g. [50, 4.25 Theorem]. Returning to the case of d-dimensional weakly stationary time series {Xt }, for any index j and component {Xtj }, j = 1, . . . , d, we can define an isometry ψ j : H(X j ) → L2 ([−π, π], B, dF j ) as described above. In this isometric isomorphism, the application of S k in H(X j ) corresponds to a multiplication with the function eikω in L2 ([−π, π], B, dF j ). It means that for any squareintegrable periodic function on [−π, π], there exists a unique random variable in H(X j ) by (ψ j )−1 . In particular, for any ω ∈ (−π, π] and indicator 1(−π,ω] , there exists a unique complex valued random variable of 0 expectation: Zωj := (ψ j )−1 (1(−π,ω] ) ∈ H(X j ), j and we define Z−π := 0. Moreover, for any B ∈ B, there exists a unique j random variable ZB = (ψ j )−1 (1B ) ∈ H(X j ). Then the process (Zωj )ω∈(−π,π] has orthogonal increments. Indeed, if −π ≤ a < b ≤ c < d ≤ π, then E (Zbj − Zaj )(Zdj − Zcj ) = hZbj − Zaj , Zdj − Zcj iH(X j ) Z π = h1(a,b] , 1(c,d] idF j = 1(a,b] (ω)1(c,d] (ω)dF j (ω) = 0, (1.16) −π

likewise, E(|Zbj − Zaj |2 ) =

Z

π

1(a,b] (ω)dF j (ω) = F j (b) − F j (a).

(1.17)

−π

In order to introduce stochastic integration of non-random functions w.r.t. {Zωj }, let us start with step functions (or simple functions). With an arbitrary positive integer N , let φstep (ω) :=

N −1 X r=1

ar 1(ωr ,ωr+1 ] (ω)

(ar ∈ C)

12

Harmonic analysis of stationary time series

in L2 ([−π, π], B, dF j ), where each ωr is a continuity point of F j and −π = ω1 < ω2 < · · · < ωN = π. Define the stochastic integral of a step function by Z

π

φstep (ω)dZω :=

N −1 X

−π

ar (Zωr+1 − Zωr ).

(1.18)

r=1

This stochastic integration also establishes an isometry between step functions and random variables of the form (1.18). Since any φ ∈ L2 ([−π, π], B, dF j ) can be approximated by step functions, this isometry extends to a Hilbert space isometry. So we get the stochastic integral of a non-random L2 -function: Z π j −1 (ψ ) (φ) = φ(ω)dZωj , φ ∈ L2 ([−π, π], B, dF j ). −π

Apply this for φ(ω) = eitω : Z Xtj =

π

eitω dZωj

(j = 1, . . . , d).

(1.19)

−π

In this way the following important fact is proved. Theorem 1.4. With the notation Xt := [Xt1 , . . . , Xtd ]T (t ∈ Z) and Zω := [Zω1 , . . . , Zωd ]T , ω ∈ [−π, π], one can write that Z π Xt = eitω dZω (t ∈ Z). (1.20) −π

This is the spectral representation (sometimes called Cramér’s representation) of the stationary time series {Xt }t∈Z . Now we are ready to define the complex measures corresponding to entries cj` (h) of the autocovariance matrix. Let us recall again that an isometric isomorphism between two Hilbert spaces is an isomorphism between the inner products as well. For any B ∈ B, define the complex valued set function j ` dF j` (B) := E ZB ZB (j, ` = 1, . . . , d). In particular, for ω ∈ (−π, π], define spectral cumulative distribution functions (spectral c.d.f.): F j` (ω) := dF j` ((−π, ω]) = E Zωj Zω` (j, ` = 1, . . . , d). (1.21) By isometry, it is consistent with our previous definition of the function F j (ω): Z π Z ω jj j 2 2 j F (ω) = E(|Zω | ) = |1(−π,ω] (t)| dF (t) = dF j (t) = F j (ω). (1.22) −π

−π

Spectral representation of multidimensional stationary time series

13

The Cauchy–Schwarz inequality, (1.9), and (1.22) show that for any B ∈ B we have n o 21 j j 2 ` 2 ` |dF j` (B)| = E ZB ZB ≤ E(|ZB | )E(|ZB | ) 1 1 ≤ F j (π)F ` (π) 2 = {cjj (0)c`` (0)} 2 . (1.23) This inequality and the next lemma imply that dF j` has bounded variation. Lemma 1.2. Let µ be a complex valued set function on a σ-field F and suppose that |µ(F )| ≤ c < ∞ for any F ∈ F. Then µ has bounded variation. Proof. Let µre := Re(µ) and µim := Im(µ). By our assumption, |µre (F )| ≤ c and |µim (F )| ≤ c for any F ∈ F. Indirectly, suppose that at least one of the signed measures µre and µim has infinite variation.P Then there exists a ∞ sequence (Fj )∞ j=1 of pairwise disjoint sets in F such that j=1 |µre (Fj )| > 2c, say. Define J+ := {j : µre (Fj ) > 0} and J− := {j : µre (Fj ) < 0}. Then [ [ |µre ( Fj )| > c or |µre ( Fj )| > c or both, j∈J+

j∈J−

which is a contradiction. Thus F j` (ω) can be decomposed into four bounded, non-decreasing functions: j` j` j` j` F j` (ω) = Fre+ (ω) − Fre− (ω) + i(Fim+ (ω) − Fim− (ω))

(−π < ω ≤ π).

It means that F j` (ω) defines a complex measure on ([−π, π], B). Also, we have the equality of bilinear functions: Z π j ` E(ZB ZB 0 ) = 1B (ω)1B 0 (ω)dF j` (ω) (j, ` = 1, . . . , d). −π

In particular, if B ∩ B 0 = ∅, then j ` ) = 0, ZB E(ZB 0

dF j` (B ∪ B 0 ) = dF j` (B) + dF j` (B 0 ),

for any j, ` ∈ {1, . . . , d}. (Of course, the second equality is consistent with the fact that dF j` is a complex measure.) Similarly to the Herglotz theorem (Theorem 1.3), one can exhibit and prove spectral representation of the non-diagonal entries of the covariance matrix: Z π cj` (h) = eihω dF j` (ω), F j` (−π) = 0, F j` (π) = cj` (0). (1.24) −π

14

Harmonic analysis of stationary time series

The only difference is that here the spectral measure is complex valued in general, even if the time series has real valued components. Let us introduce the spectral measure matrix dF := [dF j` ]d×d . First, it is self-adjoint: Z π Z π cj` (h) = eihω dF j` (ω) = c`j (−h) = eihω dF `j (ω) ⇒ dF j` = dF `j , −π

−π

(1.25) since by Weierstrass’ theorem the trigonometric polynomials are dense in C[−π, π], the space of continuous functions over [−π, π], and thus determine the measure by the Riesz representation theorem. Also, dF is non-negative definite: m 2 m X X ∗ ∗ ar dF (Br ∩ Bs ) as = E ar ZBr ≥ 0 (1.26) r,s=1

r=1

n

for any m ≥ 1 and a1 , . . . , am ∈ C ; B1 , . . . , Bm ∈ B. In particular, definition (1.5) of non-negative definiteness of a spectral measure matrix holds too. Corollary 1.1. In the special case when each dF j` are absolutely continuous w.r.t. Lebesgue measure in [−π, π], that is, dF j` (ω) = f j` (ω) dω for j, ` = 1, . . . , d, it follows that the time series has a spectral density matrix f := [f j` ]d×d , which is self-adjoint and non-negative definite. Proof. The self-adjointness of f follows from (1.25), while the non-negative definiteness of f follows from (1.26). Corollary 1.2. We can write (1.24) in matrix form: Z π C(h) = eihω dF (ω), h ∈ Z, −π

or in the case of an absolutely continuous spectral measure: Z π C(h) = eihω f (ω)dω, h ∈ Z.

(1.27)

−π

This proves Theorem 1.2(a). Corollary 1.3. By formula (1.21), we have the following relationship between the orthogonal increment process Zω and the spectral c.d.f. F(ω): F(ω) = E(Zω Z∗ω )

(ω ∈ [−π, π]).

Corollary 1.4. Let ψqj ∈ L2 ([−π, π], B, dF j ) for j = 1, . . . , d and q = 1, 2. Define the random variables Z π Yqj := ψqj (ω)dZωj ∈ H(X) ⊂ L2 (Ω, F, P), −π

Spectral representation of multidimensional stationary time series

15

and the random vectors Yq := (Yq1 , . . . , Yqd ) for q = 1, 2. Then by isometry, we may write the cross-covariance matrix of Y1 and Y2 as Z π Z π h i j j ∗ ` j ` ` ψ1 (ω)dZω E(Y1 Y2 ) = hY1 , Y2 iH(X) = E ψ2 (ω)dZω d×d

i h = hψ1j , ψ2` idF j`

−π

Z

π

= d×d

−π

−π

d×d

ψ1j (ω)ψ2` (ω)dF j` (ω)

.

(1.28)

d×d

If dF is absolutely continuous with respect to Lebesgue measure with spectral density f , then Z π j j` ∗ ` ψ1 (ω)f (ω)ψ2 (ω)dω . E(Y1 Y2 ) = −π

d×d

Definition 1.4. An important special case of weakly stationary time series is a so-called white noise process. Given a self-adjoint non-negative definite matrix Σ = [σjk ] ∈ Cd×d , the stationary sequence {ξt }t∈Z is called a white noise process with covariance matrix Σ, denoted WN(Σ), if E(ξt ) = 0 for all t ∈ Z and its autocovariance function is given by Cξ (h) = E(ξt+h ξt∗ ) = δh0 Σ

(h ∈ Z),

(1.29)

where δjk = 1 if j = k, and δjk = 0 if j 6= k (Kronecker delta). It means that the values of {ξt } are orthogonal (uncorrelated) longitudinally for different time instants, while its coordinates may be correlated cross-sectionally, that is, at the same time instant. The special case WN(Id ) is an orthonormal sequence, whose coordinates are uncorrelated cross-sectionally as well. By Remark 1.4 and (1.29), {ξt } ∼ WN(Σ) has spectral density fξ (ω) = jk [fξ (ω)]d×d : 1 1 fξjk (ω) = σjk , fξ (ω) = Σ. (1.30) 2π 2π This explains the name ‘white noise’: the spectral density is the same constant at every frequency ω, like the spectrum of ideal white light. Remark 1.4. Remark 1.1 can be extended to the present case as well. (a) If for each j, ` = 1, . . . , d we have ∞ X

|cj` (h)| < ∞,

(1.31)

h=−∞

that is, {C(h)}h∈Z ∈ `1 , then the time series {Xt } has a continuous spectral density matrix f (ω) = [f j` (ω)]d×d such that (1.27) holds and f (ω) =

∞ 1 X C(h) e−ihω , 2π h=−∞

where the series converges pointwise.

ω ∈ [−π, π],

(1.32)

16

Harmonic analysis of stationary time series

(b) By the Riesz–Fischer theorem, condition (1.31) can be weakened as ∞ X

|cj` (h)|2 < ∞,

∀j, ` = 1, . . . , d,

h=−∞

that is, {C(h)}h∈Z ∈ `2 . Then the series (1.32) converges in L2 ([−π, π], B, dω) entrywise, where dω is Lebesgue measure, so the spectral density f (ω) exists and (1.27) holds. Remark 1.5. Let {Xt } be a d-dimensional stationary time series with real components and with absolute continuous spectral measure with density matrix f . By (1.1) we have C(−h) = C ∗ (h) for the covariance matrix of the process. However, now C(h) is a matrix with real entries, hence C(−h) = C T (h) and also C(h) = C(h) (entrywise complex conjugation) for any h ∈ Z. Therefore from the equation (1.27), we obtain that Z π Z π ihω C(−h) = e f (ω)dω = C(−h) = eihω f (−ω)dω. −π

−π

Since the Fourier transform is a.e. uniquely determines f , it follows that for real valued processes we have f (−ω) = f (ω),

ω ∈ [−π, π].

(1.33)

(We do not distinguish two density functions which are equal almost everywhere.) It implies that in this case it is enough to consider only the half frequency interval [0, π]. Remark 1.6. Here we discuss some properties of time series which have autocovariance matrix function with real entries, which are absolute summable. (a) Let {Xt } be a d-dimensional weakly stationary time series of complex components. Denoting by C(h) = [cpq (h)] the d × d autocovariance matrix function, C(−h) = C ∗ (h), h ∈ Z, inPthe time domain, assume that their ∞ entries are absolutely summable, i.e. h=0 |cpq (h)| < ∞ for p, q = 1, . . . , d. By Remark 1.4(a), then the spectral density matrix f (ω) exists in the frequency domain, and it can be computed as (1.32). In view of Corollary 1.1 it is always self-adjoint, non-negative definite (positive semidefinite). Further, if C(h) is a matrix of real entries ∀h ∈ Z, then the relation f (−ω) = f (2π − ω) = f (ω) (with entrywise conjugation) holds ∀ω ∈ [0, 2π]. The last statement follows from (1.33) and also from (1.32) by substituting −ω or 2π − ω for ω. (b) We have the following equivalent forms for 2πf (ω): 2πf (ω) =

= C(0) +

∞ X h=−∞ ∞ X

C(h)e−ihω = C(0) +

∞ X

[C(h)e−ihω + C ∗ (h)eihω ]

h=1

[(C(h) + C ∗ (h)) cos(hω) + i(C ∗ (h) − C(h)) sin(hω)].

h=1

Spectral representation of multidimensional stationary time series

17

The first line shows again that f (ω) is self-adjoint. The second line shows that whenever C(h) is a real matrix and so, C ∗ (h) = C T (h), then C(h) + C T (h) is symmetric and C T (h) − C(h) is anti-symmetric with P∞ 0 diagonal, but i(C T (h) − C(h)) is self-adjoint. Actually, (C(h) + h=1 P∞ C T (h)) cos(hω) is the real and h=1 (C T (h) − C(h)) sin(hω) is the imaginary part of 2πf (ω), the (entrywise) conjugate of which is 2πf (−ω) = 2πf (2π − ω). Remark 1.7. If {Xt } is a 1D weakly stationary, real time series, then f (ω) ≥ 0 is real, and f (−ω) = f (ω), ∀ω ∈ [0, 2π]. If {Xt } is a d-dimensional weakly stationary time series of real components, then C(h) is a matrix of real entries (∀h ∈ Z), so f (−ω) = f (ω) holds, ∀ω ∈ [0, 2π]. But it is not necessary for a time series to have real components so that C(h) be a real matrix. For example, let {Yt } be a d-dimensional weakly stationary time series of real components with expectation 0. Let µ ∈ Cd be a vector of at least one coordinate containing a nonzero imaginary part. Then the time series Xt = Yt + µ has at least one complex (not purely real) coordinate, still its autocovariance matrix sequence is the same as that of Yt , so C(h)s have real entries. When C(h) has complex entries too, then there is no such simple relaP∞ tionship between f (−ω) and f (ω) in general. Indeed, A := h=1 (C(h) + P∞ C ∗ (h)) cos(hω) and B := h=1 (C ∗ (h) − C(h)) sin(hω) are complex matrices, say A = A1 + iA2 , B = B1 + iB2 . Then f (ω) = C0 + A + iB = C0 + (A1 − B2 ) + i(A2 + B1 ), whereas f (−ω) = C0 + A − iB = C0 + (A1 + B2 ) + i(A2 − B1 ), and the latter is neither the same, nor the conjugate of the former. Summarizing, there is no extension from the real to the complex case. Consequently, in the real case we can confine ourselves to [0, π], while in the complex case the whole [0, 2π] or [−π, π] should be used. Remark 1.8. It should be noted that an alternative approach for the spectral representation of the process {Xt }t∈Z could start from the spectral representation of the unitary operator S in the Hilbert subspace H(X), and then there follows the spectral representation of the covariance matrix. See this approach e.g. in [34] and [47], and see the underlying spectral theorem of normal operators e.g. in [49]. Here we just briefly summarize the main points. If T is a bounded operator in a Hilbert space H and T is normal: T T ∗ = T ∗ T , then there exists an orthogonal projection measure E on the Borel subsets of the spectrum σ(T ) of T which satisfies Z T = λ dE(λ). σ(T )

18

Harmonic analysis of stationary time series

The spectrum σ(T ) is the subset of C such that λ ∈ σ(T ) if and only if T − λI is not invertible: • T − λI is not one-to-one (that is, λ is an eigenvalue of T ), or • T − λI is not onto H. If the operator S we consider is a unitary operator: SS ∗ = I = S ∗ S, then its spectrum σ(S) is a subset of the unit circle: Z π S= eiω dE(ω). −π

Since Xt = S t X0 , we get that Z

π

Xt =

eitω dΦ(ω),

−π

where dΦ(ω) := dE(ω)X0 . This formula corresponds to (1.20).

1.4

Constructions of stationary time series

There are some standard constructions of a stationary time series with a given covariance function or with a given spectral measure.

1.4.1

Construction 1

Suppose that C : Z → Cd×d is a non-negative definite function: n X

a∗k C(k − j) aj ≥ 0,

∀n ≥ 1,

∀a1 , . . . , an ∈ Cd .

j,k=1

Equivalently, the block Toeplitz matrix Cn defined by (1.3) is self-adjoint and non-negative definite for any n ≥ 1. We are going to construct a d-dimensional time series {Xt }t∈Z with the given covariance function E(Xt X∗s ) = C(t − s),

t, s ∈ Z.

(1.34)

The construction goes by induction, defining the value of the time series at t = 0, then at t = 1, then at t = −1, then at t = 2, then at t = −2, and so on. However, the very first thing is to choose an orthonormal sequence of random variables {ξj }∞ j=0 with expectation 0 on a probability space (Ω, F, P). For example, one can choose a sequence of independent tossing of a fair coin

Constructions of stationary time series

19

P(ξj = ±1) = 21 (j = 0, 1, 2, . . . ); or, one can choose a sequence {ξj }∞ j=0 of independent standard normal variables with the probability density function x2 1 φ(x) = √ e− 2 2π

(x ∈ R).

We define the Hilbert space H that is going to be a basis of the construction as the closed linear span H := span{ξj : j = 0, 1, 2, . . . } ⊂ L2 (Ω, F, P). At the beginning we set C(0) = A0 A∗0 ,

r := rank C(0),

X0 := A0 ξr ,

where ξr := [ξ0 , . . . , ξr−1 ]T , using the parsimonious Gram-decomposition (B.5) of the Appendix. Then each step of the induction will consist of two sub-steps. Assuming for example that a sequence (X−n , X−n+1 , . . . , X0 , . . . , Xn−1 , Xn ) has already been defined for some n ≥ 0, the first sub-step will take a preliminary sequence h iT ˜ −n , X ˜ −n+1 , . . . , X ˜ 0, . . . , X ˜ n, X ˜ n+1 . (1.35) X−n,n+1 := X After that a second sub-step will result the new vector Xn+1 in the constructed time series. Then the construction of a new vector X−n−1 goes similarly, so not detailed. For n ≥ 0 fixed, we would like to define the sequence X−n,n+1 so that its covariance function be C(h), h ∈ Z: C−n,n+1    := E(X−n,n+1 X∗−n,n+1 ) =  

C(0) C(1) .. . C(2n + 1)

C(−1) · · · C(0) · · · .. .. . . C(2n) · · ·

C(−2n − 1) C(−2n) .. .

   . 

C(0)

Clearly, C−n,n+1 is a self-adjoint, non-negative definite block Toeplitz matrix of rank r ≤ 2n + 2. (The value of r can be different at each step of the construction!) Thus by equation (B.5) in the Appendix, it has a parsimonious Gram-decomposition: C−n,n+1 = A−n,n+1 · A∗−n,n+1 ,

A−n,n+1 ∈ C(2n+2)×r .

We set X−n,n+1 := A−n,n+1 ξr ,

ξr := [ξ0 , . . . , ξr−1 ]T .

(1.36)

Then, really, E(X−n,n+1 X∗−n,n+1 ) = A−n,n+1 E(ξr ξr∗ )A∗−n,n+1 = C−n,n+1 . Now comes the second sub-step of the induction. Assume that the sequence (X−n , X−n+1 , . . . , X0 , . . . , Xn−1 , Xn ) has been already defined for some n ≥ 0

20

Harmonic analysis of stationary time series

and has the covariance function (1.34) for s, t ∈ {−n, . . . , n}. We have also defined the preliminary sequence X−n,n+1 by (1.35) and (1.36). Then define the operator T2n+1 by ˜ t = Xt T2n+1 X

(t = −n, . . . , n).

By the construction it follows that T2n+1 has the following important property: ˜ t )(T2n+1 X ˜ tX ˜ s )∗ = E(X ˜ ∗ ) = C(t − s), s, t ∈ {−n, . . . , n}. E (T2n+1 X s (1.37) By linearity, T2n+1 can be extended to an isomorphy between the finite dimensional spaces ˜ t : t = −n, . . . , n}, ˜ 2n+1 := Spand {X H H2n+1 := Spand {Xt : t = −n, . . . , n}, so that we still have ∗ ˜ ˜ ˜Y ˜ ∗ ), E (T2n+1 X)(T Y) = E(X 2n+1

˜ Y ˜ ∈H ˜ 2n+1 . X,

By Lemma 1.1, we may write that ˜ n+1 = X ˜− +X ˜+ , X n+1 n+1

˜− ∈ H ˜ 2n+1 , X n+1

˜+ ⊥ H ˜ 2n+1 . X n+1

(1.38)

˜ + = 0, then we are ready: Xn+1 := T2n+1 X ˜ n+1 belongs to the already If X n+1 + defined subspace H2n+1 . Otherwise, set Un+1 := X+ n+1 /kXn+1 k and define + ˜ ˜ H 2n+1 := Spand {H2n+1 , Un+1 }. Let ξk be the first random variable in the sequence {ξj }∞ j=0 that has not been used so far in the construction of the sequence (X−n , . . . , Xn ). √ + := Define T2n+1 Un+1 = Vn+1 := [ξk , ξk+1 , . . . , ξk+d−1 ]T / d and H2n+1 + + ˜ Spand {H2n+1 , Vn+1 }. Extend T2n+1 between H2n+1 and H2n+1 by linearity, ˜ n+1 . Then by (1.37) and (1.38), and define Xn+1 := T2n+1 X ˜ n+1 )(T2n+1 X ˜ t )∗ = E(X ˜ n+1 X ˜ ∗ ) = C(n + 1 − t) E(Xn+1 X∗t ) = E (T2n+1 X t for any t = −n, . . . , n. This completes the induction. Corollary 1.5. Formula (1.6) shows that the covariance function of any stationary time series is non-negative definite. Conversely, Construction 1 proves that for any non-negative definite function C(h), h ∈ Z, one can construct a stationary time series with this covariance function.

1.4.2

Construction 2

Assume that we are given a d × d matrix dF = [dF rs ]dr,s=1 whose entries are finite complex measures on ([−π, π], B) and which is non-negative definite: d d Z π X X dF rs ((α, β])zr z¯s = zr z¯s 1(α,β] (ω) dF rs (ω) ≥ 0, (1.39) r,s=1

r,s=1

−π

Constructions of stationary time series

21

for any interval (α, β] ⊂ [−π, π] and z1 , . . . , zd ∈ C. Equivalently, the matrix ∆αβ F := [∆αβ F rs ]dr,s=1 := [F rs (β) − F rs (α)]dr,s=1 is non-negative definite for any (α, β] ⊂ [−π, π], that is, d X

∆αβ F rs zr z¯s ≥ 0.

(1.40)

r,s=1

We would like to construct a time series whose spectral measure matrix is the given dF . One way to do it is to show that dF defines a non-negative definite function C(h), h ∈ Z, and then using Construction 1 to complete the construction. It should be noted that inequality (1.39) can be extended to sums of the d Z π X n X zr (j)¯ zs (j) 1(αj ,βj ] (ω) dF rs (ω) ≥ 0, n ≥ 1, (1.41) r,s=1

−π j=1

where (αj , βj ] ⊂ [−π, π] and z1 (j), . . . , zd (j) ∈ C for any n ≥ 1 and j = 1, . . . , n. It is clear that the class of step-functions g(ω) :=

n X

z(j) 1(αj ,βj ] (ω),

n ≥ 1,

ω ∈ [−π, π],

j=1

is dense in L2 ([−π, π], B, tr(dF (ω))), where tr(dF (ω)) denotes the trace of dF , which dominates any measure entry in dF . Thus one can extend inequality (1.41) to the case d Z π X gr (ω)¯ gs (ω) dF rs (ω) ≥ 0, (1.42) −π

r,s=1

where g1 , . . . , gd ∈ L2 ([−π, π], B, tr(dF (ω))). Define Z π

eihω dF (ω) ∈ Cd×d ,

C(h) :=

h ∈ Z.

−π

Take an arbitrary integer n ≥ 1 and arbitrary vectors ak = (a1k , . . . , adk ) ∈ C for k = 1, . . . , n. Define the trigonometric polynomials d

ζr (ω) :=

n X

ark e−ikω ,

r = 1, . . . , d.

k=1

Then by inequality (1.42) we have n X

a∗k C(k

j,k=1

=

d X r,s=1

− j) aj =

n X j,k=1

Z

a∗k

Z

π

ei(k−j)ω dF (ω) aj

−π

π

−π

ζr (ω)ζ¯s (ω)dF rs (ω) ≥ 0.

22

Harmonic analysis of stationary time series

Thus the matrix function C(h), h ∈ Z, is non-negative definite, so by Construction 1 a stationary time series can be constructed with this covariance matrix function. Corollary 1.6. Equation (1.26) shows that the spectral measure matrix of any stationary time series is non-negative definite. Conversely, Construction 2 proves that for any non-negative definite measure matrix, one can construct a stationary time series with this spectral measure. If d = 1, then (1.40) holds if and only if F is a right continuous, nondecreasing function on [−π, π] such that F (−π) = 0, F (π) < ∞. It implies the converse of Herglotz theorem, Theorem 1.3. For any such distribution function F , its Fourier transform Z π c(r) = eirω dF (ω) (r ∈ Z) −π

defines a non-negative definite function c. If d = 2, then (1.40) holds if and only if for any −π ≤ α ≤ β ≤ π, ∆αβ F 11 ∆αβ F 12 rr ≥ 0. ∆αβ F ≥ 0 (r = 1, 2) and ∆αβ F 21 ∆αβ F 22

1.4.3

Construction 3

Let us assume that the d-dimensional stationary time series {Xt }t∈Z we would like to construct has an absolutely continuous spectral measure with given density matrix f (which is a self-adjoint, non-negative definite matrix valued function) and suppose that f (ω) has constant rank r ≤ d for a.e. ω ∈ [−π, π]. Then we can take the parsimonious Gram-decomposition (B.5) of 2πf : f (ω) =

1 φ(ω)φ∗ (ω), 2π

φ(ω) ∈ Cd×r ,

for a.e. ω ∈ [−π, π]. (Compare with Theorem 4.1.) Then we may define a d-dimensional stationary Gaussian time series {Xt } with spectral density f using Itô’s stochastic integration. Let B(ω) be a standard r-dimensional Brownian motion (Wiener process) on the interval [−π, π]. Define a time series as Z π 1 √ Xt := eitω φ(ω)dB(ω), t ∈ Z. (1.43) 2π −π It is well-known that then {Xt } is a Gaussian process, EXt = 0 for any t, and by Itˆ o isometry, the covariance function is Z π Z π 1 ei(t+h)ω φ(ω)e−itω φ∗ (ω)dω = eihω f (ω)dω C(h) = E(Xt+h X∗t ) = 2π −π −π for any h ∈ Z. Thus this time series is stationary with spectral density f : this proves the correctness of the construction. In practice one would approximate

Constructions of stationary time series

23

the stochastic integral in (1.43) by a stochastic sum; for the approximation there are several approaches. For example, one may use Lévy’s construction of Brownian motion, see e.g. [42, p. 7]; or an approximation of Brownian motion by simple, symmetric random walks [52].

1.4.4 1.4.4.1

Construction 4 Discrete Fourier Transform

First let us review the Discrete Fourier Transform (DFT) in a way that is consistent with our previous setting. For simplicity, choose a positive odd integer 2N + 1 and define ∆ω := 2N2π+1 . Suppose that the spectral measure of the investigated d-dimensional stationary time series {Xt } is absolutely continuous with density matrix function f . We assume that f , or an estimate of it, is given at the discrete points ωj := j∆ω ∈ [−π, π], j = −N, . . . , N , called Fourier frequencies. Then the DFT of f is defined as ˆ C(k) = ∆ω

N X

f (ωj )eikωj ,

k = −N, . . . , N.

(1.44)

j=−N

This finite sequence is a natural estimate of the covariance matrix function, see (1.27), Z π C(k) = eikω f (ω)dω (k ∈ Z), −π

if f is Riemann integrable and N is large enough. A property of DFT is that ˆ + 2N + 1) = C(k) ˆ it is periodic with period 2N + 1: C(k for any k. Conversely, assume that the covariance matrix function C(k), k ∈ Z, or an estimate of it, is given and we would like to find an estimate of the spectral density f . Then the inverse DFT (IDFT) is defined by N 1 X fˆ(ωj ) = C(k) e−ikωj , 2π

j = −N, . . . , N.

(1.45)

k=−N

It is a natural estimate of the spectral density matrix, see (1.32): f (ω) =

∞ 1 X C(k)e−ikω , 2π

ω ∈ [−π, π],

k=−∞

if the entries of C(k) are negligible for |k| > N . If the entries of C are absolute summable: C ∈ `1 , and N is large enough, then this condition holds. A property of IDFT is that it is also periodic with period 2N +1: fˆ(ωj+2N +1 ) = fˆ(ωj ) for any j. If the chosen positive integer is even: 2N , then everything goes similarly as above, except that the indices run from −N + 1 to N .

24

Harmonic analysis of stationary time series

It is well-known that {eikω }k∈Z is an orthonormal sequence of functions in L ([0, 2π], B, dω): Z π 1 ikω i`ω he , e i := eikω ei`ω dω = δk` (k, ` ∈ Z). 2π −π 2

Similarly, {j 7→ eikωj }{k=−N,...,N } is an orthonormal sequence of functions on the discrete set of points {ωj : j = −N, . . . , N } in the following sense: N X 1 eikωj ei`ωj 2N + 1

hj 7→ eikωj , j 7→ ei`ωj i :=

j=−N

N X 1 ei(k−`)ωj = δk` , 2N + 1

=

j=−N

for any k, ` ∈ Z. The next proposition shows that this property implies that the IDFT is really the inverse transformation of the DFT. ˆ Proposition 1.2. Assume that C(k), k = −N, . . . , N , is the DFT of an approximate spectral density fˆ as defined by (1.44). Then the IDFT defined by (1.45) gives N N N X 1 X ˆ 1 X −ikωj C(k)e−ikωj = e ∆ω fˆ(ω` )eikω` 2π 2π k=−N N X

=

k=−N

fˆ(ω` )

`=−N

1 2N + 1

N X

`=−N

ei(`−j)k∆ω = fˆ(ωj ),

k=−N

for j = −N, . . . , N . Similarly, assume that fˆ(ωj ), j = −N, . . . , N , is the IDFT of an approxiˆ as defined by (1.45). Then the DFT defined mate covariance matrix function C by (1.44) gives ∆ω

N X

fˆ(ωj )eikωj = ∆ω

j=−N

=

N X

ˆ C(`)

`=−N

N X j=−N

1 2N + 1

N X

eikωj

N 1 X ˆ C(`)e−i`ωj 2π `=−N

ˆ ei(k−`)ωj = C(k),

j=−N

for k = −N, . . . , N . The DFT and IDFT are efficient from an algorithmic point of view, because when N = 2n , they can be evaluated in O(N log N ) steps using Fast Fourier Transform (FFT).

Constructions of stationary time series 1.4.4.2

25

The construction

Like in Construction 3, let us assume that the d-dimensional stationary time series {Xt }t∈Z we would like to construct has an absolutely continuous spectral measure with given density matrix f (which is a self-adjoint, non-negative definite matrix valued function) and suppose that f (ω) has constant rank r ≤ d for a.e. ω ∈ [−π, π] and is Riemann integrable on [−π, π]. Take the parsimonious Gram-decomposition (B.5) of 2πf : f (ω) =

1 φ(ω)φ∗ (ω), 2π

φ(ω) ∈ Cd×r ,

for a.e. ω ∈ [−π, π]. (Compare with Theorem 4.1.) The basis of the construction is the spectral representation (1.20) of {Xt }: Z π Xt = eitω dZω (t ∈ Z), (1.46) −π

where {Zω }ω∈[−π,π] is a d-dimensional process with orthogonal increments. Like above, assume that we have chosen an odd positive integer 2N + 1, ∆ω = 2N2π+1 , and the Fourier frequencies ωj = j∆ω, j = −N, . . . , N . Define ∆Z(ωj ) := (2N + 1)−1/2 φ(ωj )Vj , where

1

j = −N, . . . , N, r

Vj := [eiUj , . . . , eiUj ]T , and {Ujk : k = 1, . . . , r; j = −N, . . . , N } are independent random variables, uniformly distributed on [−π, π]. Here ∆Z(ωj ) gives a random vector measure of the interval [ωj , ωj+1 ]. It is an increment of a process with orthogonal increments, see (1.16) and (1.17): 1 φ(ωj )E(Vj V`∗ )φ∗ (ω` ) 2π = δj` ∆ωf (ωj ),

E (∆Z(ωj )∆Z∗ (ω` )) = ∆ω

since

k m E eiUj e−iU` = δj` δkm ,

(1.47)

E(Vj V`∗ ) = δj` Ir .

As an approximation of (1.46), for t = 0, . . . , 2N define ˆ t := X

N X

N √ 1 X itωj eitωj ∆Z (ωj ) = √ e φ(ωj )Vj ∆ω, 2π j=−N j=−N

(1.48)

which is a periodic sequence with period 2N + 1. It is of the form of DFT (1.44) with coefficients (∆ω)−1 ∆Z(ωj ). Compare also the last expression in (1.48) with construction (1.43).

26

Harmonic analysis of stationary time series ˆ t } is By (1.47) and (1.48), the covariance matrix function of {X ˆ t+h , X ˆ t ) = E(X ˆ t+h X ˆ ∗ ) = ∆ω Cov(X t

N X

ˆ f (ωj ) eihωj = C(h).

j=−N

By (1.44) and Proposition 1.2 it follows that the spectral density of the seˆ t } is exactly the given f (ωj ) at the Fourier frequencies {ωj : j = quence {X −N, . . . , N }.

1.5 1.5.1

Estimating parameters of stationary time series Estimation of the mean

It is important in practice whether one can estimate parameters of a stationary time series by observing a single trajectory of the process for a long enough time. The first thing to estimate is the mean µ ∈ C of a process, which could differ from 0 now. It is enough to consider a one-dimensional time series {Xt }t∈Z , because expectation can be taken componentwise. If Xtµ := Xt + µ,

EXt = 0,

µ ∈ C (t ∈ Z),

then one gets a natural approximation of µ by taking a positive integer T and computing the empirical mean, that is, the average of a single trajectory for t = 0, 1, . . . , T − 1: T −1 T −1 X 1 X ˜ µ := 1 ˜T . X Xtµ = µ + Xt = µ + X T T t=0 T t=0

˜ µ to the theoretical expectation If we have convergence of the time average X T µ in mean square then it is called ergodicity for the mean; it is a ‘law of large ˜ T → 0 in numbers’. Obviously, for this it is necessary and sufficient that X mean square; this is the case that we are going to investigate in the sequel. Let H(X) ⊂ L2 (Ω, F, P) be the Hilbert space defined in Section 1.3. In this section each random variable is considered as a vector in H(X). So equality of two random variables means that they are P-a.s. equal. Also, convergence of random variables is always understood in this Hilbert space, that is, as convergence in mean square. Now we introduce two subspaces of H(X). First, let S denote the unitary operator of forward shift in H(X) and I := {ξ ∈ H(X) : Sξ = ξ} the subspace of translation invariant random variables in H(X). It is clear

Estimating parameters of stationary time series

27

that I is a closed subspace in H(X). Second, define N := span{η ∈ H(X) : η = Sζ − ζ for some ζ ∈ H(X)}. Let N ⊥ denote the orthogonal complement of N in H(X), so H(X) = N ⊕N ⊥ , where ⊕ denotes orthogonal direct sum. Theorem 1.5. Let {Xt }t∈Z be a stationary time series (with mean 0). Then ˜ T converges to a random variable Y in mean square as the time average X T → ∞: ˜ T − Y |2 = 0, lim E|X T →∞

where Y is the orthogonal projection of X0 to the subspace I, denoted as Y = PI X0 ; moreover, EY = 0. Proof. Since Xt = S t X0 (t ∈ Z), let us introduce the following linear operators in H(X): T −1 1 X t S (T ≥ 1). VT := T t=0 Obviously, if ξ ∈ I, then VT ξ = ξ for any T ≥ 1, so lim VT ξ = ξ. T →∞

On the other hand, if η = Sζ − ζ for some ζ ∈ H(X), then VT η = VT (Sζ − ζ) =

T −1 1 X t+1 1 (S − S t )ζ = (S T − S 0 )ζ. T t=0 T

Since kSk = 1, and also kS T k = 1, we get that lim VT η = 0 ∀η ∈ N .

T →∞

Next we want to show that I = N ⊥ . First let ξ ∈ I. Then hξ, Sζ − ζi = hξ, Sζi − hξ, ζi = hS ∗ ξ, ζi − hξ, ζi = hξ, ζi − hξ, ζi = 0, where ζ ∈ H(X) is arbitrary, since ξ is invariant under S ∗ = S −1 as well. Hence ξ⊥(Sζ − ζ) for any ζ ∈ H(X). This implies that ξ ∈ N ⊥ , that is, I ⊂ N ⊥ . Conversely, assume that ξ ∈ N ⊥ . Then for any ζ ∈ H(X), 0 = hξ, Sζ − ζi = hξ, Sζi − hξ, ζi = hS ∗ ξ − ξ, ζi. This means that (S ∗ ξ − ξ)⊥ζ for any ζ ∈ H(X). Thus S −1 ξ = ξ, that is, ξ ∈ I, which implies that N ⊥ ⊂ I. Consequently, I = N ⊥ .

28

Harmonic analysis of stationary time series

Finally, it follows that any ξ ∈ H(X) can be written as ξ = ξI + ξN , where ξI = PI ξ ∈ I and ξN ∈ N . Then by the first part of the proof, lim VT (ξI + ξN ) = ξI .

T →∞

This shows that ˜ T = lim VT X0 = PI X0 =: Y ∈ I. lim X

T →∞

T →∞

Also, EY = 0, as the expectation of any random variable in H(X) is zero. These prove the theorem. The next theorem gives a necessary and sufficient condition of ergodicity for the mean. Theorem 1.6. Let {Xt }t∈Z be a stationary time series with mean 0 and with ˜ T converges to covariance function c(j) (j ∈ Z). Then the time average X Y = 0 in mean square as T → ∞ if and only if T −1 1 X c(j) = 0. T →∞ T j=0

lim

(1.49)

Proof. By Theorem 1.5 there exists a random variable Y ∈ H(X) such that ˜ T → Y in mean square as T → ∞. Thus for any t ∈ Z we have X t+T −1 T −1 T −1 1 X 1 X 1 X hXj , Xt i = lim hXj , X0 i = lim c(j). T →∞ T T →∞ T T →∞ T j=t j=0 j=0

hY, Xt i = lim

In the case of Y = 0, this implies that (1.49) holds. Conversely, if (1.49) holds, then hY, Xt i = 0 for any t ∈ Z. Since Y belongs to the space spanned by {Xt }, Y = 0 follows. Corollary 1.7. It is an elementary analysis fact that limj→∞ c(j) = 0 implies (1.49), thus it is a sufficient (but not necessary) condition P∞ of ergodicity for the mean. An even stronger sufficient condition is that j=−∞ |c(j)| < ∞ holds. Another approach to ergodicity for the mean is to use spectral representation of the time series. Theorem 1.7. Let {Xt }t∈Z be a stationary time series with mean 0, with random spectral measure dZω , and spectral measure dF (ω), for ω ∈ [−π, π]. ˜ T converges in mean square to Y = Z{0} , the (a) Then the time average X atom (point mass) of the random spectral measure at {0}, as T → ∞. (b) Ergodicity for the mean holds if and only if dF ({0}) = 0, that is, the spectral measure has no atom at {0}.

Estimating parameters of stationary time series Proof. Take the spectral representation Z π eitω dZω Xt =

29

(t ∈ Z).

−π

Then ˜T = X

Z

π

−π

Z π T −1 1 X itω e dZ(ω) = gT (ω)dZω , T t=0 −π

where

( gT (ω) =

eiT ω −1 T (eiω −1)

1

(ω = 6 0) . (ω = 0)

Fix an arbitrary > 0. By Corollary 1.4, we can write that Z 2 Z ˜ 2 |gT (ω)| dF (ω) + |gT (ω)|2 dF (ω), (1.50) E XT − Z{0} = 0 q.) If we use (2.27) to express the coefficients bk−` in terms of the parameters αj and β` recursively, then the right hand side of the first q equations in (2.28) will contain polynomial expressions of the parameters. Using the method of (2.2), the spectral density of an ARMA process {Xt } is f X (ω) =

1 |β(e−iω )|2 , 2π |α(e−iω )|2

β(e−iω ) =

q X `=0

β` e−i`ω ,

α(e−iω ) =

p X

αj e−ijω ,

j=0

(2.29) for ω ∈ [−π, π].

Autoregressive moving average processes

57

ARMA(p,q) process

4

Xt Xt

3

X process

2 1 0 1 2 3 4 20

40

1.0

3.0

0.8

2.5

0.6 0.4 0.2 0.0

80

100

2.0 1.5 1.0 0.5 0.0

0.2

0.5 0

20

40

Index k

60

80

100

0

20

40 60 Value of h

80

100

0

20

40 60 Value of h

80

100

100

1.50

|c(h)|, log scale

1.25 Spectral density f

60

Time t

Covariance function c(h)

Impulse response bk

0

1.00 0.75 0.50 0.25 0.00 3

2

1

0 Frequency

1

2

3

10

2

10

4

10

6

10

8

10

10

10

12

10

14

FIGURE 2.9 A typical trajectory and its prediction of a stable ARMA(4,4) process, with its bk coefficients, covariance function, and spectral density in Example 2.5. Example 2.5. Figure 2.9 shows a typical trajectory of a stable ARMA(4, 4) process and its prediction based on the finite past X0 , . . . , Xn−1 , see Subsection 5.2.1. The second, third, and fourth panel show its impulse response bk coefficients, covariance function, and spectral density. The AR polyno1 3 1 4 z − 12 z and the MA polynomial is mial is α(z) = 1 − 67 z + 12 z 2 + 12 1 1 2 1 3 1 4 β(z) = 1 − 2 z + 2 z + 3 z − 3 z . The last panel shows the exponential decrease of the covariance function. The first panel on Figure 2.10 shows the mean square prediction error as a function of n as Xn is predicted. The prediction error goes to 1, which is the non-zero impulse response b0 , since it is a regular process, see Section 2.6. The last panel shows det(Cn ).

58

ARMA, regular, and singular time series in 1D Mean square prediction error 1.40 1.35 1.30

en2

1.25 1.20 1.15 1.10 1.05 1.00 0

20

40

Value of n

60

80

100

60

80

100

Determinant of Cn 50

detCn

40 30 20 10 0

20

40

Value of n

FIGURE 2.10 The mean square prediction error and det(Cn ) of an ARMA(p, q) process in Example 2.5.

2.6

Wold decomposition in 1D

Recall that we use the notation span{Xt : t ∈ A} for the closed linear span of the random variables {Xt : t ∈ A} ⊂ L2 (Ω, F, P). Then for a weakly stationary time series {Xt }t∈Z , we define \ Hn− (X) = span{Xt : t ≤ n} (n ∈ Z), H−∞ (X) = Hn− (X). n∈Z

If it is clear from the context, we simply write Hn− instead of Hn− (X) and H−∞ instead of H−∞ (X). Hn− is called the past of {Xt } until n and H−∞ is the remote past of {Xt }. Clearly, H(X) = span{Hn− : n ∈ Z} and by the weak stationarity of {Xt }t∈Z , Hn− = S n H0− (n ∈ Z), where S is the unitary operator of right time-shift. The time series is called singular if H−∞ = H(X), − − or equivalently, if Hk− = Hk+1 for some k ∈ Z, that is, if Hk− = Hk+m for any k, m ∈ Z; otherwise, it is non-singular. The time series is regular if H−∞ = {0}. We are going to see that in general, a weakly stationary time series can be written as an orthogonal sum of a regular and a singular time series.

Wold decomposition in 1D

59

− − , X0 }. Assume that {Xt }t∈Z is non-singular, so H−1 6= H0− = span{H−1

Let X0 = X0− + X0+ ,

− X0− ∈ H−1 ,

− X0+ ⊥H−1 .

Define the random variable ξ0 := X0+ /kX0+ k. Then ξ0 ∈ H0− , kξ0 k = 1, − − ξ0 ⊥H−1 . Define ξn := S n ξ0 (n ∈ Z). Clearly, ξn ∈ Hn− , ξn ⊥Hn−1 , and Hn− = − span{Hn−1 , ξn }. Thus {ξn }n∈Z is an orthonormal sequence and ξn ⊥H−∞ for each n. This procedure resembles a Gram–Schmidt orthogonalization. Now let us expand X0 into its orthogonal series w.r.t. {ξn }t∈Z : X0 =

∞ X

bk ξ−k + Y0 .

(2.30)

k=0

P Here bk = hX0 , ξ−k i, k |bk |2 < ∞, and bk = 0 for k < 0 because ξk ⊥H0− when k ≥ 1. In particular, b0 = hX0 , ξ0 i = hX0+ , X0+ /kX0+ ki = kX0+ k > 0.

(2.31)

The vector Y0 is simply the remainder term, which of course is 0 if {ξn }t∈Z span H(X), but not in general. It is not hard to see that Y0 ∈ H−∞ . Now apply the operator S t to (2.30): Xt =

∞ X

bk ξt−k + Yt =: Rt + Yt

(t ∈ Z),

(2.32)

k=0

where we have defined Yt := S t Y0 (t ∈ Z). This way we have proved the important Wold decomposition of {Xt }. Theorem 2.1. Assume that {Xt }t∈Z is a non-singular weakly stationary time series. Then we can decompose {Xt } in the form (2.32), where {Rt }t∈Z is a regular time series (that is, a causal MA(∞) process) and {Yt }t∈Z is a singular time series, Yt ∈ H−∞ for all t; the two processes are orthogonal to each other. ˆ h of Xh (h ≥ 1) is by definition the projection The best linear prediction X of Xh to the past until 0, that is, to H0− . Time 0 is considered the present and ˆ h is called a h-step ahead prediction. By the Wold decomposition (2.32) and X the projection theorem, the best prediction is ˆh = X

∞ X k=h

bk ξh−k + Yh =

0 X

bh−j ξj + Yh

(h ≥ 1),

(2.33)

j=−∞

ˆ h ) is since the right hand side of (2.33) is in H0− and the difference (Xh − X − orthogonal to H0 . Hence, the prediction error (the mean-square error) of the h-step ahead prediction is given by ˆ h k2 = σh2 := kXh − X

h−1 X k=0

|bk |2 .

(2.34)

60

ARMA, regular, and singular time series in 1D

This implies that lim

h→∞

σh2

=

∞ X k=0

∞

2

X

|bk | = bk ξ`−k

2

(` ∈ Z),

lim

h→∞

k=0

∞ X

bk ξh−k = 0.

k=h

Clearly, the MA part {Rt }t∈Z of {Xt }t∈Z is a regular time series. For, the past Hn− (R) := span{Rt : t ≤ n} of {Rt } is spanned by the vectors {ξt : t ≤ n}, so \ H−∞ (R) := Hn− (R) = lim Hn− (R) = {0}. n→−∞

n∈Z

If the process {Xt }t∈Z itself is regular, then Yt = 0 for each t ∈ Z, ˆ t = 0. (` ∈ Z), and lim X

lim σt2 → kX` k2

t→∞

t→∞

It is also easy to see that {Yt }t∈Z is a singular process and spans H−∞ . Clearly, if a process {Xt }t∈Z is singular, then Xt ∈ H0− for any t ∈ Z, so ˆ t = Xt for any t ∈ Z. perfect linear prediction is possible: X

2.7

Spectral form of the Wold decomposition

The regular part of a non-singular time series {Xt } is a causal MA(∞) process, so by (2.11) it has an absolutely continuous spectral measure with density f (ω) =

∞ 1 X −ijω 2 1 bj e |H(e−iω )|2 , = 2π j=0 2π

∞ X

|bj |2 < ∞,

(2.35)

j=0

where H(z) is the transition function of the process. Starting from a different perspective, we may introduce the function Φ ∈ L2 (T ), called a spectral factor, by the Fourier series Φ(e−iω ) :=

∞ X

bj e−ijω ,

ω ∈ [−π, π].

(2.36)

j=0

It is essential that this Fourier series is one-sided, its coefficients are zero for j < 0. Then this definition can be extended to the open unit disc D as an analytic function ∞ X Φ(z) := bj z j , z ∈ D. (2.37) j=0

Recall that this power series must be convergent in D, since its radius of convergence R is given by (A.3), and lim supj→∞ |bj |1/j = 1/R cannot be

Spectral form of the Wold decomposition

61

greater than 1, because then there would be infinitely many coefficients bj with value greater than 1 and that would contradict to the fact that P∞ absolute 2 j=0 |bj | is convergent. Therefore Φ is analytic in the open unit disc D and is in L2 on the unit circle T , with zero Fourier coefficients when j < 0. It means that Φ ∈ H 2 , see Section A.3 about Hardy spaces. We mention that the negative sign in the exponents in (2.36) is only a matter of convention, usual in the theory of time series. An H 2 function which is not identically 0, vanishes only on a set of measure 0 on T , so the spectral density f in (2.35) is positive a.e. on T , see the last sentence of Theorem A.13. Remark 2.2. The next lemma shows that the Wold decomposition (2.32) is a spectral decomposition in the sense too that the support of the spectral measure of the singular process {Yt } must be disjoint from the set {ω ∈ [−π, π] : f (ω) 6= 0}. Consequently, the spectral measure of {Yt } is a singular measure w.r.t. Lebesgue measure. This way we see that the Wold decomposition (2.32) of {Xt } is equivalent to a decomposition of the spectral measure of a non-singular process into an absolutely continuous and a singular part w.r.t. Lebesgue measure on [−π, π]. Lemma 2.1. Assume that {Xt } is a stationary time series, Xt = Ut + Vt , where Ut , Vt ∈ H(X) for all t ∈ Z, {Ut } and {Vt } are stationary processes, and hUt , Vs i = 0 for all t, s. Then there is a decomposition A ∪ B = [−π, π], A ∩ B = ∅, A and B are measurable, and Z π Z π Z π Xt = eitω dZω = eitω 1A dZω + eitω 1B dZω = Ut + Vt (t ∈ Z). −π

−π

−π

Proof. We know that there is a unique unitary operator S on H(X) such that Xt = S t X0 (t ∈ Z). The assumptions imply that the Hilbert subspaces H(U ) and H(V ) of H(X) are orthogonal and H(X) = H(U ) ⊕ H(V ), where ⊕ denotes orthogonal direct sum. There exist unitary operators SU and SV on H(U ) and H(V ), respectively, such that Ut = SUt U0 and Vt = SVt V0 . Then the direct sum SU ⊕ SV is a unitary operator on H(X) that maps Xt into Xt+1 . Thus we conclude that S = SU ⊕ SV and SU and SV are the restrictions of S to U and V , respectively. Since U0 , V0 ∈ H(X), there exist functions u, v ∈ L2 ([−π, π], B, dF ) such that Z π Z π U0 = u(ω)dZω , V0 = v(ω)dZω , −π

−π

and so Z

π

Ut =

eitω u(ω)dZω ,

−π

Z

π

Vt =

eitω v(ω)dZω

(t ∈ Z).

−π

But the orthogonality of {Ut } and {Vt } implies that for any t and s, Z π ei(t−s)ω u(ω)v(ω)dω) = hUt , Vs i = 0. −π

62

ARMA, regular, and singular time series in 1D

It implies that the function h := u¯ v is orthogonal to all trigonometric polynomials in L2 ([−π, π], B, dF ). Since in general h ∈ L1 ([−π, π], B, dF ) only, we want to show that this still implies that h = 0 a.e. with respect to dF . For any continuous function g such that g(−π) = g(π), define the bounded linear functional Z π Kg := g(ω)h(ω)dF (ω). −π

Since K vanishes for all trigonometric polynomials, a dense set in the space of continuous functions g with g(−π) = g(π), it follows that Kg = 0 for any such g as well. By the Riesz representation theorem, K can be represented by a complex measure µ so that dµ(ω) = h(ω)dF (ω). Thus µ must be identically 0 and h(ω) = 0 for a.e. ω with respect to dF . On the other hand, u(ω) + v(ω) = 1 a.e. with respect to dF , since Z π Z π {u(ω) + v(ω)}dZω = 1 dZω = X0 , −π

−π

and the spectral representation is unique. Thus uv = 0 and u + v = 1 a.e. (dF ), so u and v are indicator functions on disjoint sets A and B, A ∪ B = [−π, π], except on a set of dF measure 0, which does not influence the integrals. This proves the lemma. Assume now that {Xt } is regular, with spectral representation Z π Xt = eitω dZω .

(2.38)

−π

We want to show that under not too restrictive assumptions we can find the causal MA(∞) representation Xt =

∞ X

bk ξt−k ,

t ∈ Z,

k=0

by factoring the spectral density f of {Xt }. This method then automatically extends to the regular part of any non-singular time series. Since ξ0 ∈ H(X) by its definition, there is a function ψ(ω) ∈ L2 ([−π, π], B, dF ) which represents it; we then have Z π n ξn = S ξ0 = einω ψ(ω)dZω . (2.39) −π

We want to show that one can find the function ψ(ω) and so the orthonormal sequence {ξn }, by suitably factoring the spectral density f (ω) of the regular part of {Xt }. There are three conditions that {ξn } must satisfy:

Spectral form of the Wold decomposition

63

1. orthonormality, 2. ξn ⊥Xn−k for k > 0, 3. ξn ∈ Hn− for n ∈ Z. By (1), Z

π

δnm = hξn , ξm i =

einω ψ(ω)e−imω ψ(ω)f (ω)dω,

−π

which implies 1 (a.e.) 2π because of the uniqueness theorem for Fourier series. Then we can factor the spectral density |ψ(ω)|2 f (ω) =

f (ω) =

1 1 1 1 · =: φ(ω) · φ(ω), 2π ψ(ω) ψ(ω) 2π

where the spectral factor φ is a complex valued square integrable function on [−π, π]. There is a simple one-to-one correspondence of this spectral factor φ defined on [−π, π] and Φ defined by (2.36) on the unit circle T , as explained in Remark 1.2. Next, condition (2) requires that hξn , Xn−k i = 0, k > 0. But Z π Z π 1 hξn , Xn−k i = einω ψ(ω)e−i(n−k)ω f (ω)dω = eikω φ(ω)dω = 0, 2π −π −π for all k > 0, that is, φ has to be orthogonal to e−ikω for all k < 0. It means that the Fourier series of φ has to be of ‘power series type’: φ(ω) =

∞ X

bk e−ikω ,

ω ∈ [−π, π],

k=0

that is, φ ∈ H 2 , where H 2 is the closed linear span of {e−ikω : k ≥ 0} in L2 (T ), see Subsection A.3.2 in the Appendix about the Hardy space H 2 . Comparing with (2.36) and (2.37), it shows that φ(ω) is the boundary value of a function Φ(z) which is regular in D: φ(ω) = Φ(e−iω ), ω ∈ [−π, π]. Finally, by (3), we must have ξn ∈ Hn− ; for this ξ0 ∈ H0− is enough. But by (2.39), Z π Z π 1 ξ0 = ψ(ω)dZω = dZω . −π −π φ(ω) The ξ0 ∈ H0− means that there exist complex numbers {γk }∞ k=0 , P∞ requirement 2 |γ | < ∞, such that k k=0 ξ0 =

∞ X k=0

γk X−k , and so ψ(ω) =

∞ X

γk e−ikω

k=0

by the spectral representation of the time series {Xt }. It means that we must have 1/φ = ψ(ω) ∈ H 2 . This way we have proved:

64

ARMA, regular, and singular time series in 1D

Theorem 2.2. Assume that the spectral density f of a regular stationary time 1 series {Xt } has a factorization f = 2π φ · φ, where φ ∈ H 2 and 1/φ ∈ H 2 . (a) Then the orthonormal sequence appearing in the Wold decomposition is given by the random variables Z π itω e ξt = dZω (t ∈ Z). (2.40) φ(ω) −π (b) We have the Fourier series φ(ω) =

∞ X

bk e−ikω ,

1 2π

bk =

k=0

Z

π

eikω φ(ω)dω,

−π

∞ X

|bk |2 < ∞,

(2.41)

k=0

and this defines the coefficients of the Wold representation: Xt =

∞ X

bk ξt−k ,

t ∈ Z.

(2.42)

k=0

Proof. We have seen (2.40) above. On the other hand, if φ ∈ H 2 , then by (2.38), (2.40), and (2.41), Z

π

Xt =

e −π

itω

Z

π

dZω = −π

eitω

∞ X

∞

bk e−ikω

k=0

X 1 dZω = bk ξt−k , φ(ω) k=0

using the properties of spectral representation. Comparing (2.35), (2.36) and (2.41), we see that φ(ω) = Φ(e−iω ) = H(e−iω ), where Φ(z) is the spectral factor and H(z) is the transition function of the process. Theorem 2.2 gives an explicit solution to causal MA(∞) representation of regular time series under conditions that are relatively mild. If {Xt } is an arbitrary non-singular time series, then Theorem 2.2 is still valid, with the difference that (2.42) gives the regular part of the process, since by Remark 2.2, the spectral density of a non-singular time series is the same as the spectral density of its regular part. In the next section we are going to discuss two special cases, where the factorization of the spectral density is rather straightforward. Remark 2.6 later will give an explicit formula for the factoring the spectral density of a general regular time series in 1D.

Factorization of rational and smooth densities

2.8 2.8.1

65

Factorization of rational and smooth spectral densities Rational spectral density

We saw in (2.29) that the spectral density f of a stationary ARMA process is a rational function of e−iω . Remark 2.3. If a spectral density f is rational function of z = e−iω , then the rational function cannot have poles on the unit circle T . For, the spectral −iω measure dF is a finite measure, while a rational function that has R π of z = e one or more poles on T cannot have finite integral −π f (ω)dω. Lemma 2.2. (Fejér–Riesz lemma [17]) If the spectral density f is a rational function of z = e−iω and f (ω) ≥ 0 for ω ∈ [−π, π], then it can be written in the form f (ω) =

1 |β(e−iω )|2 2π |α(e−iω )|2

(−π ≤ ω ≤ π),

(2.43)

where the polynomials α(z) =

p X j=0

αj z j ,

β(z) =

q X

βj z j

j=0

are relative prime, α has no zeros in the closed unit disc {z : |z| ≤ 1}, and β has no zeros in the open unit disc D. If, moreover, f (−ω) = f (ω), then the coefficients of the polynomials α and β can be chosen real. Proof. By our assumptions we can write that Qn (e−iω − vk ) −i`ω Qk=1 f (ω) = ce , m −iω − u ) j j=1 (e where c ∈ C, ` ∈ Z, m, n ≥ 0, and the sets of complex nonzero numbers {uj }j=1,...,m and {vk }k=1,...,n are disjoint. Since f (ω) is real, Qn Qn vk−1 − e−iω ) (eiω − v¯k ) 0 iω(`+n−m) i`ω k=1 (¯ k=1 = c e , (2.44) f (ω) = c¯e Qm Q m iω − u −iω ) ¯j ) u−1 j=1 (e j −e j=1 (¯ where c0 ∈ C. Since f (ω) ≥ 0, it coincides with the absolute values of all the above expressions written for it. Considering that −iω |e−iω − w ¯ −1 | = |w|−1 we ¯ − 1 = r−1 |re−i(θ+ω) − 1| = r−1 |1 − rei(θ+ω) | = |w|−1 e−iω − w , w = reiθ ,

66

ARMA, regular, and singular time series in 1D

it follows that for each root uj and vk , where |uj | > 1 and |vk | > 1, there exists another root uj 0 and vk0 , respectively, such that uj = u ¯−1 ¯k−1 0 . j 0 and vk = v −iω Also because of f (ω) ≥ 0, in case of any factor (e −vk ), where vk = e−iω0 is on the unit circle T , the root must be double: (e−iω − e−iω0 ) = (eiω − eiω0 ) = −ei(ω+ω0 ) (e−iω − e−iω0 ). These prove (2.43). If f (−ω) = f (ω), then by the first equality of (2.44), for every root uj and vk there corresponds a root uj 0 = u ¯j and vk0 = v¯k , respectively. Thus α and β can be chosen real. Corollary 2.2. If the spectral density f is a rational function of z = e−iω 1 φ · φ is straightforward and f (ω) > 0 for ω ∈ [−π, π], the factorization f = 2π by (2.43): Qq (e−iω − vk )(eiω − v¯k ) , f (ω) = K Qk=1 p −iω − u )(eiω − u ¯j ) j j=1 (e Qq √ (e−iω − vk ) β(e−iω ) φ(ω) = 2πK Qk=1 , (2.45) = p −iω − u ) α(e−iω ) j j=1 (e where K > 0 is a constant and each |vk | > 1, |uj | > 1. In this case Theorem 2.2 can be applied to find an explicit causal MA(∞) representation of the time series since both φ and 1/φ are continuous, so bounded, on [−π, π] and this way they belong to the Hardy space H 2 . When p > 1, it may be useful to apply a partial fraction expansion of φ in order to obtain the Fourier series explicitly. Equations (2.10) and (2.45) show that a process with spectral density satisfying the assumptions of Corollary 2.2 is an MA(q) process when p = 0, that is, the denominator is 1. Also, (2.24) and (2.45) give that the process is an AR(p) process when q = 0, that is, the numerator is 1. Finally, (2.29) and (2.45) show that the process is a proper ARMA(p, q) process otherwise. It simplifies the discussion if from now on, we consider AR and MA processes as special ARMA processes. Remark 2.4. By Lemma 2.2 it follows that any 1D stationary time series with spectral density f which is a rational function of z = e−iω can be represented with a stable ARMA process. Also, any ARMA process satisfies the assumptions of Lemma 2.2, so it can be represented as a stable ARMA process.

2.8.2

Smooth spectral density

Now assume that f > 0 and f is continuously differentiable on [−π, π]. Then log f is also continuously differentiable and it can be expanded into a uniformly

Classification of stationary time series in 1D

67

convergent Fourier series: log f (ω) =

∞ X

β−n = β¯n .

βn einω ,

n=−∞

Now we write log f (ω) = Q(ω) + Q(ω) :=

−1 X 1 β0 + βn einω 2 n=−∞

! +

∞ X 1 β0 + βn einω 2 n=1

! .

Then by Theorem 2.2 f (ω) = eQ(ω) · eQ(ω) =:

1 φ(ω) · φ(ω) 2π

is the correct factorization. For, then both φ¯ and 1/φ¯ are continuous on [−π, π] and the Fourier series of φ¯ contains only non-negative powers of eiω . This leads to new formulas for the prediction error. By Theorem 2.2, we need the Fourier series of  k ∞ ∞ X X √ √ 1 1 φ(ω) = 2πeQ(ω) = 2π β0 + β¯j e−ijω  . k! 2 j=1 k=0

In particular, the constant term is ∞ √ X 1 b0 = 2π k! k=0

β0 2

k =

√

2πeβ0 /2 .

But β0 is just the 0th Fourier coefficient of log f , so by formula (2.34) the one-step ahead prediction error is given by Z π dω 2 2 β0 σ1 = b0 = 2πe = 2π exp log f (ω) . (2.46) 2π −π The expressions for the several step-ahead prediction errors σ22 , σ32 , . . . are similar, but more complicated.

2.9

Classification of stationary time series in 1D

The theorems and their proofs in this section are based on the seminal paper “Stationary sequences in Hilbert space” [34] by Kolmogorov from 1941. Wold decomposition in Section 2.6 showed that a stationary time series

68

ARMA, regular, and singular time series in 1D

{Xt }t∈Z is regular if and only if it can be represented as a causal MA(∞) process ∞ X Xt = bj ξt−j , (2.47) j=0

with an orthonormal sequence {ξt }t∈Z . Moreover, in (2.35) and (2.36) we saw that the spectral density function f X of a regular time series {Xt } can be 1 |Φ(e−iω )|2 , where Φ(z) is analytic in the open unit written as f X (ω) = 2π disc D and is in L2 , more accurately, is in H 2 , on the unit circle T . The next lemma gives an even more precise and interesting description of Φ. Lemma 2.3. [34, Theorem 21] For any regular stationary time series {Xt }t∈Z , the analytic function Φ(z) has no zeros in the open unit disc D. Proof. Indirectly, assume that there exists a z0 ∈ D, where Φ(z0 ) = 0. Then consider the linear fractional transformation Φ(z) :=

z − z0 , 1 − z¯0 z

which maps the unit circle T and the open unit disc D onto itself in a one-toone way, respectively. We introduce a new stationary time series {ηt }t∈Z by linear filtration from the orthonormal series {ξt }t∈Z of (2.47). As we saw in Section 2.2, such a filter can be determined by the Fourier series of its weights {φj }j∈Z , the conditional spectral density of η w.r.t. ξ: ˆ f η|ξ (ω) := φ(ω) :=

∞ X

φj e−ijω .

(2.48)

e−iω − z0 . 1 − z¯0 e−iω

(2.49)

j=−∞

In the present case we choose ˆ f η|ξ (ω) = φ(ω) = Φ(e−iω ) = Since f η|ξ is a bounded function, |f η|ξ (ω)|2 = 1,

(2.50)

it belongs to L2 ([−π, π], B, dF ξ ), where dF ξ (ω) =

1 dω, 2π

and it shows that the time series {ηk }t∈Z is well-defined by the above filtering. Moreover, by (2.2), dF η (ω) = |f η|ξ (ω)|2 dF ξ (ω) = so {ηk }t∈Z is also an orthonormal sequence.

1 dω, 2π

Classification of stationary time series in 1D

69

By (2.7), dF X,η (ω) = f X|ξ (ω)f η|ξ (ω)dF ξ (ω) = f X|ξ (ω)f η|ξ (ω)

1 dω. 2π

Since by (2.50), f η|ξ (ω) = 1/f η|ξ (ω), and by (2.6), f X|η = dF X,η /dF η , we obtain that f X|ξ (ω) . f X|η (ω) = η|ξ f (ω) Comparing (2.2) and (1.28) it follows that f X|ξ (ω) = Φ(e−iω ), thus by (2.49) we get the result f X|η (ω) = Φ(e−iω ) Also,

1 − z¯0 e−iω ˜ −iω ). =: Φ(e e−iω − z0

1 − z¯0 z ˜ Φ(z) = Φ(z) z − z0

(|z| ≤ 1),

which is an analytic function in D, since z0 ∈ D is a zero of Φ(z). Moreover, ˜ Φ(z) is in L2 (in fact, in H 2 ) on T . Thus the properties of ˜ Φ(z) =

∞ X

˜bj z j

(z ∈ D)

j=0

guarantees that we may use the series {ηt }t∈Z as an orthonormal sequence in H(X) instead of {ξt }t∈Z : Xt =

∞ X

˜bj ηt−j

(t ∈ Z).

φj ξt−j

(t ∈ Z).

j=0

Formula (2.48) implies that ηt =

∞ X j=0

This shows that ηt ∈ Ht− (ξ) := span{ξj : j ≤ t} for any t, and so ηt ∈ Ht− (X) := span{Xj : j ≤ t} for any t. Thus the procedure of the Wold decomposition described in Section 2.6 implies that ˜b0 ηt = b0 ξt

⇒

ηt =

Then it would follow that f η|ξ (ω) =

b0 ξ ˜b0 t

(t ∈ Z).

b0 , ˜b0

which contradicts (2.49), and this proves the lemma.

70

ARMA, regular, and singular time series in 1D

Theorem 2.3. [34, Theorem 22] A stationary time series {Xt }t∈Z is regular if and only if the following three conditions hold on [−π, π]: (1) its spectral measure dF is absolutely continuous w.r.t. Lebesgue measure; (2) its spectral density f is positive a.e.; (3) (Kolmogorov’s condition) log f is integrable. Proof. Every regular time series can be represented as a causal MA(∞) process (2.47), so the necessity of conditions (1) and (2) follows from this. Let us show the necessity of condition (3). By (2.35) and (2.36) it follows that its spectral density function f can be written as 2 X ∞ 1 1 −ijω |Φ(e−iω )|2 = b e f (ω) = j , 2π 2π j=0 where Φ(z) is analytic in the open unit disc D and is in L2 on the unit circle T . By Lemma 2.3 we√know that Φ(z) has no zeros in D. Denote by Q(z) that branch of log(Φ(z)/ 2π) which at z = 0 takes the real value Φ(0) b0 Q(0) = log √ = log √ ∈ R, 2π 2π

(2.51)

since b0 > 0 by (2.31). The fact that Φ(z) has no zeros in D implies that the above choice uniquely defines the analytic function Q(z) in D and √ Φ(z) = 2π eQ(z) (z ∈ D). (2.52) Clearly, Re Q(z) = log |Φ(z)| −

1 log(2π). 2

(2.53)

Denote by Re+ z := max(Re(z), 0) and log+ x := max(log x, 0) (x ≥ 0). Then Re+ Q(z) < log+ |Φ(z)| ≤ |Φ(z)|, and so by the maximum modulus theorem, for any 0 ≤ ρ < 1 we have Z π Z π Z π Re+ Q(ρe−iω )dω < |Φ(ρe−iω )|dω ≤ |Φ(e−iω )|dω =: K < ∞. −π

−π

−π

Equation (2.53) shows that Re Q(z) is a continuous subharmonic function in D, see Theorem A.8. It means that Z π 1 Re Q(0) ≤ Re Q(ρe−iω )dω (0 ≤ ρ < 1). 2π −π

Classification of stationary time series in 1D These imply that Z π Z |Re Q(ρe−iω )|dω = −π

71

π

{2Re+ Q(ρe−iω ) − Re Q(ρe−iω )}dω

−π

≤ 2K − 2πRe Q(0)

(0 ≤ ρ < 1).

(2.54)

By Theorem A.10, the radial limit lim Φ(ρe−iω ) = Φ(e−iω )

ρ→1

exist a.e. on T , and (2.53) and (2.54) imply that Φ(e−iω ) 6= 0 a.e. on T . Hence lim log

ρ→1

Φ(ρe−iω ) √ = lim Q(ρe−iω ) = Q(e−iω ) ρ→1 2π

also exist a.e. on T . Then (2.54) implies that the boundary value of Re Q(z) on T : Re Q(e−iω ) = log

1 |Φ(e−iω )| |Φ(e−iω )|2 1 √ = log = log f (ω) 2 2π 2 2π

(2.55)

is integrable w.r.t. Lebesgue measure on [−π, π]. This proves the necessity of condition (3) of the theorem. Now we turn to the proof of sufficiency of the conditions of the theorem. Thus we may define the harmonic real function Re Q(z) in D by the definition Z π 1 Re Q(ρe−iω ) := log f (t)Pρ (ω − t)dt. (2.56) 4π −π (See the definition and properties of Poisson integrals in Section A.2.) There exists a unique analytic function Q(z) in D whose real part is Re Q(z), with its harmonic conjugate Im Q(z), if we assume that Q(0) = Re Q(0) ∈ R, and consequently, Im Q(0) = 0. Then we may define the analytic function Φ(z) in D by formula (2.52): Z π √ 1 eit + z Φ(z) = 2π exp log f (t) it dt . (2.57) 4π −π e −z By Jensen’s inequality, see e.g. [50, p. 63], if g is an a.e. positive function and µ is a probability measure on a set A, then Z Z exp log g dµ ≤ gdµ. A

A

Apply this to (2.56): |Φ(ρe−iω )|2 1 = exp 2Re Q(ρe−iω ) ≤ 2π 2π

Z

π

f (t)Pρ (ω − t)dt. −π

72

ARMA, regular, and singular time series in 1D

Hence by Fubini’s theorem, Z π Z π Z π Z π 1 1 |Φ(ρe−iω )|2 dω ≤ f (t) Pρ (ω − t)dω = f (t)dt < ∞ 2π −π 2π −π −π −π for any 0 ≤ ρ < 1. This shows that the boundary value Φ(e−iω ) exists, it is in L2 (T ), and Φ(z) belongs to the Hardy space H 2 , see Section A.3. It follows that ∞ X Φ(e−iω ) = bj e−ijω (bj ∈ C). (2.58) j=0

By (A.9), for the boundary values we have |Φ(e−iω )|2 = 2π exp 2Re Q(e−iω ) = 2πf (ω)

(2.59)

for a.e. ω ∈ [−π, π]. Let us use linear filtering of the stationary time series {Xt }t∈Z by the conditional spectral density f ξ|X (ω) = By (2.59), Z

1 . Φ(e−iω )

π

|f ξ|X (ω)|2 f (ω)dω = 1,

−π

so f ξ|X ∈ L2 (−π, π], B, dF X ), the filter is well-defined. By (2.2), dF ξ (ω) = |f ξ|X (ω)|2 f (ω)dω =

1 , 2π

so the resulting process: {ξk }t∈Z is an orthonormal sequence. Since f X|ξ (ω) = 1/f ξ|X (ω) = Φ(e−iω ), by (2.58) it follows that Xt =

∞ X

bj ξk−j

(t ∈ Z),

j=0

which shows that {Xt } is a regular process and so it completes the proof of the theorem. Remark 2.5. If {Xt }t∈Z is regular, then Q(z) defined in the previous theorem is analytic in D, so we can write that Q(z) =

∞ X k=0

γk z k

(z ∈ D).

Classification of stationary time series in 1D

73

By condition (3) of the theorem and by formula (2.55), Re Q(e−iω ) ∈ L1 (T ), so it can be expanded in a Fourier series: ∞

∞

X X 1 log f (ω) ∼ Re(γk eikω ) = (αk cos kω − βk sin kω) 2 k=0 k=0 (2.60) for ω ∈ [−π, π], where γk = αk + iβk . Then we also have Re Q(e−iω ) =

Im Q(e−iω ) ∼

∞ X

Im(γk eikω ) =

k=1

∞ X

(αk sin kω + βk cos kω),

k=1

since γ0 ∈ R by (2.51). Remark 2.6. Equation (2.57) gives an explicit formula for the factorization of the spectral density of an arbitrary one-dimensional regular time series: Z π √ 1 eit + z 1 |Φ(e−iω |2 , Φ(z) = 2π exp log f (t) it dt , f (ω) = 2π 4π −π e −z where z ∈ D, Φ(e−iω ) is the boundary value of the analytic function Φ(z), by the proof of the previous theorem. We mention that according to the definition in Subsection A.3.1 in the Appendix, Φ2 (z)/(2π) is an outer function. Its absolute value at z = e−iω is the spectral density f (ω). Remark 2.7. (Kolmogorov–Szeg˝o formula) If {Xt }t∈Z is regular, by (2.51), (2.52), and (2.60) we obtain that Z π √ √ √ 1 b0 = Φ(0) = 2πeQ(0) = 2πeα0 = 2π exp log f (ω)dω , 4π −π so Z π dω σ12 = b20 = 2π exp log f (ω) , 2π −π see Remark 2.6 as well. Thus for any regular time series we have got the same prediction error formula as obtained earlier in the special case of smooth spectral densities, see (2.46). As follows from the next theorem, see Corollary 2.3, this formula is valid for any non-singular time series as well. The next theorem gives the different classes of singular time series. Theorem 2.4. [34, Theorem 23] Assume that {Xt }t∈Z is a stationary time series with spectral measure dF on [−π, π]. Then there exists a unique Lebesgue decomposition dF = dFa + dFs ,

dFa dω,

dFs ⊥dω,

(2.61)

where dω denotes the Lebesgue measure, dFa (ω) = fa (ω)dω is the absolutely continuous part of dF with density fa and dFs is the singular part of dF , which is concentrated on a zero Lebesgue measure subset of [−π, π]. The following three cases are distinguished:

74

ARMA, regular, and singular time series in 1D

(1) fa (ω) = 0 on a set of positive Lebesgue measure on [−π, π]; Rπ (2) fa (ω) > 0 a.e., but −π log fa (ω)dω = −∞; Rπ (3) fa (ω) > 0 a.e. and −π log fa (ω)dω > −∞. Then in cases (1) and (2), the time series X is singular. In case (3), the time series {Xt } is non-singular, so we may apply the Wold decomposition to write Xt = Rt + Yt =

∞ X

bj ξt−j + Yt (t ∈ Z),

(2.62)

j=0

where {Rt } is a regular time series with absolutely continuous spectral measure dFa (ω) = fa (ω)dω and {Yt } is a singular time series with singular spectral measure dFs , as described by (2.61). Proof. Assume first that {Xt } is non-singular. Then, as we saw earlier, in the Wold decomposition (2.62) {Rt } and {Yt } are orthogonal to each other, {Rt } is a regular process with absolutely continuous spectrum dFa (ω) = fa (ω)dω and {Yt } is a singular process with singular spectrum dFs w.r.t. Lebesgue measure. Theorem 2.3 shows that for the density fa of the regular part case (3) of the present theorem holds. Conversely, we want to show that in case (3) {Xt } is non-singular. Our starting point will be the Lebesgue decomposition (2.61). We want to find two mutually orthogonal stationary time series {Rt } and {Yt } both subordinated to {Xt }, Rt + Yt = Xt , with spectral measure dFa and dFs , respectively. Since {Rt } and {Yt } are orthogonal to each other, their covariance function C R,Y (k) := E(Rt+k Yt ) = 0 for any k ∈ Z. Thus C R,X (k) = E(Rt+k Xt ) = E(Rt+k Rt ) = C R (k),

C Y,X (k) = C Y (k),

for k ∈ Z and so we have the same relationship for spectral measures: dF R,X = dF R = dFa and dF Y,X = dF Y = dFs . By (2.2) and (2.6) it implies for the conditional densities that f R|X = |f R|X |2 ,

f Y |X = |f Y |X |2 ,

|f R|X |2 + |f Y |X |2 =

dF Y dF R + = 1. dF dF

It means that f R|X and f Y |X may take only 0 and 1 values, complementarily, f R|X =

dFa , dF

f Y |X =

dFs , dF

and they belong to L2 ([−π, π], B, dF ). Moreover, since {Rt } and {Yt } should belong to H(X), by (2.1) we may define them as Z π Z π itω R|X Rt = e f (ω)dZω , Yt = eitω f Y |X (ω)dZω , Xt = Rt + Yt −π

−π

Examples for singular time series

75

for t ∈ Z. Thus {Rt } has spectral measure dFa (ω) = fa (ω)dω and {Yt } has a singular spectral measure dFs . Then by Theorem 2.3, case (3) of the present theorem implies that {Rt } is a regular time series. Therefore, Xt = Rt + Yt =

∞ X

bj ξt−j + Yt

(t ∈ Z),

(2.63)

j=0

where {ξt }t∈Z is an orthonormal sequence. Let Hk− (R) := span{Rj : j ≤ k}, the past of {Rt } until k, and H(Y ) := span{Yj : j ∈ Z}, the closed space spanned by {Yt }. Formula (2.63) shows that Hk− (X) := span{Xj : j ≤ k} ⊂ (Hk− (R) ⊕ H(Y )). (The right hand side is the closed direct sum, that is, the smallest closed space spanned by Hk− (R) and HY .) Now we write   ∞ X  − + Xk+1 = Xk+1 + Xk+1 := bj ξk+1−j + Yk+1 + b0 ξk+1 .   j=1

− + Clearly, Xk+1 ∈ (Hk− (R) ⊕ HY ), while Xk+1 = b0 ξk+1 ⊥(Hk− (R) ⊕ HY ). Moreover, by the condition in case (3) and the Kolmogorov–Szeg˝o formula (Remark 2.7) we see that b0 > 0, hence Xk+1 ∈ / Hk− (X). This proves that X is nonsingular. Thus we have shown that in case (3) the time series is non-singular and for any non-singular time series the case (3) holds. It implies that in the cases (1) and (2) the time series is singular and this completes the proof of the theorem.

Corollary 2.3. In the proof of the previous theorem we saw that for any non-singular time series {Xt } the Wold decomposition Xt = Rt + Yt gives a regular process {Rt } whose spectral density function f R is a.e. the same as the spectral density function fa of {Xt }. Thus the Kolmogorov–Szeg˝ o formula (Remark 2.7) is valid for any non-singular process {Xt } and its spectral density function fa .

2.10 2.10.1

Examples for singular time series Type (0) singular time series

In the Lebesgue decomposition (2.61) one can further decompose the singular spectral measure: dFs = dFd + dFc , where dFd is the discrete spectrum corresponding to at most countable many jumps of the spectral distribution function F , while dFc is the continuous singular spectrum.

76

ARMA, regular, and singular time series in 1D Spectral measure

Covariance function

1.0 3 0.8

2

sj2

c(h)

0.6 0.4

0 1

0.2 0.0

1

2 2

1

0 Value of

1

2

0

5

10 15 Value of h

20

25

FIGURE 2.11 Spectral measure and covariance function of a Type(0) singular process in Example 2.6. (a) A typical example for a process with discrete spectrum: Xt =

n X

Aj eitωj ,

t ∈ Z,

(2.64)

j=1

where −π < ω1 < · · · < ωn ≤ π; A1 , . . . , An are uncorrelated random variables with mean 0 and variance s2j (j = 1, . . . , n). (The Aj ’s can be e.g. independent Gaussian random variables.) This process is weakly stationary with Z π n n X X 2 ikωj 2 ikωj c(k) = E(Xt+k Xt ) = E(|Aj | )e = sj e = eikω dF (ω) j=1

j=1

π

for k ∈ Z, where F (ω) =

X

s2j ,

ω ∈ [−π, π].

ωj ≤ω

Observe that the covariance function does not tend to 0 as k → ∞. Example 2.6. Figure 2.11 shows the spectral measure and the covariance function of a Type (0) process (2.64) with n = 6. The frequencies ωj are {−2, −π/2, −1, 1, π/2, 2} and the variances s2j are {1/3, 1/2, 1, 1, 1/2, 1/3}. Since in this example the covariance function is real valued, by the construction in Subsection 1.4.1 it is possible to construct a real valued process {Xt } with this covariance function. The first panel on Figure 2.12 shows the mean square prediction error e2n when predicting Xn based on the finite past X0 , . . . , Xn−1 , see Subsection 5.2.1. If n ≥ 5, square error e2n = 0. On the second panel det(Cn ) is shown, which is also 0 if n ≥ 6, see Remark 5.5. (b) The standard example for a continuous singular function on [0, 1] is the Cantor function γ, “the devil’s ladder.” Suppose C is the Cantor set in [0, 1],

Examples for singular time series

77

Mean square prediction error

3.5 3.0 2.5 en2

2.0 1.5 1.0 0.5 0.0 0

5

10

Value of n

15

20

15

20

Determinant of Cn 40 35 30 detCn

25 20 15 10 5 0 0

5

10

Value of n

FIGURE 2.12 Prediction error and det(Cn ) of a Type(0) singular process in Example 2.6. that is, x ∈ C if and only if in base 3 expansion. While in (a) above we discussed a singular process with pure discrete spectrum, now we consider a singular process with continuous singular spectrum. x=

∞ X

an 3−n ,

an = 0 or 2.

n=1

Then the Cantor function γ : [0, 1] → [0, 1] can be defined as P∞ 1 P∞ an 2−n , x = n=1 an 3−n ∈ C; n=1 2 γ(x) = sup{γ(y) : y ≤ x, y ∈ C}, x ∈ [0, 1] \ C. Then γ(0) = 0, γ(1) = 1, γ is non-decreasing on [0, 1], and γ 0 (x) = 0 for a.e. x ∈ [0, 1]. By the case d = 1 of Corollary 1.6, the definition ω+π F (ω) = γ , ω ∈ [−π, π], 2π gives the spectral distribution function of a singular stationary time series X. Heuristically, the spectrum of X consists of the points of an uncountable but zero Lebesgue measure Cantor set, with infinitesimally small amplitudes.

78

ARMA, regular, and singular time series in 1D

2.10.2

Type (1) singular time series

A simple example for a singular time series corresponding to case (1) of Theorem 2.4 is the one with spectral density function 1 2 , |ω| ≤ 1; f (ω) = 0, 1 < |ω| ≤ π. By the case d = 1 of Corollary 1.6, one can construct a singular stationary time series with this spectral density. Its covariance function is Z 1 1 1, k = 0; c(k) = eikω dω = sin k , k 6= 0. 2 −1 k It is an example for an absolutely non-summable covariance function which still corresponds to an absolutely continuousP spectral measure. On the other P hand, k c(k) converges conditionally; also, k |c(k)|2 < ∞. Generalizing the previous example, observe the following interesting phenomenon. For any δ > 0 fixed, a time series with spectral density 1 2(π−δ) , |ω| ≤ π − δ; f (ω) = 0, π − δ < |ω| ≤ π. is still singular, like the one above. On the other hand, if we take δ = 0, that 1 is, f (ω) = 2π for any ω ∈ [−π, π], then the time series becomes a regular, orthonormal series. Example 2.7. Figure 2.13 shows the spectral density and the covariance function of the above described Type (1) singular process. The first panel on Figure 2.14 shows the mean square prediction error e2n when predicting Xn based on the finite past X0 , . . . , Xn−1 , see Subsection 5.2.1. As n is growing, the prediction error is going to 0, since this is a singular process. On the second panel det(Cn ) is shown, which goes to 0 as well, see Remark 5.5.

2.10.3

Type (2) singular time series

An example for case (2) of Theorem 2.4 is the following spectral density: 1

f (ω) = e− |ω| ,

ω ∈ [−π, π] \ {0},

f (0) = 0.

Then f (ω) > 0 a.e. and f is continuous everywhere on [−π, π], Z π Z π Z π 1 f (ω)dω < ∞, log f (ω)dω = − = −∞. |ω| −π π −π By the case d = 1 of Corollary 1.6, one can construct a singular stationary

Examples for singular time series

79

Spectral density

Covariance function

1.0

1.0

0.8

0.8 0.6 c(h)

f( )

0.6 0.4

0.4 0.2

0.2

0.0

0.0

0.2 3

2

1

0 Value of

1

2

3

0

5

10

15

20 25 Value of h

30

35

40

FIGURE 2.13 Spectral density and covariance function of a Type (1) singular process in Example 2.7.

Mean square prediction error

0.30 0.25

en2

0.20 0.15 0.10 0.05 0.00 0

5

10

15

20 Value of n

25

30

35

25

30

35

Determinant of Cn 1.0

detCn

0.8 0.6 0.4 0.2 0.0 0

5

10

15

20 Value of n

FIGURE 2.14 Prediction error and det(Cn ) of a Type (1) singular process in Example 2.7.

80

ARMA, regular, and singular time series in 1D Spectral density

2.5

0.6

2.0

0.5

1.5

0.4

c(h)

f( )

Covariance function

3.0

0.7

0.3

1.0 0.5

0.2

0.0

0.1

0.5

0.0

1.0 3

2

1

0 Value of

1

2

3

0

100

200 300 Value of h

400

500

FIGURE 2.15 Spectral density and covariance function of a Type (2) singular process in Example 2.8. time series {Xt } with this spectral density. Theorem 4.1 later shows that a time series can be represented as a two-sided infinite MA (a sliding summation) if and only if it has constant rank, that is, in 1D its spectral density is positive a.e., like in the case of this {Xt }. However, since this {Xt } is singular, it cannot be represented as a causal (one-sided) infinite MA. In general, in 1D the same is true for any singular time series of Type (2) and only for these ones. Example 2.8. Figure 2.15 shows the spectral density and the covariance function of the above described Type (2) singular process. The last few values of c(h) are not used because of the numerical distortion of the IDFT (FFT) procedure used to compute the covariance function from the spectral density. The first panel on Figure 2.16 shows the mean square prediction error e2n when predicting Xn based on the finite past X0 , . . . , Xn−1 , see Subsection 5.2.1. As n is growing, the mean square prediction error is eventually going to 0, since this is a singular process. The second panel shows det(Cn ), which goes to 0 as well, see Remark 5.5. Corollary 2.4. Theorem 2.4 has the following interesting corollary. Any nonsingular process may contain a Type (0) singular component, but cannot contain a Type (1) or (2) singular part. For, if case (1) or (2) of Theorem 2.4 holds for a time series then that process must be singular. Adding an orthogonal regular process to a Type (1) or (2) singular process results in a regular process. For, let {Xt } be a regular and {Yt } be a Type (1) or (2) singular stationary time series, Cov(Xs , Yt ) = 0 for all s, t ∈ Z. Then Z π Cov(Xt+h + Yt+h , Xt + Yt ) = cX (h) + cY (h) = eihω (f X (ω) + f Y (ω))dω, −π

where cX and cY are the covariance functions and f X and f Y are the spectral

Summary

81 Mean square prediction error

2.5

en2

2.0 1.5 1.0 0.5 0 2.00

100

200

Value of n

300

400

500

300

400

500

Determinant of Cn

1e40

1.75 1.50 detCn

1.25 1.00 0.75 0.50 0.25 0.00 0

100

200

Value of n

FIGURE 2.16 Prediction error and det(Cn ) of a Type (2) singular process in Example 2.8. density functions of X and Y , respectively. Consequently, Z π Z π X Y log(f (ω) + f (ω))dω ≥ log f X (ω)dω > −∞. −π

−π

This shows that {Xt + Yt } is regular.

2.11

Summary

Linear filtering of a stationary time series {Xt }t∈Z means applying a TLF (time-invariant linear filter) to it: Yt :=

∞ X j=−∞

ctj Xj =

∞ X

bk Xt−k

(t ∈ Z).

k=−∞

Time-invariance means that the coefficient ctj depends only on t − j, i.e. ctj = bt−j , giving the final form of a TLF. The so obtained {Yt } is also a weakly stationary sequence, and we use the wording that it is subordinated to the process {Xt }. A sufficient condition forPthe almost sure and mean square ∞ convergence of the above filtration is that j=−∞ |bj | < ∞.

82

ARMA, regular, and singular time series in 1D

The second order, weakly stationary 1D time series {Zt }t∈Z is called white noise sequence if its autocovariances are c(0) = σ 2 and c(h) = 0 for h = ±1, ±2, . . . . It is denoted by WN(σ 2 ). In other words, Zt s are uncorrelated and have variance σ 2 . In particular, if Zt s are i.i.d., with finite variance, they constitute a white noise sequence, and in the Gaussian case, the two notions are the same. We call the WN(1) sequence orthonormal sequence. Its spectral 1 ), like the spectrum of the white light. With the {ξt } ∼ density is constant ( 2π WN(1) sequence, the following special processes are introduced. The two-sided MA (sliding summation) is defined by Xt =

∞ X

∞ X

bk ξt−k =

bt−j ξj ,

t ∈ Z,

j=−∞

k=−∞

wherePfor the sequence of non-random complex coefficients {bk } we assume that k |bk |2 < ∞. (Then the Riesz–Fischer theorem implies the mean square convergence of the above infinite series.) If ξj = 0 for any j 6= j0 , then Xt = bt−j0 ξj0 , t ∈ Z. This is why the sequence {bk }k∈Z is called the impulse response function. Since the single impulse ξj0 (random shock) at time j0 can create nonzero responses not only for times t ≥ j0 , but also for times t < j0 , a sliding summation process {Xt } is non-causal in general. In contrast, when bk = 0 whenever k < 0, we obtain a causal (one-sided, future-independent) MA(∞) process: ∞ ∞ X X Xt = bk ξt−k = bt−j ξj , t ∈ Z, j=0

k=0

P∞

with the covariance function c(h) = k=0 bk+h¯bk (h ≥ 0), c(−h) = c¯(h) . It is customary to introduce the lag Poperator or left (backward) shift L, and the formal power series H(L) := k bk Lk , so we can write the sliding summation as Xt = H(L)ξt , t ∈ Z. Further, it is also customary to denote the operator L by the indeterminate z as well and to call the power series ∞ X

H(z) =

bk z k

k=−∞

the transfer function of the sliding P summation {Xt } or the z-transform of H(L). Since by our assumption k |bk |2 < ∞, we have H(z) ∈ L2 (T ), where T denotes the unit circle of the complex plane C. The finite order MA processes are generated with finitely many random shocks. The qth order moving average process, denoted by MA(q), is defined by q X Xt = β(L)ξt = βk ξt−k , k=0

where {ξt } ∼ WN(1), L is the lag operator, and β(z) =

q X k=0

βk z k

Summary

83

is the MA polynomial. The covariance function of the MA(q) process is c(h) =

q X

βk β¯k−h

(0 ≤ h ≤ q),

c(−h) = c¯(h),

c(h) = 0 if |h| > q.

k=h

Conversely, if the autocovariance function of a zero mean stationary process is such that c(h) = 0 for |h| > q and c(q) 6= 0, then it is a MA(q) process. 1 |β(e−ikω )|2 which is a The spectral density of an MA(q) process is f (ω) = 2π −iω 2q degree polynomial of e . A pth order autoregressive process AR(p) is defined by α(L)Xt = βξt , where {ξt } ∼ WN(1), L is the lag operator, and α(z) =

q X

αk z k

k=0

is the AR polynomial, α0 = 1, β 6= 0. A stationary causal MA(∞) solution of it exists if and only if the AR(p) polynomial is stable, i.e. it has no roots in the closed unit disc (|z| ≤ 1) of C. If Cp of Chapter 1 is positive definite, then this condition holds. Between the autocovariances and the constants of the AR(p) process we have the following Yule–Walker equations: c(−k) +

p X

c(j − k)α ¯ j = δ0k |β|2 ,

k ≥ 0.

j=1

Taking them for 0 ≤ k ≤ p, one obtains a system of linear equations for the unknowns α1 , . . . , αp , |β|2 , if the covariances c(0), c(1), . . . , c(p) are known. It has a unique solution if and only if the aforementioned matrix Cp is positive definite, see also Chapter 5. In this case the AR(p) process is also stable. The ARMA(p, q) processes (p ≥ 0, q ≥ 0) are generalizations of both AR(p) and MA(q) processes: in the p = 0 case we get a MA(q), while in the q = 0 case we get an AR(p) process. With the aforementioned polynomials, an ARMA(p, q) process is defined by α(L)Xt = β(L)ξt . Again, we want to find a causal stationary MA(∞) solution of this equation. Assume that the polynomials α(z) and β(z) have no common zeros. Then {Xt } is causal if and only if α(z) 6= 0 for all |z| ≤ 1 (the same stability condition as in case of the ARPprocess). The coefficients (impulse responses) ∞ of the MA(∞) process Xt = j=0 bj ξt−j are obtainable by the power series expansion of the transfer function H(z) =

∞ X j=0

bj z j = α−1 (z)β(z),

|z| ≤ 1.

84

ARMA, regular, and singular time series in 1D

The spectral density of the ARMA(p, q) process is f (ω) =

1 |β(e−iω) |2 . 2π |α(e−iω) |2

This is a rational function (ratio of polynomials) of e−iω . Rational spectral densities play an important role in state space models of Chapter 3. It can also be shown that any continuous spectral density can be approximated with the spectral density of a MA(q) and/or AR(p) process with any small error, albeit with possibly large q and/or p. So processes of rational spectral density are special MA(∞) processes. Onesided moving average processes are also called regular (also called purely nondeterministic) as they have no remote past at all (they are entirely governed by random shocks). In contrast, singular (also called deterministic) processes are completely determined by their remote past. As a mixture of them, nonsingular processes cannot be completely predicted based on their past values, but there are added values (innovations) of the newcoming observations. The Wold decomposition in 1D guarantees that any non-singular weakly stationary time series (not completely determined by the remote past) can be decomposed into a regular (MA(∞)) and a singular process that are orthogonal to each other. The original proof of Wold is based on the one-step ahead predictions with longer and longer past, where the error terms (innovations) give the shocks, and prediction errors converge to an optimum value (that is not zero in the presence of a regular part), see Chapter 5. The spectral form of the Wold decomposition is related to the transfer function and discussed in details in the multivariate situation in Chapter 4. By a theorem of Kolmogorov, a 1D stationary time series is regular if and only if the following three conditions hold on [−π, π]: 1. its spectral measure dF is absolutely continuous w.r.t. Lebesgue measure; 2. its spectral density f is positive almost everywhere; 3. (Kolmogorov’s condition) log f is integrable. Note that we have different classes of 1D singular time series as follows. The spectral measure of a stationary time series has a unique Lebesgue decomposition dF = dFa + dFs , dFa dω, dFs ⊥dω, where dω denotes Lebesgue measure, dFa (ω) = fa (ω)dω is the absolutely continuous part of dF with density fa and dFs is the singular part of dF , which is concentrated on a zero Lebesgue measure subset of [−π, π]. We call it Type (0) singularity. Apart from this, we distinguish between the following three cases: (1) fa (ω) = 0 on a set of positive Lebesgue measure on [−π, π];

Summary

85

Rπ (2) fa (ω) > 0 a.e. , but −π log fa (ω)dω = −∞; Rπ (3) fa (ω) > 0 a.e. and −π log fa (ω)dω > −∞. In cases (1) and (2), the time series is singular, whereas in case (3), the time series is non-singular and has the Wold decomposition. So a regular process can coexist only with a Type (0) singular one, while adding an orthogonal regular process to Type (1) or (2) singularity results in a regular process.

3 Linear system theory, state space models

3.1

Introduction

We apply state space models to multidimensional stationary processes, the same approach as that of the seminal Kálmán’s filtering. We consider the reachability and observability of a system, defined by an inner description. Power series and extended input/output maps are also considered together with an external description of the system. Algebraic tools of modules are intensively used. By means of these, we investigate when and how one can get an internal description from an external one, more exactly, from a reduced input/output map. We also introduce the notion of minimal polynomial of a state space if it is finite dimensional. Next, we turn to realizations that use power series techniques. The transfer function H(z) of a linear system is a power series containing only non-negative powers of z and its coefficients are p × q complex matrices. Equivalently, H(z) is a p × q matrix, whose entries are complex power series with only nonnegative powers of z. The transfer function H(z) of a linear system with finite dimensional state space is a rational matrix. Equivalences are proved about the rationality of the transfer function and the bounded rank of the infinite block Hankel matrix formed by its coefficients. The McMillan degree of the transfer function is the dimension of the state space in a minimal realization. It is uniquely defined. More general linear time-invariant dynamical systems are also treated. Relation of Hankel operators and realization theory is investigated. Finally, stochastic time-invariant linear systems are introduced that are driven by two multidimensional random stationary time series. Multidimensional ARMA (VARMA) processes are special cases of them.

3.2

Restricted input/output map

Suppose that we are given a quadruple of bounded linear operators (A, B, C, D), A : X → X, B : U → X, C : X → Y , and D : U → Y , where U is a q-dimensional, Y is a p-dimensional and X (the state space) is 87

88

Linear system theory, state space models

an n-dimensional complex vector space. The corresponding linear time invariant dynamical system is described by xk+1 = Axk + Buk yk = Cxk + Duk

(k ∈ Z).

(3.1)

We may write that U = Cq , X = Cn , Y = Cp , A ∈ Cn×n , B ∈ Cn×q , C ∈ Cp×n , and D ∈ Cp×q . Eventually, the input sequence {uk } is going to be a random stationary time series, that generates the dynamics. The resulting random stationary time series {xk } describes the consecutive, directly unobservable, internal states of the system, while the random stationary time series {yk } is the observable output sequence. However, except for the last section of this chapter, everything will be deterministic. Interestingly, for discussing many important properties of linear dynamical systems it is immaterial whether the input process is deterministic or stochastic. In this section we consider a very simple case. We assume that 1. uj = 0 when j ≥ 1; 2. the input sequence {uj : j ≤ 0} contains only finitely many nonzero terms; 3. the system was at rest when the first nonzero input arrived, say at time −j0 ≤ 0 (which is arbitrary, but finite): x−j0 = 0. If k ≤ 0, then by the first equation of (3.1), progressing from time −j0 forward step-by-step, we get that xk = Buk−1 + ABuk−2 + A2 Buk−3 + · · · + Aj0 +k−1 Bu−j0 . Then remembering the above assumptions (1)–(3), we may write that xk =

∞ X

Aj+k−1 Bu−j

(k ≥ 1).

(3.2)

j=0

This sum is convergent since it contains only finitely many non-zero terms. Further, we consider the output sequence for times k ≥ 1 only, where uk = 0: yk =

∞ X

CAj+k−1 Bu−j

(k ≥ 1).

(3.3)

j=0

This sum is convergent as well, since contains only finitely many non-zero terms. Then we define the restricted input/output map φ0 of the system by φ0 ({uj }0j=−∞ ) = {yk }∞ k=1 ,

(3.4)

as described by (3.3). (It is necessary writing the input sequence as a sequence

Reachability and observability

89

{uj }0j=−∞ , because the starting time j0 ≤ 0 of the input sequence can be arbitrary.) Observe that φ0 does not depend on the operator D, therefore in this chapter, except when it is explicitly stated otherwise, we may assume that D = 0. It is useful to write φ0 in matrix form:        u0 CB CAB CA2 B · · · u0 y1    u−1   CAB CA2 B  y2  · ···    u−1       (3.5)    y3  = H  u−2  =  CA2 B · · · · ·   u−2        .. .. .. .. .. .. . . . . . . The infinite coefficient matrix H here is a block Hankel matrix ; each of its blocks is a p × q complex matrix (see Appendix B).

3.3

Reachability and observability

Let Σ = (A, B, C) be a finite dimensional linear system. We say that the state x ∈ X is reachable if there is a k ≥ 0 and a sequence of inputs {u0 , u1 , . . . , uk−1 } of length k that drives the system from the state x0 = 0 to the state xk = x:   uk−1 k−1  uk−2  X   x= Ak−1−j Buj = [B, AB, . . . , Ak−1 B]  .  .  ..  j=0

u0 By the Cayley–Hamilton theorem, if k ≥ n, Ak can be expressed in terms of I, A, . . . , An−1 . Thus Range[B, AB, . . . , Ak−1 B] = Range[B, AB, . . . , An−1 B]

(k ≥ n).

The matrix R := [B, AB, . . . , An−1 B]

(3.6)

is the reachability matrix of the system and its range is the subspace of reachable states. We say that the system Σ is reachable if any x ∈ X is reachable, that is, if rank(R) = dim(Range(R)) = n, equivalently, if R has n linearly independent columns, equivalently, if Ker(R∗ ) = {0}. Clearly, Range(R) is an A-invariant subspace in X: A Range(R) ⊂ Range(R). We say that the state x ∈ X is observable if starting from the state x0 = x at time 0 and having a constant zero input sequence uj = 0 (j ≥ 0), there is a time instant k ≥ 0 such that yk = CAk x 6= 0.

90

Linear system theory, state space models

The reason for the name ‘observable’ is that the state x is observable if and only if for any x(1) , x(2) ∈ X, x(2) − x(1) = x, starting at time 0 with initial state x(1) and x(2) , respectively, there exists a time instant k ≥ 0 such that the output yk is different in the two cases. This fact follows from the linearity of the system. Again, by the Cayley–Hamilton theorem, if     C C  CA   CA      O :=   , then Ker  ..  = Ker(O) for each k ≥ n − 1, (3.7) ..  .    . CAn−1

CAk

where O is called the observability matrix of the system. The orthogonal complement of its kernel is the subspace of observable states. The system Σ is called observable, if any x ∈ X is observable, that is, if Ker(O) = {0}, equivalently, if O has n linearly independent rows, equivalently, if rank(O) = n. Clearly, Ker(O) is an A-invariant subspace in X.

3.4

Power series and extended input/output maps

An important tool in system theory is the application of power series. The set of integers Z, that is, the set of time instants can be mapped in a one-toone way onto powers of the indeterminate z which is the operator of the left (backward) time shift in Z. So an input u at time −j can be written as uz j and an output y at time k can be written as yz −k . Correspondingly, the right (forward) time shift is denoted by z −1 . Now we consider more general input/output sequences than in Section 3.2. Thus we suppose that 1. the input sequence contains only finitely many terms with negative indices; 2. the system was at rest when the first nonzero input arrived, say at time −j0 ≤ 0 (which is arbitrary): x−j0 = 0. So the input sequence can be written as a formal power series u(z) :=

∞ X j=−j0

uj z −j =

∞ X

uj z −j

(3.8)

j=−∞

Recall that here the coefficients uj ∈ U are q-dimensional complex vectors and in the last sum there are only finitely many terms with negative index j. The formal power series u(z) is also called the z-transform of the sequence {. . . , u−2 , u−1 , u0 , u1 , u2 , . . . }.

Power series and extended input/output maps

91

Let U denote the set of all input sequences {uj : j ∈ Z} that contain only finitely many nonzero terms with negative indices. Equivalently, U is the vector space of all formal power series u(z) as given by (3.8) with any −j0 ∈ Z. Likewise, a corresponding output sequence (y−j0 +1 , y−j0 +2 , . . . ) is the following formal power series (z-transform) with p-dimensional complex vector coefficients yk ∈ Y : y(z) :=

∞ X

yk z −k =

k=−j0 +1

∞ X

yk z −k .

(3.9)

k=−∞

Remember that in the last sum there are only finitely many terms with negative indices k. Let Y denote all output sequences {yk : k ∈ Z} that contain only finitely many nonzero terms with negative indices. Equivalently, Y is the vector space of all formal power series y(z) as given by (3.9) with any −j0 ∈ Z. With a finite dimensional linear system Σ = (A, B, C), if x−j0 = 0 and k ≥ −j0 + 1, then yk =

k−1 X

CAk−j−1 Buj =

j=−j0

k−1 X

CAk−j−1 Buj .

(3.10)

j=−∞

(The last series converges, because it contains only finitely many non-zero terms.) Now we can consider the extended input/output map φ : U → Y of the finite dimensional linear system Σ = (A, B, C), which is defined by   ∞ ∞ X X φ(u(z)) = y(z), φ  uj z −j  = yk z −k , (3.11) j=−∞

k=−∞

where yk is given by (3.10). Also, we define the left shift L that works in the ordinary way on the input/output sequences defined on Z: Lu(z) := zu(z),

Ly(z) := zy(z).

Then simply we have φ(u(z)) = y(z)

⇒

φ(Lu(z)) = Ly(z).

(3.12)

It is very important that φ is not only a complex linear function from U into Y, but a homomorphism (see Appendix D) for left shifts as well, as is shown by (3.12). Let C[z] denote the ring of all polynomials in the indeterminate z, with complex coefficients (see Appendix D). Let p(z) = am z m + am−1 z m−1 + · · · + a1 z + a0 ∈ C[z] be an arbitrary complex polynomial. The effect of p(z) on an input sequence u(z) or on an output sequence y(z) is simply the corresponding combination

92

Linear system theory, state space models

of left shifts, additions and complex multiplications: Lp(z) u(z) := p(z)u(z) = Lp(z) y(z) := p(z)y(z) =

m X r=0 m X

∞ X

ar z r u(z) = ar z r y(z) =

z −j

j=−j0 −m ∞ X

m X

z −k

k=−j0 −m+1

r=0

ar uj+r ,

r=0 m X

ar yk+r .

r=0

Algebraically, it means that φ is a C[z]-module homomorphism from U to Y (see Appendix D): φ(u(z)) = y(z)

⇒

φ(Lp(z) u(z)) = Lp(z) y(z),

or briefly, φLp = Lp φ. Let us introduce the transfer function H(z) by the formal power series H(z) :=

∞ X

H` z ` := D +

`=0

∞ X

CA`−1 Bz ` ,

(3.13)

`=1

the coefficients H` := CA`−1 B (` ≥ 1) being p × q complex matrices. The coefficient H0 = D has been assumed to be a p × q zero matrix so far, but eventually it can be nonzero. Proposition 3.1. We have the following properties of a transfer function: (1) If z is considered a complex variable, then (3.13) converges for |z| < 1/ρ(A), where ρ(A) denotes the spectral radius of the matrix A: ρ(A) := max{|λ| : λ ∈ C, det(λIn − A) = 0}, see Lemma B.2. Then H(z) is an analytic matrix function in the disc {z : |z| < 1/ρ(A)}, meaning that each entry of the matrix H(z) is an analytic complex function there. In particular, if ρ(A) = 1, then H(z) is analytic in the open unit disc and if ρ(A) < 1, then H(z) is analytic in the closed unit disc. (2) H(z) = D + C(In − Az)−1 zB, H(z

−1

) = D + C(In z − A)

−1

B,

|z| < 1/ρ(A),

(3.14)

|z| > ρ(A).

(3.15)

(3) The z-transform of the input/output map φ is H(z −1 ): y(z) = φ(u(z)) = H(z −1 )u(z),

|z| > ρ(A).

(4) The transfer function H(z) gives the output in terms of the input: yk = H(z)uk ,

k ∈ Z.

(3.16)

Power series and extended input/output maps

93

(5) lim H(z −1 ) = D.

z→∞

Proof. By Lemma B.2, for any > 0, kAk k ≤ (ρ(A) + )k , so

∞ ∞

X

X

`−1 ` kCk · kA`−1 k · kBk · |z|` CA Bz ≤

`=1

`=1

≤ kCk · kBk · |z|

∞ X [|z|(ρ(A) + )]`−1 < ∞, `=1

if |z| < 1/(ρ(A) + ). This proves (1). Simple algebra shows that we have the following relationship for formal power series: ∞ X (In − Az) A`−1 z ` = In z. `=1

If |z| < 1/ρ(A), then this implies that In − Az is invertible and (In − Az)−1 z =

∞ X

A`−1 z ` .

(3.17)

`=1

By definition (3.13) this proves (3.14). Equation 3.15 follows from this. By (3.9) and (3.10), yk =

k−1 X

Hk−j uj

(k ∈ Z),

∞ X

y(z) =

j=−∞

yk z −k .

k=−∞

On the other hand, H(z −1 )u(z) =

∞ X

H` z −`

∞ X

uj z −j =

j=−∞

`=1

∞ X k=−∞

z −k

k−1 X

Hk−j uj .

j=−∞

These prove (3). Statement (4) follows from H(z)uk =

∞ X

Hj uk−j = yk .

j=1

Statement (5) is a simple consequence of (3.15). If all uj = 0 when j 6= t, then by (3.10), yk = Hk−t ut ,

(k > t).

This shows that the sequence {H` : ` = 1, 2, . . . } represents the impact of a single impulse ut at time t on the future values of the process. That is why it is called the impulse response function.

94

Linear system theory, state space models

Equations (3.10) and (3.11) imply the important strict causality of the input/output map: u(z) =

∞ X

uj z −j

⇒

y(z) = φ(u(z)) =

j=0

∞ X

yk z −k .

(3.18)

k=1

By words: An input sequence beginning at time 0 may only influence the output sequence from time 1. While (3.1) is called the internal description of the linear system, (3.16) is the external description: if we can determine (e.g. measure or estimate) the coefficients H` , then we can compute the output sequence from the input sequence, without using (or knowing) the matrices (A, B, C). In Section 3.2 we considered special input and output sequences. Consistently with that, let U− denote the set of all input sequences {uj : j ≤ 0} that contain only finitely many nonzero terms. Also, let Y+ denote all output sequences {yk : k ≥ 1}. Then the restricted input/output map φ0 : U− → Y+ was defined by (3.4) and (3.5): φ0 (u(z)) = y(z) ∈ Y+

if

u(z) ∈ U− .

(3.19)

In this case the left shift L should be adapted to the restricted situation: L(. . . , u−2 , u−1 , u0 ) = (. . . , u−2 , u−1 , u0 , 0), L(y1 , y2 , y3 , . . . ) = (y2 , y3 , y4 , . . . ). Then it follows that φ0 (L(. . . , u−2 , u−1 , u0 )) = Lφ0 (. . . , u−2 , u−1 , u0 )

(3.20)

for any (. . . , u−2 , u−1 , u0 ) ∈ U− . Using the power series notation introduced above, we may write that u(z) ∈ U− y(z) ∈ Y+

⇔

⇔

u(z) =

y(z) =

j0 X j=0 ∞ X

u−j z j

(j0 ≥ 0, arbitrary),

(3.21)

yk z −k .

k=1

We can identify the left shift L in the space U− with the multiplication by the identity function χ(z) = z, and in the space Y+ with the operation on formal power series   −1 r X X Ly(z) = [zy(z)]− , where  aj z j  := aj z j , (3.22) j=−∞

−

j=−∞

Power series and extended input/output maps

95

whenever r ≥ 0. Another notation for the operator [·]− is the projection π− . Then (3.20) can be expressed in the form ⇒

φ0 (u(z)) = y(z)

φ0 (Lu(z)) = Ly(z).

It implies that φ0 is also a C[z]-module homomorphism: ⇒

φ0 (u(z)) = y(z)

φ0 (Lp(z) u(z)) = Lp(z) y(z),

(3.23)

or briefly, φ0 Lp = Lp φ0 . Let ι+ : U− → U denote the inclusion map and π− : Y → Y+ denote the projection map. Then by (3.16), (3.19), (3.22), and (3.23) the following diagram is commutative: φ0

U− ι+ ↓

−→

U

φ

−→

Y+ ↑ π−

(3.24)

Y

That is, φ0 = π− φ ι+ , or with power series, y(z) = φ0 (u(z)) = [H(z −1 )u(z)]− ∈ Y+ if u(z) ∈ U− .

(3.25)

Proposition 3.2. Given the restricted input/output map φ0 , there exists a unique strictly causal input/output map φ which makes diagram (3.24) commutative. Namely, we can define   ∞ ∞ X X φ0 (uj )z −j . uj z −j  := φ j=−∞

j=−∞

Proof. By (3.25), φ0 (uj ) =

∞ X

uj CAr−1 Bz −r .

r=1

Thus by the above definition,   ∞ ∞ ∞ X X X φ uj z −j  = φ0 (uj )z −j = yk z −k , j=−∞

j=−∞

(3.26)

k=−∞

where yk =

k−1 X

CAk−j−1 Buj ,

j=−∞

exactly as (3.10) and (3.11) demand. Remember that in each sum there are only finitely many terms with negative indices. The uniqueness of φ follows from the fact that φ has to be linear, so (3.26) has to hold. Then the strict causality of φ and the commutativity of the diagram (3.24) follow the same way as before, see (3.18) and (3.22).

96

3.5

Linear system theory, state space models

Realizations

In this section we investigate when and how one can get an internal description Σ = (A, B, C) from an external description, more exactly, from a reduced input/output map φ0 : U− → Y+ . Realizations are not unique, but Theorem 3.1(b) below shows that in the finite dimensional case minimal realizations are isomorphic. First we are looking for a finite dimensional complex vector space X describing the space of internal states x, and linear maps F : U− → X and G : X → Y+ such that φ0 = G F. Such a decomposition of φ0 will be called a factorization. To explain our search for such a factorization, let us review some earlier results. The Hankel matrix of the system Σ = (A, B, C) defined by (3.5) can be written as an infinite block dyadic product:   C  CA    (3.27) H =  CA2  [B, AB, A2 B, . . . ].   .. . Clearly, the system is observable if and only if the first factor in (3.27) that maps X to Y+ is one-to-one. Also, the system is reachable if and only if the second factor that maps U− to X is onto. By (3.2), the image of the second factor is determined by x1 , the internal state at time 1, because that uniquely determines the whole output sequence (y1 , y2 , y3 , . . . ) ∈ Y+ , since now the input uj is supposed to be zero if j ≥ 1: x1 =

j0 X j=0

Aj Bu−j =

∞ X

Aj Bu−j ,

yk = Cxk = CAk−1 x1

(k ≥ 1).

j=0

Theorem 3.1. We have the following important properties of factorization. (a) Suppose that we have two factorizations: φ0 = G1 F1 = G2 F2 , where Fj : U− → Xj and Gj : Xj → Y+ (j = 1, 2). If F1 is onto and G2 is one-to-one, then there exists a linear map L : X1 → X2 such that the following diagram is commutative: F1

% U−

G1

X1 L↓

F2

&

& Y+

G2

X2

%

(b) Suppose that in statement (a) both F1 and F2 are onto and both G1 and G2 are one-to-one. Then the linear map L is invertible, so the two factorizations are equivalent.

Realizations

97

(c) Suppose that we have a factorization φ0 = G1 F1 with a finite dimensional state space X1 . Then dim(X1 ) is minimal if and only if F1 is onto and G1 is one-to-one. Proof. (a) Let x1 ∈ X1 be arbitrary. Since F1 is onto, there exists a u(z) ∈ U− such that F1 (u(z)) = x1 . Then φ0 (u(z)) = G1 (x1 ) ∈ Y+ , and since G2 is one-to-one, there exists a unique x2 ∈ X2 such that G2 (x2 ) = G1 (x1 ). Then we can set L(x1 ) := x2 . −1 (b) It follows from (a) that in this case L = G−1 = G−1 2 G1 and L 1 G2 . (c) We may assume that the state space X1 is such that our description is reachable and observable. For, otherwise we may omit those subspaces from X1 that are either not reachable or not observable. In turn, it means that F1 is onto and G1 is one-to-one. Suppose that φ0 = G2 F2 is an arbitrary factorization with a state space X2 . Then dim(X2 ) ≥ dim(Range(G2 )) ≥ dim(Range(G2 F2 )) = dim(Range(φ0 )) = dim(X1 ),

(3.28)

because G1 is one-to-one. By statement (b), if φ0 = G2 F2 is an arbitrary factorization with a state space X2 such that F2 is onto and G2 is one-to-one, then there exists an invertible linear map L : X1 → X2 , thus dim(X2 ) = dim(X1 ). Conversely, if dim(X2 ) is minimal, then in (3.28) there is equality everywhere, so dim(X2 ) = dim(Range(G2 )). This means that G2 is one-to-one. Further, if F2 were not onto, then dim(F2 ) < dim(X2 ) would hold, which would contradict the equality dim(X2 ) = dim(Range(G2 F2 )). Theorem 3.2. For any k, j ≥ 1, let  H1 H2 H3  H2 H3 ·   H3 · · Hk,j =   .. .. ..  . . . Hk · ·

··· ··· ···

Hj · · .. .

···

Hk+j−1

      

be the k × j upper left sub-matrix of the infinite block Hankel matrix H defined in (3.5). (Each block Hr is a p × q complex matrix.) The system of equations Hr = CAr−1 B

(r ≥ 1)

has a linear system solution Σ = (A, B, C) (as described in Section 3.2) if and only if sup rank(Hk,j ) < ∞. k,j

Moreover, then the construction in the proof gives a linear system that is both reachable and observable, so its state space X has minimal dimension.

98

Linear system theory, state space models

Proof. If Σ = (A, B, C) is a linear system as described in Section 3.2 and Hr = CAr−1 B (r ≥ 1), then, similarly as in (3.27), we can write that   C  CA    (3.29) Hk,j =   [B, AB, . . . , Aj−1 B]. ..   . CAk−1 Thus rank(Hk,j ) ≤ n, since the second factor has n rows. Conversely, suppose that rank(Hk,j ) is a bounded function of k and j. Since the rank is a non-decreasing, integer valued function, it must be constant if k ≥ k0 and j ≥ j0 . Consider the matrices Hk0 +1,j0 +1 ∈ C(k0 +1)p×(j0 +1)q . Choose a basis in the range of Hk0 +1,j0 +1 ; let us denote the number of the basis vectors by n. From these n column vectors create the matrix Mk0 +1 ∈ C(k0 +1)p×n . Each column vector of Hk0 +1,j0 +1 is a linear combination of the columns of Mk0 +1 , so there exists a matrix Vj0 +1 ∈ Cn×(j0 +1)q such that Hk0 +1,j0 +1 = Mk0 +1 Vj0 +1 .

(3.30)

Define the matrix C ∈ Cp×n as the first p rows of the matrix Mk0 +1 and define the matrix B ∈ Cn×q as the first q columns of the matrix Vj0 +1 . It is clear from the construction that C and B may remain the same for any Hk,j if k ≥ k0 and j ≥ j0 . Let us denote the columns of matrix Vj0 +1 by v1 , v2 , . . . , v(j0 +1)q . Define the matrix A ∈ Cn×n as a solution — if there exists one — of the system of linear equations Avr = vq+r (r = 1, 2, . . . , j0 q). (3.31) Denote Vj0 := [v1 , v2 , . . . , vj0 q ],

Vjs0 := [vq+1 , vq+2 , . . . , v(j0 +1)q ].

Obviously, to have a solution A of (3.31) it is necessary that for c ∈ Cj0 q , Vj0 c = 0

⇒

Vjs0 c = 0.

(3.32)

Moreover, (3.32) is also sufficient for the solvability of (3.31). For, writing down the augmented matrix of the system (3.31):   v1 v2 ··· vj0 q  − − − − − − − − − − − − , vq+1 vq+2 ··· v(j0 +1)q we see that whenever elementary column operations result a zero column in the upper matrix Vj0 , the same operations give a zero column in the lower matrix Vjs0 as well, so the rows of Vjs0 depend linearly on the rows of Vj0 , which is equivalent to the solvability of (3.31).

Realizations

99

Thus we have to check that (3.32) holds. Suppose that Vj0 c = 0. Then Mk0 +1 Vj0 c = 0 as well. Because of the block Hankel matrix structure of Hk,j , we see that   H1 H2 ··· Hj 0  H2 H3 · · · Hj0 +1    Mk0 +1 Vj0 =   .. .. ..   . . . Hk0 +1 Hk0 +2 · · · Hk0 +j0 and

  Mk0 Vjs0 = 

H2 .. .

H3 .. .

···

Hk0 +1

Hk0 +2

···

 Hj0 +1  .. , . Hk0 +j0

which is the same as Mk0 +1 Vj0 , deleting its first row. Thus Mk0 Vjs0 c = 0 holds too. Since the columns of Mk0 are independent, it follows that Vjs0 c = 0, which was to show and which implies that the system (3.31) is solvable for A. By our assumptions, rank(Hk,j ) = rank(Mk ) = n whenever k ≥ k0 and j ≥ j0 . Hence rank(Vj ) = n too when j ≥ j0 . It implies that the columns of Vj0 span the n-dimensional complex vector space on which A acts, hence the solution A of the system (3.31) is unique and is the same for any k ≥ k0 and j ≥ j0 . Writing down the first block row of (3.30) using (3.31), it follows that Hr = CAr−1 B

(r = 1, 2, . . . j0 + 1).

Since it is true for any j ≥ j0 + 1 as well, we have this representation for any r ≥ 1. Also, we saw that in the factorization (3.29), equivalently in (3.30), the two factors have rank n, it follows that the linear system Σ = (A, B, C) is both reachable and observable, so its state space X has minimal dimension. Next we introduce the notion of minimal polynomial of a state space X. Suppose that φ0 is a restricted input/output map of a linear system and φ0 = G F is a factorization with a finite dimensional state space X; suppose as well that F is onto and G is one-to-one. Then we can define the effect of left shift L on each state x ∈ X as a multiplication by the indeterminate z. Since F is onto, there exists a u(z) ∈ U− such that F (u(z)) = x. Then z · x := F (zu(z)) ∈ X

(x ∈ X).

The vector z · x is uniquely defined, because if x = F (u1 (z)) = F (u2 (z)), then φ0 (u1 (z)) = G F (u1 (z)) = G F (u2 (z)) = φ0 (u2 (z)) and applying left shift, φ0 (zu1 (z)) = φ0 (zu2 (z));

100

Linear system theory, state space models

since G is one-to-one, it follows that z · x = F (zu1 (z)) = F (zu2 (z)). If p(z) = am z m + · · · + a1 z + a0 ∈ C[z] is an arbitrary polynomial in the indeterminate z, and x = F (u(z))), then by definition p(z) · x :=

m X

aj z j · x = F (p(z)u(z)) ∈ X.

j=0

If y(z) = φ0 (u(z)), then by (3.25) and (3.23), G(p(z)·x) = G F (p(z)u(z)) = φ0 (p(z)u(z)) = [H(z −1 )p(z)u(z)]− = Lp(z) y(z). The annihilator AnnX of the state space X is an ideal in C[z] defined as AnnX := {p(z) ∈ C[z] : p(z) · x = 0 ∀x ∈ X}. Since C[z] is a principal ideal domain, if AnnX 6= {0}, there exists a non-zero polynomial ψX ∈ AnnX such that AnnX = ψX C[z]. For uniqueness, we may assume that ψX is monic, that is, its leading coefficient is 1. Then ψX is called the minimal polynomial of X. Proposition 3.3. If the state space X is finite dimensional, there exists a minimal polynomial ψX of X. Proof. Since C[z] is a principal ideal domain, it is enough to show that AnnX 6= {0}. Take a basis (e1 , . . . , en ) in X. Then the sequence of vectors {z j · e1 : j = 0, 1, . . . , n} must be linearly dependent. Thus there exists a nonzero polynomial p1 (z) (of degree at most n) such that p1 (z) · e1 = 0. Similarly, for any ek (k = 1, . . . , n) there exists a nonzero polynomial pk (z) such that pk (z) · ek = 0. Then we can take the least common multiple (lcm): lcm(p1 , . . . , pn ) ∈ AnnX .

Next, we turn to realizations that use power series techniques, introduced in Section 3.4. The transfer function H(z) ∈ Cp×q (z) of a linear system was defined in (3.13). Its definition shows that H(z) is a power series and its coefficients are p × q complex matrices. Equivalently, H(z) is a p × q matrix, whose entries are complex power series. H(z) is called rational if it is a rational matrix, that is, there exists a nonzero polynomial φ(z) ∈ C[z] such that φ(z)H(z) is a polynomial matrix: φ(z)H(z) ∈ Cp×q [z].

Realizations

101

Theorem 3.3. The transfer function H(z −1 ) of a linear system Σ = (A, B, C), defined by (3.13), (3.14), and (3.15), with finite dimensional state space X, is a rational matrix H(z −1 ) =

P (z) , ψX (z)

P (z) ∈ C[z]p×q ,

where ψX (z) is the minimal polynomial of the state space X. Proof. Let us denote the basis of coordinate unit vectors in the input space U = Cq×1 by (e1 , . . . , eq ). Consider q different input sequences from U− : er j = 0 r uj := (r = 1, . . . , q). 0 j 6= 0 Then the corresponding polynomials are ur (z) = er

(r = 1, . . . , q).

(3.33)

By (3.25), the corresponding output polynomials from Y+ are y r (z) = [H(z −1 )ur (z)]− = H(z −1 )ur (z) = H(z −1 )er , because now H(z −1 )ur (z) does not have terms with non-negative powers of z. Let us apply the polynomial left shift ψX (z) to the input sequences. Then by (3.23), φ0 (LψX ur (z)) = G F (LψX ur (z)) = G(ψX (z) · F (ur (z))) = 0, since ψX (z) · x = 0 for any x ∈ X as ψX (z) ∈ AnnX . It means that [H(z −1 )ψX (z)ur (z)]− = 0, so H(z −1 )ψX (z)ur (z) contains only non-negative powers of z, that is, H(z −1 )ψX (z)ur (z) = Pr (z) ∈ C[z]p×1

(r = 1, . . . , q),

with certain p×1 polynomial matrices Pr (z). Thus by (3.33), ψX (z)H(z −1 )er = Pr (z) (r = 1, . . . , q), ψX (z)H(z −1 ) = P (z) = [P1 (z), . . . , Pq (z)], which proves the statement of the theorem. Observe that H(z) is a rational matrix if and only if H(z −1 ) is a rational matrix. Theorem 3.4. The following properties of a linear system are equivalent: (1) The reduced input/output map φ0 : U− → Y+ defined by (3.19) has an internal description Σ = (A, B, C) with finite dimensional state space X. (2) If Hk,j denotes the k × j upper left minor of the infinite Hankel block matrix H defined in (3.5), then supk,j rank(Hk,j ) < ∞.

102

Linear system theory, state space models

(3) Range(φ0 ) = Range(H) is finite dimensional. (4) The transfer function H(z) defined by (3.13) or (3.14) is a rational matrix. (5) The transfer function H(z) has a matrix fraction description (MFD) defined by Theorem D.1 in the Appendix. Proof. Theorem 3.2 proved the equivalence of (1) and (2) and Theorem 3.3 proved that (1) implies (4). Supposing condition (4), H(z −1 ) = P (z)/ψ(z), where P (z) ∈ C[z]p×q and ψ(z) ∈ C[z]. By (3.21), u(z) ∈ U− if and only if u(z) ∈ C[z]q×1 . Thus for any u(z) ∈ U− , H(z −1 )ψ(z)u(z) = P (z)u(z) contains only non-negative powers of z. Thus we get that φ0 (Lψ u(z)) = [H(z −1 )ψ(z)u(z)]− = 0.

(3.34)

Then Ψ := ψ(z)C[z]q×1 is a submodule in the C[z]-module C[z]q×1 , whose elements φ0 maps to 0. For any u(z) ∈ U− we can perform standard polynomial division by the ordinary polynomial ψ(z) to obtain u(z) = Q(z)ψ(z) + R(z),

Q(z), R(z) ∈ C[z]q×1 ,

deg(R(z)) < deg(ψ(z)),

where deg(R(z)) denotes the degree of a polynomial R(z). Thus by (3.34), it follows that φ0 (u(z)) = φ0 (ψ(z)Q(z)) + φ0 (R(z)) = φ0 (R(z)), which implies that dim(Range(φ0 )) ≤ dim(C[z]q×1 /Ψ) = q deg(ψ(z)). This proves (3). If (3) holds, then, since each Hk,j is a minor of H, rank(Hk,j ) ≤ rank(H) < ∞, so (2) follows. Finally, Theorem D.1 and Remark D.3 show the equivalence of (4) and (5). The McMillan degree of the transfer function H(z) is the dimension of the state space X in a minimal realization. Theorem 3.1(b) implies that the McMillan degree is uniquely defined.

Stochastic linear systems

3.6

103

Stochastic linear systems

3.6.1

Stability

Now we consider a stochastic time invariant linear system: Xt+1 = AXt + BUt Yt = CXt + DVt

(t ∈ Z),

(3.35)

which is driven by the white noise processes {Ut }t∈Z ∼ WN(Σ) and {Vt }t∈Z ∼ WN(S). The matrices A ∈ Cn×n , B ∈ Cn×q , C ∈ Cp×n , and D ∈ Cp×s give the internal description of the system. The covariance matrices Σ ∈ Cq×q and S ∈ Cs×s are given self-adjoint non-negative definite matrices. It is assumed that E(Ut Vτ∗ ) = δtτ R, where R ∈ Cq×s represents the cross-covariance between the two driving processes. An important special case is when Vt = Ut for each t ∈ Z. As before, {Xt }t∈Z is the sequence of the directly not observable Cn -valued internal states and {Yt }t∈Z is the sequence of the observable Cp -valued output terms. These are stationary time series now. We assume that the system is at rest at the remote past, that is, X1 =

∞ X

Aj BU−j ,

j=0

where the sum is convergent in mean square. (The convergence of the sum follows from stability, see next.) It implies that EXt = 0 for each t, and each Ut and Vt are orthogonal to the past: E(Ut X∗τ ) = 0,

E(Vt X∗τ ) = 0

(∀ τ ≤ t).

The system (3.35) is called stable if all the eigenvalues of the matrix A are in the open unit disc {z ∈ C : |z| < 1}, that is, det(zI − A) 6= 0 if |z| ≥ 1. The spectrum σ(A) of the matrix A is the set of all its eigenvalues and the spectral radius of A is ρ(A) = max{|λ| : λ ∈ σ(A)}. See the properties of spectral radius in Lemma B.2 in the Appendix. Theorem 3.5. If a system (3.35) is stable, that is, ρ(A) < 1, then it has a unique causal stationary MA(∞) solution both for {Xt } and {Yt }: Xt =

∞ X j=1

Aj−1 BUt−j ,

Yt = DVt +

∞ X

CAj−1 BUt−j

j=1

These series converge with probability 1 and in mean square.

(t ∈ Z).

104

Linear system theory, state space models

Proof. This proof is similar to the one given for AR(1) processes in Section 2.4 above. Suppose that ρ(A) < 1. First consider the convergence of the claimed MA solution for {Xt }. Since {Ut } is an orthogonal series with constant covariance matrix Σ, the Cauchy–Schwarz inequality implies that 1/2 = (tr(Σ))1/2 = constant. E|Ut−j | ≤ E(|Ut−j |2 ) Pq Here tr(Σ) denotes the trace of matrix Σ, tr(Σ) = j=1 σjj . Thus by Lemma B.2(2) in the Appendix we have X X ∞ ∞ j−1 A BUt−j ≤ kAj−1 k kBk E|Ut−j | E j=1 j=1 ≤ (tr(Σ))1/2 kBk

∞ X

Kcj−1 < ∞ (ρ(A) < c < 1),

j=1

which shows that the proposed MA solution for {Xt } converges with probability 1. It is clear then that {Xt } satisfies the first equation of (3.35). The statement of the theorem for {Yt } follows from this and from the second equation of (3.35). The weak stationarity of {Xt } follows directly: CX (k) := E(Xt+k X∗t ) =

∞ X

Aj−1 B E(Ut+k−j U∗t−` ) B ∗ A∗(`−1)

j,`=1

=

∞ X

Ak+`−1 B ΣB ∗ A∗(`−1) = Ak CX (0)

(k ≥ 0)

(3.36)

`=1 ∗ (k), independently of t. and CX (−k) = CX Now we show that the proposed MA solution is convergent in mean square and it is the only stationary solution of (3.35). Iterate the first equation of (3.35), shifted by one unit backward, (k − 1) times, :

Xt = BUt−1 + ABUt−2 + A2 BUt−3 + · · · + Ak−1 BUt−k + Ak Xt−k . Then using the properties of a trace, we obtain k X 2 2 E Xt − Aj−1 BUt−j = E Ak Xt−k j=1

= E X∗t−k A∗k Ak Xt−k = E tr X∗t−k (A∗ A)k Xt−k = E tr Xt−k X∗t−k (A∗ A)k = tr E(Xt−k X∗t−k )(A∗ A)k = tr CX (0)(A∗ A)k = tr Ak CX (0)A∗k ,

(3.37)

since we are looking for a stationary solution {Xt } whose covariance matrix CX (0) constant. Under the condition ρ(A) < 1, Lemma B.2 and (3.37) imply Pis ∞ that j=1 Aj−1 BUt−j converges in mean square to {Xt }. This completes the proof of the theorem.

Stochastic linear systems

105

Remark 3.1. Applying the indeterminate z as the operator of left (backward) shift, one can write that z j Ut = Ut−j (j ≥ 1). Using this and (3.17), the solutions given in Theorem 3.5 can be written as Xt = (In − Az)−1 zBUt ,

Yt = DVt + C(In − Az)−1 zBUt

(t ∈ Z).

Looking at the transfer function H(z) defined in (3.13) and (3.14), in the case Vt ≡ Ut we can write that Yt = H(z)Ut =

∞ X

Hj Ut−j ,

t ∈ Z.

j=0

3.6.2

Prediction, miniphase condition, and covariance

Now we study the prediction in the case of a stochastic linear system Xt+1 = AXt + BUt (t ∈ Z).

Yt = CXt + DUt

(3.38)

Assuming stability: ρ(A) < 1, it is easy to find the best linear prediction. By Theorem 3.5, the solution series Xt =

∞ X

Aj−1 BUt−j ,

Yt = DUt +

j=1

∞ X

CAj−1 BUt−j ,

t∈Z

j=1

converge in mean square. In mean square the best linear h-step ahead prediction (h = 1, 2, . . . ) of Yt+h is ˆ t+h := CAh Xt = Y

∞ X

CAh+j−1 BUt−j ,

t ∈ Z.

(3.39)

j=1

This follows from the projection theorem, since ˆ t+h = DUt+h + Yt+h − Y

h X

CAj−1 BUt+h−j ,

t ∈ Z,

j=1

ˆ t+h . The mean square prediction error, c.f. (2.34), is is orthogonal to Y ˆ t+h k2 = DΣD∗ + kYt+h − Y

h X

CAj−1 BΣB ∗ A∗(j−1) C ∗ .

j=1

In many applications one cannot observe the driving white noise process {Ut }, only the output process {Yt }, so it is important if one can express the former in terms of the latter. Then this can be used in the best linear prediction (3.39).

106

Linear system theory, state space models

One can describe an ‘inverse’ linear system of (3.38) that uses the observable process {Yt } as input and the not observable process {Ut } as output. We assume that m = n and D = In . Starting with Xt+1 = AXt + BUt Yt = CXt + Ut ,

t ∈ Z,

we obtain that Ut = −CXt + Yt , and so Xt+1 = (A − BC)Xt + BYt . The last two equations describe the ‘inverse’ system where the matrix (A−BC) plays the role of the matrix A in the original system: the transition matrix in the state space. The stability condition ρ(A) < 1 of Theorem 3.5 is then replaced by the so-called strict miniphase condition: ρ(A − BC) < 1

(3.40)

of the ‘inverse’ system. Under this condition the ‘inverse’ system has a solution Ut =

∞ X

˜ j Yt−j = Yt − H

∞ X

C(A − BC)j−1 BYt−j

(t ∈ Z),

j=1

j=0

˜ j , j ≥ 0, is the which converges almost surely and in mean square. Here H impulse response function of the ‘inverse’ system, whose transfer function is H −1 (z) =

∞ X

˜ j zj , H

|z| ≤ 1.

j=0

Remark 3.2. By the first equation of (3.35) we have CX (0) = E(Xt+1 X∗t+1 ) = E {(AXt + BUt )(U∗t B ∗ + X∗t A∗ )} , and it gives a linear equation, a so-called Lyapunov equation for the covariance matrix CX (0): CX (0) = ACX (0)A∗ + BΣB ∗ . (3.41) If ρ(A) < 1, by (3.36) this Lyapunov equation has a unique solution: CX (0) =

∞ X

Aj BΣB ∗ A∗j .

(3.42)

j=0

From the second equation of (3.35) we get the following result for the autocovariance function of {Yt }: CY (k) := E(Yt+k Yt∗ ) = E {(CXt+k + DVt+k ) (X∗t C ∗ + Vt∗ D∗ )} = CCX (k)C ∗ + 1{k=0} DSD∗ + 1{k≥1} CAk−1 BRD∗

(k ≥ 0),

(3.43)

Stochastic linear systems

107

since E(Xt+k Vt∗ ) =

∞ X

Aj−1 B E(Ut+k−j Vt∗ ) = 1{k≥1} Ak−1 BR.

j=1

Also, CY (−k) =

CY∗

(k). In particular, we have CY (0) = CCX (0)C ∗ + DSD∗ .

Remark 3.3. Lemma B.2(3) and (3.36) show that if ρ(A) < 1, the covariance function CX (k) decays exponentially as |k| → ∞: kCX (k)k ≤ kCX (0)k Kck , where ρ(A) < c < 1. By (3.43), CY (k) also decays exponentially as |k| → ∞. In the important special case Vt = Ut , ∀ t ∈ Z, and using (3.36) too, (3.43) becomes CY (k) = 1{k=0} DΣD∗ + 1{k≥1} CAk−1 BD∗ +

∞ X

CAk+`−1 B ΣB ∗ A∗(`−1) C ∗ ,

`=1

when k ∈ Z. At this point we may apply the transfer function H(z) =

∞ X

H` z ` = D +

∞ X

CA`−1 Bz ` ;

`=1

`=0

see (3.13), with H0 := D. Thus we get CY (k) =

∞ X

Hk+` ΣH`∗ ∈ Cp×p

(k ∈ Z).

`=0

In particular, CY (0) =

∞ X

H` ΣH`∗ .

`=0

Based on this, now we can introduce the block Hankel matrix of the covariance function CY (k):   CY (1) CY (2) CY (3) · · ·  CY (2) CY (3) CY (4) · · ·    H(CY ) :=  CY (3) CY (4) CY (5) · · ·    .. .. .. . . .    H1 H2 H3 · · · ΣH0∗ 0 0 ···  H2 H3 H4 · · ·   ΣH1∗ ΣH0∗ 0 ···     =  H3 H4 H5 · · ·   ΣH ∗ ΣH ∗ ΣH ∗ · · ·  . 2 1 0    .. .. .. .. .. .. .. . . . . . . .

108

Linear system theory, state space models

Here the first factor is H, the block Hankel matrix of the linear system defined in (3.5), while the second factor is a lower triangular block Toeplitz matrix based on H and on the covariance matrix Σ of the driving white noise {Ut }. In Section 3.3 we saw that the reachability of a linear system depends on the rank of the block matrix R∞ := [B, AB, A2 B, . . . ]. In fact, we saw that rank(R∞ ) = rank(R), where R is the reachability matrix defined by (3.6). Moreover, the linear system (3.35) is reachable if and only if rank(R) = n, where n denotes the dimension of the state space. Under the condition of stability ρ(A) < 1, let us evaluate the reachability Gramian P: ∞ X P := R∞ R∗∞ = Aj BB ∗ A∗j ∈ Cn×n . j=0

Then P is the unique solution of the Lyapunov equation P = APA∗ + BB ∗ , compare with (3.41) and (3.42). In fact, if Σ = In , then P = CX (0), the covariance matrix of {Xt }. It also follows from the above facts that the linear system (3.35) is reachable if and only if rank(P) = n, that is, P is non-singular. The case of observability is similar. Also in Section 3.3 we saw that the observability of a linear system depends on the rank of the block matrix   C  CA    O∞ :=  CA2  .   .. . In fact, we saw that rank(O∞ ) = rank(O), where O is the observability matrix defined by (3.7). Moreover, the linear system (3.35) is observable if and only if rank(O) = n. Under the condition of stability ρ(A) < 1, let us evaluate the observability Gramian Q: ∞ X ∗ Q := O∞ O∞ = A∗j C ∗ CAj ∈ Cn×n . j=0

Then Q is the unique solution of the Lyapunov equation Q = A∗ QA + C ∗ C. It also follows from the above facts that the linear system (3.35) is observable if and only if rank(Q) = n, that is, Q is non-singular.

Summary

3.7

109

Summary

State space model approach of R.E. Kálmán is used. A quadruple of bounded linear operators (A, B, C, D), A : X → X, B : U → X, C : X → Y , and D : U → Y is given, where U = Cq , X = Cn , Y = Cp , A ∈ Cn×n , B ∈ Cn×q , C ∈ Cp×n , and D ∈ Cp×q . The corresponding time invariant, linear dynamical system is described by xk+1 = Axk + Buk yk = Cxk + Duk

(k ∈ Z).

The restricted input/output map φ0 of the system is defined by φ0 ({uj }0j=−∞ ) = {yk }∞ k=1 , where starting time j0 ≤ 0 of the input sequence can be arbitrary. Observe that φ0 does not depend on the operator D, therefore we assume that D = 0. It is useful to write φ0 in matrix form:        y1 u0 CB CAB CA2 B · · · u0  y2   u−1   CAB CA2 B   · ···       u−1    .  y3  = H  u−2  =  CA2 B   · · · · ·   u−2        .. .. .. .. .. .. . . . . . . The infinite coefficient matrix H here is a block Hankel matrix ; each of its blocks is a p × q complex matrix. Let (A, B, C) be a finite dimensional linear system. We say that the state x ∈ X is reachable if there is a k ≥ 0 and a sequence of inputs u0 , u1 , . . . , uk−1 that drives the system from the state x0 = 0 to the state xk = x:   uk−1 k−1  uk−2  X   x= Ak−1−j Buj = [B, AB, . . . , Ak−1 B]  .  . ..   j=0 u0 By the Cayley–Hamilton theorem, Range[B, AB, . . . , Ak−1 B] = Range[B, AB, . . . , An−1 B]

(k ≥ n).

The matrix R := [B, AB, . . . , An−1 B] is the reachability matrix of the system and its range is the subspace of reachable states. We say that the system is reachable if any x ∈ X is reachable, that is, if rank(R) = dim(Range(R)) = n, equivalently, if R has n linearly independent columns, equivalently, if Ker(R∗ ) = {0}. Clearly, Range(R) is an A-invariant subspace in X.

110

Linear system theory, state space models

We say that the state x ∈ X is observable if starting from the state x0 = x at time 0 and having a constant zero input sequence uj = 0 (j ≥ 0), there is a time instant k ≥ 0 such that yk = CAk x 6= 0. The reason for the name “observable” is that the state x is observable if and only if for any x(1) , x(2) ∈ X, x(2) − x(1) = x, starting at time 0 with initial state x(1) and x(2) , respectively, there exists a time instant k ≥ 0 such that the output yk is different in the two cases. This fact follows from the linearity of the system. Again, by the Cayley–Hamilton theorem,     C C  CA   CA      Ker  .  = Ker   =: Ker(O) (k ≥ n − 1), ..  ..    . n−1 k CA CA where O is called the observability matrix of the system. The orthogonal complement of its kernel is the subspace of observable states. The system is observable, if any x ∈ X is observable, that is, if Ker(O) = {0}, equivalently, if O has n linearly independent rows, equivalently, if rank(O) = n. Clearly, Ker(O) is an A-invariant subspace in X. After we consider more general input/output sequences: assume that the observed input sequence contains only finitely many terms with negative indices; the system was at rest when the first nonzero input arrived, say at time −j0 ≤ 0 (which is arbitrary): x−j0P = 0. So the input sequence can be ∞ written as a formal power series u(z) := j=−j0 uj z −j . A corresponding output sequence (y−j0 +1 , y−j0 +2 , . . . ) is the following formal power P∞ series with p-dimensional complex vector coefficients yk ∈ Y : y(z) := k=−j0 +1 yk z −k . Let Y is the vector space of all formal power series y(z) with this property. Also, y(z) = φ(u(z)) = H(z −1 )u(z) with the transfer function H(z). If |z| > ρ(A) (the spectral radius of A), then Iz − A is invertible and H(z −1 ) = C(Iz − A)−1 B. Clearly, limz→∞ H(z −1 ) = 0. These facts imply the important strict causality of the input/output map, i.e., an input sequence beginning at time 0 may influence only the output sequence from time 1. This gives a link between the internal (with state space equations) and the external description (with transfer function) of the linear system. Then we investigate when and how one can get an internal description (A, B, C) from an external description, more exactly, from a reduced input/ output map φ0 : U− → Y+ . Realizations are not unique, but Theorem 3.1 guarantees that in the finite dimensional case minimal realizations are isomorphic. The following properties of a linear system are equivalent: the reduced input/output map φ0 : U− → Y+ has an internal description (A, B, C) with finite dimensional state space X; the infinite block Hankel matrix has a bounded rank; the transfer function H(z) is rational and has a matrix fraction description (MFD).

Summary

111

The McMillan degree of the transfer function H(z) is the dimension of the state space X in a minimal realization. Theorem 3.1 shows that the McMillan degree is uniquely defined. Eventually, a stochastic time-invariant linear system is considered: Xt+1 = AXt + BUt Yt = CXt + DVt

(t ∈ Z).

(3.44)

which is driven by two white noise processes {Ut } and {Vt }. These systems will play an important role in the subsequent chapters.

4 Multidimensional time series

4.1

Introduction

In this chapter we investigate properties of multidimensional, weakly stationary time series, similarly to the 1D case. From the literature we use the following important ones: [11, 14, 22, 28, 29, 35, 36, 39, 41, 47, 51, 54, 59, 60]. In contrast to Chapter 2, here we proceed deductively: from the most general constant rank processes, via regular (causal) ones, to the VARMA (vector autoregressive) processes and state space models. Not surprisingly, here the classification is more complicated; for example, the spectral density matrix can be of reduced rank, but not zero. Linear filtering and constant rank processes are considered, and we prove that the spectral density matrix of a d-dimensional process has a constant rank r ≤ d if and only if the process is a two-sided moving average, obtained as a TLF with an r-dimensional white noise process; in this way, the spectral density matrix is also factorized. Special cases are the regular processes that have a causal (future independent) MA representation. Relations between the impulse responses, transfer function, and spectral factor are also discussed. A non-singular process has a Wold decomposition in multidimension too. Here only the so-called innovation subspaces are unique, whereas their dimension is equal to the constant rank of the process. Further subclass of regular processes are the ones with a rational spectral density matrix. These can be finitely parametrized, and have either a state space, a stable VARMA representation, or an MFD (Matrix Fractional Description). Here, in the factorization of the spectral density matrix, the so-called spectral factors are also rational matrices. This fact has important consequences, for example, in the dynamic factor analysis and relations in the time domain (see Chapter 5).

4.2

Linear transformations, subordinated processes

While we met the notions of linear filter and subordinated process in Section 2.2 in the case of a one-dimensional time series, their extension to the d-

113

114

Multidimensional time series

dimensional case requires some technical tools; these are going to be discussed in this section. Let {Xt }t∈Z be an d-dimensional stationary time series with spectral representation Z π Xt = eitω dZω −π

and spectral measure matrix dF = dFX as discussed in Chapter 1. Assume that we are given a matrix function T (ω) = [tjk (ω)]m×d ,

tjk ∈ L2 ([−π, π], B, tr(dF)).

(4.1)

Here tr(dF) = dF 11 + · · · + dF dd

(4.2) jk

is a non-negative measure which dominates each dF by the non-negative definiteness of dF. Indeed, substituting zj = eit1 , zk = eit2 , z` = 0 for ` 6= j, k and fixing α and β in (1.40), we obtain that ∆F jj + ∆F kk + ∆F jk ei(t1 −t2 ) + ∆F kj ei(t2 −t1 ) ≥ 0

for any

t1 , t2 ,

which implies that ∆F jj + ∆F kk ≥ 2|∆F jk | . By definition, the m-dimensional process {Yt }t∈Z is a linear transform of or obtained by a time invariant linear filter (TLF) from {Xt } if Z π Yt = eitω T (ω) dZω (t ∈ Z). (4.3) −π

It means that with the random measure dZYω := T (ω) dZω we have a representation of the process {Yt }: Z π Yt = eitω dZYω (t ∈ Z). −π

Similarly to formula (1.28), "Z

π

eihω

Cov(Yt+h , Yt ) = −π

Z

n X

#m tpr (ω)tqs (ω)dF rs (ω)

r,s=1

p,q=1

π

eihω T (ω)dF(ω)T ∗ (ω)

=

(h ∈ Z).

(4.4)

−π

Thus {Yt } is also a stationary time series with spectral measure matrix dFY = T dFX T ∗ .

(4.5)

Linear transformations, subordinated processes

115

Considering Cov(Yt+h Xt ) = E(Yt+h X∗t ), one can similarly obtain that {Xt } and {Yt } are jointly stationary and their joint spectral density matrix is dFY,X = T dFX .

(4.6)

It is not difficult to show that the last formula is not only necessary, but sufficient for obtaining {Yt } from {Xt } by linear transformation. Because of (4.6) we may call T the conditional spectral density of {Yt } w.r.t. {Xt } and denote T = f Y |X . Compare with the one-dimensional case (2.6). By Fourier transformation, we rewrite the definition of linear transformation in the time domain, assuming that condition (4.1) holds: T (ω) =

∞ X

τ (j)e−ijω ,

τ (j) =

j=−∞

1 2π

Z

∞ X

π

eijω T (ω)dω,

−π

kτ (j)k2F < ∞,

j=−∞

where the Fourier coefficients τ (j) are m×d matrices. Because of the isometry between H(X) and L2 ([−π, π], B, tr(dF)), definition (4.3) of linear transform is equivalent to Z π Z π ∞ ∞ X X itω −ijω Yt = τ (j) e τ (j)e dZω = ei(t−j)ω dZω −π

=

j=−∞

∞ X

−π

j=−∞

τ (j)Xt−j ,

t ∈ Z.

(4.7)

j=−∞

This shows that the linear transform {Yt } is really obtained by linear filtering from {Xt }, that is, by a sliding summation with given matrix weights τ (j). We call an m-dimensional stationary time series {Yt }t∈Z causally subordinated to an d-dimensional time series {Xt }t∈Z if {Yt } is obtained from {Xt } by a linear transform (4.3) or (4.7), and, also, Hk− (Y) ⊂ Hk− (X) for all k ∈ Z. By stationarity, it suffices to require the above condition to hold for a specific value of k, for example, for k = 0. In the causally subordinated case the filtering equation (4.7) modifies as a one-sided infinite summation: Yt =

∞ X

τ (j)Xt−j ,

τ (j) = 0

if

j < 0.

(4.8)

kτ (j)k2F < ∞,

(4.9)

j=0

Therefore, the Fourier series of T becomes one-sided: T (ω) =

∞ X j=0

τ (j)e−ijω ,

∞ X j=0

where k · kF denotes the Frobenius norm, see Appendix B.

116

4.3

Multidimensional time series

Stationary time series of constant rank

Assume that Xt = (Xt1 , . . . , Xtd ) (t ∈ Z) is a d-dimensional complex valued weakly stationary time series with absolutely continuous spectral measure with density matrix f on [−π, π], see Chapter 1. We say that {Xt } has constant rank r, if the matrix f (ω) has rank r for almost every ω ∈ [−π, π]. In 1D it means the f (ω) > 0 for a.e. ω. Theorem 4.1. ([47, Section I.9] We have the following back-and-forth statements. (a) Assume that the stationary time series Xt = (Xt1 , . . . , Xtd ) (t ∈ Z) has an absolutely continuous spectral measure with density matrix f of constant rank r. Then f can be factored as f (ω) =

1 φ(ω)φ∗ (ω) 2π

for a.e. ω ∈ [−π, π],

where φ(ω) ∈ Cd×r , r ≤ d. Also, {Xt } can be represented as a two-sided infinite MA process (a sliding summation) Xt =

∞ X

b(j)ξt−j ,

(4.10)

j=−∞

where {ξt }t∈Z is a WN(Ir ) (orthonormal) sequence, b(j) = [bk` (j)] ∈ Cd×r (j ∈ Z) is a non-random matrix-valued sequence, the Fourier coefficients of the factor φ(ω): Z π ∞ ∞ X X 1 b(j)e−ijω , b(j) = kb(j)k2F < ∞. φ(ω) = φ(ω)dω, 2π −π j=−∞ j=−∞ (4.11) This Fourier series converges to φ in L2 sense. (b) Conversely, any stationary time series {Xt }t∈Z represented as a two-sided infinite MA process (4.10) has an absolutely continuous spectral measure with density matrix f of constant rank r, where r is the dimension of the white noise sequence {ξt }. Proof. First suppose that f has constant rank r. Since 2πf is self-adjoint and non-negative definite by Corollary 1.1, it has a Gram-decomposition for a.e. ω by Proposition B.1: 2πf (ω) = φ(ω)φ∗ (ω),

φ(ω) ∈ Cd×r .

(4.12)

Now for a.e. ω there exists a matrix ψ(ω) ∈ Cr×d (not unique if d > r) such that ψ(ω)φ(ω) = Ir . (4.13)

Stationary time series of constant rank

117

For, fixing ` = 1, . . . , r, the system of equations d X

ψ`j φjk = δ`k

(k = 1, . . . , r)

j=1

has a solution (ψ`1 , . . . , ψ`d ) for each ω where the coefficient matrix φ(ω) has rank r, i.e. for almost every ω. Then (4.12) and (4.13) imply that for a.e. ω, ψf ψ ∗ =

1 1 ψφφ∗ ψ ∗ = Ir . 2π 2π

By the spectral representation (1.19), Z π Xt = eitω dZω ,

(4.14)

Zω = (Zω1 , . . . , Zωd ).

−π

We are immediately going to show that the following random variables are well-defined and are in H(X) for any Borel subset B ⊂ [−π, π]: VB` :=

Z X d

ψ`j (ω) dZωj

(` = 1, . . . , r).

(4.15)

B j=1 ` and Vω := (Vω1 , . . . , Vωr ) if ω ∈ (−π, π]. Then the compoSet Vω` := V(−π,ω] nents of the process Vω , −π < ω ≤ π, are orthogonal and have orthogonal increments: d Z π X 0 E VB` VB` 0 = 1B (ω)1B 0 (ω) ψ`j (ω)fjj 0 (ω)ψj 0 `0 (ω) dω −π

j,j 0 =1

1 = δ``0 dω(B ∩ B 0 ) 2π

(`, `0 = 1, . . . , r),

(4.16)

where we used (4.14) and the isometric isomorphism between the Hilbert spaces H(X) ⊂ L2 (Ω, F, P) and L2 ([−π, π], B, tr(dF )). Here dω is Lebesgue measure, dF is the spectral measure matrix of {Xt }, and tr(dF ) defined by (4.2) is a non-negative measure which dominates each dF jk . At the same time, (4.16) shows that each ψ`j ∈ L2 ([−π, π], B, tr(dF )), so by (4.1) the integral in (4.15) exists and VB` ∈ H(X). Let us introduce now the stationary time series ξt = (ξt1 , . . . , ξtr ) (t ∈ Z) by stochastic integrals: ξt` :=

Z

π

eitω dVω` =

−π

or briefly, Z

Z

π

eitω

−π

π

ξt = −π

eitω dVω =

d X

ψ`j (ω) dZωj ,

j=1

Z

π

−π

eitω ψ(ω) dZω .

(4.17)

118

Multidimensional time series

(4.16) implies that {ξt }t∈Z is an orthonormal sequence in H(X): E ξt ξs = δts Ir t, s ∈ Z. (4.12), (4.13), and (4.14) imply that (φψ − Id )f (ψ ∗ φ∗ − Id ) = (φψ − Id )

1 φφ∗ (ψ ∗ φ∗ − Id ) = 0 2π

a.e. in [−π, π]. Consequently, the difference Z Z π eitω φ(ω)ψ(ω)dZω − ∆t :=

π

eitω dZω

−π

−π

is orthogonal to itself in H(X), so it is a zero vector. Thus we have Z π Z π Z π itω itω Xt = e dZω = e φ(ω)ψ(ω)dZω = eitω φ(ω)dVω . −π

−π

(4.18)

−π

It is clear by (4.12) that each entry φk` (ω) is square integrable, so has a Fourier expansion converging in L2 ([−π, π], B, dω): Z π ∞ ∞ X X 1 φ(ω) = b(j)e−ijω , b(j) = φ(ω)dω, kb(j)k2F < ∞. 2π −π j=−∞ j=−∞ (4.19) Finally, (4.17), (4.18), and (4.19) imply that Xt =

∞ X

b(j)ξt−j ,

b(j) = [bk` (j)]d×r ,

j=−∞

which completes the proof of statement (a) of the theorem. Conversely, assume that {Xt }t∈Z has an MA representation (4.10). Let φ = [φj` ]d×r be the Fourier series with coefficients b(j) as defined by (4.19) and Vω = (V1 (ω), . . . , Vr (ω)) be the random Fourier spectrum of the orthonormal time series {ξt }t∈Z as defined by (4.17). The Hilbert space H(ξ) ⊂ L2 (Ω, F, P) generated by the orthonormal time series {ξt }t∈Z coincides with the Hilbert space H(V) generated by random variables of the form Z π ζ= h∗ (ω) dV(ω), h(ω) = (h1 (ω), . . . , hr (ω)), −π 2

where h` ∈ L ([−π, π], B, dω) (1 ≤ ` ≤ r). By (4.10), it is also clear that H(X) = H(ξ) = H(V). 1 Since the orthonormal time series {ξt }t∈Z has 2π Ir as its spectral density matrix, (4.10) implies that the spectral measure of {Xt }t∈Z is also absolutely continuous w.r.t. Lebesgue measure with spectral density matrix f=

1 φφ∗ and rank(f (ω)) ≤ r 2π

for a.e.

ω ∈ [−π, π].

Stationary time series of constant rank

119

Indeed, ∞ X

C(h) = E(Xt+h X∗t ) =

∗ b(j)E(ξt+h−j ξt−k )b∗ (k) =

j,k=−∞

∞ X

b(k + h)b∗ (k),

k=−∞

P∞

−ikω while the Fourier series φ(ω) = converges in L2 , so k=−∞ b(k)e 1 ∗ 1 2π φφ ∈ L , thus Z π Z π ∞ X 1 1 eihω φ(ω)φ∗ (ω)dω = eihω b(j)e−ijω b∗ (k)eikω dω 2π −π 2π −π j,k=−∞ Z π ∞ X = eihω dF (ω). (4.20) b(k + h)b∗ (k) = C(h) = −π

k=−∞

Now, indirectly suppose that on a set of positive Lebesgue measure, rank(f (ω)) < r. RThen there exists h(ω) = (h1 (ω), . . . , hr (ω)), h` ∈ π 2 L2 ([−π, π], B, dω), −π |h(ω)| dω 6= 0, such that φ(ω)h(ω) = 0 for a.e. ω ∈ [−π, π]. This means that the random variable Z π ζ= h∗ (ω) dV(ω) ∈ H(V) = H(X), kζk 6= 0, −π

is orthogonal to H(X), since Z π r X eitω φj` (ω)h` (ω)dω = 0 hXtj , ζi = E(Xtj ζ) = −π

(j = 1, . . . , d; t ∈ Z).

`=1

This contradiction completes the proof of statement (b) of the theorem. Remark 4.1. Observe that the proof of statement (a) of the previous theorem depended only on the fact that one has a rank r ≤ d factorization of the spectral density of the form 1 φ(ω)φ∗ (ω), φ(ω) ∈ Cd×r , for a.e. ω ∈ [−π, π]. 2π Remark 4.2. Similarly as in 1D in Section 2.3, here we can also introduce the impulse response function {b(j)}j∈Z and the transfer function H(z) = P j 2 j∈Z b(j)z , which belongs to L (T ) and with which we can write that Xt = H(z)ξt , t ∈ Z. Clearly, for the factor φ we have φ(ω) = H(e−iω ). This and (4.12) imply that f (ω) =

f (ω) =

1 H(e−iω )H(e−iω )∗ , 2π

ω ∈ [−π, π].

By (4.10), the covariance function of {Xt } can be written in terms of {b(j)} as in 1D: ∞ X C(h) = E(Xt+h X∗t ) = b(k + h)b∗ (k), h ∈ Z. (4.21) k=−∞

120

4.4

Multidimensional time series

Multidimensional Wold decomposition

The multidimensional version of Wold decomposition goes similarly as its 1D case in Section 2.6. The notations, definitions, and a part of the material of Section 2.6 about singular and regular processes can be extended to the multidimensional case with no essential change, therefore will not be repeated here.

4.4.1

Decomposition with an orthonormal process

Theorem 4.2. (See e.g. Rozanov’s book [47, Sections II.2–3].) Assume that {Xt }t∈Z is an d-dimensional non-singular stationary time series. Then it can be represented in the form Xt = Rt + Yt =

∞ X

b(j)ξt−j + Yt

(t ∈ Z),

(4.22)

j=0

where (1) {Rt } is a d-dimensional regular time series causally subordinated to {Xt }; (2) {Yt } is an d-dimensional singular time series causally subordinated to {Xt }; (3) {ξt } is an r-dimensional (r ≤ d) WN(Ir ) (orthonormal) sequence causally subordinated to {Xt }; (4) {Rt } and {Yt } are orthogonal to each other: E(Rt Ys∗ ) = O for t, s ∈ Z; P∞ (5) b(j) = [bk` (j)] ∈ Cd×r for j ≥ 0 and j=0 kb(j)k2F < ∞. − span{Xj : j ≤ k} denote the past of {Xt } until time k and Proof. Let T Hk := H−∞ := k∈Z Hk− is the remote past of {Xt }. − − Since {Xt } is assumed to be non-singular, H−1 , X0 }. 6= H0− = span{H−1 (One may choose an arbitrary initial time.) Let D0 be the orthogonal com− plement of H−1 in H0− : − H0− = H−1 ⊕ D0 , (4.23)

where ⊕ denotes orthogonal direct sum. Choose an orthonormal basis {ξ01 , . . . , ξ0r } in the subspace D0 . Then clearly, r ≤ d. Set ξ0 := (ξ01 , . . . , ξ0r ). Let S denote the unitary operator of right time-shift in H(X). Define − ξt := S t ξ0 for t ∈ Z. Clearly, ξt ∈ Dt := S t D0 , so ξt ∈ Ht− , ξt ⊥Ht−1 for each − − t, and Ht = span{Ht−1 , ξt }. Thus {ξt }t∈Z is an r-dimensional orthonormal sequence: E(ξt ξs∗ ) = δts Ir (t, s ∈ Z)

Multidimensional Wold decomposition

121

and ξt ⊥H−∞ for each t. It is important that Hk− = H−∞ ⊕

k M

∞ M

H(X) = H−∞ ⊕

Dj ,

j=−∞

Dj ,

j=−∞

and each {ξk1 , . . . , ξkr } is an orthonormal basis in Dk (k ∈ Z). Let us expand X0 into its orthogonal series w.r.t. {ξt }: X0k = Y0k +

∞ X r X

` bk` (j)ξ−j ,

` bk` (j) := hX0k , ξ−j i,

k = 1, . . . , d;

j=0 `=1

that is, X0 = Y0 +

∞ X

b(j)ξ−j .

(4.24)

j=0

The vector Y0 is simply the remainder term, which of course is 0 if the bases {ξt1 , . . . , ξtr }t∈Z span H(X), but not in general. It is not hard to see that Y0 ∈ H−∞ . Now apply the operator S t to (4.24): Xt =

∞ X

b(j)ξt−j + Yt =: Rt + Yt

(t ∈ Z),

j=0

where we have defined Yt := S t Y0 (t ∈ Z). Thus {Yt }t∈Z is a stationary process, and Yt ∈ H−∞ for all t, so it is singular. {Rt }t∈Z is a stationary causal MA(∞) process, that is, a regular process. {Rt } and {Yt } are orthogonal to each other. Formula (4.7) shows that {Rt } is a linear transform of {Xt }, and since Yt = Xt − Rt for all t, it follows for {Yt } as well. The relationships Hk− (R) ⊂ Hk− (X),

Hk− (Y) ⊂ Hk− (X)

(k ∈ Z)

clearly holds: thus {Rt } and {Yt } are subordinated to {Xt }. As we are going to see in the next section, the orthonormal series {ξt } can be chosen the same as the orthonormal series in Theorem 4.1, which was obtained by linear transformation ψ(ω) from {Xt } according to formula (4.17). Also, by the definition {ξt } in the present theorem, Hk− (ξ) ⊂ Hk− (X), so {ξt } is also subordinated to {Xt }. This completes the proof of the theorem. Remark 4.3. It is clear from the proof that ξ0 is unique up to premultiplication by an arbitrary r × r unitary matrix U . Indeed, we can take any other initial orthonormal basis in the subspace D0 in the form ξ˜0 = U ξ0 , but otherwise the subspace D0 is fixed. Also, the Wold decomposition of the theorem is uniquely defined up to this rotation of the initial basis.

122

Multidimensional time series

The transfer function of the regular part {Rt } is HR (z) =

∞ X

b(j)z j

(z ∈ T ),

Rt = HR (z)ξt

(t ∈ Z).

(4.25)

j=0

For simplicity, let us fix the present time as time 0. Then the best linear ˆ h of Xh is by definition the projection of Xh onto h-step ahead prediction X the past until 0, that is, to H0− . By the Wold decomposition (4.22), the best linear h-step ahead prediction is ˆh = X

∞ X

0 X

b(j)ξh−j + Yt =

j=h

b(h − k)ξk + Yt

(h ≥ 1),

(4.26)

k=−∞

since the right hand side of (4.26) is in H0− and the error ˆh = Xh − X

h−1 X

b(j)ξh−j

j=0

is orthogonal to H0− . Hence the mean square error of the prediction is given by h−1 X ˆ h k2 = kb(j)k2F > 0. (4.27) σh2 := kXh − X j=0

4.4.2

Decomposition with innovations

(See e.g. the approach introduced by Wold [62] and used by Wiener and Masani [59]. Now we briefly describe that approach too, since we need it later.) Let Hk− denote again the Hilbert space spanned by the past of {Xt } until − time k and let ProjH − Xt denote the orthogonal projection of Xt to Ht−1 , t−1 which exists uniquely by the projection theorem and Lemma 1.1. Then kXt − ProjH − Xt k ≤ kXt − Zk for any t−1

− Z ∈ Ht−1 .

Define the process of innovations ηt := Xt − ProjH − Xt , t−1

t ∈ Z.

(4.28)

If the covariance matrix of {ηt } is Σ := E(η0 η0∗ ) ∈ Cd×d ,

(4.29)

then {ηt } is a WN(Σ) process, E(ηt ηs∗ ) = δts Σ for any t, s ∈ Z. The one-step

Multidimensional Wold decomposition

123

− ahead prediction of Xt based on Ht−1 is exactly ProjH − Xt , the prediction t−1 error is ηt , and the covariance matrix of this error is Σ. Further, by (4.28),

Cov(Xt , ηt ) = E(Xt ηt∗ ) = E(ηt ηt∗ ) = Σ.

(4.30)

The rank r ≤ d of Σ is the rank of the process {Xt }. It is clear from the definition (4.23) of D0 that rank(Σ) ≤ dim(D0 ). On the other hand, rank(Σ) < dim(D0 ) would lead to a contradiction. So it follows that r = rank(Σ) = dim(D0 ),

(4.31)

and so each Dt , t ∈ Z, can be called innovation subspace. If {Xt } is nonsingular, then clearly we must have r ≥ 1. In the special case when r = d the process is called full rank. In general, there is the following relationship between the approach described in Theorem 4.2 and the present one. The components of ξ0 are an orthonormal basis in the innovation subspace D0 , Thus the components of the innovation η0 can be expressed as linear combinations of them. It means that there exists a d × r matrix A such that η0 = Aξ0 . By the stationarity of the time series {Xt }, the operator S of right (forward) time shift takes this relationship into ηt = S t η0 = AS t ξ0 = Aξt

(t ∈ Z),

Σ = E(Aξt ξt∗ A∗ ) = AA∗ .

(4.32)

With the WN(Σ) innovation process {ηt } the Wold decomposition described in the proof of Theorem 4.2 can go similarly as with the orthonormal process {ξt } there, resulting X t = Rt + Y t =

∞ X

a(j)ηt−j + Yt

(t ∈ Z).

(4.33)

j=0

Here {Yt } is a singular process, Yt ∈ H−∞ , Rs ⊥Yt for any s, t ∈ Z. Further, {Rt } is a regular process, a(j) ∈ Cd×d (j ≥ 0), and by (4.32) and (4.22), Rt =

∞ X

a(j)ηt−j =

j=0

∞ X

a(j)Aξt−j =

j=0

∞ X

b(j)ξt−j ,

t ∈ Z.

j=0

This implies that b(j) = a(j)A for j ≥ 0. Moreover, by (4.33) and (4.30), ∗ Cov(X0 , η−j ) = E(X0 η−j ) = a(j)Σ,

Cov(X0 , η0 ) = Cov(η0 , η0 )

⇒

a(0)Σ = Σ.

The covariance matrix function of {Rt } is   ∞ ∞ X X ∗ C R (h) = E  a(j)ηt+h−j ηt−k a∗ (k) = a(k + h) Σ a∗ (k), j,k=0

k=0

(4.34)

(4.35)

124

Multidimensional time series

and its squared norm is kRt k2 = tr(C R (0)) =

∞ X

∞ X 1 1 1 tr a(k) Σ 2 Σ 2 a∗ (k) = ka(k)Σ 2 k2F < ∞.

k=0

k=0

(4.36) In the present setting, the transfer function of the regular part {Rt } is HR (z) =

∞ X

a(j)z j

(z ∈ T ),

Rt = HR (z)ηt

(t ∈ Z),

j=0

compare with (4.25).

4.5

Regular and singular time series

Assume that {Xt }t∈Z is a regular stationary time series. By Wold’s decomposition (4.22), Xt =

∞ X

b(j)ξt−j

(t ∈ Z),

b(j) = [bk` (j)]d×r ,

(4.37)

j=0 ∞ X

kb(j)k2F < ∞,

{ξt }t∈Z ∼ WN(Ir ).

j=0

By Theorem 4.1, this representation of {Xt } implies that {Xt } has an absolutely continuous spectral measure with density matrix f (ω) and with constant rank r˜ for a.e. ω ∈ [−π, π], f (ω) =

1 φ(ω)φ∗ (ω), 2π

φ(ω) = [φk` (ω)]d×˜r ,

for a.e. ω ∈ [−π, π]

(4.38)

and Xt =

∞ X

j=−∞ ∞ X

˜ ξ˜t−j b(j)

(t ∈ Z),

2 ˜ kb(j)k F < ∞,

˜ b(j) = [˜bk` (j)]d×˜r ,

(4.39)

{ξ˜t }t∈Z ∼ WN(Ir ).

j=−∞

Compare now (4.37) and (4.39). Remark 4.3 says that the orthonormal process {ξt } in (4.37) is unique up to pre-multiplication by an arbitrary r × r unitary matrix U . Thus it follows that 1. r˜ = r,

Regular and singular time series

125

˜ 2. b(j) = 0 if j < 0, ˜ 3. ξ0 = U ξ0 , ξ˜k = S k U ξ0 (k ∈ Z). Corollary 4.1. The dimension r of the WN(Ir ) (orthonormal) innovation process {ξt } in (4.37) and of the rank of the WN(Σ) innovation process {ηt } in (4.28) and (4.31) are equal to the a.e. constant rank of the spectral density matrix f of the regular time series {Xt }. From now on we assume that ξ0 = ξ˜0 is chosen as in (4.37), so we may omit all ‘tildes’ in (4.39). Let us write the Fourier expansion (4.19) for the present regular time series {Xt }: φ(ω) =

∞ X

b(j)e−ijω ,

kφk22 =

j=0

∞ X

kb(j)k2F < ∞.

j=0

Corollary 4.2. A stationary time series {Xt }t∈Z is regular if and only it has an absolutely continuous spectral measure with spectral density of rank r ≤ d that can be factored in the form f (ω) =

1 φ(ω)φ∗ (ω), 2π

where φ(ω) =

∞ X

φ(ω) = [φk` (ω)]d×r ,

b(j)e−ijω ,

kφk22 =

kb(j)k2F < ∞.

j=0

j=0

That is, φ(ω) = Φ(e−iω ),

∞ X

for a.e. ω ∈ [−π, π],

Φ(z) =

∞ X

b(j)z j .

z ∈ D,

(4.40)

j=0

so the entries of the spectral factor Φ(z) = [Φjk (z)]d×r are analytic functions in the open unit disc D and belong to the class L2 (T ) on the unit circle, consequently, they belong to H 2 , see Section A.3. Briefly, we write that Φ ∈ H 2. Proof. If {Xt } is regular, then it has constant rank and so its spectral density f can be factored as (4.38) and φ can be expressed as an L2 -convergent Fourier sum with coefficients b(j) by (4.11), but now (4.22) implies that b(j) = 0 when j < 0. Conversely, if f can be factored as stated, then Theorem 4.1, Remark 4.1 and Theorem 4.2 imply that it is regular. Take the spectral factor Φ(z) = [Φk` (z)]d×r ∈ H 2 defined in (4.40). Not only does it have the boundary value φ(ω) on the unit circle T , but by Theorem A.10 it has radial limits: lim kφ(ω) − Φ(ρe−iω )k2 = 0.

ρ→1

126

Multidimensional time series

The spectral factor Φ(z) contains all information needed for finding the orthonormal innovation process {ξt } since by (4.17) in the proof of Theorem 4.1, we may compute {ξt } by linear transformation of {Xt } using ψ, and by (4.13) ψ can be obtained from the system of linear equations ψ(ω)φ(ω) = Ir , where φ is the boundary value of Φ. Also, the coefficients b(j) can be obtained from Φ by power series expansion. As soon as we have these information, we may get the optimal linear prediction and its mean square error by formulas (4.26) and (4.27). Moreover, it follows from Remark 4.2, that Φ(z) is equal to the transfer function H(z) on the unit circle T , so we can write that Xt = H(z)ξt = Φ(z)ξt , t ∈ Z. In the case of a regular process {Xt }, (4.21) takes the following form: C(h) =

∞ X

b(k + h)b∗ (k)

(h ∈ Z),

C(−h) = C ∗ (h).

k=0

4.5.1

Full rank processes

Assume that {Xt } is a d-dimensional stationary time series, with spectral measure matrix dF = dFa + dFs ,

dFa dω,

dFs ⊥ dω,

where dω denotes Lebesgue measure, dFa (ω) = f (ω)dω, f is the spectral density matrix of {Xt }, dFs is supported on a zero Lebesgue measure subset of [−π, π]. We say that {Xt } has full rank if rank(f (ω)) = d for a.e. ω ∈ [−π, π]. It means that f (ω) is a.e. non-singular, positive definite matrix. The next theorem is an extension of Theorem 2.4 and the Kolmogorov– Szeg˝ o formula, Corollary 2.3, to the multidimensional full rank case. Theorem 4.3. A d-dimensional stationary time series {Xt } is of full rank non-singular process if and only if log det f ∈ L1 , that is, Z π log det f (ω)dω > −∞. −π

In this case if Σ denotes the covariance matrix of the innovation process {ηt }, that is, of the one-step ahead prediction error process defined in (4.28) and (4.29), then Z π dω (4.41) det Σ = (2π)d exp log det f (ω) . 2π −π The proof of this theorem requires some technical tools that we are going to discuss first. The arguments will essentially follow the lines of Wiener and Masani [59].

Regular and singular time series

127

Let φ : [−π, π] → C be a measurable function. When p > 0, we say that φ ∈ Lp if Z π |φ(ω)|p dω < ∞. −π

When f = [fjk ]d×d , we say that f ∈ Lp if each fjk ∈ Lp . Recall that each Lp is a vector space, however it is a Banach space only when p ≥ 1. Lemma 4.1. If f = [fjk ]d×d ∈ Lp , p > 0, then det f ∈ Lp/d . Proof. The function det f is the sum of d! terms of the form g(ω) := ±f1k1 (ω) · · · fdkd (ω). Then by the inequality between the geometric and arithmetic means, d

1/d

|g(ω)|p/d = (|f1k1 (ω)|p · · · |fdkd (ω)|p )

≤

1X |fjkj (ω)|p , d j=1

thus g ∈ Lp/d . It implies that det f ∈ Lp/d as well. Lemma 4.2. Let A and B be d × d self-adjoint, non-negative definite matrices. Then (1) (det(A + B))1/d ≥ (det A)1/d + (det B)1/d , (2) det(A + B) tr(A) ≥ det(A) tr(A + B), (3) det(A + B) ≥ det(A). Proof. (1) is Minkowski’s inequality, see [30, p. 35]. (2) We may suppose that A is a diagonal matrix, since we may diagonalise it by means of a unitary transformation that does not affect the determinants and traces. If A = diag[a1 , . . . , ad ], then a1 + b11 b12 ··· b1d b21 a2 + b22 · · · b2d det(A + B) = .. .. .. .. . . . . bd1 bd2 · · · ad + bdd X X = det B + aj det Bj + aj ak det Bjk + · · · + a1 · · · ad , 1≤j≤d

1≤j 0. Then the integrability of log det f R and the inequality in (c) follows from Lemma 4.4(c), since log det Σ = 2 log det Φ(0) and Z π Z π 1 1 1 log det f R (ω)dω = log det φ(ω)φ∗ (ω) dω 2π −π 2π −π 2π Z π 1 = −d log(2π) + 2 · log | det φ(ω)|dω. 2π −π

Let P (z) :=

N X

Aj z j ,

z ∈ C,

j=0

be a matrix polynomial with coefficients Aj ∈ Cd×d . We consider z as the left shift (backward shift) operator and define P (X) := P (z)X0 :=

N X

Aj X−j .

(4.47)

j=0

Lemma 4.6. Let {Xt } be a d-dimensional stationary time series with spectral measure matrix dF . Then Z π E(P (X)P (X)∗ ) = P (e−iω ) dF (ω) P (e−iω )∗ ; −π

Proof.

E

 N X 

=

Aj X−j

j=0

k=0

N X

Z

π

= −π

 X∗−k A∗k  =

N X

Aj E(X−j X∗−k )A∗k

j,k=0

π

Aj ·

ei(k−j)ω dF (ω) · A∗k

−π

j,k=0

Z

N X

  !∗ N N X X −ijω −ikω   dF (ω) Aj e Ak e , j=0

which proves the statement.

k=0

132

Multidimensional time series

Let {Xt } be a d-dimensional stationary time series with spectral measure matrix dF . Then there exists a unique Lebesgue decomposition dF = dFa + dFs ,

dFa dω,

dFs ⊥dω,

where dω denotes Lebesgue measure, dFa (ω) = f (ω)dω, f is the spectral density matrix of {Xt }, dFs is supported on a zero Lebesgue measure subset of [−π, π]. Lemma 4.7. Let {Xt } be a d-dimensional stationary time series with spectral measure matrix dF = dFa + dFs and spectral density matrix f , as defined above. Let P (X) be defined by (4.47). Then Z π 1 1 ∗ log det f (ω)dω + 2 log | det A0 |. log det E(P (X)P (X) ) ≥ 2π 2π −π Proof. It is not difficult to see that Z π Z P (e−iω ) dF (ω) P (e−iω )∗ − −π

π

P (e−iω ) f (ω) P (e−iω )∗ dω

−π

is a self-adjoint, non-negative definite matrix. Thus by Lemmas 4.2 (3) and 4.6, Z π 1 1 E(P (X)P (X)∗ ) = det P (e−iω ) dF (ω) P (e−iω )∗ det 2π 2π −π Z π 1 ≥ det P (e−iω ) f (ω) P (e−iω )∗ dω . (4.48) 2π −π The values of P (e−iω ) f (ω) P (e−iω )∗ are also self-adjoint, non-negative definite matrices for a.e. ω, so by Lemma 4.3 we have Z π 1 log det P (e−iω ) f (ω) P (e−iω )∗ dω 2π −π Z π 1 ≥ log det P (e−iω ) f (ω) P (e−iω )∗ dω 2π −π Z π Z π 1 1 = 2 log | det P (e−iω )|dω + log det f (ω)dω. (4.49) 2π −π 2π −π Now det P (z) is a polynomial in z, therefore log | det P (z)| is a subharmonic function, see e.g. [50, Theorem 17.3]: Z π 1 log | det P (e−iω )|dω ≥ log | det P (0)| = log | det A0 |. (4.50) 2π −π Inequalities (4.48), (4.49), and (4.50) prove the statement of the lemma.

Regular and singular time series

133

Proof of Theorem 4.3. Assume first that {Xt } is a full rank non-singular process. By the Wold decomposition described in (4.33)–(4.34) and by Lemma 4.5, Xt = Rt +Yt with spectral measures dF = dF R +dF Y , dF R (ω) = f R (ω) dω. Also, log det f R ∈ L1 and Z π dω log det f R (ω) . det Σ ≤ (2π)d exp (4.51) 2π −π We may take the Lebesgue decompositions dF = dFa +dFs , dFa (ω) = f (ω)dω and dF Y = dFaY + dFsY , dFaY (ω) = f Y (ω)dω as well. Then f = f R + f Y . Since f Y is self-adjoint, non-negative definite, by Lemma 4.2 (3) we see that det f R (ω) ≤ det f (ω). Inequalities (4.51) and (4.52) imply that Z π dω d det Σ ≤ (2π) exp log det f (ω) . 2π −π

(4.52)

(4.53)

Since we assumed that {Xt } has full rank, det Σ 6= 0. Thus the integral in (4.53) cannot be −∞. It cannot be +∞ either, since by Lemma 4.3 it is dominated by Z π 1 f (ω)dω < ∞. log det 2π −π This proves that log det f ∈ L1 . Conversely, assume that log det f ∈ L1 . By definition (4.28), take the innovation at time 0: η0 = X0 − ProjH − X0 . −1

Consider an approximation of η0 by a finite past: η0N

:= X0 −

N X j=1

AN j X−j ,

lim η0N = η0

N →∞

in H0− ⊂ L2 (Ω, F, P).

Then η0N is a matrix polynomial P (X) as defined by (4.47), with A0 = Id . Hence by Lemma 4.7, Z π 1 1 log det E(η0N η0N ∗ ) ≥ log det f (ω)dω. 2π 2π −π Now if N → ∞, E(η0N η0N ∗ ) → E(η0 η0∗ ) = Σ. Thus it follows that Z π 1 dω log det Σ ≥ log det f (ω) . 2π 2π −π

(4.54)

By our assumption, the integral is finite, so det Σ 6= 0. This proves that {Xt } is non-singular and has full rank.

134

Multidimensional time series

Thus we have proved the equivalence of the conditions that log det f ∈ L1 and that {Xt } is non-singular and of full rank. Under these conditions both (4.53) and (4.54) hold, which prove the multidimensional Kolmogorov–Szeg˝o formula (4.41). Corollary 4.3. If {Xt } is of full rank non-singular time series, then the singular process {Yt } in its Wold decomposition Xt = Rt + Yt has singular spectral measure. More exactly, using the notations introduced above, dF R = dFa ,

dF Y = dFs .

Proof. We know that in the Wold decomposition dF = dF R + dF Y , dF R is an absolutely continuous measure, see Lemma 4.5 (b). Hence it is enough to show that for the spectral density of Y we have f Y = 0 a.e. on [−π, π]. Also, by Lemma 4.5 (b), f = fR + fY =

1 φφ∗ + f Y , 2π

φ ∈ L2 .

Since f R and f Y are self-adjoint and non-negative definite, Lemma 4.2 (2) gives that det(f ) tr(f R ) ≥ det(f R ) tr(f R + f Y ), tr(f Y ) det(f ) ≥ det(f R ) 1 + 1 . 2 2π kφkF Take logarithm and integrate over [−π, π]: Z π Z π Z log det f (ω)dω ≥ log det f R (ω)dω +

tr(f R ) =

π

1 kφk2F , 2π

tr(f Y (ω)) dω log 1 + 1 2 −π −π −π 2π kφ(ω)kF Z π tr(f Y (ω)) 1 Σ + log 1 + 1 ≥ 2π log det dω, (4.55) 2 2π −π 2π kφ(ω)kF

where we were allowed to use Lemma 4.5 (c) since {Xt } has full rank. By (4.41), the integral on the left hand side of (4.55) is also Z π 1 log det f (ω)dω = 2π log det Σ . 2π −π Since the integrand on the right hand side of (4.55) is non-negative, it follows that tr(f Y (ω)) = 0 a.e. ⇒ tr(f Y (ω)) = 0 a.e., log 1 + 1 2 kφ(ω)k F 2π because the denominator here must be finite a.e., since φ ∈ L2 . We know that f Y (ω) is self-adjoint and non-negative definite a.e., thus we conclude that f Y (ω) = 0 a.e. This completes the proof.

Regular and singular time series

135

Corollary 4.4. A stationary time series {Xt } is regular and of full rank if and only if (1) it has an absolutely continuous spectral measure matrix dF with density matrix f ; (2) log det f ∈ L1 . Then the Kolmogorov–Szeg˝ o formula (4.41) also holds. Proof. First assume that {Xt } is regular and of full rank. Then in the Wold decomposition Xt = Rt + Yt we have Yt = 0, so F is absolutely continuous, see Lemma 4.5. By Theorem 4.3, log det f ∈ L1 , since the process has full rank. Conversely, assume that assumptions (1) and (2) hold. Then {Xt } has full rank by Theorem 4.3. Then by Corollary 4.3, dF R = dFa and dF Y = dFs . Since dFs = 0 by assumption (1), it follows that Xt = Rt , a regular process.

4.5.2

Generic regular processes

The next theorem is an extension of Corollary 4.4 to the general, not necessarily full rank, case. Let {Xt } be a d-dimensional stationary time series. Assume that its spectral measure matrix dF is absolutely continuous with density matrix f (ω) which has rank r, 1 ≤ r ≤ d, for a.e. ω ∈ [−π, π]. Take the parsimonious spectral decomposition of the self-adjoint, non-negative definite matrix f (ω) as described in Definition B.2: f (ω) =

r X

˜ (ω)Λr (ω)U ˜ ∗ (ω), λj (ω)uj (ω)u∗j (ω) = U

(4.56)

j=1

where Λr (ω) = diag[λ1 (ω), . . . , λr (ω)],

λ1 (ω) ≥ · · · ≥ λr (ω) > 0,

(4.57)

˜ (ω) ∈ Cd×r is a sub-unitary matrix. and U Then, still, we have ˜ ∗ (ω)f (ω)U ˜ (ω). Λr (ω) = U

(4.58)

Take care that here we use the word ‘spectral’ in two different meanings. On one hand, we use the spectral density of a time series in terms of a Fourier spectrum, on the other hand we take the spectral decomposition of a matrix in terms of eigenvalues and eigenvectors. Theorem 4.4. A d-dimensional stationary time series {Xt } is regular and of rank r, 1 ≤ r ≤ d, if and only if each of the following conditions holds:

136

Multidimensional time series

(1) It has an absolutely continuous spectral measure matrix dF with density matrix f (ω) which has rank r for a.e. ω ∈ [−π, π]. (2) For Λr (ω) defined by (4.57) one has log det Λr ∈ L1 = L1 ([−π, π], B, dω), equivalently, Z π log λr (ω) dω > −∞. (4.59) −π

˜ (ω) appearing in the spectral decomposi(3) The sub-unitary matrix function U tion of f (ω) in (4.56) belongs to the Hardy space H ∞ ⊂ H 2 , so ˜ (ω) = U

∞ X

ψ(j)e−ijω ,

ψ(j) ∈ Cd×r ,

j=0

∞ X

kψ(j)k2F < ∞.

j=0

Proof. Assume first that {Xt } is regular and of rank r, 1 ≤ r ≤ d. Here we are going to use the Wold decomposition given in Theorem 4.2. By Corollary 4.2 it follows that {Xt } has an absolutely continuous spectral measure with density matrix f (ω), which has constant rank r for a.e. ω. This shows that condition (1) of the theorem holds. In order to check that log det Λr ∈ L1 , it is enough to show that (4.59) holds. First, Z π

log det Λr (ω)dω < ∞ −π

is always true. Namely, using the inequality log x < x if x > 0, we see that log det Λr (ω) =

r X

log λj (ω) < tr(Λr (ω)) = tr(f (ω)) ∈ L1 ,

j=1

since the spectral density f is an integrable function. Second, by (4.57), Z

π

r X

Z log λj (ω)dω > −∞

−π j=1

π

⇔

log λr (ω) dω > −∞.

(4.60)

−π

By Corollary 4.2 it follows that f (ω) =

1 Φ(e−iω )Φ∗ (e−iω ), 2π

(4.61)

where the spectral factor Φ(z) = [Φjk (z)]d×r is an analytic function in the open unit disc D and Φ ∈ H 2 . Equality (4.61) implies that every principal minor M (ω) = det[fjp jq ]rp,q=1 of f is a constant times the product of a minor MΦ (e−iω ) of Φ(e−iω ) and its conjugate: M (ω) = (2π)−r det[Φjp k (e−iω )]rp,k=1 det[Φjp k (e−iω )]rp,k=1 2 = (2π)−r det[Φjp k (e−iω )]rp,k=1 =: (2π)−r |MΦ (e−iω )|2 .

(4.62)

Regular and singular time series

137

The row indices of the minor MΦ (z) in the matrix Φ(z) are the same indices jp , p = 1, . . . , r, that define the principal minor M (ω) in the matrix f (ω). Since the function MΦ (z) = det[Φjp k (z)]rp,k=1 is analytic in D, it is either identically zero or is different from zero a.e. The rank of f is r a.e., therefore the sum of all its principal minors of order r (which are non-negative since f is non-negative definite) must be different from zero a.e. The last two sentences imply that there exists a principal minor M (ω) of order r which is different from zero a.e. We are using this principal minor M (ω) from now on. The entries of Φ(z) are in H 2 , so by Lemma 4.4 (b) it follows that the determinant MΦ (z) ∈ H 2/r . Then Theorem A.13 in the Appendix implies that log |MΦ (e−iω )| ∈ L1 , which in turn with (4.62) imply that n 2 o log M (ω) = log (2π)−r MΦ (e−iω ) = −r log 2π + 2 log MΦ (e−iω ) ∈ L1 . (4.63) ˜ (ω) by Further, let us define the corresponding minor of the matrix U r ˜ ˜ MU˜ (ω) := det[Ujp k ]p,k=1 . Since U (ω) is a sub-unitary matrix, its each entry has absolute value less than or equal to 1. Consequently, |MU˜ (ω)| ≤ r!. By (4.56), M (ω) = MU˜ (ω) det Λr (ω)MU˜ (ω) = det Λr (ω) |MU˜ (ω)|2 . It follows that log det Λr (ω) = log M (ω) − 2 log |MU˜ (ω)| ≥ log M (ω) − 2 log(r!), which with (4.63) and (4.60) shows (4.59), and this proves condition (2) of the theorem. The matrix function Λr (ω) is a self-adjoint, positive definite function, tr(Λr (ω)) = tr(f (ω)), where f (ω) is the density function of a finite spectral measure. This shows that the integral of tr(Λr (ω)) over [−π, π] is finite. Hence Λr (ω) can be considered as the spectral density function of an r-dimensional stationary time series {Vt }t∈Z of full rank r. In fact, by linear filtering we define Z π ˜ ∗ (ω)dZω , t ∈ Z. Vt := eitω U (4.64) −π

Then by (4.4) its auto-covariance function and spectral density are really given by Z π Z π ˜ ∗ (ω)f (ω)U ˜ (ω)dω = CV (h) = eihω U eihω Λr (ω)dω, h ∈ Z. (4.65) −π

−π

Then Corollary 4.4 and (4.63) show that {Vt } is a regular time series. By the characterization of regular time series in Corollary 4.2, it follows that Λr can be factored: Λr (ω) =

1 Dr (ω)Dr (ω), 2π

(4.66)

138

Multidimensional time series

where Dr (ω) =

√

p p 2π diag[ λ1 (ω), . . . , λr (ω)]. Then

Dr (ω) =

∞ X

δ(j)e−ijω ,

j=0

∞ X

kδ(j)k2F < ∞,

j=0

and ∆(z) :=

∞ X

δ(j)z j ,

∆(z) = [∆k` (z)]r×r ,

z ∈ D,

j=0

∆(e−iω ) = Dr (ω),

(4.67)

∆(z) is an H 2 analytic function. Since the non-diagonal entries of the boundary value Dr (ω) are zero, it follows that the non-diagonal entries of ∆(z) are also zero, see the integral formulas in Theorem A.10. The factorizations (4.12), (4.40), and (4.56) show that ˜ (ω)∆(e−iω ). Φ(e−iω ) = U Since Dr (ω) = ∆(e−iω ) is a diagonal matrix and each diagonal entry is positive, we may write that   −1 −iω ∆11 (e ) · · · 0  .. .. .. ˜ (ω) = Φ(e−iω )  U .  . . . −1 −iω 0 · · · ∆rr (e ) The components of the process {Vt } are regular 1D time series, so by Theorem 2.3, the H 2 -functions ∆kk (z) (k = 1, . . . , r) have no zeros in the open unit disc D. Consequently, the entries on the right hand side are boundary values of the ˜ (ω) ratio of two H 2 -functions, and the denominator has no zeros in D. Hence U is the boundary value of an analytic function W (z) = [wk` (z)]d×r defined in D: ˜ (ω) = W (e−iω ). U (4.68) Moreover, since the boundary value U (ω) is unitary, its entries are bounded functions. It implies that W (z) =

∞ X

ψ(j)z j ∈ H ∞ ⊂ H 2 ,

j=0

˜ (ω) = U

∞ X j=0

ψ(j)e

−ijω

,

∞ X

kψ(j)k2F < ∞.

(4.69)

j=0

This proves condition (3) of the theorem. Conversely, assume that conditions (1), (2), and (3) of the theorem hold. Conditions (1) and (2) give that Λr (ω) is the spectral density of an rdimensional regular stationary time series {Vt }t∈Z of full rank r, just like

Regular and singular time series

139

in the first part of the proof. Then Λr (ω) has the factorization (4.66), (4.67). ˜ (ω) ∈ H ∞ ⊂ H 2 , with properties (4.68) and Condition (3) implies that U (4.69). The spectral decomposition (4.56) can be written as ˜ (ω)Λr (ω)U ˜ ∗ (ω) = 1 U ˜ (ω)Dr (ω)Dr (ω)U ˜ ∗ (ω) = 1 φ(ω)φ∗ (ω). f (ω) = U 2π 2π So ˜ (ω)Dr (ω) = φ(ω) = U

∞ X

ψ(j)e−ijω

j=0

b(`) =

` X

∞ X

δ(j)e−ikω =

k=0

∞ X

b(`)e−i`ω ,

`=0

ψ(j)δ(` − j).

j=0

These imply that Φ(z) :=

∞ X

b(`)z ` = W (z)∆(z),

z ∈ D,

`=0

where W ∈ H ∞ and ∆ ∈ H 2 , thus Φ(z) ∈ H 2 . By Corollary 4.2 it means that the time series {Xt } is regular. This completes the proof of the theorem. Remark 4.4. Assume that {Xt } is a d-dimensional regular time series of rank r. Assume as well that its spectral density matrix f has the spectral decomposition (4.56). Then the r-dimensional time series {Vt } can be given by (4.64), whose spectral density by (4.65) is fV (ω) = Λr (ω). ˜ ∗ (ω) = P∞ ψ ∗ (j)eijω , By (4.69) and (4.64) we obtain that U j=0 P∞ ∗ 2 j=0 kψ (j)kF < ∞, and Z

π

Vt =

e −π

=

∞ X

itω

∞ X

∗

ijω

ψ (j)e

j=0

ψ ∗ (j)Xt+j .

dZω =

∞ X j=0

∗

Z

π

ψ (j)

ei(t+j)ω dZω

−π

(4.70)

j=0

It is interesting that {Vt } is not causally subordinated to {Xt }, see (4.8) and (4.9). Remark 4.5. Comparing Corollary 4.4 and Theorem 4.4 shows that in the full rank case, condition (3) in Theorem 4.4 follows from conditions (1) and (2). Corollary 4.5. Assume that {Xt } is a d-dimensional regular stationary time series of rank r, 1 ≤ r ≤ d. Then a Kolmogorov–Szeg˝ o formula holds: Z π X Z π r dω dω = (2π)r exp log λj (ω) , det Σr = (2π)r exp log det Λr (ω) 2π 2π −π j=1 −π

140

Multidimensional time series

where Λr is defined by (4.57) and Σr is the covariance matrix of the innova(r) tion process of an r-dimensional subprocess {Xt } of rank r, as defined below in the proof. Proof. Let fr (ω) be the submatrix of f (ω) whose determinant M (ω) was defined in the first part of the proof of Theorem 4.4: fr (ω) = [fjp jq (ω)]rp,q=1 ,

det fr (ω) = M (ω) 6= 0

for a.e.

ω ∈ [−π, π].

(r)

The indices jp , p = 1, . . . , r, define a subprocess Xt = [Xtj1 , . . . , Xtjr ]T of the (r) original time series {Xt }. Then {Xt } has an absolutely continuous spectral measure with density fr (ω), and by (4.63), log det fr = log M ∈ L1 . (r)

Hence by Corollary 4.4, {Xt } is a regular process of full rank r and Z π dω det Σr = (2π)r exp log det fr (ω) , 2π −π (r)

where Σr is the covariance matrix of the innovation process of {Xt } as defined by (4.28) and (4.29): (r) (r) (r) (r) (r)∗ ηt := Xt − ProjH − Xt (t ∈ Z), Σr := E η0 η0 . t−1

(r)

Here we used that the past until (t − 1) of the subprocess {Xt } is the same − as the past Ht−1 of {Xt }. Really, by (4.33), for the regular process {Xt } of rank r we have a causal MA(∞) form: Xt =

∞ X

a(j)ηt−j ,

t ∈ Z,

j=0 (r)

and for each t, ηt is linearly dependent on ηt . Thus (r)

(r)

(r)

− span{Xt−1 , Xt−2 , Xt−3 , . . . } = span{Xt−1 , Xt−2 , Xt−3 , . . . } = Ht−1 .

4.5.3

Classification of non-regular multidimensional time series

The classification of multidimensional time series is — not surprisingly — more complex than the one-dimensional ones, see the 1D classes in Section 2.10. We call a time series non-regular if either it is singular or its Wold decomposition contains two orthogonal, non-vanishing processes: a regular and a singular one. The classification below follows from Theorem 4.4.

Low rank approximation

141

In dimension d > 1 a non-singular process beyond its regular part may have a singular part with non-vanishing spectral density. For example, if d = 3 and the components {(Xt1 , Xt2 , Xt3 )} are orthogonal to each other, it is possible that {Xt1 } is regular of rank 1, {Xt2 } is Type (1) singular, and {Xt3 } is Type (2) singular. Below we are considering a d-dimensional stationary time series {Xt } with spectral measure dF . • Type (0) non-regular processes. In this case the spectral measure dF of the time series {Xt } is singular w.r.t. the Lebesgue measure in [−π, π]. Clearly, type (0) non-regular processes are simply singular ones. Like in the 1D case, we may further divide this class into processes with a discrete spectrum or processes with a continuous singular spectrum or processes with both. • Type (1) non-regular processes. The time series has an absolutely continuous spectral measure with density f , but rank(f ) is not constant. It means that there exist measurable subsets A, B ⊂ [−π, π] such that dω(A) > 0 and dω(B) > 0, rank(f (ω)) = r1 if ω ∈ A, rank(f (ω)) = r2 if ω ∈ B, and r1 6= r2 . Here dω denotes Lebesgue measure in [−π, π]. • Type (2) non-regular processes. The time series has an absolutely continuous spectral measure with density f which has constant rank r a.e., 1 ≤ r ≤ d, but Z π Z π X r log det Λr (ω)dω = log λj (ω) dω = −∞, −π

−π j=1

where Λr is defined by (4.57). • Type (3) non-regular processes. The time series has an absolutely continuous spectral measure with density f which has constant rank r a.e., 1 ≤ r < d, Z

π

Z

π

log det Λr (ω)dω = −π

r X

log λj (ω) dω > −∞,

−π j=1

˜ (ω) appearing in the spectral decomposibut the unitary matrix function U tion of f (ω) in (4.58) does not belong to the Hardy space H 2 . By Corollary 4.3, if {Xt } has full rank r = d and it is non-singular, then it may have only a Type (0) singular part.

4.6

Low rank approximation

The aim of this section is to approximate a time series of constant rank r with one of smaller rank k. This problem was treated by Brillinger in [8] and also

142

Multidimensional time series

in [9, Chapter 9] where it was called Principal Component Analysis (PCA) in the Frequency Domain. We show the important fact that when the process is regular, the low rank approximation can also be chosen regular.

4.6.1

Approximation of time series of constant rank

Assume that {Xt } is a d-dimensional stationary time series of constant rank r, 1 ≤ r ≤ d. By Theorem 4.1, it is equivalent to the assumption that {Xt } can be written as a sliding summation of form (4.10). The spectral density f of the process has rank r a.e., and so we may write its eigenvalues as λ1 (ω) ≥ · · · ≥ λr (ω) > 0,

λr+1 (ω) = · · · = λd (ω) = 0.

(4.71)

Also, the spectral decomposition of f is f (ω) =

r X

˜ r (ω)Λ ˜ r (ω)U ˜ ∗ (ω), λj (ω)uj (ω)u∗j (ω) = U r

a.e. ω ∈ [−π, π],

j=1

(4.72) ˜ r (ω) := diag[λ1 (ω), . . . , λr (ω)], uj (ω) ∈ Cd (j = 1, . . . , r) are the where Λ ˜ r (ω) ∈ Cd×r is the matrix of corresponding orthonormal eigenvectors, and U these column vectors. Now the problem we are treating can be described as follows. Given an (k) integer k, 1 ≤ k ≤ r, find a process {Xt } of constant rank k which is a linear transform of {Xt } and which minimizes the distance n o (k) (k) (k) kXt − Xt k2 = E (Xt − Xt )∗ (Xt − Xt ) n o (k) (k) = tr Cov (Xt − Xt ), (Xt − Xt ) , ∀t ∈ Z. (4.73) In Brillinger’s book [9] this is called Principal Component Analysis (PCA) in the Frequency Domain. (k) Consider the spectral representations of {Xt } and {Xt }, see (4.3): Z π Z π (k) Xt = eitω dZω , Xt = eitω T (ω)dZω , t ∈ Z. −π

−π

Then by (4.4) and (4.72) we can rewrite (4.73) as Z π (k) 2 kXt − Xt k = tr (Id − T (ω))f (ω)(Id − T ∗ (ω)) dω −π Z π ˜ r (ω)Λ ˜ r (ω)U ˜ ∗ (ω)(Id − T ∗ (ω)) dω, = tr (Id − T (ω))U r

(4.74)

−π

which clearly does not depend on t ∈ Z. To find the minimizing linear transformation T (ω), we have to study the non-negative definite quadratic form ˜ r (ω)Λ ˜ r (ω)U ˜ ∗ (ω)v, v∗ f (ω)v = v∗ U r

v ∈ Cd ,

|v| = 1.

Low rank approximation

143

By (4.71), there is a monotonicity: taking the orthogonal projections uj u∗j (j = 1, . . . , r) in the space Cd one-by-one, the sequence vj∗ f (ω)vj ,

vj ∈ uj (ω)u∗j (ω) Cd (j = 1, . . . , r), Pd ∗ is non-increasing. Since Id = j=1 uj (ω)uj (ω) and T (ω) must have rank k a.e., (4.74) implies that the minimizing linear transformation must be the orthogonal projection T (ω) =

k X

˜ k (ω)U ˜ ∗ (ω); uj (ω)u∗j (ω) = U k

(4.75)

j=1

see Corollary B.1 as well. Thus we have proved that Z π (k) ˜ k (ω)U ˜ ∗ (ω)dZω , t ∈ Z. Xt = eitω U k

(4.76)

−π (k)

Then by (4.5), the spectral density of {Xt } is ˜ k (ω)U ˜ ∗ (ω)U ˜ r (ω)Λ ˜ r (ω)U ˜ ∗ (ω)U ˜ k (ω)U ˜ ∗ (ω) fk (ω) = U k r k I k ˜ ∗ (ω) ˜ ˜ U = Uk (ω) Ik 0k×(r−k) Λr (ω) k 0(r−k)×k ˜ k (ω)Λ ˜ k (ω)U ˜ ∗ (ω), ω ∈ [−π, π]. =U k

(4.77)

(k)

Further, the covariance function of {Xt } is Z π Ck (h) := eihω fk (ω)dω,

h ∈ Z.

(4.78)

−π

The next theorem summarizes the results above. Theorem 4.5. Assume that {Xt } is a d-dimensional stationary time series of constant rank r, 1 ≤ r ≤ d, with spectral density f . Let (4.71) and (4.72) be the spectral decomposition of f . (a) Then (k)

Xt

Z

π

=

˜ k (ω)U ˜ ∗ (ω)dZω , eitω U k

t ∈ Z,

−π

is the approximating process of rank k, 1 ≤ k ≤ r, which minimizes the mean square error of the approximation. (b) For the mean square error we have Z π X r (k) kXt − Xt k2 = λj (ω) dω,

t ∈ Z,

(4.79)

t ∈ Z.

(4.80)

−π j=k+1

and

R π Pr (k) kXt − Xt k2 j=k+1 λj (ω) dω Rπ P = −π , r kXt k2 j=1 λj (ω) dω −π

144

Multidimensional time series

(c) If condition λk (ω) ≥ ∆ > ≥ λk+1 (ω)

∀ω ∈ [−π, π],

(4.81)

holds then we also have (k)

kXt − Xt k ≤ (2π(r − k))1/2 , and (k)

kXt − Xt k ≤ kXt k

(r − k) r∆

t∈Z

12 ,

t ∈ Z.

Proof. Statement (a) was shown above. By (4.74) and (4.75), (k)

kXt − Xt k2     Z π  X r r   X  ∗ ˜ r (ω)Λ ˜ r (ω)U ˜ ∗ (ω) = tr uj (ω)u∗j (ω) U u (ω)u (ω) dω j r j    −π j=k+1 j=k+1   0k×d Z π  u∗k+1 (ω)    ˜ 0d×k uk+1 (ω) · · · ur (ω) Λr (ω)  = tr  dω ..   −π . u∗r (ω)

Z

π

˜ r (ω)diag[0, . . . , 0, λk+1 , . . . , λr ]U ˜ ∗ (ω)dω U r

= tr −π

Z =

π

r X

λj (ω) dω,

−π j=k+1

where we finally used that the trace equals the sum of the eigenvalues of a matrix. This proves (4.79). Since (4.79) holds for k = 0 as well, we get (4.80). Finally, condition (4.81) and (b) imply (c). Corollary 4.6. For the difference of the covariance functions of {Xt } and (k) {Xt } we have the following estimate: Z π kC(h) − Ck (h)k ≤ λk+1 (ω)dω, h ∈ Z, −π

where the matrix norm is the spectral norm, see Appendix B. If condition (4.81) holds then we have the bound kC(h) − Ck (h)k ≤ 2π,

h ∈ Z.

Low rank approximation

145

Proof. By (4.72), (4.77), and (4.78) it follows that

Z π

ihω

e (f (ω) − fk (ω))dω kC(h) − Ck (h)k =

−π

Z π n o

ihω ˜ ∗ ˜ ˜

e Ur (ω) Λr (ω) − diag[λ1 (ω), . . . , λk (ω), 0, . . . , 0] Ur (ω)dω =

−π Z π ˜ r (ω)k · kdiag[0, . . . , 0, λk+1 (ω), . . . , λr (ω)]k · kU ˜ ∗ (ω)kdω kU ≤ r −π Z π = kdiag[0, . . . , 0, λk+1 (ω), . . . , λr (ω)]k dω −π Z π = λk+1 (ω)dω. −π

Equation (4.76) can be factored. As in Theorem 4.4, one can take the ˜ k (ω) ∈ L2 : Fourier series of the sub-unitary matrix function U ˜ k (ω) = U

∞ X

ψ(j)e

−ijω

1 ψ(j) = 2π

,

j=−∞

where

P∞

j=−∞

Z

π

˜ k (ω)dω ∈ Cd×k , eijω U

−π

kψ(j)k2F < ∞. Consequently, ˜ ∗ (ω) = U k

∞ X

ψ ∗ (j)eijω ,

ω ∈ [−π, π].

j=−∞

If the time series {Vt } is defined by the linear filter (4.64), then similarly to (4.70) it follows that {Vt } can be obtained from the original time series {Xt } by a sliding summation: Vt =

∞ X

ψ ∗ (j)Xt+j ,

t ∈ Z,

j=−∞

and similarly to (4.65), its spectral density is a diagonal matrix: fV (ω) = Λk (ω) = diag[λ1 (ω), . . . , λk (ω)]. It means that the covariance matrix function of {Vt } is also diagonal: Z π CV (h) = diag[c11 (h), . . . , ckk (h)], cjj (h) = eihω λj (ω)dω, h ∈ Z, −π

that is, the components of the process {Vt } are orthogonal to each other. Thus the process {Vt } can be called k-dimensional Dynamic Principal Components (DPC) of the d-dimensional process {Xt }.

146

Multidimensional time series

Using a second linear filtration, which is the adjoint ψ of the previous (k) filtration ψ ∗ , one can obtain the k-rank approximation {Xt } from {Vt }: Z π Z π (k) itω ˜ ∗ ˜ ˜ k (ω)dZV Xt = e Uk (ω)Uk (ω)dZω = eitω U ω −π

−π

∞ X

=

ψ(j)Vt−j ,

t ∈ Z.

j=−∞

Notice the dimension reduction in this approximation. Dimension d of the original process {Xt } can be reduced to dimension k < d with the crosssectionally orthogonal DPC process {Vt }, obtained by linear filtration, from (k) which the low-rank approximation {Xt } can be reconstructed also by linear filtration. Of course, this is useful only if the error of the approximation given by Theorem 4.5 is small enough. ˜kU ˜ ∗ ∈ L2 as well, one can take the L2 -convergent Fourier series Since U k ∞ X

˜ k (ω)U ˜ ∗ (ω) = U k

∞ X

ψ(j)e−ijω ψ ∗ (`)ei`ω =

w(m)e−imω ,

m=−∞

j,`=−∞

where ω ∈ [−π, π] and w(m) =

∞ X

ψ(j) ψ ∗ (j − m) ∈ Cd×d ,

∞ X

kw(m)k2F < ∞.

m=−∞

j=−∞

(k)

By (4.7) it implies that the filtered process {Xt } can be obtained directly from {Xt } by a two-sided sliding summation: (k)

Xt

=

∞ X

w(m)Xt−m ,

t ∈ Z.

m=−∞

4.6.2

Approximation of regular time series

Proposition 4.1. Assume that {Xt } is a d-dimensional regular time series (k) of rank r. Then the rank k approximation {Xt }, 1 ≤ k ≤ r, defined in (4.76) is also a regular time series. (k)

Proof. We have to check that the conditions in Theorem 4.4 hold for {Xt }. (k) By (4.77), {Xt } has an absolutely continuous spectral measure with density of constant rank k, so condition (1) holds. Rπ If −π log λk (ω)dω = −∞, then that would contradict the regularity of the original process {Xt }, and this proves condition (2). ˜ r (ω) belongs to the Hardy space H ∞ , It follows from Theorem 4.4 that U ˜ k (ω) as well, since its columns are a subset of the so the same holds for U former’s. This proves condition (3).

Rational spectral densities

147

Theorem 4.5 and Corollary 4.6 are valid for regular processes without change. However, the factorization of the approximation discussed above is different in the regular case, because several of the summations become onesided. Thus we have ˜ ∗ (ω) = U k

∞ X

ψ ∗ (j)eijω ,

ω ∈ [−π, π].

j=0

Consequently, the k-dimensional, cross-sectionally orthogonal process {Vt } becomes ∞ X Vt = ψ ∗ (j)Xt+j , t ∈ Z. j=0 (k)

Further, the reconstruction of the k-rank approximation {Xt } from {Vt } is (k)

Xt

=

∞ X

ψ(j)Vt−j ,

t ∈ Z.

j=0 (k)

The direct evaluation of {Xt } from {Xt } takes now the following form: ˜ k (ω)U ˜ ∗ (ω) = U k

∞ X

ψ(j)e−ijω ψ ∗ (`)ei`ω =

w(m)e−imω ,

ω ∈ [−π, π],

m=−∞

j,`=0

where w(m) =

∞ X

P∞

j=max(0,m)

ψ(j) ψ ∗ (j−m). It implies that the filtered process

(k) {Xt }

is not causally subordinated to the original regular process {Xt } in general, since it can be obtained from {Xt } by a two-sided sliding summation: (k)

Xt

=

∞ X

w(m)Xt−m ,

t ∈ Z,

m=−∞

see (4.7). On the other hand, it is clear that if kψ(j)kF goes to 0 fast enough as j → ∞, one does not have to use too many ‘future’ terms of {Xt } to get (k) a good enough approximation of {Xt }. In practice one can also replace the (k) future values of {Xt } by 0 to get a causal approximation of {Xt }.

4.7

Rational spectral densities

An important subclass of the class of regular stationary time series, which in turn is a subclass of time series with constant rank, is such that each entry f k` (ω) in the spectral density matrix f is a rational complex function in z = e−iω . As we are going to see in Theorem 4.8, this subclass is the same as

148

Multidimensional time series

that of the stable VARMA(p, q) processes. Also, this is the subclass of stable stochastic linear systems with finite dimensional state space representation. Moreover, this is the subclass of stable stochastic linear systems with rational transfer function. In sum, this is the subclass of weakly stationary time series that can be described by finitely many complex valued parameters. Remark 4.6. Every minor (that is, the determinant of a sub-matrix) of a rational matrix is a rational function which is either identically zero or has zeros and poles at only finitely many points. Hence a weakly stationary time series with a spectral density f which is a rational matrix in z = e−iω must be of constant rank r. Remark 4.7. Suppose that the spectral density f of a weakly stationary time series is a rational matrix in z = e−iω . Since the entries of the spectral measure dF are finite measures on [−π, π] by formula (1.23), the entries of f can have no poles on the unit circle T . See Remark 2.3 as well.

4.7.1

Smith–McMillan form

The Smith–McMillan form is a useful tool by which one can diagonalize a non-negative definite rational matrix so that both the obtained diagonal matrix and the transformation matrix used for the diagonalization are rational matrices. The usual technique of linear algebra which uses eigenvalues and eigenvectors does not have this important property, since the eigenvalues and the entries of eigenvectors of a rational matrix are not rational functions in general. Lemma 4.8. Let A(z) = [ajk (z)]d×d be a rational matrix which is self-adjoint and non-negative definite for z ∈ T , having no poles on T , and whose rank is r, 1 ≤ r ≤ d, for z ∈ T \ Z, where Z is a finite set. Then we can write A(z) in Smith–McMillan form: A(z) = E(z)Λ(z)E ∗ (z)

(z ∈ T \ Z),

(4.82)

where Λ(z) and E(z) are rational matrices, Λ(z) = diag[λ1 (z), . . . , λr (z)],

λj (z) > 0

(j = 1, . . . , r; z ∈ T \ Z).

Here diag[λ1 , . . . , λr ] denotes an r × r diagonal matrix with entries λj (j = 1, . . . , r) in its main diagonal. Also, E(z) = [ejk (z)]d×r is a lower unit trapezoidal matrix: 1. ejk (z) = 0 if k > j, 2. ejk (z) = 1 if k = j.

(z ∈ T \ Z),

Rational spectral densities

149

Proof. Since A(z) is a self-adjoint rational matrix, has no poles on T , and has rank r for z ∈ T \ Z, it has a principal minor M (z) of order r which is a rational function, has no poles for z ∈ T and no zeros for z ∈ T \ Z, while any minor of order larger than r is identically zero. We may assume that the sub-matrix corresponding to M (z) stands in the upper left corner of A(z), since by rearranging rows and columns (that is, using elementary row and column operations) we may achieve this. Since A(z) is non-negative definite, any upper left corner minor Mj (z) of size j (j = 1, . . . , r) obtained from M (z) = Mr (z) must be positive if z ∈ T \ Z. In particular, M1 (z) = a11 (z) > 0 if z ∈ T \ Z. Using elementary row and column operations we may transform the rational matrix A(1) (z) = A(z) into another rational matrix of the form a11 (z) 01×(n−1) (2) (2) A (z) = [ajk (z)]d×d = , 0(n−1)×1 B (2) (z) (2)

B (2) (z) = [bjk (z)]j,k=2,...,n ,

(2)

bjk (z) = ajk (z) −

aj1 (z)a1k (z) . a11 (z)

(2)

If r ≥ 2, then b22 (z) = M2 (z)/M1 (z) > 0 if z ∈ T \ Z, and we can apply the same transformation for B (2) (z) as for A(1) (z), and so on. Finally, after the rth step we obtain A(r) (z) = diag[λ1 (z), . . . , λr (z), 0, . . . , 0]d×d , λj (z) = Mj (z)/Mj−1 (z) > 0

λ1 (z) = M1 (z) > 0,

(j = 2, . . . , r; z ∈ T \ Z).

(4.83)

Define Λ(z) := diag[λ1 (z), . . . , λr (z)]

(z ∈ T \ Z).

(4.84)

The way one obtains A(`+1) (z) from A(`) (z) is an application of a sequence of elementary row and column operations. Such an elementary row operation is adding α(z) times the ith row to the jth row, where α(z) is a rational complex function. This operation is equivalent to left multiplication by the elementary matrix Ei,j (α(z)), see (D.1). Similarly, the elementary column operation of adding α(z) times the ith column to the jth column is equivalent to right multiplication by Ei,j (α(z))∗ . It is important that Ei,j (α(z))−1 = Ei,j (−α(z)). Using this notation, we can write that A(`+1) (z) = E (`) (z)A(`) (z)E (`) (z)∗ , where E (`) (z) :=

d−`−1 Y

(`) (`) E`,d−j −ad−j,` (z)/a`` (z) .

j=0

(Product of a sequence of matrices is understood so that the one with the first index, in the present case j = 0, is the first factor from the left, and so on.) Observe that each matrix E (`) (z) is a lower unit triangular matrix.

150

Multidimensional time series Thus ˜ ˜ ∗ (z), A(r) (z) = E(z)A(z) E

˜ E(z) :=

r−1 Y

E r−` (z), if z ∈ T \ Z,

`=1

˜ where E(z) is also a lower unit triangular matrix. Then ˜ −1 A(r) (z)E ˜ ∗ (z)−1 if z ∈ T \ Z. A(z) = E(z) Since the last d − r rows and columns of A(r) (z) are zero, we may omit the ˜ −1 . The resulting matrix of size d×r will be denoted last d−r columns of E(z) by E(z), with which we still have A(z) = E(z)Λ(z)E ∗ (z) if z ∈ T \ Z, where E(z) is a lower unit trapezoidal matrix and, correspondingly, E(z)∗ is an upper unit trapezoidal matrix. This with equations (4.83) and (4.84) proves the lemma.

4.7.2

Spectral factors of a rational spectral density matrix

Theorem 4.6. [47, Section I.10] Let A(z) = [ajk (z)]d×d be a rational matrix which is self-adjoint and nonnegative definite for z ∈ T , having no poles on T , and whose rank is r for all z ∈ T \ Z, where Z is a finite set. Then we can write a special Gram decomposition 1 A(z) = Φ(z)Φ∗ (z) (z ∈ T \ Z), 2π where Φ(z) = [Φjk (z)]d×r is a rational matrix, analytic in the open unit disc D = {z : |z| < 1}, and has rank r for any z ∈ T \ Z. Proof. By Lemma 4.8, we may write A(z) = E(z)Λ(z)E(z)∗ , z ∈ T \ Z, with the properties of Λ(z) = diag[λ1 (z), . . . , λr (z)] and E(z) = [ejk (z)]d×r described in the statement of the lemma. Denote ejk (z) =

qjk (z) pjk (z)

(z ∈ T \ Z),

where qjk and pjk are coprime (relative prime) polynomials for j = 1, . . . , d (k) and k = 1, . . . , r. Fixing a column k = 1, . . . , r, let ζ` (` = 1, . . . , `0 (k)) denote all zeros in D of the polynomials pjk (z), each with maximal multiplicity for the indices j = 1, . . . , d. Set `0 (k)

ck (z) :=

Y `=1

(k)

(z − ζ` ),

Dk (z) :=

λk (z) |ck (z)|2

(z ∈ T \ Z).

Multidimensional ARMA (VARMA) processes

151

By Lemma 4.8, Dk (z) is a non-negative rational function with no poles on T , so Lemma 2.2 shows that we can write Qk (z) 2 , Dk (z) = Pk (z) where the coprime polynomials Qk and Pk do not have zeros in the open unit disc D, and Pk does not have zeros on the unit circle T either. Define the rational functions Φjk (z) :=

√

2πejk (z)ck (z)

Qk (z) Pk (z)

(j = 1, . . . , d; k = 1, . . . , r)

which are analytic in D and set Φ(z) := [Φjk (z)]d×r . Then " r # X ¯ ` (z) Q` (z) Q 1 ∗ Φ(z)Φ(z) = ej` (z)c` (z) e¯k` (z) c¯` (z) ¯ 2π P` (z) P` (z) `=1 d×d " r # X = ej` (z)λ` (z)¯ ek` (z) = A(z) (z ∈ T \ Z) `=1

d×d

by (4.82). Corollary 4.7. Let {Xt }t∈Z be a d-dimensional weakly stationary time series with spectral density f (ω) which is a rational function in z = e−iω . By Remark 4.6, f (ω) has constant rank r for all ω ∈ [−π, π] \ Z, where Z is a finite set. By Remark 4.7, f cannot have poles on [−π, π]. Thus Theorem 4.6 applies: f (ω) =

1 Φ(e−iω )Φ∗ (e−iω ), 2π

ω ∈ [−π, π] \ Z,

where Φ(z) is a d × r rational matrix, analytic in the open unit disc D, and has rank r for any z = e−iω , ω ∈ [−π, π] \ Z.

4.8 4.8.1

Multidimensional ARMA (VARMA) processes Equivalence of different approaches

Definition 4.1. A multidimensional ARMA(p, q) process (p ≥ 0, q ≥ 0) or VARMA(p, q) process (vector ARMA process) is a causal stationary time series solution {Yt }t∈Z of the stochastic difference equation p X j=0

αj Yt−j =

q X `=0

β` Ut−` ,

(4.85)

152

Multidimensional time series

where each Yt is a Cn -valued random vector, {Ut }t∈Z is the driving Cm -valued WN(Σ) sequence, E(Ut U∗s ) = δts Σ,

EUt = 0,

∀s, t ∈ Z,

Σ ∈ Cm×m is a given self-adjoint non-negative definite matrix. The coefficients αj ∈ Cn×n and β` ∈ Cn×m are given matrices. We always assume that α0 = In and β0 6= 0n×m . We also assume that EYt = 0 for each t and the {Yt } process has no remote past: Yt ∈ span{Ut , Ut−1 , . . . }, ∀t ∈ Z, that is, Yt depends only on the present and past values of the driving process. It implies that each Ut is orthogonal to the past (Yt−1 , Yt−2 , . . . ): E(Ut Ys∗ ) = 0,

∀ s < t.

As in the 1D case, we introduce polynomial matrices, the VAR polynomial and VMA polynomial of the VARMA process: α(z) :=

p X

αj z j ,

j=0

β(z) :=

q X

βj z j ,

z ∈ C.

j=0

From now on, the complex variable z corresponds to the left (backward) shift operator L; with this convention (4.85) can be written as α(L)Yt = β(L)Ut

or α(z)Yt = β(z)Ut ,

t ∈ Z.

(4.86)

The special case when p = 0 is called VMA(q) process and when q = 0 is a VAR(p) process. We can realize any multidimensional ARMA process as a stochastic linear system. The correspondence is far from unique, but a simple, though highly redundant realization is as follows. Proposition 4.2. First introduce a random C(np+mq) -valued state vector that contains all necessary history of {Yt } and {Ut }: Xt := (Yt−p+1 , Yt−p+2 , . . . , Yt ; Ut−q , Ut−q+1 , . . . , Ut−1 ). Then the system (3.35) with the following matrices A ∈ C(np+mq)×(np+mq) , B ∈ C(np+mq)×m , C ∈ Cn×(np+mq) , D ∈ Cn×m is a realization of the

Multidimensional ARMA (VARMA) processes ARMA(p, q) process:  0n In  .. ..  . .   0n 0 n   −αp −αp−1 A :=   0m×n 0m×n   .. ..  . .   0m×n 0m×n 0m×n 0m×n

153

··· .. . ··· ··· ··· .. .

0n .. .

0n×m .. .

0n×m .. .

In −α1 0m×n .. .

0n×m βq 0m .. .

0n×m βq−1 Im .. .

··· .. . ··· ··· ··· .. .

··· ···

0m×n 0m×n

0m 0m

0m 0m

··· ···

 0n×m  ..  .  0n×m   β1  , 0m    ..  .  Im  0m

T

B := [0n×m , . . . , 0n×m , β0 , 0m , . . . , 0m , Im ] , C := [0n , · · · , 0n , In , 0n×m , · · · , 0n×m ], D := 0n×m , and Vt := Ut . The proof is obvious, so omitted. The next theorem gives the standard observable realization of a VARMA process. The dimension of the state space will be significantly smaller in general than in Proposition 4.2 above. The stability condition below is a direct generalization of the condition on the zeros of AR polynomial in the 1D case. Theorem 4.7. Consider the VARMA(p, q) process defined in (4.85) with α0 = In and β0 6= 0n×m . Assume the stability condition det(α(z)) 6= 0

if

|z| ≤ 1.

(4.87)

Then this process can be realized by a parsimonious, stable observable stochastic linear system (3.35) with an (np)-dimensional state space X, where Vt = Ut for all t ∈ Z,     0n In 0n ··· 0n H1  0n  0n In ··· 0n    H2   ..   . . ..  . . . . A :=  . B :=  .   . . . .    ..   0n 0n 0n ··· In  Hp np×n −αp −αp−1 −αp−2 · · · −α1 np×np C :=[In , 0n , · · · , 0n ]n×np

D := β0 .

(4.88)

Here H1 , . . . , Hp ∈ Cn×m are coefficient matrices of the transfer function H(z) =

∞ X j=0

Hj z j := α−1 (z)β(z),

H0 := β0 .

(4.89)

154

Multidimensional time series

Consequently, the considered VARMA process has a unique causal stationary MA(∞) solution Yt =

∞ X

Hj Ut−j = DUt +

j=0

∞ X

CAj−1 BUt−j

(t ∈ Z),

(4.90)

j=1

which converges almost surely and also in mean square: N

2 X

lim Yt − Hj Ut−j = 0.

N →∞

j=0

Proof. Without loss of generality we may suppose that p ≥ q, because if p < q, we may add new coefficients αp+1 = · · · = αq = 0n . From now on in this proof we assume that det(α(z)) 6= 0 if |z| ≤ 1. Since det(α(z)) is a continuous function of z, there exists an > 0 such that the rational matrix α−1 (z) = adj(α(z))/ det(α(z)) (4.91) is defined and analytic for |z| < 1 + . Thus we can define the transfer function (a rational matrix) as H(z) := α−1 (z)β(z) =

∞ X

Hj z j

(Hj ∈ Cn×m ),

(4.92)

j=0

which is a convergent power series for |z| < 1 + . Using (4.92), for |z| < 1 + , we see that  ! ∞ p q X X X αk z k  Hj z j  = β` z ` , j=0

k=0

`=0

so β` =

` X

αk H`−k

(` = 0, 1, . . . , q).

(4.93)

k=0

It is clear that H0 = β0 = D by the definition of Consider the system of linear block equations      W1 W1  ..   .    .  = zA  ..  +  Wp

Wp

D in (4.88).  H1 ..  , .  Hp

(4.94)

where W1 , . . . , Wp ∈ Cn×m are unknowns and the matrix A is defined by

Multidimensional ARMA (VARMA) processes

155

(4.88). That is, W1 = zW2 + H1 W2 = zW3 + H2 .. . Wp−1 = zWp + Hp−1 Wp = −

p X

αp−j+1 zWj + Hp .

(4.95)

j=1

Substitute the first equation for W1 into the last equation: Wp = (−αp z 2 − αp−1 z)W2 − αp−2 W3 − · · · − α1 Wp + Hp − (αp z)H1 . Then do the same with the second equation for W2 , the third equation for W3 , and so on, to obtain Wp = (−αp z p − αp−1 z p−1 − · · · − α1 z)Wp + Hp − (αp z)H1 − (αp z 2 + αp−1 z)H2 − · · · − (αp z p−1 + αp−1 z p−2 + · · · + α2 z)Hp−1 . Rearranging the terms, using α0 = In , we get α(z)Wp = −z p−1 (αp Hp−1 ) − z p−2 (αp Hp−2 + αp−1 Hp−1 ) − · · · − z(αp H1 + αp−1 H2 + · · · + α2 Hp−1 ) + Hp . Then substitute this into the penultimate equation of (4.95), and so on, to eventually obtain α(z)W1 = z 2p−2 (αp Hp−1 ) − z 2p−3 (αp Hp−2 + αp−1 Hp−1 ) − · · · − z p (αp H1 + αp−1 H2 + · · · + α2 Hp−1 ) + z p−1 Hp + α(z)(z p−2 Hp−1 + · · · + zH2 + H1 ). Fortunately, all terms with powers z k (k ≥ p) cancel on the right hand side, so by (4.93), for |z| < 1 + we get that α(z)D + zα(z)W1 = z p (α0 Hp + α1 Hp−1 + · · · + αp H0 ) + · · · + z(α0 H1 + α1 H0 ) + (α0 H0 ) = β(z).

(4.96)

This completes the first part of the proof. Now, let us continue with the definition of the matrices A, B, C, D given in (4.88). By Lemma 4.9 below, det(In − zA) = z np det(z −1 In − A) 6= 0 for |z| ≤ 1. It implies that all eigenvalues of the matrix A are in the open unit disc {z : |z| < 1}, that is, ρ(A) < 1. Thus by Theorem 3.5, the linear system defined in (4.88) has a unique causal stationary MA(∞) solution Yt = DUt +

∞ X j=1

CAj−1 BUt−j

(t ∈ Z),

(4.97)

156

Multidimensional time series

and this series converges with probability 1 and in mean square. Here {Ut } ∼ WN(Σ) is the same process that appears in the definition (4.85). By induction over j it is easy to check that CAj−1 = [0n , . . . , 0n , In , 0n , . . . , 0n ]n×np

(j = 2, . . . , p),

(4.98)

where the block In is at the jth position; all other blocks are 0n . Thus CAj−1 B = Hj

(j = 1, . . . , p).

On the other hand, by (4.94) we obtain that     W1 H1 ∞ X  ..  .  −1  −1 Aj Bz j ,  .  = (In − Az)  ..  = (In − Az) B = j=0 Wp Hp ∞ X W1 = C(In − Az)−1 B = CAj Bz j , (4.99) j=0

if |z| < 1 + . Comparing equations (4.96) and (4.99), it follows for the linear system defined in (4.88) that its transfer function is H(z) = α−1 (z)β(z) = D + zC(In − Az)−1 B = D +

∞ X

CAj−1 Bz j ,

j=1

if |z| < 1 + . This and (4.97) imply that the solution of the linear system has the form (4.90). This proves that the linear system (4.88) is a realization of the given VARMA process. Pp Taking the output sequence Ψt := j=0 αj Yt−j (t ∈ Z), it has the transfer function α(z)H(z) = β(z), which means that the sequence {Yt } defined by (4.90) is the unique solution of the considered VARMA process, see (4.86). By (4.98),   C  CA    rank(Op ) := rank   = rank(Inp ) = np = dim(X) = rank(O), ..   . CAp−1

so the linear system in (4.88) is indeed observable, see Section 3.3. Lemma 4.9. If the block-matrix A is defined by (4.88) and the VAR polynomial α(z) is defined by (4.91), then det(In − zA) = det(α(z)).

Multidimensional ARMA (VARMA) processes

157

Proof. det(In − zA) =

In 0n .. .

−zIn In .. .

0n −zIn .. .

··· ··· .. .

0n 0n .. .

0n zαp

0n zαp−1

0n zαp−2

··· ···

In zα2

−zIn In + zα1 0n 0n .. .

Add z times the first block-column to the second, then z times the second to the third, etc., lastly z times the (p − 1)th block-column to the pth. Finally, eliminate the non-diagonal blocks of the last block-row by using all the other block-rows: In 0n ··· 0n 0n 0n In ··· 0n 0n .. . . .. .. .. .. det(In − zA) = . . . 0n 0 · · · I 0 n n n zαp z 2 αp + zαp−1 · · · z p−1 αp + · · · + zα2 α(z) In 0n · · · 0n 0n 0n In · · · 0n 0n .. . . .. = det(α(z)). . .. .. .. = . . 0n 0n · · · In 0n 0n 0n · · · 0n α(z)

By (4.5), (1.30), and (4.89), we get the spectral density function of a stable VARMA(p, q) process: fY (ω) =

∞ 1 1 X −i`ω H(e−iω )ΣH(e−iω )∗ = e 2π 2π `=−∞

=

∞ X

H`+k ΣHk∗

k=max(0,−`)

1 −1 −iω α (e )β(e−iω )Σβ(e−iω )∗ α−1 (e−iω )∗ . 2π

(4.100)

The next theorem summarizes results above and in Appendix D, showing equivalence of different approaches that describe weakly stationary time series determined by finitely many scalar parameters. P∞ Theorem 4.8. Let H(z) = j=0 Hj z j , Hj ∈ Cn×n , H0 = In , be a rational ¯ By transfer function of a linear system, analytic in the closed unit disc D. Theorem D.1 in the Appendix, there exist left coprime polynomial matrices α(z), β(z) ∈ Cn×n [z] such that H(z) can be represented by a matrix fraction description (MFD) H(z) = α−1 (z)β(z), and so the linear system can be represented as a VARMA(p, q) process as in

158

Multidimensional time series

¯ Definition 4.1. Since α−1 (z) = adj α(z)/ det α(z), and H(z) is analytic in D, it follows that det(α(z)) 6= 0 for |z| ≤ 1. Thus Theorem 4.7 implies that this matrix fraction can be realized as a stable observable stochastic linear system with coefficient matrices Σ = (A, B, C, D) given by (4.88). Also, there exists a unique causal stationary MA(∞) solution of the VARMA(p, q): Yt =

∞ X

Hk Ut−k

(t ∈ Z),

∞ X

kHj k2F < ∞,

j=0

k=0

which converges with probability 1 and in mean square, where {Ut } ∼ WN(Im ). It is clear from the proof of Theorem D.1 that if H(z) = P (z)/ψ(z), where P (z) is a polynomial matrix and ψ(z) is a polynomial, then α(z) is obtained by simplifying with the greatest common left divisor of P (z) and ψ(z)In , so p in VARMA(p, q) can be much smaller than the degree of ψ(z). It means that the above realization is in this sense optimal. By (4.100), any stable VARMA(p, q) process has spectral density f which is a rational function in z = e−iω . In turn, by Corollaries 4.2 and 4.7 and Remark 4.7, if {Xt } is a stationary time series with absolutely continuous spectral measure of density f which is 1 Φ(e−iω )Φ∗ (e−iω ), where the a rational function in z = e−iω , then f (ω) = 2π spectral factor Φ(z) is a rational matrix, analytic in D, belongs to H 2 , and has no poles on T . Thus defining the transfer function H(z) = Φ(z), the first paragraph of the present theorem shows that {Xt } can be given as a stable VARMA(p, q) process. By (4.100), its spectral density is the given f , when {Ut } ∼ WN(Im ).

4.8.2

Yule–Walker equations

In this subsection we always assume that the conditions of Theorem 4.7 are fulfilled and w.l.o.g. we suppose that q < p. Then there exists a unique causal stationary MA(∞) solution of the VARMA(p, q): Yt =

∞ X

Hk Ut−k ,

t ∈ Z,

(4.101)

k=0

where the sum converges almost surely and in mean square. Then it is easy write the covariance function in terms of the transfer Pto ∞ function H(z) = j=0 Hj z j (which is convergent for |z| ≤ 1):   ! ∞ ∞   X X CY (k) = E(Yt+k Yt∗ ) = E H` Ut+k−`  U∗t−j Hj∗    `=0

=

∞ X j=0

Hj+k ΣHj∗ ,

CY (−k) = CY∗ (k),

j=0

k ≥ 0.

(4.102)

Multidimensional ARMA (VARMA) processes

159

The important Yule–Walker equations can be obtained from definition ∗ (4.85) by multiplying it with Yt−k and taking expectation: p X j=0 p X

∗ αj E(Yt−j Yt−k )

=

q X

∗ β` E(Ut−` Yt−k ),

`=0

αj CY (k − j) =

j=0

q X

∗ β` ΣH`−k

(k ≥ 0).

(4.103)

`=k

The right hand side of (4.103) is O if k > q. It is easy to extend the discussion of Yule–Walker equations in the 1D case in Section 2.4 to the present multi-D case. Substitute (4.101) into definition (4.85): p q ∞ X X X αj Hk Ut−j−k = β` Ut−` . k=0 j=0

`=0

Equate the coefficients on both sides, starting with Ut and working toward the past (α0 = In ): H0 = β0 , H1 + α1 H0 = β1 , H2 + α1 H1 + α2 H0 = β2 , .. . Hp−1 + α1 Hp−2 + · · · + αp−1 H0 = βp−1 , Hp+k + α1 Hp+k−1 + · · · + αp Hk = 0 (k ≥ 0),

(4.104)

where β` = 0 if ` > q. If {αj : j = 1, . . . , p} and {β` : ` = 0, . . . , q} are known, these equations uniquely determine the coefficients {Hk : k ≥ 0} by recursion. From now on we omit the subscript Y from the covariances, since all of the covariances here are related to the process {Yt }. From (4.103) we obtain the following system of block equations   C(0) C(1) · · · C(p) · · · C(p − 1)    C(−1) C(0)  In α1 · · · αp  .  . .. .. ..  ..  . . ···

C(0)

βq ΣH0∗

O

C(−p) C(−p + 1) =

Pq

∗ `=0 β` ΣH`

Pq

∗ `=1 β` ΣH`−1

···

···

O . (4.105)

In the special case when the process is VAR(p), that is, q = 0, this can be rearranged into a system of linear block equations for the unknowns α1 , . . . , αp

160

Multidimensional time series

if the covariance matrix function C(h),  C(0) C(−1)   α1 · · · αp  .  .. =−

C(1) · · ·

C(−p + 1) C(p) ,

(h = 0, 1, . . . , p) is known: C(1) C(0) .. .

··· ··· .. .

C(−p + 2)

···

 C(p − 1) C(p − 2)     C(0) (4.106)

and also, p X

αj C(j) = β0 Σβ0∗ ,

j=0

since H0 = β0 . Unfortunately, if q > 0, so when one really has an ARMA process, on the right hand side of (4.105) polynomial expressions of the unknown parameters would be obtained from (4.104). Introducing the notations α := [α1 , . . . , αp ], c := [C(1), . . . , C(p)], and   C(0) C(1) · · · C(p − 1)  C(−1) C(0) · · · C(p − 2)    C :=  . , .. . ..  ..  . C(−p + 1) C(−p + 2) · · · C(0) we may write (4.106) as αC = −c.

(4.107) −1

If C is positive definite, then there exists a unique solution: α = −cC . Otherwise, one may use the Moore–Penrose inverse C+ instead of the inverse matrix, see Definition B.4 in Appendix B. The system (4.107) is always consistent, always has a solution, because it is a Gauss normal equation, see Appendix C. ∗ ∗ For, taking Y := [Yp−1 Yp−2 . . . Y0∗ ]∗ , we have C = E(YY∗ ),

c = E(Yp Y∗ ),

and this shows that the right hand side of (4.107) is in the row space of the left hand side. Assuming that n = m, rank(H0 ) = rank(β0 ) = n, and rank(Σ) = n, introduce the ∞ × p upper trapezoidal block Toeplitz matrix   H0 H1 · · · Hp−1 Hp Hp+1 · · ·  0 H0 · · · Hp−2 Hp−1 Hp ···    H :=  . (4.108) . . . . .. ..  .. .. ..  .. . .. .  0

0

···

H0

H1

H2

···

and the infinite block row matrix h := [H1 , H2 , H3 , · · · ]. Then the infinite system of linear equations (4.104) can be written as αH = −h,

Multidimensional ARMA (VARMA) processes

161

and in the light of (4.102), equation (4.107) is equivalent to αH(ΣH∗ ) = −h(ΣH∗ ),

C = H(ΣH∗ ),

where (ΣH∗ ) denotes block multiple  ΣH0  ΣH1   ..  . ∗ (ΣH ) :=   ΣHp−1   ΣHp  .. .

c = h(ΣH∗ ),

(4.109)

of a block matrix: 0 ΣH0 .. .

0 ··· .. .

ΣHp−2 ΣHp−1 .. .

··· ··· .. .

0 0 .. .



    . ΣH0   ΣH1   .. .

Equation (4.109) gives an infinite block Cholesky-type decomposition of C. By our assumption rank(H0 ) = n, so (4.108) shows that the range of H (when multiplying from the left) is pn-dimensional and this pn-dimensional subspace of `2 is mapped by (ΣH∗ ) onto Cpn . Corollary 4.8. Under the above assumptions, rank(C) = pn, so the system (4.107) has a unique solution α = −c C−1 . Since the covariances can be easily estimated from a random sample, see Subsection 1.5.2, the resulting estimated version of (4.106) gives a practical method for estimating the coefficients α1 , . . . , αp of a VAR(p) process.

4.8.3

Prediction, miniphase condition, and approximation by VMA processes

A very useful property of stable VARMA processes Yt = −α1 Yt−1 − · · · − αp Yt−p + β0 Ut + β1 Ut−1 + · · · + βq Ut−q that in mean square the best linear one-step ahead prediction of Yt based on − the infinite past Ht−1 = span{Yt−1 , Yt−2 , . . . } can be evaluated using only finitely many past values: ˆ t = −α1 Yt−1 − · · · − αp Yt−p + β1 Ut−1 + · · · + βq Ut−q . Y

(4.110)

The proof follows from the projection theorem: the right hand side here lies in − ˆ t = β0 Ut is orthogonal to H − . This finiteness is Ht−1 and the error Yt − Y t−1 in contrast to the general case of regular processes, where one needs infinitely many data in principle, see (4.26). The problem with the prediction (4.110) is that in practice one typically cannot observe the input process {Ut }. Lemma 4.9 proves that the stability condition (4.87) is equivalent to the stability condition ρ(A) < 1 of the corresponding linear system. Similarly, the strict miniphase condition det(β(z)) 6= 0,

∀|z| ≤ 1

162

Multidimensional time series

is equivalent to the stability condition ρ(A − BC) < 1, see (3.40) and the ‘inverse’ stochastic linear system discussed in Subsection 3.6.2. Here it is assumed that m = n and β0 = In in the definition (4.85). Under the stability condition (4.87) of the original case we had Yt = H(z)Ut = α−1 (z)β(z)Ut ,

t ∈ Z,

so the output could be expressed in terms of the input. Similarly, under the miniphase condition, the input can be expressed in terms of the observable output: Ut = H −1 (z)Yt := β −1 (z)α(z)Yt , t ∈ Z. The importance and wide applicability of VARMA processes is further emphasized by the next proposition. Proposition 4.3. The stable VARMA processes are dense among the regular time series. More exactly, for any regular stationary time series {Xt } and for any > 0 there exists a positive integer N and a VMA(N ) process {Yt } such that kXt − Yt k < , ∀t ∈ Z. Proof. Consider a regular time series Xt =

∞ X

b(j)ξt−j

(t ∈ Z),

b(j) = [bk` (j)]d×r ,

j=0 ∞ X

|bk` (j)|2 < ∞

∀k, `,

{ξt }t∈Z ∼ WN(Ir ).

j=0

Then for any > 0 and for any k, ` fixed, there exists a positive integer Nk` such that ∞ X 2 . |bk` (j)|2 < rd j=Nk` +1

Take N := max{Nk` : k = 1, . . . , d; ` = 1, . . . , r} and define Yt :=

N X

b(j)ξt−j ,

t ∈ Z.

j=0

Then {Yt } is a VMA(N ) process and kXt − Yt k2 =

∞ X j=N +1

This proves the proposition.

kb(j)k2 < 2 ,

∀t ∈ Z.

Summary

163

Example 4.1. Figure 4.1 shows typical trajectories of a 3D VAR(2) process, together with its impulse response functions and covariance functions. Figure 4.2 shows the spectral densities. The parameters are β = I3 , α0 = I3 ,     −1 0 0 1/2 0 0 0 , 0 . α1 =  1/2 −4/5 α2 =  1/3 1/5 −1/2 1/3 −1/6 −1/2 1 −1/6 The third panel of Figure 4.1 shows that the off-diagonal entries of the matrix Hk are 0. The first panel of Figure 4.2 shows that the spectral densities of the diagonal entries are even functions, similarly the real parts of the off-diagonal entries, while the imaginary parts of the off-diagonal entries are odd functions, see Remark 1.7.

4.9

Summary

The d-dimensional, weakly stationary time series {Xt } has a spectral density matrix f of constant rank r ≤ d (almost everywhere on [−π, π]) if and only if it can be factorized as f (ω) =

1 φ(ω)φ∗ (ω) 2π

for a.e. ω ∈ [−π, π],

where φ(ω) ∈ Cd×r and φ(ω) =

∞ X

b(j)e−ijω ,

j=−∞

∞ X

|bqs (j)|2 < ∞ (q = 1, . . . , d; s = 1, . . . , r).

j=−∞

Equivalently, {Xt } can be represented as a two-sided MA process: Xt =

∞ X

b(j)ξt−j ,

j=−∞

where b(j) ∈ Cd×r (j ∈ Z) is the above non-random matrix-valued sequence; it consists of the Fourier coefficients of the function φ(ω) and called impulse response function; further, {ξt } ∼ W N (Ir ) is the orthonormal white noise sequence of random shocks. Akin to the 1D situation (see Chapter 2), here too we can introduce the P transfer function (matrix) H(z) = j∈Z b(j)z j , which belongs to L2 (T ) and with which we can write that Xt = H(z)ξt , t ∈ Z. Clearly, for the spectral factor φ we have φ(ω) = H(e−iω ). Therefore, f (ω) =

1 H(e−iω )H(e−iω )∗ , 2π

ω ∈ [−π, π],

164

Multidimensional time series VAR(p) process Xt1 Xt2 Xt3

10

Process Xt

5 0 5 10 15

0

20

40

1.0

Hk11 Hk22 Hk33

0.6 0.4 0.2

80

100 Hk12 Hk13 Hk23

0.04 Impulse response Hk

Impulse response Hk

0.8

60

Time t

0.02 0.00 0.02

0.0 0.04

0.2 0

10

20

30

Index k

40

50 c11(h) c22(h) c33(h)

20 15 10 5

10

20

Index k

30

40

50

0 Covariance function C(h)

25 Covariance function C(h)

0

2 4 6 c12(h) c13(h) c23(h)

8

0 0

10

20 30 Value of h

40

50

0

10

20 30 Value of h

40

50

FIGURE 4.1 Typical trajectories of a 3D VAR(2) process, with its impulse response functions and covariance functions in Example 4.1.

f 11 f 22 f 33 Spectral density f

Spectral density f

20 15 10 5

4

6

2

4

0

Spectral density f

25

2 4 6 Ref 12 Ref 13 Ref 23

8

0 3

2

1

0 1 Frequency

2

3

10

3

2

FIGURE 4.2 Spectral densities in Example 4.1.

1

0 1 Frequency

2 0 2 Imf 12 Imf 13 Imf 23

4 2

3

6

3

2

1

0 1 Frequency

2

3

Summary

165

and the covariance function of {Xt } is C(h) = E(Xt+h X∗t ) =

∞ X

b(k + h)b∗ (k),

h ∈ Z.

k=−∞

Regular (purely non-deterministic) time series are subclasses of the constant rank ones in that they have a one-sided MA representation: Xt =

∞ X

b(j)ξt−j ,

t ∈ Z.

j=0 1 φ(ω)φ∗ (ω), Equivalently, their density matrix factorizes as f (ω) = 2π P∞spectral−ikω where φ(ω) = k=0 b(k)e , so the transfer function is also one-sided. The multi-D Wold decomposition also works as follows. Assume that {Xt }t∈Z is an d-dimensional non-singular stationary time series. Then it can be represented as

X t = Rt + Y t =

∞ X

b(j)ξt−j + Yt

(t ∈ Z),

j=0

where {Rt } is a d-dimensional regular time series subordinated to {Xt }; {Yt } is an d-dimensional singular time series subordinated to {Xt }; {ξt } is an rdimensional (r ≤ d) WN(Ir ) sequence subordinated to {Xt }. Further, {Rt } and {Yt } are orthogonal to each other; the impulse response matrices are b(j), j ≥ 0, with square summable entries. It is important that the orthonormal innovation process {ξt } is not unique, but ξt s are within the pairwise orthogonal innovation subspaces that are unique and their dimension is equal to the constant rank r of the spectral density matrix f of the process. More special regular processes are the ones that can be finitely parametrized. Those are, in fact, the causal VARMA (vector ARMA) processes that also have an MFD or state space representation. The d-dimensional VARMA(p, q) process of 0 mean is defined by α(L) Xt = β(L) Ut , where α(z) = I + α1 z + · · · + αp z p and β(z) = I + β1 z + · · · + βq z q are matrix-valued complex polynomials, namely, the AR and MA polynomials, L is the backward shift operator, and {Ut } ∼ WN (Σ) is d-dimensional white noise. The coefficient matrices α1 , . . . , αp and β1 , . . . , βq are d × d complex matrices; α0 = Id and β0 = Id can be assumed. In particular, in the q = 0 case we have a VAR(p), whereas, in the p = 0 case we have a VMA(q) process. If the condition |α(z)| 6= 0 for |z| ≤ 1 for the VAR polynomial is satisfied (it is called stability), then we have a causal representation of the process: Xt =

∞ X j=0

Hj Ut−j ,

166

Multidimensional time series

with {Ut } ∼ WN(0, Σ) and the coefficient matrices Hj s come from the power series expansion of the transfer function: H(z) =

∞ X

Hj z j = α−1 (z) β(z).

j=0

So we can write the original process as Xt = H(z) Ut . Note that {Ut } can be transformed into an orthonormal process. Indeed, if the white noise covariance matrix Σ has rank r ≤ d, it has the Gramdecomposition (see Appendix B) like Σ = BB ∗ with the d × r matrix B of full rank, and ∞ ∞ X X Xt = Hj Ut−j = (Hj B)ξt−j , j=0

j=0

where {ξt } ∼ WN(Ir ) is an orthonormal process (both longitudinally and cross-sectionally). This is the multidimensional Wold decomposition (there is no singular part). If, in addition to the stability condition, the inverse stability or strict miniphase condition, i.e., |β(z)| 6= 0 for |z| ≤ 1 also holds (concerning the MA polynomial), then Ut can also be expanded in terms of Xk s (k ≤ t). It is important that a VAR(p) process makes rise of a finite prediction of Xt = a1 Xt−1 + · · · + ap Xt−p + Ut with its p-length long past: ˆ t = a1 Xt−1 + · · · + ap Xt−p , X where aj = −αj for j = 1, . . . , p. This extends to the infinite past prediction, see Chapter 5. The multidimensional Yule–Walker equations also work in this situation. It is also discussed when and how we can find a trivial and a minimal state space representation to a VARMA process. We prove that the stable VARMA processes are dense among the multi-D regular time series. Regular, singular, and non-singular processes are also characterized. A ddimensional stationary time series {Xt } is of full rank non-singular process if and only if Z π

log det f (ω)dω > −∞. −π

In this case, if Σ denotes the covariance matrix of the innovation process {ηt }, that is, of the one-step ahead prediction error process based on the infinite past (see Chapter 5), then Z π dω det Σ = (2π)d exp log det f (ω) , 2π −π which is the multi-D generalization of the Kolmogorov–Szeg˝o formula. More generally, the d-dimensional stationary time series {Xt } is regular and of rank r, 1 ≤ r ≤ d, if and only if each of the following conditions holds:

Summary

167

1. It has an absolutely continuous spectral measure matrix dF with density matrix f (ω) which has rank r for a.e. ω ∈ [−π, π]. ˜ (ω)Λr (ω)U ˜ ∗ (ω) the parsimonious spectral 2. Denoting by f (ω) = U decomposition of the spectral density matrix, log det Λr (ω) ∈ L1 , or equivalently, Z π

log λr (ω) dω > −∞, −π

where λr (ω) is the smallest diagonal entry of Λr (ω). ˜ (ω) appearing in the spectral 3. The sub-unitary matrix function U decomposition of f (ω) in belongs to the Hardy space H ∞ ⊂ H 2 , so ˜ (ω) = U

∞ X

ψ(j)e−ijω ,

ψ(j) ∈ Cd×r ,

j=0

∞ X

kψ(j)k2F < ∞.

j=0

In the full rank case, the first two conditions imply the third one, so it need not be stated separately. In the possession of the above, the classification of non-regular multi-D time series is more complex than that of the 1D ones. We call a time series non-regular if either it is singular or its Wold decomposition contains two orthogonal, non-vanishing processes: a regular and a singular one. In dimension d > 1 a non-singular process beyond its regular part may have a singular part with non-vanishing spectral density. We consider a d-dimensional stationary time series {Xt } with spectral measure dF . We distinguish between the following cases. • Type (0) non-regular processes. In this case, dF is singular w.r.t. the Lebesgue measure in [−π, π]. Clearly, type (0) non-regular processes are simply singular ones. Like in the 1D case, we may further divide this class into processes with a discrete spectrum or processes with a continuous singular spectrum or processes with both. • Type (1) non-regular processes. The time series has an absolutely continuous spectral measure with density f , but rank(f ) is not constant on [−π, π]. • Type (2) non-regular processes. The time series has an absolutely continuous spectral measure with density f which has constant rank r a.e., 1 ≤ r ≤ d, but Z π Z π X r log det Λr (ω)dω = log λj (ω) dω = −∞. −π

−π j=1

• Type (3) non-regular processes. The time series has an absolutely continuous spectral measure with density f which has constant rank r a.e., 1 ≤ r < d, Z π Z π X r log det Λr (ω)dω = log λj (ω) dω > −∞, −π

−π j=1

168

Multidimensional time series

˜ (ω) appearing in the spectral decomposibut the unitary matrix function U tion of f (ω) does not belong to the Hardy space H 2 . If {Xt } has full rank r = d and it is non-singular, then it may have only a Type (0) singular part. Assume that {Xt } is a d-dimensional stationary time series of constant rank r, 1 ≤ r ≤ d, so it can be written as a sliding summation with a WN(Ir ) process. In Brillinger’s book [9], the Principal Component Analysis (PCA) in the Frequency Domain aims at approximating the process {Xt } with a rank (k) (k) k process {Xt } such that the mean square error kXt − Xt k2 is minimized, where k ≤ r is a fixed positive integer. The solution is as follows. ˜ (ω)Λr (ω)U ˜ ∗ (ω). Then Consider the spectral decomposition f (ω) = U Z π (k) ˜ k (ω)U ˜ ∗ (ω)dZω , t ∈ Z, Xt = eitω U k −π

is the best approximating process of rank k. For the mean square error we have Z π X r (k) kXt − Xt k2 = λj (ω) dω, t ∈ Z, −π j=k+1

and

R π Pr (k) kXt − Xt k2 j=k+1 λj (ω) dω Rπ P = −π , r 2 kXt k j=1 λj (ω) dω −π

t ∈ Z.

If condition λk (ω) ≥ ∆ > ≥ λk+1 (ω)

∀ω ∈ [−π, π],

holds, then we also have (k)

kXt − Xt k ≤ (2π(r − k))1/2 , and (k)

kXt − Xt k ≤ kXt k

(r − k) r∆

t∈Z

12 ,

t ∈ Z.

In particular, when {Xt } is a d-dimensional regular time series of rank r, then its best rank k approximation is also a regular time series (1 ≤ k ≤ r). Application of the above for dynamic PCA is further discussed in Chapter 5.

5 Dimension reduction and prediction in the time and frequency domain

5.1

Introduction

In this chapter, we deal with the prediction of stochastic processes in general and in the weakly stationary case. We consider one-step and more-step ahead predictions based on finitely many past values or on the infinite past. Actually, the original paper of H. Wold [62] is about 1D, weakly stationary time series, and constructs the famous decomposition via one-step ahead predictions based on the n-length long past with usual multivariate regression techniques, while making use of stationarity as well. Then, at a passage to infinity (n → ∞) he gets the formula for the one-step ahead prediction based on the infinite past. In this way, he decomposes the regular part of a weakly stationary 1D time series as the infinite sum of the innovations that also form a weakly stationary process, namely, a white noise process with the smallest obtainable variance of the linear prediction error. Orthogonality (uncorrelatedness) of the innovation ηt and the past Xt−1 , Xt−2 , . . . of Xt is the consequence of the projection principle used in multivariate regression. The generalization to a multivariate process {Xt } is straightforward with the observation that here simultaneous multivariate linear regressions are used for the components of Xt based on all the components of Xt−1 , . . . , Xt−n . The error terms, ηt s and their covariance matrices are obtainable by the block Cholesky decomposition of the block Toeplitz matrix Cn already used in Chapter 1. At a passage to infinity, we get the multidimensional Wold decomposition that is more complicated than the 1D one in that here only the so-called innovation subspaces are unique, the dimension of which is the same as the rank of the spectral density matrix of the process {Xt }. If this rank r is less than the dimension d of the process, then the error covariance matrix of ηt is singular (of rank r), but usually not the zero matrix. In this case, within the innovation subspaces, with usual factor analysis techniques, a {ξt } ∼ WN(Ir ) process can be constructed (up to orthogonal rotation) such that it appears in the multidimensional Wold decomposition instead of the innovation subspaces. Actually, this is the task of the dynamic factor analysis when the low-dimensional approximation is not always straightforward but is

169

170

Dimension reduction and prediction in the time and frequency domain

obtainable with spectral approximations under the conditions of the GDFM (Generalized Dynamic Factor Model) of [3, 15, 18, 31]. We also establish asymptotic relations between the spectrum of Cn and the spectra of spectral density matrices at the Fourier frequencies, for “large” n. In this way, the spectra of spectra, that is the spectral decomposition of these matrices plays a crucial rule in dimension reduction and dynamic PCA (see Chapter 4), and gives rise to computationally more tractable algorithms as the above block Cholesky decomposition. The technique of the Kálmán’s filtering [1, 32, 33, 58] is also introduced together with a recursion to obtain the innovations and the newer and newer predictions for the state variable of a state space system, while using only the newcoming observed variable and the preceding estimate of the state variable. In the heart of the recursion there lies the propagation of the error covariance matrices.

5.2 5.2.1

1D prediction of weakly stationary processes in the time domain One-step ahead prediction based on finitely many past values

We have a 1D time series {Xt } which is not necessarily stationary, for now; we just assume the existence of the second moments (cross-autocovariances). For simplicity, the state space is R, but the time is discrete (t ∈ Z). Assume that E(Xt ) = 0 (t ∈ Z). Select a starting observation X1 and Hn := Span{X1 , . . . , Xn } (precisely, it should be denoted by Hn (X), but as the process is fixed, it is briefly denoted by Hn ). We want to linearly predict Xn+1 based on random past values X1 , . . . , Xn . ˆ 1 := 0, and denote by X ˆ n+1 the best linear prediction that minimizes Let X ˆ n+1 )2 , n = 1, 2, . . . . If we consider the the mean square error E(Xn+1 − X Hilbert space of the random variables with 0 expectation and finite variance, where the inner product is the covariance (see Appendix C), then we have ˆ n+1 )2 = kXn+1 − X ˆ n+1 k2 . By the general theory of Hilbert spaces, E(Xn+1 − X ˆ Xn+1 = ProjHn Xn+1 , i.e. the projection of Xn+1 onto the linear subspace Hn . ˆ n+1 = E(Xn+1 | X1 , . . . , Xn ), which is In the Gaussian case, the solution is X the regression plane, but the coefficients of the optimal linear predictor ˆ n+1 = an1 Xn + · · · + ann X1 X can be obtained in the non-Gaussian case too, by solving a system of linear equations that contains the second moments and the second cross-moments of the involved random variables as follows by the theory of multivariate linear regression (see also Appendix C).

1D prediction in the time domain

171

With the notations an = (an1 , . . . , ann )T , Cn = [Cov(Xi , Xj )]ni,j=1 and dn = (Cov(Xn+1 , Xn ), . . . , Cov(Xn+1 , X1 ))T , we have to solve the following system of linear equations (Gauss normal equations): Cn an = dn .

(5.1)

A solution (the projection) always exists, and it is unique if Cn is positive definite. Then the unique solution is an = Cn−1 dn . Otherwise, there are infinitely many solutions, and we can give them similarly, with any generalized inverse of the positive semidefinite matrix Cn . In this case, there are linear relations between X1 , . . . , Xn , and so, infinitely many linear combinations of them produce the same projection of Xn+1 onto the subspace spanned by them. In case of a singular Cn it is customary to use the (unique) Moore–Penrose inverse (see Appendix B) that gives the particular solution an = Cn+ dn . We will see that this issue is immaterial in the stationary case, since then zero determinant of Cn for the smallest n ≥ 1 indicates zero prediction error, see Remark 5.5. In particular, if {Xt } is stationary, then Cn = [c(i − j)]ni,j=1 , so Cn is a Toeplitz matrix, and dn (j) = c(j), j = 1, . . . , n. Therefore, the solution an does not depend on the selection of the starting time of the starting observation X1 . In this case, no double indexing for the coordinates of the vector an is necessary, they can as well be written as a1 , . . . , an . Also, when {Xt } is stationary, then under very general conditions, there is a unique solution as discussed below. Some remarks are in order. Remark 5.1. Namely, by Proposition 5.1.1 of [11], if c(0) > 0 and limh→∞ c(h) = 0, then the autocovariance matrix Cn = [c(i−j)]ni,j=1 of (X1 , . . . , Xn )T is positive definite for every n ∈ N. Remark 5.2. By Proposition 4.5.2 of [11], for “large” n, the eigenvalues of Cn are asymptotically the same as the union of the values of the spectral density f at the Fourier frequencies. In Section 5.4, we will generalize this statement for multidimensional time series. Considering the decomposition ˆ n+1 + ηn+1 , Xn+1 = X ˆ n+1 = aT Xn and ηn+1 is the error term, it is easy to see (Appendix C) where X n that the two right-hand side terms are orthogonal (uncorrelated), therefore their squared norms (variances) are added together: ˆ n+1 ) + Var(ηn+1 ). Var(Xn+1 ) = Var(X With our notation it yields c(0) = Var(aTn Xn ) + Var(ηn+1 ) = Var(dTn Cn−1 Xn ) + Var(ηn+1 ) = dTn Cn−1 Cn Cn−1 dn + Var(ηn+1 ) = dTn Cn−1 dn + Var(ηn+1 ).

172

Dimension reduction and prediction in the time and frequency domain

Consequently, the prediction error, that is the variance of the error term, is e2n = kηn+1 k2 = Var(ηn+1 ) = c(0) − dTn Cn−1 dn ,

(5.2)

by Remark C.3 in Appendix C. It will be further analyzed in Section 5.2.2. Note that Equation (5.1) is exactly the same as the first n Yule–Walker equations for estimating the parameters of a stationary AR(n) process which is Xt = a1 Xt−1 + a2 Xt−2 + · · · + an Xt−n + ηt , t = 0, ±1, ±2, . . . where {ηt } ∼ WN(σ 2 ) is a white noise process, and σ 2 is also estimated. In case of second order processes, due to the projection principle, it also comes out that ηt (the orthogonal component) is uncorrelated with the regressor, and so with the past values Xt−1 , Xt−2 , . . . Xt−n too. The Yule–Walker equations based on the first n autocovariances are: a1 c(1) + · · · + an c(n) + σ 2 , k=0 c(k) = (5.3) a1 c(k − 1) + · · · + an c(k − n), k = 1, . . . , n. For real-valued time series, the Yule–Walker equations (5.3) for k = 1, . . . , n can be written in matrix form:       a1 c(1) c(0) c(1) . . . c(n − 1)      c(1) c(0) . . . c(n − 2)   a2   c(2)   (5.4)   ·  ..  =  ..  . .. .. .. ..   .  .  . . . . c(n) c(n − 1) c(n − 2) . . . c(0) an These are the Gauss normal equations, see Appendix C. If the coefficient matrix is strictly positive definite, then we have a unique solution. Substituting this solution in the first equation of (5.3), which is the same as equation (5.2), provides the solution for σ 2 . Here σ 2 = e2n if the order n of the AR process is fixed. More precisely, between the coefficients αi s of Equations (2.21) and the coefficients ai s of Equations (5.4) the relation ai = −αi holds, i = 1, . . . , n (here n stands for the order of the AR process). Also, in case of a stable AR(n) process, the first n Yule–Walker equations imply the next ones; while in other cases, the solution of the first n Yule– Walker equations just gives the best prediction using n past values, and they are rather called Gauss normal equations. Remark 5.3. (see [11], p. 424). If for some n ≥ 1 the covariance matrix Cn is positive definite, then the nth degree AR polynomial α(z) is causal in the sense that α(z) 6= 0 for z ≤ 1. Remark 5.4. Comparing Remarks 5.1 and 5.3, we can conclude the following. If for the autocovariance function of the process {Xt }, c(0) > 0 and limh→∞ c(h) = 0 hold, then the autocovariance matrix Cn = [c(i − j)]ni,j=1

1D prediction in the time domain

173

of (X1 , . . . , Xn )T assigned to the process is positive definite for every n ∈ N. Consequently, the process {Xt } has a (unique) stable AR(n) representation such that the first n autocovariances of it are c(0), . . . , c(n − 1), for every n ∈ N. However, if the sequence c(h) tends to 0, but not exponentially fast, these AR(n) representations based on just X1 , . . . , Xn do not approximate the process at all, and the process is not even necessarily regular. On the contrary, if Cn is singular for some n (and consequently, for larger ns too), then using its the generalized inverse, we get an AR(n) solution, but it is not stable. It is also important, that in case of a stationary process, the h-step ahead prediction, i.e. the prediction of Xn+h based on X1 , . . . , Xn can be easily concluded from the one-step ahead prediction, for h = 1, 2, . . . . In view of Hn ⊂ Hn+h−1 , ˆ n+h , ProjHn Xn+h = ProjHn ProjHn+h−1 Xn+h = ProjHn X we get the equation Cn an = dn (h) for the coefficients of the prediction in an , where dn (h) = [Cov(Xn+h , Xn ), . . . , Cov(Xn+h , X1 )]T, and it is [c(h), . . . , c(n + h − 1)]T in the stationary case. Equation (5.1) is the special case when h = 1 and dn = dn (1).

5.2.2

Innovations

Observe that, by the Gram–Schmidt procedure, the prediction error terms form an orthogonal sequence, and they are called innovations. In this way, Xn s can as well be expressed in terms of the innovations; in other words, Xn can be written as the linear combination of the normalized error terms that form a complete orthonormal system in Hn . Moreover, this is true in each step of the Gram–Schmidt process, so in the expansion of each Xn only the same and lower index error terms appear. We do it as follows. First, {Xt } is not necessarily stationary. Recall that Hn := Span{X1 , . . . , ˆ n+1 be the one-step ahead prediction error term, Xn }. Let ηn+1 := Xn+1 − X ˆ 1 = 0 (see Section 5.2.1), based on the n-length long past, for n = 0, 1, . . . . As X η1 = X1 ∈ H1 and the unique orthogonal decomposition ˆ 2 + η2 X2 = X ˆ 2 ∈ H1 and η2 ⊥ H1 , whenever H1 ⊂ H2 is a proper subspace works, where X (disregard the situation H1 = H2 , when η2 = 0). Therefore, η2 ∈ H2 and η2 ⊥ η1 . Further, X2 = l21 η1 + η2 .

174

Dimension reduction and prediction in the time and frequency domain

With the same considerations, ˆ j+1 + ηj+1 , Xj+1 = X

j = 2, . . . , n

ˆ j+1 ∈ Hj and ηj+1 ⊥ Hj if Hj ⊂ Hj+1 is a proper subspace. So with X ηj+1 ∈ Hj+1 and ηj+1 ⊥ ηj . In this way, we get the innovations η1 , . . . , ηn , the linear combination of which produces Xk as X1 = η1 ,

Xk =

k−1 X

lkj ηj + ηk ,

k = 2, . . . , n,

j=1

where the coefficients lkj are obtained recursively, together with the mean square one-step ahead prediction errors e2k−1 = kηk k2 (k = 2, . . . , n), kη1 k2 = c(0). Actually, this is the LDL (variant of the Cholesky) decomposition, see Appendix B. Indeed, with the notation ηn = (η1 , . . . , ηn )T and Xn = (X1 , . . . , Xn )T , we have to find an n × n lower triangular matrix Ln with entries lkj s and all 1s along its main diagonal such that Xn = Ln ηn .

(5.5)

Taking the covariance matrices on both sides, yields the LDL decomposition Cn = Ln Dn LTn .

(5.6)

If Cn is positive definite, then Dn = diag(c(0), e21 , . . . , e2n−1 ) is positive definite too. Ln is not singular (with diagonal entries 1s), hence ηn = L−1 n Xn , where is also lower triangular; therefore, the innovations can as well be written L−1 n in terms of the same or lower index Xt s. So the LDL decomposition gives the prediction errors (diagonal entries of Dn ), and the entries of Ln (below its main diagonal, which is constantly 1) are obtainable in a nested way; therefore, n does not play an important role here, see Appendix B. With increasing n, we just extend the rows of Ln . Summarizing, Xn =

n−1 X

ˆ n + ηn , lkj ηj + ηn = X

Var(η1 ) = c(0),

Var(ηn ) = e2n−1 , (5.7)

j=1

and it is true for any n = 1, 2, . . . . The situation further simplifies in the stationary case, when Cn is a Toeplitz matrix. However, Ln will not be Toeplitz, but asymptotically, it becomes more and more like a Toeplitz one, and the entries of Dn will be more and more similar to each other (the sequence e2n converges) as with n → ∞ the situation extends to the prediction by the infinite past case. This is the topic of the Wold decomposition, see Section 5.2.3.

1D prediction in the time domain

175

In the stationary case, the innovation ηn is non-zero if Hn−1 ⊂ Hn is a proper subspace. In this case, ηn s are true innovations. Recall that, by Remark 5.1, this holds true at the same time for any n whenever c(0) > 0 and limh→∞ c(h) = 0. In this case, we can also standardize the ηn s, and write Xn =

n X

˜lnj ξj ,

n = 1, 2, . . .

j=1

p where ξj = ηj /ej−1 and ˜lnj = lnj ej−1 , e0 = c(0), for j = 1, . . . , n. Here ξ1 , . . . , ξn form a complete orthonormal system in Hn . The coefficients ˜lnj s are obtainable by the Gram decomposition (see Appendix B) Cn = An ATn 1/2

where An = Ln Dn is a lower triangular solution, but it can be postmultiplied with any orthogonal matrix.

5.2.3

Prediction based on the infinite past

Going farther, in case of a stationary, non-singular process, we can project Xn+1 onto the infinite past Hn− = span {Xt : t ≤ n} and expand it in terms of an orthonormal system, that is called Wold decomposition. This part will be the regular (causal) part of the process, whereas, the other, singular part, is orthogonal to it. Note that this singular part is of Type (0) deterministic (see Section 2.9). Recall that a singular (deterministic) process has no future, only (remote) past; whereas a regular (purely non-deterministic) process has only future, and no past. A non-singular process has future (with or without past). Also, by stationarity, the one-step ahead prediction error σ 2 = kXn+1 − ProjHn− Xn+1 k = E(Xn+1 − ProjHn− Xn+1 )2 does not depend on n, and it is positive, since the process is non-singular. Again, the Wold decomposition gives Xn =

∞ X

bj ηn−j + Yn ,

j=0

where {Yn } is of Type (0) singular and {ηt } is white noise with variance σ 2 , b0 = 1. If Yn = 0 for all n, the process {Xn } is regular. The coefficients bi /σ are the impulse responses. Because of the stationarity and infinite past, b has a single index. Here the coefficients bj s are the limiting values of lnj s when n → ∞ in (5.5). It is in accord with the earlier observation that the matrix Ln will be closer and closer to a Toeplitz one, if we disregard the first finitely many rows of it.

176

Dimension reduction and prediction in the time and frequency domain

Note that the innovation process is a MA(∞) process, which is a causal TLF. Here ηn is not considered as an error term, but rather than positive information that is not contained in the past of Xn . This is why it is called innovation. Wold derives his celebrated decomposition theorem for real, univariate stationary time series in the following situation: the one-step ahead prediction of Xt is based on its n-length long past and n → ∞. More precisely, let us fix Xt and consider its one-step ahead prediction, based on its n-length long past. In view of Equation (5.7) and by stationarity, the mean square prediction error does not depend on t, it only depends on n, and was denoted by e2n . It can be written in many equivalent forms, see the theory of multivariate regression [44] and Equation (C.3): 2 e2n = c(0)(1 − rX ) = c(0) − dTn Cn−1 dn , t ,(Xt−1 ,...,Xt−n ) 2 where rX is the squared multiple correlation coefficient between t ,(Xt−1 ,...,Xt−n ) Xt and (Xt−1 , . . . , Xt−n ); it does not depend on t either, and obviously increases (does not decrease) with n, i.e. e21 ≥ e22 ≥ . . . . The mean square error can as well be written with the determinants of the consecutive Toeplitz matrices Cn and Cn+1 . The next proposition is also used in [62], but here we give a simple proof by means of the determinants of block matrices.

Proposition 5.1. If for some n, |Cn | 6= 0, then e2n = c(0) − dTn Cn−1 dn =

|Cn+1 | . |Cn |

(5.8)

Proof. We use block matrix techniques for the following partitioned matrix: Cn dn Cn+1 = . dTn c(0) It is known that |Cn+1 | = |Cn − dn c−1 (0)dTn | · |c(0)| = c(0)|Cn (In − Cn−1 dn dTn /c(0))| = c(0)|Cn | · |In − Cn−1 dn dTn /c(0)| = c(0)|Cn | · (1 − λ(Cn−1 dn dTn /c(0))), where λ(Cn−1 dn dTn /c(0)) is the only nonzero eigenvalue of the matrix Cn−1 dn dTn /c(0), which is of rank 1. Indeed, the rank of the dyad dn dTn is 1, and the multiplication with another matrix cannot increase this rank. Therefore, the eigenvalues of In − Cn−1 dn dTn /c(0) are 1 − λ(Cn−1 dn dTn /c(0)) and 1 (with multiplicity n − 1). So its determinant is 1 − λ(Cn−1 dn dTn /c(0)) = 1 − λ(Cn−1 dn dTn )/c(0) = 1 − tr(Cn−1 dn dTn )/c(0) = 1 − tr(dTn Cn−1 dn )/c(0) = 1 − dTn Cn−1 dn /c(0) =

c(0) − dTn Cn−1 dn , c(0)

1D prediction in the time domain

177

where we used that the only nonzero eigenvalue of a rank 1 matrix is its trace and the cyclic commutativity of the trace operator. Putting things together: |Cn+1 | = c(0)|Cn |

c(0) − dTn Cn−1 dn = |Cn |(c(0) − dTn Cn−1 dn ), c(0)

that proves the statement. Remark 5.5. If |Cn | = 0 for some n, then |Cn+1 | = |Cn+2 | = · · · = 0 too. The smallest index n for which this happens indicates that there is a linear relation between n consecutive Xj s, but no linear relation between n − 1 consecutive ones (by stationarity, this property is irrespective of the position of the consecutive random variables). This can happen only if some Xt linearly depends on n − 1 preceding Xj s. In this case e2n−1 = 0 and, of course e2n = e2n+1 = · · · = 0 too. In any case, e21 ≥ e22 ≥ . . . is a decreasing (non-increasing) nonnegative sequence, and in view of Equation (5.8), |C1 | = c(0),

|Cn | = c(0)e21 . . . e2n−1 ,

n = 2, 3, . . . ,

(5.9)

so, provided c(0) > 0, |Cn | = 0 holds if and only if e2n−1 = 0. Note that in this stationary case there is no sense of using generalized inverse if |Cn | = 0, since then exact one-step ahead prediction with the n − 1 long past can be done with zero error, and this property is manifested for longer past predictions too. Remark 5.6. Equation (5.9) can as well be obtained from the LDL decomposition of Equation (5.6). Indeed, |Cn | = |Dn | = c(0)e21 . . . e2n−1 ,

(5.10)

utilizing that the diagonal entries of Dn are the prediction errors. In this way, Equation (5.10) implies another proof of Proposition 5.1. In the light of these, there are the following possibilities: • Ck is positive definite up to k ≤ h, but |Ch | = 0 for some positive integer h (and so, |Ck | = 0 for k > h too). Wold calls such a process singular of rank h. Then, by Remark 5.5, e21 ≥ e22 ≥ · · · > e2h−1 = e2h = · · · = 0. So Xt can be exactly predicted based on its (h − 1)-length long past. This is caused by periodicities, for instance, in case of the Type(0) singular process of Section 2.10.1. In this case, c(h) cannot tend to 0, otherwise all the Ch s were positive definite, in view of Remark 5.1. • |Cn | 6= 0 for any n, and so, e2n > 0 for every n, but still, limn→∞ e2n = 0 in a decreasing (non-increasing) way. Wold calls such a process singular of infinite rank. This is caused by hidden periodicities, for instance, the Type (1) and Type (2) singular process of Section 2.10.2. (Then limh→∞ c(h) = 0, but not exponentially fast.)

178

Dimension reduction and prediction in the time and frequency domain

• In the remaining (non-singular) case, e2n → σ 2 as n → ∞ in a decreasing (non-increasing) way, where 0 < σ 2 < c(0). (In case of ARMA processes limh→∞ c(h) = 0 exponentially fast.) Wold shows that the residual process ηt,n (one-step ahead prediction error term of predicting Xt with its n-length long past) is stationary for any fixed n. After a passage to the limit, the process {ηt,n } converges in probability to the residual process {ηt } as n → ∞. We cite the exact theorem (Theorem 6 in [62]): Theorem 5.1. A residual process {ηt } obtained from a non-singular stationary process {Xt } is stationary and non-autocorrelated. Further, ηt is noncorrelated with Xt−1 , Xt−2 , . . . , while p Var(ηt ) Var(ηt ) σ Cov(Xt , ηt ) p p =p = p =p . Corr(Xt , ηt ) = p c(0) Var(ηt ) c(0) Var(ηt ) c(0) c(0) Wold notes that the arguments used in the proof of this theorem also apply to the singular cases. As the residual variables ηt are here vanishing, their correlation properties will be indeterminate. Accordingly, these cases do not need further comment.

5.3 5.3.1

Multidimensional prediction One-step ahead prediction based on finitely many past values

We have a d-dimensional real time series {Xt } with components Xt = [Xt1 , . . . , Xtd ]T . It is not necessarily stationary, we just assume the existence of the second moments and cross-moments. For simplicity, the state space is Rd , but the time is discrete (t ∈ Z). Assume that E(Xt ) = 0 (t ∈ Z). Select a starting observation X1 and Hn := Span{Xtj : t = 1, . . . , n; j = 1, . . . , d}. (Precisely, it should be denoted by Hn (X), but as the process is fixed, it is briefly denoted by Hn . However, this Hn is not the same as in the 1D situation.) We want to linearly predict Xn+1 based on past values X1 , . . . , Xn . Let ˆ 1 := 0, and denote by X ˆ n+1 the best one-step ahead linear prediction that X minimizes the mean square error ˆ n+1 )2 = kXn+1 − X ˆ n+1 k2 , E(Xn+1 − X

n = 1, 2, . . .

Multidimensional prediction

179

ˆ n+1 = Proj Xn+1 , i.e. in the Hilbert-space setup of Section 5.2.1. Thus, X Hn the projection of Xn+1 onto the linear subspace Hn . In the Gaussian case, the ˆ n+1 = E(Xn+1 | X1 , . . . , Xn ), which is the instance of simultanesolution is X ous linear regressions for the components of Xn+1 by predictors X1 , . . . , Xn . In the general case, we have to solve a system of linear equations that resembles (5.1). Indeed, the projection is looked for in the form ˆ n+1 = An1 Xn + · · · + Ann X1 , X

(5.11)

ˆ n+1 ) ⊥ Xn+1−k for where An1 , . . . Ann are d × d matrices. But (Xn+1 − X k = 1, . . . , n in the sense that ˆ n+1 )XT E[(Xn+1 − X n+1−k ] = Od ,

k = 1, . . . n,

(5.12)

where Od is the d × d zero matrix. Equations (5.11) and (5.12) together yield the following system of linear equations: n X

Anj Cov(Xn+1−j , Xn+1−k ) = Cov(Xn+1 , Xn+1−k ),

k = 1, . . . , n,

j=1

(5.13) where Cov now denotes an d × d cross-covariance matrix. This is the extension of the Gauss normal equations for parallel linear predictions with d-dimensional target, see Lemma C.2 of Appendix C. When {Xt } is stationary, then Equation (5.13) simplifies to n X

Aj C(k − j) = C(k),

k = 1, . . . , n,

j=1

where C(k) is the kth order d × d autocovariance matrix. This provides a system of d2 n linear equations with the same number of unknowns that always has a solution. Further, the solution does not depend on the selection of the time of the starting observation X1 , and no double indexing of the coefficient matrices is necessary. For the block matrix version, see Appendix C. The coefficient matrix is just Cn , which is always positive semidefinite. If positive definite, we have a unique solution; otherwise, with block matrix techniques, reduced rank innovations are obtained. There are recursions to solve this system (e.g. the Durbin–Levinson algorithm), see [11], which resembles the set of the first n Yule–Walker equations for a multidimensional VAR(n) processes. Proposition 5.2. ([11], p.424). If for some n ≥ 1 the covariance matrix of (XTn+1 , . . . , XT1 )T is positive definite, then the matrix polynomial α(z) = I − A1 z − · · · − An z n is causal in the sense that the determinant |α(z)| 6= 0 for z ≤ 1.

180

5.3.2

Dimension reduction and prediction in the time and frequency domain

Multidimensional innovations

Analogously to the 1D situation, Xt can again be expanded in terms of the now d-dimensional innovations, i.e. the prediction error terms ˆ n+1 . ηn+1 := Xn+1 − X It can be done step by step as follows. Assume that the nd × nd covariance matrix Cn of the components of X1 , . . . , Xn is positive definite for every n ≥ 1. ˆ 1 := 0, η1 := X1 and consider the unique orthogonal decomposition Let X ˆ 2 + η2 , X2 = X ˆ 2 ∈ H1 and η2 ⊥ H1 , whenever H1 ⊂ H2 is a proper subspace where X (disregard the situation H1 = H2 , when η2 = 0). Therefore, η2 ∈ H2 and η2 ⊥ η1 . With the same considerations, ˆ j+1 + ηj+1 , Xj+1 = X

j = 2, . . . , n

ˆ j+1 ∈ Hj and ηj+1 ⊥ Hj if Hj ⊂ Hj+1 is a proper subspace. So with X ηj+1 ∈ Hj+1 and ηj+1 ⊥ ηj . In this way, we get the innovations η1 , . . . , ηn that trivially have 0 expectation and form an orthogonal system in the nd-dimensional Hn (their pairwise cross-covariance matrices are zeros). We consider the first n steps, i.e. the recursive equations Xk =

k−1 X

Bkj ηj + ηk ,

k = 1, 2, . . . , n

(5.14)

j=1

in the case when the observations X1 , . . . , Xn are available. If our process is stationary, the coefficient matrices are irrespective of the choice of the starting time. The ηj s are not zeros if Hn ⊂ Hn+1 are proper subspaces, i.e. they are true innovations. However, it can be, that though they are not zeros, they span a lower than d-dimensional subspace, i.e. their covariance matrix Ej = Eηj ηj0 is not zero, but a positive semidefinite matrix of reduced rank. When we go to the future, then look back to the “infinite” past, and obtain the multidimensional Wold decomposition (see Section 4.4 and the forthcoming explanation at the end of this section). Multiplying the equations in (5.14) by XTj from the right, and taking expectation, the solution for the matrices Bkj and Ej (k = 1. . . . , n; j = 1, . . . , k − 1) can be obtained via the block Cholesky (LDL) decomposition: Cn = Ln Dn LTn ,

(5.15)

where Cn is nd × nd positive definite block Toeplitz matrix of general entry C(i − j), see (1.3). Dn is nm × nm block diagonal and contains the positive semidefinite prediction error matrices E1 , . . . , En in its diagonal blocks,

Multidimensional prediction

181

whereas Ln is nd × nd lower triangular with blocks Bkj s below its diagonal blocks which are d × d identities, so Ln is non-singular. In matrix form,     E1 O . . . O O I O ... O O  O E2 . . . O O   B21 I ... O O     , Dn =  . Ln =  .  .. .. .. ..  . . . . . . . . . . .  .  . . . . .  . . . . En (5.16) To find the block Cholesky decomposition of (5.16), the following recursion is at our disposal: for j = 1, . . . , n Bn1

Bn2

...

Bn,n−1

Ej := C(0) −

j−1 X

O

I

T Bjk Ek Bjk ,

O

j = 1, . . . , n

...

O

(5.17)

k=1

and for i = j + 1, . . . , n Bij :=

C(i − j) −

j−1 X

! T Bik Ek Bik

Ej+ ,

(5.18)

k=1

where we take the Moore–Penrose inverse if necessary. Note that Equation (5.15) implies the following: |Cn | = |Dn | =

n Y

|Ej |

j=1

that is the multi-D analogue of the 1D Equation (5.10). Note that here E1 = C(0), and Ej is analogous to e2j−1 there. Also note that En is the error covariance matrix of the prediction of Xn based on its (n−1)-length long past. In the stationary case, if we predict based on the n-length long past, then we project on a richer subspace, therefore the prediction errors of the linear combinations of the coordinates of Xn are decreased (better to say, not increased). Consequently, by Remark C.2 of Appendix C, the ranks of the error covariance matrices En s are also decreased (not increased) as n → ∞. If the prediction is based on the infinite past, then with n → ∞ this procedure (which is a nested one) extends to the multidimensional Wold decomposition. We can construct a causal TLF in this way. Actually, here n = t, and as observations arrive, Xn is predicted based on past values X1 , . . . , Xn−1 , and so, ηn is in fact, ηt,n . By stationarity, it has the same distribution for all t, especially for t = n. Also, if n → ∞, the matrix Ln better and better approaches a Toeplitz one, and the matrices E1 , . . . , En are closer and closer to Σ, the covariance matrix of the innovation process {ηt } that is the limit of {ηt,n }. In this way, we get the multi-D analogue of the 1D Theorem 5.1, according to which, ηn → η in mean square: kEn − Σk = kE(ηn ηnT ) − E(ηη T )k → 0

182

Dimension reduction and prediction in the time and frequency domain

as n → ∞. Consequently, Bnj → Bj as n → ∞ as it continuously depends on Ej s in view of Equations (5.18). Also, if there is a gap in the spectrum of Σ, like λ1 ≥ · · · ≥ λr ≥ ∆ ε ≥ λr+1 ≥ · · · ≥ λd , then there is a gap in the spectrum of En too. Indeed, to any δ > 0 there is an N such that for n ≥ N : kEn − Σk < δ. Then for the eigenvalues of En , (n)

λ1

(n)

(n)

≥ · · · ≥ λ(n) ≥ ∆ − δ ε + δ ≥ λr+1 ≥ · · · ≥ λd . r

Consequently, for the best rank r approximations (with Gram-decompositions): kΣ − Σr k ≤ ε and kEn − Enr k ≤ δ + ε holds by Theorem B.5 (Weyl perturbation theorem). Therefore, kΣr − Enr k ≤ kΣr − Σk + kΣ − En k + kEn − Enr k ≤ ε + δ + (δ + ε) = 2(δ + ε) that can be arbitrarily close to 2ε. At the same time, the projections onto the subspaces spanned by the eigenvectors of the r structural eigenvalues of these matrices are close to each other, in the sense of Theorem B.6 (Davis–Kahan theorem). Let S1 := [∆ − δ, λ1 + δ] and S2 := [λd + δ, ε + δ]. Then for n > N : kPΣ (S1 ) − PEn (S1 )k2F = kPΣ (S1 )k2F + kPEn (S1 )k2F − 2 tr[PΣ (S1 )PETn (S1 )] = 2r − 2 tr[PΣ (S1 )(Id − PETn (S2 ))] = 2r − 2 tr[PΣ (S1 ) − PΣ (S1 )PETn (S2 )] = 2r − 2r + 2 tr[PΣ (S1 )PETn (S2 )] ≤ 2dkPΣ (S1 )PETn (S2 )k cδ c kΣ − En k ≤ 2d ≤ 2d ∆−δ−ε ∆−δ−ε that can be arbitrarily small if δ is arbitrarily small. Here we also used Lemma B.3 and Theorem B.6. Going further, when the Ej s are of rank r < d, we can find a system ξ1 , . . . , ξn ∈ Rr in the d-dimensional innovation subspaces that span the same subspace as η1 , . . . , ηn . (Though, in this situation, the block Cholesky decomposition algorithm should be modified by taking generalized inverses.) If the rank is not exactly r (may be full), but the spectral density matrix has r < d structural eigenvalues, then ξj ∈ Rr , Eξj ξj0 = Ir is the principal component factor of ηj obtained from the r-factor model η j = A j ξj + ε j , p where the columns of d × r matrix Aj are λj` uj` with the r largest eigenvalues and the corresponding eigenvectors of Ej ; the vector εj is the error

Spectra of spectra

183

comprised of both the idiosyncratic noise and the error term of the model, but it has a negligible L2 -norm. Note that Aj of the decomposition Ej = Aj ATj is far not unique, it can be post-multiplied with an r × r orthogonal matrix. With this, k X Xk ∼ Bkj Aj ξj , k = 1, 2, . . . , n (5.19) j=1

where Bkk = Ik . This approaches the following Wold decomposition of the d-dimensional process {Xt } with an r-dimensional (r ≤ d) innovation process {ξt }: ∞ ∞ X X Xt = Bj ηt−j = Bj Aξt−j , j=0

j=0

where limk→∞ Bkj = Bj is d × d matrix; {ηt } is a d-dimensional white-noise sequence with covariance matrix Σ of rank r (actually, Σ it is the limit of the sequence En ), and {ξt } is an r-dimensional white-noise sequence with covariance matrix Ir . Further, Σ = AAT is the Gram-decomposition of the matrix Σ of rank r, where A is d × r (see Appendix B). Then the matrix sequence Bj A plays the role of the d × r coefficient matrices in the multidimensional Wold decomposition of Section 4.4. Note that here we use nd × nd block matrices, but the procedure, realized by Equations (5.17) and (5.18), iterates only with the d × d blocks of them, so the computational complexity of this algorithm is not significantly larger than that of the K´ alm´ an’s filtering of Section 5.5. However, in the next Section 5.4, we can decrease this computational complexity in the frequency domain.

5.4

Spectra of spectra

Let {Xt } be a d-dimensional, weakly stationary time series with real components and autocovariance matrices C(h), h ∈ Z, C(−h) = C T (h). Consider the finite segment X1 , . . . , Xn ∈ Rd of it and the nd × nd covariance matrix Cn of the compounded random vector [XT1 , . . . , XTn ]T ∈ Rnd , as introduced in Equation (1.3). This is a symmetric, positive semidefinite block Toeplitz matrix, the (i, j) block of which is C(j − i). The symmetry comes from the fact, that the (j, i) entry is C(i − j) = C T (j − i). To characterize the eigenvalues of the block Toeplitz matrix Cn , we need (s) the symmetric block circulant matrix Cn that we consider for odd n, say n = 2k + 1 here (for even n, the calculations are similar); for the definition, see [53]. In fact, the rows of a circulant matrix are cyclic permutations of the preceding ones; whereas, in the block circulant case, when permuting, we take transposes of the blocks if those are not symmetric themselves, see the example of Equation (5.20). Spectra of block circulant matrices are well characterized,

184

Dimension reduction and prediction in the time and frequency domain (s)

but Cn is not block circulant, in general; this is why Cn is constructed, by disregarding the autocovariances of order greater than n2 . This can be done only on the assumption that the sequences of autcovariances are (entrywise) absolutely summable. (s) The (i, j) block of Cn for 1 ≤ i ≤ j ≤ n is C(j − i), j−i≤k (s) Cn (blocki , blockj ) = C(n − (j − i)), j − i > k; whereas, for i > j, it is C(s) n (blocki , blockj ) =

C T (i − j), i−j ≤k C T (n − (i − j)), i − j > k.

(s)

In this way, Cn is a symmetric block Toeplitz matrix, like Cn , and it is the (s) same as Cn within the blocks (i, j)s for which |j − i| ≤ k holds. However, Cn is also a block circulant matrix that fits our purposes. For example, if n = 7 and k = 3, then we have   C(0) C(1) C(2) C(3) C(3) C(2) C(1) C T (1) C(0) C(1) C(2) C(3) C(3) C(2)  T  C (2) C T (1) C(0) C(1) C(2) C(3) C(3)   (s) T T T C(1) C(2) C(3) C7 :=  C T (3) C T (2) C T (1) C(0)  . (5.20) C (3) C (3) C (2) C T (1) C(0) C(1) C(2)  T  C (2) C T (3) C T (3) C T (2) C T (1) C(0) C(1) C T (1) C T (2) C T (3) C T (3) C T (2) C T (1) C(0) In the univariate (d = 1) case, when n = 2k + 1, by Kronecker products (with permutation matrices) it is well known, see e.g. [19, 11, 53], that the (s) jth (real) eigenvalue of Cn is k X h=−k

c(h)ρhj = c(0) + 2

k X

c(h) cos(hωj ),

h=1

where ρj = eiωj is the jth primitive (complex) nth root of 1 and ωj = 2πj n is the jth Fourier frequency (j = 0, 1, . . . , n − 1). Further, the eigenvector √ corresponding to the jth eigenvalue is (1, ρj , . . . , ρn−1 )T ; it has norm n. j After normalizing with √1n , we get a complete orthonormal set of eigenvectors (of complex coordinates). When C(h)s are d × d matrices, by inflation techniques and applying Kronecker products, we use blocks instead of entries and the eigenvectors also follow a block structure. In [19, 53], the eigenvalues and eigenvectors of a general symmetric block circulant matrix are characterized. We apply this result in our situation, when n = 2k + 1 is odd. In view of this, the spectrum (s) of Cn is the union of spectra of the matrices Mj = C(0)+

k X

[C(h)ρhj +C T (h)ρ−h j ] = C(0)+

h=1

k X

[C(h)eiωj h +C T (h)e−iωj h ]

h=1

Spectra of spectra

185

for j = 0, 1, . . . , n − 1; whereas, the eigenvectors are obtained by compounding the eigenvectors of these d×d matrices. So we need the spectral decomposition Pk of the matrices M0 = C(0) + h=1 [C(h) + C T (h)] and Mj = C(0) +

k X

[(C(h) + C T (h)) cos(ωj h) + i(C(h) − C T (h)) sin(ωj h)]

h=1

for j = 1, 2, . . . , n − 1. Since C(h) + C T (h) is symmetric and C(h) − C T (h) is anti-symmetric with 0 diagonal, Mj is self-adjoint for each j and has real eigenvalues with corresponding orthonormal set of eigenvectors of possibly complex coordinates. Indeed, Mj may have complex entries if j 6= 0; actually, Pk Pk T T h=1 (C(h) + C (h)) cos(ωj h) is the real and h=1 (C(h) − C (h)) sin(ωj h) is the imaginary part of Mj . It is easy to see that Mn−j = Mj (entrywise conjugate), therefore, it has the same (real) eigenvalues as Mj , but its (complex) eigenvectors are the (componentwise) complex conjugates of the eigenvectors of Mj . We also need the following form of this matrix: Mn−j = C(0) +

k X

[(C(h) + C T (h)) cos(ωj h) − i(C(h) − C T (h)) sin(ωj h)]

h=1

= C(0) +

k X

[C(h)e−iωj h + C T (h)eiωj h ],

j = 1, . . . , n − 1.

h=1

(5.21) (s) Summarizing, for odd n = 2k + 1, the nd eigenvalues of Cn are obtained as the union of the (real) eigenvalues of M0 and those of Mj (j = 1, . . . , k) duplicated. Note that for even n, similar arguments hold with the difference (s) that there the spectrum of Cn is the union of the eigenvalues of M0 and Mn−1 , whereas the eigenvalues of M1 , . . . , M n2 −1 are duplicated. (s)

The eigenvectors of Cn are obtainable by compounding the d (usually complex) orthonormal eigenvectors of the d × d self-adjoint matrices M0 , M1 , . . . , Mn−1 as follows. For j = 1, . . . , k: if v is a unit-length eigenvector of Mj with eigenvalue λ, then in [53] it is proved that the compound vector (vT , ρj vT , ρ2j vT , . . . , ρn−1 vT )T ∈ Cnd j (s)

is an eigenvector of Cn with the same eigenvalue λ. It has squared norm −(n−1)

n−1 2 −2 v∗ v(1 + ρj ρ−1 ρj j + ρj ρj + · · · + ρj

) = n.

Therefore, the vector 1 w = √ (vT , ρj vT , ρ2j vT , . . . , ρn−1 vT )T ∈ Cnd j n (s)

is a unit-norm eigenvector (of complex coordinates) of Cn .

(5.22)

186

Dimension reduction and prediction in the time and frequency domain Further, if 1 z = √ (tT , ρ` tT , ρ2` tT , . . . , ρ`n−1 tT )T ∈ Cnd n (s)

is another unit-norm eigenvector of Cn compounded from a unit-norm eigenvector t of another M` (` 6= j), then w and z are orthogonal, irrespective whether M` has the same eigenvalue λ as Mj or not. Similar construction holds starting with the eigenvectors of M0 . Here for each j = 0, 1, . . . , n−1, there are d pairwise orthonormal eigenvectors (potential vs) of Mj , and the so obtained ws are also pairwise orthonormal. Assume that the eigenvectors of Mj are enumerated in non-increasing order of its (real) eigenvalues, and the inflated ws also follow this ordering, for j = 0, 1, . . . , n − 1. Choose a unit-norm eigenvector v ∈ Cd of Mj with (real) eigenvalue λ. Then v ∈ Cd is the corresponding unit-norm eigenvector of Mn−j with the same eigenvalue λ. Consider the compounded w ∈ Cnd and w ∈ Cnd obtained from them by Equation (5.22). We learned that they are orthonormal (s) eigenvectors of Cn corresponding to the eigenvalue λ with multiplicity (at least) two. From them, corresponding to this double eigenvalue λ, the new orthonormal pair of eigenvectors w+w w−w √ and − i √ (5.23) 2 2 is constructed, but they, in this order, occupy the original positions of w and w. They√have real coordinates and unit norm; actually, their coordinates contain the 2 multiples the real and imaginary parts of the corresponding coordinates of w. It is in accord with the fact that a real symmetric matrix, (s) as Cn , must have an orthogonal system of eigenvectors with real coordinates too. We do not go in details, neither discuss defective cases. Consider u1 , . . . , und , the so obtained orthonormal set of eigenvectors (s) (of real coordinates) of Cn (in the above ordering), and denote by U = (u1 , . . . , und ) the nd × nd (real) orthogonal matrix containing them columnwise. Let Cn(s) = U Λ(s) U T (5.24) be the corresponding spectral decomposition. After this preparation, we are able to prove the following theorem. Theorem 5.2. Let {Xt } be d-dimensional weakly stationary time series of real components. Denoting by C(h) = [cij (h)] the d×d autocovariance matrices (C(−h) = C T (h), h ∈ Z) inPthe time domain, assume that their entries ∞ are absolutely summable, i.e. h=0 |cpq (h)| < ∞ for p, q = 1, . . . , d. Then, the self-adjoint, positive semidefinite spectral density matrix f exists in the frequency domain, and it is defined by f (ω) =

∞ 1 X C(h)e−ihω , 2π h=−∞

ω ∈ [0, 2π].

Spectra of spectra

187

For odd n = 2k + 1, consider X1 , . . . Xn and the block Toeplitz matrix Cn of (1.3); further, the Fourier frequencies ωj = 2πj n for j = 0, . . . , n − 1. Let Dn be the dn × dn diagonal matrix that contains the spectra of the matrices f (0), f (ω1 ), f (ω2 ), . . . , f (ωk ), f (ωk ), . . . , f (ω2 ), f (ω1 ) in its main diagonal, i.e. Dn = diag(spec f (0), spec f (ω1 ), . . . , spec f (ωk ), spec f (ωk ), . . . , spec f (ω1 )). Here spec denotes the eigenvalues of the affected matrix in non-increasing order if not otherwise stated. (The duplication is due to the fact that f (ωj ) = f (ωn−j ), j = 1, . . . , k, for real time series). Then, with the spectral decomposition (5.24), U T Cn U − 2πDn → O, n → ∞, i.e. the entries of the matrix U T Cn U − 2πDn tend to 0 uniformly as n → ∞. (s)

Proof. We saw that U T Cn U = Λ(s) . Recall that the eigenvalues in the diagonal of Λ(s) comprise the union of spectra of the matrices M0 and those of M1 , . . . , Mn−1 , which are the same as the eigenvalues of M0 and those of Mn−1 , . . . , Mn−k of (5.21), duplicated. But these matrices are finite sub-sums (for |h| ≤ k) of the infinite summations 2πf (ωj ) =

∞ X

C(h)e−ihω = C(0) +

∞ X

[C(h)e−iωj h + C T (h)eiωj h ].

h=1

h=−∞

So, by the absolute summability of the autocovariances, and because the eigenvalues depend continuously on the underlying matrices, the pairwise distances between the eigenvalues of Mj and the corresponding eigenvalues of 2πf (ωj ) (both in non-increasing order) tend to 0 as n → ∞, for j = 0, 1, . . . , k. Indeed, the absolute summability of the entries of C(h)s implies that the diagonal entries of the diagonal matrix Λ(s) − 2πDn are bounded in absolute value by X |cpq (h)| → 0, n = 2k + 1 → ∞. max p,q∈{1,...,d}

|h|>k

Therefore, the matrix Λ(s) − 2πDn tends to the zero matrix entrywise uni(s) formly as n → ∞. It remains to show that the entries of UT Cn U − UT Cn U tend to 0 uniformly as n → ∞. Before doing this, some facts should be clarified. • The pth row sum of Mj is bounded by d X q=1

|cpq (0)| +

d X k X q=1 h=1

|cpq (h)| +

d X k X q=1 h=1

|cqp (h)| ≤ dcpp (0) + 2dL,

188

Dimension reduction and prediction in the time and frequency domain P∞ for p ∈ {1, . . . , d} with L = maxp,q∈{1,...,d} h=1 |cpq (h)| > 0, independently of n, because of the absolute summability of the entries of C(h). This is true for any j ∈ {0, 1, . . . , n − 1}. For simplicity, consider (any) one of the Mj s, and denote it by M = [mpq ]dp,q=1 . Then kM k∞ =

max p∈{1,...,d}

d X

|mpq | ≤ d

q=1

max p∈{1,...,d}

cpp (0) + 2dL = K.

As the spectral radius of M is at most kM k∞ , any eigenvalue λ of M is bounded in absolute value by K (independently of n). • Recall that u is compounded via (5.22) andq(5.23) from the primitive roots. Therefore, its coordinates are bounded by n2 in absolute value. Now we are ready to show that T |uTi C(s) n uj − ui Cn uj | → 0,

n→∞ (s)

uniformly in i, j ∈ {1, . . . , nd}. Recall that in the nd × nd matrices Cn and Cn the (m, `) blocks are the same if |m − `| ≤ k. Denote by ui,m and uj,` the mth and `th blocks of the unit-norm eigenvectors ui and uj , respectively. Then |uTi (C(s) n − Cn )uj | k m X X [uTi,` (C(m) − C(n − m))uj,n−m+` = m=1 `=1 T + ui,n−m+` (C(n − m) − C(m))uj,` ] r r k 2 2 X T ≤2 m 1d (C(m) − C(n − m))1d n n m=1 k X

4 ≤ n ≤ 4d

m=1 2

≤ 4d2

m

d X d X p=1 q=1

|cpq (m)| +

k X m=1

m

d X d X

! |cpq (n − m)|

p=1 q=1

! k k X X m m |cpq (m)| + max |cpq (n − m)| max n n p,q∈{1,...,d} p,q∈{1,...,d} m=1 m=1 ! k n−1 X X k m max |cpq (m)| + max |cpq (m)| , n n p,q∈{1,...,d} p,q∈{1,...,d} m=1 m=n−k

where 1d ∈ Rd is the vector of all 1 coordinates and so, the quadratic form 1Td (C(m) − C(n − m))1d is the sum of the entries of C(m) − C(n − m). In the last line, the second term converges to 0, since it is bounded by

Spectra of spectra 189 P∞ Pn−1 P∞ k m=k |cpq (m)| (indeed, m=n−k n |cpq (m)| ≤ m=k |cpq (m)| as k < n − k), and together with n, k tends to ∞ too; further, it holds uniformly for all p, q ∈ {1, . . . , d}. The first term for every p, q pair also tends to 0 as n → ∞ by the discrete version of the dominated convergence theorem (for series), see the forthcoming Lemma 5.1. Indeed, the summand is dominated by |cpq (m)| P∞ ∞, for any fixed and m=1 |cpq (m)| < ∞; further, m n |cpq (m)| → 0 as n → P P∞ k m |c (m)| tends to 0, and so does m. Consequently, m=1 m m=1 n |cpq (m)| n pq as n → ∞. It holds uniformly for all p, q, and also for all i, j, so the proof is complete. Lemma 5.1. (Dominated convergence theorem for sums, discrete verP∞ sion). Consider f (m) and assume that |fn (m)| ≤ g(m) with m=1 n P∞ g(m) < ∞. If lim f (m) = f (m) exists ∀m ∈ N, then n→∞ n m=1 lim

∞ X

n→∞

fn (m) =

m=1

∞ X

f (m).

m=1

Some important consequences of Theorem 5.2 follow.

5.4.1

Bounds for the eigenvalues of Cn

Proposition 5.3. Analogously to the 1D statement (see [11], Proposition 4.5.3), the above theorem implies the following. Assume that for the spectra of the spectral densities f of the d-dimensional weakly stationary process {Xt } of real coordinates the following hold: m :=

inf ω∈[0,2π],q∈{1,...,d}

M :=

sup

λq (f (ω)) ≥ 0, λq (f (ω)) < ∞.

ω∈[0,2π],q∈{1,...,d}

(Note that under the conditions of Theorem 5.2, f (ω) is continuous almost everywhere on [0, 2π], so the above conditions are readily satisfied.) Then for the eigenvalues λ1 ≤ λ2 ≤ · · · ≤ λnd of the block Toeplitz matrix Cn the following holds: 2πm ≤ λ1 ≤ λnd ≤ 2πM. Proof. Let λ be an arbitrary eigenvalue of Cn with a corresponding eigenvector x ∈ Cnd , x∗ = [x∗1 , . . . , x∗n ], xj ∈ Cd : Cn x = λx. Take the spectral decomposition of the spectral density matrix f : f (ω) =

d X `=1

λ` (f (ω)) · u` (f (ω)) · u∗` (f (ω)).

190

Dimension reduction and prediction in the time and frequency domain

Then we can write that λ|x|2 = λx∗ x = x∗ Cn x Z π h in ∗ e−i(j−k)ω f (ω) =x · Z

−π n X

π

=

dω · x

j,k=1

e−i(j−k)ω x∗j f (ω)xk d ω

−π j,k=1

Z

n X

π

=

e−i(j−k)ω

−π j,k=1

Z =

d π X

d X

λ` (f (ω)) · x∗j · u` (f (ω)) · u∗` (f (ω)) · xk dω

`=1

λ` (f (ω))

−π `=1

n X

e−ijω x∗j · u` (f (ω)) · u∗` (f (ω)) · xk · eikω dω

j,k=1

2 X n −ijω ∗ = λ` (f (ω)) e · xj · u` (f (ω)) dω −π `=1 j=1 Z π n d X X ≤M x∗j · e−i(j−k)ω u` (f (ω)) · u∗` (f (ω)) dω · xk Z

π

d X

j,k=1 n X

= 2πM

−π

`=1

x∗j xj = 2πM |x|2 .

j=1

This proves that λ ≤ 2πM for any eigenvalue of Cn . The proof of the fact that λ ≥ 2πm is similar.

5.4.2

Principal component transformation as discrete Fourier transformation

The complex principal component (PC) transform of the collection of random vectors X = (XT1 , . . . XTn )T of real coordinates is the random vector Z = (ZT1 , . . . , ZTn )T of complex coordinates obtained by Z = W ∗ X. (s)

Here, analogously to (5.24), Cn also has the spectral decomposition (s) C(s) W ∗, n = WΛ

where the unitary matrix W = (w1 , . . . , wnd ), contains a complete orthonor(s) mal set of eigenvectors of Cn , columnwise. They usually have complex coordinates. To relate the PC transformation to a discrete Fourier transformation, we also make PC transformations within the blocks. For this purpose we use the

K´ alm´ an’s filtering

191

eigenvectors in the columns of W (of complex coordinates) in the ordering described in the preparation of Theorem 5.2. We utilize their block structure and also assume that they are already normalized to have a complete orthonormal system in Cnd . By Theorem 5.2, EZZ∗ ∼ 2πDn , so the coordinates of Z are asymptotically uncorrelated, for “large” n. Instead, we consider the blocks Zj s of it, and perform a “partial principal component transformation” (in d-dimension) of them. Let w1j , . . . , wdj be the columns of W corresponding to the coordinates of Zj . In view of (5.22), Zj can be written as 1 Zj = √ (Vj∗ ⊗ r∗ )X, n −(n−1)

−2 where r∗ = (1, ρ−1 ) and Vj is the d × d unitary matrix in the j , ρj , . . . , ρj spectral decomposition Mj = Vj Λj Vj∗ . Because of EZj Z∗j = Λj (apparently from the proof of Theorem 5.2), we have that

E(Vj Zj )(Vj Zj )∗ = Vj Λj Vj∗ = Mj . At the same time, n

1 1 X 1 Xt e−itωj , Vj Zj = √ Vj (Vj∗ ⊗r∗ )X = √ (Id ⊗r∗ )X = √ n n n t=1

j = 1, . . . , n.

This is the discrete Fourier transform of X1 , . . . , Xn . It is in accord with the existence of the orthogonal increment process {Zω } (see Chapter 1 and [11]) of which Vj Zj ∼ Zωj is the discrete analogue. Also, Z1 , . . . Zn are asymptotically pairwise orthogonal akin to V1 Z1 , . . . , Vn Zn . Further, E(Vj Zj )(Vj Zj )∗ ∼ 2πf (ωj ), and it is in accord with the fact that EZj Z∗j ∼ 2π diag spec f (ωj ), for j = 0, 1, . . . , n − 1 when n is “large”.

5.5

K´ alm´ an’s filtering

Given a linear dynamical system, with state equations and specified matrices, R.E. K´ alm´ an gave a recursive algorithm, how to find prediction for the state variable Xt in the possession of newer and newer observations for the b t+1|t are found, while observable variable Yt . Starting at time 0, estimates X observing Yt , t = 1, 2, . . . . The point is that we only use the last observation

192

Dimension reduction and prediction in the time and frequency domain

b t|t−1 . During the recursion, we use the linYt and the preceding estimate X earity of the state equations and the predictions, for which either normality is assumed, or we confine ourselves to the second moments of the underlying distributions, see Appendix C. This so-called filtering technique is widely used in the engineering practice, when we can “get rid” of the noise, and also possess an algorithm to find the innovations of the observed {Yt } process (we need not perform the block Cholesky decomposition of Section 5.3.2, but get the innovations recursively). The problem is that we merely have the output of a linear system that is burdened with noise, and usually not invertible; e.g. in case of telecommunication systems, when sensors can sense only noisy signals. It is not by accident that the research of R.E. K´ alm´ an and R.S. Bucy followed the era of the information theoretical breakthroughs, e.g. the intensive use of the Shannon entropy. Here we follow the discussion of R.E. Kálmán’s original paper [32], where stationarity is not assumed, but the random vectors are Gaussian. (Sometimes we use simpler notation in accordance with the one used in the previous sections of this chapter.) Here the linear dynamical system is Xt+1 = At Xt + Ut Yt = Ct Xt ,

(5.25)

where At and Ct are specified matrices; At is an n × n matrix, called phase transition matrix, and Ct is p × n; further, Ut is an orthogonal noise process with EUt UTs = δst QU (t) and EXTs Ut = 0 for s ≤ t. All the expectations are zeros, and all the random vectors have real components. Sometimes Ut is called random excitation, Xt is the n-dimensional hidden state variable, while Yt is the p-dimensional observable variable. In the paper [32], p ≤ n is assumed, but it is not a restriction. Even if p = n, the matrix Ct is not invertible, otherwise the process Xt is trivially observable, unless a noise term is added to Ct Xt in the second equation (we will touch upon this possibility at the end of this section). So the problem is the following: starting the observations at time 0, given Y0 , . . . , Yt−1 , we want to estimate X component-wise, with minimum b denotes this estimate, then X b = mean square error. More precisely, if X ProjHt−1 (Y) X, where Ht−1 (Y) = Span (Y0 , . . . , Yt−1 ) consists of the linear combinations of all the components of Y0 , . . . , Yt−1 (with the notation of Appendix C, but here the indexing starts at 0) and the projection of X is meant component-wise as noted in Remark C.2. If we minimize the mean square error, the minimizer is the conditional expectation E(X | Y0 , . . . , Yt−1 ), which is the linear function of the coordinates of the random vectors in the condition, whenever the underlying distribution is Gaussian. If X = Xt , this is the prediction problem and we denote the optimal b t|t−1 . In a similar vein, X b t|t solves the one-step ahead prediction of Xt by X filtration problem, when we project onto Ht (Y) for the prediction; finally, b t|t+h solves the smoothing problem, when we project onto Ht+h (Y) with X

K´ alm´ an’s filtering

193

h > 0 integer. The first one-step ahead prediction problem can be generb t+h|t−1 , h > 0 integer (not the alized to the h-step ahead prediction of X same as the smoothing problem). The first problem is sometimes called extrapolation, whereas the second two interpolation, respectively, see also A.N. Kolmogorov [34] and N. Wiener [57]. Note that the problem itself is originated in the Wiener–Hopf problem. As for the one-step ahead prediction problem, if Y0 , . . . , Yt−1 are observed, i.e. Ht−1 (Y) is known, then the newly observed (measured) Yt can be orthogonally decomposed as e t|t−1 , e t|t−1 = Yt|t−1 + Y Yt = ProjHt−1 (Y) Yt + Y

(5.26)

e t|t−1 ∈ It (Y), and It (Y) is the so-called where the orthogonal component Y e t|t−1 span It (Y)). We innovation subspace. (Actually, the components of Y shall make intensive use of this innovation. Assume that It (Y) is not the sole 0 vector, otherwise observing Yt does not give any additional information to Ht−1 (Y). If {Yt } is weakly stationary, it means that the process is regular. Equation (5.26) implies the decomposition of the corresponding subspaces like Ht (Y) = Ht−1 (Y) ⊕ It (Y) (5.27) that is the analogue of the multidimensional Wold decomposition in the case when the prediction is based on finite past measurements. The multidimensional Wold decomposition applies to the stationary and infinite past case. When t → ∞, i.e. going to the future, we approach this situation. b t|t−1 is already known. We shall give a recursion to find Assume that X b t+1|t by using the new value of Yt . In view of Equation (5.27), we proceed X as follows: b t+1|t = ProjH (Y) Xt+1 = ProjH (Y) Xt+1 + ProjI (Y) Xt+1 X t t−1 t e t|t−1 = At ProjHt−1 (Y) Xt + ProjHt−1 (Y) Ut + Kt Y

(5.28)

b t|t−1 + Kt Y e t|t−1 , = At X where we utilized Lemma C.3 of Appendix C and the fact that Ut ⊥ Ht−1 (Y); we also used the first (state) equation of (5.25). Since ProjIt (Y) Xt+1 is a linear operation and results in a vector within It (Y) (linear combination of e t|t−1 ), its effect can be written as a matrix Kt times the components of Y e t|t−1 . This n × p matrix Kt is called K´ Y alm´ an gain matrix. (In fact, the notation K is first used in the paper [33] of Kálmán and Bucy.) b t|t is produced, then a strongly related matrix Lt In another context, when X emerges that in some places is also called gain matrix; however, Kt = At Lt as it will be shown later in this section. In the stationary case, the rank r(≤ p) of the spectral density matrix of the process {Yt } is equal to the dimension of the innovation subspace if we predict based on the infinite past (see [40, 47]). In this stationary, infinite past case, Kt has a limiting value at the passage

194

Dimension reduction and prediction in the time and frequency domain

to infinity and it is unique only if the spectral density matrix of the process {Yt } is of full rank (r = p), or equivalently, if the p × p covariance matrix of the innovations is non-singular. If t → ∞, we approach the infinite past e t|t−1 Y e T ] has some near zero eigenvalues, based prediction, and so, if E[Y t|t−1 this is an indication of a reduced rank spectral density matrix of {Yt }, see Section 5.3.2. In the nonstationary case too, even if there are innovations (the innovations are not zeros), the innovation subspace can be of reduced rank, e t|t−1 Y e T ] is not invertible (we shall take its generalized in which case E[Y t|t−1 inverse later if necessary). e t|t−1 in terms To specify the K´ alm´ an gain matrix Kt , we have to write Y b t|t−1 and Yt . For this purpose, let us project both sides of the second of X (observation) equation of (5.25), i.e. of Yt = Ct Xt , to Ht−1 (Y). We get that b t|t−1 . Yt|t−1 = Ct X Taking the orthogonal decomposition (5.26) of Yt into consideration yields that e t|t−1 = Yt − Yt|t−1 = Yt − Ct X b t|t−1 . Y (5.29) We substitute this into the last line of Equation (5.28) and obtain that b t+1|t = At X b t|t−1 + Kt Y e t|t−1 = (At − Kt Ct )X b t|t−1 + Kt Yt . X With the notation A∗t = At − Kt Ct

(5.30)

for the updated transition matrix, we get the new linear dynamics: b t+1|t = A∗ X b X t t|t−1 + Kt Yt .

(5.31)

It is also important that Equations (5.28) and (5.31) give two equivalent forb t+1|t : mulas for the prediction of X b t+1|t = At X b t|t−1 + Kt (Yt − Ct X b t|t−1 ) = A∗ X b X t t|t−1 + Kt Yt .

(5.32)

We shall intensively use this equivalence. The estimation error is also governed by the linear dynamical system. This error term is e t+1|t = Xt+1 − X b t+1|t = At Xt + Ut − A∗ X b X t t|t−1 − Kt Ct Xt b t|t−1 ) + Ut = A∗ X e t|t−1 + Ut , = A∗ (Xt − X t

(5.33)

t

so A∗t is not only the transition matrix in (5.31), but it is also the transition matrix of the linear dynamical system governing the error. By the equivalence, stated in Equation (5.32), we get another expression for the same error term: e t+1|t = Xt+1 − X b t+1|t X b t|t−1 − Kt (Yt − Ct X b t|t−1 ) = At Xt + Ut − At X e t|t−1 + Ut − Kt (Yt − Ct X b t|t−1 ). = At X

(5.34)

K´ alm´ an’s filtering

195

In the heart of the algorithm there is a recursion for the propagation of the the covariance matrix of the above error term, which is defined as e t|t−1 X e T ]. P (t) = E[X t|t−1 Then we shall write P (t + 1) in terms of P (t) with the help of the two alternative Equations (5.33) and (5.34) for the same error term: e t+1|t X eT ] P (t + 1) = E[X t+1|t e t|t−1 + Ut )(At X e t|t−1 + Ut − Kt (Yt − Ct X b t|t−1 ))T ] = E[(A∗t X e t|t−1 X e T ]AT + QU (t) = A∗ P (t)AT + QU (t), = A∗ E[X t

t|t−1

t

t

(5.35)

t

where recall that QU (t) = E[Ut UTt ] and we used that Ut is uncorrelated with e t|t−1 too; further, that Yt − Ct X b t|t−1 is within the Xt and, therefore, with X innovation subspace It (Y). It remains to find an explicit formula for Kt , and thus, also for A∗t . Recall that Kt is the matrix of the linear operation ProjIt (Y) Xt+1 , therefore by the geometry of projections: e T ]+ , e T ][E(Y e t|t−1 Y Kt = [EXt+1 Y t|t−1 t|t−1 where + denotes the Moore–Penrose generalized inverse. See also the theory of simultaneous linear regressions in Appendix C, namely Equation (C.2); AT there plays the role of our Kt here. Now we calculate the matrices in brackets. e t = Ct X e t , we get that By the second equation of (5.25), that extends to Y T T eT e e e t|t−1 Y EY t|t−1 = E(Ct Xt|t−1 )(Ct Xt|t−1 ) = Ct P (t)Ct .

b t|t−1 and By the first and second equation of (5.25) and the orthogonality of X e t|t−1 : X eT eT b e eT EXt+1 Y t|t−1 = At EXt Yt|t−1 = At E(Xt|t−1 + Xt|t−1 )(Ct Xt|t−1 ) = At P (t)CtT .

(5.36)

Therefore, Kt = At P (t)CtT [Ct P (t)CtT ]+ .

(5.37)

Instead of the Moore–Penrose generalized inverse, we use the regular inverse provided the matrix in brackets is invertible, i.e. the innovation subspace It (Y) is of full dimension p, and Ct is of full rank p. Then the recursion starts at t = 1, when the systems of p linear equations b 1|0 b 1|0 = Y1|0 and C1 X1 = Y1 should be solved for the coordinates of X C1 X and X1 , respectively (the n coordinates are the unknowns). They obviously have a solution if C1 is of full rank. Here Y1|0 = E(Y1 ) if Y0 is a constant vector. Even if it is 0, the system has a nontrivial solution in the p ≤ n case.

196

Dimension reduction and prediction in the time and frequency domain

e 1|0 = X1 − X b 1|0 . In the original paper of Kálmán [33], the following Then X b 1|0 := 0; X e 1|0 := X1 ; P (1) := E[X1 XT ]. This can be starting is suggested: X 1 the product moment estimate from the training sample (possible past). See also Remark 5.7. By using Remark C.1, we can summarize the above results and recursion as follows. b t+1|t of Xt+1 given Y0 , Y1 , . . . , Yt Proposition 5.4. The optimal estimate X is generated by the linear dynamical system b t+1|t = A∗ X b X t t|t−1 + Kt Yt . The estimation error term is given by e t+1|t = A∗ X e X t t|t−1 + Ut and the propagated covariance matrix of the estimation error term is e T ], e t|t−1 X P (t) = E[X t|t−1 while the expected quadratic loss is trP (t). The matrices involved are generated by the following recursion. Starting with b 1|0 = ProjY X1 , X e 1 = X1 −X b 1|0 , P (1) = E[X e 1|0 X e T ] = E[X1 XT ]−E[X b T ], b 1|0 X X 1 1|0 1|0 0 for t = 1, 2, . . . , the steps of the following recursion are uniquely defined: • Evaluate Kt by (5.37): Kt = At P (t)CtT [Ct P (t)CtT ]+ . • Input Yt . Output b t+1|t = (At − Kt Ct )X b t|t−1 + Kt Yt . X • Evaluate A∗t by (5.30): A∗t = At − Kt Ct . • Eventually, calculate P (t + 1) by (5.35) that completes the cycle: P (t + 1) = A∗t P (t)ATt + QU (t). Note that QU (t) is known/given or estimated from a training sample. In the forthcoming Remark 5.11, a symmetric expression is also given for the matrix P (t + 1). Some remarks are in order. b 1|0 = ProjY X1 = Σ ˆ XY Σ ˆ + Y0 , by Remark 5.7. As for the starting, X YY 0 Lemma C.2, where the last training sample entry can be chosen for Y0 . To initialize P (1), the whole training sample can be used. Another possibility is b 1|0 = 0, see [29]. to start with X

K´ alm´ an’s filtering

197

Remark 5.8. As a byproduct, the algorithm is able to get the innovations via Equation (5.29). Remark 5.9. In some situations, the observation equation also contains a noise term, for example, in [11]; whereas, in [22] the stationary case is treated with this generalization that results only in minor changes in the algorithm. In [33], K´ alm´ an and Bucy consider the continuous-time case, but they write that even in this case, the assumption that every observed signal contains a white noise term, “is unnecessary when the random processes in question are sampled (discrete-time parameter), see [32]”; even in the continuous-time case, it “is no real restriction since it can be removed in various ways”. However, the random excitation in the state (message) process “is quite basic; it is analogous to but somewhat less restrictive than the assumption of rational spectra in the conventional theory”. Indeed, Kálmán uses only the regularity (causality) of the process if stationary, but not the rational spectral density. He mostly considers Gaussian processes that is not a restriction in the possession of second order processes, when we confine ourselves to the second moments. In this case, the state equations have the form Xt+1 = At Xt + Ut Yt = Ct Xt + Wt , where Wt is independent of Xt and Ut (latter condition can be relaxed by introducing the covariance matrix between Ut and Wt as a given parameter, see [11]); further the covariance matrix of the zero expectation Wt is QW = EWt WtT is also given. The only difference in the calculations is that now T T eT e e e t|t−1 Y EY t|t−1 = E(Ct Xt|t−1 + Wt )(Ct Xt|t−1 + Wt ) = Ct P (t)Ct

+ QW (t), and so, Kt = At P (t)CtT [Ct P (t)CtT + QW (t)]+ . Instead of Ut we may write Bt Vt with some n × q matrix Bt with q ≤ n and q-dimensional orthogonal noise Vt , i.e. EVt VtT = QV (t) is a given diagonal matrix. Here instead of Q(t) the matrix Bt QV (t)BtT enters into Equation (5.35). This approach mainly used in the stationary case, when a lower rank driving force (excitation) is assumed, but this is the topic of Dynamic Factor Analysis, see [15] and Section 5.6. In the same vein, instead of Wt we may write Dt Zt with some p×s matrix Dt with s ≤ p and s-dimensional orthogonal noise Zt , i.e. EZt ZTt = QZ (t) is a given diagonal matrix. Here instead of QW (t), the possibly reduced rank matrix Dt QZ (t)DtT enters into the calculations.

198

Dimension reduction and prediction in the time and frequency domain

Remark 5.10. Equation (5.31) gives rise to a predictive filtering, in the possession of the gain matrix Kt . After this, the algorithm is also applicable to filtering. Indeed, b t+1|t = ProjH (Y) Xt+1 = ProjH (Y) (At Xt + Ut ) = At X b t|t . X t t Now, provided At is invertible, b t|t = A−1 X b t+1|t . X t If At is not invertible, then we proceed as follows: b t|t = Proj b e X Ht (Y) Xt = ProjHt−1 (Y) Xt + ProjIt (Y) Xt = Xt|t−1 + Lt Yt|t−1 . Now the gain matrix is Lt , which is not the same as Kt (though, sometimes this is what called K´ alm´ an gain matrix), can be determined with a similar calculation: e T ]+ . e T ][E(Y e t|t−1 Y Lt = [EXt Y t|t−1 t|t−1 The only difference between the formula for Kt and Lt that here we calculate e T , but Equation (5.36) is at our disposal the covariance between Xt and Y t|t−1 in this situation too. We get that T eT EXt Y t|t−1 = P (t)Ct ,

and so, Lt = P (t)CtT [Ct P (t)CtT ]+ . Consequently, Kt = At Lt , so we could first find Lt = P (t)CtT [Ct P (t)CtT ]+ b t|t } and then, Kt . Therefore, in course of the iteration, the filtered process {X can as well be obtained. Remark 5.11. If we write the expression for Kt , Lt , and A∗t into Equation (5.35), then we get P (t + 1) = A∗t P (t)At T + QU (t) = (At − At Lt Ct )P (t)At T + Q(t) = (At (I − Lt Ct ))P (t)At T + QU (t) = At P (t)At T − At Lt Ct P (t)At T + QU (t) = At P (t)At T − At P (t)CtT [Ct P (t)CtT ]−1 Ct P (t)At T + QU (t) which final formula shows that P (t + 1) is indeed a symmetric matrix.

(5.38)

Dynamic principal component and factor analysis

199

Remark 5.12. Assume that the underlying process is weakly stationary, and put A for At , C for Ct , and QU for QU (t). In this case, instead of the recursion, we get the fixed point iteration Pt+1 = APt AT − APt C T [CPt C T ]+ CPt AT + QU , where now Pt just denotes step t of the iteration. Note that [55] considers the question when the discrete matrix Riccati equation P = AP AT − AP C T [CP C T ]+ CP AT + QU ,

(5.39)

has a unique solution and so, the method of successive approximation, resembling the recursion in (5.38), is able to find it. (Actually, the Riccati operator is concave and has a unique fixed point under very general conditions.) With this P , the limit of the sequence Kt is K = AP C T [CP C T ]+ as t → ∞, and Lt also has a limiting value L = P C T [CP C T ]+ , when our sequence is weakly stationary. The paper [33] gives guidance to the solution, mainly considers the continuous time case, and contains many applications in engineering and telecommunication. The authors also discuss the relation to differential equations and the Fisher information matrix. We remark that in the possession of another error term W (but QW does not depend on the time), Equation (5.39) has the slightly modified form P = AP AT − AP C T [CP C T + QW ]+ CP AT + QU , though it does not change the type of the matrix Riccati equation. In the stationary case, the stability of the matrix A should be assumed, as well as that of the new transition matrix, corresponding to A∗ , which is A − KC. For further details, see [22, 51, 55].

5.6

Dynamic principal component and factor analysis

Here we confine ourselves to high dimensional weakly stationary processes that are usually of lower rank than their dimension or can be approximated with a lower rank process. In the time domain, we are looking for the convenient filters and for the matrices in the state equations too. In the frequency domain, we use the low rank approximation of the spectral density matrix at the Fourier frequencies. We summarize the findings based of the previous sections.

5.6.1

Time domain approach via innovations

First we use the method of innovations. If Xt s have different dimensions, then denoting by d the minimal dimension, first we perform a static factor analysis

200

Dimension reduction and prediction in the time and frequency domain

on them, and start with the so obtained d-dimensional static factor process. We also deprive the process from trend and seasonality, and assume that it has a spectral density matrix of constant rank. If the process is also deprived of the singular part, then a regular process is at our disposal. If Xt is regular, we learned that it can be expanded in terms of the ddimensional innovations ˆ t+1 , ηt+1 := Xt+1 − X ˆ t+1 is the projection of Xt+1 onto the subspace spanned by where X X1 , . . . , Xt , denoted by Ht . It can be done step by step as described in Section 5.3.2. If not regular, the prediction process gives the regular part of it. We can as well reduce the dimension of the innovation process to k < d. This k-dimensional innovation process can be considered as a dynamic factor process, where k ≤ r, and r is the rank of the spectral density matrix of the process. As an alternative to the block Cholesky decomposition, the Kálmán filtering is also able to find the innovations, see Equation (5.29). In this way, instead of the decomposition of a huge block matrix, we operate with matrices of size comparable to the dimension of the process. The above is also related to the minimal phase spectral factor. To find this and a reduced rank causal approximation of a process of rational spectral density, we refer to [40]. Another approach via singular autoregressions is discussed in [13]. We saw that a d-dimensional regular process {Xt }, whose spectral density matrix f is of rank r ≤ d has the variant of the multidimensional Wold decomposition: ∞ X Xt = Bj νt−j , j=0

where Bj s are d × r matrices (like dynamic factor loadings), and {νt } ∼ WN(Σ) is r-dimensional white noise (like non-standardized minimal dynamic factors). It is important that there is a one-to-one correspondence between f (frequency domain) and the B(z), Σ pair (time domain): B(z) =

∞ X

Bj z j ,

|z| ≤ 1

(5.40)

j=0

and Σ is the covariance matrix of νt . This correspondence is given by f (z) =

1 B(z)ΣB ∗ (z). 2π

1 We can as well write f (z) = 2π H(z)H ∗ (z), where H(z) = B(z)Σ1/2 is the transfer function and it is unique only up to unitary transformation. At the same time, the matrices Bj Σ−1/2 are the impulse responses, also see (5.19). So

Dynamic principal component and factor analysis

201

by performing the expansion (5.40) at the Fourier frequencies, we can estimate the transfer function. In Section 5.3.2, we gave an algorithm to this in the time domain, via block Cholesky decomposition, see (5.15). Then we can perform a static PCA on Σ with k ≤ r principal components, that results in dynamic factors of dimension k. The choice of k is such that there are n(r − k) negligible eigenvalues in the spectrum of Cn . By Theorem 5.2 , for “large” n, this is in accord with the existence of r − k negligible eigenvalues of the spectral density matrix at all the n Fourier frequencies. Therefore, we proceed in the frequency domain.

5.6.2

Frequency domain approach

Let {Xt } be discrete time, d-dimensional, weakly stationary time series of zero expectation and spectral density matrix of constant rank. For given 0 < k ≤ d we are looking for the k-dimensional time series Yt such that X Yt = bt−j Xj , t ∈ Z, j

where bj s are k × d matrices and b is the corresponding transfer function. (Here k is less than the rank of the process itself.) Then approximate Xt with X ˆt = X ct−j Yj , t ∈ Z, j

where the impulse responses cj s are d × k matrices, and c is the transfer function. ˆ is obtained from X with the time invariant filter So X a(ω) = c(ω)b(ω). The error of approximation is measured with ˆ t )∗ (Xt − X ˆ t ). E(Xt − X Then Brillinger [8] in Theorem 9.3.1 states that the minimum is attained with the impulse responses Z 2π 1 b(ω)eijω dω bj = 2π 0 and 1 cj = 2π

Z

2π

c(ω)eijω dω,

0

where c(ω) = (u1 (ω), . . . , uk (ω))

202

Dimension reduction and prediction in the time and frequency domain

contains columnwise the orthonormal eigenvectors corresponding to the k largest eigenvalues of the spectral density matrix f of {Xt }. Further, b(ω) = c∗ (ω). (See Section 4.6 as well.) The approximation error is d X

2π

Z 0

λj (ω) dω.

j=k+1

This is in Frobenius norm, in spectral norm it only depends on λk+1 , but the best k-rank approximation is the same in any unitary invariant norm (that depends only on the eigenvalues). The larger the gap in the spectrum between the k largest and the other eigenvalues, the better the approximation is. {Yt } is called principal component process. Its spectral density matrix is diagonal with diagonal entries λ1 (ω), . . . λk (ω). In Section 4.6 we proved that if the original process is regular, then its best k-rank approximation is regular too.

5.6.3

Best low-rank approximation in the frequency domain, and low-dimensional approximation in the time domain

Let {Xt }nt=1 be the finite part of a d-dimensional process of real coordinates and constant rank 1 ≤ r ≤ d. Its discrete Fourier transform, discussed in Section 5.4.2, is n

1 X Xt e−itωj, Tj = Vj Zj = √ n t=1 More precisely, T0 =

√1 n

Pn

t=1

j = 0, . . . , n − 1.

Xt ,

n

1 X Xt [cos(tωj ) − i sin(tωj )], Tj = √ n t=1 and Tn−j = Tj , for j = 1, . . . , k (n = 2k + 1). Therefore, Zj = Vj−1 Tj = Vj∗ Tj ,

j = 0, . . . , n − 1.

It can easily be seen that Zn−j = Zj . To find the best m-rank approximation (1 < m ≤ r) of the process in any unitary invariant norm (see Theorem B.8), we project the d-dimensional vector Tj onto the subspace spanned by the m leading eigenvectors of Vj (see e.g. [44] for the linear algebra justification for this). Important that the eigenvalues in Λj are in non-increasing order. Let us denote the eigenvectors corresponding to the m largest eigenvalues by vj1 , . . . , vjm . Then b j := ProjSpan{v ,...,v } Tj = T j1 jm

m m X X ∗ (vj` Vj Zj )vj` = Zj` vj` , `=1

`=1

Dynamic principal component and factor analysis

203

b n−j = T b j , for for j = 1, . . . , k (by the previous considerations), where and T n = 2k + 1. Further, m X b 0 := T Z0` v0` . `=1

So, for each j, the resulting vector is the linear combination of the vectors vj` s with the corresponding coordinates Zj` s of Zj , ` = 1, . . . , m. Eventually, we find the m-rank approximation of Xt by inverse Fourier transformation: n−1 X b j eitωj = b t := √1 T X n j=0   k   X 1 b0 + bj + T b j ) cos(tωj ) + i(T bj − T b j ) sin(tωj )] =√ T [(T  n j=1   k   X 1 b0 + b j ) cos(tωj ) + i · 2i · Im(T b j ) sin(tωj )] =√ T [(2Re(T  n j=1   k  X 1 b b j ) cos(tωj ) − Im(T b j ) sin(tωj )] . =√ T0 + 2 [Re(T  n j=1

b t (t = 1, . . . , n) all have real coordinates (n = 2k+1). Apparently, the vectors X In this way, we have a lower rank process with spectral density of rank m ≤ r. Note that if the process is regular (e.g. it has a rational spectral density), then so is its low-rank approximation. The theory (e.g. [44]) guarantees that the “larger” the gap between the mth and (m + 1)th eigenvalues (in non-increasing order) of the spectral density matrix, the “smaller” the approximation error is. To back-transform the PC process into the time domain, note that ∗ Zj` = vj` Tj ,

` = 1, . . . m

defines the coordinates of an m-dimensional approximation of Tj , m ≤ r ≤ d. ˜ j = (Zj1 , . . . , Zjm )T . That is, we take the This is the m-dimensional vector T first m complex PCs in each blocks (it is important that the entries in the diagonal of each Λj are in non-increasing order). The other d − m coordinates of Zj are disregarded (they are taken zeros in the new coordinate system vj1 , . . . , vjd ). The proportion of the total variance explained by the first m Pm Pd principal components at the jth Fourier frequency is `=1 λj` / `=1 λj` .

204

Dimension reduction and prediction in the time and frequency domain

Then the m-dimensional approximation of Xt by the PC process is as follows: n−1 X ˜ t := √1 ˜ j eitωj = X T n j=0   k   X 1 ˜0 + ˜j + T ˜ j ) cos(tωj ) + i(T ˜j − T ˜ j ) sin(tωj )] =√ T [(T  n j=1   k   X 1 ˜0 + ˜ j ) cos(tωj ) + i · 2i · Im(T ˜ j ) sin(tωj )] =√ T [2Re(T  n j=1   k  X 1 ˜ ˜ j ) cos(tωj ) − Im(T ˜ j ) sin(tωj )] [Re(T T0 + 2 =√  n j=1

that again results in real coordinates. Equivalently, the m-dimensional PC process is: n−1 n−1 X X ˜ t = √1 ( Zjm eitωj )T. X Zj1 eitωj , . . . , n j=0 j=0 Note that to estimate the matrices Mj s, only the estimates of the first n/2 autocovariances are needed. By the ergodicity considerations of Section 1.5.2, these can be estimated more accurately (using at least one half of the sample entries) as the remaining n/2 ones if n is “large”. Example 5.1. The previously detailed low-rank approximation is illustrated on a financial dataset [2] containing stock exchange log-returns: Istanbul stock exchange national 100 index, S&P 500 return index, stock market return index of Germany, UK, Japan, Brazil, the MSCI European index, and the MSCI emerging markets index; ranging from June 5, 2009 to February 22, 2011. It is a d = 8 dimensional time series dataset of length n = 535. In Figure 5.1, the eigenvalue processes of the estimated Mj matrices are shown in the frequency domain. Based on this, the time series is of rank approximately 3, thus we can apply the outlined low-rank approximation with m = 3. In Figure 5.2, the individual variables of the original data and its rank 3 approximation are illustrated. There are calculated root mean square error (RMSE) values under each subplot. The 3 leading PC’s, back-transformed to the time domain, are to be found in Figure 5.3.

5.6.4

Dynamic factor analysis

Standard factor analysis can be generalized to the case of a d-dimensional, real valued, vector stochastic process {Xt }. Here t ≥ 0 is the time, and our sample usually consists of observations at discrete time instances t = 1, . . . , T . In the

Dynamic principal component and factor analysis

205 (1) (2) (3)

0.005

eigenvalues

0.004 0.003 0.002 0.001 0.000 0

1

0.5

1.5

2

FIGURE 5.1 Eigenvalue processes of the estimated Mj (j = 0, . . . , 534) matrices over [0, 2π], ordered decreasingly in Example 5.1. classical factor analysis approach, the data come from i.i.d. observations, and the dimension reduction happens in the so-called cross-sectional dimension, i.e. the number d of variables is decreased. In Dynamic Factor Analysis, the observations Xt ’s are not independent, and we want to compress the information, embodied by them, in the cross-sectional and the time dimension as well. Sometimes even the cross-sectional dimension d is large compared to the time span T . Assume that {Xt } is weakly stationary with an absolutely continuous spectral distribution, i.e. it has the d × d spectral density matrix fX . With the integer 1 ≤ k < d, the dynamic k-factor model for Xt (see, e.g. [4]) is Xt = µ + B(L)Zt + et = µ + χt + et

(5.41)

or with components, Xti = µi + bi1 (L)Zt1 + · · · + bik (L)Ztk + eit (Zt1 , . . . Ztk )T

(5.42)

where the k-dimensional stochastic process Zt = is the dynamic factor, χt is called common component, the d-dimensional stochastic process et = (e1t , . . . , edt )T is called noise component, and the d × k matrix B(L) = (bij (L)), i = 1, . . . , d, j = 1, . . . , k, is the transfer function. Here L is the lag operator (backward shift) and bij (L) is a square-summable one-sided filter, P∞ i.e. bij (L) = bij (0) + bij (1)L + bij (2)L2 + . . . with `=0 b2ij (`) < ∞. Further, the components of (5.42) satisfy the following requirements: E(Zt ) = 0, Cov(eit , Zsj ) Cov(eit , ejs )

E(et ) = 0,

= 0,

i = 1, . . . , d,

= 0,

i, j = 1, . . . d,

t∈Z j = 1, . . . , k, i 6= j,

t, s ∈ Z, s ≤ t.

t, s ∈ Z, s < t.

206

Dimension reduction and prediction in the time and frequency domain

0.1

ISE

0.0 100

200

300

400

500

0.05

0 SP

0.00 0

100

200

300

400

DAX

0

100

200

300

400

0

RMSE = 0.00219749 200 300

100

400

500 DAX(lra)

0

100

RMSE = 0.00202743 200 300

400

0

100

RMSE = 0.00202032 200 300

400

500

0.05 0.00 FTSE 0

100

200

300

400

0.05

500 NIKKEI

0.00

0.05

FTSE(lra)

0.05

500 NIKKEI(lra)

0.00 0

100

200

300

400

0.05

0.05

500 BOVESPA

0

RMSE = 0.00239289 200 300

100

400

0.05

500 BOVESPA(lra)

0.00

0.00 0

100

200

300

400

0.05

500

0.05

EU

0

RMSE = 0.00251209 200 300

100

400

0.05

500 EU(lra)

0.00

0.00 0.05

SP(lra)

0.05 0.05

500

0.00

0.05

500

0.00

0.05

0.05

400

0.05 0.05

500

0.00

0.05

RMSE = 0.00261937 200 300

100

0.00

0.05 0.05

ISE(lra)

0.0 0

0.05

0.1

0

100

200

300

400

0.05

500

0.05

EM

0.00 0

100

200

300

400

500

0

RMSE = 0.00170108 200 300

100

400

500 EM(lra)

0.025 0.000 0.025 0

RMSE = 0.00172178 200 300

100

400

500

FIGURE 5.2 Approximation of the original time series by a rank 3 time series in Example 5.1. If Zt and et are also weakly stationary and they have rational spectral densities fZ and fe , the model Equation (5.41) extends to the spectral density matrices: ∗

fX (ω) = fχ (ω) + fe (ω) = B(e−iω )fZ (ω)B(e−iω ) + fe (ω),

ω ∈ [−π, π]. (5.43) Very frequently, Zt is assumed to be orthonormal WN(Ik ) process. Then Equation (5.43) simplifies to fX (ω) =

1 ∗ B(e−iω )B(e−iω ) + fe (ω). 2π

The so-called static case occurs if, in addition, B is constant. Otherwise,

Dynamic principal component and factor analysis

207

PC(1)

0.1 0.0 0.1 0

100

200

300

400

500 PC(2)

0.02 0.00 0.02 0

100

200

300

400

500 PC(3)

0.005 0.000 0.005 0.010

0

100

200

300

400

500

FIGURE 5.3 The 3 leading PC’s of the stock exchange data in the time domain in Example 5.1. Equation (5.41) is dynamic in that the latent variables Ztj s can affect the observables Xti s both contemporaneously and with lags. Like in the standard factor model, neither B(L) nor Zt are identified uniquely; and given the spectral density √ fX , the spectra fχ and fe are generically can be determined for k ≤ n − n (reminiscent of the Lederman bound).

5.6.5

General Dynamic Factor Model

Let Xt be a weakly stationary time series (t = 1, 2, . . . ) with an absolutely continuous spectral measure and the positive semidefinite spectral density matrix fX . Assume that fX (ω) has constant rank r for a.e. ω ∈ [−π, π]. If Xt is also regular (it always holds if fX is a rational spectral density matrix), then the multidimensional Wold decomposition is able to make it a one-sided V M A(∞) process. It is important that the dimension of the innovation subspaces is also r. With the integer 1 ≤ k ≤ r, the k-factor GDFM : Xt = χt + et ,

t = 1, 2, . . .

208

Dimension reduction and prediction in the time and frequency domain

where now χt denotes the common component, et is the idiosyncratic noise, and all the expectations are zeros, for simplicity. Here χt is subordinated to Xt , but has spectral density matrix of rank k ≤ r. For example, there are k uncorrelated signals (given by k distinct sources) detected by r sensors. Opposed to the static factors, this is not a low-rank approximation of the (zero-lag) auto-covariance matrix that provides the static factors. Forni and Lippi [18] and Deistler et al. [13, 15] gave necessary and sufficient conditions for the existence of an underlying GDFM in terms of the expanding n sequence of n × n spectral density matrices fX (ω), n ∈ N. Theorem 5.3. The nested sequence {Xnt : n ∈ N, t = 1, 2, . . . } can be represented by a sequence of k-factor GDFMs if and only if n • the k largest eigenvalues, λnX,1 (ω) ≥ · · · ≥ λnX,k (ω) of fX (ω) diverge almost everywhere in [−π, π] as n → ∞; n • the (k + 1)-th largest eigenvalue λnX,k+1 (ω) of fX (ω) is uniformly bounded for ω ∈ [−π, π] (almost everywhere) and for all n ∈ N.

The theorem is rather theoretical; its message is that for large n and T (T is not necessarily larger than n) we can conclude for k from the spectral gap of the constant rank spectral density matrix. The estimate χnt is consistent if n, T → ∞. The idiosyncratic noise is less and less important when n, T → ∞, and it may have slightly correlated components. Also, the largest eigenvalue of fen (ω) is uniformly bounded for ω ∈ [−π, π] and for all n ∈ N. As we learned in Chapter 4, a stationary process with a not full rank spectral density matrix may have some singular components. All these parts are included in the weakly dependent idiosyncratic noise. Dynamic factor analysis is an unsupervised learning method, and with the lag-dependent factor loading matrices we are able to give meaning to the dynamic factors that embody the comovements between the components at different lags. For example, when we use a parametric method, see, e.g. [5], we are also able to give predictions for the dynamic factors (via autoregression) and, in turn, for the components of the time series too. There are also state space models that are able to estimate the parameter matrices via singular autoregression [13, 18]. The reduced rank approximation in Section 5.6.3 offers a first step, and the Yule–Walker equations can be solved for the reduced rank process.

5.7

Summary

First we have a 1D real valued time series {Xt } which is not necessarily stationary, E(Xt ) = 0 (t ∈ Z). Selecting a starting observation X1 and with the notation Hn = Span{X1 , . . . , Xn }, Xn+1 is predicted linearly based on random

Summary

209

ˆ 1 := 0 and e2n = Eη 2 = E(Xn+1 −X ˆ n+1 )2 past values X1 , . . . , Xn such that X n+1 ˆ is minimized, n = 1, 2, . . . . By the general theory of Hilbert spaces, Xn+1 is the projection of Xn+1 onto the linear subspace Hn . The coefficients of the optimal linear predictor ˆ n+1 = an1 Xn + · · · + ann X1 X can be obtained by solving the system of linear equations Cn an = dn , where an = (an1 , . . . , ann )T , Cn = [Cov(Xi , Xj )]ni,j=1 and dn = (Cov(Xn+1 , Xn ), . . . , Cov(Xn+1 , X1 ))T . A solution (the projection ) always exists, and it is unique if Cn is positive definite; then the unique solution is an = Cn−1 dn , otherwise, the generalized inverse of Cn comes into existence. However, in case of stationary processes, this is not an issue. The h-step ahead prediction is obtained from Cn an = dn (h), where dn (h) = (c(h), . . . , c(n + h − 1))T in the stationary case. As for the innovation ηn = (η1 , . . . , ηn )T , we have to find an n × n lower triangular matrix Ln such that Xn = Ln ηn . Taking the covariance matrices on both sides, yields Cn = Ln Dn LTn . In this way, the LDL decomposition (a variant of the Cholesky decomposition) gives the prediction errors e2n s (diagonal entries of Dn ), and the entries of Ln below its main diagonal (the main diagonal is constantly 1). The situation further simplifies in the stationary case, when Cn is a Toeplitz matrix. However, Ln will not be Toeplitz, but asymptotically, it becomes more and more like a Toeplitz one, and the entries of Dn will be more and more similar to each other, i.e. to the limit σ 2 = limn→∞ e2n . In particular, if {Xt } is stationary, then Cn = [c(i − j)]ni,j=1 , so Cn is a Toeplitz matrix, and dn (j) = c(j), j = 1, . . . , n. Therefore, no double indexing is necessary, but an = (a1 , . . . , an )T. With it, the defining equation is exactly the same as the first n Yule–Walker equations for estimating the parameters of a stationary AR(n) process. The prediction error is e2n = Var(ηn+1 ). It can be written in many equivalent forms, e.g. 2 e2n = c(0)(1 − rX ) = c(0) − dTn Cn−1 dn , t ,(Xt−1 ,...,Xt−n ) 2 where rX is the squared multiple correlation coefficient between t ,(Xt−1 ,...,Xt−n ) Xt and (Xt−1 , . . . , Xt−n ); it does not depend on t either, and obviously increases (does not decrease) with n, i.e. e21 ≥ e22 ≥ . . . . The mean square error can as well be written with the determinants of the consecutive Toeplitz matrices Cn and Cn+1 . If for some n, |Cn | = 6 0, then

e2n = c(0) − dTn Cn−1 dn =

|Cn+1 | . |Cn |

If |Cn | = 0 for some n, then |Cn+1 | = |Cn+2 | = · · · = 0 too. The smallest

210

Dimension reduction and prediction in the time and frequency domain

index n for which this happens indicates that there is a linear relation between n consecutive Xj s, but no linear relation between n − 1 consecutive ones (by stationarity, this property is irrespective of the position of the consecutive random variables). This can happen only if some Xt linearly depends on n − 1 preceding Xj s. In this case e2n−1 = 0 and, of course e2n = e2n+1 = · · · = 0 too. In any case, e21 ≥ e22 ≥ . . . is a decreasing (non-increasing) nonnegative sequence, and in view of Equation (5.8), |C1 | = c(0),

|Cn | = c(0)e21 . . . e2n−1 ,

n = 2, 3, . . . ,

so, provided c(0) > 0, |Cn | = 0 holds if and only if e2n−1 = 0. Note that in this stationary case there is no sense of using generalized inverse if |Cn | = 0, since then exact one-step ahead prediction with the n − 1 long past can be done with zero error, and this property is manifested for longer past predictions too. Note that the previous LDL decomposition also implies that |Cn | = |Dn | = c(0)e21 . . . e2n−1 , n = 2, 3, . . . . In case of a stationary, non-singular process, we can project Xn+1 onto the infinite past Hn− = span {Xt : t ≤ n} and expand it in terms of an orthonormal system, see the Wold decomposition. This part will be the regular (causal) part of the process, whereas, the other, singular part, is orthogonal to it. Also, by stationarity, the one-step ahead prediction error σ 2 does not depend on n, and it is positive, since the process is non-singular. Then we have a d-dimensional time series {Xt } with components Xt = (Xt1 , . . . , Xtd )T , the state space is Rd and E(Xt ) = 0. Select a starting observation X1 and Hn := span{Xtj : t = 1, . . . , n; j = 1, . . . , d}. We want to linearly predict Xn+1 based on random past values X1 , . . . , Xn . ˆ 1 := 0 and X ˆ n+1 is the best one-step ahead Analogously to the 1D situation, X ˆ linear predictor that minimizes E(Xn+1 − Xn+1 )2 . Now we solve the system of linear equations n X

Anj Cov(Xn+1−j , Xn+1−k ) = Cov(Xn+1 , Xn+1−k ).

k = 1, . . . , n.

j=1

When {Xt } is stationary, then it simplifies to n X

Aj C(k − j) = C(k),

k = 1, . . . , n.

j=1

This provides a system of d2 n linear equations with the same number of unknowns that always has a solution. Further, the solution does not depend on the selection of the time of the starting observation X1 , and no double indexing of the coefficient matrices is necessary. If for some n ≥ 1 the covariance matrix of (XTn+1 , . . . , XT1 )T is positive definite, then the matrix polynomial

Summary

211

(VAR polynomial) α(z) = I − A1 z − · · · − An z n is causal in the sense that |α(z)| 6= 0 for z ≤ 1; otherwise, block matrix techniques and reduction in the innovation subspaces is needed. Xt can again be expanded in terms of the now d-dimensional innovations, ˆ n+1 . In this way, we get the ini.e. the prediction error terms ηn+1 = Xn+1 − X novations η1 , . . . , ηn that trivially have 0 expectation and form an orthogonal system in Hn . Actually, we have the recursive equations Xk =

k−1 X

Bkj ηj + ηk ,

k = 1, 2, . . . , n.

j=1

Here the covariance matrix Ej = Eηj ηjT is a positive semidefinite matrix, but can be of reduced rank. At a passage to infinity, we obtain the multidimensional Wold decomposition. At the end, we have to perform the block Cholesky (LDL) decomposition: Cn = Ln Dn LTn , where Cn is nd × nd positive definite block Toeplitz matrix, Dn is nm × nm block diagonal and contains the positive semidefinite prediction error matrices E1 , . . . , En in its diagonal blocks, whereas Ln is nd × nd lower triangular with blocks Bkj s below its diagonal blocks which are d × d identities. In view of this, n Y |Cn | = |Dn | = |Ej |, j=1

analogously to the 1D situation. We also prove that if the entries of the autocovariance matrices are absolutely summable, then the eigenvalues of Cn asymptotically comprise the union of the spectra of the spectral density matrices at the n Fourier frequencies as n → ∞. When the Ej s are of reduced rank, we can find a system ξ1 , . . . , ξn ∈ Rr in the d-dimensional innovation subspaces that span the same subspace as η1 , . . . , ηn . If the the spectral density matrix has r < d structural eigenvalues in a General Dynamic Factor Model, then ξj ∈ Rr is the principal component factor of ηj obtained from an r-factor model. Note that here we use d × d block matrices in the calculations, so the computational complexity of the procedure is not significantly larger than that of the subsequent K´ almán’s filtering for which we use the notation of R.E. K´ alm´ an’s original paper [32], where stationarity is not assumed, but the random vectors are Gaussian. The linear dynamical system is Xt+1 = At Xt + Ut Yt = C t Xt , where At and Ct are specified matrices; At is an n × n matrix, called phase

212

Dimension reduction and prediction in the time and frequency domain

transition matrix, and Ct is p × n; further, Ut (random excitation) is an orthogonal noise process with EUt UTs = δst QU (t) and EXTs Ut = 0 for s ≤ t. All the expectations are zeros, and all the random vectors have real components. Xt is the n-dimensional hidden state variable, while Yt is the p-dimensional observable variable. In the paper [32], p ≤ n is assumed, but it is not a restriction. Even if p = n, the matrix Ct is not invertible, otherwise the process Xt is trivially observable, unless a noise term is added to Ct Xt in the second equation. The problem is the following: starting the observations at time 0, given Y0 , . . . , Yt−1 , we want to estimate X component-wise, with minimum mean square error. If X = Xt , this is the prediction problem and we denote the b t|t−1 . If Y0 , . . . , Yt−1 is observed optimal one-step ahead prediction of Xt by X b b t+1|t by using and Xt|t−1 is already known, then we give a recursion to find X the new value of Yt : b t+1|t = At X b t|t−1 + Kt (Yt − Ct X b t|t−1 ), X where Kt is the K´ alm´ an gain matrix: Kt = At P (t)CtT [Ct P (t)CtT ]− . Here e t|t−1 X eT P (t) = EX t|t−1 is the error covariance matrix that drives the process. For it, the recursion P (t + 1) = At P (t)At T − At P (t)CtT [Ct P (t)CtT ]− Ct P (t)At T + QU (t) holds, which makes rise to an iteration. The above equation results in a matrix Riccati equation for P = P (t) = P (t + 1) if the process is stationary. With the integer 1 ≤ k < r, the dynamic k-factor mode (GDFM) is: Xt = χt + et ,

t = 1, 2, . . .

where now χt denotes the common component, et is the n-dimensional idiosyncratic noise, and all the expectations are zeros, for simplicity. Here χt is subordinated to Xt , but has spectral density matrix of rank k < r. For example, there are k uncorrelated signals (given by k distinct sources), detected by d sensors. Opposed to the static factors, this is not a low-rank approximation of the (zero-lag) auto-covariance matrix that provides the static factors. Forni and Lippi [18] and Deistler et al. [15, 13] gave necessary and sufficient conditions for the existence of an underlying GDFM in terms of the observable n n × n spectral densities fX (ω), n ∈ N. The nested sequence {Xnt : n ∈ N, t = 1, 2, . . . } can be represented by a sequence of GDFMs if and only if • the first k eigenvalues, λnX,1 (ω) ≥ · · · ≥ λnX,k (ω) (in non-increasing order), n of fX (ω) diverge almost everywhere in [−π, π] as n → ∞;

Summary

213

n • the (k + 1)-th eigenvalue λnX,k+1 (ω) of fX (ω) is uniformly bounded for ω ∈ [−π, π] almost everywhere and for all n ∈ N.

So we can conclude for k from the spectral gap. The estimate χnt is consistent if n, T → ∞. The idiosyncratic noise is less and less important when n, T get larger and larger, and it may have slightly correlated components.

A Tools from complex analysis

Details of the material and the proofs of the theorems discussed in this chapter can be found e.g. in [50]. The aim of this chapter to summarize the most important tools from complex analysis that we are using in this book.

A.1

Holomorphic (or analytic) functions

Let Ω ⊂ C be an open set and f : Ω → C be a complex function. If z0 ∈ Ω and if f (z) − f (z0 ) lim z→z0 z − z0 exists then we denote this limit by f 0 (z0 ) ∈ C and call it the derivative of f at z0 and f is said to be complex differentiable at z0 . If f is complex differentiable at every z0 ∈ Ω, then we say that f is holomorphic (or analytic) in Ω. Let [α, β] ⊂ R be a closed interval. A path γ is a piecewise continuously differentiable function γ : [α, β] → C. A closed path is a path such that γ(α) = γ(β). If f : Ω → C is a continuous complex function and the range γ ∗ of the path γ is in Ω, then the integral of f over γ is defined as Z Z β f (z)dz := f (γ(t))γ 0 (t)dt. γ

α

Let γ be a closed path and take Ω = C \ γ ∗ . Define Z dζ 1 Indγ (z) := , z ∈ Ω. 2πi γ ζ − z Then Indγ is an integer-valued function in Ω which is constant in each component of Ω and which is 0 in the unbounded component of Ω. We call Indγ (z) the index or winding number of z with respect to γ. The most important special class of closed paths is when there are exactly two components of Ω w.r.t. γ: one where the winding number is 1 and one where the winding number is 0. Such a closed path is called a simple closed path. If γ is a simple closed path in Ω, that is, such that Ω = Ω1 ∪ Ω0 ∪ γ ∗ ,

Indγ = 1

in

Ω1 ,

Indγ = 0

in

Ω0 , 215

216

Tools from complex analysis

then we may call Ω1 the interior and Ω0 the exterior of γ. This is the case e.g. when γ(t) = eit , 0 ≤ t ≤ 2π, so γ ∗ is the unit circle T . Then γ winds around the points of the open unit disc D exactly once counterclockwise. It is easy to see that in this example Indγ (z) = 1 if z ∈ D and Indγ (z) = 0 if z ∈ {ζ ∈ C : |ζ| > 1}. The fundamental theorems of complex analysis are Cauchy’s theorem and Cauchy’s formula. Theorem A.1. Suppose that Ω is an open set and f is a holomorphic function in Ω. Then for any closed path γ in Ω such that Indγ (z) = 0 for any z ∈ / Ω, we have Cauchy’s theorem Z f (z)dz = 0. γ

Moreover, we also have Cauchy’s formula Z f (ζ) 1 dζ, f (z) · Indγ (z) = 2πi γ ζ − z

∀z ∈ Ω \ γ ∗ .

(A.1)

An important consequence of (A.1) is that if γ is a simple closed path, then the values of a holomorphic function f in the interior of γ are uniquely determined by the values of f on the boundary γ ∗ . The condition “Indγ (z) = 0 for any z ∈ / Ω” of the theorem prevents a situation that points where f is not holomorphic may influence the value of the integrals. (Imagine that Ω is an annulus and a closed path γ in Ω goes around the inner circle boundary of the annulus.) Another important consequence is stated in the next theorem. Theorem A.2. Suppose that Ω is an open set. Then a function f is holomorphic in Ω if and only if it is representable by power series in Ω in the sense that for any a ∈ Ω there exists an r > 0 such that f (z) =

∞ X

cn (z − a)n ,

∀z ∈ D(a, r) := {ζ : |ζ − a| < r},

(A.2)

n=0

where each cn ∈ C. It follows that if f is holomorphic in Ω, then f is arbitrary many times complex differentiable in Ω. Recall that a function f which is representable by power series in an open set Ω is usually called an analytic function. We have just seen that the class of holomorphic and the class of analytic functions in an open set Ω coincide in complex analysis. Notice that the power series in (A.2) contains only nonnegative powers of (z − a). The domain of convergence of such a power series is always a disc {z : |z − a| < R} and possibly some points of the boundary circle as well, where R ∈ [0, ∞], 1 = lim sup |cn |1/n . R n→∞

(A.3)

Holomorphic (or analytic) functions

217

Here comes another important property of holomorphic functions. Theorem A.3. (The maximum modulus theorem) (a) Suppose that Ω is a connected open set, D(a, r) ⊂ Ω, r > 0, and f is a holomorphic function in Ω. Then |f (a)| ≤ max |f (a + reit )|. t

Equality occurs here if and only if f is constant in Ω. (b) Let Ω be a bounded connected open set and let K denote its closure. If the function f is continuous in K and holomorphic in Ω, then |f (z)| ≤ sup |f (ζ)|,

z ∈ Ω,

ζ∈∂Ω

where ∂Ω denotes the boundary of Ω. If equality holds at one point z ∈ Ω, then f is constant. Let f be a holomorphic function in the connected open set Ω, and define the zero set of f by Z(f ) := {a ∈ Ω : f (a) = 0}. If f is not identically 0 in Ω, then Z(f ) has no limit point in Ω and Z(f ) is at most countable. Moreover, then to each a ∈ Z(f ) there corresponds a unique positive integer m such that f (z) = (z − a)m g(z),

z ∈ C,

where g is holomorphic in Ω and g(a) 6= 0. Then f is said to have a zero of order m at the point a. Let Ω be an open set, a ∈ Ω, and f be holomorphic in Ω \ {a}. Then f is said to have an isolated singularity at a. If f can be defined at the point a so that the extended function is holomorphic in Ω, then the singularity is called removable. If there is a positive integer m and constants c−1 , . . . , c−m ∈ C, c−m 6= 0, such that f (z) − Q(z) = f (z) −

m X

c−k (z − a)−k

k=1

has a removable singularity at a, then f is said to have a pole of order m at a and Q(z) is called the principal part of f at a. Then there exists an r > 0 such that f can be expressed by a two-sided power series around a as f (z) =

∞ X

ck (z − a)k ,

0 < |z − a| < r.

k=−m

Any other isolated singularity is called an essential singularity. A function f is said to be meromorphic in an open set Ω if there is a subset A ⊂ Ω such that

218

Tools from complex analysis 1. A has no limit point in Ω, 2. f is holomorphic in Ω \ A, 3. f has a pole at each point of A.

The set A can be at most countable. For each a ∈ A, the principal part of f at a has the form m(a) X (a) Qa (z) = c−k (z − a)−k ; k=1 (a)

the coefficient c−1 is called the residue of f at a: (a)

Res(f, a) := c−1 . If γ is a closed path in Ω \ A, then elementary integration shows that Z 1 Qa (z)dz = Res(f ; a) Indγ (a). 2πi γ This simple fact can be used to show the following Residue theorem. Theorem A.4. Suppose that f is a meromorphic function in the open set Ω. Let A be the subset of points at which f has poles. If γ is a closed path in Ω \ A such that Indγ (z) = 0 for all z ∈ / Ω, then Z X 1 f (z)dz = Res(f ; a) Indγ (a). 2πi γ a∈A

The next theorem is an application of the Residue theorem; useful to determine how many zeros a holomorphic function f has in the interior of a simple closed path. Theorem A.5. Assume that γ is a simple closed path in a connected open set Ω, such that Indγ (z) = 0 for any z ∈ / Ω. Let f be a holomorphic function in Ω and let Nf (γ) denote the number of zeros of f in the interior Ω1 of γ, counted according to their multiplicities. Assume that f has no zeros on γ ∗ . Then Z 0 f (z) 1 dz = Indf ◦γ (0). Nf (γ) = 2πi γ f (z) This theorem is sometimes called the “Argument principle.” This name can be explained by a heuristic argument. Since we assumed that f has no zeros on γ ∗ , along the closed path γ one can take log f (z) = log |f (z)|ei arg f (z) = log |f (z)| + i arg f (z),

Harmonic functions

219

where arg f (z) denotes the multiple-valued argument (angle) of the complex 0 number f (z). By the chain rule, (log f (z)) = f 0 (z)/f (z), and so Z 0 Z Z f (z) dz = d log |f (z)| + i d arg f (z) γ f (z) γ γ = log |f (γ(β))| − log |f (γ(α))| + i∆γ arg f (z) = i∆γ arg f (z), since γ(β) = γ(α). Here ∆γ arg f (z) denotes the change of argument of f along the closed path γ, which divided by 2π gives the winding number Indf ◦γ (0) in the theorem.

A.2

Harmonic functions

If f : C → C is a complex function, we may write that f (z) = u(x, y)+iv(x, y), where u, v : R2 → R are real two-variable functions. If Ω is a plane open set, then f is holomorphic in Ω if and only if u and v are differentiable two-variable functions in Ω and the Cauchy–Riemann equations hold: ∂x u = ∂y v,

∂y u = −∂x v,

(x, y) ∈ Ω,

(A.4)

where ∂x and ∂y denote partial differentiation w.r.t. x and y, respectively. Then 1 i f 0 = (∂x u + ∂y v) + (∂x v − ∂y u). (A.5) 2 2 Another way of writing the above equalities can be obtained by introducing the differential operators ∂ :=

1 (∂x − i∂y ), 2

1 ∂¯ := (∂x + i∂y ). 2

Then (A.4) and (A.5) are equivalent to ¯ = 0, ∂f

f 0 = ∂f.

(A.6)

(Applying ∂ and ∂¯ goes like multiplication with complex numbers.) Let Ω be a plane open set and let u : R2 → R be a two-variable real function such that ∂xx u and ∂yy u exist at every point of Ω. Then the Laplacian of u is defined as ∆u := ∂xx u + ∂yy u. The function u is called harmonic in Ω if it is continuous in Ω and ∆u = 0 in Ω. Similarly, the complex function f : C → C, f = u + iv, is harmonic in Ω if it is continuous in Ω, ∂xx f and ∂yy f exist in Ω, and ∆f := ∂xx f + ∂yy f = ∆u + i∆v = 0

in

Ω.

(A.7)

220

Tools from complex analysis

If f is a holomorphic function in Ω, then f has continuous derivatives of ¯ . Since then ∂f ¯ = 0 by all orders, so ∂xy f = ∂yx f , moreover, ∆f = 4∂ ∂f (A.6), it follows that ∆f = 0 in Ω. This shows that holomorphic functions are harmonic. Equation (A.7) shows that the real and imaginary parts of f are also harmonic, and by the Cauchy–Riemann equations (A.4) they are strongly related to each other; that is why u and v are called harmonic conjugates. For example, any harmonic function u in D is the real part of one and only one holomorphic function f = u + iv such that f (0) = u(0), and then v(0) = 0 and this v is also unique. For 0 ≤ ρ < 1; t, θ ∈ R and z = ρeiθ , the Poisson kernel is it 1 − |z|2 1 − ρ2 e +z = it = . Pρ (θ − t) := Re it e −z |e − z|2 1 − 2ρ cos(θ − t) + ρ2 Then Pρ (t) > 0,

1 2π

Z

π

Pρ (t)dt = 1

(0 ≤ ρ < 1).

−π

If g ∈ L1 ([−π, π]), then 1 ˜ G(z) := 2π

Z

π

−π

eit + z g(t)dt eit − z

is a holomorphic function of z = ρeiθ in the open unit disc D. Hence the Poisson integral Z π 1 G(ρeiθ ) := Pρ (θ − t)g(t)dt (A.8) 2π −π is the real part of a holomorphic function, so a harmonic real function in D, for any g ∈ L1 ([−π, π]) real function. It implies that if g ∈ L1 ([−π, π]) is complex valued, the Poisson integral G(z) defined by (A.8) is a complex harmonic function. Moreover, lim G(ρeiθ ) = g(θ)

ρ→1

in

L1 ([−π, π]).

(A.9)

The Dirichlet problem is a famous and important problem of mathematics and physics. Here we discuss it in the unit disc. Assume that a continuous function g is given on T and it is required to find a harmonic function G in D, ¯ and has the boundary values g. which is continuous on the closed unit disc D Theorem A.6. We have a unique solution of the Dirichlet problem in the unit disc. (a) Assume that g ∈ C(T ). Let G(eiθ ) := g(eiθ ) on T and define G(z) in D by ¯ the Poisson integral (A.8). Then G is harmonic in D and G ∈ C(D). ¯ and G is harmonic in D. Then G is the (b) Conversely, suppose that G ∈ C(D) Poisson integral (A.8) in D of its restriction to T .

Harmonic functions

221

So far the Poisson integral was considered only in the unit disc. However, it is easy to extend it to an arbitrary disc. If g is a continuous complex or real function on the boundary of the open disc D(a, R) := {z : |z − a| < R} and if g is defined by the Poisson integral Z π R 2 − ρ2 1 g(a + ρeiθ ) = g(a + Reit )dt (A.10) 2 2π −π R − 2Rρ cos(θ − t) + ρ2 ¯ R) and harmonic in in D(a, R), then g is continuous on the closed disc D(a, D(a, R). Conversely, if u is a harmonic real function in an open set Ω and if ¯ R) ⊂ Ω, then u satisfies (A.10) in D(a, R) and there is a unique holoD(a, morphic function f = u + iv in D(a, R) such that f (a) = u(a) and v(a) = 0. In sum, every real harmonic function is locally the real part of a holomorphic function. Consequently, every harmonic function has continuous partial derivatives of arbitrary order. We say that a continuous complex or real function g has the mean value property in an open set Ω if we have Z π 1 ¯ R) ⊂ Ω. g(a + Reit )dt, ∀D(a, g(a) = 2π −π The Poisson integral (A.10) shows that any harmonic complex or real function has the mean value property. In fact, much more is true, as shown by the next theorem. Theorem A.7. A continuous complex or real function has the mean value property in an open set Ω if and only if it is harmonic in Ω. A real-valued function u defined in a plane open set Ω is said to be subharmonic in Ω if it has the following four properties: 1. −∞ ≤ u(z) < ∞ for all z ∈ Ω; 2. u is upper semicontinuous in Ω; Z π 1 ¯ R) ⊂ Ω; 3. u(a) ≤ u(a + Reit )dt, ∀D(a, 2π −π 4. none of the integrals above is −∞. Clearly, every harmonic real function is also subharmonic. Theorem A.8. We have several useful criteria for subharmonic functions. (a) If u is subharmonic in the open set Ω and φ is a monotonically increasing convex function in R, then φ ◦ u is subharmonic in Ω. (b) If Ω is a connected open set in the plane and f is a holomorphic function in Ω which is not identically 0, then log |f |, log+ |f | := max(0, log |f |), and |f |p (0 < p < ∞) are subharmonic in Ω.

222

Tools from complex analysis

The next theorem explains the term “subharmonic.” Theorem A.9. Suppose that u is a continuous subharmonic function in a plane open set Ω, K is a compact subset of Ω, h is a continuous real function on K which is harmonic in the interior of K, and u(z) ≤ h(z) at all boundary points of K. Then u(z) ≤ h(z) for all z ∈ K.

A.3 A.3.1

Hardy spaces First approach

Details of the material and the proofs of the theorems discussed in this subsection can be found e.g. in Chapter 17 of [50]. Let D denote the open unit disc in C and T be its boundary circle. Let f ∈ C(D) and define fr on T by fr (eit ) := f (reit )

(0 ≤ r < 1,

−π ≤ t ≤ π),

and kfr kp := kfr k∞ :=

1 2π

Z

1/p |fr (eit )|p dt

π

(0 < p < ∞),

−π

sup |fr (eit )|, −π≤t 0}, H

¯ 02 := {f ∈ L2 (T ) : an = 0 if n ≥ 0}. H

226

Tools from complex analysis

¯ 2 . Also, It follows that there is an orthogonal decomposition L2 (T ) = H 2 ⊕ H 0 2 2 it it ¯ ¯ ¯ ¯ f ∈ H if and only if f ∈ H , where f is defined by f (e ) := f (e ). Let us denote the space of square summable complex sequences on Z by `2 (−∞, ∞). One can define the Fourier transform F from `2 (−∞, ∞) onto L2 (T ) by X F({an }n∈Z ) = an eint . n∈Z

Then F is a unitary map, that is, an isometric isomorphism from `2 (−∞, ∞) onto L2 (T ). Let S be the (bilateral) right shift in `2 (−∞, ∞). Then it defines a shift operator in L2 (T ) as well by the formula (Sf )(eit ) = eit f (eit ), with which we have FS = SF. It is clear that F(`2 (0, ∞)) = H 2 and the restricted operators S|`2 (0, ∞) and S|H 2 are also unitarily equivalent and they agree with the operator S introduced in (A.11) and (A.12). There is another useful characterization of outer functions introduced in Subsection A.3.1. Theorem A.16. A function f ∈ H 2 is outer if and only if the linear combinations of the functions S n f (n ≥ 0) are dense in H 2 .

B Matrix decompositions and special matrices

We consider finite dimensional complex Euclidean spaces that are also Hilbert spaces. Linear operations between them can be described by matrices of complex entries. Vectors are treated as column-vectors and denoted by bold-face, lower-case letters. The inner product of the vectors x, y ∈ Cn is therefore written with matrix multiplication as x∗ y, where ∗ stands for the conjugate transpose of a complex vector, hence x∗ is a row-vector. Matrices will be denoted by bold-face upper-case letters. An m × n matrix A = [aij ] of complex entries aij ’s corresponds to a Cn → Cm linear transformation (operator). Its adjoint, A∗ , is an n × m matrix, the entries of which are a∗ij = aji . An n × n matrix is called quadratic (or square) and it maps Cn into itself. The identity matrix is denoted by I or In if we want to refer to its size. Definition B.1. The following types of matrices will be frequently used. • The n × n complex matrix A is self-adjoint (Hermitian) if A = A∗ . In particular, a real matrix with AT = A is called symmetric. • The n × n real matrix A is anti-symmetric if AT = −A. • The n × n complex matrix A is unitary if AA∗ = A∗ A = In . It means that both its rows and columns form a complete orthonormal system in Cn . • The n × r complex matrix (r ≤ n) is sub-unitary if its columns constitute a (usually not complete) orthonormal system in Cn ; consequently, A∗ A = Ir , whereas AA∗ is the matrix of the orthogonal projection onto the subspace spanned by the column vectors of A. • The n × n complex matrix P is Hermitian projector if it is self-adjoint and idempotent, i.e. P 2 = P . • The complex matrix A is called normal if AA∗ = A∗ A. The n × n matrix A has an inverse if and only if its determinant, det A = |A| = 6 0, and its inverse is denoted by A−1 . In this case, the linear transformation corresponding to A−1 undoes the effect of the Cn → Cn transformation corresponding to A, i.e. A−1 y = x if and only if Ax = y for any y ∈ Cn ; equivalently, AA−1 = A−1 A = In . A unitary matrix A is always invertible and A−1 = A∗ .

227

228

Matrix decompositions and special matrices

It is important that in the case of an invertible (regular ) matrix A, the range (or image space) of A – denoted by R(A) – is the whole C n , and in exchange, the kernel of A (the subspace of vectors that are mapped into the zero vector by A) consists of only the vector 0. Note that for an m × n matrix A, its range is R(A) = Span{a1 , . . . , an }, where a1 , . . . , an are the column vectors of A for which fact the notation A = [a1 , . . . , an ] will be used. The rank of A is the dimension of its range: rank(A) = dim R(A), and it also equals the maximum number of linearly independent rows, or equivalently, the maximum number of linearly independent columns of A, or the maximal size of a nonzero minor (subdeterminant) of A. Trivially, rank(A) ≤ min{m, n}; if equality is attained, we say that A has full rank. In the case of m = n, A is regular if and only if rank(A) = n, and singular otherwise. Eigenvalues and eigenvectors tell “everything” about a quadratic matrix. The complex number λ is an eigenvalue of the n × n complex matrix A with corresponding eigenvector u 6= 0 if Au = λu. If u is an eigenvector of A, it is easy to see that for c 6= 0, cu is also an eigenvector with the same eigenvalue. Therefore, it is better to speak about eigen-directions instead of eigenvectors; or else, we will consider specially normalized, e.g. unit-norm eigenvectors, when only the orientation is divalent. It is well known that an n× n matrix A has exactly n eigenvalues (with multiplicities) which are (complex) roots of the characteristic polynomial |A − λI|. Knowing the eigenvalues, the corresponding eigenvectors are obtained by solving the system of linear equations (A − λI)u = 0 which must have a non-trivial solution due to the choice of λ. In fact, there are infinitely many solutions (in case of a single eigenvalue, they are constant multiples of each other). Normal matrices have the following important spectral property: to their eigenvalues there corresponds a complete orthonormal set of eigenvectors; choosing this as a new basis, the matrix becomes diagonal (all the off-diagonal entries are zeros). Here we only state the analogous version for Hermitian matrices. Theorem B.1 (Hilbert–Schmidt theorem). The n × n self-adjoint complex matrix A has real eigenvalues λ1 ≥ · · · ≥ λn (with multiplicities), and the corresponding eigenvectors u1 , . . . , un can be chosen so that they constitute a complete orthonormal set in Cn . Theorem B.1 implies the following Spectral Decomposition (SD) of the n×n self-adjoint matrix A: A=

n X i=1

λi ui u∗i = U ΛU ∗ ,

(B.1)

229 where Λ = diag(λ1 , . . . , λn ) is the diagonal matrix containing the eigenvalues — called spectrum — in its main diagonal, while U = [u1 , . . . , un ] is the unitary matrix containing the corresponding unit-norm eigenvectors of A in its columns in the order of the eigenvalues. Of course, permuting the eigenvalues in the main diagonal of Λ, and the columns of U accordingly, will lead to the same SD, however — if not otherwise stated — we will enumerate the real eigenvalues in non-increasing order. About the uniqueness of the above SD we can state the following: the unit-norm eigenvector corresponding to a single eigenvalue is unique (up to orientation), whereas to an eigenvalue with multiplicity m there corresponds a unique m-dimensional so-called eigen-subspace within which any orthonormal set can be chosen for the corresponding eigenvectors. Definition B.2. The parsimonious SD of the n × n self-adjoint matrix A of rank r is r X ˜Λ ˜U ˜ ∗, λi ui u∗i = U (B.2) A= i=1

˜ = [u1 , . . . , ur ] is n × r sub-unitary matrix and Λ ˜ = diag(λ1 , . . . , λr ) where U is r × r diagonal matrix, where λ1 ≥ · · · ≥ λr is the set of nonzero eigenvalues. The quadratic form x∗ Ax with the SD of the self-adjoint A is x∗ Ax =

n X i=1

λi (x∗ ui )(x∗ ui ) =

n X

λi |x∗ ui |2

i=1

that is a real number. Some properties of the self-adjoint matrix A and of the quadratic forms generated by it follow: • A is singular if and only if it has a 0 eigenvalue, and r = rank(A) = rank(Λ) = |{i : λi 6= 0}|; moreover, R(A) = Span{ui : λi 6= 0}. Therefore, the SD of A simplifies to Pr ∗ i=1 λi ui ui . • A is positive (negative) definite if x∗ Ax > 0 (x∗ Ax < 0), ∀x 6= 0; equivalently, all the eigenvalues of A are positive (negative). • A is positive (negative) semidefinite if x∗ Ax ≥ 0 (x∗ Ax ≤ 0), ∀x ∈ Cn ; equivalently, all the eigenvalues of A are non-negative (non-positive). • Note that the notion non-negative (non-positive) definite can be used instead of positive (negative) semidefinite. In some literature if A is called positive (negative) semidefinite, then it is understood that x∗ Ax = 0 for at least one x 6= 0; and so the spectrum of A contains the zero eigenvalue too. • A is indefinite if x∗ Ax takes both positive and negative values (with different, non-zero x’s); equivalently, the spectrum of A contains at least one positive and one negative eigenvalue.

230

Matrix decompositions and special matrices Qn Pn • |A| = det(A) = i=1 λi and tr(A) = i=1 λi . A canonical decomposition for a rectangular matrix is a useful tool. Theorem B.2. Let A be an m × n rectangular matrix of complex entries, rank(A) = r ≤ min{m, n}. Then there exist an orthonormal set (v1 , . . . , vr ) ⊂ Cm and (u1 , . . . , ur ) ⊂ Cn together with the positive real numbers s1 ≥ s2 ≥ · · · ≥ sr > 0 such that Auj = sj vj ,

A∗ vj = sj uj ,

j = 1, 2, . . . , r.

(B.3)

The elements vj ∈ Cm and uj ∈ Cn (j = 1, . . . , r) in (B.3) are called relevant singular vector pairs (or left and right singular vectors) corresponding to the singular value sj (j = 1, 2, . . . , r). The transformations in (B.3) give a one-to-one mapping between R(A) and R(A∗ ), all the other vectors of Cn and Cm are mapped into the zero vector of Cm and Cn , respectively. However, the left and right singular vectors can appropriately be completed into a complete orthonormal set {v1 , . . . , vm } ⊂ Cm and {u1 , . . . un } ⊂ Cn , respectively, such that, the so introduced extra vectors in the kernel subspaces in Cm and Cn are mapped into the zero vector of Cn and Cm , respectively. With the unitary matrices V = (v1 , . . . , vm ) and U = (u1 , . . . un ), the following singular value decomposition (SVD) of A and A∗ holds: A = V SU ∗ =

r X i=1

si vi u∗i

and A∗ = U S ∗ V ∗ =

r X

si ui vi∗ ,

(B.4)

i=1

where S is an m × n so-called generalized diagonal matrix which contains the singular values s1 , . . . , sr in the first r positions of its main diagonal (starting from the upper left corner) and zeros otherwise. We remark that there are other equivalent forms of the above SVD depending on, whether m < n or m ≥ n. For example, in the m < n case, V can be an m × m unitary, S an m × m diagonal, and U an n × m sub-unitary matrix with the same relevant entries. About the uniqueness of the SVD the following can be stated: to a single positive singular value there corresponds a unique singular vector pair (of course, the orientation of the left and right singular vectors can be changed at the same time). To a positive singular value of multiplicity k > 1 a k-dimensional left and right so-called isotropic subspace corresponds, within which, any k-element orthonormal sets can embody the left and right singular vectors with orientation such that the requirements in (B.3) are met. We also remark that the singular values of a self-adjoint matrix are the absolute values of its real eigenvalues. In case of a positive eigenvalue, the left and right singular vectors are the same (they coincide with the corresponding eigenvector with any, but the same orientation). In case of a negative eigenvalue, the left and right side singular vectors are opposite (any of them is the corresponding eigenvector which have a divalent orientation). In case of a zero singular value the orientation is immaterial, as it does not contribute to the

231 SVD of the underlying matrix. Numerical algorithms for SD and SVD of real matrices are presented in [26, 61]. Assume that the m × n complex matrix A of rank r has SVD (B.4). It is easy to see that the matrices AA∗ and A∗ A are self-adjoint, positive semidefinite matrices of rank r, and their SD is AA∗ = V (SS ∗ )V ∗ =

r X

s2j vj vj∗

and A∗ A = U (S ∗ S)U ∗ =

j=1

r X

s2j uj u∗j ,

j=1

where the diagonal matrices SS ∗ and S ∗ S both contain the numbers s21 , . . . , s2r in the leading positions of their main diagonals as non-zero eigenvalues. These facts together also imply that the only positive singular value of a sub-unitary matrix is the 1 with multiplicity of its rank. By means of SD and SVD we are able to define so-called generalized inverses of singular quadratic or rectangular matrices. Definition B.3. The m × n complex matrix X is a generalized inverse of the n × m complex matrix A if AXA = A. A generalized inverse X satisfying AXA = A is denoted by A− . In fact, any matrix that undoes the effect of the underlying linear transformation between the ranges of A∗ and A will do. A generalized inverse is far not unique as any transformation operating on the kernels can be added. However, the following pseudoinverse (Moore–Penrose inverse) is unique and, in case of a quadratic matrix, it coincides with the usual inverse if exists. Definition B.4. The m × n complex matrix X is the pseudoinverse (in other words, the Moore–Penrose inverse) of the n×m complex matrix A if it satisfies all of the following conditions: AXA = A, XAX = X, (AX)∗ = AX, (XA)∗ = XA. It can be proven that there uniquely exists a pseudoinverse satisfying the conditions in the above definition, and it is denoted by A+ . Actually, it can be obtained from the SVD (B.4) of A as follows: A+ = U S + V ∗ =

r X 1 uj vj∗ , s j j=1

where S + is the m × n generalized diagonal matrix containing the reciprocals of the non-zero singular values, otherwise zeros, in its main diagonal.

232

Matrix decompositions and special matrices

In particular, the Moore–Penrose inverse of the n × n self-adjoint matrix with SD (B.1) is r X 1 A+ = uj u∗j = U Λ+ U ∗ , λ j j=1 where Λ+ = diag( λ11 , . . . , λ1r , 0, . . . , 0) is the diagonal matrix containing the reciprocals of the non-zero eigenvalues, otherwise zeros, in its main diagonal. Note that any analytic function f of the self-adjoint matrix A can be defined by its SD, A = U ΛU ∗ , in the following way: f (A) := U f (Λ)U ∗ where f (Λ) = diag(f (λ1 ), . . . , f (λn )), of course, only if every eigenvalue is in the domain of f . In this way, for a positive semidefinite A, its square root is A1/2 := U Λ1/2 U ∗ , and for a regular A its inverse is obtained by applying the f (x) = x−1 function to it: A−1 = U Λ−1 U ∗ . For a singular A, the Moore–Penrose inverse is obtained by using Λ+ instead of Λ−1 . Accordingly, for a positive definite matrix, its −1/2 power is defined as the square root of A−1 . Now a special type of a matrix is introduced. Definition B.5. We say that the n × n self-adjoint complex matrix G = (gij ) is a Gram matrix (Gramian) if its entries are inner products; i.e., there is a dimension d > 0 and vectors x1 , . . . , xn ∈ Cd such that gjk = x∗j xk ,

j, k = 1, . . . n.

Proposition B.1. The self-adjoint matrix G is a Gramian if and only if it is positive semidefinite. Proof. If G is a Gram-matrix, then it can be decomposed as G = AA∗ , where A∗ = [x1 , . . . , xn ] with its generating vectors x1 , . . . , xn ∈ Cd . Therefore, x∗ Gx = x∗ AA∗ x = (A∗ x)∗ (A∗ x) = kA∗ xk2 ≥ 0,

∀x ∈ Cn .

Conversely, if G is positive semidefinite Prwith rank∗ r ≤ n, then its SD — using (B.2) — can be written as G = j=1 λj uj uj with its positive real eigenvalues λ1 ≥ · · · ≥ λr > 0. Let the n × r matrix A be defined as follows: p p ˜Λ ˜ 1/2 = [ λ1 u1 , . . . , λr ur ], (B.5) A=U ˜ and Λ ˜ are as defined in Definition B.2. where U

233 Then the row vectors of the matrix A will be the vectors xj ∈ Cr (j = 1, . . . n) reproducing G. Of course, the decomposition G = AA∗ is far not unique: first of all, instead of A the matrix AQ will also do, where Q is an arbitrary r × r unitary matrix (obviously, xj ’s can be rotated); and xj ’s can also be put in a higher (d > r) dimension with attaching any (but the same) number of zero coordinates to them. Now matrix norms are summarized. Definition B.6. The spectral norm (or operator norm) of an m × n complex matrix A of rank r, with singular values s1 ≥ · · · ≥ sr > 0, is kAk := max |Ax| = s1 , |x|=1

where | · | is the Euclidean (L2 ) norm. Then for square matrices A and B we have kABk ≤ kAk kBk. The Frobenius norm, denoted by k.kF , is  1/2 !1/2 m X n r X X p p kAkF :=  |aij |2  = tr(AA∗ ) = tr(A∗ A) = s2i . i=1 j=1

i=1

For a self-adjoint matrix A, kAk = max |Ax| = max |λi | and kAkF = |x|=1

i

r X

!1/2 λ2i

.

i=1

Obviously, for a matrix A of rank r, kAk ≤ kAkF ≤

√

rkAk.

(B.6)

More generally, a matrix norm is called unitary invariant if kAkun = kQARkun with any m × m and n × n unitary matrices Q and R, respectively. It is easy to see that a unitary invariant norm of a matrix merely depends on its singular values (or eigenvalues if it is self-adjoint). For example, the spectral and Frobenius norms are such. Next, the spectral radius of a quadratic matrix is defined and related to its so-called natural norms. Definition B.7. The spectrum σ(A) of the matrix A ∈ Cn×n is the set of all its eigenvalues λj , j = 1, . . . , n. The spectral radius of the matrix A ∈ Cn×n is ρ(A) = max{|λ| : λ ∈ σ(A)}. Note that for a self-adjoint matrix ρ(A) = kAk, where kAk is the spectral norm of A.

234

Matrix decompositions and special matrices

Definition B.8. A natural matrix norm (or matrix norm induced by a vector norm) of a matrix A ∈ Cn×n is defined as kAkp = max kAxkp , kxkp =1

where kxkp can be any Lp vector norm in Cn , 1 ≤ p ≤ ∞. Note that kAk2 = kAk, the previous spectral norm. Lemma B.1. (Theorem 11.1.3 of [48]). Between the spectral radius and any natural norm of the quadratic, complex matrix A, the relation ρ(A) ≤ kAkp holds. Proof. Let λ∗ := ρ(A), and let x∗ denote a unit-norm eigenvector corresponding to λ∗ . Then kAkp = max kAxkp ≥ kAx∗ kp = kλ∗ x∗ kp = ρ(A) kx∗ kp = ρ(A), kxkp =1

and that proves the lemma. Here we quote some important facts about spectral radius. The first statement can be proved using the Jordan normal form of A, the other statements easily follow from the first. Lemma B.2. For any matrix A ∈ Cn×n and its spectral norm kAk = kAk2 we have (1) ρ(A) = limk→∞ kAk k1/k = inf k≥1 kAk k1/k ; (2) if ρ(A) < 1, then for any c, ρ(A) < c < 1, there exists a constant K such that kAj k ≤ Kcj (j ≥ 0); (3) ρ(A) < 1

⇔

limk→∞ Ak = 0;

(4) ρ(A) > 1

⇔

limk→∞ kAk k = ∞;

(5) ρ(A) = 1

⇒

kAk k ≥ 1 for any k ≥ 1.

It is obvious that all the complex eigenvalues of an n × n complex matrix A are within the closed circle of radius ρ(A) around the origin of the complex plane. However, the subsequent Gersgorin disc theorem gives a finer allocation of them.

235 Theorem B.3 (Gersgorin disc theorem). Let A be an n × n matrix of entries aij ∈ C. The Gersgorin disks of A are the following regions of the complex plane: X Di = {z ∈ C : |z − aii | ≤ |aij |}, i = 1, . . . n. j6=i

Let λ1 , . . . , λn denote the eigenvalues of A. Then {λ1 , . . . , λn } ⊂ ∪ni=1 Di . Furthermore, any connected component of the set ∪ni=1 Di contains as many eigenvalues of A as the number of discs that form this component. Theorem B.4. (Cayley–Hamilton theorem). For any n×n complex matrix A, pn (A) = O (the n × n zero matrix), where pn is the characteristic polynomial of A, i.e., pn (z) = |A − zI| (nth degree polynomial of z). Now some perturbation results follow for self-adjoint matrices. Theorem B.5. (Weyl perturbation theorem). Let A and B be n × n selfadjoint matrices. Then |λj (A) − λj (B)| ≤ kA − Bk,

j = 1, . . . , n

in spectral norm, where the eigenvalues of A and B are enumerated in nonincreasing order. A theorem for the perturbation of spectral subspaces (sometimes called Davis–Kahan theorem) is stated here for self-adjoint matrices. Theorem B.6. Let A and B be self-adjoint matrices; S1 and S2 are subsets of R such that dist (S1 , S2 ) = δ > 0. Let PA (S1 ) and PB (S2 ) be orthogonal projections onto the subspace spanned by the eigenvectors of the matrix in the lower index, corresponding to the eigenvalues within the subset in the argument. Then with any unitary invariant norm: c kPA (S1 )PB (S2 )k ≤ kA − Bk δ where c is a constant. The statement is true for any unitary invariant norm. In case of the Frobenius norm, c = 1 will always do. We also need the following simple lemma. Lemma B.3. If A and B are self-adjoint, positive semidefinite quadratic matrices of the same size, then AB has real nonnegative eigenvalues. Proof. Though AB is usually not self-adjoint, it is still diagonalizable as follows. The eigenvalue–eigenvector equation for the matrix AB is: ABx = λx

236

Matrix decompositions and special matrices

that is equivalent to (A1/2 BA1/2 )(A−1/2 x) = λ(A−1/2 x), where A1/2 BA1/2 = (A1/2 B 1/2 )(A1/2 B 1/2 )∗ is a Gram matrix (Definition B.5), so positive semidefinite. Each of its nonnegative real eigenvalue λ is also an eigenvalue of AB with eigenvector that is obtained with premultiplying its eigenvector with A1/2 . Definition B.9. The quadratic matrix A = [ajk ] is of Toeplitz type if it has the same entries along its main diagonal and along all lines parallel to the main diagonal. In other words, the value of the entry ajk depends only on |j − k|, ∀j, k. Definition B.10. The quadratic matrix A = [ajk ] is of Hankel type if it has the same entries along its anti-diagonal and along all lines parallel to its anti-diagonal. In other words, the value of the entry ajk depends only on j +k, ∀j, k. Block Toeplitz and block Hankel matrices are defined analogously: they are of Toeplitz and of Hankel type in terms of their blocks considered as entries, respectively. Without proofs, we enlist some notable matrix decompositions, see [26]. • The Gram decomposition of the self-adjoint, positive semidefinite, n × n matrix G is the decomposition G = AA∗ in the proof of Proposition B.1. As we saw, it is not unique, and the minimal size of A is n × r, where r = rank(A). The Gram decomposition G = AA∗ with the A of equation (B.5) is called parsimonious Gram decomposition. • The QR-decomposition of a complex m × n matrix A is A = QR, where the matrix Q is m × m unitary, whereas R is m × n generalized upper triangular matrix (there are 0 entries below its main diagonal, starting at its upper left corner). This (not necessarily unique) decomposition always exists, and can be derived by applying the Gram–Schmidt orthogonalization procedure to the column vectors of A. The related QR-transformation (Francis, 1961) uses the QR-decomposition for an iteration converging to Λ of the SD A = U ΛU ∗ of the self-adjoint matrix A. The iteration is as follows: A0 = A,

Q0 = Q,

R0 = R,

and for t = 1, 2, . . . , if At−1 = Qt−1 Rt−1 , then At := Rt−1 Qt−1 . Then limt→∞ At = Λ in L2 -norm.

237 • The LDL-decomposition of the complex n × n self-adjoint matrix A is A = LDL∗ , where L is n × n lower triangular with 1s along its main diagonal, and D is n × n diagonal matrix (with nonnegative diagonal entries). Moreover, the LDL-decomposition is nested in the following sense: if Ak , Lk , and Dk denote the k × k submatrices of the underlying matrices formed by their first k rows and columns, then Ak = Lk Dk L∗k is also LDL-decomposition for k = 1, 2, . . . , n. • The LU-decomposition of the complex n × n matrix A is A = LU , where L is n × n lower, and U is n × n upper triangular matrix. It can be arranged that each diagonal entry of L is 1. This decomposition is sometimes called Cholesky decomposition. If A is self-adjoint, then the LU-decomposition can be obtained from the LDL-decomposition via manipulations with the nonnegative entries of D. The related LR-transformation (Rutishauser, 1958, see [61]) uses the LUdecomposition for an iteration converging to Λ of the SD A = U ΛU ∗ of the self-adjoint matrix A. Here LR denotes left-right, which is the same as LU (lower-upper). The iteration is as follows: A0 = A = LR,

L0 = L,

R0 = R,

and for t = 1, 2, . . . , if At−1 = Lt−1 Rt−1 , then At := Rt−1 Lt−1 . Eventually, limt→∞ Rt = limt→∞ At = Λ and limt→∞ Lt = In in L2 -norm. Note that the above matrix decompositions work for block-matrices similarly. The computational complexity is increased with the understanding that here matrix multiplications are substituted for entry-wise multiplications. Block-matrices sometimes arise as Kronecker-products. Definition B.11. Let A be p × n and B be q × m complex matrix. Their Kronecker product, denoted by A ⊗ B, is the following pq × nm block-matrix: it has p block rows and n block columns; each block is a q×m matrix such that the block indexed by (j, k) is the matrix ajk B (j = 1, . . . , p; k = 1, . . . , n). This product is associative, for the addition distributive, but usually not commutative. If A is n × n and B is m × m quadratic matrix, then det(A ⊗ B) = (det A)m · (det B)n ;

238

Matrix decompositions and special matrices

further, if both are regular, then so is their Kronecker-product. Namely, (A ⊗ B)−1 = A−1 ⊗ B −1 . It is also useful to know that — provided A and B are self-adjoint — the spectrum of A ⊗ B consists of the real numbers αj βk

(j = 1, . . . , n; k = 1, . . . , m),

where αj ’s and βk ’s are the eigenvalues of A and B, respectively. Definition B.11 naturally extends to vectors: the Kronecker-product of vectors a ∈ Cn and b ∈ Cm is a vector a ⊗ b ∈ Cnm . The eigenvalues of other types of block-matrices are characterized in the following theorem. Theorem B.7. (Theorem 5.3.1 of [48]). Let A be a d × d self-adjoint matrix with spectral decomposition A=

d X

ak uk u∗k ;

k=1

the analytic functions gij (z) for i, j = 1, . . . , n satisfy gij (z) = gji (z), and the eigenvalues a1 , . . . , ak are within the convergence region of every gij (z). Denoting the spectral decomposition of the self-adjoint matrix [gij (ak )]ni,j=1 with n X (k) (k) (k) ∗ [gij (ak )]ni,j=1 = λ` v` v` , k = 1, . . . , d, `=1

the spectral decomposition of the nd×nd block matrix [Aij ]ni,j=1 = [gij (A)]ni,j=1 is n X d X (k) (k) (k) [Aij ]ni,j=1 = λ` (uk ⊗ v` )(uk ⊗ v` )∗ . `=1 k=1

In [48] it is noted that if A is normal and the matrices [gij (ak )]ni,j=1 are as well all normal, the statement also holds irrespective whether gij s are analytic. Next we discuss low rank approximations of a matrix. Pr Theorem B.8. Let A ∈ Cm×n with SVD A = i=1 si ui vi∗ , where r is the rank of A and s1 ≥ · · · ≥ sr > 0. Then for any 1 ≤ k < r such that sk > sk+1 we have !1/2 r X 2 min kA − Bk = sk+1 and min kA − BkF = si , i=k+1

where the minima are taken for all matrices BP∈ Cm×n of rank k. Both k minima are attained with the matrix B = Ak := i=1 si ui vi∗ .

239 Note that Ak is called the best rank k approximation of A, and the aforementioned theorem guarantees that it is the best approximation both in spectral and Frobenius norm. In fact, it is true for any unitary invariant norm: min

B is m×n rank(B)=k

kA − Bkun = kA − Ak kun .

n×n Corollary B.1. be a self-adjoint, positive semidefinite matrix Pr Let A ∈ ∗C with SD A = j=1 λj uj uj , where r is the rank of A and the eigenvalues are λ1 ≥ · · · ≥ λr > 0. Then for any 1 ≤ k < r such that λk > λk+1 we have

min kA − Bk = λk+1

and

min kA − BkF =

r X

!1/2 λ2i

,

i=k+1

where the minima are taken for all self-adjoint, positive semidefinite matrices B ∈ Cn×n of rank k. Both minima are attained with the best rank k approximation k X ˜kΛ ˜ kU ˜∗ λj uj u∗j = U Ak = k j=1

˜ k = [u1 , . . . , uk ] is n × k sub-unitary and Λ ˜ k = diag(λ1 , . . . , λk ) of A, where U is k × k diagonal matrix. In particular, ˜rΛ ˜ rU ˜ ∗, A = Ar = U r of which we can take the square-root: ˜rΛ ˜ 1/2 U ˜ ∗, A1/2 = U r r √ √ ˜rΛ ˜ 1/2 ˜ 1/2 = diag( λ1 , . . . , λr ). However, with any matrix M = U where Λ r Q, r ∗ where Q is r ×r unitary, the decomposition A = M M holds. In particular it √ √ ˜rΛ ˜ 1/2 = [ λ1 u1 , . . . , λr ur ], see the parsimonious Gramholds with M = U r decomposition (B.5). Theorem B.8 is proved for self-adjoint matrices in [44] with the Frobenius norm, but it can be easily extended to any unitary invariant norm, see also [45].

C Best prediction in Hilbert spaces

Let L2 (Ω, A, P) be the Hilbert space of real valued random variables with zero expectation and finite variance; the inner product is the covariance, and P denotes the joint distribution of all of them. We use subspaces of this, related to multivariate, weakly stationary processes. Let X = {Xt }t∈Z be a weakly stationary, d-dimensional time series, with real-valued coordinates, EXt = 0. By weak stationarity, Xt s all have the same covariance matrix C(0). Note that to any weakly stationary process there corresponds a Gaussian one with the same second moments, so it is not a restriction to confine ourselves to Gaussian processes. Sometimes we speak in terms of so-called second order processes that are determined by their first and second moments. When the expectations are 0s, the pairwise covariances characterize the process, and predictions can be discussed in terms of projections in Hilbert spaces. So in the Gaussian case, Xt s all have the same d-variate Gaussian distribution, but they are defined on different d-dimensional marginals of P. In particular, their first, second, etc. autocovariances, C(1), C(2), . . . characterize their joint distribution. Therefore, this can be regarded as a special random field that extends in space and time (cross-sectionally and longitudinally), i.e. the parameters of the random process contain both (discrete) time and (ddimensional) space locations, see [23, 25, 24]. Corresponding to the above weakly stationary process, throughout the book we consider the following subspaces of L2 (Ω, A, P): H(X) = span{Xki | k ∈ Z, i = 1, . . . , d} Ht− (X) = span{Xki | k ≤ t, i = 1, . . . , d} Ht (X) = span{Xki | 1 ≤ k ≤ t, i = 1, . . . , d} = Span {Xki : 1 ≤ k ≤ t, i = 1, . . . , d}, as the last one is finite dimensional. Obviously, Ht (X) ⊆ Ht− (X) ⊆ H(X). In any Hilbert space, the following Projection Theorem holds true. Theorem C.1. Let M be a closed subspace of the Hilbert space H. Then for any Y ∈ H, there is a unique element Yˆ ∈ M such that kY − Yˆ k ≤ kY − Zk,

∀Z ∈ M

and Y − Yˆ ⊥ Z,

∀Z ∈ M. 241

242

Best prediction in Hilbert spaces

This unique Yˆ is called the projection of Y onto M and denoted by ProjM Y . By the Pythagorean Theorem, kY k2 = kYˆ k2 + kY − Yˆ k2 . The Projection Theorem is widely used in statistics, in the context of multivariate parallel, in other words, simultaneous or joint response regressions. We distinguish between nonparametric and parametric regression, where the former applies to arbitrary multivariate distributions, whereas, the latter to Gaussian ones. More generally, we speak in terms of random vectors which are not necessarily instances of a multidimensional time series. Let X be qand Y be a p-dimensional random vector. In the Gaussian, zero mean case, their joint distribution is characterized by their q × p cross-covariance matrix EXYT, and the orthogonality (independence) of them means that each component of X is uncorrelated with each component of Y, i.e. EXYT = O. Note that EYXT = [EXYT ]T, whereas EXXT is the usual (positive semidefinite) covariance matrix of X. Consider the multiple response regression problem, when each component of Y is projected onto the subspace generated by the coordinates of X. First we deal with the general (nonparametric) situation. Lemma C.1. Let Y be an Rp -valued and X be an Rq -valued random vector with components of zero expectations. Then for every f : Rq → Rp measurable function for which the expectation below exists and is finite, the conditional expectation of Y conditioned on X minimizes the error covariance matrix E(Y − f (X))(Y − f (X))T ≥ E(Y − E(Y|X))(Y − E(Y|X))T, in the sense that the difference of the left and right hand side matrices is positive semidefinite. Proof. With the notation f ∗ (X) = E(Y|X) we have that E(Y − f (X))(Y − f (X))T = E[(Y − f ∗ (X)) + (f ∗ (X) − f (X))][(Y − f ∗ (X)) + (f ∗ (X) − f (X))]T = E(Y − f ∗ (X))(Y − f ∗ (X))T + E(f ∗ (X) − f (X))(f ∗ (X) − f (X))T + E(Y − f ∗ (X))(f ∗ (X) − f (X))T + E(f ∗ (X) − f (X))(Y − f ∗ (X))T = E(Y − f ∗ (X))(Y − f ∗ (X))T + E(f ∗ (X) − f (X))(f ∗ (X) − f (X))T . In the last step we used that E(Y − f ∗ (X))(f ∗ (X) − f (X))T = O is the zero matrix, akin to its transpose E(f ∗ (X) − f (X))(Y − f ∗ (X))T . So it suffices to prove only for the first one: E(Y − f ∗ (X))(f ∗ (X) − f (X))T = E[E(Y − f ∗ (X))(f ∗ (X) − f (X))T | X] = E[E(Y − f ∗ (X) | X)(f ∗ (X) − f (X))T ] = E[(E(Y|X) − f ∗ (X))(f ∗ (X) − f (X))T ] =O

243 since E(Y|X) − f ∗ (X) = 0. As E(f ∗ (X) − f (X))(f ∗ (X) − f (X))T is positive semidefinite (being a covariance matrix), it follows that E(Y − f (X))(Y − f (X))T − E(Y − f ∗ (X))(Y − f ∗ (X))T is positive semidefinite that was to be proved. Remark C.1. The conditional expectation f ∗ (X) = E(Y|X) also minimizes EkY − f (X)k2 =

p X

E[Y i − fi (X)]2 ,

i=1

where fi s are the coordinate functions of f . Indeed, we can minimize the p terms separately. Applying Lemma C.1 to the univariate case, we get that the minimizer of the ith term is fi∗ (X) = E(Y i |X). As this is the ith coordinate of f ∗ (X) = E(Y|X), the minimum of EkY − f (X)k2 is attained at the same f ∗ (X) that minimizes the covariance matrix E(Y − f (X))(Y − f (X))T in the sense of Lemma C.1. Note that EkY − f (X)k2 is the trace of the error covariance matrix. Therefore, it suffices to investigate univariate non-parametric regression estimations. If they are mean square consistent, can be constructed with a (n) sequence fi (for example, with local averaging) such that the mean square error (n) E[fi (X) − fi∗ (X)]2 → 0, n → ∞, for i = 1, . . . , p. This implies that in the p-variate case, for the sequence (n) (n) f (n) (X) = (f1 (X), . . . , fp (X))T Ekf (n) (X) − f ∗ (X)k2 → 0,

n→∞

(C.1)

holds, exhibiting a kind of mean square consistency in the multiple target situation. Since Ekf (n) (X) − f ∗ (X)k2 is the trace of the p × p symmetric, positive semidefinite error covariance matrix En = E(f (n) (X)−f ∗ (X))(f (n) (X)− f ∗ (X))T, Equation (C.1) is equivalent to kEn k2 → 0 as n → ∞ (the spectral norm kEn k2 is the largest eigenvalue of En ). Conversely, if kEn k2 → 0 , then trEn → 0, and by the Cauchy–Schwarz inequality, En → O too. See [7, 46], also [27]. Now we concentrate on linear estimates that are the best in the above sense too if the underlying distribution is multivariate Gaussian. Therefore, the forthcoming estimation can be called parametric simultaneous (joint response) regression, and can be described by matrices. Here we use the second order property of P: the pairwise inner products of the random variables in L2 (Ω, A, P) are determined by their covariances. If we consider subspaces, then the relation of a p- and a q-dimensional subspace can be described by all possible pq pairs of the pairwise covariances, i.e. by the cross-covariance matrices.

244

Best prediction in Hilbert spaces

Lemma C.2. Let Y ∈ Rp and X ∈ Rq be random vectors on a joint probability space with existing second moments and zero expectation. Then the q × p matrix A minimizing EkY − AT Xk2 is A = [EXXT ]− [EXYT ],

(C.2)

where we use generalized inverse if the covariance matrix EXXT of X is singular (see Appendix B). If it is positive definite, then we get a unique minimizer A with the unique inverse matrix [EXXT ]−1 . Proof. Observe that minimizing EkY − AT Xk2 =

p X

E(Y i − aTi X)2

i=1

with respect to A = [a1 , . . . , ap ] falls apart into the following p minimization tasks, with respect to the q-dimensional column vectors of A: min E(Y i − aTi X)2 , ai

i = 1, . . . , p.

The ith task, with the coordinates of X = (X 1 , . . . , X q )T and aTi = (a1i , . . . , aqi ), is equivalent to E(Y i −

q X

aki X k )2 → min.

k=1

Take the derivative with respect to aji and make it equal to 0. (We assume regularity, i.e. that the differentiation and taking the expectation can be interchanged, which is true if the underlying distribution is Gaussian.) 2E[(−Xj )(Y i −

q X

aki X k )] = 0,

j = 1, . . . , q.

k=1

After rearranging, we have the system of equations q X

aki E(X j X k ) = E(X j Y i ),

j = 1, . . . , q.

k=1

This can be condensed into the well-known system of Gauss normal equations from the classical theory of multivariate regression: [EXXT ]ai = [EXY i ],

i = 1, . . . , p.

(Actually, the original equations of Gauss apply to the sample version, and do not contain expectations.) Since this system of linear equations is consistent

245 (the vector EXY i is in the column space of EXXT ), it always has a solution in the general form: ai = [EXXT ]− [EXY i ],

i = 1, . . . , p.

T −

Here [EXX ] is the generalized inverse of the matrix in brackets. Therefore the matrix A giving the optimum is A = [EXXT ]− [EXYT ], that is unique only if EXXT is invertible (positive definite), otherwise (if EXXT is singular, positive semidefinite) infinitely many versions of the generalized inverse give infinitely many convenient As (see Appendix B). Albeit, with different linear combinations of the coordinates of Xi s, these always provide the same optimal linear prediction for Y as follows:  b 1  T  Y a1 X Yb 2  aT2 X    b = Y  .  =  ..  .  ..   .  Yb p

aTp X

b Y b T = O. By the proof Remark C.2. It can easily be checked that E(Y − Y) b of Lemma C.1, it follows that Y − Y is orthogonal to any element of M spanned by p-tuples of the coordinates of the vector X (in the sense that their cross-covariance is the zero matrix). Therefore, it bears the properties of a projection in a wider sense. Hence, the statement of Lemma C.2 also follows by applying the Projection Theorem C.1 simultaneously. We know that the q × p matrix A, giving the minimum of EkY − AT Xk2 , is such that AT X = ProjM Y = (ProjM Y 1 , . . . , ProjM Y p )T denotes coordinate-wise projection. But Y − ProjM Y is orthogonal to any vector in M, which has the form B T X with a q × p matrix B. Therefore, E[(Y − AT X)(B T X)T ] = O,

∀Bq×p .

Equivalently, [E(YXT ) − AT E(XXT )]B = O,

∀Bq×p .

This implies that the matrix in brackets is the zero matrix, which fact after transposing and using that E(XXT ) is symmetric (it is the usual covariance matrix of X) gives again the system of Gauss normal equations in concise form: [E(XXT )]A = E(XYT ). Remark C.3. We can also estimate the attainable minimum error. When p = 1, then with the notations C := EXXT

and d := EXY

(i = 1, . . . , p)

246

Best prediction in Hilbert spaces

by the theory of multivariate regression [44] we have that E(Y − Yb )2 = Var(Y )(1 − rY2 X ) = Var(Y ) − dT C −1 d,

(C.3)

where we assumed that C is positive definite and rY X denotes the multiple correlation between Y and the components of X. Adapting this for a p-dimensional Y we get that b 2= EkY − Yk

p X

E(Y i − Yb i )2 = kYk2 −

i=1

p X

dTi C −1 di ,

i=1

where di = EXY i , i = 1, . . . , p. Lemma C.3. Let Y ∈ Rp and X ∈ Rq be random vectors on a joint probability space with existing second moments and zero expectation, and let ProjM Y denote the best linear prediction of Y based on p-tuples of linear combinations of the coordinates of X, denoted by M, as in Lemma C.2. Then with any p × p matrix Φ, ProjM (ΦY) = ΦProjM Y. Proof. We saw that ProjM Y = AT X, where by (C.2) A = [EXXT ]− [EXYT ], and we use generalized inverse — if the covariance matrix EXXT of X is singular. Then ProjM (ΦY) = {[EXXT ]− [EX(ΦY)T ]}T X = [E(ΦYXT )][EXXT ]− X = Φ[E(YXT )][EXXT ]− X = ΦProjM Y.

The above lemma shows that this projection is linear in Y and it commutes with Φ. In the Gaussian case, obviously, we have that ProjM (ΦY) = E(ΦY | X) = ΦE(Y | X) = ΦProjM (Y) by the properties of the conditional expectation. Now go back to time series. In particular, if we look for the best linear prediction of the p-dimensional random vector Y based on the segment X1 , . . . , Xt of a d-dimensional time series, then with the q = dt dimensional vector X = [XT1 , . . . , XTt ]T , the above formula adapts as b = ProjH (X) Y = AT X Y t  T    T  a11 . . . aT1t a11 X1 + · · · + aT1t Xt X1 aT21 . . . aT2t  X2  aT21 X1 + · · · + aT2t Xt       = . , .. ..   ..  =  ..  ..  . .  .   . aTp1

...

aTpt

Xt

aTp1 X1 + · · · + aTpt Xt

where the columns of the q × p matrix A are partitioned into t segments of

247 length d; or equivalently, the columns of the p × q matrix AT are partitioned into p × d matrices AT1 , . . . , ATt like  T   a1j a11 . . . ap1 aT2j  a12 . . . ap2      T T A= . .. ..  , A = AT1 . . . ATt , Aj =  ..  ,  .   ..  . . a1t . . . apt aTpj b is the linear combination of X1 , . . . , Xt with matrij = 1, . . . , t. With this, Y T T ces A1 , . . . , At , i.e. b = AT X1 + · · · + AT Xt . Y 1 t Remark C.4. Observe that the pdt-dimensional linear space generated by the linear combinations AT1 X1 + · · · + ATt Xt of X1 , . . . , Xt with p × d matrices AT1 , . . . , ATt is the p-tuple Cartesian product of Ht (X), which is dtdimensional and contains scalar linear combinations of all the d coordinates of X1 , . . . , Xt . So, in case of p simulteneous regressions, X1 , . . . , Xt are linearly combined with p × d matrices, the rows of which give scalar linear combinations that define the individual regressions. Just the solution is organized in matrix form, which is more suitable for our purposes. So far, the time series was not necessarily stationary. When it is so, then the covariance matrix of the compounded vector X = [XT1 , . . . , XTt ]T is   C(0) C(1) . . . C( t − 1)  C(−1) C(0) . . . C(t − 2)   EXXT =   .. .. .. ..   . . . . C(1 − t) C(2 − t) . . .

C(0)

T

which is the symmetric (due to C(−k) = C (k)), positive semidefinite block Toeplitz matrix discussed in the context of VARMA processes. It is also positive definite whenever the process {Xt } is regular (see the Wold decomposition). Going further, with Y = Xt+1 , we get the one-step ahead prediction b Xt+1 = AT X, where the optimal dt × d matrix A is  −   C(0) C(1) . . . C( t − 1) C(1)  C(−1)   C(0) . . . C(t − 2)   C(2) A=   ..  . .. .. .. ..    .  . . . . C(1 − t) C(2 − t) . . . C(0) C(t) When the above block Toeplitz matrix is not singular, A is the unique solution of the system of equations     C(0) C(1) . . . C( t − 1) C(1)  C(−1) C(2) C(0) . . . C(t − 2)       A =  ..  .. .. .. ..    . . . . .  C(1 − t) C(2 − t) . . . C(0) C(t)

248

Best prediction in Hilbert spaces

that are exactly the first t Yule–Walker equations introduced in the context of VARMA processes. Note that the theory naturally extends to complex valued random variables with inner product used in Chapter 1.

D Tools from algebra

Algebraic tools are important for proving many statements in linear system theory. Here we recall some tools from algebra that we are using. Readers who are not interested in these algebraic tools and their applications in the proofs may skip this. Otherwise, it is supposed that the interested reader is familiar with some basic concepts of algebra, like semigroups, groups, and fields. A ring R is a set with two operations called addition and multiplication. An important example is C[z], the ring of complex polynomials. By definition, with respect to addition R is a commutative (Abelian) group, with respect to multiplication R is a semi-group with a multiplicative unit 1, and the two distributive laws hold. R is called commutative if xy = yx for all x, y ∈ R. For example, C[z] is a commutative ring. Given two rings R1 and R2 , a ring homomorphism is a map φ : R1 → R2 such that φ(x + y) = φ(x) + φ(y),

φ(xy) = φ(x)φ(y),

φ(1) = 1;

φ(0) = 0 follows automatically. A subset J of a ring R is a left ideal if it is an additive subgroup of R and RJ = J. A right ideal is similar with JR = J. An ideal is simultaneously a left and right ideal. For example, fixing a non-zero polynomial p(z) ∈ C[z], J = p(z)C[z] is an ideal. If we have two nonzero elements x, y ∈ R such that xy = 0, then we say that x and y are zero divisors. A principal ideal domain R is a commutative ring with no zero divisors and in which every ideal is principal, that is, of the form aR, where a ∈ R. For example, C[z] is a principal ideal domain. A generalization of the notion of vector spaces is the one of modules. The reason for introducing this notion here is that we want to multiply our ‘vectors’ not only by complex scalars, but by complex polynomials and C[z] is only a ring, not a field. Let R be a ring. A left R-module M is a set in which an addition is defined, plus a multiplication from the left by elements of R. By definition, with respect to addition M is a commutative group, and with respect to multiplication it satisfies r(x + y) = rx + ry,

(r + s)x = rx + sx,

r(sx) = (rs)x,

1x = x

for all r, s ∈ R and x, y ∈ M . Right R-modules are defined similarly, just we multiply by the elements of R from the right. In our main examples the 249

250

Tools from algebra

elements of a module are commutative with respect to multiplication by polynomials from the ring C[z], so they can be simply called C[z]-modules. If M1 and M2 are two left R-modules, a map φ : M1 → M2 is an R-module homomorphism, if φ(x + y) = φ(x) + φ(y),

φ(rx) = rφ(x),

for any x, y ∈ M and r ∈ R. Our main example for a C[z]-module is the set of polynomial matrices P (z) ∈ C[z]j×k whose entries are complex polynomials in the indeterminate z, where j and k are arbitrary fixed positive integers. It is often advantageous if we treat z as a complex number, though in several cases it can signify a left shift of a sequence over the integers Z as well. We can also write P (z) as a matrix polynomial with complex matrix coefficients m X P (z) = Pr z r ∈ Cj×k [z], Pr ∈ Cj×k , r=0

where m is the highest degree of the polynomial entries of P (z). Thus Cj×k [z] and C[z]j×k are isomorphic. In general, an isomorphism, denoted ∼ =, is a oneto-one and onto homomorphism. Important special cases are the C[z]-module of column matrices with polynomial entries denoted by C[z]j ∼ = C[z]j×1 and of the row = Cj [z] = Cj×1 [z] ∼ 1×k ∼ 1×k matrices with polynomial entries denoted by C[z] [z]. =C Given a j × ` polynomial matrix P (z) and a j × k polynomial matrix R(z), P (z) is left divisible by R(z) if there exists a k × ` polynomial matrix Q(z) such that P (z) = R(z) Q(z). The j × k polynomial matrix R(z) is a common left divisor of the j × `r polynomial matrices Pr (z) (r = 1, . . . , s) if each Pr (z) is left divisible by R(z). A greatest common left divisor (gcld) R(z) is a common left divisor such that any other common left divisor is a left divisor of R(z) as well. A j × j polynomial matrix U (z) is unimodular if there exists U −1 (z) ∈ j×j C [z] such that U (z) U −1 (z) = Ij = U −1 (z) U (z), where Ij is the j × j identity matrix. Equivalently, U (z) is unimodular if and only if det U (z) = α ∈ C, α 6= 0, since U −1 (z) := adj(U (z))/ det U (z), where adj(U (z)), the adjugate matrix (transpose of the cofactor matrix) of U (z), is a polynomial matrix.

251 P1 (z) and P2 (z) are left coprime if their greatest common left divisor is a unimodular matrix. Observe that any j × j unimodular matrix is a left divisor of each j × ` polynomial matrix P (z): P (z) = U (z) (U −1 (z)P (z)). One can similarly define the corresponding notions of right divisibility. We say that a polynomial matrix P (z) = [pjk (z)] ∈ Cp×q [z] and an ordinary complex polynomial ψ(z) ∈ C[z] are coprime if there are no common factors (z − z0 ) of all polynomial entries pjk (z) (j = 1, . . . , p; k = 1, . . . , q) and ψ(z). Observe that Cn×n [z], the set of square polynomial matrices, is simultaneously a ring and a C[z]-module. We also need the notion of a rational matrix. H(z) = [hjk (z)]p×q is a rational matrix if each entry hjk (z) is a rational function of the variable z ∈ C, hjk [z] = njk [z]/djk [z], where njk [z] and djk [z] are complex polynomials. A square polynomial matrix R(z) ∈ Cp×p [z] is called non-singular if det R(z) 6≡ 0, that is, its determinant is not identically zero. However, its determinant, which is a complex polynomial, may have finitely many zeros in C. It follows that the inverse R−1 (z) exists except for at most finitely many values of z and is a p × p rational matrix: R−1 (z) := adj(R(z))/ det R(z), where adj(R(z)) denotes adjugate matrix: the transpose of the cofactor matrix. The extension of the notions of generator set, linear independence, and basis from vector spaces to R-modules is straightforward, the only difference is that the elements of the ring R play the role of scalar multipliers. For example, let us consider Cn [z], the set of n-dimensional polynomial vectors as a module over the ring C[z] of polynomials. Then the vectors x1 (z), . . . , xs (z) ∈ Cn [z] are linearly independent if s X

rj (z)xj (z) ≡ 0

⇔

r1 (z) ≡ · · · ≡ rs (z) ≡ 0,

j=1

where rj (z) are complex polynomials. As a concrete example, the columns of the matrix z+1 z+2 x1 (z) x2 (z) = (z + 1)(z + 3) (z + 2)(z + 3) are not linearly independent, because (z + 2)x1 (z) − (z + 1)x2 (z) ≡ 0. Correspondingly, the column rank of a polynomial matrix A(z) ∈ Cm×n [z] is the maximal number of its linearly independent columns over the ring C[z]

252

Tools from algebra

of polynomials. It is not difficult to show that the column rank of A(z) is equal to the row rank and is equal to the size of its largest nonsingular minor. (A determinant is a polynomial now; it is called nonsingular, if it is not identically 0. Otherwise, it may have finitely many zeros in C.) Thus we can simply call this the rank of the polynomial matrix A(z), rank A(z) ≤ min(m, n). Example D.1. It is easy to find a basis for the C[z]-module Cm×n [z]. For example, the standard basis Ejk = [ers ]m×n = [δrj δsk ]m×n

(j = 1, . . . , m; k = 1, . . . , n)

for the vector space of ordinary matrices Cm×n is a basis for this module as well. For, these matrices are clearly linearly independent and any matrix A(z) = [ajk (z)]m×n ∈ Cm×n [z] can be represented as A(z) =

m X n X

ajk (z)Ejk .

j=1 k=1

The situation is more complex with submodules. The next lemma generalizes a well-known statement from linear algebra to R-submodules. Lemma D.1. Let R be a principal ideal domain and M be a left R-module with a finite basis of m ≥ 0 elements. Then every left R-submodule N ⊂ M has also a finite basis of 0 ≤ n ≤ m elements. Similar statement holds for right R-modules as well. Proof. If M = {0}, that is, its basis has 0 elements, then the statement is trivial. Then we prove the theorem by induction over m. Assume that the lemma holds for (m − 1) basis elements. Let B = {e1 , . . . , em } be P a basis for m M . Then every element x ∈ M has a unique representation x = j=1 rj ej , rj ∈ R. Let N ⊂ M be a left R-submodule. If N contains only elements of Pm−1 the form j=1 rj ej then the statement holds by the induction hypothesis. Pm On the other hand, if N contains an element j=1 rj ej with rm 6= 0, define Pm J := {rm ∈ R : j=1 rj ej ∈ N }. Clearly, J is an ideal in R: JR = J. Since R is a principal ideal domain, it follows that J = aR with some a ∈ J, a 6= 0. Thus N has an element y = r1 e1 + · · · + rm−1 em−1 + aem ,

a 6= 0,

rj ∈ R.

If y˜ ∈ N is arbitrary then there exists an r ∈ R such that y˜−ry belongs to a ˜ of M generated by only {e1 , . . . , em−1 }. Hence by the induction submodule N ˜ with n−1 ≤ m−1. Clearly, hypothesis there exists a basis {f1 , . . . , fn−1 } of N 0 B := {f1 , . . . , fn−1 , y} spans N . We claim that B 0 is linearly independent. Assume that s1 f1 + · · · + sn−1 fn−1 + sn y = 0

(sj ∈ R).

253 If sn = 0 then, {f1 , . . . , fn−1 } being linearly independent, it follows that s1 = · · · = sn−1 = 0 as well. On the other hand, if sn 6= 0 then it follows that s1 f1 + · · · + sn−1 fn−1 + sn (r1 e1 + · · · + rm−1 em−1 + aem ) = [s1 f1 + · · · + sn−1 fn−1 + sn (r1 e1 + · · · + rm−1 em−1 )] + sn aem = 0. The term sn aem is linearly independent from the term in the brackets. Thus a = 0 would follow, which is a contradiction. This finishes the proof of the lemma. It can be proved that if R is a principal ideal domain and M is a left R-module with a finite basis of m ≥ 0 elements, then each basis of M consists of m elements. The number of the elements in a basis of M is called the rank of the module M denoted by rankM . By Example D.1, rank Cm×n [z] = mn. Clearly, the rank of a polynomial matrix A(z) ∈ Cm×n [z] is the same as the rank of the submodule M := A(z)Cn [z] ⊂ Cm [z]. The useful technique of elementary row (or column) operations can be extended to polynomial matrices. From now on we discuss mainly row operations, since it is easy to modify the statements to column operations. So the elementary row operations are 1. interchanging the ith and jth rows (i 6= j), 2. multiplying the ith row by a nonzero complex number α, 3. adding a polynomial p(z) ∈ C[z] multiple of the ith row to the jth row (i 6= j). Each of these elementary row operations on a polynomial matrix A(z) ∈ Cm×n [z] can be represented by an m × m elementary matrix E ∈ Cm×m [z], which multiply A(z) from the left:   1 0 0 0      0 0 1 0    ← i  Ei↔j :=     0 1 0 0    ← j   0 0 0 1  Ei,α

1

  :=   0  0

0 α 0

0



  0   ←  1

i

254

Tools from algebra 

1

0

   0  Ei,j (p(z)) :=    0   0

0

1

0

p(z)

1

0

0

0



  0   ←   0   ←  1

i (D.1) j

−1 −1 Clearly, Ei↔j = Ei↔j , Ei,α = Ei,1/α , and Ei,j (p(z))−1 = Ei,j (−p(z)). It implies that all elementary matrices and their products are unimodular. In the case of elementary column operations we should multiply a matrix A(z) by the adjoint matrices E ∗ of the above elementary matrices from the right. −1 The formula Ei,α = Ei,1/α explains why one cannot multiply a row or column by a polynomial, just by a nonzero complex number if one wants to use only unimodular elementary matrices. Next we are going to describe a Hermite form of a polynomial matrix.

Lemma D.2. Let A(z) ∈ Cm×n [z] and its rank be equal to n, so A(z) has full column rank. Then by elementary row operations, equivalently, by premultiplying by a unimodular matrix U (z) ∈ Cm×m [z], it can be transformed to an upper triangular form   c11 (z) c12 (z) · · · c1n (z)  0 c22 (z) · · · c2n (z)      .. .. .. ..   . . . .    0 0 · · · cnn (z)  ,    0 0 ··· 0     .. .. . . . .   . . . . 0

0

···

0

where 1. all entries below the main diagonal are identically zero, 2. the diagonal entries ckk (z) are monic, 3. any entry cjk (z), j < k, above the main diagonal has smaller degree than the corresponding diagonal entry ckk (z). Proof. By interchanging rows, bring to the (1, 1) position a nonzero element of the lowest degree in the first column, call it a11 (z). By the Euclidean division algorithm, any entry aj1 (z), j > 1, can be written as aj1 (z) = qj1 (z)a11 (z) + rj1 (z),

deg rj1 (z) < deg a11 (z),

with some polynomials qj1 (z) and rj1 (z), where deg denotes the degree of a polynomial. (By definition, the identically zero polynomial has degree smaller

255 then zero.) So add −qj1 (z) times the first row to the jth; this way the original polynomial entry aj1 (z) is replaced by rj1 (z), which has smaller degree than a11 (z). Perform this for j = 2, . . . , m. Now repeat the operation choosing a new nonzero (1, 1) element with the smallest degree in the first column and continue this until all elements in the first column except the (1, 1) element are zero. If the leading coefficient of the (1, 1) element is α 6= 1, then multiply the first row by 1/α. Then consider the second column of the resulting matrix and, temporarily ignoring the first row, repeat the above procedure with the (2, 2) element and the rows below, until all entries below the (2, 2) element become zero. If the leading coefficient of the (2, 2) element is α 6= 1, then multiply the second row by 1/α. Then by a Euclidean division replace the (1, 2) entry above the (2, 2) entry by a remainder that has smaller degree than the (2, 2) entry, if this condition originally did not hold. Continue this procedure with the 3rd, . . . , nth column, and finally we arrive at the desired Hermite form. Corollary D.1. By elementary row operations, equivalently by premultiplying by a unimodular matrix, one may transform an arbitrary polynomial matrix A(z) ∈ Cm×n [z] to a Hermite form, even if rank A(z) = r < n. At the beginning we skip the identically zero columns, if there is any. Then we carry out the algorithm as described in the proof of Lemma D.2, except if after the currently finished kth column and (j, k), j ≤ k, ‘diagonal’ element, the next (k + 1)th, (k + 2), . . . , columns have only zeros below the jth row. Then we skip these columns. At the end we have a quasi upper triangular matrix called echelon form, that still has all entries below the ‘main diagonal’ identically zero. Also, in the r columns that were not skipped in the procedure, properties (2) and (3) described in Lemma D.2 hold as well. The next lemma gives a construction for a greatest common right divisor. For simplicity, here we restrict ourselves to the square matrix greatest common right divisors (gcrd’s). One can similarly construct a greatest common left divisor. Lemma D.3. Given polynomial matrices Aj (z) ∈ Cmj ×n [z] (j = 1, . . . , k), there exists a greatest common right divisor R(z) ∈ Cn×n [z] of them. ˜j (z) ∈ Cn×mj [z], j = 1, . . . , k, Moreover, there exist polynomial matrices U such that k X ˜j (z)Aj (z). R(z) = U (D.2) j=1

Similar statements hold for a greatest common left divisor. Pk Proof. Let m := j=1 mj and define the m × n matrix A(z) as   A1 (z)   .. A(z) :=  . . Ak (z)

(D.3)

256

Tools from algebra

By Corollary D.1 there exists a unimodular matrix U (z) ∈ Cm×m [z] such that ˜ U (z)A(z) = R(z),   A1 (z) R(z) U11 (z) · · · U1k (z)   .. , (D.4) = 0 . U21 (z) · · · U2k (z)  (m−n)×n Ak (z) ˜ where R(z) ∈ Cm×n [z] is the Hermite form of A(z), R(z) ∈ Cn×n [z], U1j (z) ∈ n×mj C [z], U2j (z) ∈ C(m−n)×mj [z], j = 1, . . . , k, and 0(m−n)×n is a zero matrix of size (m − n) × n. Since U (z) is unimodular, its inverse V (z) ∈ Cm×m [z] is also a polynomial matrix:   V11 (z) V12 (z)   .. .. U −1 (z) = V (z) =  , . . Vk1 (z) Vk2 (z) where Vj1 (z) ∈ Cmj ×n [z] and Vj2 (z) ∈ Cmj ×(m−n) [z], j get    V11 (z) V12 (z) A1 (z)    .. .. . ˜ .. A(z) = V (z)R(z),  = . . Vk1 (z) Vk2 (z)

Ak (z)

= 1, . . . , k. Then we   

R(z) 0(m−n)×n

.

This implies Aj (z) = Vj1 (z)R(z),

j = 1, . . . , k,

so R(z) is a right divisor of each Aj (z), j = 1, . . . , k. By (D.4) it follows that R(z) =

k X

U1j (z)Aj (z),

j=1

which proves (D.2). If S(z) ∈ Cn×n [z] is another common right divisor: Wj (z) ∈ Cmj ×n [z],

Aj (z) = Wj (z)S(z), then R(z) =

 k X 

j=1

U1j (z)Wj (z)

 

j = 1, . . . , k,

S(z).



This proves that R(z) is a gcrd. Remark D.1. The gcrd is not unique, e.g. the product of a unimodular matrix and a gcrd (in this order) is also a gcrd. Any two gcrd’s R1 (z) and R2 (z) must be related as R1 (z) = W2 (z)R2 (z),

R2 (z) = W1 (z)R1 (z),

Wj (z) ∈ Cn×n [z],

257 thus R1 (z) = W2 (z)W1 (z)R1 (z). It implies that if R1 (z) is nonsingular, that is, det R1 (z) 6≡ 0, then W1 (z) and W2 (z) must be unimodular, hence R2 (z) is also nonsingular. So if one gcrd is nonsingular, then all gcrd’s must be nonsingular. Similarly, if one gcrd is unimodular, then all gcrd’s must be unimodular. Moreover, if the matrix A(z) defined in (D.3) has full column rank n, then Lemma D.2 implies that the gcrd R(z) is nonsingular and all other gcrd are also nonsingular, differing from R(z) by a unimodular left factor. The next Bézout’s identity characterizes the coprime polynomial matrices. Lemma D.4. P1 (z) ∈ Cm1 ×n [z] and P2 (z) ∈ Cm2 ×n [z] are right coprime if and only if there exist polynomial matrices Xj (z) ∈ Cn×mj [z] (j = 1, 2) such that X1 (z)P1 (z) + X2 (z)P2 (z) = In , (D.5) where In is the n × n identity matrix. Similar statement holds for left coprime matrices. Proof. (D.2) gives a formula for a gcrd R(z) of P1 (z) and P2 (z): ˜1 (z)P1 (z) + U ˜2 (z)P2 (z) R(z) = U ˜1 (z) and U ˜2 (z). If P1 (z) and P2 (z) are right cowith polynomial matrices U prime, then R(z) must be unimodular, so that R−1 (z) is also a polynomial matrix. Therefore, we can write In = X1 (z)P1 (z) + X2 (z)P2 (z),

˜j (z), Xj (z) = R−1 (z)U

j = 1, 2.

Conversely, assume that (D.5) holds. Let R(z) be a gcrd of P1 (z) and P2 (z). Then Pj (z) = Uj (z)R(z), j = 1, 2, with some polynomial matrices Uj (z), j = 1, 2. These imply that {X1 (z)U1 (z) + X2 (z)U2 (z)} R(z) = In . This shows that R−1 (z) = X1 (z)U1 (z) + X2 (z)U2 (z), a polynomial matrix. Thus R(z) is unimodular, so P1 (z) and P2 (z) are right coprime. Remark D.2. P1 (z) ∈ Cm1 ×n [z] and P2 (z) ∈ Cm2 ×n [z] are right coprime if and only if the matrix P1 (z) P2 (z) has full column rank n for every z ∈ C. For, then Lemmas D.2 and D.3 imply that it is equivalent to the fact that the determinant of the gcrd R(z) of P1 (z) and P2 (z) is a nonzero constant complex number, so R(z) is unimodular.

258

Tools from algebra

Lemma D.5. We have the following properties of submodules. (a) M ⊂ Cn [z] is a submodule if and only if it is of the form M = A(z)Ck [z],

A(z) ∈ Cn×k [z],

where A(z) has full column rank k. (b) If A(z) ∈ Cn×q [z] and B(z) ∈ Cn×p [z], then M1 := A(z)Cq [z] ⊂ M2 := B(z)Cp [z]

⇔

A(z) = B(z)X(z),

where X(z) ∈ Cp×q [z]. Proof. (a) The ‘if’ part is obvious. Conversely, Lemma D.1 and Example D.1 imply that a submodule M ⊂ Cn [z] has a basis consisting of k ≤ n elements a1 (z), . . . , ak (z) ∈ Cn [z]. Define A(z) := [a1 (z) · · · ak (z)]n×k . Then M = A(z)Ck [z] and A(z) has rank k. (b) The implication ⇐ is obvious. Conversely, we have to show that if M1 ⊂ M2 , then the system of equations B(z)X(z) = A(z) has a polynomial matrix solution X(z). Perform elementary row operations on the matrix [B(z)A(z)]n×(q+p) to obtain its Hermite form. The condition M1 ⊂ M2 guarantees that the above system will be solvable and by the back-substitution of the usual Gaussian algorithm one gets a polynomial solution X(z) from the Hermite form. Here we extend Lemma D.3 to the non-square divisor case and we establish a connection between the generated submodule and the gcld. Lemma D.6. Let Aj (z) ∈ Cn×mj [z], j = 1, . . . , r. Then there exists their greatest common left divisor (gcld) D(z) ∈ Cn×q [z]. A polynomial matrix D(z) ∈ Cn×q [z] is a gcld if and only if for the generated submodule M := A1 (z)Cm1 [z] + · · · + Ar (z)Cmr [z] ⊂ Cn [z] we have M = D(z)Cq [z]. Similar statements are true for the greatest common right divisors. Pr Proof. Define A(z) := [A1 (z) · · · Ar (z)]n×m , where m = j=1 mj . A matrix D(z) ∈ Cn×q [z] is a common left divisor of the matrices Aj (z), j = 1, . . . , r, if and only if D(z) is a left divisor of A(z): A(z) = D(z)U (z),

U (z) ∈ Cq×m [z].

By Lemma D.5(b) this is equivalent to M := A1 (z)Cm1 [z] + · · · + Ar (z)Cmr [z] = A(z)Cm [z] ⊂ D(z)Cq [z].

(D.6)

259 Since M is a submodule of Cn [z], by Lemma D.1 it has a finite basis, so there exists a matrix D0 (z) ∈ Cn×p [z] such that M = D0 (z)Cp [z].

(D.7)

Thus D0 (z) is also a common left divisor. Assuming that D(z) is a gcld, it follows that D(z) = D0 (z)Q(z)

for some

Q(z) ∈ Cp×q [z].

By Lemma D.5(b), this implies that D(z)Cq [z] ⊂ D0 (z)Cp [z]. In turn, this with (D.6) and (D.7) imply that D(z)Cq [z] = M = D0 (z)Cp [z]. Conversely, assume that M := A1 (z)Cm1 [z] + · · · + Ar (z)Cmr [z] = D(z)Cq [z]. By Lemma D.5(a) such a D(z) always exists. Then by (D.6) D(z) is a common left divisor. Moreover, D(z)Cq [z] = M ⊂ D0 (z)Cp [z] for any common left divisor D0 (z) by (D.6). Then by Lemma D.5(b) this implies that D(z) = D0 (z)Q(z) for some Q(z) ∈ Cp×q [z]. Thus D(z) is a greatest common left divisor. Lemma D.7. Let Aj (z) ∈ Cn×mj [z], j = 1, . . . , r with rank q := rank[A1 (z) · · · Ar (z)]. (a) There exists a gcld D(z) ∈ Cn×q [z] with rank q. If D0 (z) ∈ Cn×p [z] is also a gcld with full column rank p, then p = q and there exists a unique unimodular transformation X(z) ∈ Cq×p [z] such that D0 (z) = D(z)X(z). (b) A square gcld D(z) ∈ Cn×n [z] with det D(z) 6≡ 0 exists if and only if rank[A1 (z) · · · Ar (z)] = n. Similar statements hold for the greatest common right divisors. Proof. (a) By Lemma D.1, M := A1 (z)Cm1 [z] + · · · + Ar (z)Cmr [z] has a finite basis and rank M = q ≤ n. By Lemma D.5(a), there exists a full column rank matrix D(z) ∈ Cn×q [z] such that M = D(z)Cq [z]. By Lemma D.6, then D(z) is a gcld. Let D0 (z) ∈ Cn×p [z] be another gcld with full column rank p. Then

260

Tools from algebra

D0 (z)Cp [z] = M = D(z)Cq [z]. By Lemma D.5, this implies that p = q, D0 (z) = D(z)X(z) and D(z) = D0 (z)Y (z) with X(z), Y (z) ∈ Cq×q [z]. Thus D0 (z) = D0 (z)Y (z)X(z),

D(z) = D(z)X(z)Y (z).

Since D(z) and D0 (z) have full column rank, by the Gaussian algorithm, getting a Hermite form of the systems of equations [D0 (z)D0 (z)] and [D(z)D(z)] and using back-substitution, we obtain that Y (z)X(z) = Iq = X(z)Y (z). Thus X(z) and Y (z) are unimodular, Y (z) = X −1 (z). (b) If rank[A1 (z), · · · , Ar (z)] = n, equivalently, rank M = n, then by (a) it follows that D(z) ∈ Cn×n [z], rank D(z) = n, det D(z) 6≡ 0. Conversely, if D(z) ∈ Cn×n [z], det D(z) 6≡ 0, then also by (a) it follows that rank[A1 (z), · · · , Ar (z)] = rank M = n. Theorem D.1. [21, Theorem 2.29] Assume that H(z) is an n × m rational matrix, which is not identically zero. (a) Then it has a representation H(z) =

P (z) , ψ(z)

P (z) = [pjk (z)] ∈ Cn×m [z],

ψ(z) ∈ C[z].

If we assume that ψ(z) is monic and P (z) and ψ(z) are coprime, then P (z) and ψ(z) are unique. (b) Also, H(z) can be represented as a matrix fraction, the ‘ratio’ of two left coprime polynomial matrices: H(z) = α−1 (z)β(z),

α(z) ∈ Cn×n [z],

det α(z) 6≡ 0,

β(z) ∈ Cn×m [z].

The polynomial matrices α(z) and β(z) are unique up to a unimodular factor. (c) Similarly, H(z) can be represented as the ‘ratio’ of two right coprime polynomial matrices: ˜ α ˜ −1 (z), H(z) = β(z)

˜ α(z) ∈ Cm×m [z],

˜ det α(z) 6≡ 0,

˜ β(z) ∈ Cn×m [z],

which are unique up to a unimodular factor. Proof. By our assumptions, we can define ψ(z) as the least common multiple of the denominators of the entries of the rational matrix H(z), also assuming that each entry is in lowest terms, that is, its numerator and denominator are coprime. If we assume that ψ(z) has leading coefficient 1 (is monic), then it really gets unique. This proves (a).

261 By (a) we may factorize as H(z) = (ψ(z)In )−1 P (z). Consider the submodule M := P (z)Cm [z] + (ψ(z)In )Cn [z] ⊂ Cn [z]. M contains the linearly independent elements ψ(z)ej , j = 1, . . . , n, where ej denotes the jth coordinate unit vector in Cn , j = 1, . . . , n. Hence rank M ≥ n, while M is a submodule of Cn [z], so rank M ≤ n. Thus rank M = n and, due to Lemma D.7, there exists a gcld D(z) ∈ Cn×n [z], det D(z) 6≡ 0, of P (z) and ψ(z)Ip . Therefore, there exists α(z) ∈ Cn×n [z] and β(z) ∈ Cn×m [z] such that P (z) = D(z)β(z),

ψ(z)In = D(z)α(z).

By Lemma D.6, M = D(z)β(z)Cm [z] + D(z)α(z)Cn [z] = D(z)Cn [z], and since D(z) is nonsingular, it follows that β(z)Cm [z] + α(z)Cn [z] = Cn [z].

(D.8)

Let e1 , . . . , en be the standard basis in Cn [z], that is, ej is the jth coordinate unit vector in Cn , j = 1, . . . , n. Then by (D.8), there exist elements x1 (z), . . . , xp (z) ∈ Cm [z] and y1 (z), . . . , yp (z) ∈ Cn [z] such that β(z)xj (z) + α(z)yj (z) = ej ,

j = 1, . . . , n.

Consequently, with X(z) = [x1 (z), . . . , xn (z)] ∈ Cm×n [z] and Y (z) = [y1 (z), . . . , yn (z)] ∈ Cn×n [z] we have β(z)X(z) + α(z)Y (z) = In .

(D.9)

By Bézout’s identity (Lemma D.4) this implies that α(z) and β(z) are left coprime. Moreover, H(z) = (ψ(z)In )−1 P (z) = (D(z)α(z))−1 D(z)β(z) = α−1 (z)β(z), which is the stated left coprime factorization. (Since ψ(z)In is nonsingular, α(z) must be nonsingular too.) Assume now that (N (z), ∆(z)) is another left coprime factorization of H(z): ∆−1 (z)N (z) = H(z) = α−1 (z)β(z). Thus N (z) = ∆(z)α−1 (z)β(z). By Bézout’s identity (D.9), α−1 (z)β(z)X(z) + Y (z) = α−1 (z). Therefore N (z)X(z) + ∆(z)Y (z) = ∆(z)α−1 (z), ∆(z) = (N (z)X(z) + ∆(z)Y (z))α(z) = U (z)α(z),

262

Tools from algebra

where U (z) ∈ Cn×n [z] is a polynomial matrix. Similarly, there exists a polynomial matrix V (z) ∈ Cn×n [z] such that α(z) = V (z)∆(z). Thus we obtain that α(z) = V (z)U (z)α(z). Since α(z) is nonsingular, it follows that V (z)U (z) = Ip , U (z) and V (z) are unimodular, with N (z) = ∆(z)α−1 (z)β(z) = U (z)β(z),

∆(z) = U (z)α(z).

This completes the proof of (b) of the theorem. Statement (c) can be proved similarly. Remark D.3. Conversely, if a transfer function H(z) can be written in any of the form (a), (b), or (c) in Theorem D.1, then it is a rational matrix. The case of (a) is obvious. The case of (b) follows from the fact that α−1 (z) = adj α(z)/det α(z), and the case of (c) follows similarly.

263

Acknowledgment The research and work underlying the present book and carried out at the Budapest University of Technology and Economics was supported by the National Research Development and Innovation Fund based on the charter of bolster issued by the National Research, Development and Innovation Office under the auspices of the Hungarian Ministry for Innovation and Technology, so was supported by the National Research, Development and Innovation Fund (TUDFO/51757/2019-ITM, Thematic Excellence Program). It was also funded by the project EFOP-3.6.2-16-2017-00015-HU-MATHS-IN: for deepening the activity of the Hungarian Industrial and Innovation Network. The authors gratefully remember András Krámli for his valuable help with the theory and the related literature. The authors are obliged to Professors Manfred Deistler and Gy¨ orgy Michaletzky for their valuable help and thank M´ até Baranyi for his useful remarks and help in creating some figures.

Bibliography

[1] H. Akaike. Stochastic theory of minimal realization. IEEE Transactions on Automatic Control, 19(6):667–674, 1974. [2] O. Akbilgic, H. Bozdogan, and M. E. Balaban. A novel Hybrid RBF Neural Networks model as a forecaster. Statistics and Computing, 24(3):365– 375, May 2014. [3] B. D. O. Anderson, M. Deistler, E. Felsenstein, B. Funovits, L. Koelbl, and M. Zamani. Multivariate AR systems and mixed frequency data: Gidentifiability and estimation. Econometric Theory, 32(4):793–826, 2016. [4] M. Bolla. Factor analysis, dynamic. Wiley StatsRef: Statistics Reference Online, pages 1–15, 2014. [5] M. Bolla and A. Kurdyukova. Dynamic factors of macroeconomic data. Annals of the University of Craiova, Mathematics and Computer Science Series, 37(4):18–28, 2010. [6] G. E. P. Box, G. M. Jenkins, G. C. Reinsel, and G. M. Ljung. Time series analysis: Forecasting and control. John Wiley & Sons, 2015. [7] L. Breiman and J. H. Friedman. Estimating optimal transformations for multiple regression and correlation. Journal of the American Statistical Association, 80(391):580–598, 1985. [8] D. R. Brillinger. The canonical analysis of stationary time series. In P. R. Krishnaiah, editor, Multivariate analysis II, pages 331–350. Academic Press, New York, 1969. [9] D. R. Brillinger. Time series: Data analysis and theory, volume 36. SIAM, 1981. [10] P. J. Brockwell and R. A. Davis. Introduction to time series and forecasting. Springer, 2016. [11] P. J. Brockwell, R. A. Davis, and S. E. Fienberg. Time series: Theory and methods. Springer Science & Business Media, 1991. [12] H. Cramér. Mathematical methods of statistics. Princeton University Press, 1946.

265

266

Bibliography

[13] M. Deistler, B. D. O. Anderson, A. Filler, C. Zinner, and W. Chen. Generalized linear dynamic factor models: An approach via singular autoregressions. European Journal of Control, 16(3):211–224, 2010. [14] M. Deistler and W. Scherrer. Modelle der Zeitreihenanalyse. Springer, 2018. [15] M. Deistler, C. Zinner, et al. Modelling high-dimensional time series by generalized linear dynamic factor models: An introductory survey. Communications in Information & Systems, 7(2):153–166, 2007. [16] J. L. Doob. Stochastic processes, volume 101. Wiley, 1953. ¨ [17] L. Fejér. Uber trigonometrische polynome. Journal f¨ ur die Reine und Angewandte Mathematik, 146:53–82, 1916. [18] M. Forni and M. Lippi. The general dynamic factor model: One-sided representation results. Journal of Econometrics, 163(1):23–28, 2011. [19] B. Friedman. Eigenvalues of composite matrices. Mathematical Proceedings of the Cambridge Philosophical Society, 57(1):37–49, 1961. [20] P. A. Fuhrmann. Linear systems and operators in Hilbert space. Courier Corporation, 2014. [21] P. A. Fuhrmann and U. Helmke. The mathematics of networks of linear systems. Springer, 2015. [22] L. Gerencsér, Z. V´ ag´ o, and B. Gerencsér. Financial time series. Pázmány Péter Catholic University, 2013. [23] I. I. Gikhman and A. V. Skorokhod. The theory of stochastic processes I. Springer, 1974. [24] I. I. Gikhman and A. V. Skorokhod. The theory of stochastic processes III. Springer, 1979. [25] I. I. Gikhman and A. V. Skorokhod. The theory of stochastic processes II. Springer Science & Business Media, 2004. [26] G. H. Golub and C. F. Van Loan. Matrix computations, volume 3. JHU Press, 2012. [27] L. Gy¨ orfi, M. Kohler, A. Krzyzak, and H. Walk. A distribution-free theory of nonparametric regression. Springer Science & Business Media, 2006. [28] E. J. Hannan. Multiple time series, volume 38. John Wiley & Sons, 2009. [29] E. J. Hannan and M. Deistler. The statistical theory of linear systems. SIAM, 2012. [30] G. H. Hardy, J. E. Littlewood, and G. Pólya. Inequalities. Cambridge University Press, 1952.

Bibliography

267

[31] B. L. Ho and R. E. K´ almán. Effective construction of linear state-variable models from input/output functions. Regelungstechnik, 14(12):545–548, 1966. [32] R. E. K´ alm´ an. A new approach to linear filtering and prediction problems. Journal of Basic Engineering, 82(1):35–45, 1960. [33] R. E. K´ alm´ an and R. S. Bucy. New results in linear filtering and prediction theory. Journal of Basic Engineering, 83(1):95–108, 1961. [34] A. N. Kolmogorov. Stationary sequences in Hilbert space. Moscow University Mathematics Bulletin (in Russian), 2(6):1–40, 1941. Translated into English: Selected works of A. N. Kolmogorov, Vol. II, Probability theory and mathematical statistics, ed. A. N. Shiryayev, 228–271, Springer, 1986. [35] A. Kr´ amli. On factorization of spectral matrices (in Hungarian). MTA III. Oszt´ aly K¨ ozleményei, 18:183–186, 1968. [36] A. Kr´ amli. Regularity and singularity of stationary stochastic processes (in Hungarian). MTA III. Oszt´ aly K¨ ozleményei, 18:155–168, 1968. [37] J. Lamperti. Probability: A survey of the mathematical theory. Wiley, second edition, 1996. [38] J. Lamperti. Stochastic processes: A survey of the mathematical theory, volume 23. Springer Science & Business Media, 2012. [39] A. Lindquist and G. Picci. Linear stochastic systems: A geometric approach to modeling, estimation and identification, volume 1. Springer, 2015. [40] P. Loubaton. Non-full-rank causal approximations of full-rank multivariate stationary processes with rational spectrum. Systems & Control Letters, 15(3):265–272, 1990. [41] H. L¨ utkepohl. New introduction to multiple time series analysis. Springer Science & Business Media, 2005. [42] P. M¨ orters and Y. Peres. Brownian motion, volume 30. Cambridge University Press, 2010. [43] N. Nikolski. Hardy spaces, volume 179. Cambridge University Press, 2019. [44] C. R. Rao. Linear statistical inference and its applications, volume 2. Wiley, 1973. [45] C. R. Rao. Separation theorems for singular values of matrices and their applications in multivariate analysis. Journal of Multivariate Analysis, 9(3):362–377, 1979.

268

Bibliography

[46] A. Rényi. On measures of dependence. Acta Mathematica Academiae Scientiarum Hungarica, 10(3–4):441–451, 1959. [47] Yu. A. Rozanov. Stationary random processes. Holden-Day, 1967. [48] P. R´ ozsa. Linear algebra and its applications (in Hungarian). M˝ uszaki Kiad´ o, Budapest, third edition, 1991. [49] W. Rudin. Functional analysis. McGraw-Hill, 1991. [50] W. Rudin. Real and complex analysis. Tata McGraw-Hill education, 2006. [51] W. Scherrer and M. Deistler. Vector autoregressive moving average models. In Conceptual Econometrics Using R, page 145. Elsevier, 2019. [52] T. Szabados and B. Székely. Stochastic integration based on simple, symmetric random walks. Journal of Theoretical Probability, 22(1):203– 219, 2009. [53] G. J. Tee. Eigenvectors of block circulant and alternating circulant matrices. New Zealand Journal of Mathematics, 36(8):195–211, 2007. [54] R. S. Tsay. Multivariate time series analysis: With R and financial applications. John Wiley & Sons, 2013. [55] G. Tusn´ ady and M. Ziermann, editors. Analysis of time series (in Hungarian). M˝ uszaki K¨ onyvkiadó, Budapest, 1986. [56] R. von Sachs. Nonparametric spectral analysis of multivariate time series. Annual Review of Statistics and Its Application, 7:361–386, 2020. [57] N. Wiener. Time series. MIT Press, 1949. [58] N. Wiener. Extrapolation, interpolation, and smoothing of stationary time series: With engineering applications, volume 8. MIT Press, 1964. [59] N. Wiener and P. Masani. The prediction theory of multivariate stochastic processes, I. The regularity condition. Acta Mathematica, 98:111–150, 1957. [60] N. Wiener and P. Masani. The prediction theory of multivariate stochastic processes, II. The linear predictor. Acta Mathematica, 99:93–137, 1958. [61] J. H. Wilkinson. The algebraic eigenvalue problem, volume 662. Oxford Clarendon, 1965. [62] H. Wold. A study in the analysis of stationary time series. PhD thesis, Almqvist & Wiksell, 1938.

Index

(auto)covariance matrix function, 35 block Cholesky decomposition, 201 AR polynomial, 47, 83 AR(p) process, 47, 83 stable, 49 ARMA process multi-D, 151 ARMA(p, q) process, 53, 83 stable, 55 autoregressive process, 47, 172 autoregressive, moving average process, 53 Bézout’s identity, 257 best rank k approximation, 239 Beurling’s theorem, 225 block Cholesky decomposition, 180 block Hankel matrix, 109 of the covariance function, 107 Cauchy’s formula, 216 Cauchy’s theorem, 216 Cauchy–Riemann equations, 219 causal (one-sided, future-independent) MA(∞) process, 82 causally subordinated time series, 115 Cayley–Hamilton theorem, 235 Cholesky decomposition, 174, 237 conditional spectral density, 42, 115 constant rank, 116 constructions of time series, 18 coprime, 251 coprime (relative prime), 150 covariance function

absolutely summable, 7 square summable, 7 covariance matrix function, 2 Cramér’s representation, 36 cross-covariance matrix, 8 Davis–Kahan theorem, 235 dimension reduction, 146 Dirichlet problem, 220 Discrete Fourier Transform (DFT), 23 Dynamic Factor Analysis, 197, 205 Dynamic Principal Components (DPC), 145 eigenvalue, 228 eigenvector, 228 elementary matrix, 253 elementary row (or column) operations, 253 empirical covariance, 30 empirical mean, 26 ergodic for the covariance, 31 for the mean, 26 ergodicity, 37 external description, 94 factorization, 96 factorization of spectral density, 64 filtration, 41 first order autoregressive process, 46 Fourier coefficient, 7 Fourier frequencies, 23, 171, 201 Fourier series, 7 Fourier transform, 4, 191 Frobenius norm of a matrix, 233 full rank process, 123, 126 269

270 function analytic, 215 harmonic, 219 holomorphic, 215 meromorhic, 217 subharmonic, 221 Gauss normal equations, 171, 172, 179, 244 generalized inverse of a matrix, 231 Gersgorin disc theorem, 235 Gram decomposition, 175, 236 parsimonious, 236 Gram matrix (Gramian), 232 Gram–Schmidt procedure, 173 greatest common left divisor (gcld), 250 greatest common right divisor (gcrd), 255 Hardy spaces, 222, 225 Herglotz theorem, 5 Hermite form, 254 Hilbert space, 170, 179 ideal, 249 idiosyncratic noise, 208 impulse response function, 43, 82, 93, 119, 163 impulse responses, 175, 201 inner function, 223 innovation, 176 innovation process, 176, 200 innovation space, 123 innovation subspace, 193 innovation subspaces, 207 innovations, 173 input/output map extended, 91 restricted, 88 internal description, 94 inverse DFT (IDFT), 23 isometry (isometric isomorphism), 10 joint spectral measure, 41

Index Kálmán gain matrix, 193 Kálmán’s filtering, 191, 200, 211 kernel of a matrix, 228 Kolmogorov’s condition, 70, 84 Kolmogorov–Szeg˝o formula, 126 multi-D, 166 Kronecker product, 237 lag operator, 82 LDL-decomposition, 174, 237 left (backward) shift operator L, 91, 94 left coprime, 251 left divisible, 250 linear span, 8 closed, 8 linear system, 109 stable, 103 linear time invariant dynamical system, 88 linear transform, 114 LU-decomposition, 237 Lyapunov equation, 106 MA polynomial, 44, 83 MA(q) process, 82 matrix anti-symmetric, 227 block Hankel, 236 block Toeplitz, 236 diagonal, 228 generalized diagonal, 230 Hankel, 236 Hermitian, 227 Hermitian projector, 227 invertible, 227 negative definite, 229 negative semidefinite, 229 non-negative definite, 229 non-positive definite, 229 normal, 227 positive definite, 229 positive semidefinite, 229 self-adjoint, 227 sub-unitary, 227

Index symmetric, 227 Toeplitz, 236 unitary, 227 matrix fraction, 260 matrix fraction description (MFD), 102 matrix polynomial, 250 maximum modulus theorem, 217 McMillan degree, 102 mean square prediction error, 105 minimal polynomial, 100 minimal realizations, 96 Minkowski’s inequality, 127 module, 249 monic polynomial, 100 moving average process (MA), 43, 82 causal MA(∞), 43 MA(q), 43 multidimensional (or multivariate) time series, 1 natural matrix norm, 234 non-causal, 43, 82 non-negative definite, 36 non-negative definite measure matrix, 4 non-singular, 178, 251 non-singular process full rank, 166 number of zeros of a function, 218 observability Gramian, 108 observability matrix, 90 observable, 109 state, 89 system, 90 observable variable, 192 operator of left (backward) shift L, 43 operator of right (forward) shift S, 10, 224 orthogonality, 8 orthonormal innovation process, 165 orthonormal sequence, 15

271 outer function, 223, 226 past of {Xt } until n, 58 past until time k, 120 periodogram, 34, 37 phase transition matrix, 192 Poisson integral, 220 pole of order m, 217 polynomial matrix, 250 prediction h-step ahead, 59, 105, 122, 173 error, 59 one-step ahead, 161, 176, 192 Principal Component Analysis PC transformation, 190 Principal Component Analysis (PCA) in the Frequency Domain, 142, 168 principal ideal domain, 249 process of innovations, 122 Projection Theorem, 241 pseudoinverse (Moore–Penrose inverse) of a matrix, 231 QR-decomposition, 236 radial limit, 223 range of a matrix, 228 rank of a matrix, 228 rank of the process, 123 rational matrix, 251 reachability Gramian, 108 reachability matrix, 89 reachable, 109 state, 89 system, 89 regular process, 193 multi-D, 135, 166 remote past, 120 remote past of {Xt }, 58 Riccati equatiion, 199 Riesz–Fischer theorem, 7 ring, 249 second order processes, 2, 35 shift operator

272 left (backward), 44 simultaneous linear regressions, 179 singular time series type (0), 75, 84, 177 type (1), 78, 85, 177 type (2), 78, 85, 177 singular value, 230 singular value decomposition (SVD), 230 singular vector pair, 230 sliding summation, 43, 82, 116 spectral (operator) norm of a matrix, 233 spectral amplitudes, 37 spectral cumulative distribution function (c.d.f.), 6 Spectral Decomposition (SD), 228 spectral density matrix, 14, 36 of a VARMA process, 157 of an ARMA process, 56 of constant rank, 116 rational, 65 smooth, 66 spectral factor, 60, 63, 125, 163 spectral measure matrix, 4, 14 spectral radius of a matrix, 92, 103, 233 spectral representation of a time series, 10 spectrum, 103 spectrum of a matrix, 229, 233 stability, 165 stability condition, 153, 199 stable, 83 state space, 87 state variable, 192 static factors, 208 stationary, 1, 171 in the strong sense, 1 in the wide sense, 2 with discrete time, 2 stochastic integration, 11 strict miniphase condition, 106, 161, 166

Index strongly stationary, 1 submodule, 258 subordinated process, 41, 81 time invariant linear filter (TLF), 40, 114 time invariant linear system stochastic, 103 time series non-regular, 140, 167 non-singular, 59 of constant rank, 163 regular, 58, 84, 120, 125, 165 singular, 58, 84, 120 spectral representation, 12 with real components, 16 time shift left (backward), 10, 90 right (forward), 10, 90 unitary, 10 time-invariant linear filter (TLF), 81 Toeplitz matrix, 174, 176, 209 transfer function, 43, 56, 82, 92, 110, 119, 163, 166, 201 rational, 102 unimodular matrix, 250 VAR polynomial, 152 process, 152 VARMA process, 151, 165 standard observable realization, 153 VMA polynomial, 152 process, 152 weakly stationary, 2 Weyl perturbation theorem, 235 white noise process, 15, 172 white noise sequence, 82 Wold decomposition, 59, 174, 175, 210 multi-D, 120, 165

Index Yule–Walker equations, 51, 56, 83, 172 multi-D, 159 z-transform, 43, 82, 90 zero of order m, 217

273