Data Driven Model Learning for Engineers: With Applications to Univariate Time Series 3031316355, 9783031316357

The main goal of this comprehensive textbook is to cover the core techniques required to understand some of the basic an

106 101 9MB

English Pages 222 [218] Year 2023

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Data Driven Model Learning for Engineers: With Applications to Univariate Time Series
 3031316355, 9783031316357

Table of contents :
Preface
References
Contents
1 Basic Concepts of Time Series Modeling
References
2 Singular Spectrum Analysis of Univariate Time Series
2.1 The Main Ingredients
2.2 The Basic Singular Spectrum Analysis for Low Rank Approximation
2.2.1 Times Series Decomposition
2.2.2 Times Series Reconstruction
2.3 The Impact of the SVD for the Basic SSA
2.3.1 SVD and Separability
2.3.2 SVD and Noise Effect Filtering
2.3.3 In a Nutshell
2.3.4 Another Example
2.4 Take Home Messages
References
3 Trend and Seasonality Model Learning with Least Squares
3.1 Linear Least Squares Solutions
3.1.1 Problem Formulation
3.1.2 Analytic Solution from a Projection Viewpoint
3.1.3 Numerical Solution with a QR Factorization
3.1.4 Numerical Solution with a Singular ValueDecomposition
3.1.5 Linear Least Squares and Condition Number
3.2 A Digression to Nonlinear Least Squares
3.3 Linear Model Complexity Selection and Validation
3.4 Hints for Least Squares Solutions Refinement
3.5 Take Home Messages
References
4 Least Squares Estimators and Residuals Analysis
4.1 Residual Components
4.2 Stochastic Description of the Residuals
4.2.1 Stochastic Process
4.2.2 Stationarity
4.2.3 Ergodicity
4.3 Basic Tests of the Residuals
4.3.1 Autocorrelation Function Test
4.3.2 Portmanteau Test
4.3.3 Turning Point Test
4.3.4 Normality Test
4.4 Statistical Properties of the Least Squares Estimates
4.4.1 Bias, Variance, and Consistency of the Linear Least Squares Estimators
4.4.2 Bias, Variance, and Consistency of the Nonlinear Least Squares Estimators
4.4.3 Least Squares Statistical Properties Validation with the Bootstrap Method
4.4.4 From Least Squares Statistical Properties to Confidence Intervals
Estimated Parameters Confidence Intervals
Estimated Outputs Confidence Intervals
Predicted Outputs Confidence Intervals
4.5 Wold Decomposition
4.6 Take Home Messages
References
5 Residuals Modeling with AR and ARMA Representations
5.1 From Transfer Functions to Linear Difference Equations
5.2 AR and ARMA Model Learning
5.2.1 AR Model Parameters Estimation
Parameters Estimation with Linear Least Squares
Parameters Estimation with the Yule-Walker Algorithm
5.2.2 ARMA Model Parameters Estimation
Parameters Estimation with Pseudolinear Least Squares
Parameters Estimation with Nonlinear Least Squares
5.3 Partial Autocorrelation Function
5.4 Forecasting with AR and ARMA Models
5.5 Take Home Messages
References
6 A Last Illustration to Conclude
A Vectors and Matrices
A.1 Vector Space
A.1.1 Linear Space and Subspace
A.1.2 Inner Product, Induced Norm and Inner Product Space
A.1.3 Complementary Subspaces
A.1.4 Orthogonal Complement and Projection
A.2 Vector
A.2.1 First Definitions
A.2.2 Basic Operations
A.2.3 Span, Vector Space, and Subspace
A.2.4 Linear Dependency and Basis
A.2.5 Euclidean Inner Product and Norm
A.2.6 Orthogonality and Orthonormality
A.3 Matrix
A.3.1 First Definitions
A.3.2 Basic Operations
A.3.3 Symmetry
A.3.4 Hankel and Toeplitz Matrices
A.3.5 Gram, Normal, and Orthogonal Matrices
A.3.6 Vectorization and Frobenius Matrix Norm
A.3.7 Quadratic Form and Positive Definiteness
A.4 Matrix Fundamental Subspaces
A.4.1 Range and Nullspace
A.4.2 Rank
A.5 Matrix Inverses
A.5.1 Square Matrix Inverse
A.5.2 Matrix Inversion Lemmas
A.5.3 Matrix Pseudo-inverse
A.6 Some Useful Matrix Decompositions
A.6.1 QR Decomposition
A.6.2 Singular Value Decomposition
A.6.3 Condition Number and Norms
A.6.4 SVD and Eckart–Young–Mirsky Theorem
A.6.5 Moore–Penrose Pseudo-inverse
A.7 Orthogonal Projector
A.7.1 First Definitions
A.7.2 Orthogonal Projector and Singular ValueDecomposition
References
B Random Variables and Vectors
B.1 Probability Space and Random Variable
B.1.1 Probability Space
B.1.2 Random Variable
B.2 Univariate Random Variable
B.2.1 Cumulative Distribution Function
B.2.2 Probability Density Function
B.2.3 Cumulative Distribution Function and Quantile
B.2.4 Uniform Random Variable
B.2.5 Normal Random Variable
B.2.6 Student's Random Variable
B.2.7 Chi-squared Random Variable
B.2.8 Moments
B.3 Multivariate Random Variable
B.3.1 Basic Idea
B.3.2 2D Joint Distribution Function
B.3.3 2D Joint Probability Density Function
B.3.4 Marginal Distribution and Density Functions
B.3.5 Statistical Independence
B.3.6 Generalized Mean and Moments
B.3.7 Covariance
B.3.8 Correlation Coefficient
B.3.9 Correlation
B.3.10 nxD Joint Cumulative Distribution and DensityFunction
B.3.11 nxD Marginal Distribution and Density Functions
B.3.12 Independence
B.3.13 Uncorrelatedness and Orthogonality
B.3.14 Random Vector
B.3.15 Mean Vector, Covariance, and Correlation Matrices
B.3.16 Pay Attention to the Definition!!!
B.3.17 White Random Vector
B.4 Sum of Random Variables
B.4.1 Sample Mean and Variance
B.4.2 Sample vs. Expected Values
B.4.3 Central Limit Theorem
B.4.4 Sample Mean with Gaussian Assumption
B.4.5 Laws of Large Numbers
B.4.6 Weak Law of Large Numbers
B.4.7 Strong Law of Large Numbers
References
C Data

Citation preview

Guillaume Mercère

Data Driven Model Learning for Engineers With Applications to Univariate Time Series

Data Driven Model Learning for Engineers

Guillaume Mercère

Data Driven Model Learning for Engineers With Applications to Univariate Time Series

Guillaume Mercère Université de Poitiers Poitiers cedex 9, France

ISBN 978-3-031-31635-7 ISBN 978-3-031-31636-4 https://doi.org/10.1007/978-3-031-31636-4

(eBook)

© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

Preface

As teachers in engineering programs, I guess we all have already experienced the difficulty of introducing complicated (but useful) mathematical tools like Singular Value Decomposition to engineering students straightforwardly. According to my recent experience with Master students of different engineering degrees, tackling this problem the French way, i.e., focusing first on theory (list of definitions and theorems) and then illustrating them with tutorials and practicals during dedicated sessions, is indeed not efficient anymore. Generation Z students we have nowadays in front of us are digital natives who may have difficulties to embrace theoretical notions without clear and illustrated reasons why these tools or concepts are useful for their future professional (and personal?) activities. Such an observation may explain the reason why I think that switching to lectures combining theoretical concepts and their direct applications on real data sets with dedicated software is, for the moment, a good solution for motivating the students, thus being able to reach our ultimate goal: to impart knowledge and skills in order to prepare our students for the real world challenges. I personally initiated this teaching methodology change 3 years ago after a lecture session dedicated to state space modeling and control with electrical engineering Master students. First, their difficulties to figure out the role played by eigenvalues for guaranteeing stability of a linear state space representation or for designing a controller made me aware of new students’ needs. Second, I also realized the necessity for me to (i) illustrate a lot more what I mean with simulations and (ii) involve students’ critical thinking skills regularly and all along the lectures via dedicated activities. With this in mind and with the objective to teach my Master students tools for model learning, i.e., basics of linear algebra, numerical optimization, and statistics, the most important step of this “new” learning process for me and my students was to select diverse data sets having a physical meaning, easy to get from the web, free of charge, and, more importantly, real. This is probably the main reason why I decided to work with real univariate time series, then use them (especially the “air passenger” time series popularized by B. Box and B. Jenkins in Box et al. 2016) to illustrate the efficiency of combining linear

v

vi

Preface

algebra, numerical optimization, and statistics tools for time series model learning and forecasting. Because my secondary goal with this lecture was also to introduce mathematical tools I could use afterward during my lectures dedicated to system identification and Kalman filtering (my main research topics in fact), I have chosen a path inspired but different from the mainstream literature. Indeed, instead of studying thoroughly the ARMA and ARIMA models for time series analysis and forecasting as carried out efficiently and in a very didactic manner, e.g., in (Bisgaard and Kulahci, 2011; Montgomery et al., 2015; Box et al., 2016; Brockwell and Davis, 2016), a different (but complementary) journey in time series modeling is considered in this document. Indeed, as explained in Chap. 1, the recipe I suggest herein is made of different steps and ingredients inspired by the famous Wold decomposition theorem (see Sect. 4.5 for more details), i.e., first a specific attention to the main deterministic components of a time series by using standard linear algebra and numerical optimization tools, then a focus on residuals and their statistical properties in order to introduce solutions for AR and ARMA model parameters estimation. I hope that this different viewpoint on time series analysis and forecasting will be seen by readers as an interesting complement to the mainstream solutions. Notice that, even if the triggering event for writing this document was a bit painful (it is always difficult for a teacher I guess to see that what he tries to explain is not understood by the audience despite the time he spent to do it), I strongly thank my students for their constant feedback when I teach this lecture here in Poitiers. It has indeed helped me determine what I have to focus on in the document. Poitiers, France March 18, 2023

Guillaume Mercère

References S. Bisgaard, M. Kulahci, Time Series Analysis and Forecasting by Example (Wiley, 2011) G. Box, G. Jenkins, G. Reinsel, G. Ljung, Time Series Analysis: Forecasting and Control (Wiley, 2016) P. Brockwell, R. Davis, Introduction to Time Series and Forecasting (Springer, 2016) D. Montgomery, C. Jennings, M. Kulahci, Introduction to Time Series Analysis and Forecasting (Wiley, 2015)

Contents

1

Basic Concepts of Time Series Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1 6

2

Singular Spectrum Analysis of Univariate Time Series . . . . . . . . . . . . . . . . . 2.1 The Main Ingredients. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 The Basic Singular Spectrum Analysis for Low Rank Approximation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.1 Times Series Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.2 Times Series Reconstruction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 The Impact of the SVD for the Basic SSA . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.1 SVD and Separability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.2 SVD and Noise Effect Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.3 In a Nutshell . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.4 Another Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 Take Home Messages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

9 10

Trend and Seasonality Model Learning with Least Squares . . . . . . . . . . . . 3.1 Linear Least Squares Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.1 Problem Formulation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.2 Analytic Solution from a Projection Viewpoint. . . . . . . . . . . . . 3.1.3 Numerical Solution with a QR Factorization . . . . . . . . . . . . . . . 3.1.4 Numerical Solution with a Singular Value Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.5 Linear Least Squares and Condition Number . . . . . . . . . . . . . . . 3.2 A Digression to Nonlinear Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Linear Model Complexity Selection and Validation . . . . . . . . . . . . . . . . . . 3.4 Hints for Least Squares Solutions Refinement . . . . . . . . . . . . . . . . . . . . . . . . 3.5 Take Home Messages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

31 33 33 36 37

3

14 15 18 20 21 22 26 26 26 29

42 45 51 60 67 69 69

vii

viii

4

Contents

Least Squares Estimators and Residuals Analysis . . . . . . . . . . . . . . . . . . . . . . . 4.1 Residual Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Stochastic Description of the Residuals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.1 Stochastic Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.2 Stationarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.3 Ergodicity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Basic Tests of the Residuals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.1 Autocorrelation Function Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.2 Portmanteau Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.3 Turning Point Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.4 Normality Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4 Statistical Properties of the Least Squares Estimates . . . . . . . . . . . . . . . . . 4.4.1 Bias, Variance, and Consistency of the Linear Least Squares Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.2 Bias, Variance, and Consistency of the Nonlinear Least Squares Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.3 Least Squares Statistical Properties Validation with the Bootstrap Method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.4 From Least Squares Statistical Properties to Confidence Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5 Wold Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6 Take Home Messages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

73 75 77 77 81 83 84 85 88 89 90 91 95 101 104 109 117 119 119

5

Residuals Modeling with AR and ARMA Representations . . . . . . . . . . . . . 5.1 From Transfer Functions to Linear Difference Equations . . . . . . . . . . . . 5.2 AR and ARMA Model Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.1 AR Model Parameters Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.2 ARMA Model Parameters Estimation . . . . . . . . . . . . . . . . . . . . . . 5.3 Partial Autocorrelation Function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4 Forecasting with AR and ARMA Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5 Take Home Messages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6

A Last Illustration to Conclude . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159

A Vectors and Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.1 Vector Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.1.1 Linear Space and Subspace . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.1.2 Inner Product, Induced Norm and Inner Product Space . . . . A.1.3 Complementary Subspaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.1.4 Orthogonal Complement and Projection . . . . . . . . . . . . . . . . . . . . A.2 Vector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.2.1 First Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.2.2 Basic Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.2.3 Span, Vector Space, and Subspace . . . . . . . . . . . . . . . . . . . . . . . . . .

121 123 125 126 134 147 148 155 156

169 169 169 170 170 171 171 171 172 172

Contents

ix

A.2.4 Linear Dependency and Basis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.2.5 Euclidean Inner Product and Norm. . . . . . . . . . . . . . . . . . . . . . . . . . A.2.6 Orthogonality and Orthonormality . . . . . . . . . . . . . . . . . . . . . . . . . . A.3 Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.3.1 First Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.3.2 Basic Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.3.3 Symmetry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.3.4 Hankel and Toeplitz Matrices. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.3.5 Gram, Normal, and Orthogonal Matrices . . . . . . . . . . . . . . . . . . . A.3.6 Vectorization and Frobenius Matrix Norm . . . . . . . . . . . . . . . . . . A.3.7 Quadratic Form and Positive Definiteness . . . . . . . . . . . . . . . . . . A.4 Matrix Fundamental Subspaces. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.4.1 Range and Nullspace . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.4.2 Rank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.5 Matrix Inverses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.5.1 Square Matrix Inverse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.5.2 Matrix Inversion Lemmas. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.5.3 Matrix Pseudo-inverse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.6 Some Useful Matrix Decompositions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.6.1 QR Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.6.2 Singular Value Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.6.3 Condition Number and Norms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.6.4 SVD and Eckart–Young–Mirsky Theorem. . . . . . . . . . . . . . . . . . A.6.5 Moore–Penrose Pseudo-inverse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.7 Orthogonal Projector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.7.1 First Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.7.2 Orthogonal Projector and Singular Value Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

173 173 174 174 174 175 176 176 177 177 178 178 178 179 180 180 180 181 181 181 182 183 184 184 186 186

B Random Variables and Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.1 Probability Space and Random Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.1.1 Probability Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.1.2 Random Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.2 Univariate Random Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.2.1 Cumulative Distribution Function . . . . . . . . . . . . . . . . . . . . . . . . . . . B.2.2 Probability Density Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.2.3 Cumulative Distribution Function and Quantile . . . . . . . . . . . . B.2.4 Uniform Random Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.2.5 Normal Random Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.2.6 Student’s Random Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.2.7 Chi-squared Random Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.2.8 Moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.3 Multivariate Random Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.3.1 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

189 189 189 190 190 190 191 192 193 193 194 195 195 196 196

187 187

x

Contents

B.3.2 B.3.3 B.3.4 B.3.5 B.3.6 B.3.7 B.3.8 B.3.9 B.3.10

2D Joint Distribution Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2D Joint Probability Density Function . . . . . . . . . . . . . . . . . . . . . . Marginal Distribution and Density Functions. . . . . . . . . . . . . . . Statistical Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Generalized Mean and Moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . Covariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Correlation Coefficient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . nx D Joint Cumulative Distribution and Density Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.3.11 nx D Marginal Distribution and Density Functions . . . . . . . . . B.3.12 Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.3.13 Uncorrelatedness and Orthogonality . . . . . . . . . . . . . . . . . . . . . . . . B.3.14 Random Vector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.3.15 Mean Vector, Covariance, and Correlation Matrices . . . . . . . B.3.16 Pay Attention to the Definition!!! . . . . . . . . . . . . . . . . . . . . . . . . . . . B.3.17 White Random Vector. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.4 Sum of Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.4.1 Sample Mean and Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.4.2 Sample vs. Expected Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.4.3 Central Limit Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.4.4 Sample Mean with Gaussian Assumption. . . . . . . . . . . . . . . . . . . B.4.5 Laws of Large Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.4.6 Weak Law of Large Numbers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.4.7 Strong Law of Large Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

196 197 198 199 199 200 200 200 201 201 202 202 203 203 204 205 205 205 206 207 208 208 209 209 210

C Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211

Chapter 1

Basic Concepts of Time Series Modeling

At a time when politicians, decision makers, but also social media influencers are obsessed with data and numbers, nobody can contradict the fact that data play a central role in economics, industry, and science of course but also in contemporary politics and public life. For instance, as shown in The 2020 Global State of Enterprise Analytics report1 written by the business intelligence company MicroStrategy: • .94% of the survey respondents say that data and analytics are important to their organization’s digital transformation efforts • .59% of the survey respondents are moving forward with the use of advanced and predictive analytics • .45% of the survey respondents are using data and analytics to develop new business models to give a short sample of figures demonstrating that data analysis drives decision making and impacts new strategy formulation. In other words, this kind of survey proves that, at the era of “big data,” being able to: • Understand the links between the available data sets • Make sensible forecasts from the available data sets are both essential in many fields for reliable decision making. Such an assessment probably explains the reasons why websites, books, training courses, and diplomas dedicated to data analytics flourish nowadays. A quick look at their contents clearly shows that a standard and efficient first ingredient to analyze raw data and find trends to answer different strategic questions is model learning (Box et al., 2016). Indeed, once reliable models of data set dynamical evolution are determined accurately, efficient prediction or classification tools can be deployed effectively (Hastie et al.,

1 Available

via the link https://www3.microstrategy.com/getmedia/db67a6c7-0bc5-41fa-82a9bb14ec6868d6/2020-Global-State-of-Enterprise-Analytics.pdf. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 G. Mercère, Data Driven Model Learning for Engineers, https://doi.org/10.1007/978-3-031-31636-4_1

1

2

1 Basic Concepts of Time Series Modeling

2009; Theodoridis, 2015). The developments introduced in this textbook do not depart from this well-tried methodology by bringing to light (new) solutions for data-driven model learning. Among the different data a user can encounter in practice, a specific attention is paid herein to discrete time series. By discrete time series, it is meant that the observations to be modeled can be described by a sequence of samples acquired at discrete time instants (most of the time made at regular time intervals, i.e., with a constant sampling period). This choice is mainly dictated by the fact that we are all immersed in discrete time series as soon as we follow stock exchange, weekly numbers of hospitalizations or deaths due to a virus like the COVID-19, polls about the next presidential election,.. . . or our monthly or yearly running performance generated with our new “smart” watch. Examples of data of this kind are displayed as time series plots in Figs. 1.1, 1.2, 1.3, and 1.4. These plots clearly illustrate that: • Some series have obvious trends and/or seasonal patterns. • Adjacent observations seem to be dependent. • Linear models may not always be sufficient to describe the underlying dynamics. Another important feature that is not directly visible with these plots can be emphasized as follows. Let us consider, e.g., the time series in Fig. 1.5. This curve shows some trends of the atmospheric carbon dioxide via the measurement of the monthly mean carbon dioxide rate at Mauna Loa Observatory, Hawaii, between 1959 and 1991. By resorting to curve fitting solutions introduced in Chap. 3, a polynomial function and a Fourier series can be fitted to this data set, leading to the curves given in Fig. 1.6. As shown with the bottom curve, the residuals, i.e., the difference between the initial observations and the simulated model points: • Are different from zero

86

84

82

80

78

76

1970

1975

1980

1985

Fig. 1.1 Female life expectancy in France

1990

1995

2000

2005

2010

2015

1 Basic Concepts of Time Series Modeling

3

50

40

30

20

10

0 1950

1960

1970

1980

1990

2000

2010

Fig. 1.2 Infant mortality rate in France

11500 11000 10500 10000 9500 9000 8500 8000 7500 7000 6500 1973.5

1974

1974.5

1975

1975.5

1976

1976.5

1977

1977.5

1978

1978.5

Fig. 1.3 Monthly accidental death toll in the USA

• Do not have any discernible trend • Are quite smooth These observations have two main explanations and consequences. First, because the residuals are different from zero, it can be concluded that the selected model (a combination of a polynomial function and a Fourier series) is not flexible enough to describe the data dynamics perfectly. An analysis (as introduced, e.g., in James et al. 2017, Chapter 5) must thus be performed to see if increasing the complexity of the fitting function to better mimic the real data points is the best solution for “a good” time series modeling. Second, because the residuals dynamics “is not that random,”

4

1 Basic Concepts of Time Series Modeling 10 5

6

5

4

3

2

1 1950

1951

1952

1953

1954

1955

1956

1957

1958

1959

1960

1961

Fig. 1.4 Monthly worldwide airline passengers

360 355 350 345 340 335 330 325 320 315 310 1960

1965

1970

1975

1980

1985

1990

Fig. 1.5 Monthly mean carbon dioxide concentration measured at Mauna Loa Observatory, Hawaii

is it unlikely that independent and identically distributed stochastic processes2 can be used to describe the residuals behavior accurately. The smoothness of the residuals plot probably means that there is still some dependency or correlation in the residuals points. Quantifying such a correlation by resorting to filtered noises

2 See,

e.g., Leon-Garcia (2008) for more details about random variables and statistics.

1 Basic Concepts of Time Series Modeling

5

360 350 340 330 320 310

1960

1965

1970

1975

1980

1985

1990

1960

1965

1970

1975

1980

1985

1990

0.5

0

-0.5

Fig. 1.6 Initial and reconstructed time series (top) and residuals (bottom). CO2 concentration at Mauna Loa Observatory, Hawaii

and dedicated statistical model descriptions may be an easy way to describe the residuals behavior efficiently. This short and quite naive analysis of the small sample of time series plotted in Figs. 1.1, 1.2, 1.3, 1.4, and 1.5 outlines the main steps of the model learning procedure carried out in this book. After the drawing of the time series and a careful analysis of these plots to detect trends, seasonality, outliers, nonlinear behavior, .. . ., and dependency (see, e.g., Bisgaard and Kulahci 2011, Chapter 2, for first graphical tools for visualizing time series structures), the multi-step approach considered in this book consists in: • Suggesting an appropriate model class (or a family of models) to describe the main dynamics of the time series • Estimating the deterministic components of these models by resorting to standard least squares solutions (Björck, 1996) • Focusing on the residuals and analyzing their statistical properties by using standard tools such as the autocorrelation functions, some whiteness tests, .. . . (Papoulis, 2000; Kay, 2006; Leon-Garcia, 2008; Brockwell and Davis, 2016) • Choosing a model to fit the residuals accurately • Estimating the residuals model parameters so that, in the end, the “leftovers” do not have internal dynamics or dependency what so ever anymore the final test being the prediction efficiency of the estimated model. This global procedure being set, each of its steps will be described in detail in the rest of this book as follows. More precisely, Chap. 2 is devoted to the description of a nonparametric technique yielding a reliable decomposition of univariate time series into a sum of components, each having a meaningful interpretation. Once

6

1 Basic Concepts of Time Series Modeling

the deterministic components of the time series are available, Chap. 3 focuses on the determination of parsimonious parametric models of the trend and seasonal patterns composing the raw time series by using specific least squares estimation solutions. Chapter 4 first deals with the statistical analysis of the residuals, then with its consequences as far as the least squares estimators are concerned. In Chap. 5, we describe standard numerical techniques to model the residuals with AR and ARMA representations effectively. A last illustration of the model learning procedure suggested in this book is finally the topic of Chap. 6. Before diving into the description of techniques and algorithms to be used for time series model learning, it is essential to point out that, when real data sets are handled: • The data generating process (i.e., the real process that “generates” the data we are handling) is unknown. • Only a finite sequence of samples (from an infinite population generated by the aforementioned data generating process) is available in practice. • The time series samples are measured most of the time and thus may be noisy. All of these practical conditions and constraints make determination of an exact mathematical representation of the data generating process utopian. On the contrary, focusing on a parsimonious parametric model mimicking the available data set accurately with the ability of giving reliable predictions and uncertainty quantifications inferred from it is what the time series model learning solutions introduced in this book aim at. This is the main reason why, as claimed by L. Ljung in his famous book (Ljung, 1999), “our acceptance of models should be guided by usefulness rather than truth.” Said differently, the user of the model learning solutions introduced hereafter must be aware that the generated models have a certain domain of validity and thus must take notice of the assumptions made to generate these models. In this book, a specific attention will be paid to time series assumed to be composed of deterministic trend and/or cyclic components combined with random or irregular movements that can be described by stationary stochastic processes. Such assumptions, which can be considered as too strong or too unrealistic by some users, must be kept in mind by the reader when he deploys the solutions introduced hereafter.

References S. Bisgaard, M. Kulahci, Time Series Analysis and Forecasting by Example (Wiley, Hoboken, 2011) A. Björck, Numerical Methods for Least Squares Problems (SIAM, Philadelphia, 1996) G. Box, G. Jenkins, G. Reinsel, G. Ljung, Time Series Analysis: Forecasting and Control (Wiley, Hoboken, 2016) P. Brockwell, R. Davis, Introduction to Time Series and Forecasting (Springer, Berlin, 2016) T. Hastie, R. Tibshirani, J. Friedman, The Elements of Statistical Learning (Springer, Berlin, 2009)

References

7

G. James, D. Witten, T. Hastie, R. Tibshirani, An Introduction to Statistical Learning with Applications in R (Springer, Berlin, 2017) S. Kay, Intuitive Probability and Random Processes Using MATLAB (Springer, Berlin, 2006) A. Leon-Garcia, Probability, Statistics, and Random Processes for Electrical Engineering (Pearson, London, 2008) L. Ljung, System Identification. Theory for the User (Prentice Hall, Hoboken, 1999) A. Papoulis, Probability, Random Variables, and Stochastic Processes (McGraw-Hill Europe, Irvine, 2000) S. Theodoridis, Machine Learning: A Bayesian and Optimization Perspective (Academic, Cambridge, 2015)

Chapter 2

Singular Spectrum Analysis of Univariate Time Series

Before diving into statistical or numerical optimization considerations for characterizing time series parsimoniously and effectively, it is interesting to introduce a first technique, called the singular spectrum analysis (SSA) (Elsner and Tsonis, 1996; Golyandina and Zhigljavsky, 2013), which yields the main components of a time series by resorting to applied linear algebra tools only (Zhang, 2017; Meyer, 2000). As shown hereafter, this method indeed gives directly access to a decomposition of the original time series into a small number of components (such as the aforementioned trends, seasonalities, or structureless noise) without assuming anything about the statistical properties of the process generating the observations. On top of that, and contrary to the solutions introduced in the next chapters, the basic SSA technique is a nonparametric technique (Golyandina, 2020), i.e., can be seen as a model free method. Such a feature often leads the user to resort to this kind of low rank approximation techniques (Markovsky, 2008, 2012) as the first line methods when an initial and easy to tune characterization of a specific univariate time series is required. The main content of this chapter is the following. First, before giving a formal description of what the literature calls the “basic” SSA, the main ingredients of the SSA recipe are introduced in Sect. 2.1. Section 2.2 focuses on both stages composing the basic SSA algorithm (see also Golyandina 2020 for a complementary description of the basic SSA algorithm). In Sect. 2.3, a specific attention is paid to the reasons why the singular value decomposition is an efficient tool for decomposing a time series into a small number of components. Section 2.4 concludes this chapter with the main points to remember.

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 G. Mercère, Data Driven Model Learning for Engineers, https://doi.org/10.1007/978-3-031-31636-4_2

9

10

2 Singular Spectrum Analysis of Univariate Time Series

2.1 The Main Ingredients The first ingredient of the SSA technique is of course a time series. As shown via the small sample of examples introduced in Chap. 1, a time series is a sequence of observations recorded at successive time instants. More specifically: Definition 2.1 A time series defined over a time set .T ⊆ Z is a sequence of samples (y0 , y1 , · · · ) indexed by time instants .t ∈ T and compactly denoted as follows:

.

(yt )t∈T .

.

(2.1)

In most of the practical situations encountered in this book, because the selected observation sets are equally spaced, countable, and generally short, the time set .T is finite and equal to .t0 + {0, · · · , N − 1} × Ts , .N ∈ N∗ , where .t0 stands for the initial value of the time axis, whereas .Ts is the user defined sampling period. For ease of notation, thanks to a convenient time axis rescaling and shift, the explicit reference to .t0 and .Ts is often removed and .T = {0, · · · , N − 1}, .N ∈ N∗ , hereafter. By assuming that the time series .(yt )t∈T to be studied is originally made of a trend .(mt )t∈T , an oscillatory component .(st )t∈T , and leftovers .(nt )t∈T (called noise in the sequel), i.e., yt = mt + st + nt , t ∈ T,

.

(2.2)

the low rank approximation technique (Markovsky, 2008, 2012) introduced in this chapter aims at determining the aforementioned components (i.e., each component .(mt )t∈T , .(st )t∈T , and .(nt )t∈T separately or the noise free signal .(mt + st )t∈T ) from .(yt )t∈T by using a specific data description called in the SSA literature the trajectory matrix (Elsner and Tsonis, 1996). The importance as well as the structure of this trajectory matrix can be explained as follows. Let us first focus on the trend .(mt )t∈T . If this trend can be accurately described by a polynomial function, then mt =

k 

.

ai t i , t ∈ T,

(2.3)

i=0

where .k ∈ N is any user defined polynomial order. With the differentiation operator1 defined as follows (Oppenheim and Schafer, 2014): dk zt = (1 − q−1 )k zt ,

.

(2.4)

this textbook, the backward (resp., forward) shift operator is denoted by .q−1 (resp., .q) instead of B (resp., F ) as suggested, e.g., in the standard statistics and econometrics literature (Brockwell and Davis, 1991; Box et al., 2016). This choice is made first because of the automatic control background of the author, second because .q and .q−1 are a lot more used when the notions of transfer function, poles, .. . . are introduced (Oppenheim et al., 2014) (as done, e.g., in Chap. 5). Similarly, we use .dk instead of .∇ k for denoting the differentiation operator.

1 In

2.1 The Main Ingredients

11

for any time series .(zt )t∈T with .q−1 the backward shift operator, i.e., .q−1 zt = zt−1 , it can be shown that  k   k+1 k+1 i .d mt = d ai t = 0, t ∈ T. (2.5) i=0

Illustration 2.1 If .mt = at + b, then d2 mt = (1 − q−1 )2 mt = mt − 2mt−1 + mt−2

.

= at + b − 2(a(t − 1) + b) + a(t − 2) + b = 0. (2.6)

By using the binomial formula (Bronshtein et al., 2015) to expand .(1 − q−1 )k , real coefficients .bi , .i ∈ {0, · · · , k}, exist such that Eq. (2.5) becomes b0 mt + b1 mt−1 + · · · + bk mt−k =

k 

.

bi mt−i =

i=0

k 

bi q−i mt = 0.

(2.7)

i=0

Let us now turn to the seasonality component .(st )t∈T . Clearly, if the period of this oscillatory term is p, then, obviously, st = st−p ,

(2.8)

st − st−p = (1 − q−p )st = 0.

(2.9)

.

or, written differently, .

Remark 2.1 It is interesting to notice that, even if it is used in a different way, the SSA solution introduced herein shares important common features with the time series analysis mainstream techniques. Indeed, as shown, e.g., in Brockwell and Davis (2016, Chapter 6) or in Box et al. (2016, Chapter 4), the standard solutions involving seasonal ARIMA models for time series analysis and forecasting preliminary eliminate the trend and seasonal components by introducing the operators (see Brockwell and Davis 2016, Section 1.5 for more details and illustrations): • .(1 − q−p ) to get rid of the p seasonality. • .(1 − q−1 )d with a user defined sufficiently large value of .d ∈ N∗ to eliminate the trend. On the one hand, the SSA based solution introduced herein can thus be seen as a smart solution for selecting the orders p and d of the standard solutions involving

12

2 Singular Spectrum Analysis of Univariate Time Series

seasonal ARIMA models efficiently. On the other hand, these operators can be seen at the basis of the low rank approximation introduced herein to cope with the deterministic components of a time series (as illustrated with Eqs. (2.5) and (2.9)). The path chosen in this textbook however differs from the standard solutions the reader can find in the literature. We indeed suggest herein first analyzing the raw time series via the SSA method before any transformation and then modeling the deterministic components with least squares based methods before resorting to “standard” techniques for fitting ARMA models on the residuals if necessary. In a nutshell, both components can be compactly written as recurrent equations of sufficient orders that are equal to zero. By assuming that the available data points .(yt )t∈T are noise free (i.e., .nt = 0 for any .t ∈ T), such observations on .(mt )t∈T and .(st )t∈T lead us to introduce, for each case, a sufficiently large integer . and a nonzero vector .r ∈ R(+1)×1 such that ⎡

⎤ yt  ⎢ . ⎥ . r0 · · · r ⎣ . ⎦ = 0, t ∈ {0, · · · , N − 1 − }. .  

y  t+ r

(2.10)

Illustration 2.2 For a trend .(mt )t∈T satisfying Eq. (2.7), we have . = k and   r = bk bk−1 · · · b0 ,

(2.11)

.

whereas, for the oscillatory term .(st )t∈T of period p, . = p and   r = 1 0 · · · 0 −1 .

(2.12)

.

Because the time series .(yt )t∈T is made of N data samples, Eq. (2.10) is valid for any of the following vectors: ⎡ ⎤ y0 ⎢y1 ⎥ ⎢ ⎥ .⎢ . ⎥, ⎣ .. ⎦ y

⎡ ⎢ ⎢ ⎢ ⎣

y1 y2 .. .



⎤ ⎥ ⎥ ⎥, ⎦

···

y+1

⎤ yN −1− ⎢ yN − ⎥ ⎢ ⎥ ⎢ . ⎥. ⎣ .. ⎦

(2.13)

yN −1

Thus, r  Y  = 0,

.

(2.14)

2.1 The Main Ingredients

13

or, equivalently, Y  r = 0,

.

(2.15)

where .Y  stands for the Hankel matrix2 defined as follows (Katayama, 2005, Chapter 2): ⎡

y0 y1 ⎢y1 y2 ⎢ .Y  = ⎢ . ⎣ .. . . . y y+1

⎤ · · · yN −1− · · · yN − ⎥ ⎥ (+1)×(N −) , N > . .. ⎥ ∈ R .. ⎦ . . · · · yN −1

(2.16)

This specific Hankel matrix, called the trajectory matrix in the SSA literature (Golyandina, 2020), is our second ingredient of the recipe. Let us now focus on Eq. (2.15). Because .r = 0, such equality means that .Y  is rank deficient (Meyer, 2000, Chapter 7), i.e.: • The null space of .Y   is not empty  • The columns of .Y  are linearly dependent as soon as the available data points .(yt )t∈T are noise free and are made of trends and/or periodic components for instance. Said differently, as soon as the dynamics of the components of .(yt )t∈T can be accurately described with recurrent equations (such as, e.g., Eq. (2.7)), this dynamical information is available through the analysis of the rank and the column space of an Hankel matrix (Katayama, 2005, Chapter 2) made of the available data points. This observation is strongly linked to the following theorem (Markovsky, 2008, 2012): Theorem 2.1 Given a sequence .(yt )t∈T and a polynomial function R(q) = R0 + R1 q + · · · + R q ∈ R[q]

(2.17)

R(q)yt = 0,

(2.18)

.

such that .

then: • The Hankel matrix .Y  built from the sequence .(yt )t∈T is low rank • .(yt )t∈T is generated by the finite linear time invariant dynamical model characterized by the recurrent Eq. (2.18) are equivalent facts.

2 A Hankel matrix, named after Hermann Hankel, is a matrix, the parameters of which are constant along each antidiagonal.

14

2 Singular Spectrum Analysis of Univariate Time Series



Proof See Markovsky (2008, Section II).

The third ingredient of the technique introduced in this chapter is thus a linear algebra tool enabling the determination of the rank of matrix, its column space as well as its low rank approximation. Such a tool is the singular value decomposition (SVD) (Meyer, 2000, Chapter 5). The Eckart–Young–Mirsky theorem (Golub and Van Loan, 2013, Chapter 2) (see below) clearly shows the usefulness of the SVD for matrix rank determination and low rank approximation. ˆ ∈ Rm×n , the cost Theorem 2.2 Given a matrix .A ∈ Rm×n , minimizing, over .A function ˆ F s.t. rank(A) ˆ ≤ r ≤ min(m, n),

A − A

.

(2.19)

where . • F stands for the Frobenius norm (Meyer, 2000, Chapter 5), has an analytic solution in terms of the singular value decomposition of the data matrix .A. Indeed, with      Σ1 0  V1 , A = U1 U2 0 Σ2 V  2

.

(2.20)

where .U 1 ∈ Rm×r , .U 2 ∈ Rm×(m−r) , .Σ 1 ∈ Rr×r , .V 1 ∈ Rn×r , and .V 2 ∈ Rn×(n−r) , ˆ ∗ = U 1 Σ 1 V  is such that the rank r matrix .A 1 ˆ ∗ F =

A − A

.

min

ˆ rank(A)≤r

ˆ F =

A − A

 2 + · · · + σ 2. σr+1 n

(2.21)



ˆ is unique if and only if .σr = σr+1 . On top of that, The minimum .A ˆ ∗ 2 =

A − A

.

min

ˆ rank(A)≤r

ˆ 2 = σr+1 ,

A − A

(2.22)

where . • 2 stands for the spectral norm (Meyer, 2000, Chapter 5). Proof See Golub and Van Loan (2013, Section II).



The main ingredients being available, it is time to introduce the main steps composing the “basic” SSA recipe (Golyandina and Zhigljavsky, 2013).

2.2 The Basic Singular Spectrum Analysis for Low Rank Approximation As shown in Elsner and Tsonis (1996), Golyandina and Zhigljavsky (2013), Golyandina and Zhigljavsky (2018), the “basic” SSA technique consists of two

2.2 The Basic Singular Spectrum Analysis for Low Rank Approximation

15

stages that include two steps each. The first stage aims at decomposing a user defined Hankel matrix (the matrix .Y  introduced previously) into (the sum of) rank 1 matrices thanks to an SVD. The second stage reconstructs times series of each component of .(yt )t∈T from user defined groups of rank 1 matrices generated in the first stage. The SSA recipe is more precisely described in the following subsections.

2.2.1 Times Series Decomposition By assuming that the time series .(yt )t∈T is made of components, the behavior of which can be described by linear recurrent equations, the first step of the SSA procedure consists in constructing the trajectory matrix .Y  given in Eq. (2.16), i.e., filling up an Hankel matrix with the data points .(y0 , · · · , yN −1 ). Such a step leads to a matrix .Y  of dimension . + 1 by .N −  (as shown in Eq. (2.16)) where . is a user defined integer chosen large enough to guarantee that the coefficients of all the aforementioned recurrent equations are in the null space of .Y   (see, e.g., Golyandina 2020, Section 2.6 for an interesting discussion on the choice of . in SSA). Once the trajectory matrix .Y  is generated, its analysis, more precisely the extraction of the main dynamical patterns of the time series, is then performed by applying an SVD to .Y  , i.e., for a matrix .Y  of rank r, by using the following data description: Y  = U ΣV  =

r 

.

σi ui v  i =

i=1

r 

Y i ,

(2.23)

i=1

with .r = rank(Y  ) and where (Zhang, 2017, Chapter 5): • .σi ∈ R, .i ∈ {1, · · · , r}, are the r singular values of .Y  satisfying .σ1 ≤ · · · ≤ σr . • .Σ = diag(σ1 , · ·· , σr ) is a square diagonal of size .r × r. • .U = u1 · · · ur ∈ R(+1)×r is a semi-orthogonal matrix, i.e., a matrix satisfying  .U U = I r×r .   • .V = v 1 · · · v r ∈ R(N −)×r is a semi-orthogonal matrix, i.e., a matrix satisfying .V  V = I r×r . • .ui ∈ R(+1)×1 , .i ∈ {1, · · · , r}, are the first r left singular vectors of .Y  . • .v i ∈ R(N −)×1 , .i ∈ {1, · · · , r}, are the first r right singular vectors of .Y  . More specifically, as far as the first stage of the SSA procedure is concerned, a specific attention is paid to the right singular vectors and singular values of .Y  : • First, as shown, e.g., in Zhang (2017, Chapter 5), range(Y   ) = range(V ).

.

(2.24)

16

2 Singular Spectrum Analysis of Univariate Time Series

Thus, analyzing and plotting the evolution of the column of .V are equivalent to handling .Y   for testing, e.g., the linear dependency of the rows of .Y  . But, contrary to .Y   , .V has orthonormal columns, i.e., .V gives access to an  orthonormal basis of .range(Y   ). Thus, by focusing on .V instead of .Y  , the main ingredients for generating the dynamical information (such as the time series trend and seasonality) contained in the column space of .Y   are available with a subspace representation benefiting form a better separability capability. • Second, Eq. (2.23) clearly shows that the SVD of .Y  yields a decomposition of .Y  into a sum of rank 1 matrices weighted each by a singular value. Because (Golub and Van Loan, 2013, Chapter 2) r 

Y  F =

.

σi and Y i F = σi ,

(2.25)

i=1

the ratio σi rσ = r

.

(2.26)

i=1 σi

quantifies the contribution of .Y i into .Y  . The beneficial impact of these two properties for dynamical pattern extraction is illustrated via the analysis of the monthly maximum temperatures in Paris given in Fig. 2.1.

30

25

20

15

10

5

0 2002

2004

2006

Fig. 2.1 Monthly maximum temperatures in Paris

2008

2010

2012

2.2 The Basic Singular Spectrum Analysis for Low Rank Approximation

17

Illustration 2.3 Let us consider the time series given in Fig. 2.1. These 154 data points are the monthly maximum temperatures in Paris, France, between January 2000 and November 2012. As clearly shown in Fig. 2.1, this time series is periodic (with a period of 12 months) with an average of .16 ◦ C approximately. By selecting . = 12 in order to describe the yearly periodicity of the time series, the trajectory matrix .Y 12 can be generated. Figures 2.2 and 2.3 show the thirteen singular values and the thirteen right singular vectors of .Y 12 , respectively. These plots clearly indicate that: • This time series is made of three main components, one for the mean value and two for the main periodicity. • These three main components embed more than .80% of the dynamics included in the trajectory matrix .Y 12 (sum of the corresponding .rσ ). Notice that the thirteen singular values are different from zero, which means that .Y 12 is not rank deficient as expected. The main reason why the null space of .Y 12 is restricted to the zero vector is linked to the fact that this time series is noisy (as any real time series), which implies that small and slowly decaying singular values appear in the singular value spectrum (because of the noise contribution). This feature is clearly shown in Fig. 2.2 where we see that the last nine singular values are very small when compared with the first three singular values (see Sect. 2.3 for further discussions about the noise effects in SSA).

700

600

500

400

300

200

100

0

1

2

3

4

5

6

7

8

9

10

Fig. 2.2 Leading singular values of .Y 12 . Maximum temperatures in Paris

11

12

13

18

2 Singular Spectrum Analysis of Univariate Time Series

0.2

0.2

0.2

0.2

0

0

0

0

0

-0.2

-0.2

-0.2

-0.2

-0.2

0

50 100

0

50 100

0

50 100

0.2

0

50 100

0.2

0.2

0.2

0.2

0

0

0

0

0

-0.2

-0.2

-0.2

-0.2

-0.2

0

50 100

0

50 100

0.2

0.2

0

0

0

-0.2

-0.2

-0.2

0

50 100

0

50 100

0

50 100

0

50 100

0

50 100

0.2

0

50 100

0.2

0

50 100

Fig. 2.3 Leading right singular vectors of .Y 12 . Maximum temperatures in Paris

Remark 2.2 The analysis of the SVD of .Y  has been performed by focusing on its singular values and right singular vectors only. Complementary tools such as specific 2D scatterplots or w-correlation plots can be introduced to help the user select the useful components for the next stage of SSA. See, e.g., Hassani (2007) for an interesting illustration of these complementary graphical tools for the SSA analysis of the monthly accidental deaths in the USA between 1973 and 1978 time series. As shown, however, in this chapter, plotting the singular values and right singular vectors of .Y  is often sufficient to draw first conclusions and guide the user to select the main components of the time series and/or filter out the noise effect efficiently. The first observation of the former example, more specifically the presence of two close singular values and the necessity to involve two components to describe the main periodicity pattern of the original time series, is an illustration of the fact that “for an arbitrary frequency component, no matter how much the frequency and its amplitude are, this frequency component always generates only two nonzero singular values” (Zhao and Ye, 2018). Such a property of harmonic components can be seen as supplementary information to help the user select the useful components from the decomposition of .Y  . This indeed guides the user to select a window length . proportional to the frequency component he aims at characterizing.

2.2.2 Times Series Reconstruction As quantified by the ratio introduced in Eq. (2.26), the rank 1 matrices .Y i , .i ∈ {1, · · · , r}, contribute in a different way to the global dynamics of .(yt )t∈T . On top of

2.2 The Basic Singular Spectrum Analysis for Low Rank Approximation

19

that, as clearly shown in Fig. 2.3, the right singular vectors have different dynamical behaviors and, more importantly, some of them have similar time evolution. These observations lead the user to gather the elementary matrices .Y i , .i ∈ {1, · · · , r}, into disjoint groups of rank 1 matrices having equivalent behaviors. Such a step is called “grouping” in the SSA literature (Hassani, 2007). More specifically, after, e.g., the analysis of the singular values and right singular vectors as performed in Figs. 2.2 and 2.3, respectively, the user is asked to select which elementary matrices among the r rank 1 matrices .Y i , .i ∈ {1, · · · , r}, must be grouped to mimic the dynamical behavior he wants to describe effectively. At the end of this grouping step, the user has access to a new trajectory matrix Y˜  =

g 

.

σi ui v  i ,

(2.27)

i=1

where g stands for the number of user selected components. Again, once these components are selected, the global contribution of these grouped rank 1 matrices can be quantified with the ratio given in Eq. (2.26), i.e.,  .

i∈Ig

σi

i∈I

σi



,

(2.28)

where .I = {1, · · · , r}, whereas .Ig stands for a subset of I made of the g user selected elementary matrices .Y i indices, .i ∈ {1, · · · , r}. Remark 2.3 Thanks to the orthonormality property of the singular vectors, it can be shown that (Meyer, 2000, Chapter 5) Y˜  = U g U  g Y ,

.

(2.29)

  with .U g = ui i∈I . Said differently, the truncated trajectory matrix .Y˜  is the result g of the orthogonal projection of .Y  onto .range(U g ), i.e., on the user selected signal subspace .U g (Meyer, 2000, Chapter 5). It is essential to notice that because of the SVD and the selection of specific rank 1 matrices among the r matrices .Y i , .i ∈ {1, · · · , r}, the reconstructed matrix .Y˜  loses its Hankel structure. A diagonal averaging step is thus necessary to “rehankelize” .Y˜  . Such a step consists in computing the k-th term of the “rehankelized” matrix by averaging3 the components .y˜ij over its antidiagonal elements, i.e., all i and j such that .i + j = k + 2. As proved in Golyandina et al. (2001, Chapter 6), this hankelization procedure is optimal in the sense that the generated matrix is the closest to .Y˜  (with respect to the Frobenius matrix norm) among all Hankel matrices of the corresponding size. Once the rehankelized 3 See

Golyandina and Zhigljavsky (2013, Paragraph 2.1.1.2) for a generic equation.

20

2 Singular Spectrum Analysis of Univariate Time Series

17

16.5

16

15.5 2002

2004

2006

2008

2010

2012

2002

2004

2006

2008

2010

2012

10 5 0 -5 -10

Fig. 2.4 SSA time series components: trend (top) and seasonality (bottom). Maximum temperatures in Paris

trajectory matrix .Y˜rehan is available, the reconstruction of the corresponding time series samples .(y˜trehan )t∈T is straightforward by focusing on the first column and last row of .Y˜rehan [just have a look to Eq. (2.16) (more specifically, the components of its first column and last row, respectively) for a direct proof of this claim].

Illustration 2.4 Going back to the Paris temperature time series, the grouping of the first three components, then the rehankelization of the generated trajectory matrix leads to the reconstructed time series .(y˜trehan )t∈{0,··· ,153} in red in Fig. 2.5 (see also Fig. 2.4 for the reconstructed trend and seasonality). These curves clearly reproduce the main dynamics of the initial time series by embedding the mean and the seasonality of the time series .(yt )t∈{0,··· ,153} (Fig. 2.5).

2.3 The Impact of the SVD for the Basic SSA Once the Hankel matrix .Y  is generated, the different stages composing the basic SSA solution mainly rely on the trajectory matrix singular value decomposition. This tool is indeed used in SSA to recover a specific data matrix (denoted herein by ˜  ), the components of which should be selected by the user from the analysis of the .Y

2.3 The Impact of the SVD for the Basic SSA

21

30

20

10

0

2002

2004

2006

2008

2010

2012

2002

2004

2006

2008

2010

2012

4 2 0 -2 -4 -6

Fig. 2.5 Initial and SSA time series (top) with residuals (bottom). Maximum temperatures in Paris

SVD of .Y  . Once .Y˜  is available, the extraction of the SSA time series components indeed boils down to a “rehankelization” of this matrix. Because of the central role played by the SVD in the SSA success, we aim at explaining in this section why this linear algebra tool is a good candidate to extract these components efficiently.

2.3.1 SVD and Separability First, by recalling that the SVD of .Y  writes Y =



.

σi ui v  i ,

(2.30)

i∈I

with .I = {1, · · · , r}, the reconstructed trajectory matrix .Y˜  satisfies Y˜  =



.

σi ui v  i ,

(2.31)

i∈Ig

where .Ig stands for the user selected component indices. By denoting by .E  the remaining components of .Y  , i.e., E =



.

i∈I \Ig

σi ui v  i ,

(2.32)

22

2 Singular Spectrum Analysis of Univariate Time Series

we thus have Y  = Y˜  + E  ,

.

(2.33)

and, thanks to the orthonormality of the singular vectors of .Y  (Meyer, 2000, Chapter 5), it can be shown that Y˜  E   = 0× .

.

(2.34)

The orthogonality between .Y˜  and .E  (thanks to the SVD) clearly emphasizes the interest of using an SVD to guarantee an efficient separability of .Y˜  and .E  , thus a consistent extraction of user defined time series components. Said differently, thanks to the orthonormal system generated with the SVD of .Y  , the separation of interpretable components such as the trend, regular oscillations, and/or the noise is made possible (see also Golyandina and Zhigljavsky 2013, Section 2.3.3 for further discussion on the separability of components in time series). Remark 2.4 If, instead of considering .Y  , its centered transposed is used, i.e., the mean of each row is removed before transposing the resulting matrix, the SVD introduced in step 2 of the basic SSA can be related to the famous principal component analysis (PCA) (Vidal et al., 2016). This is probably the reason why, in the SSA literature (Hassani and Thomakos, 2010), the left singular vectors .ui , .i ∈ {1, · · · , r}, are called the “factor empirical orthogonal functions,” whereas the right singular vectors .v i , .i ∈ {1, · · · , r}, are called the “principal components” (see Golyandina 2020 for a short description of the similarities between SSA and PCA).

2.3.2 SVD and Noise Effect Filtering Until now, no specific attention has been paid to the noise acting on any acquired data set. Noise has however an effect on the dynamical pattern extraction as illustrated, e.g., in Fig. 2.2 through its impact on the magnitude of the singular values. When SSA is involved, a natural way for noise extraction consists in grouping the rank 1 matrices that do not seemingly contain elements of trend or oscillations and see them as noise components. This idea can be linked to specific properties of the SVD illustrated in De Moor (1993). Indeed, as suggested in De Moor (1993), a particular analysis of the SVD of the Hankel matrix .Y  can be performed by using simple geometrical and algebraic tools. By assuming that: • .2 > N − 1, i.e., .Y  has more columns than rows • The time series .(yt )t∈T is made of a noise free sequence .(y˘t )t∈T (e.g., .y˘t = mt + st , .t ∈ T) and an additive noise .(nt )t∈T , i.e., yt = y˘t + nt , t ∈ T

.

(2.35)

2.3 The Impact of the SVD for the Basic SSA

23

then, thanks to the structure of the Hankel matrix .Y  , Y  = Y˘  + N  .

(2.36)

.

Let us now introduce the SVD of .Y  and .Y˘  as follows (Meyer, 2000, Chapter 5):      Σ1 0 V1 .Y  = U 1 U 2 = U1 × Σ1 × V  1 

 

0 Σ2 V  2 

r×r

(+1)×r

+

×

U2 

(+1)×(+1−r)

Σ2 

r×N −

×

(+1−r)×(N −−r)

V 2 

r×r

(2.37)

(N−−r)×(N −)

   ˘ 1 0 V˘    Σ 1 ˘ 1 × V˘  ˘ ˘ 1 U˘ 2 .Y  = U U˘ 1 × Σ  = 

1 , 



0 0 V˘ 2 (+1)×r

,

(2.38)

r×N −

where .r ≤  stands for the rank of .Y˘  . Remark 2.5 The parameter . has been chosen large enough by the user so that the trajectory matrix embeds the dynamics of the process that has generated the data points. Said differently, . has been chosen so that .Y˘  is rank deficient, i.e., ˘  ) = r <  + 1. On the contrary, because the data points .(yt )t∈T are noisy, .rank(Y the Hankel matrix .Y  should be full rank, i.e., .rank(Y  ) =  + 1. Thus, the following theorem can be stated (De Moor, 1993; Katayama, 2005). Theorem 2.3 Given a time series .(yt )t∈T made of a noise free sequence .(y˘t )t∈T and an additive noise .(nt )t∈T , under the assumption that 

2 ˘ N N   = σ I (+1)×(+1) and N  Y  = 0(N−)×(N −) ,

.

(2.39)

we have 





˘ 1 V˘ 1 + U˘ 1 U˘ 1 N  + U˘ 2 U˘ 2 N  Y  = U˘ 1 Σ    (Σ  ˘ 21 + σ 2 I r×r )1/2 0 = U˘ 1 U˘ 2 . 0 σ I (+1−r)×(+1−r)  2   ˘ 1 V˘  ˘ ˘ 1 + σ 2 I r×r )−1/2 (Σ + U N ) (Σ  1 1 × ,  U˘ 2 N  σ −1

(2.40)

24

2 Singular Spectrum Analysis of Univariate Time Series

and  .Y  Y 

     Σ  ˘ 21 + σ 2 I r×r U˘ 1 0 = U˘ 1 U˘ 2  . 0 σ 2 I (+1−r)×(+1−r) U˘ 2

Proof See De Moor (1993, Section II) and Katayama (2005, Section 6.7).

(2.41) 

The main consequences of this theorem are the following: • The left singular vectors of the noise free matrix .Y˘  can be reconstructed from the left singular vectors of .Y  perfectly. • The right singular vectors of .Y  are impacted by the noise component. ˘ 21 + σ 2 I r×r is larger than .σ 2 , i.e., the singular • The smallest singular value of .Σ spectrum of .Y  should have a gap. 

Said differently, .range(Y˘  ) and .ker(Y˘  ) can be reconstructed perfectly from the  SVD of .Y  , whereas the extraction of .range(Y˘  ) and .ker(Y˘  ) from the SVD of .Y  is biased. On top of that because the last . + 1 − r singular values of .Y  are all equal to .σ 2 , .σ 2 can be estimated from the singular spectrum of .Y  directly and the exact singular values of .Y˘  can be determined a posteriori as follows: .

˘ 1 = (Σ 21 − σ 2 I r×r )1/2 . Σ

(2.42)

This gap in the singular spectrum of .Y  is clearly illustrated in Fig. 2.2. An interesting side effect of the SVD of .Y  is what we could call its denoising feature. Indeed, by keeping only the singular components (singular values, left and right singular vectors) corresponding to the r largest singular values, the noise effect is filtered out from the noisy signals. As pointed out in Theorem 2.3, this noise effect removal is not perfect (the right singular vectors of .Y˘  are indeed not reconstructed perfectly), but the fact that no other rank r approximation captures more of the dynamics in the data than this SVD based solution (as proved via the former Eckart–Young–Mirsky theorem Golub and Van Loan 2013, Chapter 2) leads us to emphasize this SVD denoising approach. On top of that, as proved, e.g., in Golub and Van Loan (2013, Chapter 2), the robustness of the SVD to mild violations  of the assumptions introduced in the former theorem (such as . N  Y˘  F is small instead of zero) guarantees that the SVD of .Y  gives access to good approximations of the sought signal subspace in the end (see also De Moor 1993). Remark 2.6 In Theorem 2.3, the main assumptions are 2 ˘ N N   = σ I (+1)×(+1) and N  Y  = 0(N−)×(N −) .

.

(2.43)

2.3 The Impact of the SVD for the Basic SSA

25

In order to point out the physical meaning of these assumptions, let us focus on the  matrix multiplications .Y  Y   and .N  N  . Thanks to the Hankel structure of these matrices, we have ⎡

λz,0 λz,−1 ⎢λz,1 λz,0 ⎢  .Z  Z  = ⎢ . .. ⎣ .. . λz, λz,−1

⎤ · · · λz,− · · · λz,1− ⎥ ⎥ . ⎥, .. . .. ⎦ · · · λz,0

(2.44)

where .zt , .t ∈ T, stands for .yt , .t ∈ T, or .nt , .t ∈ T, respectively, whereas λz,i =

N−−1 

.

zk+i zk , i ∈ {0, · · · , } with, for any i, λz,i = λz,−i .

(2.45)

k=0 1  Thus, by introducing the weight . N1 , i.e., by considering . N1 Y  Y   and . N N  N  1  instead of .Y  Y   or .N  N  , the generic coefficients . N λz,i , .i ∈ {0, · · · , }, are 4 nothing but finite approximations of the correlation between two time instants of the random sequence .(zt )t∈T . This observation means that assuming

.

1 2 N N   = σ I (+1)×(+1) , . N→∞ N 1  lim N  Y˘  = 0(N−)×(N −) , N→∞ N lim

(2.46a) (2.46b)

is nothing but considering that: • The sequence .(nt )t∈T is as a finite realization of a zero mean white noise of variance .σ 2 . • The random variables generating the noise and the noise free sequences are orthogonal (and uncorrelated if they are zero mean on top of that). Notice also that, thanks to this direct link between the matrices involved in the SSA stages and correlation coefficients, the first stage of the basic SSA (i.e., the construction of .Y  and its SVD) is nothing but the SVD based step of the Karhunen– Loeve transform (KLT) (Gerbrands, 1981) if the time series .(yt )t∈T is assumed to be a realization of a zero mean random sequence .(yt )t∈T (see Golyandina 2020 for a short description of the similarities between SSA and KLT).

4 These .(zt )t∈T

approximations become exact when N tends to infinity and when the random variable is ergodic (see Papoulis 2000 and Van Overschee and De Moor 1996, Chapter 3).

26

2 Singular Spectrum Analysis of Univariate Time Series

2.3.3 In a Nutshell As explained in De Moor (1993), the SVD is the linear algebra tool to be used when: • “The information on the data generating mechanism is completely contained in certain subspaces of the data matrix.” • “The complexity of the model is given by the rank of the data matrix.” • “The data are corrupted by additive noise.” The extraction of trend and/or seasonality patterns from noisy time series via the SVD of a specific Hankel matrix is a perfect illustration of these three beneficial features of the SVD. This probably explains the reasons why the SVD of .Y  is the core of SSA.

2.3.4 Another Example In order to convince the reader of the usefulness of the basic SSA algorithm for extracting the trend and the seasonal components from a noisy time series, let us consider another time series.

Illustration 2.5 The time series given in Fig. 2.6 and made of 384 samples shows the monthly mean carbon dioxide concentration measured at Mauna Loa Observatory, Hawaii, USA, between 1959 and 1991. This curve clearly has a polynomial trend and a yearly periodicity. Thus, by selecting again . = 12, the steps composing the basic SSA algorithm can be used to extract these components from the initial time series. Again, according to Figs. 2.7 and 2.8, the trend is available from the first singular vectors, whereas the seasonal pattern is summed up with the next four singular vectors (components 2, 3, 4, and 5). These observations are confirmed after signal reconstruction as shown in Figs. 2.9 and 2.10, respectively.

2.4 Take Home Messages • The goal of SSA is to decompose a time series into a small number of interpretable components such as a trend, an oscillatory pattern, and noise. • The SSA class of solutions can be used without making any statistical assumptions on the time series to be decomposed.

2.4 Take Home Messages

27

360 355 350 345 340 335 330 325 320 315 310 1960

1965

1970

1975

1980

1985

1990

Fig. 2.6 Monthly mean carbon dioxide concentration measured at Mauna Loa Observatory, Hawaii 2.5

10 4

2

1.5

1

0.5

0

1

2

3

4

5

6

7

8

9

10

11

12

13

Fig. 2.7 Leading singular values of .Y 12 . CO2 concentration at Mauna Loa Observatory, Hawaii

• The basic SSA solution assumes the existence of an unknown linear recurrent equation satisfied by the time series to be analyzed. • The basic SSA algorithm is made of four steps and requires Hankel matrices and SVD only. • SSA can be used for denoising a time series or, in a complementary way, for signal subspace extraction.

28

2 Singular Spectrum Analysis of Univariate Time Series

0.1

0.1

0.1

0.1

0

0

0

0

0

-0.1

-0.1

-0.1

-0.1

-0.1

0

200

0

200

0

200

0.1

0

200

0.1

0.1

0.1

0

0

0

0

0

-0.1

-0.1

-0.1

-0.1

-0.1

0

200

200

0.1

0.1

0

0

0

-0.1

-0.1

-0.1

0

200

0

200

0

200

200

0

200

0.1

0.1

0

0

0

200

0.1

0

200

Fig. 2.8 Leading right singular vectors of .Y 12 . CO2 concentration at Mauna Loa Observatory, Hawaii

350 340 330 320 1960

1965

1970

1975

1980

1985

1990

1960

1965

1970

1975

1980

1985

1990

2 0 -2 -4

Fig. 2.9 SSA time series components: trend (top) and seasonality (bottom). CO2 concentration at Mauna Loa Observatory, Hawaii

References

29

360 350 340 330 320 310

1960

1965

1970

1975

1980

1985

1990

1960

1965

1970

1975

1980

1985

1990

0.5

0

-0.5

Fig. 2.10 Initial and reconstructed SSA time series (top) and residuals (bottom). CO2 concentration at Mauna Loa Observatory, Hawaii

References G. Box, G. Jenkins, G. Reinsel, G. Ljung, Time Series Analysis: Forecasting and Control (Wiley, Hoboken, 2016) P. Brockwell, R. Davis, Time Series: Theory and Methods (Springer, Berlin, 1991) P. Brockwell, R. Davis, Introduction to Time Series and Forecasting (Springer, Berlin, 2016) I. Bronshtein, K. Semendyayev, G. Musiol, H. Muhlig, Handbook of Mathematics (Springer, Berlin, 2015) B. De Moor, The singular value decomposition and long and short spaces of noisy matrices. IEEE Trans. Signal Process. 41, 2826–2838 (1993) J. Elsner, A. Tsonis, Singular Spectrum Analysis: A New Tool in Time Series Analysis (Springer, Berlin, 1996) J. Gerbrands, On the relationships between SVD. KLT and PCA. Patt. Recog. 14, 375–381 (1981) G. Golub, C. Van Loan, Matrix Computations (John Hopkins University Press, Baltimore, 2013) N. Golyandina, Particularities and commonalities of singular spectrum analysis as a method of time series analysis and signal processing. Wiress Comput. Statist. 12, 1–39 (2020) N. Golyandina, A. Zhigljavsky, Singular Spectrum Analysis for Time Series (Springer, Berlin, 2013) N. Golyandina, A. Zhigljavsky, Singular Spectrum Analysis with R (Springer, Berlin, 2018) N. Golyandina, V. Nekrutkin, A. Zhigljavsky, Analysis of Time Series Structure: SSA and Related Techniques (CRC Press, Boca Raton, 2001) H. Hassani, Singular spectrum analysis: methodology and comparison. J. Data Sci. 5, 239–257 (2007) H. Hassani, D. Thomakos, A review of singular spectrum analysis for economic and financial time series. Statist. Inter. 3, 377–397 (2010) T. Katayama, Subspace Methods for System Identification (Springer, Berlin, 2005) I. Markovsky, Structured low-rank approximation and its applications. Automatica 44, 891–909 (2008)

30

2 Singular Spectrum Analysis of Univariate Time Series

I. Markovsky, Low Rank Approximation: Algorithms, Implementation, Applications (Springer, Berlin, 2012) C. Meyer, Matrix Analysis and Applied Linear Algebra (SIAM, Philadelphia, 2000) A. Oppenheim, R. Schafer, Discrete Time Signal Processing (Pearson, London, 2014) A. Oppenheim, S. Willsky, S. Hamid, Signals and Systems (Pearson, London, 2014) A. Papoulis, Probability, Random Variables, and Stochastic Processes (McGraw-Hill Europe, Irvine, 2000) P. Van Overschee, B. De Moor, Subspace Identification for Linear Systems. Theory, Implementation, Applications (Kluwer Academic Publishers, Amsterdam, 1996) R. Vidal, Y. Ma, S. Sastry, Generalized Principal Component Analysis (Springer, Berlin, 2016) X. Zhang, Matrix Analysis and Applications (Cambridge University, Cambridge, 2017) X. Zhao, B. Ye, Separation of single frequency component using singular value decomposition. Circuits Syst. Signal Process. 38, 191–217 (2018)

Chapter 3

Trend and Seasonality Model Learning with Least Squares

The former chapter has introduced an efficient solution to decompose any discrete times series into the sum of a small number of physically meaningful components. Often used in practice to get a first insight into the time series main components (Hassani and Zhigljavsky, 2009; Rodriguez-Aragon and Zhigljavsky, 2010; Bricenoa et al., 2013), this approach is by construction nonparametric and hence does not give access to a compact description of the time series under study. As explained in Chap. 1, time series model learning is made of several steps, the final goal of which is the characterization of the dynamical system generating the observable data points with a compact and parsimonious mathematical representation, i.e., a finite set of parameters. This is all the more true when the estimated model is used for prediction (Box et al., 2016; Brockwell and Davis, 2016). Thus, having access to additional solutions yielding parametric models would be of prime interest for the user. In this chapter, a specific attention is paid to the determination of parametric models of the time series deterministic components: the trend and the seasonal patterns. As explained in Chap. 1, once these components are accurately estimated, the residuals, i.e., the difference between the raw data and the estimated model data points, should be studied by resorting to a different paradigm: randomness and statistics in order to describe the unpredictable variations of these residuals. As shown, e.g., in (Commandeur and Koopman, 2007; Box et al., 2016; Brockwell and Davis, 2016), many solutions for the estimation and/or elimination of the trend and seasonality patterns are available in the literature. We can cite the smoothing methods with finite moving average, exponential or low pass filters as well as the techniques resorting to specific differentiation operators (Box et al., 2016; Brockwell and Davis, 2016). In order to bypass tuning issues linked to these filter based solutions, least squares data fitting methods (Lawson and Hanson, 1995; Björck, 1996; Seber and Wild, 2003) are favored in this chapter. By least squares data fitting, we mean solutions that determine the unknown parameters of a parametric mathematical function by minimizing the sum of squared differences © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 G. Mercère, Data Driven Model Learning for Engineers, https://doi.org/10.1007/978-3-031-31636-4_3

31

32

3 Trend and Seasonality Model Learning with Least Squares

between the observed data and the fitted outputs provided by the model. The main reasons why this class of solutions is selected are mainly linked to the availability of (Lawson and Hanson, 1995; Björck, 1996; Seber and Wild, 2003): • • • • •

A vast literature dedicated to least squares solutions for regression Analytic solutions under specific practical circumstances Statistical results proving the accuracy and consistency of the estimates Dedicated toolboxes and software Solutions to tune required hyperparameters effectively

On top of that, as shown in Chap. 5 for instance, some of these least squares tools can be used for residuals modeling as well. The problem solved with the least squares data fitting methods introduced hereafter is more precisely the following. By assuming the access to: • A time series .(yt )t∈T • A user defined parametric model .f (t, θ ), where .θ ∈ Rnθ ×1 stands for the vector of unknown parameters the least squares regression problem considered herein aims at determining an accurate estimate .θˆ of .θ that minimizes with respect to .θ the cost function V (θ ) =

N −1 

.

yi − f (ti , θ )22 ,

(3.1)

i=0

where . • 2 stands for the Euclidean norm (Meyer, 2000, Chapter 5). Such a cost function is required because, in practice, unless the data sets have been generated artificially, there is no .θ such that .yi = f (ti , θ ) for all .i ∈ {0, · · · , N − 1}. As a compromise, we aim at determining the parameter vector .θˆ that makes the residuals .yi − f (ti , θ ) as small as possible for all .i ∈ {0, · · · , N − 1}. In order to quantify this notion of “small residuals,” a norm (Debnath and Mikusinski, 2005, Chapter 1) must be introduced. The Euclidean norm is used herein (instead of the .1 norm (James et al., 2017, Chapter 6) for instance) because this norm does not require to involve tools such as subgradients (Lemaréchal, 1982) that render minimization more involved and, in the end, not necessary when trend and/or periodic function parameters have to be estimated. As shown hereafter, closed form or approximated solutions can be derived easily according to the complexity of the model description .f (t, θ ). Indeed, when this function is chosen linear with respect to the unknowns .θ , analytic solutions can be determined even when regularization is necessary (Björck, 1996). When this linearity property is violated, heuristics, approximations, and iterative algorithms are required (Nocedal and Wright, 2006; Rao, 2009). Both classes of solutions are introduced hereafter. More precisely, Sect. 3.1 focuses on the linear least squares estimation problem and its solutions. A specific attention is paid to analytic solutions, their numerical implementations with the QR factorization and the singular value decomposition (SVD) as well as numerical

3.1 Linear Least Squares Solutions

33

improvements when ill conditioned problems come into play. Section 3.2 is then dedicated to specific iterative algorithms to be used when the model structure is not linear with respect to the unknown parameters. The famous Levenberg– Marquardt algorithm (Nocedal and Wright, 2006) is more specifically introduced to determine the unknowns of a nonlinear model when a sum of squared residual cost function is considered. Section 3.3 introduces standard model validation methods for .(i) assessing a good model complexity (trade-off between model complexity and data fitting capability) and .(ii) tuning the hyperparameters of the optimization algorithms. Section 3.4 gives access to a short list of practicalities, i.e., a short implementation aspects guide that can be useful in practice. Finally, Sect. 3.5 concludes this chapter with the main points to remember.

3.1 Linear Least Squares Solutions 3.1.1 Problem Formulation Let us first focus on simple solutions for trend and seasonal patterns modeling. By assuming that the trend within the time series .(yt )t∈T is smooth, polynomial functions of sufficient high order can be used to describe such dynamics accurately (Trefethen, 2013, Chapter 6). Said differently, for trend modeling, the following model structure can be suggested f (t, θ ) =

k 

.

  cj t j = 1 t · · · t k × θ ,

(3.2)

j =0

where k is the polynomial function order to be selected by the user, whereas   θ = c0 c1 · · · ck ∈ R(k+1)×1 .

.

(3.3)

Let us now turn to the seasonality. Because a seasonality is periodic by definition, such a pattern can be described by Fourier series efficiently (Oppenheim et al., 2014, Chapter 5). Said differently, by denoting by .Tp the period of the seasonal term, we can introduce the following model structure:

f (t, θ ) = a0 +

k 

.

j =1

aj cos(j ωp t) +

k 

bj sin(j ωp t)

j =1

  = 1 cos(ωp t) · · · cos(kωp t) sin(ωp t) · · · sin(kωp t) × θ,

(3.4)

34

3 Trend and Seasonality Model Learning with Least Squares

where .ωp = 2π Tp , k is the number of cosine and sine functions selected by the user, whereas this time   θ = a0 · · · ak b1 · · · bk ∈ R(2k+1)×1 .

.

(3.5)

The common denominator of both model representations is the linearity in terms of the unknown parameters .θ . Indeed, once the user has selected  the period and the orders of each model structure, the vectors . 1 · · · t k and   . 1 cos(ωp t) · · · sin(kωp t) are known a priori and, generically, ⎡ ⎤ θ1 ⎢ . ⎥   .f (t, θ) = θ0 + φ1 (t) · · · φk (t) ⎣ . ⎦ = θ0 + φ t θ, .

(3.6)

θk where .φj (t), .j ∈ {1, · · · , k}, are user defined smooth time dependent functions such as the former cosine and sine functions, whereas .θ and the intercept .θ0 stand for the unknown parameters, respectively. By focusing on N time samples, we can write a compact model linear with respect to .θ and .θ 0 f (t, θ ) = θ 0 + Φθ ,

.

(3.7)

where1   θ 0 = θ0 · · · θ0 ∈ RN ×1 , .   f (t, θ ) = f (0, θ ) · · · f (N − 1, θ ) ∈ RN ×1 , .   Φ = φ 0 · · · φ N −1 ∈ RN ×nθ , .   θ = θ1 · · · θk ∈ Rnθ ×1 . .

(3.8a) (3.8b) (3.8c) (3.8d)

Thanks to this compact description, the least squares cost function .V (θ ) introduced previously becomes V (θ) = y − θ 0 − Φθ 22 = (y − θ 0 − Φθ ) (y − θ 0 − Φθ),

.

(3.9)

with   y = y0 · · · yN −1 ∈ RN ×1 .

.

(3.10)

1 As pointed out previously, the explicit use of the initial time instant as well as the sampling period .Ts in the definition of the time samples .ti = iTs , .i ∈ {0, · · · , N − 1}, is avoided to make the notations easier.

3.1 Linear Least Squares Solutions

35

It is now time to derive the minimum solution of this cost function. Remark 3.1 The former equations emphasize the affine structure of the chosen trend and seasonal pattern models by pointing out the intercept .θ0 explicitly. In practice, estimating this offset is not an issue. Indeed, by assuming that .θ is known a priori, it is easy to see that .θˆ0 =

N −1 1  yi − N i=0



N −1 1   φi θ . N

(3.11)

i=0

By using this equation into the former cost function .V (θ ) explicitly, we get ¯ 22 , V (θ ) = y¯ − Φθ

.

(3.12)

with   y¯ = y¯0 · · · y¯N −1 ∈ RN ×1 , .   ¯ = φ¯ 0 · · · φ¯ N −1  ∈ RN ×nθ , Φ .

(3.13a) (3.13b)

where N −1 1  .y ¯i = yi − yj , i ∈ {0, · · · , N − 1}, . N

(3.14a)

N −1 1    φ¯ i = φ  φ j , i ∈ {0, · · · , N − 1}. i − N

(3.14b)

j =0

j =0

The estimation of .θ0 being not an issue once .θˆ is available, the algorithms introduced in this section will focus on the estimation of .θ from centered data only, i.e., from ¯ instead of .(y, Φ). For the sake of conciseness, the bars are dropped in the ¯ Φ) .(y, sequel, and we consider the following linear model: f (t, θ ) = Φθ ,

.

(3.15)

as well as the following cost function: V (θ) = y − Φθ 22 = (y − Φθ) (y − Φθ ),

.

(3.16)

but the reader must keep in mind that most of the results introduced in this section are valid if and only if the data sets are centered initially. Notice right now that this centering step will however not be considered in Sect. 3.2.

36

3 Trend and Seasonality Model Learning with Least Squares

3.1.2 Analytic Solution from a Projection Viewpoint In order to minimize .V (θ), let us first assume that .N > nθ , .rank(Φ) = nθ , and recall that .y ∈ RN ×1 , .θ ∈ Rnθ ×1 , and .Φ ∈ RN ×nθ . Because there is no .θ such that .y = Φθ , .y clearly does not belong to .range(Φ). Thus, the least squares approximated solution of .θ should be obtained by determining the nearest vector .y Φ ∈ range(Φ) of .y and then computing the estimate .θˆ from .y Φ and .Φ. By assuming that .rank(Φ) = nθ and .N > nθ , the columns of .Φ form a basis of .range(Φ) (Meyer, 2000, Chapter 5). By having access to such a basis, any vector such as .y Φ belonging to .range(Φ) is a linear combination of the basis vectors, i.e., θ real valued .{θi }ni=1 exist (they are unique by the way Meyer, 2000, Chapter 5) such that y Φ = θ1 Φ 1 + · · · + θnθ Φ nθ = Φθ,

.

(3.17)

where the .Φ i , .i ∈ {1, · · · , nθ }, stand for the columns of .Φ, whereas ⎤ θ1 ⎢ . ⎥ n ×1 .θ = ⎣ . ⎦ ∈ R θ . . ⎡

(3.18)

θnθ Let us now recall the closest point theorem (Meyer, 2000, Chapter 5). Theorem 3.1 Let .M be a subspace of an inner product space .V, and let b be a vector in .V. Then, the unique vector in .M that is closest to b is the orthogonal projection of b onto .M. Proof See (Meyer, 2000, Section 5.13).

 

Herein, the subspace .M is .range(Φ), whereas .y plays the role of b. Then, according to the former theorem, the unique vector .y Φ ∈ range(Φ) such that y − y Φ 2 ≤ y − z2 for all z ∈ range(Φ)

.

(3.19)

is the unique minimizing vector satisfying .y − y Φ orthogonal to .range(Φ). The translation of the closest point theorem leads us to write that, for any .ϑ ∈ Rnθ ×1 different from .0, (Φϑ) (y − y Φ ) = 0,

(3.20)

ϑ  (Φ  y − Φ  Φθ ) = 0.

(3.21)

.

or, written differently, .

3.1 Linear Least Squares Solutions

37

Because the former equality is valid for any nonzero .ϑ ∈ Rnθ ×1 , we find that the linear least squares solution satisfies the normal equations (Björck, 1996, Chapter 2) Φ  Φ θˆ = Φ  y,

.

(3.22)

whereas the optimal approximation of .y in .range(Φ) satisfies yˆ = Φ θˆ .

.

(3.23)

One interesting feature of this linear least squares solution is the fact that, once .y ∈ RN ×1 and .Φ ∈ RN ×nθ are known, the extraction of .θˆ boils down to the solution of the normal equations .Φ  Φ θˆ = Φ  y (Björck, 1996, Chapter 2), i.e., to the solution of a system of .nθ linear equations with .nθ unknown parameters. Of course, as any squared system of linear equations, the solution depends on the invertibility of the symmetric matrix .Φ  Φ ∈ Rnθ ×nθ . Because it is assumed that .rank(Φ) = nθ , .Φ  Φ is invertible, and thus the linear least squares solution for .θ is unique and given in closed form as (Björck, 1996, Chapter 2) θˆ = (Φ  Φ)−1 Φ  y.

.

(3.24)

This closed form solution requires the matrix inversion of .Φ  Φ. Computing this solution directly should be avoided in practice because of numerical issues encountered when ill conditioned matrices are involved (Golub and Van Loan, 2013, Chapter 2), (Björck, 1996, Chapter 2). This is for instance the case when .Φ is a Vandermonde matrix (Walter, 2014, Chapter 5) that appears when trends are estimated with high order polynomial functions (see also Sect. 3.1.5 for a detailed discussion on such an issue). The next paragraphs will thus introduce numerical solutions to bypass these numerical problems.

3.1.3 Numerical Solution with a QR Factorization The first solution for computing .θˆ consists in resorting to the QR factorization of .Φ. More specifically (Meyer, 2000, Chapter 5), Definition 3.1 Every matrix .A ∈ Rm×n with .m ≥ n and linearly independent columns can be uniquely factored as A = QR,

.

(3.25)

with .Q ∈ Rm×n such that .Q Q = I n×n , whereas .R ∈ Rn×n is an upper triangular matrix with strictly positive diagonal entries. Furthermore, the columns of .Q form an orthonormal basis of .range(A).

38

3 Trend and Seasonality Model Learning with Least Squares

By using this tool with .Φ ∈ RN ×nθ , i.e., .Φ = QR with .Q ∈ RN ×nθ and .R ∈ Rnθ ×nθ , we have Φ  Φ = R  R,

(3.26)

(Φ  Φ)−1 Φ  = (R  R)−1 R  Q = R −1 Q

(3.27)

.

whereas .

because .R is square and invertible. Solving the normal equations .Φ  Φθ = Φ  y thus boils down to solve Rθ = Q y = b,

.

(3.28)

i.e., to solve the triangular equations .Rθ = b by using a back substitution technique for instance (Meyer, 2000, Chapter 3). The determination of .θˆ is thus made without any direct matrix inversion but with the inversion of some real and nonzero coefficients only. Once .θˆ is available, we easily see that yˆ = QQ y.

.

(3.29)

Illustration 3.1 In order to illustrate the use of this QR factorization based solution, let us consider the 49 samples given in Fig. 3.1 and consisting of the life expectancy of French women between 1968 and 2016. A quick look at this curve clearly shows that a low order polynomial function should be sufficient to mimic the trend of this time series. By considering a second order polynomial function, a time dependent Vandermonde matrix .Φ of size .49 × 3 can be constructed easily, whereas the vector .y involved in the cost function .V (θ ) is nothing but the 49 data samples of the initial time series. By using these vectors and matrices with the QR factorization introduced previously,    a vector .θˆ = −7.3668 × 103 7.2595 −1.7673 × 10−3 can be estimated straightforwardly (see Fig. 3.2 for an illustration of the model mimicking capabilities).

Illustration 3.2 Let us now consider a more turbulent time series. More specifically, let us tackle the problem of trend and seasonal pattern estimation of the monthly mean carbon dioxide concentration measured at Mauna Loa Observatory, Hawaii, USA, between 1959 and 1991 (see Fig. 2.6 or Fig. 3.3). As shown in Illustration 2.5, the SSA decomposition of this time series (see (continued)

3.1 Linear Least Squares Solutions

39

86

84

82

80

78

76

1970

1975

1980

1985

1990

1995

2000

2005

2010

2015

Fig. 3.1 Female life expectancy in France 86 84 82 80 78 76 1970

1975

1980

1985

1990

1995

2000

2005

2010

2015

1970

1975

1980

1985

1990

1995

2000

2005

2010

2015

0.2 0 -0.2 -0.4

Fig. 3.2 Initial and reconstructed linear least squares time series (top) with residuals (bottom). Female life expectancy in France

40

3 Trend and Seasonality Model Learning with Least Squares

360 355 350 345 340 335 330 325 320 315 310 1960

1965

1970

1975

1980

1985

1990

Fig. 3.3 Monthly mean carbon dioxide concentration measured at Mauna Loa Observatory, Hawaii

Illustration 3.2 (continued) Fig. 2.9) leads us to select a polynomial function of low order to describe the main trend of the data associated with a Fourier series to mimic its seasonality, i.e., a model of the form

f (t, θ) = a0 +

3 

.

aj cos(j ωp t)

j =1

+

3  j =1

bj sin(j ωp t) + c1 t + c2 t 2 , ωp =

π , 6Ts

(3.30)

if a second order polynomial function and three cosine and sine couples are considered to be sufficient to encompass the trend and seasonality of this time series. By considering the aforementioned .f (t, θ ), a matrix .Φ can be constructed by merging a Vandermonde matrix of size .384 × 3 and a matrix of size .384 × 12 made of .cos(j ωp ti ) and .sin(j ωp ti ) for .j ∈ {1, · · · , 3} and .ti = iTs , .i ∈ {0, · · · , 383}. The linear least squares estimates of the parameters .ai , .i ∈ {0, · · · , 3}, .bj , .j ∈ {1, · · · , 3}, and .ck , .k ∈ {1, 2}, are gathered in Table 3.1, whereas the corresponding model response .yˆ is drawn in Fig. 3.5 (see also Fig. 3.4 for a focus on the SSA component modeling). These curves illustrate the efficiency of the QR based linear least squares solution for trend and seasonality estimation (Fig. 3.5).

3.1 Linear Least Squares Solutions

41

Table 3.1 Estimated parameters of the model given in Eq. (3.30) .a ˆ0

4 .7.2427 × 10

ˆ2 .b

.a ˆ2

.a ˆ3

.−1.7011

−1 .7.4518 × 10

−3 .2.6789 × 10

ˆ3 .b

.9.4022

× 10−3

ˆ1 .b

.a ˆ1

.−7.2878

.cˆ1

× 10−4

.−7.4254

.2.1061

.cˆ2

× 101

.1.9113

× 10−2

350 340 330 320 1960

1965

1970

1975

1980

1985

1990

1960

1965

1970

1975

1980

1985

1990

1960

1965

1970

1975

1980

1985

1990

1960

1965

1970

1975

1980

1985

1990

0.5 0 -0.5

2 0 -2 -4

0.6 0.4 0.2 0 -0.2 -0.4

Fig. 3.4 SSA components (trend at the top, seasonality at the bottom) and corresponding estimated linear least squares models. CO.2 concentration at Mauna Loa Observatory, Hawaii

42

3 Trend and Seasonality Model Learning with Least Squares

360 350 340 330 320 310 1960

1965

1970

1975

1980

1985

1990

1960

1965

1970

1975

1980

1985

1990

1

0

-1

Fig. 3.5 Initial and reconstructed linear least squares time series (top) with residuals (bottom). CO.2 concentration at Mauna Loa Observatory, Hawaii

3.1.4 Numerical Solution with a Singular Value Decomposition In Chap. 2, the singular value decomposition has been introduced to decompose a Hankel matrix into rank 1 components. Here, the same tool is used but with a different objective in mind: the inversion of a matrix and the numerical analysis of the linear least squares solution. More precisely, let us consider the matrix .Φ ∈ RN ×nθ and, more specifically, its SVD. Then, we know2 that orthonormal matrices N ×N and .V ∈ Rnθ ×nθ as well as a block diagonal matrix .Σ ∈ RN ×nθ exist .U ∈ R such that (Meyer, 2000, Chapter 5) Φ = U ΣV  , .

.

(3.31a)

U U  = U  U = I N ×N , i.e., U −1 = U  , .

(3.31b)

V V  = V  V = I nθ ×nθ , i.e., V −1 = V  .

(3.31c)

Thus, straightforwardly, Φ  Φ = V Σ  ΣV 

.

2 Please

(3.32)

pay attention to the dimensions and properties of the matrices .U and .V used hereafter and compare them with those introduced in Chap. 2. For more details, see, e.g., (Zhang, 2017, Chapter 5).

3.1 Linear Least Squares Solutions

43

and (Φ  Φ)−1 = V (Σ  Σ)−1 V  ,

(3.33)

(Φ  Φ)−1 Φ  = V (Σ  Σ)−1 Σ  U  .

(3.34)

.

whereas .

−1  The matrix product . Σ  Σ Σ ∈ Rnθ ×N is called the pseudo-inverse of .Σ and is denoted by .Σ † (Meyer, 2000, Chapter 5). It is formed by replacing every nonzero diagonal entry of .Σ by its reciprocal and then transposing the resulting matrix. By using the notation .Σ † , we get θˆ = V Σ † U  y.

.

(3.35)

This linear least squares solution is generated by resorting to matrix transpositions and inversions of .nθ nonzero real values only. It is interesting to notice that, by focusing on the columns of .U and .V , when .rank(Φ) = nθ , we can write that ˆ= .θ

nθ  u y i

vi ,

(3.36)

(u i y)ui ,

(3.37)

i=1

σi

whereas yˆ =

nθ 

.

i=1

where .ui (resp., .v i ), .i ∈ {1, · · · , nθ }, stands for the ith left (resp., right) singular vector of .Φ, whereas .σi , .i ∈ {1, · · · , nθ }, is its ith singular value. Thus: u y

• The scalars . σi i , .i ∈ {1, · · · , nθ }, are the coordinates of the linear least squares 

 solution .θˆ in .range v 1 · · · v n , i.e., .θˆ is split up into .nθ mutually orthogonal θ



u y vectors . σi i v i , .i ∈ {1, · · · , nθ }. The scalars .u i ∈ {1, · · · , nθ }, i y, .  

are the coordinates of the optimal approximation .yˆ in .range u1 · · · unθ , i.e., .yˆ is split up into .nθ mutually orthogonal vectors .(u i y)ui , .i ∈ {1, · · · , nθ }.

These observations illustrate the fact that, from a numerical stability viewpoint, using the SVD is the best way to compute .θˆ (Björck, 1996, Chapter 2), (Golub and Van Loan, 2013, Chapter 5).

44

3 Trend and Seasonality Model Learning with Least Squares

Illustration 3.3 In Chap. 2, it has been shown that a nonparametric model, i.e., a signal .(ytssa )t∈T , made of N samples and containing the deterministic components of a time series can be reconstructed by using the basic SSA algorithm. By construction, this signal .(ytssa )t∈T should satisfy a recurrent equation, i.e., a nonzero vector .r ∈ R+1 , . ∈ N∗ being a user defined index, should exist such that ⎤ ytssa ⎢ y ssa ⎥  ⎢ t−1 ⎥ .r ⎢ . ⎥ = 0. ⎣ .. ⎦ ⎡

(3.38)

ssa . yt−

Written differently, a parameter vector .θ ∈ R×1 exists such that ssa ssa ytssa = θ1 yt−1 + · · · + θ yt− .

.

(3.39)

Such a recurrent equation is known in the literature as an autoregressive model (Box et al., 2016, Chapter 1), and thanks to the access to a realization ssa ) ×1 can be ˆ .(yt t∈T made of N data points, an estimated vector .θ ∈ R generated by minimizing with respect to .θ y ssa − Φθ22 ,

.

(3.40)

where the vector .y ssa and the matrix .Φ satisfy   ssa y ssa = yssa · · · yN ∈ R(N −)×1 , . −1

.

ssa · · · y ssa ⎤ y−1 0 ⎢ .. ⎥ ∈ R(N −)× . Φ = ⎣ ... . . . . ⎦ ssa ssa yN −2 · · · yN −1−

(3.41a)



(3.41b)

This problem is nothing but a linear least squares approximation problem with a regression matrix .Φ different from the ones considered until now. Said differently, as done before, once the time series .(ytssa )t∈T is available, the vector .y ssa and the matrix .Φ can be constructed easily from which .θˆ can be generated by resorting, e.g., to the SVD based solution introduced previously. Running this algorithm with the reconstructed SSA signal generated in Illustration 2.5 leads to the estimated parameters gathered in Table 3.2 as well as the model data points drawn in Fig. 3.6. This linear least squares solution proves that the nonparametric SSA solution can, in the end, be described with a parsimonious linear parametric model straightforwardly.

3.1 Linear Least Squares Solutions

45

Table 3.2 Estimated parameters of the model given in Eq. (3.39) .θˆ1

.θˆ2

.−3.1353

.4.4353

.θˆ3

.θˆ4

.−3.3707

.1.0873

.θˆ6

.θˆ5 .−5.1229

× 10−2

.1.3558

.θˆ9 −2 .−6.0409 × 10

.θˆ7

× 10−1

.θˆ10 −1 .3.4073 × 10

.θˆ8

.−2.2576

× 10−1

.θˆ11 −1 .−4.4080 × 10

× 10−2 ˆ .θ12 −1 .2.1924 × 10 .−3.6422

360 350 340 330 320 310 1960

1965

1970

1975

1980

1985

1990

1960

1965

1970

1975

1980

1985

1990

0.5

0

-0.5

Fig. 3.6 SSA and reconstructed autoregressive linear least squares time series (top) with residuals (bottom). CO.2 concentration at Mauna Loa Observatory, Hawaii

3.1.5 Linear Least Squares and Condition Number Whatever the approach, the numerical solutions introduced so far require the inversion of numerical values. As shown, e.g., in Eq. (3.36), these specific values are the singular values of .Φ when the SVD is used for linear least squares parameters estimation. Equation (3.36) clearly shows that the singular values play a central role ˆ Indeed, when the singular in the sensitivity of the linear least squares solution .θ. values have huge magnitude differences, i.e., when the “small” singular values have tiny magnitudes when compared with the “large” ones, Eq. (3.36) clearly shows that small errors in .y or .Φ, i.e., in .y, .U , or .V , can be magnified when it comes to determine .θˆ . The condition number (Meyer, 2000, Chapter 3) is the standard way to quantify this singular values magnitude difference via the following ratio: cond(Φ) =

.

σmax (Φ) , σmin (Φ)

(3.42)

46

3 Trend and Seasonality Model Learning with Least Squares

i.e., the ratio of the largest and smallest singular values of .Φ, respectively. By construction, the closer to one .cond(Φ) is, the better for the linear least squares solution. Said differently, favoring well-conditioned observation matrices .Φ is of prime interest for the user. The problem of ill conditioned matrices clearly appears when trend patterns are approximated with high order polynomial functions. Indeed, as shown with Eq. (3.2), the columns of .Φ are made of monomials .t k , .k ∈ N∗ , and the corresponding observation matrix .Φ is nothing but a Vandermonde matrix. As proved in (Pan, 2016), “any Vandermonde matrix of a large size is badly ill conditioned unless its knots are more or less equally spaced on or about the circle .C(0, 1) = {x : |x| = 1}.” Because, in time series modeling, this constraint on the knots location is, in general, not valid, most of the linear least squares problems involving Vandermonde matrices of order 10 or more are ill conditioned by construction. In order to bypass this difficulty, Legendre polynomial functions (Kennedy and Sadeghi, 2013, Chapter 2) can be introduced instead. More specifically, given: • The Hilbert space .L2 ([a, b], R) made up of functions .f : [a, b] → R that are Lebesgue integrable (Bronshtein et al., 2015, Chapter 8), i.e., such that .|f |2 is integrable • The inner product (Kennedy and Sadeghi, 2013, Chapter 2) 

b

f, g =

.

f (t)g(t)dt with f and g ∈ L2 ([a, b], R)

(3.43)

a

• The induced norm 

b

f 2 =

.

1/2 |f (t)| dt

with f ∈ L2 ([a, b], R)

2

(3.44)

a

the Legendre polynomials .(pn (t))n∈N defined for .a = −1 and .b = +1 as follows (Kennedy and Sadeghi, 2013, Chapter 2): pn (t) =

.

1 dn 2 (x − 1)n , n ∈ N∗ , 2n n! dt n

(3.45)

with .p0 (t) = 1 form an orthogonal basis of the Hilbert space .L2 ([−1, 1], R) (Kennedy and Sadeghi, 2013, Chapter 2). Inspired by this orthogonality property satisfied by the Legendre polynomials, a first idea suggested for improving the condition number of .Φ consists in substituting the monomials .t k , .k ∈ N∗ , with the components .pk (t), .k ∈ N∗ , defined previously. As shown, e.g., in (Kennedy and Sadeghi, 2013, Chapter 2), we can indeed prove that p1 (t) = t, .

.

p2 (t) =

1 2 (3t − 1), . 2

(3.46a) (3.46b)

3.1 Linear Least Squares Solutions

47

1 3 (5t − t), . 2 35 4 15 2 3 p4 (t) = t − t + , 8 4 8 p3 (t) =

(3.46c) (3.46d)

or, in a compact way, .

p0 (t) = 1, .

(3.47a)

p1 (t) = t, .

(3.47b)

pk+1 (t) = t

k 2k + 1 pk (t) − pk−1 (t), k ∈ N∗ . k+1 k+1

(3.47c)

Of course, constructing the observation matrix .Φ from .pk (t), .k ∈ N∗ , instead of the monomials .t k , .k ∈ N∗ , will not guarantee that the condition number of .Φ is unitary (as it would be if the inner product defined in Eq. (3.43) was used instead of the Euclidean inner product involved in the normal equations (3.22)), but, as shown hereafter, its condition number can be reduced drastically.

Illustration 3.4 Let us consider the time series given in Fig. 3.7 and made of 139 samples, i.e., hourly temperature readings from a ceramic furnace (Bisgaard and Kulahci, 2011, Appendix A). Let us furthermore assume that our goal is to describe the main trend of this time series with a single polynomial function. Then, because of the hectic variations of this time series, high order polynomial functions are required to be able to mimic all the dynamics of this data set. More specifically, a polynomial function of order 10 is suggested in this illustrative example. Then, by considering the direct use of a Vandermonde matrix for the observation matrix .Φ, we get .cond(Φ) = 4.14 × 1022 , whereas .cond(Φ) = 2.58 when Legendre polynomials are involved. This drastic improvement of the condition number makes the modeling easier from a numerical viewpoint, thus leading to the trend model given in Fig. 3.8. Notice that these good results are obtained after transforming the time samples to fit the range .[−1, 1], i.e., by using the following transformation: t˜ =

.

2t − min(t) − max(t) . max(t) − min(t)

(3.48)

Remark 3.2 It is important to notice that using a single polynomial function of relatively high order to mimic the behavior of the time series given in Fig. 3.7 was suggested for illustrating condition number issues only. It is clear that, in practice, it would be more efficient to split up this time series into small parts where low order

48

3 Trend and Seasonality Model Learning with Least Squares

840

835

830

825

820

815 20

40

60

80

100

120

80

100

120

Fig. 3.7 Hourly temperatures of a ceramic furnace

840

835

830

825

820

815 20

40

60

Fig. 3.8 Hourly ceramic furnace data and trend model time series. Estimation with Legendre polynomial functions

polynomial functions are tuned to describe each local behavior. On top of that, as explained in Sect. 3.3, there is no guarantee that the best choice for the polynomial function order is 10. Again, this value was chosen arbitrarily in order to illustrate the effect of high order polynomials on the condition number of .Φ. When tuning the components of .Φ by selecting appropriate functions .φj (t), j ∈ {1, · · · , nθ }, is complicated in practice, another standard solution to decrease the condition number of .Φ consists in resorting to regularization (Golub and Van

.

3.1 Linear Least Squares Solutions

49

Loan, 2013, Chapter 6). Regularized least squares solutions indeed aim at penalizing the cost function .V (θ ) by introducing a constraint on the 2-norm of .θ, i.e., by considering the ridge regression cost function (Theodoridis, 2015, Chapter 3) Vr (θ ) = V (θ) + ηθ22 ,

.

(3.49)

where .η > 0 is the regularization parameter to be selected by the user (see Sect. 3.3 for a solution). The term .θ  θ is thus used in .Vr (θ) to penalize models for which the estimated coefficients are too large. Therefore, the ridge regression favors models with low complexity and avoids overfitting (Boyd and Vandenberghe, 2018, Chapter 15). By observing that this cost function can be written as follows (Boyd and Vandenberghe, 2018, Chapter 15): √   2  ηI n ×n 0nθ ×1  θ θ  ,  θ− .Vr (θ ) =  Φ y 2

(3.50)

√  it is guaranteed that . ηI nθ ×nθ Φ  has full column rank once .η > 0, and thus, the regularized linear least squares problem has a unique solution (Björck, 1996, Chapter 2). On top of that, the corresponding normal equations satisfy √  √  √    ηI nθ ×nθ ηI nθ ×nθ ηI nθ ×nθ 0nθ ×1 . θ= , Φ Φ Φ y

(3.51)

i.e., (Φ  Φ + ηI nθ ×nθ )θ = Φ  y.

.

(3.52)

The ridge linear least squares solution is thus the solution of normal equations with the property that .Φ  Φ + ηI nθ ×nθ has always full rank. Last but not least, it can be proved (Björck, 1996, Chapter 2) that this solution is better conditioned than the ordinary linear least squares solution. Indeed, by resorting to the SVD of .Φ, we can write that (Σ  Σ + ηI nθ ×nθ )ϑ = Σ  U  y,

.

(3.53)

with .ϑ = V  θ . Then, the analysis ran for the ordinary linear least squares solution can be extended to the ridge linear least squares solution straightforwardly, thus leading to the estimate ˆ = .ϑ

nθ  i=1

σi

u i y σi2 + η

,

(3.54)

50

3 Trend and Seasonality Model Learning with Least Squares

whereas θˆ =

nθ 

σi

.

i=1

u i y σi2



vi =

nθ  i=1

βi

u i y vi , σi

(3.55)

with the weighting factors βi =

.

σi2 σi2 + η

, i ∈ {1, · · · , nθ }.

(3.56)

Because of these weighting factors, the ridge linear least squares solution is biased3 (when compared with the ordinary linear least squares solution) but, thanks to the introduction of the parameter .η, it can be ensured that this problem is better conditioned than the ordinary one. Indeed, cond(Φ  Φ + ηI nθ ×nθ ) =

.

2 (Φ) η + σmax , 2 (Φ) η + σmin

(3.57)

whereas cond(Φ  Φ) =

.

2 (Φ) σmax . 2 (Φ) σmin

(3.58)

Said differently, because of the decrease of .cond(Φ  Φ) in .η, regularization positively impacts the condition number of the corresponding linear least squares problem. Remark 3.3 In the list of unknown parameters considered in the trend and seasonality models, the first ones, i.e., .a0 and .c0 in Eqs. (3.2) and (3.4), respectively, are nothing but the intercepts of the curves fitting the data. In order to avoid the procedure depending on the origin chosen for .(yt )t∈T , the intercept must be left out of the penalty term, i.e., the following cost function must be used instead: Vr (θ) =

N −1 

.

yi − θ0 −

i=0

nθ  j =2

φj i θj 22



nθ 

θj2 ,

(3.59)

j =1

when .φ1i = 1 for any .i ∈ {0, · · · , N − 1}. This practical constraint also explains the reason why using centered data is of prime interest for linear least squares estimation. See, e.g., (Hastie et al., 2009, Chapter 3) for more details about this algorithmic measure.

3 See

Chap. 4.4.1 for a proof of this claim.

3.2 A Digression to Nonlinear Least Squares

51

840

835

830

825

820

815 20

40

60

80

100

120

Fig. 3.9 Hourly ceramic furnace data and trend model time series. Estimation with regularization

Illustration 3.5 Let us consider again the time series given in Fig. 3.7, but, this time, let us analyze the effect of regularization on the condition number of this least squares problem. By using for .Φ a Vandermonde matrix of order 10 and by selecting .η = 1×1010 , the condition number of .Φ  Φ +ηI nθ ×nθ drops to .6.6 × 1016 , which illustrates the positive impact of regularization on the problem numerical conditioning. Unfortunately, as illustrated in Fig. 3.9, this 5 decades decrease of the condition number also reduces the model mimicking abilities when compared with the former solution. Keep indeed in mind that the regularized linear least squares solution is biased.

3.2 A Digression to Nonlinear Least Squares When data sets such as the monthly flows of global air travel passengers are studied (see Fig. 1.4), linear models may not be sufficient to describe the dynamics of the deterministic components governing the time series behavior. This specific time series indeed has an upward trend that cannot be modeled by a low order polynomial function only. Therefore, using models nonlinear with respect to the unknown parameters is a standard but efficient solution to give access to parsimonious model representations for such nonlinear dynamics. In this section, a specific attention is thus paid to general smooth mathematical functions .f (ti , θ ), .i ∈ {0, · · · , N − 1}, which are, this time, not linear with respect to .θ.

52

3 Trend and Seasonality Model Learning with Least Squares

One of the main problems when nonlinear models are handled is the absence of any analytical solutions most of the time (Rao, 2009, Chapter 6). Thus, heuristic developments must be introduced to yield numerical solutions with the smallest possible residual norm. On top of that, in addition to the absence of analytical solutions, the second difficulty of nonlinear least squares is the presence of possible local minima for .V (θ) (Rao, 2009, Chapter 6). Thus, most of the numerical optimization tools used for nonlinear least squares can only guarantee the convergence toward a local minimum. When these nonlinear optimization algorithms are iterative, the convergence toward the global optimum can be ensured only if the initial guess is chosen in the vicinity of the global minimum. The basic idea behind the standard numerical optimization algorithms for nonlinear least squares problem minimization is the use of a local linearization of the cost function around the current value of .θ. In this book, we consider linear approximations only4 in order to determine the local structure of an iterative solution for the estimation of .θ. As shown hereafter, these approximations are used to get a quadratic cost function in .θ locally (like the ordinary linear least squares) and thus convexify the cost function locally. The starting point of the solutions developed herein is again the aforementioned cost function .V (θ ), more specifically its compact description V (θ ) = y − f (t, θ )22 .

.

(3.60)

In a small neighborhood of the optimum .θ ∗ , we can write that (Verhaegen and Verdult, 2007, Chapter 7) f (t, θ ) ≈ f (t, θ ∗ ) + F |θ ∗ (θ − θ ∗ ),

.

(3.61)

where the Jacobian matrix .F |θ ∗ is defined as follows (Verhaegen and Verdult, 2007, Chapter 7): ⎡ ⎢ F |θ ∗ = ⎢ ⎣

.

∂f (0,θ) ∂θ1

.. .

∂f (N −1,θ) ∂θ1

··· .. . ···

∂f (0,θ) ∂θnθ

.. .

∂f (N −1,θ) ∂θnθ

⎤ ⎥ ⎥ ⎦

∈ RN ×nθ .

(3.62)

θ=θ ∗

Thus, locally, the cost function .V (θ) satisfies V (θ ) ≈ y − f (t, θ ∗ ) − F |θ ∗ (θ − θ ∗ )22 = z − F |θ ∗ ϑ22 ,

.

(3.63)

with .z = y − f (t, θ ∗ ) and .ϑ = θ − θ ∗ . By using the similarity between this locally approximated cost function and the criterion encountered when linear least squares were studied, by assuming also that the Jacobian matrix .F |θ ∗ has full column rank, 4 For

other efficient but more complex solutions, see, e.g., (Nocedal and Wright, 2006).

3.2 A Digression to Nonlinear Least Squares

53

the minimum argument of this locally approximated cost function is nothing but the solution of following normal equations: ∗  ∗ F |θ ∗ F |θ (θ − θ ) = F |θ ∗ z.

.

(3.64)

The “basic” Gauss–Newton solutions (Nocedal and Wright, 2006, Chapter 10) have at their source these locally linearized equations. Indeed, they all estimate .θ by using the iterative procedure  j +1 j ˆ ˆ .θ = θ + F j F |θˆ

−1 |θˆ

j

  j F  j y − f (t, θˆ ) , |θˆ

(3.65)

0 j with .θˆ the starting point of the optimization process and .θˆ , .j ∈ N∗ , any intermediate estimate obtained during the search for the minimum of .V (θ).

Remark 3.4 It is interesting to notice that this nonlinear optimization problem has the following first order optimality necessary condition (Rao, 2009, Chapter 2): Theorem 3.2 If .θ ∗ minimizes V (θ ) = y − f (t, θ )22 ,

.

then .θ ∗ is a critical point of .V (θ), i.e.,

 ∗ F |θ ∗ y − f (t, θ ) = 0.

.

Proof See (Rao, 2009, Chapter 2).

(3.66)  

This last equation is nothing but a local version of the orthogonality conditions used in Eq. (3.20) to determine the normal equations when ordinary linear least squares problems come into play. Unfortunately, for general functions .f (t, θ ): • The former condition is not sufficient. • There is no necessary and sufficient condition. Thus, most of the iterative algorithms introduced in the literature use the first order optimality necessary condition only and then check if the resulting estimate is the sought optimum (Rao, 2009). The numerical optimization algorithm introduced so far is valid if the former j +1 j and .θˆ are not too far from Taylor series is valid, i.e., if, at each iteration, .θˆ j j +1 2 2 is small each other. In order to guarantee that, at each iteration, .θˆ − θˆ enough, regularization can be used (Björck, 1996, Chapter 2). More specifically, a regularized Gauss–Newton method (Nocedal and Wright, 2006, Chapter 10)

54

3 Trend and Seasonality Model Learning with Least Squares

consisting in resorting to regularization as suggested in Sect. 3.1.5, i.e., minimizing with respect to .θ j Vr (θ ) = y − f (t, θˆ ) − F

.

|θˆ

j

j j (θ − θˆ )22 + ηθ − θˆ 22 ,

(3.67)

can be introduced to get around the former shortcoming. In the cost function Vr (θ), the strictly positive regularization parameter .η is introduced to penalize estimates that move too far from the former one, i.e., solutions for which the affine approximation given in Eq. (3.61) is not valid anymore. As shown, e.g., in (Björck, 1996, Chapter 2), minimizing with respect to .θ the cost function .Vr (θ ) is equivalent to minimizing  2    √   − ηI nθ ×nθ 0nθ ×1 j   . (θ − θˆ ) . − j F ˆj  y − f (t, θˆ )  |θ 2

.

The resulting iterative procedure is the famous Levenberg–Marquardt algorithm (Nocedal and Wright, 2006, Chapter 10) that can be written as follows:  −1   j  ˆ ˆ j +1 = θˆ j + ηI n ×n + F  j F j y − f (t, θ .θ F ) . (3.68) j θ θ ˆ |θˆ



|θˆ

The introduction of the parameter .η ∈ R∗+ guarantees that:  √    • . − ηI nθ ×nθ F ˆ j has full column rank. |θ • The Levenberg–Marquardt solution is better conditioned than the Gauss–Newton one. j +1 j • .θˆ is close to .θˆ . As shown, e.g., in (Gavin, 2020), updating .η at each iteration can help for an efficient convergence toward its local minima (see also Rao, 2009, Chapter 6). As a general rule, when nonpathological problems come into play, the parameter .η is initialized with a large value so that the Levenberg–Marquardt algorithm coincides with the gradient descent algorithm (Nocedal and Wright, 2006, Chapter 5), i.e., the first updates are small steps in the steepest descent direction, whereas .η is decreased when the solution improves, thus leading to a Gauss–Newton update when the algorithm estimate is close to a local minimum.

Illustration 3.6 Let us consider the classic Box and Jenkins airline data (Box et al., 2016, Series G) of Fig. 3.10, i.e., the monthly totals of international airline passengers between January 1949 and December 1960. Made of 144 samples, this airline passenger time series is known for showing a large number of nonlinear features (Box et al., 2016). This is the main reason why this time series is selected herein to illustrate the efficiency of nonlinear least (continued)

3.2 A Digression to Nonlinear Least Squares

55

10 5

6

5

4

3

2

1 1950

1951

1952

1953

1954

1955

1956

1957

1958

1959

1960

1961

Fig. 3.10 Monthly worldwide airline passengers between January 1949 and December 1960 14

10 6

12

10

8

6

4

2

0

1

2

3

4

5

6

7

8

9

10

11

12

13

Fig. 3.11 Leading singular values of .Y 12 . Monthly totals of international airline passengers

Illustration 3.6 squares solutions for nonlinear trend modeling. In order to emphasize these nonlinear components, let us start with the use of SSA for an efficient principal element extraction. By following the procedure introduced in Chap. 2, with . = 12 because of the yearly seasonality of the time series, the leading singular values and right singular vectors of .Y 12 given in Figs. 3.11 and 3.12, respectively, lead us to consider the first five principal components only (continued)

56

3 Trend and Seasonality Model Learning with Least Squares

Illustration 3.6 (continued) to describe the deterministic time series dynamics accurately (see Fig. 3.13 for an illustration of this modeling accuracy). Whereas the main trend of this data set given at the top of Fig. 3.14 can be easily described with a polynomial function of low order (a third order polynomial function for instance, as shown in Fig. 3.15), thus by resorting to a linear least squares technique (see Table 3.3 for the corresponding estimated parameters), the other two main components drawn in Fig. 3.14 require a more complex model description because of their exponentially growing time evolution. Such a transient behavior can be parameterically described by a sum of products of exponential functions, polynomial functions, and harmonics, i.e., st =

q 

.

ak (t)eλk t sin(ωk t + ψk ),

(3.69)

k=1

where .ak (t) are polynomial functions, whereas .λk , .ωk , and .ψk , .k ∈ {1, · · · , q}, are arbitrarily parameters (Hassani et al., 2010). For our model learning problem, the following parametric model (corresponding to the impulse response of a second order system Oppenheim et al., 2014, Chapter 6)    ωk sk,t = ak  e−ξk ωk t sin ωk 1 − ξk2 t + ψk , k ∈ {1, 2}, 1 − ξk2

.

(3.70)

could be more specifically suggested to mimic the oscillatory behaviors given in Fig. 3.14 (bottom curves) by determining the amplitudes .ak , the natural frequencies .ωk , the damping ratios .ξk , and the phase shifts .ψk , .k ∈ {1, 2}, from the available data points. Optimal values (in the least squares sense) of .ak , .ωk , .ξk , and .ψk , .k ∈ {1, 2}, can be more precisely determined by minimizing the aforementioned  cost function .Vr (θ k ) with, for each .k ∈ {1, 2}, .θ  k = ak ωk ξk ψk , via a nonlinear least squares optimization algorithm. The use of a nonlinear least squares solution is justified because, as clearly shown in the model description given in Eq. (3.70), the cost function .Vr (θ) is by construction not linear with respect to .θ . Starting from initial guesses generated by a simple graphical analysis of the curves in Fig. 3.14, then by using the Levenberg–Marquardt algorithm described previously (see, e.g., (Gavin, 2020) for a practical implementation), the parameter estimates given in Table 3.4 and Table 3.5, respectively, can be determined efficiently, from which simulated model responses can be produced as shown in Figs. 3.16 and 3.17, respectively. These curves clearly show that nonlinear least squares methods (such as the aforementioned Levenberg–Marquardt algorithm Nocedal and Wright, 2006, Chapter 10) are reliable solutions for nonlinear model parameter estimation.

3.2 A Digression to Nonlinear Least Squares

57

0.2

0.2

0.2

0.2

0

0

0

0

0

-0.2

-0.2

-0.2

-0.2

-0.2

0

50

100

0

50

100

0

50

100

0.2

0

50

100

0.2

0.2

0.2

0.2

0

0

0

0

0

-0.2

-0.2

-0.2

-0.2

-0.2

0

50

100

0

50

100

0

0

0

-0.2

-0.2

-0.2

0

50

100

50

100

0

50

100

50

100

0

50

100

0.2

0

50

100

0.2

0.2

0.2

0

0

0

50

100

Fig. 3.12 Leading right singular vectors of .Y 12 . Monthly totals of international airline passengers 10 5 6

4

2

0

1950

1951

1952

1953

1954

1955

1956

1957

1958

1959

1960

1961

1950

1951

1952

1953

1954

1955

1956

1957

1958

1959

1960

1961

10 4 2 1 0 -1 -2 -3

Fig. 3.13 Initial and reconstructed SSA time series (top) with residuals (bottom). Monthly totals of international airline passengers

The algorithms are introduced herein for nonlinear least squares problem minimization being iterative by construction, and it is important to know when the search should be stopped. Three main criteria are usually suggested in practice (Dennis and Schnabel, 1996, Chapter 7): j +1 j • The absolute or relative change in the parameter estimates, e.g., .θˆ − θˆ 2

58

3 Trend and Seasonality Model Learning with Least Squares 10 5 4 3 2

10

1950

1951

1952

1953

1954

1955

1956

1957

1958

1959

1960

1961

1950

1951

1952

1953

1954

1955

1956

1957

1958

1959

1960

1961

1950

1951

1952

1953

1954

1955

1956

1957

1958

1959

1960

1961

4

5 0 -5

10 4 2 0 -2

Fig. 3.14 SSA time series components: trend (top) and seasonalities (middle and bottom). Monthly totals of international airline passengers 10 5 5 4 3 2 1

1

1950

1951

1952

1953

1954

1955

1956

1957

1958

1959

1960

1961

1950

1951

1952

1953

1954

1955

1956

1957

1958

1959

1960

1961

10 4

0.5 0 -0.5 -1

Fig. 3.15 SSA main trend and reconstructed linear least squares time series (top) with residuals (bottom). Monthly totals of international airline passengers

j +1 • The absolute or relative change in the cost function, e.g., .(Vr (θˆ ) − j j ˆ ˆ Vr (θ ))/Vr (θ )2 • The number of iterations

Once selected, these criteria are compared with user defined thresholds, and in general, the minimization is stopped when at least one of these criteria is fulfilled.

3.2 A Digression to Nonlinear Least Squares

59

Table 3.3 Estimated parameters of the polynomial model for the SSA trend of the air passengers time series .θˆ0

.θˆ1

.1.2177

.θˆ2

.−1.8489

× 1011

× 108

.θˆ3

.9.3549

× 104

.−15.7730

Table 3.4 Initial guesses (top) and estimated parameters (bottom) of the harmonic model given in Eq. (3.70) for the first SSA seasonal component of the air passengers time series 0

0

0

.a1

.ξ1

3 .2 × 10

.−0.05

.a ˆ1 .2.5405

0

.ω1

× 103

.ψ1

.−π/3

.2π

.ξˆ1

.ω ˆ1

ˆ1 .ψ

.−0.0247

.6.2547

.−1.2311

Table 3.5 Initial guesses (top) and estimated parameters (bottom) of the harmonic model given in Eq. (3.70) for the second SSA seasonal component of the air passengers time series 0

0

0

.a2

.ξ2

2 .6 × 10

.−0.015

.a ˆ2 .8.073

× 102

0

.ω2

.ψ2

.4π

.π/6

.ξˆ2

.ω ˆ2

ˆ2 .ψ

.−0.0098

.12.6180

.0.5395

10 5 1 0.5 0 -0.5 -1

10

1950

1951

1952

1953

1954

1955

1956

1957

1958

1959

1960

1961

1950

1951

1952

1953

1954

1955

1956

1957

1958

1959

1960

1961

4

1 0.5 0 -0.5 -1

Fig. 3.16 First SSA seasonal component and reconstructed nonlinear least squares time series (top) with residuals (bottom). Monthly totals of international airline passengers

60

3 Trend and Seasonality Model Learning with Least Squares 10 4 4 2 0 -2 -4 1950

1951

1952

1953

1954

1955

1956

1957

1958

1959

1960

1961

1950

1951

1952

1953

1954

1955

1956

1957

1958

1959

1960

1961

10000

5000

0

-5000

Fig. 3.17 Second SSA seasonal component and reconstructed nonlinear least squares time series (top) with residuals (bottom). Monthly totals of international airline passengers

For more details, please read (Dennis and Schnabel, 1996, Chapter 7) and the references therein.

3.3 Linear Model Complexity Selection and Validation As shown through the different illustrations introduced so far, the efficiency of the least squares optimization algorithms introduced in this book has been mainly tested by comparing the raw data sets with the estimated model outputs graphically. This comparison has led to figures such as Fig. 3.8 or Fig. 3.16 for instance. The model mimicking abilities have thus been validated by using common sense, i.e., by claiming that, for a good model, the model output should resemble to the measured time series. In addition to these curves, it could be interesting and often more convenient to quantify the model capabilities to mimic the real data. Some popular model quality measurements are the following (Ljung, 1999; Verhaegen and Verdult, 2007): • The root mean squared error (RMSE) RMSE =

.

ˆ 2 y − y , √ N

(3.71)

with .y ∈ RN ×1 the vector of observed data and .yˆ ∈ RN ×1 the model output vector

3.3 Linear Model Complexity Selection and Validation

61

• The Best FiT5 (BFT)  ˆ 2 y − y ,0 , .BFT = 100% max 1 − ¯ 2 y − y 

(3.72)

with   y¯ = y¯ · · · y¯ ∈ RN ×1 ,

y¯ =

.

N −1 1  yi N

(3.73)

i=0

• The variance accounted for (VAF)   ˆ var(y − y) ,0 VAF = 100% max 1 − var(y)

.

(3.74)

For instance, for the polynomial function fitting problem considered in Illustration 3.1 with the French women life expectancy, these values are equal to .0.3456, .92.56%, and .99.45%, respectively, values that confirm the efficiency of a second order polynomial function to mimic the main trend of this time series. In addition to help the user measure the model quality, these fitting measurements can also be used for tuning the hyperparameters of the linear least squares solutions introduced so far. These tuning parameters are more precisely: .(i) the model orders (the polynomial function orders or the number of sine and cosine functions to be used for the trend and seasonality pattern description) representing the model complexity and .(ii) the regularization parameter .η for the regularized linear least squares. These fitting measurements can indeed be used in a cross validation procedure (Hastie et al., 2009, Chapter 7) to quantify the estimated model efficiency and thus help the user make a fair and reliable decision for a good trade-off between model complexity and data fitting capability. Notice indeed that the goal of a model is usually not to have a perfect fit on the data used to learn the parameters but to be able to achieve a good fit on previously unseen data for prediction, monitoring, etc. Remark 3.5 It is essential to notice that the solutions introduced in this section for model validation and model complexity selection are dedicated to linear models only, i.e., model structures parameterized as follows: ⎡ ⎤ θ1 ⎢ . ⎥   .f (t, θ ) = θ0 + φ1 (t) · · · φk (t) ⎣ . ⎦ = θ0 + φ t θ , .

(3.75)

θk

5 Notice

ˆ 2 √ y that the ratio involved in the definition of the BFT is nothing but the ratio between . y−

¯ 2 √ y and . y− , N

N

i.e., the ratio of two RMSE.

62

3 Trend and Seasonality Model Learning with Least Squares

where: • .φj (t), .j ∈ {1, · · · , k}, are user defined smooth time dependent functions such as the aforementioned cosine, sine, or polynomial functions, whereas .θ and the intercept .θ0 stand for the unknown parameters, respectively. • The functions .φj (t), .j ∈ {1, · · · , k}, depend on the current value of t and not on past (or future) values of t. This last constraint means that the time series we play with at this stage do not have any memory or, said differently, dependency among observations. Notice right now that this assumption will not be valid in Chap. 5 anymore. In order to run such a generalization performance test or validation step under ideal conditions, the best would be to have access to two realizations (at least) of the time series to analyze. Unfortunately, most of the time, this is simply an empty promise because, e.g., we do not have two realizations of French women for running a validation test for our polynomial function fitting problem !-)). This is the main reason why resampling methods such as the “out of sample validation” or “K-fold” procedures (Boyd and Vandenberghe, 2018, Chapter 13) are often used when linear regressions come into play. In a nutshell, these procedures test the generalization ability of a linear model by dividing the initial data set into two or more subsets and then use some of these subsets (usually called the training set Rogers and Girolami, 2017, Chapter 1) for learning or training the model, whereas the remaining subsets (usually called the validation or test set Rogers and Girolami, 2017, Chapter 1) are used for testing the prediction capabilities of the model trained with the training set. More specifically (Hastie et al., 2009, Chapter 7): • The out of sample validation procedure divides the initial data set into two sets (one containing usually .80% of the data set, the other made of .20% of it), the large one corresponding to the training set, whereas the rest is used as a test set (see Fig. 3.18). • The K-fold validation procedure (see Fig. 3.19) divides the original data set into K sets (usually 5 or 10) and then uses each of them as a validation test for a model learned with the .K − 1 remaining sets (procedure that leads to K model parameter sets in the end). Once test and training data sets are available, the basic idea of the standard solutions in the literature for model complexity selection or model validation (Hastie et al., 2009, Chapter 7) consists then in comparing the training errors and the test errors, i.e., the errors between the data and the model outputs estimated with the training sets and test sets, respectively, and then favoring parsimonious models having a good generalization ability, i.e., test errors that are, at least, of equivalent magnitude to the learning errors. Whereas the out of sample validation procedure is often used for a first validation of the estimated model because of its low computational time, the second one is often favored to make a fair decision among different models, i.e., to help the user select the most parsimonious model with the best generalization ability.

3.3 Linear Model Complexity Selection and Validation

63

Fig. 3.18 A schematic display of the out of sample validation procedure. The bottom plot splits the data set into three consecutive parts (training, test, validation) when the full data set is long enough

Fig. 3.19 A schematic display of a 5-fold validation procedure

Remark 3.6 When the available data set is long enough, it is often suggested dividing the time series into three connected parts: a training set, a validation set, and a test set (Hastie et al., 2009, Chapter 7). First, the training set is used to fit the models. Second, the test set is used for model selection. Last, the validation set is used for assessing the generalization performance of the model chosen with the test set. As suggested in (Hastie et al., 2009, Chapter 7), “A typical split might be 50% for training, and 25% for validation and 25% for testing” (see Fig. 3.19), bottom plot). In order to explain this model structure selection problem easily, let us test the “Kfold” procedure with real times series.

Illustration 3.7 Let us consider again the French women life expectancy time series introduced in Illustration 3.1. Whereas the common sense [after a quick look at the time series plot (see again Fig. 3.1)] leads us to select a low order polynomial function to fit this data set, it is still important to know if a second order model is more reliable for generalization than a fourth order model for instance. By assuming that the final goal of this model is to make good predictions for the next 10 years, i.e., for unseen data, it is essential (continued)

64

3 Trend and Seasonality Model Learning with Least Squares

Illustration 3.7 (continued) that the selected model gives small RMSE or large BFT or VAF on test sets without having a prohibitive structural complexity. In order to select the model structure effectively, a standard solution consists in: • Starting with a constant function • Running for this model structure a K-fold validation procedure • Computing for each of the K test sets the RMSE, BFT, or VAF fitting measurement • Averaging these fitting measurements • Increasing the model structure complexity (affine function, quadratic functions, etc.) • Running the K-fold validation procedure and the former calculations for each model structure • Plotting these averaged fitting measurements vs. the model complexity • Selecting, in the end, the model with the lower complexity and the smallest (largest) average RMSE (BFT or VAF) Such a procedure has been run with the aforementioned life expectancy data set. The resulting curves are given in Fig. 3.20. As expected, the RMSE computed with the training sets decreases as the polynomial order (hence the model complexity) increases, whereas the RMSE determined from the test sets has a clear minimum for a polynomial function of order 2. Such a result suggests that a second order polynomial function should be selected because it is the best trade-off between model complexity and generalization ability. Said differently, this second order polynomial model should produce the most reliable predictions when it is used with new data. Notice that, once this resampling procedure is used to select a specific model complexity, it is often suggested determining the final model parameters with the full data set even if the K different models generated for this model complexity selection method are not too different. This rule of thumb is all the more true when the manipulated data sets are short (Boyd and Vandenberghe, 2018, Chapter 13).

Remark 3.7 When the test sets in the K-fold procedure are restricted to one data point each time (i.e., the training sets in Fig. 3.19 contain .N − 1 samples, whereas the test set consists of one sample only), the K-fold method boils down to what is called the leave one out cross validation (LOOCV) method (James et al., 2017, Chapter 5). When, on top of that, linear regressions come into play, the RMSE for each model can be directly calculated as follows (Hastie et al., 2009, Chapter 7):   N −1   1  yi − yˆi 2  .RMSE = , N 1 − φi i=0

(3.76)

3.3 Linear Model Complexity Selection and Validation

65

10 0

0

1

2

3

4

5

6

Fig. 3.20 K-fold procedure for model order selection. Female life expectancy in France

where .φi , .i ∈ {0, · · · , N − 1}, is the .(i + 1)th diagonal element of .Φ(Φ  Φ)−1 Φ. Under such practical constraints, the prediction error is determined via a single model fit that reduces its computational time drastically. It is interesting to notice that such a cross validation procedure can also be used for the selection of the aforementioned regularization parameter .η involved in the regularized linear least squares for instance (see Sect. 3.1.5). Indeed, as .η increases, the RMSE on the training data increases. But (as with the model order), the RMSE obtained with test sets typically decreases as .η increases and then increases when .η becomes very large. Thus, again, a good choice for the regularization parameter .η is the value that approximately minimizes the RMSE with test sets. This tuning procedure has been used for the selection of .η in Illustration 3.5, thus leading to choose .η = 1 × 1010 . As clearly illustrated in Fig. 3.20, the RMSE and, by extension, the minimal value of the least squares cost function .V (θ ) evaluated at .θ = θˆ naturally decrease with the model order when the full data set is taken into account. This tendency is explained by the fact that adding new degrees of freedom (by increasing the model order) increases the model abilities to fit the measured data points with (too much) precision and thus leads to overfitted models. In order to bypass this overfitting issue, another solution (complementary to the aforementioned cross validation methods) consists in using a new discrepancy assessment criterion that penalizes the decrease of the loss function .V (θ) when the model complexity gets too high. A general form of such a criterion is the following (Söderström and Stoica, 1989, Chapter 11): ˆ W (θˆ ) = V (θ)(1 + γ (N, nθ )),

.

(3.77)

where .γ (N, nθ ) is a function depending on the data length N and the number of unknown parameters .nθ , which should increase with .nθ in order to penalize complex

66

3 Trend and Seasonality Model Learning with Least Squares

models and guarantee a certain model parsimony. Whereas several criteria have been suggested in the literature (Johansson, 1993, Chapter 9) (Hastie et al., 2009, Chapter 7), the most famous one is probably the Akaike Information Criterion (AIC) that is defined as follows: ˆ + 2nθ , WAIC (θˆ ) = N log(V (θ))

(3.78)

.

which, as expected, decreases as .V (θˆ ) decreases and increases as .nθ increases. This AIC based solution is often used, e.g., when: • The available data set is too small to split it up effectively in order to be able to generate sufficiently long training and test sets (De Ridder et al., 2005). • Dependency among the time series samples is predominant, dependency that makes the standard “K-fold” technique questionable for a reliable selection of the training, test, and validation parts (for an interesting analysis of the impact of sample dependency on cross validation methods, see, e.g., Bergmeir and Benitez, 2012; Cerqueira et al., 2020).

Illustration 3.8 In order to reinforce the conclusion drawn in Illustration 3.7, let us plot, for the French female life expectancy time series, the AIC evolution versus the polynomial model order when, this time, all the data set is used for each model order. As shown in Fig. 3.21, selecting a second order polynomial model leads to a minimum value for the AIC and thus can be considered as the best model complexity according to the parsimony principle (Box et al., 2016, Chapter 1).

100

50

0

-50

-100

-150

0

1

2

3

4

Fig. 3.21 AIC test for model order selection. Female life expectancy in France

5

6

3.4 Hints for Least Squares Solutions Refinement

67

3.4 Hints for Least Squares Solutions Refinement Although the methods introduced previously have proved their efficiency in many practical situations (Kailath et al., 2000; Seber and Wild, 2003; van den Bos, 2007; Bisgaard and Kulahci, 2011; Gibbs, 2011), the dedicated literature is full of hints introduced to improve their numerical implementation. Herein, a short list is given only. For more details, please read dedicated sections in, e.g., (Kelley, 1987; Lawson and Hanson, 1995; Björck, 1996; Dennis and Schnabel, 1996; Bonnans et al., 2006; Nocedal and Wright, 2006; Rao, 2009; Luenberger and Ye, 2016). First, when: • Every observation should not be treated equally • Prior on the standard deviation of the random errors acting on the data is available weights can be introduced into the former least squares cost functions in order to balance this varying quality issue by giving each data point its proper amount of influence. More specifically, instead of considering the Euclidean 2-norm in .V (θ ) or .Vr (θ) for measuring the length of the residuals, a weighted norm (Meyer, 2000, Chapter 5) can be favored in order to maximize the efficiency of parameter estimation. More precisely, the following criterion y − f (t, θ )2W = (y − f (t, θ )) W (y − f (t, θ ))

.

(3.79)

can be used instead of .V (θ ) where .W ∈ RN ×N is a user defined positive definite weighting matrix. Most of the time, this matrix .W is chosen to be diagonal with diagonal elements inversely proportional to the influence or variance of the corresponding observation. Such a nonsingular weighting matrix leads to the following closed form weighted linear least squares solution: θˆ = (Φ  W Φ)−1 Φ  W y.

.

(3.80)

By introducing ˜ = W 1/2 Φ, Φ

.

y˜ = W 1/2 y,

f˜ (t, θ ) = W 1/2 f (t, θ ),

(3.81)

we have ˜ 22 , . y − Φθ 2W = y˜ − Φθ

(3.82a)

y − f (t, θ )2W = y˜ − f˜ (t, θ )22 .

(3.82b)

.

Thus, although numerical issues associated with disparate weight values may occur, such a weighted least squares problem can be numerically solved by applying any methods introduced previously to this “tilde problem.” For numerically stable solutions when the weights vary widely in size, see (Björck, 1996, Section 4.4).

68

3 Trend and Seasonality Model Learning with Least Squares

Second, in order to explicitly take into account the magnitude differences of the parameters to be estimated within the numerical optimization, it can be interesting to normalize each parameter .θi , .i ∈ {1, · · · , nθ }, with respect to an initial guess .θi0 , .i ∈ {1, · · · , nθ }. By assuming that this initial guess is of the same order of magnitude as the final value, we have now to estimate .nθ parameters .μi , which are linked to .θi as follows: θi = (1 + μi )θi0 , i ∈ {1, · · · , nθ }.

.

(3.83)

The former numerical solutions can be extended straightforwardly to the estimation of the normalized parameters .μi , .i ∈ {1, · · · , nθ }. Notice that such a normalization step also has an explicit impact on the determination of Jacobian matrix involved in the Gauss–Newton or Levenberg–Marquardt algorithms. Indeed, by using a standard chain rule, we can easily show that .

∂f (t, θ ) ∂f (t, θ ) ∂f (t, θ) ∂θi = = θi0 , i ∈ {1, · · · , nθ }. ∂μi ∂θi ∂μi ∂θi

(3.84)

This direct proportionality between the .μ and .θ dependent Jacobian matrix components should be explicitly used at each iteration of the Gauss–Newton or Levenberg–Marquardt algorithms (Gavin, 2020). Last but not least, because, in many practical cases, the nonlinear least squares cost functions .V (θ ) and .Vr (θ ) have relative or local minima (Rao, 2009), the initial guess plays a central role to help the minimization algorithms converge toward the global minimum of the cost function. Indeed, even if it cannot be proved beforehand that the global minimum is reached, selecting an initial guess in the vicinity of the global optimum is an efficient way to get, in the end, the optimal value for ˆ . Thus, as far as the initialization is concerned, it is strongly suggested using .θ prior knowledge to give initial values or, at worse, upper and lower bounds for the unknown parameters in order to start the minimization from these values or within the ranges of possible values. On top of that, if bounds are available, it is interesting to resort to constrained optimization techniques (Rao, 2009, Chapter 7) instead in order to take this prior into account explicitly all along the optimization iterations. In the worst case scenario, i.e., when nothing can be said about the unknown parameters before running the numerical optimization, the only way to start the minimization is to resort random number generators combined random search methods (Rao, 2009, Chapter 6) in order to shrink the feasible set. Notice, however, that such a full randomness based solution is unlikely to give accurate results if .nθ is greater than 10 or so when standard gradient based minimization algorithms are considered (see, e.g., (Parrilo and Ljung, 2003) for an interesting illustration of this initialization solution weakness).

References

69

3.5 Take Home Messages • The linear least squares approach should be considered when the model to determine is linear with respect to the unknown parameters. • When N > nθ and rank(Φ), the linear least squares solution is analytically equivalent to the solution to the normal equations Φ  Φθ = Φ  y. • The linear least squares solution is unique if N ≥ nθ and Φ has full rank. • If Φ has full rank and has the QR factorization Φ = QR, then the linear least squares solution can be found by back substitution Rθ = Q y. • By using the SVD of Φ, the minimum norm solution of the linear least squares  u y problem is given by V Σ † U  y = ri=1 σi i v i , where r = rank(Φ). • The SVD of Φ is the tool to use to determine the condition number of Φ. • When the initial linear least squares problem is ill conditioned, the user can either change the basis functions composing Φ or use a regularization approach to lead to better fits and less ill conditioned linear least squares problems. • When nonlinear models are involved, iterative minimization algorithms must be used to determine the unknown parameter vector θ. • For nonlinear least squares problem, the Levenberg–Marquardt method (also known as the damped nonlinear least squares method) is the generic algorithm used in many software applications for solving curve fitting problems. • Because the Levenberg–Marquardt method finds only a local minimum, a specific attention must be paid to the initialization in order to guarantee that, in the end, the estimated local minimum is the global minimum of the nonlinear least squares cost function. • Resampling techniques such as the K-fold procedure can be used for (i) assessing how the results of a linear least squares minimization problem can generalize to an independent data set and (ii) tuning the hyperparameters used in the linear least squares solutions. • Model order selection can be performed by resorting to information criteria such as the famous AIC.

References C. Bergmeir, J. Benitez, On the use of cross validation for time series predictor evaluation. Inform. Sci. 191, 192–213 (2012) S. Bisgaard, M. Kulahci, Time Series Analysis and Forecasting by Example (Wiley, London, 2011) A. Björck, Numerical Methods for Least Squares Problems (SIAM, Philadelphia, 1996) J. Bonnans, J. Gilbert, C. Lemaréchal, C. Sagastizábal, Numerical Optimization: Theoretical and Practical Aspects (Springer, Berlin, 2006) G. Box, G. Jenkins, G. Reinsel, G. Ljung, Time Series Analysis: Forecasting and Control (Wiley, London, 2016) S. Boyd, L. Vandenberghe, Introduction to Applied Linear Algebra (Cambridge University Press, Cambridge, 2018)

70

3 Trend and Seasonality Model Learning with Least Squares

H. Bricenoa, C. Roccoa, E. Zio, Singular spectrum analysis for forecasting of electric load demand. Chem. Eng. 33, 919–924 (2013) P. Brockwell, R. Davis, Introduction to Time Series and Forecasting (Springer, Berlin, 2016) I. Bronshtein, K. Semendyayev, G. Musiol, H. Muhlig, Handbook of Mathematics (Springer, Berlin, 2015) V. Cerqueira, L.Torgo, I. Mozetic, Evaluating time series forecasting models: an empirical study on performance estimation methods. Mach. Learn. 109, 1997–2028 (2020) J. Commandeur, S. Koopman, An Introduction to State Space Time Series Analysis (Oxford University Press, Oxford, 2007) F. De Ridder, R. Pintelon, J. Schoukens, D. Gillikin. Modified AAIC and MDL model selection criteria for short data records. IEEE Trans. Instrum. Meas. 54, 144–150 (2005) L. Debnath, P. Mikusinski, Introduction to Hilbert Spaces with Applications (Academic Press, London, 2005) J. Dennis, R. Schnabel, Numerical Methods for Unconstrained Optimization and Nonlinear Equations (SIAM, Philadelphia, 1996) H. Gavin, The Levenberg-Marquardt algorithm for nonlinear least squares curve fitting problems. Technical report, Department of Civil and Environmental Engineering, Duke University, Durham, North Carolina, The USA, 2020 B. Gibbs, Advanced Kalman Filtering, Least Squares and Modeling: A Practical Handbook (Wiley, London, 2011) G. Golub, C. Van Loan, Matrix Computations (John Hopkins University Press, Baltimore, 2013) H. Hassani, A. Zhigljavsky, Singular spectrum analysis: methodology and application to economics data. J. Syst. Sci. Complexity 22, 372–394 (2009) H. Hassani, A. Soofi, A. Zhigljavsky, Predicting daily exchange rate with singular spectrum analysis. Nonlinear Anal. Real World Appl. 11, 2023–2034 (2010) T. Hastie, R. Tibshirani, J. Friedman, The Elements of Statistical Learning (Springer, Berlin, 2009) G. James, D. Witten, T. Hastie, R. Tibshirani, An introduction to Statistical Learning with Applications in R (Springer, Berlin, 2017) R. Johansson, System Modeling and Identification (Prentice Hall, Englewood Cliffs, 1993) T. Kailath, A. Sayed, B. Hassibi, Linear Estimation (Prentice Hall, Englewood Cliffs, 2000) C. Kelley, Iterative Methods for Optimization (Society for Industrial Mathematics, Philadelphia, 1987) R. Kennedy, P. Sadeghi, Hilbert Space Methods in Signal Processing (Cambridge University Press, Cambridge, 2013) C. Lawson, R. Hanson, Solving least squares problems (Society for Industrial Mathematics, Philadelphia, 1995) C. Lemaréchal, Numerical experiments in nonsmooth optimization. Progr. Nondifferentiable Optim. A-2361, 61–84 (1982) L. Ljung, System Identification. Theory for the User (Prentice Hall, Englewood Cliffs, 1999) D. Luenberger, Y. Ye, Linear and Nonlinear Programming (Springer, Berlin, 2016) C. Meyer, Matrix Analysis and Applied Linear Algebra (SIAM, Philadelphia, 2000) J. Nocedal, S. Wright, Numerical Optimization (Springer, Berlin, 2006) A. Oppenheim, S. Willsky, S. Hamid, Signals and Systems (Pearson, London, 2014) V. Pan, How bad are Vandermonde matrices? SIAM J. Matrix Anal. Appl. 37, 676–694 (2016) P. Parrilo, L. Ljung, Initialization of physical parameter estimates. In Proceedings of the IFAC Symposium on System Identification (Rotterdam, 2003) S. Rao, Engineering Optimization: Theory and Practice (Wiley, London, 2009) L. Rodriguez-Aragon, A. Zhigljavsky, Singular spectrum analysis for image processing. Stat. Interface 3, 419–426 (2010) S. Rogers, M. Girolami, A First Course in Machine Learning (CRC Press, Boca Raton, 2017) G. Seber, C. Wild, Nonlinear Regression (Wiley, London, 2003) T. Söderström, P. Stoica, System Identification (Prentice Hall, Englewood Cliffs, 1989) S. Theodoridis, Machine Learning: A Bayesian and Optimization Perspective (Academic Press, London, 2015)

References

71

L. Trefethen, Approximation Theory and Approximation Practice (SIAM, Philadelphia, 2013) A. van den Bos, Parameter Estimation for Scientists and Engineers (Wiley, London, 2007) M. Verhaegen, V. Verdult, Filtering and System Identification: A Least Squares Approach (Cambridge University Press, Cambridge, 2007) E. Walter, Numerical Methods and Optimization: A Consumer Guide (Springer, Berlin, 2014) X. Zhang, Matrix Analysis and Applications (Cambridge University, Cambridge, 2017)

Chapter 4

Least Squares Estimators and Residuals Analysis

A thorough reading of Chaps. 2 and 3 clearly shows that the emphasis in these chapters is mainly on computational solutions combining linear algebra and numerical optimization for trend and/or seasonality extraction and parameter estimation. Although introducing efficient numerical methods for this important task is a valuable asset for generating accurate estimates, there are important available priors which have been ignored so far. One of them, which is essential for prediction, is the absence of errors in the modeling of the observations considered by the numerical solutions introduced in Chaps. 2 and 3, respectively (Björck, 1996; van den Bos, 2007). Apart from Sect. 2.3.2 where the noise effect has been quickly analyzed when the SVD of the SSA Hankel matrix is generated, the effects of (measurement) noise on the estimated parameters precision has been indeed disregarded until now. It is time to fill this gap and analyze the noise impacts, on the one hand, on the least squares estimation results introduced in Chap. 3 and, on the other hand, on the modeling of the residuals. This new chapter is thus devoted to the effective description of .(i) these disturbances by resorting to statistics and .(ii) their impacts as far as estimated model parameter accuracy and precision are concerned. The fact that the available observations contain noise can be easily emphasized by a simple look at the residuals generated by subtracting the estimated models of the trend and seasonality patterns to the raw data sets. As shown, for instance, in the examples introduced in Chaps. 2 or 3, these residuals • Are quite smooth, thus the residuals samples are not necessarily independent of each other. • Have no clear deterministic behavior, i.e., even if forecasting the one or two steps ahead values of the residuals seems to be possible for most of the examples considered in the former chapters by just a look at the time evolution of the preceding samples, accurately predicting residuals values long ahead is unlikely. These simple observations lead us to leave the deterministic world to involve randomness in the time series representations in order to describe this unpredictability © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 G. Mercère, Data Driven Model Learning for Engineers, https://doi.org/10.1007/978-3-031-31636-4_4

73

74

4 Least Squares Estimators and Residuals Analysis

properly. When random variables come into play, statistical assumptions must be set in order to characterize the phenomenon under study and, by extension, the dynamical system generating the residuals. The best way to characterize the statistics of the random processes involved in the generation of residuals would be to determine their generating characteristic functions (e.g., their cumulative distribution or probability density functions) from the available data samples. Estimating these characteristic functions from (short) data samples would be possible if the data sets were generated by simple simulations or if the user had access to many realizations of these stochastic processes (Leon-Garcia, 2008; James et al., 2017). Unfortunately, in practice, this is rarely the case.1 Thus, in many practical situations, the user must resort to weaker characteristics which can be generated from a single (and, sometimes, short on top of that) realization of the stochastic process to be characterized. In this book, as suggested, e.g., in Box et al. (2016) and Brockwell and Davis (2016), second order statistics, i.e., the mean value and the covariance functions, are chosen as the main descriptors of the data statistical properties. This specific choice is guided by two observations. First, when a single (and short) realization of a stochastic process is available, getting higher order statistics is far from being easy in practice (Nandi, 1989, Chapter 1). Second, if the random process under study exhibits a Gaussian distribution, its first and second moments characterize the random process distribution completely (Kay, 2006, Chapter 20), thus are sufficient to capture all the available statistical information of the process generating this realization. Remark 4.1 Because the solutions introduced in this book aim at retrieving system unknown information from its outputs only, they can be considered as blind model learning methods (Nandi, 1989). Thus, like any blind approach, the techniques introduced hereafter determine parametric models under the assumption that the dynamical system generating the observable data set is driven by stochastic sequences which are, contrary to the measurable outputs, not observable. This is also one of the reasons why resorting to statistical tools is essential to describe the dynamics embedded in the available data efficiently. This pragmatic discussion points out that, once the trend and seasonality patterns are removed, the second step of time series model learning should be the selection of a suitable mathematical characterization of the residuals. In order to capture the residuals unpredictability, involving stochastic processes seems to be a natural solution to describe the random nature of the observations. This is the main reason why, in the next sections, what is meant by randomness and, more specifically, stochastic processes (Papoulis, 2000) is briefly introduced (for more details, refer to well known contributions like Papoulis 2000; Kay 2006; Leon-Garcia 2008). Among the main properties of stochastic processes, a specific attention is paid to ergodic

1 The data sets for the French female life expectancy or the north America accidental death toll are indeed unique, for instance (see Figs. 1.1 and 1.3, respectively). Thus, the residuals generated by subtracting trends and seasonal components to the raw data sets are unique as well.

4.1 Residual Components

75

covariance stationary time series (Kay, 2006, Chapter 17). As explained hereafter, ergodic covariance stationary stochastic processes are indeed characterized by their mean and covariance functions only, the values of which can be determined from realizations of the underlying stochastic process. Such residuals properties thus render their analysis and characterization from data samples a lot easier. In addition to give access to convenient descriptions of the unpredictable fluctuations of the observations and, by extension, of the residuals, this change of paradigm results in the possibility to depict the estimated parameters as stochastic variables as well. These estimated parameters are indeed direct functions of noisy observations. By handling stochastic processes to describe time series, the estimated vector thus becomes a realization (called usually an estimate Theodoridis 2015, Chapter 3) of a random vector (called in the literature the estimator Theodoridis 2015, Chapter 3) and the analysis of the properties of such an estimator should help the user characterize the estimate quality. In this book, a specific attention will be paid to precision and accuracy of the least squares estimators introduced so far. These two standard quality attributes are indeed .(i) essential to detect if the estimated model is reliable or not, .(ii) quite easy to determine because they are directly linked to means and standard deviations of random variables which can be computed quite straightforwardly. The importance of analyzing the randomness of time series as well as estimated parameters carefully leads us to organize this chapter as follows. After defining explicitly what we mean by residuals in Sect. 4.1, Sect. 4.2 introduces the main notions and definitions borrowed from statistics for the stochastic characterization of the time series studied in this book. Section 4.3 is then dedicated to standard tests performed to see if they the residuals can be regarded as independent random variables or not, thus reducing their characterization to the estimation of means and variances. Once the stochasticity of the residuals has been discussed, the main statistical properties of the least squares estimates introduced in Chap. 3 are reviewed in Sect. 4.4. This chapter ends with the Wold representation theorem in Sect. 4.5 which can be considered as a fundamental justification for time series analysis. It indeed explicitly proves that any stationary time series can be decomposed into a deterministic part and a moving average (MA) component of infinite order. Section 4.6 concludes this chapter with the main points to remember.

4.1 Residual Components As defined previously with words only, the residuals are generated by subtracting the estimated models of the trend and seasonality patterns to the raw data sets. A direct translation of this sentence with signals and parameters leads to the following equation for any residual sample .zi , .i ∈ Z, i.e., zi = yi − f (ti , θˆ ),

.

(4.1)

76

4 Least Squares Estimators and Residuals Analysis

where .θˆ stands for .θ lls or .θ nlls according to the linearity (w.r.t. .θ ) of the selected model structure .f (t, θ ) for the trend and seasonality patterns. By introducing explicitly in Eq. (4.1) a function .g(t) standing for the true data function, i.e., the underlying (unknown) function generating the noise free data set perfectly, the residual sample .zi , .i ∈ Z, satisfies straightforwardly zi = (yi − g(ti )) + (g(ti ) − f (ti , θˆ )),

.

(4.2)

where • .yi − g(ti ) = ei , .i ∈ Z, stands for the unpredictable fluctuation of the observation .yi from the not measurable noise free sample .g(ti ), thus is a sample of a stochastic sequence .(ei )i∈Z . • .g(ti ) − f (ti , θˆ ), .i ∈ Z, is the modeling error, i.e., a deterministic component which quantifies the discrepancy between the unknown true data function .g(t) and the computed fitting model .f (t, θˆ ). By assuming that the combination of the priors we have and • The SSA method • The model complexity selection solutions • The least squares based parameter estimation techniques introduced in the former chapters guarantees that • The true data function belongs to the user selected model class so that a parameter vector .θ ∗ exists such that g(ti ) = f (ti , θ ∗ ), i ∈ Z.

.

(4.3)

• The solution .θˆ is the global minimum of the linear and nonlinear least squares cost functions introduced in Chap. 3, the estimated parameter vector .θˆ should be close to .θ ∗ , thus, the residuals can be assumed to be mainly dominated by the unpredictable fluctuations of the observations, i.e., zi ≈ yi − g(ti ) = ei , i ∈ Z,

.

(4.4)

thus embed the statistical properties of these unpredictable fluctuations. This (strong?) assumption, i.e., the fact that the model .f (t, θˆ ) captures the pure data function well enough to neglect its impact on the residuals, will be considered valid in the sequel. Said differently, we assume from now on that the model learning steps introduced in Chaps. 2 and 3 give estimated trend and seasonality models accurate enough to guarantee that the residuals are stochastic sequences, i.e., are mainly dominated by the unpredictable fluctuations of the observations. Even if such an assumption can be regarded as a strong assumption in many model

4.2 Stochastic Description of the Residuals

77

learning problems, it is in fact often easy to be satisfied when time series are concerned. Indeed, as shown through the different illustrations considered until now, combinations of exponential functions, polynomial functions, and harmonics are often sufficient to mimic the deterministic part of the time series, i.e., the pure data function .g(t). Thus, once the user has selected a model class large enough to contain .g(t), the solutions introduced in Chap. 3 lead to accurate estimates of this pure function, thus guarantee that the rest is made of stochastic sequences mainly. Notice finally that the residuals tests introduced in Sect. 4.3 will also help the user detect if there are still trends or seasonalities in the residuals. All these practical conditions (which are specific to our time series model learning problem) allow us to turn to the stochastic analysis of the residuals immediately.

4.2 Stochastic Description of the Residuals The starting point of our stochastic analysis of residuals is the characterization of the randomness embedded into them. Because residuals are by construction time dependent in addition to be random, it seems reasonable to study their probabilistic characteristics by resorting to the notion of stochastic process (Kay, 2006; Papoulis, 2000).

4.2.1 Stochastic Process Let us start by defining what we mean by stochastic process. Definition 4.1 A stochastic or random process is a collection of random variables defined on a common probability space and indexed by some mathematical set. In the sequel, the selected mathematical index set is a discrete time set denoted by .T when a finite set of N integers, i.e., .T = {0, · · · , N − 1}, is handled whereas .Z is used when the notions introduced hereafter require infinite length stochastic processes. Said differently, it is assumed in this book that the available residuals time series .(zt )t∈T are realizations of discrete time stochastic processes or stochastic sequences .(zt )t∈Z . Remark 4.2 In order to avoid confusion between a stochastic process and (one of) its realizations, sans serif type style is used for denoting random variables whereas serif type style is used for realizations, i.e., .(zt )t∈Z denotes a stochastic sequence whereas .(zt )t∈T stands for a specific sequence or realization of this stochastic process.

78

4 Least Squares Estimators and Residuals Analysis

The reason why discrete time processes are selected (instead of continuous time) is related to the fact that the time series studied in this book are all acquired with specific sampling periods, i.e., are discrete time by nature. Note also that handling continuous time random processes would require mathematical subtleties which are beyond the scope of this document (for reader interested in working with continuous time stochastic processes, please read Aström 2006 and the references therein). On the contrary, because the time series handled until now are continuous in the y-axis, the stochastic sequences studied herein will be assumed to be continuous valued in the sequel. Remark 4.3 Whereas the stochastic sequences introduced so far explicitly depend on .t ∈ Z, they implicitly depend on an event or sample space . (and the associated .σ algebra and probability measure Papoulis 2000) to take into account the randomness governing the evolution in time of the process behavior. Strictly speaking, the correct notation for a stochastic process should be zt,ω with t ∈ Z and ω ∈ .

.

(4.5)

With this notation in mind, • For a fixed .ω ∈ , .z•,ω is a realization .(zt )t∈Z . • For a fixed .t ∈ Z, .zt,• , is a random variable on the sample space .. Again, for ease of notation, the explicit dependency of the stochastic sequence on . will be used only when necessary. It is now time to introduce an explicit way to characterize with time the randomness of a stochastic sequence. Such a characterization can be performed by using the notion of finite N-th order joint probability (Katayama, 2005, Chapter 4), i.e., Pr(z0 ≤ z0 , · · · , zN −1 ≤ zN −1 )  z0  zN−1 = ··· pz0 ,··· ,zN−1 (z0 , · · · , zN −1 )dz0 · · · dzN −1 ,

.

−∞

−∞

(4.6)

with .zi ∈ R, .i ∈ {0, · · · , N − 1} whereas .pz0 ,··· ,zN−1 (z0 , · · · , zN −1 ) stands for the joint probability density function of .(zt )t∈Z for any .N ∈ N∗ and any choice of the time instants .{0, · · · , N − 1} (Leon-Garcia, 2008, Chapter 6). Thanks to this definition, various expectations can be determined (Johnson and Wichern, 2002, Chapter 2), then used to characterize the underlying stochastic sequence at least partially. For instance, Definition 4.2 Given a stochastic sequence .(zt )t∈Z , • The mean at any time instant .i ∈ Z is defined as follows

4.2 Stochastic Description of the Residuals

79

E{zi } = μzi = μz (i)  ∞  ∞ ··· zi pz0 ,··· ,zN−1 (z0 , · · · , zN −1 )dz0 · · · dzN −1 = −∞ −∞   

.

N times

 =



−∞

zi pzi (zi )dzi ∈ R,

(4.7)

where .pzi (zi ) stands for the marginal probability density function of .zi computed from the joint probability density function .pz0 ,··· ,zN−1 (z0 , · · · , zN −1 ) by integrating out the other variables, i.e.,  pzi (zi ) =

.





···  −∞ 



−∞

N −1 times



pz0 ,··· ,zN−1 (z0 , · · · , zN −1 )

× dz0 · · · dzi−1 dzi+1 · · · dzN −1 .

(4.8)

• The mean of ⎤ z0 ⎥ ⎢ = ⎣ ... ⎦ ∈ RN ×1 ⎡

z0:N −1

.

(4.9)

zN −1 is the concatenation of the aforedefined means at time .i ∈ {0, · · · , N − 1}, i.e., ⎤ μz0 ⎥ ⎢ = ⎣ ... ⎦ ∈ RN ×1 . ⎡

E{z0:N −1 } = μz0:N−1

.

(4.10)

μzN−1 Definition 4.3 Given a stochastic sequence .(zt )t∈Z , • The variance at any time instant .i ∈ Z is defined as follows  E{(zi − μzi ) } =

.

2

σz2i

=

σz2 (i)

=



· · · (zi − μzi )2   R R

N times

× pz0 ,··· ,zN−1 (z0 , · · · , zN −1 )dz0 · · · dzN −1  = (zi − μzi )2 pzi (zi )dzi . R

(4.11)

80

4 Least Squares Estimators and Residuals Analysis

• The autocovariance coefficient between any time instant .i ∈ Z and .j ∈ Z is defined as follows σzi zj = σzz (i, j ) = E{(zi − μzi )(zj − μzj )}   · · · (zi − μzi )(zj − μzj ) =  R  R

.

N times

× pz0 ,··· ,zN−1 (z0 , · · · , zN −1 )dz0 · · · dzN −1   = (zi − μzi )(zj − μzj )pzi ,zj (zi , zj )dzi zj . R R

(4.12)

• The covariance matrix of .z0:N −1 is the concatenation of the aforedefined variance and autocovariance coefficients at time .i ∈ {0, · · · , N − 1} and .j ∈ {0, · · · , N − 1}, i.e., E{(z0:N −1 − μz0:N−1 )(z0:N −1 − μz0:N−1 ) } = Σ z0:N−1 z0:N−1 ⎤ ⎡ σz20 · · · σz0 zN−1 ⎢ . .. ⎥ N ×N .. = ⎣ .. . . . ⎦∈R 2 σzN−1 z0 · · · σzN−1

.

(4.13)

Perhaps the most famous (and useful?) stochastic sequences are the ones made of independent and identically distributed (i.i.d.) components. Definition 4.4 A stochastic sequence .(et )t∈Z is i.i.d. when, for any .N ∈ N∗ and any choice of the time instant set .{0, · · · , N − 1}, • The random variables .(e0 , · · · , eN −1 ) are identically distributed, i.e., pe0 (e) = pek (e), ∀ k ∈ {0, · · · , N − 1} and ∀ e ∈ R.

.

(4.14)

• The random variables .(e0 , · · · , eN −1 ) are independent, i.e., pe0 ,··· ,eN−1 (e0 , · · · , eN −1 ) = pe0 (e0 ) · · · peN−1 (eN −1 ),

.

(4.15)

or, equivalently, for all .i ∈ {0, · · · , N − 1} and .j ∈ {0, · · · , N − 1} with .i = j , pei ,ej (ei , ei ) = pei (ei )pej (ej ).

.

(4.16)

Such a process can be used to generate a random walk sequence, for instance.

4.2 Stochastic Description of the Residuals

81

Definition 4.5 A random walk .(zt )t∈N is a stochastic sequence starting at zero, i.e., z0 = 0, and generated by summing i.i.d. stochastic components, i.e.,

.

zi = e0 + · · · + ei−1 , i ∈ N∗ ,

.

(4.17)

where .(et )t∈Z is i.i.d., i.e., for .i ∈ N∗ , zi = zi−1 + ei−1 with z0 = 0.

.

(4.18)

4.2.2 Stationarity As explained, e.g., in Chap. 1, one essential purpose of the techniques introduced in this book is to draw reliable inference from the available time series. In order to guarantee that the calculations performed from a specific realization are valid for unseen data points, it is compulsory to assume that the underlying stochastic sequence has stochastic properties which do not change (too much) with time, i.e., is stationary. If this stationarity assumption is not valid, roughly speaking, it cannot be guaranteed that the future is statistically similar to the past. Thus, the prediction of the future behavior performed from the past is not reliable anymore. This is the main reason why, in the sequel, the following assumption is made. Assumption 4.1 The residuals are (wide sense) stationary. In order to be more explicit, it is necessary to define what we mean by (wide sense) stationarity. Definition 4.6 A stochastic sequence .(zt )t∈Z is strongly or strict sense stationary if its statistical properties are invariant to any shift of the origin, i.e., if .(zt )t∈Z and ∗ .(zt+h )t∈Z have the same joint distribution for all .h ∈ Z, i.e., for any .N ∈ N and any choice of the time instant set .{0, · · · , N − 1}, pzh ,··· ,zN−1+h (zh , · · · , zN −1+h ) = pz0 ,··· ,zN−1 (z0 , · · · , zN −1 ) for all h ∈ Z.

.

Because this strong stationarity condition can be violated in practice and, more importantly, is complicated to verify from a single realization of .(zt )t∈Z , a weaker form of stationarity known as weak stationarity, wide sense stationarity or covariance stationarity is often favored. One of the main advantages of this kind of stationarity is the requirement of the first and second moments only. Definition 4.7 A stochastic sequence .(zt )t∈Z is weakly or wide sense stationary if its mean does not vary with respect to time, its covariance matrix depends on the time difference only whereas its second moment is finite for all times. More

82

4 Least Squares Estimators and Residuals Analysis

specifically, for all .t ∈ Z, E{zt+h } = E{zt } = μz for all h ∈ Z, .

.

σz2 (t) < ∞, . σzz (t + h, t) = σzz (h, 0) for all h ∈ Z,

(4.19a) (4.19b) (4.19c)

i.e., .σzz (t + h, t) depends only on the time difference h for all t and .h ∈ Z. Said differently, the autocovariance coefficients have a single time index as free parameter, the value of which measures how much the time instants .t + h and t differ. In this case, in order to shorten the notations and focus on the time difference, the autocovariance coefficients can be written as follows σzz () = E{(zi+ − μz )(zi − μz )}, i ∈ Z,  ∈ Z.

.

(4.20)

Said differently, a weakly stationary stochastic process is only characterized by specifying the first and second order moments, i.e., the expected values .E{zt } and the expected products .E{(zi+ − E{zt })(zi − E{zt })}, . i ∈ Z and . ∈ Z. Of course, a strongly stationary stochastic sequence of finite second moment and satisfying .σz2 (t) < ∞ is weakly stationary. Unfortunately, the opposite is often not true except, for instance, if the stochastic process is Gaussian (see, e.g., Kay 2006, Chapter 20 for a proof), condition which is often satisfied in practice. Remark 4.4 As proved, e.g., in Katayama (2005, Section 4.2), a random walk is not stationary. Definition 4.8 A zero mean white noise is a stochastic sequence .(et )t∈Z satisfying .

E{et } = 0, .

(4.21a)

E{e2t } = σe2 (t) = σe2 < ∞, .

(4.21b)

σee () = E{ei+ ei } = 0 for  = 0,

(4.21c)

for all .t ∈ Z. Such a stochastic sequence is weakly stationary. It is, however, not i.i.d. necessarily because the last condition in the former list means that the white noise components are uncorrelated but not independent necessarily. A zero mean i.i.d. stochastic sequence is a white noise on the contrary because having independent components implies uncorrelated components. Note that this definition does not specify any particular probability density function but, if the white noise is Gaussian, it is i.i.d. as well because, for Gaussian processes, having uncorrelated components implies independent components.

4.2 Stochastic Description of the Residuals

83

4.2.3 Ergodicity Although handling weakly stationary stochastic processes is, from a practical viewpoint, less demanding than requiring strongly stationary stochastic sequences, the definition of a wide sense stationary stochastic process (see Definition 4.7) still involves first and second order moments, thus requires determination of mean vectors and covariance matrices to characterize the stochastic sequence generating the measured observations explicitly. Because, in practice, one time series is available only, it is necessary to determine these essential components from time values only. Determining accurate estimates of mean vectors and covariance matrices from time samples is possible if and only if the involved processes are ergodic. This important feature leads us to make the following assumption. Assumption 4.2 The residuals are ergodic. According to the strong law of large numbers, Kay (2006, Chapter 15) states that, if .(et )t∈Z is an i.i.d. stochastic sequence with finite and constant vector mean .μe , then the time average of the samples converges toward the ensemble average with probability one. In this book, it would be interesting to get such results for a larger class of stochastic processes, for weak stationary stochastic sequences, for instance. Before introducing the ergodicity definitions, let us define the time average mean vector and covariance coefficients. Definition 4.9 Given a time series .(zt )t∈T , its time average or time sample mean mz is defined as follows

.

mz =

.

N −1 1 zi . N

(4.22)

i=0

Definition 4.10 Given a time series .(zt )t∈T of time average mean .mz , its time average or time sample autocovariance coefficient .szz () for any shift . ∈ T is defined as follows szz () =

.

N −1 1 (zi+ − mz )(zi − mz ). N

(4.23)

i=0

Then, the following ergodicity definitions can be stated (see Pollock 1999, Chapter 20, Katayama 2005, Chapter 4 or Papoulis 2000, Chapter 12 for existence conditions under which a process is mean ergodic or covariance ergodic). Definition 4.11 A stochastic process is said to be ergodic if its statistical moments can be determined by any one of its (infinite) realizations.

84

4 Least Squares Estimators and Residuals Analysis

Definition 4.12 Given • A weakly stationary stochastic sequence .(zt )t∈Z of constant ensemble mean vector .μz • A realization .(zt )t∈T of this stochastic process .(zt )t∈Z having a constant time average mean .mz Then, .(zt )t∈Z is a mean ergodic sequence if .

lim mz = μz

N →∞

(4.24)

holds in the quadratic mean (Söderström and Stoica, 1989, Appendix B). Definition 4.13 Given • A weakly stationary stochastic sequence .(zt )t∈Z of constant ensemble mean vector .μz and ensemble autocovariance coefficients .σzz () for any shift . ∈ Z • A realization .(zt )t∈T of this stochastic process .(zt )t∈Z having a constant time average mean .mz and time average autocovariance coefficients .szz () for any shift . ∈ T Then, .(zt )t∈Z is a covariance ergodic sequence if .

lim szz () = σzz ()

N →∞

(4.25)

holds in the quadratic mean (Söderström and Stoica, 1989, Appendix B). Notice that the time average means and autocovariance coefficients can be computed for any time series and not only from realizations of stationary stochastic sequences. However, every ergodic process is stationary necessarily (Theodoridis, 2015, Chapter 2).

4.3 Basic Tests of the Residuals Once it is assumed that the residuals have been generated so that no apparent deviation from (weakly) stationarity is present in the corresponding time series .(zt )t∈T , it is essential to test if there is still dependency among between the residual samples in order to know if there are further modeling efforts to make apart from estimating some means and variances. Before introducing methods to determine models accounting for such a dependency (as we will do in Chap. 5), tests should be run to check if the residuals are uncorrelated or not. The solutions introduced in Chap. 5 are indeed useful if and only if there are still links between some or all the residual samples. Such tests are abundant in the literature (see, e.g., Brockwell and Davis 2016, Chapter 1 or Box et al. 2016, Chapter 8 and the references therein). In this section, a specific attention is paid to four of them because they can be seen as a

4.3 Basic Tests of the Residuals

85

good trade-off between implementation complexity and efficiency when univariate time series come into play.

4.3.1 Autocorrelation Function Test As clearly stated by the name of this test, its main ingredient is the autocorrelation function of the stochastic sequence .(zt )t∈Z . In order to introduce this test efficiently, it is thus necessary to start by defining what an autocorrelation function is. As explained, i.e., in Brockwell and Davis (2016, Chapter 2), the autocorrelation function of a weakly stationary stochastic sequence can be defined as follows. Definition 4.14 Let us consider a weakly stationary stochastic sequence .(zt )t∈Z of mean .μ. Then, • The autocovariance coefficient2 .γk at lag .k ∈ N is defined as follows γk = E{(zt − μ)(zt+k − μ)}, t ∈ Z, k ∈ Z,

.

(4.26)

and, for all .k ∈ N, γ−k = γk .

(4.27)

.

• The autocorrelation coefficient .ρk at lag .k ∈ Z is defined as follows ρk =

.

γk , k ∈ Z, γ0

(4.28)

for .γ0 = σz2 = 0 and, for all .k ∈ N, ρ−k = ρk .

(4.29)

.

Remark 4.5 By definition, the autocorrelation coefficient .ρk at lag .k ∈ Z is ρk = 

.

γk E{(zt − μ)2 }E{(zt+k − μ)2 }

, k ∈ Z.

(4.30)

When stationary sequences come into play, the variances at time t and .t + k of .zt are the same and equal to .γ0 , thus Eq. (4.28).

2 In the sequel, we deliberately use new notations for the autocovariance coefficients (different from the ones introduced previously) in order to stick to the ones used in the literature when the autocorrelation function tests come into play.

86

4 Least Squares Estimators and Residuals Analysis

Interesting from a theoretical viewpoint, these definitions, which involve mathematical expectations, are, by construction, conceptual only. In practice, the user has indeed access to a finite time series .(zt )t∈T only. As a consequence, approximations or estimates must be introduced to determine these autocorrelation coefficients from finite data sets. Fortunately, as pointed out in Sect. 4.2.3, when ergodic time series are involved, the time sample mean mz =

.

N −1 1 zi , N

(4.31)

i=0

and the time sample autocovariance and autocorrelation coefficients defined by gk =

.

N −1−k 1 (zi+k − mz )(zi − mz ), . N

(4.32a)

gk , g0 = sz2 = 0, g0

(4.32b)

i=0

rk =

can be used as accurate sample based estimators. More importantly, as far as the autocorrelation function test is concerned, the following theorem can be stated (Brockwell and Davis, 1991). Theorem 4.1 Let us consider a weakly stationary stochastic sequence .(zt )t∈Z and let us assume that this sequence is i.i.d. with finite variance. Then, the sample autocorrelation coefficients .rk at lag k, .k ∈ N, are asymptotically i.i.d. and normally distributed with zero mean and variance .1/N when N tends to infinity. Proof For a proof, see Brockwell and Davis (1991, Section 7.3).



Said differently, if the autocorrelation coefficients .rk at lag k, .k ∈ T, are computed from a realization .(zt )t∈T of an i.i.d. stochastic sequence .(zt )t∈T , about .95% of the generated autocorrelation coefficients should be contained between the √ bounds .±1.96/ N for a sufficiently large value of N and .k ∈ T. This observation is at the basis of the autocorrelation function test introduced herein. More precisely, by simply drawing the autocorrelogram of .(zt )t∈T (Box et al., 2016, Chapter 2), i.e., by plotting the sample autocorrelation coefficients .rk versus the lags k, .k ∈ T, as well as the aforementioned upper and lower bounds corresponding to a user defined significance level, the i.i.d. hypothesis is rejected as soon as a few samples (e.g., 2 or 3 maximum when 40 lags are involved) except .ρ0 = 1 are outside the bounds or one of them fall far outside.

4.3 Basic Tests of the Residuals

87

1

0.8

0.6

0.4

0.2

0

-0.2

-0.4

0

5

10

15

20

25

30

35

√ Fig. 4.1 Autocorrelogram of the residuals with bounds .±1.96/ N in dash. Female life expectancy in France

Illustration 4.1 Let us first focus on the analysis of the residuals generated in Illustration 3.1, i.e., when the French women life expectancy is estimated by considering a second order polynomial function (see the lower part of Fig. 3.2 for a plot of the residuals). Having access to the 47 samples of .(zt )t∈T , .T = {0, · · · , 46}, the autocorrelogram given in Fig. 4.1 can be drawn easily by focusing on 35 lags only. Because most of the autocorrelation coefficients are within the bounds (in dash in Fig. 4.1), this curve validates the i.i.d. hypothesis. Now, let us consider the monthly totals of international airline passengers data introduced in Illustration 3.6, more specifically the residuals generated by subtracting the first five SSA principal components from the raw data (see the lower part of Fig. 3.13 for a plot of the residuals). Once again, by considering 35 lags for the autocorrelogram, Fig. 4.2 shows that there is still dependency among the residuals samples.

Remark 4.6 Equation (4.32) involves the factor . N1 whereas the number of samples used for computing the autocorrelation coefficients equals .N − k. The reader thus 1 may wonder why Eq. (4.32) does not use . N−k instead of . N1 . In fact, the use of 1 1 . N instead of . N −k can be justified by noticing that Pollock (1999, Chapter 20), Brockwell and Davis (2016, Chapter 2) • When .N → ∞, both cases give access to the same sample estimates.

88

4 Least Squares Estimators and Residuals Analysis

1

0.5

0

-0.5

0

5

10

15

20

25

30

35

√ Fig. 4.2 Autocorrelogram of the residuals with bounds .±1.96/ N in dash. Monthly totals of international airline passengers

• Using . N1 guarantees that the corresponding covariance matrix defined as follows ⎤ g0 · · · gN −1 ⎥ ⎢ = ⎣ ... . . . ... ⎦ gN −1 · · · g0 ⎡

S z0:N−1 z0:N−1

.

(4.33)

is positive semidefinite (Zhang, 2017, Chapter 1). These are the main reasons why the factor . N1 is favored in the sequel.

4.3.2 Portmanteau Test Assuming again that the sample autocorrelation coefficients .rk at lag k, .k ∈ T, are asymptotically i.i.d. and normally distributed, the sum of the squares of the first h sample autocorrelation coefficients should be distributed according to a chisquared distribution with h degrees of freedom (Kay, 2006, Chapter 10). Running a statistical hypothesis test based on this chi-squared distribution null hypothesis is the keystone of the portmanteau test introduced herein. More precisely, as suggested in Ljung and Box (1978), the sum of squares Q=N

h

.

k=1

rk2 , h ∈ {0, · · · , N − 1},

(4.34)

4.3 Basic Tests of the Residuals

89

should be distributed according to a chi-squared distribution with h degrees of 2 (h) where freedom, i.e., the i.i.d. hypothesis can be rejected at level .α if .Q > χ1−α 2 .χ 1−α (h) is the .1 − α quantile of the chi-squared distribution with h degrees of freedom (Draper and Smith, 1998, Chapter 6).

Illustration 4.2 Again, let us first consider the French female life expectancy data set. By using Eq. (4.34) with .h = 35 and .N = 47, we get .Q = 43.2 2 (35) = 49.8. The portmanteau test gives the same conclusion whereas .χ0.95 as the former one, i.e., the residuals can be considered as being generated by an i.i.d. stochastic process. Now, if the same test is used with the monthly totals of international airline 2 (35) = 49.8. These passengers, we have .Q = 404.8 whereas, again, .χ0.95 figures clearly show that, again, the residuals are far from being i.i.d.

4.3.3 Turning Point Test Before describing this third randomness test, let us define what a turning point is. Definition 4.15 Consider a time series .(zt )t∈T , .T = {0, · · · , N − 1}. Then, there is a turning point at time instant .i ∈ {1, · · · , N − 2} if • Either .zi−1 < zi and .zi > zi+1 • Or .zi−1 > zi and .zi < zi+1 As explained in Brockwell and Davis (2016, Chapter 1), the number T of turning points in a time series made of samples of an i.i.d. stochastic process should be asymptotically normally distributed. More specifically, T ∼ AsN (μT , σT2 ) with μT =

.

2(N − 2) 16N − 29 and σT2 = . 3 90

(4.35)

Again, thanks to this feature of T , an i.i.d. null hypothesis test can be performed T| > easily, for which the i.i.d. hypothesis is rejected at a level .α as soon as . |T −μ σT

−1 (1−α/2) where, this time, . −1 (1−α/2) is the .1−α/2 quantile of the standard normal distribution (Draper and Smith, 1998, Chapter 6). T| Illustration 4.3 By using the same data sets as before, we have . |T −μ = σT T| = 0.86 for the air passengers 1.26 for the life expectancy residuals, . |T −μ σT

(continued)

90

4 Least Squares Estimators and Residuals Analysis

Illustration 4.3 (continued) residuals whereas . −1 (0.975) = 1.96. These two results indicate that, for both cases, there is no sufficient evidence to reject the i.i.d. hypothesis at level .0.05. Whereas this result reinforces the conclusions drawn previously as far as the life expectancy residuals analysis is concerned, it refutes the idea that the air passengers residuals have still some correlations between samples. The main issue with this randomness test is its emphasis on the number of maxima and minima in the series, i.e., its time evolution which completely disregards possible amplitude changes which can occur when correlated samples are handled. A quick look at Fig. 3.13 clearly shows that the amplitudes of the residuals between 1950 and 1953 are very different from the values measured between 1958 and 1961, for instance. These important differences are not taken into account by this randomness test. This is probably the reason why this test is often considered as too conservative.

4.3.4 Normality Test Once the i.i.d. hypothesis is tested with the solutions introduced previously, it is interesting to figure out which distribution governs the behavior of the residuals. Again, easy to draw graphical tools are introduced herein to assess if a set of data plausibly comes from some theoretical distributions such as the Gaussian distribution. As we said previously, a specific attention is paid herein to the normal distribution because .(i) thanks to the central limit theorem (Papoulis, 2000), in many practical situations, adding independent random variables leads to random variables, the distributions of which tend toward a normal distribution even if the original variables themselves are not normally distributed, .(ii) the statistical properties of a normally distributed random variable can be summed by resorting to its mean and variance only. Notice, however, that the tests considered in this section can be adapted to other distributions straightforwardly as explained, e.g., in Wilk and Gnanadesikan (1968). The first graphical solution simply consists in drawing the histogram of the residuals, then inspecting visually if this histogram looks like a normal probability curve (i.e., a standard bell curve). The empirical sample based distribution of the data (the histogram) should indeed be bell-shaped and resemble the normal distribution if the residuals are normally distributed whereas the sample size is not too small. Because conclusions drawn from the analysis of the residuals histogram can be problematic when small data sets are handled (keep in mind that the histogram appearance strongly depends on the number of data points and the number of bars), a second and complementary solution the user should implement is the normal probability plot (or QQ plot) (Thode, 2002). In a nutshell, this graph is obtained by plotting the residuals on the abscissa in an ascending order, accounting for the

4.4 Statistical Properties of the Least Squares Estimates

91

signs, against the following quantity on the ordinate sz × −1 (p(i)) + mz , i ∈ {1, · · · , N},

.

(4.36)

with Thode (2002) p(i) =

.

i − 3/8 when N ≤ 10, . N + 1 − 3/4

i − 1/2 when N > 10, . N  z u2 1 (z) = √ e− 2 du, 2π −∞ p(i) =

(4.37a) (4.37b) (4.37c)

where . (z) stands for the cumulative probability distribution of a random variable z normally distributed with zero mean and variance 1 whereas .sz and .mz stand for the time average standard deviation and mean of the residuals, respectively. Thus, the QQ plot is nothing but a comparison of the theoretical or expected quantiles of a normally distributed random variable versus the observed quantiles. If the data are consistent with a realization of a normally distributed random variable, the points gathered in a QQ plot should therefore lie close to a straight line.

.

Illustration 4.4 Let us use again the residuals the French female life expectancy data set. Whereas the i.i.d. tests considered previously all led to the conclusion that the i.i.d. hypothesis cannot be rejected, the histogram and QQ plot given in Fig. 4.3 show that the normality hypothesis should be rejected.

4.4 Statistical Properties of the Least Squares Estimates As explained, e.g., in the introduction of Chap. 3, the linear and nonlinear least squares methods for parameter estimation aim at minimizing the cost function .V (θ ) given in Eq. (3.1), i.e., determine the unknown parameter vector .θ by minimizing the sum of squared residuals. As shown through different illustrations introduced in Chap. 3, these least squares numerical techniques make sense without imposing any stochastic framework for the estimation problem. This pragmatic viewpoint was already highlighted by K. F. Gauss in 1809 Gauss (1963) (see also Ljung 1999, Appendix II) In conclusion, the principle that the sum of the squares of the differences between the observed and the computed quantities must be minimum may be considered independently of the calculus of probabilities.

92

4 Least Squares Estimators and Residuals Analysis

Fig. 4.3 Histogram and QQ plot of the residuals. Female life expectancy in France

This being said, such an observation does not prevent the user to study the influence of the residuals statistical properties on the quality of least squares estimates introduced in Chap. 3. Remark 4.7 The reader must keep in mind that the statistical analysis run in this chapter is valid for least squares estimates involving deterministic regressor matrices .Φ only (as, e.g., in Chap. 3). A discussion on the impact of stochastic regressors on least squares estimates statistics is postponed to Chap. 5. In order to turn to this analysis, let us notice that the way the least squares methods estimate .θ clearly pinpoints an intrinsic link between the residuals .(zt )t∈T and the estimated parameter vector .θˆ whatever the numerical technique used to compute this estimate. Indeed, θˆ = arg min

N−1

.

θ

i=0

yi − f (ti , θ )22 = arg min θ

N−1

zi (θ)22 .

(4.38)

i=0

Such a link has of course consequences as far as precision and accuracy of the least squares estimators are concerned. Indeed, according to the residuals statistical properties, we can wonder if the least squares solutions introduced in Chap. 3 can guarantee the access to an accurate and precise estimator for .θ. This section is thus dedicated to a statistical analysis of the least squares solutions introduced previously in order to answer this legitimate question. More specifically, we will try to know if, under mild conditions on the residuals, the least squares estimators introduced in Chap. 3 .(i) are biased or not, .(ii) have a minimum variance or not, .(iii) are consistent or not. However, before giving conditions under which these interesting

4.4 Statistical Properties of the Least Squares Estimates

93

statistical properties are satisfied, it is first essential to define what is meant by precise, accurate, and consistent estimator. Definition 4.16 A statistic is a rule or a function acting on a set of observations introduced to determine an estimate of a statistical quantity based on observed data. The result of this statistic for a specific observation data set is called the estimate. Once the set of observations is allowed to change randomly, the estimate becomes itself a random variable or vector which is called an estimator. With mathematical notations, for .(zt )t∈T a realization of a stochastic sequence ˆ as follows .(zt )t∈Z and .g(•) a user defined statistic, we can define the estimator .θ θˆ = g((zt )t∈Z ),

(4.39)

θˆ = g((zt )t∈T ).

(4.40)

.

whereas the estimate .θˆ satisfies .

Because an estimator .θˆ is a function which maps a sample space to a set of sample estimates, an estimator is a random vector by definition. Thus, it can be characterized by its moments. The first and second moments of .θˆ lead to the definition of precision and accuracy of an estimator. Definition 4.17 By denoting by .θ o the true parameter vector,3 the bias of .θˆ is defined as ˆ − θ o, bθˆ = E{θ}

.

(4.41)

where the expectation is taken with respect to the probability density function of the stochastic sequence .(zt )t∈Z , i.e., ˆ = .E{θ}





· · · g(z0 , · · · , zN −1 )  R  R N times

× pz0 ,··· ,zN−1 (z0 , · · · , zN −1 )dz0 · · · dzN −1 .

(4.42)

The estimator is (asymptotically) unbiased if .bθˆ = 0 (when N tends to infinity). It is (asymptotically) biased otherwise. An estimator is said to be more accurate as its bias is smaller. ˆ i.e., given, Definition 4.18 Given the variance of each component of an estimator .θ, for any .k ∈ {1, · · · , nθ }, σ ˆ2 = E{(θˆ k − E{θˆ k })2 },

.

3 Pay

θk

attention to the notation! The subscript here is “o,” not “0” as for the intercept.

(4.43)

94

4 Least Squares Estimators and Residuals Analysis

where, again, the expectation is taken with respect to the probability density function of the stochastic sequence .(zt )t∈Z , the estimator .θˆ is said to be more precise as all the aforedefined variances .σ ˆ2 , .k ∈ {1, · · · , nθ }, are smaller. θk

Remark 4.8 It is important to notice that the accuracy depends on the true parameter whereas the precision depends on the estimator only. These two features can be combined by resorting to the mean squared error around the true parameter vector .θ o , i.e., E{(θˆ − θ o ) (θˆ − θ o )}

.

nθ   = tr E{(θˆ − θ o )(θˆ − θ o ) } = E{(θˆ k − θok )2 },

(4.44)

k=1

where .θˆ k and .θok are the kth component of .θˆ and .θ o , respectively. Indeed, because (see, e.g., van den Bos 2007, Chapter 4 for a proof) E{(θˆ k − θok )2 } = σ ˆ2 + b2ˆ , k ∈ {1, · · · , nθ },

.

θk

θk

(4.45)

where .bθˆ k is the kth component of the bias vector .bθˆ , E{(θˆ − θ o ) (θˆ − θ o )} =



.

k=1

σ ˆ2 + θk



k=1

b2ˆ . θk

(4.46)

Such an observation leads to the following definition. Definition 4.19 An estimator .θˆ is said to be convergent toward the true parameter vector .θ o in quadratic mean if its mean squared error vanishes asymptotically, i.e., .

lim E{(θˆ − θ o ) (θˆ − θ o )} = 0,

N →∞

(4.47)

or, for all .k ∈ {1, · · · , nθ }, .

lim E{(θˆ k − θok )2 } = 0.

N →∞

(4.48)

In this case, its elements are asymptotically unbiased and their variances vanish asymptotically (van den Bos, 2007, Chapter 4). Last but not least, the notion of (asymptotically) consistent estimator can be defined as follows (Pintelon and Shoukens, 2012, Chapter 16). Definition 4.20 An estimator .θˆ is (asymptotically) consistent if it converges toward the true parameter vector .θ o as the number of data samples N tends to infinity, i.e., as .N → ∞.

4.4 Statistical Properties of the Least Squares Estimates

95

Because the (least squares) estimators are stochastic by construction, the definition of consistency can differ according to the selected stochastic limit (Pintelon and Shoukens, 2012, Chapter 16). Definition 4.21 An estimator .θˆ is weakly consistent if it converges toward the true parameter vector .θ o in probability as the number of data samples N tends to infinity, i.e., for all .ε > 0 and .k ∈ {1, · · · , nθ }, lim Pr(|θˆ k − θok | > ε) = 0.

.

N →∞

(4.49)

Definition 4.22 An estimator .θˆ is consistent if it converges toward the true parameter vector .θ o in quadratic mean as the number of data samples N tends to infinity, i.e., for all .k ∈ {1, · · · , nθ }, .

lim E{(θˆ k − θok )2 } = 0.

N →∞

(4.50)

Definition 4.23 An estimator .θˆ is strongly consistent if it converges toward the true parameter vector .θ o with probability one (or almost everywhere) as the number of data samples N tends to infinity, i.e., for all .k ∈ {1, · · · , nθ }, Pr( lim θˆ k = θok ) = 1.

.

N →∞

(4.51)

Once these important definitions have been introduced, it is now time to focus on the effect of the residuals statistical properties on the least squares estimators.

4.4.1 Bias, Variance, and Consistency of the Linear Least Squares Estimators Let us first start with the analysis of the linear least squares estimators introduced in Chap. 3. As shown in Sect. 3.1, the regularized linear least squares estimate, and, by extension, the ordinary linear least squares estimate when .η = 0, is the unique solution of the normal equations (see Chap. 3 for the notations) (Φ  Φ + ηI nθ ×nθ )θ = Φ  y.

.

(4.52)

Thus, the regularized linear least squares estimator satisfies θˆ = (Φ  Φ + ηI nθ ×nθ )−1 Φ  y,

.

(4.53)

96

4 Least Squares Estimators and Residuals Analysis

where   y = y0:N −1 = y0 · · · yN −1 ∈ RN ×1 .

.

(4.54)

After recalling that • The residuals .(zt )t∈T and the observations .(yt )t∈T are linked as follows yt = zt + dt , t ∈ T,

.

(4.55)

where .(dt )t∈T stands for the deterministic part of the time series (the trends and seasonality, for instance). • The regressor matrices .Φ considered in Chap. 3 are all made of deterministic4 components. • The residuals .(zt )t∈T are realizations of zero mean white noise sequences .(zt )t∈Z . The following theorems can be stated. Theorem 4.2 By assuming that • The stochastic sequence .(zt )t∈Z generating the residuals is zero mean. • .Φ is deterministic and has full column rank. • A unique .θ o ∈ Rnθ ×1 exists such that, for all .t ∈ Z, dt = φ  t θ o,

.

(4.56)

then, the regularized linear least squares estimator satisfies ˆ = (Φ  Φ + ηI n ×n )−1 Φ  Φθ o . E{θ} θ θ

.

(4.57)

Proof Because .Φ is a deterministic matrix, we have ˆ = (Φ  Φ + ηI n ×n )−1 Φ  E{y}. E{θ} θ θ

.

(4.58)

Thanks to the zero mean assumption for .(zt )t∈Z , we can easily see that E{y} = E{z} + d = d,

(4.59)

d = Φθ o .

(4.60)

.

whereas .

4 The

entries of .Φ are linear and nonlinear functions of time samples which are assumed to be known a priori (thus not measured).

4.4 Statistical Properties of the Least Squares Estimates

97

Combining these equations leads to Eq. (4.57) straightforwardly.



Theorem 4.3 Under the assumptions of Theorem 4.2, the ordinary least squares estimator is unbiased. Proof For .η = 0 in Eq. (4.57), we get ˆ = (Φ  Φ)−1 Φ  Φθ o = θ o . E{θ}

.

(4.61)

Theorem 4.4 Under the assumptions of Theorem 4.2, the regularized least squares estimator is biased. Proof With .η > 0 in Eq. (4.57), ˆ = (Φ  Φ + ηI n ×n )−1 Φ  Φθ o = θ o . E{θ} θ θ

.

(4.62)

Theorem 4.5 By assuming that • The stochastic sequence .(zt )t∈Z generating the residuals is a zero mean white noise with finite variance .σ 2 . • .Φ is deterministic and has full column rank. • A unique .θ o ∈ Rnθ ×1 exists such that, for all .t ∈ Z, dt = φ  t θ o,

.

(4.63)

then, the covariance matrix of the regularized linear least squares estimator is equal to ˆ θˆ − E{θ}) ˆ  } = σ 2 (Φ  Φ + ηI n ×n )−1 Φ  Σ θθ ,η = E{(θˆ − E{θ})( θ θ

.

× Φ(Φ  Φ + ηI nθ ×nθ )−1 ,

(4.64)

whereas, for the ordinary least squares, Σ θθ,η=0 = σ 2 (Φ  Φ)−1 .

.

(4.65)

Proof By using the result of Eq. (4.57) as well Eq. (4.53), we can show that ˆ = (Φ  Φ + ηI n ×n )−1 Φ  (y − Φθ o ) θˆ − E{θ} θ θ

.

= (Φ  Φ + ηI nθ ×nθ )−1 Φ  z

(4.66)

98

4 Least Squares Estimators and Residuals Analysis

with   z = z0:N −1 = z0 · · · zN −1 ∈ Rnz ×1 .

.

(4.67)

Equation (4.64) is then obtained straightforwardly by using the symmetry of (Φ  Φ + ηI nθ ×nθ )−1 and noticing that the covariance matrix of .z0:N −1 satisfies

.

Σ z0:N−1 z0:N−1 = σ 2 I N ×N .

.

The equation for the ordinary least squares is obtained when .η = 0.

(4.68)

Whereas the regularization has a negative impact as far as bias is concerned, it is interesting to notice that the parameter .η can be used to decrease the least squares estimator covariance matrix. More specifically, we have, with .W η = (Φ  Φ + ηI nθ ×nθ )−1 (Φ  Φ), Σ θ θ,η=0 − Σ θ θ,η =0 = σ 2 [(Φ  Φ)−1 − W η =0 Φ  ΦW  η =0 ]  −1 −   = σ 2 W η =0 [W −1 η =0 (Φ Φ) W η =0 − Φ Φ]W η =0

.

(4.69)

= σ 2 W η =0 [2η(Φ  Φ)−2 + η2 (Φ  Φ)−3 ]W  η =0 which finally leads to Σ θθ ,η=0 − Σ θθ ,η =0 = σ 2 (Φ  Φ + ηI nθ ×nθ )−1

.

× [2ηI N nz ×N nz + η2 (Φ  Φ)−1 ](Φ  Φ + ηI nθ ×nθ )− .

(4.70)

Because this resulting matrix is positive semidefinite (Meyer, 2000, Chapter 7), we thus prove that Σ θθ,η=0 ≥ Σ θθ ,η =0 ,

.

(4.71)

i.e., the covariance matrix of the ridge estimator is always smaller than the covariance matrix of the ordinary least squares estimator. This observation is at the basis of many recent developments in machine learning where biased estimators are favored for regression or classification problem because, thanks of a decrease of the covariance matrix of the estimator, the mean squared error cost function is decreased in the end. For an interesting discussion about machine learning and the impact of regularization for mean squared error cost function minimization, see, e.g., Theodoridis (2015). Last but not least, the ordinary linear least squares estimator is consistent and asymptotically normal.

4.4 Statistical Properties of the Least Squares Estimates

99

Theorem 4.6 By assuming that • The stochastic sequence .(zt )t∈Z generating the residuals is a zero mean white noise with finite variance .σ 2 . • .Φ is deterministic and has full column rank. • A unique .θ o ∈ Rnθ ×1 exists such that, for all .t ∈ Z, dt = φ  t θ o,

.

(4.72)

then, the ordinary linear least squares estimator satisfies θˆ → θ o ,

.

(4.73)

when .N → ∞ where the convergence is meant to be with probability one whereas √

N(θˆ − θ o ) → N (0, P ),

(4.74)

ˆ → P ˆ = σ 2 R −1 , cov(θ) Φ θ

(4.75)

.

and .

 −1 when .N → ∞ with .P = σ 2 N1 R Φ with .R Φ = limN →∞ Φ  Φ where the convergence is meant to be in distribution (Söderström and Stoica, 1989, Appendix B). Proof See Ljung and Glad (2016, Section 11.4) or Ljung (1999, Appendix II), Söderström and Stoica (1989, Section 7.5), Theodoridis (2015, Section 6.3). Remark 4.9 When the stochastic sequence .(zt )t∈Z has a Gaussian distribution, the ordinary linear least squares estimator is also Gaussian and centered on .θ o without requiring that N tends to infinity (Ljung, 1999, Appendix II). As shown, e.g., in Söderström and Stoica (1989, Chapter 4), an unbiased estimate for .σ 2 can be obtained by computing sˆ 2 =

.

1 ˆ 2. y − Φ θ 2 N − nθ

(4.76)

Illustration 4.5 In order to illustrate the results gathered in Theorem 4.6, let us tackle the problem of determining the temperature of a room with a low cost sensor. More precisely, let us assume that • The temperature of the room is .θo = 25 ◦ C. (continued)

100

4 Least Squares Estimators and Residuals Analysis

Illustration 4.5 (continued) • The measured temperature is disturbed by a zero mean white noise .(zt )t∈T , i.e., for .t ∈ {0, · · · , N − 1}, yt = θo + zt .

.

(4.77)

First of all, let us assume that • 100 realizations of this measurement noise can be generated with the constraint that this output disturbance is normally distributed with zero mean and unit variance. • Experiments with .N = 10, .N = 102 , .N = 103 and .N = 104 can be carried out. For each data set, the linear least squares approach introduced in Chap. 3 can be used to estimate the room temperature, then the estimated temperatures can be compared with the true value .θo = 25 ◦ C. As clearly shown in Fig. 4.4, • The estimated temperatures are homogeneously scattered around the true value. • The dispersion of the estimates decreases with N . More precisely, √ as shown in Fig. 4.5, this standard deviation decreases with a factor of . 10 when N is multiplied by 10. Both observations perfectly illustrate the fact that, under the practical conditions given in Theorem 4.6, the linear least squares estimator √ is consistent whereas the uncertainty decreases with N (with a factor .1/ N more specifically). Let us now turn to the asymptotic distribution of the estimate. In order to illustrate Equation (4.74), let us assume that the measurement noise is now uniformly distributed, i.e., .zt ∼ U (−1, 1). Then, by considering this time .105 realizations (instead of 100) and .N = 1, .N = 2, .N = 4 and .N = 8, respectively, the histograms of the estimated temperatures can be drawn straightforwardly, then compared with the normal distribution given in Eq. (4.74). As shown in Fig. 4.6, whereas the histogram for .N = 1 is strongly different from the Gaussian distribution given in Eq. (4.74), the convergence toward the expected normal distribution is very fast as illustrated by the case .N = 8, for instance. Notice that, for generating the (co)variance of the ˆ the value of .σ 2 has been determined by using the fact that, for a estimator .θ, uniform distribution .U (a, b), .σ 2 = (b − a)2 /12 (Papoulis, 2000, Chapter 4).

4.4 Statistical Properties of the Least Squares Estimates

26

26

25.5

25.5

25

25

24.5

24.5

24

20

40

60

80

100

24

26

26

25.5

25.5

25

25

24.5

24.5

24

20

40

60

80

100

24

101

20

40

60

80

100

20

40

60

80

100

Fig. 4.4 Estimated room temperature .θˆ for .N = 10, .N = 102 , .N = 103 , and .N = 104 by considering 100 experiments each time. Linear case 10 0

10 -1

10 -2 10 1

10 2

10 3

10 4

Fig. 4.5 Standard deviation of .θˆ vs. N . Linear case

4.4.2 Bias, Variance, and Consistency of the Nonlinear Least Squares Estimators As clearly shown in Sect. 3.2, the nonlinear least squares optimization algorithms can be viewed as iterative solutions based on local linear least squares results. Indeed, the Gauss–Newton or Levenberg–Marquardt algorithms iteratively update

102

4 Least Squares Estimators and Residuals Analysis

Fig. 4.6 Impact of N on the data based pdf of .θˆ vs. N. Linear case

regularized or ordinary linear least squares solutions where the Jacobian matrix F |θ plays the role of the regression matrix .Φ. It is thus not surprising to have the following result (van den Bos, 2007, Chapter 5).

.

Theorem 4.7 By assuming that • The stochastic sequence .(zt )t∈Z generating the residuals is a zero mean white noise with finite variance .σ 2 . • A unique .θ o ∈ Rnθ ×1 exists such that, for all .t ∈ Z, zt = yt − f (t, θ o ).

.

(4.78)

• .F |θ is deterministic and has full column rank, then, the nonlinear least squares estimator satisfies .

θˆ → θ o = arg min E{y − f (t, θ)22 },

(4.79)

θ

when .N → ∞ where the convergence is meant to be with probability one whereas √

N(θˆ − θ o ) → N (0, P ),

(4.80)

ˆ → P ˆ = σ 2 R −1 , cov(θ) F θ

(4.81)

.

and .

4.4 Statistical Properties of the Least Squares Estimates

103

 −1 when .N → ∞ with .P = σ 2 N1 R F with .R F = limN →∞ F  |θ o F |θ o where the convergence is meant to be in distribution (Söderström and Stoica, 1989, Appendix B). Proof See Ljung and Glad (2016, Section 11.4) or van den Bos (2007, Section 5.10), Söderström and Stoica (1989, Section 7.5), Johansson (1993, Chapter 6). Remark 4.10 In Theorem 4.7, it is assumed that a unique .θ o ∈ Rnθ ×1 exists such that, for all .t ∈ Z, zt = yt − f (t, θ o ),

.

(4.82)

i.e., “the true system” generating the data belongs to the model class. If not, i.e., if the model class does not contain the true system, .θˆ will at least converge toward the parameter vector5 corresponding to the best model in the selected model class (Ljung and Glad, 2016, Section 11.4). This is an important robustness property of the nonlinear least squares estimator. Remark 4.11 This important result is valid for a Gauss–Newton minimization technique but also for a Levenberg–Marquardt algorithm because, under standard practical conditions, the regularization parameter of the Levenberg–Marquardt algorithm is tiny, i.e., .η ≈ 0, when a local minimum is reached. This is the main reason why the parameter .η does not appear in the former theorem even when the Levenberg–Marquardt algorithm is considered. As for the linear least squares case, an unbiased estimate for .σ 2 can be obtained by computing this time (van den Bos, 2007, Chapter 5) sˆ 2 =

.

1 y − f (t, θˆ )22 . N − nθ

(4.83)

Illustration 4.6 We tackle again the problem of estimating the temperature of a room with a low cost sensor as we did in Illustration 4.5. However, contrary to Illustration 4.5, the selected sensor is nonlinear, i.e., yt = θ02 + zt ,

.

(4.84) (continued)

course, if the minimizer of .E {y − f (t, θ)22 } is not unique, the resulting parameter vector will be selected among the values of the minimizers set.

5 Of

104

4 Least Squares Estimators and Residuals Analysis

6

6

5.5

5.5

5

5

4.5

4.5

4

20

40

60

80

100

4

6

6

5.5

5.5

5

5

4.5

4.5

4

20

40

60

80

100

4

20

40

60

80

100

20

40

60

80

100

Fig. 4.7 Estimated room temperature .θˆ for .N = 10, .N = 102 , .N = 103 , and .N = 104 by considering 100 experiments each time. Nonlinear case

Illustration 4.6 (continued) where, this time, the parameter to be estimated is .θ0 . Thus, if the room temperature is .25 ◦ C, the true value for .θ0 is .5 ◦ C. By running similar simulations as the ones considered in Illustration 4.5, i.e., by using • First 100 realizations of zero mean white Gaussian noises of variance 10 for .N = 10, .N = 102 , .N = 103 and .N = 104 . • Second .105 realizations of uniformly distributed output disturbances between .−1 and 1 for .N = 1, .N = 2, .N = 4 and .N = 8. Figures 4.7, 4.8, and 4.9 can be generated. These curves clearly illustrate the results gathered in Theorem 4.7, i.e., the nonlinear least squares estimator is unbiased and normally distributed asymptotically.

4.4.3 Least Squares Statistical Properties Validation with the Bootstrap Method Whereas the statistical properties of the least squares estimators gathered in Theorems 4.6 and 4.7 are very attractive theoretically, they all assume that the data set size is sufficiently large to guarantee their asymptotic accuracy. Unfortunately, as

4.4 Statistical Properties of the Least Squares Estimates

105

10 0

10 -1

10 -2 10 1

10 2

10 3

10 4

Fig. 4.8 Standard deviation of .θˆ vs. N . Nonlinear case

Fig. 4.9 Impact of N on the data based pdf of .θˆ vs. N. Nonlinear case

shown in the former sections and chapters, real time series are often short (usually by construction but also to ensure the stationarity of the underlying stochastic process). Thus, the constraint that the number of data samples used for model learning tends to infinity is violated in many practical cases. In order to bypass this short data constraint for generating reliable estimates of the statistical outcomes of Theorems 4.6 and 4.7 (like the covariance matrix .P θˆ ), it is suggested hereafter resorting to the bootstrap method introduced by Efron in 1979. Roughly speaking, as

106

4 Least Squares Estimators and Residuals Analysis

claimed in Zoubir and Boashash (1998), “the bootstrap does with a computer what the experimenter would do in practice if it were possible: he or she would repeat the experiment.” Based on this claim, our idea for least squares statistical properties validation consists in • Generating B “new” time series with the bootstrap principle. • Computing B least squares estimates from which empirical sample based means and covariance matrices can be generated straightforwardly. • Comparing these empirical estimates with the asymptotic values gathered in Theorems 4.6 and 4.7, respectively. In order to describe our approach more precisely, let us start by introducing the bootstrap principle more specifically. As any resampling method (James et al., 2017, Chapter 5), the bootstrap method draws new training sets from an initial time series, then fit a model to each new training set in order to generate additional information about the estimated model. More precisely, the standard bootstrap method randomly draws new samples from a candidate distribution assumed to be close to the unknown distribution generating the initial time series, then determines important characteristics of the estimated model by resorting to empirical sample based estimates. One common way to determine the candidate cumulative distribution function in the bootstrap literature (Zoubir and Boashash, 1998; Politis, 1998) consists in using the available time series to generate an empirical sample based cumulative distribution function .Fˆ . More precisely, by assuming that the available data is made of i.i.d. samples drawn from a population characterized by a cumulative distribution function F , the bootstrap paradigm clearly suggests that the samples used for computing .Fˆ themselves constitute the underlying distribution F . Thus, in order to be valid, the bootstrap method requires to ensure that the initial training set is representative of the whole population. As far as our model learning problem is concerned, by assuming that (Zoubir and Boashash, 1998; Politis, 1998) • The observed residuals samples .(zt )t∈T are good representatives of the true residuals stochastic sequence .(zt )t∈Z so that the following empirical cumulative distribution function #{zi ≤ z} , Fˆzt (z) = N

.

(4.85)

is a reliable estimate of the cumulative distribution function of .(zt )t∈Z . • The residuals samples .(zt )t∈T do not reject the i.i.d. tests introduced in Sect. 4.3. Our least squares statistical properties validation procedure consists in • Estimating a least squares parameter vector .θˆ from the available time series by resorting to the techniques described in Chap. 3. • Generating the residuals time series .(zt )t∈T , then testing it to guarantee that it does not reject the i.i.d. tests introduced in Sect. 4.3.

4.4 Statistical Properties of the Least Squares Estimates

107

• Generating B new residuals time series .(ztb )t∈T by randomly selecting, with replacement, N samples from the available time series .(zt )t∈T . • Simulating B new estimated model outputs by adding up each of the B new ˆ residuals time series .(zb )t∈T to .f (t, θ). t

b

• Estimating B least squares parameter vectors .θˆ from each of the B new training sets by resorting to the techniques described in Chap. 3. • Determining the empirical sample based mean and covariance matrix of the b estimated parameter vector from the B estimated .θˆ . By using this procedure, .P θˆ , for instance, can be approximated without knowing the true distribution of the residuals time series .(zt )t∈T as illustrated hereafter. Remark 4.12 As pointed out in Zoubir and Boashash (1998), Politis (1998), the bootstrap method is valid only if the residuals stochastic sequence .(zt )t∈Z is i.i.d. Of course, in many practical case (see, e.g., Illustration 4.3.2), this assumption is far from being satisfied. Fortunately, by resorting to models of the residuals as described, e.g., in Sect. 5, different multi step procedures can be suggested in order to get, in the end, samples which can be “bootstrapped” (see, e.g., Freedman 1984; Tjärnsträm and Ljung 2002).

Illustration 4.7 Let us consider again the French female life expectancy time series (see, e.g., Fig. 3.2) and more precisely the model learning problem involving a second order polynomial function (see Illustration 3.1 for more details about this model learning phase). As shown in Illustrations 4.1 and 4.4 such a model leads to residuals which are zero mean, uncorrelated and almost Gaussian distributed. Then, according to Theorem 4.6, the covariance matrix  −1 which can be of .θˆ should be asymptotically equal to .P θˆ = σ 2 Φ  Φ   −1 2 ˆ when the residuals variance .σ 2 is approximated by .P θˆ = sˆ Φ Φ unknown. For this specific time series and with a second order polynomial model, we get sˆ 2 = 3.317 × 10−2 ,

.

(4.86)

and ⎡

⎤ 3.335 × 105 −3.348 × 102 8.405 × 10−2 ˆ ˆ = ⎣−3.348 × 102 3.362 × 10−1 −8.439 × 10−5 ⎦ . .P θ 8.405 × 10−2 −8.439 × 10−5 2.118 × 10−8

(4.87)

Because the residuals do not reject the i.i.d. hypothesis as shown in Illustrations 4.1 and 4.4, let us try to validate this approximated covariance matrix (continued)

108

4 Least Squares Estimators and Residuals Analysis

b

Fig. 4.10 Empirical distributions of the B bootstrap estimated parameter vectors .θˆ . Female life expectancy in France

Illustration 4.7 (continued) by resorting to the bootstrap procedure introduced previously. More precisely, by considering .B = 5000 bootstrap time series, let us generate B estimated b vectors .θˆ . Then, by resorting to standard empirical sample based estimates, we get ⎡

.

Pˆ ˆ b θ

⎤ 3.076 × 105 −3.089 × 102 7.754 × 10−2 = ⎣−3.0895 × 102 3.102 × 10−1 −7.786 × 10−5 ⎦ , 7.754 × 10−2 −7.786 × 10−5 1.954 × 10−8

(4.88)

which is close to the theoretical covariance matrix .Pˆ θˆ , thus validates the asymptotic result given Theorem 4.6 even with a 49 sample time series. b Notice finally that, as shown in Fig. 4.10, the B estimated vectors .θˆ seems to be Gaussian distributed as expected in Theorem 4.6 whereas, as shown b in Fig. 4.11 where 10 instances of the B estimated vectors .θˆ have been (randomly) selected to generate 10 model outputs, the small values in .Pˆ b or .Pˆ θˆ lead to a small variability in the model responses.

θˆ

4.4 Statistical Properties of the Least Squares Estimates

109

86

84

82

80

78

76

74

1970

1975

1980

1985

1990

1995

2000

2005

2010

2015

Fig. 4.11 Ten samples of model responses (in light grey) generated by selecting 10 instances b among the B bootstrap estimated vectors .θˆ . The red curve corresponds to the estimated model output generated from the initial time series. Female life expectancy in France

4.4.4 From Least Squares Statistical Properties to Confidence Intervals Estimated Parameters Confidence Intervals As clearly stated in Theorems 4.6 and 4.7, on average, • The least squares solutions are close to the expected value (of course under the practical conditions listed in these theorems) and, more importantly, closer and closer to it as the number of samples used for model learning increases. • The variability of the least squares solutions is encapsulated into the covariance matrices .P θˆ in each theorem. Interesting from a theoretical viewpoint, these statistical results unfortunately cannot guarantee that the estimated vector (generated from a specific data set) is the true parameter vector even if • The residuals data set is a (finite) realization of a zero mean white noise with finite variance .σ 2 . • The regressor matrix is deterministic. • The system generating the data set is in the model class. They indeed involve statistical averaging and, more importantly, infinite data sets, thus cannot be guaranteed in practice. A smart way to bypass the lack of consistency guarantees when finite data sets are handled consists in generating estimated values

110

4 Least Squares Estimators and Residuals Analysis

with uncertainty certificates. Said differently, such an uncertainty on the estimated parameters (inherent to the involvement of finite data sets) leads the user to favor in practice the construction of intervals or sets of values which are highly likely to contain the true parameter vector .θ o instead of seeking a single vector of values designated as the estimate of the parameter vector of interest. In order to gauge and quantify the confidence we can place in the estimated parameters, these sets of values are usually .(1 − α) × 100% confidence intervals characterized by a lower and a upper bound determined such that .θ o is somewhere between these two bounds with a probability .1 − α. In other words, the construction of confidence intervals consists in computing .l(z) and .u(z) such that Pr(l(z) ≤ θ o ≤ u(z)) = 1 − α.

.

(4.89)

Of course, the narrower the confidence interval is, the more accurately we can specify the estimate for a parameter of interest. Because the least squares estimators .θˆ are asymptotically normally distributed around .θ o with a shape controlled by the covariance matrix .P θˆ , each component .θˆ i , ˆ is normally distributed as well with, this time, a mean equal .i ∈ {1, · · · , nθ }, of .θ to .θoi , .i ∈ {1, · · · , nθ }, and a variance equal to .pii , .i ∈ {1, · · · , nθ }, where .pii , .i ∈ {1, · · · , nθ }, stands for the ith diagonal element of .P ˆ . Consequently, as shown, θ e.g., in Klein and Morelli (2006, Chapter 5) or Leon-Garcia (2008, Chapter 8), for each .i ∈ {1, · · · , nθ }, the interval .

 √  √ θˆ i − r pii , θˆ i + r pii

(4.90)

contains .θoi with probability .1 − 2Q(r) with Q(r) =

.

1 2π





t2

e− 2 dt,

(4.91)

r

i.e., the probability of the “tail” of the Gaussian probability density function. Thus, if we let .rα/2 be the value such that .α = Q(rα/2 ), then the .(1 − α) × 100 confidence interval for each .θoi , .i ∈ {1, · · · , nθ }, is given by .

 √  √ θˆ i − rα/2 pii , θˆ i + rα/2 pii .

(4.92)

With .α = 0.05, .rα/2 = 1.96 for a Gaussian distribution. Thus, the .95% confidence interval for each .i ∈ {1, · · · , nθ } becomes .

 √ √  θˆ i − 1.96 pii , θˆ i + 1.96 pii .

(4.93)

By recalling that the covariance matrices .P θˆ introduced in Theorems 4.6 and 4.7 both depend on .σ 2 explicitly, these confidence intervals are valid only if .σ is known

4.4 Statistical Properties of the Least Squares Estimates

111

a priori. When .σ is estimated, i.e., with .σ ≈ sˆ , θˆ i − θo ti =  i pˆ ii

.

(4.94)

has a Student’s t-distribution with .N − nθ degrees of freedom (Papoulis, 2000, Chapter 7). Thus, in most of the practical cases, the .(1 − α) × 100% confidence interval for each parameter .θˆ i , .i ∈ {1, · · · , nθ }, writes .

    θˆ i − tα/2,N −nθ pˆ ii , θˆ i + tα/2,N −nθ pˆ ii ,

(4.95)

where .tα/2,N −nθ stands for user selected t-distribution values available in Tables (see, e.g., Beyer 2018). For instance, for .α = 0.05 and .N − nθ = 20, .t0.025,20 = 2.086 whereas .t0.025,100 = 1.984 for .α = 0.05 and .N − nθ = 100.

Illustration 4.8 Let us go on with the residuals analysis started in Illustration 4.7. By using the estimated covariance matrix .Pˆ θˆ given in Eq. (4.87), the following .95% confidence intervals for the parameters composing the second order polynomial function involved in our model learning problem can be computing easily   θo1 ∈ −8.5293 × 103 −6.5213 × 103 , .   θo2 ∈ 6.0923 8.1084 , .   θo3 ∈ −2.0603 × 10−3 −1.5542 × 10−3 .

.

(4.96a) (4.96b) (4.96c)

Notice that these intervals are in perfect line with the histograms available in Fig. 4.10. Using .Pˆ ˆ b instead of .Pˆ θˆ leads to θ

θo1

.

θo2 θo3

  ∈ −8.5357 × 103 −6.5576 × 103 , .   ∈ 6.1336 8.1194 , .   ∈ −2.0618 × 10−3 −1.5634 × 10−3 ,

(4.97a) (4.97b) (4.97c)

i.e., confidence intervals equivalent to those generated with .Pˆ θˆ as expected.

It is interesting to point out that the aforedefined Student’s t-distribution based confidence intervals can also be used to test if each term involved in the regressor .φ t contributes effectively to the model learning capabilities. The linear models considered until now to describe the trend and/or seasonality of a time series

112

4 Least Squares Estimators and Residuals Analysis

may indeed be more effective with the deletion of one or more of the variables already in the model. Whereas such a problem has been already studied from a macroscopic viewpoint in Sect. 3.3 thanks to specific validation procedures or with cost functions penalizing complex models, the analysis of the real contribution of, e.g., each power of t in a polynomial function of an order determined via the solutions introduced in Sect. 3.3 has not been discussed so far. Such an analysis can be run easily by testing if the corresponding estimated parameter is not “too small”. Indeed, in statistics, the significance of an individual variable (or regressor) .φi , .i ∈ {1, · · · , nθ }, to a regression model can be tested by determining the significance of the corresponding regression coefficient .θˆi , .i ∈ {1, · · · , nθ }, i.e., by testing the hypotheses (Montgomery et al., 2011, Chapter 4) H0 θˆi = 0, H1 θˆi = 0.

.

(4.98)

As shown, e.g., in Klein and Morelli (2006, Chapter 5) or Montgomery et al. (2015, Chapter 3), this significance testing can be performed by comparing (again) with .tα/2,N −nθ the statistic θˆ i t¯i =  . pˆ ii

.

(4.99)

The null hypothesis should indeed be rejected if .|t¯i | > tα/2,N −nθ . A comparison of Eqs. (4.94) and (4.99) directly shows that, for .i ∈ {1, · · · , nθ }, .t¯i = ti when .θoi = 0. Thus, rejecting (at a level of significance .α) the null hypothesis that .θˆi = 0, .i ∈ {1, · · · , nθ }, is equivalent to noticing that its .(1−α)×100% confidence interval does not include zero.

Estimated Outputs Confidence Intervals Let us now focus on the determination of confidence intervals for the estimated model outputs. When the estimated outputs satisfies ˆ yˆt = φ  t θ , t ∈ T,

.

(4.100)

i.e., when linear least squares are handled, we easily get that the random variable .yˆ t for any .t ∈ T is also asymptotically normally distributed around .φ  t θ o and has a

4.4 Statistical Properties of the Least Squares Estimates

113

variance .σyˆ2 equal to .φ  t P θˆ φ t (Klein and Morelli, 2006, Chapter 5). Indeed, t

σyˆ2 = E{(yˆ t − yt )2 }, t

 ˆ   ˆ = E{(φ  t θ − φ t θ o )(φ t θ − φ t θ o ) }, .

(4.101)

 ˆ ˆ = φ t E{(θ − θ o )(θ − θ o ) }φ t ,

= φ t P θˆ φ t . When nonlinear linear least squares come into play, i.e., when yˆt = f (t, θˆ ),

(4.102)

.

things are more complicated because, even if the nonlinear least squares estimator θˆ is also asymptotically normally distributed around .θ o , the shape of which is governed by .P θˆ , a nonlinear transformation of Gaussian processes unfortunately does not maintain Gaussianity. In order to bypass this difficulty, especially when the nonlinear function .f (•, •) is complicated and/or noninvertible, it is often suggested resorting to a Taylor series expansion of .f (•, •), then considering

.

ˆ = f (t, θ o + δθ ) ≈ f (t, θ o ) + ˆ t = f (t, θ) .y



∂f (t, θ o ) ∂θ



δθ ,

(4.103)

where .δθ ∈ Rnθ ×1 stands for a small zero mean Gaussian variation having a nθ ×1 is the gradient of .f (•, •) with covariance matrix .P θˆ whereas . ∂f ∂θ (t, θ o ) ∈ R respect to .θ . Indeed, combining this first order approximation with the fact that Gaussianity is preserved with any affine transformation leads to a random variable ˆ t for any .t ∈ T which is asymptotically normally distributed around .f (t, θ o ) with .y     ∂f (t, θ ) P (t, θ ) . a variance .σyˆ2 equal to . ∂f o o ˆ ∂θ θ ∂θ t

Remark 4.13 Of course, in practice, we do not have access to .θ o . Thus, the best we can do is to use  σˆ yˆ2 =

.

t

∂f (t, θˆ ) ∂θ



Pˆ θˆ



∂f (t, θˆ ) ∂θ

 (4.104)

instead. ˆ once means and variances are available for any As done with the estimator .θ, .t ∈ T, the construction of confidence intervals for the estimated outputs can be performed straightforwardly. Hence, for .t ∈ T, the .(1 − α) × 100% confidence interval for .yˆt writes  .

 yˆt − tα/2,N −nθ σˆ yˆ t , yˆt + tα/2,N −nθ σˆ yˆ t .

(4.105)

114

4 Least Squares Estimators and Residuals Analysis

86

84

82

80

78

76

74

1970

1975

1980

1985

1990

1995

2000

2005

2010

2015

Fig. 4.12 Estimated outputs with confidence intervals. Female life expectancy in France

Illustration 4.9 By considering again the French female life expectancy time series, we can easily compute .95% confidence intervals for each .yˆt with .t ∈ {1968, 1969, · · · , 2016}. Such confidence intervals directly lead to the curves and error bars given in Fig. 4.12. This figure illustrates the fact that, with the least squares solutions introduced so far, the estimated time series is reliable as supported by the narrow error bars in Fig. 4.12.

When the bootstrap method can be used, the availability of B bootstrap estimated b vectors .θˆ makes the determination of the mean and variance of the estimated model outputs a lot easier. Indeed, when the bootstrap estimates are available, the mean and variance of any estimated model outputs can be determined by simply resorting to B evaluations of the model outputs under study, then by computing empirical sample based means and variances. This easy-to-implement empirical solution is again a good way to test the validity of the theoretical results introduced previously. b Illustration 4.10 By using the 5000 estimated parameter vectors .θˆ generated in Illustration 4.8, 5000 estimated model outputs can be determined for each .t ∈ {1968, 1969, · · · , 2016}. Then, thanks to standard empirical sample based variance estimates, the estimated model output quality can be assessed

(continued)

4.4 Statistical Properties of the Least Squares Estimates

115

86

84

82

80

78

76

74

1970

1975

1980

1985

1990

1995

2000

2005

2010

2015

Fig. 4.13 Estimated outputs with confidence intervals with bootstrap. Female life expectancy in France

Illustration 4.10 (continued) straightforwardly as shown in Fig. 4.13. As expected for this specific data set, this bootstrap based solution yields confidence intervals equivalent to those generated in Illustration 4.9 (see Fig. 4.12).

Predicted Outputs Confidence Intervals In addition to giving access to reliable estimates of the trend and seasonalities of a time series, the deterministic models introduced so far can also be used for prediction, i.e., for forecasting future and unseen time series samples from current and past values. Being able to forecast accurate future values of a time series is indeed essential when tasks like planning or production control come into play. Again, in order to assess the risks a decision maker or an engineer takes by relying on the predicted time series values, it is paramount to quantify the predicted outputs reliability. Resorting to confidence intervals is once more a good solution to assess the prediction uncertainty. Given the estimate .θˆ determined from the “past” time series samples .(yt )t∈T , i.e., for .t ∈ {0, · · · , N −1}×Ts , the predicted “future” output at time .tnew = Nnew (×Ts ) with .Nnew ∈ N and .Nnew ≥ N can be calculated straightforwardly by evaluating ˆ yˆtnew = φ  tnew θ,

.

(4.106)

116

4 Least Squares Estimators and Residuals Analysis

where .φ tnew stands for the regressor .φ t evaluated at .t = tnew . Whereas this value is similar to the estimated outputs introduced in the former section, its prediction error differs from Eq. (4.101). Indeed, because .yˆ tnew and .ytnew are statistically independent, σyˆ2

tnew

= E{(yˆ tnew − ytnew )2 },   ˆ   ˆ = E{(ztnew − (φ  tnew θ − φ tnew θ o ))(ztnew − (φ tnew θ − φ tnew θ o )) },

.

 ˆ ˆ = E{z2tnew } + φ  tnew E{(θ − θ o )(θ − θ o ) }φ tnew ,

(4.107)

= σ 2 + φ tnew P θˆ φ tnew . Thus, when predicted outputs come into play, the variance of .yˆ tnew − ytnew is the sum of the residuals variance .σ 2 and the variance of the estimated output .yˆ tnew given in Eq. (4.101). Remark 4.14 Of course, these lines are valid as well when nonlinear models are involved. In this case,  E{(yˆ tnew − ytnew )2 } ≈ σ 2 +

.

∂f (tnew , θˆ ) ∂θ



Pˆ θˆ



 ∂f (tnew , θˆ ) . ∂θ

(4.108)

From this variance expression, using (again) an estimated residuals variance .sˆ 2 , the prediction confidence interval becomes .

  yˆtnew − tα/2,N −nθ σˆ yˆ2

tnew

 + sˆ 2 , yˆtnew + tα/2,N −nθ σˆ yˆ2

tnew

 + sˆ 2 .

(4.109)

Illustration 4.11 Let us consider again the French women life expectancy time series and let us try to predict the life expectancy values from 2017 to 2030 with confidence intervals. As shown in Fig. 4.14, the second order polynomial function estimated from past samples, i.e., for .t ∈ {1968, · · · , 2016}, has good prediction capabilities. The .95% confidence intervals (represented here via the error bars) are indeed narrow even when the selected is 2030. Notice in Fig. 4.14 the impact of including .sˆ in the evaluation of the confidence intervals for prediction. Equivalent results are obtained with the bootstrap based solution as illustrated in Fig. 4.15.

4.5 Wold Decomposition

117

88

86

84

82

80

78

76

74

1970

1980

1990

2000

2010

2020

2030

Fig. 4.14 Estimated and predicted outputs with confidence intervals. Female life expectancy in France 88

86

84

82

80

78

76

74

1970

1980

1990

2000

2010

2020

2030

Fig. 4.15 Estimated and predicted outputs with confidence intervals with bootstrap. Female life expectancy in France

4.5 Wold Decomposition In order to make a smooth transition between the least squares statistical analysis introduced in Sect. 4.4 and the residuals modeling solutions described in the next chapters, let us now introduce the following paramount theorem, called the

118

4 Least Squares Estimators and Residuals Analysis

Wold decomposition theorem (Katayama, 2005, Chapter 4), which states how the residuals can be efficiently described once they are assumed to be observations of a weak stationary stochastic sequence. Theorem 4.8 Given a univariate weak stationary stochastic sequence .(zt )t∈Z , this discrete time process can be uniquely written as the sum of two sequences, i.e., zt = wt + vt , t ∈ Z,

.

(4.110)

where

 2 • .wt = ∞ j =0 hj et−j with .(et )t∈Z a zero mean white noise with finite variance .σ and .hj ∈ R, .j ∈ N. • .(vt )t∈Z is a linearly singular process, i.e., for all .t ∈ Z, .vt can be completely determined from linear functions of its past values. • .(vt )t∈Z is uncorrelated with .(wt )t∈Z and .(et )t∈Z for any .t ∈ Z, i.e., .E{wt vs } = 0 and  .E{et vs2} = 0 for all t and .s ∈ Z. • . ∞ j =0 |hj | < ∞ (stability). • .hj = 0 for .j ∈ Z− (causality) and .h0 = 1. • The .hj , .j ∈ N, are constant (i.e., independent of t) and unique. Proof See Brockwell and Davis (2016, Section II) and Katayama (2005, Section 4). Said differently, if the residuals satisfy the aforementioned covariance stationary time series conditions, the Wold decomposition theorem entails the interesting feature that the dynamical evolution of .(zt )t∈Z (and by extension .(zt )t∈T ) can be approximated by a linear model accurately. This theorem has thus a strong impact as far as model learning and prediction are concerned. Indeed, from the available observations .(yt )t∈T , if it is possible to • Determine the deterministic components accurately6 from .(yt )t∈T , and assume that they embed the aforementioned linearly singular process .(vt )t∈Z . • Remove these deterministic components to generate the residuals .(zt )t∈T . • Assume that the stochastic process generating the residuals is weak stationary and satisfies zt = wt , t ∈ Z,

.

(4.111)

 with .wt = ∞ j =0 hj et−j and .(et )t∈Z a zero mean white noise with finite variance 2 .σ . • Estimate from the residuals realization .(zt )t∈T the aforementioned unique scalars .hj , .j ∈ N, accurately.

6 See

Chap. 3 for reliable solutions.

References

119

• Characterize the zero mean white noise .(et )t∈Z efficiently by determining its variance .σ 2 , then the time series .(yt )t∈T will be learned efficiently and reliable inference will be conceivable. This is the main reason why this theorem is the keystone of the developments introduced in the next chapters.

4.6 Take Home Messages • Adding statistical assumptions on the observations is essential to explain the errors in the estimated parameters. • In order to extract statistical properties of the residuals from one (short) realization, guaranteeing the stationarity and ergodicity of the residuals sequences is paramount. • Contrary to the regularized linear least squares estimator, the linear least squares estimator is unbiased. • Regularization can reduce the estimator variance considerably when compared to the ordinary least squares solution. • Thanks to the strong link between regularized least squares and nonlinear optimization techniques like the Levenberg–Marquardt algorithm, consistency properties of the linear least squares are shared by the nonlinear least squares estimator asymptotically. • Asymptotic consistency of the least squares estimators help the user determine confidence interval for the estimated parameters as well as the estimated and predicted model outputs. • When possible, bootstrap solutions can be implemented to assess the accuracy of the asymptotic statistical properties of the least squares estimators. • Once the deterministic components of a time series is available, the remaining part can be described by linear difference equations involving constant coefficients and zero mean white noise sequences only thanks to the Wold decomposition theorem.

References K. Aström, Introduction to Stochastic Control Theory (Dover, Illinois, 2006) W. Beyer (ed.), CRC Handbook of Tables for Probability and Statistics (CRC Press, Boca Raton, 2018) A. Björck, Numerical Methods for Least Squares Problems (SIAM, Philadelphia, 1996) G. Box, G. Jenkins, G. Reinsel, G. Ljung, Time Series Analysis: Forecasting and Control (Wiley, Hoboken, 2016) P. Brockwell, R. Davis, Time Series: Theory and Methods (Springer, Berlin, 1991) P. Brockwell, R. Davis, Introduction to Time Series and Forecasting (Springer, Berlin, 2016) N. Draper, H. Smith, Applied Regression Analysis (Wiley, Hoboken, 1998)

120

4 Least Squares Estimators and Residuals Analysis

B. Efron, Bootstrap methods. another look at the Jackknife. Ann. Stratist. 7(0), 1–26 (1979) D. Freedman, On bootstrapping two stage least squares estimates in stationary linear models. Ann. Statist. 12, 827–842 (1984) K.F. Gauss, Theoria Motus corporum Celestium (Theory of the Motion of Heavenly Bodies) (Dover, Illinois, 1963) G. James, D. Witten, T. Hastie, R. Tibshirani, An Introduction to Statistical Learning with Applications in R (Springer, Berlin, 2017) R. Johansson, System Modeling and Identification (Prentice Hall, Hoboken, 1993) R. Johnson, D. Wichern, Applied Multivariate Statistical Analysis (Prentice Hall, Hoboken, 2002) T. Katayama, Subspace Methods for System Identification (Springer, Berlin, 2005) S. Kay, Intuitive Probability and Random Processes Using MATLAB (Springer, Berlin, 2006) V. Klein, E. Morelli, Aircraft System Identification: Theory and Practice (American Institute of Aeronautics and Astronautics, Reston, 2006) A. Leon-Garcia, Probability, Statistics, and Random Processes for Electrical Engineering (Pearson, London, 2008) G. Ljung, G. Box, On a measure of lack of fit in time series models. Biometrika 65, 297–303 (1978) L. Ljung, System Identification. Theory for the User (Prentice Hall, Hoboken, 1999) L. Ljung, T. Glad, Modeling and Identification of Dynamic Systems (Studentlitteratur, Lund, 2016) C. Meyer, Matrix Analysis and Applied Linear Algebra (SIAM, Philadelphia, 2000) D. Montgomery, G. Runger, N. Faris Hubele, Engineering Statistics (Wiley, Hoboken, 2011) D. Montgomery, C. Jennings, M. Kulahci, Introduction to Time Series Analysis and Forecasting (Wiley, Hoboken, 2015) A. Nandi (ed.), Blind Estimation Using Higher Order Statistics (Springer, Berlin, 1989) A. Papoulis, Probability, Random Variables, and Stochastic Processes (McGraw-Hill Europe, Irvine, 2000) R. Pintelon, J. Shoukens, System Identification: A Frequency Domain Approach (Wiley, Hoboken, 2012) D. Politis, Computer intensive methods in statistical analysis. IEEE Signal Process. Mag. 15, 39–55 (1998) D. Pollock. Handbook of Time Series Analysis, Signal Processing and Dynamics (Academic, Cambridge, 1999) T. Söderström, P. Stoica, System Identification (Prentice Hall, Hoboken, 1989) S. Theodoridis, Machine Learning: A Bayesian and Optimization Perspective (Academic, Cambridge, 2015) H. Thode, Testing for Normality (CRC Press, Boca Raton, 2002) F. Tjärnsträm, L. Ljung, Using the bootstrap to estimate the variance in the case of undermodeling. IEEE Trans. Autom. Control 47, 395–398 (2002) A. van den Bos, Parameter Estimation for Scientists and Engineers (Wiley, Hoboken, 2007) M. Wilk, R. Gnanadesikan, Probability plotting methods for the analysis of data. Biometrika 55, 1–17 (1968) X. Zhang, Matrix Analysis and Applications (Cambridge University, Cambridge, 2017) A. Zoubir, B. Boashash, The bootstrap and its application in signal processing. IEEE Signal Process. Mag. 15, 56–76 (1998)

Chapter 5

Residuals Modeling with AR and ARMA Representations

By assuming that the residuals time series is a realization of a zero mean weak stationary stochastic sequence which does not contain linearly singular components anymore, the Wold decomposition theorem states that the underlying stochastic sequence .(zt )t∈Z satisfies zt =

∞ 

.

hj et−j , hj ∈ R, j ∈ N,

(5.1)

j =0

with .(et )t∈Z a zero mean white noise with finite variance .σ 2 . Thus, by using the backward shift operator introduced in Chap. 2, we can write that ∞ 

zt =

.

hj q−j et = h(q)et ,

(5.2)

j =0

with h(q) =

∞ 

.

hj q−j .

(5.3)

j =0

This way of expressing .(zt )t∈Z in terms of .h(q) and .(et )t∈Z is called an input output representation (Oppenheim et al., 2014) whereas h(z) =

∞ 

.

hj z−j , z ∈ C,

(5.4)

j =0

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 G. Mercère, Data Driven Model Learning for Engineers, https://doi.org/10.1007/978-3-031-31636-4_5

121

122

5 Residuals Modeling with AR and ARMA Representations

is called the transfer function of the system generating the output signal .(zt )t∈Z from the input signal .(et )t∈Z (Oppenheim et al., 2014). In addition, as stated in Theorem 4.8, because  2 • . ∞ j =0 |hj | < ∞. • .hj = 0 for .j ∈ Z− . • The .hj , .j ∈ N, are independent of t. The system having the aforedefined transfer function is said to be causal, linear, time invariant, and (bounded input bounded output) stable (Hsu, 2019, Chapter 1), i.e., • Its output at an arbitrary time .t0 depends on its past input samples only, i.e., on input samples at time .t ≤ t0 . • Given zt = h(q)et and z˘ t = h(q)e˘ t ,

(5.5)

h(q)(αet + β e˘ t ) = αzt + β z˘ t ,

(5.6)

.

we have .

where .α and .β are arbitrary scalars. • A time shift in the input signal causes the same time shift in the output signal. • For any bounded input, the corresponding output is also bounded. Thanks to this linear time invariant input output representation stemming directly from the Wold decomposition theorem, it clearly appears that there is a strong relation between linear dynamical system theory and stationary time series modeling. Using such a link to model the residuals dynamics is the main goal of this chapter. More precisely, thanks to the linear time invariant representation given in Eq. (5.2), the residuals model learning problem boils down from now on to a parameter estimation problem, i.e., determination of the weights .hj ∈ R, .j ∈ N, from the realization .(zt )t∈T of .(zt )t∈Z . Such a model learning problem is nothing but what is called black box model identification in the automatic control community (Söderström and Stoica, 1989; Ljung, 1999). Black box model identification can indeed be defined as a set of techniques borrowed from statistics, numerical optimization, and linear algebra which are combined to build mathematical models of dynamical systems using mainly measurements of the input and output signals of the system. More specifically, the process of black box model identification requires • Data sets • A model structure • An estimation method to determine accurate values of the adjustable parameters in the candidate model structure

5.1 From Transfer Functions to Linear Difference Equations

123

• A validation step which evaluates if the estimated model is adequate for the application needs Because the first two items of this list are already available or have been defined in the former chapters, the rest of this chapter is dedicated to the last two steps of the black box model identification procedure only, i.e., the estimation of .hj , .j ∈ N, from available realizations .(zt )t∈T of .(zt )t∈Z , then the validation of the estimated model. Notice that the system identification theory has also been developed for model structures a lot more complicated than the input output representation considered in this document. For interested readers, please refer to (Goodwin and Payne, 1977), Söderström and Stoica (1989), Ljung (1999), Nelles (2000), Verhaegen and Verdult (2007), Tóth (2010), Lovera (2014) and the references therein. Hereafter, attention is paid to the estimation of scalar parameters of univariate AR and ARMA input output representations only. The rest of this chapter is more precisely organized as follows. After some time spent on the transition between the Wold decomposition transfer function given in Eq. (5.1) and the AR and ARMA model parameterizations in Sect. 5.1, Sect. 5.2 focuses on AR and ARMA parameter estimation, i.e., describes .(i) standard numerical solutions for computing the AR and ARMA coefficients, .(ii) the main statistical properties of these estimation techniques. Section 5.3 introduces a method based on the partial autocorrelation function to help the user select which model structure (among the AR and ARMA model representations) is the most appropriate parameterization for residuals model learning. Forecasting with AR or ARMA models is finally the main topic of Sect. 5.4. Again, Sect. 5.5 concludes this chapter with the main points to remember.

5.1 From Transfer Functions to Linear Difference Equations Before introducing specific numerical solutions for the estimation of the unknown parameters of the residuals input output representation given in Eq. (5.2), let us first look at the model structure dictated by the Wold decomposition theorem a bit closer. Interesting because it involves a stable, causal, linear time invariant transfer function, the model structure given in Eq. (5.4) suffers from the requirement of an infinite number of unknown parameters to describe the residuals behavior accurately. In order to relax this practical constraint, it is convenient to assume that • The transfer function .h(z) is rational, i.e., h(z) =

.

b(z) , a(z)

(5.7)

124

5 Residuals Modeling with AR and ARMA Representations

with a(z) =

na 

.

aj z−j , na ∈ N, aj ∈ R for j ∈ {0, · · · , na }, .

(5.8a)

bj z−j , nb ∈ N, bj ∈ R for j ∈ {0, · · · , nb },

(5.8b)

j =0

b(z) =

nb  j =0

such that – .a(z) = 0 for all .|z| ≤ 1 to guarantee (Brockwell and Davis, 2016, Chapter 3) * The existence of this rational function * The stationarity, uniqueness1 , and causality of the stochastic process .(zt )t∈Z generated by this transfer function when it is excited by a zero mean white noise .(et )t∈Z with finite variance – .a0 = b0 = 1 to guarantee a certain uniqueness of the system description • The model representation given in Eq. (5.7) is irreductible (Kailath, 1980, Chapter 6), i.e., there are no common factors in the polynomial functions .a(z) and .b(z) . • The model representation given in Eq. (5.7) is minimum phase, i.e., the inverse of .h(z) exists, is causal and stable, condition which is guaranteed if and only if .b(z) = 0 for all .|z| ≤ 1 (Brockwell and Davis, 2016, Chapter 3). The interesting consequences of these assumptions for black box model identification are the following. First, the fact that finite-dimensional representations are handled in the sequel reduces the number of unknown parameters to a finite value .na + nb without losing the mimicking capabilities of the model representation. Second, dealing with irreductible input output representations guarantees that the system representation does not contain redundant dynamics, thus .(a(z), b(z)) is a minimal fraction description of .h(z) (Hannan and Deistler, 1988, Chapter 2). Third, the minimum phase condition ensures that .h(z) has a stable and causal inverse which allows to determine a realization .(et )t∈T of a zero mean white noise sequence .(et )t∈Z from any realization .(zt )t∈T . By a straightforward combination of the former equations, the input output representation given in Eq. (5.1) becomes a linear difference equation a(q)zt = b(q)et ,

.

these stationarity and uniqueness properties, it is sufficient to guarantee that .a(z) = 0 for all = 1.

1 For .|z|

(5.9)

5.2 AR and ARMA Model Learning

125

or, written differently, for .na ∈ N∗ , .nb ∈ N∗ and .a0 = b0 = 1, zt = et +

nb 

.

bj et−j −

na 

j =1

aj zt−j .

(5.10)

j =1

Then, the following definition can be introduced. Definition 5.1 If .(et )t∈Z is a zero mean white noise with finite variance .σ 2 whereas a(z) = 0 for all |z| ≤ 1, .

(5.11a)

b(z) = 0 for all |z| ≤ 1, .

(5.11b)

.

a0 = b0 = 1,

(5.11c)

then • The model representation given in Eq. (5.10) is called an Auto Regressive Moving Average (ARMA) representation or model of the aforedefined dynamical, causal, minimum phase, stable and linear time invariant system. • The unique stochastic process .(zt )t∈Z generated by this model is called a stationary, causal, stable and invertible ARMA process. An important subclass is the Auto Regressive (AR) representation when .nb = 0, i.e., zt = et −

na 

.

aj zt−j .

(5.12)

j =1

Similarly, the stochastic process .(zt )t∈Z generated by this model is called a stationary, causal and stable AR process. In order to emphasize the dependency of the AR and ARMA model structures on the orders .na and .nb of the polynomial functions .a(z) and .b(z), the standard notations .ARMA(na , nb ) and .AR(na ) are used in the sequel (see, e.g., Brockwell and Davis, 2016, Chapter 3). These notations are thus introduced to characterize the model structure explicitly but also to qualify the stochastic sequences .(zt )t∈Z generated by these models and, by extension, any realization .(zt )t∈T of these stochastic processes.

5.2 AR and ARMA Model Learning Once the AR and ARMA model structures have been introduced to describe the residuals dynamics, let us now turn to solutions for estimating the unknown a ,nb parameters .{ai , bj }ni=1,j =1 from the available realizations of .(zt )t∈Z .

126

5 Residuals Modeling with AR and ARMA Representations

5.2.1 AR Model Parameters Estimation Parameters Estimation with Linear Least Squares Let us first focus on the AR model. More precisely, let us assume that the residuals time series .(zt )t∈T is a realization of a zero mean .AR(na ) stochastic process. Then, at any time .t ∈ Z, we have zt = et −

na 

.

aj zt−j ,

(5.13)

j =1

where .(et )t∈Z is a zero mean white noise with finite variance. Interesting for its a linearity in terms of the unknown parameters .{ai }ni=1 , the main issue with this difference equation is the presence of .et , i.e., an unpredictable component at time t. In order to bypass this difficulty, a standard solution in estimation theory Ljung (2002) simply consists in ignoring the noise contribution in Eq. (5.13) in order to introduce the following predictor zˆ t = −

na 

.

aj zt−j ,

(5.14)

j =1 a then estimating the unknown parameters .{ai }ni=1 such that the prediction error sequence .(t )t∈Z defined as

t = zt − zˆ t

.

(5.15)

is a zero mean white noise sequence as .(et )t∈Z is. Notice indeed that, when the a parameters .{ai }ni=1 are known, .t = et for any .t ∈ Z. Unfortunately, in practice, we never have a direct access to the stochastic process .(zt )t∈Z . At best, a reliable realization .(zt )t∈T of it is available. Under such practical a circumstances, the estimation of the parameters .{ai }ni=1 can be performed again by using a linear least squares approach, i.e., by discarding the underlying probabilistic a framework and determining .{aˆ i }ni=1 so that the prediction error realization .(t )t∈T is as small as possible at any .t ∈ T (Söderström and Stoica, 1989, Chapter 7) with, now, for .t ∈ T, zˆ t = −

na 

.

aj zt−j , .

(5.16a)

j =1

t = zt − zˆ t .

(5.16b)

5.2 AR and ARMA Model Learning

127

More precisely, by resorting to the following vectors and matrices    = na · · · N −1 ∈ R(N −na )×1 , .   z = zna · · · zN −1 ∈ R(N −na )×1 , .   φ t = −zt−1 · · · −zt−na ∈ Rna ×1 , .   Φ = φ na · · · φ N −1 ∈ R(N −na )×na , .

(5.17a) (5.17b) (5.17c) (5.17d)

and   θ AR = a1 · · · ana ∈ Rna ×1 ,

.

(5.18)

a our linear least squares solution for determining estimates of .{ai }ni=1 consists in minimizing the 2 norm (Boyd and Vandenberghe, 2018, Chapter 3) of the prediction error .t at any .t ∈ T, i.e., minimizing the AR cost function

VAR (θ AR ) = 22 = z − Φθ AR 22 = (z − Φθ AR ) (z − Φθ AR ).

.

(5.19)

Up to some small changes of notations, this loss function is equivalent to the ones introduced in Chap. 3 (see, e.g., Eq. (3.16)) when linear least squares solutions were introduced for trend and seasonal patterns modeling. Thus, any ordinary least squares tools introduced in Chap. 3 can be used for this AR model learning problem straightforwardly. More precisely, by assuming that .N > na whereas .Φ has full column rank, the least squares estimate .θˆ AR is the solution of the normal equations Φ  z = Φ  Φθ AR ,

(5.20)

θˆ AR = (Φ  Φ)−1 Φ  z.

(5.21)

.

i.e., .

Such an observation allows us to conclude that all the linear least squares solutions and techniques introduced in the former chapters can be used for estimating .θˆ AR effectively. For instance, the QR factorization or SVD based techniques described in Sect. 3.1 can be adapted for computing .θˆ AR directly from .Φ and .z. The model complexity selection and validation tools introduced in Sects. 3.3 and 4.3 can also be extended straightforwardly to select, e.g., the AR model order .na or test the statistical characteristics of the time series .(t )t∈T . Such direct extensions are validated in the following illustration.

128

5 Residuals Modeling with AR and ARMA Representations

Table 5.1 AIC test for AR model order selection. CO.2 concentration at Mauna Loa Observatory, Hawaii AR order AIC

1 −1002.52

AR order AIC

5

6

7

8

.−1016.08

.−1015.88

.−1015.39

.−1014.44

2 −1015.59

3 −1018.34

4 −1017.76

Illustration 5.1 In Illustration 3.2, the problem of trend and seasonality estimation of the monthly mean carbon dioxide concentration measured at Mauna Loa Observatory, Hawaii, USA, between 1959 and 1991 has been tackled by resorting to a polynomial function and a Fourier series, respectively. Let us now focus on the modeling of the resulting residuals (see the bottom curve of Fig. 3.5 or Fig. 5.1). As explained in Chap. 4, before trying to fit an AR model to the residuals, it is essential to detect if the residuals are i.i.d. or not. As clearly shown in Fig. 5.2, almost half of the residuals correlation coefficients are outside of the prescribed boundaries, which clearly contradicts the i.i.d. hypothesis. Thus, an AR model can be suggested for describing the remaining dynamics in the residuals. As far as the AR model complexity is concerned, solutions introduced in Sect. 3.3 can be considered for AR model order determination. By using, e.g., the AIC technique, an AR model with .na = 3 seems to be a good trade-off between complexity and accuracy as shown in Table 5.1. Once the AR model order is selected, a linear least squares algorithm can be used to estimate the AR model parameters as explained previously. Such a procedure leads to the parameter values gathered in Table 5.2. The last step of the AR model learning methodology consists in checking if the time series .(et )t∈T generated by subtracting the AR model output to the residuals time series .(zt )t∈T is a realization of a zero mean white noise or not. The definition of an AR model indeed requires that .(et )t∈T is a realization of a zero mean white noise .(et )t∈Z . Again, the tests introduced in Sect. 4.3 can be run to see if .(et )t∈Z is i.i.d., thus uncorrelated and white. As shown in Fig. 5.3, the autocorrelation function coefficients of .(et )t∈T are mostly inside the .±1.96/N bounds. On top of that, running the portmanteau test with .N = 384 and .h = 40 leads to a 2 (40) = 55.7. All these elements prove value for Q equal to .37.2 whereas .χ0.95 that an AR model of order 3 can be validated as a good candidate for the Mauna carbon dioxide concentration residuals modeling. Notice finally that a quick look at the histogram and the QQ plot of the time series .(et )t∈T (see Fig. 5.4) clearly shows that the residuals after the AR modeling step can be considered as normally distributed.

5.2 AR and ARMA Model Learning

129

1.5

1

0.5

0

-0.5

-1

-1.5 1960

1965

1970

1975

1980

1985

1990

Fig. 5.1 Residuals generated after the linear least squares estimation of the main trend and seasonal patterns of the time series. CO.2 concentration at Mauna Loa Observatory, Hawaii 1

0.8

0.6

0.4

0.2

0

-0.2

0

5

10

15

20

25

30

35

40

Fig. 5.2 Autocorrelogram of the residuals with bounds .±1.96/N in dash. CO.2 concentration at Mauna Loa Observatory, Hawaii Table 5.2 Estimated parameters of the AR model. CO.2 concentration at Mauna Loa Observatory, Hawaii .a ˆ2

.a ˆ1 .−6.3248

× 10−1

.−2.0003

.a ˆ3

× 10−1

.1.4902

× 10−2

As we did in Chap. 4, let us now turn to the determination of the statistical properties of the underlying ordinary least squares estimator .θˆ AR . Because the

130

5 Residuals Modeling with AR and ARMA Representations

1

0.8

0.6

0.4

0.2

0

-0.2

0

5

10

15

20

25

30

35

40

Fig. 5.3 Autocorrelogram of the residuals after AR modeling with bounds .±1.96/N in dash. CO.2 concentration at Mauna Loa Observatory, Hawaii

Fig. 5.4 Histogram and QQ plot of the residuals after AR modeling. CO.2 concentration at Mauna Loa Observatory, Hawaii

estimation problem solved herein seems to be identical to the one studied in Chap. 3 (up to some notation changes), the first idea would be to simply copy and paste the statistical results introduced in Chap. 4 for .θˆ AR . Whereas, as proved, e.g., in (Johansson, 1993, Chapter 5), the final conclusions are indeed similar to the ones drawn in Sect. 4.4.1, it is paramount to pinpoint that, for the AR estimator, the regressor matrix . is not deterministic anymore. It is indeed made of past residuals components which cannot be assumed to be deterministic this time. Thus, in order to

5.2 AR and ARMA Model Learning

131

guarantee consistency of .θˆ AR , this deterministic assumption must be replaced by the fact that the stochastic regressors . are uncorrelated with the disturbances vector .e, i.e., E{ e} = 0.

.

(5.22)

Whereas such a condition is generally not valid for most of the stochastic processes which can be encountered in practice, an important exception is when .(et )t∈T is a realization of a zero mean white noise sequence .(et )t∈Z with finite variance. Under such practical conditions, .et , .t ∈ Z, is indeed uncorrelated with all past data, thus with past samples of .zt , .t ∈ Z, as well. Therefore, assuming that .(et )t∈T is a realization of a zero mean white noise sequence .(et )t∈Z with finite variance is sufficient to guarantee the uncorrelatedness between . and .e. Such a particular case leads to state the following theorem. Theorem 5.1 By assuming that • The stochastic sequence .(et )t∈Z generating the residuals is a zero mean white noise with finite variance .σ 2 . • The realization .(zt )t∈T has been generated by a stationary, causal and stable AR process of order .na . • The stochastic regressor matrix . is selected so that .E{ } is nonsingular. Then, the estimator .θˆ AR is consistent, i.e., with .θ o the true parameter vector, θˆ AR → θ o ,

.

(5.23)

when .N → ∞ where the limit is meant to be with probability one whereas √ N(θˆ AR − θ o ) → N (0, σ 2 E{( )−1 }),

.

(5.24)

when .N → ∞ where the limit is meant to be in distribution (Söderström and Stoica, 1989, Appendix B). Proof See (Isermann and Münchhof, 2011, Chapter 9).



Parameters Estimation with the Yule-Walker Algorithm A complementary solution for AR model parameter estimation consists in resorting to the famous Yule-Walker equations (Söderström and Stoica, 1989, Appendix C8.1). Instead of focusing on the AR model directly, the basic idea of the Yule-Walker estimation method is to use correlation or covariance function

132

5 Residuals Modeling with AR and ARMA Representations

a coefficients in order to estimate the unknown parameters .{ai }ni=1 and benefit from the fact that .(et )t∈Z is a zero mean white noise, i.e., is uncorrelated with past samples. More precisely, by recalling that, for a zero mean stationary and causal AR process,

a(q)zt = et ,

(5.25)

E{a(q)zt zt+k } = a(q)E{zt zt+k },

(5.26)

E{et zt+k } = a −1 (q)E{et et+k } = 0,

(5.27)

.

then, for any .k ∈ N∗ , .

whereas2 .

because .(et )t∈Z is a zero mean white noise. Thus, for any .t ∈ Z and .k ∈ N∗ , we have a(q)E{zt zt+k } = 0,

(5.28)

γk + a1 γk−1 + · · · + ana γk−na = 0,

(5.29)

.

i.e., for .k ∈ N∗ , .

where, for a zero mean stochastic sequence .(zt )t∈Z , γk = E{zt zt+k }, k ∈ Z.

.

(5.30)

By using the property that .γk = γ−k and also by dividing through by .γ0 , we get that ρk = −

na 

.

ai ρi−k ,

(5.31)

i=1

where, again, ρk =

.

2 Keep

in mind that .a(z) ≡ 0 for .|z| ≤ 1.

γk . γ0

(5.32)

5.2 AR and ARMA Model Learning

133

By selecting .na equations, i.e., by considering that k varies from 1 to .na , we can write that ⎡ ⎤ ρ0 ρ1 ρ1 ⎢ ρ1 ⎢ ρ2 ⎥ ρ2 ⎢ ⎢ ⎥ .⎢ . ⎥ = −⎢ .. .. . ⎣ . ⎣ . ⎦ . ρna ρna −1 ρna −2 ⎡

⎤⎡ ⎤ a1 · · · ρna −1 ⎥ ⎢ · · · ρna ⎥ ⎢ a2 ⎥ ⎥ . ⎥⎢ . ⎥, .. . .. ⎦ ⎣ .. ⎦ · · · ρ0 ana

(5.33)

which gives a direct access to .θˆ AR as long as the square Toeplitz matrix involved in this system of linear equations has full rank. In practice, as long as ergodic time series are considered, the correlation coefficients .ρk , .k ∈ {0, · · · , na }, used in this well posed system of linear equations can be approximated by using any realization .(zt )t∈T of the zero mean stochastic sequence .(zt )t∈Z (see Sect. 4.3.1 for equations to determine the correlation coefficients .ρk , .k ∈ {0, · · · , na }, from finite realizations of ergodic stationary stochastic sequences). Thus, θˆ AR = −R −1 r,

.

(5.34)

with ⎤ · · · rna −1 . ⎥ na ×na .. ,. . .. ⎦ ∈ R rna −1 · · · r0   r = r1 · · · rna ∈ Rna ×1 , ⎡

r0 ⎢ .. .R = ⎣ .

(5.35a)

(5.35b)

where, for .k ∈ {0, · · · , na }, gk =

.

N −1−k 1  zi zi+k , k ∈ {1, · · · , m}, N

(5.36)

gk , g0 = 0. g0

(5.37)

i=0

and rk =

.

Illustration 5.2 Let us consider again the residuals modeled in Illustration 5.1 and let us test the Yule-Walker equation based solution introduced beforehand. As shown in Table 5.3, the estimated parameters for a third order AR model are very similar to those obtained with the first least squares (continued)

134

5 Residuals Modeling with AR and ARMA Representations

Table 5.3 Estimated parameters of the AR model with a Yule-Walker equation based solution. CO.2 concentration at Mauna Loa Observatory, Hawaii .a ˆ1

.a ˆ2

.−6.3048

× 10−1

.−2.0523

.a ˆ3

× 10−1

.2.1774

× 10−2

Illustration 5.2 (continued) approach (see Table 5.2). Thus, both solutions give, for this specific time series, comparable solutions.

Remark 5.1 When long AR models are required, the former implementation of the Yule-Walker equation based solution should be replaced by the famous LevinsonDurbin’s algorithm (Söderström and Stoica, 1989, Complement C8.2). The Levinson-Durbin’s algorithm is indeed recursive and takes into account the Toeplitz structure of .R explicitly. In the end, the Levinson-Durbin’s algorithm requires 2 3 .O(na ) operations whereas the brute force method involves .O(na ) arithmetic operations. See also (Theodoridis, 2015, Section 4.8) for a discussion about the Levinson-Durbin’s algorithm implementation. Remark 5.2 Once an accurate AR model is generated, the bootstrap approach introduced in Sect. 4.4.3 can be deployed to generate variance estimates and confidence intervals straightforwardly. Indeed, once the time series .(et )t∈T does not reject the i.i.d. hypothesis, this time series can be bootstrapped, i.e., B new realizations of .(et )t∈Z can be generated by resorting to standard random number a generators with replacement. Knowing reliable estimates for .{ai }ni=1 , B non i.i.d. realizations of the residuals .(zt )t∈Z can be produced easily, thus yielding inputs for the bootstrap analysis carried out in Sect. 4.4.3 and its sequels. This AR based bootstrap solution echoes back to Remark 4.12 in Chap. 4.

5.2.2 ARMA Model Parameters Estimation Parameters Estimation with Pseudolinear Least Squares Let us now consider the problem of estimating the full set of unknown parameters a ,nb {ai , bj }ni=1,j =1 , i.e., the parameters of an ARMA model. In order to reach this goal, let us assume that the residuals time series .(zt )t∈T is a realization of a zero mean stationary, causal, and invertible .ARMA(na , nb ) stochastic process, i.e.,

.

zt =

nb 

.

j =1

bj et−j −

na  j =1

aj zt−j + et ,

(5.38)

5.2 AR and ARMA Model Learning

135

at any time .t ∈ T. Then, because the noise contribution is no more restricted to the innovation term .et but appears also in the MA part of this .ARMA(na , nb ) model structure, the idea of resorting to zˆ t =

nb 

.

bj et−j −

j =1

na 

aj zt−j ,

(5.39)

j =1

then minimizing the resulting prediction error .t = zt − zˆ t at any time .t ∈ T with a linear least squares approach is not conceivable anymore. In order to bypass this difficulty, a standard solution consists in estimating the past components .et−j , .j ∈ {1, · · · , nb }, from the available data sets, then using these estimated a ,nb samples into a linear least squares problem for determining .{aˆ i , bˆj }ni=1,j =1 . Once accurate estimates of the components .et−j , .j ∈ {1, · · · , nb }, are available, the resulting estimation problem indeed boils down to a linear least squares problem and algorithms equivalent to the ones used for AR model parameters determination can be easily adapted for extracting .θˆ ARMA efficiently. As shown, e.g., in (Johansson, 1993, Chapter 6), the first solution to determine reliable estimates of .et−j , .j ∈ {1, · · · , nb }, can consist in • Estimating a long AR model by means of the former linear least squares solutions (see Sect. 5.2.1) in order to guarantee that the resulting residual time series .(et )t∈T is a realization of a zero mean white noise sequence .(et )t∈Z . a ,nb • Determining .{aˆ i , bˆj }ni=1,j =1 with, e.g., a linear least squares method developed with the following predictor zˆ t =

nb 

.

j =1

bj eˆt−j −

na 

aj zt−j ,

(5.40)

j =1

where the components .eˆt−j , .j ∈ {1, · · · , nb }, stand for the noise sequence generated during the first step of the procedure. As the AR model order is selected quite high, it can indeed be assumed that the corresponding residual sequence .(eˆt )t∈T yields a reliable approximation of the realization .(et )t∈T of the zero mean white noise sequence .(et )t∈T exciting the ARMA model. Interesting from a numerical viewpoint because of the involvement of linear least squares based steps only, this offline procedure (called the HannanRissanen algorithm in the literature Brockwell and Davis, 2016, Section 5.1.4) can be refined by resorting to an online implementation which yields, in one step, a ,nb accurate estimates of the noise sequence .(et )t∈T as well as .{aˆ i , bˆj }ni=1,j =1 via a single recursive least squares algorithm (Ljung and Söderström, 1983, Chapter 2), (Söderström and Stoica, 1989, Chapter 9), (Ljung, 1999, Chapter 11). Roughly speaking, such a recursivity stems from the observation that the following sequence

136

5 Residuals Modeling with AR and ARMA Representations

of prediction errors .

1 = z1 + a1 z0 − b1 0 , .

(5.41a)

2 = z2 + a1 z1 + a2 z0 − b1 1 − b2 0 , .

(5.41b)

.. .

.. .

N −1 = zN −1 + a1 zN −2 + · · · + ana zN −1−na − b1 N −2 − · · · − bnb N −1−nb ,

(5.41c)

or, in a compact form, for any .t ∈ T with .t = 0 for .t < 0 and .zt = 0 for .t ≤ 0, t = zt + a1 zt−1 + · · · + ana zt−na − b1 t−1 − · · · − bnb t−nb ,

.

(5.42)

can be used as good approximations of .(e1 , · · · , eN −1 ) given3 .(z0 , · · · , zN −1 ). More specifically, by assuming that the samples .(z0 , · · · , zt , 1 , · · · , t−1 ) are available at time .t ∈ T, it can be easily seen that the ARMA parameters can be estimated by minimizing4 VARMA (θ t ) = zt − Ψ t θ t 22 ,

.

(5.43)

where   zt = z1 · · · zt ∈ Rt×1 , .   ψ t = −zt−1 · · · −zt−na t−1 · · · t−nb ∈ R(na +nb )×1 , .   Ψ t = ψ 1 · · · ψ t ∈ Rt×(na +nb ) , .   θ t = a1 · · · ana b1 · · · bnb ∈ R(na +nb )×1 . .

(5.44a) (5.44b) (5.44c) (5.44d)

Then, as usual now !-)), the minimum of this cost function writes −1  θˆ t = (Ψ  t Ψ t ) Ψ t zt ,

.

(5.45)

when .Ψ t has full column rank. The main issue with this least squares estimate is the fact that the regressors contain unknown values for .(t )t∈T . Fortunately, this problem can be bypassed by resorting to a recursive algorithm which estimates, at a ,nb each iteration, the unknown parameters .{ai , bj }ni=1,j =1 as well as the sequence of

the noise sequence is unknown, the recursion starts with .0 = 0. the time dependency stressed with the t subscript.

3 Because 4 Notice

5.2 AR and ARMA Model Learning

137

prediction errors .(t )t∈T . With −1 P t = (Ψ  t Ψ t) ,

(5.46)

.

such a recursion can be easily highlighted by noticing that −1  P −1 t = P t−1 + ψ t ψ t .

(5.47)

.

Indeed, thanks to this updating equation for .P −1 t , we trivially have θˆ t = P t (Ψ  t−1 zt−1 + ψ t zt ) ˆ = P t (P −1 t−1 θ t−1 + ψ t zt )

.

(5.48)

ˆ = θˆ t−1 + P t ψ t (zt − ψ  t θ t−1 ), i.e., θˆ t = θˆ t−1 + k t t ,

(5.49)

.

with kt = P t ψ t , .

(5.50a)

ˆ  t = zt − ψ  t θ t−1 .

(5.50b)

.

Interestingly, by using the Sherman-Morrison-Woodbury formula (Meyer, 2000, Chapter 3), the explicit inversion of .P t can be avoided, leading to the famous equation in recursive identification P t = P t−1 −

.

P t−1 ψ t ψ  t P t−1 1 + ψ t P t−1 ψ t

.

(5.51)

In a nutshell, as shown, e.g., in (Söderström and Stoica, 1989, Section 9.5), a ,nb the recursive estimation of .{aˆ i , bˆj }ni=1,j =1 can be carried out by resorting to the following recursive pseudolinear regression algorithm (Söderström and Stoica, 1989, Section 9.5)   ψ t = −zt−1 · · · −zt−na t−1 · · · t−nb , .

.

P t = P t−1 −

P t−1 ψ t ψ  t P t−1 1 + ψ t P t−1 ψ t

,.

(5.52a) (5.52b)

kt = P t ψ t , .

(5.52c)

ˆ  t = zt − ψ  t θ t−1 , .

(5.52d)

θˆ t = θˆ t−1 + k t t ,

(5.52e)

138

5 Residuals Modeling with AR and ARMA Representations

6

4

2

0

-2

-4

20

40

60

80

100

120

Fig. 5.5 Residuals generated after the estimation of the main trend of the time series with a 10th order polynomial function. Hourly ceramic furnace temperatures

where .ψ t is nothing but the approximation of .Φ t defined as follows   Φ t = −zt−1 · · · −zt−na et−1 · · · et−nb ,

.

(5.53)

by replacing the unknown noise terms .{et−1 , · · · , et−nb } with the estimated prediction errors .{t−1 , · · · , t−nb }. By construction, the former recursive algorithm requires initial guesses. A rule of thumb consists in choosing .θˆ 0 = 0, .t = 0 for .t ∈ Z− and .P 0 quite large, e.g., .P 0 = 1000I (na +nb )×(na +nb ) , to guarantee a good transient behavior of the recursive algorithm. Indeed, as pointed out in Sect. 4.4.1, .P t is the covariance matrix of .θˆ t (up to .σ 2 ). It is thus consistent to select a big value for .P 0 if we do not have a huge confidence in the prior on .θˆ 0 . See, e.g., (Söderström and Stoica, 1989, Chapter 9) or Ljung and Söderström (1983) for further explanations on the impact of these initial values on the algorithm behavior).

Illustration 5.3 In order to illustrate the efficiency of the pseudolinear regression algorithm for ARMA model learning, let us consider the time series introduced in Illustration 3.4, i.e., the hourly temperature readings from a ceramic furnace. After removing the deterministic part of this time series by resorting to the polynomial function fitting solution introduced in Illustration 3.4, an ARMA model can be suggested for modeling the dynamical behavior of the residuals given in Fig. 5.5 (see Sect. 5.3 for some facts (continued)

5.2 AR and ARMA Model Learning

139

Illustration 5.3 (continued) explaining the reasons why an ARMA model can be suggested as a good candidate for this model learning problem). As we did in Illustration 5.1, the first step of the recipe consists in testing these residuals by checking if there is still something to model in their dynamics. As clearly shown in Fig. 5.6, even if many autocorrelation function coefficients of the residuals are encapsulated between the .±1.96/N bounds, a damped oscillation clearly appears in the dynamics of this autocorrelogram, thus leading us to reject the i.i.d. hypothesis. This i.i.d. hypothesis rejection is confirmed by the portmanteau test run 2 (40) = 55.75. with .h = 40 and .N = 139 for which .Q = 189.05 whereas .χ0.95 The second step of the learning procedure is then dedicated to the selection of the ARMA model complexity with, again, a specific attention to its parsimony. As suggested earlier, the Akaike’s Information Criterion can be used for this purpose. As shown in Table 5.4, selecting .na = 7 and .nb = 2 seems to be the best compromise between complexity and accuracy. Once this model order is selected, the ARMA model parameters given in Table 5.5 are estimated thanks to the pseudolinear regression algorithm introduced previously. Again, the efficiency of this procedure is tested by analyzing the remaining signal .(et )t∈T , i.e., the time series generated by subtracting the ARMA model output to the residuals time series .(zt )t∈T . The autocorrelation function test (see Fig. 5.7) and the portmanteau test (with .h = 40, .N = 139, .Q = 28.42 2 (40) = 55.75) are both leading to the conclusion that .(i) the time and .χ0.95 series .(et )t∈T can be considered as a realization of a zero mean white noise (close to be normally distributed as illustrated in Fig. 5.8), .(ii) an ARMA(7, 2) model can be used for modeling the hourly ceramic furnace temperatures residuals efficiently.

Because, by construction, the pseudolinear regression solution involves approximations, we may wonder if the estimated parameters are consistent or not. Fortunately, when an ARMA model is handled, the following theorem can be stated. Theorem 5.2 If • The stochastic sequence .(et )t∈Z generating the residuals is a zero mean white noise with finite variance .σ 2 . • The realization .(zt )t∈T has been generated by a stationary, causal, stable and invertible ARMA process of order .na and .nb , respectively. • The ARMA model structure satisfies Re

.

1 ıω b(e , θ o )



1 > 0 for all ω, 2

where .θ o stands for the true parameter vector,

(5.54)

140

5 Residuals Modeling with AR and ARMA Representations

1

0.8

0.6

0.4

0.2

0

-0.2

-0.4

0

5

10

15

20

25

30

35

40

Fig. 5.6 Autocorrelogram of the residuals with bounds .±1.96/N in dash. Hourly ceramic furnace temperatures Table 5.4 AIC test for ARMA model order selection. Hourly ceramic furnace temperatures na \nb 1 2 3 4 5 6 7 8

1 6.9917 × 10−1 6.7562 × 10−1 6.3391 × 10−1 6.2489 × 10−1 6.2025 × 10−1 5.6715 × 10−1 5.5477 × 10−1 5.8773 × 10−1

2 7.0056 × 10−1 6.5546 × 10−1 6.4917 × 10−1 6.2803 × 10−1 5.8964 × 10−1 5.3662 × 10−1 4.7126 × 10−1 5.2243 × 10−1

3 7.0497 × 10−1 6.8140 × 10−1 6.2827 × 10−1 6.4088 × 10−1 6.2587 × 10−1 5.3728 × 10−1 5.3223 × 10−1 6.0620 × 10−1

4 7.1828 × 10−1 6.5223 × 10−1 6.8214 × 10−1 6.2436 × 10−1 6.1773 × 10−1 5.5371 × 10−1 5.8416 × 10−1 6.0948 × 10−1

.na \nb

5 −1 .6.0789 × 10 −1 .5.7721 × 10 −1 .5.6425 × 10 −1 .6.4519 × 10 −1 .4.8246 × 10 −1 .5.7132 × 10 −1 .5.9450 × 10 −1 .6.0820 × 10

6 −1 .5.2186 × 10 −1 .4.8009 × 10 −1 .4.9701 × 10 −1 .6.0257 × 10 −1 .6.1039 × 10 −1 .5.7673 × 10 −1 .6.1455 × 10 −1 .6.0771 × 10

7 −1 .5.5268 × 10 −1 .5.4886 × 10 −1 .6.0456 × 10 −1 .6.1623 × 10 −1 .6.3042 × 10 −1 .6.1559 × 10 −1 .5.9934 × 10 −1 .6.3625 × 10

8 −1 .6.6302 × 10 −1 .6.5101 × 10 −1 .6.5383 × 10 −1 .6.4407 × 10 −1 .6.3811 × 10 −1 .6.2746 × 10 −1 .6.1420 × 10 −1 .6.4277 × 10

1 2 3 4 5 6 7 8

then , when .N → ∞, the parameter vector .θˆ ARMA generated by the pseudolinear regression algorithm given in Eq. (5.52) converges toward .θ o with probability 1. Proof See (Söderström and Stoica, 1989, Section 9.6)



5.2 AR and ARMA Model Learning

141

Table 5.5 Estimated parameters of the ARMA model with a pseudolinear least squares solution. Hourly ceramic furnace temperatures .a ˆ1

.a ˆ2

.−7.8401

× 10−2

.a ˆ3

.1.2007

× 10−1

.1.4818

.a ˆ4

× 10−1

.a ˆ5

.−9.4216

.a ˆ6

.a ˆ7

ˆ1 .b

ˆ2 .b

−2 .9.9828 × 10

−1 .−8.5736 × 10

−2 .3.1733 × 10

.−1.0200

× 10−3

.8.8229

× 10−2

× 10−1

1

0.8

0.6

0.4

0.2

0

-0.2

0

5

10

15

20

25

30

35

40

Fig. 5.7 Autocorrelogram of the residuals after ARMA modeling with bounds .±1.96/N in dash. Hourly ceramic furnace temperatures

Notice unfortunately that the condition given in Eq. (5.54) .(i) is a sufficient condition only .(ii) depends on the true parameter values which is obviously unknown, thus cannot be verified a priori. This is probably the main reason why this pseudolinear least squares algorithm is often “only” used for generating good initial guesses when nonlinear least squares ARMA model learning algorithms are considered. One of these nonlinear least squares ARMA model learning algorithms is introduced in the next section.

Parameters Estimation with Nonlinear Least Squares As explained previously, an efficient way to learn the dynamics of an ARMA model consists in determining the prediction error sequence .(t )t∈T recursively, then using this estimated sequence in a (recursive) linear least squares algorithm a b to get reliable estimates of .{ai }ni=1 and .{bi }ni=1 iteratively. Inspired by the idea of resorting to a prediction error sequence for parameters estimation, standard nonlinear least squares algorithms for ARMA model learning aim at determining a b .{a ˆ i }ni=1 and .{bˆi }ni=1 such than the sum of squared prediction errors on the full range

142

5 Residuals Modeling with AR and ARMA Representations

Fig. 5.8 Histogram and QQ plot of the residuals after ARMA modeling. Hourly ceramic furnace temperatures

T is minimized, i.e.,

.

θˆ ARMA = arg min

N−1 

.

θ

(zt − zˆ t|t−1 (θ))2 ,

(5.55)

t=0

where .zˆ t|t−1 (θ) stands for the one step ahead prediction of .zt , .t ∈ T (Söderström and Stoica, 1989, Chapter 7), (Ljung, 1999, Chapter 3), i.e., the prediction of .zt given the samples of .zt up to and including .t − 1 (i.e., given .(zt−1 , zt−2 , · · · )) as well as the parameter vector .θ. In order to reach this goal, it is necessary first to determine .zˆ t|t−1 (θ), i.e., to clarify the dependency of the one step ahead prediction b a of .zt , .t ∈ T, in terms of .{ai }ni=1 and .{bi }ni=1 . Such a .θ dependency can be determined by recalling that, for an ARMA(.na , .nb ) stochastic process, zt = h(q)et =

.

b(q) et , a(q)

(5.56)

with a filter .h(q) = b(q)/a(q) stable and inversely stable (i.e., the filter .h(q) is invertible and its inverse is stable). Thanks to the inversely stability of .h(q), we can write et = h−1 (q)zt ,

.

(5.57)

5.2 AR and ARMA Model Learning

143

with (Ljung, 1999, Chapter 3) h−1 (q) =

.



 1 h˜ j q−j , = h(q)

(5.58)

j =0

 ˜ 2 where the weights .(h˜ j )j ∈N satisfy .h˜ 0 = 1 and . ∞ j =0 |hj | < ∞. Thanks to this link between .(et )t∈Z and .(zt )t∈Z via .h−1 (q), it can be concluded that the knowledge of .zk for .k ≤ t − 1 implies the knowledge of .ek for .k ≤ t − 1. By recalling that .h(q) is monic, i.e., by noticing .h0 = 1, we have zt = (h(q) − 1)et + et .

.

(5.59)

Combining this rewriting of .zt with the knowledge of .ek for .k ≤ t − 1 allows us to conclude that .zt is made of two terms: .(h(q) − 1)et which is known at time .t − 1 and the innovation .et which is unknown at time .t − 1. Hence, the one step ahead prediction of .zt for any .t ∈ T satisfies .

zˆ t|t−1 = E{zt−1 |(zt )t∈{−∞,··· ,t−1} } = (h(q) − 1)et .

(5.60)

Now, by using Eq. (5.57), we easily get .



a(q) zˆ t|t−1 (θ) = (h(q) − 1)h−1 (q)zt = (1 − h−1 (q))zt = 1 − zt , b(q)

(5.61)

whereas, for .t ∈ T, zt − zˆ t|t−1 (θ) =

.

a(q) zt . b(q)

(5.62)

Remark 5.3 So far, it has been assumed that the user has access to .zk for .k ∈ {−∞, · · · , t − 1}. Assuming the availability of .zt on a semi-infinite range is unfortunately not conceivable practically. In order to bypass this difficulty, the unknown values of .zt , i.e., the samples out of the range .[0, t−1], are usually replaced by zero because of the lack of knowledge of the time series initial conditions. Of course, such a choice (as well as the underlying approximated one step ahead prediction) is valid in practice only if the transient effect is decaying quickly. Fortunately, for most practical purposes, this approximation gives a satisfactory result because, in many applications involving ARMA models, the transient effect decays exponentially effectively. Because the one step ahead prediction error (see Eq. (5.62)) is not linear in terms a b of the unknown parameters .{ai }ni=1 and .{bi }ni=1 by construction, iterative algorithms

144

5 Residuals Modeling with AR and ARMA Representations

must be used to determine the parameter vector .θˆ ARMA which minimizes VARMA (θ ) =

N −1 

.

(zt − zˆ t|t−1 (θ )) = 2

t=0

N −1  t=0

a(q, θ ) zt b(q, θ )

2 .

(5.63)

The Levenberg–Marquardt algorithm introduced in Sect. 3.2 is a good candidate for minimizing the nonconvex cost function .VARMA (θ ) iteratively. Interestingly, when ARMA model learning comes into play, the efficiency of such an iterative minimization algorithm is guaranteed thanks to the following result (Söderström and Stoica, 1989, Chapter 7) Theorem 5.3 For scalar ARMA models, all stationary points of .VARMA (θ) are global minima or, said differently, .VARMA (θ) has a unique minimum if the model order is correctly chosen. Proof See (Söderström and Stoica, 1989, Complement C7.6).



Notice that, in order to accelerate the Levenberg–Marquardt algorithm convergence toward .θˆ ARMA , it is strongly recommended to choose its initial guess in the vicinity of the global minimum of .VARMA (θ ). Such reliable initial parameters can be generated by resorting, e.g., the pseudolinear least squares algorithm introduced previously. Remark 5.4 As clearly mentioned in Sect. 3.2, the Levenberg–Marquardt involves a Jacobian matrix, i.e., (partial) derivatives of .f (•, θ ) w.r.t. to each component of .θ. Whereas efficient automatic differentiation algorithms Neidinger (2010) can be deployed to determine these derivatives, it can be numerically more efficient to resort to explicit derivatives when these functions can be determined a priori easily. This is the case with the ARMA model. Indeed, because zt − zˆ t|t−1 (θ) =

.

1 + a1 q−1 + · · · + ana q−na zt , 1 + b1 q−1 + · · · + bnb q−nb

(5.64)

we have (Söderström and Stoica, 1989, Chapter 7) .

∂(zt − zˆ t|t−1 (θ )) = zt−k , ∂ak

(5.65)

for .k ∈ {1, · · · , na } whereas .

∂(zt − zˆ t|t−1 (θ)) 1 (zt−k − zˆ t−k|t−k−1 (θ )), =− ∂bk b(q, θ )

(5.66)

for .k ∈ {1, · · · , nb }. These partial derivatives equations can be used explicitly with most of the software implementing the Levenberg–Marquardt algorithm for ARMA model learning.

5.2 AR and ARMA Model Learning

145

Table 5.6 Estimated parameters of the ARMA model with a nonlinear least squares solution. Hourly ceramic furnace temperatures .a ˆ1 .−1.4158

.a ˆ2

× 10−1

.a ˆ3

.3.5298

× 10−1

.−5.4229

.a ˆ6

.a ˆ7

ˆ1 .b

−1 .3.0582 × 10

−2 .8.7020 × 10

.6.4520

.a ˆ4 × 10−1 .1.4447 × 10−1 ˆ2 .b

× 10−1

.8.0657

.a ˆ5 .1.3788

× 10−1

× 10−1

Illustration 5.4 Let us tackle again the ARMA model learning problem addressed in Illustration 5.3, i.e., let us try to fit an ARMA(7, 2) model on the hourly ceramic furnace temperatures residuals given in Fig. 5.5 with, this time, the one step ahead prediction error method introduced previously. With the estimated ARMA(7, 2) model generated in Illustration 5.3 as the initial guess for minimizing .VARMA (θ) given in Eq. (5.63), the estimated parameters gathered in Table 5.6 can be determined straightforwardly thanks to the Levenberg–Marquardt algorithm introduced in Sect. 3.2. As shown in Fig. 5.9, then confirmed with the Portmanteau test figures (.h = 40, .N = 139, 2 .Q = 19.87 and .χ 0.95 (40) = 55.75)), this new estimated ARMA(7, 2) model is able to mimic the residuals dynamics efficiently. Hence, as a general conclusion for both model learning problems (see also Illustration 5.3), selecting an ARMA(7, 2) model for learning the dynamics of the hourly ceramic furnace temperatures residuals seems to be a good trade-off between model complexity and model accuracy.

As any prediction error method Ljung (2002), the nonlinear least squares based solution introduced herein for ARMA parameters estimation benefits from interesting statistical properties. More precisely, Theorem 5.4 By assuming that • The stochastic sequence .(et )t∈Z generating the residuals is a zero mean white noise with finite variance .σ 2 . • The realization .(zt )t∈T has been generated by a stationary, causal, and inversely stable ARMA process of order .na and .nb , respectively. • The stochastic regressor matrix .F |θ is selected so that .E{F  |θ F |θ } is nonsingular around the global minimum point of .VARMA (θ ) with ⎡ ⎢ F |θ ∗ = ⎢ ⎣

.

∂f (0,θ) ∂θ1

.. .

∂f (N −1,θ ) ∂θ1

··· .. . ···

∂f (0,θ) ∂θnθ

.. .

∂f (N −1,θ) ∂θnθ

⎤ ⎥ ⎥ ⎦

, θ=θ ∗

(5.67)

146

5 Residuals Modeling with AR and ARMA Representations

1

0.8

0.6

0.4

0.2

0

-0.2

0

5

10

15

20

25

30

35

40

Fig. 5.9 Autocorrelogram of the residuals after ARMA modeling with bounds .±1.96/N in dash. Hourly ceramic furnace temperatures

where .f (t, θ) = zt − zˆ t|t−1 (θ), then, the estimator .θˆ ARMA is consistent, i.e., with .θ o the true parameter vector, θˆ ARMA → θ o ,

.

(5.68)

when .N → ∞ where the limit is meant to be with probability one whereas  −1 √ N (θˆ ARMA − θ o ) → N (0, σ 2 E{F  F } ), |θ o |θ o

.

(5.69)

when .N → ∞ where the limit is meant to be in distribution (Söderström and Stoica, 1989, Appendix B). Proof See (Söderström and Stoica, 1989, Chapter 7).



Last but not least, let us shortly discuss the links between the prediction error method introduced so far and the maximum likelihood method for ARMA model learning (Montgomery et al., 2015, Chapter 3) (Shumway and Stoffer, 2017, Chapter 3) when the noise in the ARMA model is assumed to be Gaussian distributed. By definition, the maximum likelihood estimate of a parameter vector .θ is generated by maximizing the likelihood function, i.e., the probability distribution function of the time series sequence conditioned on .θ . Mathematically, the maximum likelihood

5.3 Partial Autocorrelation Function

147

estimates can be generated by maximizing (Rogers and Girolami, 2017, Chapter 2)

.

log( (θ , σ )) = −

N −1 N 1  (zt − zˆ t|t−1 (θ))2 , log(2π ) − N log(σ ) − 2 2σ 2

(5.70)

t=0

when the ARMA noise term is assumed to be zero mean, white, Gaussian with a variance equal to .σ 2 . Interestingly, • We easily recognize .VARMA (θ ) in this cost function (last term in the former equality). • The only term depending on .θ is .VARMA (θ ). −1 2 • Knowing .σ , maximizing .− 2σ1 2 N t=0 (zt − zˆ t|t−1 (θ )) w.r.t .θ is equivalent to minimizing .VARMA (θ) w.r.t. .θ. Thus, when the noise in the ARMA model is assumed to be normally distributed, white and zero mean, the maximum likelihood solution is exactly the one derived with the nonlinear least squares ARMA model learning method introduced previously. Notice finally that, as far as the estimation of .σ is concerned, we have

.

N 1 ∂ log( (θ, σ )) = − + 3 VARMA (θ ) = 0 ∂σ σ σ N −1 1  ⇒ σˆ = (zt − zˆ t|t−1 (θˆ ARMA ))2 . N 2

(5.71)

t=0

Such a result echoes back to the nonlinear least squares estimator analysis made in Sect. 4.4.2 (see Eq. (4.83)). For a deeper analysis of the maximum likelihood solution for ARMA model learning, the reader is warmly invited to study (Söderström and Stoica, 1989, Complement C7.7).

5.3 Partial Autocorrelation Function After reading the former sections, the reader may wonder why an AR model was used in Illustration 5.1 whereas an ARMA model gave reliable results in Illustration 5.3. As pointed out, e.g., in (Bisgaard and Kulahci, 2011, Chapter 3 and Chapter 4), a standard empirical way to detect if a stationary time series should be modeled with either an AR or ARMA model consists in analyzing two complementary curves: the autocorrelogram and the partial autocorrelogram (Brockwell and Davis, 1991, Chapter 1 and Chapter 3). Whereas the autocorrelogram was introduced in Sect. 4.3.1 as the plot of the sample autocorrelation coefficients .ρk versus the time lags k, the partial autocorrelogram similarly consists in drawing the sample partial autocorrelation coefficients .φk versus the time lags k. As explained in (Brockwell and Davis, 2016, Chapter 3), the sample partial autocorrelation

148 Table 5.7 Autocorrelogram and the partial autocorrelogram properties of AR and ARMA models

5 Residuals Modeling with AR and ARMA Representations

Autocorr.

Partial Autocorr.

AR(.na ) Infinite damped exponential and/or damped sine waves Finite and cut off after .na lags

ARMA(.na , .nb ) Infinite damped exponential and/or damped sine waves Infinite damped exponential and/or damped sine waves

coefficients .φk are nothing but the kth coefficient of an AR(k) model fitted to the data set, i.e., the last coefficient of the corresponding AR(k). Said differently, by increasing the order k of an AR model fitting the time series to analyze, the partial autocorrelogram is nothing but the plot of the coefficient premultiplying the term .zt−k in the AR(k) model versus the model order k. As thoroughly studied in (Brockwell and Davis, 1991, Chapter 1 and Chapter 3), the autocorrelograms and partial autocorrelograms of AR and ARMA models should behave as summed up in Table 5.7 (see also Bisgaard and Kulahci, 2011, Chapter 3 and Chapter 4). These properties, which of course should not be used for any time series without a critical eye, have led us to select the different model structures in Illustrations 5.1 and 5.3, respectively, as explained hereafter.

Illustration 5.5 Let us first consider the residuals analyzed in Illustration 5.1. As clearly shown in Fig. 5.10, the autocorrelogram has a damped sine wave dynamical behavior whereas the partial autocorrelogram is mainly encapsulated within the upper and lower bounds after 3 lags. Such an analysis leads us to select an AR(3) model for mimicking the behavior of this time series. When the residuals of Illustration 5.3 are considered, the curves in Fig. 5.11 have both slowly damped oscillating variations which confirm the ARMA model structure selection.

5.4 Forecasting with AR and ARMA Models Once an AR or ARMA model has been fitted to the residuals realization .(zt )t∈T , it may be useful to make forecasts about the future on the basis of the model trained with past and present data samples. More precisely, by assuming that the user a b has access to accurate estimated parameters .{aˆ i }ni=1 and .{bˆi }ni=1 determined from ∗ .(zt )t∈T , we now aim at predicting .z ˆ N−1+k|N −1 , .k ∈ N , .N ∈ N∗ , i.e., determining

5.4 Forecasting with AR and ARMA Models

149

1 0.8 0.6 0.4 0.2 0 0

5

10

15

20

25

30

35

40

0

5

10

15

20

25

30

35

40

1 0.8 0.6 0.4 0.2 0

Fig. 5.10 Autocorrelogram (top) and partial autocorrelogram (bottom) of the residuals with bounds .±1.96/N in dash. CO.2 concentration at Mauna Loa Observatory, Hawaii 1

0.5

0

-0.5

0

5

10

15

20

25

30

35

40

0

5

10

15

20

25

30

35

40

1 0.8 0.6 0.4 0.2 0

Fig. 5.11 Autocorrelogram (top) and partial autocorrelogram (bottom) of the residuals with bounds .±1.96/N in dash. Hourly ceramic furnace temperatures

the k step ahead prediction of .zN −1+k , .k ∈ N∗ , .N ∈ N∗ , knowing .(zt )t∈{0,··· ,N −1} , ∗ .N ∈ N .

150

5 Residuals Modeling with AR and ARMA Representations

By definition, the k step ahead prediction of .zN −1+k|N −1 , .k ∈ N, .N ∈ N∗ , is the conditional expectation of .zN −1+k given the infinite past (Montgomery et al., 2015, Chapter 5), i.e.,5 zˆ N −1+k|N −1 = E{zN −1+k |(zt )t∈{−∞,··· ,N −1} }, k ∈ N∗ , N ∈ N∗ .

.

(5.72)

In order to determine the k step ahead prediction of .zN −1+k for ARMA and AR models, let us start by recalling that, for an AR(.na ) or ARMA(.na , .nb ) stochastic process, ˆ zt = h(q)e t =

∞ 

.

hˆ j et−j , t ∈ Z,

(5.73)

j =0

with  ˆ h(q) =

.

1 a(q) ˆ ˆ b(q) a(q) ˆ

for the AR model, for the ARMA model.

(5.74)

Thus, for .k ∈ N∗ and .N ∈ N∗ , we have zN −1+k =

∞ 

.

j =0

hˆ j eN −1+k−j =

k−1 

hˆ j eN −1+k−j +

∞ 

hˆ j eN −1+k−j .

(5.75)

j =k

j =0

ˆ Thanks to the causality and invertibility6 of .h(z), for .k ∈ N∗ and .N ∈ N∗ , we can prove that (Ljung, 1999, Chapter 3)  E{eN −1+k−j |(zt )t∈{−∞,··· ,N −1} } =

.

0 for j < k, eN −1+k−j for j ≥ k,

(5.76)

Thus, for .k ∈ N∗ , the forecast of .zN −1+k satisfies zˆ N −1+k|N −1 = E{zN −1+k |(zt )t∈{−∞,··· ,N −1} } =

∞ 

.

hˆ j eN −1+k−j ,

(5.77)

j =k

shown in (Ljung, 1999, Chapter 3), .zˆ N −1+k|N −1 can also be defined as the minimizer of − zˆ N −1+k|N −1 )2 }, .k ∈ N∗ , .N ∈ N∗ . 6 By invertibility of .h(z), −1 ˆ .et = h (q)zt , .t ∈ Z. Thus, if we know .(zt )t∈{−∞,··· ,N −1} , we know .(et )t∈{−∞,··· ,N −1} , thus .eN −1+k−j for .j ≥ k by causality. 5 As

.E {(zN −1+k

5.4 Forecasting with AR and ARMA Models

151

whereas zN −1+k − zˆ N −1+k|N −1 =

k−1 

.

hˆ j eN −1+k−j .

(5.78)

j =0

These two equations prove that the k step ahead predictor .zˆ N −1+k|N −1 as well as the prediction error .zN −1+k − zˆ N −1+k|N −1 are both linear combinations of samples of .(et )t∈Z . Now, by recalling that, for an AR or ARMA stochastic process, the sequence .(et )t∈Z is a zero mean white noise with finite variance .σ 2 , it is easily proved that, for .k ∈ N∗ and .N ∈ N∗ , E{zN −1+k − zˆ N −1+k|N −1 } = 0, .

(5.79a)

.

E{(zN −1+k − zˆ N −1+k|N −1 )2 } = σ 2

k−1 

hˆ 2j ,

(5.79b)

j =0

i.e., the k step ahead prediction error is zero mean with a variance increasing with growing values of the prediction horizon k. Remark 5.5 If, in addition to be zero mean and white, the stochastic sequence (et )t∈Z is normally distributed, then the k step ahead prediction is normally  error ˆ2. distributed as well with zero mean and variance equal to .σ 2 k−1 h j =0 j

.

Although Eq. (5.77) is useful to determine, e.g., the variance of the prediction error, it suffers from three main practical drawbacks. First, it involves infinitely many terms in the past, i.e., it requires to know .zt for .t ∈ {−∞, · · · , N − 1}. Second, it calls for estimates of .(hj )∞ j =k . Third, it requires the knowledge of the ∗ ∗ full sequence .(eN −1+k−j )∞ j =k , for .k ∈ N and .N ∈ N . In order to bypass these three difficulties in one shot, let us use instead the definitions of an AR and ARMA a b model using .{aˆ i }ni=1 and .{bˆi }ni=1 explicitly, i.e., let us consider (again) the difference equation zN −1+k = eN −1+k −

na 

.

aˆ j zN −1+k−j ,

(5.80)

j =1

for the AR model and zN −1+k = eN −1+k +

nb 

.

j =1

bˆj eN −1+k−j −

na  j =1

aˆ j zN −1+k−j ,

(5.81)

152

5 Residuals Modeling with AR and ARMA Representations

for the ARMA model. Then, knowing the finite sequence .(zt )t∈T , we can write zˆ N −1+k|N −1 = E{zN −1+k |(zt )t∈T }

.

= E{eN −1+k |(zt )t∈T } −

na 

aˆ j E{zN −1+k−j |(zt )t∈T },

(5.82)

j =1

for the AR model whereas zˆ N −1+k|N −1 = E{zN −1+k |(zt )t∈T } = E{eN −1+k |(zt )t∈T }

.

+

nb 

bˆj E{eN −1+k−j |(zt )t∈T } −

j =1

na 

aˆ j E{zN −1+k−j |(zt )t∈T },

(5.83)

j =1

for the ARMA one. Let us now try to determine the value of each term composing these equalities. Because the available residuals samples at time .t = N − 1 are .zt for .t ∈ {0, · · · , N − 1} only (Bisgaard and Kulahci, 2011, Chapter 4), • The conditional expectations of the present and past residuals samples are themselves,7 i.e., E{zN −1+k−j |(zt )t∈T } = zN −1+k−j for k − j ≤ 0.

.

(5.84)

• The conditional expectations of the future residuals samples are nothing but their predictions, i.e., E{zN −1+k−j |(zt )t∈T } = zˆ N −1+k−j |N −1 for k − j > 0.

.

(5.85)

• The conditional expectations of the future noise samples are unknown at any time .t < N, thus must be replaced by zero, i.e., E{eN −1+k−j |(zt )t∈T } = 0 for k − j > 0.

.

(5.86)

• As suggested, e.g., in Sect. 5.2.2, good approximations of the present and past noise samples .(et )t∈T are their one step ahead predictors, i.e., E{eN −1+k−j |(zt )t∈T } = eˆN −1+k−j |N −2+k−j

.

= zN −1+k−j − zˆ N −1+k−j |N −2+k−j for k − j ≤ 0.

7 Keep

in mind that .zt = 0 for .t < 0.

(5.87)

5.4 Forecasting with AR and ARMA Models

153

In a nutshell (Shumway and Stoffer, 2017, Chapter 3), • For an AR(.na ) model, the (truncated) predictors of .zN −1+k for .k ∈ N∗ are zˆ N −1+k|N −1 = −

na 

.

aˆ j zˆ N −1+k−j |N −1 ,

(5.88)

j =1

with .zˆ |N −1 = zt for . ∈ T and .zˆ |N −1 = 0 for . < 0. • For an ARMA(.na , .nb ), the (truncated) predictors of .zN −1+k for .k ∈ N∗ are zˆ N −1+k|N −1 = −

na 

.

aˆ j zˆ N −1+k−j |N −1 +

j =1

nb 

bˆj eˆN −1+k−j |N −1 ,

(5.89)

j =1

with .zˆ |N −1 = zt for . ∈ T and .zˆ |N −1 = 0 for . < 0 whereas the truncated prediction errors .eˆ |N −1 are given by eˆ |N −1 = a(q)ˆ ˆ z |N −1 −

nb 

.

bˆj eˆ −j |N −1 for ∈ T,

(5.90)

j =1

and .eˆ |N −1 = 0 for . < 0 and . > N − 1. Remark 5.6 The reader must notice that the same notation is used for defining the k step ahead predictor when semi-infinite and finite data sets are involved (i.e., .(zt )t∈{−∞,··· ,N −1} and .(zt )t∈{0,··· ,N −1} ). Of course, both predictors are equal only if it can be guaranteed that .zt = 0 for .t < 0. Even if this practical constraint on the “past” residuals samples is not always satisfied in practice, its impact on the predictor accuracy can be neglected when N is large because, with standard stationary time series, the effect of the transient decays quickly. As shown in Sect. 4.4.4, in order to quantify the confidence we can have in the essential to determine the k forecasts made with .E{zN −1+k |(zt )t∈T }, .k ∈ N∗ , it is  ˆ2 step ahead prediction error variance, thus compute .σ 2 k−1 j =0 hj . As shown, e.g., in 2 Sect. 4.4 or in Sect. 5.2.2, estimating .σ is straightforward once accurate values of a b .{a ˆ i }ni=1 and .{bˆi }ni=1 have been generated. As far as the determination of the finite set k−1 ˆ of weights .(hj )j =0 is concerned, a simple look at the link between .(hj )k−1 j =0 and the a b parameters .{ai }ni=1 and .{bi }ni=1 gives access to an iterative solution easily. Indeed, we have h(q) =

.

1 , i.e.,(h0 + h1 q−1 + · · · )(1 + a1 q−1 + · · · + ana q−na ) = 1, a(q)

(5.91)

154

5 Residuals Modeling with AR and ARMA Representations

for an AR model whereas h(q) =

.

b(q) , i.e.,(h0 + h1 q−1 + · · · ) a(q)

× (1 + a1 q−1 + · · · + ana q−na ) = 1 + b1 q−1 + · · · + bnb q−nb ,

(5.92)

for an ARMA model. Thus, the weights .(hˆ j )k−1 j =0 can be iteratively determined from nb na ˆ .{a ˆi } and .{bi } as follows i=1

i=1

.

hˆ 0 = 1, .

(5.93a)

hˆ 1 = −hˆ 0 aˆ 1 , .

(5.93b)

hˆ 2 = −hˆ 0 aˆ 2 − hˆ 1 aˆ 1 ,

(5.93c)

.. . i.e., .hˆ 0 = 1 and .hˆ = − model whereas

 −1

ˆ

ˆ −j j =0 hj a

for . ∈ {1, · · · , k − 1}, .k ∈ N∗ , for an AR

.

hˆ 0 = 1, .

(5.94a)

hˆ 1 = bˆ1 − hˆ 0 aˆ 1 , .

(5.94b)

hˆ 2 = bˆ2 − hˆ 0 aˆ 2 − hˆ 1 aˆ 1 ,

(5.94c)

.. . i.e., .hˆ 0 = 1 and .h = bˆ −

 −1

ˆ ˆ −j for . ∈ {1, · · · , k − 1}, .k ∈ N∗ , for j =0 hj a an ARMA model. Once .sˆ 2 and .(hˆ j )k−1 j =0 are available, confidence intervals can be generated for .zˆ N −1+k|N −1 , .k ∈ N∗ , by following the ideas introduced in Sect. 4.4.4. More specifically, each value of .zˆ N −1+k|N −1 , .k ∈ N∗ , can be associated with the following .(1 − α) × 100% confidence interval   k−1    2 ⎢ ˆ N −1+k|N −1 − tα/2,N −na −nb ˆs . ⎣z hˆ 2j , ⎡

j =0

 ⎤  k−1    ⎥ zˆ N −1+k|N −1 + tα/2,N −na −nb ˆs 2 hˆ 2j ⎦ . j =0

(5.95)

5.5 Take Home Messages

155

1.5

1

0.5

0

-0.5

-1

-1.5

1960

1965

1970

1975

1980

1985

1990

Fig. 5.12 Estimated and predicted AR model output for .t ∈ {Feb. 1959, · · · , Jan. 1992} with confidence intervals for the predicted part. CO.2 concentration at Mauna Loa Observatory, Hawaii

Illustration 5.6 Let us study (for the last time) the residuals generated after learning the deterministic components of the CO.2 concentration time series measured at Mauna Loa Observatory, Hawaii. As shown in Illustration 5.1, this residuals time series can be accurately described with the help of an AR(3) model. Based on the estimated parameters gathered in Table 5.2 and the available time series, the values of the .zˆ t for .t ∈ {Jan. 1991, Feb. 1991, · · · , Dec. 1991, Jan. 1992} can be predicted, then plotted with confidence intervals as explained previously. The resulting time series forecast is given in Fig. 5.12. This curve shows that the increasing tendency of the last samples of the training time series is well mimicked by the predictions for the first four or five predicted months. Note also .(i) how the forecast levels off quickly and .(ii) how the confidence intervals are wide even if the forecast limits are only based on .2.3 standard deviations.

5.5 Take Home Messages • By assuming that the residuals time series – Is a realization of a zero mean weak stationary stochastic sequence which does not contain linearly singular components anymore.

156

5 Residuals Modeling with AR and ARMA Representations

– Has been generated by a stable, causal, linear, and time invariant system.

• •

• •



The modeling of the residuals boils down to the estimation of ARMA model coefficients, i.e., a finite number of constant weights thanks to the Wold decomposition theorem. The computation of AR model parameters from residuals realizations can be carried out by using a linear least squares algorithm. The computation of ARMA model parameters from residuals realizations can be carried out by using a pseudolinear least squares or a nonlinear least squares algorithm. Both AR and ARMA model parameters can be estimated consistently under mild practical conditions. The autocorrelation and partial autocorrelation functions are easy-to-implement tools to determine if an AR or ARMA model structure should be considered for residuals time series modeling. Prediction with confidence intervals can be carried out once reliable AR or ARMA model parameters are available.

References S. Bisgaard, M. Kulahci, Time Series Analysis and Forecasting by Example (Wiley, London, 2011) S. Boyd, L. Vandenberghe, Introduction to Applied Linear Algebra (Cambridge University Press, Cambridge, 2018) P. Brockwell, R. Davis, Time Series: Theory and Methods (Springer, Berlin, 1991) P. Brockwell, R. Davis, Introduction to Time Series and Forecasting (Springer, Berlin, 2016) G. Goodwin, R. Payne, Dynamic System Identification: Experiment Design and Data Analysis (Academic Press, London, 1977) E. J. Hannan, M. Deistler, The Statistical Theory of Linear Systems (Wiley, London, 1988) H. Hsu, Signals and Systems (McGraw-Hill, New York, 2019) R. Isermann, M. Münchhof, Identification of Dynamic Systems: An Introduction with Applications (Springer, Berlin, 2011) R. Johansson, System Modeling and Identification (Prentice Hall, Englewood Cliffs, 1993) T. Kailath, Linear Systems (Prentice Hall, Englewood Cliffs, 1980) L. Ljung, System Identification. Theory for the User (Prentice Hall, Englewood Cliffs, 1999) L. Ljung, Prediction error estimation methods. Circuits Syst. Signal Process. 21, 11–21 (2002) L. Ljung, T. Söderström, Theory and Practice of Recursive Identification (MIT Press, Cambridge, 1983) M. Lovera, Control-Oriented Modelling and Identification: Theory and Practice (The Institution of Engineering and Technology, 2014) C. Meyer, Matrix Analysis and Applied Linear Algebra (SIAM, Philadelphia, 2000) D. Montgomery, C. Jennings, M. Kulahci, Introduction to Time Series Analysis and Forecasting (Wiley, London, 2015) R. Neidinger, Introduction to automatic differentiation and MATLAB object-oriented programming. SIAM Rev. 52(3), 545–563 (2010) O. Nelles, Nonlinear system Identification: From Classical Approaches to Neural Networks and Fuzzy Models (Springer, Berlin, 2000) A. Oppenheim, S. Willsky, S. Hamid, Signals and Systems (Pearson, London, 2014) S. Rogers, M. Girolami, A First Course in Machine Learning (CRC Press, Boca Raton, 2017)

References

157

R. Shumway, D. Stoffer, Time Series Analysis and Its Applications with R Examples (Springer, Berlin, 2017) T. Söderström, P. Stoica, System Identification (Prentice Hall, Englewood Cliffs, 1989) S. Theodoridis, Machine Learning: A Bayesian and Optimization Perspective (Academic Press, London, 2015) R. Tóth, Identification and Modeling of Linear Parameter-Varying Systems Springer. Lecture Notes in Control and Information Sciences, vol. 403 (2010) M. Verhaegen, V. Verdult, Filtering and System Identification: A Least Squares Approach (Cambridge University Press, Cambridge, 2007)

Chapter 6

A Last Illustration to Conclude

In order to sum up the main contributions of this document and quickly illustrate the efficiency of the modeling learning solutions introduced so far, let us consider a last time series: the monthly electricity production in France between 1981 and 2013 (see Fig. 6.1). As clearly shown in Fig. 6.1, this time series seems to be made of an increasing trend and a periodic behavior. Such a quick graphical examination is directly confirmed thanks to the singular spectrum analysis introduced in Chap. 2. Indeed, by selecting . equal to 12 (as dictated by the yearly periodicity of the time series), the leading singular values of .Y 12 given in Fig. 6.2 as well as leading right singular vectors of .Y 12 given in Fig. 6.3 show that one trend component and four periodic components should be enough to describe the main dynamics of the time series. This idea is confirmed in Fig. 6.5 where the SSA trend and seasonal components drawn in Fig. 6.4 are combined and then compared with the raw data set. As quantified by a BFT of .87.17% and a VAF of .98.35%, these five components are definitely enough for mimicking the main dynamical behavior of the French electricity production between 1981 and 2013 (Fig. 6.5). Before diving into the analysis of the residuals, let us spend some time to model the trend and periodic components of Fig. 6.4 with parsimonious representations. Let us more precisely start with the trend. Whereas the top curve of Fig. 6.4 clearly indicates that a polynomial function can be suggested for describing this dynamical behavior, selecting the “best” polynomial function order is of prime importance to guarantee a good trade-off between complexity and efficiency of the estimated model. As suggested in Chap. 3, a good candidate for this model order selection is a K-fold validation procedure. Running this validation procedure for our polynomial function order, selection problem leads to the curves gathered in Fig. 6.6, then to the conclusion that a third order polynomial function should be selected herein to avoid overfitting. As shown in Fig. 6.7, then confirmed with BFT and VAF indicators equal to .92.51% and .99.44%, respectively, a third order polynomial function with estimated parameters gathered in Table 6.1 is indeed able to reproduce the trend © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 G. Mercère, Data Driven Model Learning for Engineers, https://doi.org/10.1007/978-3-031-31636-4_6

159

160

6 A Last Illustration to Conclude 10 4 6 5.5 5 4.5 4 3.5 3 2.5 2 1.5 1985

1990

1995

2000

2005

2010

Fig. 6.1 Monthly electricity production in France 3

10 6

2.5

2

1.5

1

0.5

0

1

2

3

4

5

6

7

8

9

10

11

12

13

Fig. 6.2 Leading singular values of .Y 12 . Monthly electricity production in France

component evolution correctly. Let us now turn to the seasonality time series (see the bottom curve of Fig. 6.4). Because of the slow increase of its amplitude, a good candidate for model learning is (again) the parametric function yt = a 

.

ω0 1 − ξ2

e

−ξ ω0 t

   2 sin ω0 1 − ξ t + ψ ,

(6.1)

6 A Last Illustration to Conclude

161

0.1

0.1

0.1

0.1

0.1

0

0

0

0

0

-0.1

-0.1

-0.1

-0.1

-0.1

0

200

0

200

0

0

200

200

0.1

0.1

0.1

0

0

0

0

0

-0.1

-0.1

-0.1

-0.1

-0.1

0

200

200

0.1

0

0

0

-0.1

-0.1

-0.1 0

200

200

0

200

0

200

0

200

0.1

0.1

0

0

200

0.1

0.1

0

0

200

Fig. 6.3 Leading singular vectors of .Y 12 . Monthly electricity production in France 10 4 4.5 4 3.5 3 2.5 1985

1990

1995

2000

2005

2010

1985

1990

1995

2000

2005

2010

10000 5000 0 -5000

Fig. 6.4 SSA time series components: trend (top) and seasonality (bottom). Monthly electricity production in France

as suggested in Chap. 3. By resorting to the nonlinear least squares algorithm introduced in Chap. 3 with initial guesses determined graphically, the estimated parameters gathered in Table 6.2 as well as the model plotted in Fig. 6.8 are easily generated, the accuracy of which can be quantified by BFT and VAF indicators equal to .78.77% and .92.73%, respectively. Combining these trend and seasonality models to reproduce the raw time series dynamical evolution leads to the curves given in

162

6 A Last Illustration to Conclude

6

10 4

5 4 3 2 1

1985

1990

1995

2000

2005

2010

1985

1990

1995

2000

2005

2010

2000 0 -2000 -4000

Fig. 6.5 Initial and SSA time series (top) with residuals (bottom). Monthly electricity production in France

10 3

0

1

2

3

4

5

6

Fig. 6.6 K-fold procedure for model order selection. Monthly electricity production in France

Fig. 6.9 and BFT and VAF indicators equal to .73.04% and .95.49%, respectively (Fig. 6.10). Once reliable and parsimonious models for the trend and seasonal components of the French electricity production time series are available, let us focus on the residuals model learning step (see the bottom curve of Fig. 6.9). As explained in Chap. 4, this step starts with basic tests such as the autocorrelation function test or the portmanteau test in order to detect if there is still dependency among the

6 A Last Illustration to Conclude

163

10 4 4.5 4 3.5 3 2.5 2 1985

1990

1995

2000

2005

2010

1985

1990

1995

2000

2005

2010

1000 500 0 -500 -1000 -1500

Fig. 6.7 SSA trend component and corresponding estimated linear least squares model. Monthly electricity production in France Table 6.1 Estimated parameters of the polynomial model for the SSA trend of the French electricity production time series .θˆ0

.θˆ1

.θˆ2

.θˆ3

9 .3.5536 × 10

6 .−5.4029 × 10

3 .2.7374 × 10

.−0.4622

Table 6.2 Initial guesses (top) and estimated parameters (bottom) of the harmonic model given in Eq. (6.1) for the SSA seasonal component of the French electricity production time series .a

0



2 .5 × 10 .a ˆ .7.3004

× 102

0

0

.ω0

.−0.005 .ξˆ

.−3.1403

× 10−3



0

.2π

0

.ω ˆ0

ˆ .ψ

.6.2842

.−8.9794

× 10−1

residuals samples. Running such tests with the residuals given in Fig. 6.9 leads to the curve gathered in Fig. 6.11 as well as portmanteau values of .Q = 570.1, whereas 2 .χ 0.95 (35) = 49.8. These portmanteau test figures clearly show that the residuals are far from being i.i.d., whereas Fig. 6.11 guides us to select an ARMA model structure for mimicking the residuals dynamics. Thanks to a model order selection procedure based on the Akaike’s Information Criterion as suggested in Chap. 3, an ARMA(14, 9) is more precisely selected for this residuals model learning step. Again, the accuracy of the ARMA(14, 9) model (generated by combining the least squares solutions introduced in Sect. 5.2.2) can be tested by analyzing the remaining signal .(et )t∈T , i.e., the time series generated by subtracting the ARMA(14, 9) model output to the residuals time series .(zt )t∈T . The autocorrelation function test (see Fig. 6.12) 2 (35) = 49.8) are both leading and the portmanteau test (with .Q = 37.49 and .χ0.95 to the conclusion that .(i) the time series .(et )t∈T can be considered as a realization of

164

6 A Last Illustration to Conclude

10000 5000 0 -5000 1985

1990

1995

2000

2005

2010

1985

1990

1995

2000

2005

2010

2000 0 -2000 -4000

Fig. 6.8 SSA seasonal component and corresponding estimated nonlinear least squares model. Monthly electricity production in France 10 4 6 5 4 3 2 1985

1990

1995

2000

2005

2010

1985

1990

1995

2000

2005

2010

5000

0

-5000

Fig. 6.9 Initial and reconstructed time series (top) with residuals (bottom). Monthly electricity production in France

a zero mean white noise (close to be normally distributed as illustrated in Fig. 6.13), (ii) an ARMA(14, 9) model can be used for modeling the residuals of Fig. 6.9 efficiently. Let us complete the analysis of the ARMA(14, 9) model properties by determining its forecasting capabilities. More precisely, by running the procedure introduced in Sect. 5.4, let us test the prediction ability of the ARMA(14, 9) model by generating its output for .t ∈ {2013, · · · , 2017} with confidence intervals. As

.

6 A Last Illustration to Conclude

1

165

10 4

0.8 0.6 0.4 0.2 0 -0.2 -0.4 -0.6 -0.8 -1

1985

1990

1995

2000

2005

2010

2015

Fig. 6.10 Estimated and predicted ARMA model output for .t ∈ {Feb. 2013, · · · , Jan. 2017} with confidence intervals for the predicted part. Monthly electricity production in France

clearly shown in Fig. 6.10, the ARMA(14, 9) model is able to forecast future events reliably as illustrated by the main dynamics of the future time series associated with reasonable confidence intervals. Such an observation reinforces our former conclusion on the ARMA(14, 9) model capabilities to describe the French electricity production residuals dynamics accurately. The last stage of our model learning procedure consists in quantifying the confidence we can have in the estimated trend and seasonality models. Herein, a bootstrap based solution is favored because of the involvement of nonlinear least squares for estimating the parameters of the periodic pattern model. More precisely, because the time series .(et )t∈T can be considered as a realization of a zero mean white noise (as proved, e.g., in Fig. 6.12), our procedure consists in: • Generating B new residuals time series .(ztb )t∈T by randomly selecting, with replacement, N samples from the available time series .(et )t∈T , and then feeding each of the B time series .(etb )t∈T to the ARMA(14, 9) estimated previously • Simulating B new estimated model outputs by adding up each of the B new ˆ residuals time series .(ztb )t∈T to .f (t, θ) b • Estimating B least squares parameter vectors .θˆ from each of the B new training sets by resorting to the techniques described in Chap. 3 as suggested in Chap. 4. Such an easy-to-run procedure leads to the histograms gathered in Fig. 6.14 from which empirical variance estimates can be determined and then used to assess the confidence we can have in our deterministic models when .t ∈ {1981, · · · , 2013} (i.e., for simulation) but also when .t ∈ {2013, · · · , 2017} (i.e., for prediction). As shown in Fig. 6.15:

166

6 A Last Illustration to Conclude

1

0.8

0.6

0.4

0.2

0

-0.2

-0.4

0

5

10

15

20

25

30

35

0

5

10

15

20

25

30

35

1

0.8

0.6

0.4

0.2

0

-0.2

-0.4

Fig. 6.11 Autocorrelogram and partial autocorrelogram of the residuals with bounds .±1.96/N in dash. Monthly electricity production in France

• For .t ∈ {1981, · · · , 2013}, apart from outliers occurring regularly in January, the raw time series is globally always located within the confidence intervals. • For .t ∈ {2013, · · · , 2017}, the estimated model is able to reproduce the decreasing trend engaged in the last samples of the training time series and thus can be used with confidence for predicting future values based on previously observed samples.

6 A Last Illustration to Conclude

167

1

0.8

0.6

0.4

0.2

0

-0.2

0

5

10

15

20

25

30

35

Fig. 6.12 Autocorrelogram of the residuals after ARMA modeling with bounds .±1.96/N in dash. Monthly electricity production in France

Fig. 6.13 Histogram and QQ plot of the residuals after ARMA modeling. Monthly electricity production in France

168

6 A Last Illustration to Conclude

b

Fig. 6.14 Empirical distributions of the B bootstrap estimated parameter vectors .θˆ for the trend and periodic pattern models. Monthly electricity production in France 6

10 4

5.5 5 4.5 4 3.5 3 2.5 2 1.5

1985

1990

1995

2000

2005

2010

2015

Fig. 6.15 Trend and seasonality models with confidence intervals for .t ∈ {1981, · · · , 2017}. Monthly electricity production in France

All these good mimicking capabilities prove that the multi-step approach considered in this book is an efficient and easy-to-implement first line univariate time series model learning solution.

Appendix A

Vectors and Matrices

The results gathered herein are mainly extracted from Meyer (2000); Golub and Van Loan (2013); Zhang (2017); Boyd and Vandenberghe (2018). Please read these documents for proofs and details. Notice that all the developments introduced so far involve finite sequences of real numbers only. Thus, the rest of this Appendix focuses on real vectors and matrices simply even if most of the definitions and properties detailed hereafter are valid for other fields (such as .C for instance).

A.1 Vector Space Before focusing on real vectors and matrices, it is essential to recall important definitions that are valid for generic nonempty sets, the elements of which are called vectors. Of course, as shown hereafter, these definitions are true for the (Euclidean) set .Rn×1 as well.

A.1.1 Linear Space and Subspace • A linear space (or vector space) V over the field1 R is a set of elements (called vectors) that satisfies, for all vectors x, y and z of V and α and β belonging to R, x + y ∈ V and αx ∈ V,

.

x + y = y + x and (x + y) + z = x + (y + z), we have chosen to define vector spaces w.r.t. R because of the focus on real sequences in the former chapters. Notice, however, that most of these definitions can also be used for other fields (such as C for instance) with notation modifications mainly.

1 Herein,

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 G. Mercère, Data Driven Model Learning for Engineers, https://doi.org/10.1007/978-3-031-31636-4

169

170

A Vectors and Matrices

(αβ)x = α(βx) and 1x = x, (α + β)x = αx + βx and α(x + y) = αx + αy, a zero vector 0 exists in V s.t. 0 + x = x, for each x ∈ V, −x ∈ V exists s.t. (−x) + x = 0. • Given a vector space V, a nonempty subset U of V is a subspace of V if and only if, for all vectors x and y of U and α and β belonging to R, αx + βy ∈ U .

.

• A subspace is a linear space.

A.1.2 Inner Product, Induced Norm and Inner Product Space • Given a vector space V over R, the map •, • : V × V → R is called an inner product if and only if it satisfies the following three properties for all vectors x, y and z of V and α and β belonging to R x, y = y, x,

.

αx + βy, z = αx, z + βy, z, x, x ≥ 0 and x, x = 0 ⇔ x = 0. • A vector space with an inner product is called an inner product space. • Given an inner product space V over R (equipped with the inner product •, •), the induced norm • : V → R+ is defined for all vector x of V as x =



.

x, x.

A.1.3 Complementary Subspaces • Given a vector space V and two subspaces X and Y of V, – X + Y is the sum of X and Y, i.e., X + Y = {x + y|x ∈ X and y ∈ Y},

.

– X ∩ Y is the intersection of X and Y, i.e., X ∩ Y = {v ∈ V|v ∈ X and v ∈ Y},

.

A

Vectors and Matrices

171

– X and Y are said to be disjoint if and only if X ∩ Y = {0}. – V is the direct sum of X and Y, i.e., V = X ⊕ Y,

.

if and only if X and Y are complementary subspaces of V, i.e., if and only if V = X + Y and X ∩ Y = {0}.

.

A.1.4 Orthogonal Complement and Projection • Given an inner product space V (equipped with the inner product •, •) and a subspace X of V, the orthogonal complement X ⊥ of X is defined as X ⊥ = {x ∈ V|x, y = 0 ∀ y ∈ X }.

.

• Let X be a subspace of an inner product space V. Then, for any v ∈ V, unique vectors x ∈ X and y ∈ X ⊥ exist such that v = x + y. • The aforedefined vector x is called the orthogonal projection of v onto X .

A.2 Vector A.2.1 First Definitions • Finite-dimensional real vectors are the key ingredients of linear algebra. • A real vector is an array or ordered list of real numbers. • A real column vector made of real elements {x1 , · · · , xn } is defined as follows ⎡ ⎤ x1 ⎢ .. ⎥ n×1 .x = ⎣ .⎦∈R . xn • The number of elements in a real vector is called the dimension or size of the vector. Remark A.1 In order to shorten the notations, we may say that “x is a nR -vector” instead of “x ∈ Rn×1 .”

172

A Vectors and Matrices

A.2.2 Basic Operations • A real row vector is denoted by x  and is written as follows:

x  = x1 · · · xn ∈ R1×n .

.

• This real row vector x  is the transpose of the real column vector x and vice versa. • In order to play with nR -vectors, operations can be introduced: – The multiplication of a vector x ∈ Rn×1 by a scalar α ∈ R is equal to a vector belonging to Rn×1 and is defined as ⎤ αx1 ⎢ . ⎥ .αx = ⎣ . ⎦ . . ⎡

αxn – The sum of two nR -vectors x and y is equal to a vector belonging to Rn×1 and is defined as ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ y1 x1 + y1 x1 ⎢ . ⎥ ⎢ . ⎥ ⎢ .. ⎥ .x + y = ⎣ . ⎦ + ⎣ . ⎦ = ⎣ . . . ⎦. xn

yn

xn + yn

A.2.3 Span, Vector Space, and Subspace • Given the aforedefined two operations, a set {x 1 , · · · , x k } of k nR -vectors as well as a set of real scalars {α1 , · · · , αk }, the vector α1 x 1 + · · · + αk x k

.

can be generated straightforwardly. This vector belongs to Rn×1 as well. • Given a set {x 1 , · · · , x k } of k nR -vectors, k ≤ n, the set U defined as follows U=

k

.



∗ αi x i αi ∈ R, i ∈ {1, · · · , k}, k ∈ N ,

i=1

is called the space spanned by {x 1 , · · · , x k } and is compactly denoted as follows: U = span(x 1 , · · · , x k ).

.

A

Vectors and Matrices

173

• The Euclidean space Rn×1 , i.e., the space spanned by the collection of all nR vectors, is a vector space (or linear space) over the field R. • Given a set {x 1 , · · · , x k } of k nR -vectors, k ≤ n, U = span(x 1 , · · · , x k ) is a subspace of Rn×1 .

A.2.4 Linear Dependency and Basis • The set {x 1 , · · · , x k } of k nR -vectors, k ≤ n, is called linearly independent if and only if k .

αi x i = 0 ⇒ α1 = · · · = αk = 0.

i=1

It is called linearly dependent otherwise. • A linearly independent collection of vectors belonging to Rn×1 can have at most n elements. • A basis of Rn×1 is a set of linearly independent vectors belonging to Rn×1 that spans Rn×1 . • The choice of a basis of Rn×1 is not unique, but the number of the elements of any basis of Rn×1 is unique, equals n, and is called the dimension of Rn×1 . • Once a basis {x 1 , · · · , x n } of Rn×1 is selected, any y ∈ Rn×1 can be expressed as a linear combination of the form y=

n

.

βi x i ,

i=1

where the coefficients {β1 , · · · , βn }, βi ∈ R, i ∈ {1, · · · , n}, are the unique components of y w.r.t. the basis {x 1 , · · · , x n }.

A.2.5 Euclidean Inner Product and Norm • The (Euclidean) inner product of two nR -vectors x and y is defined as the scalar x  y = x1 y1 + · · · + xn yn .

.

• The (Euclidean induced) 2-norm of a nR -vector x is defined as follows: x 2 =

.



xx =

 x12 + · · · + xn2 .

174

A Vectors and Matrices

A.2.6 Orthogonality and Orthonormality • Two nR -vectors x and y are orthogonal if and only if x  y = 0.

.

• A collection of m nR -vectors {x 1 , · · · , x m } is: – Orthogonal if and only if x i x j = 0, i ∈ {1, · · · , m}, j ∈ {1, · · · , m}, i = j

.

– Orthonormal if and only if it is orthogonal and x i 2 = 1, i ∈ {1, · · · , m}

.

• A collection of orthogonal or orthonormal vectors belonging to Rn×1 is a linearly independent collection.

A.3 Matrix A.3.1 First Definitions • A matrix is a compact way to gather a set of vectors of the same size. • From a collection of n mR -vectors {x 1 , · · · , x n }, a matrix X ∈ Rm×n , i.e., a m × n real matrix X, can be defined as follows:

.X = x 1 · · · x n . A matrix can also be defined row-wise from the same collection of mR -vectors, i.e., a matrix X ∈ Rn×m is defined as follows: ⎡ ⎤ x1 ⎢ .. ⎥ .X = ⎣ . ⎦. x n

• Given a matrix X = x 1 · · · x n ∈ Rm×n with x i ∈ Rm×1 , i ∈ {1, · · · , n}, its transpose X ∈ Rn×m satisfies ⎤ x 1 ⎢ ⎥  .X = ⎣ ... ⎦ . ⎡

x n

A

Vectors and Matrices

175

A.3.2 Basic Operations • Given two matrices A ∈ Rm×n and B ∈ Rm×n , the sum A + B is a matrix of Rm×n calculated entry-wise,2 i.e., (A + B)(ij ) = A(ij ) + B (ij ) , i ∈ {1, · · · , n}, j ∈ {1, · · · , m}.

.

• Given two matrices A ∈ Rm×n and B ∈ Rm×n , (A + B) = B  + A .

.

• Given a matrix A ∈ Rm×n and a scalar α ∈ R, the scalar multiplication αA is a matrix of Rm×n computed by multiplying every entry of A by α. • Given a matrix A ∈ Rm×n and a vector x ∈ Rn×1 , the matrix–vector multiplication y = Ax can be defined as the vector y ∈ Rm×1 satisfying y = x1 a 1 + · · · + xn a n ,

.

where

A = a 1 · · · a n , a i ∈ Rm×1 , i ∈ {1, · · · , n}.

.

• Given two matrices A ∈ Rm×n and B ∈ Rn×p , the matrix product AB is a matrix belonging to Rm×p that can be defined as the m × p matrix satisfying ⎤  a 1 b1 · · · a 1 bp ⎢ .. . . . ⎥ .AB = ⎣ . .. ⎦ , .  a m b1 · · · a m bp ⎡

with ⎤ a 1 ⎢ . ⎥ n×1 .A = ⎣ . ⎦ , a i ∈ R , i ∈ {1, · · · , m}, . ⎡

a m

B = b1 · · · bp , bi ∈ Rn×1 , i ∈ {1, · · · , p}. • Given two matrices A ∈ Rm×n and B ∈ Rn×p , (AB) = B  A .

.

2 For a matrix A ∈ Rm×n , A (ij ) , i ∈ {1, · · · , m}, j ∈ {1, · · · , n}, stands for the coefficient of A at the intersection of row i and column j , i ∈ {1, · · · , m}, j ∈ {1, · · · , n}.

176

A Vectors and Matrices

• Generally speaking, given two matrices A and B of appropriate dimensions, AB = BA.

.

A.3.3 Symmetry • A matrix A ∈ Rn×n is a symmetric matrix if and only if A = A,

.

i.e., aij = aj i , i ∈ {1, · · · , n}, j ∈ {1, · · · , n}. • A matrix A ∈ Rn×n is a skew-symmetric matrix if and only if A = −A,

.

i.e., aij = −aj i , i ∈ {1, · · · , n}, j ∈ {1, · · · , n}. • Given a matrix A ∈ Rn×n , A + A is a symmetric matrix.

A.3.4 Hankel and Toeplitz Matrices • A Hankel matrix A ∈ Rm×n is a matrix with constant skew diagonals, i.e., aij = ai−1,j +1 , i ∈ {1, · · · , m}, j ∈ {1, · · · , n}. For instance, ⎡

a ⎢b .⎢ ⎣c d

b c d e

c d e f

⎤ d e⎥ ⎥ f⎦ g

is a Hankel matrix. • A Toeplitz matrix A ∈ Rm×n is a matrix with constant diagonals, i.e., aij = ai+1,j +1 , i ∈ {1, · · · , m}, j ∈ {1, · · · , n}. For instance, ⎡

a ⎢f .⎢ ⎣g h is a Toeplitz matrix.

b a f g

c b a f

⎤ d c⎥ ⎥ b⎦ a

A

Vectors and Matrices

177

A.3.5 Gram, Normal, and Orthogonal Matrices • Given a matrix A ∈ Rm×n with columns {a 1 , · · · , a n }, a i ∈ Rm×1 , i ∈ {1, · · · , n}, the matrix product A A ∈ Rn×n is called the Gram matrix associated with the set of mR -vectors {a 1 , · · · , a n }.   • A Gram matrix is symmetric, i.e., A A = A A. • A matrix A ∈ Rn×n is said to be a normal matrix if and only if AA = A A.

.

• A matrix A ∈ Rn×n is said to be an orthogonal matrix if and only if AA = A A = I n×n .

.

• The rows and columns of an orthogonal matrix A ∈ Rn×n constitute orthonormal collections of vectors, thus orthonormal bases of Rn×1 . • A matrix A ∈ Rm×n is said to be a semi-orthogonal matrix if and only if AA = I m×m or A A = I n×n .

.

• The rows or columns of a semi-orthogonal matrix A ∈ Rm×n constitute an orthonormal collection of vectors, thus orthonormal bases of Rm×1 and Rn×1 , respectively.

A.3.6 Vectorization and Frobenius Matrix Norm • Given a matrix A ∈ Rm×n such that

A = a 1 · · · a n , a i ∈ Rm×1 , i ∈ {1, · · · , n},

.

its vectorization is the mn × 1 column vector obtained by stacking its columns on top of one another, i.e., ⎤ a1 ⎢.⎥ nm×1 .vec(A) = ⎣ . ⎦ ∈ R . . ⎡

an

178

A Vectors and Matrices

• Given a matrix A ∈ Rm×n such that ⎡

a11 a12 ⎢ a21 a22 ⎢ .A = ⎢ . .. ⎣ .. . am1 am2

··· ··· .. .

⎤ a1n a2n ⎥ ⎥ .. ⎥ , . ⎦

· · · amn

the 2-norm of vec(A) is the Frobenius norm of A, i.e., ⎛ ⎞1/2 n m . A F = vec(A) 2 = ⎝ aij2 ⎠ . i=1 j =1

A.3.7 Quadratic Form and Positive Definiteness • For a symmetric matrix A ∈ Rn×n : – A is positive definite if and only if, for all nonzero x ∈ Rn×1 , the quadratic form x  Ax satisfies x  Ax > 0.

.

– A is positive semidefinite if and only if, for all x ∈ Rn×1 , the quadratic form x  Ax satisfies x  Ax ≥ 0.

.

– Indefinite if neither A nor −A is positive semidefinite. – We can similarly define the negative (semi)definiteness.

A.4 Matrix Fundamental Subspaces A.4.1 Range and Nullspace • The column space of a matrix A ∈ Rm×n , denoted by range(A), is the subspace spanned by the columns of A. • The column space of a matrix A ∈ Rm×n is a subspace of Rm×1 . • The row space of a matrix A ∈ Rm×n , denoted by range(A ), is the subspace spanned by the rows of A.

A

Vectors and Matrices

179

• The row space of a matrix A ∈ Rm×n is a subspace of Rn×1 . • The nullspace or kernel of a matrix A ∈ Rm×n , denoted by null(A), consists of all vectors x ∈ Rn×1 that satisfy Ax = 0.

.

• The nullspace of a matrix A ∈ Rm×n is a subspace of Rn×1 . • To sum up, for a matrix A ∈ Rm×n ,  range(A) = y  null(A) = x  range(A ) = x  null(A ) = y .

∈ Rm×1 |y = Ax, x ∈ Rn×1  ∈ Rn×1 |Ax = 0 ,



∈ Rn×1 |x = A y, y ∈ Rm×1  ∈ Rm×1 |A y = 0 .



A.4.2 Rank • The column rank of the matrix A ∈ Rm×n , denoted by rankcol (A), is the number of linearly independent columns of A, i.e., rankcol (A) = dim (range(A)).

.

• The row rank of the matrix A ∈ Rm×n , denoted by rankrow (A), is the number of linearly independent rows of A, i.e., rankrow (A) = dim (range(A )).

.

• The column rank of a matrix is always equal to its row rank, i.e., for a matrix A ∈ Rm×n , rankcol (A) = rankrow (A).

.

• The resulting value, denoted by rank(A), is called the rank of A. • For a matrix A ∈ Rm×n , rank(A) ≤ min (m, n) .

.

• A ∈ Rm×n has full rank if and only if rank (A) = min (m, n). • A ∈ Rm×n has full column rank if and only if n ≤ m and rank (A) = n.

180

A Vectors and Matrices

• A ∈ Rm×n has full row rank if and only if n ≥ m and rank (A) = m. • A ∈ Rm×n is rank deficient if and only if rank (A) < min (m, n).

A.5 Matrix Inverses A.5.1 Square Matrix Inverse • A matrix A ∈ Rn×n is invertible if a matrix B ∈ Rn×n exists such that AB = BA = I n×n .

.

• The aforementioned matrix B, denoted by A−1 , is unique and is called the inverse of A. • An invertible square matrix is said to be nonsingular. • A square matrix with no inverse is called a singular matrix. • Given a matrix A ∈ Rn×n , the following statements are equivalent: – A is invertible. – rank(A) = n or ker (A) = {0}. – The columns (resp., rows) of A are linearly independent. • Given two nonsingular matrices A ∈ Rn×n and B ∈ Rn×n , the product is also nonsingular and (AB)−1 = B −1 A−1 .

.

• For a nonsingular matrix A ∈ Rn×n , (A−1 ) = (A )−1 .

.

• An orthogonal matrix A ∈ Rn×n is nonsingular and A−1 = A .

.

A.5.2 Matrix Inversion Lemmas • Given the matrices A ∈ Rn×n , B ∈ Rn×m , C ∈ Rm×m , and D ∈ Rm×n , by assuming that A, C, and A + BCD are invertible, (A + BCD)−1 = A−1 − A−1 B(C −1 + DA−1 B)−1 DA−1 .

.

A

Vectors and Matrices

181

A.5.3 Matrix Pseudo-inverse • For a matrix A ∈ Rm×n : – The matrix B ∈ Rn×m that satisfies BA = I n×n but not AB = I m×m is called the left pseudo-inverse of the matrix A. – The matrix B ∈ Rn×m that satisfies AB = I m×m but not BA = I n×n is called the right pseudo-inverse of the matrix A. • The right and left pseudo-inverses of a matrix A ∈ Rm×n are denoted generically by A† . • The left (resp., right) pseudo-inverse of a matrix A ∈ Rm×n exists if and only if A has full column (resp., row) rank. • When A ∈ Rm×n with m > n has full column rank, A A is square, full rank, and invertible, whereas A† = (A A)−1 A . • When A ∈ Rm×n with n > m has full row rank, AA is square, full rank, and invertible, whereas A† = A (AA )−1 .

A.6 Some Useful Matrix Decompositions A.6.1 QR Decomposition • Any matrix A ∈ Rm×n with m ≥ n can be decomposed into the product of an orthogonal matrix Q ∈ Rm×m and an upper triangular matrix R ∈ Rm×n , i.e., A = QR.

.

• Given a matrix A ∈ Rm×n with m ≥ n, we can also write  A=Q

.



R1 0(m−n)×n

= Q1 Q2



R1

0(m−n)×n

 = Q1 R 1 ,

where R 1 ∈ Rn×n is upper triangular, whereas Q1 ∈ Rm×n and Q2 ∈ Rm×(m−n) are semi-orthogonal matrices, i.e., Q 1 Q1 = I n×n ,

Q 2 Q2 = I (m−n)×(m−n) ,

.

and Q 2 Q1 = 0(m−n)×n .

.

• For a full column rank matrix A ∈ Rm×n with m ≥ n, R 1 and Q1 are unique and the diagonal elements of R 1 are strictly positive.

182

A Vectors and Matrices

• For a full column rank matrix A ∈ Rm×n with m ≥ n, range(A) = range(Q1 ),

.

null(A ) = range(Q2 ). • The QR factorization of a full column rank3 matrix A ∈ Rm×n with m ≥ n yields orthonormal bases of range(A) and null(A ), respectively.

A.6.2 Singular Value Decomposition • Consider a rank r matrix A ∈ Rm×n with r ≤ min(m, n). Then, orthogonal matrices U ∈ Rm×m and V ∈ Rn×n exist such that   diag(σ1 , · · · , σr ) 0r×(n−r)  .U AV = Σ = ∈ Rm×n , 0(m−r)×r 0(m−r)×(n−r) where the singular values σi satisfy σ1 ≥ · · · ≥ σr > 0. • The columns of U are the left singular vectors of A, whereas the columns of V are the right singular vectors of A. • The number of nonzero singular values is equal to the rank of the rectangular matrix A. • Because, for A ∈ Rm×n , AA = U ΣΣ  U  ,

.

A A = V Σ  ΣV  : – The columns of U are the eigenvectors of AA . – The columns of V are the eigenvectors of A A. – Its nonzero singular values are the nonnegative square roots of the nonzero eigenvalues of A A or AA . • The SVD is not unique because U and V are not. • For any rank r rectangular matrix A ∈ Rm×n with r ≤ min(m, n), we can compactly write

A = U1 U2

.



Σ1

0(m−r)×r

= U 1Σ 1V  1 =

r

   V1 0(m−r)×(n−r) V  2 0r×(n−r)

σi ui v  i ,

i=1

3 For

rank deficient matrices, please refer, e.g., to Golub and Van Loan (2013, Section 5.4).

A

Vectors and Matrices

183

with Σ 1 = diag(σ1 , · · · , σr ) ∈ Rr×r ,

.

whereas:

– U 1 = u1 · · · ur ∈ Rm×r is a semi-orthogonal matrix, i.e., U  1 U 1 = I r×r .

 n×r – V 1 = v1 · · · vr ∈ R is a semi-orthogonal matrix, i.e., V 1 V 1 = I r×r .

– U 2 = ur+1 · · · um ∈ Rm×(m−r) is a semi-orthogonal matrix, i.e., U  2 U2 = I (m−r)×(m−r) .

– V 2 = v r+1 · · · v n ∈ Rn×(n−r) is a semi-orthogonal matrix, i.e., V  2 V2 = I (n−r)×(n−r) . • Furthermore, .

U 1 U 2 = 0r×(m−r) ,

U 2 U 1 = 0(m−r)×r ,

V 1 V 2 = 0r×(m−r) ,

V 2 V 1 = 0(m−r)×r ,

whereas range(A) = range(U 1 ),

.

range(A ) = range(V 1 ), null(A) = range(V 2 ), null(A ) = range(U 2 ).

A.6.3 Condition Number and Norms • The condition number of a full rank matrix A ∈ Rm×n is defined to be cond(A) =

.

σmax (A) . σmin (A)

• For a rank r matrix A ∈ Rm×n , A 2 = σmax (A),  A F = σ12 + · · · + σr2 . .

184

A Vectors and Matrices

A.6.4 SVD and Eckart–Young–Mirsky Theorem ˆ ∈ Rm×n , the cost • Given a rectangular matrix A ∈ Rm×n , minimizing, over A function ˆ F such that rank(A) ˆ ≤ r ≤ min(m, n) A − A

.

has an analytic solution in terms of the SVD of A, i.e.,    

Σ1 0 V1 , A = U1 U2 0 Σ2 V  2

.

where U 1 ∈ Rm×r , U 2 ∈ Rm×(m−r) , Σ 1 ∈ Rr×r , V 1 ∈ Rn×r , and V 2 ∈ ˆ ∗ = U 1 Σ 1 V  , i.e., Rn×(n−r) . This rank r solution is the matrix A 1 ˆ ∗ F = A − A

.

min

ˆ rank(A)≤r

ˆ F. A − A



ˆ is unique if and only if σr = σr+1 . • The minimum A • On top of that, ∗

ˆ F = A − A

.

ˆ ∗ 2 = A − A

 2 + · · · + σ 2, σr+1 n

min

ˆ F = A − A

min

ˆ 2 = σr+1 . A − A

ˆ rank(A)≤r ˆ rank(A)≤r

A.6.5 Moore–Penrose Pseudo-inverse • Consider a rank r rectangular diagonal matrix Σ ∈ Rm×n , m ≥ n, with r ≤ n, i.e.,  Σ=

.

Σ1

0(m−n)×n

 ,

with Σ 1 = diag(σ1 , · · · , σr , 0, · · · , 0) ∈ Rn×n .

.

The (Moore–Penrose) pseudo-inverse Σ † ∈ Rn×m of Σ ∈ Rm×n is defined as

Σ † = Σ −1 1 0n×(m−n) ,

.

A

Vectors and Matrices

185

with n×n Σ −1 . 1 = diag(1/σ1 , · · · , 1/σr , 0, · · · , 0) ∈ R

.

• Knowing that, for any rank r matrix A ∈ Rm×n with r ≤ n ≤ m,   

Σ1 0 V  1 , A = U1 U2 0 0 V 2

.

where U 1 ∈ Rm×r , U 2 ∈ Rm×(m−r) , Σ 1 ∈ Rr×r , V 1 ∈ Rn×r , and V 2 ∈ Rn×(n−r) , the (Moore–Penrose) pseudo-inverse A† ∈ Rn×m of A ∈ Rm×n can be defined as  A† = V 1 Σ −1 1 U1 .

.

• The (Moore–Penrose) pseudo-inverse A† ∈ Rn×m of a matrix A ∈ Rm×n can also be defined as the unique matrix satisfying these four equalities AA† A = A,   AA† = AA† , .

A† AA† = A† ,   A† A = A† A.

• When A ∈ Rm×m is invertible, A† = A−1 and A† A = AA† = I m×m .

.

• When A ∈ Rm×n , m > n, has full column rank n,  −1 A† = A A A and A† A = I n×n .

.

• When A ∈ Rm×n , m < n, has full row rank m,  −1 A† = A AA and AA† = I m×m .

.

• When A ∈ Rm×n , m ≥ n, with rank(A) = r < min(m, n),  A† = V Σ † U  = V 1 Σ −1 1 U1 ,

.

where U 1 ∈ Rm×r , Σ 1 ∈ Rr×r , and V 1 ∈ Rn×r , whereas AA† = U 1 U  1,

.

A† A = V 1 V  1.

186

A Vectors and Matrices

A.7 Orthogonal Projector A.7.1 First Definitions • A matrix P ∈ Rn×n is the orthogonal projector onto range(P ), i.e., the projection matrix of the orthogonal projection onto range(P ) if and only if P2 = P,

.

P = P. • An orthogonal projection matrix is a symmetric oblique projection matrix. • Given: – X a subspace of Rn×1 of dimension r and X ⊥ its orthogonal complement (of dimension n − r) – A full column rank matrix X ∈ Rn×r such that range(X) = X , i.e., X is a basis of X we have: – Tthe projection matrix P X of the orthogonal projection onto X is given by P X = X(X X)−1 X = XX† .

.

⊥ – The projection matrix P ⊥ X of the orthogonal projection onto X is given by  −1  † P⊥ X = I n×n − P X = I n×n − X(X X) X = I n×n − XX .

.

• Given a rank r matrix X ∈ Rm×n with r ≤ min (m, n), we have P range(X) = P ⊥ = XX† , null(X )

.

† P⊥ range(X) = P null(X ) = I m×m − XX , † P range(X ) = P ⊥ (null(X)) = X X,

P⊥ = P null(X) = I n×n − X† X. range(X )

A

Vectors and Matrices

187

A.7.2 Orthogonal Projector and Singular Value Decomposition • Given a rank r matrix X ∈ Rm×n with r ≤ min (m, n), and recalling that, for a rank r matrix X ∈ Rm×n ,   

Σ1 0 V  1 , .X = U 1 U 2 0 0 V 2 where U 1 ∈ Rm×r , U 2 ∈ Rm×(m−r) , Σ 1 ∈ Rr×r , V 1 ∈ Rn×r , and V 2 ∈ Rn×(n−r) , we have P range(X) = U 1 U  1,

.

 P⊥ range(X) = U 2 U 2 ,

P range(X ) = V 1 V  1, P⊥ = V 2V  2. range(X )

References S. Boyd, L. Vandenberghe, Introduction to Applied Linear Algebra (Cambridge University Press, Cambridge, 2018) G. Golub, C. Van Loan, Matrix Computations (John Hopkins University Press, Baltimore, 2013) C. Meyer, Matrix Analysis and Applied Linear Algebra. (SIAM, Philadelphia, 2000) X. Zhang, Matrix Analysis and Applications (Cambridge University, Cambridge, 2017)

Appendix B

Random Variables and Vectors

The results gathered herein are mainly extracted from Papoulis (2000); Dudley (2004); Kay (2006); Leon-Garcia (2008). Please read these documents for proofs and details. This document focuses on random variables only. As shown, e.g., in Papoulis (2000), these notions are strongly linked to probability theory. Because the main concepts of probability theory are just skimmed over herein, it is recommended to study, e.g., Papoulis (2000, Chapter 1–2), if the reader is not familiar with the meaning or the main axioms of probability.

B.1 Probability Space and Random Variable B.1.1 Probability Space • A probability space is a triple (, F, Pr), where: –  is a nonempty set called the sample space. – F is a σ −algebra of subsets of , i.e., a family of subsets of  including  itself, and closed under complement and countable unions, i.e.: *  is in F. * If A is in F, then so is the complement of A. * If An is a sequence of elements of F, then the union of the An is in F. – P is a probability measure defined for all members of F, i.e., P is a function P : F → [0, 1] such that: * Pr(A) ≥ 0 for all A ∈ F. * Pr() = 1. ∞ * Pr(∪∞ i=1 Pr(Ai ) for all Ai ∈ F such that Ai ∩ Aj = for i = j . i=1 Ai ) = © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 G. Mercère, Data Driven Model Learning for Engineers, https://doi.org/10.1007/978-3-031-31636-4

189

190

B

Random Variables and Vectors

B.1.2 Random Variable • Given a probability space (, F, Pr), a random variable (r.v.) x(ω) can be viewed as a function mapping each sample point ω of the sample space  (the domain) to real numbers (the range). • Given a probability space (, F, Pr), a real r.v. x is a real function, the domain of which is , such that: – The set {ω ∈  : x(ω) ≤ x} is an event, i.e., is in F for any real number x. – Pr(x = ±∞) = 0. • Given a probability space (, F, Pr) and real r.v. x, once the event ω has occurred, the resulting real number x(ω) is no longer random and is called a realization of the r.v. x. • Random variables can be discrete or continuous. • This distinction leads to different descriptions of the same notions. • A r.v. is called a discrete r.v. if the range of the function x(ω) consists of isolated points on the real line (a finite or countably infinite number of values). • A r.v. is called a continuous r.v. if its range is a continuum. • In this reminder, we focus on real-valued continuous r.v. only.

B.2 Univariate Random Variable B.2.1 Cumulative Distribution Function

• A r.v. x can be entirely characterized by the way its samples are distributed.

B

Random Variables and Vectors

191

• Given a sequence (x1 , · · · , xM ) of M of independent and identically distributed realizations of the r.v. x, the empirical cumulative histogram converges toward the continuous cumulative distribution function (cdf) Fx (x) of x almost surely when M tends to infinity. • Given a probability space (, F, Pr), the cdf of a r.v. x is defined as follows:1 Fx (x) = Pr(x ≤ x), −∞ ≤ x ≤ ∞,

.

i.e., the probability that the r.v. x takes a value between −∞ and x. • Given a probability space (, F, Pr) and a r.v. x, we have: P1: 0 ≤ Fx (x) ≤ 1 ∀ x ∈ R. P2: ∀ x < y, Pr(x < x ≤ y) = Fx (y) − Fx (x). P3: lim Fx (x) = 1 and lim Fx (x) = 0. x→+∞

x→−∞

P4: Pr(x > x) = 1 − Fx (x), x ∈ R. P5: Fx (x) is a monotone non-decreasing function of x, i.e., Fx (x) Fx (y) ∀ x < y. P6: Fx (x) is continuous from the right.



• A continuous r.v. x: – Has a continuous distribution function Fx (x). – Never takes an exact prescribed value, i.e., Pr(x = x) = 0 for all x ∈ R.

B.2.2 Probability Density Function • Given a probability space (, F, Pr) and a r.v. x, if the cdf Fx (x) of x is sufficiently smooth, the probability density function (pdf) px (x) of x can be defined as follows: px (x) =

.

dFx (x) . dx

• If the cdf Fx (x) of a r.v. x is a continuous function differentiable almost everywhere (i.e., with a countable number of points at which it is not differentiable), then x is a continuous r.v. and its pdf px (x) exists almost everywhere. • Given a sequence (x1 , · · · , xM ) of M of independent and identically distributed realizations of the r.v. x, the empirical histogram converges toward the continuous probability density function px (x) of x almost surely when M tends to infinity.

1 Theoretically, we should write F (x) = P ({ω : x(ω) ≤ x}). Probabilities are indeed assigned to x events, i.e., sets of outcomes. Thus, the probability that the r.v. x is less than x requires the implicit definition {ω ∈  : x(ω) ≤ x}.

192

B

Random Variables and Vectors

• Given a continuous r.v. x, its pdf px (x) satisfies px (x) ≥ 0 ∀ x ∈ R,

.

  

−∞ x

−∞ y



px (x)dx = 1,

px (y)dy = Pr(x ≤ x) = Fx (x),

px (x)dx = Pr(x ≤ x ≤ y) = Fx (y) − Fx (x).

x

B.2.3 Cumulative Distribution Function and Quantile • The probability Pr(x ≤ x) = Fx (x) of a r.v. x can be represented as the area between the pdf px (x) and the x-axis on the interval ] − ∞, x]. • The smallest abscissa x = xα satisfying Pr(x ≤ xα ) = Fx (xα ) = α

.

is called the quantile of order α. • The quantile of order α is thus the inverse of the function α = Fx (xα ). • The median of a r.v. x is the quantile of order 0.5.

B

Random Variables and Vectors

193

B.2.4 Uniform Random Variable • A continuous uniform r.v. is used to model a scenario where a continuous r.v. can take values that are equally distributed (with equal probability) in an interval. • If the pdf of a r.v. x is a rectangular pulse, i.e., px (x) =

.

for a ≤ x ≤ b, , elsewhere,

1 b−a

0

then x is uniformly distributed in the interval [a, b]. • The cdf of this r.v. x satisfies ⎧ for x < a ⎨ 0 x−a .Fx (x) = for a ≤ x ≤ b, ⎩ b−a 1 for x > b.

B.2.5 Normal Random Variable • If the pdf of a r.v. x is a Gaussian curve, i.e., px (x) = √

.

1 2π σ 2

e

− (x−μ) 2 2σ

2

, μ ∈ R and σ ∈ R∗ ,

then x is normally distributed with parameters μ and σ . • The cdf of this normal r.v. x satisfies " !  x 1 x−μ −(y−μ)2 /2σ 2 , .Fx (x) = √ e dy = g σ 2π σ 2 −∞ with 1 g(x) = √ 2π



x

.

−∞

e−y

2 /2

dy.

194

B

Random Variables and Vectors

B.2.6 Student’s Random Variable • If the pdf of a r.v. x satisfies   ! "− k+1  k+1 2 2 x2 1+ .px (x) = ,  k √ k  2 πk where (1) = 1,  

.



 

n 2 n 2

1 2

= =

=

n 2 n

√ π , and, for n ∈ N with n ≥ 2,  − 1 ! for n even,

 n  3 1√ −1 − 2 ··· × π for n odd, 2 2 2 2

then x is said to be distributed according to the Student’s t-distribution with k degrees of freedom. • The cdf of this r.v. x can be written in terms of the regularized incomplete beta function, i.e., ! " k 1 , .Fx (x) = 1 − I k , 2 2 x 2 +k with Ix (a, b) =

.

B(x; a, b) , B(a, b)

where  B(a, b) =

1

.

t a−1 (1 − t 2 )b−1 dt,

0

 B(x; a, b) =

0

x

t a−1 (1 − t 2 )b−1 dt.

B

Random Variables and Vectors

195

B.2.7 Chi-squared Random Variable • If the pdf of a r.v. x satisfies

px (x) =

.

where (1) = 1,  

.



 

n 2 n 2

1 2

= =

=

n 2 n

⎧ k x ⎪ ⎨ x 2 −1e−2 if x > 0, k ⎪ ⎩

k 2

22 

0

,

otherwise

√ π , and, for n ∈ N with n ≥ 2,  − 1 ! for n even,

 n  3 1√ −1 − 2 ··· × π for n odd, 2 2 2 2

then x is said to be distributed according to the chi-squared distribution with k degrees of freedom. • The cdf of this r.v. x can be written in terms of the lower incomplete gamma function, i.e., γ

Fx (x) =

.

k

x 2, 2 k  2

 ,

with  γ (a, b) =

.

b

t a−1 e−t dt.

0

B.2.8 Moments • A r.v. is often characterized by a small number of parameters, which also have a practical interpretation. • The moment of order k of a r.v. x is defined as  +∞ k .Ex {x} = x k px (x)dx. −∞

• The moment of order 1 E1x {x} of a r.v. x is named its expected value or mean and denoted by Ex {x}, μx or μ.

196

B

Random Variables and Vectors

• The definition of moments can be extended to a function f of a r.v. x straightforwardly. • The statistical mean or expectation of a function f (x) of a r.v. x is defined as  Ex {f (x)} = E1x {f (x)} =

+∞

.

−∞

f (x)px (x)dx.

• For instance, with f (x) = ax + b, a ∈ R and b ∈ R, Ex {f (x)} = aEx {x} + b.

.

• The central moment of order k of a r.v. x is defined as  +∞ k ¯ .Ex {x} = (x − E{x})k px (x)dx. −∞

• The variance or dispersion σx2 = σ 2 , i.e., the spread of the distribution w.r.t. the mean value, is the central moment of order 2 of a r.v. x, i.e., σx2 = E¯ 2x {x} = Ex {(x − μx )2 } = Ex {x2 } − μ2x .

.

B.3 Multivariate Random Variable B.3.1 Basic Idea • Our study of r.v. has been restricted to one-dimensional sample spaces so far. • We have only recorded outcomes of an experiment as values assumed by a single random variable. • Given a probability space (, F, Pr), we may want to record the simultaneous outcomes of several r.v. defined on the same experiment, giving rise to a multidimensional sample space.

B.3.2 2D Joint Distribution Function • Given a probability space (, F, Pr), we are given two r.v. x and y with respective (or marginal) cumulative distribution functions Fx (x) = Pr(x ≤ x),

.

Fy (y) = Pr(y ≤ y).

B

Random Variables and Vectors

197

• Given a probability space (, F, Pr), the joint cdf Fx,y (x, y) of x and y (i.e., the probability distribution for their simultaneous occurrence) is defined as follows: Fx,y (x, y) = Pr(x ≤ x, y ≤ y)

.

= Pr({ω : x(ω) ≤ x and y(ω) ≤ y}). • Given a probability space (, F, Pr), the joint cdf Fx,y (x, y) of two r.v. x and y is the probability assigned to the set of all points ω that can be associated with regions of the two-dimensional Euclidean space. • Given a probability space (, F, Pr), the joint cdf Fx,y (x, y) of two r.v. x and y satisfies: P1: P2: P3: P4: P5:

1 ≥ Fx,y (x, y) ≥ 0 for x ∈ R, y ∈ R. limy→−∞ Fx,y (x, y) = 0 for x ∈ R. limx→−∞ Fx,y (x, y) = 0 for y ∈ R. limx→∞,y→∞ Fx,y (x, y) = 1. If b > a and d > c, Fx,y (b, d) ≥ Fx,y (b, c) ≥ Fx,y (a, c).

.

P6: Pr(x > x, y > y) = 1 − Fx (x) − Fy (y) + Fx,y (x, y). P7: limy→∞ Fx,y (x, y) = Fx (x) for x ∈ R. P8: limx→∞ Fx,y (x, y) = Fy (y) for y ∈ R.

B.3.3 2D Joint Probability Density Function • Given a probability space (, F, Pr) and two r.v. x and y characterized by a joint cdf Fx,y (x, y), the joint probability density function of x and y can be defined as follows: px,y (x, y) =

.

∂ 2 Fx,y (x, y) , ∂x∂y

if the derivatives exist, or, by duality,  Fx,y (x, y) =

x



y

.

−∞ −∞

px,y (α, β)dαdβ,

198

B

Random Variables and Vectors

whereas 

∂Fx,y (x, y) = . ∂x

y

−∞  x

∂Fx,y (x, y) = ∂y

−∞

px,y (x, β)dβ, px,y (α, y)dα.

• Given two r.v. x and y characterized by a joint pdf px,y (x, y), we have px,y (x, y) ≥ 0 ∀ x ∈ R and y ∈ R,

.









.

−∞ −∞

px,y (x, y)dxdy = 1,

whereas   Pr((x, y) ∈ D) =

px,y (x, y)dxdy,

.

D

for any region D in the (x, y) plane.

B.3.4 Marginal Distribution and Density Functions • Given two r.v. x and y characterized by a joint cdf Fx,y (x, y) or a joint pdf px,y (x, y): – The marginal distribution functions satisfy Fx (x) = Fx,y (x, ∞),

.

Fy (y) = Fx,y (∞, y). – The marginal probability density functions satisfy  px (x) =



.

py (y) =

−∞  ∞ −∞

px,y (x, y)dy, px,y (x, y)dx.

B

Random Variables and Vectors

199

B.3.5 Statistical Independence • Given two r.v. x and y characterized by a joint cdf Fx,y (x, y) or a joint pdf px,y (x, y): – x and y are independent if and only if Fx,y (x, y) = Fx (x)Fy (y),

.

or, equivalently, if and only if px,y (x, y) = px (x)py (y).

.

• If the r.v. x and y are independent and share the same marginal cdf (or pdf), i.e., if Fx (x) = Fy (y) (or px (x) = py (y)), then they are called independently and identically distributed (i.i.d.). • For any functions g(•) and h(•), g(x) and h(y) are independent if x and y are independent.

B.3.6 Generalized Mean and Moments • Given two r.v. x and y characterized by a joint pdf px,y (x, y) and: – A function f (x, y) of the r.v. x and y, then the joint mean of f (x, y) satisfies  Ex,y {f (x, y)} =







.

−∞ −∞

f (x, y)px,y (x, y)dxdy.

– A function g(x) of the random variable x, then the joint mean of g(x) satisfies  Ex,y {g(x)} =







.

−∞ −∞

g(x)px,y (x, y)dxdy = Ex {g(x)}.

– For instance, Ex,y {x} = Ex {x} = μx ,

.

Ex,y {y} = Ey {y} = μy ,

Ex,y {(x − μx )2 } = Ex {(x − μx )2 } = σx2 .

.

200

B

Random Variables and Vectors

B.3.7 Covariance • Given two r.v. x and y characterized by a joint pdf px,y (x, y), the covariance of x and y is σxy = Ex,y {(x − μx )(y − μy )}  =

.







−∞ −∞

(x − μx )(y − μy )px,y (x, y)dxdy.

B.3.8 Correlation Coefficient • Given two r.v. x and y characterized by a joint pdf px,y (x, y), the correlation coefficient of x and y is defined as ρxy =

.

σxy . σx σy

If ρxy = 1, then x and y are perfectly, positively, linearly correlated. If ρxy = −1, then x and y are perfectly, negatively, linearly correlated. If ρxy > 0, then x and y are positively, linearly correlated, but not perfectly so. If ρxy < 0, then x and y are negatively, linearly correlated, but not perfectly so. If ρxy = 0, σxy = 0, and Ex,y {xy} = μx μy and x and y are called uncorrelated. If ρxy = 0 (or σxy = 0), then x and y are completely not linearly correlated. That is, x and y may be perfectly correlated in some other manner, in a parabolic manner, perhaps, but not in a linear manner. • When x and y are statistically independent, the covariance σxy is zero and thus the r.v. are uncorrelated. • The converse, however, is not generally true. • • • • • •

B.3.9 Correlation • Given two r.v. x and y characterized by a joint pdf px,y (x, y): – The correlation of x and y is  πxy = Ex,y {xy} =







.

−∞ −∞

xypx,y (x, y)dxdy.

B

Random Variables and Vectors

201

– x and y are orthogonal if and only if πxy = 0.

.

• It can be proved that πxy = σxy + μx μy .

.

B.3.10 nx D Joint Cumulative Distribution and Density Function • The former definitions written for 2 r.v. x and y can be extended to a sequence of nx r.v. straightforwardly, i.e., (x1 , · · · , xnx ).

.

• Given a probability space (, F, Pr) and nx r.v. (x1 , · · · , xnx ) with respective (or marginal) cumulative distribution functions Fxi (xi ) = Pr(xi ≤ xi ), i ∈ {1, · · · , nx }

.

– The joint cdf (i.e., the probability distribution for their simultaneous occurrence) of (x1 , · · · , xnx ) is defined as follows: Fx1 ,··· ,xnx (x1 , · · · , xnx ) = Pr(x1 ≤ x1 , · · · , xnx ≤ xnx )

.

– The joint pdf of (x1 , · · · , xnx ) can be defined as follows: px1 ,··· ,xnx (x1 , · · · , xnx ) =

.

∂ nx Fx1 ,··· ,xnx (x1 , · · · , xnx ) . ∂x1 · · · ∂xnx

B.3.11 nx D Marginal Distribution and Density Functions • Given nx r.v. (x1 , · · · , xnx ) characterized by a joint cdf Fx1 ,··· ,xnx (x1 , · · · , xnx ) or a joint pdf px1 ,··· ,xnx (x1 , · · · , xnx ):

202

B

Random Variables and Vectors

– The marginal distribution functions are obtained by substituting in Fx1 ,··· ,xnx (x1 , · · · , xnx ) certain variables by ∞, e.g., with nx = 4, Fx1 ,x3 (x1 , x3 ) = Fx1 ,x2 ,x3 ,x4 (x1 , ∞, x3 , ∞)

.

– The marginal probability density functions are obtained by integrating out px1 ,··· ,xnx (x1 , · · · , xnx ) w.r.t. certain variables, e.g., with nx = 4,  px1 ,x3 (x1 , x3 ) =







.

−∞ −∞

px1 ,x2 ,x3 ,x4 (x1 , x2 , x3 , x4 )dx2 dx4

B.3.12 Independence • Given nx r.v. (x1 , · · · , xnx ) characterized by a joint cdf Fx1 ,··· ,xnx (x1 , · · · , xnx ) or a joint pdf px1 ,··· ,xnx (x1 , · · · , xnx ): – These r.v. are independent if and only if Fx1 ,··· ,xnx (x1 , · · · , xnx ) = Fx1 (x1 ) · · · Fxnx (xnx ),

.

or, equivalently, px1 ,··· ,xnx (x1 , · · · , xnx ) = px1 (x1 ) · · · pxnx (xnx ) :

.

– These r.v. are i.i.d. if they are independent and Fx1 (x1 ) = · · · = Fxnx (xnx ),

.

or px1 (x1 ) = · · · = pxnx (xnx ).

.

B.3.13 Uncorrelatedness and Orthogonality • Given nx r.v. (x1 , · · · , xnx ) characterized by a joint cdf Fx1 ,··· ,xnx (x1 , · · · , xnx ) or a joint pdf px1 ,··· ,xnx (x1 , · · · , xnx ): – These r.v. are uncorrelated if and only if, for {i, j } ∈ {1, · · · , nx }2 , i = j , Ex1 ,··· ,xnx {xi xj } = Exi ,xj {xi xj } = Exi {xi }Exj {xj }.

.

B

Random Variables and Vectors

203

– These r.v. are orthogonal if and only if, for {i, j } ∈ {1, · · · , nx }2 , i = j , Ex1 ,··· ,xnx {xi xj } = Exi ,xj {xi xj } = 0.

.

B.3.14 Random Vector • Instead of handling only random variable sequences, it is often convenient to resort to vectors of r.v., i.e., ⎤ x1 ⎢ . ⎥ .x = ⎣ . ⎦ . . ⎡

xnx • The joint cdf and pdf can be compactly written as follows: Fx1 ,··· ,xnx (x1 , · · · , xnx ) = Fx (x),

.

px1 ,··· ,xnx (x1 , · · · , xnx ) = px (x). • Statistical mean, correlation, and covariance information can be gathered into matrices by considering random vectors instead of sequences of r.v.

B.3.15 Mean Vector, Covariance, and Correlation Matrices • Given two r.v. x1 and x2 characterized by a joint pdf px1 ,x2 (x1 , x2 ), we can define the random vector x ∈ Rnx ×1 , nx = 2, as follows: x=

.

  x1 . x2

• The mean μx ∈ Rnx ×1 of x satisfies   Ex1 {x1 } . .μx = Ex {x} = Ex2 {x2 } • The covariance matrix Σ x ∈ Rnx ×nx of x satisfies    = .Σ x = Ex (x − μx ) (x − μx )

$

% σx21 σx1 x2 , σx2 x1 σx22

204

B

Random Variables and Vectors

with σx1 x2 = σx2 x1 . • The correlation matrix x ∈ Rnx ×nx of x satisfies    π π x = Ex xx = x1 x1 x1 x2 , πx2 x1 πx2 x2

.

with πx1 x2 = πx2 x1 . • The former definitions and analysis can be extended to any random vector x ∈ Rnx ×1 , nx > 2, straightforwardly. • Given a random vector x ∈ Rnx ×1 , we have μx = Ex {x} ∈ Rnx ×1 ,   x = Ex xx ∈ Rnx ×nx ,   nx ×nx Σ x = Ex (x − μx ) (x − μx ) = x − μx μ . x ∈R .

• The correlation and covariance matrices are both symmetric and positive (semi)definite. • Given a random vector x ∈ Rnx , its components are: – Uncorrelated if and only if Σ x is diagonal – Orthogonal if and only if x is diagonal – Uncorrelated and orthogonal if and only if μx = 0nx ×1 and Σ x = x is diagonal.

B.3.16 Pay Attention to the Definition!!! • Given two r.v. x1 and x2 characterized by a joint pdf px1 ,x2 (x1 , x2 ), we have defined: – The covariance matrix Σ x1 ,x2

.

$

%





σx21 σx1 x2 = σx2 x1 σx22

– The correlation matrix x1 ,x2

.

π π = x1 x1 x1 x2 πx2 x1 πx2 x2

B

Random Variables and Vectors

205

– The correlation coefficient ρx1 x2 =

.

σx1 x2 σx1 σx2

• In many books, the correlation matrix is defined by involving the correlation coefficients ρxy instead of the correlation πxy , i.e., given two r.v. x1 and x2 characterized by a joint pdf px1 ,x2 (x1 , x2 ),    1 ρx1 x2 ρx1 ρx1 x2 = . ρx2 x1 ρx2 ρx2 x1 1

 x1 ,x2 =

.

• It is essential to pay attention to the notations and definitions if we want to avoid misunderstanding.

B.3.17 White Random Vector • Given a random vector x ∈ Rnx ×1 , its components form a zero mean white sequence if and only if, for {i, j } ∈ {1, · · · , nx }2 , i = j , Ex {xi } = Exi {xi } = 0,

.

Ex {x2i } = Exi {x2i } = σ 2 < ∞, Ex {xi xj } = Exi ,xj {xi xj } = 0. • Said differently, μx = 0nx ×1 ,

.

Σ x = x = σ 2 I nx ×nx .

B.4 Sum of Random Variables B.4.1 Sample Mean and Variance • In many practical cases, it is essential to determine the mean and/or the variance of a r.v. x from samples selected randomly and independently.

206

B

Random Variables and Vectors

• Given the realizations (x1 , x2 , · · · , xM ), we often determine: – The sample or empirical mean of x as follows: mx =

.

M 1 xi M i=1

– The sample or empirical variance of x as follows: 2 .sx

M 1 = (xi − mx )2 . M i=1

• We may wonder if these empirical values are good estimates of the statistical mean and variance of x.

B.4.2 Sample vs. Expected Values • In the sequel: – Each component xi , i ∈ {1, · · · , M}, of the sequence (x1 , x2 , · · · , xM ) is assumed to be a realization of an associated r.v. xi , i ∈ {1, · · · , M}. – The r.v. (x1 , · · · , xM ) composing the sums (which are r.v. now!!!) mx =

.

M 1 xi , M i=1

s2x =

M 1 (xi − mx )2 , M i=1

are assumed to be independently and identically distributed (i.i.d.), i.e., are independent, and share the same pdf. • Given M i.i.d. r.v. (x1 , · · · , xM ) having the same finite mean value and finite variance as x, i.e., for i ∈ {1, · · · , M}, Ex {x} = μ = Exi {xi },

.

Ex {(x − μ)2 } = σ 2 = Exi {(xi − μ)2 },

B

Random Variables and Vectors

207

then we have μmx = Ex1 ,··· ,xM {mx } = μ,

.

2 = Ex1 ,··· ,xM {(mx − μ)2 } = σm x

σ2 . M

• The estimator mx is unbiased and consistent. • Given M i.i.d. r.v. (x1 , · · · , xM ) having the same finite mean value and finite variance as x, i.e., for i ∈ {1, · · · , M}, Ex {x} = μ = Exi {xi },

.

Ex {(x − μ)2 } = σ 2 = Exi {(xi − μ)2 }, then we have Ex1 ,··· ,xM {s2x } =

.

M −1 2 σ . M

• Given M i.i.d. r.v. (x1 , · · · , xM ) having the same finite mean value and finite variance as x, i.e., for i ∈ {1, · · · , M}, Ex {x} = μ = Exi {xi },

.

Ex {(x − μ)2 } = σ 2 = Exi {(xi − μ)2 }, Ex1 ,··· ,xM {s2x } = σ 2 , i.e., s2x is a biased estimator of σ 2 for finite values of nx , whereas  M 1 .Ex1 ,··· ,xM (xi − mx )2 = σ 2 . M −1 i=1

• The estimator

1 M−1

M

i=1 (xi

− mx )2 is unbiased.

B.4.3 Central Limit Theorem • Given M i.i.d. r.v. (x1 , · · · , xM ) having the same finite mean value and finite variance as x, i.e., for i ∈ {1, · · · , M}, Ex {x} = μ = Exi {xi },

.

Exi {(xi − μ)2 } = Ex {(x − μ)2 } = σ 2 < ∞,

208

B

Random Variables and Vectors

then, for any ∈ R, we have &√ '  M 1 2 (mx − μ) ≤ = √ . lim Pr e−u /2 du, M→∞ σ 2π −∞ √ i.e., the cdf of M(mx − μ) converges (pointwise) to the cdf of the normal distribution N (0, σ ).

B.4.4 Sample Mean with Gaussian Assumption • Given M i.i.d. r.v. (x1 , · · · , xM ) having the same finite mean value and finite variance as x, i.e., for i ∈ {1, · · · , M}, Ex {x} = μ = Exi {xi },

.

Ex {(x − μ)2 } = σ 2 = Exi {(xi − μ)2 }, and assuming that x ∼ N (μ, σ 2 ), then .

mx − μ √ σ M

is Gaussian distributed with zero mean and unit variance. • Given M i.i.d. r.v. (x1 , · · · , xM ) having the same finite mean value and finite variance as x, i.e., for i ∈ {1, · · · , M}, Ex {x} = μ = Exi {xi },

.

Ex {(x − μ)2 } = σ 2 = Exi {(xi − μ)2 }, and assuming that x ∼ N (μ, σ 2 ), then tM−1 =

.

mx − μ √ sx M

has a Student’s t-distribution of degree of freedom M − 1.

B.4.5 Laws of Large Numbers • Both laws of large numbers describe the result of performing the same experiment a large number of times.

B

Random Variables and Vectors

209

• They both prove that the average of the results obtained from a large number of trials should be close to the expected value and will tend to become closer to the expected value as more trials are performed. • Roughly speaking, given M i.i.d. r.v. (x1 , · · · , xM ) having the same mean value as x, i.e., for i ∈ {1, · · · , M}, Ex {x} = μ = Exi {xi },

.

both laws of large numbers claim that mx converges to μ as M tends toward ∞.

B.4.6 Weak Law of Large Numbers • Given M i.i.d. r.v. (x1 , · · · , xM ) having the same finite mean value as x and finite variances, i.e., for i ∈ {1, · · · , M}, Ex {x} = μ = Exi {xi },

.

Exi {(xi − μ)2 } = σi2 < ∞, then, for any > 0, we have .

lim Pr(|mx − μ| < ) = 1,

M→∞

or .

lim Pr(|mx − μ| ≥ ) = 0,

M→∞

i.e., mx converges to μ in probability.

B.4.7 Strong Law of Large Numbers • Given M i.i.d. r.v. (x1 , · · · , xM ) having the same mean value as x, i.e., for i ∈ {1, · · · , M}, Ex {x} = μ = Exi {xi },

.

then we have Pr( lim mx = μ) = 1,

.

M→∞

210

B

Random Variables and Vectors

i.e., mx converges to μ almost surely if Exi {x2i } < ∞, i ∈ {1, · · · , nx }.

.

References R. Dudley, Real Analysis and Probability (Cambridge University Press, Cambridge, 2004) S. Kay, Intuitive Probability and Random Processes Using MATLAB (Springer, Berlin, 2006) H. Kobayashi, B. Mark, W. Turin, Provability, Random Process and Statistical Analysis (Cambridge University Press, Cambridge, 2012) A. Leon-Garcia, Probability, Statistics, and Random Processes for Electrical Engineering (Pearson, London, 2008) A. Papoulis, Probability, Random Variables, and Stochastic Processes (McGraw-Hill Europe, 2000)

Appendix C

Data

As illustrated all along this textbook, one of the goals of this document is to show that efficient model learning solutions exist to determine accurate models of univariate time series. The efficiency of these methods is indeed exemplifying by resorting to different real data sets I found (for free) on the Internet in 2020 when I started teaching this new lecture at Poitiers University. If you are interested in playing with the data sets used in the former chapters (and if you are too shy to contact me directly), you can download or generate them by using the following links:1 • Data of the World Bank available via the url https://data.worldbank.org/indicator/ SP.DYN.LE00.FE.IN?locations=FR for the female life expectancy in France, accessed 01 February 2023 • Data of the World Bank available via the url https://data.worldbank.org/indicator/ SP.DYN.IMRT.IN?locations=FR for the infant mortality rate in France, accessed 01 February 2023 • Data of R-data statistics page available via the url https://r-data.pmagunia.com/ dataset/r-dataset-package-datasets-usaccdeaths for the accidental deaths in the USA between 1973 and 1978, accessed 01 February 2023 • Data of R-data statistics page available via the url https://r-data.pmagunia.com/ dataset/r-dataset-package-datasets-airpassengers for the monthly airline passenger numbers between 1949 and 1960, accessed 01 February 2023 • Data of Global Monitoring Laboratory available via the url https://gml.noaa.gov/ ccgg/trends/data.html for the Mauna Loa CO2 records between 1959 and 1991, accessed 01 February 2023

1 These links were all valid in February 2023. I cannot guarantee that they are still functioning when you read these lines.

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 G. Mercère, Data Driven Model Learning for Engineers, https://doi.org/10.1007/978-3-031-31636-4

211

212

C Data

• Data of Infoclimat available via the url https://www.infoclimat.fr/climatologie/ annee/2006/paris-montsouris/valeurs/07156.html for the monthly maximum temperatures in Paris between 2000 and 2012, accessed 01 February 2023 • Data of Appendix A of S. Bisgaard and M. Kulahci. Time series analysis and forecasting by example. Wiley, 2011 available via the url https://onlinelibrary. wiley.com/doi/pdf/10.1002/9781118056943.app1 for the ceramic furnace data, accessed 01 February 2023 • Data of the République Française via the url https://www.data.gouv.fr/fr/datasets/ r/6e5d5dff-3341-44bc-bbb3-5c2b6d7c6423 for the monthly electricity production in France between 1981 and 2013, accessed 01 February 2023